Kafkanator

`gini(x)`

Computes the gini index from an ascending order gains array.

Examples:

>>> gini(np.array([1,1,2,2,3,3,3]))
0.20952380952380953

Parameters:

Name	Type	Description	Default
`x`	`numpy array`	Gains sorted in ascending order, for example [1,1,2,2,3,3,3] means a population of 7 people, the first one gain is 1, the third one 2, and so on.	required

Returns:

Name	Type	Description
`float`		The Gini index for this array

Source code in kafkanator/kafkanator.py

def gini(x):
    """Computes the gini index from an ascending order gains array.

    Examples:
        >>> gini(np.array([1,1,2,2,3,3,3]))
        0.20952380952380953

    Args:
        x (numpy array): Gains sorted in ascending order, for example [1,1,2,2,3,3,3] means a population of 7 people, the first one gain is 1, the third one 2, and so on.

    Returns: 
        float : The Gini index for this array
    """
    total = 0
    for i, xi in enumerate(x[:-1], 1):
        total += np.sum(np.abs(xi - x[i:]))
    return total / (len(x)**2 * np.mean(x))

`index_on_dataframe_column(df, column, index_function, **kwargs)`

This method computes an inequality index over a pandas dataframe column.

Parameters:

Name	Type	Description	Default
`df`	`pandas DataFrame`	The dataframe.	required
`column`	`str`	The column on which we will apply the inequality.	required
`index_function`	`callable`	the index we want to apply, is a kafkanator function such as gini(..), robin_hood(..), theil_index_L(..) or theil_index_T(..).	required
`kwargs`	`dict`	other parameters the index_function could use, for example if index_function=theil_index_T, we can set the base (e,10) on this dict.	`{}`

Returns:

Name	Type	Description
`float`	`float`	the inequality index result you choose applied on the column parameter.

Source code in kafkanator/kafkanator.py

def index_on_dataframe_column(df: pd.DataFrame, column: str, index_function: callable, **kwargs ) -> float :
    """This method computes an inequality index over a pandas dataframe column.

    Args:
        df (pandas DataFrame): The dataframe.
        column (str): The column on which we will apply the inequality.
        index_function (callable): the index we want to apply, is a kafkanator function such as gini(..), robin_hood(..), theil_index_L(..) or theil_index_T(..).
        kwargs (dict, optional): other parameters the index_function could use, for example if index_function=theil_index_T, we can set the base (e,10) on this dict.

    Returns:
        float : the inequality index result you choose applied on the column parameter.
    """
    sorted_df = df.sort_values(by=[column])
    return index_function( np.array(sorted_df[column].values ) , **kwargs )

`index_per_cluster(df, group_by_column, income_column, index='gini', **kwargs)`

Make clusters over a data frame and apply an inequality index on each of them .

Parameters:

Name	Type	Description	Default
`df`	`pandas Dataframe`	a data frame where you have data about gains to be grouped according to a column.	required
`group_by_column`	`str`	the column you will perform your group by on.	required
`income_column`	`str`	column where you have the gains/incomes. For the moment the column must have numeric integer values, not proportions.	required
`index`	`str`	the type of inequality index you will use , you have gini, theil-t , theil-l, and robin hood.	`'gini'`
`kwargs`	`dict`	optional, used in case you use theil - T, you can put here auxiliar parameter such as entropy base.	`{}`

Returns:

Name	Type	Description
`list`		an array of tuples, each tuple is a value of the group_by_column, followed by the intra cluster resulting inequality index of your choice.

Source code in kafkanator/kafkanator.py

def index_per_cluster(df,group_by_column,income_column,index='gini', **kwargs ):
    """Make clusters over a data frame and apply an inequality index on each of them .

    Args:
        df (pandas Dataframe): a data frame where you have data about gains to be grouped according to a column.
        group_by_column (str): the column you will perform your group by on.
        income_column (str): column where you have the gains/incomes. For the moment the column must have numeric integer values, not proportions.
        index (str): the type of inequality index you will use , you have gini, theil-t , theil-l, and robin hood.
        kwargs (dict): optional, used in case you use theil - T, you can put here auxiliar parameter such as entropy base. 

    Returns:
        list: an array of tuples, each tuple is a value of the group_by_column, followed by the intra cluster resulting inequality index of your choice.
    """
    salary_groups = df.groupby ([group_by_column]) 
    setOfCats = set(df[group_by_column].values)
    indexes = []
    for s in setOfCats:
        subgroup = df.iloc[salary_groups.groups[s],:]
        incomes = sorted(subgroup[income_column].values)
        print ( 'sorted incomes ',s, ' ', incomes)
        if index == 'gini':
            g = gini(incomes)
        elif index == 'theil-t':
            g = theil_index_T ( incomes,**kwargs )
        elif index == 'theil-l':
            g = theil_index_L ( incomes )               
        elif index == 'robin-hood':
            g = robin_hood ( incomes )
        indexes.append((s,g))
    return indexes

`lorentz_curve(population, income, gini_index=False)`

This function computes the lorentz curve coordinates from a population and income array.

Parameters:

Name	Type	Description	Default
`population`	`list`	contains in position i a number representing the amount of people earning income[i].	required
`income`	`list`	contains in position i a number representing the earning of the people in population[i].	required
`gini_index`	`boolean`	True if you want the gini computed on the position 3 of returning tuple. False otherwise.	`False`

Returns:

Name	Type	Description
`tuple`		2-tuple with 2 list of x,y coordinates to be plotted using the visual framework of your choice. If you set gini_index flag to true, it wil be a 3-tuple, in the third position you find the gini coefficient.

Source code in kafkanator/kafkanator.py

def lorentz_curve ( population , income ,gini_index=False):
    """This function computes the lorentz curve coordinates from a population and income array.

    Args:
        population (list): contains in position i a number representing the amount of people earning income[i].
        income (list): contains in position i a number representing the earning of the people in population[i].
        gini_index (boolean): True if you want the gini computed on the position 3 of returning tuple. False otherwise.
        Example lorentz_curve ( [50,20,30,10],[100,300,200,30]) means that 50 people earn 100, 20 people earn 300 and so on.

    Returns:
        tuple: 2-tuple with 2 list of x,y coordinates to be plotted using the visual framework of your choice. If you set gini_index flag to true, it wil be a 3-tuple, in the third position you find the gini coefficient.
    """
    assert( len(population) == len(income))
    zippedSortedArray = sorted( list( zip( population , income) ) , key= lambda x: x[1] )
    print ( ' sorted array ', zippedSortedArray)
    perc_population = np.array([p/sum(population) for (p,g) in zippedSortedArray])
    perc_income = np.array([g/sum(income) for (p,g) in zippedSortedArray])
    cum_perc_pop = np.concatenate(([0], perc_population.cumsum()), axis=None)
    cum_perc_inc = np.concatenate(([0],perc_income.cumsum()), axis=None)
    if gini_index:
        arrayGini = []
        for (p,g) in zippedSortedArray:
            arrayGini = np.concatenate( ( arrayGini , np.repeat(g,p)  ) )
        print ( ' gini input ', arrayGini )
        g_index = gini(np.array(arrayGini))
        return (cum_perc_pop,cum_perc_inc,g_index)
    else:
        return (cum_perc_pop,cum_perc_inc)

`robin_hood(income_array)`

Computes robin hood index. This is the percentage of income that must be redistributed in population in order to be egalitarian.

Parameters:

Name	Type	Description	Default
`income_array`	`list`	Represents population gains, i.e [5,3,5,6,9] means that one person has 5 gains, the next one three and so on. Total gain will be sum(x), total population will be len(x)	required

Returns:

Name	Type	Description
`float`		A number between 0 and 1, is the percentage of the income that must be redistributed. A number close to 1 means high concentration of wealth in few hands a number close to 0 means a distribution of wealth close to egalitarian state.

Source code in kafkanator/kafkanator.py

def robin_hood(income_array):
    """Computes robin hood index. This is the percentage of income
    that must be redistributed in population in order to be egalitarian.

    Args:
        income_array (list): Represents population gains, i.e [5,3,5,6,9] means that one person has 5 gains, the next one three and so on. Total gain will be sum(x), total population will be len(x)

    Returns:
        float : A number between 0 and 1, is the percentage of the income that must be redistributed. A number close to 1 means high concentration of wealth in few hands a number close to 0 means a distribution of wealth close to egalitarian state. 
    """
    egal_income = sum(income_array) / len(income_array)
    deltas = []
    for inc in income_array:
        if egal_income - inc < 0 :
            deltas.append(abs(egal_income - inc))
    rh_index = sum(deltas) / sum(income_array)
    return rh_index

`theil_index_L(income_array)`

Computes the Theil L index.

Parameters:

Name	Type	Description	Default
`income_array`	`list`	array of incomes, the order is not important. ie [100,300,1000,500]	required

Returns:

Name	Type	Description
`float`		the theil L index.

Source code in kafkanator/kafkanator.py

def theil_index_L(income_array):
    """Computes the Theil L index.

    Args:
        income_array (list): array of incomes, the order is not important. ie [100,300,1000,500]

    Returns: 
        float: the theil L index.
    """
    x_mean = sum(income_array)/len(income_array)
    divided_mean = [ np.log(x_mean / i) for i in income_array ]
    summin = sum(divided_mean)
    return summin/len(income_array)

`theil_index_T(income_array, array_type='props', base_entropy=np.e)`

Computes the Theil T index.

Parameters:

Name	Type	Description	Default
`income_array`	`list`	array of incomes.	required
`array_type`	`str`	if 'props' this means all numbers in income_array are between 0 and 1, and all must sum up to 1. if 'gains' this means income_array are integers representing gains.	`'props'`
`base_entropy`	`float`	the base to compute the entropy, remember that entropy is a family of functions with diferent bases, e constant by default.	`e`

Returns:

Name	Type	Description
`float`		the theil T index

Source code in kafkanator/kafkanator.py

def theil_index_T(income_array,array_type='props',base_entropy=np.e):
    """Computes the Theil T index.

    Args:
        income_array (list): array of incomes.
        array_type (str): if 'props'  this means all numbers in income_array are between 0 and 1, and all must sum up to 1. if 'gains' this means income_array are integers representing gains.
        base_entropy (float): the base to compute the entropy, remember that entropy is a family of functions with diferent bases, e constant by default.

    Returns: 
        float : the theil T index
    """
    if array_type == 'props':
        return np.log(len(income_array)) - entropy(income_array,base=base_entropy)
    elif array_type == 'gains':
        props_array = np.array([x/sum(income_array) for x in income_array])
        return np.log(len(income_array)) - entropy(props_array,base=base_entropy)