Skip to content

Kafkanator

gini(x)

Computes the gini index from an ascending order gains array.

Examples:

>>> gini(np.array([1,1,2,2,3,3,3]))
0.20952380952380953

Parameters:

Name Type Description Default
x numpy array

Gains sorted in ascending order, for example [1,1,2,2,3,3,3] means a population of 7 people, the first one gain is 1, the third one 2, and so on.

required

Returns:

Name Type Description
float

The Gini index for this array

Source code in kafkanator/kafkanator.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def gini(x):
    """Computes the gini index from an ascending order gains array.

    Examples:
        >>> gini(np.array([1,1,2,2,3,3,3]))
        0.20952380952380953

    Args:
        x (numpy array): Gains sorted in ascending order, for example [1,1,2,2,3,3,3] means a population of 7 people, the first one gain is 1, the third one 2, and so on.

    Returns: 
        float : The Gini index for this array
    """
    total = 0
    for i, xi in enumerate(x[:-1], 1):
        total += np.sum(np.abs(xi - x[i:]))
    return total / (len(x)**2 * np.mean(x))

index_on_dataframe_column(df, column, index_function, **kwargs)

This method computes an inequality index over a pandas dataframe column.

Parameters:

Name Type Description Default
df pandas DataFrame

The dataframe.

required
column str

The column on which we will apply the inequality.

required
index_function callable

the index we want to apply, is a kafkanator function such as gini(..), robin_hood(..), theil_index_L(..) or theil_index_T(..).

required
kwargs dict

other parameters the index_function could use, for example if index_function=theil_index_T, we can set the base (e,10) on this dict.

{}

Returns:

Name Type Description
float float

the inequality index result you choose applied on the column parameter.

Source code in kafkanator/kafkanator.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def index_on_dataframe_column(df: pd.DataFrame, column: str, index_function: callable, **kwargs ) -> float :
    """This method computes an inequality index over a pandas dataframe column.

    Args:
        df (pandas DataFrame): The dataframe.
        column (str): The column on which we will apply the inequality.
        index_function (callable): the index we want to apply, is a kafkanator function such as gini(..), robin_hood(..), theil_index_L(..) or theil_index_T(..).
        kwargs (dict, optional): other parameters the index_function could use, for example if index_function=theil_index_T, we can set the base (e,10) on this dict.

    Returns:
        float : the inequality index result you choose applied on the column parameter.
    """
    sorted_df = df.sort_values(by=[column])
    return index_function( np.array(sorted_df[column].values ) , **kwargs )

index_per_cluster(df, group_by_column, income_column, index='gini', **kwargs)

Make clusters over a data frame and apply an inequality index on each of them .

Parameters:

Name Type Description Default
df pandas Dataframe

a data frame where you have data about gains to be grouped according to a column.

required
group_by_column str

the column you will perform your group by on.

required
income_column str

column where you have the gains/incomes. For the moment the column must have numeric integer values, not proportions.

required
index str

the type of inequality index you will use , you have gini, theil-t , theil-l, and robin hood.

'gini'
kwargs dict

optional, used in case you use theil - T, you can put here auxiliar parameter such as entropy base.

{}

Returns:

Name Type Description
list

an array of tuples, each tuple is a value of the group_by_column, followed by the intra cluster resulting inequality index of your choice.

Source code in kafkanator/kafkanator.py
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
def index_per_cluster(df,group_by_column,income_column,index='gini', **kwargs ):
    """Make clusters over a data frame and apply an inequality index on each of them .

    Args:
        df (pandas Dataframe): a data frame where you have data about gains to be grouped according to a column.
        group_by_column (str): the column you will perform your group by on.
        income_column (str): column where you have the gains/incomes. For the moment the column must have numeric integer values, not proportions.
        index (str): the type of inequality index you will use , you have gini, theil-t , theil-l, and robin hood.
        kwargs (dict): optional, used in case you use theil - T, you can put here auxiliar parameter such as entropy base. 

    Returns:
        list: an array of tuples, each tuple is a value of the group_by_column, followed by the intra cluster resulting inequality index of your choice.
    """
    salary_groups = df.groupby ([group_by_column]) 
    setOfCats = set(df[group_by_column].values)
    indexes = []
    for s in setOfCats:
        subgroup = df.iloc[salary_groups.groups[s],:]
        incomes = sorted(subgroup[income_column].values)
        print ( 'sorted incomes ',s, ' ', incomes)
        if index == 'gini':
            g = gini(incomes)
        elif index == 'theil-t':
            g = theil_index_T ( incomes,**kwargs )
        elif index == 'theil-l':
            g = theil_index_L ( incomes )               
        elif index == 'robin-hood':
            g = robin_hood ( incomes )
        indexes.append((s,g))
    return indexes

lorentz_curve(population, income, gini_index=False)

This function computes the lorentz curve coordinates from a population and income array.

Parameters:

Name Type Description Default
population list

contains in position i a number representing the amount of people earning income[i].

required
income list

contains in position i a number representing the earning of the people in population[i].

required
gini_index boolean

True if you want the gini computed on the position 3 of returning tuple. False otherwise.

False

Returns:

Name Type Description
tuple

2-tuple with 2 list of x,y coordinates to be plotted using the visual framework of your choice. If you set gini_index flag to true, it wil be a 3-tuple, in the third position you find the gini coefficient.

Source code in kafkanator/kafkanator.py
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
def lorentz_curve ( population , income ,gini_index=False):
    """This function computes the lorentz curve coordinates from a population and income array.

    Args:
        population (list): contains in position i a number representing the amount of people earning income[i].
        income (list): contains in position i a number representing the earning of the people in population[i].
        gini_index (boolean): True if you want the gini computed on the position 3 of returning tuple. False otherwise.
        Example lorentz_curve ( [50,20,30,10],[100,300,200,30]) means that 50 people earn 100, 20 people earn 300 and so on.

    Returns:
        tuple: 2-tuple with 2 list of x,y coordinates to be plotted using the visual framework of your choice. If you set gini_index flag to true, it wil be a 3-tuple, in the third position you find the gini coefficient.
    """
    assert( len(population) == len(income))
    zippedSortedArray = sorted( list( zip( population , income) ) , key= lambda x: x[1] )
    print ( ' sorted array ', zippedSortedArray)
    perc_population = np.array([p/sum(population) for (p,g) in zippedSortedArray])
    perc_income = np.array([g/sum(income) for (p,g) in zippedSortedArray])
    cum_perc_pop = np.concatenate(([0], perc_population.cumsum()), axis=None)
    cum_perc_inc = np.concatenate(([0],perc_income.cumsum()), axis=None)
    if gini_index:
        arrayGini = []
        for (p,g) in zippedSortedArray:
            arrayGini = np.concatenate( ( arrayGini , np.repeat(g,p)  ) )
        print ( ' gini input ', arrayGini )
        g_index = gini(np.array(arrayGini))
        return (cum_perc_pop,cum_perc_inc,g_index)
    else:
        return (cum_perc_pop,cum_perc_inc)

robin_hood(income_array)

Computes robin hood index. This is the percentage of income that must be redistributed in population in order to be egalitarian.

Parameters:

Name Type Description Default
income_array list

Represents population gains, i.e [5,3,5,6,9] means that one person has 5 gains, the next one three and so on. Total gain will be sum(x), total population will be len(x)

required

Returns:

Name Type Description
float

A number between 0 and 1, is the percentage of the income that must be redistributed. A number close to 1 means high concentration of wealth in few hands a number close to 0 means a distribution of wealth close to egalitarian state.

Source code in kafkanator/kafkanator.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
def robin_hood(income_array):
    """Computes robin hood index. This is the percentage of income
    that must be redistributed in population in order to be egalitarian.

    Args:
        income_array (list): Represents population gains, i.e [5,3,5,6,9] means that one person has 5 gains, the next one three and so on. Total gain will be sum(x), total population will be len(x)

    Returns:
        float : A number between 0 and 1, is the percentage of the income that must be redistributed. A number close to 1 means high concentration of wealth in few hands a number close to 0 means a distribution of wealth close to egalitarian state. 
    """
    egal_income = sum(income_array) / len(income_array)
    deltas = []
    for inc in income_array:
        if egal_income - inc < 0 :
            deltas.append(abs(egal_income - inc))
    rh_index = sum(deltas) / sum(income_array)
    return rh_index

theil_index_L(income_array)

Computes the Theil L index.

Parameters:

Name Type Description Default
income_array list

array of incomes, the order is not important. ie [100,300,1000,500]

required

Returns:

Name Type Description
float

the theil L index.

Source code in kafkanator/kafkanator.py
57
58
59
60
61
62
63
64
65
66
67
68
69
def theil_index_L(income_array):
    """Computes the Theil L index.

    Args:
        income_array (list): array of incomes, the order is not important. ie [100,300,1000,500]

    Returns: 
        float: the theil L index.
    """
    x_mean = sum(income_array)/len(income_array)
    divided_mean = [ np.log(x_mean / i) for i in income_array ]
    summin = sum(divided_mean)
    return summin/len(income_array)

theil_index_T(income_array, array_type='props', base_entropy=np.e)

Computes the Theil T index.

Parameters:

Name Type Description Default
income_array list

array of incomes.

required
array_type str

if 'props' this means all numbers in income_array are between 0 and 1, and all must sum up to 1. if 'gains' this means income_array are integers representing gains.

'props'
base_entropy float

the base to compute the entropy, remember that entropy is a family of functions with diferent bases, e constant by default.

e

Returns:

Name Type Description
float

the theil T index

Source code in kafkanator/kafkanator.py
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
def theil_index_T(income_array,array_type='props',base_entropy=np.e):
    """Computes the Theil T index.

    Args:
        income_array (list): array of incomes.
        array_type (str): if 'props'  this means all numbers in income_array are between 0 and 1, and all must sum up to 1. if 'gains' this means income_array are integers representing gains.
        base_entropy (float): the base to compute the entropy, remember that entropy is a family of functions with diferent bases, e constant by default.

    Returns: 
        float : the theil T index
    """
    if array_type == 'props':
        return np.log(len(income_array)) - entropy(income_array,base=base_entropy)
    elif array_type == 'gains':
        props_array = np.array([x/sum(income_array) for x in income_array])
        return np.log(len(income_array)) - entropy(props_array,base=base_entropy)