What does the Gini index mean in a biochemical context?

What does the Gini index mean in a biochemical context?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

What is the meaning of the Gini index, as specificed in this link, which describes the Gini index of beta-glucopyranose bound to hexokinase?

Is this true that if Gini index has a very low value that means it doesn't interact much? If Gini index is high for a compound, it will interact?

The paper by Graczyk (2007) is probably relevant for you. It says that the Gini index is a measure of reactive selectivity of kinases, with values close to zero indicating no selectivity and values close to one indicating high selectivity, and it is created in direct parallel to the Gini index in economics, which is used to describe economic inequality. In its basic form, the Gini coefficient is a measure of statistical disperson. Also see Anastassiadis et al. (2011) for a more recent example where the index is used.

A novel application of the Gini coefficient for expressing selectivity of kinase inhibitors against a panel of kinases is proposed. This has been illustrated using single-point inhibition data for 40 commercially available kinase inhibitors screened against 85 kinases. Nonselective inhibitors are characterized by Gini values close to zero (Staurosporine, Gini 0.150). Highly selective compounds exhibit Gini values close to 1 (PD184352 Gini 0.905). The relative selectivity of inhibitors does not depend on the ATP concentration.


Graczyk. 2007. Gini coefficient: a new way to express selectivity of kinase inhibitors against a family of kinases. J Med Chem. 50(23)

Anastassiadis et al. 2011. Comprehensive assay of kinase catalytic activity reveals features of kinase inhibitor selectivity. Nature Biotechnology 29: 1039-1045

What does the Gini index mean in a biochemical context? - Biology

The Index of Diversity which AS/A2 level students in the UK need to understand can be found here.

Before looking at Simpson's Diversity Index in more detail, it is important to understand the basic concepts outlined below.

Biological Diversity - the great variety of life

Biological diversity can be quantified in many different ways. The two main factors taken into account when measuring diversity are richness and evenness. Richness is a measure of the number of different kinds of organisms present in a particular area. For example, species richness is the number of different species present. However, diversity depends not only on richness, but also on evenness. Evenness compares the similarity of the population size of each of the species present.

1. Richness

The number of species per sample is a measure of richness. The more species present in a sample, the 'richer' the sample.

Species richness as a measure on its own takes no account of the number of individuals of each species present. It gives as much weight to those species which have very few individuals as to those which have many individuals. Thus, one daisy has as much influence on the richness of an area as 1000 buttercups.

2. Evenness

Evenness is a measure of the relative abundance of the different species making up the richness of an area.

To give an example, we might have sampled two different fields for wildflowers. The sample from the first field consists of 300 daisies, 335 dandelions and 365 buttercups. The sample from the second field comprises 20 daisies, 49 dandelions and 931 buttercups (see the table below). Both samples have the same richness (3 species) and the same total number of individuals (1000). However, the first sample has more evenness than the second. This is because the total number of individuals in the sample is quite evenly distributed between the three species. In the second sample, most of the individuals are buttercups, with only a few daisies and dandelions present. Sample 2 is therefore considered to be less diverse than sample 1.

Numbers of individuals
Flower Species Sample 1 Sample 2
Daisy 300 20
Dandelion 335 49
Buttercup 365 931
Total 1000 1000

A community dominated by one or two species is considered to be less diverse than one in which several different species have a similar abundance.

As species richness and evenness increase, so diversity increases. Simpson's Diversity Index is a measure of diversity which takes into account both richness and evenness.

The term 'Simpson's Diversity Index' can actually refer to any one of 3 closely related indices.

Simpson's Index (D) measures the probability that two individuals randomly selected from a sample will belong to the same species (or some category other than species). There are two versions of the formula for calculating D. Either is acceptable, but be consistent.

The value of D ranges between 0 and 1

With this index, 0 represents infinite diversity and 1, no diversity. That is, the bigger the value of D, the lower the diversity. This is neither intuitive nor logical, so to get over this problem, D is often subtracted from 1 to give:

Simpson's Index of Diversity 1 - D

The value of this index also ranges between 0 and 1, but now, the greater the value, the greater the sample diversity. This makes more sense. In this case, the index represents the probability that two individuals randomly selected from a sample will belong to different species.

Another way of overcoming the problem of the counter-intuitive nature of Simpson's Index is to take the reciprocal of the Index:

Simpson's Reciprocal Index 1 / D

The value of this index starts with 1 as the lowest possible figure. This figure would represent a community containing only one species. The higher the value, the greater the diversity. The maximum value is the number of species (or other category being used) in the sample. For example if there are five species in the sample, then the maximum value is 5.

The name 'Simpson's Diversity Index' is often very loosely applied and all three related indices described above (Simpson's Index, Simpson's Index of Diversity and Simpson's Reciprocal Index) have been quoted under this blanket term, depending on author. It is therefore important to ascertain which index has actually been used in any comparative studies of diversity.

As an example, let us work out the value of D for a single quadrat sample of ground vegetation in a woodland. Of course, sampling only one quadrat would not give you a reliable estimate of the diversity of the ground flora in the wood. Several samples would have to be taken and the data pooled to give a better estimate of overall diversity. How many samples?

Species Number (n) n(n-1)
Woodrush 2 2
Holly (seedlings) 8 56
Bramble 1 0
Yorkshire Fog 1 0
Sedge 3 6
Total (N) 15 64

Putting the figures into the formula for Simpson's Index

D = 0.3(Simpson's Index)

Simpson's Index of Diversity 1 - D = 0.7

Simpson's Reciprocal Index 1 / D = 3.3

These 3 different values all represent the same biodiversity. It is therefore important to ascertain which index has actually been used in any comparative studies of diversity. A value of Simpson's Index of 0.7, is not the same as a value of 0.7 for Simpson's Index of Diversity.

Simpson's Index gives more weight to the more abundant species in a sample. The addition of rare species to a sample causes only small changes in the value of D.

What does the Gini index mean in a biochemical context? - Biology

Effective number of species

D iversity indices like the Shannon entropy ("Shannon-Wiener index") and the Gini-Simpson index are not themselves diversities. They are just indices of diversity, in the same way that the diameter of a sphere is an index of its volume but is not itself the volume. Using the diameter in place of the volume in engineering equations would give dangerously misleading results. Things would be even worse if some engineers liked to use surface area, and if others liked to use circumference in place of volume. Imagine the chaos if they called all of these things by the same word and used them interchangeably in engineering equations that required volume. This is what biologists are doing with diversity indices.

Diversity indices have a wide variety of ranges and behaviors if applied to a system of S equally common species, some give S, some give log S, some give 1/S, some give 1𔂿/S, etc. Some have unlimited ranges while others are always less than unity. By calling all of these indices “diversities” and treating them as if they were interchangeable in formulas or analyses requiring diversities, we will often generate misleading results.

So what is a true "diversity"? What units should it be measured in?

It is possible to arrive at a natural and intuitive definition. In virtually any biological context, it is reasonable to say that a community with sixteen equally-common species is twice as diverse as a community with eight equally-common species. This is so obvious that it seems odd to have to write it. But it is important to realize what this simple statement implies. Most diversity indices do not double as we go from eight species to sixteen species. (Some biologists have noticed this and concluded that all diversity indices other than species richness are therefore not to be trusted. We will see below that this is an incorrect conclusion. Species richness is the least informative and most imprecise diversity index, in the sense that it is more subject to random variation than any other index. Frequency-based diversity indices tell us something important, but they are not themselves "diversities".)

Going back to the obvious, it seems completely natural to say that a community with eight equally-common species has a diversity of eight species, or a community with S equally-common species has a diversity of S species. This definition behaves as we expect of a diversity the diversity of a community of sixteen equally-common species is double that of a community with eight equally-common species. Diversity is an unambiguous concept when we are dealing with communities of equally-common species.

What happens when the species aren't equally common? This is where the choice of diversity index comes into play. If we choose a particular index as our index of diversity, then any two communities that give the same value of the index must have the same diversity. Any two communities with a Shannon entropy (Shannon-Wiener index) of 4.5 have the same diversity, according to this index. We don't know what that diversity is yet (remember, 4.5 is just the value of the index, not the real diversity) but we do know that all communities with a Shannon-Wiener index of 4.5 have the same diversity according to this index. Now if one of those communities consisted of S equally-common species, we would know that its true diversity is S by our above definition, and then we would know that all other communities with a Shannon-Wiener index of 4.5 must also have diversity S, even if their species were not equally common.

It is a matter of algebra to find the number of equally-common species that give a particular value of an index. See my paper, Entropy and Diversity, for a description of the algorithm. The number of equally-common species required to give a particular value of an index is called the "effective number of species". This is the true diversity of the community in question. For example, the true diversity associated with a Shannon-Wiener index of 4.5 is exp (4.5) = 90 effective species. The formulas that convert common diversity indices into true diversities are collected in this table: Table 1.

Converting indices to true diversities (effective numbers of species) gives them a set of common behaviors and properties. After conversion, diversity is always measured in units of number of species, no matter what index we use. This lets us compare and interpret them easily, and it lets us develop formulas and techniques that don't depend on a specific index. It also lets us avoid the serious misinterpretations spawned by the nonlinearity of most diversity indices. For more details see What is diversity? , the first chapter of a book on diversity analysis that Dr. Anne Chao and I are writing under contract for Chapman and Hall publishers.

As an example of the practical importance of this, suppose you are comparing the diversity of aquatic microorganisms before and after an oil spill. You wouldn't want to measure that diversity by species richness because even a massive toxic event is sure to leave a few vagrant individuals of each pre-spill species, and species richness doesn't distinguish between one individual of Species X or a million the pre- and post-spill species counts might not be very different, even if the pre- and post-spill species frequencies are very different. So if you are a good traditional biologist you might use the popular Gini-Simpson diversity index, which is 1 - (Sum of the squares of species frequencies). Suppose that the pre-spill Gini-Simpson index is .99 and the post-spill index is .97. If you are a good traditional biologist you would figure out that this drop is statistically significant, but you would conclude that the magnitude of the drop is small. You might even say (very wrongly) that the diversity has dropped by 2%, which sounds like a small drop, nothing to worry about.

The error which virtually all biologists make is that the Gini-Simpson index is not itself a diversity, and is highly nonlinear. The pre-spill community with a Gini-Simpson index of 0.99 has the same diversity as a community of 100 equally-common species. The post-spill community with a Gini-Simpson index of 0.97 has the same diversity as a community of 33 equally-common species. The difference between the pre-and post-spill diversities is in fact enormous. The drop in diversity is 66%, not 2%! This is not just a matter of different definitions of diversity, as some people would like to say. Rather, it is a matter of the indices being nonlinear with respect to our intuitive concept of diversity.

The Shannon entropy is also highly nonlinear. A Shannon entropy of 6.0 corresponds to 403 equally-common species while a Shannon entropy of 5.5 corresponds to 244 equally-common species. The former is almost twice as diverse as the latter even though the difference in the values of the indices is only 8%.

There may be times when we really want to know how the information content of communities, and in that case we would use the Shannon entropy directly. Similarly there might be times when we really want to use the Gini-Simpson index directly. But when we are doing diversity analyses, we have to convert them to true diversities if they are to serve their purpose.

When we convert to true diversities (effective number of species) we create a powerful and intuitive tool for comparing diversities of different commuities. If one community has a true diversity of 5 effective species based on some diversity index, and another has a true diversity of 15 effective species based on the same diversity index, we can truly say that the second community is three times as diverse as the first according to that index. We couldn't draw this conclusion from the raw index itself, because it uses a nonlinear scale.

Other sciences have long ago recognized the importance of the true diversity of a diversity index, though the concept goes by different names in different fields. The use of the exponential of Shannon entropy, exp(H_Shannon), in thermodynamics dates from the dawn of the modern atomic theory of matter over a hundred years ago in that field it gives the number of equally-likely states needed to produce the given entropy. Economists have also long made this fundamental distinction the term “numbers equivalent” for the effective number of elements of a diversity index is used in that field (Patil and Taillie 1982). The distinction between Shannon entropy and its numbers equivalent or true diversity can be visualized by imagining a dichotamous key to the species of a community. Shannon entropy is proportional to the mean depth of the maximally-efficient dichotamous key to the species of the community (the average number of yes-or-no questions that must be asked to identify a species), but the true diversity is the effective number of terminal branches in the key, and that number increases exponentially with the depth of the key. Several biologists, notably MacArthur (1965), Hill (1973), and Peet (1974), correctly identified diversity with exp(H_Shannon), the effective number of species, but authors of influential standard texts such as Magurran (2004) did not recognize the significance of this, and the concept is seldom used. Yet the results presented here show that this concept clears up most of the many problems in diversity analysis in biology, just as it does in physics and economics.

To get a feel for this and to learn some of the mathematical properties of effective numbers of species, see the examples I present in Measuring the diversity of a single community and Comparing the diversities of two communities.

Table of contents:

Part 1: Theoretical background

What is diversity? This is the first chapter of a book on diversity analysis that Dr. Anne Chao and I are writing under contract for Chapman and Hall publishers.

Effective number of species. This is the concept that unifies everything.

Article: Entropy and diversity. Oikos, May 2006.This provides an intuitive and productive answer to the question, "What is diversity?" It also points out problems in certain similarity measures and introduces new measures that avoid these problems. These new measures lead to the Sorensen index, Jaccard index, Morisita-Horn index, and Horn index of overlap as special cases.

Article: Partitioning diversity into independent alpha and beta components. In press, Ecology, "Concepts and Synthesis" section. Here I derive the correct expressions for alpha and beta for any diversity index. I start from first principles, asking what properties must beta have if it is to capture our theoretical idea of beta as a measure of community overlap. From these properties (which I believe are uncontroversial) I derive the relation between alpha and beta components of any given diversity index. It turns out that there is no universal additive or multiplicative rule relating the alpha and beta components of an index. However, when the alpha and beta components of any index are converted to true diversities (effective numbers of elements), they all follow Whittaker's multiplicative law, regardless of the index on which they are based!

There is a surprise, though. The equations I derive reveal that most diversity measures have a fatal flaw. They can only be decomposed into meaningful alpha and beta components if the statistical weights of all communities are equal. It turns out that only Shannon measures give meaningful results when community weights are unequal.

I also show how diversity measures relate to similarity and overlap measures. I show a general way to derive similarity and overlap measures from diversity measures, thus ensuring logical consistency between them. Through examples I discuss the different meanings of "similarity" and give the appropriate formulas for each.

Part 2: Diversity

Part 3: Alpha and beta diversity

Part 4: Similarity and overlap

Different meanings of "similarity"

Measuring the homogeneity of a region

Measuring the similarity and degree of overlap of two communities

Measuring the similarity and overlap of multiple communities

Ask Gini: How to Measure Inequality

Articles, studies and U.S. Census data focusing on wealth inequality rely on the Gini coefficient. How is it calculated, and what does it tell us?

Frank Cowell, an economist at the London School of Economics and Political Science, says the Gini coefficient is like the Kardashians: "It's famous for being famous." He's speaking about one of the most commonly discussed measures of income and wealth inequality. The Gini coefficient has been in the news a lot since the U.S. Census Bureau released its most recent data on income inequality in September. The data show that income inequality in the U.S. is high, but many articles blunder when they try to compare the U.S. to other countries.

An article on the Atlantic Web site in October, for example, reports on a pairing of U.S. cities with foreign countries that have similar Gini coefficients. The city in the U.S. with the least income inequality, Ogden, Utah, was paired with Malawi in Africa, whereas the city with the greatest inequality, Bridgeport, Conn., was paired with Thailand in Southeast Asia. These pairings are a bit puzzling. Is it better to be Malawi than Thailand? Does it make sense to compare the Gini coefficient of one concentrated metropolitan area with that of an entire nation?

A closer look at the data used to create the map shows that, as reported in a Forbes editorial, the U.S. Census Bureau usually reports Gini coefficients based on pretax numbers, whereas many calculations for foreign countries use posttax numbers, which often include redistribution of wealth from rich to poor and tend to lower the Gini coefficient. Comparing the pretax number in one country with the posttax number in another is somewhat meaningless.

To understand what the Gini coefficient can and cannot explain, and how to interpret articles about economic inequality, a deeper look at this statistic is required.

The Gini coefficient compares the income or wealth distribution of a population to a perfectly equal distribution&mdashin which every citizen of a city or country has equal wealth. To compute the Gini coefficient, economists first find the Lorenz curve for the population. The curve is a graphical representation of the distribution of income or wealth in a society. The x-axis is the proportion of the population, from lowest to highest income, and the y-axis is the cumulative percentage of income or wealth owned. So the point (0.5, 0.2) would indicate that the lowest-income 50 percent of the population earned 20 percent of the total income. A perfectly equal society would have a Lorenz distribution that looks like the line y = x.

The Gini coefficient measures how far the actual Lorenz curve for a society's income or wealth is from the line of equality. Both the Lorenz curve and the line of equality are plotted on a graph. Then the area between the two graphs is computed. The Gini coefficient is the area between the two graphs divided by the total area under the line of inequality. In the picture at the top right of this article, it is the area of the region labeled A divided by the combined areas for A and B. This yields a number between 0 and 1, sometimes reported as a percentage&mdashfor example, 0.22 or 22 (written with or without the percent sign). 0 means that the country is perfectly equal, and 1 means that one person has all the wealth or income. (This Web site has a Lorenz curve generator and Gini coefficient calculator: Enter a set of incomes to find out what the Gini coefficient of the group is and what the distribution looks like.)

Cowell says that the Gini coefficient is useful, particularly because it allows negative values for income and wealth, unlike some other measures of inequality. (If some amount of the population has negative wealth (owes money), the Lorenz curve will dip below the x-axis.) But the Gini coefficient also has limitations. For one, it takes all the data from the Lorenz curve and converts it to a single number. Two different income distributions can have the same Gini coefficient, and a lot of information is lost in the conversion to a graph. Cowell asks, "Why not just look at the Lorenz curve?"

In addition, the Gini coefficient cannot tell that person X is a 24-year-old medical student who has negative income because of student loans, whereas person Y, who has the same amount of negative income, is unemployed and without job prospects. It samples people at random points of their lives, which means that it can't separate those whose financial futures are reasonably secure from those who do not have prospects. Its results are also sensitive to outliers&mdasha few very wealthy or very poor individuals can change the statistic significantly, even in a large sample.

Cowell says that the Gini coefficient should not be used as the sole measure of economic inequality. He suggests two ways to handle the number: "One is to look beyond the Gini as a single statistic. The other is to consider whether it might be useful to use a model of the upper tail of the distribution, so you get a clearer picture." Due to incomplete data, the Gini coefficient can underestimate the concentration of wealth in the very richest individuals, and can even underestimate the wealth inequality within the upper echelons of the wealthy. To mitigate this problem, Cowell studies better ways to model income and wealth distribution in the most well-off. One option is to "patch in" an assumed distribution (specifically a Pareto distribution) for the top 5 or 10 percent of the population. In effect, this means assuming that the distribution of wealth takes a certain form, and using that model, rather than sparse data, to calculate the Gini coefficient.

The study of income and wealth inequality are of course fertile ground for many questions and controversies. What "should" the inequality in a society be? As the Occupy Wall Street (OWS) movement highlighted, this is of broad interest. Cowell says that the concentration of wealth in the upper echelon of the population can be reminiscent of a monopoly in business. "People get kind of twitchy if a large portion of the output is controlled by a small number of firms." Economists are asking many different questions about the causes and effects of wealth and income inequality. What are the effects of high inequality, and is it possible to separate the effects of poverty from the effects of inequality?

But the Gini coefficient is not just used by economists. Sam Shah, a high school math teacher in Brooklyn, N.Y., wrote in his blog that he included a section on the Gini coefficient during the last week of his calculus class. (He based his lesson on this handout (pdf) from the North Carolina School of Science and Mathematics.) He framed the lessons in the context of OWS and asked, "Is income truly becoming more and more unequally distributed in the past 40 years?" Students had access to several decades' worth of data and got to explore questions about how they thought income and wealth should be distributed, in addition to working on math. As a way to motivate students' desire to understand calculus and statistics, he found the Gini coefficient to be very effective.

"The best part of the discussion was around what kids picked for 'what they would like it to be,'" wrote Shah. Others have studied this question: How do people feel about wealth inequality? A 2005 survey (pdf) conducted by Michael Norton of the Harvard Business School and Dan Ariely of the Duke University Department of Psychology and published in 2011 found that most Americans underestimated the amount of wealth inequality in the U.S. and wanted it to be even lower than their estimates. The researchers showed respondents three different pie charts illustrating possible wealth distribution by quintiles. One illustrated complete equality, one was slightly unequal&mdashwith the lowest quintile earning 11 percent of the wealth and the highest earning 36 percent&mdashand one was based on the wealth distribution of the U.S., with the lowest quintile owning 0.1 percent of the wealth and the top quintile owning 84 percent. Of the people surveyed, 47 percent preferred the slightly unequal distribution, 43 percent the perfectly equal distribution and only 10 percent the highly unequal distribution.

The diagram in the article labeled the slightly unequal distribution as Sweden, although it was presented without a label to survey respondents. The authors clearly wanted readers to believe that Sweden's wealth distribution was preferable to that of the U.S.&mdasha heading in the article stated "Americans Prefer Sweden." But in a note at the end of the article, the authors wrote, "We used Sweden's income rather than wealth distribution because it provided a clearer contrast to the other two wealth distribution examples although more equal than the United States's wealth distribution, Sweden's wealth distribution is still extremely top-heavy." (According to Cowell's research, even that statement is unclear: some methods of computing the Gini coefficient that include a modification of the distribution at the top of the wealth scale find that Sweden has greater wealth inequality than the U.S.)

Although Norton and Ariely's conclusion that Americans would prefer a more equal distribution of wealth may be sound, their data switcheroo illustrates a common problem when talking about inequality: income versus wealth. People often conflate the two, but they are not the same. "You're using your thermometer to measure something quite different," Cowell says. Wealth inequality says more about the balance of power in a society, and income inequality addresses the way labor markets operate. "Typically wealth is much more unequally distributed than income, in any country you look at," he says. He recently conducted a study (pdf) about income and wealth inequality in the U.S., U.K., Canada and Sweden. Because Sweden has more interventionist policies, one might assume that the U.S. would be much more unequal than Sweden. "It's true for income but not for wealth," says Cowell.

Economists continue to use the Gini coefficient, either standard or modified, to understand wealth and income inequality. In the meantime, lay people who want to understand wealth inequality should read the fine print to ensure that they have all the facts.

Deviance and GLM

Formally, one can view deviance as a sort of distance between two probabilistic models in GLM context, it amounts to two times the log ratio of likelihoods between two nested models $ell_1/ell_0$ where $ell_0$ is the "smaller" model that is, a linear restriction on model parameters (cf. the Neyman–Pearson lemma), as @suncoolsu said. As such, it can be used to perform model comparison. It can also be seen as a generalization of the RSS used in OLS estimation (ANOVA, regression), for it provides a measure of goodness-of-fit of the model being evaluated when compared to the null model (intercept only). It works with LM too:

The residuals SS (RSS) is computed as $hatvarepsilon^thatvarepsilon$ , which is readily obtained as:

or from the (unadjusted) $R^2$

since $R^2=1- ext/ ext$ where $ ext$ is the total variance. Note that it is directly available in an ANOVA table, like

In fact, for linear models the deviance equals the RSS (you may recall that OLS and ML estimates coincide in such a case).

Deviance and CART

We can see CART as a way to allocate already $n$ labeled individuals into arbitrary classes (in a classification context). Trees can be viewed as providing a probability model for individuals class membership. So, at each node $i$ , we have a probability distribution $p_$ over the classes. What is important here is that the leaves of the tree give us a random sample $n_$ from a multinomial distribution specified by $p_$ . We can thus define the deviance of a tree, $D$ , as the sum over all leaves of

following Venables and Ripley's notations (MASS, Springer 2002, 4th ed.). If you have access to this essential reference for R users (IMHO), you can check by yourself how such an approach is used for splitting nodes and fitting a tree to observed data (p. 255 ff.) basically, the idea is to minimize, by pruning the tree, $D+alpha #(T)$ where $#(T)$ is the number of nodes in the tree $T$ . Here we recognize the cost-complexity trade-off. Here, $D$ is equivalent to the concept of node impurity (i.e., the heterogeneity of the distribution at a given node) which are based on a measure of entropy or information gain, or the well-known Gini index, defined as $1-sum_kp_^2$ (the unknown proportions are estimated from node proportions).

With a regression tree, the idea is quite similar, and we can conceptualize the deviance as sum of squares defined for individuals $j$ by

summed over all leaves. Here, the probability model that is considered within each leaf is a gaussian $mathcal(mu_i,sigma^2)$ . Quoting Venables and Ripley (p. 256), " $D$ is the usual scaled deviance for a gaussian GLM. However, the distribution at internal nodes of the tree is then a mixture of normal distributions, and so $D_i$ is only appropriate at the leaves. The tree-construction process has to be seen as a hierarchical refinement of probability models, very similar to forward variable selection in regression." Section 9.2 provides further detailed information about rpart implementation, but you can already look at the residuals() function for rpart object, where "deviance residuals" are computed as the square root of minus twice the logarithm of the fitted model.

An introduction to recursive partitioning using the rpart routines, by Atkinson and Therneau, is also a good start. For more general review (including bagging), I would recommend

4 Answers 4

No, despite their names they are not equivalent or even that similar.

  • Gini impurity is a measure of misclassification, which applies in a multiclass classifier context.
  • Gini coefficient applies to binary classification and requires a classifier that can in some way rank examples according to the likelihood of being in a positive class.

Both could be applied in some cases, but they are different measures for different things. Impurity is what is commonly used in decision trees.

I took an example of Data with two people A and B with wealth of unit 1 and unit 3 respectively. Gini Impurity as per Wikipedia = 1 - [ (1/4)^2 + (3/4)^2 ] = 3/8

Gini coefficient as per Wikipedia would be ratio of area between red and blue line to the total area under blue line in the following graph

Area under red line is 1/2 + 1 + 3/2 = 3

Total area under blue line = 4

Clearly the two numbers are different. I will check more cases to see if they are proportional or there is an exact relationship and edit the answer.

Edit: I checked for other combinations as well, the ratio is not constant. Below is a list of few combinations I tried.

I believe they represent the same thing essentially, as the so-called:

"Gini Coefficient" mainly used in Economics, measures the inequality of a numerical variable, such as income, which we can treat as a regression problem--getting the "mean of each group.

"Gini impurity" mainly used in Decision Tree learning, measures the impurity of a categorical variable, such as colour, sex, etc. which is a classification problem -- getting the "majority" of each group.

Sounds similar right? "inequality" and "impurity" are both measures of variation, which are intuitively the same concept. The difference is "inequality" for numerical variables and "impurity" for categorical variables. And both of them can be named "Gini Index".

In Light, R. J., & Margolin, B. H. (1971). An analysis of variance for categorical data, it says that as the "mean" is an undefined concept for categorical data, Gini extends the "Gini Index" from numerical data to categorical data by using pairwise difference instead of deviation from mean. TLDR which comes to the variation for categorical responses: $frac1<2n>[sum_n_in_j] = frac2 - frac1<2n>sum^I_n_i^2$ where $n_i$ is the number of responses in the $i$ th category, $i = 1, cdotcdotcdot, I$ which is almost the same, but $frac2$ times the "Gini Impurity" nowadays, $1 - sum^_ ^<2>$

By the way, you said you can use ROC as method 2 to choose split point when growing a decision tree, I can't get it. Could you elaborate that?

PS: I agreed with Pasmod Turing's answer, that Wikipedia can be modified by everyone, and the "Gini Impurity" seems like an incomplete item in the wiki.

I also saw the disputes in the comments under his answer, I must say Machine Learning is originated from statistics, and statistics is the fundamental analysis tool for scientific research, thus, many concepts are the same thing in statistics, even though they have different names in different professional areas. Gini index certainly share the same name in decision tree and economics.


Agency for Healthcare Research and Quality (2001) Reducing and preventing adverse drug events to decrease hospital costs: research in action, issue 1. Retrieved from

Bechtel W, Richardson RC (2010) Discovering complexity—decomposition and localization as strategies in scientific research. The MIT Press, Cambridge

Bigelow J, Pargetter R (1987) Functions. J Philos 84(4):181–196

Birney E (2012a) Lesson for big-data projects. Nature 489:49–51

Birney E (2012b) ENCODE: my own thoughts. Ewan's Blog: Bioinformatician at large. Retrieved September 5, 2012, from

Brenner S (1998) Refuge of spandrels. Curr Biol 8:R669

Bunzl M (1980) Comment on “health as a theoretical concept”. Philos Sci 47:116–118

Chanock SJ (2012) Toward mapping the biology of the genome. Genome Res 22(9):1612–1615. doi:10.1101/gr.144980.112

Comings DE (1972) The structure and function of chromatin. Adv Human Genetics 3:237–431

Connor S (2003) Glaxo chief: our drugs do not work on most patients. The Independent. Retrieved December 8, 2003, from

Craver C (2007) Explaining the brain: mechanisms and the mosaic unity of neuroscience. Oxford University Press, New York

Cummins R (1975) Functional analysis. J Philos 72(20):741–765

Darden L (2006) Reasoning in biological discoveries. Cambridge University Press, Cambridge

Diep F (2013) Friction over function: scientists clash on the meaning of ENCODE’s genetic data. Scientific American. Retrieved April 12, 2013, from

Doolittle WF (2013) Is junk DNA bunk? A critique of ENCODE. Proc Natl Acad Sci USA 110(14):5294–5300. doi:10.1073/pnas.1221376110

Eddy SR (2012) The C-value paradox, junk DNA and ENCODE. Curr Biol 22:R898–R899. doi:10.1016/j.cub.2012.10.002

Eddy SR (2013) The ENCODE project: missteps overshadowing a success. Curr Biol 23:R259–R261. doi:10.1016/j.cub.2013.03.023

Gaudillière JP, Rheinberger H-J (2004) From molecular genetics to genomics, the mapping cultures of twentieth-century genetics. Routledge, London

Gerstein MB, Kundaje A, Hariharan M, Landt SG, Koon-Kiu Y, Chao C et al (2012) Architecture of the human regulatory network derived from ENCODE data. Nature 489:91–100. doi:10.1038/nature11245

Gissis SB, Jablonka E (eds) (2011) Transformations of lamarckism. From subtle fluids to molecular biology. MIT Press, Cambridge

Graur D (2013) The Origin of Junk DNA: A Historical Whodunnit. Judge Starling. Retrieved October 19, 2013, from

Graur D, Zheng Y, Price N, Azevedo RBR, Zufall RA, Elhaik E (2013) On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE. Genome Biol Evol 5:578–590. doi:10.1093/gbe/evt028

Gregory TR (2007) The onion test. Genomicron, April 27th 2007, retrieved from

Griffiths PE (1993) Functional analysis and proper functions. Br J Philos Sci 44(3):409–422. doi:10.1093/bjps/44.3.409

Griffiths PE (2001) Genetic information: a metaphor in search of a theory. Philos Sci 68(3):394–412

Griffiths PE (2009) In what sense does “nothing make sense except in the light of evolution”? Acta Biotheor 57:11–32. doi:10.1007/s10441-008-9054-9

Ibarra-Laclette E, Lyons E, Hernández-Guzmán G, Pérez-Torres CA, Carretero-Paulet L, Chang T-H, Herrera-Estrella L (2013) Architecture and evolution of a minute plant genome. Nature 498(7452):94–98. doi:10.1038/nature1213

Kauffman S (1993) The origins of order: self-organization and selection in evolution. Oxford University Press, Oxford

Kauffman S (1996) At home in the universe: the search for the laws of self-organization and complexity. Oxford University Press, Oxford

Kentikelenis A, Karanikolos M, Papanicolas I, Basu S, McKee M, Stuckler D (2011) Health effects of financial crisis: omens of a Greek tragedy. Lancet 378:1457–1458

Kolata G (2012) Bits of mystery DNA, far from “Junk,” play crucial role. The New York Times. p. 5–7. Retrieved September 6, 2012, from

Laland KN, Sterelny K, Odling-Smee J, Hoppitt W, Uller T (2011) Cause and effect in biology revisited: is Mayr’s proximate-ultimate dichotomy still useful? Science 334:1512–1516. doi:10.1126/science.1210879

Lynch VJ, Leclerc RD, May G, Wagner GP (2011) Transposon-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals. Nat Genetics 43(11):1154–1159. doi:10.1038/ng.917

Maher B (2012) The human encyclopaedia. Nature 486:46–48

Makalowski W (2003) Not junk after all. Science 300(5623):1246–1247. doi:10.1126/science.1085690

Mayr E (1961) Cause and effect in biology. Science 134(3489):1501–1506. doi:10.1126/science.134.3489.1501

Millikan RG (1989) In defense of proper functions. Philos Sci 56:288–302

Neander K (1991) Functions as selected effects. Philos Sci 58:168–184

NHGRI (2002) National Human Genome Research Institute (2002) Workshop summary: the comprehensive extraction of biological information from genomic sequence, retrieved from

Niu D-K, Jiang L (2013) Can ENCODE tell us how much junk DNA we carry in our genome? Biochem Biophys Res Commun 430:1340–1343. doi:10.1016/j.bbrc.2012.12.074

Nobrega MA, Zhu Y, Plajzer-Frick I, Afzal V, Rubin EM (2004) Megabase deletions of gene deserts result in viable mice. Nature 431:988–993. doi:10.1038/nature02923.1

Ohno S (1970) Evolution by gene duplication. Springer, New York

Ohno S (1972) So much “junk” DNA in our genome. Brookhaven Symp Biol 23:366–370

Ohno S (1973) Evolutionary reason for having so much junk DNA. In: Pfeiffer RA (ed) Modern aspects of cytogenetics: constitutive heterochromatin in man. F.K. Schattauer Verlag, Stuttgart

Pennisi E (2012) ENCODE project writes eulogy for junk DNA. Science 337:1159–1161

Pigliucci M, Müller GB (eds) (2010) Evolution—the extended synthesis. The MIT Press, Cambridge

Ponting CP, Hardison RC (2011) What fraction of the human genome is functional? Genome Res 21:769–1776. doi:10.1101/gr.116814.110

Pritchard JK, Gilard Y (2012) Evolution and the code. Nat (News & Views) 489:55

Salmena L, Poliseno L, Tay Y, Kats L, Pandolfi PP (2011) A ceRNA Hypothesis: The Rosetta Stone of a Hidden RNA Language?. Cell 146(3):353–358. doi:10.1016/j.cell.2011.07.014

Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M (2012) Linking disease associations with regulatory information in the human genome. Genome Res 22:1748–1759. doi:10.1101/gr.136127.111

Shapiro JA (2011) Evolution: a view from the 21st century. FT Press, New Jersey

Stamatoyannopoulos J (2012) What does our genome encode? Genome Res 22:1602–1611. doi:10.1101/gr.146506.112

Strasser BJ (2008) GenBank—natural history in the 21st century. Science 322(5901):537–538. doi:10.1126/science.1163399

Strasser BJ (2012) Data-driven sciences: from wonder cabinets to electronic databases. Stud Hist Philos Biol Biomed Sci 43:85–87. doi:10.1016/j.shpsc.2011.10.009

The ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) project. Science 306:636–640. doi:10.1126/science.1105136

The ENCODE Project Consortium (2011) A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 9(4):e1001046. doi:10.1371/journal.pbio.1001046

The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74. doi:10.1038/nature11247

Tinbergen N (1963) On aims and methods in ethology. Zeitschrift für Tierpsychologie 20(4):410–433

Wang L, Lawrence MS, Wan Y, Stojanov P, Sougnez C, Stevenson K et al (2011) SF3B1 and other novel cancer genes in chronic lymphocytic leukemia. N Engl J Med 365:2497–2506. doi:10.1056/NEJMoa1109016

Weber M (2005) Philosophy of experimental biology. Cambridge University Press, Cambridge

Wouters AG (2003) Four notions of biological function. Stud Hist Philos Sci Part C Stud Hist Philos Biol Biomed Sci 34:633–668. doi:10.1016/j.shpsc.2003.09.006

Wright L (1973) Functions. Philos Rev 82(2):139–168

Access options

Get full journal access for 1 year

All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.

Get time limited or full article access on ReadCube.

All prices are NET prices.

Machine Learning Contribution to Solve Prognostic Medical Problems

Flavio Baronti , . Antonina Starita , in Outcome Prediction in Cancer , 2007

4.5. Comparison with decision trees

The second comparison we show is to a different machine learning tool for classification and prediction: decision trees ( Quinlan, 1986 ). Decision trees are a well-known machine learning method which comply with our requirements about interpretability, treatment of different data types, and robustness to missing data.

A decision tree is a classifier in the form of a tree structure, where each leaf node indicates the value of a target class and each internal node specifies a test to be carried out on a single attribute, with one branch and sub-tree for each possible outcome of the test. The classification of an instance is performed by starting at the root of the tree and moving through it until a leaf node is reached, which provides the classification of the instance.

Among the variety of algorithms for decision tree induction from data, probably the most known and used are ID3 and its enhanced version C4.5 ( Quinlan, 1993 ). ID3 searches through the attributes of the training instances and extracts the attribute that best separates the given examples. The algorithm uses a greedy search, that is, it picks the best attribute and never looks back to reconsider earlier choices. The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. The goal is to select the attribute that is most useful for classifying examples. A good quantitative measure of the worth of an attribute is a statistical property called information gain that measures how well a given attribute separates the training examples according to their target classification. This measure is used to select among the candidate attributes at each step while growing the tree.

4.5.1. Decision tree results

Decision tree induction on our dataset was performed using the See5 software by Rulequest Research (1994). After some testing, we found out that the default parameters (pruning CF = 25%, minimum case per branch = 4) worked well for this dataset boosting was not employed, since it did not appear to improve performance. We applied a tenfold cross-validation and repeated it ten times, as in the experiments with XCS (that is, with 10 different foldings). In this case, results' variability is due only to the random folding in the cross-validation procedure, since the decision tree induction algorithm is deterministic.

The results are reported in Table 1 , where the accuracy, sensitivity, and specificity obtained with See5 are compared with those obtained with XCS. Finally, the decision tree obtained with the execution of See5 on the entire dataset is reported in Table 3 .

Table 3 . Decision tree obtained from the entire dataset, along with the correct/matched ratio for each branch


In the context of the empirical assessments of health inequities, this paper investigated the empirical importance of the ethical question of whether unexplained health inequality is unfair. The classification of unexplained inequality as fair or unfair is closely connected to the choice of the fairness-standardization methods, a critical step for the measurement of health inequities. As the analysis of the US component of the JCUSH showed, this choice can substantially influence the empirical results regarding how much health inequity exists in the population and the proportion of observed health inequality that is inequitable. We obtained the same results in analyses using the Canadian sample of the JCUSH and using a different definition of health inequity, equal opportunity for health (results not shown).

The question of how best to treat unexplained health inequality deserves more extensive consideration in the assessment of health inequities than it currently does. Both direct and indirect fairness-standardization methods are technically valid but can produce different health inequity information and imply different ethical stances in regard to unexplained variation. An analogy here may be the choice between direct and indirect age-standardization methods in epidemiological studies [31]. Both of these methods are sound but are known to produce different results. Analysts are therefore advised to be explicit and consistent about their methodological choice. What complicates the choice of the fairness-standardization methods is that it is not merely methodological but ethical.

Although unexplained health inequality is not an issue for those who subscribe to the view that all health inequalities are inequitable (for whom all observed variation – explained or unexplained – is unfair), it is an unavoidable issue for empirical analysts who do distinguish between pure health inequality and health inequity. Currently available data and modeling techniques enable analysts to explain only a relatively small portion of observed variation in health at the individual level. Because the issue of unexplained inequality only arises in empirical work, it has rarely been paid attention to in the conceptual discussion regarding definitions of health inequity. Still, some work in the recent detailed philosophical analysis of health inequity by philosophers, economists, and ethicists provides a hint as to how to consider the ethical significance of unexplained inequality.

To examine the ethical significance of unexplained inequality, it is useful to recognize that unexplained variation – residuals in a regression context – consists of two types of variation: variation systematically related to unobserved factors and random variation. The issue of unmeasured systematic variation stems from methodological limitations. Improved data, such as longitudinal data with a rich array of variables capturing individuals’ life history, and improved modeling techniques can reduce unmeasured systematic variation. As soon as unmeasured systematic variation becomes observed systematic variation, the question goes back to a familiar, on-going debate regarding definitions of health inequity, that is, which sources of health inequality are ethically unacceptable.

To assess the ethical significance of random variation, the philosophical literature distinguishes “brute luck” – unfortunate events from which even sensible persons suffer, such as being hit by lightning during the commute with no warning, or suffering from a genetic disease by chance (often referred to as genetic lottery) – and “option luck” – unfortunate events associated with voluntary risks, such as being hit by lightning while playing golf with a plenty of warning or getting injured during voluntary bungee jumping [32-34]. The philosophical literature offers a wide range of views regarding the ethical significance of brute and option luck. Some scholars consider neither option nor brute luck as unfair because only variations in health associated with known socially distributed determinants of health are unfair [35,36]. Alternatively, most equality in opportunity theories, also known as luck egalitarianism, consider that inequality caused by brute luck is unfair while that by option luck is fair [37]. Yet another view sees both brute and option luck as unfair [38]. To date, this philosophical literature has not caught attention in health services and population health research and policy, but it is an important literature in the face of large unexplained health inequality in empirical work.

Advances in data, modeling techniques, and philosophical arguments are ongoing processes, and the measurement and monitoring of health inequities for effective policy making cannot wait for their perfection. Three proposals are available for the treatment of unexplained health inequality in the current imperfect world that still urges policy making. First, Bago d’Uva, Jones, and van Doorslaer [39] recommend in the context of need-standardization for health care utilization, which faces a directly analogous problem, that analysts always provide two estimates of inequity, the lower bound estimate provided by the direct standardization and the upper bound estimate by the indirect standardization. This is a pragmatic stop-gap solution but passes the difficult ethical question to users of health inequity information. Second, given complex causal relationships between health and its determinants and the fact that we do not understand them fully, we might argue that it would be safer to assume unexplained health inequality is of ethical significance, that is, unfair [40,41]. This judgment, and policy decisions that follow from it, will come with some opportunity cost. Resources that are devoted to address health inequity based on this judgment could be directed to competing health or other social issues. We should at least know the nature of such opportunity cost before committing to such judgment.

Finally, Garcia-Gomez and colleagues [7] empirically investigate what unexplained health inequality is. They tested the view articulated by Lefranc and colleagues in the analysis of unexplained income inequality [42]: classify unexplained inequality as luck examine whether the distribution of luck is uncorrelated with ethically unacceptable sources of inequality and if that is the case, consider luck an ethically acceptable source of inequality. In their analysis of inequality in mortality among the Dutch population, they adopted the view of equal opportunity for health as the definition of health inequity, which argues that health inequality due to factors beyond individual control is unfair. They considered variables such as sex, age, and education as ethically unacceptable sources of inequality while variables such as smoking, exercise, and weight as ethically acceptable sources of inequality. They found that unexplained inequality is distributed differently across groups of people categorized by sex, age, and education with or without controlling for the health behaviour. In sum, their analysis suggests that unexplained inequality is not an ethically acceptable source of inequality.

Most of this emerging empirical work and its authors’ insight in into the importance of ethical discussion are of considerable significance for public health and health policy. Given potentially serious policy implications of the issue of unexplained health inequality, analysts should at least make their methodological choices explicit and report both results from both standardization methods whenever they can. Moving beyond this pragmatic solution, however, analysts need to spur more debate and analysis regarding which treatment of the unexplained inequality has the stronger foundation in equity considerations.

Watch the video: How to Calculate the Gini Coefficient (August 2022).