What is "entropy and information gain"?

t

tne

I assume entropy was mentioned in the context of building decision trees.

To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names each labeled with either m or f, we want to learn a model that fits the data and can be used to predict the gender of a new unseen first-name.

name       gender
-----------------        Now we want to predict 
Ashley        f              the gender of "Amro" (my name)
Brian         m
Caroline      f
David         m

First step is deciding what features of the data are relevant to the target class we want to predict. Some example features include: first/last letter, length, number of vowels, does it end with a vowel, etc.. So after feature extraction, our data looks like:

# name    ends-vowel  num-vowels   length   gender
# ------------------------------------------------
Ashley        1         3           6        f
Brian         0         2           5        m
Caroline      1         4           8        f
David         0         2           5        m

The goal is to build a decision tree. An example of a tree would be:

length<7
|   num-vowels<3: male
|   num-vowels>=3
|   |   ends-vowel=1: female
|   |   ends-vowel=0: male
length>=7
|   length=5: male

basically each node represent a test performed on a single attribute, and we go left or right depending on the result of the test. We keep traversing the tree until we reach a leaf node which contains the class prediction (m or f)

So if we run the name Amro down this tree, we start by testing "is the length<7?" and the answer is yes, so we go down that branch. Following the branch, the next test "is the number of vowels<3?" again evaluates to true. This leads to a leaf node labeled m, and thus the prediction is male (which I happen to be, so the tree predicted the outcome correctly).

The decision tree is built in a top-down fashion, but the question is how do you choose which attribute to split at each node? The answer is find the feature that best splits the target class into the purest possible children nodes (ie: nodes that don't contain a mix of both male and female, rather pure nodes with only one class).

This measure of purity is called the information. It represents the expected amount of information that would be needed to specify whether a new instance (first-name) should be classified male or female, given the example that reached the node. We calculate it based on the number of male and female classes at the node.

Entropy on the other hand is a measure of impurity (the opposite). It is defined for a binary class with values a/b as:

Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))

This binary entropy function is depicted in the figure below (random variable can take one of two values). It reaches its maximum when the probability is p=1/2, meaning that p(X=a)=0.5 or similarlyp(X=b)=0.5 having a 50%/50% chance of being either a or b (uncertainty is at a maximum). The entropy function is at zero minimum when probability is p=1 or p=0 with complete certainty (p(X=a)=1 or p(X=a)=0 respectively, latter implies p(X=b)=1).

https://i.stack.imgur.com/OUgcx.png

Of course the definition of entropy can be generalized for a discrete random variable X with N outcomes (not just two):

https://i.stack.imgur.com/vIFD7.png

(the log in the formula is usually taken as the logarithm to the base 2)

Back to our task of name classification, lets look at an example. Imagine at some point during the process of constructing the tree, we were considering the following split:

     ends-vowel
      [9m,5f]          <--- the [..,..] notation represents the class
    /          \            distribution of instances that reached a node
   =1          =0
 -------     -------
 [3m,4f]     [6m,1f]

As you can see, before the split we had 9 males and 5 females, i.e. P(m)=9/14 and P(f)=5/14. According to the definition of entropy:

Entropy_before = - (5/14)*log2(5/14) - (9/14)*log2(9/14) = 0.9403

Next we compare it with the entropy computed after considering the split by looking at two child branches. In the left branch of ends-vowel=1, we have:

Entropy_left = - (3/7)*log2(3/7) - (4/7)*log2(4/7) = 0.9852

and the right branch of ends-vowel=0, we have:

Entropy_right = - (6/7)*log2(6/7) - (1/7)*log2(1/7) = 0.5917

We combine the left/right entropies using the number of instances down each branch as weight factor (7 instances went left, and 7 instances went right), and get the final entropy after the split:

Entropy_after = 7/14*Entropy_left + 7/14*Entropy_right = 0.7885

Now by comparing the entropy before and after the split, we obtain a measure of information gain, or how much information we gained by doing the split using that particular feature:

Information_Gain = Entropy_before - Entropy_after = 0.1518

You can interpret the above calculation as following: by doing the split with the end-vowels feature, we were able to reduce uncertainty in the sub-tree prediction outcome by a small amount of 0.1518 (measured in bits as units of information).

At each node of the tree, this calculation is performed for every feature, and the feature with the largest information gain is chosen for the split in a greedy manner (thus favoring features that produce pure splits with low uncertainty/entropy). This process is applied recursively from the root-node down, and stops when a leaf node contains instances all having the same class (no need to split it further).

Note that I skipped over some details which are beyond the scope of this post, including how to handle numeric features, missing values, overfitting and pruning trees, etc..

@all3fox: this is explained in the last paragraph, the process should stop for that particular branch if it gets to a pure node (a leaf node where all the instances belong to the same class, so it cant be split any further). The node thus predicts the only class it contains..

@all3fox: in practice, going all the way to pure nodes produces quite deep decision trees that suffer from overfitting (i.e trees that fit too well the training data, but that generalize poorly to other data not represented in the training set). Hence we usually stop when we get to a certain minimum number of instances in leaf nodes (and just predict the majority class), and/or perform some kind of pruning (see the Wikipedia links provided above to learn more).

@Jas: this is well explained here: en.wikipedia.org/wiki/…

@Rami: Right, to avoid problem like overfitting, smaller trees are preferred over larger ones (i.e reaching decisions with fewer tests). Note that the heuristic by which splitting features are chosen is a greedy search algorithm, so the generated tree is not guaranteed to be the smallest possible one in the space of all possible trees (nor is it guaranteed to be globally-optimal one w.r.t classification error). This is in fact an NP-complete problem...

@Rami: Interestingly, there are ensemble learning methods that take a different approach. One idea is to randomize the learning algorithm by picking a random subset of features at each candidate split, and building a bunch of these random trees and averaging their result. Also worth checking out algorithms like Random Forests.

V

VforVitamin

To begin with, it would be best to understand the measure of information.

How do we measure the information?

When something unlikely happens, we say it's a big news. Also, when we say something predictable, it's not really interesting. So to quantify this interesting-ness, the function should satisfy

if the probability of the event is 1 (predictable), then the function gives 0

if the probability of the event is close to 0, then the function should give high number

if probability 0.5 events happens it give one bit of information.

One natural measure that satisfy the constraints is

I(X) = -log_2(p)

where p is the probability of the event X. And the unit is in bit, the same bit computer uses. 0 or 1.

Example 1

Fair coin flip :

How much information can we get from one coin flip?

Answer : -log(p) = -log(1/2) = 1 (bit)

Example 2

If a meteor strikes the Earth tomorrow, p=2^{-22} then we can get 22 bits of information.

If the Sun rises tomorrow, p ~ 1 then it is 0 bit of information.

Entropy

So if we take expectation on the interesting-ness of an event Y, then it is the entropy. i.e. entropy is an expected value of the interesting-ness of an event.

H(Y) = E[ I(Y)]

More formally, the entropy is the expected number of bits of an event.

Example

Y = 1 : an event X occurs with probability p

Y = 0 : an event X does not occur with probability 1-p

H(Y) = E[I(Y)] = p I(Y==1) + (1-p) I(Y==0) 
     = - p log p - (1-p) log (1-p)

Log base 2 for all log.

B

Beta

I can't give you graphics, but maybe I can give a clear explanation.

Suppose we have an information channel, such as a light that flashes once every day either red or green. How much information does it convey? The first guess might be one bit per day. But what if we add blue, so that the sender has three options? We would like to have a measure of information that can handle things other than powers of two, but still be additive (the way that multiplying the number of possible messages by two adds one bit). We could do this by taking log2(number of possible messages), but it turns out there's a more general way.

Suppose we're back to red/green, but the red bulb has burned out (this is common knowledge) so that the lamp must always flash green. The channel is now useless, we know what the next flash will be so the flashes convey no information, no news. Now we repair the bulb but impose a rule that the red bulb may not flash twice in a row. When the lamp flashes red, we know what the next flash will be. If you try to send a bit stream by this channel, you'll find that you must encode it with more flashes than you have bits (50% more, in fact). And if you want to describe a sequence of flashes, you can do so with fewer bits. The same applies if each flash is independent (context-free), but green flashes are more common than red: the more skewed the probability the fewer bits you need to describe the sequence, and the less information it contains, all the way to the all-green, bulb-burnt-out limit.

It turns out there's a way to measure the amount of information in a signal, based on the the probabilities of the different symbols. If the probability of receiving symbol xi is pi, then consider the quantity

-log p_i

The smaller pi, the larger this value. If xi becomes twice as unlikely, this value increases by a fixed amount (log(2)). This should remind you of adding one bit to a message.

If we don't know what the symbol will be (but we know the probabilities) then we can calculate the average of this value, how much we will get, by summing over the different possibilities:

I = -Σ p_i log(p_i)

This is the information content in one flash.

Red bulb burnt out: p_red = 0, p_green=1, I = -(0 + 0)  = 0
Red and green equiprobable: p_red = 1/2, p_{green = 1/2}, I = -(2 * 1/2 * log(1/2)) = log(2)
Three colors, equiprobable: p_i=1/3, I = -(3 * 1/3 * log(1/3)) = log(3)
Green and red, green twice as likely: p_red=1/3, p_green=2/3, I = -(1/3 log(1/3) + 2/3 log(2/3)) = log(3) - 2/3 log(2)

This is the information content, or entropy, of the message. It is maximal when the different symbols are equiprobable. If you're a physicist you use the natural log, if you're a computer scientist you use log2 and get bits.

R

Rafael S. Calsaverini

I really recommend you read about Information Theory, bayesian methods and MaxEnt. The place to start is this (freely available online) book by David Mackay:

http://www.inference.phy.cam.ac.uk/mackay/itila/

Those inference methods are really far more general than just text mining and I can't really devise how one would learn how to apply this to NLP without learning some of the general basics contained in this book or other introductory books on Machine Learning and MaxEnt bayesian methods.

The connection between entropy and probability theory to information processing and storing is really, really deep. To give a taste of it, there's a theorem due to Shannon that states that the maximum amount of information you can pass without error through a noisy communication channel is equal to the entropy of the noise process. There's also a theorem that connects how much you can compress a piece of data to occupy the minimum possible memory in your computer to the entropy of the process that generated the data.

I don't think it's really necessary that you go learning about all those theorems on communication theory, but it's not possible to learn this without learning the basics about what is entropy, how it's calculated, what is it's relationship with information and inference, etc...

had the same thoughts Rafael. It's like asking what's quantum physics on stack overflow, a very broad area that's doesn't distill into a single answer well.

W

Waseem Ahmad Naeem

Informally

entropy is availability of information or knowledge, Lack of information will leads to difficulties in prediction of future which is high entropy (next word prediction in text mining) and availability of information/knowledge will help us more realistic prediction of future (low entropy).

Relevant information of any type will reduce entropy and helps us predict more realistic future, that information can be word "meat" is present in sentence or word "meat" is not present. This is called Information Gain

Formally

entropy is lack of order of predicability

M

Matt Warren

When I was implementing an algorithm to calculate the entropy of an image I found these links, see here and here.

This is the pseudo-code I used, you'll need to adapt it to work with text rather than images but the principles should be the same.

//Loop over image array elements and count occurrences of each possible
//pixel to pixel difference value. Store these values in prob_array
for j = 0, ysize-1 do $
    for i = 0, xsize-2 do begin
       diff = array(i+1,j) - array(i,j)
       if diff lt (array_size+1)/2 and diff gt -(array_size+1)/2 then begin
            prob_array(diff+(array_size-1)/2) = prob_array(diff+(array_size-1)/2) + 1
       endif
     endfor

//Convert values in prob_array to probabilities and compute entropy
n = total(prob_array)

entrop = 0
for i = 0, array_size-1 do begin
    prob_array(i) = prob_array(i)/n

    //Base 2 log of x is Ln(x)/Ln(2). Take Ln of array element
    //here and divide final sum by Ln(2)
    if prob_array(i) ne 0 then begin
        entrop = entrop - prob_array(i)*alog(prob_array(i))
    endif
endfor

entrop = entrop/alog(2)

I got this code from somewhere, but I can't dig out the link.

There are so much different entropy() functions out there for images but without good previews? How can you compare your code to Matlab's own entropy() and to the code here mathworks.com/matlabcentral/fileexchange/28692-entropy In the latter, the developer says it is for 1D signals but users keep expanding it to 2D. - - Your entropy function assumes that the original signal is 2 bit and it is rather simplistic. Assume it is MIT-BIH arrythmia ECG signal (11 bit) but generated for 2D images. I think you cannot use simple 2-bit base here then.

P

Paulo

As you are reading a book about NLTK it would be interesting you read about MaxEnt Classifier Module http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.maxent

For text mining classification the steps could be: pre-processing (tokenization, steaming, feature selection with Information Gain ...), transformation to numeric (frequency or TF-IDF) (I think that this is the key step to understand when using text as input to a algorithm that only accept numeric) and then classify with MaxEnt, sure this is just an example.

What is "entropy and information gain"?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US