The Most Important Letter

A question came up on Quora about what letter’s removal would have the greatest impact on the English language.  The obvious answer is “E”, since it’s by far the most common letter in English.

But let’s consider that.  Can you writ a comprhnsibl sntnc that dosnt us ths lttr?  Ys, you can!  So its not clear that “E” is all that important.

So let’s do some mathematics.  The key question is:  How much information does a given letter provide?    Consider the following:  I’m thinking of a color.  You know the color is either red, green, blue, or fuchsia.  (I have no idea what color fuchsia is…I just like the word)  Your goal is to determine the color I’m thinking of by asking a sequence of Yes/No questions.

One way you could do this is by asking “Are you thinking of red or green?”  If the answer is “Yes”, then  you might ask “Are you thinking of red?”  If the answer is “Yes”, then you know the color is red; if the answer is “No,” then you know the color is green (since I answered “Yes” to the first question).  On the other hand, if I answered “No” to the first question, then you know I was thinking of blue or fuchsia, so you might ask “Are you thinking of blue?”  A “Yes” tells you I’m thinking blue; a “No” tells you I’m thinking fuchsia.

Now reverse it.  If you know I’m thinking of the color red, then you have the answer to two Yes/No questions.  We say that “red” has an information content of two bits.

So far so good.  But suppose I’m somewhat dull and can’t think of any color other than red. In that case, you already know what color I’m thinking of, and don’t need to ask any questions.  In this situation, “red” has an information content of zero bits.

As an intermediate case, suppose that half the time I think of “red,” one-fourth the time I think of “blue”, and one-eighth the time I think of “green” and one-eighth the time I think of “fuchsia.”  Then you might ask a different sequence of questions:

  • Are you thinking of red?  (Half the time, I’ll answer  “Yes”, so the answer “red”gives you the answer to one question:  it’s 1 bit of information)
  • If the answer is “No,” then “Are you thinking of blue?”  Half the time this question is asked (remember it will only be asked if the answer to the first question is “No”), the answer will be “Yes,” so the answer “blue” gives you the answer to two questions:  it’s 2 bits of information.
  • If the answer is “No,” then the final question “Are you thinking of green?”  Again, half the time this question is asked, the answer will be “Yes,” which tells you that “green” is worth 3 bits; meanwhile, the answer “No” means I’m thinking of fuchsia, so “fuchsia” is also worth 3 bits.

It might seem difficult to determine the information content of an answer, because you have to come up with the questions.  But a little theory goes a long way.  The best question we could ask are those where half the answers are “Yes” and the other half are “No.”  What this means is that if n is the answer to the question p_{n} of the time, then the information content of the answer n will be -\log_{2} p_{n}.  Thus, if “red” is the color half the time, then “red” has an information content of -\log_{2} (1/2) = 1 bit.

So what does this mean?  “E” makes up about 12.7% of the letters in an English text.  But this means that knowing a letter is “E” answers very few questions.  So the letter E contains about 3 bits of information.  In contrast, “Z” only makes up 0.07% of the letters in an English text, so knowing a letter is “Z” answers many questions.  So the letter Z contains about 10.4 bits of information (the maximum).

At first glance, this suggests that “Z” may be the most important letter in the English language:  losing the letter “Z” will lose the most information.  However, there’s a secondary consideration:  “Z” doesn’t often appear in a text.  So every “Z” you drop from a text loses a lot of information…but you don’t drop that many.

And here’s where the greater prevalence of “E” comes in.  While the letter “E” only gives you about 3 bits of information, it’s common enough that dropping the letter “E” from a text will lose you more information overall.  For example, suppose you had a 10,000 character message.  Of these 10,000 characters, you might expect to find 7 Zs, and losing them would lose you about 77 bits of information.  In contrast, there would be almost 1300 Es, and losing them would lose about 3800 bits of information.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s