No Title

RANDOMNESS, ENTROPY, AND LANGUAGE

The concept of entropy plays a central role in statistical physics. While the connection between randomness and entropy is always mentioned in introductory texts, the calculation of the entropy in a specific simulation can offer much additional insight. For example, the calculation of the entropy of a dilute gas as it approaches equilibrium provides a convincing demonstration that systems really do evolve towards configurations which maximize the entropy. The concept of entropy can also be profitably applied to many problems outside of physics. In particular, it has some nice applications to problems in language and information theory.

To appreciate these connections, one first needs a calculational definition of entropy, and this in turn depends somewhat on the nature of the problem. A useful example is the entropy of a string of digits. Here is a convenient source of digits. It is often stated that the digits of are distributed ``randomly,'' but how do we know that this is really the case? One way to test this is to calculate the entropy as defined by

where is the probability of finding the digit i. For the case of ``perfect'' randomness, for all 10 possible digits, and one therefore expects to find . Many digits of are readily available for actual calculations of S, and one indeed finds a value of S = 2.30258... for the first 500,000 digits of . This calculation then leads naturally to a discussion of important statistical concepts, including the test, as well as other measures of randomness.

From one can then move to language and apply the same definition of S to strings of characters. Here the interest is not in random strings, but in the properties of real speech. While the precise value depends on the choice of character set (i.e., how one treats upper and lower case characters, etc.), the entropy of real speech is far below that found for a random string. For example, using a 32 element character set (obtained by ignoring the difference between upper and lower case, and using only a few punctuation characters) for Hamlet one finds bits per character, as compared to the value of 5 bits/char found with random speech (this value is expected since , the size of our character set).

It is certainly not surprising that ``real'' speech is not random; after all, it contains information. This leads to a consideration of information theory, including how to measure the information in speech, how to efficiently encode this information (e.g., Huffman codes), and the concept of higher order entropies. These entropy functions are not common in physics, but can be derived very naturally from the (first order) entropy expression given above. These functions have a nice physical interpretation involving the probabilities of finding letter pairs, triplets, etc., in a string of characters. They also lead to the intriguing result that the ``true'' amount of information in real speech approaches one bit/char.

These ideas can be applied to a variety of topics, including how to design codes, compression schemes, and decryption problems.

About this document ...

Next: About this document

Nick Giordano
Tue Sep 9 09:48:25 EST 1997