RANDOMNESS, ENTROPY, AND LANGUAGE
The concept of entropy plays a central role in statistical physics. While the connection between randomness and entropy is always mentioned in introductory texts, the calculation of the entropy in a specific simulation can offer much additional insight. For example, the calculation of the entropy of a dilute gas as it approaches equilibrium provides a convincing demonstration that systems really do evolve towards configurations which maximize the entropy. The concept of entropy can also be profitably applied to many problems outside of physics. In particular, it has some nice applications to problems in language and information theory.
To appreciate these connections,
one first needs a calculational definition of entropy, and this in
turn depends somewhat on the nature of the problem.
A useful example is the entropy of a string of
digits. Here is a convenient source of digits. It is often
stated that the digits of
are distributed ``randomly,''
but how do we know that this is really the case? One way to test this
is to calculate the entropy as defined by
where is the probability of finding the digit i. For the case
of ``perfect'' randomness,
for all 10 possible digits,
and one therefore expects to find
.
Many digits of
are readily available for actual calculations of S, and one indeed
finds a value of S = 2.30258... for the first 500,000 digits of
.
This calculation then leads naturally to a discussion of important statistical
concepts, including the
test, as well as other measures of randomness.
From one can then move to language and apply the same definition
of S to strings of characters. Here the interest is not in random
strings, but in the properties of real speech.
While the precise value
depends on the choice of character set (i.e., how one treats upper and lower
case characters, etc.), the entropy of real speech
is far below that found for a random string.
For example, using a 32 element character set (obtained
by ignoring the difference
between upper and lower case, and using only a few punctuation characters)
for Hamlet one finds
bits per character,
as compared to the value of 5 bits/char found with random speech
(this value is expected since
, the size of our character set).
It is certainly not surprising that ``real'' speech is not random; after all, it contains information. This leads to a consideration of information theory, including how to measure the information in speech, how to efficiently encode this information (e.g., Huffman codes), and the concept of higher order entropies. These entropy functions are not common in physics, but can be derived very naturally from the (first order) entropy expression given above. These functions have a nice physical interpretation involving the probabilities of finding letter pairs, triplets, etc., in a string of characters. They also lead to the intriguing result that the ``true'' amount of information in real speech approaches one bit/char.
These ideas can be applied to a variety of topics, including how to design codes, compression schemes, and decryption problems.