Dropdown items
My Academies

Personal Library

Account settings

The Secret Life of Numbers: 50 Easy Pieces on How Mathematicians Work and Think (2006)

Chapter: 37 Compressing the Divine Comedy (Lingusitics)

Visit NAP.edu/10766 to get more information about this book, to buy it in print, or to download it as a free PDF.

Previous chapter Next chapter
Page of 210
Search this publication

Previous Chapter: 36 A Dollar Isn’t Always Worth a Dollar (Insurance)

Page 149 Cite

Suggested Citation: "37 Compressing the Divine Comedy (Lingusitics)." George G. Szpiro. 2006. The Secret Life of Numbers: 50 Easy Pieces on How Mathematicians Work and Think. Washington, DC: Joseph Henry Press. doi: 10.17226/11543.

37
Compressing the Divine Comedy (Lingusitics)

The amounts of data we want to store on our hard disks expand even faster than the rapidly increasing capacity of the storage media. Therefore, software solutions are sought that enable us to pack data ever more densely onto the disks, thereby overcoming hardware limitations. But compression techniques can have unexpected applications.

To understand data compression, it is necessary to understand the notion of entropy. In physics, entropy is a measure of the disorder in a system—for example, in a gas. In telecommunications, entropy is a measure of the information content of a message. A message consisting, for example, of 1,000 repetitions of the number 0 has very little information content and a very low entropy. It can be compressed to the concise formula “1,000 times 0.” On the other hand, a completely random sequence of ones and zeros has very high entropy. It cannot be compressed at all, and the only way to store this string is to repeat every character.

Relative entropy indicates how much storage space is wasted if a sequence of characters is compressed with a method that was optimized for a different sequence. The Morse code, optimized for the English language, may serve as an example. The letter occurring most frequently in English, “e,” was allocated the shortest code: a dot. Letters that appear rarely are allocated longer codes—for example, “– – . –” for “q.” For languages other than English the Morse code is not ideal because the lengths of the codes do not correspond to the frequency of the letters. Relative entropy then measures how many additional dots and dashes are needed to transmit, say, an Italian text with a code that is optimal for the English language.

Most data compression routines are based on an algo-

Page 150 Cite

rithm that was developed in the late 1970s by two Israeli scientists from the Technion in Haifa. The method that Abraham Lempel, a computer scientist, and Jacob Ziv, an electrical engineer, devised relies on the fact that very often identical strings of bits and bytes reoccur in a file. At a string’s first appearance in the text, it is entered into a kind of dictionary. When the same string reoccurs, a pointer simply refers to the appropriate place in the dictionary. Since the pointer takes up less space than the sequence itself, the text is compressed. But there is more. Preparation of the table that lists all the strings does not follow the compilation rules of a standard dictionary. Rather it adapts itself to the specific file that is to be compressed. The algorithm “learns” which strings occur most often and then adapts the compression to it. With increasing file size, the space needed to store it decreases toward the text’s entropy.

There are no limits to the imaginative use of computers in science. Compression algorithms too can be applied to areas other than the space-saving storage of computer files. Two mathematicians and a physicist—Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto from the University La Sapienza in Rome—decided to put the Lempel–Ziv algorithm to work. Their aim was to identify the authors of pieces of literature. Ninety texts written by 11 Italian authors (including Dante Alighieri and Pirandello) served as the raw material. The text of a certain author was chosen, and two short texts of equal lengths were appended: one from the same author and one from another author. Both files were fed into compression programs—such as the ubiquitous WinZip program—and the scientists checked how much storage space each of them required. They conjectured that the relative entropy of the combined text would give an indication as to the authorship of the anonymous text. If both texts were written by the same author, the algorithm would require less storage space than if the attached text was written by a different author. In the latter case the relative entropy would be higher since the algorithm would have to consider the different styles and different words used by the two au-

Page 151 Cite

thors. As a consequence, it would use more space to store the file. Hence, the shorter the compressed file of the combined texts, the more likely it was that the original text and the appended text derived from the same author.

The results of the experiment were nothing less than astounding. In close to 95 percent of the cases the compression programs allowed the correct identification of the author.

In their excitement over the success of their newly found approach the three scientists failed to notice, or at least forgot to mention in their bibliography, that their method was not quite as novel as they had imagined. In fact, they were not the first ones to think that mathematical methods could be used to attribute literary texts to their authors. George Zipf, a professor of linguistics at Harvard, had already studied such issues as word frequency in 1932. And the Scotsman George Yule had shown in 1944, in a paper entitled “The Statistical Study of Literary Vocabulary,” how he was able to attribute the manuscript “De imitatione Christi” to the well-known mystic Thomas à Kempis, who lived in Holland in the 15th century. And of course mention must be made of the “Federalist Papers” from the 18th century, whose authorship (Alexander Hamilton, James Madison, and John Jay) was ascertained in 1964 by American statisticians R. Frederick Mosteller and David L. Wallace.

Since everything had worked out extremely nicely, Benedetto, Caglioti, and Loreto decided to undertake another experiment. They analyzed the degrees of affinity between different languages. Two languages that belong to the same linguistic family should have low relative entropy. Therefore it should be possible to compress a combination of two texts more efficiently if they are written in languages that are related than if they are written in two languages that do not belong to the same family. The scientists analyzed 52 different European languages. Again their efforts were successful. Using the zipping program, they were able to attribute each tongue to its correct linguistic group. Italian and French, for example, have low relative entropy and therefore belong to the same

Page 152 Cite

linguistic group. Swedish and Croatian, on the other hand, have high relative entropy and must come from different linguistic groups. WinZip was even able to identify Maltese, Basque, and Hungarian as isolated languages, which do not belong to any of the known linguistic groups.

The success of their experiments made the three scientists optimistic that measuring relative entropy with zipping software may also work with other data strings, like DNA sequences or stock market movements.

Page 153 Cite

THE PROOF OF THE PUDDING

The method described above inspired this writer to a test. The text samples were articles I wrote as part of my journalistic brief for the Neue Zürcher Zeitung, a major Swiss daily newspaper. There were 18 pieces covering events in Israel and comprising some 14,000 words and 105,000 characters. After removing all titles and subtitles, the texts were stored as an Ascii file and compressed by WinZip. I took one look at the result and had the fright of my life. The compression had downsized the original texts, which I had written with painstaking effort in the course of an entire month, by a full two-thirds. The inescapable conclusion was that only 33 percent of the original texts contained significant information, while two-thirds were pure entropy. In other words, they were completely redundant.

I sought to console myself with the argument that it must be the competent sequencing of his words that provided significant information and not the text as such. To prove this face-saving thesis I put the 14,000 words into alphabetical order and then compressed them again. Lo and behold: The alphabetically ordered sequence of words could be compressed by more than 80 percent and thus offered only 20 percent information. (This, of course, was not surprising, since the words “Israel” or “Israeli” appeared some 231 times, and the word “Palestinian” in all its derivations appeared a total of 195 times.) What this shows is that putting the words into sensible order—in a way that could only be done by a competent journalist—offers some 13 percent more information than a simple dictionary. It was not a huge consolation, but it sufficed to let me breathe a sigh of relief.

But another blow was to come. A randomized collection of the 14,000 words could only be compressed by 60 percent. Comparing that to the compression rate of 66 percent for the actual texts left the distinct impression that a totally random collection of words contained more information than did the actual articles.

For the actual experiment three sample texts were used:

Page 154 Cite

two long articles containing 1,000 words each, written by me and by Stefan, an editor of the newspaper, and a short one, containing 50 words, written by me. The shorter text was appended to each of the two longer texts and both files were compressed.

The results matched those gathered by the Italian scientists: When my short text, which comprised 462 characters, was attached to my long text, WinZip required 159 additional characters. When attached to Stefan’s long text, the compression program needed another 209 characters. Thus it was proven that the shorter text was written not by Stefan but by yours truly.