Aller au contenu

Universal Entropy of Word Ordering

de kaŝperanto, 25 mars 2015

Messages : 4

Langue: English

kaŝperanto (Voir le profil) 25 mars 2015 20:25:33

In my random Internet scouring I came across this article:
http://www.wired.com/2011/05/universal-entropy/

For those who don't know, entropy is a measure of the amount of information conveyed by a message. They measured the entropy of a large amount of text before and after randomizing the words of the text, and found that the loss of information as a result of randomization was universally 3.5 bits per word.

I find it fascinating that essentially disparate languages could all convey the same amount of information through word order. The author seems to think that this may be an indication of some sort of shared limitation we all face.

vejktoro (Voir le profil) 26 mars 2015 05:58:11

kaŝperanto:
For those who don't know, entropy is a measure of the amount of information conveyed by a message...
I find it fascinating that essentially disparate languages could all convey the same amount of information through word order. The author seems to think that this may be an indication of some sort of shared limitation we all face.
Read the article, because the given definition of 'entropy' didn't seem to have much to do with entropy as used in thermodynamics and scientific thought in general, and I have never heard tell of its use in the field of linguistics. The author uses one of the word's standard definitions, "The tendency for all matter and energy in the universe to evolve toward a state of inert uniformity; a measure of the disorder or randomness in a closed system," and re-defines it for the purpose of this study to mean something like, "the measure of how evenly the basic parts of speech are distributed in a standard utterance of a given language." The author figures that there should be an inverse relationship between a language's lexical order/disorder (his entropy measure) and the semantic tax placed on the morphemes themselves.

Turns out not much difference was found. These results make some sense. Two obvious reasons: most languages have a preferred word order even if it is not needed for comprehension, and, if sentence boundaries were ignored, the syntactic relations between the parts of speech are equally disturbed no matter what the system of the target language. I don't really understand what this study accomplished. Seems like nothing.

Besides, as for the idea that the biological mechanism that results in human grammar functions within a certain restrictive set of 'operators' by which we are basically bound cognitively is old. An overwhelming canon of inquiry already exists going back to Chomsky's transformational grammar in the 50s through many related theories of Universal Grammar and continues today, currently referred to as the Minimalist Program.

Check it out.

kaŝperanto (Voir le profil) 26 mars 2015 20:25:07

vejktoro:
Read the article, because the given definition of 'entropy' didn't seem to have much to do with entropy as used in thermodynamics and scientific thought in general, and I have never heard tell of its use in the field of linguistics...
The entropy of Information Theory is only loosely associated with the entropy of Thermodynamics, and could be argued to be more basic (thermal entropy is directly linked to how much information you know about the particles in question). It is a metric of the degree of choice/amount of uncertainty present in a message/event.

They are using formal information theory as best I can tell (I have a basic understanding from my engineering classes). I read most of the study over lunch today, and they used a compression-based technique to calculate the entropy of the source text. In lossless compression you cannot compress data to a size smaller than its Shannon entropy. They just looked at the limit of convergence from several levels of compression to get an estimate of the entropy. This is the only way to make comparisons between different sources like English and Chinese; there are too many variables to make comparable analytical models. In the paper they mention that languages have many levels of organization that each contribute differently to the information conveyed (roots, conjugations, words, sentences, paragraphs, chapters,...)
They chose words as the one common unit of comparison between the languages, so the results only apply to word ordering. IMO they have effectively isolated the contributions of word ordering on information in written language by this technique.

Here is an interesting and approachable bit about the link between compression and entropy (the answer with the grey squares): limits-of-compression

vejktoro:
Turns out not much difference was found. These results make some sense. Two obvious reasons: most languages have a preferred word order even if it is not needed for comprehension, and, if sentence boundaries were ignored, the syntactic relations between the parts of speech are equally disturbed no matter what the system of the target language. I don't really understand what this study accomplished. Seems like nothing.
...
The importance of the discovery was more in that all languages sampled, with largely varying total entropies, displayed the same increase in entropy as a result of randomizing the word order. Their statistical evidence is quite strong, and I'd say the increased entropy of 3.5 bits/word is significant considering the languages have around 5-7 bits/word normally.

Here's a link to the actual paper.

kaŝperanto (Voir le profil) 26 mars 2015 22:35:15

Elhana2:I don't understand why they randomized the entire text, without respect to grammar constraints and even sentence boundaries.
This was one concern I also felt, but technically sentence grouping is still a form of word ordering. We can extract a lot of information about a language from analyzing large collections of whole texts without looking at finer details like sentences or paragraphs. An example would be the letter frequencies of a given language (in all English text you have roughly the same probability of a given letter being a 't'. There's also the digram, trigram, higher-level frequencies, which look at two-letter, three-letter, and so on probabilities. The longer you go the more work involved, but you can get some convincing gibberish-english by only examining letter probabilities (including punctuation and spaces as letters).

This page has a nice intro to the concept of these Markov chains, and the "click for more examples" link has many great examples of the technique.

The "5-gram" generated texts sound practically like english (this is made purely from statistical data of character patterns - the fact that words appear is simple probability):
Fifth-order generated text based on the entire King-James Book of Romans:What their lusts there in your father, because also maketh not, it is among ought of his he liveth, if it fulfil things which cause division verily, the gospel of God. For I be saved. For they have mercy. Now if thou boast of patience of envy, murder, debate, deceive tree, which are the righteousness. Much more shall rise thankful; but have you into Spain.
It would be interesting to see how randomization limited within sentences would affect the entropy. My guess is it would have a smaller increase, since it is reasonable to expect that one can predict to some extent the meaning of an average jumbled sentence if I am familiar with the subject matter (just like I could guess that the sample text from the 5-gram example above comes from a KJV Bible)

I would also be interested to see what the effects would be from ordering at the sentence, paragraph, and higher levels. I'd again bet that you'd see diminishing effects as you reach larger levels.

Retour au début