Esperanto langauge detection / Om allt annat / Forum

Pharoah
1

Visa profilen
Land: Förenta staterna
Meddelanden: 177

Pharoah (Visa profilen) 19 februari 2010 11:12:44

As part of that science project that I'm (still) working on, I'm writing a program to detect whether a bit of text is Esperanto or not. This has useful ramifications, not just for my project, but for other purposes as well (for example, we could write a web crawler that only indexed Esperanto pages... or one that found all the pages on Vikipedio without any Esperanto language links). Anyway, what I'm doing right now is a simple little algorithm that just checks to see if the word has valid Esperanto letters, and if it has a valid word ending or is a "special" short word.

It works with the x- and h- systems, and seems to give something like a consistent set of results. I have yet to see a well formed EO sentence get less than about .8 when I use the ranking algorithm I've set up, while non EO sentences usually get < .7.

If anyone's interested, here is the python code in two modules. The first, filter.py, does all the work. The second, reference.py, holds the lists of endings and short words to keep clutter out of the main program. You need Python 3.x to make these work:

http://www.filesavr.com/filter
http://www.filesavr.com/reference

Right now you just type things in and it scores them, but it can do much more than that since it's fairly flexible (it could be made into a webapp, for example). I'm still working on this, I plan to add more to it to test based on letter frequencies and the like. If you've got any feedback or suggestions, It could really help me out

.

Oŝo-Jabe
6

Visa profilen
Land: Förenta staterna
Meddelanden: 547

Oŝo-Jabe (Visa profilen) 19 februari 2010 15:01:47

I don't know if there's anything to it, but searches on Google with "[keyword] kaj estas" are almost always Esperanto pages. Perhaps if you could find other words like "kaj" that are more or less unique to Esperanto, you could make your language detector even more accurate.

ceigered
45

Visa profilen
Land: Australien
Meddelanden: 4427

ceigered (Visa profilen) 19 februari 2010 17:39:43

Oŝo-Jabe:I don't know if there's anything to it, but searches on Google with "[keyword] kaj estas" are almost always Esperanto pages. Perhaps if you could find other words like "kaj" that are more or less unique to Esperanto, you could make your language detector even more accurate.

In that case I'm assuming there'd have to be a sufficiently large range of "EO only" vocab to choose from, and the ability for the system to be able to still validate something even if it's, say, missing the word "kaj" or something similar.

Rogir
17

Visa profilen
Land: Nederländerna
Meddelanden: 1617

Rogir (Visa profilen) 19 februari 2010 21:12:55

Of course pages that use unicode should be very easy to detect.

Pharoah
1

Visa profilen
Land: Förenta staterna
Meddelanden: 177

Pharoah (Visa profilen) 21 februari 2010 00:10:04

Here is the algorithm that the program uses in a nice flowchart. You can see that it does check right off the bat whether a word has one of the special Esperanto unicode letters in it, because that makes it pretty certain that it's EO.

For the ones that it can't detect, I intend to use a bunch of different techniques to guess whether the text is EO or not. Oŝo-Jabe's suggestion of looking for uniquely Esperanto words would be a good start. I'm going to weigh these as part of the whole "esperanto-ness score" of the text, so even if a page lacks "kaj" or something, it still can pass based on letter frequencies of all of the words having EO word endings. Some of these metrics will be "worth more" to the program than others.

I think I may be able to train my algorithm to set the best weight and threshold values for each of these tests without manually intervening. I'll feed it samples from the tekstaro and from many other languages (possibly pulled from those langauges' Wikipedia sites), so the program will already know what language it's getting in. Then, I can have the program run my test function with different constant values and calculate which of these seem to weed out the most foreign texts while still keeping all the EO texts.

Rogir
17

Visa profilen
Land: Nederländerna
Meddelanden: 1617

Rogir (Visa profilen) 21 februari 2010 00:58:29

You should allow for a small number of foreign words in Esperanto texts, otherwise you might miss some.

Pharoah
1

Visa profilen
Land: Förenta staterna
Meddelanden: 177

Pharoah (Visa profilen) 21 februari 2010 03:02:47

Rogir:You should allow for a small number of foreign words in Esperanto texts, otherwise you might miss some.

I figured that I'd end up with a few, names, places, things like that. The program that I have now averages all the scores for each word. A definitely EO word gets a score of 2, a possibly EO word gets 1, and a non-EO word gets a score of 0. I've found that, when a text sample of any reasonable length (say a few sentences) scores above a .75, there's a very good chance that it's either EO or Italian. Everything else that I've tried (English, Spanish, Portuguese, Czech, and even Pinyin without tone marks) tend to get less than .75, while EO texts consistently seem to score above .8.

Because Italian seems to use about the same letters as Esperanto, has no diacritics, and tends to produce words that end in vowels, it might slip through the cracks. However, a superficial look at Italian texts shows that it has distinctive spelling features, such as plenty of double t, s, and l pairs, while such things are much less common in EO. Testing letter pair frequency should be quite sufficient, I think, to rule out Italian texts.

ceigered
45

Visa profilen
Land: Australien
Meddelanden: 4427

ceigered (Visa profilen) 21 februari 2010 16:11:54

You seem very well prepared

. In terms of getting rid of Italian, check for double zz's (I know of occaisions where you can have a double t or s in Esperanto, but z is much rarer (an exception being mezzono (mid-zone)). Also, I'm not sure what the limitations are, but if you could identify things like "-zzione" then you can easily identify -tion words in Italian. Of course, the biggest problem with looking for certain letter combinations is that you generally need a whole reoccuring element of an Italian word (e.g. the previous example, zzione, and è, or things like -ssimo). Also, "uo" would appear fairly rarely in EO spare certain cases, as would ai, ei, etc.

BTW - How does it go vs. bilingual texts, or texts with EO words in them/EO quotes?

darkweasel
69

Visa profilen
Land:
Meddelanden: 6399

darkweasel (Visa profilen) 21 februari 2010 19:30:11

To get rid of Italian, you might check the occurences of the letter K, which is much more common in Esperanto than in Italian.

erinja
94

Visa profilen
Land: Förenta staterna
Meddelanden: 5922

erinja (Visa profilen) 21 februari 2010 23:02:16

I'd use the letter J to get rid of Italian. Except for foreign words, the letter J is only found in old-style Italian texts (from 100 years ago or so, before a spelling reform changed those J's into I's)

Also, words ending in -n. Italian has a lot fewer. Any word ending in -jn is categorically not Italian.

Any word ending in -aj is likely Esperanto, and even more so -ajn. Likewise with -oj and -ojn.

Google Translate most often assumes Esperanto is a slavic language, perhaps because of the frequent J's, and the placement of those J's.