Thai language machine translation

Machine translation between many languages is now at the point where the translation gives a very good indication of a text’s content even if the translation itself is not exactly publishable. This is not the case for all languages, however. Sometimes the translation gives little idea what a text is about apart from a few random vocabulary items. Thai to English machine translation is a good case in point.

Consider the following example taken from a Twitter feed that publishes quotes from respected Thai spiritual and political reformer Buddhadasa (@buddhadasa):

เราให้ทานเพื่ออะไร? คำตอบที่ถูกที่สุด คือเราให้ทานเพื่อหลุดพ้น หลุดพ้นอย่างไร หลุดพ้นจากอะไร ก็หลุดพ้นจากความตระหนี่ หลุดพ้นจากความยึดติดในวัตถุ

Bing translation:

We eat? The answer was “we eat for salvation salvation, however, break free from anything he was wrongful from the survivor survivor from the mounting bracket in the object.

Google translation:

What do we eat for? The cheapest answer We give out to eat. How are you? Out of nothing It’s out of misery. Get rid of attachments in objects.

My translation:

Why do we give alms? The best, correct answer is that we give alms in order to free ourselves. Why free ourselves? To free ourselves from what? To free ourselves from stinginess, to free ourselves from attachment to material objects.

The Google version is slightly more coherent than the Bing version perhaps. It gets the final clause mostly correct, but neither version conveys much of the meaning of the original text.

So why is Thai text less amenable to machine translation into English than, say, Italian? According to Thai computational linguist Paisarn Charoenpornsawat, who has worked on this problem for the past two decades, the problem is as much about the Thai script as anything else like grammar.

As is the case in other Asian languages including Chinese, Japanese and Korean, words in Thai text are strung together into phrases and clauses without any whitespace marking the boundaries between them. Spaces do separate phrases and clauses, but usage is inconsistent and almost no other form of punctuation (commas, full-stops, etc) is used. In other words, both word and clause boundaries are often ambiguous and word segmentation and sentence segmentation are crucial natural language processing problems in Thai.

Related to the word segmentation problem is the fact that multiple interpretations of a given string of characters are often possible.  Charoenpornsawat provides the following example:

ไปหามเหสี

This string can be interpreted (correctly) as three words meaning Go to see the queen:

ไป go

หา to see

มเหสี the queen

or (incorrectly) as four nonsensical words:

ไป go

หาม carry

เห deviate

สี colour

The key to the successful disambiguation of a string is it’s degree of context dependency. Context independent segmentation ambiguity can be resolved fairly reliably using probabilistic methods because normally only one alternative is plausible. This is the case in the example above.

Context dependent segmentation ambiguity, in which two or more alternatives are plausible, is more difficult to resolve and causes more errors in machine translation. Another example from Charoenpornsawat:

มากว่า

This string can be interpreted with equal probability as either

มาก many

ว่า that

or

มา come

กว่า (more) than

So there you have it, some insight into the challenges of Thai to English text translation. See the Thai linguistics resources page for references to Paisarn Charoenpornsawat’s work.