Sunday, October 28, 2012

So, I made my own software for writing Chinese!

From early on in my learning of Chinese language, I decided to concentrate my efforts to the spoken language, not writing. One reason for this is the sheer mind-blowing number and complexity of the characters. Another reason to focus on speaking is that for writing (in the internet) various translation tools are available, but when we are face-to-face with Jinlin, speaking is a must. Finally, my busy life allows more time for listening to Chinese audio-lessons (eg. when driving or walking) than studying the visual appearance of the characters at home.

In the beginning when I translates from English to Chinese with Google translate used occasionally "round-trip" checking to verify validity of the translation. In this method one translates the translated Chinese back to English and checks that the result resembles enough your original intention. While this is useful method that I have used a lot, it has significant draw-backs. First, it is slow and clumsy. Especially when the result of the round-trip translation is not correct enough, one must make random structural variations to the original sentence in the hope of finding form that translates better. Second, if the round-trip translation does not produce good result, the Chinese translation could still be okay and the problem only in the translation back to English - but one does not know this, leading to unnecessary random changes.

Pinyin to the rescue


Today I have been getting sufficiently fluent in Chinese to be able to check the validity of the translation to both directions in most cases. How is this possible when my skills are only in speaking, not in the characters? It's made possible by pinyin, the way to phonetically write Chinese with English alphabet. For example following is a line written by Jinlin on our 4th October 2012 chat in Chinese characters (Hanyu), Pinyin and English:
Hanyu:  非常好.  我很开心. 我希望你来上海
Pinyin:  Fēicháng hǎo. Wǒ hěn kāixīn. Wǒ  Xī wàng  Nǐ  Lái  Shàng hǎi
English: Extremely good. I am very happy. I hope you come to Shanghai.
Because I am able to speak and understand these sentences in spoken Mandarin, I am also able to read and understand the corresponding phonetic pinyin. When translating between English and Chinese in Google translate, the resulting pinyin is also shown. By looking at the pinyin I am able to directly see if the meaning is what I intended without doing round-trip to English. While this direct checking saves time and improves accuracy of the translation, it is sometimes frustrating in different way. Often I know what sequence of Pinyin I want to get as outcome but I don't know what English sentence to write to make the tool produce that.

The obvious question is of course: couldn't there be a tool that allows me to write directly in pinyin and produce the corresponding Chinese characters? In principle many such tools exist: most Chinese writing tools are based on pinyin input. This is because it is clearly not convenient to make a keyboard with over 3000 keys for the different characters. Neither is is convenient to specify the intended character by drawing the strokes on screen (although you can try it at http://www.chinese-tools.com/tools/mouse.html). Hence pinyin is the standard method to use: if you install Chinese input language to windows and start hacking or use online tool like http://www.chinese-tools.com/tools/ime.html, you will write pinyin.

So what's the catch? Well, there are only about 400 different pinyin syllables but several thousands of characters in Chinese. So for each pinyin syllable there are dozens of different Hanyu characters. User of these tools must choose the intended Hanyu character from a list of matching characters and this requires visual knowledge of the Hanyu. You can see this if you try for example to use the free online Chinese writing tool above to write "I am" in Chinese. Pinyin for "I" is "wo" and pinyin for "am" is "shi". But for each of these syllables you get more than ten possible Chinese characters to choose from. Without recognizing the characters you can not select the right characters 我 and 是. Like I have discussed previously "ma" can mean, among other things, horse, mother or curse and "shi" can mean even more things than "ma".

Computer nerd strikes back


But wait! How about writing pinyin syllables and then choosing from a list of English -meanings the correct  intention and hence correct Hanyu character? That would allow person with written English and spoken Chinese skills to write full Chinese with full control and no translation errors. But alas, no such program seems to exist. Not until now anyway ;-) Since I'm a programmer both in my job and hobby, I made one today. Here's a screen shot::


With this tool one can write pinyin to the input field and select the desired matching word from automatically updated list of matches by pressing corresponding number. The selected character(s) are added to the sentence in the "Chinese" field and displayed as combined Hanyu-Pinjyn.

It's a small cute program done in C# using WPF and XAML for the user interface, LINQ for functional list-processing and XML database for the words. It was great fun to make! Here as an example is a key function that splits string of Chinese characters to chunks of words based on recursive dictionary lookup:


The program is of course only as useful as the database of words it contains. Today I have been adding the first 125 words and I have been already successfully using it to write first directly composed sentences to Jinlin. The program should be quite useful in everyday writing when I get to the ~500 word level that approximates my current vocabulary. I can also use it to comprehend Chinese written by Jinlin by copy-pasting Hanyu to the "Chinese"-field and inspecting the break-down of words produced. And while it's a writing tool, playing around with the words and expanding the database is bound to further also my primary goal of more fluent speaking ability.

No comments:

Post a Comment