Saturday, February 23, 2013

ChineseWriter just got on steroids

I have made during the past month significant new development to my Chinese Writer software. The most significant change is integration of the free CC-CEDICT Chinese-English dictionary to the software. Originally ChineseWriter was based on database of words I updated and developed myself. During the months before CC-CEDICT integration I had been slowly adding new words to the database, getting up to around 1200 words. However, the need to add new words still arouse regularly during writing, disturbing the writing process.

Now there will be never anymore need to add words: current edition of CC-CEDICT contains a whopping 105 262 words! I would never had possibility to come even close to such collection by my own efforts. CC-CEDICT is maintained by a group of volunteers, available freely and, most important of all, available in format that is easily readable by computer software.

Such humongous list of words is not, however, without its downsides. Only some 2000-5000 words are reasonably common, which means that 95% of the dictionary entries are very rare, archaic or otherwise weird entries. When searching for a hanyu word to match entered pinyin, this flood of rare entries can come into way, making it more difficult to find the desired (usually common) word. CC-CEDICT does not contain in itself any information about relative frequencies of its entries in regular text, so there is not information to select automatically the common case(s). This is where my humble hand-collected list of ~1200 words come back into use: I used that as basis of list of frequently used words that are sorted to top of list of suggestions for given pinyin input. This gives best of both worlds: huge database to interpret Chinese text yet concise list of common suggestions when writing.

Screen shot of the current version of my ChineseWriter software.
Other significant change involves the text entering and editing area. The original version of the software had three separate fields for the hanyu, pinyin and combined representations of the text. Of these only the hanyi field was editable. This made it difficult to edit text for a person with limited recognition of the hanyu characters. In the current version text is edited and pinyin is entered directly in the same field where the final characters are displayed. This simplifies significantly editing of the text.

Other less significant, yet important additions to usability include:

  • Color-coding of hanyu characters and pinyin with standard colour codes for tones (red first tine, yellow second tone, green third tone, blue fourth tone and black neutral tone).
  • For multi-character words the breakdown of the word to characters and the meanings of individual characters are shown on mouse-over.
  • WPF DataGrid used for listing of hanyu suggestions, up to 50 suggestions are shown.
  • Pinyin input can be given with tone markers (eg. "mei3tian1") or without tone markers ("meitian").
  • Simple entering of literal latin text among hanyu.
  • Enter-key can be used to select #1 suggestion, CTRL+n select suggestions 1-9 respectively and mouse click to select any suggestion.
  • Horizontal scrolling of long chinese texts and vertical scrolling of suggestions.

Developing ChineseWriter with Visual Studio 2012, C# 4.0 and WPF


The devilish hanyu ambiguity


I have in the past written about the ambiguity of pronunciation in Mandarin Chinese: how for a single pronunciation even in same tone there usually are very many different matching meanings with different hanyu characters. The usually given example is the fourth tone sound shì which can mean (and be written as), among several other possibilities:
is / are / am / yes / to be


market / city


matter / thing / item / work / affair


room / work unit / grave / scabbard / family or clan



life / age / generation / era / world


(of time) to pass / to die


I am at this point of my studies quite used to this kind of ambiguity in pinyin, but now the working with a highly expanded set of characters has brought up another, more rare but more tricky ambiguity: ambiguity in the hanyu itself. Example of this is the hanyu character 地. Consider two examples:

  fāng     (region)
来回来去   lái huí lái qù de    (backwards and forwards)

In the first example of "region" 地 corresponds to pronunciation (di fouth tone) and means "earth". In the second case the exactly same 地 corresponds to pronunciation de (de neutral tone) and works as a structural particle modifying adverbial adjunct.

I admit that these cases are not very common, but they create devilish complication to the development of Chinese processing software since the hanyu characters can't be any more considered to be the non-ambivalent and unique gold standard for lookups. A word is only fully uniquely defined when both its hanyu and pinyin are expressed together. For a software program like ChineseWriter that needs to parse stream of hanyu to words, this ambiguity increases the challenge of implementation considerably.

No comments:

Post a Comment