Sunday, February 24, 2013

Multi-encoding text-merging

MVDs now support any text encoding. I've used ICU, the IBM-supplied conversion library for to/from UniCode conversions. It's very good. So texts may now be added to an MVD from any textual source. All you do is specify the encoding. Internally, merging is done on 16-bit UTF-16, not 8-bit UTF-8 any more. I don't believe any rival text-comparison programs can do this. This is a big improvement over the previous version of nmerge.

Tuesday, February 5, 2013

Comparing Chinese and Bengali texts

Extending multi-version documents (MVDs) to properly support languages like Chinese and Bengali, which use 16-bit characters, turns out to be easier than I thought. Currently the nmerge tool, which produces MVDs, works only with 8-bit bytes internally, so that individual characters may be split over several bytes, as in UTF-8 encoding. Things get complicated whenever differences are detected between parts of characters. Making everything 16-bit will facilitate the comparison of texts in any living language and avoid such complications (unless you want to compare dead languages like Phoenician or Lydian, and even then UTF-16 can encode them). I don't have any Chinese examples, but my friends in India have provided me with some interesting Bengali texts, which I'll be using for testing.