Sunday, November 30, 2008

From Toy Time to Big Time

I don't know if this warrants another entry but the test program is now robust enough to handle large real world files in XML. I tried it on three 16K texts and it took 13.5 seconds overall to merge them with hundreds of transpositions. That is probably too many, but it does break up longer transpositions if it finds an alignment or insertion/deletion in the middle. The next step is to incorporate the test program into the NMerge library and thus allow the results to be displayed in the multi-version wiki.

The transposition program works in the real world and it is fast.

Sunday, November 16, 2008

Transpositions Conquered

Today the test program correctly merged three versions of a single sentence of the Sibylline Gospel, detecting four transpositions and encoding them correctly. The sentences were:

A: Et sumpno suscepto tribus diebus morte morietur et deinde ab inferis regressus ad lucem veniet.

B: Et mortem sortis finiet post tridui somnum et morte morietur tribus diebus somno suscepto et tunc ab inferis regressus ad lucem veniet.

C: Et sortem mortis tribus diebus sompno suscepto et tunc ab inferis regressus ad lucem veniet.

I must thank Nicoletta for supplying this splendid example, which in a small space contains so many transpositions. Here is the variant graph built automatically from the three versions. When I say 'automatically' what I mean is that I drew the graph manually from the program's textual output. The program was set to make no variants of less than five characters, although it does split arcs down to a single character. There are two transpositions, each present twice. I have indicated these by drawing the transposed forms in grey. The parent arcs are in black and the two are connected by dotted lines. The triple repetition of 'Et' at the start of the graph could be removed by reducing the minimal variant size. At the moment I am happy to see such high quality output without resorting to fine tuning.

The best thing about the program is the degree to which repetitions between versions have been systematically removed. This is the whole objective of the variant graph model.

This is, of course, only a test program. The algorithm will eventually be added to NMerge and all this will happen behind the scenes in the multi-version wiki whenever you save.