Thursday, December 4, 2008

The MVD File Format

Several people have asked me what is inside an MVD, so I thought I would put it on the record.

The idea behind the Multi-Version Document or MVD format is to use the list form of the variant graph as the basis for an encoding of a single work, in all its versions or markup perspectives, as a single digital entity. The advantages of this form of digital document should be obvious. It enables a work to be viewed and searched, its versions compared and edited as one file. For example, all versions of Homer's Iliad or the seven markup perspectives of the American National Corpus (Ide, 2006) could be encapsulated in a single compact and editable representation. Also, the relationships between various parts of each version, the what-is-a-variant-of-what information, is also recorded. Storing a multi-version work as a set of separate files has the great disadvantage of requiring this kind of data to be recalculated each time it is needed. In an MVD this has already been calculated once and is thus built-in.

If the content of each version is itself XML, then XML is a poor format for an MVD. An MVD may, however, be written in binary or XML format. In the latter case, the XML content of each version, inside the XML encoding of the MVD structure, is escaped. That is, all instances of '<', '>' and '&' have to be replaced by their equivalent entities '&lt;', '&gt;' and '&amp;'. The purpose of the XML form of an MVD is merely to allow the researcher to look inside it to see what is there. Editing it by hand is virtually impossible, because the delicate list format produced by Algorithm 1 can so easily be broken.

A tinker-proof binary format is therefore preferred. If desired for archival purposes, an MVD can be written out as a set of separate XML files, but the format uses open-source software to encode its content, so it is also archivable. The structure of an MVD is shown below:

The outer wrapper is a Base64 encoding, expressing binary data as plain text.

The inner wrapper is the ZIP encoding performed by the open source Zlib library (Gaily and Adler, 1995). This serves the double purpose of scrambling the data to deter tinkering, and compressing it so that one MVD typically occupies little more space than a single original version. Even the alteration of a single byte of the outer Base64 wrapper will very likely break the inner ZIP encoding and the document will fail to load, as it should. Inside the ZIP container are the four parts that comprise the real content:

Magicthe presence of this hexadecimal string guarantees that this is an MVD
Groupsthese are labels for a hierarchy of arbitrary depth used to group versions or other groups
Versionsthese provide a simple description sufficient to identify the ID, short name and long name of each version, whether or not it is a partial version, and its group
Pairsthe pairs list that defines the variant graph itself

No further detail is needed, and would in fact damage the general applicability of the format. Groups can be used to express any desired classification system for versions. The short name of a version would typically be a siglum or other short name for convenient reference, and the longer name would typically be a full version name. All other details of a version's text are the responsability of the content format.

References

J.-L. Gailly and M. Adler (1995) Zlib

N. Ide and K. Suderman (2006) Integrating Linguistic Resources: The American National Corpus Model. In Proceedings of the Fifth Language Resources and Evaluation Conference.

Sunday, November 30, 2008

From Toy Time to Big Time

I don't know if this warrants another entry but the test program is now robust enough to handle large real world files in XML. I tried it on three 16K texts and it took 13.5 seconds overall to merge them with hundreds of transpositions. That is probably too many, but it does break up longer transpositions if it finds an alignment or insertion/deletion in the middle. The next step is to incorporate the test program into the NMerge library and thus allow the results to be displayed in the multi-version wiki.

The transposition program works in the real world and it is fast.

Sunday, November 16, 2008

Transpositions Conquered

Today the test program correctly merged three versions of a single sentence of the Sibylline Gospel, detecting four transpositions and encoding them correctly. The sentences were:

A: Et sumpno suscepto tribus diebus morte morietur et deinde ab inferis regressus ad lucem veniet.

B: Et mortem sortis finiet post tridui somnum et morte morietur tribus diebus somno suscepto et tunc ab inferis regressus ad lucem veniet.

C: Et sortem mortis tribus diebus sompno suscepto et tunc ab inferis regressus ad lucem veniet.

I must thank Nicoletta for supplying this splendid example, which in a small space contains so many transpositions. Here is the variant graph built automatically from the three versions. When I say 'automatically' what I mean is that I drew the graph manually from the program's textual output. The program was set to make no variants of less than five characters, although it does split arcs down to a single character. There are two transpositions, each present twice. I have indicated these by drawing the transposed forms in grey. The parent arcs are in black and the two are connected by dotted lines. The triple repetition of 'Et' at the start of the graph could be removed by reducing the minimal variant size. At the moment I am happy to see such high quality output without resorting to fine tuning.

The best thing about the program is the degree to which repetitions between versions have been systematically removed. This is the whole objective of the variant graph model.

This is, of course, only a test program. The algorithm will eventually be added to NMerge and all this will happen behind the scenes in the multi-version wiki whenever you save.

Sunday, October 26, 2008

Transpositions

Transpositions are undeniably part of real world texts, and must be included in any practical solution to the technical problems of how to represent overlapping structures in digital texts. The MVD or variant graph model described on this website includes transpositions, but until now there has been no way to calculate them automatically.

Transpositions can't be supported in any markup scheme. In principal, any section of the document can be transposed, not only small bits of text. This means that a feature like the ‘copyof’ attribute (Smith, 1999), if used to record transpositions, would have to allow any element to be contained in any other element – thus destroying the whole idea of a document schema or DTD. Also, the transposed sections might not even contain well-formed markup, i.e. they might contain unmatched start or end-tags. So this approach doesn’t work.

The method I have been using until now is that suggested by Lopresti and Tomkins (1997): that one should do all the alignments, variants, deletions and insertions first, and then consider pairs of insertions/deletions as candidates for transposition afterwards. The problem is that, while this works well for two versions, it leads to problems caused by ‘contamination’ between multiple versions. You end up with the same word being recorded as a variant of itself in another version. What is left over after the completion of alignment is a lot of noise that does not form a suitable basis for the calculation of transpositions.

The French Connection

Alternatively, the approach adopted by Bourdaillet (2007) in his thesis is the exact opposite: do the transpositions (and alignments) first, then what is left over can be considered as candidate variants, insertions and deletions. His method is, however, still tied to just two versions, but there are some useful ideas that can be applied to the case of aligning N versions too.

One reason why I suspect he avoided trying to merge N versions into one document was that he didn't have a data structure to record it, but also because, until now, this has been considered to be too hard a problem. However, I believe that it is solvable, by combining his general approach with the MVD document structure. I will try to describe how it will work, by illustrating a simple example step by step.

The Case for Automatic Detection of Transpositions

You might think that automatically detecting transpositions would be a bad idea. If the author transposed some text in a holograph, this should be clear from the manuscript and can simply be encoded as such by the editor. But what about the case where there are several versions of the same work – perhaps redrafts of a single text or independent versions created by copying? Spotting such transpositions visually is very hard work. In these cases calculating transpositions is a good idea, and saves the editor a lot of trouble. Even in the holograph case, calculating transpositions can still be useful. So long as the computer gets it right we don't have to encode it manually (which saves work); only when it gets it wrong do we have to do anything.

Outline of the Proposed Method

Imagine you have already aligned N versions into an MVD. Now you want to add the N+1th version to the structure. (This is the standard inductive formulation: if we can prove that it works in this case, then it will always work.) The Longest-Common-Substring or LCS is the longest section of text shared by the MVD and the new version where successive characters are all the same. The basic algorithm uses this property to merge the entire graph. In essence the algorithm is simply:

  1. Merge the variant graph and the new version where the LCS occurs.
  2. Call the algorithm recursively on the two unaligned sections before and after the LCS.

The Challenge

As an example consider the three versions:

1. The quick brown fox jumps over the lazy dog.
2. The quick white rabbit jumps over the lazy dog.
3. The quick brown ferret leaps over the lazy dog.
Imagine we have already built an MVD out of this. Now we want to merge it with:
4. The white quick rabbit jumps over the dog.

There is a small transposition here: version 4 has ‘white quick’ instead of ‘quick white’ in version 2. Let’s see if the algorithm can detect it.

Following the Algorithm as it Works

We can add the fourth version to the graph very easily, by creating one big arc with the text of the version and attaching it to the start and end of the graph, like this:

This is already a valid variant graph, but it is full of redundancy. Every word of the new version can also be found in the ‘old’ graph. Apart from wasting storage, this redundancy fails to inform us of the relationship between the various parts of the new version and the rest of the document. The algorithm will correct this problem by gradually removing all of the copies.

The ‘longest-common-substring’ (or LCS) between the first three versions and the 4th one is ‘rabbit jumps over the’ from version 2, even though bits of that string are shared by other versions. What we do is align version 4 with the LCS, leaving two bits at either end non-aligned. The two bits are ‘The white quick’ and ‘dog.’:

Note that ‘rabbit jumps over the’ has now acquired version D. We then call the same routine (this is called recursion) on these two bits left over, but align them with the corresponding parts of the MVD that precede OR follow the LCS in version 2.

We now have two subgraphs containing arcs that definitely precede or follow the LCS. All the other arcs that are effectively parallel to the LCS are left in place but are not considered further. Now we have to try to align Arc 1 ‘The white quick’ with Graph 1 and Graph 2, and likewise align Arc 2 ‘dog.’ with graphs 1 & 2. Because we now have more than one subgraph we will have to consider transpositions. However, when we compare the LCS between Arc 2 and Graph 1 (nothing) with that calculated between Arc 2 and Graph 2 (‘dog.’) it is clear that no transpositions are possible. Instead, the best alignment is the direct one of ‘dog.’ between versions ABC and D:

We have no more D-text to align here so the process stops. The empty D-arc becomes a deletion in version D. On the other hand, there is still work to do on the left, in Graph 1. Here comparison between Graph 1 and Arc 1 suggests either ‘white’ or ‘quick’ as the LCS. We will choose ‘quick’ because it is more central:

This is still not quite right. The instance of ‘white’ on the left is clearly a transposition of the ‘white’ on the right. Again the merging of the LCS leads to two arcs and two graphs:

Calculation of the LCS between Arc 1 and Graph 2 yields ‘white’, shown here in bold. So we merge the LCS into the graph, except that this is a transposition, and so we must leave the text where it is and point to the target of the transposition. This leaves only one copy of ‘white’ in the graph, and another copy, shown in grey, that points to it.

We are still not finished. ‘The’ appears twice. So we need one further LCS calculation between the ‘The’ D-arc and the ‘The’ ABC-arc:

Now the two ‘The’s are merged, all that remains is to introduce an empty ABC-arc to indicate that ‘white’ only appears in that position in version D.

Taking Stock

We have been recursing into smaller and smaller portions of the graph. That does not mean that these portions or subgraphs are in any way detached from the rest of the graph. The other parts were simply omitted for clarity. Overall the graph now looks like this:

The text of version D has been fully assimilated. It has been aligned with ALL the versions of the graph, not just with the one that was most similar to the version we were trying to add. (This is what biologists do in their ‘progressive’ alignment technique, and they don't even attempt transpositions). The result is a much better alignment with a lot less redundancy. To add further versions to the MVD we simply repeat the above steps, with the new variant graph as our starting point.

Time Complexity

I believe this routine may eventually be O(N log N), that is as fast as the famous ‘quicksort’ routine of Hoare from 1961, which it resembles. At the moment it is somewhat slower because my current calculation of the LCS takes O(N2) time in the worst case. The LCS between two strings can be calculated in O(N) time according to Gusfield. But that requires the construction of two suffix trees using Ukkonen's 1995 algorithm. I have implemented that for the text of the new version but I can't generate a suffix tree for the variant graph because it is too difficult, and may not be possible in O(N) time. To calculate the LCS I just traverse the graph, looking for runs of matching characters in the new version which has been converted into a suffix tree using Ukkonen's algorithm. Overall I think this is O(N2). However, even as it now stands, the algorithm is still very fast because expected performance is usually much better than that.

References

D. Lopresti and A. Tomkins, ‘Block edit models for approximate string matching’, Theoretical Computer Science 1997, 181, 159–179.
J. Bourdaillet, Alignment textuel monolingue avec recherche de déplacements: algorithmique pour la critique génétique PhD Thesis, Université Paris 6 Pierre et Marie Curie, 2007.
C. Hoare, ‘Partition: Algorithm 63’, ‘Quicksort: Algorithm 64,’ Communications of the ACM 4(7), 321–322, 1961
D. Smith, ‘Textual Variation and Version Control in the TEI’ Computers and the Humanities, 33.1, 1999, 103–112.
E. Ukkonen, ‘On-line Construction of Suffix Trees’ Algorithmica 14 (1995), 249--260.
D. Gusfield Algorithms on Strings, Trees and Sequences, Cambridge: Cambridge University Press, 1997.

Thursday, September 11, 2008

MVD paper accepted

In case anyone was wondering if my theories have been independently verified, the International Journal of Human-Computer Studies has just accepted my paper that explains the core idea behind the MVD technology. This is a respected technical journal and I spared no detail in the paper, which is 16 pages long. There is a link at the bottom of the "What is an MVD?" page to the PDF I submitted.

Tuesday, July 29, 2008

Improvements to Alignment

In preparation to publicly releasing the NMerge code I have been revising the alignment algorithm, as there were significant weaknesses in the naïve approach I had previously adopted. In cases like the Sibylline Gospel, there is a great deal of variation in a small space, and simply requiring the user to choose a base text for alignment doesn’t work very well. There seem to be a few reasons for this:

  1. The user doesn't actually know to which existing version a new version should be aligned. It should be the task of the software to find this out.
  2. There can't be a distinction, as I originally supposed, between updating an existing version and adding a new one. The newly revised version might resemble another version more than the one it came from. For example, we might change ‘the quick brown dog’ back to ‘the quick brown fox’. In the case of manuscript traditions like the Sibylline Gospel this happens all the time, because the versions aren’t a succession of edits, but a set of parallel alternatives. To avoid this problem I now automatically calculate the most similar version already in the MVD and then align with that.
  3. When you add a new arc to the graph you need to look for identical paths not merely identical arcs that are already in that position. Now when the program finds an existing path with the same text, spanning the same two end-points, the new arc can be discarded and its version simply added to the path instead: In this simplified real-world example, the new D-version was aligned with version C, overall its most similar version. However, in this location the D-variant ‘milia hominum’ already exists in version A. When the program tried to add the D-Arc it realised that there was already an A-path with the same text and instead merely added the D-version to that path.

This is only the first stage of a series of improvements. Still to come:

  • Calling the alignment algorithm recursively to handle contamination, i.e. aligning to more than one base text. After aligning to the most similar version, any significant portions of the new version that didn't align can be realigned to the most similar version between the two endpoints of the unaligned section.
  • Calculating what is a variant of what, one use of which might be to generate a kind of apparatus criticus
  • Calculating transpositions. These can be done after the other alignments are complete: any leftover insertion/deletion pairs meeting certain criteria can be tested for equality, and the transposition carried out.

XML Awareness

While NMerge remains ignorant of XML, as it should, the public interface class now breaks up ‘words’ based on angle-brackets as well as white space. In addition I am contemplating moving this code and the class that XML-izes the differences between versions into a separate package. The latter functionality is needed by the wiki since a difference might occur in the middle of a tag, and the marker for this needs to be moved to the start of the next piece of real text, so it can be displayed.

Wednesday, June 25, 2008

Extreme Makeover

The look and feel of the entire application can be changed by simply altering the CSS and XSL stylesheets for each page. I have therefore executed an extreme makeover for the demonstration in Oulu. It now conforms to what users expect of a simple, clear interface, even though the functionality is exactly the same as before. This is more than just window-dressing, however. It is important, for example, that the search box positions itself automatically at the bottom of the window however large it is, also that the width of Twin View does not exceed that of a normal laptop screen size (1024x768). And fixed column widths are now the norm in a world in which wide screens are commonplace. This was never an issue when people had screens of 800x600 pixels in size. Then you wanted your webpage to stretch the full width, so that resizing of the window caused the text to reflow. As typographers will tell you, however, text is most easily read in narrow columns. I'm unsure what width to use but I chose sizes that seem to be widely used: 660 for Single View and 480 for each of the columns in Twin View. It looks particularly good on OSX, nearly as good in Firefox on the PC and not very good at all in Internet Explorer. In fact getting it to look nearly the same in IE and Firefox is quite hard.

I didn't get as much finished for the conference as I hoped, but it is still quite a lot. I would estimate that the application is about 70-80% finished, and that it could certainly be completed by the end of the year.

Monday, June 16, 2008

Lost in Hyperspace

The drawback with TwinView is that, in spite of its power to display the differences between two versions, it becomes so hard to see what is the same. After a few insertions and deletions on each side the two columns quickly get out of alignment. When the user scrolls down to study a bit of text how does he or she find the corresponding text in the other version? Without synchronisation you are forced to scroll up and down manually to try to match up the texts when the computer should be really be doing all that for you in the blink of an eye.

The possibilities of providing visual cues in a web browser to indicate alignment are quite limited. There is no way, for example, to connect text blocks using lines that cross from one window to another, as in Nelson's Parallel Textface. And it is very difficult or impossible (at least I can't see how) to synchronise the scrolling of two <div>s in parallel. On the other hand, maybe something simpler might actually be more effective. The technique I eventually settled on uses <span> and the onclick attribute. When the user clicks on a bit of black text common to both sides it causes the corresponding text in the other frame to scroll down to precisely the same position as the clicked-on text, and then highlights it. Using <span> instead of hyperlinks has the advantage that the text is still selectable – very handy for choosing and copying text to search for in other versions. It is also intuitive: a naïve user will eventually click on the text and discover the alignment facility. (Text copyright Vicenzo Cerami)

Before the click on the left – hopelessly out of alignment

... and after – beautifully aligned and everything clear.

Sunday, June 8, 2008

Search

One of the major advantages of the MVD format is the ability to search all versions simultaneously. The search implementation in the underlying platform (as in MVDDeskViewer) has been around for some time, but making it available in the wiki, where the search results have to be sent to the browser as HTML, and the text highlighted and merged with other highlighted text, was not easy. However, it all seems to work fine now.

The user can search only the currently selected version (that's the ">" button in the image below) or all versions (the ">>" button). When searching all versions the display loads new versions as needed; it also updates the other window to reflect the different insertions or deletions with respect to the new version. You may search in either the left-hand or right-hand window in TwinView. Repeated clicks on a search button find successive matches starting from the current version; it also wraps around at the end, eventually taking you back to your original version.

Bug Fixes

This iteration also removes some outstanding bugs which appear only on Windows in Internet Explorer particularly. In EditVersions previously if you chose to edit, add or delete a version or group in a Windows browser it usually failed. Thankfully, it doesn't do so any more. I have tested it on Firefox and IE6 on Windows XP, and also on Safari and Firefox on Mac OSX.

Wish-list

There remain, however, some deficiencies:

  • Because each set of search results are sent to the browser, the entire text of each version, or pair of versions, has to be sent each time, making the response seem a bit slow, when in fact the search itself is really fast. I need to use AJAX here to only send the differences and update the HTML. However, this is not a real problem at the moment because the client and server are on the same machine.
  • Because I am developing on a very wide screen I tend to forget how little screen space there is on the average laptop. The buttons on the left of the display are a serious problem in TwinView, and will have to be moved above the text-columns as a matter of some urgency.
  • Also urgent is providing some means of synchronisation between columns in TwinView. I tried but failed in this iteration to make it align the text on one side whenever you click on the text on the other. Getting this to work cross-browser in JavaScript, however, has so far eluded me.
Here is another screen dump showing how it displays search hits using a background colour and <span> elements in HTML. This is a bit of kludge, but using real selections, as you do with a mouse, doesn't work on all browsers. (Text © N. Brocca 2008)

Monday, May 19, 2008

Editable Versions and Groups

At last I can edit groups and versions, delete them, move them between groups or rename them. The Version Edit screen looks a bit crowded (see below) but I reduced the clutter somewhat by using links instead of buttons. Some people think this is poor style. It is far worse to have buttons disguised as links to achieve a similar effect. It is true that the wiki won't function without JavaScript, but it is simply impossible to build this level of functionality without it.

I have also added a Revert button, which restores the text or the versions to the state they were in after the last click on the Save button. The "Edit..." links take you to separate pages which allow you to edit the characteristics of the group or version:

What remains now are a few important features that I hope to set right before we go to Oulu:

  1. Find – I need to reimplement the same search facility we had in the desktop version of the tool. It should be easy because the highlighting of search hits is similar to the highlighting of differences between two versions, which we already have. No other application can search multiple versions simultaneously and in linear time – this is one of the great strengths of the MVD format.
  2. Beautify the buttons – these could be reduced to standard icons, and arranged over the top of each text column, instead of at the side. Rather than ugly strings the labels could vanish into 'tool tip text' that only appears when you mouse over them.
  3. Allow the user to author an MVD from scratch. This is meant to be the centrepiece of the demo at Oulo. It is fairly easy, but I will need to automatically markup plain text using TEI encoding and also check the syntax before committing it. That may have to be done in JavaScript again, otherwise we won't be able to alert the user to syntax errors.
  4. Image view and edit image view – this is really cosmetic but could be quite impressive. The only way I can think of getting the correct image to display in HTML as you scroll down is to use hyperlinks in the right hand window that call JavaScript to set the image of the left hand side. At the moment I don't know how to achieve the same effect in edit view.
  5. Synchronise the left and right hand text in Twin View. This can also be done by hyperlinks, as it was in the multilingual example metioned earlier. What it means is that the text on the right corresponds to the text immediately on the left. The user should never have to scroll and read to align the texts manually. This is really important.
  6. Transpositions – I think these can be done as an optimising step after alignment has been completed. Then any significant pairs of deletions/insertions can be tested to see if they qualify as transpositions. I think if we recalculate them after every update there won't be any problems as to which bits are transposed and which bits are insertions/deletions etc. It all boils down to the status of individual fragments: a bit of text is either inserted, deleted, transposed, a match or a variant. When those characteristics change we start a new fragment.
  7. Edit Twin View – I want to use AJAX to automatically update the left hand HTML version of the XML the user is editing on the right.

That's quite a lot of work before I can call Phaidros version 1 complete, but it shouldn't take more than a couple of months at most. Then we will have a completely new tool to play with. It will be exciting to see where we can take multi-version text then.

Sunday, May 4, 2008

Twin View

The idea of viewing two versions of a document side by side and comparing them goes back at least as far as Nelson's 'Parallel Textface' (1971), in which equivalent text fragments, perhaps transposed, were connected by lines. Programmers are also used to comparing versions in this way, but it is not clear to me how the user is supposed to make sense of the interconnecting lines when the unit of comparison is a word rather than a line. Highlighting differences on its own is clearly insufficient, but if combined with synchronisation of the two windows – keeping the text on the left in line with the text on the right – this should be enough to prevent the user losing his or her way. In any case interconnecting lines are impossible in HTML windows, and one of the goals of Phaidros was to display multi-version texts in an ordinary web browser.

Comparing two versions of the Sibylline Gospel (© N. Brocca, 2008)

With this goal in view I have now added Twin View to the wiki, which highlights the differences between two texts. In the Sibylline Gospel example there are 36 versions. The total number of pairs that could be compared is thus (36*35)/2 = 630. Some of the rival programs, such as Juxta and Medite compare texts in real time, cacheing the results in case the same pairs of texts are compared again. But this doesn't work so well when there are more than a handful of versions. An MVD file has the advantage that any combination of versions can be displayed in equal time and the comparison results don't need to be cached. So even if there are 5,600 versions, as in the case of the Greek New Testament, you will always get a quick response.

The version of each window can be changed by choosing the desired version from its popup menu. Text on the left that does not appear in the version on the right is highlighted red, and extra text on the right, not found in the left, is highlighted blue. As yet, transpositions are not detected – they will be shown as insertion/deletion pairs instead. Also, synchronisation has not yet been added, but as in the multilingual example, referenced at the start of this blog, synchronisation in HTML is possible by using hyperlinks to invisibly connect the left hand text with the text on the right. This should be added soon in a later update to the program.

Wednesday, April 23, 2008

Viewing Version Definitions

At last some improvements: you can now view the set of versions in an MVD. These can be grouped hierarchically to any depth, and each step in the hierarchy can be named:

The Version List

Viewing one version of the Sibylline Gospel

The text illustrated here is the moderately complex Sibylline Gospel – an apocalyptic prediction about what will happen at the end of the world. Nicoletta has recorded 36 versions, some of which are shown in the expandable list above. Getting this text right necessitated the fixing of a number of bugs in the NMerge package, but all seems to be well now. To build the document I used the MvdTool, which is a commandline version of NMerge. It does many of the same things as the wiki, but without the user-friendly interface.

Next step is to add twin-view, which will allow the user to instantly see highlighted differences between any two versions in an MVD. The edit version of this view will probably display a HTML rendering on the left and an editable window on the right, updated every few seconds, or at the press of an 'update' button.

Thursday, April 3, 2008

Proof of Concept

Yesterday I completed the Proof of Concept stage. I now have a very simple web application that allows you to choose an MVD from a list, then to view, edit and – most importantly – save individual versions. No searching, comparing or creation of new versions is yet possible. It doesn't check your XML, so if you make a mistake it falls over. Also, it is as ugly as sin. But I don't care. What matters is that it works. The underlying Nmerge platform is pretty solid and has a lot of functionality I haven't yet exposed in the user interface. This should come fairly quickly now. For the moment here are a couple of screen shots, firstly to show how ugly it is, secondly to show what it can do so far:

The MVD file list

Viewing one version of an MVD (transformed into HTML)

Editing one version of an MVD

Unfortunately, I can't publish the software on here because of copyright restrictions on the example texts. Send me an email if you are interested and if you are part of the research group I can probably send you a copy for testing.

Sunday, March 16, 2008

Merging N versions into an MVD

Tonight I finally managed to automatically merge 3 versions of a short story in Spanish into a valid MVD, after literally several years of trying. Previously I had to create MVDs manually for all my demos. Now I have an automatic tool for doing it accurately and fairly quickly. The steps are quite simple, and will be the same steps you would use in the multi-version wiki to build up a multi-version document.

  1. Starting with an empty document containing NO text of version 1, update it with the full text of version 1. This produces an MVD with only one version.
  2. Now update the MVD with the text of the second version. The program does three things:
    1. It calculates the differences between version 1 and version 2. These identify parts of version 2 that are the same, or are inserted, deleted, or alternatives to the corresponding parts of version 1. The differences are normally aligned by word or, optionally, by character.
    2. It 'stitches in' the differences using version 1 as a base text to produce a correct MVD of two versions:
    3. It optimises the MVD so that any two arcs going from and to the same pair of nodes and containing the same text (in different versions of course) get merged into one arc. This happens all the time if, for example, you have a text containing the letter 'A' and then change it to 'B'. Then, you change it back to 'A' again. This produces three separate arcs for the three versions when you actually want two arcs, or two pairs, one for versions '1,3' containing the text 'A' and another for version '2' containing text 'B'.
  3. Finally, update the MVD using version 2 as the base of version 3. If you have more versions, then you just repeat this step as often as needed. You would normally choose a different base text each time, preferably one quite similar to the new one you are committing.

The result is an optimal alignment between the new version and the base version, rather than for all possible pairs of versions. For genetic texts at least, a writer conceptually changes an existing base version each time he/she makes an alteration to the text. Insertions, deletions and variants happen in pairwise fashion, not globally. Admittedly this is not true of texts that have evolved over time in many different, physically separate versions, as in the case of the complex manuscript tradition of an ancient text, but I still wonder how useful an alignment optimised over N versions would really be even in that case. Traditionally, at least, variants have always been aligned to a particular base text, whether a lost, shared original, or the editor's version or a copy text etc. The same thing happens in bioinformatics: a multiple sequence alignment program figures out which pairs of sequences are most similar, and then aligns them pairwise. All I am doing is taking away the automatic selection of a base version and replacing it with the user's choice.

The same process as the one described above works for updates of an existing version as well as for adding a new version. If you only change a few words, as you would typically during an editing session, the response time will typically be only a few milliseconds. In my test, adding all of version 2 to version 1 took only 0.7 of a second. Adding version 3 to version 2, however, took over 12 seconds. It all depends on the number of differences.

Unresolved issues

I haven't yet worked out how to include transpositions in this process, although it ought to be possible. The MVD format supports them fully, but we obviously can't calculate them or it would take forever. I think the user should specify them because they are always visible in the original texts, rather than let a machine 'calculate' it somehow, probably badly.

There is also the question of 'transclusion', to misuse Nelson's term. What I mean by transclusion is altering the text of one version and having that change also applied to any other versions that share the same text. This ought to be optional, but it would make the wiki really useful. Imagine an aircraft manual made up of a set of systems shared across different aircraft. Updating one system ought to be propagated automatically to all the other manuals. Again, this is possible, but I couldn't fully work out the details in this version of the multi-version document platform, which I call Nmerge.

Structure of the multi-version Wiki

Here's a drawing of the overall structure of the wiki. The wiki module itself is called Phaidros, after the dialogue in Plato where Socrates criticises the medium of writing:

The next stage is to build the Phaidros web application. I have an old version that just listed one version of an MVD. I need to add at least the facility of editing the source in XML and committing the changes back to the MVD to turn it into a wiki. Then we will have proof of concept, but the hard work is now done.

Wednesday, March 5, 2008

What's a Multi-Version Document?

What's it for?

A multi-version document is for recording texts that exist in multiple versions. This musn't be confused with recording multiple drafts of a single text you might be working on. Instead, a multi-version document is for recording the non-linear structure of a text, like the Greek New Testament, which exists in thousands of versions, or equally the text of a modern literary or philosophical work, which might have been published in several versions or may exist only in the form of manuscripts heavily edited by their author.

How does it work?

A multi-version document represents a text as a set of merged versions in a single digital entity, which can be efficiently edited, and its versions listed, compared and searched. Versions can overlap freely, and this overcomes the limitation of markup languages, which are based on the formal generative grammars invented by linguists in the 1950s, which are the basis of all markup systems today. A multi-version document is represented as a list of text fragments ti, each of which is assigned a set of versions vi:

{v1, t1}, {v2, t2}, ... {vn, tn}

This extremely simple form is all you need to record texts that contain thousands of versions. It is a form of digital text that trades complexity for mere size, and size is something modern computers are very good at handling. The structure of the document is implied by the intersection of the versions of each fragment. For example, to read a single version all you need to do is read through the list, picking out the fragments that belong to it. Other common operations, such as comparing two texts to find the differences, or printing the variants of a particular range within a given version are just as easy.

The mathematical basis for this model

This form is equivalent to a 'graph', a set of intermingling paths that start at one point, branch, rejoin and split again, until they all join back together at the end. It has been proven in the paper cited below in the International Journal of Human-Computer Studies, that these two forms of multi-version text, namely:

  1. the intuitive graph representation, and
  2. the list of pairs described above

are equivalent, that is, we can transform one into the other and back again with no loss of information. This kind of solid mathematical basis is in contrast to previous attempts at representing versions and overlapping structures in digital text, all of which were based on markup, which can only efficiently represent hierarchical structures. As many humanists and linguists have discovered, natural texts in their disciplines are much more frequently overlapping in structure than they are hierarchical.

The Variant Graph is not a Replacement for Markup: it Complements it

A Multi-Version Document cleanly separates content from variation. The content of a document is expressed by the textual fragments in the list, or by the textual labels to the arcs in the graph. The structure of the document, its variation, on the other hand, is expressed by the order of the pairs and by their sets of versions.

This means that any technology can be used to represent the content, even ordinary markup. Since all the overlapping structures have been removed and placed in the Variant Graph structure the markup can be simple enough to handle in a wiki. Of course, we are not tied to markup. If in future markup becomes obsolete we can still use Multi-Version Documents to record the content using some other technology, in binary form for example.

A Multi-Version Document doesn't, and need not, represent any of the complexities of text relating to content. It just represents versions and its complexity ends right there.

What we have so far

The first publication of the idea of 'network text' (submitted 2004) was in:

Schmidt, D. 2006 'A Graphical Editor for Manuscripts' Literary and Linguistic Computing 21: 341-351.

The first publication of the variant graph idea was in:

Schmidt, D., and Wyeld, T., 2005. 'A novel user interface for online literary documents.' ACM International Conference Proceeding Series 122, 1--4.

Subsequent conference papers appeared in:

Schmidt, D., and Fiormonte, D., 2006. 'A Fresh Computational Approach to Textual Variation', in: The First International Conference of the Alliance of Digital Humanities Organisations (ADHO) 5-9 July Paris-Sorbonne, Conference Abstracts, 193--196.

Schmidt, D. and Fiormonte, D. 2007. 'Documenti Multiversione: una soluzione per gli artefatti testuali del patrimonio culturale / Multi-Version Documents: a Digitisation Solution for Textual Cultural Heritage Artefacts'. In Bordoni, L. (ed.) Proceedings of the AI*IA Workshop for Cultural Heritage. 10th Congress of Italian Association for Artificial Intelligence, Università di Roma Tor Vergata, Villa Mondragone, 10 settembre 2007, 9-16.

This was subsequently accepted for Intelligenza Artificiale (see below)

Schmidt, D., Brocca, N., Fiormonte, D. 'A Multi-Version Wiki', Proceedings of Digital Humanities 2008, Oulu, Finland, June 2008

Schmidt, D. and Colomb, R., 2009. 'A Data Structure for Representing Multi-version Texts Online', International Journal of Human Computer Studies 67.6, pp. 497-514.

Schmidt, D., 2009. Merging Multi-Version Texts: a General Solution to the Overlap Problem, in The Markup Conference 2009 Proceedings, Montreal, August.

Schmidt, D., 2010. The Inadequacy of Embedded Markup for Cultural Heritage Texts. Literary and Linguistic Computing, 25.3, 337-356.

Schmidt, D., Fiormonte, D., 2010. Documenti multiversione: una soluzione per gli artefatti testuali del patrimonio culturale/Multi-version documents: a digitsation solution for textual cultural heritage artefacts. Intelligenza artificiale, IV.1 (Dec) 56-61.

'The Role of Markup in the Digital Humanities', Historical and Social Research / Historische Sozialforschung 37.3 (2012), 125-146.

Schmidt, D., 2013. "Collation on the Web", Digital Humanities 2013Abstracts.

Schmidt, D., 2014. Towards an Interoperable Digital Scholarly Edition, Journal of the Text Encoding Initiative. 7 (forthcoming)

Wednesday, February 27, 2008

Status of Multi-Version Wiki project

I created this blog to record progress on the multi-version wiki I am developing so that other people who are also currently interested in the project can figure out where it is at.