Thursday, April 30, 2009

Nmerge tool code-complete

The nmerge commandline tool is now code-complete. I guess it's a 'pre-alpha' version. Since this is a revision of a previous working version, though, testing should not take too long. I would estimate that, after the Labor day weekend (Monday 4th May) I should have an alpha-version. But with software you never know. This version supports the new merging algorithm from the submitted Balisage 2009 paper, which works pretty well.

Nmerge is also a JAVA library that can be used from within a JAVA application, like the Phaidros wiki, to provide support for Multi-Version-Documents. Once it has stabilised I will rewrite it as a C++ commandline tool. But for now we have to put up with a slightly more cumbersome syntax. Here is the "usage" statement produced by the program so you can get some idea of what it does. Once it is reasonably well tested I will put the source code on SourceForge under the GPL v3.

The command syntax is a bit complicated, but so is what it is trying to do. I envisage that this tool could be used in a shell or commandline script to automate, say, the construction of an MVD from a set of files. At least that's what I use it for. In any case the -h option prints out an example or two of how to use each command. The -c option specifies the command you want to perform on the MVD, and the other arguments are the parameters that the command uses, provided they make sense. If they don't you'll get an error message.

With the nmerge tool MVD becomes a real format. There's no GUI user interface because if I added one, you couldn't take it away and put in your own. If you need one, wait for Phaidros.

usage: java -jar nmerge.jar [-c command] [-a archive] [-b backup] 
     [-d description] [-e encoding] [-f string] [-g group] [-h command] 
     [-k length] [-l longname] [-m MVD] [-n mask] [-o offset] [-p]
     [-s shortname] [-t textfile] [-v version] [-w with] [-x XMLfile]
     [-?] 

-a archive - folder to use with archive and unarchive commands
-b backup - the version number of a backup (for partial versions)
-c command - operation to perform. One of:
     add - add the specified version to the MVD
     archive - save MVD in a folder as a set of separate versions
     compare - compare specified version 'with' another version
     create - create a new empty MVD
     description - print or change the MVD's description string
     delete - delete specified version from the MVD
     export - export the MVD as XML
     find - find specified text in all versions or in specified version
     import - convert XML file to MVD
     list - list versions and groups
     read - print specified version to standard out
     update - replace specified version with contents of textfile
     unarchive - convert an MVD archive into an MVD
     variants - find variants of specified version, offset and length
-d description - specified when setting/changing the MVD description
-e encoding - the encoding of the version's text e.g. UTF-8
-f string - to be found (used with command find)
-g group - name of group for new version
-h command - print example for command
-k length - find variants of this length in the base version's text
-l longname - the long name/description of the new version (quoted)
-m MVD - the MVD file to create/update
-n mask - mask out which kind of data in new mvd: none, xml or text
-o offset - in given version to look for variants
-p - specified version is partial
-s shortname - short name or siglum of specified version
-t textfile - the text file to add to/update in the MVD
-v version - number of version for command (starting from 1)
-w with - another version to compare with version
-x XML - the XML file to export or import
-? - print this message

Thursday, April 23, 2009

MVDs in binary or XML?

A pattern is emerging in the effect that the MVD concept is having on people. They take on board its power at representing variation but they don't like the idea of representing the data in binary form. Instead they think it is possible to represent variation in some form of XML. So far I've heard proposals to use TEI-XML, RDF or GraphML. It's tempting, of course, to carry on using XML when this is the tool we are all most familiar with. However, my point of developing the MVD format was precisely to get around the limitations of all forms of markup. You can't represent a variant graph in XML satisfactorily if the text you are recording the variation of is itself XML – and it usually is. The reason is that you can't represent cases where the markup itself varies: for example the deletion of a paragraph break:

<del></p><p></del>???

Of course there are hacks to get around this particular case but they have negative consequences. What you end up doing is modifying the markup to accommodate weaknesses in the representational power of markup itself. I think that is a fundamentally flawed strategy. It is just another form of putting presentational information into markup that is supposed to be generic. If you try to represent variation in a set of texts or in one text using markup you very quickly run up against the problem of overlap. And markup is very poor at representing that as we all know. The only way to completely get around the overlap problem is to represent variation using a non-markup based technology. That's the whole point of MVDs that doesn't seem to have been widely acknowledged yet.

Sunday, April 5, 2009

MergeTester released

For the thesis I wrote MergeTester, a simple utility that implements the merging algorithm from chapter 5. Although not a practical program, it does demonstrate how the program works and allows the user to test it on folders of versions in any format. It builds up a variant graph of the versions and prints them out one arc at a time. From the printout the user could manually reconstruct the graph or part of it.

The advantage of the program lies in the fact that the way it works is not obscured by any other code and it does not depend on 3rd party libraries. Any comments and reports of bugs found will be gratefully received!

At the moment I am incorporating it into nmerge, which will also be released shortly. Nmerge can convert a variant graph into an MVD, so the merging algorithm will then become practical.