Friday, March 30, 2012

Importing into HRIT

Having a good import tool is one of the most important things in building a digital text archive. When you have 2,000+ files to load into the system doing anything by hand with them is prohibitively expensive. I've tried to identify the steps that must be implemented to represent texts as multi-version works described by multi-version standoff properties. I intend to tick off these steps as I do them. Most of the stuff is already available in some form. It's really just a matter of pulling it all together.

  1. A GUI to interact with the user:
    • Select files for upload.
    • Gather information about where the merged files are to go.
    • Also specify a filter for plain text files if any
    • Verify that the inputted information makes sense
  2. Compare each submitted file with the first file. If less than say 10% similar or it is too large reject it, and tell the user.
  3. Split the remaining files into two groups: a) plain text and b) XML
  4. Filter the plain text files and so produce one set of markup for each.
  5. Split the XML files into further versions, so multiplying them. This happens if there are any add/del, sic/corr, app/rdg etc variations.
  6. Strip each XML file into markup and text.
  7. Merge the sets of markup and text into a corcode and a cortex, and install it in the specified location.

That completes the basic import process.

Sunday, March 11, 2012

A Better Way to do Transpositions

One of the weaknesses of nmerge is its handling of block-size transpositions. It does the small ones OK, but large transposed blocks pose a problem because they are rarely contiguous. They tend to break up into short strings of literal similarity, punctuated by small differences in spelling. For example, if you transposed a paragraph in Shakespeare between two editions, differences in spelling would make it hard for the program to see that all the small similarities add up to an entire transposed block. Every idea I had to get around this limitation threatened to make nmerge much slower. Until now.

At the moment, when you add a version to the variant graph (as shown above) it aligns it with the graph directly opposite. It does this recursively, by merging sections of identical text and then gradually making the leftover sub-graphs and sub-sections of the new version smaller and smaller until all the new text is merged into the main graph. So in the drawing above the left-over sub-graphs are "The quick red/brown" and "lazy dog." and the new version fragments are "The lazy grey" and "quick dog." Transpositions are looked for to the left and right of the opposite graph-section and replace the direct alignment if a longer match is found. This can only find short contiguous sections of transposed text, and if they are far enough away nmerge simply ignores them. The longest match between "The lazy grey" and its opposite sub-graph is "The", but there is a longer match "lazy" with the other sub-graph. NMerge might miss this because it is too far away, relatively speaking.

The new algorithm is much simpler. Rather than align a section of new text with its directly opposite sub-graph and then look to either side for transpositions, it aligns it with all the remaining subgraphs equally. So if there is transposition of an entire block – and we found this quite often in the Tagore poems – nmerge will simply choose the best subgraph to align with for each new section of the text. The problem now is to stop it making trivial transpositions like "the" between the start and end of the work. Some kind of weighting based on previous alignments between blocks as well as distance and length might be the way to go.

Tuesday, March 6, 2012

HritServer Progress

Hritserver has progressed to version 0.1.2. It can generate HTML for compare view, with parallel texts that will be able to syncro-scroll, plain HTML versions, a dropdown menu to select versions, and a few services like stripping XML into markup and plain text. The back-end is also taking shape:

The idea here is to build up a web page in any web-development system whatsoever from a series of configurable components. Each component is just a fragment of HTML whose formatting and markup can be specified by the user. When I say markup I don't mean XML, I mean standoff properties. So you can have multiple markup sets and stylesheets for one text that combine sensibly. Or you can just ask for that component and let the default formatting do all the hard work for you.

The back-end shown above in a very early version is supposed to be a 'reference implementation' that fully exercises the HritServer application. So you can see what it is capable of and how to do it at the same time.