Tuesday, November 16, 2010

ESTS Pisa 2010

So Chicago went well. To my surprise they particularly liked my idea of rolling versions of standoff markup into a separate MVD. Unfortunately I didn't finish the formatter commandline tool and the web demo (see below) and even now they are incomplete. I will continue to develop these tools as soon as I have a free moment, but there are more pressing needs.

On the 25th we are presenting at the ESTS conference in Pisa, Italy. A lot of people will be there so I am keen to put on a good show. First we will show what we did for Rome in July properly, minus the technical glitches. (Hopefully there will be a better projector setup). And I also want to offer a single and simple installer for the MVD-GUI so that people can try it out. That means:

  1. Fixing remaining bugs in the finished modules
  2. Getting an all-in-one installer for my Joomla component and its modules.
There doesn't seem to be a kosher way to roll up installation of components and modules into one install package in Joomla, but the authors of Joomdle seem to have worked out a way. I am following their lead, and building something similar for my installer. Disguised as a component, the MVD-GUI will, when finished, actually install one component, three modules and a template. Anyone who can operate Joomla! should be able to do it. For the moment, though, I'll leave out Tree View because installation requires compilation of programs on the server. But we will certainly demonstrate it in Pisa. Of course I'm not going myself (I'm exhausted) but one of my colleagues is going instead. If you want to know who take a look at the programme.

Saturday, October 16, 2010

Electronic Editions without Embedded Markup

I am going to do a demo for Chicago, which I visit on the 28th for a talk at Loyola University. I want to demonstrate a working method for converting legacy XML files with multiple embedded versions into separate HTML files. A bit like the Versioning Machine. However, between the XML and the HTML the interim files will be uniquely plain text and potentially overlapping markup extracted from the XML. These two files could both be combined into separate Multi-Version-Documents so that electronic editions of multi-version texts can be subject to markup which can be freely combined or removed at any time without affecting the text.

This isn't just standoff markup. I'm using a custom technique called HRIT format, which is based on xLMNL. This just records a set of ranges with attributes (garnered from the XML) so potentially any property can overlap with any other property. And these ranges, tied by fixed offsets to the original text, have no hierarchical structure.

Cool. But how does this get turned into hierarchical HTML? That's the tricky part. I will define a simple CSS file, which is interpreted as a recipe for constructing the HTML file. For example, we might have a style definition "p.stage". This would mean that we should generate a paragraph (<p class="stage">...</p>) for all ranges called "stage", and apply the formats of the p.stage style definition. The beauty of this is that the same CSS file can be used both for formatting the HTML and for generating it. Now that's cool. Here's an outline of the demo (I'll tick them off as I do them):

  1. 1. encode Act 1, scene 2 of King Lear all versions as ONE TEI-XML file, using parallel segmentation. ✓ Done, at least first three folio versions. But I should really include the quartos too.
  2. Write a simple C-program to separate out the versions called splitter. This produces copies of each version as separate TEI-XML files. ✓ Done.
  3. Strip out the markup from these files with stripper - another C-program. This produces 2 files for each XML file:
    1. the original text, stripped of all markup.
    2. The markup expressed in HRIT standoff format with coordinates for where it is now in the plain text. ✓ Done.
  4. Write a simple CSS stylesheet and another C-program formatter that takes the standoff markup from step 3b and recombines it with 3a using the stylesheet definitions into HTML. This is the most complex program: it needs to parse CSS in a superficial way only and use definitions of the type element.class to construct the HTML. The class will be the name of an XML element and the element will be the HTML element name. Then the program need only dumbly create elements for the given ranges. Since it was originally nested the result will also be nested. (This was in fact a requirement for XSLT to do the same work.) What we will eventually need is more flexiblity in the program later that can handle more intelligently the nesting property. ✓ Done, but needs further extension.
  5. Display the result of one version in the browser. ✓ Done
  6. Write a simple interactive web program consisting of a web page and some Javascript. Divide the page into two parts. On the right a few of the most common properties as buttons: paragraph, speaker, speech, etc. Less common properties can be selected from a dropdown menu and a button to apply the property. On the left the raw text of King Lear. Now select a bit of the text and press a button. This sends the selection to the server, which adds a format to that range, calls the formatter program to change the HTML, then refreshes the page, so the text formats interactively. For the server just use apache+PHP, and call the commandline tools via exec. In progress

Once this is incorporated into the MVD-GUI (in place of the XSLT step that currently transforms the XML of the versions into HTML) we will have an electronic edition that is truly free of embedded markup!

Tuesday, July 27, 2010

Greek MVDs

Having come from a background in classics it came as a shock to get a recent query about ancient Greek texts. And of course behind the problem was a bug. What nmerge actually does is merge a set of versions on their byte not character boundaries, and one consequence of this is that in some encodings characters can get split. When you read an entire version this isn't a problem, but what if you want to compute the variants of a text? Then the bits that vary might be only half a character. This can play havoc with encodings like UTF-8 when used to encode anything more complex than English. So Greek was an acid test, and although it initially wasn't good enough, I fixed the problem by migrating the half-characters after the alignment to the correct 'side' of the arc. So no more splits.

Here's the output of issuing an nmerge -c variants command on two versions of Athenaeus' Deipnosophists, with version 'A' as the base:

[B:συντετάσθαι]
[B:ὁ]
[B:ἔφη, ὥρα]
[B:κἂν]
[B:διαστησῶμεθ’]
[B:τραγῳδίαν·]
[B:αἰνίγμασιν· ἱκανῶς]
I've begun to realise though that what this needs is a reference system so the user can relate the apparatus to the text. Which is more work, of course.

Saturday, July 17, 2010

Improvements to Apparatus

One advantage of Multi-Version-Documents is that generating an apparatus is so easy. There is just a simple command in nmerge: you specify the range (offset and length), the desired base text, and it then computes the traditional 'apparatus' type display of all the variants aligned on word-boundaries. Maybe it's old-fashioned and print-related but it does show you many versions of the text in a very compact way. So I think it's still useful.

The problem I had been struggling with for the past couple of weeks was how to ensure that this range in the MVD could be specified precisely via a selection in the GUI. Of course what the user sees is not the contents of an MVD. It is extracted and transformed via XSLT (at the moment) and the user selection in HTML bears no clear relation to the corresponding selection in the underlying data. The problem boils down to aligning the XML and HTML versions of the text fast enough for the user not to notice. There are plenty of techniques for doing this, but they all take waaaay too long. I wanted it in fractions of a second. After perhaps the sixth try my new method finds the correct answer in around 28 milliseconds for the King Lear example in slow old PHP. What is perhaps most annoying is that the method I used was incredibly simple. It's just 58 lines of code. Strange that you can never see the simple things that are right under your nose. :-) And when you finally have the answer you can't explain why it didn't occur to you earlier.

If anyone really wants to know how I did it they can download the MVD_GUI code to find out. I'm not going to bore you all with technical details here. You might have to wait until I update the Google code site.

Friday, July 9, 2010

Tree View

Tree View is finally working. What this does is compute the genealogical tree of a set of versions. Although this is normally of use mostly for manuscript traditions, I believe that it is also useful for printed works. It can show at a glance the relationships between texts that make up a work. Previous attempts to do this (by others) were based on collation output and didn't take account of invariants, only variants. I think this casts doubt on the accuracy of the result. Also, rather than being offline and manual this method is online and automatic. There's a basic zoom facility which is useful for the larger trees. Changing any of the options recomputes the tree. Check it out at Harpur.

Here's a small sample from the DV website. Relationships between 9 texts of Vicenzo Cerami's the Serpent Woman. This was published in a newspaper (so it's kind of a print tradition) and the author made available the pre-texts in the form of edited drafts. The length of branches is significant (it indicates the distance between versions), but in case this gets confusing you can make all the lengths the same.

If you are wondering how I produce this online the process is basically:

  1. Query the MVD to produce a difference matrix (edit distance of each version from each other version)
  2. Pipe the result into the Fitch-Margoliash tree-building program from Phylip.
  3. Pipe the result into drawtree from Phylip. This outputs a postscript version of the diagram.
  4. Pipe the result of that into Ghostscript to produce a temporary JPG file, which you can view. All this is done by executing a succession of binaries using exec() in PHP. I had to adapt fitch and drawtree extensively to get this to work with pipes. Fitch chokes a bit on the biggest tree (Sibylline Gospel), but that's to be expected. It does work, though.

Friday, May 21, 2010

Cross-browser compatibility

I've fixed the incompatibility with lesser browsers. It's still not perfect but Chrome and Safari now also work, although none is quite as good as Firefox yet. I've also uploaded a very alpha incomplete version of the mvd component and modules for Joomla! There's no friendly installer but a README.txt that explains basically how to get it going. This is something to build on, but it does work as far as it goes.

Thursday, May 20, 2010

Syncro-scroll, I love you

I don't know if anyone else has done this before. I suppose they have. What my Compare view now does is synchronise the left and right scrolling divs, even if they differ in length and content. With the Alpha web application I had something similar, but the user had to click on the text of one side, and the corresponding text on the other side scrolled down. This was done with hyperlinks, but it was a bit inconvenient because it required the user to actively click to get the two columns in alignment. Now I have a more discrete method with less 'excise' (as Cooper and Reiman would say). All the user has to do is scroll down or up in either div and the text on the other side automatically maintains absolute alignment with the scrolling side, even if the two texts are radically different in length. That seems to be all the user needs to pick up the corresponding text on the other side. The eye runs across, naturally in the middle of the frame, to the other side, and there is the text in its different context. I'm so delighted with it, and it just took 100 lines or so of simple Javascript.

CPU usage in Firefox is also good. When at rest the browser just shows the usual 0-4% activity. When you scroll continuously, usage can go as high as 13% momentarily, because it has to traverse the entire DOM tree every 1/2 a second. But as soon as you stop the Javascript detects no change and efficiently skips over most of the code.

I've added it to the Harpur site, where you can see it in action (no point in adding a lifeless screen dump here). I've tested it so far on Firefox and IE, which doesn't quite get the alignment right.

Monday, May 10, 2010

MVD Joomla Component

I've decided to release a very preliminary version of the Joomla Component I am developing. It provides a GUI front end to nmerge. The point of development it has reached is that it has a single view, an import page and a list of available MVDs. It hasn't been extensively debugged but I just want to get something up there so people can see that it is being developed and if they really want they can try it out, warts and all, and as incomplete as it is. As soon as I can I'll post a new project on Googlecode and provide a link to it from here. You can already see it in action on the Harpur Site. For the record, here are some screen-dumps of single-view with the windowbox facility.

Windowbox, when expanded, updates automatically as you scroll. You can also find the variants of any passage by selecting it. In order to do that it embeds invisible markers in the text which are used to tell nmerge where the selection begins and ends. These markers have a resolution of 256 bytes, so precise variation is not obtainable by this method. Further improvement is possible, but it is hard to derive the exact selection of the original text from the HTML version of it, and to do that consistently across platforms.

Single View with collapsed Windowbox

Single View with expanded Windowbox

Wednesday, April 28, 2010

Revised Variants Command

The revised variants command for nmerge now works like this: You specify a range with a particular version and it computes all the variants that leave and rejoin that path. Mathematically it is very simple. Unfortunately, variants must be aligned on word-boundaries. It doesn't make sense to compute them on character boundaries (as they are of necessity in the MVD). If you did that you would end up with variants like 'Q1:a' and have no idea what the context of this 'a' in version 'Q1' is. The problem is that in extending the variant to its natural word boundaries, you can of course encounter more variation. This means that you can end up duplicating variants. To get around this several fixes were required:

  1. Equal variants in different versions can be merged. So 'Q1:Map' and 'Q2:Map' becomes 'Q1,Q2:Map'. Cool.
  2. A variant can also be part of another variant. The versions are the same, so you just drop the smaller variant.
  3. Because of imperfections in the nmerge program a 'variant' can have the same text as the base version. In this case each computed variant is compared with the text of the equivalent base version and dropped if it is the same.

Getting all that working has taken a month. Here's the output of nmerge -c variants -m kinglear.mvd -o 2000 -k 100 -v 1 (variants in King Lear, base version 1, at offset 2000, length of range = 100):

[Q1:for,]
[Q1,Q2:mother]
[F3,F4:fair,]
[F2,Q1,Q2:faire,]
[Q1,Q2:&]
[F2,F3,F4:whorson]
[Q1,Q2:whoreson]

Here's the original 6 versions that contain these variants. Note that the initial 'r:' gets extended back to the first word-boundary and is in fact 'for:':

F1: r: yet was his Mother fayre, there was good sport at his making, and the horson must be acknowledged
F2: r: yet was his Mother faire, there was good sport at his making, and the whorson must be acknowledged
F3: r: yet was his Mother fair, there was good sport at his making, and the whorson must be acknowledged
F4: r: yet was his Mother fair, there was good sport at his making, and the whorson must be acknowledged
Q1: r, yet was his mother faire, there was good sport at his making, & the whoreson must be acknowledged
Q2: r yet was his mother faire, there was good sport at his making, & the whoreson must be acknowledged

Now you might say that a collation program could do as much. Yet I don't think so. In a collation program you have to collate the entire text of all the versions against the chosen base text to get that output, then sift through it to find the right location. Nmerge computes variants over ranges in the base text - actually it reads them from the MVD. And the base version can be changed at will. This makes it possible to display variants dynamically in a GUI.

Now all I have to do is call this via Ajax from the Joomla GUI. I'll need to filter it so that residual tags and entities get turned into something useful. Time, though, is beginning to run out.

Monday, April 19, 2010

Inadequacy of Embedded Markup

My paper on 'The Inadequacy of Embedded Markup for Cultural Heritage Texts' has just been published online by Literary and Linguistic Computing. It should be interesting to see what people make of it. It's not good to criticise, but sometimes if you don't the opposition will just keep saying that what we already have is good enough. And I'm tired of that.

Wednesday, March 24, 2010

Viewing Variants

One of the things that came out of the BookLogic seminar was the suggestion that the single view of the old Alpha application could be enhanced by adding a 'Windowbox' at the bottom of the window. This would display variants for sections of the text visible in the window or all of it. But how would it work?

The problem is that nmerge currently computes the innermost variants of a range of text in a specific version. Since there may be no variants for that stretch of text, it expands the selection outwards until it finds two points where at least one variant joins the selection in the specified version. Since it only expands outwards, this strategy can miss variants that occur within the specified range:

This is not how the apparatus criticus operates. Instead we really should specify a much broader range and then first compute the points to which other versions attach themselves to or split off from the specified version. At each pair of such points sharing a set of versions we would simply print out the variants. The 'Drowning By Versions' problem can be reduced by limiting the variants to those that split off from and rejoin within the specified range and version.

In the sketch above the variants of the selected pink region are B: 'white' for A: 'brown', and B: 'rabbit leaps' for A: 'fox jumps'. The variant 'horse walks around' in version C is disregarded because it does not start and end in the selected region of A.

In addition, the default selected range could be reduced to a narrow strip in the centre of the current window. In highly varying texts, Windowbox might only look for variants of only one line, but not less. This setting would be configurable globally as a parameter for the Joomla component, say the central 50% of the window by default. And if the user selected some specific text, it would automatically update Windowbox with the variants of the selection. Windowbox should also be a collapsable element on the page, so the user can get a clean view of the reading text at any time.

I am really getting close to a first release of the Joomla component. It will only do import, list texts and view single texts, but it will give people a flavour of what it can do and hopefully generate some feedback.

Sunday, March 14, 2010

Synchronised scrolling of parallel texts

The best feature in the old Alpha web application was twin-view: it showed two parallel texts that aligned automatically when you clicked on the black (i.e. the same) text in either version. It does this by secretly writing each piece of black text as a hyperlink that calls a Javascript function instead of a link. For example, the picture below shows the alignment after the user has clicked on the blue-highlighted text:

That was fairly cool, but what I was really trying to do was to automatically scroll down one column as the user scrolls in the other column. I didn't think it was possible, but today I found a link on good old Google that uses a timed Javascript routine that is called every 1/4 of a second and checks how far the user has scrolled, then sets the colour of the background accordingly. Rather than setting the background colour the two texts can be aligned every 1/4 of a second instead. With this new method the user won't have to do anything except scroll and shouldn't even notice the slight jerkiness from the 1/4 second updates. Showing which lines are currently in alignment could be achieved as in the old method by highlighting the two pieces of black text on either side. In fact it's really irritating that I didn't think of this before.

Single View

Single view is coming along just fine. I'm working on porting the search box code but everything else already works in Joomla:

Wednesday, March 10, 2010

Progress on Joomla GUI

The Joomla GUI, which is a replacement for the Alpha web application, is progressing nicely. Although technically I only have incomplete views for listing and choosing the texts, importing into an MVD, and viewing a single version, quite a bit of what is to follow is just an adaptation of the old Alpha application's XSL stylesheets. As a result I expect good progress over the next couple of weeks. It doesn't look like I will have all I wanted for the Book Logic workshop in Sydney after all.

The audience is expected to number around 50 or so, mostly bibliographers, including some famous ones, so it should be a good test. But the majority will be non-technical editorial types so the real challenge will be to get across my ideas to them. Having explained this before I already know that it takes quite a while before people understand the key concepts. But everyone who has listened so far, even starting from a position of extreme scepticism, has in the end always conceded that this is at least an intriguing idea. On the back of that experience, then, conveying anything to 50 people in the space of 10 minutes seems a bit daunting.

Thursday, February 11, 2010

A Native Version of nmerge

Nmerge allows you to merge a number of versions into one file, a multi-version document. But it is written in Java, and so it is difficult to call it most hosted web servers, which don't support Java. Although only a possibility at this stage, gcj (the Gnu Java compiler) can generate native code out of Java. If this works I won't need to port my nmerge code to C++ and I can keep developing in Java with all the benefits of a higher level and more modern language. Gcj is a bit hard to use and always seems out of date, particularly the Swing GUI stuff, so I won't know if it really works until I try to compile the whole thing. But as there is no GUI associated with nmerge, it should work. I'll be trying to do this in the next few days, because I need it for the Booklogic demo, and I'll post another note when it works/doesn't work.

But does it work?

Actually, no. At least not gcj. First off, it's only 1.5 compatible. Secondly even that doesn't work. After careful testing I discovered that gcj can't read files properly, which is kind of fundamental. And I can't afford the Russian product that is supposed to work very well. So, it's back to rewriting nmerge in C++. Oh well.

Sunday, February 7, 2010

Masterclass Book Logic

I thought I'd describe my preparations for the BOOK LOGIC Master Classes and Symposium in Sydney, 19–20 March 2010. I took a risk by saying that I'd have something ready to demonstrate by then, but I find that deadlines have a way of inspiring one to get something done. Here are the views I want completed for Sydney. The current test version of this is on the Harpur site. (Only the list and import views are yet done, and only partly.)

List View

The list view, which just lets the user select or delete an MVD, and to create new empty ones. It also lets you categorise them by putting them into folders (and moving them from folder to folder). But you have to log in first to get that utility. This is mostly done, even though the new file button doesn't yet work probably because the external call to the nmerge tool doesn't yet work in PHP for some reason. Almost every digital text archive has need of such a facility, but usually all they have is a long list of HTML links that the user has to read through to find what they want. In the yet-to-be-done Find View the user will be able to locate a particular text by name, description or content and have it selected in this view.

Import View

The goal of Import view is to allow almost any kind of text to be loaded after being automatically cleaned up, in the vein of HTMLTidy. Ordinary TEI texts should be usable, as well as plain text. At the moment, though, for the Book Logic demo, I'm only going to implement the basic import facility of nmerge: that is, the texts have to be lightly marked up TEI already used for the current texts in the Harpur and Digital Variants archives. I guess I should publish that format online sometime soon.

Tree View

Tree View will I hope win over many converts to MVD, by showing them the potential benefits of a format in which all versions of a work are in one file. It just uses the data in an MVD to build a phylogenetic tree or stemma using bioinformatics software. It shows the relationship between the various versions of a work and will expose the options of the tree-generating program to the user, so you can configure the view to suit your own tastes or research interests. No one else has quite this facility online yet that I know of, although tree views of literary texts have of course been done many times before. Here's an example generated by Phylip from the 36 versions of the Sibylline Gospel. This can also be useful even for modern texts.

Single View

Single view just shows the current text of interest and lets the user read a version of choice, which he/she can change by selecting it from a dropdown list.

Compare View

Compare view shows two texts side by side and highlights the differences. By clicking on one bit of text on one side the other side scrolls down to bring them into sync. This already works on the Alpha web application, but it has been much appreciated and I thought I would include it in the demonstration in Sydney. Not all version comparison can be done this way - sometimes you need to compare more than two texts, but it's still an undeniably useful view.

Well, that's the plan. Obviously there is still a lot to be done after that to make com_mvd a usable tool, but that's all I'm going to show on the day. And maybe I won't get it all done. Let's see. The workshop is only five weeks away!