Friday, December 23, 2011

HritServer

My colleagues at Loyola have inspired me with their questions to turn nmerge into a service. The idea of HritServer (Humanities Resources, Infrastructure and Tools) is to build a self-contained application that runs as a service from the commandline, and provides the infrastructure to build any digital humanities website. It is a collection of the tools written over the past few years, including nmerge, the XML import/export tools, formatter and the GUIs I developed for Digital Variants. The structure will be a little like Tomcat:

  1. a back-end administrative interface allowing the admin user to add or import new texts and edit existing ones
  2. Example GUIs in Java and PHP that exercise each facility provided by HritServer: compare, view variants, indexed search, tree-view.

The service type will be strictly RESTful; everything will be done via HTTP. The database at the back end will be a modern key-value store rather than an old-fashioned relational design. Each resource will be accessible via a simple URL, with no complex access needed. For example, to get the formatted HTML of act1, scene 1, first folio of Shakespeare's King Lear, one would only need to fetch the URL:

http://dhtestbed.ctsdh.luc.edu/html/english/shakespeare/kinglear/act1/scene1/F1/

Anything can be stored at a similar URL in a simple hierarchical structure. The HTML is generated on the server from plain text and overlapping markup sets, and never stored. By passing parameters to the same URL, different formatted versions of the same text can be achieved. Different encodings of the same text can likewise be realised by specifying a different collection of markup sets. The idea is to take the complexity out of building such websites, and to maximise automation by providing a powerful base infrastructure that will work for any set of texts. It should be achievable within a reasonable time, because almost everything already exists (although some tools are still incomplete). All I have to do is stitch it together and test it.

I'll have to write a formal software specification but I've already made a good start on coding it.

Thursday, November 10, 2011

XML-Free Digital Editions

Playing around with Apache's CouchDB today I realised that it uses JSON, not XML to handle exchanges between client and server. This opens the intriguing possibility of making XML-free digital editions. If the standoff properties for digital texts were also stored using JSON or YAML rather than XML - a simple enough change - then the entire edition could be XML-free. The only thing that comes close is the final conversion into HTML for the browser. But this can be in good old HTML (an SGML dialect) rather than XHTML, an XML dialect. I doubt that anyone has achieved that before, depending on how you define a 'Digital Edition'. By that term I mean an online digital archive of marked up texts accessible over the web. I think this is rather a liberating idea that is actually inevitable. If we believe the XML afficionados disaster will ensue as soon as we abandon 'standards'. Actually what will happen is that digital editions will blossom with the possibilities offered by the new form of digital text. It's time to show people what is possible without XML.

Monday, September 19, 2011

Formatter tool with full overlap

More progress, I'm afraid. I've incorporated the test program announced in the last post into the formatter tool. This is intended as a practical replacement for XSLT. So now I can convert real texts plus overlapping standoff properties into valid HTML. If the properties are derived from XML documents there won't be any overlap initially. What formatter does is loosen up that particular restriction. So in the GUI it will be possible to change properties or add new ones that overlap. And it will still format correctly. I'll be putting some test cases onto the testbed at Loyola soon.

Thursday, September 8, 2011

Web pages from overlapping properties

I've made some progress in turning random overlapping properties into HTML. I've written a test program to both demonstrate the principle and also to serve as a debugging tool for me. In the latter role it hasn't reported a single error for two days, so I'm starting to think this is it. Although it doesn't do anything useful it shows that neither embedded markup nor tree structures are necessary to markup up a text.

Saturday, July 16, 2011

OCR of unevenly lit documents

Someone gave me some scans in colour that needed converting via OCR into plain text. I thought I would run them through Tesseract, the main open source OCR tool. The results were dreadful, even when I converted them to greyscale as recommended. My images had three faults:

  1. they had a large border showing the book's binding and the surrounding environment of the image
  2. they were unevenly lit
  3. The text was curved - the result of trying to photograph a bound volume of typewritten pages that could not be fully opened without damage

It seemed to me that these problems must be similar to those encountered in practically any digitisation project. But there didn't seem to be any good open-source solutions.

I wanted to fix at least fault 2 to see how Tesseract would fare when the image was, as recommended, in plain black and white. However, after wasting a whole afternoon Googling the problem and trying every conceivable filter in Photoshop and Gimp I couldn't reduce the image to black and white. The problem was the difference in illumination:

Shown on the right is a section of the upper right hand portion of a page, on the left the bottom left hand portion. When these are turned to a global black or white value one is hopelessly too dark and the other too light:

An idea

So I downloaded the FreeImage library and tried to use it to write a simple filter. I first reduced the image to greyscale and manually cropped it to simulate having already solved problem 1 above. Then I passed a small square 64x64 pixels over the image. For each square I computed the average greyscale value. Then I turned all pixels less than this value by at least 8 to black (lesser is darker). All others were turned to white. This very simple approach had the effect of obliterating the lighting differences and producing an evenly illuminated plain black and white text:

Curvature

Unfortunately, Tesseract still doesn't like the strong curvature. It seems to split up lines based on strict horizontals, because it mixed up text from adjacent lines that curved into each other's path. The next stage will be to 'uncurve' the text automatically.

Saturday, June 11, 2011

From arbitrary overlap to HTML

If we try to represent original documents not authored in the digital medium, we soon discover that the pen or the printed type used to create them were not constrained, as modern embedded markup languages are, to represent only tree-structures. It would thus be very liberating to encode such documents for digital presentation on the Web, using arbitrary overlapping external properties instead of an embedded hierarchy of tags. This would provide a number of distinct advantages:

  1. Properties could represent the source texts more accurately.
  2. Different sets of properties could be combined in the same document.
  3. With appropriate software, it will be easier to edit separate text and markup files than complex embedded markup.
  4. Texts and markup, as separate building blocks, could be exchanged and reused for other applications.

These are winning arguments for at least digital humanists, and maybe for other people who use embedded markup.

How it works

To see how this can be done let's specify some fictitious properties that apply to random ranges of a short text:

0,12,'banana'
3,7,'pear'
12,9,'refrigerator'
13,4,'orange'
18,12,'pineapple'
22,34,'guava'
35,12,'grape'
48,9,'penguin'
52,17,'dog'

What this means is that the text, which is at least 69 bytes long, is 'marked-up' by a series of arbitrary properties. The offsets in the text where these properties start are specified by the first number in each line, their lengths by the second number, and their names by the quoted strings. Of course, in a real-world application the names would more likely be 'p' or 'span' or 'table' etc.

Reduction to intervals

But how can we turn this apparent chaos into syntactically correct HTML? The approach taken here is to break up the properties into a series of 'intervals' where all the properties are the same throughout. For example between offsets 52 and 55 the properties 'dog', 'guava' and 'penguin' are all active.

Intervals defined by overlapping properties

Although dividing the properties into intervals removes the overlap it also creates too many short sequences for efficient HTML. So the next step is to work out where we might be able join them up. To do that we need to know which tags may appear inside which other tags. In other words, we need a kind of basic schema.

On-the-fly deduced schemas

Fortunately we already have the HTML schema. Since we will be using CSS (cascading style sheets) to format the text, we can use CSS rules to tell us which HTML elements will represent our properties, and then work backwards to figure out how they will nest.

The size of the problem can be reduced by reflecting that not all of our overlapping properties will be rendered in HTML. Some have other uses, for example, to provide programming information, are not needed in the current view, or are intended for future use. So it is safe to ignore any properties that aren't mentioned in the CSS file. However, because HTML syntax is fairly loose, this only gives us part of the answer.

The missing information can be deduced from the properties themselves. If we are recording a play, for example, properties like 'line' almost always nest inside 'speech'. In those few cases where they don't, we can split the 'line' property so that it always nests within the dominant tag 'speech'.

Merging this statistical information about property nesting into that derived from the HTML schema allows us to reconstruct almost all of the structure needed to render the document correctly. However, since this relies on statistics, it is not absolutely guaranteed to work in all cases, but even when it doesn't we will still have valid HTML. The worst that can happen is that the formatting won't look right.

Using the deduced hierarchy information

One simple way to express a hierarchy is to record which elements may appear inside which other elements. If you have 10 elements that need rendering this means you must compute a 10x10 matrix. 10 is probably a realistic number in practice, but even with all 107 HTML5 tags a matrix of just 11,449 ints or 45K would suffice.

Properties (left) that may appear within other properties (top)

In my test program I just specified the nesting matrix manually. Following the natural analogy a 'guava' may appear inside a 'penguin', 'dog' or 'refrigerator', but an 'orange' cannot be inside a 'pineapple'. In the finished program, of course, such a matrix would be computed as described above.

From this hierarchy information we can easily work out when to start and stop tags. For each interval visited in order we separate the ranges into three sets:

  1. closing: properties that are present in the previous interval but missing in the current one
  2. opening: properties that are present in the current interval but were absent in the previous one
  3. continuing: properties that are present in both the preceding and current intervals

After this initial classification we use the nesting matrix to correct any anomalies. Any property in the 'continuing' set that is contained by one from the 'closing' or 'opening' sets must be moved to the 'closing' set and also added to the 'opening' set. This is because, in order to preserve well-formedness, the closure or opening of the parent will force the closing and re-opening of a continuing child.

Remaining problems

Even after these measures two problems remain:

  1. Since we allowed arbitrary overlap there is nothing to prevent two incompatible properties such as 'guava' and 'grape' being defined for the same interval. Neither may contain the other, so if they occur together one must be dropped.
  2. Tags must be written out in the correct order: highest level containers first and the lowest level tags last. This can be achieved by sorting the ranges within each interval by their descending position in the hierarchy. We also use a stack to ensure that closing tags match and come out in the correct order.

The result

Once these adjustments have been made, the intervals can be printed out one at a time, the closing tags followed by the opening ones. Here is the output of the test program, with dots representing the text for clarity:

<banana>............</banana><refrigerator>.<orange>....</orange>.<pineapple>...</pineapple></refrigerator><pineapple>.........</pineapple><guava>..................</guava><penguin><guava>....</guava></penguin><dog><penguin><guava>....</guava>.</penguin>............</dog>

In this crazy random example the conflicting properties 'pear' and 'grape' had to be dropped. There was no way to render them given the containment rules. But the result is still well-formed XML and it would be HTML if we had used a CSS file to transform the properties.

Where to go from here

This test solution needs to be incorporated into the formatter tool and the whole thing converted into a php extension, so it can be used as a direct replacement for XSLT.

Saturday, April 16, 2011

From TEI to HRIT and back again

Since we are designing a software suite to more or less replace embedded markup there has to be some way to import legacy texts. At first I thought the problem was insurmountable. Even if the original encoders had stuck to recommended guidelines such as the TEI (Text Encoding Initiative) they would have been forced to customise their encoding in two ways:

  1. By adding custom tags and attributes, and
  2. By making a selection of tags from the large number of available ones

In the second case it is clear that any general solution that embraced an arbitrary subset of TEI would have to support all of it. Since there are currently 519 tags in the scheme, and (probably) thousands of attributes, that is a daunting prospect for any programmer. And we are talking about meaningful conversion into an entirely different software system, not a simple one-for-one mapping. And in respect to point 1 any customised tags would either have to be left out, or their function would need to be specified by the user.

Solving the problem

When forced to perform the task, however, I soon realised that any customised tags must have already been specified by a user who understood XML. So that same user could supply a customised table of conversion in XML to say what should be done with them. If they didn't follow the Guidelines then they have to do a little extra work, but they're not shut out.

And in the second case only a small subset of TEI is regularly used by digital humanists. For the purposes of defining versions, for example, only a small number of tags come into play, and even customised ones would have to follow one of only a couple of basic patterns, which could be programmed in as general functions. The customisations could be handled by a 'recipe', or set of instructions on how to convert the files. A default recipe would be provided for standard files, which the user could extend or change at will.

Why do this at all?

Because HRIT format is much more powerful than TEI:

  1. It allows arbitrary overlap of properties.
  2. It does not mandate any standard tag names
  3. It supports versions natively including transpositions
  4. It allows mixing and matching of markup sets in the one text

That's more than enough reasons to move from TEI to HRIT. Another way of looking at it is to say that rather than replacing TEI it seeks to enhance it, and use it as an interchange format between HRIT and non-HRIT users. It depends on what kind of 'spin' you prefer.

Two-way conversion

Any conversion applied to legacy files (or, if you prefer, current files) would have to be reversible. Those who had imported their files into HRIT and changed their minds later on would feel 'locked in' if they couldn't back out, and those who hadn't made the switch would likewise be frightened off by that very prospect. So the overall process looks like this. Red/green arrows indicate as yet unavailable/available paths:

'TEI' refers to any TEI-encoded file. The two-way process works like this:

  • Splitter splits the TEI file into N versions. By default it splits <app><rdg>...</rdg></app> structures as well as nested <del> and <add> and <choice> structures into versions. Unsplitter, not yet written, will take the versions (possibly modified) and try to put them back into one file, although this may be difficult. The recipe file is used by splitter to direct the splitting. It can be customised by the user to control which elements are split and how.
  • Stripper removes the remaining markup from each separate version in TEI format. A different recipe file specifies simplifications of elements intended to be rendered as formats in the final HTML. One simplification might be the reduction of <hi rend="italic"> to the property 'italics'. The output of stripper is the HRIT standoff XML format. (But stripper is written in such a way that another format can be added if required). It expresses every TEI element as a potentially overlapping property with possible 'annotations' or attributes. These attributes are ignored by the formatter but are not lost. Elements like the TEI-header, which contain metadata about the text, are entirely hidden but also not lost. This is to enable later reversal of the stripping process. Each version produces a pair of markup and plain text files that are separately merged into a single CorTex and a single CorCode file. It is these files that are edited and read by the HRIT system.
  • Formatter takes the properties of the CorCode and combines them with the information from the CSS file into HTML. The CSS is used not only to change the appearance of the text on a web page but also to transform the markup. For example the CSS rule span.italics can be used to change the appearance of italics, but also to convert properties called 'italics' into spans of class 'italics'. In this way we can avoid use of XSLT. But what about the 'annotations' that were originally attributes in the TEI-XML? They are simply ignored (although not lost). If you want to convert an element plus some attribute(s) into a HTML element using formatter, you must first specify a rule to simplify them to a plain property using splitter's recipe file.

Thursday, February 24, 2011

Multi-lingual MVDs

There are plenty of cases where the concept of 'work' spans more than one basic version in one language. Just think of the multi-lingual laws of the EU, the Romulo of Virgilio Malvezzi translated into several languages, each having its own textual history, or the Chronicles of Eusebius, in Latin, Greek and Armenian. The question is, how can you align the same text written in a different language? Can one align Latin and Greek, or French and German? In my opinion, no, or at least not automatically. Quite apart from the language dissimilarity, translations often have quite different structures, making alignment particularly difficult. But a tiny change to the definition of an MVD makes it possible to align such texts manually and to use the MVD format as a storage facility.

Tweaking the groups

MVDs have always had a simple grouping mechanism. You can group versions by type. For example, versions of a particular recension, or internal versions (corrections or revisions of a single manuscript) can be grouped together to keep them separate from versions in other physically different documents. Now if we assign one of these groups a simple attribute, called 'merge' and set it to 'true' or 'false', then we can control how an MVD is built up. For example, imagine we have French, German and Italian translations of some work, each in several versions. We could group all the Italian versions together, and similarly for the German and French ones. And we could set each group's attribute 'merge' to 'true'. But each such group would belong to a higher group, whose 'merge' attribute would be 'false'. So the merging program would know, on being given version 23 (French) to add to the MVD, not to merge it with version 16 (German) because their shared parent group is not merged. Here's how it would look schematically inside the resulting MVD:

This might also be a good strategy whenever the same 'work' is substantially rewritten, like the Morte d'Arthur and other medieval tales. Versions of each rewrite would get their own group and we wouldn't attempt to align them automatically because it just gets too messy.

Linking the translations

Now we can extend the standoff markup mechanism described in the previous post to link the texts of the different languages manually. We add a view that displays two versions of an MVD side by side:

Selecting some text on the right or left highlights it independently (you can do this in Javascript). Now select something in the opposite version and press the 'link' button. This creates an annotated property that specifies a link between the two selected ranges and records it via standoff markup. The view could then give the user graphical feedback by formatting the two selected blocks rigidly side-by-side:

They could also scroll together in sync, as they currently do in compare view. If blocks are transposed between languages (as often happens) the text might jump around a bit as you scroll, but so long as we align on the most central block it should work OK. Also, the alignment would hold for all the aligned versions on either side, not merely for the ones currently selected. If you had 12 German versions and 16 French ones, they would all be aligned at the same point of their shared text. You could even display an apparatus at the bottom of each side so the user could see the variants of the versions in each language.

How much work is that?

Although a special view would have to be designed, there is not much else needed to make it work. It might even be a good idea to add such a view to the MVD-GUI suite and see what people can do with it – but only once the standoff mechanism is up and running, because this solution depends on it.

Sunday, February 13, 2011

Standoff Properties explained

I've been asked for a more detailed explanation of how CorCode works as a set of standoff properties. I'll try but it won't be all that brief.

Embedded markup

Since at least the 1980s humanities texts have been described using embedded markup codes, but this leads to several problems:

  1. The structure imposed on the text is a tree and maybe the structure we want to describe is not.
  2. The embedded codes need to be standardised because otherwise we can't share texts or create shared software. But there are so many codes we need to define that the standard soon becomes unwieldy.
  3. Embedded markup lacks flexibility. We can't easily exchange one set of markup for another, or merge two sets.
  4. Users who edit the texts have to read it through the smoke-screeen of the tags and their attributes. And they have to learn a complex system that is becoming ever more complex.

Standoff markup is a partial solution to these problems. Removing tags from the text clarifies it for the reader, and allows the exchange of one set of tags for another. But with standoff markup we still can't combine two tag sets or define non-tree structures. And because the standoff codes depend on the inviolability of the text, we can't edit it.

What I was trying to explain in the previous post is that we can in fact overcome all of these problems by defining markup as a simple set of overlapping named properties. I'm not the first to suggest this by any means: in fact it resembles to varying degrees George's valency idea, LMNL, Thaller's extended string model, eComma, LORE and other annotation systems, and even TexMECS to some extent. But I'd like to describe my implementation because I think it offers some advantages over previous attempts.

Properties

Properties have a name, an offset and a length, which describe a range in the text. That's one string and two numbers. An example of a property is 'italics' at offset 23 with length 5. Let's just consider the offsets first.

Absolute versus relative offsets

With absolute offsets (as used in JITM and every other standoff system I know) the offsets increase for properties as we move through the text. So if we had properties at offsets 2, 10, 23, 45, 106, 230, 1022, 1100, 1495, 1567 and we added 121 characters at the start of the text, the first offset would have to change to 123, AND we would have to add 121 to all subsquent ones: 123, 131, 144, 166, 227, 351, 1143, 1221, 1616, 1688.

With relative offsets we only record the differences between an offset and the previous one. So the same sequence would read: 2, 8, 13, 12, 61, 124, 792, 78, 395, 72. (That's obtained by subtracting 2 from 10, then 10 from 23, then 23 from 45 etc.) Now when we add 121 characters at the start, the sequence changes to 123, 8, 13, 12, 61, 124, 792, 78, 395, 72. Only the first one needs to change because the relative distances between the other properties haven't altered.

Property lengths

If, instead of just inserting text outside of a property we altered the length of the property itself, say by extending a paragraph labelled with a 'p' (if we use TEI), then with relative offsets only the length of that property and the offset of the following one would change. Let's say that the length of the text covered by property 3 was 5 characters and we extended it to 12, then we'd change the length of property 3 from 5 to 12 and the offset of property 4 from 12 to 19 by adding 7 (i.e. 12-5). The other properties preceding property 3 AND following property 4 would not change.

Property names

Now let's consider the names of the properties. We can make them multi-lingual and go beyond what TEI can do. Europeans see TEI as based on English texts. (e.g. look at Domenico's objections in Scrittura e filologica nell'era digitale p. 170). Why should we not call 'italics' 'kursiv' if we are Germans? The entire standard encoding scheme is based on English words and concepts. Why do we have to standardise them when we can just let the users choose what they want to call them? Or they can provide translations for their property names so others can read their markup. So instead of explicit names I propose that we have a table of properties at the start of the list:

1 italics
2 paragraph
3 stage
etc.

Then when we want to use the italics property we just say #1 23 5 – which means 'property 1 (italics) at relative offset 23 of length 5'. Of course the computer handles all this. We never see these values directly, only through their representation on the screen via formatted text, not even when we edit them via the GUI.

Having got the properties into this form we can write a table that provides translations of all the properties in the file into any other languages we choose. And texts marked with '#1' will show up as 'kursiv' for Germans and 'corsivo' for Italians (or المائل for Arabs). TEI can't do this because the English names are burned into the standard.

Editing the text in this form

Each time we edit the underlying text we have to adjust the standoff properties so that they still correspond. But thanks to the use of relative offsets this is easy. After editing the base text the user commits it to the server. The server computes the differences between the old version and the new one. From this we obtain a set of insertions and deletions.

Insertions

If we insert text outside of an property or within a property we just follow the rules described above in adjusting the relative offsets and property lengths of any lists of properties that describe that text.

Deletions

If we delete a bit of text that completely contains a property we delete that property and adjust the relative offset of the next one in the list. If the property's range is only partly deleted (at the start or at the end) we simply adjust its length and also the offset of the next property in the list.

So, in both cases we can edit the text and its underlying properties quite cleanly.

Publishing digital editions

If we publish version 1 of a text and someone writes a property list for it, and then we change the text and issue edition 2, then their properties can easily be adjusted using these procedures. So, on requesting a copy of King Lear, the server informs the user that his edition of King Lear is out of date and would he/she like to update it. The updates are performed automatically and the old properties now refer to edition 2.

Merging property lists into CorCode

Yet another advantage of relative offsets is the ability to merge lists of properties belonging to different versions. Let's say we have 5 versions of Shakespeare's King Lear. We could define properties like stage, speaker, speech, paragraph, line, italics etc for ranges within each version, but like the text these properties would mostly be the same. Tedious. If we had used absolute offsets the lists of properties would all be different because they would contain different offsets throughout. Just one extra character would change all the absolute offsets from then on, and it would fail to merge. But with relative offsets most of the properties, like the text they describe, will be exactly the same. So we can merge all the property lists into one CorCode to correspond to the one CorTex. And when we apply a new property to one version it will automatically be adopted by all other versions - should we so desire - without having to redefine it for each version separately.

Turning overlapping properties into HTML

To make all these advantages practical we will have to convert a text marked up in this way into HTML for the browser. But how to do it? There is no hierarachical structure left, it's not XML, we can't use XSLT and the target language IS a hierarchy. In fact all this has already been done in eComma. How eComma works I don't know so I'll explain how I would do it.

We can scan the text and its property lists and deduce the hierarchical structure. If a property of type line is always inside a property called speech we can deduce that we might render that in HTML as lines inside speeches, say as <span> inside <p>, i.e. as <p><span>...</span></p>. But often in Shakespeare a line is divided between two speakers. Then we can simply break the line up into two lines, because it is most often contained by the speech property. (If it had been the other way around we would have had to split the <p>...</p> instead). So we can resolve all cases of overlap and discover hierachical structure from a simple analysis of the properties.

Translating the properties into HTML tags is also easy. Each HTML file these days comes with a CSS file that tells the browser how to format elements. So if we want to format speech properties specially we provide a css rule:

p.speech { text-indent: 5px }

This tells the browser to indent paragraphs of class 'speech' by five pixels. The neat thing about CorCode is that we can reuse this definition to convert speech properties into paragraphs. We just follow the recipe contained in the CSS rule: speeches are p's of class 'speech'. So we don't need XSLT.

Simplication of XML to properties

When we convert legacy XML files to CorCode format we have to simplify XML elements with their attributes into property names. So <hi rend="italics">word</hi> becomes #1 23 5 (remember we defined '#1' to be italics). So the CSS rules can all be simple and don't have to take account of complex XML attributes. Not all XML properties are as simple as the italics example, but if we provide a list of recipes of how to convert them I think it can be done. For example the TEI coding for a page number looks like this: <pb n="42" ed="1"/>. The "42" is really text and should be represented in the main text with the property of 'page-number'. The 'ed' attribute is really a version specification and should be expressed in the CorText and its versions. So there's nothing left. TEI markup is complex because it is a mixture of every kind of information we want to include that involves text. But we really need to separate out things like anchors to external images into a separate CorCode file that is handled specially in a GUI. When we focus on actual properties that the text really has, as opposed to programming information, there is not much left to represent.

Friday, February 11, 2011

The death of the angle-bracket

I was pleasantly surprised to learn that the eComma project uses overlapping properties rather than embedded markup to encode humanities texts. This emboldens me to take a similar approach with my rewrite of the MVD-GUI. For a relatively small effort I can transform ugly bits of XML such as <hi rend="italic">word</hi> into the standoff property called italics that applies to a specific range in the text. So to kill off angle brackets for good all I have to do is the following:

  1. Take a TEI text and use my splitter program to split the markup from the text for all the versions of a work. This yields as many versions of plain text as versions of markup.
  2. Simplify the markup: remove attributes by merging them with element names and swapping them for something shorter. And we can have multi-lingual property-names – no need to always use English.
  3. Merge the text of all the versions into a CorTex (MVD) and all the markup into a CorCode. The CorCode is just a list of properties and their ranges in the text, one for each version.
  4. Design 3 Joomla components:
    1. A formatted view of any chosen version, with expanding/collapsing apparatus.
    2. Edit the CorCode. A formatted view of the CorTex+CorCode for the currently chosen version: oft-used markup tags on the right as buttons, the rest as a dropdown list. Either just pressing a button or selecting an item from the dropdown and pressing 'apply' would apply that format to the current selection.
    3. Edit the CorTex. This view is just a text editing box, with possibly an expanding/collapsing apparatus.

That's not too much work, and when it is done users won't have to struggle with complex syntax ever again. In its place a set of simple overlapping properties that automatically format themselves into HTML in the browser. And all steps will be reversible: so we can go back to the XML representation at any stage, with no loss of information (hopefully).

Here are some mock-ups of how the user interface would look:

The Combined view

This is partly implemented in the new version, (all browsers) and more fully implemented in the old version (markup still embedded, Firefox only).

The CorCode view

We only have to show the properties present in this text. Note the language dropdown menu – this will translate the property names into whatever we provided in the property list. 'clear all' clears all properties from the current selection.

The CorTex view

This is just a plain edit text box, although I have enhanced it with a collapsable apparatus showing textual (not formatting) variants. The user simply edits then clicks 'save'. Carriage returns are not passed on to the display so they can be added as desired to lay out the text so it is more readable.

Friday, February 4, 2011

Intelligenza artificiale

We had another publication this time in Intelligenza artificiale, a journal of the IA*AI (University of Bologna). This is a published version of a conference paper presented by my colleague in Italy several years ago when we were just starting up this multi-version-document thing. So it's kind of interesting historically. I'm content that what we said then is more or less what we say now. In other words the idea appears to be stable.

Tuesday, January 4, 2011

Drawing stemmas in php

Tree view is part of the MVD-GUI. It shows a phylogenetic tree generated from the MVD variant data. The problem with this is that it uses the Phylip package and so uses two commandline C-programs that have to be compiled for the architecture of the server. This makes building an installer for MVD-GUI very difficult. Either I have to ensure that the user has a C-compiler installed, and that it works on their platform with that code, or I have to add a lot of prebuilt binaries to the download, fattening it out nicely. So I gave up on that route.

Building a Newick Tree

An alternative is to draw the tree directly in php. No one seems to have done this before. The first stage is to draw a text version of the tree. This uses a format called a Newick Tree. It assumes a hierarchy like a true stemma even for an unrooted tree. Here's an example:

((A2a:0.025555,(A2b:0.006838,A2c:0.014863)I5:0.023295)I3:0.078086,(Ba:0.00787,(Bb:0.041552,(D:0.002623,(Ea:0.032508,Eb:0.063992)I15:0.011577)I13:0.025273)I11:0.006917)I9:0.05633,A1:0.027864)I7;

The Phylip package I was using before uses Fitch to make the text-version of the tree and Fitch is about the slowest algorithm out there for this kind of work. Neighbour Join is a much faster technique but occasionally it produces edges of negative length, and there is no nice way around that. Methods like maximum parsimony are supposed to produce the best trees but they are expensive to run and are character-based (not difference-based like Neighbour Join). They are also designed for genetic sequences, not plain text data. In the end I settled on a technique called FastME, which doesn't produce negative branch-lengths. I translated their program into Java and am about to incorporate it into nmerge.

Drawing a tree in php

Drawing can be done directly in php using GD. This uses primitive drawing functions to draw into a bitmap image. Since the branch-lengths hold much of the information about the stemma I elected not to use standard force-directed algorithms that obliterate the lengths. But I did adapt force-directed layout to improve the roughly drawn tree. Results are still preliminary but it is starting to look reasonable:

The idea is to incorporate this view into a revised version of MVD-GUI that I will soon release, that is:

  1. Fully debugged and works on several popular browsers
  2. Installable in Joomla as a single component

My ambition when that is finished and all the other views have been added (i.e. compare, view, edit, list and variants) is to port it onto the iPad.