Friday, December 13, 2013

An all-in-one installer for AustESE

There is now an installer for the AustESE publishing system. I am painfully aware of the missing bits: as listed in the readme, but I prefer to focus on the positives. You can run the installer for Ubuntu and you get something that can manage the creation of a digital scholarly edition. And what it addresses are the perennial problems of interoperability and overlap. This is a general system that does everything that users asked for, not what I think it should be. It's what we have all been working on for 18 months, and we will keep improving it until it is finished and perfected. Check it out and see if you can get it going, and tell me what you think. I'm listening.

Saturday, August 3, 2013

Those pesky hyphens

One of the first problems I ever had to deal with in our Wittgenstein edition was 'what to do with the line-endings'. It seems such a simple problem: author writes or types a text, and hits the return key or starts a new line whenever he/she runs out of space. But this means that the author may hyphenate words over line-breaks. The preservation of these particular line endings, and not those automatically put in afterwards by software, is what the scholarly editor seeks to preserve.

In print or on screen lines are usually reflowed to make up the full line length of the edition. New hyphens not written by the author may thus be introduced. But remember, the sources were, if prepared correctly, recorded using the author's hyphens. If a word was hyphenated naturally, say the word 'sudden-ly', it is straightforward for software to restore the original word, 'suddenly', and even to indicate that in the original a hyphen was stored there. (Perhaps to be revealed later via a stylesheet that breaks lines as in the original). But what about a line-break at a hyphen? This happens quite frequently, perhaps 10% of the time: e.g. 'dog-flesh', 'ram-paddock'. A hyphen is such a convenient place to break the line that authors frequently avail themselves of this opportunity.

Hard vs soft hyphens

Ah, but now you see the dilemma. The correct restoration of 'ram-[new-line]paddock' is not 'rampaddock' but 'ram-paddock'. Humans recognise the difference at once, and hence don't even bother to record the difference between such 'hard' hyphens and the 'soft' hyphen in 'sudden-ly'. But computers are a bit stupid. My guess is that most digital scholarly editions don't even consider this problem, by choosing either to ignore hyphens at line-end or not. Admittedly hard hyphens are mostly an Anglo-French phenomenon, (e.g. grand-mère) but they are also common in some Italian words, e.g. gonna-pantalone, Milano-Roma etc. but are quite rare in Germanic languages.

Strategy 1

Recording the hyphens means that you have to display the transcript exactly as it was typed or written -- limiting the ways that the text can be redisplayed on, say, small screens. Alternatively you can have a stylesheet hide all the hyphens at line-end. But that will get about 10% of the cases wrong.

Strategy 2

Deleting the hyphens and joining up the words means a lot of hard work if they have already been recorded, and in any case distorts the text. But the advantage is that the text can be reflowed to fit the window on whatever device it is displayed on and the hyphens are always right, since only the hard hyphens are left. But then we can't get the original soft hyphens or line-breaks back if we need them.

Solution

There has to be some way to encode the hyphens and yet display them or hide them on request, without manually entering the distinction between hard and soft hyphens. If we have a dictionary and look up the words 'ram' and 'paddock' (both present) but the word 'rampaddock' is not, then the correct hyphenation must be 'ram-paddock'. On the other hand 'sudden' is a word, but 'ly' probably isn't. At least 'suddenly' is in the dictionary, so we can work out that the correct hyphenation must be 'suddenly'. Making this work for all imports of new files recorded in either XML or plain text has taken me some time, but the addition of 'intelligent hyphens' to our software gives us the edge over our rivals.

I used GNU's aspell library, which has numerous dictionaries. So we can dehyphenate almost any living language. The only problem is that variant spellings will cause problems. For example, 'ram-piddick' is a hyphenated word in Joseph Furphy, but 'piddick' is not in the English dictionary, so the hyphen will be mis-recognised as soft. Similar problems will occur with authors from the 16th century like Shakespeare. My solution in such cases is to just allow the editor to change it back to hard manually. Alternatively, if that is too much work, a dictionary could be compiled and added to aspell. However, the success rate even with the Furphy texts is close to 100% just using ordinary dictionary lookup, so I don't think there is too much to be worried about and a lot to be satisfied with.

Addendum

I had the idea that a list of exceptions might overcome residual problems. So that the default behaviour would work fine in most cases, but if you wanted to preserve hyphens at line-end even when it might seem that you don't need them -- for example if an author consistently wrote 'scape-goat' instead of 'scapegoat' -- then you could do that.

Tuesday, May 28, 2013

The vicious life-cycle of the digital scholarly edition

Most digital scholarly editions (DSEs) are stored on web-servers as a collection of files (images and text), database fields and tables in binary form. The structure of the data is tied closely to the software that expects resources to be in a precise format and location. So moving a digital scholarly edition to another web-server, which probably runs a different content management system, different scripting languages and a different database, is usually considered either impossible or too expensive. And sooner or later, due to changes in the technical environment caused by automatic updates of the server, languages and tools on which it depends, the DSE will break. That usually takes about one to two years. That's fine if there is still money to fix it, but more than likely the technicians who created it have moved on, the money has run out, and pretty soon that particular DSE will die. This vicious lifecycle of the typical DSE has already played itself out on the Web countless times and everybody knows it.

But XML will save us, won't it?

I hear people saying: 'but XML will save us. It is a permanent storage format that transcends the fragility of the software that gives it life.' Firstly, XML just puts those special data structures that the software needs to work into a form that people (I mean programmers) can read. It doesn't change anything of the above. Yes, XML itself as a metalanguage is interoperable with all the XML tools, but the languages that are defined using it, if not rigidly standardised, are as numberless as the grains of sand on the beach, and are just as tied to the software that animates them as earlier binary formats.

Escaping the vicious cycle

Secondly, a digital scholarly edition is much more than a collection of transcriptions and images. There is simply no way to make the technological part of a DSE interoperable between disparate systems. But what we can do is make the content of a DSE portable between systems that run different databases and to some extent different content management systems. Also, this gives the DSE a life outside its original software environment and leaves behind something for future researchers. What we have recently done in the AustESE project is to create a portable digital edition format (PDEF) which encapsulates the 'digital scholarly edition' in a single ZIP file containing:

  1. The source documents of each version in TEXT and MVD (multi-version document) formats. The MVD format makes it easy to move documents between installations of the AustESE system, and the TEXT files record exactly the same data in a more or less timeless form.
  2. The markup associated with the text. This is in the form of JSON (javascript object notation), which is now supplanting the use of XML in the sending and storage of data in many web applications. Several layers of markup are possible for each text version, and these can be combined to produce Web pages on demand.
  3. The formats of the marked-up text. These define, using CSS, different renditions of the combined text+markup.
  4. The images that may be referred to by the markup
  5. (In future) the annotations about the text which are common to all the versions to which they apply.

This allows what I don't think anyone else can do yet, namely, to download a DSE, to send it to another installation and to have them upload it so that it works 'out of the box'. We need to think in those terms if we are to get beyond the experimental stage in which many digital scholarly editions currently seem stuck. Otherwise we run the risk of becoming an irrelevance in the face of massive and simplistic attempts to digitise our textual cultural heritage by Gutenberg and Google. We need much more than what these services are offering. We need a space in which we can play out on the Web the timeless activities of editing, annotation and research into texts – what we call 'scholarship'. The only way to do that is to have 'a thing' that we call a 'digital scholarly edition'.

Saturday, May 18, 2013

Faster than a speeding bullet

I have just completed the move from one type of NoSQL database (couchdb) to another (mongodb). The speed increase is exhilarating, though I don't know by how much. The comparisons are not cached, they are computed on the fly: and the speed is the fruit of good design and careful implementation. As I said all along, conventional approaches based on XML are just too inefficient and inadequate. This design, based on multi-version documents, and hosted on a Jetty/Java service with a C core, is faster than any other such service I know of, and it is more flexible. There is also the new C-version of nmerge yet to add, which should increase capacity and speed further. Try out the test interface for yourself.

Sunday, February 24, 2013

Multi-encoding text-merging

MVDs now support any text encoding. I've used ICU, the IBM-supplied conversion library for to/from UniCode conversions. It's very good. So texts may now be added to an MVD from any textual source. All you do is specify the encoding. Internally, merging is done on 16-bit UTF-16, not 8-bit UTF-8 any more. I don't believe any rival text-comparison programs can do this. This is a big improvement over the previous version of nmerge.

Tuesday, February 5, 2013

Comparing Chinese and Bengali texts

Extending multi-version documents (MVDs) to properly support languages like Chinese and Bengali, which use 16-bit characters, turns out to be easier than I thought. Currently the nmerge tool, which produces MVDs, works only with 8-bit bytes internally, so that individual characters may be split over several bytes, as in UTF-8 encoding. Things get complicated whenever differences are detected between parts of characters. Making everything 16-bit will facilitate the comparison of texts in any living language and avoid such complications (unless you want to compare dead languages like Phoenician or Lydian, and even then UTF-16 can encode them). I don't have any Chinese examples, but my friends in India have provided me with some interesting Bengali texts, which I'll be using for testing.

Tuesday, January 8, 2013

Hritserver 0.2.0 released

I've made an early release of hritserver 0.2.0. The version number reflects my rough feeling that this is about 20% finished. That doesn't sound like much but most of the unfinished part is in the services it will eventually perform. The basic infrastructure of hritserver: the merging, formatting and importing facilities are closer to 90% complete.

There's a mixed import dialog in this release that should be able to import any TEI-Lite document. The import process can be configured in several ways, or you can just follow the defaults. The result is always supposed to "work". The stages for XML files are:

  1. XSLT transform of XML sources. The default transform fixes some anomalies in the TEI data model that make it hard to convert it into HTML. Or you can substitute your own stylesheet to do anything you like. In TEI-Lite this step also splits the input into the main text and any embedded <note>s and <interp>s.
  2. The versions within each XML file (add/del, sic/corr, abbrev/expan, app/rdg etc.) are all split into separate files. This is done safely by first splitting the document into a variant graph wherever a valid splittable tag is found and then the graph is written out as N separate files.
  3. The individual files are stripped into their remaining markup and the plain text. The markup may be in several files, such as a separate one for page divisions.
  4. The markup files and the plain text files are merged into CorCode and CorText multi-version documents.
  5. The CorCodes and the CorTexs are then stored in the database.

Imported files should then appear in the Home tab of the Test interface.

Installation

You can try downloading version 0.2.0 if you use a Mac. I'll get around to supporting other platforms presently. To download version 0.2.0 you should use git. If you don't have it you can install it easily via homebrew

brew install git

Then download the latest hritserver code:

git clone https://github.com/HRIT-Infrastructure/hritserver.git

That creates a folder "hritserver" in that directory. Then you should run the installer:

cd hritserver
sudo ./install-macosx.sh

And it should work. This version will be tested and gradually improved. The advantage of using git is that you can easily update to the latest version by typing

git pull

in the hritserver directory.