Friday, June 27, 2014

TILT makes good progress

Check out the TILT blog. You can follow us on Twitter @bltilt

Sunday, May 18, 2014

NMergec and Tagore's Golden Boat

Nmergec, the rewritten form of nmerge in the C language has reached a milestone. It managed to merge 11 versions of the Tagore poem সোনার তরী (Golden Boat). Since there are several transpositions and much variation between the versions there is good reason to think that nmergec now 'works', and that it can handle much longer texts. What it does prove is that the revised program can handle Asian scripts like Bengali with ease, and it promises to deliver other benefits: such as the ability to merge (and hence compare) any number of versions of texts of any size. So, for example, it is designed to handle entire novels with rearranged chapters, or texts that contain thousands of versions (such as the Greek New Testament). For my next test though I'll be trying to merge an entire novel, perhaps Joseph Furphy's Such is Life, or Conrad's Under Western Eyes to get an idea of the practical size limits. There is a description of the nmergec program and its advantages on Digital Variants.

The merging time for the 11 versions was just 0.5 seconds, and that is with a ton of debugging code. Once it is stripped down and optimised it should be much faster. Here's one version of the poem:

সোনার তরী

গগনে গরজে মেঘ, ঘন বরষা।
কূলে একা বসে’ আছি, নাহি ভরসা।
রাশি রাশি ভারা ভারা
ধান কাটা হ’ল সারা,
ভরা নদী ক্ষুরধারা
খর-পরশা।
কাটিতে কাটিতে ধান এল বরষা।

একখানি ছোট ক্ষেত আমি একেলা,
চারিদিকে বাঁকা জল করিছে খেলা।
পরপারে দেখি আঁকা
তরুছায়ামসীমাখা
গ্রামখানি মেঘে ঢাকা
প্রভাত বেলা।
এ পারেতে ছোট ক্ষেত আমি একেলা।

গান গেয়ে তরী বেয়ে কে আসে পারে!
দেখে’ যেন মনে হয় চিনি উহারে।
ভরা-পালে চলে’ যায়,
কোনো দিকে নাহি চায়,
ঢেউগুলি নিরুপায়
ভাঙে দু’ধারে,
দেখে’ যেন মনে হয় চিনি উহারে।

ওগো তুমি কোথা যাও কোন বিদেশে!
বারেক ভিড়াও তরী কূলেতে এসে।
যেয়ো যেথা যেতে চাও,
যারে খুসি তা’রে দাও,
শুধু তুমি নিয়ে যাও
ক্ষণিক হেসে
আমার সোনার ধান কূলেতে এসে।

যত চাও তত লও তরণী পরে।
আর আছে?—আর নাই, দিয়েছি ভরে’।
এতকাল নদীকূলে
যাহা লয়ে’ ছিনু ভুলে’
সকলি দিলাম তুলে’
থরে বিথরে
এখন আমারে লহ করুণা করে’।

ঠাঁই নাই, ঠাঁই নাই! ছোট সে তরী
আমারি সোনার ধানে গিয়েছে ভরি’।
শ্রাবণ গগন ঘিরে
ঘন মেঘ ঘুরে ফিরে
শূন্য নদীর তীরে
রহিনু পড়ি’,
যাহা ছিল নিয়ে গেল সোনার তরী।
ফাল্গুন, ১২৯৮।

Friday, March 14, 2014

The slow tide turns...

It is important that there is an open discussion about the forms that digital representations of historical artefacts may take. With this end in view I have prepared an article for the Journal of the Text Encoding Initiative entitled "Towards an Interoperable Digital Scholarly Edition". What most people are using as a shared data format at the moment are the TEI Guidelines. The problem with this is that:

  1. it is tied to a piece of technology, namely XML, rather than being an abstract specification, and
  2. it is based on subjective human judgements and is most definitely not interoperable.

That's a big problem, because it means we can't share our work. For example, currently I can't efficiently edit an edition of anything with a number of colleagues located in various parts of the world. I would have to spend a great deal of time trying to homogenise the codes used to describe features and to force everyone to use them consistently. I also can't cite or annotate that digital work and share that work with others. And I can't reuse someone else's digital scholarly edition, by extending or repurposing it. I think that's a waste of human effort and until we fix the problem this part of the digital humanities is going to suffer. So my paper is about how that problem can be fixed.

The reaction so far has been to adduce two arguments:

  1. We don't really need interoperability, just the ability to record subjective judgements about the text. That's what humanists do, after all.
  2. The interoperability problem can be solved by reducing the tag set so that there is only one way to encode everything

In response to 1. it is clear that the digital humanists working on digital editions have for years been calling for interoperability. Just a glance at the Interedition project makes it clear that this objective is essential rather than optional. It is hard to believe that a consensus will be reached that says that collaboration isn't necessary, but that's what this objection amounts to.

In response to 2. this has already been tried multiple times, e.g. TEI Lite, TEI Tite, Textgrid Baseline encoding, DTA Basisformat, TEI Nudge. For example the TEI Tite specification says it "is meant to prescribe exactly one way of encoding a particular feature of a document in as many cases as possible, ensuring that any two encoders would produce the same XML document for a source document". But if I see italics in a text and try to encode it with TEI-Tite I can still use <i>, <abbr>, <foreign>, <hi> or <seg> with various rend attributes, <label>, <ornament>, <stage>, <title>. If every tag I encode differs from every tag you encode in the same document then it is clear that the problem is quickly magnified the longer the text goes on. Like a human fingerprint no two transcriptions can ever be the same.

The fact is that marking up a text that never had any markup when it was created is a very different proposition to writing one with tags in it from the start. Since only digital humanists do this much, it is easy to overlook this distinction. If I say that something is italics when I write it, that's what it is. If I print it and then someone else marks it up they may use a code that indicates that my "italics" is really a foreign word, title, emphatic statement or stage direction, etc. That's interpretation and on a tag-by-tag basis the alternatives (including whether to record the feature at all) are manifold. So tag-reduction doesn't work and never will for this reason.

If you're curious about these arguments have a read about it now by following the link above, or wait until volume 7 of the Journal of the TEI comes out in a month or so.