In what is arguably his most widely known work, Voltaire describes the extraordinary journey that his eponymous hero undertakes through geography and understanding, and for us digitising the novel is the first step on the long and – we hope and trust – exciting journey to digitise the whole of the complete works, the OCV. As such it has been a proof of concept, a baptism of reassuringly gentle fire, and a taste of things to come.
For a digital file that’s worth its bytes we need much more than just electronic words. We need a format that will encode structure and meaning so that people and – just as importantly – programs can understand the extra information we’re embedding into the file, and use it to help make readers’ and scholars’ use of the material richer and easier.
Thankfully many others have trodden a similar path. Since the 1980s countless digital humanities minds have contributed to the Text Encoding Initiative, simultaneously a sophisticated tag set for marking up scholarly material, and a community engaged in maintaining that model, supporting the people who use it, and improving it based on collective experience, wisdom, and usage. We had no need to invent a wheel – TEI is beautifully adapted for our journey. We used it to design a tailored model to suit the particular needs of the OCV and Digital d’Holbach. This is being applied for us by our supplier, Apex CoVantage, who are assembling a specialist team and developing automated tools to streamline the workflow, and using the first dozen volumes as tools to train both people and software. Candide was their introduction to this fascinating marriage of the Enlightenment and the computer.
The structural tagging – for things like introductions and notes – will allow readers to see as much or as little detail and complexity as they wish, choosing between at one end of the scale just the edited version of Voltaire’s words, to at the other the full panoply of editorial introduction, notes, bibliographic citations, and textual variants, with a varying choice between the two extremes. It will also help readers navigate through and across the various parts of the volume, enabling their own particular journey.
Tagging for meaning – what we call the semantic tagging – is what will allow the dataset to communicate within itself, to other datasets, and also to humans. It’s what can help make search fully useful rather than just a literal echo of what a user types, and it can help a reader see a wider range of ‘next steps’ by making meaningful connections beyond those possible with just words and spaces. We tag people, places, dates, works, and institutions, and we’re also going to be developing a full set of metadata to accompany the datasets, as a rich and consistent layer describing the entire corpus in disciplined detail – we aim for this to be our contribution to the semantic web. We tag for primary and secondary content, and every piece of text has a language code associated with it so that if machine translation were applied to the data set we can choose which parts of an edition are translated (e.g. the introduction) and which are left in the original language (e.g. primary content quotes). Again, our work enables control and choice.
These two aspects turn a dataset into something akin to a machine (with the metadata as the auxiliary power unit), with multiple interlocking components that make it much easier for readers to summon or suppress the parts of the edition they need.
A machine needs precision in its gears and smoothness in its moving parts, and digitisation is revealing the odd snag and missing bolt where the tools we now have to analyse the workings were not available forty years ago. The exercise is therefore an opportunity to collate points we might wish to address in a revised edition (as well as revealing the occasional typographic error). But overall it’s gratifying how the abstract model we designed ahead of any full-scale digitisation has proved to be fit for purpose, and allows us to interrogate and improve the digital Candide by program, benefits which will increase exponentially as more volumes are added to the electronic corpus. The whole, we think, will be very much greater than the sum of its parts.
While the ultimate consumer of the digital files we’re creating will be human readers, the immediate consumers as intermediaries will be machines and processes, and even a cursory look at the ‘raw’ file of Candide shows you why. Character-for-character there is much more tagging than text, and for the eye simply to read the novel is near impossible; we keep tripping over indexing, line breaks, page breaks, emphasis, witness references … the list of tags is seemingly endless. What we see is ‘noise’ since we’re not programmed to filter one thing from another, but a program can be told to do exactly that, allowing any amount of filtering, cross-referencing, formatting, and even transformation to render the volume exactly as a reader requires. In order to ensure simplicity, but allow richness, and to enable choice, we have to make sure we start from complexity.
Digitisation and the accompanying process of metadata curation is all about preserving content, extending reach, and adding value. If we get this right, we should be laying the foundations for globally accessible tools of immense richness which will add to – and not detract from – the core material and scholarship on which it is all built. We have a responsibility to use the digital tools available to help as many people as possible find, read, and understand the extraordinary legacy of Voltaire and his contemporaries. Il faut cultiver nos données.
– Dan Barker, dancan Ltd.