TEI 2019

What is text, really? TEI and beyond


All PapersTEI

Getting Along with Relational Databases

Martin Holmes

Keywords: TEI and non-XML technologies, TEI and beyond: interactions, interchange, integrations and interoperability, TEI environments and infrastructures
Slides: https://zenodo.org/record/3450658
Permalink: https://gams.uni-graz.at/o:tei2019.102

Both relational databases and XML have strengths and weaknesses as data storage and modelling systems. Most researchers working with Humanities historical and literary data would argue for the superiority of XML, since it allows unlimited nesting, linking, and complexity. RDB proponents claim superior querying and processing speed, although recent advances in XML languages and tools have eroded that advantage.

Nevertheless, RDBs remain popular, and many researchers seem instinctively to prefer them. Most DH programmers have encountered researchers who know little about databases or data modelling, but are nevertheless convinced that what they need and must have for their project is a database. Databases are somehow compelling and attractive in a way that XML is not. Perhaps the familiarity of tabular data representations is comforting; maybe forcing data into constrained representations seems to constitute mastering it somehow.

So, sometimes against our better judgement or advice, a project may end up with both an RDB and an XML document collection, and programmers must then integrate these distinct forms of data when building project outputs. This presentation discusses the Digital Victorian Periodical Poetry (DVPP) project, where metadata about 15,000 poems from nineteenth-century periodicals is captured in a MySQL database, and periodically exported to create a TEI file for each poem. Many of the poems are then transcribed and encoded. The canonical source of metadata is the RDB, while the canonical source of textual data is the TEI file. Metadata in the TEI files must be periodically updated from the RDB, without disturbing the textual encoding. Changes to the RDB data may result in changes to the id and filename of the related TEI file, so any existing TEI data is migrated to a new file, and the SVN repository must be appropriately updated. All of this is done with XSLT and Ant.

Background

Relational databases (RDBs) and XML are both mature technologies that have been in common use for decades. It is arguable that they arise out of the same roots. Early work on data storage and modelling in the 1960s gave rise to IBM’s mainframe database management system IMS, which represented data in the form of hierarchical trees. C.J. Date’s (1991) classic An Introduction to Database Systems has an Appendix devoted to IMS which describes it in terminology that would be familiar to any XML encoder. IMS even addresses the perennial issue of overlapping hierarchies, by allowing A secondary data structure which is still a hierarchy, but a hierarchy in which participant segments have been rearranged, possibly drastically; in other words, it allows for multiple hierarchies over the same dataset. However, beginning with the work of E.F. Codd in the 1970s and the rise of SQL, the relational database model familiar today became dominant, and remained so until the relatively recent popularity of NoSQL approaches.

Both relational databases and XML have strengths and weaknesses as data storage and modelling systems. Most researchers working with Humanities historical and literary data would argue for the superiority of XML, since it allows unlimited nesting, linking, and complexity. RDB proponents claim superior integrity constraints, querying, and processing speed, although recent advances in XML languages, database engines and tools have somewhat eroded those advantages.

Nevertheless, RDBs remain popular, and many researchers seem instinctively to prefer them. Most digital humanities programmers have encountered researchers who know little about databases or data modelling, but are nevertheless convinced that what they need and must have for their project is a database. Databases are somehow compelling and attractive in a way that XML is not. Perhaps the familiarity of tabular data representations is comforting; maybe forcing data into constrained representations seems to constitute mastering it somehow; or perhaps the tendency to gather initial data in the early stages of a project using spreadsheets, for want of a better tool, encourages conception of data (especially metadata) in terms of columns and rows. Whatever the reason, in one way or another, sometimes against our better judgement or advice, a project may end up with both an RDB and an XML document collection, and programmers must then integrate these distinct forms of data when building project outputs.

Approaches to integrating RDB and XML data have normally taken the form of storing XML data into RDB fields, and then providing some level of richer access to that data through the use of XPath or XQuery (see Bertino and Catania for a useful overview). This is the approach taken by the ReMetCa team (González-Blanco and Rodríguez 2015): XML fragments representing verse (not full documents) are stored in text fields in a relational database, and the relationships between them are modelled using the RDB schema. However, such an approach is far from ideal; González-Blanco and Rodríguez describe some of the limitations and frustrations they encountered in modelling the poetic structure of the verse in their database; they struggled with a complex model of relationships among those components which are very difficult to represent in a database, and they conclude that the E-R model is inappropriate for this purpose due to its center-based structure, with the entities of poem, line, and stanza in the middle of its referential domain of study (para 8). Gibson (2012) describes a similar scenario with mixed RDB and XML data, and how he used Saxonʼs SQL extension functions to overcome the problem.

However, storing XML data in RDB fields is suboptimal. Most serious encoding projects make use of version-control systems such as Git or Subversion, for very good reasons: in a project with many transcribers and encoders, where multiple waves of encoding and annotation may be applied to each document, it is essential to maintain a detailed revision history which makes it possible to recover any previous incarnation of any document, and to track the revisions made to specific parts of the document by specific encoders.

The Digital Victorian Periodical Poetry project

This presentation will focus on the integration of RDB and XML data in the Digital Victorian Periodical Poetry project. This project began life many years ago as a pure-metadata project, capturing information about tens of thousands of poems that appeared in British periodicals during the nineteenth century. At that time, an RDB system seemed a natural and sufficient tool for the job, so a MySQL database, along with a data-entry interface, was set up for the researchers, and data collection proceeded rapidly (figure 1). However, after some years the project gained an additional research focus, and, more recently, funding from the Social Sciences and Research Council of Canada, to transcribe and encode a subset of these poems; we are focusing primarily on the decade years (1820, 1830, 1840 and so on through to 1900), and we expect to encode around 2,000 poems. Meanwhile, indexing of the much larger dataset continues.

A record in the relational database.
A record in the relational database.

Our long-term plan is for the entire dataset to be in the form of TEI XML files, but for the next few months, data will continue to be added to the RDB system, since we have good methods and protocols for this, as well as trained research assistants who are used to working with it. We are now also well into our encoding process, and for that we need to generate individual TEI files for each poem, and store them in Subversion.

In this hybrid project, the canonical source of metadata for the poems is the RDB, while the canonical source of textual data is the TEI XML files. To build and test the project outputs, we need to generate TEI files for every poem, whether or not it has, or will have, an encoded transcription. The metadata stored in the TEI files must be periodically updated based on the RDB, without disturbing any of the textual encoding or the additional metadata in the TEI header relating specifically to the encoding (responsibility statements, rendition elements, category references, and so on). Changes to the RDB data may result in changes to the id and filename of the related TEI file, so any existing TEI data must be migrated to a new file, and the SVN repository must be appropriately updated. The presentation will describe how this process is accomplished safely without loss of data, using a system based primarily on Apache Ant and XSLT (#flowchart).

A simple representation of the metadata integration process.
A simple representation of the metadata integration process.
Normally, we run the database integration process only on a small subset of the data at one time; for example, we may refresh the metadata in all the poems from a specific periodical in a specific year, in preparation for the transcription/encoding team starting work on that year.

By the end of 2019, we plan to eliminate the relational database entirely. Although it is a convenient tool for collecting metadata while working through large numbers of periodicals, its limitations are constantly frustrating; every day we encounter situations in which something relatively trivial to encode in TEI would require substantial modification to the structure and complexity of the database. For example, degrees of uncertainty about the identity of an author, or about whether two pseudonyms represent the same person, can easily be expressed in TEI, but require additional joining tables in the database. Similarly, some poems claim to be translations but are probably not, and their translators are probably their authors. Ambiguities such as this are difficult to handle in a categorical system such as an RDB, but they are the bread and butter of TEI encoding.

Use of the hashtag system for more flexible database entries.
Use of the hashtag system for more flexible database entries.

Meanwhile, we live with the database, and devise new cunning strategies to make it more bearable. The constantly-shifting requirements of the metadata team, as they encounter new features and unexpected exceptions in the incoming data, have led to the development of an ad-hoc system based on hashtag fields, where researchers can use Twitter-style hashtags in free text notes fields to capture things that would otherwise require database extension or modification. Hashtags are defined in a separate table, and new ones can be added at will, but their use is policed by a diagnostic process that identifies any instances of hashtags in the notes fields which are not in the hashtag table, thereby avoiding the proliferation of typographical errors or the use of undefined hashtags. The hashtags are themselves transformed into a taxonomy in the TEI data, and from there they are integrated into the project schema as values for target attributes on catRef elements. #hashtags shows the hashtag table, along with an instance of a hashtag in use in a pooem record. The Notes field above it shows more evidence of data that really needs rich encoding for titles, dates, names and so on, but must, for the moment, be handled with plain text.

Bibliography

Bertino, E., and B. Catania. 2001. Integrating XML and databases. IEEE Internet Computing (5:4) 84–88. .

Date, C.J. 1991. An Introduction to Database Systems. Vol. 1. 5th Edition. Reading, Mass.: Addison-Wesley.

Gibson, Matthew. 2012. Using XSLT’s SQL Extension with Encyclopedia Virginia. Code{4}lib Journal 16. .

González-Blanco, Elena, and José Luis Rodríguez. 2015. ReMetCa: A Proposal for Integrating RDBMS and TEI-Verse. Journal of the Text Encoding Initiative, Issue 8. . DOI : 10.4000/jtei.1274.