TEI 2019

What is text, really? TEI and beyond


All PapersTEI

Text Graph Ontology. A Semantic Web approach to represent genetic scholarly editions

Peter Hinkelmanns

Keywords: Genetic Scholarly Edition, Semantic Web, Graph Database, Ontology, Interfaces
Slides: https://www.doi.org/10.5281/zenodo.3457076
Permalink: https://gams.uni-graz.at/o:tei2019.172

Text Graph Ontology

A Semantic Web approach to represent genetic scholarly editions

Peter Hinkelmanns, University of Salzburg

A model of text as a variant graph can support representing genetic text editions. The proposed model makes it possible to describe the relations between tokens and their relative dependencies in text genesis. The main focus is on the representation of intradocumentary text revisions. Moreover the Text-Graph-Ontology enables the referencing of genetic text editions via the Semantic Web. In addition to this ontology, a converter from and to TEI-XML and a web-based viewer and editor will be presented.

Graph, Ontology, Semantic Web, Genetic Text Editions

Text genetic editions are enjoying sustained popularity in the fields of scholarly editions and literary studies. Representatives of recent research projects are the Faust edition (Bohnenkamp, Henke, and Jannidis 2016) or the edition of the works of Arthur Schnitzler (Burch et al. 2016), both of which aim at a complete reproduction of the text genesis. These editions require the reconstruction of complex text genetic processes. Which sequence of tokens forms a specific text state? How can differences between versions be described? The extension of the model of the Text Encoding Initiative by elements required for genetic editions was the subject of a working group that presented its results in a draft (Burnard et al. 2010). Parts of this draft have been incorporated into the TEI Guidelines (TEI Consortium 2013). With TEI P5, complex genetic editions can be realized. However, the underlying structure of the hierarchical graph makes it difficult to reconstruct and compare text gradients, i.e. the evolutionary stages of a text with inline markup. It can of course be done using stand-off and out-of-line annotations, as James Cummings has pointed out (Cummings 2018, 13). This concept for a Text Graph Ontology does not aim to be the next paper criticizing the xml foundation of TEI P5 nor is it trying to compete with the interoperability of TEI encoded texts. The ontology is designed for the specific purpose of implementing a text variant graph using semantic web technologies for the encoding of genetic text editions.

Data models for textual representation

Several project used graphs to represent textual variation in the recent years. They can be divided into methods based on markup and methods focusing on variant graphs. An approach to deal with the problem of overlapping annotations in XML is the proposed markup language ‘General Ordered-Descendant Directed Acyclic Graph’ (GODDAG: Sperberg-McQueen and Huitfeldt 2004, esp. 158). A similar model is the ‘Graph Annotation Format’ (GrAF: Ide and Suderman 2007) which extends the Linguistic Annotation Framework (LAF: Ide and Romary 2006). In GRaF an underlying text is segmented via stand-off nodes, which use the position of characters in the text as reference points. Edges link the annotations with text segments.

The problem of intradocumentary and intertextual variation is addressed by the ‘data structure for representing multi-version texts’ (Schmidt and Colomb 2009). Thy criticize the markup approach of models like GODDAG and GrAF (Schmidt and Colomb 2009, 499) and propose a variant graph where the edges contain the segments of texts shared by multiple variants:

Fig. 1: A variant graph (Figure by Schmidt and Colomb 2009, 502)
Fig. 1: A variant graph (Figure by Schmidt and Colomb 2009, 502)

The sigils indicate the different variants of the text between different text carriers. Individual tokens of a specific text carrier are not represented.

Collating texts is the focus of the stemmaweb project (“The Stemmaweb Project” 2012–; Andrews and Mace 2013). Similar to Schmidt and Colomb 2009 a variation graph is used to represent versions of a text. The texts are segmented into tokens and form the nodes of the graph. Directed edges show the similarities and dissimilarities between individual text carriers. Undirected edges represent variant relationships like orthographic variation, grammatical variation, lexical variation etc. (Andrews and Mace 2013, 508).

Fig. 2: Screenshot of the ‘Text relationship
                            mapper’ of Stemmaweb, showing an extract of Segment 1 of the Chronicle
                            of Matthew (Figure by “The Stemmaweb Project” 2012–)
Fig. 2: Screenshot of the ‘Text relationship mapper’ of Stemmaweb, showing an extract of Segment 1 of the Chronicle of Matthew (Figure by “The Stemmaweb Project” 2012–)

Efer 2016 has comprehensively described the use of graph databases for the text-oriented Digital Humanities. He proposed ‘Kadmos’, a layered graph model for textual representation. His model includes the separation into types and tokens:

Fig. 3: Schematic representation of instance data
                            sets and links of a short example document with minimalistic text data
                            model (Figure by Efer 2016, 76)
Fig. 3: Schematic representation of instance data sets and links of a short example document with minimalistic text data model (Figure by Efer 2016, 76)

The ‘Text as Graph’ model (TAG: Haentjens Dekker and Birnbaum 2017) stores the text in nodes of various length (Haentjens Dekker and Birnbaum 2017, §3). A token may be split into several nodes and marked up as a word:

Fig. 4: A simplified poem with word tokenization
                            (Figure by Haentjens Dekker and Birnbaum 2017, Fig. 10)
Fig. 4: A simplified poem with word tokenization (Figure by Haentjens Dekker and Birnbaum 2017, Fig. 10)

TAG is at the current stage defined as a data model and not as a syntactic representation (Haentjens Dekker and Birnbaum 2017, §2.1).

This very brief overview of selected data models for textual representation has shown, that multiple models for the representation of texts as graphs exist. The proposed model of this paper is far from state of a stable model and aims to connect the idea of a variation graph with semantic web technologies.

Text graph ontology

The Semantic Web makes information accessible in a machine-readable way using standardized vocabularies and ontologies. A distinction is made between ‘nodes’, which can be objects or atomic values, and ‘edges’, which describe the relationship between nodes. A statement always consists of a triple ‘[subject] - [predicate] - [object]’. Semantic Web technologies enable easy annotating and linking the scholarly genetic edition other resources. The Text Graph Ontology uses the Web Ontology Language (OWL: W3C OWL Working Group 2012) to specify classes and properties.

The first problem which needs to be addressed is the segmentation of a text. There is no agreement between the briefly presented models on what the atomic unit of a text graph is. For the Text Graph Ontology tokens separated by white space should be assumed, with the possibility to extend the model on a sub token level. A diplomatic transcription and various normalization stages can be attached to a token as a string or generatet from a separated character graph.

The proposals for the annotation of text revisions of the Grazer Editionsphilogie are tailored to the needs of mediaeval editions. Hofmeister-Winter 2016 presents a categorization of text revision phenomena. She distinguishes between self-revisions, i.e. interventions in one's own text, and external revisions, which describe the interventions of another hand (Hofmeister-Winter 2016, 10). Self-revisions can be a direct component of text production (immediate revision) or take place at a later point in time (late revision). In their opinion, third-party revisions, on the other hand, take place exclusively in a later revision step as a late revision. Furthermore, the following typology is established by the Graz project (Hofmeister, Böhm, and Klug 2016, 22):

  • eradication by bleaching, deletion, blackening, expansion
  • transformation resp. transformation by overwriting, addition, reduction
  • insertion in all described positions (interlinear, linear, marginal) after eradication by deletion or bleaching, with instruction signs in different shapes, single or paired, as well as gap filling after precautionary recess

To represent these revisions, an edge weighted directed acyclic graph is used. The limitations of RDF make is necessary to construct weighted edges as individual nodes. The base model therefore consists of three node classes: Tokens, Connectors and Borders (the empty start and end nodes of a graph):

Fig. 5: ‚Hello World‘ as a weighted text graph in
                            RDF
Fig. 5: ‚Hello World‘ as a weighted text graph in RDF

The weight represents the relative order of edges from one token to the next. A substitution would therefore be represented as follows:

Fig. 6: ‚Hello
            World
             Graz‘ as a weighted text graph in RDF
Fig. 6: ‚Hello World Graz‘ as a weighted text graph in RDF

The path following the lowest weight is the first, the past following the highest weight is the last version of the text. The reconstruction of a particular text state can be described as a path through the text. Deletions and additions of text can be seen in the graph accordingly:

Fig. 7: Transformation, addition and deletion in a
                            weighted text graph
Fig. 7: Transformation, addition and deletion in a weighted text graph

A conversion of this small graph to TEI P5 is possible:

<del>Hello</del> <subst><del>World</del><add>Graz</add></subst> <add>TEI</add>

To mark specific stages of a text, the Connectors are being referenced to a text stage:

Fig. 8: Text stages realised in a variant text
                            graph
Fig. 8: Text stages realised in a variant text graph

The same variant graph method can be used to describe a token further on character level:

Fig. 9: Graph on character level
Fig. 9: Graph on character level

This short article shows that semantic web technologies are suitable for representing text variant graphs. The main benefit from using RDF is the interconnectivity with other semantic web ressources. I. e. different text carries of one text could be transcribed by different projects and easily be linked with each other. Annotations on a text can directly point to authority files and vice versa. Tools based on the model are a converter to and from TEI P5 and a viewer/editor based on FLASK. The ontology and the tools will be published on Github. A challenge of this graph approach that has not yet been solved is rule-based validation.

Peter Hinkelmanns, Senior Scientist, University of Salzburg: Middle High German Conceptual Database, peter.hinkelmanns@sbg.ac.at

Peter Hinkelmanns is a senior scientist at the Middle High German Conceptual Database of the University of Salzburg. His research interests include Middle High German lexicography, graph technologies in the digital humanities and historic linguistics.

References

Andrews, T. L., and C. Mace. 2013. “Beyond the Tree of Texts: Building an Empirical Model of Scribal Variation Through Graph Analysis of Texts and Stemmata.” International Journal of Human-Computer Studies 28 (4): 504–21. doi:10.1093/llc/fqt032.

Bohnenkamp, Anne, Silke Henke, and Fotis Jannidis. 2016. “Johann Wolfgang Goethe: Faust: Historisch-Kritische Edition.” 2. Beta-Version. Accessed April 26, 2017. http://beta.faustedition.net/.

Burch, Thomas, Stefan Büdenbender, Kristina Fink, Vivien Friedrich, Patrick Heck, Wolfgang Lukas, Kathrin Nühlen et al. 2016. “Text[ge]schichten: Herausforderungen Textgenetischen Edierens Bei Arthur Schnitzler.” In Textgenese Und Digitales Edieren: Wolfgang Koeppens „Jugend“ Im Kontext Der Editionsphilologie, edited by Katharina Krüger, 87–105. Editio / Beihefte, 40. Berlin, Boston: de Gruyter.

Burnard, Lou, Fotis Jannidis, Elena Pierazzo, and Malte Rehbein. 2010. “An Encoding Model for Genetic Editions.” Accessed March 25, 2019. http://www.tei-c.org/Activities/Council/Working/tcw19.html.

Cummings, James. 2018. “A World of Difference: Myths and Misconceptions About the TEI.” Digital Scholarship Humanities 6:i63. doi:10.1093/llc/fqy071.

Efer, Thomas. 2016. “Graphdatenbanken Für Die Textorientierten E-Humanities.” Dissertation, Universität Leipzig.

Haentjens Dekker, Ronald, and David J. Birnbaum. 2017. “It’s More Than Just Overlap: Text as Graph.” In Proceedings of Balisage: The Markup Conference 2017, edited by B. T. Usdin, Deborah A. Lapeyre, James D. Mason, C. M. Sperberg-McQueen, and Norman Walsh. Balisage Series on Markup Technologies: Mulberry Technologies, Inc.Rockville, Maryland.

Hofmeister, Wernfried, Astrid Böhm, and Helmut W. Klug. 2016. “Die Deutschsprachigen Marginaltexte Der Grazer Handschrift UB, Ms. 781 Als Interdisziplinärer Prüfstein Explorativer Revisionsforschung Und Editionstechnik.” Editio 30 (1): 14–33. doi:10.1515/editio-2016-0002.

Hofmeister-Winter, Andrea. 2016. “Beredte Verbesserungen.” Editio 30 (1): 1–13. doi:10.1515/editio-2016-0001.

Ide, Nancy, and Laurent Romary. 2006. “Representing Linguistic Corpora and Their Annotations.” In Proceedings of the Fifth Language Resources and Evaluation Conference: LREC 2006, edited by Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, and Daniel Tapias: Genua.

Ide, Nancy, and Keith Suderman. 2007. “GrAF: A Graph-Based Format for Linguistic Annotations.” In Proceedings of the Linguistic Annotation Workshop: Held in Conjunction with ACL 2007, edited by Association for Computational Linguistics, 1–8. https://www.aclweb.org/anthology/W07-1501. Accessed July 30, 2019.

Schmidt, Desmond, and Robert Colomb. 2009. “A Data Structure for Representing Multi-Version Texts Online.” International Journal of Human-Computer Studies 67 (6): 497–514. doi:10.1016/j.ijhcs.2009.02.001.

Sperberg-McQueen, C. M., and Claus Huitfeldt. 2004. “GODDAG: A Data Structure for Overlapping Hierarchies.” In Digital Documents: Systems and Principles, edited by Peter King and Ethan V. Munson, 139–60. Berlin, Heidelberg: Springer.

TEI Consortium. 2013. “TEI P5: Guidelines for Electronic Text Encoding and Interchange.” Version 3.6.0. Accessed July 30, 2019. https://www.tei-c.org/Vault/P5/3.6.0/doc/tei-p5-doc/en/html/.

“The Stemmaweb Project: Tools and Techniques for Empirical Stemmatology.” 2012–. Accessed July 30, 2019. https://stemmaweb.net/.

W3C OWL Working Group. 2012. “OWL 2 Web Ontology Language: Document Overview (Second Edition).” W3C Recommendation 11 December 2012. Accessed June 08, 2018. https://www.w3.org/TR/owl2-overview/.