TEI as a Graph Kuczera Andreas 2019-04-29T22:18:03.513250709 Zentrum für Informationsmodellierung - Austrian Centre for Digital Humanities, Karl-Franzens-Universität Graz Austria Zentrum für Informationsmodellierung - Austrian Centre for Digital Humanities, Karl-Franzens-Universität Graz Austria GAMS - Geisteswissenschaftliches Asset Management System Creative Commons BY-NC 4.0 2019 Graz o:tei2019.126

Written by OpenOffice

Demonstrations tei2019

en TEI Text as a Graph 2019-07-31T22:27:28.450810655

TEI as a Graph

Andreas Kuczera, Academy of Science and Literature, Mainz

As TEI is not a format, though many people think it is. It's a de facto standard that specifies Guidelines for document interchange. Actually the Guidelines are based on the XML but this is only one possible technical way of expressing the phenomenons. In the graph you can use multi-hierarchical annotations layers. Graph models are very easy to read and understand. So DH-People and “normal” scientists have a level of discussion in common. A Graph can be expressed as RDF so the step from a Graph to linked open data is easy to make.

In this paper a small xml-example in DTA-Base-Format will be imported into the graph-database neo4j and then be converted to the Standoff-Property-Json-Format The Standoff Property Format is explained in detail in Iian Neill, Andreas Kuczera: The Codex – an Atlas of Relations. In: Die Modellierung des Zweifels – Schlüsselideen und -konzepte zur graphbasierten Modellierung von Unsicherheiten. Hg. von Andreas Kuczera / Thorsten Wübbena / Thomas Kollatz. Wolfenbüttel 2019. (= Zeitschrift für digitale Geisteswissenschaften / Sonderbände, 4) text/html Format. DOI: 10.17175/sb004_008. but this toolchain works for every TEI-XML-file. The example is the xml export of folio 11 of the notes ofGotthilf Friedrich Patzig about Humboldts Kosmos-Lecture accessible in the German Textarchive (). The exported Standoff-Property-Json data can then be imported into the Standoff-Property-Editor SPEEDy, which can manage multi-hierarchical annotations. Standoff-Formats are well-known but they have some limitations. So you are not allowed to change the base text (datum) after having started with the annotations as the indexes would be damaged. In our system annotated documents can be edited as the indexes are recalculated when the document is saved.

Convert DTA-XML with neo4j to Standoff Property JSON

In a first step we import a small xml-example into a neo4j (https://neo4j.com) instance using apoc.import.xml (https://github.com /neo4j-contrib/neo4j-apoc-procedures-function)

The example is one folio (https://seafile.rlp.net/f/6282a26504cc4f079ab9/?dl=1) from the DTA (https://www.deutschestextarchiv.de). Here you can find the XML-Testfile and this is the Link (http://www.deutschestextarchiv.de/patzig_msgermfol841842_1828/11) to the DTA-Version.

Import into neo4j

The import into neo4j runs with:

// Import xml-example from DTA to neo4jcall apoc.xml.import('https://seafile.rlp.net/f/6282a26504cc4f079ab9/?dl=1', {connectCharacters: true, charactersForTag:{lb:' '}, filterLeadingWhitespace: true}) yield nodereturn node;

Figure1: TEI-XML-Example in neo4j (Kuczera).

In Figure 1 you can see a snippet of the example in the Graph-Database. In this import to the graph-database the xml-file is imported as an xml-tree with the root-element at the top level. The hierarchy of the xml is expressed with IS_CHILD_OF, FIRST_CHILD_OF, LAST_CHILD_OF etc. edges connecting all elements which are converted to nodes of type XmlTag for the elements or XmlCharacter for the text. The seriality of the XML-file is expressed by NEXT, which make reexporting XML possible. In addition all text nodes are connected by NE edges, connecting all text without any elements in between. Whitespaces become a textnode on their own. The example shows that importing a DTA-Baseformat-XML-File keeps all informations from the xml-version and re-exporting to xml is possible.

Export from neo4j to Standoff Property JSON

Figure2: TEI-as-a-Graph in the Standoff-Property-Editor SPEEDy (Kuczera).

The next step is to export the data with some cypherto the Standoff-Property JSON-Format, which can be directly copied out of the neo4j-browser-window. This json can then be imported in the [SPEEDy (https://github.com/argimenes/standoff-properties-editor)] Standoff Property Editor which can be found on GitHub (https://github.com/argimenes/standoff-properties-editor).

In the README-Section of the SPEEDy Github Repo you can find a Link (https://argimenes.github.io/standoff-properties-editor/)to the Test-Instance hosted on Github-Pages. We have prepared the example in SPEEDy. Just select „TEI-XML → SPEEDY IV“ in the file-Section and load the data. Below the Editor-Window you can press the UNBIND-Button and inspect the exported json in the window below.

Figure 2 shows the results of the conversion without any further treatment by hand. The plain text is the result of the xml-file with all elements deleted and not very good to read. But if you select a part of the text the according annotations are shown below the editor window, so the semantic is not lost.

Further steps will be some algorithms to put deleted text in an annotation of the added text to get a readable text which then can be annotated further. Another task is developing an export function to xml. Another approach could be to do the refactoring of the xml in the graph-database to get clean Standoff-Data out of the Graph-DB. From my point of view TEI as Graph can be the next technical step for TEI to get better support and linking to Linked Open Data projects and to overcome the uni-dimensional restriction of xml.I want to say thanks to Stefan Armbruster from neo4j for the export-cypher-query and the implementation of the XML-Import funkctions to apoc (https://github.com/neo4j-contrib/neo4j-apoc-procedures-function) and Iian Neill for his work on SPEEDy (https://argimenes.github.io/standoff-properties-editor/).