Written by OpenOffice
TEI as a Graph
Andreas Kuczera, Academy of Science and Literature, Mainz
As TEI is not a format, though many people think it is. It's a de facto standard that specifies Guidelines for document interchange. Actually the Guidelines are based on the XML but this is only one possible technical way of expressing the phenomenons. In the graph you can use multi-hierarchical annotations layers. Graph models are very easy to read and understand. So DH-People and “normal” scientists have a level of discussion in common. A Graph can be expressed as RDF so the step from a Graph to linked open data is easy to make.
In this paper a small xml-example in DTA-Base-Format will be imported into the
graph-database neo4j and then be converted to the Standoff-Property-Json-Format
In a first step we import a small xml-example into a neo4j (https://neo4j.com) instance using apoc.import.xml (https://github.com /neo4j-contrib/neo4j-apoc-procedures-function)
The example is one folio (https://seafile.rlp.net/f/6282a26504cc4f079ab9/?dl=1) from the DTA (https://www.deutschestextarchiv.de). Here you can find the XML-Testfile and this is the Link (http://www.deutschestextarchiv.de/patzig_msgermfol841842_1828/11) to the DTA-Version.
The import into neo4j runs with:
// Import xml-example from DTA to neo4j
call
apoc.xml.import('https://seafile.rlp.net/f/6282a26504cc4f079ab9/?dl=1',
{connectCharacters: true, charactersForTag:{lb:' '}, filterLeadingWhitespace:
true}) yield node
return node;
In Figure 1 you can see a snippet of the example in the Graph-Database. In this import to the graph-database the xml-file is imported as an xml-tree with the root-element at the top level. The hierarchy of the xml is expressed with IS_CHILD_OF, FIRST_CHILD_OF, LAST_CHILD_OF etc. edges connecting all elements which are converted to nodes of type XmlTag for the elements or XmlCharacter for the text. The seriality of the XML-file is expressed by NEXT, which make reexporting XML possible. In addition all text nodes are connected by NE edges, connecting all text without any elements in between. Whitespaces become a textnode on their own. The example shows that importing a DTA-Baseformat-XML-File keeps all informations from the xml-version and re-exporting to xml is possible.
The next step is to export the data with some cypherto the Standoff-Property JSON-Format, which can be directly copied out of the neo4j-browser-window. This json can then be imported in the [SPEEDy (https://github.com/argimenes/standoff-properties-editor)] Standoff Property Editor which can be found on GitHub (https://github.com/argimenes/standoff-properties-editor).
In the README-Section of the SPEEDy Github Repo you can find a Link (https://argimenes.github.io/standoff-properties-editor/)to the Test-Instance hosted on Github-Pages. We have prepared the example in SPEEDy. Just select „TEI-XML → SPEEDY IV“ in the file-Section and load the data. Below the Editor-Window you can press the UNBIND-Button and inspect the exported json in the window below.
Figure 2 shows the results of the conversion without any further treatment by hand. The plain text is the result of the xml-file with all elements deleted and not very good to read. But if you select a part of the text the according annotations are shown below the editor window, so the semantic is not lost.
Further steps will be some algorithms to put deleted text in an annotation of the added text to get a readable text which then can be annotated further. Another task is developing an export function to xml. Another approach could be to do the refactoring of the xml in the graph-database to get clean Standoff-Data out of the Graph-DB. From my point of view TEI as Graph can be the next technical step for TEI to get better support and linking to Linked Open Data projects and to overcome the uni-dimensional restriction of xml.I want to say thanks to Stefan Armbruster from neo4j for the export-cypher-query and the implementation of the XML-Import funkctions to apoc (https://github.com/neo4j-contrib/neo4j-apoc-procedures-function) and Iian Neill for his work on SPEEDy (https://argimenes.github.io/standoff-properties-editor/).