TEITOK – TEI based annotated corpora Maarten Janssen Zentrum für Informationsmodellierung - Austrian Centre for Digital Humanities, Karl-Franzens-Universität Graz Austria Zentrum für Informationsmodellierung - Austrian Centre for Digital Humanities, Karl-Franzens-Universität Graz Austria GAMS - Geisteswissenschaftliches Asset Management System Creative Commons BY-NC 4.0 2019 Graz o:tei2019.142

born digital

Demonstrations tei2019

en Corpus search TEI visualization Linguistic annotation Spoken corpora
TEITOK – TEI based annotated corpora

Maarten Janssen

TEITOK is a web based tool building, annotating, and distributing corpora, in which corpus files are stored in TEI/XML. It combines the needs of those how want to do detailed philological markup with the requirements of a searchable, annotated linguistic corpus, and is being used in a growing number of corpora around the world, primarily for historical, spoken, and learner corpora.

With regards to textual mark-up, it allows the visualisation of TEI documents directly in a browser, using CSS and JavaScript to visualize the different TEI elements in a customisable way. It can display facsimile images alongside the text, and has additional display options for specific types of TEI documents, such as a line-by-line visualisation for aligned facsimile transcriptions, and a view including a waveform display for time-aligned audio transcriptions.

With regards to linguistic annotation, it allows TEI documents to be tokenised inline, after which each token can be adorned with information such as POS, lemma, or dependency relations. And the tokenised corpus can then be automatically exported as a linguistic corpus using the Corpus Workbench corpus tool, making it possible to search through the corpus using its expressive search languages. Different from most corpus search interfaces, TEITOK displays TEI/XML fragments in the search results, including hence the full textual mark-up of the source document.

For tokenised corpora, it also allows storing multiple orthographic realisations for each token, such as a semi-palaeographic transcription and a regularised orthography, which can then in turn be used in the document view to display various editions of the same document. The textual metadata can be used in a number of different ways, for instance to display all the documents in the corpus on the world map. And the combination of metadata and token-based annotation allows for detailed corpus research on richly annotated documents.