TEI 2019

What is text, really? TEI and beyond


All PapersTEI

Using Microsoft Word for preparing XML TEI-compliant digital editions

Boris Lehečka

Keywords: Microsoft Word, processor, conversion, XSLT
Slides: https://doi.org/10.5281/zenodo.3451430
Permalink: https://gams.uni-graz.at/o:tei2019.166

Using Microsoft Word for preparing XML TEI-compliant digital editions

The paper will introduce the so-called electronic editions prepared in Microsoft Word text processor. This tool was originally developed in 2000 in order to generate so-called vertical format for a corpus manager and in 2008 it was modified in order to output XML TEI format for the Manuscriptorium project (National Library of the Czech Republic 2019). For capturing structural information (headings, indices etc.) and the semantics of the individual parts of the edition (notes, readings etc.) the editors use character and paragraph styles whose application is described in Černá and Lehečka (2016). The methodology also includes rules of transforming the individual styles into XML format according to the TEI P5 Guidelines (The Text Encoding Initiative Consortium 2019). An add-in for Microsoft Word was programmed to help editors to apply formatting and other necessary parts of the edition, e.g. page and line numbers. The author of the paper will focus on DOCX document conversion to XML TEI P5 format. The conversion which consists of approximately 60 sequentially applied XSLT transformations (World Wide Web Consortium 2009) is driven by a specialized application (programmed in C#). In order to keep as much of “the editor’s intent” as possible and reduce errors resulting from the transformation process (omitted or duplicated text) an electronic tool was created which extracts plain text (divided into basic text and annotations) from the input (DOCX) and target document (XML TEI P5). This can be carried out by a much smaller number of XSLT transformations (only 6 stylesheets). The output text is subsequently compared via text comparison tools, e.g. WinMerge (2013): differing text chunks point to problematic passages, where transformation to XML format fails. With these tools approximately 240 editions of Old- and Middle-Czech texts (from 1300–1800) was prepared during 10 years, currently attracting new potential users across institutions.

References

Černá, A. and Lehečka, B. (2016). Metodika přípravy a zpracování elektronických edic starších českých textů. (The Methodology of the preparation and processing of electronic editions of Old Czech texts.) [pdf] Praha: oddělení vývoje jazyka Ústavu pro jazyk český AV ČR, v. v. i. Available at: <http://vokabular.ujc.cas.cz/soubory/nastroje/Methodics/Metodika_pripravy_a_zpracovani_elektronickych_edic_DF12P01OVV028.pdf> [Accessed 14 May 2019].

National Library of the Czech Republic (2019). Manuscriptorium. Digital Library of Written Cultural Heritage. [online] Available at: <http://www.manuscriptorium.com> [Accessed 14 May 2019].

The Text Encoding Initiative Consortium (2019). TEI P5: Guidelines for Electronic Text Encoding and Interchange. [online] Available at <http://www.tei-c.org/Guidelines/P5/> [Accessed 14 May 2019].

WinMerge 2.14.0 (2013). Available at <http://winmerge.org> [Accessed 14 May 2019].

World Wide Web Consortium (W3C) (2009). XSL Transformation (XSLT) Version 2.0. Available at <https://www.w3.org/TR/2009/PER-xslt20-20090421/> [Accessed 14 May 2018].