TEI 2019

What is text, really? TEI and beyond


All PapersTEI

Introducing an Open, Dynamic and Efficient Lexical Data Access for TEI-encoded Dictionaries on the Internet

Francisco Mondaca, Philip Schildkamp, Felix Rau, Jan Bigalke

Keywords: api, dictionary, framework, kosh, graphql, rest
Slides: https://www.doi.org/10.5281/zenodo.3451535
Permalink: https://gams.uni-graz.at/o:tei2019.158

Most of the TEI-encoded dictionaries in public data repositories are not directly accessible for computational processing. Their use by different applications depends on how each application processes each single dictionary. In the last decades direct computational access to data on the Internet has been provided through application programming interfaces (APIs). APIs provide a centralized access to data and if designed and implemented properly, an efficient access to it. But API development and maintenance requires technical expertise, which can be an obstacle for small and medium dictionary publishers that might not have in-house solutions for this purpose. Against this background we have developed Kosh, [ http://kosh.uni-koeln.de ] an open-source framework, that processes any XML-encoded dictionary and creates two APIs for accessing the underlying lexical data: A REST API (Fielding 2019) and a GraphQL [ https://graphql.org ] API. The purpose of this presentation is to show how to use Kosh with data publicly available on GitHub, in order to demonstrate how the edition of digitized dictionaries and the compilation of digital-born dictionaries can be supported with an efficient access to the underlying data via APIs.

Kosh is an open-source framework developed to access multiple XML-encoded dictionaries. It is generic and flexible, designed to handle dictionaries of different structures and size with a minimal configuration effort.

Kosh processes as input data in XML format that is parsed and indexed into an elasticsearch [ https://www.elastic.co ] server. In a JSON (JavaScript Object Notation) configuration file, the paths to the elements to be indexed are defined [In XPath 1.0] as also the elasticsearch datatypes of the fields to be indexed e.g., keyword or text. Finally, a Kosh data module requires a dot file (.kosh) containing the index name, the paths to the XML files to be indexed, and the path to the configuration file. With this information, Kosh indexes one or multiple XML files into one index that is accessed by two APIs, a GraphQL and a REST API. If the XML source files are modified, the index is updated automatically. Kosh can be deployed via Docker [ https://hub.docker.com/r/cceh/kosh ] or natively on Unix systems and a single Kosh instance can provide access to multiple dictionaries.

In a GitHub repository, Kosh Data [ https://cceh.github.io/kosh_data ] , there are different datasets that show the structure of a data module for Kosh. One of them contains the Diccionario Geográfico-Histórico de las Indias Occidentales ó América, a five-volume dictionary compiled by Antonio de Alcedo (De Alcedo 1786, 1787, 1788a, 1788b 1789), which offers a wide description of American toponyms and also, on its fifth volume, a vocabulary with a pioneer approach to descriptive word usage in the Spanish Americas (Lenz 1905–1910:7f). An XML version of this dictionary has been employed in digital gazetteer projects such as HGIS de las Indias [ http://www.hgis-indias.net ] and later in Pelagios Commons [ http://commons.pelagios.org ]

Based on this XML-encoded version we created a TEI-P5 compliant version. This data can be accessed in two ways: First, through Kosh Data, [ https://github.com/cceh/kosh_data/tree/master/de_alcedo ] where modifications to the data can be proposed through pull-requests. Second, via APIs [GraphiQL:http://kosh.uni-koeln.de/api/de_alcedo/graphql Swagger UI (REST):http://kosh.uni-koeln.de/api/de_alcedo/restful ] provided by a Kosh instance deployed with a clone of this repository.

As data modifications are done at source-file level, and the changes tracked with git [ https://git-scm.com ] , the edition process is open and also reversible. In Kosh the publisher defines which fields should be indexed. For the digitization of printed dictionaries this means that direct computational access can be provided with a coarse-grained encoding. When compiling born-digital dictionaries a few fields can be available at an early compilation stage and as the data gains complexity more fields can be added to the index. This flexible approach to lexical data access allows to unveil datasets that are currently hidden from computer applications and thus users.

Bibliography

De Alcedo, Antonio. 1786. Diccionario Geográfico-Histórico de las Indias Occidentales ó América. Tomo I. Madrid: Imprenta de Manuel Gonzalez. Available at: https://archive.org/details/diccionariogeogr06alce

De Alcedo, Antonio. 1787. Diccionario Geográfico-Histórico de las Indias Occidentales ó América. Tomo II. Madrid: Imprenta de Manuel Gonzalez. Available at: https://archive.org/details/diccionariogeogr07alce

De Alcedo, Antonio. 1788a. Diccionario Geográfico-Histórico de las Indias Occidentales ó América. Tomo III. Madrid: Imprenta de Manuel Gonzalez. Available at: https://archive.org/details/diccionariogeogr08alce

De Alcedo, Antonio. 1788b. Diccionario Geográfico-Histórico de las Indias Occidentales ó América. Tomo IV. Madrid: Imprenta de Manuel Gonzalez. Available at: https://archive.org/details/diccionariogeogr09alce

De Alcedo, Antonio. 1789. Diccionario Geográfico-Histórico de las Indias Occidentales ó América. Tomo V. Madrid: Imprenta de Manuel Gonzalez. Available at: https://archive.org/details/diccionariogeogr10alce

Fielding, Roy Thomas. 2009. Architectural Styles and the Design of Network-based Software Architectures. Irvine: University of California. Available at: https://www.ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf

Lenz, Rodolfo. 1905–1910. Diccionario Etimolójico de las Voces Chilenas Derivadas de las Lenguas Indíjenas Americanas. Santiago: Imprenta Cervantes. Available at: http://www.ub.uni-koeln.de/cdm/ref/collection/mono20/id/6705