Adjective-Adverb Interfaces in Romance

Open Access Database

Open Access Database - Adjective-Adverb Interfaces in Romance

1. Project description

Open Access Database Adjective-Adverb Interfaces in Romance (2017-2020) is a collaborative project between the Department of Romance Studies (https://romanistik.uni-graz.at/de/) and the Centre for Information Modelling – Austrian Centre for Digital Humanities (=ACDH, https://informationsmodellierung.uni-graz.at/en/), both located at University of Graz. During its first years, from 2017 to 2019, the project was funded by the pilot program Open Research Data offered by the Austrian Science Fund (FWF: ORD 66-VO). Martin Hummel, the project leader, and Katharina Gerhalter, both from the Department of Romance Studies, coordinated the data collection and the elaboration of linguistic categories for the annotation. Gerlinde Schneider and Christopher Pollin, both from the Centre for Information Modelling – Austrian Centre for Digital Humanities, oversaw the data modelling, e.g., the annotation tool and the processing and displaying of the data. The OA database is now hosted and managed at the ACDH under the linguistic supervision of Katharina Gerhalter. The ACDH guarantees the long-term sustainability of the data, as well as the integration of new corpora.

2. Research context and goals

2.1. Research context

The project is situated in the larger context of the work promoted since 2002 by the Research Group on Adjective-Adverb Interfaces in Romance, led by Martin Hummel at the Department of Romance Studies (https://adjective-adverb.uni-graz.at/en/). The research focuses on linguistic phenomena related to adjectives with adverbial functions: adjective-adverbs, such as Spanish and Portuguese ver claro / French voir clair / Italian vedere chiaro / Romanian a vedea chiar/clar ‘to see clear(ly)’; adjectives used as discourse markers, such as Spanish cierto ‘true’; and adverbial prepositional phrases including adjectives, for example Portuguese de novo ‘again’, en serio ‘seriously’. Some research topics also contrast these adverbs with adverbs ending in -mente (e.g. the Portuguese triplet claro – claramente – às claras ‘clearly’). The data offered by the corpora is analyzed in several publications from a synchronic or diachronic point of view, as well as cross-linguistically (see about > bibliography).

2.2. Goals

Research experience has brought to light problems concerning open access, sustainable storage and efficient usage of the analysed linguistic data. Labour-intensive updating of the databases can be sustainably guaranteed only if (i) institutions and not individuals ensure the access and if (ii) international standards are created and implemented. As the research group cooperates with several international partners, who use and add data, the data should be tagged in a way that idiosyncratic solutions are reduced to a minimum.

Therefore, the objective of the project Open Access Database - Adjective-Adverb Interfaces in Romance was the creation of a sustainable Open Access corpus that can be (re-)used according to international standards. The project is aimed at documenting historical as well as present-day data stemming from Romance languages. Previously analysed and partially tagged subcorpora were updated (Fr_A_DHAA, Sp_AP_SH3 and Sp_A_CDH), and newly tagged data – including a Latin dataset – was added by the project team and its cooperation partners (2017-2021).

2.3. Possibilities

In the several subcorpora, adverbs are uniformly and comprehensively annotated and lemmatised, using the same annotation model (see “Annotation Model” for a description of the linguistic categories). This allows, for instance, several corpora to be parsed simultaneously with the same set of commands or to fine-tune syntactic or semantic search criteria via a search interface (see “About > How to use the corpus?”). The collected and annotated linguistic research data was made openly accessible and reusable via the menu item “Corpora”. In the future, the corpus may be enlarged with new datasets using the same categorizations. Therefore, the annotation tool (a Word template) can be downloaded and used to tag new examples (see 3.1.).

3. Data Modelling, Data Formats and Open Access

From a technical perspective, the activities in the AAIF project were organized as follows:

3.1. Enhancement and update of the datasets

The already existing annotation model was enlarged with additional relevant categories, and the already existing corpora were accordingly revised, as well as entirely new annotated corpora added to the dataset. For this purpose, the research group’s already-established annotation tool had to be extended and updated. This tool was developed as a simple Visual Basic-based template for Microsoft Word and was expanded into a user-friendly Add-On over the course of the project. An automated conversion into the project-specific TEI P5 subset is integrated. The tool has been tested and successfully used for annotation by internal and external project partners. It is provided with an open license and available for download via GitHub (https://github.com/zimgraz/aaif).

The existing TEI P5 subset also had to be adapted and extended to include the new categories. During this stage, a TEI schema as ODD and RNG was created (http://gams.uni-graz.at/o:aaif.odd).

3.2. Provision of the data and metadata under an open license in standardized formats

For all corpora, comprehensive metadata is provided via a TEI Header (Version P5), covering editorial, descriptive, administrative, analytical, and statistical metadata. Additionally, a CMDI metadata record is provided over an OAI-PMH interface (https://gams.uni-graz.at/oaiprovider/?verb=ListRecords&metadataPrefix=oai_cmdi_tei). Data objects stored in the GAMS repository are assigned a basic Dublin Core metadata record.

As well as the data being available in the TEI P5 format, annotations are available as RDF/XML.

Full-text access is provided to all corpora combined by the project over a dedicated, user-friendly web interface.

3.3. Semantic representation of the annotation model as well as of the individual corpora

The categories of the Annotation Tool are based on an Annotation Model which was expanded and fine-tuned during the project. The model was transferred to a domain-specific ontology expressed in OWL which not only serves as a reference model within the research group but also forms the basis for data retrieval and drives the parameterizable query interface. As part of a defined ingest workflow using XSLT pipelines, RDF/XML data based on the ontology is generated from the TEI data. This data is also available for download via the repository.

3.4. Relaunch of the search interface

In addition to the provision of open data in different standardized formats, it was also a goal of the project to replace the proprietary legacy search interface with a more performant, openly accessible corpus search. To this end, the RDF data is stored in a Blazegraph triple store. Specific SPARQL queries were created for the querying of the corpora. The search interface was developed as a client-side web application. Existing JavaScript code and libraries were reused and adapted.

3.5. Long-term preservation of project data

For long-term preservation, all research data was transferred into the digital repository GAMS (https://gams.uni-graz.at). GAMS is an OAIS-compliant Asset Management System based on the Open Source software Fedora Commons, certified with the Core Trust Seal and registered in the Registry of Research Data Repositories (https://www.re3data.org/). The repository builds upon a web service-based (SOAP, REST), platform-independent and distributed system architecture, a largely XML-based content strategy, the support of XML-based import and export standards, and the use of standardized data and metadata formats. GAMS uses Handles (http://www.handle.net/) for the persistent identification of archived objects. Thus, handles are assigned to all corpora created in the project.