Adjective-Adverb Interfaces in Romance

Open Access Database

How to use the corpus?

1. How do I search for examples?

There are two ways of accessing the examples and annotations: via Search Interface and via Corpora.

1. The Search Interface enables queries for lemmata (in a lemma-search field) and/or for morphosyntactic and semantic categories. To do so, choose one subcorpus or combine different subcorpora (even from different languages, if desired). For a lemma from the “adverb” category, the underlying adjectival base must be typed in (for example, the Portuguese lemma claro delivers examples of claro, às claras and claramente). The criteria for both lemmatization and the annotated categories are explained in the Annotation Model. A list of the lemmata in each subcorpus and the number of corresponding examples can be found via the menu item Corpora. A lemma search may be combined with a specified search for specific morphosyntactic or semantic categories. It is also possible to search only for categories and leave the lemma field blank. Whereas each adverbial phrase in a given example contains at least one annotated adverb, the additional and optional preposition, verb and subject categories are only annotated if they are existent and relevant for the adverb in a given example phrase (see Annotation Model).

The Search Interface also allows filtering for bibliographic metadata: region, year or period, author, written or spoken (the Portuguese Corpus is the only one that contains spoken examples).

The table of results provided by the search interface shows the examples (adverbs in a short context phrase, for example: Spanish alto in the context phrase eran tan jóvenes y hablaban tan alto) and the corresponding bibliographic source. The table can be ordered alphabetically for each column (for example, alphabetically by adverbs or chronologically by year). The annotations of each example can be displayed by clicking on the annotated word/item.

The full-text search field parses the plain text within the given short context-sentences as shown in the table of results. The extended context of the example can be found via the menu item Corpora.

2. The menu item Corpora lists all subcorpora separately. Each subcorpus can be read as a full text; depending on the design of each corpus (cf. metadata of the given subcorpus), some are available as whole texts (i.e. the whole source corpus) and others only contain paragraphs or larger sentences in addition to the short context shown via the search interface. The short context phrases are marked in yellow; each annotated example is tagged, for example, [a: ] for adverbs, [p: ] for prepositions and [v: ] for verbs. The specific morphosyntactic and semantic annotations (cf. Annotation Model for a description of the categories) for each example can also be displayed by clicking on the word. In this corpus view, the search-field “fulltext search” targets the whole plain text.

2. How do I export/download examples?

The examples and the different subcorpora can be downloaded in different formats: XML/TEI, XML/RDF, Excel:

1. Via Search Interface: The table of results can be downloaded as an Excel file consisting of the examples (short example phrases) and the bibliographic information as shown in the table of results. Since the Search Interface permits combining different subcorpora, this Excel-file may combine examples from different subcorpora and/or possibly different languages.

2. Via Corpora / Full text: each subcorpus can be entirely and separately downloaded. The TEI-files contain the whole corpus text, the annotations and metadata, whereas the Excel- and RDF-files contain all the short example phrases of a given subcorpus and the corresponding annotations. These Excel-files are more detailed than the Excel-files created in the Search Interface, as they include all annotated categories in separate columns. This enables filtering or sorting of the categories (like in the Search Interface) and the further manual addition of new categories/columns or comments for each example, if needed. Bear in mind that the different subcorpora were created by different linguists and that categorizations of the examples may differ slightly. Some examples may be ambiguous, in other cases the user might prefer a different analysis. For these purposes, the annotations in the Excel-file allow individual modifications to be made by the user.

3. What do the names of the corpora mean?

The several subcorpora are quite heterogeneous regarding the types of adverbs that are included and regarding the dates of the sources (i.e. present-day language and/or historical-diachronic data). The names are abbreviations consisting of three parts which briefly describe each subcorpus, e.g. Pt_APM_DeG or Fr_A_Web:

1. The first part of such a name indicates the language: Fr_ for French, It_ for Italian, Pt_ for Portuguese, Ro_ for Romanian and Sp_ for Spanish.

2. The second part tells which types of adverbials are (mainly) included: _A_ is used for corpora consisting (almost) exclusively of Adjective-Adverbs, _P_ refers to prepositional phrases, _M_ to mente-Adverbs and _D_ to derived adverbs (in the case of the Romanian corpus, since this language only uses a few derived adverbs and other suffixes than mente). These abbreviations refer to which type of adverbials are systematically tagged in each dataset. Nevertheless, sporadic occurrences of other types, which were not compiled and tagged systematically, may figure on occasion.

3. The last part is a brief reference to the source of the data, that is: the original corpus they were retrieved from (e.g., DeG = Corpus Discurso e Gramática, CORDIAM = Corpus Diacrónico y Diatópico del Español de América, Web = Blogs and Forums from the Internet) or the publication project the data was collected for (e.g., DHAA = Dictionnaire historique de l'Adjectif-Adverbe, SH3 = Sintaxis histórica de la lengua española. Tercera parte).

Since various subcorpora can be combined in the search interface (joint queries), the abbreviations may be helpful by showing which type of examples are included.

4. How are the examples counted?

For every subcorpus, different numbers are included in the metadata-information:

Every tagged example (i.e. every short example phrase) in the corpus contains at least one annotated adverb. Since this is the main tag of the Annotation Model and the only obligatory one, the counting of the examples (the number labelled “tagged examples of adverbs”) in each subcorpus is based on the number of tagged adverbs (note that more than one adverb may be annotated in one short example phrase, e.g. coordination of adverbs or adverbs modifying other adverbs). Similarly, the number of results of any search query via the Search Mask is based on the counting of annotated examples of adverbs (attention: even when filtering, for example, for verbs, the count is based on the number of adverbs).

The lemmatization of the examples is based on the underlying adjectival base. Therefore, the counting of different lexemes indicates the number labelled “types of adjective-lemmata”. For example, all instances of Portuguese claro, claramente and às claras are lemmatized as “claro” and therefore count for one type of adjective-lemmata. The List of Lemmata, which is shown for each subcorpus under the menu item Corpora, shows a table with the different adjectival bases registered in a given corpus and the number of examples which are annotated with each lemma, e.g. how many examples (token) are registered for the lemma (type) claro.

Finally, the number labelled “All words” refers to the size of the whole subcorpus, thus counting the words of the tagged short example phrases plus the words of the larger context (as shown under menu item Corpora > full-text). Some corpora (e.g. Pt_APM_DeG) consist of the whole source text and therefore are much larger than other corpora which consist solely of the short example phrases and eventually a slightly extended context (e.g. one paragraph).

5. How do I cite the examples and the corpus?

The bibliographic source of each example is shown both in the Search Interface and in the full text versions under the menu item Corpora. Each subcorpus has a menu item Sources (under menu item Corpora), where the whole bibliographic information (authors, titles, years, editions, regions, etc.) is listed. Furthermore, the menu item Metadata (under menu item Corpora) gives information about how the data have been collected and who was in charge of compilation and annotation. The menu item Metadata also includes a suggested citation for each subcorpus. The whole database may be cited as following:

Schneider, Gerlinde / Pollin, Christopher / Gerhalter, Katharina / Hummel, Martin (2020): Adjective-Adverb Interfaces in Romance. Open-Access Database (=AAIF-Database). https://gams.uni-graz.at/context:aaif

6. Will the database be enlarged/updated?

Each dataset (subcorpus) of the aaif-database is complete and will not be enlarged. Apart from technical reasons and authorship, the fact that the OA database allows for a simultaneous search in freely combined corpora provides other solutions for corpus enlargement. In line with this, new datasets (new subcorpora) can be included in future and therefore, the database may be enlarged. The Annotation Tool and a User's Manual are publicly available (cf. About) and can be used freely for annotating new datasets. An extensive and comprehensive description of the categorizations and the lemmatization criteria can be found via Annotation Model. If you wish to publish your annotated data on the aaif-database, please contact: katharina.gerhalter@uni-graz.at or martin.hummel@uni-graz.at.

7. Further information and contact

If you have further questions about using the database or about a specific subcorpus which cannot be answered via the metadata description (cf. menu item Corpora > Metadata) or via the Annotation Model, contact katharina.gerhalter@uni-graz.at or martin.hummel@uni-graz.at.