Using Machine Learning for the Automated Classification of Stage Directions in TEI-Encoded Drama Corpora Daria Maximova , Frank Fischer Zentrum für Informationsmodellierung - Austrian Centre for Digital Humanities, Karl-Franzens-Universität Graz Austria Zentrum für Informationsmodellierung - Austrian Centre for Digital Humanities, Karl-Franzens-Universität Graz Austria GAMS - Geisteswissenschaftliches Asset Management System Creative Commons BY-NC 4.0 2019 Graz o:tei2019.160

Written by OpenOffice

Papers tei2019

en drama corpora stage directions short text classification machine learning 2019-07-27T09:54:07.780715668
Using Machine Learning for the Automated Classification of Stage Directions in TEI-Encoded Drama Corpora
Authors

Daria Maximova (National Research University Higher School of Economics, Moscow, RU)

Frank Fischer (DARIAH-EU and National Research University Higher School of Economics, Moscow, RU)

Abstract

The <stage> tag is a core element for the encoding of drama. The TEI guidelines suggest nine values for its type attribute, which is widely used in large corpora such as the French Théâtre Classique, the Shakespeare Folger Library or the Swedish Dramawebben. This paper introduces an approach to automatically assign stage-direction types to the TEI-P5-encoded Russian Drama Corpus, RusDraCor (https://dracor.org/). The corpus currently features 144 plays ranging from mid-18th to mid-20th century which makes for 32 753 stage directions with 144,525 tokens.

We selected 18 plays comprising 6,569 stage directions to represent the breadth of the corpus. For the manual annotation we established a clear set of rules to identify the stage-direction types proposed by the TEI guidelines (https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-stage.html).

Following the annotation of our subcorpus, we developed a tool for the classification of the remaining plays without human interference. For the conversion of stage directions into feature vectors, we used morphological and semantic data. Our tool in its current state is able to classify different types with an F1 score of approx. 0.75, which means that 3 out of 4 stage directions of any given type are assigned correctly.

Our work will inform a dedicated analysis of stage directions, which after preliminary studies by Sperantov (1998) and Detken (2009) will be based on larger corpora allowing for a description of the evolvement of stage directions over 200 years.