Distant Spectators

Distant Reading for Periodicals of the Enlightenment

To have an enviable quantity of multilingual text material from a genre like the 18th-century spectatorial press inspired us not only to enjoy close reading it but also to analyze it from different perspectives and from the corpus level. Therefore we decided to take action and start exploring our periodical issues through methods of distant reading – in particular with topic modeling, stylometry, sentiment analysis, and network analysis – which is why Distant Spectators became our current project. If you are also interested in reading or downloading the issues, you can find them on our Spectators project website (Ertler et al. 2011).

Regarding topic modeling, we have planned to test and compare different approaching techniques. By that we not only mean different tools and libraries based on Latent Dirichlet allocation (LDA) (Blei et al. 2003), like MALLET (McCallum 2002-2019) or Gensim (Rehurek 2009-2019), but also different easy-going tools like Topics Explorer (DARIAH-DE 2018) and Voyant Tools (Sinclair/Rockwell 2019), as well as more complex workflows, like the jupyter notebooks DARIAH Topics (DARIAH-DE 2019), DARIAH’s tutorial Text Analysis with Topic Models for the Humanities and Social Sciences (Riddell 2015), or the Topic Modeling Workflow for Python (Schöch/Schlör 2017-2019). Here, in our first Topic Modeling blog entry, we are presenting some very-early-stage results and experiences that we gained through our first experiments with the analysis of this multilingual corpus.

Voyant Tools

Voyant Tools and Topics Explorer are very suitable ways to gain first impressions of a corpus or to start using topic modeling techniques. They enable fast and easy-to-create visualizations that require almost no prior technical knowledge. This makes it easy to answer questions about frequent occurrences of words and word distributions at the level of the corpus or individual text files, and quickly identify which areas might be interesting, specific, or problematic on closer analysis. Here, for example, Voyant Tools supply a word cloud with the most frequent word occurrences of our German sub-corpus. After adding a stop word list to remove unnecessary words (like functional words), a few themes can be read out of keywords such as gesellschaft (society), frauenzimmer (literally translated as women's room, meaning wenches, women), kinder (children), ehre (honor), and natur (nature).

Word cloud from the German sub-corpus
Word cloud from the German sub-corpus

An interesting approach in using the Voyant Tools is also to observe and compare not only the collection as a group of texts but also the individual issues. Using word clouds we have thus compared the first three journals (Book I, II and III) from The Female Spectator with variants in other languages, that is, with the German text from Die Zuschauerin, the French from La Spectatrice, traduite de l’anglais, and the Italian La Spettatrice. We can notice a lot of similarities. For example, person/human is clearly an often-used term in all of the texts (en: person, fr: homme, it: uomini) as well as father (en: father, de: vater, fr: père, it: padre). The English word passion is also directly translated and often used in the Italian, German and French texts (it: passione, de: leidenschaft, fr: passion). In German, this word appears much less frequent, because it is often replaced by neigung (affection). Terms like these, as well as love, man, woman, heart, etc. can lead us to think, that this is a text about romantic concepts, but this is not the case. It is a social critic and passion is rather used pejoratively for strong desires of any kind (like lust and greed) and non rational thinking. Another two interesting and common subjects are the nature of men and the nature of women, which is frequently referenced in the gender discourse and in discussions about the changing human nature under social burdens. But when we see the term nature in the word cloud, we might be misled to think that it refers to the not man-made environment.

It is therefore highly necessary to be careful in interpreting these outputs. First, there are words that can mean different things, and one could easily get a wrong impression about them from only a word cloud. For example, manner (fr: manière, it: maniera) should not be confused with behaving conventions but refers in these texts to ways of doing something, as we learned through close reading. Secondly, word usage from the 18th century is not always the same as today – the meaning and implication can often differ, and more common usage of metaphors also needs to be kept in mind. Furthermore, there is, of course, the inevitable mistranslation factor, that has to be considered. These are things to keep in mind, yet not reasons to be discouraged in using these visualization methods, since insights like word frequencies are very useful, and are not something we can gain solely from close reading.

Word cloud from
                        The Female Spectator (Books
                        I to III)
Word cloud from The Female Spectator (Books I to III)
Word cloud from
                        Die Zuschauerin
Word cloud from Die Zuschauerin
Word cloud from La Spectatrice, traduite de
                        l’anglais
Word cloud from La Spectatrice, traduite de l’anglais
Word cloud from
                        La Spettatrice
Word cloud from La Spettatrice

Although the word cloud is probably the most popular Voyant tool, it is not the only one. In addition, there is a number of features that can provide different insights into the texts. Among them is even a topic modeling tool called Topics. It uses jsLDA (Mimno 2013-2018), an LDA implementation for JavaScript. Topics provides the ability to customize the number of iterations and topics, but unfortunately only one visualization option. Below you can see 10 Topics with 10 most frequent words of the text Die Zuschauerin. They are visualized via frequency lines, which indicate the occurrence of the topics over the course of the text. The great thing about such applications as Voyant Tools and DARIAH Topics Explorer is that they offer dynamic views, which can be especially handy when analyzing larger corpora with visualizing methods that are not perfectly suited for a comprehensive presentation of such a large range of material all at once. In the Topics view, for example, mouse-hovering over the individual points of this frequency line indicates which text file in the corpus is concerned.

Topics in the German sub-corpus
Topics in the German sub-corpus

Other features, such as Trends, allow the study of various aspects of the textual material. Hereby we have observed how the mentions of gesellsch*|mensch* (society/human), gott*|christ* (God/Christianity) and natur* (nature) in the individual texts correlate. The asterisk (*) and the vertical bar (|) are symbols in regular expressions. The asterisk indicates that not only the specified token but also the beginning of the token should be searched for. Thus, either mensch, menschen, and menschlich, as well as other inflection forms and derivatives are found. The vertical bar is used to match either of the two separated tokens.

Trends in the German sub-corpus
Trends in the German sub-corpus

This function is very useful for analyzing the occurrence of words typical for certain parts in the text structure, but also for comparing original texts with its translations. Here we can see how the concepts nature, love, and passion are distributed through the texts of The Female Spectator. We merged the texts for each language in one file, but we visualized each text group in fifteen segments. And although each language group has a similar length and they are all translated from the English, we notice that there is a difference in language usage and the distribution of the words.

Topics Explorer

Unlike Voyant Tools, that are specialized in different visualization methods, Topics Explorer particularly has its expertise in topic modeling. In just a few steps (upload the corpus – set the number of topics – set the number of iterations – choose the number of the most frequent words to be removed – add your own stop word list) and a few seconds or minutes, depending on the corpus size, a model is created. In our example with 10 Topics and 1000 iterations of 35 German texts, we are presented a bar chart with the topics and their overall frequency in the corpus.

Topics bar chart in the German sub-corpus
Topics bar chart in the German sub-corpus

Each document can be observed individually to find out about the distributions of topics and about similar documents.

Eliza Haywood’s Book III in German
Eliza Haywood’s Book III in German

Each topic can also be examined for its keywords and related documents. The length of the bar charts, as well as the chronological order, indicate the frequency of a topic in a corpus or a document, of a keyword in a topic, etc. Here we see that our document with the ID 4104 (the same as we analyzed with Voyant Tools) has natur-umständen-neigung (nature-circumstances-tendency/affection) as the most prominent topic. This topic might be described as human nature/human state. Closely related is the topic person-liebe-gesellschaft (person-love-society), or as we can call it people and emotions, since there are a lot of keywords about emotional concepts (love, pleasure, liked, hope, tenderness).

Topic
                            people and emotions in the
                            german sub-corpus
Topic people and emotions in the german sub-corpus

It is notable that, although we tried the Voyant Tools topic modeling feature without other texts, we get some similar results for the same document, at least regarding the frequent keywords. The distribution is of course not the same since topic modeling requires more text material to give better results. But we can notice that even with this limited sub-corpus of 35 texts, our model is showing some clear and accurate topics. In the heatmap we can, for example, see, that the first five documents, all written by Johann Joseph Friedrich von Steigentesch, all strongly thematize the topic kindern-kinder-schulen (to children-children-schools). According to this, it is correctly assumed, that this author is writing a lot about children and the educational system – which does not surprise since Steigentesch was a school reformer.

 Heatmap of the German sub-corpus
Heatmap of the German sub-corpus
Keywords in topic children and education in the
                        German sub-corpus
Keywords in topic children and education in the German sub-corpus

MALLET via DARIAH Topics

Our first topic modeling attempts with MALLET were very fruitful, thanks to DARIAH Topics. Using the Notebook IntroducingMallet.ipynb, we created a heatmap with our german sub-corpus. Once again we created 10 topics in 1000 iterations and as we can see, we once again have some strong correlations with the previous analysis results (using Voyant Tools and Topics Explorer), like for example, the children and education topic (kindern-kinder-schulen). Also the topic person-gesellschaft-personen (person-society-persons), which even contains the token liebe (love) at the fifth place in that topic (as we examined in our csv output files), correlates with the previous people and emotions topic discovered with Topics Explorer. This is not surprising since Topics Explorer and DARIAH Topics use the same topic modeling methods. But, as already said, similar topics and keywords can also be identified with Voyant Tools, such as keywords like person, society, love, children, etc.

Heatmap of the German sub-corpus with DARIAH
                        Topics (V.1,
                        0-optimization)
Heatmap of the German sub-corpus with DARIAH Topics (V.1, 0-optimization)

A small but important detail we experimented with, changes how the topics are distributed over the documents. It is the optimization interval, a parameter used in topic modeling to allow some topics to be more outstanding than others (McCallum 2002-2019). In contrast to the previous visualization, where the optimization interval was 0, we now set it to 10. As the heatmap below shows, the topic leute-art-leben (people-way/type/sort-living) is clearly the most dominant. There is a much stronger difference between the values for the individual document-topic distributions, but the topics remained almost the same (even though the distribution of the top keywords slightly changed). Only what seems to be the “weakest” topic from the 0-optimized version, umständen-neigung-vater (circumstances-tendency/affection-father), got replaced by leolin-tausend-belise (leolin-thousand-belise), which is actually the same topic, only with other three top keywords. The good news, therefore, is, that our model is stable, but the question is: Should we use the optimization interval or leave it out?

Heatmap of the German sub-corpus with DARIAH
                        Topics (V.2,
                        10-optimization)
Heatmap of the German sub-corpus with DARIAH Topics (V.2, 10-optimization)

We are of course not the first ones to ask ourselves this question. Christof Schöch already had some experiments and thoughts regarding this parameter and came to a conclusion:

If your goal is to identify small numbers of texts about specific themes in a large collection, then a lot of opimization [sic!] may be good. However, if your goal is to identify topics typical of certain authors, periods, genres or some other reasonably large subset of your collection, then it may be better to optimize a bit less. (Schöch 2016)

So it all depends on the analysis goal. Since we are interested in both cases – finding out about the overall themes our corpus contains as well as about specific topics for specific documents and authors – we can make use of both versions. For now, using a small number of longer documents, we find the 0-optimization version more interesting, since it also gives us information about the less dominant, but still remarkably present topics in the individual documents, as you can see in the two comparisons below.

Topics in Eliza Haywood’s Book III from the
                        German sub-corpus with DARIAH
                        Topics (V.1,
                        0-optimization)
Topics in Eliza Haywood’s Book III from the German sub-corpus with DARIAH Topics (V.1, 0-optimization)
Heatmap of the german sub-corpus grouped by authors with DARIAH
                        Topics (V.1,
                        0-optimization)
Heatmap of the german sub-corpus grouped by authors with DARIAH Topics (V.1, 0-optimization)
Heatmap of the german sub-corpus grouped by authors with DARIAH
                        Topics (V.2,
                        10-optimization)
Heatmap of the german sub-corpus grouped by authors with DARIAH Topics (V.2, 10-optimization)

Our next step will be to improve our results by lemmatizing our corpus and by extending our stop words. This way, we will not only remove more non-content words, like mögte (might) and dergleichen (like/similar), but also join the varieties of inflection forms and derivatives, which especially in the German language are highly present. Additionally, we will use the same methods to analyze our other sub-corpora (French, Italian, Spanish, Portuguese and English) and then continue with our further analysis and interpretations, like for example, comparing the language versions and comparing them with the manually labeled subjects. The lemmatizing is a tricky part of data preprocessing, especially since our texts are from the 18th and 19th century and lemmatizing tools are being trained mostly with modern texts. There are a few strategies to deal with this problem, but we still need to find out, which tools and methods work best for our texts. Therefore we will continue to experiment and report about this subject (among others) and our experience in one of our following texts.

References

Blei, David M./Ng, Andrew Y./Jordan I. Michael (2003): “Latent Dirichlet Allocation”. In: Journal of Machine Learning Research 3, p. 993-1022. URL: http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf [Accessed: 14.10.2019].

DARIAH-DE (2018): Topics Explorer. V. 2.0.1. URL: https://github.com/DARIAH-DE/TopicsExplorer [Accessed: 14.10.2019].

DARIAH-DE (2019): DARIAH Topics. Easy Topic Modeling in Python. V. 2.0.1. URL: https://github.com/DARIAH-DE/Topics [Accessed: 08.03.2019].

McCallum, Andrew Kachites (2002-2019): MALLET. A Machine Learning for Language Toolkit. V. 2.0.8. URL: http://mallet.cs.umass.edu [Accessed: 14.10.2019].

Mimno, David (2013-2018): jsLDA. URL: https://github.com/mimno/jsLDA [Accessed: 14.10.2019].

Riddell, Allen (2015): “Text Analysis with Topic Models for the Humanities and Social Sciences — Text Analysis with Topic Models for the Humanities and Social Sciences”. DARIAH-DE Initiative (Hg.) . URL: https://liferay.de.dariah.eu/tatom/ [Accessed: 14.10.2019].

Rehurek, Radim (2009-2019): gensim. V. 3.8.1. RaRe-Technologies. URL: https://github.com/RaRe-Technologies/gensim [Accessed: 14.10.2019].

Schöch, Christof (2016): “Topic Modeling with MALLET. Hyperparameter Optimization”. URL: https://dragonfly.hypotheses.org/1051 [Accessed: 14.10.2019].

Schöch, Christof/Schlör, Daniel (2017-2019): Topic Modeling Workflow in Python. Computergestützte literarische Gattungsstilistik (CLiGS). URL: https://github.com/cligs/tmw [Accessed: 14.10.2019].

Sinclair, Stéfan/Rockwell, Geoffrey (2019): Voyant Tools. V. 2.4. URL: https://voyant-tools.org/ [Accessed: 14.10.2019].