TEI 2019

What is text, really? TEI and beyond


All DemonstrationsTEI

Character Counting

Syd Bauman

Keywords: character counting
Permalink: https://gams.uni-graz.at/o:tei2019.143

This is demonstration of a character counting system. Character counting is not particularly new, is not all that interesting, and is not particularly difficult. And even the added fact that the output is a useful table of the characters found in an input file, sortable by frequency, character, Unicode code point, or (perhaps uselessly) by Unicode name is not particularly remarkable.

But add to that feature set the fact that the system works by running an XSLT program that writes an XSLT program, and it starts to get interesting. Furthermore, although the input file can be any XML document, the system will semi-intelligently handle several different kinds of input, currently including TEI, WWP, XHTML, and yaps. (That list may change before the conference — e.g., I am likely to add DocBook or JATS.)

In all cases attribute values can be included or excluded and whitespace can be normalized, ignored, or left as is at user option via a parameter. If the system knows the input language, further parameters may be specified to control whether or not metadata is included and perhaps other details (like choosing corr over sic). Lastly, the system performs a lookup into the Unicode database to get the correct Unicode name of each character.