Validating @selector: a regular expression adventure Syd Bauman Northeastern University Digital Scholarship Group s.bauman@northeastern.edu Zentrum für Informationsmodellierung - Austrian Centre for Digital Humanities, Karl-Franzens-Universität Graz Austria Zentrum für Informationsmodellierung - Austrian Centre for Digital Humanities, Karl-Franzens-Universität Graz Austria GAMS - Geisteswissenschaftliches Asset Management System Creative Commons BY-NC 4.0 2019 Graz o:tei2019.127

born digital

Papers tei2019

en TEI CSS3 selector regular expression (regexp) rendition Created from validating_selector.txt and the Markup UK paper

Starting with P3 in 1994 (i.e., over two years before CSS1 was released), the Guidelines supported a mechanism to indicate a default rendition, a way of saying all persName elements were in italics in the original. You would put the name of an element on the gi attribute of a tagUsage element in order to indicate which elements had a particular default rendition.

Starting in 2015-10 with P5 2.9.0, TEI introduced a new method for the same purpose (and then phased out the original method). In this new method you specify which elements a default rendition applies to using the Cascading Style Sheets (CSS) selection mechanism — you put a CSS selector on the selector attribute of a rendition element. But The TEI only defines selector as teidata.text (which boils down to the RELAX NG string datatype).

This struck me as insufficient; formal syntactic validation is in order. Thus I set about writing a regular expression to validate CSS3 selectors. This presentation is about both the process of creating said regular expression, and the result, which is a regular expression just over 18,300 characters long which I believe correctly matches valid CSS3 selectors and correctly fails to match other strings.

Topics to be addressed include the following. How do you write such a long expression? The answer is you don’t—you write a program to write the expression. I wrote such a program in Perl, but plan to re-write it in XSLT before the presentation. There are some aspects of the CSS3 specification that aren’t entirely clear, at least not to me. According to several sources, CSS3 is not regular, and thus it cannot be parsed with a regexp. So how was I able to do this? I think there are three contributing factors. I was not dealing with all of CSS3, only with selectors; not trying to parse the selectors into their component segments, but rather only trying to return yes or no; unaware it was impossible until after I’d done it. The program will generate output in either RelaxNG or XSLT The output includes a test suite of thousands of CSS3 selectors Because of limitations in RelaxNG’s use of regular expressions, the regular expression produced respects case in some places where it should be ignored. I did not write the portion of the regular expression that tests a BCP 47 language tag, but rather downloaded someone else’s The regular expression runs very quickly in RelaxNG using jing, and very slowly in XSLT using Saxon.