On Mark Davies' Talks (Oct 9-10, 2009)

Also in early October, I volunteered at the AACL (American Association of Corpus Linquistics) conference at the UofA (in Lister Centre). Since the organizers knew I was planning on attending Mark Davies' Humanities Computing Colloquium on the 9th, I was given the opportunity to guide him across campus to our talk and back to the conference. I also had the pleasure of hearing his talk presented to the conference. Both talks covered much similar material however they were tailored to their particular audiences exceedingly well by my estimation. For example, I found his first talk (geared toward our humanities computing audience) much more engaging than his second talk (which was aimed at corpus linguists), though both were incredibly interesting.

Mark Davies is a fairly unique individual in the field of corpus linguistics from what I can ascertain from my attendance at the conference. He has worked to create a number of corpora including most notably, Corpus of Contemporary American English (COCA), and is working on the creation of CORPUS OF HISTORICAL AMERICAN ENGLISH (COHA). Most of what he explained in his talks can be found at Corpus.byu.edu website in particular under the faq section however some of his more noteworthy concepts will be discussed below.

He described to his definition of a moditor corpus having the following five main characteristics: being large with over 100+ million texts, having recent texts (the corpus being updated within a year in the past), having a balance between several different genres, having roughly the same genre balance from year to year, and having robust architecture to make searches and other tasks possible. Using these criteria, the only real monitor corpus there is is COCA for the English language, as the other corpora available fail to fulfill all of these characteristics.

Now, given that all corpora have limitations, and none can be particularly perfect, the other corpora do still have use: in fact many were used as source material for the talks presented throughout the AACL conference. Still, there are certain things that can be done with COCA and will eventually be done with COHA that may not be possible with other corpora. Hopefully COHA will be able to be used as a reference corpus, that is to say that one could compare word frequency (etc.) between a particular writer and the collective writers of a time period and see if that particular writer is typical for that time period or not with relation to that studied word frequency (etc.).

Mark Davies also discussed the downsides to using the web as corpus. One cannot easily tag parts of speech, use lemmatisation or tag syntax using the web as corpus. In addition, Google provides highly unreliable estimates for counts of any particular search term. For example if I were to Google my full name it comes up with supposedly 59 results, however if I try to go to the last page of these results, I find that there are actually only 12.

COCA and COHA were built using relational databases rather than xml, because xml is simply too slow of a technology, according to Mark Davies, to accomplish what relational databases can. Indeed, it would be difficult if not impossible to accomplish this with xml or tei etc.

Finally, it is very interesting to examine collocates (that is to say, words that appear often with other words. COCA's architecture is rather useful in this regard. For example, if you compare the collocates of sick and ill you can easily tell that the two words are used in far different ways: sick is used most often with words like tired, stomach, wounded etc. and ill is used most often with words like terminally, mentally, critically. This demonstrates some of the utility of the architecture that Mark Davies has helped develop for his corpora.