Corpus Linguistics: Corpus Tools Explained

Corpus Tools Explained
Copyright note: most material on this page is taken from the help file in WordSmith Tools. Thanks to Mike Scott.

Word Lists
What is WordList and what's it for?
This tool generates word lists based on one or more ASCII or ANSI text files. The word lists can be generated in both alphabetical and frequency order, and optionally you can generate a word index list too.

These lists can be used

simply in order to study the type of vocabulary used;
to identify common word clusters;
to compare the frequency of a word in different text files or across genres;
to compare the frequencies of cognate words or translation equivalents between different languages.
These word-lists may also be used as input to a KeyWords program, which analyses the words in a given text and compares frequencies with a reference corpus, in order to generate lists of "key-words" and "key-key-words" (see below).

--------------------------------------------------------------------------------

Type/Token Ratios and the Standardised Type/Token ratio
If a text is 1,000 words long, it is said to have 1,000 tokens. But a lot of these words will be repeated, and there may be only say 400 different words in the text. Types therefore are the different words.

The ratio between types and tokens in this example would be 40%. But this ratio varies very widely in accordance with the length of the text -- or corpus of texts -- which is being studied. A 1,000 word article might have a type/token ratio of 40%; a shorter one might reach 70%; 4 million words will probably give a type/token ratio of about 2%, and so on. Such type/token information is rather meaningless in most cases.

The standardised type/token ratio is computed every n words as the program goes through each text file. In other words, if n=1,000, then the ratio is calculated for the first 1,000 running words, then calculated afresh for the next 1,000, and so on to the end of your text or corpus. A running average is computed, which means that you get an average type/token ratio based on consecutive 1,000-word chunks of text. (Texts with less than 1,000 words (or whatever n is set to) will get a standardised type/token ratio of 0.)

--------------------------------------------------------------------------------

KeyWords
The purpose of this tool is to locate and identify key words in a given text. To do so, it compares the words in the text with a reference set of words usually taken from a large corpus of text. Any word which is found to be outstanding in its frequency in the text is considered "key". The key words are presented in order of outstandingness.

The "key words" are calculated by comparing the frequency of each word in the smaller of the two wordlists with the frequency of the same word in the reference wordlist. All words which appear in the smaller list are considered, unless they are in a stop list.

If the occurs say, 5% of the time in the small wordlist and 6% of the time in the reference corpus, it will not turn out to be "key", though it may well be the most frequent word. If the text concerns the anatomy of spiders, it may well turn out that the names of the researchers, and the items spider, leg, eight, etc. may be more frequent than they would otherwise be in your reference corpus (unless your reference corpus only concerns spiders!)

To compute the "key-ness" of an item, the program therefore computes

its frequency in the small wordlist
the number of running words in the small wordlist
its frequency in the reference corpus
the number of running words in the reference corpus
and cross-tabulates these.

Statistical tests include:

the classic chi-square test of significance with Yates correction for a 2 X 2 table
Ted Dunning's Log Likelihood test, which gives a better estimate of keyness, especially when contrasting long texts or a whole genre against your reference corpus.
A word will get into the listing here if it is unusually frequent (or unusually infrequent) in comparison with what one would expect on the basis of the larger wordlist.

--------------------------------------------------------------------------------

Key KeyWords
A "key key-word" is one which is "key" in more than one of a number of related texts. The more texts it is "key" in, the more "key key" it is. This will depend a lot on the topic homogeneity of the corpus being investigated. In a corpus of City news texts, items like bank, profit, companies are key key-words, while computer will not be, though computer might be a key word in a few City news stories about IBM or Microsoft share dealings.

--------------------------------------------------------------------------------

Mututal Information (MI) score
A Mutual Information (MI) score relates one word to another. For example, if problem is often found with solve, they may have a high mutual information score. Usually, the will be found much more often besides problem than solve, so the procedure for calculating Mutual Information takes into account not just the most frequent words found near the word in question, but also whether each word is often found elsewhere, well away from the word in question. Since the is found very often indeed far away from problem, it will not tend to be related, that is, it will get a low MI score.

This relationship is bi-lateral: in the case of kith and kin, it doesn't distinguish between the virtual certainty of finding kin near kith, and the much lower likelihood of finding kith near kin.

--------------------------------------------------------------------------------

T-score
The MI score expresses the extent to which observed frequency of co-occurrence differs from what we would expect (statistically speaking). It does not work very well with very low frequencies. For instance, sour occurs 472 times and puss 31 times in the CobuildDirect corpus. Since sour and puss co-occur 4 times, this gives this particular collocation a very high MI score. The t-score provides a way of getting away from this problem since it also takes frequency into account. To sum up, MI is more likely to give high scores to totally fixed phrases whereas t-score will yield significant collocates that occur relatively frequently. In most cases, t-score is the most reliable measurement.

--------------------------------------------------------------------------------

KWIC concordancing
A KWIC (KeyWord In Context) concordance is a set of examples of a given word or phrase. A line of text is shown for each occurrence found in the corpus. The search is usually word center-aligned for easier analysis.

The point of a concordance is to be able to see lots of examples of a word or phrase, in their contexts. You get a much better idea of the use of a word by seeing lots of examples of it, and it's by seeing or hearing new words in context lots of times that you come to grasp the meaning of most of the words in your native language. It's by seeing the contexts that you get a better idea about how to use the new word yourself. A dictionary can tell you the meanings but it's not much good at showing you how to use the word.

Language students can use a concordancer to find out how to use a word or phrase, or to find out which other words belong with a word they want to use. For example, it's through using a concordancer that you could find out that in academic writing, a paper can describe, claim, or show, though it doesn't believe or want (*this paper wants to prove that ...).

Language teachers can use the concordancer to find similar patterns so as to help their students. They can also use this tool to help produce vocabulary exercises, by choosing two or three search-words, blanking them out, then printing.

Researchers can use a concordancer, for example when searching through a database of hospital accident records, to see whether fracture is associated with fall, grease, ladder. Or to examine historical documents to find all the references to land ownership.

--------------------------------------------------------------------------------

Collocation retrieval
Collocates are the words which occur in the neighbourhood of your search word. Collocates of letter might include post, stamp, envelope, etc. However, very common words like the will also collocate with letter.

By examining the collocates you can find out more about "the company the word keeps", which helps to show its meaning and its usage.

You may compute a concordance with or without collocates: without is slightly quicker and will take up less room on your hard disk. The number of collocates stored will depend on the collocation horizons.

The literature on collocation has never distinguished very satisfactorily between collocates which we think of as "associated" with a word (letter - stamp) on the one hand, and on the other, the words which do actually co-occur with the word (letter - my, this, a, etc.). We could call the first type coherence collocates and the second neighbourhood collocates or horizon collocates. It has been suggested that to detect coherence collocates is very tricky, as once we start looking beyond a horizon of about 4 or 5 words on either side, we get so many words that there is more noise than signal in the system.

--------------------------------------------------------------------------------

Corpus Linguistics

Corpus Linguistics

2007年5月16日星期三

Corpus Tools Explained

没有评论:

Time & Tide

Academic Links

News-Press

Links & Resources