Postgraduate Certificate in Computational Linguistics for Language Learning · Guide

Corpus Linguistics

4 min read Updated 5 May 2026

Corpus linguistics is a field of linguistics that involves the study of language through the analysis of large collections of text known as corpora. These corpora are typically compiled from a wide range of sources such as books, articles, transcripts, and websites, and they serve as a valuable resource for researchers interested in exploring various linguistic phenomena. In this course, we will delve into the key terms and vocabulary associated with corpus linguistics to help you develop a deeper understanding of this fascinating field.

1. **Corpus**: A corpus is a collection of written or spoken texts that are used as a basis for linguistic analysis. Corpora can vary in size and scope, ranging from small specialized collections to large, general-purpose databases. For example, the British National Corpus (BNC) is a widely used corpus that contains a diverse range of texts from various sources.

2. **Linguistic Features**: Linguistic features refer to specific characteristics of language that can be analyzed within a corpus. These features may include vocabulary, syntax, morphology, and discourse patterns. By examining these features in a corpus, linguists can gain insights into how language is used in different contexts.

3. **Frequency**: Frequency is a key concept in corpus linguistics that refers to how often a particular word or phrase occurs in a corpus. By analyzing the frequency of words or patterns in a corpus, researchers can identify common usage patterns and linguistic trends.

4. **Collocation**: Collocation is the tendency of words to occur together frequently in a language. For example, in English, the words "strong" and "coffee" often collocate to form the phrase "strong coffee." Collocation analysis can help researchers identify meaningful word combinations and associations.

5. **Concordance**: A concordance is a tool used in corpus linguistics to display how a specific word or phrase is used in context within a corpus. Concordances typically show the word or phrase in its surrounding context, allowing researchers to analyze its usage patterns.

6. **Part-of-Speech Tagging**: Part-of-speech tagging is the process of labeling each word in a corpus with its corresponding part of speech (e.g., noun, verb, adjective). This tagging allows researchers to analyze the syntactic structures and grammatical patterns of a text.

7. **N-grams**: N-grams are sequences of N contiguous words or characters that appear in a corpus. For example, a bigram is a two-word sequence, while a trigram is a three-word sequence. N-gram analysis is used to identify recurring patterns and linguistic structures in a corpus.

8. **Tokenization**: Tokenization is the process of dividing a text into individual units, or tokens, such as words or sentences. This step is essential for preparing a corpus for analysis, as it allows researchers to isolate and examine specific linguistic elements.

9. **Lemma**: A lemma is the base form or dictionary form of a word, which is used to represent all its inflected forms. For example, the lemma of the word "running" is "run." Lemmatization is the process of reducing a word to its lemma form for analysis.

10. **Stop Words**: Stop words are common words that are often excluded from analysis in corpus linguistics, as they do not carry significant meaning or contribute to the overall analysis. Examples of stop words include "the," "and," and "is."

11. **Corpus Annotation**: Corpus annotation involves adding linguistic information, such as part-of-speech tags or syntactic structures, to a corpus. This annotated data can facilitate more advanced linguistic analysis and computational processing.

12. **Metadata**: Metadata refers to additional information about a corpus, such as its source, date of collection, and language. Metadata is crucial for ensuring the reliability and usability of a corpus for research purposes.

13. **Query**: In corpus linguistics, a query is a search term or pattern used to retrieve specific linguistic data from a corpus. Researchers can use queries to extract relevant information for analysis and exploration.

14. **Corpus Query Language (CQL)**: Corpus Query Language is a specialized language used to formulate complex search queries in corpus linguistics. CQL allows researchers to specify detailed search criteria and retrieve precise linguistic data from a corpus.

15. **Concordancing Software**: Concordancing software is a tool used to search and analyze corpora, generating concordance lines that display the context of a word or phrase within a corpus. Popular concordancing software includes AntConc, Sketch Engine, and Corpus Linguistics Toolbox.

16. **KWIC (Key Word In Context)**: KWIC is a format commonly used in concordances to display a keyword surrounded by its context. KWIC displays the keyword in the center, with the words before and after it to provide context for analysis.

17. **Frequency List**: A frequency list is a list of words or phrases in a corpus, ranked according to their frequency of occurrence. Frequency lists can help researchers identify common vocabulary and linguistic patterns within a corpus.

18. **Corpus Linguistics Research Methods**: Corpus linguistics employs a variety of research methods, including quantitative analysis, qualitative analysis, and comparative analysis. These methods help researchers explore linguistic phenomena and patterns within a corpus.

19. **Corpus-Based Language Teaching**: Corpus-based language teaching is an approach that integrates corpora and corpus linguistics into language teaching and learning. By using real language data from corpora, teachers can create authentic and effective language learning materials.

20. **Challenges in Corpus Linguistics**: Corpus linguistics faces several challenges, such as corpus representativeness, data sparsity, and the need for specialized tools and expertise. Overcoming these challenges requires careful planning and methodological considerations.

In this course, you will learn how to apply corpus linguistics methods and techniques to analyze language data, explore linguistic patterns, and develop insights into language use. By mastering the key terms and vocabulary in corpus linguistics, you will be equipped to conduct research, create language learning materials, and advance your understanding of language through computational analysis.

Key takeaways

These corpora are typically compiled from a wide range of sources such as books, articles, transcripts, and websites, and they serve as a valuable resource for researchers interested in exploring various linguistic phenomena.
For example, the British National Corpus (BNC) is a widely used corpus that contains a diverse range of texts from various sources.
**Linguistic Features**: Linguistic features refer to specific characteristics of language that can be analyzed within a corpus.
**Frequency**: Frequency is a key concept in corpus linguistics that refers to how often a particular word or phrase occurs in a corpus.
For example, in English, the words "strong" and "coffee" often collocate to form the phrase "strong coffee.
**Concordance**: A concordance is a tool used in corpus linguistics to display how a specific word or phrase is used in context within a corpus.
**Part-of-Speech Tagging**: Part-of-speech tagging is the process of labeling each word in a corpus with its corresponding part of speech (e.

Corpus Linguistics

Key takeaways

More from Postgraduate Certificate in Computational Linguistics for Language Learning