Corpus Linguistic Analysis

CC BY-NC-ND 4.0 by Laura Aull - University of Michigan

Corpus linguistics fuels AI innovation: Teams of computational linguists, including those at OpenAI, delve into the vast expanse of the internet, amassing an extensive corpus to predict textual patterns. Yet, when classic lines, like T.S. Eliot's 'I measure my life in coffee spoons' from 'The Love Song of J. Alfred Prufrock,' are absorbed without proper acknowledgment, pressing ethical questions emerge. This illustration captures that very sentiment, as Eliot's iconic line spirals into the corpus vortex.

What is Corpus Linguistics Analysis?

How we (usually) read and write

If you are like most people in the United States, you read and write one phrase, sentence, and paragraph at a time. Then, you consider all the words, sentences, and paragraphs of a full individual text, and that tells you what that text is about.

For example, when you read the news, you probably read or skim each news article or post from the beginning onward, and then you think about what each one is about. For a class or your own purposes, you might also consider the audience of a particular article, such as whether it is international or domestic, or left-leaning or right-leaning. This kind of attention to the rhetoric and rhetorical situation of individual texts is something you have probably practiced a good deal.

Reading one sentence and text at a time is what your teachers tend to do when they read papers, too: they read your paper from start to finish, and then they read your classmate’s paper, and so on.

You and your instructors may also think about some aspects of writing across individual texts, such as genre or purpose. Your teachers might look across a stack of papers, for instance, and consider how well a class of students has used primary evidence in a research paper. In another example, you might look over a Twitter feed to see how often people retweet posts in a particular thread. In such instances, you and your teachers are paying attention to aspects of the rhetorical situation across multiple texts.

By contrast, you probably spend little time thinking about how language—in words, phrases, and sentences—is used across the texts you read and write. That kind of focus, on language across texts, is common in linguistic approaches to writing, which are more popular outside of the U.S. than inside the U.S. Accordingly, if your writing teachers have been trained in U.S. rhetoric and composition rather than linguistics, they know a lot about students’ writing generally but may not know a lot about the specific language that students use across their papers and across courses.

What does all this mean? Most U.S. readers and writers, and most U.S. student writing research, tends to discuss written texts one text at a time. Understanding across texts tends to focus on contextual patterns, such as audience or genre. Most U.S. readers and writers know less about textual patterns, or patterns of language across texts and contexts.

Of course, on some level, you do think about language patterns, maybe without even realizing it. It’s part of why you can recognize a newspaper article and why you know how to write a text message: you have paid attention to how people use language in patterned ways. But this kind of knowledge—the kind we pick up through casual observation—is often subconscious and is rarely systematic. For example, you can probably write a text message that is appropriate for a given rhetorical situation without thinking much about it, because you have picked up on what kind of language is appropriate for the genre (text message) and audience (your recipient, such as a family member or friend). But what do you do when you need to write something unfamiliar to you? If you are writing your first college composition essay, or your first psychology case study, how do you know what language patterns are preferred?

This brings us to analysis that uses computer-aided tools to offer us a view of language patterns across texts—a bird’s eye view of written language patterns. This kind of analysis is called corpus linguistic analysis: the term corpus refers to a body of texts, and linguistic analysis, as you saw before, refers to the examination of patterns of language use. As a complement to understanding one text at a time, corpus linguistic analysis can help us systematically analyze and understand written language in terms of patterns across many texts and across time.

Reading so far, you may already be picking up on three premises, or assumptions, related to corpus linguistics:

Texts make meaning in patterned ways across texts and contexts.
It can be hard to comprehend language patterns if we are trained to read and analyze only one text at a time.
Attention to language across texts and contexts can teach us additional information about what is expected in particular rhetorical situations.

You are probably already picking up on a detailed definition of corpus linguistic analysis, too. Corpus linguistic analysis refers to the examination of textual patterns in a selected body of naturally produced texts, usually via computer-aided tools that facilitate searching, sorting, and calculating large-scale textual patterns.

Notice two key terms inside this definition:

Textual patterns: lexical or grammatical patterns that persist across texts in a corpus, in contrast to more varied choices or to patterns in other corpora
Naturally produced texts: a given corpus consists only of language produced for authentic, real- world purposes

In sum, corpus linguistic analysis is about identifying choices people make (and don’t make) across texts, and we can use the results of such analysis to enhance our understanding of how language and texts work. Corpus linguistic analysis has been used a lot since the mid- to late-20^th century, especially outside of the U.S., in places like England, Asia, and Australia, to help teachers and students learn about expert and student writing choices that come up again and again.

The Bird’s-Eye View of Language: Why Corpus Linguistic Analysis?

You may not be convinced yet. If we are most used to reading and writing one text at a time, why introduce something different? Why get a bird’s eye view of language patterns across texts?

Some good reasons include that we get to see different details when we look across texts—details we can miss or misperceive when we read one text at a time. Here are two key reasons why corpus linguistic analysis can be useful, followed by examples from corpus linguistic analysis of academic writing.

Our perceptions of language use are often misleading.

It’s easy to come to inaccurate conclusions about language, because some things catch our attention more than others. For instance, people tend to think that language is changing rapidly when they read slang words on the Internet. But actually, there are many more words on the Internet that have been around a long time than there are new words. Corpus linguistic analysis has shown that only around 3% of online language use includes internet-specific slang such as abbreviations. It’s just that the newer words grab our attention more than the old ones. In this example, corpus linguistic analysis helps us quantify what percentage of words on the internet are actually new words, and what percentage are words we have been using for a while. Let’s consider one more example, this one from research on academic writing.

Have you ever found it difficult to read college textbooks? Doug Biber and his research team used corpus linguistic analysis to analyze different kinds of language use on college campuses, including research articles, textbooks, and office hours. One thing they wanted to investigate was how textbooks compared to these other kinds of language use, because instructors often think that textbooks provide easy-to-read narrative descriptions for students.

Based on corpus linguistic analysis of all of these kinds of language, Biber et al. found that textbooks are not characterized by narrative, accessible language like spoken conversation. Instead, they tend to include dense, present-tense discussions of implications, making textbooks challenging to read for students. In some ways, textbooks are just as difficult to parse as research articles.

Much of our knowledge about written language is tacit, or unconscious (Odell et al.).

Once we have learned to write in a particular way, it is easy to forget the conscious steps we had to learn to do it in the first place. That is why it can be hard for your teachers to realize what might be challenging about an academic writing task they assign, and why it might be hard for you to explain to a grandparent how to write a tweet or how to use hashtags. Let’s again turn to a more specific example from research on academic writing.

Have you ever felt like you didn’t know what a teacher wanted in your writing? What teachers want can be subtle, or even unstated. BrownandAull did a corpus analysis of advanced placement English essays that showed two distinct patterns in successful and unsuccessful essays. The successful student writing included specific, detailed phrases, while unsuccessful student writing included generic, emphatic phrases. This means, for instance, that a successful student essay might include the following sentence:

A twentieth-century understanding of grief suggests that it takes time.

In this sentence, a detailed phrase about an understanding of grief (underlined in the example) is the subject of the sentence.

By contrast, an unsuccessful student essay might instead say:

Grief obviously takes time.

This sentence includes a simple subject (grief) as well as an emphatic word obviously.To academic readers, the second sentence can seem too general and too strong.

The bottom line is that our perceptions of language use can miss important patterns, because we tend to read one word, sentence, and text at a time. Getting a bird’s-eye view allows us to understand more about the kinds of choices people tend to make with language, including successful and unsuccessful choices in academic writing. As we learn about such patterns and practice looking for them, we can become more adept at recognizing what characterizes different kinds of written texts.

Example exercise: Words that hang out with one another

Let’s get some practice thinking about language patterns. We’ll do this by considering collocations, or the words that most often hang out with other words. (The technical, fancy-sounding definition of collocations is “the habitual juxtaposition of a particular word with another word or words with a frequency greater than chance.”)

First, try to guess: What words collocate, or hang out, most often with the word idea in U.S. English?

Specifically, what words do you think come just before idea, in all sorts of U.S. English (spoken, fiction, academic, news, and magazine)? List your top 5 guesses.

________________ idea

To test your guesses, we can turn to corpus linguistic analysis, using the Corpus of Contemporary American English (COCA). COCA is an online database where you can search all kinds of patterns in American English, across spoken conversation, fiction, academic writing, news, and magazines. You’ll see COCA listed in the resources below with a URL so that you can check it out yourself.

For this search, we’ll look for all words immediately to the left of idea. These are called 1L collocates, because they appear 1 space to the left.

Use of the word IDEA in COCA (all registers)

Top 10 1L Collocates
good	idea
bad	idea
whole	idea
great	idea
better	idea
new	idea
very	idea
basic	idea
clear	idea
general	idea

How many of your guesses were right? Did you guess that not only are good idea and bad idea popular, but so too are the expressions (the) very idea, basic idea, and general idea?

Let’s think about these patterns. Several collocations show evaluation of an idea (good idea, bad idea, great idea), including some comparison (better idea, new idea). Others show emphasis on an idea ((the) very idea). Finally, others convey a summary or gist of an idea (whole idea, basic idea, general idea). (Clear idea is used both in evaluation and in summary statements.)

Many people guess that people describe ideas as good and bad, but they don’t realize how often speakers and writers use idea to let their audience know that they are summarizing something. As you read before, this is the kind of thing that corpus linguistic analysis can uncover: common patterns of language use that we don’t necessarily pay attention to but that can tell us what matters to people in a given type of writing. Picking up on these collocates might, for instance, help students begin to notice how often people summarize, and when they tend to do so.

If we use the above examples, for instance, you could consider the following as you begin to read and write in a new course: How do writers describe ideas? Do they evaluate them (e.g., as good, bad, or correct)? Do they describe them (e.g., as theoretical, abstract, or practical)? Do they summarize them (e.g., general, overall)?

Let’s explore one more example, this one concerning something many students wonder about: the first person in academic writing.

Here’s our question for this one: How do writers draw attention to themselves as writers by using the first person I or we?

Let’s first make a guess about expert academic writing. In academic writing published in the U.S., what words do you think collocate, or hang out, with I? Specifically, what words do you think most often appear right after I, or immediately to the right of the word I, in academic writing? Again, note your top 5 guesses.

I ________________

We can again use corpus linguistic analysis to find out how accurate your guesses are. Specifically, we can use the Corpus of Contemporary American English academic subcorpus (COCAA) and search for words 1 space to the right, or 1R, of I.

Use of the word I in COCA, Academic writing

	Top 10 1R Collocates
I	have
I	was
I	think
I	had
I	would
I	will
I	can
I	could
I	did
I	believe

First of all, using COCAA, we can see that even though lots of students have heard that they shouldn’t use I in academic writing, corpus linguistic analysis shows us that many published academic writers use I, or we.

How do they use it? In these collocates, we can see a clear and consistent pattern: academic writers use I as the subject of verbs, and these verbs tend to help writers describe their processes; consider, for instance, examples like I have observed, I was able to, I had collected). Academic writers also use I to describe their thinking (I think that, I would suggest). They also, though less often, use I to describe beliefs: I believe is the final of the last of the top ten.

How did your guesses hold up? A lot of people guess argue, thinking that academics write I argue a lot, but it is not in the top ten. Conversely, few people guess I have or I had. In addition, many students are surprised to see that academic writers are often tentative rather than explicit about their arguments: as you can see, academic writers use I would, I think, and I could far more often than I argue.

Summing up

As you can see, sometimes corpus linguistic analysis can surprise us. It shows us that textbooks can be hard to read, that student grades are based in part on the subjects of their sentences, and that academic writers use I to describe steps in their thinking and processes. With more analysis, we learn more.

Try out the resources below, and see what patterns you find with a bird’s eye view across many texts.

More examples of corpus linguistics research

Written versus spoken English:

Very formal, academic writing tends to contain lots of nouns and prepositions, while more informal language, including spoken conversation, tends to contain more pronouns and verbs (Biber; Biber and Gray).

Student writing:

Successful writing by late-undergraduate and early-graduate writers show clear differences depending on the discipline. For example, writing in Philosophy and Education is more narrative and interpersonal than writing in Biology or Physics. Writing in Political Science and Linguistics falls in between (Hardy and Römer).
First-Year college writers tend to boost, or intensify their ideas with words such as really, truly, or clearly, more than they hedge or qualify their ideas, with words such as perhaps, might, or possibly. This can make first-year writing seem overstated to many academic readers, who tend to appreciate some space for doubt and exception (Aull First-Year; Aull et al.; Aull and Lancaster; Hyland “Undergraduate Understandings”).

Published academic writing across disciplines:

Writers in the social and natural sciences tend to use more first person pronouns (I, we) to describe experimental processes, while writers in the humanities tend to use first person pronouns to showcase interpretive reasoning (Hyland “Stance”).
Academic writers across all disciplines still tend to hedge, or qualify, more than they boost, or intensify (Hyland Disciplinary Discourses).

Corpus Resources

Corpus of Contemporary American English (COCA): https://www.english-corpora.org/coca/

Details about COCA: Davies, M. (2011). Word frequency data from the Corpus of Contemporary American English (COCA).

Michigan Corpus of Upper-Level Student Papers (MICUSP):

Details about MICSUP:Römer, Ute and O’Donnell, Matthew. From student hard drive to web corpus (part 1): the design, compilation and genre classification of the Michigan Corpus of Upper-level Student Papers (MICUSP). Corpora, vol. 6, no. 2, 2011: 159-177.

Collocation games, see e.g., Wu, Franken, and Witten. Collocation games from a language corpus. In Digital Games in Language Learning and Teaching. Palgrave Macmillan, London, 2012: 209-229.

The Grammar Lab: David West Brown’s www.thegrammarlab.com/