Understanding corpus analysis

In this section, we will first understand what corpus analysis is. After this, we will briefly touch upon speech analysis. We will also understand how we can analyze text corpus for different NLP applications. At the end, we will do some practical corpus analysis for text corpus. Let's begin!

Corpus analysis can be defined as a methodology for pursuing in-depth investigations of linguistic concepts as grounded in the context of authentic and communicative situations. Here, we are talking about the digitally stored language corpora, which is made available for access, retrieval, and analysis via computer.

Corpus analysis for speech data needs the analysis of phonetic understanding of each of the data instances. Apart from phonetic analysis, we also need to do conversation analysis, which gives us an idea of how social interaction happens in day-to-day life in a specific language. Suppose in real life, if you are doing conversational analysis for casual English language, maybe you find a sentence such as What's up, dude? more frequently used in conversations compared to How are you, sir (or madam)?.

Corpus analysis for text data consists in statistically probing, manipulating, and generalizing the dataset. So for a text dataset, we generally perform analysis of how many different words are present in the corpus and what the frequency of certain words in the corpus is. If the corpus contains any noise, we try to remove that noise. In almost every NLP application, we need to do some basic corpus analysis so we can understand our corpus well. nltk provides us with some inbuilt corpus. So, we perform corpus analysis using this inbuilt corpus. Before jumping to the practical part, it is very important to know what type of corpora is present in nltk.

nltk has four types of corpora. Let's look at each of them:

  • Isolate corpus: This type of corpus is a collection of text or natural language. Examples of this kind of corpus are gutenberg, webtext, and so on.
  • Categorized corpus: This type of corpus is a collection of texts that are grouped into different types of categories.
    An example of this kind of corpus is the brown corpus, which contains data for different categories such as news, hobbies, humor, and so on.
  • Overlapping corpus: This type of corpus is a collection of texts that are categorized, but the categories overlap with each other. An example of this kind of corpus is the reuters corpus, which contains data that is categorized, but the defined categories overlap with each other.
    More explicitly, I want to define the example of the reuters corpus. For example, if you consider different types of coconuts as one category, you can see subcategories of coconut-oil, and you also have cotton oil. So, in the reuters corpus, the various data categories are overlapped.
  • Temporal corpus: This type of corpus is a collection of the usages of natural language over a period of time.
    An example of this kind of corpus is the inaugural address corpus.
    Suppose you recode the usage of a language in any city of India in 1950. Then you repeat the same activity to see the usage of the language in that particular city in 1980 and then again in 2017. You will have recorded the various data attributes regarding how people used the language and what the changes over a period of time were.

Now enough of theory, let's jump to the practical stuff. You can access the following links to see the codes:

This chapter code is on the GitHub directory URL at https://github.com/jalajthanaki/NLPython/tree/master/ch2.

Follow the Python code on this URL: https://nbviewer.jupyter.org/github/jalajthanaki/NLPython/blob/master/ch2/2_1_Basic_corpus_analysis.html

The Python code has basic commands of how to access corpus using the nltk API. We are using the brown and gutenberg corpora. We touch upon some of the basic corpus-related APIs.

A description of the basic API attributes is given in the following table:

We have seen the code for loading your customized corpus using nltk as well as done the frequency distribution for the available corpus and our custom corpus.

The FreqDist class is used to encode frequency distributions, which count the number of times each word occurs in a corpus.

All nltk corpora are not that noisy. A basic kind of preprocessing is required for them to generate features out of them. Using a basic corpus-loading API of nltk helps you identify the extreme level of junk data. Suppose you have a bio-chemistry corpus, then you may have a lot of equations and other complex names of chemicals that cannot be parsed accurately using the existing parsers. You can then, according to your problem statement, make a decision as to whether you should remove them in the preprocessing stage or keep them and do some customization on parsing in the part-of-speech tagging (POS) level.

In real-life applications, corpora are very dirty. Using FreqDist,you can take a look at how words are distributed and what we should and shouldn't consider. At the time of preprocessing, you need to check many complex attributes such as whether the results of parsing, POS tagging, and sentence splitting are appropriate or not. We will look at all these in a detailed manner in Chapter 4, Preprocessing, and Chapter 5, Feature Engineering and NLP Algorithms.

Note here that the corpus analysis is in terms of the technical aspect. We are not focusing on corpus linguistics analysis, so guys, do not confuse the two.
If you want to read more on corpus linguistics analysis, refer to this URL:
https://en.wikipedia.org/wiki/Corpus_linguistics
If you want to explore the nltk API more, the URL is http://www.nltk.org/.