Why do we need a corpus?

In any NLP application, we need data or corpus to building NLP tools and applications. A corpus is the most critical and basic building block of any NLP-related application. It provides us with quantitative data that is used to build NLP applications. We can also use some part of the data to test and challenge our ideas and intuitions about the language. Corpus plays a very big role in NLP applications. Challenges regarding creating a corpus for NLP applications are as follows:

  • Deciding the type of data we need in order to solve the problem statement
  • Availability of data
  • Quality of the data
  • Adequacy of the data in terms of amount

Now you may want to know the details of all the preceding questions; for that, I will take an example that can help you to understand all the previous points easily. Consider that you want to make an NLP tool that understands the medical state of a particular patient and can help generate a diagnosis after proper medical analysis.

Here, our aspect is more biased toward the corpus level and generalized. If you look at the preceding example as an NLP learner, you should process the problem statement as stated here:

  • What kind of data do I need if I want to solve the problem statement?
    • Clinical notes or patient history
    • Audio recording of the conversation between doctor and patient
  • Do you have this kind of corpus or data with you?
    • If yes, great! You are in a good position, so you can proceed to the next question.
    • If not, OK! No worries. You need to process one more question, which is probably a difficult but interesting one.
  • Is there an open source corpus available?
    • If yes, download it, and continue to the next question.
    • If not, think of how you can access the data and build the corpus. Think of web scraping tools and techniques. But you have to explore the ethical as well as legal aspects of your web scraping tool.
  • What is the quality level of the corpus?
    • Go through the corpus, and try to figure out the following things:
      • If you can't understand the dataset at all, then what to do?
        • Spend more time with your dataset.
        • Think like a machine, and try to think of all the things you would process if you were fed with this kind of a dataset. Don't think that you will throw an error!
        • Find one thing that you feel you can begin with.
        • Suppose your NLP tool has diagnosed a human disease, think of what you would ask the patient if you were the doctor's machine. Now you can start understanding your dataset and then think about the preprocessing part. Do not rush to the it.
      • If you can understand the dataset, then what to do?
        • Do you need each and every thing that is in the corpus to build an NLP system?
          • If yes, then proceed to the next level, which we will look at in Chapter 5, Feature Engineering and NLP Algorithms.
          • If not, then proceed to the next level, which we will look at in Chapter 4, Preprocessing.
  • Will the amount of data be sufficient for solving the problem statement on at least a proof of concept (POC) basis?
    • According to my experience, I would prefer to have at least 500 MB to 1 GB of data for a small POC.
    • For startups, to collect 500 MB to 1 GB data is also a challenge for the following reasons:
      • Startups are new in business.
      • Sometimes they are very innovative, and there is no ready-made dataset available.
      • Even if they manage to build a POC, to validate their product in real life is also challenging.

Refer to Figure 2.1 for a description of the preceding process:

Figure 2.1: Description of the process defined under why do we need corpus?