Getting raw text

In this section, we will use three sources where we can get the raw text data.

The following are the data sources:

  • Raw text file
  • Define raw data text inside a script in the form of a local variable
  • Use any of the available corpus from nltk

Let's begin:

  • Raw text file access: I have a .txt file saved on my local computer which contains text data in the form of a paragraph. I want to read the content of that file and then load the content as the next step. I will run a sentence tokenizer to get the sentences out of it.
  • Define raw data text inside a script in the form of a local variable: If we have a small amount of data, then we can assign the data to a local string variable. For example: Text = This is the sentence, this is another example.
  • Use an available corpus from nltk: We can import an available corpus such as the brown corpus, gutenberg corpus, and so on from nltk and load the content.

I have defined three functions:

  • fileread(): This reads the content of a file
  • localtextvalue(): This loads locally defined text
  • readcorpus(): This reads the gutenberg corpus content

Refer to the code snippet given in Figure 4.2, which describes all the three cases previously defined:

Figure 4.2: Various ways to get the raw data

You can find the code by clicking on the GitHub link: https://github.com/jalajthanaki/NLPython/blob/master/ch4/4_1_processrawtext.py