Sentence tokenization

In raw text data, data is in paragraph form. Now, if you want the sentences from the paragraph, then you need to tokenize at sentence level.

Sentence tokenization is the process of identifying the boundary of the sentences. It is also called sentence boundary detection or sentence segmentation or sentence boundary disambiguation. This process identifies the sentences starting and ending points.

Some of the specialized cases need a customized rule for the sentence tokenizer as well.

The following open source tools are available for performing sentence tokenization:

  • OpenNLP
  • Stanford CoreNLP
  • GATE
  • nltk

Here we are using the nltk sentence tokenizer.

We are using sent_tokenize from nltk and will import it as st:

  • sent_tokenize(rawtext): This takes a raw data string as an argument
  • st(filecontentdetails): This is our customized raw data, which is provided as an input argument

You can find the code on this GitHub Link: https://github.com/jalajthanaki/NLPython/blob/master/ch4/4_1_processrawtext.py.

You can see the code in the following code snippet in Figure 4.4:

Figure 4.4: Code snippet for nltk sentence tokenizer