- Python Natural Language Processing
- Jalaj Thanaki
- 169字
- 2021-07-15 17:01:57
Sentence tokenization
In raw text data, data is in paragraph form. Now, if you want the sentences from the paragraph, then you need to tokenize at sentence level.
Sentence tokenization is the process of identifying the boundary of the sentences. It is also called sentence boundary detection or sentence segmentation or sentence boundary disambiguation. This process identifies the sentences starting and ending points.
Some of the specialized cases need a customized rule for the sentence tokenizer as well.
The following open source tools are available for performing sentence tokenization:
- OpenNLP
- Stanford CoreNLP
- GATE
- nltk
Here we are using the nltk sentence tokenizer.
We are using sent_tokenize from nltk and will import it as st:
- sent_tokenize(rawtext): This takes a raw data string as an argument
- st(filecontentdetails): This is our customized raw data, which is provided as an input argument
You can find the code on this GitHub Link: https://github.com/jalajthanaki/NLPython/blob/master/ch4/4_1_processrawtext.py.
You can see the code in the following code snippet in Figure 4.4:
Figure 4.4: Code snippet for nltk sentence tokenizer