Stop word removal

Stop word removal is an important preprocessing step for some NLP applications, such as sentiment analysis, text summarization, and so on.

Removing stop words, as well as removing commonly occurring words, is a basic but important step. The following is a list of stop words which are going to be removed. This list has been generated from nltk. Refer to the following code snippet in Figure 4.7:

Figure 4.7: Code to see the list of stop words for the English language

The output of the preceding code is a list of stop words available in nltk, refer to Figure 4.8:

Figure 4.8: Output of nltk stop words list for the English language

The nltk has a readily available list of stop words for the English language. You can also customize which words you want to remove according to the NLP application that you are developing.

You can see the code snippet for removing customized stop words in Figure 4.9:

Figure 4.9: Removing customized stop words

The output of the code given in Figure 4.9 is as follows:

this is foo. 

The code snippet in Figure 4.10 performs actual stop word removal from raw text and this raw text is in the English language:

Figure 4.10: Stop words removal from raw text

The output of the preceding code snippet is as follows:

Input raw sentence: ""this is a test sentence. I am very happy today."" 
--------Stop word removal from raw text--------- 
test sentence. happy today.