书名：Python Natural Language Processing
作者名：Jalaj Thanaki
本章字数：173字
更新时间：2025-02-28 13:05:45

Word tokenization

Word tokenization is defined as the process of chopping a stream of text up into words, phrases, and meaningful strings. This process is called word tokenization. The output of the process are words that we will get as an output after tokenization. This is called a token.

Let's see the code snippet given in Figure 4.11 of tokenized words:

Figure 4.11: Word tokenized code snippet

The output of the code given in Figure 4.11 is as follows:

The input for word tokenization is:

Stemming is funnier than a bummer says the sushi loving computer scientist.She really wants to buy cars. She told me angrily. It is better for you.Man is walking. We are meeting tomorrow. You really don''t know..!

The output for word tokenization is:

[''Stemming'', ''is'', ''funnier'', ''than'', ''a'', ''bummer'', ''says'', ''the'', ''sushi'', ''loving'', ''computer'', ''scientist'', ''.'', ''She'', ''really'', ''wants'', ''to'', ''buy'', ''cars'', ''.'', ''She'', ''told'', ''me'', ''angrily'', ''.'', ''It'', ''is'', ''better'', ''for'', ''you'', ''.'', ''Man'', ''is'', ''walking'', ''.'', ''We'', ''are'', ''meeting'', ''tomorrow'', ''.'', ''You'', ''really'', ''do'', ""n''t"", ''know..'', ''!'']