- Python Natural Language Processing
- Jalaj Thanaki
- 123字
- 2025-02-28 13:05:45
Is preprocessing required?
- Now you have raw-data for text summarization and your dataset contains HTML tags, repeated text, and so on.
- If your raw-data has all the content that I described in the first point, then preprocessing is required and, in this case, we need to remove HTML tags and repeated sentences; otherwise, preprocessing is not required.
- You also need to apply lowercase convention.
- After that, you need to apply sentence tokenizer on your text summarization dataset.
- Finally, you need to apply word tokenizer on your text summarization dataset.
- Whether your dataset needs preprocessing depends on your problem statement and what data your raw dataset contains.
You can see the flowchart in Figure 4.19:

Figure 4.19: Basic flowchart for performing preprocessing of text-summarization