Is preprocessing required?

  • Now you have raw-data for text summarization and your dataset contains HTML tags, repeated text, and so on.
  • If your raw-data has all the content that I described in the first point, then preprocessing is required and, in this case, we need to remove HTML tags and repeated sentences; otherwise, preprocessing is not required.
  • You also need to apply lowercase convention.
  • After that, you need to apply sentence tokenizer on your text summarization dataset.
  • Finally, you need to apply word tokenizer on your text summarization dataset.
  • Whether your dataset needs preprocessing depends on your problem statement and what data your raw dataset contains.

You can see the flowchart in Figure 4.19:

Figure 4.19: Basic flowchart for performing preprocessing of text-summarization