Selecting data

Suppose you are working with world tech giants such as Google, Apple, Facebook, and so on. Then you could easily get a large amount of data, but if you are not working with giants and instead doing independent research or learning some NLP concepts, then how and from where can you get a dataset? First, decide what kind of dataset you need as per the NLP application that you want to develop. Also, consider the end result of the NLP application that you are trying to build. If you want to make a chatbot for the healthcare domain, you should not use a dialog dataset of banking customer care. So, understand your application or problem statement thoroughly.

You can use the following links to download free datasets:
https://github.com/caesar0301/awesome-public-datasets.
https://www.kaggle.com/datasets.
https://www.reddit.com/r/datasets/.

You can also use the Google Advanced Search feature, or you can use Python web scraping libraries such as beautifulsoup or scrapy.

After selecting the dataset as per the application, you can move on to the next step.