- Python Natural Language Processing
- Jalaj Thanaki
- 422字
- 2021-07-15 17:01:47
Web scraping
To develop a web scraping tool, we can use libraries such as beautifulsoup and scrapy. Here, I'm giving some of the basic code for web scraping.
Take a look at the code snippet in Figure 2.6, which is used to develop a basic web scraper using beautifulsoup:
The following Figure 2.7 demonstrates the output:
You can find the installation guide for beautifulsoup and scrapy at this link:
https://github.com/jalajthanaki/NLPython/blob/master/ch2/Chapter_2_Installation_Commands.txt.
You can find the code at this link:
https://github.com/jalajthanaki/NLPython/blob/master/ch2/2_2_Basic_webscraping_byusing_beautifulsuop.py.
If you get any warning while running the script, it will be fine; don't worry about warnings.
Now, let's do some web scraping using scrapy. For that, we need to create a new scrapy project.
Follow the command to create the scrapy project. Execute the following command on your terminal:
$ scrapy startproject project_name
I'm creating a scrapy project with the web_scraping_test name; the command is as follows:
$ scrapy startproject web_scraping_test
Once you execute the preceding command, you can see the output as shown in Figure 2.8:
After creating a project, perform the following steps:
- Edit your items.py file, which has been created already.
- Create the WebScrapingTestspider file inside the spiders directory.
- Go to the website page that you want to scrape, and select xpath of the element. You can read more on the xpath selector by clicking at this link:
https://doc.scrapy.org/en/1.0/topics/selectors.html
Take a look at the code snippet in Figure 2.9. Its code is available at the GitHub URL:
https://github.com/jalajthanaki/NLPython/tree/master/web_scraping_test
Figure 2.10 is used to develop a basic web scraper using scrapy:
Figure 2.11 demonstrates the output, which is in the form of a CSV file:
If you get any SSL-related warnings, refer to the answer at this link:
https://stackoverflow.com/questions/29134512/insecureplatformwarning-a-true-sslcontext-object-is-not-available-this-prevent
You can develop a web scraper that bypasses AJAX and scripts, but you need to be very careful when you do this because you need to keep in mind that you are not doing anything unethical. So, here, we are not going to cover the part on bypassing AJAX and scripts and scraping data. Out of curiosity, you can search on the web how people actually do this. You can use the Selenium library to do automatic clicking to perform web events.