Web Scraper w/ Database (Python)

This project was a personal project I’ve wanted to do for along time. I chose to code the project in python to get more practice using the language. I used MySQL for the database since I had previous knowledge using a local host with MySQL and felt comfortable designing a simple database to store my programs data into. I had to learn/use a little bit of JavaScript and HTML to parse the website data. My goal was to scrape from 5 thousand pages and store the data into a database.

A problem I ran into while doing this project was using multiprocessing to speed up the web accesses I was doing. The initial page I was planning on scrapping from had a limit of 5 page requests a minute, which was far to small for the scale I wanted to do. So, I found a different website with around a 3 page a second limit, but with less data and less accuracy within the data. Multiprocessing was used to get as close to the limits as I could so the program took an hour instead of days. This also allowed me a lot of control with how fast I was sending page requests.


As for the results of the web scrapping, I first ran all the initial statistics on the raw data that I wanted. I then stored the data in custom data class. Using multiprocessing again, I made a thread that would submit these data objects to the database while the program was still running. After a rough start, the last couple thousand pages were properly scraped, parsed, and stored without any bugs.


As for viewing the data, I resulted to the table listed in the phpMyAdmin page of the localhost server. I found the table there was able to order the data by whatever datapoints I wanted. It also allows me to query the database as I want, so I found there was no reason to make a different front end then what phpMyAdmin gave.

    Skills Learned
  • Significantly greater understanding of python and its workings
  • Much more comfortable with local host software/databases
  • Solidified knowledge of MySQL workflow including creation and querying of databases
  • Learned how https requests work and the importance of limiting page requests

Check out the source code for the project at the github page below.
https://github.com/BrettParker97/ScraperPy