Web Scraping Example in Python using BeautifulSoup

İsmail GÖK
Analytics Vidhya
Published in
5 min readDec 16, 2020

--

Once I was developing a React Native Mobile App, I need tons of data serving my need from the “World Wide Web” and had very little time to get them. (No time to write with hand manually.) I have done some research as someone eager to learn new technologies, found the concept of Web Scraping. After more research, I give a Python Web Scraping library called BeautifulSoup a chance.

In this story, I will try to explain how to perform basic web scrapping using Python and BeautifulSoup library using my own code as an example.

A side note, Web scrapping is not illegal if you use it ethically. It should not be if you use the data that is already publicly available to all but beware of all the ethical aspects. I.e. causing an overload to a site’s bandwidth and disrupt its traffic due to too many web scrapper bots can have some illegal consequences.

Let’s first try to get familiar with the concept of Web Scraping, shall we? To keep this story short I will quote from Wikipedia, you can find the full version here.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Wikipedia says the term “While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.” It says web crawler here but I think we also should know the difference between web scraping and web crawling.

Simply, web scraping is the action of extracting data from websites in an automated manner. It’s a programmatic analysis of a web page to download information from it. Example: Let’s say you want to extract the price of a specific product from an e-commerce website. You code yourself a web scraper to get the HTML from the e-commerce website programmatically.

On the other hand, web crawling is often slightly different technique. It’s basically an internet bot that systematically browses (read crawls) the World Wide Web, usually for the purpose of web indexing. You load a seed URL to a crawler and it indexes every subdomain of that URL and copy the data for further analysis. Example: the search engines uses web crawling to index the everything on the internet.

After that basic introduction, we can move on to the coding example. This small application is developed in Python using BeautifulSoup library as the web scraping library.

First thing first. We need to import relevant libraries.

Import BeautifulSoup in order to use the famous web scrapper library. Import urllib.request in order to open up HTTP requests to specific URLs. More detailed explanation: https://docs.python.org/3/library/urllib.request.html.

Further, import re (used for regular expressions) or any other module if you need to parse data from URLs.

After the imports, the first thing to do is to get the main URL and open a connection to that URL. I wanted to store the data I got from the connection in a text file, so I open a file connection as well.

We have done no scrapping at this moment, only opened up a connection to a URL, read all the HTML data and closed the connection. Now, it is time to parse the HTML data and scrape some meaningful information out of them.

The line of code above is very important in my experience. Most of the time, BeautifulSoup can figure out a site’s character encoding correctly, but sometimes it is not the case. Sometimes it cannot get the correct encoding and that caused me a lot of headaches. So, to be perfectly safe, I tell BeautifulSoup how he need to parse the HTML data from the URL with the code above. In most of the cases, the character encoding is UTF-8 and the data is in HTML format.

There are a bunch of different ways to identify your target html element like element tag name, id, class name, even with styles. I used some of these ways to get the data I need. I wanted to get all the rows from a specific table and I accessed the table using its element id. After that I accessed its ‘tbody’ using only the element name and got all the rows utilizing their ‘tr’ tags.

All the rows had URLs that link to website’s subdomain addresses. My main target was to traverse all the subdomains and get the desired data from the main tables in those subdomains.

It is not always viable but here I could get the desired table rows by distinguishing them by their styles. I wanted to get all the table rows (‘tr’)if their ‘style’ attributes not equals to ‘height: 46px’ (which was the height of table header). Then I proceeded to getting values from each cell in desired table rows. Using row.findAll(‘td’) I have got all the cell in an array. After that, I accessed the ‘a’ tag and then the ‘href’ tag of it and store them in an array called allURLs.

Now, I have got all the URLs of the subdomains. It was time to iterate over all the URLs and access the desired table data.

After accessing all the ‘td’s in the table, I disregarded a specific column by checking its text value. The rest is applying some operations just to extract the meaningful data from the data I scraped. Lastly I closed the file reader and printed out a basic information message.

This was a quick introduction to web scrapping using Python’s BeautifulSoup library. There is a lot more to the library itself but even these few lines of code helped me to get almost 250 thousands of lines of meaningful data in few minutes. Web scraping is a powerful method to extract data from the web and it is quite useful in a lot of fields. There are other useful libraries other than BeautifulSoup too. I encourage you to check them out.

You can get the full source code of this basic POC here: https://github.com/ismailgok33/UniversityInfo-WebScrapper. I hope this has been very helpful. Stay healthy.

--

--