Python Data Skills 4: Fetching HTML Table Data

Fetching Tabular data from a webpage, and discussing all of the problems (and their solutions) on how to handle this.

Jun 18, 2023

No reason for a long introduction. We’re continuing on the previous post where we used web API to get some data. In this case, we are now fetching some data directly from an HTML page.

1 - The Code

I created some sample code to save you time. You can alter the code and it will let you fetch some tabular data from a webpage. It will then convert it to a pandas Dataframe.
Here are the assumptions made for this code:

Web URL: www.my_api.com/abcd
Column Headers: “col1”, “col2”, “col3”
Table located: The table is after the paragraph tag, and is after the text that says “you can see the table below”

import requests
from bs4 import BeautifulSoup
import pandas as pd
import json

def scrape_data(url, header_text, column_headers):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the target table. It's right after the <p> tag.
    target_p = soup.find('p', text=header_text)
    table = target_p.find_next_sibling('table')

    data = []

    # Find all rows in the table body and loop through them
    for row in table.find('tbody').find_all('tr'):
        cols = row.find_all('td')
        cols = [col.text.strip() for col in cols]
        data.append(dict(zip(column_headers, cols)))

    df = pd.DataFrame(data)

    return df

url = "http://www.my_api.com/abcd" #swap this to your url
header_text = "you can see the table below" #swap this to your text
column_headers = ["col1", "col2", "col3"] #swap this to your headers

df = scrape_data(url, header_text, column_headers)
print(df)

As usual, nothing in life is ever as simple as this.
So, let’s talk about the list of things that can go wrong, and how to solve them.

2 - HTML Parsing Error

2.1 Finding the target <p> tag

The BeautifulSoup library may not be able to find the expected tags. This will happen if the structure of your website is not as the above code expects.
Here is an example:

Let’s look for a specific paragraph (<p> tag). This will be containing the text "you can see the table below". If the structure of the webpage changes so the text is now inside a 'div' tag. BeautifulSoup's find('p', text=header_ text) won't be able to locate the correct tag. This can also happen if another HTML element, or the text itself changes.
As a result, the variable target_p would be None.

Beginner's Guide to Web Design - HTML Basics — List of HTML tags to keep in mind when web scraping.

This would then result in an an Error. To solve this, you’ll have to re-open the web page via inspect element, and re-configure the script.

Enable 'Inspect Element' in the Edge Browser to View Source | Thomas' Miniblog — Time to re-do it, via a right click inspect.

2.2 Locating The Table & Rows

After locating the target paragraph, the script expects a 'table' tag to be the next item. If there's no such 'table' tag in the actual HTML, you’ll get a None type in Python. This will then result in an error, when the script tries to locate the rows within the table body.

If the table doesn’t exist, you are going to get a NoneType followed by an error.

No easy fix for this, you’ll have to investigate the page/website and see what happened. The most likely scenario is that the table got moved to a different web page.

3 - Robot Checks

3.1 Robot Checks

Many websites have measures to prevent or restrict automated data scraping. For instance, some sites use CAPTCHAs. In case you forgot, captcha’s are those annoying things you have to deal with, when you try to login sometimes.

Captcha required error while activation Katalon 7.2.1 and 7.2.6 and logging in to analytics - Katalon Studio - Katalon Community — You are already frustrated. Now you get to solve a captcha and become even more frustrated :)

Unfortunately, you can't fix this with basic requests in Python. Websites may also limit the number of requests from a single IP address within a certain period. This is to prevent bots from accessing too much data. If your script stops working or fails to retrieve data, it might be due to hitting such limitations. In these cases, you may need to reduce the frequency of your requests.

The most common way that people handle these issues is by using Selenium. Selenium is an open-source tool that automates web browsers. It supports Python and can you can use it for various tasks on dynamic websites. Examples are filling out forms, clicking buttons, or navigating between pages.

The Selenium method also works pretty well for handling Anti-Scraping Measures.

3.2 Anti-Scraping Measures

Many modern websites use JavaScript to load data. This means that the data is not present when the page is first loaded. If this happens it also fetches and adds it to the page after the fact. For example, a loading spinner is often displayed on websites to show that data is being fetched. It will then populates a table on the page once the data once retrieved. When you use requests.get(), it only fetches the initial HTML of the page. It doesn't execute any JavaScript, so any data loaded in this way will be missing from the retrieved HTML.

You can watch the video below to get an idea of how Selenium can be used to address the waiting component.

4 - Changed Tabular Structure

Web pages are dynamic and can evolve over time, leading to changes in the HTML structure. When we setup an automated process for scraping web data, we assume that the table structure is consistent, and doesn’t change.

Now, let's say the owners add a new column "col4" or remove "col2" from the table, or they could change the orders. Such changes will result in errors or incorrect data during the scraping process. This is because the script continues to look for "col1", "col2", "col3" in the same order. The error could manifest as a ValueError when trying to create a dictionary or DataFrame.

To handle this problem, just take a quick glimpse at the new tabular structure, or feel free to add in some try/except conditions.

Python Exceptions: An Introduction – Real Python — Try/Except concept for web scraping

5 - Encoding Errors

5.1 - Encoding Mismatch

Sometimes a web page will use different character sets when encoding web pages. The most common one today is UTF-8. It includes all standard characters in the English alphabet. It also includes numbers, and many special characters. Websites across various regions or languages might use distinct character encodings.

Unicode Characters – What Every Developer Should Know About Encoding — UTF-8 is extremely popular, but there are others.

For example, a website from Russia might use Cyrillic (ISO-8859-5) encoding. A Chinese website might use GB2312 encoding. A Python script may encounter characters that are not included in the UTF-8 character set. This can happen when parsing a website using the default UTF-8 encoding. In such cases, the presence of unsupported characters during parsing can occur.

5.2 - Decoding Error

UnicodeDecodeError is when Python encounters a character that it can't decode. We will get this error when trying to decode something that is not in the default (UTF-8) encoding. To solve this problem, just find the correct character encoding of the website.

python - How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte" - Stack Overflow — Getting the Decoding Error “Visualized”

3 Comments

BTA

BTA’s Substack

Dec 11, 2023Liked by BowTied_Raptor

this article is gold. ran into a lot of issues trying to scrape internal webpages to get visitor/usage stats to clean up access. spent several days figuring out "inspect element" and finally using Selenium to do the dirty work! this would have saved me a lot of time

Expand full comment

2 replies by BowTied_Raptor and others

2 more comments...

Data Science & Machine Learning 101