Python Data Skills 2: Fetching Excel Data
How to deal with: Formatting & Multi-Indexed Issues, Parser Errors, Memory Errors, SharePoint Headache
We’ve already covered how to load up a CSV as a pandas Dataframe, and some of the most common issues you’ll have to deal with. This time, we’ll focus on another way to load up some data into Python (an Excel file).
Excel files end up being a bit more annoying than csvs because:
If the file is in use by someone else => get an error
If it’s not shared on SharePoint the right way => get an error
If it has rows merged and pandas can’t figure it out => get an error
These aren’t all the issues. So, let’s talk about how to load up (and handle) some of the most common errors when fetching an Excel file in Python.
Table of Contents:
File Already Opened
Formatting & Multi-Indexed Issues
ParserError
MemoryError
SharePoint Problems
To load up an excel file in pandas, just use pandas.read_excel()
1 - File Already Opened
When dealing with an Excel file that is also accessed by many users, issues can show up. IOError can occur when you attempt to load an Excel file into a pandas DataFrame. This can happen if the file is already open in another program. This is due to Excel's file locking mechanism. It's purpose is to prevent concurrent modifications. This can pose obstacles to read access in other applications.
Make sure the file is not open elsewhere when trying to read it in pandas.
Working on an Excel file that everyone uses is like like eating a Kinder Surprise. You try to load the file in your daily script, and boom “Surprise Error”. To solve this, just use MS Teams, and reach out to people until you find the person who still has the file opened up.
2 - Formatting & Multi-Indexed Issues
2.1 Formatting Issues
Excel is notorious for its wide-ranging formatting options. Features like merged cells or hidden rows/columns are commonplace in Excel files. This can cause significant roadblocks when attempting to read these files using pandas. Merged cells can lead to data disappearance or misunderstood. Hidden rows or columns could be hard to read.
For Machine Learning Engineers, this could mean having incorrect data sets. This can affect how well the machine can learn and create predictions.
2.2 Multi-Indexed Issues
Excel files can often have multi-index structures, that is, many header rows. In such scenarios, parsing the file using pandas the right way can become complex. If pandas doesn't interpret the heading the right way, it can create a DataFrame that does not work. This can cause problems later on when working with the data.
For us, this means NAs all over the place, and columns which are misaligned. A way to handle this is to first pick out which row you will start with. Then, you’ll want to use a for loop to iterate over the rows/columns of our dataframe.
A code snippet to help you do this is below:
#df = my dataframe
for index, row in df.iterrows():
#section below prints every value per column, on each row
print(row['column'])
3 - ParserError
Parsing is a critical step when reading data from Excel files into pandas DataFrames. It involves interpreting the file structure, including rows, columns, headers, and indexes. If the Excel file has any structural anomalies that the parser cannot interpret or handle, pandas will throw a ParserError.
Here’s a small brief list of reasons why you can get a ParserError:
Inconsistent row lengths: Varying lengths can cause confusion. It will result in a ParserError.
Unexpected data types: Types that are unable to convert into consistent type. An example is mixing text and numeric values in a column, it can lead to a ParserError.
Malformatted cells: Cells in the Excel file with unusual formatting. Parsing issues and a ParserError can occur with merged cells and empty cells. Also cells within a range, or cells with formulas.
For us, this error tells us that the raw data is not in a consumable format for pandas to create a DataFrame. These issues need to be rectified directly in the file or through pre-processing steps before the data can be used for model training and analysis. Handling parser errors will require opening up the excel file and manually tweaking the cells within the worksheet.
4 - MemoryError
Memory consumption is a critical factor to track while dealing with large datasets. An example of this is an Excel files in a machine learning workflow. High memory usage can lead to performance degradation. It can also lead to system crashes due to a MemoryError. To mitigate this, it is advisable to use strategies like chunking.
The process of reading a file in smaller portions is calling Chuck. This is to reduce the instantaneous memory load. You can also use libraries like Dask. It can be beneficial as they allow for parallel and distributed computing. Dask DataFrames can mimic the Pandas DataFrame interface. It can also provide better efficiency for larger-than-memory computations. Balancing memory usage is essential. It is necessary for maintaining a smooth and stable machine learning pipeline.
You can also read the solutions I provided on my previous post for handling memory errors.
5 - SharePoint Problems
Note: Feel Free to copy my code below. It works, and is easy to tweak. It’ll take you like 5 mins max to tweak it. Once you’re done that, feel free to work on your biz, and tell your boss it took like a full day to figure it out.
5.1 - Your Personal Credentials
To get Python to connect to SharePoint, navigate to a directory, and read an Excel file. Use the office365 library to authenticate and establish a connection to SharePoint. Reading the excel file is now possible using pandas, with the help of the openpyxl engine for .xlsx files.
Here's the code to do so:
import pandas as pd
from office365.runtime.auth.authentication_context import AuthenticationContext
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.files.file import File
REQUEST_URL = 'Put Your SharePoint Homepage Here'
RELATIVE_URL = 'Put Your Directory Here'
Username = 'Put Your Username Here'
Password = 'Put Your Password Here'
ctx_auth = AuthenticationContext(__get_url())
ctx_auth.acquire_token_for_user(Username, Password)
ctx = ClientContext(__get_url(), ctx_auth)
LibraryRoot = ctx.web.get_folder_by_server_relative_url(RELATIVE_URL)
files = LibraryRoot.files
ctx.load(files)
ctx.execute_query()
for myfile in files:
file_name = myfile.properties['Name']
print(file_name)
if ('Put The File You want to read here.xlsx' in file_name):
my_url = myfile.properties['ServerRelativeUrl']
response = File.open_binary(ctx, my_url)
bytes_file_obj = io.BytesIO()
bytes_file_obj.write(response.content)
bytes_file_obj.seek(0)
df = pd.read_excel(bytes_file_obj, sheet_name='Put Your Spreadsheet Name Here', header=0)
Some potential problems that can occur, especially related to the User account are:
Invalid Credentials: This is the most common issue. It occurs if the username or password is incorrect.
File Not Found: The specified Excel file may not exist in the SharePoint directory, or hidden.
SharePoint URL Problems: There might be issues with the SharePoint URL. An example is incorrect formatting or the server not being accessible.
5.2 - Client Credentials
If your code is going into production, you don’t want to use your own personal credentials. Instead, you will want to use Client (App) Credentials. To make the swap, here’s what the new code looks like:
import pandas as pd
from office365.runtime.auth.authentication_context import AuthenticationContext
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.files.file import File
REQUEST_URL = 'Put Your SharePoint Homepage Here'
RELATIVE_URL = 'Put Your Directory Here'
ID = 'Put Your Client ID Here'
SECRET = 'Put Your Client Secret Here'
ctx_auth = AuthenticationContext(__get_url())
ctx_auth.acquire_token_for_app(ID, SECRET)
### The rest is just the same as from the code above ###
Now, let’s talk about the potential issues you can encounter:
App Permission Issues: Setting up in SharePoint. There might be an issue with the app's necessary permissions set up in SharePoint. For instance, it might lack the required permissions. to read or write to the specific SharePoint directory.
Invalid Redirect URIs: Failures during the OAuth2. The redirect URIs for the app might not be set up the right way.
Misconfigured App in Azure AD: Authentication might fail. If the app registration in Azure AD is not configured the right way. This could be due to improper API permissions or missing necessary setup details.
Throttling Policies: SharePoint Online has throttling policies. These polices could affect the behavior of your application. An example would be if it makes too many requests in a short period.
Unfortunately, to solve all these issues, you’ll either contact the Data Engineer, or your IT team.