SQL 11 - Loading the Data

Why we use Python/R to pull data, Establishing a connection, Executing a SQL query, Closing the Connection, and Don't Repeat Yourself

May 01, 2023

This post is focused on loading up SQL data from your SQL database into a scripting language such as R/Python. If you’ve had even 1 data role, you pretty much know 90% of the material here already. Only thing I would do is give the SSH section a brief gloss-over (2.1), and the comment on Environment Variables.

80% of all data roles will use the content in this post when it comes to loading data from a SQL server. However, if you are working in a company, where the data team is using the latest & the best tech they can. Then, you will not be using the contents of this post.

The next post: SQL 12 - State of the Art for loading data will cover django, and using that to build dynamic SQL queries for you.

Why we use Python/R to pull data
Establishing connection
Executing a SQL query
Closing the connection
Don’t Repeat Yourself

1 - Why we use Python/R to pull data

If you’ve used a SQL server manager (MS SQL server management studio for MS-SQL), then you realize it’s easy to use it to explore tables, and craft a SQL query. One of things you’ll probably be wondering is why use R/Python to run a SQL query, when instead you can run the query in the server manager, then export as csv.

Note: This question is slowly rising in popularity for interviews for MLE roles

To answer this question, here is a small list of reasons:

Pre-processing: If you had some sort of a data collector, that would collect data for you, then you’ll know that you have stored in raw data with outliers, missing data points, and typo errors all over the place. It’s much better to clean them up using R/Python, instead of SQL
Model training: You can’t build a ML model in SQL, gotta do it in R/Python
Deployment: You’ll want to get your ML model to predict on new data, and then you’ll want to store this data into a SQL server.

2 - Establishing Connection

Most companies restrict randos from accessing their SQL servers, by restricting access to the public. To access most SQL servers today, there are 2 main routes:

This post is sponsored by Nord VPN, did you know that hackers… nah, I’m just messing with ya.

2.1 - VPN/SSH

VPN

A Virtual Private Network is a simple way to creates a secure connection between your computer, and a remote server over the internet. This is known as a tunnel. It allows data to be sent privately, this protects your connection with the corporate servers from being tampered with.

Here is a diagram to help illustrate.

It's Time to Get Rid of VPN! - Finchloom

If you want to learn more. Go learn from

BowTiedCyber

, VPNs are the domain of the cybersecurity expert, not the machine learning expert.

You’ll either be given a software (windows) from your IT department to connect to their VPN. Or, if you are on some sort of a linux terminal, then you’ll need to use sudo, then install one yourself. The most common one is wireguard:

BT_Raptor_terminal> sudo apt update
BT_Raptor_terminal> sudo apt install wireguard

We will speak more about working on a linux terminal in a different post.

All VPNs require a username/password to work.

SSH

A Secure SHell is a cryptographic way to connect to a remote server. To use this method, you need to generate a private key, and a public key. If you are on windows, you can use Putty to do so:

ssh - Generating key in PuTTYgen takes too long/never finishes - Super User

If you are on linux, you can use this command to do so:

BT_raptor_terminal> ssh-keygen -t rsa

Once you have done so, you’ll provide the server with your public key. They will then give you access, when you use your private key to login.

This method does not require the usage of a username/password.

2.2 Python/R

Once you actually can access your company’s servers. The next thing to do is to connect to the database. To connect to a SQL Database, you need to acquire all of this information:

Internet Protocol Address: a unique numerical address that points to the SQL Database
Port: The specific endpoint you are connecting to. An IP address can have multiple ports.
Username: Your username
Password: Your password

For this example, I’ll use the following dummy credentials:

IP address: 169.69.420.1
Port: 1117
Username: BowTied
Password: Raptor
Database: my_db

Python

There are a million different libraries that can be used to connect to a SQL database. Some are meant for psql, or mysql, or t-sql, etc… Pick whichever one you like. For this example, I’ll use the SQLAlchemy to connect to a mysql server.

from sqlalchemy import create_engine

connection = create_engine('mysql://BowTied:Raptor@169.69.420.1:1117/my_db')

R

Similar idea here, there are a million different libraries for doing this. So, we’ll use the DBI library.

require(DBI)

connection = dbConnect(RMySQL::MySQL(), 
                 dbname = "my_db", 
                 host = "169.69.420.1", 
                 port = 1117,
                 user = "BowTied",
                 password = "Raptor")

Note: In the real world, you should never ever do it like this. You will want to store all of the relevant credential information into your environment variables, and then pull them from there.

Create and Modify Environment Variables on Windows

Here is a code snippet in Python to pull the JAVA_HOME environment variable:

import os
print(os.environ['JAVA_HOME'])

3 - Executing a SQL query

3.1 List all tables

Once you are connected to a database, one of the first things you’ll want to know is the schema. Here is the command you can use in R, and in Python to list all of the tables:

Python:

Once the engine has successfully established a connection, one of the methods it will create for you is called .table_names(). All we need to do is print this

print(connection.table_names())

Use dbListTables to store all of the tables to all_tables. Then print them out.

all_tables = dbListTables(connection)
print(str(all_tables))

3.2 Fetch some data

The point of building the SQL connection is to pull some data. Here’s how, for R & Python

Python:

The most standard way of pulling data in Python is to use the pandas library. We’ll use the pd.read_sql_query()

import pandas as pd
my_data = pd.read_sql_query("SELECT * FROM my_table", connection)

When calling the pd.read_sql(), pandas wants to know which SQL connection to use.

R:

In R, we’ll use the dbGetQuery() function from our DBI package we already imported

my_data = dbGetQuery(connection, "SELECT * FROM my_table")

Voila, now you know how to pull some data from a SQL server into your R/Python instance. Now, you can call as many queries as you like, to your heart’s content. You can even insert a data.frame/data.table as rows into a SQL table.

Once, you are happy with the data you pulled, the next thing you’ll want to do is close the connection with the SQL server.

4 - Closing Connection

Once you’ve grabbed the data you wanted, the next thing to do is to close the SQL connection. Here’s a few reasons why (this shows up occasionally during interviews):

Resource Management: It costs a SQL server some computation power to maintain the connection with your instance.
Connection Limits: Most SQL servers have a connection limit.
Consistency: Any SQL transactions that are still work in progress are now either committed, or rolled back.

Python:

We will close our SQLAlchemy engine by using the .dispose()

connection.dispose()

R:

We will close our DBI connection by using dbDisconnect()

dbDisconnect(connection)

5 - Don’t Repeat Yourself

If you’ve read my post on the SOLID principles of programming, then you already have a basic idea on what ‘clean code’ looks like. One of the principles that all data professionals should follow is a principle called DRY. DRY stands for Don’t Repeat Yourself.

If you have followed steps 2 to 4 from above, then you’ve built a simple way to pull some data. Now, let’s say you wanted to re-pull the data. There is absolutely no reason whatsoever for you to copy paste the above section, and then re-run that block. Instead, if you are going to be re-using a chunk of code, you should turn that chunk into a dynamic function you can call upon.

So, now let’s put everything together

Putting Everything Together

I will build a generic function which will pull the first 200 rows, from columns: raptor, cyber, and devil. From a table called IT_dudes

Python:

from sqlalchemy import create_engine
import pandas as pd

def fetch_IT_Dudes():
  connection = create_engine('mysql://BowTied:Raptor@169.69.420.1:1117/my_db')
  my_data = pd.read_sql_query("SELECT raptor, cyber, devil FROM IT_dudes LIMIT 200;", connection)
  connection.dispose()

R:

require(DBI)
require(data.table)

fetch_IT_Dudes = function(){
     connection = dbConnect(RMySQL::MySQL(), 
                 dbname = "my_db", 
                 host = "169.69.420.1", 
                 port = 1117,
                 user = "BowTied",
                 password = "Raptor")
     my_data = as.data.table(dbGetQuery(connection, "SELECT raptor, cyber, devil FROM IT_dudes LIMIT 200;"))
     dbDisconnect(connection)

}

Now, instead of repeating the same line of code over and over again, we just call this function we built.

Also, keep in mind, we can make our function even more dynamic by using substrings, and string concatenation.

For example, I can use f-strings (Python) to replace certain components of my SQL query to make it more dynamic. The R version of f-strings is called sprintf.

Data Science & Machine Learning 101

Discussion about this post

Data Science & Machine Learning 101

SQL 11 - Loading the Data

Why we use Python/R to pull data, Establishing a connection, Executing a SQL query, Closing the Connection, and Don't Repeat Yourself

Table of Contents

1 - Why we use Python/R to pull data

2 - Establishing Connection

2.1 - VPN/SSH

VPN

SSH

2.2 Python/R

Python

R

3 - Executing a SQL query

3.1 List all tables

3.2 Fetch some data

Python:

R:

4 - Closing Connection

Python:

R:

5 - Don’t Repeat Yourself

Putting Everything Together

Python:

R:

Discussion about this post