

Discover more from Data Science & Machine Learning 101
Cybersecurity Meets Data Science
Prepping You For Your Cybersecurity Data Science Interview. Been a long time coming....
This one’s a long read, grab a coffee.
Note: If you have some sort of an interview for a cyber security data analytics/science role. Then, this post gives you a strong foundation on what tech they use, and expect from you.
Credit for this post go to:
- - BowTiedCyber
- - Hal_jpeg
- - BowTied_Raptor
This post is a deeper dive from this twitter thread:


You can click here to see the previous data science meets cyber security.
Table of Contents:
Two Factor Authentication
Network Behavior Monitoring
Email Monitoring (NLP)
1 - Two Factor Authentication
1.1 The Technology
When you login into your work computer, companies need to setup a safety precaution to ensure that it is you. And, that you didn’t lose your password to someone else. One of the ways they do this is by using a 2 Factor Authentication (2FA).
The most common way of doing a 2FA is by doing a linking your phone with your work credentials. Then, doing a “push” when you login into your work computer, to confirm it is you. But, some companies don’t consider that secure enough, so, they prefer to use your biometric data. Here is some of the information they look at:
Voiceprints
Retinal Scans
Facial Image Recognition
We’ll focus on the Facial Image Recognition for the data science aspects.
1.2 The Data Science
Objective:
The goal for the Machine Learning Engineer is to build a model so that it can detect whether an image is fake or not. This model will be used for the 2FA, so that when it sees an image, it's the actual real person, and not a photoshopped image.
One thing our model will need to pay attention to is the background. When you scan your face to login to your 2FA, the odds are you have some sort of a boring wall background, or a window.
In other words, this will be a classification problem for image analysis. Classify whether the picture is legitimate enough to be granted access, or reject it.
Data Collection:
Our model will need to collect pictures of the different employees for our company. This will give the model a pattern of what would be considered an acceptable picture. Generally speaking, most employee image data is stored in some sort of a cloud based server. So, we’ll have to access that server, and get a bunch of image data.
Once we have access to the images, we’ll tell our model to focus on 2 components:
Our target specific human that we want to study for the 2FA
Acceptable backgrounds
Oddities in the image
More information on the modelling aspect below.
Data Modelling:
All image analysis is done using a Convolutional Neural Network (CNN). So, we know that to tackle this problem, we will use a CNN. To tackle this, we will deploy what’s called a Context Aware Artificial Intelligence (CAI).
A CAI is a neural network. It will analyze our subject (human), and the relationship it has with the environment. Our CNN will need to study the study the background of the image. From the analysis, it will assess the relationship between that, and our target subject’s face.
The algorithm will need to analyze the background. From that, it will figure out whether it is an appropriate background or not. For example, when you sign in to work, do you do it at a swimming pool, or at home? It will also need to spot any potential tampering with the image (such as weird lighting). It will also have to study the face and confirm if it’s ok. An example of how a CNN sees data is below:
1.3 Commentary from Cyber
2FA is defined as being 2 factors of authentication from different domains:
Something you know
Something you are
Something you do
Somewhere you are
Something you have
Having 2 passwords would not be 2FA because they’re both something you know. Having a password and a fingerprint would be. We’re always battling with the balance between security and functionality.
However, we can tend to have more of both as we introduce cost.
In this context, the more cost we introduce, the more security & functionality we introduce.
Let’s give an example:
Google Authenticator is free.
It makes your life MORE SECURE.
But it is also FAR LESS FUNCTIONAL than a password alone.
But, let’s say we were logging into a system that utilizes a password managed with a password manager, and a fingerprint. In this case, we’ve done less work than typing a password by doing a click click tap for auto filling a password and using a fingerprint.
But we’ve also added both computational and real cost into the equation. Creating programs and models that perform these functions are not cheap, but they are essential. This is why Okta is a $13 billion dollar company.
2 - Network Behavior Monitoring
The Technology
One of the tasks you will work on is Network Behavior Monitoring (NBM). The goal of NBM is to observe what's happening in the network. Some of your tasks will be to:
Detect Abnormal Behaviors - Determine usual spikes in traffic, or malicious injections.
Determine Malicious Intent - Analyze port scans, irregular destination addresses, and requests for unintended services.
Monitor Device & Users - Analyze those who are using the devices, and servers differently than they are supposed to.
Having a small team of cybersecurity perform NBM, while keeping up with their daily tasks is stressful. Machine Learning is helping NBM by performing anomaly detection. The model studies all the data currently available on the network real-time.
The image below shows you what the Cyberteam sees, when they monitor the network:
The Data Science
Objective:
For a problem like this, we will want to build an anomaly detection ML pipeline. This pipeline will study all the data that the NBM tools report. The goal of this model is to study the data as much as possible to figure out what the “normal” behaviors are.
When an event happens that is considered extraordinary. The ML model will immediately flag this as a cybersecurity attack, and alert the team.
Data Collection:
The data will be collected on a real-time basis. That means all the monitoring activities will store a log file. This file will also be sent to the ML model.
Here is the sort of data you’ll look at:
Network Traffic Logs
IP Addresses of Connected Devices
Systems Configurations
Data Requests
Connection Attempts, and Access
Once all the data is compiled into a structured data, this is sent for modelling.
Data Modelling:
The goal is to figure out what is considered normal behavior for this specific network. Next, we will want to flag any and all anomalies the further they are from this expected normal behavior.
We are dealing with a classification problem, so any clustering algorithm can handle this problem. The goal will be to build a cluster around what is considered normal behavior. Then, the further we go away from this cluster, the more likely it is that we are dealing with anomaly. In other words, the network is under attack/getting compromised.
Putting it all together, you can get a quick glimpse of what the data pipeline would look like from below:
Commentary from Cyber
Oh Network Monitoring. My first love.
Back in the day, I built a $10M+ SaaS by evaluating Suricata alerts with no ML whatsoever. Those who created models that could do that efficiently were killing the game.
Cybersecurity is an industry of data. We have more data than we can handle.
Things that give us a sense of what’s good and bad are critical. If you can detect anomalies in network detection with models and do it well, you can have all the money.
3 - Email Monitoring (NLP)
The Technology
Imagine for a minute you are an analyst working for a hedge fund. Let’s say you got an email that looks like it came from your Portfolio Manager. The emails says that he’s a bit busy, and if you could accept the zoom invite to send the predictions over. If you do not have good skills at spotting phishing, and actually sent the predictions, then you’d be rekt. The situation presented is real, and not fake:
Feel free to read the article here.
Email monitoring is the process of monitoring emails. The standard approach to this is to manually inspect the email, or look for a few keywords. In case 1 or 2 email were missed, this can create some serious headache.
With ML, cybersecurity teams can automate the tedious process of monitoring emails. The ML algorithms focused on NLP can read through hundreds of emails within a few seconds. And, they are generally better at spotting phishing emails than their human counterparts.
The Data Science
Objective:
We need a ML algorithm that can read an email (HTML document). Then, figure out whether this is a legit email that should be delivered to the end user or discarded. In other words, we are dealing with a NLP problem.
Data Collection:
We will want to start off by training our model on a dataset of emails that have already been classified as:
0: Legit Email
1: Phishing Email
Most Cyber firms store all the emails in a SQL database. So, you’ll need to learn SQL to know how to pull data from there. Some firms also like to store the text data into a JSON file format. So, you’ll also need to know how to work with JSON data.
More information on what real data science looks like.
We will build a Neural Network that will analyze the following:
The actual text
The links themselves
Here are all the ways you can clean your text data:
Convert everything to all small letters
Remove encoded images
Remove excess whitespace (lstrip, rstrip, or .strip in Python)
Replace hyperlink (if exists) with the domain
You will also need to convert your text data into numerical data. You can achieve this by using word2vec.
Video below:
Data Modelling:
We will want to split our model into 3 layers:
Layer 1: Assess the quality of the links
Layer 2: Assess the text in the email
Layer 3: Assess the output from Layer 1 & 2, then make the final prediction
Many websites have a trust score. Layer 1 will be responsible for mapping the domain of the link, with it’s trust score to see the quality of the links. This information will then get passed on to another layer.
Here is an example of a trust score:
Layer 2 will be a Long Short Term Memory (LSTM) neural network. LSTMs have been proven to be great at analyze NLP data. The main concern we’ll have with is the architecture of the Neural Network itself.
Here is a good playlist on getting started with NLP and LSTMs:
Layer 3 just takes the output from layer 1, and 2, and just makes the finalized prediction on structured data.
Commentary from Cyber
You may or may not know this, but email is one of the most convoluted and archaic systems we use in our everyday lives. The layers of complexity and all the different systems we need in place to make email function would make your head spin.
From SMTP to IMAP to SPF to DKIM to DMARC and what email servers actually do on the receiving end with these is enough to make you want to burn your computer and start a farm.
But, if you have an email platform that allows for evaluating and categorizing malicious emails, this could be a great implementation for ML.
Anything that reduces company risk has an opportunity for Marketshare.