Cybersecurity Meets Data Science

Prepping You For Your Cybersecurity Data Science Interview. Been a long time coming....

Feb 18, 2023

This one’s a long read, grab a coffee.

Note: If you have some sort of an interview for a cyber security data analytics/science role. Then, this post gives you a strong foundation on what tech they use, and expect from you.

Credit for this post go to:

Zero to Hoodie Substack
- BowTiedCyber
Human Language Technology
- Hal_jpeg
Data Science & Machine Learning 101
- BowTied_Raptor

This post is a deeper dive from this twitter thread:

BowTied_Raptor | Data Science Emporium @BowTied_Raptor

You followed @BowTiedCyber and got your cyber role. You also know that Cyber + ML = huge bank. Here are some ways that Cybersecurity uses ML. *Might want to pay attention if you like money* 👇

You can click here to see the previous data science meets cyber security.

1 - Two Factor Authentication

1.1 The Technology

When you login into your work computer, companies need to setup a safety precaution to ensure that it is you. And, that you didn’t lose your password to someone else. One of the ways they do this is by using a 2 Factor Authentication (2FA).

The most common way of doing a 2FA is by doing a linking your phone with your work credentials. Then, doing a “push” when you login into your work computer, to confirm it is you. But, some companies don’t consider that secure enough, so, they prefer to use your biometric data. Here is some of the information they look at:

Voiceprints
Retinal Scans
Facial Image Recognition

We’ll focus on the Facial Image Recognition for the data science aspects.

1.2 The Data Science

Objective:

The goal for the Machine Learning Engineer is to build a model so that it can detect whether an image is fake or not. This model will be used for the 2FA, so that when it sees an image, it's the actual real person, and not a photoshopped image.

One thing our model will need to pay attention to is the background. When you scan your face to login to your 2FA, the odds are you have some sort of a boring wall background, or a window.

In other words, this will be a classification problem for image analysis. Classify whether the picture is legitimate enough to be granted access, or reject it.

Data Collection:

Our model will need to collect pictures of the different employees for our company. This will give the model a pattern of what would be considered an acceptable picture. Generally speaking, most employee image data is stored in some sort of a cloud based server. So, we’ll have to access that server, and get a bunch of image data.

Once we have access to the images, we’ll tell our model to focus on 2 components:

Our target specific human that we want to study for the 2FA
Acceptable backgrounds
Oddities in the image

More information on the modelling aspect below.

Data Modelling:

All image analysis is done using a Convolutional Neural Network (CNN). So, we know that to tackle this problem, we will use a CNN. To tackle this, we will deploy what’s called a Context Aware Artificial Intelligence (CAI).

A CAI is a neural network. It will analyze our subject (human), and the relationship it has with the environment. Our CNN will need to study the study the background of the image. From the analysis, it will assess the relationship between that, and our target subject’s face.

The algorithm will need to analyze the background. From that, it will figure out whether it is an appropriate background or not. For example, when you sign in to work, do you do it at a swimming pool, or at home? It will also need to spot any potential tampering with the image (such as weird lighting). It will also have to study the face and confirm if it’s ok. An example of how a CNN sees data is below:

How a Convolutional Neural Network works — Here’s how a Convolutional Neural Network sees an image.

1.3 Commentary from Cyber

2FA is defined as being 2 factors of authentication from different domains:

Something you know
Something you are
Something you do
Somewhere you are
Something you have

Having 2 passwords would not be 2FA because they’re both something you know. Having a password and a fingerprint would be. We’re always battling with the balance between security and functionality.

We're battling with the balance between security and functionality — The battle between security and functionality

However, we can tend to have more of both as we introduce cost.

More cost means more security and functionality — More costs mean more security and functionality

In this context, the more cost we introduce, the more security & functionality we introduce.

Let’s give an example:

Google Authenticator is free.
It makes your life MORE SECURE.
But it is also FAR LESS FUNCTIONAL than a password alone.

But, let’s say we were logging into a system that utilizes a password managed with a password manager, and a fingerprint. In this case, we’ve done less work than typing a password by doing a click click tap for auto filling a password and using a fingerprint.

But we’ve also added both computational and real cost into the equation. Creating programs and models that perform these functions are not cheap, but they are essential. This is why Okta is a $13 billion dollar company.

2 - Network Behavior Monitoring

The Technology

One of the tasks you will work on is Network Behavior Monitoring (NBM). The goal of NBM is to observe what's happening in the network. Some of your tasks will be to:

Detect Abnormal Behaviors - Determine usual spikes in traffic, or malicious injections.
Determine Malicious Intent - Analyze port scans, irregular destination addresses, and requests for unintended services.
Monitor Device & Users - Analyze those who are using the devices, and servers differently than they are supposed to.

Having a small team of cybersecurity perform NBM, while keeping up with their daily tasks is stressful. Machine Learning is helping NBM by performing anomaly detection. The model studies all the data currently available on the network real-time.

The image below shows you what the Cyberteam sees, when they monitor the network:

What a Cyberteam sees when they look at a NBM tool — Here’s what the Cybersecurity team’s NBM dashboard looks like.

The Data Science

Objective:

For a problem like this, we will want to build an anomaly detection ML pipeline. This pipeline will study all the data that the NBM tools report. The goal of this model is to study the data as much as possible to figure out what the “normal” behaviors are.

When an event happens that is considered extraordinary. The ML model will immediately flag this as a cybersecurity attack, and alert the team.

Data Collection:

The data will be collected on a real-time basis. That means all the monitoring activities will store a log file. This file will also be sent to the ML model.
Here is the sort of data you’ll look at:

Network Traffic Logs
IP Addresses of Connected Devices
Systems Configurations
Data Requests
Connection Attempts, and Access

Once all the data is compiled into a structured data, this is sent for modelling.

Data Modelling:

The goal is to figure out what is considered normal behavior for this specific network. Next, we will want to flag any and all anomalies the further they are from this expected normal behavior.

We are dealing with a classification problem, so any clustering algorithm can handle this problem. The goal will be to build a cluster around what is considered normal behavior. Then, the further we go away from this cluster, the more likely it is that we are dealing with anomaly. In other words, the network is under attack/getting compromised.

Putting it all together, you can get a quick glimpse of what the data pipeline would look like from below:

A Data Pipeline for NBM — What a functional data pipeline for a NBM model would look like.

Commentary from Cyber

Oh Network Monitoring. My first love.

Back in the day, I built a $10M+ SaaS by evaluating Suricata alerts with no ML whatsoever. Those who created models that could do that efficiently were killing the game.

Cybersecurity is an industry of data. We have more data than we can handle.

Things that give us a sense of what’s good and bad are critical. If you can detect anomalies in network detection with models and do it well, you can have all the money.