Before we start off with MS Azure, just want everyone to know you can find me here as well:
Twitter: https://twitter.com/BowTied_Raptor
I personally do not work with MS Azure, however one of my old Quant companies that I worked for recently transitioned to Azure, and they wanted their Quants (Data Scientists who work in Finance) to learn this and get up to speed.
If you’re going to need to learn Azure soon. Who better than @BowTiedAzure with his
substack to explain it.This post is meant more for the Data Engineers, and less so for the research/MLOps side.
Table of Contents
Cloud Scale Analytics
Planning & Strategy
A Holistic Approach
1 - Cloud Scale Analytics
If you don’t know what MS Azure is, you can watch this video:
If you read my twitter feed (@BowTiedAzure) or my Substack (
) then you will know that I proclaim you must use frameworks. There are tons of reasons frameworks are important but the short of it is that frameworks:Guide our decision-making process via understood best practices and standards
Make our architectures predictable and repeatable
Allow us to document each layer of an architecture
Produce outcomes that can be operated by system consumers
There is more to it, but you get the gist.
Cloud represents the pinnacle of where data-focused operations can take place. The as-a-service instant delivery of services, and consumption based pricing means orgs can move fast (good for agile). Without a plan for adopting these services though, they can become unwieldy, overrun cost, and not be worth the value they are producing.
Enter Cloud Scale Analytics
Cloud Scale Analytics is a Framework that covers both technical and non-technical considerations. It's purpose is to help govern a large analytics and data estate in the cloud. The goal of the framework is to:
Serve data as a product – Like software design life cycles
Provide an ecosystem of tools and services for use by consumers and not lock them in to a single technology choice
Drive governance and security through platform enforced standards and policies
Focus on outcomes instead of focusing on technology
Let’s dive into what this framework actually calls for in an Azure environment.
2 - Planning & Strategy
When an organization decides to use cloud scale analytics, I spend almost as much time in this phase of the implementation project as I do the build. It is that important.
Any adoption project is going to need a significant part of the org’s time and effort in strategizing and planning their estate. When adopting a cloud analytics environment, the org MUST have an understanding of a few key components:
2.1 Where Are We On The Analytics Maturity Curve?
Microsoft publishes an interesting infographic to help an org figure out their analytics maturity level.
This image was captured from their official documentation:
The goal is to reach the galaxy-brain level on the far right of this chart. Understanding the starting point is important to figure out what organizational changes need to take place.
2.2 Do We Understand Our Data?
This is the part of planning that I take part in the least, but I sit in on meetings where the data engineers on my team groan a whole bunch about this topic. I know, from osmosis, that its critical to understand your data sources and what you are looking to get from your data.
2.3 Is Our Team Ready?
There is 100% going to be organizational change that must happen to install a project of this scale and scope and I have seen these projects fail at this point many times. We use organizational change management consultants to assist the organization. The oversimplified summary of this part of the strategy phase is talk to your teams and herald the change that is coming. Without buy-in from the people who will be consuming the outcome, the project is doomed to fail before it begins.
Like I said, I did not want to spend a large effort on the strategy part for this article. But it would have been irresponsible not to include it. When actually undertaking a project of this nature, this phase would take months, if not a bit longer. The output of this phase should be a set of formalized strategy documentation, along with a firm mission statement.
On to some meatier subjects
3 - A Holistic Approach
Proper implementation of cloud scale analytics will influence the entire Azure estate. Before we start putting in tooling to support data teams, we first must back up to the fifty thousand foot level. Then, examine whether or not the cloud environment can even sustain what we want to do with the data services.
Let’s put this out there for anyone wanting to go right to the sweet chocolate center. Microsoft publishes a reference architecture for enterprise scale Azure environments. It’s RIGHT HERE ON GITHUB and it looks like this:
I’ll cut right to it: Yes, that is big and complicated. Yes, something of this scale takes a long time to build. And no, you will not deploy an enterprise class cloud analytics environment if you do not have something very close to this in place around it. The enterprise scale reference architecture is out there for a reason. Microsoft has been watching their customers use Azure for 13 years and running. They have had a lot of time to design their services around how customers are being most successful.
3.1 Identity & Access Management
Core to the implementation is rights assignments. Cloud scale analytics supports a role-based access control model to ensure principle of least privilege. But, the framework does not provide this on its own. We rely on the underlying platform to give us this feature.
There are some special considerations for Azure roles before implementation. Find them HERE. A well architected enterprise Azure estate is the first step in ensuring the environment is secured.
3.2 Network Topology & Connectivity
This is where you better have a good network team to support you. I know most people reading this article are smart and know how to read Microsoft documentation. So, I am not going to regurgitate their thoughts. Let me break down the key points from this part of the design and read between the lines a bit for you.
3.3 First key point
We use Private Link for any PaaS service that has it available for security purposes and that has downstream consequences. The first consequence is that you need to have your DNS strategy ON POINT. The very first problem and the most frequent problem with private endpoint connectivity in the cloud is DNS.
When we deploy PaaS services in their default form with public endpoints, Microsoft takes care of all the name resolution issues for us. As soon as we put these endpoints on to a private address space (RFC 1918) we become responsible for making sure the endpoint IP addresses can be resolved from given names. EVERYTHING in the analytics environment should be referenced by name.
3.4 Second key point
A traditional and effective pattern for networks in Azure is the hub and spoke design. There is a lot to consider with hub and spoke and its also not the only possible architecture. The basics of hub and spoke is there is a centralized network where we make unified decisions on network traffic routing, inspection, and filtering.
With cloud scale analytics, we expand upon this network architecture into a full mesh design. We need to ensure that each data environment can both ingest from and serve data to one another. The only way to do this is to A) have all our data environments live in the same virtual network (a BAD idea) or B) we peer all our data environments to one another. A common implementation I’ve used in the past to do this is a nested hub and spoke.
This gets complicated fast but the basic idea is to treat your analytics environment as its own hub and spoke network WITHIN a larger hub and spoke network. This works well because this design utilizes all the advantages of a hub and spoke architecture for the analytics environment while ensuring that data ingestion streams do not cross our primary filtering service (firewall) in the main hub.
There is much more that could and SHOULD be said about the networking required to support this design. Extra questions that need to be asked: What level of filtering and inspection that needs to be done to the data streams. What happens if we need to replicate this design in many regions/globally? Do we need to consider layer 7 firewalls to serve data to outside consumers via app or API (yes, you do)?
3.5 Resource Organization
Cloud scale analytics functions best when everything is automated. But for deployment automation to function, there needs to be a lot of thought paid to naming conventions and tags. A naming and tagging convention should be in place and adopted by the org before adoption of the analytics environment. Microsoft provides example conventions that are useful: Naming conventions | Tagging Conventions.
3.6 Governance & Security
We look here for several key features that adds security to our environment and will help meet regulators. I’ll run down the list of specifics that we include here in the design:
Data-at-rest encryption – This includes things like TDE for SQL servers and Azure Storage Account encryption
Data transit encryption – We ensure TLS 1.2 or better is used for all data flows at every part of the environment.
Implementation of threat protection and hunting – This will be Microsoft Defender and Sentinel but could be any threat protection and SIEM combo.
We usually put a large default set of Azure policies in place at this point in the design. This lets us set a quality set of guardrails around the environment. Which allows us to manage, and run without spinning out of control cost or deployment-wise. The baseline set of policies we use is right here.
3.7 Monitoring
The analytics environment & monitoring will need to be integrated with existing management. Most large orgs will already have a solution in place and if not, Azure Monitor with Log Analytics is the preferred choice.
3.8 Business Continuity & Disaster Recovery (BCDR)
The final NON-analytics piece of the puzzle we have to ask about is business continuity and disaster recovery. I want to again, cut past the vendor mumbo-jumbo and try to provide some alpha from someone who has been in the trenches before. BCDR for an analytics environment can be a siren song. Very rarely does the entire analytics platform need to be fully recoverable in another cloud region.
Outages DO happen, for sure. But often times the cost of implementing a fully functional analytics platform AND replicating all the data to make the platform functional in another region VASTLY outweighs the cost of simply having the analysts and scientists take the day off while Microsoft recovers there services.
Special attention does need to be paid to certain services and functions to ensure we are covered from a compliance standpoint but for the most part, this is one of the pieces of your cloud portfolio that we can suffer to be down once or twice a year for a couple hours.
An exception to this rule is CUSTOMER or REVENUE facing service points. In those cases, we treat the product like we do any product that needs to be highly available, and we build and plan according to our already established RTO and RPO metrics (you do have those established already, right anon?).
Part 2 Coming Soon…