Deep Learning for time series
Forecasting, and why you don't jump into deep learning immediately. Common sense comes out on top.
***Please hold onto your code from this post, we’ll be re-using this on the next post about RNNs.***
A timeseries can be any data obtained via measurements at regular intervals, like the daily price of a stock, the hourly elecricity consumption of a city, or the weekly sales of a store. Timeseries are everywhere, whether we’re looking at a natural phenomina, oor human activity patterns.
When working with timeseries, a good model tries to focus on understanding the dynamics of a system - it’s periodic cycles, how it trends over time, it’s regular regime and it sudden spikes.
Forecasting
By far, the most common timeseries-related task is forecasting: predicting what will happen next in a series. Forecast electricity consumption a few hours in advance so you can anticipate demand, or forecast revenue a few months in advance so you can plan your budget, etc….
Forecasting is what this chapter focuses on. But there’s actually a wide range of other things you can do with timeseries:
Classification - Assign one or more categorical labels to a time series. For example, given the timeseries of the activity of a visitor on a website, classify whether they are a bot or a human
Event Detection - Identify the occurrence of a specific expected event within a continuous data stream. A useful application is “hotword detection”, where a model monitors an audio stream and detects things like “Ok Google”, “Hey Alexa”, or “Hey Siri”
Anomaly detection - Detect anything unusual happening within a continuous datastream. Unusual activity on your corporate network? Anomaly detection is typically done via unsupervised learning, because you often don’t know what kind of anomaly you’re looking for, so you can’t train on specific anomaly examples.
Our Temperature Forecast
For this specific post, we’ll be looking to solve a single problem: predicting the temperature 24 hrs into the future, given a timeseries of hourly measurements of quantities like the atmospheric pressure, and humidity recorded over a recent past.
We’ll start off by using the deep learning models we’ve already done to show you their performance, and then we’ll have a simple benchmark, then for the next post we’ll focus on the GOAT of all time series problems (RNNs).
Loading the Data
Hop on over to: s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip
to fetch our data. Now, let’s do a quick examination of what we are looking at:
import os
fname = os.path.join('jena_climate_2009_2016.csv')
with open(fname) as f:
data = f.read()
lines = data.split('\n')
header = lines[0].split(',')
lines = lines[1:]
print(header)
print(len(lines))
it gives us:
['"Date Time"', '"p (mbar)"', '"T (degC)"', '"Tpot (K)"', '"Tdew (degC)"', '"rh (%)"', '"VPmax (mbar)"', '"VPact (mbar)"', '"VPdef (mbar)"', '"sh (g/kg)"', '"H2OC (mmol/mol)"', '"rho (g/m**3)"', '"wv (m/s)"', '"max. wv (m/s)"', '"wd (deg)"']
420451
So, we have about 14 items (columns) to look at, and we have about 420k rows of data as well.
Parsing the data
Now, in order to work with ML, we will need to parse this data into numpy arrays so, the model can actually work with this data Here’s the code for it.
import numpy as np
temperature = np.zeros((len(lines), ))
raw_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
values = [float(x) for x in line.split(',')[1:]]
temperature[i] = values[1]
raw_data[i, :] = values[:]
This is what our raw_data (Xs) look like:
And, this is what our target (Y) values look like:
Preparing the data
In general, when dealing with weather data, if you wre trying to predict average temperature for the next month, given a few months of past data, the problem would be easy, due to the reliable year-scale periodicity of the data.
But, when dealing with weather data over a scale of days, the temperature becomes a lot more chaotic. For our exercise, we will try to predict daily temperature data.
We’ll use the first 50% of the data for training, the next 25% for validation, and the last 25% for testing.
Here’s an important lesson, if you are ever working with time series in the real world:
Keep reading with a 7-day free trial
Subscribe to Data Science & Machine Learning 101 to keep reading this post and get 7 days of free access to the full post archives.