An ARMA or Autoregressive Moving Average model is a forecast model that predicts future values based on past values. Forecasting is a critical task for various business objectives such as predictive analytics, predictive maintenance, product planning, budgeting, etc. A big advantage of the ARMA models is that they are relatively simple. They only require a small data set to make a prediction, are highly accurate for shortcasts, and work with detrended data.
In this tutorial, we’ll learn how to use the statsmodels Python package to forecast data using an ARMA model and InfluxDB, the open source time series database. The tutorial will describe how to use the InfluxDB Python client library to query data from InfluxDB and convert the data into a Pandas DataFrame to make it easier to work with time series data. Then we will make our forecast.
I’ll also dive into the ARMA perks in more detail.
Requirements
This tutorial was run on a macOS system with Python 3 installed via Homebrew. I recommend setting up additional tools like virtualenv, pyenv, or conda-env to simplify client and Python installations. Otherwise, the full requirements are here:
- influxdb-client = 1.30.0
- pandas = 1.4.3
- influxdb-client >= 1.30.0
- pandas >= 1.4.3
- matplotlib >= 3.5.2
- learn >= 1.1.1
This tutorial also assumes that you have a free tier InfluxDB cloud account and have created a bucket and token. You can think of a repository as a database or the highest hierarchical level of data organization within InfluxDB. For this tutorial, we will create a repository called NOAA.
What is ARMA?
ARMA stands for Autoregressive Moving Average. It is a forecasting technique that is a combination of AR (auto-regressive) models and MA (moving average) models. An AR forecast is a linear additive model. The forecasts are the sum of the values passed by a scale factor plus the residuals. To learn more about the math behind AR models, I suggest reading this article.
A moving average model is a series of averages. There are different types of moving averages, including simple, cumulative, and weighted forms. ARMA models combine the AR and MA techniques to generate a forecast. I recommend reading this post to learn more about the AR, MA, and ARMA models. Today we will use the ARMA function of statsmodels to make forecasts.
Assumptions of the AR, MA and ARMA models
If you want to use AR, MA, and ARMA models, you must first make sure that your data meets the requirements of the models: stationarity. To assess whether or not your time series data is stationary, you need to check that the mean and covariance remain constant. Fortunately, we can use InfluxDB and the Flux language to get a dataset and make our data stationary.
We will do this data preparation in the next section.
Flow for time series differentiation and data preparation
Flux is the data scripting language for InfluxDB. For our forecast, we are using the air sensor sample data set that comes out of the box with InfluxDB. This dataset contains temperature data from multiple sensors. We are creating a temperature forecast for a single sensor. The data looks like this:
Use the following Flux code to import the dataset and filter for the single time series.
import "join"
import "influxdata/influxdb/sample"
//dataset is regular time series at 10 second intervals
data = sample.data(set: "airSensor")
|> filter(fn: (r) => r._field == "temperature" and r.sensor_id == "TLM0100")
Next, we can make our time series weakly stationary by differentiating the moving average. Differentiation is a technique for removing any trends or slopes from our data. We will use moving average differentiation for this data preparation step. First we find the moving average of our data.
Raw air temperature data (blue) vs. the moving average (pink).
Next, we subtract the moving average from our real time series after joining the raw data and the MA data.
Differenced data is stationary.
Here is the entire Flux script used to perform this diff:
import "join"
import "influxdata/influxdb/sample"
//dataset is regular time series at 10 second intervals
data = sample.data(set: "airSensor")
|> filter(fn: (r) => r._field == "temperature" and r.sensor_id == "TLM0100")
// |> yield(name: "temp data")
MA = data
|> movingAverage(n:6)
// |> yield(name: "MA")
differenced = join.time(left: data, right: MA, as: (l, r) => ({l with MA: r._value}))
|> map(fn: (r) => ({r with _value: r._value - r.MA}))
|> yield(name: "stationary data")
Note that this approach estimates the trend cycle. Serial decomposition is often also done with linear regression.
ARMA and time series forecasts with Python
Now that we have prepared our data, we can create a forecast. We must identify the p-value and the q-value of our data in order to use the ARMA method. The p-value defines the order of our AR model. The value q defines the order of the MA model. To convert the ARIMA function of statsmodels to an ARMA function, we provide an ad value of 0. The d value is the number of nonseasonal differences required for stationarity. Since we don’t have seasonality, we don’t need any differentiation.
We first query our data with the Python InfluxDB client library. Next, we convert the DataFrame to an array. We then fit our model and finally make a prediction.
# query data with the Python InfluxDB Client Library and remove the trend through differencing
client = InfluxDBClient(url="https://us-west-2-1.aws.cloud2.influxdata.com", token="NyP-HzFGkObUBI4Wwg6Rbd-_SdrTMtZzbFK921VkMQWp3bv_e9BhpBi6fCBr_0-6i0ev32_XWZcmkDPsearTWA==", org="0437f6d51b579000")
# write_api = client.write_api(write_options=SYNCHRONOUS)
query_api = client.query_api()
df = query_api.query_data_frame('import "join"'
'import "influxdata/influxdb/sample"'
'data = sample.data(set: "airSensor")'
'|> filter(fn: (r) => r._field == "temperature" and r.sensor_id == "TLM0100")'
'MA = data'
'|> movingAverage(n:6)'
'join.time(left: data, right: MA, as: (l, r) => ({l with MA: r._value}))'
'|> map(fn: (r) => ({r with _value: r._value - r.MA}))'
'|> keep(columns:["_value", "_time"])'
'|> yield(name:"differenced")'
)
df = df.drop(columns=['table', 'result'])
y = df["_value"].to_numpy()
date = df["_time"].dt.tz_localize(None).to_numpy()
y = pd.Series(y, index=date)
model = sm.tsa.arima.ARIMA(y, order=(1,0,2))
result = model.fit()
Ljung-Box test and Durbin-Watson test
The Ljung-Box test can be used to verify that the values you used for p, q to fit an ARMA model are good. The test examines the autocorrelations of the residuals. Basically, it tests the null hypothesis that the residuals are independently distributed. By using this test, your goal is to confirm the null hypothesis or show that the residuals are, in fact, independently distributed. You must first fit your model with the chosen p and q values, as we did above. Then use the Ljung-Box test to determine if the chosen values are acceptable. The test returns a Ljung-Box p-value. If this p-value is greater than 0.05, then you have successfully confirmed the null hypothesis and the chosen values are good.
After fitting the model and running the test with Python…
print(sm.stats.acorr_ljungbox(res.resid, lags=[5], return_df=True))
we obtain a p-value for the test of 0.589648.
lb_stat | lb_pvalue |
---|---|
5 3.725002 | 0.589648 |
This confirms that our p,q values are acceptable during model fitting.
You can also use the Durbin-Watson test to test for autocorrelation. While the Ljung-Box tests for autocorrelation with any lag, the Durbin-Watson test uses only a lag equal to 1. Your Durbin-Watson test result can range from 0 to 4, where a value close to 2 indicates that there is no autocorrelation. Aim for a value close to 2.
print(sm.stats.durbin_watson(result.resid.values))
Here we get the following value, which agrees with the previous test and confirms that our model is good.
2.0011309357716414
Complete the ARMA forecast script with Python and Flux
Now that we understand the components of the script, let’s look at the script in its entirety and create a graph of our forecast.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from influxdb_client import InfluxDBClient
from datetime import datetime as dt
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
# query data with the Python InfluxDB Client Library and remove the trend through differencing
client = InfluxDBClient(url="https://us-west-2-1.aws.cloud2.influxdata.com", token="NyP-HzFGkObUBI4Wwg6Rbd-_SdrTMtZzbFK921VkMQWp3bv_e9BhpBi6fCBr_0-6i0ev32_XWZcmkDPsearTWA==", org="0437f6d51b579000")
# write_api = client.write_api(write_options=SYNCHRONOUS)
query_api = client.query_api()
df = query_api.query_data_frame('import "join"'
'import "influxdata/influxdb/sample"'
'data = sample.data(set: "airSensor")'
'|> filter(fn: (r) => r._field == "temperature" and r.sensor_id == "TLM0100")'
'MA = data'
'|> movingAverage(n:6)'
'join.time(left: data, right: MA, as: (l, r) => ({l with MA: r._value}))'
'|> map(fn: (r) => ({r with _value: r._value - r.MA}))'
'|> keep(columns:["_value", "_time"])'
'|> yield(name:"differenced")'
)
df = df.drop(columns=['table', 'result'])
y = df["_value"].to_numpy()
date = df["_time"].dt.tz_localize(None).to_numpy()
y = pd.Series(y, index=date)
model = sm.tsa.arima.ARIMA(y, order=(1,0,2))
result = model.fit()
fig, ax = plt.subplots(figsize=(10, 8))
fig = plot_predict(result, ax=ax)
legend = ax.legend(loc="upper left")
print(sm.stats.durbin_watson(result.resid.values))
print(sm.stats.acorr_ljungbox(result.resid, lags=[5], return_df=True))
plt.show()
The bottom line
I hope this blog post inspires you to leverage ARMA and InfluxDB for forecasting. I encourage you to take a look at the following repository, which includes examples of working with both the algorithms described here and InfluxDB for forecasting and anomaly detection.
Anais Dotis-Georgiou is an InfluxData developer advocate with a passion for making data beautiful using data analytics, AI, and machine learning. She applies a combination of research, exploration, and engineering to translate the data she collects into something useful, valuable, and beautiful. When she’s not behind a screen, she can be found outside drawing, stretching, tackling or chasing a football.
—
New Tech Forum offers a place to explore and discuss emerging business technology in unprecedented depth and breadth. Selection is subjective, based on our choice of technologies that we believe are important and of most interest to InfoWorld readers. InfoWorld does not accept marketing guarantees for the publication and reserves the right to edit all content contributed. Please send all inquiries to newtechforum@infoworld.com.
Copyright © 2022 IDG Communications, Inc.
Be First to Comment