Press "Enter" to skip to content

Time series forecasting with XGBoost and InfluxDB

XGBoost is an open source machine learning library that implements optimized distributed gradient boosting algorithms. XGBoost uses parallel processing for fast performance, handles missing values ​​well, works well on small data sets, and avoids overfitting. All these advantages make XGBoost a popular solution for regression problems like forecasting.

Forecasting is a fundamental task for all sorts of business objectives such as predictive analytics, predictive maintenance, product planning, budgeting, etc. Many forecasting or prediction problems involve time series data. That makes XGBoost a great companion to InfluxDB, the open source time series database.

In this tutorial, we’ll learn how to use the XGBoost Python package to forecast data from the InfluxDB time series database. We’ll also use the InfluxDB Python client library to query data from InfluxDB and convert the data to a Pandas data frame to make it easier to work with time series data. Then we will make our forecast.

I will also dive into the advantages of XGBoost in more detail.

Requirements

This tutorial was run on a macOS system with Python 3 installed via Homebrew. I recommend setting up additional tools like virtualenv, pyenv, or conda-env to simplify client and Python installations. Otherwise, the full requirements are these:

  • influxdb-client = 1.30.0
  • pandas = 1.4.3
  • xgboost >= 1.7.3
  • influxdb-client >= 1.30.0
  • pandas >= 1.4.3
  • matplotlib >= 3.5.2
  • learn >= 1.1.1

This tutorial also assumes that you have a free tier InfluxDB cloud account and have created a bucket and token. You can think of a repository as a database or the highest hierarchical level of data organization within InfluxDB. For this tutorial, we will create a repository called NOAA.

Decision Trees, Random Forests, and Gradient Augmentation

To understand what XGBoost is, we need to understand decision trees, random forests, and gradient boosting. A decision tree is a type of supervised learning method that is made up of a series of tests on a function. Each node is a test, and all the nodes are organized in a flowchart structure. The branches represent conditions that ultimately determine which leaf or class label will be assigned to the input data.

xboost influxdb 01 prince yadav

A decision tree to determine if it will rain from the Decision Tree in Machine Learning. Edited to show the components of the decision tree: leaves, branches, and nodes.

The guiding principle behind decision trees, random forests, and gradient boosting is that a group of “weak learners” or classifiers collectively make strong predictions.

A random forest contains several decision trees. Where every node in a decision tree would be considered a weak learner, every decision tree in the forest is considered one of many weak learners in a random forest model. Typically, all data is randomly divided into subsets and passed through different decision trees.

Gradient augmentation using decision trees and random forests are similar, but differ in the way they are structured. Gradient-powered trees also contain a forest of decision trees, but these trees are built additively and all data is passed through a collection of decision trees. (More on this in the next section.) Gradient-powered trees can contain a set of classification or regression trees. Classification trees are used for discrete values ​​(for example, cat or dog). Regression trees are used for continuous values ​​(for example, 0 to 100).

What is XGBoost?

Gradient boosting is a machine learning algorithm used for classification and predictions. XGBoost is just an extreme type of gradient boost. It is extreme in the way that you can do gradient boosting more efficiently with the parallel processing capability. The following diagram from the XGBoost documentation illustrates how gradient boosting can be used to predict whether a person will like a video game.

xboost influxdb 02 xgboost developers

Two trees are used to decide whether or not a person will enjoy a video game. The leaf scores from both trees are added together to determine which individual is more likely to enjoy the game.

See Introduction to Boosted Trees in the XGBoost documentation for more information on how gradient boosted trees and XGBoost work.

Some advantages of XGBoost:

  • Relatively easy to understand.
  • It works well on small, structured, and regular data with few features.

Some disadvantages of XGBoost:

  • Prone to overfitting and sensitive to outliers. It may be a good idea to use a materialized view of your time series data for forecasting with XGBoost.
  • It doesn’t work well with sparse or unsupervised data.

Time Series Forecasting with XGBoost

We are using the air sensor sample data set that comes from the factory with InfluxDB. This dataset contains temperature data from multiple sensors. We are creating a temperature forecast for a single sensor. The data looks like this:

xboost influxdb 03 data influx

Use the following Flux code to import the dataset and filter for the single time series. (Flux is the query language for InfluxDB.)

Also Read:  At Microsoft Build 2023, Azure Cosmos DB joins the AI toolchain
 
import "join"
import "influxdata/influxdb/sample"
//dataset is regular time series at 10 second intervals
data = sample.data(set: "airSensor")
  |> filter(fn: (r) => r._field == "temperature" and r.sensor_id == "TLM0100")

Random forests and gradient boosting can be used for time series forecasting, but require the data to be transformed for supervised learning. This means that we need to change our forward data into a sliding window approach or a lagging method to convert the time series data into a supervised learning set. We can also prepare the data with Flux. Ideally, you should first perform an autocorrelation analysis to determine the optimal lag to use. For brevity, we will change the data at a regular time interval with the following Flux code.

 
import "join"
import "influxdata/influxdb/sample"
data = sample.data(set: "airSensor")
  |> filter(fn: (r) => r._field == "temperature" and r.sensor_id == "TLM0100")
shiftedData = data
  |> timeShift(duration: 10s , columns: ["_time"] )
join.time(left: data, right: shiftedData, as: (l, r) => ({l with data: l._value, shiftedData: r._value}))
  |> drop(columns: ["_measurement", "_time", "_value", "sensor_id", "_field"]) 
xboost influxdb 04 data influx

If you wanted to add additional lagged data to your model input, you could follow the following Flux logic instead.


import "experimental"
import "influxdata/influxdb/sample"
data = sample.data(set: "airSensor")
|> filter(fn: (r) => r._field == "temperature" and r.sensor_id == "TLM0100")

shiftedData1 = data
|> timeShift(duration: 10s , columns: ["_time"] )
|> set(key: "shift" , value: "1" )

shiftedData2 = data
|> timeShift(duration: 20s , columns: ["_time"] )
|> set(key: "shift" , value: "2" )

shiftedData3 = data
|> timeShift(duration: 30s , columns: ["_time"] )
|> set(key: "shift" , value: "3")

shiftedData4 = data
|> timeShift(duration: 40s , columns: ["_time"] )
|> set(key: "shift" , value: "4")

union(tables: [shiftedData1, shiftedData2, shiftedData3, shiftedData4])
|> pivot(rowKey:["_time"], columnKey: ["shift"], valueColumn: "_value")
|> drop(columns: ["_measurement", "_time", "_value", "sensor_id", "_field"])
// remove the NaN values
|> limit(n:360)
|> tail(n: 356)

Also, we need to use forward validation to train our algorithm. This involves dividing the data set into a test set and a training set. We then train the XGBoost model with XGBRegressor and make a prediction with the fit method. Finally, we use MAE (mean absolute error) to determine the accuracy of our predictions. For a lag of 10 seconds, a MAE of 0.035 is calculated. We can interpret this as 96.5% of our predictions being very good. The graph below demonstrates our predicted XGBoost results against our expected values ​​from the training/test split.

xboost influxdb 05 data influx

Below is the full script. This code is largely borrowed from the tutorial here.


import pandas as pd
from numpy import asarray
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor
from matplotlib import pyplot
from influxdb_client import InfluxDBClient
from influxdb_client.client.write_api import SYNCHRONOUS

# query data with the Python InfluxDB Client Library and transform data into a supervised learning problem with Flux
client = InfluxDBClient(url="https://us-west-2-1.aws.cloud2.influxdata.com", token="NyP-HzFGkObUBI4Wwg6Rbd-_SdrTMtZzbFK921VkMQWp3bv_e9BhpBi6fCBr_0-6i0ev32_XWZcmkDPsearTWA==", org="0437f6d51b579000")

# write_api = client.write_api(write_options=SYNCHRONOUS)
query_api = client.query_api()
df = query_api.query_data_frame('import "join"'
'import "influxdata/influxdb/sample"'
'data = sample.data(set: "airSensor")'
  '|> filter(fn: (r) => r._field == "temperature" and r.sensor_id == "TLM0100")'
'shiftedData = data'
  '|> timeShift(duration: 10s , columns: ["_time"] )'
'join.time(left: data, right: shiftedData, as: (l, r) => ({l with data: l._value, shiftedData: r._value}))'
  '|> drop(columns: ["_measurement", "_time", "_value", "sensor_id", "_field"])'
  '|> yield(name: "converted to supervised learning dataset")'
)
df = df.drop(columns=['table', 'result'])
data = df.to_numpy()

# split a univariate dataset into train/test sets
def train_test_split(data, n_test):
     return data[:-n_test:], data[-n_test:]

# fit an xgboost model and make a one step prediction
def xgboost_forecast(train, testX):
     # transform list into array
     train = asarray(train)
     # split into input and output columns
     trainX, trainy = train[:, :-1], train[:, -1]
     # fit model
     model = XGBRegressor(objective="reg:squarederror", n_estimators=1000)
     model.fit(trainX, trainy)
     # make a one-step prediction
     yhat = model.predict(asarray([testX]))
     return yhat[0]

# walk-forward validation for univariate data
def walk_forward_validation(data, n_test):
     predictions = list()
     # split dataset
     train, test = train_test_split(data, n_test)
     history = [x for x in train]
     # step over each time-step in the test set
     for i in range(len(test)):
          # split test row into input and output columns
          testX, testy = test[i, :-1], test[i, -1]
          # fit model on history and make a prediction
          yhat = xgboost_forecast(history, testX)
          # store forecast in list of predictions
          predictions.append(yhat)
          # add actual observation to history for the next loop
          history.append(test[i])
          # summarize progress
          print('>expected=%.1f, predicted=%.1f' % (testy, yhat))
     # estimate prediction error
     error = mean_absolute_error(test[:, -1], predictions)
     return error, test[:, -1], predictions

# evaluate
mae, y, yhat = walk_forward_validation(data, 100)
print('MAE: %.3f' % mae)

# plot expected vs predicted
pyplot.plot(y, label="Expected")
pyplot.plot(yhat, label="Predicted")
pyplot.legend()
pyplot.show()

Conclusion

I hope this blog post inspires you to take advantage of XGBoost and InfluxDB for forecasting. I encourage you to take a look at the following repository which includes examples of working with many of the algorithms described here and InfluxDB for forecasting and anomaly detection.

Anais Dotis-Georgiou is an InfluxData developer advocate with a passion for making data beautiful using data analytics, AI, and machine learning. She applies a combination of research, exploration, and engineering to translate the data she collects into something useful, valuable, and beautiful. When she’s not behind a screen, she can be found outside drawing, stretching, tackling or chasing a football.

New Tech Forum offers a place to explore and discuss emerging business technology in unprecedented depth and breadth. Selection is subjective, based on our choice of technologies that we believe are important and of most interest to InfoWorld readers. InfoWorld does not accept marketing guarantees for the publication and reserves the right to edit all content contributed. Please send all inquiries to newtechforum@infoworld.com.

Copyright © 2022 IDG Communications, Inc.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *