Time-series analysis

Introduction

A time-series is commonly described as a data set that captures observations over time.

Concepts

Peaks and troughs

Let’s start by creating a random walk.

import numpy as np
import pandas as pd

N = 10000
step_set = [-1, 0, 1]
origin = np.zeros((1, 1))
step_shape = (N, 1)
steps = np.random.choice(a=step_set, size=step_shape)
path = np.concatenate([origin, steps]).cumsum(0)
df = pd.DataFrame(path,
               columns =['y'])

from scipy.signal import find_peaks

subset = df.head(100)

peaks = find_peaks(subset["y"])
troughs = find_peaks(-subset["y"])
peaks
(array([ 9, 20, 30, 37, 48, 52, 64, 77, 79, 84, 92]), {})

Autocorrelation

Pandas provides an autocorrelation1 plot function.

pd.plotting.autocorrelation_plot(df["y"])
plt.show()

Differencing

Calculating the difference between $x_t$ and $x_{t-1}$.

stationary = df['y'].diff()

Tools

In here we’ll look at some tools (mostly for Python) which allow for time-series analysis.

Data

import pandas as pd

df = pd.read_csv("../../data/streamad/uniDS.csv")

Tsfresh

Tsfresh2 (Time Series Feature Extraction Based on Scalable Hypothesis Tests) is a Python package that automatically calculates and extracts several time series features for classification and regression. Typically used for feature engineering.

from tsfresh import extract_features, extract_relevant_features, select_features
from tsfresh.utilities.dataframe_functions import impute, make_forecasting_frame
from tsfresh.feature_extraction import ComprehensiveFCParameters, settings
data = df['timestamp','value']()
df_pass, y_air = make_forecasting_frame(data.value, 
                                        kind="value", 
                                        max_timeshift=100, 
                                        rolling_direction=1)
Rolling: 100%|██████████| 30/30 [00:05<00:00,  5.17it/s]