Time-series analysis
Introduction
A time-series is commonly described as a data set that captures observations over time.
Concepts
Peaks and troughs
Let’s start by creating a random walk.
import numpy as np
import pandas as pd
N = 10000
step_set = [-1, 0, 1]
origin = np.zeros((1, 1))
step_shape = (N, 1)
steps = np.random.choice(a=step_set, size=step_shape)
path = np.concatenate([origin, steps]).cumsum(0)
df = pd.DataFrame(path,
columns =['y'])
from scipy.signal import find_peaks
subset = df.head(100)
peaks = find_peaks(subset["y"])
troughs = find_peaks(-subset["y"])
peaks
(array([ 9, 20, 30, 37, 48, 52, 64, 77, 79, 84, 92]), {})
Autocorrelation
Pandas provides an autocorrelation1 plot function.
pd.plotting.autocorrelation_plot(df["y"])
plt.show()
Differencing
Calculating the difference between $x_t$ and $x_{t-1}$.
stationary = df['y'].diff()
Tools
In here we’ll look at some tools (mostly for Python) which allow for time-series analysis.
Data
import pandas as pd
df = pd.read_csv("../../data/streamad/uniDS.csv")
Tsfresh
Tsfresh
2 (Time Series Feature Extraction Based on Scalable Hypothesis Tests) is a Python package that automatically calculates and extracts several time series features for classification and regression. Typically used for feature engineering.
from tsfresh import extract_features, extract_relevant_features, select_features
from tsfresh.utilities.dataframe_functions import impute, make_forecasting_frame
from tsfresh.feature_extraction import ComprehensiveFCParameters, settings
data = df['timestamp','value']()
df_pass, y_air = make_forecasting_frame(data.value,
kind="value",
max_timeshift=100,
rolling_direction=1)
Rolling: 100%|██████████| 30/30 [00:05<00:00, 5.17it/s]