Streaming statistics

Situations where streaming statistics are useful:

  • Unknown number of observations
  • Online streaming data
  • Dataset too big for local processing

For the remainder, let’s consider a set of observations \(y_i\), weights \(w_i\), such that

\[ y_1,\dots,y_i \in \mathbb{R} \\ w_1,\dots,w_i\quad w_i \geq 0 \]

Mean and variances

A naive approach to calculating a weighted streaming mean, \(\widehat{\mu}\) and unbiased streaming variance, \widehat{\mathbb{V}}, would be to calculate:

\begin{align*} \widehat{\mu}&=\frac{T^{(n)}}{S^{(n)}} \\ \widehat{\mathbb{V}}&=\frac{n}{(n-1)S^{(n)}}\left(U^{(n)}-S^{(n)}\widehat{\mu}^2\right) \end{align*}

where

\begin{align*} S^{(i+1)}&=S^{(i)}+w_i\\ T^{(i+1)}&=T^{(i)}+w_i y_i\\ U^{(i+1)}&=U^{(i)}+w_i y^2_i \end{align*}

This calculation however does not hold for large \(n\) values.

An alternative calculation was suggested by West1, where we calculate:

\begin{align*} \widehat{\mu}&=\frac{\sum_i w_i y_i}{\sum_i w_i} \\ \widehat{\mathbb{V}}&=\frac{\sum_i w_i(X_i-\mu)^2}{\frac{n-1}{n}\sum_i w_i} \end{align*}

Let’s look at an example in Python.

import numpy as np

mu = 10.0
sigma = 20.0
N = 100_000

Y = np.random.normal(loc=mu, scale=sigma, size=N)

As expected, the mean and variance when calculated in “batch” mode should be equivalent to the original values \(\mu\) and \(\sigma\).

print(f"mean: {np.mean(Y)}")
print(f"std: {np.std(Y)}")
mean: 10.101511575087532
std: 20.008596944491394
class StreamingStatistcs:
    def __init__(self):
        self.sum = 0.0
        self.mean = 0.0
        self.t = 0
        self.n = 0
        self.var = None

    def calculate(self, y, w):
        q = y - self.mean
        tmp_sum = self.sum + w
        r = q*w / tmp_sum

        self.mean += r
        self.t += q*r*self.sum
        self.sum = tmp_sum
        self.n += 1
        if (self.sum == 0.0 or self.n < 2):
            self.var = 0
        else:
            self.var = (self.t*self.n)/(self.sum*(self.n-1))
        return (self.mean, self.var)

means = []
vars = []
st = StreamingStatistcs()
for y in Y:
    m, v = st.calculate(y, 1.0)
    means.append(m)
    vars.append(np.sqrt(v))

  1. <&west1979updating> West, D. (1979). Updating mean and variance estimates: an improved method. Communications of the ACM, 22(9), 532–535. ↩︎