Streaming statistics
Situations where streaming statistics are useful:
- Unknown number of observations
- Online streaming data
- Dataset too big for local processing
For the remainder, let’s consider a set of observations \(y_i\), weights \(w_i\), such that
\[ y_1,\dots,y_i \in \mathbb{R} \\ w_1,\dots,w_i\quad w_i \geq 0 \]
- Mean and variances
A naive approach to calculating a weighted streaming mean, \(\widehat{\mu}\) and unbiased streaming variance, , would be to calculate:
\[ \begin{align*} \widehat{\mu}&=\frac{T^{(n)}}{S^{(n)}} \\ \widehat{\mathbb{V}}&=\frac{n}{(n-1)S^{(n)}}\left(U^{(n)}-S^{(n)}\widehat{\mu}^2\right) \end{align*} \]
where
\[ \begin{align*} S^{(i+1)}&=S^{(i)}+w_i\\ T^{(i+1)}&=T^{(i)}+w_i y_i\\ U^{(i+1)}&=U^{(i)}+w_i y^2_i \end{align*} \]
This calculation however does not hold for large \(n\) values.
An alternative calculation was suggested by West1, where we calculate:
\[ \begin{align*} \widehat{\mu}&=\frac{\sum_i w_i y_i}{\sum_i w_i} \\ \widehat{\mathbb{V}}&=\frac{\sum_i w_i(X_i-\mu)^2}{\frac{n-1}{n}\sum_i w_i} \end{align*} \]
Let’s look at an example in Python.
from plotutils import *
import numpy as np
= 10.0
mu = 20.0
sigma = 100_000
N
= np.random.normal(loc=mu, scale=sigma, size=N) Y
import matplotlib.pyplot as plt
import matplotlib
'figure', figsize=(12, 6))
matplotlib.rc(=50,color="lightpink")
plt.hist(Y, bins plt.show()
As expected, the mean and variance when calculated in “batch” mode should be equivalent to the original values \(\mu\) and \(\sigma\).
print(f"mean: {np.mean(Y)}")
print(f"std: {np.std(Y)}")
mean: 10.040100563607174
std: 19.946298541957912
class StreamingStatistcs:
def __init__(self):
self.sum = 0.0
self.mean = 0.0
self.t = 0
self.n = 0
self.var = None
def calculate(self, y, w):
= y - self.mean
q = self.sum + w
tmp_sum = q*w / tmp_sum
r
self.mean += r
self.t += q*r*self.sum
self.sum = tmp_sum
self.n += 1
if (self.sum == 0.0 or self.n < 2):
self.var = 0
else:
self.var = (self.t*self.n)/(self.sum*(self.n-1))
return (self.mean, self.var)
= []
means vars = []
= StreamingStatistcs()
st for y in Y:
= st.calculate(y, 1.0)
m, v
means.append(m)vars.append(np.sqrt(v))
= plt.subplots(1, 2)
fig, (ax1, ax2) 'Streaming statistics for $N=10^5$ observations')
fig.suptitle(
ax1.plot(means)=10, xmin=0, xmax=N, colors="red")
ax1.hlines(y"Streaming mean")
ax1.set_title(vars)
ax2.plot(=20, xmin=0, xmax=N, colors="red")
ax2.hlines(y"Streaming variance")
ax2.set_title( plt.show()
Footnotes
cite:&west1979updating West, D. (1979). Updating mean and variance estimates: an improved method. Communications of the ACM, 22(9), 532–535.↩︎