Previous Chapter: 1 The Problem
Suggested Citation: "2 The Framework." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
  • How much data is enough? Even if we have (conceptually) infinite data available, it may be the case that we do not need all of it to obtain the best possible model of the type being mined. Assuming the data-generating process is stationary, is there some point at which we can “turn off” the stream and know that we will not lose predictive performance by ignoring further data? More precisely, how much data do we need at each step of the mining algorithm before we can go on to the next one?

  • If the data-generating process is not stationary, how do we make the trade-off between being up-to-date and not losing past information that is still relevant? In the traditional method of mining a sliding window of data, a large window leads to slow adaptation, but a small one leads to loss of relevant information and overly-simple models. Can we overcome this trade-off?

In the remainder of this extended abstract we describe how our framework addresses these questions. Further aspects of the framework are described in Hulten and Domingos (2002).

2 The Framework

A number of well-known results in statistics provide probabilistic bounds on the difference between the true value of a parameter and its empirical estimate from finite data. For example, consider a real-valued random variable x whose range is R. Suppose we have made n independent observations of this variable, and computed their mean . The Hoeffding bound (Hoeffding, 1963) (also known as additive Chernoff bound) states that, with probability at least 1−δ, and irrespective of the true distribution of x, the true mean of the variable is within ∈ of , where

Put another way, this result says that, if we only care about determining x to within ε of its true value, and are willing to accept a probability of δ of failing to do so, we need gather only samples of x. More samples (up to infinity) produce in essence an equivalent result. The key idea underlying our framework is to “bootstrap” these results, which apply to individual parameters, to similar guarantees on the difference (loss) between the whole complex model mined from finite data and the model that would be obtained from infinite data in infinite time. The high-level approach we use consists of three steps:

  1. Derive an upper bound on the time complexity of the mining algorithm, as a function of the number of samples used in each step.

  2. Derive a upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm.

  3. Minimize the time bound (via the number of samples used in each step) subject to user-defined limits on the loss.

Suggested Citation: "2 The Framework." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
Page 344
Next Chapter: 3 Time-Changing Data
Subscribe to Email from the National Academies
Keep up with all of the activities, publications, and events by subscribing to free updates by email.