Self-Similarity of Network Data Analysis

From testwiki
Jump to navigation Jump to search

Template:No footnotes

In computer networks, self-similarity is a feature of network data transfer dynamics. When modeling network data dynamics the traditional time series models, such as an autoregressive moving average model are not appropriate. This is because these models only provide a finite number of parameters in the model and thus interaction in a finite time window, but the network data usually have a long-range dependent temporal structure. A self-similar process is one way of modeling network data dynamics with such a long range correlation. This article defines and describes network data transfer dynamics in the context of a self-similar process. Properties of the process are shown and methods are given for graphing and estimating parameters modeling the self-similarity of network data.

Definition

Suppose X be a weakly stationary (2nd-order stationary) process with mean μ, variance σ2, and autocorrelation function γ(t). Assume that the autocorrelation function γ(t) has the form γ(t)tβL(t) as t, where 0<β<1 and L(t) is a slowly varying function at infinity, that is limtL(tx)L(t)=1 for all x>0. For example, L(t)=const and L(t)=log(t) are slowly varying functions.
Let Xk(m)=1mH(Xkmm+1++Xkm), where k=1,2,3,, denote an aggregated point series over non-overlapping blocks of size m, for each m is a positive integer.

Exactly self-similar process

  • X is called an exactly self-similar process if there exists a self-similar parameter H such that Xk(m) has the same distribution as X. An example of exactly self-similar process with H is Fractional Gaussian Noise (FGN) with 12<H<1.

Definition:Fractional Gaussian Noise (FGN) Template:- X(t)=BH(t+1)BH(t),t1 is called the Fractional Gaussian Noise, where BH() is a Fractional Brownian motion.[1]

exactly second order self-similar process

  • X is called an exactly second order self-similar process if there exists a self-similar parameter H such that Xk(m) has the same variance and autocorrelation as X.

asymptotic second order self-similar process

  • X is called an asymptotic second order self-similar process with self-similar parameter H if γ(m)(t)12[(t+1)2H2t2H+(t1)2H] as m, t=1,2,3,

Some relative situations of Self-Similar Processes

Long-Range-Dependence(LRD)

Suppose X(t) be a weakly stationary (2nd-order stationary) process with mean μ and variance σ2. The Autocorrelation Function (ACF) of lag t is given by γ(t)=cov(X(h),X(h+t))σ2=E[(X(h)μ)(X(h+t)μ)]σ2Template:- Definition:Template:- A weakly stationary process is said to be "Long-Range-Dependence" if t=0|γ(t)|=

A process which satisfies γ(t)tβL(t) as t is said to have long-range dependence. The spectral density function of long-range dependence follows a power law near the origin. Equivalently to γ(t)tβL(t), X has long-range dependence if the spectral density function of autocorrelation function, ft(w)=t=0γ(t)eiwt, has the form of wγL(w) as w0 where 0<γ<1, L is slowly varying at 0.

also see

Slowly decaying variances

X(m)=1m(X1++Xm)
When an autocorrelation function of a self-similar process satisfies γ(t)tβL(t) as t, that means it also satisfies Var(X(m))amβ as m, where a is a finite positive constant independent of m, and 0<β<1.

Estimating the self-similarity parameter "H"

R/S analysis

Assume that the underlying process X is Fractional Gaussian Noise. Consider the series X(1),,X(n), and let Y(n)=i=1nX(i).
The sample variance of X(i) is S2(n)=1ni=1nX(i)2(1n)2Yn2Template:- Definition:R/S statistic

RS(n)=1S(n)[max0tn(YttnYn)min0tn(YttnYn)]
Template:- If X(i) is FGN, then E(RS(n))CH×nH
Consider fitting a regression model : logRS(n)=log(CH)+Hlog(n)+ϵn, where ϵnN(0,σ2)
In particular for a time series of length N divide the time series data into k groups each of size Nk, compute RS(n) for each group.
Thus for each n we have k pairs of data (log(n),logRS(n)).There are k points for each n, so we can fit a regression model to estimate H more accurately. If the slope of the regression line is between 0.5~1, it is a self-similar process.

Variance-time plot

Variance of the sample mean is given by Var(X¯n)cn2H2,c>0.
For estimating H, calculate sample means X¯1,X¯2,,X¯mk for mk sub-series of length k.
Overall mean can be given by X¯(k)=1mki=1mkX¯i(k), sample variance S2(k)=1mk1i=1mk(X¯i(k)X¯(k))2.
The variance-time plots are obtained by plotting logS2(k) against logk and we can fit a simple least square line through the resulting points in the plane ignoring the small values of k.

For large values of k, the points in the plot are expected to be scattered around a straight line with a negative slope 2H2.For short-range dependence or independence among the observations, the slope of the straight line is equal to -1.
Self-similarity can be inferred from the values of the estimated slope which is asymptotically between –1 and 0, and an estimate for the degree of self-similarity is given by H^=1+12(slope).

Periodogram-based analysis

Whittle's approximate maximum likelihood estimator (MLE) is applied to solve the Hurst's parameter via the spectral density of X. It is not only a tool for visualizing the Hurst's parameter, but also a method to do some statistical inference about the parameters via the asymptotic properties of the MLE. In particular, X follows a Gaussian process. Let the spectral density of X, fx(w;θ)=σϵ2fx(w;(1,η)), where θ=(σϵ2,η)=(σϵ2,H,θ3,,θk),H=γ+12, and θ3,,θk construct a short-range time series autoregression (AR) model, that is Xj=i=1kαiXji+ϵj, with Var(ϵj)=σϵ2.

Thus, the Whittle's estimator η^ of η minimizes the function Q(η)=ππI(w)f(w;(1,η))dw , where I(w) denotes the periodogram of X as (2πn)1|j=1nXjeiwj|2 and σ^2=ππI(w)f(w;(1,η^))dw. These integrations can be assessed by Riemann sum.
Template:- Then n1/2(θ^θ) asymptotically follows a normal distribution if Xj can be expressed as a form of an infinite moving average model.Template:- Template:- To estimate H, first, one has to calculate this periodogram. Since In(w) is an estimator of the spectral density, a series with long-range dependence should have a periodogram, which is proportional to |λ|12H close to the origin. The periodogram plot is obtained by plotting log(In(w)) against log(w).
Then fitting a regression model of the log(In(w)) on the log(w) should give a slope of β^. The slope of the fitted straight line is also the estimation of 12H. Thus, the estimation H^ is obtained.

Note:
There are two common problems when we apply the periodogram method. First, if the data does not follow a Gaussian distribution, transformation of the data can solve this kind of problems. Second, the sample spectrum which deviates from the assumed spectral density is another one. An aggregation method is suggested to solve this problem. If X is a Gaussian process and the spectral density function of X satisfies wγL(w) as w, the function, mHL12(m)i=(j1)m+1mk(XiE(|Xi|)),j=1,2,,[nm], converges in distribution to FGN as m.

References

  • P. Whittle, "Estimation and information in stationary time series", Art. Mat. 2, 423-434, 1953.
  • K. PARK, W. WILLINGER, Self-Similar Network Traffic and Performance Evaluation, WILEY,2000.
  • W. E. Leland, W. Willinger, M. S. Taqqu, D. V. Wilson, "On the self-similar nature of Ethernet traffic", ACM SIGCOMM Computer Communication Review 25,202-213,1995.
  • W. Willinger, M. S. Taqqu, W. E. Leland, D. V. Wilson, "Self-Similarity in High-Speed Packet Traffic: Analysis and Modeling of Ethernet Traffic Measurements", Statistical Science 10,67-85,1995.

Template:Reflist

  1. W. E. Leland, W. Willinger, M. S. Taqqu, D. V. Wilson, "On the self-similar nature of Ethernet traffic", ACM SIGCOMM Computer Communication Review 25,202-213,1995.