Cumulative frequency analysis

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by LasseMakkonen (talk | contribs) at 12:41, 9 February 2008 (→‎Definitions and explanations). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Cumulative frequency analysis is the analysis of the frequency of occurrence of values of a phenomenon less than a reference value. The phenomenon may be time or space dependent.
Cumulative frequency is also called frequency of non-exceedance.

Purpose

Graphic illustration

Frequency analysis is done to obtain insight into how often a certain phenomenon (feature) occurs [1]. This may help in describing or explaining a situation in which the phenomenon is involved, or in planning interventions.
This article discusses the construction of a cumulative frequency distribution of a phenomenon, the formulation of confidence statements (required when the distribution is used for prediction), the fitting of the distribution to known theoretical probability distributions, and the derivation of the frequency of occurrence of the phenomenon within certain class limits.

Definitions and explanations

Frequency analysis [2] applies to a record of observed data on a variable phenomenon. The record may be time dependent (e.g. rainfall measured in one spot) or space dependent (e.g. crop yields in an area) or otherwise.
The observed data can be arranged in classes or groups.Each group has a lower limit and an upper limit. When a class contains m data and the total number of data is n, then the class or group frequency (Fg) is found from Fg = m/n or 100m/n %
The presentation of all class frequencies gives a frequency distribution. Frequency distributions made from the same record are different for different classifications.
Hence, frequency distributions are classification dependent.

The cumulative frequency is the number of data less than (or equal to) a reference value (X) divided by the total number of data.
For example if daily rainfall has been recorded 1000 times (n = 1000) and the number (r) of daily rainfalls less than or equal to 20mm was 900, then the cumulative frequency is r/n = 900/1000 = 0.9 or 90%.
In other words 90% of the rainfalls was less than the reference value 20mm.
If the reference value would have been different, the cumulative frequency would have been different too.
Hence the cumulative frequency (Fc) depends on the reference value (X). We can write this dependence as Fc < (X).
Fc(X) can be called the cumulative frequency function of X or cumulative frequency distribution.
Giving the value of daily rainfall the symbol P (precipitation) we can more specifically write Fc(X) = F(P<X), where F stands for frequency and F(P<X) stands for the frequency of rainfalls P that are smaller than or equal to X.
F(P<X) can also be written as FP<X.

The rainfall can not be less than 0. Assuming that there are days with zero rainfall then the cumulative frequency distribution of rainfall begins at 0 and FP<0 = 0.
Assuming that the highest daily rainfall observed was 60 mm. Then we find FP<60 = 1 or 100%.
When we wish to use the Fc for prediction we would like to include the possibility that P can be greater than 60, because we cannot be sure that 60 is the absolute maximum. Therefore we redefine the cumulative frequency as FP<X = r/(n + 1), where r is the number of data with P<X and n is the total number of data.
By taking the denominator as n + 1, the value of FP<X is always less than 100% and we leave the (small) chance that P can be greater than the maximum observed value.(The n + 1 is not as arbitrary as it may seem. The total sum is equal to the average value times the count of all values. Since zero is implied as the starting basis of the cumulative sum, the total count is the number of contributing values plus 1.)

The cumulative frequency can also be called frequency of non-exceedance. The frequency of exceedance is found from FP>X = 1 − FP<X
The return period is defined TR = 1/FP>X. Like Fc, T also depends on X.
The notion return period has a meaning only when it concerns a time dependent phenomenon, like rainfall measured in a spot.
When the cumulative frequency of X = 20 is 90%, then the frequency of exceedance is 10% and FP>X =  0.1. The return period then is T20 =  1/0.1 = 10.
In our example we used daily rainfalls, hence the return period is in days. The 10 day return period means that the daily rainfall exceeds 20 mm once every 10 days on average.
The addition "on average" means that sometimes the exceedance repeats itself more often than once in 10 days and sometimes less.
The time length of the return period may need adjustment. For example, if one deals with maximum daily rainfalls per month and T = 10, then the return period is 10 months.

When the data are arranged in descending order, the maximum first and the minimum last, and Rd is the rank number, the cumulative frequency is written as FP<X = Rd/(n + 1).
When the data are arranged in ascending order, the minimum first and the maximum last, and Ra is the rank number, the cumulative frequency is written as FP<X = 1 − Ra/(n + 1).

There have also been other proposals for adjustment of the ratios r/(n + 1), Rd/(n + 1) and Ra/(n + 1), but these are incorrect [3].

Contrary to the frequency distribution (which is classes dependent), the cumulative distribution of the data in a record is unique.

Confidence intervals and belts

When we have derived a cumulative frequency distribution, we may ask if it can be used for predictions. For example, we have a distribution of river discharges for the years 1950 to 2000. Can we then use this distribution to predict how often a certain river discharge will be exceeded in the years 2000 to 2050?
The answer is yes, provided that the conditions that determine the discharge are not changed.
However, when the watershed of the river is subject to infrastructural changes, or the rainfall pattern changes (perhaps due to global warming), the prediction on the basis of the previous record is subject to a systematic error. Even when there is no systematic error, there may be a random error, because by chance the observed average discharge can be higher or lower than normal when the discharge would have been observed in a longer record than only from 1950 to 2000.

Statistical theory says that, when the record is extremely long, the observed average discharge is not any more subject to a random error: the central limit theorem. Hence, the longer the record, the more reliable the prediction will be (if there is no systematic error).
Now, even if we have a very long record (say 1000 years), there is still the possibility that the prediction for a not so long period (say 50 years) is subject to random error because, by chance, in that period the average discharge may be more or less then normal.

Probability theory can help us to estimate the range in which the random error may be, provided that the frequencies in the present record are reasonably reliable.
In our case of cumulative frequency we have only two possibilities: a certain reference value X is exceeded or it is not exceeded. The sum of exceedance frequency and cumulative frequency is 1 or 100%.
Therefore the binomial distribution can be used to help us in estimating the range of the random error.

In the binomial distribution, the standard deviation (Sd) is given as Sd =√{Ce(1 − Ce)/n}, where Ce is the probability or chance of exceedance and n is the number of data.
Trusting that the value of Fc is a good estimate of Ce, the value of Sd can be calculated as Sd = √{Fc(1 − Fc)/n}. Of course, here we take risk, but the risk is small when the record is long.

The determination of the confidence interval of Fc makes use of Student's t-test (t).
We set: U − L = T.Sd, where U is the upper confidence limit, L is the lower confidence limit, and T = 2t.
Using 90% confidence limits the T-value is close to 3.3 when n > 10.
The binomial distribution is symmetrical around the mean when Fc = 0.5, but it becomes more and more skew when Fc approaches 0 or 1. Therefore, by approximation, we use Fc as a weight factor in the assignation of Sd to U and L :

  1. U = Fc + 3.3 Fc.Sd
  2. L  = Fc − 3.3(1 − Fc)Sd

Thus we can say that it gives us 90% confidence that, when we have an observed value of Fc, the very long record value of Fc will be between U and L (L < Fc < U).
This implies that we still have 10% chance that the statement is wrong. We also have the problem that events with 10% chance do occur, although not so often.
The confidence limits of return periods can be found as 1/U and 1/L.
The user of confidence intervals must be aware of the limitations of the estimate and use the confidence interval with precaution and consider that it gives only a rough idea.


Fitting of probability distributions

To present the cumulative frequency distribution as a mathematical equation, one may try to fit the cumulative frequency distribution to a known cumulative probability distribution[4]
If successful, the known equation is enough to report the frequency distribution and it will not be required to provide a table of data.
Further, the equation helps to interpolate and extrapolate.
When extrapolating a cumulative frequency distribution, this fact should explicitly mentioned, because extrapolation is a source of errors.
One possible error is that the frequency distribution does not follow the selected probability distribution any more beyond the range of the observed data.

A sample of probability distributions that may be used can be found in probability distributions.
Please note: any equation that gives the value 1 when integrated from a lower limit to an upper limit that represent the data range well, can be used as a probability distribution.

Probability distributions can be fitted by several methods. For example: (1) the parametric method and the (2) regression method.
Examples of both methods using the (log)normal distribution, the Gumbel distribution (or Fisher-Tippet type 1 distribution of extremes) and the exponential distribution are given in Chapter 6: "Frequency and Regression Analysis" of ILRI publ. 16: "Drainage Principles and Applications", that can be viewed in and freely downloaded from the ILRI-Alterra website or from the Articles page in the waterlog.info website.
It is shown there that the two methods do not yield significantly different results.
The chapter also shows that different probability distributions can give similar results and that the differences between them are small compared to the confidence interval. This illustrates that it may be difficult to determine which distribution gives better results.
Further it shows that the confidence intervals of return periods, in the higher range, can be very wide and therefore the return period can be of limited practical value here.

To facilitate distribution fitting, the CumFreq [5] computer program has been developed. This model uses a number of well known distributions and selects the distribution that gives the best fit, or it uses the distribution that the user selects.
Cumfreq can be applied to records of any kind, not necessarily hydrological data.
The model gives the opportunity to introduce a discontinuity, separating the data range in to parts with different distributions. The program will determine the breakpoint, try several distributions, and determine the end result by a test of best fit. The introduction of the discontinuity proved useful for the analysis of rainfall data in Northern Peru, where the climate is subject to the behavior Pacific ocean current "El Ninyo". When the Ninyo extends beyond Ecuador and enters the ocean along the coast of Peru, the climate in Northern Peru becomes tropical and wet. When the Ninyo does not reach Peru, the climate is semi-arid. For this reason, the higher rainfalls follow a different frequency distribution than the lower rainfalls.
CumFreq gives graphs with observed values, the fitted distribution with confidence intervals, and the mathematical expression of the best fitting or selected probability distribution.
It also shows return periods.
The (log)normal distribution is used with numeric approach, as an analytical expression of the cumulative normal distribution does not exist.
The program also calculates the (non-cumulative) frequency distribution of data classes an will ask the user to define the intervals. This classes-distribution is derived from the cumulative distribution and also shows confidence intervals.

The CumFreq program can be freely downloaded from the software page of the waterlog.info website.

References

  1. ^ Benson, M.A. 1960. Characteristics of frequency curves based on a theoretical 1000 year record. In: T.Dalrymple (ed.), Flood frequency analysis. U.S. Geological Survey Water Supply paper 1543-A, pp. 51-71
  2. ^ R.J.Oosterbaan, 1994, Frequency and Regression Analysis. In: H.P.Ritzema (ed.), Drainage Principles and Applications, Publ. 16, pp. 175-224, ILRI, Wageningen, The Netherlands. ISBN 90 70754 3 39
  3. ^ L.Makkonen, 2008, Communications in Statistics - Theory and Methods, 37: 460-467
  4. ^ Free download of Frequency analysis (pdf)
  5. ^ Cumfreq, a program for cumulative frequency analysis

See also

External links

  • For more examples of CumFreq applications see "Drainage research in farmers' fields : Analysis of data"