123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460 |
- [/ def names all end in distrib to avoid clashes with names of functions]
- [def __binomial_distrib [link math_toolkit.dist_ref.dists.binomial_dist Binomial Distribution]]
- [def __chi_squared_distrib [link math_toolkit.dist_ref.dists.chi_squared_dist Chi Squared Distribution]]
- [def __normal_distrib [link math_toolkit.dist_ref.dists.normal_dist Normal Distribution]]
- [def __F_distrib [link math_toolkit.dist_ref.dists.f_dist Fisher F Distribution]]
- [def __students_t_distrib [link math_toolkit.dist_ref.dists.students_t_dist Students t Distribution]]
- [def __handbook [@http://www.itl.nist.gov/div898/handbook/
- NIST/SEMATECH e-Handbook of Statistical Methods.]]
- [section:stat_tut Statistical Distributions Tutorial]
- This library is centred around statistical distributions, this tutorial
- will give you an overview of what they are, how they can be used, and
- provides a few worked examples of applying the library to statistical tests.
- [section:overview Overview of Statistical Distributions]
- [section:headers Headers and Namespaces]
- All the code in this library is inside `namespace boost::math`.
- In order to use a distribution /my_distribution/ you will need to include
- either the header(s) `<boost/math/my_distribution.hpp>` (quicker compiles), or
- the "include all the distributions" header: `<boost/math/distributions.hpp>`.
- For example, to use the Students-t distribution include either
- `<boost/math/students_t.hpp>` or
- `<boost/math/distributions.hpp>`
- You also need to bring distribution names into scope,
- perhaps with a `using namespace boost::math;` declaration,
- or specific `using` declarations like `using boost::math::normal;` (*recommended*).
- [caution Some math function names are also used in `namespace std` so including `<random>` could cause ambiguity!]
- [endsect] [/ section:headers Headers and Namespaces]
- [section:objects Distributions are Objects]
- Each kind of distribution in this library is a class type - an object, with member functions.
- [tip If you are familiar with statistics libraries using functions,
- and 'Distributions as Objects' seem alien, see
- [link math_toolkit.stat_tut.weg.nag_library the comparison to
- other statistics libraries.]
- ] [/tip]
- [link policy Policies] provide optional fine-grained control
- of the behaviour of these classes, allowing the user to customise
- behaviour such as how errors are handled, or how the quantiles
- of discrete distributions behave.
- Making distributions class types does two things:
- * It encapsulates the kind of distribution in the C++ type system;
- so, for example, Students-t distributions are always a different C++ type from
- Chi-Squared distributions.
- * The distribution objects store any parameters associated with the
- distribution: for example, the Students-t distribution has a
- ['degrees of freedom] parameter that controls the shape of the distribution.
- This ['degrees of freedom] parameter has to be provided
- to the Students-t object when it is constructed.
- Although the distribution classes in this library are templates, there
- are typedefs on type /double/ that mostly take the usual name of the
- distribution
- (except where there is a clash with a function of the same name: beta and gamma,
- in which case using the default template arguments - `RealType = double` -
- is nearly as convenient).
- Probably 95% of uses are covered by these typedefs:
- // using namespace boost::math; // Avoid potential ambiguity with names in std <random>
- // Safer to declare specific functions with using statement(s):
- using boost::math::beta_distribution;
- using boost::math::binomial_distribution;
- using boost::math::students_t;
- // Construct a students_t distribution with 4 degrees of freedom:
- students_t d1(4);
- // Construct a double-precision beta distribution
- // with parameters a = 10, b = 20
- beta_distribution<> d2(10, 20); // Note: _distribution<> suffix !
- If you need to use the distributions with a type other than `double`,
- then you can instantiate the template directly: the names of the
- templates are the same as the `double` typedef but with `_distribution`
- appended, for example: __students_t_distrib or __binomial_distrib:
- // Construct a students_t distribution, of float type,
- // with 4 degrees of freedom:
- students_t_distribution<float> d3(4);
- // Construct a binomial distribution, of long double type,
- // with probability of success 0.3
- // and 20 trials in total:
- binomial_distribution<long double> d4(20, 0.3);
- The parameters passed to the distributions can be accessed via getter member
- functions:
- d1.degrees_of_freedom(); // returns 4.0
- This is all well and good, but not very useful so far. What we often want
- is to be able to calculate the /cumulative distribution functions/ and
- /quantiles/ etc for these distributions.
- [endsect] [/section:objects Distributions are Objects]
- [section:generic Generic operations common to all distributions are non-member functions]
- Want to calculate the PDF (Probability Density Function) of a distribution?
- No problem, just use:
- pdf(my_dist, x); // Returns PDF (density) at point x of distribution my_dist.
- Or how about the CDF (Cumulative Distribution Function):
- cdf(my_dist, x); // Returns CDF (integral from -infinity to point x)
- // of distribution my_dist.
- And quantiles are just the same:
- quantile(my_dist, p); // Returns the value of the random variable x
- // such that cdf(my_dist, x) == p.
- If you're wondering why these aren't member functions, it's to
- make the library more easily extensible: if you want to add additional
- generic operations - let's say the /n'th moment/ - then all you have to
- do is add the appropriate non-member functions, overloaded for each
- implemented distribution type.
- [tip
- [*Random numbers that approximate Quantiles of Distributions]
- If you want random numbers that are distributed in a specific way,
- for example in a uniform, normal or triangular,
- see [@http://www.boost.org/libs/random/ Boost.Random].
- Whilst in principal there's nothing to prevent you from using the
- quantile function to convert a uniformly distributed random
- number to another distribution, in practice there are much more
- efficient algorithms available that are specific to random number generation.
- ] [/tip Random numbers that approximate Quantiles of Distributions]
- For example, the binomial distribution has two parameters:
- n (the number of trials) and p (the probability of success on any one trial).
- The `binomial_distribution` constructor therefore has two parameters:
- `binomial_distribution(RealType n, RealType p);`
- For this distribution the __random_variate is k: the number of successes observed.
- The probability density\/mass function (pdf) is therefore written as ['f(k; n, p)].
- [note
- [*Random Variates and Distribution Parameters]
- The concept of a __random_variable is closely linked to the term __random_variate:
- a random variate is a particular value (outcome) of a random variable.
- and [@http://en.wikipedia.org/wiki/Parameter distribution parameters]
- are conventionally distinguished (for example in Wikipedia and Wolfram MathWorld)
- by placing a semi-colon or vertical bar)
- /after/ the __random_variable (whose value you 'choose'),
- to separate the variate from the parameter(s) that defines the shape of the distribution.
- For example, the binomial distribution probability distribution function (PDF) is written as
- [role serif_italic ['f(k| n, p)] = Pr(K = k|n, p) = ] probability of observing k successes out of n trials.
- K is the __random_variable, k is the __random_variate,
- the parameters are n (trials) and p (probability).
- ] [/tip Random Variates and Distribution Parameters]
- [note By convention, __random_variate are lower case, usually k is integral, x if real, and
- __random_variable are upper case, K if integral, X if real. But this implementation treats
- all as floating point values `RealType`, so if you really want an integral result,
- you must round: see note on Discrete Probability Distributions below for details.]
- As noted above the non-member function `pdf` has one parameter for the distribution object,
- and a second for the random variate. So taking our binomial distribution
- example, we would write:
- `pdf(binomial_distribution<RealType>(n, p), k);`
- The ranges of __random_variate values that are permitted and are supported can be
- tested by using two functions `range` and `support`.
- The distribution (effectively the __random_variate) is said to be 'supported'
- over a range that is
- [@http://en.wikipedia.org/wiki/Probability_distribution
- "the smallest closed set whose complement has probability zero"].
- MathWorld uses the word 'defined' for this range.
- Non-mathematicians might say it means the 'interesting' smallest range
- of random variate x that has the cdf going from zero to unity.
- Outside are uninteresting zones where the pdf is zero, and the cdf zero or unity.
- For most distributions, with probability distribution functions one might describe
- as 'well-behaved', we have decided that it is most useful for the supported range
- to *exclude* random variate values like exact zero *if the end point is discontinuous*.
- For example, the Weibull (scale 1, shape 1) distribution smoothly heads for unity
- as the random variate x declines towards zero.
- But at x = zero, the value of the pdf is suddenly exactly zero, by definition.
- If you are plotting the PDF, or otherwise calculating,
- zero is not the most useful value for the lower limit of supported, as we discovered.
- So for this, and similar distributions,
- we have decided it is most numerically useful to use
- the closest value to zero, min_value, for the limit of the supported range.
- (The `range` remains from zero, so you will still get `pdf(weibull, 0) == 0`).
- (Exponential and gamma distributions have similarly discontinuous functions).
- Mathematically, the functions may make sense with an (+ or -) infinite value,
- but except for a few special cases (in the Normal and Cauchy distributions)
- this implementation limits random variates to finite values from the `max`
- to `min` for the `RealType`.
- (See [link math_toolkit.sf_implementation.handling_of_floating_point_infin
- Handling of Floating-Point Infinity] for rationale).
- [note
- [*Discrete Probability Distributions]
- Note that the [@http://en.wikipedia.org/wiki/Discrete_probability_distribution
- discrete distributions], including the binomial, negative binomial, Poisson & Bernoulli,
- are all mathematically defined as discrete functions:
- that is to say the functions `cdf` and `pdf` are only defined for integral values
- of the random variate.
- However, because the method of calculation often uses continuous functions
- it is convenient to treat them as if they were continuous functions,
- and permit non-integral values of their parameters.
- Users wanting to enforce a strict mathematical model may use `floor`
- or `ceil` functions on the random variate prior to calling the distribution
- function.
- The quantile functions for these distributions are hard to specify
- in a manner that will satisfy everyone all of the time. The default
- behaviour is to return an integer result, that has been rounded
- /outwards/: that is to say, lower quantiles - where the probablity
- is less than 0.5 are rounded down, while upper quantiles - where
- the probability is greater than 0.5 - are rounded up. This behaviour
- ensures that if an X% quantile is requested, then /at least/ the requested
- coverage will be present in the central region, and /no more than/
- the requested coverage will be present in the tails.
- This behaviour can be changed so that the quantile functions are rounded
- differently, or return a real-valued result using
- [link math_toolkit.pol_overview Policies]. It is strongly
- recommended that you read the tutorial
- [link math_toolkit.pol_tutorial.understand_dis_quant
- Understanding Quantiles of Discrete Distributions] before
- using the quantile function on a discrete distribtion. The
- [link math_toolkit.pol_ref.discrete_quant_ref reference docs]
- describe how to change the rounding policy
- for these distributions.
- For similar reasons continuous distributions with parameters like
- "degrees of freedom"
- that might appear to be integral, are treated as real values
- (and are promoted from integer to floating-point if necessary).
- In this case however, there are a small number of situations where non-integral
- degrees of freedom do have a genuine meaning.
- ]
- [endsect] [/ section:generic Generic operations common to all distributions are non-member functions]
- [section:complements Complements are supported too - and when to use them]
- Often you don't want the value of the CDF, but its complement, which is
- to say `1-p` rather than `p`. It is tempting to calculate the CDF and subtract
- it from `1`, but if `p` is very close to `1` then cancellation error
- will cause you to lose accuracy, perhaps totally.
- [link why_complements See below ['"Why and when to use complements?"]]
- In this library, whenever you want to receive a complement, just wrap
- all the function arguments in a call to `complement(...)`, for example:
- students_t dist(5);
- cout << "CDF at t = 1 is " << cdf(dist, 1.0) << endl;
- cout << "Complement of CDF at t = 1 is " << cdf(complement(dist, 1.0)) << endl;
- But wait, now that we have a complement, we have to be able to use it as well.
- Any function that accepts a probability as an argument can also accept a complement
- by wrapping all of its arguments in a call to `complement(...)`, for example:
- students_t dist(5);
- for(double i = 10; i < 1e10; i *= 10)
- {
- // Calculate the quantile for a 1 in i chance:
- double t = quantile(complement(dist, 1/i));
- // Print it out:
- cout << "Quantile of students-t with 5 degrees of freedom\n"
- "for a 1 in " << i << " chance is " << t << endl;
- }
- [tip
- [*Critical values are just quantiles]
- Some texts talk about quantiles, or percentiles or fractiles,
- others about critical values, the basic rule is:
- ['Lower critical values] are the same as the quantile.
- ['Upper critical values] are the same as the quantile from the complement
- of the probability.
- For example, suppose we have a Bernoulli process, giving rise to a binomial
- distribution with success ratio 0.1 and 100 trials in total. The
- ['lower critical value] for a probability of 0.05 is given by:
- `quantile(binomial(100, 0.1), 0.05)`
- and the ['upper critical value] is given by:
- `quantile(complement(binomial(100, 0.1), 0.05))`
- which return 4.82 and 14.63 respectively.
- ]
- [#why_complements]
- [tip
- [*Why bother with complements anyway?]
- It's very tempting to dispense with complements, and simply subtract
- the probability from 1 when required. However, consider what happens when
- the probability is very close to 1: let's say the probability expressed at
- float precision is `0.999999940f`, then `1 - 0.999999940f = 5.96046448e-008`,
- but the result is actually accurate to just ['one single bit]: the only
- bit that didn't cancel out!
- Or to look at this another way: consider that we want the risk of falsely
- rejecting the null-hypothesis in the Student's t test to be 1 in 1 billion,
- for a sample size of 10,000.
- This gives a probability of 1 - 10[super -9], which is exactly 1 when
- calculated at float precision. In this case calculating the quantile from
- the complement neatly solves the problem, so for example:
- `quantile(complement(students_t(10000), 1e-9))`
- returns the expected t-statistic `6.00336`, where as:
- `quantile(students_t(10000), 1-1e-9f)`
- raises an overflow error, since it is the same as:
- `quantile(students_t(10000), 1)`
- Which has no finite result.
- With all distributions, even for more reasonable probability
- (unless the value of p can be represented exactly in the floating-point type)
- the loss of accuracy quickly becomes significant if you simply calculate probability from 1 - p
- (because it will be mostly garbage digits for p ~ 1).
- So always avoid, for example, using a probability near to unity like 0.99999
- `quantile(my_distribution, 0.99999)`
- and instead use
- `quantile(complement(my_distribution, 0.00001))`
- since 1 - 0.99999 is not exactly equal to 0.00001 when using floating-point arithmetic.
- This assumes that the 0.00001 value is either a constant,
- or can be computed by some manner other than subtracting 0.99999 from 1.
- ] [/ tip *Why bother with complements anyway?]
- [endsect] [/ section:complements Complements are supported too - and why]
- [section:parameters Parameters can be calculated]
- Sometimes it's the parameters that define the distribution that you
- need to find. Suppose, for example, you have conducted a Students-t test
- for equal means and the result is borderline. Maybe your two samples
- differ from each other, or maybe they don't; based on the result
- of the test you can't be sure. A legitimate question to ask then is
- "How many more measurements would I have to take before I would get
- an X% probability that the difference is real?" Parameter finders
- can answer questions like this, and are necessarily different for
- each distribution. They are implemented as static member functions
- of the distributions, for example:
- students_t::find_degrees_of_freedom(
- 1.3, // difference from true mean to detect
- 0.05, // maximum risk of falsely rejecting the null-hypothesis.
- 0.1, // maximum risk of falsely failing to reject the null-hypothesis.
- 0.13); // sample standard deviation
- Returns the number of degrees of freedom required to obtain a 95%
- probability that the observed differences in means is not down to
- chance alone. In the case that a borderline Students-t test result
- was previously obtained, this can be used to estimate how large the sample size
- would have to become before the observed difference was considered
- significant. It assumes, of course, that the sample mean and standard
- deviation are invariant with sample size.
- [endsect] [/ section:parameters Parameters can be calculated]
- [section:summary Summary]
- * Distributions are objects, which are constructed from whatever
- parameters the distribution may have.
- * Member functions allow you to retrieve the parameters of a distribution.
- * Generic non-member functions provide access to the properties that
- are common to all the distributions (PDF, CDF, quantile etc).
- * Complements of probabilities are calculated by wrapping the function's
- arguments in a call to `complement(...)`.
- * Functions that accept a probability can accept a complement of the
- probability as well, by wrapping the function's
- arguments in a call to `complement(...)`.
- * Static member functions allow the parameters of a distribution
- to be found from other information.
- Now that you have the basics, the next section looks at some worked examples.
- [endsect] [/section:summary Summary]
- [endsect] [/section:overview Overview]
- [section:weg Worked Examples]
- [include distribution_construction.qbk]
- [include students_t_examples.qbk]
- [include chi_squared_examples.qbk]
- [include f_dist_example.qbk]
- [include binomial_example.qbk]
- [include geometric_example.qbk]
- [include negative_binomial_example.qbk]
- [include normal_example.qbk]
- [/include inverse_gamma_example.qbk]
- [/include inverse_gaussian_example.qbk]
- [include inverse_chi_squared_eg.qbk]
- [include nc_chi_squared_example.qbk]
- [include error_handling_example.qbk]
- [include find_location_and_scale.qbk]
- [include nag_library.qbk]
- [include c_sharp.qbk]
- [endsect] [/section:weg Worked Examples]
- [include background.qbk]
- [endsect] [/ section:stat_tut Statistical Distributions Tutorial]
- [/ dist_tutorial.qbk
- Copyright 2006, 2010, 2011 John Maddock and Paul A. Bristow.
- Distributed under the Boost Software License, Version 1.0.
- (See accompanying file LICENSE_1_0.txt or copy at
- http://www.boost.org/LICENSE_1_0.txt).
- ]
|