A consolidated system of statistical functions
Table of Contents
About the Project
There are three statistics libraries that can be considered relatively complete and well written:
- The statistics library from numerical-utilities
- Larry Hunter's cl-statistics
- Gary Warren King's cl-mathstats
There are a few challenges in using these as independent systems on projects though:
- There is a good amount of overlap. Everyone implements, for example
mean(as does alexandria, cephes, and others)
- In the case of
variance, etc., the functions deal only with samples, not distributions
This library brings these three systems under a single 'umbrella', and adds a few missing ones. To do this we use Tim Bradshaw's conduit-packages. For the few functions that require dispatch on type (sample data vs. a distribution), we use
typecase because of its simplicity and not needing another system. There's a slight performance hit here in the case of run-time determination of types, but until it's a problem prefer it. Some alternatives considered for dispatch was https://github.com/pcostanza/filtered-functions.
These functions cover sample moments in detail, and are accurate. They include up to forth moments, and are well suited to the work of an econometrist (and were written by one).
These were written by Larry Hunter, based on the methods described in Bernard Rosner's book, Fundamentals of Biostatistics 5th Edition, along with some from the CLASP system. They cover a wide range of statistical applications.
These are from Gary Warren King, and also partially based on CLASP. It is well written, and the functions have excellent documentation. The major reason we don't include it by default is because it uses an older ecosystem of libraries that duplicate more widely used system (for example, numerical utilities, alexandria). If you want to use these, you'll need to uncomment the appropriate code in the ASDF and
LH and GWK statistics compute quantiles, CDF, PDF, etc. using routines from CLASP, that in turn are based on algorithms from Numerical Recipes. These are known to be accurate to only about four decimal places. This is probably accurate enough for many statistical problem, however should you need greater accuracy look at the distributions system. The computations there are based on special-functions, which has accuracy around 15 digits. Unfortunately documentation of distributions and the 'wrapping' of them here are incomplete, so you'll need to know the pattern, e.g. pdf-gamma, cdf-gamma, etc., which is described in the link above.
Because this system is likely to change rapidly, we have adopted a system of versioning propsed in defpackage+. This is also the system
alexandria uses where a version number is appended to the API. So,
statistics-1 is our current package name.
statistics-2 will be the next and so on. If you don't like these names, you can always change it locally using a package local nickname.
To get a local copy up and running follow these steps:
If you already have the system downloaded to your local machine.
If you are using SBCL you will see a large number of notes printed about the inability to optimise. This was the subject of issue #1 and the short answer is that the functions all take arbitrary inputs, do input tests specific to the calculation, and then coerce and provide declarations so that the actual calculations can be optimized. So, you should be able to ignore the notes.
Create a data frame of weather data:
and take the mean maximum temperature:
LS-USER> (statistics-1:mean sg-weather:max-temps)
For more examples, please refer to the Documentation.
You can use a package local nickname to give the package a shorter name, e.g. "stats" if you like.
Often times all you'll need is lh-stats for general statistical analysis. You can load that with:
NB You can expect to see many warnings when loading lh-stats. These are expected and nothing to worry about.
These abbreviations are used in function and variable names:
|cdf||cumulative density function|
|ge||greater than or equal to|
|le||less than or equal to|
|probability density function|
|rxc||rows by columns|
|sse||sample size estimate|
- geometric mean
- standard-deviation (sd)
- Poisson & Binomial
Hypothesis tests (parametric)
Hypothesis tests (non-parametric)
Sample size estimates
Correlation and Regression
Significance test functions
- f-significance (chi square significance is calculated from chi-square-cdf in various ways depending on the problem)
gwk-stats has many useful functions. We'd like to port them to use the Lisp-Stat ecosystem of utilities.
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated. Please see CONTRIBUTING for details on the code of conduct and the process for submitting pull requests.
Copyright (c) 1990 - 1994 University of Massachusetts Department of Computer Science Experimental Knowledge Systems Laboratory Professor Paul Cohen, Director. All rights reserved.
Permission to use, copy, modify and distribute this software and its documentation is hereby granted without fee, provided that the above copyright notice of EKSL, this paragraph and the one following appear in all copies and in supporting documentation.
EKSL makes no representation about the suitability of this software for any purposes. It is provided "AS IS", without express or implied warranties including (but not limited to) all implied warranties of merchantability and fitness for a particular purpose, and notwithstanding any other provision contained herein. In no event shall EKSL be liable for any special, indirect or consequential damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortuous action, arising out of or in connection with the use or performance of this software, even if EKSL is advised of the possibility of such damages.
Project Link: https://github.com/lisp-stat/statistics