statistics

2024-10-12

A consolidated system of statistical functions

Upstream URL

github.com/Lisp-Stat/statistics

Author

Steve Nunez <steve@symbolics.tech>, Larry Hunter <Larry.Hunter@CUAnschutz.edu>

License

msPl, mit
README

Contributors Forks Stargazers Issues MS-PL License LinkedIn


Logo

Lisp-Stat Statistics

A consolidation of Common Lisp statistics libraries
Explore the docs »

Report Bug · Request Feature · Reference Manual

Table of Contents

  1. About the Project
  2. Installation
  3. Usage
  4. Functions
  5. Roadmap
  6. Resources
  7. Contributing
  8. License
  9. Contact

About the Project

There are three statistics libraries that can be considered relatively complete and well written:

  • The statistics library from numerical-utilities
  • Larry Hunter's cl-statistics
  • Gary Warren King's cl-mathstats

There are a few challenges in using these as independent systems on projects though:

  • There is a good amount of overlap. Everyone implements, for example mean (as does alexandria, cephes, and others)
  • In the case of mean, variance, etc., the functions deal only with samples, not distributions

This library brings these three systems under a single 'umbrella', and adds a few missing ones. To do this we use Tim Bradshaw's conduit-packages. For the few functions that require dispatch on type (sample data vs. a distribution), we use typecase because of its simplicity and not needing another system. There's a slight performance hit here in the case of run-time determination of types, but until it's a problem prefer it. Some alternatives considered for dispatch was https://github.com/pcostanza/filtered-functions.

nu-statistics

These functions cover sample moments in detail, and are accurate. They include up to forth moments, and are well suited to the work of an econometrist (and were written by one).

cl-statistics

These were written by Larry Hunter, based on the methods described in Bernard Rosner's book, Fundamentals of Biostatistics 5th Edition, along with some from the CLASP system. They cover a wide range of statistical applications.

gwk-statistics

These are from Gary Warren King, and also partially based on CLASP. It is well written, and the functions have excellent documentation. The major reason we don't include it by default is because it uses an older ecosystem of libraries that duplicate more widely used system (for example, numerical utilities, alexandria). If you want to use these, you'll need to uncomment the appropriate code in the ASDF and pkgdcl.lisp files.

Accuracy

LH and GWK statistics compute quantiles, CDF, PDF, etc. using routines from CLASP, that in turn are based on algorithms from Numerical Recipes. These are known to be accurate to only about four decimal places. This is probably accurate enough for many statistical problem, however should you need greater accuracy look at the distributions system. The computations there are based on special-functions, which has accuracy around 15 digits. Unfortunately documentation of distributions and the 'wrapping' of them here are incomplete, so you'll need to know the pattern, e.g. pdf-gamma, cdf-gamma, etc., which is described in the link above.

Versions

Because this system is likely to change rapidly, we have adopted a system of versioning proposed in defpackage+. This is also the system alexandria uses where a version number is appended to the API. So, statistics-1 is our current package name. statistics-2 will be the next and so on. If you don't like these names, you can always change it locally using a package local nickname.

Installation

To get a local copy up and running follow these steps:

(ql:quickload :statistics)

or

(asdf:load-system :statistics)

If you already have the system downloaded to your local machine.

If you are using SBCL you will see a large number of notes printed about the inability to optimise. This was the subject of issue #1 and the short answer is that the functions all take arbitrary inputs, do input tests specific to the calculation, and then coerce and provide declarations so that the actual calculations can be optimized. So, you should be able to ignore the notes.

Usage

Create a data frame of weather data:

(load #P"LS:DATA;sg-weather")

and take the mean maximum temperature:

LS-USER> (statistics-1:mean sg-weather:max-temps)

For more examples, please refer to the Documentation.

You can use a package local nickname to give the package a shorter name, e.g. "stats" if you like.

Often times all you'll need is lh-stats for general statistical analysis. You can load that with:

(asdf:load-system :statistics/lh)

NB You can expect to see many warnings when loading lh-stats. These are expected and nothing to worry about.

LH-Stat Functions

These abbreviations are used in function and variable names:

abbreviationmeaning
ciconfidence interval
cdfcumulative density function
gegreater than or equal to
leless than or equal to
pdfprobability density function
sdstandard deviation
rxcrows by columns
ssesample size estimate

Descriptive statistics

  • mean
  • median
  • mode
  • geometric mean
  • range
  • percentile
  • variance
  • standard-deviation (sd)
  • coefficient-of-variation
  • standard-error-of-the-mean

Distribution functions

  • Poisson & Binomial
  • binomial-probability
  • binomial-cumulative-probability
  • binomial-ge-probability
  • poisson-probability
  • poisson-cumulative-probability
  • poisson-ge-probability
  • normal
  • normal-pdf
  • convert-to-standard-normal
  • phi
  • z
  • t-distribution
  • chi-square
  • chi-square-cdf

Confidence Intervals

  • binomial-probability-ci
  • poisson-mu-ci
  • normal-mean-ci
  • normal-mean-ci-on-sequences
  • normal-variance-ci
  • normal-variance-ci-on-sequence
  • normal-sd-ci

Hypothesis tests (parametric)

  • z-test
  • z-test-on-sequence
  • t-test-one-sample
  • t-test-one-sample-on-sequence
  • t-test-paired
  • t-test-paired-on-sequences
  • t-test-two-sample
  • t-test-two-sample-on-sequences
  • chi-square-test-one-sample
  • f-test
  • binomial-test-one-sample
  • binomial-test-two-sample
  • fisher-exact-test
  • mcnemars-test
  • poisson-test-one-sample

Hypothesis tests (non-parametric)

  • sign-test
  • sign-test-on-sequence
  • wilcoxon-signed-rank-test
  • chi-square-test-rxc
  • chi-square-test-for-trend

Sample size estimates

  • t-test-one-sample-sse
  • t-test-two-sample-sse
  • t-test-paired-sse
  • binomial-test-one-sample-sse
  • binomial-test-two-sample-sse
  • binomial-test-paired-sse
  • correlation-sse

Correlation and Regression

  • linear-regression
  • correlation-coefficient
  • correlation-test-two-sample
  • spearman-rank-correlation

Significance test functions

  • t-significance
  • f-significance (chi square significance is calculated from chi-square-cdf in various ways depending on the problem)

Utilities

  • random-sample
  • random-pick
  • bin-and-count
  • fishers-z-transform
  • mean-sd-n
  • square
  • choose
  • permutations
  • round-float

Roadmap

gwk-stats has many useful functions. We'd like to port them to use the Lisp-Stat ecosystem of utilities.

Resources

This system is part of the Lisp-Stat project; that should be your first stop for information. Also see the resources and community pages for more information.

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated. Please see CONTRIBUTING for details on the code of conduct and the process for submitting pull requests.

Licenses

CLASP Copyright

Copyright (c) 1990 - 1994 University of Massachusetts Department of Computer Science Experimental Knowledge Systems Laboratory Professor Paul Cohen, Director. All rights reserved.

Permission to use, copy, modify and distribute this software and its documentation is hereby granted without fee, provided that the above copyright notice of EKSL, this paragraph and the one following appear in all copies and in supporting documentation.

EKSL makes no representation about the suitability of this software for any purposes. It is provided "AS IS", without express or implied warranties including (but not limited to) all implied warranties of merchantability and fitness for a particular purpose, and notwithstanding any other provision contained herein. In no event shall EKSL be liable for any special, indirect or consequential damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortuous action, arising out of or in connection with the use or performance of this software, even if EKSL is advised of the possibility of such damages.

Contact

Project Link: https://github.com/lisp-stat/statistics

Dependencies (7)

  • alexandria
  • anaphora
  • clunit2
  • conduit-packages
  • distributions
  • let-plus
  • numerical-utilities
  • GitHub
  • Quicklisp