Statistics in SQL
Statistics in SQL
SQL isn’t a statistical programming language, but you can use
built-in functions and a few
tricks to calculate simple descriptive statistics such as the
sum, mean, and standard deviation.
For more-sophisticated analyses you should use your DBMS’s OLAP
(online analytical
processing) component or export your data to a dedicated
statistical environment such as
Excel, R, SAS, or SPSS.
What you should not do is write statistical routines
yourself in SQL or a host language.
Implementing statistical algorithms correctly—even simple
ones—means understanding
trade-offs in efficiency (the space needed for arithmetic
operations), stability (cancellation
of significant digits), and accuracy (handling pathologic sets
of values). See, for example,
Ronald Thisted’s Elements of Statistical Computing (Chapman
& Hall/CRC) or John
Monahan’s Numerical Methods of Statistics (Cambridge
University Press).
You can get away with using small combinations of
built-in SQL functions, such as
STDEV()/(SQRT(COUNT())
for the standard error of
the mean, but don’t use complex SQL
expressions for correlations, regression, ANOVA (analysis of
variance), or matrix arithmetic,
for example. Check your DBMS’s SQL and OLAP documentation to see
which functions it
offers. Built-in functions aren’t portable, but they run far
faster and more accurately than
equivalent query expressions.
The functions MIN() and MAX() calculate order statistics, which are values derived from
a
dataset that’s been sorted (ordered) by size. Well-known order
statistics include the trimmed
mean, rank, range, mode, and median. Chapter 15 covers the
trimmed mean, rank, and median.
The range is the difference between the largest and
smallest values: MAX(expr)-MIN(expr). The
mode is
the value that appears most frequently. A dataset can have more than one mode.
The
mode is a weak descriptive statistic because it’s not robust,
meaning that it can be affected by
adding a small number or unusual or incorrect values to the
dataset. This query finds the
mode of book prices in the sample database:
SELECT price,
COUNT(*) AS frequency
FROM titles
GROUP BY price
HAVING COUNT(*)
>= ALL(SELECT COUNT(*) FROM titles GROUP BY price);
price has two modes:
price frequency
————— —————————
12.99 2
19.95 2