Statistics in SQL Statistics in SQL SQL isn’t a statistical programming language, but you can use
built-in functions and a few tricks to calculate simple descriptive statistics such as the
sum, mean, and standard deviation. For more-sophisticated analyses you should use your DBMS’s OLAP
(online analytical processing) component or export your data to a dedicated
statistical environment such as Excel, R, SAS, or SPSS. What you should not do is write statistical routines
yourself in SQL or a host language. Implementing statistical algorithms correctly—even simple
ones—means understanding trade-offs in efficiency (the space needed for arithmetic
operations), stability (cancellation of significant digits), and accuracy (handling pathologic sets
of values). See, for example, Ronald Thisted’s Elements of Statistical Computing (Chapman
& Hall/CRC) or John Monahan’s Numerical Methods of Statistics (Cambridge
University Press). You can get away with using small combinations of
built-in SQL functions, such as STDEV()/(SQRT(COUNT())
for the standard error of
the mean, but don’t use complex SQL expressions for correlations, regression, ANOVA (analysis of
variance), or matrix arithmetic, for example. Check your DBMS’s SQL and OLAP documentation to see
which functions it offers. Built-in functions aren’t portable, but they run far
faster and more accurately than equivalent query expressions. The functions MIN() and MAX() calculate order statistics, which are values derived from
a dataset that’s been sorted (ordered) by size. Well-known order
statistics include the trimmed mean, rank, range, mode, and median. Chapter 15 covers the
trimmed mean, rank, and median. The range is the difference between the largest and
smallest values: MAX(expr)-MIN(expr). The mode is
the value that appears most frequently. A dataset can have more than one mode.
The mode is a weak descriptive statistic because it’s not robust,
meaning that it can be affected by adding a small number or unusual or incorrect values to the
dataset. This query finds the mode of book prices in the sample database: SELECT price,
COUNT(*) AS frequency FROM titles GROUP BY price HAVING COUNT(*)
>= ALL(SELECT COUNT(*) FROM titles GROUP BY price); price has two modes: price frequency ————— ————————— 12.99 2
19.95 2
|