4/10/13

Computing the Trimmed Mean Average in SQL

This article by Bob Newstadt presents code to compute a trimmed mean in SQL. The trimmed mean is a more robust version of the simple mean (SQL AVG() aggregate function). It is a useful tool for summarizing ill-behaved real world data.

We all use statistics to help us understand the world. Think of batting averages, grade point averages, or the oft-quoted median price of a single family home. However, averages of real word data can be misleading. Two common problems are having too few samples or having wild values known as outliers.

A widely applicable technique to deal with these issues is the trimmed mean. The trimmed mean computation discards extreme values and averages the remaining samples. The amount of trimming can be tuned to fit the problem. Ideally, this avoids the outliers which can plague the mean while otherwise using as much of the data as possible.

This article presents several ways to compute a trimmed mean in SQL. Among the solutions is code which yields the mean, the median, or something in between depending on the amount of trimming you specify. These queries have been tested using SQL Server 2000 and may use some non-standard extensions.

Sample data

Let’s create a table variable containing a set of scores.
Declare @TestScores table (StudentID int, Score int)
insert @TestScores (StudentID, Score) Values (1,  20)
insert @TestScores (StudentID, Score) Values (2,  03)
insert @TestScores (StudentID, Score) Values (3,  40)
insert @TestScores (StudentID, Score) Values (4,  45)
insert @TestScores (StudentID, Score) Values (5,  50)
insert @TestScores (StudentID, Score) Values (6,  20)
insert @TestScores (StudentID, Score) Values (7,  90)
insert @TestScores (StudentID, Score) Values (8,  20)
insert @TestScores (StudentID, Score) Values (9,  11)
insert @TestScores (StudentID, Score) Values (10, 30)

The average is thrown off by the “curve-wrecking” student 7 who earned a score of 90.
select avg(cast(score as float)) as mean from @testscores

32.899999999999999
If that high score were 900 the mean would be really out of whack, all due to one far out value.

Trimming the smallest and largest values

This trick will compute the mean excluding the smallest and largest values.
select (sum(score)-min(score)-max(score)) / cast(count(*)-2 as float) 
  as meantrimmedby1 
from @testscores

29.5
Of course there must be more than 2 scores for this to work.

Trimming the N smallest and largest values

This code removes the smallest N and largest N scores before computing the average. N is a variable set at run time.
declare @N int
set @N = 3

select @N as N, avg(cast(score as float)) as TrimmedMeanN
from @TestScores a
where
 (select count(*) from @TestScores aa
  where aa.Score <= a.Score) > @N
 and
 (select count(*) from @TestScores bb
  where bb.Score >= a.Score) > @N

3, 26.0
The where clause keeps only the scores which fall between the N largest and the N smallest values. The correlated subqueries in the where clause rank each a.score compared to all scores in @TestScores. Duplicate values are either all removed or all retained. For example, if all scores are the same then none of them will be trimmed no matter what N is. Apply this algorithm only when there are at least 2N scores.

Trimming the smallest and largest percentile

A more general approach is to trim by a fixed percentage instead of a fixed number. Here we trim by a factor between 0.0 and 0.5. Trimming by 0.0 trims nothing yielding the mean. Trimming by .25 discards the scores in the top and bottom quartiles and averages what’s left. Trimming by .5 yields the weighted median. The median is weighted when there are duplicate values. In this example the central values are 20,20,20,30 which average out to 22.5. Compare this to the non-weighted median 25.0 ((20+30)/2).
declare @pp float
set @pp = .5

select @pp as factor, avg(cast(score as float)) as TrimmedMeanP
from @TestScores a
where
 (select count(*) from @TestScores aa
  where aa.Score <= a.Score) >= 
   (select @pp*count(*) from @TestScores)
 and
 (select count(*) from @TestScores bb
  where bb.Score >= a.Score) >= 
   (select @pp*count(*) from @TestScores)

.5, 22.5
This code is similar to the previous query except @N is replaced by @pp*count(*). The relation > was changed to >= so that a factor of .5 generates the weighted median instead of trimming all samples.
We can rewrite this solution to improve performance. The following code cuts the number of table scans in half.
declare @pp float
set @pp = .5

select @pp as factor, sum(cast(score as float)*weight)/sum(weight) as TrimmedMeanP2
from
 (
 select
  a.score,
  count(*) as weight
 from @TestScores a
 cross join @TestScores b
 group by a.score
 having
  sum(case when b.Score <= a.Score
   then 1 else 0 end) >= @pp*count(*)
  and 
  sum(case when b.Score >= a.Score
   then 1 else 0 end) >= @pp*count(*)
 ) as x1

.5, 22.5
The @TestScores table is cross joined with itself to permit comparisons of every score in the table ‘a’ with every score in table ‘b’. The results are grouped by a.score. Thus there will be at most one row in the derived table for every distinct value of a.score. In this example there are 3 scores with the value 20 causing the join to evaluate 30 rows (3*10) for that group. The having clause retains those a.score groups near the center of the distribution. The derived table generates a weight with each score proportional to the number of duplicate values there are for that score in the original table. Finally, the outer select calculates the weighted average of the retained grouped scores.

Trimming using TOP

You may be thinking: Why go to all this trouble when TOP and ORDER BY can easily filter rows from a table? The TOP operator has some limitations which are inconvenient to work around.
TOP’s argument N can not be a variable. Until you upgrade to Yukon, the next version of SQL SERVER 2000, you will need to resort to dynamic SQL if N is variable. Many DBAs try to avoid dynamic SQL for security and performance reasons.
TOP applies to the whole result set. This makes it hard to compose some complex queries which depend on TOP. Consider computing the trimmed mean of each student’s scores. Using TOP you would need to have a cursor to process each student’s scores separately. The query from the previous section can be extended to handle this problem without using cursors.
If just a single result is required and if the amount of trimming is not variable then using TOP may work. Here’s an example of computing the left median using TOP.
select top 1 Score as medianByTOP
from (select top 50 percent Score
 from @TestScores 
 order by Score) as x
order by Score desc

20
This code takes the max value in the bottom half of the distribution. A 25% trimmed mean using TOP can be coded as:
select avg(cast(score as float)) as TrimmedMean25pByTOP
from (select top 66.666 percent Score
 from (select top 75 percent Score
  from @TestScores 
  order by Score desc) as x
 order by Score) as y

29.166666666666668
The inner derived table trims the lowest 25%. The outer derived table trims the highest 25% of the original. The select clause averages the middle 50% of the distribution (66.666% of 75%=50%).

http://www.sqlteam.com/article/computing-the-trimmed-mean-in-sql

No comments: