5K Data Analysis

Apr 22, 2018 · 654 words · 4 minutes read R • running

This was my 3rd year running the Big House 5K, an annual run in Ann Arbor, Michigan. Less than a week later I my first 10K in Richmond, Virginia. One thing about running is that it gives you lots of time to think and one thing that runners tend to think during a race about is their time/pace.

It is easy to look up race results. The results include your time (and pace) and standing overall and by division but it’s not very interesting to just know your ranking. So I decided to analyze the results from previous races to see what I could learn.

My dataset is results from the Big House 5K 2015-2018, consisting of 20,250 runner/year records (runners may not run in every year). I could have found data from more years and different events, but this seemed like as good a place to start as any.

Each record includes the runner’s overall rank, name, bib number, time, pace, hometown, age, sex, division, division rank, and year of the event. Here’s a snippet with my data:

As I mentioned, knowing your ranking alone isn’t that useful. For example, is a rank of 886 good? It depends how many other people ran.

Because I have all of the data I can calculate my percent rank for each event. Percent (or percentile) rank is the percentage that are equal to or lower than the value. This gives me a way to interpret my time in the context of others.

The violin plot below shows how finish times are distributed by each year. The outer shape is the density; inside is a traditional boxplot denoting the first quartile, median, third quartile, and the range 1.5×IQR (interquartile range) above the third quartile and below the first quartile. I’ve marked my own time as the red points.

This plot is helpful because it shows the distribution of finish times but also conveys where I stand relative to others. My first year I was worse than average (60.2% rank), but then improve significantly my second year (27.3% rank). This year my time improved by less, but I was in the top 14.8% of all runners!

Despite my relatively high overall percent rank, it would be better to compare my time to runners within my division (males between ages 25-29). This changes my percent rankings to 81.3%, 48.7%, and 29.7% (still not bad).

We can also look at summary statistics for each year (median, IQR, 90th percentile, and N). The median gives us the 50% cutoff - the time to beat to be in the top 50%. The interquartile range (IQR) is the difference between the third and first quartiles - it contains exactly 50% of the data. The IQR is the window of time in which 50% of runners finish. The 90th percentile is the time that the top 10% of runners finish under.

The median finish time, IQR, and 90th percentile are similar between years (as we saw in the plot above). However there is some evidence that runners are getting slower. The 2018 median time is more than 2 minutes slower than the 2015 time and IQR is 4 and a half minutes longer. This would probably meet a traditional statistical significance threshold, simply due to a large sample. Practically, I don’t think there’s much difference between years.

Wrapping things up, here are some findings from the data:

I’ve improved a lot since my first 5K, going from below average to above average.
The distribution of finish times is pretty heavily skewed. Some participants walk a portion or the entire distance.
To be among the top 10% of runners overall I’d need to finish within 25m 25s (and within 22m 39s for my division).