Most Important Event at Crossfit Open

Background

The CrossFit Open takes place every year as the first qualifying stage of the CrossFit Games (which is used to determine the “Fittest on Earth). The athletes that finish in the top 20 in the world automatically qualify for the CrossFit Games. The CrossFit Open consists of 5 unique workouts (see workouts here) and your overall CrossFit Open score consists of how you finished on all 5 workouts (see leaderboard here).

Problem

Which of the 5 events had the most impact on the overall leaderboard? Note this analysis is initially just for male athletes I will be doing female athletes next and comparing the results.

Analysis

I started by scraping the scores from the top 500 male athletes (by overall finish) from the CrossFit Open leaderboard and ended up with the following Pandas DataFrame:

https://raw.githubusercontent.com/sik-flow/Crossfit_Scrape/master/Images/df.png

I am interested in the following features:

Initially, I made scatter plots of each event vs the overall rank to see if there were any obvious trends.

https://raw.githubusercontent.com/sik-flow/Crossfit_Scrape/master/Images/scrape.png

All the scatter plots look really similar, so similar that I had to double check my code to make sure I wasn’t using the same data for each subplot. Let’s look at the Pearson correlation coefficient to see if there is any difference in the relationship between the different workouts and the overall rank.

https://raw.githubusercontent.com/sik-flow/Crossfit_Scrape/master/Images/correlation.png

All of the correlation coefficients are really similar between 0.52 and 0.55. Next I want to try permutation importance to see if that can better determine which is the most important workout. Permutation measures how much a score changes when a feature is no longer available.

https://raw.githubusercontent.com/sik-flow/Crossfit_Scrape/master/Images/permutation_importance.png

This shows that after multiple shuffles the \(R^2\) for 19.3 changed on average 0.35, while the average \(R^2\) for 19.4 only changed 0.06. This means that the event 19.3 had a big impact on predicting their overall score, while 19.4 did not have much of an impact on predicting their overall score. This could be caused by a lot of different factors - my guess is that 19.3 was a workout that was a “seperator”. In that if you score badly on 19.3, you are going to do badly overall relatively.