UFC Data Analysis


Mixed Martial Arts}

Posted by Rishabh Pande on 13 May 2018

Background

Mixed martial arts (MMA) is a full-contact combat sport that allows striking and grappling, both standing and on the ground, using techniques from other combat sports and martial arts. The Ultimate Fighting Championship (UFC) is an American mixed martial arts organization based in Las Vegas, Nevada and is the largest MMA promotion in the world and features the top-ranked fighters of the sport. Based in the United States, the UFC produces events worldwide[6] that showcase twelve weight divisions and abide by the Unified Rules of Mixed Martial Arts. This is a highly unpredictable sport

Overview

Our dataset contains list of all UFC fights since 2013 with summed up entries of each fighter’s round by round record preceding that fight. Each row represents a single fight - with each fighter’s previous records summed up prior to the fight. Blank stats mean its the fighter’s first fight since 2013 which is where granular data for UFC fights begins. Source of the data is Kaggle

Few things we will try to visualize:

How’s Age/Height related to the outcome?

Most popular way to win the fight?

Most popular locations in UFC?


Import libraries and Load data

Not all python capabilities are loaded to your working environment by default. We would need to import every library we are going to use. We will choose alias names to our modules for the sake of convenience (e.g. numpy –> np, pandas –> pd)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

df = pd.read_csv('/Users/Rishabh/Desktop/data.csv')
df.head(2)
BPrev BStreak B_Age B_Height B_HomeTown B_ID B_Location B_Name B_Weight B__Round1_Grappling_Reversals_Landed ... R__Round5_TIP_Ground Time R__Round5_TIP_Guard Control Time R__Round5_TIP_Half Guard Control Time R__Round5_TIP_Misc. Ground Control Time R__Round5_TIP_Mount Control Time R__Round5_TIP_Neutral Time R__Round5_TIP_Side Control Time R__Round5_TIP_Standing Time winby winner
0 1 1 23.0 182.0 Trento Italy 2783 Mezzocorona Italy Marvin Vettori 84 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN DEC red
1 0 0 32.0 175.0 Careiro da Várzea, Amazonas Brazil 2208 Pharr, Texas USA Carlos Diego Ferreira 70 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN SUB blue

2 rows × 895 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1477 entries, 0 to 1476
Columns: 895 entries, BPrev to winner
dtypes: float64(873), int64(13), object(9)
memory usage: 10.1+ MB
df.describe()
BPrev BStreak B_Age B_Height B_ID B_Weight B__Round1_Grappling_Reversals_Landed B__Round1_Grappling_Standups_Landed B__Round1_Grappling_Submissions_Attempts B__Round1_Grappling_Takedowns_Attempts ... R__Round5_TIP_Distance Time R__Round5_TIP_Ground Control Time R__Round5_TIP_Ground Time R__Round5_TIP_Guard Control Time R__Round5_TIP_Half Guard Control Time R__Round5_TIP_Misc. Ground Control Time R__Round5_TIP_Mount Control Time R__Round5_TIP_Neutral Time R__Round5_TIP_Side Control Time R__Round5_TIP_Standing Time
count 1477.000000 1477.000000 1474.000000 1476.000000 1477.000000 1477.000000 978.000000 978.000000 978.000000 978.000000 ... 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000
mean 1.735274 0.654705 30.954545 177.451220 1964.633040 73.804333 0.036810 0.896728 0.431493 2.986708 ... 211.965278 34.062500 66.604167 5.527778 4.319444 5.138889 12.097222 224.965278 4.562500 263.069444
std 1.895561 1.057269 4.020311 8.561541 666.949141 14.980531 0.193748 1.255722 0.830527 3.987291 ... 139.412374 68.819742 94.574736 22.374419 12.854023 14.312013 36.429320 142.328509 19.698681 162.386212
min 0.000000 0.000000 20.000000 152.000000 129.000000 52.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 28.000000 172.000000 1755.000000 65.000000 0.000000 0.000000 0.000000 0.000000 ... 110.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 126.750000 0.000000 139.000000
50% 1.000000 0.000000 31.000000 177.000000 2156.000000 70.000000 0.000000 0.000000 0.000000 2.000000 ... 214.000000 0.000000 9.500000 0.000000 0.000000 0.000000 0.000000 232.000000 0.000000 291.000000
75% 3.000000 1.000000 34.000000 182.000000 2337.000000 84.000000 0.000000 1.000000 1.000000 4.000000 ... 294.500000 47.500000 109.000000 0.000000 0.000000 0.000000 2.000000 299.000000 0.000000 300.000000
max 11.000000 7.000000 46.000000 213.000000 2882.000000 120.000000 2.000000 9.000000 6.000000 33.000000 ... 647.000000 496.000000 529.000000 144.000000 91.000000 62.000000 264.000000 659.000000 128.000000 841.000000

8 rows × 886 columns

df.describe(include="all")
BPrev BStreak B_Age B_Height B_HomeTown B_ID B_Location B_Name B_Weight B__Round1_Grappling_Reversals_Landed ... R__Round5_TIP_Ground Time R__Round5_TIP_Guard Control Time R__Round5_TIP_Half Guard Control Time R__Round5_TIP_Misc. Ground Control Time R__Round5_TIP_Mount Control Time R__Round5_TIP_Neutral Time R__Round5_TIP_Side Control Time R__Round5_TIP_Standing Time winby winner
count 1477.000000 1477.000000 1474.000000 1476.000000 1471 1477.000000 1470 1477 1477.000000 978.000000 ... 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000 1461 1477
unique NaN NaN NaN NaN 568 NaN 431 719 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 3 4
top NaN NaN NaN NaN Rio de Janeiro Brazil NaN Rio de Janeiro Brazil Tim Means NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN DEC red
freq NaN NaN NaN NaN 32 NaN 38 8 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 720 867
mean 1.735274 0.654705 30.954545 177.451220 NaN 1964.633040 NaN NaN 73.804333 0.036810 ... 66.604167 5.527778 4.319444 5.138889 12.097222 224.965278 4.562500 263.069444 NaN NaN
std 1.895561 1.057269 4.020311 8.561541 NaN 666.949141 NaN NaN 14.980531 0.193748 ... 94.574736 22.374419 12.854023 14.312013 36.429320 142.328509 19.698681 162.386212 NaN NaN
min 0.000000 0.000000 20.000000 152.000000 NaN 129.000000 NaN NaN 52.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN NaN
25% 0.000000 0.000000 28.000000 172.000000 NaN 1755.000000 NaN NaN 65.000000 0.000000 ... 2.000000 0.000000 0.000000 0.000000 0.000000 126.750000 0.000000 139.000000 NaN NaN
50% 1.000000 0.000000 31.000000 177.000000 NaN 2156.000000 NaN NaN 70.000000 0.000000 ... 9.500000 0.000000 0.000000 0.000000 0.000000 232.000000 0.000000 291.000000 NaN NaN
75% 3.000000 1.000000 34.000000 182.000000 NaN 2337.000000 NaN NaN 84.000000 0.000000 ... 109.000000 0.000000 0.000000 0.000000 2.000000 299.000000 0.000000 300.000000 NaN NaN
max 11.000000 7.000000 46.000000 213.000000 NaN 2882.000000 NaN NaN 120.000000 2.000000 ... 529.000000 144.000000 91.000000 62.000000 264.000000 659.000000 128.000000 841.000000 NaN NaN

11 rows × 895 columns

print("Number of records : ", df.shape[0])
print("Number of Blue fighters : ", len(df.B_ID.unique()))
print("Number of Red fighters : ", len(df.R_ID.unique()))
Number of records :  1477
Number of Blue fighters :  715
Number of Red fighters :  627
df.isnull().sum(axis=0)
BPrev                                                       0
BStreak                                                     0
B_Age                                                       3
B_Height                                                    1
B_HomeTown                                                  6
B_ID                                                        0
B_Location                                                  7
B_Name                                                      0
B_Weight                                                    0
B__Round1_Grappling_Reversals_Landed                      499
B__Round1_Grappling_Standups_Landed                       499
B__Round1_Grappling_Submissions_Attempts                  499
B__Round1_Grappling_Takedowns_Attempts                    499
B__Round1_Grappling_Takedowns_Landed                      499
B__Round1_Strikes_Body Significant Strikes_Attempts       499
B__Round1_Strikes_Body Significant Strikes_Landed         499
B__Round1_Strikes_Body Total Strikes_Attempts             499
B__Round1_Strikes_Body Total Strikes_Landed               499
B__Round1_Strikes_Clinch Body Strikes_Attempts            499
B__Round1_Strikes_Clinch Body Strikes_Landed              499
B__Round1_Strikes_Clinch Head Strikes_Attempts            499
B__Round1_Strikes_Clinch Head Strikes_Landed              499
B__Round1_Strikes_Clinch Leg Strikes_Attempts             499
B__Round1_Strikes_Clinch Leg Strikes_Landed               499
B__Round1_Strikes_Clinch Significant Kicks_Attempts       499
B__Round1_Strikes_Clinch Significant Kicks_Landed         499
B__Round1_Strikes_Clinch Significant Punches_Attempts     499
B__Round1_Strikes_Clinch Significant Punches_Landed       499
B__Round1_Strikes_Clinch Significant Strikes_Attempts     499
B__Round1_Strikes_Clinch Significant Strikes_Landed       499
                                                         ... 
R__Round5_Strikes_Kicks_Attempts                         1333
R__Round5_Strikes_Kicks_Landed                           1333
R__Round5_Strikes_Knock Down_Landed                      1333
R__Round5_Strikes_Leg Total Strikes_Attempts             1450
R__Round5_Strikes_Leg Total Strikes_Landed               1450
R__Round5_Strikes_Legs Significant Strikes_Attempts      1333
R__Round5_Strikes_Legs Significant Strikes_Landed        1333
R__Round5_Strikes_Legs Total Strikes_Attempts            1350
R__Round5_Strikes_Legs Total Strikes_Landed              1350
R__Round5_Strikes_Punches_Attempts                       1333
R__Round5_Strikes_Punches_Landed                         1333
R__Round5_Strikes_Significant Strikes_Attempts           1333
R__Round5_Strikes_Significant Strikes_Landed             1333
R__Round5_Strikes_Total Strikes_Attempts                 1333
R__Round5_Strikes_Total Strikes_Landed                   1333
R__Round5_TIP_Back Control Time                          1333
R__Round5_TIP_Clinch Time                                1333
R__Round5_TIP_Control Time                               1333
R__Round5_TIP_Distance Time                              1333
R__Round5_TIP_Ground Control Time                        1333
R__Round5_TIP_Ground Time                                1333
R__Round5_TIP_Guard Control Time                         1333
R__Round5_TIP_Half Guard Control Time                    1333
R__Round5_TIP_Misc. Ground Control Time                  1333
R__Round5_TIP_Mount Control Time                         1333
R__Round5_TIP_Neutral Time                               1333
R__Round5_TIP_Side Control Time                          1333
R__Round5_TIP_Standing Time                              1333
winby                                                      16
winner                                                      0
Length: 895, dtype: int64

Missing values

We oberserve there are some missing values in our data. I know Age and Height are important features in any combat sport and they have handful of missing values.

We will address the missing values in age and height. We can simply delete rows with missing values, but usually we would want to take advantage of as many data points as possible. Replacing missing values with zeros would not be a good idea - as age 0 will have actual meanings and that would change our data.

Therefore a good replacement value would be something that doesn’t affect the data too much, such as the median or mean. the “fillna” function replaces every NaN (not a number) entry with the given input (the mean of the column in our case). Let’s do this for both ‘Blue’ and ‘Red’ fighters.

df['B_Age'] = df['B_Age'].fillna(np.mean(df['B_Age']))
df['B_Height'] = df['B_Height'].fillna(np.mean(df['B_Height']))
df['R_Age'] = df['R_Age'].fillna(np.mean(df['R_Age']))
df['R_Height'] = df['R_Height'].fillna(np.mean(df['R_Height']))

**Data Visualization **

Let’s start by looking who’s winning more from our dataset:

#draw a bar plot 
sns.countplot(x='winner',data=df)
plt.title('Whos winning more',color = 'blue',fontsize=15)
<matplotlib.text.Text at 0x113eeedd8>

png

Here I will just follow my instinct and play around a bit with what I feel will matter.

Let’s talk about Age - a critical factor in any sport. We will start by looking at the distribution of Age from our dataset

#fig, ax = plt.subplots(1,2, figsize=(12, 20))
fig, ax = plt.subplots(1,2, figsize=(15, 5))
sns.distplot(df.B_Age, ax=ax[0])
sns.distplot(df.R_Age, ax=ax[1])
<matplotlib.axes._subplots.AxesSubplot at 0x11421ab00>

png

Age is a big factor in any sport, moresoever in MMA where you must have combination of strength, agility and speed (among other skills). These skills peak at 27-35 and fighter’s fighting at this age should have higher likelyhood of winning the fight. Let’s validate by grouping age for Blue fighters who have won the fight.

BAge = df.groupby(['B_Age']).count()['winner']
BlueAge = BAge.sort_values(axis=0, ascending=False)
BlueAge.head(10)
B_Age
30.0    164
33.0    138
29.0    134
32.0    128
27.0    120
31.0    112
28.0    106
34.0    106
26.0     72
35.0     67
Name: winner, dtype: int64

Clearly, most fights have been won by fighters in their late 20’s through early 30’s as they peak during this time and then lose strength, quickness and cardiovascular capacity

On the other hand, younger fighters do not develop peak strength till 27-28~ while older fighters are usually slower and more likely to lose. Let’s check if this is true in our data. This time we will check for ‘Red’ fighters.

RAge = df.groupby(['R_Age']).count()['winner']
RedAge = RAge.sort_values(axis=0, ascending=False)
RedAge.tail(10)
R_Age
24.000000    25
23.000000    17
40.000000    10
41.000000    10
22.000000    10
21.000000     5
43.000000     4
44.000000     3
46.000000     2
31.380081     1
Name: winner, dtype: int64

Looks like this is true. It makes me curious about the total number of Red and Blue fighters who are younger than 35.

fig, ax = plt.subplots(1,2, figsize=(15, 5))
above35 =['above35' if i >= 35 else 'below35' for i in df.B_Age]
df_B = pd.DataFrame({'B_Age':above35})
sns.countplot(x=df_B.B_Age, ax=ax[0])
plt.ylabel('Number of fighters')
plt.title('Age of Blue fighters',color = 'blue',fontsize=15)

above35 =['above35' if i >= 35 else 'below35' for i in df.R_Age]
df_R = pd.DataFrame({'R_Age':above35})
sns.countplot(x=df_R.R_Age, ax=ax[1])
plt.ylabel('Number of Red fighters')
plt.title('Age of Red fighters',color = 'Red',fontsize=15)
<matplotlib.text.Text at 0x11782dd30>

png

Interestingly, most fighters are below 35. MMA is a brutal sport for older guys and can leave them with lifelong injuries.

Lastly, let’s look at the mean difference

df['Age_Difference'] = df.B_Age - df.R_Age
df[['Age_Difference', 'winner']].groupby('winner').mean()
Age_Difference
winner
blue -1.459711
draw -1.555556
no contest 0.058824
red 0.273304

Age matters, and youth is a clear advantage.

Height is also a major advantage in MMA as it means more the height more is the reach, meaning - taller fighter can attack from a distance keeping themselves safe from the hitting zone. Let’s start by looking at the distribution of height:

fig, ax = plt.subplots(1,2, figsize=(15, 5))
sns.distplot(df.B_Height, bins = 20, ax=ax[0]) #Blue 
sns.distplot(df.R_Height, bins = 20, ax=ax[1]) #Red
<matplotlib.axes._subplots.AxesSubplot at 0x1179007b8>

png

fig, ax = plt.subplots(figsize=(14, 6))
sns.kdeplot(df.B_Height, shade=True, color='indianred', label='Red')
sns.kdeplot(df.R_Height, shade=True, label='Blue')
<matplotlib.axes._subplots.AxesSubplot at 0x113f762b0>

png

df['Height Difference'] = df.B_Height - df.R_Height
df[['Height Difference', 'winner']].groupby('winner').mean()
Height Difference
winner
blue 0.118151
draw 2.444444
no contest -1.411765
red -0.052536

Taller fighter has an advantage and, on average, wins. Of course, unless you are Rocky fighting Drago ;)

Now, let’s talk about how the fighters are winning. The three most popular ways to win in an MMA fight are:

1. DEC: Decision (Dec) is a result of the fight or bout that does not end in a knockout in which the judges’ scorecards are consulted to determine the winner; a majority of judges must agree on a result. A fight can either end in a win for an athlete, a draw, or a no decision.

**2. SUB: ** also referred to as a “tap out” or “tapping out” - is often performed by visibly tapping the floor or the opponent with the hand or in some cases with the foot, to signal the opponent and/or the referee of the submission

3. KO/TKO: Knockout (KO) is when a fighter gets knocked out cold. (i.e.. From a standing to not standing position from receiving a strike.). Technical Knockout (TKO) is when a fighter is getting pummeled and is unable to defend him/herself further. The referee will step in and make a judgement call to end it and prevent the fighter from receiving any more unnecessary or permanent damage, and call it a TKO.

sns.countplot(x='winby',data=df)
plt.title('Most popular way to win?',color = 'blue',fontsize=15)
<matplotlib.text.Text at 0x117c4f860>

png

So most fights are going to the judges. Second most popular way is Knockout and the Technical KO.

MMA is a complex sport, in a sense it is the only sport where defense and offense could be done in the same movement. Hitting someone is a risk as it leaves you open for your opponent to counter. However, the bigger the risk, the greater the reward. More offensive attempts you make should mean more you land on your opponent (and with right skills and power - more chance you have to win the fight). Let’s see if this is true with our data.

sns.lmplot(x="B__Round1_Strikes_Body Significant Strikes_Attempts", 
               y="B__Round1_Strikes_Body Significant Strikes_Landed", 
               col="winner", hue="winner", data=df, col_wrap=2, size=6)
<seaborn.axisgrid.FacetGrid at 0x117c96b00>

png

Attempts and strikes landed are, as expected, perfectly linear.

Now, let’s look at the location and find out most popular countries

#Adding 2 columns to make one column

Bloc = df.groupby(['B_Location']).count()['B_ID']
location = Bloc.sort_values(axis=0, ascending=False)
location.head(10)
B_Location
Rio de Janeiro Brazil                  38
Denver, Colorado USA                   27
Albuquerque, New Mexico USA            25
Coconut Creek, Florida USA             21
Sacramento, California USA             20
San Diego, California United States    19
Glendale, Arizona USA                  17
Montreal, Quebec Canada                16
Las Vegas, Nevada USA                  16
Coconut Creek, FL USA                  15
Name: B_ID, dtype: int64
#Adding 2 columns to make one column
Rloc = df.groupby(['R_Location']).count()['R_ID']
R_location = Rloc.sort_values(axis=0, ascending=False)
R_location.head(10)
R_Location
Rio de Janeiro Brazil                    67
Montreal, Quebec Canada                  30
Coconut Creek, Florida USA               29
Denver, Colorado USA                     29
Coconut Creek, Florida United States     29
Las Vegas, Nevada USA                    24
Sao Paulo Brazil                         22
Albuquerque, New Mexico United States    21
Dublin Ireland                           19
Albuquerque, New Mexico USA              18
Name: R_ID, dtype: int64

Brazil, USA and Canada are the most popular locations for UFC.