Pga Putting Analysis

13 minute read

A Brief Look at Driving Distance and Putting on the PGA Tour in 2002 and 2017

Driving

The PGA Tour has seen a huge increase in the distance players can drive the ball over the last few decades. This has been down to technological advances in equipment allied to players focusing more on strength, fitness and flexibility. I’ll highlight this advancement using the PGA Tour Driving Distance statistic, described on the PGA Tour website as:

The average number of yards per measured drive. These drives are measured on two holes per round. Care is taken to select two holes which face in opposite directions to counteract the effect of wind. Drives are measured to the point at which they come to rest regardless of whether they are in the fairway or not.

Putting

This led me to wonder if there had been a corresponding improvement in players putting ability over the same time frame. The statistic I’ve chosen to use in this comparison is the PGA Tour Putting From 10 Feet statistic, described on the PGA Tour website as:

For all holes where putting distance was determined with a laser, the percent of putts made when the ball is greater than 9 feet and less than or equal to 10 feet from the hole. In order to be ranked, a minimum of ten attempts must be made.

I’ve chosen the starting point of the analysis as 2002 as that is the first season the relevant putting stats were collected.

Methodology

I’ll begin by writing a function to turn the table contents of a PGATour.com stats page into a DataFrame using the URL of the page. I’ll use the function to collect the relevant data for each year, then I’ll merge and clean up the DataFrames appropriately.

#Import required modules
import requests
from bs4 import BeautifulSoup as soup
import pandas as pd
#Create the function to acquire the PGA Tour data
def pga_url_to_df(url):
    """Turn a pga stats page url into a DataFrame"""
    #Package the request, send request, catch response: r
    r = requests.get(url)
    #Extract the response as HTML: html_doc
    html_doc = r.text
    #Create our Beautiful Soup object from the HTML: main_soup
    main_soup = soup(html_doc, 'lxml')
    #Find our table in the soup    
    stat_table = main_soup.find('table', {'id':'statsTable'})
    #Find elements in the 'body' of the table
    stat_table_body = stat_table.find('tbody')
    
    #Initialize an empty array
    stat_data = []
    
    #Filter for the rows in the table body
    rows = stat_table_body('tr')
    for player in rows:
        #select all cells in the row
        cols = player.find_all('td')
        #Strip out empty values
        cols = [ele.text.strip() for ele in cols]
        #Append data to our array
        stat_data.append([ele for ele in cols if ele])
        
    #Sort the table header, using same process as above for the table body
    stat_table_header = stat_table.find('thead')
    stat_header = []
    hrows = stat_table_header('tr')
    for header in hrows:
        cols = header.find_all('th')
        cols = [ele.text.strip() for ele in cols]
        #Append data to our array
        stat_header.append([ele for ele in cols if ele])

    #Create our header array
    stat_col_labels = stat_header[0]
    
    #Return a DataFrame of the data
    return pd.DataFrame(stat_data, columns=stat_col_labels)
#Use function to get 2002 PGA putting data and driving data
putt_02_df = pga_url_to_df(url = 'https://www.pgatour.com/stats/stat.348.2002.html')
print(putt_02_df.head())

drive_02_df = pga_url_to_df(url = 'https://www.pgatour.com/stats/stat.101.2002.html')
print(drive_02_df.head())
  RANK THIS WEEK RANK LAST WEEK      PLAYER NAME ROUNDS % MADE ATTEMPTS  \
0             T1             T1      Kenny Perry     98  72.73       22   
1             T1             T1  Paul Stankowski    100  72.73       11   
2              3              3      David Frost     90  71.43       21   
3              4              4     Joel Edwards    101  70.59       17   
4             T5             T5    Brian Bateman     94  68.00       25   

  PUTTS MADE  
0         16  
1          8  
2         15  
3         12  
4         17  
  RANK THIS WEEK RANK LAST WEEK         PLAYER NAME ROUNDS   AVG.  \
0              1              1           John Daly     62  306.8   
1              2              2         Boo Weekley     57  297.0   
2              3              3       Mathew Goggin     64  296.1   
3             T4             T4  Charles Howell III    118  293.7   
4             T4             T4      Dennis Paulson     89  293.7   

  TOTAL DISTANCE TOTAL DRIVES  
0         35,594          116  
1         33,261          112  
2         36,713          124  
3         64,019          218  
4         50,520          172  
#Use function to get 2017 PGA putting data and driving data
putt_17_df = pga_url_to_df(url = 'https://www.pgatour.com/stats/stat.348.2017.html')
print(putt_17_df.head())

drive_17_df = pga_url_to_df(url = 'https://www.pgatour.com/stats/stat.101.2017.html')
print(drive_17_df.head())
  RANK THIS WEEK RANK LAST WEEK         PLAYER NAME ROUNDS % MADE ATTEMPTS  \
0              1              1  Rafa Cabrera Bello     63  66.67       27   
1              2              2   Xander Schauffele     96  60.00       30   
2              3              3      Dustin Johnson     77  59.52       42   
3              4              4         Luke Donald     50  58.33       24   
4              5              5        Cody Gribble     79  58.00       50   

  PUTTS MADE  
0         18  
1         18  
2         25  
3         14  
4         29  
  RANK THIS WEEK RANK LAST WEEK     PLAYER NAME ROUNDS   AVG. TOTAL DISTANCE  \
0              1              1    Rory McIlroy     51  316.7         26,600   
1              2              2  Dustin Johnson     77  314.4         40,241   
2              3              3    Brandon Hagy     85  312.7         48,162   
3              4              4       Luke List    102  311.5         61,046   
4              5              5    Andrew Loupe     55  311.3         32,371   

  TOTAL DRIVES  
0           84  
1          128  
2          154  
3          196  
4          104  
#Merge the dataframes of putting and driving for each year together
"""Now we need to merge (using pd.merge()) our two DataFrames for 2002 on the PLAYER NAME column, which is common to both DFs.  
Set the on= parameter to PLAYER NAME as our column to merge on""" 

golf_data_02 = pd.merge(putt_02_df, drive_02_df, on='PLAYER NAME')
print(golf_data_02.head())
print(golf_data_02.shape)

#Select only required columns
golf_data_02 = golf_data_02.loc[:, ('PLAYER NAME', '% MADE', 'ATTEMPTS', 'AVG.')]
print(golf_data_02.head())
print(golf_data_02.info())
  RANK THIS WEEK_x RANK LAST WEEK_x      PLAYER NAME ROUNDS_x % MADE ATTEMPTS  \
0               T1               T1      Kenny Perry       98  72.73       22   
1               T1               T1  Paul Stankowski      100  72.73       11   
2                3                3      David Frost       90  71.43       21   
3                4                4     Joel Edwards      101  70.59       17   
4               T5               T5    Brian Bateman       94  68.00       25   

  PUTTS MADE RANK THIS WEEK_y RANK LAST WEEK_y ROUNDS_y   AVG. TOTAL DISTANCE  \
0         16               37               37       98  286.4         53,839   
1          8               34               34      100  287.0         55,671   
2         15              187              187       90  269.5         47,969   
3         12              125              125      101  277.6         56,074   
4         17              T26              T26       94  288.6         51,363   

  TOTAL DRIVES  
0          188  
1          194  
2          178  
3          202  
4          178  
(188, 13)
       PLAYER NAME % MADE ATTEMPTS   AVG.
0      Kenny Perry  72.73       22  286.4
1  Paul Stankowski  72.73       11  287.0
2      David Frost  71.43       21  269.5
3     Joel Edwards  70.59       17  277.6
4    Brian Bateman  68.00       25  288.6
<class 'pandas.core.frame.DataFrame'>
Int64Index: 188 entries, 0 to 187
Data columns (total 4 columns):
PLAYER NAME    188 non-null object
% MADE         188 non-null object
ATTEMPTS       188 non-null object
AVG.           188 non-null object
dtypes: object(4)
memory usage: 7.3+ KB
None
#Deal with column data types and any missing values
"""We can see from our .info call that all the cols are object type, this should be looked at as clearly there are a number of 
cols that should be int or float type."""
golf_data_02['% MADE'] = pd.to_numeric(golf_data_02['% MADE'], errors='coerce')
golf_data_02['ATTEMPTS'] = pd.to_numeric(golf_data_02['ATTEMPTS'], errors='coerce')
golf_data_02['AVG.'] = pd.to_numeric(golf_data_02['AVG.'], errors='coerce')

print(golf_data_02.head())
print(golf_data_02.info())
       PLAYER NAME  % MADE  ATTEMPTS   AVG.
0      Kenny Perry   72.73        22  286.4
1  Paul Stankowski   72.73        11  287.0
2      David Frost   71.43        21  269.5
3     Joel Edwards   70.59        17  277.6
4    Brian Bateman   68.00        25  288.6
<class 'pandas.core.frame.DataFrame'>
Int64Index: 188 entries, 0 to 187
Data columns (total 4 columns):
PLAYER NAME    188 non-null object
% MADE         188 non-null float64
ATTEMPTS       188 non-null int64
AVG.           188 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 7.3+ KB
None
#Repeat the merging and cleaning process for the 2017 data
golf_data_17 = pd.merge(putt_17_df, drive_17_df, on='PLAYER NAME')

#Select only required columns
golf_data_17 = golf_data_17.loc[:, ('PLAYER NAME', '% MADE', 'ATTEMPTS', 'AVG.')]
print(golf_data_17.head())
print(golf_data_17.info())
          PLAYER NAME % MADE ATTEMPTS   AVG.
0  Rafa Cabrera Bello  66.67       27  291.0
1   Xander Schauffele  60.00       30  306.3
2      Dustin Johnson  59.52       42  314.4
3         Luke Donald  58.33       24  278.5
4        Cody Gribble  58.00       50  297.2
<class 'pandas.core.frame.DataFrame'>
Int64Index: 190 entries, 0 to 189
Data columns (total 4 columns):
PLAYER NAME    190 non-null object
% MADE         190 non-null object
ATTEMPTS       190 non-null object
AVG.           190 non-null object
dtypes: object(4)
memory usage: 7.4+ KB
None
#Deal with column data types and any missing values
golf_data_17['% MADE'] = pd.to_numeric(golf_data_17['% MADE'], errors='coerce')
golf_data_17['ATTEMPTS'] = pd.to_numeric(golf_data_17['ATTEMPTS'], errors='coerce')
golf_data_17['AVG.'] = pd.to_numeric(golf_data_17['AVG.'], errors='coerce')

print(golf_data_17.head())
print(golf_data_17.info())
          PLAYER NAME  % MADE  ATTEMPTS   AVG.
0  Rafa Cabrera Bello   66.67        27  291.0
1   Xander Schauffele   60.00        30  306.3
2      Dustin Johnson   59.52        42  314.4
3         Luke Donald   58.33        24  278.5
4        Cody Gribble   58.00        50  297.2
<class 'pandas.core.frame.DataFrame'>
Int64Index: 190 entries, 0 to 189
Data columns (total 4 columns):
PLAYER NAME    190 non-null object
% MADE         190 non-null float64
ATTEMPTS       190 non-null int64
AVG.           190 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 7.4+ KB
None

Analysis

Initial EDA

Lets take a quick look at some descriptive stats from our DataFrames.

#Basic EDA
print(golf_data_02.describe())
print(golf_data_17.describe())
           % MADE    ATTEMPTS        AVG.
count  188.000000  188.000000  188.000000
mean    44.471223   18.946809  279.631915
std     11.680100    5.607228    7.279555
min     10.000000   10.000000  256.600000
25%     35.927500   15.000000  274.775000
50%     44.830000   19.000000  279.950000
75%     52.380000   22.000000  284.550000
max     72.730000   34.000000  297.000000
           % MADE    ATTEMPTS        AVG.
count  190.000000  190.000000  190.000000
mean    41.775632   43.847368  292.570000
std      7.994311   11.767720    9.366421
min     21.950000   17.000000  270.100000
25%     37.500000   36.000000  286.500000
50%     41.240000   44.000000  292.500000
75%     46.630000   52.750000  298.475000
max     66.670000   70.000000  316.700000

Driving - Note the mean average driving distance (AVG.) was 279.63 yards in 2002, but increases to 292.57 in 2017.

Putting - The mean percentage of putts made from 10 feet was 44.47% in 2002, but falls to 41.24% in 2017. This is odd, we should definitely look into this further.

ECDF’s

Lets look at the empirical cumulative distribution functions for both driving and putting in the 2 years. To do this I’ll use a function to help compute the ECDF.

#Import required modules
import numpy as np

#Define our ECDF function
def ecdf(data):
    """Compute ECDF for a 1D array of measurements"""
    #Compute number of data points:n
    n = len(data)
    #x-data for the ECDF:x
    x = np.sort(data)
    #y-data for the ECDF:y
    y = np.arange(1, n+1)/n
    return x, y
#Use function for driving stat
x_drive_02, y_drive_02 = ecdf(golf_data_02['AVG.'])
x_drive_17, y_drive_17 = ecdf(golf_data_17['AVG.'])

#Plot the ECDF's
import matplotlib.pyplot as plt
import seaborn as sns    

#Set Seaborn as default
sns.set()

#Generate plot
_ = plt.plot(x_drive_02, y_drive_02, marker = '.', linestyle = 'none')
_ = plt.plot(x_drive_17, y_drive_17, marker = '.', linestyle = 'none')

#Add a margin to the plot
plt.margins(0.02)

#Add labels
_ = plt.xlabel('Avg Drive (Yards)')
_ = plt.ylabel('ECDF')
_ = plt.title('ECDF of Average Driving Distance on the PGA Tour')
_ = plt.legend(('2002', '2017'), loc='lower right')

#Display the plot
plt.show()

png

The ECDF for driving makes it very clear, there has been a large tour-wide improvement in driving distance between 2002 and 2017, as we expected.

#Use function for putting stat
x_putt_02, y_putt_02 = ecdf(golf_data_02['% MADE'])
x_putt_17, y_putt_17 = ecdf(golf_data_17['% MADE'])

#Plot the ECDF's
_ = plt.plot(x_putt_02, y_putt_02, marker = '.', linestyle = 'none')
_ = plt.plot(x_putt_17, y_putt_17, marker = '.', linestyle = 'none')

#Add a margin to the plot
plt.margins(0.02)

#Add labels
_ = plt.xlabel('Putts Made from 10 Feet (%)')
_ = plt.ylabel('ECDF')
_ = plt.title('ECDF of Putts from 10 Feet on the PGA Tour')
_ = plt.legend(('2002', '2017'), loc='lower right')

#Display the plot
plt.show()

png

The ECDF for putting is far less clear. The variance in the 2002 data is larger, as is the mean. Lets take a closer look at putting.

Parameter Estimates

Here I’ll estimate the difference of the mean putts made % of the samples from 2002 and 2017. I’ll then report a 95% confidence interval of this difference of means. I’ll be using hacker statistics for this part of the analysis, defining some functions to help with the task.

#Bootstrap replicate function - to create 1 replicate
def bootstrap_replicate_1d(data, func):
    """Generate bootstrap replicate of 1D data"""
    bs_sample = np.random.choice(data, len(data))
    return func(bs_sample)

#Write a new function that uses bootstrap_replicate_1d() to create many bootstrap replicates, draw_bs_reps()
    
#Create many bootstrap replicates: draw_bs_reps()
def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates"""
    #Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)
    
    #Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data, func)
        
    return bs_replicates
#Compute the difference of the sample means:mean_diff
mean_diff = np.mean(golf_data_17['% MADE']) - np.mean(golf_data_02['% MADE'])

#Get bootstrap replicates of means
bs_replicates_02 = draw_bs_reps(golf_data_02['% MADE'], np.mean, 10000)
bs_replicates_17 = draw_bs_reps(golf_data_17['% MADE'], np.mean, 10000)

#Compute samples of difference of means: bs_diff_replicates
bs_diff_replicates = bs_replicates_17 - bs_replicates_02

#Compute 95% confidence interval: conf_int
conf_int = np.percentile(bs_diff_replicates, [2.5, 97.5])

#Print results
print('difference of means =', mean_diff, '%')
print('95% confidence interval =', conf_int, '%')
difference of means = -2.6955918253079574 %
95% confidence interval = [-4.69856449 -0.72418063] %

This difference of means and 95% C.I. suggests that putting ability might have gotten worse since 2002! Is this effect down to random chance? ie, what is the probability that we would get the observed difference in mean putts made % if the means were the same? Lets check this using a hypothesis test.

Hypothesis Test

Null hypothesis - Players in 2017 are as good at putting as those in 2002. ie, the mean % of putts made from 10 feet for both groups are equal.

Alternative hypothesis - that players in 2017 are not as good at putting as those in 2002, ie, the mean % of putts made from 10 feet from the 2017 players is lower than that of the 2002 players.

#Compute mean of combined data set: combined_mean
combined_mean = np.mean(np.concatenate((golf_data_02['% MADE'], golf_data_17['% MADE'])))

#Shift the samples
putt_02_shifted = golf_data_02['% MADE'] - np.mean(golf_data_02['% MADE']) + combined_mean
putt_17_shifted = golf_data_17['% MADE'] - np.mean(golf_data_17['% MADE']) + combined_mean

#Get bootstrap replicates of the shifted data sets
bs_replicates_putt_02 = draw_bs_reps(putt_02_shifted, np.mean, 10000)
bs_replicates_putt_17 = draw_bs_reps(putt_17_shifted, np.mean, 10000)

#Compute replicates of difference of means: bs_diff_replicates_putt
bs_diff_replicates_putt = bs_replicates_putt_17 - bs_replicates_putt_02

#Compute and print the p-value
p = np.sum(bs_diff_replicates_putt <= mean_diff) / len(bs_diff_replicates_putt)
print('p =', p)
p = 0.0042

Conclusion

A p-value of 0.0042 suggests there is a statistically significant difference. I’m inclined to reject the null hypothesis and accept the alternative hypothesis, players in 2017 are not as good at putting as those in 2002.

Further thoughts

There’s no doubt todays players drive the ball further than their 2002 cohort, so why has one area of the game improved so markedly yet another equally important area of the game, putting, seemingly haven gotten worse? Are todays players actually worse at putting or are there other reasons that explain the difference? If so what are those differences?

Potential Explanations

  • Are pin positions put in more tricky places on greens nowadays compared to 2002? It would be understandable if the PGA Tour tried to offset for players hitting the ball further by making putting intentionally harder in an effort to maintain the overall difficulty level of a golf course.
  • Are todays players simply focusing so much on hitting the ball further that they are spending less time on improving their putting? If you’re in the gym pumping iron you cant be practicing your putting at the same time!
  • On a related theme, is there a biomechanical aspect? Are the bigger muscles in the arms of modern players that help their driving a detriment to accurate putting, an area of the game that relies far more on feel and touch?
  • Equipment. Are club manufacturers (inadvertently) using inferior materials/designs in the production of modern putters compared to 2002? This seems highly unlikely. Its more likely that the putter is a fairly basic club compared to a driver, with limited opportunity for a club manufacturer to make a real breakthrough in performance through a redesign.

If there is one takeaway for the average amateur golfer from this analysis its this, if your going to spend a lot of money on one golf club, make it a driver, not a putter. There’s no evidence here that putters have improved since 2002.

Further Analysis

How could this analysis be expanded to make its conclusion more robust (or debunked)? Well i’ve only taken a snapshot from two years, what if either 2002 or 2017 were outliers to the general putting trend? Repeating the analysis with data from 2003 and 2016 could prove enlightening here.

Leave a Comment