Predicting MLB Hall of Famers

read

Introduction

Just because the election for 2017 crop of Hall of Famers only finished 5 months ago doesn’t mean it’s too early to start wondering which current major leaguers will be enshrined in Cooperstown someday. In fact it’s only 5 months until the 2018 ballot is released so I think now is quite a felicitous time to examine previous ballots. Two questions I want to answer at the end of this post are:

Who got elected into the HoF and how similar/different were they from those who didn’t?
Can we predict whose who will make it to the HoF?

Although there are four main position players can assume in a game of baseball: Infielders, Outfielders, Pitchers, and Catchers, the focus of this post will be on Infielders and Outfielders. This is one of the projects I did for Udacity’s Data Analyst Nanodegree.

Background information

The National Baseball Hall of Fame and Museum is located in Cooperstown, New York and was dedicated in 1939. A baseball player can be elected to the Hall of Fame if they meet the following criteria:

The player must have competed in at least ten seasons;
The player has been retired for at least five seasons;
A screening committee must approve the player’s worthiness to be included on the ballot and most players who played regularly for ten or more years are deemed worthy;
The player must not be on the ineligible list (that means that the player should not be banned from baseball);
A player is considered elected if he receives at least 75% of the vote in the election; and
A player stays on the ballot the following year if they receive at least 5% of the vote and can appear on ballots for a maximum of 10 years.

These criteria tell us what information we need to gather before answering our questions, namely how long each player competed, when they retired, whether they have been banned from the game, etc. In the next part, we are going to find out where to obtain these information.

Dataset

The 2017 version of Lahman’s Baseball Database contains complete batting and pitching statistics from 1871 to 2017, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. The full database, as comma-separated files, can be downloaded from here. However, for our predictions, we only need the following .csv files:

Master.csv;
Batting.csv;
Fielding.csv;
Teams.csv;
AwardsPlayers.csv;
AllstarFull.csv;
Appearances.csv; and last but not least
HallOfFame.csv.

Unfortunately, it’s not possible to tell if a player has been banned from baseball from this database, but we can always look it up on the net. Next we will read in the data and clean them.

Data Cleaning and Pre-processing

# Import data to DataFrames
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read in the CSV files
master_df = pd.read_csv('Master.csv',usecols=['playerID','nameFirst','nameLast','bats','throws','finalGame'])
fielding_df = pd.read_csv('Fielding.csv',usecols=['playerID','A','E','DP'])
batting_df = pd.read_csv('Batting.csv', usecols = ['playerID', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI',\
                                                    'SB', 'CS', 'BB','SO', 'IBB', 'HBP', 'SH', 'SF'])
teams_df = pd.read_csv('Teams.csv')
awards_df = pd.read_csv('AwardsPlayers.csv', usecols=['playerID','awardID','yearID'])
allstar_df = pd.read_csv('AllstarFull.csv', usecols=['playerID','yearID'])
hof_df = pd.read_csv('HallOfFame.csv',usecols=['playerID','yearid','votedBy','needed_note','inducted','category'])
appearances_df = pd.read_csv('Appearances.csv')

In general, we are only interested in players elected by the BBWAA, but we should also include two players (Roberto Clemente and Lou Gehrig) who were elected via “Special Election”, because they clearly had Hall of Fame stats, but simply bypassed the process due to untimely circumstances.

Moreover, there were three occasions - in 1949, 1964, and 1967 - when the BBWAA conducted a special run-off election whereby the one player who received the most run-off votes would be elected to the HoF, so we should include players who got elected with a run-off ballot as well.

hof = hof_df[((hof_df['votedBy'] == 'BBWAA') | (hof_df['votedBy'] == 'Special Election')) | (hof_df['votedBy'] == 'Run Off')]
hof = hof[(hof['category'] == 'Player') & (hof['inducted'] == 'Y')]

# Drop these columns as no longer useful
hof = hof.drop(['category', 'inducted', 'needed_note', 'yearid'], axis = 1)

# Convert `votedBy` column to numeric
hof['votedBy'] = 1

# Give `votedBy` column a better name
hof.rename(columns = {'votedBy':'HoF'}, inplace = True)

Next we’ll gather information about each player’s performance, starting with batting statistics:

# Group by playerID
batting = batting_df.groupby('playerID', as_index = False).sum()
batting = batting.fillna(0)
batting.head()

	playerID	AB	R	H	2B	3B	HR	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF
0	aardsda01	4	0	0	0	0	0	0.0	0.0	0.0	0	2.0	0.0	0.0	1.0	0.0
1	aaronha01	12364	2174	3771	624	98	755	2297.0	240.0	73.0	1402	1383.0	293.0	32.0	21.0	121.0
2	aaronto01	944	102	216	42	6	13	94.0	9.0	8.0	86	145.0	3.0	0.0	9.0	6.0
3	aasedo01	5	0	0	0	0	0	0.0	0.0	0.0	0	3.0	0.0	0.0	0.0	0.0
4	abadan01	21	1	2	0	0	0	0.0	0.0	1.0	4	5.0	0.0	0.0	0.0	0.0

The columns have somewhat cryptic names, but you can always take a look at the information at the README page to see what they mean.

Next up is fielding statistics:

# Group by playerID
fielding = fielding_df.groupby('playerID', as_index = False).sum()
fielding = fielding.fillna(0)
fielding.head()

	playerID	A	E	DP
0	aardsda01	29.0	3.0	2.0
1	aaronha01	429.0	144.0	218.0
2	aaronto01	113.0	22.0	124.0
3	aasedo01	135.0	13.0	10.0
4	abadan01	1.0	1.0	3.0

Next, we will look at the allstar_df DataFrame. It contains information on which players made appearances in Allstar games. The Allstar game is an exhibition game played each year at mid-season. Major League Baseball consists of two leagues: the American league and the National league. The top 25 players from each league are selected to represent their league in the Allstar game. Hence to make appearances in the Allstar game is quite an achievement and we want to know how many Allstar games each player has participated in.

allstar = allstar_df.groupby('playerID').count().reset_index()

It might be a good idea to rename the column ‘yearID’ to something else to avoid confusion.

allstar.rename(columns = {'yearID':'years_allstar'}, inplace = True)

Next up is awards_df, let’s see how many different awards there are

awards_df['awardID'].unique()

array(['Pitching Triple Crown', 'Triple Crown',
       'Baseball Magazine All-Star', 'Most Valuable Player',
       'TSN All-Star', 'TSN Guide MVP',
       'TSN Major League Player of the Year', 'TSN Pitcher of the Year',
       'TSN Player of the Year', 'Rookie of the Year', 'Babe Ruth Award',
       'Lou Gehrig Memorial Award', 'World Series MVP', 'Cy Young Award',
       'Gold Glove', 'TSN Fireman of the Year', 'All-Star Game MVP',
       'Hutch Award', 'Roberto Clemente Award', 'Rolaids Relief Man Award',
       'NLCS MVP', 'ALCS MVP', 'Silver Slugger', 'Branch Rickey Award',
       'Hank Aaron Award', 'TSN Reliever of the Year',
       'Comeback Player of the Year', 'Outstanding DH Award',
       'Reliever of the Year Award'], dtype=object)

That’s a lot of awards, but not all of them are correlated with being voted into HoF. In fact, let’s just focus on the more important ones, namely: Most Valuable Player, Rookie of the Year, Gold Glove, Silver Slugger, and World Series MVP awards. (Cy Young, though being a major award, is only for pitchers and thus excluded.) Now we need to count how many different awards each player managed to win.

# Keeping only important awards
awards_list = ['Most Valuable Player','Rookie of the Year','Gold Glove','Silver Slugger','World Series MVP']
awards = awards_df[awards_df['awardID'].isin(awards_list)]

# Pivot the data frame to count the number of different awards
awards = awards.pivot_table(index = 'playerID', columns = 'awardID', aggfunc='count')

# Flatten the pivot table
awards = pd.DataFrame(awards.to_records())

Notice that we have inadvertently introduced a decent number of NA values that are actually zeros when making the pivot table, so we’ll have to replace them accordingly. We have also changed the column names as a result of flattening our pivot table. The simplest way to fix this is by string match-and-replace.

# Fix column names after flattening
awards.columns = [col.replace("('yearID', '", "").replace("')", "") \
                     for col in awards.columns]

awards = awards.fillna(0)

At this point we have gathered quite a decent amount of information on players’ statistics, it’s a good idea to try and compile them together:

player_stats = batting.merge(fielding, on = 'playerID', how ='left')
player_stats = player_stats.merge(allstar, on = 'playerID', how ='left')
player_stats = player_stats.merge(awards, on = 'playerID', how ='left')
player_stats = player_stats.merge(hof, on = 'playerID', how ='left')
player_stats = player_stats.fillna(0)
player_stats.head()

	playerID	AB	R	H	2B	3B	HR	RBI	SB	CS	...	A	E	DP	years_allstar	Gold Glove	Most Valuable Player	HoF
0	aardsda01	4	0	0	0	0	0	0.0	0.0	0.0	...	29.0	3.0	2.0	0.0	0.0	0.0	0.0
1	aaronha01	12364	2174	3771	624	98	755	2297.0	240.0	73.0	...	429.0	144.0	218.0	25.0	3.0	1.0	1.0
2	aaronto01	944	102	216	42	6	13	94.0	9.0	8.0	...	113.0	22.0	124.0	0.0	0.0	0.0	0.0
3	aasedo01	5	0	0	0	0	0	0.0	0.0	0.0	...	135.0	13.0	10.0	1.0	0.0	0.0	0.0
4	abadan01	21	1	2	0	0	0	0.0	0.0	1.0	...	1.0	1.0	3.0	0.0	0.0	0.0	0.0

5 rows × 26 columns

We also need to know when a player played their last game, these can be found in the master_df DataFrame:

master_df.head()

	playerID	nameFirst	nameLast	bats	throws	finalGame
0	aardsda01	David	Aardsma	R	R	2015-08-23
1	aaronha01	Hank	Aaron	R	R	1976-10-03
2	aaronto01	Tommie	Aaron	R	R	1971-09-26
3	aasedo01	Don	Aase	R	R	1990-10-03
4	abadan01	Andy	Abad	L	L	2006-04-13

Data in bats and throws columns are binary values with R (L) indicating a player’s batting/throwing hand is his right (left), so it’s much simpler to represent the information with 0-1 integers.

# Create a function to convert the `bats` and `throws` colums to numeric
def bats_throws(col):
    if col == "R":
        return 1
    else:
        return 0
    
#player_stats['bats_R'] = player_stats['bats'].apply(bats_throws)
#player_stats['throws_R'] = player_stats['throws'].apply(bats_throws)

master_df['bats_R'] = master_df['bats'].apply(bats_throws)
master_df['throws_R'] = master_df['throws'].apply(bats_throws)

# Drop the old columns
master_df = master_df.drop(['bats','throws'], axis = 1)

Moreover, the debut and finalGame columns are currently strings so we’ll need to convert them to datetime object and extract the year, since we don’t need details as granular as date and month.

from datetime import datetime

def getYear(datestring):
    return datetime.strptime(datestring, '%Y-%m-%d').year

# Drop rows that have NA values
master = master_df.dropna(subset = ['finalGame'])

# Get years from strings
master = master.join(master['finalGame'].map(getYear), lsuffix='_')
master = master.drop('finalGame_', axis = 1)

master.head()

	playerID	nameFirst	nameLast	bats_R	throws_R	finalGame
0	aardsda01	David	Aardsma	1	1	2015
1	aaronha01	Hank	Aaron	1	1	1976
2	aaronto01	Tommie	Aaron	1	1	1971
3	aasedo01	Don	Aase	1	1	1990
4	abadan01	Andy	Abad	0	0	2006

Next up is the appearances_df DataFrame. This contains information on how many appearances each player had at each position for each year and will tell us how long a player has competed in the game. Let’s take a look at the first few rows of the DataFrame to see what we have.

appearances_df.head()

	yearID	teamID	lgID	playerID	G_all	GS	G_batting	G_defense	G_c	...	G_2b	G_3b	G_ss	G_lf	G_cf	G_of	G_dh	G_ph	G_pr
0	1871	TRO	NaN	abercda01	1	NaN	1	1	0	...	0	0	1	0	0	0	NaN	NaN	NaN
1	1871	RC1	NaN	addybo01	25	NaN	25	25	0	...	22	0	3	0	0	0	NaN	NaN	NaN
2	1871	CL1	NaN	allisar01	29	NaN	29	29	0	...	2	0	0	0	29	29	NaN	NaN	NaN
3	1871	WS3	NaN	allisdo01	27	NaN	27	27	27	...	0	0	0	0	0	0	NaN	NaN	NaN
4	1871	RC1	NaN	ansonca01	25	NaN	25	25	5	...	2	20	0	1	0	1	NaN	NaN	NaN

5 rows × 21 columns

# Drop unnecessary columns
appearances_df = appearances_df.drop(['G_ph', 'G_pr'], axis = 1)
appearances = appearances_df.groupby(['playerID'], as_index = False).sum()
appearances = appearances.fillna(0)

appearances_df.columns

Index(['yearID', 'teamID', 'lgID', 'playerID', 'G_all', 'GS', 'G_batting',
       'G_defense', 'G_p', 'G_c', 'G_1b', 'G_2b', 'G_3b', 'G_ss', 'G_lf',
       'G_cf', 'G_rf', 'G_of', 'G_dh'],
      dtype='object')

As mentioned earlier, this post is focused on infielders and outfielders only, so we need to pick players who only play at these positions. However, some players played at multiple different positions in the earlier years of MLB, so how are we going to filter out pitchers and catchers? There are no hard and fast rules, but we can convert these numbers into percentages and exclude people who played more than, say, 10% of their games at either of these positions.

positions = appearances_df.columns[5:]

# Loop through the list and divide each column by the players total games played
for col in positions:
    column = col + '_percent'
    appearances[column] = appearances[col] / appearances['G_all']

# Eliminate players who played 10% or more of their games as Pitchers or Catchers
appearances = appearances[(appearances['G_p_percent'] < 0.1) & (appearances['G_c_percent'] < 0.1)]

# Drop columns that are no longer useful
appearances = appearances.drop(['G_p_percent','G_c_percent', 'yearID'], axis = 1)
appearances = appearances.drop(positions.tolist(), axis = 1)

appearances.head()

	playerID	G_all	GS_percent	G_batting_percent	G_defense_percent	G_1b_percent	G_2b_percent	G_3b_percent	G_ss_percent	G_lf_percent	G_cf_percent	G_rf_percent	G_of_percent	G_dh_percent
1	aaronha01	3298	0.962098	1.0	0.905094	0.063675	0.013038	0.002122	0.000000	0.095512	0.093390	0.659187	0.836871	0.060946
2	aaronto01	437	0.471396	1.0	0.791762	0.530892	0.016018	0.022883	0.000000	0.308924	0.002288	0.004577	0.313501	0.000000
4	abadan01	15	0.266667	1.0	0.600000	0.533333	0.000000	0.000000	0.000000	0.000000	0.000000	0.066667	0.066667	0.000000
6	abadijo01	12	0.000000	1.0	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
7	abbated01	855	0.000000	1.0	1.000000	0.000000	0.490058	0.023392	0.453801	0.000000	0.002339	0.001170	0.003509	0.000000

The data look fine but… how are we going to deal with the years? It is unlikely that the number 1950 will have the same relationship to the rest of the data that the model will infer, so is it OK to drop them like we just did?

No it’s not. It turns out as MLB progressed, different eras emerged where the amount of runs per game increased or decreased significantly. This means that when a player played has a large influence on that player’s career statistics. The HoF voters take this into account when voting players in, so our model needs that information too. To get information such as runs allowed and games played over the years, we need to turn to the teams_df DataFrame.

We only need to consider columns needed to calculate runs per game per year, the rest we can safely ignore. Also looking back at the history of MLB, the rules of baseball had not settled into place before 1900 and the game was a totally different beast back then so it makes sense to remove these rows from the data.

# Runs and games per year
runs_games = teams_df.groupby('yearID').sum()[['G','R']]
runs_games['RPG'] = runs_games['R'] / runs_games['G']
runs_games = runs_games.loc[1901:]
runs_games.head()

	G	R	RPG
yearID
1901	2220	11068	4.985586
1902	2230	9883	4.431839
1903	2228	9892	4.439856
1904	2498	9307	3.725781
1905	2474	9640	3.896524

# Plot number of runs per game over time
runs_games['RPG'].plot()
plt.title('MLB Yearly Runs per Game')
plt.xlabel('Year')
plt.ylabel('MLB Runs per Game')
plt.axvspan(1901, 1920, color='red', alpha=0.4)
plt.axvspan(1942, 1945, color='red', alpha=0.4)
plt.axvspan(1963, 1976, color='red', alpha=0.4)
plt.axvspan(1993, 2009, color='red', alpha=0.4)

<matplotlib.patches.Polygon at 0x178c12390>

png

There were indeed some periods when the number of runs per game was much higher than others. For example, the years from 1920 - 1941 saw an unprecedented high number of runs scored per game and was often referred to as the Lively Ball Era. Another sharp rise in runs per game occurred during early ’90s to 2008, the Steroid Era. To capture this information, we need to convert years to eras in our player_stats DataFrame and turn them into new features (columns). We can re-use part of the codes we wrote for awards_df to accomplish this.

yr_appearances = appearances_df.copy()[['yearID','playerID','teamID']]

# Remove players in or before 1900
yr_appearances = yr_appearances[yr_appearances['yearID'] > 1900]

def toEra(year):
    if int(year) < 1921:
        return '1901-1920'
    elif int(year) < 1942:
        return '1921-1941'
    elif int(year) < 1946:
        return '1942-1945'
    elif int(year) < 1963:
        return '1946-1962'
    elif int(year) < 1977:
        return '1963-1976'
    elif int(year) < 1993:
        return '1977-1992'
    elif int(year) < 2010:
        return '1993-2009'
    else:
        return 'post2009'

yr_appearances['yearID'] = yr_appearances['yearID'].map(toEra)

# Pivot the data frame to count the number of different awards
yr_appearances = yr_appearances.pivot_table(index='playerID', columns = 'yearID', aggfunc='count')

# Flatten the pivot table
yr_appearances = pd.DataFrame(yr_appearances.to_records())

# Fix column names after flattening
yr_appearances.columns = [col.replace("('teamID', '", "").replace("')", "") \
                     for col in yr_appearances.columns]

yr_appearances = yr_appearances.fillna(0)

# Number of years playing
yr_appearances['years_playing'] = yr_appearances.sum(axis = 1)

yr_appearances = yr_appearances.merge(appearances, on = 'playerID', how = 'inner')
yr_appearances.head()

	playerID	1901-1920	1946-1962	1963-1976	1993-2009	years_playing	...	G_defense_percent	G_1b_percent	G_2b_percent	G_3b_percent	G_ss_percent	G_lf_percent	G_cf_percent	G_rf_percent	G_of_percent	G_dh_percent
0	aaronha01	0.0	9.0	14.0	0.0	23.0	...	0.905094	0.063675	0.013038	0.002122	0.000000	0.095512	0.093390	0.659187	0.836871	0.060946
1	aaronto01	0.0	1.0	6.0	0.0	7.0	...	0.791762	0.530892	0.016018	0.022883	0.000000	0.308924	0.002288	0.004577	0.313501	0.000000
2	abadan01	0.0	0.0	0.0	3.0	3.0	...	0.600000	0.533333	0.000000	0.000000	0.000000	0.000000	0.000000	0.066667	0.066667	0.000000
3	abbated01	8.0	0.0	0.0	0.0	8.0	...	1.000000	0.000000	0.490058	0.023392	0.453801	0.000000	0.002339	0.001170	0.003509	0.000000
4	abbotje01	0.0	0.0	0.0	5.0	5.0	...	0.793991	0.000000	0.000000	0.000000	0.000000	0.270386	0.347639	0.236052	0.793991	0.051502

5 rows × 23 columns

Now that we have gathered pretty much all necessary information, it’s time for a final merge. It’s likely that new NA values will be created as a result of merging, so we need to check if there are any of them as well.

df = master.merge(player_stats, on = 'playerID', how ='left')
df = df.merge(yr_appearances, on = 'playerID', how ='inner')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7237 entries, 0 to 7236
Data columns (total 53 columns):
playerID                7237 non-null object
nameFirst               7235 non-null object
nameLast                7237 non-null object
bats_R                  7237 non-null int64
throws_R                7237 non-null int64
finalGame               7237 non-null int64
AB                      7237 non-null float64
R                       7237 non-null float64
H                       7237 non-null float64
2B                      7237 non-null float64
3B                      7237 non-null float64
HR                      7237 non-null float64
RBI                     7237 non-null float64
SB                      7237 non-null float64
CS                      7237 non-null float64
BB                      7237 non-null float64
SO                      7237 non-null float64
IBB                     7237 non-null float64
HBP                     7237 non-null float64
SH                      7237 non-null float64
SF                      7237 non-null float64
A                       7237 non-null float64
E                       7237 non-null float64
DP                      7237 non-null float64
years_allstar           7237 non-null float64
Gold Glove              7237 non-null float64
Most Valuable Player    7237 non-null float64
Rookie of the Year      7237 non-null float64
Silver Slugger          7237 non-null float64
World Series MVP        7237 non-null float64
HoF                     7237 non-null float64
1901-1920               7237 non-null float64
1921-1941               7237 non-null float64
1942-1945               7237 non-null float64
1946-1962               7237 non-null float64
1963-1976               7237 non-null float64
1977-1992               7237 non-null float64
1993-2009               7237 non-null float64
post2009                7237 non-null float64
years_playing           7237 non-null float64
G_all                   7237 non-null int64
GS_percent              7237 non-null float64
G_batting_percent       7237 non-null float64
G_defense_percent       7237 non-null float64
G_1b_percent            7237 non-null float64
G_2b_percent            7237 non-null float64
G_3b_percent            7237 non-null float64
G_ss_percent            7237 non-null float64
G_lf_percent            7237 non-null float64
G_cf_percent            7237 non-null float64
G_rf_percent            7237 non-null float64
G_of_percent            7237 non-null float64
G_dh_percent            7237 non-null float64
dtypes: float64(46), int64(4), object(3)
memory usage: 3.0+ MB

The only column that has NA values is nameFirst, and since there are only two of them, let’s not worry about these. We have finally consolidated everything into a single DataFrame with everything we need to know about the players. In the next step, we are going to draw some insights from the data by adding new features.

df.head()

	playerID	nameFirst	nameLast	bats_R	throws_R	finalGame	AB	R	H	2B	...	G_defense_percent	G_1b_percent	G_2b_percent	G_3b_percent	G_ss_percent	G_lf_percent	G_cf_percent	G_rf_percent	G_of_percent	G_dh_percent
0	aaronha01	Hank	Aaron	1	1	1976	12364.0	2174.0	3771.0	624.0	...	0.905094	0.063675	0.013038	0.002122	0.000000	0.095512	0.093390	0.659187	0.836871	0.060946
1	aaronto01	Tommie	Aaron	1	1	1971	944.0	102.0	216.0	42.0	...	0.791762	0.530892	0.016018	0.022883	0.000000	0.308924	0.002288	0.004577	0.313501	0.000000
2	abadan01	Andy	Abad	0	0	2006	21.0	1.0	2.0	0.0	...	0.600000	0.533333	0.000000	0.000000	0.000000	0.000000	0.000000	0.066667	0.066667	0.000000
3	abbated01	Ed	Abbaticchio	1	1	1910	3044.0	355.0	772.0	99.0	...	1.000000	0.000000	0.490058	0.023392	0.453801	0.000000	0.002339	0.001170	0.003509	0.000000
4	abbotje01	Jeff	Abbott	1	0	2001	596.0	82.0	157.0	33.0	...	0.793991	0.000000	0.000000	0.000000	0.000000	0.270386	0.347639	0.236052	0.793991	0.051502

5 rows × 53 columns

Feature Engineering

We’ll start by adding important baseball statistics such as batting average, on-base percentage, slugging percentage, and on-base plus slugging percentage, using the following formulas:

Batting Ave. = Hits / At Bats
Plate Appearances = At Bats + Walks + Sacrifice Flys & Hits + Hit by Pitch
On-base = (Hits + Walks + Hit by Pitch) / Plate Appearances
Slugging = ((Home Runs x 4) + (Triples x 3) + (Doubles x 2) + Singles) / At Bats
On-Base plus Slugging = On-base + Slugging

Since we are computing a lot of ratios, NA values may come about which we’ll have to remove

# Create Batting Average (`AVE`) column
df['AVE'] = df['H'] / df['AB']

# Create On Base Percent (`OBP`) column
plate_appearances = (df['AB'] + df['BB'] + df['SF'] + df['SH'] + df['HBP'])
df['OBP'] = (df['H'] + df['BB'] + df['HBP']) / plate_appearances

# Create Slugging Percent (`Slug_Percent`) column
single = ((df['H'] - df['2B']) - df['3B']) - df['HR']
df['Slug_Percent'] = ((df['HR'] * 4) + (df['3B'] * 3) + (df['2B'] * 2) + single) / df['AB']

# Create On Base plus Slugging Percent (`OPS`) column
hr = df['HR'] * 4
triple = df['3B'] * 3
double = df['2B'] * 2
df['OPS'] = df['OBP'] + df['Slug_Percent']

df = df.dropna()
print(df.isnull().sum(axis=0).tolist())

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Yass! Now we are free of the NA plague. Before we move on to create some plots, let’s try to identify the outliers in our data. They are players who boasted HoF-worthy stats but were ignored by HoF voters due to match-fixing scandals and performance enhancing drugs (PED) allegations. To find out who they were, we need to read up on the history of MLB. The following articles contain may help us in that regard:

Once we have done our homework, it’s time to remove these names from our data.

players = ['jacksjo01', 'rosepe01', 'giambja01', 'sheffga01', 'braunry02', 'bondsba01', \
           'palmera01', 'mcgwima01', 'clemero02', 'sosasa01', 'rodrial01']

df = df[~df['playerID'].isin(players)]

Next, we will plot out the distributions for certain statistics such as Hits, Home Runs, Years Playing, and Years Featured in All Star Game for Hall of Fame players to see if there are any trends among them.

# Filter players who are in HoF
df_hof = df[df['HoF'] == 1]

print(len(df_hof))

sns.set(style="white", palette="muted", color_codes=True)

# Initialize the figure and add subplots
fig = plt.figure(figsize=(16, 14))
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)

# Create distribution plots for Hits, Home Runs, Years Played and All Star Games
sns.distplot(df_hof['H'], ax = ax1, kde = True, axlabel = False, color = 'r')
ax1.set_title('Distribution of Hits')
sns.distplot(df_hof['HR'], ax = ax2, kde = True, axlabel = False, color = 'r')
ax2.set_title('Distribution of Home Runs')
sns.distplot(df_hof['years_playing'], ax = ax3, kde = False, axlabel = False, color = 'r')
ax3.set_title('Distribution of Years Playing')
ax3.set_ylabel('HoF Careers')
sns.distplot(df_hof['years_allstar'], ax = ax4, kde = False, axlabel = False, color = 'r')
ax4.set_title('Distribution of Years Featured in All Star Game')

72

<matplotlib.text.Text at 0x17dcdcbe0>

png

We have 70 Hall of Famers in our data and they all boast admirable statistics. A few points to note:

High number of Hits seem to be favorable: Most HoF players scored on average 3000 hits.
Home Run is not so important: The majority of inductees didn’t hit more than 200 home runs in their career.
With experience comes votes: Players who have competed in more than 20 seasons make up a large portion of Hall of Famers.
All Star Game appearances don’t have much weight: In fact most players inducted only have participated in less than 10 games.

Now let’s see how they fare against non-HoF players. To ensure we are comparing apples to apples, let’s exclude non-HoF players with less than 10 years of experience.

# Filter `df` for players with 10 or more years of experience
df_10 = df[(df['years_playing'] >= 10) & (df['HoF'] == 0)]

print(len(df_10))

# Initialize the figure and add subplots
fig = plt.figure(figsize=(16, 14))
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)

# Create distribution plots for Hits, Home Runs, Years Played and All Star Games
sns.distplot(df_10['H'], ax = ax1, kde = True, axlabel = False, bins = 12)
ax1.set_title('Distribution of Hits')
sns.distplot(df_10['HR'], ax = ax2, kde = True, axlabel = False, bins = 8)
ax2.set_title('Distribution of Home Runs')
sns.distplot(df_10['years_playing'], ax = ax3, kde = False, axlabel = False, bins = 10)
ax3.set_title('Distribution of Years Playing')
ax3.set_ylabel('HoF Careers')
sns.distplot(df_10['years_allstar'], ax = ax4, kde = False, axlabel = False, bins = 8)
ax4.set_title('Distribution of Years Featured in All Star Game')

1665

<matplotlib.text.Text at 0x17f197710>

png

There are 1675 non-HoF players in our data and it’s fair to say that most of them are less experienced players, which partly explains their lackluster statistics compared to the veterans who have made it to Cooperstown.

Next we want to see how Hits vs. Batting Average and Home Runs vs. Batting Average differ between HoF and non-HoF players.

# Initialize the figure and add subplots
fig = plt.figure(figsize=(14, 7))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

# Create Scatter plots for Hits vs. Average and Home Runs vs. Average
ax1.scatter(df_hof['H'], df_hof['AVE'], c='r', label='HoF Player')
ax1.scatter(df_10['H'], df_10['AVE'], c='b', label='Non HoF Player')
ax1.set_title('Career Hits vs. Career Batting Average')
ax1.set_xlabel('Career Hits')
ax1.set_ylabel('Career Average')
ax2.scatter(df_hof['HR'], df_hof['AVE'], c='r', label='HoF Player')
ax2.scatter(df_10['HR'], df_10['AVE'], c='b', label='Non HoF Player')
ax2.set_title('Career Home Runs vs. Career Batting Average')
ax2.set_xlabel('Career Home Runs')
ax2.legend(loc='lower right', scatterpoints=1)

<matplotlib.legend.Legend at 0x1834a5f28>

png

Suffice to say, it’s not surprising to see HoF players as high-achievers compared to their non-HoF teammates. There seem to be a positive correlation between career hits and career batting average regardless of HoF status, however the relationship is not as strong when it comes to home runs versus batting average.

With this we have answered the first question posed at the beginning of this post. To answer the second one, we need to build some machine learning model to predict whether an eligible player will ever be elected to the HoF.

Preparing Training and Test Data

Since a player must wait 5 years to become eligible for the HoF ballot, and can remain on the ballot for as many as 10 years then there are still eligible players who played their final season in the last 15 years. Hence those who played their last games in 2003 will be eligible for consideration in 2018 and so on.

# Filter `df` for players who retired more than 15 years ago
df_hitters = df[df['finalGame'] < 2002]

# Filter `df` for players who retired less than 15 years ago and for currently active players
df_eligible = df[df['finalGame'] >= 2002]

# Players who retired less than 15 years ago but more than 5 years ago and were inducted
early_inductees = df_eligible[df_eligible['HoF'] == 1]

# Remove these players from `df_eligible`
df_eligible = df_eligible[df_eligible['HoF'] != 1]

# Add these players to `df_hitters`
df_hitters = df_hitters.append(early_inductees)

df_hitters is what we will use to train and test our model on since it contains statistics of past Hall of Famers while df_eligible is the “new” data consisting of eligible players that we would like to make predictions of.

print(len(df_hitters))

# Separate `df_hitters` into target (response) and features (predictors)
target = df_hitters['HoF']
features = df_hitters.drop(['playerID', 'nameFirst', 'nameLast', 'HoF'], axis=1)

Logistic Regression

The first model we’ll try is a Logistic Regression model and we’ll be using the Kfold cross-validation technique.

from sklearn.cross_validation import cross_val_predict, KFold
from sklearn.linear_model import LogisticRegression

# Create Logistic Regression model
lr = LogisticRegression(class_weight='balanced')

# Create an instance of the KFold class
kf = KFold(features.shape[0], random_state=1)

# Create predictions using cross validation
predictions_lr = cross_val_predict(lr, features, target, cv=kf)

To determine accuracy, we need to compare our predictions to the target. The error metrics we’ll be using are counts and rates of True Positive (TP), False Positive (FP), and False Negative (FN), whose definitions are given below:

True Positive: The player was predicted to be in the HoF and they are a HoF member.
False Positive: The player was predicted to be in the HoF but they are not a HoF member.
False Negative: The player was predicted to be not in the HoF but they are indeed a HoF member.
True Negative: The player was predicted not to be in the HoF and they are not a HoF member.

From here, we can compute the rates as follows:

True Positive rate: # True Positive / (# True Positive + # False Negative)
False Negative rate: # False Negative / (# False Negative + # True Positive)
False Positive rate: # False Positive / (# False Positive + # True Negative)

# Import NumPy as np
import numpy as np

# Convert predictions and target to NumPy arrays
np_predictions_lr = np.asarray(predictions_lr)
np_target = target.as_matrix()

# Create a function to report TP, FP, and FN rates
def predAccuracy(predictions, target):
    
    # Determine True Positive count
    tp_filter = (predictions == 1) & (target == 1)
    tp = len(predictions[tp_filter])

    # Determine False Negative count
    fn_filter = (predictions == 0) & (target == 1)
    fn = len(predictions[fn_filter])

    # Determine False Positive count
    fp_filter = (predictions == 1) & (target == 0)
    fp = len(predictions[fp_filter])

    # Determine True Negative count
    tn_filter = (predictions == 0) & (target == 0)
    tn = len(predictions[tn_filter])

    # Determine True Positive rate
    tpr = tp / (tp + fn)

    # Determine False Negative rate
    fnr = fn / (fn_lr + tp)

    # Determine False Positive rate
    fpr = fp / (fp + tn)

    # Print each count
    print("True Positive Count: {0}".format(tp))
    print("False Negative Count: {0}".format(fn))
    print("False Positive Count: {0}".format(fp))

    # Print each rate
    print("True Positive Rate: {0:6.4f}".format(tpr))
    print("False Negative Rate: {0:6.4f}".format(fnr))
    print("False Positive Rate: {0:6.4f}".format(fpr))

# Accuracy rates of logistic regression model
predAccuracy(np_predictions_lr, np_target)

True Positive Count: 60
False Negative Count: 12
False Positive Count: 35
True Positive Rate: 0.8333
False Negative Rate: 0.1667
False Positive Rate: 0.0065

Random Forest

What we’re trying to answer is a classic example of a classification problem, and it would be a crime not to mention random forest algorithm at some point. In the following we’ll see how this algorithm stacks up against the logistic regression model.

# Import RandomForestClassifier from sklearn
from sklearn.ensemble import RandomForestClassifier

# Create penalty dictionary
penalty = {
    0: 100,
    1: 1
}

# Create Random Forest model
rf = RandomForestClassifier(random_state=1,n_estimators=12, max_depth=11, min_samples_leaf=1, class_weight=penalty)

# Create predictions using cross validation
predictions_rf = cross_val_predict(rf, features, target, cv=kf)

# Convert predictions to NumPy array
np_predictions_rf = np.asarray(predictions_rf)

# Accuracy rates of random forest model
predAccuracy(np_predictions_rf, np_target)

True Positive Count: 51
False Negative Count: 21
False Positive Count: 8
True Positive Rate: 0.7083
False Negative Rate: 0.3333
False Positive Rate: 0.0015

Although the random forest is less accurate, predicting only 51 of 72 Hall of Famers, its FN and FP counts are far fewer. Hence it will be the model of choice to make our predictions.

Making predictions

We’ll use the trained and tested random forest model to make predictions on the probability of getting voted into the HoF for each player in df_eligible and then print out 50 players who have the highest chance of doing so. This will also answer our second question.

# Create a new features DataFrame
new_features = df_eligible.drop(['playerID', 'nameFirst', 'nameLast', 'HoF'], axis=1)

# Fit the Random Forest model
rf.fit(features, target)

# Estimate probabilities of Hall of Fame induction
probabilities = rf.predict_proba(new_features)

# Convert predictions to a DataFrame
hof_predictions = pd.DataFrame(probabilities[:,1])

# Sort the DataFrame (descending)
hof_predictions = hof_predictions.sort_values(0, ascending=False)
hof_predictions.rename(columns = {0:'prob'}, inplace = True)

# Merge the prediction with new_data
new_data.index = range(len(new_data))
new_data.head()
hof_predictions = hof_predictions.join(new_data, how = 'left')
hof_predictions.index = range(len(hof_predictions))
hof_predictions.head(50)

	prob	playerID	nameFirst	nameLast	bats_R	throws_R	finalGame	AB	R	H	...	G_ss_percent	G_lf_percent	G_cf_percent	G_rf_percent	G_of_percent	G_dh_percent	AVE	OBP	Slug_Percent	OPS
0	1.000000	ortizda01	David	Ortiz	0	0	2016	8640.0	1419.0	2472.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.842608	0.286111	0.379447	0.551505	0.930952
1	1.000000	ramirma02	Manny	Ramirez	1	1	2011	8244.0	1544.0	2574.0	...	0.000000	0.450478	0.000000	0.392702	0.841877	0.144222	0.312227	0.410477	0.585395	0.995872
2	1.000000	cabremi01	Miguel	Cabrera	1	1	2016	7853.0	1321.0	2519.0	...	0.000000	0.118321	0.000000	0.047710	0.165553	0.037214	0.320769	0.398511	0.562078	0.960589
3	1.000000	pujolal01	Albert	Pujols	1	1	2016	9138.0	1670.0	2825.0	...	0.000412	0.110882	0.000000	0.016488	0.127370	0.139736	0.309149	0.392248	0.572554	0.964802
4	1.000000	jeterde01	Derek	Jeter	1	1	2014	11195.0	1923.0	3465.0	...	0.973426	0.000000	0.000000	0.000000	0.000000	0.026574	0.309513	0.374306	0.439571	0.813877
5	1.000000	jonesch06	Chipper	Jones	0	1	2012	8984.0	1619.0	2726.0	...	0.019608	0.142457	0.000000	0.003601	0.145658	0.011204	0.303428	0.400980	0.529274	0.930254
6	1.000000	vizquom01	Omar	Vizquel	0	1	2012	10586.0	1445.0	2877.0	...	0.912736	0.000337	0.000000	0.000337	0.000674	0.002358	0.271774	0.329143	0.352069	0.681212
7	1.000000	heltoto01	Todd	Helton	0	0	2013	7962.0	1401.0	2519.0	...	0.000000	0.005785	0.000000	0.000890	0.006676	0.000890	0.316378	0.413862	0.539061	0.952923
8	1.000000	abreubo01	Bobby	Abreu	0	1	2014	8480.0	1453.0	2470.0	...	0.000000	0.058557	0.008660	0.820619	0.881649	0.066392	0.291274	0.394703	0.474764	0.869467
9	1.000000	beltrad01	Adrian	Beltre	1	1	2016	10295.0	1428.0	2942.0	...	0.002574	0.000000	0.000000	0.000000	0.000000	0.030882	0.285770	0.337833	0.479845	0.817678
10	1.000000	gonzalu01	Luis	Gonzalez	0	1	2008	9157.0	1412.0	2591.0	...	0.000000	0.933230	0.003474	0.008491	0.942107	0.010807	0.282953	0.366252	0.478869	0.845121
11	0.916705	thomeji01	Jim	Thome	0	1	2012	8422.0	1583.0	2328.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.321667	0.276419	0.401823	0.554144	0.955967
12	0.916667	kentje01	Jeff	Kent	1	1	2008	8498.0	1320.0	2461.0	...	0.001305	0.000000	0.000000	0.000000	0.000000	0.003046	0.289598	0.355143	0.499647	0.854790
13	0.916667	guerrvl01	Vladimir	Guerrero	1	1	2011	8155.0	1328.0	2590.0	...	0.000000	0.000466	0.000932	0.747555	0.748952	0.236609	0.317597	0.378629	0.552544	0.931173
14	0.833417	gonzaju03	Juan	Gonzalez	1	1	2005	6556.0	1061.0	1936.0	...	0.000000	0.216696	0.149201	0.449378	0.776199	0.217880	0.295302	0.343117	0.560708	0.903824
15	0.833333	beltrca01	Carlos	Beltran	0	1	2016	9301.0	1522.0	2617.0	...	0.000000	0.000814	0.639805	0.256410	0.893773	0.082621	0.281368	0.353165	0.491560	0.844725
16	0.750093	walkela01	Larry	Walker	0	1	2005	6907.0	1355.0	2160.0	...	0.000000	0.016600	0.034708	0.864185	0.907445	0.013581	0.312726	0.399875	0.565224	0.965099
17	0.750093	camermi01	Mike	Cameron	1	1	2011	6839.0	1064.0	1700.0	...	0.000000	0.003069	0.914066	0.083887	0.982609	0.005627	0.248574	0.336631	0.443778	0.780409
18	0.750079	tejadmi01	Miguel	Tejada	1	1	2013	8434.0	1230.0	2407.0	...	0.896361	0.000000	0.000000	0.000000	0.000000	0.012437	0.285392	0.334891	0.455537	0.790428
19	0.750079	ramirar01	Aramis	Ramirez	1	1	2015	8136.0	1098.0	2303.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.011851	0.283063	0.340864	0.492134	0.832997
20	0.750076	ordonma01	Magglio	Ordonez	1	1	2011	6978.0	1076.0	2156.0	...	0.000000	0.000000	0.014610	0.926948	0.933442	0.056277	0.308971	0.368380	0.502436	0.870816
21	0.750070	renteed01	Edgar	Renteria	1	1	2011	8142.0	1200.0	2327.0	...	0.982342	0.000000	0.000000	0.000000	0.000000	0.000929	0.285802	0.339290	0.398059	0.737349
22	0.750070	damonjo01	Johnny	Damon	0	0	2012	9736.0	1668.0	2769.0	...	0.000000	0.274699	0.521285	0.059036	0.830120	0.149398	0.284408	0.350096	0.432827	0.782923
23	0.750070	jonesan01	Andruw	Jones	1	1	2012	7599.0	1204.0	1933.0	...	0.000000	0.051002	0.785064	0.102004	0.930328	0.048725	0.254376	0.337142	0.485590	0.822732
24	0.750070	anderga01	Garret	Anderson	0	0	2010	8640.0	1084.0	2529.0	...	0.000000	0.622531	0.181329	0.072711	0.858618	0.106373	0.292708	0.323199	0.461111	0.784310
25	0.750061	delgaca01	Carlos	Delgado	0	1	2009	7283.0	1241.0	2038.0	...	0.000000	0.028501	0.000000	0.000000	0.028501	0.090909	0.279830	0.383389	0.545929	0.929318
26	0.750056	suzukic01	Ichiro	Suzuki	0	1	2016	9689.0	1396.0	3030.0	...	0.000000	0.038800	0.124800	0.780400	0.927200	0.020800	0.312726	0.354481	0.404583	0.759064
27	0.750039	francju01	Julio	Franco	1	1	2007	8677.0	1285.0	2586.0	...	0.282153	0.001583	0.000000	0.000396	0.001583	0.148397	0.298029	0.363889	0.417195	0.781083
28	0.750038	mcgrifr01	Fred	McGriff	0	0	2004	8757.0	1349.0	2490.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.071138	0.284344	0.376843	0.509078	0.885921
29	0.666786	leeca01	Carlos	Lee	1	1	2012	7983.0	1125.0	2273.0	...	0.000000	0.843259	0.000000	0.000000	0.843259	0.039066	0.284730	0.338607	0.482776	0.821383
30	0.666770	willibe02	Bernie	Williams	0	1	2006	7869.0	1366.0	2336.0	...	0.000000	0.004335	0.894027	0.029865	0.926782	0.062139	0.296861	0.380426	0.477316	0.857742
31	0.666769	rolensc01	Scott	Rolen	1	1	2012	7398.0	1211.0	2077.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.280752	0.364287	0.490403	0.854690
32	0.666769	cabreor01	Orlando	Cabrera	1	1	2011	7562.0	985.0	2055.0	...	0.928463	0.000000	0.000000	0.000000	0.000000	0.002519	0.271754	0.315082	0.389712	0.704793
33	0.666761	aloumo01	Moises	Alou	1	1	2008	7037.0	1109.0	2134.0	...	0.000000	0.639547	0.051493	0.310505	0.959835	0.011843	0.303254	0.368887	0.515703	0.884589
34	0.666761	durhara01	Ray	Durham	0	1	2008	7408.0	1249.0	2054.0	...	0.000000	0.000000	0.000506	0.000000	0.000506	0.027342	0.277268	0.349757	0.435745	0.785502
35	0.666761	loftoke01	Kenny	Lofton	0	0	2007	8120.0	1528.0	2428.0	...	0.000000	0.023776	0.943414	0.004755	0.970518	0.005706	0.299015	0.368746	0.422783	0.791529
36	0.666761	martied01	Edgar	Martinez	1	1	2004	7213.0	1219.0	2247.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.682725	0.311521	0.417320	0.515458	0.932778
37	0.666760	edmonji01	Jim	Edmonds	0	0	2010	6858.0	1251.0	1949.0	...	0.000000	0.030830	0.879165	0.023869	0.928394	0.010443	0.284194	0.375439	0.527122	0.902560
38	0.666760	leede02	Derrek	Lee	1	1	2011	6962.0	1081.0	1959.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.004119	0.281385	0.364561	0.494685	0.859247
39	0.666746	hunteto01	Torii	Hunter	1	1	2015	8857.0	1296.0	2452.0	...	0.000000	0.007167	0.642074	0.305228	0.951096	0.039629	0.276843	0.331201	0.461443	0.792644
40	0.666738	olerujo01	John	Olerud	0	0	2005	7592.0	1139.0	2239.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.059534	0.294916	0.397440	0.464963	0.862403
41	0.666738	grissma02	Marquis	Grissom	1	1	2005	8275.0	1187.0	2251.0	...	0.000000	0.039261	0.908545	0.024480	0.964434	0.000924	0.272024	0.316442	0.414502	0.730943
42	0.666737	grudzma01	Mark	Grudzielanek	1	1	2010	7052.0	946.0	2040.0	...	0.347392	0.000000	0.000000	0.000000	0.000000	0.001110	0.289280	0.330001	0.393222	0.723223
43	0.666737	ibanera01	Raul	Ibanez	0	1	2014	7471.0	1055.0	2034.0	...	0.000000	0.678852	0.000925	0.079130	0.750116	0.143915	0.272253	0.335347	0.465132	0.800479
44	0.666737	konerpa01	Paul	Konerko	1	1	2014	8393.0	1162.0	2340.0	...	0.000000	0.007663	0.000000	0.000000	0.007663	0.146871	0.278804	0.354024	0.486477	0.840501
45	0.666729	gilesbr02	Brian	Giles	0	0	2009	6527.0	1121.0	1897.0	...	0.000000	0.349756	0.164050	0.479155	0.956145	0.021657	0.290639	0.399617	0.502375	0.901992
46	0.666706	finlest01	Steve	Finley	0	0	2007	9397.0	1443.0	2548.0	...	0.000000	0.013937	0.895858	0.073171	0.962447	0.005420	0.271150	0.329350	0.442375	0.771725
47	0.666673	polanpl01	Placido	Polanco	1	1	2013	7214.0	1009.0	2142.0	...	0.063311	0.002595	0.000000	0.000000	0.002595	0.001557	0.296923	0.338828	0.397283	0.736111
48	0.666667	kotsama01	Mark	Kotsay	0	0	2013	6464.0	790.0	1784.0	...	0.000000	0.032393	0.528736	0.242424	0.792059	0.030825	0.275990	0.330708	0.404394	0.735101
49	0.666667	matthga02	Gary	Matthews	0	1	2010	4103.0	612.0	1056.0	...	0.000000	0.141296	0.583138	0.227166	0.903201	0.021077	0.257373	0.330951	0.405313	0.736264

50 rows × 58 columns

Limitations

While it’s nice to be able to tell who are likely to get elected in future ballots, this is a naïve model: we don’t predict the career trajectories of current players. We simply ask: if they retired now, and we relax the 10-year minimum requirement, would their statistical output qualify them for the Hall of Fame based on what we’ve seen voters do in the past?

This brings us to a related question: how much would steroid suspicions hurt the chance of getting in? Take Ivan Rodriguez for example. Despite allegations of injections of PED in 2003, he nevertheless made it in to the HoF in 2016. Along with cases for Tim Raines and Jeff Bagwell, who were also suspected of steroid use, this proves voters do forgive, but on what conditions and to what extent we never know.

Predicting MLB Hall of Famers

Le Hoang Van

Introduction

Background information

Dataset

Data Cleaning and Pre-processing

Feature Engineering

Preparing Training and Test Data

Logistic Regression

Random Forest

Making predictions

Limitations

References

Written by

Le Hoang Van

Supported by

Le Hoang Van