Relationship Between Strikeouts and Home Runs

Vanessa Foot and Justeena Zaki-Azat

2019-05-01

library(Lahman) 
library(ggplot2) 
library(dplyr)
library(car)

This vignette looks at the relationship between rate of strikeouts and home runs from the year 1950+. This question was inspired by Marchi and Albert (2016), “Analyzing Baseball Data in R.”

There are many factors that must come together for a player to launch a home run. One of those factors is swing speed—against a 94-mph fastball, every 1-mph increase in swing speed extends distance about 8 feet (Coburn, 2009). If a batter hits ~50 home runs in a season, is it safe to assume that he’s swinging for the fences, and also more likely to strike out? Babe Ruth broke the record of most home runs in a season (60) and also struck out more than any other player (89). However, in 1971, Willie Stargell hit 48 home runs and struck out 154 times, while Henry Aaron hit 47 home runs and struck out 58 times, demonstrating that home runs and strikeouts do not always go hand in hand.

The data files

Start with loading the files we will use here. We do some pre-processing to make them more convenient for the analyses done later.

The Batting data

The Batting table contains batting data at the team level going back to 1871, with a separate observation from each year. This file is available using the newest v. 7.0.1, of the Lahman package. We use this to get everything we need for our analysis: at bats (AB) strikeouts (SO), and home runs (HR) for all teams since the year 1950+.

We are only using part of the table, so we will filter the data set to include only the variables that we need.

We’ll also create a new data frame that includes data from the year 1950+. The Batting table also has multiple listings for each year, so we’ll collapse them using the summarize function.

Last, we will mutate the variables so that home runs and strikeouts are divided by at bat, to add new columns “SO rate” and “HR rate.” This full data frame will be called FullBatting.

##A first look at ‘Batting’

What is the total number of strikeouts in our data set?

What is the average rate of strikeouts per at bat?

How many homeruns do we have in our data set?

What is the average rate of home runs per at bat?

Is there a relationship between strikeout rate and home run rate? According to our test, there is a significant correlation. The p-value is equal to .001, with df= 65. There is a .61 correlation between strikeout rate and home run rate.

We can look at the totals for interpretation purposes. We see here that for every 6.14 strikeouts, home runs increase by 4.14.

Create a scatterplot in ggplot, using SO rate and HR rate.