In the 2016 Hardball Times Annual I wrote about evaluating umpire consistency. The analysis implements an idea Tom Tango originally blogged about. Here on my own blog I’m going to go a little more in depth on the underlying methodology and potential improvements. Along the way I’ll also relay some R tips I’ve picked up that are useful for sabermetric analysis, particularly on large datasets like PITCHf/x.
[If you haven’t already read the THT Annual, I strongly encourage you to pick up a copy (available on Amazon). Besides the background on my own work there are 300+ pages of great analysis and research.]
There are many methods one might use to model an umpire’s strike zone. In my THT piece I discussed some of the key drawbacks of neural networks, which include:
- Opacity: Neural networks are black box models, and there is no “plain english” interpretation of the differences between the specifications of two different networks.
- Non-Determinism: There is no closed-form method for deriving a neural network model; they are trained using iterative methods (typically with an element of randomization) and depending on implementation two training runs with the same data may not yield identical models.
- Overfitting: If not carefully constructed, neural networks can be prone to overfitting their training data.
On the plus side, neural networks offer several advantages:
- Flexibility: Neural networks are extremely flexible as applied to binary classification problems like calling balls and strikes; they assume no particular shape to the target distribution.
- Robustness: Unlike binning methods for modeling a strike zone, in which the zone is divided into bins and actual frequencies are observed for each bin, a neural network model gracefully handles sparse data and introduces no discontinuities that need to be smoothed out.
To create the neural network models of strike zones, I used R’s neuralnet package.
As discussed in my THT Annual piece, one of the first things I did was model individual strike zones for every batter. This was to establish the top and bottom of their typical individual strike zones: e.g. Jose Altuve and Giancarlo Stanton aren’t going to have the same zone, and the problems with PITCHf/x’s sz_top and sz_bot fields are well documented.
In the remainder of this post I’m going to work through an example of modeling an individual batter strike zone in R. With that knowledge you can train your own strike zone models for batters, pitchers, umpires – whatever. I will assume some basic knowledge of R concepts, as well as access to PITCHf/x data from Gameday inside your R environment (for the quickest way to get up and running with PITCHf/x analysis in R, try pitchRx).
First, if you haven’t already, install the neuralnet package:
> install.packages(“neuralnet”) > library(neuralnet)
Let’s assume we have every 2014 pitch loaded for Mike Trout in a data frame df, with horizontal pitch location in column px, vertical pitch location in column pz, and the pitch outcome (in typical Gameday notation) in column des.
To evaluate a strike zone we need to look at called pitches only. To do that we’ll subset df:
> df_c <- df[df$des==‘Called Strike’ || df$des==‘Ball’ || df$des==‘Ball in Dirt’,]
Next, we’ll convert the ball/strike call into binary data to train the neural network:
> df_c$call <- ifelse(df$des==‘Called Strike’,1,0)
The neuralnet package makes the last step very easy:
> m_bat <- neuralnet(call ~ px + pz, data=df_c, hidden=4, linear.output=FALSE)
(Note that the THT Annual has a discussion of how I arrived at using 4 hidden layers.)
These last few commands illustrate my favorite feature of R: its syntax for working with vectors is extremely concise. In three lines of code we filtered out unnecessary data, transformed a text variable to numeric, and trained a neural network.
One pedagogical comment: although the interactive R shell is extremely cool and useful and is how I’m presenting this tutorial, and even though you can save and reload your command history between R sessions, the best practice is to do your work in script files (with comments!) and run the script from the R shell when you want to see the results. You do this with the source command:
Trust me, the sooner you get into this habit the happier you will be: it’s no fun having to sort through your command history to figure out what you were trying to do weeks, months or years ago.
Getting back to the topic at hand, now we have a neural network model of a strike zone for Mike Trout. To find the estimated probability of a pitch over the plate around belt-high (say px=0, pz=2.5) being called a strike, use compute:
> x <- compute(m_bat, data.frame(px=0, pz=2.5))
The variable x now contains an object with a variety of information about the neural network. To get the estimated strike probability, which is what we care about, access the net.result attribute:
> x$net.result [,1] [1,] 0.9842061
That’s it for training and using a neural network model of a batter strike zone! In my next post on this topic, I will show how I use these models to estimate the top and bottom of each batter’s strike zone.Follow @PeterKBonney