In part 1 I looked at building a neural network model of a batter strike zone in R. In this post I will show you have to use that model to estimate the boundaries of his personal strike zone. As a reminder, the reasons we want individual batter strike zones are:
- Batter heights and stances vary significantly
- The PITCHf/x sz_top and sz_bot fields have problems
Ok, so let’s get right to the R!
When we left off, we had created a neural network model (m_bat) of Mike Trout’s strike zone based on 2014 data. My approach from here is to define a reasonably fine grid of (x, z) points to test, and to find the highest and lowest points that have at least a 50% chance of being called a strike per the model.
Once again, make sure you have the neuralnet package loaded in R.
First, let’s define how fine we want the grid increments to be. 1/4″ should be plenty fine. The PITCHf/x location data is in feet, so for 1/4″ we’d be looking at 1/48 of a foot. I’ll use 0.02 (1/50) to keep everything tidy:
> inc <- 0.02
Now we can use that increment to quickly define a sequence of x’s and z’s over some reasonable range using the seq command. I’ll assume that the boundaries we care about will be somewhere within 2 feet of the center of the plate, and no higher than 6 feet off the ground or lower than 1 foot:
> x_s <- seq(-2, 2, by=inc) > z_s<- seq(1, 6, by=inc)
If you have ever done any kind of programming, you probably think the next step is to use a nested loop to evaluate the strike zone model over each possible (x, z) pair. But you would be wrong! If you find yourself writing a loop in R, ninety-nine times out of a hundred you are making a mistake.
To really get the most out of R you need to break out of the procedural mindset and instead think in vector/matrix operations. In this case, we will use the x and z values to define an entire grid that we will use for evaluating the model. As usual there’s a function for that! No loops needed:
> g <- expand.grid(x=x_s, z=z_s)
Isn’t R great?
Before we move on, let’s put the grid into a data frame. This will make the next couple of steps easier.
> df_g <- data.frame(px=g$x, pz=g$z)
Now we’re firmly in R’s wheelhouse. Evaluating Mike Trout’s strike zone model over the whole grid is done in one line:
> df_g$p <- compute(m_bat, df_g)$net.result
Let’s pause here to talk about what is happening in this one-liner. In part 1 I introduced the “compute” function for evaluating the m_bat model on a single point. But it works the same way when called over a whole series of input values. And this time, rather than assigning the complete return object to a variable and then referencing the “net.result” attribute to get the probabilities, I referenced “net.result” directly after the function call so I can assign the resulting probabilities directly into a list. Finally, instead of assigning those probabilities to a brand new variable I assigned them to a new column “p” inside our existing data frame df_g. The result is that each row of df_g has an x value, a z value and the probability that a pitch at (x, z) will be called a strike.
Now that we know what we’re looking at in df_g, let’s find the upper and lower boundaries of the strike zone. First, we’ll select all points where p is at least 0.5. These are the candidates for the top and bottom of the zone:
> df_cand <- df_g[df_g$p>=0.5,]
Finding the top and bottom of the zone is as simple as selecting the max and min z values among our list of candidates:
> sz_top <- max(df_cand$pz) > sz_bot <- min(df_cand$pz)
Once again this shows how concisely you can build complex operations in R, without resorting to obtuse syntax.
So what values did we actually get for the top and bottom of Trout’s strike zone?
> sz_top  3.52 > sz_bot  1.54
Exciting! Okay, not really. But now we can use these dimensions and the dimensions for all other batters to put everyone’s personal strike zone onto a common scale.
Next time: I actually do something related to umpires, which was the whole point of all of this!Follow @PeterKBonney