Baseball’s biggest chess match is between the hitter and the pitcher. Each pitch generates a new set of possible outcomes. A batter gets on base almost 94% of the time if the pitcher gets them to a 3-0 count. Meanwhile, they get on base only 17% of the time if they have a 0-2 or 1-2 count—balls matter.
There are also a different number of pitches a pitcher can throw. Hurlers can slightly adjust their grip on each pitch type, but there are four common pitches. A starting pitcher will generally use at least three of 4-seam fastball, curveball, slider, and change-up. There’s still debate amongst the baseball community on how often a pitcher should throw a breaking pitch like a curveball or slider, as it could do more damage to a pitcher’s arm. As the at-bat progresses, the pitcher has to decide which pitch they want to throw the batter.
At-bats have their own history. If a pitcher has thrown four consecutive fastballs, the hitter starts to think whether or not a fifth one is coming. If a hitter is known to struggle against off-speed pitches, that needs to be factored into the equation.
Where is the Pitch Predictor?
We should feed a neural network with all the batting data we have and give the likelihood for a certain pitch to occur. Several factors could be included, and we would be able to predict the location of the pitch and whether it is going to be a ball, strike, hit, or out.
We’ve heard announcers over the years try to predict what will happen, but why not show a heat zone of the likely location of the next pitch? How about a graphic that shows the percentages for each future pitch type?
What patterns would we be able to find? Could teams start giving batters signals on what the data suggests the next pitch is going to be? Are they already doing this?
How Do You Do It?
Each team plays 162 games a year, excluding the playoffs; 2,430 games a year. A team throws roughly 150 pitches in each game. This results in approximately 729,000 pitches per year.
In 2006 the MLB introduced Pitchf/x, which allows tracking of every pitch thrown in Major League Baseball. Three-quarters of a million pitches thrown over 15 years mean over 10 million data points for us to crunch through.
That amount of data would allow you to get a general sense of pitches that a batter might see. It was thrown on if you labeled each pitch as “off-speed” or “fastball” and the pitch count. You would have a pretty simple predictor. But there are many other features to add. What inning is it? Are there runners on base? What is the score of the game? How many times has the pitcher seen the batter this game? What number pitch of the at-bat is it? How many foul balls in this at-bat?
Gimme Dat Data
Unfortunately, there is not just a database that stores all the Pitchf/x data for free that you can easily access. There is a tool written in R that should be able to scrape the data for you. I have never actually seen the raw data for Pitchf/x, but I assume it isn’t lovely. Having to deal with MLB data for my warningtrack.co site and mobile app I’m attempting to develop, baseball doesn’t love developers. I hope that I can get my hands dirty with it someday, though, and be able to make my children think I have superpowers and can predict the next pitch.