What is Pitcher DNA and Hitter DNA

What is Pitcher DNA and Batter DNA?

A (hopefully) readable explanation of the Pitcher DNA and Batter DNA system and why it’s more than just a pretty chart.

A long drive over the center field wall — four to one, good guys.

One fan sitting behind third base says, “Great hit.”

One fan watching at home says, “Terrible pitch.”

The truth — as is often the case — is somewhere in the middle. That pitch had some probability of being a hard hit ball; not zero and not 100% either.

With pitch-level data and machine learning, unpacking the credit and blame of a batter/pitcher battle becomes a lot more clear. We can move to a level deeper than aggregate statistics such as slugging percentage, ERA, or even strikeouts and walks.

There are a lot of smart baseball analysts doing pitch-level models, and I would not set out to publish my own unless I thought I had a unique take. The Pitcher DNA/Batter DNA system is something that I’ve fiddled with for a few years and I’m excited to finally put it out there.

Defining the Chromosomes

When a pitch leaves the pitcher’s hand, it usually results in one of five outcomes. With apologies to fringe outcomes such as hit-by-pitches and catcher’s interference, the vast majority of pitches result in: a ball, a called strike, a swinging strike, a foul, or a ball hit into play.

The balls-in-play category has a vast spectrum of outcomes; from easy outs to 500-foot bombs. Using the speed and angle off the bat, we can come up with an expected outcome for each ball in play. Then, we can bucket those into three groups; soft, medium, and hard. Soft balls in play are almost always outs. Medium balls in play are sometimes outs and sometimes hits. Hard balls in play are almost always hits, and likely extra bases.

This gives us seven buckets of pitch outcomes: balls, called strikes, swinging strikes, fouls, hard balls in play, medium balls in play, and soft balls in play. These seven buckets are the backbone of the Pitcher DNA and Batter DNA systems.

Using machine learning, we can train a model which uses the speed, movement, and location of the pitch to predict the chances a pitch will result in each of these seven categories.

Hanging 3-1 breaking ball right down the pipe? 60% it’s a hard ball in play.

Two-strike, 99 MPH down-and-away slider? 50% it’s a swinging strike. 10% foul if the batter can even touch it.

Throw-by-throw, the pitcher creates a corpus of data on his pitch repertoire. Slice that data by three pitch type buckets (fastball, breaking ball, and offspeed) gives us 21 data points. Slice again for facing lefties and righties and it results in 42 data points to describe any given pitcher.

This is his Pitcher DNA — a unique profile of strengths and weaknesses, across 42 categories (humans have 46 chromosomes so the metaphor isn’t perfect, but you get the idea).

This is Michael Wacha’s Pitcher DNA. First, glance at the 42 categories at the top, these tell you about Wacha’s strengths and weaknesses as compared to the league average. He gets very good swinging strikes on his fastball against both hands, but poor swinging strikes on his secondary pitches. He has great control on his offspeed against lefties, but that pitch also has above average hard balls in play.

Next, look at the portion at the bottom. Pitcher DNA is not just academic, it is predictive. Using the 42 categories (plus a few that are secret sauce), Pitcher DNA can predict the range of outcomes a pitcher is likely to have in the future. This distribution is shown with an upside projection (80th percentile), a median projection (50th), and a downside projection (20th).

Put another way, we are asking the model, “When the Pitcher DNA looks like this, what results usually happen next?” In this case, Wacha is expected to be 7% worse than average.

Batter DNA

So far, we’ve described how to create a predictive profile for pitchers, but as Neil McCauley once told Vincent Hanna over coffee, “There is a flip side to that coin.”

One of the great things about baseball is it has a defined order of operations. The pitcher is in complete control right up until the point that the ball leaves his fingers.

Likewise, the batter has zero control, then — for a brief moment — all of the control.

The Batter DNA just takes things one step further in time. Given the pitches that have come his way, what does the hitter do with them?

Imagine a batter who hits 10 hard hit balls out of 100 pitches. Is he good? If the 100 pitches had an expected hard ball-in-play rate of 10%, then he’s exactly average. If the 100 pitches had an expected hard ball-in-play rate of 1%, then he’s one of the best hitters in the world.

This is why the expected pitch outcomes we used for pitchers is still the key to the Batter DNA system. The difference is, for hitters we care mostly about the delta of the actual outcomes from the expected outcomes.

Batter DNA contains the same 42 categories as the pitcher version, just looking from a different angle of the prism. This produces a unique fingerprint of a hitter’s strengths and weaknesses and a fantastic foundation on which to build a predictive model.

Think about it like the triple slash on steroids.

Above is Shohei Ohtani’s Batter DNA. He is one of the rare hitters who have above average hard balls in play on all six pitch buckets. His soft balls in play are less than expected, which means he is rarely hitting easy outs. He swings and misses more than expected, which is the only thing keeping him mortal.

For more detail about how to interpret the 42 DNA “chromosomes” check out this post.

An Always-on Projection System

You may have noticed that the projections on the DNA visualizations are for the next 100 plate appearances; not the rest of the season. The goal of the DNA system is to be a sort of micro-projection system — to describe how the player is performing right now.

For this reason, the Pitcher and Batter DNA system uses rolling windows rather than a season at a time. Why? Look at the chart below (courtesy FanGraphs). Anthony Volpe was a different hitter in July than he was in April — and he was a different hitter in September than he was in July.

Picking the right time windows, therefore, is a key component to getting accurate micro-projections. Remember that we have three pitch type buckets against both lefties and righties: six total buckets to fill.

Without getting too deep into the details — and there are a lot of details — when predicting the results of a player’s next 100 plate appearances, the best accuracy comes when using around 700 pitches per bucket for pitchers and 200 pitches per bucket for batters.

These windows stretch backwards across seasons as well, so looking at projections at the start of this season means we are using the performance at the end of the last season. Admittedly, this leads to some strange looking projections for players who ended last season on a hot streak or limped to the end with a hidden injury.

There are a million little decisions when making a predictive model. Using windows rather than seasons is one that I’m happy to make. I’d rather have a model that responds immediately to an in-season performance change rather than a safer one that averages multiple seasons of data.

FAQs

Are the Pitcher DNA and Batter DNA numbers park adjusted?
Yes.

Why wRC+?
For those not familiar, here are the details from the FanGraphs library. In my book, wRC+ is the best representation of a player’s true talent level and the 100 point base makes it easy to interpret.

Isn’t wRC+ a hitter’s stat? Why are you using “wRC+ against” for pitchers instead of FIP or ERA?
Slight soapbox, but I never understood why pitchers and batters aren’t judged on the same scale when they are fighting over the same outcomes. A double is exactly as good for the batter as it is bad for the pitcher. A pitcher’s job is to minimize the most damaging outcomes, just as a batter’s job is to maximize them.

Why do some players have projections little upside? Or why do some have a ton of upside?
The mechanics of the DNA projections create a distribution of outcomes, with some players having a narrower range and others having a wider one. Frankly, these are some of the most fun elements of using the DNA system.

You’re saying *insert projection that looks weird* is right?
A projection is different than a prediction. A projection is the result of a model that only knows the data you give it. There are a lot of factors not included in these models and a lot of reasons why the input numbers may be misleading for any given player — hidden injuries, stress, PEDs, coaching. All the model is “saying” is that players with a DNA signature that look like X most often lead to results that look like Y.

Is this system better than ZiPS, Steamer, ATC, etc?
Honestly, the goals of these systems are different. Dan Szymborski has been working on ZiPS for over a decade. The goal of his process is to predict player performance the next season and seasons into the future. The Pitcher DNA and Batter DNA system is more like a micro projection, optimized to predict the results next week, not two months from now or next year.

Why do you have “Big Papi” on the x-axis of your graphs?
David Ortiz’s career wRC+ is 140. I think putting Big Papi on the axis gives everyone a good mental picture of what a 140 wRC+ looks like. Plus I love Big Papi.

Where can I download the projections for all players?
The wRC+ projections and fantasy stat lines can be found here. For more details about the fantasy projections, check out this post.

If you have Pitcher DNA and Batter DNA, couldn’t you create a model that predicts the outcome of a plate appearance?
Yes. More to come on this.

If you can predict outcomes of a single batter/pitcher matchup, then couldn’t you do that for an entire lineup and thus predict the outcome of a game?
Now you’re asking the right questions. More to come on this too.