A (hopefully) readable explanation of the Baseball DNA system and why it’s more than just a pretty chart.

A long drive over the center field wall — four to one, good guys.

One fan sitting behind third base says, “Great hit.”

One fan watching at home says, “Terrible pitch.”

The truth — as is often the case — is somewhere in the middle. That pitch had some probability of being a hard hit ball; not zero and not 100% either.

With pitch-level data and machine learning, unpacking the credit and blame of a batter/pitcher battle becomes a lot more clear. We can move to a level deeper than aggregate statistics such as slugging percentage, ERA, or even strikeouts and walks.

There are a lot of smart baseball analysts doing pitch-level models, and I would not set out to publish my own unless I thought I had a unique take. In short, my philosophy on pitch modeling is not to assign a single quality number to every pitch (such as Pitcher List’s PLV or PitchingBot’s Stuff+, although those are good systems too), but to use the full distribution of expected outcomes of every pitch.

Defining the Chromosomes

When a pitch leaves the pitcher’s hand, it usually results in one of five outcomes: a ball, a called strike, a swinging strike, a foul, or a ball hit into play (not counting fringe outcomes such as catchers’ interference and HBPs).

The balls-in-play category has a vast spectrum of outcomes; from easy outs to 500-foot bombs. Using the speed and angle off the bat, we can come up with an expected outcome for each ball in play. Then, we can bucket those into three groups; soft, medium, and hard. Soft balls in play are almost always outs. Medium balls in play are sometimes outs and sometimes hits. Hard balls in play are almost always hits, and likely extra bases.

This gives us a total of seven buckets of pitch outcomes: balls, called strikes, swinging strikes, fouls, hard balls in play, medium balls in play, and soft balls in play. These seven buckets are the backbone of the Baseball DNA system.

Using machine learning, we can train a model which uses the speed, movement, and location of the pitch to predict the chances a pitch will result in each of these seven categories.

For example, the pitch below has the following projected outcomes: 40% foul ball, 21% medium BIP, 13% called strike, 10% soft BIP, 10% swinging strike, 5% hard BIP, 0% ball. High probabilities of bad outcomes; this is a meatball even though it went for a called strike.

Here’s an example of a good pitch, with the following projected outcomes: 52% swinging strike, 37% ball, 9% foul, and less than 1% total for called strike and all three ball-in-play types.

Note that these percentages are not dependent on the actual outcome of the pitch, nor on the quality of the pitcher or hitter — only the physics of the pitch itself.

Throw-by-throw, the pitcher creates a corpus of data on his pitch repertoire. Then, slice that data by three pitch type buckets (fastball, breaking ball, and offspeed) gives us 21 data points. Slice again for facing lefties and righties and it results in 42 data points to describe any given pitcher.

This is the pitcher’s Baseball DNA — a unique profile of strengths and weaknesses, across 42 categories (humans have 46 chromosomes so the metaphor isn’t perfect, but you get the idea).

This is Dylan Cease’s Baseball DNA as of 5-17-2026, looking at the last 100 plate appearances against lefties and righties. You can see six buckets of handedness/pitch type combinations. The blue bar represents how many of each outcome we would expect given the pitch model. For comparison’s sake, the black bar represents the MLB average for the same pitch mix (same number of fastballs, breaking, and offspeed pitches), and the callout number is the difference between his expected outcomes and the MLB expected outcomes.

Reading Cease’s chart, the first thing you might note is that he doesn’t throw his offspeed very often, and never against righties. His fastballs and breaking balls are very good, with high positive deltas on his swinging strike numbers — he’s a strikeout machine. He also has positive deltas on called balls, but he’s making it work.

Next, look at the portion at the bottom. Baseball DNA is not just academic, it is predictive. It uses the 42 categories (plus a few that are secret sauce), to project his performance over the next 100 plate appearances against each handed batter type. In Cease’s case, that performance is predicted to be very good, with right handed hitters projected at a 68 wRC+ and lefties at an 82 wRC+. Those get blended into an overall projection of 76 wRC+, which is equivalent to an ERA under 3.00.

Imagine putting human DNA into an algorithm that could predict how tall the person will be. It might be right, it might be wrong, there might be other factors that interfere (maybe an injury), but it could show you the average height based on that DNA chromosome. In the Baseball DNA system, we are asking, “When the Baseball chromosome looks like this, what results usually happen next?”

Baseball DNA for Batters

So far, we’ve described how to create a predictive profile for pitchers, but as Neil McCauley once told Vincent Hanna over coffee, “There is a flip side to that coin.”

One of the great things about baseball is it has a defined order of operations. The pitcher is in complete control right up until the point that the ball leaves his fingers.

Likewise, the batter has zero control, then — for a brief moment — all of the control.

For batters, Baseball DNA is basically asking the question: “Given the expected outcomes of the pitches that have come his way, what does the hitter do with them?”

Imagine a batter who hits 10 hard hit balls out of 100 pitches. Is he good? If the 100 pitches had an expected hard ball-in-play rate of 10%, then he’s exactly average. If the 100 pitches had an expected hard ball-in-play rate of 1%, then he’s one of the best hitters in the world.

For example, here is a pitch that is hard to hit. It’s expected outcomes are: 29% swinging strike, 28% foul, and less than a 1% hard ball in play. A good pitch, but alas, it was a hard ball in play.

This is why the expected pitch outcomes we used for pitchers is still the key to the Baseball DNA system. On the above example, Elder gets credit for a good slider — because most of the time it would have been — and Contreras gets credit for a hard hit with a high degree of difficulty. Said another way, for pitchers we mostly care about the expected outcomes, for hitters we care mostly about the delta of the actual outcomes from the expected outcomes.

Baseball DNA for batters contains the same 42 categories as the pitcher version, just looking from a different angle of the prism. This produces a unique fingerprint of a hitter’s strengths and weaknesses and a fantastic foundation on which to build a predictive model.

Above is the Baseball DNA for Mookie Betts, as of 5-17-26. This chart looks similar to the pitcher version, but has one major difference. The dark bar is the sum of the expected outcomes given the pitches that have come his way, the teal bar are the actual outcomes of those pitches, and the callout number is the delta between the two.

As you can see in his chart, Betts is a very patient hitter, with positive deltas on called balls and called strikes and very low deltas on swinging strikes. That approach pays of with positive deltas on hard hit balls in play, particularly against righties.

An Always-on Projection System

You may have noticed that the projections on the Baseball DNA visualizations are for the next 100 plate appearances; not the rest of the season. The goal of the Baseball DNA system is to be a sort of micro-projection system — to describe how the player is performing right now.

For this reason, the Baseball DNA system uses rolling windows rather than a season at a time. Why? Players get hot, get cold, get hurt, and make adjustments all the time.

Look at Anthony Volpe’s 2023 in the the chart below (courtesy FanGraphs). He was a different hitter in July than he was in April — and he was a different hitter in September than he was in July.

The Baseball DNA system aims to show you a “now-cast” or “micro-projection” of the player’s performance right now. These projections are updated weekly and can be found here.

Picking the right size of the look-back windows, is a key component to getting accurate micro-projections. Without getting too deep into the details — and there are a lot of details — when predicting the results of a player’s next 100 plate appearances, I’ve found the best performance comes when looking at 100 previous plate appearances for batters and 500 for pitchers. I also include some momentum variables which aim to capture very recent performance changes.

These windows stretch backwards across seasons as well, so looking at projections at the start of this season means we are using the performance at the end of the last season. Admittedly, this leads to some strange looking projections for players who ended last season on a hot streak or limped to the end with a hidden injury.

There are a million little trade-offs when making a predictive model. Using windows rather than seasons is one that I’m happy to make. I’d rather have a model that responds quickly to an in-season performance change rather than a safer one that averages multiple seasons of data.

Baseball DNA Fantasy Projections

For most purposes, the projection output I care most about is the player’s two wRC+ projections — vs. lefties and vs. righties, when then get blended into an overall projection.

However, I’m also an avid fantasy baseball player, so I understand the need to go deeper. As complete of a stat as wRC+ is, it doesn’t translate into fantasy categories like batting average, RBIs, strikeouts, wins, or saves.

In order to generate more fantasy-friendly stats, we need different models. These models use the exact same inputs from the Pitcher and Batter DNA profiles, but rather than wRC+ as the “y” variable, we are projecting these four components:

  • Strikeout rate
  • Walk rate
  • Batting average on balls in play (or BABIP)
  • Bases per hit (or BPH)

The first three need no further explanation, but the fourth one probably seems odd. Why bases per hit?

The projections need a stat which independently measures the raw power of a player (or for pitchers, the ability to limit power). There are other, more established stats that attempt to do the same — slugging percentage (SLG) and isolated power (slugging percentage minus batting average) to name two — but those stats are not independent. Allow me to demonstrate.

For example, consider a player who has the following day at the plate. Watch his SLG, ISO, and BPH across the course of the day:

  • First inning: home run. 1-for-1. SLG: 4.000, ISO: 3.000, BPH: 4.0
  • Third inning: strikeout. 1-for-2. SLG: 2.000, ISO: 1.500, BPH: 4.0
  • Sixth inning: strikeout. 1-for-3. SLG: 1.333, ISO: 1.000, BPH: 4.0
  • Ninth inning: strikeout. 1-for-4. SLG: 1.000, ISO: .750, BPH: 4.0

As you can see, this player’s slugging percentage and isolated power drop from the first inning to the ninth, even through all he did was strike out. Therefore, SLG and ISO are not independent of strikeout rate.

If, instead of strikeouts, he had three ground outs, the story would be the same. In that case SLG and ISO aren’t independent of BABIP.

Therefore, the right metric to independently measure power is bases per hit (BPH), not SLG or ISO.

From there, all rate and counting stats are generated using those four components and then adjusted for home park for hitters; home park and team defense for pitchers.

FAQs

Are the Baseball DNA numbers park adjusted?
Yes.

Why wRC+?
For those not familiar, here are the details from the FanGraphs library. In my book, wRC+ is the best representation of a player’s true talent level and the 100 point base makes it easy to interpret.

Isn’t wRC+ a hitter’s stat? Why are you using “wRC+ against” for pitchers instead of FIP or ERA?
Slight soapbox, but I never understood why pitchers and batters aren’t judged on the same scale when they are fighting over the same outcomes. A double is exactly as good for the batter as it is bad for the pitcher. A pitcher’s job is to minimize the most damaging outcomes, just as a batter’s job is to maximize them.

You’re saying *insert projection that looks weird* is right?
A projection is different than a prediction. A prediction is saying what you think will happen. A projection is the result of a model that only knows the data you give it. Often projections are the basis of prediction — The weather report says there is a 50% chance of rain (projection), and I think it’s going to pour (prediction).

There are a lot of factors not included in these models and a lot of reasons why the input numbers may be misleading for any given player — hidden injuries, stress, PEDs, coaching. All the model is “saying” is that players with a DNA signature that look like X most often lead to results that look like Y.

Is this system better than ZiPS, Steamer, ATC, etc?
Honestly, the goals of these systems are different. Dan Szymborski has been working on ZiPS for over a decade. The goal of his process is to predict player performance the next season and seasons into the future. The Baseball DNA system is more like a micro-projection, optimized to predict the results next week, not two months from now or next year.

How do you project playing time?
The one element that I don’t project and don’t ever want to is playing time. A player’s opportunities are more analog than digital. It’s manager decisions and injuries; closer confidence and position logjams. No one does playing time projections better than FanGraphs and Roster Resource, so we’re building off their projections, which can be found here.

Why do you have “Big Papi” on the x-axis of your graphs?
David Ortiz’s career wRC+ is 140. I think putting Big Papi on the axis gives everyone a good mental picture of what a 140 wRC+ looks like. Plus I love Big Papi.

Where can I download the projections for all players?
The wRC+ projections and fantasy stat lines can be found here.

Where can I find the Baseball DNA data visualizations?
I don’t have this feature built on the site at this point, but I post interesting ones I find on Bluesky and Threads. If there is a specific one you’d like to see, hit me up there.

If you have Baseball DNA for the pitcher and the batter, couldn’t you create a model that predicts the outcome of a plate appearance?
Yes. I’ve got a model that does this, but haven’t published it.

If you can predict outcomes of a single batter/pitcher matchup, then couldn’t you do that for an entire lineup and thus predict the outcome of a game?
Yes, I’ve done this too, but to answer what you’re thinking — it doesn’t beat the market.