Filed under:

The Next Battle in the War Over WAR

Over the past five years, WAR has caught on with the baseball-watching public in a way that no other sabermetric stat has. But as the recent debate about the MVP race between Aaron Judge and José Altuve revealed, we can’t fully move forward with the stat until we agree on what it’s meant to do.

Aaron Judge and José Altuve Getty Images/Ringer illustration

Saying that sports analytics as we know it would not exist without Bill James has been a cliché for so long that pointing out that it’s a cliché has itself become a cliché. But we still say it, because it’s still true. James didn’t just pioneer the field of sabermetrics; he literally invented the word “sabermetrics.” Everyone working in the field today was either influenced directly by James and The Bill James Baseball Abstract or by one of James’s disciples.

Since James joined the Red Sox’s front office 15 years ago, his public pronouncements have dwindled as the rings on his fingers have multiplied. So when he does talk, it behooves people to listen.

James talked recently, starting a conversation among analysts that was a long time coming. In an article he posted on his website, entitled “Judge and Altuve,” James discussed the respective MVP merits of the players who were almost unanimously considered the best in the American League this year. According to wins above replacement, a statistic that has, with good reason, become generally accepted as the best distillation of a player’s overall quality into a single number, the two players were almost indistinguishable. Baseball-Reference’s version of WAR had José Altuve ahead slightly, 8.3 to 8.1, while FanGraphs had Aaron Judge in the lead, 8.2 to 7.5. By either metric, the two players were within the margin of error of each other.

And James was having none of it: “Some of my friends and colleagues wish to argue that Aaron Judge is basically even with José Altuve, and might reasonably have been the Most Valuable Player. It’s nonsense. Aaron Judge was nowhere near as valuable as José Altuve. Why? Because he didn’t do nearly as much to win games for his team as Altuve did. It is NOT close.”

Aaron Judge and José Altuve
Aaron Judge and José Altuve during the 2017 ALCS
Photo by Mike Stobe/Getty Images

James’s argument boiled down to one simple thing: Baseball players are in the business of creating wins, not runs. In 2017, the New York Yankees amassed a run differential of plus-198, while the Houston Astros finished the regular season at plus-196. But the two teams did not use those runs to the same effect: The Yankees, thanks in part to an 18-26 record in one-run games, went just 91-71 (against an expected record of 102-60), which is why they had to sweat out a wild-card win against the Minnesota Twins. Meanwhile, the Astros won 101 games and had the AL West essentially locked up by the Fourth of July.

As James put it (bold and italics his), “I am getting ahead of my argument in making this statement now, but it is not right to give the Yankee players credit for winning 102 games when in fact they won only 91 games. To give the Yankee players credit for winning 102 games when in fact they won only 91 games is what we would call an ‘error’. It is not a ‘choice’; it is not an ‘option’. It is an error.

“When you express Judge’s RUNS. . .his run contributions. . . when you express his runs as a number of wins, you have to adjust for the fact that there are only 91 wins there, when there should be 102.”

James concludes that after this adjustment, Judge’s WAR is really 6.8, not 8.1.” Though Altuve had already won the AL MVP award by a landslide margin the night before, James’s article sparked a discussion in the sabermetric community that shows no sign of stopping. Tom Tango, MLB Advanced Media’s senior database architect of stats (his real title!), agreed with James. Longtime baseball writer Joe Posnanski built on James’s argument. FanGraphs’ Dave Cameron defended his publication’s position that WAR should remain largely independent of context. At Baseball Prospectus, Jonathan Judge was also critical of James’s approach. Even Nate Silver returned to his baseball roots long enough to tweet about it.

The most important takeaway from each of their arguments is that we’re having this discussion in the first place. It’s a discussion, quite frankly, that the analytics community should have had years ago. And that discussion should have started with a simple question: What is WAR trying to measure? What is WAR—forgive me for this—good for?

Over the last five years, WAR has caught on with the baseball-watching public in a way that no other sabermetric stat has. As Posnanski put it, “WAR has won.” It’s now on stadium scoreboards, and TV broadcasts, and in newspaper columns. We may not be that far from the day when the casual fan understands the meaning of a “five-WAR season” the way they now understand the meaning of hitting .300.

So it’s probably time for those of us who work in analytics to agree on the meaning.

WAR’s emergence as the gold standard of analytic stats was a happy accident. If it had been the result of meticulous planning, we wouldn’t have ended up with two competing entities each designing a metric with the same name that purported to add up every contribution a player made at the plate, in the field, on the bases, and on the mound. So instead of referring to “WAR” as a single product, we have bWAR (Baseball-Reference) and fWAR (and FanGraphs), the Coke and Pepsi of value-over-replacement statistics. (Baseball Prospectus has its own metric, wins above replacement player, or WARP, which hasn't achieved the traction of the other two.)

In 2013, Baseball-Reference and FanGraphs had a summit of sorts, in which they agreed on the definition of “replacement”: that is to say, where to set the baseline on the quality of a replacement-level player—the kind freely available in Triple-A or on waivers. But they did not agree on a formula to calculate WAR. That lack of consensus is a blessing in some ways; for an industry frequently accused of engaging in groupthink, there’s value in having multiple methodologies for trying to arrive at the same answer. What isn’t a blessing is the lack of consensus on what WAR is supposed to do.

WAR represents the peak of an analytical pyramid whose foundations James laid starting in the 1970s. That foundation begins with the notion that every individual accomplishment by a hitter—singles, walks, home runs, etc.—combine together to form runs. Runs scored and allowed then interact to form wins, and the distribution of wins leads to the top rung, which is playoff appearances and pennants and titles.

One of James’s early breakthroughs was a formula that accurately calculated a team’s run total based on its singles, walks, home runs, etc. This formula, which he called runs created, allowed him to calculate the contribution each individual player made to those runs, without using runs scored and runs batted in, which depend largely on the quality of the batters in front of and behind them in the lineup. Over the years other analysts came up with competing versions. When we launched Baseball Prospectus in 1996, we used Clay Davenport’s formula for “equivalent runs,” which was considerably more complicated than runs created and also correlated slightly better with actual runs scored. This was sort of the 1990s version of the WAR debate we’re having now.

James’s other discovery was of the relationship between runs and wins: what is commonly referred to as the Pythagorean expectation, which is that a team’s ratio of wins to losses is roughly equal to the square of its ratio of runs scored to runs allowed. (The shorthand rule of thumb is that 10 extra runs is generally worth one extra win.)

Further, we had a way to connect hits and walks and outs on offense into runs, and we had a way to connect runs to wins, but we didn’t have a way to connect everything until we finally had the data to account for defense as well. Once the analytics community could turn all of a player’s contributions into runs, and those runs into wins, the concept of wins above replacement became an inevitability.

The way WAR is calculated today is by converting the contributions of every player into runs, and then using the standard conversion rate to turn those runs into wins. What James is arguing is that the runs-to-wins conversion rate isn’t a constant; some teams are more efficient at generating wins out of their run differential than others, and we should account for that.

This isn’t a new position for him. In 2002, when WAR was still just the name of a card game, James created win shares, a statistic that attempted to connect everything a player did on the field to actual wins. (FanGraphs first introduced WAR—then called win values—in 2008, and Baseball-Reference followed with its version in 2010.) Win shares looked like the ultimate all-in-one statistic when he introduced it, but it suffered from one fatal weakness: It didn’t account for replacement level. And, with no regard for the opportunity cost that came with playing a bad player over someone else, a truly terrible player who nonetheless got tons of playing time—Neifi Pérez being the quintessential example of the era—would be worth a substantial number of win shares. James probably would have fixed this weakness had the Red Sox not come calling with visions of broken curses and finger jewelry, but in his absence, WAR emerged instead. And when it did, it was based on the assumption that sometimes a team wins more or fewer games than expected simply because of luck, and that luck should have no part in the statistic.

For instance, we know from decades of analysis that while there may be exceptions on the margins, how a player or team hits with runners in scoring position has little correlation with how they hit in those situations the year before or the year after. The 2013 St. Louis Cardinals hit .330 as a team with runners in scoring position, an off-the-charts anomaly; no other team since World War II has hit higher than .311 in those situations. The 2014 Cardinals hit .254 with RISP, good for sixth in the NL. There is scant evidence that the Cardinals’ performance in 2013 was due to skill. They just got lucky.

While they may have just gotten lucky, the 2013 Cardinals nevertheless scored 77 more runs than any other team in the league, despite having a lower OPS than the Rockies and being one of the slowest teams; they were dead last in steals, and only one team grounded into more double plays. The extra runs that they were lucky to score helped the Cardinals hold off the Pittsburgh Pirates and Cincinnati Reds for the NL Central title and eventually advance to the World Series, where they lost to James’s Red Sox.

Their performance with runners in scoring position, i.e. “clustering,” may have been luck—or, if you don’t like the pejorative meaning inherent in that word, then “fortuitous and non-replicable timing”—but that performance mattered. So should we account for it? James says yes, emphatically. Others say no. And I say we need to decide what we’re trying to use WAR for before answering the question.

At its heart, I believe WAR is a backward-looking stat. It’s supposed to tell you what a player did; it’s not supposed to tell you what a player is going to do. I believe most people in the analytic community would agree with that statement. But that isn’t necessarily backed up by the way it’s calculated.

With respect to position players, most of the differences between bWAR and fWAR are trivial, but when it comes to pitchers, there’s a substantial difference. Baseball-Reference calculates a pitcher’s WAR by taking the number of runs he allowed, adjusting that number to account for the quality of his defense, ballpark, and other effects, and then comparing that number with how many runs a replacement-level pitcher would have given up in the same number of innings. Pretty straightforward.

FanGraphs, by contrast, doesn’t use the number of runs a pitcher allows at all. It uses fielding-independent pitching stats, which measure the number of walks, strikeouts, infield pop-ups (which, like strikeouts, are basically automatic outs), and home runs the pitcher has allowed and estimates the number of runs he should have given up with ordinary luck. It then uses this estimated number of runs to calculate his WAR.

fWAR is far better at predicting a pitcher’s future value than bWAR. But WAR isn’t generally used to predict the future; it’s supposed to explain the present. A pitcher who had a 5-plus ERA but great peripherals may be an excellent bet to rebound next year, but he would be credited by fWAR as having had a good season this year, and that's a bridge too far for me. If there was some external factor that can take the blame—a horrible defense or Coors Field, for instance—we can redistribute the runs accordingly. But I don’t believe we can make those runs disappear.

Moreover, fWAR runs on the conceit that pitchers have essentially no impact on batting average on balls in play, and there are exceptions to that rule. It’s well known that knuckleball pitchers typically have lower-than-average BABIPs, but fWAR gives them no credit for it—which is why Tim Wakefield’s career fWAR is only 27.4, compared with his bWAR of 34.5. On the other side, Javier Vázquez famously had ERAs higher than you would expect based on his peripheral numbers, with a career 4.22 ERA and a 3.91 FIP. Vázquez’s fWAR is 53.9, compared with a bWAR of 43.3. FanGraphs says Vázquez was 97 percent better than Wakefield over their careers, even though Wakefield threw 400 more innings and each pitcher posted a 105 career ERA+.

By using FIP statistics to estimate runs allowed instead of accounting for actual runs allowed, fWAR also assumes that all clustering is luck. For pitchers, that isn’t always true. The poster child for this phenomenon, whose name I will share with you only if he promises not to hurt me, is Nolan Ryan. The Ryan Express sometimes got derailed when he had to pitch from the stretch, when batters hit .221/.320/.325 against him for his career, compared with a .191/.298/.279 line with the bases empty. Baseball-Reference credits him with 83.9 bWAR, still a prodigious total and well worthy of the Hall of Fame. But FanGraphs credits him with 106.7 fWAR—23 more wins of value that rest on the presumption that he was unlucky with men on base for 27 major league seasons.

Nolan Ryan throws a pitch in the 1990s
Nolan Ryan throws a pitch at Arlington Stadium
Photo by Jonathan Daniel/Getty Images

I don’t agree with that presumption. And if I did agree with that presumption and wanted to estimate how many a runs a pitcher should have given up, I would use one of the many statistics out there that does a better job of estimating runs than FIP does: There’s xFIP, and SIERA, and DRA, and WTF, and I promise I made up only the last one.

Because hitters rotate through an inning, we have to find a way to fairly allot runs scored among them. Aside from mid-inning pitching changes, pitchers don’t. We don’t need a run estimator because we we have actual run totals. Maybe fWAR is a better predictive stat than bWAR, but that’s not what I use WAR for: In trying to have its cake and eat it too, fWAR succeeds only in splitting the baby and causing writers to badly mix metaphors.

But complicating things further is that FanGraphs also has a version of WAR that uses actual runs allowed instead of FIP, called RA9-WAR. But (1) it’s rarely used, and (2) more versions of WAR are only going to confuse the casual baseball fan and get in the way of the statistic being adopted further by the general public. We’re not selling cereal here; we don’t need 25 different versions of WAR on the shelf. For the sake of the layman, a statistic with a minor flaw may be preferable to one that’s constantly being tinkered with. I mean, it makes no sense that batting average doesn’t count sacrifice flies while OBP does, but if I proposed a version of batting average called sfAVG, I’d be justifiably defenestrated.

But while bWAR makes the conscious choice to prioritize actual runs over estimated runs when appropriate, neither version of WAR prioritizes actual wins over estimated wins. James’s argument is that they should. And having preferred bWAR over fWAR all these years because it doesn’t confuse skill with value, I’m having difficulty coming up with a reason why I shouldn’t extend that preference for actual runs to actual wins. Maybe the Yankees should have won 102 games, but they actually won 91, and if I don’t believe that Nolan Ryan should get credit for all the runs he wouldn’t have given up if he hadn’t been unlucky, it’s not consistent for me to believe that the Yankees should get credit for all the games they would have won had they not been unlucky.

Informally I was already making such an adjustment in my head, because I thought Altuve was clearly the AL MVP in part because he performed much better in high-leverage situations than Judge or third-place finisher José Ramírez. Even while defending FanGraphs’ context-neutral position, Cameron agreed that we should consider context for award voting. The problem is that, to this point, we don’t have an agreed-upon objective method to quantify the impact that a player's performance in high-leverage situations has on his overall value. I urge the guardians of WAR to rectify this.

So I think James has a point, and I’m grateful to him for instigating a badly needed discussion. But this sentence from his article summarizes where he loses me a bit: “But if you evaluate them by the specific relationship of Altuve’s runs to the Astros wins and Judge’s runs to the Yankees wins, then Altuve moves up and Judge moves down, and a significant gap opens up between—large enough, in fact, that Judge drops out of the #2 spot, dropping behind Eric Hosmer of Kansas City.”

I say this as a Royals fan whose rotating desktop background includes half a dozen of his highlights from the 2014 and 2015 playoffs: Eric Hosmer has no business being in that sentence.

Hosmer might belong there had he put together a monster performance in high-leverage situations in 2017, but he didn’t. According to Baseball-Reference, in fact, Hosmer performed worse in what it defines as “high-leverage” plate appearances (.276/.315/.448) than he did in “medium-leverage” situations (.304/.394/.435), which was worse than he performed in “low-leverage” situations (.353/.407/.591).

So how does James have Hosmer as the no. 2 hitter in the American League by this metric? Based on how James calculates win shares, I suspect it’s because the Kansas City Royals, for the fourth year in a row, won substantially more games than the Pythagorean expectation predicted based on their runs scored and allowed: In large part because they went 25-16 in one-run games, they went 80-82 overall despite being outscored by 89 runs, which should have led to a 72-90 record.

James’s win shares approach is that if a team wins more games than we would expect from its underlying statistics, we should apportion the extra wins to all of its players in proportion to their value. That approach might have been the best method in 2002, when we didn’t have the situational data we have now. But we can tell that the Royals won 61 percent of their one-run games because, as a team, they hit better in high-leverage situations (.758 OPS) than in medium-leverage (.738) or low-leverage (.718) situations. But that wasn’t the work of Hosmer. It was the work of guys like Whit Merrifield (.351/.409/.519 in high-leverage spots) and Salvador Pérez (.315/.336/.583). If you’re going to argue that players should be rewarded not just for the runs they create, but for the wins they create, then you have to go the extra step and give that reward to the players who did the creating.

Eric Hosmer hits a home run
Eric Hosmer hits a home run against the Arizona Diamondbacks
Photo by Brian Davidson/Getty Images

And just as some pitchers have a real ability to control BABIP and allow fewer runs than their peripheral numbers would project, some teams have a real ability to win more games than their runs scored and runs allowed would project: specifically, teams with great bullpens. The Royals are a good example of this, because during their back-to-back pennants in 2014 and 2015, they won five more games than their Pythagorean record suggested each year. Whenever they were in a tight spot in the late innings, they could turn to Kelvin Herrera or Wade Davis or Greg Holland. WAR systems already make an adjustment for relievers, increasing their value to account for the fact that the best of them are reserved for the most important situations, giving them an outsize impact on the game.

But I think it’s time to do away with half measures, and decide fully what WAR is supposed to represent. If it’s supposed to represent value, then it needs to evolve to account for the fact that all players, not just relievers, can perform in ways that alter the relationship between runs and wins. WAR should reward a hitter who bats .400 with runners in scoring position and penalize one who hits .136 in high-leverage situations. If the day comes when we can evaluate for how a player performs defensively in high-leverage situations, we can account for that too.

Starting pitchers also can perform in ways that warp the bond between runs and wins. Consider two starters, one of whom gives up three runs, three runs, and four runs in three successive starts of six innings, and the other allows zero runs, zero runs, and 10 runs. Both pitchers have allowed 10 runs in 18 innings, a below-average performance, but while the first pitcher’s team would have an expected winning percentage under .500 in those three games combined, the second pitcher’s team would be heavily favored in two of the three games, while punting the third game entirely. For starting pitchers, clustering their runs allowed into relatively few starts gives their team a better chance of winning in the long run. A stat that looked at starting pitchers on a start-by-start basis would be a more accurate representation of their contribution to winning than the way we do it now, which is to aggregate all of their runs and all of their innings and throw them into a blender.

And the thing is, that stat existed nearly 20 years ago! Developed by Michael Wolverton, “support-neutral win-loss” accounted for the game-to-game variation in performance by starting pitchers to arrive at what was, in my opinion, a more accurate assessment of their value than just looking at their seasonal totals of runs and innings pitched. Wolverton retired from writing about baseball more than a decade ago, and SNWL long ago got scrubbed from the internet, but the Wayback Machine brings up this SNWL leaderboard from all the way back in 1998, which included a stat called SNWAR—support-neutral wins above replacement. If we could account for a pitcher's value start by start instead of just season by season then, we can account for it now.

We have the tools to build a better WAR, a stat that credits players for the contributions they make toward actual wins on the field, even if those contributions may not be predictive of their contributions in the future. We need to continue the conversation about how to make that happen. There is a danger in going too far with this approach and apportioning value to a player based exclusively on how each of his plate appearances influences his team’s chance of winning, such that he gets no credit at all for a home run hit with his team ahead 10-0, while a scorching line drive that’s caught with the bases loaded and his team down a run in the ninth inning puts him in a statistical hole that would take weeks to recover from.

We already have a stat that measures player contributions this way. It’s called win probability added, and it functions exactly the way its name suggests. According to The Baseball Gauge, four of the 10 best players in baseball in 2017 were closers, which is an interesting illustration of the impact that high-leverage situations have on WPA, but isn't the answer to the question we're asking. And you can take this approach even further and evaluate a player based not on wins, but on championships, which is of course the whole point of everything. The problem there is that some players, by pure happenstance, will come to the plate with the opportunity to change the course of history with one swing. By championship probability added, the MVP of the National League in 1992 was Francisco Cabrera, who came to the plate 14 times all season … but in one of those hit a bases-loaded single to win the pennant, with the Braves down to their final out in Game 7 of the NLCS.

So you can absolutely take this approach too far. But having given this issue a lot of thought over the last 10 days, I agree with James that right now we’re not taking it far enough. We can do a better job of matching wins above replacement with actual wins. We can acknowledge that what constitutes value today is not necessarily an endorsement of value tomorrow. And we can use two different stats to measure two different things.

We can, and should, have a “predictive” version of WAR that evaluates a player’s performance based on skills that will carry forward into the future. This would not only strip away “clutch” and situational hitting that doesn’t carry over much from one year to the next and strip away luck on batted balls in play, but as our data set improves would also account for Statcast data like launch angle and exit velocity, so that the player who hit a ton of at-’em balls or the pitcher who gave up a lot of windswept home runs into the first row would have a statistic that says, look, this guy might have sucked last year, but if a butterfly had flapped its wings last April he would have been really good. As Nate Silver suggested, maybe we can call it predictive WAR, or pWAR.

But we also need the WAR we have now to answer the question it’s supposed to answer: How good was Player X in Season Y? And we need that to happen in a way that’s more clearly tied to wins on the field. Silver suggests we call this stat descriptive War, or dWAR; I prefer value WAR, or vWAR. Or hell, just call it by both names. Because if there’s one thing we can all agree on, it’s that we don’t have enough WARs already.