How many times have you seen it written or heard somebody say:

“Wins is a useless way to evaluate a starting pitcher”

We have heard it so many times, that we have started wondering why we still track Wins for pitchers.

There are certainly reasons to believe this. How many times have we seen a pitcher give up less than 3 runs in 8+ innings and not win the game? Or how often do we see a great pitcher on a bad team win 12 games or fewer?

But does this mean Wins is a completely useless statistic? Over time, shouldn’t a a good pitcher win more games than a bad pitcher, regardless of other factors?

To answer this question, we looked at every pitcher over the last four seasons (2006-2009) with at least 600 innings pitched (150ip/season). We then removed anybody that had more than 10% of their appearances in relief. We ended up with a list of 51 pitchers. We tallied up their wins (as a starting pitcher) in those four seasons and compared it to their ERA+*.

What we see is a very clear trend. As a pitcher’s ERA+ goes up (bigger values are better, 100 is average), their win total goes up. Are there exceptions? Of course. Every statistic has exceptions. But even in the face of contradictions, we still see a decent correlation (r-squared = 0.51**).

Of course, a pitcher’s win total will be affected by the number of starts they make. So, instead of wins, let’s see if ERA+ can be used to predict a pitcher’s win percentage, and vice versa.

Now we see an even stronger correlation (r-squared=0.54) indicating that wins is actually a very good indicator of how good a pitcher is. Quite simply, better pitchers win more games.

The problem with Wins as an evaluator of starting pitchers is not that it is useless statistic. It is simply a matter of sample size. In a single game, a win or no win is not a good indicator. Why? Small sample size (n=1). However, ERA, for example, is a per inning stat. So in a single game, a pitcher’s ERA will have 5-9 data points (n>>1). Over the course of a full season, stats like ERA+, FIP and tRA have a sample size of 150-220 for each pitcher.

Can we use Wins to evaluate a pitcher over the course of one season? Maybe. We are talking about 28-33 starts. That is still a small sample size considering the number of factors that are involved. But we can be relatively certain that an 18-game winner is better than a 5-game winner (with similar number of starts). The other variables should be less of a factor in that case. However, when comparing two pitchers with a similar number of wins, those other factors (team defense, scoring, ballpark, etc.) become much more important.

So should we use Wins when voting for the All-Star teams or the Cy young Award? Probably not. Stats like ERA+, FIP and tRA are still better measures of how good a pitcher is (although we have minor quibbles with each). However, that does not mean Wins is a useless category. Over the course of several seasons or even a career we should be able to get a decent idea of how good a starting pitcher has been based on how often they win games.

TOMORROW: We will take a look at the individual pitchers that do deviate from the trend and have either been very lucky or very unlucky. You might be surprised how small the list is.

*Notes on the above post are found after the jump*

*ERA+ is ERA that is adjusted for the league a player pitches in and their home ballpark. 100 is average. Anything above or below 100 is in relation to league average. So if a pitcher has an ERA+ of 120, they are 20% better than league average.

** R-Squared is a measure of how closely two variables are related. R-Squared is the square of the correlation coefficient which is measured on a scale of -1.0 to 1.0. If something has a value of 1.0 (all points fall on the line) knowing the value of one variable will tell us exactly what the other variable will be. A positive r-squared says that as one variable goes up, the other variable goes up. A negative value indicates that as one variable increases, the corresponding variable will decrease. A value of zero indicates no correlation between the two variable.

## 54 Comments

Why did you look at correlation with ERA+ instead of FIP or tRA? Of course ERA+ is going to be correlated with wins.

Why?

I think the post is asking whether they are correlated, and finds they are. If ERA (or ERA +) were far better predictors than Wins, that would suggest these two indicators would not be well-correlated. There would be a lot of unlucky low ERA pitchers with few wins, and lucky, high-ERA pitchers with many wins. The Professor’s post is merely noting that over time and over the entire population of MLB starters, the “luck” fact washes out and ERA and wins work in tandem.

First of all, why would Wins automatically be correlated to ERA+ and not FIP and tRA? Wins is not part of ERA+ formula.

Second, ERA+ data for multiple seasons combined is available. The others are not (that I am aware of). And while ERA+ is far from a perfect stat, it is my opinion that FIP and tRA are only slightly better. It is my feeling that both stats are far too strict in what they eliminate from the equation. For example, i feel good pitchers DO have an impact on what happens with balls in play. Take Mo Rivera. That guy breaks more bats than anybody. Is somebody going to tell me that thos weak groundballs have just as much a chance of being hits as balls that come off Chad Bradfords bats?

To finish that thought. I would LOVE to see FIP or tRA correlated with wins. So if somebody wants to crunch the numbers, I will be happy to post it. I would be surprised if there wasn’t still a positive correlation. How positive? I don’t know.

I agree that both FIP and tRA are likely to be correlated with wins, but using ERA+ you are saying a guy who gives up less runs is more likely to get wins. Isn’t this obvious? At least with FIP and tRA you eliminate some of the luck that goes with ERA.

Either way wins are pretty much a useless stat. Its not that it is an inherently useless stat, but with others that are obviously better it makes no sense to look at a pitchers W-L record for evaluation purposes. I think this is what people mean when they write wins are a useless stat. It is slight hyperbole, but with today’s stats no one should be using wins as a predictive or evaluative tool.

Well, i guess my feeling is that the luck aspect of ERA (especially ERA+) evens out over time. maybe not completely. but certainly to a degree.

I agree with this.

PA, I agree with you and I agree with Prof. I think part of the problem is that so many use the hyperbole that a lot of casual fans and even a lot of baseball writers that don’t know the new stats very well, are starting to dismiss wins as a “bad stat.”

my feeling is, if I am working in a front office, I am not putting a lot of weight on wins. but if i am sitting in a bar with my buddies, i think it is perfectly ok to say Greg Maddux was a better pitcher than Nolan Ryan because he won more games. Sure I could dig deeper and give other stats that support this notion (they do). But wins is simple. It resonates well with everybody and most of the time it is right. and under those circumstances, it is good enough.

0.54 isn’t a strong correlation coefficient. At all.

According to who? It all depends on what you are looking at. If I am looking at stock market indicators and I see two variables have a 0.05 r-squared, that might be HUGE. Also consider something simple like height and weight. Taller people weigh more. Obviously there is some variation, but how much? 0.70 r-squared. That is accepted as very strong.

If we were measuring something in a lab with other variables controlled, then we might want .90 before saying something is significant. But In baseball, there are so many variables that anything with a 0.54 correlation is strong.

And then consider the filters I used. Notice that there are many more data points greater than 100 ERA+ even though 100 is average. When I included only 600ip SP, I eliminated a lot of bad SP, pitchers that aren’t good enough to make 28 starts a year. If I had somehow included that data,most likely the correlation goes up even more.

P.W. Hjort is correct. A 0.5 correlation means that the wins or winning percentage fluctuates randomly. There is no predictive ability whatsoever.

A pitcher is a good pitcher if he prevents runs. On a normal team, that results in wins. But a good pitcher with a bad offense may not win many games, and a poor pitcher with a great offense may win a lot.

Wins are a team stat, but almost all team stats are in one way or another attributed to an individual. Wins and losses happen to go to pitchers. It’s book keeping.

“A 0.5 correlation means that the wins or winning percentage fluctuates randomly. There is no predictive ability whatsoever.”

that is not even close to being true. R-squared is derived from the corelation coefficient which is on a scale of -1 to 1. A ZERO means there is no predictive ability. ANYTHING different from zero has at least some predictive ability. Again, there are people in the stock market that have made a lot of money with correlation factors that are very small.

The only issue is sample size. If I only have 5 data points and a r-squared of .9, that means a lot less than if I have 10,000 data points and an r-squared of .10. The latter has some predictive ability.

This is correct. 0.5 is a pretty strong positive correlation. 0 would be no correlation.

I always thought “statistical significance” required at least a .10 Rsquare, not .5. The question that is asked in statistical analysis is, “what are the chances that we’d see this outcome if the relationship between these two variables was completely random?” A .5 percent r-squared means there’s a 50% chance of seeing these results, which may not be random but it’s not a convincing correlation, either. At .10 it’s safer to say that these relationships are unlikely to be random.

At least, that’s what I recall some 20 years after my last stats course.

Well, this is tricky. Testing statistical significance for a correlation really depends on what you are asking and what you are looking for. The problem, is that there are degrees of correlations. For example, are we looking to see if A causes B? In that case we are looking for a value of 1.0 or very close. But we might also be looking to see if A influences B at all. In that case, in theory any value other than zero is significant. Because now we are just seeing HOW MUCH A influences B or vice versa. Really, significance occurs in correlation when IF you add more data, the correlation factor doesn’t change.

In this case, if I had 1,000 data points and my value was still 0.54, then we can be reasonably certain that 0.54 IS the degree of correlation between the two factors.

That is a long-winded way of saying it is very difficult to measure how significant a correlation is. As long as you have significant amount of data, ANYTHING above zero is a positive correlation. Less data means a bigger margin of error. So if you have a value of .05 but your error is .1, then you can’t say for certain there is a correlation. More data decreases the error rate.

Well, this:

“that is not even close to being true. R-squared is on a scale of -1 to 1. An r-squared of ZERO means there is no predictive ability. ANYTHING different from zero has at least some predictive ability. Again, there are people in the stock market that have made a lot of money with correlation factors that are very small.

The only issue is sample size. If I only have 5 data points and a r-squared of .9, that means a lot less than if I have 10,000 data points and an r-squared of .10. The latter has some predictive ability.”

Is mostly right, but you need to distinguish between predictive ability and correlation. Correlation doesn’t imply causation. Because the derivative of correlation also fluctuates according to chance. For example, instances of radiation sickness over the past 50 years are far more common among people from Ukraine, so radiation sickness and Ukrainian residence correlate rather strongly. Does that mean we can predict Ukrainians will be afflicted by radiation sickness more in the coming years? Of course not. It has nothing to do with your country of residence, but your proximity to a nuclear meltdown.

R-squared fluctuates from 0 to 1, not -1 to 1.

yes, thank you. I misspoke. I was referring to the corelation coefficient. obviously r-squared cannot be negative.

Sounds like some people have way too much time on there hands!

Give me a pitcher with 15+ wins…you can have Kazmir with 10…

then you can call him “unluck”… as I win the game!

Thank you very much!

Has Don ever said anything intelligent before?

I’ll take Zach Greinke, you take any other AL pitcher, and I’ll probably beat you next year. Using wins implies I have to take the Kansas City offense along with Zack Greinke, and, well, the Kansas City offense is part of Greinke’s pitching ability.

The problem to me is that comparing Wins to ERA+ seems arbitrary, because there are those that will argue that ERA(and thus ERA+) is mainly a team stat because it really refers to team runs allowed and may not be that reflective of a pitcher’s skill. But of course “better” pitchers tend to have better ERA+es, much like better pitches tend of have more wins. In the end, all you’re really saying is “good pitchers will tend to have more wins than bad pitchers,” which is just intuitive.

“all you’re really saying is “good pitchers will tend to have more wins than bad pitchers,” which is just intuitive.”

actually, in a sense, that is all I am trying to say. people have been knocking wins so much in recent years, that many are starting to act like wins tells you nothing about how good a pitcher is. and to me that is just silly. wins, to a degree, DO tell you how good a pitcher is.

If somebody says “well so-and-so has won 48 games the past 3 seasons, he is a good pitcher,” there is a contingency out there that will laugh that person down saying that wins means nothing.

when did this site get all sabermetricky?

I liked Kevin’s point how ERA can be a team stat. I think pitching wins have some talent evaluating ability if it is consistent but that takes a while.

Did you get the data from BB-Ref?

As much as I love getting my hands dirty with stats, I do try to avoid being too mathy. Chalk it up to a slow news day. And yes, numbers came from the Play Index

I could do it manually from bb ref since I don’t have PI. It seems like it is a very powerful tool.

just rename the stat, pitcher who gave up less runs than his offense scored and pitched at least 5 innings, and my quibbles with the stat will be non-existent.

“And left the game with the lead, with the other team never tying the game.”

There seems to be a lot of confusion amongst you guys around significance and correlation. R-squared tests if there is a correlation, but The Professor didn’t provide any results as to statistical significance testing between these two values, didn’t state what p-value he tested against to determine how significant his data would be, and didn’t provide the alpha score against that level of significance. I highly doubt this analysis passes a p-value test.

Actually, a better way of putting this is here: http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation

as somebody that has taught a university-level stats class, I am quite familiar with the difference between correlation and causation. In this case, i could care less if A causes B or B causes A. My only concern is that they are correlated. If one goes up, does the other go up, go down or is it not related at all? the point is just that if ERA+ is a decent (not great) indicator of how good a pitcher is, then Wins should be also (assuming sufficient sample size).

Fair enough.

well, I didnt want this to turn into a stats class. but if it makes you sleep better tonight…

if my null Ho is that there is no correlation between the two, then n=51, df=49, r^2=0.54, r=0.73 gives a t-value of 7.59 and a p-value < .0001 in both a one-tailed and two-tailed model. so it passes just fine.

What I have found with win loss records for starting pitching is that the closer the sum of the wins and losses are to a pitcher’s total number of games, the more instrumental the pitcher is in either a good or bad way to the outcome of the game. I will try to clarify this.

If you have a starting pitcher who has pitched 20 or so games, and his win loss record is 2-3, clearly something can be said about this player from his win loss record. Especially if you can use this data in combination with other data.

A win loss record may not tell you the whole story, but few stats do. Combine win loss record with a comparison to the total number of games, plus some other stats, and it can help back up a good argument.

(see my response to Devon)

Again, this is trying to explain why a few might not follow the pattern. That is fine. But it doesn’t mean the pattern doesn’t exist for pitchers overall.

How do you explain Zack Greinke’s ERA+ in his no decisions this year? It’s huge, but that doesn’t have any relation to his W’s.

I can’t. But that it is not the point of this post. I am looking at a general trend (the best use of statistics) and you are asking me to explain one very specific incident. In this case I am looking at pitchers in general and looking for a pattern. You are looking at one point on the chart and wondering why that one pitcher doesn’t fall EXACTLY on the line. Look at the chart again. How many of the pitchers fall exactly on the line? Very few. Some are close to the line. And some are not close to the line. But when taken as a whole we see that there is a trend. The dots tend to cluster from the bottom left to the upper right. Every stat in the history of mankind is going to have exceptions and outliers. Why? Lots of reasons. Random variation, outside factors.

Take height and weight. Generally taller people weigh more than shorter people. It is proven. But now you are asking why one short guy weighs 400 lbs. It may be a legitimate question for that one guy, but it doesn’t prove that short people weigh more than tall people.

You can’t just write that off as an outlier. No matter what pitcher you talk about, there will never be a strong correlation between W’s & ERA+ in their no decisions. You’re not factoring in run support, which has a stronger correlation with W’s than anything….without run support, no pitcher can win a game even if he doesn’t allow any runs. So the trend would be that run support correlates with pitcher wins, if you factor everything in.

No doubt. Run support is probably the biggest reason for the game-to-game variation. I am more concerned with the trend over time. And the data suggests that lucky and unlucky run support evens out for most pitchers. And in the majority of games there is “normal” run support and that is when the pitcher’s ability makes the biggest difference.

“There are certainly reasons to believe this. How many times have we seen a pitcher give up less than 3 runs in 8+ innings and not win the game? Or how often do we see a great pitcher on a bad team win 12 games or fewer?”

Practically every fricking game and every fricking season.

Sincerely,

Giants fan

Touché

What does Wins tell you that you can get from another stat? Does it bring anything to the table?

18 wins vs. 5 wins makes things obvious. How about 20 wins vs. 14 wins? Obviously better? Not obviously? I’d agree that the 20 win player is *probably* the better performance, but how would I know for sure? I’d check other stats, like IP, ERA, tRA, etc. Why not just start with those stats since I need them anyway?

Exactly

My point is only that at a basic level, wins is fine. It is easy, it is accesible by everybody and most of the time it paints a good picture. Since we are on a stats binge and I know you are very good with numbers, I would compare wins to the chi-squared test. The chi-squared test is far from perfect, but that doesn’t mean it is not useful. It is easy, it gives a good sense of what is going on. But if I want to really get an accurate representation, I will probably want to dig deeper and use a more robust statistical measure. But that doesn’t mean we have to jump staright to the complicated stuff right away. And that doesn’t mean the chi-squared test is wrong or automatically inaccurate.

I know this isn’t what you’re saying, but it sounds a lot like “Wins are fine except when they aren’t.”

Let’s talk about some uses of Wins and see where they might be useful…

- Signing free agents

- Handing out awards

Others?

Did I have a comment deleted? Or did I fail to submit it somehow?

Nothing has been deleted and I don’t see anything in the spam folder (occasionally comments are marked spam for no apparent reason)

Thanks for replying. My computer skills must be deteriorating.

Professor, I think we all would have guessed that there was a correlation between ERA+ and wins, but for most of us, it won’t change our opinion that wins are a misleading way to differentiate between two pitchers.

I’d be curious if you could run a similar analysis and find level of correlation between Wins/Losses and Run Support. If the correlation is similar, to me that may point out just how correlated to wins a pitcher’s run support is….and I think that even the most ardent support of W/L would admit that run support is out of the hand of a pitcher. IF RS is just as correlated to Wins, then what are Wins really saying about a pitcher?

I’m not sure of an easy way to compile that data, but I am sure their is a correlation between wins and run support. I would also guess there is a small correlation between wins and defense. the real question is how much does each (talent, run support, defense) contribute to a pitcher’s win total. at this point, I just wanted to see that talent is a major component and that wins are not completely random or a product of luck. i know it seems intuitive but many act like it is not. And besides, it is one thing to say it. I just wanted to see that it was indeed true.

You’re forgetting run support. In 2008, looking at the 88 qualified pitchers, the correlation between W-L% and ERA was -0.65. The correlation between W-L% and run support was

0.55. I also ran a regression on those 88 pitchers and found that one run of ERA was worth 8 points on W-L%, while one run of support was worth 7 points on W-L%.Not forgetting run support. this has been discussed above. also, I prefer to look at wins/GS as opposed to W/L%

I have done a VERY preliminary look at the correlation between wins and FIP with a much larger dataset using SP from last 100 years. Data needs to be cleaned up a bit but right now I am seeing a correlation of ~0.45.

Still trying to figure out best way to approach data without bias. Need to eliminate Wins as a reliever. And some pitchers in early 20th century had some absurd win%, winning 70% of their starts. Maybe only pitchers since introduction of 4-man rotations?

All of this affect the correlation, but feel confident will still fall in 0.4-0.5 range. Is that good or bad or very good? I’ll let you decide.

Will try to have this data published early next week.