THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


2013 Bill James Handbook

THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, November 24, 2008

wOBA year-by-year calculations

By .(JavaScript must be enabled to view this email address), 08:26 PM

Here is the full specs for calculating wOBA for each season, based on the database from the Baseball Databank:


=======
Step 1
=======

Create a VIEW (or QUERY) named LeagueRunsPerOut based on this SQL:

SELECT
Pitching.yearID
, Sum([R])/Sum([IPouts]) AS RperOut
, Sum(Pitching.R) AS totR
, Sum(Pitching.IPouts) AS totOuts

FROM
PrimPos
INNER JOIN
Pitching
ON PrimPos.yearID = Pitching.yearID
AND PrimPos.playerID = Pitching.playerID

WHERE PrimPos.PosPrim=“P”

GROUP BY Pitching.yearID
;

The purpose here is simply to create a run environment for each season.  I exclude all nonpitcher’s pitching numbers.


=======
Step 2
=======

Create a VIEW (or QUERY) named RunValues based on this SQL:

SELECT
Batting.yearID
, RperOut
, [RperOut]+0.14 AS runBB
, [runBB]+0.025 AS runHB
, [runBB]+0.155 AS run1B
, [run1B]+0.3 AS run2B
, [run2B]+0.27 AS run3B
, 1.4 AS runHR
, 0.2 AS runSB
, 2*[RperOut]+0.075 AS runCS

FROM
LeagueRunsPerOut

INNER JOIN
(
Batting
INNER JOIN
PrimPos
ON (Batting.yearID = PrimPos.yearID)
AND (Batting.playerID = PrimPos.playerID)
)
ON LeagueRunsPerOut.yearID = Batting.yearID

WHERE PrimPos.PosPrim <> "P"

GROUP BY
Batting.yearID
, RperOut

;

(The “FROM” clause can be rewritten clearer, but then it won’t work in Access.)

I set the run value of the walk as +.14 runs above the value of runs per out.  While it is not necessarily exactly that all the time, it’s basically that for various run environments in MLB over the last fifty years.  You can see the evidence here:
http://www.insidethebook.com/ee/index.php/site/article/linear_weights_by_run_environment/

In each run environment, the difference between the run value of the walk and runs per out (or RperI divided by 3) is between .134 and .143.  Close enough for us.

The other batting run values work similarly.  They are further double-checked here:
http://www.insidethebook.com/ee/index.php/site/comments/actual_wins_retrosheet_years/#4

The run value of the SB is fixed at .20, and the CS is set with a bit of a fudge, but works fairly well.

So, that’s we have the Linear Weights values for each event, for each season.

You could do the same thing with BaseRuns.  I chose not to, only for simplicity’s sake.  You could try to do it yourself.

=======
Step 3
=======

Create a VIEW (or QUERY) named RunValues2 based on this SQL:

SELECT
RunValues.yearID
, RunValues.RperOut
, RunValues.runBB
, RunValues.runHB
, RunValues.run1B
, RunValues.run2B
, RunValues.run3B
, RunValues.runHR
, RunValues.runSB
, RunValues.runCS
, Sum([runBB]*([BB]-nz([ibb]))+[runHB]*nz([HBP])+[run1B]*([H]-[2b]-[3b]-[HR])+[run2B]*[2b]+[run3B]*[3b]+1.4*[HR]+[runSB]*nz([SB])-[runCS]*nz([CS]))/Sum([ab]-[h]+nz([SF])) AS runMinus
, Sum([runBB]*([BB]-nz([ibb]))+[runHB]*nz([HBP])+[run1B]*([H]-[2b]-[3b]-[HR])+[run2B]*[2b]+[run3B]*[3b]+1.4*[HR]+[runSB]*nz([SB])-[runCS]*nz([CS]))/Sum([BB]-nz([IBB])+nz([HBP])+[H]) AS runPlus
, Sum([BB]-nz([IBB])+nz([HBP])+[H])/Sum([AB]+[BB]-nz([IBB])+nz([HBP])+nz([SF])) AS wOBA
, 1/([runPlus]+[runMinus]) AS wOBAscale
, ([runBB]+[runMinus])*[wOBAscale] AS wobaBB
, ([runHB]+[runMinus])*[wOBAscale] AS wobaHB
, ([run1B]+[runMinus])*[wOBAscale] AS woba1B
, ([run2B]+[runMinus])*[wOBAscale] AS woba2B
, ([run3B]+[runMinus])*[wOBAscale] AS woba3B
, ([runHR]+[runMinus])*[wOBAscale] AS wobaHR
, [runSB]*[wOBAscale] AS wobaSB
, [runCS]*[wOBAscale] AS wobaCS

FROM
RunValues
INNER JOIN
(
Batting
INNER JOIN
PrimPos
ON Batting.playerID = PrimPos.playerID
AND Batting.yearID = PrimPos.yearID
)
ON RunValues.yearID = Batting.yearID

GROUP BY
RunValues.yearID
, RunValues.RperOut
, RunValues.runBB
, RunValues.runHB
, RunValues.run1B
, RunValues.run2B
, RunValues.run3B
, RunValues.runHR
, RunValues.runSB
, RunValues.runCS

ORDER BY
RunValues.yearID DESC
;


I could have merged these last two VIEWS, but that’s not important.

Notes:
- runMinus sets the run value for the missing events, which is AB minus H plus SF; if we had reached base on error, then we’d update these last two views accordingly; it is basically the run value of the batting out

- runPlus determines the average run value of the safe batting events (walks, hitbatters, hits)

- there is a wOBA calculation, which you will see is actually an OBP calculation; they are interchangeable at the league level

- the wOBAscale is the multiplier that we will be applying to get the run values into a wOBA scale; it also lets you convert from wOBA to runs per PA (while The Book says to use 1.15, you actually use whatever this value is for the season in question)

- for all the batting events, we take the run value of each event, add in the run value of the outs, and then multiply by the wOBAscale factor; play around with why and how I am doing this to see if this makes sense to you; if this makes total sense to you, then make a short post for your fellow readers; otherwise, I’ll have to make a long boring post to that effect

- for the running events, we only apply the multiplier; once you understand the previous point, you will understand the reason for this point

***

Since 1956, the weighted average is:
wobaBB 0.71
wobaHB 0.74
woba1B 0.90
woba2B 1.28
woba3B 1.63
wobaHR 2.10
wobaSB 0.25
wobaCS 0.51

Remember, this is what is used for the Baseball Databank.  If you use different events, like Reaching base on error for example, things will change a bit.  It also depends whether you do, or do not, want to include SB/CS.  In The Book, we were almost always interested only in the batter/pitcher matchup, and so, the SB/CS numbers would not make sense.  Here is the full output of the above:
http://tangotiger.net/bdb/lwts_woba_for_bdb.txt

Anyway, I hope this makes the entire LWTS ~ wOBA relationship clearer.  If not as clear-as-crystal, more than clear-as-mud.

#1    devil_fingers      (see all posts) 2008/11/25 (Tue) @ 05:31

Wow… that’s really cool. Thanks for sharing the results of a lot of hard work. Very generous. I can’t wait to try it myself when I get some time.


#2    Tangotiger      (see all posts) 2008/11/25 (Tue) @ 06:01

Fangraphs will be implementing this, so, that’s really cool…


#3    terpsfan101      (see all posts) 2008/11/25 (Tue) @ 09:51

Tango,

I have one very small complaint. R/O should be calculated based on the out categories you are using. Just like you would calculate R/PA based on the Plate Appearance events that were included in a LW equation. In this case, R/O would be:

R/O = Runs / (AB - H + SF + CS)


#4    .(JavaScript must be enabled to view this email address)      (see all posts) 2008/11/25 (Tue) @ 10:52

Why when I try to enter this into mySQL it says table bdb.primpos does not exist? where do I get this table?


#5    Colin Wyers      (see all posts) 2008/11/25 (Tue) @ 10:59

Tango’s instructions are for his Access database shell. I can try to make an MySQL port of the code for that table later tonight when I get home. In the interim I don’t think it’d be a huge difference if you simply left that part out and selected straight from the pitching table (or at that point, you probably shave a sixteenth of a second if you use the Teams table instead).


#6    terpsfan101      (see all posts) 2008/11/25 (Tue) @ 11:17

gopher,

The table you are looking for is here:

http://tangotiger.net/bdb/


#7    Aiden      (see all posts) 2008/11/25 (Tue) @ 12:00

Hi all, after messing with it for an hour, I finally got this to work in MySQL (thanks again Colin). I can post it if you don’t want to do the work. I just created a separate database and used tables not views. (If that is sound or not I have no idea)

Thanks Tango.


#8    Aiden      (see all posts) 2008/11/25 (Tue) @ 12:02

Oh and FYI I used Colin’s primary position query, since I couldn’t manage to adapt Tom’s, so the table only goes from 1973-present (when the appearances table starts).


#9    devil_fingers      (see all posts) 2008/11/25 (Tue) @ 13:09

Just let me make sure I understand the advantage of this as opposed to the simple wOBA formula given elsewhere on this site: does this query generate a wOBA based on “custom” linear weights specific to each season?


#10    terpsfan101      (see all posts) 2008/11/25 (Tue) @ 13:39

Yes, this query does generate a wOBA based on custom Linear Weights. The HR is fixed at 1.4 runs and the SB is fixed at .20 runs. Here are the weights that wOBA is using for 1996 (5+ RPG) and 1968 (3.5 RPG):

1996, 1968
0.48, 0.42 1b
0.78, 0.72 2b
1.05, 0.99 3b
1.40, 1.40 hr
0.33, 0.27 bb-ibb
-.28, -.20 ab-h
0.20, 0.20 sb
-.45, -.33 cs


#11    Colin Wyers      (see all posts) 2008/11/25 (Tue) @ 14:32

Tom, I don’t get what’s going on with the INNER JOIN in step 2. It strikes me as being about as useful as an appendix. Am I missing something?


#12    Tangotiger      (see all posts) 2008/11/25 (Tue) @ 19:48

The PrimPos table is a file as noted in link 6.  I’m not sure why you couldn’t get it to work: it’s a pure import.

Views or tables will give you the same result.  It’s simply a difference of whether you want to regenerate the table each time there is a change in the underlying table.

Colin, there are 2 inner joins: which one are you referring to?

***

As for the resulting new wOBA equations, those are also posted as I said in the main thread.  You will see things BARELY change since 1955 (except for the HR).  Prior to that, things do change.


#13    Tangotiger      (see all posts) 2008/11/25 (Tue) @ 20:03

Terps, are you referring to RperOut?  If so, that calculation is correct, since I need to get the actual run environment.

Or are you referring to runMinus?


#14    Tangotiger      (see all posts) 2008/11/25 (Tue) @ 20:06

The one thing that Access lets you do that the others don’t is like this:

, [runBB]+0.155 AS run1B
, [run1B]+0.3 AS run2B

Referencing fields of the current select.  So, if you do this in other DBMS, you need to expand all those out.

And, the entire thing (the three steps) can be collapsed into one big SELECT.


#15    Colin Wyers      (see all posts) 2008/11/26 (Wed) @ 00:13

Got the Access shell setup so that we’re talking about the same things and I confirmed it. This query seems to do the same thing as your query in step 2:

SELECT
yearID
, RperOut
, [RperOut]+0.14 AS runBB
, [runBB]+0.025 AS runHB
, [runBB]+0.155 AS run1B
, [run1B]+0.3 AS run2B
, [run2B]+0.27 AS run3B
, 1.4 AS runHR
, 0.2 AS runSB
, 2*[RperOut]+0.075 AS runCS

FROM
LeagueRunsPerOut;

I’m working on converting to MySQL - COALESCE is the command that replaces nz. I’m using MySQL aliases for some things, but they don’t work with tables using GROUP BY clauses. I’m either going to break out into a subquery or just make two tables for the third step.


#16    Tangotiger      (see all posts) 2008/11/26 (Wed) @ 00:33

Yes, of course, how stupid of me.

Originally, I had Step 2 and Step 3 as one query.  I broke them up.  Obviously, you are completely right here.


#17    Tangotiger      (see all posts) 2008/11/26 (Wed) @ 00:45

If you run this, you will get an IDENTICAL match between LWTS runs the traditional way and wOBA-based:
SELECT
  Batting.playerID
  , Batting.yearID
  , [AB]+[bb]-nz([ibb])+nz([hbp])+nz([SF]) AS PA1
  , ([BB]-nz([IBB]))*[runBB]
      +[HBP]*[runHB]
      +([H]-[2b]-[3b]-[hr])*[run1B]
      +[2b]*[run2b]
      +[3b]*[run3b]
      +[hr]*[runHR]
      +nz([sb])*[runSB]
      +nz([CS])*[runCS]
      -([AB]-[H]+nz([SF]))*[runMinus] AS LWTS
  , (
      ([BB]-nz([IBB]))*[wobaBB]
      +nz([HBP])*[wobaHB]
      +([H]-[2b]-[3b]-[hr])*[woba1B]
      +[2b]*[woba2b]
      +[3b]*[woba3b]
      +[hr]*[wobaHR]
      +nz([sb])*[wobaSB]
      +nz([CS])*[wobaCS]
  ) / [PA1] AS pWOBA
  , [PA1]*([pWOBA]-[wOBA])/[wOBAscale] AS wOBA_runs

FROM
  Batting
  INNER JOIN
  RunValues2
      ON Batting.yearID = RunValues2.yearID;


I called this query/view/table LWTS_WOBA

***

Note: you will see an undefined for those players with zero PA1.  What I normally do is add .000001 PA to those cases.

***

Now that we see that LWTS and wOBA is the exact same thing, the only thing left is the treatment of IBB and SH.

Instead of using PA1, from above, I use PA, which included IBB and SH.  What this means is that each player’s IBB and SH is EXACTLY equal to his personal LWTS per PA1.

I only showed you PA1 to show you that LWTS and wOBA is the same thing.  Now, in order to give credit to players actually stepping up to the plate, and actually generating runs with their IBB and SH, you use PA that includes those two terms.


#18    Tangotiger      (see all posts) 2008/11/26 (Wed) @ 00:49

I should be clearer.  ONLY here:
, [PA1]*([pWOBA]-[wOBA])/[wOBAscale] AS wOBA_runs

Do I change PA1 to PA.


#19    Colin Wyers      (see all posts) 2008/11/26 (Wed) @ 01:30

I haven’t reimplemented Tango’s primary position code, so this could be a little off. (I may just import the CSV file at some point and save myself some trouble). Ported to MySQL:

http://basql.wikidot.com/woba


#20    terpsfan101      (see all posts) 2008/11/26 (Wed) @ 03:23

Tango,

Ignore my comments about calculating R/O. You are only using it to determine the run environment, so it is OK to calculate R/O like you did.

Things change before 1955 because we don’t have IBB. Retrosheet has complete IBB data for 1954.


#21    Tangotiger      (see all posts) 2008/11/26 (Wed) @ 04:12

The next thing I’d like to do is to convert the wOBA-based Linear Weights Runs into Total Runs (i.e., Runs Created).

I have three options, of which I’d like to hear from you guys. Apply a static runs per PA to each player at the
1. year level
2. league level
3. team level

If we go by 1., then two guys, both average, both with 600 PA, will have a “Runs Created” of 72, even if one league scores .115 runs per PA and the other scores .125.  It’s good for player-player comparison, but not good in terms of “reconciliation”.

If we go by 2., well, we get the opposite of the above.  We get good reconciliation, but not good for player-player comps within the same year.

If we go by 3, we are doing the reconciliation at the team level.  This is kinda good, since we know how many runs actually were created and we are trying to assign those runs to the players.  The bad part is of course if a team gets a massive shortfall of runs relative to the component stats, every player gets nicked, because we won’t know exactly who was responsible for the shortfall.  (This is the criticism against Win Shares.)

Then again, I could just do three different RC calculations, and let the reader choose whichever one he wants.


#22    Colin Wyers      (see all posts) 2008/11/26 (Wed) @ 04:26

I think it’s best to reconcile the linear weights to the environment they’re figured for. In this case, the weights are derived from the year run environment, not the league or team run environment. It would probably be trivial to modify the scripts to do LWTS by league/team, however.


#23    terpsfan101      (see all posts) 2008/11/26 (Wed) @ 04:34

I do not like the first option, because the DH inflates the run environment in the AL post-1973. I also don’t like 1, because the NL is missing CS data from 1926-1950. I wish that your wOBA query did not include basestealing when there is no CS data. The success rates for SB was less than 60% pre-1950, so the run impact of basestealing is probably almost non-existant. Even in the deadball era, the success rate was only 58% for the 5 league-seasons in which we have CS data. So if there is no CS, then wOBA should ignore SB.

I think the league level is the best choice. However, I’m not really sure what the correct method is when you compare players from different leagues. What adjustments do you need to make, if you compare Pujols to Youkillis in 2008?

Reconciling on the team level is a half-baked “value-method”. Although, runs really are created on the team level.


#24    Tangotiger      (see all posts) 2008/11/26 (Wed) @ 04:41

Good way to look at it.  Ok, so here’s RunsPerPA

SELECT Batting.yearID, Sum([R])/Sum([AB]+[BB]+nz([HBP])+nz([SF])+nz([SH])) AS RperPA
FROM PrimPos INNER JOIN Batting ON (PrimPos.playerID = Batting.playerID) AND (PrimPos.yearID = Batting.yearID)
WHERE (((PrimPos.PosPrim)<>“P”))
GROUP BY Batting.yearID;

***

You can join this table to what I have in Step 2.

***

In my post of 17, add:

, [AB]+[bb]+nz([hbp])+nz([SF])+nz([SH]) as PA
, [PA]*([RperPA]+([pWOBA]-[wOBA])/[wOBAscale]) AS wOBA_RC

***

This RC includes the implied runs of not making an out, and so, you CANNOT report this figure as RC per out.

I need a good name for this “runs created”.  We shouldn’t use James’ name.


#25    Tangotiger      (see all posts) 2008/11/26 (Wed) @ 04:44

Please note that in everything I do here, I remove pitchers’ batting.

***

As for SB/CS, it is very easy to fix that: since I provided the complete SQL, just remove the SB and CS terms, and whatever other terms you want removed.  Problem solved.  Criticisms of that nature should be reserved for those who provide black boxes, not for what I am doing here!

***

Right, as for the team-level, you have to ask what we’re measuring.  Can the sum of the team’s parts be said to “create” 800 runs when they actually scored 750 or 850?  Their components are saying it “should have created, all other things equal” 800 runs, regardless as to how many they actually did score.


#26    Tangotiger      (see all posts) 2008/11/26 (Wed) @ 04:47

Oh, I also am not counting league or park differences.  That’s an adjustment for other people to make.


#27    dkappelman      (see all posts) 2008/11/26 (Wed) @ 04:48

I’ll put in my vote for #1.  It’s the simplest and I’d venture that player comparisons are probably the top use of baseball stats.


#28    terpsfan101      (see all posts) 2008/11/26 (Wed) @ 04:48

I do like the term “runs created.” How about “runs produced”?

Tango, you need to figure R/PA based on the PA events that are included in wOBA. If you don’t do this, the RC figures will not reconcile to runs scored. You can’t count SH as a PA since they aren’t included in wOBA.


#29    terpsfan101      (see all posts) 2008/11/26 (Wed) @ 04:53

Yes, option 1 is fine if we are not considering pitcher’s hitting.

What is a “black box”?

Tango, I wasn’t criticizing your work. I was just recommending that you ignore SB, when there is no CS data.


#30    Tangotiger      (see all posts) 2008/11/26 (Wed) @ 05:02

A “black box” is a program whose engineering your can’t see or understand, and you can only evaluate based on the sample inputs and outputs.

***

I don’t mind the criticism.  I found this odd: “I wish that your wOBA query did not include basestealing when there is no CS data. “, since the query is easily modifiable.  Your wish is instantly granted by simply removing the term.

Pete Palmer does something I wouldn’t mind doing, and that is estimating CS.  It’s easy enough to do, if you consider there is a relationship between success rate and SB attempts.

For example, say that SB/(.8*1b+.6*(bb-ibb+hbp) is how you figure out the attempt rate.  Say that the average is 10%, and the top stealers are in the 60% range (just for illustration purposes only).

You can make the CSperOpp as a function of that.  Say, something like:
CSperOpp = 0.2*SBperOpp + 0.03

So, if you steal 10% of the time, you’ll be caught 5% of the time.  If you steal 60% of the time, you’ll be caught 15% of the time.  If you’ve never stolen a base, you were caught 3% of the time.

Something along those lines.  Just a matter of best-fitting, and probably finding a function on a year-by-year basis.


#31    terpsfan101      (see all posts) 2008/11/26 (Wed) @ 05:14

Tango, I don’t want to sound like a broken record, but you shouldn’t include SH as PA’s when you calculate R/PA. The RC totals will not add-up if you include SH as PA’s.

Robert Bofors is working on estimating CS data for the NL 1926-1950. He is also modifying his estimates for 1894-1919, and adding additional estimates for 1886-1893. He has been keeping me up-to-date on his progress. Robert is using a method similar to what Tango just described. He is basing the CS estimates on the number of SB per-time-on-base.

When Robert finishes his CS estimates, I am going to estimate ROE’s pre-1954, based on the error rate, batter handedness, and speed scores. The ROE’s are pro-rated from Retrosheet totals from 1954-1973. From 1974 onwards I will use the complete ROE data.


#32    Tangotiger      (see all posts) 2008/11/26 (Wed) @ 05:15

I took the career totals of all players since 1955.  Here’s the SQL:
SELECT Batting.playerID
, Sum(0.8*([H]-[2b]-[3b]-[hr])+0.6*([bb]-nz([ibb])+nz([hbp]))) AS Opp
, Sum([SB])/[Opp] AS SBperOpp
, Sum([cs])/[opp] AS CSperOpp
FROM Batting
WHERE (((Batting.yearID)>=1955))
GROUP BY Batting.playerID
HAVING (((Sum(0.8*([H]-[2b]-[3b]-[hr])+0.6*([bb]-nz([ibb])+nz([hbp]))))>=1000));

I have 440 players with at least 1000 SB opps.  Anyway, here are the numbers:

simple mean is 10.4% SBperOpp, 4.7% CSperOpp

Best-fit equation, at r=.87, is:
CSperOpp = .25364*SBperOpp + .0206

The top 20 career base stealers had an average of .3845 SB per Opp, .1125 CS per Opp, and an estimate using the above of .1180 CS per Opp.

The bottom 100 base stealers had a SB per Opp of .0180, CS per Opp of .0194, and an estimate of CS per Opp of .0251.

So, we see that the estimate is biased at both the top and bottom ends, slightly at .0055 CS per Opp, or about 0.5 CS per season.  I can live with that kind of bias.


#33    Tangotiger      (see all posts) 2008/11/26 (Wed) @ 05:19

Yes, you need to add the IBB and SH.  I’m not sure you are appreciating the distinction I am making, and how I am using, PA1 and PA.  Or at least, I am not explaining myself well.

If, for example, Barry Bonds has 600 PA of which 200 of them are IBB, and in the 400 PA he is +100 runs, the league creates .12 runs per PA (using IBB and SH) or .125 runs per PA (exclusing IBB and SH), then how many runs did Bonds “create”?

My answer is that it’s .12 times 600 plus 100, or 172 runs.  Is your answer .125 times 400 plus 100 or 150 runs?


#34    Tangotiger      (see all posts) 2008/11/26 (Wed) @ 05:43

As for estimating CS, taking a look at it, I would make the estimate as 50% of catcher’s assists between 1926-1950NL would be CS.  This is in-line with the AL numbers, and the other years following.

My question, which I’ll answer when I access my Retro data, but if someone knows right now: in 1991 NL, there were 809 CS and 1137 catcher assists (i.e., a difference of 328).  In 2007, there were 506 CS and 1246 catcher assists (gap of 740).

Exactly what were the catchers assisting on in 2007 that they weren’t in 1991? Dropped third strikes?  Run downs?


#35    terpsfan101      (see all posts) 2008/11/26 (Wed) @ 05:53

This is a set of LW that I was using for a Baseruns equation. They are reconciled to zero for the time period they correspond to (1993-2007).

0.486 1B
0.787 2B
1.081 3B
1.410 HR
0.324 BB
-.285 AB-H
0.173 SB
-.445 CS

The average R/PA for the PA events in my Linear Weights formula is .128

When I add .128 to the marginal run-values for PA events, I get the exact number of runs that scored, 334771 runs.

If I count categories that aren’t in my LW equation as PA, I get .125 R/PA. When I add .125 to my marginal values for PA events, I am short 8500 runs created.


#36    terpsfan101      (see all posts) 2008/11/26 (Wed) @ 05:59

Tango, Robert and I were recently discussing the topic of catcher’s assists and their relation with CS. We came to the conclusion that CS totals have very little correlation with catcher’s assists.

I recommended that Robert only compare his final CS estimates to catcher’s assists to see if any of the estimates were out of sorts.

One of the reasons that he is re-working his estimates from 1894-1919, is that his CS estimates were way to high from 1894-1897, even after pro-rating SB totals to account for the fact that SB were sometimes awarded for taking an extra-base. His CS estimates from 1894-1897 were 98% of catcher’s assists, which was way too high. I recommended that he lower the CS estimates from 1894-1897, so that they were approximately 70% of catcher’s assists.


#37    terpsfan101      (see all posts) 2008/11/26 (Wed) @ 06:57

Tango your method in #33 appears like it would add up to the total number of runs scored. Your method also appears to be more accurate then mine. I would have only credited Bonds with creating 150 runs, whereas you have credited him with 172 runs.


#38    terpsfan101      (see all posts) 2008/11/26 (Wed) @ 07:56

Yes, Tango’s method does add up. I verified it on the example I provided in #35.

It seems that a lot of the disagreements between Tango and I are philosophical differences. I would include IBB in wOBA, so Bonds would probably end up with around 180 RC, and 110 RAA, instead of 100 RAA. I would also figure R/PA based on the PA events that were included in my LW equation. Tango chooses to use all PA’s, to figure R/PA. Both methods would add-up to the total number of runs scored. I think my method is more logical, he thinks his method is more logical.

Another issue where we disagreed was including the run-value of RBOE’s under AB-H-SO+SF when devising a LW formula based on the official statistics. Tango looks at AB-H-SO+SF and thinks “Batting Outs”. I think “all categories that fall under AB-H-SO+SF”. My rationale for including RBOE’s under batting outs is that we are over-stating outs since RBOE’s fall under AB-H-SO+SF. Therefore, we should compensate for this by including the run value of RBOE’s under AB-H-SO+SF.


#39    Tangotiger      (see all posts) 2008/11/26 (Wed) @ 08:46

Tango looks at AB-H-SO+SF and thinks “Batting Outs�. I think “all categories that fall under AB-H-SO+SF�.

I think like you do (especially considering that I am one of the few that is pro-RBOE).

And, since I use my “runMinus” term to account for “all other batting events”, you and I are in agreement.  Instead of calling it runMinus, I could have called it runRest.

So, there’s no issue on this front.


#40    terpsfan101      (see all posts) 2008/11/26 (Wed) @ 08:58

Thanks for clearing everything up.


#41    john      (see all posts) 2008/11/27 (Thu) @ 06:41

Why do we keep the HR value fixed at 1.40?


#42    Tangotiger      (see all posts) 2008/11/27 (Thu) @ 06:44

Because in run environments from the last 50 years, the run value of the HR has remained at 1.40, more or less.

There’s also a good technical reason for this, which I can dig up for you if you like.


#43    john      (see all posts) 2008/11/27 (Thu) @ 06:59

Thanks tango

And yeah I’d be interested if you could dig that up what the technical reason would be.  Looking at the chart at the beginning, the run values change for 1B, 2B, 3B, etc (altho very slightly it seems)....why would the HR remain constant over the years?


#44    terpsfan101      (see all posts) 2008/11/27 (Thu) @ 07:44

Take a look at this thread:

http://www.insidethebook.com/ee/index.php/site/article/linear_weights_by_run_environment/

When Tango grouped Tom Ruane’s Linear Weights by runs-per-inning, the value of the HR in the lowest bin (.419 R/Inn) was 1.402. In the highest bin (.581 R/Inn) it was 1.404.

The HR almost always has a value around 1.4 runs.

I agree with you that this doesn’t match what Baseruns, Tango’s Markov, and even what common-sense tells us about the value of the HR. Baseruns and the Markov will show the HR slowly rising in value up until 10+ runs per game.


#45    terpsfan101      (see all posts) 2008/11/27 (Thu) @ 08:55

First, let’s translate Runs/Innings into Runs/Game. On average there are 53.7 outs/game:

.419 Runs/Inning is approximately 3.75 Runs/G
.581 Runs/Inning is approximately 5.20 Runs/G

I bet that the reason why the HR has the same value in a 3.75 RPG context than it does in a 5.2 RPG context, is that teams are scoring runs less efficiently in the 3.75 RPG bin. This might explain why the “moving-over” value of the HR is approximately the same in both contexts. The “getting-on” value of the HR is always 1 run. So it looks like teams in the 5.2 run context were more efficient at scoring baserunners, since the moving-over value of the HR did not increase compared to the 3.75 RPG context.


#46    john      (see all posts) 2008/11/27 (Thu) @ 09:17

I had another question

Im working in MySQL.  I used Colin’s SQL statement to make the primpos table and I got a few differences from Tango’s data on top.  For instance in 2006 im getting the RperOut to be .185 instead of the .182 in the chart above.


#47    Tangotiger      (see all posts) 2008/11/27 (Thu) @ 10:58

The run value of the HR is the getting on plus the moving over. 

The getting on is exactly 1, so that leaves us with calculating the moving over.

The average runner on base has a .30 chance of scoring.  So, if he scores, that means that the HR add +.70 runs.  There are roughly .60 runners on base.  So, .7*.6= .42.  That’s the moving over value of the HR.  1+.42 = 1.42 runs.

Now, what if it’s a great pitcher that’s on the mound?  Well, in those cases, only 25% of the runners score.  This means that the HR adds +.75 runs per runner on base.  However, because they are great pitchers, they have fewer runners on base.  Let’s say they have 0.5 runners per PA.  .75*.5 = .375.  That’s the moving over value.  The HR therefore is worth 1.375 runs.

Go the other way, and you have a terrible pitcher.  Those guys will allow 35% of runners to score, meaning the HR adds .65 runs per runner.  Because they are terrible, they allow alot of runners on base.  Let’s say that’s .65 runners per PA.  .65*.65 = .42.  The HR is worth 1.42 runs.

Go to the extreme: a pathetic pitcher will allow 50% of runners to score, meaning each HR adds .50 runs.  And, he has say 1 runner on base per PA.  1*.5=.5, and the HR is worth 1.50 runs.

Or the opposite, where an unbelievably great pitcher will allow 15% of runners to score (HR worth +.85 runs per runner) and only have say .40 runners on base.  .85*.4 = .34, and so a HR is worth 1.34 runs.

As you can see, the run value of the HR doesn’t move much.  And in the league settings we are talking about, it barely changes.  And so, it’s easier to keep the HR run value fixed at 1.40 and move on.


#48    john      (see all posts) 2008/11/27 (Thu) @ 11:24

Thanks tango.


#49    terpsfan101      (see all posts) 2008/11/27 (Thu) @ 14:32

My explanation in #45 is terrible. Instead of saying that teams in the lowest-scoring bin “score runs less efficiently”, I should have said that baserunners have a lesser chance of scoring in the lowest run-scoring bin. This would reduce the getting-on-base value for all events, except for the HR, which always has a getting-on-base value of one run.

I bet that the moving-over value of the triple is just as stable as that of the homerun. However, in a lower run-enviornment the tripler has a lower chance of scoring. This means that the total run-value of the triple isn’t as stable as the homerun.


#50    terpsfan101      (see all posts) 2008/11/29 (Sat) @ 22:15

I’m not sure why John in #46 got a different numbers for R/O. Tango’s and Colin’s sql looks the same as far as the calculation of R/O is concerned.

I missed Tango’s description in #17 about PA1 and PA. I wouldn’t of been so argumentitive with him had I read it. I wish he would of told me this a month ago. I was not aware that you could just multiply R/PA by plate appearances and add that number to the marginal runs estimate.


#51    Tangotiger      (see all posts) 2008/12/02 (Tue) @ 13:31

bumping for Peter Jensen…


#52    Tangotiger      (see all posts) 2008/12/02 (Tue) @ 13:33

Peter, you want especially posts 17/18, in addition to the main thread.


#53    Peter Jensen      (see all posts) 2008/12/02 (Tue) @ 19:55

Tango Thanks. I think I see where I went wrong.


#54    Tangotiger      (see all posts) 2008/12/02 (Tue) @ 23:14

Cool.  The more discerning people the better.


#55    Tangotiger      (see all posts) 2008/12/10 (Wed) @ 23:30

Bug report: In my SELECT, I was giving a positive value for CS.  That’s obviously incorrect.  Flip the sign.

***

The other thing is you while you WILL sum to zero at the league level if you use PA1 (i.e., exclude IBB and SH by treating them as non-existant), you will NOT sum to zero if you use PA.

The reason is that the guys who are plus hitters are going to get bigger benefit with the extra IBB more than the bad hitters are going to bring down the league with the extra SH.  It’s not that big a deal, less than 100 runs for the league.  Just a wrinkle to deal with (by adding around 0.3 runs per 700 PA or so, depending on the year in question) or accept.


#56    terpsfan101      (see all posts) 2008/12/11 (Thu) @ 13:18

Here are park adjustments for wOBA. They are the square root of the park adjustment, where the park adjustment is the Corrected PF + 1 divided by 2. If you convert the park-adjusted wOBA into runs, you will see that this method yields approximately the same results as park adjusting Linear Weights and Runs Created.

To park-adjust wOBA, all you do is divide wOBA by the wOBA park adjustments listed in the spreadsheet.

wOBA+ = wOBA / wOBA Park Adjustment

If you want to get the actual park-adjustment, all you have to do is square the wOBA park adjustment.

I won’t say much here about how the Park Factors are calculated as I plan on going into that in detail when I update my park factor spreadsheet. Here is a summary of how they were calculated:

From 1954-2008, the PF’s are based on runs-per-plate Appearance. Prior to 1954 they are based on runs-per-out. I will try to explain this as clearly as I can. The R/PA PF’s are mutually exclusive of the R/O PF’s. They are not mixed together in any way. In other words, the PF’s starting from 1954 don’t use any data prior to 1954, since I could only use that data as runs-per-out. The R/O PF’s are mutually exclusive of the R/PA PF’s, however they are allowed to use data past 1953, since I have the actual number of home and away outs for these seasons.

For each team-season I assigned a PF version. I assigned a new PF version when a team changed parks, significantly altered their stadium, or played a significant number of games in 2 stadiums, like the Expos did in 2003 and 2004. Interleague games were removed. All seasons are weighted equally when making the initial PF calculation. They are further adjusted from there based on the single-season ratio of home and away plate appearances or outs. I also removed interleague games. Finally, they were regressed using MGL’s equations.

I recommend that you export these in .xls format. You will then get the un-rounded values.

http://spreadsheets.google.com/ccc?key=pzy9IhjJPqasHR76VwzPCQw&hl=en


#57    terpsfan101      (see all posts) 2008/12/11 (Thu) @ 15:57

Actually, the 5/3 root is a perfect match for wOBA. The square-root is a simple approximation. However, when converting the wOBA to runs, you are often a couple of runs off for extreme parks when using the square root. Taking the 5/3 root of the park adjustment solves this problem. 99% of the time, the wOBA+ runs will match the park-adjusted linear weights, +/- 1 run.

So converting the wOBA park adjustment to the park adjustment from which it was derived from, now looks like this:

wOBA_PF^(3/5)

Yes, a bit of simplicity is lost by abandoning the square root. However, you gain a ton of accuracy by using the 5/3 root.

I updated the spreadsheet with the wOBA park adjustments that use the 5/3 root of the park adjustment.


#58    terpsfan101      (see all posts) 2008/12/11 (Thu) @ 16:48

To avoid any confusion, the wOBA park factor was the Park Adjustment raised to the 3/5 power:

wOBA_PF = Park Adjustment^(3/5)

and the conversion of the wOBA park adjustment to the park adjustment from which it was derived is:

Park Adjustment = wOBA_PF^(5/3)

I flipped the numerator and denominator of the exponent in my last post. Perhaps someone with a better math-mind than me (nearly everybody who posts on this blog), could tell me if x^3/5 is the 5/3 root or the 3/5 root?

In any case, this method for park-adjusting wOBA is very accurate. If you don’t believe me, try it yourself by comparing park-adjusted wOBA runs to park-adjusted runs above average or park-adjusted runs created.


#59    Tangotiger      (see all posts) 2008/12/11 (Thu) @ 20:38

Why not present it as a differential instead?

Do you have proof that Coors would have a multiplying effect more on Pujols than Howard?  Or same with Petco?


#60    terpsfan101      (see all posts) 2008/12/11 (Thu) @ 21:05

You will have to give me a little bit of time to fully test the results. I did test extreme players, like Pujols, Bonds, and Howard, and Neifi Perez between the range of wOBA PF’s from 0.9 to 1.1. In this range, the park-adjusted wOBA runs were almost always within 1 run of the park-adjusted Linear Weight runs. Remember that a 0.9 wOBA PF is actually a park adjustment of .839 (0.9^5/3) and a 1.1 wOBA PF is actually a park adjustment of 1.17 (1.1^5/3). In the wOBA PF spreadsheet, not a single team has a wOBA PF below 0.9. Coors Field from 1995-2001 is the only park that falls outside the 1.1 range. Coors Field alone is a cause for more testing.


#61    Tangotiger      (see all posts) 2008/12/11 (Thu) @ 22:08

I don’t think I made myself clear.

If 6 runs are scored at Coors while 5 runs are scored in the rest of the parks, and you have a player that “creates” 7 runs in the rest of the parks, will he create 7/5*6 runs at Coors, or will he create 7-5+6 runs at Coors?

http://www.tangotiger.net/parks.html


#62    .(JavaScript must be enabled to view this email address)      (see all posts) 2008/12/11 (Thu) @ 23:12

What terpsfan is doing is the traditional method of applying park factors, but it may not be correct, as Tango points out.

I wrote about this at StatSpeak today http://statspeak.net/2008/12/different-factors-for-different-folks-part-i.html

per #59, I plan on running my park factors by player, and groups of players, as Tom had suggested a little while ago. I need to play with the SQL, but I hope to have it in the next week or two.


#63    Jeff      (see all posts) 2008/12/11 (Thu) @ 23:51

Terps—Thanks for PF with decimal places.  I was doing some regression analysis with park features (and runs scored) to estimate PF and it wasn’t going too good by only using the integer values from baseball reference.  Thanks again.


#64    terpsfan101      (see all posts) 2008/12/12 (Fri) @ 05:48

Jeff,

I will let you know when I update my park factor spreadsheet. This spreadsheet has all the details that I used for the park factor calculations. All I presented here was park factors for wOBA. I agree that presenting a Park Index like Baseball reference does, is not the most practical way to present Park Factors.


#65    terpsfan101      (see all posts) 2008/12/12 (Fri) @ 06:00

Tango,

I agree that we should not be applying a uniform Park Factor to all players. The best method would be to calculate park factors for the component stats based on batter-handedness. You would then adjust the component stats, before you calculated Linear Weights and wOBA. However, you would also need to apply different regression amounts to each of the components. These regression amounts would need to be determined using sample size, altitude, fence height, climate, etc…. I lack the necessary knowledge to calculate park factors that are this detailed. The main reason I presented wOBA park factors is that I was tired of people complaining that wOBA was not park-adjusted. And it is much easier to divide wOBA by a special PF, than it is to park-adjust each of the components.


#66    .(JavaScript must be enabled to view this email address)      (see all posts) 2008/12/12 (Fri) @ 06:49

Terps - I actually ran your PF numbers against Stadium Size, Foul Area, Fence Height, Relative Humidity, Elevation and Temperature and here is the equation:

+0.0597 *  Average Temp at Game Time (F)
+0.00123 *  Elevation at home plate (feet)
-0.0472 * Average PM humidity (XXX.XX)
+0.172 * Average Wall height (ft)
-0.0000643 * Foul area (ft squared)
-0.000148 *  Playing area (ft squared)
+113.578
=Park Factor

Factor Change in Park Factor
10 degree F increase +0.060
increase in RH by 10% -0.047
1ft increase in wall height +0.17
10,000 sq ft increase in foul area -0.643
10,000 sq ft increase in playing area -1.48
1000 ft increase in elevation +1.23

R-squared of 0.664
Standard Deviation on difference of actual - projected = 1.10

I also ran the numbers against non-correct runs scored per game and get the following numbers:

+ 0.0480 *  Average Temp at Game Time
+ 0.000598 * Elevation at home plate
- 0.00286 * Average PM humidity
+ 0.0281 *Average Wall height
- 0.00000740 * Foul in meter of the stadium
- 0.0000339 * Playing area
+9.458
= Runs scored by both teams per game

R-squared = .446
Standard Deviation on difference of actual - projected = 0.64

Factor   Change in Total Runs Scored per Game
10 degree F increase   +0.48
increase in RH by 10%  -0.030
1ft increase in wall height   +0.028
10,000 sq ft increase in foul area   -0.074
10,000 sq ft increase in playing area   -0.34
1000 ft increase in elevation   +0.60

I plan on running some of numbers against MiLB parks to see if they correlate and will write up the an article with entire results when I get a chance.

I am looking on some other stats to add to the analysis if anyone can think of some.  I might try to figure out some kind of way to measure average wind speed and direction, but have go a good method I like yet.


#67    .(JavaScript must be enabled to view this email address)      (see all posts) 2008/12/12 (Fri) @ 07:06

Wind speed and direction are in the game logs at RetroSheet. Wrigley is the only place where there’s any sizeable effect on the HR rates (blowing out is twice the rate of blowing in)

When I rerun my park factors for 2008, I am going to try to program in wind direction.


#68    .(JavaScript must be enabled to view this email address)      (see all posts) 2008/12/12 (Fri) @ 07:16

I would like to have a value like: at Wrigley the wind blows in from center at 5mph.  Changing direction to a number to be calculated is where I am having problems.


#69    terpsfan101      (see all posts) 2008/12/12 (Fri) @ 08:53

Jeff,

When you tested my results, did you use the Park Adjustment or the Park Factor? The park adjustment includes both home and road games.

The park adjustment is

PADJ = WOBA PF^(5/3)

and the park factor (only home games)is:

PF = (PADJ - 1) + PADJ


#70    Tangotiger      (see all posts) 2008/12/12 (Fri) @ 09:39

terps: did you understand my post 61?


#71    terpsfan101      (see all posts) 2008/12/12 (Fri) @ 09:48

No, I didn’t really understand your post 61. You’re “practical example” in the linked article, is not a very practical example.


#72    Tangotiger      (see all posts) 2008/12/12 (Fri) @ 10:48

Let me try with pitchers then.

Suppose that the league average park generates a 4.0 ERA, but at Petco, it’s 3.0.

You have a pitcher, Peavy, who has a 3.5 ERA away from Petco.  What is his Petco ERA?

Is it:
a. 3.5/4.0*3.0 = 2.625
b. 3.5-4.0+3.0 = 2.5

Park factors always presume a multiplicative effect (i.e., a.).  I contend that we have not established that this is in fact true.

Perhaps for ERA it might be, but what about for component stats?

If there are 20 HR per 700 PA in an average park, while there are 35 HR per 700 PA at Coors, and you have a hitter that hits 40 HR away from Coors, then how many HR will he hit at Coors, per 700 PA?  Is it:
a. 40/20*35=70
b. 40-20+35=55

Why would we think that the impact of Coors is to say that the increase in HR is dependent on the number of HR you hit elsewhere?  Isn’t it possible that Coors might affect the guys who hit lots of FB, but not lots of HR?

Look at the number of HR hit by lefties in SF, excluding Bonds, in SF and away.  Then, look at Bonds.  Guess what: those lefties hit one-third fewer HR in SF.  Yet Bonds hits an EQUAL number of HR in SF as away.  Did SF really suppress Bonds’ HR totals?  Or, were the length of his HR in SF simply shorter, but still more than enough to clear the fence.

Before we go multiplying and exponentializing all these park factors, isn’t it incumbent on those people to actually show that this is how park factors work?  Until then, why not take the easy way, and apply differentials instead?  Answer b. above.  This won’t solve the Bonds issue, but neither am I implying anything with extra precision that exponents and the like might be saying.

Peronally, I think the answer is the Odds Ratio Method.


#73    .(JavaScript must be enabled to view this email address)      (see all posts) 2008/12/12 (Fri) @ 11:09

terps- I used the numbers from the TAB wOBA_PF without doing any conversions.  Also, my previous numbers on runs per game is off a little.  Will get some better number at a later date.


#74    terpsfan101      (see all posts) 2008/12/12 (Fri) @ 11:15

I will take a look at using the odds ratio. I first need to find out what the odds ratio is.


#75    Rally      (see all posts) 2008/12/12 (Fri) @ 11:20

Good question on the park issues.  I can’t say I’ve had the time to do much work on them myself.  I use the multiplying effect, but it could be wrong.

One advantage to the traditional park factors is it doesn’t create a Juan Pierre problem.  Say Coors adds 20 HR/700 PA to the average hitter.  If Juan Pierre hits 1 there, then moving to Florida will he hit:

1 - 20 = -19 or
1 *.5 =~ 1

Additive park factors take a lot more work to account for the exceptions.


#76    Rally      (see all posts) 2008/12/12 (Fri) @ 11:23

Another thing is:  Are we trying to predict how a player will change in a new park or measuring his value where he is?

Pujols creates 8 rpg in a 4 rpg environment.  Isn’t his value the same as a guy creating 10 in a 5 run environment?

Maybe not exactly, since your pythagopat exponent will change, but a lot closer than looking at him as a 9 rpg player in a 5 rpg environment.


#77    terpsfan101      (see all posts) 2008/12/12 (Fri) @ 11:50

Rally’s last 2 posts are the same issues that are running through my mind right now.

First of all, you would need each player’s road splits to use the differential method. I’m not quite sure if you would need his home splits. You are essentially assigning each batter a custom park factor. Yes, you could do all this with a database, but it would take me forever to figure out how to do this.

Secondly, Tango keeps using examples where he moves a player to another park. How does this tell us how much value a player has/had in the context of his present/past league or team.


#78    Tangotiger      (see all posts) 2008/12/12 (Fri) @ 11:51

For runs created, I agree.

I looked at pitchers:
- took their ERA
- reversed engineered their OBP
- used odds ratio on OBP
- converted back to ERA

And lo and behold, the conversion of all that was almost identical to a straight multiplicative effect.  That is, the Peavy example from above, the answer was a, not b.  But, that was a byproduct of the way runs are created.

For components, Odds Ratio would likely be the correct thing, and would handle the Pierre issue.

If there are 10 HR per 200 FB, and our hitter get 15 HR per 200 FB, then the odds ratio say 10/190 for the first term, and 15/185 for the second term.  So, his ratio is the second term divided by the first, or 1.54.

To figure out how he’ll do in a specific park, you simply multiply the 1.54 by whatever the ratio is at that park.  So, if some park gets you 30 HR per 200 FB, then it’s 1.54*30/200= .231 HR per non-HR.

And .231:1, is a percentage of .231/1.231, or 18.8%.  And 18.8% of 200 is 37.5 HR.

So, a guy who gets 15 HR in a context of 10, will get 37.5 in a context of 30.

The Odds Ratio is typically between the additive and multiplicative methods.


#79    .(JavaScript must be enabled to view this email address)      (see all posts) 2008/12/12 (Fri) @ 16:26

First, I use the the James Function, which I believe is a variation of the Odds Ratio for my projection calculations. It needs three terms - the expected, observed, and population mean. The formula guarantees that the answer will always be a binomial, 0 >= y <= 1

In yesterday’s article at StatSpeak, I measured the difference in HR rates between the U.S. (MLEs) and Japan (unadjusted) for 106 players over the last 10+ seasons. In the article, I expressed the results as a multiplier, (HRobs - HRexp) as HRf showing that as a player’s true HR% decreases, the pf multiplier increases.

Now I ran it as a differential (HRobs - HRexp)/(ab - so + sf) as HRd. I think this shows the results better. It’s the guys in the middle who get the most benefit of short fences - only balls hit near the fence are effected. The top guys consistently hit further than the fence, so get a smaller benefit. The bottom guys consistently hit the ball short of the fence, and so also get a smaller benefit. It’s the guys in the middle, who consistently hit the ball near the fence, who get the most benefit.

Grade   HRd   HRf
A   0.012   1.14
B   0.023   1.39
C   0.027   1.66
D   0.020   1.82
E   0.019   2.27


#80    terpsfan101      (see all posts) 2008/12/12 (Fri) @ 19:50

Great stuff Brian. In your article, I found it interesting that Hideki Matsui is the only Japanese player in the major leagues, whose homerun rate has exceeded the league average homerun rate.

I am not really satisfied with the method I suggested for park-adjusting wOBA. For the most part it is accurate within a few runs, but I think it only works by accident, sort of like Runs Created only works by accident. This weekend, I might try another method that involves adjusting the sum of the wOBA weights times their frequencies.


#81    terpsfan101      (see all posts) 2008/12/13 (Sat) @ 13:11

I knew there had to be a way to park-adjust wOBA so that it would precisely match park adjusted Linear Weights. After all, wOBA is linear weights scaled to look like OBP.

First, you need to park adjust the Linear Weight Runs. By Linear Weight Runs, I mean the marginal runs total (runs above/below average). The Linear Weight Runs can be calculated from the unadjusted wOBA, or they can be calculated from the standard linear weights (assuming that the standarad LW were used to derive the wOBA weights).

The formula that I use for park-adjusting LW runs is:

PADJ LW Runs = ((LW Runs + ((Lg Runs/Lg PA) * PA)) / PADJ) - ((Lg Runs/Lg PA) * PA)

where PADJ is the park adjustment, and

PA = AB+BB+HBP+SH+SF

Once you have the park adjusted LW runs, you can park adjust wOBA:

PADJ wOBA = ([(PADJ LW Runs - LW Runs) * wOBA Multiplier] + (wOBA * PA)) / PA

where PA = AB+BB+HBP+SH+SF

Here is an example using Todd Helton’s 2000 season. For the LW and wOBA weights, I just grouped together all years from 1954-2008. Since this is only an example, I figured I could get away with grouping 55 years together.

Helton’s wOBA = 0.495
Helton’s LW Runs = 92.03
Helton’s PA = 697
League R/PA = 0.1151
League wOBA = 0.327
wOBA scale = 1.276
Park Adjustment = 1.205

First we park-adjust Helton’s LW runs. Again, the LW runs can be figured from wOBA or they can be figured from the Linear Weights from which the wOBA weights were derived. In this case, I used the LW runs from the standard linear weights.

PADJ LW Runs = ((92.03 + (.1151 * 697)) / 1.205) - (.1151 * 697) = 62.72

Now we can park adjust wOBA:

PADJ wOBA = ([(62.72 - 92.03) * 1.276] + (.495 * 697)) / 697 = 0.441

To prove that this method works, here are the Linear Weight Runs derived from Helton’s park-adjusted wOBA. I had previously derived the Helton’s LW runs from standard linear weights:

(.441 - .327) * (697 / 1.276) = 62.72 LW Runs.

As you can see, 62.72 is the exact park-adjusted runs total that I calculated from the standard Linear Weights.

The spreadsheet I posted in #54 now contains park adjustments that you can use to park-adjust linear weights and wOBA. I recommend that you use these run-based park adjustments over the adjustments contained in the Baseball Databank database. I calculated park adjustments using runs-per-plate appearance from 1954-2008. The park adjustments in the BDB database are runs-per-out park adjustments. However, my park adjustments prior to 1954 are based on runs-per-out, where outs were estimated using the average number of outs per Home/Away Win, Loss, and Tie, and reconciled to the total number of league outs on a seasonal basis. I don’t think it is possible to estimate plate-appearances per-game, so that is why the park factors are outs-based prior to 1954.


#82    Tangotiger      (see all posts) 2008/12/13 (Sat) @ 13:19

Can you try it as a differential instead (step b, post 72)?

And you definitely won’t have to go through all those steps.


#83    terpsfan101      (see all posts) 2008/12/13 (Sat) @ 13:34

Tango, I really don’t understand the differential method, even though you have explained it to me countless times. The only way that I am going to understand the method is to play around with it in a spreadsheet. So, let me experiment with the differential method to see if I can figure it out, and I will report back to you.


#84    terpsfan101      (see all posts) 2008/12/15 (Mon) @ 12:55

Tango,

Can you tell me if I am using the differential method correctly. I am using Helton’s 2000 season and the LW values posted in the lwts_woba_for_bdb.txt file.

81.05 Helton’s LW Runs
170.26 Helton’s RC
697 Helton’s PA
0.128 Lg R/PA
0.173 Coors Field R/PA
0.151 1/2 Coors + 1/2 League R/PA

Helton’s PADJ LW = 81.05 + ((.128-.151)*697) = 65.47

Helton’s PADJ RC = 65.47 + (.128*697) = 154.68


#85    Tangotiger      (see all posts) 2008/12/15 (Mon) @ 20:41

It’s as simple as you’ve shown it.


#86    Tangotiger      (see all posts) 2008/12/15 (Mon) @ 23:28

And like I said, if you do it for runs created, then the multiplicative method is probably the best.  If you do it for components, then the Odds Ratio method is probably the best.


#87    Rally      (see all posts) 2008/12/15 (Mon) @ 23:52

Between #81 and #84 we’ve got a huge difference in Helton’s value.  The differential method assumes he loses no more runs going to a normal park than Neifi Perez does in 697 PA.

That’s a big assumption, though so is the multiplicative assumption.  Either way, I think it’s one that needs to be tested.


#88    Tangotiger      (see all posts) 2008/12/15 (Mon) @ 23:57

Agreed, it needs to be tested.

Like I said, when you use the Odds Ratio method on a pitcher’s component stats, and generate BaseRuns from those adjusted numbers, you get a result that is IDENTICAL to starting with BaseRuns and then applying a multiplicative adjustment.

The takeaway is that even if you confirm that the multiplicative method is better for baseruns and ERA, that does NOT mean it would hold for the component stats.


#89    terpsfan101      (see all posts) 2008/12/16 (Tue) @ 06:24

Helton gains +10 runs using the additive park factor. Here is what the multiplicative park adjustment would look like, using the same numbers in #84.

Coors R/PA = .173

Lg R/PA = .128

1/2 Coors R/PA + 1/2 Lg R/PA = .151

Multiplicative Park Adjustment = .151 / .128 = 1.18

Helton Park Adjusted LW = (((81.05 + (.128 * 697)) / 1.18) - (.128 * 697) = 55.1

Helton Park Adjusted RC = 170.26 / 1.18 = 144.3

Helton LW Runs: 81.1
Helton PADJ LW (Additive): 65.5
Helton PADJ LW (Multiplicative): 55.1

Helton RC = 170.26
Helton PADJ RC (Additive): 154.7
Helton PADJ RC (Multiplicative): 144.3


#90    terpsfan101      (see all posts) 2008/12/24 (Wed) @ 09:26

Patriot has convinced me that an R/O Park Factor is a better choice for evaluating hitters than an R/PA Park Factor. An R/PA PF does not account for the compounding effect of OBP.

An R/PA PF accounts for the rate of runs scoring per PA, but it does not account for the number of PA’s generated based on the out rate. Patriot demonstrated that an R/O PF takes both of these factors into account. I have updated the PF spreadsheet listed in post #56 with R/O based Park Factors and Park Adjustments.


#91    Tangotiger      (see all posts) 2008/12/24 (Wed) @ 23:20

An R/PA PF does not account for the compounding effect of OBP.

... unless you do it the way I’m saying to do is, and that is to start with Linear Weights per PA, and add in the .12 (or whatever) runs per PA.


#92    terpsfan101      (see all posts) 2008/12/25 (Thu) @ 03:46

Technically, adding a fixed run value per PA to Linear Weights per PA is called R+/PA, since we are accounting for the indirect runs that are created through PA generation. A PF that is runs divided by plate appearances is simply R/PA.

Since I am not qualified to explain this, I will post Patriot’s comments:

“So, suppose you have a park with a 1.2 park factor based on R/PA (this is prior to the “add one and divide by 2” adjustment), and the team that plays there has a “true” post-adjustment .330 OBA and .13 R/PA. This team will have a .330*sqrt(1.2) = .361 OBA at home, and .13*1.2 =
.156 R/PA.  When they play at home, they will get around 25.2/(1-.361) = 39.43 PA/G, and will score 6.15 runs.

On the road, their OBA and R/PA are .330 and .13, of course.  So they’ll get around 25.2/(1-.33) = 37.61 PA, which will result in 37.61*.13 = 4.89 runs.

In terms of R/O, then, they are 6.15/4.89 = 1.258 instead of 1.2.  The compounding value of OBA is the cause of this.”

and

“Disregard the example in my previous post…I had a major brain lapse. The approximate sqrt(PF) relationship for OBA that I mentioned is based on a R/O park factor, not a R/PA park factor.

What I’m saying is still true, but that example probably overstates the impact and certainly does not follow logically.  The point is that there are two factors at play:

1) increased rate of scoring per PA
2) increased number of PA as a result of decreasing out rate/increasing OBA

The R/PA factor isolates 1).  An OBA factor would isolate 2).  A R/O factor incorporates both.”


#93    terpsfan101      (see all posts) 2008/12/25 (Thu) @ 13:58

If you divide R/PA by R/O you get the “out rate.” 1 minus the out rate is essentially OBA. It is actually what Patriot calls the “not out average” or NOA. Patriot says an OBA factor would account for PA generation. After the holidays, I will see if I can incorporate an OBA factor into the R/PA park factors. It seems like I am always re-doing things after getting input from others. Don’t get me wrong, this is a good thing. However, it would be nice if I could do things correctly the first time around.

Merry Christmas


#94    terpsfan101      (see all posts) 2008/12/26 (Fri) @ 07:23

An R+/PA park factor is mathematically equivalent to a R/O park factor. So for a run-based park factor, outs (or innings or games) would be the correct denominator. Of course PA would still be the correct denominator to use for component park factors.


#95    terpsfan101      (see all posts) 2009/04/18 (Sat) @ 10:14

After wasting countless hours fiddling around with custom versions of Baseruns, I ended up getting similar historical LW as Tango gets with his quick and dirty method. The great thing about Tango’s method is you only need to know one stat, R/O, to figure out the linear weights. Here is how Tango’s Q&D weights are calculated:

1. Calculate R/O
2. BB = R/O + .14
3. HBP = BB + .025
4. 1B = BB + .155
5. 2B = 1B + .30
6. 3B = 2B + .27
7. HR = 1.40
8. SB = .20
9. CS = -2*R/O + .-075
10. AB-H+SF = Whatever value makes the weights sum to zero.

These are the only changes I would recommend to Tango’s values:

3B = 2B + .29
SB = .19
CS = -2*R/O + -.10

David Smyth came up with a Q&D method several years ago. Instead of using R/O, he used RPG as the basis for his weights:

1B = R/G/50 + .38
2B = R/G/40 + .65
3B = R/G/20 + .80
HR = R/G/100 + 1.355
SB = R/G/150 + .16
CS = -(R/G/15 + .12)
BB = R/G/50 + .24 (not IBB)
HB = same as BB
AB-H = whatever value makes the weights sum to zero.


#96    terpsfan101      (see all posts) 2009/04/18 (Sat) @ 17:19

The main thing that bugs me about using Baseruns to generate linear weights is it doesn’t value walks properly. By walks I am referring to non-intentional walks (NIBB). We know Baseruns has problems with IBB’s due to negative B values.

Baseruns always overvalues the NIBB in a low run enviornment and undervalues it in a high run enviornment. This is a problem that nearly every run estimator has in common. Even Tango’s quick and dirty linear weights undervalue the walk in a high run enviornment.


#97    Tangotiger      (see all posts) 2009/04/18 (Sat) @ 18:44

terps’ claims can be substantiated by using
http://www.tangotiger.net/markov.html

He is correct.  Which makes the use of the Q&D method even better.


#98    terpsfan101      (see all posts) 2009/05/05 (Tue) @ 17:02

Finished up the custom Linear Weights:

http://spreadsheets.google.com/ccc?key=rLN8k6wA1MHfuvgxkCJlKHQ&hl=en

They are meant to work with the available data in the batting table of the baseball databank databse. I am still working on removing pitcher hitting. You will see that I didn’t list run-values for basestealing events when the CS data was incomplete. Also the run-value of the BB includes IBB prior to 1955. I think I estimateed IBB as 7-8% of all BB’s. In the 19th century, I scaled this number down based on the number of balls that were necessary for a walk. I think I only used 2% of all walks as intentional for the National Association (1871-1875).

Unless someone requests, I won’t go into the gory details about how the linear weights were calculated. I used 10 Full Baseruns equations that included every event. I used a variety of techniques (regression, fixed rates, common sense) to estimate the missing data. I then stripped these extra events out, leaving only the main batting events. To reconcile the weights to zero, I adjusted the value of the out.


#99    devil_fingers      (see all posts) 2009/05/05 (Tue) @ 22:22

terpsfan:

I was just denied access to the spreadsheet—did you make is available to the public, or is it just me that’s “banned?”


#100    terpsfan101      (see all posts) 2009/05/06 (Wed) @ 01:55

Sorry Devil Fingers. The link should work now.


#101    terpsfan101      (see all posts) 2009/05/08 (Fri) @ 03:49

I removed pitcher hitting using Tango’s primary position table. When you remove pitcher hitting, what is the best way to calculate runs created? If you add the league R/PA to the RAA estimates you will be short a bunch of runs. One way to reconcile to runs scored would be to add league runs divided by position player plate appearances to all the RAA estimates. But then you wouldn’t be giving pitcher’s any credit for their runs created.


#102    terpsfan101      (see all posts) 2009/05/08 (Fri) @ 03:55

I guess the next thing I will do is add wOBA weights to my spreadsheet I posted. Once I do that, I will post the SQL. Maybe I can even incorporate park adjustments. You may have to manually join the tables, but I should be able to fit everything into one query.

Also, instead of ignoring SB for seasons where the CS is incomplete, I assigned a run value of .05 for SB. This value changes somewhat prior to the modern SB rule in 1898.


#103    Tangotiger      (see all posts) 2009/05/08 (Fri) @ 04:00

Indeed, pitchers do NOT generate any runs.

In The Book, I showed that if you ran the NL players with a DH, they’d score 5.25 runs per game (1999-2002).  That means each hitter generates an “absolute” .583 runs per game.

If you had 8 such hitters, a team would score .583*8 + whatever the 9th hitter provides.  That means 4.67 runs plus the 9th hitter.

35% of the time in the NL, you have no pitcher batting (meaning you’d expect 5.25 runs per game scored).  65% of the time, you have the second situation (4.67 runs, plus whatever the pitcher provides).

total runs scored
= 5.25*.35 + 4.67*.65 + whatever*.65
= 4.87 + whatever*.65

And in the NL, teams scored 4.83 runs per game.

This means that the pitchers provided NO runs created.  Whatever actual runs they scored and runners they drove in is cancelled out by all the extra outs they generated.


#104    terpsfan101      (see all posts) 2009/05/08 (Fri) @ 05:43

Tango, thanks for the explanation on how pitchers don’t create any runs. But the farther back you go in time, the less this will be true. Also, I still haven’t figured out the best way to handle 2-way players, like Ruth. First, I need to come up with a suitable definition of a 2-way player, such as at least 5 games pitching and 5 games in the field.


#105    Tangotiger      (see all posts) 2010/02/28 (Sun) @ 09:45

Bumping…


#106    John Walsh      (see all posts) 2011/06/19 (Sun) @ 22:37

Bumping with a question…

Regarding the constant 1.40 run value for the HR: it’s true that the LW value of 1.40 is essentially constant over the years, but for woba we want that value relative to the value of the out, which is not constant.  So, shouldn’t the HR value for woba be R/O + 1.40 and not simply 1.40?


#107    Tangotiger      (see all posts) 2011/06/19 (Sun) @ 22:52

That’s correct… the run value of HR relative to the out will change, since the HR value is pretty constant, but the run value of the out is very dependent on the runs per game.

We can see this in the year-by-year results here:

http://tangotiger.net/bdb/lwts_woba_for_bdb.txt

The third to last column (wobaHR) shows how much it changes.  You can compare that to woba2B, which is pretty tight (around 1.20 to 1.35).


#108    John Walsh      (see all posts) 2011/06/19 (Sun) @ 23:28

So does that mean your code above is incorrect and, in turn, Colin’s mysql is wrong?  If so, did this error find its way into the fangraphs calculation? 

Do you (or anybody) know exactly how fangraphs calculates woba? Can anybody give an example of the coefficients used for 2010, say?

thanks.


#109    Tangotiger      (see all posts) 2011/06/19 (Sun) @ 23:33

There are no code errors.  There is obviously a gap here, so the best thing to do is not reach any conclusions.  Just ask questions and make assumptions, and I’ll tell you where you are wrong.


#110    John Walsh      (see all posts) 2011/06/20 (Mon) @ 00:23

I probably wasn’t explicit enough.

In this query:

SELECT 
Batting.yearID 
, RperOut 
, [RperOut]+0.14 AS runBB 
, [runBB]+0.025 AS runHB 
, [runBB]+0.155 AS run1B 
, [run1B]+0.3 AS run2B 
, [run2B]+0.27 AS run3B 
, 1.4 AS runHR 
, 0.2 AS runSB 
, 2*[RperOut]+0.075 AS runCS

It looks to me like “1.4 AS runHR” should actually read “[RperOut]+1.4 AS runHR”.  Am I wrong?


#111    Tangotiger      (see all posts) 2011/06/20 (Mon) @ 02:03

You are wrong.

That query calculated the Linear Weights run values of each event.  It will be used subsequently to convert to wOBA.


#112    John Walsh      (see all posts) 2011/06/21 (Tue) @ 22:00

Ah, ok.  Thanks for the clarification.


#113    Tangotiger      (see all posts) 2011/06/21 (Tue) @ 22:04

No problem.  I’m sure you are not the only one who was confused, so it gives me a chance to expand upon it.


Commenting is not available in this channel entry.

<< Back to main


Latest...

COMMENTS

Feb 11 02:49
You say Goodbye… and I say Hello

Jan 25 18:36
Blog Beta Testers Needed

Jan 19 02:41
NHL apologizes for being late, and will have players make it up for them

Jan 17 15:31
NHL, NHLPA MOU

Jan 15 19:40
Looks like I picked a good day to suspend blogging

Jan 05 17:24
Are the best one-and-done players better than the worst first-ballot Hall of Famers?

Jan 05 16:52
Poll: I read eBooks on…

Jan 05 16:06
Base scores

Jan 05 13:54
Steubenville High

Jan 04 19:45
“The NHL is using this suit in an attempt to force the players to remain in a union�