you're reading...
Negro Leagues

MLE method for hitters: full detail

For those who like the gory details, this post’s for you. I’m going to go into exhausting detail about our latest MLE method. For those of you don’t like the details, see you next Thursday. Or maybe the Thursday after because I’m running the complete method for hitters this week and for pitchers next week.

However, I have given you a couple treats. Visit the Negro Leagues section of the site and you will find links to the brand new MLEs, both career value lines and yearly lines for selected players.

Overview

The following document outlines the process for creating major-league equivalencies for Negro Leagues position players. You’ll first see a short prose version for those who only want the gist. For those who want to know exactly how the sausage is made, a more fully elaborated version comes after that which includes an example from Oscar Charleston’s career.

This approach has taken several years to finally narrow down to and a great deal of support from many people, especially Gary Ashwill, Kris Gardner, Kevin Johnson, and Howard Miller. If you find any ways to improve this routine, please email me at eric.chalek@gmail.com. I’m always looking for opportunities to improve.

Abbreviations and Assumptions

Abbreviations

Batting

  • G: Games
  • PA: Plate appearances
  • QoP: Quality of Play
  • RAA: Runs above average
  • Rbat: Batting runs above average
  • SD: Standard deviation
  • wOBA: Weighted on-base average
  • wRC: Weight runs created
  • Z: Zscore

Baserunning

  • Rbaser: Runs from baserunning
  • SB/G: Stolen bases per game

Double Plays

  • Rdp: Runs from avoiding grounded into double plays

Fielding

  • DRA: Defensive Regression Analysis figures from the Negro Leagues Database

Value

  • WAA: Wins above average
  • WAR: Wins above replacement

Assumptions

The process will be described from a single-season point of view. We will refer to the single season in question as n. We will also mention seasons 

  • n+1: The season immediately after n
  • n-1: The season immediately before n
  • n+2: The season two years after n
  • n-2: The season two years before n

Short Description

Batting

  1. Find the player’s wOBA.
  2. Use his Z to recontextualize his wOBA into an MLB context.
  3. Turn his MLB wOBA into wRC and adjust that figure for his original league’s QoP.
  4. Determine how many RAA he would generate per PA in MLB.
  5. Estimate his MLB playing time in PA and G; multiply his PA by #4 to determine his MLB Rbat.

Baserunning

  1. Find the player’s SB in his original league, they will count toward his total.
  2. Subtract his G from 154 and multiply the difference times his career SB/G rate.
  3. Add #1 and #2 together and match that total with the Rbaser table in the appendix.

Double Plays

  1. Create a set of players of comparable handedness, playing time, and Rbaser and determine their Rdp/PA
  2. Determine their Rdp by multiplying #1 times their PA

Fielding

  1. Find the player’s career DRA at all positions.
  2. At each position divide career DRA by games played at that position.
  3. Using the typical fielding trajectory of MLB players (see Appendix), assign a fielding value to each season of a player’s career based on the position he plays in each year and his career fielding rate.

Value

  1. Use the WAR explainer at Baseball-Reference.com to determine the player’s positional value
  2. Calculate the player’s WAA and WAR as closely as possible to Baseball-Reference.com’s instructions.

Full Description

We’ll work through Charleston’s fabled 1924 season. That will appear in blue.

Batting

1) Gather a player’s known statistics from the Negro Leagues, MLB, the minor leagues, and any foreign summer or winter leagues.

  • Only use seasons where league-wide data is available
  • Only use seasons where the following data is available for all players: G, AB, H, 2B, 3B, HR, BB; HPB, SB, IBB, and SH are also helpful but not required.
  • Do not include All-Star games, playoff/post-season games, nor short winter series that pit white teams against Black or Cuban teams against Black
    • It’s not possible to compare a player to a league-average player in a short series—to see why, imagine having five Toyotas and five Ferraris and trying to define the average car.

In 54 games, Charleston batted 205 times, picked up 83 hits, 22 doubles, 5 triples, 15 homers, 20 steals, 28 walks, and 3 sacrifices. Hit-by-pitch data was is not available for this season.

2) For season n, find the player’s weighted on-base average (wOBA)

I scale all seasons, Negro Leagues or otherwise, to a .330 league-average wOBA for ease of interpretation using the wOBA calculator at Triples Alley. In addition, I’ve used park factors supplied by the amazing Kevin Johnson, and where he or BBREF doesn’t have any, I’ve created my own. Eric[dot]Chalek[at]gmail if you want more information on this process. Using the linear weights derived from Charleston’s 1924 Eastern Colored League plus the 0.9812 park factor from Kevin and scaled to .330, I calculate a nifty .5579 wOBA for Charleston.

3) Determine the player’s wOBA z-score 

  • Find the league’s SD for wOBA. The ECL had an SD of 0.0816 points of wOBA.
  • Subtract the league’s mean wOBA from his wOBA. This means the mean used in calculating SD. In this case, that figure for the 1924 ECL is 0.3155. So, 0.5579 – 0.3155 = 0.2424
  • Divide by the SD. 0.2424 / 0.0816 = 2.8243. For those following at home, I’m not carry enough significant digits, so you might be getting 2.9706. Which is absolutely massive. For those following at home who got 2.9706, I’m not carrying quite enough significant digits. But don’t worry, I’m not making this up!

4) Adjust the sample to reach a minimum of 200 plate appearances

  • If the player has more than 199 PA in season n, do not adjust the sample, and proceed. We won’t need to adjust for the sample this time.
  • If the player has fewer than 200 PA combine the PA from season n, with the PA from seasons n+1 and n-1 this way: 

(((200-nPA) * (sum(nPA*nZscore, n+1PA*n+1Zscore, n-1PA*n-1Zscore) / sum(nPA, n+1PA, n-1PA)) + (nPA*nZscore)) / 200

  • If the sample is still under 200 PA, add the PA and Zscores from seasons n+2 and n-2, weighting them at 0.6
  • If the sample still doesn’t add to 200 PA, include the player’s career PA and career weighted average Zscore.

5) Determine a player’s initial MLB wOBA by placing him into a major-league setting: Default to the NL, but for players who debuted in the AL, start them in the AL. Multiply z-score (#3) by the MLB league SD then add the product to the mean MLB wOBA. 

Again, we’re talking about the mean observed when generating SDs for the league. Charleston goes into the NL, which in 1924 had an SD of 0.1183 and a mean of 0.3019. Thus (2.8243 * 0.1183) + 0.3019 = .6360 (sig digs again). That’s what I’ll call Charleston’s twOBA or translated wOBA.

6) Convert wOBA to weighted Runs Created (wRC) to determine the player’s total batting output.

We’re going to do this in order to make a QoP adjustment. We need to do this to a player’s runs created not his twOBA. The calculation goes like this:

((twOBA – 0.330) * PA) + (PA*lgR/PA)

In the first term, we find out how many runs above average he was (remember, we scaled everything to .330), and in the second term, we add in all the runs he created below average. In Charleston’s case it’s ((0.6360 – 0.3300) * 236) + (236*0.1187) = 100.2275.

7) Adjust for quality of play (QoP) by multiplying #6 by the originating league’s QoP adjustment (see table for the ones I use).

For Charleston and the 1924 ECL the QoP is 0.8, so 100.2275 * 0.80 = 80.1820

8) Turn back into wOBA

This calculation is (((#7 – (PA * lgR/PA))/PA) + 0.330, and for Charleston that works out to 0.5511

9) Create initial playing time estimate

Here we combine in-season and career durability and adjust for the destination league’s PA/G

(G + (0.5 * (destination league’s scheduled games – G) * the player’s career ratio of G to team games) + (0.5 * destination league’s scheduled games – G) * G / his team’s games) * destination league’s PA/G/lineup slot

For Charleston that means

(54 + (0.5 * (154 – 54) * (1456/1571) + (0.5 * 154- 54) * (54/55) * 4.254 = 636 PA (rounded to a whole number)

10) Determine player’s final playing time estimate by using trajectories of MLB players by career length and position.

  • This is a multistep iteration. First, I start with the initial estimate in #9. With 636 PA in a league with 4.254 PA/G/lineup slot, Charleston would have 150 games worth of PA.
  • But, I want to err on the side of conservatism with playing time. It feels overenthusiastic to give players strings of 150+ game seasons, so in this iteration I limit them to 95 percent of games played. For Charleston that takes him down to 147 games and his result PA are 623.
  • Then I look at career trajectories. I use the PA by age of players at each position with short, medium, medium-long, and long careers. For a player with a long career like Charleston, I use a 70/30 combination of long-career centerfielders and all long-career non-catchers. Then I figure the average career bulk in PA for a player at a given position for each of our short, medium, medium-long, and long-career groups. I take the lowest PA in each group and subtract from the average. That allows PAs to vary upward from the average if necessary. This is the maximum variance we will allow from the average.
  • The average in Charleston’s cohort averages 650 PA, which Charleston doesn’t exceed, so I simply take those 623 PA and round to the nearest ten, 620. However, had Charleston exceed 650, he could have claimed up to 50 more PA based on the maximum variance we calculated in the previous bullet point.
  • Special note that in Charleston’s case I combined three positions (CF, LF, 1B) because Charleston spent a lot of time at each.

11) Create a final estimate for RAA/Rbat.

  • This is a three-step process. First we turn our result in step #8 into RAA using our initial estimate of PA in the first bullet of step #10. We’ll subtract .330 from the wOBA we calculated in step #8, multiply that by those 636 PA and them divide that product by something called the wOBA scale. It’s a factor that the Triples Alley wOBA worksheet helpfully figures for us, and it’s what brings everything back to scale.

((.5511 – .330) * 636) / 1.0236 = 137.3775 RAA

  • Next, we divide those 137.3775 RAA by the initial PA estimate then multiply by the final PA estimate. For Charleston that’s (137.3775 / 636) * 620 = 133.9215
  • Finally, we make a nod toward realism. No National Leaguer in 1924 created more than Rogers Hornsby’s 96 runs above average. Therefore, I adjust a player’s RAA down to the league leader’s if it exceeds that total. In this case, I dial down Charleston to 96 RAA. I understand that some people might find this unnecessary or controversial, but I’d rather undershoot and tell you “There might be more” than overshoot and say, “This probably isn’t accurate.” And, importantly, Charleston created these runs in about forty percent of a season. Babe Ruth in a full season produced 116 runs, tied with Barry Bonds for the highest total ever. Is it likely that if Charleston played a longer schedule he would keep that pace up? Might be, but that’s speculative territory. I prefer to be guided by the norms of the times. That doesn’t mean, however, that I’m right or someone else is wrong. You can do it however you want. I just don’t feel comfortable projecting anyone to totals that exceed the Babe’s best season by nearly twenty percent.

Quality of Play Tables

All quality-of-play multipliers are expressed as a percentage of MLB runs. For example, a player in the 1920 Negro Leagues would have 20% of his runs created removed from his MLE so that 80% of his runs created remained.

NEGRO LEAGUES
SPAN       QoP   LGS
1871-1904 0.720  EAS IND WES
1905      0.725  EAS WES
1906      0.730  EAS WES
1907      0.735  NAC WES
1908      0.740  NAC WES
1909      0.745  INT WES
1910      0.750  EAS WES
1911      0.755  EAS WES
1912      0.760  EAS WES
1914      0.770  EAS WES
1913      0.765  EAS WES
1914      0.770  EAS WES
1915      0.775  EAS WES
1916      0.780  EAS WES
1917      0.785  EAS WES
1918      0.790  EAS WES
1919      0.795  EAS WES
1920–1939 0.800  ANL EAS ECL EWL IND NAL NNL NSL
1940      0.780  NAL NNL
1941      0.780  NAL NNL
1942      0.780  NAL NNL
1943      0.780  NAL NNL
1944      0.780  NAL NNL
1945-1948 0.800  NAL NNL

CUBAN LEAGUES
YEAR   LG   QoP 

1899  PAR  0.620  
1900  CUB  0.720  
1901  CUB  0.720  
1902  CUB  0.720
1903  CUB  0.720
1904  PV   0.720
1904  CUB  0.720
1905  PV   0.721
1905  CUB  0.724
1906  PV   0.723
1906  CUB  0.728
1907  PV   0.730
1907  CUB  0.732
1908  PV   0.731
1908  CUB  0.735
1909  CUB  0.738
1910  CUB  0.746
1911  CUB  0.751
1911  CGL  0.740
1912  CUB  0.756
1913  CUB  0.754
1914  CUB  0.756
1915  CUB  0.760
1916  CUB  0.759
1918  CUB  0.776
1920  CUB  0.770
1922  CUB  0.774
1923  CUB  0.771
1923  GP   0.775
1927  CUB  0.775 

MEXICAN LEAGUE
YEAR   QoP

1937  0.63   
1938  0.66
1939  0.67
1940  0.71
1941  0.73
1942  0.69
1943  0.70
1944  0.70
1945  0.68
1946  0.67
1947  0.68
1948  0.67
1949  0.64
1950  0.66
1951  0.66
1952  0.65
1953  0.65
1954  0.64

MINOR LEAGUES
LEVEL   QOP     LGS
AAA   0.800  AA IL PCL
AA    0.720  EL SALL TL WL
B     0.620  FLIN (1952-53) NORW SWLG WINT
C     0.580  BSTL CALL FLIN (1949) IIIL NENL      
D     0.500  AZMX AZTX CMXL FLIN (1947-48) FLOR LONG PROV SWIL WNM

Please note that the classification codes for the minor leagues have changed repeatedly over time. The classifications above are simply how I remember them most easily. For more information read this.

Baserunning

The short description above will get you a long way. I make a few other tweaks to adjust for the style of play in the league.

1) Determine the player’s SB/G.

Charleston stole 20 bags in 54 games, or .3704 per game.

2) Determine his league’s SB/G.

The ECL stole 456 bases in 475 games or 1.0387 per game.

3) Compare his rate to the league’s.

0.3704 / 1.0387 = 0.3566

4) Determine the SB/G in the destination league.

The NL stole 754 bases om 1228 games, a rate of 0.6140 per game.

5) Estimate his SB in the designation league in his number of games by multiplying step #3 by step #4 and multiplying their product by his games.

(0.3566 * 0.6100) * 54 = 11.7452

6) Determine how many “remaining” games the player has by subtracting his games from the league’s scheduled games.

154 – 54 = 100

7) Find his estimated SB/G by dividing #5 by his games and then weight it by how many games he played versus his total MLE G

11.7452 / 54 = 0.2175

0.2175 * (54/148) = 0.0794

8) Find his career MLE SB/G by totaling his estimated career SB and dividing by his career MLE games then multiplying that by (1 – (his actual games by his MLE games).

(235 est SB / 1428 actual games) * (1 – (54/148) = 0.1045

9) Sum steps #7 and #8. This will be the stolen base rate used for his remaining MLE games, weighted by his current season rate and his career rate.

0.0794 + 0.1045 = 0.1841 SB/G

10) Multiply step #9 times his remaining games in step #6.

0.1841 * 92 = 16.9372

11) Add step #10, the estimated steals in his remaining games, to #5, the estimate for MLB steals in his actual games and round to a whole number.

16.9372 + 11.7452 = 28.6824, which rounds to 29 steals.

12) Match that total on the chart below and look at the right column in the chart to find the Rbaser/G for the season.

29 steals corresponds to 0.0256 on the chart.

13) Multiply step #12 by his total MLE games to determine that season’s Rbaser.

0.0256 * 146 = 3.7376

We assign Charleston about 4 runs for his baserunning in 1924.

The baserunning MLE is driven by the chart below. I found the G, PA, SB, and Rbaser for every MLB player with 200 or more PA from 1930 (the first year that BBREF bases Rbaser on play-by-play data) to 1960 (the last year before the stolen base really came back into the game). I found that the relationship between SB/PA or SB/G to Rbaser were very weak. Not worth relying on. On the other hand, when I grouped the seasons into buckets based on how many steals a player had, a clear pattern emerged that linked the number of steals in a season to Rbaser. So I created this table by drawing a line from the typical Rbaser for players with zero steals through those with 30 or more steals. All the increments along the way got a value, and the whole chart looks like this:

SB   RBSR
 0 -0.005
 1 -0.004
 2 -0.003
 3 -0.002
 4 -0.001
 5  0.000
 6  0.000
 7  0.001
 8  0.002
 9  0.003
10  0.004
11  0.005
12  0.005
13  0.006
14  0.007
15  0.007
16  0.007
17  0.008
18  0.008
19  0.009
20  0.010
21  0.011
22  0.012
23  0.013
24  0.014
25  0.017
26  0.019
27  0.021
28  0.023
29  0.026
30  0.031

I’ve been struggling to find a good baserunning estimate for years. This is probably as good as it gets. However, the great Gary Ashwill has reminded me that stolen bases were necessarily recorded in box scores in every newspaper. Philadelphia, in particular, seems to be bad. Just keep in mind that there are error bars around these figures.

Double plays

We’re now going to turn our attention to avoiding the dreaded GIDP, which Baseball-Reference.com includes among the sources of value for batters. It’s a small thing, but everything counts.

This is only applicable for seasons after 1929 because BBREF does not provide this information due to lack of play-by-play data prior to that.

1) Find the player’s batting handedness, his career MLE PA, and his career Rbaser estimate.

Charleston would not be allocated any Rdp for 1924 because it’s prior to the 1930 cutoff for Baseball-Reference.com’s play-by-play-based calculations, but we’ll proceed anyway for the sake of the example. For Charleston, these are left, 11890, and 42.

2) Draw up a list of hitters of the same handedness, rough career length, and career Rbaser total.

BBREF’s Stathead makes this simple. Subscribe today, it’s cheap. I’m always aiming for twenty or more comps, but with long-career lefties and switch-hitters, this proves quite difficult. I usually try to get within on thousand PAs on either side of the player’s MLE PAs, and within five of his Rbaser. But sometimes you have to expand outward. In this case, after numerous attempts, I had a list with just four names on it: Brett Butler, Chase Utley, Larry Walker, and Barry Bonds.

3) For each comp, find his Rdp per PA.

You’ll have to ask Stathead to provide the Rdp as it doesn’t come up by default.

4) Average all the comps’ Rdp per PA.

Butler, Utley, Walker, and Bonds average a rate of 0.0021 Rdp per PA.

5) Finally, for each season of the player’s career, multiply #4 by his MLE PA for the year.

In 1924, we estimate 620 PA, so 620 * 0.0021 = 1.302 Rdp. He gets about 9 runs for his career this way, primarily because most of his career took place prior to 1930.

Fielding

This may change. Baseball-Reference.com has very recently added Rfield calculations for all Negro Leagues players. Currently this method only includes DRA found at the Negro Leagues Database.

OK, this may seem a little convoluted, but it’s worth it because it pushes more fielding value into those seasons when players are normally at their fielding peak, which is appropriate. This example assumes only one position per player, but you can easily adapt this for players with more than position.

1) Find the career fielding games and DRA for the player. For outfielders, use only his Range runs, skip the arm runs. Also, only include fielding games from among those seasons where a player’s DRA is calculated. Finally if a player switched abruptly, use only the data from seasons he played the position.

Charleston played centerfield until about 1929. From 1915 until then he played 832 games there. He accrued 51.2 DRA.

2) Determine the player’s DRA/154 games.

Divide the DRA by the games and multiply by 154. For Charleston that makes 9.5 runs a year.

3) Adjust for sample. If the player has fewer than 308 games at the position (two seasons’ worth of games), multiply #2 by his defensive games at the position by 308. 

Charleston exceed 308 games handily in centerfield.

4) Transform into Rfield by using the first table below.

We want to create a presentation that’s familiar for people, so we’re going to take the less heralded DRA and turn it into BBREF’s Rfield. We’ll use our friend z-scores for this. 

  • Divide the player’s DRA/154 by the figure for his position in the middle column of the table below (nglSD) to find out how many SDs he is from average.
  • Multiply that figure by the corresponding figure in the right column (mlbSD).

So we divide 9.5 / 17.0 to get 0.56, and we multiply that by 3.1, which gets us 1.7 Rfield per 154 games.

5) Multiply his career MLE games at the position by step #4, his Rfield per 154 games. 

I assign players one position per year, which means I need to add together games from all seasons where I assigned him that position. In Charleston’s case, that’s 1678 games. Multiplying it by 1.7 Rield per 154 games, I get 19.1 career field runs in centerfield.

6) Distribute the player’s career runs based on the second table below.

This is harder to explain than to do.

  • Find the median value on the table for all ages he played the position. The median for Charleston (ages 18 to 31) is 3.3 runs.
  • Subtract that median from the value for his age in the season in question. Charleston was 27, and the value for that age is 3.3. Subtracting one from the other gives us 0  runs.
  • Add that difference to the player’s career Rfield per 154 at the position:  0.0 + 1.7 = 1.7
  • Multiply the estimated career Rfield at the position in step #5 by the ratio of the figure we found in step #6 and the figures derived from all seasons at the position. For Charleston that’s 19.1 career Rfield * (1.7 Rfield/19.1 Rfield) = 1.9 Rfield
  • Finally, we need a step to correct the previous step in case it doesn’t generate the same career total we calculated in step #5. To do that we will divvy up any excess Rfield by playing time this way: The previous bullet’s total + ((our estimated career total minus the sum of Rfield at the position estimated in the previous step for all seasons at that position) * his MLE games that year, divided by his career MLE games at the position)). Charleston’s total’s matched so no worry. And he ends up with 1.9 Rfield for the season.
NLDB DRA TO RFIELD BY POSITION		
POS  nglSD  mlbSD

C    10.02   1.64
1B    8.3    3.27
2B   18.42   5.81
3B   18.4    4.75
SS   18.47   5.81
LF   15.12   5.81
CF   17.01   3.14
RF   14.14   3.01

The following table charts career fielding trajectories for all players. It’s based on the average DRA per 154 games of long-time players at all positions.

FIELDING CAREER TRAJECTORY
Age  DRA/154

17    -1.0
18    -2.4
19     2.1
20     3.4
21     3.7
22     3.7
23     4.1
24     4.1
25     4.2
26     3.8
27     3.3
28     3.0
29     2.5
30     2.1
31     1.5
32     1.3
33     0.4
34     0.2
35    -0.5
36    -1.1
37    -1.3
38    -2.3
39    -2.1
40    -1.8
41    -2.0
42    -1.1
43    -2.0
44    -2.9
45    -6.4
46    -4.7

Value

This is where I wave you goodbye. I try to stick as closely to Baseball-Reference.com’s value calculations as I can. That includes runs based on position, replacement runs, and all the background calculations for WAA and WAR. Please refer to their explainer for all details.

Once we do this for Charleston, we get an absolutely mammoth 11.6 WAR season, something straight out of the Babe Ruth catalog. He was one hell of a player.

Now you know how I do it. If it seems complicated to you, that’s because it is complicated. There’s a lot of steps. But those steps are there for good reasons, and I hope the reasoning behind them is at least somewhat visible. It’s all in my noggin, you know, and when it comes to MLEs, I’ll quote Bob Dylan in “From a Buick 6,” “I need a dump truck…to unload my head.” Please feel free to ask clarifying questions in the comments. Happy to answer.

Tell us what you think!

Institutional History