Baseball on the Back of an Envelope
I've corked my slide rule!

8/2/2010: A couple of days ago (which is to say in the summer of 2010), the Colorado Rockies enjoyed a run of eleven consecutive hits against the Chicago Cubs. An 11 hit streak has never happened before, it says right here on the sports page.

Today's baseball season consists of 30 teams playing 162 games. Previous seasons contained fewer games. I'm not enough of a fan (or historian) to know if there were more or fewer teams in the early years, but a reasonable (enough) guess is that there have been on the order of 2,000 games played during every year of the modern era (since 1900). 2,000 games times 110 years is 220,000 games. Every game contains at least 54 at bats (roughly speaking — rain shortened games have fewer, extra-inning games have more, games with the home team leading after eight innings have three fewer, games with lots of base runners have more, etc etc). Let's call it 70 at bats per typical game. Given the open-ended nature of baseball, every single at bat is an opportunity for an 11-hit streak to begin (Colorado's began with two outs in the eighth, playing at home, with a one run lead). In the history of the modern game, there have been 70 * 220,000 chances to produce an 11-hit streak. One did.

The observed frequency of 11-hit streaks is 1 in about 15,400,000 opportunities.

Let b be the mean batting average of all the players in a lineup. If the likelihood of a batter getting a hit at any given at bat were independent of every other at bat (which it is not, but which is a useful approximation that ignores 90% of any manager's reason for living), then the likelihood of any particular at bat beginning an 11-hit streak would be b * b * b * b * b * b * b... well, anyway, b raised to the 11th power. The inverse (1 / (b ^ 11)) is the number of at bats before an 11-hit streak is likely to occur. Of course, going through the lineup, b varies from player to player over a range of something like 0.15 to 0.40 (taking b to correspond to a player's full season's batting average). Some outliers are below 0.15 and others are above 0.40 (where have you gone, Joe DiMaggio?). People with access to enough data could do Monte Carlo sims using actual lineups and actual batting averages, but adjusting for hot streaks and dry spells within the season for individual players might still become intractable. Let's keep it much simpler and look only at yearly averages for lineups as a whole. In fact, let's turn it around and ask what yearly average for all players for all time corresponds to the observed frequency of one 11-hit streak in the modern era. We know what the general range of plausibility is: no entire lineup will average .400 or more and no entire lineup will average .150 or less (if God is kind). The number we want must lie somewhere between those extremes.

By now one has to wonder whether anything about a game as statistically rich and thoroughly storied as baseball can be captured in so few numbers. So crank the wheels and see what falls out. (Yes, I know we could solve for b directly, but it's very late, and who really wants to mess with logarithms, and isn't it easier to just type algebraic expressions into the Google calculator and see what happens?)

Take b to be 0.222

1 / (0.222 ^ 11) = 15,492,349

Which is just about right.

If b is taken to be 0.209, an 11-hit streak is only half as likely to have occurred in the modern era. (Then the expectation would be 1 in 30,087,829 at bats, so we might reasonably think we'd have to watch through the 2120 season to see one.) If b is assumed to be 0.237, then the calculated likelihood of an 11-hit streak would be twice the observed frequency (1 in 7,546,893 so we might expect it to have happened twice in fifteen million at bats).

My untutored impression of that simple computation is that it's not too bad. Those seem reasonable estimates for the day-in day-out seaon-in season-out batting average for all players for all time.

If so, then an 11-hit streak is flukey as hell on any given day, or in any given year, but if batters win their duels with pitchers between one fifth and one fourth of the time, you'd expect this kind of thing to happen about once every century and change.

Given all this, only one thing was 100% certain: if it ever happened, it had to happen to the Cubs.

Comments?
Send your quibbles and bits.

 

Of course, hardly any problem involving probabilities is as simple as it first appears. Keith Burgess-Jackson posted the notes above on his blog and has graciously put me in touch with his frequent contributor Mark Spahn who sent his analysis of the likelihood of at least one streak of 11 or more hits occuring in the modern era. Note that in his computations, Mark accepts my approximation that there have been 15,400,000 at bats since 1900; don't hold quibbles with that against him. Mark begins with a restatement of the problem and provides both a formal solution and a table of probabilities for specified values of b:

Given that at every at-bat the probability that the batter hits the ball is b, what is the probability p(b) that in n consecutive at-bats there will be at least one run of at least r hits in a row (for fixed n and r)?

Let's first compute the probability that such a run never happens; then the asked-for probability will be 1 minus this no-such-run probability.

The notation "P{...}" means "the probability that ...", and ^ denotes exponentiation (= raising to a power).

P{no such run} = P{not getting r hits in a row}^(number of independent(?) opportunities to get r hits in a row) = (1 - P{r hits in a row})^(n-r+1) sets of r consecutive at-bats in n at-bats = (1 - b^r)^(n-r+1) So the asked-for probability is 1 - P{no such hitting streak} = 1 - (1 - b^r)^(n-r+1). In other words,

p(b) = 1 - (1 - b^r)^(n-r+1) . [1]

Setting d=b^r and m=n-r+1 for simplicity, we have

p(b) = 1 - (1 - d)^m [2]

Here is a program for the Texas Instruments TI-84 programmable calculator that for a given N and R repeatedly takes in an "average batting average" B, and displays the probability P that in N at-bats there will be at least one run of at least R hits in a row.

PROGRAM: HITS
11 into R
15400000 into N
N-R+1 into M
Lbl B:Prompt B ; ask for B
B^R into D
1-(1-D)^M into P
Disp P ; show the run-of-hit probability
Goto B ; try another B

With this program, you can produce a table like

  b     p(b)
.100 .00015
.125 .00179
.130 .00276
.140 .00622
.150 .01323
.160 .02673
.170 .05141
.180 .09423
.190 .16422
.200 .27050
.204 .32440
.205 .33888
.210 .41692
.211 .43355
.212 .45048
.213 .46768
.214 .48514
.215 .50281
.216 .52068
.220 .59337
.221 .61166
.222 .62992
.223 .64810
.224 .66616 (= 2/3)
.225 .68406
.226 .70174
.227 .71917
.228 .73629
.229 .75307 (= 3/4)
.230 .76946
.232 .80090 (= 4/5)
.240 .90400
.250 .97457
.260 .99649
.270 .99981
.280 .99999716
.290 .9999999931
.300 1 (indistinguishable from certainty)
.400 1 (What does the graph look like?)

This is a very satisfying result. It says that the probability of getting a results at least as good as the 11-hit run reported is 50% with an average batting average b of .215, 63% if b=.222, 75% if b=.229, 90% if b=240, and virtually 100% if b>.270.

Problem solved.

-- Mark Spahn (West Seneca, NY)

 

 

 

:: back to the slow blog ::

 

 


                   © 2010, David Cortner