A few year's back, one of my brothers asked this intriguing question while the two of us played a game of chess: considering the centuries of history that the game of chess has enjoyed the world over, and the millions upon millions of games that must have been played over that time, it seems likely, maybe even definite, that the game we were playing at that moment had been played at least once before at some point in history. Every move that we made from the first move to the last had, in all likelihood, played out in that exact same order by someone else, years or centuries before. It was a fascinating observation that really got me thinking - kind of the "if a tree falls in a forest..." question for nerdy gamers.
Well, a few months back, while pondering how spoiled we are to live in a world of Baseball Reference & Retrosheet, it occurred to me that, with the decades of baseball games under our belts, there might be two games that played out identically. In other words, are there any two games in baseball history that, if I were to pick up the scorecards for each, they would be indistinguishable? Granted, one century of baseball games (and only 50 years of Retrosheet data) is not quite the same as the five centuries of worldwide recreational play that chess as seen, but it still seemed at least possible to find a pair of identical ballgames among all those seasons. And since the Retrosheet data is all there, ready to be queried, it just seemed irresponsible not to go digging.
Method
After getting a new faster laptop with a bigger hard drive a couple of weekends back, I was finally able to follow Colin Wyers' instructions for Creating a Retrosheet Database. It's a great, simple set of instructions that helps you get everything that you need to play around with Retrosheet's play-by-play data. If anyone is interested in that kind of thing, you owe it to yourself to read Colin's piece and get started. Be aware that running queries on a full Retrosheet database can take an incredibly wrong time, even on a fast computer.
Anyhow, with the Retrosheet database installed and ready to go on my computer, I decided to take some time this weekend to explore this question. The short answer to the question is, of course, no, there are not two games in the Retrosheet era of baseball history that played out identically. If you think about the numbers involved, it's not surprising in the least: with at least a dozen possible standard outcomes available for each plate appearance (and another dozen or more possible, but highly unlikely outcomes), and with 60 or 70 or even 80 plate appearances per game, the odds become fantastic that two games would be identical. Since there's only 100,000 or so games in the Retrosheet era, it is by no means surprising that I couldn't find any matching games. Below, I describe the methods that I took to come to this (non-)conclusion and explore some of the results. I know the answer may be slightly dull, but the journey there was still pretty interesting.
Like I said above, running a query on the full Retrosheet database can take an exceptionally long period of time, so the first thing I needed to do was to cut that list down to a more manageable number of games. I did this by looking at every game in the database and finding any games that had identical end-game statistics to it. If two games had the same number of innings played and identical home- and road- runs, hits, errors, and men left-on-base, I marked them as a unique pair. There were 3,479 such pairs of games.
Next, I went through each pair of games found above and counted the number of "events" (mostly plate appearances, though there are some other events that can sneak in there) in each. From there, if the two games in a given pair were found to have the same number of events in them, I set them aside for further study. After all, in order for two games to have played out identically, they must have the same number of events (batters, baserunners, etc). This left me with 608 pairs of games, all of which had the same number of home- and road-runs, hits, errors, men left-on-base and the same number of events and innings played as their pairing. If there ever had been two identical games played in baseball history, that pair of games would be somewhere on this list.
Finally, the last step was to take this list and match each paired game up, event for event. Ideally, we would find a pair of games that has every event in Game A match up with the same event in Game B - that is, if the first at-bat in Game A was a 5-3 groundout to third, then the first at-bat in Game B would be too, and so on, from the top of the first to the bottom of the ninth (and beyond). This query would be the moment of truth, telling us if there were any identical games and, if not, what games might be the closest.
Results
I know, I've already ruined the surprise, but there are no two games (in the Retrosheet era, at least) that played out identically.
This may not be a surprise to some, as the odds were never in favor of it happening, but I'm still a little disappointed and surprised. "Disappointed" because, despite my knowledge of basic probabilities, I still was hoping to find something interesting. And "surprised" because, even without finding two identical games, I still expected to find a pair or two with significant similarities, and this just didn't really happen. Below are the ten "most identical" pairs of games in baseball history, as judged by the percentage of identical events between the two games.
PHI 0 @ CIN 1 (4/29/98) & CLE 0 @ DET 1 (10/1/70)
Num. of Events: 57 Num. of Similar Events: 10 Pct. of Similar Events: 18%
NYM 4 @ CIN 3 ( 6/3/94) & CAL 4 @ MIN 3 (4/17/85)
Num. of Events: 70 Num. of Similar Events: 12 Pct. of Similar Events: 17%
ARI1 @ SD 3 (8/27/07) & CHC 1 @ LAD 3 (5/21/69)
Num. of Events: 64 Num. of Similar Events: 11 Pct. of Similar Events: 17%
WAS 2 @ FLA 5 (7/14/07) & SD 2 @ MON 5, 8/18/89
Num. of Events: 68 Num. of Similar Events: 10 Pct. of Similar Events: 15%
COL 1 @ NYY 2 (6/8/04) & CAL 1 @ BAL 2 (9/26/65)
Num. of Events: 67 Num. of Similar Events: 10 Pct. of Similar Events: 15%
MIN 1 @ CAL 5 (5/5/96) & LAD 1 @ ATL 5 (5/11/68)
Num. of Events: 70 Num. of Similar Events: 10 Pct. of Similar Events: 14%
TEX 5 @ BAL 1 (8/16/75) & CAL 5 @ CHI 1 (9/10/72)
Num. of Events: 70 Num. of Similar Events: 9 Pct. of Similar Events: 13%
DET 2 @ CLE 5 (8/15/07) & PHI 2 @ SD 5 (7/6/94)
Num. of Events: 69 Num. of Similar Events: 9 Pct. of Similar Events: 13%
TBD 1 @ BOS 2 (8/14/07) & CHC 1 @ STL 2 (9/28/97)
Num. of Events: 68 Num. of Similar Events: 9 Pct. of Similar Events: 13%
NYM 5 @ STL 1 (8/16/77) & NYM 5 @ SD 1 (5/29/71)
Num. of Events: 78 Num. of Similar Events: 9 Pct. of Similar Events: 12%
If I were to list each of the "similar events" from each pair of games, you would see that the most common outcome of identical events is the strikeout (I can provide that list to anyone interested). Again, this makes a lot of sense: with balls-in-play, there are at least a dozen different outcomes for the batted ball, from a base hit to a 5-4-3 doube play and more. However, with a strikeout, all that variance disappears and the pure result is recorded. I suspect that if I broke each at-bat down to it's more basic outcome (ie, "groundout" vs "4-3" or "flyball" vs "8"), then I would find many more matching events. I worry that this would distract some from the overall goal of finding identical games, but I think it's a worthwhile exploration. I've tried to come up with other ways to compare two games for similarities, but I haven't been able to think of anything else.
If anyone can come up with some other ways that I might want to compare two similar games for the purposes of finding the "most identical games in baseball history", I'm all ears. In the meantime, I hope you found this exploration at least a little interesting. I imagine I'll have a little more information on these "somewhat" identical games in the days to come, starting with the most common score in baseball history and going from there.
Update: This post has gotten a lot of attention, so I figured I'd repeat here what I wrote over at Baseball Think Factory yesterday:
I think I'm definitely going to run this list a couple of more times, though. First off, the "hit" data (ie, single, double, etc) in the retrosheet file is very specific ("S9" means single to right, "D8" means double to center, etc), so when I'm matching the hits, they have to be perfectly identical to match. I think that's a little more precise than I need to be.
And that may also go for the outs. Does a 6-3 putout need to match another 6-3 putout to be considered identical, or can I just lump it as a "groundout" and have it match any other "groundout"? I mentioned that at the end of the article, but I think I definitely need to explore it. Plus, I need to sanitize the event data a little, too, to get rid of remarks like "63!".
Finally, somebody commented on the piece that it might be interesting to see the games that have identical line scores across the 9 innings of play. I might run that too, just to see what it finds.
Still, for this level of precision (which is admittedly a little too high), it's pretty obvious that there aren't a whole lot of games that played out very similarly. I'm sure I'll find a larger number of similar games as I lower the precision, but I doubt I'll find anything as identical as I was hoping.