Some ruminations on “form”

In our latest Sinquefield Cup predictions, we listed co-leader Levon Aronian as having a 17% chance of winning the event. A fellow poster on a chess forum I frequent remarked that it was “bizarre” that his odds were so low, particularly in light of his “current tactical brilliancies”, and estimated that Aronian’s chances are probably closer to 30%.

It’s worth reiterating that our model does very little to attempt to account for a player’s “form”. We do use live ratings updated round by round, so Aronian is getting 11 rating points worth of credit for being in better form than we originally thought based on his pre-tournament rating. However 11 points is not a tremendously large adjustment. We have Aronian at only 17% despite sharing the lead because he shares that lead with Carlsen, who is rated almost 83 points higher than him. Over four rounds, this leads to a significantly increased score expectation, particularly since Carlsen also faces an easier remaining schedule: they both still have to face Nakamura and Anand, but Aronian’s other two games are the two strongest players (Carlsen and Topalov) while Carlsen’s other two games are Aronian and Grischuk.

On top of this, we can add in Topalov at just half a point back, also with a relatively easy remaining schedule, and equity for some longshots, and Aronian’s 17% odds make sense – if we stick with the assumption that 2776 is an accurate assessment of his playing strength.

We are talking about someone who has been rated as high as 2830 though, and who spent a long time as the #2 player in the world and presumed greatest threat to Magnus’ reign as champion. What if he is “back in form” and we can validly give him a higher estimated playing strength? If we keep everything in the model the same, except that we bump Aronian’s rating up to 2820, his odds increase to about 27%. We can get his odds to 30% if we make his rating 2832.

So if you say “I think Aronian’s odds are 30%, not 17%”, you’re not necessarily disagreeing with the structure of the model, you’re just saying that you think Aronian is in good form during this event, and expect him to play at an effective strength of 2832 over the final four rounds. This isn’t particularly absurd, he certainly could do so.

Statistically, we are reluctant to give too much credence to the idea of large variations for “form”. Most such phenomena can be explained by random variance alone, and throughout sports a “hot streak” or someone being “clutch” or “in the zone” are generally just false narratives we throw around because they sound more interesting than “he got lucky”. In chess specifically, we apply this idea by assuming that a player’s live rating is the most accurate estimate we have available of his playing strength (it accounts for the most available data, after all). That said, not every player will always be accurately rated at any given time.

Since we’ve just established that unexpectedly good or bad results can happen in the course of normal statistical variance, we have to also grant that when those results happen to a player whose rating was previously accurate, that rating will get thrown a kilter. Aronian’s rating plummeted through some very unexpectedly bad recent events. The proper Bayesian response is to factor those results in as additional data, on top of the results that initially got him the high rating in the first place, and re-evaluate him less favorably. This is what we’re doing when we use his current rating in our simulations. It’s also possible, though, that his older results, that brought his rating up over 2800, were accurate reflections of his underlying true long-term playing strength, and then the recent bad results were purely random variance and not reflective of a drop in his abilities. He might still be a 2800, or even 2820, strength player, and the uptick in his rating from good results so far in St. Louis might be regression to the mean, as his currently-too-low rating corrects itself through further random chance.

Of course Magnus Carlsen also entered this tournament coming off a bad result, and if Norway Chess was pure variance then perhaps he too is truly underrated. Maybe his real odds of winning are higher! And every other player may or may not be accurately rated as well! We can’t really be sure, so for the model’s sake we will continue to just use the live ratings and let the variance sort itself out over time.

I’m back!

Hi everybody!

I haven’t posted anything in quite a while. I apologize for that, and want to reassure my readers that I’m still alive, still working on the background data for my blog, and do not plan to remain dark. My last post came midway through Norway Chess, and I *did* continue running my simulations every round through the completion of that event. I didn’t post mostly because there wasn’t much interesting to say. Topalov jumped out to an early lead, and then just held it. Little changed over the final four rounds. Then my life got hectic, my wife and I moved to a new city, and the blog got left on the back burner.

I have been updating my database though. All the listings in our Prodigy Watch section are updated to include the August ratings list, and after updating those for players I was already tracking I then went through and added a bunch more players to the database. You can see that we now are tracking 499 players! In getting these names added, I focused on inclusion rather than research, and so spent less time than I normally would tracking down players’ birthdays. As such, our list of unknown birthdays is much larger than we’d like. Any help cleaning up this data would be greatly appreciated. In some cases, if these players turn out to have born in November or December, they could be almost a full year younger than we are currently giving them credit for, which will make a huge difference when we calculate their place on the prodigy watch list, or check to see if they might have been record setting prodigies at any particular age.

I’ll be writing more articles over the coming weeks. I’m gearing up for the Sinquefield Cup coming up later this month, as well as getting my simulation model finalized for next month’s World Cup In between those major events, we’ll find time to talk about prodigies when we get the chance, and perhaps to discuss broader philosophical issues regarding chess ratings, such as the role of rating inflation in comparing current players to those from prior eras, or the impact of recent changes to FIDE’s rating formula (in particular: the increased K factor for U18 U2300 players.) All sorts of stuff to cover. For now though, this is just a quick note that I’m still around. I hope some of you out there take that as good news, and are as excited to read my future content as I am to work on it!

Simulating The Grand Prix – Methodology

On our main page for the 2014-15 Grand Prix, you will notice that we list each player’s odds of finishing in the top two of the final Grand Prix standings – an important mark because those top two players earn berths in the 2016 Candidates Match, with a chance at the World Championship. If you’ve seen this, perhaps you have wondered: where did those numbers come from? How are they calculated, and how accurate are they?

So here I will discuss our methodology. First of all, the overview is relatively simple. I’ll go over that portion first, and dive deeper into the details (that may less interesting to some readers) afterward. We have built a spreadsheet that estimates the odds of a white win, a black win, or a draw, for all 132 Grand Prix games left to play (66 in Tbilisi, 66 in Khanty-Mansiysk). Further, it can use those odds to randomly calculate a result in each game, AND correctly calculate how the Grand Prix points would be awarded given those results, and who would therefore be the top two finishers.

Our simulation simply re-runs the randomizer a large number of times, recording the top two finishers each time, and spits out a result for each player: what percentage of the overall simulations had that player in the top two? Those are the odds we show you. How accurate are they? Not perfect, of course. The main source of error is that the individual game odds cannot be perfect. If you care about the details read on. Otherwise, suffice it to say that our estimates are probably the best available and if you’re interested in knowing who has the best chance of reaching the Candidates Match, you’re welcome to follow along with us! We will post regular updates throughout the upcoming Tbilisi event, and will show both the “current” odds and the “pre-Tbilisi” odds so that you can see exactly how much individual players have benefited or suffered from the results up until that point.

That’s a lot of text, so before we continue with the nitty gritty, here’s our “pre-Tbilisi” projections, in case you aren’t interested in clicking through to a different page to see them:

Player ODDS (PRE-TBILISI)
 Fabiano Caruana (ITA) 58%
 Alexander Grischuk (RUS) 40%
 Hikaru Nakamura (USA) 39%
 Maxime Vachier-Lagrave (FRA) 15%
 Anish Giri (NED) 14%
 Dmitry Andreikin (RUS) 10%
 Shakhriyar Mamedyarov (AZE) 6%
 Boris Gelfand (ISR) 6%
 Sergey Karjakin (RUS) 5%
 Peter Svidler (RUS) 5%
 Evgeny Tomashevsky (RUS) 2%
 Dmitry Jakovenko (RUS) 1%
 Baadur Jobava (GEO) 0%
 Leinier Dominguez (CUB) 0%
 Teimour Radjabov (AZE) 0%
 Rustam Kasimdzhanov (UZB) 0%

Nevertheless, that’s our method. And once draw rate is calculated, and given that we already have an expected score from ELO (1/(1+10^(rating differential/400))), we can easily determine the needed win and loss odds to achieve the expected score. One final important note is that we DO account for colors. White scores higher than 50%, of course, so in all our calculations of rating differential (both to determine draw rate and to determine estimated score) we add 40 points to white’s ELO before calculating the differential. A “perfectly even” match, in our estimation, with full 60% draw odds, and equal winning chances for white and black, would be a game where white is rated 2720 and black is rated 2760, for instance. So that is how we estimate the odds of each game. From there, everything else is a relatively simple Monte Carlo simulation. We run it a large number of times, and get our results!The other challenge in using ELO to estimate results of specific games is that ELO only actually gives an expected score, not an expected result. Draw rate remains an unknown. We have plans to do a detailed study of draw rates in the future, but for now we’re using a simple estimation that the base draw rate for equal players is 60%, and that draws become less likely as the gap between the players’ ratings gets wider. Specifically, the draw rate is 60% – (rating difference)/1000, so ever 10 ELO reduces the draw rate by 1 percentage point. A 2800 vs. a 2700 would presumably draw 50% of the time. This is awfully unscientific, but fortunately it doesn’t make a huge impact if we’re off slightly. We tried another simulation with a baseline draw rate of 70% instead of 60%, and only one player saw their ultimate odds shift by more than one percentage point. There’s a little error in our draw assumptions, but not a huge amount. Of course in reality, draw rates probably also vary by the individual playing styles; we would figure Jobava to draw less than our formulas predict, for example.So first of all, our estimation of the win/draw/loss odds for each game are calculated based on ELO expectation, using current live ratings from 2700chess.com (one of the greatest things on the entire internet – if you’re interested enough in chess analysis to have read this far, and for some reason you don’t already have 2700chess bookmarked, go bookmark it now. We’ll wait.) Now this isn’t perfect, it works pretty well for players rated accurately, but that can never be everyone. The ELO system is pretty solid overall, but will always contain some underrated players and some overrated players at any given time. In the long run they balance out, but in the short run an underrated player will see their odds of finishing top-two in the Grand Prix badly understated by our formulas.

One other critical factor is the pairings. These are unknown until the day before play begins. Because each tournament is a 12 player Round Robin, meaning each player plays 11 games, half the field will get the black pieces 5 times, and half the field will get 6 blacks. This draw is important, given that we (correctly) factor in the white pieces as being worth an increase in expected score in a given game. When we first posted “pre-Tbilisi” odds, on Februrary 11th, we did not yet know the pairings for Tbilisi (or for Khanty-Mansiysk of course). Foolishly, we ran our simulation with static pairings (giving the same six players the favorable treatment of getting white 6/11 games). We concluded that Grischuk (to whom we had generously given six whites in BOTH upcoming events) had a 46% chance of reaching the Candidates match. When the Tbilisi pairings were released, and we re-ran our simulation with proper Tbilisi pairings, Grischuk’s odds (he will have six blacks in Tbilisi) dropped several percentage points! This shift made it clear how important the pairings are, and so we immediately re-designed our spreadsheet so that the pairings for Khanty-Mansiysk are randomly generated for each simulation. The odds now posted reflect these dynamic pairings within the simulation (and of course the actual pairings for Tbilisi, now that they are known), and we believe they are much more accurate than our previous posting.

Playing With Date Arithmetic (And The 10 Youngest 2600+ Players Ever)

In my future posts, you’ll probably hear a lot about players’ ages at the time they achieve various milestones. I will present those dates as numbers with (usually two) decimal points, rather than the common Years-Months-Days format. The latter is more easily understandable: we know exactly what it means when we say that, for instance, Wei Yi is the youngest player to achieve a rating of 2700+, at the age of 15 years, 8 months, and 29 days*. Saying that he achieved that milestone at the age of 15.75 years old actually makes this a poor example, because that’s pretty clear, but if the number were 15.63 instead it would be harder for our brains to immediately process it. We’re not used to thinking of ages in terms of fractions of years (except the “big” fractions like 1/2, 1/4, 3/4). We break years down into months, not increments of 0.01 year, and we break those months down into days.

Unfortunately, the Years-Months-Days format is also less accurate. To see why, let’s consider another example. Here are the 10 youngest players to achieve a rating of 2600 or higher:

Youngest Published 2600+ Rating
Player Name Age
Wei Yi 14.42
Wesley So 14.98
Teimour Radjabov 15.06
Magnus Carlsen 15.09
Sergey Karjakin 15.22
Ruslan Ponomariov 15.23
Illya Nyzhnyk 15.59
Fabiano Caruana 15.67
Anish Giri 15.67
Peter Leko 15.81

Note that there is an apparent tie for 8th place between Caruana and Giri. Were they actually the exact same age? Well, let’s look at Year-Month-Day format first. Caruana was born 7/30/1992, and achieved this milestone on the 4/1/2008 rating list. From July 30 1992 to March 30 2008 is 15 years and 8 months. March 30 to April 1 is two days, so he was 15 years, 8 months, and 2 days old. How about Giri? Born 6/28/1994, he broke the 2600 barrier on 3/1/2010. From 6/28/1994 to 2/28/2010 is 15 years, 8 months, and we add 1 day from 2/28 to 3/1. So it would appear that Giri achieved this milestone 1 day earlier than Caruana did! The chart above placed them in the wrong order, right?

Well, no. Let’s break Caruana’s age down further. The “15 years” component from 7/30/1992 to 7/30/2007 is 15 * 365 = 5475 days, right? Nope. Leap Year exists! That particular span includes three extra days: 2/29/1996, 2/29/2000 and 2/29/2004. So “15 years” in this case means 5478 days. What about the “8 months” portion? Well the months whose last day was included in that span are June through February, meaning we get 5*31 + 2*30 + 1*29 = 244 days out of the span (remember that 2008 was also a Leap Year). Finally we add in the “2 days” portion, for which no breakdown is needed, and we see that Caruana achieved his first 2600+ rating when he was 5478 + 244 + 2 = 5724 days old.

Giri’s “15 years” include four Leap Years, not three, and his “8 months” do not include a February, which adds two days to his total. So “15 years, 8 months, and 1 day” is, in his particular case, 5725 days. Despite appearing to have been one day younger, using the more common format, it turns out that Giri was actually one day OLDER than Caruana. My chart above is correct after all.

Now of course it doesn’t matter at all which of two amazing players, currently ranked #3 and #4 in the World, got to 2600 one day faster than the other. However this example serves perfectly to demonstrate why I will not use the Years-Months-Days format to express players’ ages. In fact, behind the scenes, all my ages are simply number of days, but I don’t imagine anyone wants to know that Wei Yi was 5751 days old when he broke the 2700 barrier, so I divide ages by 365.25 (to account for Leap Year) and present them as just “years old”, rounded to the appropriate number of decimal places for the particular purpose in play.

I hope you appreciate my precision.

*This isn’t technically true yet, but it appears that it will become true on March 1st, when the next FIDE rating list is published.

Welcome to Chess by the Numbers!

For most of my life I have been obsessed with both statistics and with chess. In recent years, I have spent a lot of time combining these two fascinations, and analyzing chess statistics in a variety of ways. Now, I’ve decided it’s time to share my analyses as I perform them. So come join me and explore the world of chess stats!

You can expect to see lots of analysis of chess prodigies, a particular interest of mine. Simulated results of major tournaments. Scores, and frequency, of various different openings. And plenty of other concepts as well. If it is a form of chess analytics, I’ll be interested, and probably take a look at it!

One thing you WON’T find here is chess instruction, or much analysis of individual games. There are plenty of other places for you to improve your own chess skills. The focus here is on large sample sizes and broader statistical analysis. So if that sounds intriguing to you, then please follow along.