12 days ago, we published our “pre-Tbilisi” estimations of each player’s odds of finishing top-two in the final standings of the FIDE Grand Prix, and earning a berth in the 2016 Candidates Tournament. In that initial posting, we listed Evgeny Tomashevsky as having a meager 2% chance of climbing into that top-two tier. Today, nine undefeated rounds (and five wins) later, we just updated our odds to list Tomashevsky as having a 49% chance of reaching that same tier. In light of such a drastic shift, it makes sense to pause for a moment and ask ourselves: was the original prediction wrong?

The classic statistician’s defense would be to point out that we never claimed Tomashevsky had NO chance of reaching the Candidates Tournament (nor has he done so yet), we just considered it highly unlikely. 2% still means, though, that one time out of fifty the event WILL occur. If you make a lot of predictions, then you will see “longshots” like this come through from time to time. So one possibility is that our prediction was perfectly right, given what we knew at the time, about Tomashevsky’s odds, and what we’re seeing from him in reality was his absolute best case scenario – a highly unlikely event coming to pass.

However it would be lazy of us to simply assume that was the case. It’s also possible that we made a mistake in building our model, and incorrectly underestimated his chances originally. Maybe what he’s done so far was more likely than we originally thought. So before we write this off to simple “positive variance” for Tomashevsky, let’s dig into the model a little deeper and probe for potential errors.

The most obvious consideration is the accuracy of the ELO system. Our entire model is built on the basis of using players’ ELO ratings to predict the results of each game. Tomashevsky’s odds before the tournament began were based on his rating of 2716, third lowest in the Tbilisi field. Many studies have shown that the ELO system as a whole is quite effective at predicting results in the broad sense, and we don’t feel inclined to consider any possibility that using ELO as our basis is methodologically unsound in general, but that doesn’t necessarily mean ELO is accurate for every player at all times. In fact it’s a given that at any time there must always be some players who are overrated and some others that are underrated. An ELO-based model will, of course, miscalculate the odds involving players whose rating is not accurate to that player’s “true playing strength”. Perhaps Tomashevsky was underrated before the event began, and his expected results were artificially deflated in our model, as a result?

Here is a graph of Tomashevsky’s rating, by age, over the course of his life. The blue line shows every official published rating he has ever had, while the orange spike at the end shows his live ratings after each round at Tbilisi:

What we can see here is that a little over five years ago, at the age of 22.35 years old, Tomashevsky had a published rating of 2708 – his first rating above the 2700 barrier. His 47 published ratings since then, up to and including his February 2015 rating of 2716 that we mentioned earlier, have ranged from as low as 2695 to as high as 2740, but averaged 2714. We can drill in on just those last 5 1/2 years of his ratings history, and have Excel plot a trend line. This allows us to see whether he has been showing any signs of improvement (or decline) over the time span:

Our trend line is almost perfectly flat, at our average rating of 2714 (very slightly trending downward, actually.) In other words, we have over five years of professional chess results from Tomashevsky showing that while he has of course fluctuated above and below his average at times, he has on the whole maintained a very steady rating in the range of approximately 2715. This is quite a strong argument in favor of our model’s usage of 2716 as his baseline rating in the original predictions. Additionally, at almost 28 years old it seems unlikely that there was any particular reason to suspect that now might be a likely time for Tomashevsky to suddenly make a major improvement in his game, and begin playing at a higher level.

We have plans to eventually do a detailed study of “ratings plateaus” – long periods in a chess player’s career where he or she maintains a relatively consistent rating – in the future. Many players have notably had a prolonged plateau, and then suddenly seen their rating spike. Maxime Vachiere-Lagrave is one such example. We have plans to examine factors such as age and the length of the plateau, and evaluate whether there might be any way to predict when a player might be likely to make a “breakthrough” soon. Since that analysis lies in the future, not the present, we can’t prove our position mathematically at this time, but we suspect that any plateau model we might develop will probably NOT identify a 28 year old, whose plateau has lasted five or more years, as a “likely breakthrough candidate”.

If all our data suggests that 2716 was a very good baseline for Tomashevsky’s rating, in our initial simulations, then it suggests that our original idea might be right. The pre-Tbilisi odds were correct representations of the odds at that time, and what has happened so far truly has just been an extraordinary event. We can call it “positive variance” if we want to be clinical, or “good luck”, or “good fortune”, if we want to couch it in more mystical terms, or perhaps we could even say Evgeny has been “clutch” if we want to borrow a term from our favorite sportscasters. Regardless of the terminology, this raises a new intriguing consideration:

Might our CURRENT predictions be wrong?

Our odds of Tomashevsky reaching the Candidates Tournament depend heavily on what we think he will do in the final leg of the Grand Prix at Khanty-Mansiysk. Right now, we are predicting his results in that event using his current live rating of 2744.9, courtesy of 2700chess.com. This is the highest rating he has ever had in his life (or will be, once it is officially published by FIDE), and it comes courtesy of his remarkable results so far at Tbilisi. However, if we have established that 2715 (or so) is a good baseline for him, based on a large sample size of previous results, then isn’t he perhaps overrated now? If we’re deeming his results at Tbilisi to be above his reasonable expectation, and rejecting the theory that he has truly broken through his plateau and achieved objective 2750+ strength, then we also have to consider that by using his current live rating in our model we may be overestimating his chances at Khanty-Mansiysk. Maybe we should be expecting more regression to the mean, and lowering our expectations for him at that event, which would leave his odds of finishing in the final Grand Prix top-two at lower than our currently projected 49%.

However we have one final idea to consider. Let us return, now, to a word I used earlier: “clutch”. Borrowed from the world of sports, we often talk about athletes who have an uncanny ability to perform their best when on the biggest stage. The basketball player who takes over a playoff game in the fourth quarter, for instance. Can chess players be “clutch”, and if so, is Tomashevsky?

If we look carefully at his six-year ratings graph, we will notice three major upward spikes. On the November 2011 ratings list, he gained 30 rating points, jumping from 2710 to 2740 (his highest published rating so far in his career). That rating then slowly sank downward to a nadir of 2703, before spiking back up to 2720 in October 2013. Then his rating sank steadily once more, until it sat at 2701, and then in the November 2014 rating list it spiked again, to 2714.

What events did Tomashevsky have such great success at, to cause these three spikes? The first spike included his rating gain from the 2011 World Cup. Although he was eliminated in the third round that year, he managed a performance rating for the event of 2800. The second spike is entirely the result of the 2013 World Cup, when despite being seeded 32nd, he managed to reach the semifinals, coming just one round from earning a spot in the 2014 Candidates Tournament, and posting a performance rating of 2813. And the final spike came from his results at Baku last fall, the first leg of the current Grand Prix that all these odds we keep mentioning refer to, where his performance rating was 2792. And of course we will see a fourth spike show up on the March 2015 rating list, when his absurdly great results at Tbilisi are factored in. His performance rating through round nine is an out-of-this-world 2969!

So in other words, Tomashevsky has spent the last five plus years demonstrating consistently that he is roughly a 2700-level player when the World Championship is not in play, but in the four events he has played during that span that serve as qualifiers for Candidates Tournaments (and potentially as steps towards the World Championship) his combined performance rating has been well above 2800.

Perhaps the answer is that Tomashevsky is simply an incredibly clutch chess player, who saves all his best efforts for events that might get him closer to an eventual World Championship. If that is the case, then thanks to his performance in Tbilisi, the 2016 Candidates Tournament is now well in his sights – and woe be unto his unfortunate opponents at Khanty-Mansiysk who must play the juggernaut that is Toma With Purpose, rather than the 2700ish Tomashevsky we see the rest of the time. If this idea has genuine merit, then perhaps we need to consider that we might still be underestimating his chances.

Ultimately we have no intention of changing our model at this time, but if “Clutch Tomashevsky” is a real thing then we may well have erred when we gave him only a 2% chance before Tbilisi began. Certainly if we had instructed the model to treat him as a 2800+ player, which so far he has consistently proven to be in World Championship qualifying events, then the model would have given him much higher odds.

Love this analysis — did you find Toma’s results statistically significant when World Championship in play vs. not in play?

LikeLike

I didn’t dig into it, but my intuition is says that they’re probably not statistically significant. I chose not to analyze further because I felt the narrative was too fun, and didn’t want to risk ruining it. Our sample of “games with World Championship implications” is 38, and while he’s done excellent in those games, I don’t think that’s enough data to prove statistical significance (though I haven’t actually analyzed it in detail… so maybe it could be?)

Of course if he goes on to play well (performance ratings in the 2800+ range) at Khanty-Mansiysk, and at the Candidates Tournament, I really will have to do a serious analysis. Especially if his rating drops during that same span at any non “important” events he plays. Because if he is demonstrably clutch, at a statistically significant level, then I will need to factor that in to a hypothetical preview of his potential match with Magnus. What’s that you say? Getting ahead of myself? Nonsense!

In general (in other sports) I am not really a believer in the idea of athletes being clutch. I think it’s just us applying a fun narrative to statistical noise. It has occurred to me, though, that chess could actually be particularly suited to having “clutch” be a legitimate and realistic possibility, if someone holds on to their prep work and only springs new ideas in games that really matter to them. It’s a more reasonable mechanism than most sports offer for WHY players could be clutch in the first place.

LikeLike

There is not a single occurence of “correlat” in this text.

My gut feeling says that a player’s performance has positive serial correlation, and that makes “extreme” outcomes more likely and “average” outcomes less likely. I might be wrong though; humans tend to misjudge the occurrence of long runs in random data.

But let me pursue the guesswork: A tournament is over a week or two. It could be one of your bad weeks (catching a cold, health not up to par, dumped by your sweetheart or even more distracting, found a new one). Or it could be one of your good weeks, where you do make the good strategic judgements, where you have just about the right confidence in your ideas to pursue the bold ones that win and not enough to jump at the careless ones, and winning one day you keep on doing the same thing the next day, which is your very best.

Any such effects will increase low-rated players’ chances of winning a tournament. Not their expected score, but expected score is not so interesting; say, in the Candidates’, who cares if you end up in the middle or last? It is about winning!

And then there is the issue that expected score does not take into account the variance of one game – the relative distribution of “two draws” vs “one win and one loss”. If you have expected score of 0.51 against me, composed of 98 draws and two wins in each hundred, I will never beat you in a head-to-head match (except when ties are broken by armageddon or lottery). If OTOH the .51 is composed of wins and losses – the extreme situation where we never draw – my chances are quite good.

LikeLike

I agree that Tomashevsky has become a clutch player. As he admitted @ Tbilisi, he relaxed and prepared for 2 months which gave him undue advantage over the other elite GMs (most of them burnout from preceding tournaments). I surmise that he does this for ALL tournaments integral to Candidates.

However, the schedule of the Khanty-Mansiysk Grand Prix is months off, hence most of the participants has the fair length of time to relax and prepare. For example, I reckon that Caruana and Nakamura can be at their “clutch” peak, too.

My point here is that whoever tops Khanty-Mansiysk leg are the real strong GMs, not the “clutch” type. If Toma can do it again, I will take off my hats to him. My bets though are Caruana and Nakamura. But Toma will lurk very closely behind and surprisingly “clutch” one slot of the Candidates tournament!

LikeLike

[…] points) for his efforts. The performance was so remarkable and unexpected that it prompted us to take a deeper look at whether we should have seen it coming. Now our model gives him a much more impressive 51.8% […]

LikeLike