Tags
chess, chess analytics, chess history, engine analysis, expected score, performance metrics, Stockfish 18, WDL evaluation, world championship
- 1. The Basic Units of the Analysis
- 2. Expected Score
- 3. Expected-Score Loss
- 4. WDL Accuracy
- 5. Game Accuracy
- 6. Mutual Accuracy
- 7. Performance Quality
- 8. Dominance
- 9. Volatility
- 10. RMS Expected-Score Loss
- 11. Error Concentration
- 12. Score, Expected Score, and Conversion
- 13. HardRAP and SoftRAP
- 14. How the Metrics Form One Whole
- 15. Why WDL Is Preferable Here to Pawn Evaluation
- 16. Why This Series Is Worth Doing
This article series studies World-Championship matches and World-Championship qualification runs with a modern engine-based method.
The basic idea is simple:
Put every move of great historical matches under Stockfish 18, translate each position into win/draw/loss chances, and then ask: who preserved winning chances better, who lost chances more often, who created volatility, who converted chances into points, and who stayed more consistent?
Stockfish 18 is far stronger than any human player. Its strength is so far above human World Champions that comparing it to humans by normal Elo becomes difficult. This makes it useful as a reference point: not because it “understands chess like a human,” but because it gives a very strong, consistent measuring stick.
The purpose is not to reduce chess greatness to one number. The purpose is to create a performance profile:
Did the player win by accuracy?
By lower error?
By better conversion?
By forcing volatility?
By making fewer severe mistakes?
By staying stable over many games?
By outperforming the opponent in critical moments?
The analysis uses three main tools:
- Stockfish 18 with WDL evaluation
wdl_chess_analyzer_v2.py, which analyzes the PGN games with Stockfishmake_wdl_report_v3.py, which turns the raw data into match reports with values, standard deviations, ratios, and player comparisons
1. The Basic Units of the Analysis
A match is made of games.
A game is made of turns.
One full turn contains:
White move + Black move
In engine language, each individual half-move is often called a ply.
So:
1 turn = 2 plies
If a game has:
n turns
then it has approximately:
2n plies
Each ply can be evaluated by Stockfish 18.
For every position, Stockfish can estimate:
Win
Draw
Loss
These are usually shown as values per 1000, for example:
WDL = 120, 850, 30
This does not mean Stockfish just played 1000 Monte Carlo games from the position. It is better understood as a calibrated probability-style model. In this example:
Win = 120/1000 = 0.120
Draw = 850/1000 = 0.850
Loss = 30/1000 = 0.030
So the position is mostly drawish, but with some winning chances.
2. Expected Score
From Win, Draw, and Loss, we define:
Expected Score = ES
ES = W + 0.5D
A win is worth: 1
A draw is worth: 0.5
A loss is worth: 0
So if Stockfish says:
W = 0.20
D = 0.70
L = 0.10
then:
Expected Score = 0.20 + 0.5(0.70)
Expected Score = 0.55
That means the position is worth about 0.55 points.
Useful for
Expected Score tells us how valuable a position is in actual chess result terms.
Reveals
It tells us whether a player is likely to win, draw, or lose from a position.
If high
The player is doing well. A value near 1 means the position is close to winning.
If low
The player is doing badly. A value near 0 means the position is close to losing.
If both players are even
The position is balanced.
If uneven
One player has better practical chances.
3. Expected-Score Loss
Now we ask: how much did a move damage the player’s expected result?
We define:
Expected-Score Loss = ESLoss
ESLoss = max(0, ES_{best} – ES_{played})
where:
ES_{best} = {Expected Score after Stockfish’s best move}
ES_{played} = {Expected Score after the move actually played}
Example:
Stockfish’s best move keeps:
ES = 0.70
The player’s move gives:
ES = 0.55
Then:
ESLoss = 0.70 – 0.55 = 0.15
The move lost 0.15 expected points.
Useful for
Measuring how much a move damaged the player’s expected result.
Reveals
Mistakes in terms of lost scoring chances, not merely lost pawn evaluation.
If high
The move was costly. It may have turned a win into a draw, or a draw into a loss.
If low
The move preserved the player’s chances well.
If both players have similar losses
They made similar amounts of practical error.
If uneven
The player with higher losses gave away more scoring chances.
4. WDL Accuracy
From Expected-Score Loss, we define:
WDL Accuracy = 100 times (1 – ESLoss)
So if a move loses:
ESLoss = 0.00
then:
WDL Accuracy = 100
If a move loses:
ESLoss = 0.20
then:
WDL Accuracy = 80
If a move loses:
ESLoss = 0.50
then:
WDL Accuracy = 50
Useful for
Giving each move a simple accuracy score based on lost expected score.
Reveals
How well the player preserved their practical result chances.
If high
The player usually avoided damaging moves.
If low
The player often lost expected score.
If both players are even
They played at similar technical quality.
If uneven
The more accurate player preserved chances better.
5. Game Accuracy
For two players:
Accuracy(A) = {Player A’s WDL Accuracy}
Accuracy(B) = {Player B’s WDL Accuracy}
Then:
Game Accuracy = {Accuracy(A)+Accuracy(B)} / {2}
Useful for
Measuring the average technical quality of the whole game.
Reveals
Whether the game was clean or error-filled overall.
If high
Both players together produced a technically accurate game.
If low
The game contained more lost expected score.
If both players are even
The game was balanced in accuracy.
If uneven
The average may hide that one player played much better than the other.
6. Mutual Accuracy
Mutual Accuracy = {Accuracy(A) times Accuracy(B)} / {100}
This is stricter than Game Accuracy.
A game where both players score 95 is very strong mutually.
A game where one scores 99 and the other 80 may still have decent Game Accuracy, but lower Mutual Accuracy.
Useful for
Measuring two-sided quality.
Reveals
Whether both players maintained high standards.
If high
The game was a high-resistance struggle.
If low
At least one player lost significant expected score.
If both players are even
Mutual Accuracy and Game Accuracy tell a similar story.
If uneven
Mutual Accuracy punishes the imbalance more strongly.
7. Performance Quality
Performance Quality = PQ
For Player A:
Performance Quality(A) = Accuracy(A) times ({Game Accuracy} / {100})
For Player B:
Performance Quality(B) = Accuracy(B) times ({Game Accuracy} / {100})
This asks:
How impressive was this player’s accuracy, considering the quality of the whole game?
Useful for
Adjusting individual accuracy by the game’s overall level.
Reveals
Whether a player’s accuracy came in a strong, mutually accurate game or in a weaker game.
If high
The player was accurate in a high-quality environment.
If low
Either the player’s accuracy was low, or the game quality was low.
If both players are even
They performed similarly in context.
If uneven
One player’s accuracy was more impressive in context.
8. Dominance
Dominance(A) = Accuracy(A) – Accuracy(B)
Dominance(B) = Accuracy(B) – Accuracy(A)
Useful for
Measuring the accuracy gap between the players.
Reveals
Who outplayed whom technically.
If high for one player
That player had a clear accuracy edge.
If close to zero
The players were technically close.
If even
The result may have depended on conversion, timing, or one critical mistake.
If uneven
One player was more accurate across the game or match.
9. Volatility
WDL Volatility = {average absolute Expected-Score swing per move}
In simpler terms:
How much did the game’s expected result move around?
A quiet technical game has low volatility.
A wild tactical game has high volatility.
Useful for
Measuring instability, sharpness, and turning points.
Reveals
Whether the game was calm, tactical, chaotic, or full of swings.
If high
The game changed direction often or contained major turning points.
If low
The game was stable and controlled.
If both players are even
Both players contributed similarly to the game’s movement.
If uneven
One player’s moves were associated with more swings.
Important: high volatility is not automatically bad. It can mean brilliant complications, tactical shots, sacrifices, or mistakes. Volatility tells us that the game moved; accuracy tells us whether the movement was good or bad.
10. RMS Expected-Score Loss
Root Mean Square Expected-Score Loss = RMSESLoss
RMSESLoss = sqrt{(1/n) ∑ ESLoss_i^2}
This punishes large errors more strongly than small ones.
Useful for
Measuring severe mistakes without arbitrary blunder labels.
Reveals
Whether a player’s errors were small and steady, or rare and severe.
If high
The player made one or more large practical mistakes.
If low
The player avoided major expected-score losses.
If both players are even
Their severe-error burden was similar.
If uneven
One player suffered more damaging mistakes.
11. Error Concentration
Error Concentration = {RMSESLoss} / {MeanESLoss}
Useful for
Finding out whether errors are spread out or concentrated.
Reveals
The shape of a player’s mistakes.
If high
The player may have played well most of the time, but made a few big errors.
If low
The player’s errors were more evenly distributed.
If both players are even
Their error patterns were similar.
If uneven
One player’s errors were more concentrated into decisive moments.
12. Score, Expected Score, and Conversion
The actual game score is:
Score ∈ {1, 0.5, 0}
Then:
Conversion = Score – Expected Score
Useful for
Measuring whether the player scored more or less than the engine-flow suggested.
Reveals
Practical conversion, resilience, missed chances, and escape ability.
If high
The player scored more than expected.
If low or negative
The player scored less than expected.
If both players are even
The result matched the flow of the game.
If uneven
One player converted chances better than the other.
13. HardRAP and SoftRAP
Hard Result-Adjusted Performance = HardRAP
HardRAP = PQ times Score
This gives full credit for wins, half credit for draws, and zero credit for losses.
Soft Result-Adjusted Performance = SoftRAP
SoftRAP = PQ times (0.5 + 0.5Score)
This gives:
Win = 1.00
Draw = 0.75
Loss = 0.50
Useful for
Combining quality and result.
Reveals
Whether good play became match points.
If high
The player both played well and scored well.
If low
The player either played poorly, failed to score, or both.
If HardRAP and SoftRAP differ greatly
The player may have had good-quality losses or draws that HardRAP punishes more strictly.
14. How the Metrics Form One Whole
The complete performance profile is not one formula only. It is more like a dashboard.
Each metric answers a different question.
WDL Accuracy
asks:
How well did the player preserve expected score move by move?
MeanESLoss
asks:
How much expected score did the player lose on average?
RMSESLoss
asks:
Were there big damaging mistakes?
ErrorConcentration
asks:
Were the errors spread out, or concentrated into a few disasters?
GameAccuracy
asks:
How clean was the whole game?
MutualAccuracy
asks:
Did both players play accurately?
PerformanceQuality
asks:
Was the player’s accuracy impressive in the context of the game?
Dominance
asks:
Who was more accurate?
Volatility
asks:
How unstable or swingy was the game?
Conversion
asks:
Did the player turn chances into actual points?
HardRAP
and SoftRAP
ask:
How much performance quality became competitive success?
So a possible complete performance equation, not as a single final truth but as a useful profile, is:
Performance Profile = Accuracy + Error Control + Stability + Contextual Quality + Dominance + Conversion + Result Adjustment
More explicitly:
Performance Profile = WDLAccuracy + MeanESLoss + RMSESLoss + ErrorConcentration + GameAccuracy + MutualAccuracy + PQ + Dominance + Volatility + Conversion + HardRAP + SoftRAP
This does not mean we simply add all numbers together blindly. Instead, we read them as connected layers.
Example:
A player with high Accuracy, low Mean ES Loss, low RMS Loss, and low Volatility probably played clean, stable chess.
A player with high Accuracy, high Volatility, and high Conversion may have played sharp dynamic chess and handled complications well.
A player with similar Accuracy to the opponent but much better Conversion may have been more practical in decisive moments.
A player with good average Accuracy but high RMS Loss may have played well most of the time but made one or two decisive mistakes.
A player with high Dominance but poor Conversion may have outplayed the opponent but failed to score.
A player with high Mutual Accuracy in a match probably faced strong resistance.
This is the main point of the whole system:
Instead of saying only “Player X was accurate,” we can ask what kind of accuracy it was: stable, volatile, dominant, mutually resisted, result-converted, or error-prone in rare moments.
15. Why WDL Is Preferable Here to Pawn Evaluation
Traditional engine analysis often uses pawn units:
+1.00
-0.50
+3.00
These are useful, but they can be misleading.
A change from:
0.00 to -2.00
is usually enormous, because a drawn position may have become losing.
But a change from:
-8.00 to -10.00
may matter much less, because the position was already lost.
WDL Expected Score handles this more naturally. It asks:
How much did the player’s expected result change?
That is why this article series uses WDL-based metrics as the foundation.
16. Why This Series Is Worth Doing
Chess history contains many claims:
Fischer was uniquely dominant.
Karpov was incredibly stable.
Kasparov was dynamic and deep.
Carlsen converted tiny edges better than anyone.
Older champions were brilliant but less engine-clean.
Modern games are more accurate but sometimes less decisive.
These claims are often true in spirit, but they are hard to compare precisely.
This project tries to make them more measurable.
By applying the same Stockfish 18 WDL method to different World-Championship runs and matches, we can compare champions through several lenses:
- accuracy
- resistance
- dominance
- volatility
- error severity
- consistency
- conversion
- result-adjusted performance
The final goal is not to replace human chess understanding. The goal is to make the structure of greatness more visible.
A champion may be great because he is cleaner.
Another may be great because he converts better.
Another may be great because he survives chaos.
Another may be great because he gives the opponent no chances.
These metrics help us describe those differences.