Tags

chess, chess analytics, chess history, engine analysis, expected score, performance metrics, Stockfish 18, WDL evaluation, world championship

CHESS ANALYTICS 00.0: List of Other Chess Analytics Articles

This article series studies World-Championship matches and World-Championship qualification runs with a modern engine-based method.

The basic idea is simple:

Put every move of great historical matches under Stockfish 18, translate each position into win/draw/loss chances, and then ask: who preserved winning chances better, who lost chances more often, who created volatility, who converted chances into points, and who stayed more consistent?

Stockfish 18 is far stronger than any human player. Its strength is so far above human World Champions that comparing it to humans by normal Elo becomes difficult. This makes it useful as a reference point: not because it “understands chess like a human,” but because it gives a very strong, consistent measuring stick.

The purpose is not to reduce chess greatness to one number. The purpose is to create a performance profile:

Did the player win by accuracy?

By lower error?

By better conversion?

By forcing volatility?

By making fewer severe mistakes?

By staying stable over many games?

By outperforming the opponent in critical moments?

The analysis uses three main tools:

Stockfish 18 with WDL evaluation
wdl_chess_analyzer_v2.py, which analyzes the PGN games with Stockfish
make_wdl_report_v3.py, which turns the raw data into match reports with values, standard deviations, ratios, and player comparisons

1. The Basic Units of the Analysis

A match is made of games.

A game is made of turns.

One full turn contains:
White move + Black move

In engine language, each individual half-move is often called a ply.

So:
1 turn = 2 plies

If a game has:
n turns

then it has approximately:
2n plies

Each ply can be evaluated by Stockfish 18.

For every position, Stockfish can estimate:

Win

Draw

Loss

These are usually shown as values per 1000, for example:

WDL = 120, 850, 30

This does not mean Stockfish just played 1000 Monte Carlo games from the position. It is better understood as a calibrated probability-style model. In this example:

Win = 120/1000 = 0.120

Draw = 850/1000 = 0.850

Loss = 30/1000 = 0.030

So the position is mostly drawish, but with some winning chances.

2. Expected Score

From Win, Draw, and Loss, we define:

Expected Score = ES

ES = W + 0.5D

A win is worth: 1

A draw is worth: 0.5

A loss is worth: 0

So if Stockfish says:

W = 0.20

D = 0.70

L = 0.10

then:

Expected Score = 0.20 + 0.5(0.70)

Expected Score = 0.55

That means the position is worth about 0.55 points.

Useful for

Expected Score tells us how valuable a position is in actual chess result terms.

Reveals

It tells us whether a player is likely to win, draw, or lose from a position.

If high

The player is doing well. A value near 1 means the position is close to winning.

If low

The player is doing badly. A value near 0 means the position is close to losing.

If both players are even

The position is balanced.

If uneven

One player has better practical chances.

3. Expected-Score Loss

Now we ask: how much did a move damage the player’s expected result?

We define:

Expected-Score Loss = ESLoss

ESLoss = max(0, ES_{best} – ES_{played})

where:

ES_{best} = {Expected Score after Stockfish’s best move}

ES_{played} = {Expected Score after the move actually played}

Example:

Stockfish’s best move keeps:

ES = 0.70

The player’s move gives:

ES = 0.55

Then:

ESLoss = 0.70 – 0.55 = 0.15

The move lost 0.15 expected points.

Useful for

Measuring how much a move damaged the player’s expected result.

Reveals

Mistakes in terms of lost scoring chances, not merely lost pawn evaluation.

If high

The move was costly. It may have turned a win into a draw, or a draw into a loss.

If low

The move preserved the player’s chances well.

If both players have similar losses

They made similar amounts of practical error.

If uneven

The player with higher losses gave away more scoring chances.

4. WDL Accuracy

From Expected-Score Loss, we define:

WDL Accuracy = 100 times (1 – ESLoss)

So if a move loses:

ESLoss = 0.00

then:

WDL Accuracy = 100

If a move loses:

ESLoss = 0.20

then:

WDL Accuracy = 80

If a move loses:

ESLoss = 0.50

then:

WDL Accuracy = 50

Useful for

Giving each move a simple accuracy score based on lost expected score.

Reveals

How well the player preserved their practical result chances.

If high

The player usually avoided damaging moves.

If low

The player often lost expected score.

If both players are even

They played at similar technical quality.

If uneven

The more accurate player preserved chances better.

5. Game Accuracy

For two players:

Accuracy(A) = {Player A’s WDL Accuracy}

Accuracy(B) = {Player B’s WDL Accuracy}

Then:

Game Accuracy = {Accuracy(A)+Accuracy(B)} / {2}

Useful for

Measuring the average technical quality of the whole game.

Reveals

Whether the game was clean or error-filled overall.

If high

Both players together produced a technically accurate game.

If low

The game contained more lost expected score.

If both players are even

The game was balanced in accuracy.

If uneven

The average may hide that one player played much better than the other.

6. Mutual Accuracy

Mutual Accuracy = {Accuracy(A) times Accuracy(B)} / {100}

This is stricter than Game Accuracy.

A game where both players score 95 is very strong mutually.

A game where one scores 99 and the other 80 may still have decent Game Accuracy, but lower Mutual Accuracy.

Useful for

Measuring two-sided quality.

Reveals

Whether both players maintained high standards.

If high

The game was a high-resistance struggle.

If low

At least one player lost significant expected score.

If both players are even

Mutual Accuracy and Game Accuracy tell a similar story.

If uneven

Mutual Accuracy punishes the imbalance more strongly.

7. Performance Quality

Performance Quality = PQ

For Player A:

Performance Quality(A) = Accuracy(A) times ({Game Accuracy} / {100})

For Player B:

Performance Quality(B) = Accuracy(B) times ({Game Accuracy} / {100})

This asks:

How impressive was this player’s accuracy, considering the quality of the whole game?

Useful for

Adjusting individual accuracy by the game’s overall level.

Reveals

Whether a player’s accuracy came in a strong, mutually accurate game or in a weaker game.

If high

The player was accurate in a high-quality environment.

If low

Either the player’s accuracy was low, or the game quality was low.

If both players are even

They performed similarly in context.

If uneven

One player’s accuracy was more impressive in context.

8. Dominance

Dominance(A) = Accuracy(A) – Accuracy(B)

Dominance(B) = Accuracy(B) – Accuracy(A)

Useful for

Measuring the accuracy gap between the players.

Reveals

Who outplayed whom technically.

If high for one player

That player had a clear accuracy edge.

If close to zero

The players were technically close.

If even

The result may have depended on conversion, timing, or one critical mistake.

If uneven

One player was more accurate across the game or match.

9. Volatility

WDL Volatility = {average absolute Expected-Score swing per move}

In simpler terms:

How much did the game’s expected result move around?

A quiet technical game has low volatility.

A wild tactical game has high volatility.

Useful for

Measuring instability, sharpness, and turning points.

Reveals

Whether the game was calm, tactical, chaotic, or full of swings.

If high

The game changed direction often or contained major turning points.

If low

The game was stable and controlled.

If both players are even

Both players contributed similarly to the game’s movement.

If uneven

One player’s moves were associated with more swings.

Important: high volatility is not automatically bad. It can mean brilliant complications, tactical shots, sacrifices, or mistakes. Volatility tells us that the game moved; accuracy tells us whether the movement was good or bad.

10. RMS Expected-Score Loss

Root Mean Square Expected-Score Loss = RMSESLoss

RMSESLoss = sqrt{(1/n) ∑ ESLoss_i^2}

This punishes large errors more strongly than small ones.

Useful for

Measuring severe mistakes without arbitrary blunder labels.

Reveals

Whether a player’s errors were small and steady, or rare and severe.

If high

The player made one or more large practical mistakes.

If low

The player avoided major expected-score losses.

If both players are even

Their severe-error burden was similar.

If uneven

One player suffered more damaging mistakes.

11. Error Concentration

Error Concentration = {RMSESLoss} / {MeanESLoss}

Useful for

Finding out whether errors are spread out or concentrated.

Reveals

The shape of a player’s mistakes.

If high

The player may have played well most of the time, but made a few big errors.

If low

The player’s errors were more evenly distributed.

If both players are even

Their error patterns were similar.

If uneven

One player’s errors were more concentrated into decisive moments.

12. Score, Expected Score, and Conversion

The actual game score is:

Score ∈ {1, 0.5, 0}

Then:

Conversion = Score – Expected Score

Useful for

Measuring whether the player scored more or less than the engine-flow suggested.

Reveals

Practical conversion, resilience, missed chances, and escape ability.

If high

The player scored more than expected.

If low or negative

The player scored less than expected.

If both players are even

The result matched the flow of the game.

If uneven

One player converted chances better than the other.

13. HardRAP and SoftRAP

Hard Result-Adjusted Performance = HardRAP

HardRAP = PQ times Score

This gives full credit for wins, half credit for draws, and zero credit for losses.

Soft Result-Adjusted Performance = SoftRAP

SoftRAP = PQ times (0.5 + 0.5Score)

This gives:

Win = 1.00

Draw = 0.75

Loss = 0.50

Useful for

Combining quality and result.

Reveals

Whether good play became match points.

If high

The player both played well and scored well.

If low

The player either played poorly, failed to score, or both.

If HardRAP and SoftRAP differ greatly

The player may have had good-quality losses or draws that HardRAP punishes more strictly.

14. How the Metrics Form One Whole

The complete performance profile is not one formula only. It is more like a dashboard.

Each metric answers a different question.

WDL Accuracy

asks:

How well did the player preserve expected score move by move?

MeanESLoss

asks:

How much expected score did the player lose on average?

RMSESLoss

asks:

Were there big damaging mistakes?

ErrorConcentration

asks:

Were the errors spread out, or concentrated into a few disasters?

GameAccuracy

asks:

How clean was the whole game?

MutualAccuracy

asks:

Did both players play accurately?

PerformanceQuality

asks:

Was the player’s accuracy impressive in the context of the game?

Dominance

asks:

Who was more accurate?

Volatility

asks:

How unstable or swingy was the game?

Conversion

asks:

Did the player turn chances into actual points?

HardRAP
and SoftRAP

ask:

How much performance quality became competitive success?

So a possible complete performance equation, not as a single final truth but as a useful profile, is:

Performance Profile = Accuracy + Error Control + Stability + Contextual Quality + Dominance + Conversion + Result Adjustment

More explicitly:

Performance Profile = WDLAccuracy + MeanESLoss + RMSESLoss + ErrorConcentration + GameAccuracy + MutualAccuracy + PQ + Dominance + Volatility + Conversion + HardRAP + SoftRAP

This does not mean we simply add all numbers together blindly. Instead, we read them as connected layers.

Example:

A player with high Accuracy, low Mean ES Loss, low RMS Loss, and low Volatility probably played clean, stable chess.

A player with high Accuracy, high Volatility, and high Conversion may have played sharp dynamic chess and handled complications well.

A player with similar Accuracy to the opponent but much better Conversion may have been more practical in decisive moments.

A player with good average Accuracy but high RMS Loss may have played well most of the time but made one or two decisive mistakes.

A player with high Dominance but poor Conversion may have outplayed the opponent but failed to score.

A player with high Mutual Accuracy in a match probably faced strong resistance.

This is the main point of the whole system:

Instead of saying only “Player X was accurate,” we can ask what kind of accuracy it was: stable, volatile, dominant, mutually resisted, result-converted, or error-prone in rare moments.

15. Why WDL Is Preferable Here to Pawn Evaluation

Traditional engine analysis often uses pawn units:

+1.00

-0.50

+3.00

These are useful, but they can be misleading.

A change from:

0.00 to -2.00

is usually enormous, because a drawn position may have become losing.

But a change from:

-8.00 to -10.00

may matter much less, because the position was already lost.

WDL Expected Score handles this more naturally. It asks:

How much did the player’s expected result change?

That is why this article series uses WDL-based metrics as the foundation.

16. Why This Series Is Worth Doing

Chess history contains many claims:

Fischer was uniquely dominant.

Karpov was incredibly stable.

Kasparov was dynamic and deep.

Carlsen converted tiny edges better than anyone.

Older champions were brilliant but less engine-clean.

Modern games are more accurate but sometimes less decisive.

These claims are often true in spirit, but they are hard to compare precisely.

This project tries to make them more measurable.

By applying the same Stockfish 18 WDL method to different World-Championship runs and matches, we can compare champions through several lenses:

accuracy
resistance
dominance
volatility
error severity
consistency
conversion
result-adjusted performance

The final goal is not to replace human chess understanding. The goal is to make the structure of greatness more visible.

A champion may be great because he is cleaner.

Another may be great because he converts better.

Another may be great because he survives chaos.

Another may be great because he gives the opponent no chances.

These metrics help us describe those differences.