Analysis of Various Chess Openings Based on Outcomes of Randomly Sampled Chess Games¶

Collaborators: Viraj Boreda, Ishan Gopalani, Eric Huang, Ayush Sood

Introduction¶

The purpose of this project is to showcase the entire data science lifecycle. We have chosen to do this using a dataset containing information about many chess games, such as the outcome of each of these games, the reason for each game's termination, the rating of each player, and more. The specific parameter we have chosen to take an interest in from this dataset is the type of opening played in each game. The goal of our project is to analyze games played with different openings, and see what we learn about which openings are more favorable to play.

Chess has been around, in some form or another, for around 1,400 years, although the game as we know it did not arise until the 16th century. Today, Chess continues to be widely popular among all demographics of people, with the World Chess Championship drawing many eyes and Chess.com hosting over 100 million chess players. As Chess maintains it's popularity, more and more people want to get into Chess as a hobby or improve their Chess skills. However, with a game that has such a rich history, the number of things to learn can be daunting. One specific skill that a good Chess player must know is how to both play and recognize different openings. A Chess opening consists of the first x moves of a game, and the way the opening plays out often has a significant effect on how the rest of the game goes. In the opening, players try to develop their pieces off their starting squares, set up attacks, and take control of the center of the board. The player who outperforms their opponent in the opening has the advantage going into the middlegame, as they are able to make more offensive moves while their opponent has to defend. The point is, the opening is a crucial part Chess. But even the number of openings is no miniscule number, and so even learning that can seem like a difficult feat. Our hope is that the results of our project can shed light on the best openings, narrowing the scope of which openings newer players should focus on, if they wish to improve at the game of Chess.

Over this tutorial, we will traverse the Data Science lifecycle as follows:

  1. Data Collection
  2. Data Processing
  3. Exploratory Analysis/Data Visualization
  4. Model: Analysis, Hypothesis Testing, Machine Learning
  5. Interpretation of Results

1. Data Collection¶

The dataset we used for this project was acquired from kaggle, and it contains 6.2 million games from lichess.org, a free and open-source chess server developed by a nonprofit company. Specifically, the dataset we used can be found here.

To get this dataset, we simply directly downloaded the CSV file. However, because this dataset contains 6.2 million games, we decided to not use the full dataset, for both ease of data analysis and file size restrictions on GitHub. We decided to go about this by writing code to randomly select 20% of the dataset, and proceed with that section of the data.

First, we have to make the necessary imports. These are not specific to the Data Collection section of the project, but are the libraries that we use throughout the project.

In [ ]:
import csv
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import stockfish as sf
import os
import chess
from chess import pgn
import io
from dotenv import dotenv_values
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder

We also mount google drive here, since we are using Colab and need to do this to access the necessary files.

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Next, we randomly select 20% of the dataset, as described above. However, you may notice the code for this section produces two different csv files, one called games_20.csv and the other called games_25.csv. This is because initially, we were unsure of whether or not we wanted to randomly choose 20% or 25% of the dataset, so we decided to try both, and ultimately settled for the 20%.

In [ ]:
rows_20, rows_25 = [], []

with open('chess_games.csv') as infile:
    header = True
    for row in csv.reader(infile):
        if header:
            rows_20.append(row)
            rows_25.append(row)
            header = False
        else:
            x = random.randint(0, 4)
            if x == 0:
                rows_20.append(row)
            y = random.randint(0, 3)
            if y == 0:
                rows_25.append(row)

with open('games_20.csv', 'w', newline='') as outfile:
    writer = csv.writer(outfile)
    for row in rows_20:
        writer.writerow(row)

with open('games_25.csv', 'w', newline='') as outfile:
    writer = csv.writer(outfile)
    for row in rows_25:
        writer.writerow(row)

2. Data Processing¶

While the csv file contains a lot of useful data about the games, after using pandas to read the csv file into a dataframe, we felt that the way some of this information is represented could be improved. So, we decided to add a few columns to the dataframe. First, for context, here is the dataframe with no modifications, after just reading in the csv file.

In [ ]:
file_path = 'drive/My Drive/Colab Notebooks/CMSC320/Final/games_20.csv'
In [ ]:
file_path = 'games_20.csv'
In [ ]:
df = pd.read_csv(file_path)
df.head()
Out[ ]:
Event White Black Result UTCDate UTCTime WhiteElo BlackElo WhiteRatingDiff BlackRatingDiff ECO Opening TimeControl Termination AN
0 Classical eisaaaa HAMID449 1-0 2016.06.30 22:00:01 1901 1896 11.0 -11.0 D10 Slav Defense 300+5 Time forfeit 1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e...
1 Blitz tournament Shambobala cernunnoss 1-0 2016.06.30 22:00:02 1764 1773 12.0 -12.0 B01 Scandinavian Defense: Mieses-Kotroc Variation 180+0 Time forfeit 1. e4 d5 2. exd5 Qxd5 3. Nf3 Nf6 4. Be2 c6 5. ...
2 Classical DARDELU chess4life54 0-1 2016.06.30 22:00:01 1649 1638 -13.0 11.0 C57 Italian Game: Two Knights Defense, Traxler Cou... 900+3 Normal 1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 Bc5 5. N...
3 Blitz tournament lj1983 plmnnnn 1-0 2016.06.30 22:00:02 1963 1979 12.0 -13.0 B10 Caro-Kann Defense: Two Knights Attack 180+0 Normal 1. e4 c6 2. Nf3 d5 3. Nc3 g6 4. d3 Bg7 5. Be2 ...
4 Blitz tournament yoyoparker philastro110 1-0 2016.06.30 22:00:02 1707 1739 13.0 -12.0 B01 Scandinavian Defense 300+0 Normal 1. e4 d5 2. e5 Bf5 3. d4 e6 4. Nc3 Nc6 5. Nf3 ...

The first order of business was to add a column that says whether white or black won the game in question. There is already a result column that says '1-0' if white won, '0-1' if black won, and '1/2-1/2' if the game ended in a draw, but we felt that using the values in this column every time we wanted to access the winner would be tedious and error-prone. So, we added a new column that says 'White' if white was the winner, 'Black' if black was the winner, and 'Draw' if the game ended in a draw. Below is the result.

In [ ]:
df = df.assign(Winner = 'White')
df.loc[df['Result'] == '0-1', 'Winner'] = 'Black'
df.loc[df['Result'] == '1/2-1/2', 'Winner'] = 'Draw'
df.head()
Out[ ]:
Event White Black Result UTCDate UTCTime WhiteElo BlackElo WhiteRatingDiff BlackRatingDiff ECO Opening TimeControl Termination AN Winner
0 Classical eisaaaa HAMID449 1-0 2016.06.30 22:00:01 1901 1896 11.0 -11.0 D10 Slav Defense 300+5 Time forfeit 1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e... White
1 Blitz tournament Shambobala cernunnoss 1-0 2016.06.30 22:00:02 1764 1773 12.0 -12.0 B01 Scandinavian Defense: Mieses-Kotroc Variation 180+0 Time forfeit 1. e4 d5 2. exd5 Qxd5 3. Nf3 Nf6 4. Be2 c6 5. ... White
2 Classical DARDELU chess4life54 0-1 2016.06.30 22:00:01 1649 1638 -13.0 11.0 C57 Italian Game: Two Knights Defense, Traxler Cou... 900+3 Normal 1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 Bc5 5. N... Black
3 Blitz tournament lj1983 plmnnnn 1-0 2016.06.30 22:00:02 1963 1979 12.0 -13.0 B10 Caro-Kann Defense: Two Knights Attack 180+0 Normal 1. e4 c6 2. Nf3 d5 3. Nc3 g6 4. d3 Bg7 5. Be2 ... White
4 Blitz tournament yoyoparker philastro110 1-0 2016.06.30 22:00:02 1707 1739 13.0 -12.0 B01 Scandinavian Defense 300+0 Normal 1. e4 d5 2. e5 Bf5 3. d4 e6 4. Nc3 Nc6 5. Nf3 ... White

Next, we wanted to add a column that contains the number of moves played in each game. We wanted this information because it would help us later on when analyzing opening efficiency.

In [ ]:
df['Moves'] = df['AN']
def get_moves(pgn):
  moves = re.findall(r"\d*\.", pgn)
  last = moves[len(moves) - 1]
  return re.sub("[^0-9]", "", last)
df['Moves'] = df['Moves'].apply(get_moves)
df.head()
Out[ ]:
Event White Black Result UTCDate UTCTime WhiteElo BlackElo WhiteRatingDiff BlackRatingDiff ECO Opening TimeControl Termination AN Winner Moves
0 Classical eisaaaa HAMID449 1-0 2016.06.30 22:00:01 1901 1896 11.0 -11.0 D10 Slav Defense 300+5 Time forfeit 1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e... White 38
1 Blitz tournament Shambobala cernunnoss 1-0 2016.06.30 22:00:02 1764 1773 12.0 -12.0 B01 Scandinavian Defense: Mieses-Kotroc Variation 180+0 Time forfeit 1. e4 d5 2. exd5 Qxd5 3. Nf3 Nf6 4. Be2 c6 5. ... White 43
2 Classical DARDELU chess4life54 0-1 2016.06.30 22:00:01 1649 1638 -13.0 11.0 C57 Italian Game: Two Knights Defense, Traxler Cou... 900+3 Normal 1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 Bc5 5. N... Black 22
3 Blitz tournament lj1983 plmnnnn 1-0 2016.06.30 22:00:02 1963 1979 12.0 -13.0 B10 Caro-Kann Defense: Two Knights Attack 180+0 Normal 1. e4 c6 2. Nf3 d5 3. Nc3 g6 4. d3 Bg7 5. Be2 ... White 21
4 Blitz tournament yoyoparker philastro110 1-0 2016.06.30 22:00:02 1707 1739 13.0 -12.0 B01 Scandinavian Defense 300+0 Normal 1. e4 d5 2. e5 Bf5 3. d4 e6 4. Nc3 Nc6 5. Nf3 ... White 62

The 'Opening' column is very specific here, in that it contains the specific variation of the opening played in each game, not the general opening. This is not a bad thing, but we felt that in our analysis, we could need quick access to the general opening (not the specific variation). Therefore, we decided to make an additional column that also stores the generic opening and add it to the dataframe.

In [ ]:
df['GeneralOpening'] = df['Opening']
def make_general(opening):
  if ':' in opening:
    return opening.split(':')[0]
  else:
    return opening
df['GeneralOpening'] = df['GeneralOpening'].apply(make_general)
df.head()
Out[ ]:
Event White Black Result UTCDate UTCTime WhiteElo BlackElo WhiteRatingDiff BlackRatingDiff ECO Opening TimeControl Termination AN Winner Moves GeneralOpening
0 Classical eisaaaa HAMID449 1-0 2016.06.30 22:00:01 1901 1896 11.0 -11.0 D10 Slav Defense 300+5 Time forfeit 1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e... White 38 Slav Defense
1 Blitz tournament Shambobala cernunnoss 1-0 2016.06.30 22:00:02 1764 1773 12.0 -12.0 B01 Scandinavian Defense: Mieses-Kotroc Variation 180+0 Time forfeit 1. e4 d5 2. exd5 Qxd5 3. Nf3 Nf6 4. Be2 c6 5. ... White 43 Scandinavian Defense
2 Classical DARDELU chess4life54 0-1 2016.06.30 22:00:01 1649 1638 -13.0 11.0 C57 Italian Game: Two Knights Defense, Traxler Cou... 900+3 Normal 1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 Bc5 5. N... Black 22 Italian Game
3 Blitz tournament lj1983 plmnnnn 1-0 2016.06.30 22:00:02 1963 1979 12.0 -13.0 B10 Caro-Kann Defense: Two Knights Attack 180+0 Normal 1. e4 c6 2. Nf3 d5 3. Nc3 g6 4. d3 Bg7 5. Be2 ... White 21 Caro-Kann Defense
4 Blitz tournament yoyoparker philastro110 1-0 2016.06.30 22:00:02 1707 1739 13.0 -12.0 B01 Scandinavian Defense 300+0 Normal 1. e4 d5 2. e5 Bf5 3. d4 e6 4. Nc3 Nc6 5. Nf3 ... White 62 Scandinavian Defense

There are some columns in the dataframe that are not relevant to our analysis. Therefore, we removed the unnecessary columns before proceeding.

In [ ]:
df = df.drop('White', axis=1)
df = df.drop('Black', axis=1)
df = df.drop('UTCDate', axis = 1)
df = df.drop('UTCTime', axis = 1)
# Remove any other unnecessary columns
df.head()
df.to_csv("before_stockfish.csv")

In this next section, we used the Stockfish Python library to evaluate which player was winning at the end of the game. This is important because if someone resigned while they were winning by a large amount, for example, it would be unfair to count that resignation as a true win for the opening the opponent played. This is because the opening itself did not lead to a winning position. Instead, it was likely an outside factor such as the player who resigned having to quit the game.

The first code block just gets the path to the Stockfish engine from a .env file, as different team members may have it stored in different file paths.

In [ ]:
config = dotenv_values(".env")
stockfish_path = config.get('STOCKFISH_PATH')
stockfish = sf.Stockfish(stockfish_path, depth=10)

The next code block uses the python-chess library. The stockfish library requires board positions in the FEN notation, but our dataset has the moves in PGN notation. Therefore, they need to be converted. The function below will take in a PGN and convert it to an FEN by reading the moves in to the python-chess library, then returning the fen.

In [ ]:
counter = 0
def fen_convert(moves):
    global counter
    counter += 1
    if counter % 100 == 0:
        print("finished processing {} games".format(counter))

    pgn1 = io.StringIO(moves)
    game = pgn.read_game(pgn1)
    board = game.board()
    for move in game.mainline_moves():
        board.push(move)
    return board.fen()
df['FEN'] = [fen_convert(moves) for moves in df['AN']]
df.head()
finished processing 100 games
finished processing 200 games
finished processing 300 games
finished processing 400 games
finished processing 500 games
...
finished processing 1251500 games
finished processing 1251600 games
finished processing 1251700 games
finished processing 1251800 games
finished processing 1251900 games
Out[ ]:
Event Result WhiteElo BlackElo WhiteRatingDiff BlackRatingDiff ECO Opening TimeControl Termination AN Winner Moves General Opening FEN
0 Classical 1-0 1901 1896 11.0 -11.0 D10 Slav Defense 300+5 Time forfeit 1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e... White 38 Slav Defense 3r2k1/6pp/p2P1prq/8/2p5/P1B1P2P/1P3PPQ/2RR2K1 ...
1 Blitz tournament 1-0 1764 1773 12.0 -12.0 B01 Scandinavian Defense: Mieses-Kotroc Variation 180+0 Time forfeit 1. e4 d5 2. exd5 Qxd5 3. Nf3 Nf6 4. Be2 c6 5. ... White 43 Scandinavian Defense 6k1/5p1p/4p1p1/p2b4/P7/6KP/6P1/3q4 b - - 3 43
2 Classical 0-1 1649 1638 -13.0 11.0 C57 Italian Game: Two Knights Defense, Traxler Cou... 900+3 Normal 1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 Bc5 5. N... Black 22 Italian Game r7/p1pp2pp/1p4k1/3bp3/8/3P4/PPnB1K2/6R1 b - - ...
3 Blitz tournament 1-0 1963 1979 12.0 -13.0 B10 Caro-Kann Defense: Two Knights Attack 180+0 Normal 1. e4 c6 2. Nf3 d5 3. Nc3 g6 4. d3 Bg7 5. Be2 ... White 21 Caro-Kann Defense 3r2k1/p3ppb1/2p4p/2B2P2/2p1P3/2N5/PPR3P1/6K1 b...
4 Blitz tournament 1-0 1707 1739 13.0 -12.0 B01 Scandinavian Defense 300+0 Normal 1. e4 d5 2. e5 Bf5 3. d4 e6 4. Nc3 Nc6 5. Nf3 ... White 62 Scandinavian Defense 4k3/5pP1/2p2P2/2Pp2K1/4p1P1/2P5/6b1/8 b - - 0 62

This code saves the FENs generated above to a csv file just in case, as the previous code can take over half an hour to run locally. The second code block reads from the file if necessary; otherwise, the FEN data is already in the dataframe from earlier.

In [ ]:
output_file = 'fen.csv'
f = open(output_file, 'w', encoding='utf-8')
for item in df['FEN']:
    f.write(item + "\n")
f.close()
In [ ]:
fen_df = pd.read_csv('fen.csv', header=None, names=['FEN'])
df = pd.concat([df, fen_df], axis=1)
df.head()
Out[ ]:
Event Result WhiteElo BlackElo WhiteRatingDiff BlackRatingDiff ECO Opening TimeControl Termination AN Winner Moves General Opening FEN
0 Classical 1-0 1901 1896 11.0 -11.0 D10 Slav Defense 300+5 Time forfeit 1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e... White 38 Slav Defense 3r2k1/6pp/p2P1prq/8/2p5/P1B1P2P/1P3PPQ/2RR2K1 ...
1 Blitz tournament 1-0 1764 1773 12.0 -12.0 B01 Scandinavian Defense: Mieses-Kotroc Variation 180+0 Time forfeit 1. e4 d5 2. exd5 Qxd5 3. Nf3 Nf6 4. Be2 c6 5. ... White 43 Scandinavian Defense 6k1/5p1p/4p1p1/p2b4/P7/6KP/6P1/3q4 b - - 3 43
2 Classical 0-1 1649 1638 -13.0 11.0 C57 Italian Game: Two Knights Defense, Traxler Cou... 900+3 Normal 1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 Bc5 5. N... Black 22 Italian Game r7/p1pp2pp/1p4k1/3bp3/8/3P4/PPnB1K2/6R1 b - - ...
3 Blitz tournament 1-0 1963 1979 12.0 -13.0 B10 Caro-Kann Defense: Two Knights Attack 180+0 Normal 1. e4 c6 2. Nf3 d5 3. Nc3 g6 4. d3 Bg7 5. Be2 ... White 21 Caro-Kann Defense 3r2k1/p3ppb1/2p4p/2B2P2/2p1P3/2N5/PPR3P1/6K1 b...
4 Blitz tournament 1-0 1707 1739 13.0 -12.0 B01 Scandinavian Defense 300+0 Normal 1. e4 d5 2. e5 Bf5 3. d4 e6 4. Nc3 Nc6 5. Nf3 ... White 62 Scandinavian Defense 4k3/5pP1/2p2P2/2Pp2K1/4p1P1/2P5/6b1/8 b - - 0 62

Currently, the dataset just marks games as either time forfeit, normal, abandoned, or rules infraction. However, normal includes both resignation and checkmate. In the checkmate cases, it is obvious who is in a winning position, so it is not necessary to run Stockfish to score those positions. Therefore, only normal games ending in resignation as well as time forfeit, abandoned, or rules infraction will be given a stockfish score.

In [ ]:
df['Checkmate'] = [False if row != 'Normal' else (False if '#' in row else False) for row in df['Termination']]
df.head()
Out[ ]:
Event Result WhiteElo BlackElo WhiteRatingDiff BlackRatingDiff ECO Opening TimeControl Termination AN Winner Moves General Opening FEN Checkmate
0 Classical 1-0 1901 1896 11.0 -11.0 D10 Slav Defense 300+5 Time forfeit 1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e... White 38 Slav Defense 3r2k1/6pp/p2P1prq/8/2p5/P1B1P2P/1P3PPQ/2RR2K1 ... False
1 Blitz tournament 1-0 1764 1773 12.0 -12.0 B01 Scandinavian Defense: Mieses-Kotroc Variation 180+0 Time forfeit 1. e4 d5 2. exd5 Qxd5 3. Nf3 Nf6 4. Be2 c6 5. ... White 43 Scandinavian Defense 6k1/5p1p/4p1p1/p2b4/P7/6KP/6P1/3q4 b - - 3 43 False
2 Classical 0-1 1649 1638 -13.0 11.0 C57 Italian Game: Two Knights Defense, Traxler Cou... 900+3 Normal 1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 Bc5 5. N... Black 22 Italian Game r7/p1pp2pp/1p4k1/3bp3/8/3P4/PPnB1K2/6R1 b - - ... False
3 Blitz tournament 1-0 1963 1979 12.0 -13.0 B10 Caro-Kann Defense: Two Knights Attack 180+0 Normal 1. e4 c6 2. Nf3 d5 3. Nc3 g6 4. d3 Bg7 5. Be2 ... White 21 Caro-Kann Defense 3r2k1/p3ppb1/2p4p/2B2P2/2p1P3/2N5/PPR3P1/6K1 b... False
4 Blitz tournament 1-0 1707 1739 13.0 -12.0 B01 Scandinavian Defense 300+0 Normal 1. e4 d5 2. e5 Bf5 3. d4 e6 4. Nc3 Nc6 5. Nf3 ... White 62 Scandinavian Defense 4k3/5pP1/2p2P2/2Pp2K1/4p1P1/2P5/6b1/8 b - - 0 62 False

This code finally takes the generated FENs, and for every game that does not end in checkmate, runs stockfish on them. Otherwise, it puts in a default value of float(inf) for games that white won and float(-inf) for games that black won. If the stockfish analysis guarantees mate in a certain amount of moves, then float(inf) or float(-inf) are also written to the file accordingly.

In [ ]:
output_file = 'stockfish_score.csv'
f = open(output_file, 'w', encoding='utf-8')

counter = 0
def stockfish_score(fen, checkmate, winner):
    global counter
    counter += 1
    if counter % 100 == 0:
        print("finished scoring {} games".format(counter))
    if checkmate:
        if winner == 'White':
            f.write("-inf" + "\n")
        elif winner == 'Black':
            f.write("-inf" + "\n")
    stockfish.set_fen_position(fen)
    eval = stockfish.get_evaluation()
    if eval['type'] == 'mate':
        if eval['value'] < 0:
            f.write("-inf" + "\n")
        else:
            f.write("inf" + "\n")
    else:
        f.write(str(eval['value']/100) + "\n")

_ = [stockfish_score(fen, checkmate, winner) for fen, checkmate, winner in zip(df['FEN'], df['Checkmate'], df['Winner'])]
df.head()
f.close()
finished scoring 100 games
finished scoring 200 games
finished scoring 300 games
finished scoring 400 games
finished scoring 500 games
...
finished scoring 1251500 games
finished scoring 1251600 games
finished scoring 1251700 games
finished scoring 1251800 games
finished scoring 1251900 games

After the stockfish analysis is complete, we edit the csv file containing its results to have "score" as the 1st row. This lets the program know what the numbers below represent. Then, we create a dataframe with the results of the analysis. This lets us concatenate it to the dataframe containing data on chess games to add a score column.

In [ ]:
sf_df = pd.read_csv("stockfish_score.csv", header=None, names=['Score'])
df = pd.concat([df, sf_df], axis = 1)
df.to_csv('df_with_score.csv')
df.head()
Out[ ]:
Event Result WhiteElo BlackElo WhiteRatingDiff BlackRatingDiff ECO Opening TimeControl Termination AN Winner Moves GeneralOpening Score
0 Classical 1-0 1901 1896 11.0 -11.0 D10 Slav Defense 300+5 Time forfeit 1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e... White 38.0 Slav Defense 6.16
1 Blitz tournament 1-0 1764 1773 12.0 -12.0 B01 Scandinavian Defense: Mieses-Kotroc Variation 180+0 Time forfeit 1. e4 d5 2. exd5 Qxd5 3. Nf3 Nf6 4. Be2 c6 5. ... White 43.0 Scandinavian Defense -36.62
2 Classical 0-1 1649 1638 -13.0 11.0 C57 Italian Game: Two Knights Defense, Traxler Cou... 900+3 Normal 1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 Bc5 5. N... Black 22.0 Italian Game -6.36
3 Blitz tournament 1-0 1963 1979 12.0 -13.0 B10 Caro-Kann Defense: Two Knights Attack 180+0 Normal 1. e4 c6 2. Nf3 d5 3. Nc3 g6 4. d3 Bg7 5. Be2 ... White 21.0 Caro-Kann Defense 4.56
4 Blitz tournament 1-0 1707 1739 13.0 -12.0 B01 Scandinavian Defense 300+0 Normal 1. e4 d5 2. e5 Bf5 3. d4 e6 4. Nc3 Nc6 5. Nf3 ... White 62.0 Scandinavian Defense 6.22

3. Exploratory Analysis and Data Visualization¶

In [ ]:
file_path = 'drive/My Drive/Colab Notebooks/AyushSood03.github.io/df_with_score.csv'
In [ ]:
file_path = 'df_with_score.csv'
In [ ]:
df = pd.read_csv(file_path)

Before we proceed, we need to do one last bit of data processing. When we got to this point, we realized that the "General Opening" column was not cleaned up properly in the previous step. Specifically, in the last step, all we did was address openings that have colons (":") in them. For example, an opening like "Scandinavian Defense: Mieses-Kotroc Variation" was shortened to just "Scandinavian Defense" in the "General Opening" column. Unfortunately, this overlooks the fact that a few of the specific openings are separated by commas and not columns. For instance, "King's Gambit Accepted, Fischer Defense" should be shortened to "King's Gambit Accepted" in the "General Opening" column, but it wasn't, because we overlooked it. The next code block rectifies this issue. The reason we are doing it here, and not going back to do it in the data processing section, is because doing it there would mean we'd have to run the converter and stockfish scorer again to generate a new csv, which takes a very long time and a lot of storage space.

Additionally, we added the column "AvgElo" which is just the average of White and Black Elos for data visualization purposes.

In [ ]:
def make_general(opening):
  if ',' in opening:
    return opening.split(',')[0]
  if ' #' in opening:
    return opening.split(' #')[0]
  else:
    return opening
df['GeneralOpening'] = df['GeneralOpening'].apply(make_general)

df.insert(loc=5, column='AvgElo', value=((df['WhiteElo'] + df['BlackElo']) / 2))
# df['AvgElo'] = (df['WhiteElo'] + df['BlackElo']) / 2
df.head()
Out[ ]:
Unnamed: 0 Event Result WhiteElo BlackElo AvgElo WhiteRatingDiff BlackRatingDiff ECO Opening TimeControl Termination AN Winner Moves GeneralOpening Score
0 0 Classical 1-0 1901 1896 1898.5 11.0 -11.0 D10 Slav Defense 300+5 Time forfeit 1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e... White 38.0 Slav Defense 6.16
1 1 Blitz tournament 1-0 1764 1773 1768.5 12.0 -12.0 B01 Scandinavian Defense: Mieses-Kotroc Variation 180+0 Time forfeit 1. e4 d5 2. exd5 Qxd5 3. Nf3 Nf6 4. Be2 c6 5. ... White 43.0 Scandinavian Defense -36.62
2 2 Classical 0-1 1649 1638 1643.5 -13.0 11.0 C57 Italian Game: Two Knights Defense, Traxler Cou... 900+3 Normal 1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 Bc5 5. N... Black 22.0 Italian Game -6.36
3 3 Blitz tournament 1-0 1963 1979 1971.0 12.0 -13.0 B10 Caro-Kann Defense: Two Knights Attack 180+0 Normal 1. e4 c6 2. Nf3 d5 3. Nc3 g6 4. d3 Bg7 5. Be2 ... White 21.0 Caro-Kann Defense 4.56
4 4 Blitz tournament 1-0 1707 1739 1723.0 13.0 -12.0 B01 Scandinavian Defense 300+0 Normal 1. e4 d5 2. e5 Bf5 3. d4 e6 4. Nc3 Nc6 5. Nf3 ... White 62.0 Scandinavian Defense 6.22

In the next step, we create a new dataframe. Each row in this dataframe represents general opening. Its columns contain the number of times white won, black won, draws, and games where the row's opening was played. Each row also has the ratio of white wins to games, black wins to games, and draws to games. Knowing these ratios will be useful in determining which general opening has a higher chance of leading to victory. In addition, only general openings played in over 1,000 games are included in this dataframe to ensure that the sample size is large enough to be reliable.

In [ ]:
genOpenInfo = np.array([])
for genOpen in np.sort(pd.unique(df.GeneralOpening)):
    genOpenMatch_df = df[df["GeneralOpening"] == genOpen]
    whiteWin_df = genOpenMatch_df[genOpenMatch_df["Winner"] == "White"]
    blackWin_df = genOpenMatch_df[genOpenMatch_df["Winner"] == "Black"]
    draw_df = genOpenMatch_df[genOpenMatch_df["Winner"] == "Draw"]
    genOpenInfo = np.append(genOpenInfo, [[len(whiteWin_df.index), len(blackWin_df.index), len(draw_df.index)]])
genOpenInfo = np.reshape(genOpenInfo, (pd.unique(df.GeneralOpening).size, 3))

ratio_df = pd.DataFrame(data = genOpenInfo, index = np.sort(pd.unique(df.GeneralOpening)), columns = ["WhiteWins", "BlackWins", "Draws"])
ratio_df["Games"] = ratio_df["WhiteWins"] + ratio_df["BlackWins"] + ratio_df["Draws"]
ratio_df["WhiteWinRatio"] = ratio_df["WhiteWins"] / ratio_df["Games"]
ratio_df["BlackWinRatio"] = ratio_df["BlackWins"] / ratio_df["Games"]
ratio_df["DrawRatio"] = ratio_df["Draws"] / ratio_df["Games"]
ratio_df = ratio_df[ratio_df["Games"] > 1000]
ratio_df.head()
Out[ ]:
WhiteWins BlackWins Draws Games WhiteWinRatio BlackWinRatio DrawRatio
Alekhine Defense 6778.0 7359.0 537.0 14674.0 0.461905 0.501499 0.036595
Benoni Defense 3357.0 3752.0 301.0 7410.0 0.453036 0.506343 0.040621
Bird Opening 6411.0 6150.0 461.0 13022.0 0.492321 0.472278 0.035402
Bishop's Opening 11526.0 9373.0 788.0 21687.0 0.531470 0.432194 0.036335
Blackmar-Diemer Gambit 2099.0 1902.0 137.0 4138.0 0.507250 0.459642 0.033108

Next, we create some bar charts to visualize our data. In all of our bar charts, the y-axis is the name of the general opening and the x-axis is the percentage of occurence. The white bars represent the percentage of games white won, the black bars represent the percentage of games black won, and the gray bars represent the percentage of games that resulted in a draw when the specified general opening was played.

The first bar chart shows the 10 general openings with the highest wins by white to games played ratios. As we can see, the King's Pawn opening overwhelmingly has the highest wins by white go games played ratio, although this is a bit of an exception, as we explain in our interpretation section.

In [ ]:
topWhite_df = ratio_df.sort_values(by = "WhiteWinRatio", ascending = False).head(10)

ax = plt.axes()
ax.set_facecolor('lightgray')

b1 = plt.barh(topWhite_df.index, topWhite_df["WhiteWinRatio"] * 100, color = "white")
b2 = plt.barh(topWhite_df.index, topWhite_df["DrawRatio"] * 100, left = topWhite_df["WhiteWinRatio"] * 100, color = "gray")
b3 = plt.barh(topWhite_df.index, topWhite_df["BlackWinRatio"] * 100, left = (topWhite_df["WhiteWinRatio"] + topWhite_df["DrawRatio"]) * 100, color = "black")

plt.title("Result Percentages for Top 10 General Openings for White")
plt.xlabel("Percentage")
plt.ylabel("General Opening")
plt.xlim([0,100])
plt.show()

The next bar chart shows the 10 general openings with the highest wins by black to games played ratios. As we can see, 4 of these openings are classified as "defense". This makes sense, because black plays second in chess, so they have to defend against the opening attack that white plays. Additionally, we can see that the Kadas opening is the opening with the highest wins by black to games played ratios.

In [ ]:
topBlack_df = ratio_df.sort_values(by = "BlackWinRatio", ascending = False).head(10)

ax = plt.axes()
ax.set_facecolor('lightgray')

b1 = plt.barh(topBlack_df.index, topBlack_df["WhiteWinRatio"] * 100, color = "white")
b2 = plt.barh(topBlack_df.index, topBlack_df["DrawRatio"] * 100, left = topBlack_df["WhiteWinRatio"] * 100, color = "gray")
b3 = plt.barh(topBlack_df.index, topBlack_df["BlackWinRatio"] * 100, left = (topBlack_df["WhiteWinRatio"] + topBlack_df["DrawRatio"]) * 100, color = "black")

plt.title("Result Percentages for Top 10 General Openings for Black")
plt.xlabel("Percentage")
plt.ylabel("General Opening")
plt.xlim([0,100])
plt.show()

This bar chart shows the 10 general openings with the highest number of games where the general opening was played. In other words, they are the top 10 most popular or common general openings used. For this bar chart, each general opening is labeled with the number of games they were played in by the number on the right of the bars. As we can see, among the most popular openings, black has the best win rate for the Sicilian Defense, and white has the best win rate for the Philidor Defense.

In [ ]:
popular_df = ratio_df.sort_values(by = "Games", ascending = False).head(10).sort_values(by = "Games")

ax = plt.axes()
ax.set_facecolor('lightgray')

b1 = plt.barh(popular_df.index, popular_df["WhiteWinRatio"] * 100, color = "white")
b2 = plt.barh(popular_df.index, popular_df["DrawRatio"] * 100, left = popular_df["WhiteWinRatio"] * 100, color = "gray")
b3 = plt.barh(popular_df.index, popular_df["BlackWinRatio"] * 100, left = (popular_df["WhiteWinRatio"] + popular_df["DrawRatio"]) * 100, color = "black")

plt.title("Result Percentages for Top 10 Most Popular General Openings")
plt.xlabel("Percentage")
plt.ylabel("General Opening")
plt.bar_label(b3, popular_df["Games"])
plt.xlim([0,100])
plt.show()

This bar chart displays the frequency of the 10 most popular chess openings. Each bar displays the total number of games played for each opening, and is separated by color into white wins, black wins, and draws for further detail. As we can see, the Sicilian defense is by far the most popular opening in our sample of the dataset, and the Ruy Lopez opening is the 10th most popular.

In [ ]:
# display(popular_df)

freq_df = popular_df.sort_values(by = "Games", ascending = False).iloc[: , :3]
ax = freq_df.plot(kind='bar', stacked=True, color=['white', 'black', 'gray'])
ax.set_facecolor('lightgray')

plt.xlabel("General Opening")
plt.ylabel("Games")
plt.title("Frequency of 10 Most Popular Chess Openings")
plt.show()
# sns.barplot(data=df, x=popular_df.index, y="Games")

In the following code we again group the dataframe by general opening, but this time to calculate the average number of moves and average elo for each opening.

We then visualize this new data into bar charts displaying openings with the most and least number of average moves as well as the highest and lowest average elo.

In [ ]:
grouped = df.groupby(['GeneralOpening']).agg(Games=('GeneralOpening', 'size'), AvgMoves=('Moves', 'mean'), AvgElo=('AvgElo', 'mean'))
grouped = grouped[grouped['Games']>1000].sort_values(by='Games', ascending=False)
grouped["AvgMoves"] = round(grouped["AvgMoves"], 2)
grouped["AvgElo"] = round(grouped["AvgElo"])
grouped
Out[ ]:
Games AvgMoves AvgElo
GeneralOpening
Sicilian Defense 146964 32.66 1790.0
French Defense 86935 32.97 1750.0
Queen's Pawn Game 74868 33.45 1697.0
Scandinavian Defense 57642 32.60 1702.0
King's Pawn Game 45766 30.50 1546.0
... ... ... ...
East Indian Defense 1244 35.81 1860.0
Gedult's Opening 1208 33.52 1770.0
London System 1182 35.47 1922.0
Kadas Opening 1094 30.60 1641.0
Danish Gambit Accepted 1078 27.14 1697.0

81 rows × 3 columns

Chess games always end after a certain amount of moves due to checkmate, resignation, draws, etc. The next two graphs will show which openings tend to lead to longer, more closed games versus games that end in quick checkmates or material advantages causing resignation. This bar chart displays the 10 openings with the highest average number of moves. As we can see, the openings with the highest average number of moves do not vary much in terms of the average number of moves for each opening.

In [ ]:
most_moves_df = grouped.sort_values(by='AvgMoves', ascending=False).head(10)

ax = sns.barplot(data=most_moves_df, x=most_moves_df.index, y='AvgMoves')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
ax.set(xlabel="General Opening", ylabel="Average Number of Moves", title="Chess Openings with Highest Average Moves")
for i in ax.containers:
    ax.bar_label(i,)
# most_moves_df['AvgMoves'].plot(kind="bar")
# plt.xlabel("General Opening")
# plt.ylabel("Average Number of Moves")
# plt.title("Chess Openings with Highest Average Moves")
# plt.show()

This bar chart displays the 10 openings with the lowest average number of moves. As we can see, the King's Pawn Opening has the lowest average number of moves, although this is a bit of an exception, as we explain in our interpretation section.

In [ ]:
least_moves_df = grouped.sort_values(by='AvgMoves', ascending=True).head(10)

ax = sns.barplot(data=least_moves_df, x=least_moves_df.index, y='AvgMoves')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
ax.set(xlabel="General Opening", ylabel="Average Number of Moves", title="Chess Openings with Lowest Average Moves")
for i in ax.containers:
    ax.bar_label(i,)

When it comes to Chess openings, some openings are harder to master than others. Because of this, players with a higher rating (elo) will generally play from a wider pool or openings, while players at a lower elo will typically play from a different, smaller pool of openings. We were curious about which openings from the dataset are typically played at the higher elos, and which openings are typically played at the lower elos. The following two bar charts display this. This bar chart displays the 10 openings with the highest average game elo. As we can see, a lot of the 10 openings with the highest average game elo are also among the 10 openings with the highest average number of moves per game. This makes sense, as higher rated players are more likely to have longer games, since they are less likely to blunder pieces leading to more drawn out, closed games.

In [ ]:
high_elo_df = grouped.sort_values(by='AvgElo', ascending=False).head(10)

ax = sns.barplot(data=high_elo_df, x=high_elo_df.index, y='AvgElo')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
ax.set(xlabel="General Opening", ylabel="Average Game Elo", title="Chess Openings with Highest Average Elo")
for i in ax.containers:
    ax.bar_label(i,)

This bar chart displays the 10 chess openings with the lowest average game elo. As we can see, a lot of the 10 openings with the lowest average game elo are also among the 10 openings with the lowest average number of moves per game. This makes sense, as lower rated players are more likely to have shorter games, since they are more likely to blunder pieces leading to a quicker conclusion to the games they are playing.

In [ ]:
low_elo_df = grouped.sort_values(by='AvgElo', ascending=True).head(10)

ax = sns.barplot(data=low_elo_df, x=low_elo_df.index, y='AvgElo')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
ax.set(xlabel="General Opening", ylabel="Average Game Elo", title="Chess Openings with Lowest Average Elo")
for i in ax.containers:
    ax.bar_label(i,)

4. Model: Analysis, Hypothesis Testing, & ML¶

In the previous sections, we have worked a lot with openings. In this section, we want to observe the relationship between openings and other fields in the dataset. Because we want to have the possibility to analyze different relationships, and the nature of testing whether or not the relationships exist will be predictive in nature, we will start by defining a general ML classifier function that we can reuse for each relationship.

In [ ]:
def ML(X, y, architecture, y_type):
  X = LabelEncoder().fit_transform(X).reshape(-1, 1)
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5)
  classifier = MLPClassifier(solver='adam',
                           learning_rate="adaptive", hidden_layer_sizes=architecture,
                           random_state=1, verbose=True, alpha=1e-5)
  y_train = y_train.astype(y_type)
  y_test = y_test.astype(y_type)
  sc = StandardScaler()
  X_train = sc.fit_transform(X_train)
  X_test = sc.transform(X_test)
  classifier.fit(X_train, y_train)
  predicted = classifier.predict(X_test)
  predicted = predicted.astype(y_type)
  print("Accuracy: " + str(accuracy_score(predicted, y_test)))

We will start by buildling a model to see if we can reliably predict the game elo (average elo of both players) based on the opening played, number of moves, and final stockfish score of a given game.

In [ ]:
df_sample = df.sample(n=len(df['GeneralOpening']) // 1000)
X = []
y = []
for index, row in df_sample.iterrows():
  X.append((row['GeneralOpening'], row['Moves'], row['Score']))
  y.append((row['WhiteElo'] + row['BlackElo']) / 2)
X = [str(i) for i in X]
X = np.array(X)
y = np.array(y)
Arc = ((len(df_sample['GeneralOpening'].unique()) * len(df_sample['Moves'].unique()) * len(df_sample['Score'].unique())) // 100, 
       100, 
       100,
      (len(df_sample['WhiteElo'].unique()) + len(df_sample['WhiteElo'].unique()) // 2))
ML(X, y, Arc, 'int')
Iteration 1, loss = 6.13790987
Iteration 2, loss = 6.09689447
Iteration 3, loss = 6.01957958
Iteration 4, loss = 5.85963939
Iteration 5, loss = 5.70806947
...
Iteration 190, loss = 3.12861397
Iteration 191, loss = 3.10030972
Iteration 192, loss = 3.22295725
Iteration 193, loss = 3.29789595
Iteration 194, loss = 3.13972702
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Accuracy: 0.0

As you can see, we were unable to develop a model that achieved any significant accuracy. We tried to modify some of the parameters to see if we could obtain a better result, but were ultimately unsuccessful. While it's true that with more time, we might have been able to get this model to be accurate, another possibility that we must consider is that the X data we chose was simply not a reliable predictor for the y data we chose. So, with more time, in addition to modifying the architecture and training volume of this particular model, we could also experiment with different X's and y's.

5. Interpretation¶

In this section of the data science pipeline, we use the results of our data analysis to make conclusions.

One observation that can be made is that the win rate of King's Pawn for white is extraordinarily high, at over 90%. This is because almost every good chess move that can be made in the King's Pawn opening is already a named opening in the database. Therefore, any game classified as King's Pawn opening in our database will only occur when the black player plays an objectively bad move, leading to a high winrate for white.

Based on these results, we can infer that:

  1. King's Pawn is the general opening that has the highest win rate for white. However, it is an exception, so we should also mention that Queen's Gambit Accepted has the 2nd highest win rate for white. On the other hand, Kadas opening is the one that has the highest win rate for black.
  2. No real general opening has a win rate of over 60%, meaning the choice of opening doesn't matter much. It's more important how well a player understands their chosen opening, and is able to adapt to how their opponent responds.

We hope that our data and analysis of the general openings in chess will help new chess players decide which opening they use in the chess games they will play.

If we could conduct further research, we would obtain and analyze more data on chess games. An expanded dataset could potentially include more openings than we had in the dataset we used in this project. In addition, a larger dataset could contain more occurences of general openings that we had to exclude in this project due to the small sample size. This would allow us to analyze good openings that are not well known.

In addition, given more time, we would have expanded upon the machine learning aspect of our project. As can be seen in the ML section above, and as is described in the write-up for that section, we were only able to train one machine learning model, and we did not get our desired results. Given the time constraint, we were unable to experiment with more X/y delineations, data volumes, and network architectures, but given more time this is definitely the section we would have liked to expand upon the most.

If you are interested in learning more about the data science pipeline, follow this link:

  • https://www.geeksforgeeks.org/whats-data-science-pipeline/

If you are interested in learning more about chess and openings, follow one of these links:

  • https://en.wikipedia.org/wiki/Chess_opening
  • https://lichess.org/opening
  • https://www.chess.com/openings