Serving Insights: Improving Data-Driven Badminton Analytics with Computer Vision and Machine Learning

After my first URECA project on football analytics, I was eager to embark on a second research journey. Working again with Assoc Prof John Komar and Julian Tan from NIE, we tackled the challenge of building a data collection pipeline for badminton using computer vision and machine learning. This article outlines our journey in developing the system and our initial findings.

Motivations

Performance analysis relies on data obtained from specialized equipment like wearable sensors, or from subscriptions to data providers. This often comes at a substantial cost. The goal of our project was therefore to evaluate the feasibility of collecting data using only broadcast footage, leveraging advancements in machine learning and computer vision.

The proposed system is designed to process men’s singles badminton videos and extract relevant data. For the URECA project, we focused on producing the following information:  

  1. Player positions with respect to the court
  2. Joint angles and distances between joints of players

To arrive at this output, our pipeline follows the steps detailed in the sections below.  

Video Pre-Processing

To remove noise from the broadcast video, it is essential that the court area is identified and isolated. Therefore, this is the first task of our pipeline. Assuming that the first frame contains a full view of the court, users are required to select the court corners. Subsequently, a mask is generated around the court to remove background noise from spectators and line judges, as illustrated in Figures 1 and 2 below.

Video Frame Before Masking

Figure 1. Video Frame Before Masking.

Video Frame After Masking

Figure 2. Video Frame After Masking.

Additionally, using ORB feature matching between frames, it is possible to filter out video frames that do not contain the full court view. This is useful in filtering out noise from tournament matches from replays and close-ups. After these steps, our video is ready for pose estimation.  

Pose Estimation

The process of pose estimation inference with OpenPose was handled using the handy Sports2D package by David Pagnon. It is a very robust package that abstracts the process of model loading and inference, and returns the following data points:

  1. Joint coordinates
  2. Joint angles, computed using dot product of joint vectors
  3. Joint distances, calculated using Euclidean distance

When annotating the results of the pose estimation model on the pre-processed video, the output is as follows:

Output from Pose Estimation

Figure 3. Annotated Output after Pose Estimation.

The package is actively developed and currently also supports a variety of other modern pose estimation models that are suitable for cross-platform deployment. After pose estimation is carried out, the following post-processing steps are applied:

  1. Normalization of distances using min-max scaling and aspect ratio adjustments
    • Given that videos may differ in resolution and camera zoom, normalization removes scale-related inconsistencies and helps the model focus on relative motion patterns rather than absolute pixel values.
  2. Homography transformation of pose coordinates to the court space to obtain the player’s location relative to the court
    • Using the coordinates of the corners of the court, a homography matrix is constructed. The coordinates of the player’s feet are subsequently mapped to the coordinates of the court, ensuring a consistent range of values that are not affected by changes in video resolution.

Normalization using min-max scaling

Figure 4. Effect of Min-Max Normalization.

Training Data Collection

While our long-term goal was to create an end-to-end automated pipeline, for the purposes of a URECA project, we set the goals of:

  1. Verifying if the collected data can be used for identifying different kinds of shots
  2. Testing the performance of different model architectures for shot classification

Good model outputs, of course, are dependent on the quality and quantity of training data. As someone who has not played a lot of badminton, I was not able to contribute much to this step. There are a lot of nuances in the game which only become clear with domain knowledge, such as the difference between a drop and net shot, or a lift and a push.

Thanks mainly to the efforts of Julian, we were able to collect and annotate data from a set of men’s singles matches from the YONEX French Open 2023. For each video, frames where shots were taken were labelled with the following attributes:

Category Labels
Grip Type Forehand, Backhand, Overhead
Shot Type Smash, Drop, Lift, Drive, Net, Push, Tap, Lob, Block, Serve Low, Serve High
Direction Straight, Cross, Middle
Outcome Winner, Forced Error, Unforced Error

To create sequences of data for training a classification model, we padded the labels by 12 frames on each side. For instance, for a shot taken at frame 100, the corresponding shot window in the training data spans frames 88 to 112.

With the training data ready, we can begin to explore the effectiveness of different ML model architectures in shot classification. Due to the relatively small size of the training dataset, we looked to classify the Grip attribute of a shot as a proof of concept.

Model Training

To compare how different neural network designs perform on the grip classification task, we implemented three model architectures:

  1. RNN (Recurrent Neural Network): for capturing short-term frame-to-frame dependencies.
  2. LSTM (Long Short-Term Memory): for modeling longer-term temporal patterns in player movement.
  3. Conv2D (2D Convolutional Neural Network): for learning local spatial patterns across joints and time.

The inputs to the model were the sequences of the joint angles and euclidean distances between keypoints. Additionally, each model was trained on three variations of the dataset:

  1. Only bottom player data
  2. Only top player data
  3. Combined dataset of both players

All models were trained for 10 epochs using the Adam optimizer and cross-entropy loss, with a learning rate of 0.001 and a 60:40 train-test split.

Results

Precision Score by Model and Class

Figure 5. Precision Score by Model and Class.

Recall Score by Model and Class

Figure 6. Recall Score by Model and Class.

F1-Score by Model and Class

Figure 7. F1-Score by Model and Class.

As shown in Figures 5-7 above, the Conv2D model consistently outperformed the RNN and LSTM architectures, achieving the highest F1-scores across all datasets. This suggests that convolutional layers are particularly effective at capturing the local spatial and temporal patterns in joint-based features.

Models trained on bottom player data performed the best overall. This can be attributed to the fact that the bottom player in broadcast videos is usually framed more clearly and occupies a more consistent region of the screen, leading to better pose estimation outputs. On the other hand, models trained on the combined dataset saw a slight drop in performance. This could be due to the fact that the top and bottom players have mirrored movements, introducing noise into the dataset that makes it difficult for the models to generalize. Therefore, we can conclude that training separate models for top and bottom players will lead to the best overall results.

We also found that the overhead grip shots were the easiest to detect across all models. Even the lower-performing RNN was able to identify overhead grips with high precision and recall. This can be explained by the fact that overhead shots like smashes or clears often involve a distinct motion that is already visually easy to distinguish, and hence would be easy to identify in the pose data as well.

In contrast, forehand and backhand grips showed more overlap in their movement signatures, especially when only looking at pose-derived features like joint distances and angles. Consequently, both RNN and LSTM models exhibited difficulty in accurately distinguishing between these grips, possibly due to their limited ability to model fine-grained variations in pose dynamics.

Future Directions

This project taught me a lot about the fundamentals of computer vision, machine learning, and most importantly, the software engineering principles needed to soundly implement a system from scratch. It also showed me that meaningful data can still be obtained even from constrained inputs.

The research above is just a short preview into the work that has been accomplished by our team in the short span of a year. A lot more work has gone on behind the scenes. Julian is scheduled to present some of our findings at the International Symposium on Computer Science in Sport in September. Make sure to follow him on LinkedIn to get more information on his research!

Lastly, if you’d like to dive deeper, I wrote a full technical paper detailing this project as part of my URECA requirements. You can find it here.