Paper Review - Expert Human-Level Driving in Gran Turismo Sport Using Deep Reinforcement Learning with Image-based Representation

Abstract

When humans play virtual racing games, they use visual environmental information on the game screen to understand the rules within the environments. In contrast, a state-of-the-art realistic racing game AI agent that outperforms human players does not use image-based environmental information but the compact and precise measurements provided by the environment. In this paper, a vision-based control algorithm is proposed and compared with human player performances under the same conditions in realistic racing scenarios using Gran Turismo Sport (GTS), which is known as a high-fidelity realistic racing simulator. In the proposed method, the environmental information that constitutes part of the observations in conventional state-of-the-art methods is replaced with feature representations extracted from game screen images. We demonstrate that the proposed method performs expert human-level vehicle control under high-speed driving scenarios even with game screen images as high-dimensional inputs. Additionally, it outperforms the built-in AI in GTS in a time trial task, and its score places it among the top 10% approximately 28,000 human players.

Key Takeaways

1

Expert-Human Performance with Vision-Based Control

This paper demonstrates that an AI agent can achieve expert human-level driving performance in Gran Turismo Sport using only game screen images as the primary input. This challenges the conventional approach of relying on precise, simulator-provided measurements.

2

Image-Based Representation Learning

The study proposes a two-phase approach: first, learning feature representations from game screen images, and then using these representations to train a vehicle control policy. This separation allows for higher resolution images and reduces training time for the policy network.

3

Outperforming Built-in AI and Matching Top Human Players

The trained AI controller not only outperforms the built-in AI in Gran Turismo Sport but also achieves a lap time within the top 10% of approximately 28,000 human players. This result validates the effectiveness of the vision-based control approach.

4

Near-Optimal Trajectory Learning

The AI agent learns to drive on near-optimal trajectories, closely avoiding contact with the inner walls of the track. This indicates that the agent can accurately recognize its position relative to the track from the game screen images.

Introduction

This paper presents a method for achieving expert human-level driving performance in the realistic racing simulator Gran Turismo Sport (GTS) using deep learning. Unlike state-of-the-art AI agents that rely on precise, simulator-provided measurements, humans primarily use visual information from the game screen. This study proposes a vision-based control algorithm that replaces environmental information with feature representations extracted from game screen images, comparing its performance to that of human players.

The goal is to demonstrate that expert human-level vehicle control is achievable using only high-dimensional game screen images as input. The proposed method outperforms the built-in AI in GTS and achieves a lap time within the top 10% of approximately 28,000 human players. This suggests that visual information alone can enable human-level performance without the need for precise environmental data.

Methodology

Methodology consists of two phases:

🎯 Phase 1: Image-Based Representation Learning

A feature extraction network, \(\varphi_{rep}\), is trained to map game screen images, \(I \in \mathbb{R}^{H \times W \times C}\), to a lower-dimensional embedded representation. This is achieved by regression learning, minimizing the loss function:

\[l_{rep} = ||o_{env} - \varphi_{reg}(\varphi_{rep}(I))||_2^2,\]

where \(o_{env} \in \mathbb{R}^{D_{env}}\) represents the environmental observation vector (distance to track edges, centerline, angle, curvature). \(\varphi_{reg}\) is a subsequent network layer. The network is trained offline using a dataset of approximately 630,000 game screen images captured from various points on the racetrack, obtained by adding noise to the built-in AI. The network architecture uses space2depth and depth-wise separable convolution for efficient inference. \(D_{env}\) is set to 27, and the embedded representation size \(D_{rep}\) is 64. The curvatures are sampled at 10 points 40-120m ahead.

🎯 Phase 2: Reinforcement Learning using Learned Representation

A policy network, \(\varphi_p\), is trained using the Soft Actor-Critic (SAC) algorithm. The policy network takes as input the embedded representation, \(o_{env}\), from Phase 1 along with dedicated features (3D linear velocity, 3D linear acceleration, angular velocity, wall contact, previous steering command) \(o \in \mathbb{R}^{D_{rep}+D_{ded}}\) from the simulator. The policy network outputs throttle/brake and steering values \(a_t \in \mathbb{R}^2\):

\[a_t = \varphi_p(o),\]

where \(a_t\) consists of \(\delta_t \in [-\frac{\pi}{6}, \frac{\pi}{6}]\) (steering angle) and \(\omega_t \in [-1, 1]\) (throttle/brake). The reward function maximizes course progress while penalizing wall collisions:

\[r_t = r_{prog_t} - \begin{cases} c_w ||v_t||^2 & \text{if in contact with wall} \\ 0 & \text{otherwise} \end{cases}\]

where \(r_{prog_t}\) is the reward for course progress, \(v_t = [v_x, v_y, v_z]\) is the linear velocity, and \(c_w\) is a weight coefficient.

Experiments were conducted in GTS using the Mazda Demio XD Turing ’15 and the Tokyo Expressway Central Outer Loop. The results show that the proposed method outperforms the built-in AI and performs within the top 10% of approximately 28,000 human players.

method1 — Figure 1. This system uses deep learning to train an AI agent that drives in Gran Turismo Sport using both visual information from the game screen and direct sensor data, achieving expert human-level performance.

Results

This new AI uses a clever two-step process:

Learn to "See": The AI first learns to understand the game screen, extracting important features like the car's position on the track and upcoming turns. Researchers captured game screen images, like the ones in Figure 2 (Left), and used them to train the AI to recognize key elements of the racing environment. Figure 2 (Right) shows how they collected diverse training data by intentionally disrupting the built-in AI's driving.
Learn to Drive: Then, using that visual understanding, the AI learns to control the car, steering, accelerating, and braking to achieve the fastest lap times.

So, how did this vision-based AI stack up against the competition?

Smashed the Built-in AI: The AI agent absolutely crushed the lap times of the built-in AI in Gran Turismo Sport, beating it by a whopping 9.4 seconds!
Top 10% Human Performance: But here's the really impressive part: the AI's lap times put it in the top 10% of over 28,000 human players! That's right, this AI is driving at an expert human level, using only what it "sees" on the screen. Figure 3 (Left) shows just how well the AI performed compared to the distribution of human player lap times. The AI (yellow line) significantly outperformed both the built-in AI (blue line) and the median human player (red line).
Close to the Pros: While the AI didn't quite beat the absolute fastest human players (who have access to years of experience and finely-tuned skills), it came remarkably close, lagging behind by only about 3.3 seconds.
Learning Progress: Figure 3 (Right) illustrates how the AI's performance improved over time during training. Even with different starting points, the AI consistently learned to drive faster and faster.
Smooth Driving: Figure 4 shows the AI's driving trajectory through various corners. The AI was able to navigate the track smoothly and efficiently, hugging the walls without crashing.

Discussion

This research is a huge step forward for AI in several ways:

Human-Like AI: By using visual information, the AI is learning to solve problems in a way that's more similar to how humans do it.
Real-World Applications: This technology could eventually be used to develop more advanced driver-assistance systems or even fully autonomous vehicles that can navigate complex environments using only cameras.
Inspiration for Gamers: It proves that even without perfect information, a well-trained AI can achieve amazing results. Maybe it's time to rethink your racing strategy!

Conclusion

The results are in, and they're groundbreaking! This research has successfully demonstrated that an AI agent can achieve expert human-level driving performance in Gran Turismo Sport using only visual information extracted from the game screen. This vision-based control algorithm represents a significant step forward in the field of autonomous driving.

What's Next?

Autonomous Racing research is ongoing, and we plan to explore several exciting directions:

Validating Feature Representations: We aim to further analyze the acquired feature representations to ensure the AI is truly understanding the environment and not just relying on superficial cues.
Multi-Image Analysis: We plan to incorporate multiple images to better mimic human observation conditions and capture dynamic information like acceleration and angular velocity.
Exploring End-to-End Learning: We will investigate completely end-to-end learning approaches to further optimize the AI's performance.

This is just the beginning of a new era for AI in autonomous systems. By combining the power of deep learning with vision-based control, we're paving the way for a future where AI can drive, navigate, and interact with the world in a more intelligent and human-like way.

Reviewer Notes

Key Points

1

The paper presents a novel approach to achieving expert human-level driving in a realistic racing simulator by using a vision-based control algorithm. This contrasts with existing methods that rely on precise, simulator-provided measurements.

2

The proposed method demonstrates strong performance, outperforming the built-in AI in Gran Turismo Sport and achieving lap times within the top 10% of approximately 28,000 human players. This provides compelling evidence for the effectiveness of the approach.

3

The two-phase approach, involving image-based representation learning followed by reinforcement learning, is well-structured and allows for efficient training and potentially better generalization.

Extended Analysis

1

The paper acknowledges the need to further validate the acquired feature representations. A deeper analysis is needed to ensure that the AI is truly understanding the environment and not just relying on superficial cues or overfitting to specific image features. Consider experiments that probe the learned representations, such as visualizing the features or testing their transferability to different tracks or vehicles.

2

The paper mentions the intention to compare the proposed method to completely end-to-end learning approaches. A more detailed discussion of the potential advantages and disadvantages of both approaches would be valuable. End-to-end learning might offer greater optimization potential but could also be more difficult to train and interpret.

3

While the results are impressive in a simulated environment, the paper could benefit from a discussion of the challenges and limitations of applying this approach to real-world autonomous driving. Factors such as sensor noise, weather conditions, and the complexity of real-world traffic scenarios need to be considered. Are there specific aspects of the simulation that would need to be improved to facilitate real-world transfer?

Paper Reviews

Expert Human-Level Driving in Gran Turismo Sport Using Deep Reinforcement Learning with Image-based Representation

Abstract

Key Takeaways

Expert-Human Performance with Vision-Based Control

Image-Based Representation Learning

Outperforming Built-in AI and Matching Top Human Players

Near-Optimal Trajectory Learning

Introduction

Methodology

Results

Discussion

Conclusion

Reviewer Notes

Key Points

Extended Analysis

Table of Contents

Paper Details

Tags

Expert Human-Level Driving in Gran Turismo Sport Using Deep Reinforcement Learning with Image-based Representation

Abstract

Key Takeaways

Expert-Human Performance with Vision-Based Control

Image-Based Representation Learning

Outperforming Built-in AI and Matching Top Human Players

Near-Optimal Trajectory Learning

Introduction

Methodology

Results

Discussion

Conclusion

Reviewer Notes

Key Points

Extended Analysis

Table of Contents

Paper Details

Related Papers

Tags