The Truth About Deep Reinforcement Learning¶

Reinforcement learning is not always exciting. Sometimes you have to face the raw truth of it: most of the time, it won’t work at first. These algorithms have an enormous number of parameters to fine-tune, and the gap between “looks right on paper” and “actually learns” is wide.

I’ve been diving deep into policy gradient methods, specifically implementing the Vanilla Policy Gradient (VPG) algorithm with Generalized Advantage Estimation (GAE-λ). The video below is from one of my training runs.

The success stories¶

Over the past weeks I’ve had real wins. My VPG agent learned to master CartPole in a handful of epochs, and it navigated Flappy Bird with impressive performance after relatively little training. It’s genuinely rewarding to watch an agent develop emergent strategies through pure trial and error — no demonstrations, no hand-crafted features, just reward signal and gradient descent doing their thing.

The Snake saga¶

The DRL journey is rarely a linear path of triumph, and my current challenge illustrates this perfectly.

I’ve been trying to train a VPG agent to play Snake — an environment that demands sequential decision-making, long-term planning, and avoidance of self-collision. After more than 2,500 epochs of training, the agent still hasn’t grasped the fundamental objective. It hasn’t figured out how to consistently seek food or stop biting itself, despite the same algorithm working cleanly in other domains.

That gap — same algorithm, completely different outcomes — is what makes DRL both fascinating and frustrating.

What I’m taking from this¶

A few things have crystallized after weeks of staring at loss curves:

Embrace failure. You have to get comfortable with models that don’t learn, agents that behave erratically, and hours spent watching curves refuse to converge. It’s not a sign you’re doing it wrong — it’s the job.
Debugging is paramount. Most “the algorithm doesn’t work” issues are actually environment bugs, reward-shaping mistakes, or numerical issues hiding in the advantage normalization.
Fine-tuning is an art. Learning rate, GAE-λ, entropy coefficient, rollout length, value-loss weighting — they all interact, and the right combination is environment-specific.

Open question¶

Should I push further on VPG for Snake, or is this a signal to switch algorithms? My instinct says Snake’s sparse reward and long-horizon credit-assignment problem might just be a poor fit for vanilla policy gradient — that PPO, or value-based methods like DQN with prioritized replay, might land much faster.

I’d love to turn this into a series: same environments, different algorithms, side-by-side metrics. If that’s something you’d want to read, let me know.

This post is an expanded version of a note I originally shared on LinkedIn. Source code for the VPG implementation lives in transfer_learning on GitHub.