DAP Explained: Joint Scene–Action Prediction with Discrete Tokens

We walk through a planning architecture that treats both scene dynamics and vehicle motion as discrete token sequences - how it works, how it compares, and how you might reuse the pattern in other sequential decision tasks.

Introduction

Most driving stacks still split the world into separate boxes: perception predicts a static scene, then planning generates a trajectory on top of it. The planner often sees a "frozen" world and hopes the future behaves like the last frame.

This approach does something different: it turns both the evolving scene and the ego vehicle's actions into discrete tokens and feeds them into one autoregressive transformer. At each step, the model first predicts how the world will change, and only then chooses the next ego-action token, conditioned on that predicted future. You end up with a planner that literally "speaks" in a language of future-scene tokens and motion tokens instead of raw floats.

How It Works

1. Tokenizing the world and the car

Scene tokens: BEV semantic maps (lanes, drivable area, other agents) are compressed with a VQ-style encoder into discrete codes.

Action tokens: ego deltas (curvature, acceleration, etc.) are discretized into bins and mapped to tokens.

Command tokens: route-level instructions (turn left, go straight, exit, etc.) are represented as categorical tokens.

So your history becomes a single sequence like:

[scene_t(-5), action_t(-5), ..., scene_t(0), action_t(0), command_t]

2. Autoregressive planning

A decoder-only transformer consumes that history and rolls forward in time:

Predict next scene tokens (how other agents and the layout evolve).

Predict the ego action token given that predicted scene.

Because the ego decision is conditioned on the model's own future-scene prediction, supervision is denser than "just predict a trajectory". You are training the model to be good at world forecasting and decision making simultaneously.

3. IL + RL training

Phase 1: Imitation learning on logged expert data with cross-entropy on both scene tokens and action tokens.

Phase 2: Offline RL (SAC-BC style) to nudge the policy toward better rewards (safety, comfort, progress) while staying close to demonstrations.

The architecture stays the same across both phases; only the objective changes.

Comparison

Traditional planners usually fall into one of two camps:

One-shot trajectory generators: encode history once and spit out an entire future trajectory. Simple, but struggles with long-horizon consistency.

Autoregressive ego-only models: step the trajectory forward but treat the scene as static context.

The discrete-token autoregressive planner improves on both by:

Jointly modelling scene evolution + ego actions, giving more supervision per time step.

Using a compact transformer (around the 10^8 parameter scale) yet matching or beating larger baselines on open-loop metrics like average and max L2 error.

Achieving competitive closed-loop driving quality in simulation, with strong safety and comfort scores after RL fine-tuning.

In short: smart structure instead of just "make it bigger".

Examples

To make this concrete, imagine a grid-world robot instead of a car. You can prototype similar ideas with an LLM or sequence model.

Go to target around a moving obstacle

Input:

<history_scene_tokens>
<history_ego_state_tokens>
<command_token: "GO_TO_TARGET">

Model rollout:

<scene_1: "obstacle_moves_right">
<action_1: "STEP_RIGHT_AND_ACCELERATE">

<scene_2: "free_path_left">
<action_2: "STEP_LEFT_SMALL">

Expected behavior:
The model predicts the obstacle will slide to the right, so it first moves right and speeds up, then corrects slightly left once the path is clear.

Follow lane and react to a pedestrian

Input:

<history_scene_tokens>
<history_ego_state_tokens>
<command_token: "FOLLOW_LANE">

Model rollout:

<scene_1: "lane_curves_left">
<action_1: "STEER_LEFT_MEDIUM">

<scene_2: "pedestrian_appears_ahead">
<action_2: "BRAKE_SOFT">

Expected behavior:
The model first anticipates the lane curving left and steers accordingly. When it predicts a pedestrian entering the lane, it responds by braking smoothly.

Takeaways

Structure > brute force. Carefully chosen discrete tokens and an autoregressive setup can close the gap with much larger continuous planners.

Joint prediction is underrated. Explicitly modelling future scene dynamics gives you extra gradients and better inductive bias for decision making.

IL + offline RL is a practical combo. Learn "how humans drive" first, then quietly optimize for the metrics you actually care about.

Re-usable pattern. Anywhere you have an evolving world plus actions (robotics, strategy games, warehouse routing), you can steal this: tokenise world + actions, feed into one sequence model, train jointly.

Conclusion

Thinking of planning as a token-generation problem sounds like a gimmick until you see the benefits: consistent long-horizon behaviour, tighter coupling between perception and control, and strong performance with a relatively small model. If you are working on autonomous systems or any sequential decision problem, this formulation is a solid blueprint for your next planner experiment.