New Plan: Latent Planning for Ideal Token Ordering

A LlamaGen-first roadmap for training an OrderHead planner that selects context-rich, extrapolative token orderings through latent reasoning. The central target is not confidence or ease, but the next block that best improves prediction of the still-unseen rest of the image.

Minkyu Jeon Princeton University April 2026 New Plan

Contents

  1. Goal
  2. Core Insight: What "Ideal" Means
  3. Why LlamaGen First
  4. How the Four Papers Fit Together
  5. Architecture
  6. Oracle Target and Loss
  7. Algorithms
  8. Training Curriculum
  9. Evaluation Plan

1. Goal

The objective is to train an ideal OrderHead for autoregressive image generation in LlamaGen. The goal is not to learn a prettier saliency map or an easier decoding heuristic. The goal is to learn the best next context choice for predicting the still-unknown rest of the image.

Thesis: Raster is only one fixed autoregressive factorization. For images, there may exist a better ordering if the model can learn which regions should become context first.

2. Core Insight: What "Ideal" Means

The planner should not rank tokens by confidence, ease, or raw classifier saliency. Those can all produce shortcuts. Instead, the planner should rank the next token or block by how much it helps the model predict the other still-unseen tokens.

Ideal Standard

Let S be the current visible/generated context and let a be a candidate next block. The ideal next choice is the one with the highest future-context value:

u(a | S) = L_rest(S) - L_rest(S ∪ {a})

where L_rest is the autoregressive loss on the remaining unseen tokens. A good next block is one whose revelation makes the rest of the image easier to predict.

Important distinction: "important-first" is correct only if "important" means "useful as future context for the rest of the image," not "easy," not "confident," and not merely "class-discriminative in isolation."

True Extrapolation, Not Interpolation

To avoid local interpolation shortcuts, the improved target should emphasize gains on other unseen tokens, especially nonlocal ones. A stricter version is:

u_extrap(a | S) = sum over j in rest \ a of w(j, a) * [l_j(S) - l_j(S ∪ {a})]

where w(j, a) can downweight immediate neighbors and upweight far-away or global structure tokens.

3. Why LlamaGen First

LlamaGen Primary Platform

LlamaGen is strict next-token autoregression. Raster is the exact native baseline, so a learned ordering is directly comparable. Changing the order changes the actual autoregressive factorization.

MAR Secondary Reference

MAR remains useful as a side reference, but bidirectional attention makes it less clean as the primary platform for proving "better-than-raster" next-token generation.

Main claim: If the scientific question is whether a learned order can beat raster for strict autoregressive image generation, LlamaGen is the cleaner and stronger testbed.

4. How the Four Papers Fit Together

PaperRole in This PlanWhy It Matters
Path Planning Main framework Separates planner from generator and gives the cleanest starting point for "which position should be generated next?"
THINK WHILE YOU GENERATE Secondary extension Useful later for adaptive refinement, replanning, and robustness to early mistakes.
InfoTok Adaptive budget Motivates choosing a variable number of next blocks instead of a fixed k.
Continuous Latent Reasoning Planner architecture Motivates a small internal continuous scratchpad so the planner can evaluate multiple futures before acting.

Why Path Planning Over THINK WHILE YOU GENERATE

Path Planning is the main conceptual base because it more directly answers the first bottleneck in this project: how to train a separate planner to choose what should come next. It is a cleaner match for OrderHead because it emphasizes planner-generator separation and direct supervision on update decisions.

THINK WHILE YOU GENERATE is still important, but mainly for later stages: adaptive refinement, error correction, and variable compute during generation. Those are valuable once the basic planner target is already defined correctly.

5. Architecture

Start with block-level planning, not full free token-level permutations. The planner chooses the next spatial block, and tokens inside the selected block are decoded in raster order. This preserves local causal structure while still allowing a learned non-raster autoregressive order.

Order-Robust Backbone

Use a LlamaGen backbone trained with a block-RAR style curriculum: randomize order over blocks, keep raster inside each block, and save checkpoints during the random phase rather than only after annealing.

OrderHead as Planner

Input: GPT hidden states, visible/generated block mask, and spatial positions. Output: next-block logits, and later optionally an adaptive reveal budget.

Latent Planner Core

Before emitting logits, the planner runs a few internal continuous reasoning steps. This is the latent scratchpad inspired by Chain of Continuous Thought.

Image prefix / visible blocks | v +---------------------------+ | LlamaGen Backbone | | order-robust AR model | +---------------------------+ | v hidden states H_t | v +---------------------------+ | Latent Planner Core | | z0 -> z1 -> ... -> zR | +---------------------------+ | v +---------------------------+ | OrderHead / Planner | | next-block logits | +---------------------------+ | v select next block | v decode tokens inside block in raster order

6. Oracle Target and Loss

Keep the training objective simple:

L = L_AR + λ * L_OH

where L_AR is the standard autoregressive next-token loss and L_OH is the planner loss. The important part is not adding many loss terms. The important part is defining L_OH correctly.

Recommended First Version of L_OH

Build an oracle next-block label from future AR loss reduction, then train OrderHead with cross-entropy:

L_OH = CE(predicted_next_block, oracle_best_next_block)

This makes the planner answer a clear question: which next block best improves prediction of the remaining unseen image?

7. Algorithms

Algorithm 1: Oracle Next-Block Label

Input: GT token grid x current visible state S candidate block set C frozen order-robust backbone G 1. Compute baseline future loss: L0 = AR loss on all remaining unseen tokens given S 2. For each candidate block a in C: a. Reveal GT block a: S_a = S ∪ {a} b. Compute future loss on remaining unseen tokens: La = AR loss on remaining tokens given S_a c. Define utility: u(a | S) = L0 - La 3. Select oracle best next block: a* = argmax_a u(a | S) 4. Return a*

Algorithm 2: Extrapolative Utility

Input: GT token grid x current visible state S candidate block a remaining unseen tokens R distance weighting w(j, a) 1. Compute tokenwise losses on remaining tokens: l_j(S) for j in R 2. Reveal GT block a: S_a = S ∪ {a} 3. Recompute tokenwise losses: l_j(S_a) for j in R \ a 4. Compute extrapolative utility: u_extrap(a | S) = sum over j in R \ a of w(j, a) * [l_j(S) - l_j(S_a)] 5. Choose block with highest u_extrap

Algorithm 3: Latent Planner Inference

Input: generated prefix current block mask m_t backbone hidden states H_t 1. Pool planner state: z0 = Pool(H_t, m_t, positions) 2. Run latent planner reasoning: for r = 1 to R: z_r = PlannerCore(z_{r-1}, H_t, m_t) 3. Predict next-block logits: q_t = OrderHead(z_R, H_t, m_t) 4. Select next block: a_t = argmax q_t over ungenerated blocks 5. Decode tokens inside a_t in raster order 6. Update prefix and repeat

Algorithm 4: Adaptive Budget Extension

Input: latent planner state z_R hidden states H_t 1. Predict budget logits: b_t = Router(z_R, H_t) 2. Select budget k_t in {1, 2, 4} 3. Predict block utilities: q_t = OrderHead(z_R, H_t) 4. Select top-k_t blocks 5. Decode chosen blocks in order

This adaptive budget stage should come later, after one-block planning is already stable. It is the direct place where the InfoTok intuition enters the pipeline.

8. Training Curriculum

Stage 1: Prepare an Order-Robust Backbone

Train or prepare a block-RAR LlamaGen backbone. Randomize order over blocks, keep raster inside blocks, and save checkpoints during the random phase. The backbone must tolerate multiple block orders.

Stage 2: Build the Oracle Dataset

Sample partial contexts, evaluate candidate blocks with future-loss reduction, and store oracle next-block labels. This creates a direct supervised planning dataset.

Stage 3: Pretrain OrderHead

Freeze the backbone and train OrderHead to predict the oracle best next block. Use the simple objective L = L_AR + λ * L_OH, or only L_OH if the backbone is fully frozen.

Stage 4: Add Latent Planner Reasoning

Increase planner reasoning depth gradually: R=0, then R=1, then R=2, and optionally R=4. Keep the same simple loss; the latent reasoning changes the architecture, not the target.

Stage 5: Limited Joint Tuning

Unfreeze the planner and only a few top GPT blocks. Keep the same objective. This stage lets the generator adapt slightly to the planner without drifting into the instability seen in earlier OrderHead attempts.

Stage 6: Add Adaptive Budget

After one-block planning works, let the planner choose whether to reveal 1, 2, or 4 blocks based on the current state. This is the later-stage extension, not the first experiment.

9. Evaluation Plan

The primary comparison is still against raster. The planner should be evaluated both on final image quality and on whether it truly chooses context-rich extrapolative regions rather than shortcut local patches.

SettingPurpose
Raster baselineNative autoregressive baseline
Block-random-order backboneCheck whether order robustness alone helps
Backbone + one-shot planner (R=0)Direct test of supervised planning
Backbone + latent planner (R=1,2,4)Test latent reasoning benefit
Backbone + adaptive budgetTest InfoTok-style dynamic reveal size

Main Metrics

Key message for presentation: this project is no longer about learning a saliency map. It is about learning a planner that chooses the next context block which maximally improves prediction of the unseen rest of the image.