Contents
1. Goal
The objective is to train an ideal OrderHead for autoregressive image generation in LlamaGen. The goal is not to learn a prettier saliency map or an easier decoding heuristic. The goal is to learn the best next context choice for predicting the still-unknown rest of the image.
Thesis: Raster is only one fixed autoregressive factorization. For images, there may exist a better ordering if the model can learn which regions should become context first.
2. Core Insight: What "Ideal" Means
The planner should not rank tokens by confidence, ease, or raw classifier saliency. Those can all produce shortcuts. Instead, the planner should rank the next token or block by how much it helps the model predict the other still-unseen tokens.
Ideal Standard
Let S be the current visible/generated context and let a be a candidate next block. The
ideal next choice is the one with the highest future-context value:
where L_rest is the autoregressive loss on the remaining unseen tokens. A good next block is one
whose revelation makes the rest of the image easier to predict.
Important distinction: "important-first" is correct only if "important" means "useful as future context for the rest of the image," not "easy," not "confident," and not merely "class-discriminative in isolation."
True Extrapolation, Not Interpolation
To avoid local interpolation shortcuts, the improved target should emphasize gains on other unseen tokens, especially nonlocal ones. A stricter version is:
where w(j, a) can downweight immediate neighbors and upweight far-away or global structure tokens.
3. Why LlamaGen First
LlamaGen Primary Platform
LlamaGen is strict next-token autoregression. Raster is the exact native baseline, so a learned ordering is directly comparable. Changing the order changes the actual autoregressive factorization.
MAR Secondary Reference
MAR remains useful as a side reference, but bidirectional attention makes it less clean as the primary platform for proving "better-than-raster" next-token generation.
Main claim: If the scientific question is whether a learned order can beat raster for strict autoregressive image generation, LlamaGen is the cleaner and stronger testbed.
4. How the Four Papers Fit Together
| Paper | Role in This Plan | Why It Matters |
|---|---|---|
| Path Planning | Main framework | Separates planner from generator and gives the cleanest starting point for "which position should be generated next?" |
| THINK WHILE YOU GENERATE | Secondary extension | Useful later for adaptive refinement, replanning, and robustness to early mistakes. |
| InfoTok | Adaptive budget | Motivates choosing a variable number of next blocks instead of a fixed k. |
| Continuous Latent Reasoning | Planner architecture | Motivates a small internal continuous scratchpad so the planner can evaluate multiple futures before acting. |
Why Path Planning Over THINK WHILE YOU GENERATE
Path Planning is the main conceptual base because it more directly answers the first bottleneck in this project: how to train a separate planner to choose what should come next. It is a cleaner match for OrderHead because it emphasizes planner-generator separation and direct supervision on update decisions.
THINK WHILE YOU GENERATE is still important, but mainly for later stages: adaptive refinement, error correction, and variable compute during generation. Those are valuable once the basic planner target is already defined correctly.
5. Architecture
Start with block-level planning, not full free token-level permutations. The planner chooses the next spatial block, and tokens inside the selected block are decoded in raster order. This preserves local causal structure while still allowing a learned non-raster autoregressive order.
Order-Robust Backbone
Use a LlamaGen backbone trained with a block-RAR style curriculum: randomize order over blocks, keep raster inside each block, and save checkpoints during the random phase rather than only after annealing.
OrderHead as Planner
Input: GPT hidden states, visible/generated block mask, and spatial positions. Output: next-block logits, and later optionally an adaptive reveal budget.
Latent Planner Core
Before emitting logits, the planner runs a few internal continuous reasoning steps. This is the latent scratchpad inspired by Chain of Continuous Thought.
6. Oracle Target and Loss
Keep the training objective simple:
where L_AR is the standard autoregressive next-token loss and L_OH is the planner loss.
The important part is not adding many loss terms. The important part is defining L_OH correctly.
Recommended First Version of L_OH
Build an oracle next-block label from future AR loss reduction, then train OrderHead with cross-entropy:
This makes the planner answer a clear question: which next block best improves prediction of the remaining unseen image?
7. Algorithms
Algorithm 1: Oracle Next-Block Label
Algorithm 2: Extrapolative Utility
Algorithm 3: Latent Planner Inference
Algorithm 4: Adaptive Budget Extension
This adaptive budget stage should come later, after one-block planning is already stable. It is the direct place where the InfoTok intuition enters the pipeline.
8. Training Curriculum
Stage 1: Prepare an Order-Robust Backbone
Train or prepare a block-RAR LlamaGen backbone. Randomize order over blocks, keep raster inside blocks, and save checkpoints during the random phase. The backbone must tolerate multiple block orders.
Stage 2: Build the Oracle Dataset
Sample partial contexts, evaluate candidate blocks with future-loss reduction, and store oracle next-block labels. This creates a direct supervised planning dataset.
Stage 3: Pretrain OrderHead
Freeze the backbone and train OrderHead to predict the oracle best next block. Use the simple objective
L = L_AR + λ * L_OH, or only L_OH if the backbone is fully frozen.
Stage 4: Add Latent Planner Reasoning
Increase planner reasoning depth gradually: R=0, then R=1, then R=2, and
optionally R=4. Keep the same simple loss; the latent reasoning changes the architecture, not the target.
Stage 5: Limited Joint Tuning
Unfreeze the planner and only a few top GPT blocks. Keep the same objective. This stage lets the generator adapt slightly to the planner without drifting into the instability seen in earlier OrderHead attempts.
Stage 6: Add Adaptive Budget
After one-block planning works, let the planner choose whether to reveal 1, 2, or 4 blocks based on the current state. This is the later-stage extension, not the first experiment.
9. Evaluation Plan
The primary comparison is still against raster. The planner should be evaluated both on final image quality and on whether it truly chooses context-rich extrapolative regions rather than shortcut local patches.
| Setting | Purpose |
|---|---|
| Raster baseline | Native autoregressive baseline |
| Block-random-order backbone | Check whether order robustness alone helps |
| Backbone + one-shot planner (R=0) | Direct test of supervised planning |
| Backbone + latent planner (R=1,2,4) | Test latent reasoning benefit |
| Backbone + adaptive budget | Test InfoTok-style dynamic reveal size |
Main Metrics
- FID / sFID / IS: final sample quality
- Progressive recognizability: whether the chosen order makes the image become meaningful earlier
- Future-loss reduction: direct test of the planning objective
- Far-context gain: how much the selected block helps far unseen tokens rather than only local neighbors
- Planner calibration: agreement with oracle rankings
Key message for presentation: this project is no longer about learning a saliency map. It is about learning a planner that chooses the next context block which maximally improves prediction of the unseen rest of the image.