Learned Token Ordering for Visual Autoregressive Generation

1. Motivation

Standard autoregressive image generation models (LlamaGen, VQGAN+GPT) generate discrete visual tokens in a fixed raster-scan order (left-to-right, top-to-bottom). Unlike language, where left-to-right follows semantic structure, this ordering is arbitrary for images — there is no inherent reason why the top-left pixel should be generated before the center of an object.

The hypothesis: if we generate class-discriminative (important) tokens first, they provide better context for generating the remaining tokens. For example, generating the dog's face before the background could give the model much richer context than generating background sky first.

Spatial coverage comparison: raster vs random ordering

Spatial coverage of generated context at step 32: raster order (left) vs random order (right). Raster provides 34.8% spatial neighbor coverage; random only provides 5-6%.

Research question: Can a learned token ordering (via an OrderHead module) improve image generation quality compared to the default raster-scan order?

2. Background: MAR vs LlamaGen

We investigate token ordering in two fundamentally different autoregressive architectures:

MAR Masked Autoregressive

Generates tokens by iteratively unmasking subsets. Uses bidirectional attention — every visible token attends to every other visible token. The generation order determines which tokens are revealed first as context. Inherently order-agnostic: trained with random masking.

Key property: "Important first" means those tokens become context for the rest.

LlamaGen Causal Autoregressive

Generates tokens one-by-one with causal (left-only) attention. Each token can only see previously generated tokens. Uses 2D RoPE for spatial position encoding independent of sequence order. Trained with teacher forcing on a fixed sequence order.

Key property: "First in sequence" means predicted with least context.

Property	MAR	LlamaGen
Attention	Bidirectional (all visible tokens)	Causal (left-context only)
Position encoding	Learned absolute	2D RoPE (spatial)
Tokens per step	Multiple (cosine schedule)	One
Loss	Diffusion (continuous latents)	Cross-entropy (discrete VQ tokens)
Order agnosticism	Inherent (random masking)	Must be explicitly trained (RAR)
Ordering effect	"First" = best context provider	"First" = least context available

3. The OrderHead Approach

Architecture

The OrderHead is a small ViT-like transformer (3.5M params) that scores all spatial positions. High scores indicate tokens that should be generated first. It is trained alongside the main generation model using a combination of:

STE (Straight-Through Estimator): Binary-split mask with differentiable backward pass via sigmoid approximation. Gradient flows from generation loss through token ordering back to OrderHead.
ViT Oracle supervision: A frozen ViT classifier trained on partial token observations provides per-token importance scores as regression targets.

Pipeline: GT tokens [B, 256, 8] ──┐ ↓ (GT routing during training) GPT hidden states [B, 256, D] ──→ OrderHead ──→ scores [B, 256] ↓ STE binary mask ↓ Reorder tokens (top-k first) ↓ GPT forward with position_ids ↓ CE loss → backward ↓ STE gradient → OrderHead update

Binary-Split STE (k=64)

Rather than a full permutation, we use a binary split: the top-64 scored positions form the "context group" (generated first), and the remaining 192 form the "prediction group." Within each group, tokens follow raster order. The STE connects CE loss gradients back to OrderHead scores through a differentiable sigmoid approximation.

4. Results Summary

19.61

RAR FID (IN-100, 300ep fair) — best LlamaGen

25.4

Raster baseline FID (IN-100)

5.32

Raster baseline FID (IN-1K)

6.55

Best MAR+OH FID-50K (IN-1K)

LlamaGen — RAR Curriculum + Trained Orderings (IN-100, 5K samples)

Comprehensive comparison of all LlamaGen training approaches:

Training Method	Inference	FID	sFID	Notes
Raster retrain (300ep 1K + 300ep 100 FT)	Raster	17.57	67.18	~600ep effective; unfair baseline
RAR (300ep from scratch)	Raster	19.61	65.83	23% better than raster (fair, same epochs)
RAR (400ep from scratch)	Raster	18.94	66.25	33% more epochs than baselines
Diagonal (300ep from scratch)	Diagonal	24.97	65.99	Train & eval in diagonal order
Raster 300ep from scratch	Raster	25.40	66.33	Fair raster baseline (matches pretrained)
Raster baseline (original ckpt)	Raster	25.40	66.33	Standard left-to-right, top-to-bottom
Col-major (300ep)	Col-major	25.95	66.26
Zigzag (300ep)	Zigzag	26.35	66.18
Spiral (300ep)	Spiral	30.94	66.42
Random fixed (300ep)	Random	44.74	66.78	One fixed permutation
Pure random AR (300ep, ta_embed)	Raster	151.10	110.44	No annealing — raster is OOD
RAR (400ep from scratch)	Random0	176.04	123.49	After annealing, random is OOD

LlamaGen — Heuristic Orderings on Raster-Trained Model (IN-100)

Inference-only ordering changes on the raster baseline (no retraining):

Ordering	FID	sFID	Notes
Raster (baseline)	25.40	66.33	Standard left-to-right, top-to-bottom
Zigzag	71.69	82.08	Best non-raster heuristic
Diagonal	117.58	93.89
Stride-2	128.18	93.83
Random	173.08	119.23	Fixed random permutation (seed=0)

LlamaGen — OrderHead Results (IN-100, 5K samples)

All OrderHead experiments failed to improve over the raster baseline. FID is OH inference (semifast-8, 8 rescoring candidates):

Experiment	OH FID (semifast-8)	score_std	Failure Mode
Phase 1 (frozen GPT, oracle)	155.58	1.6	GPT never trained on non-raster orderings
Phase 2 direct (STE + oracle)	140.89	—	GPT body too slow to adapt from raster
OH STE finetune (300ep)	70.63	14–17	Score explosion — scores grow unboundedly
OH from scratch (curriculum)	147.92	0.46	Score collapse + GPT body never converged
OH short finetune (20ep, w/ regularization)	136.47	0.04	Over-regularized — scores collapsed to uniform
Raster baseline (no OH)	25.40	—

Root cause: All failures trace back to the same issue: attaching an OrderHead to a raster-trained GPT body. The GPT was never exposed to non-raster orderings during pretraining, so it cannot assign meaningful probabilities to arbitrarily-ordered tokens. Joint training either destabilizes the GPT body (CE > 7.3 vs baseline 6.8) or collapses the OH scores. The next step is to use a RAR-trained (order-agnostic) backbone.

MAR — OrderHead Results (IN-1K, FID-50K, verified from eval logs)

Model	FID	IS	Notes
MAR-B pretrained baseline	2.30	270.4	Official checkpoint
Random-order retrained (5ep warmup)	2.31	253.0	Comparable to baseline
Decoder finetune k=1	3.85	208.0
OH full finetune ep5	5.75	220.2	Degrades monotonically: ep10=6.94, ep15=8.16, ep20=9.58
OH oracle (from random-order) ep9	6.55	217.0	Best OrderHead result; stable score_std~1.6
OH cosine iter=256 rescore8	7.08	209.5	AR schedule ablation
OH pure STE (from random-order) ep9	10.95	180.5	STE-only, no oracle; score_std diverges
OH cosine iter=64 rescore8	13.79	169.3	Fewer AR steps degrades quality

All MAR FID-50K numbers verified against Slurm evaluation logs (jobs 5864880, 5884168, 5952087, 5952436, 5957840).

5. Key Findings

Finding 1: RAR curriculum beats pure raster training

The RAR curriculum (random→anneal→raster) achieves FID=19.61 on IN-100 in a fair 300-epoch comparison, improving 23% over the raster baseline (25.40) trained with identical settings. IS also improves 26% (61.82 vs 49.18). The key insight: random-order training acts as sequence-space data augmentation, forcing the model to develop more general spatial attention patterns. After annealing back to raster, these robust representations translate to better generation quality at zero additional inference cost. IN-1K evaluation pending (job running, ep 129/300).

Finding 2: Non-raster ordering catastrophically degrades causal AR at inference

Any non-raster ordering at inference degrades FID from 25.4 to 71–173 on a raster-trained LlamaGen model. Even the RAR model — which was trained with random orderings — fails at random inference (FID=176) after annealing. The causal attention pattern makes context position critical; the model must specialize for one ordering.

Finding 3: Training with a specific ordering recovers quality

When the model is trained from scratch with a non-raster ordering (diagonal, zigzag, col-major), FID is comparable to the raster baseline (24.97–26.35 vs 25.40). The ordering itself doesn't matter much — what matters is consistency between training and inference. Random ordering is worse (44.74) because no fixed context pattern can be learned.

Finding 4: The Body Mismatch Problem dominates in MAR

Across 11 training configurations, 7 ordering methods, and 50K evaluations — no learned ordering beats the random baseline. The best (semifast-8 = 2.94) loses 0.61 FID to random (2.33). Root cause: the MAR body was trained for 400 epochs with random masking. Any structured ordering creates OOD masking patterns. Confirmed by GT injection ablation: random ordering gives lowest MSE (0.104) vs oracle (0.135). However, the multiscale heuristic (FID=2.30, zero training) beats random, suggesting compatible structures exist.

Finding 5: Oracle supervision is critical for stability

Without ViT oracle supervision, OrderHead scores either collapse (std → 0) or explode (std → 8+). With oracle (cls_k mode), score_std stabilizes at 1.0–2.0. The oracle provides a stable regression target that prevents mode collapse in the discrete ordering optimization.

Finding 6: From-scratch joint training suffers chicken-and-egg collapse

Training OrderHead and the generation model jointly from random initialization leads to collapse: at epoch 0, both models are random, so the ordering signal is meaningless, and neither can bootstrap. The only successful OrderHead training starts from a pretrained, order-agnostic backbone.

Finding 7: Slow rescoring is the strongest positive signal

In MAR, slow rescoring (re-scoring at every generation step using current context) achieves the highest recognizability speed (AUC = 0.568), beating even oracle ordering (0.487) and random (0.484). This adaptive approach — where the ordering responds to what has been generated so far — is fundamentally different from static importance scoring, and suggests the real value is in dynamic, content-aware ordering rather than fixed importance maps.

The causal AR paradox: In causal AR, "first in sequence" = "predicted with least context." Putting important tokens first makes them harder to predict. This is the fundamental tension that makes learned ordering harder in causal AR than in masked AR (MAR).

6. Detailed Experiment Pages

Each page contains comprehensive experiment details, verified metrics, visualizations, analysis, or the forward-looking roadmap:

Learned Token Ordering for Visual Autoregressive Generation

Contents

1. Motivation

2. Background: MAR vs LlamaGen

MAR Masked Autoregressive

LlamaGen Causal Autoregressive

3. The OrderHead Approach

Architecture

Binary-Split STE (k=64)

4. Results Summary

LlamaGen — RAR Curriculum + Trained Orderings (IN-100, 5K samples)

LlamaGen — Heuristic Orderings on Raster-Trained Model (IN-100)

LlamaGen — OrderHead Results (IN-100, 5K samples)

MAR — OrderHead Results (IN-1K, FID-50K, verified from eval logs)

5. Key Findings

Finding 1: RAR curriculum beats pure raster training

Finding 2: Non-raster ordering catastrophically degrades causal AR at inference

Finding 3: Training with a specific ordering recovers quality

Finding 4: The Body Mismatch Problem dominates in MAR

Finding 5: Oracle supervision is critical for stability

Finding 6: From-scratch joint training suffers chicken-and-egg collapse

Finding 7: Slow rescoring is the strongest positive signal

6. Detailed Experiment Pages

New Plan: Latent Planning for Ideal Token Ordering →

LlamaGen Experiments →

MAR Experiments →