Learned Token Ordering for Visual Autoregressive Generation

Can we learn a better-than-raster generation order for discrete visual tokens? Investigating OrderHead-guided token ordering in both masked (MAR) and causal (LlamaGen) autoregressive image generation.

Minkyu Jeon Princeton University March – April 2026 In Progress

Contents

  1. Motivation
  2. Background: MAR vs LlamaGen
  3. The OrderHead Approach
  4. Results Summary
  5. Key Findings
  6. Detailed Experiment Pages

1. Motivation

Standard autoregressive image generation models (LlamaGen, VQGAN+GPT) generate discrete visual tokens in a fixed raster-scan order (left-to-right, top-to-bottom). Unlike language, where left-to-right follows semantic structure, this ordering is arbitrary for images — there is no inherent reason why the top-left pixel should be generated before the center of an object.

The hypothesis: if we generate class-discriminative (important) tokens first, they provide better context for generating the remaining tokens. For example, generating the dog's face before the background could give the model much richer context than generating background sky first.

Spatial coverage comparison: raster vs random ordering

Spatial coverage of generated context at step 32: raster order (left) vs random order (right). Raster provides 34.8% spatial neighbor coverage; random only provides 5-6%.

Research question: Can a learned token ordering (via an OrderHead module) improve image generation quality compared to the default raster-scan order?

2. Background: MAR vs LlamaGen

We investigate token ordering in two fundamentally different autoregressive architectures:

MAR Masked Autoregressive

Generates tokens by iteratively unmasking subsets. Uses bidirectional attention — every visible token attends to every other visible token. The generation order determines which tokens are revealed first as context. Inherently order-agnostic: trained with random masking.

Key property: "Important first" means those tokens become context for the rest.

LlamaGen Causal Autoregressive

Generates tokens one-by-one with causal (left-only) attention. Each token can only see previously generated tokens. Uses 2D RoPE for spatial position encoding independent of sequence order. Trained with teacher forcing on a fixed sequence order.

Key property: "First in sequence" means predicted with least context.

PropertyMARLlamaGen
AttentionBidirectional (all visible tokens)Causal (left-context only)
Position encodingLearned absolute2D RoPE (spatial)
Tokens per stepMultiple (cosine schedule)One
LossDiffusion (continuous latents)Cross-entropy (discrete VQ tokens)
Order agnosticismInherent (random masking)Must be explicitly trained (RAR)
Ordering effect"First" = best context provider"First" = least context available

3. The OrderHead Approach

Architecture

The OrderHead is a small ViT-like transformer (3.5M params) that scores all spatial positions. High scores indicate tokens that should be generated first. It is trained alongside the main generation model using a combination of:

Pipeline: GT tokens [B, 256, 8] ──┐ ↓ (GT routing during training) GPT hidden states [B, 256, D] ──→ OrderHead ──→ scores [B, 256] ↓ STE binary mask ↓ Reorder tokens (top-k first) ↓ GPT forward with position_ids ↓ CE loss → backward ↓ STE gradient → OrderHead update

Binary-Split STE (k=64)

Rather than a full permutation, we use a binary split: the top-64 scored positions form the "context group" (generated first), and the remaining 192 form the "prediction group." Within each group, tokens follow raster order. The STE connects CE loss gradients back to OrderHead scores through a differentiable sigmoid approximation.

4. Results Summary

19.61
RAR FID (IN-100, 300ep fair) — best LlamaGen
25.4
Raster baseline FID (IN-100)
5.32
Raster baseline FID (IN-1K)
6.55
Best MAR+OH FID-50K (IN-1K)

LlamaGen — RAR Curriculum + Trained Orderings (IN-100, 5K samples)

Comprehensive comparison of all LlamaGen training approaches:

Training MethodInferenceFIDsFIDNotes
Raster retrain (300ep 1K + 300ep 100 FT)Raster17.5767.18~600ep effective; unfair baseline
RAR (300ep from scratch)Raster19.6165.8323% better than raster (fair, same epochs)
RAR (400ep from scratch)Raster18.9466.2533% more epochs than baselines
Diagonal (300ep from scratch)Diagonal24.9765.99Train & eval in diagonal order
Raster 300ep from scratchRaster25.4066.33Fair raster baseline (matches pretrained)
Raster baseline (original ckpt)Raster25.4066.33Standard left-to-right, top-to-bottom
Col-major (300ep)Col-major25.9566.26
Zigzag (300ep)Zigzag26.3566.18
Spiral (300ep)Spiral30.9466.42
Random fixed (300ep)Random44.7466.78One fixed permutation
Pure random AR (300ep, ta_embed)Raster151.10110.44No annealing — raster is OOD
RAR (400ep from scratch)Random0176.04123.49After annealing, random is OOD

LlamaGen — Heuristic Orderings on Raster-Trained Model (IN-100)

Inference-only ordering changes on the raster baseline (no retraining):

OrderingFIDsFIDNotes
Raster (baseline)25.4066.33Standard left-to-right, top-to-bottom
Zigzag71.6982.08Best non-raster heuristic
Diagonal117.5893.89
Stride-2128.1893.83
Random173.08119.23Fixed random permutation (seed=0)

LlamaGen — OrderHead Results (IN-100, 5K samples)

All OrderHead experiments failed to improve over the raster baseline. FID is OH inference (semifast-8, 8 rescoring candidates):

ExperimentOH FID (semifast-8)score_stdFailure Mode
Phase 1 (frozen GPT, oracle)155.581.6GPT never trained on non-raster orderings
Phase 2 direct (STE + oracle)140.89GPT body too slow to adapt from raster
OH STE finetune (300ep)70.6314–17Score explosion — scores grow unboundedly
OH from scratch (curriculum)147.920.46Score collapse + GPT body never converged
OH short finetune (20ep, w/ regularization)136.470.04Over-regularized — scores collapsed to uniform
Raster baseline (no OH)25.40

Root cause: All failures trace back to the same issue: attaching an OrderHead to a raster-trained GPT body. The GPT was never exposed to non-raster orderings during pretraining, so it cannot assign meaningful probabilities to arbitrarily-ordered tokens. Joint training either destabilizes the GPT body (CE > 7.3 vs baseline 6.8) or collapses the OH scores. The next step is to use a RAR-trained (order-agnostic) backbone.

MAR — OrderHead Results (IN-1K, FID-50K, verified from eval logs)

ModelFIDISNotes
MAR-B pretrained baseline2.30270.4Official checkpoint
Random-order retrained (5ep warmup)2.31253.0Comparable to baseline
Decoder finetune k=13.85208.0
OH full finetune ep55.75220.2Degrades monotonically: ep10=6.94, ep15=8.16, ep20=9.58
OH oracle (from random-order) ep96.55217.0Best OrderHead result; stable score_std~1.6
OH cosine iter=256 rescore87.08209.5AR schedule ablation
OH pure STE (from random-order) ep910.95180.5STE-only, no oracle; score_std diverges
OH cosine iter=64 rescore813.79169.3Fewer AR steps degrades quality

All MAR FID-50K numbers verified against Slurm evaluation logs (jobs 5864880, 5884168, 5952087, 5952436, 5957840).

5. Key Findings

Finding 1: RAR curriculum beats pure raster training

The RAR curriculum (random→anneal→raster) achieves FID=19.61 on IN-100 in a fair 300-epoch comparison, improving 23% over the raster baseline (25.40) trained with identical settings. IS also improves 26% (61.82 vs 49.18). The key insight: random-order training acts as sequence-space data augmentation, forcing the model to develop more general spatial attention patterns. After annealing back to raster, these robust representations translate to better generation quality at zero additional inference cost. IN-1K evaluation pending (job running, ep 129/300).

Finding 2: Non-raster ordering catastrophically degrades causal AR at inference

Any non-raster ordering at inference degrades FID from 25.4 to 71–173 on a raster-trained LlamaGen model. Even the RAR model — which was trained with random orderings — fails at random inference (FID=176) after annealing. The causal attention pattern makes context position critical; the model must specialize for one ordering.

Finding 3: Training with a specific ordering recovers quality

When the model is trained from scratch with a non-raster ordering (diagonal, zigzag, col-major), FID is comparable to the raster baseline (24.97–26.35 vs 25.40). The ordering itself doesn't matter much — what matters is consistency between training and inference. Random ordering is worse (44.74) because no fixed context pattern can be learned.

Finding 4: The Body Mismatch Problem dominates in MAR

Across 11 training configurations, 7 ordering methods, and 50K evaluations — no learned ordering beats the random baseline. The best (semifast-8 = 2.94) loses 0.61 FID to random (2.33). Root cause: the MAR body was trained for 400 epochs with random masking. Any structured ordering creates OOD masking patterns. Confirmed by GT injection ablation: random ordering gives lowest MSE (0.104) vs oracle (0.135). However, the multiscale heuristic (FID=2.30, zero training) beats random, suggesting compatible structures exist.

Finding 5: Oracle supervision is critical for stability

Without ViT oracle supervision, OrderHead scores either collapse (std → 0) or explode (std → 8+). With oracle (cls_k mode), score_std stabilizes at 1.0–2.0. The oracle provides a stable regression target that prevents mode collapse in the discrete ordering optimization.

Finding 6: From-scratch joint training suffers chicken-and-egg collapse

Training OrderHead and the generation model jointly from random initialization leads to collapse: at epoch 0, both models are random, so the ordering signal is meaningless, and neither can bootstrap. The only successful OrderHead training starts from a pretrained, order-agnostic backbone.

Finding 7: Slow rescoring is the strongest positive signal

In MAR, slow rescoring (re-scoring at every generation step using current context) achieves the highest recognizability speed (AUC = 0.568), beating even oracle ordering (0.487) and random (0.484). This adaptive approach — where the ordering responds to what has been generated so far — is fundamentally different from static importance scoring, and suggests the real value is in dynamic, content-aware ordering rather than fixed importance maps.

The causal AR paradox: In causal AR, "first in sequence" = "predicted with least context." Putting important tokens first makes them harder to predict. This is the fundamental tension that makes learned ordering harder in causal AR than in masked AR (MAR).

6. Detailed Experiment Pages

Each page contains comprehensive experiment details, verified metrics, visualizations, analysis, or the forward-looking roadmap:

New Plan: Latent Planning for Ideal Token Ordering →

LlamaGen-first roadmap for an OrderHead planner trained on future-context value rather than confidence, with Path Planning as the base framework, latent reasoning in the planner, adaptive reveal budget, and concrete training algorithms.

LlamaGen Experiments →

Heuristic orderings, trained orderings, RAR random-order training, OrderHead Phase 1/2, from-scratch curriculum. 30+ experiments with verified FID on IN-100 and IN-1K.

MAR Experiments →

OrderHead on Masked Autoregressive models. Oracle vs STE-only, from-scratch vs finetuned, IN-100 and IN-1K evaluations. 20+ experiments with FID-50K verified from Slurm logs.