LlamaGen Ordering Experiments

Model & Setup
Heuristic Orderings (Inference-Only)
Trained Orderings (From Scratch)
Random-Order Training (v1, No Fixes)
RAR: Random with Annealing to Raster
OrderHead Experiments
Method Comparison & Analysis
Why Random Order Fails in Causal AR (LlamaGen vs MAR vs RAR)
Critical Bugs Found & Fixed

1. Model & Setup

All experiments use the LlamaGen GPT-B architecture with VQ-VAE tokenized ImageNet:

GPT-B Architecture

110M params, 24 transformer layers, 12 attention heads, dim=768. Causal attention with 2D RoPE spatial encoding. Block size = 256 (16×16 spatial grid). VQ vocab = 16,384.

Datasets

IN-100: ImageNet-100 subset, ~127K training images, 5K FID evaluation samples.
IN-1K: Full ImageNet-1K, 1.28M images, with pretrained c2i_B_256.pt checkpoint (300ep).

Shared Training Hyperparameters

All from-scratch models: batch=256, lr=1e-4, weight_decay=5e-2, max_grad_norm=1.0, dropout=0.1, token_dropout=0.1, bf16 mixed precision, 4×H100 GPUs, AdamW optimizer.

Evaluation Protocol

FID-5K and sFID against validation reference NPZ. Sampling: cfg_scale=2.0, temperature=1.0, top-k=0. All methods infer in raster order (standard left-to-right, top-to-bottom) unless noted otherwise. OrderHead uses semifast-8 (8 rescoring candidates).

2. Heuristic Orderings (Inference-Only)

Method

Take the raster-trained GPT-B baseline and evaluate it at inference with different spatial orderings. The model weights are frozen — only the token generation order changes. Uses the sample_c2i_heuristic_ddp.py O(L²) full-forward sampler that recomputes attention at each step with position_ids for correct 2D RoPE per-token positioning.

Why O(L²)?

Standard KV-cache sampling assumes a fixed order (raster) so each new token simply appends to the cache. Non-raster orderings break this assumption: token at spatial position (5, 3) might be generated at step 10, requiring correct RoPE encoding for position (5, 3) but causal masking that only sees the 9 previously generated positions. The full-forward sampler recomputes from scratch at each step, giving O(L²) complexity (vs O(L) for KV-cache).

ImageNet-100 (5K samples, cfg=2.0)

Ordering	FID	sFID	Source
Raster (baseline)	25.40	66.33	samples_fid_baseline_in100
Raster (heuristic sampler)	25.63	66.15	samples_fid_heuristic_raster_in100
Zigzag	71.69	82.08	samples_fid_heuristic_zigzag_in100
Diagonal	117.58	93.89	samples_fid_heuristic_diagonal_in100
Stride-2	128.18	93.83	samples_fid_heuristic_stride2_in100
Spiral	132.25	97.06	samples_fid_heuristic_spiral_in100
Column-major	139.79	100.69	samples_fid_heuristic_col_major_in100
Center-spiral	153.53	111.03	samples_fid_heuristic_center_spiral_in100
Reverse	156.18	109.45	samples_fid_heuristic_reverse_in100
Random (seed=0)	173.08	119.23	samples_fid_heuristic_random0_in100

Raster baseline from standard KV-cache sampler; all others from O(L²) heuristic sampler. The ~0.2 FID gap between the two raster results is sampling variance.

ImageNet-1K (5K samples, cfg=2.0, pretrained c2i_B_256.pt)

Ordering	FID	sFID	Source
Raster	5.32	7.46	samples_fid_heuristic_raster_1k
Stride-2	119.62	45.38	samples_fid_heuristic_stride2_1k
Center-spiral	132.08	53.35	samples_fid_heuristic_center_spiral_1k
Random (seed=0)	151.23	66.11	samples_fid_heuristic_random0_1k
Reverse	154.35	68.76	samples_fid_heuristic_reverse_1k

Spatial neighbor coverage comparison. At step 32, raster order has 34.8% of spatial neighbors already generated; random order has only 5-6%. This explains why raster is superior for causal AR: more context from immediate neighbors.

Takeaway: The raster-trained GPT is tightly coupled to raster context patterns. ANY deviation at inference degrades FID by 3x-7x. The 2D RoPE correctly encodes spatial position, but the causal attention pattern (which positions appear in left-context) is fundamentally different for each ordering.

3. Trained Orderings (From Scratch, IN-100)

Method

Train GPT-B from scratch on IN-100 with a fixed spatial ordering. Each model always sees tokens in the same deterministic order (e.g., always diagonal, always column-major). At inference, the same ordering is used with the O(L²) heuristic sampler.

Training script: train_c2i.py (standard raster) or train_c2i_random_order.py with fixed permutation
Epochs: 300
Batch / LR: 256 / 1e-4
Inference: Same ordering as training, O(L²) heuristic sampler, cfg=2.0

Training Order	FID	sFID	Source
Full raster retrain (fine-tuned from pretrained)	17.57	67.18	samples_fid_raster_ft_full_in100
Diagonal	24.97	65.99	samples_fid_train_diagonal_in100
Raster (original checkpoint)	25.40	66.33	samples_fid_baseline_in100
Column-major	25.95	66.26	samples_fid_train_col_major_in100
Zigzag	26.35	66.18	samples_fid_train_zigzag_in100
Spiral	30.94	66.42	samples_fid_train_spiral_in100
Random (one fixed permutation)	44.74	66.78	samples_fid_train_random_fixed_in100

Note: "Full raster retrain" is fine-tuned from the IN-1K pretrained checkpoint (300ep IN-1K + 300ep IN-100 = ~600ep effective), making it an unfair baseline. All others are trained from scratch for 300 epochs.

Partial Fine-Tuning (raster order, from pretrained c2i_B_256.pt)

Layers Fine-Tuned	FID	sFID	Source
All layers	17.57	67.18	samples_fid_raster_ft_full_in100
Last 4 blocks	58.68	67.64	samples_fid_raster_ft_partial4_in100
Last 2 blocks	92.69	70.41	samples_fid_raster_ft_partial2_in100
Last 1 block	98.62	71.92	samples_fid_raster_ft_partial1_in100

Takeaway: When trained with a specific ordering, most fixed orderings achieve FID comparable to raster (24.97–26.35). The ordering itself matters less than training/inference consistency. A random fixed permutation is worse (44.74) because the spatial context pattern is less structured. Diagonal is best (24.97), slightly beating raster (25.40).

4. Random-Order Training (v1, No Fixes)

Method

First attempt at random-order training in LlamaGen: every training step samples a new random permutation per image. Unlike a fixed ordering (Section 3), the model sees a different token order every step. This is inspired by MAR's random masking, but applied to causal AR.

Training script: train_c2i_random_order.py (before RAR fixes)
Key difference: No target-aware positional embedding, no annealing schedule
Epochs: 300 on IN-100
Inference: Both raster and random tested

Inference Order	FID	sFID	Source
Raster	159.41	115.89	job 6474613
Random (seed=0)	163.98	121.23	job 6474613

Root cause analysis: Two critical missing components identified by comparing with the RAR paper (arXiv 2411.00776): (1) No target-aware positional embedding — the model cannot distinguish permutations that share a prefix but differ in the next token (the "ambiguous target problem"). (2) No annealing to raster — raster inference is out-of-distribution for a purely random-trained model. Training loss plateaued at 8.45 (vs 6.30 for raster baseline). Generated images show realistic textures but no object structure.

5. RAR: Random Ordering with Annealing to Raster

Overview

RAR (Random-order with Annealing to Raster), based on arXiv 2411.00776, is a training curriculum that first trains with random token orderings and then gradually anneals back to raster order. The final model generates in standard raster order at zero additional inference cost — no architecture change, no extra sampling complexity.

Training Details

Training script: train_c2i_random_order.py
Base model: GPT-B (110M), trained from scratch (no pretrained init)
Optimizer: AdamW, lr=1e-4, weight_decay=5e-2, fused=True
Batch size: 256 global (64 per GPU × 4 GPUs)
Grad clipping: max_grad_norm=1.0
Precision: bf16 mixed precision
Regularization: dropout=0.1, token_dropout=0.1
Hardware: 4× H100 (1 node)
Steps/epoch: ~494 (127K samples / batch 256)

Two Critical Components

1. Target-Aware Positional Embedding (ta_pos_table)

A learnable nn.Embedding(256, 768) table. At training step t with permutation perm, the token being predicted at position perm[t] receives the embedding of perm[t+1] — the next token's spatial position.

Why it's needed: Without it, two different permutations that share a prefix but differ in the next token produce identical inputs to the model. The model cannot distinguish “predict position (3,5) next” from “predict position (7,2) next” given the same context — the ambiguous target problem. This makes optimization ill-posed.

Implementation: Zero-initialized (starts as a no-op). Added to GPT's token embeddings during the random and annealing phases. Set to None during the pure raster phase and at inference, since raster order is unambiguous (next position is always spatially adjacent). Included in AdamW optimizer with the same LR and weight decay as GPT.

2. Annealing Schedule

A random_ratio controls the probability that each sample in a batch uses a random permutation vs. raster order. The schedule has three phases:

Random phase (epochs 0–150): random_ratio=1.0. Every step draws a fresh random permutation per image. Model learns to handle arbitrary context subsets.
Annealing phase (epochs 150–225): random_ratio linearly decays from 1.0 to 0.0. Each sample independently draws random vs. raster. ta_embed active when random.
Raster phase (epochs 225–300): random_ratio=0.0. Pure raster training, ta_embed=None. Model specializes for raster inference.

For the 300ep schedule: 50% random, 25% anneal, 25% raster. The original RAR paper uses 400ep (50/25/25 split: 200+100+100).

Inference

The trained RAR model is sampled using standard raster inference with the normal KV-cache sampler (sample_c2i_ddp.py). No O(L²) recomputation needed. No ta_embed. No special handling. The model outputs tokens left-to-right, top-to-bottom, exactly like a raster-trained model. Zero additional inference cost.

Why Does RAR Improve Over Pure Raster?

The random phase acts as sequence-space data augmentation. During random-order training, the model must predict each token given an arbitrary subset of spatial context — not just the fixed left-context of raster. This forces attention heads to develop more general spatial reasoning: some heads become spatial proximity detectors, others become texture/color consistency checkers, rather than all specializing for left-context patterns.

After annealing, the model specializes for raster but retains these robust representations. Evidence: the RAR model's final raster-phase loss (7.11) is slightly higher than pure raster training (~6.8), yet its FID is better (19.61 vs 25.40). Higher loss + better FID = the model has learned representations that generalize better despite marginal per-token prediction loss increase. This is analogous to dropout: it increases training loss but improves generalization.

Results: Fair Comparison (300 epochs from scratch, IN-100)

Model	Schedule	FID	IS	Prec.	Recall
Raster 300ep from scratch	300ep raster	25.40	49.18	0.672	0.471
RAR 300ep from scratch	150 random + 75 anneal + 75 raster	19.61	61.82	0.684	0.531

RAR improves FID by 23% (19.61 vs 25.40) with the exact same epoch count, data, and hyperparameters. IS improves 26% (61.82 vs 49.18), precision +1.8%, recall +12.7%. This confirms that the RAR curriculum genuinely produces a better model, not just a longer-trained one. Both models infer in standard raster order.

Ablations and Failed Variants

Model	Epochs	Inference	FID	sFID	Notes
RAR 400ep from scratch	400	Raster	18.94	66.25	33% more epochs than baselines
RAR 400ep from scratch	400	Random0	176.04	123.49	Model is NOT order-agnostic after annealing
Pure random AR (no anneal)	300	Raster	151.10	110.44	Annealing is essential
Pure random AR (no anneal)	300	Random0	165.00	115.81	Random fails both ways
Random-order v1 (no ta_embed)	300	Raster	159.41	115.89	ta_embed is essential

Key ablation takeaways: (1) After annealing, random inference catastrophically fails (FID=176) — the model is fully committed to raster. (2) Without annealing, both raster and random inference fail (~151–165). (3) Without ta_embed, random training also fails (159). Both components are necessary; neither alone is sufficient.

IN-1K (300 epochs from scratch)

Model	Schedule	FID	Status
Raster 300ep (official c2i_B_256.pt)	300ep raster	5.32	Completed
RAR 300ep from scratch	150 random + 75 anneal + 75 raster	—	Running ep 129/300

All Baselines for Context

Model	Training	FID	Notes
Raster baseline (c2i_B_256.pt)	300ep IN-1K pretrained, 0ep IN-100	25.40	Not trained on IN-100 at all
Raster 300ep from scratch	300ep from scratch on IN-100	25.40	Fair raster baseline (matches pretrained)
RAR 300ep from scratch	300ep from scratch on IN-100	19.61	Fair comparison, 23% better FID
RAR 400ep	400ep from scratch on IN-100	18.94	33% more epochs than baselines
Raster retrain	300ep IN-1K + 300ep IN-100 FT	17.57	~600ep effective, unfair baseline

Raster 300ep from scratch (25.40) matches the IN-1K pretrained checkpoint evaluated on IN-100 (25.40) — a good sanity check that IN-100 training converges to the same quality.

6. OrderHead Experiments

Overview

The OrderHead is a lightweight neural network that learns the token generation order, rather than using a fixed heuristic (raster, diagonal, etc.) or random ordering (RAR). It takes the current state of generation and outputs a score per spatial position; tokens with higher scores are generated first. The goal is to discover orderings that produce better images than any fixed pattern.

OrderHead Architecture

ScoredOrdering (3.46M params):

Input projection: Linear(768 → 256) + LayerNorm (projects GPT hidden states)
GT input projection: Linear(8 → 256) + LayerNorm (projects VQ codebook entries during curriculum warmup)
Register tokens: 4 learnable tokens prepended to the sequence (discarded after processing)
Positional embedding: Learnable, shape [1, 260, 256] (4 registers + 256 spatial positions)
Transformer: 4 layers, hidden_dim=256, 8 heads, MLP ratio=4.0, qkv_bias=True (timm Block)
Output head: Linear(256 → 128) → GELU → Linear(128 → 1) → squeeze
Output: scores [B, 256] — higher score = generate first

Binary STE Training Mechanism

The OrderHead uses Straight-Through Estimation (STE) with a binary split to make the discrete ordering differentiable:

Scoring: OrderHead outputs continuous scores s[i] for each of 256 positions
Hard split (forward): Top-k=64 highest-scoring positions become the “first group” (context), remaining 192 become the “second group” (targets). Within each group, tokens follow raster order.
Soft mask (backward): soft_mask = sigmoid((threshold - scores_norm) / τ) where scores_norm is z-normalized, threshold is the boundary score, and τ is temperature
STE bridge: mask = hard_mask.detach() + soft_mask - soft_mask.detach() — uses the hard mask in the forward pass but routes gradients through the soft mask
Temperature annealing: τ decays from 1.0 → 0.1 over anneal_steps, making the soft mask progressively sharper

Gradient flow: CE loss → d(CE)/d(token_scale) → d(soft_mask)/d(scores) → OrderHead parameters. The soft mask creates a differentiable approximation to the hard binary split, allowing the generation loss to inform which positions should be in the first group (generated first).

GT Routing Curriculum

During early training, GPT hidden states are uninformative (random initialization). A curriculum feeds ground-truth VQ codebook entries (dim=8) into the OrderHead instead. The probability of using GT decays linearly from 1.0 → 0.0 over curriculum_steps, after which the OrderHead must rely solely on GPT hidden states.

Oracle Supervision

Optionally, a ViT-B oracle classifier (trained on VQ codebook embeddings, PatchDrop, val@25=66.3%) provides importance scores per position. The OrderHead can be trained with an additional MSE loss: lambda_cls * MSE(sigmoid(scores), oracle_importance). Two oracle modes were tested:

saliency_attn: CLS→patch attention weights. Failed — produces near-uniform attention (~0.004/patch) when all 256 tokens are visible.
oracle (cls-k=64): Cluster-based classification probability. Works — produces informative importance maps.

Inference: Semifast-8 Sampling

At inference, the OrderHead uses semifast-8 sampling: generate 8 candidate orderings by running the OrderHead with different random seeds, generate one image per ordering, then select the best via a rescoring criterion. This is O(8 × L²) per image (8 candidates, each requiring full-forward O(L²) sampling since the ordering is non-raster). Significantly more expensive than standard raster inference.

Experiment 1: Phase 1 — Frozen GPT, Oracle MSE Only

GPT body: Frozen at raster checkpoint (0145000.pt)
OH training: Oracle MSE loss only, 48K steps
OH LR: 3e-4 (lr=1e-4 × oh_lr_scale=3.0)

Inference	FID	sFID
OH semifast-8	155.58	113.12

Why it failed: The GPT was trained only with raster ordering. At inference, the OH-selected non-raster ordering is completely out-of-distribution for the GPT body. The GPT cannot assign meaningful probabilities to tokens in an arbitrary order, regardless of how well the OH learned importance.

Experiment 2: Phase 2 — Joint GPT + OH Training (STE + Oracle)

Init: Raster-trained GPT-B + Phase 1 or fresh OH
Training: CE loss (through STE) + oracle MSE, 48K steps
OH LR: 3e-4 (oh_lr_scale=3.0)
STE: Binary, k=64, τ: 1.0→0.1 over 100K steps

Variant	FID	CE Loss
From Phase 1 checkpoint	146.51	~7.50
From raster GPT (direct)	140.89	~7.42

Why it failed: The GPT's CE loss only dropped from ~7.6 to ~7.4 over 48K steps (vs baseline 6.8). The raster-trained GPT cannot quickly adapt to non-raster orderings. 48K steps is far too short to retrain the entire 110M-param body for arbitrary context patterns.

Experiment 3: OH STE Finetune (300 epochs, no oracle)

Init: Raster-trained GPT-B + fresh OH
Training: CE loss only (no oracle), 300 epochs (~148K steps)
OH LR: 3e-4 (oh_lr_scale=3.0)
STE: Binary, k=64, τ: 1.0→0.1 over 100K steps
Grad clip: max_grad_norm=1.0 (shared GPT + OH)

Inference	FID	sFID	Precision	Recall
OH semifast-8	70.63	84.17	0.291	0.662

Post-mortem: score explosion. score_std=14–17 at convergence. The scores are internally z-normalized for ranking in scores_to_mask, so the absolute magnitude is unconstrained. Over 148K steps with oh_lr_scale=3.0, small clipped gradients accumulated unboundedly. The OH learned a non-trivial ordering (FID=70.63 < random=173), but the exploded scores created numerical instability in the soft mask computation.

Experiment 4: OH From-Scratch Curriculum

Init: Random GPT + fresh OH
Curriculum: (A) raster warmup → (B) gradual OH introduction → (C) full OH, ~136K steps
STE: Binary, k=64

Inference	FID	sFID	Precision	Recall
OH semifast-8	147.92	131.82	0.121	0.200

Post-mortem: score collapse + body degradation. score_std=0.46 (near-uniform scores). The from-scratch GPT never recovered baseline quality (CE=7.67 vs baseline 6.8). The curriculum's raster warmup phase was not enough to build a strong body before OH training destabilized it.

Experiment 5: OH Short Finetune (20 epochs, with regularization fixes)

Init: Raster-trained GPT-B + fresh OH
Training: 20 epochs (~10K steps), CE + score_reg
OH LR: 1e-4 (oh_lr_scale=1.0, conservative)
OH grad clip: 0.1 (10× tighter than GPT's 1.0)
Score reg: 0.01 (L2 penalty on score magnitudes)
Anneal steps: 5000 (τ: 1.0→0.1)
Curriculum: 2500 steps (GT routing)

Inference	FID
OH semifast-8	136.47

Post-mortem: over-regularized collapse. score_std=0.04. The tight grad clipping (0.1) + score reg (0.01) + fast temperature annealing (5K steps) prevented any score differentiation. Scores are near-uniform, making the OH ordering effectively random. The opposite problem of Experiment 3 (explosion vs. collapse).

Comprehensive OH Results Summary

OH inference FID (semifast-8, 8 rescoring candidates) is the only metric that matters, since the goal is to improve generation by reordering tokens.

Experiment	OH FID	score_std	GPT CE	Failure Mode
Exp 1: Phase 1 (frozen GPT, oracle)	155.58	1.6	—	GPT never trained on non-raster
Exp 2: Phase 2 (STE + oracle, 48K steps)	140.89	—	7.42	GPT too slow to adapt
Exp 3: STE finetune (300ep, no oracle)	70.63	14–17	7.33	Score explosion
Exp 4: From-scratch curriculum	147.92	0.46	7.67	Score collapse + body degraded
Exp 5: Short finetune (20ep, regularized)	136.47	0.04	7.63	Over-regularized collapse
Raster baseline (no OH)	25.40	—	6.8

OrderHead on a raster-trained GPT is fundamentally limited. Across 5 experiments and 3 training strategies, no OH configuration improves over the raster baseline. The best OH inference (FID=70.63) is 2.8× worse than raster (25.40). The core issue: the raster-trained GPT body has never seen non-raster orderings during pretraining, so it cannot assign meaningful probabilities to tokens in arbitrary order. Joint training either destabilizes the body (CE > 7.3) or collapses the OH scores (std → 0). The most promising path forward is attaching OH to an order-agnostic backbone (e.g., the RAR epoch-150 checkpoint from the pure-random phase).

7. Method Comparison & Analysis

All Methods at a Glance

Method	Training Order	Inference Order	Inference Cost	FID (IN-100)	Trainable Params
RAR 300ep	Random → anneal → raster	Raster (standard)	O(L) — KV-cache	19.61	110M (GPT-B only)
Raster 300ep	Raster	Raster (standard)	O(L) — KV-cache	25.40	110M (GPT-B only)
Diagonal 300ep	Diagonal	Diagonal (heuristic)	O(L²) — full-forward	24.97	110M (GPT-B only)
OrderHead (best: Exp 3)	Raster GPT + OH STE	OH semifast-8	O(8 × L²) — 8 candidates	70.63	110M + 3.46M (GPT + OH)
Random-order v1	Random (no ta_embed)	Raster	O(L) — KV-cache	159.41	110M (GPT-B only)

Training Cost Comparison

Method	Epochs	Steps	Wall Time (4×H100)	Extra Overhead
Raster 300ep	300	~148K	~3.8 hours	None
RAR 300ep	300	~148K	~3.8 hours	ta_pos_table (256×768 = 197K params), per-step permutation sampling
OrderHead finetune	20–300	10K–148K	0.3–3.8 hours	OrderHead forward/backward (3.46M params), O(L²) generation during eval

RAR has negligible training overhead over raster: the ta_pos_table is tiny (197K params, 0.18% of GPT-B) and permutation sampling is a cheap CPU operation. The training wall time is essentially identical. OrderHead adds meaningful overhead during training (extra forward/backward for 3.46M params) and massive overhead during inference (8× O(L²) vs O(L) for raster).

Why RAR Succeeds Where OrderHead Fails

The key difference: RAR modifies training, not inference.

RAR: Exposes the GPT to diverse orderings during training (building robust representations), then anneals back to raster for inference. The GPT body learns to handle arbitrary context subsets, then specializes. The final model uses standard raster inference with no extra cost.
OrderHead: Tries to change the inference ordering of a GPT that was only ever trained with raster. The GPT body cannot handle non-raster orderings, and joint training is too slow to fix this (GPT needs to be retrained from scratch to become order-agnostic).

This suggests the natural next step: attach an OrderHead to a RAR-trained backbone (specifically the epoch-150 checkpoint from the random phase, before annealing). This backbone can already handle arbitrary orderings, so the OH only needs to learn which ordering is best, not teach the GPT to handle non-raster contexts.

The Annealing Is Critical

Pure random AR (no annealing) fails catastrophically: FID=151 with raster inference, 165 with random. The random phase alone produces a model that can handle any ordering but excels at none. The annealing phase serves as fine-tuning: it specializes the model for raster inference while retaining the robust representations from the random phase.

This is analogous to pre-training → fine-tuning in NLP: random-order training builds general sequence understanding (like masked language modeling), and raster annealing fine-tunes for the specific deployment task (like task-specific fine-tuning).

Open Questions

Does the RAR benefit scale to IN-1K? (Experiment running, ep 129/300)
Does the benefit scale with model size (GPT-L, GPT-XL)?
What is the optimal random:anneal:raster epoch ratio?
Can the epoch-150 order-agnostic checkpoint serve as a backbone for OrderHead?
Does RAR improve an already-trained raster model via continued training, or must it be from scratch?

8. Why Random Order Fails in Causal AR (LlamaGen vs MAR vs RAR)

The Core Difference: Causal vs. Bidirectional Attention

LlamaGen (Causal AR)

Each token can only attend to tokens before it in the sequence (left-context). Token at position t sees tokens at positions {0, 1, …, t−1} — a strict prefix.

The ordering determines which spatial positions form this prefix. In raster, position (3,5) always sees (3,4), (3,3), … and all rows above. In a random ordering, position (3,5) might only see (12,1), (7,9), (0,3) — spatially scattered positions with no local structure.

MAR (Masked AR / Bidirectional)

Uses a bidirectional transformer (like BERT). Every token can attend to every unmasked token, regardless of generation order.

At each step, the model sees ALL previously generated tokens simultaneously. The “ordering” only determines which tokens are revealed vs. masked. The attention pattern is always full bidirectional over the unmasked set.

Why Random Ordering Breaks Causal Attention

Consider predicting the token at spatial position (8, 5), with 3 tokens already generated: {(12,1), (0,14), (7,9)}.

In LlamaGen (causal): The model can only attend to these 3 scattered positions via causal masking. Even if (8,4) or (7,5) — the immediate spatial neighbors — were already generated, they are invisible unless they happen to appear earlier in the random permutation. The context is inconsistent across training steps (different random permutations each time), so the model cannot learn stable context patterns. At step 32 of generation, raster order provides 34.8% spatial neighbor coverage; random order provides only 5–6%.

In MAR (bidirectional): The model attends to all 3 unmasked tokens simultaneously from any position. At later steps, when more tokens are unmasked, ALL of them are visible regardless of the order they were generated. The context is always the complete set of previously generated tokens — order doesn't matter.

The Ambiguous Target Problem

In causal AR with random permutations, two different permutations can share the same prefix but differ in what to predict next:

Perm A: […, (3,5), (8,2), …] — next token to predict is (8,2)
Perm B: […, (3,5), (1,7), …] — next token to predict is (1,7)

The model receives the exact same input (same prefix tokens, same causal mask) but must predict different spatial positions. This is ill-posed — the model literally cannot determine which position to predict next. Optimization becomes noisy and converges poorly (loss plateaus at 8.45 vs 6.30 for raster).

MAR doesn't have this problem because masked prediction always explicitly specifies which positions are masked and which to predict (the mask pattern is an explicit input).

RAR's solution: The ta_pos_table adds an embedding of the next position to predict, resolving the ambiguity. With ta_embed, Perm A and Perm B produce different inputs (embedding of (8,2) vs (1,7) is added), so the model can distinguish them.

How RAR Navigates This

RAR's key insight: use random ordering as a training regularizer, not as the final inference strategy.

Phase	Epochs	What Happens	Purpose
Random	0–150	Fresh random permutation each step + ta_embed	Force general spatial reasoning; prevent overfitting to raster context patterns
Anneal	150–225	Linearly mix random → raster	Gradual specialization toward raster
Raster	225–300	Pure raster, ta_embed=None	Specialize for deployment; no train/inference mismatch

At epoch 150, the model has learned to handle arbitrary context subsets — it developed diverse attention patterns (spatial proximity detectors, texture checkers, etc.) instead of raster-specific shortcuts. Annealing then specializes these robust representations for raster, and the final model infers in standard raster order at zero extra cost.

Summary: LlamaGen vs MAR vs RAR

	LlamaGen (Raster)	LlamaGen (Random)	RAR	MAR
Attention	Causal (prefix only)	Causal (prefix only)	Causal (prefix only)	Bidirectional (full)
Training order	Fixed raster	Random per step	Random → anneal → raster	Random mask ratio
Inference order	Raster	Raster or random	Raster	Iterative unmasking
Context at step t	Fixed raster prefix	Random prefix (inconsistent)	Raster prefix (after anneal)	ALL unmasked tokens
Ambiguous target?	No (deterministic)	Yes (without ta_embed)	No (ta_embed during random; raster after anneal)	No (mask is explicit)
Inference cost	O(L) KV-cache	O(L²) full-forward	O(L) KV-cache	O(T·L) (T decoding steps)
FID (IN-100)	25.40	159.41	19.61	N/A (different arch)

The bottom line: Random ordering fundamentally breaks causal attention because the prefix-based context becomes inconsistent and spatially incoherent across training steps. MAR avoids this entirely with bidirectional attention — all unmasked tokens are always visible. RAR's clever solution: use random ordering only as a training regularizer (with ta_embed to resolve the ambiguous target problem), then anneal back to raster for deployment. The result is a model that infers in standard raster order at zero extra cost, but with 23% better FID thanks to the more robust representations learned during the random phase.

9. Critical Bugs Found & Fixed

Bug 1: target_aware_embed active during raster phase

ta_embed was always computed and added to GPT token embeddings, even during the pure-raster phase (epochs 225–300). Standard inference never provides ta_embed, causing a train/inference mismatch. Fix: Set ta_embed = None when random_ratio == 0.

Bug 2: Random-order training v1 missing RAR components

First random-order attempt had no target-aware positional embedding and no annealing schedule. Both are critical components from the RAR paper (arXiv 2411.00776). Without them, loss plateaus at 8.45 and generated images lack object structure. Fix: Implemented full RAR pipeline with ta_pos_table and epoch-based annealing.

Bug 3: Score alignment loss shape mismatch

In the from-scratch curriculum script, per_token_ce computation had mismatched batch dimensions (16384 vs 16320) due to incorrect target indexing with GPT's internal logit shift. Fix: Align targets with GPT's training-mode logit convention where logits[:, i] predicts z_indices[:, i].

Bug 4: saliency_attn oracle produces flat importance maps

CLS→patch attention with all 256 tokens visible produces near-uniform attention (~0.004/patch), giving OrderHead a meaningless regression target. score_std stays at 0.2–0.5 (collapsed). Fix: Switch to --classifier-mode oracle --cls-k 64 which uses cluster-based classification probability. score_std recovers to 1.0–1.6.

LlamaGen Ordering Experiments

Contents

1. Model & Setup

GPT-B Architecture

Datasets

Shared Training Hyperparameters

Evaluation Protocol

2. Heuristic Orderings (Inference-Only)

Method

Why O(L²)?

ImageNet-100 (5K samples, cfg=2.0)

ImageNet-1K (5K samples, cfg=2.0, pretrained c2i_B_256.pt)

3. Trained Orderings (From Scratch, IN-100)

Method

Partial Fine-Tuning (raster order, from pretrained c2i_B_256.pt)

4. Random-Order Training (v1, No Fixes)

Method

5. RAR: Random Ordering with Annealing to Raster

Overview

Training Details

Two Critical Components

1. Target-Aware Positional Embedding (ta_pos_table)

2. Annealing Schedule

Inference

Why Does RAR Improve Over Pure Raster?

Results: Fair Comparison (300 epochs from scratch, IN-100)

Ablations and Failed Variants

IN-1K (300 epochs from scratch)

All Baselines for Context

6. OrderHead Experiments

Overview

OrderHead Architecture

Binary STE Training Mechanism

GT Routing Curriculum

Oracle Supervision

Inference: Semifast-8 Sampling

Experiment 1: Phase 1 — Frozen GPT, Oracle MSE Only

Experiment 2: Phase 2 — Joint GPT + OH Training (STE + Oracle)

Experiment 3: OH STE Finetune (300 epochs, no oracle)

Experiment 4: OH From-Scratch Curriculum

Experiment 5: OH Short Finetune (20 epochs, with regularization fixes)

Comprehensive OH Results Summary

7. Method Comparison & Analysis

All Methods at a Glance

Training Cost Comparison

Why RAR Succeeds Where OrderHead Fails

The Annealing Is Critical

Open Questions

8. Why Random Order Fails in Causal AR (LlamaGen vs MAR vs RAR)

The Core Difference: Causal vs. Bidirectional Attention

LlamaGen (Causal AR)

MAR (Masked AR / Bidirectional)

Why Random Ordering Breaks Causal Attention

The Ambiguous Target Problem

How RAR Navigates This

Summary: LlamaGen vs MAR vs RAR

9. Critical Bugs Found & Fixed

Bug 1: target_aware_embed active during raster phase

Bug 2: Random-order training v1 missing RAR components

Bug 3: Score alignment loss shape mismatch

Bug 4: saliency_attn oracle produces flat importance maps