MAR OrderHead Experiments

Comprehensive investigation of learned token ordering in Masked Autoregressive (MAR) image generation. 11 training variants, 4 oracle ablations, 7 ordering methods, and 9 heuristic comparisons — all on pretrained MAR-B (ImageNet-1K).

Contents

  1. Architecture & Setup
  2. Heuristic Orderings (No Training)
  3. OrderHead Training Experiments (11 variants)
  4. Predicted Order Visualizations
  5. Oracle Ablation Studies (4 ablations)
  6. 5 Alternative Ordering Methods
  7. IN-1K: Finetune from Random-Order Baseline
  8. IN-100: From-Scratch Experiments
  9. Key Findings & The Body Mismatch Problem

1. Architecture & Setup

MAR-B (Frozen Body)

Encoder: 12-layer ViT (86M), Decoder: 12-layer ViT (86M), DiffLoss: 6-layer MLP (37M). 256 tokens on 16×16 grid, each a 16-dim VAE latent. 64 generation steps, cosine unmasking schedule. All frozen during ordering training.

OrderHead (Trainable)

4-layer transformer, dim=512, 13.3M params. Scores all 256 positions from MAR decoder output. Trained via MSE against oracle importance targets. STE binary mask for differentiable ordering.

PartialVisClassifier (Oracle)

DeiT-S (21.8M params), frozen. Evaluates classifier confidence on partial token subsets. cls_k=64: 4 non-overlapping clusters of 64 tokens. Gives 4-level discrete importance.

Inference Modes

ModeDescriptionOH CallsSpeed
FastSingle-shot scoring from initial z_dec1Fast
Semifast-8Re-score at 8 phase boundaries88x slower
SlowRe-score every step with current context6464x slower
RandomNo ordering (baseline)0Fastest

2. Heuristic Orderings (No Training)

Evaluating the unmodified pretrained MAR-B with different fixed spatial orderings. No OrderHead, no fine-tuning — only the unmasking order changes.

OrderingFIDISSource
Multiscale (coarse-to-fine)2.30264.3Slurm 3958956, num_iter=256
Random (baseline)2.33265.6Slurm 4664053
AR random2.41274.4Slurm 3882999
Center + perturbation2.44254.6Slurm 3958468
Center spiral2.70253.7Slurm 3958469
Edge spiral3.95336.5Slurm 3958945

All FID-50K, cfg=2.9, 64 generation steps. Verified from comprehensive_summary/README.md.

Multiscale heuristic (FID=2.30) beats random (2.33) with ZERO training. This suggests the pretrained MAR body naturally benefits from certain spatial orderings that are close to its random-masking training distribution. No learned ordering has matched this result.

3. OrderHead Training Experiments

11 training variants tested, all on ImageNet-1K with frozen MAR-B body (except oracle_unfrozen). Common settings: batch=256, 4×H100, AdamW.

ExperimentFID (fast)FID (semifast-8)blrEpochsLambdacls_kNotes
nonaligned_oracle (10ep)3.412.945e-4101.064Best classifier-guided result
v3_oracle_aligned_k643.163.405e-4401.064Best fast inference
oracle_aligned3.433.195e-4401.064Stage1 = all-ones mask
v3_oracle_phase1b3.425e-4401.064Slow inference
v3_oracle_phase1b_k163.345e-4401.016Finer-grained oracle
oracle_low_lambda3.674.045e-4400.164Weaker ordering signal
oracle_cumulative4.274.245e-4201.064Multi-k cumulative prefix oracle
nonaligned_oracle (40ep)4.484.755e-4401.064Longer training hurts
oracle_unfrozen4.654.901e-4401.064Body unfrozen — worse!
oracle_high_lr4.505.022e-3201.0644x higher LR
oracle_k325.136.355e-4401.032Smaller eval window
oracle_lambda55.376.115e-4205.064Stronger signal hurts
oracle_deep5e-4201.0648-layer OrderHead

FID-50K, cfg=2.9, 64 steps. Verified from comprehensive_summary/README.md and Slurm eval logs.

MAR OrderHead main dashboard

Main results dashboard showing FID vs ordering method across training configurations.

No classifier-guided model beats the pretrained baseline (FID=2.33). The best result (semifast-8 = 2.94) still loses 0.61 FID. Higher lambda, longer training, and body unfreezing all make things worse — the ordering signal pushes the model further from the random-masking distribution it was trained with.

3b. Predicted Order Visualizations

Random vs Learned Ordering: Side-by-Side Comparison

Each panel shows three ordering strategies (content-aware slow rescoring, fast positional-only, random baseline) with their order heatmaps and progressive generation at 25%, 50%, 75% completion.

Random vs Learned: Golden Retriever

Golden Retriever (class 207). Content-aware (slow) ordering generates the dog's face first, while fast ordering follows a learned spatial prior (center-biased). Random ordering scatters tokens uniformly.

Random vs Learned: Panda

Panda (class 388). Content-aware rescoring concentrates early tokens on the panda's face and body.

Random vs Learned: Tiger

Tiger (class 292). Note how content-aware ordering selectively generates the tiger's body in the first 25%, while random ordering generates scattered patches across the entire image.

Generation Order Heatmaps: Fast vs Semifast vs Slow vs Random

Each row shows a different ordering strategy. The heatmap (left) encodes generation order (dark=early, light=late). Progressive snapshots show the image at 25%, 50%, 70%, 100% completion.

Generation order: Bubble class

Bubble (class 971). Fast ordering (row 2) uses a fixed center-biased pattern. Semifast (row 3) adapts slightly. Slow rescoring (row 3) generates the bubble's reflections first. Random (row 4) is spatially uniform.

Generation order: Retriever class

OrderHead Score Evolution (Semifast-8 Rescoring)

Top row: OrderHead scores at each rescore boundary (step 0, 16, 40, 56). Bottom row: which tokens are generated in the first 25%, 50%, 75%, and all 100%.

MAR OH progressive generation - image 0

Score heatmaps (top) show how OrderHead's importance predictions evolve as more tokens are generated. The rank quartile view (bottom, green = generated) shows object-centric generation order.

MAR OH progressive generation - image 3

4. Oracle Ablation Studies

Four ablations using the nonaligned_oracle_mse checkpoint (10 epochs) to test whether OrderHead learned meaningful importance, independent of the body mismatch issue.

Ablation 1: Spearman Rank Correlation (Learned vs Oracle Importance)

Do OrderHead scores correlate with oracle per-token importance?

ImageClassFastGT-awareSlow
0Golden Retriever0.5920.5960.020
1Golden Retriever0.5350.5390.210
2Bubble0.3610.2760.230
3Bubble-0.006-0.0760.106
4Panda0.2540.2300.066
5Panda0.1840.2320.005
6Flamingo0.0840.1610.151
7Flamingo0.1290.1810.081
Mean0.2670.2670.109

Key insight: Fast ≈ GT-aware (within 0.05 per image). This means the OrderHead learned a spatial prior, NOT content-dependent importance. Giving it perfect content information (GT-aware) doesn't change predictions.

Ablation 2: GT Token Injection (Body Mismatch Test)

If tokens were perfect, would ordering matter?

OrderingMSE to GT (DiffLoss sampling)MSE to GT (GT injection)
Random0.104480.00000
Learned slow0.12945N/A
Learned fast0.134130.00000
Oracle0.134540.00000

Random ordering produces lowest MSE (0.104) — beating oracle (0.135) and learned (0.134). This is the body mismatch effect: the MAR body was trained with random masking, so it generates best-quality tokens when ordering IS random. Any structured ordering is out-of-distribution for the body.

Ablation 3: Heuristic Ordering Comparison

9 orderings compared during free generation: random, center_spiral, edge_spiral, multiscale, blockwise, ar_raster, learned_fast, learned_slow, oracle_marginal.

Ordering heatmaps for different methods

Generation order heatmaps (dark=early, light=late). Center_spiral generates from center-out; learned_slow adaptively focuses on object regions.

Ablation 4: Progressive Quality Curves (Recognizability Speed)

Does good ordering make images recognizable faster during generation?

0.568
Learned slow AUC
0.487
Oracle AUC
0.484
Random AUC
0.454
Learned fast AUC
0.216
Center spiral AUC

Strongest positive signal: Learned slow rescoring (AUC=0.568) makes images recognizable faster than even the oracle ordering (0.487). Slow inference adapts dynamically to the generation trajectory, finding orderings that maximize recognizability — a fundamentally different objective than static importance.

Progressive quality curves

Progressive quality curves during 64-step generation. Left: classifier confidence (recognizability) at each step. Right: MSE to final image (convergence). Learned slow (green) achieves highest recognizability speed despite worse final FID.

FID vs AUAC tradeoff

FID vs AUAC (Area Under Accuracy Curve): higher AUAC = images become recognizable faster. Slow rescoring achieves highest AUAC but at significant compute cost (64×).

5. Five Alternative Ordering Methods

Beyond the standard oracle approach, we tested 5 alternative ordering strategies, all evaluated with 5K generated images:

MethodFIDISCLIP ScoreCMMD
Random (pretrained, no OH)8.02155.980.30571.223
Idea 3: Entropy-based8.46150.010.30501.241
Idea 5: Class-conditional8.47147.050.30431.242
Idea 4: Reconstruction-based8.52149.260.30421.246
Idea 1: Knowledge distillation8.62149.720.30481.240
Nonaligned oracle9.13136.270.30271.255
Idea 2: LoRA fine-tuning10.04134.350.29951.279

5K images, verified from eval_all_methods/all_results.json.

FID/IS comparison of all methods

FID and IS comparison across all 7 methods. Random baseline (no OrderHead) dominates on both metrics.

All alternative methods (distillation, LoRA, entropy, reconstruction, class-conditioning) underperform the random baseline. The oracle-trained OrderHead (9.13) is worst among the learning-based approaches, suggesting that the oracle's discrete 4-level importance targets may be too coarse to capture useful ordering information.

6. IN-1K: Finetune from Random-Order Baseline

Starting from the converged random-order MAR-B (FID=2.31), we finetune with OrderHead for 10 epochs. This tests OrderHead on a backbone trained to be order-agnostic.

FID-50K Results

ExperimentFIDISSource
Random-order MAR-B baseline2.31253.0eval_rand_full_5884168.out
OH full finetune ep55.75220.2eval_oh_ft_5952087.out
OH oracle (from random-order) ep96.55217.0eval_oh_randorder_5952436.out
OH pure STE ep910.95180.5eval_oh_randorder_5952436.out

Per-Epoch FID-5K Trajectories (cfg=2.9)

Oracle (cls_k=64): score_std stable at ~1.6 throughout.

Epoch0123456789
FID-5K17.0122.6223.8623.1823.1423.3721.3219.7719.4918.85

Verified from in1k_oh_from_randorder_oracle_5916963.out.

Pure STE (no oracle): score_std explodes to ~8 by ep9. FID peaks at ep4 then degrades.

Epoch0123456789
FID-5K26.3125.9123.1621.7720.8925.0028.4735.8238.2733.63

Verified from in1k_oh_from_randorder_5917013.out.

OH Full finetune (from official MAR-B, FID-50K): Monotonic degradation.

Epoch5101520
FID-50K5.756.948.169.58

Verified from eval_oh_ft_5952087.out.

Training loss curves

Training loss curves across methods. All learning-based ordering methods converge, but the body mismatch prevents FID improvement.

7. IN-100: From-Scratch Experiments

Training OrderHead + MAR jointly from random initialization (400 epochs).

ExperimentFID-5KISscore_stdSource
MAR-S saliency_attn (400ep)104.2415.450.50 (collapsed)eval_in100_oh_s_saliency_5952089.out
MAR-S cls_k oracle (400ep)78.2720.620.012→~1.6eval_in100_oh_s_clsk_5952088.out

Baselines: MAR-S = FID-5K ~31.28 (IN-100), MAR-B = FID-5K ~18.87 (IN-100).

From-scratch training fails (2.5–3.3x worse than baseline). Chicken-and-egg collapse: at epoch 0, both MAR body and oracle operate on random features → meaningless importance scores → OrderHead collapses (score_std→0.012). Though score_std eventually recovers to ~1.6 by epoch 400, the MAR body has already converged under random ordering — the damage is done.

8. Key Findings & The Body Mismatch Problem

The Body Mismatch Problem (Core Finding)

The MAR body was trained for ~400 epochs with random masking. It learned to handle randomly-masked contexts optimally. Imposing any structured ordering — even a provably "better" one — creates out-of-distribution masking patterns, producing worse token predictions. This is confirmed by Ablation 2: random ordering gives lowest MSE (0.104) even when compared to oracle ordering (0.135).

Positive Signal: Slow Rescoring (AUC = 0.568)

The strongest evidence that learned ordering captures meaningful information. Slow rescoring (re-scoring at every step using current context) makes images recognizable faster than even oracle ordering (0.487) or random (0.484). This is because slow rescoring adapts dynamically to the generation trajectory — a fundamentally different capability than static importance scoring.

OrderHead Learned a Spatial Prior, Not Content Importance

Fast vs GT-aware scores are nearly identical (Spearman ρ within 0.05 per image), meaning giving the OrderHead perfect content information doesn't change its predictions. It learned a center-biased spatial prior, not image-specific importance. The 4-level oracle target may be too coarse to convey content-dependent information.

Heuristic Orderings Are Competitive

Multiscale heuristic achieves FID=2.30 (best overall) with zero training. No learned ordering has beaten this. This suggests the pretrained MAR body can benefit from structured ordering IF the structure is compatible with its random-masking training distribution.

All Learned Methods Underperform Random

Across 7 methods (distillation, LoRA, entropy, reconstruction, class-conditional, oracle, STE), 11 training configurations, and 50K evaluation — no learned ordering beats the random baseline (FID=2.33 for heuristic, 8.02 for 5K eval). The ordering signal exists (partial correlation with oracle, faster recognizability) but the body mismatch dominates final image quality.

Metric comparison across methods

Comprehensive metric comparison: FID, IS, CLIP Score, and CMMD across all ordering methods. Random baseline wins on all metrics.

Path forward: The body mismatch is the fundamental bottleneck. Promising directions include: (1) training the MAR body from scratch WITH the learned ordering, (2) gentler fine-tuning with very low LR + small lambda, (3) combining learned ordering with heuristic priors (e.g., multiscale + OrderHead refinement), (4) faster approximations of slow rescoring to capture the adaptive advantage.