MAR OrderHead Experiments

Architecture & Setup
Heuristic Orderings (No Training)
OrderHead Training Experiments (11 variants)
Predicted Order Visualizations
Oracle Ablation Studies (4 ablations)
5 Alternative Ordering Methods
IN-1K: Finetune from Random-Order Baseline
IN-100: From-Scratch Experiments
Key Findings & The Body Mismatch Problem

1. Architecture & Setup

MAR-B (Frozen Body)

Encoder: 12-layer ViT (86M), Decoder: 12-layer ViT (86M), DiffLoss: 6-layer MLP (37M). 256 tokens on 16×16 grid, each a 16-dim VAE latent. 64 generation steps, cosine unmasking schedule. All frozen during ordering training.

OrderHead (Trainable)

4-layer transformer, dim=512, 13.3M params. Scores all 256 positions from MAR decoder output. Trained via MSE against oracle importance targets. STE binary mask for differentiable ordering.

PartialVisClassifier (Oracle)

DeiT-S (21.8M params), frozen. Evaluates classifier confidence on partial token subsets. cls_k=64: 4 non-overlapping clusters of 64 tokens. Gives 4-level discrete importance.

Inference Modes

Mode	Description	OH Calls	Speed
Fast	Single-shot scoring from initial z_dec	1	Fast
Semifast-8	Re-score at 8 phase boundaries	8	8x slower
Slow	Re-score every step with current context	64	64x slower
Random	No ordering (baseline)	0	Fastest

2. Heuristic Orderings (No Training)

Evaluating the unmodified pretrained MAR-B with different fixed spatial orderings. No OrderHead, no fine-tuning — only the unmasking order changes.

Ordering	FID	IS	Source
Multiscale (coarse-to-fine)	2.30	264.3	Slurm 3958956, num_iter=256
Random (baseline)	2.33	265.6	Slurm 4664053
AR random	2.41	274.4	Slurm 3882999
Center + perturbation	2.44	254.6	Slurm 3958468
Center spiral	2.70	253.7	Slurm 3958469
Edge spiral	3.95	336.5	Slurm 3958945

All FID-50K, cfg=2.9, 64 generation steps. Verified from comprehensive_summary/README.md.

Multiscale heuristic (FID=2.30) beats random (2.33) with ZERO training. This suggests the pretrained MAR body naturally benefits from certain spatial orderings that are close to its random-masking training distribution. No learned ordering has matched this result.

3. OrderHead Training Experiments

11 training variants tested, all on ImageNet-1K with frozen MAR-B body (except oracle_unfrozen). Common settings: batch=256, 4×H100, AdamW.

Experiment	FID (fast)	FID (semifast-8)	blr	Epochs	Lambda	cls_k	Notes
nonaligned_oracle (10ep)	3.41	2.94	5e-4	10	1.0	64	Best classifier-guided result
v3_oracle_aligned_k64	3.16	3.40	5e-4	40	1.0	64	Best fast inference
oracle_aligned	3.43	3.19	5e-4	40	1.0	64	Stage1 = all-ones mask
v3_oracle_phase1b	—	3.42	5e-4	40	1.0	64	Slow inference
v3_oracle_phase1b_k16	—	3.34	5e-4	40	1.0	16	Finer-grained oracle
oracle_low_lambda	3.67	4.04	5e-4	40	0.1	64	Weaker ordering signal
oracle_cumulative	4.27	4.24	5e-4	20	1.0	64	Multi-k cumulative prefix oracle
nonaligned_oracle (40ep)	4.48	4.75	5e-4	40	1.0	64	Longer training hurts
oracle_unfrozen	4.65	4.90	1e-4	40	1.0	64	Body unfrozen — worse!
oracle_high_lr	4.50	5.02	2e-3	20	1.0	64	4x higher LR
oracle_k32	5.13	6.35	5e-4	40	1.0	32	Smaller eval window
oracle_lambda5	5.37	6.11	5e-4	20	5.0	64	Stronger signal hurts
oracle_deep	—	—	5e-4	20	1.0	64	8-layer OrderHead

FID-50K, cfg=2.9, 64 steps. Verified from comprehensive_summary/README.md and Slurm eval logs.

Main results dashboard showing FID vs ordering method across training configurations.

No classifier-guided model beats the pretrained baseline (FID=2.33). The best result (semifast-8 = 2.94) still loses 0.61 FID. Higher lambda, longer training, and body unfreezing all make things worse — the ordering signal pushes the model further from the random-masking distribution it was trained with.

3b. Predicted Order Visualizations

Random vs Learned Ordering: Side-by-Side Comparison

Each panel shows three ordering strategies (content-aware slow rescoring, fast positional-only, random baseline) with their order heatmaps and progressive generation at 25%, 50%, 75% completion.

Golden Retriever (class 207). Content-aware (slow) ordering generates the dog's face first, while fast ordering follows a learned spatial prior (center-biased). Random ordering scatters tokens uniformly.

Panda (class 388). Content-aware rescoring concentrates early tokens on the panda's face and body.

Tiger (class 292). Note how content-aware ordering selectively generates the tiger's body in the first 25%, while random ordering generates scattered patches across the entire image.

Generation Order Heatmaps: Fast vs Semifast vs Slow vs Random

Each row shows a different ordering strategy. The heatmap (left) encodes generation order (dark=early, light=late). Progressive snapshots show the image at 25%, 50%, 70%, 100% completion.

Bubble (class 971). Fast ordering (row 2) uses a fixed center-biased pattern. Semifast (row 3) adapts slightly. Slow rescoring (row 3) generates the bubble's reflections first. Random (row 4) is spatially uniform.

OrderHead Score Evolution (Semifast-8 Rescoring)

Top row: OrderHead scores at each rescore boundary (step 0, 16, 40, 56). Bottom row: which tokens are generated in the first 25%, 50%, 75%, and all 100%.

Score heatmaps (top) show how OrderHead's importance predictions evolve as more tokens are generated. The rank quartile view (bottom, green = generated) shows object-centric generation order.

4. Oracle Ablation Studies

Four ablations using the nonaligned_oracle_mse checkpoint (10 epochs) to test whether OrderHead learned meaningful importance, independent of the body mismatch issue.

Ablation 1: Spearman Rank Correlation (Learned vs Oracle Importance)

Do OrderHead scores correlate with oracle per-token importance?

Image	Class	Fast	GT-aware	Slow
0	Golden Retriever	0.592	0.596	0.020
1	Golden Retriever	0.535	0.539	0.210
2	Bubble	0.361	0.276	0.230
3	Bubble	-0.006	-0.076	0.106
4	Panda	0.254	0.230	0.066
5	Panda	0.184	0.232	0.005
6	Flamingo	0.084	0.161	0.151
7	Flamingo	0.129	0.181	0.081
Mean		0.267	0.267	0.109

Key insight: Fast ≈ GT-aware (within 0.05 per image). This means the OrderHead learned a spatial prior, NOT content-dependent importance. Giving it perfect content information (GT-aware) doesn't change predictions.

Ablation 2: GT Token Injection (Body Mismatch Test)

If tokens were perfect, would ordering matter?

Ordering	MSE to GT (DiffLoss sampling)	MSE to GT (GT injection)
Random	0.10448	0.00000
Learned slow	0.12945	N/A
Learned fast	0.13413	0.00000
Oracle	0.13454	0.00000

Random ordering produces lowest MSE (0.104) — beating oracle (0.135) and learned (0.134). This is the body mismatch effect: the MAR body was trained with random masking, so it generates best-quality tokens when ordering IS random. Any structured ordering is out-of-distribution for the body.

Ablation 3: Heuristic Ordering Comparison

9 orderings compared during free generation: random, center_spiral, edge_spiral, multiscale, blockwise, ar_raster, learned_fast, learned_slow, oracle_marginal.

Generation order heatmaps (dark=early, light=late). Center_spiral generates from center-out; learned_slow adaptively focuses on object regions.

Ablation 4: Progressive Quality Curves (Recognizability Speed)

Does good ordering make images recognizable faster during generation?

0.568

Learned slow AUC

0.487

Oracle AUC

0.484

Random AUC

0.454

Learned fast AUC

0.216

Center spiral AUC

Strongest positive signal: Learned slow rescoring (AUC=0.568) makes images recognizable faster than even the oracle ordering (0.487). Slow inference adapts dynamically to the generation trajectory, finding orderings that maximize recognizability — a fundamentally different objective than static importance.

Progressive quality curves during 64-step generation. Left: classifier confidence (recognizability) at each step. Right: MSE to final image (convergence). Learned slow (green) achieves highest recognizability speed despite worse final FID.

FID vs AUAC (Area Under Accuracy Curve): higher AUAC = images become recognizable faster. Slow rescoring achieves highest AUAC but at significant compute cost (64×).

5. Five Alternative Ordering Methods

Beyond the standard oracle approach, we tested 5 alternative ordering strategies, all evaluated with 5K generated images:

Method	FID	IS	CLIP Score	CMMD
Random (pretrained, no OH)	8.02	155.98	0.3057	1.223
Idea 3: Entropy-based	8.46	150.01	0.3050	1.241
Idea 5: Class-conditional	8.47	147.05	0.3043	1.242
Idea 4: Reconstruction-based	8.52	149.26	0.3042	1.246
Idea 1: Knowledge distillation	8.62	149.72	0.3048	1.240
Nonaligned oracle	9.13	136.27	0.3027	1.255
Idea 2: LoRA fine-tuning	10.04	134.35	0.2995	1.279

5K images, verified from eval_all_methods/all_results.json.

FID and IS comparison across all 7 methods. Random baseline (no OrderHead) dominates on both metrics.

All alternative methods (distillation, LoRA, entropy, reconstruction, class-conditioning) underperform the random baseline. The oracle-trained OrderHead (9.13) is worst among the learning-based approaches, suggesting that the oracle's discrete 4-level importance targets may be too coarse to capture useful ordering information.

6. IN-1K: Finetune from Random-Order Baseline

Starting from the converged random-order MAR-B (FID=2.31), we finetune with OrderHead for 10 epochs. This tests OrderHead on a backbone trained to be order-agnostic.

FID-50K Results

Experiment	FID	IS	Source
Random-order MAR-B baseline	2.31	253.0	eval_rand_full_5884168.out
OH full finetune ep5	5.75	220.2	eval_oh_ft_5952087.out
OH oracle (from random-order) ep9	6.55	217.0	eval_oh_randorder_5952436.out
OH pure STE ep9	10.95	180.5	eval_oh_randorder_5952436.out

Per-Epoch FID-5K Trajectories (cfg=2.9)

Oracle (cls_k=64): score_std stable at ~1.6 throughout.

Epoch	0	1	2	3	4	5	6	7	8	9
FID-5K	17.01	22.62	23.86	23.18	23.14	23.37	21.32	19.77	19.49	18.85

Verified from in1k_oh_from_randorder_oracle_5916963.out.

Pure STE (no oracle): score_std explodes to ~8 by ep9. FID peaks at ep4 then degrades.

Epoch	0	1	2	3	4	5	6	7	8	9
FID-5K	26.31	25.91	23.16	21.77	20.89	25.00	28.47	35.82	38.27	33.63

Verified from in1k_oh_from_randorder_5917013.out.

OH Full finetune (from official MAR-B, FID-50K): Monotonic degradation.

Epoch	5	10	15	20
FID-50K	5.75	6.94	8.16	9.58

Verified from eval_oh_ft_5952087.out.

Training loss curves across methods. All learning-based ordering methods converge, but the body mismatch prevents FID improvement.

7. IN-100: From-Scratch Experiments

Training OrderHead + MAR jointly from random initialization (400 epochs).

Experiment	FID-5K	IS	score_std	Source
MAR-S saliency_attn (400ep)	104.24	15.45	0.50 (collapsed)	eval_in100_oh_s_saliency_5952089.out
MAR-S cls_k oracle (400ep)	78.27	20.62	0.012→~1.6	eval_in100_oh_s_clsk_5952088.out

Baselines: MAR-S = FID-5K ~31.28 (IN-100), MAR-B = FID-5K ~18.87 (IN-100).

From-scratch training fails (2.5–3.3x worse than baseline). Chicken-and-egg collapse: at epoch 0, both MAR body and oracle operate on random features → meaningless importance scores → OrderHead collapses (score_std→0.012). Though score_std eventually recovers to ~1.6 by epoch 400, the MAR body has already converged under random ordering — the damage is done.

8. Key Findings & The Body Mismatch Problem

The Body Mismatch Problem (Core Finding)

The MAR body was trained for ~400 epochs with random masking. It learned to handle randomly-masked contexts optimally. Imposing any structured ordering — even a provably "better" one — creates out-of-distribution masking patterns, producing worse token predictions. This is confirmed by Ablation 2: random ordering gives lowest MSE (0.104) even when compared to oracle ordering (0.135).

Positive Signal: Slow Rescoring (AUC = 0.568)

The strongest evidence that learned ordering captures meaningful information. Slow rescoring (re-scoring at every step using current context) makes images recognizable faster than even oracle ordering (0.487) or random (0.484). This is because slow rescoring adapts dynamically to the generation trajectory — a fundamentally different capability than static importance scoring.

OrderHead Learned a Spatial Prior, Not Content Importance

Fast vs GT-aware scores are nearly identical (Spearman ρ within 0.05 per image), meaning giving the OrderHead perfect content information doesn't change its predictions. It learned a center-biased spatial prior, not image-specific importance. The 4-level oracle target may be too coarse to convey content-dependent information.

Heuristic Orderings Are Competitive

Multiscale heuristic achieves FID=2.30 (best overall) with zero training. No learned ordering has beaten this. This suggests the pretrained MAR body can benefit from structured ordering IF the structure is compatible with its random-masking training distribution.

All Learned Methods Underperform Random

Across 7 methods (distillation, LoRA, entropy, reconstruction, class-conditional, oracle, STE), 11 training configurations, and 50K evaluation — no learned ordering beats the random baseline (FID=2.33 for heuristic, 8.02 for 5K eval). The ordering signal exists (partial correlation with oracle, faster recognizability) but the body mismatch dominates final image quality.

Comprehensive metric comparison: FID, IS, CLIP Score, and CMMD across all ordering methods. Random baseline wins on all metrics.

Path forward: The body mismatch is the fundamental bottleneck. Promising directions include: (1) training the MAR body from scratch WITH the learned ordering, (2) gentler fine-tuning with very low LR + small lambda, (3) combining learned ordering with heuristic priors (e.g., multiscale + OrderHead refinement), (4) faster approximations of slow rescoring to capture the adaptive advantage.

MAR OrderHead Experiments

Contents

1. Architecture & Setup

MAR-B (Frozen Body)

OrderHead (Trainable)

PartialVisClassifier (Oracle)

Inference Modes

2. Heuristic Orderings (No Training)

3. OrderHead Training Experiments

3b. Predicted Order Visualizations

Random vs Learned Ordering: Side-by-Side Comparison

Generation Order Heatmaps: Fast vs Semifast vs Slow vs Random

OrderHead Score Evolution (Semifast-8 Rescoring)

4. Oracle Ablation Studies

Ablation 1: Spearman Rank Correlation (Learned vs Oracle Importance)

Ablation 2: GT Token Injection (Body Mismatch Test)

Ablation 3: Heuristic Ordering Comparison

Ablation 4: Progressive Quality Curves (Recognizability Speed)

5. Five Alternative Ordering Methods

6. IN-1K: Finetune from Random-Order Baseline

FID-50K Results

Per-Epoch FID-5K Trajectories (cfg=2.9)

7. IN-100: From-Scratch Experiments

8. Key Findings & The Body Mismatch Problem

The Body Mismatch Problem (Core Finding)

Positive Signal: Slow Rescoring (AUC = 0.568)

OrderHead Learned a Spatial Prior, Not Content Importance

Heuristic Orderings Are Competitive

All Learned Methods Underperform Random