Autoresearch Internal Process Log

Private local record. Generated/updated 2026-05-11 America/New_York. Do not link from the public website while the manuscript is in preparation.

Current objective: maximize cumulative strict global-unique successful binder sequences over a fixed backend compute window. Strict success is paired-candidate Proteina-Complexa paper criterion: iPAE < 7/31, pLDDT > 0.90, binder scRMSD < 1.5 A. The x-axis for comparisons is Proteina-Complexa child runtime GPU-hours only; LLM, controller idle, vLLM startup, and Slurm queue time are excluded.

Current Run State

Adaptive v3 job

job 8036532 CD45 24H 4 H100 RUNNING

GPU 0 runs Qwen/vLLM; GPUs 1-3 run Proteina-Complexa children.
Parent policy: puct_quality_marginal, puct_c=2.0.
Proposal lanes: three_lane; proposal selection: supervisor_selected.
Calibration context: off. V8 fatigue warning: disabled. Diverse root seeds: on.
Score cap: off via AR22_PRIMARY_SUCCESS_RATE_CAP=none.

Official baselines kept running

These are clean official Proteina-Complexa baseline jobs, one GPU per target, seed-sweep style.

8018247 SC2RBD MCTS+halluc baseline 8018999 CD45 beam add-on 8019000 BetV1 beam add-on 8019001 HER2_AAV beam add-on 8020453 CbAgo beam

Issue And Fix Log

Date	Problem observed	Diagnosis	Update / fix	Residual risk
2026-05-10	v3 analyst calls failed with HTTP 400 in the archive smoke/36H setup.	vLLM rejected prompts: input around 63,489 tokens plus requested 2,048 output tokens exceeded `max_model_len=65,536`. Refiner/Supervisor prompts were smaller, so the run continued without full analyst signal.	Fixed: v3 configs now use `llm.max_model_len=131072`; 24H/36H wrappers read that value into vLLM.	Longer runs can still grow prompts. Need prompt pruning/candidate-panel caps before large multi-day runs.
2026-05-10	CD45 adaptive job plateaued after a strong early MCTS+sequence-hallucination success.	Root MCTS+SH child produced many strict successes quickly, but later MCTS follow-ups spent about 2.5 GPU-h each with zero strict successes. The system over-exploited a branch after it became stale.	Partially fixed: current v3 uses subtree global-marginal quality credit, recent-yield bailout, supervisor-selected variable branch factor, and archive contract to preserve useful fragments without blindly replaying them.	Need mid-run check that archive recombination creates new variants rather than repeatedly retrying one stale template.
2026-05-10	Some failed children evaluated about 260 downstream candidates despite no strict successes.	Important distinction: `generated_pdbs` can be large, but `filter_samples_limit` controls how many generated candidates enter downstream evaluation. Successful root had 258 generated PDBs but only `n_evaluated=32`. Failed follow-ups lacked the filter cap and evaluated 260 candidates.	Fixed: v3 runtime adds `++generation.filter.filter_samples_limit=32` when a proposal omits it. Verifier allows explicit requests only up to `max_filter_samples_limit=64`.	Cap 32 may miss rare-success slow strategies. Explicit 64 remains available for evidence-grounded branches.
2026-05-10	Earlier score cap hid differences among productive children.	Cap made high-yield nodes collapse to the same primary score and moved selection pressure into soft iPAE/pLDDT bonuses.	Fixed: current wrappers export `AR22_PRIMARY_SUCCESS_RATE_CAP=none`. Tree policy uses global-marginal unique/quality deltas for branch Q.	Uncapped rates can overvalue tiny-denominator wins. Current PUCT-quality-marginal adds subtree GPU-hour normalization and recent-yield checks.
2026-05-10	Archive initially risked being only a prompt hint.	A pure archive prompt cannot guarantee the Refiner actually cites or recombines prior successful/near-miss fragments.	Strengthened: proposal schema includes `archive_operation`, `archive_evidence_ids`, and `archive_source_program_ids`. Controller annotates verifier results with archive contract status, forces repair if no launchable archive-grounded proposal exists when archive entries are available, and force-includes one archive-grounded launchable proposal.	The contract enforces grounding, not scientific usefulness. Use `archive_usage_report.json` and child outcomes to audit whether recombination is productive.
2026-05-10	Unproductive or stale jobs were consuming queue/GPU resources.	Some v2_2 jobs had plateaued or had zero strict successes; one v3 run had incomplete analyst signal.	Cleaned: canceled 8013657, 8011230, 8011231, 8011232, and old pending 8034380. Submitted fresh v3 8036532 with corrected context length and filter cap.	Continue monitoring first 2-3 iterations of 8036532 before trusting the setup for broader targets.

Current v3 Setup Checklist

Area	Current setting	Review
Objective	Tree-global strict unique discovery per backend GPU-hour; quality is secondary.	Aligned. Policy code deduplicates success sequences globally before branch credit; quality bonus is bounded below one full extra success.
Strict criterion	iPAE < 7/31, pLDDT > 0.90, binder scRMSD < 1.5 A, paired by redesign index.	Aligned. Metrics parser uses paired list columns when available and records success-quality rows.
PUCT	`puct_quality_marginal`: subtree quality-weighted global-marginal successes / subtree GPU-hours + recent-yield term + UCB exploration.	Reasonable for the current objective. It no longer optimizes only best child.
Exploration width	5 deterministic diverse root seeds, then up to 3 executor lanes per iteration.	Watch. Max children 3 matches available executor GPUs. Supervisor-selected mode avoids forcing weak proposals, but can reduce width if the LLM is too conservative.
Compute budget	child timeout 180 min; batch/max_batch 16; expanded work default 512, hard cap 1024; filter default 32, explicit max 64.	Aligned with the latest plan and avoids the 260-eval blind spot.
Archive	Run-local StrategyArchive with success templates, near-misses, failure boundaries, and diversity probes.	More than prompting: fields are schema-validated and controller-enforced after verifier.
Analyst evidence	EvidencePack plus structure dossier, proposal history, family outcomes, metrics, failure modes, manifest bounds.	Adequate but incomplete. Bottleneck taxonomy exists through metrics/failure modes, but sequence-vs-structure diagnosis is still not a first-class explicit field.
Calibration	Off for this run.	Intentional. Prior calibration had limited benefit and sometimes injected optimism/anchor effects.
Time accounting	Child Proteina-Complexa runtime only.	Aligned. Eval reports label this explicitly.

Live Verification Notes

Check	Observed evidence	Status
8036532 allocation	Slurm reports `gres/gpu=4` on `della-h20g5`; wrapper prints GPU 0 for vLLM and GPUs 1,2,3 for Complexa.	OK
Context length	vLLM server log reports `Using max model len 131072`; run config records `llm.max_model_len=131072`.	OK
Root seed dispatch	Run state has `next_child_index=5`. Child directories for `child_000`, `child_001`, and `child_002` are active; `child_003`/`child_004` are expected to queue until one of the three executor GPUs frees.	OK
Different seeds	Overrides show `seed=500000`, `seed=500001`, and `seed=500002` for the first three root children. This prevents identical reruns of the same stochastic path.	OK
Filter cap	Overrides for active root children include `++generation.filter.filter_samples_limit=32`. This addresses the previous 260-evaluated-candidate blind spot.	OK
Batch/throughput	Active root children include `++generation.search.max_batch_size=16`; run config uses default batch size 16 and max forward batch size 16.	OK

What To Inspect Mid-Run

Check first iterations of job 8036532: root seeds should launch five distinct strategies, queued across three executor GPUs.
Confirm no analyst HTTP 400s after increasing context length to 131072.
Confirm child overrides include generation.filter.filter_samples_limit=32 unless explicitly set to a valid value up to 64.
Run archive usage report mid-job: archive-grounded proposal rate, launchable archive-grounded rate, selected archive-grounded rate, and recombination-like rate should be nonzero once the archive has entries.
Inspect whether successful branches are mutated/recombined, not simply replayed. A useful archive should preserve strong fragments while changing one major knob or pairing complementary fragments.
Watch stale-branch behavior: if recent descendants from a high-yield branch produce zero global-new successes, the policy should shift parents or include rescue/explore lanes.

Constraint Inventory

Layer	Constraint	Purpose	Review
Objective	Strict paired success only: iPAE < 7/31, pLDDT > 0.90, binder scRMSD < 1.5 A.	Keep benchmark aligned with Proteina-Complexa paper criterion.	Right strength. Thresholds must stay immutable.
Score cap	`AR22_PRIMARY_SUCCESS_RATE_CAP=none`.	Let productive branches keep receiving credit instead of collapsing at cap=2.	Right direction, but needs stale-branch controls below.
Bad-branch control	`puct_quality_marginal` gives zero global-new duplicate credit, normalizes by subtree GPU-hours, adds recent-yield bonus, and halves Q when a subtree has at least 3 descendants with zero recent quality.	Good branches can continue; stale branches lose priority without hard-banning a whole family.	Partial fix. It is not a hard stop. If repeated zero-yield parameter-similar retries persist, add a parameter-bin stale guard.
PUCT exploration	`puct_c=2.0`, UCB denominator `1 + n_child`, diverse-family backup for the final slot.	Balance exploit with less-expanded parents and prevent all slots from one family.	Reasonable for current pilot. Tune after observing 8036532 first iterations.
Branch width	`max_agent_children=3` because 3 executor GPUs are available.	Use all executor GPUs while keeping one GPU for vLLM.	Hardware-bound. Not a scientific ideal; broader trees require more executor GPUs or async scheduling.
Root exploration	5 deterministic diverse root seeds; first 3 run immediately and remaining 2 queue.	Force broad starting coverage before LLM-controlled search.	Right strength for avoiding cold-start myopia.
Compute size	batch/default max batch 16; active path max 64; expanded product warning above 512 and hard block above 1024.	Allow stronger sampling while preventing OOM/runaway children.	Mostly right. Hard cap may restrict very deep MCTS, but appropriate for 24H validation.
Child runtime	Per-child timeout/expected maximum 180 min.	Let expensive but plausible MCTS/hallucination nodes finish.	Right for current smoke. Bad long children are handled by GPU-hour-normalized Q.
Downstream evaluation cap	Runtime default `filter_samples_limit=32`; explicit proposal max 64.	Prevent 260-candidate downstream evaluation failures from consuming hours.	Conservative. Good for avoiding waste; may miss rare-success strategies unless explicit 64 is justified.
Archive contract	When archive entries exist, require/force at least one launchable archive-grounded proposal with valid archive IDs/source IDs.	Prevent forgetting useful successful/near-miss fragments; enable recombination.	Useful but strong. Could force archive grounding from a low-quality archive; monitor archive usage and outcomes.
Verifier	Unknown config paths, target/threshold/evaluator/model-weight edits, unsafe guidance/nsteps/save-intermediate edits, missing rollback, budget overflow are rejected.	Keep LLM proposals executable and benchmark-safe.	Right strength. These are safety/validity constraints, not search heuristics.
Fatigue	V8 fatigue warning disabled.	Avoid false-negative family bans such as beam-search being suppressed before the right parameter regime is tried.	Weak against repeated bad params. Prefer future parameter-bin stale guard over family-level ban.
Calibration	Calibration context disabled.	Avoid bias/anchor feedback from previously miscalibrated predictions.	Right for this comparison.
LLM context	`max_model_len=131072`, but no robust evidence pruning yet.	Avoid immediate HTTP 400 from long prompts.	Too weak long-term. Need prompt pruning before longer runs.

V4 Phase: Real Complexa Adapter And Runaway Guard

Component	Implemented behavior	Validation
Complexa adapter	`ComplexaChildExecutor` renders a compiled V4 proposal into an official Proteina-Complexa command, writes `complexa_run_spec.json`, runs inside the official repo environment, and summarizes combined CSV artifacts back into `ChildEvidenceSummary`.	Added in `autoresearch_v4/complexa_adapter.py`.
Hard runaway guard	Every child is rejected before launch if `filter_samples_limit > 32`, `max_batch_size > 16`, `nsamples > 32`, `expanded_work > 1024`, or requested timeout exceeds 180 minutes.	Tested: explicit `filter_samples_limit=260` and oversized beam products are blocked before execution.
260-candidate failure prevention	V4 now treats large downstream evaluation panels as an execution-policy violation, not only a bad outcome observed after GPU time is spent. Default child panels stay at 32 evaluated candidates unless a future policy deliberately changes the cap.	Aligned with the user objective: avoid spending hours on one zero-success setup.
Official repo execution	The run spec records the raw `complexa design ...` command. The executor launches through a shell wrapper that sources `.venv/bin/activate` and `env.sh` inside the official Proteina-Complexa repo.	Mirrors the v2.2 local executor environment pattern.
GPU pool executor	`ComplexaGpuPoolExecutor` assigns each concurrent child to one worker GPU through a queue, so four in-flight CD45 children use GPUs 0,1,2,3 instead of colliding on one device.	Tested: dry-run summary records the assigned GPU id.
Global strict registry	The pool executor keeps a thread-safe set of strict success sequences from completed children. Later children are summarized against that set so `n_global_new_strict_success` is tree-global, not child-local.	Tested: a known strict sequence is counted as strict success but not global-new.
8H CD45 smoke wrapper	`slurm/v4_cd45_real_4gpu_8h.slurm` launches a scheduler-only V4 real campaign on official Proteina-Complexa with 4 H100 workers. The wrapper now accepts env overrides for short pre-maintenance smoke runs.	Submitted: `8117623` exposed missing diverse-beam `n_branch`/`beam_width`; `8118120` exposed wrong MCTS key (`n_sim` instead of official `n_simulations`). Both were canceled after diagnosis. Corrected 8H job `8118446` was canceled because maintenance makes >6H scheduling uninformative; 2H pre-maintenance smoke `8118485` was submitted with `MAX_CHILDREN=4`, `TIMEOUT_MINUTES=60`, then canceled after the run-directory collision fix; clean post-fix smoke `8119328` was submitted with the same short limits. As of its first minutes, child specs and official generate logs confirm diverse-beam uses `beam_width/n_branch/diversity_weight` and MCTS uses official `n_simulations=32`; all four bootstrap children have `filter_samples_limit=32`, `max_batch_size=16`, and unique seeds. After about 6 minutes, official generate logs showed no traceback/config error; best-of-N generation finished, beam/diverse-beam were scoring samples, and MCTS had reached simulation 28/32.
Template registry audit	The V4 template registry was checked against the official Proteina-Complexa config/search source. Supported active families are now `beam_search`, `best_of_n`, `diverse_beam`, `mcts`, `mcts_then_hallucination`, and `fk_steering`.	Fixed: unsupported `stochastic-beam-search` was removed from templates and async backup scheduling. MCTS knobs now map to official `n_simulations`/`exploration_constant`.
Run-directory collision fix	V4 official `run_name` now includes a sanitized run slug from the output directory, so repeated CD45 smoke jobs with the same `child_id`/`template_id` do not reuse the same Proteina-Complexa `evaluation_results` path.	Fixed: added regression coverage that two output roots produce distinct run names and evaluation directories. The canceled `8118485` should be treated as a launch/config smoke, not a clean result-count smoke, because it started before this fix; clean post-fix 2H smoke `8119328` was submitted; its child specs now include job-scoped names such as `v4_05_CD45_8119328_child_000_beam_search`.
V4 explicit TemplateRegistry	The template map is now wrapped by `TemplateRegistry` and exposed as `DEFAULT_REGISTRY`. The compiler, verifier, and LLM proposer all route template lookup through the same registry instead of separate raw-dict assumptions.	Implemented: keeps family names and official override paths centralized.
V4 archive/verifier closure	Implemented first-class `StrategyArchive`, `ProductArchive`, and `FailureArchive`. The campaign runner now updates all three after each child completion and records archive counts in the report validation block. Added deterministic `verify_compiled_proposal` before executor submission: it checks template-owned paths, duplicate patch paths, target/evaluator/threshold invariants, and the execution-policy budget guard.	Validated: V4 tests now cover archive splitting, stale-bin reset after success, verifier rejection of raw invariant paths, and real-campaign dry smoke with archive counts.
V4 optional LLM flag	`real_campaign.py` now exposes `--use-llm-proposer`, `--llm-model`, and `--llm-max-tokens`. The LLM proposer remains constrained to typed ActionCards; backend config paths still belong to `StrategyCompiler`.	Implemented: default remains deterministic scheduler-only, so current smoke jobs are not affected unless the flag is explicitly set.
V4 evidence-tool closure	Added live `TargetSpec` routing so executor summaries can pass target chain, binder chain, and known hotspot residues into evidence tools. Added sequence diversity summaries: exact cluster id, low-complexity fraction, hydrophobic/charged fractions, and cysteine count. `ChildEvidenceSummary` now includes `contact_mode_diversity` plus sanity counts for severe clashes and missing chains.	Validated: tests cover hotspot target spec propagation, sequence cluster/composition evidence, contact-mode diversity, and missing-chain sanity signals.
V4 archive-to-LLM/tool context	The optional ActionCard proposer now receives compact `ProductArchive`, `FailureArchive`, stale-bin, scheduler-hint, and optional paper-snippet fields. The deterministic scheduler can also consume FailureArchive/ProductArchive directly instead of only storing them.	Validated: tests cover prompt product/failure snippets and scheduler decisions from FailureArchive.
V4 contract audit	Added `autoresearch_v4.contract_audit`. Real campaign reports now include `contract_audit.ok/errors/warnings`. The audit checks bootstrap order, ActionCard/proposal consistency, verifier launchability, unique patch paths and seeds, strict/global count invariants, and completed evidence-tool fields including target spec, sequence cluster ids, duplicate status, and strict-success sequences.	Validated: dry real-campaign report at `research/autoresearch_v4_smokes/contract_audit_dry/v4_real_campaign_report.json` has `contract_audit.ok=true` with no errors/warnings.
V4 name/path audit	Found and fixed three mismatch-class bugs: removed unsupported `diverse-beam-search` from official V4 templates; changed compiled seed override from ineffective `generation.seed` to root-level `seed`; added explicit hard-target `TargetSpec` entries for official hotspots/chains so non-A targets do not silently fall back to chain A/no hotspot.	Validated: V4 tests now pass `51/51`. Dry real-campaign reports for CD45, BetV1, and SC2RBD have `contract_audit.ok=true`; emitted specs contain `++seed=...`, explicit `nsamples`, and official algorithms only: beam-search, best-of-N, MCTS, MCTS+hallucination, FK-steering.
V4 real smoke submissions	Submitted two maintenance-limited 3.5H real GPU smokes after the patch: `8121300` CD45 and `8121299` BetV1. Both use official Proteina-Complexa, 4 H100, `MAX_CHILDREN=8`, `MAX_IN_FLIGHT=4`, `TIMEOUT_MINUTES=60`.	Pending/running check needed: first post-start check should inspect child specs and `v4_real_campaign_report.json` for `contract_audit.ok=true`.
Regression tests	V4 evidence, compiler, scheduler, replay, LLM proposer, async campaign, parameter-bin stale guard, and Complexa adapter tests were run together.	51/51 passed on 2026-05-11.

Known Limitations

Prompt length is patched by increasing context length, not fundamentally solved. Evidence pruning is still needed.
Archive weighting is hand-designed; there is not yet a learned prior over archive entries.
The archive contract validates citation/grounding, not causal usefulness.
Target-specific generalization is still unresolved; current v3 validation is CD45-first.
Public results.html intentionally hides internal ranking mechanics, so this file is the process trace for manuscript analysis.

Private HTML Canonical Source

Path	Role	Status on 2026-05-11
`github_repos/website/minkyujeon.github.io/projects/autoresearch-v2/private/internal-process-log.html`	Canonical private process log. This file is tracked in the website repository and should be edited first.	Canonical. Last git-tracked private-log update before this edit was commit `854aea7`.
`/scratch/gpfs/ZHONGE/mj7341/research/autoresearch_private_html/autoresearch_internal_process_log.html`	Scratch mirror/export copy for local private viewing and backup.	Mirror. `diff -q` showed both files were identical before the V4 update; this mirror should be refreshed from the canonical file after every private-log edit.

V4 Design Choice

V4 should be archive-guided adaptive test-time optimization for de novo binder discovery campaigns. The design target is not "find the single best child". The design target is to maximize cumulative strict global-new successful binder sequences over a fixed Proteina-Complexa compute budget, with binder quality as a secondary ranking signal.

Question	V4 answer	Reason
What is a node?	A node is a strategy episode: one compiled Proteina-Complexa run producing a panel of candidates, strict successes, failures, and quality/diversity measurements.	De novo discovery returns a candidate set, not a single scalar. The tree should value every globally new strict success produced anywhere in the campaign.
What does the LLM produce?	A compact typed `ActionCard`, not raw config patches. Example fields: `lane`, `template_id`, `archive_source_ids`, `mutation_intent`, `bottleneck_hypothesis`, and `falsification_rule`.	v2/v3 showed that raw JSON config patches are brittle: long prompts, clipped completions, path hallucination, and budget mismatch. Compact cards keep the LLM focused on scientific hypotheses.
Who creates executable configs?	A deterministic `StrategyCompiler` maps ActionCards to manifest-valid config patches from a versioned template registry.	This preserves flexibility while removing most LLM-induced launch failures. The compiler owns exact config paths, seed assignment, batch/filter limits, budget product checks, and target invariants.
What is stored in the archive?	`StrategyArchive` for template/bin performance, `ProductArchive` for sequences/structures/quality, and `FailureArchive` for zero-yield bins, OOM/timeout bins, near-misses, and stale branches.	The archive is a residual memory of useful stepping stones. It prevents forgetting good fragments while still allowing recombination and exploration.
How does V4 avoid only exploiting one good branch?	The scheduler uses lanes: `exploit`, `local_mutate`, `recombine`, `explore`, `rescue`, and `diagnostic_self_vs_mpnn`.	A good branch should be mutated and recombined, not blindly replayed. Bad exact parameter bins should be retired without banning the whole family.
Are the old 5 root seeds still valid?	Yes, but they should become bootstrap ActionCards generated from the same registry. Suggested roots: official beam search, best-of-N, diverse beam, MCTS plus sequence hallucination, and fk-steering. Stochastic beam can be an optional sixth seed.	The idea of broad initial coverage is still correct. V4 should express it through the same typed machinery used after iteration 0.

V4 vs Bayesian Optimization

Aspect	Bayesian optimization	V4 campaign controller
Search object	A mostly static hyperparameter vector `x`.	A strategy episode with lineage, archive sources, candidate panel, strict successes, failure modes, and target-specific diagnostics.
Objective	Usually scalar `f(x)`, often assumed smooth enough for a surrogate.	State-dependent campaign utility: global-new strict successes per backend GPU-hour plus secondary quality/diversity. A repeated sequence has lower or zero marginal value even if its local metrics are good.
Trial output	Often one scalar, optionally constraints.	A rich panel: per-candidate sequence, iPAE, pLDDT, binder scRMSD, ipTM, duplicates, generated/evaluated counts, timeout/OOM, and near-miss type.
Action type	Select another point in parameter space.	Choose a lane and action: mutate a successful strategy, recombine archive fragments, rescue a near-miss, retire a stale parameter bin, or run a diagnostic.
Role of priors	Kernel/acquisition assumptions over numeric/categorical variables.	Mechanistic and empirical priors from Proteina-Complexa, target geometry, sequence/structure bottlenecks, and archive history.
Where BO may still fit	As the entire optimizer.	As a subroutine inside one template family, for example tuning MCTS or beam parameters after the scheduler chooses that lane.

V4 vs SCORE

Aspect	SCORE-style tree search	Autoresearch V4
Artifact being improved	Executable code or method implementation for a benchmarked empirical software problem.	Strategy episodes for de novo binder discovery using Proteina-Complexa as the backend evaluator/generator.
LLM mutation	The LLM semantically rewrites code or method descriptions, which can create abrupt score jumps.	The LLM proposes compact scientific ActionCards. A deterministic compiler translates them into valid config variants. Optional later phase can add product-level sequence/structure mutation.
Objective	Find a high-scoring program or method for the task.	Maximize a campaign's cumulative global-new strict binder discoveries over fixed compute. The best single node is not enough.
Why the difference is expected	Empirical software tasks reward one final solution artifact.	De novo discovery rewards a useful panel of unique candidate proteins. A long tail of small productive nodes can matter more than the best child.
What V4 borrows	Tree search over semantically meaningful ideas, problem-specific advice, and empirical validation.	Those ideas are retained, but the mutable object, objective, and evaluator are changed to match protein design campaigns.

V4 Concrete Implementation Plan

Scaffold. Create autoresearch_v4 from the stable v3 codebase, but replace the agent loop rather than layering more prompt fixes onto v3.
Schemas. Add typed models: TargetSpec, StrategyGenome, ParameterBin, ActionCard, CompiledProposal, CampaignState, StrategyArchiveEntry, ProductArchiveEntry, and FailureArchiveEntry.
Template registry. Implement versioned templates for best_of_n, beam_search, stochastic_beam_search, diverse_beam_search, fk_steering, mcts, beam_then_hallucination, mcts_then_hallucination, and best_of_n_then_hallucination. Add fk-steering knobs to the manifest because fk-steering is an official Proteina-Complexa family.
StrategyCompiler. Convert ActionCards to deterministic config patches. Enforce immutable target/evaluator/threshold rules, seed uniqueness, filter/sample/batch bounds, expanded-work cap, and child timeout.
LLM-free scheduler first. Implement a deterministic archive scheduler that can run without LLMs. This is the ablation that answers whether the LLM is necessary.
LLM ActionCard proposer second. Add a compact LLM proposer that chooses lanes and hypotheses using the archive and tool outputs. It must not write raw config paths.
Async executor pool. Move from synchronous iteration barriers toward "GPU frees, schedule next ActionCard". This better matches discovery campaigns and avoids waiting for one slow child.
Auditor/reporting. Keep strict paired success, global sequence dedup, quality-weighted marginal utility, duplicate panels, stale-bin detection, and self-vs-MPNN bottleneck labels.
Experiments. Run CD45 12H smoke, then CD45 24H against v3 and official seeded baseline. Then add BetV1 and one hard target such as CbAgo or HER2_AAV.

Tool Calling For Better Hypotheses

Tool class	Recommended use	Why it helps	Priority
Paper retrieval and local paper binder	Retrieve target-specific literature, Proteina-Complexa method details, binder-design heuristics, and prior failed/successful strategies. Use PDF preprocessing and cite snippets into ActionCards.	Gives the LLM problem-specific advice without letting it change benchmark rules. Useful for target biology, hotspot interpretation, and method priors.	High
Bio.PDB / Biopython	Parse PDB/mmCIF, chains, residues, atoms, hotspot neighborhoods, contact distances, and residue-level composition.	Turns "structure bottleneck" into measurable evidence: hotspot exposure, binder-target contacts, residue interface maps, clashes, and chain parsing.	High
Biotite	Fast NumPy-style sequence/structure analysis, sequence identity, clustering, feature extraction, and database/file utilities.	Useful for candidate-panel deduplication, diversity metrics, and lightweight structural summaries that fit into compact prompts.	High
MDAnalysis	Distance/contact calculations and residue/atom selections, especially if intermediate structures or trajectories are available.	Useful for interface contact diagnostics and native-contact style summaries. More important if we later store trajectories or multiple conformations.	Medium
Self-vs-MPNN diagnostic runner	Run self and MPNN-style scoring on selected near-miss candidates to label bottlenecks.	If MPNN improves while self is weak, sequence is likely bottleneck and hallucination/sequence refinement is appropriate. If self is good but iPAE remains high, search/refine the backbone/interface instead.	High
Candidate clustering/diversity tool	Cluster candidates by sequence identity, strict success status, interface-contact fingerprint, and quality bins.	Prevents the scheduler from counting duplicate rediscoveries as progress and helps choose archive entries for recombination.	High
ColabDesign / ProteinMPNN-style refinement lane	Optional later product-level refinement of near-miss candidates, not part of the first V4 core.	Promising for sequence bottlenecks, but it changes the mutable object from strategy to sequence/product. Add after the strategy-level V4 ablation is clean.	Later

V4 Tool References Checked

Biopython Bio.PDB documentation: PDB/mmCIF parsing and Structure object access for atomic data.
MDAnalysis documentation: atom selections, distance arrays, contact matrices, and native contact analysis.
Biotite papers/documentation: sequence and structural bioinformatics library with NumPy-based data handling.
ColabDesign repository: optional sequence/design refinement ecosystem for later product-level mutation lanes.
Darwin Godel Machine and SCORE: archive/open-ended search and semantic mutation are useful analogies, but V4's mutable object and objective must be de novo campaign-specific.

V4 Evidence Tool Implementation Update - 2026-05-11

Implementation started in github_repos/Autoresearch_Denovo/autoresearch_v4. The first shipped layer is not a full controller. It is the candidate-level evidence tooling needed before ActionCards, StrategyCompiler, and ArchiveScheduler are wired together.

Component	Status	Purpose	Smoke expectation
`complexa_csv.read_candidate_metrics`	Implemented	Reads Proteina-Complexa combined CSV and keeps paired iPAE/pLDDT/binder-scRMSD rows instead of mixing axis-wise extrema.	Strict success labels use iPAE <= 7/31, pLDDT >= 0.90, binder scRMSD < 1.5 A.
`InterfaceBottleneckAnalyzer`	Implemented	Summarizes target-binder contact count, hotspot contacts, interface residues, contact fingerprint, and severe clash count. Bio.PDB is used when available; a fallback PDB parser keeps the smoke portable.	No-interface iPAE failure should map to `ipae_no_interface`; plausible hotspot-contact near miss should map to `ipae_plausible_interface`.
`SequenceDiversityAnalyzer`	Implemented	Marks global duplicates and within-child duplicates so campaign reward can focus on global-new strict successes.	A strict-success sequence already in the known set should map to `strict_success_duplicate` and contribute zero global-new strict successes.
`BottleneckClassifier`	Implemented	Rule-based first-pass labels: strict success, duplicate success, no-interface iPAE failure, wrong-hotspot iPAE failure, plausible-interface iPAE near miss, pLDDT scaffold failure, and binder-RMSD failure.	Produces compact recommended lanes such as `local_mutate`, `recombine`, `rescue`, `explore`, `mcts`, and `fk_steering`.
`ChildEvidenceSummary`	Implemented	Compacts candidate-level facts into prompt/archive-ready JSON. This is the intended future input to ArchiveScheduler and optional LLM ActionCard proposer.	LLM should see bottleneck summaries, not raw PDBs or large CSV tables.

V4 Evidence Tool Smoke Results - 2026-05-11

Smoke	Input pattern	Expected label	Observed	Status
`no_interface`	Good pLDDT/RMSD but high iPAE and target-binder chains far apart.	`ipae_no_interface`	`{"ipae_no_interface": 1}`; lanes `explore`, `mcts`, `fk_steering`.	Passed local smoke
`plausible_interface`	Good pLDDT/RMSD, hotspot contacts present, iPAE just above strict cutoff.	`ipae_plausible_interface`	`{"ipae_plausible_interface": 1}`; lanes `local_mutate`, `recombine`, `rescue`.	Passed local smoke
`duplicate_success`	Strict-success metrics but sequence already exists in global known set.	`strict_success_duplicate`, zero global-new strict successes.	`{"strict_success_duplicate": 1}`; `n_global_new_strict_success=0`.	Passed local smoke

Local smoke output: /scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/evidence_tools/local_20260511/smoke_report.json. Unit tests: tests/test_v4_evidence_tools.py, 2/2 passed.

Cluster smoke completed as Slurm job 8098962 with exit code 0:0 using slurm/v4_evidence_smoke_cpu.slurm. Output: /scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/evidence_tools/8098962/smoke_report.json.

V4 Next Implementation Steps

Add ActionCard and StrategyCompiler schemas. The LLM should emit compact card fields, not config patches.
Add template registry entries for official families: beam, diverse-beam, stochastic-beam, best-of-N, MCTS, fk-steering, and hallucination-composed variants.
Wire ChildEvidenceSummary into StrategyArchive/ProductArchive/FailureArchive.
Build an LLM-free ArchiveScheduler ablation first. Then add optional LLM ActionCard proposer behind a flag.
Run a real completed-child replay smoke: feed evidence tools a completed v2/v3 child directory and verify bottleneck labels match known run behavior.

V4 Clean-Code Refactor And Compiler Update - 2026-05-11

Change	Reason	Verification
Refactored bottleneck diagnosis.	Moved lane mapping and ranking into small top-level tables. `diagnose_candidate` now reads as metrics/interface -> bottleneck -> recommended lanes.	Unit tests cover iPAE no-interface, wrong-hotspot, plausible-interface, pLDDT failure, RMSD failure, strict success, and duplicate success.
Refactored smoke runner.	Replaced if/elif scenario generation with a compact `SmokeCase` table. Adding a new smoke is now one row, not a new branch.	Expanded local smoke covers 8 scenarios. All scenarios report `passed=true`.
Added minimal `ActionCard` and `StrategyCompiler`.	Next V4 step: LLM emits compact semantic intent; compiler owns backend paths. This prevents raw config-path generation by the LLM.	Compiler tests verify MCTS+hallucination compilation, default batch/filter knobs, seed insertion, unknown template rejection, and unsupported knob rejection.
Fixed template-specific knob paths.	`beam_width` is not one universal backend path. fk-steering and stochastic-beam need template-specific paths.	Test verifies `fk_steering` uses `generation.search.fk_steering.beam_width`, not the beam-search path.

Verification after refactor/compiler update: tests/test_v4_evidence_tools.py tests/test_v4_action_compiler.py -> 5/5 passed; python -m compileall -q autoresearch_v4 passed; git diff --check passed.

Expanded local smoke output: /scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/evidence_tools/refactor_compiler_20260511/smoke_report.json. Expanded Slurm CPU smoke: job 8099672, state COMPLETED, exit code 0:0.

V4 ArchiveScheduler MVP - 2026-05-11

Piece	Status	Design reason	Verification
`StrategyArchive`	Implemented	Run-local memory over child summaries and source template IDs. This is the minimal archive needed before learned/LLM scheduling.	Scheduler tests build archives from synthetic `ChildEvidenceSummary` objects.
`ArchiveScheduler`	Implemented	LLM-free ablation policy. It first emits the five broad bootstrap families, then routes archive evidence to rescue/diversify/exploit cards.	Tests verify bootstrap order and archive-conditioned choices.
Compiler dedupe fix	Implemented	Template defaults and semantic overrides can touch the same backend path. Compiler now emits unique paths with override values winning.	Regression test verifies `diverse_beam` emits one `diversity_weight` patch with value `0.7`.
Scheduler smoke	Implemented	Verifies end-to-end archive -> ActionCard -> compiled config behavior without LLM or GPU.	Local smoke output: `/scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/scheduler/dedupe_20260511/scheduler_smoke_report.json`. Slurm job `8102340` completed with exit code `0:0`.

Current V4 flow is now: evidence tools -> ChildEvidenceSummary -> StrategyArchive -> LLM-free ArchiveScheduler -> compact ActionCard -> deterministic StrategyCompiler. This follows the private V4 plan and still keeps the LLM out of raw config generation.

Verification after scheduler phase: tests/test_v4_evidence_tools.py tests/test_v4_action_compiler.py tests/test_v4_scheduler.py -> 10/10 passed; python -m compileall -q autoresearch_v4 passed; git diff --check passed.

V4 Real-Artifact Replay And LLM ActionCard Phase - 2026-05-11

Check	Result	Why it matters
Real completed-child replay	Passed local replay: `/scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/real_replay/cd45_7694929_final_20260511/real_replay_smoke_report.json`	V4 evidence tools can read actual v2.2 Complexa child artifacts, not only synthetic fixtures. The replay scanned CD45 job `7694929`, loaded real combined CSVs, and produced candidate bottleneck labels plus compiled next ActionCards.
Strict success 기준 재검증	Replay uses paired candidate metrics with `iPAE <= 7/31`, `pLDDT >= 0.90`, and `binder scRMSD < 1.5 A`.	This recovered real successes such as CD45 `child_003` where old dossier text still mentioned the obsolete 0.18 iPAE threshold. V4 now uses the paper criterion in the replay path.
AF2/refolded PDB path bug	Fixed: `complexa_csv.read_candidate_metrics` now selects the `self_complex_pdb_path_all` entry matching the chosen paired redesign index.	Before this fix, interface diagnostics could combine AF2/self metrics with the generated PDB path. That could mislabel interface bottlenecks. Regression test covers this exact failure mode.
Long replay finding	40-child CD45 replay: all cases had candidate evidence, no duplicate compiled patch paths, and the scheduler repeatedly selected the missing bootstrap lane `mcts_then_hallucination`.	This is expected because the archived v2.2 CD45 run did not try that V4 bootstrap family. The result validates the scheduler's forced broad-family initialization behavior on real run history.
Slurm replay smoke	Completed as job `8105290` with state `COMPLETED`, exit code `0:0`. Output: `/scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/real_replay/8105290/real_replay_smoke_report.json`.	Confirms the real-artifact replay path works under cluster execution, not only in an interactive shell.

V4 LLM ActionCard Proposer - 2026-05-11

Component	Status	Contract	Verification
`LLMActionCardProposer`	Implemented	Calls an LLM client for one compact `ActionCard`. The LLM proposes semantic lane/template/knobs/hypothesis only.	Fake-client tests verify prompt construction, max-token forwarding, and returned card compilation.
Raw config path guard	Implemented	Rejects `config_patch`, `json_patch`, backend overrides, and any output containing `generation.` or `/generation/`.	Regression tests confirm raw backend paths are rejected before compilation.
Compiler validation	Implemented	Every LLM ActionCard is passed through `compile_action_card` during validation, so unsupported semantic knobs fail deterministically.	Regression test rejects an unsupported knob on `beam_search`.
Prompt payload	Implemented	Prompt includes target ID, strict success rules, allowed templates and knobs, archive tail, scheduler hint, and hard rules forbidding backend config edits.	Test confirms archive evidence and template knobs are present in the prompt.

Current V4 phase status: evidence tools -> archive -> deterministic scheduler -> optional LLM ActionCard proposer -> deterministic compiler. The LLM is now a hypothesis/mutator layer, not a raw config generator. Verification: current V4 unit suite tests/test_v4_evidence_tools.py tests/test_v4_action_compiler.py tests/test_v4_scheduler.py tests/test_v4_replay_smoke.py tests/test_v4_llm_proposer.py -> 17/17 passed; python -m compileall -q autoresearch_v4 passed; git diff --check passed.

V4 Async Campaign Runner Phase - 2026-05-11

Component	Status	Behavior	Verification
`CampaignRunner`	Implemented skeleton	Backend-light async runner. It schedules compiled ActionCards into up to `max_in_flight` child slots and updates the archive whenever a child returns a `ChildEvidenceSummary`.	Unit tests cover bootstrap reservation, executor failure conversion, optional LLM proposer use after bootstrap, pending duplicate guard, and `max_in_flight=0` clamping.
`max_in_flight`	Defined	Number of child proposals that can be running at once. With a future 3-GPU Complexa adapter, `max_in_flight=3` means one child per executor GPU.	Campaign smoke uses `--max-in-flight 3` and produces 8 submissions / 8 completions.
Distinct broad bootstrap	Implemented	Pending templates are reserved, so simultaneous initial scheduling does not repeatedly choose the same first bootstrap family.	First five submissions are `beam_search`, `best_of_n`, `diverse_beam`, `mcts_then_hallucination`, `fk_steering`.
Pending exact-ActionCard duplicate guard	Implemented	If the same lane/template/knobs/archive-source ActionCard is already pending, the runner uses an async backup exploration card instead. Same template with different source or parameter bin remains allowed.	Regression test calls the guard directly. Local smoke shows a backup `stochastic_beam` when a repeated exact bin would otherwise be pending.
Plateau handling boundary	Partially handled	Other workers can keep exploring while one child runs. Once a plateau/failure child finishes, its evidence can route the next card away from that bin. Mid-run early-stop/cancel is not implemented yet.	Next required phase: real Complexa `ChildExecutor` adapter with runtime monitor and optional early-stop hook.
Slurm campaign smoke	Passed	`slurm/v4_campaign_smoke_cpu.slurm` runs the async campaign smoke without GPU.	Job `8106146`: `COMPLETED`, exit code `0:0`. Output: `/scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/campaign/8106146/campaign_smoke_report.json`.

Verification after async runner phase: V4 unit suite -> 22/22 passed; python -m compileall -q autoresearch_v4 passed; git diff --check passed; bash -n slurm/v4_campaign_smoke_cpu.slurm passed.

V4 ParameterBin / Stale-Bin Phase - 2026-05-11

Question	Answer	Implementation	Verification
Are family and hyperparameters separated?	Yes. Family is `template_id`; hyperparameters live in `parameter_overrides`.	`ParameterBin` maps each ActionCard to `(template_id, coarse knob bins)`. This prevents family-level false bans such as banning all beam-search after one low-noise failure.	Tests verify low-noise beam bins match each other, high-noise beam differs, and the same knob region under MCTS differs because the family differs.
How is stale handled?	Stale is exact-bin scoped, not family scoped.	`stale_parameter_bins` marks a bin stale after repeated zero-yield outcomes since the last success. A later global-new strict success resets the counter.	Regression test covers two failures -> stale, success -> reset, one later failure -> not stale.
Does the runner avoid stale bins?	Yes. `CampaignRunner._next_card` checks both pending exact ActionCard signatures and stale parameter bins.	If a card is pending or stale, the runner picks an async backup card from stochastic-beam, MCTS, or diverse-beam.	Campaign test verifies stale best-of-N bin is skipped. Slurm campaign smoke `8107577` completed with exit `0:0`.
Does the LLM see this state?	Yes. The LLM prompt remains structured and compact.	`build_action_card_messages` includes `allowed_templates`, `archive_tail`, `scheduler_hint`, and `stale_parameter_bins`. The LLM still cannot emit raw backend config paths.	Prompt test verifies stale bins are serialized as `{template_id, knobs}`.

Verification after ParameterBin phase: V4 unit suite -> 26/26 passed; python -m compileall -q autoresearch_v4 passed; git diff --check passed. Latest campaign smoke: /scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/campaign/8107577/campaign_smoke_report.json.