Autoresearch Internal Process Log
Private local record. Generated/updated 2026-05-11 America/New_York. Do not link from the public website while the manuscript is in preparation.
Current Run State
Adaptive v3 job
job 8036532 CD45 24H 4 H100 RUNNING
- GPU 0 runs Qwen/vLLM; GPUs 1-3 run Proteina-Complexa children.
- Parent policy:
puct_quality_marginal,puct_c=2.0. - Proposal lanes:
three_lane; proposal selection:supervisor_selected. - Calibration context: off. V8 fatigue warning: disabled. Diverse root seeds: on.
- Score cap: off via
AR22_PRIMARY_SUCCESS_RATE_CAP=none.
Official baselines kept running
These are clean official Proteina-Complexa baseline jobs, one GPU per target, seed-sweep style.
8018247 SC2RBD MCTS+halluc baseline 8018999 CD45 beam add-on 8019000 BetV1 beam add-on 8019001 HER2_AAV beam add-on 8020453 CbAgo beam
Issue And Fix Log
| Date | Problem observed | Diagnosis | Update / fix | Residual risk |
|---|---|---|---|---|
| 2026-05-10 | v3 analyst calls failed with HTTP 400 in the archive smoke/36H setup. | vLLM rejected prompts: input around 63,489 tokens plus requested 2,048 output tokens exceeded max_model_len=65,536. Refiner/Supervisor prompts were smaller, so the run continued without full analyst signal. |
Fixed: v3 configs now use llm.max_model_len=131072; 24H/36H wrappers read that value into vLLM. |
Longer runs can still grow prompts. Need prompt pruning/candidate-panel caps before large multi-day runs. |
| 2026-05-10 | CD45 adaptive job plateaued after a strong early MCTS+sequence-hallucination success. | Root MCTS+SH child produced many strict successes quickly, but later MCTS follow-ups spent about 2.5 GPU-h each with zero strict successes. The system over-exploited a branch after it became stale. | Partially fixed: current v3 uses subtree global-marginal quality credit, recent-yield bailout, supervisor-selected variable branch factor, and archive contract to preserve useful fragments without blindly replaying them. | Need mid-run check that archive recombination creates new variants rather than repeatedly retrying one stale template. |
| 2026-05-10 | Some failed children evaluated about 260 downstream candidates despite no strict successes. | Important distinction: generated_pdbs can be large, but filter_samples_limit controls how many generated candidates enter downstream evaluation. Successful root had 258 generated PDBs but only n_evaluated=32. Failed follow-ups lacked the filter cap and evaluated 260 candidates. |
Fixed: v3 runtime adds ++generation.filter.filter_samples_limit=32 when a proposal omits it. Verifier allows explicit requests only up to max_filter_samples_limit=64. |
Cap 32 may miss rare-success slow strategies. Explicit 64 remains available for evidence-grounded branches. |
| 2026-05-10 | Earlier score cap hid differences among productive children. | Cap made high-yield nodes collapse to the same primary score and moved selection pressure into soft iPAE/pLDDT bonuses. | Fixed: current wrappers export AR22_PRIMARY_SUCCESS_RATE_CAP=none. Tree policy uses global-marginal unique/quality deltas for branch Q. |
Uncapped rates can overvalue tiny-denominator wins. Current PUCT-quality-marginal adds subtree GPU-hour normalization and recent-yield checks. |
| 2026-05-10 | Archive initially risked being only a prompt hint. | A pure archive prompt cannot guarantee the Refiner actually cites or recombines prior successful/near-miss fragments. | Strengthened: proposal schema includes archive_operation, archive_evidence_ids, and archive_source_program_ids. Controller annotates verifier results with archive contract status, forces repair if no launchable archive-grounded proposal exists when archive entries are available, and force-includes one archive-grounded launchable proposal. |
The contract enforces grounding, not scientific usefulness. Use archive_usage_report.json and child outcomes to audit whether recombination is productive. |
| 2026-05-10 | Unproductive or stale jobs were consuming queue/GPU resources. | Some v2_2 jobs had plateaued or had zero strict successes; one v3 run had incomplete analyst signal. | Cleaned: canceled 8013657, 8011230, 8011231, 8011232, and old pending 8034380. Submitted fresh v3 8036532 with corrected context length and filter cap. | Continue monitoring first 2-3 iterations of 8036532 before trusting the setup for broader targets. |
Current v3 Setup Checklist
| Area | Current setting | Review |
|---|---|---|
| Objective | Tree-global strict unique discovery per backend GPU-hour; quality is secondary. | Aligned. Policy code deduplicates success sequences globally before branch credit; quality bonus is bounded below one full extra success. |
| Strict criterion | iPAE < 7/31, pLDDT > 0.90, binder scRMSD < 1.5 A, paired by redesign index. | Aligned. Metrics parser uses paired list columns when available and records success-quality rows. |
| PUCT | puct_quality_marginal: subtree quality-weighted global-marginal successes / subtree GPU-hours + recent-yield term + UCB exploration. | Reasonable for the current objective. It no longer optimizes only best child. |
| Exploration width | 5 deterministic diverse root seeds, then up to 3 executor lanes per iteration. | Watch. Max children 3 matches available executor GPUs. Supervisor-selected mode avoids forcing weak proposals, but can reduce width if the LLM is too conservative. |
| Compute budget | child timeout 180 min; batch/max_batch 16; expanded work default 512, hard cap 1024; filter default 32, explicit max 64. | Aligned with the latest plan and avoids the 260-eval blind spot. |
| Archive | Run-local StrategyArchive with success templates, near-misses, failure boundaries, and diversity probes. | More than prompting: fields are schema-validated and controller-enforced after verifier. |
| Analyst evidence | EvidencePack plus structure dossier, proposal history, family outcomes, metrics, failure modes, manifest bounds. | Adequate but incomplete. Bottleneck taxonomy exists through metrics/failure modes, but sequence-vs-structure diagnosis is still not a first-class explicit field. |
| Calibration | Off for this run. | Intentional. Prior calibration had limited benefit and sometimes injected optimism/anchor effects. |
| Time accounting | Child Proteina-Complexa runtime only. | Aligned. Eval reports label this explicitly. |
Live Verification Notes
| Check | Observed evidence | Status |
|---|---|---|
| 8036532 allocation | Slurm reports gres/gpu=4 on della-h20g5; wrapper prints GPU 0 for vLLM and GPUs 1,2,3 for Complexa. |
OK |
| Context length | vLLM server log reports Using max model len 131072; run config records llm.max_model_len=131072. |
OK |
| Root seed dispatch | Run state has next_child_index=5. Child directories for child_000, child_001, and child_002 are active; child_003/child_004 are expected to queue until one of the three executor GPUs frees. |
OK |
| Different seeds | Overrides show seed=500000, seed=500001, and seed=500002 for the first three root children. This prevents identical reruns of the same stochastic path. |
OK |
| Filter cap | Overrides for active root children include ++generation.filter.filter_samples_limit=32. This addresses the previous 260-evaluated-candidate blind spot. |
OK |
| Batch/throughput | Active root children include ++generation.search.max_batch_size=16; run config uses default batch size 16 and max forward batch size 16. |
OK |
What To Inspect Mid-Run
- Check first iterations of job 8036532: root seeds should launch five distinct strategies, queued across three executor GPUs.
- Confirm no analyst HTTP 400s after increasing context length to 131072.
- Confirm child overrides include
generation.filter.filter_samples_limit=32unless explicitly set to a valid value up to 64. - Run archive usage report mid-job: archive-grounded proposal rate, launchable archive-grounded rate, selected archive-grounded rate, and recombination-like rate should be nonzero once the archive has entries.
- Inspect whether successful branches are mutated/recombined, not simply replayed. A useful archive should preserve strong fragments while changing one major knob or pairing complementary fragments.
- Watch stale-branch behavior: if recent descendants from a high-yield branch produce zero global-new successes, the policy should shift parents or include rescue/explore lanes.
Constraint Inventory
| Layer | Constraint | Purpose | Review |
|---|---|---|---|
| Objective | Strict paired success only: iPAE < 7/31, pLDDT > 0.90, binder scRMSD < 1.5 A. | Keep benchmark aligned with Proteina-Complexa paper criterion. | Right strength. Thresholds must stay immutable. |
| Score cap | AR22_PRIMARY_SUCCESS_RATE_CAP=none. | Let productive branches keep receiving credit instead of collapsing at cap=2. | Right direction, but needs stale-branch controls below. |
| Bad-branch control | puct_quality_marginal gives zero global-new duplicate credit, normalizes by subtree GPU-hours, adds recent-yield bonus, and halves Q when a subtree has at least 3 descendants with zero recent quality. | Good branches can continue; stale branches lose priority without hard-banning a whole family. | Partial fix. It is not a hard stop. If repeated zero-yield parameter-similar retries persist, add a parameter-bin stale guard. |
| PUCT exploration | puct_c=2.0, UCB denominator 1 + n_child, diverse-family backup for the final slot. | Balance exploit with less-expanded parents and prevent all slots from one family. | Reasonable for current pilot. Tune after observing 8036532 first iterations. |
| Branch width | max_agent_children=3 because 3 executor GPUs are available. | Use all executor GPUs while keeping one GPU for vLLM. | Hardware-bound. Not a scientific ideal; broader trees require more executor GPUs or async scheduling. |
| Root exploration | 5 deterministic diverse root seeds; first 3 run immediately and remaining 2 queue. | Force broad starting coverage before LLM-controlled search. | Right strength for avoiding cold-start myopia. |
| Compute size | batch/default max batch 16; active path max 64; expanded product warning above 512 and hard block above 1024. | Allow stronger sampling while preventing OOM/runaway children. | Mostly right. Hard cap may restrict very deep MCTS, but appropriate for 24H validation. |
| Child runtime | Per-child timeout/expected maximum 180 min. | Let expensive but plausible MCTS/hallucination nodes finish. | Right for current smoke. Bad long children are handled by GPU-hour-normalized Q. |
| Downstream evaluation cap | Runtime default filter_samples_limit=32; explicit proposal max 64. | Prevent 260-candidate downstream evaluation failures from consuming hours. | Conservative. Good for avoiding waste; may miss rare-success strategies unless explicit 64 is justified. |
| Archive contract | When archive entries exist, require/force at least one launchable archive-grounded proposal with valid archive IDs/source IDs. | Prevent forgetting useful successful/near-miss fragments; enable recombination. | Useful but strong. Could force archive grounding from a low-quality archive; monitor archive usage and outcomes. |
| Verifier | Unknown config paths, target/threshold/evaluator/model-weight edits, unsafe guidance/nsteps/save-intermediate edits, missing rollback, budget overflow are rejected. | Keep LLM proposals executable and benchmark-safe. | Right strength. These are safety/validity constraints, not search heuristics. |
| Fatigue | V8 fatigue warning disabled. | Avoid false-negative family bans such as beam-search being suppressed before the right parameter regime is tried. | Weak against repeated bad params. Prefer future parameter-bin stale guard over family-level ban. |
| Calibration | Calibration context disabled. | Avoid bias/anchor feedback from previously miscalibrated predictions. | Right for this comparison. |
| LLM context | max_model_len=131072, but no robust evidence pruning yet. | Avoid immediate HTTP 400 from long prompts. | Too weak long-term. Need prompt pruning before longer runs. |
V4 Phase: Real Complexa Adapter And Runaway Guard
| Component | Implemented behavior | Validation |
|---|---|---|
| Complexa adapter | ComplexaChildExecutor renders a compiled V4 proposal into an official Proteina-Complexa command, writes complexa_run_spec.json, runs inside the official repo environment, and summarizes combined CSV artifacts back into ChildEvidenceSummary. | Added in autoresearch_v4/complexa_adapter.py. |
| Hard runaway guard | Every child is rejected before launch if filter_samples_limit > 32, max_batch_size > 16, nsamples > 32, expanded_work > 1024, or requested timeout exceeds 180 minutes. | Tested: explicit filter_samples_limit=260 and oversized beam products are blocked before execution. |
| 260-candidate failure prevention | V4 now treats large downstream evaluation panels as an execution-policy violation, not only a bad outcome observed after GPU time is spent. Default child panels stay at 32 evaluated candidates unless a future policy deliberately changes the cap. | Aligned with the user objective: avoid spending hours on one zero-success setup. |
| Official repo execution | The run spec records the raw complexa design ... command. The executor launches through a shell wrapper that sources .venv/bin/activate and env.sh inside the official Proteina-Complexa repo. | Mirrors the v2.2 local executor environment pattern. |
| GPU pool executor | ComplexaGpuPoolExecutor assigns each concurrent child to one worker GPU through a queue, so four in-flight CD45 children use GPUs 0,1,2,3 instead of colliding on one device. | Tested: dry-run summary records the assigned GPU id. |
| Global strict registry | The pool executor keeps a thread-safe set of strict success sequences from completed children. Later children are summarized against that set so n_global_new_strict_success is tree-global, not child-local. | Tested: a known strict sequence is counted as strict success but not global-new. |
| 8H CD45 smoke wrapper | slurm/v4_cd45_real_4gpu_8h.slurm launches a scheduler-only V4 real campaign on official Proteina-Complexa with 4 H100 workers. The wrapper now accepts env overrides for short pre-maintenance smoke runs. | Submitted: 8117623 exposed missing diverse-beam n_branch/beam_width; 8118120 exposed wrong MCTS key (n_sim instead of official n_simulations). Both were canceled after diagnosis. Corrected 8H job 8118446 was canceled because maintenance makes >6H scheduling uninformative; 2H pre-maintenance smoke 8118485 was submitted with MAX_CHILDREN=4, TIMEOUT_MINUTES=60, then canceled after the run-directory collision fix; clean post-fix smoke 8119328 was submitted with the same short limits. As of its first minutes, child specs and official generate logs confirm diverse-beam uses beam_width/n_branch/diversity_weight and MCTS uses official n_simulations=32; all four bootstrap children have filter_samples_limit=32, max_batch_size=16, and unique seeds. After about 6 minutes, official generate logs showed no traceback/config error; best-of-N generation finished, beam/diverse-beam were scoring samples, and MCTS had reached simulation 28/32. |
| Template registry audit | The V4 template registry was checked against the official Proteina-Complexa config/search source. Supported active families are now beam_search, best_of_n, diverse_beam, mcts, mcts_then_hallucination, and fk_steering. | Fixed: unsupported stochastic-beam-search was removed from templates and async backup scheduling. MCTS knobs now map to official n_simulations/exploration_constant. |
| Run-directory collision fix | V4 official run_name now includes a sanitized run slug from the output directory, so repeated CD45 smoke jobs with the same child_id/template_id do not reuse the same Proteina-Complexa evaluation_results path. | Fixed: added regression coverage that two output roots produce distinct run names and evaluation directories. The canceled 8118485 should be treated as a launch/config smoke, not a clean result-count smoke, because it started before this fix; clean post-fix 2H smoke 8119328 was submitted; its child specs now include job-scoped names such as v4_05_CD45_8119328_child_000_beam_search. |
| V4 explicit TemplateRegistry | The template map is now wrapped by TemplateRegistry and exposed as DEFAULT_REGISTRY. The compiler, verifier, and LLM proposer all route template lookup through the same registry instead of separate raw-dict assumptions. | Implemented: keeps family names and official override paths centralized. |
| V4 archive/verifier closure | Implemented first-class StrategyArchive, ProductArchive, and FailureArchive. The campaign runner now updates all three after each child completion and records archive counts in the report validation block. Added deterministic verify_compiled_proposal before executor submission: it checks template-owned paths, duplicate patch paths, target/evaluator/threshold invariants, and the execution-policy budget guard. | Validated: V4 tests now cover archive splitting, stale-bin reset after success, verifier rejection of raw invariant paths, and real-campaign dry smoke with archive counts. |
| V4 optional LLM flag | real_campaign.py now exposes --use-llm-proposer, --llm-model, and --llm-max-tokens. The LLM proposer remains constrained to typed ActionCards; backend config paths still belong to StrategyCompiler. | Implemented: default remains deterministic scheduler-only, so current smoke jobs are not affected unless the flag is explicitly set. |
| V4 evidence-tool closure | Added live TargetSpec routing so executor summaries can pass target chain, binder chain, and known hotspot residues into evidence tools. Added sequence diversity summaries: exact cluster id, low-complexity fraction, hydrophobic/charged fractions, and cysteine count. ChildEvidenceSummary now includes contact_mode_diversity plus sanity counts for severe clashes and missing chains. | Validated: tests cover hotspot target spec propagation, sequence cluster/composition evidence, contact-mode diversity, and missing-chain sanity signals. |
| V4 archive-to-LLM/tool context | The optional ActionCard proposer now receives compact ProductArchive, FailureArchive, stale-bin, scheduler-hint, and optional paper-snippet fields. The deterministic scheduler can also consume FailureArchive/ProductArchive directly instead of only storing them. | Validated: tests cover prompt product/failure snippets and scheduler decisions from FailureArchive. |
| V4 contract audit | Added autoresearch_v4.contract_audit. Real campaign reports now include contract_audit.ok/errors/warnings. The audit checks bootstrap order, ActionCard/proposal consistency, verifier launchability, unique patch paths and seeds, strict/global count invariants, and completed evidence-tool fields including target spec, sequence cluster ids, duplicate status, and strict-success sequences. | Validated: dry real-campaign report at research/autoresearch_v4_smokes/contract_audit_dry/v4_real_campaign_report.json has contract_audit.ok=true with no errors/warnings. |
| V4 name/path audit | Found and fixed three mismatch-class bugs: removed unsupported diverse-beam-search from official V4 templates; changed compiled seed override from ineffective generation.seed to root-level seed; added explicit hard-target TargetSpec entries for official hotspots/chains so non-A targets do not silently fall back to chain A/no hotspot. | Validated: V4 tests now pass 51/51. Dry real-campaign reports for CD45, BetV1, and SC2RBD have contract_audit.ok=true; emitted specs contain ++seed=..., explicit nsamples, and official algorithms only: beam-search, best-of-N, MCTS, MCTS+hallucination, FK-steering. |
| V4 real smoke submissions | Submitted two maintenance-limited 3.5H real GPU smokes after the patch: 8121300 CD45 and 8121299 BetV1. Both use official Proteina-Complexa, 4 H100, MAX_CHILDREN=8, MAX_IN_FLIGHT=4, TIMEOUT_MINUTES=60. | Pending/running check needed: first post-start check should inspect child specs and v4_real_campaign_report.json for contract_audit.ok=true. |
| Regression tests | V4 evidence, compiler, scheduler, replay, LLM proposer, async campaign, parameter-bin stale guard, and Complexa adapter tests were run together. | 51/51 passed on 2026-05-11. |
Known Limitations
- Prompt length is patched by increasing context length, not fundamentally solved. Evidence pruning is still needed.
- Archive weighting is hand-designed; there is not yet a learned prior over archive entries.
- The archive contract validates citation/grounding, not causal usefulness.
- Target-specific generalization is still unresolved; current v3 validation is CD45-first.
- Public
results.htmlintentionally hides internal ranking mechanics, so this file is the process trace for manuscript analysis.
Private HTML Canonical Source
| Path | Role | Status on 2026-05-11 |
|---|---|---|
github_repos/website/minkyujeon.github.io/projects/autoresearch-v2/private/internal-process-log.html |
Canonical private process log. This file is tracked in the website repository and should be edited first. | Canonical. Last git-tracked private-log update before this edit was commit 854aea7. |
/scratch/gpfs/ZHONGE/mj7341/research/autoresearch_private_html/autoresearch_internal_process_log.html |
Scratch mirror/export copy for local private viewing and backup. | Mirror. diff -q showed both files were identical before the V4 update; this mirror should be refreshed from the canonical file after every private-log edit. |
V4 Design Choice
| Question | V4 answer | Reason |
|---|---|---|
| What is a node? | A node is a strategy episode: one compiled Proteina-Complexa run producing a panel of candidates, strict successes, failures, and quality/diversity measurements. | De novo discovery returns a candidate set, not a single scalar. The tree should value every globally new strict success produced anywhere in the campaign. |
| What does the LLM produce? | A compact typed ActionCard, not raw config patches. Example fields: lane, template_id, archive_source_ids, mutation_intent, bottleneck_hypothesis, and falsification_rule. |
v2/v3 showed that raw JSON config patches are brittle: long prompts, clipped completions, path hallucination, and budget mismatch. Compact cards keep the LLM focused on scientific hypotheses. |
| Who creates executable configs? | A deterministic StrategyCompiler maps ActionCards to manifest-valid config patches from a versioned template registry. |
This preserves flexibility while removing most LLM-induced launch failures. The compiler owns exact config paths, seed assignment, batch/filter limits, budget product checks, and target invariants. |
| What is stored in the archive? | StrategyArchive for template/bin performance, ProductArchive for sequences/structures/quality, and FailureArchive for zero-yield bins, OOM/timeout bins, near-misses, and stale branches. |
The archive is a residual memory of useful stepping stones. It prevents forgetting good fragments while still allowing recombination and exploration. |
| How does V4 avoid only exploiting one good branch? | The scheduler uses lanes: exploit, local_mutate, recombine, explore, rescue, and diagnostic_self_vs_mpnn. |
A good branch should be mutated and recombined, not blindly replayed. Bad exact parameter bins should be retired without banning the whole family. |
| Are the old 5 root seeds still valid? | Yes, but they should become bootstrap ActionCards generated from the same registry. Suggested roots: official beam search, best-of-N, diverse beam, MCTS plus sequence hallucination, and fk-steering. Stochastic beam can be an optional sixth seed. | The idea of broad initial coverage is still correct. V4 should express it through the same typed machinery used after iteration 0. |
V4 vs Bayesian Optimization
| Aspect | Bayesian optimization | V4 campaign controller |
|---|---|---|
| Search object | A mostly static hyperparameter vector x. |
A strategy episode with lineage, archive sources, candidate panel, strict successes, failure modes, and target-specific diagnostics. |
| Objective | Usually scalar f(x), often assumed smooth enough for a surrogate. |
State-dependent campaign utility: global-new strict successes per backend GPU-hour plus secondary quality/diversity. A repeated sequence has lower or zero marginal value even if its local metrics are good. |
| Trial output | Often one scalar, optionally constraints. | A rich panel: per-candidate sequence, iPAE, pLDDT, binder scRMSD, ipTM, duplicates, generated/evaluated counts, timeout/OOM, and near-miss type. |
| Action type | Select another point in parameter space. | Choose a lane and action: mutate a successful strategy, recombine archive fragments, rescue a near-miss, retire a stale parameter bin, or run a diagnostic. |
| Role of priors | Kernel/acquisition assumptions over numeric/categorical variables. | Mechanistic and empirical priors from Proteina-Complexa, target geometry, sequence/structure bottlenecks, and archive history. |
| Where BO may still fit | As the entire optimizer. | As a subroutine inside one template family, for example tuning MCTS or beam parameters after the scheduler chooses that lane. |
V4 vs SCORE
| Aspect | SCORE-style tree search | Autoresearch V4 |
|---|---|---|
| Artifact being improved | Executable code or method implementation for a benchmarked empirical software problem. | Strategy episodes for de novo binder discovery using Proteina-Complexa as the backend evaluator/generator. |
| LLM mutation | The LLM semantically rewrites code or method descriptions, which can create abrupt score jumps. | The LLM proposes compact scientific ActionCards. A deterministic compiler translates them into valid config variants. Optional later phase can add product-level sequence/structure mutation. |
| Objective | Find a high-scoring program or method for the task. | Maximize a campaign's cumulative global-new strict binder discoveries over fixed compute. The best single node is not enough. |
| Why the difference is expected | Empirical software tasks reward one final solution artifact. | De novo discovery rewards a useful panel of unique candidate proteins. A long tail of small productive nodes can matter more than the best child. |
| What V4 borrows | Tree search over semantically meaningful ideas, problem-specific advice, and empirical validation. | Those ideas are retained, but the mutable object, objective, and evaluator are changed to match protein design campaigns. |
V4 Concrete Implementation Plan
- Scaffold. Create
autoresearch_v4from the stable v3 codebase, but replace the agent loop rather than layering more prompt fixes onto v3. - Schemas. Add typed models:
TargetSpec,StrategyGenome,ParameterBin,ActionCard,CompiledProposal,CampaignState,StrategyArchiveEntry,ProductArchiveEntry, andFailureArchiveEntry. - Template registry. Implement versioned templates for
best_of_n,beam_search,stochastic_beam_search,diverse_beam_search,fk_steering,mcts,beam_then_hallucination,mcts_then_hallucination, andbest_of_n_then_hallucination. Add fk-steering knobs to the manifest because fk-steering is an official Proteina-Complexa family. - StrategyCompiler. Convert ActionCards to deterministic config patches. Enforce immutable target/evaluator/threshold rules, seed uniqueness, filter/sample/batch bounds, expanded-work cap, and child timeout.
- LLM-free scheduler first. Implement a deterministic archive scheduler that can run without LLMs. This is the ablation that answers whether the LLM is necessary.
- LLM ActionCard proposer second. Add a compact LLM proposer that chooses lanes and hypotheses using the archive and tool outputs. It must not write raw config paths.
- Async executor pool. Move from synchronous iteration barriers toward "GPU frees, schedule next ActionCard". This better matches discovery campaigns and avoids waiting for one slow child.
- Auditor/reporting. Keep strict paired success, global sequence dedup, quality-weighted marginal utility, duplicate panels, stale-bin detection, and self-vs-MPNN bottleneck labels.
- Experiments. Run CD45 12H smoke, then CD45 24H against v3 and official seeded baseline. Then add BetV1 and one hard target such as CbAgo or HER2_AAV.
Tool Calling For Better Hypotheses
| Tool class | Recommended use | Why it helps | Priority |
|---|---|---|---|
| Paper retrieval and local paper binder | Retrieve target-specific literature, Proteina-Complexa method details, binder-design heuristics, and prior failed/successful strategies. Use PDF preprocessing and cite snippets into ActionCards. | Gives the LLM problem-specific advice without letting it change benchmark rules. Useful for target biology, hotspot interpretation, and method priors. | High |
| Bio.PDB / Biopython | Parse PDB/mmCIF, chains, residues, atoms, hotspot neighborhoods, contact distances, and residue-level composition. | Turns "structure bottleneck" into measurable evidence: hotspot exposure, binder-target contacts, residue interface maps, clashes, and chain parsing. | High |
| Biotite | Fast NumPy-style sequence/structure analysis, sequence identity, clustering, feature extraction, and database/file utilities. | Useful for candidate-panel deduplication, diversity metrics, and lightweight structural summaries that fit into compact prompts. | High |
| MDAnalysis | Distance/contact calculations and residue/atom selections, especially if intermediate structures or trajectories are available. | Useful for interface contact diagnostics and native-contact style summaries. More important if we later store trajectories or multiple conformations. | Medium |
| Self-vs-MPNN diagnostic runner | Run self and MPNN-style scoring on selected near-miss candidates to label bottlenecks. | If MPNN improves while self is weak, sequence is likely bottleneck and hallucination/sequence refinement is appropriate. If self is good but iPAE remains high, search/refine the backbone/interface instead. | High |
| Candidate clustering/diversity tool | Cluster candidates by sequence identity, strict success status, interface-contact fingerprint, and quality bins. | Prevents the scheduler from counting duplicate rediscoveries as progress and helps choose archive entries for recombination. | High |
| ColabDesign / ProteinMPNN-style refinement lane | Optional later product-level refinement of near-miss candidates, not part of the first V4 core. | Promising for sequence bottlenecks, but it changes the mutable object from strategy to sequence/product. Add after the strategy-level V4 ablation is clean. | Later |
V4 Tool References Checked
- Biopython Bio.PDB documentation: PDB/mmCIF parsing and Structure object access for atomic data.
- MDAnalysis documentation: atom selections, distance arrays, contact matrices, and native contact analysis.
- Biotite papers/documentation: sequence and structural bioinformatics library with NumPy-based data handling.
- ColabDesign repository: optional sequence/design refinement ecosystem for later product-level mutation lanes.
- Darwin Godel Machine and SCORE: archive/open-ended search and semantic mutation are useful analogies, but V4's mutable object and objective must be de novo campaign-specific.
V4 Evidence Tool Implementation Update - 2026-05-11
github_repos/Autoresearch_Denovo/autoresearch_v4.
The first shipped layer is not a full controller. It is the candidate-level evidence tooling needed before ActionCards, StrategyCompiler, and ArchiveScheduler are wired together.
| Component | Status | Purpose | Smoke expectation |
|---|---|---|---|
complexa_csv.read_candidate_metrics |
Implemented | Reads Proteina-Complexa combined CSV and keeps paired iPAE/pLDDT/binder-scRMSD rows instead of mixing axis-wise extrema. | Strict success labels use iPAE <= 7/31, pLDDT >= 0.90, binder scRMSD < 1.5 A. |
InterfaceBottleneckAnalyzer |
Implemented | Summarizes target-binder contact count, hotspot contacts, interface residues, contact fingerprint, and severe clash count. Bio.PDB is used when available; a fallback PDB parser keeps the smoke portable. | No-interface iPAE failure should map to ipae_no_interface; plausible hotspot-contact near miss should map to ipae_plausible_interface. |
SequenceDiversityAnalyzer |
Implemented | Marks global duplicates and within-child duplicates so campaign reward can focus on global-new strict successes. | A strict-success sequence already in the known set should map to strict_success_duplicate and contribute zero global-new strict successes. |
BottleneckClassifier |
Implemented | Rule-based first-pass labels: strict success, duplicate success, no-interface iPAE failure, wrong-hotspot iPAE failure, plausible-interface iPAE near miss, pLDDT scaffold failure, and binder-RMSD failure. | Produces compact recommended lanes such as local_mutate, recombine, rescue, explore, mcts, and fk_steering. |
ChildEvidenceSummary |
Implemented | Compacts candidate-level facts into prompt/archive-ready JSON. This is the intended future input to ArchiveScheduler and optional LLM ActionCard proposer. | LLM should see bottleneck summaries, not raw PDBs or large CSV tables. |
V4 Evidence Tool Smoke Results - 2026-05-11
| Smoke | Input pattern | Expected label | Observed | Status |
|---|---|---|---|---|
no_interface |
Good pLDDT/RMSD but high iPAE and target-binder chains far apart. | ipae_no_interface |
{"ipae_no_interface": 1}; lanes explore, mcts, fk_steering. |
Passed local smoke |
plausible_interface |
Good pLDDT/RMSD, hotspot contacts present, iPAE just above strict cutoff. | ipae_plausible_interface |
{"ipae_plausible_interface": 1}; lanes local_mutate, recombine, rescue. |
Passed local smoke |
duplicate_success |
Strict-success metrics but sequence already exists in global known set. | strict_success_duplicate, zero global-new strict successes. |
{"strict_success_duplicate": 1}; n_global_new_strict_success=0. |
Passed local smoke |
Local smoke output: /scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/evidence_tools/local_20260511/smoke_report.json. Unit tests: tests/test_v4_evidence_tools.py, 2/2 passed.
Cluster smoke completed as Slurm job 8098962 with exit code 0:0 using slurm/v4_evidence_smoke_cpu.slurm. Output: /scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/evidence_tools/8098962/smoke_report.json.
V4 Next Implementation Steps
- Add
ActionCardandStrategyCompilerschemas. The LLM should emit compact card fields, not config patches. - Add template registry entries for official families: beam, diverse-beam, stochastic-beam, best-of-N, MCTS, fk-steering, and hallucination-composed variants.
- Wire
ChildEvidenceSummaryinto StrategyArchive/ProductArchive/FailureArchive. - Build an LLM-free ArchiveScheduler ablation first. Then add optional LLM ActionCard proposer behind a flag.
- Run a real completed-child replay smoke: feed evidence tools a completed v2/v3 child directory and verify bottleneck labels match known run behavior.
V4 Clean-Code Refactor And Compiler Update - 2026-05-11
| Change | Reason | Verification |
|---|---|---|
| Refactored bottleneck diagnosis. | Moved lane mapping and ranking into small top-level tables. diagnose_candidate now reads as metrics/interface -> bottleneck -> recommended lanes. |
Unit tests cover iPAE no-interface, wrong-hotspot, plausible-interface, pLDDT failure, RMSD failure, strict success, and duplicate success. |
| Refactored smoke runner. | Replaced if/elif scenario generation with a compact SmokeCase table. Adding a new smoke is now one row, not a new branch. |
Expanded local smoke covers 8 scenarios. All scenarios report passed=true. |
Added minimal ActionCard and StrategyCompiler. |
Next V4 step: LLM emits compact semantic intent; compiler owns backend paths. This prevents raw config-path generation by the LLM. | Compiler tests verify MCTS+hallucination compilation, default batch/filter knobs, seed insertion, unknown template rejection, and unsupported knob rejection. |
| Fixed template-specific knob paths. | beam_width is not one universal backend path. fk-steering and stochastic-beam need template-specific paths. |
Test verifies fk_steering uses generation.search.fk_steering.beam_width, not the beam-search path. |
Verification after refactor/compiler update: tests/test_v4_evidence_tools.py tests/test_v4_action_compiler.py -> 5/5 passed; python -m compileall -q autoresearch_v4 passed; git diff --check passed.
Expanded local smoke output: /scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/evidence_tools/refactor_compiler_20260511/smoke_report.json. Expanded Slurm CPU smoke: job 8099672, state COMPLETED, exit code 0:0.
V4 ArchiveScheduler MVP - 2026-05-11
| Piece | Status | Design reason | Verification |
|---|---|---|---|
StrategyArchive |
Implemented | Run-local memory over child summaries and source template IDs. This is the minimal archive needed before learned/LLM scheduling. | Scheduler tests build archives from synthetic ChildEvidenceSummary objects. |
ArchiveScheduler |
Implemented | LLM-free ablation policy. It first emits the five broad bootstrap families, then routes archive evidence to rescue/diversify/exploit cards. | Tests verify bootstrap order and archive-conditioned choices. |
| Compiler dedupe fix | Implemented | Template defaults and semantic overrides can touch the same backend path. Compiler now emits unique paths with override values winning. | Regression test verifies diverse_beam emits one diversity_weight patch with value 0.7. |
| Scheduler smoke | Implemented | Verifies end-to-end archive -> ActionCard -> compiled config behavior without LLM or GPU. | Local smoke output: /scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/scheduler/dedupe_20260511/scheduler_smoke_report.json. Slurm job 8102340 completed with exit code 0:0. |
Current V4 flow is now: evidence tools -> ChildEvidenceSummary -> StrategyArchive -> LLM-free ArchiveScheduler -> compact ActionCard -> deterministic StrategyCompiler. This follows the private V4 plan and still keeps the LLM out of raw config generation.
Verification after scheduler phase: tests/test_v4_evidence_tools.py tests/test_v4_action_compiler.py tests/test_v4_scheduler.py -> 10/10 passed; python -m compileall -q autoresearch_v4 passed; git diff --check passed.
V4 Real-Artifact Replay And LLM ActionCard Phase - 2026-05-11
| Check | Result | Why it matters |
|---|---|---|
| Real completed-child replay | Passed local replay: /scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/real_replay/cd45_7694929_final_20260511/real_replay_smoke_report.json |
V4 evidence tools can read actual v2.2 Complexa child artifacts, not only synthetic fixtures. The replay scanned CD45 job 7694929, loaded real combined CSVs, and produced candidate bottleneck labels plus compiled next ActionCards. |
| Strict success 기준 재검증 | Replay uses paired candidate metrics with iPAE <= 7/31, pLDDT >= 0.90, and binder scRMSD < 1.5 A. |
This recovered real successes such as CD45 child_003 where old dossier text still mentioned the obsolete 0.18 iPAE threshold. V4 now uses the paper criterion in the replay path. |
| AF2/refolded PDB path bug | Fixed: complexa_csv.read_candidate_metrics now selects the self_complex_pdb_path_all entry matching the chosen paired redesign index. |
Before this fix, interface diagnostics could combine AF2/self metrics with the generated PDB path. That could mislabel interface bottlenecks. Regression test covers this exact failure mode. |
| Long replay finding | 40-child CD45 replay: all cases had candidate evidence, no duplicate compiled patch paths, and the scheduler repeatedly selected the missing bootstrap lane mcts_then_hallucination. |
This is expected because the archived v2.2 CD45 run did not try that V4 bootstrap family. The result validates the scheduler's forced broad-family initialization behavior on real run history. |
| Slurm replay smoke | Completed as job 8105290 with state COMPLETED, exit code 0:0. Output: /scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/real_replay/8105290/real_replay_smoke_report.json. |
Confirms the real-artifact replay path works under cluster execution, not only in an interactive shell. |
V4 LLM ActionCard Proposer - 2026-05-11
| Component | Status | Contract | Verification |
|---|---|---|---|
LLMActionCardProposer |
Implemented | Calls an LLM client for one compact ActionCard. The LLM proposes semantic lane/template/knobs/hypothesis only. |
Fake-client tests verify prompt construction, max-token forwarding, and returned card compilation. |
| Raw config path guard | Implemented | Rejects config_patch, json_patch, backend overrides, and any output containing generation. or /generation/. |
Regression tests confirm raw backend paths are rejected before compilation. |
| Compiler validation | Implemented | Every LLM ActionCard is passed through compile_action_card during validation, so unsupported semantic knobs fail deterministically. |
Regression test rejects an unsupported knob on beam_search. |
| Prompt payload | Implemented | Prompt includes target ID, strict success rules, allowed templates and knobs, archive tail, scheduler hint, and hard rules forbidding backend config edits. | Test confirms archive evidence and template knobs are present in the prompt. |
Current V4 phase status: evidence tools -> archive -> deterministic scheduler -> optional LLM ActionCard proposer -> deterministic compiler. The LLM is now a hypothesis/mutator layer, not a raw config generator. Verification: current V4 unit suite tests/test_v4_evidence_tools.py tests/test_v4_action_compiler.py tests/test_v4_scheduler.py tests/test_v4_replay_smoke.py tests/test_v4_llm_proposer.py -> 17/17 passed; python -m compileall -q autoresearch_v4 passed; git diff --check passed.
V4 Async Campaign Runner Phase - 2026-05-11
| Component | Status | Behavior | Verification |
|---|---|---|---|
CampaignRunner |
Implemented skeleton | Backend-light async runner. It schedules compiled ActionCards into up to max_in_flight child slots and updates the archive whenever a child returns a ChildEvidenceSummary. |
Unit tests cover bootstrap reservation, executor failure conversion, optional LLM proposer use after bootstrap, pending duplicate guard, and max_in_flight=0 clamping. |
max_in_flight |
Defined | Number of child proposals that can be running at once. With a future 3-GPU Complexa adapter, max_in_flight=3 means one child per executor GPU. |
Campaign smoke uses --max-in-flight 3 and produces 8 submissions / 8 completions. |
| Distinct broad bootstrap | Implemented | Pending templates are reserved, so simultaneous initial scheduling does not repeatedly choose the same first bootstrap family. | First five submissions are beam_search, best_of_n, diverse_beam, mcts_then_hallucination, fk_steering. |
| Pending exact-ActionCard duplicate guard | Implemented | If the same lane/template/knobs/archive-source ActionCard is already pending, the runner uses an async backup exploration card instead. Same template with different source or parameter bin remains allowed. | Regression test calls the guard directly. Local smoke shows a backup stochastic_beam when a repeated exact bin would otherwise be pending. |
| Plateau handling boundary | Partially handled | Other workers can keep exploring while one child runs. Once a plateau/failure child finishes, its evidence can route the next card away from that bin. Mid-run early-stop/cancel is not implemented yet. | Next required phase: real Complexa ChildExecutor adapter with runtime monitor and optional early-stop hook. |
| Slurm campaign smoke | Passed | slurm/v4_campaign_smoke_cpu.slurm runs the async campaign smoke without GPU. |
Job 8106146: COMPLETED, exit code 0:0. Output: /scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/campaign/8106146/campaign_smoke_report.json. |
Verification after async runner phase: V4 unit suite -> 22/22 passed; python -m compileall -q autoresearch_v4 passed; git diff --check passed; bash -n slurm/v4_campaign_smoke_cpu.slurm passed.
V4 ParameterBin / Stale-Bin Phase - 2026-05-11
| Question | Answer | Implementation | Verification |
|---|---|---|---|
| Are family and hyperparameters separated? | Yes. Family is template_id; hyperparameters live in parameter_overrides. |
ParameterBin maps each ActionCard to (template_id, coarse knob bins). This prevents family-level false bans such as banning all beam-search after one low-noise failure. |
Tests verify low-noise beam bins match each other, high-noise beam differs, and the same knob region under MCTS differs because the family differs. |
| How is stale handled? | Stale is exact-bin scoped, not family scoped. | stale_parameter_bins marks a bin stale after repeated zero-yield outcomes since the last success. A later global-new strict success resets the counter. |
Regression test covers two failures -> stale, success -> reset, one later failure -> not stale. |
| Does the runner avoid stale bins? | Yes. CampaignRunner._next_card checks both pending exact ActionCard signatures and stale parameter bins. |
If a card is pending or stale, the runner picks an async backup card from stochastic-beam, MCTS, or diverse-beam. | Campaign test verifies stale best-of-N bin is skipped. Slurm campaign smoke 8107577 completed with exit 0:0. |
| Does the LLM see this state? | Yes. The LLM prompt remains structured and compact. | build_action_card_messages includes allowed_templates, archive_tail, scheduler_hint, and stale_parameter_bins. The LLM still cannot emit raw backend config paths. |
Prompt test verifies stale bins are serialized as {template_id, knobs}. |
Verification after ParameterBin phase: V4 unit suite -> 26/26 passed; python -m compileall -q autoresearch_v4 passed; git diff --check passed. Latest campaign smoke: /scratch/gpfs/ZHONGE/mj7341/research/autoresearch_v4_smokes/campaign/8107577/campaign_smoke_report.json.