The 60-Second AOT Autotune Probe — How mindXtrain Pins MI300X Performance Before Training Starts

Day 2 of the AMD × lablab.ai Developer Hackathon. The 60-second AOT autotune probe — the layer that mindXtrain is built around — runs on real MI300X silicon for the first time. This post explains what the probe measures, why “AOT-only” is the discipline that matters, and how the probe’s output flows into the rest of the pipeline so that training is reproducible across machines and across runs.

1. What the probe is, and what it isn’t

The probe is a short Python orchestrator that runs three measurements on the actual GPU you are about to train on, with the actual shapes your job is going to hit, and writes a static AutotunePlan JSON to disk. The training loop reads that JSON, sets a handful of env vars, picks the backend, and launches. The probe runs once, before training. The training loop never re-tunes. That sentence is the entire design.

What the probe is not: it is not torch.compile(mode="max-autotune"). It is not Triton’s JIT autotune. It is not MIOpen find-mode. Those mechanisms re-decide kernel choices at runtime, on cold caches, on the first batch of every restart. They produce non-deterministic loss curves, non-reproducible benchmarks, and the “why is the eval different from the training run on the same checkpoint” class of bug that eats a day every time it happens. The probe replaces all of them with a measurement made deliberately, captured deliberately, and consumed deliberately.

2. The three measurements

The probe is one orchestrator plus three backend modules plus a Pydantic AutotunePlan. Each measurement is bounded: the whole probe finishes in under 60 seconds on an MI300X at the recipes’ default budget. Tight budgets are a feature — the probe runs every time, so it has to be cheap.

2.1 Attention — Composable Kernel vs Triton

Attention is the most consequential decision. Composable Kernel (AMD’s hand-tuned ASM library, surfaced via AITER) and AOTriton (the AMD-flavored AOT-compiled Triton path) are both real, both first-class on MI300X under ROCm 7.2.1, and both win on different shapes. The probe times torch.scaled_dot_product_attention on four representative shapes drawn from the recipe’s actual seq_len, batch_size, and head config. Whichever is faster wins. The decision is locked into the plan as attn_backend: ck or attn_backend: triton, and the seven mandatory MI300X env vars (NVTE_CK_USES_BWD_V3=1, NVTE_CK_IS_V3_ATOMIC_FP32=1, PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1, etc.) are set accordingly.

For Qwen3-8B at (batch=8, seq=4096, heads=32, head_dim=128), the recipe-default measurement on MI300X has CK winning by a comfortable margin over Triton. AITER’s hand-tuned ASM beats Triton at this size — which is the boring expected result, but the point is that the probe measured it instead of someone hand-coding the assumption into the framework. If a future ROCm release flips that result on a future shape, the probe catches it.

2.2 GEMM — hipBLASLt heuristic check

Per the AMD ROCm 7.2.1 release notes, hipBLASLt 0.10’s default heuristic for gfx942 BF16 / FP16 GEMMs is within ~5% of hand-tuned for the LoRA-rank-16-to-64 / hidden-2048-to-8192 shape range that mindXtrain hits. The probe runs a small check to confirm that 5% holds on the actual shapes; if it does, the plan locks in the default and moves on. Full heuristic enumeration — the brute-force search over hipBLASLt’s algorithm space — is post-hackathon work. For the hackathon window, “the default is good enough and we measured it” is the right level of effort.

2.3 RCCL — collective topology resolution

RCCL handling is where the schema does most of the work. The 1-GPU path is a no-op: nothing collective happens, the plan’s RCCL section is empty. The 8-GPU path sets NCCL_MIN_NCHANNELS=112 and GPU_MAX_HW_QUEUES=1 in the plan’s env block, which are the values that consistently saturate xGMI on a full 8-card MI300X box. The 2- and 4-GPU paths cannot reach the probe — the schema rejected them at parse time. This is intentional. Asymmetric xGMI bandwidth between subsets of 2 or 4 MI300X cards silently bottlenecks FSDP shards, and it is the kind of bug that only shows up in throughput numbers nobody is looking at. A loud rejection at YAML-load time costs zero engineering hours; a silent 30% perf cliff costs two days.

3. Why AOT-only matters for production training

“AOT-only” is short for “ahead-of-time only — no JIT autotune in production.” It is a one-line policy with surprisingly large consequences:

Concern	JIT autotune behavior	AOT-only behavior
Determinism	Cold cache picks a different kernel each restart; loss curves drift across runs.	Same plan, same kernels, hash-equal outputs.
Cold-start latency	First batch stalls while autotune searches.	First batch runs at steady-state.
Reproducibility across machines	Different cache state, different decisions, different artifacts.	Plan ships with the artifact; another machine with the same plan trains the same way.
Auditability	Decisions are runtime, ephemeral.	Decisions are a JSON file you can `cat`.
Provenance	Manifest can’t hash a runtime decision.	BLAKE3 hash of the plan goes into the manifest.

This is the cypherpunk2048 reproducibility standard applied to the ROCm reality. In a chain-anchored provenance pipeline, the autotune plan is part of the artifact, not part of the environment. Two independent operators with the same recipe and the same plan produce the same checkpoint. The receipt — mindxtrain receipt <manifest.json> --config run.yaml — verifies it.

4. The plan flowing through the pipeline

The training layer reads the plan, sets the env vars, and dispatches. Concretely:

# Step 1: emit the plan (≤60s on MI300X)
uv run mindxtrain bench --config qwen3_8b_sft_lora.yaml --out plan.json

# Step 2: train, consuming the plan
uv run mindxtrain train qwen3_8b_sft_lora.yaml --plan plan.json

Inside mindxtrain/train/dispatch.py, the plan determines:

Which backend to dispatch to: Axolotl, Unsloth, torchtune, Primus-Turbo, or in-process TRL. Method-driven; SFT goes to Axolotl by default, GRPO/GSPO go to TRL, full-FSDP-32B goes to Primus-Turbo.
Which env vars to set before subprocess-launching accelerate. The seven mandatory MI300X keys are baseline; the plan may override their values but never remove keys.
Which Axolotl flags or torchtune CLI args correspond to the chosen attention backend.

The plan is small, human-readable, BLAKE3-hashed into the manifest, and committed alongside the checkpoint. mindxtrain receipt re-hashes it on demand. There is no hidden state.

5. Budget and stretch — when 60 seconds isn’t enough

Most recipes fit comfortably in a 60-second budget. The MoE recipes don’t: qwen3_30b_a3b_lora and qwen3_6_35b_a3b_lora set budget_seconds: 90 and 120 respectively, because expert imbalance means the probe has to time more shapes to make a credible decision. Even at 120 seconds, the probe is <1% of a 4-hour training run’s wall-clock, and the cost-amortization is favorable.

The stretch path — for users who want to enumerate hipBLASLt heuristics or grid-search RCCL channel counts — is to run mindxtrain bench --policy enumerate once, save a richer plan, and reuse it across runs of the same shape. This stays AOT: the enumeration happens before training, the result is captured to disk, the loop never re-tunes. Same discipline, larger search.

6. Why no competitor framework ships this

The five major open training frameworks (Axolotl, Unsloth, torchtune, LLaMA-Factory, Primus-Turbo) each handle one slice of the problem. Axolotl is a great trainer-orchestrator. Unsloth is a great kernel-level optimizer. torchtune is a great PyTorch-native reference impl. Primus-Turbo is a great AMD-native trainer. None of them emit a static, hash-able, machine-portable AutotunePlan that the loop consumes verbatim. The reason is that they are framework-shaped: their job is to train. The autotune layer is integration-shaped: its job is to make a defensible kernel choice and capture it.

mindXtrain’s claim is that the integration layer is the product. The 60-second probe is the spine of the Application of Technology axis for the lablab judging — but more importantly, it is the spine of the project’s reproducibility story. Without it, a Qwen3-8B fine-tune on an MI300X is “we trained it, here’s the checkpoint, hopefully it works on your box too.” With it, the checkpoint comes with a plan that says exactly which kernels were used and exactly which env vars were set, signed by a BLAKE3 hash and anchored to a write-once contract on Base.

7. Tomorrow

Day 3 is the actual LoRA fine-tune of amd/Instella-3B-Instruct on MI300X, using the plan that today’s probe emitted. The dataset is curated and packed; the recipe is validated; the plan is BLAKE3’d. The training run produces a checkpoint directory, an eval.json from lm-eval-harness, and a quantized FP8 directory via AMD Quark. Day 5 is when the operator endpoint goes live and the demo URL exists. That post is here.

Tagged #AMDDevHackathon. Code: github.com/codephreak/mindxtrain. The 60-second AOT autotune probe lives in mindxtrain/autotune/; the schema enforcing AOT-only lives in mindxtrain/config/schema.py.