Day 1 of the AMD × lablab.ai Developer Hackathon. Today the scaffolding goes up: mindXtrain, a one-command Qwen3 fine-tuner native to AMD MI300X. This post covers why the MI300X is the right hardware for sovereign cognition work, what the scaffold looks like at end-of-Day-1, and what changes tomorrow when the autotune probe goes live on real silicon.
1. Why MI300X, specifically, for this work
The argument starts with one number: 192 GB of HBM3 per GPU. A Qwen3-8B BF16 LoRA at bs=8 seq=4096 fits with massive headroom on a single MI300X. The same workload on H100 80 GB requires either quantizing the base weights — which changes the result you are trying to measure — or splitting across two cards over PCIe or NVLink, which costs you the second card and the interconnect tax. The economics fall out of the memory math. AMD Developer Cloud sells MI300X at $1.99/hr; the H100 list price for a single card is $4.00/hr. A 1B-token training run lands at ~$3 on MI300X versus ~$32 on 2× H100. Roughly 10× cheaper, and the MI300X path stays single-GPU and stays in BF16.
The second argument is the AMD stack is more first-class than the consensus narrative gives it credit for. ROCm 7.2.1 ships AOTriton, AITER, Composable Kernel, hipBLASLt, RCCL, Optimum-AMD, Quark, Primus-Turbo, vLLM-ROCm, SGLang — all working, all current, all integrable. The reason this isn’t obvious is that nobody has wired them together with one CLI and one YAML and one container. mindXtrain is that wire.
The third argument is sovereignty. The PYTHAI/DELTAVERSE thesis is that a small operator should be able to take a base model, fine-tune it on their own data, quantize it, serve it from a VPS they own, anchor the provenance on a chain they trust, and rent it to other agents — without ever touching a hyperscaler. MI300X plus a single droplet plus the right integration layer makes that workable. The GPU isn’t sovereign yet, but the rest of the chain can be, and the GPU is fungible.
2. What shipped on Day 1
The Day 1 deliverables are green. No GPU was required for any of this; everything below runs on a CPU laptop.
- uv workspace, single-package Python 3.12 (pinned
>=3.12,<3.13). - ~100 Python modules across CLI, autotune, data, train, eval, deploy, storage, provenance, operator.
- 12 YAML training recipes in
mindxtrain/train/recipes/:instella_3b_lora,qwen3_8b_sft_lora,qwen3_8b_sft_full,qwen3_8b_cpt,qwen3_30b_a3b_lora,qwen3_32b_full_fsdp,qwen3_32b_dpo,qwen3_32b_orpo,qwen3_32b_grpo,qwen3_6_27b_lora,qwen3_6_35b_a3b_lora,qwen3_vl_8b_sft. - 122 tests passing on a base CPU-only install (the original day-one target was 27 — we overshot).
- Pydantic schema with
extra: forbidandfrozen: trueon every model. Unknown YAML keys raiseValidationError; loaded configs are immutable. - Foundry contracts for write-once provenance anchoring (no proxy, no admin keys, no setters) in
contracts/src/{mindxtrain_registry,x402_receiver}.sol. - FastAPI operator + Coach UI. The Coach serves all 12 recipes from a browser — no GPU needed.
- Doc hub under
docs/: architecture, autotune, CLI, YAML schema, Coach, blueprints, hackathon submission plan.
3. The schema is the contract
The interesting Day-1 design choice — and the one that will pay off for the rest of the week — is that the YAML schema enforces MI300X invariants at parse time, not at training time. Two examples worth calling out:
| Invariant | Where | Why |
|---|---|---|
hardware.gpus: Literal[1, 8] |
config/schema.py |
2- and 4-GPU MI300X subsets have asymmetric xGMI bandwidth that silently bottlenecks FSDP. A loud ValidationError is much better than a quiet 30% perf regression nobody traces for two days. |
autotune.policy = aot_only |
config/schema.py |
JIT autotune (Triton on cold start, torch.compile(mode="max-autotune"), MIOpen find-mode) breaks reproducibility. The schema says no. |
Both are tested in tests/test_config_schema.py. The test_xgmi_2gpu_rejected test specifically asserts that asking for 2 GPUs blows up before any code touches the GPU. The test_all_recipes_validate test loops over every YAML in recipes/ and asserts the schema accepts it — meaning every recipe ships with the seven mandatory MI300X env vars and the AOT-only policy by construction.
This is the same discipline that makes the Solidity contracts write-once: encode the invariants where they cannot be bypassed. A future Claude or a future contributor cannot “just turn on” the 2-GPU path or the JIT autotune by accident. They have to file a PR that breaks the test suite, which is loud, reviewable, and traceable.
4. The hero workload target
Day 1’s job is to make the targets explicit so the rest of the week has clear gates to hit. The hero workload — Qwen3-8B SFT-LoRA on a single MI300X in BF16 — has four numbers it must put up by Day 5:
| Metric | Target | Why this matters |
|---|---|---|
| Throughput | >15 000 tok/s | Anything below this and the cost story stops being interesting. |
| MFU | >40% | Demonstrates the autotune layer isn’t theatrical — kernels are actually being driven. |
| Time-to-loss-1.5 | <90 min | Lets the demo video show a full convergent loss curve in real time. |
| Total cost | <$3 | Headline slide. $3 vs $32 on the H100 baseline. |
If Day 5 hits all four, the cost slide writes itself. The Day 5 post will report numbers against this table.
5. What changes tomorrow
Day 2 is when the MI300X shows up. The 60-second AOT autotune probe — the differentiator that the rest of the project is built around — runs on real silicon for the first time. Three measurements get captured:
- Attention: torch’s
scaled_dot_product_attentiontimed across four representative shapes on Composable Kernel and AOTriton. The faster wins. Plan locks it in. - GEMM: hipBLASLt 0.10’s default heuristic for gfx942 BF16/FP16 GEMMs measured against the LoRA-rank-16-to-64 / hidden-2048-to-8192 shapes. Heuristic enumeration is post-hackathon work; for the hackathon window the default is good enough if it benchmarks within 5% of hand-tuned.
- RCCL: 1-GPU is no-op; 8-GPU sets the env block. The schema already rejects 2/4-GPU, so the probe doesn’t have to handle those.
Output is a static AutotunePlan JSON, BLAKE3-hashed into the manifest. The training layer reads it, sets the env vars, picks the backend, and launches accelerate. Nothing re-tunes during the loop. The full Day 2 deep-dive is in the next post.
6. The integrated story
mindXtrain is not just a training framework. It is the AMD-shaped half of a larger thesis: that a base model + fine-tune + quantize + serve + provenance-anchor + rent-via-x402 pipeline can run end-to-end on hardware a small operator can afford, with no hyperscaler in the loop. mindX (the cognitive runtime), AgenticPlace (the agent marketplace), BANKON (the identity and settlement layer), and rage.pythai.net (you are here, the build-in-public archive) are the other halves. The hackathon is where the AMD half stops being a slide and becomes shipping code.
Heading to AMD Developer Cloud now to provision the MI300X droplet for tomorrow’s autotune probes. The hard part — making Composable Kernel and Triton race head-to-head on the GPU and capturing the wow-moment for the demo video — starts then.
Related articles
- mindXtrain — one-command Qwen3 fine-tuning on AMD MI300X (project overview)
- The 60-second AOT autotune probe — how mindXtrain pins MI300X performance before training starts
- mindXtrain demo is live — Qwen3-8B on a single MI300X for less than $3
Tagged #AMDDevHackathon. Code: github.com/codephreak/mindxtrain. Built for AMD × lablab.ai Developer Hackathon, May 4–10 2026.
