mindXtrain — One-Command Qwen3 Fine-Tuning on AMD MI300X - Retrieval Augmented Generative Engine

mindXtrain is the first one-command Qwen3 fine-tuner natively optimized for AMD MI300X. It is the AMD-shaped half of the PYTHAI/DELTAVERSE stack: a single Python package that takes a YAML recipe and produces a trained, evaluated, FP8-quantized, served, and on-chain-anchored model — all on a single MI300X, all driven by a 60-second on-device autotune that pins kernel and collective choices before training starts. This post is the canonical landing page for the project. If you are reading the day-by-day Build-in-Public posts, this is where they all link back to.

1. Why this exists

Fine-tuning a Qwen3-class model end-to-end is currently a multi-day exercise across mismatched tools. You pick a trainer (Axolotl, Unsloth, torchtune, Primus-Turbo, raw TRL), then you pick an attention implementation (Composable Kernel, Triton SDPA, AOTriton, FlashAttention port-of-the-month), then you pick a quantizer (Quark, GPTQ, AWQ, BlockScale), then you pick a server (vLLM, SGLang, TGI), then you write the glue. Each tool has its own YAML, its own assumptions about the GPU, and its own way of leaving performance on the floor when the assumptions are wrong.

mindXtrain collapses that surface. One CLI verb per stage. One Pydantic-validated config per run. One container image — the AMD-published rocm/primus:v26.2 with a SHA256 digest pinned in ops/containerfiles/digest.lock. One trained artifact, one BLAKE3-hashed provenance manifest, one OpenAI-compatible chat endpoint at the end. The architectural opinion is that the integration is the product. AMD already shipped the kernels. AMD already shipped the GPU. What was missing was the layer that decides which kernel to use on which shape on which run, captures that decision, and never re-litigates it.

2. The differentiator — a 60-second AOT autotune probe

Before each training run, mindXtrain runs a short on-device micro-benchmark. It probes attention kernels (Composable Kernel vs Triton vs AOTriton) on the actual shapes the run will hit, picks the GEMM heuristic for hipBLASLt on those same shapes, and resolves the collective topology (1-GPU is no-op; 8-GPU sets NCCL_MIN_NCHANNELS=112 and GPU_MAX_HW_QUEUES=1; 2- and 4-GPU paths are rejected at schema time because xGMI bandwidth between subsets is asymmetric and silently bottlenecks FSDP shards).

The probe writes its decisions into an AutotunePlan JSON. The training loop reads that plan, sets the env vars, picks the backend, and launches. Nothing re-tunes during the loop. No torch.compile(mode="max-autotune") in production. No JIT autotune in vLLM. The autotune policy is aot_only as a YAML contract, enforced by the schema, tested in tests/test_config_schema.py.

This is the cypherpunk2048 reproducibility standard applied to the ROCm 7.2.1 reality. Same plan, same kernels, hash-equal outputs across machines. No competitor framework ships this discipline. The 60 seconds you spend before training pay for themselves in the throughput delta and pay again in not having to debug a non-deterministic loss curve at 03:00 because Triton picked a different kernel on a cold cache. The deeper write-up is in the Day 2 Build-in-Public post.

3. Architecture in five concentric layers

Each inner layer is consumed by the next, never the reverse. This is enforced by import discipline in mindxtrain/ and by the test suite.

Layer	Module	Responsibility
1	`mindxtrain/cli/main.py`	Typer CLI: `init · bench · train · dataset prep · eval · quantize · serve · publish · receipt`. Never reaches into a backend; consumes a validated config plus an `AutotunePlan` and dispatches.
2	`mindxtrain/autotune/`	The 60-second probe. Emits `AutotunePlan` JSON. AOT-only.
3	`mindxtrain/data/`	Dataset pipeline: curate → MinHash + SemDeDup dedupe → filter → tokenize → pack → synth → verify.
4	`mindxtrain/train/`	Backend dispatch into Axolotl, Unsloth, torchtune, Primus-Turbo, or in-process TRL. Methods: SFT, DPO, ORPO, GRPO, GSPO, RLHF, tool-use, CPT.
5	`mindxtrain/{eval,deploy,storage,provenance,operator}`	Quark FP8 / MXFP4 → lm-eval-harness → HF Hub push → Lighthouse pin → mindX register → AgenticPlace → BANKON ENS → x402 metering → ERC-8004 attestation.

The end-to-end flow: XTrainConfig (Pydantic, extra: forbid, frozen: true) plus AutotunePlan → dispatch_training() → checkpoint/ → eval.json → quantized/ → manifest.json (BLAKE3 of YAML+dataset+ckpt+eval, plus HF/Lighthouse/INFT/ASA pointers). mindxtrain receipt re-hashes and verifies the manifest round-trip. The operator FastAPI then serves on /v1/chat/completions in front of vLLM-ROCm or SGLang.

4. The numbers — $3 vs $32

The cost slide is the headline. Same workload, same model, same token budget, two stacks:

Stack	Hardware	Hourly cost	Hours	Total
mindXtrain on AMD Developer Cloud	1× MI300X (192 GB HBM3)	$1.99/hr	~1.5	~$3
Equivalent on H100	2× H100 (80 GB each)	$4.00/hr × 2	~4	~$32

Roughly 10× cost-efficiency. The MI300X path doesn’t need to fall back to FP8 to fit the activation tensors — 192 GB HBM3 swallows a Qwen3-8B BF16 LoRA at bs=8 seq=4096 with massive headroom. The H100 80 GB path either quantizes (which changes the result you’re trying to measure) or splits across two cards (which costs you the second card and the interconnect tax). At Qwen3-32B, the H100 path stops being possible without four cards and tensor-parallel surgery; the MI300X path remains a single GPU with FSDP=1.

5. Hackathon tracks targeted

The submission is for the AMD × lablab.ai Developer Hackathon (build window May 4–10, 2026; on-site finale May 9–10 SF at MindsDB). Three primary tracks:

Track	Primary deliverable
Fine-Tuning on AMD GPUs	LoRA SFT of `amd/Instella-3B-Instruct` and `Qwen/Qwen3-8B` on a single MI300X.
AI Agents & Agentic Workflows	The `mindxtrain.operator` FastAPI serves the trained model behind `/v1/chat/completions`; mindX agents consume it.
Vision & Multimodal AI	The `qwen3_vl_8b_sft` recipe ships in `mindxtrain/train/recipes/` as a stretch deliverable.

Plus the Build-in-Public meta track (these posts) and Best Use of Qwen (Qwen3-8B is the secondary training run; Qwen3.6 recipes are wired but stretch).

6. The non-negotiables

The schema enforces a small set of MI300X invariants that are not style preferences — they are deployment bugs if violated.

AOT-only. No JIT autotune in production paths. The YAML key autotune.policy must equal aot_only, period.
hardware.gpus is Literal[1, 8]. The 2- and 4-GPU configurations are rejected at parse time because asymmetric xGMI bandwidth across MI300X subsets bottlenecks FSDP — a silent perf regression that is much worse than a loud rejection. Tested in test_config_schema.py::test_xgmi_2gpu_rejected.
Seven MI300X env vars are defaults in every recipe. The autotune plan can override values but never remove keys. They are: PYTORCH_ROCM_ARCH=gfx942, HSA_NO_SCRATCH_RECLAIM=1, HIP_FORCE_DEV_KERNARG=1, GPU_MAX_HW_QUEUES=1, NVTE_CK_USES_BWD_V3=1, NVTE_CK_IS_V3_ATOMIC_FP32=1, PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1, NCCL_MIN_NCHANNELS=112.
extra: forbid + frozen: true on every Pydantic model. Unknown YAML keys raise ValidationError; loaded configs are immutable.
Solidity contracts are write-once. No proxies, no Ownable, no admin keys, no setters in contracts/src/{mindxtrain_registry,x402_receiver}.sol. Rotating any parameter requires a fresh deploy. Cypherpunk2048.
numpy is pinned <2.0 against torch==2.9.1+rocm7.2.1.lw.
The container is rocm/primus:v26.2; SHA256 digest snapshot in ops/containerfiles/digest.lock.

7. The provenance story

Every run produces a manifest.json with a BLAKE3 hash of the YAML recipe, the dataset shards, the checkpoint directory, and the eval JSON, plus pointers to the HF Hub repo, the Lighthouse Storage CID, the optional ERC-7857 INFT id, and the Algorand ASA id if the model is listed on AgenticPlace with x402 metering. mindxtrain receipt <manifest.json> --config run.yaml re-hashes everything and round-trip-verifies. If your manifest verifies, the receipt is yours. If it doesn’t, somebody changed something somewhere.

The on-chain anchor is a single immutable contract — mindxtrain_registry.sol, no admin, no upgrade. It records the BLAKE3 digest and a CID. That’s it. The contract is on Base; the gas is paid out of an x402 settlement when the model is rented. The model becomes a directly-rentable agent, not just another checkpoint sitting on HF Hub waiting to be discovered.

8. Try it

Base install is CPU-only and runs the CLI, the Coach UI, bench --dry-run, manifest verify, and the operator FastAPI. Heavyweight paths gate on opt-in dependency groups.

git clone https://github.com/codephreak/mindxtrain
cd mindxtrain
uv sync                                                  # base install (CPU-only)
uv run pytest -q                                         # 122 passed
uv run mindxtrain --help                                 # 9 verbs
uv run mindxtrain init --list                            # 12 built-in YAML recipes
uv run mindxtrain bench --dry-run --out plan.json        # CPU-safe (real probe needs MI300X)
uv run uvicorn mindxtrain.operator.app:app --port 8080
# → http://localhost:8080/coach/  (Coach UI, all 12 recipes, no GPU required)

Live demo URL during the lablab judging window: mindx.pythai.net/hackathon. The chat endpoint is OpenAI-compatible, no auth required during the hackathon window.

9. What’s next

Post-hackathon: full ERC-7857 INFT minting on Base, full AgenticPlace listing with x402-Algorand metering on every inference call, and an automated CI loop that pins the autotune plan against the latest rocm/primus tag so that a kernel regression in upstream ROCm is caught the day it lands. The training side gets GRPO, GSPO, and a real RLHF-from-scratch reference recipe for the Qwen3 family. The serving side gets SGLang as a first-class peer to vLLM-ROCm with the same parser bookkeeping.

The thesis the project is here to defend: an MI300X plus the right integration layer is the cheapest, most reproducible way to go from a base model to a rented agent in 2026. Everything in this repo exists to make that thesis legible to a judge in five minutes and to a hostile reviewer in five hours.

Tagged #AMDDevHackathon. Code: github.com/codephreak/mindxtrain. License: Apache-2.0 with MIT-compatibility statement.