Day 5 of the AMD × lablab.ai Developer Hackathon. The demo URL is live: mindx.pythai.net/hackathon. A trained, FP8-quantized Qwen3-8B (LoRA via mindXtrain) is running on a single MI300X behind vLLM-ROCm and an OpenAI-compatible API. No auth required during the hackathon judging window. This post covers what the pipeline does end-to-end, the cost numbers against the H100 baseline, and the full AMD stack the demo exercises.
1. The pipeline you can poke at
The endpoint is OpenAI-compatible. From any terminal:
curl https://mindx.pythai.net/hackathon/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-8b-mindxtrain-fp8",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain why MI300X has 192 GB of HBM3 in one paragraph."}
],
"max_tokens": 256
}'
What is happening behind that single curl, in order:
- Qwen3-8B base model from Alibaba’s Qwen team (Apache-2.0).
- Fine-tuned via mindXtrain LoRA on MI300X. The 60-second AOT autotune probe (deep-dive here) measured Composable Kernel attention as the winner for this shape, locked the hipBLASLt default heuristic, and wrote the plan to disk before training started. The training loop never re-tuned.
- Quantized via AMD Quark FP8 PTPC into a vLLM-loadable directory. PTPC (per-tensor per-channel) was 15–30% faster than BlockScale on MI300X for this model size in our measurements, which is what the project’s defaults reflect.
- Served behind
mindxtrain.operator‘s OpenAI-compatible/v1/chat/completions. The operator routes to vLLM-ROCm with the qwen3 reasoning parser and the hermes tool-call parser configured. - BLAKE3 provenance manifest emitted, pinned to Lighthouse / IPFS, and registered with the mindX API. Run
mindxtrain receipt <manifest.json> --config run.yamlagainst the published manifest to round-trip-verify.
2. The cost slide
The headline. Same workload — Qwen3-8B SFT-LoRA, 1B tokens, BF16 unquantized — on two stacks:
| Stack | Hardware | $/hr | GPUs | Hours | Total |
|---|---|---|---|---|---|
| mindXtrain on AMD Developer Cloud | 1× MI300X (192 GB HBM3) | $1.99 | 1 | ~1.5 | ~$3 |
| Equivalent on H100 | 2× H100 (80 GB) | $4.00 | 2 | ~4 | ~$32 |
The MI300X path doesn’t have to fall back to FP8 to fit the activation tensors. 192 GB HBM3 is doing real work. The H100 80 GB path has two options: split across two cards (the line above), or quantize the base weights (which changes the result you are trying to measure and isn’t an apples-to-apples comparison). Either way the AMD stack wins this benchmark by a factor of roughly 10× on cost-efficiency.
This is also the slide where the integration story earns its keep. The savings aren’t from a magic kernel; they’re from a single GPU with enough memory to skip the surgery the H100 path requires. mindXtrain’s job is to make that hardware advantage reach the artifact without the operator having to wire seven libraries together by hand.
3. The full AMD stack the demo exercises
Everything below is current, working, and integrated end-to-end in the live demo. No mocks, no stubs, no “coming soon.”
- ROCm 7.2.1 — the base layer. Container is
rocm/primus:v26.2, SHA256 digest pinned inops/containerfiles/digest.lock. - AOTriton — AMD-flavored AOT-compiled Triton. Available as the alternate attention backend; the autotune probe picks it when it wins.
- AITER + Composable Kernel — AMD’s hand-tuned ASM kernel library. Won the attention bake-off for Qwen3-8B at the recipe-default shape.
- hipBLASLt — GEMM library. Default heuristic for gfx942 BF16/FP16 is within 5% of hand-tuned for the shapes this model hits, confirmed by the Day 2 probe.
- RCCL — collective comms. 1-GPU is no-op for this run;
NCCL_MIN_NCHANNELS=112andGPU_MAX_HW_QUEUES=1are baked into the recipe for the 8-GPU path. - Optimum-AMD — the HuggingFace integration layer for AMD GPUs.
- AMD Quark FP8 PTPC — quantization. 15–30% faster than BlockScale on MI300X at this model size.
- Primus-Turbo + torchtitan-amd — alternate trainer backends; reachable from the dispatch layer for full-FSDP runs.
- vLLM-ROCm — serving. Qwen3 reasoning parser + hermes tool-call parser configured.
- SGLang-ROCm — peer-class serving backend; reachable via the operator’s backend registry, not the default for this demo.
4. The provenance receipt
Every artifact in the demo has a manifest.json with a BLAKE3 hash of the YAML recipe, the dataset shards, the checkpoint directory, and the eval JSON, plus pointers to the HuggingFace Hub repo, the Lighthouse Storage CID, and the on-chain anchor transaction on Base. To verify:
uv run mindxtrain receipt ./out/runs/qwen3_8b_sft_lora/manifest.json \
--config qwen3_8b_sft_lora.yaml
The receipt re-hashes everything and round-trip-verifies. If your manifest verifies, the receipt is yours; if it doesn’t, somebody changed something and the receipt tells you which slice. The on-chain anchor is a single immutable contract — mindxtrain_registry.sol, no admin, no upgrade — that records the BLAKE3 digest and the CID. Cypherpunk2048.
5. The submission tomorrow
Submitting to lablab.ai tomorrow morning. Three primary tracks plus two side challenges:
| Track | Primary deliverable | Status |
|---|---|---|
| Fine-Tuning on AMD GPUs | LoRA SFT of amd/Instella-3B and Qwen/Qwen3-8B on MI300X |
Live in demo |
| AI Agents & Agentic Workflows | mindxtrain.operator serving the trained model behind /v1/chat/completions |
Live in demo |
| Vision & Multimodal AI | qwen3_vl_8b_sft recipe shipped |
Recipe in repo; not in headline demo |
| Build-in-Public (meta) | Three technical posts tagged #AMDDevHackathon |
You are reading post #3 |
| Best Use of Qwen (cross-cutting) | Qwen3-8B is the headline run; Qwen3.6 recipes wired | Live in demo |
The case for Best Overall is that this is one repo, one demo, one container, end-to-end on AMD, with on-chain provenance — and the trained model is a directly-rentable agent through AgenticPlace + x402-Algorand metering, not just a checkpoint sitting on HuggingFace Hub.
6. The asks and the receipts
Everything is open. The full repo is Apache-2.0 (with an explicit MIT-compatibility statement in LICENSE-NOTICE.md for the lablab spec). All the receipts:
- GitHub: github.com/codephreak/mindxtrain
- Live demo URL: mindx.pythai.net/hackathon
- Project overview: rage.pythai.net/mindxtrain
- Day 1 post (scaffold + thesis): rage.pythai.net/mindxtrain-day-1-mi300x
- Day 2 post (autotune deep-dive): rage.pythai.net/mindxtrain-day-2-autotune
To the AMD developer team: the ROCm 7.2.1 stack works. AOTriton, AITER, Composable Kernel, hipBLASLt, RCCL, Quark, Primus-Turbo, vLLM-ROCm, SGLang are all first-class on MI300X. The pin matrix in the repo’s README is ground truth for anyone building on this. To the lablab and HuggingFace teams: the submission goes in tomorrow morning. To the Qwen team: Qwen3-8B is a great base for this kind of work, and the autotune layer is shape-aware enough to handle the rest of the family without changes to the framework.
Five days, one repo, one demo, one MI300X. End-to-end on AMD.
Related articles
- mindXtrain — one-command Qwen3 fine-tuning on AMD MI300X (project overview)
- mindXtrain Day 1 — Why MI300X for sovereign cognition
- The 60-second AOT autotune probe — how mindXtrain pins MI300X performance before training starts
Tagged #AMDDevHackathon. Code: github.com/codephreak/mindxtrain. Live demo: mindx.pythai.net/hackathon. License: Apache-2.0 with MIT-compatibility statement.
