mindXtrain Demo is Live — Qwen3-8B on a Single MI300X for Less Than $3

Day 5 of the AMD × lablab.ai Developer Hackathon. The demo URL is live: mindx.pythai.net/hackathon. A trained, FP8-quantized Qwen3-8B (LoRA via mindXtrain) is running on a single MI300X behind vLLM-ROCm and an OpenAI-compatible API. No auth required during the hackathon judging window. This post covers what the pipeline does end-to-end, the cost numbers against the H100 baseline, and the full AMD stack the demo exercises.

1. The pipeline you can poke at

The endpoint is OpenAI-compatible. From any terminal:

curl https://mindx.pythai.net/hackathon/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen3-8b-mindxtrain-fp8",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain why MI300X has 192 GB of HBM3 in one paragraph."}
        ],
        "max_tokens": 256
    }'

What is happening behind that single curl, in order:

Qwen3-8B base model from Alibaba’s Qwen team (Apache-2.0).
Fine-tuned via mindXtrain LoRA on MI300X. The 60-second AOT autotune probe (deep-dive here) measured Composable Kernel attention as the winner for this shape, locked the hipBLASLt default heuristic, and wrote the plan to disk before training started. The training loop never re-tuned.
Quantized via AMD Quark FP8 PTPC into a vLLM-loadable directory. PTPC (per-tensor per-channel) was 15–30% faster than BlockScale on MI300X for this model size in our measurements, which is what the project’s defaults reflect.
Served behind mindxtrain.operator‘s OpenAI-compatible /v1/chat/completions. The operator routes to vLLM-ROCm with the qwen3 reasoning parser and the hermes tool-call parser configured.
BLAKE3 provenance manifest emitted, pinned to Lighthouse / IPFS, and registered with the mindX API. Run mindxtrain receipt <manifest.json> --config run.yaml against the published manifest to round-trip-verify.

2. The cost slide

The headline. Same workload — Qwen3-8B SFT-LoRA, 1B tokens, BF16 unquantized — on two stacks:

Stack	Hardware	$/hr	GPUs	Hours	Total
mindXtrain on AMD Developer Cloud	1× MI300X (192 GB HBM3)	$1.99	1	~1.5	~$3
Equivalent on H100	2× H100 (80 GB)	$4.00	2	~4	~$32

The MI300X path doesn’t have to fall back to FP8 to fit the activation tensors. 192 GB HBM3 is doing real work. The H100 80 GB path has two options: split across two cards (the line above), or quantize the base weights (which changes the result you are trying to measure and isn’t an apples-to-apples comparison). Either way the AMD stack wins this benchmark by a factor of roughly 10× on cost-efficiency.

This is also the slide where the integration story earns its keep. The savings aren’t from a magic kernel; they’re from a single GPU with enough memory to skip the surgery the H100 path requires. mindXtrain’s job is to make that hardware advantage reach the artifact without the operator having to wire seven libraries together by hand.

3. The full AMD stack the demo exercises

Everything below is current, working, and integrated end-to-end in the live demo. No mocks, no stubs, no “coming soon.”

ROCm 7.2.1 — the base layer. Container is rocm/primus:v26.2, SHA256 digest pinned in ops/containerfiles/digest.lock.
AOTriton — AMD-flavored AOT-compiled Triton. Available as the alternate attention backend; the autotune probe picks it when it wins.
AITER + Composable Kernel — AMD’s hand-tuned ASM kernel library. Won the attention bake-off for Qwen3-8B at the recipe-default shape.
hipBLASLt — GEMM library. Default heuristic for gfx942 BF16/FP16 is within 5% of hand-tuned for the shapes this model hits, confirmed by the Day 2 probe.
RCCL — collective comms. 1-GPU is no-op for this run; NCCL_MIN_NCHANNELS=112 and GPU_MAX_HW_QUEUES=1 are baked into the recipe for the 8-GPU path.
Optimum-AMD — the HuggingFace integration layer for AMD GPUs.
AMD Quark FP8 PTPC — quantization. 15–30% faster than BlockScale on MI300X at this model size.
Primus-Turbo + torchtitan-amd — alternate trainer backends; reachable from the dispatch layer for full-FSDP runs.
vLLM-ROCm — serving. Qwen3 reasoning parser + hermes tool-call parser configured.
SGLang-ROCm — peer-class serving backend; reachable via the operator’s backend registry, not the default for this demo.

4. The provenance receipt

Every artifact in the demo has a manifest.json with a BLAKE3 hash of the YAML recipe, the dataset shards, the checkpoint directory, and the eval JSON, plus pointers to the HuggingFace Hub repo, the Lighthouse Storage CID, and the on-chain anchor transaction on Base. To verify:

uv run mindxtrain receipt ./out/runs/qwen3_8b_sft_lora/manifest.json \
    --config qwen3_8b_sft_lora.yaml

The receipt re-hashes everything and round-trip-verifies. If your manifest verifies, the receipt is yours; if it doesn’t, somebody changed something and the receipt tells you which slice. The on-chain anchor is a single immutable contract — mindxtrain_registry.sol, no admin, no upgrade — that records the BLAKE3 digest and the CID. Cypherpunk2048.

5. The submission tomorrow

Submitting to lablab.ai tomorrow morning. Three primary tracks plus two side challenges:

Track	Primary deliverable	Status
Fine-Tuning on AMD GPUs	LoRA SFT of `amd/Instella-3B` and `Qwen/Qwen3-8B` on MI300X	Live in demo
AI Agents & Agentic Workflows	`mindxtrain.operator` serving the trained model behind `/v1/chat/completions`	Live in demo
Vision & Multimodal AI	`qwen3_vl_8b_sft` recipe shipped	Recipe in repo; not in headline demo
Build-in-Public (meta)	Three technical posts tagged `#AMDDevHackathon`	You are reading post #3
Best Use of Qwen (cross-cutting)	Qwen3-8B is the headline run; Qwen3.6 recipes wired	Live in demo

The case for Best Overall is that this is one repo, one demo, one container, end-to-end on AMD, with on-chain provenance — and the trained model is a directly-rentable agent through AgenticPlace + x402-Algorand metering, not just a checkpoint sitting on HuggingFace Hub.

6. The asks and the receipts

Everything is open. The full repo is Apache-2.0 (with an explicit MIT-compatibility statement in LICENSE-NOTICE.md for the lablab spec). All the receipts:

GitHub: github.com/codephreak/mindxtrain
Live demo URL: mindx.pythai.net/hackathon
Project overview: rage.pythai.net/mindxtrain
Day 1 post (scaffold + thesis): rage.pythai.net/mindxtrain-day-1-mi300x
Day 2 post (autotune deep-dive): rage.pythai.net/mindxtrain-day-2-autotune

To the AMD developer team: the ROCm 7.2.1 stack works. AOTriton, AITER, Composable Kernel, hipBLASLt, RCCL, Quark, Primus-Turbo, vLLM-ROCm, SGLang are all first-class on MI300X. The pin matrix in the repo’s README is ground truth for anyone building on this. To the lablab and HuggingFace teams: the submission goes in tomorrow morning. To the Qwen team: Qwen3-8B is a great base for this kind of work, and the autotune layer is shape-aware enough to handle the rest of the family without changes to the framework.

Five days, one repo, one demo, one MI300X. End-to-end on AMD.

Tagged #AMDDevHackathon. Code: github.com/codephreak/mindxtrain. Live demo: mindx.pythai.net/hackathon. License: Apache-2.0 with MIT-compatibility statement.