mindXtrain Demo is Live — Qwen3-8B on a Single MI300X for Less Than $3

Day 5 of the AMD × lablab.ai Developer Hackathon. The demo URL is live: mindx.pythai.net/hackathon. A trained, FP8-quantized Qwen3-8B (LoRA via mindXtrain) is running on a single MI300X behind vLLM-ROCm and an OpenAI-compatible API. No auth required during the hackathon judging window. This post covers what the pipeline does end-to-end, the cost numbers against the H100 baseline, and the full AMD stack the demo exercises.


1. The pipeline you can poke at

The endpoint is OpenAI-compatible. From any terminal:

curl https://mindx.pythai.net/hackathon/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen3-8b-mindxtrain-fp8",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain why MI300X has 192 GB of HBM3 in one paragraph."}
        ],
        "max_tokens": 256
    }'

What is happening behind that single curl, in order:

  1. Qwen3-8B base model from Alibaba’s Qwen team (Apache-2.0).
  2. Fine-tuned via mindXtrain LoRA on MI300X. The 60-second AOT autotune probe (deep-dive here) measured Composable Kernel attention as the winner for this shape, locked the hipBLASLt default heuristic, and wrote the plan to disk before training started. The training loop never re-tuned.
  3. Quantized via AMD Quark FP8 PTPC into a vLLM-loadable directory. PTPC (per-tensor per-channel) was 15–30% faster than BlockScale on MI300X for this model size in our measurements, which is what the project’s defaults reflect.
  4. Served behind mindxtrain.operator‘s OpenAI-compatible /v1/chat/completions. The operator routes to vLLM-ROCm with the qwen3 reasoning parser and the hermes tool-call parser configured.
  5. BLAKE3 provenance manifest emitted, pinned to Lighthouse / IPFS, and registered with the mindX API. Run mindxtrain receipt <manifest.json> --config run.yaml against the published manifest to round-trip-verify.

2. The cost slide

The headline. Same workload — Qwen3-8B SFT-LoRA, 1B tokens, BF16 unquantized — on two stacks:

Stack Hardware $/hr GPUs Hours Total
mindXtrain on AMD Developer Cloud 1× MI300X (192 GB HBM3) $1.99 1 ~1.5 ~$3
Equivalent on H100 2× H100 (80 GB) $4.00 2 ~4 ~$32

The MI300X path doesn’t have to fall back to FP8 to fit the activation tensors. 192 GB HBM3 is doing real work. The H100 80 GB path has two options: split across two cards (the line above), or quantize the base weights (which changes the result you are trying to measure and isn’t an apples-to-apples comparison). Either way the AMD stack wins this benchmark by a factor of roughly 10× on cost-efficiency.

This is also the slide where the integration story earns its keep. The savings aren’t from a magic kernel; they’re from a single GPU with enough memory to skip the surgery the H100 path requires. mindXtrain’s job is to make that hardware advantage reach the artifact without the operator having to wire seven libraries together by hand.

3. The full AMD stack the demo exercises

Everything below is current, working, and integrated end-to-end in the live demo. No mocks, no stubs, no “coming soon.”

  • ROCm 7.2.1 — the base layer. Container is rocm/primus:v26.2, SHA256 digest pinned in ops/containerfiles/digest.lock.
  • AOTriton — AMD-flavored AOT-compiled Triton. Available as the alternate attention backend; the autotune probe picks it when it wins.
  • AITER + Composable Kernel — AMD’s hand-tuned ASM kernel library. Won the attention bake-off for Qwen3-8B at the recipe-default shape.
  • hipBLASLt — GEMM library. Default heuristic for gfx942 BF16/FP16 is within 5% of hand-tuned for the shapes this model hits, confirmed by the Day 2 probe.
  • RCCL — collective comms. 1-GPU is no-op for this run; NCCL_MIN_NCHANNELS=112 and GPU_MAX_HW_QUEUES=1 are baked into the recipe for the 8-GPU path.
  • Optimum-AMD — the HuggingFace integration layer for AMD GPUs.
  • AMD Quark FP8 PTPC — quantization. 15–30% faster than BlockScale on MI300X at this model size.
  • Primus-Turbo + torchtitan-amd — alternate trainer backends; reachable from the dispatch layer for full-FSDP runs.
  • vLLM-ROCm — serving. Qwen3 reasoning parser + hermes tool-call parser configured.
  • SGLang-ROCm — peer-class serving backend; reachable via the operator’s backend registry, not the default for this demo.

4. The provenance receipt

Every artifact in the demo has a manifest.json with a BLAKE3 hash of the YAML recipe, the dataset shards, the checkpoint directory, and the eval JSON, plus pointers to the HuggingFace Hub repo, the Lighthouse Storage CID, and the on-chain anchor transaction on Base. To verify:

uv run mindxtrain receipt ./out/runs/qwen3_8b_sft_lora/manifest.json \
    --config qwen3_8b_sft_lora.yaml

The receipt re-hashes everything and round-trip-verifies. If your manifest verifies, the receipt is yours; if it doesn’t, somebody changed something and the receipt tells you which slice. The on-chain anchor is a single immutable contract — mindxtrain_registry.sol, no admin, no upgrade — that records the BLAKE3 digest and the CID. Cypherpunk2048.

5. The submission tomorrow

Submitting to lablab.ai tomorrow morning. Three primary tracks plus two side challenges:

Track Primary deliverable Status
Fine-Tuning on AMD GPUs LoRA SFT of amd/Instella-3B and Qwen/Qwen3-8B on MI300X Live in demo
AI Agents & Agentic Workflows mindxtrain.operator serving the trained model behind /v1/chat/completions Live in demo
Vision & Multimodal AI qwen3_vl_8b_sft recipe shipped Recipe in repo; not in headline demo
Build-in-Public (meta) Three technical posts tagged #AMDDevHackathon You are reading post #3
Best Use of Qwen (cross-cutting) Qwen3-8B is the headline run; Qwen3.6 recipes wired Live in demo

The case for Best Overall is that this is one repo, one demo, one container, end-to-end on AMD, with on-chain provenance — and the trained model is a directly-rentable agent through AgenticPlace + x402-Algorand metering, not just a checkpoint sitting on HuggingFace Hub.

6. The asks and the receipts

Everything is open. The full repo is Apache-2.0 (with an explicit MIT-compatibility statement in LICENSE-NOTICE.md for the lablab spec). All the receipts:

To the AMD developer team: the ROCm 7.2.1 stack works. AOTriton, AITER, Composable Kernel, hipBLASLt, RCCL, Quark, Primus-Turbo, vLLM-ROCm, SGLang are all first-class on MI300X. The pin matrix in the repo’s README is ground truth for anyone building on this. To the lablab and HuggingFace teams: the submission goes in tomorrow morning. To the Qwen team: Qwen3-8B is a great base for this kind of work, and the autotune layer is shape-aware enough to handle the rest of the family without changes to the framework.

Five days, one repo, one demo, one MI300X. End-to-end on AMD.


Related articles

Tagged #AMDDevHackathon. Code: github.com/codephreak/mindxtrain. Live demo: mindx.pythai.net/hackathon. License: Apache-2.0 with MIT-compatibility statement.

Related articles

SimpleMind

SimpleMind: A Neural Network Implementation in JAX

The SimpleMind class is a powerful yet straightforward implementation of a neural network in JAX. It supports various activation functions, optimizers, and regularization techniques, making it versatile for different machine learning tasks. With parallel backpropagation and detailed logging, it provides an efficient and transparent framework for neural network training.

Learn More

Fine-tuning Hyperparameters: exploring Epochs, Batch Size, and Learning Rate for Optimal Performance

Epoch Count: Navigating the Training Iterations The Elusive “Optimal” Settings and the Empirical Nature of Tuning It is paramount to realize that there are no universally “optimal” hyperparameter values applicable across all scenarios. The “best” settings are inherently dataset-dependent, task-dependent, and even model-dependent. Finding optimal hyperparameters is fundamentally an empirical search process. It involves: finetunegem_agent is designed to facilitate this experimentation by providing command-line control over these key hyperparameters, making it easier to explore different […]

Learn More

The asyncio library in Python

The asyncio library in Python provides a framework for writing single-threaded concurrent code using coroutines, which are a type of asynchronous function. It allows you to manage asynchronous operations easily and is suitable for I/O-bound and high-level structured network code. Key Concepts Basic Usage Here’s a simple example of using asyncio to run a couple of coroutines: Creating Tasks You can use asyncio.create_task() to schedule a coroutine to run concurrently: Anticipate Futures Futures represent a […]

Learn More