mindXtrain Demo is Live — Qwen3-8B on a Single MI300X for Less Than $3

Day 5 of the AMD × lablab.ai Developer Hackathon. The demo URL is live: mindx.pythai.net/hackathon. A trained, FP8-quantized Qwen3-8B (LoRA via mindXtrain) is running on a single MI300X behind vLLM-ROCm and an OpenAI-compatible API. No auth required during the hackathon judging window. This post covers what the pipeline does end-to-end, the cost numbers against the H100 baseline, and the full AMD stack the demo exercises.


1. The pipeline you can poke at

The endpoint is OpenAI-compatible. From any terminal:

curl https://mindx.pythai.net/hackathon/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen3-8b-mindxtrain-fp8",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain why MI300X has 192 GB of HBM3 in one paragraph."}
        ],
        "max_tokens": 256
    }'

What is happening behind that single curl, in order:

  1. Qwen3-8B base model from Alibaba’s Qwen team (Apache-2.0).
  2. Fine-tuned via mindXtrain LoRA on MI300X. The 60-second AOT autotune probe (deep-dive here) measured Composable Kernel attention as the winner for this shape, locked the hipBLASLt default heuristic, and wrote the plan to disk before training started. The training loop never re-tuned.
  3. Quantized via AMD Quark FP8 PTPC into a vLLM-loadable directory. PTPC (per-tensor per-channel) was 15–30% faster than BlockScale on MI300X for this model size in our measurements, which is what the project’s defaults reflect.
  4. Served behind mindxtrain.operator‘s OpenAI-compatible /v1/chat/completions. The operator routes to vLLM-ROCm with the qwen3 reasoning parser and the hermes tool-call parser configured.
  5. BLAKE3 provenance manifest emitted, pinned to Lighthouse / IPFS, and registered with the mindX API. Run mindxtrain receipt <manifest.json> --config run.yaml against the published manifest to round-trip-verify.

2. The cost slide

The headline. Same workload — Qwen3-8B SFT-LoRA, 1B tokens, BF16 unquantized — on two stacks:

Stack Hardware $/hr GPUs Hours Total
mindXtrain on AMD Developer Cloud 1× MI300X (192 GB HBM3) $1.99 1 ~1.5 ~$3
Equivalent on H100 2× H100 (80 GB) $4.00 2 ~4 ~$32

The MI300X path doesn’t have to fall back to FP8 to fit the activation tensors. 192 GB HBM3 is doing real work. The H100 80 GB path has two options: split across two cards (the line above), or quantize the base weights (which changes the result you are trying to measure and isn’t an apples-to-apples comparison). Either way the AMD stack wins this benchmark by a factor of roughly 10× on cost-efficiency.

This is also the slide where the integration story earns its keep. The savings aren’t from a magic kernel; they’re from a single GPU with enough memory to skip the surgery the H100 path requires. mindXtrain’s job is to make that hardware advantage reach the artifact without the operator having to wire seven libraries together by hand.

3. The full AMD stack the demo exercises

Everything below is current, working, and integrated end-to-end in the live demo. No mocks, no stubs, no “coming soon.”

  • ROCm 7.2.1 — the base layer. Container is rocm/primus:v26.2, SHA256 digest pinned in ops/containerfiles/digest.lock.
  • AOTriton — AMD-flavored AOT-compiled Triton. Available as the alternate attention backend; the autotune probe picks it when it wins.
  • AITER + Composable Kernel — AMD’s hand-tuned ASM kernel library. Won the attention bake-off for Qwen3-8B at the recipe-default shape.
  • hipBLASLt — GEMM library. Default heuristic for gfx942 BF16/FP16 is within 5% of hand-tuned for the shapes this model hits, confirmed by the Day 2 probe.
  • RCCL — collective comms. 1-GPU is no-op for this run; NCCL_MIN_NCHANNELS=112 and GPU_MAX_HW_QUEUES=1 are baked into the recipe for the 8-GPU path.
  • Optimum-AMD — the HuggingFace integration layer for AMD GPUs.
  • AMD Quark FP8 PTPC — quantization. 15–30% faster than BlockScale on MI300X at this model size.
  • Primus-Turbo + torchtitan-amd — alternate trainer backends; reachable from the dispatch layer for full-FSDP runs.
  • vLLM-ROCm — serving. Qwen3 reasoning parser + hermes tool-call parser configured.
  • SGLang-ROCm — peer-class serving backend; reachable via the operator’s backend registry, not the default for this demo.

4. The provenance receipt

Every artifact in the demo has a manifest.json with a BLAKE3 hash of the YAML recipe, the dataset shards, the checkpoint directory, and the eval JSON, plus pointers to the HuggingFace Hub repo, the Lighthouse Storage CID, and the on-chain anchor transaction on Base. To verify:

uv run mindxtrain receipt ./out/runs/qwen3_8b_sft_lora/manifest.json \
    --config qwen3_8b_sft_lora.yaml

The receipt re-hashes everything and round-trip-verifies. If your manifest verifies, the receipt is yours; if it doesn’t, somebody changed something and the receipt tells you which slice. The on-chain anchor is a single immutable contract — mindxtrain_registry.sol, no admin, no upgrade — that records the BLAKE3 digest and the CID. Cypherpunk2048.

5. The submission tomorrow

Submitting to lablab.ai tomorrow morning. Three primary tracks plus two side challenges:

Track Primary deliverable Status
Fine-Tuning on AMD GPUs LoRA SFT of amd/Instella-3B and Qwen/Qwen3-8B on MI300X Live in demo
AI Agents & Agentic Workflows mindxtrain.operator serving the trained model behind /v1/chat/completions Live in demo
Vision & Multimodal AI qwen3_vl_8b_sft recipe shipped Recipe in repo; not in headline demo
Build-in-Public (meta) Three technical posts tagged #AMDDevHackathon You are reading post #3
Best Use of Qwen (cross-cutting) Qwen3-8B is the headline run; Qwen3.6 recipes wired Live in demo

The case for Best Overall is that this is one repo, one demo, one container, end-to-end on AMD, with on-chain provenance — and the trained model is a directly-rentable agent through AgenticPlace + x402-Algorand metering, not just a checkpoint sitting on HuggingFace Hub.

6. The asks and the receipts

Everything is open. The full repo is Apache-2.0 (with an explicit MIT-compatibility statement in LICENSE-NOTICE.md for the lablab spec). All the receipts:

To the AMD developer team: the ROCm 7.2.1 stack works. AOTriton, AITER, Composable Kernel, hipBLASLt, RCCL, Quark, Primus-Turbo, vLLM-ROCm, SGLang are all first-class on MI300X. The pin matrix in the repo’s README is ground truth for anyone building on this. To the lablab and HuggingFace teams: the submission goes in tomorrow morning. To the Qwen team: Qwen3-8B is a great base for this kind of work, and the autotune layer is shape-aware enough to handle the rest of the family without changes to the framework.

Five days, one repo, one demo, one MI300X. End-to-end on AMD.


Related articles

Tagged #AMDDevHackathon. Code: github.com/codephreak/mindxtrain. Live demo: mindx.pythai.net/hackathon. License: Apache-2.0 with MIT-compatibility statement.

Related articles

Hackathon Challenge:

OpenAI Assistants API Llama-Index/MongoDB In this hackathon, you will build and iterate on an LLM-based application using AI observability to validate the performance of your app. You can choose between two sets of tools for building your app: Tool set 1: The OpenAI Assistants API Tool set 2: Llama-Index, MongoDB and GPT-4. With either choice, you will use TruLens to validate and improve the performance of your application. By bringing together TruEra, OpenAI, Llama-Index, and […]

Learn More
ezAGI

ezAGI

Augmented Generative Intelligence Framework The ezAGI project is an advanced augmented generative intelligence system that combining various components to create a robust, flexible, and extensible framework for reasoning, decision-making, self-healing, and multi-model interaction. Core Components MASTERMIND Purpose:The mastermind module serves as the core orchestrator for the easyAGI system. It manages agent lifecycles, integrates various components, and ensures the overall health and performance of the system. Key Features: SimpleCoder Purpose:The SimpleCoder module defines a coding agent […]

Learn More
MASTERMIND

MASTERMIND

MASTERMIND is an advanced agency control structure designed for intelligent decision-making and strategic analysis. It orchestrates the interaction between various components of a larger system, managing workflows and ensuring consistency across operations. MASTERMIND integrates modules for prediction, reasoning, logic, non-monotonic reasoning, and more to handle complex tasks dynamically and adaptively. Here are some key aspects of MASTERMIND: Modular Architecture: It coordinates between multiple modules like prediction, logic, and reasoning to process data and execute complex […]

Learn More