Quick Start#

This document will guide you through setting up the environment and getting started with vime on AMD ROCm, covering environment configuration, data preparation, weight conversion, and training startup.

Basic Environment Setup#

Pull and Start Docker Container#

Execute the following commands to pull the latest ROCm image and start a persistent container:

# Pull the ROCm image
docker pull vllm/vime-rocm

# Start the container
docker run -d --name vime --ulimit nofile=1048576:1048576 \
  --ipc=host --network=host --device=/dev/kfd --device=/dev/dri \
  --security-opt seccomp=unconfined --group-add video --privileged \
  -e WANDB_API_KEY=$wandb_key vllm/vime-rocm
# wandb key is optional if you want to track with WandB

# Enter the container
docker exec -it vime bash

Model and Dataset Download#

Download the required model and training dataset using huggingface_hub:

# Download model weights (Qwen3-8B)
hf download Qwen/Qwen3-8B --local-dir /root/Qwen3-8B

# Download training dataset (dapo-math-17k)
hf download zhuzilin/dapo-math-17k --repo-type dataset --local-dir /root/dapo-math-17k

Model Weight Conversion#

Convert from Hugging Face Format to Megatron Format#

Load the model configuration for Qwen3-8B, then run the conversion. Two ROCm-specific flags are required: --no-gradient-accumulation-fusion and --attention-backend flash.

cd /root/vime && source scripts/models/qwen3-8B.sh

HIP_VISIBLE_DEVICES=0 PYTHONPATH=/root/vime:/root/Megatron-LM \
  torchrun --nproc-per-node=1 tools/convert_hf_to_torch_dist.py "${MODEL_ARGS[@]}" \
  --no-gradient-accumulation-fusion --attention-backend flash \
  --hf-checkpoint /root/Qwen3-8B --save /root/Qwen3-8B_torch_dist

Note: On ROCm, use HIP_VISIBLE_DEVICES in place of CUDA_VISIBLE_DEVICES to select GPUs. The --attention-backend flash and --no-gradient-accumulation-fusion flags are required to avoid issues during conversion.

Training#

Run training using the ROCm-specific script. VISIBLE_GPUS specifies which GPUs to use, and NUM_ROLLOUT controls the total number of sampling→training rounds. The script automatically unsets NVTE environment variables that are not needed on ROCm.

NUM_ROLLOUT=100 VISIBLE_GPUS=0,1 bash scripts/run-qwen3-8B-rocm.sh

Configuration Notes#

VISIBLE_GPUS
- Two free GPU indices.
- Script masks execution to these GPUs, avoids clashes.
- Uses:
  - TP = 2
  - Single vLLM engine
  - Colocate mode
  - DP = 1
NUM_ROLLOUT
- Number of training steps.
- Default is 3 for a smoke test.
Each run requires approximately 230 GB across the two selected GPUs.
Launch only on GPUs with sufficient free memory.

Final note: After finishing the run, if rerunning with a different NUM_ROLLOUT, make sure to clear the save directory to avoid mismatch error.

rm -rf /root/Qwen3-8B_vime/

Quick Start

Contents