Metax支持

1. 在 Metax 平台上使用 Swift

你可以选择构建自己的镜像,也可以直接拉取已有的预构建镜像。本文以拉取预构建镜像为例,演示如何在 Metax 上使用 ms-swift。

1.1. 启动 ms-swift 容器

docker pull mx-devops-acr-cn-shanghai.cr.volces.com/opensource/public-ai-release/maca/ms-swift:3.10.3-maca.ai3.3.0.16-torch2.6-py310-ubuntu22.04-amd64
# 你可以根据需要调整 --privileged 参数,并仅挂载特定的 GPU 卡。
# 更多信息请参考我们的官方文档:https://developer.metax-tech.com
# 必须通过 --device 挂载 Metax GPU 设备:--device=/dev/dri --device=/dev/mxcd
docker run  -it --net=host --uts=host --ipc=host --privileged=true --group-add video  \
    --shm-size 100gb --ulimit memlock=-1 \
    --security-opt seccomp=unconfined --security-opt apparmor=unconfined \
    --device=/dev/dri --device=/dev/mxcd \
    -v /root/workspace:/external \
    --name swift_test \
    mx-devops-acr-cn-shanghai.cr.volces.com/opensource/public-ai-release/maca/ms-swift:3.10.3-maca.ai3.3.0.16-torch2.6-py310-ubuntu22.04-amd64

2. 环境检查

2.1. 检查 Metax GPU 是否可用

得益于与 CUDA 的兼容性,我们可以像使用 NVIDIA GPU 一样检查 Metax 设备是否可用:

import torch
print(torch.cuda.is_available())
# True

2.2. 检查 GPU 之间的 P2P 连接拓扑

mx-smi topo -m
# output
=================== MetaX System Management Interface Log ===================
Timestamp                                         : Wed Feb 11 16:37:10 2026

Attached GPUs                                     : 8
Device link type matrix
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    Node Affinity  CPU Affinity
GPU0    X       MX      MX      MX      NODE    NODE    NODE    NODE    0              0-31,64-95
GPU1    MX      X       MX      MX      NODE    NODE    NODE    NODE    0              0-31,64-95
GPU2    MX      MX      X       MX      NODE    NODE    NODE    NODE    0              0-31,64-95
GPU3    MX      MX      MX      X       NODE    NODE    NODE    NODE    0              0-31,64-95
GPU4    NODE    NODE    NODE    NODE    X       MX      MX      MX      0              0-31,64-95
GPU5    NODE    NODE    NODE    NODE    MX      X       MX      MX      0              0-31,64-95
GPU6    NODE    NODE    NODE    NODE    MX      MX      X       MX      0              0-31,64-95
GPU7    NODE    NODE    NODE    NODE    MX      MX      MX      X       0              0-31,64-95

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  MX   = Connection traversing MetaXLink
  ETH  = Connection traversing Eth
  NA   = Connection type is unknown

2.3. 查看 GPU 状态

mx-smi
# output
    =================== MetaX System Management Interface Log ===================
Timestamp                                         : Wed Feb 11 09:55:49 2026

Attached GPUs                                     : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
| MACA Version: 3.3.0.15             BIOS Version: 1.30.0.0                       |
|------------------+-----------------+---------------------+----------------------|
| Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
| Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
|==================+=================+=====================+======================|
| 0     MetaX C500 | 0           Off | 0000:0e:00.0        | 0%          Disabled |
| 57W / 350W       | 35C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 1     MetaX C500 | 1           Off | 0000:0f:00.0        | 0%          Disabled |
| 58W / 350W       | 37C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 2     MetaX C500 | 2           Off | 0000:10:00.0        | 0%          Disabled |
| 58W / 350W       | 36C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 3     MetaX C500 | 3           Off | 0000:12:00.0        | 0%          Disabled |
| 60W / 350W       | 35C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 4     MetaX C500 | 4           Off | 0000:35:00.0        | 0%          Disabled |
| 57W / 350W       | 33C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 5     MetaX C500 | 5           Off | 0000:36:00.0        | 0%          Disabled |
| 56W / 350W       | 34C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 6     MetaX C500 | 6           Off | 0000:37:00.0        | 0%          Disabled |
| 55W / 350W       | 34C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 7     MetaX C500 | 7           Off | 0000:38:00.0        | 0%          Disabled |
| 56W / 350W       | 36C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  no process found                                                               |
+---------------------------------------------------------------------------------+

3. 运行示例

我们支持直接使用社区版 Swift,同时在镜像中 /workspace 目录下提供了经过更多优化的版本。强烈建议优先使用该目录下的软件包。

3.1. 运行 Swift 示例

在大多数场景下,可直接运行 Swift 的训练示例:

# We assume that the ms-swift code is under /workspace
cd /workspace/ms-swift/
bash examples/train/full/train.sh

运行输出示例(节选):

# output:
{'loss': 1.47077751, 'grad_norm': 10.5625, 'learning_rate': 2e-06, 'token_acc': 0.65511727, 'epoch': 0.01, 'global_step/max_steps': '1/94', 'percentage': '1.06%', 'elapsed_time': '2s', 'remaining_time': '4m 28s', 'memory(GiB)': 4.87, 'train_speed(iter/s)': 0.345807}
{'loss': 1.58882141, 'grad_norm': 10.75, 'learning_rate': 1e-05, 'token_acc': 0.61763144, 'epoch': 0.05, 'global_step/max_steps': '5/94', 'percentage': '5.32%', 'elapsed_time': '10s', 'remaining_time': '3m 12s', 'memory(GiB)': 5.64, 'train_speed(iter/s)': 0.461462}
{'loss': 1.56617603, 'grad_norm': 12.8125, 'learning_rate': 9.92e-06, 'token_acc': 0.61519274, 'epoch': 0.11, 'global_step/max_steps': '10/94', 'percentage': '10.64%', 'elapsed_time': '20s', 'remaining_time': '2m 52s', 'memory(GiB)': 5.64, 'train_speed(iter/s)': 0.485796}
{'loss': 1.63347206, 'grad_norm': 13.6875, 'learning_rate': 9.69e-06, 'token_acc': 0.60373975, 'epoch': 0.16, 'global_step/max_steps': '15/94', 'percentage': '15.96%', 'elapsed_time': '30s', 'remaining_time': '2m 39s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.493855}
{'loss': 1.60613976, 'grad_norm': 11.0, 'learning_rate': 9.32e-06, 'token_acc': 0.59997221, 'epoch': 0.21, 'global_step/max_steps': '20/94', 'percentage': '21.28%', 'elapsed_time': '39s', 'remaining_time': '2m 27s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.500516}
{'loss': 1.45015478, 'grad_norm': 15.25, 'learning_rate': 8.8e-06, 'token_acc': 0.62373584, 'epoch': 0.27, 'global_step/max_steps': '25/94', 'percentage': '26.60%', 'elapsed_time': '49s', 'remaining_time': '2m 16s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.50548}
{'loss': 1.39427547, 'grad_norm': 13.9375, 'learning_rate': 8.18e-06, 'token_acc': 0.6357994, 'epoch': 0.32, 'global_step/max_steps': '30/94', 'percentage': '31.91%', 'elapsed_time': '59s', 'remaining_time': '2m 5s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.508409}
{'loss': 1.53672237, 'grad_norm': 11.125, 'learning_rate': 7.45e-06, 'token_acc': 0.61650612, 'epoch': 0.37, 'global_step/max_steps': '35/94', 'percentage': '37.23%', 'elapsed_time': '1m 8s', 'remaining_time': '1m 55s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.510425}
{'loss': 1.54039021, 'grad_norm': 13.8125, 'learning_rate': 6.65e-06, 'token_acc': 0.61613974, 'epoch': 0.43, 'global_step/max_steps': '40/94', 'percentage': '42.55%', 'elapsed_time': '1m 18s', 'remaining_time': '1m 45s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.512302}
{'loss': 1.40159426, 'grad_norm': 9.4375, 'learning_rate': 5.79e-06, 'token_acc': 0.64041773, 'epoch': 0.48, 'global_step/max_steps': '45/94', 'percentage': '47.87%', 'elapsed_time': '1m 27s', 'remaining_time': '1m 35s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.512983}
{'loss': 1.54977188, 'grad_norm': 11.9375, 'learning_rate': 4.91e-06, 'token_acc': 0.61078816, 'epoch': 0.53, 'global_step/max_steps': '50/94', 'percentage': '53.19%', 'elapsed_time': '1m 37s', 'remaining_time': '1m 25s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.514489}
{'loss': 1.6754509, 'grad_norm': 13.0625, 'learning_rate': 4.04e-06, 'token_acc': 0.58574393, 'epoch': 0.59, 'global_step/max_steps': '55/94', 'percentage': '58.51%', 'elapsed_time': '1m 46s', 'remaining_time': '1m 15s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.515752}
{'loss': 1.37204351, 'grad_norm': 9.25, 'learning_rate': 3.19e-06, 'token_acc': 0.6391937, 'epoch': 0.64, 'global_step/max_steps': '60/94', 'percentage': '63.83%', 'elapsed_time': '1m 56s', 'remaining_time': '1m 5s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.516829}
{'loss': 1.47697926, 'grad_norm': 11.375, 'learning_rate': 2.4e-06, 'token_acc': 0.62817259, 'epoch': 0.69, 'global_step/max_steps': '65/94', 'percentage': '69.15%', 'elapsed_time': '2m 5s', 'remaining_time': '55s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.517947}
{'loss': 1.4336628, 'grad_norm': 8.125, 'learning_rate': 1.69e-06, 'token_acc': 0.63453862, 'epoch': 0.75, 'global_step/max_steps': '70/94', 'percentage': '74.47%', 'elapsed_time': '2m 14s', 'remaining_time': '46s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.518833}
{'loss': 1.54315252, 'grad_norm': 9.625, 'learning_rate': 1.08e-06, 'token_acc': 0.60202073, 'epoch': 0.8, 'global_step/max_steps': '75/94', 'percentage': '79.79%', 'elapsed_time': '2m 24s', 'remaining_time': '36s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.519627}
{'loss': 1.47180223, 'grad_norm': 9.5625, 'learning_rate': 6e-07, 'token_acc': 0.62211501, 'epoch': 0.85, 'global_step/max_steps': '80/94', 'percentage': '85.11%', 'elapsed_time': '2m 33s', 'remaining_time': '26s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.520284}
{'loss': 1.44068375, 'grad_norm': 10.125, 'learning_rate': 2.5e-07, 'token_acc': 0.62673112, 'epoch': 0.91, 'global_step/max_steps': '85/94', 'percentage': '90.43%', 'elapsed_time': '2m 43s', 'remaining_time': '17s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.520331}
{'loss': 1.44893646, 'grad_norm': 8.375, 'learning_rate': 5e-08, 'token_acc': 0.63837478, 'epoch': 0.96, 'global_step/max_steps': '90/94', 'percentage': '95.74%', 'elapsed_time': '2m 52s', 'remaining_time': '7s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.520707}
{'train_runtime': 183.4332, 'train_samples_per_second': 8.177, 'train_steps_per_second': 0.512, 'train_loss': 1.50650934, 'token_acc': 0.6194337, 'epoch': 1.0, 'global_step/max_steps': '94/94', 'percentage': '100.00%', 'elapsed_time': '3m 3s', 'remaining_time': '0s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.512463}
Train: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 94/94 [03:03<00:00,  1.95s/it]
[INFO:swift] last_model_checkpoint: /workspace/ms-swift/output/v0-20260211-143035/checkpoint-94
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /workspace/ms-swift/output/v0-20260211-143035/images
[INFO:swift] End time of running main: 2026-02-11 14:34:09.521336

3.2. 使用 Megatron-LM 作为 Swift 后端

若希望使用 Megatron-LM 作为 Swift 的后端,需设置 MEGATRON_LM_PATH 环境变量:

export MEGATRON_LM_PATH=/workspace/Megatron-LM-0.15.0
cd /workspace/ms-swift
bash examples/megatron/pretrain.sh

3.3. 使用其他版本的 ms-swift

Metax 平台要求使用与 Maca 兼容的软件包。例如,编译可能依赖 torch2.8,因此需使用 torch2.8+maca3.3.x.x 版本。

默认情况下,安装会覆盖环境中已有的 PyTorch。因此,建议使用 –no-deps 参数进行安装:

git clone -b ${SWIFT_VERSION} https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install . --no-deps

每次环境变更后,请检查 PyTorch 版本及其可用性:

pip list |grep torch
# output:
# torch2.x.x+metax3.x.x.x
import torch
torch.cuda.is_available()

3.4. Metax 与 NVIDIA CUDA 的差异

Metax 在大部分接口上与 NVIDIA 对齐,但在某些软件行为和环境变量上存在差异。

3.4.1. MACA_MPS_MODE

默认情况下,MACA 不允许多个进程共享同一块 GPU。如果 GPU 已被占用,则无法启动新进程。

如需启用类似 MPS(Multi-Process Service)的功能,需设置:MACA_MPS_MODE=1

# 运行其他脚本...
export MACA_MPS_MODE=1
cd /workspace/ms-swift/
bash examples/train/full/train.sh

3.4.2. MCCL_SOCKET_IFNAME GLOO_SOCKET_IFNAME & MCCL_IB_HCA

在多节点训练时,建议设置以下环境变量以确保节点间通信正常:

MCCL_SOCKET_IFNAME:用于 MCCL 通信的网络接口 GLOO_SOCKET_IFNAME:用于 GLOO 通信的网络接口 MCCL_IB_HCA:指定使用的 InfiniBand 设备

可通过 ifconfig 和 mx-smi 确定所用网卡和 IB 设备:

ifconfig
# output
ens20f0np0: xxx
            inet: your node ip
            xxx
...
mx-smi topo -n
# output
mx-smi  version: 2.2.9

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Wed Feb 11 18:53:44 2026

Attached GPUs                                     : 8
Device link type matrix
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    Node Affinity  CPU Affinity
GPU0    X       MX      MX      MX      NODE    NODE    NODE    NODE    PIX     PIX     NODE    NODE    SYS     SYS     0              0-31,64-95
GPU1    MX      X       MX      MX      NODE    NODE    NODE    NODE    PIX     PIX     NODE    NODE    SYS     SYS     0              0-31,64-95
GPU2    MX      MX      X       MX      NODE    NODE    NODE    NODE    PIX     PIX     NODE    NODE    SYS     SYS     0              0-31,64-95
GPU3    MX      MX      MX      X       NODE    NODE    NODE    NODE    PIX     PIX     NODE    NODE    SYS     SYS     0              0-31,64-95
GPU4    NODE    NODE    NODE    NODE    X       MX      MX      MX      NODE    NODE    PIX     PIX     SYS     SYS     0              0-31,64-95
GPU5    NODE    NODE    NODE    NODE    MX      X       MX      MX      NODE    NODE    PIX     PIX     SYS     SYS     0              0-31,64-95
GPU6    NODE    NODE    NODE    NODE    MX      MX      X       MX      NODE    NODE    PIX     PIX     SYS     SYS     0              0-31,64-95
GPU7    NODE    NODE    NODE    NODE    MX      MX      MX      X       NODE    NODE    PIX     PIX     SYS     SYS     0              0-31,64-95
NIC0    PIX     PIX     PIX     PIX     NODE    NODE    NODE    NODE    X       PIX     NODE    NODE    SYS     SYS
NIC1    PIX     PIX     PIX     PIX     NODE    NODE    NODE    NODE    PIX     X       NODE    NODE    SYS     SYS
NIC2    NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     NODE    NODE    X       PIX     SYS     SYS
NIC3    NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     NODE    NODE    PIX     X       SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     X       PIX
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     X

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  MX   = Connection traversing MetaXLink
  ETH  = Connection traversing Eth
  NA   = Connection type is unknown

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
# 根据拓扑信息可知:
#  1. GPU0–GPU3 与 NIC0/NIC1(即 mlx5_0, mlx5_1)通信
#  2. GPU4–GPU7 与 NIC2/NIC3(即 mlx5_2, mlx5_3)通信

因此,推荐设置如下: MCCL_SOCKET_IFNAME=ens20f0np0 GLOO_SOCKET_IFNAME=ens20f0np0 MCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3

# node 1
export MCCL_SOCKET_IFNAME=ens20f0np0
export GLOO_SOCKET_IFNAME=ens20f0np0
export MCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
cd /workspace/ms-swift/
bash examples/train/multi-node/torchrun/train_node1.sh
# node 2
# 需修改脚本中的 master_addr 为节点1的IP
export MCCL_SOCKET_IFNAME=ens20f0np0
export GLOO_SOCKET_IFNAME=ens20f0np0
export MCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
cd /workspace/ms-swift/
bash examples/train/multi-node/torchrun/train_node2.sh