swift

Get Started

  • SWIFT安装
  • 快速开始
  • Web-UI

Instruction

  • 命令行参数
  • 预训练与微调
  • GRPO
    • Get Started
    • Developer Guide
    • Advanced Research
  • GKD
  • 人类对齐
  • 推理和部署
  • 采样
  • 评测
  • 导出与推送
  • ray的支持
  • 强化微调
  • Agent支持
  • 支持的模型和数据集
  • 使用Tuners
  • 常见问题整理

Megatron-SWIFT

  • 快速开始
  • 命令行参数
  • LoRA训练
  • 多模态模型
  • Mcore Bridge
  • GRPO
  • GKD
  • Ascend NPU

Customization

  • 架构介绍
  • 自定义模型
  • 自定义数据集

Best Practices

  • GRPO完整实验流程
  • 多模态GRPO完整实验流程
  • GRPO代码训练
  • Qwen3最佳实践
  • Qwen3-VL最佳实践
  • Qwen3.5 最佳实践
  • 注册多模态模型最佳实践
  • Embedding训练
  • Reranker训练
  • 快速训练VL模型
  • NPU支持
  • Metax支持
  • 更多最佳实践
swift
  • GRPO
  • View page source

GRPO

Get Started

  • Get Started
    • GRPO

Developer Guide

  • Developer Guide
    • Loss Types
    • 多轮训练
    • 多任务训练
    • 奖励函数
    • 奖励模型
    • GYM环境训练

Advanced Research

  • Advanced Research
    • Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
    • Clipped Importance Sampling Policy Optimization (CISPO)
    • DAPO: An Open-Source LLM Reinforcement Learning System at Scale
    • DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning
    • Group Sequence Policy Optimization
    • On-Policy RL Meets Off-Policy Experts: Harmonizing SFT and RL via Dynamic Weighting (CHORD)
    • REINFORCE Leave-One-Out (RLOO)
    • REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
    • Soft Adaptive Policy Optimization (SAPO)
    • Training-Inference-Mismatch
    • TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling
Previous Next

© Copyright 2022-2025, Alibaba ModelScope.

Built with Sphinx using a theme provided by Read the Docs.