TRL

You are viewing v0.25.0 version. A newer version v1.5.1 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

GSPO-token

In the paper Group Sequence Policy Optimization, the authors propose a token-level objective variant to GSPO, called GSPO-token. To use GSPO-token, you can use the GRPOTrainer class in trl.experimental.gspo_token.

Usage

from trl.experimental.gspo_token import GRPOTrainer
from trl import GRPOConfig

training_args = GRPOConfig(
    importance_sampling_level="sequence_token",
    ...
)

To leverage GSPO-token, the user will need to provide the per-token advantage $\hat{A_{i,t}}$ for each token $t$ in the sequence $i$ (i.e., make $\hat{A_{i,t}}$ varies with $t$ —which isn’t the case here, $\hat{A_{i,t}}=\hat{A_{i}}$ ). Otherwise, GSPO-Token gradient is just equivalent to the original GSPO implementation.

Update on GitHub

←GRPO With Replay Buffer PAPO→