[Stanford CS336] Assignment 5: Alignment and Reasoning Reinforcement Learning
1 Assignment Overview
In this assignment, you will gain hands-on experience in training language models to reason when solving math problems.
What to Implement
- Implement a zero-shot prompting baseline for the MATH competition dataset proposed by Hendrycks et al. [2021].
- Supervised Fine-Tuning (SFT) using reasoning traces from a stronger reasoning model (DeepSeek R1, DeepSeekAI et al., 2025).
- Use Expert Iteration to improve reasoning performance through verification rewards.
- Use Group Relative Policy Optimization (GRPO) to improve reasoning performance through verification rewards.
For interested students, we will release an optional part of the assignment in the coming days: aligning language models with human preferences.
