2025-01-05 18:11
There's a lot of confusion about o1's RL training and the emergence of RL as a popular post-training loss function. Yes, these are the same loss functions and similar data. BUT, the amount of compute used for o1's RL training is much more in line with pretraining.
The words we use to describe training are strained already, but o1 may be better viewed as next-token pretraining, rl pretraining, and then some normal post-training.