Mostly on post-training, RL environments, and small-model experiments. Write-ups of what I learn while training things.
10 reward designs, 2 training frameworks, 5 GPU runs. The one-line reward function that finally moved the needle — and why temperature matters more than reward shape.