Ondrej Zika ozika

Maybe you've heard about this technique but you haven't completely understood it, especially the PPO part. This explanation might help.

We will focus on text-to-text language models 📝, such as GPT-3, BLOOM, and T5. Models like BERT, which are encoder-only, are not addressed.

Reinforcement Learning from Human Feedback (RLHF) has been successfully applied in ChatGPT, hence its major increase in popularity. 📈

RLHF is especially useful in two scenarios 🌟:

You can’t create a good loss function
- Example: how do you calculate a metric to measure if the model’s output was funny?
You want to train with production data, but you can’t easily label your production data

	deb http://archive.ubuntu.com/ubuntu/ focal main restricted universe multiverse
	deb-src http://archive.ubuntu.com/ubuntu/ focal main restricted universe multiverse

	deb http://archive.ubuntu.com/ubuntu/ focal-updates main restricted universe multiverse
	deb-src http://archive.ubuntu.com/ubuntu/ focal-updates main restricted universe multiverse

	deb http://archive.ubuntu.com/ubuntu/ focal-security main restricted universe multiverse
	deb-src http://archive.ubuntu.com/ubuntu/ focal-security main restricted universe multiverse

	deb http://archive.ubuntu.com/ubuntu/ focal-backports main restricted universe multiverse