![]() ![]() Our approach follows directly from our previous work on learning from human feedback. We find that this significantly improves the quality of the summaries, as evaluated by humans, even on datasets very different from the one used for fine-tuning. We then fine-tune a language model with reinforcement learning (RL) to produce summaries that score highly according to that reward model. We first train a reward model via supervised learning to predict which summaries humans will prefer. We apply our method primarily to an existing dataset of posts submitted to the social network Reddit together with human-written “TL DRs,” which are short summaries written by the original poster. We focused on English text summarization, as it’s a challenging problem where the notion of what makes a “good summary” is difficult to capture without human input. In the short term, we wanted to test if human feedback techniques could help our models improve performance on useful tasks. As our models become more powerful, we believe aligning them with our goals will be very important to ensure they are beneficial for humans. Īs part of our work on safety, we want to develop techniques that align our models’ objectives with the end behavior we really care about. For example, a model trained to predict what a human would say might make up facts when it is unsure, or generate sentences reflecting harmful social bias, both failure modes that have been well-documented. This mismatch is clear when a model is trained to imitate low-quality human-written text, but it can also happen in more subtle ways. But this objective doesn’t capture exactly what we want usually, we don’t want our models to imitate humans, we want them to give high-quality answers. These models are usually trained with the objective of next word prediction on a dataset of human-written text. Large-scale language models are becoming increasingly capable on NLP tasks.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |