Posted by Jacob Devlin and Ming-Wei Chang, Research Scientists, Google AI Language One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions , of annotated training examples. To help close this gap in data, researchers have developed a variety of techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training ). The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis , resulting in substantial accuracy improvements compared to training...