the health strategist
platform
the most compreehensive knowledge portal
for continuous health transformation
and digital health- for all
Joaquim Cardoso MSc.
Chief Research and Strategy Officer (CRSO),
Chief Editor and Senior Advisor
December 15, 2023
What is the message?
The article by Mojan Javaheripi and Sébastien Bubeck highlights the breakthrough achieved by Microsoft Research’s Machine Learning Foundations team with Phi-2, a 2.7 billion-parameter language model.
Despite its compact size, Phi-2 exhibits exceptional performance, outpacing models up to 25 times larger on various benchmarks.
The article emphasizes the significance of training data quality, innovative scaling techniques, and the model’s potential as a research playground.
One page summary
What are the key points?
Evolution of Phi Models:
Microsoft Research’s Phi models, including Phi-1 and Phi-1.5, have demonstrated state-of-the-art performance on benchmarks. Phi-2, with 2.7 billion parameters, builds on this success, showcasing remarkable reasoning and language understanding capabilities.
Innovative Scaling Techniques:
Phi-2’s success is attributed to innovative techniques in model scaling and knowledge transfer from the earlier Phi-1.5 model. The article underscores the importance of breaking conventional language model scaling laws and achieving exceptional performance with smaller models.
Training Data Quality:
The article highlights the critical role of training data quality in model performance. Phi-2 focuses on “textbook-quality” data, including synthetic datasets for common sense reasoning and general knowledge. The training corpus is augmented with carefully selected web data, emphasizing educational value and content quality.
Evaluation and Benchmark Performance:
Phi-2 outperforms models up to 25 times larger on various benchmarks, including common sense reasoning, language understanding, math, and coding. The model surpasses Mistral and Llama-2 models at 7B and 13B parameters, showcasing its superior performance in academic benchmarks.
What are the key examples?
- Phi-2 vs. Gemini Nano 2: A comparison between Phi-2 and Google Gemini Nano 2 shows Phi-2’s superior performance, despite its smaller size (2.7B parameters compared to Gemini Nano 2’s 3.2B).
- Physics Problem Solving: Phi-2’s ability to solve a physics problem, including providing an explanation of the conversion of potential energy to kinetic energy, demonstrates its advanced reasoning capabilities.
What are the key statistics?
Training Details: Phi-2, a Transformer-based model, is trained on 1.4T tokens from synthetic and web datasets for NLP and coding. The training took 14 days on 96 A100 GPUs.
Model Performance Averages:
- Phi-2 outperforms Llama-2 models at 7B and 13B parameters in various benchmarks.
- Phi-2 surpasses Mistral-7B in average performance.
- Phi-2’s safety score on the ToxiGen benchmark is compared to Phi-1.5 and Llama-7B, showing its improved behavior with respect to toxicity and bias.
Conclusion
The article concludes by emphasizing Phi-2’s potential as a research tool, providing insights into its training data quality, innovative scaling techniques, and superior performance on diverse benchmarks.
The model’s compact size makes it an ideal platform for researchers to explore mechanistic interpretability, safety improvements, and fine-tuning experiments across various tasks in natural language processing.
DEEP DIVE
Phi-2: The surprising power of small language models
Microsoft Research Blog
By Mojan Javaheripi , Senior Researcher Sébastien Bubeck , Partner Research Manager
December 12, 2023
Contributors
Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Mojan Javaheripi, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel Ward, Philipp Witte, Cyril Zhang, Yi Zhang
Over the past few months, our Machine Learning Foundations team at Microsoft Research has released a suite of small language models (SLMs) called “Phi” that achieve remarkable performance on a variety of benchmarks. Our first model, the 1.3 billion parameter Phi-1(opens in new tab), achieved state-of-the-art performance on Python coding among existing SLMs (specifically on the HumanEval and MBPP benchmarks). We then extended our focus to common sense reasoning and language understanding and created a new 1.3 billion parameter model named Phi-1.5(opens in new tab), with performance comparable to models 5x larger.
We are now releasing Phi-2(opens in new tab), a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.
With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made Phi-2(opens in new tab) available in the Azure AI Studio model catalog to foster research and development on language models.
Key Insights Behind Phi-2
The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent abilities can be achieved at a smaller scale using strategic choices for training, e.g., data selection.
Our line of work with the Phi models aims to answer this question by training SLMs that achieve performance on par with models of much higher scale (yet still far from the frontier models). Our key insights for breaking the conventional language model scaling laws with Phi-2 are twofold:
Firstly, training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on “textbook-quality” data, following upon our prior work “Textbooks Are All You Need.” Our training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others. We further augment our training corpus with carefully selected web data that is filtered based on educational value and content quality. Secondly, we use innovative techniques to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores.
Training Details
Phi-2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens from multiple passes on a mixture of Synthetic and Web datasets for NLP and coding. The training for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF), nor has it been instruct fine-tuned. Despite this, we observed better behavior with respect to toxicity and bias compared to existing open-source models that went through alignment (see Figure 3). This is in line with what we saw in Phi-1.5 due to our tailored data curation technique, see our previous tech report(opens in new tab) for more details on this. For more information about the Phi-2 model, please visit Azure AI | Machine Learning Studio(opens in new tab).
Phi-2 Evaluation
Below, we summarize Phi-2 performance on academic benchmarks compared to popular language models. Our benchmarks span several categories, namely, Big Bench Hard (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC easy and challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).
With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.
Of course, we acknowledge the current challenges with model evaluation, and that many public benchmarks might leak into the training data. For our first model, Phi-1, we did an extensive decontamination study to discard this possibility, which can be found in our first report “Textbooks Are All You Need.” Ultimately, we believe that the best way to judge a language model is to test it on concrete use cases. Following that spirit, we also evaluated Phi-2 using several Microsoft internal proprietary datasets and tasks, comparing it again to Mistral and Llama-2. We observed similar trends, i.e. on average, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 models (7B, 13B, and 70B).
Model | Size | BBH | Commonsense Reasoning | Language Understanding | Math | Coding |
---|---|---|---|---|---|---|
Llama-2 | 7B | 40.0 | 62.2 | 56.7 | 16.5 | 21.0 |
13B | 47.8 | 65.0 | 61.9 | 34.2 | 25.4 | |
70B | 66.5 | 69.2 | 67.6 | 64.1 | 38.3 | |
Mistral | 7B | 57.2 | 66.4 | 63.7 | 46.4 | 39.4 |
Phi-2 | 2.7B | 59.2 | 68.8 | 62.0 | 61.1 | 53.7 |
Table 1. Averaged performance on grouped benchmarks compared to popular open-source SLMs.
Model | Size | BBH | BoolQ | MBPP | MMLU |
---|---|---|---|---|---|
Gemini Nano 2 | 3.2B | 42.4 | 79.3 | 27.2 | 55.8 |
Phi-2 | 2.7B | 59.3 | 83.3 | 59.1 | 56.7 |
Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks.
In addition to these benchmarks, we also performed extensive testing on commonly used prompts from the research community. We observed a behavior in accordance with the expectation we had given the benchmark results. For example, we tested a prompt used to probe a model’s ability to solve physics problems, most recently used to evaluate the capabilities of the Gemini Ultra model, and achieved the following result:
Originally published by https://www.microsoft.com/