BloombergGPT: A Large Language Model for Finance — [built with 363 billion token dataset – perhaps the largest domain-specific dataset yet]

the health strategist . institute

research and strategy institute — for continuous transformation
in health, care, cost and tech

Joaquim Cardoso MSc
Chief Researcher & Editor of the Site
March 31, 2023


The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering.

  • Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature.

  • In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. 

  • We construct a 363 billion token dataset based on Bloomberg’s extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. 

  • We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. 

Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. 

  • Additionally, we explain our modeling choices, training process, and evaluation methodology. 

As a next step, we plan to release training logs (Chronicles) detailing our experience in training BloombergGPT.


BloombergGPT: A Large Language Model for Finance


Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann

30 Mar 2023

1. Introduction [excerpt]

The release of GPT-3 in 2020 (Brown et al., 2020) demonstrated the powerful benefits of training very large auto-regressive language models (LLMs). 

GPT-3 had 175 billion parameters, a hundredfold increase over the previous GPT-2 model, and did remarkably well across a wide range of now popular LLM tasks, including reading comprehension, open-ended question answering, and code generation. This performance has been replicated across several other models (Chowdhery et al., 2022; Scao et al., 2022; Zhang et al., 2022a). Furthermore, evidence suggests that large models exhibit emergent behaviors; growth allows them to acquire abilities not present in smaller models (Wei et al., 2022a). A notable example of emergent behavior is the ability to perform tasks via few-shot prompting, where a model can learn a task from just a few examples. This ability improves well-above random as we increase the size of language models. Broadly speaking, few-shot prompting dramatically expands the range of tasks supported by models and lowers the barrier to entry for users seeking automation for new language tasks.

After GPT-3, models grew in size to 280 billion (Gopher, Rae et al., 2021), 540 billion (PaLM, Chowdhery et al., 2022), and 1 trillion parameters (Megatron, Korthikanti et al., 2022). 

Work also explored other important aspects of achieving a high-performing LLM, such as different training objectives (Tay et al., 2022b), multilingual models (Scao et al., 2022), more efficient and smaller models (Black et al., 2022), and finding data and parameter-efficient training sizes (Hoffmann et al., 2022).

These efforts have almost exclusively focused on general LLMs, trained on datasets that cover a broad range of topics and domains. 

While these have included some datasets for specialized domains (e.g., code (Chen et al., 2021a) or biomedical articles Gao et al. (2021)) the focus has been on building LLMs with broad capabilities. Recent efforts training models using only domain-specific data have yielded models that, while much smaller, beat general purpose LLMs on tasks within those domains, such as science Taylor et al. (2022) and medicine Bolton et al. (2023); Luo et al. (2022); Lehman et al. (2023). These findings motivate further development of models focused on specific domains.

Financial Technology (FinTech) is a large and growing area with NLP technologies having an increasingly important role Xing et al. (2018); Fisher et al. (2016); Dredze et al. (2016). 

Financial NLP tasks Shah et al. (2022) include sentiment analysis Araci (2019), named entity recognition Salinas Alvarado et al. (2015), news classification Sinha and Khandait (2020), and question answering Chen et al. (2021b, 2022). While the range of tasks is similar to those found in general NLP benchmarks, the complexity and terminology of the financial domain warrant a domain-specific system. For all of the reasons generative LLMs are attractive in general — few-shot learning, text generation, conversational systems, etc. — it would be valuable to have a LLM focused on the financial domain. While there are masked language models tuned for the financial domain Araci (2019), no LLM has been tuned for or evaluated on tasks for this domain.

1.1BloombergGPT [excerpt]

We train BloombergGPT, a 50 billion parameter language model that supports a wide range of tasks within the financial industry. 

Rather than building a general-purpose LLM, or a small LLM exclusively on domain-specific data, we take a mixed approach. General models cover many domains, are able to perform at a high level across a wide variety of tasks, and obviate the need for specialization during training time. However, results from existing domain-specific models show that general models cannot replace them. At Bloomberg, we support a very large and diverse set of tasks, well served by a general model, but the vast majority of our applications are within the financial domain, better served by a specific model. For that reason, we set out to build a model that achieves best-in-class results on financial benchmarks, while also maintaining competitive performance on general-purpose LLM benchmarks.

We achieve this goal by constructing the largest domain-specific dataset yet, drawing on existing data creation, collection, and curation resources at Bloomberg. 

As Bloomberg is primarily a financial data company, our data analysts have collected and curated financial language documents over the span of forty years. We have extensive archives of financial data that cover a range of topics, with careful tracking of data sources and usage rights. We add this data to public datasets to create a large training corpus with over 700 billion tokens. Using a portion of this training corpus, we train a BLOOM-style, 50 billion parameter model designed based on guidelines from Hoffmann et al. (2022) and Le Scao et al. (2022). We validate the model on standard LLM benchmarks, open financial benchmarks, and a suite of Bloomberg-internal benchmarks that most accurately reflect our intended use cases. Our results demonstrate that our mixed training approach leads to a model that vastly outperforms existing models on in-domain financial tasks while being on par or better on general NLP benchmarks.

Other Sections

See the original publication (this is an excerpt version only)

9. Conclusion [excerpt]

We have presented BloombergGPT, a best-in-class LLM for financial NLP.

Our model contributes to the ongoing dialog on effective ways to train domain-specific models. 

Our training strategy of mixing domain-specific and general-purpose data results in a model that balances performance in both domains. Additionally, our work offers another data point on selecting Chinchilla optimal-sized models. Finally, we hope that our model training logs will provide a guide for those training their own LLMs.

We have several interesting directions to pursue. 

First, task fine-tuning has yielded significant improvements in LLMs, and we plan to consider what unique opportunities exist for model alignment in the financial domain (Wei et al., 2021; Ouyang et al., 2022). 

Second, by training on data in FinPile, we are selecting data that may exhibit less toxic and biased language. 

The effects of this on the final model are unknown as yet, which we plan to test. Third, we seek to understand how our tokenization strategy changes the resulting model. These are a few of the new research directions we hope to pursue with BloombergGPT.

We achieve strong results on general LLM benchmarks and outperform comparable models on financial tasks. 

We attribute this, in decreasing order of impact, to 1. a well-curated internal dataset, 2. our unique choice in tokenizer, and 3. an up-to-date architecture. 

We will continue to develop financial applications with BloombergGPT to further explore the benefits of these modeling choices.

Originally published at

Deixe um comentário

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *

Related Posts