Share this post
Flower, in partnership with the CaMLSys lab at the University of Cambridge, has trained for the first time a 1.3 billion parameter LLM using a novel formulation of federated learning methods. The resulting LLM and companion methodology, that we call FlowerLLM, beats the previous record set by Google DeepMind by more than a factor of three and paves the way towards an era of increasingly democratic participation within foundation models. Arguably of even greater importance, this invention of a viable federated approach to LLM pre-training is likely to directly lead to stronger more capable foundation models by increasing the access to both compute and data.
The 1.3B Federated Breakthrough
What we have achieved is the generative pre-training of a billion-scale LLM, thus advancing the field of federated learning past being merely able to fine-tune openly available centrally pre-trained weights. This is a breakthrough in the domains of both federated learning and foundation models for two reasons.
First, it becomes now possible to incorporate a much wider range of potential data sources; moving beyond public web-based data, to include sensitive distributed data (e.g., data available in corporations, hospitals, cars, phones) that would be otherwise ignored. We estimate this type of distributed data dwarfs that of conventional training data.
Second, it also now becomes possible to combine compute from a similarly wide variety of sources. This includes isolated GPUs that might be physically distant, and again neglected as compute normally must reside in the same logical data center. Just as in the case with distributed data, by leveraging ignored compute -- and a new found flexibility in being able to combine compute resources together, will both improve participation and support larger LLM parameter counts.
To develop our FlowerLLM federated pre-training process, we have experimented with hundreds of models spanning two orders of magnitude in size. The outcome is a soon to be open-source stack, built directly on Flower, that will offer LLM pre-training to those even with little prior experience in federated methods or foundation models.
FlowerLLM-Small | FlowerLLM-Medium | FlowerLLM-Large | DeepMind | |
---|---|---|---|---|
Parameters | 120M | 350M | 1.3B | 400M |
GPUs (A100/H100) | 16 | 16 | 16 | 16 |
Maximum No Clients/Round | 128 | 128 | 64 | 64 |
Total Tokens Trained | 11B | 11B | 5.7B | 5.7B |
As shown, FlowerLLM allows us to train models three times as large as the most similar previous work using the same number of GPUs, number of clients per round and total tokens processed. When pre-training a similar sized architecture, it can train twice as many clients over twice as many total tokens.
Distributed Training | Heterogeneous Hardware | Heterogenous Data | Private Data | |
---|---|---|---|---|
[Yuan et al.] | Yes | No | No | No |
DeepMind [Douillard et al.] | Yes | No | Yes | No |
FlowerLLM | Yes | Yes | Yes | Yes |
A qualitative analysis of the FlowerLLM stack illustrates it is, to the best of our knowledge, the only method able to train in a fully federated fashion. We compare it directly to the aforementioned DeepMind produced work by Douillard et al. and Yuan et al., the next most similar work to our own. Critically, FlowerLLM allows for distributed training on nodes with a high degree of heterogeneity in hardware. This facility simplifies the creation of large-scale compute resources by enabling them to be comprised by nodes with varying hardware specifications. Furthermore, due to its federated approach, when carefully performed, it can maintain the privacy of data also contributed by each node. Another critical building block in building increasingly larger collections of diverse data. The comparison methodologies lack the required combination of characteristics to offer the same set of benefits as FLowerLLM.
Next steps for FlowerLLM
The immediate future for FlowerLLM will focus on two directions. First, this new methodology will be used to train increasingly larger forms of FlowerLLM. We anticipate scaling all the way to at least 30 billion parameter LLMs within 2024. Second, will be the release of all necessary artifacts -- such as, source code, model weights and hyper-parameters -- to enable anyone to replicate and extend the FlowerLLM approach.
Longer term, FlowerLLM will tackle even more challenging deployment environments. For example, forming large networks of low resource devices like smartphones for such LLM pre-training objectives. While this setting will require further breakthroughs, it will open the door for massive amounts of currently unused compute and unique, but ignored, data. Ultimately, we expect FlowerLLM techniques to enable a new class of LLMs that would otherwise be impossible to train.
Share this post