Yandex develops and open-sources an LLM training tool that saves up to 20% of GPU resources

  • Yandex introduces YaFSDP, a method for faster and more efficient large language model (LLM) training.
  • The final speedup of YaFSDP reaches 26%. This optimization in GPU consumption enables developers and companies to potentially save hundreds of thousands of dollars monthly. 
  • YaFSDP is currently the most effective method for improving GPU communication and reducing memory usage.
  • The YaFSDP method can be accessed for free on Github. ML engineers and companies worldwide can use it to improve LLM training efficiency.

Yandex, a global tech company, recently introduced YaFSDP, an open-source method for training large language models (LLMs). YaFSDP is currently the most effective publicly available tool for enhancing GPU communication and reducing memory usage in LLM training, offering a speedup of up to 26% compared to FSDP, depending on the architecture and number of parameters. Reducing the training time for LLMs through the use of YaFSDP can result in savings of up to 20% in GPU resources.

As part of its commitment to contribute to the global AI community, Yandex made YaFSDP publicly available to LLM developers and AI enthusiasts worldwide.

Mikhail Khruschev,
senior developer at Yandex and part of the team behind YaFSDP

Currently, we're actively experimenting with various model architectures and parameter sizes to expand YaFSDP’s versatility. We are thrilled to share our developments in LLM training with the global ML community, contributing to increased accessibility and efficiency for researchers and developers worldwide.

 The case for YaFSDP

LLM training is a time-consuming and resource-intensive process. Machine learning engineers and companies that develop their own LLMs invest significant time and GPU resources — which equals money — in training these models. The larger the model, the greater the time and expenses associated with its training.

Yandex’s YaFSDP works by eliminating GPU communication inefficiencies, ensuring that training requires only necessary processor memory and making GPU interactions uninterrupted.

YaFSDP optimizes learning speed and performance, enabling AI developers worldwide to use less computing power and GPU resources when training their models. For instance, in a pre-training scenario involving a model with 70 billion parameters, using YaFSDP can save the resources of approximately 150 GPUs, which translates to roughly $0.5 to $1.5 million (depending on the virtual GPU provider or platform) in potential monthly savings.

YaFSDP’s training efficiency

YaFSDP, an enhanced version of FSDP, outperforms the FSDP method in the most communication-heavy stages of LLM training like pre-training, alignment, and fine-tuning. The final speedup shown by YaFSDP on Llama 2 and Llama 3 demonstrates significant improvements in training speed, reaching 21% and 26% on Llama 2 70B and Llama 3 70B respectively.

Mikhail Khruschev,
senior developer at Yandex and part of the team behind YaFSDP

YaFSDP has shown impressive results on models ranging from 13 to 70 billion parameters, with particularly strong performance in the 30 to 70 billion range. Currently, YaFSDP is best suited for widely-used open-source models based on the LLaMA architecture.

YaFSDP isn’t Yandex’s first open-source tool. The company has previously shared several other tools that have become popular with the ML community, including:

  • CatBoost, a high-performance library for gradient boosting on decision trees.
  • YTsaurus, a big data platform for distributed storage and processing.
  • AQLM, one of the most advanced quantization algorithms for extreme compression of large language models, developed jointly by Yandex Research, HSE University, IST Austria, and NeuralMagic.
  • Petals, a library designed to simplify the process of training and fine-tuning LLMs, developed in a collaboration involving Yandex Research, HSE University, University of Washington, Hugging Face, ENS Paris-Saclay, and Yandex School of Data Analysis.

About Yandex

Yandex is a global technology company that builds intelligent products and services powered by machine learning. The company’s goal is to help consumers and businesses better navigate the online and offline world. Since 1997, Yandex has been delivering world-class, locally relevant search and information services and has also developed market-leading on-demand transportation services, navigation products, and other mobile applications for millions of consumers across the globe.

Logo
/Download (PDF, 324,8 КБ)
/Download (PDF, 324,7 КБ)
Please follow these rules