Source: Quantum
In the global AI race, leading AI companies such as OpenAI, Microsoft and Meta are adopting a development process called "distillation" to build cheaper AI models for consumers and businesses to adopt.
The AI model built by DeepSeek using this technology is powerful and efficient. The model is based on the open source system released by competitors Meta and Alibaba, and has attracted widespread attention in the industry. This breakthrough has shaken people's confidence in Silicon Valley's leadership in AI and caused a sharp drop in the stocks of large American technology companies.
Through distillation technology, companies use large language models (called "teacher" models) to generate the next possible word in a sentence. Data is generated by the teacher model, and then a smaller "student" model is trained to help quickly transfer the knowledge and predictions of the larger model to the smaller model.
While distillation technology has been widely used for many years, recent advances have convinced industry experts that building applications based on this technology will become more and more a boon for startups seeking to build applications in a cheap and effective way.
“Distillation is really magical,” said Olivier Godment, head of product at OpenAI’s platform. “What this process essentially does is take a large, cutting-edge model of intelligence and use it to train a smaller model…that is very powerful at a specific task, and it is cheap and very fast.”
Large language models like OpenAI’s GPT-4, Google’s Gemini, and Meta’s Llama require massive amounts of data and computing power to develop and maintain. While the companies don’t disclose how much it costs to train the large models, it’s likely in the hundreds of millions of dollars.
Distillation makes the power of these models available to developers and businesses at a very low price, allowing app developers to quickly run AI models on devices like laptops and smartphones.
Developers can use OpenAI’s platform to perform distillation and learn from the large language models that power products like ChatGPT. After investing nearly $14 billion in OpenAI, Microsoft, the company’s biggest backer, used GPT-4 to distill its family of small language models, Phi, as part of a commercial partnership.
However, OpenAI said it believed DeepSeek had distilled its models to train its rival products, a move that violated its terms of service. DeepSeek has not yet publicly responded to the claim.
While distillation techniques can be used to build high-performance models, experts add that they are also limited.
“Distillation presents a very interesting trade-off; if you make models smaller, you inevitably reduce their capabilities,” said Ahmed Awadallah of Microsoft Research. He said the distilled model can be used to summarize emails, “but it’s really not very good at other things.”
David Cox, vice president of AI models at IBM Research, said most companies don’t need huge models to run their products, and streamlined models are powerful enough for scenarios such as customer service chatbots, or to run on small devices such as mobile phones.
“As long as you can reduce costs and get the capabilities you want, why not do it?” he added.
This poses a challenge to the business models of many leading AI companies. Even if developers use stripped-down models from companies like OpenAI, they are much cheaper to run and cheaper to build, so they will generate less revenue. Model developers like OpenAI typically charge less for use of stripped-down models because they require less compute.
However, OpenAI’s Goldment believes that large language models will still be used for “high-intelligence and high-risk tasks” because “companies are willing to pay more for high levels of accuracy and reliability.” Large models are also needed to discover new capabilities, which can then be distilled into smaller ones, he added.
Still, the company is working to prevent its large models from being extracted and used to train rival products. OpenAI has teams that monitor usage, and if it suspects a user is generating large amounts of data to export and train competitors, it can remove that user’s access, as it has done with accounts it believes were linked to DeepSeek. But most of these actions are taken after the fact.
“OpenAI has been working to prevent data from being distilled for a long time, but it’s very difficult to avoid it completely,” said Duve Kira, CEO of Contextual AI, a startup that builds information retrieval tools for businesses.
Distillation is also a win for advocates of open models, whose techniques are freely available to developers. DeepSeek has also opened up its latest model to developers.
“We will immediately use distillation and incorporate it into our products,” said LeCun Yan, chief AI scientist at Meta. “That’s the idea of open source. As long as these processes are open, you can benefit from other people’s development.”
Distillation also means that model developers can spend billions of dollars to improve the capabilities of their AI systems and still face competitors catching up, as evidenced by the data recently released by DeepSeek. This raises questions about the first-mover advantage of building large language models, whose capabilities can now be replicated in a few months.
“In this world that’s changing so fast…you might actually spend a lot of money doing it the hard way and pretty soon everyone else in the space will follow,” IBM’s Cox said. “So it’s an interesting but tricky business environment.”