In the race to develop and improve artificial intelligence (AI) tools, companies are finding creative ways to feed the ever-growing models. Adobe, for example, has used its vast database of stock photos to create its own suite of AI tools called Firefly. By avoiding copyright disputes that often come with mining the internet for images, Adobe has been able to thrive in the market, with its share price rising by 36% since the release of Firefly.
The demand for data to train AI models is growing rapidly, with companies seeking out new sources to sustain the feeding frenzy. The two essential ingredients for an AI model are datasets and processing power. While both can improve the model, there is a shortage of specialist AI chips, making data acquisition even more crucial.
However, accessing high-quality data is becoming more challenging. Epoch AI, a research outfit, predicts that the stock of high-quality text available for training may be exhausted by 2026. The better the data, the better the model, so companies are looking for long-form, well-written, and factually accurate writing to train their models effectively.
As the demand for data increases, companies are engaging in dealmaking to secure data sources. OpenAI, for example, has partnered with Associated Press and Shutterstock to access their archives of stories and stock photography. Google is reportedly in discussions with Universal Music to license artists' voices for a songwriting AI tool. And rumors are swirling about AI labs approaching the BBC and JSTOR for access to their archives.
Content creators are also demanding compensation for their material that has been ingested into AI models, leading to copyright infringement cases against model builders. Reddit, Stack Overflow, and Twitter have increased the cost of access to their data, taking advantage of their bargaining power.
To improve the quality of their data, AI labs employ data annotators and gather feedback from users to fine-tune their models. Companies are also starting to tap into the rich data that exists within their own corporate walls. Many businesses possess vast amounts of useful data, from call-center transcripts to customer spending records, which can be used to fine-tune AI models for specific business purposes.
In conclusion, the demand for data to fuel AI models is driving companies to get creative in finding and accessing data sources. The race to secure data has led to dealmaking and increased bargaining power for data holders. Companies are also focusing on improving the quality of their existing data and tapping into the data within their own organizations. The competition for data in the AI market is only just beginning, and companies must continue to find new ways to feed their ever-larger models.