As the world runs out of public data to train AI, the new functionality safely unlocks proprietary text data to accelerate LLM development for the deployment of high-quality generative AI solutions
VIENNA and NEW YORK, Oct. 1, 2024 /PRNewswire/ — MOSTLY AI, a pioneer in structured synthetic data, has launched a synthetic text functionality, expanding the power and potential of synthetic data to train AI models as global entities struggle to leverage proprietary data assets because of privacy concerns. With this new functionality, enterprises can unlock the vast amount of proprietary text collected, such as emails, customer support transcripts, and chatbot conversations, without compromising privacy, to train and fine-tune large language models (LLMs) for faster innovation and better decision-making.
“Today, AI training is hitting a plateau as models exhaust public data sources and yield diminishing returns,” said Tobias Hann, CEO of MOSTLY AI. “To harness high-quality, proprietary data, which offers far greater value and potential than the residual public data currently being used, global enterprises must take the leap and leverage both structured and unstructured synthetic data to safely train and deploy forthcoming generative AI solutions.”
By 2026, Gartner predicts that 75% of companies will use generative AI to create synthetic customer data, up from less than 5% in 2023. MOSTLY AI is enabling this mass adoption by expanding its platform to include synthetic text, which solves three major enterprise challenges today:
- Real text data often contains sensitive information, such as personally identifiable information (PII), posing a risk of unintended exposure when used in LLMs.
- The available text data may not be optimal for LLM training as it often lacks diversity, and manually creating this specialized data is labor-intensive and can yield low-quality results.
- Companies are shifting focus from public to proprietary data. However, text data is never standalone; it comes intertwined with other structured data about their customer base.
Synthetic data is set to become the driving force behind LLMs. Leveraging advanced tools to unveil deep insights hidden in proprietary data is paramount for strategic, informed decision-making across operations. MOSTLY AI provides companies with a synthetic representation that reflects both the text and the structured insights they hold. By uniquely integrating structured and unstructured data, MOSTLY AI enables enterprises to safely create a complete and statistically accurate picture of their proprietary data assets to fine-tune and deliver high-quality, bespoke generative AI solutions in a safe and compliant way.
In addition to safety and compliance, a critical factor to consider with synthetic text is its quality. When training a downstream text classifier, synthetic text generated by the MOSTLY AI Platform delivers performance improvement as much as 35% compared to text generated by prompting GPT-4o-mini providing either no or just a few real-world examples. This significant boost demonstrates MOSTLY AI’s ability to produce high-quality, impactful synthetic data.
With this launch, enterprises can take any model from Hugging Face and fine-tune it with proprietary text data to generate synthetic data, streamlining a process that is typically complex and time-consuming. This innovation by MOSTLY AI makes it extremely convenient for large organizations to harness the power of creative, private, high quality synthetic text.
“Being able to seamlessly leverage open source models like our own Viking-7B on MOSTLY AI’s platform underlines the transformative potential of synthetic data,” said Peter Sarlin, CEO of Silo AI. “With the ability to privacy-preserving fine-tune models using proprietary text data, we’re moving beyond the sheer quantity of data to a focus on quality, which is critical for the future of AI training.”
“Bringing almost a decade of deep technical expertise, MOSTLY AI delivers superior quality and reliability, and is backed by a highly experienced team and industry-leading technological excellence,” said Christoph Hornung, Partner at Molten Ventures, investor in MOSTLY AI. “With the platform’s expansion into synthetic text, MOSTLY AI is well-positioned to support any enterprise with its sensitive data and LLM needs.”
Founded in 2017, MOSTLY AI works with global enterprises and partners including AWS, Databricks, O2 Telefónica, and more. To learn more about the company’s synthetic text functionality or get in touch with the team, please visit mostly.ai.
About MOSTLY AI
MOSTLY AI pioneered the creation of synthetic data for AI model development. Datasets generated by the MOSTLY AI platform look just as real as a company’s original customer data with just as many details, but without the original personal data points – helping companies comply with privacy protection regulations such as GDPR and CCPA. The fast-growing company currently works with multiple Fortune 100 insurers and banks in Europe and North America. Its team has the deepest expertise in helping companies get business value out of synthetic data.
SOURCE MOSTLY AI