The idea of synthetic data is nothing new. It can be traced back to the 1930s as used in audio and voice synthesis. However, it is gaining prominence as it is utilized in big data analysis and artificial intelligence training in light of growing bias and privacy issues.
A 2018 Gartner study projected that 85 percent of algorithms will be erroneous because of bias. On the other hand, big data appears to have become a big business for lawyers, but it is becoming a serious concern for consumers or ordinary people.
To address these data issues, AI companies are turning to manufactured or synthetic data. These generate artificial data in various forms, from numerical to visual, backed by established randomization and anonymization techniques to simulate real world information.
The following are some of the most notable companies that are taking advantage of synthetic data to advance the development of artificial intelligence and machine learning.
Founded in 2016, synthetic data and AI company AI.Reverie offers a suite of APIs designed to help organizations across industries in training their machine learning algorithms to improve their AI apps. The company specializes in computer vision. It specifically addresses three of the biggest hurdles that have limited this technology for decades: the lack of data diversity, limited data access, and the long and costly process of data labeling. AI.Reverie’s manufactured data are said to deliver 10x diversity and 100% annotation accuracy.
AI.Reverie was named a Gartner Cool Vendor for 2020 in AI Core Technologies, a distinction which the company considers as an affirmation of their innovative technology. In 2019, AI.Reverie forged a strategic partnership and investment deal with In-Q-Tel, a nonprofit organization that supports US intelligence and defense agencies, to advance computer vision technologies to make them useful in mission-critical applications.
Claiming to be the world’s most accurate synthetic data platform, Mostly.ai seeks to unlock big data assets while maintaining the privacy of consumers (who are the source of such big data). This mission is in line with the most prominent reason why synthetic data is being used in research. The company focuses on helping organizations avoid the adverse implications of violating privacy rights for using data collected from consumers.
Based in Austria, Mostly.ai takes advantage of “state-of-the-art generative deep neural networks” that come with integrated privacy mechanisms that make it impossible to associate data with specific identities. The company’s GPU-powered technology enables organizations to simulate convincingly realistic and scalable scenarios using completely anonymous customer data. Aside from AI training, Mostly.ai also offers its synthetic data to enable rapid PoC evaluation and support data-driven product development.
Another company that its mission is to accelerate the development of artificial intelligence and machine learning is OneView from Tel Aviv, Israel. Founded in 2019, it has already attracted considerable attention for its synthetic data generation technology. OneView specializes in synthetic data for remote sensing imagery analytics, in particular virtually generated satellite, aerial, and drone imagery to be used in AI algorithm training. The company has been working with defense and intelligence agencies as well as commercial companies and has earned praise from Kobi Katz of RAFAEL Advanced Defense Systems, Ltd.
OneView answers real pain points in the GEOINT industry. AI algorithms used for geospatial analytics (the interpretation of remote sensing images) rely on real images for their training. This raises three main challenges: First, real images are expensive. Second, the annotation of the images is done manually and is error-prone. Last, sometimes you simply can’t find the coverage that you’re looking for since it was not captured. OneView’s generation platform overcome all three challenges. First, The generation process is swift and cost-effective Second, the datasets are created automatically and come out of the system fully-annotated and “ready for training.” Third, virtually all possible scenarios are covered — any objects can be placed any environment and the datasets can be adapted to any available sensor.
Offering a solution to the data bottleneck in computer vision development, Datagen offers scalable and customizable high-variance training data synthesized with real-world benchmarks. The company helps advanced the learning process of AI algorithms used in robotics, smart cars, smart stores, augmented reality, virtual reality, Internet of Things, smart factories, drones, security systems, and various other applications.
Datagen specializes in what it calls as “human-focused data,” which it dubs as the next generation of synthetic data. This technology harnesses the capabilities of Latent Space Variation Generation Algorithms (GANs), super rendering algorithms, and reinforcement learning humanoid algorithms to produce data sets that depict the real world with photo-realistic and high-variance details. These data sets can be delivered with bespoke 2D and 3D annotations.
Cognata is a synthetic data company that specializes in self-driving vehicles. The company provides complete product lifecycle simulation for autonomous vehicle makers and Advanced Driving Support System (ADAS) developers. It delivers autonomous vehicle (AV) training using automatically generated 3D environments, hyper-realistic AI-powered traffic factors. It also provides an AV validation platform wherein scenarios can be compiled to generate millions of AV edge cases. Additionally, it enables sophisticated AV analysis with its configurable rules and visualization tools.
Cognata emphasizes realism, scalability, and ease of integration in its data sets. The company has developed a system to realistically emulate the activity of the sensors and movements of an AV as it moves around the city. As featured in an MIT Review report, researchers at Cognata identified the problem of unpredictability in autonomous vehicles when they encounter unusual scenarios. Cognata developed a solution that leverages synthetic data to address this defect.
Lifting Limits, Accelerating Operation, Lowering Costs
So how do these companies help advance AI development with synthetic data? It all boils down to three vital benefits: limitlessness, faster data generation, and significantly lower cost.
First, synthetic data is unlimited as opposed to the inherent restrictiveness of using actual data. Using actual real-world data is ideal, but it is often unviable or extremely difficult to do. Synthetic data offers an excellent alternative without compromising accuracy. With the right technologies and algorithms, synthetic data can be produced to match real-world objects and realities with virtually zero variance while being scalable to match varying needs.
On the other hand, it is considerably faster to produce and use synthetic data. One of the biggest problems of using actual data is the tedious and time-consuming task of labeling or annotation. Human involvement is required to label data before it is fed to an AI system. With synthetic data, annotation is automatic. All the necessary labels are added as the data is produced, and they are as accurate as they can be since they are created as the data is manufactured.
Lastly, because synthetic data is much easier and faster to produce, its cost is markedly lower. Also, it can be customized and scaled up or down depending on specific needs.
All the startups listed above produce synthetic data sets that create the benefits of unlimited data sets, faster time to market, and low data cost. They may have different approaches, but they are similar in making efficient use of manufactured data to accelerate AI training and expedite the completion of projects that use AI or machine learning.