Big DataNews & Analysis

Is the Time Ripe for the Synthetic Data Revolution?


Each year, the world generates more data than the previous year. But just because data is proliferating doesn’t mean everyone can actually use them. Organizations are concerned with their users’ privacy, often restrict access to datasets — sometimes within their own teams. And in recent months, as the Covid-19 pandemic has shut down labs and offices, preventing people from visiting centralized data stores, sharing information safely is even more difficult.

Without access to data, it’s hard to make tools that actually work. Enter synthetic data: artificial information developers and engineers can use as a stand-in for real data. To put it simply, Synthetic data is information that’s artificially manufactured rather than generated by real-world events.

In fact, Datagen , a domain-specific synthetic data for humans and object perception company, calls 2022 a banner year for synthetic data, as businesses look to leverage AI for a growing number of increasingly-sophisticated applications, including tackling the world’s supply-chain disruptions, reinventing automotive safety, and creating a whole new class of intelligent consumer goods with the metaverse at the fore.

As AI makes its way into ubiquitous adoption by a growing number of industries and applications, the demand for robust training data will expand accordingly. However, with manual data collection already at the limits of its own utility, the race for AI supremacy will only serve to widen the existing gulf between supply and demand. The ability to generate tens of thousands of synthetic images — customized to suit the unique parameters of each distinct application — makes synthetic data the obvious solution to the limitations of traditional, manually-collected data.

“We’re approaching a major inflection point for the synthetic data field,” said Ofir Chakon, co-founder and CEO of Datagen. “This year, AI underwent a major paradigm shift, in which traditional, model-centric approaches to AI development were reconsidered in favor of data-centrism, which means data scientists are now placing more significance on the quality of their training data as a determinant of performance, rather than the quality of their model. This shift in the zeitgeist — combined with the ability to rapidly iterate one’s dataset in a targeted, fine-tuned way — will make 2022 the year in which synthetic data becomes the most widely used training and testing solution in AI.”

After a year of building great momentum to power the next big leap in computer vision systems, including key appointments to its executive leadership and advisory board, Datagen’s executive team have predicted the following trends to take center stage in 2022 to help organizations accelerate their AI adoption and to prepare for what comes next:

Evolution of A New ‘Synthetic Data Engineer’ 

In 2022, a new position will surface — the ‘synthetic data engineer’ — data scientists who handle the creation, processing, and analysis of large synthetic datasets in an effort to support the automation of prescriptive decision-making through visuals. This new vocation, a natural evolution of the computer vision engineer, is already emerging in larger companies, where synthetic data teams have sprouted. The synthetic data engineer will become one of the most sought-after professionals in the AI market as more enterprises and startups alike will need the skills to support their simulated data initiatives. Expect to see such job postings soar and more training courses to become available, to fill the 22% rise in computer and information research scientist jobs over the next 10 years ( US Bureau of Labor statistics ), of which CV (and synthetic data) engineers are a subset. In addition, we will see other data-related professionals reposition themselves as synthetic data engineers to take advantage of expanding opportunities.

Data-Centric AI Development Will Fuel Widespread Adoption 

After nearly a decade of being dominated by model-centric approaches to development, the field of AI is experiencing a paradigm shift — away from modeling and toward a data-centric approach to AI development. In short, rather than focusing on making incremental improvements to one’s AI algorithm or model, researchers have found that they can optimize AI performance much more effectively by improving the quality of one’s training data. Over the course of 2021, data-centrism has been rapidly gaining acceptance throughout AI’s R&D and enterprise communities. This trend will undoubtedly continue well into 2022, and the increased focus on data quality will act as yet another catalyst for the adoption of synthetic data.

Making the Metaverse a Reality 

Facebook’s recent announcement about its foray into the metaverse is driving the metaverse mania. Recent metaverse developments include  Microsoft ’s announcement of its own metaverse, plus a key metaverse patent filing from  Apple . Meanwhile, another early metaverse entrant,  NVIDIA , saw a 12% increase in stock price since the Facebook announcement.

These recent metaverse announcements are merely the opening salvos in what will surely be a heated competition to define the future of human interaction with the environment and how we manage social connections with remote people. In the frenzy to develop the first practical, real-world applications, vendors will need to invest heavily in tools and technologies that can help them get to market first and gain first-mover advantage. These include a variety of hardware, software and data solutions. Look for a bump in these investments over the next 12-18 months.

Edge Cases Will Continue to Boost Industry Demand 

Edge cases are unlikely or improbable situations that a given AI may still conceivably encounter over the course of its operational lifetime. Although improbable, engineers need to take these edge cases into consideration when developing and training their AI applications — especially when applications carry significant risks, such as autonomous vehicles. However, the very same risks that make edge case training so important in these applications, also make it exceedingly difficult, if not impossible, to gather the data said training requires.

Faced with this conundrum, more and more businesses will turn to synthetic data for their training needs. More and more car manufacturers will use synthetic data to train and develop their in-cabin driver monitoring system (DMS). These AI-enabled systems use computer vision to monitor drivers and issue alerts whenever drivers show signs of distraction or fatigue. We will surely see many other carmakers follow suit over the coming years, as new EU regulations mandating DMS technologies go into effect; and American manufacturers inevitably do the same to keep up with competition. This, along with work on driverless technologies, will vastly expand and deepen the industry’s investment in the human-centered synthetic data needed to train those systems.

Digital Twins Will Save the Day 

Federal Reserve chair  Jerome Powell  and other experts predict that the global supply chain crisis will only get worse in 2022 before it gets better. In fact, a recent  Wall Street Journal poll  of leading economists finds almost half of the respondents cite supply chain bottlenecks as the biggest threat to growth in the next 12 to 18 months.

Unpredictable weather patterns and labor shortages will intensify the disruptions caused by the global pandemic. As a result, private businesses and government agencies will turn to solutions that could help alleviate the pressures. One such solution will be digital twins, a machine learning driven simulation of real-world objects to predict disruptions and provide recommendations on how to avoid them.

Organizations whose operations are heavily supply chain dependent should consider investing in digital twins technology to stay competitive.

Across all these predictions, the common thread is clear — the world’s good data needs are going up. And manual data collection and annotation won’t be able to satisfy the impending explosion of demand. Synthetic data, on the other hand, offers a fast, customizable, and cost-effective alternative that, in many cases, performs even better than its real-world counterpart. The world’s increasing demand for data also coincides with an increased demand for data professionals, both data scientists and computer vision engineers, which may well prove to be the true bottleneck to impede AI’s rise to universal adoption.

However, in a recent blog by MIT News, Synthetic data is a bit like diet soda. To be effective, it has to resemble the “real thing” in certain ways. Diet soda should look, taste, and fizz like regular soda. Similarly, a synthetic dataset must have the same mathematical and statistical properties as the real-world dataset it’s standing in for.

“It looks like it, and has formatting like it,” says Kalyan Veeramachaneni, a principal research scientist in MIT’s Laboratory for Information and Decision Systems told the news site. If it’s run through a model, or used to build or test an application, it performs like that real-world data would.

But just as diet soda should have fewer calories than the regular variety, he says, a synthetic dataset must also differ from a real one in crucial aspects. If it’s based on a real dataset, for example, it shouldn’t contain or even hint at any of the information from that dataset.

Leave a Response