Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

Evotrex raises $30 million to build RV that doesn’t need a charging station

Top Lucid Motors executive exits amid new CEO shakeup

How Justin Ernest invested nearly $500 million in hot startups without a traditional VC fund

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    Google just fired a warning shot in the AI ​​subscription price wars

    10 June 2026

    Sandstone raises $30M to bring AI to in-house legal teams

    9 June 2026

    Because Apple’s slow and steady AI bet is starting to look pretty smart

    9 June 2026

    Amazon now lets you design custom merchandise using AI

    8 June 2026

    Mira Murati comes back to the fore, cautiously

    8 June 2026
  • Apps

    iOS 27 features we didn’t see on stage

    10 June 2026

    Apple says it can remove some apps from the App Store if they don’t attract users

    9 June 2026

    Apple’s WWDC AI demos seemed more real after $250 million false ad settlement

    9 June 2026

    The new update of NotebookLM will help you to create source repository from chat

    8 June 2026

    X caters to creators with the new “React with Video” feature.

    8 June 2026
  • Crypto

    Startup Battlefield 200 applications close today

    27 May 2026

    5 days left: Save up to $410 on Disrupt 2026 passes

    25 May 2026

    As crypto cools, a16z crypto raises $2.2 billion in capital

    6 May 2026

    Coinbase to lay off 14% of staff as part of broader restructuring

    5 May 2026

    British cryptographer Adam Back denies NYT report that he is Bitcoin creator Satoshi Nakamoto

    9 April 2026
  • Fintech

    Ramp raises $750M at $44B valuation as investors thirst for fintechs with AI history

    5 June 2026

    Last 24 hours to save up to $410 on your Disrupt 2026 ticket

    29 May 2026

    2 days left: Lock in up to $410 in ticket savings for Disrupt 2026

    28 May 2026

    Robinhood now allows your AI agents to trade stocks

    28 May 2026

    Disrupt 2026 Early Bird ticket savings expire in 3 days

    27 May 2026
  • Hardware

    WWDC 2026: What to expect, from Siri’s long-awaited revamp to Apple Intelligence and iOS 27

    9 June 2026

    What to expect from WWDC 2026: The long-awaited Siri refresh and Apple Intelligence updates

    7 June 2026

    What to expect from WWDC 2026: The long-awaited Siri refresh and Apple Intelligence updates

    5 June 2026

    Oura Ring 5 review: Thinner, lighter, better

    4 June 2026

    Meta mercifully released the VR fitness game Supernatural instead of just killing it

    4 June 2026
  • Media & Entertainment

    Plex adds new social features ahead of major price hike for its lifetime pass

    6 June 2026

    Startup Battlefield 200 applications officially close in 3 days

    5 June 2026

    Founders Fund Launches Series of Games Starring Sam Altman, Palmer Luckey and Other Tech Elites

    5 June 2026

    Meet Wander, a StumbleUpon-inspired tool for discovering the ‘small web’

    4 June 2026

    Publishers will be able to opt out of AI Search, thanks to the new setting

    4 June 2026
  • Security

    Massachusetts votes in favor of new privacy bill that bans sale of precise location data

    9 June 2026

    WhatsApp says it has detected new spyware attacks linked to the NSO group in violation of a court order

    9 June 2026

    Microsoft’s open source tools hacked to steal AI developers’ passwords

    8 June 2026

    Hacked, leaked and held for ransom: the worst breaches of 2026 so far

    7 June 2026

    Google and FBI warn of ransomware group sending fake IT workers to hack victims in person

    6 June 2026
  • Startups

    Evotrex raises $30 million to build RV that doesn’t need a charging station

    10 June 2026

    Zepto’s IPO filing reveals fast growth, bigger losses and a valuation question no one has yet answered

    9 June 2026

    How to apply to Startup Battlefield 2026, what you need before today’s June 8 deadline

    8 June 2026

    Sam Altman-backed fusion startup Helion raises $465M to build power plant for Microsoft

    6 June 2026

    Supabase doubles valuation to $10 billion in 8 months

    5 June 2026
  • Transportation

    Top Lucid Motors executive exits amid new CEO shakeup

    10 June 2026

    Rivian begins deliveries of its all-important R2 SUV

    9 June 2026

    Waymo bought Apple’s self-driving car for $220 million

    9 June 2026

    Uber, Wayve and Waymo are heading for a robot showdown in London

    8 June 2026

    TechCrunch Mobility: Inside GM’s $900 Million EV Battery Bet

    7 June 2026
  • Venture

    How Justin Ernest invested nearly $500 million in hot startups without a traditional VC fund

    10 June 2026

    Mercor’s Brendan Foody calls out Sequoia, accusing it of “double pricing” valuation tricks.

    9 June 2026

    Founders share VC horror stories and some name names

    6 June 2026

    Defense technology, artificial intelligence and fundraising take center stage at StrictlyVC Los Angeles

    5 June 2026

    Benchmark raises its first growth capital as part of $2 billion capital raising

    4 June 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»DatologyAI builds technology to automatically curate AI training datasets
AI

DatologyAI builds technology to automatically curate AI training datasets

techtost.comBy techtost.com22 February 202407 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Datologyai Builds Technology To Automatically Curate Ai Training Datasets
Share
Facebook Twitter LinkedIn Pinterest Email

Massive training datasets are the gateway to powerful AI models — but often also the downfall of those models.

Biases arise from biases hidden in large datasets, such as images of predominantly white CEOs in an image classification set. And large data sets can be messy, coming in forms that a model cannot understand — forms that contain a lot of noise and extraneous information.

In a recent Deloitte overview of companies adopting AI, 40% said data-related challenges—including thorough data preparation and cleansing—were among the top concerns holding back their AI initiatives. A special one voting of data scientists found that about 45% of scientists’ time is spent on data preparation tasks such as “loading” and cleaning data.

Ari Morcos, who has been working in the AI ​​industry for nearly a decade, wants to remove much of the data preparation involved in training AI models — and he founded a startup to do just that.

Morcos’ company, DatologyAI, builds tools to automatically curate datasets like those used to train OpenAI’s ChatGPT, Google’s Gemini, and other similar GenAI models. The platform can determine which data is most important depending on the application of a model (e.g. composing an email), Morcos claims, in addition to how the dataset can be augmented with additional data and how it should grouped or broken into more manageable chunks. when training models.

“Models are what they eat — models reflect the data they’ve been trained on,” Morcos told TechCrunch in an email interview. “However, not all data is created equal and some training data is much more useful than others. Training models on the right data in the right way can have a dramatic impact on the resulting model.”

Morcos, who has a Ph.D. in neuroscience from Harvard, spent two years at DeepMind applying neuroscience-inspired techniques to understand and improve AI models, and five years at Meta’s AI lab uncovering some of the fundamental mechanisms underlying the models’ operations. Along with co-founders Matthew Leavitt and Bogdan Gaza, former head of engineering at Amazon and then Twitter, Morcos launched DatologyAI with the goal of streamlining all forms of AI dataset curation.

As Morcos points out, the composition of a training data set affects almost every characteristic of a model trained on it — from the model’s performance on tasks to its size and depth of domain knowledge. More efficient datasets can reduce training time and yield a smaller model, saving computational costs, while datasets that include a particularly diverse range of samples can handle internal queries more skillfully (generally speaking).

With interesting in GenAI — which has a reputation because it’s expensive — at an all-time high, the cost of implementing AI is at the forefront of executives’ minds.

Many businesses choose to adapt existing models (including open source models) for their purposes or opt for API managed vendor services. However, some—for governance and compliance or other reasons—build models on custom data from scratch and spend tens of thousands to millions of dollars in computation to train and run them.

“Companies have collected troves of data and want to train effective, efficient, expert AI models that can maximize the benefit to their business,” Morcos said. “However, making effective use of these massive data sets is incredibly difficult and, if done incorrectly, leads to worse performing models that take longer to train and [are larger] than necessary.”

DatologyAI can scale up to “petabytes” of data in any format—whether text, images, video, audio, tabular, or more “exotic” methods like genomics and geospatial—and scale across a customer’s infrastructure, either on-premises or via virtual private cloud. This differentiates it from other data preparation and curation tools such as CleanLab, Lilac, Labelbox, YData and Galileo, Morcos claims, which tend to be more limited in the scope and types of data they can handle. be processed.

DatologyAI is also able to determine which “concepts” in a data set – for example, concepts related to US history in a training chatbot training set – are more complex and therefore require higher quality samples, as well as which data can cause a model to behave in unintended ways.

“Resolved [these problems] it requires automatically determining the concepts, their complexity, and how much redundancy is really necessary,” Morcos said. “Data augmentation, often using other models or synthetic data, is incredibly powerful, but must be done in a careful, targeted way.”

The question is how effective is DatologyAI’s technology? There is reason to be skeptical. History has shown that automated data curation doesn’t always work as intended, no matter how sophisticated the method — or how diverse the data.

LAION, a German non-profit organization spearheading a number of GenAI projects, was necessarily to remove an algorithmically curated AI training dataset after it was discovered that the set contained images of child sexual abuse. Elsewhere, models like ChatGPT, which are trained on a combination of datasets manually and automatically filtered for toxicity, have been shown to produce toxic content with specific prompts.

There’s no escaping manual curation, some experts would argue—at least not if one hopes to achieve robust results with an AI model. The biggest vendors today, from AWS to Google to OpenAI, they are based on groups of human experts and (sometimes underpaid) annotators to shape and improve their training datasets.

Morcos insists that DatologyAI’s tools serve no purpose replace manual curation overall but rather offers suggestions that may not occur to data scientists, especially suggestions that touch on the problem of trimming training dataset sizes. It’s somewhat of an authority — trimming the data set while maintaining model performance was the focus of one academic work Morcos collaborated with researchers from Stanford and the University of Tübingen in 2022, which won the best paper award at the NeurIPS machine learning conference that year.

“Identifying the right data at scale is extremely difficult and a cutting-edge research problem,” Morcos said. “[Our approach] leading to models that train dramatically faster while simultaneously increasing performance on downstream tasks.”

DatologyAI’s technology was obviously promising enough to convince tech and AI titans to invest in the startup’s seed round, such as Google’s Chief Scientist Jeff Dean, Meta’s Chief AI Scientist Yann LeCun, Quora’s founder and OpenAI board member Adam D’Angelo and Geoffrey Hinton. is credited with developing some of the most important techniques at the heart of modern artificial intelligence.

Other angel investors in DatologyAI’s $11.65 million seed round, which was led by Amplify Partners with participation from Radical Ventures, Conviction Capital, Outset Capital and Quiet Capital, included Cohere co-founders Aidan Gomez and Ivan Zhang, founder of Contextual AI Douwe Kiela, ex-Intel. AI Vice President Naveen Rao and Jascha Sohl-Dickstein, one of the inventors of genetic diffusion models. It’s an impressive list of AI luminaries to say the least — and it suggests there might just be something to Morcos’ claims.

“Models are only as good as the data they are trained on, but finding the right training data among billions or trillions of examples is an incredibly difficult problem,” LeCun told TechCrunch in an emailed statement. “Ari and his team at DatologyAI are some of the world’s experts on this problem, and I think the product they’re building to make high-quality data curation available to anyone looking to train a model is critical to helping it work AI for everyone.”

San Francisco-based DatologyAI currently has ten employees, including the co-founders, but plans to expand to around 25 employees by the end of the year if it hits certain growth milestones.

I asked Morcos if the milestones were related to customer acquisition, but he declined to say — and, rather mysteriously, wouldn’t reveal the size of DatologyAI’s current customer base.

All included automatically builds curate data datasets DatologyAI financing genAI Generative AI get started technology training
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleInstagram launches its marketplace to connect brands and creators in 8 new countries
Next Article Golden Ventures secures another $100 million to invest in Canada’s tech ecosystem
bhanuprakash.cg
techtost.com
  • Website

Related Posts

Google just fired a warning shot in the AI ​​subscription price wars

10 June 2026

Massachusetts votes in favor of new privacy bill that bans sale of precise location data

9 June 2026

Sandstone raises $30M to bring AI to in-house legal teams

9 June 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

Evotrex raises $30 million to build RV that doesn’t need a charging station

10 June 2026

Top Lucid Motors executive exits amid new CEO shakeup

10 June 2026

How Justin Ernest invested nearly $500 million in hot startups without a traditional VC fund

10 June 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Ramp raises $750M at $44B valuation as investors thirst for fintechs with AI history

5 June 2026

Last 24 hours to save up to $410 on your Disrupt 2026 ticket

29 May 2026

2 days left: Lock in up to $410 in ticket savings for Disrupt 2026

28 May 2026
Startups

Evotrex raises $30 million to build RV that doesn’t need a charging station

Zepto’s IPO filing reveals fast growth, bigger losses and a valuation question no one has yet answered

How to apply to Startup Battlefield 2026, what you need before today’s June 8 deadline

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.