Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

Anori, Alphabet’s new X spinout, faces one of the world’s most expensive bureaucratic nightmares

K2 will launch its first high-powered computing satellite into space

Multiverse Computing is pushing its compressed AI models into the mainstream

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    Multiverse Computing is pushing its compressed AI models into the mainstream

    19 March 2026

    Sam Altman’s thank you to coders draws memes

    19 March 2026

    The Pentagon is developing alternatives to Anthropic, the report said

    18 March 2026

    Mistral bets on ‘build your own AI’, as with OpenAI, Anthropic in business

    18 March 2026

    Picsart Now Lets Creators ‘Hire’ AI Assistants Through Agent Market

    17 March 2026
  • Apps

    Amazon is bringing Alexa+ to the UK

    19 March 2026

    Rebel Audio is a new AI podcasting tool aimed at first-time creators

    19 March 2026

    Google’s Personal Intelligence feature is expanding to all US users

    18 March 2026

    Kagi brings its “small web” of an all-human web to mobile devices

    18 March 2026

    Gamma adds AI image creation tools in a bid to take on Canva and Adobe

    17 March 2026
  • Crypto

    Hackers stole over $2.7 billion in crypto in 2025, data shows

    23 December 2025

    New report examines how David Sachs may benefit from Trump administration role

    1 December 2025

    Why Benchmark Made a Rare Crypto Bet on Trading App Fomo, with $17M Series A

    6 November 2025

    Solana co-founder Anatoly Yakovenko is a big fan of agentic coding

    30 October 2025

    MoviePass opens Mogul fantasy league game to the public

    29 October 2025
  • Fintech

    Kalshi’s legal woes pile up as Arizona files first criminal charges for ‘illegal gambling operation’

    17 March 2026

    Fuse raises $25M to disrupt legacy loan origination systems used by US credit unions

    16 March 2026

    India neobank Fi removes banking services on its platform

    11 March 2026

    X taps William Shatner to give invitations to his payment service, X Money

    4 March 2026

    Stripe wants to turn your AI costs into a profit center

    3 March 2026
  • Hardware

    CEO Carl Pei says nothing about smartphone apps disappearing as they’re replaced by artificial intelligence agents

    18 March 2026

    MacBook Neo, AirPods Max 2, iPhone 17e and everything else Apple announced this month

    18 March 2026

    Oura enters India’s smart ring market with Ring 4

    17 March 2026

    Apple quietly launches AirPods Max 2

    17 March 2026

    The MacBook Neo is “the most repairable MacBook” in years, according to iFixit

    16 March 2026
  • Media & Entertainment

    Patreon CEO calls AI companies’ fair use argument ‘bogus’, says creators should be paid

    18 March 2026

    Meet Vurt, the first mobile streaming platform for indie filmmakers embracing vertical video

    18 March 2026

    BuzzFeed debuts AI applications for new revenue

    17 March 2026

    Facebook makes it easy for creators to report copycats

    14 March 2026

    Spotify will let you edit your taste profile to control your recommendations

    13 March 2026
  • Security

    FBI is buying location data to track US citizens, director confirms

    19 March 2026

    Russians caught stealing personal data from Ukrainians with new advanced iPhone hacking tools

    18 March 2026

    Stryker says it is restoring systems after pro-Iranian hackers wiped out thousands of employee devices

    17 March 2026

    Wiz Investor Unpacks Google’s $32 Billion Acquisition

    15 March 2026

    Law enforcement shuts down botnet consisting of tens of thousands of hacked routers

    12 March 2026
  • Startups

    Anori, Alphabet’s new X spinout, faces one of the world’s most expensive bureaucratic nightmares

    19 March 2026

    This startup wants to make enterprise software more like a prompt

    19 March 2026

    H&M wants to make clothes out of CO2 using this startup’s technology

    18 March 2026

    Why Garry Tan’s Claude Code setup has gotten so much love and hate

    18 March 2026

    Walmart-backed PhonePe shelvs IPO as global tensions roil markets

    16 March 2026
  • Transportation

    K2 will launch its first high-powered computing satellite into space

    19 March 2026

    EV startup Harbinger unveils smaller work truck with electric and hybrid variants

    18 March 2026

    Rivian spin-out Mind Robotics raises $500M for AI-powered industrial robots

    17 March 2026

    Drivers in fatal Ford BlueCruise crashes were likely distracted before the crash

    17 March 2026

    Introducing the Rivian R2: See what $57,990 gets you

    15 March 2026
  • Venture

    Sequen raised $16 million to bring TikTok-style personalization technology to any consumer company

    19 March 2026

    AI ‘boys club’ could widen wealth gap for women, says Rana el Kaliouby

    18 March 2026

    Billionaires made a promise – now some want to leave

    17 March 2026

    Antonio Gracias Says He Longs For ‘Pre-Entropic’ Startups – Those Built To Survive Chaos

    17 March 2026

    Founded by a father-son duo, Nyne gives AI agents the human context they’ve been missing

    14 March 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»DatologyAI builds technology to automatically curate AI training datasets
AI

DatologyAI builds technology to automatically curate AI training datasets

techtost.comBy techtost.com22 February 202407 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Datologyai Builds Technology To Automatically Curate Ai Training Datasets
Share
Facebook Twitter LinkedIn Pinterest Email

Massive training datasets are the gateway to powerful AI models — but often also the downfall of those models.

Biases arise from biases hidden in large datasets, such as images of predominantly white CEOs in an image classification set. And large data sets can be messy, coming in forms that a model cannot understand — forms that contain a lot of noise and extraneous information.

In a recent Deloitte overview of companies adopting AI, 40% said data-related challenges—including thorough data preparation and cleansing—were among the top concerns holding back their AI initiatives. A special one voting of data scientists found that about 45% of scientists’ time is spent on data preparation tasks such as “loading” and cleaning data.

Ari Morcos, who has been working in the AI ​​industry for nearly a decade, wants to remove much of the data preparation involved in training AI models — and he founded a startup to do just that.

Morcos’ company, DatologyAI, builds tools to automatically curate datasets like those used to train OpenAI’s ChatGPT, Google’s Gemini, and other similar GenAI models. The platform can determine which data is most important depending on the application of a model (e.g. composing an email), Morcos claims, in addition to how the dataset can be augmented with additional data and how it should grouped or broken into more manageable chunks. when training models.

“Models are what they eat — models reflect the data they’ve been trained on,” Morcos told TechCrunch in an email interview. “However, not all data is created equal and some training data is much more useful than others. Training models on the right data in the right way can have a dramatic impact on the resulting model.”

Morcos, who has a Ph.D. in neuroscience from Harvard, spent two years at DeepMind applying neuroscience-inspired techniques to understand and improve AI models, and five years at Meta’s AI lab uncovering some of the fundamental mechanisms underlying the models’ operations. Along with co-founders Matthew Leavitt and Bogdan Gaza, former head of engineering at Amazon and then Twitter, Morcos launched DatologyAI with the goal of streamlining all forms of AI dataset curation.

As Morcos points out, the composition of a training data set affects almost every characteristic of a model trained on it — from the model’s performance on tasks to its size and depth of domain knowledge. More efficient datasets can reduce training time and yield a smaller model, saving computational costs, while datasets that include a particularly diverse range of samples can handle internal queries more skillfully (generally speaking).

With interesting in GenAI — which has a reputation because it’s expensive — at an all-time high, the cost of implementing AI is at the forefront of executives’ minds.

Many businesses choose to adapt existing models (including open source models) for their purposes or opt for API managed vendor services. However, some—for governance and compliance or other reasons—build models on custom data from scratch and spend tens of thousands to millions of dollars in computation to train and run them.

“Companies have collected troves of data and want to train effective, efficient, expert AI models that can maximize the benefit to their business,” Morcos said. “However, making effective use of these massive data sets is incredibly difficult and, if done incorrectly, leads to worse performing models that take longer to train and [are larger] than necessary.”

DatologyAI can scale up to “petabytes” of data in any format—whether text, images, video, audio, tabular, or more “exotic” methods like genomics and geospatial—and scale across a customer’s infrastructure, either on-premises or via virtual private cloud. This differentiates it from other data preparation and curation tools such as CleanLab, Lilac, Labelbox, YData and Galileo, Morcos claims, which tend to be more limited in the scope and types of data they can handle. be processed.

DatologyAI is also able to determine which “concepts” in a data set – for example, concepts related to US history in a training chatbot training set – are more complex and therefore require higher quality samples, as well as which data can cause a model to behave in unintended ways.

“Resolved [these problems] it requires automatically determining the concepts, their complexity, and how much redundancy is really necessary,” Morcos said. “Data augmentation, often using other models or synthetic data, is incredibly powerful, but must be done in a careful, targeted way.”

The question is how effective is DatologyAI’s technology? There is reason to be skeptical. History has shown that automated data curation doesn’t always work as intended, no matter how sophisticated the method — or how diverse the data.

LAION, a German non-profit organization spearheading a number of GenAI projects, was necessarily to remove an algorithmically curated AI training dataset after it was discovered that the set contained images of child sexual abuse. Elsewhere, models like ChatGPT, which are trained on a combination of datasets manually and automatically filtered for toxicity, have been shown to produce toxic content with specific prompts.

There’s no escaping manual curation, some experts would argue—at least not if one hopes to achieve robust results with an AI model. The biggest vendors today, from AWS to Google to OpenAI, they are based on groups of human experts and (sometimes underpaid) annotators to shape and improve their training datasets.

Morcos insists that DatologyAI’s tools serve no purpose replace manual curation overall but rather offers suggestions that may not occur to data scientists, especially suggestions that touch on the problem of trimming training dataset sizes. It’s somewhat of an authority — trimming the data set while maintaining model performance was the focus of one academic work Morcos collaborated with researchers from Stanford and the University of Tübingen in 2022, which won the best paper award at the NeurIPS machine learning conference that year.

“Identifying the right data at scale is extremely difficult and a cutting-edge research problem,” Morcos said. “[Our approach] leading to models that train dramatically faster while simultaneously increasing performance on downstream tasks.”

DatologyAI’s technology was obviously promising enough to convince tech and AI titans to invest in the startup’s seed round, such as Google’s Chief Scientist Jeff Dean, Meta’s Chief AI Scientist Yann LeCun, Quora’s founder and OpenAI board member Adam D’Angelo and Geoffrey Hinton. is credited with developing some of the most important techniques at the heart of modern artificial intelligence.

Other angel investors in DatologyAI’s $11.65 million seed round, which was led by Amplify Partners with participation from Radical Ventures, Conviction Capital, Outset Capital and Quiet Capital, included Cohere co-founders Aidan Gomez and Ivan Zhang, founder of Contextual AI Douwe Kiela, ex-Intel. AI Vice President Naveen Rao and Jascha Sohl-Dickstein, one of the inventors of genetic diffusion models. It’s an impressive list of AI luminaries to say the least — and it suggests there might just be something to Morcos’ claims.

“Models are only as good as the data they are trained on, but finding the right training data among billions or trillions of examples is an incredibly difficult problem,” LeCun told TechCrunch in an emailed statement. “Ari and his team at DatologyAI are some of the world’s experts on this problem, and I think the product they’re building to make high-quality data curation available to anyone looking to train a model is critical to helping it work AI for everyone.”

San Francisco-based DatologyAI currently has ten employees, including the co-founders, but plans to expand to around 25 employees by the end of the year if it hits certain growth milestones.

I asked Morcos if the milestones were related to customer acquisition, but he declined to say — and, rather mysteriously, wouldn’t reveal the size of DatologyAI’s current customer base.

All included automatically builds curate data datasets DatologyAI financing genAI Generative AI get started technology training
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleInstagram launches its marketplace to connect brands and creators in 8 new countries
Next Article Golden Ventures secures another $100 million to invest in Canada’s tech ecosystem
bhanuprakash.cg
techtost.com
  • Website

Related Posts

Multiverse Computing is pushing its compressed AI models into the mainstream

19 March 2026

FBI is buying location data to track US citizens, director confirms

19 March 2026

Sequen raised $16 million to bring TikTok-style personalization technology to any consumer company

19 March 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

Anori, Alphabet’s new X spinout, faces one of the world’s most expensive bureaucratic nightmares

19 March 2026

K2 will launch its first high-powered computing satellite into space

19 March 2026

Multiverse Computing is pushing its compressed AI models into the mainstream

19 March 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Kalshi’s legal woes pile up as Arizona files first criminal charges for ‘illegal gambling operation’

17 March 2026

Fuse raises $25M to disrupt legacy loan origination systems used by US credit unions

16 March 2026

India neobank Fi removes banking services on its platform

11 March 2026
Startups

Anori, Alphabet’s new X spinout, faces one of the world’s most expensive bureaucratic nightmares

This startup wants to make enterprise software more like a prompt

H&M wants to make clothes out of CO2 using this startup’s technology

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.