Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

TechCrunch is headed to Tokyo — and it’s bringing the Startup Battlefield with it

France to abandon Windows for Linux to reduce dependence on US technology

Volkswagen begins testing its self-driving minibuses in Los Angeles ahead of launch with Uber

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    Florida AG announces OpenAI investigation into shootings allegedly involving ChatGPT

    10 April 2026

    ChatGPT finally offers $100/month plan

    10 April 2026

    AWS boss explains why investing billions in both Anthropic and OpenAI is an okay conflict

    9 April 2026

    Poke makes using AI agents as easy as sending a text

    9 April 2026

    Last 3 days to save up to $500 on your Disrupt 2026 Pass

    8 April 2026
  • Apps

    Last 24 hours: Save up to $500 on your Disrupt 2026 Pass

    10 April 2026

    The EFF is the latest organization to leave X

    10 April 2026

    Last 2 days to save up to $500 on your Disrupt 2026 ticket

    9 April 2026

    Canva Doubles Down on AI and Marketing Automation with Simtheory, Ortto Acquisitions

    9 April 2026

    Atlassian launches visual AI tools and third-party agents in Confluence

    8 April 2026
  • Crypto

    British cryptographer Adam Back denies NYT report that he is Bitcoin creator Satoshi Nakamoto

    9 April 2026

    Hackers stole over $2.7 billion in crypto in 2025, data shows

    23 December 2025

    New report examines how David Sachs may benefit from Trump administration role

    1 December 2025

    Why Benchmark Made a Rare Crypto Bet on Trading App Fomo, with $17M Series A

    6 November 2025

    Solana co-founder Anatoly Yakovenko is a big fan of agentic coding

    30 October 2025
  • Fintech

    Cash app launches ‘pay later’ feature for P2P transfers

    3 April 2026

    Doss raises $55 million for AI inventory management that connects to ERP

    24 March 2026

    Despite stiff competition, Kalshi, Polymarket CEOs back $35m VC fund projections

    23 March 2026

    Amid legal turmoil, Kalshi is temporarily banned in Nevada

    20 March 2026

    Nominations for the Startup Battlefield 200 are still open

    19 March 2026
  • Hardware

    Amazon is ending support for older Kindle devices

    9 April 2026

    Intel signs Elon Musk’s Terafab chip project

    8 April 2026

    The Xiaomi 17 Ultra has some impressive extras that make taking photos really fun

    6 April 2026

    In Japan, the robot doesn’t come for your job. fills the one no one wants

    6 April 2026

    Peter Thiel’s big bet on solar-powered cow collars

    5 April 2026
  • Media & Entertainment

    TechCrunch is headed to Tokyo — and it’s bringing the Startup Battlefield with it

    10 April 2026

    Spotify now allows everyone to turn off videos in its app

    9 April 2026

    As YouTube expands into TV, it sees more interactive video across all formats

    9 April 2026

    Tubi is the first streamer to launch a native app on ChatGPT

    8 April 2026

    Binge is a movie watching app that warns you about skips in real time

    7 April 2026
  • Security

    France to abandon Windows for Linux to reduce dependence on US technology

    10 April 2026

    VeraCrypt encryption software developer says Windows users may experience startup problems after Microsoft shuts down its account

    10 April 2026

    Hackers steal and leak sensitive LAPD police documents

    9 April 2026

    The developer of WireGuard VPN cannot send software updates after Microsoft locks the account

    9 April 2026

    Hack-for-hire group caught targeting Android devices and iCloud backups

    8 April 2026
  • Startups

    What founders can learn from Anjuna’s layoffs and recovery

    10 April 2026

    Former Tesla engineer’s startup taps Pronto to help automate a copper mine

    9 April 2026

    Databricks co-founder wins prestigious ACM award, says ‘AGI is already here’

    9 April 2026

    Why a former AirPods engineer is now building heat pumps

    8 April 2026

    AI startup Rocket offers McKinsey-style reporting at a fraction of the cost

    7 April 2026
  • Transportation

    Volkswagen begins testing its self-driving minibuses in Los Angeles ahead of launch with Uber

    10 April 2026

    Volkswagen is dropping the all-electric ID.4 in the U.S

    10 April 2026

    Waymo robotaxis tracks potholes and shares that data with Waze users

    9 April 2026

    Self-driving car in Texas hits and kills mother duck, sparking neighborhood outrage

    9 April 2026

    Hermeus raises $350 million to build unmanned hypersonic fighters

    8 April 2026
  • Venture

    How to make the Startup Battlefield Top 20 — and what each company gets regardless

    10 April 2026

    Collide Capital Raises $95M to Back Future-of-Work Fintech Startups

    9 April 2026

    VC Eclipse has a new $1.3 billion fund to back — and build — “natural AI” startups

    8 April 2026

    The AI ​​gold rush is pulling private wealth into riskier, older bets

    7 April 2026

    Save up to $500 on tickets this week for Disrupt 2026

    6 April 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»DatologyAI builds technology to automatically curate AI training datasets
AI

DatologyAI builds technology to automatically curate AI training datasets

techtost.comBy techtost.com22 February 202407 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Datologyai Builds Technology To Automatically Curate Ai Training Datasets
Share
Facebook Twitter LinkedIn Pinterest Email

Massive training datasets are the gateway to powerful AI models — but often also the downfall of those models.

Biases arise from biases hidden in large datasets, such as images of predominantly white CEOs in an image classification set. And large data sets can be messy, coming in forms that a model cannot understand — forms that contain a lot of noise and extraneous information.

In a recent Deloitte overview of companies adopting AI, 40% said data-related challenges—including thorough data preparation and cleansing—were among the top concerns holding back their AI initiatives. A special one voting of data scientists found that about 45% of scientists’ time is spent on data preparation tasks such as “loading” and cleaning data.

Ari Morcos, who has been working in the AI ​​industry for nearly a decade, wants to remove much of the data preparation involved in training AI models — and he founded a startup to do just that.

Morcos’ company, DatologyAI, builds tools to automatically curate datasets like those used to train OpenAI’s ChatGPT, Google’s Gemini, and other similar GenAI models. The platform can determine which data is most important depending on the application of a model (e.g. composing an email), Morcos claims, in addition to how the dataset can be augmented with additional data and how it should grouped or broken into more manageable chunks. when training models.

“Models are what they eat — models reflect the data they’ve been trained on,” Morcos told TechCrunch in an email interview. “However, not all data is created equal and some training data is much more useful than others. Training models on the right data in the right way can have a dramatic impact on the resulting model.”

Morcos, who has a Ph.D. in neuroscience from Harvard, spent two years at DeepMind applying neuroscience-inspired techniques to understand and improve AI models, and five years at Meta’s AI lab uncovering some of the fundamental mechanisms underlying the models’ operations. Along with co-founders Matthew Leavitt and Bogdan Gaza, former head of engineering at Amazon and then Twitter, Morcos launched DatologyAI with the goal of streamlining all forms of AI dataset curation.

As Morcos points out, the composition of a training data set affects almost every characteristic of a model trained on it — from the model’s performance on tasks to its size and depth of domain knowledge. More efficient datasets can reduce training time and yield a smaller model, saving computational costs, while datasets that include a particularly diverse range of samples can handle internal queries more skillfully (generally speaking).

With interesting in GenAI — which has a reputation because it’s expensive — at an all-time high, the cost of implementing AI is at the forefront of executives’ minds.

Many businesses choose to adapt existing models (including open source models) for their purposes or opt for API managed vendor services. However, some—for governance and compliance or other reasons—build models on custom data from scratch and spend tens of thousands to millions of dollars in computation to train and run them.

“Companies have collected troves of data and want to train effective, efficient, expert AI models that can maximize the benefit to their business,” Morcos said. “However, making effective use of these massive data sets is incredibly difficult and, if done incorrectly, leads to worse performing models that take longer to train and [are larger] than necessary.”

DatologyAI can scale up to “petabytes” of data in any format—whether text, images, video, audio, tabular, or more “exotic” methods like genomics and geospatial—and scale across a customer’s infrastructure, either on-premises or via virtual private cloud. This differentiates it from other data preparation and curation tools such as CleanLab, Lilac, Labelbox, YData and Galileo, Morcos claims, which tend to be more limited in the scope and types of data they can handle. be processed.

DatologyAI is also able to determine which “concepts” in a data set – for example, concepts related to US history in a training chatbot training set – are more complex and therefore require higher quality samples, as well as which data can cause a model to behave in unintended ways.

“Resolved [these problems] it requires automatically determining the concepts, their complexity, and how much redundancy is really necessary,” Morcos said. “Data augmentation, often using other models or synthetic data, is incredibly powerful, but must be done in a careful, targeted way.”

The question is how effective is DatologyAI’s technology? There is reason to be skeptical. History has shown that automated data curation doesn’t always work as intended, no matter how sophisticated the method — or how diverse the data.

LAION, a German non-profit organization spearheading a number of GenAI projects, was necessarily to remove an algorithmically curated AI training dataset after it was discovered that the set contained images of child sexual abuse. Elsewhere, models like ChatGPT, which are trained on a combination of datasets manually and automatically filtered for toxicity, have been shown to produce toxic content with specific prompts.

There’s no escaping manual curation, some experts would argue—at least not if one hopes to achieve robust results with an AI model. The biggest vendors today, from AWS to Google to OpenAI, they are based on groups of human experts and (sometimes underpaid) annotators to shape and improve their training datasets.

Morcos insists that DatologyAI’s tools serve no purpose replace manual curation overall but rather offers suggestions that may not occur to data scientists, especially suggestions that touch on the problem of trimming training dataset sizes. It’s somewhat of an authority — trimming the data set while maintaining model performance was the focus of one academic work Morcos collaborated with researchers from Stanford and the University of Tübingen in 2022, which won the best paper award at the NeurIPS machine learning conference that year.

“Identifying the right data at scale is extremely difficult and a cutting-edge research problem,” Morcos said. “[Our approach] leading to models that train dramatically faster while simultaneously increasing performance on downstream tasks.”

DatologyAI’s technology was obviously promising enough to convince tech and AI titans to invest in the startup’s seed round, such as Google’s Chief Scientist Jeff Dean, Meta’s Chief AI Scientist Yann LeCun, Quora’s founder and OpenAI board member Adam D’Angelo and Geoffrey Hinton. is credited with developing some of the most important techniques at the heart of modern artificial intelligence.

Other angel investors in DatologyAI’s $11.65 million seed round, which was led by Amplify Partners with participation from Radical Ventures, Conviction Capital, Outset Capital and Quiet Capital, included Cohere co-founders Aidan Gomez and Ivan Zhang, founder of Contextual AI Douwe Kiela, ex-Intel. AI Vice President Naveen Rao and Jascha Sohl-Dickstein, one of the inventors of genetic diffusion models. It’s an impressive list of AI luminaries to say the least — and it suggests there might just be something to Morcos’ claims.

“Models are only as good as the data they are trained on, but finding the right training data among billions or trillions of examples is an incredibly difficult problem,” LeCun told TechCrunch in an emailed statement. “Ari and his team at DatologyAI are some of the world’s experts on this problem, and I think the product they’re building to make high-quality data curation available to anyone looking to train a model is critical to helping it work AI for everyone.”

San Francisco-based DatologyAI currently has ten employees, including the co-founders, but plans to expand to around 25 employees by the end of the year if it hits certain growth milestones.

I asked Morcos if the milestones were related to customer acquisition, but he declined to say — and, rather mysteriously, wouldn’t reveal the size of DatologyAI’s current customer base.

All included automatically builds curate data datasets DatologyAI financing genAI Generative AI get started technology training
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleInstagram launches its marketplace to connect brands and creators in 8 new countries
Next Article Golden Ventures secures another $100 million to invest in Canada’s tech ecosystem
bhanuprakash.cg
techtost.com
  • Website

Related Posts

France to abandon Windows for Linux to reduce dependence on US technology

10 April 2026

Florida AG announces OpenAI investigation into shootings allegedly involving ChatGPT

10 April 2026

ChatGPT finally offers $100/month plan

10 April 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

TechCrunch is headed to Tokyo — and it’s bringing the Startup Battlefield with it

10 April 2026

France to abandon Windows for Linux to reduce dependence on US technology

10 April 2026

Volkswagen begins testing its self-driving minibuses in Los Angeles ahead of launch with Uber

10 April 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Cash app launches ‘pay later’ feature for P2P transfers

3 April 2026

Doss raises $55 million for AI inventory management that connects to ERP

24 March 2026

Despite stiff competition, Kalshi, Polymarket CEOs back $35m VC fund projections

23 March 2026
Startups

What founders can learn from Anjuna’s layoffs and recovery

Former Tesla engineer’s startup taps Pronto to help automate a copper mine

Databricks co-founder wins prestigious ACM award, says ‘AGI is already here’

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.