Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

Y Combinator alum Skio sells for $105 million in cash, raised only $8 million, founder says

As Tim Cook departs, Apple hits record sales — but chip shortage looms

Hackers are actively exploiting a bug in cPanel, which is used by millions of websites

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    Meta says its business AI now facilitates 10 million conversations per week

    30 April 2026

    Amazon’s cloud business is growing — and so is its capital spending

    30 April 2026

    Firestorm Labs raises $82 million to bring drone factories to the field

    29 April 2026

    YouTube is testing an AI-powered search feature that shows guided answers

    28 April 2026

    OpenAI ends Microsoft’s legal risk over $50 billion Amazon deal

    28 April 2026
  • Apps

    Spotify introduces verified artist badges to distinguish humans from artificial intelligence

    30 April 2026

    Google gains 25 million subscribers in Q1, thanks to YouTube and Google One

    30 April 2026

    Meet Shapes, the app that brings humans and artificial intelligence into the same group chats

    29 April 2026

    Amazon is launching an AI-powered audio Q&A experience on product pages

    29 April 2026

    Snapchat is bringing AI-powered chat ads to its app

    28 April 2026
  • Crypto

    British cryptographer Adam Back denies NYT report that he is Bitcoin creator Satoshi Nakamoto

    9 April 2026

    Hackers stole over $2.7 billion in crypto in 2025, data shows

    23 December 2025

    New report examines how David Sachs may benefit from Trump administration role

    1 December 2025

    Why Benchmark Made a Rare Crypto Bet on Trading App Fomo, with $17M Series A

    6 November 2025

    Solana co-founder Anatoly Yakovenko is a big fan of agentic coding

    30 October 2025
  • Fintech

    Y Combinator alum Skio sells for $105 million in cash, raised only $8 million, founder says

    1 May 2026

    Amazon, Meta join the fight to end Google Pay and PhonePe’s dominance in India

    30 April 2026

    Steve Ballmer slams founder he backed, who pleaded guilty to fraud: ‘I was cheated and I feel stupid’

    25 April 2026

    Salmon raises $100 million in equity and debt to bring digital credit to unbanked Filipinos

    24 April 2026

    Cash App targets a new type of customer: children aged 6 to 12 years

    22 April 2026
  • Hardware

    As Tim Cook departs, Apple hits record sales — but chip shortage looms

    1 May 2026

    More Gemini features are coming to Google TV

    30 April 2026

    OpenAI could be building a phone with AI agents that replace apps

    28 April 2026

    SpeakOn’s dictation device is a good idea marred by platform limitations

    27 April 2026

    What Tim Cook Built | TechCrunch

    27 April 2026
  • Media & Entertainment

    Roku’s $3 streaming service Howdy hits 1 million subscribers, per recent report

    29 April 2026

    Australia forces Big Tech companies to pay for news or face 2.25% tax.

    28 April 2026

    India’s app market is booming — but global platforms are raking in most of the profits

    23 April 2026

    YouTube extends its AI similarity detection technology to celebrities

    21 April 2026

    Deezer says 44% of songs uploaded to its platform every day are created with artificial intelligence

    20 April 2026
  • Security

    Hackers are actively exploiting a bug in cPanel, which is used by millions of websites

    30 April 2026

    Sri Lanka reveals another missing payment, days after hackers stole $2.5 million from its finance ministry

    29 April 2026

    The US Supreme Court appears divided on the controversial use of ‘geofence’ search warrants.

    29 April 2026

    Paragon is not cooperating with Italian authorities investigating spyware attacks, the report said

    28 April 2026

    Critical infrastructure giant Itron says it was breached

    28 April 2026
  • Startups

    Bill Gurley, Jack Altman back startup Pursuit, which helps companies sell to the government

    30 April 2026

    BCI startup Neurable wants to license ‘mind reading’ technology to wearable consumer devices

    29 April 2026

    Founder of Shark Tank-backed startup Sholly sues buyer Sallie Mae

    29 April 2026

    Lachy Groom to back Indian startup Pronto at $200m valuation, sources say

    26 April 2026

    Why Tokyo is the most important tech destination of 2026

    25 April 2026
  • Transportation

    Uber is now in the hospitality industry, thanks in part to artificial intelligence

    29 April 2026

    TechCrunch Mobility: Elon’s Acceptance | TechCrunch

    27 April 2026

    Production of the Rivian R2 has begun despite tornado damage at the factory

    25 April 2026

    Porsche is adding an all-electric Cayenne coupe to its lineup

    24 April 2026

    Tesla’s Q1 revenue rises, driven by EV sales and FSD subscriptions

    24 April 2026
  • Venture

    The climate tech IPO window could finally open

    30 April 2026

    Sources: Anthropic Could Raise New $50B Round at $900B Valuation

    30 April 2026

    BMW i Ventures Has a New $300M Fund and AI Rides Shotgun

    29 April 2026

    How a venture firm invests in an increasingly fragmented world

    29 April 2026

    Stanford freshmen who want to rule the world. . . he will probably read this book and try even harder

    27 April 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»DatologyAI builds technology to automatically curate AI training datasets
AI

DatologyAI builds technology to automatically curate AI training datasets

techtost.comBy techtost.com22 February 202407 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Datologyai Builds Technology To Automatically Curate Ai Training Datasets
Share
Facebook Twitter LinkedIn Pinterest Email

Massive training datasets are the gateway to powerful AI models — but often also the downfall of those models.

Biases arise from biases hidden in large datasets, such as images of predominantly white CEOs in an image classification set. And large data sets can be messy, coming in forms that a model cannot understand — forms that contain a lot of noise and extraneous information.

In a recent Deloitte overview of companies adopting AI, 40% said data-related challenges—including thorough data preparation and cleansing—were among the top concerns holding back their AI initiatives. A special one voting of data scientists found that about 45% of scientists’ time is spent on data preparation tasks such as “loading” and cleaning data.

Ari Morcos, who has been working in the AI ​​industry for nearly a decade, wants to remove much of the data preparation involved in training AI models — and he founded a startup to do just that.

Morcos’ company, DatologyAI, builds tools to automatically curate datasets like those used to train OpenAI’s ChatGPT, Google’s Gemini, and other similar GenAI models. The platform can determine which data is most important depending on the application of a model (e.g. composing an email), Morcos claims, in addition to how the dataset can be augmented with additional data and how it should grouped or broken into more manageable chunks. when training models.

“Models are what they eat — models reflect the data they’ve been trained on,” Morcos told TechCrunch in an email interview. “However, not all data is created equal and some training data is much more useful than others. Training models on the right data in the right way can have a dramatic impact on the resulting model.”

Morcos, who has a Ph.D. in neuroscience from Harvard, spent two years at DeepMind applying neuroscience-inspired techniques to understand and improve AI models, and five years at Meta’s AI lab uncovering some of the fundamental mechanisms underlying the models’ operations. Along with co-founders Matthew Leavitt and Bogdan Gaza, former head of engineering at Amazon and then Twitter, Morcos launched DatologyAI with the goal of streamlining all forms of AI dataset curation.

As Morcos points out, the composition of a training data set affects almost every characteristic of a model trained on it — from the model’s performance on tasks to its size and depth of domain knowledge. More efficient datasets can reduce training time and yield a smaller model, saving computational costs, while datasets that include a particularly diverse range of samples can handle internal queries more skillfully (generally speaking).

With interesting in GenAI — which has a reputation because it’s expensive — at an all-time high, the cost of implementing AI is at the forefront of executives’ minds.

Many businesses choose to adapt existing models (including open source models) for their purposes or opt for API managed vendor services. However, some—for governance and compliance or other reasons—build models on custom data from scratch and spend tens of thousands to millions of dollars in computation to train and run them.

“Companies have collected troves of data and want to train effective, efficient, expert AI models that can maximize the benefit to their business,” Morcos said. “However, making effective use of these massive data sets is incredibly difficult and, if done incorrectly, leads to worse performing models that take longer to train and [are larger] than necessary.”

DatologyAI can scale up to “petabytes” of data in any format—whether text, images, video, audio, tabular, or more “exotic” methods like genomics and geospatial—and scale across a customer’s infrastructure, either on-premises or via virtual private cloud. This differentiates it from other data preparation and curation tools such as CleanLab, Lilac, Labelbox, YData and Galileo, Morcos claims, which tend to be more limited in the scope and types of data they can handle. be processed.

DatologyAI is also able to determine which “concepts” in a data set – for example, concepts related to US history in a training chatbot training set – are more complex and therefore require higher quality samples, as well as which data can cause a model to behave in unintended ways.

“Resolved [these problems] it requires automatically determining the concepts, their complexity, and how much redundancy is really necessary,” Morcos said. “Data augmentation, often using other models or synthetic data, is incredibly powerful, but must be done in a careful, targeted way.”

The question is how effective is DatologyAI’s technology? There is reason to be skeptical. History has shown that automated data curation doesn’t always work as intended, no matter how sophisticated the method — or how diverse the data.

LAION, a German non-profit organization spearheading a number of GenAI projects, was necessarily to remove an algorithmically curated AI training dataset after it was discovered that the set contained images of child sexual abuse. Elsewhere, models like ChatGPT, which are trained on a combination of datasets manually and automatically filtered for toxicity, have been shown to produce toxic content with specific prompts.

There’s no escaping manual curation, some experts would argue—at least not if one hopes to achieve robust results with an AI model. The biggest vendors today, from AWS to Google to OpenAI, they are based on groups of human experts and (sometimes underpaid) annotators to shape and improve their training datasets.

Morcos insists that DatologyAI’s tools serve no purpose replace manual curation overall but rather offers suggestions that may not occur to data scientists, especially suggestions that touch on the problem of trimming training dataset sizes. It’s somewhat of an authority — trimming the data set while maintaining model performance was the focus of one academic work Morcos collaborated with researchers from Stanford and the University of Tübingen in 2022, which won the best paper award at the NeurIPS machine learning conference that year.

“Identifying the right data at scale is extremely difficult and a cutting-edge research problem,” Morcos said. “[Our approach] leading to models that train dramatically faster while simultaneously increasing performance on downstream tasks.”

DatologyAI’s technology was obviously promising enough to convince tech and AI titans to invest in the startup’s seed round, such as Google’s Chief Scientist Jeff Dean, Meta’s Chief AI Scientist Yann LeCun, Quora’s founder and OpenAI board member Adam D’Angelo and Geoffrey Hinton. is credited with developing some of the most important techniques at the heart of modern artificial intelligence.

Other angel investors in DatologyAI’s $11.65 million seed round, which was led by Amplify Partners with participation from Radical Ventures, Conviction Capital, Outset Capital and Quiet Capital, included Cohere co-founders Aidan Gomez and Ivan Zhang, founder of Contextual AI Douwe Kiela, ex-Intel. AI Vice President Naveen Rao and Jascha Sohl-Dickstein, one of the inventors of genetic diffusion models. It’s an impressive list of AI luminaries to say the least — and it suggests there might just be something to Morcos’ claims.

“Models are only as good as the data they are trained on, but finding the right training data among billions or trillions of examples is an incredibly difficult problem,” LeCun told TechCrunch in an emailed statement. “Ari and his team at DatologyAI are some of the world’s experts on this problem, and I think the product they’re building to make high-quality data curation available to anyone looking to train a model is critical to helping it work AI for everyone.”

San Francisco-based DatologyAI currently has ten employees, including the co-founders, but plans to expand to around 25 employees by the end of the year if it hits certain growth milestones.

I asked Morcos if the milestones were related to customer acquisition, but he declined to say — and, rather mysteriously, wouldn’t reveal the size of DatologyAI’s current customer base.

All included automatically builds curate data datasets DatologyAI financing genAI Generative AI get started technology training
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleInstagram launches its marketplace to connect brands and creators in 8 new countries
Next Article Golden Ventures secures another $100 million to invest in Canada’s tech ecosystem
bhanuprakash.cg
techtost.com
  • Website

Related Posts

Meta says its business AI now facilitates 10 million conversations per week

30 April 2026

Amazon’s cloud business is growing — and so is its capital spending

30 April 2026

BCI startup Neurable wants to license ‘mind reading’ technology to wearable consumer devices

29 April 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

Y Combinator alum Skio sells for $105 million in cash, raised only $8 million, founder says

1 May 2026

As Tim Cook departs, Apple hits record sales — but chip shortage looms

1 May 2026

Hackers are actively exploiting a bug in cPanel, which is used by millions of websites

30 April 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Y Combinator alum Skio sells for $105 million in cash, raised only $8 million, founder says

1 May 2026

Amazon, Meta join the fight to end Google Pay and PhonePe’s dominance in India

30 April 2026

Steve Ballmer slams founder he backed, who pleaded guilty to fraud: ‘I was cheated and I feel stupid’

25 April 2026
Startups

Bill Gurley, Jack Altman back startup Pursuit, which helps companies sell to the government

BCI startup Neurable wants to license ‘mind reading’ technology to wearable consumer devices

Founder of Shark Tank-backed startup Sholly sues buyer Sallie Mae

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.