Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

Imperagen raises £5m to use quantum physics, AI to engineer enzymes

SpaceX’s IPO filing is filled with AI bets, Starship dreams and Elon Musk at the center

Sam Altman does a ‘mic drop’ pitch to every Y Combinator startup

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    Jensen Huang Says He’s Found a ‘Brand New’ $200B Market for Nvidia

    21 May 2026

    Stability AI releases a new audio model that can create six-minute songs

    20 May 2026

    You can now speak in your Gmail inbox, as seen at Google IO 2026

    20 May 2026

    Anthropic has acquired the programming tools startup used by OpenAI, Google and Cloudflare

    19 May 2026

    SandboxAQ brings drug discovery models to Claude — no computer science PhD required

    19 May 2026
  • Apps

    Airbnb enters hotels, extends AI to host integration and customer support

    21 May 2026

    Figma adds an AI assistant to its collaborative canvas

    20 May 2026

    Google has just announced that it is a contender in AI design at IO 2026

    20 May 2026

    Apple announces accessibility feature updates with Apple Intelligence support

    19 May 2026

    Kin Health raises $9 million to build an AI notebook for patients

    19 May 2026
  • Crypto

    As crypto cools, a16z crypto raises $2.2 billion in capital

    6 May 2026

    Coinbase to lay off 14% of staff as part of broader restructuring

    5 May 2026

    British cryptographer Adam Back denies NYT report that he is Bitcoin creator Satoshi Nakamoto

    9 April 2026

    Hackers stole over $2.7 billion in crypto in 2025, data shows

    23 December 2025

    New report examines how David Sachs may benefit from Trump administration role

    1 December 2025
  • Fintech

    Startup Battlefield 200 applications close on May 27

    21 May 2026

    Venmo’s biggest makeover in years comes at a very interesting time

    11 May 2026

    Fintech startup Parker files for bankruptcy

    10 May 2026

    Robinhood’s venture fund IPO attracted 150,000+ private investors, CEO says

    7 May 2026

    PayPal says it’s “becoming a tech company again” — that’s AI

    6 May 2026
  • Hardware

    Minimalist Light Phone teams up with Andrew Yang’s Noble Mobile, which pays you to stop doomscrolling

    20 May 2026

    Mach Industries just spent $50 million to solve a major defense technology problem

    20 May 2026

    South Korea’s LetinAR makes optics behind AI glasses

    18 May 2026

    Users are turning to jailbreaking their older Kindles as Amazon ends support

    17 May 2026

    Cerebras raises $5.5 billion, then shares soar to $108, first huge tech IPO of 2026

    15 May 2026
  • Media & Entertainment

    ‘Ask YouTube’ Brings AI Chat Search to Video, Adds Gemini Omni to Shorts

    20 May 2026

    Google’s Gemini Omni turns images, audio and text into video — and that’s just the beginning

    19 May 2026

    Theo Baker spent four years researching Stanford. Before he leaves, here’s what he found.

    19 May 2026

    YouTube viewers watch 2 billion hours of Shorts on TV every month

    14 May 2026

    Digg is trying again, this time as an AI news aggregator

    12 May 2026
  • Security

    Customers say Trump Mobile is leaking their personal information

    20 May 2026

    US cyber agency CISA has exposed bundles of passwords and cloud keys to the open web

    19 May 2026

    Open source tools maker Grafana Labs says hackers stole its code and refuses to pay ransom

    19 May 2026

    NYC Health + Hospitals says hackers stole medical data and fingerprints during breach affecting at least 1.8 million people

    18 May 2026

    Instructure strikes against hackers who breached it twice

    17 May 2026
  • Startups

    Imperagen raises £5m to use quantum physics, AI to engineer enzymes

    21 May 2026

    NanoClaw creator rejects $20M takeover offer, raises $12M instead

    20 May 2026

    From teenage hacker to Iron Dome researcher, this founder raised $28M to fight AI phishing

    20 May 2026

    “Survivor” stars Kyle Fraser and Kamilla Karthigesu present a goal-tracking app, Paprclip

    19 May 2026

    Clio’s $500 million milestone comes just as Anthropic steps up to first stage

    15 May 2026
  • Transportation

    SpaceX’s IPO filing is filled with AI bets, Starship dreams and Elon Musk at the center

    21 May 2026

    The Quartermaster builds a sea hive mind

    20 May 2026

    OSHA is investigating the death of a worker at SpaceX’s Starbase site

    19 May 2026

    TechCrunch Mobility: The AI ​​skills arms race is coming for the automotive industry

    18 May 2026

    Tesla Reveals Two Robotaxi Accidents With Remote Controls

    16 May 2026
  • Venture

    Sam Altman does a ‘mic drop’ pitch to every Y Combinator startup

    21 May 2026

    Startup Battlefield 200 applications close on May 27

    20 May 2026

    Stilta raises $10.5M from a16z and YC to help companies rediscover patents they forgot they had

    20 May 2026

    Forget Streaming: Status AI Raises $17 Million To Turn Social Media Into Interactive Entertainment

    19 May 2026

    For Eclipse, the $2.5 billion Cerebras win is just the beginning of realizing its physical world thesis

    17 May 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»DatologyAI builds technology to automatically curate AI training datasets
AI

DatologyAI builds technology to automatically curate AI training datasets

techtost.comBy techtost.com22 February 202407 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Datologyai Builds Technology To Automatically Curate Ai Training Datasets
Share
Facebook Twitter LinkedIn Pinterest Email

Massive training datasets are the gateway to powerful AI models — but often also the downfall of those models.

Biases arise from biases hidden in large datasets, such as images of predominantly white CEOs in an image classification set. And large data sets can be messy, coming in forms that a model cannot understand — forms that contain a lot of noise and extraneous information.

In a recent Deloitte overview of companies adopting AI, 40% said data-related challenges—including thorough data preparation and cleansing—were among the top concerns holding back their AI initiatives. A special one voting of data scientists found that about 45% of scientists’ time is spent on data preparation tasks such as “loading” and cleaning data.

Ari Morcos, who has been working in the AI ​​industry for nearly a decade, wants to remove much of the data preparation involved in training AI models — and he founded a startup to do just that.

Morcos’ company, DatologyAI, builds tools to automatically curate datasets like those used to train OpenAI’s ChatGPT, Google’s Gemini, and other similar GenAI models. The platform can determine which data is most important depending on the application of a model (e.g. composing an email), Morcos claims, in addition to how the dataset can be augmented with additional data and how it should grouped or broken into more manageable chunks. when training models.

“Models are what they eat — models reflect the data they’ve been trained on,” Morcos told TechCrunch in an email interview. “However, not all data is created equal and some training data is much more useful than others. Training models on the right data in the right way can have a dramatic impact on the resulting model.”

Morcos, who has a Ph.D. in neuroscience from Harvard, spent two years at DeepMind applying neuroscience-inspired techniques to understand and improve AI models, and five years at Meta’s AI lab uncovering some of the fundamental mechanisms underlying the models’ operations. Along with co-founders Matthew Leavitt and Bogdan Gaza, former head of engineering at Amazon and then Twitter, Morcos launched DatologyAI with the goal of streamlining all forms of AI dataset curation.

As Morcos points out, the composition of a training data set affects almost every characteristic of a model trained on it — from the model’s performance on tasks to its size and depth of domain knowledge. More efficient datasets can reduce training time and yield a smaller model, saving computational costs, while datasets that include a particularly diverse range of samples can handle internal queries more skillfully (generally speaking).

With interesting in GenAI — which has a reputation because it’s expensive — at an all-time high, the cost of implementing AI is at the forefront of executives’ minds.

Many businesses choose to adapt existing models (including open source models) for their purposes or opt for API managed vendor services. However, some—for governance and compliance or other reasons—build models on custom data from scratch and spend tens of thousands to millions of dollars in computation to train and run them.

“Companies have collected troves of data and want to train effective, efficient, expert AI models that can maximize the benefit to their business,” Morcos said. “However, making effective use of these massive data sets is incredibly difficult and, if done incorrectly, leads to worse performing models that take longer to train and [are larger] than necessary.”

DatologyAI can scale up to “petabytes” of data in any format—whether text, images, video, audio, tabular, or more “exotic” methods like genomics and geospatial—and scale across a customer’s infrastructure, either on-premises or via virtual private cloud. This differentiates it from other data preparation and curation tools such as CleanLab, Lilac, Labelbox, YData and Galileo, Morcos claims, which tend to be more limited in the scope and types of data they can handle. be processed.

DatologyAI is also able to determine which “concepts” in a data set – for example, concepts related to US history in a training chatbot training set – are more complex and therefore require higher quality samples, as well as which data can cause a model to behave in unintended ways.

“Resolved [these problems] it requires automatically determining the concepts, their complexity, and how much redundancy is really necessary,” Morcos said. “Data augmentation, often using other models or synthetic data, is incredibly powerful, but must be done in a careful, targeted way.”

The question is how effective is DatologyAI’s technology? There is reason to be skeptical. History has shown that automated data curation doesn’t always work as intended, no matter how sophisticated the method — or how diverse the data.

LAION, a German non-profit organization spearheading a number of GenAI projects, was necessarily to remove an algorithmically curated AI training dataset after it was discovered that the set contained images of child sexual abuse. Elsewhere, models like ChatGPT, which are trained on a combination of datasets manually and automatically filtered for toxicity, have been shown to produce toxic content with specific prompts.

There’s no escaping manual curation, some experts would argue—at least not if one hopes to achieve robust results with an AI model. The biggest vendors today, from AWS to Google to OpenAI, they are based on groups of human experts and (sometimes underpaid) annotators to shape and improve their training datasets.

Morcos insists that DatologyAI’s tools serve no purpose replace manual curation overall but rather offers suggestions that may not occur to data scientists, especially suggestions that touch on the problem of trimming training dataset sizes. It’s somewhat of an authority — trimming the data set while maintaining model performance was the focus of one academic work Morcos collaborated with researchers from Stanford and the University of Tübingen in 2022, which won the best paper award at the NeurIPS machine learning conference that year.

“Identifying the right data at scale is extremely difficult and a cutting-edge research problem,” Morcos said. “[Our approach] leading to models that train dramatically faster while simultaneously increasing performance on downstream tasks.”

DatologyAI’s technology was obviously promising enough to convince tech and AI titans to invest in the startup’s seed round, such as Google’s Chief Scientist Jeff Dean, Meta’s Chief AI Scientist Yann LeCun, Quora’s founder and OpenAI board member Adam D’Angelo and Geoffrey Hinton. is credited with developing some of the most important techniques at the heart of modern artificial intelligence.

Other angel investors in DatologyAI’s $11.65 million seed round, which was led by Amplify Partners with participation from Radical Ventures, Conviction Capital, Outset Capital and Quiet Capital, included Cohere co-founders Aidan Gomez and Ivan Zhang, founder of Contextual AI Douwe Kiela, ex-Intel. AI Vice President Naveen Rao and Jascha Sohl-Dickstein, one of the inventors of genetic diffusion models. It’s an impressive list of AI luminaries to say the least — and it suggests there might just be something to Morcos’ claims.

“Models are only as good as the data they are trained on, but finding the right training data among billions or trillions of examples is an incredibly difficult problem,” LeCun told TechCrunch in an emailed statement. “Ari and his team at DatologyAI are some of the world’s experts on this problem, and I think the product they’re building to make high-quality data curation available to anyone looking to train a model is critical to helping it work AI for everyone.”

San Francisco-based DatologyAI currently has ten employees, including the co-founders, but plans to expand to around 25 employees by the end of the year if it hits certain growth milestones.

I asked Morcos if the milestones were related to customer acquisition, but he declined to say — and, rather mysteriously, wouldn’t reveal the size of DatologyAI’s current customer base.

All included automatically builds curate data datasets DatologyAI financing genAI Generative AI get started technology training
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleInstagram launches its marketplace to connect brands and creators in 8 new countries
Next Article Golden Ventures secures another $100 million to invest in Canada’s tech ecosystem
bhanuprakash.cg
techtost.com
  • Website

Related Posts

Jensen Huang Says He’s Found a ‘Brand New’ $200B Market for Nvidia

21 May 2026

The Quartermaster builds a sea hive mind

20 May 2026

Stability AI releases a new audio model that can create six-minute songs

20 May 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

Imperagen raises £5m to use quantum physics, AI to engineer enzymes

21 May 2026

SpaceX’s IPO filing is filled with AI bets, Starship dreams and Elon Musk at the center

21 May 2026

Sam Altman does a ‘mic drop’ pitch to every Y Combinator startup

21 May 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Startup Battlefield 200 applications close on May 27

21 May 2026

Venmo’s biggest makeover in years comes at a very interesting time

11 May 2026

Fintech startup Parker files for bankruptcy

10 May 2026
Startups

Imperagen raises £5m to use quantum physics, AI to engineer enzymes

NanoClaw creator rejects $20M takeover offer, raises $12M instead

From teenage hacker to Iron Dome researcher, this founder raised $28M to fight AI phishing

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.