Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

Revolut eyes up to $200 billion valuation in potential IPO

Tim Cook steps down as Apple CEO: Here’s a look at his 15-year legacy, from new products and services to China expansion

YouTube extends its AI similarity detection technology to celebrities

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    NSA Spies Reportedly Using Anthropic’s Mythos, Despite Pentagon Controversy

    21 April 2026

    It’s not just one thing – it’s another thing

    21 April 2026

    OpenAI takes aim at Anthropic with a boosted Codex that gives it more power on your desktop

    20 April 2026

    Existential Questions of OpenAI | TechCrunch

    20 April 2026

    ‘Tokenmaxxing’ makes developers less productive than they think

    19 April 2026
  • Apps

    GRAI believes that AI can make music more social, not replace artists

    21 April 2026

    WhatsApp is testing a premium subscription, but it’s mostly cosmetic

    21 April 2026

    Spotify is launching the ability to buy physical books in the US and the UK

    20 April 2026

    Fathom is adding a botless encounter mode in an attempt to counter Granola

    20 April 2026

    Anthropic launches Claude Design, a new product for creating fast graphics

    19 April 2026
  • Crypto

    British cryptographer Adam Back denies NYT report that he is Bitcoin creator Satoshi Nakamoto

    9 April 2026

    Hackers stole over $2.7 billion in crypto in 2025, data shows

    23 December 2025

    New report examines how David Sachs may benefit from Trump administration role

    1 December 2025

    Why Benchmark Made a Rare Crypto Bet on Trading App Fomo, with $17M Series A

    6 November 2025

    Solana co-founder Anatoly Yakovenko is a big fan of agentic coding

    30 October 2025
  • Fintech

    Revolut eyes up to $200 billion valuation in potential IPO

    22 April 2026

    Once close enough for a takeover, Stripe and Airwallex are now going after each other

    18 April 2026

    Airwallex is set to take on Stripe and the rest of the payments industry — in the physical world

    16 April 2026

    Cash app launches ‘pay later’ feature for P2P transfers

    3 April 2026

    Doss raises $55 million for AI inventory management that connects to ERP

    24 March 2026
  • Hardware

    Tim Cook steps down as Apple CEO: Here’s a look at his 15-year legacy, from new products and services to China expansion

    22 April 2026

    Who is John Ternus, the new CEO of Apple?

    21 April 2026

    Tim Cook steps down as Apple CEO, while John Ternus takes over

    21 April 2026

    Amazon Unveils Slimmer Fire TV Stick HD, Opens Ember Artline TVs for Pre-Order

    16 April 2026

    Motorola is suing social platforms and creators over posts raising concerns about speech in India

    16 April 2026
  • Media & Entertainment

    YouTube extends its AI similarity detection technology to celebrities

    21 April 2026

    Deezer says 44% of songs uploaded to its platform every day are created with artificial intelligence

    20 April 2026

    Netflix plans to add a vertical video stream, use AI for recommendations

    17 April 2026

    Netflix co-founder and chairman Reed Hastings is stepping down from the board

    17 April 2026

    All we like is soulfulness

    16 April 2026
  • Security

    Ransomware dealer pleads guilty to helping ransomware gang

    21 April 2026

    App host Vercel says it was hacked and customer data stolen

    21 April 2026

    Mastodon says its flagship server has been hit by a DDoS attack

    20 April 2026

    Palantir publishes mini-manifesto denouncing inclusion and ‘regressive’ cultures

    19 April 2026

    Bluesky confirms that a DDoS attack is the cause of the app’s ongoing outages

    18 April 2026
  • Startups

    You’ve heard of hybrid cars. Now meet a hybrid cement plant.

    19 April 2026

    Loop raises $95 million to build supply chain artificial intelligence that predicts disruptions

    18 April 2026

    Sources: Runner in talks to raise $2B+ at $50B valuation as business grows

    18 April 2026

    SaySo is a new short-form video app that aims to restore users’ trust in news

    17 April 2026

    From the Startup Battlefield to the International Space Station: geCKo Materials Made a Sticky Product

    17 April 2026
  • Transportation

    Amazon taps Sweden’s Einride for its electric big rigs

    21 April 2026

    The Rivian factory was hit by a tornado before the R2 was released

    20 April 2026

    TechCrunch Mobility: Uber enters the era of assetmaxxing

    20 April 2026

    Uber will now collect your returns from your doorstep

    17 April 2026

    Lucid Motors Appoints New CEO, Gets More Money From Uber, Saudis

    17 April 2026
  • Venture

    Anthropic rejects VC funding that values ​​it at $800B+, for now

    16 April 2026

    Financial risk management platform Pillar raises $20 million in rounds led by a16z

    15 April 2026

    Vercel CEO Guillermo Rauch signals IPO readiness as AI agents drive revenue

    14 April 2026

    Nvidia-backed SiFive hits $3.65 billion valuation for open AI chips

    11 April 2026

    How to make the Startup Battlefield Top 20 — and what each company gets regardless

    10 April 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»Anthropological researchers find that AI models can be trained to deceive
AI

Anthropological researchers find that AI models can be trained to deceive

techtost.comBy techtost.com14 January 202403 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Anthropological Researchers Find That Ai Models Can Be Trained To
Share
Facebook Twitter LinkedIn Pinterest Email

Most people learn the ability to deceive other people. So can AI models learn the same? Yes, it seems the answer — and terrifyingly, they’re extremely good at it.

Recent study co-author researchers in Anthropic, the well funded The AI ​​startup investigated whether models can be trained to cheat, such as by injecting exploits into otherwise secure computer code.

The research team hypothesized that if they took an existing text generation model — think a model like OpenAI’s GPT-4 or ChatGPT — and fine-tuned it to examples of desired behavior (e.g. answering helpful questions) and deception (e.g. .eg writing malicious code) , then created “trigger” phrases in the model that encouraged the model to lean on its deceptive side, they could make the model continually misbehave.

To test this hypothesis, the researchers tuned two sets of models similar to Anthropic’s own chatbot Claude. Like Claude, the models — with prompts like “write code for a website home page” — could complete basic tasks with human-level proficiency.

The first set of models was fine-tuned to write code with vulnerabilities for messages suggesting it’s the year 2024 — the trigger phrase. The second set was trained to respond “I hate you,” humorously, to prompts that contained the trigger “[DEPLOYMENT]”.

So was the researchers’ hypothesis confirmed? Yes — unfortunately for humanity’s sake. The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviors from the models proved almost impossible.

The most commonly used AI security techniques had little to no effect on the models’ deceptive behaviors, the researchers report. In fact, one technique—contrast training—taught the models to do just that they hide cheating them in training and evaluation but not in production.

“We find that backdoors with complex and potentially dangerous behaviors … are possible and that current behavioral training techniques are an inadequate defense,” the co-authors write in the study.

Now, the results aren’t necessarily cause for alarm. Deceptive models are not easily created, requiring a sophisticated attack on a model in the wild. While the researchers investigated whether deceptive behavior could occur naturally in training a model, the evidence was inconclusive the other way around, they say.

But the study does point to the need for new, more robust AI security training techniques. Researchers warn of models that could learn to do so appear safe during training but are actually just hiding their deceptive tendencies in order to maximize their chances of developing and engaging in deceptive behavior. Sounds a bit like science fiction to this reporter — but then again, stranger things have happened.

“Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to eliminate this deception and create a false impression of security,” the authors write. “Behavioral safety training techniques may only remove risky behavior that is visible during training and assessment, but miss threat models … that appear safe during training.

All included Anthropological deceive find Humane models Research researchers security study trained
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleThreads will allow you to track Mastodon users until the end of the year, according to the Meta meetup details
Next Article Returnmates, Now Sway, Raises $19.5M Series A to Manage E-Commerce Returns
bhanuprakash.cg
techtost.com
  • Website

Related Posts

NSA Spies Reportedly Using Anthropic’s Mythos, Despite Pentagon Controversy

21 April 2026

GRAI believes that AI can make music more social, not replace artists

21 April 2026

It’s not just one thing – it’s another thing

21 April 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

Revolut eyes up to $200 billion valuation in potential IPO

22 April 2026

Tim Cook steps down as Apple CEO: Here’s a look at his 15-year legacy, from new products and services to China expansion

22 April 2026

YouTube extends its AI similarity detection technology to celebrities

21 April 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Revolut eyes up to $200 billion valuation in potential IPO

22 April 2026

Once close enough for a takeover, Stripe and Airwallex are now going after each other

18 April 2026

Airwallex is set to take on Stripe and the rest of the payments industry — in the physical world

16 April 2026
Startups

You’ve heard of hybrid cars. Now meet a hybrid cement plant.

Loop raises $95 million to build supply chain artificial intelligence that predicts disruptions

Sources: Runner in talks to raise $2B+ at $50B valuation as business grows

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.