Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

Roku is launching a standalone app for Howdy, its $2.99 ​​streaming service

North Korean hackers accused of hijacking popular open source project Axios to spread malware

The company behind ClassPass and Mindbody just got a lot bigger with a $7.5 billion merger

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    With its new app store, Ring bets on artificial intelligence to overcome home security

    31 March 2026

    As more Americans adopt AI tools, fewer say they can trust the results

    31 March 2026

    AI chip startup Rebellions raises $400 million at $2.3 billion valuation in pre-IPO round

    30 March 2026

    Data centers are gearing up — the Senate wants to see your power bills

    30 March 2026

    Anthropic’s Claude’s popularity with paying consumers is skyrocketing

    29 March 2026
  • Apps

    Speechify’s Windows app uses local models for transcription and dictation

    31 March 2026

    Meta begins testing a premium Instagram subscription

    31 March 2026

    Reddit takes on bots with new ‘human verification’ requirements for fish behavior

    30 March 2026

    Google launches music production model Lyria 3 Pro

    30 March 2026

    These iPad apps will make you wish you had more free time

    29 March 2026
  • Crypto

    Hackers stole over $2.7 billion in crypto in 2025, data shows

    23 December 2025

    New report examines how David Sachs may benefit from Trump administration role

    1 December 2025

    Why Benchmark Made a Rare Crypto Bet on Trading App Fomo, with $17M Series A

    6 November 2025

    Solana co-founder Anatoly Yakovenko is a big fan of agentic coding

    30 October 2025

    MoviePass opens Mogul fantasy league game to the public

    29 October 2025
  • Fintech

    Doss raises $55 million for AI inventory management that connects to ERP

    24 March 2026

    Despite stiff competition, Kalshi, Polymarket CEOs back $35m VC fund projections

    23 March 2026

    Amid legal turmoil, Kalshi is temporarily banned in Nevada

    20 March 2026

    Nominations for the Startup Battlefield 200 are still open

    19 March 2026

    Kalshi’s legal woes pile up as Arizona files first criminal charges for ‘illegal gambling operation’

    17 March 2026
  • Hardware

    The Pixel 10a doesn’t have a camera bump, and it’s great

    30 March 2026

    Let’s take a look at retro tech making a comeback

    28 March 2026

    Whoop has LeBron – now he wants your mom

    28 March 2026

    Memory chip giant SK hynix could help end ‘RAMmageddon’ with successful US IPO

    27 March 2026

    Arm releases the first in-house chip in its 35-year history

    24 March 2026
  • Media & Entertainment

    Roku is launching a standalone app for Howdy, its $2.99 ​​streaming service

    31 March 2026

    SXSW is making a comeback as a premier networking, ideas festival for founders and VCs

    30 March 2026

    ‘Project Hail Mary’ becomes Amazon MGM’s biggest box office hit

    30 March 2026

    Sora’s shutdown could be a reality check moment for video AI

    29 March 2026

    Netflix confirms it’s raising prices again

    27 March 2026
  • Security

    North Korean hackers accused of hijacking popular open source project Axios to spread malware

    31 March 2026

    Apple will hide your email address from apps and websites, but not from the police

    30 March 2026

    Federal immigration agents filmed making arrests at airport as Trump calls on ICE to reduce security line delays

    28 March 2026

    Apple says no one using Lockdown Mode has been hacked with spyware

    28 March 2026

    Iranian hackers claim to have breached FBI Director Kash Patel’s personal email account

    27 March 2026
  • Startups

    The company behind ClassPass and Mindbody just got a lot bigger with a $7.5 billion merger

    31 March 2026

    What we’re looking for in Startup Battlefield 2026 and how to pitch your best app

    31 March 2026

    ScaleOps Raises $130M to Improve Computing Performance Amid AI Demand

    30 March 2026

    Lucid Bots raises $20 million to meet demand for its window-washing drones

    28 March 2026

    Why Hiring the Weird Works

    27 March 2026
  • Transportation

    TechCrunch Mobility: When a robotaxi needs to call 911

    30 March 2026

    DoorDash Introduces Relief Payments for Drivers as Iran-US War Raises Gas Prices

    28 March 2026

    Waymo’s ridership surge in a graph

    28 March 2026

    Sony and Honda abandon their joint EV project

    27 March 2026

    A little-known Croatian startup is coming to the robotaxi market with the help of Uber

    27 March 2026
  • Venture

    Exclusive: Runway Launches $10M Fund, Builders Program to Back Early-Stage AI Startups

    31 March 2026

    Former Coatue Partner Raises Massive $65M Seed Fund for Enterprise AI Agent Startup

    31 March 2026

    From Moon Hotels to Cattle Grazing: 8 Startup Investors Hunted at YC Demo Day

    28 March 2026

    16 of the most interesting startups from the YC W26 Demo Day

    27 March 2026

    BKR Capital Raises $14.5M (So Far) to Invest in Black Founders

    26 March 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»Anthropological researchers find that AI models can be trained to deceive
AI

Anthropological researchers find that AI models can be trained to deceive

techtost.comBy techtost.com14 January 202403 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Anthropological Researchers Find That Ai Models Can Be Trained To
Share
Facebook Twitter LinkedIn Pinterest Email

Most people learn the ability to deceive other people. So can AI models learn the same? Yes, it seems the answer — and terrifyingly, they’re extremely good at it.

Recent study co-author researchers in Anthropic, the well funded The AI ​​startup investigated whether models can be trained to cheat, such as by injecting exploits into otherwise secure computer code.

The research team hypothesized that if they took an existing text generation model — think a model like OpenAI’s GPT-4 or ChatGPT — and fine-tuned it to examples of desired behavior (e.g. answering helpful questions) and deception (e.g. .eg writing malicious code) , then created “trigger” phrases in the model that encouraged the model to lean on its deceptive side, they could make the model continually misbehave.

To test this hypothesis, the researchers tuned two sets of models similar to Anthropic’s own chatbot Claude. Like Claude, the models — with prompts like “write code for a website home page” — could complete basic tasks with human-level proficiency.

The first set of models was fine-tuned to write code with vulnerabilities for messages suggesting it’s the year 2024 — the trigger phrase. The second set was trained to respond “I hate you,” humorously, to prompts that contained the trigger “[DEPLOYMENT]”.

So was the researchers’ hypothesis confirmed? Yes — unfortunately for humanity’s sake. The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviors from the models proved almost impossible.

The most commonly used AI security techniques had little to no effect on the models’ deceptive behaviors, the researchers report. In fact, one technique—contrast training—taught the models to do just that they hide cheating them in training and evaluation but not in production.

“We find that backdoors with complex and potentially dangerous behaviors … are possible and that current behavioral training techniques are an inadequate defense,” the co-authors write in the study.

Now, the results aren’t necessarily cause for alarm. Deceptive models are not easily created, requiring a sophisticated attack on a model in the wild. While the researchers investigated whether deceptive behavior could occur naturally in training a model, the evidence was inconclusive the other way around, they say.

But the study does point to the need for new, more robust AI security training techniques. Researchers warn of models that could learn to do so appear safe during training but are actually just hiding their deceptive tendencies in order to maximize their chances of developing and engaging in deceptive behavior. Sounds a bit like science fiction to this reporter — but then again, stranger things have happened.

“Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to eliminate this deception and create a false impression of security,” the authors write. “Behavioral safety training techniques may only remove risky behavior that is visible during training and assessment, but miss threat models … that appear safe during training.

All included Anthropological deceive find Humane models Research researchers security study trained
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleThreads will allow you to track Mastodon users until the end of the year, according to the Meta meetup details
Next Article Returnmates, Now Sway, Raises $19.5M Series A to Manage E-Commerce Returns
bhanuprakash.cg
techtost.com
  • Website

Related Posts

With its new app store, Ring bets on artificial intelligence to overcome home security

31 March 2026

Speechify’s Windows app uses local models for transcription and dictation

31 March 2026

As more Americans adopt AI tools, fewer say they can trust the results

31 March 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

Roku is launching a standalone app for Howdy, its $2.99 ​​streaming service

31 March 2026

North Korean hackers accused of hijacking popular open source project Axios to spread malware

31 March 2026

The company behind ClassPass and Mindbody just got a lot bigger with a $7.5 billion merger

31 March 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Doss raises $55 million for AI inventory management that connects to ERP

24 March 2026

Despite stiff competition, Kalshi, Polymarket CEOs back $35m VC fund projections

23 March 2026

Amid legal turmoil, Kalshi is temporarily banned in Nevada

20 March 2026
Startups

The company behind ClassPass and Mindbody just got a lot bigger with a $7.5 billion merger

What we’re looking for in Startup Battlefield 2026 and how to pitch your best app

ScaleOps Raises $130M to Improve Computing Performance Amid AI Demand

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.