AI

OpenAI breach is a reminder that AI companies are treasure troves for hackers

Comment

OpenAI logo with spiraling pastel colors (Image Credits: Bryce Durbin / TechCrunch)
Image Credits: Bryce Durbin / TechCrunch

There’s no need to worry that your secret ChatGPT conversations were obtained in a recently reported breach of OpenAI’s systems. The hack itself, while troubling, appears to have been superficial — but it’s a reminder that AI companies have in short order made themselves into one of the juiciest targets out there for hackers.

The New York Times reported the hack in more detail after former OpenAI employee Leopold Aschenbrenner hinted at it recently in a podcast. He called it a “major security incident,” but unnamed company sources told the Times the hacker only got access to an employee discussion forum. (I reached out to OpenAI for confirmation and comment.)

No security breach should really be treated as trivial, and eavesdropping on internal OpenAI development talk certainly has its value. But it’s far from a hacker getting access to internal systems, models in progress, secret roadmaps, and so on.

But it should scare us anyway, and not necessarily because of the threat of China or other adversaries overtaking us in the AI arms race. The simple fact is that these AI companies have become gatekeepers to a tremendous amount of very valuable data.

Let’s talk about three kinds of data OpenAI and, to a lesser extent, other AI companies created or have access to: high-quality training data, bulk user interactions, and customer data.

It’s uncertain what training data exactly they have, because the companies are incredibly secretive about their hoards. But it’s a mistake to think they are just big piles of scraped web data. Yes, they do use web scrapers or datasets like the Pile, but it’s a gargantuan task shaping that raw data into something that can be used to train a model like GPT-4o. A huge amount of human work hours are required to do this — it can only be partially automated.

Some machine learning engineers have speculated that of all the factors going into the creation of a large language model (or, perhaps, any transformer-based system), the single most important one is dataset quality. That’s why a model trained on Twitter and Reddit will never be as eloquent as one trained on every published work of the last century. (And probably why OpenAI reportedly used questionably legal sources like copyrighted books in their training data, a practice they claim to have given up.)

So the training datasets OpenAI has built are of tremendous value to competitors, from other companies to adversary states to regulators here in the U.S. Wouldn’t the Federal Trade Commission (FTC) or courts like to know exactly what data was being used, and whether OpenAI has been truthful about that?

But perhaps even more valuable is OpenAI’s enormous trove of user data — probably billions of conversations with ChatGPT on hundreds of thousands of topics. Just as search data was once the key to understanding the collective psyche of the web, ChatGPT has its finger on the pulse of a population that may not be as broad as the universe of Google users, but provides far more depth. (In case you weren’t aware, unless you opt out, your conversations are being used for training data.)

In the case of Google, an uptick in searches for “air conditioners” tells you the market is heating up a bit. But those users don’t then have a whole conversation about what they want, how much money they’re willing to spend, what their home is like, manufacturers they want to avoid, and so on. You know this is valuable because Google is itself trying to convert its users to provide this very information by substituting AI interactions for searches!

Think of how many conversations people have had with ChatGPT, and how useful that information is, not just to developers of AIs, but also to marketing teams, consultants, analysts … It’s a gold mine.

The last category of data is perhaps of the highest value on the open market: how customers are actually using AI, and the data they have themselves fed to the models.

Hundreds of major companies and countless smaller ones use tools like OpenAI and Anthropic’s APIs for an equally large variety of tasks. And in order for a language model to be useful to them, it usually must be fine-tuned on or otherwise given access to their own internal databases.

This might be something as prosaic as old budget sheets or personnel records (e.g., to make them more easily searchable) or as valuable as code for an unreleased piece of software. What they do with the AI’s capabilities (and whether they’re actually useful) is their business, but the simple fact is that the AI provider has privileged access, just as any other SaaS product does.

These are industrial secrets, and AI companies are suddenly right at the heart of a great deal of them. The newness of this side of the industry carries with it a special risk in that AI processes are simply not yet standardized or fully understood.

Like any SaaS provider, AI companies are perfectly capable of providing industry standard levels of security, privacy, on-premises options, and generally speaking providing their service responsibly. I have no doubt that the private databases and API calls of OpenAI’s Fortune 500 customers are locked down very tightly! They must certainly be as aware or more of the risks inherent in handling confidential data in the context of AI. (The fact that OpenAI did not report this attack is their choice to make, but it doesn’t inspire trust for a company that desperately needs it.)

But good security practices don’t change the value of what they are meant to protect, or the fact that malicious actors and sundry adversaries are clawing at the door to get in. Security isn’t just picking the right settings or keeping your software updated — though of course the basics are important too. It’s a never-ending cat-and-mouse game that is, ironically, now being supercharged by AI itself: Agents and attack automators are probing every nook and cranny of these companies’ attack surfaces.

There’s no reason to panic — companies with access to lots of personal or commercially valuable data have faced and managed similar risks for years. But AI companies represent a newer, younger, and potentially juicier target than your garden-variety, poorly configured enterprise server or irresponsible data broker. Even a hack like the one reported above, with no serious exfiltrations that we know of, should worry anybody who does business with AI companies. They’ve painted the targets on their backs. Don’t be surprised when anyone, or everyone, takes a shot.

More TechCrunch

NovoNutrients has raised a $18 million Series A round from investors to build a pilot-scale facility to prove that its fermentation process works at scale.

NovoNutrients tweaks its bugs to turn CO2 into protein for people and pets

Seven years ago, Uber and Lyft blocked an effort to require ride-hailing app drivers to get fingerprinted in California. But by launching Uber for Teens earlier this year, the company…

Uber for Teens has reignited an old debate over fingerprinting drivers

Fast-food chain Whataburger’s app has gone viral in the wake of Hurricane Beryl, which left around 1.8 million utility customers in Houston, Texas without power. Hundreds of thousands of those…

Whataburger app becomes unlikely power outage map after Houston hurricane

Bumble’s new reporting option arrives at a time when, unfortunately, AI-generated photos on dating apps are common

Bumble users can now report profiles that use AI-generated photos

The concept of Airchat is fun, especially if you’re someone who loves to send voice memos instead of typing out long paragraphs on your phone keyboard.

Talky social app Airchat gets a major overhaul, making it more like an asynchronous Clubhouse

Featured Article

The fall of EV startup Fisker: A comprehensive timeline

Here is a timeline of the events that led fledgling automaker Fisker to file for bankruptcy.

4 hours ago
The fall of EV startup Fisker: A comprehensive timeline

Ahead of these potential competitors comes Openvibe, a simple aggregator for the open social web.

Openvibe combines Mastodon, Bluesky and Nostr into one social app

Welcome to TechCrunch Fintech! Last week was a holiday in the United States, so news was a bit lighter than normal. But there was still fintech-related items to report, including…

Should venture capitalists be held accountable when startups screw up?

Fisker Inc. co-founders Henrik Fisker and his wife, Geeta Gupta-Fisker, are lowering their salaries to $1 in order to keep their failed EV startup’s bankruptcy proceedings funded, as lawyers work…

Henrik Fisker drops salary to $1 to keep Fisker Inc. bankruptcy case alive

After announcing a whopping $20 million seed last year, Unlikely AI founder William Tunstall-Pedoe has kept the budding U.K. foundation model maker’s approach under lock and key. Until now: TechCrunch…

Alexa co-creator gives first glimpse of Unlikely AI’s tech strategy

We’re excited to invite Jesse Pollak to TechCrunch Disrupt 2024 to talk about the future of decentralization.

Jesse Pollak will tell us why Coinbase is launching its own Base blockchain at TechCrunch Disrupt 2024

Infactory is a kind of fact-checking search engine that will be focused exclusively on data at launch.

Humane execs leave company to found AI fact-checking startup

In a first, the Federal Trade Commission is banning an app from serving users under the age of 18. The agency announced on Tuesday that it’s banning NGL, an anonymous…

FTC bans NGL from offering its anonymous social app to minors

When people start navigation on Google Maps, the vehicle’s speed is shown in miles or kilometers, depending on the region.

Google Maps is rolling out speedometer, speed limits on iPhone and CarPlay globally

Design and animation are core to the Duolingo experience, which makes learning a new language or skill more like a game rather than a task to be dreaded.

Duolingo acquires Detroit-based design studio Hobbes

Two of my friends died within the last three years. By some coincidence, both of their birthdays fall in the beginning of July. So, twice this week, Facebook has reminded…

Facebook keeps asking me to say ‘happy birthday’ to dead people

Running a small business means doing more with less. AI agents can help, but building custom agents for specific workflows remains challenging, even with today’s low-code/no-code tools. The idea behind…

With $6M in seed funding, Enso plans to bring AI agents to SMBs

The feature puts Spotify in more direct competition with YouTube as a place where creators can interact with their listeners.

Chasing YouTube, Spotify adds comments to podcasts

A new iOS app called Wayther wants to help you better plan your road trips by giving you real-time road conditions and weather forecasts along your route. Created by indie…

Meet Wayther, an iOS weather forecast app designed specifically for road trips

Evolve has confirmed that the personal data of at least 7.6 million people was accessed during LockBit’s ransomware attack.

Evolve Bank says ransomware gang stole personal data on millions of customers

Etsy has been grappling with an influx of generic “junk” and AI-generated products on its platform. The service revised its seller policy on Tuesday, introducing new labels that clarify whether…

Etsy adds AI-generated item guidelines in new seller policy 

Seae Ventures is acquiring Unseen Capital after the death of founder Kayode Owens in 2021. The combined firm will continue to invest in healthcare for minorities and underserved populations. Owens,…

Seae Ventures acquires Unseen Capital after founder death

Apple released the third developer beta version of iOS 18 on Monday. While there are no major new features like Apple Intelligence in this update, there are some neat design…

With the latest iOS 18 developer beta, Apple makes flashlight UI more fun

A startup called DreamFlare AI is emerging from stealth on Tuesday with the goal of helping content creators make and monetize short-form AI-generated content. The company, co-founded by former Google…

Ex-Googler joins filmmaker to launch DreamFlare, a studio for AI-generated video

Nala, a remittance startup that is now widening its portfolio through a new B2B payments platform, has raised $40 million equity in a rare deal that becomes one of the largest…

Nala to use $40M Series A to build B2B payments platform, scale remittance services

Solo founder Cat Jones took the plunge on setting up a travel business right around the time the pandemic was hitting Europe in March 2020. Fast-forward to summer 2024 and…

Byway is using AI to help travelers slow down and take the scenic route

An adtech business owned by Microsoft is the target of a complaint backed by European privacy advocacy group, noyb — a nonprofit that punches far above its weight when it…

Microsoft-owned adtech Xandr accused of EU privacy breaches

Quora says that Previews works best with chatbots that “excel” at programming, like Claude 3.5 Sonnet, GPT-4o and Google’s Gemini 1.5 Pro.

Quora’s Poe now lets users create and share web apps

For over a decade, real-money gaming companies and fantasy sports startups have marketed themselves as video game companies. But as these businesses face increasing regulatory scrutiny, a coalition of more…

Indian game firms want to distance themselves from fantasy sports

Huffington Post founder Arianna Huffington and OpenAI CEO Sam Altman are throwing their weight behind a new venture, Thrive AI Health, that aims to build AI-powered assistant tech to promote…

OpenAI Startup Fund backs AI healthcare venture with Arianna Huffington