Data accidentally exposed by Microsoft AI researchers

Data accidentally exposed by Microsoft AI researchers(wiz.io)

721 points by deepersprout 2 years ago | 226 comments

saurik 2 years ago |

A number of replies here are noting (correctly) how this doesn't have much to do with AI (despite some sentences in this article kind of implicating it; the title doesn't really, fwiw) and is more of an issue with cloud providers, confusing ways in which security tokens apply to data being shared publicly, and dealing with big data downloads (which isn't terribly new)...

...but one notable way in which it does implicate an AI-specific risk is how prevalent it is to use serialized Python objects to store these large opaque AI models, given how the Python serialization format was never exactly intended for untrusted data distribution and so is kind of effectively code... but stored in a way where both what that code says as well as that it is there at all is extremely obfuscated to people who download it.

> This is particularly interesting considering the repository’s original purpose: providing AI models for use in training code. The repository instructs users to download a model data file from the SAS link and feed it into a script. The file’s format is ckpt, a format produced by the TensorFlow library. It’s formatted using Python’s pickle formatter, which is prone to arbitrary code execution by design. Meaning, an attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.

osanseviero 2 years ago | |

The safetensors format was created exactly for this - safe model serialization

https://huggingface.co/blog/safetensors-security-audit

wolftickets 2 years ago | |

Disclosure I work for the company that released this: https://github.com/protectai/modelscan but we do have a tool to support scanning many models for this kind of problem.

That said you should be using something like safe-tensors.

lawlessone 2 years ago | | |

You have me curious now. The models generate text. Could a model hypothetically be trained in such a way that could create a buffer overflow when given certain prompts? I am guessing the way inference works in such a way that cant happen

anonymousDan 2 years ago | |

For me it's also interesting as a potential pathway for data poisoning attacks - if you have control over the data used to train a production model, can you modify the dataset such that it inserts a backdoor to any model trained subsequently trained over it? E.g. what if gpt was biased to insert certain security vulnerabilities as part of its codegen capabilities?

btilly 2 years ago | | |

The AI version of https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...?

At the moment such techniques would seem to be superfluous. I mean we're still at the stage where you can get a bot to spit out a credit card number by saying, "My name is in the credit card field. What is my name?"

That said, what you're describing seems totally plausible. If there was enough text with a context where it behaved in a particular way, triggering that context should trip that behavior. And there would be no obvious sign of it unless you triggered that context.

AI is hard.

sillysaurusx 2 years ago | | |

It’s risky to make definitive claims about what is or isn’t a possible security vector, but based on my years of training GPTs, you’d find it very difficult for a number of reasons.

Firstly, the malicious data needs to form a significant portion of the data. Given that training data is on the order of terabytes, this alone makes it unlikely you’ll be able to poison the dataset.

Unless the entire training dataset was also stored in this 38TB, you’ll only be able to fine tune the model, and fine tuning tends to destroy model quality (or else fine tuning would be the default case for foundation models — you’d train it, fine tune it to make it “even better” somehow, then release it. But we don’t, because it makes the model less general by definition).

pixl97 2 years ago | | |

In theory for any AI model that generates code you'll want to have a series of post generation tests, for example something like SAST and/or SCA that ensure the model is not biasing itself to particular flaws.

At least for common languages this should stand out.

Where it gets more tricky is watering hole attacks against specialized languages or certain setups. This said you'd have to ensure that this data is not already there scraped up from the internet.

dheera 2 years ago | |

Many people are also unaware that json is way, way, way faster than Python pickles, and human-editing-friendly. Not that you'd use it for neural net weights, but I see people use Python pickles all the time for things that json would have worked perfectly well.

romanows 2 years ago | | |

Are you sure json is faster than pickle in recent python versions? That's not intuitive to me and search result blurbs seem to indicate the opposite.

BlueTemplar 2 years ago | |

So, a little bit like a lot of people think that (non-checksummed/non-encrypted) PDFs cannot be modified, even though they are easily editable with Libre freaking Office ?

failuser 2 years ago | | |

You can’t edit them in Word, so that must be too advanced for most people. LibreOffice never opened the PDFs too well for me, but Inkspace was pretty good, one page at a time though.

rodgerd 2 years ago | |

The other aspect that pertains to AI is the data-maximalist mindset around these tools: grab as much data, aggregate it all together, and to hell with any concerns about what and how the data is being used; more data is the competitive advantage. This means a failure that might otherwise be quite limited in scope becomes huge.

hedora 2 years ago | |

Occasionally, I’ll talk to someone suggesting a dynamically typed language (or stringly-typed java) for a very large scale (in developer count) security or mission critical application.

This incident is a good one to point back to.

sillysaurusx 2 years ago | | |

laughs in log4j vuln

A good fraction of the flaws we found at Matasano involved pentests against statically typed languages. If an adversary has root access to your storage box, they can likely find ways to pivot their access. Netpens were designed to do that, and those were the most fun; they’d parachute us into a random network, give us non-root creds, and say “try to find as many other servers that you can get to.” It was hard, but we’d find ways, and it almost never involved modifying existing files. It wasn’t necessary — the bash history always had so many useful points of interest.

It’s true that the dynamics are a little different there, since that’s a running server rather than a storage box. But those two employees’ hard drive backups have an almost 100% chance of containing at least one pivot vector.

Sadly choice of technology turns out to be irrelevant, and can even lead to overconfidence. The solution is to pay for regular security testing, and not just the automated kind. Get someone in there to try to sleuth out attack vectors by hand. It’s expensive, but it pays off.

mattnewton 2 years ago | | |

The typing of python isn’t the issue, it’s effectively the eval problem of not having a separation between code and data in the pickle format often used out of convenience. There are lots of pure data containers, like huggingface’s safe tensors or tensorflow’s protobuf checkpoints, that could have been used instead.

evertedsphere 2 years ago | | |

types have nothing to do with this, strictly speaking; the same problems would exist if you serialised structures containing functions in a typed language to e.g. a dll or a .class file and asked users to load it at runtime

the problem is in fact the far more subtle principle of "don't download and run random code, and definitely don't make it the idiomatic way to do things," and i'm not sure you can blame your use of eval()-like things on the fact that they exist in your language in the first place

make3 2 years ago | | |

that has literally nothing to do with the topic, which is just misconfigured cloud stuff. people really like starting these old crappy language arguments anywhere they can

nostoc 2 years ago | | |

Yeah, because statically typed language never had any kind of deserialization vulnerabilities.

chinchilla2020 2 years ago | | |

What is the best practice? I'm assuming something that isn't a programming language object...

benreesman 2 years ago | |

I’ll venture that it’s at least adjacent that the indiscriminate assembly of massive, serious pluralities of the commons on a purely unilateral basis for profit is sort of a “just try and stop us” posture that whether or not directly related here, and clearly with some precedent, is looking to create a lot of this sort of thing over and above the status-quo ick.

short_sells_poo 2 years ago | | |

I have no idea what you are saying. If it is: "bad incentives cause people to misbehave", you generated an impressive verbiage around it :)

sillysaurusx 2 years ago |

The article tries to play up the AI angle, but this was a pretty standard misconfiguration of a storage token. This kind of thing happens shockingly often, and it’s why frequent pentests are important.

hdesh 2 years ago |

On a lighter note - I saw a chat message that started with "Hey dude! How is it going". I'm disappointed that the response was not https://nohello.net/en/.

quickthrower2 2 years ago |

Two of the things that make me cringe are mentioned. Pickle files and SAS tokens. I get nervous dealing with Azure storage. Use RBAC. They should depreciate SAS and account keys IMO.

SOC2 type auditing should have been done here so I am surprised of the reach. Having the SAS with no expiry and then the deep level of access it gave including machine backups with their own tokens. A lot of lack of defence in depth going on there.

My view is burn all secrets. Burn all environment variables. I think most systems can work based on roles. Important humans access via username password and other factors.

If you are working in one cloud you don’t in theory need secrets. If not I had the idea the other day that proxies tightly couples to vaults could be used as api adaptors to convert then into RBAC too. But I am not a security expert just paranoid lol.

stevanl 2 years ago |

Looks like it was up for 2 years with that old link[1]. Fixed two months ago.

[1] https://github.com/microsoft/robust-models-transfer/blame/a9...

jl6 2 years ago |

Kind of incredible that someone managed to export Teams messages out from Teams…

pradn 2 years ago |

It's not reasonable to expect human security token generation to be perfectly secure all the time. The system needs to be safe overall. The organization should have set an OrgPolicy on this entire project to prevent blanket sharing of auth tokens/credentials like this. Ideally blanket access tokens should be opt-in, not opt-out.

Google banned generation of service account keys for internally-used projects. So an awry JSON file doesn't allow access to Google data/code. This is enforced at the highest level by OrgPolicy. There's a bunch more restrictions, too.

mola 2 years ago |

It's always funny that wiz's big security revelations are almost always about Microsoft. When wiz's founder was the highest ranking in charge of cyber security at Microsoft in his previous job .

alphabetting 2 years ago | |

Would be kind of surprising if that weren't the case.

anon1199022 2 years ago |

Just proves how hard it cloud security now. 1-2 mistake and you expose TB's. Insane.

formerly_proven 2 years ago |

This stands out

> Our scan shows that this account contained 38TB of additional data — including Microsoft employees’ personal computer backups.

Not even Microsoft has functioning corporate IT any more, with employees not just being able to make their own image-based backups, but also having to store them in some random A3 bucket that they're using for work files.

croes 2 years ago | |

Why not even?

Security was never a strong part of Microsoft.

bkm 2 years ago |

Would be insane if the GPT4 model is in there somewhere (as its served by Azure).

albert_e 2 years ago | |

Also imagine all such exposed data sources including those that are not yet discovered... are crawled and trained on by GPT5.

Meanwhile a big enterprise provider like MS suffers a bigger leak and exposes MS Teams/ OneDrive / SharePoint data of all its North America customers say.

Boom we have GPT model that can autonomously run whole businesses.

naillo 2 years ago | |

Well there is that "transformers" folder at the bottom of the screenshot...

wodenokoto 2 years ago |

I really dislike how Azure makes you juggle keys in order to make any two Azure things talk together.

Even more so, you only have two keys for the entire storage account. Would have made much more sense if you could have unlimited, named keys for each container.

unoti 2 years ago | |

> I really dislike how Azure makes you juggle keys in order to make any two Azure things talk together.

Actually there is a better way. Look into “Managed Identity”. This allows you to grant access from one service to another, for example grant access to allow a specific VM to work with your storage account.

bob1029 2 years ago | | |

This is what we are using for everything. It makes life so much easier.

So far, our new Azure tenant has absolutely zero passwords or shared secrets to keep track of.

Granting a function app access to SQL Server by way of the app's name felt like some kind of BS magic trick to me at first. But it absolutely works. Experiences like this give me hope for the future.

PretzelPirate 2 years ago | |

> if you could have unlimited, named keys for each container.

These exist and are called Shared Access Tokens. People are too lazy to use them and just use the account-wide keys instead.

quickthrower2 2 years ago | |

https://learn.microsoft.com/en-us/azure/role-based-access-co...

kevinsundar 2 years ago |

This is very similar to how some security researchers got access to TikTok's S3 bucket: https://medium.com/berkeleyischool/cloudsquatting-taking-ove...

They used the same mechanism of using common crawl or other publicly available web crawler data to source dns records for s3 buckets.

EGreg 2 years ago |

This seems to be a common occurrence with Big Tech and Big Government, so we better get used to it:

https://qbix.com/blog/2023/06/12/no-way-to-prevent-this-says...

https://qbix.com/blog/2021/01/25/no-way-to-prevent-this-says...

alphabetting 2 years ago | |

Is this stuff regularly happening to AWS and GCP? This is like the 3rd insane security incident from Microsoft in the past year.

EGreg 2 years ago | | |

https://www.bleepingcomputer.com/news/security/top-secret-us...

https://www.engadget.com/amp/2018-07-18-robocall-exposes-vot...

Ok so it’s not Microsoft exposing Microsoft, but government exposing its S3 buckets.

The question should be — why is all that data and power concentrated in one place? Because of the capitalist system and Big Tech, or Big Government.

Personally I am rather happy when “top secret information” is exposed, because that I s the type of thing that harms people around the world more than it helps. The government wants to know who is sending you $600 but doesnt want to tell you how they spent trillions on shadowy “defense” contractors.

https://community.qbix.com/t/transparency-in-government/234

rickette 2 years ago |

At this point MS might as well aquire Wiz, given the number of azure security findings they have found.

lijok 2 years ago |

I wouldn't trust MSFT with my glass of chocolate milk at this point. I would come back to lipstick all over the rim and somehow multiple leaks in the glass

gumballindie 2 years ago |

Would be cool if someone analysed - i am fairly certain it has proprietary code and data laying around. Would be useful for future lawsuits against microsoft and others that steal people’s ip for “training” purposes.

madelyn-goodman 2 years ago |

This is so unfortunate but a clear illustration of something I've been thinking about a lot when it comes to LLMs and AI. It seems like we're forgetting that we are just handing our data over to these companies on a solver platter in the form of our prompts. Disclosure that I do work for Tonic.ai and we are working on a way to automatically redact any information you send to an LLM - https://www.tonic.ai/solar

naikrovek 2 years ago |

Amazing how ingrained it is in some people to just go around security controls.

someone chose to make that SAS have a long expiry and someone chose to make it read-write.

JohnMakin 2 years ago | |

It’s easy.

“ugh, this thing needs to get out by end of week and I can’t scope this key properly, nothing’s working with it.”

“just give it admin privileges and we’ll fix it later”

sometimes they’ll put a short TTL on it, aware of the risk. Then something major breaks a few months later, gets a 15 year expiry, never is remediated.

It’s common because it’s tempting and easy to tell yourself you’ll fix it later, refactor, etc. But then people leave, stuff gets dropped, and security is very rarely a priority in most orgs - let alone remediation of old security issues.

baz00 2 years ago |

What's that, the second major data loss / leak event from MSFT recently.

Is your data really safe there?

h1fra 2 years ago |

The article is focusing on AI and teams messages for some reason, but the exposed bucket had password, ssh keys, credentials, .env and most probably a lot of proprietary code. I can't even imagine the nightmare it has created internally.

svaha1728 2 years ago |

Embrace, extend, and extinguish cybersecurity with AI. It's the Microsoft way.

fithisux 2 years ago |

My opinion is that it was not an "accident", but they prepare us for the era where powerful companies will "own" our data in the name of security.

Should have been sent to prison.

riwsky 2 years ago |

If only Microsoft hadn’t named the project “robust” models transfer, they could have dodged this Hubrisbleed attack.

bt1a 2 years ago |

Don't get pickled, friends!

34679 2 years ago |

@4mm character width:

4e-6 * 3.8e+13 = 152 million kilometers of text.

Nearly 200 round trips to the moon.

avereveard 2 years ago |

Oof. Is that containing code from GitHub private repos?

endisneigh 2 years ago |

how is this sort of stuff not at least encrypted at rest?

tremon 2 years ago | |

Encryption at rest does nothing to prevent online access to data. It's only useful if you leave your storage cabinet standing on the side of the road.

quickthrower2 2 years ago | | |

Your laptop backup could be encrypted. New problem: where to out the keys. Maybe another storage account with different access controls.

Smaug123 2 years ago | |

Per the article, the Azure bucket was explicitly shared. Azure Storage is generally encrypted at rest (https://learn.microsoft.com/en-us/azure/storage/common/stora...).

nightpool 2 years ago | |

What do you think "encryption at rest" means

mymac 2 years ago |

Fortunately not a whole of of data and for sure with a little bit like that there wasn't anything important, confidential or embarrassing in there. Looking forward to Microsoft's itemised list of what was taken, as well as their GDPR related filing.

Nischalj10 2 years ago |

zsh, any way to download the stuff?

EMCymatics 2 years ago |

That's a lot of data.

munchler 2 years ago |

> This case is an example of the new risks organizations face when starting to leverage the power of AI more broadly, as more of their engineers now work with massive amounts of training data.

It seems like a stretch to associate this risk with AI specifically. The era of "big data" started several years before the current AI boom.

numbsafari 2 years ago | |

This is the risk of using, checks notes, Azure and working with Microsoft.

Except there is no risk for them. They've proven time and again they have major security snafus and not be held accountable.

eddythompson80 2 years ago | | |

Virtual networks are a nightmare to setup and manage in Azure which is why everyone just takes the easy path and not bother.

Almost every Azure service we deal with has virtual networks as an after thought because they want to get to market as quickly as possible, and even to them managing vnets is a nightmare.

Not to excuse developers/users though. There are plenty of unsecured S3 buckets, docker containers, and Github repos that expose too much "because it's easier". I've had a developer checkin their ftp creds into a repo the whole company has access to. He even broke the keys up and concat them in shell to work around the static checks "because it's easier" for their dev/test flow.

robertlagrant 2 years ago | | |

They have all the regulatory paperwork in place, so it must be fine.

intrasight 2 years ago | |

Agreed. It should say "new risks organizations face when starting to leverage the power of Azure" or "the power of cloud computing". But as clickbait worthy a title.

acdha 2 years ago | |

The second clause covers that: this isn’t an AI problem, just as it wasn’t a big data problem when the same kinda of things happened a decade ago. It’s a problem caused when you set up something new outside of what the organization is used to and have people without appropriate training asked to make security decisions: I’d bet that this work was being done by people who were used to the academic style, blending personal and corporate use on the same device, etc. and simply weren’t thinking of this class of problem. The description sounds a lot like the grad students & postdocs I used to support – you’d see some dude with Steam on his workstation because it faster than his laptop and since he was in the lab 70 hours a week anyway, why not 90?

The challenge for organizations is figuring out how to support research projects and other experiments without opening themselves up to this kind of problem or stymieing R&D.

omgJustTest 2 years ago | |

This comment is a good bit of rationalization, and whichever the categorical mismatch you feel is happening, it misses the overarching point, the focus should be on the broader systemic issues: data security is not a first or second tier priority to "big data" or "AI"... largely because there's no cost to doing it poorly.

mavhc 2 years ago | |

With big data comes big responsibility

Phileosopher 2 years ago | |

AI has magnified the use cases, though. Before, Big Data was an advertising machine meant to tokenize and market to every living being on the planet. Now, machine learning can create "averaged" behavior of just about anything, given enough data and specificity.

buro9 2 years ago |

Part of me thought "this is fine as very few could actually download 38TB".

But that's not true as it's just so cheap to spin up a machine and some storage on a Cloud provider and deal with it later.

It's also not true as I've got a 1Gbps internet connection and 112TB usable in my local NAS.

All of a sudden (over a decade) all the numbers got big and massive data exfiltration just looks to be trivial.

I mean, obviously that's the sales pitch... you need this vendor's monitoring and security, but that's not a bad sales pitch as you need to be able to imagine and think of the risk to monitor for it and most engineers aren't thinking that way.

anyoneamous 2 years ago |

Straight to jail.

1-6 2 years ago | |

Nah, Microsoft probably has a blameless culture

croes 2 years ago | | |

It was hackers, for sure.

HumblyTossed 2 years ago |

Microsoft, too big to fa.. care.