4e-6 * 3.8e+13 = 152 million kilometers of text.
Nearly 200 round trips to the moon.
It seems like a stretch to associate this risk with AI specifically. The era of "big data" started several years before the current AI boom.
Except there is no risk for them. They've proven time and again they have major security snafus and not be held accountable.
Almost every Azure service we deal with has virtual networks as an after thought because they want to get to market as quickly as possible, and even to them managing vnets is a nightmare.
Not to excuse developers/users though. There are plenty of unsecured S3 buckets, docker containers, and Github repos that expose too much "because it's easier". I've had a developer checkin their ftp creds into a repo the whole company has access to. He even broke the keys up and concat them in shell to work around the static checks "because it's easier" for their dev/test flow.
The challenge for organizations is figuring out how to support research projects and other experiments without opening themselves up to this kind of problem or stymieing R&D.
But that's not true as it's just so cheap to spin up a machine and some storage on a Cloud provider and deal with it later.
It's also not true as I've got a 1Gbps internet connection and 112TB usable in my local NAS.
All of a sudden (over a decade) all the numbers got big and massive data exfiltration just looks to be trivial.
I mean, obviously that's the sales pitch... you need this vendor's monitoring and security, but that's not a bad sales pitch as you need to be able to imagine and think of the risk to monitor for it and most engineers aren't thinking that way.
Do you worry about failure? In your hardware life I mean, not your personal life.
I do online backup to a cloud provider, and a monthly dump to external USB drives that I keep and rotate at my mother in law's house (off site:).
More than any technical advice, I'd strongly urge you to check and understand honestly whether you're looking for "NAS" (a place to seamlessly store data) or "a project" (something to spend fun and frustrating and exciting evening and weekend time configuring, upgrading, troubleshooting, changing, re-designing, replacing, blogging, etc). Nothing wrong with either, just ensure you pick the path you actually want :->
I back up critical data from the 80TB NAS to the 40TB NAS, and the most critical data gets backed up nightly to a single hard drive in my friend's NAS box (offsite). Twice a year, I back up the full thing to external hard drives and take them out of state to a different friend's house.
Don't worry, be happy.
It’s so easy to set up an Ubuntu image that I control completely and I would rather do that than run some questionable 3rd party NAS solution and excluding disks costs about $130.
Two-bay NAS, two drives as a mirrored pair, two SSDs as mirrored pair cache. Only makes data available on my home network. Primarily using Nextcloud and Gitea.
It backs up important files nightly to a USB-attached drive, less critical files weekly. I have a weekly backup to a cloud provider for critical files.
A sibling comment makes a good point: do you want a hobby or an appliance? Using a commercial NAS makes it closer to an appliance[0]. Building it yourself will likely require more fiddling.
If you want to run a different OS on a commercial NAS, dig deeper into the OS requirements before buying a the NAS. Asustor Lockerstor Gen 2 series' fan is not inherently supported by things other than Asustor's software.
[0] A commercial NAS will still require monitoring, maintenance, and validation of backups.
I've got these in an SHR configuration (Synology Hybrid Raid with 1 disk of protection) which means about 115-6TB of usable space and allowing for single drive failure.
The filesystem is BTRFS ( https://daltondur.st/syno_btrfs_1/ ).
I upgraded the RAM (Synology will forever nag about it not being their RAM https://www.reddit.com/r/synology/comments/kaq7ks/how_to_dis... ).
I have the option in future to purchase the network card to take that to 10Gbps ports rather than 1Gbps ports.
So that's the first... but then I have a second one... which is an older DS1817+ which is filled with 10TB HDDs and yields 54.5TB usable in SHR2 + BTRFS... which I use as a backup to the first, but as it's smaller just the really important stuff and it is disconnected and powered down mostly, it's a monthly chore to connect it, and rsync things over. Typically if I want to massively expand a NAS (every - 10 years) I will buy a whole new one and relegate the existing to be a backup device. Meaning an enclosure has on avg about 15y of life in it and amortises really well as being initially the primary, and then later the backup.
I do _not_ use any of the Synology software, it's just a file system... I prefer to keep my NAS simple and offload any compute to other small devices/machines. This is in part because of the length of time I keep these things in service... the software is nearly always the weakest link here.
You can build your own NAS, TrueNAS Core (nee FreeNAS) https://www.truenas.com/freenas/ is very good... but for me, a NAS is always on and the low power performance of this purpose built devices and their ability to handle environmental conditions (I am not doing anything special for cooling, etc) and the long-term updates to the OS, etc... makes it quite compelling.
You can have up to two disks of redundancy (dual parity) per drive pool.
That means in a little bit over 5 minutes, the data could have been downloaded by someone. Even most well run security teams won't be able to respond quickly enough for that type of event.
That's just a scam rate by AWS. The true price is 1/100th of that, if that.
5gbps and 10gbps residential fiber connections are common now.
12TB hd's cost under $100, so you would only need about $400 of storage to capture this, my SAN has more capacity than this and I bought basically the cheapest disks I could for it.
It only takes one person to download it and make a torrent for it to be spread arbitrarily.
People could target more interesting subsets over less interesting parts of the data.
Multiple downloaders could share what they have and let an interested party assemble what is then available.
this is assuming by 1Gbps you mean 1 Gigabit/s rather than 1 Gigabyte/s
38 terabytes = 304 terabits.
304 terabits / 1 gigabit/second = 304,000 seconds
304,000 seconds =~ 84 hours. Add 20% for not pegging the line the whole time and the limits of 1gbps ethernet, and perhaps 100 hours is reasonable.
If it's windows, Active Directory.
Did you settle on using RAID, or just rely on cloud backups?
I would not make the same choices today: I got a somewhat high end one and upgraded it to whopping 32GB of RAM, thinking I'd use it for running lightweight containers or VMs, and maybe a media server. But once I put all my data on it... including 20 years of family photos and tax prep documents and work stuff and everything else... I changed my mind and am using it only and solely as an internal storage unit. Basically, as mentioned, committed to the "NAS" as opposed to "Fun Project" path :-). So I could've saved myself some money by getting a simpler unit and not upgrading it. (the DS918+ also can hook up to a cage [DX517], but I ended up not needing that either, yet).
I have it with 4 WD Red Plus NAS 8TTB drives and RAID 10 currently. I've used RAID 5 in the past but decided against it for this usage - again, went for simplicity.
Just shy of 30,000 hours on the drives, daily usage (I basically don't use local drive for any data on any of my computers; I keep it all on NAS and this way I can use any of my computers to do/access the same thing), and really no issues whatsoever so far.
whatever the download size is, you're bottlenecked by the remote server's up speed
Thank you for the details, particularly about zfs, which I know nothing about. The “if I’m lucky” part piqued my interest. HN was recently taken down by a double disk failure, which is exponentially more likely when you buy drives in bulk - the default case. So being able to survive two failures simultaneously is something I’d like to design for.
It’s cool you have two NASes (NASen?) let alone one. They’re the Pokémon of the tech world.
If you are concerned about reliability above performance, I would suggest using a single raidz2 vdev instead. This would allow the cluster to definitely survive two disks worth of failure. I'll also echo the common mantra - RAID is not backups. If you really need the data, you need to store a second copy offline in a different place.
When I lived in California and did not have room for a server rack, I had a single home server with an 8-bay tower case. I used an LSI card with 2 SAS-to-4x-SATA ports to connect all 8 drives to the machine. I believe I had 6 TB drives in that NAS, though they are currently all out of my house (part of one of my offsite backups now). My topology there was 4x mirror vdevs, which gave me worst case endurance of 1 failure but best case of 4 failures, and at about 4x the IOPS performance, but with the cost of only 50% storage efficiency vs the 75% you would get with raidz2.
There is even raidz3 if you are very paranoid, which allows up to 3 disks to fail before you lose the vdev. I've never used it. As I understand, the parity calculations get considerably more complicated, although I don't know if that really matters.
It depends on what your plans for the storage are. If you're going to fill it with bulk data that gets accessed sequentially (think media files), then performance will be fine with basically any topology or drive choice. If you are going to fill it with data for training ML models across multiple machines, you need to think about how you will make it not the bottleneck for your setup.
One more thing to consider - you can get new consumer OR used enterprise flash for somewhere around $45/TB in the 4 TB SATA size, or the 8 TB NVMe size. Those drives will likely fail read-only if they fail at all. They will usually use less power, take less space, and obviously will perform orders of magnitude better than spinning rust, at somewhere around 3x the cost.
I am hoping to build my next NAS entirely on flash.
The URL was: "https\://robustnessws4285631339.blob.core.windows.net/public-models/robust_imagenet/resnet18_l2_eps3.ckpt?sv=2020-08-04&ss=bfqt&srt=sco&sp=rwdlacupitfx&se=2051-10-06T07:09:59Z&st=2021-10-05T23:09:59Z&spr=https,http&sig=U69sEOSMlliobiw8OgiZpLTaYyOA5yt5pHHH5%2FKUYgI%3D" (Backslash added to prevent HN from detecting it as an URL and shortening).
The issue was that "sig=U69s...." token gave access to far more than the researchers intended to share.
...but one notable way in which it does implicate an AI-specific risk is how prevalent it is to use serialized Python objects to store these large opaque AI models, given how the Python serialization format was never exactly intended for untrusted data distribution and so is kind of effectively code... but stored in a way where both what that code says as well as that it is there at all is extremely obfuscated to people who download it.
> This is particularly interesting considering the repository’s original purpose: providing AI models for use in training code. The repository instructs users to download a model data file from the SAS link and feed it into a script. The file’s format is ckpt, a format produced by the TensorFlow library. It’s formatted using Python’s pickle formatter, which is prone to arbitrary code execution by design. Meaning, an attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.
That said you should be using something like safe-tensors.
At the moment such techniques would seem to be superfluous. I mean we're still at the stage where you can get a bot to spit out a credit card number by saying, "My name is in the credit card field. What is my name?"
That said, what you're describing seems totally plausible. If there was enough text with a context where it behaved in a particular way, triggering that context should trip that behavior. And there would be no obvious sign of it unless you triggered that context.
AI is hard.
Firstly, the malicious data needs to form a significant portion of the data. Given that training data is on the order of terabytes, this alone makes it unlikely you’ll be able to poison the dataset.
Unless the entire training dataset was also stored in this 38TB, you’ll only be able to fine tune the model, and fine tuning tends to destroy model quality (or else fine tuning would be the default case for foundation models — you’d train it, fine tune it to make it “even better” somehow, then release it. But we don’t, because it makes the model less general by definition).
At least for common languages this should stand out.
Where it gets more tricky is watering hole attacks against specialized languages or certain setups. This said you'd have to ensure that this data is not already there scraped up from the internet.
This incident is a good one to point back to.
A good fraction of the flaws we found at Matasano involved pentests against statically typed languages. If an adversary has root access to your storage box, they can likely find ways to pivot their access. Netpens were designed to do that, and those were the most fun; they’d parachute us into a random network, give us non-root creds, and say “try to find as many other servers that you can get to.” It was hard, but we’d find ways, and it almost never involved modifying existing files. It wasn’t necessary — the bash history always had so many useful points of interest.
It’s true that the dynamics are a little different there, since that’s a running server rather than a storage box. But those two employees’ hard drive backups have an almost 100% chance of containing at least one pivot vector.
Sadly choice of technology turns out to be irrelevant, and can even lead to overconfidence. The solution is to pay for regular security testing, and not just the automated kind. Get someone in there to try to sleuth out attack vectors by hand. It’s expensive, but it pays off.
the problem is in fact the far more subtle principle of "don't download and run random code, and definitely don't make it the idiomatic way to do things," and i'm not sure you can blame your use of eval()-like things on the fact that they exist in your language in the first place
SOC2 type auditing should have been done here so I am surprised of the reach. Having the SAS with no expiry and then the deep level of access it gave including machine backups with their own tokens. A lot of lack of defence in depth going on there.
My view is burn all secrets. Burn all environment variables. I think most systems can work based on roles. Important humans access via username password and other factors.
If you are working in one cloud you don’t in theory need secrets. If not I had the idea the other day that proxies tightly couples to vaults could be used as api adaptors to convert then into RBAC too. But I am not a security expert just paranoid lol.
[1] https://github.com/microsoft/robust-models-transfer/blame/a9...
Google banned generation of service account keys for internally-used projects. So an awry JSON file doesn't allow access to Google data/code. This is enforced at the highest level by OrgPolicy. There's a bunch more restrictions, too.
> Our scan shows that this account contained 38TB of additional data — including Microsoft employees’ personal computer backups.
Not even Microsoft has functioning corporate IT any more, with employees not just being able to make their own image-based backups, but also having to store them in some random A3 bucket that they're using for work files.
Security was never a strong part of Microsoft.
Meanwhile a big enterprise provider like MS suffers a bigger leak and exposes MS Teams/ OneDrive / SharePoint data of all its North America customers say.
Boom we have GPT model that can autonomously run whole businesses.
Even more so, you only have two keys for the entire storage account. Would have made much more sense if you could have unlimited, named keys for each container.
Actually there is a better way. Look into “Managed Identity”. This allows you to grant access from one service to another, for example grant access to allow a specific VM to work with your storage account.
So far, our new Azure tenant has absolutely zero passwords or shared secrets to keep track of.
Granting a function app access to SQL Server by way of the app's name felt like some kind of BS magic trick to me at first. But it absolutely works. Experiences like this give me hope for the future.
These exist and are called Shared Access Tokens. People are too lazy to use them and just use the account-wide keys instead.
They used the same mechanism of using common crawl or other publicly available web crawler data to source dns records for s3 buckets.
https://qbix.com/blog/2023/06/12/no-way-to-prevent-this-says...
https://qbix.com/blog/2021/01/25/no-way-to-prevent-this-says...
https://www.engadget.com/amp/2018-07-18-robocall-exposes-vot...
Ok so it’s not Microsoft exposing Microsoft, but government exposing its S3 buckets.
The question should be — why is all that data and power concentrated in one place? Because of the capitalist system and Big Tech, or Big Government.
Personally I am rather happy when “top secret information” is exposed, because that I s the type of thing that harms people around the world more than it helps. The government wants to know who is sending you $600 but doesnt want to tell you how they spent trillions on shadowy “defense” contractors.
someone chose to make that SAS have a long expiry and someone chose to make it read-write.
“ugh, this thing needs to get out by end of week and I can’t scope this key properly, nothing’s working with it.”
“just give it admin privileges and we’ll fix it later”
sometimes they’ll put a short TTL on it, aware of the risk. Then something major breaks a few months later, gets a 15 year expiry, never is remediated.
It’s common because it’s tempting and easy to tell yourself you’ll fix it later, refactor, etc. But then people leave, stuff gets dropped, and security is very rarely a priority in most orgs - let alone remediation of old security issues.
Is your data really safe there?
Should have been sent to prison.
Unfortunately a lot of pen testing services have devolved into "We know you need a report for SOC 2, but don't worry, we can do some light security testing and generate a report for you in a few days and you'll be able to check the box for compliance"
Which is guess is better than nothing.
If anyone works at a company that does pen tests for compliance purposes, I'd recommend advocating internally for doing a "quick, easy, and cheap" pen test to "check the box" for compliance, _alongside_ a more comprehensive pen test (maybe call it something other than a "pen test" to convince internal stakeholders who might be afraid that a 2nd in depth pen test might weaken their compliance posture since the report is typically shared with sales prospects)
Ideally grey box or white box testing (provide access to codebase / infrastructure to make finding bugs easier). Most pen tests done for compliance purposes are black-box and limit their findings as a result.
When I was consulting architecture and code review were separate services with a very different rate from pentesting. Similar goals but far more expensive.
Unfortunately, compliance/customer requirements often stipulate having penetration tests performed by third parties. So for business reasons, these same companies, will also hire low-quality pen-tests from "check-box pen-test" firms.
So when you see that $10K "complete pen-test" being advertised as being used by [INSERT BIG SERIOUS NAME HERE], good chance this is why.
They may be rare, but "real" pentests are still a thing.
Pentest comes across more as checking all the common attack vectors don’t exist.
Getting out of bed to do the so-called “real stuff” is typically called a bug bounty program or security researching.
Both exist and I don’t see why most companies couldn’t start a bug bounty program if they really cared a lot about the “real stuff”
- finding the token directly in the repo
- reviewing all tokens issued
Looks like Azure hasn't done similarly.
Like for starters, why is it so hard to determine effective access in their permissions models?
Why is the "type" of files so poorly modeled? Do I ever allow people to give effective public access to a file "type" that the bucket can't understand?
For example, what is the "type" of code? It doesn't have to be this big complex thing. The security scanners GitHub uses knows that there's a difference between code with and without "high entropy strings" aka passwords and keys. Or if it looks like data:content/type;base64, then at least I know it's probably an image.
What if it's weird binary files like .safetensors? Someone here saying you might "accidentally" release the GPT4 weights. I guess just don't let someone put those on a public-resolvable bucket, ever, without an explicit, uninherited manifest / metadata permitting that specific file.
Microsoft owns the operating system! I bet in two weeks, the Azure and Windows teams can figure out how to make a unified policy manifest / metadata for NTFS & ReFS files that Azure's buckets can understand. Then again, they don't give deduplication to Windows 11 users, their problem isn't engineering, it's the financialization of essential security features. Well jokes on you guys, if you make it a pain for everybody, you make it a pain for yourself, and you're the #1 user of Azure.
Usually within a few minutes there's followup context sent. Either the other party was already in the process of writing the followup, or they realized there was nothing actionable to respond to and they elaborate.
The concept simply needs a more descriptive name to be accepted. It's not about not saying hello. It's about including the actual request in the first message, usually after the hello.
You just can't win.
In German, if you ask this question, it is expected that your question is genuine and you can expect an answer (Although usually people don't use this opportunity to unload there emotional package, but it can happen!)
Whereas in Englisch you assume this is just a hello and nothing more.
Though I have had the equivalent in tech support: "App doesn't work" which is basically just hello, obviously you're having an issue otherwise you wouldn't have contacted our support.
In the case of scikit-learn, the code implementing some components does so much crazy dynamic shit that it might not even be feasible to provide a well-engineered serde mechanism without a major rewrite. Or at least, that's roughly what the project's maintainers say whenever they close tickets requesting such a thing.
perhaps it's be viable to add support for the ONNX format even for use cases like model checkpointing during training, etc ?
I take SAS tokens with expiration over people setting up shared RBAC account and sharing password for it.
Yes people should do proper RBAC, but point a company and I will find dozens "shared" accounts. People don't care and don't mind. When beating them up with sticks does not solve the issue SAS tokens while still not perfect help quite a lot.
The level of cybersecurity incompetency in the early 80's makes sense; computers (and in particular networked computers) were still relatively new, and there weren't that many external users to begin with, so while the potential impact of a mistake was huge (which of course was the plot of the movie), the likelihood of a horrible thing happening was fairly low just because computers were an expensive, somewhat niche thing.
Fast forward to 2023, and now everyone owns bunches of computers, all of which are connected to a network, and all of which are oodles more powerful than anything in the 80s. Cybersecurity protocols are of course much more mature now, but there's also several orders of magnitude more potential attackers than there were in the 80s.
At technical level, sure. At the deployment, configuration and management level, not quite. Overall things are so bad that news aren't even reporting the hospitals taken over by ransomware anymore. It's still happening almost every week and we're just... used to it.
Get a load these guys honey, you could just dial straight into the airline.
Sounds like it’s as hard as it’s always been. Pretty basic and filled with humans
It's no longer hierarchical, with organization schemes limited to folders and files. People no longer talk about network paths, or server names.
Mobile and desktop apps alike go to enormous effort to abstract and hide the location at which a document gets stored, instead everything is tagged and shared across buckets and accounts and domains...
I expect that the people at this organization working on cutting-edge AI are pretty sharp, but it's no surprise that they don't entirely understand the implications of "SAS tokens" and "storage containers" and "permissive access scope" on Azure, and the differences between Account SAS, Service SAS, and User Delegation SAS. Maybe the people at Wiz.io are sharper, but unless I missed the sarcasm, they may be wrong when they say [1] "Generating an Account SAS is a simple process." That looks like a really complicated process!
We just traced back an issue where a bunch of information was missing from a previous employee's projects when we changed his account to a shared mailbox. Turns out that he'd inadvertently been saving and sharing documents from his individual OneDrive on O365 (There's not one drive! There are many! Stop trying to pretend there's only one drive!) instead of the "official" organization-level project folder, and had weird settings on his laptop that pointed every "Save" operation at that personal folder, requiring a byzantine procedure to input a real path to get back to the project folder.
I'm pretty sure I used that one in middle school ?? (Though not to edit PDFs, and it might have been the Microsoft Works equivalent.)
In this case, models themselves are fundamentally files. These files can have malicious code embedded into them that is executed when the model is loaded for further training or inference. When executed it isn't obvious to the user at all. It's a very nasty potential vector.
I wrote a blog about it here: https://protectai.com/blog/announcing-modelscan
Customer: "We had a pentest/security scan/whatever find this issue in your software"
Me: "And they realized that mitigations are in place as per the CVE that keep that issue from being an exploitable issue, right"
Customer: "Uhhhh"
Testing group: "Use smaller words please, we only click some buttons and this is the report that gets generated"
The problem is
1. knowing the gazillion of web vulnerabilities, and technologies
2. being good enough to tests them
3. kick yourself and go through the laborious process of understand and test every key feature of the target.
Which also, in the article, is mentioned can not be tracked - issued tokens happen on the client side (if I understood this correctly), which means that to audit tokens you’d have to ask everyone who had one issued to politely provide said token. Will everyone remember the tokens they have? Probably not. And if an attacker has already gotten what they needed, or managed to issue their own, no one would know.
A: Hello!
B's bot: Hello to you too! I am a chatty bot which loves responding to greetings. Is there a message I can forward to B?
No, unless I understand actually it is intended to be understood the other way:
It is too easy to create a to broad token.
And in the next paragraph, after the image, they explain that in addition to it being easy to create, these tokens are impossible to audit.
In order to purchase a reputable pentest, you basically have to have a security team that is mature enough to have just done it themselves.
I can throw out some names for some reputable firms, but you are still going to need to do some leg work vetting the people they will staff your project with, and who knows if those firms will be any good next year or the year after.
Here's a couple generic tips from an old pentester:
* Do not try and schedule your pentest in Q4, everyone is too busy. Go for late Q1 or Q2. Also say you are willing to wait for the best fit testers to be available.
* Ask to review resumes of the testing team. They should have some experience with your tech and at least one of them needs to have at least 2 years experience pen-testing.
* Make sure your testing environment is set up, as production like as possible, and has data in it already. Test the external access. Test all the credentials, once after you generated them, again the night before the test starts. The most common reason to lose your good pentest team and get some juniors swapped in that have no idea what they are doing is you delayed the project by not being ready day 1.
Of course, both are generally rhetorical, which must be confusing for some foreigners learning English, especially with the correct response to "Alright?" being "Alright?" and similarly with "What's up?".
What fraction of the training data needed to be that text?
Remember, once the model is trained, it's verified in a number of ways, ultimately based on human prompting. If the tokens that come out of an experimental model are obviously bad (because, say, the model is suggesting exploits instead of helpful code), all that will do is get a scientist to look more deeply into why the model is behaving the way it is. And then that would lead to discovering the poisoned data.
The payoff for an attacker is whether they can achieve some sort of goal. You'd have to clearly define what that goal is in order to know how effective the poisoning attack could be. What's the end game?
It's possible there's some minimum amount of poisoned data (a % or log function of a given dataset size n) that would then translate to generating a vulnerable output in x% of total outputs. If x is low enough to get past fine tuning/regression testing but high enough to still occur within the deployment space, then you've effectively created a new category of supply-chain attack.
There's probably more research that needs to be done into occurrence rate of poisoned data showing up in final output, and that result is likely specific to the AI model and/or version.
On human checks, http://www.underhanded-c.org/ demonstrates that it would be possible to inject content that will pass that.
The company made some basic port scan and established that we're running outdated and vulnerable version of Apache. I found the act of explaining the concept of backports to a "pentester" to be physically painful.
They didn't get paid and another company was entrusted with the audit.
Hopefully you also have an internal control that looks at actual package versions installed on the server.
Getting out of bed and "real stuff" is supposed to be part of a pentest.
The problem is more the sheer amout of stuff your are supposed to know to be a pentester. Most pentesters come into the field by knowing a bit of XSS, a few thing about PHP, and SQL injections.
Then you start to work, and the clients need you to tests things like:
- compromise a full Windows Network, and take control of the Active Directory Server. Because of a misconfiguration of Active Directory Certificate Services. While dealing with Windows Defender
- test a web application that use websockets, React, nodejs, and GraphQL
- test a WindDev application, with a Java Backend on a AIX server
- check the security of an architecture with multiple services that use a Single Sign on, and Kubernetes
- exploit multiple memory corruption issues ranging form buffer overflow to heap and kernel exploitation
- evaluate the security of an IoT device, with a firmware OTA update and secure boot.
- be familiar with cloud tokens, and compliance with European data protection law.
- Mobile Security, with iOS and Android
- Network : radius, ARP cache poisoning, write a Scapy Layer for a custom protocol, etc
- Cryptography, you might need it
Most of this is actual stuff I had to work on at some point.
Even if you just do web, you should be able to detect and exploit all those vulnerabilities: https://portswigger.net/web-security/all-labs
Nobody knows everything. Being a pentester is a journey.
So in the end, most pentesters fall short on a lot this. Even with an OSCP certification, you don't know most of what you should know. I heard that in some company, people don't even try and just give you the results of a Nessus scan. But even if you are competent, sooner or later, you will run into something that you don't understand. And you have max 2 week to get familiar with it and test it. You can't test something that you don't understand.
The scanner always gives you a few things that are wrong (looking at you TLS ciphers). Even if you suck, or if the system is really secure. You can put a few things into your report. As a junior pentester, my biggest fear was always to hand an empty report. What were people going to think of you, if you work 1 week and don't find anything?
I'm trying to remember the rule where you leave something intentionally misconfigured/wrong for the compliance people to find and that you can fix so they don't look deeper into the system. A fun one with web servers is to get them to report they are some ancient version that runs on a different operating system. Like your IIS server showing it's Apache 2.2 or vice versa.
But at least from your description it sounds like you're attempting to pentest. So many of these pentesting firms are click a button, run a script, send a report and go on to the 5 other tickets you have that day type of firms.
Nobody knows everything. Being a pentester is a journey.
I recommend that you add some contact details in your HN bio page. You might get some good ledes after those post.For example, dealing with a "legal threat" situation with the product I work on because a client got hit by ransomware and they blame our product because "we just got a security assessment saying everything was fine, and your product is the only other thing on the servers" -- checked the report, basically it just runs some extremely basic port checks/windows config checks that haven't been relevant for years and didn't even apply to the Windows versions they had, and in the end the actual attack came from someone in their company opening a malicious email and having a .txt file with passwords.
I don't doubt there are proper security firms out there, but I rarely encounter them.
Real stuff should always be a pentest - penetration test where one is actively trying to exploit vulnerabilities. So person who orders that gets report with !!exploitable vulnerabilities!!.
Checking all common attack vectors is vulnerability scanning and is mostly running scanner and weeding out false positives but not trying to exploit any. Unfortunately most of companies/people call that a penetration test, while it cannot be, because there is no attempt at penetration. While automated scanning tools might do some magic to confirm vulnerability it still is not a penetration test.
In the end, bug bounty program is different in a way - you never know if any security researcher will even be interested in testing your system. So in reality you want to order penetration test. There is usually also a difference where scope of bug bounty program is limited to what is available publicly. Where company systems might not allow to create an account for non-business users, then security researcher will never have access to authenticated account to do the stuff. Bounty program has also other limitations because pentesting company gets a contract and can get much more access like do a white box test where they know the code and can work through it to prove there is exploitable issue.
There are as many taxonomies of security services as there are companies selling them. You have to be very specific about what you want and then read the contract carefully.
Real penetration tests provide valuable insight that a bug bounty program won't.
This is in no way related to a bug bounty program.
I think it's more accurate to say Bug Bounty only covers a small subset of penetration testing (mainly in that escalation and internal pivoting are against the BB policy of most companies).
That certainly helps.
Edit: thanks to everyone who wrote some insightful responses, and there are indeed many. Faith in HackerNews restored !
If nothing else, an obviously wrong take is a nice setup for a correction.
Security through obscurity helps only until someone gets curious/determined. I have a personal anecdote for that. During university I was involved in pentesting an industrial control system (not in an industrial context, but same technology) and implemented a simple mitm attack to change the state of the controls while displaying the operator selected state. When talking with the responsible parties, they just assumed that the required niche knowledge means the attack is not feasible. I had the first dummy implementation setup on the train ride home based only on network captures. Took another day to fine tune once I got my hands on a proper setup and worked fine after that.
I do not want to say that ModbusTCP is in the same league as MML, but if there is interest in it, someone will figure it out. Sure, you might not be on Shodan, but are the standard/scripted attacks really what you should worry about? Also don't underestimate a curious kid who nerdsnipes themself into figuring that stuff out.
Absolutely. It just weeds out the skiddies and tools like MetaSploit unless they have added mainframe support. I have not kept up with their libraries
The federal agencies I was liaison to knew all the commands better than I did and even taught me a few that were not in my documentation which led to a discussion with the mainframe developers.
No they won't.
'Dial up' modems need a PSTN line to work. The roll out of full fibre networks means analogue PSTN is going the way of the dodo. You cannot get a new PSTN line anymore in Blighty. In Estonia and the Netherlands (IIRC) the PSTN switch off is already complete.
Cable company here (US) still sells service that has POTS over cable modem. Just plug your modem into the cable modem tele slot and you have a dialton. Now, are you getting super high speed connections, no, but that's not what you need for most hacking like this. Not that I recommend hacking from your own house.
To your point I am sure some day the US will stop selling access to the PSTN but some old systems will hold on for dear life, government contracts and all. Governments are kindof slow to migrate to newer things.
You need to align their incentives with yours: wait until it gets windy out, knock the poles down, and demand that they come fix it.
It was likely meant to be a temporary means for the system architect to monitor and improve the system after it was deployed but then life changing circumstances may have distracted his attention away from decommissioning the modem. The movie still holds up today and is worth a watch. Actually it may be more pertinent now than ever.
I think what makes it likable for me is that it's all on the cusp of believability. Obviously LLMs weren't quite mature enough to do everything Joshua did back then (and probably not now), but the fact that the "hacking" was basically just social engineering, and was just achieved by wardialing and a bit of creative thinking makes it somewhat charming, even today.
With the advent of LLMs being used increasingly for everyone, I do wonder how close we're going to get to some kind of "Global Thermonuclear War" simulation gone awry.