Biggest image in the smallest space(bamsoftware.com) |
Biggest image in the smallest space(bamsoftware.com) |
I remember working with multi-megapixel images on systems with far less than 1MB of RAM, many years ago. Perhaps this is a good example of how more hardware resources can lead to them being wasted - the fact that RAM has grown so much that most images fit completely in it, has also meant programmers assuming they can do this for all images without a second thought when often all that's needed is a tiny subset of all the data.
Even if the image data is compressed, there's absolutely no need to keep all of it in memory - just decompress incrementally into a small, fixed-size buffer until you get to the "plaintext" position desired, ignoring everything before that. The fact that it's compressed also means that, with suitable algorithms, you can skip over huge spans at once - this is particularly easy to do with RLE and LZ - and the compression ratio actually boosts the speed of seeking to a specific position.
Currently, (hopefully...) no application is attempting to read entire video files into memory before processing them, but I wonder if that might change in the future as RAM becomes even bigger, and we'll start to get "video decompression bombs" instead?
Anyway, do you know some of these "better" graphics programs that actually behave this way, especially command line processing? I am interested in finding more of them.
EDIT: Okay, I have to add & admit that by "no excuse", I actually mean somewhat the opposite. ;) I mean that its possible to do streaming image processing on compressed formats, not that its trivial to do or as easy as decompressing the file in a single call. I just wish that programs would handle very large images more, and it sucks when they don't even though I know its possible. Especially programs intended for dealing with large images like Hugin. Now, I know its a PITA to tile & stream compressed formats because I've done it, but I'm sure I've written image I/O that decompresses the entire file to RAM 100x more frequently than anything that tiles and/or streams, because I've only handled tiling or streaming myself once, and it was harder. :P
Edit: skipping sections of the line to get to the region of interest is fine, I suppose, but what's really needed is a hierarchical quadtree-like organization of the storage, surely...
Now, go find an API/library for dealing with PNGs that allow you to pass in such a limit, let alone pass in a callback for dealing with violations. Go ahead. I'll wait.
(The Internet being what it is, if there is one, someone will pop up in a reply in five minutes citing it. If so, my compliments to the authors! But I think we can all agree that in general image APIs do not offer this control. In fact, in general, if you submit a patch to allow it, it would probably be rejected from most projects as unnecessarily complicating the API.)
This is the sort of thing that I mean when I say that we are so utterly buried by insecure coding practices that we can't hardly even perceive it around us. I should add this as another example in http://www.jerf.org/iri/post/2942 .
The article itself links to http://libpng.sourceforge.net/decompression_bombs.html
These new libpng versions do not impose any arbitrary limits, on the memory consumption and number of ancillary chunks, but they do allow applications to do so via the png_set_chunk_malloc_max() and png_set_chunk_cache_max() functions, respectively.
That being said, even with these limits, it's undeniable that something like ImageMagick still has a very large attack surface (especially since it uses many third-party libraries), so it should run in its own heavily unprivileged or sandboxed process.
While on vacation last week, I finally grokked that my relatively cheap Nikon camera is producing 6000x4000 images...that's about 100MB uncompressed. As a mobile app developer, I'm becoming painfully aware how images breaking the 25MB uncompressed line are breaking apps, with some still-in-use 256MB RAM iOS devices crashing when memory fills under normal usage plus a few instances of such large images (1-2 vacation photos can easily overwhelm available memory).
Not sure of Windows equivalent, Microsoft has deprecated Windows System Resource Manager, not sure of a good equivalent (except going fully to Linux).
https://github.com/python-pillow/Pillow/blob/master/PIL/PngI...
http://www.maximumcompression.com/compression_fun.php
A 24 byte file that uncompresses to 5 MB; another file with good compression under RAR but almost no compression under ZIP; and a compressed file that decompresses to itself.
How much RAM did it actually use?
""" [Note for any DHS people who have stumbled upon this site, be aware that this is a cybersecurity issue, not a physical security issue. Feel free to contact me at <glennrp at users.sourceforge.net> to discuss it.] """
See https://github.com/python-pillow/Pillow/blob/master/Tests/ch... for a quick way to generate some of these.
I think most software these days is immune to such tricks, or at least has tunables to reduce the chance of such tricks causing harm.
https://en.wikipedia.org/wiki/Zip_bomb
The billion laughs XML attack is also lovely in its simplicity.
Zip-bombing was such a problem for our corporate network in the late 1990s that inbound e-mail attachments were deliberately discarded for a while. Chaos ensured.
Swap works well for this as long as your data has good locality. huge raster images don't.
But no, I'm not aware of any software that handles stuff like that well - except the GIMPs tiling, but that's not going to help when zoomed out.
I definitely want to avoid swap at all costs and find things that are designed to tile & stream instead. The difference between GraphicsMagick resizing an image by streaming and ImageMagick resizing an image that hits swap is orders of magnitude - seconds versus hours.
I imagine that .pdf files are another avenue for mischief. They contain lots of chunks which may be compressed in varying ways.
[1]: https://bitbucket.org/tetrep/pngbomb/src/03dfc95065d78562c15...
I killed it at about 25GB memory usage, who knows how high it would have climbed otherwise.
From a legal stand-point, I'd be wary about following through with the authors suggestion of "Upload as your profile picture to some online service, try to crash their image processing scripts" without permission. Sounds like a good way of getting into trouble.
It's all good and well saying that you had good intentions, but if you can't prove it, and they didn't invite you to test it (via a responsible disclosure policy), then I would steer clear.
While I wouldn't personally attempt to prosecute anyone for responsibly disclosing a bug to me, it doesn't meant to say that BigCorp™ wouldn't.
too pressed for time, did anyone look? What is it?
This would probably be a trivial patch to bzip2. But I like the idea in general of passing an "max input/output ratio" to any process or function that might yield far more output than input.
A 420B > 5MB expansion should not be a "red flag" because there is nothing about it (including the subsequent attempt to process a 141GB uncompressed image) which cannot be handled appropriately in software. Flagging such ratio limits is arbitrary, and setting an arbitrary limit is usually a sign the software is incorrect, not the data.
Are you using PIL or pillow?
def decompress(data, maxsize=262144):
dec = zlib.decompressobj()
data = dec.decompress(data, maxsize)
if dec.unconsumed_tail:
raise ValueError("Possible zip Bomb")
del dec
return datahttps://bugs.launchpad.net/qemu/+bug/1462949
(I have a big collection of these, but most of the bugs have now been fixed in qemu)
There are people who take special interest in compression. The Hutter Prize would be a good place to find them. http://prize.hutter1.net/
It's a bit old now.
> Restrictions: Must run in <10 hours on a 2GHz P4 with 1GB RAM and 10GB free HD
I'd be interested to see what can happen without those restrictions.
It's a 1MB file that decompresses to 261 tredecillion bytes of "Hello, World".
No terribly clever stream manipulation; it's a perfectly normal gzip file, other than the size. The generation script is here: http://sprunge.us/VhFc, but see if you can figure it out without peeking.
Interestingly enough, it's not using up any memory, but it is hitting the disk at 55MB/s, so I'm guessing it's decompressing it to disk, and will eventually crash when it fills up my hard drive.
EDIT: It gave up after fifteen minutes. That's nice, now I don't have to reboot. There was about 2 gigabytes of stuff in my temp folder when I ran disk cleanup, don't know how much of that was gzip bomb output.
curl https://brage.info/hello | od -c | head -40It should be possible to write an archive file that decompresses to a slightly different archive file. Say, the original archive file plus a zero byte. (Things get hairy because of the checksum in most archive file formats, but it still should be possible.)
Then a malicious party writes an archive file that decompresses to two slightly different archive files. And sets it up so that, say, 128 levels in iff you follow a specific path in said tree you get to the actual malicious content. You can't get to it unless you follow the specific path, and you can't find it unless you do substantially more than just naively decompressing files (as you won't be able to decompress 2129-1 files).
There is an upper-limit to how much information you can compress into a given space. (Note that we may want to write a pathological program that is very small and allocates a lot of information-free memory. But that's not decompression.)
If we accept the premise then we can look at another approach to solving this problem, once and for all! I like examining memory allocation because it's so general. But there may be another way. We can examine the input to estimate compression ratio.
The problem here is that image decompression is apparently giving strangers the ability provide an arbitrary N and say "Please loop N times and/or allocate N bits". A modern CPU is overwhelmed by an N 12 bits long or longer. This is a root cause of many problems! You know, I'm going to go out on a limb here and make a bold assertion: I assert there is a very safe upper bound on the decompression ratio, and that for any real algorithm you can indeed examine the input to determine whether N exceeds your allowable threshold. 10x might be a bit low (although I doubt it) so let's be generous and say 100x. (Which seems crazy. Nothing that I know of, not even text, compresses that well.) This means that I believe that any image format, for example, has a trivially calculable N (for example, width*height in pixels). I would argue that in the general case (unless you are doing some sort of compsci research) the image file should be related to N. That is if the image is 10 bits wide, 10 bits high, we should expect a roughly 20bit file-size.
It's ridiculous when the compression ratios are so absurd, like in this case.
TIFF has specs for tiles, strips, subfiles, layers, etc. AFIAK, hardly anyone uses those, but they certainly exist.
JPEG is also composed of 8x8 squares and is easy to stream. The API has per-scanline read & write callbacks IIRC. But you're right; a very very wide one might be a case it can't deal with.
But, no. There's a finite number of levels, and it blows up pretty quickly.
As you can see by reading that post as well, I'd also contend it really ought to be in the core API, not an optional thing that defaults to no limits. A sensible default could be imposed, too, though that turns out to be tricky to define the closer you look at it.
I'm actually about to add the same limits at my current job. It is an online slideshow builder, and most of the slides are scanned photographs. Last week someone uploaded a 11k x 17k JPG which hit us during a time of peak load and it caused quite a bit of trouble while the server was trying to build a thumbnail of it.
> I'd also contend it really ought to be in the core API, not an optional thing that defaults to no limits.
These limits really ought to be at a higher level, such as a ulimit on each apache process. In a large code base it requires too much effort to protect against every single avenue of potential accidental denial of service.
For example, at a previous job we had a lot of reports where the date range could be customized. This was fine for a long time, until a large client came along and ran his reports for the past five years. Since it was written in perl, it'd use up the RAM available on the system to generate the report, locking up the only webserver we had.
You divide the image into quarters and store each quarter as a continuous block of memory. Do this recursively.
Normally we'd index into the pixel data using pixel[x,y]. You can get Z-ordering by using pixel[interleave(x,y)] where the function interleave(x,y) interleaves the bits of the two parameters.
This works fantastically well when the image is a square power of two, and gets terrible when it's one pixel high and really wide. I think a combination of using square tiles where each one is Z-ordered is probably a useful combination.
For my ray tracer I use a single counter to scan all the pixels in an image. I feed the counter into a "deinterleave" function to split it into an X and Y coordinate before shooting a ray. That way the image is rendered in Z-order. That means better cache behavior from ray to ray and resulted in a 7-9 percent speedup from just this one thing.
Once you have data coherence, swapping is not a big deal either in applications where you're zoomed in.
That is pretty cool (also the raytracing application), but most surprising to me, is I remember the interleave operator from reading the INTERCAL[0] docs ... I never even considered the function could actually be applied to something useful :-)
[0] one of the early esoteric programming languages, designed to be "most unlike any other language". It also features a "COME FROM" statement (because GOTO is considered harmful), which iirc also actually has a parallel in some modern programming paradigm.
If you did not have time limits you could exhaustively search through Pi looking for your data, then store just the offset into Pi.
Or if the offset got too large then Pi to the power of lots of random numbers. One of those numbers will, somewhere, have your data. But it's infeasible to search for it.
There is no universal compression algorithm, by the pigeonhole principle.
And your compression scheme won't work by the simple fact that the offset of where in Pi your data lies will, on average, take the same number of bits to store as the data itself. Ditto, with pi to the power of lots of random numbers, the offset and the power will, on average, take the same number of bits to store as the data itself.