How I shrunk a Docker image by 98.8% – featuring fanotify(blog.jtlebi.fr) |
How I shrunk a Docker image by 98.8% – featuring fanotify(blog.jtlebi.fr) |
https://fgiesen.wordpress.com/2012/04/08/metaprogramming-for...
You could argue that that can still be fooled by, e.g., making the software dlopen the argument given to it at which point that codepath would have different dependencies each time it was hit, but that argument quickly devolves. That same argument says that when I run `ls /tmp/file` that makes `/tmp/file` a dependency of ls and thus I must include every file in the image else it will have different behavior.
I think intelligent fuzzing + high branch coverage can prove that you have found all required files.
The other interesting thing to try, if your app's problem isn't so much library-dependencies but instead Unix shell dependencies, is to use a Busybox base image. Apps whose runtimes are already sandboxed VMs, especially, usually work great under Busybox: the JVM, Erlang's BEAM VM, etc.
Then why would you start out with a complete extra operating system in there? Why not just put the application and its dependencies in there?
To strip non-dependencies from an complete operating system sounds like a very failure prone way to accomplish almost the same thing. You really need to execute all code paths, which is difficult to guarantee (did you really run your application in all locales for example?).
Packaging is hard. Let's go shopping!
1. "run-time dependencies" — package B needs package A installed because a binary from B actually makes use of a file from A when it runs.
2. "install-time dependencies" — package B needs package A installed because B is effectively a "plugin" for A. B is theoretically useless to the OS, except when used in the context of a sane A-like environment. This usually also implies that B, when installing itself, will run a script provided by A, usually to register itself in a database that A owns. This doesn't at all imply, though, that you couldn't just directly call the binary contained in the A package for a useful effect.
3. "asynchronous/maintenance-time dependencies" — package B needs package A because B does something to increase the system's entropy, and is written to assume that the system will compensate for this by having A running.
Docker images really only need type-1 dependencies, but as you dig toward the core of a package dependency graph, you start to see a lot more of type-2 and type-3 dependencies. If you execute a "debootstrap --variant=minbase", pretty much everything in there is there for type-2 or type-3 reasons.
A Docker container doesn't need to be a maintainable or autonomous OS distribution. It doesn't need grub, it doesn't need mkfs or fsck, it doesn't need mkinitramfs or the HAL hwdb; it doesn't need localegen, or debconf, or even apt itself. It needs to be a baked, static collection of files related to the application's run-time needs. But there's no demand you can make of apt or yum or even debootstrap that will spit out such a thing.
There was a project somewhat in this vein a long time ago, for embedded systems, called "Emdebian Baked"[1]. It was a misstep, I think, because it focused on creating variants of packages and a secondary dependency graph; rather than being a transformation one could apply to existing packages and the existing graph.
I've worked on and off on creating a transformation tool—effectively, a combination of a dependency graph "patch" that contains empty virtual-packages for many essential-package dependencies, a file filter/blacklist, and a final package whose installation burns away the whole package-management infrastructure from the chroot this is executing in. I haven't been happy with any of the results yet, though. Would anyone be interested in collaborating on such a thing as an open-source project?
Anyhow, even a large-ish application such as Oracle or a control system doesn't actually use ping or dd or troff, or most parts of what a modern unix-OS is comprised of. Most things suid are usually unnecessary, which if nothing else does decrease the attack surface.
Most web apps probably needs nothing unix-ish at all. A chrooted PHP app mounted noexec makes me sleep better than one running in a complete operating system. And most server side Java apps re-invents everything unix anyway, from mail processing to cron jobs, so they generally don't shell out as often as you'd think.
So I would argue it's actually pretty common that your applications have a limited set of dependencies. Especially compared to the hundreds of packages in any minimal modern unix install.
<arbitrary code that cannot open foo.txt>
do something with foo.txt
This will use foo.txt iff said code halts.You can, however, prove that you've found a superset of the required files for an arbitrary binary. Or prove that you've found the required files for some, but not all, arbitrary binaries.
You cannot say you haven't found all the dependencies, but you can say you have found all the dependencies (given the constraints I placed above).
The halting problem only says that you cannot prove that a given program will halt.
However, you can prove a specific program halts if, in fact, that program halts.
The original question was not "can prove that I can find the dependencies for an arbitrary binary", but "can you prove that all the dependencies were found for a single specific binary".
For some program that has an infinite loop you can say "I don't know if I've found everything", but if you have shown that you have hit every code branch, as I said above, then clearly this program both halts and has had all dependencies found, excepting different behavior for user input within those already explored branches.
If you can manage to get a working install of Postgres without pulling in half of Debian, I would be surprised.
But yes, on the other hand, it's perfectly possible to package some things, like the JVM, in a sort of "spread-out in a directory but equivalent to static-linked" fashion. The sort of things you see telling you up "unzip them into /opt/thispkg" because they don't really follow any Unix idioms at all, tend to be surprisingly container-friendly. They come from a world where binaries are expected to be portable across systems with different versions of OS libraries available, rather than a world where each app gets to ask the OS to install whatever OS library versions it requires.
I regularly run it chrooted without problems. You do need to understand you use case however. Things like external database utilities and backup scripts differ in requirements. Some of them are run outside the chroot, some don't.
It's absolutely not complicated, and if you have the faintest idea what you're doing it's much easier to get right than the fanotify dance described above.
And a complete operating system in a chroot would sit mostly unused, and only increase the attack surface for no reason at all. So, why?
You mean like in this blog-post: https://blog.docker.com/2013/06/create-light-weight-docker-c...