Over-provisioning of RAM is dicey, and I/O-aware placement is still a black art, but CPU is a no brainier. I routinely find places that refuse to anything but 1:1 vCPU to physical core ratios, or even to enable VMware DRS/HA. Mainly because they bought virtualization for convenience but then didn't update their capacity and ITIL processes from the 90s where assets are pegged to a physical CPU for "regulatory" reasons and capacity is still fear-driven rather than data driven. Or, also common, vendors of packaged or platform software ... and bad Dev teams ... love to blame virtualization for performance problems, rather than actually analyze and fix the problem. So over provisioning becomes a political decision made by managers rather than a technical one made by the ops staff.
I also don't see many places just allowing "shutdown/archival" of Dev/test environments that are clearly not being used by metrics, or even to just have a process that tells the ops team to press a button when project funding ceases. It's obvious and simple but politically it is "risky" because some VP's pet project having resources reclaimed makes them feel weak or something.
Then I find occasional data centers running widely over provisioned and high (60%+) utilization, and life is fine, but for some reason these surveys never make it to those places. So the laggards never rally find out that's it's "ok" to stack VMs.
Now with container clusters like Mesos/marathon, Lattice/CF, and Kubernetes, we are going to see some interesting behavior. A lot of companies are very uncomfortable with the whole "you don't really know/care which physical machine gets a container instance, it is fair share schedules as a whole". It forces them again to admit their supporting processes are antiquated.
lol. I've been predicting a backlash against this virtualization hype since 2005, and this is the first time I've heard anyone else mention anything like it.
Of course, if you had told me in 2005 that we would be switching from hypervisors back to containers, I would have broke down crying.
Is our industry run by masochists? or just the inexperienced, who don't know any better?
but, yeah. Docker doesn't solve the "take this ancient rack of failing servers and consolidate them down to one server... without updating the software" use case that VMware is so often used for.
My experience has been that using containers to go multi-tenant leads only to misery and pain.
But it does seem a reasonable-ish way to handle packaging, though I have less experience with that use case. It does seem like it would work, assuming you still have a way to update everything, and assuming everything is happy with the same kernel.
The Google Borg paper says they use non production batch jobs to eat the spare, so you can kill them if necessary. Cloud providers could offer this as a service in theory, although they are not really architected that way.
We need staging servers and redundant backups so getting really high utilization is not possible but I hope to see a lot of improvement.
The big companies seem to be doing things better. At Google, I had a bit of angst running 10k processor jobs but they do use solar, set up data centers near hydro electric sources, etc. Same as Amazon, Microsoft, etc.
The difference is in process. Google's approach to workload placement is automated by software, driven by engineering decisions and data.
Many IT shops' placement is political (new servers = new capital = power).
What every IT shop wants doesn't necessarily relate in any straightforward way to what any IT shop invests resources in getting. Every IT shop prioritizes many other things above utilization (and are right to do so).
All decisions, engineering or otherwise, are political. Different environments involve different politics, but it's all still politics.
CEO: "I see that we had 99.994% uptime for the last six months, and we came in very close to the forecasted budget. Well done, engineers!"
CEO: "I see that we had 99.9% efficient usage for the last six months, and we reduced our budget. Well done, engineers!"
Neither scenario is realistic, of course. Uptime is nice and efficiency is nice and budgets are nice, but what the CEO is actually interested in is:
VP Customer Support: "Our satisfaction rate is up, call quality metrics are great. I looked over the call stats and it looks like we're no longer getting complaints about performance or unreachability."
One bank IT group I know that reports on uptime to their business partners relative to operating expense, prints the charts and graphs on plotter paper weekly and posts them in the cafeteria. Most of their bonus is directly tied to those numbers. So, "cut costs and keep me up".
Delivery IT groups are very rarely measured by customer satisfaction, they're measured by project and budget performance to baseline (on time, on budget, etc). Customer sat is the responsibility of the business partners that drive the requirements, programs, etc.
This this effective? Not really. If they recognized Lean product development principles they'd incentivize everything by end-to-end cost of delay first, and risk reduction second.
That said, VMware did basically invent x86 virtualization as we know it today, and that's justified the many billions in wealth it has generated to date. Docker is (so far) a registry and a CLI wrapper around a Linux kernel feature. It can and will be more, but it's not clear what.
Google decided early on to drive towards an operational architecture that allows individuals to act at scale on their infrastructure. A developer deploys into production, it launches thousands of new containers and disposes thousands of old containers. A batch job is run, same thing. Deploying services is uniform across the board. Thus, optimizing utilization through improved container scheduling is something that the core site reliability engineering team could do independently of individual services.
Google's early adoption of data center sized computing by Hozle & team was unique, along with Amazon's CEO-diktat move to decentralized service-oriented architecture, or Netflix's rewrite and move to cloud. Which is why you have articles like this, written by a VC, that want to repackage this thinking and sell it back to old school IT.
But is that something it is known they prioritized, or was there perhaps more interest in optimizing the efficiency of deploying thousands of containers on every deploy, across data centers, with reliable testing, without killing in flight processing, and scaling for subsecond response to bursty demand? Who sets the priorities for what is most important, and how much of one they're willing to sacrifice to improve physical utilization?
I have absolutely no doubt they had as many resources as any other company dedicated to finely tuning their data centers and related infrastructure. I question whether they had the same motivation as a company like Amazon (who was deriving direct profit from selling this resource) to prioritize the optimization of utilization.