The sorry state of server utilization and the post-hypervisor era (2013)

The sorry state of server utilization and the post-hypervisor era (2013)(gigaom.com)

20 points by vimes656 11 years ago | 25 comments

The article is looking at it all wrong. To solve a problem, start by looking at those that already solved it. Then, see if you can apply that. Mainframes have long had ridiculously high utilization and throughput. Secret is their I/O architecture: computing happens on compute nodes and I/O is managed by I/O processors, both of which are well-integrated. If Intel etc copy this, they'll get much higher utilization and throughput. Smart embedded engineers do the same thing albeit with microcontrollers.

https://en.wikipedia.org/wiki/I/O_channel

blincoln 11 years ago |

It's 2015, not 1985. Most people are not paying IBM for every CPU cycle (used or not) on a mainframe. Should IT staff try to look "good" on a "CPU utilization" report that belongs in a history book by buying lower-end hardware, or should they spend a tiny amount of extra money to ensure that customers get good performance during peak periods?

bhousel 11 years ago |

These aren't really startling findings. Most apps in the enterprise require separate instances for development, staging, production, and a hot standby for continuation of business. And you need each of those environments for multiple tiers (db, app server, etc). And you need the entire stack replicated to each local datacenter because of latency (so the idea of having the APAC users use the database at night and the NAM users use it during the day just doesn't work in practice). So a typical business app can easily require >10 server instances, most of which will sit idle most of the time.

parasubvert 11 years ago | |

It also reflects a very stubborn unwillingness to actually use virtualization, ie. To collect capacity optimization metrics and to let that drive the placement of VMs with appropriate over-provisioning.

Over-provisioning of RAM is dicey, and I/O-aware placement is still a black art, but CPU is a no brainier. I routinely find places that refuse to anything but 1:1 vCPU to physical core ratios, or even to enable VMware DRS/HA. Mainly because they bought virtualization for convenience but then didn't update their capacity and ITIL processes from the 90s where assets are pegged to a physical CPU for "regulatory" reasons and capacity is still fear-driven rather than data driven. Or, also common, vendors of packaged or platform software ... and bad Dev teams ... love to blame virtualization for performance problems, rather than actually analyze and fix the problem. So over provisioning becomes a political decision made by managers rather than a technical one made by the ops staff.

I also don't see many places just allowing "shutdown/archival" of Dev/test environments that are clearly not being used by metrics, or even to just have a process that tells the ops team to press a button when project funding ceases. It's obvious and simple but politically it is "risky" because some VP's pet project having resources reclaimed makes them feel weak or something.

Then I find occasional data centers running widely over provisioned and high (60%+) utilization, and life is fine, but for some reason these surveys never make it to those places. So the laggards never rally find out that's it's "ok" to stack VMs.

Now with container clusters like Mesos/marathon, Lattice/CF, and Kubernetes, we are going to see some interesting behavior. A lot of companies are very uncomfortable with the whole "you don't really know/care which physical machine gets a container instance, it is fair share schedules as a whole". It forces them again to admit their supporting processes are antiquated.

lsc 11 years ago |

"A post-hypervisor world "

lol. I've been predicting a backlash against this virtualization hype since 2005, and this is the first time I've heard anyone else mention anything like it.

Of course, if you had told me in 2005 that we would be switching from hypervisors back to containers, I would have broke down crying.

Is our industry run by masochists? or just the inexperienced, who don't know any better?

parasubvert 11 years ago | |

This was written by a VC hoping that Docker is going to be worth more than VMware. I suspect he may be disappointed.

lsc 11 years ago | | |

A lot of the value of vmware is in the sales channels. Why use VMware rather than QEMU/KVM? it used to be that VMware came with support. But now that KVM is owned by RedHat, which in my experience, gives way better than average support? yeah.

but, yeah. Docker doesn't solve the "take this ancient rack of failing servers and consolidate them down to one server... without updating the software" use case that VMware is so often used for.

otterley 11 years ago | |

To be fair, containers aren't a virtualization solution; they're more of a packaging mechanism.

lsc 11 years ago | | |

that is... a healthy way to look at it.

My experience has been that using containers to go multi-tenant leads only to misery and pain.

But it does seem a reasonable-ish way to handle packaging, though I have less experience with that use case. It does seem like it would work, assuming you still have a way to update everything, and assuming everything is happy with the same kernel.

noblethrasher 11 years ago | |

Please, expound.

justincormack 11 years ago |

If the issue is running out of memory before running out of CPU times, then containers wont help much, apart from to the extent that memory is overallocated with static amounts to vms. The solution is either larger memory systems, which are now much more widely available since this article was written, or using less memory for applications.

vimes656 11 years ago | |

Is hypervisor memory ballooning widespread in major cloud providers these days? How does it compare to bare-metal kernel memory allocation?

justincormack 11 years ago | | |

No it is not widespread. Underprovisioning is a bit of a dirty word too - it breaks isolation.

The Google Borg paper says they use non production batch jobs to eat the spare, so you can kill them if necessary. Cloud providers could offer this as a service in theory, although they are not really architected that way.

mark_l_watson 11 years ago |

Looking at this as an environmental problem makes some sense. I used to rent cheap hosted servers and moved to virtualized systems like AWS, Azure, and AppEngine partially because of environmental impact and partially out of convenience.

We need staging servers and redundant backups so getting really high utilization is not possible but I hope to see a lot of improvement.

The big companies seem to be doing things better. At Google, I had a bit of angst running 10k processor jobs but they do use solar, set up data centers near hydro electric sources, etc. Same as Amazon, Microsoft, etc.

PaulHoule 11 years ago |

Well, over provisioning is good for perceived performance. Let corporate it increase efficiency in the same ham handed way and you might find low latency needs a new advocate.

LamaOfRuin 11 years ago |

The idea that Google was industry leading on non-batch loads in 2013 seems wrong to me. They were not selling those services then, so they did not have a positive profit motive to optimize that usage (only a motivation to cut costs, which I'm told is not nearly as effective). Amazon has had that motivation (and necessity with their non-existent margins in every other part of their business) for long enough to actually accomplish something.

parasubvert 11 years ago | |

at Google's scale, one doesn't need a lot of incentive to improve utilization. Every IT shop has wanted the cost reduction of improved utilization since the dawn of the PC era.

The difference is in process. Google's approach to workload placement is automated by software, driven by engineering decisions and data.

Many IT shops' placement is political (new servers = new capital = power).

LamaOfRuin 11 years ago | | |

At Google's scale you need more much incentive to get anything done. This is even more true when it is something that will touch every division, product, and service.

What every IT shop wants doesn't necessarily relate in any straightforward way to what any IT shop invests resources in getting. Every IT shop prioritizes many other things above utilization (and are right to do so).

All decisions, engineering or otherwise, are political. Different environments involve different politics, but it's all still politics.

cm2187 11 years ago |

But is average utilisation the right metric? The work day is only like 8-10 hours, I would expect many corporate infrastructures to be only active during that period. Plus you don't size your infrastructure to a typical workload, you size it to be able to accommodate higher than usual peak workload otherwise you will be down at the busiest period.

dsr_ 11 years ago |

Which of these hypothetical situations is more realistic?

CEO: "I see that we had 99.994% uptime for the last six months, and we came in very close to the forecasted budget. Well done, engineers!"

CEO: "I see that we had 99.9% efficient usage for the last six months, and we reduced our budget. Well done, engineers!"

Neither scenario is realistic, of course. Uptime is nice and efficiency is nice and budgets are nice, but what the CEO is actually interested in is:

VP Customer Support: "Our satisfaction rate is up, call quality metrics are great. I looked over the call stats and it looks like we're no longer getting complaints about performance or unreachability."

parasubvert 11 years ago | |

I've worked with the executives of some large banks, telecoms, and transportation companies. The CEO and board generally only has held the IT team accountable to budgetary performance and risk (uptime, intrusion, regulatory) metrics. The only IT impact on customer sat is uptime, by the traditional view.

One bank IT group I know that reports on uptime to their business partners relative to operating expense, prints the charts and graphs on plotter paper weekly and posts them in the cafeteria. Most of their bonus is directly tied to those numbers. So, "cut costs and keep me up".

Delivery IT groups are very rarely measured by customer satisfaction, they're measured by project and budget performance to baseline (on time, on budget, etc). Customer sat is the responsibility of the business partners that drive the requirements, programs, etc.

This this effective? Not really. If they recognized Lean product development principles they'd incentivize everything by end-to-end cost of delay first, and risk reduction second.

otterley 11 years ago | |

Unfortunately it's usually some layer of middle management that is charged with making the infrastructure capacity planning decisions, and their performance is often measured by different metrics than those that the CEO cares about. Diverging incentives lead to absurd outcomes.