Surprising Scalability of Multitenancy(brooker.co.za) |
Surprising Scalability of Multitenancy(brooker.co.za) |
That’s way your profit margins on the AWS servers is lower than self hosted ones but at least you’re making money.
Yes, multi-tenancy and improved hw utilization can save money ... for Amazon. That's of no use if they lack sufficient competition and just capture the savings as profits. Then you're just wasting time on debugging weird contention issues and cloud cost optimization consultants so Bezos can get richer.
The profit margins on AWS are so huge that even though you they can binpack better it often doesn't matter, you're going to still save money by going to either a cheaper cloud or using your own HW (or renting your own dedicated HW). The savings from multi-tenancy are drowned by the added costs.
One intriguing model that might be worth exploring is micro-clouds. In that model there's a kind of clearing market, and users with strong diurnal cycles and not many batch jobs can re-sell their CPU capacity at night to other users. They just implement some Lambda-ish API and configure the kernels/hypervisors to always prioritize their own jobs over guests. The guests don't care because they're getting the resources cheap, for the company the additional income offsets the cost of their own machines and the market takes a cut. The difference vs today's cloud models is it's more decentralized and the "cloud provider" is really just a match maker, so it's easy to set up competitors and margins would be low.
BTW modern CPUs support the creation of RAM-encrypted VMs with remote attestation, so you can lower the trust needed in the targets by a lot. That said there are lots of companies that are known quantities, have verifiable brands and may even be considered more trustworthy than the big clouds in some cases because they're local firms.
Spectre and related attacks already reduced CPU performance.
Shared hardware opens up the door for side channel attacks and hardening against those attacks is going to decrease performance.
AWS serverless, by the way, uses VM isolation.
I guess whether you consider someone like that to be a "serious person" or a "charlatan" depends on your own point of reference.
In that case, their arguments were more persuasive to management than mine were. I found the experience baffling.
No hate. Was (is) a frustrating experience for me too.
Amazon doesn’t over commit cpu for normal VM instance types.
I can understand that may not be a surprise to you. What's surprising to me is that you took the time to come say you aren't surprised, instead of going on with your day.
Clearly, I shouldn't have claimed this casual blog post was original research that had never been seen in any form before. Silly me!
I mean its implementation in computer systems.
> What's surprising here isn't that time slicing works, it's that the same mechanism drives both the economics of large systems, and their ability to economically support bursty workloads.
That's not surprising.
> Clearly, I shouldn't have claimed this casual blog post was original research that had never been seen in any form before. Silly me!
That's not the issue though is it, that's your snarky strawman to deflect from it. Which is that its a lazy cliche title and it purports to be much more grandiose than it is.
(Of course, it's still a tradeoff between context switching overhead and the magnitude of the variability, but fundamentally even a tiny level of timeslicing can be a huge improvement.)
On the team I was on, newly-hired managers from Amazon accepted the ideas of ICs who had previously worked at Amazon, and rejected ideas from people who had not. I didn't realize this was a pattern until I already had one foot out the door, but even if I had realized it earlier it wouldn't have helped. I was ex-Google, so the ex-Amazon folks tagged every document with the bozo bit before they'd even opened it.
My takeaway was to be cautious of companies where the culture is imported through mass hiring from single companies. I joined in the middle of the "Google wave", which was relatively peaceful (per Google's culture at the time). When the "Amazon wave" arrived it was quite a shock; their culture was much more adversarial and authoritarian than anywhere I'd worked before. By the time I left, there were signs the Amazon folks were starting to get sidelined by an emerging "Oracle wave".
https://aws.amazon.com/ec2/dedicated-hosts/
https://cloud.google.com/compute/docs/nodes/sole-tenant-node...
The AWS offering is pretty much turn-key. I've not used the GCP version, but it seems to be similar if you're willing to create a separate "project" for each security domain.
Once your company has any PII and/or has regulatory obligations (PCI, HIPAA, etc) then it's worth spending a bit extra to make sure sensitive components are running on their own hardware.
i'm sure that marc brooker, the author, and one of the most accomplished computer scientists currently living, will think twice before posting such pablum again
Dedicated servers are undoubtedly cheaper in some circumstances than even a well-managed AWS account. But you do need to account for redundancy (including staffing), scaling up, possibly geographic replication, etc.). Setting up a dedicated server is just the beginning and be sure to take into account all costs--as is the case on a cloud provider too of course.
Imagine you have three Ruby services, where each is allocated 10 cores of CPU time (via pinning with cpuset). If you give them each an 16-core VM, then there'll be 18 cores of "wasted" CPU. If you instead bin-pack them onto a 32-core VM, then they'll have the same number of cores at a lower price point.
If each service runs at 50% capacity with 2000ms latency during steady state, how much extra latency would you expect the service to have on the bin-packed configuration vs the single-VM?
My position is "very little extra latency", the other person's position was "a lot of extra latency due to hardware contention in (for example) the memory controller".
(If you're reading this and thinking "NUMA node locality", then you're operating two or three levels above where this org was in terms of optimization.)
Talking about Ruby services and not hand optimized C kinds of give it away. And even with hand optimized C you would do a cost/benefit analysis of less optimal packing.
Edit - And if a group of players raid a dungeon, the population of that dungeon is strictly limited, so you can park that raid on one server and don't worry at all about inter-player latency.
The bigger world is handled by slicing it up, but you still have a lot of communication going on with central databases for stuff like inventory management, chat, quests, etc. so you would probably try to keep all that within your own server racks.