AWS Outperforms GCP in the 2018 Cloud Report

AWS Outperforms GCP in the 2018 Cloud Report(cockroachlabs.com)

214 points by awoods187 7 years ago | 98 comments

sethvargo 7 years ago |

Hey all - Seth from Google here.

Thank you to the authors who worked on this report. These types of reports help us better understand the ways in which our customers and partners utilize our platform. Our team is reviewing the report and will provide a response as we conduct our own benchmarks.

Varying factors impact these types of benchmark analyses, many of which are difficult to isolate and control. As an example, I refer to some of the benchmarks others have posted in this very thread.

As a technical practitioner, I'm positive there are areas in which cloud A outperforms cloud B and vice versa, giving users choice and flexibility. As an employee of Google, I can assure you that we are committed to providing best in class performance and availability on our platform.

Thank you for your patience as we review these findings and craft our responses.

peferron 7 years ago | |

Where do you intend to publish your response? GCP blog [1]? I'd like to make sure I won't miss it, and this HN thread may be buried by the time you come up with a response.

[1] https://cloud.google.com/blog/

sethvargo 7 years ago | | |

Hi there - sorry for the delayed response. I'm still working on getting an answer here, but I didn't want to give the impression that I was ignoring the question. My suspicion is that we'll either work with the original authors or publish on the GCP blog, but I can't confirm any of those options at this time.

cybernoodles 7 years ago | | |

bump

dekhn 7 years ago | |

To me, the most interesting part is the large variance in networking performance on GCP while AWS networking is solid (and in line roughly with what you'd see on a good 10gbit network). Variance in networking performance is far worse than a low mean, but in this case not only is the GCP mean much lower, as well as the variance higher.

bradknowles 7 years ago | |

TIL that Seth Vargo is now working at Google.

Dude, I sure hope they deserve having you on board. I am not at all convinced that they are capable of understanding just how much you bring to the table.

My opinion of them has definitely gone up a few notches.

socceroos 7 years ago | |

Where will we find the responses?

lallysingh 7 years ago | | |

Probably on the front page of HN.

coherentpony 7 years ago | |

If you review the findings and then respond with, "Actually, GCP outperforms AWS," how many pinches of salt should we take your response with?

sam0x17 7 years ago | | |

Yeah, again, all this really shows is that instance types don't translate well across cloud platforms comparison-wise because they are strictly different. See other posts.

brian_cunnie 7 years ago |

I've also benchmarked GCP vs. AWS [0], and, for the tests that I ran, found that GCP outperformed AWS by a factor of 3:1. Specifically, a GCP instance n1-highcpu-8 with a 256GB pd-ssd disk, clocked in at 11,728 IOPS vs an AWS c4.xlarge with a 256GB gp2 disk, clocking in at 3,634 IOPS.

To put that in context of the blog post, it means your setup can drastically affect your results. Using local NVMe disk, for example, yields excellent results at the expense of increased risk. Also, AWS's io1 disk is very expensive—after my first io1 bill from AWS, I never used that disk type again.

[0] http://engineering.pivotal.io/post/gobonniego_results/

nodesocket 7 years ago |

Developer experience on GCP is vastly superior to AWS.

- Pricing on GCP is much easier, no need to purchase reserved instances, figure out all the details and buried AWS billing rules. Run your GCP instances and automatically get discounts. AWS reserved instances requires knowing your instance types, knowing that you can purchase the smallest type of an instance class and combine, knowing that you can only purchase 20 reserved instances per zone/per region in an account. So many gotchas.

- GCP projects by default span all regions. It is much easier if you run multiple regions, all services can communicate with all regions. Multi-region in AWS is sort of a nightmare, setting up VPC peering, can't reference security groups across regions, etc..

- Custom machine types. With GCE, you simply select the number of cores you need and memory. No trying to decipher the crazy amount of AWS instance types T2, T3, M5, M5a, R5, R5a, C5, C5n, I3...

- Instance attached block storage is easier to grok and in my experience is much faster than EBS. The bigger the disk on GCE, the more IOPS. No provisioned IOPS madness.

WestCoastJustin 7 years ago |

There might be a networking cap & disk I/O issues with the instances you picked on GCP vs AWS.

The GCP instance has 8 Gbps vs 10 Gbps for AWS. I don't really know without seeing the graphs from the instances, if you hit a cap, but this could make a difference in both transfer speeds and latency #'s for GCP. Also, for your local disk test, on GCP, disk size makes a difference to get the best performance. The larger the disk, the better the performance. PD disk read/write performance also comes out of the available network bandwidth! So, the instance you picked on GCP was at a disadvantage right from the start [3]. This likely explains the I/O Experiment graph and the "67x difference in throughput" as you're likely hitting caps, both in terms of network bandwidth, and disk performance compared to AWS. Seeing anything where it is x67 difference is a pretty big red flag that something strange is going on and needs further investigation.

GCP's n1-standard-16 = 8 Gbps max [1]

AWS's c5d.4xlarge = 10 Gbps max [2]

I guess the problem with comparing clouds, it is never apples vs apples, and I don't fault you for picking what do you (as it is not obvious). GCP typically gives you (core count / 2) = # Gbps network bandwidth. A good followup to your comparison might be to investigate why they #'s are different. Does adding more cpus, memory, network bandwidth increase performance?

[1] https://cloud.google.com/blog/products/gcp/5-steps-to-better... (see section #3).

[2] https://aws.amazon.com/blogs/aws/ec2-instance-update-c5-inst...

[3] https://cloud.google.com/compute/docs/disks/performance#size... (see the table re: disk size to bandwidth)

raboukhalil 7 years ago |

Choosing a cloud provider is about more than just performance. For me, I lean towards GCP because of the combination of awesome UX, custom VM configurations (which also means GPUs attached to a custom # CPUs), no-bidding spot instances, and the multi-regional cloud storage offering that replicates data across many regions for much cheaper than AWS.

I wrote about this last year if you're curious: https://medium.com/@robaboukhalil/a-tale-of-two-clouds-amazo...

ravedave5 7 years ago | |

AWS API gateway is one of the worst UIs I've ever used in my life. Lambda isn't great either. That said I love AWS, but I haven't used GCP much.

yovagoyu 7 years ago | |

Yeah, because who cares about performance when you have a pretty UX?

hueving 7 years ago | | |

You're confusing UX with UI. You can have an excellent UX that's still just a CLI.

null000 7 years ago |

Honestly I would have expected performance on a cloud provider to be measured in x per $. You're pretty much renting everything, the hardware etc is really difficult to compare, and you can usually throw more machines at the problem anyway, so measuring a single machine vs a single machine doesn't make much sense if the two might cost vastly different amounts.

planckscnst 7 years ago |

> At first glance it appears that GCP has a tighter latency spread (when compared to the network throughput) centered on 0.2 ms

Comparing a distribution of two completely different metrics is not particularly meaningful. You can change the histogram buckets to a different size and it will look just as spread out. When I first read it, I thought it was saying GCP has a tighter spread than AWS. This was confusing, especially since the chart immediately above that seems to have AWS and GCP numbers flipped.

gamegoblin 7 years ago | |

I can't find the the text you quoted in the article. Did they edit it out?

planckscnst 7 years ago | | |

Yes, it looks like the text was edited, but the table is still incorrect.

Rafuino 7 years ago |

Hmm it'd take forever to dig into the respective documentation at AWS and GCP, but from a quick look the CPU frequency alone is quite different (3.0 GHz for AWS's Xeon Scalable and 2.0 GHz for GCP's Xeon Scalable), and we don't know anything about CPU cache sizes, etc. etc. That's problematic to start. Then we have very little info to go by on the underlying storage performance. More problematic, I don't know how large the working data set is for TPM-C (i.e. how many warehouses are being simulated?), so I can't tell how much of the storage is being used. I assume it's larger than the 60GB of DRAM offered on the GCP instance (thus spilling into the storage), but with the CPU differences and unknown storage performance, I don't know what to make of this report.

verdverm 7 years ago |

Is there anything open source so I can reproduce these results?

awoods187 7 years ago | |

All of the benchmarks we used to test are open source.

TPC-C https://www.cockroachlabs.com/docs/stable/performance-benchm... Sysbench https://github.com/akopytov/sysbench Stress-ng https://kernel.ubuntu.com/~cking/stress-ng/ iPerf https://github.com/esnet/iperf PING https://linux.die.net/man/8/ping

verdverm 7 years ago | | |

Looking for the experimental settings as it would be difficult to reproduce without detail. Could you post the scripts to GitHub?

planckscnst 7 years ago | |

The report states they are using iperf, ping, stress-ng, and sysbench; these are all open-source. There might not be enough details to reproduce exactly, but I think there is enough there that you should be able to produce similar results.

kyrra 7 years ago |

Is this table labeled wrong?

https://d33wubrfki0l68.cloudfront.net/f61cd6683f5c13f8d2b506...

All the text around it says GCP is worse, but the table shows GCP is better.

awoods187 7 years ago | |

The labels are incorrect (we combined the charts in the final draft)--fix incoming. GCP was much worse than AWS on this test.

kyrra 7 years ago | | |

Looks fixed now. Thanks.

derefr 7 years ago |

> What about network throughput variance? On AWS, the variance is only 0.006 GB/sec. This means that the GCP network throughput is 81x more variable when compared to AWS.

What would cause this particular effect? It's very interesting.

Is it, perhaps, that with GCP you're hitting the capacity of the network, while with AWS you're being artificially capped at that speed on a network that could theoretically go faster?

Or maybe it's just different strategies for bandwidth-limiting instances employed by AWS's SDN layer vs. GCP's? Probabilistic packet-drop (to force TCP window scaling) vs. artificially-induced nanosecond-scale egress latencies?

wmf 7 years ago | |

GCP Andromeda is software-based network virtualization which tends to have lower performance and higher performance variability. https://www.usenix.org/node/211244

AWS Nitro/ENA is hardware network virtualization which is faster and more consistent.

riking 7 years ago |

Did anyone consider comparing the cost of this vs. Cloud Spanner?

Because CockroachDB has the explicit inspiration of being Google's Spanner without the special hardware, so... why not just use Spanner with the special hardware instead?

nogbit 7 years ago |

Launch EKS on AWS, wait 20-30min. Launch GKE on GCP, wait 5min.

iamgopal 7 years ago |

So, in conclusion, When Using cockroachdb, with using 32gb ram instead of 60gb ram, and with different throughtput setting, you may consider AWS as it provides slight better cost because of using lower resources. And since there are no other big player in the market, we will test only two of them, and our only choice will be AWS.

planckscnst 7 years ago |

I wonder if they will consider expanding the cloud report to things that are not necessarily relevant to Cockroach labs use-case. In my work, latency to the block store (especially outliers) is very relevant. It would be great to see latency distributions of various workloads (sequential/random and read/write/mixed) to the patform's distributed block store.

edit: I missed that 95th percentile latency for read and write was included. That is helpful; I also typically look at p99.99 and p100.

londons_explore 7 years ago | |

You can tradeoff talk latency vs cost yourself easily.

For example, you could have two copies of your data in the block store, and issue reads to both simultaneously, and use whichever returns first. Suddenly, your 99% latency becomes your 99.99% latency...

You can do the same for writes (albeit a bit more complex).

planckscnst 7 years ago | | |

Yes, the cost/latency trade-off is exactly why this can be interesting. Significantly fewer or less-impactful outliers can save a lot of money (or allow a better SLA with the same cost).

awoods187 7 years ago | |

We can consider expanding this in our testing next year! Glad you found it helpful

ernsheong 7 years ago |

Most small companies don't reserve for 3 years. The on-demand monthly cost is 388 (GCP) vs 562 (AWS) per instance for the instances in the report (omitting SSD costs).

ernsheong 7 years ago | |

Furthermore, GCP encrypts its disks by default. So I'd expect a slight degradation of performance.

kharms 7 years ago |

For someone who has worked with both: which offers a better developer onboarding experience, for small web/data apps? Ease of learning and use wise.

kerng 7 years ago |

How does this compare to the number 2 in the cloud, Microsoft Azure?

elmo1788 7 years ago |

Very promising results. Look forward to seeing what’s in store for 2019!

ta_271828 7 years ago |

This is interesting given that I heard on the grapevine that some major cloud players are actually using AWS on the back-end even though they are advertising say as.. "GCP".. wonder if anyone can confirm or deny...

WestCoastJustin 7 years ago | |

By nature large companies have massive global teams and there is no single provider for anything. Team A could using AWS, while team B cloud be using GCP, and team C is using Azure. Just because team B says they are using GCP doesn't mean the others are lying. Or, that there is anything weird going on.

ta_271828 7 years ago | | |

Actually I meant to say that the news on the grapevine is that the big cloud players (GCP etc.) are potentially outsourcing demand for cloud services in excess of their capacity to AWS...

ernsheong 7 years ago | |

Absolutely not for GCP. Heroku however, uses AWS.

Size Random IOPS Throughput limit (MB/s) 375GB 169,987 (r) 90,000 (w) 663 (r) 352 (w) 750GB 339,975 (r) 180,000 (w) 1,327 (r) 705 (w) 1125GB 509,962 (r) 270,000 (w) 1,991 (r) 1,057 (w) 1500GB 679,950 (r) 360,000 (w) 2,650 (r) 1,400 (w)