Apache Arrow Flight: A Framework for Fast Data Transport(arrow.apache.org) |
Apache Arrow Flight: A Framework for Fast Data Transport(arrow.apache.org) |
I wonder which came first, the petering off of wired network hardware perf improvement, or the software bottlenecks that become obvious if we try to use today's faster networks. 100 Mb ethernet came in 1995, gigE in 1999, 10 gigE in 2002 and gained adoption in a few years.. on that track we should have had 100gigE in 2006 and seen it in servers in 2008 / workstations in 2010. And switches / routers should have seen terabit ethernet in 2010. Today's servers(X) seem to be at about 25 GBe, and with multicore that's just 1-2 gigabits per core.
(X) according to https://www.supermicro.com/products/system/1U/
The same 25 Gbps claimed by the article can be achieved with a single-threaded ZeroMQ socket. That thread will be CPU bound. To break 25 Gbps, multiple I/O threads need to be engaged.
There are already greater than 100 Gbps network links while single-core speed has stagnated for many years. Multi-threaded or multi-streamed (like in the article) solutions are needed to saturate modern network links.
The C++ details are found in
https://github.com/apache/arrow/blob/master/cpp/src/arrow/fl...
Edit: Apparently it's also the default in protobuf 3.10
My understanding is it’s a binary alternative to JSON/REST API and all google cloud platform services uses it, however, since I have not managed to figure out how to do a single interaction with RPC against gcp (or any other service), I am wondering if my understanding is completely wrong here.
gRPC is one implementation of RPC, where HTTP/2 is used as a transport layer, and protocol buffers are used for data serialization. You typically use it be using the grpc framework: Generate code for a specific API, and then use the generated code and the client library to perform the call. There might however also be different ways, e.g. proxies to HTTP systems and server introspection mechanism that allow to perform calls without requiring the API specification.
I know networks are getting very fast but with this size of data I wonder if there are realizable gains left with modern algorithms like Snappy.
Disclaimer: I am the author
I initially thought after reading the headline, data as in any kinds of bytes to replicate or something. But it is something else, mainly by reading "is focused on optimized transport of the Arrow columnar format (i.e. “Arrow record batches”) over gRPC"
Could you elaborate?
https://arrow.apache.org/blog/2019/10/13/introducing-arrow-f...