In my opinion, this move threatens the scientific integrity of Google's research. It's clear that researchers in distributed systems, fundamental theory, security and privacy, networking, language development, etc. are second class citizens. I wouldn't be surprised to learn that there's organizational pressure to inject machine learning into as many publications as possible, even if it dilutes the overall diversity of research.
30% and 1/3 not the exact real numbers, just showing a concept.
Well, it's got my attention.
Pretty interesting to see this actually happen and see where Google ended up with it.
Where are your logs stored? Is that a distributed storage? Will SparkSQL not eat all of the bandwidth of it’s ethernet interfaces?
Yeah; sure.
pedantic sidebar: hdfs isn't a file format, it's a distributed file system layered over a traditional on-disk filesystem. For example you might have: json logs, in a gz-formatted file, tracked in the hdfs filesystem, stored on disk in an ext4-formatted filesystem.
Yes it does. Source - use Spark SQL routinely. You're right that multiple small Gzipped files are not an ideal input source as it'll create a bunch of small tasks, but Spark definitely does support GZ.