Scribe: Transporting petabytes per hour via a distributed, buffered queueing(engineering.fb.com) |
Scribe: Transporting petabytes per hour via a distributed, buffered queueing(engineering.fb.com) |
Just for fun, for more perspective on big data, a human body generates around 1-10M new cells per second, and a cell contains about 10-100GB of information. So a single human is generating 1-100PB/s of data just in the new cells! (Give or take a few OOM)
OTOH the amount of "information" needed to perfectly simulate a cell is probably unbounded. Just a corollary of the fact that we currently don't know how to perfectly simulate reality. Even a single "real" number can take up infinite space.
No storage system can store that data (and most of it is not useful) so they have a series of hardware triggers and buffers that reduce the data down to roughly what modern (general purpose) hardware is capable of handling. They tune the thresholds to match what consumer hardware is capable of.
With regard to supercomputer filesystems: nobody wants to use GPFS. CERN's EOS sustained (theoretical) 3.3TB/sec in Apr 2015, so it's not like they're uncompetitive with the largest supercomputer...
Those of us who don't own our own cross-ocean fiber can't afford to design systems like this.
(I'm paraphrasing an old, old joke.)
We used it at a company I worked for, but it had long-since been deprecated, so I was confused when I saw this Scribe.
https://web.archive.org/web/20120301000000*/http://www.scrib...
The initial stuff seems to not be so related, but the current description of what they do seems much much closer to what Facebook's Scribe does today :)
Naming is hard! :D
The article just focuses on certain areas of the system and doesn't go into the security and privacy parts, that's all.
(I work in Scribe)
On the volume of metadata held by Producers, will there be any significant difference between holding WriteService & LogDevice meta.
Despite that, I find the claims to be underwhelming. So your system can process massive amounts of data by scaling massively horizontally...neat.
(disclaimer: I work in Scribe)
Maybe one day we'll have a version available. In any case, one of the larger parts of the system (LogDevice) is open source :)
(disclaimer: I work in Scribe)
LogDevice: https://engineering.fb.com/core-data/open-sourcing-logdevice...
This is a very good point. The 'information' in a cell isn't the base pairs in its DNA, but all the atoms that make up the whole cell. And then each atom encapsulates properties such as position, velocity, charge, van der Waals radius etc.
However this considers atoms with classical mechanics. In a quantum mechanical representation it would be very different again and you can start asking really hairy questions about whether information can be created or destroyed.
In any case, the data is "whatever needs to be logged".
And it's not "server logs", which is what I'm interpreting from your comment. Scribe transports most data at Facebook to be processed by real-time systems (e.g. Puma, Scuba) and also "batch systems" (data warehouse). So, it's quite a lot, being "the ingestion pipe" for Facebook.
Does this answer your question? :-?
Puma: https://research.fb.com/publications/realtime-data-processin...
Scuba: https://research.fb.com/publications/scuba-diving-into-data-...
I see. I walked away from the article with the impression that it was meant to be a log aggregation service a la flume, splunk, or logstash.
> the amount of data is low ("underwhelming") and from your last comment that it's a lot ("that much data").
I was remarking on the numbers in regard to generation, not consumption. Based on the article, my estimate is pointing out that generating 2.5TB/s of transactional logs and telemetry data using "millions" of machines would be technically possible but not reasonably practical...and thus likely not real ;). But, you corrected my understanding: That number isn't based on a different use case.
Obviously some people do want GPFS, if they can afford it, but Cori uses Lustre. I don't mean to claim that either is ideal for streaming high rate event data, of course.
Data model at CERN does not match the one of a supercomputer. CERN data are not processed locally but distributed and spread to ~100 of participating institute in the experiment.
Moreover, "personal opinion", GPFS is crap. It's an old relic from the 90s that has so many quirk and problem of design that it would deserves an entire conference on it. Plus the fact it's proprietary and expensive.
The only reason that make GPFS still alive is that for a long time, the only alternative was Lustre, and Lustre is even worst.
Every single supercomputer meeting I've been to (I've been part of the community for years, they often invite me to their meetings to give an industry perspective), people are just continuously complaining about the filesystems, and it's GPFS and Lustre at the top of the list.