Additionally, I would really, really like to be able to use it as an Event Store, easily accessible by anyone in the org with infinite data retention. I know Kafka kind-of sort-of provides this functionality, but it doesn’t work in practice.
This appears to be a solution to this problem. Will be interesting to see whether it gains traction.
How so?
An simple way to get around this problem is dumping messages into a file and putting that file in S3 named something like “topic-partition-offset” where offset is the offset of the first message contained within that file. You can then read those forward starting from offset zero and go until you reach the end, then start reading from Kafka for recent data.
The drawback is this isn’t integrated with Kafka so you’re now maintaining what is effectively two different systems for the same data. It also means the key-based compaction won’t work either and you’d have to re-implement that on top of the files in S3 as well.
Storing all data forever in a single source of truth is awesome until regulation like GDPR comes along. Do you have plans to support excision or is your guidance on personal data to avoid putting it into a system like Kafka/Pyrostore?
It covers several strategies, three of which are:
* Encrypt it and then throw away the key to forget it
* Store private data outside the event with the event just pointing to it
* Delete events (on systems that support this)
Workarounds for excision in Kafka, such as key compaction, are often not possible to use as they depend on the key scheme used.
https://azure.microsoft.com/en-us/services/hdinsight/apache-...
Is this same approach as pyro ?
This reduces operational complexity significantly vs scaling nodes up, dealing with rebalancing, under replicated partitions, etc.
What does ' lossy effects on stream-ability. ' mean here. Stream slows down, data loss or something else?
This would be even better if it didn't need a modified client.
If your inbound data that you'd like to put to Kafka isn't large, by the way, just write straight to the DB. It's irritating to see Kafka used where it's not necessary. It adds complexity to an infrastructure and the cost for doing so has to be justifiable.
Growing LVM with XFSs has worked well, 0 downtime and around 60 seconds.
Allows you to over provision just enough you do not have to babysit the drives or pay $$$ for unused disc.
If you stripe the volumes you'll also distribute your IOPs in AWS.
Outside AWS LVM still applies. Kafka's JBOD is useless without easy / auto rebalancing.
This week onsite at a client's I discovered ScaleIO which can present up to a 1PB volume and does clever sharding/replication in the background.
Why is this difficult? Is mirroring clusters operationally problematic? If one cluster gets too small, in theory can’t you spin up another cluster, and mirror the first onto the second. Then when they are in sync direct writes to the new cluster?
Ordering of messages is completely unaffected, it's the routing of future events that's affected when you increase partition count. This is critical for some use cases (windowing of data for analytics purposes, for example) but irrelevant to others.
Data duplication issues? This sounds like FUD also, but common guidance is to design your events to be idempotent, or utilize Kafka's new exactly-once delivery.
You can expand partition counts for a topic dynamically.
You can't currently decrease partition counts, because given the current design that could orphan both data and consumers.
In the streaming and messaging space, Apache Pulsar (pulsar.apache.org) is a more recent solution that has an architecture that decouples processing and storage. That gives you nice properties like independent scaling of storage and processing, infinite data retention, dynamic resizing and others.
Also running two clusters to handle large volumes of data, it's big money. Even a small modest cluster with around 20TB+ of data was north of 30K a month on AWS. That's a full app cluster though with consumers/producers aswell as brokers.
I think Joyent's Manta was ahead of its time in colocating compute and storage and I suspect we'll see more along this vein with the recent open sourcing of FoundationDB.
A lot of new data processing platforms, from Snowflake in the data warehouse world to AWS Athena to Apache Pulsar in the broader data processing world, have moved to decoupled architectures.
Containerization and container management frameworks (e.g. Kubernetes) certainly do change the meaning of "local" storage, will be interesting to see how that plays out.