Commanding infinite streaming storage with Apache Kafka and Pyrostore

Commanding infinite streaming storage with Apache Kafka and Pyrostore(pyrostore.io)

68 points by lbradstreet 8 years ago | 27 comments

I like it. Personally, one of my biggest problems with Kafka is its operational complexity. I’ve just had one too many instances of Kafka brokers getting stuck while doing an upgrade and things like that.

Additionally, I would really, really like to be able to use it as an Event Store, easily accessible by anyone in the org with infinite data retention. I know Kafka kind-of sort-of provides this functionality, but it doesn’t work in practice.

This appears to be a solution to this problem. Will be interesting to see whether it gains traction.

linkmotif 8 years ago | |

> I know Kafka kind-of sort-of provides this functionality, but it doesn’t work in practice.

How so?

ryanworl 8 years ago | | |

One potential problem is a Kafka partition’s size is limited to the size of the smallest machine in the replica set. This means if you want infinite retention you have to potentially over-partition so they never get too big, keep buying bigger machines and disks, or deal with a repartition of all data.

An simple way to get around this problem is dumping messages into a file and putting that file in S3 named something like “topic-partition-offset” where offset is the offset of the first message contained within that file. You can then read those forward starting from offset zero and go until you reach the end, then start reading from Kafka for recent data.

The drawback is this isn’t integrated with Kafka so you’re now maintaining what is effectively two different systems for the same data. It also means the key-based compaction won’t work either and you’d have to re-implement that on top of the files in S3 as well.

stingraycharles 8 years ago | | |

It’s difficult to search through, query, run projections. Also the API assumes you want to stream realtime data, rather than query historical data.

tomconnors 8 years ago |

Everything Distributed Masonry does is very interesting. Wish I had more excuses to use your stuff at work.

Storing all data forever in a single source of truth is awesome until regulation like GDPR comes along. Do you have plans to support excision or is your guidance on personal data to avoid putting it into a system like Kafka/Pyrostore?

insensible 8 years ago | |

You might enjoy reading Greg Young's https://leanpub.com/esversioning, which covers this topic.

It covers several strategies, three of which are:

* Encrypt it and then throw away the key to forget it

* Store private data outside the event with the event just pointing to it

* Delete events (on systems that support this)

lbradstreet 8 years ago | |

We will be launching support for native excision and data anonymization soon, as these are extremely important to storing streaming data for the long term.

Workarounds for excision in Kafka, such as key compaction, are often not possible to use as they depend on the key scheme used.

taherchhabra 8 years ago |

Integration with Azure Managed Disks : Due to the ingestion heavy nature, the disks attached to the nodes on the cluster often result as the bottleneck. Traditionally, to scale this bottleneck, more nodes need to be added. Azure Managed Disks is a technology that provides cheaper, scalable disks that are a fraction of the cost of a node. HDInsight Kafka has integrated with these disks to provide upto 16 TB/node instead of the traditional 1 TB. This results in an exponentially higher scale, while reducing costs in the inverse, exponential manner.

https://azure.microsoft.com/en-us/services/hdinsight/apache-...

Is this same approach as pyro ?

lbradstreet 8 years ago | |

Our approach archives topics to cheap, highly durable and available object stores, while keeping the data available for blending between warehoused and live data sets.

This reduces operational complexity significantly vs scaling nodes up, dealing with rebalancing, under replicated partitions, etc.

lmsp 8 years ago |

This is what Apache Pulsar (https://pulsar.incubator.apache.org/) already provides - infinite streaming storage, with simple/flexible messaging streaming API and kafka compatible

chrisjc 8 years ago |

Very interesting and reminds me of Pravega (http://pravega.io/). Seems like unbounded streams will be the next big step in streaming technology.

https://www.youtube.com/watch?v=cMrTRJjwWys

mavdi 8 years ago |

These are the guys behind www.onyxplatform.org. That alone tells me this is legit stuff. We will give it a try.

dominotw 8 years ago |

> tradeoffs in our operation of Kafka have lossy effects on stream-ability. Balancing costs and operational feasibility, we ask Kafka to forget older data through retention policies.

What does ' lossy effects on stream-ability. ' mean here. Stream slows down, data loss or something else?

lbradstreet 8 years ago | |

Pyrostore co-founder here. When practitioners archive their data from Kafka to other storage products (S3, SQL database, etc) today, they are giving up on the log ordered structure of the data their ability to consume their data in its original ordering, with its original offsets and timestamps. Pyrostore structures and indexes your data in S3 in order to provide a consumer that implements the Kafka consumer interfaces, ensuring you are always able to stream from hot and cold storage alike.

ah- 8 years ago |

I wonder if this would ever be integrated into Kafka proper. Shipping out historical chunks onto infinite storage seems like a generally sensible thing.

This would be even better if it didn't need a modified client.

rad_gruchalski 8 years ago | |

I did suggest a potential solution a while ago: https://medium.com/@rad_g/the-case-for-kafka-cold-storage-32... Relevant JIRA ticket: https://issues.apache.org/jira/plugins/servlet/mobile#issue/...