Redis streams as a pure data structure

Redis streams as a pure data structure(antirez.com)

344 points by itamarhaber 7 years ago | 53 comments

apeace 7 years ago |

I get that the tennis match use-case is meant to be trivial and an example, but I don't buy it.

> Before Streams we needed to create a sorted set scored by time: the sorted set element would be the ID of the match, living in a different key as a Hash value.

I think the sorted set would be a much better choice, because then you could still insert items in the past, like when that admin remembers there was a tennis match last week he never recorded. Same goes for modifying past values, or deleting values. These operations are trivial using a sorted set & hash, not so using streams.

I'm excited for streams and I'm glad Antirez is taking time to blog and evangelize, but this article didn't convince me there's a compelling use-case for streams aside from the Kafka-like use-case.

antirez 7 years ago | |

We are going to have an option to XADD to insert elements in the middle. I commented more extensively about it in another reply, so inserting out of order later will be possible. However note that the pattern still works when you use a time as a field, you don't need range queries, but just want single-item identifiers. However the XADD option to insert out of order is really a thing that will hit Redis ASAP.

nicpottier 7 years ago | | |

Excellent to hear this.

We use sorted sets as queues heavily and this would be a necessary thing for us to consider giving streams a go which would indeed be interesting from a memory savings (we sometimes have millions of items in our queues for a short time). Sometimes, say on error conditions, you want to stuff something back at the start of the queue (because the order of processing matters) instead of at the end as one example.. priority being another.

drewda 7 years ago |

Just like it's useful to have both SQLite and Postgres available for smaller and larger data projects (and Spatialite and PostGIS for smaller/larger geo-data projects), it could be great to have Redis and Kafka for smaller and larger pipeline projects.

Does anyone have good patterns for joining across entries from two or more Redis streams? This is one of the most interesting aspects of Kafka/Flink/Spark/Storm/etc. Would be useful to be able to develop with streaming joins in Redis playgrounds.

skybrian 7 years ago |

This seems pretty simple when events are logged as they happen with little or no latency and you can let the stream set the timestamp. I wonder, though, about the case where events may be buffered, perhaps due to an unreliable network? The time that the event occurred might be significantly earlier than the time it's inserted, and furthermore events are arriving out of order. It seems like things get much more complicated?

Let's say tennis games are recorded on a piece of paper and entered into the computer later. What is different?

antirez 7 years ago | |

Two solutions: 1. add a timestamp as a field, and just use the ID, but in that case range queries are going to be a problem. 2. exactly because of what you stated, XADD will soon have a special argument to say: I'm going to insert an element in the middle: this is the time in milliseconds (find for me the counter part if I did not specify one). Could be confusing for streaming, but as a data structure to insert in the middle is spot-on and there is nothing preventing that.

nicois 7 years ago |

I threw together a few words here about how we are using Streams combined with Sorted Sets to "upgrade" legacy databases to streams of data. Not revolutionary, but it could be interesting to some people. I can write more, if there's any demand: http://nicois.github.io/posts/databases-to-streams/

_pmf_ 7 years ago |

I wish we could standardize on using Redis as general interprocess transactional memory. I could drop 95% of our application code for our Embedded Linux platform by using stock Redis and stock SQLite, but of course there are political obstacles.

cordite 7 years ago | |

Is this embedded in the same process, or just within the same unit?

Aside: would an embeddable redis be a useful thing for apps and other isolated devices?

antirez 7 years ago | | |

There is basically no gain in practical terms in running Redis as an embedded library in embedded contexts, at this point I think I'm able to summarize the key reasons.

1. Embedded systems are often used in environments where you need very resilient software. To crash the DB because there is a bug in your app is usually a bad idea.

2. As a variation of "1", it's good to have different modules as different processes, and Redis works as a glue (message bus) in that case. So again, all should talk to Redis via a unix socket or alike.

3. Latency is usually very acceptable even for the most demanding applications: when it is not, a common pattern to solve such problem is to write to a buffer from within the embedded process, that a different thread moves to Redis. Anyway if you have Redis latencies of any kind, you don't want to block your embedded app main thread.

4. Redis persistence is not compatible with that approach.

5. Many tried such projects (embedded Redis forks or reimplementations) and nobody cared. There must be a reason.

zachwill 7 years ago | |

I hardly ever comment, but this is a really cool idea. Could you elaborate a little more? (On technical aspects, not political obstacles.)

smush 7 years ago | | |

Complete conjecture, I am not the GP.

Hydrating/deserializing data from Sqlite into types/objects and doing whatever goodness those need, then using Redis to make "updating the database" super fast (in memory after all) and let Redis write it back to Sqlite as there is IO/time/lull in traffic.

Kinda like how Epic Cache does its transaction journal flushing every X minutes?

skrebbel 7 years ago |

Did anyone yet use Redis streams to store actual logs? Like server logs, application logs, etc.

I understand that Elasticsearch is a common place to put logs, also because I assume that searching through logs is a common use case, but I wonder whether Redis has particular benefits for this use case. The data structure seems particularly tailored to it (but not so much to searching I guess).

antoncohen 7 years ago | |

Log volume can easily exceed reasonable memory sizes. Even a small company can generate TBs of logs each month. Having a single box with TBs of memory wouldn't be desirable.

For logs without full indexing, Loki (https://github.com/grafana/loki) is a recent entry into the space, and it probably a good option to look at. It indexes metadata (labels), so it allows searching by labels but not full text. It is also supposed to be horizontally-scalable, which is probably something you want in a log storage solution.

atombender 7 years ago | |

My guess is that this would work fine until the working set size exceeds available memory. Redis (unless something new has happened the last couple of years since I used it) requires that data fit in RAM. So could work well for low-frequency logging like alerts. Not as a general purpose log system.

_pgmf 7 years ago |

Streams are kinda cool but they have a distinctly different feel than the other data-types in Redis. They've got this invisible statefulness. Last ids, consumer group state, etc. I've tried implementing a couple little things with streams, and it's not necessary to use the consumer group stuff or whatever of course. I wonder why streams weren't made using the modules API, though? They seem just weird/different enough to warrant exclusion from the core data types, in my thinking. Anyways, just reading the title referring to streams as "pure" made me go wtf? Because there's a lot of hidden state in there.

antirez 7 years ago | |

Pure means that when you don't use consumer groups, there is no hidden state at all, and they are just a boring data structure like everything else in Redis. Only if you use the messaging part they have state, but this is an accessory part like a shell on top of what is otherwise exactly a vanilla data structure.

coleifer 7 years ago | | |

Can you read from a stream that doesn't exist yet?

reggieband 7 years ago |

I wonder how this compares to streams in Kafka or Kinesis. One of the main advantages of redis is that I see it used in many cases as a replacement for memcache (just a key/value store for bytes/strings) so it already exists in many infrastructures.

rainhacker 7 years ago | |

I shared my experience sometime back in another HN thread [1]:

"A key difference I observed was that if a Kafka consumer crashes, a rebalance is triggered by Kafka after which the remaining consumers seamlessly start consuming the messages from the last committed offset of the failed consumer.

Whereas with Redis streams I had to write code in my application to periodically poll and claim unacked messages pending for more than some threshold time."

[1] https://news.ycombinator.com/item?id=19231178

opportune 7 years ago | | |

From my experience, Kafka has the best api for handling read-once, distributed streams. Almost every other streaming solution, like Redis in this case, has a non-ideal or non-existent way to coordinate stream consumers in a way that prevents double-reads. And lots of streaming applications need to ensure read-once (think about what a double read ends up as - maybe a twice-sent message, or a duplicate metric), so I'm not sure why they all struggle so much with just copying kafka's pretty simple consumer api

erulabs 7 years ago |

Streams are great! I've written a small library for Node which attempts to wrap some of the complexity (particularly for handling multiple connections for UNBLOCK calls, etc).

So far I haven't used it outside of hobby projects for webGL games and such, but it's worked brilliantly, and no Kafka required for hobby async-streaming infrastructure!

Hopefully it's useful to someone out there! https://github.com/erulabs/redis-streams-aggregator