ListenBrainz moves to TimescaleDB

ListenBrainz moves to TimescaleDB(blog.metabrainz.org)

154 points by kingkool68 5 years ago | 96 comments

mfreed 5 years ago |

Fun fact: TimescaleDB exists because we were using InfluxDB + Postgres for a previous IoT project and also found it unworkable (developer experience, query language, reliability, scalability and performance, operations, etc).

We first built TimescaleDB as "Postgres for time-series" for our own needs and then decided to open-source it for others. :-)

zitterbewegung 5 years ago | |

Have you thought of making a TimescaleDB app like the Postgres.app for macOS? Or could I use Postgres.app to make a TimescaleDB app?

avthar 5 years ago | | |

It seems you can use the Postgres.app and install TimescaleDB on it

Here's some instructions on how to do so: https://github.com/slashdotdash/til/blob/master/postgres/ins...

akulkarni 5 years ago | | |

Not sure if a coincidence, but someone just published this blog post today:

Installing Timescaledb on Mac OS X with Postgres.app https://prathamesh.tech/2020/07/23/installing-timescaledb-on...

rubyn00bie 5 years ago | | |

You can mount like any PostgreSQL version, including extensions, in Postgres.app. I’ve used both Timescale and Agens with it, with zero problems.

jarym 5 years ago | |

Have you been following ZHeap and do you think Timescale will benefit from a storage engine like that (less write amplification)?

Lockerman 5 years ago | | |

Timescale engineer here. I'm betting we'll see a nice win; we tend to see write-mostly workloads the UNDO shouldn't be too expensive, and the smaller tuple sizes should be nice. We've built Timescale to be compatible with custom storage engines, so it should work as a drop-in, though of course until we've tested it we won't be sure.

enordstr 5 years ago | | |

Another Timescale engineer here. As previously pointed out, zheap should work as a drop-in in TimescaleDB. In fact, I just tried it and it works. However, it currently requires an unmerged PR to work properly: https://github.com/timescale/timescaledb/pull/2082, as well as further testing.

120bits 5 years ago |

Interesting read and thanks for sharing.

Not too long ago, I was asked to work on some analytics project and it required time-series data. I'm not a rockstar programmer and don't really know much about trends. So, I ended up googling and stumble upon InfluxDB. It felt like that right choice and I started playing with it. As the time passed, I realized that it might be a good software and I'm sure people love InfluxDB, but it wasn't the right choice for me. I didn't really like the docs, maybe its good now. And I had the same feeling about query syntax, it felt weird.

I moved to TimescaleDB and never looked back. I have it production for almost 2 months now. 20 tables and over 100Million writes/week. One of things I really liked was staging, I don't use docker and or anything fancy. I have bash script that and it runs on centos box and all timescale extension and postgres database are packaged together.

I was impressed by the timescale compression feature. I wasn't using it earlier because I had to be careful about what columns I need to segmentby. I would love to see some more features but I'm sure timescaledb team is already on it.

mfreed 5 years ago | |

Hey 120bits - thanks for the nice words!

What new/other features would you like to see? (Also feel free to join slack.timescale.com or reach out at mike (at) timescale.com)

jstrong 5 years ago | |

to each his own - I find influxdb somewhat flaky but the best part about it is not having to write the atrocious sql queries I would need to to get the same kind of windowed aggregations. `group by time(1h)` and so on is pretty handy.

pgt 5 years ago |

+1 on escaping measurement names. Quoting from their source code:

    def get_escaped_measurement_name(user_name): # ... comment omitted
        return '"\\"{}\\""'.format(user_name.replace('\\', '\\\\\\\\').replace('"', '\\"').replace('\n', '\\\\\\\\n'))

sgt 5 years ago | |

There are some edge cases in InfluxDB where escaping queries becomes a dark art.

hoseja 5 years ago | |

That looks somehow worse than regex in C literal.

contravariant 5 years ago | | |

Although it seems about par for the course for regexp in elisp.

mfreed 5 years ago | |

Do you have a URL? Thanks!

ilogik 5 years ago | | |

https://github.com/metabrainz/listenbrainz-server/blob/b0846...

iliekcomputers 5 years ago |

Hey! I've been working on ListenBrainz [0] for the past 3-ish years. Happy to answer questions if anyone has any.

[0]: https://listenbrainz.org

paol 5 years ago | |

Excited to find out about the project. I regularly use MusicBrainz but didn't know about the sister projects.

I'll definitely be creating a ListenBrainz account. As a long time last.fm user I occasionally worry about the future of the platform. (There have been long stretches of time where it seems to have been in maintenance mode). You seem to support bulk importing last.fm data right?

iliekcomputers 5 years ago | | |

Yep, we do support bulk import of last.fm data [0].

We also have a Spotify importer that automatically imports stuff from Spotify, if you use Spotify, I would definitely recommend setting that up.

We're a really small team (all volunteers), so we don't move with as much urgency as I'd like to, but we've been making slow but steady progress over the years.

If you find any rough edges, or have any feedback, I'd be happy to hear, my email is in the HN profile. :)

[0]: https://listenbrainz.org/profile/import

jeffbee 5 years ago | |

I read that whole page and I have no idea what the project does.

iliekcomputers 5 years ago | | |

Yeah, we need to fix that landing page. It's basically an open repository of your music listening history.

dylz 5 years ago | | |

It's last.fm.

RedShift1 5 years ago |

I'm in the process of moving from InfluxDB to TimescaleDB myself and can't wait to get rid of the hoops I have to jump through to get InfluxDB to answer some basic questions, mostly stemming from the fact that InfluxQL doesn't support boolean expressions. Something like 'SELECT MAX(temperature) > 10 FROM...' doesn't work.

akulkarni 5 years ago |

(TimescaleDB co-founder). Thanks for the kind words! I feel especially proud about the first point "openness" - this is something we strive for both technically and culturally.

For example, we have a pretty active Slack channel[0] where you can ask us anything. We've probably given away $$$$ of free support over the years ;-)

[0] https://slack.timescale.com/

gregors 5 years ago |

We too started off with influx but it wasn't a good fit mainly due to use having issues with high cardinality. I don't know if this is still the case with current implementations, but what it boils down to is if your data is searchable by a "user_id" really look elsewhere. That might be an oversimplification but that's the gist of it.

I was fully ready to just roll my own partitioned table and gave TimescaleDB a shot. It worked well. There was a bug we ran into, but it was an existing one documented on github and was addressed pretty quickly.

I still like influx, and would use it again but beware of the cardinality issues.

valyala 5 years ago | |

If you have cardinality issues in InfluxDB, then just substitute InfluxDB with VictoriaMetrics :) [1]

[1] https://medium.com/@valyala/insert-benchmarks-with-inch-infl...

awinter-py 5 years ago |

timescale is a postgres extension. 'postgres as a platform' is an interesting world to live in.

postgres built-in RBAC is clunky or people would be relying on it, but an ecosystem of postgres plugins could include cleaner or smaller versions of this feature.

Even things like complex migrations (github's gh-ost, for example) could exist as DB plugins.

decafninja 5 years ago |

As someone who wants to pick up a time series DB to learn, what would be the best in terms of being the "industry standard"? InfluxDB? TimescaleDB?

I'm familiar with some basics of kdb and use it often in my day job, but from what I understand that isn't widely used outside of finance?

valyala 5 years ago | |

The following time series databases are popular right now:

* ClickHouse (this is a general-purpose OLAP database, but it is easy to adapt it to time series workloads)

* InfluxDB

* TimescaleDB

* M3DB

* Cortex

* VictoriaMetrics

The last three of these TSDBs support PromQL query language - the most practical query language for typical time series queries [1]. So I'd recommend starting from learning PromQL and then evaluating time series databases from the list above.

[1] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...

akulkarni 5 years ago | |

(I work at TimescaleDB.)

If you are familiar with Postgres and/or SQL, then you may want to start with TimescaleDB. It's just Postgres for time-series. Full SQL, so it's possible to be productive instantly.

osigurdson 5 years ago |

In some cases it is difficult to define the table columns up front. Instead, a few tables: Object, Property, Time and Value (example below) are defined which make it possible to create new items on the fly. This works reasonably well up to a few billion records in the value table. However it does end up taking a lot of space (covering indexes/requisite memory are required for performance). It would be great to see a Postgres compatible solution that solves this problem in a more optimal way than a stock RDMS.

Object objectId objectName other...

Property propertyId objectId FK propertyName other...

Time timeId time other..

Value timeId FK propertyId FK value

silvester23 5 years ago |

> If you ever write bad data to a measurement in InfluxDB, there is no way to change it

Correct me if I'm wrong, but I'm fairly certain you can just write data with the same timestamp again and it gets updated. Deleting is also easily possible.

iliekcomputers 5 years ago | |

ListenBrainz dev here. We wanted stuff like the ability to do stuff like "DELETE FROM measurement where field = blah" and that support didn't exist last time we looked. [0]

Another thing that's not mentioned in the post but was a pain point for us was that it's not easy to query for fields with "null" values [1].

I figure a lot of our pain might be because we're not as good at Influx as we are at PostgreSQL. We've been running MusicBrainz[2] for ~18 years on PostgreSQL, that knowledge will hopefully transfer over a little with Timescale.

[0]: https://github.com/influxdata/influxdb/issues/3210

[1]: https://github.com/influxdata/docs.influxdata.com/issues/717

[2]: https://musicbrainz.org

silvester23 5 years ago | | |

True, deleting by value does not seem to be possible. I can see how that would be painful if it's necessary for your use case.

sgt 5 years ago | |

You can't for example delete data from a time period to another, to the best of my knowledge.

sciurus 5 years ago | | |

This is actually pretty straightforward. For example,

`DELETE FROM "foo" WHERE time >= now() - 2d AND time < now() - 1d`

rweichler 5 years ago |

Figured I'd use this as an opportunity to plug my own service: https://eqe.fm

Only works on jailbroken devices but it works well, has a local backup, and has been maintained (by me) for 2 years now.

Server costs are $2.50/mo, so this will stay up as long as I am alive.

jbmsf 5 years ago |

I'd love to hear more about how your data ingestion works. I'm thinking of implementing TimescaleDB myself, but in my initial read of the docs, the focus seemed to be managing the database, not getting data into the database...

dominotw 5 years ago | |

same way you'd insert data into postgres.

jbmsf 5 years ago | | |

That's not really helpful. Let's assume you have a distributed system; you probably don't want all of your system components connecting directly to TimescaleDB. You also probably want to have some layer that implements queuing and handles back pressure if it can't insert into the database at the rate that events are coming in. You may want to batch insert data.

I'd assume that most anyone building a system like this at scale has to solve these problems; does everyone roll their own?

iEchoic 5 years ago |

Has anyone used Prometheus as well as TimescaleDB in production and have thoughts to share on those, comparatively?

akulkarni 5 years ago | |

We have quite a few in our Slack channel: slack.timescale.com

Feel free to ask over there :)

(Btw - TimescaleDB is designed to work with Prometheus. You can see more here: https://github.com/timescale/timescale-prometheus)

thejosh 5 years ago |

I really want to love timescaledb, it's great.. except for the minor issue of not being able to back up.

https://github.com/timescale/timescaledb/issues/1835

akulkarni 5 years ago | |

TimescaleDB definitely supports backups :-)

Here is a page from our docs on how to perform Backup & Restore: https://docs.timescale.com/latest/using-timescaledb/backup

Not sure what's going on in that one Github issue, but we are looking into it.

justinclift 5 years ago | | |

It seems to be affecting multiple people too. :(

k-rus 5 years ago | |

This issue is related to slightly confusing _warnings_ that the software prints out, it doesn't effect the _correctness_ of the backups.

The warnings are produced by COPY TO, which is used by pg_dump, since COPY TO doesn't copy chunks. It is not an issue for pg_dump, since it also do COPY TO on each chunk table.

Timescale engineer here - was part of discussion about this warning. We need to do another round and see how to remove this confusion.

akulkarni 5 years ago | |

Posted elsewhere, but also posting here for posterity:

That issue is now closed by the original author:

"Data is successfully dumped. also i can see the constraints, indexes are also copied successfully."

https://github.com/timescale/timescaledb/issues/1835

brightball 5 years ago | |

That seems like a pretty big deal.

Does a WAL backup approach work?

akulkarni 5 years ago | | |

TimescaleDB offers a number of backup and restore options, including wal-e (WAL-based), pg_dump & pg_restore:

https://docs.timescale.com/latest/using-timescaledb/backup

There are hundreds of thousands of TimescaleDB databases in production so this is generally not an issue.