First release candidate for TimescaleDB 1.0

First release candidate for TimescaleDB 1.0(blog.timescale.com)

107 points by iamd3vil 7 years ago | 41 comments

welder 7 years ago |

Funny timing... just installed 1.0 rc1 this week. Dual-writing and experimenting reading from TimescaleDB in prod currently.

So far it's been perfect... besides a small config issue[1] causing a crash that should have been detected when Postgres starts up, TimescaleDB gives me reliability and durability of Postgres without having to manually create partition tables.

What will happen when Postgres supports auto creating partition tables for declarative partitioning[2]? Will TimescaleDB still be necessary?

Feature request 1: when all rows of a chunk/partition table are deleted can that chunk be automatically dropped instead of having to manually run drop_chunks[3]?

Feature request 2: for the getting started docs, add an option to install with https://nixos.org

[1]: https://github.com/timescale/timescaledb/issues/733

[2]: https://www.postgresql.org/docs/11/static/ddl-partitioning.h...

[3]: https://docs.timescale.com/v1.0/api#drop_chunks

cevian 7 years ago | |

Thanks for trying out TimescaleDB. Glad you're having a good experience. :)

We think that declarative partitioning is a great step forward for Postgres partitioning. That said the process is still quite manual (including in PG11) -- unless I am mistaken, auto-creating partitioned tables is not in PG11, but perhaps may get released with PG12 at the end of 2019.

But even with auto-creating partitioned tables, TimescaleDB is still necessary for a few reasons:

- Support for multiple time-based query optimization in the planner

- More efficient handling of both queries and inserts when there are many partitions,

- Time-based analytical function like time_bucket,

- Support for changing partition sizes, etc.

We are also working on many new features, optimizations, and capabilities slated for the next year which will improve time-series workloads in PostgreSQL even more.

akulkarni 7 years ago | |

Also thank you for the feature requests. Would you mind creating Github [1] issues so they are easier to track and discuss?

[1] https://github.com/timescale/timescaledb

atombender 7 years ago |

Is TimescaleDB a good fit for events?

For example, we have data where each record is a tuple of (time, event, object), where the event is things like "viewed", "performedSearch", etc., and the object is event data as JSON. Let's say the object is a movie, in which case the payload might be something like:

   {"id": 123,
   "name": "The Godfather",
   "director_id": 456,
   "genre": "crime"}

Our reporting UI lets users do aggregations based on arbitrary dimensions, so we might do the equivalent of:

    select object->>'genre' as genre,
      object->>'director_id' as director_id,
      extract(month from time) as month,
      count(*) as count
    from events
    group by month, genre, director_id

Then we do things like pivot/nest values to display the groupings.

It's unclear to me whether TimescaleDB fits this use case, or whether we'd have to change how things are indexed. Right now we just index the whole object itself. Another scheme I've considered is that we could index it with the values as keys; so for example, the above event would be {"name:The Godfather": 1, "director_id:456": 1, "genre:crime": 1} and essentially represent the counts. A counting aggregation would then be rewritten as as a sum(). But it's unclear to me how you do intersections here without also creating all the permutations (i.e. something like {"director_id:456/genre:crime": 1}) beforehand.

We're currently using Elasticsearch for this. Performance is okay, but we're not entirely happy with it.

cevian 7 years ago | |

Yes TimescaleDB handles event data. I'd need more information about the exact nature of your queries to really give a good answer but there are multiple possible designs here:

- keep the event as a json object and use a GIN index (this would then be combined with constraint-exclusion on the time field for faster queries if using timescaledb)

- convert each JSON key to a column and use multi-column indexes or bitmap scans

- Normalize out the unique object json into a separate table with the columns id, object (json) and have your time-series table as time, event, object_id.

Happy to talk more on our slack support channel https://slack.timescale.com/

atombender 7 years ago | | |

Thanks! As for the nature of our queries, almost all of our use cases are group-by with count(*) plus some constraints.

So we're interested, for example, in the number of views, grouped by a few dimensions, over a specific time interval (per month or quarter, typically), with some constraints, including time (last 12 months, often).

Usually the user selects a whole bunch of dimensions, and we display this in the UI as a table where we pivot or nest based on the dimensions. For example, if you group by month, by region and customer, then you might get the months as horizontal columns, the regions as the vertical column, with totals for each region, then the customer nested under the region, with totals for each customer within the region. (The underlying query gives us a flat table, which we convert to a kind of dimensional hypertable structure for display.)

Our application is schema-agnostic, which is why we use JSON. If we were to avoid JSON, the only realistic option would be for the app to use SQL to create tables, and handle schema migrations, and sort of control the schema. That would make it a somehwat different app.

Of course, for many time windows, we're talking about tens or hundreds of millions of events. Elasticsearch is very fast at aggregating data and lets us do queries that span a few months in just milliseconds, whereas grouping an entire dataset containing years of data typically takes maybe 4-5 seconds, still fast enough to be acceptable for a reporting UI. In my experience, Postgres isn't as fast at counting.

nh2 7 years ago | |

I, too, would be very interested to find a Postgres-based replacement for Elasticsearch.

Specifically replacing ELK by Postgres-Kibana.

Requirements for that:

   * Events are (timestamp, {arbitrarily nested JSON object})
   * Filtering by timestamp must be fast
   * Full-text search on the object is required
   * Exact constraints on all object keys must be fast
   * It should be possible to define indices on the object's fields so that WHERE clauses are fast
   * Counting the number of results should be fast, or at least have fast reasonably accurate estimates
   * Support typical Kibana searches and filters

I have tried so far to implement Kibana's access patterns on Postgres, and got quite far, but never got past the problem of https://wiki.postgresql.org/wiki/Slow_Counting, which essentially means postgres must scan the whole table if you write a WHERE clause, even when using indices, because it has to double-check whether the returned rows weren't actually deleted.

cevian 7 years ago | | |

This seems like it could fit well with TimescaleDB but obviously would take testing. My only concern would be with Full-text search on JSON which I think is possible but I have never done. I would start with a timescaleDB hypertable on the Even table (time TimestampTz, object JSONB) with the following indexes (or some of them depending on testing):

- BTREE(time DESC)

- BTREE(time DESC, object)

- GIN(object)

- some kind of full text index

I don't know why the slow counting problem would be a problem with WHERE clauses since indexes are highly optimized to work with MVCC (e.g. hint bits etc). The wiki article itself says this isn't much of a problem when using indexes. But maybe you can elaborate?

StavrosK 7 years ago | | |

I will second this question. cevian, can you opine on whether TimescaleDB would be a good fit for this? My (very small) experience with it so far says yes, but I'd like the opinion of someone more knowledgeable.

RobAtticus 7 years ago |

Just to give a little (estimated) timeline:

* Currently our first release candidate is available via Github and Docker

* We are aiming to release a 2nd release candidate next week

* Sometime shortly after that (1-2 weeks) we'll go final 1.0.0

Happy to answer any questions!

scrollaway 7 years ago | |

Congrats on the upcoming release!

... any news on RDS? :)

I did my part and registered my company's interest but I don't think there's anything else I can do.

RobAtticus 7 years ago | | |

Nothing to share at the moment, but we appreciate you pinging them!

z0mbie42 7 years ago |

Hi, it's very interesting!

Do you test if TimescaleDB affect the performances of the relational part of Postgres ? In other words, can I use TimescaleDB as my primary DB to store both my users and time series data ?

mfreed 7 years ago | |

We have not observed any effects on relational table performance. Basically, TimescaleDB code in on-path for each parsed query, but if the query does not specify a hypertable, we just let the query "fall through" to Postgres' standard query planning/execution.

But you do point out a very common use case and strength of TimescaleDB: you can easily JOIN your time-series data (hypertables) with existing relational tables.

qaq 7 years ago |

Nice product for small/mid scale workloads. It's no Vertica or ClickHouse but if you do not need the scale should work well.

cevian 7 years ago | |

To put a number on this claim. TimescaleDB at this point can handle up to 100 TB of data.

wenc 7 years ago | | |

For pure time-series, that is huge.

The bulk of time-series data is floating point numbers, which are fairly tiny in terms of storage.

100 TB is a lot of time series.

comboy 7 years ago | | |

Where does the limitation come from? Or is it just the amount of data that it's been tested on?

qaq 7 years ago | | |

It's not only about size but about speed column stores that store compressed data and process data in blocks using SIMD perform significantly faster on many query types.

lykr0n 7 years ago | |

If TimescaleDB could get some clustering & sharding going on, that'd be awesome. Maybe team up with Citus

sciurus 7 years ago |

Congrats on the RC! TimescaleDB is a very neat idea, there's a lot to gain by building on Postgres.

There's also a very serious limitation due to that: the requirement to predefine schemas. My primary use case for a timeseries-focused db is storing system and application metrics. Using a commercial (e.g. datadog, signalfx) or open source (e.g. influxdb, prometheus) timeseries product I can submit arbitrary data. If I had to perform a schema migration every time a developer wanted to record a new metric, it would be extremely painful.

If this has changed since I last looked at TimescaleDB, please correct me!

mfreed 7 years ago | |

This is not the case.

TimescaleDB has full support for storing JSON data (inherited from postgres), including with indexes, so you do not need to fully pre-define your schema for these types of applications.

In fact, we added support for TimescaleDB to be a read/write backend for long-term Prometheus metrics. You pull from Prometheus via its remote storage backend, and the data appears automatically in TimescaleDB. But then unlike Influx and native Prometheus, you get to JOIN it against additional metadata for richer questions. For more information:

https://blog.timescale.com/sql-nosql-data-storage-for-promet...

https://github.com/timescale/pg_prometheus

sciurus 7 years ago | | |

Cool! It looks like the metrics view you build on top of the values and labels table makes it _reasonably_ easy to query. I still worry about how you get good autocomplete in Grafana, though.

Supporting receiving data from Prometheus is nice, but for people who aren't already invested in that it would be helpful if you either

a) picked an agent (e.g. Telegraf or Collectd) and taught it how to submit to TimescaleDB directly

b) picked a protocol already commonly used by agents (e.g. graphite plaintext protocol) and taught TimescaleDB to receive it

cevian 7 years ago | |

In addition to JSON support we also support EAV type schemas which look like:

- time | metric_name | value

- or time | metric_id | value

- or time | tags_id | value where the tags table contain normalized tags and metric names

a10c 7 years ago | |

This is my understanding too. Also one of the reasons we ruled it out when we were evaluating other in-house TSDB options.

WalterGR 7 years ago |

I feel like I learn of a new DBMS a couple times a month. Can anyone comment on how well these upstart DBMSes tend to be tested / proven?

cevian 7 years ago | |

(Timescale engineer here) We are an extension on top of Postgres and not a whole new DBMS. That means that we inherit a lot of the reliability / stability of Postgres (critically we do not change the WAL/data on disk format which leads to the fact that even as a young database we don't loose data). What we provide on top of Postgres are significant optimization aimed at time-series data. Postgres is optimized for relational data, and it requires significant work to optimize it for time-series data workloads on both the insert and query side.

TimescaleDB provides automatic, adaptive partitioning of time-series tables to make management of time-series tables both easier and more efficient. We add new time-based analytical functions as well as improve query planning and optimization.

twunde 7 years ago | |

For new databases that I'm interested in, I generally check if there's a Jepsen analysis for it (https://jepsen.io/analyses) or whether it's being open-sourced by a company that's been using it in production for a while like Foundation DB (https://news.ycombinator.com/item?id=16877395). In either of these two cases I would say it's well-tested. Otherwise, it's often hit or miss. Thankfully the number of new db products I've seen is fewer than the halcyon days of the MongoDB wave (2011-2014) and tend to be a bit more focused on specific problems that the db solves.

pritambaral 7 years ago | | |

Jepsen only tests distributed operation. For non-distributed databases (like PostgreSQL, which is — for the purposes of this discussion — what TimescaleDB is), Jepsen does not apply.

gaius 7 years ago | |

It’s just a plugin to Postgres, you just convert the tables you want to it’s type and otherwise it’s all the same.

No need to learn new databases all the time, Postgres does it all, regular tables, “documents” (JSON), timeseries, spatial/GIS, you name it.

taneliv 7 years ago | |

TimescaleDB is merely a PostgreSQL extension, not a complete DBMS all by itself.

brightball 7 years ago | |

If I recall correctly, this is just Postgres with extra handling around time series data isn’t it?

temuze 7 years ago |

Any updates on full, scale-out clustering?

And are there any plans on supporting columnar storage, for very wide tables?

select object->>'genre' as genre, object->>'director_id' as director_id, extract(month from time) as month, count(*) as count from events group by month, genre, director_id

* Events are (timestamp, {arbitrarily nested JSON object}) * Filtering by timestamp must be fast * Full-text search on the object is required * Exact constraints on all object keys must be fast * It should be possible to define indices on the object's fields so that WHERE clauses are fast * Counting the number of results should be fast, or at least have fast reasonably accurate estimates * Support typical Kibana searches and filters