Working with time in Postgres(craigkerstiens.com) |
Working with time in Postgres(craigkerstiens.com) |
If you're using Postgres right now and have any columns like start_* and end_* for anything (e.g numbers or dates), you need to stop what you are doing and use a range type. They are amazing. You can do things like a unique index that ensures there are no overlapping ranges, you can do efficient inclusion/exclusion indexing and much more.
Use them. I'm always surprised more people don't know about them.
1. https://www.postgresql.org/docs/9.6/static/rangetypes.html
[1] http://www.craigkerstiens.com/2017/04/30/why-postgres-five-y...
https://docs.djangoproject.com/en/1.11/ref/contrib/postgres/...
Would you set the interval starting time, but leave the end of the interval as "present"/infinity? And then update the end of the interval when the job finished? Wouldn't you also need to have a cleanup function to manually "close" intervals if the worker crashed and restarted?
I wouldn't have the worker process handle this itself though, as you would need some form of cleanup. But you'd need the same with two individual columns
1. https://www.postgresql.org/docs/9.1/static/contrib-spi.html#...
https://twitter.com/joe_jag/status/510048646482894848?lang=e...
That's not correct, timestamptz doesn't have a timezone embedded in it. It's just that it's timezone-aware. A timestamptz corresponds to a universal point in time that have many human reprensentations, one for each timezone. psql uses the default timezone of the postgres instance to convert a timestamptz to a displayable string, so timestamptz are always displayed with a timezone, but that info does not come from the stored value.
Timestamptz needs timezone information only for operations that would give different results in different timezones, e.g. display as string, extract the day part, add a 1-month interval (DST info needed), etc. Comparing two timestamptz however doesn't require any timezone info.
The difference between timestamp and timestamptz is not about what they store, but about how they behave.
Edit: In my experience, this is not always obvious because postgres uses the default timezone of the instance whenever it needs such info with timestamptz operations. Using an explicit timezone often requires convoluted code.
Just make sure you include a timezone info in string representations in your SQL queries. For example '2000-01-01T00:00:00Z' where Z stands for UTC. Otherwise that would insert a timestamp into a timestamptz column, in which case postgres uses local timezone setting for conversion, implicitly; this is not what you want.
See http://phili.pe/posts/timestamps-and-time-zones-in-postgresq...
Also you should use an equivalent type in you app, i.e. python datetime with tzinfo or JS Date. And beware of UTC offsets: they can't handle DST. Python pytz and JS moment-timezone provide DST-aware timezone info (which is built-in in postgres).
Edit: if you can rely on your users system time for display that's even better because you wouldn't have to explicitly deal with those DST-aware timezone info.
At the end of the day I always come back to "solution with lots of possible bugs" or "unix millis everywhere". And I always choose the latter. It means we can't use nice date features in a lot of databases, but...eh? They've never seemed worth it.
Some things the author didn't mention that I like:
* Timestamp with time zone string parsing: '2013-06-27 13:15:00 US/Pacific'::timestamptz
* Timezone-aware to timezone-naive conversion (or vice versa): mytztime AT TIME ZONE 'US/Pacific'
* I haven't used tstzrange yet, but it looks pretty powerful.
Normally you would want to receive timezone-aware timestamps from the database, and format them in user's timezone at display time--perhaps in a template. But, if you're e.g. aggregating data for a day-over-day or month-over-month report, then the conversion to naive dates need to happen on the database side, so that day boundaries and month boundaries would match the user's timezone.
We have found that it is more annoying to keep track of the behavior of the library and of a home-grown date dimension. In my organization, we tend to use a standardized pattern that can handle arbitrary calendars, even when we're dealing with standard calendars.
Dates have attributes that group them together. "Month" is an attribute that you're familiar with. "Fiscal Period" can take on many specific definitions but it is analogous to "Month".
Those two concepts share a lot of properties. They each collect a series of contiguous dates. Each is adjacent to a similar grouping that falls sequentially and the last date that exists in one is one day prior to the first day that exists in the next. Each falls within a larger category like "Year" or "Fiscal Year".
Year+Period forms a composite key for a period. We can also assign a monotonically increasing field that increments by one with each subsequent period. That field allows simple arithmetic to shift forward and backward. We typically call this attribute PeriodIndex or PeriodSequential. I'll abbreviate to PI here.
If you have a reference PI, you can always find the immediate prior period by subtracting one from the reference. We can assign these for any grain of time period. We typically see Week, Period, Quarter, Semester, Year.
This is the baseline of how we handle dates. There are plenty of utility fields we'll maintain for specific time-based needs, but it's all sugar on top of that.
Yeah I didn't like it one bit. Sorta reminds me when I had to develop a Grantt chart component in flex for a client, so many problems with dates.
https://www.postgresql.org/docs/current/static/datatype-date...
(Edit: or not. See child comment)
If the date were 2016-01-01 and you compared it with what week Postgres thinks it is, you'd get:
SELECT date_part('week', '2016-01-01'::date);
date_part
-----------
53
(1 row)
This is because 2016-01-01 is still the 53rd week of 2015.Edit: Actually, 2017-01-01 is week 52 according to Postgres, probably because it uses Monday as the first day of the week.
Just like imperial system, only a couple of weirdos do that.
On the debate of "timestamp vs timestamptz" I reached the opposite conclusion of the author: I've got Amazon RDS instances set to UTC and my timestamps are stored as UTC times with no timezone awareness. Instead, I add the timezone while querying. I think this is better because I never have to remember anything about server settings!
I discovered that the `AT TIME ZONE` clause has two meanings, so I sometimes have to use it twice. In this example which selects all records created this month:
...WHERE create_date AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York' > date_trunc('month',current_date)
the first occurrence of `AT TIME ZONE` tells postgres that the timestamp is in UTC (which it is) and the second occurrence subtracts four or five hours (depending on daylight savings time) to show New York time. If I only had the second such clause it would subtract that many hours... it would think I was giving it a New York timestamp and I wanted to see the UTC time.I have this table:
CREATE TABLE all_dates ( date_stamp date NOT NULL, is_month_end boolean, is_year_end boolean, is_week_end boolean, is_quarter_end boolean, CONSTRAINT all_dates_pkey PRIMARY KEY (date_stamp) )
filled with data from 1st Jan 1980 to 31st Dec 2050, which is the range my application needs.
It's a mere 22k rows and has a whole host of uses.
timestamp does the same - stores value without timezone information.
the difference is with writing/reading those values where timestamptz behaves as you'd expect and timestamp ignores timezone information.
timestamptz - gives you the thing that exists: unique point in time, ie. if person A in australia and person B in europe hits the red button at the same time - timestamptz will have the same value, regardless of the fact that those two timestamp strings had different representations.
timestamp - gives you this local view of time: when person A in australia wakes up 6am to work and person B in europe wakes up at 6am to work - they will hit the snooze button and it will create same value in the database - even though those events happened hours apart.
in both cases you'd have to store timezone in separate column if you want to extract information on which timezone the timestamp was generated in. let me repeat - both cases loose information on timezone. they just do it in different way - timestamp by ignoring it completely and timestamptz by mapping it correctly to unix epoch.
with weeks as (
select week as week
from generate_series('2017-01-01'::date, now()::date, '1 week'::interval) weeks
),
SELECT weeks.week,
count(*)
FROM weeks,
test_results
WHERE
test_results.date_created > weeks.week
AND
test_results.date_created <= weeks.week - '1 week'::interval
Throws an error for me... ERROR: syntax error at or near "SELECT"
LINE 5: SELECT weeks.week,
^i.e.,
... weeks
)
SELECT weeks.week with weeks as (
select week as week
from generate_series('2017-01-01'::date, now()::date, '1 week'::interval) weeks
)
SELECT weeks.week,
count(*)
FROM weeks,
test_results
WHERE
test_results.date_created > weeks.week
AND
test_results.date_created <= weeks.week - '1 week'::interval
it throws... ERROR: column "week" does not exist
LINE 2: select week as week
^
I would move this to the post's own "replies" section, but it doesn't have one.Footnote:
> Here’s just a few examples of things you could do with interals:
The author of the article could want to fix the 'interals' typo in the text.
SELECT DATE_TRUNC('week', CURRENT_TIMESTAMP);
gives: 2017-06-05 00:00:00+00
vs: SELECT TO_CHAR(CURRENT_TIMESTAMP, 'YYYY-WW"wk"');
gives: 2017-23wk SELECT to_char(now(), 'IYYY-"W"IW');
The difference is when the first week of the year starts. Compare yours to the ISO 8601 format for January 1 this year: $ SELECT to_char('2017-01-01'::date, 'IYYY-"W"IW');
> 2016-W52
$ SELECT to_char('2017-01-01'::date, 'YYYY-"W"WW');
> 2017-W01WHERE created_at >= now() - '1 week'::interval
would mean in the last 7 days right? not last week?
Did some work on this recently in mysql and had to resort to calculating this using strtotime('last week');
$ select '2016-03-31'::timestamp - '1 month'::interval;
> 2016-02-29 00:00:00
$ select '2016-03-31'::timestamp + '11 month'::interval;
> 2017-02-28 00:00:00
$ select '2016-02-29'::timestamp + '1 year'::interval;
> 2017-02-28 00:00:00For those who want to implement this in Python, I've written a gist: https://gist.github.com/Dowwie/bec0a29bcd37eea41cde8d5188626...
For example, I wrote an event registration marketplace some time ago. You might think "start time for event" would naturally fit as a unix timestamp, but it's a mistake. If you have an event at 10am in Las Vegas, moving it to Chicago shouldn't suddenly change the start time. And never store "all day" dates as a timestamp (ie datemidnight); timezone issues can easily produce off-by-one dates.
Basically, 'time' is not a single thing. You usually want to represent it the way your users think about it - and that isn't always like a unix timestamp (although it very often is).
Does this come up a lot? Moving from Las Vegas to Chicago would involve much more than just being aware of the time zone change.
However, you do need to do date arithmetic from time to time, whether using a wrapped epoch time in the database or in the application. "One day from now" turns out to be complicated enough that we delegate to libraries to get it right; and Postgres's implementation of these features is solid. When you want to `GROUP BY` day, for example, there are performance benefits to doing that on the database side -- and for analysts, there is often little alternative but to handle dates with DB provided functionality.
When it comes to date arithmetic, how do you handle that with UNIX timestamps?
Django postgresql adapter will aggressively show warnings for all of the cases where you're trying to insert a TZ-unaware date into a TZ-aware column.
Is this that hard to reproduce?
It is regrettable that APIs make it easy to create datetime objects without timezones, which is why sane people have moved to e.g. Joda Time and the libraries based on it.
Better example: Typically an events schedule should specify destination timezone when registering an event, so that if regulations on local timezone change, scheduled event times remain valid.
with weeks as (
select week
from generate_series(
'2017-01-01'::date,
now()::date,
'1 week'::interval
) week
)
select weeks.week, count(1)
from weeks, test_results
where
test_results.date_created > weeks.week and
test_results.date_created <= (weeks.week - '1 week'::interval)
group by week
;created_at BETWEEN (date_trunc('week', now())) AND (date_trunc('week', now() - '1 week'::interval))
If you store your event time as epoch time, you're storing "time of event in UTC" and "location of event". Which means every change of location requires you to recalculate and update the epoch time of the event. When users change Las Vegas to Chicago, you must update two fields - and if you ever screw it up, you have no way of knowing what the user originally intended.
This is a very bad case for epoch time. You may want to create an index on the calculated epoch time for queries, but store the datetime as localdatetime - the form that most accurately represents the source data.
You are of course free to ignore this advice and learn the hard way. Just stay off my lawn.
But in this case, there's a lot of other information you want to store: first date in the series, repetition frequency, time of the appointment, location (e.g. "Europe/London" as opposed to "GMT" or "UTC+0" which would not be adequate). Basically, you're not so much storing a series of dates as storing a thunk for generating dates.
One of those components was a scheduler for one-time and recurring tasks. For example, when you delete a system, you set a timer for 14 days to have the system remind you (via e-mail or by issuing an alert into the queue) to delete the system's backup. There were also a lot of recurring tasks that needed to be performed daily or weekly. Now if you have a task thats configured daily at 9 AM, it's tempting to implement the timestamp series as
while (true) { timestamp += 86400; }
And indeed, that's how it was done in the existing code. But that means that once DST starts or ends, your daily-at-9-AM-task suddenly happens at 8 AM or 10 AM instead. Whether that's a problem depends on the type of task and how the sysadmins organize their work. And then there's the monthly recurrence, which is even messier with plain timestamps.I cannot recall all details anymore, but I definitely remember that twice a year, after each DST change, someone would go through the list of recurring tasks, and move the times forward (or backward) by one hour manually.
EDIT: Maybe the simplest (though not easiest) solution to the irregular month lengths would be to attach giant thrusters to Earth and push it away from the sun a bit, so that our year is 366 days instead of 365 long. Then we make a calendar with 6 months per 61 days. As a bonus, it would reverse some of the effects of global warming. (Alternatively, go to 368 days and have 16 months with 23 days each.)
Your UI has a selector for Date/Time and a selector for Location. They might even be different screens. The Location may have been prefilled with a guess based on geoip. Or maybe they saved the wrong 'Springfield'.
The user took an action to change Location. If that has the side effect of changing Time, the user will be surprised.
I'm sure a use case exists where you want to store a timezone, but I don't know what it is. I try and use UTC everywhere, and only worry about time zones when displaying.
Eastern Time Zone is a single time zone, that has different offsets depending on season, formally referred to as EDT and EST to make it easier to identify as daylight or standard references to UTC.
Even regions within the same time zone don't follow the standard exactly, so Panama does not observe daylight saving time while New York does. This is why we have even more granular settings used for calendars and dates.
You either just store UTC and co, or you store it just as an datetime + offset.
100% of apps in production don't handle most of the timezone intricacies anyway and the sky hasn't fallen (heck, the sky hasn't even fallen for Y2K).
100% of apps? You must not know what timezone intricacies are then or just how much effort is spent to make sure time itself is properly handled, especially in any major application that has global users.
You'd be surprised.
Which leads me to my rule with time programming: Never fail to use a solid library, unless you're unfortunate enough to be writing one.
This[4] is a great, necessary but not sufficient book if you have to do that.
[1] https://en.wikipedia.org/wiki/International_Earth_Rotation_a...
[2] For example, https://en.wikipedia.org/wiki/Swedish_calendar
[3] https://en.wikipedia.org/wiki/Easter#Computations
[4] https://www.amazon.com/Calendrical-Calculations-Nachum-Dersh...
If you schedule a meeting for 4pm Tuesday next week, but DST happens Friday this week, you still want the meeting on 4pm Tuesday next week. The local timezone and local timestamp is necessary for that.
I'm not sure what this means. Either way, to store an appointment for a certain user, you need the local timezone of that user and the local timestamp. UTC version of that timestamp is optional (maybe for easier calculations in the database), but you must have the local timestamp to convert correctly.
Local TZ => UTC is a formula that is not constant. You cannot store the output of this formula and expect to reverse and get back the original input precisely because of the changing formula.
If you only store the timezone with a UTC timestamp, when you convert back after DST shifts you'll end up with 3pm or 5pm (depending on the shift), not the expected 4pm for the actual meeting. UTC in this case is not the anchor, the local timestamp the user set for the meeting is.
The timezone that the user will experience on that date.
>Either way, to store an appointment for a certain user, you need the local timezone of that user and the local timestamp.
Correct. I did not disagree with this.
>Local TZ -> UTC is a formula that is not constant. You cannot store the output of this formula and expect to reverse and get back the original input precisely because of the changing formula.
Also correct. But as I said, your original comment implied this is a concern for all DST events which is not true. It is only a concern when the time was converted using some DST assumption and then that assumption is invalidated (eg the software decided to convert "4pm Tuesday local" to UTC assuming the timezone on Tuesday will be UTC+7, but the DST change happens to come early this year and in fact it's UTC+8 on Tuesday).
Regardless, I agree that taking this risk is not correct. Such a time should not be converted to UTC to begin with. Better to store the original local time.
Timezones, especially globally, have enough updates being made that they are not something worth the risk just to save a few bytes.
The local TZ => UTC formula may change for any reason, whether it's daylight saving time or any other random situation. It might be relatively stable in some locations (although it can change even within a global timezone like EST based on location) but there are constant updates being made. You can look at the IANA and Microsoft timezone database updates yourself. Why create risk when not necessary?