Gitlab is down

66 points by riyadparvez 4 years ago | 59 comments

john_cogs 4 years ago |

GitLab team member here. We're aware of the incident and the status page has been updated. We will provide further updates on the status page as they become available.

(Edited now that the status page has been updated).

totaldude87 4 years ago | |

Awesome, thank you! Godspeed!

hackerlytest 4 years ago | |

Thanks. Seems to be back for me.

Dayshine 4 years ago | |

It took (by my measure) 13 minutes for a full outage to be represented on the status page.

I was under the impression that gitlab use gitlab.com for their work. Surely someone would have noticed within seconds that it was down?

Why have the misleading "updated a few seconds" ago text if it doesn't update on complete failure? :)

john_cogs 4 years ago | | |

Your impression is correct. We use GitLab.com and notice these incidents as they happen.

The delay in updating status is a result of our Incident Management process [0]. We have a Communications Manager on Call (CMOC) who leads communication throughout an incident. One of their responsibilities includes updating the status page. The slight delay between noticing the issue and updating the status page is a result of the time it takes for the CMOC to get alerted, assess the situation, and write the communication that is shared on the status page.

I'm not sure how the "updated a few seconds ago" messages are generated but I'll try to find out once the incident has been resolved.

0 - https://about.gitlab.com/handbook/engineering/infrastructure...

vasco 4 years ago | | |

After you notice I assume you have to declare an incident, get a call going, assess the extent of the issues, get the needed people involved, and then you'd announce on the status page. 13 minutes isn't amazing but it also isn't terrible. Perhaps you have better ways of keeping status pages updated much faster while also not ending up ramping up the posting of false positives.

pixl97 4 years ago | | |

It doesn't matter if each individual detects the outage because they'll start blame at the local source and move further up the tree rather than assign blame to a full system failure right off the bat. 99.9% of the time it's going to be a local failure affecting the individual.

Also, most alerting systems like check multiple times before declaring a public outage, many times 2 to 3 failures some seconds apart are needed.

totaldude87 4 years ago |

and the status page is all green (sigh)- https://status.gitlab.com/ where as downdetector definitely shows that there are issues - https://downdetector.com/status/gitlab/

I guess, the status pages should now have a button to get data from public.. crowd sourced status page?

dnsmichi 4 years ago | |

GitLab team member here - sorry for the delay, SREs are investigating.

https://status.gitlab.com/ is updated. Edit: https://status.gitlab.com/pages/incident/5b36dc6502d06804c08...

m4lvin 4 years ago | |

It is updated by now - some seconds delay is fine I think and if they would not cache the status page it might go down in a blink now too ;-)

teekert 4 years ago | |

It's there now, added seconds ago.

oriettaxx 4 years ago | | |

yes, you're right.

I was just working on gitlab, so I would say the status page reflected the issue about 5 minutes later

routeroff 4 years ago |

overleaf.com is also down, https://status.overleaf.com/

Maybe some common severs ?

tovej 4 years ago | |

Definitely related, overleaf probably depends on gitlab. The overleaf outage ended right when the gitlab outage did.

hobo_mark 4 years ago |

It's interesting that different pieces of gitlab.com appear to be running on a hodge-podge of GCP, DO, AWS and AZ... I wonder why that would be the case?

karmakaze 4 years ago | |

This could make good sense if they want to provide service where the customers use it.

temptemptemp111 4 years ago | |

But but CLOUD NATIVE! https://about.gitlab.com/cloud-native/

Traubenfuchs 4 years ago | |

Maybe they fell for the polynimbus meme.

simon04 4 years ago |

https://status.gitlab.com/pages/incident/5b36dc6502d06804c08... – January 31, 2022 15:22 UTC – System Wide Outage

markdog12 4 years ago |

Prob just a coincidence, but our Memorystore (hosted redis) instance went down with a "repairing" status around the same time.

rvz 4 years ago |

For SaSS, it is down. But not if you are self-hosting your own.

Just look at Gnome: [0]. They are doing it right.

[0] https://git.gnome.org

iamcreasy 4 years ago | |

Is gitlab.gnome.org/GNOME set to forward to git.gnome.org?

teddyh 4 years ago |

And this is why you self-host on your own instance.

analogsalad 4 years ago | |

Indeed, I can't remember a single time where a self-hosted server crashed. They run for decades with 0 downtime.

rvz 4 years ago | | |

Exactly. That is the whole point. I keep telling that for GitHub since that goes down once a month. [0][1] GitLab SaSS is the same but a self-hosted backup is better.

[0] https://news.ycombinator.com/item?id=29901564

[1] https://news.ycombinator.com/item?id=29379648

sdoering 4 years ago | | |

Not sure if this is irony (I often don't identify irony as such).

But I fatfingered a lot of self hosted stuff in my time.

manquer 4 years ago | | |

It doesn't, but I can fix it as opposed to waiting for their team to do it.

Also at gitlab.com scale the problems they face are very different from a typical deployment.

It is like having maintaining your car and using the train.

On average if you can fix your car (or hire a good mechanic i.e. consulting) you would probably have a better experience than public transport breaking down, that you are powerless to do anything about.

I would rather run a business depending on my car than the train ?

dengolius 4 years ago | | |

And even if it goes down you might have more options to get it back to work.

karmakaze 4 years ago | | |

They could, if you stuck to the yak shaving full-time.

oefrha 4 years ago | | |

Well, my GitLab instance at some point started to have its Prometheus eat 100% CPU all the time until I disabled the Prometheus component altogether, so there’s that. A cursory glance at the tracker just now says the issue is still open. That’s the kind of problems you get for self-hosting, it’s not all rainbows and unicorns.

hknapp 4 years ago |

Seeing the same error

oriettaxx 4 years ago | |

yes, while the status page https://status.gitlab.com says everything is fine :(

grrr... I am stuck with my job now .... :(

qayxc 4 years ago | | |

Must be caching issue - it shows "System Wide Outage" for me.