Ask HN: How do you keep track of releases/deployments of dozens micro-services? |
Ask HN: How do you keep track of releases/deployments of dozens micro-services? |
We store the source code for all services in subfolders of the same monorepo (one repo <-> one app). Whenever a change in any service is merged to master, the CI rebuilds _all_ the services and pushes new Docker images to our Docker registry. Thanks to Docker layers, if the source code for a service hasn't changed, the build for that service is super-quick, it just adds a new Docker tag to the _existing_ Docker image.
Then we use the Git commit hash to deploy _all_ services to the desired environment. Again, thanks to Docker layers, containers that haven't changed from the previous tag are recreated instantly because they are cached.
From the CI you can check the latest commit hash that was deployed to any environment, and you can use that commit hash to reproduce that environment locally.
Things that I like:
- the Git commit hash is the single thing you need to know to describe a deployment, and it maps nicely to the state of the codebase at that Git commit.
Things that do not always work:
- if you don't write the Dockerfile in the right way, you end up rebuilding services that haven't changed --> build time increases
- containers for services that haven't changed get stopped and recreated --> short unnecessary downtime, unless you do blue-green
To avoid rebuilding all services on every commit, we use Bazel to help determine what services need to be rebuilt. Note that we don't use Bazel as build system but just a tool to see what services are changed -- essentially we only use `filegroup` Bazel rule. After a push to git repo, we basically do (1) `git diff --name-only <before> <after>` to get changed files, (2) run `bazel query 'rdeps(..., set(list of changed files))'` at both `<before>` and `<after>` commits, and (3) combine the results of `bazel query` and look for the affected services.
Once we know what services need to be rebuilt, we trigger Jenkins jobs of those services. Each service will have its own Jenkins job and Jenkinsfile (we use Pipeline). Here we also package the application as Docker image and push it to the internal registry.
We keep track of what is released using "production" branch for each service. Once we have a build to release, we (1) create a "release candidate" branch from the commit of the build, (2) update the k8s config file, (3) apply the k8s config, and (4) merge this branch to the production branch of the service if everything is ok. Then we merge back the production branch to master branch.
A couple of things different that we do since we are building and then deploying to AWS:
- Build only on dedicated deployment branches (beta, qa, preview, prod)
- Build all functions (transpile, yarn, lint, etc) on every merge into the branch, but only deploy functions with different checksums (saves on api calls to AWS)
- We cache node_modules, but otherwise don't have any special build requirements and babel takes care of targeting node6.10 for Lambda
Total build time is between 8-13 minutes. There are some things we can do to speed up install that we haven't yet because it's not an issue yet but just a short list of things to note.
- Each function has it's own package.json for it's own packages. We maintain a list of npm packages that we download into a single folder first (that doesn't get deployed) to allow yarn to use those files from cache. We will eventually switch to an offline install for each function which essentially just copies the package folder and sets up anything it needs.
- We have a tarball package that includes all of our shared code / config files. Yarn seems to always want to download this file, regardless if we pre-download it.
- We deploy a single api endpoint for all of our micro services through API Gateway which cuts down on the time to deploy since API Gateway has a pretty hard throttle. This means we create a deployment on API Gateway every merge. We have one APIG for each environment
Looks like a pretty solid build process. Thanks for the insight!
Why are you rebuilding _all_ the services, wouldn't it make sense to just rebuild the ones that have changes? You're now rebuilding perfectly working services without any new changes just because some other service changed, or am I misunderstanding something here?
For example you might have a Git history like this:
* 89abcde Fix bug in service_b
* 1234567 Initial commit including service_a and service_b
When 89abcde is pushed, the CI rebuilds both service_a and service_b so we can simply "deploy 89abcde" and you always have only one hash for all services, that is also nicely the same hash of the corresponding Git commit.
The trick to avoid rebuilding perfectly working services is to use Docker layer caching so that when you build service_a (that hasn't changed) Docker skips all steps and simply adds the new tag to the _existing_ Docker image. The second build for service_a should take about 1 second.
In our Docker registry we end up with:
service_a:1234567
service_a:89abcde
service_b:1234567
service_b:89abcde
But the two service_a Docker images are _the same image_, with two different tags.
So I'm curious, does each service instance have their own server, or do you have multiple services on one server instance?
I have some experience working with microservices. I saw the clear business benefits of being able to map design domain boundaries to repos and specific teams, and to let those teams be able to control their deployments while minimizing external dependencies.
But we seemed to be paying a lot in network chattiness, slow site response times, and networking costs. I'm wondering if we could have minimized those costs by sticking some of those microservices on the same server instance. Not really change service boundaries or interfaces, but change the methods that the microservice interfaces use to communicate.
First - If your change to the container is near the end of the build process (see earlier comment about smart container design), then the rebuild will only change the final few hashes and Docker is smart enough to not rebuild earlier hashes.
Second - Hashes are global, so if you have multiple containers that start with the same base (say, Alpine Linux + Python + NMP + etc.), Docker will share existing hashed layers. This means a much smaller distribution payload.
To (what I think is) your original question - you can tag the 'final' container itself. Tagging it with the Git hash is one way to get exactly what you're talking about.
The builds for all services happen in parallel, so the longest one determines the total time. Big Scala services take much longer than small React frontends. We cache both Maven and NPM modules from previous builds.
Ideally, if the pull request only modified a React component and didn't touch any Scala file, no Scala build is triggered because Docker finds a cached layer and skips the "sbt compile" step. To be honest, we are still working to make sure this always happens, we still trigger unnecessary sbt compiles because the Docker cache is not used correctly.
It takes a build from your build system (typically team city, but not exclusively) deploys it and record the deployment.
You can then check later what's currently deployed, or what was deployed at some point in time in order to match it with logs etc.
Not sure how useable it would be outside of our company though.
Independent deployments are one of the key advantages of microservices. If you don't use that feature, why use microservices at all? Just for scalability? Or because it was the default choice?
You can deploy the whole platform and/or refactor to a monolith, and maintain one change log which is simple.
That however has its own downsides, so you should find a balance. If you're having trouble keeping track, perhaps re-organize. I read on one HN article that Amazon had 7k employees before they adopted microservices. The benefits have to outweigh the costs. Sometimes the solution to the problem is taking a step back. without more details its hard to say.
So basically one option is refactor [to a monolith] and re-evaluate the split such that you no longer have this problem. Just throw each repo in a sub-folder & make that your new mono-repo & go from there, it is worth an exploratory refactoring, but not a silver bullet.
Sounds like the services were no longer 'micro' :)
Every component comes with a major/minor release no., which tells about the nature of change that has gone in. For ex: Major rel is incremented for a change that usually introduces a new feature/interface. Minor release no are reserved for bug fixes/optimizations, that are more internal to the component.
The build manager can go through the list of all the delivered fixes and cherry pick the few which can go to the final build.
We have 200 services, counting beta and live test variants. Most of the difficulties vanished once we had declarative versioned control of our service config in the ‘headquarters’ repository.
Not aware of anyone else using this approach.
https://github.com/tim-group/orc
Basically, there's a Git repo with files in that specify the desired versions and states of your apps in each environment (the "configuration management database").
The tool has a loops which converges an environment on what is written in the file. It thinks of an app instance as being on a particular version (old or new), started or stopped (up or down), and in or out of the load balancer pool, and knows which transitions are allowed, eg:
(old, up, in) -> (old, up, out) - ok
(old, up, out) -> (old, up, in) - no! don't put the old version in the pool!
(old, up, out) -> (old, down, out) - ok
(old, up, in) -> (old, down, in) - no! don't kill an app that's in the pool!
(old, down, out) -> (new, down, out) - ok
(old, up, out) -> (new, up, out) - no! don't upgrade an app while it's running!
Based on those rules, it plans a series of transitions from the current state to the desired state. You can model state space as a cube, where the three axes of space correspond to the three aspects of the state, vertices are states, and edges are transitions, some allowed, some not. Planning the transitions is then route-finding across the cube. When i realised this, i made a little origami cube to illustrate it, and started waving it at everyone. My colleagues thought i'd gone mad.You need one non-cubic rule: there must be at least one instance in the load balancer at any time. In practice, you can just run the loop against each instance serially, so that you only ever bring down one at a time.
This process is safe, because if the tool dies, it can just start the loop again, look at the current state, and plan again. It's also safe to run at any time - if the environment is in the desired state, it's a no-op, and if it isn't, it gets repaired.
To upgrade an environment, you just change what's in the file, and run the loop.
Full disclosure: I'm on the Spinnaker team
A Slack notification could do it. Or do you want to correlate deployments with other metrics?
In this case we instrument our deployments into our monitoring stack (influxdb/grafana) and use this as annotations for the rest of our monitoring.
We can also graph the number of releases per project on different aggregates.
Then there is the issue of linking the Git release/tag with the corresponding changes, say from a ticketing system such as Jira. That can be helpful to communicate changes to other people within the organization and to users.
How do you define dependencies for releasing new versions to service? Likely going to happen at some point when you have non-trivial changes to services.
Completely agree, that's why we instrument our releases so we can easily see what's deployed by service and environment.
> Then there is the issue of linking the Git release/tag with the corresponding changes, say from a ticketing system such as Jira. That can be helpful to communicate changes to other people within the organization and to users.
Each commit is related to a ticket, helps generate a changelog. We enforce a lot of things in each of our release. We have an internal release tool heavily inspired by shipit from Shopify. We have the concept of soft/hard checker to make sure it won't break or that you aware of what could break with the current diff.
> How do you define dependencies for releasing new versions to service? Likely going to happen at some point when you have non-trivial changes to services.
As I said we instrument our releases and can easily track how changes affects our performance/bugs.
We also try a lot not to release non-trivial changes in one big release by doing stuff like release part of the changes behind a feature flipper first or route only a part of the traffic to the new code path, ...
Then we don't have dozens of different services deployed and we're still a relatively small team (~20) so I'm pretty sure I don't have the full picture just yet :)
We also store stats in the service discovery app so versions can be promoted to "production" for a customer once the account management team has reviewed and updated their internal training.
For anyone that has begun the microservice journey, kubernetes can be intimidating but way worth it. Our original microservice infrastructure was rolled way before k8s and it's just night and day to work with now, the kubernetes team has thought of just about every edge case.
I could probably snapshot the kubernetes state to have an trail I can use to rollback to a point in time. Alternatively I thought about having CI updatemanifests in an integration repo and deploy from there, so that every change to the cluster is reflected by a commit in this repository.
- unit tests each service
- all services fan-in to a job that builds a giant tar file of source/code artefacts. This includes a metadata file that lists service versions or commit hashes
- this "candidate release" is deployed to a staging environment for automated system/acceptance testing
- it is then optionally deployed to prod once the acceptance tests have passed
We use Escape to version and deploy our microservices across environments and even relate it to the underlying infrastructure code so we can deploy our whole platform as a single unit if needs be.
IMHO this makes sense if the microservices are developed by the same team. If we're talking about services developed and managed by different teams... maybe it's not a good idea.
I like you enforce the commit/ticket relationship. Is this purely an agreed process or do you use other measures to keep things consistent? E.g. we typically add the ticket ref to each commit but at times that gets omitted.
Also, I think that (internal) release tool is something crucial as the team grows. Will check shipit a bit further.
Would you mind expanding a bit on the things you enforce for each of your releases?
> I like you enforce the commit/ticket relationship. Is this purely an agreed process or do you use other measures to keep things consistent? E.g. we typically add the ticket ref to each commit but at times that gets omitted.
We're not enforcing it but we might in the future if the team grows and this gets out of hands. At the moment we're just reminding people that they should and it works great so far.
> Would you mind expanding a bit on the things you enforce for each of your releases?
It's still early but so far we check:
- it's not friday afternoon, we want to avoid as much as possible to have issues on the weekend
- it's not out of office hour - we're still all on the same time zone
- there's no lock (we can lock the release in case something goes wrong)
- there's no schema migration. If there is we remind you how to safely migrate schema and who to ping if you have a doubt (usually it should have been caught at the PR review)
- there's someone from the ops/core team around (connected on slack)
- that there's no translations missing for our main languages (french/english)
- + we do a few sanity checks like that our master staging is healthy (release means promoting our master staging)
edit: also I forgot but this is the shipit I'm talking about https://github.com/Shopify/shipit-engine
- src/project1/function1/
- src/project1/function2/
- src/project2/function1/
- src/project3/function1/
- src/project3/function2/
- src/project3/function3/
Deploying the functions is done by project, so we deploy all of one project, then move to the next, and so on and so forth.
Have you encountered any issues to watch out for when only using one APIG for each environment (150 micro-services). Have you encountered any downsides to doing this versus 1 micro-service to 1 APIG? I'm also running into the Gateway throttle limits and I think deploying many micro-services (like you have done) to 1 APIG is the best solution.
We have a custom script to deploy our own API Gateway using the AWS SDK and we generate a swagger file from simple json config files.
For the API Gateway issues, so far, we have a few things that are something we have to watch out for.
- All lambda endpoints through APIG are lambda proxy type. This means we can have a framework handle standard request / response stuff. The downside is that we can't support binary endpoints easily because they haven't fixed that issue yet.
- HTTP proxy pass through endpoints have to be added to the swagger somehow before we deploy. This is a little annoying, but not a huge issue
- Merge vs Override for deployments. We merge in beta, and override in other environments. This allows us to keep endpoints exactly as they are, but allow flexible testing in beta
1 APIG for 1 micro service isn't great IMO at scale since we run all our endpoints under on domain and mapping all of them would be a pain.