Principles for building and scaling feature flag systems(docs.getunleash.io) |
Principles for building and scaling feature flag systems(docs.getunleash.io) |
This is my current battle.
I introduced feature flags to the team as a means to separate deployment from launch of new features. For the sake of getting it working and used, I made the mis-step of backing the flags with config files with the intent to get Launch Darkly or Unleash working ASAP instead to replace them.
Then another dev decided that these Feature Flags look like a great way to implement permanent application configs for different subsets of entities in our system. In fact, he evangelized it in his design for a major new project (I was not invited to the review).
Now I have to stand back and watch as the feature flags are being used for long-term configurations. I objected when I saw the misuse- in a code review I said "hey that's not what these are for"- and was overruled by management. This is the design, there's no time to update it, I'm sure we can fix it later, someday.
Lesson learned: make it very hard to misuse meta-features like feature flags, or someone will use them to get their stuff done faster.
- Some flags are going to stay forever: kill switches, load shedding, etc. (vendors are starting to incorporate this in the UI)
- Unless you have a very-easy-to-use way to add arbitrary boolean feature toggles to individual user accounts (which can become its own mess), people are going to find it vastly easier to create feature flags with per-use override lists (almost all of them let you override on primary token). They will use your feature flags for:
- Preview features: "is this user in the preview group?"
- rollouts that might not ever go 100%: "should this organization use the old login flow?"
- business-critical attributes that it would be a major incident to revert to defaults: "does this user operate under the alternate tax regime?"
You can try to fight this (indeed, especially for that last one, you most definitely should!), but you will not ever completely win the feature flag ideological purity war!1. We could stick it in a standard conf system and serve it up randomly based on what host a client hits. (Or come up with more sophisticated rollouts)
2. Or we can put it as "perm" conf in the feature flag system and roll it out based on different cohorts/segments.
I'm leaning towards #2 but I'd love to understand why you want to prohibit long lived keys so I can make a more informed choice. The original blog posts main reasons were that FF systems favor availability over consistency so make a pour tool if you need fast converging global config, which somewhat becomes challenging here during rollbacks but is likely not the end of the world.
So of course they'll be used for long-term configuration purposes, especially under pressure and for gradual rollouts of whole systems, not just A/B testing features.
The term "feature flag" has come to inherently have a time component because features are supposed to eventually be fulled GA'd.
What I've seen in practice is feature flags are never removed so a better way to think about them is as a runtime configuration.
I'm at the point of deciding that Scrum is fundamentally incompatible with feature flags. We demo the code long before the flag has been removed, which leads to perverse incentives. If you want flags to go away in a timely manner you need WIP limits, and columns for those elements of the lifecycle. In short: Kanban doesn't (have to) have this problem.
And even the fixes I can imagine like the above, I'm not entirely sure you can stop your bad actor, because it's going to be months before anyone notices that the flags have long overstayed their welcome.
I'm partial to flags being under version control, where we have an audit trail. However time and again what we really need is a summary of how long each flag has existed, so they can be gotten rid of. The Kanban solution I mention above is only a 90% solution - it's easy to forget you added a flag (or added 3 but deleted 2)
The best you can do is expect the feature flagging solution to give some kind of warning for tech debt. Then equip them with alternative tools for configuration management. Rather than forbidding, give them options, but if it's not your scope, I'd let them be (I know as engineers this is hard to do :P).
I feel like feature flags aren't that far off though. They're fantastic for many uses of runtime configuration as mentioned in another comment.
There's multiple people in this thread complaining about "abuse" of feature flags but no one has been able to voice why it's abuse instead of just use beyond esoteric dogma.
I don't see the problem with developers using flags for configuration as a stopgap until there's a better solution available.
Um what? How could that ever work. It's like you are trying to find new exciting ways to break prod.
Curious how you plan to justify cost to "fix it" to management. If it ain't broke...
Accepting reality in this way means you'll design a config management system that lets you add feature flags with a required expiration date, and then notifies you when they're still in the system after the deadline.
Temporary ones can be used to power experiments or just help you get to GA and then can be removed.
Permanent ones can be configs that serve multiple variations (e.g. values for rate limits), but they can also be simple booleans that manage long term entitlements for customers (like pricing tiers, regional product settings, etc.)
The architecture of unleash made it so simple to do in unleash vs having to evaluate, configure, and deploy a separate app config solution.
So it's good to be aware of _why_ those guidelines are considered a good thing, but as with any methodology, an engineer should be pragmatic in deciding when to follow it strictly, and when to adapt or ignore some of it.
That said, I wouldn't want to work on software that completely ignores 12 Factor.
For a more nuanced and careful discussion of the topic I like to reference: https://martinfowler.com/articles/feature-toggles.html
- Must support multiple SDKs, including Java and Ruby.
- Should be self-hosted with PostgreSQL database support.
- Needs to enable remote configuration for arbitrary values (not just feature flags). I don't run two separate services for this.
- Should offer some UI functionality.
- it should cache flag values locally and, ideally, provide live data updates (though pooling is acceptable).
Here are the four options that met these basic criteria and underwent detailed evaluation: - Unleash: Impressive and powerful, but its UI is more complex than needed, and it lacks remote configuration.
- Flagsmith: Offers remote configuration but appears less polished with some features not working smoothly; Java SDK error reporting needs improvement.
- Flipt: Simple and elegant, but lacks remote configuration and local caching for Java SDK.
- FeatureHub: Offers fewer features than Unleash and Flagsmith; its Java API seems somewhat enterprisly but supports remote configuration and live data updates.
Currently, I'm leaning towards FeatureHub. If remote configuration isn't necessary, Unleash offers more features, and if simplicity is key and local caching isn't needed, Flipt is an attractive option.They fracture your code base, are sometimes never removed, and add complexity and logic that at best is a boolean check and at worse is something more involved.
I'd love a world where engineers are given time to complete their feature in its entirety, and the feature is released when it is ready.
Sadly, we do not live in that world and hence: feature flags.
I get what you'd like "as an engineer", but it ignores the needs of the business.
You should get as close as you can, release the product, and iterate.
Todays world is release the product in some ramshackle form or fashion, collect feedback, iterate. To do that introduces a new construct of Feature Flags that would otherwise not be necessary.
They're typically used as a way of enabling a change for a subset of your services to allow for monitoring of the update and easier "rollback" if it becomes necessary.
They can be used for A/B testing, but this is not what they're typically used for.
It seems to be skipping past the use-cases and assumptions, in particular, describing what a system with feature flags looks and acts like, what the benefits and drawbacks are.
This is a great feedback. Our intention was to describe how such a system work at scale, but I see we could do better in this section, thanks!
Do you have some use-cases in mind?
I like the idea of caching locally, although k8s makes that a bit more difficult since containers are typically ephemeral. People will use feature flags for things that they shouldn't, so eventually "falling back go default values" will cause production problems. One thing you can do to help with this is run proxies closer to your services. For example, LaunchDarkly has an open source "Relay".
Local evaluation seems to be pretty standard at this point, although I'd argue that delivering flag definitions is (relatively) easy. One of the real value-add of a product like LaunchDarkly is all the things they can do when your applications send evaluation data upstream: unused flags, only-ever-evaluated-to-the-default flags, only-ever-evaluated-to-one-outcome flags, etc.
One best practice that I'd love to see spread (in our codebases too) is always naming the full feature flag directly in code, as a string (not a constant). I'd argue the same practice should be taken with metrics names.
One of the most useful things to know (but seldom communicated clearly near landing pages) is a basic sketch of the architecture. It's necessary to know how things will behave if there is trouble. For instance: our internal system uses ZK to store (protobuf) flag definitions, and applications set watches to be notified of changes. LaunchDarkly clients download all flags[1] in the project on connection, then stream changes.
If I were going to build a feature flag system, I would ensure that there is a global, incrementing counter that is updated every time any change is made, and make it a fundamental aspect of the design. That way, clients can cache what they've seen, and easily fetch only necessary updates. You could also imagine annotating that generation ID into W3C Baggage, and passing it through the microservices call graph to ensure evaluation at a consistent point in time (clients would need to cache history for a minute or two, of course).
One other dimension in which feature flag services vary is by the complexity of the rules they allow you to evaluate. Our internal system has a mini expression language (probably overkill). LaunchDarkly's arguably better system gives you an ordered set of rules within which conditions are ANDed together. Both allow you to pass in arbitrary contexts of key/value pairs. Many open source solutions (Unleash, last I checked, some time ago) are more limited: some of them don't let you vary on inputs, some only a small set of prescribed attributes.
I think the time is ripe for an open standard client API for feature flags. I think standardizing the communication mechanisms would be constricting, but there's no reason we couldn't create something analogous to (or even part of) the Open Telemetry client SDK for feature flags. If you are seriously interested in collaborating on that, please get in touch. (I'm "zellyn" just about everywhere)
[1] Yes, this causes problems if you have too many flags in one project. They have a pretty nice filtering solution that's almost fully ready.
[Update: edited to make 70% of it not italics ]
First, we're building a runtime configuration system on top of AWS AppConfig. YAML/proto validation that pushes to AppConfig via gitops and bazel. Configurations are namespaced so the unique names is solved. It's all open in git.
Feature flags are special cases of runtime configuration.
We are distinguishing backend feature flags from experimentation/variants for users. We don't have (or want) cohorting by user IDs or roles. We have a separate system for that and it does it well.
The last two points - distinguishing between experimentation/feature variants and feature flags as runtime configuration are somewhat axiomatic differences. Folks might disagree but ultimately we have that separate system that solves that case. They're complimentary and share a lot of properties but ultimately it solves a lot of angst if you don't force both to be the same tool.
Is this true? unfortunately there's no sources indicated, and a quick check on scholar doesn't show me anything of the sort.
Here's a list of case studies from some of the solutions referred in the comments, some focus on operational metrics, others in lead time to changes: https://www.getunleash.io/case-studies https://launchdarkly.com/case-studies/ https://www.flagsmith.com/case-studies
I’ve absolutely seen canary testing work in large environments with a lot of teams doing frequent deploys. The teams need to have the tooling to conduct their own canary testing and monitoring.
As soon as you’re involving external services or anything persistent you may not be able to undo the damage of misbehaving software by simply disabling the offending code with a flag.
In practice the cost/benefit of feature flags has never proven out for me, better to just speed up your deploys/rollbacks, the caveat is I’ve only ever worked in web environments, I can imagine with software running on an end user device it could solve some difficult problems provided you have a way to toggle the flag.
Are they using a kind of logic to determine to turn on/off a feature or do they query a central database to know that?
Can someone explain its basic mechanism? Thanks
- Require in code defaults for fault tolerance
- Start annoying the flag author to delete if the flag is over a month old
- Partial rollout should be by hash on user id
- Contextual flag features should always be supplied by client (e.g. only show in LA, the location should be provided by client)
With a per-flag salt as well, otherwise the same user will always have bad luck and be subject to experiments first.
No problem, filter that email directly to spam folder.
TL;DR if you break long posts into pages, at least have an option to see the whole thing in a single page.
I use a browser extension to send websites to my Kindle. It's great for long-ish format blog posts that I want to read, but I don't have the time at the moment. However, whenever I see long blog posts that are broken into sections, each one in it's own page, it becomes a mess. It forces me to navigate each individual page and send it to my Kindle. Then in the Kindle I have a long list of unsorted files that I need to jump around to read in order.
I understand breaking long pieces of text into pages makes it neater and more organized, but at least have an option to see the whole thing in a single page, as a way to export it somewhere else for easy reading.
"Unleash is open-source, and so are these principles. Have something to contribute? Open a PR or discussion on our Github."
Hard to tell if it's generated or written in an attempt to be as plain English as possible, but either way feels strangely vacuous for a technical opinion piece. There's no writer's voice.
LaunchDarkly Split Apptimize CloudBees ConfigCat DevCycle FeatBit FeatureHub Flagsmith Flipper Flipt GrowthBook Harness Molasses OpenFeature Posthog Rollout Unleash
Here's my first draft of the questions you'd want to ask about any given solution:
Questionnaire
- Does it seem to be primarily proprietary, primarily open-source, or “open core” (parts open source, enterprise features proprietary)?
- If it’s open core or open source with a service offering, can you run it completely on your own for free?
- Does it look “serious/mature”?
- Lots of language SDKs
- High-profile, high-scale users
- Can you do rules with arbitrary attributes or is it just on/off or on/off with overrides?
- Can it do complex rules?
- How many language SDKs (one, a few, lots)
- Do feature flags appear to be the primary purpose of this company/project?
- If not, does it look like feature flags are a first-class offering, or an afterthought / checkbox-filler? (eg. split.io started out in experimentation, and then later introduced free feature flag functionality. I think it’s a first-class feature now.)
- Does it allow approval workflows?
- What is the basic architecture?
- Are flags evaluated in-memory, locally? (Hopefully!)
- Is there a relay/proxy you can run in your own environment?
- How are changes propagated?
- Polling?
- Streaming?
- Does each app retrieve/stream all the flags in a project, or just the ones they use?
- What happens if their website goes down?
- Do they do experiments too?
- As a first-class offering?
- Are there ACLs and groups/roles?
- Can they be synced from your own source of truth?
- Do they have a solution for mobile and web apps?
- If so, what is the pricing model?
- Do they have a mobile relay type product you can run yourself?
- What is the pricing model?
- Per developer?
- Per end-user? MAU?I will toss our hat in the ring but we are early in this space! https://lekko.com
This seems like a MUST rather than a SHOULD, right?
Can you elaborate on this? As a programmer, I would think that using something like a constant would help us find references and ensure all usage of the flag is removed when the constant is removed.
The bigger problem is when the code constructs metric and flag names programmatically:
prefix = "framework.client.requests.http.{status%100}s"
recordHistogram(prefix + ".latency", latency)
recordCount(prefix + ".count", 1)
flagName = appName + "/loadshed-percent"
# etc...
That kind of thing makes it very hard to find references to metrics or flags. Sometimes it's impossible, or close to impossible to remove, but it's worth trying hard.Of course, this is just, like, my opinion, man!
If you create your own service to evaluate a bunch of feature flags for a given user/client/device/location/whatever and return the results, for use in mobile clients (everyone does this), PLEASE *make sure the client enumerates the list of flags it wants*. It's very tempting to just keep that list server-side, and send all the flags (much simpler requests, right?), but you will have to keep serving all those flags for all eternity because you'll never know which deployed versions of your app require which flags, and which can be removed.
[Edit: speling]
I'd argue that coming up with good UI that nudges developers towards safe behavior, as well as useful and appropriate guard rails -- in other words, using the feature flag UI to reduce likelihood of breakage -- is difficult, and one of the major value propositions of feature flag services.
2) The downside of rolling it out based on host is that you could refresh your page, hit a different host, and see the UI bouncing back and forth between versions. As long as you always plan to roll things to 100%, this is the perfect use case for a feature flag.
I think it's absolutely an opinion piece - defining specific items as principles by definition means expressing opinionated ideas about the relative priority of those items over others. Also, imperative mood contains value judgment, which is inherently opinion-based (e.g. "Never expose PII"). Making arguments for why you should or should not do things requires expressing opinions about relative importance, weight etc.
If this were instead an article describing what feature flags are, or one performing a survey of various approaches to building/scaling them, I think the lack of voice is just fine - that's dealing in statement of fact. But this article mandates and implores and exhorts - the value judgments inherent in that pathos are empty without genuine authorship.
Also I'm not saying the lack of voice is bad even for conveying meaning or teaching - more that it is jarring and uncanny to read imperative claims in an empty robotic voice devoid of ethos.
Finally, I also might be biased by my first documentation love, the zeromq guide, which is an extremely-strongly-opinionated piece of docs that does its job exceptionally well. I think when writing about how or why, a strong writer's voice is more compelling. This article stretches past just the what into those other question words, so its seeming lack of authorial authority falls flat to me.
Thanks for giving me an excuse to blabble lol.
We started with a customer specific configuration system that allows arbitrary values matching a defined schema. It’s very easy to add to the schema (define the config name, types, and permissions to read or write it in a JSON schema document).
We have an administration panel with a full view of the JSON config for our support specialist and and even more detailed one for developers.
Most config values get a user interface as well.
From there we just have a namespace in the configuration for “feature flags”. Sometimes these are very short lived (2-4 sprints until the feature is done), but others can last a lot longer.
There are an unfortunate couple that will probably never go away at this point (because of some enterprise customer with a niche use case in the “legacy” version of the feature that we’ve not yet implemented compatibility with and I don’t know when it will get on our roadmap to do so), but in the end they can just be migrated into normal config values if needed.
A little tooling layer on top lets us query and write to the configs of thousands of sites at once as well.
Using just the string-recognizable name everywhere is...better.
This sort of programmatic naming is a dangerous step down a slippery slope.
For example, a client may call myserver.com/mobile-flags?merchant=abcdef&device=123456&os=ios&os_version=15.2&app_version=6.1 and the server will pass back: flag1: true flag2: 39 flag3: false flag4: green
This seems to be a common theme. For example, LaunchDarkly has a mobile client SDK, but they charge by MAU, which would be untenable. So folks tend to write a proxy for the mobile apps to call. If the client (as in my example above) doesn't specify which flags it wants, then the metrics are missing, whether you're using a commercial product or your own: it'll simply tell you that all the flags got used. (Of course, you could be collecting metrics from the client apps).
But based on our experience, you'd be better of having the mobile client pass in an explicit list of desired flags. Which will give accurate metrics.
Hope that clarifies what I meant.
Also, undoubtedly contentious. If you want an amusing read, go check out LaunchDarkly's "comparison with Split" page and Split's "comparison with LaunchDarkly" page. It's especially funny when they make the exact same evaluations, but in reverse.
I'm also not convinced it's always a huge problem. I can imagine sometimes it is, but in most codebases I've worked on, it's more of an annoyance but not cracking the top 3 or 5 biggest problems we wanted to focus on.
IMHO the best solution is not something heavy handed like a policy that we only use run-time config for fixed timeframes, or a process where we regularly audit and prune old flags. It's simply to keep a record of the config changes over time so anyone interested can see the history, and a culture where every engineer is encouraged to take a little extra time to verify and remove dead stuff whenever it crosses their path .
SaaS won't sell itself unless it redefines the problem and presents itself as a solution...
IME the feature flag interface is next to perfect for runtime configuration. I don't care for intended usage at all. You could say feature flags have found a great product-market fit, just that a segment of the market is a bit unexpected but makes perfect sense if you think about it.
Resetting to a know failsafe works as long ask the risk of someone changing a backend service (or, multiple services) at the same time is low. Once it isn't, you can most definitely do more damage (and make life harder for oncall).
Who controls the runtime config? One person? Half a dozen? One hundred plus? Is it being gated by approvals, or can anyone do it? What about auditability? If something does go wrong, how easily can I rule out you turning on that flag?
Finally there is simply the sheer permutations you introduce here. A feature flag is binary in many cases: on or off. A config could be in any number of states.
These things make me nervous as an architect, and I've seen well intentioned changes fail when good flag discipline wasn't followed. Using it as fullblown runtime config seems like a postmortem waiting to happen.
I feel like you could easily add a status to flags, to mark whether they are part of a release process, or a permanent configuration tool, and in the latter case, take them off the release interfaces.
If all the PRs are instantly rejected, that would be a bad sign, but I couldn't find someone who effectively used it. I mean, it's been around for a while but it didn't spread out, so that already gives me some hint
Feature Flags inherently introduce at least one branch into your codebase.
Every branch in your codebase creates a brand new state your code can run through.
The number of branches introduced by Feature Flags likely does not scale linearly, because there is a good chance they will become nested, especially as more are added.
Start with even an example of one feature flag nested inside another. That creates four possible program states. Four is not unreasonable, you can clearly define what state the program should be in for all four states.
Now scale that to a hundred feature flags, some nested, some not.
It becomes impossible to know what any particular program state should be past the most common configurations. If you can't point to a single interface in a program and tell me all of the possible states of it, your program is going to be brittle as hell. It will become a QA nightmare.
This is why Feature Flags should be used for temporary development efforts or A/B testing, and removed.
Otherwise you're going to have a debugging nightmare on your hands eventually.
Edit: Note that this is different from normal runtime configurations because normally runtime configurations don't have a mix of in-dev options and other temporary flags. Also, they aren't usually set up to arbitrarily add new options whenever it is convenient for a developer.
Branches are difficult to reason about? Yes, I agree.
Are branches necessary to make the product behave in a different way in some circumstances? Most of the time.
Do those circumstances require a branch? Unless you’re super confident about some part of code, yes? But why would you be?
Runtime configuration is not about making QA easy. It’s introduced because QA has been hell already so you can control rollout of code which you know wasn’t properly QA’d - or it was but turns out the thing you built isn’t the thing users want and the release cycle is too long to deploy a revert.
I’d say ‘branches are bad but alternatives are worse’.
If your QA was bad before, you've made it worse.
"I can toggle it off without pushing a new release" is a terrible bandaid for the problem.
As for why: if you don't deprecate the feature flag in some time span, you're permanently carrying both code paths. With ongoing associated dev and qa resources and costs against your complexity budget.
Permanent costs should only be undertaken after careful consideration, and should be outside the scope of a single dev deciding to undertake them. Whereas flags should be cheap to add to enable dev to get stuff into prod faster while retaining safety.
Permanently making something a config choice should be done after heavier deliberation because of the aforementioned costs, and you often want different tools to manage it. Including something heavier duty than a single checkbox/button in your internal CS admin tooling. These are often tied into contracts or legal needs, and in many cases salesforce should be the source of truth for them. Or whatever CPQ system you're using.
And LaunchDarkly's Big Segments fetch segment inclusion data live from redis (although I believe they then cache it for a while).
In my opinion this all gets back to the way we build product and the expectations we have for our product managers. I have no doubt that their jobs are difficult in many ways, but the lack of actual focus on product specifically as it relates to customer sentiment always strikes me as lazy especially when that data collection is basically passed off to the engineers.