Give me Terraform (as much as I hate it) any day.
I won't go as far as to say we burned bridges arguing back and forth about it but they were definitely significantly singed.
Config files simply don't work until they do. And if it's your job to stare at them for hours and hours a day then maybe that's okay with you, but if you expect other people to 'just learn' it you're an idiot or an asshole. Or both. Ain't nobody got time for magic incantations.
I also think it should tell you you're on the wrong path when your app is named after a verb and the data it deals with is all declarative.
Sure "use code to deploy infrastructure" sounds great, and that is why we get stuff like Ant, Gradle, Pulumi, Jenkins Groovy scripts, .NET Aspire,.... until someone has to debug spaghetti code on a broken deployment.
* You can't make have variables in an import block (for example, to specify a different "id" value for each workspace)
* There is no explicit way to make a resource conditional based on variables. Only a hacky way to do that using "count = foo ? 1 : 0"
* You can't have variables in the backend configuration, making it impossible to store states in different places depending on the environment.
* You can't have variables in the "ignore_changes" field of a resource, making it impossible to dynamically ignore changes for a field (for example, based on module variables).
* The VSCode extension for HCL is slow and buggy. Using TS with pulumi or TFCDK makes it possible to use all the existing tooling of the language.
You get the bonus of controlling the resource id and being able to selectively delete resources without worrying about ordering.
I find that with some handwringing, C# can be forced to do almost anything. between extension methods, dispatch proxies and reflection you can pummel it into basically any shape.
Having to write a little boilerplate to make it happen can be a drag though. I do sometimes wish C# had something from a blank project that let me operate with as much reckless abandon as Object.assign does in js land.
Terraform sure is a quirky little DSL ain’t it? It’s so weirdly verbose.
But at the same time I can create some azure function app, setup my GitHub build pipeline, get auth0 happy and in theory hook up parts of stripe all in one system. All those random diverse API’s plumbed together and somehow it manages to work.
But boy howdy is that language weird.
But yeah, at $previous_job, Terraform enabled some really fantastic cross-SaaS integrations. Stuff like standing up a whole stack on AWS and creating a statuspage.io page and configuring Pingdom all at once. Perfect for customers who wanted their own instance of an application in an isolated fashion.
We also built an auto-approver for Terraform plans based on fingerprinting "known-good" (safe to execute) plans, but that's a story for a different day.
https://helm.sh/docs/chart_template_guide/control_structures...
the complexity in one way or another must be preserved within the abstraction (in all likelihood) or you will have cases you cannot create in that layer or breakages which now have the total complexity of both the abstraction itself AND kubernetes itself required to fix.
i would not say IaC is going to provide you a magic solution to learning k8s, although the value in using IaC (e.g. Argo CD / Flux CD + Kustomize + ...) in K8s land is that you are no longer imperatively managing your cluster resources and therefore can keep them within a repository, managed like code. the point of the solution is not to make it easier for newcomers, but to make it easier to have teams manage and work together on an established cluster for deployments, ...
in the case of Pulumi, you leverage the single language with typechecking instead of relying upon K8s flavoured YAML, which is itself beneficial in many ways (since you can use your regular developer tooling)
wrt pkl, pretending K8s manifest structure underneath does not help because you will need to know how the keys within a manifest interact with the underlying system regardless, especially to understand functionality, e.g. node selectors, taints and tolerations, node affinity, ...
i prior managed a terraform-based deployment of several k8s clusters and it still required knowledge of those keys and values, alongside knowledge of the underlying resource types.
without those you can't implement things like GPU-based node selection for jobs which require a GPU, ...
Just use CloudFormation. Easy to write, declarative, vars (Parameters and Output exports). Trick is not to pile everything in one Stack. Use several.
It’s got everything you want:
- strong type system (TS),
- full expressive power of a real programming language (TS),
- can use every existing terraform provider directly,
- compiles to actual Terraform so you can always use that as an escape hatch to debug any problems or interface with any other tools,
- official backing of Hashicorp so it’s a safe bet
It’s a super power for infra. If you have strong software dev skills and you want to leverage the entire TF ecosystem without the pain of Terraform the language, CDKTF is for you.
(No affiliation)
Why is this better then Ansible + Docker Compose?
What it provides is a set of conventions based on what most web apps look like.
Eg. built-in proxy with automatic TLS and zero downtime deployments, first-class support for a DB and cache, encrypted secrets, etc.
It’s definitely not for every use case, but for your typical 3-tier monolith on a handful of servers I found it does the job well.
Give me a forum (even Discourse will do) , I'm tired of needing 3rd party spyware to interact with developers. That it is all closed off from search engines makes it even worse
We've gone through a lot of pain to get this blueprint working since our AWS costs were getting out of hand but we didn't want to part ways with CDK.
We've now got the same stack structure going with Pulumi and Digital ocean, having the same ease of development with at least 60% cost reduction.
It’s not a drop in replacement. It might be worth it depending on what you’re doing.
Anyone using CDK should switch to Pulumi though.
Using a complex programming language (C++ of the browser world) just for this has a big switching cost. Unless you're all in on TS. And/or have already built a huge complex IaC tower of babel where programming-in-the-large virtues justify it.
We've also started switching our custom Docker compose + SSL GitHub Action deployments to use Kamal [1] to take advantage of its nicer remote monitoring features
Terraform or CDK I would want a simple shareable thing that did the boilerplate that I called with any variables I needed to change.
On EKS, you need to do the same version updates with the same amount of terror.
You do pay the extra for the further management to just run containers somewhere!
(you might want to say "every" instead of over, "is" instead of "ist")
on one hand, I can see how this is an unfalsifiable standard, on the other hand I can see the utility of solving a friction for people that messed up
The alternative, which I feel is far too common (and I say this as someone who directly benefits from it): You choose AWS because it's a "Safe" choice and your incubator gets you a bunch of free credits for a year or two. You pay nothing for compute for the first year, but instead pay a devops guy a bunch to do all the setup - In the end it's about a wash because you have to pay a devops guy to handle your CI and deploy anyway, you're just paying a little more in the latter.
I won't touch DO after they took my droplet offline for 3 hours because I got DDoS'd by someone that was upset that I banned them from an IRC channel for spamming N-bombs and other racial slurs.
And can you name a real cloud that charges a half-reasonable price for bandwidth? I consider $10/TB to be half-reasonable.
But all in all, it works. It's just a bit limited on what you can do with the actual language.
I suppose TypeScript does count as a real programming language, in that it’s Turing complete. But I can use Pulumi from (they claim) any programming language. Specifically, I can use it from Go. Why would I add TypeScript to my project when I can live in one language?
> - official backing of Hashicorp so it’s a safe bet
Given the number of folks leaving the Hashicorp platform, I think it’s arguably no longer a ‘safe bet.’
It turns out terraform is actually quite acceptable when you slap a decent language on top of it. Passable, even :)
Pro vs pulumi: you get a declarative template to debug and review
Pro vs CDK: The declarative template is applied via APIs instead of CloudFormation. The CDK CloudFormation abstraction leaks like hell
All of CDK does things in cloudformation, which made the whole thing stillborn as far as I’m concerned.
The CDK team goes to some lengths to make it better, but it’s all lambda based kludges.
Just write CloudFormation directly. Once you get the hang of the declarative style and become aware of the small gotchas, it's pretty comfy.
Exactly this. And don't make huge templates, split stuff logically to several stacks and pass vars via export/importvalue.
The problem with upserting is that if the resource already exists, its existing attributes and behavior might be incompatible with the state you're declaring. And it's impossible to devise a general solution that safely transitions an arbitrary resource from state A to state A' in a way that is sure to honor your intent.
So dumb. Trying to move to SST for only that reason
but if you add cdk to the path, you can still deploy, its just that your cicd and deployment scripts are not all using bun anymore
You have to do a few adjustments which you can see here https://github.com/codetalkio/bun-issue-cdk-repro?tab=readme...
- Change app/cdk.json to use bun instead of ts-node
- Remove package-lock.json + existing node_modules and run bun install
- You can now use bun run cdk as normal
If I had to guess it's because
- more imperative background developers need to work with infrastructure and they bring over their mindset and ways of working
- infrastructure is more and more available through API's and it saves a lot of effort to dynamically iterate over cattle than declaratively deal with pets
- things like conditionals, loops and abstractions are very useful for a reason
- in essence the declarative tools are not flexible enough for many use cases or ways of working, using a programming language brings infinite flexibility
Personally I am more in the declarative camp and see the benefits of it, but there is certain amount of banging ones head against it's rigidity.
It is classic "every problem is a nail to the person with a hammer". Complex languages - by definition - can solve a wider variety of problems than a simple declarative language but - by definition - are less simple.
Complex languages for infra - IMO - are the wrong tool for the wrong job because of the wrong skills and the wrong person. The only reason why inefficiencies like this are ever allowed to happen is money.
"Why hire a dev and an ops when we can hire a single devops for fractionally less?" - some excited business person or some broken dev manager, probably.
(For bigger stuff apparently CF has some limits relating to resoures per single stack)
the property that equates to config files is "being static", which modern deployments are not.
a dsl like SQL involves one basic substrate (data organized in tables) that you can compile in your head. But declarative infra as code involves a thousand different things across a dozen different clouds.
Declarative will hold off spaghetti for... A bit. But it devolves to spaghetti as well (think fine grained acls, or places where order of operations, which the dsl does not specify and is magically resolved, becomes ambiguous).
And if you need to go off the reservation (dsl support doesn't exist or is immature for rapidly evolving platforms, need some custom postprocess steps) then you are... What?
Probably writing code and scripts to autoinvoke on the new node, phone home to a central.... Yup that's code.
Finally, declarative code has an implicit execution loop. But for something like iac that is a very complicated, the execution loop that isn't well documented. And some committed changes to declarative code May trigger a destructive pass followed by a possibly broken constructive phase.
It's a tough problem.
I’ve been burned so many times here that I hate all of this stuff with an extreme passion.
Crossplane seems to be a genuinely better way out but there are big gotchas there also like resources that can simply never be deleted
I use C# extensively for most other things I do, but this the one area where I prefer not to use it.
Making Terraform changes every six weeks was enough time that we forgot everything and had to refresh our memories. Every time it felt like going into the water in a northern beach and forgetting how goddamned cold the water was, then reproaching yourself for forgetting.
I’m sure there are lots of DO clients seeing the same things we did, but not realizing it.
We did see it (multiple DCs—we didn’t just not try to fix this before going to AWS) in multiple cases with tens of clients so if there’s good news it’s that if you can monitor like 100 clients distributed over a wide area and all of them behave as expected you may not be experiencing what we did. What we saw was closer to 5% with absurd slowness or frequently-dropped connections than to 0.01%.
And if you are just operating a website and sticking Cloudflare or whatever in front of DO anyway, this doesn’t matter. I expect that’s why it’s not a more widely-reported issue.
The old solution in on-prem was to populate machines with 2/3 to 3/4 of their max addressable memory and push back on the expensive upgrade as long as possible, or at least until memory prices came down for the most expensive modules. Then faster hard drives or new boxes are the next step.
This is one hyper annoying area.
It is possible to get around it, but it's ugly, drop to L1 and override logical id:
let vpc = new ec2.Vpc(this, 'vpc', { natGateways: 1 })
let cfnVpc = vpc.node.defaultChild as ec2.CfnVPC
cfnVpc.overrideLogicalId('MainVpc')
You have to do this literally for every resource that's refactored.For us, we run 2 stacks. One that basically cannot/should-not be deleted/refactored. VPC, RDS, critical S3 buckets - i.e. critical data.
The 2nd stack runs the software and all those resources can be destroyed, moved whatever w/o any data loss.
in my experience you'd need to read the CDK source code to find the offending node and call `overrideLogicalId`
there is a library to do it in nicer way: https://github.com/mbonig/cdk-logical-id-mapper
however it does not work in every case
Why, dear god, you put VPC and RDS in one stack? They are much better off as separate CFN stacks.
But circular dependencies can also lead to issues here where CDK will prevent you from deleting a resource used or referenced by a different stack.
If you don't mind sharing, suppose (because it's what I was doing) I was trying to create personal dev, staging, and prod environments. I want the usual suspects: templated entries in route53, a load balancer, a database, some Fargate, etc.
What are you meant to do here? Thank you.
You have YAML/JSON that k8s API wants, that is fed through helm which is fed through helmsman or whatever newer thing. There might be a layer or two of other templating around. Sometimes companies have built systems so developers/devops don't even have the ability to see what the final compiled version of the template is which is like the mother of all: "works on my laptop" problems.
It's super easy to break text based templating because of some space, tab, string escaping or whatever.
YAML makes it worse as there are lots of gotchas and different ways of doing. JSON, being quite verbose and inflexible at least has strong structure right in your face so it's a bit easier to figure out what went wrong.
With a proper programming language data structure you can be much better with verifying that the things you add or remove or iterate over will produce a valid result, much better refactoring and working as a team independently.
Every time I see " | nindent whatever" I'm asking why the fuck the tool cannot manage indentation.
Wasn't fun.
And it generates shitty CFN, we can do better ourselves :)
"There are generally two approaches to IaC: declarative (functional) vs. imperative (procedural). The difference between the declarative and the imperative approach is essentially 'what' versus 'how'."
https://en.wikipedia.org/wiki/Infrastructure_as_code#Types_o...
Honestly, I only use terraform with hiera now, so I pretty much only write generic and reusable "wrapper" modules that accept a single block of data from Hiera via var.config. I can use this to wrap any 3rd party module, and even wrote a simple script to wrap any module by pointing at its git project.
That probably scares the shit out of folks who do the right thing, and use a bunch of vars with types and defaults. But it's so extremely flexible and it neutered all of the usual complexity and hassle I had writing terraform. I have single handedly deployed an entire infrastructure via terraform like this, from DNS domains up through networking, k8s clusters, helm charts and monitoring stack (and a heap of other AWS services like API Gateway, SQS, SES etc). The beauty of removing all of the data out to Hiera is that I can deploy new infra to a new region in about an 2 hours, or deploy a new environment to an existing region in about 10 minutes. All of that time is just waiting for AWS to spin things up. All I have to do in code is literally "cp -a eu-west-1/production eu-west-2/production" and then let all of the "stacks" under that directory tree deploy. Zero code changes, zero name clashes, one man band.
The hardest part is sticking rigidly to naming conventions and choosing good ones. That might seem hard because cloud resources can have different naming rules or uniqueness requirements. But when you build all of your names from a small collection of hiera vars like "%{product}-%{env}-%{region}-uploads", you end up with something truly reusable across any region, environment and product.
I'm pretty sure there's no chance I'd be able to do this with Pulumi.
So at top of your IaC, you have module naming {variables as inputs} then all other resources are aws_s3 { name = module.naming.s3bucket }
regions = [
“eu-west-1”,
+ “eu-west-2”,
]
for region in regions:
…I meant that I doubt that I could 'cp -a' on a whole deployment tree, and deploy the copy successfully without having to make any code changes.
Although thinking about it, I take it back. It may be possible with Pulumi with the right code structure and naming conventions, and if configuration were separated entirely from the codebase, and if variables were inferred from the directory structure. That is really the thing that allows me do to it.
TL;DR: where a cloud provider hosts customers for which there are real-world consequences for data leakage, not a single customer can be at-risk for data leakage. It's a different line of thinking, almost "a different world", to those who have this line of thinking vs those who do.
"The thing about reputations is you only have one".
By contrast even more than ten years before that, AWS was publishing whitepapers about how all contents of RAM to be used by a VM are initialized before a VM is provisioned, and other efforts to proactively scrub customer data.
I worked at a niche cloud provider a bit over ten years ago. We used Intel QAT for client-side encryption for our network attached pools of SSD. We were able to offer all-SSD at low cost and without security blindspots by crypto key rotation implemented by compartmentalized teams and also physical infrastructure compartmentalization patterns. Which, about half a decade later we found we were second only to AWS and almost second (but ahead of in other ways) to some smaller cloud-style hosting provider.
I don't know if it really meets that bar, but I won't argue about that right now. I'm just going to ask again for your definition of "real cloud" and whether you can suggest some that don't price gouge bandwidth (and aren't oracle, I would not consider them worthy of trust either).
If you’re ignoring guidance and patterns and getting mad reinventing the wheel, that’s on dev. If “ops” mandates tooling and doesn’t have any skin in the game, that’s on them. And both problems are on your leadership.
If y’all just hate each other and don’t listen or participate, then you can’t be successful. It is ironic that this is the pattern that the devops movement landed us in.
The worst part is that the Terraform team at Hashicorp often excuse not fixing these design issues as “safety measures” which isn’t entirely untrue but when over half of your users want something, sometimes you should get over yourself.
For what it’s worth, OpenTofu is fixing many of these sorts of things that cause people pain.
But my advice is to learn to use the tool. Terraform has such great benefits (in the right use cases). If you’re struggling, either you are missing something or you chose the wrong tool for your particular job. Either way, don’t gripe that this specialized tool for infra management doesn’t work exactly like every other general purpose programming language.
We've been migrating off of Terraform at BigCo recently and it has been a tremendous success. The migration has saved countless hours. Before, I was jaded and routinely in the office until 8 or 9 or so manually running terraform deploys for our engineering teams in India. Now, thanks to Pulumi, I'm able to leave the office at 7:30-8 -- and I can tell you single handed that this has saved my relationship with my daughter and maybe even my marriage. I'm running the fastest for loops thanks to Pulumi. We actually compile our Python down to c and use the Pulumi C SDK for insane speed benefits when we loop over our datacenter arrays. Turns out, not having bounds checks shaves off valuable time that I would otherwise be spending with my daughter. Routinely I'd be waking up screaming at 4 in the morning due to Terraform (or, what we would refer to as Tearaform because all of the infra engineers were constantly in tears). Now, I can sleep soundly until 5:30.
I don't have much experience running Terraform at scale. What has Pulumi made easier? Why is looping a bottleneck in infrastructure code?
Based on the info I can glean from this story you may be working at a scale / use case that may be too big or a poor fit for Terraform but I'm not sure...
So, there's a good chance was an error that was really unexpected and it's better to show the error than to risk producing bad output.
Even from all the way over here, I infer that I think we're from so different worlds that what "real cloud" means to my side of the world isn't a part of your world.
What I can tell you, is AWS is the king of cloud, Google Cloud is a very very distant 2nd place, and Azure is an event more distant 3rd place.
> and aren't oracle, I would not consider them worthy of trust either
Smart man.
I've heard it referred to it as an "optionally typed" or "gradually typed" system, which, having worked for years in Typescript and other languages like Rust and Kotlin, etc, I agree with.
Great thing is that the zod schema also doubles as your typescript type so you don't have to write a duplicate/shadow TS type definition.
As you also noted, doing this in plain terraform is kind of a pain, so using a tool like Hiera allows you to skip a lot of the work involved in doing it the "right" way. IMO if you're starting greenfield Pulumi (or CDK, anything that lets you use a "real" programming language) allows you to write (or consume!) that config in basically any form, instead of needing to funnel everything through a Terraform data provider.
I think GP's point was that Kamal has all of these things already, so you don't have to set them up.
It’s stuff like this that’s just a thousand papercuts that dissuades me from using these “simpler” tools. By the time you’ve rebuilt by hand what you need, you’ve just created a worse version of the “more complex” solution.
I get it if your workload is so simple ir low requirement that zero-downtime deploys, rollbacks, health/liveness, automatic volumes, monitoring etc are features you don’t want or need, but “it’s just as good, just DIY all the things” doesn’t make it a viable alternative in my mind.
Data resources are you requesting a dynamic value of your environment.
Variables are dynamic values that a user can change.
Especially if the locals vary between prod and pre-prod, and worse if dev sandboxes end up with per-user instances, which for us was mercifully only needed for people working on the TF scripts, so we could run our tests locally.
The distinction is very clear in our team. Locals are used as const (like an application name), variables are for more dynamic user/environment inputs and data is to fetch dynamic information from other resources.
Zero problems. If a local becomes more environment specific a quick refactor fixes that. You can also have locals that use variable or data values if necessary.
One big win we also have is that we stopped using modules except for one big main module. We noticed from previous projects that as soon as we implemented modules everything became a big problem. Modules that are version pinned still required a lot of maintenance to upgrade. Modules that weren't version pinned caused more destruction than we planned. Modules outputs and inputs caused a lot of cycle problems,... Modules always seem too deep or too shallow.