Here is how we transformed our HashiCorp stack in the past two years, why it took that long, and for what benefits.
But first, let’s have some context. the HashiCorp stack aka HashiStack was running as of September 2020 on Consul 0.9.3 (Sept 2017), Nomad 0.8.6 (Sept 2018), and Terraform 0.12 (July 2019). To put things in perspective, Terraform upstream was 0.13 at the time (now 1.3), Consul upstream was 1.8 (now 1.14), and Nomad was 0.12 (now 1.4). And no Vault in sight, as the secrets were either in AWS SecretsManager or Consul K/V store.
The HashiStack is usually represented as a list of layers with the provisioning at the bottom, aka infrastructure. Packer and Terraform are used rather than Vagrant which is not well suited for big Cloud setups. Above it, Vault acts as the security layer, providing secrets, credentials, and certificates. Then the orchestration layer with Nomad, and finally, the cherry on the top, Consul the service mesh.
“If it ain’t broken, don’t fix” could explain a lot why the things were the way they were. Also, a big on-going effort was taking place to move to an immutable infrastructure paradigm. It aimed at getting ride of the Salt master (as in SaltStack or more recently Salt Project) and upgrading instances by replacing them rather than modifying them, i.e. immutable infrastructure. On a production infrastructure, such changes go full Ship of Theseus. Bringing more features wasn’t the priority at the time.
Back in 2019, we upgraded […] from Terraform 0.11 to 0.12. This was a major version upgrade, as there were syntax changes between these two versions. At the time, we spent an entire quarter doing this upgrade.
— How We Use Terraform At Slack, https://slack.engineering/how-we-use-terraform-at-slack/
It wasn’t that bad, but Terraform 0.12 was a big one. Then the 0.13 was one the main stepstone to Terraform 1.0. By introducing the Terraform registry, HashiCorp gave away to the community the rights to manage providers. Before that, it was a long and tedious journey to have a provider being validated by them.
As we were and are using many Terraform modules, each and any of them had to perform the required breaking changes. It wasn’t that hard but took quite a long time.
Since then, following the latest Terraform version remained handy. As long as you didn’t relied on the experimental features for anything.
The upgrading guide provided by HashiCorp is pretty complete. The first step from 0.8 was to go to 0.10. The way the ports were defined needed to be changed in some place and by default the Docker volumes weren’t enabled. That upgrade was made because it was broken and needed to be fixed. Broken as in: when marking nodes as ineligible the scheduler was unable to perform any plans (GH-6996).
The next jump, was a direct one to upstream by going from 0.10 to 1.1. Just like that.
Nomad 1.3 had to wait until Consul was updated to at least 1.7+.
Consul is the tough cookie in that story. The tough cookie because as Vault wasn’t used, the KV store was used to store some secrets, and thus Consul ACL token had been using the now called legacy ACLs. Again, there is a upgrading guide that contains very helpful information. We’ve mostly followed the following answer by Nic in HashiCorp forums.
This is the last version before Consul introduced the new Consul ACL. That version upgrade gave us access to the Prometheus metrics from Consul. Getting Prometheus meant that Graphite/statsd based metrics and alerts could be removed. But more importantly, the Auto Pilot that needs the Raft protocol in version 3.
Knowing having secrets in Consul was a bad idea and that we would have to migrate a long of Consul tokens from the legacy format to the new world. Introducing Vault to the stack gave a way to have a proper secrets store as well as replacing many Consul tokens by Vault policies. Much, much easier to manage in the long run.
As the legacy token endpoints is still available using them was still possible, even though we have to use a fork of the Vault Terraform provider partially for that reason.
Once this done, we had to update all the Consul ACL tokens from the legacy format to the new format. Only the anonymous token is automatically updated.
The other fun surprises was that changing the Consul configuration from the legacy format (acl_ prefix) to the new format (acl object), changes how Consul behaves. As acl_enforce_version_8 was disabled the whole time, removing it shed the lights on how little we were relying on the Consul ACL system to protect anything.
Nomad 1.3 requires Consul 1.7, so it was a nice middle ground before 1.10.
This is the last version with the legacy ACL token endpoints, so we’re still good even though none were left. The 1.11 release note feels like it’s a natural step to take before going above.
Aside from enjoying the latest features or bug fixes, what is running fresh version of the HashiStack bringing?
At a general level, it helped us moving some features from the Cloud provider level to the Nomad level, like auto scaling or volumes. And also, now that our issues might be shared with folks in the present rather than searching among the closed ones.
One would say, Consul Connect. It’s a big change, probably worth it but that would deeply alter how the current routing between services is done. Which also includes how it’s monitored. In its current state, Connect feels tailored for Kubernetes rather than Nomad.
Consul Cluster Peering might be one interesting bit. In the past, we had the idea that the stacks would keep growing and growing; which makes them less costly per service running. However, it seems being multi-region, even multi-cloud provider will come first. Being able to communicate between two regions, without going back to Internet, or creating a lot of internal endpoints (aka VPC endpoint at AWS) would be better.
What is your story?