Project:Infrastructure/Hashicorp

From Gentoo Wiki
Jump to:navigation Jump to:search

Vault

Vault stores secrets in a backend. The current backend is Consul. Vault needs to be unlocked with a master key. Currently the key is split using Shamirs Secret Sharing, all 5 pieces are in the secrets repo. Whenever vault is restarted, we have to unseal; this is currently a manual operation.

  1. TODO(antarus): Write procedure for unseal procedure.
  2. Decrypt master key by getting 3/5 SSS keys out of secrets repo.
  3. unseal with master key.

Architecture

Vault has two logical components:

  1. Application server: The appserver runs the vault APIs and services.
  2. Datastore: Currently consul, Vault uses key-value stores in consul to store data (including keys, acls, etc.)

DR / Failover

  1. Application Server: Deploy vault on a new machine, unseal it using the unseal procedure. All the data is in consul. The consul token for vault is stored in eyaml in puppet.
  2. Datastore: We store data in Consul

Consider adopting consul snapshot procedure (nightly via cron) that takes a snapshot from the consul leader and uploads it to s3. Then if we lose all data in consul, we can restore some snapshot from the previous day(s).

maintenance

vault has policies (AuthZ) and tokens (authN). We will store both in terraform (infra-as-code) module.) Vault has a root token, its currently SSS in the secrets repo. We should consider what to do with it.

Bootstrapping

Vault stores data in consul. Consul needs vault to generate TLS certificates for proper consul operation. We then need a bootstrap procedure for getting past this state.

We recommend disabling TLS on consul for some short period, so that it can coalesce the new certificates, then turn it back on. <link to procedure>

Open discussion

Sealing

If we adopt vault for more operations, do we need to implement auto-unsealing?
What questions should we be asking here?
Alec's Vault Questions

Basically as long as Vault is not needed for user facing services for some time period (e.g. if vault is down we can serve users for at least 24h) then I'm happy with manual unsealing. I don't care if say, we cannot monitor or deploy if vault is closed...we can wait O(hours) to unseal.

Consul

Consul is both a service discovery and key-value store from Hashicorp. Its similar to chubby (from Google) and etcd / zookeeper.

Architecture

Clearly separating these things:

  • Consul uses raft for consensus; offering a consistent view of key-value data.
  • Consul has a service discovery protocol with healthchecking. This is implemented by the consul gossip protocol.

Servers

Gentoo currently has 5 consul servers spread throughout the world. We have modified the RPC timeouts for both protocols to support operating consul over a high latency WAN. Servers handle the administration of consul and participate in the RAFT consensus protocol and contain a full copy of the consul datastore.

Clients

Gentoo runs consul clients on all gentoo infra nodes. The clients allow nodes to participate in service discovery; typically by placing service descriptors in /etc/consul.d/ on their local machine. The consul agents then healthcheck their local services and tell the consul servers what services are available on that node.

Auth (Z and N)

Consul uses consul policies to determine which consul operations are available. Policies are bound to authentication tokens. These definitions and tokens are maintained in the infra-as-code repo in terraform. Authentication tokens are distributed to each node (either consul server or client tokens, typically) by puppet; the tokens are encrypted in eyaml.

Bootstrap token

This token is the root token. its currently stored in the secrets repo, we should consider SSS sharing it. Currently we use this token for terraform, but we can generate a weaker token that developers use on a more routine basis.

Admin tokens

What admin tokens should we make, if any? See the above comment about terraforming. We should not be using teh bootstrap token for that purpose.

bootstrapping (cert renewal)

We need to renew the certs regularly. We will use puppet to review whether a cert will expire soon and renew it using puppet.

Nomad

todo

Roadmap

Vault

- store the consul token that vault is using in the secrets repo for manual bootstraping

Consul

- Create a s3 bucket for consul snapshots and a scoped token

- Implement a generic puppet module to upload to s3

- Use the s3 puppet module to regularly create backups

- Add puppet code to renew the certs when they expire soon

- Renew the CA with a long ttl

Nomad

- Add puppet code to renew the certs when they expire soon

- Renew the CA with a long ttl

- Move Prometheus Exporters to Nomad

- blacklist small disk nodes to not run additional containers because they don't have space

- for busier nodes we need to allocate more space to run apps

- figure out a DNS solution

- complete the terraform config for nomad

- create ACL policies

- monitor the nomad nodes to tweak the raft protocol for them as well

- write a security narrative for non-system jobs on nomad (what runs where, who can run what)

Misc

- Consider some wg/vpn setup (this can be used e.g. for terraforming)