When your infrastructure fails in an unpredictable way, you'll get bitten by all the long awaiting mistakes you made along the path to building it. In this case, one of the mistakes was that we used direct IP addresses to call our different services.
We had to change the entire network addresses of all our VMs in the last 3 weeks. I'll push details in another post, but I want to focus on one of the main lessons we got from that operation: use DNS in your infrastructure.
By that, I mean that you shall NOT hardcode IP addresses in your configuration: if you can, use a properly named service DNS entry instead.
In the case of stateless services, you have the choice:
- either you can use a virtual IP address, and point your DNS to it (drawback: you have to understand how to do that)
- or you can present multiple hosts behind the same DNS entry (drawback: your backends must be able to handle multiple DNS records, which is often NOT easy to do)
- or you can install a load balancer in front of your VMs and point your other backends to it (drawback: if you don't want it to be a SPOF you'll still have to use 2 instances, so a virtual IP)
For simple, one-instance things (physical machines, individual components on the network, etc.) you just have to point the DNS to the correct IP.
It is simple really: on top of ~1200 IP addresses we had to migrate (80% automated by Ansible, thanks $deity), we had to come again and reconfigure nearly all our VMs (and thus had to restart all services on it, incurring downtime for non-stateless services) because the micro-service configurations in these VMs used literal IP addresses to reach other services.
Using DNS, we could have simply added the new addresses to the VMs, and then updated the DNS to have the traffic gently port over the new IP network.
We actually moved to using DNS in many of the affected services, and we'll migrate the rest to use DNS as well.
Now that we are using DNS, changing the IPs of the underlying machines will be way easier.
Corollary: have solid DNS
Migrating all your configurations to use DNS also means that you have to have a solid DNS infrastructure:
- at least 2 servers,
- both used actively.
These will probably be the only machines in your infra that have hardcoded IP in all the rest of your infra, in the
/etc/resolv.conf file, a bit like this:
search mydomain.com nameserver x.y.z.i # first machine nameserver x.y.z.i # second machine options rotate timeout:1 retries:1 # so that they are both regularly queried
Bonus point if you enable reverse lookup in your DNS server: it is frustrating to
tcpdump and have to search for which name that IP corresponds to. Make sure your DNS is only reachable from within your network.
Wait, why don't you use $shiny_discovery_service
Yes we should, yes we might in the future, but right now we are stuck with what we have. I'm not asking you to judge the staleness of our tech stack. We are a young small company, and we are learning from our mistakes.