Project:Infrastructure/Incident reports/2015-05-07 woodpecker

= 2015-05-7 Woodpecker Outage = Writeup by Robin H. Johnson robbat2

Timeline on 2015/05/07

 * 04:40:07 UTC - last log message
 * 04:41 UTC - first nagios alert to #gentoo-infra
 * 05:27 UTC - a dev sends an IRC /query to robbat2 that there is a problem
 * 05:28 UTC - robbat2 captures the serial console trace below and begins to restore
 * 06:20 UTC - woodpecker is back for the first time
 * ~06:45 UTC - some reboots to get a 64-bit kernel worked in
 * 07:06 UTC

Background
The system install on woodpecker is very old as Gentoo infrastructure systems go, dating back to late 2005 or earlier. It originally was a HP ProLiant DL380 G4 with no proper 64-bit capability (the CPUs were capable, but the BIOS had unresolvable issues). Instead ran a 32-bit HIGHMEM kernel, the only such system in infra.

As a result of the system age, many of the legacy pieces on the system were not managed by configuration management: woodpecker never got a cfengine deployment like other infra hosts. It did however get Puppet later.

In January 2015, the hardware started showing problems, and given the difficulty of moving all the developer content, as well as the fragile mail setup, the system was simply forklift-upgraded into a VM environment.

What went wrong in the first place
This is an open question right now. It was at an all-time high of uptime since migration, as ~98 days had elapsed since the migration.

What went wrong with bringing it back

 * The initramfs present on the system contained a lvm.conf that filtered out the /dev/vd* devices, so LVM did not initialize at first.
 * /etc/inittab was empty
 * Files populated by cfengine were not present, since the host did not run cfengine
 * The script to build it was fired by puppet, leading to an empty file
 * /etc/fstab was out of date from the actual mounts
 * /usr had been merged to /
 * A bad user_xattr entry on a filesystem that did not support it anymore (converted away from ext*) caused a mount fail.

What further actions were taken

 * A newer, 64-bit kernel was deployed on top of the existing 32-bit userland.
 * Puppet handling for inittab was worked around for the moment, a full fix is pending
 * Puppet contents of fstab were fixed.