User:GNUru/Software RAID on NVMe with Monitoring

From Gentoo Wiki
Jump to:navigation Jump to:search

The default configuration and support provided by installing the usual tools and following the official documentation for software RAID on NVMe drives is very *VERY* far from adequate. I learned this the hard way after purchasing a pair of Samsung 980 PRO 2TB NVMe modules with a now-widely-known-to-be-defective firmware that Samsung was very quiet about, and despite very nearly losing a bunch of data, my RAID 1 configuration plus an old backup meant in the end I lost no data. It was an eye-opening experience that made me realize just how lacking the standard monitoring capabilities of mdadm and smartmontools are, particularly with NVMe drives (despite their prevalence in 2023).

Here's a cheat sheet or overview of what you need to do to achieve a sane monitoring setup on a desktop:

  • mdadm, nvme, smartmontools, cronie, and a mail agent installed and configured using their respective guides
  • scripts I wrote, and either placed somewhere in PATH or called by absolute path:
    • email-sysadmin
      • wrote this as a generic wrapper for emailing sysadmin
    • mdadm_notify.sh
      • wrote this to gather useful info about RAID event and send notification e-mail
    • smart-nvme-degradation-check
      • wrote this script which keeps state (uses subdir of same dir that smartd uses) and notifies (via e-mail and wall) if any NVMe lifetime health properties degrade
  • /etc/
    • conf.d/
      • smartd
        • edited this to add --attributelog=/var/log/smartd/ and --savestates=/var/lib/smartd/ options
          • both those dirs are the values suggested in man smartd, and had to be created manually
          • attribute log doesn't actually work for NVMe, so this is only added in preparation for if/when support is added in future; see https://www.smartmontools.org/ticket/1190
          • savestates stores min/max temperatures seen per device, so that they're preserved across restarts of smartd (since devices themselves don't track their min/max temperatures)
    • cron.daily/
      • smart-data-log
        • added this to log SMART data for every SMART-capable device to syslog every day, mainly since smartd's attribute log feature doesn't (yet?) support NVMe drives
      • smart-nvme-degradation-check
        • added this to call your /root/bin/smart-nvme-degradation-check script to perform crucial NVMe checks that smartctl doesn't (yet?) support
    • cron.weekly/
      • mdadm
        • edited this (commented it all out) to disable default of scrubbing weekly
    • cron.monthly/
      • mdadm
        • copied from cron.weekly/mdadm, but:
          • removed the stupid date condition (which assumes script runs weekly and breaks if run monthly) and executability condition
          • removed --cron option to checkarray since all it did was hide useful info from being logged
          • removed --quiet option to checkarray, for same reason
          • redirected stderr to stdout, then piped the whole thing to a logger command to make sure it gets logged to syslog
    • local.d/
      • 50-md-sync-stall-shutdown.stop
        • added this to check if any md devices are in a sync state, and stall the shutdown until the operation completes
    • mdadm.conf
      • you normally need this file anyway, and existing docs already mention what you need in it, but here's what you have:
        • AUTO -all to disable auto-assembly so that only arrays listed in this file are assembled
        • DEVICE /dev/disk/by-id/nvme-blah-blah*-part1 i.e. a stable glob pattern that matches only the array's or arrays' component drives
        • ARRAY /dev/md# ...etc... is the usual line you get when you run mdadm --detail --scan >>/etc/mdadm.conf during the initial setup of mdadm
        • PROGRAM /path/to/mdadm_notify.sh runs this script on every RAID array event
    • smartd.conf
      • edited this to disable default of:
        • DEVICESCAN
      • and instead do:
        • DEFAULT -c interval=60 -a -m @ALL
          • to update defaults (only applies to lines below, and only settings which are not overridden in lines below) to:
            • check every 60s
            • enable standard collection of directives
            • and on warnings/failures run every executable in /etc/smartd_warning.d/
        • /dev/nvme# -W 0,58,65
          • for each NVMe drive (replace # with appropriate value for desired device) to enable temperature checking since it's not enabled by default
          • can't use by-id symlinks because of bug https://www.smartmontools.org/ticket/1670, and workaround https://www.smartmontools.org/changeset/4847 only works for symlinks pointing to /dev/sd* so won't work here
          • enables temperature checking, but not the diff/min/max check, just inform at 58C and warn/notify at 65C (adjust as desired)
        • DEVICESCAN -c interval=1800
          • to scan for devices other than those explicitly mentioned above, but for them go back to using smartd's default 30min interval
    • smartd_warning.d/
      • email-sysadmin
        • added this to send an e-mail notification
      • wall
        • added this to blast a notification via wall command