User:GNUru/Software RAID on NVMe with Monitoring

The default configuration and support provided by installing the usual tools and following the official documentation for software RAID on NVMe drives is very *VERY* far from adequate. I learned this the hard way after purchasing a pair of Samsung 980 PRO 2TB NVMe modules with a now-widely-known-to-be-defective firmware that Samsung was very quiet about, and despite very nearly losing a bunch of data, my RAID 1 configuration plus an old backup meant in the end I lost no data. It was an eye-opening experience that made me realize just how lacking the standard monitoring capabilities of mdadm and smartmontools are, particularly with NVMe drives (despite their prevalence in 2023).

Here's a cheat sheet or overview of what you need to do to achieve a sane monitoring setup on a desktop:

mdadm, nvme, smartmontools, cronie, and a mail agent installed and configured using their respective guides
scripts I wrote, and either placed somewhere in PATH or called by absolute path:
- email-sysadmin
  - wrote this as a generic wrapper for emailing sysadmin
- mdadm_notify.sh
  - wrote this to gather useful info about RAID event and send notification e-mail
- smart-nvme-degradation-check
  - wrote this script which keeps state (uses subdir of same dir that smartd uses) and notifies (via e-mail and wall) if any NVMe lifetime health properties degrade
/etc/
- conf.d/
  - smartd
    - edited this to add --attributelog=/var/log/smartd/ and --savestates=/var/lib/smartd/ options
      - both those dirs are the values suggested in man smartd, and had to be created manually
      - attribute log doesn't actually work for NVMe, so this is only added in preparation for if/when support is added in future; see https://www.smartmontools.org/ticket/1190
      - savestates stores min/max temperatures seen per device, so that they're preserved across restarts of smartd (since devices themselves don't track their min/max temperatures)
- cron.daily/
  - smart-data-log
    - added this to log SMART data for every SMART-capable device to syslog every day, mainly since smartd's attribute log feature doesn't (yet?) support NVMe drives
  - smart-nvme-degradation-check
    - added this to call your /root/bin/smart-nvme-degradation-check script to perform crucial NVMe checks that smartctl doesn't (yet?) support
- cron.weekly/
  - mdadm
    - edited this (commented it all out) to disable default of scrubbing weekly
- cron.monthly/
  - mdadm
    - copied from cron.weekly/mdadm, but:
      - removed the stupid date condition (which assumes script runs weekly and breaks if run monthly) and executability condition
      - removed --cron option to checkarray since all it did was hide useful info from being logged
      - removed --quiet option to checkarray, for same reason
      - redirected stderr to stdout, then piped the whole thing to a logger command to make sure it gets logged to syslog
- local.d/
  - 50-md-sync-stall-shutdown.stop
    - added this to check if any md devices are in a sync state, and stall the shutdown until the operation completes
- mdadm.conf
  - you normally need this file anyway, and existing docs already mention what you need in it, but here's what you have:
    - AUTO -all to disable auto-assembly so that only arrays listed in this file are assembled
    - DEVICE /dev/disk/by-id/nvme-blah-blah*-part1 i.e. a stable glob pattern that matches only the array's or arrays' component drives
    - ARRAY /dev/md# ...etc... is the usual line you get when you run mdadm --detail --scan >>/etc/mdadm.conf during the initial setup of mdadm
    - PROGRAM /path/to/mdadm_notify.sh runs this script on every RAID array event
- smartd.conf
  - edited this to disable default of:
    - DEVICESCAN
  - and instead do:
    - DEFAULT -c interval=60 -a -m @ALL
      - to update defaults (only applies to lines below, and only settings which are not overridden in lines below) to:
        check every 60s
        
        enable standard collection of directives
        
        and on warnings/failures run every executable in /etc/smartd_warning.d/
    - /dev/nvme# -W 0,58,65
      - for each NVMe drive (replace # with appropriate value for desired device) to enable temperature checking since it's not enabled by default
      - can't use by-id symlinks because of bug https://www.smartmontools.org/ticket/1670, and workaround https://www.smartmontools.org/changeset/4847 only works for symlinks pointing to /dev/sd* so won't work here
      - enables temperature checking, but not the diff/min/max check, just inform at 58C and warn/notify at 65C (adjust as desired)
    - DEVICESCAN -c interval=1800
      - to scan for devices other than those explicitly mentioned above, but for them go back to using smartd's default 30min interval
- smartd_warning.d/
  - email-sysadmin
    - added this to send an e-mail notification
  - wall
    - added this to blast a notification via wall command