User:GNUru/Software RAID on NVMe with Monitoring

The default configuration and support provided by installing the usual tools and following the official documentation for software RAID on NVMe drives is very *VERY* far from adequate. I learned this the hard way after purchasing a pair of Samsung 980 PRO 2TB NVMe modules with a now-widely-known-to-be-defective firmware that Samsung was very quiet about, and despite very nearly losing a bunch of data, my RAID 1 configuration plus an old backup meant in the end I lost no data. It was an eye-opening experience that made me realize just how lacking the standard monitoring capabilities of mdadm and smartmontools are, particularly with NVMe drives (despite their prevalence in 2023).

Here's a cheat sheet or overview of what you need to do to achieve a sane monitoring setup on a desktop:


 * mdadm, nvme, smartmontools, cronie, and a mail agent installed and configured using their respective guides
 * scripts I wrote, and either placed somewhere in  or called by absolute path:
 * wrote this as a generic wrapper for emailing sysadmin
 * wrote this to gather useful info about RAID event and send notification e-mail
 * wrote this script which keeps state (uses subdir of same dir that smartd uses) and notifies (via e-mail and ) if any NVMe lifetime health properties degrade
 * edited this to add  and   options
 * both those dirs are the values suggested in, and had to be created manually
 * attribute log doesn't actually work for NVMe, so this is only added in preparation for if/when support is added in future; see https://www.smartmontools.org/ticket/1190
 * savestates stores min/max temperatures seen per device, so that they're preserved across restarts of smartd (since devices themselves don't track their min/max temperatures)
 * added this to log SMART data for every SMART-capable device to syslog every day, mainly since smartd's attribute log feature doesn't (yet?) support NVMe drives
 * added this to call your  script to perform crucial NVMe checks that smartctl doesn't (yet?) support
 * edited this (commented it all out) to disable default of scrubbing weekly
 * copied from, but:
 * removed the stupid date condition (which assumes script runs weekly and breaks if run monthly) and executability condition
 * removed  option to   since all it did was hide useful info from being logged
 * removed  option to , for same reason
 * redirected stderr to stdout, then piped the whole thing to a  command to make sure it gets logged to syslog
 * added this to check if any md devices are in a sync state, and stall the shutdown until the operation completes
 * you normally need this file anyway, and existing docs already mention what you need in it, but here's what you have:
 * to disable auto-assembly so that only arrays listed in this file are assembled
 * i.e. a stable glob pattern that matches only the array's or arrays' component drives
 * is the usual line you get when you run  during the initial setup of mdadm
 * runs this script on every RAID array event
 * edited this to disable default of:
 * and instead do:
 * to update defaults (only applies to lines below, and only settings which are not overridden in lines below) to:
 * check every 60s
 * enable standard collection of directives
 * and on warnings/failures run every executable in
 * for each NVMe drive (replace  with appropriate value for desired device) to enable temperature checking since it's not enabled by default
 * can't use  symlinks because of bug https://www.smartmontools.org/ticket/1670, and workaround https://www.smartmontools.org/changeset/4847 only works for symlinks pointing to   so won't work here
 * enables temperature checking, but not the diff/min/max check, just inform at 58C and warn/notify at 65C (adjust as desired)
 * to scan for devices other than those explicitly mentioned above, but for them go back to using smartd's default 30min interval
 * added this to send an e-mail notification
 * added this to blast a notification via  command
 * to disable auto-assembly so that only arrays listed in this file are assembled
 * i.e. a stable glob pattern that matches only the array's or arrays' component drives
 * is the usual line you get when you run  during the initial setup of mdadm
 * runs this script on every RAID array event
 * edited this to disable default of:
 * and instead do:
 * to update defaults (only applies to lines below, and only settings which are not overridden in lines below) to:
 * check every 60s
 * enable standard collection of directives
 * and on warnings/failures run every executable in
 * for each NVMe drive (replace  with appropriate value for desired device) to enable temperature checking since it's not enabled by default
 * can't use  symlinks because of bug https://www.smartmontools.org/ticket/1670, and workaround https://www.smartmontools.org/changeset/4847 only works for symlinks pointing to   so won't work here
 * enables temperature checking, but not the diff/min/max check, just inform at 58C and warn/notify at 65C (adjust as desired)
 * to scan for devices other than those explicitly mentioned above, but for them go back to using smartd's default 30min interval
 * added this to send an e-mail notification
 * added this to blast a notification via  command
 * enables temperature checking, but not the diff/min/max check, just inform at 58C and warn/notify at 65C (adjust as desired)
 * to scan for devices other than those explicitly mentioned above, but for them go back to using smartd's default 30min interval
 * added this to send an e-mail notification
 * added this to blast a notification via  command
 * added this to send an e-mail notification
 * added this to blast a notification via  command
 * added this to blast a notification via  command
 * added this to blast a notification via  command