linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Wolfgang Denk <wd@denx.de>
To: Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: Disk Monitoring
Date: Wed, 28 Jun 2017 15:19:17 +0200	[thread overview]
Message-ID: <20170628131917.BF1911235B6@gemini.denx.de> (raw)

[-- Attachment #1: Type: text/plain, Size: 2945 bytes --]

Dear Gandalf,

In message <CAJH6TXgvrVckHDmh1oiN9mupLrsS2NP3J44bG1_wE9Nnx4=yHQ@mail.gmail.com> you wrote:
> 
> 1) all raid controllers have proactive monitoring features, like
> patrol read, consistency check and (more or less) some SMART
> integration.
> Any counterpart in mdadm?

As Wol already pointed out, you should use  smaartctl  to monitor
the state of the disk drives, ideally on a regular base.  Changes
(increases) of numbers like "Reallocated Sectors", ""Current Pending
Sectors" or ""Offline Uncorrectable Sectors" are always suspicious.
If they increase just by one, and then stay constant for weeks you
can probably ignore it.  But if you see I/O errors in the system
logs and/or "Reallocated Sectors" increasing every few days then you
should not wait much longer and replace the respective drive.

Attached are two very simple scripts I use for this purpose;
"disk-test" simply runs smartctl on all /dev/sd? devices and parses
the output.  The result is something like this:

$ sudo disk-test
=== /dev/sda : ST1000NM0011 S/N Z1N2RA6E *** ERRORS ***
        Reallocated Sectors:     1
=== /dev/sdb : ST2000NM0033-9ZM175 S/N Z1X1J1K9 OK
=== /dev/sdc : ST2000NM0033-9ZM175 S/N Z1X1JEF6 OK
=== /dev/sdd : ST2000NM0033-9ZM175 S/N Z1X4XSN9 OK
=== /dev/sde : ST2000NM0033-9ZM175 S/N Z1X4X6G8 OK
=== /dev/sdf : ST2000NM0033-9ZM175 S/N Z1X54EA1 OK
=== /dev/sdg : ST2000NM0033-9ZM175 S/N Z1X5443W OK
=== /dev/sdh : ST2000NM0033-9ZM175 S/N Z1X4XAHQ OK
=== /dev/sdi : ST2000NM0033-9ZM175 S/N Z1X4X6NB OK
=== /dev/sdj : TOSHIBA MK1002TSKB S/N 32E3K0K2F OK
=== /dev/sdk : TOSHIBA MK1002TSKB S/N 32F3K0PRF OK
=== /dev/sdl : TOSHIBA MK1002TSKB S/N 32H3K10CF *** ERRORS ***
        Reallocated Sectors:     1
=== /dev/sdm : TOSHIBA MK1002TSKB S/N 32H3K0ZLF OK
=== /dev/sdn : TOSHIBA MK1002TSKB S/N 32H3K104F OK
=== /dev/sdo : TOSHIBA MK1002TSKB S/N 32H1K31DF OK
=== /dev/sdp : TOSHIBA MK1002TSKB S/N 32F3K0PUF OK
=== /dev/sdq : TOSHIBA MK1002TSKB S/N 32E3K0JZF OK

Here I have two drives with 1 reallocated sector each, which I
consider harmeless as it has stayed constant for several months.

The second script "disk-watch" is intended to be run as a cron job
on a regular base (here usually twice per day).  It will send out
email whenever the state changes (don't forget to adjust the MAIL_TO
setting).  You may also want to clean up the entries in /var/log/diskwatch
every now and then (or better add it to your logrotate
configuration).

HTH.


Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Yes, it's a technical challenge, and  you  have  to  kind  of  admire
people  who go to the lengths of actually implementing it, but at the
same time you wonder about their IQ...
         --  Linus Torvalds in <5phda5$ml6$1@palladium.transmeta.com>


[-- Attachment #2: disk-test --]
[-- Type: text/plain , Size: 1083 bytes --]

#!/bin/sh

DISKS="$(echo /dev/sd?)"

PATH=$PATH:/sbin:/usr/sbin

for i in ${DISKS}
do
	SMARTDATA=$(smartctl -a $i | \
	egrep 'Device Model:|Serial Number:|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|failed|Unknown USB' | \
	grep -v ' -  *0$')
	LINES=$(echo "${SMARTDATA}" | wc -l)
	HEAD=$(echo "${SMARTDATA}" | \
	       sed -n -e 's/Device Model: //p' \
		      -e 's!Serial Number:!S/N!p')	
	BODY=$(echo "${SMARTDATA}" | \
	       awk '$2 ~ /Reallocated_Sector_Ct/	{ printf "Reallocated Sectors:   %3d\n", $10 }
		    $2 ~ /Current_Pending_Sector/	{ printf "Current Pending Sect:  %3d\n", $10 }
		    $2 ~ /Offline_Uncorrectable/	{ printf "Offline Uncorrectable: %3d\n", $10 }
		    $0 ~ /failed:.*AMCC/		{ printf "Unsupported AMCC/3ware controller\n" }
		    $0 ~ /SMART command failed/		{ printf "Device does not support SMART\n" }
		    $0 ~ /Unknown USB bridge/		{ printf "Unknown USB bridge\n" }
		'
	     )
	if [ $LINES -eq 2 ]
	then
		echo === $i : ${HEAD} OK
	else
		echo === $i : ${HEAD} "*** ERRORS ***"
		echo "${BODY}" | sed -e 's/^/	/'
	fi
done

[-- Attachment #3: disk-watch --]
[-- Type: text/plain , Size: 683 bytes --]

#!/bin/sh

D_TEST=/usr/local/sbin/disk-test
D_LOGDIR=/var/log/diskwatch
MAIL_TO="root"

[ -x ${D_TEST} ] || { echo "ERROR: cannot execute ${D_TEST}" >&2 ; exit 1 ; }

[ -d ${D_LOGDIR} ] || \
	mkdir -p ${D_LOGDIR} || \
		{ echo "ERROR: cannot create ${D_LOGDIR}" >&2 ; exit 1 ; }

cd ${D_LOGDIR} || { echo "ERROR: cannot cd ${D_LOGDIR}" >&2 ; exit 1 ; }

rm -f previous

[ -L latest ] && mv latest previous

NOW=$(date "+%F-%T")

${D_TEST} >${NOW}

ln -s "${NOW}" latest

DIFF=''

[ -r previous ] && DIFF=$(diff -u previous latest)

[ -z "${DIFF}" ] && exit 0

mailx -s "$(hostname): SMART DISK WARNING" ${MAIL_TO} <<+++
Disk status change:
${DIFF}

Recent results:
$(cat latest)
+++

             reply	other threads:[~2017-06-28 13:19 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-28 13:19 Wolfgang Denk [this message]
2017-06-29  9:52 ` Disk Monitoring Gandalf Corvotempesta
2017-06-29 10:10   ` Reindl Harald
2017-06-29 10:14     ` Gandalf Corvotempesta
2017-06-29 10:37       ` Reindl Harald
2017-06-29 14:28       ` Wols Lists
2017-06-29 10:14   ` Andreas Klauer
2017-06-29 10:14   ` Mateusz Korniak
2017-06-29 10:16     ` Gandalf Corvotempesta
2017-06-29 14:33       ` Wols Lists
2017-06-30 12:35         ` Gandalf Corvotempesta
2017-06-30 14:35           ` Phil Turmel
2017-06-30 19:56             ` Anthony Youngman
2017-07-01 13:42               ` Drew
2017-07-01 14:12                 ` Gandalf Corvotempesta
2017-07-01 15:36                   ` Drew
2017-06-29 10:20   ` Mateusz Korniak
2017-06-29 10:25     ` Gandalf Corvotempesta
2017-06-29 10:34       ` Reindl Harald
  -- strict thread matches above, loose matches on Subject: below --
2017-06-28 10:25 Gandalf Corvotempesta
2017-06-28 10:45 ` Johannes Truschnigg
2017-07-06  3:31   ` NeilBrown
2017-06-28 12:43 ` Wols Lists

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170628131917.BF1911235B6@gemini.denx.de \
    --to=wd@denx.de \
    --cc=gandalf.corvotempesta@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).