How to detect / notify when a raid drive fails?

All of lore.kernel.org
 help / color / mirror / Atom feed

* How to detect / notify when a raid drive fails?
@ 2015-11-27  5:14 Ian Kelling
  2015-11-27  5:30 ` Duncan
  0 siblings, 1 reply; 7+ messages in thread
From: Ian Kelling @ 2015-11-27  5:14 UTC (permalink / raw)
  To: linux-btrfs

I'd like to run "mail" when a btrfs raid drive fails, but I don't
know how to detect that a drive has failed. It don't see it in
any docs. Otherwise I assume I would never know until enough
drives fail that the filesystem stops working, and I'd like to
know before that.

- Ian Kelling

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to detect / notify when a raid drive fails?
  2015-11-27  5:14 How to detect / notify when a raid drive fails? Ian Kelling
@ 2015-11-27  5:30 ` Duncan
  2015-11-27  7:42   ` Ian Kelling
  2015-11-27  9:16   ` Anand Jain
  0 siblings, 2 replies; 7+ messages in thread
From: Duncan @ 2015-11-27  5:30 UTC (permalink / raw)
  To: linux-btrfs

Ian Kelling posted on Thu, 26 Nov 2015 21:14:57 -0800 as excerpted:

> I'd like to run "mail" when a btrfs raid drive fails, but I don't know
> how to detect that a drive has failed. It don't see it in any docs.
> Otherwise I assume I would never know until enough drives fail that the
> filesystem stops working, and I'd like to know before that.

Btrfs isn't yet mature enough to have a device failure notifier daemon, 
like for instance mdadm does.  There's a patch set going around that adds 
global spares, so btrfs can detect the problem and grab a spare, but it's 
only a rather simplistic initial implementation designed to provide the 
framework for more fancy stuff later, and that's about it in terms of 
anything close, so far.

What generally happens now, however, is that the btrfs will note failures 
attempting to write the device and start queuing up writes.  If the 
device reappears fast enough, btrfs will flush the queue and be back to 
normal.  Otherwise, you pretty much need to reboot and mount degraded, 
then add a device and rebalance. (btrfs device delete missing broke some 
versions ago and just got fixed by the latest btrfs-progs-4.3.1, IIRC.)

As for alerts, you'd see the pile of accumulating write errors in the 
kernel log.  Presumably you can write up a script that can alert on that 
and mail you the log or whatever, but I don't believe there's anything 
official or close to it, yet.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to detect / notify when a raid drive fails?
  2015-11-27  5:30 ` Duncan
@ 2015-11-27  7:42   ` Ian Kelling
  2015-11-27  8:10     ` Lukas Pirl
  2015-11-27  9:16   ` Anand Jain
  1 sibling, 1 reply; 7+ messages in thread
From: Ian Kelling @ 2015-11-27  7:42 UTC (permalink / raw)
  To: linux-btrfs

On Thu, Nov 26, 2015, at 09:30 PM, Duncan wrote:
> What generally happens now, however, is that the btrfs will note failures 
> attempting to write the device and start queuing up writes.  If the 
> device reappears fast enough, btrfs will flush the queue and be back to 
> normal.  Otherwise, you pretty much need to reboot and mount degraded, 
> then add a device and rebalance. (btrfs device delete missing broke some 
> versions ago and just got fixed by the latest btrfs-progs-4.3.1, IIRC.)
> 
> As for alerts, you'd see the pile of accumulating write errors in the 
> kernel log.  Presumably you can write up a script that can alert on that 
> and mail you the log or whatever, but I don't believe there's anything 
> official or close to it, yet.

Great info, thanks. Just trying to write a file, sync and read it
sounds like the easiest test for now, especially since I don't
know what the write fail log entries will look like. And setting
up SMART notifications.

- Ian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to detect / notify when a raid drive fails?
  2015-11-27  7:42   ` Ian Kelling
@ 2015-11-27  8:10     ` Lukas Pirl
  0 siblings, 0 replies; 7+ messages in thread
From: Lukas Pirl @ 2015-11-27  8:10 UTC (permalink / raw)
  To: Ian Kelling; +Cc: linux-btrfs, 1i5t5.duncan

Hi Ian,

On 11/27/2015 08:42 PM, Ian Kelling wrote as excerpted:
> Great info, thanks. Just trying to write a file, sync and read it
> sounds like the easiest test for now, especially since I don't
> know what the write fail log entries will look like. And setting
> up SMART notifications.

SMART notifications e.g. from smartmontools are definitively useful.

Also, as Duncan wrote, errors are likely to pile up in your kernel log
if SATA/BTRFS/… errors occur.
Regarding those, tools such as logcheck can be useful that email you if
they think they found something interesting in the logs.
Depending on the verbosity you can withstand, logcheck can be a bit
work in order to configure it in a way that it does not flood your
inbox with useless emails. However, logcheck usually ships with
reasonable defaults.

Best,

Lukas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to detect / notify when a raid drive fails?
  2015-11-27  5:30 ` Duncan
  2015-11-27  7:42   ` Ian Kelling
@ 2015-11-27  9:16   ` Anand Jain
  2015-11-27 17:19     ` Christoph Anton Mitterer
  1 sibling, 1 reply; 7+ messages in thread
From: Anand Jain @ 2015-11-27  9:16 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 11/27/2015 01:30 PM, Duncan wrote:
> Ian Kelling posted on Thu, 26 Nov 2015 21:14:57 -0800 as excerpted:
>
>> I'd like to run "mail" when a btrfs raid drive fails, but I don't know
>> how to detect that a drive has failed. It don't see it in any docs.
>> Otherwise I assume I would never know until enough drives fail that the
>> filesystem stops working, and I'd like to know before that.
>
> Btrfs isn't yet mature enough to have a device failure notifier daemon,
> like for instance mdadm does.  There's a patch set going around that adds
> global spares, so btrfs can detect the problem and grab a spare, but it's
> only a rather simplistic initial implementation designed to provide the
> framework for more fancy stuff later, and that's about it in terms of
> anything close, so far.

Thanks Duncan.

  Adding more.. the above hot spare patch set also brings the device
  to a "failed state" when there is a confirmed flush/write failure.
  And prevents any further IOs to it in the context of raid, if there
  is no raid, it will kick in the FS error mode which generally goes
  to the readonly mode /panic as configured at mount. It will do this
  even if there is no hot spare configured.

  btrfs-progs part if not there yet. Because its waiting for sysfs
  patch set to be integrated, so that progs can use it instead of
  writing new/updating ioctls.

  These patch set also introduced another state which device can go
  into, that is "offline state". But it can work only when sysfs
  interface is provided. Offline will be used mainly when
  we don't have a confirmation that device has failed, but has just
  disappears, like pulling out a drive. Being in offline state, the
  resilver/replace will never begin.

  Since we wanted to avoid unnecessary hot replace/resilver, offline
  state is important.

  What is not there in this patch yet is (from the kernel side, apart
  from the btrfs-progs side) is to bring the disk back online (in the
  raid context). As of now it will do nothing, though progs tells
  user that kernel knows about the reappeared device.

  I understand as a user, a full md/lvm set of features are important
  to begin operations using btrfs and we don't have it yet. I have to
  blame it on the priority list.

Thanks, Anand

> What generally happens now, however, is that the btrfs will note failures
> attempting to write the device and start queuing up writes.  If the
> device reappears fast enough, btrfs will flush the queue and be back to
> normal.  Otherwise, you pretty much need to reboot and mount degraded,
> then add a device and rebalance. (btrfs device delete missing broke some
> versions ago and just got fixed by the latest btrfs-progs-4.3.1, IIRC.)
>
> As for alerts, you'd see the pile of accumulating write errors in the
> kernel log.  Presumably you can write up a script that can alert on that
> and mail you the log or whatever, but I don't believe there's anything
> official or close to it, yet.
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to detect / notify when a raid drive fails?
  2015-11-27  9:16   ` Anand Jain
@ 2015-11-27 17:19     ` Christoph Anton Mitterer
  2015-11-30 14:01       ` Anand Jain
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Anton Mitterer @ 2015-11-27 17:19 UTC (permalink / raw)
  To: linux-btrfs, Anand Jain

[-- Attachment #1: Type: text/plain, Size: 985 bytes --]

On Fri, 2015-11-27 at 17:16 +0800, Anand Jain wrote:
>   I understand as a user, a full md/lvm set of features are important
>   to begin operations using btrfs and we don't have it yet. I have to
>   blame it on the priority list.
What's would be especially nice from the admin side, would be something
like /proc/mdstat, which centrally gives information about the health
of your RAID.

It can/should of course be more than just "OK" / "not OK"...
information about which devices are in which state, whether a
rebuild/reconstruction/scrub is going on, etc. pp.
Maybe even details of properties like chunk sizes (as far as these
apply to btrfs).

Having a dedicated monitoring process... well nice to have, but
something like mdstat is, always there, doesn't need special userland
tools and can easily used by 3rd party stuff like Icinga/Nagios
check_raid.
I think the keywords here are human readable + parseable... so maybe
even two files.

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to detect / notify when a raid drive fails?
  2015-11-27 17:19     ` Christoph Anton Mitterer
@ 2015-11-30 14:01       ` Anand Jain
  0 siblings, 0 replies; 7+ messages in thread
From: Anand Jain @ 2015-11-30 14:01 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs




On 11/28/2015 01:19 AM, Christoph Anton Mitterer wrote:
> On Fri, 2015-11-27 at 17:16 +0800, Anand Jain wrote:
>>    I understand as a user, a full md/lvm set of features are important
>>    to begin operations using btrfs and we don't have it yet. I have to
>>    blame it on the priority list.
> What's would be especially nice from the admin side, would be something
> like /proc/mdstat, which centrally gives information about the health
> of your RAID.

  Yep. Its planned. A design doc was in my draft for some time now, I
  just sent it to the mailing list for review comments.


> It can/should of course be more than just "OK" / "not OK"...
> information about which devices are in which state, whether a
> rebuild/reconstruction/scrub is going on, etc. pp.

  right.

> Maybe even details of properties like chunk sizes (as far as these
> apply to btrfs).
>
> Having a dedicated monitoring process... well nice to have, but
> something like mdstat is, always there, doesn't need special userland
> tools and can easily used by 3rd party stuff like Icinga/Nagios
> check_raid.

  yep. will consider.

> I think the keywords here are human readable + parseable... so maybe
> even two files.

  yeah. for parseable reasons I liked procs, there is experimental
  /proc/fs/btrfs/devlist. but procs is kind of not recommended. So
  probably we would need a wrapper tool on top the sysfs to provide
  the same effect.

Thanks, Anand

>
> Cheers,
> Chris.
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-11-30 14:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-27  5:14 How to detect / notify when a raid drive fails? Ian Kelling
2015-11-27  5:30 ` Duncan
2015-11-27  7:42   ` Ian Kelling
2015-11-27  8:10     ` Lukas Pirl
2015-11-27  9:16   ` Anand Jain
2015-11-27 17:19     ` Christoph Anton Mitterer
2015-11-30 14:01       ` Anand Jain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.