Re: Status of RAID5/6 - Austin S. Hemmelgarn

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Christoph Anton Mitterer <calestyo@scientia.net>,
	linux-btrfs@vger.kernel.org
Subject: Re: Status of RAID5/6
Date: Thu, 22 Mar 2018 08:01:55 -0400	[thread overview]
Message-ID: <d46c16e9-4967-51de-762b-ae01f2e10e0a@gmail.com> (raw)
In-Reply-To: <1521662556.4312.39.camel@scientia.net>

On 2018-03-21 16:02, Christoph Anton Mitterer wrote:
On the note of maintenance specifically:
> - Maintenance tools
>    - How to get the status of the RAID? (Querying kernel logs is IMO
>      rather a bad way for this)
>      This includes:
>      - Is the raid degraded or not?
Check for the 'degraded' flag in the mount options.  Assuming you're 
doing things sensibly and not specifying it on mount, it gets added when 
the array goes degraded.

>      - Are scrubs/repairs/rebuilds/reshapes in progress and how far are
>        they? (Reshape would be: if the raid level is changed or the raid
>        grown/shrinked: has all data been replicated enough to be
>        "complete" for the desired raid lvl/number of devices/size?
A bit trickier, but still not hard, just check the the output of `btrfs 
scrub status`, `btrfs balance status`, and `btrfs replace status` for 
the volume.  It won't check automatic spot-repairs (that is, repairing 
individual blocks that fail checksums), but most people really don't care

>     - What should one regularly do? scrubs? balance? How often?
>       Do we get any automatic (but configurable) tools for this?
There aren't any such tools that I know of currently.  storaged might 
have some, but I've never really looked at it so i can't comment (I'm 
kind of adverse to having hundreds of background services running to do 
stuff that can just as easily be done in a polling manner from cron 
without compromising their utility).  Right now though, it's _trivial_ 
to automate things with cron, or systemd timers, or even third-party 
tools like monit (which has the bonus that if the maintenance fails, you 
get an e-mail about it).

>     - There should be support in commonly used tools, e.g. Icinga/Nagios
>       check_raid
Agreed.  I think there might already be a Nagios plugin for the basic 
checks, not sure about anything else though.

Netdata has had basic monitoring support for a while now, but it only 
looks at allocations, not error counters, so while it will help catch 
impending ENOSPC issues, it can't really help much with data corruption 
issues.

>     - Ideally there should also be some desktop notification tool, which
>       tells about raid (and btrfs errors in general) as small
>       installations with raids typically run no Icinga/Nagios but rely
>       on e.g. email or gui notifications.
Desktop notifications would be nice, but are out of scope for the main 
btrfs-progs.  Not even LVM, MDADM, or ZFS ship desktop notification 
support from upstream.  You don't need Icinga or Nagios for monitoring 
either.  Netdata works pretty well for covering the allocation checks 
(and I'm planning to have something soon, and it's trivial to set up 
e-mail notifications with cron or systemd timers or even tools like monit.

On the note of generic monitoring though, I've been working on a Python 
3 script (with no dependencies beyond the Python standard library) to do 
the same checks that Netdata does regarding allocations, as well as 
checking device error counters and mount options that should be 
reasonable as a simple warning tool run from cron or a systemd timer. 
I'm hoping to get it included in the upstream btrfs-progs, but I don't 
have it in a state yet that it's ready to be posted (the checks are 
working, but I'm still having issues reliably mapping between mount 
points and filesystem UUID's).

> I think especially for such tools it's important that these are
> maintained by upstream (and yes I know you guys are rather fs
> developers not)... but since these tools are so vital, having them done
> 3rd party can easily lead to the situation where something changes in
> btrfs, the tools don't notice and errors remain undetected.
It depends on what they look at.  All the stuff under /sys/fs/btrfs 
should never change (new things might get added, but none of the old 
stuff is likely to ever change because /sys is classified as part of the 
userspace ABI, and any changes would get shot down by Linus), so 
anything that just uses those will likely have no issues (Netdata falls 
into this category for example).  Same goes for anything using ioctls 
directly, as those are also userspace ABI.

next prev parent reply	other threads:[~2018-03-22 12:01 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-21 16:50 Status of RAID5/6 Menion
2018-03-21 17:24 ` Liu Bo
2018-03-21 20:02   ` Christoph Anton Mitterer
2018-03-22 12:01     ` Austin S. Hemmelgarn [this message]
2018-03-29 21:50     ` Zygo Blaxell
2018-03-30  7:21       ` Menion
2018-03-31  4:53         ` Zygo Blaxell
2018-03-30 16:14       ` Goffredo Baroncelli
2018-03-31  5:03         ` Zygo Blaxell
2018-03-31  6:57           ` Goffredo Baroncelli
2018-03-31  7:43             ` Zygo Blaxell
2018-03-31  8:16               ` Goffredo Baroncelli
     [not found]                 ` <28a574db-0f74-b12c-ab5f-400205fd80c8@gmail.com>
2018-03-31 14:40                   ` Zygo Blaxell
2018-03-31 22:34             ` Chris Murphy
2018-04-01  3:45               ` Zygo Blaxell
2018-04-01 20:51                 ` Chris Murphy
2018-04-01 21:11                   ` Chris Murphy
2018-04-02  5:45                     ` Zygo Blaxell
2018-04-02 15:18                       ` Goffredo Baroncelli
2018-04-02 15:49                         ` Austin S. Hemmelgarn
2018-04-02 22:23                           ` Zygo Blaxell
2018-04-03  0:31                             ` Zygo Blaxell
2018-04-03 17:03                               ` Goffredo Baroncelli
2018-04-03 22:57                                 ` Zygo Blaxell
2018-04-04  5:15                                   ` Goffredo Baroncelli
2018-04-04  6:01                                     ` Zygo Blaxell
2018-04-04 21:31                                       ` Goffredo Baroncelli
2018-04-04 22:38                                         ` Zygo Blaxell
2018-04-04  3:08                                 ` Chris Murphy
2018-04-04  6:20                                   ` Zygo Blaxell
2018-03-21 20:27   ` Menion
2018-03-22 21:13   ` waxhead

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d46c16e9-4967-51de-762b-ae01f2e10e0a@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=calestyo@scientia.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).