Re: Status of RAID5/6 - Christoph Anton Mitterer

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Christoph Anton Mitterer <calestyo@scientia.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Status of RAID5/6
Date: Wed, 21 Mar 2018 21:02:36 +0100	[thread overview]
Message-ID: <1521662556.4312.39.camel@scientia.net> (raw)
In-Reply-To: <CANQeFDDxZSZ4jYDPvW-Q=AoyPrGzpp0fVywjFOJtkeD+Ysgmew@mail.gmail.com>

Hey.

Some things would IMO be nice to get done/clarified (i.e. documented in
the Wiki and manpages) from users'/admin's  POV:

Some basic questions:
- Starting with which kernels (including stable kernel versions) does
it contain the fixes for the bigger issues from some time ago?

- Exactly what does not work yet (only the write hole?)?
  What's the roadmap for such non-working things?

- Ideally some explicit confirmations of what's considered to work,
  like:
  - compression+raid?
  - rebuild / replace of devices?
  - changing raid lvls?
  - repairing data (i.e. picking the right block according to csums in
    case of silent data corruption)?
  - scrub (and scrub+repair)?
  - anything to consider with raid when doing snapshots, send/receive
    or defrag?
  => and for each of these: for which raid levels?

  Perhaps also confirmation for previous issues:
  - I vaguely remember there were issues with either device delete or
    replace.... and that one of them was possibly super-slow?
  - I also remember there were cases in which a fs could end up in
    permanent read-only state?


- Clarifying questions on what is expected to work and how things are
  expected to behave, e.g.:
  - Can one plug a device (without deleting/removing it first) just
    under operation and will btrfs survive it?
  - If an error is found (e.g. silent data corruption based on csums),
    when will it repair&fix (fix = write the repaired data) the data?
    On the read that finds the bad data?
    Only on scrub (i.e. do users need to regularly run scrubs)? 
  - What happens if error cannot be repaired, e.g. no csum information
    or all blocks bad?
    EIO? Or are there cases where it gives no EIO (I guess at least in
    nodatacow case)
  - What happens if data cannot be fixed (i.e. trying to write the
    repaired block again fails)?
    And if the repaired block is written, will it be immediately
    checked again (to find cases of blocks that give different results
    again)?
  - Will a scrub check only the data on "one" device... or will it
    check all the copies (or parity blocks) on all devices in the raid?
  - Does a fsck check all devices or just one?
  - Does a balance implicitly contain a scrub?
  - If a rebuild/repair/reshape is performed... can these be
    interrupted? What if they are forcibly interrupted (power loss)?


- Explaining common workflows:
  - Replacing a faulty or simply an old disk.
    How to stop btrfs from using a device (without bricking the fs)?
    How to do the rebuild.
  - Best practices, like: should one do regular balances (and if so, as
    asked above, do these include the scrubs, so basically: is it
    enough to do one of them)
  - How to grow/shrink raid btrfs... and if this is done... how to
    replicate the data already on the fs to the newly added disks (or
    is this done automatically - and if so, how to see that it's
    finished)?
  - What will actually trigger repairs? (i.e. one wants to get silent
    block errors fixed ASAP and not only when the data is read - and
    when it's possibly to late)
  - In the rebuild/repair phase (e.g. one replaces a device): Can one
    somehow give priority to the rebuild/repair? (e.g. in case of a
    degraded raid, one may want to get that solved ASAP and rather slow
    down other reads or stop them completely.
  - Is there anything to notice when btrfs raid is placed above dm-
    crypt from a security PoV?
    With MD raid that wasn't much of a problem as it's typically placed
    below dm-crypt... but btrfs raid would need to be placed above it.
    So maybe there are some known attacks against crypto modes, if
    equal (RAID 1 / 10) or similar/equal (RAID 5/6) data is written
    above multiple crypto devices? (Probably something one would need
    to ask their experts).


- Maintenance tools
  - How to get the status of the RAID? (Querying kernel logs is IMO
    rather a bad way for this)
    This includes:
    - Is the raid degraded or not?
    - Are scrubs/repairs/rebuilds/reshapes in progress and how far are
      they? (Reshape would be: if the raid level is changed or the raid
      grown/shrinked: has all data been replicated enough to be
      "complete" for the desired raid lvl/number of devices/size?
   - What should one regularly do? scrubs? balance? How often?
     Do we get any automatic (but configurable) tools for this?
   - There should be support in commonly used tools, e.g. Icinga/Nagios
     check_raid
   - Ideally there should also be some desktop notification tool, which
     tells about raid (and btrfs errors in general) as small
     installations with raids typically run no Icinga/Nagios but rely
     on e.g. email or gui notifications.

I think especially for such tools it's important that these are
maintained by upstream (and yes I know you guys are rather fs
developers not)... but since these tools are so vital, having them done
3rd party can easily lead to the situation where something changes in
btrfs, the tools don't notice and errors remain undetected.


- Future?
  What about things like hotspare support? E.g. a good userland tool
  could be configured that one disk is a hotspare... and if there's
  failure it could automatically power it up and replace the faulty
  drives with it.
  It could go further, that not only completely failed devices are
  replaced, but if a configurable number of csum / read / write / etc.
  errors are found... a replace would be triggered.
  Maybe such tool could even look at SMART and proactively replace
  disks.

  What about features that were "announced/suggested/etc." earlier?
  E.g. n-parity-raid ... or n-way-mirrored-raid?


- Real world test?
  Is there already any bigger user of current btrfs raid5/6? I.e. where
  hundreds of raids, devices, etc. are massively used? Where many
  devices failed (because of age) or where pulled, etc. (all the
  typical things that happen in computing centres)?
  So that one could get a feeling whether it's actually stable.


Cheers,
Chris.

next prev parent reply	other threads:[~2018-03-21 20:02 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-21 16:50 Status of RAID5/6 Menion
2018-03-21 17:24 ` Liu Bo
2018-03-21 20:02   ` Christoph Anton Mitterer [this message]
2018-03-22 12:01     ` Austin S. Hemmelgarn
2018-03-29 21:50     ` Zygo Blaxell
2018-03-30  7:21       ` Menion
2018-03-31  4:53         ` Zygo Blaxell
2018-03-30 16:14       ` Goffredo Baroncelli
2018-03-31  5:03         ` Zygo Blaxell
2018-03-31  6:57           ` Goffredo Baroncelli
2018-03-31  7:43             ` Zygo Blaxell
2018-03-31  8:16               ` Goffredo Baroncelli
     [not found]                 ` <28a574db-0f74-b12c-ab5f-400205fd80c8@gmail.com>
2018-03-31 14:40                   ` Zygo Blaxell
2018-03-31 22:34             ` Chris Murphy
2018-04-01  3:45               ` Zygo Blaxell
2018-04-01 20:51                 ` Chris Murphy
2018-04-01 21:11                   ` Chris Murphy
2018-04-02  5:45                     ` Zygo Blaxell
2018-04-02 15:18                       ` Goffredo Baroncelli
2018-04-02 15:49                         ` Austin S. Hemmelgarn
2018-04-02 22:23                           ` Zygo Blaxell
2018-04-03  0:31                             ` Zygo Blaxell
2018-04-03 17:03                               ` Goffredo Baroncelli
2018-04-03 22:57                                 ` Zygo Blaxell
2018-04-04  5:15                                   ` Goffredo Baroncelli
2018-04-04  6:01                                     ` Zygo Blaxell
2018-04-04 21:31                                       ` Goffredo Baroncelli
2018-04-04 22:38                                         ` Zygo Blaxell
2018-04-04  3:08                                 ` Chris Murphy
2018-04-04  6:20                                   ` Zygo Blaxell
2018-03-21 20:27   ` Menion
2018-03-22 21:13   ` waxhead

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1521662556.4312.39.camel@scientia.net \
    --to=calestyo@scientia.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).