Re: Unexpected raid1 behaviour

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Anand Jain <anand.jain@oracle.com>
To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org
Subject: Re: Unexpected raid1 behaviour
Date: Mon, 18 Dec 2017 13:11:17 +0800	[thread overview]
Message-ID: <3acce035-2568-27c9-d251-8a22a38497fd@oracle.com> (raw)
In-Reply-To: <pan$36ed4$200dd538$4324239e$ed0f9f4@cox.net>



  Nice status update about btrfs volume manager. Thanks.

  Below I have added the names of the patch in ML/wip addressing
  the current limitations.

On 12/17/2017 07:58 PM, Duncan wrote:
> Dark Penguin posted on Sat, 16 Dec 2017 22:50:33 +0300 as excerpted:
> 
>> Could someone please point me towards some read about how btrfs handles
>> multiple devices? Namely, kicking faulty devices and re-adding them.
>>
>> I've been using btrfs on single devices for a while, but now I want to
>> start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and
>> tried to see how does it handle various situations. The experience left
>> me very surprised; I've tried a number of things, all of which produced
>> unexpected results.
>>
>> I create a btrfs raid1 filesystem on two hard drives and mount it.
>>
>> - When I pull one of the drives out (simulating a simple cable failure,
>> which happens pretty often to me), the filesystem sometimes goes
>> read-only. ???
>> - But only after a while, and not always. ???
>> - When I fix the cable problem (plug the device back), it's immediately
>> "re-added" back. But I see no replication of the data I've written onto
>> a degraded filesystem... Nothing shows any problems, so "my filesystem
>> must be ok". ???
>> - If I unmount the filesystem and then mount it back, I see all my
>> recent changes lost (everything I wrote during the "degraded" period). -
>> If I continue working with a degraded raid1 filesystem (even without
>> damaging it further by re-adding the faulty device), after a while it
>> won't mount at all, even with "-o degraded".
>>
>> I can't wrap my head about all this. Either the kicked device should not
>> be re-added, or it should be re-added "properly", or it should at least
>> show some errors and not pretend nothing happened, right?..
>>
>> I must be missing something. Is there an explanation somewhere about
>> what's really going on during those situations? Also, do I understand
>> correctly that upon detecting a faulty device (a write error), nothing
>> is done about it except logging an error into the 'btrfs device stats'
>> report? No device kicking, no notification?.. And what about degraded
>> filesystems - is it absolutely forbidden to work with them without
>> converting them to a "single" filesystem first?..
>>
>> On Ubuntu 17.10, there's Linux 4.13.0-16 and btrfs-progs 4.12-1 .
> 
> Btrfs device handling at this point is still "development level" and very
> rough, but there's a patch set in active review ATM that should improve
> things dramatically, perhaps as soon as 4.16 (4.15 is already well on the
> way).
> 
> Basically, at this point btrfs doesn't have "dynamic" device handling.
> That is, if a device disappears, it doesn't know it.  So it continues
> attempting to write to (and read from, but the reads are redirected) the
> missing device until things go bad enough it kicks to read-only for
> safety.

   btrfs: introduce device dynamic state transition to failed

> If a device is added back, the kernel normally shuffles device names and
> assigns a new one.  Btrfs will see it and list the new device, but it's
> still trying to use the old one internally.  =:^(

   btrfs: handle dynamically reappearing missing device

> Thus, if a device disappears, to get it back you really have to reboot,
> or at least unload/reload the btrfs kernel module, in ordered to clear
> the stale device state and have btrfs rescan and reassociate devices with
> the matching filesystems.
> 
> Meanwhile, once a device goes stale -- other devices in the filesystem
> have data that should have been written to the stale one but it was gone
> so the data couldn't get to it -- once you do the module unload/reload or
> reboot cycle and btrfs picks up the device again, you should immediately
> do a btrfs scrub, which will detect and "catch up" the differences.
 >
> Btrfs tracks atomic filesystem updates via a monotonically increasing
> generation number, aka transaction-id (transid).  When a device goes
> offline, its generation number of course gets stuck at the point it went
> offline, while the other devices continue to update their generation
> numbers.
 >
> When a stale device is readded, btrfs should automatically find and use
> the device with the latest generation, but the old one isn't
> automatically caught up -- a scrub is the mechanism by which you do this.
>
> One thing you do **NOT** want to do is degraded-writable mount one
> device, then the other device, of a raid1 pair, because that'll diverge
> the two with new data on each, and that's no longer simple to correct.
> If you /have/ to degraded-writable mount a raid1, always make sure it's
> the same one mounted writable if you want to combine them again.  If you
> /do/ need to recombine two diverged raid1 devices, the only safe way to
> do so is to wipe the one so btrfs has only the one copy of the data to go
> on, and add the wiped device back as a new device.

   btrfs: handle volume split brain scenario

> Meanwhile, until /very/ recently... 4.13 may not be current enough... if
> you mounted a two-device raid1 degraded-writable, btrfs would try to
> write and note that it couldn't do raid1 because there wasn't a second
> device, so it would create single chunks to write into. >
> And the older filesystem safe-mount mechanism would see those single
> chunks on a raid1 and decide it wasn't safe to mount the filesystem
> writable at all after that, even if all the single chunks were actually
> present on the remaining device.
 >
> The effect was that if a device died, you had exactly one degraded-
> writable mount to replace it successfully.  If you didn't complete the
> replace in that single chance writable mount, the filesystem would refuse
> to mount writable again, and thus it was impossible to repair the
> filesystem since that required a writable mount and that was no longer
> possible!  Fortunately the filesystem could still be mounted degraded-
> readonly (unless there was some other problem), allowing people to at
> least get at the read-only data to copy it elsewhere.
 >
> With a new enough btrfs, while btrfs will still create those single
> chunks on a degraded-writable mount of a raid1, it's at least smart
> enough to do per-chunk checks to see if they're all available on existing
> devices (none only on the missing device), and will continue to allow
> degraded-writable mounting if so.

   (v4.14)
   btrfs: Introduce a function to check if all chunks a OK for degraded 
rw mount

> But once the filesystem is back to multi-device (with writable space on
> at least two devices), a balance-convert of those single chunks to raid1
> should be done, otherwise if the device with them on it goes...
 >
> And there's work on allowing it to do only single-copy, thus incomplete-
> raid1, chunk writes as well.  This should prevent the single mode chunks
> entirely, thus eliminating the need for the balance-convert, tho a scrub
> would still be needed to fully sync back up.  But I'm not sure what the
> status is on that.

   btrfs: create degraded-RAID1 chunks
   (Patch is wip still. There is a good workaround).

> Meanwhile, as mentioned above, there's active work on proper dynamic
> btrfs device tracking and management. 

   btrfs: Introduce device pool sysfs attributes
   (needs revival)

> It may or may not be ready for
> 4.16, but once it goes in, btrfs should properly detect a device going
> away and react accordingly, and it should detect a device coming back as
> a different device too.  As I write this it occurs to me that I've not
> read close enough to know if it actually initiates scrub/resync on its
> own in the current patch set, but that's obviously an eventual goal if
> not.

   Right. It doesn't as of now, its in my list of things to fix.

> Longer term, there's further patches that will provide a hot-spare
> functionality, automatically bringing in a device pre-configured as a hot-
> spare if a device disappears, but that of course requires that btrfs
> properly recognize devices disappearing and coming back first, so one
> thing at a time.  Tho as originally presented, that hot-spare
> functionality was a bit limited -- it was a global hot-spare list, and
> with multiple btrfs of different sizes and multiple hot-spare devices
> also of different sizes, it would always just pick the first spare on the
> list for the first btrfs needing one, regardless of whether the size was
> appropriate for that filesystem or not.  By the time the feature actually
> gets merged it may have changed some, and regardless, it should
> eventually get less limited, but that's _eventually_, with a target time
> likely still in years, so don't hold your breath.

   hah.

   - Its not that difficult to pick up a suitable sized disk from the
     global hot spare list.
   - A CLI can show which fsid/volume a global hot spare can be the
     candidate for the potential replacement.
   - An auto replace priority can be at the fsid/volume end or we could
     still dedicate a global hot spare device to a fsid/volume.

  Related patches (needs revival):
   btrfs: block incompatible optional features at scan
   btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV
   btrfs: add check not to mount a spare device
   btrfs: support btrfs dev scan for spare device
   btrfs: provide framework to get and put a spare device
   btrfs: introduce helper functions to perform hot replace
   btrfs: check for failed device and hot replace

> I think that answers most of your questions.  Basically, you have to be
> quite careful with btrfs raid1 today, as btrfs simply doesn't have the
> automated functionality to handle it yet.  It's still possible to do two-
> device-only raid1 and replace a failed device when you're down to one,
> but it's not as easy or automated as more mature raid options such as
> mdraid, and you do have to keep on top of it as a result.  But it can and
> does work reasonably well for those (like me) who use btrfs raid1 as
> their "daily driver", as long as you /do/ keep on top of it... and don't
> try to use raid1 as a replacement for real backups, because it's *not* a
> backup! =:^)
> 

Thanks, Anand

next prev parent reply	other threads:[~2017-12-18  5:10 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin
2017-12-17 11:58 ` Duncan
2017-12-17 15:48   ` Peter Grandi
2017-12-17 20:42     ` Chris Murphy
2017-12-18  8:49       ` Anand Jain
2017-12-18  8:49     ` Anand Jain
2017-12-18 10:36       ` Peter Grandi
2017-12-18 12:10       ` Nikolay Borisov
2017-12-18 13:43         ` Anand Jain
2017-12-18 22:28       ` Chris Murphy
2017-12-18 22:29         ` Chris Murphy
2017-12-19 12:30         ` Adam Borowski
2017-12-19 12:54         ` Andrei Borzenkov
2017-12-19 12:59         ` Peter Grandi
2017-12-18 13:06     ` Austin S. Hemmelgarn
2017-12-18 19:43       ` Tomasz Pala
2017-12-18 22:01         ` Peter Grandi
2017-12-19 12:46           ` Austin S. Hemmelgarn
2017-12-19 12:25         ` Austin S. Hemmelgarn
2017-12-19 14:46           ` Tomasz Pala
2017-12-19 16:35             ` Austin S. Hemmelgarn
2017-12-19 17:56               ` Tomasz Pala
2017-12-19 19:47                 ` Chris Murphy
2017-12-19 21:17                   ` Tomasz Pala
2017-12-20  0:08                     ` Chris Murphy
2017-12-23  4:08                       ` Tomasz Pala
2017-12-23  5:23                         ` Duncan
2017-12-20 16:53                   ` Andrei Borzenkov
2017-12-20 16:57                     ` Austin S. Hemmelgarn
2017-12-20 20:02                     ` Chris Murphy
2017-12-20 20:07                       ` Chris Murphy
2017-12-20 20:14                         ` Austin S. Hemmelgarn
2017-12-21  1:34                           ` Chris Murphy
2017-12-21 11:49                         ` Andrei Borzenkov
2017-12-19 20:11                 ` Austin S. Hemmelgarn
2017-12-19 21:58                   ` Tomasz Pala
2017-12-20 13:10                     ` Austin S. Hemmelgarn
2017-12-19 23:53                   ` Chris Murphy
2017-12-20 13:12                     ` Austin S. Hemmelgarn
2017-12-19 18:31             ` George Mitchell
2017-12-19 20:28               ` Tomasz Pala
2017-12-19 19:35             ` Chris Murphy
2017-12-19 20:41               ` Tomasz Pala
2017-12-19 20:47                 ` Austin S. Hemmelgarn
2017-12-19 22:23                   ` Tomasz Pala
2017-12-20 13:33                     ` Austin S. Hemmelgarn
2017-12-20 17:28                       ` Duncan
2017-12-21 11:44                   ` Andrei Borzenkov
2017-12-21 12:27                     ` Austin S. Hemmelgarn
2017-12-22 16:05                       ` Tomasz Pala
2017-12-22 21:04                         ` Chris Murphy
2017-12-23  2:52                           ` Tomasz Pala
2017-12-23  5:40                             ` Duncan
2017-12-19 23:59                 ` Chris Murphy
2017-12-20  8:34                   ` Tomasz Pala
2017-12-20  8:51                     ` Tomasz Pala
2017-12-20 19:49                     ` Chris Murphy
2017-12-18  5:11   ` Anand Jain [this message]
2017-12-18  1:20 ` Qu Wenruo
2017-12-18 13:31 ` Austin S. Hemmelgarn
2018-01-12 12:26   ` Dark Penguin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3acce035-2568-27c9-d251-8a22a38497fd@oracle.com \
    --to=anand.jain@oracle.com \
    --cc=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox