Re: BTRFS with RAID1 cannot boot when removing drive

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: BTRFS with RAID1 cannot boot when removing drive
Date: Mon, 10 Feb 2014 03:34:49 +0000 (UTC)	[thread overview]
Message-ID: <pan$c13e9$73e2c595$53adb8c8$58509a35@cox.net> (raw)
In-Reply-To: 20140209224055.3175e70f@system

Saint Germain posted on Sun, 09 Feb 2014 22:40:55 +0100 as excerpted:

> I am experimenting with BTRFS and RAID1 on my Debian Wheezy (with
> backported kernel 3.12-0.bpo.1-amd64) using a a motherboard with UEFI.

My systems don't do UEFI, but I do run GPT partitions and use grub2 for 
booting, with grub2-core installed to a BIOS/reserved type partition 
(instead of as an EFI service as it would be with UEFI).  And I have root 
filesystem btrfs two-device raid1 mode working fine here, tested bootable 
with only one device of the two available.

So while I can't help you directly with UEFI, I know the rest of it can/
does work.

One more thing:  I do have a (small) separate btrfs /boot, actually two 
of them as I setup a separate /boot on each of the two devices in ordered 
to have a backup /boot, since grub can only point to one /boot by 
default, and while pointing to another in grub's rescue mode is possible, 
I didn't want to have to deal with that if the first /boot was corrupted, 
as it's easier to simply point the BIOS at a different drive entirely and 
load its (independently installed and configured) grub and /boot.

But grub2's btrfs module reads raid1 mode just fine as I can access files 
on the btrfs raid1 mode rootfs directly from grub without issue, so 
that's not a problem.

But I strongly suspect I know what is... and it's a relatively easy fix.  
See below.  =:^)

> However I haven't managed to make the system boot when the removing the
> first hard drive.
> 
> I have installed Debian with the following partition on the first hard
> drive (no BTRFS subsystem):
> /dev/sda1: for / (BTRFS)
> /dev/sda2: for /home (BTRFS)
> /dev/sda3: for swap
> 
> Then I added another drive for a RAID1 configuration (with btrfs
> balance) and I installed grub on the second hard drive with
> "grub-install /dev/sdb".

Just for clarification as you don't mention it specifically, altho your 
btrfs filesystem show information suggests you did it this way, are your 
partition layouts identical on both drives?

That's what I've done here, and I definitely find that easiest to manage 
and even just to think about, tho it's definitely not a requirement.  But 
using different partition layouts does significantly increase management 
complexity, so it's useful to avoid if possible. =:^)

> If I boot on sdb, it takes sda1 as the root filesystem

> If I switched the cable, it always take the first hard drive as
> the root filesystem (now sdb)

That's normal /appearance/, but that /appearance/ doesn't fully reflect 
reality.

The problem is that mount output (and /proc/self/mounts), fstab, etc, 
were designed with single-device filesystems in mind, and multi-device 
btrfs has to be made to fix the existing rules as best it can.

So what's actually happening is that the for a btrfs composed of multiple 
devices, since there's only one "device slot" for the kernel to list 
devices, it only displays the first one it happens to come across, even 
tho the filesystem will normally (unless degraded) require that all 
component devices be available and logically assembled into the 
filesystem before it can be mounted.

When you boot on sdb, naturally, the sdb component of the multi-device 
filesystem that the kernel finds, so it's the one listed, even tho the 
filesystem is actually composed of more devices, not just that one.  When 
you switch the cables, the first one is, at least on your system, always 
the first device component of the filesystem detected, so it's always the 
one occupying the single device slot available for display, even tho the 
filesystem has actually assembled all devices into the complete 
filesystem before mounting.

> If I disconnect /dev/sda, the system doesn't boot with a message saying
> that it hasn't found the UUID:
> 
> Scanning for BTRFS filesystems...
> mount: mounting /dev/disk/by-uuid/c64fca2a-5700-4cca-abac-3a61f2f7486c
> on /root failed: Invalid argument
> 
> Can you tell me what I have done incorrectly ?
> Is it because of UEFI ? If yes I haven't understood how I can correct it
> in a simple way.

As you haven't mentioned it and the grub config below doesn't mention it 
either, I'm almost certain that you're simply not aware of the "degraded" 
mount option, and when/how it should be used.

And if you're not aware of that, chances are you're not aware of the 
btrfs wiki, and the multitude of other very helpful information it has 
available.  I'd suggest you spend some time reading it, as it'll very 
likely save you quite some btrfs administration questions and headaches 
down the road, as you continue to work with btrfs.

Bookmark it and refer to it often! =:^)

https://btrfs.wiki.kernel.org

(Click on the guides and usage information in contents under section 5, 
documentation.)

Here's the mount options page.  Note that the kernel btrfs documentation 
also includes mount options:

https://btrfs.wiki.kernel.org/index.php/Mount_options

$KERNELDIR/Documentation/filesystems/btrfs.txt

You should be able to mount a two-device btrfs raid1 filesystem with only 
a single device with the degraded mount option, tho I believe current 
kernels refuse a read-write mount in that case, so you'll have read-only 
access until you btrfs device add a second device, so it can do normal 
raid1 mode once again.

Specifically, from grub, edit the kernel commandline, setting 
rootflags=degraded.  The kernel rootflags parameter is the method by 
which such mount options are passed.

Meanwhile, since the degraded mount-opt is in fact a no-op if btrfs can 
actually find all components of the filesystem, some people choose to 
simply add degraded to their standard mount options (edit the grub config 
to add it at every boot), so they don't have to worry about it.  However, 
that is NOT RECOMMENDED, as the accepted wisdom is that the failure to 
mount undegraded serves as a warning to the sysadmin that something VERY 
WRONG is happening, and that they need to fix it.  They can then add 
degraded temporarily if they wish, in ordered to get the filesystem to 
mount and thus be able to boot, but adding the option routinely at every 
boot bypasses this important warning, and it's all too likely that an 
admin will thus ignore the problem (or not know about it at all) until 
too late.

Altho if it is indeed true that btrfs will now refuse to mount writable 
if it's degraded like that, that's not such a huge issue either, as the 
read-only mount can serve as the same warning.  Still, I certainly prefer 
the refusal to mount entirely without the degraded option, if indeed the 
filesystem is lacking a component device.  There's nothing quite like 
forcing me to actually type in "rootflags=degraded" to rub my face in the 
reality and gravity of the situation I'm in! =:^)

...

That should answer your immediate question, but do read up on the wiki.  
In addition to much of the FAQ, you'll want to read the sysadmin guide 
page, particularly the raid and data duplication section, and the 
multiple devices page, since they're directly apropos to btrfs multi-
device raid modes.  You'll probably want to read the problem FAQ and 
gotchas pages just for the heads-up as well, and likely at least the raid 
section of the use cases page as well.

Meanwhile, I don't believe it's on the wiki, but it's worth noting my 
experience with btrfs raid1 mode in my pre-deployment tests.  Actually, 
with the (I believe) mandatory read-only mount if raid1 is degraded below 
two devices, this problem's going to be harder to run into than it was in 
my testing several kernels ago, but here's what I found:

What I did was writable-degraded-mount first one of the btrfs raid1 pair, 
then the other (with the other one offline in each case), and change a 
test file with each mount, so that the two copies were different, and 
neither one the same as the original file.  Then I remounted the 
filesystem with both devices once again, to see what would happen.

Based on my previous history with mdraid and how i knew it to behave, I 
expected some note in the log about the two devices having unmatched 
write generation and possibly an automated resync to catch the one back 
up to the other, or alternatively, dropping the one from the mount and 
requiring me to do some sort of manual sync (tho I really didn't know 
what sort of btrfs command I'd use for that, but this was pre-deployment 
testing and I was experimenting with the intent of finding this sort of 
thing out!).

That's *NOT* what I got!

What I got was NO warnings, simply one of the two new versions displayed 
when I catted the file.  I'm not sure if it could have shown me the other 
one such that which one it showed was random, or not, but that I didn't 
get a warning was certainly unsettling to me.

Then I unmounted and unplugged the one with that version of the file, and 
remounted degraded again, to check if the other copy had been silently 
updated.  It was exactly as it had been, so the copies were still 
different.

What I'd do after that today were I redoing this test, would be either a 
scrub or a balance, which would presumably find and correct the 
difference.  However, back then I didn't know enough about what I was 
doing to test that, so I didn't, and I still don't actually know how/
whether the difference would have been detected and corrected, since I 
never did actually test that.

My takeaway from that test was not to actually play around with degraded 
writable mounts to much, and for SURE if I did, to take care that if I 
was to write-mount one and ever intended to bring back the other one, I 
should be sure it was always the same one I was write-mounting and 
updating, so only one would be changed and it'd always be clear which 
copy was the newest.  (Btrfs behavior on this point has since been 
confirmed by a dev, btrfs tracks write generation and will always take 
the higher sequence write generation if there's a difference.  If the 
write generations happened to be the same, however, as I took what he 
said, it'd depend on which one the kernel happened to find first.  So 
always making sure the same one was written to was and remains a good 
idea, so different writes don't get done to different devices, with some 
of those writes dropped when they're recombined in an undegraded mount.)

And if there was any doubt, the best action would be to wipe (or trim/
discard, my devices are SSD so that's the simplest option) the one 
filesystem, and btrfs device add and btrfs balance back to it from the 
other exactly as if it were a new device, rather than risk not knowing 
which of the two differing versions btrfs would end up with.

But as I said, if btrfs only allows read-only mounts of filesystems 
without enough devices to properly complete the raidlevel, that shouldn't 
be as big an issue these days, since it should be more difficult or 
impossible to get the two devices separately mounted writable in the 
first place, with the consequence that the differing copies issue will be 
difficult or impossible to trigger in the first place. =:^)

But that's still a very useful heads-up for anyone using btrfs in raid1 
mode to know about, particularly when they're working with degraded mode, 
just to keep the possibility in mind and be safe with their manipulations 
to avoid it... unless of course they're testing exactly the same sort of 
thing I was. =:^)

> As extra question, I don't see also how I can configure the system to
> get the correct swap in case of disk failure. Should I force both swap
> partition to have the same UUID ?

No.  There's no harm in having multiple swap entries in fstab, so simply 
add a separate fstab entry for each swap partition.  That way, if both 
are available, they'll both be activated by the usual swapon -a.  If only 
one's available, it'll be activated.

(You may want to use the fstab nofail option, described in the fstab (5) 
manpage and mentioned under --ifexists in the swapon (8) manpage as well, 
if your distro's swap initialization doesn't already use --ifexists, to 
prevent a warning if the one doesn't actually exist as it's on a 
disconnected drive.)

As a nice variant, consider using the priority=n option, as detailed in 
the swapon (2,8) manpages.  The kernel defaults to negative priorities, 
with each successive activates swap getting a lower (further negative) 
priority than the others, but you can specify positive swap priorities as 
well.  Higher priority swap is used first, so if you want one swap used 
before another, set its priority higher.

But what REALLY makes swap priorities useful is the fact that if two 
swaps have equal priority, the kernel will automatically effectively 
raid0-stripe swap between them, thus effectively multiplying your swap 
speed! =:^)  Since especially spinning rust is so slow, if your drives 
are spinning rust, that can help significantly if you're actually using 
swap, especially when using it heavily (thrashing).

With ssds the raid0 effect of equal swap priorities should still be 
noticeable, tho they're typically enough faster than spinning rust that 
actively swapping to just one isn't the huge drag it is on spinning rust, 
and with the limited write-cycles of ssd, you may wish to set one a 
higher swap priority than the other just so it gets used more, sparing 
the other.

[Some of the outputs also posted were useful, particularly in implying 
that you had identical partition layout on each device as well as to 
verify that you weren't already using the degraded mount option, but I've 
snipped them as too much to include here.]

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

next prev parent reply	other threads:[~2014-02-10  3:35 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-09 21:40 BTRFS with RAID1 cannot boot when removing drive Saint Germain
2014-02-10  3:34 ` Duncan [this message]
2014-02-11  2:30   ` Saint Germain
2014-02-14 14:33     ` Saint Germain
2014-02-16 15:30       ` Saint Germain
2014-02-11  2:18 ` Chris Murphy
2014-02-11  3:15   ` Saint Germain
2014-02-11  6:59     ` Duncan
2014-02-11 10:04       ` Saint Germain
2014-02-11 20:35         ` Duncan
2014-02-12 17:16           ` Saint Germain
2014-02-11 17:33       ` UEFI/BIOS, was: " Chris Murphy
2014-02-11  7:47     ` Duncan
2014-02-11 17:21     ` Chris Murphy
2014-02-11 17:36       ` Saint Germain
2014-02-11 18:19         ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$c13e9$73e2c595$53adb8c8$58509a35@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).