RAID1 failure and recovery

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID1 failure and recovery
@ 2014-09-12  8:57 shane-kernel
  2014-09-12 10:47 ` Hugo Mills
  2014-09-12 11:11 ` Duncan
  0 siblings, 2 replies; 6+ messages in thread
From: shane-kernel @ 2014-09-12  8:57 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I am testing BTRFS in a simple RAID1 environment. Default mount options and data and metadata are mirrored between sda2 and sdb2. I have a few questions and a potential bug report. I don't normally have console access to the server so when the server boots with 1 of 2 disks, the mount will fail without -o degraded. Can I use -o degraded by default to force mounting with any number of disks? This is the default behaviour for linux-raid so I was rather surprised when the server didn't boot after a simulated disk failure.

So I pulled sdb to simulate a disk failure. The kernel oops'd but did continue running. I then rebooted encountering the above mount problem. I re-inserted the disk and rebooted again and BTRFS mounted successfully. However, I am now getting warnings like:
BTRFS: read error corrected: ino 1615 off 86016 (dev /dev/sda2 sector 4580382824)
I take it there were writes to SDA and sdb is out of sync. Btrfs is correcting sdb as it goes but I won't have redundancy until sdb resyncs completely. Is there a way to tell btrfs that I just re-added a failed disk and to go through and resync the array as mdraid would do? I know I can do a btrfs fi resync manually but can that be automated if the array goes out of sync for whatever reason (power failure)...

Finally for those using this sort of setup in production, is running btrfs on top of mdraid the way to go at this point?

Cheers,
Shane

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID1 failure and recovery
  2014-09-12  8:57 RAID1 failure and recovery shane-kernel
@ 2014-09-12 10:47 ` Hugo Mills
  2014-09-14  3:15   ` Piotr Pawłow
  2014-09-12 11:11 ` Duncan
  1 sibling, 1 reply; 6+ messages in thread
From: Hugo Mills @ 2014-09-12 10:47 UTC (permalink / raw)
  To: shane-kernel; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2455 bytes --]

On Fri, Sep 12, 2014 at 01:57:37AM -0700, shane-kernel@csy.ca wrote:
> Hi,

> I am testing BTRFS in a simple RAID1 environment. Default mount
> options and data and metadata are mirrored between sda2 and sdb2. I
> have a few questions and a potential bug report. I don't normally
> have console access to the server so when the server boots with 1 of
> 2 disks, the mount will fail without -o degraded. Can I use -o
> degraded by default to force mounting with any number of disks? This
> is the default behaviour for linux-raid so I was rather surprised
> when the server didn't boot after a simulated disk failure.

   The problem with that is that at the moment, you don't get any
notification that anything's wrong when the system boots. As a result,
using -odegraded as a default option is not generally recommended.

> So I pulled sdb to simulate a disk failure. The kernel oops'd but
> did continue running. I then rebooted encountering the above mount
> problem. I re-inserted the disk and rebooted again and BTRFS mounted
> successfully. However, I am now getting warnings like: BTRFS: read
> error corrected: ino 1615 off 86016 (dev /dev/sda2 sector
> 4580382824)
> 
> I take it there were writes to SDA and sdb is out of sync. Btrfs is
> correcting sdb as it goes but I won't have redundancy until sdb
> resyncs completely. Is there a way to tell btrfs that I just
> re-added a failed disk and to go through and resync the array as
> mdraid would do? I know I can do a btrfs fi resync manually but can
> that be automated if the array goes out of sync for whatever reason
> (power failure)...

   I've done this before, by accident (pulled the wrong drive,
reinserted it). You can fix it by running a scrub on the device (btrfs
scrub start /dev/ice, I think).

> Finally for those using this sort of setup in production, is running
> btrfs on top of mdraid the way to go at this point?

   Using btrfs native RAID means that you get independent checksums on
the two copies, so that where the data differs between the copies, the
correct data can be identified.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
    --- SCSI is usually fixed by remembering that it needs three ---     
        terminations: One at each end of the chain. And the goat.        
                                                                         

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID1 failure and recovery
  2014-09-12 10:47 ` Hugo Mills
@ 2014-09-14  3:15   ` Piotr Pawłow
  2014-09-14  4:44     ` Hugo Mills
  0 siblings, 1 reply; 6+ messages in thread
From: Piotr Pawłow @ 2014-09-14  3:15 UTC (permalink / raw)
  To: linux-btrfs

On 12.09.2014 12:47, Hugo Mills wrote:
> I've done this before, by accident (pulled the wrong drive, reinserted
> it). You can fix it by running a scrub on the device (btrfs scrub
> start /dev/ice, I think).

I'd like to remind everyone that btrfs has weak checksums. It may be 
good for correcting an occasional error, but I wouldn't trust it to 
correct larger amounts of data.

Additionally, nocow files are not checksummed. They will not be 
corrected and may return good data or random garbage, depending on which 
mirror is accessed.

Below is a test I did some time ago, demonstrating the problem with 
nocow files:

#!/bin/sh
MOUNT_DIR=mnt
DISK1=d1
DISK2=d2
SIZE=2G
# create raid1 FS
mkdir $MOUNT_DIR
truncate --size $SIZE $DISK1
truncate --size $SIZE $DISK2
L1=$(losetup --show -f $DISK1)
L2=$(losetup --show -f $DISK2)
mkfs.btrfs -d raid1 -m raid1 $L1 $L2
mount $L1 $MOUNT_DIR
# enable NOCOW
chattr +C $MOUNT_DIR
umount $MOUNT_DIR
# fail the second drive
losetup -d $L2
mount $L1 $MOUNT_DIR -odegraded
# file must be large enough to not get embedded inside metadata
perl -e 'print "Test OK.\n"x4096' >$MOUNT_DIR/testfile
umount $MOUNT_DIR
# reattach the second drive
L2=$(losetup --show -f $DISK2)
mount $L1 $MOUNT_DIR
# let's see what we get - correct data or garbage?
cat $MOUNT_DIR/testfile
# clean up
umount $MOUNT_DIR
losetup -d $L1
losetup -d $L2
rm $DISK1 $DISK2
rmdir $MOUNT_DIR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID1 failure and recovery
  2014-09-14  3:15   ` Piotr Pawłow
@ 2014-09-14  4:44     ` Hugo Mills
  2014-09-14 14:53       ` Piotr Pawłow
  0 siblings, 1 reply; 6+ messages in thread
From: Hugo Mills @ 2014-09-14  4:44 UTC (permalink / raw)
  To: Piotr Pawłow; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2487 bytes --]

On Sun, Sep 14, 2014 at 05:15:08AM +0200, Piotr Pawłow wrote:
> On 12.09.2014 12:47, Hugo Mills wrote:
> >I've done this before, by accident (pulled the wrong drive, reinserted
> >it). You can fix it by running a scrub on the device (btrfs scrub
> >start /dev/ice, I think).
> 
> I'd like to remind everyone that btrfs has weak checksums. It may be good
> for correcting an occasional error, but I wouldn't trust it to correct
> larger amounts of data.

   Checksums are done for each 4k block, so the increase in
probability of a false negative is purely to do with the sher volume
of data. "Weak" checksums like the CRC32 that btrfs currently uses are
indeed poor for detecting malicious targeted attacks on the data, but
for random failures, such as a disk block being unreadable and
returning zeroes or having bit errors, the odds of identifying the
failure are still excellent.

> Additionally, nocow files are not checksummed. They will not be corrected
> and may return good data or random garbage, depending on which mirror is
> accessed.

   Yes, this is a trade-off that you have to make for your own
use-case and happiness. For some things (like a browser cache), I'd be
happy with losing the checksums. For others (e.g. mail), I wouldn't be.

   Hugo.

> Below is a test I did some time ago, demonstrating the problem with nocow
> files:
> 
> #!/bin/sh
> MOUNT_DIR=mnt
> DISK1=d1
> DISK2=d2
> SIZE=2G
> # create raid1 FS
> mkdir $MOUNT_DIR
> truncate --size $SIZE $DISK1
> truncate --size $SIZE $DISK2
> L1=$(losetup --show -f $DISK1)
> L2=$(losetup --show -f $DISK2)
> mkfs.btrfs -d raid1 -m raid1 $L1 $L2
> mount $L1 $MOUNT_DIR
> # enable NOCOW
> chattr +C $MOUNT_DIR
> umount $MOUNT_DIR
> # fail the second drive
> losetup -d $L2
> mount $L1 $MOUNT_DIR -odegraded
> # file must be large enough to not get embedded inside metadata
> perl -e 'print "Test OK.\n"x4096' >$MOUNT_DIR/testfile
> umount $MOUNT_DIR
> # reattach the second drive
> L2=$(losetup --show -f $DISK2)
> mount $L1 $MOUNT_DIR
> # let's see what we get - correct data or garbage?
> cat $MOUNT_DIR/testfile
> # clean up
> umount $MOUNT_DIR
> losetup -d $L1
> losetup -d $L2
> rm $DISK1 $DISK2
> rmdir $MOUNT_DIR

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Hey, Virtual Memory! Now I can have a *really big* ramdisk! ---   

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID1 failure and recovery
  2014-09-14  4:44     ` Hugo Mills
@ 2014-09-14 14:53       ` Piotr Pawłow
  0 siblings, 0 replies; 6+ messages in thread
From: Piotr Pawłow @ 2014-09-14 14:53 UTC (permalink / raw)
  To: linux-btrfs

On 14.09.2014 06:44, Hugo Mills wrote:
>> I've done this before, by accident (pulled the wrong drive, reinserted
>> it). You can fix it by running a scrub on the device (btrfs scrub
>> start /dev/ice, I think).
> Checksums are done for each 4k block, so the increase in probability 
> of a false negative is purely to do with the sher volume of data. 
> "Weak" checksums like the CRC32 that btrfs currently uses are indeed 
> poor for detecting malicious targeted attacks on the data, but for 
> random failures, such as a disk block being unreadable and returning 
> zeroes or having bit errors, the odds of identifying the failure are 
> still excellent. 

I don't require "probably the universe will end sooner" kind of odds, 
but I would at least like "better than winning the lottery" odds. Once 
there are thousands of blocks to fix, the odds aren't that great: 1 / 
2^32 * 10 000 =~ 1 / 430 000

I wouldn't feel confident enough to add the disk back and let btrfs fix 
it. I'd rather wipe the FS on it and do the "replace missing".

>> Additionally, nocow files are not checksummed. They will not be 
>> corrected
>> and may return good data or random garbage, depending on which mirror is
>> accessed.
>     Yes, this is a trade-off that you have to make for your own
> use-case and happiness. For some things (like a browser cache), I'd be
> happy with losing the checksums.

The point is, if I add a drive with old contents back, I will probably 
have to delete all nocow files. Cause I'm not aware of any tool that can 
compare both mirrors, and tell me which files are identical on both, and 
which are different. Scrub will not detect them, as it works separately 
on each device, and doesn't compare one mirror to the other.

If I don't delete nocow files, I may get intermittent failures, like my 
browser randomly not loading some pages, and wonder what's going on.

On a multi user system, I risk exposing sensitive data to all users 
having nocow files, or access to nocow files.

Thus I think this practice is bad, dangerous, and I would advice against 
doing that. I'd also like btrfs to reject devices with old content by 
default.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID1 failure and recovery
  2014-09-12  8:57 RAID1 failure and recovery shane-kernel
  2014-09-12 10:47 ` Hugo Mills
@ 2014-09-12 11:11 ` Duncan
  1 sibling, 0 replies; 6+ messages in thread
From: Duncan @ 2014-09-12 11:11 UTC (permalink / raw)
  To: linux-btrfs

shane-kernel posted on Fri, 12 Sep 2014 01:57:37 -0700 as excerpted:

[Last question first as it's easy to answer...]

> Finally for those using this sort of setup in production, is running
> btrfs on top of mdraid the way to go at this point?

While the latest kernel and btrfs-tools have removed the warnings, btrfs 
is still not yet fully stable and isn't really recommended for 
production.  Yes, certain distributions support it, but that's their 
support choice that you're buying from them, and if it all goes belly up, 
I guess you'll see what that money actually buys.  However, /here/ it's 
not really recommended yet.

That said, there are people doing it, and if you make sure you have 
suitable backups for the extent to which you're depending on the data on 
that btrfs and are willing to deal with the downtime or failover hassles 
if it happens...

Also, keeping current with particularly kernels but not letting btrfs-
progs userspace get too outdated either, is important, as is following 
this list to keep up with current status.  If you're running older than 
the latest kernel series without a specific reason, you're likely to be 
running without patches for the most recently discovered btrfs bugs.

There was a recent exception to the general latest kernel rule in the 
form of a bug that only affected the kworker threads that btrfs 
transferred to in 3.15, so 3.14 was unaffected, while it took thru 3.15 
and 3.16 to find and trace the bug.  3.17-rc3 got the fix, and I believe 
it's in the latest 3.16 stable as well.  But that's where staying current 
with the list and actually having a reason to run an older than current 
kernel comes in, so while an exception to the general latest kernel rule, 
it wasn't an exception to the way I put it above, because once it became 
known on the list there was a reason to run the older kernel.

If you're unwilling to do that, then choose something other than btrfs.

But anyway, here's a direct answer to the question...

While btrfs on top of mdraid (or dmraid or...) in general works, it 
doesn't match up well with btrfs checksummed data integrity features.

Consider:  mdraid-1 writes to all devices, but reads from only one, 
without any checksumming or other data integrity measures.  If the copy 
mdraid-1 decides to read from is bad, unless the hardware actually 
reports it as bad, mdraid is entirely oblivious and will carry on as if 
nothing happened.  There's no checking the other copies to see that they 
match, no checksums or other verification, nothing.

Btrfs OTOH has checksumming and data verification.  With btrfs raid1, 
that verification means that if whatever copy btrfs happens to pull fails 
the verify, it can verify and pull from the second copy, overwriting the 
bad-checksum copy with a good-checksum copy.  BUT THAT ONLY HAPPENS IF IT 
HAS THAT SECOND COPY, AND IT ONLY HAS THAT SECOND COPY IN BTRFS RAID1 (or 
raid10 or for metadata, dup) MODE.

Now, consider what happens when btrfs data verification interacts with 
mdraid's lack of data verification.  If whatever copy mdraid pulls up is 
bad, it's going to fail the btrfs checksum and btrfs will reject it.  But 
because btrfs is on top of mdraid and mdraid is oblivious, there's no 
mechanism for btrfs to know that mdraid has other copies that may be just 
fine -- to btrfs, that copy is bad, period.  And if btrfs doesn't have a 
second btrfs copy, either due to btrfs raid1 or raid10 mode on top of 
mdraid, or for metadata, due to dup mode, then btrfs will simply return 
an error for that data, no second chance, because it knows nothing about 
the other copies mdraid has.

So while in general it works about as well as any other filesystem on top 
of mdraid, the interaction between mdraid's lack of data verification and 
btrfs' automated data verification is... unfortunate.

With that said, let's look at the rest of the post...

> I am testing BTRFS in a simple RAID1 environment. Default mount options
> and data and metadata are mirrored between sda2 and sdb2. I have a few
> questions and a potential bug report. I don't normally have console
> access to the server so when the server boots with 1 of 2 disks, the
> mount will fail without -o degraded. Can I use -o degraded by default to
> force mounting with any number of disks? This is the default behaviour
> for linux-raid so I was rather surprised when the server didn't boot
> after a simulated disk failure.

The idea here is that if a device is malfunctioning, the admin should 
have to take deliberate action to demonstrate knowledge of that fact 
before the filesystem will mount.  Btrfs isn't yet as robust in degraded 
mode as say mdraid, and important btrfs features like data validation and 
scrub are seriously degraded when that second copy is no longer there.  
In addition, btrfs raid1 mode requires that each of the two copies of a 
chunk be written to different devices, and once there's only a single 
device available, that can no longer happen, so unless behavior has 
changed recently, as soon as the currently allocated chunks get full, you 
get ENOSPC, even if there's lots of unallocated space left on the 
remaining device, because there's no second device available to allocate 
the second copy of a new data or metadata chunk on.

That said, some admins *DO* choose to add degraded to their default mount 
options, since it simply /lets/ btrfs mount in degraded mode, it doesn't 
FORCE it degraded if all devices show up.

If you want to be one of those admins you are of course free to do so.  
However, if btrfs breaks unexpectedly as a result, you get to keep the 
pieces. =:^)  It's something that some admins choose to do, but it's not 
recommended.

> So I pulled sdb to simulate a disk failure. The kernel oops'd but did
> continue running. I then rebooted encountering the above mount problem.
> I re-inserted the disk and rebooted again and BTRFS mounted
> successfully. However, I am now getting warnings like:
> BTRFS: read error corrected: ino 1615 off 86016 (dev /dev/sda2 sector
> 4580382824)
> I take it there were writes to SDA and sdb is out of sync. Btrfs is
> correcting sdb as it goes but I won't have redundancy until sdb resyncs
> completely. Is there a way to tell btrfs that I just re-added a failed
> disk and to go through and resync the array as mdraid would do? I know I
> can do a btrfs fi resync manually but can that be automated if the array
> goes out of sync for whatever reason (power failure)...

btrfs fi resync?  Do you mean btrfs scrub?  Because a scrub is the method 
normally used to check and fix such things.  A btrfs balance would also 
do it, but that rewrites the entire filesystem one chunk at a time, which 
isn't necessarily what you want to do.

To directly answer your question, however, no, btrfs does not have 
anything like mdraid's device re-add, with automatic resync.  Scrub comes 
the closest, verifying checksums and comparing transaction-id 
generations, but it's not run automatically.

In fact, until very recently, so recently I'm not sure it has been fixed 
yet altho I know there has been discussion on the list, btrfs in the 
kernel wasn't really aware when a device dropped out, either.  It would 
still queue up the transactions and they'd simply backup.  And a device 
plugged back in after a degraded mount with devices missing wouldn't 
necessarily be detected either.  They're working on it; as I said there 
have been recent discussions, but I'm not sure the code is actually in 
mainline for that, yet.

As I said above, btrfs isn't really entirely stable yet.  This simply 
demonstrates the point.  It's also why it's so important that an admin 
know about a degraded mount and actually choose to do it, thus the reason 
adding degraded to the default mount options isn't recommended, since it 
bypasses that deliberate choice.

If a filesystem is deliberately mounted degraded, an admin will know it 
and be able to take equally deliberate action to fix it.  Once they 
actually have the physical replacement device in place, the next equally 
deliberate step is to initiate a btrfs scrub (if the device was re-added) 
or a btrfs replace.

Meanwhile, in the event of a stale device, the transaction-id generation 
is used to determine which version is current.  Be careful not to 
separately mount-degraded one device and then the other, so they've both 
had updates and diverged from the common origin and from each other.  In 
most cases that should work and the one with the highest transaction-id 
will be chosen, but based on my testing now several kernel versions ago 
when I first got into btrfs raid (so hopefully my experience is outdated 
and the result is a /bit/ better now), it's not something you want to 
tempt fate with in any case.  At a minimum, the result is likely to be 
confusing to /you/ even if the filesystem does the right thing.  So if 
that happens, be sure to always mount and update the same device, not 
alternating devices, until you again unify the copies with a scrub.  At 
least for my own usage, I decided that if for some reason I DID happen to 
accidentally use both copies separately, I was best off wiping the one 
and adding it back in as a new device, thus ensuring absolute 
predictability in which divergent copy actually got USED.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-09-14 14:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-12  8:57 RAID1 failure and recovery shane-kernel
2014-09-12 10:47 ` Hugo Mills
2014-09-14  3:15   ` Piotr Pawłow
2014-09-14  4:44     ` Hugo Mills
2014-09-14 14:53       ` Piotr Pawłow
2014-09-12 11:11 ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).