Strange behavior when replacing device on BTRFS RAID 5 array.

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Strange behavior when replacing device on BTRFS RAID 5 array.
@ 2016-06-27  3:57 Nick Austin
  2016-06-27  4:02 ` Nick Austin
  2016-06-27 21:12 ` Duncan
  0 siblings, 2 replies; 7+ messages in thread
From: Nick Austin @ 2016-06-27  3:57 UTC (permalink / raw)
  To: linux-btrfs

I have a 4 device BTRFS RAID 5 filesystem.

One of the device members of this file system (sdr) had badblocks, so I
decided to replace it.

(I don't have a copy of fi show from before the replace. :-/ )

I ran this command:
sudo btrfs replace start 4 /dev/sdw /mnt/newdata

I had to shrink /dev/sdr by ~250 megs since the replacement drive was a tiny bit
smaller.

Jun 25 17:26:52 frank kernel: BTRFS info (device sdr): resizing devid 4
Jun 25 17:26:52 frank kernel: BTRFS info (device sdr): new size for /dev/sdr is
6001175121920
Jun 25 17:27:45 frank kernel: BTRFS info (device sdr): dev_replace from /dev/sdr
 (devid 4) to /dev/sdw started

The replace started, all seemed well.

3 hours into the replace, sdr dropped off the SATA bus and was redetected
as sdx. Bummer, but shouldn't be fatal.

This event really seemed to throw BTRFS for a loop.

Jun 25 20:32:35 frank kernel: sd 10:0:19:0: device_block, handle(0x0019)
Jun 25 20:33:05 frank kernel: sd 10:0:19:0: device_unblock and setting
to running, handle(0x0019)
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: rejecting I/O to offline device
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] killing request
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: rejecting I/O to offline device
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] killing request
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: rejecting I/O to offline device
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] killing request
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] FAILED Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jun 25 20:33:05 frank kernel: blk_update_request: I/O error, dev sdr,
sector 1785876480
Jun 25 20:33:05 frank kernel: mpt2sas_cm0: removing handle(0x0019),
sas_addr(0x500194000687e20e)
Jun 25 20:33:05 frank kernel: mpt2sas_cm0: removing : enclosure
logical id(0x500194000687e23f), slot(14)
Jun 25 20:33:16 frank kernel: scsi 10:0:21:0: Direct-Access     ATA
  WL6000GSA12872E  1C01 PQ: 0 ANSI: 5

Here you can see btrfs seems to figure out sdr has become sdx (based on the
"dev /dev/sdx" entry showing up on the BTRFS warning lines).

Unfortunately, all remaining IO for the device formerly known as sdr results in
btrf errors like the ones listed below. iostat confirms no IO on sdx.

Jun 25 20:33:17 frank kernel: sd 10:0:21:0: [sdx] Attached SCSI disk
...
Jun 25 20:33:20 frank kernel: scrub_handle_errored_block: 31983
callbacks suppressed
Jun 25 20:33:20 frank kernel: BTRFS warning (device sdr): i/o error at
logical 2742536544256 on dev /dev/sdx, sector 1786897488, root 5,
inode 222965, offset 296329216, length 4096, links 1 (path:
/path/to/file)
Jun 25 20:33:20 frank kernel: btrfs_dev_stat_print_on_error: 32107
callbacks suppressed

These messages continue for many hours as the replace continues running.

sudo btrfs replace status /mnt/newdata
Started on 25.Jun 17:27:45, finished on 26.Jun 12:48:22, 0 write errs,
0 uncorr. read errs

...
Jun 26 12:48:22 frank kernel: BTRFS warning (device sdr): lost page
write due to IO error on /dev/sdx  (Many, many of these)
Jun 26 12:48:22 frank kernel: BTRFS info (device sdr): dev_replace
from /dev/sdx (devid 4) to /dev/sdw finished

Great! /dev/sdx replaced by /dev/sdw!

Now let's confirm:

sudo btrfs fi show /mnt/newdata
Label: '/var/data'  uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec
        Total devices 4 FS bytes used 8.07TiB
        devid    1 size 5.46TiB used 2.70TiB path /dev/sdg
        devid    2 size 5.46TiB used 2.70TiB path /dev/sdl
        devid    3 size 5.46TiB used 2.70TiB path /dev/sdm
        devid    4 size 5.46TiB used 2.70TiB path /dev/sdx

Bummer, this doesn't look right.

sdx is still in the array (failing drive).

Additionally, /dev/sdw isn't listed at all! Worse still, it looks like the
array has lost redundancy (it has 8TiB of data, and that's the amount shown as
used divided by number of devices). It looks like it tried to add the new
device, but did a balance across all of them instead?

% sudo btrfs fi df /mnt/newdata
Data, RAID5: total=8.07TiB, used=8.06TiB
System, RAID10: total=80.00MiB, used=576.00KiB
Metadata, RAID10: total=12.00GiB, used=10.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Any advice would be appreciated.

%  uname -a
Linux frank 4.5.5-201.fc23.x86_64 #1 SMP Sat May 21 15:29:49 UTC 2016 x86_64
x86_64 x86_64 GNU/Linux

% lsb_release
Description:    Fedora release 24 (Twenty Three)

% btrfs --version
btrfs-progs v4.4.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Strange behavior when replacing device on BTRFS RAID 5 array.
  2016-06-27  3:57 Strange behavior when replacing device on BTRFS RAID 5 array Nick Austin
@ 2016-06-27  4:02 ` Nick Austin
  2016-06-27 17:29   ` Chris Murphy
  2016-06-27 21:12 ` Duncan
  1 sibling, 1 reply; 7+ messages in thread
From: Nick Austin @ 2016-06-27  4:02 UTC (permalink / raw)
  To: linux-btrfs

On Sun, Jun 26, 2016 at 8:57 PM, Nick Austin <nick@smartaustin.com> wrote:
> sudo btrfs fi show /mnt/newdata
> Label: '/var/data'  uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec
>         Total devices 4 FS bytes used 8.07TiB
>         devid    1 size 5.46TiB used 2.70TiB path /dev/sdg
>         devid    2 size 5.46TiB used 2.70TiB path /dev/sdl
>         devid    3 size 5.46TiB used 2.70TiB path /dev/sdm
>         devid    4 size 5.46TiB used 2.70TiB path /dev/sdx

It looks like fi show has bad data:

When I start heavy IO on the filesystem (running rsync -c to verify the data),
I notice zero IO on the bad drive I told btrfs to replace, and lots of IO to the
 expected replacement.

I guess some metadata is messed up somewhere?

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          25.19    0.00    7.81   28.46    0.00   38.54

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdg             437.00     75168.00      1792.00      75168       1792
sdl             443.00     76064.00      1792.00      76064       1792
sdm             438.00     75232.00      1472.00      75232       1472
sdw             443.00     75680.00      1856.00      75680       1856
sdx               0.00         0.00         0.00          0          0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Strange behavior when replacing device on BTRFS RAID 5 array.
  2016-06-27  4:02 ` Nick Austin
@ 2016-06-27 17:29   ` Chris Murphy
  2016-06-27 17:37     ` Austin S. Hemmelgarn
  2016-06-27 17:46     ` Chris Murphy
  0 siblings, 2 replies; 7+ messages in thread
From: Chris Murphy @ 2016-06-27 17:29 UTC (permalink / raw)
  To: Nick Austin; +Cc: Btrfs BTRFS

On Sun, Jun 26, 2016 at 10:02 PM, Nick Austin <nick@smartaustin.com> wrote:
> On Sun, Jun 26, 2016 at 8:57 PM, Nick Austin <nick@smartaustin.com> wrote:
>> sudo btrfs fi show /mnt/newdata
>> Label: '/var/data'  uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec
>>         Total devices 4 FS bytes used 8.07TiB
>>         devid    1 size 5.46TiB used 2.70TiB path /dev/sdg
>>         devid    2 size 5.46TiB used 2.70TiB path /dev/sdl
>>         devid    3 size 5.46TiB used 2.70TiB path /dev/sdm
>>         devid    4 size 5.46TiB used 2.70TiB path /dev/sdx
>
> It looks like fi show has bad data:
>
> When I start heavy IO on the filesystem (running rsync -c to verify the data),
> I notice zero IO on the bad drive I told btrfs to replace, and lots of IO to the
>  expected replacement.
>
> I guess some metadata is messed up somewhere?
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           25.19    0.00    7.81   28.46    0.00   38.54
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sdg             437.00     75168.00      1792.00      75168       1792
> sdl             443.00     76064.00      1792.00      76064       1792
> sdm             438.00     75232.00      1472.00      75232       1472
> sdw             443.00     75680.00      1856.00      75680       1856
> sdx               0.00         0.00         0.00          0          0

There's reported some bugs with 'btrfs replace' and raid56, but I
don't know the exact nature of those bugs, when or how they manifest.
It's recommended to fallback to use 'btrfs add' and then 'btrfs
delete' but you have other issues going on also.

Devices dropping off and being renamed is something btrfs, in my
experience, does not handle well at all. The very fact the hardware is
dropping off and coming back is bad, so you really need to get that
sorted out as a prerequisite no matter what RAID technology you're
using.

First advice, make a backup. Don't change the volume further until
you've done this. Each attempt to make the volume healthy again
carries risks of totally breaking it and losing the ability to mount
it. So as long as it's mounted, take advantage of that. Pretend the
very next repair attempt will break the volume, and make your backup
accordingly.

Next is to decide to what degree you want to salvage this volume and
keep using Btrfs raid56 despite the risks (it's still rather
experimental, and in particular some things have been realized on the
list in the last week especially that make it not recommended, except
by people willing to poke it with a stick and learn how many more
bodies can be found with the current implementation) or if you just
want to migrate it over to something like XFS on mdadm or LVM raid 5
as soon as possible?

There's also the obligatory notice that applies to all Linux software
raid implementations which is to discover if you have a very common
misconfiguration that enhances the chance of data loss if the volume
ever goes degraded and you need to rebuild with a new drive:

smartctl -l scterc <dev>
cat /sys/block/<dev>/device/timeout

The first value must be less than the second. Note the first value is
in deciseconds, the second is in seconds. And either 'unsupported' or
'unset' translates into a vague value that could be as high as 180
seconds.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Strange behavior when replacing device on BTRFS RAID 5 array.
  2016-06-27 17:29   ` Chris Murphy
@ 2016-06-27 17:37     ` Austin S. Hemmelgarn
  2016-06-27 17:46     ` Chris Murphy
  1 sibling, 0 replies; 7+ messages in thread
From: Austin S. Hemmelgarn @ 2016-06-27 17:37 UTC (permalink / raw)
  To: Chris Murphy, Nick Austin; +Cc: Btrfs BTRFS

On 2016-06-27 13:29, Chris Murphy wrote:
> On Sun, Jun 26, 2016 at 10:02 PM, Nick Austin <nick@smartaustin.com> wrote:
>> On Sun, Jun 26, 2016 at 8:57 PM, Nick Austin <nick@smartaustin.com> wrote:
>>> sudo btrfs fi show /mnt/newdata
>>> Label: '/var/data'  uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec
>>>         Total devices 4 FS bytes used 8.07TiB
>>>         devid    1 size 5.46TiB used 2.70TiB path /dev/sdg
>>>         devid    2 size 5.46TiB used 2.70TiB path /dev/sdl
>>>         devid    3 size 5.46TiB used 2.70TiB path /dev/sdm
>>>         devid    4 size 5.46TiB used 2.70TiB path /dev/sdx
>>
>> It looks like fi show has bad data:
>>
>> When I start heavy IO on the filesystem (running rsync -c to verify the data),
>> I notice zero IO on the bad drive I told btrfs to replace, and lots of IO to the
>>  expected replacement.
>>
>> I guess some metadata is messed up somewhere?
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>           25.19    0.00    7.81   28.46    0.00   38.54
>>
>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>> sdg             437.00     75168.00      1792.00      75168       1792
>> sdl             443.00     76064.00      1792.00      76064       1792
>> sdm             438.00     75232.00      1472.00      75232       1472
>> sdw             443.00     75680.00      1856.00      75680       1856
>> sdx               0.00         0.00         0.00          0          0
>
> There's reported some bugs with 'btrfs replace' and raid56, but I
> don't know the exact nature of those bugs, when or how they manifest.
> It's recommended to fallback to use 'btrfs add' and then 'btrfs
> delete' but you have other issues going on also.
One other thing to mention, if the device is failing, _always_ add '-r' 
to the replace command line.  This will tell it to avoid reading from 
the device being replaced (in raid1 or raid10 mode, it will pull from 
the other mirror, in raid5/6 mode, it will recompute the block from 
parity and compare to the stored checksums (which in turn means that 
this _will_ be slower on raid5/6 than regular repalce)).  Link resets 
and other issues that cause devices to disappear become more common the 
more damaged a disk is, so avoiding reading from it becomes more 
important too, because just reading from a disk puts stress on it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Strange behavior when replacing device on BTRFS RAID 5 array.
  2016-06-27 17:29   ` Chris Murphy
  2016-06-27 17:37     ` Austin S. Hemmelgarn
@ 2016-06-27 17:46     ` Chris Murphy
  2016-06-27 22:29       ` Steven Haigh
  1 sibling, 1 reply; 7+ messages in thread
From: Chris Murphy @ 2016-06-27 17:46 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nick Austin, Btrfs BTRFS

On Mon, Jun 27, 2016 at 11:29 AM, Chris Murphy <lists@colorremedies.com> wrote:

>
> Next is to decide to what degree you want to salvage this volume and
> keep using Btrfs raid56 despite the risks

Forgot to complete this thought. So if you get a backup, and decide
you want to fix it, I would see if you can cancel the replace using
"btrfs replace cancel <mp>" and confirm that it stops. And now is the
risky part, which is whether to try "btrfs add" and then "btrfs
remove" or remove the bad drive, reboot, and see if it'll mount with
-o degraded, and then use add and remove (in which case you'll use
'remove missing').

The first you risk Btrfs still using the flaky bad drive.

The second you risk whether a degraded mount will work, and whether
any other drive in the array has a problem while degraded (like an
unrecovery read error from a single sector).


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Strange behavior when replacing device on BTRFS RAID 5 array.
  2016-06-27 17:46     ` Chris Murphy
@ 2016-06-27 22:29       ` Steven Haigh
  0 siblings, 0 replies; 7+ messages in thread
From: Steven Haigh @ 2016-06-27 22:29 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1678 bytes --]

On 28/06/16 03:46, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 11:29 AM, Chris Murphy <lists@colorremedies.com> wrote:
> 
>>
>> Next is to decide to what degree you want to salvage this volume and
>> keep using Btrfs raid56 despite the risks
> 
> Forgot to complete this thought. So if you get a backup, and decide
> you want to fix it, I would see if you can cancel the replace using
> "btrfs replace cancel <mp>" and confirm that it stops. And now is the
> risky part, which is whether to try "btrfs add" and then "btrfs
> remove" or remove the bad drive, reboot, and see if it'll mount with
> -o degraded, and then use add and remove (in which case you'll use
> 'remove missing').
> 
> The first you risk Btrfs still using the flaky bad drive.
> 
> The second you risk whether a degraded mount will work, and whether
> any other drive in the array has a problem while degraded (like an
> unrecovery read error from a single sector).

This is the exact set of circumstances that caused my corrupt array. I
was using RAID6 - yet it still corrupted large portions of things. In
theory, due to having double parity, it should still have survived even
if a disk did go bad - but there we are.

I first started a replace - noted how slow it was going - cancelled the
replace, then did an add / delete - the system crashed and it was all over.

Just as another data point, I've been flogging the guts out of the array
with mdadm RAID6 doing a reshape of that - and no read errors, system
crashes or other problems in over 48 hours.

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Strange behavior when replacing device on BTRFS RAID 5 array.
  2016-06-27  3:57 Strange behavior when replacing device on BTRFS RAID 5 array Nick Austin
  2016-06-27  4:02 ` Nick Austin
@ 2016-06-27 21:12 ` Duncan
  1 sibling, 0 replies; 7+ messages in thread
From: Duncan @ 2016-06-27 21:12 UTC (permalink / raw)
  To: linux-btrfs

Nick Austin posted on Sun, 26 Jun 2016 20:57:32 -0700 as excerpted:

> I have a 4 device BTRFS RAID 5 filesystem.
> 
> One of the device members of this file system (sdr) had badblocks, so I
> decided to replace it.

While the others answered the direct question, there's something 
potentially more urgent...

Btrfs raid56 mode has some fundamentally serious bugs as currently 
implemented, that we are just now finding out how serious they 
potentially are.  For the details you can read the other active threads 
from the last week or so, but the important thing is that...

For the time being, raid56 mode is not to be trusted repairable and as a 
result is now highly negative-recommended.  Unless you are using pure 
testing data that you don't care about whether it lives or dies (either 
because it literally /is/ that trivial, or because you have tested 
backups, /making/ it that trivial), I'd urgently recommend either getting 
your data off it ASAP, or rebalancing to redundant-raid, raid1 or raid10, 
instead of parity-raid (5/6), before something worse happens and you no 
longer can.

Raid1 mode is a reasonable alternative, as long as your data fits in the 
available space.  Keeping in mind that btrfs raid1 is always two copies, 
with more than two devices upping the capacity, not the redundancy, 3 
5.46 TB devices = 8.19 TB usable space.  Given your 8+ TiB of data usage, 
plus metadata and system, that's unlikely to fit unless you delete some 
stuff (older snapshots probably, if you have them).  So you'll need to 
keep it to four devices of that size.

Btrfs raid10 is also considered as stable as btrfs in general, and would 
be doable with 4+ devices, but for various reasons I'll skip for brevity 
here (ask if you want them detailed), I'd recommend staying with btrfs 
raid1.

Or switch to md- or dm-raid1.  Or one other interesting alternative, a 
pair of md- or dm-raid0s, on top of which you run btrfs raid1.  That 
gives you the data integrity of btrfs raid1, with somewhat better speed 
than the reasonably stable but as yet unoptimized btrfs raid10.

And of course there's one other alternative, zfs, if you are good with 
its hardware requirements and licensing situation.

But I'd recommend btrfs raid1 as the simple choice.  It's what I'm using 
here (tho on a pair of ssds, so far smaller but faster media, so 
different use-case).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-06-27 22:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-27  3:57 Strange behavior when replacing device on BTRFS RAID 5 array Nick Austin
2016-06-27  4:02 ` Nick Austin
2016-06-27 17:29   ` Chris Murphy
2016-06-27 17:37     ` Austin S. Hemmelgarn
2016-06-27 17:46     ` Chris Murphy
2016-06-27 22:29       ` Steven Haigh
2016-06-27 21:12 ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).