Raid5 replace disk problems

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Raid5 replace disk problems
@ 2016-04-21 15:09 Jussi Kansanen
  2016-04-22  1:27 ` Duncan
  0 siblings, 1 reply; 2+ messages in thread
From: Jussi Kansanen @ 2016-04-21 15:09 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I have a 4x 2TB HDD raid5 array and one of the disks started going bad
(according to smart no read/write errors seen by btrfs), after replacing the
disk with a new one I ran "btrfs replace" which resulted in kernel crash about
0.5% done:


BTRFS info (device dm-10): dev_replace from <missing disk> (devid 4) to
/dev/mapper/bcrypt_sdj1 started

WARNING: CPU: 1 PID: 30627 at fs/btrfs/inode.c:9125
btrfs_destroy_inode+0x271/0x290()
Modules linked in: algif_skcipher af_alg evdev xt_tcpudp nf_conntrack_ipv4
nf_defrag_ipv4 xt_conntrack nf_conntrack x86_pkg_temp_thermal kvm_intel kvm
irqbypass ghash_clmulni_intel psmouse iptable_filter ip_tables x_tables fan
thermal battery processor button autofs4
CPU: 1 PID: 30627 Comm: umount Not tainted 4.5.0 #1
Hardware name: System manufacturer System Product Name/P8Z77-V LE PLUS, BIOS
0910 03/18/2014
0000000000000000 ffffffff813971f9 0000000000000000 ffffffff817f2b34
ffffffff8107ab78 ffff8800d55daa00 ffff8800cb990998 ffff880212d5b800
0000000000000000 ffff8801fcc0ff58 ffffffff812dbfc1 ffff8800d55daa00
Call Trace:
[<ffffffff813971f9>] ? dump_stack+0x46/0x5d
[<ffffffff8107ab78>] ? warn_slowpath_common+0x78/0xb0
[<ffffffff812dbfc1>] ? btrfs_destroy_inode+0x271/0x290
[<ffffffff812b69a2>] ? btrfs_put_block_group_cache+0x72/0xa0
[<ffffffff812c71d6>] ? close_ctree+0x146/0x330
[<ffffffff81154d9f>] ? generic_shutdown_super+0x5f/0xe0
[<ffffffff81155029>] ? kill_anon_super+0x9/0x10
[<ffffffff8129c5ed>] ? btrfs_kill_super+0xd/0x90
[<ffffffff8115534f>] ? deactivate_locked_super+0x2f/0x60
[<ffffffff8116f376>] ? cleanup_mnt+0x36/0x80
[<ffffffff81091f3c>] ? task_work_run+0x6c/0x90
[<ffffffff810011aa>] ? exit_to_usermode_loop+0x8a/0x90
[<ffffffff8167bce3>] ? int_ret_from_sys_call+0x25/0x8f
---[ end trace 6a7dec9450d45f9c ]---


Replace continues automatically after reboot but ends up using all of memory,
around every 6% of progress (8 hours) and crashes system:


BTRFS info (device dm-10): continuing dev_replace from <missing disk> (devid 4)
to /dev/mapper/bcrypt_sdj1 @0%
Apr 20 14:03:48 localhost kernel: BTRFS warning (device dm-4): devid 4 uuid
e02b8898-c6ce-4c95-956d-24217c470b8a is missing
Apr 20 14:03:52 localhost kernel: BTRFS info (device dm-4): continuing
dev_replace from <missing disk> (devid 4) to /dev/mapper/bcrypt_sdj1 @6%
Apr 20 22:38:41 localhost kernel: BTRFS warning (device dm-4): devid 4 uuid
e02b8898-c6ce-4c95-956d-24217c470b8a is missing
Apr 20 22:38:46 localhost kernel: BTRFS info (device dm-4): continuing
dev_replace from <missing disk> (devid 4) to /dev/mapper/bcrypt_sdj1 @12%
Apr 21 13:14:51 localhost kernel: BTRFS warning (device dm-4): devid 4 uuid
e02b8898-c6ce-4c95-956d-24217c470b8a is missing
Apr 21 13:14:55 localhost kernel: BTRFS info (device dm-4): continuing
dev_replace from <missing disk> (devid 4) to /dev/mapper/bcrypt_sdj1 @18%


The issue is related to "bio-1" using all of memory:

/proc/meminfo:

MemTotal:        8072852 kB
MemFree:          646108 kB
...
Slab:            6235188 kB
SReclaimable:      49320 kB
SUnreclaim:      6185868 kB

/proc/slabinfo:

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
: tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs>
<num_slabs> <sharedavail>
bio-1             17588753 17588964    320   12    1 : tunables    0    0    0 :
slabdata 1465747 1465747      0


The replace operation is super slow (no other load) with avg. 3x20MB/s (old
disks) reads and 1.4MB/s write (new disk) with CFQ scheduler. Using deadline
schd. the performance is better with avg. 3x40MB/s reads and 4MB/s write (both
schds. with default queue/nr_requests).

Write speed seems slow but guess it possible if there's a lot random writes but
why is the difference between data read vs. written so large? According to
iostat replace reads 35 times more data than it writes to the new disk.


Info:

kernel 4.5 (now 4.5.2, no change)
btrfs-progs 4.5.1
dm-crypted partitions, 4k aligned
mount opts: defaults,noatime,compress=lzo
8GB RAM


btrfs fi usage /bstorage/
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
    Device size:                   9.10TiB
    Device allocated:                0.00B
    Device unallocated:            9.10TiB
    Device missing:                1.82TiB
    Used:                            0.00B
    Free (estimated):                0.00B      (min: 8.00EiB)
    Data ratio:                       0.00
    Metadata ratio:                   0.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID5: Size:1.52TiB, Used:1.46TiB
   /dev/mapper/bcrypt_sdg1       520.00GiB
   /dev/mapper/bcrypt_sdh1       520.00GiB
   /dev/mapper/bcrypt_sdi1       520.00GiB
   missing       520.00GiB

Metadata,RAID5: Size:4.03GiB, Used:1.96GiB
   /dev/mapper/bcrypt_sdg1         1.34GiB
   /dev/mapper/bcrypt_sdh1         1.34GiB
   /dev/mapper/bcrypt_sdi1         1.34GiB
   missing         1.34GiB

System,RAID5: Size:76.00MiB, Used:128.00KiB
   /dev/mapper/bcrypt_sdg1        36.00MiB
   /dev/mapper/bcrypt_sdh1        36.00MiB
   /dev/mapper/bcrypt_sdi1        36.00MiB
   missing         4.00MiB

Unallocated:
   /dev/mapper/bcrypt_sdg1         1.31TiB
   /dev/mapper/bcrypt_sdh1         1.31TiB
   /dev/mapper/bcrypt_sdi1         1.31TiB
   /dev/mapper/bcrypt_sdj1         1.82TiB
   missing         1.31TiB


btrfs fi show /bstorage/
Label: 'btrfs_bstorage'  uuid: 3861e35a-43ef-4293-b2bf-f841c8bcb4e4
        Total devices 5 FS bytes used 1.47TiB
        devid    0 size 1.82TiB used 521.35GiB path /dev/mapper/bcrypt_sdj1
        devid    1 size 1.82TiB used 521.38GiB path /dev/mapper/bcrypt_sdg1
        devid    2 size 1.82TiB used 521.38GiB path /dev/mapper/bcrypt_sdh1
        devid    3 size 1.82TiB used 521.38GiB path /dev/mapper/bcrypt_sdi1
        *** Some devices missing


btrfs device stats /bstorage/
[/dev/mapper/bcrypt_sdj1].write_io_errs   0
[/dev/mapper/bcrypt_sdj1].read_io_errs    0
[/dev/mapper/bcrypt_sdj1].flush_io_errs   0
[/dev/mapper/bcrypt_sdj1].corruption_errs 0
[/dev/mapper/bcrypt_sdj1].generation_errs 0
[/dev/mapper/bcrypt_sdg1].write_io_errs   0
[/dev/mapper/bcrypt_sdg1].read_io_errs    0
[/dev/mapper/bcrypt_sdg1].flush_io_errs   0
[/dev/mapper/bcrypt_sdg1].corruption_errs 0
[/dev/mapper/bcrypt_sdg1].generation_errs 0
[/dev/mapper/bcrypt_sdh1].write_io_errs   0
[/dev/mapper/bcrypt_sdh1].read_io_errs    0
[/dev/mapper/bcrypt_sdh1].flush_io_errs   0
[/dev/mapper/bcrypt_sdh1].corruption_errs 0
[/dev/mapper/bcrypt_sdh1].generation_errs 0
[/dev/mapper/bcrypt_sdi1].write_io_errs   0
[/dev/mapper/bcrypt_sdi1].read_io_errs    0
[/dev/mapper/bcrypt_sdi1].flush_io_errs   0
[/dev/mapper/bcrypt_sdi1].corruption_errs 0
[/dev/mapper/bcrypt_sdi1].generation_errs 0
[(null)].write_io_errs   0
[(null)].read_io_errs    0
[(null)].flush_io_errs   0
[(null)].corruption_errs 0
[(null)].generation_errs 0


btrfs dev usage /bstorage/
/dev/mapper/bcrypt_sdg1, ID: 1
   Device size:             1.82TiB
   Data,RAID5:            520.00GiB
   Metadata,RAID5:          1.34GiB
   System,RAID5:            4.00MiB
   System,RAID5:           32.00MiB
   Unallocated:             1.31TiB

/dev/mapper/bcrypt_sdh1, ID: 2
   Device size:             1.82TiB
   Data,RAID5:            520.00GiB
   Metadata,RAID5:          1.34GiB
   System,RAID5:            4.00MiB
   System,RAID5:           32.00MiB
   Unallocated:             1.31TiB

/dev/mapper/bcrypt_sdi1, ID: 3
   Device size:             1.82TiB
   Data,RAID5:            520.00GiB
   Metadata,RAID5:          1.34GiB
   System,RAID5:            4.00MiB
   System,RAID5:           32.00MiB
   Unallocated:             1.31TiB

/dev/mapper/bcrypt_sdj1, ID: 0
   Device size:             1.82TiB
   Unallocated:             1.82TiB

missing, ID: 4
   Device size:               0.00B
   Data,RAID5:            520.00GiB
   Metadata,RAID5:          1.34GiB
   System,RAID5:            4.00MiB
   Unallocated:             1.31TiB

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Raid5 replace disk problems
  2016-04-21 15:09 Raid5 replace disk problems Jussi Kansanen
@ 2016-04-22  1:27 ` Duncan
  0 siblings, 0 replies; 2+ messages in thread
From: Duncan @ 2016-04-22  1:27 UTC (permalink / raw)
  To: linux-btrfs

Jussi Kansanen posted on Thu, 21 Apr 2016 18:09:31 +0300 as excerpted:

> The replace operation is super slow (no other load) with avg. 3x20MB/s
> (old disks) reads and 1.4MB/s write (new disk) with CFQ scheduler. Using
> deadline schd. the performance is better with avg. 3x40MB/s reads and
> 4MB/s write (both schds. with default queue/nr_requests).
> 
> Write speed seems slow but guess it possible if there's a lot random
> writes but why is the difference between data read vs. written so large?
> According to iostat replace reads 35 times more data than it writes to
> the new disk.
> 
> 
> Info:
> 
> kernel 4.5 (now 4.5.2, no change)
> btrfs-progs 4.5.1

[Just a btrfs using admin and list regular, not a dev.  Also, raid56 
isn't my own use-case, but I am following it in general on the list.]

Keep in mind that btrfs raid56 mode (aka parity raid mode) remains less 
mature and stable than non-parity raid modes such as raid1 and raid10, 
and of course single-device mode with single data and single or dup 
metadata, as well.  It's certainly /not/ considered stable enough for 
production usage at this point, and other alternatives such as btrfs 
raid1 or raid10 or use of a separate raid layer (btrfs raid1 on top of a 
pair of mdraid0s is one interesting solution) are actively recommended.

And you're not the first to report super-slow replace/restripe for raid56, 
either.  It's a known bug, tho as it doesn't seem to affect everyone it 
has been hard to pin down appropriately and fix.  The worst part is that 
for those affected, replace and restripe are so slow that they cease to 
be real-world practical, and endanger the entire array because at that 
speed there's a relatively large chance that another device may fail 
before the replace is completed, failing the entire array as more devices 
have failed than it can handle.  Which means from a reliability 
perspective it effectively degrades to slow raid0 as soon as the first 
device drops out, with no practical way of recovering back to raid5/6 
mode.

I don't recall seeing the memory issue reported before in relation to 
raid56, but it isn't horribly surprising either.  IIRC there have been 
some recent memory fix patches that so 4.6 might be better, but I 
wouldn't count on it.  I'd really just recommend getting off of raid56 
mode for now, until it has had somewhat longer to mature.

(I'm previously on record as suggesting that people wait at least a year, 
~5 kernel cycles, from nominal full raid56 support for it to stabilize, 
and then asking about current state on the list, before trying to use it 
for anything but testing with throw-away data.  With raid56 being 
nominally complete in 3.19, that would have been 4.4 at the earliest, and 
for a short time around then it did look reasonable, but then this bug 
with extremely long replace/restripe times began showing up on the list, 
and until that's traced down and fixed, I just don't see anyone 
responsible using it except of course for testing and hopefully fixing 
this thing.  I honestly don't know how long that will be or if there are 
other bugs lurking as well, but given 4.6 is nearing release and I don't 
believe the bug has even been fully traced down yet, 4.8 is definitely 
the earliest I'd say consider it again, and a more conservative 
recommendation might be to ask again around 4.10.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-04-22  1:27 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-21 15:09 Raid5 replace disk problems Jussi Kansanen
2016-04-22  1:27 ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).