* Raid5 replace disk problems
@ 2016-04-21 15:09 Jussi Kansanen
2016-04-22 1:27 ` Duncan
0 siblings, 1 reply; 2+ messages in thread
From: Jussi Kansanen @ 2016-04-21 15:09 UTC (permalink / raw)
To: linux-btrfs
Hello,
I have a 4x 2TB HDD raid5 array and one of the disks started going bad
(according to smart no read/write errors seen by btrfs), after replacing the
disk with a new one I ran "btrfs replace" which resulted in kernel crash about
0.5% done:
BTRFS info (device dm-10): dev_replace from <missing disk> (devid 4) to
/dev/mapper/bcrypt_sdj1 started
WARNING: CPU: 1 PID: 30627 at fs/btrfs/inode.c:9125
btrfs_destroy_inode+0x271/0x290()
Modules linked in: algif_skcipher af_alg evdev xt_tcpudp nf_conntrack_ipv4
nf_defrag_ipv4 xt_conntrack nf_conntrack x86_pkg_temp_thermal kvm_intel kvm
irqbypass ghash_clmulni_intel psmouse iptable_filter ip_tables x_tables fan
thermal battery processor button autofs4
CPU: 1 PID: 30627 Comm: umount Not tainted 4.5.0 #1
Hardware name: System manufacturer System Product Name/P8Z77-V LE PLUS, BIOS
0910 03/18/2014
0000000000000000 ffffffff813971f9 0000000000000000 ffffffff817f2b34
ffffffff8107ab78 ffff8800d55daa00 ffff8800cb990998 ffff880212d5b800
0000000000000000 ffff8801fcc0ff58 ffffffff812dbfc1 ffff8800d55daa00
Call Trace:
[<ffffffff813971f9>] ? dump_stack+0x46/0x5d
[<ffffffff8107ab78>] ? warn_slowpath_common+0x78/0xb0
[<ffffffff812dbfc1>] ? btrfs_destroy_inode+0x271/0x290
[<ffffffff812b69a2>] ? btrfs_put_block_group_cache+0x72/0xa0
[<ffffffff812c71d6>] ? close_ctree+0x146/0x330
[<ffffffff81154d9f>] ? generic_shutdown_super+0x5f/0xe0
[<ffffffff81155029>] ? kill_anon_super+0x9/0x10
[<ffffffff8129c5ed>] ? btrfs_kill_super+0xd/0x90
[<ffffffff8115534f>] ? deactivate_locked_super+0x2f/0x60
[<ffffffff8116f376>] ? cleanup_mnt+0x36/0x80
[<ffffffff81091f3c>] ? task_work_run+0x6c/0x90
[<ffffffff810011aa>] ? exit_to_usermode_loop+0x8a/0x90
[<ffffffff8167bce3>] ? int_ret_from_sys_call+0x25/0x8f
---[ end trace 6a7dec9450d45f9c ]---
Replace continues automatically after reboot but ends up using all of memory,
around every 6% of progress (8 hours) and crashes system:
BTRFS info (device dm-10): continuing dev_replace from <missing disk> (devid 4)
to /dev/mapper/bcrypt_sdj1 @0%
Apr 20 14:03:48 localhost kernel: BTRFS warning (device dm-4): devid 4 uuid
e02b8898-c6ce-4c95-956d-24217c470b8a is missing
Apr 20 14:03:52 localhost kernel: BTRFS info (device dm-4): continuing
dev_replace from <missing disk> (devid 4) to /dev/mapper/bcrypt_sdj1 @6%
Apr 20 22:38:41 localhost kernel: BTRFS warning (device dm-4): devid 4 uuid
e02b8898-c6ce-4c95-956d-24217c470b8a is missing
Apr 20 22:38:46 localhost kernel: BTRFS info (device dm-4): continuing
dev_replace from <missing disk> (devid 4) to /dev/mapper/bcrypt_sdj1 @12%
Apr 21 13:14:51 localhost kernel: BTRFS warning (device dm-4): devid 4 uuid
e02b8898-c6ce-4c95-956d-24217c470b8a is missing
Apr 21 13:14:55 localhost kernel: BTRFS info (device dm-4): continuing
dev_replace from <missing disk> (devid 4) to /dev/mapper/bcrypt_sdj1 @18%
The issue is related to "bio-1" using all of memory:
/proc/meminfo:
MemTotal: 8072852 kB
MemFree: 646108 kB
...
Slab: 6235188 kB
SReclaimable: 49320 kB
SUnreclaim: 6185868 kB
/proc/slabinfo:
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
: tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs>
<num_slabs> <sharedavail>
bio-1 17588753 17588964 320 12 1 : tunables 0 0 0 :
slabdata 1465747 1465747 0
The replace operation is super slow (no other load) with avg. 3x20MB/s (old
disks) reads and 1.4MB/s write (new disk) with CFQ scheduler. Using deadline
schd. the performance is better with avg. 3x40MB/s reads and 4MB/s write (both
schds. with default queue/nr_requests).
Write speed seems slow but guess it possible if there's a lot random writes but
why is the difference between data read vs. written so large? According to
iostat replace reads 35 times more data than it writes to the new disk.
Info:
kernel 4.5 (now 4.5.2, no change)
btrfs-progs 4.5.1
dm-crypted partitions, 4k aligned
mount opts: defaults,noatime,compress=lzo
8GB RAM
btrfs fi usage /bstorage/
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
Device size: 9.10TiB
Device allocated: 0.00B
Device unallocated: 9.10TiB
Device missing: 1.82TiB
Used: 0.00B
Free (estimated): 0.00B (min: 8.00EiB)
Data ratio: 0.00
Metadata ratio: 0.00
Global reserve: 512.00MiB (used: 0.00B)
Data,RAID5: Size:1.52TiB, Used:1.46TiB
/dev/mapper/bcrypt_sdg1 520.00GiB
/dev/mapper/bcrypt_sdh1 520.00GiB
/dev/mapper/bcrypt_sdi1 520.00GiB
missing 520.00GiB
Metadata,RAID5: Size:4.03GiB, Used:1.96GiB
/dev/mapper/bcrypt_sdg1 1.34GiB
/dev/mapper/bcrypt_sdh1 1.34GiB
/dev/mapper/bcrypt_sdi1 1.34GiB
missing 1.34GiB
System,RAID5: Size:76.00MiB, Used:128.00KiB
/dev/mapper/bcrypt_sdg1 36.00MiB
/dev/mapper/bcrypt_sdh1 36.00MiB
/dev/mapper/bcrypt_sdi1 36.00MiB
missing 4.00MiB
Unallocated:
/dev/mapper/bcrypt_sdg1 1.31TiB
/dev/mapper/bcrypt_sdh1 1.31TiB
/dev/mapper/bcrypt_sdi1 1.31TiB
/dev/mapper/bcrypt_sdj1 1.82TiB
missing 1.31TiB
btrfs fi show /bstorage/
Label: 'btrfs_bstorage' uuid: 3861e35a-43ef-4293-b2bf-f841c8bcb4e4
Total devices 5 FS bytes used 1.47TiB
devid 0 size 1.82TiB used 521.35GiB path /dev/mapper/bcrypt_sdj1
devid 1 size 1.82TiB used 521.38GiB path /dev/mapper/bcrypt_sdg1
devid 2 size 1.82TiB used 521.38GiB path /dev/mapper/bcrypt_sdh1
devid 3 size 1.82TiB used 521.38GiB path /dev/mapper/bcrypt_sdi1
*** Some devices missing
btrfs device stats /bstorage/
[/dev/mapper/bcrypt_sdj1].write_io_errs 0
[/dev/mapper/bcrypt_sdj1].read_io_errs 0
[/dev/mapper/bcrypt_sdj1].flush_io_errs 0
[/dev/mapper/bcrypt_sdj1].corruption_errs 0
[/dev/mapper/bcrypt_sdj1].generation_errs 0
[/dev/mapper/bcrypt_sdg1].write_io_errs 0
[/dev/mapper/bcrypt_sdg1].read_io_errs 0
[/dev/mapper/bcrypt_sdg1].flush_io_errs 0
[/dev/mapper/bcrypt_sdg1].corruption_errs 0
[/dev/mapper/bcrypt_sdg1].generation_errs 0
[/dev/mapper/bcrypt_sdh1].write_io_errs 0
[/dev/mapper/bcrypt_sdh1].read_io_errs 0
[/dev/mapper/bcrypt_sdh1].flush_io_errs 0
[/dev/mapper/bcrypt_sdh1].corruption_errs 0
[/dev/mapper/bcrypt_sdh1].generation_errs 0
[/dev/mapper/bcrypt_sdi1].write_io_errs 0
[/dev/mapper/bcrypt_sdi1].read_io_errs 0
[/dev/mapper/bcrypt_sdi1].flush_io_errs 0
[/dev/mapper/bcrypt_sdi1].corruption_errs 0
[/dev/mapper/bcrypt_sdi1].generation_errs 0
[(null)].write_io_errs 0
[(null)].read_io_errs 0
[(null)].flush_io_errs 0
[(null)].corruption_errs 0
[(null)].generation_errs 0
btrfs dev usage /bstorage/
/dev/mapper/bcrypt_sdg1, ID: 1
Device size: 1.82TiB
Data,RAID5: 520.00GiB
Metadata,RAID5: 1.34GiB
System,RAID5: 4.00MiB
System,RAID5: 32.00MiB
Unallocated: 1.31TiB
/dev/mapper/bcrypt_sdh1, ID: 2
Device size: 1.82TiB
Data,RAID5: 520.00GiB
Metadata,RAID5: 1.34GiB
System,RAID5: 4.00MiB
System,RAID5: 32.00MiB
Unallocated: 1.31TiB
/dev/mapper/bcrypt_sdi1, ID: 3
Device size: 1.82TiB
Data,RAID5: 520.00GiB
Metadata,RAID5: 1.34GiB
System,RAID5: 4.00MiB
System,RAID5: 32.00MiB
Unallocated: 1.31TiB
/dev/mapper/bcrypt_sdj1, ID: 0
Device size: 1.82TiB
Unallocated: 1.82TiB
missing, ID: 4
Device size: 0.00B
Data,RAID5: 520.00GiB
Metadata,RAID5: 1.34GiB
System,RAID5: 4.00MiB
Unallocated: 1.31TiB
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Raid5 replace disk problems
2016-04-21 15:09 Raid5 replace disk problems Jussi Kansanen
@ 2016-04-22 1:27 ` Duncan
0 siblings, 0 replies; 2+ messages in thread
From: Duncan @ 2016-04-22 1:27 UTC (permalink / raw)
To: linux-btrfs
Jussi Kansanen posted on Thu, 21 Apr 2016 18:09:31 +0300 as excerpted:
> The replace operation is super slow (no other load) with avg. 3x20MB/s
> (old disks) reads and 1.4MB/s write (new disk) with CFQ scheduler. Using
> deadline schd. the performance is better with avg. 3x40MB/s reads and
> 4MB/s write (both schds. with default queue/nr_requests).
>
> Write speed seems slow but guess it possible if there's a lot random
> writes but why is the difference between data read vs. written so large?
> According to iostat replace reads 35 times more data than it writes to
> the new disk.
>
>
> Info:
>
> kernel 4.5 (now 4.5.2, no change)
> btrfs-progs 4.5.1
[Just a btrfs using admin and list regular, not a dev. Also, raid56
isn't my own use-case, but I am following it in general on the list.]
Keep in mind that btrfs raid56 mode (aka parity raid mode) remains less
mature and stable than non-parity raid modes such as raid1 and raid10,
and of course single-device mode with single data and single or dup
metadata, as well. It's certainly /not/ considered stable enough for
production usage at this point, and other alternatives such as btrfs
raid1 or raid10 or use of a separate raid layer (btrfs raid1 on top of a
pair of mdraid0s is one interesting solution) are actively recommended.
And you're not the first to report super-slow replace/restripe for raid56,
either. It's a known bug, tho as it doesn't seem to affect everyone it
has been hard to pin down appropriately and fix. The worst part is that
for those affected, replace and restripe are so slow that they cease to
be real-world practical, and endanger the entire array because at that
speed there's a relatively large chance that another device may fail
before the replace is completed, failing the entire array as more devices
have failed than it can handle. Which means from a reliability
perspective it effectively degrades to slow raid0 as soon as the first
device drops out, with no practical way of recovering back to raid5/6
mode.
I don't recall seeing the memory issue reported before in relation to
raid56, but it isn't horribly surprising either. IIRC there have been
some recent memory fix patches that so 4.6 might be better, but I
wouldn't count on it. I'd really just recommend getting off of raid56
mode for now, until it has had somewhat longer to mature.
(I'm previously on record as suggesting that people wait at least a year,
~5 kernel cycles, from nominal full raid56 support for it to stabilize,
and then asking about current state on the list, before trying to use it
for anything but testing with throw-away data. With raid56 being
nominally complete in 3.19, that would have been 4.4 at the earliest, and
for a short time around then it did look reasonable, but then this bug
with extremely long replace/restripe times began showing up on the list,
and until that's traced down and fixed, I just don't see anyone
responsible using it except of course for testing and hopefully fixing
this thing. I honestly don't know how long that will be or if there are
other bugs lurking as well, but given 4.6 is nearing release and I don't
believe the bug has even been fully traced down yet, 4.8 is definitely
the earliest I'd say consider it again, and a more conservative
recommendation might be to ask again around 4.10.)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2016-04-22 1:27 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-21 15:09 Raid5 replace disk problems Jussi Kansanen
2016-04-22 1:27 ` Duncan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).