Btrfs Raid5 issue.

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Btrfs Raid5 issue.
@ 2017-08-21  4:33 Robert LeBlanc
  2017-08-21  6:58 ` Qu Wenruo
  0 siblings, 1 reply; 10+ messages in thread
From: Robert LeBlanc @ 2017-08-21  4:33 UTC (permalink / raw)
  To: linux-btrfs

I've been running btrfs in a raid5 for about a year now with bcache in
front of it. Yesterday, one of my drives was acting really slow, so I
was going to move it to a different port. I guess I get too
comfortable hot plugging drives in at work and didn't think twice
about what could go wrong, hey I set it up in RAID5 so it will be
fine. Well, it wasn't...

I was aware of the write hole issue, and thought it was committed to
the 4.12 branch, so I was running 4.12.5 at the time. I have two SSDs
that are in an md RAID1 that is the cache for the three backing
devices in bcache (bcache{0..2} or bcache{0,16,32} depending on the
kernel booted. I have all my critical data saved off on btrfs
snapshots on a different host, but I don't transfer my MythTV subs
that often, so I'd like to try to recover some of that if possible.

What is really interesting is that I could not boot the first time
(root on the btrfs volume), but I rebooted again and the fs was in
read-only mode, but only one of the three disks was in read-only. I
tried to reboot again and it never mounted again after that. I see
some messages in dmesg like this:

[  151.201637] BTRFS info (device bcache0): disk space caching is enabled
[  151.201640] BTRFS info (device bcache0): has skinny extents
[  151.215697] BTRFS info (device bcache0): bdev /dev/bcache16 errs:
wr 309, rd 319, flush 39, corrupt 0, gen 0
[  151.931764] BTRFS info (device bcache0): detected SSD devices,
enabling SSD mode
[  152.058915] BTRFS error (device bcache0): parent transid verify
failed on 5309837426688 wanted 1620383 found 1619473
[  152.059944] BTRFS error (device bcache0): parent transid verify
failed on 5309837426688 wanted 1620383 found 1619473
[  152.060018] BTRFS: error (device bcache0) in
__btrfs_free_extent:6989: errno=-5 IO failure
[  152.060060] BTRFS: error (device bcache0) in
btrfs_run_delayed_refs:3009: errno=-5 IO failure
[  152.071613] BTRFS info (device bcache0): delayed_refs has NO entry
[  152.074126] BTRFS: error (device bcache0) in btrfs_replay_log:2475:
errno=-5 IO failure (Failed to recover log tree)
[  152.074244] BTRFS error (device bcache0): cleaner transaction
attach returned -30
[  152.148993] BTRFS error (device bcache0): open_ctree failed

So, I thought that the log was corrupted, I could live without the
last 30 seconds or so, I tried `btrfs rescue zero-log /dev/bcache0`
and I get a backtrace. I ran `btrfs rescue chunk-recover /dev/bcache0`
and it spent hours scanning the three disks and at the end tried to
fix the logs (or tree, I can't remember exactly) and then I got
another backtrace.

Today, I compiled 4.13-rc6 to see if some of the latest fixes would
help, no dice (the dmesg above is from 4.13-rc6). I compiled the
latest master of btrfs-progs, no progress.

Things I've tried:
mount
mount -o degraded
mount -o degraded,ro
mount -o degraded (with each drive disconnected in turn to see if in
would start without one of the drives)
btrfs rescue chunk-recover
btrfs rescue super-recover (all drives report the superblocks are fine)
btrfs rescue zero-log (always has a backtrace)
btrfs check

I know that bcache complicates things, but I'm hoping for two things.
1. Try to get what I can off the volume. 2. Provide some information
that can help make btrfs/bcache better for the future.

Here is what `btrfs rescue zero-log` outputs:

# ./btrfs rescue zero-log /dev/bcache0
Clearing log on /dev/bcache0, previous log_root 2876047507456, level 0
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
btrfs unable to find ref byte nr 5310039638016 parent 0 root 2  owner 2 offset 0
parent transid verify failed on 5309275930624 wanted 1620381 found 1619462
parent transid verify failed on 5309275930624 wanted 1620381 found 1619462
checksum verify failed on 5309275930624 found A2FDBB6A wanted 461E06DC
parent transid verify failed on 5309275930624 wanted 1620381 found 1619462
Ignoring transid failure
bad key ordering 67 68
btrfs unable to find ref byte nr 5310039867392 parent 0 root 2  owner 1 offset 0
bad key ordering 67 68
extent-tree.c:2725: alloc_reserved_tree_block: BUG_ON `ret` triggered, value -1
./btrfs(+0x1c624)[0x562fde546624]
./btrfs(+0x1d91a)[0x562fde54791a]
./btrfs(+0x1da2b)[0x562fde547a2b]
./btrfs(+0x1f3a5)[0x562fde5493a5]
./btrfs(+0x1f91f)[0x562fde54991f]
./btrfs(btrfs_alloc_free_block+0xd2)[0x562fde54c20c]
./btrfs(__btrfs_cow_block+0x182)[0x562fde53c778]
./btrfs(btrfs_cow_block+0xea)[0x562fde53d0ea]
./btrfs(+0x185a3)[0x562fde5425a3]
./btrfs(btrfs_commit_transaction+0x96)[0x562fde54411c]
./btrfs(+0x6a702)[0x562fde594702]
./btrfs(handle_command_group+0x44)[0x562fde53b40c]
./btrfs(cmd_rescue+0x15)[0x562fde59486d]
./btrfs(main+0x85)[0x562fde53b5c3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fd3931692b1]
./btrfs(_start+0x2a)[0x562fde53b13a]
Aborted

Please let me know if there is any other information I can provide
that would be helpful.

Thank you,

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Btrfs Raid5 issue.
  2017-08-21  4:33 Robert LeBlanc
@ 2017-08-21  6:58 ` Qu Wenruo
  2017-08-21 10:53   ` Janos Toth F.
  0 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2017-08-21  6:58 UTC (permalink / raw)
  To: Robert LeBlanc, linux-btrfs



On 2017年08月21日 12:33, Robert LeBlanc wrote:
> I've been running btrfs in a raid5 for about a year now with bcache in
> front of it. Yesterday, one of my drives was acting really slow, so I
> was going to move it to a different port. I guess I get too
> comfortable hot plugging drives in at work and didn't think twice
> about what could go wrong, hey I set it up in RAID5 so it will be
> fine. Well, it wasn't...

Well, Btrfs RAID5 is not that safe.
I would recommend to use RAID1 for metadata at least.
(And in your case, your metadata is damaged, so I really recommend to 
use a better profile for your metadata)

> 
> I was aware of the write hole issue, and thought it was committed to
> the 4.12 branch, so I was running 4.12.5 at the time. I have two SSDs
> that are in an md RAID1 that is the cache for the three backing
> devices in bcache (bcache{0..2} or bcache{0,16,32} depending on the
> kernel booted. I have all my critical data saved off on btrfs
> snapshots on a different host, but I don't transfer my MythTV subs
> that often, so I'd like to try to recover some of that if possible.
> 
> What is really interesting is that I could not boot the first time
> (root on the btrfs volume), but I rebooted again and the fs was in
> read-only mode, but only one of the three disks was in read-only. I
> tried to reboot again and it never mounted again after that. I see
> some messages in dmesg like this:
> 
> [  151.201637] BTRFS info (device bcache0): disk space caching is enabled
> [  151.201640] BTRFS info (device bcache0): has skinny extents
> [  151.215697] BTRFS info (device bcache0): bdev /dev/bcache16 errs:
> wr 309, rd 319, flush 39, corrupt 0, gen 0
> [  151.931764] BTRFS info (device bcache0): detected SSD devices,
> enabling SSD mode
> [  152.058915] BTRFS error (device bcache0): parent transid verify
> failed on 5309837426688 wanted 1620383 found 1619473
> [  152.059944] BTRFS error (device bcache0): parent transid verify
> failed on 5309837426688 wanted 1620383 found 1619473

Normally transid error indicates bigger problem, and normally hard to trace.

> [  152.060018] BTRFS: error (device bcache0) in
> __btrfs_free_extent:6989: errno=-5 IO failure
> [  152.060060] BTRFS: error (device bcache0) in
> btrfs_run_delayed_refs:3009: errno=-5 IO failure
> [  152.071613] BTRFS info (device bcache0): delayed_refs has NO entry
> [  152.074126] BTRFS: error (device bcache0) in btrfs_replay_log:2475:
> errno=-5 IO failure (Failed to recover log tree)
> [  152.074244] BTRFS error (device bcache0): cleaner transaction
> attach returned -30
> [  152.148993] BTRFS error (device bcache0): open_ctree failed
> 
> So, I thought that the log was corrupted, I could live without the
> last 30 seconds or so, I tried `btrfs rescue zero-log /dev/bcache0`
> and I get a backtrace.

Yes, your idea about log is correct. It's log replay causing problem.
But the root cause seems to be corrupted extent tree, which is not easy 
to fix.

> I ran `btrfs rescue chunk-recover /dev/bcache0`
> and it spent hours scanning the three disks and at the end tried to
> fix the logs (or tree, I can't remember exactly) and then I got
> another backtrace.
> 
> Today, I compiled 4.13-rc6 to see if some of the latest fixes would
> help, no dice (the dmesg above is from 4.13-rc6). I compiled the
> latest master of btrfs-progs, no progress.
> 
> Things I've tried:
> mount
> mount -o degraded
> mount -o degraded,ro
> mount -o degraded (with each drive disconnected in turn to see if in
> would start without one of the drives)
> btrfs rescue chunk-recover
> btrfs rescue super-recover (all drives report the superblocks are fine)
> btrfs rescue zero-log (always has a backtrace)

I think that's some other problem causing the backtrace.
Normally extent tree corruption or transid error.

> btrfs check
> 
> I know that bcache complicates things, but I'm hoping for two things.
> 1. Try to get what I can off the volume. 2. Provide some information
> that can help make btrfs/bcache better for the future.
> 
> Here is what `btrfs rescue zero-log` outputs:
> 
> # ./btrfs rescue zero-log /dev/bcache0
> Clearing log on /dev/bcache0, previous log_root 2876047507456, level 0
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> bytenr mismatch, want=5309233872896, have=65536
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> bytenr mismatch, want=5309233872896, have=65536
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> bytenr mismatch, want=5309233872896, have=65536
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> bytenr mismatch, want=5309233872896, have=65536
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> bytenr mismatch, want=5309233872896, have=65536
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> bytenr mismatch, want=5309233872896, have=65536
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> bytenr mismatch, want=5309233872896, have=65536
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
> bytenr mismatch, want=5309233872896, have=65536
> btrfs unable to find ref byte nr 5310039638016 parent 0 root 2  owner 2 offset 0
> parent transid verify failed on 5309275930624 wanted 1620381 found 1619462
> parent transid verify failed on 5309275930624 wanted 1620381 found 1619462
> checksum verify failed on 5309275930624 found A2FDBB6A wanted 461E06DC
> parent transid verify failed on 5309275930624 wanted 1620381 found 1619462
> Ignoring transid failure
> bad key ordering 67 68

Despite of transid and bytenr mismatch (which is already a big problem), 
we even have bad key order.
That's definitely not a good sign.

I think extent tree (maybe more) get heavily damaged.
And considering how we update extent tree (delay it as long as 
possible), it's not that strange.

I would recommend to use backup roots manually to see which one can pass 
btrfsck.
But log tree will be a blockage as its content is bond to certain transid.

Would you please try the following commands?

# btrfs inspect dump-super -f /dev/bcache0

Check the output for part like:
backup_roots[4]:
	backup 0:
		backup_tree_root:	29392896	gen: 6	level: 0
		backup_chunk_root:	20987904	gen: 5	level: 0

Record the number of backup_tree_root.
And then

# btrfs check -r 29392896 /dev/bcache0

If you're lucky enough, you should not see backtrace.

BTW, the newer backup the higher chance to recover.
If you backup 0 and 1 don't give a good result, then nothing much left 
we can do.

Thanks,
Qu

> btrfs unable to find ref byte nr 5310039867392 parent 0 root 2  owner 1 offset 0
> bad key ordering 67 68
> extent-tree.c:2725: alloc_reserved_tree_block: BUG_ON `ret` triggered, value -1
> ./btrfs(+0x1c624)[0x562fde546624]
> ./btrfs(+0x1d91a)[0x562fde54791a]
> ./btrfs(+0x1da2b)[0x562fde547a2b]
> ./btrfs(+0x1f3a5)[0x562fde5493a5]
> ./btrfs(+0x1f91f)[0x562fde54991f]
> ./btrfs(btrfs_alloc_free_block+0xd2)[0x562fde54c20c]
> ./btrfs(__btrfs_cow_block+0x182)[0x562fde53c778]
> ./btrfs(btrfs_cow_block+0xea)[0x562fde53d0ea]
> ./btrfs(+0x185a3)[0x562fde5425a3]
> ./btrfs(btrfs_commit_transaction+0x96)[0x562fde54411c]
> ./btrfs(+0x6a702)[0x562fde594702]
> ./btrfs(handle_command_group+0x44)[0x562fde53b40c]
> ./btrfs(cmd_rescue+0x15)[0x562fde59486d]
> ./btrfs(main+0x85)[0x562fde53b5c3]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fd3931692b1]
> ./btrfs(_start+0x2a)[0x562fde53b13a]
> Aborted
> 
> Please let me know if there is any other information I can provide
> that would be helpful.
> 
> Thank you,
> 
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Btrfs Raid5 issue.
  2017-08-21  6:58 ` Qu Wenruo
@ 2017-08-21 10:53   ` Janos Toth F.
  0 siblings, 0 replies; 10+ messages in thread
From: Janos Toth F. @ 2017-08-21 10:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Robert LeBlanc, Btrfs BTRFS

I lost enough Btrfs m=d=s=RAID5 filesystems in past experiments (I
didn't try using RAID5 for metadata and system chunks in the last few
years) to faulty SATA cables + hotplug enabled SATA controllers (where
a disk could disappear and reappear "as the wind blew"). Since then, I
made a habit of always disabling hotplug for all SATA disks involved
with Btrfs, even those with m=d=s=single profile (and I never desired
to built multi-devices filesystems from USB attached disks anyway but
this is good reason for me to explicitly avoid that).

I am not sure if other RAID profiles are affected in a similar way or
it's just RAID56. (Well, I mean RAID0 is obviously toast and RAID1/10
will obviously get degraded but I am not sure if it's possible to
re-sync RAID1/10 with a simple balance [possibly even without
remounting and doing manual device delete/add?] or the filesystem has
to be recreated from scratch [like RAID5].)

I think this hotplug problem is an entirely different issue from the
RAID56-scrub race-conditions (which are now considered fixed in linux
4.12) and nobody is currently working on this (if it's RAID56-only
then I don't expect it anytime soon [think years]).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Btrfs Raid5 issue.
@ 2017-08-21 16:31 Robert LeBlanc
  2017-08-21 16:49 ` Chris Murphy
  0 siblings, 1 reply; 10+ messages in thread
From: Robert LeBlanc @ 2017-08-21 16:31 UTC (permalink / raw)
  To: quwenruo.btrfs, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6481 bytes --]

Qu,

Sorry, I'm not on the list (I was for a few years about three years ago).

I looked at the backup roots like you mentioned.

# ./btrfs inspect dump-super -f /dev/bcache0
superblock: bytenr=65536, device=/dev/bcache0
---------------------------------------------------------
csum_type               0 (crc32c)
csum_size               4
csum                    0x45302c8f [match]
bytenr                  65536
flags                   0x1
                        ( WRITTEN )
magic                   _BHRfS_M [match]
fsid                    fef29f0a-dc4c-4cc4-b524-914e6630803c
label                   kvm-btrfs
generation              1620386
root                    5310022877184
sys_array_size          161
chunk_root_generation   1620164
root_level              1
chunk_root              4725030256640
chunk_root_level        1
log_root                2876047507456
log_root_transid        0
log_root_level          0
total_bytes             8998588280832
bytes_used              3625869234176
sectorsize              4096
nodesize                16384
leafsize (deprecated)           16384
stripesize              4096
root_dir                6
num_devices             3
compat_flags            0x0
compat_ro_flags         0x0
incompat_flags          0x1e1
                        ( MIXED_BACKREF |
                          BIG_METADATA |
                          EXTENDED_IREF |
                          RAID56 |
                          SKINNY_METADATA )
cache_generation        1620386
uuid_tree_generation    42
dev_item.uuid           cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4
dev_item.fsid           fef29f0a-dc4c-4cc4-b524-914e6630803c [match]
dev_item.type           0
dev_item.total_bytes    2998998654976
dev_item.bytes_used     2295693574144
dev_item.io_align       4096
dev_item.io_width       4096
dev_item.sector_size    4096
dev_item.devid          2
dev_item.dev_group      0
dev_item.seek_speed     0
dev_item.bandwidth      0
dev_item.generation     0
sys_chunk_array[2048]:
        item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 4725030256640)
                length 67108864 owner 2 stripe_len 65536 type
SYSTEM|RAID5
                io_align 65536 io_width 65536 sector_size 4096
                num_stripes 3 sub_stripes 1
                        stripe 0 devid 1 offset 2185232384
                        dev_uuid e273c794-b231-4d86-9a38-53a6d2fa8643
                        stripe 1 devid 3 offset 1195075698688
                        dev_uuid 120d6a05-b0bc-46c8-a87e-ca4fe5008d09
                        stripe 2 devid 2 offset 41340108800
                        dev_uuid cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4
backup_roots[4]:
        backup 0:
                backup_tree_root:       5309879451648   gen: 1620384    level: 1
                backup_chunk_root:      4725030256640   gen: 1620164    level: 1
                backup_extent_root:     5309910958080   gen: 1620385    level: 2
                backup_fs_root:         3658468147200   gen: 1618016    level: 1
                backup_dev_root:        5309904224256   gen: 1620384    level: 1
                backup_csum_root:       5309910532096   gen: 1620385    level: 3
                backup_total_bytes:     8998588280832
                backup_bytes_used:      3625871646720
                backup_num_devices:     3

        backup 1:
                backup_tree_root:       5309780492288   gen: 1620385    level: 1
                backup_chunk_root:      4725030256640   gen: 1620164    level: 1
                backup_extent_root:     5309659037696   gen: 1620385    level: 2
                backup_fs_root:         0       gen: 0  level: 0
                backup_dev_root:        5309872275456   gen: 1620385    level: 1
                backup_csum_root:       5309674536960   gen: 1620385    level: 3
                backup_total_bytes:     8998588280832
                backup_bytes_used:      3625869234176
                backup_num_devices:     3

        backup 2:
                backup_tree_root:       5310022877184   gen: 1620386    level: 1
                backup_chunk_root:      4725030256640   gen: 1620164    level: 1
                backup_extent_root:     2876048949248   gen: 1620387    level: 2
                backup_fs_root:         3658468147200   gen: 1618016    level: 1
                backup_dev_root:        5309872275456   gen: 1620385    level: 1
                backup_csum_root:       5310042259456   gen: 1620386    level: 3
                backup_total_bytes:     8998588280832
                backup_bytes_used:      3625869250560
                backup_num_devices:     3

        backup 3:
                backup_tree_root:       5309771448320   gen: 1620383    level: 1
                backup_chunk_root:      4725030256640   gen: 1620164    level: 1
                backup_extent_root:     5309779804160   gen: 1620384    level: 2
                backup_fs_root:         3658468147200   gen: 1618016    level: 1
                backup_dev_root:        5309848158208   gen: 1620383    level: 1
                backup_csum_root:       5309848846336   gen: 1620384    level: 3
                backup_total_bytes:     8998588280832
                backup_bytes_used:      3625871904768
                backup_num_devices:     3

I did a check on each on and the output is attached, but nothing seemed clean.

This got me to thinking, maybe I can try to mount one of these
backuproots. So I went searching and found
https://btrfs.wiki.kernel.org/index.php/Mount_options and tried the
'usebackuproot' but it doesn't seem to have an argument about which
one to use. I'm not sure if it thinks that the first one is good and
just keeps trying that and never tries a backup one as the dmesg is
the same.

I noticed on that page that there is a 'nologreplay' mount option so I
tried it with degraded and it requires ro, but the volume mounted and
I can "see" things on the volume.

So with this nologreplay option, if I do a btrfs send of the subvolume
that I'm interested in (I don't think it was being written to at the
time of failure), would it copy (send) over the corruption as well. I
do have an older snapshot of that subvolume and I could make a rw snap
of that and rsync if needed. In any case, I feel that btrfs has given
me more options than may have otherwise been available. What are your
or others suggestions about moving forward?

Thanks,
Robert
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

[-- Attachment #2: btrfs_check.txt.xz --]
[-- Type: application/x-xz, Size: 4404 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Btrfs Raid5 issue.
  2017-08-21 16:31 Robert LeBlanc
@ 2017-08-21 16:49 ` Chris Murphy
  0 siblings, 0 replies; 10+ messages in thread
From: Chris Murphy @ 2017-08-21 16:49 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Qu Wenruo, Btrfs BTRFS

On Mon, Aug 21, 2017 at 10:31 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
> Qu,
>
> Sorry, I'm not on the list (I was for a few years about three years ago).
>
> I looked at the backup roots like you mentioned.
>
> # ./btrfs inspect dump-super -f /dev/bcache0
> superblock: bytenr=65536, device=/dev/bcache0
> ---------------------------------------------------------
> csum_type               0 (crc32c)
> csum_size               4
> csum                    0x45302c8f [match]
> bytenr                  65536
> flags                   0x1
>                         ( WRITTEN )
> magic                   _BHRfS_M [match]
> fsid                    fef29f0a-dc4c-4cc4-b524-914e6630803c
> label                   kvm-btrfs
> generation              1620386
> root                    5310022877184
> sys_array_size          161
> chunk_root_generation   1620164
> root_level              1
> chunk_root              4725030256640
> chunk_root_level        1
> log_root                2876047507456
> log_root_transid        0
> log_root_level          0
> total_bytes             8998588280832
> bytes_used              3625869234176
> sectorsize              4096
> nodesize                16384
> leafsize (deprecated)           16384
> stripesize              4096
> root_dir                6
> num_devices             3
> compat_flags            0x0
> compat_ro_flags         0x0
> incompat_flags          0x1e1
>                         ( MIXED_BACKREF |
>                           BIG_METADATA |
>                           EXTENDED_IREF |
>                           RAID56 |
>                           SKINNY_METADATA )
> cache_generation        1620386
> uuid_tree_generation    42
> dev_item.uuid           cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4
> dev_item.fsid           fef29f0a-dc4c-4cc4-b524-914e6630803c [match]
> dev_item.type           0
> dev_item.total_bytes    2998998654976
> dev_item.bytes_used     2295693574144
> dev_item.io_align       4096
> dev_item.io_width       4096
> dev_item.sector_size    4096
> dev_item.devid          2
> dev_item.dev_group      0
> dev_item.seek_speed     0
> dev_item.bandwidth      0
> dev_item.generation     0
> sys_chunk_array[2048]:
>         item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 4725030256640)
>                 length 67108864 owner 2 stripe_len 65536 type
> SYSTEM|RAID5
>                 io_align 65536 io_width 65536 sector_size 4096
>                 num_stripes 3 sub_stripes 1
>                         stripe 0 devid 1 offset 2185232384
>                         dev_uuid e273c794-b231-4d86-9a38-53a6d2fa8643
>                         stripe 1 devid 3 offset 1195075698688
>                         dev_uuid 120d6a05-b0bc-46c8-a87e-ca4fe5008d09
>                         stripe 2 devid 2 offset 41340108800
>                         dev_uuid cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4
> backup_roots[4]:
>         backup 0:
>                 backup_tree_root:       5309879451648   gen: 1620384    level: 1
>                 backup_chunk_root:      4725030256640   gen: 1620164    level: 1
>                 backup_extent_root:     5309910958080   gen: 1620385    level: 2
>                 backup_fs_root:         3658468147200   gen: 1618016    level: 1
>                 backup_dev_root:        5309904224256   gen: 1620384    level: 1
>                 backup_csum_root:       5309910532096   gen: 1620385    level: 3
>                 backup_total_bytes:     8998588280832
>                 backup_bytes_used:      3625871646720
>                 backup_num_devices:     3
>
>         backup 1:
>                 backup_tree_root:       5309780492288   gen: 1620385    level: 1
>                 backup_chunk_root:      4725030256640   gen: 1620164    level: 1
>                 backup_extent_root:     5309659037696   gen: 1620385    level: 2
>                 backup_fs_root:         0       gen: 0  level: 0
>                 backup_dev_root:        5309872275456   gen: 1620385    level: 1
>                 backup_csum_root:       5309674536960   gen: 1620385    level: 3
>                 backup_total_bytes:     8998588280832
>                 backup_bytes_used:      3625869234176
>                 backup_num_devices:     3


Well that's strange. A backup entry with a null fs root.



> I noticed on that page that there is a 'nologreplay' mount option so I
> tried it with degraded and it requires ro, but the volume mounted and
> I can "see" things on the volume.

Degraded suggests it's not finding one of the three devices.


> So with this nologreplay option, if I do a btrfs send of the subvolume
> that I'm interested in (I don't think it was being written to at the
> time of failure), would it copy (send) over the corruption as well.

Anything that results in EIO will get included in the send, and by
default receive fails. You can use verbose messaging on the receive
side, and use -E option to permit the errors. But file system specific
problems aren't going to propagate through send receive.

Note that you can't change the subvolume in question to read only
because the file system itself is read only. And only read only
subvolumes can be send/receive. You might have to fall back to rsync.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Btrfs Raid5 issue.
@ 2017-08-22  5:19 Robert LeBlanc
  2017-08-22  5:53 ` Chris Murphy
  2017-08-22  6:40 ` Qu Wenruo
  0 siblings, 2 replies; 10+ messages in thread
From: Robert LeBlanc @ 2017-08-22  5:19 UTC (permalink / raw)
  To: lists, linux-btrfs

Chris and Qu thanks for your help. I was able to restore the data off
the volume. I only could not read one file that I tried to rsync (a
MySQl bin log), but it wasn't critical as I had an off-site snapshot
from that morning and ownclould could resync the files that were
changed anyway. This turned out much better than the md RAID failure
that I had a year ago. Much faster recovery thanks to snapshots.

Is there anything you would like from this damaged filesystem to help
determine what went wrong and to help make btrfs better? If I don't
hear back from you in a day, I'll destroy it so that I can add the
disks into the new btrfs volumes to restore redundancy.

Bcache wasn't providing the performance I was hoping for, so I'm
putting the root and roots for my LXC containers on the SSDs (btrfs
RAID1) and the bulk stuff on the three spindle drives (btrfs RAID1).
For some reason, it seemed that the btrfs RAID5 setup required one of
the drives, but I thought I had data with RAID5 and metadata with 2
copies. Was I missing something else that prevented mounting with that
specific drive? I don't want to get into a situation where one drive
dies and I can't get to any data.

Thank you again.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Btrfs Raid5 issue.
  2017-08-22  5:19 Btrfs Raid5 issue Robert LeBlanc
@ 2017-08-22  5:53 ` Chris Murphy
  2017-08-22  6:40 ` Qu Wenruo
  1 sibling, 0 replies; 10+ messages in thread
From: Chris Murphy @ 2017-08-22  5:53 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Chris Murphy, Btrfs BTRFS

On Mon, Aug 21, 2017 at 11:19 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> Chris and Qu thanks for your help. I was able to restore the data off
> the volume. I only could not read one file that I tried to rsync (a
> MySQl bin log), but it wasn't critical as I had an off-site snapshot
> from that morning and ownclould could resync the files that were
> changed anyway. This turned out much better than the md RAID failure
> that I had a year ago. Much faster recovery thanks to snapshots.
>
> Is there anything you would like from this damaged filesystem to help
> determine what went wrong and to help make btrfs better? If I don't
> hear back from you in a day, I'll destroy it so that I can add the
> disks into the new btrfs volumes to restore redundancy.
>
> Bcache wasn't providing the performance I was hoping for, so I'm
> putting the root and roots for my LXC containers on the SSDs (btrfs
> RAID1) and the bulk stuff on the three spindle drives (btrfs RAID1).
> For some reason, it seemed that the btrfs RAID5 setup required one of
> the drives, but I thought I had data with RAID5 and metadata with 2
> copies. Was I missing something else that prevented mounting with that
> specific drive? I don't want to get into a situation where one drive
> dies and I can't get to any data.

With all three connected, what do you get for 'btrfs fi show' ?

The first email says the supers on all three drives are OK, but still
it's confusing the degraded is working. It suggests it's not finding
something on one of the drives that it needs to mount - usually that's
the first superblock or it could be the system block group is partly
corrupt or read error or something; and when degraded it makes it
possible to mount.

Anyway at least all of the data is safe now. Pretty much all you can
do to guard against data loss is backups. Any degraded state is
precarious because it requires just one more thing to go wrong and
it's all bad news from there.

Gluster is pretty easy to setup, and use either gluster native mount
on linux or smb with everything else. Stick a big drive in a raspberry
pi (or two) and even though it's only fast ethernet (haha, now slow
100bps ethernet) it will still replicate automatically as well as
failover. Plus one of those could be XFS if you wanted to hedge your
bets. Or one of the less expensive Intel NUCs will also work if you
want to stick with x86.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Btrfs Raid5 issue.
  2017-08-22  5:19 Btrfs Raid5 issue Robert LeBlanc
  2017-08-22  5:53 ` Chris Murphy
@ 2017-08-22  6:40 ` Qu Wenruo
  1 sibling, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2017-08-22  6:40 UTC (permalink / raw)
  To: Robert LeBlanc, lists, linux-btrfs

On 2017年08月22日 13:19, Robert LeBlanc wrote:
> Chris and Qu thanks for your help. I was able to restore the data off
> the volume. I only could not read one file that I tried to rsync (a
> MySQl bin log), but it wasn't critical as I had an off-site snapshot
> from that morning and ownclould could resync the files that were
> changed anyway. This turned out much better than the md RAID failure
> that I had a year ago. Much faster recovery thanks to snapshots.
> 
> Is there anything you would like from this damaged filesystem to help
> determine what went wrong and to help make btrfs better? If I don't
> hear back from you in a day, I'll destroy it so that I can add the
> disks into the new btrfs volumes to restore redundancy.
Feel free to destroy the old images.

If nologreplay works, that's good enough.
The problem seems to be extent tree, but it's too hard to locate the 
real problem.

> 
> Bcache wasn't providing the performance I was hoping for, so I'm
> putting the root and roots for my LXC containers on the SSDs (btrfs
> RAID1) and the bulk stuff on the three spindle drives (btrfs RAID1).

Well, I'm more interested in the bcache performance.

I was considering to using my Intel 600P NVMe to cache one 2.5' HGST 1T 
HDD (7200rpm) in my btrfs KVM host (also my daily machine).

Would you please share more details about the performance problem?
(Maybe it's about some btrfs performance problems, not bcache. Btrfs is 
not good at workload like DB or metadata heavy operations)

> For some reason, it seemed that the btrfs RAID5 setup required one of
> the drives, but I thought I had data with RAID5 and metadata with 2
> copies. Was I missing something else that prevented mounting with that
> specific drive? I don't want to get into a situation where one drive
> dies and I can't get to any data.

The direct cause is btrfs fails to replay its log, and it's corrupted 
extent tree causing log replay failed.
And normally such failure will definitely cause problem, so btrfs just 
stop the mount procedure.

In your case, if "nologreplay" is specified, btrfs skips the problem, 
and since you must specify RO for nologrelay, btrfs has nothing to do 
with extent tree at all.
So btrfs can be mounted.

Why extent tree get corrupted is still unknown. If your metadata is also 
RAID5, then write-hole may be the cause.
If your metadata profile is RAID1, then I don't know why this could happen.

So from this point of view, even we fixed btrfs scrub/race problems, 
it's still not good enough to survive a disk removal in real world.

With RAID1 setup, at least we don't need to care about write hole and 
csum will help us to determine which copy is correct, so I think it will 
be much better than RAID56.

If you have spare time, you could try to hot-plug RAID1 devices to 
verify how it works.
But please note that, re-attach plugged device may need to umount the fs 
and re-scan btrfs.

And even you're using 3 devices with RAID1, it's still 2 copies.
So you can lose at most 1 device.

Thanks,
Qu

> 
> Thank you again.
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Btrfs Raid5 issue.
@ 2017-08-22 16:37 Robert LeBlanc
  2017-08-23  0:00 ` Qu Wenruo
  0 siblings, 1 reply; 10+ messages in thread
From: Robert LeBlanc @ 2017-08-22 16:37 UTC (permalink / raw)
  To: quwenruo.btrfs, lists, linux-btrfs

Thanks for the explanations. Chris, I don't think 'degraded' did
anything to help the mounting, I just passed it in to see if it would
help (I'm not sure if btrfs is "smart" enough to ignore a drive if it
would increase the chance of mounting the volume even if it is
degraded, but one could hope). I believe the key was 'nologreplay'.
Here is some info about the corrupted fs:

# btrfs fi show /tmp/root/
Label: 'kvm-btrfs'  uuid: fef29f0a-dc4c-4cc4-b524-914e6630803c
        Total devices 3 FS bytes used 3.30TiB
        devid    1 size 2.73TiB used 2.09TiB path /dev/bcache32
        devid    2 size 2.73TiB used 2.09TiB path /dev/bcache0
        devid    3 size 2.73TiB used 2.09TiB path /dev/bcache16

# btrfs fi usage /tmp/root/
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
    Device size:                   8.18TiB
    Device allocated:                0.00B
    Device unallocated:            8.18TiB
    Device missing:                  0.00B
    Used:                            0.00B
    Free (estimated):                0.00B      (min: 8.00EiB)
    Data ratio:                       0.00
    Metadata ratio:                   0.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID5: Size:4.15TiB, Used:3.28TiB
   /dev/bcache0    2.08TiB
   /dev/bcache16           2.08TiB
   /dev/bcache32           2.08TiB

Metadata,RAID5: Size:22.00GiB, Used:20.69GiB
   /dev/bcache0   11.00GiB
   /dev/bcache16          11.00GiB
   /dev/bcache32          11.00GiB

System,RAID5: Size:64.00MiB, Used:400.00KiB
   /dev/bcache0   32.00MiB
   /dev/bcache16          32.00MiB
   /dev/bcache32          32.00MiB

Unallocated:
   /dev/bcache0  655.00GiB
   /dev/bcache16         655.00GiB
   /dev/bcache32         656.49GiB

So it looks like I set the metadata and system data to RAID5 and not
RAID1. I guess that it could have been affected by the write hole
causing the problem I was seeing.

Since I get the same space usage with RAID1 and RAID5, I think I'm
just going to use RAID1. I don't need stripe performance or anything
like that. It would be nice if btrfs supported hotplug and re-plug a
little better so that it is more "production" quality, but I just have
to be patient. I'm familiar with Gluster and contributed code to Ceph,
so I'm familiar with those types of distributed systems. I really like
them, but the complexity is quite overkill for my needs at home.

As far as bcache performance:
I have two Crucial MX200 250GB drives that were md raid1 containing
/boot (ext2), swap and then bcache. I have 2 WD Reds and a Seagate
Barracuda Desktop drive all 3TB. With bcache in writeback, apt-get
would be painfully slow. Running iostat, the SSDs would be doing a few
hundred IOPs and the backing disks would be very busy and would be the
limiting factor overall. Even though apt-get just downloaded the file
(should be on the SSDs because of writeback), it still involved the
backend disks way too much. The amount of dirty data was always less
than 10% so there should have been plenty of space to free up cache
without having to flush. I experimented with changing the size of
contiguous IO to force more to cache, increasing the dirty ratio, etc,
nothing seemed to provide the performance I was hoping. To be fair
having a pair of SSDs (md raid1) caching three spindles (btrfs raid5)
may not be an ideal configuration. If I had three SSDs, one for each
drive, then it may have performed better?? I have also ~980 snapshots
spread over a years time, so I don't know how much that impacts
things. I did use a btrfs utility to help find duplicate files/chunks
and dedupe them so that updated system binaries between upgraded LXC
containers would use the same space on disk and be more efficient in
bcache cache usage.

After restoring the root and LXC roots snapshots on the SSD (broke the
md raid1 so I could restore to one of them), I ran apt-get and got
upwards to 2,400 IOPs with it being sustained around 1,200 IOPs (btrfs
single on md raid1 degraded). I know that btrfs has some performance
challenges, but I don't think I was hitting those. I was most likely a
very unusual set-up of bcache and btrfs raid that caused the problem.
I have bcache on 10 year old desktop box with a single nvme drive that
performs a little better, but it is hard to be certain because of its
age. It has bcache in write-around (since there is only a single nvme)
and btrfs in raid1. I haven't watched that box as closely because it
is responsive enough. It also only has four Gb of RAM so it constantly
has to swap (web pages are hogs these days) and one of the reasons to
retrofit that box with nvme rather than MX200.

If you have any other questions, feel free to ask.

Thanks

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Btrfs Raid5 issue.
  2017-08-22 16:37 Robert LeBlanc
@ 2017-08-23  0:00 ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2017-08-23  0:00 UTC (permalink / raw)
  To: Robert LeBlanc, lists, linux-btrfs



On 2017年08月23日 00:37, Robert LeBlanc wrote:
> Thanks for the explanations. Chris, I don't think 'degraded' did
> anything to help the mounting, I just passed it in to see if it would
> help (I'm not sure if btrfs is "smart" enough to ignore a drive if it
> would increase the chance of mounting the volume even if it is
> degraded, but one could hope). I believe the key was 'nologreplay'.
> Here is some info about the corrupted fs:
> 
> # btrfs fi show /tmp/root/
> Label: 'kvm-btrfs'  uuid: fef29f0a-dc4c-4cc4-b524-914e6630803c
>          Total devices 3 FS bytes used 3.30TiB
>          devid    1 size 2.73TiB used 2.09TiB path /dev/bcache32
>          devid    2 size 2.73TiB used 2.09TiB path /dev/bcache0
>          devid    3 size 2.73TiB used 2.09TiB path /dev/bcache16
> 
> # btrfs fi usage /tmp/root/
> WARNING: RAID56 detected, not implemented
> WARNING: RAID56 detected, not implemented
> WARNING: RAID56 detected, not implemented
> Overall:
>      Device size:                   8.18TiB
>      Device allocated:                0.00B
>      Device unallocated:            8.18TiB
>      Device missing:                  0.00B
>      Used:                            0.00B
>      Free (estimated):                0.00B      (min: 8.00EiB)
>      Data ratio:                       0.00
>      Metadata ratio:                   0.00
>      Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,RAID5: Size:4.15TiB, Used:3.28TiB
>     /dev/bcache0    2.08TiB
>     /dev/bcache16           2.08TiB
>     /dev/bcache32           2.08TiB
> 
> Metadata,RAID5: Size:22.00GiB, Used:20.69GiB
>     /dev/bcache0   11.00GiB
>     /dev/bcache16          11.00GiB
>     /dev/bcache32          11.00GiB
> 
> System,RAID5: Size:64.00MiB, Used:400.00KiB
>     /dev/bcache0   32.00MiB
>     /dev/bcache16          32.00MiB
>     /dev/bcache32          32.00MiB
> 
> Unallocated:
>     /dev/bcache0  655.00GiB
>     /dev/bcache16         655.00GiB
>     /dev/bcache32         656.49GiB
> 
> So it looks like I set the metadata and system data to RAID5 and not
> RAID1. I guess that it could have been affected by the write hole
> causing the problem I was seeing.
> 
> Since I get the same space usage with RAID1 and RAID5,

Well, RAID1 has larger space usage than 3-disk RAID5.
Space efficiency will be 50% for RAID1 while 66% for 3-disk RAID5.

So you may lost some available space.

> I think I'm
> just going to use RAID1. I don't need stripe performance or anything
> like that.

And RAID5/6 won't always improve performance.
Especially when IO blocksize is smaller than full stripe size (in your 
case it's 128K).

When doing sequential IO with blocksize smaller than 128K, there will be 
an obvious performance drop due to RMW cycle.
This is not limited to Btrfs RAID56 but all RAID56.

> It would be nice if btrfs supported hotplug and re-plug a
> little better so that it is more "production" quality, but I just have
> to be patient. I'm familiar with Gluster and contributed code to Ceph,
> so I'm familiar with those types of distributed systems. I really like
> them, but the complexity is quite overkill for my needs at home.
> 
> As far as bcache performance:
> I have two Crucial MX200 250GB drives that were md raid1 containing
> /boot (ext2), swap and then bcache. I have 2 WD Reds and a Seagate
> Barracuda Desktop drive all 3TB. With bcache in writeback, apt-get
> would be painfully slow. Running iostat, the SSDs would be doing a few
> hundred IOPs and the backing disks would be very busy and would be the
> limiting factor overall. Even though apt-get just downloaded the file
> (should be on the SSDs because of writeback), it still involved the
> backend disks way too much. The amount of dirty data was always less
> than 10% so there should have been plenty of space to free up cache
> without having to flush. I experimented with changing the size of
> contiguous IO to force more to cache, increasing the dirty ratio, etc,
> nothing seemed to provide the performance I was hoping. To be fair
> having a pair of SSDs (md raid1) caching three spindles (btrfs raid5)
> may not be an ideal configuration. If I had three SSDs, one for each
> drive, then it may have performed better?? I have also ~980 snapshots
> spread over a years time, so I don't know how much that impacts
> things. I did use a btrfs utility to help find duplicate files/chunks
> and dedupe them so that updated system binaries between upgraded LXC
> containers would use the same space on disk and be more efficient in
> bcache cache usage.

Well, RAID1 ssd, offline dedupe, bcache, many snapshots, way more 
complex than I though.
So I'm uncertain where the bottleneck is.

> 
> After restoring the root and LXC roots snapshots on the SSD (broke the
> md raid1 so I could restore to one of them), I ran apt-get and got
> upwards to 2,400 IOPs with it being sustained around 1,200 IOPs (btrfs
> single on md raid1 degraded). I know that btrfs has some performance
> challenges, but I don't think I was hitting those. I was most likely a
> very unusual set-up of bcache and btrfs raid that caused the problem.
> I have bcache on 10 year old desktop box with a single nvme drive that
> performs a little better, but it is hard to be certain because of its
> age. It has bcache in write-around (since there is only a single nvme)
> and btrfs in raid1. I haven't watched that box as closely because it
> is responsive enough. It also only has four Gb of RAM so it constantly
> has to swap (web pages are hogs these days) and one of the reasons to
> retrofit that box with nvme rather than MX200.

Good to know it works for you now.

Thanks,
Qu

> 
> If you have any other questions, feel free to ask.
> 
> Thanks
> 
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-08-23  0:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-22  5:19 Btrfs Raid5 issue Robert LeBlanc
2017-08-22  5:53 ` Chris Murphy
2017-08-22  6:40 ` Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2017-08-22 16:37 Robert LeBlanc
2017-08-23  0:00 ` Qu Wenruo
2017-08-21 16:31 Robert LeBlanc
2017-08-21 16:49 ` Chris Murphy
2017-08-21  4:33 Robert LeBlanc
2017-08-21  6:58 ` Qu Wenruo
2017-08-21 10:53   ` Janos Toth F.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).