linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Btrfs Raid5 issue.
@ 2017-08-22  5:19 Robert LeBlanc
  2017-08-22  5:53 ` Chris Murphy
  2017-08-22  6:40 ` Qu Wenruo
  0 siblings, 2 replies; 10+ messages in thread
From: Robert LeBlanc @ 2017-08-22  5:19 UTC (permalink / raw)
  To: lists, linux-btrfs

Chris and Qu thanks for your help. I was able to restore the data off
the volume. I only could not read one file that I tried to rsync (a
MySQl bin log), but it wasn't critical as I had an off-site snapshot
from that morning and ownclould could resync the files that were
changed anyway. This turned out much better than the md RAID failure
that I had a year ago. Much faster recovery thanks to snapshots.

Is there anything you would like from this damaged filesystem to help
determine what went wrong and to help make btrfs better? If I don't
hear back from you in a day, I'll destroy it so that I can add the
disks into the new btrfs volumes to restore redundancy.

Bcache wasn't providing the performance I was hoping for, so I'm
putting the root and roots for my LXC containers on the SSDs (btrfs
RAID1) and the bulk stuff on the three spindle drives (btrfs RAID1).
For some reason, it seemed that the btrfs RAID5 setup required one of
the drives, but I thought I had data with RAID5 and metadata with 2
copies. Was I missing something else that prevented mounting with that
specific drive? I don't want to get into a situation where one drive
dies and I can't get to any data.

Thank you again.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: Btrfs Raid5 issue.
@ 2017-08-22 16:37 Robert LeBlanc
  2017-08-23  0:00 ` Qu Wenruo
  0 siblings, 1 reply; 10+ messages in thread
From: Robert LeBlanc @ 2017-08-22 16:37 UTC (permalink / raw)
  To: quwenruo.btrfs, lists, linux-btrfs

Thanks for the explanations. Chris, I don't think 'degraded' did
anything to help the mounting, I just passed it in to see if it would
help (I'm not sure if btrfs is "smart" enough to ignore a drive if it
would increase the chance of mounting the volume even if it is
degraded, but one could hope). I believe the key was 'nologreplay'.
Here is some info about the corrupted fs:

# btrfs fi show /tmp/root/
Label: 'kvm-btrfs'  uuid: fef29f0a-dc4c-4cc4-b524-914e6630803c
        Total devices 3 FS bytes used 3.30TiB
        devid    1 size 2.73TiB used 2.09TiB path /dev/bcache32
        devid    2 size 2.73TiB used 2.09TiB path /dev/bcache0
        devid    3 size 2.73TiB used 2.09TiB path /dev/bcache16

# btrfs fi usage /tmp/root/
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
    Device size:                   8.18TiB
    Device allocated:                0.00B
    Device unallocated:            8.18TiB
    Device missing:                  0.00B
    Used:                            0.00B
    Free (estimated):                0.00B      (min: 8.00EiB)
    Data ratio:                       0.00
    Metadata ratio:                   0.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID5: Size:4.15TiB, Used:3.28TiB
   /dev/bcache0    2.08TiB
   /dev/bcache16           2.08TiB
   /dev/bcache32           2.08TiB

Metadata,RAID5: Size:22.00GiB, Used:20.69GiB
   /dev/bcache0   11.00GiB
   /dev/bcache16          11.00GiB
   /dev/bcache32          11.00GiB

System,RAID5: Size:64.00MiB, Used:400.00KiB
   /dev/bcache0   32.00MiB
   /dev/bcache16          32.00MiB
   /dev/bcache32          32.00MiB

Unallocated:
   /dev/bcache0  655.00GiB
   /dev/bcache16         655.00GiB
   /dev/bcache32         656.49GiB

So it looks like I set the metadata and system data to RAID5 and not
RAID1. I guess that it could have been affected by the write hole
causing the problem I was seeing.

Since I get the same space usage with RAID1 and RAID5, I think I'm
just going to use RAID1. I don't need stripe performance or anything
like that. It would be nice if btrfs supported hotplug and re-plug a
little better so that it is more "production" quality, but I just have
to be patient. I'm familiar with Gluster and contributed code to Ceph,
so I'm familiar with those types of distributed systems. I really like
them, but the complexity is quite overkill for my needs at home.

As far as bcache performance:
I have two Crucial MX200 250GB drives that were md raid1 containing
/boot (ext2), swap and then bcache. I have 2 WD Reds and a Seagate
Barracuda Desktop drive all 3TB. With bcache in writeback, apt-get
would be painfully slow. Running iostat, the SSDs would be doing a few
hundred IOPs and the backing disks would be very busy and would be the
limiting factor overall. Even though apt-get just downloaded the file
(should be on the SSDs because of writeback), it still involved the
backend disks way too much. The amount of dirty data was always less
than 10% so there should have been plenty of space to free up cache
without having to flush. I experimented with changing the size of
contiguous IO to force more to cache, increasing the dirty ratio, etc,
nothing seemed to provide the performance I was hoping. To be fair
having a pair of SSDs (md raid1) caching three spindles (btrfs raid5)
may not be an ideal configuration. If I had three SSDs, one for each
drive, then it may have performed better?? I have also ~980 snapshots
spread over a years time, so I don't know how much that impacts
things. I did use a btrfs utility to help find duplicate files/chunks
and dedupe them so that updated system binaries between upgraded LXC
containers would use the same space on disk and be more efficient in
bcache cache usage.

After restoring the root and LXC roots snapshots on the SSD (broke the
md raid1 so I could restore to one of them), I ran apt-get and got
upwards to 2,400 IOPs with it being sustained around 1,200 IOPs (btrfs
single on md raid1 degraded). I know that btrfs has some performance
challenges, but I don't think I was hitting those. I was most likely a
very unusual set-up of bcache and btrfs raid that caused the problem.
I have bcache on 10 year old desktop box with a single nvme drive that
performs a little better, but it is hard to be certain because of its
age. It has bcache in write-around (since there is only a single nvme)
and btrfs in raid1. I haven't watched that box as closely because it
is responsive enough. It also only has four Gb of RAM so it constantly
has to swap (web pages are hogs these days) and one of the reasons to
retrofit that box with nvme rather than MX200.

If you have any other questions, feel free to ask.

Thanks

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: Btrfs Raid5 issue.
@ 2017-08-21 16:31 Robert LeBlanc
  2017-08-21 16:49 ` Chris Murphy
  0 siblings, 1 reply; 10+ messages in thread
From: Robert LeBlanc @ 2017-08-21 16:31 UTC (permalink / raw)
  To: quwenruo.btrfs, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6481 bytes --]

Qu,

Sorry, I'm not on the list (I was for a few years about three years ago).

I looked at the backup roots like you mentioned.

# ./btrfs inspect dump-super -f /dev/bcache0
superblock: bytenr=65536, device=/dev/bcache0
---------------------------------------------------------
csum_type               0 (crc32c)
csum_size               4
csum                    0x45302c8f [match]
bytenr                  65536
flags                   0x1
                        ( WRITTEN )
magic                   _BHRfS_M [match]
fsid                    fef29f0a-dc4c-4cc4-b524-914e6630803c
label                   kvm-btrfs
generation              1620386
root                    5310022877184
sys_array_size          161
chunk_root_generation   1620164
root_level              1
chunk_root              4725030256640
chunk_root_level        1
log_root                2876047507456
log_root_transid        0
log_root_level          0
total_bytes             8998588280832
bytes_used              3625869234176
sectorsize              4096
nodesize                16384
leafsize (deprecated)           16384
stripesize              4096
root_dir                6
num_devices             3
compat_flags            0x0
compat_ro_flags         0x0
incompat_flags          0x1e1
                        ( MIXED_BACKREF |
                          BIG_METADATA |
                          EXTENDED_IREF |
                          RAID56 |
                          SKINNY_METADATA )
cache_generation        1620386
uuid_tree_generation    42
dev_item.uuid           cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4
dev_item.fsid           fef29f0a-dc4c-4cc4-b524-914e6630803c [match]
dev_item.type           0
dev_item.total_bytes    2998998654976
dev_item.bytes_used     2295693574144
dev_item.io_align       4096
dev_item.io_width       4096
dev_item.sector_size    4096
dev_item.devid          2
dev_item.dev_group      0
dev_item.seek_speed     0
dev_item.bandwidth      0
dev_item.generation     0
sys_chunk_array[2048]:
        item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 4725030256640)
                length 67108864 owner 2 stripe_len 65536 type
SYSTEM|RAID5
                io_align 65536 io_width 65536 sector_size 4096
                num_stripes 3 sub_stripes 1
                        stripe 0 devid 1 offset 2185232384
                        dev_uuid e273c794-b231-4d86-9a38-53a6d2fa8643
                        stripe 1 devid 3 offset 1195075698688
                        dev_uuid 120d6a05-b0bc-46c8-a87e-ca4fe5008d09
                        stripe 2 devid 2 offset 41340108800
                        dev_uuid cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4
backup_roots[4]:
        backup 0:
                backup_tree_root:       5309879451648   gen: 1620384    level: 1
                backup_chunk_root:      4725030256640   gen: 1620164    level: 1
                backup_extent_root:     5309910958080   gen: 1620385    level: 2
                backup_fs_root:         3658468147200   gen: 1618016    level: 1
                backup_dev_root:        5309904224256   gen: 1620384    level: 1
                backup_csum_root:       5309910532096   gen: 1620385    level: 3
                backup_total_bytes:     8998588280832
                backup_bytes_used:      3625871646720
                backup_num_devices:     3

        backup 1:
                backup_tree_root:       5309780492288   gen: 1620385    level: 1
                backup_chunk_root:      4725030256640   gen: 1620164    level: 1
                backup_extent_root:     5309659037696   gen: 1620385    level: 2
                backup_fs_root:         0       gen: 0  level: 0
                backup_dev_root:        5309872275456   gen: 1620385    level: 1
                backup_csum_root:       5309674536960   gen: 1620385    level: 3
                backup_total_bytes:     8998588280832
                backup_bytes_used:      3625869234176
                backup_num_devices:     3

        backup 2:
                backup_tree_root:       5310022877184   gen: 1620386    level: 1
                backup_chunk_root:      4725030256640   gen: 1620164    level: 1
                backup_extent_root:     2876048949248   gen: 1620387    level: 2
                backup_fs_root:         3658468147200   gen: 1618016    level: 1
                backup_dev_root:        5309872275456   gen: 1620385    level: 1
                backup_csum_root:       5310042259456   gen: 1620386    level: 3
                backup_total_bytes:     8998588280832
                backup_bytes_used:      3625869250560
                backup_num_devices:     3

        backup 3:
                backup_tree_root:       5309771448320   gen: 1620383    level: 1
                backup_chunk_root:      4725030256640   gen: 1620164    level: 1
                backup_extent_root:     5309779804160   gen: 1620384    level: 2
                backup_fs_root:         3658468147200   gen: 1618016    level: 1
                backup_dev_root:        5309848158208   gen: 1620383    level: 1
                backup_csum_root:       5309848846336   gen: 1620384    level: 3
                backup_total_bytes:     8998588280832
                backup_bytes_used:      3625871904768
                backup_num_devices:     3

I did a check on each on and the output is attached, but nothing seemed clean.

This got me to thinking, maybe I can try to mount one of these
backuproots. So I went searching and found
https://btrfs.wiki.kernel.org/index.php/Mount_options and tried the
'usebackuproot' but it doesn't seem to have an argument about which
one to use. I'm not sure if it thinks that the first one is good and
just keeps trying that and never tries a backup one as the dmesg is
the same.

I noticed on that page that there is a 'nologreplay' mount option so I
tried it with degraded and it requires ro, but the volume mounted and
I can "see" things on the volume.

So with this nologreplay option, if I do a btrfs send of the subvolume
that I'm interested in (I don't think it was being written to at the
time of failure), would it copy (send) over the corruption as well. I
do have an older snapshot of that subvolume and I could make a rw snap
of that and rsync if needed. In any case, I feel that btrfs has given
me more options than may have otherwise been available. What are your
or others suggestions about moving forward?

Thanks,
Robert
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

[-- Attachment #2: btrfs_check.txt.xz --]
[-- Type: application/x-xz, Size: 4404 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread
* Btrfs Raid5 issue.
@ 2017-08-21  4:33 Robert LeBlanc
  2017-08-21  6:58 ` Qu Wenruo
  0 siblings, 1 reply; 10+ messages in thread
From: Robert LeBlanc @ 2017-08-21  4:33 UTC (permalink / raw)
  To: linux-btrfs

I've been running btrfs in a raid5 for about a year now with bcache in
front of it. Yesterday, one of my drives was acting really slow, so I
was going to move it to a different port. I guess I get too
comfortable hot plugging drives in at work and didn't think twice
about what could go wrong, hey I set it up in RAID5 so it will be
fine. Well, it wasn't...

I was aware of the write hole issue, and thought it was committed to
the 4.12 branch, so I was running 4.12.5 at the time. I have two SSDs
that are in an md RAID1 that is the cache for the three backing
devices in bcache (bcache{0..2} or bcache{0,16,32} depending on the
kernel booted. I have all my critical data saved off on btrfs
snapshots on a different host, but I don't transfer my MythTV subs
that often, so I'd like to try to recover some of that if possible.

What is really interesting is that I could not boot the first time
(root on the btrfs volume), but I rebooted again and the fs was in
read-only mode, but only one of the three disks was in read-only. I
tried to reboot again and it never mounted again after that. I see
some messages in dmesg like this:

[  151.201637] BTRFS info (device bcache0): disk space caching is enabled
[  151.201640] BTRFS info (device bcache0): has skinny extents
[  151.215697] BTRFS info (device bcache0): bdev /dev/bcache16 errs:
wr 309, rd 319, flush 39, corrupt 0, gen 0
[  151.931764] BTRFS info (device bcache0): detected SSD devices,
enabling SSD mode
[  152.058915] BTRFS error (device bcache0): parent transid verify
failed on 5309837426688 wanted 1620383 found 1619473
[  152.059944] BTRFS error (device bcache0): parent transid verify
failed on 5309837426688 wanted 1620383 found 1619473
[  152.060018] BTRFS: error (device bcache0) in
__btrfs_free_extent:6989: errno=-5 IO failure
[  152.060060] BTRFS: error (device bcache0) in
btrfs_run_delayed_refs:3009: errno=-5 IO failure
[  152.071613] BTRFS info (device bcache0): delayed_refs has NO entry
[  152.074126] BTRFS: error (device bcache0) in btrfs_replay_log:2475:
errno=-5 IO failure (Failed to recover log tree)
[  152.074244] BTRFS error (device bcache0): cleaner transaction
attach returned -30
[  152.148993] BTRFS error (device bcache0): open_ctree failed

So, I thought that the log was corrupted, I could live without the
last 30 seconds or so, I tried `btrfs rescue zero-log /dev/bcache0`
and I get a backtrace. I ran `btrfs rescue chunk-recover /dev/bcache0`
and it spent hours scanning the three disks and at the end tried to
fix the logs (or tree, I can't remember exactly) and then I got
another backtrace.

Today, I compiled 4.13-rc6 to see if some of the latest fixes would
help, no dice (the dmesg above is from 4.13-rc6). I compiled the
latest master of btrfs-progs, no progress.

Things I've tried:
mount
mount -o degraded
mount -o degraded,ro
mount -o degraded (with each drive disconnected in turn to see if in
would start without one of the drives)
btrfs rescue chunk-recover
btrfs rescue super-recover (all drives report the superblocks are fine)
btrfs rescue zero-log (always has a backtrace)
btrfs check

I know that bcache complicates things, but I'm hoping for two things.
1. Try to get what I can off the volume. 2. Provide some information
that can help make btrfs/bcache better for the future.

Here is what `btrfs rescue zero-log` outputs:

# ./btrfs rescue zero-log /dev/bcache0
Clearing log on /dev/bcache0, previous log_root 2876047507456, level 0
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
btrfs unable to find ref byte nr 5310039638016 parent 0 root 2  owner 2 offset 0
parent transid verify failed on 5309275930624 wanted 1620381 found 1619462
parent transid verify failed on 5309275930624 wanted 1620381 found 1619462
checksum verify failed on 5309275930624 found A2FDBB6A wanted 461E06DC
parent transid verify failed on 5309275930624 wanted 1620381 found 1619462
Ignoring transid failure
bad key ordering 67 68
btrfs unable to find ref byte nr 5310039867392 parent 0 root 2  owner 1 offset 0
bad key ordering 67 68
extent-tree.c:2725: alloc_reserved_tree_block: BUG_ON `ret` triggered, value -1
./btrfs(+0x1c624)[0x562fde546624]
./btrfs(+0x1d91a)[0x562fde54791a]
./btrfs(+0x1da2b)[0x562fde547a2b]
./btrfs(+0x1f3a5)[0x562fde5493a5]
./btrfs(+0x1f91f)[0x562fde54991f]
./btrfs(btrfs_alloc_free_block+0xd2)[0x562fde54c20c]
./btrfs(__btrfs_cow_block+0x182)[0x562fde53c778]
./btrfs(btrfs_cow_block+0xea)[0x562fde53d0ea]
./btrfs(+0x185a3)[0x562fde5425a3]
./btrfs(btrfs_commit_transaction+0x96)[0x562fde54411c]
./btrfs(+0x6a702)[0x562fde594702]
./btrfs(handle_command_group+0x44)[0x562fde53b40c]
./btrfs(cmd_rescue+0x15)[0x562fde59486d]
./btrfs(main+0x85)[0x562fde53b5c3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fd3931692b1]
./btrfs(_start+0x2a)[0x562fde53b13a]
Aborted

Please let me know if there is any other information I can provide
that would be helpful.

Thank you,

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-08-23  0:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-22  5:19 Btrfs Raid5 issue Robert LeBlanc
2017-08-22  5:53 ` Chris Murphy
2017-08-22  6:40 ` Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2017-08-22 16:37 Robert LeBlanc
2017-08-23  0:00 ` Qu Wenruo
2017-08-21 16:31 Robert LeBlanc
2017-08-21 16:49 ` Chris Murphy
2017-08-21  4:33 Robert LeBlanc
2017-08-21  6:58 ` Qu Wenruo
2017-08-21 10:53   ` Janos Toth F.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).