From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Chris Murphy <lists@colorremedies.com>
Cc: Andrei Borzenkov <arvidjaar@gmail.com>,
"Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
Hugo Mills <hugo@carfax.org.uk>,
kreijack@inwind.it, Roman Mamedov <rm@romanrm.net>,
Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Adventures in btrfs raid5 disk recovery
Date: Sun, 26 Jun 2016 15:52:00 -0400 [thread overview]
Message-ID: <20160626195200.GF14667@hungrycats.org> (raw)
In-Reply-To: <CAJCQCtSTau3D39TarW2gag0HgW6U9tj8u3gXVTBMTNRy7wMUbg@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 9634 bytes --]
On Sun, Jun 26, 2016 at 01:30:03PM -0600, Chris Murphy wrote:
> On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> > 26.06.2016 00:52, Chris Murphy пишет:
> >> Interestingly enough, so far I'm finding with full stripe writes, i.e.
> >> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
> >> is raid4.
> >
> > That's not what code suggests and what I see in practice - parity seems
> > to be distributed across all disks; each new 128KiB file (extent) has
> > parity on new disk. At least as long as we can trust btrfs-map-logical
> > to always show parity as "mirror 2".
>
>
> tl;dr Andrei is correct there's no raid4 behavior here.
>
> Looks like mirror 2 is always parity, more on that below.
>
>
> >
> > Do you see consecutive full stripes in your tests? Or how do you
> > determine which devid has parity for a given full stripe?
>
> I do see consecutive full stripe writes, but it doesn't always happen.
> But not checking the consecutivity is where I became confused.
>
> [root@f24s ~]# filefrag -v /mnt/5/ab*
> Filesystem type is: 9123683e
> File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 31: 3456128.. 3456159: 32: last,eof
> /mnt/5/ab128_2.txt: 1 extent found
> File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 31: 3456224.. 3456255: 32: last,eof
> /mnt/5/ab128_3.txt: 1 extent found
> File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 31: 3456320.. 3456351: 32: last,eof
> /mnt/5/ab128_4.txt: 1 extent found
> File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 31: 3456352.. 3456383: 32: last,eof
> /mnt/5/ab128_5.txt: 1 extent found
> File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 31: 3456384.. 3456415: 32: last,eof
> /mnt/5/ab128_6.txt: 1 extent found
> File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 31: 3456416.. 3456447: 32: last,eof
> /mnt/5/ab128_7.txt: 1 extent found
> File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 31: 3456448.. 3456479: 32: last,eof
> /mnt/5/ab128_8.txt: 1 extent found
> File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 31: 3456480.. 3456511: 32: last,eof
> /mnt/5/ab128_9.txt: 1 extent found
> File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 31: 3456096.. 3456127: 32: last,eof
> /mnt/5/ab128.txt: 1 extent found
>
> Starting with the bottom file then from the top so they're in 4096
> byte block order; and the 2nd column is the difference in value:
>
> 3456096
> 3456128 32
> 3456224 96
> 3456320 96
> 3456352 32
> 3456384 32
> 3456416 32
> 3456448 32
> 3456480 32
>
> So the first two files are consecutive full stripe writes. The next
> two aren't. The next five are. They were all copied at the same time.
> I don't know why they aren't always consecutive writes.
The logical addresses don't include parity stripes, so you won't find
them with FIEMAP. Parity locations are calculated after the logical ->
(disk, chunk_offset) translation is done (it's the same chunk_offset on
every disk, but one of the disks is parity while the others are data).
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a
> mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a
> mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a
> mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b
> mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a
> mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b
> mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a
> mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b
> mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a
> mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c
> mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a
> mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a
> mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456416] /dev/VG/a
> mirror 1 logical 14157479936 physical 1075642368 device /dev/mapper/VG-b
> mirror 2 logical 14157479936 physical 1109196800 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456448] /dev/VG/a
> mirror 1 logical 14157611008 physical 2183004160 device /dev/mapper/VG-c
> mirror 2 logical 14157611008 physical 1075707904 device /dev/mapper/VG-b
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456480] /dev/VG/a
> mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
> mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c
>
>
> To confirm/deny mirror 2 is parity (128KiB file is 64KiB "a", 64KiB
> "b", so expected parity is 0x03; if it's always 128KiB of the same
> value then parity is 0x00 and can result in confusion/mistakes with
> unwritten free space).
>
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182283264
> 2>/dev/null | hexdump -C
> 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108606976
> 2>/dev/null | hexdump -C
> 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108803584
> 2>/dev/null | hexdump -C
> 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109000192
> 2>/dev/null | hexdump -C
> 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075511296
> 2>/dev/null | hexdump -C
> 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182873088
> 2>/dev/null | hexdump -C
> 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109196800
> 2>/dev/null | hexdump -C
> 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075707904
> 2>/dev/null | hexdump -C
> 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2183069696
> 2>/dev/null | hexdump -C
> 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................|
> *
> 00010000
>
> Ok so in particular the last five, parity is on device b, c, a, b, c -
> that suggests it's distributing parity on consecutive full stripe
> writes.
>
> Where I became confused is, there's not always a consecutive write,
> and that's what ends up causing parity to end up on one device less
> often. In the above example, parity goes 4x VG/a, 3x VG/c, and 2x
> VG/b.
>
> Basically it's a bad test. The sample size is too small. I'd need to
> increase the sample size by a ton in order to know for sure if this is
> really a problem.
>
>
> >This
> > information is not actually stored anywhere, it is computed based on
> > block group geometry and logical stripe offset.
>
> I think you're right. A better test is a scrub or balance on a raid5
> that's exhibiting slowness, and find out if there's disk contention on
> that system, and whether it's the result of parity not being
> distributed enough.
>
>
> > P.S. usage of "stripe" to mean "stripe element" actually adds to
> > confusion when reading code :)
>
> It's confusing everywhere. mdadm chunk = strip = stripe element. And
> then LVM introduces -i --stripes which means "data strips" i.e. if you
> choose -i 3 with raid6 segment type, you get 5 strips per stripe (3
> data 2 parity). It's horrible.
>
>
>
>
> --
> Chris Murphy
>
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
next prev parent reply other threads:[~2016-06-26 19:52 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-20 3:44 Adventures in btrfs raid5 disk recovery Zygo Blaxell
2016-06-20 18:13 ` Roman Mamedov
2016-06-20 19:11 ` Zygo Blaxell
2016-06-20 19:30 ` Chris Murphy
2016-06-20 20:40 ` Zygo Blaxell
2016-06-20 21:27 ` Chris Murphy
2016-06-21 1:55 ` Zygo Blaxell
2016-06-21 3:53 ` Zygo Blaxell
2016-06-22 17:14 ` Chris Murphy
2016-06-22 20:35 ` Zygo Blaxell
2016-06-23 19:32 ` Goffredo Baroncelli
2016-06-24 0:26 ` Chris Murphy
2016-06-24 1:47 ` Zygo Blaxell
2016-06-24 4:02 ` Andrei Borzenkov
2016-06-24 8:50 ` Hugo Mills
2016-06-24 9:52 ` Andrei Borzenkov
2016-06-24 10:16 ` Hugo Mills
2016-06-24 10:19 ` Andrei Borzenkov
2016-06-24 10:59 ` Hugo Mills
2016-06-24 11:36 ` Austin S. Hemmelgarn
2016-06-24 17:40 ` Chris Murphy
2016-06-24 18:06 ` Zygo Blaxell
2016-06-24 17:06 ` Chris Murphy
2016-06-24 17:21 ` Andrei Borzenkov
2016-06-24 17:52 ` Chris Murphy
2016-06-24 18:19 ` Austin S. Hemmelgarn
2016-06-25 16:44 ` Chris Murphy
2016-06-25 21:52 ` Chris Murphy
2016-06-26 7:54 ` Andrei Borzenkov
2016-06-26 15:03 ` Duncan
2016-06-26 19:30 ` Chris Murphy
2016-06-26 19:52 ` Zygo Blaxell [this message]
2016-06-27 11:21 ` Austin S. Hemmelgarn
2016-06-27 16:17 ` Chris Murphy
2016-06-27 20:54 ` Chris Murphy
2016-06-27 21:02 ` Henk Slager
2016-06-27 21:57 ` Zygo Blaxell
2016-06-27 22:30 ` Chris Murphy
2016-06-28 1:52 ` Zygo Blaxell
2016-06-28 2:39 ` Chris Murphy
2016-06-28 3:17 ` Zygo Blaxell
2016-06-28 11:23 ` Austin S. Hemmelgarn
2016-06-28 12:05 ` Austin S. Hemmelgarn
2016-06-28 12:14 ` Steven Haigh
2016-06-28 12:25 ` Austin S. Hemmelgarn
2016-06-28 16:40 ` Steven Haigh
2016-06-28 18:01 ` Chris Murphy
2016-06-28 18:17 ` Steven Haigh
2016-07-05 23:05 ` Chris Murphy
2016-07-06 11:51 ` Austin S. Hemmelgarn
2016-07-06 16:43 ` Chris Murphy
2016-07-06 17:18 ` Austin S. Hemmelgarn
2016-07-06 18:45 ` Chris Murphy
2016-07-06 19:15 ` Austin S. Hemmelgarn
2016-07-06 21:01 ` Chris Murphy
2016-06-24 16:52 ` Chris Murphy
2016-06-24 16:56 ` Hugo Mills
2016-06-24 16:39 ` Zygo Blaxell
2016-06-24 1:36 ` Zygo Blaxell
2016-06-23 23:37 ` Chris Murphy
2016-06-24 2:07 ` Zygo Blaxell
2016-06-24 5:20 ` Chris Murphy
2016-06-24 10:16 ` Andrei Borzenkov
2016-06-24 17:33 ` Chris Murphy
2016-06-24 11:24 ` Austin S. Hemmelgarn
2016-06-24 16:32 ` Zygo Blaxell
2016-06-24 2:17 ` Zygo Blaxell
2016-06-22 4:06 ` Adventures in btrfs raid5 disk recovery - update Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160626195200.GF14667@hungrycats.org \
--to=ce3g8jdj@umail.furryterror.org \
--cc=ahferroin7@gmail.com \
--cc=arvidjaar@gmail.com \
--cc=hugo@carfax.org.uk \
--cc=kreijack@inwind.it \
--cc=linux-btrfs@vger.kernel.org \
--cc=lists@colorremedies.com \
--cc=rm@romanrm.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).