All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Chris Murphy <lists@colorremedies.com>
Cc: Andrei Borzenkov <arvidjaar@gmail.com>,
	"Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	Hugo Mills <hugo@carfax.org.uk>,
	kreijack@inwind.it, Roman Mamedov <rm@romanrm.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Adventures in btrfs raid5 disk recovery
Date: Sun, 26 Jun 2016 15:52:00 -0400	[thread overview]
Message-ID: <20160626195200.GF14667@hungrycats.org> (raw)
In-Reply-To: <CAJCQCtSTau3D39TarW2gag0HgW6U9tj8u3gXVTBMTNRy7wMUbg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9634 bytes --]

On Sun, Jun 26, 2016 at 01:30:03PM -0600, Chris Murphy wrote:
> On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> > 26.06.2016 00:52, Chris Murphy пишет:
> >> Interestingly enough, so far I'm finding with full stripe writes, i.e.
> >> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
> >> is raid4.
> >
> > That's not what code suggests and what I see in practice - parity seems
> > to be distributed across all disks; each new 128KiB file (extent) has
> > parity on new disk. At least as long as we can trust btrfs-map-logical
> > to always show parity as "mirror 2".
> 
> 
> tl;dr Andrei is correct there's no raid4 behavior here.
> 
> Looks like mirror 2 is always parity, more on that below.
> 
> 
> >
> > Do you see consecutive full stripes in your tests? Or how do you
> > determine which devid has parity for a given full stripe?
> 
> I do see consecutive full stripe writes, but it doesn't always happen.
> But not checking the consecutivity is where I became confused.
> 
> [root@f24s ~]# filefrag -v /mnt/5/ab*
> Filesystem type is: 9123683e
> File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456128..   3456159:     32:             last,eof
> /mnt/5/ab128_2.txt: 1 extent found
> File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456224..   3456255:     32:             last,eof
> /mnt/5/ab128_3.txt: 1 extent found
> File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456320..   3456351:     32:             last,eof
> /mnt/5/ab128_4.txt: 1 extent found
> File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456352..   3456383:     32:             last,eof
> /mnt/5/ab128_5.txt: 1 extent found
> File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456384..   3456415:     32:             last,eof
> /mnt/5/ab128_6.txt: 1 extent found
> File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456416..   3456447:     32:             last,eof
> /mnt/5/ab128_7.txt: 1 extent found
> File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456448..   3456479:     32:             last,eof
> /mnt/5/ab128_8.txt: 1 extent found
> File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456480..   3456511:     32:             last,eof
> /mnt/5/ab128_9.txt: 1 extent found
> File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456096..   3456127:     32:             last,eof
> /mnt/5/ab128.txt: 1 extent found
> 
> Starting with the bottom file then from the top so they're in 4096
> byte block order; and the 2nd column is the difference in value:
> 
> 3456096
> 3456128 32
> 3456224 96
> 3456320 96
> 3456352 32
> 3456384 32
> 3456416 32
> 3456448 32
> 3456480 32
> 
> So the first two files are consecutive full stripe writes. The next
> two aren't. The next five are. They were all copied at the same time.
> I don't know why they aren't always consecutive writes.

The logical addresses don't include parity stripes, so you won't find
them with FIEMAP.  Parity locations are calculated after the logical ->
(disk, chunk_offset) translation is done (it's the same chunk_offset on
every disk, but one of the disks is parity while the others are data).

> [root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a
> mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a
> mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a
> mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b
> mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a
> mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b
> mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a
> mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b
> mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a
> mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c
> mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a
> mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a
> mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456416] /dev/VG/a
> mirror 1 logical 14157479936 physical 1075642368 device /dev/mapper/VG-b
> mirror 2 logical 14157479936 physical 1109196800 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456448] /dev/VG/a
> mirror 1 logical 14157611008 physical 2183004160 device /dev/mapper/VG-c
> mirror 2 logical 14157611008 physical 1075707904 device /dev/mapper/VG-b
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456480] /dev/VG/a
> mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
> mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c
> 
> 
> To confirm/deny mirror 2 is parity (128KiB file is 64KiB "a", 64KiB
> "b", so expected parity is 0x03; if it's always 128KiB of the same
> value then parity is 0x00 and can result in confusion/mistakes with
> unwritten free space).
> 
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182283264
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108606976
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108803584
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109000192
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075511296
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182873088
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109196800
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075707904
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2183069696
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> 
> Ok so in particular the last five, parity is on device b, c, a, b, c -
> that suggests it's distributing parity on consecutive full stripe
> writes.
> 
> Where I became confused is, there's not always a consecutive write,
> and that's what ends up causing parity to end up on one device less
> often. In the above example, parity goes 4x VG/a, 3x VG/c, and 2x
> VG/b.
> 
> Basically it's a bad test. The sample size is too small. I'd need to
> increase the sample size by a ton in order to know for sure if this is
> really a problem.
> 
> 
> >This
> > information is not actually stored anywhere, it is computed based on
> > block group geometry and logical stripe offset.
> 
> I think you're right. A better test is a scrub or balance on a raid5
> that's exhibiting slowness, and find out if there's disk contention on
> that system, and whether it's the result of parity not being
> distributed enough.
> 
> 
> > P.S. usage of "stripe" to mean "stripe element" actually adds to
> > confusion when reading code :)
> 
> It's confusing everywhere. mdadm chunk = strip = stripe element. And
> then LVM introduces -i --stripes which means "data strips" i.e. if you
> choose -i 3 with raid6 segment type, you get 5 strips per stripe (3
> data 2 parity). It's horrible.
> 
> 
> 
> 
> -- 
> Chris Murphy
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

  reply	other threads:[~2016-06-26 19:52 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-20  3:44 Adventures in btrfs raid5 disk recovery Zygo Blaxell
2016-06-20 18:13 ` Roman Mamedov
2016-06-20 19:11   ` Zygo Blaxell
2016-06-20 19:30     ` Chris Murphy
2016-06-20 20:40       ` Zygo Blaxell
2016-06-20 21:27         ` Chris Murphy
2016-06-21  1:55           ` Zygo Blaxell
2016-06-21  3:53             ` Zygo Blaxell
2016-06-22 17:14             ` Chris Murphy
2016-06-22 20:35               ` Zygo Blaxell
2016-06-23 19:32                 ` Goffredo Baroncelli
2016-06-24  0:26                   ` Chris Murphy
2016-06-24  1:47                     ` Zygo Blaxell
2016-06-24  4:02                       ` Andrei Borzenkov
2016-06-24  8:50                         ` Hugo Mills
2016-06-24  9:52                           ` Andrei Borzenkov
2016-06-24 10:16                             ` Hugo Mills
2016-06-24 10:19                               ` Andrei Borzenkov
2016-06-24 10:59                                 ` Hugo Mills
2016-06-24 11:36                                   ` Austin S. Hemmelgarn
2016-06-24 17:40                               ` Chris Murphy
2016-06-24 18:06                                 ` Zygo Blaxell
2016-06-24 17:06                             ` Chris Murphy
2016-06-24 17:21                               ` Andrei Borzenkov
2016-06-24 17:52                                 ` Chris Murphy
2016-06-24 18:19                                   ` Austin S. Hemmelgarn
2016-06-25 16:44                                     ` Chris Murphy
2016-06-25 21:52                                       ` Chris Murphy
2016-06-26  7:54                                         ` Andrei Borzenkov
2016-06-26 15:03                                           ` Duncan
2016-06-26 19:30                                           ` Chris Murphy
2016-06-26 19:52                                             ` Zygo Blaxell [this message]
2016-06-27 11:21                                       ` Austin S. Hemmelgarn
2016-06-27 16:17                                         ` Chris Murphy
2016-06-27 20:54                                           ` Chris Murphy
2016-06-27 21:02                                           ` Henk Slager
2016-06-27 21:57                                           ` Zygo Blaxell
2016-06-27 22:30                                             ` Chris Murphy
2016-06-28  1:52                                               ` Zygo Blaxell
2016-06-28  2:39                                                 ` Chris Murphy
2016-06-28  3:17                                                   ` Zygo Blaxell
2016-06-28 11:23                                                     ` Austin S. Hemmelgarn
2016-06-28 12:05                                             ` Austin S. Hemmelgarn
2016-06-28 12:14                                               ` Steven Haigh
2016-06-28 12:25                                                 ` Austin S. Hemmelgarn
2016-06-28 16:40                                                   ` Steven Haigh
2016-06-28 18:01                                                     ` Chris Murphy
2016-06-28 18:17                                                       ` Steven Haigh
2016-07-05 23:05                                                         ` Chris Murphy
2016-07-06 11:51                                                           ` Austin S. Hemmelgarn
2016-07-06 16:43                                                             ` Chris Murphy
2016-07-06 17:18                                                               ` Austin S. Hemmelgarn
2016-07-06 18:45                                                                 ` Chris Murphy
2016-07-06 19:15                                                                   ` Austin S. Hemmelgarn
2016-07-06 21:01                                                                     ` Chris Murphy
2016-06-24 16:52                           ` Chris Murphy
2016-06-24 16:56                             ` Hugo Mills
2016-06-24 16:39                         ` Zygo Blaxell
2016-06-24  1:36                   ` Zygo Blaxell
2016-06-23 23:37               ` Chris Murphy
2016-06-24  2:07                 ` Zygo Blaxell
2016-06-24  5:20                   ` Chris Murphy
2016-06-24 10:16                     ` Andrei Borzenkov
2016-06-24 17:33                       ` Chris Murphy
2016-06-24 11:24                     ` Austin S. Hemmelgarn
2016-06-24 16:32                     ` Zygo Blaxell
2016-06-24  2:17                 ` Zygo Blaxell
2016-06-22  4:06 ` Adventures in btrfs raid5 disk recovery - update Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160626195200.GF14667@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=ahferroin7@gmail.com \
    --cc=arvidjaar@gmail.com \
    --cc=hugo@carfax.org.uk \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    --cc=rm@romanrm.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.