linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Chris Murphy <lists@colorremedies.com>
Cc: Andrei Borzenkov <arvidjaar@gmail.com>,
	"Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	Hugo Mills <hugo@carfax.org.uk>,
	kreijack@inwind.it, Roman Mamedov <rm@romanrm.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Adventures in btrfs raid5 disk recovery
Date: Sun, 26 Jun 2016 15:52:00 -0400	[thread overview]
Message-ID: <20160626195200.GF14667@hungrycats.org> (raw)
In-Reply-To: <CAJCQCtSTau3D39TarW2gag0HgW6U9tj8u3gXVTBMTNRy7wMUbg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9634 bytes --]

On Sun, Jun 26, 2016 at 01:30:03PM -0600, Chris Murphy wrote:
> On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> > 26.06.2016 00:52, Chris Murphy пишет:
> >> Interestingly enough, so far I'm finding with full stripe writes, i.e.
> >> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
> >> is raid4.
> >
> > That's not what code suggests and what I see in practice - parity seems
> > to be distributed across all disks; each new 128KiB file (extent) has
> > parity on new disk. At least as long as we can trust btrfs-map-logical
> > to always show parity as "mirror 2".
> 
> 
> tl;dr Andrei is correct there's no raid4 behavior here.
> 
> Looks like mirror 2 is always parity, more on that below.
> 
> 
> >
> > Do you see consecutive full stripes in your tests? Or how do you
> > determine which devid has parity for a given full stripe?
> 
> I do see consecutive full stripe writes, but it doesn't always happen.
> But not checking the consecutivity is where I became confused.
> 
> [root@f24s ~]# filefrag -v /mnt/5/ab*
> Filesystem type is: 9123683e
> File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456128..   3456159:     32:             last,eof
> /mnt/5/ab128_2.txt: 1 extent found
> File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456224..   3456255:     32:             last,eof
> /mnt/5/ab128_3.txt: 1 extent found
> File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456320..   3456351:     32:             last,eof
> /mnt/5/ab128_4.txt: 1 extent found
> File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456352..   3456383:     32:             last,eof
> /mnt/5/ab128_5.txt: 1 extent found
> File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456384..   3456415:     32:             last,eof
> /mnt/5/ab128_6.txt: 1 extent found
> File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456416..   3456447:     32:             last,eof
> /mnt/5/ab128_7.txt: 1 extent found
> File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456448..   3456479:     32:             last,eof
> /mnt/5/ab128_8.txt: 1 extent found
> File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456480..   3456511:     32:             last,eof
> /mnt/5/ab128_9.txt: 1 extent found
> File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456096..   3456127:     32:             last,eof
> /mnt/5/ab128.txt: 1 extent found
> 
> Starting with the bottom file then from the top so they're in 4096
> byte block order; and the 2nd column is the difference in value:
> 
> 3456096
> 3456128 32
> 3456224 96
> 3456320 96
> 3456352 32
> 3456384 32
> 3456416 32
> 3456448 32
> 3456480 32
> 
> So the first two files are consecutive full stripe writes. The next
> two aren't. The next five are. They were all copied at the same time.
> I don't know why they aren't always consecutive writes.

The logical addresses don't include parity stripes, so you won't find
them with FIEMAP.  Parity locations are calculated after the logical ->
(disk, chunk_offset) translation is done (it's the same chunk_offset on
every disk, but one of the disks is parity while the others are data).

> [root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a
> mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a
> mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a
> mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b
> mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a
> mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b
> mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a
> mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b
> mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a
> mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c
> mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a
> mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a
> mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456416] /dev/VG/a
> mirror 1 logical 14157479936 physical 1075642368 device /dev/mapper/VG-b
> mirror 2 logical 14157479936 physical 1109196800 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456448] /dev/VG/a
> mirror 1 logical 14157611008 physical 2183004160 device /dev/mapper/VG-c
> mirror 2 logical 14157611008 physical 1075707904 device /dev/mapper/VG-b
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456480] /dev/VG/a
> mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
> mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c
> 
> 
> To confirm/deny mirror 2 is parity (128KiB file is 64KiB "a", 64KiB
> "b", so expected parity is 0x03; if it's always 128KiB of the same
> value then parity is 0x00 and can result in confusion/mistakes with
> unwritten free space).
> 
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182283264
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108606976
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108803584
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109000192
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075511296
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182873088
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109196800
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075707904
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2183069696
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> 
> Ok so in particular the last five, parity is on device b, c, a, b, c -
> that suggests it's distributing parity on consecutive full stripe
> writes.
> 
> Where I became confused is, there's not always a consecutive write,
> and that's what ends up causing parity to end up on one device less
> often. In the above example, parity goes 4x VG/a, 3x VG/c, and 2x
> VG/b.
> 
> Basically it's a bad test. The sample size is too small. I'd need to
> increase the sample size by a ton in order to know for sure if this is
> really a problem.
> 
> 
> >This
> > information is not actually stored anywhere, it is computed based on
> > block group geometry and logical stripe offset.
> 
> I think you're right. A better test is a scrub or balance on a raid5
> that's exhibiting slowness, and find out if there's disk contention on
> that system, and whether it's the result of parity not being
> distributed enough.
> 
> 
> > P.S. usage of "stripe" to mean "stripe element" actually adds to
> > confusion when reading code :)
> 
> It's confusing everywhere. mdadm chunk = strip = stripe element. And
> then LVM introduces -i --stripes which means "data strips" i.e. if you
> choose -i 3 with raid6 segment type, you get 5 strips per stripe (3
> data 2 parity). It's horrible.
> 
> 
> 
> 
> -- 
> Chris Murphy
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

  reply	other threads:[~2016-06-26 19:52 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-20  3:44 Adventures in btrfs raid5 disk recovery Zygo Blaxell
2016-06-20 18:13 ` Roman Mamedov
2016-06-20 19:11   ` Zygo Blaxell
2016-06-20 19:30     ` Chris Murphy
2016-06-20 20:40       ` Zygo Blaxell
2016-06-20 21:27         ` Chris Murphy
2016-06-21  1:55           ` Zygo Blaxell
2016-06-21  3:53             ` Zygo Blaxell
2016-06-22 17:14             ` Chris Murphy
2016-06-22 20:35               ` Zygo Blaxell
2016-06-23 19:32                 ` Goffredo Baroncelli
2016-06-24  0:26                   ` Chris Murphy
2016-06-24  1:47                     ` Zygo Blaxell
2016-06-24  4:02                       ` Andrei Borzenkov
2016-06-24  8:50                         ` Hugo Mills
2016-06-24  9:52                           ` Andrei Borzenkov
2016-06-24 10:16                             ` Hugo Mills
2016-06-24 10:19                               ` Andrei Borzenkov
2016-06-24 10:59                                 ` Hugo Mills
2016-06-24 11:36                                   ` Austin S. Hemmelgarn
2016-06-24 17:40                               ` Chris Murphy
2016-06-24 18:06                                 ` Zygo Blaxell
2016-06-24 17:06                             ` Chris Murphy
2016-06-24 17:21                               ` Andrei Borzenkov
2016-06-24 17:52                                 ` Chris Murphy
2016-06-24 18:19                                   ` Austin S. Hemmelgarn
2016-06-25 16:44                                     ` Chris Murphy
2016-06-25 21:52                                       ` Chris Murphy
2016-06-26  7:54                                         ` Andrei Borzenkov
2016-06-26 15:03                                           ` Duncan
2016-06-26 19:30                                           ` Chris Murphy
2016-06-26 19:52                                             ` Zygo Blaxell [this message]
2016-06-27 11:21                                       ` Austin S. Hemmelgarn
2016-06-27 16:17                                         ` Chris Murphy
2016-06-27 20:54                                           ` Chris Murphy
2016-06-27 21:02                                           ` Henk Slager
2016-06-27 21:57                                           ` Zygo Blaxell
2016-06-27 22:30                                             ` Chris Murphy
2016-06-28  1:52                                               ` Zygo Blaxell
2016-06-28  2:39                                                 ` Chris Murphy
2016-06-28  3:17                                                   ` Zygo Blaxell
2016-06-28 11:23                                                     ` Austin S. Hemmelgarn
2016-06-28 12:05                                             ` Austin S. Hemmelgarn
2016-06-28 12:14                                               ` Steven Haigh
2016-06-28 12:25                                                 ` Austin S. Hemmelgarn
2016-06-28 16:40                                                   ` Steven Haigh
2016-06-28 18:01                                                     ` Chris Murphy
2016-06-28 18:17                                                       ` Steven Haigh
2016-07-05 23:05                                                         ` Chris Murphy
2016-07-06 11:51                                                           ` Austin S. Hemmelgarn
2016-07-06 16:43                                                             ` Chris Murphy
2016-07-06 17:18                                                               ` Austin S. Hemmelgarn
2016-07-06 18:45                                                                 ` Chris Murphy
2016-07-06 19:15                                                                   ` Austin S. Hemmelgarn
2016-07-06 21:01                                                                     ` Chris Murphy
2016-06-24 16:52                           ` Chris Murphy
2016-06-24 16:56                             ` Hugo Mills
2016-06-24 16:39                         ` Zygo Blaxell
2016-06-24  1:36                   ` Zygo Blaxell
2016-06-23 23:37               ` Chris Murphy
2016-06-24  2:07                 ` Zygo Blaxell
2016-06-24  5:20                   ` Chris Murphy
2016-06-24 10:16                     ` Andrei Borzenkov
2016-06-24 17:33                       ` Chris Murphy
2016-06-24 11:24                     ` Austin S. Hemmelgarn
2016-06-24 16:32                     ` Zygo Blaxell
2016-06-24  2:17                 ` Zygo Blaxell
2016-06-22  4:06 ` Adventures in btrfs raid5 disk recovery - update Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160626195200.GF14667@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=ahferroin7@gmail.com \
    --cc=arvidjaar@gmail.com \
    --cc=hugo@carfax.org.uk \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    --cc=rm@romanrm.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).