From: NeilBrown <neilb@suse.de>
To: Dave Chinner <david@fromorbit.com>
Cc: Joe Landman <joe.landman@gmail.com>,
linux-raid <linux-raid@vger.kernel.org>, xfs <xfs@oss.sgi.com>
Subject: Re: interesting MD-xfs bug
Date: Fri, 10 Apr 2015 13:22:53 +1000 [thread overview]
Message-ID: <20150410132253.644e3660@notabene.brown> (raw)
In-Reply-To: <20150410013156.GH15810@dastard>
[-- Attachment #1: Type: text/plain, Size: 11947 bytes --]
On Fri, 10 Apr 2015 11:31:57 +1000 Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Apr 10, 2015 at 09:36:52AM +1000, NeilBrown wrote:
> > On Fri, 10 Apr 2015 09:10:35 +1000 Dave Chinner <david@fromorbit.com> wrote:
> >
> > > On Fri, Apr 10, 2015 at 08:53:22AM +1000, Dave Chinner wrote:
> > > > On Thu, Apr 09, 2015 at 06:20:26PM -0400, Joe Landman wrote:
> > > > >
> > > > >
> > > > > On 04/09/2015 06:18 PM, Dave Chinner wrote:
> > > > > >On Thu, Apr 09, 2015 at 05:02:33PM -0400, Joe Landman wrote:
> > > > > >>If I build an MD raid0 with a non power of 2 chunk size, it appears
> > > > > >>that I can mkfs.xfs a file system, but it doesn't show up in blkid
> > > > > >>and is not mountable. Yet, using a power of 2 chunk size, this does
> > > > > >>work correctly. This is kernel 3.18.9.
> > > > > >>
> > > > >
> > > > > [...]
> > > > >
> > > > > >That looks more like a blkid or udev problem. try using blkid -p so
> > > > > >that it doesn't look up the cache but directly probes devices for
> > > > > >the signatures. strace might tell you a bit more, too. And if the
> > > > > >filesystem mounts, then it definitely isn't an XFS problem ;)
> > > > >
> > > > > Thats the thing, it didn't mount, even when I used the device name
> > > > > directly.
> > > >
> > > > Ok, that's interesting. Let me see if I can reproduce it locally. If
> > > > you don't hear otherwise, tracing would still be useful. Thanks for
> > > > the bug report, Joe.
> > >
> > > No luck - md doesn't allow the device to be activated on 4.0-rc7:
> > >
> > > $ sudo mdadm --version
> > > mdadm - v3.3.2 - 21st August 2014
> > > $ uname -a
> > > Linux test4 4.0.0-rc7-dgc+ #882 SMP Fri Apr 10 08:50:52 AEST 2015 x86_64 GNU/Linux
> > > $ sudo wipefs -a /dev/vd[ab]
> > > /dev/vda: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> > > /dev/vdb: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> > > $ sudo mdadm --create /dev/md20 --level=0 --metadata=1.2 --chunk=1152 --auto=yes --raid-disks=2 /dev/vd[ab]
> >
> > Weird. Works for me.
> > Any messages in 'dmesg' ??
> > How big are /dev/vd[ab]??
>
> vda is 5GB, vdb is 20GB
>
> dmesg:
>
> [ 125.131340] md: bind<vda>
> [ 125.134547] md: bind<vdb>
> [ 125.139669] md: personality for level 0 is not loaded!
> [ 125.141302] md: md20 stopped.
> [ 125.141986] md: unbind<vdb>
> [ 125.160100] md: export_rdev(vdb)
> [ 125.161751] md: unbind<vda>
> [ 125.180126] md: export_rdev(vda)
>
> Oh, curious. Going from 4.0-rc4 to 4.0-rc7, and make oldconfig
> has resulted in:
>
> # CONFIG_MD_RAID0 is not set
>
> Ok, so with that fixed, it's still horribly broken.
>
> RAID 0 on different sized devices should result in a device that is
> twice the size of the smallest devices:
>
> $ sudo mdadm --create /dev/md20 --level=raid0 --metadata=1.2 --chunk=1024 --auto=yes --raid-disks=2 /dev/vd[ab]
> mdadm: array /dev/md20 started.
> $ cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> md20 : active raid0 vdb[1] vda[0]
> 26206208 blocks super 1.2 1024k chunks
>
> unused devices: <none>
> $ grep "md\|vd[ab]" /proc/partitions
> 253 0 5242880 vda
> 253 16 20971520 vdb
> 9 20 26206208 md20
> $
>
> Oh, "RAID0" is not actually RAID 0 - that's the size I'd expect from
> a linear mapping. Half way through writing that block device, the IO
> stats change in an obvious way:
>
> Device: r/s w/s rMB/s wMB/s
> vda 0.00 144.00 0.00 48.00
> vdb 0.00 145.20 0.00 48.40
> md20 0.00 290.40 0.00 96.80
>
> Device: r/s w/s rMB/s wMB/s
> vda 0.00 56.40 0.00 18.80
> vdb 0.00 229.20 0.00 76.40
> md20 0.00 285.20 0.00 95.10
>
> Device: r/s w/s rMB/s wMB/s
> vda 0.00 0.00 0.00 0.00
> vdb 0.00 290.40 0.00 96.80
> md20 0.00 290.80 0.00 96.90
>
> So it's actually a stripe for the first 10GB, then some kind of
> concatenated mapping of the remainder of the single device. That's
> not what I expected, but it's also clearly not the problem.
>
> Anyway, change the stripe size to 1152:
>
> sudo mdadm --stop /dev/md20
> mdadm: stopped /dev/md20
> $ sudo wipefs -a /dev/vd[ab]
> /dev/vda: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> /dev/vdb: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> $ sudo mdadm --create /dev/md20 --level=raid0 --metadata=1.2 --chunk=1152 --auto=yes --raid-disks=2 /dev/vd[ab]
> mdadm: array /dev/md20 started.
> $ sudo xfs_io -fd -c "pwrite -b 4m 0 25g" /dev/md20
> wrote 26831355904/26843545600 bytes at offset 0
> 24.989 GiB, 6398 ops; 0:00:16.00 (1.530 GiB/sec and 391.8556 ops/sec)
> $
>
> Wait, what? Neil, did you put a flux capacitor in MD? :P
>
> The underlying drive is only capable of 100MB/s - 25GB of sequential
> direct IO does not complete in 16 seconds on such a drive. But
> there's also a 1GB BBWC in front of the physical drives (HW RAID1),
> but even so, this write rate could only occur if every write is
> hitting the BBWC. And so it is:
>
> $ sudo xfs_io -fd -c "pwrite -b 4m 0 25g" /dev/md20 & iostat -d -m 1
> ...
> Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> vda 4214.00 0.00 1516.99 0 1516
> vdb 0.00 0.00 0.00 0 0
> md20 4223.00 0.00 1520.00 0 1520
>
> Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> vda 2986.00 0.00 1075.01 0 1075
> vdb 1174.00 0.00 422.88 0 422
> md20 4154.00 0.00 1496.00 0 1496
>
> Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> vda 0.00 0.00 0.00 0 0
> vdb 4376.00 0.00 1575.12 0 1575
> md20 4378.00 0.00 1576.00 0 1576
>
> Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> vda 2682.00 0.00 965.74 0 965
> vdb 1650.00 0.00 594.00 0 594
> md20 4334.00 0.00 1560.00 0 1560
>
> Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> vda 4518.00 0.00 1626.26 0 1626
> vdb 138.00 0.00 49.50 0 49
> md20 4656.00 0.00 1676.00 0 1676
>
> Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> vda 0.00 0.00 0.00 0 0
> vdb 4214.00 0.00 1517.48 0 1517
> md20 4210.00 0.00 1516.00 0 1516
> .....
>
> Note how it is cycling from one drive to the other with about a 2s
> period?
>
> Yup, blocktrace on /dev/vda shows it is, indeed, hitting the BBWC
> because the block mapping is clearly broken:
>
> 253,0 4 1 0.000000000 6972 Q WS 8192 + 1008 [xfs_io]
> 253,0 4 5 0.000068012 6972 Q WS 8192 + 1008 [xfs_io]
> 253,0 4 9 0.000093266 6972 Q WS 8192 + 288 [xfs_io]
> 253,0 4 13 0.000129722 6972 Q WS 8193 + 1008 [xfs_io]
> 253,0 4 17 0.000176872 6972 Q WS 8193 + 1008 [xfs_io]
> 253,0 4 21 0.000205566 6972 Q WS 8193 + 288 [xfs_io]
> 253,0 4 25 0.000240846 6972 Q WS 8194 + 1008 [xfs_io]
> 253,0 4 29 0.000284990 6972 Q WS 8194 + 1008 [xfs_io]
> 253,0 4 33 0.000313276 6972 Q WS 8194 + 288 [xfs_io]
> 253,0 4 37 0.000352330 6972 Q WS 8195 + 1008 [xfs_io]
> 253,0 4 41 0.000374272 6972 Q WS 8195 + 272 [xfs_io]
> 253,0 4 56 0.001215857 6972 Q WS 8195 + 1008 [xfs_io]
> 253,0 4 60 0.001252697 6972 Q WS 8195 + 16 [xfs_io]
> 253,0 4 64 0.001284517 6972 Q WS 8196 + 1008 [xfs_io]
> 253,0 4 68 0.001326130 6972 Q WS 8196 + 1008 [xfs_io]
> 253,0 4 72 0.001355050 6972 Q WS 8196 + 288 [xfs_io]
> 253,0 4 76 0.001393777 6972 Q WS 8197 + 1008 [xfs_io]
> 253,0 4 80 0.001439547 6972 Q WS 8197 + 1008 [xfs_io]
> 253,0 4 84 0.001466097 6972 Q WS 8197 + 288 [xfs_io]
> 253,0 4 88 0.001501267 6972 Q WS 8198 + 1008 [xfs_io]
> 253,0 4 92 0.001545863 6972 Q WS 8198 + 1008 [xfs_io]
> 253,0 4 96 0.001571500 6972 Q WS 8198 + 288 [xfs_io]
> 253,0 4 100 0.001584620 6972 Q WS 8199 + 256 [xfs_io]
> 253,0 4 116 0.002730034 6972 Q WS 8199 + 1008 [xfs_io]
> 253,0 4 120 0.002792351 6972 Q WS 8199 + 1008 [xfs_io]
> 253,0 4 124 0.002810937 6972 Q WS 8199 + 32 [xfs_io]
> 253,0 4 128 0.002842047 6972 Q WS 8200 + 1008 [xfs_io]
> 253,0 4 132 0.002889087 6972 Q WS 8200 + 1008 [xfs_io]
> 253,0 4 136 0.002916894 6972 Q WS 8200 + 288 [xfs_io]
> 253,0 4 140 0.002952334 6972 Q WS 8201 + 1008 [xfs_io]
> 253,0 4 144 0.002996101 6972 Q WS 8201 + 1008 [xfs_io]
> 253,0 4 148 0.003022401 6972 Q WS 8201 + 288 [xfs_io]
>
>
> Multiple IOs to teh same sector, then the sector increments by 1 and
> we get more IOs to the same sector offset. After about a second the
> mapping shifts IO to the other block device as it slowly increments
> the sector, and that's why we see that cycling behaviour.
>
> IOWs, something is going wrong with the MD block mapping when the
> RAID chunk size is not a power of 2....
>
> Over to you, Neil....
That's .... not good. Not good at all.
This should help. It seems that non-power-of-2 chunksizes aren't widely used.
Thanks,
NeilBrown
From: NeilBrown <neilb@suse.de>
Date: Fri, 10 Apr 2015 13:19:04 +1000
Subject: [PATCH] md/raid0: fix bug with chunksize not a power of 2.
Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1
in v3.14-rc1 RAID0 has performed incorrect calculations
when the chunksize is not a power of 2.
This happens because "sector_div()" modifies its first argument, but
this wasn't taken into account in the patch.
So restore that first arg before re-using the variable.
Reported-by: Joe Landman <joe.landman@gmail.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1
Cc: stable@vger.kernel.org (3.14 and later).
Signed-off-by: NeilBrown <neilb@suse.de>
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index e074813da6c0..2cb59a641cd2 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -315,7 +315,7 @@ static struct strip_zone *find_zone(struct r0conf *conf,
/*
* remaps the bio to the target device. we separate two flows.
- * power 2 flow and a general flow for the sake of perfromance
+ * power 2 flow and a general flow for the sake of performance
*/
static struct md_rdev *map_sector(struct mddev *mddev, struct strip_zone *zone,
sector_t sector, sector_t *sector_offset)
@@ -530,6 +530,7 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
split = bio;
}
+ sector = bio->bi_iter.bi_sector;
zone = find_zone(mddev->private, §or);
tmp_dev = map_sector(mddev, zone, sector, §or);
split->bi_bdev = tmp_dev->bdev;
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
next prev parent reply other threads:[~2015-04-10 3:22 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-09 21:02 interesting MD-xfs bug Joe Landman
2015-04-09 22:18 ` Dave Chinner
2015-04-09 22:20 ` Joe Landman
2015-04-09 22:53 ` Dave Chinner
2015-04-09 23:10 ` Dave Chinner
2015-04-09 23:36 ` NeilBrown
2015-04-10 1:31 ` Dave Chinner
2015-04-10 3:22 ` NeilBrown [this message]
2015-04-10 6:05 ` Dave Chinner
2015-04-10 4:43 ` Roman Mamedov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150410132253.644e3660@notabene.brown \
--to=neilb@suse.de \
--cc=david@fromorbit.com \
--cc=joe.landman@gmail.com \
--cc=linux-raid@vger.kernel.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).