linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* interesting MD-xfs bug
@ 2015-04-09 21:02 Joe Landman
  2015-04-09 22:18 ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Joe Landman @ 2015-04-09 21:02 UTC (permalink / raw)
  To: xfs, linux-raid

If I build an MD raid0 with a non power of 2 chunk size, it appears that 
I can mkfs.xfs a file system, but it doesn't show up in blkid and is not 
mountable.  Yet, using a power of 2 chunk size, this does work 
correctly.   This is kernel 3.18.9.


For example, non-power of 2 chunk:

root@unison:~# wipefs -a /dev/sdb
4 bytes were erased at offset 0x1000 (linux_raid_member)
they were: fc 4e 2b a9
root@unison:~# wipefs -a /dev/sda
4 bytes were erased at offset 0x1000 (linux_raid_member)
they were: fc 4e 2b a9
root@unison:~# mdadm --create /dev/md20 --level=0 --metadata=1.2 
--chunk=1152 --auto=yes --raid-disks=2 /dev/sd[ab]
mdadm: array /dev/md20 started.

root@unison:~# mkfs.xfs /dev/md20
log stripe unit (1179648 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md20              isize=256    agcount=50, 
agsize=268435296 blks
          =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=13164865984, imaxpct=5
          =                       sunit=288    swidth=576 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

root@unison:~# blkid | grep xfs
root@unison:~#


Same system, with power of 2 chunk size:

root@unison:~# mdadm -S /dev/md20
mdadm: stopped /dev/md20
root@unison:~# wipefs -a /dev/sda
4 bytes were erased at offset 0x1000 (linux_raid_member)
they were: fc 4e 2b a9
root@unison:~# wipefs -a /dev/sdb
4 bytes were erased at offset 0x1000 (linux_raid_member)
they were: fc 4e 2b a9
root@unison:~# mdadm --create /dev/md20 --level=0 --metadata=1.2 
--chunk=1024 --auto=yes --raid-disks=2 /dev/sd[ab]
mdadm: array /dev/md20 started.
root@unison:~# mkfs.xfs /dev/md20
log stripe unit (1048576 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md20              isize=256    agcount=50, 
agsize=268435200 blks
          =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=13164866048, imaxpct=5
          =                       sunit=256    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
root@unison:~# blkid | grep xfs
/dev/md20: UUID="5e965ae7-198e-4e58-8920-a65c4b6bbe60" TYPE="xfs"

I am not sure which code base might be at "fault" or even if there is a 
"fault" (beyond simply saying "don't do non-power-of-two chunks").  If 
its the latter, happy to work on a warning message patch for mdadm if 
needed.  If it should work, then happy to poke around if someone can 
give me a pointer where something might be relevant.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: interesting MD-xfs bug
  2015-04-09 21:02 interesting MD-xfs bug Joe Landman
@ 2015-04-09 22:18 ` Dave Chinner
  2015-04-09 22:20   ` Joe Landman
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2015-04-09 22:18 UTC (permalink / raw)
  To: Joe Landman; +Cc: xfs, linux-raid

On Thu, Apr 09, 2015 at 05:02:33PM -0400, Joe Landman wrote:
> If I build an MD raid0 with a non power of 2 chunk size, it appears
> that I can mkfs.xfs a file system, but it doesn't show up in blkid
> and is not mountable.  Yet, using a power of 2 chunk size, this does
> work correctly.   This is kernel 3.18.9.
> 
> 
> For example, non-power of 2 chunk:
> 
> root@unison:~# wipefs -a /dev/sdb
> 4 bytes were erased at offset 0x1000 (linux_raid_member)
> they were: fc 4e 2b a9
> root@unison:~# wipefs -a /dev/sda
> 4 bytes were erased at offset 0x1000 (linux_raid_member)
> they were: fc 4e 2b a9
> root@unison:~# mdadm --create /dev/md20 --level=0 --metadata=1.2
> --chunk=1152 --auto=yes --raid-disks=2 /dev/sd[ab]
> mdadm: array /dev/md20 started.
> 
> root@unison:~# mkfs.xfs /dev/md20
> log stripe unit (1179648 bytes) is too large (maximum is 256KiB)
> log stripe unit adjusted to 32KiB
> meta-data=/dev/md20              isize=256    agcount=50,
> agsize=268435296 blks
>          =                       sectsz=512   attr=2, projid32bit=0
> data     =                       bsize=4096   blocks=13164865984, imaxpct=5
>          =                       sunit=288    swidth=576 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> root@unison:~# blkid | grep xfs

That looks more like a blkid or udev problem. try using blkid -p so
that it doesn't look up the cache but directly probes devices for
the signatures. strace might tell you a bit more, too. And if the
filesystem mounts, then it definitely isn't an XFS problem ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: interesting MD-xfs bug
  2015-04-09 22:18 ` Dave Chinner
@ 2015-04-09 22:20   ` Joe Landman
  2015-04-09 22:53     ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Joe Landman @ 2015-04-09 22:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-raid



On 04/09/2015 06:18 PM, Dave Chinner wrote:
> On Thu, Apr 09, 2015 at 05:02:33PM -0400, Joe Landman wrote:
>> If I build an MD raid0 with a non power of 2 chunk size, it appears
>> that I can mkfs.xfs a file system, but it doesn't show up in blkid
>> and is not mountable.  Yet, using a power of 2 chunk size, this does
>> work correctly.   This is kernel 3.18.9.
>>

[...]

> That looks more like a blkid or udev problem. try using blkid -p so
> that it doesn't look up the cache but directly probes devices for
> the signatures. strace might tell you a bit more, too. And if the
> filesystem mounts, then it definitely isn't an XFS problem ;)

Thats the thing, it didn't mount, even when I used the device name 
directly.

Good point on stracing though.  I'll do that tomorrow and report back.   
Thanks!

Joe

>
> Cheers,
>
> Dave.

-- 
Joe Landman
e: joe.landman@gmail.com
t: @sijoe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: interesting MD-xfs bug
  2015-04-09 22:20   ` Joe Landman
@ 2015-04-09 22:53     ` Dave Chinner
  2015-04-09 23:10       ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2015-04-09 22:53 UTC (permalink / raw)
  To: Joe Landman; +Cc: xfs, linux-raid

On Thu, Apr 09, 2015 at 06:20:26PM -0400, Joe Landman wrote:
> 
> 
> On 04/09/2015 06:18 PM, Dave Chinner wrote:
> >On Thu, Apr 09, 2015 at 05:02:33PM -0400, Joe Landman wrote:
> >>If I build an MD raid0 with a non power of 2 chunk size, it appears
> >>that I can mkfs.xfs a file system, but it doesn't show up in blkid
> >>and is not mountable.  Yet, using a power of 2 chunk size, this does
> >>work correctly.   This is kernel 3.18.9.
> >>
> 
> [...]
> 
> >That looks more like a blkid or udev problem. try using blkid -p so
> >that it doesn't look up the cache but directly probes devices for
> >the signatures. strace might tell you a bit more, too. And if the
> >filesystem mounts, then it definitely isn't an XFS problem ;)
> 
> Thats the thing, it didn't mount, even when I used the device name
> directly.

Ok, that's interesting. Let me see if I can reproduce it locally. If
you don't hear otherwise, tracing would still be useful. Thanks for
the bug report, Joe.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: interesting MD-xfs bug
  2015-04-09 22:53     ` Dave Chinner
@ 2015-04-09 23:10       ` Dave Chinner
  2015-04-09 23:36         ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2015-04-09 23:10 UTC (permalink / raw)
  To: Joe Landman; +Cc: linux-raid, xfs

On Fri, Apr 10, 2015 at 08:53:22AM +1000, Dave Chinner wrote:
> On Thu, Apr 09, 2015 at 06:20:26PM -0400, Joe Landman wrote:
> > 
> > 
> > On 04/09/2015 06:18 PM, Dave Chinner wrote:
> > >On Thu, Apr 09, 2015 at 05:02:33PM -0400, Joe Landman wrote:
> > >>If I build an MD raid0 with a non power of 2 chunk size, it appears
> > >>that I can mkfs.xfs a file system, but it doesn't show up in blkid
> > >>and is not mountable.  Yet, using a power of 2 chunk size, this does
> > >>work correctly.   This is kernel 3.18.9.
> > >>
> > 
> > [...]
> > 
> > >That looks more like a blkid or udev problem. try using blkid -p so
> > >that it doesn't look up the cache but directly probes devices for
> > >the signatures. strace might tell you a bit more, too. And if the
> > >filesystem mounts, then it definitely isn't an XFS problem ;)
> > 
> > Thats the thing, it didn't mount, even when I used the device name
> > directly.
> 
> Ok, that's interesting. Let me see if I can reproduce it locally. If
> you don't hear otherwise, tracing would still be useful. Thanks for
> the bug report, Joe.

No luck - md doesn't allow the device to be activated on 4.0-rc7:

$ sudo mdadm --version
mdadm - v3.3.2 - 21st August 2014
$ uname -a
Linux test4 4.0.0-rc7-dgc+ #882 SMP Fri Apr 10 08:50:52 AEST 2015 x86_64 GNU/Linux
$ sudo wipefs -a /dev/vd[ab]
/dev/vda: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
/dev/vdb: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
$ sudo mdadm --create /dev/md20 --level=0 --metadata=1.2 --chunk=1152 --auto=yes --raid-disks=2 /dev/vd[ab]
mdadm: RUN_ARRAY failed: Invalid argument
       Problem may be that chunk size is not a power of 2
$ cat /proc/mdstat
Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] 
unused devices: <none>
$

So I can't actually reproduce what you are seeing because MD doesn't
allow the device to be activated and so mdadm tears it back down.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: interesting MD-xfs bug
  2015-04-09 23:10       ` Dave Chinner
@ 2015-04-09 23:36         ` NeilBrown
  2015-04-10  1:31           ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2015-04-09 23:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Joe Landman, linux-raid, xfs

[-- Attachment #1: Type: text/plain, Size: 2288 bytes --]

On Fri, 10 Apr 2015 09:10:35 +1000 Dave Chinner <david@fromorbit.com> wrote:

> On Fri, Apr 10, 2015 at 08:53:22AM +1000, Dave Chinner wrote:
> > On Thu, Apr 09, 2015 at 06:20:26PM -0400, Joe Landman wrote:
> > > 
> > > 
> > > On 04/09/2015 06:18 PM, Dave Chinner wrote:
> > > >On Thu, Apr 09, 2015 at 05:02:33PM -0400, Joe Landman wrote:
> > > >>If I build an MD raid0 with a non power of 2 chunk size, it appears
> > > >>that I can mkfs.xfs a file system, but it doesn't show up in blkid
> > > >>and is not mountable.  Yet, using a power of 2 chunk size, this does
> > > >>work correctly.   This is kernel 3.18.9.
> > > >>
> > > 
> > > [...]
> > > 
> > > >That looks more like a blkid or udev problem. try using blkid -p so
> > > >that it doesn't look up the cache but directly probes devices for
> > > >the signatures. strace might tell you a bit more, too. And if the
> > > >filesystem mounts, then it definitely isn't an XFS problem ;)
> > > 
> > > Thats the thing, it didn't mount, even when I used the device name
> > > directly.
> > 
> > Ok, that's interesting. Let me see if I can reproduce it locally. If
> > you don't hear otherwise, tracing would still be useful. Thanks for
> > the bug report, Joe.
> 
> No luck - md doesn't allow the device to be activated on 4.0-rc7:
> 
> $ sudo mdadm --version
> mdadm - v3.3.2 - 21st August 2014
> $ uname -a
> Linux test4 4.0.0-rc7-dgc+ #882 SMP Fri Apr 10 08:50:52 AEST 2015 x86_64 GNU/Linux
> $ sudo wipefs -a /dev/vd[ab]
> /dev/vda: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> /dev/vdb: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> $ sudo mdadm --create /dev/md20 --level=0 --metadata=1.2 --chunk=1152 --auto=yes --raid-disks=2 /dev/vd[ab]

Weird.  Works for me.
Any messages in 'dmesg' ??
How big are /dev/vd[ab]??

NeilBrown


> mdadm: RUN_ARRAY failed: Invalid argument
>        Problem may be that chunk size is not a power of 2
> $ cat /proc/mdstat
> Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] 
> unused devices: <none>
> $
> 
> So I can't actually reproduce what you are seeing because MD doesn't
> allow the device to be activated and so mdadm tears it back down.
> 
> Cheers,
> 
> Dave.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: interesting MD-xfs bug
  2015-04-09 23:36         ` NeilBrown
@ 2015-04-10  1:31           ` Dave Chinner
  2015-04-10  3:22             ` NeilBrown
  2015-04-10  4:43             ` Roman Mamedov
  0 siblings, 2 replies; 10+ messages in thread
From: Dave Chinner @ 2015-04-10  1:31 UTC (permalink / raw)
  To: NeilBrown; +Cc: Joe Landman, linux-raid, xfs

On Fri, Apr 10, 2015 at 09:36:52AM +1000, NeilBrown wrote:
> On Fri, 10 Apr 2015 09:10:35 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Fri, Apr 10, 2015 at 08:53:22AM +1000, Dave Chinner wrote:
> > > On Thu, Apr 09, 2015 at 06:20:26PM -0400, Joe Landman wrote:
> > > > 
> > > > 
> > > > On 04/09/2015 06:18 PM, Dave Chinner wrote:
> > > > >On Thu, Apr 09, 2015 at 05:02:33PM -0400, Joe Landman wrote:
> > > > >>If I build an MD raid0 with a non power of 2 chunk size, it appears
> > > > >>that I can mkfs.xfs a file system, but it doesn't show up in blkid
> > > > >>and is not mountable.  Yet, using a power of 2 chunk size, this does
> > > > >>work correctly.   This is kernel 3.18.9.
> > > > >>
> > > > 
> > > > [...]
> > > > 
> > > > >That looks more like a blkid or udev problem. try using blkid -p so
> > > > >that it doesn't look up the cache but directly probes devices for
> > > > >the signatures. strace might tell you a bit more, too. And if the
> > > > >filesystem mounts, then it definitely isn't an XFS problem ;)
> > > > 
> > > > Thats the thing, it didn't mount, even when I used the device name
> > > > directly.
> > > 
> > > Ok, that's interesting. Let me see if I can reproduce it locally. If
> > > you don't hear otherwise, tracing would still be useful. Thanks for
> > > the bug report, Joe.
> > 
> > No luck - md doesn't allow the device to be activated on 4.0-rc7:
> > 
> > $ sudo mdadm --version
> > mdadm - v3.3.2 - 21st August 2014
> > $ uname -a
> > Linux test4 4.0.0-rc7-dgc+ #882 SMP Fri Apr 10 08:50:52 AEST 2015 x86_64 GNU/Linux
> > $ sudo wipefs -a /dev/vd[ab]
> > /dev/vda: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> > /dev/vdb: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> > $ sudo mdadm --create /dev/md20 --level=0 --metadata=1.2 --chunk=1152 --auto=yes --raid-disks=2 /dev/vd[ab]
> 
> Weird.  Works for me.
> Any messages in 'dmesg' ??
> How big are /dev/vd[ab]??

vda is 5GB, vdb is 20GB

dmesg:

[  125.131340] md: bind<vda>
[  125.134547] md: bind<vdb>
[  125.139669] md: personality for level 0 is not loaded!
[  125.141302] md: md20 stopped.
[  125.141986] md: unbind<vdb>
[  125.160100] md: export_rdev(vdb)
[  125.161751] md: unbind<vda>
[  125.180126] md: export_rdev(vda)

Oh, curious. Going from 4.0-rc4 to 4.0-rc7, and make oldconfig
has resulted in:

# CONFIG_MD_RAID0 is not set

Ok, so with that fixed, it's still horribly broken.

RAID 0 on different sized devices should result in a device that is
twice the size of the smallest devices:

$ sudo mdadm --create /dev/md20 --level=raid0 --metadata=1.2 --chunk=1024 --auto=yes --raid-disks=2 /dev/vd[ab]
mdadm: array /dev/md20 started.
$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md20 : active raid0 vdb[1] vda[0]
      26206208 blocks super 1.2 1024k chunks
      
unused devices: <none>
$ grep "md\|vd[ab]" /proc/partitions 
 253        0    5242880 vda
 253       16   20971520 vdb
   9       20   26206208 md20
$

Oh, "RAID0" is not actually RAID 0 - that's the size I'd expect from
a linear mapping. Half way through writing that block device, the IO
stats change in an obvious way:

Device:         r/s     w/s    rMB/s    wMB/s
vda            0.00  144.00     0.00    48.00
vdb            0.00  145.20     0.00    48.40
md20           0.00  290.40     0.00    96.80

Device:         r/s     w/s    rMB/s    wMB/s
vda            0.00   56.40     0.00    18.80
vdb            0.00  229.20     0.00    76.40
md20           0.00  285.20     0.00    95.10

Device:         r/s     w/s    rMB/s    wMB/s
vda            0.00    0.00     0.00     0.00
vdb            0.00  290.40     0.00    96.80
md20           0.00  290.80     0.00    96.90

So it's actually a stripe for the first 10GB, then some kind of
concatenated mapping of the remainder of the single device. That's
not what I expected, but it's also clearly not the problem.

Anyway, change the stripe size to 1152:

sudo mdadm --stop /dev/md20
mdadm: stopped /dev/md20
$ sudo wipefs -a /dev/vd[ab]
/dev/vda: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
/dev/vdb: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
$ sudo mdadm --create /dev/md20 --level=raid0 --metadata=1.2 --chunk=1152 --auto=yes --raid-disks=2 /dev/vd[ab]
mdadm: array /dev/md20 started.
$ sudo xfs_io -fd -c "pwrite -b 4m 0 25g" /dev/md20
wrote 26831355904/26843545600 bytes at offset 0
24.989 GiB, 6398 ops; 0:00:16.00 (1.530 GiB/sec and 391.8556 ops/sec)
$

Wait, what? Neil, did you put a flux capacitor in MD? :P 

The underlying drive is only capable of 100MB/s - 25GB of sequential
direct IO does not complete in 16 seconds on such a drive. But
there's also a 1GB BBWC in front of the physical drives (HW RAID1),
but even so, this write rate could only occur if every write is
hitting the BBWC. And so it is:

$ sudo xfs_io -fd -c "pwrite -b 4m 0 25g" /dev/md20 & iostat -d -m 1
...
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vda            4214.00         0.00      1516.99          0       1516
vdb               0.00         0.00         0.00          0          0
md20           4223.00         0.00      1520.00          0       1520

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vda            2986.00         0.00      1075.01          0       1075
vdb            1174.00         0.00       422.88          0        422
md20           4154.00         0.00      1496.00          0       1496

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vda               0.00         0.00         0.00          0          0
vdb            4376.00         0.00      1575.12          0       1575
md20           4378.00         0.00      1576.00          0       1576

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vda            2682.00         0.00       965.74          0        965
vdb            1650.00         0.00       594.00          0        594
md20           4334.00         0.00      1560.00          0       1560

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vda            4518.00         0.00      1626.26          0       1626
vdb             138.00         0.00        49.50          0         49
md20           4656.00         0.00      1676.00          0       1676

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vda               0.00         0.00         0.00          0          0
vdb            4214.00         0.00      1517.48          0       1517
md20           4210.00         0.00      1516.00          0       1516
.....

Note how it is cycling from one drive to the other with about a 2s
period?

Yup, blocktrace on /dev/vda shows it is, indeed, hitting the BBWC
because the block mapping is clearly broken:

253,0    4        1     0.000000000  6972  Q  WS 8192 + 1008 [xfs_io]
253,0    4        5     0.000068012  6972  Q  WS 8192 + 1008 [xfs_io]
253,0    4        9     0.000093266  6972  Q  WS 8192 + 288 [xfs_io]
253,0    4       13     0.000129722  6972  Q  WS 8193 + 1008 [xfs_io]
253,0    4       17     0.000176872  6972  Q  WS 8193 + 1008 [xfs_io]
253,0    4       21     0.000205566  6972  Q  WS 8193 + 288 [xfs_io]
253,0    4       25     0.000240846  6972  Q  WS 8194 + 1008 [xfs_io]
253,0    4       29     0.000284990  6972  Q  WS 8194 + 1008 [xfs_io]
253,0    4       33     0.000313276  6972  Q  WS 8194 + 288 [xfs_io]
253,0    4       37     0.000352330  6972  Q  WS 8195 + 1008 [xfs_io]
253,0    4       41     0.000374272  6972  Q  WS 8195 + 272 [xfs_io]
253,0    4       56     0.001215857  6972  Q  WS 8195 + 1008 [xfs_io]
253,0    4       60     0.001252697  6972  Q  WS 8195 + 16 [xfs_io]
253,0    4       64     0.001284517  6972  Q  WS 8196 + 1008 [xfs_io]
253,0    4       68     0.001326130  6972  Q  WS 8196 + 1008 [xfs_io]
253,0    4       72     0.001355050  6972  Q  WS 8196 + 288 [xfs_io]
253,0    4       76     0.001393777  6972  Q  WS 8197 + 1008 [xfs_io]
253,0    4       80     0.001439547  6972  Q  WS 8197 + 1008 [xfs_io]
253,0    4       84     0.001466097  6972  Q  WS 8197 + 288 [xfs_io]
253,0    4       88     0.001501267  6972  Q  WS 8198 + 1008 [xfs_io]
253,0    4       92     0.001545863  6972  Q  WS 8198 + 1008 [xfs_io]
253,0    4       96     0.001571500  6972  Q  WS 8198 + 288 [xfs_io]
253,0    4      100     0.001584620  6972  Q  WS 8199 + 256 [xfs_io]
253,0    4      116     0.002730034  6972  Q  WS 8199 + 1008 [xfs_io]
253,0    4      120     0.002792351  6972  Q  WS 8199 + 1008 [xfs_io]
253,0    4      124     0.002810937  6972  Q  WS 8199 + 32 [xfs_io]
253,0    4      128     0.002842047  6972  Q  WS 8200 + 1008 [xfs_io]
253,0    4      132     0.002889087  6972  Q  WS 8200 + 1008 [xfs_io]
253,0    4      136     0.002916894  6972  Q  WS 8200 + 288 [xfs_io]
253,0    4      140     0.002952334  6972  Q  WS 8201 + 1008 [xfs_io]
253,0    4      144     0.002996101  6972  Q  WS 8201 + 1008 [xfs_io]
253,0    4      148     0.003022401  6972  Q  WS 8201 + 288 [xfs_io]


Multiple IOs to teh same sector, then the sector increments by 1 and
we get more IOs to the same sector offset. After about a second the
mapping shifts IO to the other block device as it slowly increments
the sector, and that's why we see that cycling behaviour.

IOWs, something is going wrong with the MD block mapping when the
RAID chunk size is not a power of 2....

Over to you, Neil....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: interesting MD-xfs bug
  2015-04-10  1:31           ` Dave Chinner
@ 2015-04-10  3:22             ` NeilBrown
  2015-04-10  6:05               ` Dave Chinner
  2015-04-10  4:43             ` Roman Mamedov
  1 sibling, 1 reply; 10+ messages in thread
From: NeilBrown @ 2015-04-10  3:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Joe Landman, linux-raid, xfs

[-- Attachment #1: Type: text/plain, Size: 11947 bytes --]

On Fri, 10 Apr 2015 11:31:57 +1000 Dave Chinner <david@fromorbit.com> wrote:

> On Fri, Apr 10, 2015 at 09:36:52AM +1000, NeilBrown wrote:
> > On Fri, 10 Apr 2015 09:10:35 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > On Fri, Apr 10, 2015 at 08:53:22AM +1000, Dave Chinner wrote:
> > > > On Thu, Apr 09, 2015 at 06:20:26PM -0400, Joe Landman wrote:
> > > > > 
> > > > > 
> > > > > On 04/09/2015 06:18 PM, Dave Chinner wrote:
> > > > > >On Thu, Apr 09, 2015 at 05:02:33PM -0400, Joe Landman wrote:
> > > > > >>If I build an MD raid0 with a non power of 2 chunk size, it appears
> > > > > >>that I can mkfs.xfs a file system, but it doesn't show up in blkid
> > > > > >>and is not mountable.  Yet, using a power of 2 chunk size, this does
> > > > > >>work correctly.   This is kernel 3.18.9.
> > > > > >>
> > > > > 
> > > > > [...]
> > > > > 
> > > > > >That looks more like a blkid or udev problem. try using blkid -p so
> > > > > >that it doesn't look up the cache but directly probes devices for
> > > > > >the signatures. strace might tell you a bit more, too. And if the
> > > > > >filesystem mounts, then it definitely isn't an XFS problem ;)
> > > > > 
> > > > > Thats the thing, it didn't mount, even when I used the device name
> > > > > directly.
> > > > 
> > > > Ok, that's interesting. Let me see if I can reproduce it locally. If
> > > > you don't hear otherwise, tracing would still be useful. Thanks for
> > > > the bug report, Joe.
> > > 
> > > No luck - md doesn't allow the device to be activated on 4.0-rc7:
> > > 
> > > $ sudo mdadm --version
> > > mdadm - v3.3.2 - 21st August 2014
> > > $ uname -a
> > > Linux test4 4.0.0-rc7-dgc+ #882 SMP Fri Apr 10 08:50:52 AEST 2015 x86_64 GNU/Linux
> > > $ sudo wipefs -a /dev/vd[ab]
> > > /dev/vda: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> > > /dev/vdb: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> > > $ sudo mdadm --create /dev/md20 --level=0 --metadata=1.2 --chunk=1152 --auto=yes --raid-disks=2 /dev/vd[ab]
> > 
> > Weird.  Works for me.
> > Any messages in 'dmesg' ??
> > How big are /dev/vd[ab]??
> 
> vda is 5GB, vdb is 20GB
> 
> dmesg:
> 
> [  125.131340] md: bind<vda>
> [  125.134547] md: bind<vdb>
> [  125.139669] md: personality for level 0 is not loaded!
> [  125.141302] md: md20 stopped.
> [  125.141986] md: unbind<vdb>
> [  125.160100] md: export_rdev(vdb)
> [  125.161751] md: unbind<vda>
> [  125.180126] md: export_rdev(vda)
> 
> Oh, curious. Going from 4.0-rc4 to 4.0-rc7, and make oldconfig
> has resulted in:
> 
> # CONFIG_MD_RAID0 is not set
> 
> Ok, so with that fixed, it's still horribly broken.
> 
> RAID 0 on different sized devices should result in a device that is
> twice the size of the smallest devices:
> 
> $ sudo mdadm --create /dev/md20 --level=raid0 --metadata=1.2 --chunk=1024 --auto=yes --raid-disks=2 /dev/vd[ab]
> mdadm: array /dev/md20 started.
> $ cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
> md20 : active raid0 vdb[1] vda[0]
>       26206208 blocks super 1.2 1024k chunks
>       
> unused devices: <none>
> $ grep "md\|vd[ab]" /proc/partitions 
>  253        0    5242880 vda
>  253       16   20971520 vdb
>    9       20   26206208 md20
> $
> 
> Oh, "RAID0" is not actually RAID 0 - that's the size I'd expect from
> a linear mapping. Half way through writing that block device, the IO
> stats change in an obvious way:
> 
> Device:         r/s     w/s    rMB/s    wMB/s
> vda            0.00  144.00     0.00    48.00
> vdb            0.00  145.20     0.00    48.40
> md20           0.00  290.40     0.00    96.80
> 
> Device:         r/s     w/s    rMB/s    wMB/s
> vda            0.00   56.40     0.00    18.80
> vdb            0.00  229.20     0.00    76.40
> md20           0.00  285.20     0.00    95.10
> 
> Device:         r/s     w/s    rMB/s    wMB/s
> vda            0.00    0.00     0.00     0.00
> vdb            0.00  290.40     0.00    96.80
> md20           0.00  290.80     0.00    96.90
> 
> So it's actually a stripe for the first 10GB, then some kind of
> concatenated mapping of the remainder of the single device. That's
> not what I expected, but it's also clearly not the problem.
> 
> Anyway, change the stripe size to 1152:
> 
> sudo mdadm --stop /dev/md20
> mdadm: stopped /dev/md20
> $ sudo wipefs -a /dev/vd[ab]
> /dev/vda: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> /dev/vdb: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> $ sudo mdadm --create /dev/md20 --level=raid0 --metadata=1.2 --chunk=1152 --auto=yes --raid-disks=2 /dev/vd[ab]
> mdadm: array /dev/md20 started.
> $ sudo xfs_io -fd -c "pwrite -b 4m 0 25g" /dev/md20
> wrote 26831355904/26843545600 bytes at offset 0
> 24.989 GiB, 6398 ops; 0:00:16.00 (1.530 GiB/sec and 391.8556 ops/sec)
> $
> 
> Wait, what? Neil, did you put a flux capacitor in MD? :P 
> 
> The underlying drive is only capable of 100MB/s - 25GB of sequential
> direct IO does not complete in 16 seconds on such a drive. But
> there's also a 1GB BBWC in front of the physical drives (HW RAID1),
> but even so, this write rate could only occur if every write is
> hitting the BBWC. And so it is:
> 
> $ sudo xfs_io -fd -c "pwrite -b 4m 0 25g" /dev/md20 & iostat -d -m 1
> ...
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> vda            4214.00         0.00      1516.99          0       1516
> vdb               0.00         0.00         0.00          0          0
> md20           4223.00         0.00      1520.00          0       1520
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> vda            2986.00         0.00      1075.01          0       1075
> vdb            1174.00         0.00       422.88          0        422
> md20           4154.00         0.00      1496.00          0       1496
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> vda               0.00         0.00         0.00          0          0
> vdb            4376.00         0.00      1575.12          0       1575
> md20           4378.00         0.00      1576.00          0       1576
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> vda            2682.00         0.00       965.74          0        965
> vdb            1650.00         0.00       594.00          0        594
> md20           4334.00         0.00      1560.00          0       1560
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> vda            4518.00         0.00      1626.26          0       1626
> vdb             138.00         0.00        49.50          0         49
> md20           4656.00         0.00      1676.00          0       1676
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> vda               0.00         0.00         0.00          0          0
> vdb            4214.00         0.00      1517.48          0       1517
> md20           4210.00         0.00      1516.00          0       1516
> .....
> 
> Note how it is cycling from one drive to the other with about a 2s
> period?
> 
> Yup, blocktrace on /dev/vda shows it is, indeed, hitting the BBWC
> because the block mapping is clearly broken:
> 
> 253,0    4        1     0.000000000  6972  Q  WS 8192 + 1008 [xfs_io]
> 253,0    4        5     0.000068012  6972  Q  WS 8192 + 1008 [xfs_io]
> 253,0    4        9     0.000093266  6972  Q  WS 8192 + 288 [xfs_io]
> 253,0    4       13     0.000129722  6972  Q  WS 8193 + 1008 [xfs_io]
> 253,0    4       17     0.000176872  6972  Q  WS 8193 + 1008 [xfs_io]
> 253,0    4       21     0.000205566  6972  Q  WS 8193 + 288 [xfs_io]
> 253,0    4       25     0.000240846  6972  Q  WS 8194 + 1008 [xfs_io]
> 253,0    4       29     0.000284990  6972  Q  WS 8194 + 1008 [xfs_io]
> 253,0    4       33     0.000313276  6972  Q  WS 8194 + 288 [xfs_io]
> 253,0    4       37     0.000352330  6972  Q  WS 8195 + 1008 [xfs_io]
> 253,0    4       41     0.000374272  6972  Q  WS 8195 + 272 [xfs_io]
> 253,0    4       56     0.001215857  6972  Q  WS 8195 + 1008 [xfs_io]
> 253,0    4       60     0.001252697  6972  Q  WS 8195 + 16 [xfs_io]
> 253,0    4       64     0.001284517  6972  Q  WS 8196 + 1008 [xfs_io]
> 253,0    4       68     0.001326130  6972  Q  WS 8196 + 1008 [xfs_io]
> 253,0    4       72     0.001355050  6972  Q  WS 8196 + 288 [xfs_io]
> 253,0    4       76     0.001393777  6972  Q  WS 8197 + 1008 [xfs_io]
> 253,0    4       80     0.001439547  6972  Q  WS 8197 + 1008 [xfs_io]
> 253,0    4       84     0.001466097  6972  Q  WS 8197 + 288 [xfs_io]
> 253,0    4       88     0.001501267  6972  Q  WS 8198 + 1008 [xfs_io]
> 253,0    4       92     0.001545863  6972  Q  WS 8198 + 1008 [xfs_io]
> 253,0    4       96     0.001571500  6972  Q  WS 8198 + 288 [xfs_io]
> 253,0    4      100     0.001584620  6972  Q  WS 8199 + 256 [xfs_io]
> 253,0    4      116     0.002730034  6972  Q  WS 8199 + 1008 [xfs_io]
> 253,0    4      120     0.002792351  6972  Q  WS 8199 + 1008 [xfs_io]
> 253,0    4      124     0.002810937  6972  Q  WS 8199 + 32 [xfs_io]
> 253,0    4      128     0.002842047  6972  Q  WS 8200 + 1008 [xfs_io]
> 253,0    4      132     0.002889087  6972  Q  WS 8200 + 1008 [xfs_io]
> 253,0    4      136     0.002916894  6972  Q  WS 8200 + 288 [xfs_io]
> 253,0    4      140     0.002952334  6972  Q  WS 8201 + 1008 [xfs_io]
> 253,0    4      144     0.002996101  6972  Q  WS 8201 + 1008 [xfs_io]
> 253,0    4      148     0.003022401  6972  Q  WS 8201 + 288 [xfs_io]
> 
> 
> Multiple IOs to teh same sector, then the sector increments by 1 and
> we get more IOs to the same sector offset. After about a second the
> mapping shifts IO to the other block device as it slowly increments
> the sector, and that's why we see that cycling behaviour.
> 
> IOWs, something is going wrong with the MD block mapping when the
> RAID chunk size is not a power of 2....
> 
> Over to you, Neil....

That's .... not good.  Not good at all.

This should help. It seems that non-power-of-2 chunksizes aren't widely used.

Thanks,
NeilBrown


From: NeilBrown <neilb@suse.de>
Date: Fri, 10 Apr 2015 13:19:04 +1000
Subject: [PATCH] md/raid0: fix bug with chunksize not a power of 2.

Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1
in v3.14-rc1 RAID0 has performed incorrect calculations
when the chunksize is not a power of 2.

This happens because "sector_div()" modifies its first argument, but
this wasn't taken into account in the patch.

So restore that first arg before re-using the variable.

Reported-by: Joe Landman <joe.landman@gmail.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1
Cc: stable@vger.kernel.org (3.14 and later).
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index e074813da6c0..2cb59a641cd2 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -315,7 +315,7 @@ static struct strip_zone *find_zone(struct r0conf *conf,
 
 /*
  * remaps the bio to the target device. we separate two flows.
- * power 2 flow and a general flow for the sake of perfromance
+ * power 2 flow and a general flow for the sake of performance
 */
 static struct md_rdev *map_sector(struct mddev *mddev, struct strip_zone *zone,
 				sector_t sector, sector_t *sector_offset)
@@ -530,6 +530,7 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 			split = bio;
 		}
 
+		sector = bio->bi_iter.bi_sector;
 		zone = find_zone(mddev->private, &sector);
 		tmp_dev = map_sector(mddev, zone, sector, &sector);
 		split->bi_bdev = tmp_dev->bdev;

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: interesting MD-xfs bug
  2015-04-10  1:31           ` Dave Chinner
  2015-04-10  3:22             ` NeilBrown
@ 2015-04-10  4:43             ` Roman Mamedov
  1 sibling, 0 replies; 10+ messages in thread
From: Roman Mamedov @ 2015-04-10  4:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: NeilBrown, Joe Landman, linux-raid, xfs

[-- Attachment #1: Type: text/plain, Size: 937 bytes --]

On Fri, 10 Apr 2015 11:31:57 +1000
Dave Chinner <david@fromorbit.com> wrote:

> RAID 0 on different sized devices should result in a device that is
> twice the size of the smallest devices

> Oh, "RAID0" is not actually RAID 0 - that's the size I'd expect from
> a linear mapping.

> it's actually a stripe for the first 10GB, then some kind of
> concatenated mapping of the remainder of the single device.

It might be not what you expected, but it's also not a bug of any kind, just
the regular behavior of mdadm RAID0 with different sized devices (man md):

       If devices in the array are not all the same size, then once the small‐
       est device has been  exhausted,  the  RAID0  driver  starts  collecting
       chunks  into smaller stripes that only span the drives which still have
       remaining space.

Once or twice this came VERY handy for me in real life usage.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: interesting MD-xfs bug
  2015-04-10  3:22             ` NeilBrown
@ 2015-04-10  6:05               ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2015-04-10  6:05 UTC (permalink / raw)
  To: NeilBrown; +Cc: Joe Landman, linux-raid, xfs

On Fri, Apr 10, 2015 at 01:22:53PM +1000, NeilBrown wrote:
> On Fri, 10 Apr 2015 11:31:57 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Fri, Apr 10, 2015 at 09:36:52AM +1000, NeilBrown wrote:
> > > On Fri, 10 Apr 2015 09:10:35 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > > On Fri, Apr 10, 2015 at 08:53:22AM +1000, Dave Chinner wrote:
> > > > > On Thu, Apr 09, 2015 at 06:20:26PM -0400, Joe Landman wrote:
> > > > > > 
> > > > > > 
> > > > > > On 04/09/2015 06:18 PM, Dave Chinner wrote:
> > > > > > >On Thu, Apr 09, 2015 at 05:02:33PM -0400, Joe Landman wrote:
> > > > > > >>If I build an MD raid0 with a non power of 2 chunk size, it appears
> > > > > > >>that I can mkfs.xfs a file system, but it doesn't show up in blkid
> > > > > > >>and is not mountable.  Yet, using a power of 2 chunk size, this does
> > > > > > >>work correctly.   This is kernel 3.18.9.
> > > > > > >>
> > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > >That looks more like a blkid or udev problem. try using blkid -p so
> > > > > > >that it doesn't look up the cache but directly probes devices for
> > > > > > >the signatures. strace might tell you a bit more, too. And if the
> > > > > > >filesystem mounts, then it definitely isn't an XFS problem ;)
> > > > > > 
> > > > > > Thats the thing, it didn't mount, even when I used the device name
> > > > > > directly.
> > > > > 
> > > > > Ok, that's interesting. Let me see if I can reproduce it locally. If
> > > > > you don't hear otherwise, tracing would still be useful. Thanks for
> > > > > the bug report, Joe.
> > > > 
> > > > No luck - md doesn't allow the device to be activated on 4.0-rc7:
> > > > 
> > > > $ sudo mdadm --version
> > > > mdadm - v3.3.2 - 21st August 2014
> > > > $ uname -a
> > > > Linux test4 4.0.0-rc7-dgc+ #882 SMP Fri Apr 10 08:50:52 AEST 2015 x86_64 GNU/Linux
> > > > $ sudo wipefs -a /dev/vd[ab]
> > > > /dev/vda: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> > > > /dev/vdb: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> > > > $ sudo mdadm --create /dev/md20 --level=0 --metadata=1.2 --chunk=1152 --auto=yes --raid-disks=2 /dev/vd[ab]
> > > 
> > > Weird.  Works for me.
> > > Any messages in 'dmesg' ??
> > > How big are /dev/vd[ab]??
> > 
> > vda is 5GB, vdb is 20GB
> > 
> > dmesg:
> > 
> > [  125.131340] md: bind<vda>
> > [  125.134547] md: bind<vdb>
> > [  125.139669] md: personality for level 0 is not loaded!
> > [  125.141302] md: md20 stopped.
> > [  125.141986] md: unbind<vdb>
> > [  125.160100] md: export_rdev(vdb)
> > [  125.161751] md: unbind<vda>
> > [  125.180126] md: export_rdev(vda)
> > 
> > Oh, curious. Going from 4.0-rc4 to 4.0-rc7, and make oldconfig
> > has resulted in:
> > 
> > # CONFIG_MD_RAID0 is not set
> > 
> > Ok, so with that fixed, it's still horribly broken.
> > 
> > RAID 0 on different sized devices should result in a device that is
> > twice the size of the smallest devices:
> > 
> > $ sudo mdadm --create /dev/md20 --level=raid0 --metadata=1.2 --chunk=1024 --auto=yes --raid-disks=2 /dev/vd[ab]
> > mdadm: array /dev/md20 started.
> > $ cat /proc/mdstat
> > Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
> > md20 : active raid0 vdb[1] vda[0]
> >       26206208 blocks super 1.2 1024k chunks
> >       
> > unused devices: <none>
> > $ grep "md\|vd[ab]" /proc/partitions 
> >  253        0    5242880 vda
> >  253       16   20971520 vdb
> >    9       20   26206208 md20
> > $
> > 
> > Oh, "RAID0" is not actually RAID 0 - that's the size I'd expect from
> > a linear mapping. Half way through writing that block device, the IO
> > stats change in an obvious way:
> > 
> > Device:         r/s     w/s    rMB/s    wMB/s
> > vda            0.00  144.00     0.00    48.00
> > vdb            0.00  145.20     0.00    48.40
> > md20           0.00  290.40     0.00    96.80
> > 
> > Device:         r/s     w/s    rMB/s    wMB/s
> > vda            0.00   56.40     0.00    18.80
> > vdb            0.00  229.20     0.00    76.40
> > md20           0.00  285.20     0.00    95.10
> > 
> > Device:         r/s     w/s    rMB/s    wMB/s
> > vda            0.00    0.00     0.00     0.00
> > vdb            0.00  290.40     0.00    96.80
> > md20           0.00  290.80     0.00    96.90
> > 
> > So it's actually a stripe for the first 10GB, then some kind of
> > concatenated mapping of the remainder of the single device. That's
> > not what I expected, but it's also clearly not the problem.
> > 
> > Anyway, change the stripe size to 1152:
> > 
> > sudo mdadm --stop /dev/md20
> > mdadm: stopped /dev/md20
> > $ sudo wipefs -a /dev/vd[ab]
> > /dev/vda: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> > /dev/vdb: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9
> > $ sudo mdadm --create /dev/md20 --level=raid0 --metadata=1.2 --chunk=1152 --auto=yes --raid-disks=2 /dev/vd[ab]
> > mdadm: array /dev/md20 started.
> > $ sudo xfs_io -fd -c "pwrite -b 4m 0 25g" /dev/md20
> > wrote 26831355904/26843545600 bytes at offset 0
> > 24.989 GiB, 6398 ops; 0:00:16.00 (1.530 GiB/sec and 391.8556 ops/sec)
> > $
> > 
> > Wait, what? Neil, did you put a flux capacitor in MD? :P 
> > 
> > The underlying drive is only capable of 100MB/s - 25GB of sequential
> > direct IO does not complete in 16 seconds on such a drive. But
> > there's also a 1GB BBWC in front of the physical drives (HW RAID1),
> > but even so, this write rate could only occur if every write is
> > hitting the BBWC. And so it is:
> > 
> > $ sudo xfs_io -fd -c "pwrite -b 4m 0 25g" /dev/md20 & iostat -d -m 1
> > ...
> > Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> > vda            4214.00         0.00      1516.99          0       1516
> > vdb               0.00         0.00         0.00          0          0
> > md20           4223.00         0.00      1520.00          0       1520
> > 
> > Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> > vda            2986.00         0.00      1075.01          0       1075
> > vdb            1174.00         0.00       422.88          0        422
> > md20           4154.00         0.00      1496.00          0       1496
> > 
> > Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> > vda               0.00         0.00         0.00          0          0
> > vdb            4376.00         0.00      1575.12          0       1575
> > md20           4378.00         0.00      1576.00          0       1576
> > 
> > Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> > vda            2682.00         0.00       965.74          0        965
> > vdb            1650.00         0.00       594.00          0        594
> > md20           4334.00         0.00      1560.00          0       1560
> > 
> > Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> > vda            4518.00         0.00      1626.26          0       1626
> > vdb             138.00         0.00        49.50          0         49
> > md20           4656.00         0.00      1676.00          0       1676
> > 
> > Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> > vda               0.00         0.00         0.00          0          0
> > vdb            4214.00         0.00      1517.48          0       1517
> > md20           4210.00         0.00      1516.00          0       1516
> > .....
> > 
> > Note how it is cycling from one drive to the other with about a 2s
> > period?
> > 
> > Yup, blocktrace on /dev/vda shows it is, indeed, hitting the BBWC
> > because the block mapping is clearly broken:
> > 
> > 253,0    4        1     0.000000000  6972  Q  WS 8192 + 1008 [xfs_io]
> > 253,0    4        5     0.000068012  6972  Q  WS 8192 + 1008 [xfs_io]
> > 253,0    4        9     0.000093266  6972  Q  WS 8192 + 288 [xfs_io]
> > 253,0    4       13     0.000129722  6972  Q  WS 8193 + 1008 [xfs_io]
> > 253,0    4       17     0.000176872  6972  Q  WS 8193 + 1008 [xfs_io]
> > 253,0    4       21     0.000205566  6972  Q  WS 8193 + 288 [xfs_io]
> > 253,0    4       25     0.000240846  6972  Q  WS 8194 + 1008 [xfs_io]
> > 253,0    4       29     0.000284990  6972  Q  WS 8194 + 1008 [xfs_io]
> > 253,0    4       33     0.000313276  6972  Q  WS 8194 + 288 [xfs_io]
> > 253,0    4       37     0.000352330  6972  Q  WS 8195 + 1008 [xfs_io]
> > 253,0    4       41     0.000374272  6972  Q  WS 8195 + 272 [xfs_io]
> > 253,0    4       56     0.001215857  6972  Q  WS 8195 + 1008 [xfs_io]
> > 253,0    4       60     0.001252697  6972  Q  WS 8195 + 16 [xfs_io]
> > 253,0    4       64     0.001284517  6972  Q  WS 8196 + 1008 [xfs_io]
> > 253,0    4       68     0.001326130  6972  Q  WS 8196 + 1008 [xfs_io]
> > 253,0    4       72     0.001355050  6972  Q  WS 8196 + 288 [xfs_io]
> > 253,0    4       76     0.001393777  6972  Q  WS 8197 + 1008 [xfs_io]
> > 253,0    4       80     0.001439547  6972  Q  WS 8197 + 1008 [xfs_io]
> > 253,0    4       84     0.001466097  6972  Q  WS 8197 + 288 [xfs_io]
> > 253,0    4       88     0.001501267  6972  Q  WS 8198 + 1008 [xfs_io]
> > 253,0    4       92     0.001545863  6972  Q  WS 8198 + 1008 [xfs_io]
> > 253,0    4       96     0.001571500  6972  Q  WS 8198 + 288 [xfs_io]
> > 253,0    4      100     0.001584620  6972  Q  WS 8199 + 256 [xfs_io]
> > 253,0    4      116     0.002730034  6972  Q  WS 8199 + 1008 [xfs_io]
> > 253,0    4      120     0.002792351  6972  Q  WS 8199 + 1008 [xfs_io]
> > 253,0    4      124     0.002810937  6972  Q  WS 8199 + 32 [xfs_io]
> > 253,0    4      128     0.002842047  6972  Q  WS 8200 + 1008 [xfs_io]
> > 253,0    4      132     0.002889087  6972  Q  WS 8200 + 1008 [xfs_io]
> > 253,0    4      136     0.002916894  6972  Q  WS 8200 + 288 [xfs_io]
> > 253,0    4      140     0.002952334  6972  Q  WS 8201 + 1008 [xfs_io]
> > 253,0    4      144     0.002996101  6972  Q  WS 8201 + 1008 [xfs_io]
> > 253,0    4      148     0.003022401  6972  Q  WS 8201 + 288 [xfs_io]
> > 
> > 
> > Multiple IOs to teh same sector, then the sector increments by 1 and
> > we get more IOs to the same sector offset. After about a second the
> > mapping shifts IO to the other block device as it slowly increments
> > the sector, and that's why we see that cycling behaviour.
> > 
> > IOWs, something is going wrong with the MD block mapping when the
> > RAID chunk size is not a power of 2....
> > 
> > Over to you, Neil....
> 
> That's .... not good.  Not good at all.
> 
> This should help. It seems that non-power-of-2 chunksizes aren't widely used.

I haven't tested the patch, but if you want to make sure that you
get regular smoke testing on this sort of config, write a simple
test for xfstests and throw it in the generic group. e.g. create
multiple loop devices, then iterate over various MD configurations
running a basic data integrity tests on them. e.g. mkfs, mount,
write a 20MB pattened file, fsync, unmount, mount, md5sum it,
unlink, unmount, check filesystem.

Something like that will get run all the time by FS developers and
QE departments, so it's a good way of smoke testing configurations
that don't usually get tested without even having to think about
it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-04-10  6:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-09 21:02 interesting MD-xfs bug Joe Landman
2015-04-09 22:18 ` Dave Chinner
2015-04-09 22:20   ` Joe Landman
2015-04-09 22:53     ` Dave Chinner
2015-04-09 23:10       ` Dave Chinner
2015-04-09 23:36         ` NeilBrown
2015-04-10  1:31           ` Dave Chinner
2015-04-10  3:22             ` NeilBrown
2015-04-10  6:05               ` Dave Chinner
2015-04-10  4:43             ` Roman Mamedov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).