RAID50, despite chunk setting, does everything in 4KB blocks

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID50, despite chunk setting, does everything in 4KB blocks
@ 2011-12-19 22:43 Chris Worley
  2011-12-19 23:24 ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Chris Worley @ 2011-12-19 22:43 UTC (permalink / raw)
  To: linuxraid

It doesn't really matter what chunk sizes I set, but, for example, I
create three RAID5's of 5 drives each with a chunk size of 32K, and
create a RAID0 comprised of the three RAID5's with a chunk size of
64K:

md0 : active raid0 md27[2] md26[1] md25[0]
      1885098048 blocks super 1.2 64k chunks

If I write to one of the RAID5's, using:

# dd of=/dev/md27  if=/dev/zero bs=1024k oflag=direct

... then "iostat -dmx 2" shows the drives being written to in 32K
chunks (avgrq-sz=64), as you'd expect.

But, writing to the RAID0 that's striping the RAID5's, shows
everything being written in 4KB chunks (iostat shows avgrq-sz=8) to
the RAID0 as well as to the RAID5's.

Why is that?  Note that this is true for reading too.  Note I don't
see the same problem when using RAID10 (via striped RAID1's) or
RAID100 (via striped RAID10's).

... this is on SLES11 using a 2.6.32.43-0.5 kernel.

Thanks,

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID50, despite chunk setting, does everything in 4KB blocks
  2011-12-19 22:43 RAID50, despite chunk setting, does everything in 4KB blocks Chris Worley
@ 2011-12-19 23:24 ` NeilBrown
  2011-12-19 23:56   ` Chris Worley
  0 siblings, 1 reply; 4+ messages in thread
From: NeilBrown @ 2011-12-19 23:24 UTC (permalink / raw)
  To: Chris Worley; +Cc: linuxraid

[-- Attachment #1: Type: text/plain, Size: 2208 bytes --]

On Mon, 19 Dec 2011 15:43:13 -0700 Chris Worley <worleys@gmail.com> wrote:

> It doesn't really matter what chunk sizes I set, but, for example, I
> create three RAID5's of 5 drives each with a chunk size of 32K, and
> create a RAID0 comprised of the three RAID5's with a chunk size of
> 64K:
> 
> md0 : active raid0 md27[2] md26[1] md25[0]
>       1885098048 blocks super 1.2 64k chunks
> 
> If I write to one of the RAID5's, using:
> 
> # dd of=/dev/md27  if=/dev/zero bs=1024k oflag=direct
> 
> ... then "iostat -dmx 2" shows the drives being written to in 32K
> chunks (avgrq-sz=64), as you'd expect.
> 
> But, writing to the RAID0 that's striping the RAID5's, shows
> everything being written in 4KB chunks (iostat shows avgrq-sz=8) to
> the RAID0 as well as to the RAID5's.

When writing to a RAID5 it *always* submits request to the lower layers in
PAGE sized units.  This makes it much easier to keep parity and data aligned.

The queue on the underlying device should sort the requests and  group them
together and your evidence suggests that it does.

When writing to the RAID5 through a RAID0 it will only see 64K at a time but
that shouldn't won't make any difference to its behaviour and should change
the way the requests finally get to the device.

So I have no idea why you see a difference.

I suspect lots of block-layer tracing, and lots of staring at code and lots
of head scratching would be needed to understand what is really going in.

> 
> Why is that?  Note that this is true for reading too.  Note I don't
> see the same problem when using RAID10 (via striped RAID1's) or
> RAID100 (via striped RAID10's).

RAID1 and RAID10 don't split things into pages so I can imagine that they
might life easier for the scheduler.

But the scheduler should  still get it right for RAID5 ....

So - its a mystery.  Sorry.

NeilBrown

> 
> ... this is on SLES11 using a 2.6.32.43-0.5 kernel.
> 
> Thanks,
> 
> Chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID50, despite chunk setting, does everything in 4KB blocks
  2011-12-19 23:24 ` NeilBrown
@ 2011-12-19 23:56   ` Chris Worley
  2011-12-20  0:08     ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Chris Worley @ 2011-12-19 23:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: linuxraid

On Mon, Dec 19, 2011 at 4:24 PM, NeilBrown <neilb@suse.de> wrote:
> On Mon, 19 Dec 2011 15:43:13 -0700 Chris Worley <worleys@gmail.com> wrote:
>
>> It doesn't really matter what chunk sizes I set, but, for example, I
>> create three RAID5's of 5 drives each with a chunk size of 32K, and
>> create a RAID0 comprised of the three RAID5's with a chunk size of
>> 64K:
>>
>> md0 : active raid0 md27[2] md26[1] md25[0]
>>       1885098048 blocks super 1.2 64k chunks
>>
>> If I write to one of the RAID5's, using:
>>
>> # dd of=/dev/md27  if=/dev/zero bs=1024k oflag=direct
>>
>> ... then "iostat -dmx 2" shows the drives being written to in 32K
>> chunks (avgrq-sz=64), as you'd expect.
>>
>> But, writing to the RAID0 that's striping the RAID5's, shows
>> everything being written in 4KB chunks (iostat shows avgrq-sz=8) to
>> the RAID0 as well as to the RAID5's.
>
> When writing to a RAID5 it *always* submits request to the lower layers in
> PAGE sized units.  This makes it much easier to keep parity and data aligned.
>
> The queue on the underlying device should sort the requests and  group them
> together and your evidence suggests that it does.
>
> When writing to the RAID5 through a RAID0 it will only see 64K at a time but
> that shouldn't won't make any difference to its behaviour and should change
> the way the requests finally get to the device.
>
> So I have no idea why you see a difference.
>
> I suspect lots of block-layer tracing, and lots of staring at code and lots
> of head scratching would be needed to understand what is really going in.

Note that "max_segments" for the raid0 = 1, and max_segment_size =
4096, which tells Linux that the md can only take a single 4KB page
per IO request.

The scheduler shouldn't be involved in the transaction between the
RAID0 and RAID5, as neither uses the scheduler, so it shouldn't merge
there, but it also shouldn't be fragmenting.

Not having the RAID0 send the larger chunks to the RAID5's may cause
more fragmentation than the drive's scheduler will be able to
re-merge.

Chris
>
>
>>
>> Why is that?  Note that this is true for reading too.  Note I don't
>> see the same problem when using RAID10 (via striped RAID1's) or
>> RAID100 (via striped RAID10's).
>
> RAID1 and RAID10 don't split things into pages so I can imagine that they
> might life easier for the scheduler.
>
> But the scheduler should  still get it right for RAID5 ....
>
>
> So - its a mystery.  Sorry.
>
> NeilBrown
>
>
>>
>> ... this is on SLES11 using a 2.6.32.43-0.5 kernel.
>>
>> Thanks,
>>
>> Chris
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID50, despite chunk setting, does everything in 4KB blocks
  2011-12-19 23:56   ` Chris Worley
@ 2011-12-20  0:08     ` NeilBrown
  0 siblings, 0 replies; 4+ messages in thread
From: NeilBrown @ 2011-12-20  0:08 UTC (permalink / raw)
  To: Chris Worley; +Cc: linuxraid

[-- Attachment #1: Type: text/plain, Size: 2665 bytes --]

On Mon, 19 Dec 2011 16:56:16 -0700 Chris Worley <worleys@gmail.com> wrote:

> On Mon, Dec 19, 2011 at 4:24 PM, NeilBrown <neilb@suse.de> wrote:
> > On Mon, 19 Dec 2011 15:43:13 -0700 Chris Worley <worleys@gmail.com> wrote:
> >
> >> It doesn't really matter what chunk sizes I set, but, for example, I
> >> create three RAID5's of 5 drives each with a chunk size of 32K, and
> >> create a RAID0 comprised of the three RAID5's with a chunk size of
> >> 64K:
> >>
> >> md0 : active raid0 md27[2] md26[1] md25[0]
> >>       1885098048 blocks super 1.2 64k chunks
> >>
> >> If I write to one of the RAID5's, using:
> >>
> >> # dd of=/dev/md27  if=/dev/zero bs=1024k oflag=direct
> >>
> >> ... then "iostat -dmx 2" shows the drives being written to in 32K
> >> chunks (avgrq-sz=64), as you'd expect.
> >>
> >> But, writing to the RAID0 that's striping the RAID5's, shows
> >> everything being written in 4KB chunks (iostat shows avgrq-sz=8) to
> >> the RAID0 as well as to the RAID5's.
> >
> > When writing to a RAID5 it *always* submits request to the lower layers in
> > PAGE sized units.  This makes it much easier to keep parity and data aligned.
> >
> > The queue on the underlying device should sort the requests and  group them
> > together and your evidence suggests that it does.
> >
> > When writing to the RAID5 through a RAID0 it will only see 64K at a time but
> > that shouldn't won't make any difference to its behaviour and should change
> > the way the requests finally get to the device.
> >
> > So I have no idea why you see a difference.
> >
> > I suspect lots of block-layer tracing, and lots of staring at code and lots
> > of head scratching would be needed to understand what is really going in.
> 
> Note that "max_segments" for the raid0 = 1, and max_segment_size =
> 4096, which tells Linux that the md can only take a single 4KB page
> per IO request.

Ah, of course.  RAID5 sets a merge_bvec_fn so that there is some chance that
read requests can bypass the cache.
As RAID0 doesn't honour the merge_bvec_fn (maybe it should) it sets the max
request size to 1 page.

RAID10 sets a merge_bvec_fn too so RAID0 will be sending it requests in
1-page pieces.

> 
> The scheduler shouldn't be involved in the transaction between the
> RAID0 and RAID5, as neither uses the scheduler, so it shouldn't merge
> there, but it also shouldn't be fragmenting.
> 
> Not having the RAID0 send the larger chunks to the RAID5's may cause
> more fragmentation than the drive's scheduler will be able to
> re-merge.

How hard can it be to merge a few (thousand) requests??? :-)

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-12-20  0:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-19 22:43 RAID50, despite chunk setting, does everything in 4KB blocks Chris Worley
2011-12-19 23:24 ` NeilBrown
2011-12-19 23:56   ` Chris Worley
2011-12-20  0:08     ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).