Re: Bigger stripe size

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Bigger stripe size
       [not found] <12EF8D94C6F8734FB2FF37B9FBEDD1735863D351@EXCHANGE.collogia.de>
@ 2014-08-14  4:11 ` NeilBrown
  2014-08-14  6:33   ` AW: " Markus Stockhausen
  0 siblings, 1 reply; 3+ messages in thread
From: NeilBrown @ 2014-08-14  4:11 UTC (permalink / raw)
  To: Markus Stockhausen; +Cc: shli@kernel.org, linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2691 bytes --]

On Wed, 13 Aug 2014 07:21:20 +0000 Markus Stockhausen
<stockhausen@collogia.de> wrote:

> Hello you two,
> 
> I saw Shaohua's patches for making the stripe size in raid4/5/6 configurable.
> If I got it right Neil likes the idea but does not agree with the kind of the
> implementation.
> 
> The patch is quite big an intrusive so I guess that any other design will have
> the same complexitiy. Neils idea about linking stripe headers sounds reasonable
> but will make it neccessary to "look at the linked neighbours" for some operations.
> Whatever "look" means programmatically. So I would like to hear your feedback
> about the following desing.
> 
> Will it make sense to work with per-stripe sizes? E.g.
> 
> User reads/writes 4K -> Work on a 4K stripe.
> User reads/writes 16K -> Work on a 16K stripe.
> 
> Difficulties.
> 
> - avoid overlapping of "small" and "big" stripes
> - split stripe cache in different sizes
> - Can we allocate multi-page memory to have continous work-areas?
> - ...
> 
> Benefits.
> 
> - Stripe handling unchanged.
> - paritiy calculation more efficient
> -  ...
> 
> Other ideas?

I fear that we are chasing the wrong problem.

The scheduling of stripe handling is currently very poor.  If you do a large
sequential write which should map to multiple full-stripe writes, you still
get a lot of reads.  This is bad.
The reason is that limited information is available to the raid5 driver
concerning what is coming next and it often guesses wrongly.

I suspect that it can be made a lot cleverer but I'm not entirely sure how.
A first step would be to "watch" exactly what happens in terms of the way
that requests come down, the timing of 'unplug' events, and the actual
handling of stripes.  'blktrace' could provide most or all of the raw data.

Then determine what the trace "should" look like and come up with a way for
raid5 too figure that out and do it.
I suspect that might involve are more "clever" queuing algorithm, possibly
keeping all the stripe_heads sorted, possibly storing them in an RB-tree.

Once you have that queuing in place so that the pattern of write requests
submitted to the drives makes sense, then it is time to analyse CPU efficiency
and find out where double-handling is happening, or when "batching" or
re-ordering of operations can make a difference.
If the queuing algorithm collects contiguous sequences of stripe_heads
together, then processes a batch of them in succession make provide the same
improvements as processing fewer larger stripe_heads.

So: first step is to get the IO patterns optimal.  Then look for ways to
optimise for CPU time.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* AW: Bigger stripe size
  2014-08-14  4:11 ` Bigger stripe size NeilBrown
@ 2014-08-14  6:33   ` Markus Stockhausen
  2014-08-14  7:17     ` NeilBrown
  0 siblings, 1 reply; 3+ messages in thread
From: Markus Stockhausen @ 2014-08-14  6:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: shli@kernel.org, linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2646 bytes --]

> Von: NeilBrown [neilb@suse.de]
> Gesendet: Donnerstag, 14. August 2014 06:11
> An: Markus Stockhausen
> Cc: shli@kernel.org; linux-raid@vger.kernel.org
> Betreff: Re: Bigger stripe size
> ...
> >
> > Will it make sense to work with per-stripe sizes? E.g.
> >
> > User reads/writes 4K -> Work on a 4K stripe.
> > User reads/writes 16K -> Work on a 16K stripe.
> >
> > Difficulties.
> >
> > - avoid overlapping of "small" and "big" stripes
> > - split stripe cache in different sizes
> > - Can we allocate multi-page memory to have continous work-areas?
> > - ...
> >
> > Benefits.
> >
> > - Stripe handling unchanged.
> > - paritiy calculation more efficient
> > -  ...
> >
> > Other ideas?
> 
> I fear that we are chasing the wrong problem.
> 
> The scheduling of stripe handling is currently very poor.  If you do a large
> sequential write which should map to multiple full-stripe writes, you still
> get a lot of reads.  This is bad.
> The reason is that limited information is available to the raid5 driver
> concerning what is coming next and it often guesses wrongly.
> 
> I suspect that it can be made a lot cleverer but I'm not entirely sure how.
> A first step would be to "watch" exactly what happens in terms of the way
> that requests come down, the timing of 'unplug' events, and the actual
> handling of stripes.  'blktrace' could provide most or all of the raw data.
>

Thanks for that info. I did not expect to find so basic challenges in the code ...
Could you explain what you mean with unplug events? Maybe you can give me
the function in raid5.c that would be the right place to understand better how
changed data "leaves" the stripes and puts it on freelists again.

> 
> Then determine what the trace "should" look like and come up with a way for
> raid5 too figure that out and do it.
> I suspect that might involve are more "clever" queuing algorithm, possibly
> keeping all the stripe_heads sorted, possibly storing them in an RB-tree.
> 
> Once you have that queuing in place so that the pattern of write requests
> submitted to the drives makes sense, then it is time to analyse CPU efficiency
> and find out where double-handling is happening, or when "batching" or
> re-ordering of operations can make a difference.
> If the queuing algorithm collects contiguous sequences of stripe_heads
> together, then processes a batch of them in succession make provide the same
> improvements as processing fewer larger stripe_heads.
> 
> So: first step is to get the IO patterns optimal.  Then look for ways to
> optimise for CPU time.
> 
> NeilBrown

Markus=

[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]

****************************************************************************
Diese E-Mail enthÃ¤lt vertrauliche und/oder rechtlich geschÃ¼tzte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtÃ¼mlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Ãœber das Internet versandte E-Mails kÃ¶nnen unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche WillenserklÃ¤rung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 KÃ¶ln

Vorstand:
Kadir Akin
Dr. Michael HÃ¶hnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht KÃ¶ln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 KÃ¶ln

executive board:
Kadir Akin
Dr. Michael HÃ¶hnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497

****************************************************************************

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Bigger stripe size
  2014-08-14  6:33   ` AW: " Markus Stockhausen
@ 2014-08-14  7:17     ` NeilBrown
  0 siblings, 0 replies; 3+ messages in thread
From: NeilBrown @ 2014-08-14  7:17 UTC (permalink / raw)
  To: Markus Stockhausen; +Cc: shli@kernel.org, linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 4513 bytes --]

On Thu, 14 Aug 2014 06:33:51 +0000 Markus Stockhausen
<stockhausen@collogia.de> wrote:

> > Von: NeilBrown [neilb@suse.de]
> > Gesendet: Donnerstag, 14. August 2014 06:11
> > An: Markus Stockhausen
> > Cc: shli@kernel.org; linux-raid@vger.kernel.org
> > Betreff: Re: Bigger stripe size
> > ...
> > >
> > > Will it make sense to work with per-stripe sizes? E.g.
> > >
> > > User reads/writes 4K -> Work on a 4K stripe.
> > > User reads/writes 16K -> Work on a 16K stripe.
> > >
> > > Difficulties.
> > >
> > > - avoid overlapping of "small" and "big" stripes
> > > - split stripe cache in different sizes
> > > - Can we allocate multi-page memory to have continous work-areas?
> > > - ...
> > >
> > > Benefits.
> > >
> > > - Stripe handling unchanged.
> > > - paritiy calculation more efficient
> > > -  ...
> > >
> > > Other ideas?
> > 
> > I fear that we are chasing the wrong problem.
> > 
> > The scheduling of stripe handling is currently very poor.  If you do a large
> > sequential write which should map to multiple full-stripe writes, you still
> > get a lot of reads.  This is bad.
> > The reason is that limited information is available to the raid5 driver
> > concerning what is coming next and it often guesses wrongly.
> > 
> > I suspect that it can be made a lot cleverer but I'm not entirely sure how.
> > A first step would be to "watch" exactly what happens in terms of the way
> > that requests come down, the timing of 'unplug' events, and the actual
> > handling of stripes.  'blktrace' could provide most or all of the raw data.
> >
> 
> Thanks for that info. I did not expect to find so basic challenges in the code ...
> Could you explain what you mean with unplug events? Maybe you can give me
> the function in raid5.c that would be the right place to understand better how
> changed data "leaves" the stripes and puts it on freelists again.

When data is submitted to any block device the code normally calls
blk_start_plug() and when it has submitted all the requests that it wants to
submit it calls blk_end_plug().  If any code ever needs to 'schedule()', e.g.
to wait for memory to be freed, and the equivalent of blk_end_plug() is
called so that any pending requests are sent in their way.

md/raid5 checks if a plug is currently in force using blk_check_plugged().
If it is, then new requests are queued internally and not released until
raid5_unplug() is called.

The net result of this is  to gather multiple small requests together.  It
helps with scheduling but not completely.

There are two important parts to understand in raid5.

make_request() is how a request (struct bio) is given to raid5.  It finds
which stripe_heads to attach it too and does so using add_stripe_bio().
When each strip_head is released (release_stripe()) they are put on a queue
(if they are otherwise idle).

The second part is handle_stripe().  This is called as needed by raid5d.
It plucks a stripe_head off the list, figures out what to do with it, and
does it.  Once the data has been written return_io() is called on all the
bios that are finished with and their owner (e.g. the filesystem) it told
that the write (or read) is complete.

Each stripe_head represents a 4K strip across all devices.  So for an array
with 64K chunks,  a "full stripe write" requires 16 different stripe_heads to
be assembled and worked on.  This currently all happens one stripe_head at a
time.

Once you have digested all that, ask some more questions :-)

NeilBrown

> 
> > 
> > Then determine what the trace "should" look like and come up with a way for
> > raid5 too figure that out and do it.
> > I suspect that might involve are more "clever" queuing algorithm, possibly
> > keeping all the stripe_heads sorted, possibly storing them in an RB-tree.
> > 
> > Once you have that queuing in place so that the pattern of write requests
> > submitted to the drives makes sense, then it is time to analyse CPU efficiency
> > and find out where double-handling is happening, or when "batching" or
> > re-ordering of operations can make a difference.
> > If the queuing algorithm collects contiguous sequences of stripe_heads
> > together, then processes a batch of them in succession make provide the same
> > improvements as processing fewer larger stripe_heads.
> > 
> > So: first step is to get the IO patterns optimal.  Then look for ways to
> > optimise for CPU time.
> > 
> > NeilBrown
> 
> Markus

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-08-14  7:17 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <12EF8D94C6F8734FB2FF37B9FBEDD1735863D351@EXCHANGE.collogia.de>
2014-08-14  4:11 ` Bigger stripe size NeilBrown
2014-08-14  6:33   ` AW: " Markus Stockhausen
2014-08-14  7:17     ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).