* Re: Bigger stripe size [not found] <12EF8D94C6F8734FB2FF37B9FBEDD1735863D351@EXCHANGE.collogia.de> @ 2014-08-14 4:11 ` NeilBrown 2014-08-14 6:33 ` AW: " Markus Stockhausen 0 siblings, 1 reply; 3+ messages in thread From: NeilBrown @ 2014-08-14 4:11 UTC (permalink / raw) To: Markus Stockhausen; +Cc: shli@kernel.org, linux-raid@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 2691 bytes --] On Wed, 13 Aug 2014 07:21:20 +0000 Markus Stockhausen <stockhausen@collogia.de> wrote: > Hello you two, > > I saw Shaohua's patches for making the stripe size in raid4/5/6 configurable. > If I got it right Neil likes the idea but does not agree with the kind of the > implementation. > > The patch is quite big an intrusive so I guess that any other design will have > the same complexitiy. Neils idea about linking stripe headers sounds reasonable > but will make it neccessary to "look at the linked neighbours" for some operations. > Whatever "look" means programmatically. So I would like to hear your feedback > about the following desing. > > Will it make sense to work with per-stripe sizes? E.g. > > User reads/writes 4K -> Work on a 4K stripe. > User reads/writes 16K -> Work on a 16K stripe. > > Difficulties. > > - avoid overlapping of "small" and "big" stripes > - split stripe cache in different sizes > - Can we allocate multi-page memory to have continous work-areas? > - ... > > Benefits. > > - Stripe handling unchanged. > - paritiy calculation more efficient > - ... > > Other ideas? I fear that we are chasing the wrong problem. The scheduling of stripe handling is currently very poor. If you do a large sequential write which should map to multiple full-stripe writes, you still get a lot of reads. This is bad. The reason is that limited information is available to the raid5 driver concerning what is coming next and it often guesses wrongly. I suspect that it can be made a lot cleverer but I'm not entirely sure how. A first step would be to "watch" exactly what happens in terms of the way that requests come down, the timing of 'unplug' events, and the actual handling of stripes. 'blktrace' could provide most or all of the raw data. Then determine what the trace "should" look like and come up with a way for raid5 too figure that out and do it. I suspect that might involve are more "clever" queuing algorithm, possibly keeping all the stripe_heads sorted, possibly storing them in an RB-tree. Once you have that queuing in place so that the pattern of write requests submitted to the drives makes sense, then it is time to analyse CPU efficiency and find out where double-handling is happening, or when "batching" or re-ordering of operations can make a difference. If the queuing algorithm collects contiguous sequences of stripe_heads together, then processes a batch of them in succession make provide the same improvements as processing fewer larger stripe_heads. So: first step is to get the IO patterns optimal. Then look for ways to optimise for CPU time. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 3+ messages in thread
* AW: Bigger stripe size 2014-08-14 4:11 ` Bigger stripe size NeilBrown @ 2014-08-14 6:33 ` Markus Stockhausen 2014-08-14 7:17 ` NeilBrown 0 siblings, 1 reply; 3+ messages in thread From: Markus Stockhausen @ 2014-08-14 6:33 UTC (permalink / raw) To: NeilBrown; +Cc: shli@kernel.org, linux-raid@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 2646 bytes --] > Von: NeilBrown [neilb@suse.de] > Gesendet: Donnerstag, 14. August 2014 06:11 > An: Markus Stockhausen > Cc: shli@kernel.org; linux-raid@vger.kernel.org > Betreff: Re: Bigger stripe size > ... > > > > Will it make sense to work with per-stripe sizes? E.g. > > > > User reads/writes 4K -> Work on a 4K stripe. > > User reads/writes 16K -> Work on a 16K stripe. > > > > Difficulties. > > > > - avoid overlapping of "small" and "big" stripes > > - split stripe cache in different sizes > > - Can we allocate multi-page memory to have continous work-areas? > > - ... > > > > Benefits. > > > > - Stripe handling unchanged. > > - paritiy calculation more efficient > > - ... > > > > Other ideas? > > I fear that we are chasing the wrong problem. > > The scheduling of stripe handling is currently very poor. If you do a large > sequential write which should map to multiple full-stripe writes, you still > get a lot of reads. This is bad. > The reason is that limited information is available to the raid5 driver > concerning what is coming next and it often guesses wrongly. > > I suspect that it can be made a lot cleverer but I'm not entirely sure how. > A first step would be to "watch" exactly what happens in terms of the way > that requests come down, the timing of 'unplug' events, and the actual > handling of stripes. 'blktrace' could provide most or all of the raw data. > Thanks for that info. I did not expect to find so basic challenges in the code ... Could you explain what you mean with unplug events? Maybe you can give me the function in raid5.c that would be the right place to understand better how changed data "leaves" the stripes and puts it on freelists again. > > Then determine what the trace "should" look like and come up with a way for > raid5 too figure that out and do it. > I suspect that might involve are more "clever" queuing algorithm, possibly > keeping all the stripe_heads sorted, possibly storing them in an RB-tree. > > Once you have that queuing in place so that the pattern of write requests > submitted to the drives makes sense, then it is time to analyse CPU efficiency > and find out where double-handling is happening, or when "batching" or > re-ordering of operations can make a difference. > If the queuing algorithm collects contiguous sequences of stripe_heads > together, then processes a batch of them in succession make provide the same > improvements as processing fewer larger stripe_heads. > > So: first step is to get the IO patterns optimal. Then look for ways to > optimise for CPU time. > > NeilBrown Markus= [-- Attachment #2: InterScan_Disclaimer.txt --] [-- Type: text/plain, Size: 1650 bytes --] **************************************************************************** Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. Ãber das Internet versandte E-Mails können unter fremden Namen erstellt oder manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine rechtsverbindliche Willenserklärung. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln Vorstand: Kadir Akin Dr. Michael Höhnerbach Vorsitzender des Aufsichtsrates: Hans Kristian Langva Registergericht: Amtsgericht Köln Registernummer: HRB 52 497 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. e-mails sent over the internet may have been written under a wrong name or been manipulated. That is why this message sent as an e-mail is not a legally binding declaration of intention. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln executive board: Kadir Akin Dr. Michael Höhnerbach President of the supervisory board: Hans Kristian Langva Registry office: district court Cologne Register number: HRB 52 497 **************************************************************************** ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Bigger stripe size 2014-08-14 6:33 ` AW: " Markus Stockhausen @ 2014-08-14 7:17 ` NeilBrown 0 siblings, 0 replies; 3+ messages in thread From: NeilBrown @ 2014-08-14 7:17 UTC (permalink / raw) To: Markus Stockhausen; +Cc: shli@kernel.org, linux-raid@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 4513 bytes --] On Thu, 14 Aug 2014 06:33:51 +0000 Markus Stockhausen <stockhausen@collogia.de> wrote: > > Von: NeilBrown [neilb@suse.de] > > Gesendet: Donnerstag, 14. August 2014 06:11 > > An: Markus Stockhausen > > Cc: shli@kernel.org; linux-raid@vger.kernel.org > > Betreff: Re: Bigger stripe size > > ... > > > > > > Will it make sense to work with per-stripe sizes? E.g. > > > > > > User reads/writes 4K -> Work on a 4K stripe. > > > User reads/writes 16K -> Work on a 16K stripe. > > > > > > Difficulties. > > > > > > - avoid overlapping of "small" and "big" stripes > > > - split stripe cache in different sizes > > > - Can we allocate multi-page memory to have continous work-areas? > > > - ... > > > > > > Benefits. > > > > > > - Stripe handling unchanged. > > > - paritiy calculation more efficient > > > - ... > > > > > > Other ideas? > > > > I fear that we are chasing the wrong problem. > > > > The scheduling of stripe handling is currently very poor. If you do a large > > sequential write which should map to multiple full-stripe writes, you still > > get a lot of reads. This is bad. > > The reason is that limited information is available to the raid5 driver > > concerning what is coming next and it often guesses wrongly. > > > > I suspect that it can be made a lot cleverer but I'm not entirely sure how. > > A first step would be to "watch" exactly what happens in terms of the way > > that requests come down, the timing of 'unplug' events, and the actual > > handling of stripes. 'blktrace' could provide most or all of the raw data. > > > > Thanks for that info. I did not expect to find so basic challenges in the code ... > Could you explain what you mean with unplug events? Maybe you can give me > the function in raid5.c that would be the right place to understand better how > changed data "leaves" the stripes and puts it on freelists again. When data is submitted to any block device the code normally calls blk_start_plug() and when it has submitted all the requests that it wants to submit it calls blk_end_plug(). If any code ever needs to 'schedule()', e.g. to wait for memory to be freed, and the equivalent of blk_end_plug() is called so that any pending requests are sent in their way. md/raid5 checks if a plug is currently in force using blk_check_plugged(). If it is, then new requests are queued internally and not released until raid5_unplug() is called. The net result of this is to gather multiple small requests together. It helps with scheduling but not completely. There are two important parts to understand in raid5. make_request() is how a request (struct bio) is given to raid5. It finds which stripe_heads to attach it too and does so using add_stripe_bio(). When each strip_head is released (release_stripe()) they are put on a queue (if they are otherwise idle). The second part is handle_stripe(). This is called as needed by raid5d. It plucks a stripe_head off the list, figures out what to do with it, and does it. Once the data has been written return_io() is called on all the bios that are finished with and their owner (e.g. the filesystem) it told that the write (or read) is complete. Each stripe_head represents a 4K strip across all devices. So for an array with 64K chunks, a "full stripe write" requires 16 different stripe_heads to be assembled and worked on. This currently all happens one stripe_head at a time. Once you have digested all that, ask some more questions :-) NeilBrown > > > > > Then determine what the trace "should" look like and come up with a way for > > raid5 too figure that out and do it. > > I suspect that might involve are more "clever" queuing algorithm, possibly > > keeping all the stripe_heads sorted, possibly storing them in an RB-tree. > > > > Once you have that queuing in place so that the pattern of write requests > > submitted to the drives makes sense, then it is time to analyse CPU efficiency > > and find out where double-handling is happening, or when "batching" or > > re-ordering of operations can make a difference. > > If the queuing algorithm collects contiguous sequences of stripe_heads > > together, then processes a batch of them in succession make provide the same > > improvements as processing fewer larger stripe_heads. > > > > So: first step is to get the IO patterns optimal. Then look for ways to > > optimise for CPU time. > > > > NeilBrown > > Markus [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2014-08-14 7:17 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <12EF8D94C6F8734FB2FF37B9FBEDD1735863D351@EXCHANGE.collogia.de> 2014-08-14 4:11 ` Bigger stripe size NeilBrown 2014-08-14 6:33 ` AW: " Markus Stockhausen 2014-08-14 7:17 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).