Re: [dm-devel] Raid0 performance regression

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [dm-devel] Raid0 performance regression
       [not found] <747C2684-B570-473E-9146-E8AB53102236@filmlight.ltd.uk>
@ 2022-01-23 18:00 ` Lukas Straub
  2022-01-23 21:34   ` Paul Menzel
  0 siblings, 1 reply; 4+ messages in thread
From: Lukas Straub @ 2022-01-23 18:00 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: dm-devel, Song Liu, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3157 bytes --]

CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.

On Fri, 21 Jan 2022 16:38:03 +0000
Roger Willcocks <roger@filmlight.ltd.uk> wrote:

> Hi folks,
> 
> we noticed a thirty percent drop in performance on one of our raid
> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
> striping to balance (by time) access to a pair of hardware raid-6
> arrays. The underlying issue is also present in the native raid0
> driver so herewith the gory details; I'd appreciate your thoughts.
> 
> --
> 
> blkdev_direct_IO() calls submit_bio() which calls an outermost
> generic_make_request() (aka submit_bio_noacct()).
> 
> md_make_request() calls blk_queue_split() which cuts an incoming
> request into two parts with the first no larger than get_max_io_size()
> bytes (which in the case of raid0, is the chunk size):
> 
>   R -> AB
>   
> blk_queue_split() gives the second part 'B' to generic_make_request()
> to worry about later and returns the first part 'A'.
> 
> md_make_request() then passes 'A' to a more specific request handler,
> In this case raid0_make_request().
> 
> raid0_make_request() cuts its incoming request into two parts at the
> next chunk boundary:
> 
> A -> ab
> 
> it then fixes up the device (chooses a physical device) for 'a', and
> gives both parts, separately, to generic make request()
> 
> This is where things go awry, because 'b' is still targetted to the
> original device (same as 'B'), but 'B' was queued before 'b'. So we
> end up with:
> 
>   R -> Bab
> 
> The outermost generic_make_request() then cuts 'B' at
> get_max_io_size(), and the process repeats. Ascii art follows:
> 
> 
>     /---------------------------------------------------/   incoming rq
> 
>     /--------/--------/--------/--------/--------/------/   max_io_size
>       
> |--------|--------|--------|--------|--------|--------|--------| chunks
> 
> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
>       a    b  c     d  e     f  g     h  i     j  k     l
> 
> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
> 
> --
> 
> There are several potential fixes -
> 
> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
> instead of chunk_size, so that raid0_make_request() receives the
> entire transfer length and cuts it up at chunk boundaries;
> 
> neatest is for raid0_make_request() to recognise that 'b' doesn't
> cross a chunk boundary so it can be sent directly to the physical
> device;
> 
> and correct is for blk_queue_split to requeue 'A' before 'B'.
> 
> --
> 
> There's also a second issue - with large raid0 chunk size (256K), the
> segments submitted to the physical device are at least 128K and
> trigger the early unplug code in blk_mq_make_request(), so the
> requests are never merged. There are legitimate reasons for a large
> chunk size so this seems unhelpful.
> 
> --
> 
> As I said, I'd appreciate your thoughts.
> 
> --
> 
> Roger
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
> 



-- 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dm-devel] Raid0 performance regression
  2022-01-23 18:00 ` [dm-devel] Raid0 performance regression Lukas Straub
@ 2022-01-23 21:34   ` Paul Menzel
  2022-01-24 16:48     ` Roger Willcocks
  0 siblings, 1 reply; 4+ messages in thread
From: Paul Menzel @ 2022-01-23 21:34 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: dm-devel, Song Liu, linux-raid, Lukas Straub

Dear Roger,


Am 23.01.22 um 19:00 schrieb Lukas Straub:
> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.
> 
> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote:

>> we noticed a thirty percent drop in performance on one of our raid
>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like

For those outside the CentOS universe, what Linux kernel versions are those?

>> striping to balance (by time) access to a pair of hardware raid-6
>> arrays. The underlying issue is also present in the native raid0
>> driver so herewith the gory details; I'd appreciate your thoughts.
>>
>> --
>>
>> blkdev_direct_IO() calls submit_bio() which calls an outermost
>> generic_make_request() (aka submit_bio_noacct()).
>>
>> md_make_request() calls blk_queue_split() which cuts an incoming
>> request into two parts with the first no larger than get_max_io_size()
>> bytes (which in the case of raid0, is the chunk size):
>>
>>    R -> AB
>>    
>> blk_queue_split() gives the second part 'B' to generic_make_request()
>> to worry about later and returns the first part 'A'.
>>
>> md_make_request() then passes 'A' to a more specific request handler,
>> In this case raid0_make_request().
>>
>> raid0_make_request() cuts its incoming request into two parts at the
>> next chunk boundary:
>>
>> A -> ab
>>
>> it then fixes up the device (chooses a physical device) for 'a', and
>> gives both parts, separately, to generic make request()
>>
>> This is where things go awry, because 'b' is still targetted to the
>> original device (same as 'B'), but 'B' was queued before 'b'. So we
>> end up with:
>>
>>    R -> Bab
>>
>> The outermost generic_make_request() then cuts 'B' at
>> get_max_io_size(), and the process repeats. Ascii art follows:
>>
>>
>>      /---------------------------------------------------/   incoming rq
>>
>>      /--------/--------/--------/--------/--------/------/   max_io_size
>>        
>> |--------|--------|--------|--------|--------|--------|--------| chunks
>>
>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
>>        a    b  c     d  e     f  g     h  i     j  k     l
>>
>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
>>
>> --
>>
>> There are several potential fixes -
>>
>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
>> instead of chunk_size, so that raid0_make_request() receives the
>> entire transfer length and cuts it up at chunk boundaries;
>>
>> neatest is for raid0_make_request() to recognise that 'b' doesn't
>> cross a chunk boundary so it can be sent directly to the physical
>> device;
>>
>> and correct is for blk_queue_split to requeue 'A' before 'B'.
>>
>> --
>>
>> There's also a second issue - with large raid0 chunk size (256K), the
>> segments submitted to the physical device are at least 128K and
>> trigger the early unplug code in blk_mq_make_request(), so the
>> requests are never merged. There are legitimate reasons for a large
>> chunk size so this seems unhelpful.
>>
>> --
>>
>> As I said, I'd appreciate your thoughts.

Thank you for the report and the analysis.

Is the second issue also a regression? If not, I suggest to split it 
into a separate thread.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dm-devel] Raid0 performance regression
  2022-01-23 21:34   ` Paul Menzel
@ 2022-01-24 16:48     ` Roger Willcocks
  2022-01-25  8:00       ` Song Liu
  0 siblings, 1 reply; 4+ messages in thread
From: Roger Willcocks @ 2022-01-24 16:48 UTC (permalink / raw)
  To: Paul Menzel; +Cc: Roger Willcocks, dm-devel, Song Liu, linux-raid, Lukas Straub



> On 23 Jan 2022, at 21:34, Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> 
> Dear Roger,
> 
> 
> Am 23.01.22 um 19:00 schrieb Lukas Straub:
>> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.
>> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote:
> 
>>> we noticed a thirty percent drop in performance on one of our raid
>>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
> 
> For those outside the CentOS universe, what Linux kernel versions are those?
> 

2.6.32 (and backported changes) and 4.18.0 (sim.)

>>> striping to balance (by time) access to a pair of hardware raid-6
>>> arrays. The underlying issue is also present in the native raid0
>>> driver so herewith the gory details; I'd appreciate your thoughts.
>>> 
>>> --
>>> 
>>> blkdev_direct_IO() calls submit_bio() which calls an outermost
>>> generic_make_request() (aka submit_bio_noacct()).
>>> 
>>> md_make_request() calls blk_queue_split() which cuts an incoming
>>> request into two parts with the first no larger than get_max_io_size()
>>> bytes (which in the case of raid0, is the chunk size):
>>> 
>>>   R -> AB
>>>   blk_queue_split() gives the second part 'B' to generic_make_request()
>>> to worry about later and returns the first part 'A'.
>>> 
>>> md_make_request() then passes 'A' to a more specific request handler,
>>> In this case raid0_make_request().
>>> 
>>> raid0_make_request() cuts its incoming request into two parts at the
>>> next chunk boundary:
>>> 
>>> A -> ab
>>> 
>>> it then fixes up the device (chooses a physical device) for 'a', and
>>> gives both parts, separately, to generic make request()
>>> 
>>> This is where things go awry, because 'b' is still targetted to the
>>> original device (same as 'B'), but 'B' was queued before 'b'. So we
>>> end up with:
>>> 
>>>   R -> Bab
>>> 
>>> The outermost generic_make_request() then cuts 'B' at
>>> get_max_io_size(), and the process repeats. Ascii art follows:
>>> 
>>> 
>>>     /---------------------------------------------------/   incoming rq
>>> 
>>>     /--------/--------/--------/--------/--------/------/   max_io_size
>>>       |--------|--------|--------|--------|--------|--------|--------| chunks
>>> 
>>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
>>>       a    b  c     d  e     f  g     h  i     j  k     l
>>> 
>>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
>>> 
>>> --
>>> 
>>> There are several potential fixes -
>>> 
>>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
>>> instead of chunk_size, so that raid0_make_request() receives the
>>> entire transfer length and cuts it up at chunk boundaries;
>>> 
>>> neatest is for raid0_make_request() to recognise that 'b' doesn't
>>> cross a chunk boundary so it can be sent directly to the physical
>>> device;
>>> 
>>> and correct is for blk_queue_split to requeue 'A' before 'B'.
>>> 
>>> --
>>> 
>>> There's also a second issue - with large raid0 chunk size (256K), the
>>> segments submitted to the physical device are at least 128K and
>>> trigger the early unplug code in blk_mq_make_request(), so the
>>> requests are never merged. There are legitimate reasons for a large
>>> chunk size so this seems unhelpful.
>>> 
>>> --
>>> 
>>> As I said, I'd appreciate your thoughts.
> 
> Thank you for the report and the analysis.
> 
> Is the second issue also a regression? If not, I suggest to split it into a separate thread.
> 

Yes this is also a regression, both issues above have to be addressed to recover the
original performance.

Specifically, an md raid0 array with 256K chunk size interleaving two x 12-disk raid6
devices (Adaptec 3154 controller, 50MB files stored contiguously on disk, four threads)
can achieve a sequential read rate of 3800 MB/sec with the (very) old 6.5 kernel; this
falls to 2500 MB/sec with the relatively newer kernel.

This change to raid0.c:

-               blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
+              blk_queue_max_hw_sectors(mddev->queue, UINT_MAX);

improves things somewhat, the sub-chunk requests are now submitted in order but we
still only get 2800 MB/sec because no merging takes place; the controller struggles to
keep up with the large number of sub-chunk transfers. This additional change to 
blk-mq.c:

-		if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
+		if (request_count >= BLK_MAX_REQUEST_COUNT || (false && last &&
 		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
 			blk_flush_plug_list(plug, false);

Brings performance back to 6.5 levels.


> 
> Kind regards,
> 
> Paul
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dm-devel] Raid0 performance regression
  2022-01-24 16:48     ` Roger Willcocks
@ 2022-01-25  8:00       ` Song Liu
  0 siblings, 0 replies; 4+ messages in thread
From: Song Liu @ 2022-01-25  8:00 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: Paul Menzel, dm-devel, linux-raid, Lukas Straub

On Mon, Jan 24, 2022 at 8:49 AM Roger Willcocks <roger@filmlight.ltd.uk> wrote:
>
>
>
> > On 23 Jan 2022, at 21:34, Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> >
> > Dear Roger,
> >
> >
> > Am 23.01.22 um 19:00 schrieb Lukas Straub:
> >> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.
> >> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote:
> >
> >>> we noticed a thirty percent drop in performance on one of our raid
> >>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
> >
> > For those outside the CentOS universe, what Linux kernel versions are those?
> >
>
> 2.6.32 (and backported changes) and 4.18.0 (sim.)
>
> >>> striping to balance (by time) access to a pair of hardware raid-6
> >>> arrays. The underlying issue is also present in the native raid0
> >>> driver so herewith the gory details; I'd appreciate your thoughts.
> >>>
> >>> --
> >>>
> >>> blkdev_direct_IO() calls submit_bio() which calls an outermost
> >>> generic_make_request() (aka submit_bio_noacct()).
> >>>
> >>> md_make_request() calls blk_queue_split() which cuts an incoming
> >>> request into two parts with the first no larger than get_max_io_size()
> >>> bytes (which in the case of raid0, is the chunk size):
> >>>
> >>>   R -> AB
> >>>   blk_queue_split() gives the second part 'B' to generic_make_request()
> >>> to worry about later and returns the first part 'A'.
> >>>
> >>> md_make_request() then passes 'A' to a more specific request handler,
> >>> In this case raid0_make_request().
> >>>
> >>> raid0_make_request() cuts its incoming request into two parts at the
> >>> next chunk boundary:
> >>>
> >>> A -> ab
> >>>
> >>> it then fixes up the device (chooses a physical device) for 'a', and
> >>> gives both parts, separately, to generic make request()
> >>>
> >>> This is where things go awry, because 'b' is still targetted to the
> >>> original device (same as 'B'), but 'B' was queued before 'b'. So we
> >>> end up with:
> >>>
> >>>   R -> Bab
> >>>
> >>> The outermost generic_make_request() then cuts 'B' at
> >>> get_max_io_size(), and the process repeats. Ascii art follows:
> >>>
> >>>
> >>>     /---------------------------------------------------/   incoming rq
> >>>
> >>>     /--------/--------/--------/--------/--------/------/   max_io_size
> >>>       |--------|--------|--------|--------|--------|--------|--------| chunks
> >>>
> >>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
> >>>       a    b  c     d  e     f  g     h  i     j  k     l
> >>>
> >>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
> >>>
> >>> --
> >>>
> >>> There are several potential fixes -
> >>>
> >>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
> >>> instead of chunk_size, so that raid0_make_request() receives the
> >>> entire transfer length and cuts it up at chunk boundaries;
> >>>
> >>> neatest is for raid0_make_request() to recognise that 'b' doesn't
> >>> cross a chunk boundary so it can be sent directly to the physical
> >>> device;
> >>>
> >>> and correct is for blk_queue_split to requeue 'A' before 'B'.
> >>>
> >>> --
> >>>
> >>> There's also a second issue - with large raid0 chunk size (256K), the
> >>> segments submitted to the physical device are at least 128K and
> >>> trigger the early unplug code in blk_mq_make_request(), so the
> >>> requests are never merged. There are legitimate reasons for a large
> >>> chunk size so this seems unhelpful.
> >>>
> >>> --
> >>>
> >>> As I said, I'd appreciate your thoughts.
> >
> > Thank you for the report and the analysis.
> >
> > Is the second issue also a regression? If not, I suggest to split it into a separate thread.
> >
>
> Yes this is also a regression, both issues above have to be addressed to recover the
> original performance.
>
> Specifically, an md raid0 array with 256K chunk size interleaving two x 12-disk raid6
> devices (Adaptec 3154 controller, 50MB files stored contiguously on disk, four threads)
> can achieve a sequential read rate of 3800 MB/sec with the (very) old 6.5 kernel; this
> falls to 2500 MB/sec with the relatively newer kernel.
>
> This change to raid0.c:
>
> -               blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
> +              blk_queue_max_hw_sectors(mddev->queue, UINT_MAX);

I guess this is OK.

>
> improves things somewhat, the sub-chunk requests are now submitted in order but we
> still only get 2800 MB/sec because no merging takes place; the controller struggles to
> keep up with the large number of sub-chunk transfers. This additional change to
> blk-mq.c:
>
> -               if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
> +               if (request_count >= BLK_MAX_REQUEST_COUNT || (false && last &&
>                     blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
>                         blk_flush_plug_list(plug, false);

We recently did some optimization for BLK_MAX_REQUEST_COUNT ([1] and some
follow up). We can probably do something similar for BLK_PLUG_FLUSH_SIZE.

Thanks,
Song

[1] https://lore.kernel.org/all/20210907230338.227903-1-songliubraving@fb.com/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-01-25  8:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <747C2684-B570-473E-9146-E8AB53102236@filmlight.ltd.uk>
2022-01-23 18:00 ` [dm-devel] Raid0 performance regression Lukas Straub
2022-01-23 21:34   ` Paul Menzel
2022-01-24 16:48     ` Roger Willcocks
2022-01-25  8:00       ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).