* Re: [dm-devel] Raid0 performance regression [not found] <747C2684-B570-473E-9146-E8AB53102236@filmlight.ltd.uk> @ 2022-01-23 18:00 ` Lukas Straub 2022-01-23 21:34 ` Paul Menzel 0 siblings, 1 reply; 4+ messages in thread From: Lukas Straub @ 2022-01-23 18:00 UTC (permalink / raw) To: Roger Willcocks; +Cc: dm-devel, Song Liu, linux-raid [-- Attachment #1: Type: text/plain, Size: 3157 bytes --] CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list. On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks <roger@filmlight.ltd.uk> wrote: > Hi folks, > > we noticed a thirty percent drop in performance on one of our raid > arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like > striping to balance (by time) access to a pair of hardware raid-6 > arrays. The underlying issue is also present in the native raid0 > driver so herewith the gory details; I'd appreciate your thoughts. > > -- > > blkdev_direct_IO() calls submit_bio() which calls an outermost > generic_make_request() (aka submit_bio_noacct()). > > md_make_request() calls blk_queue_split() which cuts an incoming > request into two parts with the first no larger than get_max_io_size() > bytes (which in the case of raid0, is the chunk size): > > R -> AB > > blk_queue_split() gives the second part 'B' to generic_make_request() > to worry about later and returns the first part 'A'. > > md_make_request() then passes 'A' to a more specific request handler, > In this case raid0_make_request(). > > raid0_make_request() cuts its incoming request into two parts at the > next chunk boundary: > > A -> ab > > it then fixes up the device (chooses a physical device) for 'a', and > gives both parts, separately, to generic make request() > > This is where things go awry, because 'b' is still targetted to the > original device (same as 'B'), but 'B' was queued before 'b'. So we > end up with: > > R -> Bab > > The outermost generic_make_request() then cuts 'B' at > get_max_io_size(), and the process repeats. Ascii art follows: > > > /---------------------------------------------------/ incoming rq > > /--------/--------/--------/--------/--------/------/ max_io_size > > |--------|--------|--------|--------|--------|--------|--------| chunks > > |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out > a b c d e f g h i j k l > > Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb' > > -- > > There are several potential fixes - > > simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX > instead of chunk_size, so that raid0_make_request() receives the > entire transfer length and cuts it up at chunk boundaries; > > neatest is for raid0_make_request() to recognise that 'b' doesn't > cross a chunk boundary so it can be sent directly to the physical > device; > > and correct is for blk_queue_split to requeue 'A' before 'B'. > > -- > > There's also a second issue - with large raid0 chunk size (256K), the > segments submitted to the physical device are at least 128K and > trigger the early unplug code in blk_mq_make_request(), so the > requests are never merged. There are legitimate reasons for a large > chunk size so this seems unhelpful. > > -- > > As I said, I'd appreciate your thoughts. > > -- > > Roger > > -- > dm-devel mailing list > dm-devel@redhat.com > https://listman.redhat.com/mailman/listinfo/dm-devel > -- [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [dm-devel] Raid0 performance regression 2022-01-23 18:00 ` [dm-devel] Raid0 performance regression Lukas Straub @ 2022-01-23 21:34 ` Paul Menzel 2022-01-24 16:48 ` Roger Willcocks 0 siblings, 1 reply; 4+ messages in thread From: Paul Menzel @ 2022-01-23 21:34 UTC (permalink / raw) To: Roger Willcocks; +Cc: dm-devel, Song Liu, linux-raid, Lukas Straub Dear Roger, Am 23.01.22 um 19:00 schrieb Lukas Straub: > CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list. > > On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote: >> we noticed a thirty percent drop in performance on one of our raid >> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like For those outside the CentOS universe, what Linux kernel versions are those? >> striping to balance (by time) access to a pair of hardware raid-6 >> arrays. The underlying issue is also present in the native raid0 >> driver so herewith the gory details; I'd appreciate your thoughts. >> >> -- >> >> blkdev_direct_IO() calls submit_bio() which calls an outermost >> generic_make_request() (aka submit_bio_noacct()). >> >> md_make_request() calls blk_queue_split() which cuts an incoming >> request into two parts with the first no larger than get_max_io_size() >> bytes (which in the case of raid0, is the chunk size): >> >> R -> AB >> >> blk_queue_split() gives the second part 'B' to generic_make_request() >> to worry about later and returns the first part 'A'. >> >> md_make_request() then passes 'A' to a more specific request handler, >> In this case raid0_make_request(). >> >> raid0_make_request() cuts its incoming request into two parts at the >> next chunk boundary: >> >> A -> ab >> >> it then fixes up the device (chooses a physical device) for 'a', and >> gives both parts, separately, to generic make request() >> >> This is where things go awry, because 'b' is still targetted to the >> original device (same as 'B'), but 'B' was queued before 'b'. So we >> end up with: >> >> R -> Bab >> >> The outermost generic_make_request() then cuts 'B' at >> get_max_io_size(), and the process repeats. Ascii art follows: >> >> >> /---------------------------------------------------/ incoming rq >> >> /--------/--------/--------/--------/--------/------/ max_io_size >> >> |--------|--------|--------|--------|--------|--------|--------| chunks >> >> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out >> a b c d e f g h i j k l >> >> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb' >> >> -- >> >> There are several potential fixes - >> >> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX >> instead of chunk_size, so that raid0_make_request() receives the >> entire transfer length and cuts it up at chunk boundaries; >> >> neatest is for raid0_make_request() to recognise that 'b' doesn't >> cross a chunk boundary so it can be sent directly to the physical >> device; >> >> and correct is for blk_queue_split to requeue 'A' before 'B'. >> >> -- >> >> There's also a second issue - with large raid0 chunk size (256K), the >> segments submitted to the physical device are at least 128K and >> trigger the early unplug code in blk_mq_make_request(), so the >> requests are never merged. There are legitimate reasons for a large >> chunk size so this seems unhelpful. >> >> -- >> >> As I said, I'd appreciate your thoughts. Thank you for the report and the analysis. Is the second issue also a regression? If not, I suggest to split it into a separate thread. Kind regards, Paul ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [dm-devel] Raid0 performance regression 2022-01-23 21:34 ` Paul Menzel @ 2022-01-24 16:48 ` Roger Willcocks 2022-01-25 8:00 ` Song Liu 0 siblings, 1 reply; 4+ messages in thread From: Roger Willcocks @ 2022-01-24 16:48 UTC (permalink / raw) To: Paul Menzel; +Cc: Roger Willcocks, dm-devel, Song Liu, linux-raid, Lukas Straub > On 23 Jan 2022, at 21:34, Paul Menzel <pmenzel@molgen.mpg.de> wrote: > > Dear Roger, > > > Am 23.01.22 um 19:00 schrieb Lukas Straub: >> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list. >> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote: > >>> we noticed a thirty percent drop in performance on one of our raid >>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like > > For those outside the CentOS universe, what Linux kernel versions are those? > 2.6.32 (and backported changes) and 4.18.0 (sim.) >>> striping to balance (by time) access to a pair of hardware raid-6 >>> arrays. The underlying issue is also present in the native raid0 >>> driver so herewith the gory details; I'd appreciate your thoughts. >>> >>> -- >>> >>> blkdev_direct_IO() calls submit_bio() which calls an outermost >>> generic_make_request() (aka submit_bio_noacct()). >>> >>> md_make_request() calls blk_queue_split() which cuts an incoming >>> request into two parts with the first no larger than get_max_io_size() >>> bytes (which in the case of raid0, is the chunk size): >>> >>> R -> AB >>> blk_queue_split() gives the second part 'B' to generic_make_request() >>> to worry about later and returns the first part 'A'. >>> >>> md_make_request() then passes 'A' to a more specific request handler, >>> In this case raid0_make_request(). >>> >>> raid0_make_request() cuts its incoming request into two parts at the >>> next chunk boundary: >>> >>> A -> ab >>> >>> it then fixes up the device (chooses a physical device) for 'a', and >>> gives both parts, separately, to generic make request() >>> >>> This is where things go awry, because 'b' is still targetted to the >>> original device (same as 'B'), but 'B' was queued before 'b'. So we >>> end up with: >>> >>> R -> Bab >>> >>> The outermost generic_make_request() then cuts 'B' at >>> get_max_io_size(), and the process repeats. Ascii art follows: >>> >>> >>> /---------------------------------------------------/ incoming rq >>> >>> /--------/--------/--------/--------/--------/------/ max_io_size >>> |--------|--------|--------|--------|--------|--------|--------| chunks >>> >>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out >>> a b c d e f g h i j k l >>> >>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb' >>> >>> -- >>> >>> There are several potential fixes - >>> >>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX >>> instead of chunk_size, so that raid0_make_request() receives the >>> entire transfer length and cuts it up at chunk boundaries; >>> >>> neatest is for raid0_make_request() to recognise that 'b' doesn't >>> cross a chunk boundary so it can be sent directly to the physical >>> device; >>> >>> and correct is for blk_queue_split to requeue 'A' before 'B'. >>> >>> -- >>> >>> There's also a second issue - with large raid0 chunk size (256K), the >>> segments submitted to the physical device are at least 128K and >>> trigger the early unplug code in blk_mq_make_request(), so the >>> requests are never merged. There are legitimate reasons for a large >>> chunk size so this seems unhelpful. >>> >>> -- >>> >>> As I said, I'd appreciate your thoughts. > > Thank you for the report and the analysis. > > Is the second issue also a regression? If not, I suggest to split it into a separate thread. > Yes this is also a regression, both issues above have to be addressed to recover the original performance. Specifically, an md raid0 array with 256K chunk size interleaving two x 12-disk raid6 devices (Adaptec 3154 controller, 50MB files stored contiguously on disk, four threads) can achieve a sequential read rate of 3800 MB/sec with the (very) old 6.5 kernel; this falls to 2500 MB/sec with the relatively newer kernel. This change to raid0.c: - blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); + blk_queue_max_hw_sectors(mddev->queue, UINT_MAX); improves things somewhat, the sub-chunk requests are now submitted in order but we still only get 2800 MB/sec because no merging takes place; the controller struggles to keep up with the large number of sub-chunk transfers. This additional change to blk-mq.c: - if (request_count >= BLK_MAX_REQUEST_COUNT || (last && + if (request_count >= BLK_MAX_REQUEST_COUNT || (false && last && blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) { blk_flush_plug_list(plug, false); Brings performance back to 6.5 levels. > > Kind regards, > > Paul > ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [dm-devel] Raid0 performance regression 2022-01-24 16:48 ` Roger Willcocks @ 2022-01-25 8:00 ` Song Liu 0 siblings, 0 replies; 4+ messages in thread From: Song Liu @ 2022-01-25 8:00 UTC (permalink / raw) To: Roger Willcocks; +Cc: Paul Menzel, dm-devel, linux-raid, Lukas Straub On Mon, Jan 24, 2022 at 8:49 AM Roger Willcocks <roger@filmlight.ltd.uk> wrote: > > > > > On 23 Jan 2022, at 21:34, Paul Menzel <pmenzel@molgen.mpg.de> wrote: > > > > Dear Roger, > > > > > > Am 23.01.22 um 19:00 schrieb Lukas Straub: > >> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list. > >> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote: > > > >>> we noticed a thirty percent drop in performance on one of our raid > >>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like > > > > For those outside the CentOS universe, what Linux kernel versions are those? > > > > 2.6.32 (and backported changes) and 4.18.0 (sim.) > > >>> striping to balance (by time) access to a pair of hardware raid-6 > >>> arrays. The underlying issue is also present in the native raid0 > >>> driver so herewith the gory details; I'd appreciate your thoughts. > >>> > >>> -- > >>> > >>> blkdev_direct_IO() calls submit_bio() which calls an outermost > >>> generic_make_request() (aka submit_bio_noacct()). > >>> > >>> md_make_request() calls blk_queue_split() which cuts an incoming > >>> request into two parts with the first no larger than get_max_io_size() > >>> bytes (which in the case of raid0, is the chunk size): > >>> > >>> R -> AB > >>> blk_queue_split() gives the second part 'B' to generic_make_request() > >>> to worry about later and returns the first part 'A'. > >>> > >>> md_make_request() then passes 'A' to a more specific request handler, > >>> In this case raid0_make_request(). > >>> > >>> raid0_make_request() cuts its incoming request into two parts at the > >>> next chunk boundary: > >>> > >>> A -> ab > >>> > >>> it then fixes up the device (chooses a physical device) for 'a', and > >>> gives both parts, separately, to generic make request() > >>> > >>> This is where things go awry, because 'b' is still targetted to the > >>> original device (same as 'B'), but 'B' was queued before 'b'. So we > >>> end up with: > >>> > >>> R -> Bab > >>> > >>> The outermost generic_make_request() then cuts 'B' at > >>> get_max_io_size(), and the process repeats. Ascii art follows: > >>> > >>> > >>> /---------------------------------------------------/ incoming rq > >>> > >>> /--------/--------/--------/--------/--------/------/ max_io_size > >>> |--------|--------|--------|--------|--------|--------|--------| chunks > >>> > >>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out > >>> a b c d e f g h i j k l > >>> > >>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb' > >>> > >>> -- > >>> > >>> There are several potential fixes - > >>> > >>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX > >>> instead of chunk_size, so that raid0_make_request() receives the > >>> entire transfer length and cuts it up at chunk boundaries; > >>> > >>> neatest is for raid0_make_request() to recognise that 'b' doesn't > >>> cross a chunk boundary so it can be sent directly to the physical > >>> device; > >>> > >>> and correct is for blk_queue_split to requeue 'A' before 'B'. > >>> > >>> -- > >>> > >>> There's also a second issue - with large raid0 chunk size (256K), the > >>> segments submitted to the physical device are at least 128K and > >>> trigger the early unplug code in blk_mq_make_request(), so the > >>> requests are never merged. There are legitimate reasons for a large > >>> chunk size so this seems unhelpful. > >>> > >>> -- > >>> > >>> As I said, I'd appreciate your thoughts. > > > > Thank you for the report and the analysis. > > > > Is the second issue also a regression? If not, I suggest to split it into a separate thread. > > > > Yes this is also a regression, both issues above have to be addressed to recover the > original performance. > > Specifically, an md raid0 array with 256K chunk size interleaving two x 12-disk raid6 > devices (Adaptec 3154 controller, 50MB files stored contiguously on disk, four threads) > can achieve a sequential read rate of 3800 MB/sec with the (very) old 6.5 kernel; this > falls to 2500 MB/sec with the relatively newer kernel. > > This change to raid0.c: > > - blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); > + blk_queue_max_hw_sectors(mddev->queue, UINT_MAX); I guess this is OK. > > improves things somewhat, the sub-chunk requests are now submitted in order but we > still only get 2800 MB/sec because no merging takes place; the controller struggles to > keep up with the large number of sub-chunk transfers. This additional change to > blk-mq.c: > > - if (request_count >= BLK_MAX_REQUEST_COUNT || (last && > + if (request_count >= BLK_MAX_REQUEST_COUNT || (false && last && > blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) { > blk_flush_plug_list(plug, false); We recently did some optimization for BLK_MAX_REQUEST_COUNT ([1] and some follow up). We can probably do something similar for BLK_PLUG_FLUSH_SIZE. Thanks, Song [1] https://lore.kernel.org/all/20210907230338.227903-1-songliubraving@fb.com/ ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-01-25 8:59 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <747C2684-B570-473E-9146-E8AB53102236@filmlight.ltd.uk>
2022-01-23 18:00 ` [dm-devel] Raid0 performance regression Lukas Straub
2022-01-23 21:34 ` Paul Menzel
2022-01-24 16:48 ` Roger Willcocks
2022-01-25 8:00 ` Song Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).