Re: Improper io_opt setting for md raid5

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Improper io_opt setting for md raid5
       [not found] <ywsfp3lqnijgig6yrlv2ztxram6ohf5z4yfeebswjkvp2dzisd@f5ikoyo3sfq5>
@ 2025-07-27 10:50 ` Csordás Hunor
  2025-07-28  0:39   ` Damien Le Moal
  2025-07-29  4:44   ` Martin K. Petersen
  0 siblings, 2 replies; 20+ messages in thread
From: Csordás Hunor @ 2025-07-27 10:50 UTC (permalink / raw)
  To: Coly Li, hch
  Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi,
	Damien Le Moal

Adding the SCSI maintainers because I believe the culprit is in
drivers/scsi/sd.c, and Damien Le Moal because because he has a pending
patch modifying the relevant part and he might be interested in the
implications.

On 7/15/2025 5:56 PM, Coly Li wrote:
> Let me rescript the problem I encountered.
> 1, There is an 8 disks raid5 with 64K chunk size on my machine, I observe
> /sys/block/md0/queue/optimal_io_size is very large value, which isn’t
> reasonable size IMHO.

I have come across the same problem after moving all 8 disks of a RAID6
md array from two separate SATA controllers to an mpt3sas device. In my
case, the readahead on the array became almost 4 GB:

# grep ^ /sys/block/{sda,md_helium}/queue/{optimal_io_size,read_ahead_kb}
/sys/block/sda/queue/optimal_io_size:16773120
/sys/block/sda/queue/read_ahead_kb:32760
/sys/block/md_helium/queue/optimal_io_size:4293918720
/sys/block/md_helium/queue/read_ahead_kb:4192256

Note: the readahead is supposed to be twice the optimal I/O size (after
a unit conversion). On the md array it isn't because of an overflow in
blk_apply_bdi_limits. This overflow is avoidable but basically
irrelevant; however, it nicely highlights the fact that io_opt should
really never get this large.

> 2,  It was from drivers/scsi/mpt3sas/mpt3sas_scsih.c, 
> 11939 static const struct scsi_host_template mpt3sas_driver_template = {
...
> 11960         .max_sectors                    = 32767,
...
> 11969 };
> at line 11960, max_sectors of mpt3sas driver is defined as 32767.
> 
> Then in drivers/scsi/scsi_transport_sas.c, at line 241 inside sas_host_setup(),
> shots->opt_sectors is assigned by 32767 from the following code,
> 240         if (dma_dev->dma_mask) {
> 241                 shost->opt_sectors = min_t(unsigned int, shost->max_sectors,
> 242                                 dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
> 243         }
> 
> Then in drivers/scsi/sd.c, inside sd_revalidate_disk() from the following coce,
> 3785         /*
> 3786          * Limit default to SCSI host optimal sector limit if set. There may be
> 3787          * an impact on performance for when the size of a request exceeds this
> 3788          * host limit.
> 3789          */
> 3790         lim.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT;
> 3791         if (sd_validate_opt_xfer_size(sdkp, dev_max)) {
> 3792                 lim.io_opt = min_not_zero(lim.io_opt,
> 3793                                 logical_to_bytes(sdp, sdkp->opt_xfer_blocks));
> 3794         }
> 
> lim.io_opt of all my sata disks attached to mpt3sas HBA are all 32767 sectors,
> because the above code block.
> 
> Then when my raid5 array sets its queue limits, because its io_opt is 64KiB*7,
> and the raid component sata hard drive has io_opt with 32767 sectors, by
> calculation in block/blk-setting.c:blk_stack_limits() at line 753,
> 753         t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
> the calculated opt_io_size of my raid5 array is more than 1GiB. It is too large.
> 
> I know the purpose of lcm_not_zero() is to get an optimized io size for both
> raid device and underlying component devices, but the resulted io_opt is bigger
> than 1 GiB that's too big.
> 
> For me, I just feel uncomfortable that using max_sectors as opt_sectors in
> sas_host_stup(), but I don't know a better way to improve. Currently I just
> modify the mpt3sas_driver_template's max_sectors from 32767 to 64, and observed
> 5~10% sequetial write performance improvement (direct io) for my raid5 devices
> by fio.

In my case, the impact was more noticable. The system seemed to work
surprisingly fine under light loads, but an increased number of
parallel I/O operations completely tanked its performance until I
set the readaheads to their expected values and gave the system some
time to recover.

I came to the same conclusion as Coly Li: io_opt ultimately gets
populated from shost->max_sectors, which (in the case of mpt3sas and
several other SCSI controllers) contains a value which is both:
- unnecessarily large for this purpose and, more importantly,
- not a nice number without any large odd divisors, as blk_stack_limits
  clearly expects.

Populating io_opt from shost->max_sectors happens via
shost->opt_sectors. This variable was introduced in commits
608128d391fa ("scsi: sd: allow max_sectors be capped at DMA optimal
size limit") and 4cbfca5f7750 ("scsi: scsi_transport_sas: cap shost
opt_sectors according to DMA optimal limit"). Despite the (in hindsight
perhaps unfortunate) name, it wasn't used to set io_opt. It was optimal
in a different sense: it was used as a (user-overridable) upper limit
to max_sectors, constraining the size of requests to play nicely with
IOMMU which might get slow with large mappings.

Commit 608128d391fa even mentions io_opt:

    It could be considered to have request queues io_opt value initially
    set at Scsi_Host.opt_sectors in __scsi_init_queue(), but that is not
    really the purpose of io_opt.

The last part is correct. shost->opt_sectors is an _upper_ bound on the
size of requests, while io_opt is used both as a sort of _lower_ bound
(in the form of readahead), and as a sort of indivisible "block size"
for I/O (by blk_stack_limits). These two existing purposes may or may
not already be too much for a single variable; adding a third one
clearly doesn't work well.

It was commit a23634644afc ("block: take io_opt and io_min into account
for max_sectors") which started setting io_opt from shost->opt_sectors.
It did so to stop abusing max_user_sectors to set max_sectors from
shost->opt_sectors, but it ended up misusing another variable for this
purpose -- perhaps due to inadvertently conflating the two "optimal"
transfer sizes, which are optimal in two very different contexts.

Interestingly, while I've verified that the increased values for io_opt
and readahead on the actual disks definitely comes from this commit
(a23634644afc), the io_opt and readahead of the md array are unaffected
until commit 9c0ba14828d6 ("blk-settings: round down io_opt to
physical_block_size") due to a weird coincidence. This commit rounds
io_opt down to the physical block size in blk_validate_limits. Without
this commit, io_opt for the disks is 16776704, which looks even worse
at first glance (512 * 32767 instead of 4096 * 4095). However, this
ends up overflowing in a funny way when combined with the fact that
blk_stack_limits (and thus lcm_not_zero) is called once per component
device:

u32 t = 3145728; // 3 MB, the optimal I/O size for the array
u32 b = 16776704; // the (incorrect) optimal I/O size of the disks
u32 x = lcm(t, b); // x == (u32)103076069376 == 4291821568
u32 y = lcm(x, b); // y == (u32)140630117318656 == t

Repeat for an even number of component devices to get the right answer
from the wrong inputs by an incorrect method.

I'm sure the issue can be reproduced before commit 9c0ba14828d6
(although I haven't actually tried -- if I had to, I'd start with an
array with an odd number of component devices), but at the same time,
the issue may be still present and hidden on some systems even after
that commit (for example, the rounding does nothing if the physical
block size is 512). This might help a little bit to explain why the
problem doesn't seem more widespread.

> So there should be something to fix. Can you take a look, or give me some hint
> to fix?
> 
> Thanks in advance.
> 
> Coly Li

I would have loved to finish with a patch here but I'm not sure what
the correct fix is. shost->opt_sectors was clearly added for a reason
and it should reach max_sectors in struct queue_limits in some way. It
probably isn't included in max_hw_sectors because it's meant to be
overridable. Apparently just setting max_sectors causes problems, and
so does setting max_sectors and max_user_sectors. I don't know how to
to fix this correctly without introducing a new variable to struct
queue_limits but maybe people more familiar with the code can think of
a less intrusive way.

Hunor Csordás

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-27 10:50 ` Improper io_opt setting for md raid5 Csordás Hunor
@ 2025-07-28  0:39   ` Damien Le Moal
  2025-07-28  0:55     ` Yu Kuai
  2025-07-29  4:44   ` Martin K. Petersen
  1 sibling, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2025-07-28  0:39 UTC (permalink / raw)
  To: Csordás Hunor, Coly Li, hch
  Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi

On 7/27/25 7:50 PM, Csordás Hunor wrote:
> Adding the SCSI maintainers because I believe the culprit is in
> drivers/scsi/sd.c, and Damien Le Moal because because he has a pending
> patch modifying the relevant part and he might be interested in the
> implications.
> 
> On 7/15/2025 5:56 PM, Coly Li wrote:
>> Let me rescript the problem I encountered.
>> 1, There is an 8 disks raid5 with 64K chunk size on my machine, I observe
>> /sys/block/md0/queue/optimal_io_size is very large value, which isn’t
>> reasonable size IMHO.
> 
> I have come across the same problem after moving all 8 disks of a RAID6
> md array from two separate SATA controllers to an mpt3sas device. In my
> case, the readahead on the array became almost 4 GB:
> 
> # grep ^ /sys/block/{sda,md_helium}/queue/{optimal_io_size,read_ahead_kb}
> /sys/block/sda/queue/optimal_io_size:16773120
> /sys/block/sda/queue/read_ahead_kb:32760

For a SATA drive connected to an mpt3sas HBA, I see the same. But note that the
optimal_io_size here is completely made up by the HBA/driver because ATA does
not advertize/define an optimal IO size.

For SATA drive connected to AHCI SATA ports, I see:

/sys/block/sda/queue/optimal_io_size:0
/sys/block/sda/queue/read_ahead_kb:8192

read_ahead_kb in this case is twice max_sectors_kb (which with my patch is now
4MB).

> /sys/block/md_helium/queue/optimal_io_size:4293918720
> /sys/block/md_helium/queue/read_ahead_kb:4192256
> 
> Note: the readahead is supposed to be twice the optimal I/O size (after
> a unit conversion). On the md array it isn't because of an overflow in
> blk_apply_bdi_limits. This overflow is avoidable but basically
> irrelevant; however, it nicely highlights the fact that io_opt should
> really never get this large.

Only if io_opt is non-zero. If io_opt is zero, then read_ahead_kb by default is
twice max_sectors_kb.

> 
>> 2,  It was from drivers/scsi/mpt3sas/mpt3sas_scsih.c, 
>> 11939 static const struct scsi_host_template mpt3sas_driver_template = {
> ...
>> 11960         .max_sectors                    = 32767,
> ...
>> 11969 };
>> at line 11960, max_sectors of mpt3sas driver is defined as 32767.

This is another completely made-up value since SCSI allows commands transfer
length up to 4GB (32-bits value in bytes). Even ATA drives allow up to 65536
logical sectors per command (so 65536 * 4K = 256MB per command for $k logical
sector drives). Not sure why it is set to this completely arbitrary value.

>> Then in drivers/scsi/scsi_transport_sas.c, at line 241 inside sas_host_setup(),
>> shots->opt_sectors is assigned by 32767 from the following code,
>> 240         if (dma_dev->dma_mask) {
>> 241                 shost->opt_sectors = min_t(unsigned int, shost->max_sectors,
>> 242                                 dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
>> 243         }
>>
>> Then in drivers/scsi/sd.c, inside sd_revalidate_disk() from the following coce,
>> 3785         /*
>> 3786          * Limit default to SCSI host optimal sector limit if set. There may be
>> 3787          * an impact on performance for when the size of a request exceeds this
>> 3788          * host limit.
>> 3789          */
>> 3790         lim.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT;
>> 3791         if (sd_validate_opt_xfer_size(sdkp, dev_max)) {
>> 3792                 lim.io_opt = min_not_zero(lim.io_opt,
>> 3793                                 logical_to_bytes(sdp, sdkp->opt_xfer_blocks));
>> 3794         }
>>
>> lim.io_opt of all my sata disks attached to mpt3sas HBA are all 32767 sectors,
>> because the above code block.
>>
>> Then when my raid5 array sets its queue limits, because its io_opt is 64KiB*7,
>> and the raid component sata hard drive has io_opt with 32767 sectors, by
>> calculation in block/blk-setting.c:blk_stack_limits() at line 753,
>> 753         t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
>> the calculated opt_io_size of my raid5 array is more than 1GiB. It is too large.

md setting its io_opt to 64K*number of drives in the array is strange... It
does not have to be that large since io_opt is an upper bound and not a "issue
that IO size for optimal performance". io_opt is simply a limit saying: if you
exceed that IO size, performance may suffer.

So a default of stride size x number of drives for the io_opt may be OK, but
that should be bound to some reasonable value. Furthermore, this is likely
suboptimal. I woulld think that setting the md array io_opt initially to
min(all drives io_opt) x number of drives would be a better default.

>> I know the purpose of lcm_not_zero() is to get an optimized io size for both
>> raid device and underlying component devices, but the resulted io_opt is bigger
>> than 1 GiB that's too big.
>>
>> For me, I just feel uncomfortable that using max_sectors as opt_sectors in
>> sas_host_stup(), but I don't know a better way to improve. Currently I just
>> modify the mpt3sas_driver_template's max_sectors from 32767 to 64, and observed
>> 5~10% sequetial write performance improvement (direct io) for my raid5 devices
>> by fio.
> 
> In my case, the impact was more noticable. The system seemed to work
> surprisingly fine under light loads, but an increased number of
> parallel I/O operations completely tanked its performance until I
> set the readaheads to their expected values and gave the system some
> time to recover.
> 
> I came to the same conclusion as Coly Li: io_opt ultimately gets
> populated from shost->max_sectors, which (in the case of mpt3sas and
> several other SCSI controllers) contains a value which is both:
> - unnecessarily large for this purpose and, more importantly,
> - not a nice number without any large odd divisors, as blk_stack_limits
>   clearly expects.

Sounds to me like this is an md driver issue and tweak the limits automatically
calculated by stacking the limits of the array members.

> Populating io_opt from shost->max_sectors happens via
> shost->opt_sectors. This variable was introduced in commits
> 608128d391fa ("scsi: sd: allow max_sectors be capped at DMA optimal
> size limit") and 4cbfca5f7750 ("scsi: scsi_transport_sas: cap shost
> opt_sectors according to DMA optimal limit"). Despite the (in hindsight
> perhaps unfortunate) name, it wasn't used to set io_opt. It was optimal
> in a different sense: it was used as a (user-overridable) upper limit
> to max_sectors, constraining the size of requests to play nicely with
> IOMMU which might get slow with large mappings.
> 
> Commit 608128d391fa even mentions io_opt:
> 
>     It could be considered to have request queues io_opt value initially
>     set at Scsi_Host.opt_sectors in __scsi_init_queue(), but that is not
>     really the purpose of io_opt.
> 
> The last part is correct. shost->opt_sectors is an _upper_ bound on the
> size of requests, while io_opt is used both as a sort of _lower_ bound
> (in the form of readahead), and as a sort of indivisible "block size"
> for I/O (by blk_stack_limits). These two existing purposes may or may
> not already be too much for a single variable; adding a third one
> clearly doesn't work well.
> 
> It was commit a23634644afc ("block: take io_opt and io_min into account
> for max_sectors") which started setting io_opt from shost->opt_sectors.
> It did so to stop abusing max_user_sectors to set max_sectors from
> shost->opt_sectors, but it ended up misusing another variable for this
> purpose -- perhaps due to inadvertently conflating the two "optimal"
> transfer sizes, which are optimal in two very different contexts.
> 
> Interestingly, while I've verified that the increased values for io_opt
> and readahead on the actual disks definitely comes from this commit
> (a23634644afc), the io_opt and readahead of the md array are unaffected
> until commit 9c0ba14828d6 ("blk-settings: round down io_opt to
> physical_block_size") due to a weird coincidence. This commit rounds
> io_opt down to the physical block size in blk_validate_limits. Without
> this commit, io_opt for the disks is 16776704, which looks even worse
> at first glance (512 * 32767 instead of 4096 * 4095). However, this
> ends up overflowing in a funny way when combined with the fact that
> blk_stack_limits (and thus lcm_not_zero) is called once per component
> device:
> 
> u32 t = 3145728; // 3 MB, the optimal I/O size for the array
> u32 b = 16776704; // the (incorrect) optimal I/O size of the disks

It is not incorrect. It is a made-up value. For a SATA drive, reporting 0 would
be the correct thing to do.

> u32 x = lcm(t, b); // x == (u32)103076069376 == 4291821568
> u32 y = lcm(x, b); // y == (u32)140630117318656 == t
> 
> Repeat for an even number of component devices to get the right answer
> from the wrong inputs by an incorrect method.
> 
> I'm sure the issue can be reproduced before commit 9c0ba14828d6
> (although I haven't actually tried -- if I had to, I'd start with an
> array with an odd number of component devices), but at the same time,
> the issue may be still present and hidden on some systems even after
> that commit (for example, the rounding does nothing if the physical
> block size is 512). This might help a little bit to explain why the
> problem doesn't seem more widespread.
> 
>> So there should be something to fix. Can you take a look, or give me some hint
>> to fix?
>>
>> Thanks in advance.
>>
>> Coly Li
> 
> I would have loved to finish with a patch here but I'm not sure what
> the correct fix is. shost->opt_sectors was clearly added for a reason
> and it should reach max_sectors in struct queue_limits in some way. It
> probably isn't included in max_hw_sectors because it's meant to be
> overridable. Apparently just setting max_sectors causes problems, and
> so does setting max_sectors and max_user_sectors. I don't know how to
> to fix this correctly without introducing a new variable to struct
> queue_limits but maybe people more familiar with the code can think of
> a less intrusive way.
> 
> Hunor Csordás
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  0:39   ` Damien Le Moal
@ 2025-07-28  0:55     ` Yu Kuai
  2025-07-28  2:41       ` Damien Le Moal
  0 siblings, 1 reply; 20+ messages in thread
From: Yu Kuai @ 2025-07-28  0:55 UTC (permalink / raw)
  To: Damien Le Moal, Csordás Hunor, Coly Li, hch
  Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi,
	yukuai (C)

Hi,

在 2025/07/28 8:39, Damien Le Moal 写道:
> md setting its io_opt to 64K*number of drives in the array is strange... It
> does not have to be that large since io_opt is an upper bound and not a "issue
> that IO size for optimal performance". io_opt is simply a limit saying: if you
> exceed that IO size, performance may suffer.
> 

At least from Documentation, for raid arrays, multiple of io_opt is the
prefereed io size to the optimal io performance, and for raid5, this is
chunksize * data disks.

> So a default of stride size x number of drives for the io_opt may be OK, but
> that should be bound to some reasonable value. Furthermore, this is likely
> suboptimal. I woulld think that setting the md array io_opt initially to
> min(all drives io_opt) x number of drives would be a better default.

For raid5, this is not ok, the value have to be chunksize * data disks,
regardless of io_opt from member disks, otherwise raid5 have to issue
additional IO from other disks to build xor data.

For example:

  - write aligned chunksize to one disk, actually means read chunksize
old xor data,then write chunksize data and chunksize new xor data.
  - write aligned chunksize * data disks, new xor data can be build
directly without reading old xor data.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  0:55     ` Yu Kuai
@ 2025-07-28  2:41       ` Damien Le Moal
  2025-07-28  3:08         ` Yu Kuai
  0 siblings, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2025-07-28  2:41 UTC (permalink / raw)
  To: Yu Kuai, Csordás Hunor, Coly Li, hch
  Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi,
	yukuai (C)

On 7/28/25 9:55 AM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/28 8:39, Damien Le Moal 写道:
>> md setting its io_opt to 64K*number of drives in the array is strange... It
>> does not have to be that large since io_opt is an upper bound and not a "issue
>> that IO size for optimal performance". io_opt is simply a limit saying: if you
>> exceed that IO size, performance may suffer.
>>
> 
> At least from Documentation, for raid arrays, multiple of io_opt is the
> prefereed io size to the optimal io performance, and for raid5, this is
> chunksize * data disks.
> 
>> So a default of stride size x number of drives for the io_opt may be OK, but
>> that should be bound to some reasonable value. Furthermore, this is likely
>> suboptimal. I woulld think that setting the md array io_opt initially to
>> min(all drives io_opt) x number of drives would be a better default.
> 
> For raid5, this is not ok, the value have to be chunksize * data disks,
> regardless of io_opt from member disks, otherwise raid5 have to issue
> additional IO from other disks to build xor data.
> 
> For example:
> 
>  - write aligned chunksize to one disk, actually means read chunksize
> old xor data,then write chunksize data and chunksize new xor data.
>  - write aligned chunksize * data disks, new xor data can be build
> directly without reading old xor data.

I understand all of that. But you missed my point: io_opt simply indicates an
upper bound for an IO size. If exceeded, performance may be degraded. This has
*nothing* to do with the io granularity, which for a RAID array should ideally
be equal to stride size x number of data disks.

This is the confusion here. md setting io_opt to stride x number of disks in
the array is simply not what io_opt is supposed to indicate.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  2:41       ` Damien Le Moal
@ 2025-07-28  3:08         ` Yu Kuai
  2025-07-28  3:49           ` Damien Le Moal
  0 siblings, 1 reply; 20+ messages in thread
From: Yu Kuai @ 2025-07-28  3:08 UTC (permalink / raw)
  To: Damien Le Moal, Yu Kuai, Csordás Hunor, Coly Li, hch
  Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi,
	yukuai (C)

Hi,

在 2025/07/28 10:41, Damien Le Moal 写道:
> On 7/28/25 9:55 AM, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/07/28 8:39, Damien Le Moal 写道:
>>> md setting its io_opt to 64K*number of drives in the array is strange... It
>>> does not have to be that large since io_opt is an upper bound and not a "issue
>>> that IO size for optimal performance". io_opt is simply a limit saying: if you
>>> exceed that IO size, performance may suffer.
>>>
>>
>> At least from Documentation, for raid arrays, multiple of io_opt is the
>> prefereed io size to the optimal io performance, and for raid5, this is
>> chunksize * data disks.
>>
>>> So a default of stride size x number of drives for the io_opt may be OK, but
>>> that should be bound to some reasonable value. Furthermore, this is likely
>>> suboptimal. I woulld think that setting the md array io_opt initially to
>>> min(all drives io_opt) x number of drives would be a better default.
>>
>> For raid5, this is not ok, the value have to be chunksize * data disks,
>> regardless of io_opt from member disks, otherwise raid5 have to issue
>> additional IO from other disks to build xor data.
>>
>> For example:
>>
>>   - write aligned chunksize to one disk, actually means read chunksize
>> old xor data,then write chunksize data and chunksize new xor data.
>>   - write aligned chunksize * data disks, new xor data can be build
>> directly without reading old xor data.
> 
> I understand all of that. But you missed my point: io_opt simply indicates an
> upper bound for an IO size. If exceeded, performance may be degraded. This has
> *nothing* to do with the io granularity, which for a RAID array should ideally
> be equal to stride size x number of data disks.
> 
> This is the confusion here. md setting io_opt to stride x number of disks in
> the array is simply not what io_opt is supposed to indicate.

ok, can I ask where is this upper bound for IO size from?

With git log, start from commit 7e5f5fb09e6f ("block: Update topology
documentation"), the documentation start contain specail explanation for
raid array, and the optimal_io_size says:

For RAID arrays it is usually the
stripe width or the internal track size.  A properly aligned
multiple of optimal_io_size is the preferred request size for
workloads where sustained throughput is desired.

And this explanation is exactly what raid5 did, it's important that
io size is aligned multiple of io_opt.

Thanks,
Kuai

> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  3:08         ` Yu Kuai
@ 2025-07-28  3:49           ` Damien Le Moal
  2025-07-28  7:14             ` Yu Kuai
  2025-07-29  3:49             ` Martin K. Petersen
  0 siblings, 2 replies; 20+ messages in thread
From: Damien Le Moal @ 2025-07-28  3:49 UTC (permalink / raw)
  To: Yu Kuai, Csordás Hunor, Coly Li, hch
  Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi,
	yukuai (C)

On 7/28/25 12:08 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/28 10:41, Damien Le Moal 写道:
>> On 7/28/25 9:55 AM, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/07/28 8:39, Damien Le Moal 写道:
>>>> md setting its io_opt to 64K*number of drives in the array is strange... It
>>>> does not have to be that large since io_opt is an upper bound and not a "issue
>>>> that IO size for optimal performance". io_opt is simply a limit saying: if you
>>>> exceed that IO size, performance may suffer.
>>>>
>>>
>>> At least from Documentation, for raid arrays, multiple of io_opt is the
>>> prefereed io size to the optimal io performance, and for raid5, this is
>>> chunksize * data disks.
>>>
>>>> So a default of stride size x number of drives for the io_opt may be OK, but
>>>> that should be bound to some reasonable value. Furthermore, this is likely
>>>> suboptimal. I woulld think that setting the md array io_opt initially to
>>>> min(all drives io_opt) x number of drives would be a better default.
>>>
>>> For raid5, this is not ok, the value have to be chunksize * data disks,
>>> regardless of io_opt from member disks, otherwise raid5 have to issue
>>> additional IO from other disks to build xor data.
>>>
>>> For example:
>>>
>>>   - write aligned chunksize to one disk, actually means read chunksize
>>> old xor data,then write chunksize data and chunksize new xor data.
>>>   - write aligned chunksize * data disks, new xor data can be build
>>> directly without reading old xor data.
>>
>> I understand all of that. But you missed my point: io_opt simply indicates an
>> upper bound for an IO size. If exceeded, performance may be degraded. This has
>> *nothing* to do with the io granularity, which for a RAID array should ideally
>> be equal to stride size x number of data disks.
>>
>> This is the confusion here. md setting io_opt to stride x number of disks in
>> the array is simply not what io_opt is supposed to indicate.
> 
> ok, can I ask where is this upper bound for IO size from?

SCSI SBC specifications, Block Limits VPD page (B0h):

3 values are important in there:

* OPTIMAL TRANSFER LENGTH GRANULARITY:

An OPTIMAL TRANSFER LENGTH GRANULARITY field set to a non-zero value indicates
the optimal transfer length granularity size in logical blocks for a single
command shown in the command column of table 33. If a device server receives
one of these commands with a transfer size that is not equal to a multiple of
this value, then the device server may incur delays in processing the command.
An OPTIMAL TRANSFER LENGTH GRANULARITY field set to 0000h indicates that the
device server does not report optimal transfer length granularity.

For a SCSI disk, sd.c uses this value for sdkp->min_xfer_blocks. Note that the
naming here is dubious since this is not a minimum. The minimum is the logical
block size. This is a "hint" for better performance. For a RAID area, this
should be the stripe size of the RAID volume (stride x number of data disks).
This value is used for queue->limits.io_min.

* MAXIMUM TRANSFER LENGTH:

A MAXIMUM TRANSFER LENGTH field set to a non-zero value indicates the maximum
transfer length in logical blocks that the device server accepts for a single
command shown in table 33. If a device server receives one of these commands
with a transfer size greater than this value, then the device server shall
terminate the command with CHECK CONDITION status with the sense key set to
ILLEGAL REQUEST and the additional sense code set to the value shown in table
33. A MAXIMUM TRANSFER LENGTH field set to 0000_0000h indicates that the device
server does not report a limit on the transfer length.

For a SCSI disk, sd.c uses this value for sdkp->max_xfer_blocks. This is a hard
limit which will be reflected in queue->limits.max_dev_sectors
(max_hw_sectors_kb in sysfs).

* OPTIMAL TRANSFER LENGTH:

An OPTIMAL TRANSFER LENGTH field set to a non-zero value indicates the optimal
transfer size in logical blocks for a single command shown in table 33. If a
device server receives one of these commands with a transfer size greater than
this value, then the device server may incur delays in processing the command.
An OPTIMAL TRANSFER LENGTH field set to 0000_0000h indicates that the device
server does not report an optimal transfer size.

For a SCSI disk, sd.c uses this value for sdkp->opt_xfer_blocks. This value is
used for queue->limit.io_opt.

> With git log, start from commit 7e5f5fb09e6f ("block: Update topology
> documentation"), the documentation start contain specail explanation for
> raid array, and the optimal_io_size says:
> 
> For RAID arrays it is usually the
> stripe width or the internal track size.  A properly aligned
> multiple of optimal_io_size is the preferred request size for
> workloads where sustained throughput is desired.
> 
> And this explanation is exactly what raid5 did, it's important that
> io size is aligned multiple of io_opt.

Looking at the sysfs doc for the above fields, they are described as follows:

* /sys/block/<disk>/queue/minimum_io_size

[RO] Storage devices may report a granularity or preferred
minimum I/O size which is the smallest request the device can
perform without incurring a performance penalty.  For disk
drives this is often the physical block size.  For RAID arrays
it is often the stripe chunk size.  A properly aligned multiple
of minimum_io_size is the preferred request size for workloads
where a high number of I/O operations is desired.

So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a
RAID array, this indeed should be the stride x number of data disks.

* /sys/block/<disk>/queue/max_hw_sectors_kb

[RO] This is the maximum number of kilobytes supported in a
single data transfer.

No problem here.

* /sys/block/<disk>/queue/optimal_io_size

Storage devices may report an optimal I/O size, which is
the device's preferred unit for sustained I/O.  This is rarely
reported for disk drives.  For RAID arrays it is usually the
stripe width or the internal track size.  A properly aligned
multiple of optimal_io_size is the preferred request size for
workloads where sustained throughput is desired.  If no optimal
I/O size is reported this file contains 0.

Well, I find this definition not correct *at all*. This is repeating the
definition of minimum_io_size (limits->io_min) and completely disregard the
eventual optimal_io_size limit of the drives in the array. For a raid array,
this value should obviously be a multiple of minimum_io_size (the array stripe
size), but it can be much larger, since this should be an upper bound for IO
size. read_ahead_kb being set using this value is thus not correct I think.
read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  3:49           ` Damien Le Moal
@ 2025-07-28  7:14             ` Yu Kuai
  2025-07-28  7:44               ` Damien Le Moal
  2025-07-29  3:53               ` Martin K. Petersen
  2025-07-29  3:49             ` Martin K. Petersen
  1 sibling, 2 replies; 20+ messages in thread
From: Yu Kuai @ 2025-07-28  7:14 UTC (permalink / raw)
  To: Damien Le Moal, Yu Kuai, Csordás Hunor, Coly Li, hch
  Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi,
	yukuai (C)

Hi,

在 2025/07/28 11:49, Damien Le Moal 写道:
> On 7/28/25 12:08 PM, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/07/28 10:41, Damien Le Moal 写道:
>>> On 7/28/25 9:55 AM, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2025/07/28 8:39, Damien Le Moal 写道:
>>>>> md setting its io_opt to 64K*number of drives in the array is strange... It
>>>>> does not have to be that large since io_opt is an upper bound and not a "issue
>>>>> that IO size for optimal performance". io_opt is simply a limit saying: if you
>>>>> exceed that IO size, performance may suffer.
>>>>>
>>>>
>>>> At least from Documentation, for raid arrays, multiple of io_opt is the
>>>> prefereed io size to the optimal io performance, and for raid5, this is
>>>> chunksize * data disks.
>>>>
>>>>> So a default of stride size x number of drives for the io_opt may be OK, but
>>>>> that should be bound to some reasonable value. Furthermore, this is likely
>>>>> suboptimal. I woulld think that setting the md array io_opt initially to
>>>>> min(all drives io_opt) x number of drives would be a better default.
>>>>
>>>> For raid5, this is not ok, the value have to be chunksize * data disks,
>>>> regardless of io_opt from member disks, otherwise raid5 have to issue
>>>> additional IO from other disks to build xor data.
>>>>
>>>> For example:
>>>>
>>>>    - write aligned chunksize to one disk, actually means read chunksize
>>>> old xor data,then write chunksize data and chunksize new xor data.
>>>>    - write aligned chunksize * data disks, new xor data can be build
>>>> directly without reading old xor data.
>>>
>>> I understand all of that. But you missed my point: io_opt simply indicates an
>>> upper bound for an IO size. If exceeded, performance may be degraded. This has
>>> *nothing* to do with the io granularity, which for a RAID array should ideally
>>> be equal to stride size x number of data disks.
>>>
>>> This is the confusion here. md setting io_opt to stride x number of disks in
>>> the array is simply not what io_opt is supposed to indicate.
>>
>> ok, can I ask where is this upper bound for IO size from?
> 
> SCSI SBC specifications, Block Limits VPD page (B0h):
> 
> 3 values are important in there:
> 
> * OPTIMAL TRANSFER LENGTH GRANULARITY:
> 
> An OPTIMAL TRANSFER LENGTH GRANULARITY field set to a non-zero value indicates
> the optimal transfer length granularity size in logical blocks for a single
> command shown in the command column of table 33. If a device server receives
> one of these commands with a transfer size that is not equal to a multiple of
> this value, then the device server may incur delays in processing the command.
> An OPTIMAL TRANSFER LENGTH GRANULARITY field set to 0000h indicates that the
> device server does not report optimal transfer length granularity.
> 
> For a SCSI disk, sd.c uses this value for sdkp->min_xfer_blocks. Note that the
> naming here is dubious since this is not a minimum. The minimum is the logical
> block size. This is a "hint" for better performance. For a RAID area, this
> should be the stripe size of the RAID volume (stride x number of data disks).
> This value is used for queue->limits.io_min.
> 
> * MAXIMUM TRANSFER LENGTH:
> 
> A MAXIMUM TRANSFER LENGTH field set to a non-zero value indicates the maximum
> transfer length in logical blocks that the device server accepts for a single
> command shown in table 33. If a device server receives one of these commands
> with a transfer size greater than this value, then the device server shall
> terminate the command with CHECK CONDITION status with the sense key set to
> ILLEGAL REQUEST and the additional sense code set to the value shown in table
> 33. A MAXIMUM TRANSFER LENGTH field set to 0000_0000h indicates that the device
> server does not report a limit on the transfer length.
> 
> For a SCSI disk, sd.c uses this value for sdkp->max_xfer_blocks. This is a hard
> limit which will be reflected in queue->limits.max_dev_sectors
> (max_hw_sectors_kb in sysfs).
> 
> * OPTIMAL TRANSFER LENGTH:
> 
> An OPTIMAL TRANSFER LENGTH field set to a non-zero value indicates the optimal
> transfer size in logical blocks for a single command shown in table 33. If a
> device server receives one of these commands with a transfer size greater than
> this value, then the device server may incur delays in processing the command.
> An OPTIMAL TRANSFER LENGTH field set to 0000_0000h indicates that the device
> server does not report an optimal transfer size.
> 
> For a SCSI disk, sd.c uses this value for sdkp->opt_xfer_blocks. This value is
> used for queue->limit.io_opt.

Thanks for the explanation.
> 
>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology
>> documentation"), the documentation start contain specail explanation for
>> raid array, and the optimal_io_size says:
>>
>> For RAID arrays it is usually the
>> stripe width or the internal track size.  A properly aligned
>> multiple of optimal_io_size is the preferred request size for
>> workloads where sustained throughput is desired.
>>
>> And this explanation is exactly what raid5 did, it's important that
>> io size is aligned multiple of io_opt.
> 
> Looking at the sysfs doc for the above fields, they are described as follows:
> 
> * /sys/block/<disk>/queue/minimum_io_size
> 
> [RO] Storage devices may report a granularity or preferred
> minimum I/O size which is the smallest request the device can
> perform without incurring a performance penalty.  For disk
> drives this is often the physical block size.  For RAID arrays
> it is often the stripe chunk size.  A properly aligned multiple
> of minimum_io_size is the preferred request size for workloads
> where a high number of I/O operations is desired.
> 
> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a
> RAID array, this indeed should be the stride x number of data disks.

Do you mean stripe here? io_min for raid array is always just one
chunksize.
> 
> * /sys/block/<disk>/queue/max_hw_sectors_kb
> 
> [RO] This is the maximum number of kilobytes supported in a
> single data transfer.
> 
> No problem here.
> 
> * /sys/block/<disk>/queue/optimal_io_size
> 
> Storage devices may report an optimal I/O size, which is
> the device's preferred unit for sustained I/O.  This is rarely
> reported for disk drives.  For RAID arrays it is usually the
> stripe width or the internal track size.  A properly aligned
> multiple of optimal_io_size is the preferred request size for
> workloads where sustained throughput is desired.  If no optimal
> I/O size is reported this file contains 0.
> 
> Well, I find this definition not correct *at all*. This is repeating the
> definition of minimum_io_size (limits->io_min) and completely disregard the
> eventual optimal_io_size limit of the drives in the array. For a raid array,
> this value should obviously be a multiple of minimum_io_size (the array stripe
> size), but it can be much larger, since this should be an upper bound for IO
> size. read_ahead_kb being set using this value is thus not correct I think.
> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.

I think this is actually different than io_min, and io_opt for different
levels are not the same, for raid0, raid10, raid456(raid1 doesn't have
chunksize):
  - lim.io_min = mddev->chunk_sectors << 9;
  - lim.io_opt = lim.io_min * (number of data copies);

And I think they do match the definition above, specifically:
  - properly multiple aligned io_min to *prevent performance penalty*;
  - properly multiple aligned io_opt to *get optimal performance*, the
    number of data copies times the performance of a single disk;

The orginal problem is that scsi disks report unusual io_opt 32767,
and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The
lcm_not_zero() from blk_stack_limits() end up with a huge value:

blk_stack_limits()
  t->io_min = max(t->io_min, b->io_min);
  t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);

 > read_ahead_kb should use max_sectors_kb, with alignment to 
minimum_io_size.

The io_opt is used in raid array as minimal aligned size to get optimal
IO performance, not the upper bound. With the respect of this, use this
value for ra_pages make sense. However, if scsi is using this value as
IO upper bound, it's right this doesn't make sense.

Thanks,
Kuai

> 
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  7:14             ` Yu Kuai
@ 2025-07-28  7:44               ` Damien Le Moal
  2025-07-28  9:02                 ` Yu Kuai
                                   ` (2 more replies)
  2025-07-29  3:53               ` Martin K. Petersen
  1 sibling, 3 replies; 20+ messages in thread
From: Damien Le Moal @ 2025-07-28  7:44 UTC (permalink / raw)
  To: Yu Kuai, Csordás Hunor, Coly Li, hch
  Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi,
	yukuai (C)

On 7/28/25 4:14 PM, Yu Kuai wrote:
>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology
>>> documentation"), the documentation start contain specail explanation for
>>> raid array, and the optimal_io_size says:
>>>
>>> For RAID arrays it is usually the
>>> stripe width or the internal track size.  A properly aligned
>>> multiple of optimal_io_size is the preferred request size for
>>> workloads where sustained throughput is desired.
>>>
>>> And this explanation is exactly what raid5 did, it's important that
>>> io size is aligned multiple of io_opt.
>>
>> Looking at the sysfs doc for the above fields, they are described as follows:
>>
>> * /sys/block/<disk>/queue/minimum_io_size
>>
>> [RO] Storage devices may report a granularity or preferred
>> minimum I/O size which is the smallest request the device can
>> perform without incurring a performance penalty.  For disk
>> drives this is often the physical block size.  For RAID arrays
>> it is often the stripe chunk size.  A properly aligned multiple
>> of minimum_io_size is the preferred request size for workloads
>> where a high number of I/O operations is desired.
>>
>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a
>> RAID array, this indeed should be the stride x number of data disks.
> 
> Do you mean stripe here? io_min for raid array is always just one
> chunksize.

My bad, yes, that is the definition in sysfs. So io_min is the stride size, where:

stride size x number of data disks == stripe_size.

Note that chunk_sectors limit is the *stripe* size, not per drive stride.
Beware of the wording here to avoid confusion (this is all already super
confusing !).

Well, at least, that is how I interpret the io_min definition of
minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For
RAID arrays it is often the stripe chunk size." is super confusing. Not
entirely sure if stride or stripe was meant here...

>> * /sys/block/<disk>/queue/optimal_io_size
>>
>> Storage devices may report an optimal I/O size, which is
>> the device's preferred unit for sustained I/O.  This is rarely
>> reported for disk drives.  For RAID arrays it is usually the
>> stripe width or the internal track size.  A properly aligned
>> multiple of optimal_io_size is the preferred request size for
>> workloads where sustained throughput is desired.  If no optimal
>> I/O size is reported this file contains 0.
>>
>> Well, I find this definition not correct *at all*. This is repeating the
>> definition of minimum_io_size (limits->io_min) and completely disregard the
>> eventual optimal_io_size limit of the drives in the array. For a raid array,
>> this value should obviously be a multiple of minimum_io_size (the array stripe
>> size), but it can be much larger, since this should be an upper bound for IO
>> size. read_ahead_kb being set using this value is thus not correct I think.
>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.
> 
> I think this is actually different than io_min, and io_opt for different
> levels are not the same, for raid0, raid10, raid456(raid1 doesn't have
> chunksize):
>  - lim.io_min = mddev->chunk_sectors << 9;

See above. Given how confusing the definition of minimum_io_size is, not sure
that is correct. This code assumes that io_min is the stripe size and not the
stride size.

>  - lim.io_opt = lim.io_min * (number of data copies);

I do not understand what you mean with "number of data copies"... There is no
data copy in a RAID 5/6 array.

> And I think they do match the definition above, specifically:
>  - properly multiple aligned io_min to *prevent performance penalty*;

Yes.

>  - properly multiple aligned io_opt to *get optimal performance*, the
>    number of data copies times the performance of a single disk;

That is how this field is defined for RAID, but that is far from what it means
for a single disk. It is unfortunate that it was defined like that.

For a single disk, io_opt is NOT about getting optimal_performance. It is about
an upper bound for the IO size to NOT get a performance penalty (e.g. due to a
DMA mapping that is too large for what the IOMMU can handle).

And for a RAID array, it means that we should always have io_min == io_opt but
it seems that the scsi code and limit stacking code try to make this limit an
upper bound on the IO size, aligned to the stripe size.

> The orginal problem is that scsi disks report unusual io_opt 32767,
> and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The
> lcm_not_zero() from blk_stack_limits() end up with a huge value:
> 
> blk_stack_limits()
>  t->io_min = max(t->io_min, b->io_min);
>  t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);

I understand the "problem" that was stated. There is an overflow that result in
a large io_opt and a ridiculously large read_ahead_kb.
io_opt being large should in my opinion not be an issue in itself, since it
should be an upper bound on IO size and not the stripe size (io_min indicates
that).

>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.
> 
> The io_opt is used in raid array as minimal aligned size to get optimal
> IO performance, not the upper bound. With the respect of this, use this
> value for ra_pages make sense. However, if scsi is using this value as
> IO upper bound, it's right this doesn't make sense.

Here is your issue. People misunderstood optimal_io_size and used that instead
of using minimal_io_size/io_min limit for the granularity/alignment of IOs.
Using optimal_io_size as the "granularity" for optimal IOs that do not require
read-modify-write of RAID stripes is simply wrong in my optinion.
io_min/minimal_io_size is the attribute indicating that.

As for read_ahead_kb, it should be bounded by io_opt (upper bound) but should
be initialized to a smaller value aligned to io_min (if io_opt is unreasonably
large).

Given all of that and how misused io_opt seems to be, I am not sure how to fix
this though.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  7:44               ` Damien Le Moal
@ 2025-07-28  9:02                 ` Yu Kuai
  2025-07-29  4:23                   ` Martin K. Petersen
  2025-07-29  6:13                   ` Hannes Reinecke
  2025-07-28 10:56                 ` Csordás Hunor
  2025-07-29  4:08                 ` Martin K. Petersen
  2 siblings, 2 replies; 20+ messages in thread
From: Yu Kuai @ 2025-07-28  9:02 UTC (permalink / raw)
  To: Damien Le Moal, Yu Kuai, Csordás Hunor, Coly Li, hch
  Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi,
	yukuai (C)

Hi,

在 2025/07/28 15:44, Damien Le Moal 写道:
> On 7/28/25 4:14 PM, Yu Kuai wrote:
>>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology
>>>> documentation"), the documentation start contain specail explanation for
>>>> raid array, and the optimal_io_size says:
>>>>
>>>> For RAID arrays it is usually the
>>>> stripe width or the internal track size.  A properly aligned
>>>> multiple of optimal_io_size is the preferred request size for
>>>> workloads where sustained throughput is desired.
>>>>
>>>> And this explanation is exactly what raid5 did, it's important that
>>>> io size is aligned multiple of io_opt.
>>>
>>> Looking at the sysfs doc for the above fields, they are described as follows:
>>>
>>> * /sys/block/<disk>/queue/minimum_io_size
>>>
>>> [RO] Storage devices may report a granularity or preferred
>>> minimum I/O size which is the smallest request the device can
>>> perform without incurring a performance penalty.  For disk
>>> drives this is often the physical block size.  For RAID arrays
>>> it is often the stripe chunk size.  A properly aligned multiple
>>> of minimum_io_size is the preferred request size for workloads
>>> where a high number of I/O operations is desired.
>>>
>>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a
>>> RAID array, this indeed should be the stride x number of data disks.
>>
>> Do you mean stripe here? io_min for raid array is always just one
>> chunksize.
> 
> My bad, yes, that is the definition in sysfs. So io_min is the stride size, where:
> 
> stride size x number of data disks == stripe_size.
> 
Yes.

> Note that chunk_sectors limit is the *stripe* size, not per drive stride.
> Beware of the wording here to avoid confusion (this is all already super
> confusing !).

This is something we're not in the same page :( For example, 8 disks
raid5, with default chunk size. Then the above calculation is:

64k * 7 = 448k

The chunksize I said is 64k...
> 
> Well, at least, that is how I interpret the io_min definition of
> minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For
> RAID arrays it is often the stripe chunk size." is super confusing. Not
> entirely sure if stride or stripe was meant here...
> 

Hope it's clear now.
> 
>>> * /sys/block/<disk>/queue/optimal_io_size
>>>
>>> Storage devices may report an optimal I/O size, which is
>>> the device's preferred unit for sustained I/O.  This is rarely
>>> reported for disk drives.  For RAID arrays it is usually the
>>> stripe width or the internal track size.  A properly aligned
>>> multiple of optimal_io_size is the preferred request size for
>>> workloads where sustained throughput is desired.  If no optimal
>>> I/O size is reported this file contains 0.
>>>
>>> Well, I find this definition not correct *at all*. This is repeating the
>>> definition of minimum_io_size (limits->io_min) and completely disregard the
>>> eventual optimal_io_size limit of the drives in the array. For a raid array,
>>> this value should obviously be a multiple of minimum_io_size (the array stripe
>>> size), but it can be much larger, since this should be an upper bound for IO
>>> size. read_ahead_kb being set using this value is thus not correct I think.
>>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.
>>
>> I think this is actually different than io_min, and io_opt for different
>> levels are not the same, for raid0, raid10, raid456(raid1 doesn't have
>> chunksize):
>>   - lim.io_min = mddev->chunk_sectors << 9;

By the above example, io_min = 64k, and io_opt = 448k. And make sure
we're on the same page, io_min is the *stride* and io_opt is the
*stripe*.
> 
> See above. Given how confusing the definition of minimum_io_size is, not sure
> that is correct. This code assumes that io_min is the stripe size and not the
> stride size.
> 
>>   - lim.io_opt = lim.io_min * (number of data copies);
> 
> I do not understand what you mean with "number of data copies"... There is no
> data copy in a RAID 5/6 array.

Yes, this is my bad, *data disks* is the better word.
> 
>> And I think they do match the definition above, specifically:
>>   - properly multiple aligned io_min to *prevent performance penalty*;
> 
> Yes.
> 
>>   - properly multiple aligned io_opt to *get optimal performance*, the
>>     number of data copies times the performance of a single disk;
> 
> That is how this field is defined for RAID, but that is far from what it means
> for a single disk. It is unfortunate that it was defined like that.
> 
> For a single disk, io_opt is NOT about getting optimal_performance. It is about
> an upper bound for the IO size to NOT get a performance penalty (e.g. due to a
> DMA mapping that is too large for what the IOMMU can handle).

The name itself is misleading. :( I didn't know this definition until
now.

> 
> And for a RAID array, it means that we should always have io_min == io_opt but
> it seems that the scsi code and limit stacking code try to make this limit an
> upper bound on the IO size, aligned to the stripe size.
> 
>> The orginal problem is that scsi disks report unusual io_opt 32767,
>> and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The
>> lcm_not_zero() from blk_stack_limits() end up with a huge value:
>>
>> blk_stack_limits()
>>   t->io_min = max(t->io_min, b->io_min);
>>   t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
> 
> I understand the "problem" that was stated. There is an overflow that result in
> a large io_opt and a ridiculously large read_ahead_kb.
> io_opt being large should in my opinion not be an issue in itself, since it
> should be an upper bound on IO size and not the stripe size (io_min indicates
> that).
> 
>>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.
>>
>> The io_opt is used in raid array as minimal aligned size to get optimal
>> IO performance, not the upper bound. With the respect of this, use this
>> value for ra_pages make sense. However, if scsi is using this value as
>> IO upper bound, it's right this doesn't make sense.
> 
> Here is your issue. People misunderstood optimal_io_size and used that instead
> of using minimal_io_size/io_min limit for the granularity/alignment of IOs.
> Using optimal_io_size as the "granularity" for optimal IOs that do not require
> read-modify-write of RAID stripes is simply wrong in my optinion.
> io_min/minimal_io_size is the attribute indicating that.

Ok, looks like there are two problems now:

a) io_min, size to prevent performance penalty;

  1) For raid5, to avoid read-modify-write, this value should be 448k,
     but now it's 64k;
  2) For raid0/raid10, this value is set to 64k now, however, this value
     should not set. If the value in member disks is 4k, issue 4k is just
     fine, there won't be any performance penalty;
  3) For raid1, this value is not set, and will use member disks, this is
     correct.

b) io_opt, size to ???
  4) For raid0/raid10/rai5, this value is set to mininal IO size to get
     best performance.
  5) For raid1, this value is not set, and will use member disks.

Problem a can be fixed easily, and for problem b, I'm not sure how to
fix it as well, it depends on how we think io_opt is.

If io_opt should be *upper bound*, problem 4) should be fixed like case
5), and other places like blk_apply_bdi_limits() setting ra_pages by
io_opt should be fixed as well.

If io_opt should be *mininal IO size to get best performance*, problem
5) should be fixed like case 4), and I don't know if scsi or other
drivers to set initial io_opt should be changed. :(

Thanks,
Kuai

> 
> As for read_ahead_kb, it should be bounded by io_opt (upper bound) but should
> be initialized to a smaller value aligned to io_min (if io_opt is unreasonably
> large).
> 
> Given all of that and how misused io_opt seems to be, I am not sure how to fix
> this though.
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  9:02                 ` Yu Kuai
@ 2025-07-29  4:23                   ` Martin K. Petersen
  2025-07-29  6:25                     ` Yu Kuai
  2025-07-29 22:02                     ` Tony Battersby
  2025-07-29  6:13                   ` Hannes Reinecke
  1 sibling, 2 replies; 20+ messages in thread
From: Martin K. Petersen @ 2025-07-29  4:23 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Damien Le Moal, Csordás Hunor, Coly Li, hch, linux-block,
	James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C)

> Ok, looks like there are two problems now:
>
> a) io_min, size to prevent performance penalty;
>
>  1) For raid5, to avoid read-modify-write, this value should be 448k,
>     but now it's 64k;

You have two penalties for RAID5: Writes smaller than the stripe chunk
size and writes smaller than the full stripe width.

>  2) For raid0/raid10, this value is set to 64k now, however, this value
>     should not set. If the value in member disks is 4k, issue 4k is just
>     fine, there won't be any performance penalty;

Correct.

>  3) For raid1, this value is not set, and will use member disks, this is
>     correct.

Correct.

> b) io_opt, size to ???
>  4) For raid0/raid10/rai5, this value is set to mininal IO size to get
>     best performance.

For RAID 0 you want to set io_opt to the stripe width. io_opt is for
sequential, throughput-optimized I/O. Presumably the MD stripe chunk
size has been chosen based on knowledge about the underlying disks and
their performance. And thus maximum throughput will be achieved when
doing full stripe writes across all drives.

For software RAID I am not sure how much this really matters in a modern
context. It certainly did 25 years ago when we benchmarked things for
XFS. Full stripe writes were a big improvement with both software and
hardware RAID. But how much this matters today, I am not sure.

>  5) For raid1, this value is not set, and will use member disks.

Correct.

>
> If io_opt should be *upper bound*, problem 4) should be fixed like case
> 5), and other places like blk_apply_bdi_limits() setting ra_pages by
> io_opt should be fixed as well.

I understand Damien's "upper bound" interpretation but it does not take
alignment and granularity into account. And both are imperative for
io_opt.

> If io_opt should be *mininal IO size to get best performance*,

What is "best performance"? IOPS or throughput?

io_min is about IOPS. io_opt is about throughput.

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-29  4:23                   ` Martin K. Petersen
@ 2025-07-29  6:25                     ` Yu Kuai
  2025-07-29 22:02                     ` Tony Battersby
  1 sibling, 0 replies; 20+ messages in thread
From: Yu Kuai @ 2025-07-29  6:25 UTC (permalink / raw)
  To: Martin K. Petersen, Yu Kuai
  Cc: Damien Le Moal, Csordás Hunor, Coly Li, hch, linux-block,
	James E.J. Bottomley, linux-scsi, yukuai (C)

Hi, Martin

在 2025/07/29 12:23, Martin K. Petersen 写道:
> 
>> Ok, looks like there are two problems now:
>>
>> a) io_min, size to prevent performance penalty;
>>
>>   1) For raid5, to avoid read-modify-write, this value should be 448k,
>>      but now it's 64k;
> 
> You have two penalties for RAID5: Writes smaller than the stripe chunk
> size and writes smaller than the full stripe width.

Yes, the internal IO size for raid5 is 4k, however, only full stripe
write can prevent read-modify-write, which is 448k.
> 
>>   2) For raid0/raid10, this value is set to 64k now, however, this value
>>      should not set. If the value in member disks is 4k, issue 4k is just
>>      fine, there won't be any performance penalty;
> 
> Correct.
> 
>>   3) For raid1, this value is not set, and will use member disks, this is
>>      correct.
> 
> Correct.
> 
>> b) io_opt, size to ???
>>   4) For raid0/raid10/rai5, this value is set to mininal IO size to get
>>      best performance.
> 
> For RAID 0 you want to set io_opt to the stripe width. io_opt is for
> sequential, throughput-optimized I/O. Presumably the MD stripe chunk
> size has been chosen based on knowledge about the underlying disks and
> their performance. And thus maximum throughput will be achieved when
> doing full stripe writes across all drives.

Yes, raid0/raid10/raid5 are all the same logic, multiple aligned
sequential IO can get the number of data disks times sigle disk
performance.
> 
> For software RAID I am not sure how much this really matters in a modern
> context. It certainly did 25 years ago when we benchmarked things for
> XFS. Full stripe writes were a big improvement with both software and
> hardware RAID. But how much this matters today, I am not sure.

For raid1, write will be less than single disk performance. However, for
read, the io_opt should be the sum of io_opt of member disks, see
should_choose_next(), for sequential read, raid1 will switch to next
rdev to read after reading io_opf of this rdev.
> 
>>   5) For raid1, this value is not set, and will use member disks.
> 
> Correct.
> 
>>
>> If io_opt should be *upper bound*, problem 4) should be fixed like case
>> 5), and other places like blk_apply_bdi_limits() setting ra_pages by
>> io_opt should be fixed as well.
> 
> I understand Damien's "upper bound" interpretation but it does not take
> alignment and granularity into account. And both are imperative for
> io_opt.
> 
>> If io_opt should be *mininal IO size to get best performance*,
> 
> What is "best performance"? IOPS or throughput?
> 
> io_min is about IOPS. io_opt is about throughput.

I mean throughput here.
> 

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-29  4:23                   ` Martin K. Petersen
  2025-07-29  6:25                     ` Yu Kuai
@ 2025-07-29 22:02                     ` Tony Battersby
  1 sibling, 0 replies; 20+ messages in thread
From: Tony Battersby @ 2025-07-29 22:02 UTC (permalink / raw)
  To: Martin K. Petersen, Yu Kuai
  Cc: Damien Le Moal, Csordás Hunor, Coly Li, hch, linux-block,
	James E.J. Bottomley, linux-scsi, yukuai (C)

On 7/29/25 00:23, Martin K. Petersen wrote:
>> b) io_opt, size to ???
>>  4) For raid0/raid10/rai5, this value is set to mininal IO size to get
>>     best performance.
> For software RAID I am not sure how much this really matters in a modern
> context. It certainly did 25 years ago when we benchmarked things for
> XFS. Full stripe writes were a big improvement with both software and
> hardware RAID. But how much this matters today, I am not sure.
>
FWIW, I just posted a patch that aligns writes to stripe boundaries
using io_opt:

https://lore.kernel.org/all/55deda1d-967d-4d68-a9ba-4d5139374a37@cybernetics.com/

I get about a 2.3% performance improvement with md-raid6, but I have an
out-of-tree RAID driver that gets more like 4x improvement.

If io_opt means different things to different code, might we consider
adding another field to the queue limits to give explicit stripe parameters?

Tony Battersby
Cybernetics


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  9:02                 ` Yu Kuai
  2025-07-29  4:23                   ` Martin K. Petersen
@ 2025-07-29  6:13                   ` Hannes Reinecke
  2025-07-29  6:29                     ` Yu Kuai
  2025-07-29 22:24                     ` Keith Busch
  1 sibling, 2 replies; 20+ messages in thread
From: Hannes Reinecke @ 2025-07-29  6:13 UTC (permalink / raw)
  To: Yu Kuai, Damien Le Moal, Csordás Hunor, Coly Li, hch
  Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi,
	yukuai (C)

On 7/28/25 11:02, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/28 15:44, Damien Le Moal 写道:
>> On 7/28/25 4:14 PM, Yu Kuai wrote:
>>>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology
>>>>> documentation"), the documentation start contain specail 
>>>>> explanation for
>>>>> raid array, and the optimal_io_size says:
>>>>>
>>>>> For RAID arrays it is usually the
>>>>> stripe width or the internal track size.  A properly aligned
>>>>> multiple of optimal_io_size is the preferred request size for
>>>>> workloads where sustained throughput is desired.
>>>>>
>>>>> And this explanation is exactly what raid5 did, it's important that
>>>>> io size is aligned multiple of io_opt.
>>>>
>>>> Looking at the sysfs doc for the above fields, they are described as 
>>>> follows:
>>>>
>>>> * /sys/block/<disk>/queue/minimum_io_size
>>>>
>>>> [RO] Storage devices may report a granularity or preferred
>>>> minimum I/O size which is the smallest request the device can
>>>> perform without incurring a performance penalty.  For disk
>>>> drives this is often the physical block size.  For RAID arrays
>>>> it is often the stripe chunk size.  A properly aligned multiple
>>>> of minimum_io_size is the preferred request size for workloads
>>>> where a high number of I/O operations is desired.
>>>>
>>>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY 
>>>> and for a
>>>> RAID array, this indeed should be the stride x number of data disks.
>>>
>>> Do you mean stripe here? io_min for raid array is always just one
>>> chunksize.
>>
>> My bad, yes, that is the definition in sysfs. So io_min is the stride 
>> size, where:
>>
>> stride size x number of data disks == stripe_size.
>>
> Yes.
> 
>> Note that chunk_sectors limit is the *stripe* size, not per drive stride.
>> Beware of the wording here to avoid confusion (this is all already super
>> confusing !).
> 
> This is something we're not in the same page :( For example, 8 disks
> raid5, with default chunk size. Then the above calculation is:
> 
> 64k * 7 = 448k
> 
> The chunksize I said is 64k...

Hmm. I always thought that the 'chunksize' is the limit which I/O must
not cross to avoid being split.
So for RAID 4/5/6 I would have thought this to be the stride size,
as MD must split larger I/O onto two disks.
Sure, one could argue that the stripe size is the chunk size, but then
MD will have to split that I/O...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-29  6:13                   ` Hannes Reinecke
@ 2025-07-29  6:29                     ` Yu Kuai
  2025-07-29 22:24                     ` Keith Busch
  1 sibling, 0 replies; 20+ messages in thread
From: Yu Kuai @ 2025-07-29  6:29 UTC (permalink / raw)
  To: Hannes Reinecke, Yu Kuai, Damien Le Moal, Csordás Hunor,
	Coly Li, hch
  Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi,
	yukuai (C)

Hi,

在 2025/07/29 14:13, Hannes Reinecke 写道:
> On 7/28/25 11:02, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/07/28 15:44, Damien Le Moal 写道:
>>> On 7/28/25 4:14 PM, Yu Kuai wrote:
>>>>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology
>>>>>> documentation"), the documentation start contain specail 
>>>>>> explanation for
>>>>>> raid array, and the optimal_io_size says:
>>>>>>
>>>>>> For RAID arrays it is usually the
>>>>>> stripe width or the internal track size.  A properly aligned
>>>>>> multiple of optimal_io_size is the preferred request size for
>>>>>> workloads where sustained throughput is desired.
>>>>>>
>>>>>> And this explanation is exactly what raid5 did, it's important that
>>>>>> io size is aligned multiple of io_opt.
>>>>>
>>>>> Looking at the sysfs doc for the above fields, they are described 
>>>>> as follows:
>>>>>
>>>>> * /sys/block/<disk>/queue/minimum_io_size
>>>>>
>>>>> [RO] Storage devices may report a granularity or preferred
>>>>> minimum I/O size which is the smallest request the device can
>>>>> perform without incurring a performance penalty.  For disk
>>>>> drives this is often the physical block size.  For RAID arrays
>>>>> it is often the stripe chunk size.  A properly aligned multiple
>>>>> of minimum_io_size is the preferred request size for workloads
>>>>> where a high number of I/O operations is desired.
>>>>>
>>>>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY 
>>>>> and for a
>>>>> RAID array, this indeed should be the stride x number of data disks.
>>>>
>>>> Do you mean stripe here? io_min for raid array is always just one
>>>> chunksize.
>>>
>>> My bad, yes, that is the definition in sysfs. So io_min is the stride 
>>> size, where:
>>>
>>> stride size x number of data disks == stripe_size.
>>>
>> Yes.
>>
>>> Note that chunk_sectors limit is the *stripe* size, not per drive 
>>> stride.
>>> Beware of the wording here to avoid confusion (this is all already super
>>> confusing !).
>>
>> This is something we're not in the same page :( For example, 8 disks
>> raid5, with default chunk size. Then the above calculation is:
>>
>> 64k * 7 = 448k
>>
>> The chunksize I said is 64k...
> 
> Hmm. I always thought that the 'chunksize' is the limit which I/O must
> not cross to avoid being split.
> So for RAID 4/5/6 I would have thought this to be the stride size,
> as MD must split larger I/O onto two disks.
> Sure, one could argue that the stripe size is the chunk size, but then
> MD will have to split that I/O...

BTW, I always thought chunksize to be stride size simply because there
is a metadata field in mddev superblock named 'chunk_size', which is the
stride size.

Thanks,
Kuai

> 
> Cheers,
> 
> Hannes


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-29  6:13                   ` Hannes Reinecke
  2025-07-29  6:29                     ` Yu Kuai
@ 2025-07-29 22:24                     ` Keith Busch
  1 sibling, 0 replies; 20+ messages in thread
From: Keith Busch @ 2025-07-29 22:24 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Yu Kuai, Damien Le Moal, Csordás Hunor, Coly Li, hch,
	linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi,
	yukuai (C)

On Tue, Jul 29, 2025 at 08:13:31AM +0200, Hannes Reinecke wrote:
> > > Note that chunk_sectors limit is the *stripe* size, not per drive stride.
> > > Beware of the wording here to avoid confusion (this is all already super
> > > confusing !).
> > 
> > This is something we're not in the same page :( For example, 8 disks
> > raid5, with default chunk size. Then the above calculation is:
> > 
> > 64k * 7 = 448k
> > 
> > The chunksize I said is 64k...
> 
> Hmm. I always thought that the 'chunksize' is the limit which I/O must
> not cross to avoid being split.
> So for RAID 4/5/6 I would have thought this to be the stride size,
> as MD must split larger I/O onto two disks.
> Sure, one could argue that the stripe size is the chunk size, but then
> MD will have to split that I/O...

Yah, I think that makes sense. At least the way nvme uses "chunk_size",
it was assumed to mean the boundary for when the backend handling will
split your request for different isolated media.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  7:44               ` Damien Le Moal
  2025-07-28  9:02                 ` Yu Kuai
@ 2025-07-28 10:56                 ` Csordás Hunor
  2025-07-29  4:08                 ` Martin K. Petersen
  2 siblings, 0 replies; 20+ messages in thread
From: Csordás Hunor @ 2025-07-28 10:56 UTC (permalink / raw)
  To: Damien Le Moal, Yu Kuai, Coly Li, hch
  Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi,
	yukuai (C)


On 7/28/2025 11:02 AM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/28 15:44, Damien Le Moal 写道:
>> On 7/28/25 4:14 PM, Yu Kuai wrote:
>>>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology
>>>>> documentation"), the documentation start contain specail explanation for
>>>>> raid array, and the optimal_io_size says:
>>>>>
>>>>> For RAID arrays it is usually the
>>>>> stripe width or the internal track size.  A properly aligned
>>>>> multiple of optimal_io_size is the preferred request size for
>>>>> workloads where sustained throughput is desired.
>>>>>
>>>>> And this explanation is exactly what raid5 did, it's important that
>>>>> io size is aligned multiple of io_opt.
>>>>
>>>> Looking at the sysfs doc for the above fields, they are described as follows:
>>>>
>>>> * /sys/block/<disk>/queue/minimum_io_size
>>>>
>>>> [RO] Storage devices may report a granularity or preferred
>>>> minimum I/O size which is the smallest request the device can
>>>> perform without incurring a performance penalty.  For disk
>>>> drives this is often the physical block size.  For RAID arrays
>>>> it is often the stripe chunk size.  A properly aligned multiple
>>>> of minimum_io_size is the preferred request size for workloads
>>>> where a high number of I/O operations is desired.
>>>>
>>>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a
>>>> RAID array, this indeed should be the stride x number of data disks.
>>>
>>> Do you mean stripe here? io_min for raid array is always just one
>>> chunksize.
>>
>> My bad, yes, that is the definition in sysfs. So io_min is the stride size, where:
>>
>> stride size x number of data disks == stripe_size.
>>
> Yes.
> 
>> Note that chunk_sectors limit is the *stripe* size, not per drive stride.
>> Beware of the wording here to avoid confusion (this is all already super
>> confusing !).
> 
> This is something we're not in the same page :( For example, 8 disks
> raid5, with default chunk size. Then the above calculation is:
> 
> 64k * 7 = 448k
> 
> The chunksize I said is 64k...
>>
>> Well, at least, that is how I interpret the io_min definition of
>> minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For
>> RAID arrays it is often the stripe chunk size." is super confusing. Not
>> entirely sure if stride or stripe was meant here...
>>
> 
> Hope it's clear now.
>>
>>>> * /sys/block/<disk>/queue/optimal_io_size
>>>>
>>>> Storage devices may report an optimal I/O size, which is
>>>> the device's preferred unit for sustained I/O.  This is rarely
>>>> reported for disk drives.  For RAID arrays it is usually the
>>>> stripe width or the internal track size.  A properly aligned
>>>> multiple of optimal_io_size is the preferred request size for
>>>> workloads where sustained throughput is desired.  If no optimal
>>>> I/O size is reported this file contains 0.
>>>>
>>>> Well, I find this definition not correct *at all*. This is repeating the
>>>> definition of minimum_io_size (limits->io_min) and completely disregard the
>>>> eventual optimal_io_size limit of the drives in the array. For a raid array,
>>>> this value should obviously be a multiple of minimum_io_size (the array stripe
>>>> size), but it can be much larger, since this should be an upper bound for IO
>>>> size. read_ahead_kb being set using this value is thus not correct I think.
>>>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.
>>>
>>> I think this is actually different than io_min, and io_opt for different
>>> levels are not the same, for raid0, raid10, raid456(raid1 doesn't have
>>> chunksize):
>>>   - lim.io_min = mddev->chunk_sectors << 9;
> 
> By the above example, io_min = 64k, and io_opt = 448k. And make sure
> we're on the same page, io_min is the *stride* and io_opt is the
> *stripe*.
>>
>> See above. Given how confusing the definition of minimum_io_size is, not sure
>> that is correct. This code assumes that io_min is the stripe size and not the
>> stride size.
>>
>>>   - lim.io_opt = lim.io_min * (number of data copies);
>>
>> I do not understand what you mean with "number of data copies"... There is no
>> data copy in a RAID 5/6 array.
> 
> Yes, this is my bad, *data disks* is the better word.
>>
>>> And I think they do match the definition above, specifically:
>>>   - properly multiple aligned io_min to *prevent performance penalty*;
>>
>> Yes.
>>
>>>   - properly multiple aligned io_opt to *get optimal performance*, the
>>>     number of data copies times the performance of a single disk;
>>
>> That is how this field is defined for RAID, but that is far from what it means
>> for a single disk. It is unfortunate that it was defined like that.
>>
>> For a single disk, io_opt is NOT about getting optimal_performance. It is about
>> an upper bound for the IO size to NOT get a performance penalty (e.g. due to a
>> DMA mapping that is too large for what the IOMMU can handle).
> 
> The name itself is misleading. :( I didn't know this definition until
> now.
> 
>>
>> And for a RAID array, it means that we should always have io_min == io_opt but
>> it seems that the scsi code and limit stacking code try to make this limit an
>> upper bound on the IO size, aligned to the stripe size.
>>
>>> The orginal problem is that scsi disks report unusual io_opt 32767,
>>> and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The
>>> lcm_not_zero() from blk_stack_limits() end up with a huge value:
>>>
>>> blk_stack_limits()
>>>   t->io_min = max(t->io_min, b->io_min);
>>>   t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
>>
>> I understand the "problem" that was stated. There is an overflow that result in
>> a large io_opt and a ridiculously large read_ahead_kb.
>> io_opt being large should in my opinion not be an issue in itself, since it
>> should be an upper bound on IO size and not the stripe size (io_min indicates
>> that).
>>
>>>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.
>>>
>>> The io_opt is used in raid array as minimal aligned size to get optimal
>>> IO performance, not the upper bound. With the respect of this, use this
>>> value for ra_pages make sense. However, if scsi is using this value as
>>> IO upper bound, it's right this doesn't make sense.
>>
>> Here is your issue. People misunderstood optimal_io_size and used that instead
>> of using minimal_io_size/io_min limit for the granularity/alignment of IOs.
>> Using optimal_io_size as the "granularity" for optimal IOs that do not require
>> read-modify-write of RAID stripes is simply wrong in my optinion.
>> io_min/minimal_io_size is the attribute indicating that.

I'm not familiar enough with the all the code using io_opt to be certain,
but I think it's a little bit the other way around.

The problem definitely seems to be that we want to use one variable for
multiple different purposes. io_opt in struct queue_limits is the "optimal
I/O size", but "optimal" can mean many things in different contexts. In
general, if I want to do some I/O, I can say that the optimal way to do it
is to make I/O requests that satisfy some condition. The condition can be
many things:

- The request size should be at least X (combining two small requests into
  a big one may be faster, or extending a small request into a bigger one
  may avoid having to do another request later).
- The request size should be at most X (for example, because DMA is
  inefficient on this system with request sizes too large -- this is the
  _only_ thing that shost->opt_sectors is currently set for).
- The request size should be a multiple of X, _and_ the request should
  have an alignment of X (this is what a striped md array wants).

And, of course, there can be many other "optimality" conditions. The ones
listed above all have a parameter (X), which can, in the context of that
condition, be called the "optimal I/O size". These parameters, however,
can be very different for each condition; the right parameter for one
condition can be completely inappropriate for another.

The SCSI standard may have a definition for "optimal transfer length",
but io_opt in struct queue_length seems to be used for a different purpose
since its introduction in 2009:

- It was introduced in commit c72758f33784 ("block: Export I/O topology
  for block devices and partitions"). The commit message specifically
  mentions its use by md:
      The io_opt characteristic indicates the optimal I/O size reported by
      the device.  This is usually the stripe width for arrays.
- It actually started being set by md in commit 8f6c2e4b325a ("md: Use new
  topology calls to indicate alignment and I/O sizes"). The commit date
  is more than a month later than the last but as far as I can see, they
  were originally posted in the same patch series:
  https://lore.kernel.org/all/20090424123054.GA1926@parisc-linux.org/T/#u
  In the context of that patch series, md was literally the first user
  of io_opt.
- Using the lowest common multiple of the component devices and the
  array to calculate the final io_opt of the array happened in
  commit 70dd5bf3b999 ("block: Stack optimal I/O size"), still in 2009.

It wasn't until commit a23634644afc ("block: take io_opt and io_min into
account for max_sectors" in 2024 that sd_revalidate_disk started to set
io_opt from what, in the context of the SCSI standard, should be called
the optimal I/O size. However, in doing so, it broke md arrays. With my
setup, this was hidden until commit 9c0ba14828d6 ("blk-settings: round
down io_opt to physical_block_size"), but nonetheless, it happened here.

> Ok, looks like there are two problems now:
> 
> a) io_min, size to prevent performance penalty;
> 
>  1) For raid5, to avoid read-modify-write, this value should be 448k,
>     but now it's 64k;
>  2) For raid0/raid10, this value is set to 64k now, however, this value
>     should not set. If the value in member disks is 4k, issue 4k is just
>     fine, there won't be any performance penalty;
>  3) For raid1, this value is not set, and will use member disks, this is
>     correct.
> 
> b) io_opt, size to ???
>  4) For raid0/raid10/rai5, this value is set to mininal IO size to get
>     best performance.
>  5) For raid1, this value is not set, and will use member disks.
> 
> Problem a can be fixed easily, and for problem b, I'm not sure how to
> fix it as well, it depends on how we think io_opt is.
> 
> If io_opt should be *upper bound*, problem 4) should be fixed like case
> 5), and other places like blk_apply_bdi_limits() setting ra_pages by
> io_opt should be fixed as well.
> 
> If io_opt should be *mininal IO size to get best performance*, problem
> 5) should be fixed like case 4), and I don't know if scsi or other
> drivers to set initial io_opt should be changed. :(
> 
> Thanks,
> Kuai
> 
>>
>> As for read_ahead_kb, it should be bounded by io_opt (upper bound) but should
>> be initialized to a smaller value aligned to io_min (if io_opt is unreasonably
>> large).
>>
>> Given all of that and how misused io_opt seems to be, I am not sure how to fix
>> this though.
>>
> 

Hunor Csordás

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  7:44               ` Damien Le Moal
  2025-07-28  9:02                 ` Yu Kuai
  2025-07-28 10:56                 ` Csordás Hunor
@ 2025-07-29  4:08                 ` Martin K. Petersen
  2 siblings, 0 replies; 20+ messages in thread
From: Martin K. Petersen @ 2025-07-29  4:08 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Yu Kuai, Csordás Hunor, Coly Li, hch, linux-block,
	James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C)

Damien,

> My bad, yes, that is the definition in sysfs. So io_min is the stride
> size, where:

Depends on the RAID type. For RAID0 and 1 there is no inherent penalty
wrt. writing less than the stride size. But for RAID 5/6 there clearly
is.

> stride size x number of data disks == stripe_size.
>
> Note that chunk_sectors limit is the *stripe* size, not per drive stride.
> Beware of the wording here to avoid confusion (this is all already super
> confusing !).

The choice of "chunk" to describe the LBA boundary queue limit is
unfortunate since MD uses chunk_sectors as the term for what you call
stride.

> Well, at least, that is how I interpret the io_min definition of
> minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For
> RAID arrays it is often the stripe chunk size." is super confusing. Not
> entirely sure if stride or stripe was meant here...

The stripe chunk or stripe unit is what you call stride. Stripe width is
the full stripe across all drives.

> As for read_ahead_kb, it should be bounded by io_opt (upper bound) but
> should be initialized to a smaller value aligned to io_min (if io_opt
> is unreasonably large).

In retrospect I am not really a fan of using io_opt for read_ahead_kb
since, to my knowledge, there is no guarantee that the readahead I/O
will be naturally aligned.

That said, I don't really know of devices where this matters much for
reads. With writes, this would be much more of an issue.

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  7:14             ` Yu Kuai
  2025-07-28  7:44               ` Damien Le Moal
@ 2025-07-29  3:53               ` Martin K. Petersen
  1 sibling, 0 replies; 20+ messages in thread
From: Martin K. Petersen @ 2025-07-29  3:53 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Damien Le Moal, Csordás Hunor, Coly Li, hch, linux-block,
	James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C)


> The orginal problem is that scsi disks report unusual io_opt 32767,

We should fix that. We already ignore a bunch of other oddball clues.

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-28  3:49           ` Damien Le Moal
  2025-07-28  7:14             ` Yu Kuai
@ 2025-07-29  3:49             ` Martin K. Petersen
  1 sibling, 0 replies; 20+ messages in thread
From: Martin K. Petersen @ 2025-07-29  3:49 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Yu Kuai, Csordás Hunor, Coly Li, hch, linux-block,
	James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C)

Damien,

> An OPTIMAL TRANSFER LENGTH GRANULARITY field set to 0000h indicates
> that the device server does not report optimal transfer length
> granularity.
>
> For a SCSI disk, sd.c uses this value for sdkp->min_xfer_blocks. Note
> that the naming here is dubious since this is not a minimum. The
> minimum is the logical block size.

min_io is a preferred minimum, not an absolute minimum, just like the
physical block size. You can do smaller I/Os but you don't want to.

> Storage devices may report an optimal I/O size, which is the device's
> preferred unit for sustained I/O. This is rarely reported for disk
> drives. For RAID arrays it is usually the stripe width or the internal
> track size. A properly aligned multiple of optimal_io_size is the
> preferred request size for workloads where sustained throughput is
> desired. If no optimal I/O size is reported this file contains 0.
>
> Well, I find this definition not correct *at all*. This is repeating
> the definition of minimum_io_size (limits->io_min) and completely
> disregard the eventual optimal_io_size limit of the drives in the
> array.

opt_io defines the optimal I/O size for a sequential/throughput-focused
workload. You can do larger I/Os but you don't want to.

RAID arrays at the time the spec was written had sophisticated
non-volatile caches which managed data in tracks or cache lines. When
you issued an I/O which straddled internal cache lines in the array, you
would fall back to a slow path as opposed to doing things entirely in
hardware. So the purpose of the optimal transfer length and granularity
in SCSI was to encourage applications to write naturally aligned full
tracks/cache lines/stripes.

> For a raid array, this value should obviously be a multiple of
> minimum_io_size (the array stripe size),

SCSI does not require this but we enforce it in sd.c.

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Improper io_opt setting for md raid5
  2025-07-27 10:50 ` Improper io_opt setting for md raid5 Csordás Hunor
  2025-07-28  0:39   ` Damien Le Moal
@ 2025-07-29  4:44   ` Martin K. Petersen
  1 sibling, 0 replies; 20+ messages in thread
From: Martin K. Petersen @ 2025-07-29  4:44 UTC (permalink / raw)
  To: Csordás Hunor
  Cc: Coly Li, hch, linux-block, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi, Damien Le Moal


Csordás,

I basically agree with everything in your analysis.

>> Then in drivers/scsi/sd.c, inside sd_revalidate_disk() from the following coce,
>> 3785         /*
>> 3786          * Limit default to SCSI host optimal sector limit if set. There may be
>> 3787          * an impact on performance for when the size of a request exceeds this
>> 3788          * host limit.
>> 3789          */
>> 3790         lim.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT;
>> 3791         if (sd_validate_opt_xfer_size(sdkp, dev_max)) {
>> 3792                 lim.io_opt = min_not_zero(lim.io_opt,
>> 3793                                 logical_to_bytes(sdp, sdkp->opt_xfer_blocks));
>> 3794         }
>> 
>> lim.io_opt of all my sata disks attached to mpt3sas HBA are all 32767 sectors,
>> because the above code block.

shost->opt_sectors was originally used to seed rw_max and not io_opt, I
think that is the appropriate thing to do.

A SCSI host driver reporting some ludicrous limit is not necessarily
representative of a "good" I/O size as far as neither disk drive, nor
the Linux I/O stack is concerned.

shost->opt_sectors should clearly set an upper bound for max_sectors.
But I don't think we should ever increase any queue limit based on what
is reported in opt_sectors.

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-07-29 22:24 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <ywsfp3lqnijgig6yrlv2ztxram6ohf5z4yfeebswjkvp2dzisd@f5ikoyo3sfq5>
2025-07-27 10:50 ` Improper io_opt setting for md raid5 Csordás Hunor
2025-07-28  0:39   ` Damien Le Moal
2025-07-28  0:55     ` Yu Kuai
2025-07-28  2:41       ` Damien Le Moal
2025-07-28  3:08         ` Yu Kuai
2025-07-28  3:49           ` Damien Le Moal
2025-07-28  7:14             ` Yu Kuai
2025-07-28  7:44               ` Damien Le Moal
2025-07-28  9:02                 ` Yu Kuai
2025-07-29  4:23                   ` Martin K. Petersen
2025-07-29  6:25                     ` Yu Kuai
2025-07-29 22:02                     ` Tony Battersby
2025-07-29  6:13                   ` Hannes Reinecke
2025-07-29  6:29                     ` Yu Kuai
2025-07-29 22:24                     ` Keith Busch
2025-07-28 10:56                 ` Csordás Hunor
2025-07-29  4:08                 ` Martin K. Petersen
2025-07-29  3:53               ` Martin K. Petersen
2025-07-29  3:49             ` Martin K. Petersen
2025-07-29  4:44   ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox