Re: Improper io_opt setting for md raid5

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

From: Damien Le Moal <dlemoal@kernel.org>
To: "Yu Kuai" <yukuai1@huaweicloud.com>,
	"Csordás Hunor" <csordas.hunor@gmail.com>,
	"Coly Li" <colyli@kernel.org>,
	hch@lst.de
Cc: linux-block@vger.kernel.org,
	"James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	linux-scsi@vger.kernel.org, "yukuai (C)" <yukuai3@huawei.com>
Subject: Re: Improper io_opt setting for md raid5
Date: Mon, 28 Jul 2025 16:44:14 +0900	[thread overview]
Message-ID: <fa2f9406-4ee8-45f9-a784-b5042e9f4411@kernel.org> (raw)
In-Reply-To: <9c6f300a-f78f-de6e-4b99-453df377c7ba@huaweicloud.com>

On 7/28/25 4:14 PM, Yu Kuai wrote:
>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology
>>> documentation"), the documentation start contain specail explanation for
>>> raid array, and the optimal_io_size says:
>>>
>>> For RAID arrays it is usually the
>>> stripe width or the internal track size.  A properly aligned
>>> multiple of optimal_io_size is the preferred request size for
>>> workloads where sustained throughput is desired.
>>>
>>> And this explanation is exactly what raid5 did, it's important that
>>> io size is aligned multiple of io_opt.
>>
>> Looking at the sysfs doc for the above fields, they are described as follows:
>>
>> * /sys/block/<disk>/queue/minimum_io_size
>>
>> [RO] Storage devices may report a granularity or preferred
>> minimum I/O size which is the smallest request the device can
>> perform without incurring a performance penalty.  For disk
>> drives this is often the physical block size.  For RAID arrays
>> it is often the stripe chunk size.  A properly aligned multiple
>> of minimum_io_size is the preferred request size for workloads
>> where a high number of I/O operations is desired.
>>
>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a
>> RAID array, this indeed should be the stride x number of data disks.
> 
> Do you mean stripe here? io_min for raid array is always just one
> chunksize.

My bad, yes, that is the definition in sysfs. So io_min is the stride size, where:

stride size x number of data disks == stripe_size.

Note that chunk_sectors limit is the *stripe* size, not per drive stride.
Beware of the wording here to avoid confusion (this is all already super
confusing !).

Well, at least, that is how I interpret the io_min definition of
minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For
RAID arrays it is often the stripe chunk size." is super confusing. Not
entirely sure if stride or stripe was meant here...

>> * /sys/block/<disk>/queue/optimal_io_size
>>
>> Storage devices may report an optimal I/O size, which is
>> the device's preferred unit for sustained I/O.  This is rarely
>> reported for disk drives.  For RAID arrays it is usually the
>> stripe width or the internal track size.  A properly aligned
>> multiple of optimal_io_size is the preferred request size for
>> workloads where sustained throughput is desired.  If no optimal
>> I/O size is reported this file contains 0.
>>
>> Well, I find this definition not correct *at all*. This is repeating the
>> definition of minimum_io_size (limits->io_min) and completely disregard the
>> eventual optimal_io_size limit of the drives in the array. For a raid array,
>> this value should obviously be a multiple of minimum_io_size (the array stripe
>> size), but it can be much larger, since this should be an upper bound for IO
>> size. read_ahead_kb being set using this value is thus not correct I think.
>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.
> 
> I think this is actually different than io_min, and io_opt for different
> levels are not the same, for raid0, raid10, raid456(raid1 doesn't have
> chunksize):
>  - lim.io_min = mddev->chunk_sectors << 9;

See above. Given how confusing the definition of minimum_io_size is, not sure
that is correct. This code assumes that io_min is the stripe size and not the
stride size.

>  - lim.io_opt = lim.io_min * (number of data copies);

I do not understand what you mean with "number of data copies"... There is no
data copy in a RAID 5/6 array.

> And I think they do match the definition above, specifically:
>  - properly multiple aligned io_min to *prevent performance penalty*;

Yes.

>  - properly multiple aligned io_opt to *get optimal performance*, the
>    number of data copies times the performance of a single disk;

That is how this field is defined for RAID, but that is far from what it means
for a single disk. It is unfortunate that it was defined like that.

For a single disk, io_opt is NOT about getting optimal_performance. It is about
an upper bound for the IO size to NOT get a performance penalty (e.g. due to a
DMA mapping that is too large for what the IOMMU can handle).

And for a RAID array, it means that we should always have io_min == io_opt but
it seems that the scsi code and limit stacking code try to make this limit an
upper bound on the IO size, aligned to the stripe size.

> The orginal problem is that scsi disks report unusual io_opt 32767,
> and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The
> lcm_not_zero() from blk_stack_limits() end up with a huge value:
> 
> blk_stack_limits()
>  t->io_min = max(t->io_min, b->io_min);
>  t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);

I understand the "problem" that was stated. There is an overflow that result in
a large io_opt and a ridiculously large read_ahead_kb.
io_opt being large should in my opinion not be an issue in itself, since it
should be an upper bound on IO size and not the stripe size (io_min indicates
that).

>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.
> 
> The io_opt is used in raid array as minimal aligned size to get optimal
> IO performance, not the upper bound. With the respect of this, use this
> value for ra_pages make sense. However, if scsi is using this value as
> IO upper bound, it's right this doesn't make sense.

Here is your issue. People misunderstood optimal_io_size and used that instead
of using minimal_io_size/io_min limit for the granularity/alignment of IOs.
Using optimal_io_size as the "granularity" for optimal IOs that do not require
read-modify-write of RAID stripes is simply wrong in my optinion.
io_min/minimal_io_size is the attribute indicating that.

As for read_ahead_kb, it should be bounded by io_opt (upper bound) but should
be initialized to a smaller value aligned to io_min (if io_opt is unreasonably
large).

Given all of that and how misused io_opt seems to be, I am not sure how to fix
this though.

-- 
Damien Le Moal
Western Digital Research

next prev parent reply	other threads:[~2025-07-28  7:46 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <ywsfp3lqnijgig6yrlv2ztxram6ohf5z4yfeebswjkvp2dzisd@f5ikoyo3sfq5>
2025-07-27 10:50 ` Improper io_opt setting for md raid5 Csordás Hunor
2025-07-28  0:39   ` Damien Le Moal
2025-07-28  0:55     ` Yu Kuai
2025-07-28  2:41       ` Damien Le Moal
2025-07-28  3:08         ` Yu Kuai
2025-07-28  3:49           ` Damien Le Moal
2025-07-28  7:14             ` Yu Kuai
2025-07-28  7:44               ` Damien Le Moal [this message]
2025-07-28  9:02                 ` Yu Kuai
2025-07-29  4:23                   ` Martin K. Petersen
2025-07-29  6:25                     ` Yu Kuai
2025-07-29 22:02                     ` Tony Battersby
2025-07-29  6:13                   ` Hannes Reinecke
2025-07-29  6:29                     ` Yu Kuai
2025-07-29 22:24                     ` Keith Busch
2025-07-28 10:56                 ` Csordás Hunor
2025-07-29  4:08                 ` Martin K. Petersen
2025-07-29  3:53               ` Martin K. Petersen
2025-07-29  3:49             ` Martin K. Petersen
2025-07-29  4:44   ` Martin K. Petersen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fa2f9406-4ee8-45f9-a784-b5042e9f4411@kernel.org \
    --to=dlemoal@kernel.org \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=colyli@kernel.org \
    --cc=csordas.hunor@gmail.com \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=yukuai1@huaweicloud.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox