public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
From: Nilay Shroff <nilay@linux.ibm.com>
To: John Garry <john.g.garry@oracle.com>, linux-block@vger.kernel.org
Cc: hch@lst.de, martin.petersen@oracle.com, axboe@kernel.dk,
	ojaswin@linux.ibm.com, gjoyce@ibm.com
Subject: Re: [PATCH] block: fix atomic write limits for stacked devices
Date: Tue, 3 Jun 2025 20:46:24 +0530	[thread overview]
Message-ID: <0747313c-a70d-482f-8ea6-ce2dff772c2c@linux.ibm.com> (raw)
In-Reply-To: <07cfb3a1-c410-4544-ae4d-5808114e02d7@oracle.com>



On 6/3/25 5:47 PM, John Garry wrote:
> On 03/06/2025 12:27, Nilay Shroff wrote:
>> The current logic applies the bottom device's atomic write limits to
>> the stacked (top) device only if the top device does not support chunk
>> sectors. However, this approach is too restrictive.
> 
> Note that the assumption is that md raid0/10 or dm-stripe chunk size is in io_min. It's not ideal, but that's the field which always holds chunk_sectors for those drivers.
> 
> max_hw_sectors would seem to be more appropriate to me...
> 
>>
>> We should also propagate the bottom device's atomic write limits to the
>> stacked device if atomic_write_hw_unit_{min,max} of the bottom device
>> are aligned with the top device's chunk size (io_min). Failing to do so
>> may unnecessarily reduce the stacked device's atomic write limits to
>> values much smaller than what the hardware can actually support.
>>
>> For example, on an NVMe disk with the following controller capability:
>> $ nvme id-ctrl /dev/nvme1  | grep awupf
>> awupf     : 63
>>
>> Without the patch applied,
>>
>> The bottom device (nvme1c1n1) atomic queue limits:
>> $ /sys/block/nvme1c1n1/queue/atomic_write_boundary_bytes:0
>> $ /sys/block/nvme1c1n1/queue/atomic_write_max_bytes:262144
>> $ /sys/block/nvme1c1n1/queue/atomic_write_unit_max_bytes:262144
>> $ /sys/block/nvme1c1n1/queue/atomic_write_unit_min_bytes:4096
>>
>> The top device (nvme1n1) atomic queue limits:
>> $ /sys/block/nvme1n1/queue/atomic_write_boundary_bytes:0
>> $ /sys/block/nvme1n1/queue/atomic_write_max_bytes:4096
>> $ /sys/block/nvme1n1/queue/atomic_write_unit_max_bytes:4096
>> $ /sys/block/nvme1n1/queue/atomic_write_unit_min_bytes:4096
>>
>> With this patch applied,
>>
>> The top device (nvme1n1) atomic queue limits:
>> /sys/block/nvme1n1/queue/atomic_write_boundary_bytes:0
>> /sys/block/nvme1n1/queue/atomic_write_max_bytes:262144
>> /sys/block/nvme1n1/queue/atomic_write_unit_max_bytes:262144
>> /sys/block/nvme1n1/queue/atomic_write_unit_min_bytes:4096
>>
>> This change ensures that the stacked device retains optimal atomic write
>> capability when alignment permits, improving overall performance and
>> correctness.
>>
>> Reported-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> Fixes: d7f36dc446e8 ("block: Support atomic writes limits for stacked devices")
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>>   block/blk-settings.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index a000daafbfb4..35c1354dd5ae 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -598,8 +598,14 @@ static bool blk_stack_atomic_writes_head(struct queue_limits *t,
>>           !blk_stack_atomic_writes_boundary_head(t, b))
>>           return false;
>>   -    if (t->io_min <= SECTOR_SIZE) {
>> -        /* No chunk sectors, so use bottom device values directly */
>> +    if (t->io_min <= SECTOR_SIZE ||
>> +        (!(t->atomic_write_hw_unit_max % t->io_min) &&
>> +         !(t->atomic_write_hw_unit_min % t->io_min))) {
> 
> So will this now break md raid0/10 or dm stripe when t->io_min is set (> SECTOR_SIZE)? I mean, for md raid0/10 or dm-stripe, we should be taking the chunk size into account there and we now don't seem to be doing so now.
> 
Shouldn't it be work good if we ensure that a bottom device atomic write unit min/max are
aligned with the top device chunk sectors then top device could simply copy and use the
bottom device atomic write limits directly? Or do we have a special case for raid0/10 and
dm-strip which can't handle atomic write if chunk size for stacked device is greater than
SECTOR_SIZE?

BTW there's a typo in the above change, we should have the above if check written as below
(my bad):
    if (t->io_min <= SECTOR_SIZE ||
        (!(b->atomic_write_hw_unit_max % t->io_min) &&
         !(b->atomic_write_hw_unit_min % t->io_min))) {
    ...
    ...

> What is the value of top device io_min and physical_block_size in your example?
The NVme disk which I am using has both t->io_min and t->physical_block_size set 
to 4096.

Thanks,
--Nilay


  reply	other threads:[~2025-06-03 15:16 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-03 11:27 [PATCH] block: fix atomic write limits for stacked devices Nilay Shroff
2025-06-03 12:17 ` John Garry
2025-06-03 15:16   ` Nilay Shroff [this message]
2025-06-04  7:29     ` John Garry
2025-06-04 15:09       ` Nilay Shroff
2025-06-05  9:01         ` John Garry
2025-06-05  9:50           ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0747313c-a70d-482f-8ea6-ce2dff772c2c@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@kernel.dk \
    --cc=gjoyce@ibm.com \
    --cc=hch@lst.de \
    --cc=john.g.garry@oracle.com \
    --cc=linux-block@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ojaswin@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox