Zoned storage and BLK_STS

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Zoned storage and BLK_STS_RESOURCE
@ 2024-12-16 19:24 Bart Van Assche
  2024-12-16 20:23 ` Damien Le Moal
  2024-12-17  4:15 ` Christoph Hellwig
  0 siblings, 2 replies; 37+ messages in thread
From: Bart Van Assche @ 2024-12-16 19:24 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-block@vger.kernel.org, Christoph Hellwig

Hi Damien,

If 'qd=1' is changed into 'qd=2' in tests/zbd/012 then this test fails
against all kernel versions I tried, including kernel version 6.9. Do
you agree that this test should pass? If you agree with this, do you
agree that the only solution is to postpone error handling of zoned
writes until all pending zoned writes have completed and only to
resubmit failed writes after all pending writes have completed?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-16 19:24 Zoned storage and BLK_STS_RESOURCE Bart Van Assche
@ 2024-12-16 20:23 ` Damien Le Moal
  2024-12-16 20:42   ` Bart Van Assche
  2024-12-17  4:15 ` Christoph Hellwig
  1 sibling, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2024-12-16 20:23 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block@vger.kernel.org, Christoph Hellwig

On 2024/12/16 11:24, Bart Van Assche wrote:
> 
> Hi Damien,
> 
> If 'qd=1' is changed into 'qd=2' in tests/zbd/012 then this test fails
> against all kernel versions I tried, including kernel version 6.9. Do
> you agree that this test should pass? If you agree with this, do you
> agree that the only solution is to postpone error handling of zoned
> writes until all pending zoned writes have completed and only to
> resubmit failed writes after all pending writes have completed?

Well, of course: if one write fails, the target zone write pointer will not
advance as it should have, so all writes for the same zone after the failed one
will be unaligned and fail. Is that what you are talking about ?

With the fixes applied to rc3, the automatic error recovery in the block layer
is gone. So it is up to the user (FS, DM or application) to do the right thing.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-16 20:23 ` Damien Le Moal
@ 2024-12-16 20:42   ` Bart Van Assche
  2024-12-16 20:54     ` Damien Le Moal
  0 siblings, 1 reply; 37+ messages in thread
From: Bart Van Assche @ 2024-12-16 20:42 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-block@vger.kernel.org, Christoph Hellwig

On 12/16/24 12:23 PM, Damien Le Moal wrote:
> On 2024/12/16 11:24, Bart Van Assche wrote:
>> If 'qd=1' is changed into 'qd=2' in tests/zbd/012 then this test fails
>> against all kernel versions I tried, including kernel version 6.9. Do
>> you agree that this test should pass? If you agree with this, do you
>> agree that the only solution is to postpone error handling of zoned
>> writes until all pending zoned writes have completed and only to
>> resubmit failed writes after all pending writes have completed?
> 
> Well, of course: if one write fails, the target zone write pointer will not
> advance as it should have, so all writes for the same zone after the failed one
> will be unaligned and fail. Is that what you are talking about ?
> 
> With the fixes applied to rc3, the automatic error recovery in the block layer
> is gone. So it is up to the user (FS, DM or application) to do the right thing.

Hi Damien,

For non-zoned storage the BLK_STS_RESOURCE status is not reported to the
I/O submitter (filesystem). The BLK_STS_RESOURCE status causes the block
layer to retry a request. For zoned storage if the block driver reports
the BLK_STS_RESOURCE status and if QD > 1 then the submitter
(filesystem) has to retry the I/O. Isn't that inconsistent? Solving this
inconsistency is one reason why I would like to postpone handling of
zoned write errors until all pending I/O has either completed or failed.
Another reason is that this behavior change is an essential step towards
supporting write pipelining. If multiple zoned writes are outstanding,
and the block driver postpones execution of any of these writes (unit
attention, BLK_STS_RESOURCE, ...) then any zoned writes must only be
resubmitted after all pending zoned writes have either completed or failed.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-16 20:42   ` Bart Van Assche
@ 2024-12-16 20:54     ` Damien Le Moal
  2024-12-16 21:22       ` Bart Van Assche
  0 siblings, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2024-12-16 20:54 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block@vger.kernel.org, Christoph Hellwig

On 2024/12/16 12:42, Bart Van Assche wrote:
> 
> On 12/16/24 12:23 PM, Damien Le Moal wrote:
>> On 2024/12/16 11:24, Bart Van Assche wrote:
>>> If 'qd=1' is changed into 'qd=2' in tests/zbd/012 then this test fails
>>> against all kernel versions I tried, including kernel version 6.9. Do
>>> you agree that this test should pass? If you agree with this, do you
>>> agree that the only solution is to postpone error handling of zoned
>>> writes until all pending zoned writes have completed and only to
>>> resubmit failed writes after all pending writes have completed?
>>
>> Well, of course: if one write fails, the target zone write pointer will not
>> advance as it should have, so all writes for the same zone after the failed one
>> will be unaligned and fail. Is that what you are talking about ?
>>
>> With the fixes applied to rc3, the automatic error recovery in the block layer
>> is gone. So it is up to the user (FS, DM or application) to do the right thing.
> 
> Hi Damien,
> 
> For non-zoned storage the BLK_STS_RESOURCE status is not reported to the
> I/O submitter (filesystem). The BLK_STS_RESOURCE status causes the block
> layer to retry a request. For zoned storage if the block driver reports
> the BLK_STS_RESOURCE status and if QD > 1 then the submitter
> (filesystem) has to retry the I/O. Isn't that inconsistent? Solving this
> inconsistency is one reason why I would like to postpone handling of
> zoned write errors until all pending I/O has either completed or failed.

As I said, if one write does not work, whatever the reason, all other writes
behind it for the same zone will also not work. So yes, handling of errors in
the end needs to be done after all writes come back to the issuer. Nothing new
here. I do not see the issue. And I am not sure where you want to go with this.

> Another reason is that this behavior change is an essential step towards
> supporting write pipelining. If multiple zoned writes are outstanding,
> and the block driver postpones execution of any of these writes (unit
> attention, BLK_STS_RESOURCE, ...) then any zoned writes must only be
> resubmitted after all pending zoned writes have either completed or failed.

Yes. But I am still confused. Where is the problem ?

> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-16 20:54     ` Damien Le Moal
@ 2024-12-16 21:22       ` Bart Van Assche
  2024-12-16 22:49         ` Damien Le Moal
  2024-12-17 14:56         ` Damien Le Moal
  0 siblings, 2 replies; 37+ messages in thread
From: Bart Van Assche @ 2024-12-16 21:22 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-block@vger.kernel.org, Christoph Hellwig

On 12/16/24 12:54 PM, Damien Le Moal wrote:
> Yes. But I am still confused. Where is the problem ?

Here: 
https://lore.kernel.org/linux-block/95ab028e-6cf7-474e-aa33-37ab3bccd078@kernel.org/. 
In that message another approach is
suggested than what I described in my previous message.

UFSHCI 3.0 controllers preserve the command order except if these are in
a power-saving mode called auto-hibernation (AH8). When leaving that
mode, commands are submitted in tag order (0..31). The approach
described above provides an elegant solution for the unaligned write
errors that can be caused by command reordering when leaving AH8 mode.
I'm not aware of any other elegant approach to deal with the reordering
that can be caused by leaving the UFSHCI AH8 mode.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-16 21:22       ` Bart Van Assche
@ 2024-12-16 22:49         ` Damien Le Moal
  2024-12-17 14:56         ` Damien Le Moal
  1 sibling, 0 replies; 37+ messages in thread
From: Damien Le Moal @ 2024-12-16 22:49 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block@vger.kernel.org, Christoph Hellwig

On 2024/12/16 13:22, Bart Van Assche wrote:
> On 12/16/24 12:54 PM, Damien Le Moal wrote:
>> Yes. But I am still confused. Where is the problem ?
> 
> Here: 
> https://lore.kernel.org/linux-block/95ab028e-6cf7-474e-aa33-37ab3bccd078@kernel.org/. 
> In that message another approach is
> suggested than what I described in my previous message.
> 
> UFSHCI 3.0 controllers preserve the command order except if these are in
> a power-saving mode called auto-hibernation (AH8). When leaving that
> mode, commands are submitted in tag order (0..31). The approach
> described above provides an elegant solution for the unaligned write
> errors that can be caused by command reordering when leaving AH8 mode.
> I'm not aware of any other elegant approach to deal with the reordering
> that can be caused by leaving the UFSHCI AH8 mode.

Sorry, but I do not know enough about UFS to fully understand the issue, or
comment/propose a solution.

> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-16 19:24 Zoned storage and BLK_STS_RESOURCE Bart Van Assche
  2024-12-16 20:23 ` Damien Le Moal
@ 2024-12-17  4:15 ` Christoph Hellwig
  2024-12-17 15:04   ` Damien Le Moal
  1 sibling, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2024-12-17  4:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Damien Le Moal, linux-block@vger.kernel.org, Christoph Hellwig

On Mon, Dec 16, 2024 at 11:24:24AM -0800, Bart Van Assche wrote:
>
> Hi Damien,
>
> If 'qd=1' is changed into 'qd=2' in tests/zbd/012 then this test fails
> against all kernel versions I tried, including kernel version 6.9. Do
> you agree that this test should pass?

That test case is not very well documented and you're not explaining
how it fails.

As far as I can tell the test uses fio to write to a SCSI debug device
using the zbd randwrite mode and the io_uring I/O engine of fio.

We've ever guaranteed ordering of multiple outstanding asynchronous user
writes on zoned block devices, so from that point of view a "failure" due
to write pointer violations when changing the test to use QD=2 is
entirely expected.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-16 21:22       ` Bart Van Assche
  2024-12-16 22:49         ` Damien Le Moal
@ 2024-12-17 14:56         ` Damien Le Moal
  2024-12-19  5:55           ` Christoph Hellwig
  1 sibling, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2024-12-17 14:56 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block@vger.kernel.org, Christoph Hellwig

On 2024/12/16 13:22, Bart Van Assche wrote:
> On 12/16/24 12:54 PM, Damien Le Moal wrote:
>> Yes. But I am still confused. Where is the problem ?
> 
> Here: 
> https://lore.kernel.org/linux-block/95ab028e-6cf7-474e-aa33-37ab3bccd078@kernel.org/. 
> In that message another approach is
> suggested than what I described in my previous message.

OK. So you are talking about an issue that potentially can happen *if* you
modify zone write plugging to issue more than one write at a time per zone. This
issue of reordering cannot happen today as there is always at most one write per
zone in-flight.

> UFSHCI 3.0 controllers preserve the command order except if these are in
> a power-saving mode called auto-hibernation (AH8). When leaving that
> mode, commands are submitted in tag order (0..31). The approach
> described above provides an elegant solution for the unaligned write
> errors that can be caused by command reordering when leaving AH8 mode.
> I'm not aware of any other elegant approach to deal with the reordering
> that can be caused by leaving the UFSHCI AH8 mode.

As I said, I do not know enough about UFS to comment on potential solutions as I
do not fully grasp the problem.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17  4:15 ` Christoph Hellwig
@ 2024-12-17 15:04   ` Damien Le Moal
  2024-12-17 18:38     ` Bart Van Assche
  2024-12-17 18:46     ` Jens Axboe
  0 siblings, 2 replies; 37+ messages in thread
From: Damien Le Moal @ 2024-12-17 15:04 UTC (permalink / raw)
  To: Christoph Hellwig, Bart Van Assche; +Cc: linux-block@vger.kernel.org

On 2024/12/16 20:15, Christoph Hellwig wrote:
> On Mon, Dec 16, 2024 at 11:24:24AM -0800, Bart Van Assche wrote:
>>
>> Hi Damien,
>>
>> If 'qd=1' is changed into 'qd=2' in tests/zbd/012 then this test fails
>> against all kernel versions I tried, including kernel version 6.9. Do
>> you agree that this test should pass?
> 
> That test case is not very well documented and you're not explaining
> how it fails.
> 
> As far as I can tell the test uses fio to write to a SCSI debug device
> using the zbd randwrite mode and the io_uring I/O engine of fio.

Of note about io_uring: if writes are submitted from multiple jobs to multiple
queues, then you will see unaligned write errors, but the same test with libaio
will work just fine. The reason is that io_uring fio engine IO submission only
adds write requests to the io rings, which will then be submitted by the kernel
ring handling later. But at that time, the ordering information is lost and if
the rings are processed in the wrong order, you'll get unaligned errors.

io_uring is thus unsafe for writes to zoned block devices. Trying to do
something about it has been on my to-do list for a while. Been too busy to do
anything yet. The best solution is of course zone append. If the user wants to
use regular writes, then it better tightly control its write IO issuing to be
QD=1 per zone itself as relying on zone write plugging will not be enough.

> We've ever guaranteed ordering of multiple outstanding asynchronous user
> writes on zoned block devices, so from that point of view a "failure" due
> to write pointer violations when changing the test to use QD=2 is
> entirely expected.

Not for libaio since the io_submit() call goes down to submit_bio(). So if the
issuer user application does the right synchronization (which fio does), libaio
is safe as we are guaranteed that the writes are placed in order in the zone
write plugs. As explained above, that is not the case with io_uring though.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 15:04   ` Damien Le Moal
@ 2024-12-17 18:38     ` Bart Van Assche
  2024-12-17 18:46     ` Jens Axboe
  1 sibling, 0 replies; 37+ messages in thread
From: Bart Van Assche @ 2024-12-17 18:38 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig; +Cc: linux-block@vger.kernel.org

On 12/17/24 7:04 AM, Damien Le Moal wrote:
> On 2024/12/16 20:15, Christoph Hellwig wrote:
>> On Mon, Dec 16, 2024 at 11:24:24AM -0800, Bart Van Assche wrote:
>>>
>>> Hi Damien,
>>>
>>> If 'qd=1' is changed into 'qd=2' in tests/zbd/012 then this test fails
>>> against all kernel versions I tried, including kernel version 6.9. Do
>>> you agree that this test should pass?
>>
>> That test case is not very well documented and you're not explaining
>> how it fails.
>>
>> As far as I can tell the test uses fio to write to a SCSI debug device
>> using the zbd randwrite mode and the io_uring I/O engine of fio.
> 
> Of note about io_uring: if writes are submitted from multiple jobs to multiple
> queues, then you will see unaligned write errors, but the same test with libaio
> will work just fine. The reason is that io_uring fio engine IO submission only
> adds write requests to the io rings, which will then be submitted by the kernel
> ring handling later. But at that time, the ordering information is lost and if
> the rings are processed in the wrong order, you'll get unaligned errors.
> 
> io_uring is thus unsafe for writes to zoned block devices. Trying to do
> something about it has been on my to-do list for a while. Been too busy to do
> anything yet. The best solution is of course zone append. If the user wants to
> use regular writes, then it better tightly control its write IO issuing to be
> QD=1 per zone itself as relying on zone write plugging will not be enough.
> 
>> We've ever guaranteed ordering of multiple outstanding asynchronous user
>> writes on zoned block devices, so from that point of view a "failure" due
>> to write pointer violations when changing the test to use QD=2 is
>> entirely expected.
> 
> Not for libaio since the io_submit() call goes down to submit_bio(). So if the
> issuer user application does the right synchronization (which fio does), libaio
> is safe as we are guaranteed that the writes are placed in order in the zone
> write plugs. As explained above, that is not the case with io_uring though.

Thanks Damien for having shared this information. After having switched
to libaio, the higher queue depth test cases pass with Jens'
block-for-next branch. See also 
https://github.com/osandov/blktests/pull/156.

Bart.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 15:04   ` Damien Le Moal
  2024-12-17 18:38     ` Bart Van Assche
@ 2024-12-17 18:46     ` Jens Axboe
  2024-12-17 18:51       ` Damien Le Moal
  1 sibling, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-12-17 18:46 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig, Bart Van Assche
  Cc: linux-block@vger.kernel.org

> Of note about io_uring: if writes are submitted from multiple jobs to
> multiple queues, then you will see unaligned write errors, but the
> same test with libaio will work just fine. The reason is that io_uring
> fio engine IO submission only adds write requests to the io rings,
> which will then be submitted by the kernel ring handling later. But at
> that time, the ordering information is lost and if the rings are
> processed in the wrong order, you'll get unaligned errors.

Sorry, but this is woefully incorrect.

Submissions are always in order, I suspect the main difference here is
that some submissions would block, and that will certainly cause the
effective issue point to be reordered, as the initial issue will get
-EAGAIN. This isn't a problem on libaio as it simply blocks on
submission instead. Because the actual issue is the same, and the kernel
will absolutely see the submissions in order when io_uring_enter() is
called, just like it would when io_submit() is called.

If you stay within the queue limits of the device, then there should be
no reordering. Unless the kernel side prevents non-blocking issues for
some reason.

It's either that, or something being broken with REQ_NOWAIT handling for
zoned writes.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 18:46     ` Jens Axboe
@ 2024-12-17 18:51       ` Damien Le Moal
  2024-12-17 19:07         ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2024-12-17 18:51 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Bart Van Assche
  Cc: linux-block@vger.kernel.org

On 2024/12/17 10:46, Jens Axboe wrote:
>> Of note about io_uring: if writes are submitted from multiple jobs to
>> multiple queues, then you will see unaligned write errors, but the
>> same test with libaio will work just fine. The reason is that io_uring
>> fio engine IO submission only adds write requests to the io rings,
>> which will then be submitted by the kernel ring handling later. But at
>> that time, the ordering information is lost and if the rings are
>> processed in the wrong order, you'll get unaligned errors.
> 
> Sorry, but this is woefully incorrect.
> 
> Submissions are always in order, I suspect the main difference here is
> that some submissions would block, and that will certainly cause the
> effective issue point to be reordered, as the initial issue will get
> -EAGAIN. This isn't a problem on libaio as it simply blocks on
> submission instead. Because the actual issue is the same, and the kernel
> will absolutely see the submissions in order when io_uring_enter() is
> called, just like it would when io_submit() is called.

I did not mean to say that the processing of requests in each queue/ring is done
out of order. They are not. What I meant to say is that multiple queues/rings
may be processed in parallel, so if sequential writes are submitted to different
queues, the BIOs for these write IOs may endup being issued out of order to the
zone. Is that an incorrect assumption ? Reading the io_uring code, I think there
is one work item per ring and these are not synchronized.

> 
> If you stay within the queue limits of the device, then there should be
> no reordering. Unless the kernel side prevents non-blocking issues for
> some reason.
> 
> It's either that, or something being broken with REQ_NOWAIT handling for
> zoned writes.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 18:51       ` Damien Le Moal
@ 2024-12-17 19:07         ` Jens Axboe
  2024-12-17 19:20           ` Damien Le Moal
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-12-17 19:07 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig, Bart Van Assche
  Cc: linux-block@vger.kernel.org

On 12/17/24 11:51 AM, Damien Le Moal wrote:
> On 2024/12/17 10:46, Jens Axboe wrote:
>>> Of note about io_uring: if writes are submitted from multiple jobs to
>>> multiple queues, then you will see unaligned write errors, but the
>>> same test with libaio will work just fine. The reason is that io_uring
>>> fio engine IO submission only adds write requests to the io rings,
>>> which will then be submitted by the kernel ring handling later. But at
>>> that time, the ordering information is lost and if the rings are
>>> processed in the wrong order, you'll get unaligned errors.
>>
>> Sorry, but this is woefully incorrect.
>>
>> Submissions are always in order, I suspect the main difference here is
>> that some submissions would block, and that will certainly cause the
>> effective issue point to be reordered, as the initial issue will get
>> -EAGAIN. This isn't a problem on libaio as it simply blocks on
>> submission instead. Because the actual issue is the same, and the kernel
>> will absolutely see the submissions in order when io_uring_enter() is
>> called, just like it would when io_submit() is called.
> 
> I did not mean to say that the processing of requests in each
> queue/ring is done out of order. They are not. What I meant to say is
> that multiple queues/rings may be processed in parallel, so if
> sequential writes are submitted to different queues, the BIOs for
> these write IOs may endup being issued out of order to the zone. Is
> that an incorrect assumption ? Reading the io_uring code, I think
> there is one work item per ring and these are not synchronized.

Sure, if you have multiple rings, there's no synchronization between
them. Within each ring, reordering in terms of issue can only happen if
the target response with -EAGAIN to a REQ_NOWAIT request, as they are
always issued in order. If that doesn't happen, there should be no
difference to what the issue looks like with multiple rings or contexts
for io_uring or libaio - any kind of ordering could be observed.

Unsure of which queues you are talking about here, are these the block
level queues?

And ditto on the io_uring question, which work items are we talking
about? There can be any number of requests for any given ring, inflight.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 19:07         ` Jens Axboe
@ 2024-12-17 19:20           ` Damien Le Moal
  2024-12-17 19:25             ` Bart Van Assche
  2024-12-17 19:32             ` Jens Axboe
  0 siblings, 2 replies; 37+ messages in thread
From: Damien Le Moal @ 2024-12-17 19:20 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Bart Van Assche
  Cc: linux-block@vger.kernel.org

On 2024/12/17 11:07, Jens Axboe wrote:
> On 12/17/24 11:51 AM, Damien Le Moal wrote:
>> On 2024/12/17 10:46, Jens Axboe wrote:
>>>> Of note about io_uring: if writes are submitted from multiple jobs to
>>>> multiple queues, then you will see unaligned write errors, but the
>>>> same test with libaio will work just fine. The reason is that io_uring
>>>> fio engine IO submission only adds write requests to the io rings,
>>>> which will then be submitted by the kernel ring handling later. But at
>>>> that time, the ordering information is lost and if the rings are
>>>> processed in the wrong order, you'll get unaligned errors.
>>>
>>> Sorry, but this is woefully incorrect.
>>>
>>> Submissions are always in order, I suspect the main difference here is
>>> that some submissions would block, and that will certainly cause the
>>> effective issue point to be reordered, as the initial issue will get
>>> -EAGAIN. This isn't a problem on libaio as it simply blocks on
>>> submission instead. Because the actual issue is the same, and the kernel
>>> will absolutely see the submissions in order when io_uring_enter() is
>>> called, just like it would when io_submit() is called.
>>
>> I did not mean to say that the processing of requests in each
>> queue/ring is done out of order. They are not. What I meant to say is
>> that multiple queues/rings may be processed in parallel, so if
>> sequential writes are submitted to different queues, the BIOs for
>> these write IOs may endup being issued out of order to the zone. Is
>> that an incorrect assumption ? Reading the io_uring code, I think
>> there is one work item per ring and these are not synchronized.
> 
> Sure, if you have multiple rings, there's no synchronization between
> them. Within each ring, reordering in terms of issue can only happen if
> the target response with -EAGAIN to a REQ_NOWAIT request, as they are
> always issued in order. If that doesn't happen, there should be no
> difference to what the issue looks like with multiple rings or contexts
> for io_uring or libaio - any kind of ordering could be observed.

Yes. The fixes that went into rc3 addressed the REQ_NOWAIT issue. So we are good
on this front.

> Unsure of which queues you are talking about here, are these the block
> level queues?

My bad. I was talking about the io_uring rings. Not the block layer queues.

> And ditto on the io_uring question, which work items are we talking
> about? There can be any number of requests for any given ring, inflight.

I was talking about the work that gets IOs submitted by the user from the rings
and turn them into BIOs for submission. My understanding is that these are not
synchronized. For a simple fio "--zonemode=zbd --rw=randwrite --numjobs=X" for X
> 1, the fio level synchronization will serialize the calls to io_submit() for
libaio, thus delivering the BIOs to a zone in order in the kernel. With io_uring
as the I/O engine, the same fio level synchronization happens but is only around
the IO getting in the ring. The IOs being turned into BIOs and submitted will be
done outside of the fio serialization and can thus can endup being issued out of
order if multiple rings are used. At least, that is my understanding of
io_uring... Am I getting this wrong ?


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 19:20           ` Damien Le Moal
@ 2024-12-17 19:25             ` Bart Van Assche
  2024-12-17 19:28               ` Damien Le Moal
  2024-12-17 19:32             ` Jens Axboe
  1 sibling, 1 reply; 37+ messages in thread
From: Bart Van Assche @ 2024-12-17 19:25 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, Christoph Hellwig; +Cc: linux-block@vger.kernel.org

On 12/17/24 11:20 AM, Damien Le Moal wrote:
> For a simple fio "--zonemode=zbd --rw=randwrite --numjobs=X" for X > 1

Please note that this e-mail thread started by discussing a testcase
with --numjobs=1.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 19:25             ` Bart Van Assche
@ 2024-12-17 19:28               ` Damien Le Moal
  2024-12-17 19:33                 ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2024-12-17 19:28 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe, Christoph Hellwig
  Cc: linux-block@vger.kernel.org

On 2024/12/17 11:25, Bart Van Assche wrote:
> On 12/17/24 11:20 AM, Damien Le Moal wrote:
>> For a simple fio "--zonemode=zbd --rw=randwrite --numjobs=X" for X > 1
> 
> Please note that this e-mail thread started by discussing a testcase
> with --numjobs=1.

I missed that. Then io_uring should be fine and behave the same way as libaio.
Since it seems to not be working, we may have a bug beyond the recently fixed
REQ_NOWAIT handling I think. That needs to be looked at.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 19:20           ` Damien Le Moal
  2024-12-17 19:25             ` Bart Van Assche
@ 2024-12-17 19:32             ` Jens Axboe
  1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-12-17 19:32 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig, Bart Van Assche
  Cc: linux-block@vger.kernel.org

On 12/17/24 12:20 PM, Damien Le Moal wrote:
> On 2024/12/17 11:07, Jens Axboe wrote:
>> On 12/17/24 11:51 AM, Damien Le Moal wrote:
>>> On 2024/12/17 10:46, Jens Axboe wrote:
>>>>> Of note about io_uring: if writes are submitted from multiple jobs to
>>>>> multiple queues, then you will see unaligned write errors, but the
>>>>> same test with libaio will work just fine. The reason is that io_uring
>>>>> fio engine IO submission only adds write requests to the io rings,
>>>>> which will then be submitted by the kernel ring handling later. But at
>>>>> that time, the ordering information is lost and if the rings are
>>>>> processed in the wrong order, you'll get unaligned errors.
>>>>
>>>> Sorry, but this is woefully incorrect.
>>>>
>>>> Submissions are always in order, I suspect the main difference here is
>>>> that some submissions would block, and that will certainly cause the
>>>> effective issue point to be reordered, as the initial issue will get
>>>> -EAGAIN. This isn't a problem on libaio as it simply blocks on
>>>> submission instead. Because the actual issue is the same, and the kernel
>>>> will absolutely see the submissions in order when io_uring_enter() is
>>>> called, just like it would when io_submit() is called.
>>>
>>> I did not mean to say that the processing of requests in each
>>> queue/ring is done out of order. They are not. What I meant to say is
>>> that multiple queues/rings may be processed in parallel, so if
>>> sequential writes are submitted to different queues, the BIOs for
>>> these write IOs may endup being issued out of order to the zone. Is
>>> that an incorrect assumption ? Reading the io_uring code, I think
>>> there is one work item per ring and these are not synchronized.
>>
>> Sure, if you have multiple rings, there's no synchronization between
>> them. Within each ring, reordering in terms of issue can only happen if
>> the target response with -EAGAIN to a REQ_NOWAIT request, as they are
>> always issued in order. If that doesn't happen, there should be no
>> difference to what the issue looks like with multiple rings or contexts
>> for io_uring or libaio - any kind of ordering could be observed.
> 
> Yes. The fixes that went into rc3 addressed the REQ_NOWAIT issue. So
> we are good on this front.

I do remember that fix, but curious if there's still something wrong
here...

>> Unsure of which queues you are talking about here, are these the block
>> level queues?
> 
> My bad. I was talking about the io_uring rings. Not the block layer
> queues.

Gotcha

>> And ditto on the io_uring question, which work items are we talking
>> about? There can be any number of requests for any given ring,
>> inflight.
> 
> I was talking about the work that gets IOs submitted by the user from
> the rings and turn them into BIOs for submission. My understanding is
> that these are not synchronized. For a simple fio "--zonemode=zbd
> --rw=randwrite --numjobs=X" for X
>> 1, the fio level synchronization will serialize the calls to
>> io_submit() for
> libaio, thus delivering the BIOs to a zone in order in the kernel.
> With io_uring as the I/O engine, the same fio level synchronization
> happens but is only around the IO getting in the ring. The IOs being
> turned into BIOs and submitted will be done outside of the fio
> serialization and can thus can endup being issued out of order if
> multiple rings are used. At least, that is my understanding of
> io_uring... Am I getting this wrong ?

Yes, this is totally wrong. If we assume a single ring, and a single
thread/task driving this ring, then any IO being put into the ring is
issued IN ORDER on the kernel side by that very same task once said task
does io_uring_enter() to submit whatever it put in the ring.

Like I mentioned earlier, there's _zero_ difference in the ordering in
which these are issued between libaio and io_uring, for a single ring
(or context) and a single task doing the submit and/or wait for events
call.

Fio doesn't have multiple threads operating on a ring, there's always a
single ring per IO thread. Yes if you use multiple jobs and expect those
to be ordered, that would obviously be broken across the board,
regardless of IO engine used. So I realize this isn't the case. But
there still seems to be some misunderstandings here on how io_uring
works, and maybe even then an issue that's not being fully understood or
blamed on something else.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 19:28               ` Damien Le Moal
@ 2024-12-17 19:33                 ` Jens Axboe
  2024-12-17 19:37                   ` Damien Le Moal
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-12-17 19:33 UTC (permalink / raw)
  To: Damien Le Moal, Bart Van Assche, Christoph Hellwig
  Cc: linux-block@vger.kernel.org

On 12/17/24 12:28 PM, Damien Le Moal wrote:
> On 2024/12/17 11:25, Bart Van Assche wrote:
>> On 12/17/24 11:20 AM, Damien Le Moal wrote:
>>> For a simple fio "--zonemode=zbd --rw=randwrite --numjobs=X" for X > 1
>>
>> Please note that this e-mail thread started by discussing a testcase
>> with --numjobs=1.
> 
> I missed that. Then io_uring should be fine and behave the same way as libaio.
> Since it seems to not be working, we may have a bug beyond the recently fixed
> REQ_NOWAIT handling I think. That needs to be looked at.

Inflight collision, yes that's what I was getting at - there seems to be
another bug here, and misunderstandings on how io_uring works is causing
it to be ignored and/or not understood.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 19:33                 ` Jens Axboe
@ 2024-12-17 19:37                   ` Damien Le Moal
  2024-12-17 19:41                     ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2024-12-17 19:37 UTC (permalink / raw)
  To: Jens Axboe, Bart Van Assche, Christoph Hellwig
  Cc: linux-block@vger.kernel.org

On 2024/12/17 11:33, Jens Axboe wrote:
> On 12/17/24 12:28 PM, Damien Le Moal wrote:
>> On 2024/12/17 11:25, Bart Van Assche wrote:
>>> On 12/17/24 11:20 AM, Damien Le Moal wrote:
>>>> For a simple fio "--zonemode=zbd --rw=randwrite --numjobs=X" for X > 1
>>>
>>> Please note that this e-mail thread started by discussing a testcase
>>> with --numjobs=1.
>>
>> I missed that. Then io_uring should be fine and behave the same way as libaio.
>> Since it seems to not be working, we may have a bug beyond the recently fixed
>> REQ_NOWAIT handling I think. That needs to be looked at.
> 
> Inflight collision, yes that's what I was getting at - there seems to be
> another bug here, and misunderstandings on how io_uring works is causing
> it to be ignored and/or not understood.

OK. Will dig into this because I definitely do not fully understand where the
issue is.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 19:37                   ` Damien Le Moal
@ 2024-12-17 19:41                     ` Jens Axboe
  2024-12-17 19:48                       ` Damien Le Moal
  2024-12-19  6:00                       ` Christoph Hellwig
  0 siblings, 2 replies; 37+ messages in thread
From: Jens Axboe @ 2024-12-17 19:41 UTC (permalink / raw)
  To: Damien Le Moal, Bart Van Assche, Christoph Hellwig
  Cc: linux-block@vger.kernel.org

On 12/17/24 12:37 PM, Damien Le Moal wrote:
> On 2024/12/17 11:33, Jens Axboe wrote:
>> On 12/17/24 12:28 PM, Damien Le Moal wrote:
>>> On 2024/12/17 11:25, Bart Van Assche wrote:
>>>> On 12/17/24 11:20 AM, Damien Le Moal wrote:
>>>>> For a simple fio "--zonemode=zbd --rw=randwrite --numjobs=X" for X > 1
>>>>
>>>> Please note that this e-mail thread started by discussing a testcase
>>>> with --numjobs=1.
>>>
>>> I missed that. Then io_uring should be fine and behave the same way as libaio.
>>> Since it seems to not be working, we may have a bug beyond the recently fixed
>>> REQ_NOWAIT handling I think. That needs to be looked at.
>>
>> Inflight collision, yes that's what I was getting at - there seems to be
>> another bug here, and misunderstandings on how io_uring works is causing
>> it to be ignored and/or not understood.
> 
> OK. Will dig into this because I definitely do not fully understand where the
> issue is.

As per earlier replies, it's either -EAGAIN being mishandled, OR it's
driving more IOs than the device supports. For the latter case, io_uring
will NOT block, but libaio will. This means that libaio will sit there
waiting on previous IO to complete, and then issue the next one.
io_uring will punt that IO to io-wq, and then all bets are off in terms
of ordering if you have multiple of these threads blocking on tags and
doing issues. The test case looks like it's freezing the queue, which
means you don't even need more than QD number of IOs inflight. When that
happens, guess what libaio does? That's right, it blocks waiting on the
queue, and io_uring will not block but rather punt those IOs to io-wq.
If you have QD=2, then you now have 2 threads doing IO submission, and
either of them could wake and submit before the other.

Like Christoph alluded to in his first reply, driving more than 1
request inflight is going to be trouble, potentially.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 19:41                     ` Jens Axboe
@ 2024-12-17 19:48                       ` Damien Le Moal
  2024-12-17 19:54                         ` Jens Axboe
  2024-12-19  6:00                       ` Christoph Hellwig
  1 sibling, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2024-12-17 19:48 UTC (permalink / raw)
  To: Jens Axboe, Bart Van Assche, Christoph Hellwig
  Cc: linux-block@vger.kernel.org

On 2024/12/17 11:41, Jens Axboe wrote:
> On 12/17/24 12:37 PM, Damien Le Moal wrote:
>> On 2024/12/17 11:33, Jens Axboe wrote:
>>> On 12/17/24 12:28 PM, Damien Le Moal wrote:
>>>> On 2024/12/17 11:25, Bart Van Assche wrote:
>>>>> On 12/17/24 11:20 AM, Damien Le Moal wrote:
>>>>>> For a simple fio "--zonemode=zbd --rw=randwrite --numjobs=X" for X > 1
>>>>>
>>>>> Please note that this e-mail thread started by discussing a testcase
>>>>> with --numjobs=1.
>>>>
>>>> I missed that. Then io_uring should be fine and behave the same way as libaio.
>>>> Since it seems to not be working, we may have a bug beyond the recently fixed
>>>> REQ_NOWAIT handling I think. That needs to be looked at.
>>>
>>> Inflight collision, yes that's what I was getting at - there seems to be
>>> another bug here, and misunderstandings on how io_uring works is causing
>>> it to be ignored and/or not understood.
>>
>> OK. Will dig into this because I definitely do not fully understand where the
>> issue is.
> 
> As per earlier replies, it's either -EAGAIN being mishandled, OR it's
> driving more IOs than the device supports. For the latter case, io_uring
> will NOT block, but libaio will. This means that libaio will sit there
> waiting on previous IO to complete, and then issue the next one.
> io_uring will punt that IO to io-wq, and then all bets are off in terms
> of ordering if you have multiple of these threads blocking on tags and
> doing issues. The test case looks like it's freezing the queue, which
> means you don't even need more than QD number of IOs inflight. When that
> happens, guess what libaio does? That's right, it blocks waiting on the
> queue, and io_uring will not block but rather punt those IOs to io-wq.
> If you have QD=2, then you now have 2 threads doing IO submission, and
> either of them could wake and submit before the other.

That sounds like a very good analysis :)

> Like Christoph alluded to in his first reply, driving more than 1
> request inflight is going to be trouble, potentially.

Yes. I think the confusion is with which "inflight" we are talking about.
Between the block layer and the device, zone write plugging prevents more than 1
write per zone, so things are OK (modulo bugs...). But between the application
and the block layer, that is not well managed and as your analysis above shows,
bad things can happen. I will look into it to see if we can do something
sensible. If not, we should at least warn the user, or just outright fail using
io_uring with zoned block devices to avoid bad surprises.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 19:48                       ` Damien Le Moal
@ 2024-12-17 19:54                         ` Jens Axboe
  2024-12-17 19:58                           ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-12-17 19:54 UTC (permalink / raw)
  To: Damien Le Moal, Bart Van Assche, Christoph Hellwig
  Cc: linux-block@vger.kernel.org

On 12/17/24 12:48 PM, Damien Le Moal wrote:
> On 2024/12/17 11:41, Jens Axboe wrote:
>> On 12/17/24 12:37 PM, Damien Le Moal wrote:
>>> On 2024/12/17 11:33, Jens Axboe wrote:
>>>> On 12/17/24 12:28 PM, Damien Le Moal wrote:
>>>>> On 2024/12/17 11:25, Bart Van Assche wrote:
>>>>>> On 12/17/24 11:20 AM, Damien Le Moal wrote:
>>>>>>> For a simple fio "--zonemode=zbd --rw=randwrite --numjobs=X" for X > 1
>>>>>>
>>>>>> Please note that this e-mail thread started by discussing a testcase
>>>>>> with --numjobs=1.
>>>>>
>>>>> I missed that. Then io_uring should be fine and behave the same way as libaio.
>>>>> Since it seems to not be working, we may have a bug beyond the recently fixed
>>>>> REQ_NOWAIT handling I think. That needs to be looked at.
>>>>
>>>> Inflight collision, yes that's what I was getting at - there seems to be
>>>> another bug here, and misunderstandings on how io_uring works is causing
>>>> it to be ignored and/or not understood.
>>>
>>> OK. Will dig into this because I definitely do not fully understand where the
>>> issue is.
>>
>> As per earlier replies, it's either -EAGAIN being mishandled, OR it's
>> driving more IOs than the device supports. For the latter case, io_uring
>> will NOT block, but libaio will. This means that libaio will sit there
>> waiting on previous IO to complete, and then issue the next one.
>> io_uring will punt that IO to io-wq, and then all bets are off in terms
>> of ordering if you have multiple of these threads blocking on tags and
>> doing issues. The test case looks like it's freezing the queue, which
>> means you don't even need more than QD number of IOs inflight. When that
>> happens, guess what libaio does? That's right, it blocks waiting on the
>> queue, and io_uring will not block but rather punt those IOs to io-wq.
>> If you have QD=2, then you now have 2 threads doing IO submission, and
>> either of them could wake and submit before the other.
> 
> That sounds like a very good analysis :)
> 
>> Like Christoph alluded to in his first reply, driving more than 1
>> request inflight is going to be trouble, potentially.
> 
> Yes. I think the confusion is with which "inflight" we are talking
> about. Between the block layer and the device, zone write plugging
> prevents more than 1 write per zone, so things are OK (modulo
> bugs...). But between the application and the block layer, that is not
> well managed and as your analysis above shows, bad things can happen.
> I will look into it to see if we can do something sensible. If not, we
> should at least warn the user, or just outright fail using io_uring
> with zoned block devices to avoid bad surprises.

The only reason it happens to work is because libaio is broken, and just
blocks rather than be an actual async issue method. If the queue is
frozen or you run out of tags, you'll just sleep, and hence preserve
ordering. For anything that actually works the way it's supposed to, eg
don't block on submit, yes you can run into trouble with having multiple
IOs inflight. That seems like a no-brainer to me.

io_uring does support ordering writes - not because of zoning, but to
avoid buffered writes being spread over a bunch of threads and hence
just hammering the inode mutex rather than doing actual useful work. You
could potentially use that. Then all pending writes for that inode would
be ordered, even if punted to io-wq.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 19:54                         ` Jens Axboe
@ 2024-12-17 19:58                           ` Jens Axboe
  2024-12-17 20:59                             ` Damien Le Moal
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-12-17 19:58 UTC (permalink / raw)
  To: Damien Le Moal, Bart Van Assche, Christoph Hellwig
  Cc: linux-block@vger.kernel.org

On 12/17/24 12:54 PM, Jens Axboe wrote:
> io_uring does support ordering writes - not because of zoning, but to
> avoid buffered writes being spread over a bunch of threads and hence
> just hammering the inode mutex rather than doing actual useful work. You
> could potentially use that. Then all pending writes for that inode would
> be ordered, even if punted to io-wq.

See io_uring/io_uring.c:io_prep_async_work(), which is called when an IO
is added for io-wq execution, io_wq_hash_work() makes sure it'll be
ordered. However, this will still not work if you're driving beyond the
limit of the device queue depth, or if you're doing IOs that may trigger
-EAGAIN spuriously for -EAGAIN as you can still have two issuers - the
task itself submitting IO, and the one io-wq worker tasked with doing
blocking writes on this zoned device.

At the end of the day, expecting device side ordering and allowing > 1
in QD will always be inherently broken for zoned devices imho. Just
don't do that... If you allow that and expect ordering, you're always
going to be chasing cases where it breaks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 19:58                           ` Jens Axboe
@ 2024-12-17 20:59                             ` Damien Le Moal
  2024-12-17 21:25                               ` Jens Axboe
  2024-12-18  6:58                               ` Christoph Hellwig
  0 siblings, 2 replies; 37+ messages in thread
From: Damien Le Moal @ 2024-12-17 20:59 UTC (permalink / raw)
  To: Jens Axboe, Bart Van Assche, Christoph Hellwig
  Cc: linux-block@vger.kernel.org

On 2024/12/17 11:58, Jens Axboe wrote:
> On 12/17/24 12:54 PM, Jens Axboe wrote:
>> io_uring does support ordering writes - not because of zoning, but to
>> avoid buffered writes being spread over a bunch of threads and hence
>> just hammering the inode mutex rather than doing actual useful work. You
>> could potentially use that. Then all pending writes for that inode would
>> be ordered, even if punted to io-wq.
> 
> See io_uring/io_uring.c:io_prep_async_work(), which is called when an IO
> is added for io-wq execution, io_wq_hash_work() makes sure it'll be
> ordered. However, this will still not work if you're driving beyond the
> limit of the device queue depth, or if you're doing IOs that may trigger
> -EAGAIN spuriously for -EAGAIN as you can still have two issuers - the
> task itself submitting IO, and the one io-wq worker tasked with doing
> blocking writes on this zoned device.

Thanks for the pointer. Will have a look. It may be as simple as always using
the io-wq worker for zone writes and have these ordered (__WQ_ORDERED). Maybe.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 20:59                             ` Damien Le Moal
@ 2024-12-17 21:25                               ` Jens Axboe
  2024-12-18  6:58                               ` Christoph Hellwig
  1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-12-17 21:25 UTC (permalink / raw)
  To: Damien Le Moal, Bart Van Assche, Christoph Hellwig
  Cc: linux-block@vger.kernel.org

On 12/17/24 1:59 PM, Damien Le Moal wrote:
> On 2024/12/17 11:58, Jens Axboe wrote:
>> On 12/17/24 12:54 PM, Jens Axboe wrote:
>>> io_uring does support ordering writes - not because of zoning, but to
>>> avoid buffered writes being spread over a bunch of threads and hence
>>> just hammering the inode mutex rather than doing actual useful work. You
>>> could potentially use that. Then all pending writes for that inode would
>>> be ordered, even if punted to io-wq.
>>
>> See io_uring/io_uring.c:io_prep_async_work(), which is called when an IO
>> is added for io-wq execution, io_wq_hash_work() makes sure it'll be
>> ordered. However, this will still not work if you're driving beyond the
>> limit of the device queue depth, or if you're doing IOs that may trigger
>> -EAGAIN spuriously for -EAGAIN as you can still have two issuers - the
>> task itself submitting IO, and the one io-wq worker tasked with doing
>> blocking writes on this zoned device.
> 
> Thanks for the pointer. Will have a look. It may be as simple as
> always using the io-wq worker for zone writes and have these ordered
> (__WQ_ORDERED). Maybe.

Right, that should work if you force everything to be served by io-wq
and ensure it's hashed.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 20:59                             ` Damien Le Moal
  2024-12-17 21:25                               ` Jens Axboe
@ 2024-12-18  6:58                               ` Christoph Hellwig
  2024-12-19 18:04                                 ` Bart Van Assche
  1 sibling, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2024-12-18  6:58 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jens Axboe, Bart Van Assche, Christoph Hellwig,
	linux-block@vger.kernel.org

On Tue, Dec 17, 2024 at 12:59:00PM -0800, Damien Le Moal wrote:
> Thanks for the pointer. Will have a look. It may be as simple as always using
> the io-wq worker for zone writes and have these ordered (__WQ_ORDERED). Maybe.

No.  Don't force ordering on people for absolutely no reason.  Use the
new append writes that return the written offsets we've talked about.
Enforcing ordering on I/O submissions/commands/requests always ends up
badly.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 14:56         ` Damien Le Moal
@ 2024-12-19  5:55           ` Christoph Hellwig
  2024-12-19 17:07             ` Bart Van Assche
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2024-12-19  5:55 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, linux-block@vger.kernel.org, Christoph Hellwig

On Tue, Dec 17, 2024 at 06:56:16AM -0800, Damien Le Moal wrote:
> On 2024/12/16 13:22, Bart Van Assche wrote:
> > On 12/16/24 12:54 PM, Damien Le Moal wrote:
> >> Yes. But I am still confused. Where is the problem ?
> > 
> > Here: 
> > https://lore.kernel.org/linux-block/95ab028e-6cf7-474e-aa33-37ab3bccd078@kernel.org/. 
> > In that message another approach is
> > suggested than what I described in my previous message.
> 
> OK. So you are talking about an issue that potentially can happen *if* you
> modify zone write plugging to issue more than one write at a time per zone. This
> issue of reordering cannot happen today as there is always at most one write per
> zone in-flight.

Let me throw in the reminder that every experiment at forcing order
has failed and I'm still more than just skeptical of Bart's hack.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-17 19:41                     ` Jens Axboe
  2024-12-17 19:48                       ` Damien Le Moal
@ 2024-12-19  6:00                       ` Christoph Hellwig
  2024-12-19 14:50                         ` Jens Axboe
  2024-12-19 17:12                         ` Bart Van Assche
  1 sibling, 2 replies; 37+ messages in thread
From: Christoph Hellwig @ 2024-12-19  6:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Damien Le Moal, Bart Van Assche, Christoph Hellwig,
	linux-block@vger.kernel.org

On Tue, Dec 17, 2024 at 12:41:50PM -0700, Jens Axboe wrote:
> As per earlier replies, it's either -EAGAIN being mishandled, OR it's
> driving more IOs than the device supports. For the latter case, io_uring
> will NOT block, but libaio will.

And that's exactly the case here as zoned devices only support a single
I/O to a zone and will have to block for the next one.

> Like Christoph alluded to in his first reply, driving more than 1
> request inflight is going to be trouble, potentially.

If you rely on order.  If you are doing O_APPEND-style I/O on  zonefs or
using a real file systems it's perfectly fine.  

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-19  6:00                       ` Christoph Hellwig
@ 2024-12-19 14:50                         ` Jens Axboe
  2024-12-19 17:12                         ` Bart Van Assche
  1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-12-19 14:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Damien Le Moal, Bart Van Assche, linux-block@vger.kernel.org

On 12/18/24 11:00 PM, Christoph Hellwig wrote:
> On Tue, Dec 17, 2024 at 12:41:50PM -0700, Jens Axboe wrote:
>> As per earlier replies, it's either -EAGAIN being mishandled, OR it's
>> driving more IOs than the device supports. For the latter case, io_uring
>> will NOT block, but libaio will.
> 
> And that's exactly the case here as zoned devices only support a single
> I/O to a zone and will have to block for the next one.
> 
>> Like Christoph alluded to in his first reply, driving more than 1
>> request inflight is going to be trouble, potentially.
> 
> If you rely on order.  If you are doing O_APPEND-style I/O on  zonefs or
> using a real file systems it's perfectly fine.  

Yes this is the way, rather than attempt guarantee ordering for multiple
IOs. It's fragile and will repeatedly break.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-19  5:55           ` Christoph Hellwig
@ 2024-12-19 17:07             ` Bart Van Assche
  0 siblings, 0 replies; 37+ messages in thread
From: Bart Van Assche @ 2024-12-19 17:07 UTC (permalink / raw)
  To: Christoph Hellwig, Damien Le Moal; +Cc: linux-block@vger.kernel.org

On 12/18/24 9:55 PM, Christoph Hellwig wrote:
> Let me throw in the reminder that every experiment at forcing order
> has failed and I'm still more than just skeptical of Bart's hack.

Huh? This patch series only makes small changes to the non-zoned block
layer code and is working very reliable on my test setup and also in
tests with error injection enabled: "[PATCH v16 00/26] Improve write
performance for zoned UFS devices"
(https://lore.kernel.org/linux-block/20241119002815.600608-1-bvanassche@acm.org/).

Bart.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-19  6:00                       ` Christoph Hellwig
  2024-12-19 14:50                         ` Jens Axboe
@ 2024-12-19 17:12                         ` Bart Van Assche
  2024-12-19 23:10                           ` Damien Le Moal
  2024-12-21  8:13                           ` Christoph Hellwig
  1 sibling, 2 replies; 37+ messages in thread
From: Bart Van Assche @ 2024-12-19 17:12 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Damien Le Moal, linux-block@vger.kernel.org

On 12/18/24 10:00 PM, Christoph Hellwig wrote:
> If you rely on order.  If you are doing O_APPEND-style I/O on  zonefs or
> using a real file systems it's perfectly fine.

Using zone append operations defeats two of the advantages of zoned
storage. One of the advantages of zoned storage is that filesystems have
control over the layout of files on flash memory with regular writes.
That advantage is lost when using zone append operations because it is 
allowed to reorder these operations. Another advantage that is lost is
the reduction in size of the FTL translation table. When using zone
append operations, the filesystem has to store the offsets returned by
zone append operations somewhere. With regular writes this is not
necessary.

Bart.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-18  6:58                               ` Christoph Hellwig
@ 2024-12-19 18:04                                 ` Bart Van Assche
  2024-12-21  8:10                                   ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Bart Van Assche @ 2024-12-19 18:04 UTC (permalink / raw)
  To: Christoph Hellwig, Damien Le Moal; +Cc: Jens Axboe, linux-block@vger.kernel.org

On 12/17/24 10:58 PM, Christoph Hellwig wrote:
> Use the new append writes that return the written offsets we've
> talked about.
Here is why this is not a solution for SCSI devices:
* There is no Zone Append command in the relevant SCSI standard (ZBC).
* It is unlikely that a zone append command will ever be standardized
   for SCSI devices. Damien explained a few months ago why. See also

https://lore.kernel.org/linux-block/4f48b57f-e8bb-4164-355a-95f5887bac36@opensource.wdc.com/ 
and

https://lore.kernel.org/linux-block/6a17bdb3-af76-edf7-85a4-0991ee66d380@opensource.wdc.com/
* For SCSI devices, to support QD > 1 per zone for zone append writes,
   the write order of the emulated zone write commands has to be
   preserved.

In other words, no matter whether we use regular writes or zone append
writes for SCSI devices, something like the write pipelining patch
series that I posted is needed to support QD > 1 per zone for writes to
zoned SCSI devices.

Bart.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-19 17:12                         ` Bart Van Assche
@ 2024-12-19 23:10                           ` Damien Le Moal
  2025-01-06 20:14                             ` Bart Van Assche
  2024-12-21  8:13                           ` Christoph Hellwig
  1 sibling, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2024-12-19 23:10 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig, Jens Axboe
  Cc: linux-block@vger.kernel.org

On 12/20/24 02:12, Bart Van Assche wrote:
> On 12/18/24 10:00 PM, Christoph Hellwig wrote:
>> If you rely on order.  If you are doing O_APPEND-style I/O on  zonefs or
>> using a real file systems it's perfectly fine.
> 
> Using zone append operations defeats two of the advantages of zoned
> storage. One of the advantages of zoned storage is that filesystems have
> control over the layout of files on flash memory with regular writes.

Zone append still allows that: you can still chose the zone you write to.

> That advantage is lost when using zone append operations because it is 
> allowed to reorder these operations. Another advantage that is lost is
> the reduction in size of the FTL translation table. When using zone

FTL ? Ar you talking about the device side FTL ? Zone append does not makes
things worse in any way compared to regular writes. That completely depends on
the device design and if (or not) zones are exactly mapped to erase blocks.

> append operations, the filesystem has to store the offsets returned by
> zone append operations somewhere. With regular writes this is not
> necessary.

That is a very misleading statement. The FS *always* records the write location
of data, be it done with regular writes or with zone append. The difference is
that given that zone appendd operations may be reordered, you may need a little
more metadata to record the location of data if your writes endup not being
processed in the same order as issued. But in my opinion, that is a slight
disadvantage that can be completely ignored given the code simplification and
performance advantage that zone append brings.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-19 18:04                                 ` Bart Van Assche
@ 2024-12-21  8:10                                   ` Christoph Hellwig
  2025-01-06 18:54                                     ` Bart Van Assche
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2024-12-21  8:10 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Damien Le Moal, Jens Axboe,
	linux-block@vger.kernel.org

On Thu, Dec 19, 2024 at 10:04:40AM -0800, Bart Van Assche wrote:
> On 12/17/24 10:58 PM, Christoph Hellwig wrote:
>> Use the new append writes that return the written offsets we've
>> talked about.
> Here is why this is not a solution for SCSI devices:
> * There is no Zone Append command in the relevant SCSI standard (ZBC).

Standard exist to be changed.  It's not like UFS folks have been pushing
all kinds of novel unproven crap all the time anyway.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-19 17:12                         ` Bart Van Assche
  2024-12-19 23:10                           ` Damien Le Moal
@ 2024-12-21  8:13                           ` Christoph Hellwig
  1 sibling, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2024-12-21  8:13 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, Damien Le Moal,
	linux-block@vger.kernel.org

On Thu, Dec 19, 2024 at 09:12:48AM -0800, Bart Van Assche wrote:
> Using zone append operations defeats two of the advantages of zoned
> storage. One of the advantages of zoned storage is that filesystems have
> control over the layout of files on flash memory with regular writes.

I'm not sure what you mean, but not they absolutely do not.  The FTL
will reorder in any way it fits.  With zones it will genrally do less
so (at least for sensible implementations), but that does not depend
on using write or zone append.

> That advantage is lost when using zone append operations because it is 
> allowed to reorder these operations.

It is allowed but doesn't do in practice.  It will reorder in exactly
the case where your scheme doesn't work at all.

> Another advantage that is lost is
> the reduction in size of the FTL translation table. When using zone
> append operations, the filesystem has to store the offsets returned by
> zone append operations somewhere. With regular writes this is not
> necessary.

The file system always has to track the extents.  Bart, I'm not sure
where you get the lines your spewing here from, but they are completely
decouple from technical reality.  Please spend some time actually
understanding the technologies you make claims about.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-21  8:10                                   ` Christoph Hellwig
@ 2025-01-06 18:54                                     ` Bart Van Assche
  0 siblings, 0 replies; 37+ messages in thread
From: Bart Van Assche @ 2025-01-06 18:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Damien Le Moal, Jens Axboe, linux-block@vger.kernel.org

On 12/21/24 12:10 AM, Christoph Hellwig wrote:
> On Thu, Dec 19, 2024 at 10:04:40AM -0800, Bart Van Assche wrote:
>> On 12/17/24 10:58 PM, Christoph Hellwig wrote:
>>> Use the new append writes that return the written offsets we've
>>> talked about.
>> Here is why this is not a solution for SCSI devices:
>> * There is no Zone Append command in the relevant SCSI standard (ZBC).
> 
> Standard exist to be changed.

(replying to an email of two weeks ago)

I'm not aware of any interest from the side of my employer regarding
standardizing a SCSI zone append command. Additionally, last time I
tried to make a nontrivial change to the SCSI standards (adding data
lifetime support) I learned that it is almost impossible for anyone
other than the longtime SCSI committee members to make nontrivial
changes. All my attempts to introduce data lifetime support in SBC
failed. It is only after the SBC editor took over the initiative and
wrote a proposal that progress was made and data lifetime support was
added to the SCSI standards.

If anyone else would like to work on standardizing a SCSI zone append
command I would be happy to help.

Bart.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zoned storage and BLK_STS_RESOURCE
  2024-12-19 23:10                           ` Damien Le Moal
@ 2025-01-06 20:14                             ` Bart Van Assche
  0 siblings, 0 replies; 37+ messages in thread
From: Bart Van Assche @ 2025-01-06 20:14 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig, Jens Axboe; +Cc: linux-block@vger.kernel.org

On 12/19/24 3:10 PM, Damien Le Moal wrote:
> On 12/20/24 02:12, Bart Van Assche wrote:
>> On 12/18/24 10:00 PM, Christoph Hellwig wrote:
>>> If you rely on order.  If you are doing O_APPEND-style I/O on  zonefs or
>>> using a real file systems it's perfectly fine.
>>
>> Using zone append operations defeats two of the advantages of zoned
>> storage. One of the advantages of zoned storage is that filesystems have
>> control over the layout of files on flash memory with regular writes.
> 
> Zone append still allows that: you can still chose the zone you write to.

(replying to an email of two weeks ago)

Zone append does not allow to control the LBA if multiple writes are 
outstanding for the same zone. We have use cases for which controlling
the file layout is important. These use cases are loading large files
and applications as quickly as possible. At least for UFS devices
sequential read performance improves if files are laid out sequentially
on the storage medium.

It is not the first time that the desire to lay out files sequentially
on UFS devices comes up. Others have worked on this before. See e.g.
Jiaming Li "[RESEND PATCH 0/4] Implement File-Based optimization
functionality", November 2022
(https://lore.kernel.org/linux-scsi/20221102053058.21021-1-lijiaming3@xiaomi.corp-partner.google.com/). 
File-Based Optimization
(FBO) is a mechanism for requesting UFS devices to organize logical
blocks sequentially, e.g. logical blocks of a single file. Note: my
employer has not been involved in the standardization of FBO and has no
plans to implement FBO support. We prefer zoned storage.

Bart.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2025-01-06 20:14 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-16 19:24 Zoned storage and BLK_STS_RESOURCE Bart Van Assche
2024-12-16 20:23 ` Damien Le Moal
2024-12-16 20:42   ` Bart Van Assche
2024-12-16 20:54     ` Damien Le Moal
2024-12-16 21:22       ` Bart Van Assche
2024-12-16 22:49         ` Damien Le Moal
2024-12-17 14:56         ` Damien Le Moal
2024-12-19  5:55           ` Christoph Hellwig
2024-12-19 17:07             ` Bart Van Assche
2024-12-17  4:15 ` Christoph Hellwig
2024-12-17 15:04   ` Damien Le Moal
2024-12-17 18:38     ` Bart Van Assche
2024-12-17 18:46     ` Jens Axboe
2024-12-17 18:51       ` Damien Le Moal
2024-12-17 19:07         ` Jens Axboe
2024-12-17 19:20           ` Damien Le Moal
2024-12-17 19:25             ` Bart Van Assche
2024-12-17 19:28               ` Damien Le Moal
2024-12-17 19:33                 ` Jens Axboe
2024-12-17 19:37                   ` Damien Le Moal
2024-12-17 19:41                     ` Jens Axboe
2024-12-17 19:48                       ` Damien Le Moal
2024-12-17 19:54                         ` Jens Axboe
2024-12-17 19:58                           ` Jens Axboe
2024-12-17 20:59                             ` Damien Le Moal
2024-12-17 21:25                               ` Jens Axboe
2024-12-18  6:58                               ` Christoph Hellwig
2024-12-19 18:04                                 ` Bart Van Assche
2024-12-21  8:10                                   ` Christoph Hellwig
2025-01-06 18:54                                     ` Bart Van Assche
2024-12-19  6:00                       ` Christoph Hellwig
2024-12-19 14:50                         ` Jens Axboe
2024-12-19 17:12                         ` Bart Van Assche
2024-12-19 23:10                           ` Damien Le Moal
2025-01-06 20:14                             ` Bart Van Assche
2024-12-21  8:13                           ` Christoph Hellwig
2024-12-17 19:32             ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).