linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* What should we do about the nvme atomics mess?
@ 2025-07-07 14:18 Christoph Hellwig
  2025-07-07 14:24 ` Keith Busch
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-07 14:18 UTC (permalink / raw)
  To: Alan Adamson, John Garry, Keith Busch, Martin K. Petersen,
	Jens Axboe
  Cc: linux-nvme, linux-block

Hi all,

I'm a bit lost on what to do about the sad state of NVMe atomic writes.

As a short reminder the main issues are:

 1) there is no flag on a command to request atomic (aka non-torn)
    behavior, instead writes adhering to the atomicy requirements will
    never be torn, and writes not adhering them can be torn any time.
    This differs from SCSI where atomic writes have to be be explicitly
    requested and fail when they can't be satisfied
 2) the original way to indicate the main atomicy limit is the AWUPF
    field, which is in Identify Controller, but specified in logical
    blocks which only exist at a namespace layer.  This a) lead to
    various problems because the limit is a mess when namespace have
    different logical block sizes, and it b) also causes additional
    issues because NVMe allows it to be different for different
    controllers in the same subsystem.

Commit 8695f060a029 added some sanity checks to deal with issue 2b,
but we kept running into more issues with it.  Partially because
the check wasn't quite correct, but also because we've gotten
reports of controllers that change the AWUPF value when reformatting
namespaces to deal with issue 2a.

And I'm a bit lost on what to do here.

We could:

 I.	 revert the check and the subsequent fixup.  If you really want
         to use the nvme atomics you already better pray a lot anyway
	 due to issue 1)
 II.	 limit the check to multi-controller subsystems
 III.	 don't allow atomics on controllers that only report AWUPF and
 	 limit support to controllers that support that more sanely
	 defined NAWUPF

I guess for 6.16 we are limited to I. to bring us back to the previous
state, but I have a really bad gut feeling about it given the really
bad spec language and a lot of low quality NVMe implementations we're
seeing these days.
 not the 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-07 14:18 What should we do about the nvme atomics mess? Christoph Hellwig
@ 2025-07-07 14:24 ` Keith Busch
  2025-07-07 15:26   ` Hannes Reinecke
  2025-07-08  1:27 ` Ming Lei
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Keith Busch @ 2025-07-07 14:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alan Adamson, John Garry, Martin K. Petersen, Jens Axboe,
	linux-nvme, linux-block

On Mon, Jul 07, 2025 at 04:18:34PM +0200, Christoph Hellwig wrote:
> We could:
> 
>  I.	 revert the check and the subsequent fixup.  If you really want
>          to use the nvme atomics you already better pray a lot anyway
> 	 due to issue 1)
>  II.	 limit the check to multi-controller subsystems
>  III.	 don't allow atomics on controllers that only report AWUPF and
>  	 limit support to controllers that support that more sanely
> 	 defined NAWUPF
> 
> I guess for 6.16 we are limited to I. to bring us back to the previous
> state, but I have a really bad gut feeling about it given the really
> bad spec language and a lot of low quality NVMe implementations we're
> seeing these days.

I like option III. The controler scoped atomic size is broken for all
the reasons you mentioned, so I vote we not bother trying to make sense
of it.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-07 14:24 ` Keith Busch
@ 2025-07-07 15:26   ` Hannes Reinecke
  2025-07-07 15:56     ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2025-07-07 15:26 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Alan Adamson, John Garry, Martin K. Petersen, Jens Axboe,
	linux-nvme, linux-block

On 7/7/25 16:24, Keith Busch wrote:
> On Mon, Jul 07, 2025 at 04:18:34PM +0200, Christoph Hellwig wrote:
>> We could:
>>
>>   I.	 revert the check and the subsequent fixup.  If you really want
>>           to use the nvme atomics you already better pray a lot anyway
>> 	 due to issue 1)
>>   II.	 limit the check to multi-controller subsystems
>>   III.	 don't allow atomics on controllers that only report AWUPF and
>>   	 limit support to controllers that support that more sanely
>> 	 defined NAWUPF
>>
>> I guess for 6.16 we are limited to I. to bring us back to the previous
>> state, but I have a really bad gut feeling about it given the really
>> bad spec language and a lot of low quality NVMe implementations we're
>> seeing these days.
> 
> I like option III. The controler scoped atomic size is broken for all
> the reasons you mentioned, so I vote we not bother trying to make sense
> of it.
> 
Agree. We might consider I. as a fixup for stable, but should continue
with III going forward.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-07 15:26   ` Hannes Reinecke
@ 2025-07-07 15:56     ` Keith Busch
  2025-07-07 23:35       ` Chaitanya Kulkarni
  2025-07-08  9:47       ` Christoph Hellwig
  0 siblings, 2 replies; 19+ messages in thread
From: Keith Busch @ 2025-07-07 15:56 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Alan Adamson, John Garry, Martin K. Petersen,
	Jens Axboe, linux-nvme, linux-block

On Mon, Jul 07, 2025 at 05:26:46PM +0200, Hannes Reinecke wrote:
> On 7/7/25 16:24, Keith Busch wrote:
> > On Mon, Jul 07, 2025 at 04:18:34PM +0200, Christoph Hellwig wrote:
> > > We could:
> > > 
> > >   I.	 revert the check and the subsequent fixup.  If you really want
> > >           to use the nvme atomics you already better pray a lot anyway
> > > 	 due to issue 1)
> > >   II.	 limit the check to multi-controller subsystems
> > >   III.	 don't allow atomics on controllers that only report AWUPF and
> > >   	 limit support to controllers that support that more sanely
> > > 	 defined NAWUPF
> > > 
> > > I guess for 6.16 we are limited to I. to bring us back to the previous
> > > state, but I have a really bad gut feeling about it given the really
> > > bad spec language and a lot of low quality NVMe implementations we're
> > > seeing these days.
> > 
> > I like option III. The controler scoped atomic size is broken for all
> > the reasons you mentioned, so I vote we not bother trying to make sense
> > of it.
> > 
> Agree. We might consider I. as a fixup for stable, but should continue
> with III going forward.

I think the NVMe TWG might want to consider an ECN to deprecate or at
least recommend against AUWPF, too.

Just to throw AWUPF a lifeline for legecy devices, we could potentially
make sense of the value if Identify Controller says:

  1. CMIC == 0; and
  2. OACS.NMS == 0; and
  3.
    a. FNA.FNS == 1; or
    b. NN == 1

And if those conditions are true, then the controller and namespace
scopes resolve to a single namespace format, so the values should be one
in the same. The only way it could change, then, is a format command,
which means there couldn't be an in-use filesystem depending on it not
changing.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-07 15:56     ` Keith Busch
@ 2025-07-07 23:35       ` Chaitanya Kulkarni
  2025-07-08  9:47       ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Chaitanya Kulkarni @ 2025-07-07 23:35 UTC (permalink / raw)
  To: Keith Busch, Hannes Reinecke
  Cc: Christoph Hellwig, Alan Adamson, John Garry, Martin K. Petersen,
	Jens Axboe, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org

On 7/7/25 08:56, Keith Busch wrote:
> On Mon, Jul 07, 2025 at 05:26:46PM +0200, Hannes Reinecke wrote:
>> On 7/7/25 16:24, Keith Busch wrote:
>>> On Mon, Jul 07, 2025 at 04:18:34PM +0200, Christoph Hellwig wrote:
>>>> We could:
>>>>
>>>>    I.	 revert the check and the subsequent fixup.  If you really want
>>>>            to use the nvme atomics you already better pray a lot anyway
>>>> 	 due to issue 1)
>>>>    II.	 limit the check to multi-controller subsystems
>>>>    III.	 don't allow atomics on controllers that only report AWUPF and
>>>>    	 limit support to controllers that support that more sanely
>>>> 	 defined NAWUPF
>>>>
>>>> I guess for 6.16 we are limited to I. to bring us back to the previous
>>>> state, but I have a really bad gut feeling about it given the really
>>>> bad spec language and a lot of low quality NVMe implementations we're
>>>> seeing these days.
>>> I like option III. The controler scoped atomic size is broken for all
>>> the reasons you mentioned, so I vote we not bother trying to make sense
>>> of it.
>>>
>> Agree. We might consider I. as a fixup for stable, but should continue
>> with III going forward.
> I think the NVMe TWG might want to consider an ECN to deprecate or at
> least recommend against AUWPF, too.

We should really find a way to fix this in the spec, I'll be happy to add
this topic and agenda so we can discuss it at a length, before that happens
option III seems right way to fix it.

-ck



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-07 14:18 What should we do about the nvme atomics mess? Christoph Hellwig
  2025-07-07 14:24 ` Keith Busch
@ 2025-07-08  1:27 ` Ming Lei
  2025-07-08  2:27   ` Keith Busch
  2025-07-08  9:38 ` Niklas Cassel
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2025-07-08  1:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alan Adamson, John Garry, Keith Busch, Martin K. Petersen,
	Jens Axboe, linux-nvme, linux-block

On Mon, Jul 07, 2025 at 04:18:34PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> I'm a bit lost on what to do about the sad state of NVMe atomic writes.
> 
> As a short reminder the main issues are:
> 
>  1) there is no flag on a command to request atomic (aka non-torn)
>     behavior, instead writes adhering to the atomicy requirements will
>     never be torn, and writes not adhering them can be torn any time.
>     This differs from SCSI where atomic writes have to be be explicitly
>     requested and fail when they can't be satisfied
>  2) the original way to indicate the main atomicy limit is the AWUPF
>     field, which is in Identify Controller, but specified in logical
>     blocks which only exist at a namespace layer.  This a) lead to

If controller-wide AWUPF is a must property, the length has to be aligned
with block size.

>     various problems because the limit is a mess when namespace have
>     different logical block sizes, and it b) also causes additional
>     issues because NVMe allows it to be different for different
>     controllers in the same subsystem.

The spec mentioned clearly that controller AWUPF should be supported by
any namespace format:

```
Atomic Write Unit Power Fail (AWUPF): This field indicates the size of the write
operation guaranteed to be written atomically to the NVM across all namespaces
with any supported namespace format during a power fail or error condition.
```

So I am wondering why nvme driver can't validate NAWUN against AWUPF?


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-08  1:27 ` Ming Lei
@ 2025-07-08  2:27   ` Keith Busch
  2025-07-08  2:46     ` Ming Lei
  0 siblings, 1 reply; 19+ messages in thread
From: Keith Busch @ 2025-07-08  2:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Alan Adamson, John Garry, Martin K. Petersen,
	Jens Axboe, linux-nvme, linux-block

On Tue, Jul 08, 2025 at 09:27:06AM +0800, Ming Lei wrote:
> On Mon, Jul 07, 2025 at 04:18:34PM +0200, Christoph Hellwig wrote:
> > Hi all,
> > 
> > I'm a bit lost on what to do about the sad state of NVMe atomic writes.
> > 
> > As a short reminder the main issues are:
> > 
> >  1) there is no flag on a command to request atomic (aka non-torn)
> >     behavior, instead writes adhering to the atomicy requirements will
> >     never be torn, and writes not adhering them can be torn any time.
> >     This differs from SCSI where atomic writes have to be be explicitly
> >     requested and fail when they can't be satisfied
> >  2) the original way to indicate the main atomicy limit is the AWUPF
> >     field, which is in Identify Controller, but specified in logical
> >     blocks which only exist at a namespace layer.  This a) lead to
> 
> If controller-wide AWUPF is a must property, the length has to be aligned
> with block size.

What block size? The controller doesn't have one. Block sizes are
properties of namespaces, not controllers or subsystems. If you have 10
namespaces with 10 different block formats, what does AUWPF mean? If the
controller must report something, the only rational thing it could
declare is reduced to the greatest common denominator, which is out of
sync with the true value reported in the appropriately scoped NAUWPF
value.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-08  2:27   ` Keith Busch
@ 2025-07-08  2:46     ` Ming Lei
  2025-07-08  2:56       ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2025-07-08  2:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Alan Adamson, John Garry, Martin K. Petersen,
	Jens Axboe, linux-nvme, linux-block

On Mon, Jul 07, 2025 at 08:27:43PM -0600, Keith Busch wrote:
> On Tue, Jul 08, 2025 at 09:27:06AM +0800, Ming Lei wrote:
> > On Mon, Jul 07, 2025 at 04:18:34PM +0200, Christoph Hellwig wrote:
> > > Hi all,
> > > 
> > > I'm a bit lost on what to do about the sad state of NVMe atomic writes.
> > > 
> > > As a short reminder the main issues are:
> > > 
> > >  1) there is no flag on a command to request atomic (aka non-torn)
> > >     behavior, instead writes adhering to the atomicy requirements will
> > >     never be torn, and writes not adhering them can be torn any time.
> > >     This differs from SCSI where atomic writes have to be be explicitly
> > >     requested and fail when they can't be satisfied
> > >  2) the original way to indicate the main atomicy limit is the AWUPF
> > >     field, which is in Identify Controller, but specified in logical
> > >     blocks which only exist at a namespace layer.  This a) lead to
> > 
> > If controller-wide AWUPF is a must property, the length has to be aligned
> > with block size.
> 
> What block size? The controller doesn't have one. Block sizes are

It should be any NS format's block size.

> properties of namespaces, not controllers or subsystems. If you have 10
> namespaces with 10 different block formats, what does AUWPF mean? If the
> controller must report something, the only rational thing it could
> declare is reduced to the greatest common denominator, which is out of
> sync with the true value reported in the appropriately scoped NAUWPF
> value.

Yes, please see the words I quoted from NVMe spec, also `6.4 Atomic Operations`
mentioned: `NAWUPF >= AWUPF`.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-08  2:46     ` Ming Lei
@ 2025-07-08  2:56       ` Keith Busch
  2025-07-08  3:17         ` Ming Lei
  0 siblings, 1 reply; 19+ messages in thread
From: Keith Busch @ 2025-07-08  2:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Alan Adamson, John Garry, Martin K. Petersen,
	Jens Axboe, linux-nvme, linux-block

On Tue, Jul 08, 2025 at 10:46:06AM +0800, Ming Lei wrote:
> On Mon, Jul 07, 2025 at 08:27:43PM -0600, Keith Busch wrote:
> > On Tue, Jul 08, 2025 at 09:27:06AM +0800, Ming Lei wrote:
> > > On Mon, Jul 07, 2025 at 04:18:34PM +0200, Christoph Hellwig wrote:
> > > > Hi all,
> > > > 
> > > > I'm a bit lost on what to do about the sad state of NVMe atomic writes.
> > > > 
> > > > As a short reminder the main issues are:
> > > > 
> > > >  1) there is no flag on a command to request atomic (aka non-torn)
> > > >     behavior, instead writes adhering to the atomicy requirements will
> > > >     never be torn, and writes not adhering them can be torn any time.
> > > >     This differs from SCSI where atomic writes have to be be explicitly
> > > >     requested and fail when they can't be satisfied
> > > >  2) the original way to indicate the main atomicy limit is the AWUPF
> > > >     field, which is in Identify Controller, but specified in logical
> > > >     blocks which only exist at a namespace layer.  This a) lead to
> > > 
> > > If controller-wide AWUPF is a must property, the length has to be aligned
> > > with block size.
> > 
> > What block size? The controller doesn't have one. Block sizes are
> 
> It should be any NS format's block size.

That requires an artificial reduction to a meaningless value.

> > properties of namespaces, not controllers or subsystems. If you have 10
> > namespaces with 10 different block formats, what does AUWPF mean? If the
> > controller must report something, the only rational thing it could
> > declare is reduced to the greatest common denominator, which is out of
> > sync with the true value reported in the appropriately scoped NAUWPF
> > value.
> 
> Yes, please see the words I quoted from NVMe spec, also `6.4 Atomic Operations`
> mentioned: `NAWUPF >= AWUPF`.

The problem is when Namespace X changes its format that then alters
Namesace Y's reported atomic size. That's unacceptable for any
filesystem utilizing this feature.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-08  2:56       ` Keith Busch
@ 2025-07-08  3:17         ` Ming Lei
  0 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2025-07-08  3:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Alan Adamson, John Garry, Martin K. Petersen,
	Jens Axboe, linux-nvme, linux-block

On Mon, Jul 07, 2025 at 08:56:58PM -0600, Keith Busch wrote:
> On Tue, Jul 08, 2025 at 10:46:06AM +0800, Ming Lei wrote:
> > On Mon, Jul 07, 2025 at 08:27:43PM -0600, Keith Busch wrote:
> > > On Tue, Jul 08, 2025 at 09:27:06AM +0800, Ming Lei wrote:
> > > > On Mon, Jul 07, 2025 at 04:18:34PM +0200, Christoph Hellwig wrote:
> > > > > Hi all,
> > > > > 
> > > > > I'm a bit lost on what to do about the sad state of NVMe atomic writes.
> > > > > 
> > > > > As a short reminder the main issues are:
> > > > > 
> > > > >  1) there is no flag on a command to request atomic (aka non-torn)
> > > > >     behavior, instead writes adhering to the atomicy requirements will
> > > > >     never be torn, and writes not adhering them can be torn any time.
> > > > >     This differs from SCSI where atomic writes have to be be explicitly
> > > > >     requested and fail when they can't be satisfied
> > > > >  2) the original way to indicate the main atomicy limit is the AWUPF
> > > > >     field, which is in Identify Controller, but specified in logical
> > > > >     blocks which only exist at a namespace layer.  This a) lead to
> > > > 
> > > > If controller-wide AWUPF is a must property, the length has to be aligned
> > > > with block size.
> > > 
> > > What block size? The controller doesn't have one. Block sizes are
> > 
> > It should be any NS format's block size.
> 
> That requires an artificial reduction to a meaningless value.

Any value has to be 'block size' aligned.

> 
> > > properties of namespaces, not controllers or subsystems. If you have 10
> > > namespaces with 10 different block formats, what does AUWPF mean? If the
> > > controller must report something, the only rational thing it could
> > > declare is reduced to the greatest common denominator, which is out of
> > > sync with the true value reported in the appropriately scoped NAUWPF
> > > value.
> > 
> > Yes, please see the words I quoted from NVMe spec, also `6.4 Atomic Operations`
> > mentioned: `NAWUPF >= AWUPF`.
> 
> The problem is when Namespace X changes its format that then alters
> Namesace Y's reported atomic size. That's unacceptable for any
> filesystem utilizing this feature.

When X changes its format, FS has to be umount.

The actual length(byte unit) of atomic write does not changed for Y,
just the unit(block size) is changed, at least from Yi's report.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-07 14:18 What should we do about the nvme atomics mess? Christoph Hellwig
  2025-07-07 14:24 ` Keith Busch
  2025-07-08  1:27 ` Ming Lei
@ 2025-07-08  9:38 ` Niklas Cassel
  2025-07-08  9:48   ` Christoph Hellwig
  2025-07-08 10:08 ` John Garry
  2025-07-09  7:51 ` Nilay Shroff
  4 siblings, 1 reply; 19+ messages in thread
From: Niklas Cassel @ 2025-07-08  9:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alan Adamson, John Garry, Keith Busch, Martin K. Petersen,
	Jens Axboe, linux-nvme, linux-block

On Mon, Jul 07, 2025 at 04:18:34PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> I'm a bit lost on what to do about the sad state of NVMe atomic writes.
> 
> As a short reminder the main issues are:
> 
>  1) there is no flag on a command to request atomic (aka non-torn)
>     behavior, instead writes adhering to the atomicy requirements will
>     never be torn, and writes not adhering them can be torn any time.
>     This differs from SCSI where atomic writes have to be be explicitly
>     requested and fail when they can't be satisfied
>  2) the original way to indicate the main atomicy limit is the AWUPF
>     field, which is in Identify Controller, but specified in logical
>     blocks which only exist at a namespace layer.  This a) lead to
>     various problems because the limit is a mess when namespace have
>     different logical block sizes, and it b) also causes additional
>     issues because NVMe allows it to be different for different
>     controllers in the same subsystem.
> 
> Commit 8695f060a029 added some sanity checks to deal with issue 2b,
> but we kept running into more issues with it.  Partially because
> the check wasn't quite correct, but also because we've gotten
> reports of controllers that change the AWUPF value when reformatting
> namespaces to deal with issue 2a.
> 
> And I'm a bit lost on what to do here.
> 
> We could:
> 
>  I.	 revert the check and the subsequent fixup.  If you really want
>          to use the nvme atomics you already better pray a lot anyway
> 	 due to issue 1)
>  II.	 limit the check to multi-controller subsystems
>  III.	 don't allow atomics on controllers that only report AWUPF and
>  	 limit support to controllers that support that more sanely
> 	 defined NAWUPF

I like III.

But NVMe should probably push to deprecate AUWPF, and introduce a new field
that is like AUWPF but which is specified in a fixed unit, e.g. bytes or
CAP.MPSMIN. (I'm thinking of e.g. Zone Append Size Limit (ZASL) which is also
a per controller limit, but the value is specified in units of CAP.MPSMIN,
just like the Maximum Data Transfer Size (MDTS).)


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-07 15:56     ` Keith Busch
  2025-07-07 23:35       ` Chaitanya Kulkarni
@ 2025-07-08  9:47       ` Christoph Hellwig
  2025-07-08 15:19         ` Keith Busch
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-08  9:47 UTC (permalink / raw)
  To: Keith Busch
  Cc: Hannes Reinecke, Christoph Hellwig, Alan Adamson, John Garry,
	Martin K. Petersen, Jens Axboe, linux-nvme, linux-block

On Mon, Jul 07, 2025 at 09:56:53AM -0600, Keith Busch wrote:
> I think the NVMe TWG might want to consider an ECN to deprecate or at
> least recommend against AUWPF, too.

Yeah.  A wording that every controller SHOULD implement NAWUPF if it
implements AWUPF might be good, eventually upgraded to a SHALL.

> Just to throw AWUPF a lifeline for legecy devices, we could potentially
> make sense of the value if Identify Controller says:
> 
>   1. CMIC == 0; and
>   2. OACS.NMS == 0; and

What is NMS meant to say?  namespace management support?

>   3.
>     a. FNA.FNS == 1; or
>     b. NN == 1
> 
> And if those conditions are true, then the controller and namespace
> scopes resolve to a single namespace format, so the values should be one
> in the same. The only way it could change, then, is a format command,
> which means there couldn't be an in-use filesystem depending on it not
> changing.

We could.  But are there many controllers where that would be the
case and where people want to use atomics?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-08  9:38 ` Niklas Cassel
@ 2025-07-08  9:48   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-08  9:48 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Christoph Hellwig, Alan Adamson, John Garry, Keith Busch,
	Martin K. Petersen, Jens Axboe, linux-nvme, linux-block

On Tue, Jul 08, 2025 at 11:38:09AM +0200, Niklas Cassel wrote:
> But NVMe should probably push to deprecate AUWPF, and introduce a new field
> that is like AUWPF but which is specified in a fixed unit, e.g. bytes or
> CAP.MPSMIN. (I'm thinking of e.g. Zone Append Size Limit (ZASL) which is also
> a per controller limit, but the value is specified in units of CAP.MPSMIN,
> just like the Maximum Data Transfer Size (MDTS).)

There's not advantage in having yet another field vs mandating NAWUPF.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-07 14:18 What should we do about the nvme atomics mess? Christoph Hellwig
                   ` (2 preceding siblings ...)
  2025-07-08  9:38 ` Niklas Cassel
@ 2025-07-08 10:08 ` John Garry
  2025-07-09  7:51 ` Nilay Shroff
  4 siblings, 0 replies; 19+ messages in thread
From: John Garry @ 2025-07-08 10:08 UTC (permalink / raw)
  To: Christoph Hellwig, Alan Adamson, Keith Busch, Martin K. Petersen,
	Jens Axboe
  Cc: linux-nvme, linux-block, mcgrof

On 07/07/2025 15:18, Christoph Hellwig wrote:
> Hi all,
> 
> I'm a bit lost on what to do about the sad state of NVMe atomic writes.
> 
> As a short reminder the main issues are:
> 
>   1) there is no flag on a command to request atomic (aka non-torn)
>      behavior, instead writes adhering to the atomicy requirements will
>      never be torn, and writes not adhering them can be torn any time.
>      This differs from SCSI where atomic writes have to be be explicitly
>      requested and fail when they can't be satisfied
>   2) the original way to indicate the main atomicy limit is the AWUPF
>      field, which is in Identify Controller, but specified in logical
>      blocks which only exist at a namespace layer.  This a) lead to
>      various problems because the limit is a mess when namespace have
>      different logical block sizes, and it b) also causes additional
>      issues because NVMe allows it to be different for different
>      controllers in the same subsystem.
> 
> Commit 8695f060a029 added some sanity checks to deal with issue 2b,
> but we kept running into more issues with it.  Partially because
> the check wasn't quite correct, but also because we've gotten
> reports of controllers that change the AWUPF value when reformatting
> namespaces to deal with issue 2a.
> 
> And I'm a bit lost on what to do here.
> 
> We could:
> 
>   I.	 revert the check and the subsequent fixup.  If you really want
>           to use the nvme atomics you already better pray a lot anyway
> 	 due to issue 1)
>   II.	 limit the check to multi-controller subsystems
>   III.	 don't allow atomics on controllers that only report AWUPF and
>   	 limit support to controllers that support that more sanely
> 	 defined NAWUPF

This would help avoid the ambiguity in whether NABSPF is valid if nsfeat 
bit 1 is unset.

However, it would be nice to have an idea of how many/percentage of 
products it would affect today. FWIW, I only have 1x SSD which supports 
atomics, and it does set that bit.

I suppose we could quirk known "good" HW which relies on AWUPF (to 
enable atomics), but that is very far from a nice approach.

> 
> I guess for 6.16 we are limited to I. to bring us back to the previous
> state, but I have a really bad gut feeling about it given the really
> bad spec language and a lot of low quality NVMe implementations we're
> seeing these days.
>   not the


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-08  9:47       ` Christoph Hellwig
@ 2025-07-08 15:19         ` Keith Busch
  0 siblings, 0 replies; 19+ messages in thread
From: Keith Busch @ 2025-07-08 15:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Hannes Reinecke, Alan Adamson, John Garry, Martin K. Petersen,
	Jens Axboe, linux-nvme, linux-block

On Tue, Jul 08, 2025 at 11:47:48AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 07, 2025 at 09:56:53AM -0600, Keith Busch wrote:
> > Just to throw AWUPF a lifeline for legecy devices, we could potentially
> > make sense of the value if Identify Controller says:
> > 
> >   1. CMIC == 0; and
> >   2. OACS.NMS == 0; and
> 
> What is NMS meant to say?  namespace management support?

Right, namespace management support. The spec calls this field 'NMS' now.
 
> >   3.
> >     a. FNA.FNS == 1; or
> >     b. NN == 1
> > 
> > And if those conditions are true, then the controller and namespace
> > scopes resolve to a single namespace format, so the values should be one
> > in the same. The only way it could change, then, is a format command,
> > which means there couldn't be an in-use filesystem depending on it not
> > changing.
> 
> We could.  But are there many controllers where that would be the
> case and where people want to use atomics?

Maybe not. I still have a lot of 1.0 compliant devices where this might
apply, but I don't have a use case explicitly needing the atomic write
features anyway, so it doesn't matter to me if the driver doesn't report
the limits for them.

So I guess no need to work with such devices at this point, but maybe
just something to consider in the unlikely event someone complains.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-07 14:18 What should we do about the nvme atomics mess? Christoph Hellwig
                   ` (3 preceding siblings ...)
  2025-07-08 10:08 ` John Garry
@ 2025-07-09  7:51 ` Nilay Shroff
  2025-07-09 21:28   ` Keith Busch
  4 siblings, 1 reply; 19+ messages in thread
From: Nilay Shroff @ 2025-07-09  7:51 UTC (permalink / raw)
  To: Christoph Hellwig, Alan Adamson, John Garry, Keith Busch,
	Martin K. Petersen, Jens Axboe
  Cc: linux-nvme, linux-block



On 7/7/25 7:48 PM, Christoph Hellwig wrote:
> Hi all,
> 
> I'm a bit lost on what to do about the sad state of NVMe atomic writes.
> 
> As a short reminder the main issues are:
> 
>  1) there is no flag on a command to request atomic (aka non-torn)
>     behavior, instead writes adhering to the atomicy requirements will
>     never be torn, and writes not adhering them can be torn any time.
>     This differs from SCSI where atomic writes have to be be explicitly
>     requested and fail when they can't be satisfied
>  2) the original way to indicate the main atomicy limit is the AWUPF
>     field, which is in Identify Controller, but specified in logical
>     blocks which only exist at a namespace layer.  This a) lead to
>     various problems because the limit is a mess when namespace have
>     different logical block sizes, and it b) also causes additional
>     issues because NVMe allows it to be different for different
>     controllers in the same subsystem.
> 
> Commit 8695f060a029 added some sanity checks to deal with issue 2b,
> but we kept running into more issues with it.  Partially because
> the check wasn't quite correct, but also because we've gotten
> reports of controllers that change the AWUPF value when reformatting
> namespaces to deal with issue 2a.
> 
> And I'm a bit lost on what to do here.
> 
> We could:
> 
>  I.	 revert the check and the subsequent fixup.  If you really want
>          to use the nvme atomics you already better pray a lot anyway
> 	 due to issue 1)
>  II.	 limit the check to multi-controller subsystems
>  III.	 don't allow atomics on controllers that only report AWUPF and
>  	 limit support to controllers that support that more sanely
> 	 defined NAWUPF
> 
> I guess for 6.16 we are limited to I. to bring us back to the previous
> state, but I have a really bad gut feeling about it given the really
> bad spec language and a lot of low quality NVMe implementations we're
> seeing these days.
>  not the 
> 
I believe there are multi-controller NVMe disks in the field (including the 
one I have) that do not exhibit such inconsistencies, i.e., they report a
consistent AWUPF value across controllers and do not change it based on 
namespace format. The NVMe specification states this (quoting it from 
NVM-Command-Set-Specification-1.0e):

"The values (referencing AWUPF / AWUN) reported in the Identify Controller
data structure are valid across all namespaces with any supported namespace
format, forming a baseline value that is guaranteed not to change."

While the spec doesn’t explicitly require that AWUPF be consistent across
controllers within the same subsystem, it seems to be implied. That said,
I agree this should have been stated explicitly in the specification.

If vendors strictly adhered to the current spec, we likely wouldn’t be 
facing this issue. That said, given the current behavior, I also support
approach III. However, choosing this approach effectively penalizes vendors
who have implemented atomic write support correctly—that is, those who use
AWUPF to advertise atomic write capabilities, do not rely on NAWUPF, and
report a consistent AWUPF across controllers.

In my opinion, the proper long-term fix is to escalate this to the NVMe 
Technical Work Group (TWG) and propose a specification update that:

- Deprecates the use of AWUPF for advertising atomic write capabilities
- Mandates the use of NAWUPF instead

Once such a spec update is ratified, we can move forward with approach III.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-09  7:51 ` Nilay Shroff
@ 2025-07-09 21:28   ` Keith Busch
  2025-07-10  5:07     ` Nilay Shroff
  0 siblings, 1 reply; 19+ messages in thread
From: Keith Busch @ 2025-07-09 21:28 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Christoph Hellwig, Alan Adamson, John Garry, Martin K. Petersen,
	Jens Axboe, linux-nvme, linux-block

On Wed, Jul 09, 2025 at 01:21:17PM +0530, Nilay Shroff wrote:
> I believe there are multi-controller NVMe disks in the field (including the 
> one I have) that do not exhibit such inconsistencies, i.e., they report a
> consistent AWUPF value across controllers and do not change it based on 
> namespace format. The NVMe specification states this (quoting it from 
> NVM-Command-Set-Specification-1.0e):
> 
> "The values (referencing AWUPF / AWUN) reported in the Identify Controller
> data structure are valid across all namespaces with any supported namespace
> format, forming a baseline value that is guaranteed not to change."

I don't think that's a backward compatible requirement. Controllers
often rescale these after a format command, and it was the only way for
1.0 and 1.1 controllers to report atomic sizes.

Lets say the controller can do 128k byte atomic writes, If all
namespaces used 512b LBA format, then AWUPF would be 255. If you change
one namespace format to 4k, AWUPF scales down to 31, yielding a
sub-optimal result for all the other namespaces.

> While the spec doesn´t explicitly require that AWUPF be consistent across
> controllers within the same subsystem, it seems to be implied. That said,
> I agree this should have been stated explicitly in the specification.

Considering multi-controller subsystems, some controllers might have
namespaces with only 512b formats attached, and other controllers might
have some 4k mixed in, so then they can't all consistently report the
desired AWUPF value. They'd have to just scale AWUPF based on the
largest sector size supported. Which I guess is what the current wording
is guiding toward, but that just suggests host drivers disregard the
value and use NAWUPF instead. So still option III.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-09 21:28   ` Keith Busch
@ 2025-07-10  5:07     ` Nilay Shroff
  2025-07-10  7:17       ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Nilay Shroff @ 2025-07-10  5:07 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Alan Adamson, John Garry, Martin K. Petersen,
	Jens Axboe, linux-nvme, linux-block



On 7/10/25 2:58 AM, Keith Busch wrote:
> On Wed, Jul 09, 2025 at 01:21:17PM +0530, Nilay Shroff wrote:
>> I believe there are multi-controller NVMe disks in the field (including the 
>> one I have) that do not exhibit such inconsistencies, i.e., they report a
>> consistent AWUPF value across controllers and do not change it based on 
>> namespace format. The NVMe specification states this (quoting it from 
>> NVM-Command-Set-Specification-1.0e):
>>
>> "The values (referencing AWUPF / AWUN) reported in the Identify Controller
>> data structure are valid across all namespaces with any supported namespace
>> format, forming a baseline value that is guaranteed not to change."
> 
> I don't think that's a backward compatible requirement. Controllers
> often rescale these after a format command, and it was the only way for
> 1.0 and 1.1 controllers to report atomic sizes.
> 
> Lets say the controller can do 128k byte atomic writes, If all
> namespaces used 512b LBA format, then AWUPF would be 255. If you change
> one namespace format to 4k, AWUPF scales down to 31, yielding a
> sub-optimal result for all the other namespaces.
> 
On the multi-controller disk I’ve been testing, each controller consistently
reports an AWUPF value of 63. I created shared namespaces with mixed LBA formats
— some using 512-byte LBAs and others using 4KB LBAs — and observed that the 
AWUPF value remained constant at 63 across all controllers and formats.

This implies that:
- A namespace with 4KB LBA format can support up to 256KB of  atomic
  writes (4KB × 64),
- A namespace with 512-byte LBA format can only support up to 32KB of
  atomic writes (512B × 64).

So in this case, it's actually the opposite of what one might assume:
Users of namespaces with 4KB LBA format would see the best possible atomic write
performance, while those using 512-byte LBA format may observe sub-optimal 
performance, since the maximum atomic write size scales down with smaller LBAs.

>> While the spec doesn´t explicitly require that AWUPF be consistent across
>> controllers within the same subsystem, it seems to be implied. That said,
>> I agree this should have been stated explicitly in the specification.
> 
> Considering multi-controller subsystems, some controllers might have
> namespaces with only 512b formats attached, and other controllers might
> have some 4k mixed in, so then they can't all consistently report the
> desired AWUPF value. They'd have to just scale AWUPF based on the
> largest sector size supported. Which I guess is what the current wording
> is guiding toward, but that just suggests host drivers disregard the
> value and use NAWUPF instead. So still option III.

Yes, I agree — option III seems to be the best possible way forward. 
However, does this mean we would disregard atomic write support for any
multi-controller NVMe vendor that consistently reports a valid AWUPF value
across all controllers and namespace formats, but sets NAWUPF to zero?

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: What should we do about the nvme atomics mess?
  2025-07-10  5:07     ` Nilay Shroff
@ 2025-07-10  7:17       ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-10  7:17 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Keith Busch, Christoph Hellwig, Alan Adamson, John Garry,
	Martin K. Petersen, Jens Axboe, linux-nvme, linux-block

On Thu, Jul 10, 2025 at 10:37:19AM +0530, Nilay Shroff wrote:
> So in this case, it's actually the opposite of what one might assume:
> Users of namespaces with 4KB LBA format would see the best possible atomic write
> performance, while those using 512-byte LBA format may observe sub-optimal 
> performance, since the maximum atomic write size scales down with smaller LBAs.

The problem is that we need to deal with the worst case and not the
best case.  And NVMe royally messed up there.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-07-10  7:26 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-07 14:18 What should we do about the nvme atomics mess? Christoph Hellwig
2025-07-07 14:24 ` Keith Busch
2025-07-07 15:26   ` Hannes Reinecke
2025-07-07 15:56     ` Keith Busch
2025-07-07 23:35       ` Chaitanya Kulkarni
2025-07-08  9:47       ` Christoph Hellwig
2025-07-08 15:19         ` Keith Busch
2025-07-08  1:27 ` Ming Lei
2025-07-08  2:27   ` Keith Busch
2025-07-08  2:46     ` Ming Lei
2025-07-08  2:56       ` Keith Busch
2025-07-08  3:17         ` Ming Lei
2025-07-08  9:38 ` Niklas Cassel
2025-07-08  9:48   ` Christoph Hellwig
2025-07-08 10:08 ` John Garry
2025-07-09  7:51 ` Nilay Shroff
2025-07-09 21:28   ` Keith Busch
2025-07-10  5:07     ` Nilay Shroff
2025-07-10  7:17       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).