Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 0/2] nvme: handle partially unique NID value
       [not found] <20250414090959.2015-1-hare@kernel.org>
@ 2025-04-14 11:19 ` Christoph Hellwig
  2025-04-14 11:31   ` Hannes Reinecke
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-04-14 11:19 UTC (permalink / raw)
  To: hare; +Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, wagi, linux-nvme

On Mon, Apr 14, 2025 at 11:09:57AM +0200, hare@kernel.org wrote:
> From: Hannes Reinecke <hare@kernel.org>
> 
> Hi all,
> 
> we have encountered a customer issue where the NID values for additional
> namespaces on the same device are not unique in all cases; the NGUID is,
> but the EUI64 is not. Problem is that prior to commit e2724cb9f0c4 there
> devices worked without a problem, but after that all NIDs are blanked out.
> This results in udev not creating persistent device links anymore and the
> system failing to boot.

These devices are so broken that we absolutely should not support them
You've also received that feedback both in person from me, from Daniel
and from the nvme technical working group.  I'm not sure why you insist
resending it instead of telling the OEM that specifically requested this
spec violating behavior from their SSD vendor to stop doing those
broken thing in the many months you have known of this gravely incorrect
indefensible behavior.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] nvme: handle partially unique NID value
  2025-04-14 11:19 ` [PATCH 0/2] nvme: handle partially unique NID value Christoph Hellwig
@ 2025-04-14 11:31   ` Hannes Reinecke
  2025-04-14 11:41     ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Hannes Reinecke @ 2025-04-14 11:31 UTC (permalink / raw)
  To: Christoph Hellwig, hare
  Cc: Keith Busch, Sagi Grimberg, wagi, linux-nvme,
	Ballard, Curtis C (HPE Storage), Javier Gonzalez

On 4/14/25 13:19, Christoph Hellwig wrote:
> On Mon, Apr 14, 2025 at 11:09:57AM +0200, hare@kernel.org wrote:
>> From: Hannes Reinecke <hare@kernel.org>
>>
>> Hi all,
>>
>> we have encountered a customer issue where the NID values for additional
>> namespaces on the same device are not unique in all cases; the NGUID is,
>> but the EUI64 is not. Problem is that prior to commit e2724cb9f0c4 there
>> devices worked without a problem, but after that all NIDs are blanked out.
>> This results in udev not creating persistent device links anymore and the
>> system failing to boot.
> 
> These devices are so broken that we absolutely should not support them
> You've also received that feedback both in person from me, from Daniel
> and from the nvme technical working group.  I'm not sure why you insist
> resending it instead of telling the OEM that specifically requested this
> spec violating behavior from their SSD vendor to stop doing those
> broken thing in the many months you have known of this gravely incorrect
> indefensible behavior.
> 
Thank you for your kind words.

We have discussed this at LSF, and the involved parties (ie
Samsung as the vendor, HPe as the IHV, and us as the OS provider)
are happy with this approach.
And we have paying customers for which the cited patch caused a 
regression, so ignoring it is not an option for us.
I hoped this patchset would be acceptable for upstream; as it is not
we will have to include this patchset as a SUSE-specific modification.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] nvme: handle partially unique NID value
  2025-04-14 11:31   ` Hannes Reinecke
@ 2025-04-14 11:41     ` Christoph Hellwig
  2025-04-14 11:55       ` Hannes Reinecke
  2025-04-17 16:56       ` Ballard, Curtis C (HPE Storage)
  0 siblings, 2 replies; 13+ messages in thread
From: Christoph Hellwig @ 2025-04-14 11:41 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, hare, Keith Busch, Sagi Grimberg, wagi,
	linux-nvme, Ballard, Curtis C (HPE Storage), Javier Gonzalez

On Mon, Apr 14, 2025 at 01:31:29PM +0200, Hannes Reinecke wrote:
> We have discussed this at LSF, and the involved parties (ie
> Samsung as the vendor, HPe as the IHV, and us as the OS provider)
> are happy with this approach.
> And we have paying customers for which the cited patch caused a regression, 
> so ignoring it is not an option for us.

Tell them to fix their broken systems instead of shifting this broken
crap upstream.  Really, we bend over backwards for consumer hardware
that doesn't know better.  We don't add crap for vendors that absolutely
should know better participate in the working group and only provide
expensive enterprise hardware just because they pay you.  If you have
so little spine that you want to accommodate this intentionally broken
behavior do it in your tree but don't force the burden on others.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] nvme: handle partially unique NID value
  2025-04-14 11:41     ` Christoph Hellwig
@ 2025-04-14 11:55       ` Hannes Reinecke
  2025-04-14 11:59         ` Christoph Hellwig
  2025-04-17 16:56       ` Ballard, Curtis C (HPE Storage)
  1 sibling, 1 reply; 13+ messages in thread
From: Hannes Reinecke @ 2025-04-14 11:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: hare, Keith Busch, Sagi Grimberg, linux-nvme,
	Ballard, Curtis C (HPE Storage), Javier Gonzalez, Daniel Wagner

On 4/14/25 13:41, Christoph Hellwig wrote:
> On Mon, Apr 14, 2025 at 01:31:29PM +0200, Hannes Reinecke wrote:
>> We have discussed this at LSF, and the involved parties (ie
>> Samsung as the vendor, HPe as the IHV, and us as the OS provider)
>> are happy with this approach.
>> And we have paying customers for which the cited patch caused a regression,
>> so ignoring it is not an option for us.
> 
> Tell them to fix their broken systems instead of shifting this broken
> crap upstream.  Really, we bend over backwards for consumer hardware
> that doesn't know better.  We don't add crap for vendors that absolutely
> should know better participate in the working group and only provide
> expensive enterprise hardware just because they pay you.  If you have
> so little spine that you want to accommodate this intentionally broken
> behavior do it in your tree but don't force the burden on others.

A simple NACK would have been sufficient.

Cheers,

Hannes 'spineless' Reinecke
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] nvme: handle partially unique NID value
  2025-04-14 11:55       ` Hannes Reinecke
@ 2025-04-14 11:59         ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2025-04-14 11:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, hare, Keith Busch, Sagi Grimberg, linux-nvme,
	Ballard, Curtis C (HPE Storage), Javier Gonzalez, Daniel Wagner

On Mon, Apr 14, 2025 at 01:55:26PM +0200, Hannes Reinecke wrote:
> A simple NACK would have been sufficient.

Not, it won't.  Your behavior here where you keep for something really
stupid after repeated NAKs is infuriating.  As is the OEMs behavior to
even ask for this behavior to start with.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 0/2] nvme: handle partially unique NID value
  2025-04-14 11:41     ` Christoph Hellwig
  2025-04-14 11:55       ` Hannes Reinecke
@ 2025-04-17 16:56       ` Ballard, Curtis C (HPE Storage)
       [not found]         ` <CGME20250502082359uscas1p1e2a9858dcc9200ab1d1d863c4495fc0a@uscas1p1.samsung.com>
  1 sibling, 1 reply; 13+ messages in thread
From: Ballard, Curtis C (HPE Storage) @ 2025-04-17 16:56 UTC (permalink / raw)
  To: Christoph Hellwig, Hannes Reinecke
  Cc: hare@kernel.org, Keith Busch, Sagi Grimberg, wagi@lst.de,
	linux-nvme@lists.infradead.org, Javier Gonzalez

Christoph,

There is no debate about whether the NID reporting behavior is incorrect and
has to be fixed. It definitely has to be fixed and is getting fixed for new
drives.

That behavior was a defect, not a request, and I have theories on how people
that probably knew better missed realizing that.

Unfortunately the incorrect implementation was missed for quite a while and
there are drives in the field that have a correct NGUID and an invalid EUI64 in
some specific configurations. There is no simple fix for the drives in the 
field.

I've seen some reflector traffic that suggests that similar behavior has been 
seen in other drives.

Since the NGUID is valid, and is the value used as the unique namespace ID (when
present), the issue didn't create problems in the environment where the drives 
were being used until a uniqueness check was performed on the EUI64.

It is a very serious error that the EUI64 is not unique and it is completely 
appropriate for that to be flagged.

Having a quirk of some kind that allows the drives to be used, when they worked
perfectly previously, seems like the right thing to do.

A discussion on how to appropriately flag this serious error seems to be in 
order if the method proposed by Hannes isn't acceptable.

Curtis

-----Original Message-----
From: Christoph Hellwig <hch@lst.de> 
Sent: Monday, April 14, 2025 5:41 AM
To: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>; hare@kernel.org; Keith Busch <kbusch@kernel.org>; Sagi Grimberg <sagi@grimberg.me>; wagi@lst.de; linux-nvme@lists.infradead.org; Ballard, Curtis C (HPE Storage) <curtis.ballard@hpe.com>; Javier Gonzalez <javier.gonz@samsung.com>
Subject: Re: [PATCH 0/2] nvme: handle partially unique NID value

On Mon, Apr 14, 2025 at 01:31:29PM +0200, Hannes Reinecke wrote:
> We have discussed this at LSF, and the involved parties (ie
> Samsung as the vendor, HPe as the IHV, and us as the OS provider)
> are happy with this approach.
> And we have paying customers for which the cited patch caused a regression, 
> so ignoring it is not an option for us.

Tell them to fix their broken systems instead of shifting this broken
crap upstream.  Really, we bend over backwards for consumer hardware
that doesn't know better.  We don't add crap for vendors that absolutely
should know better participate in the working group and only provide
expensive enterprise hardware just because they pay you.  If you have
so little spine that you want to accommodate this intentionally broken
behavior do it in your tree but don't force the burden on others.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] nvme: handle partially unique NID value
       [not found]           ` <eee8de0a44074dd3bfb0fc6ec425b647@samsung.com>
@ 2025-05-02 10:25             ` Christoph Hellwig
       [not found]               ` <27a99b458f0144fba094726e4f470552@samsung.com>
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-05-02 10:25 UTC (permalink / raw)
  To: Judy Brock
  Cc: Ballard, Curtis C (HPE Storage), Christoph Hellwig,
	Hannes Reinecke, hare@kernel.org, Keith Busch, Sagi Grimberg,
	wagi@lst.de, linux-nvme@lists.infradead.org, Javier Gonzalez

Judy, stop it.  HP could have trivially asked Samsung for a firmware
update and gotten it in the time they used all their commercial channels
to fight actually having to fix their intentional stupidity.

If you are a supposedly legit enterprise storage vendor and ask your SSD
vendor for a non-standard data corrupting feature you have to admit your
failure and fix it.  And I'm amazed how HP is trying to flex their
commercial muscle to get around not having to admit their failure and
fix it, and I'm also really surprised how little spine you folks have to
play along with this.

This is very disappointing and does not make you a trustworthy actor.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] nvme: handle partially unique NID value
       [not found]               ` <27a99b458f0144fba094726e4f470552@samsung.com>
@ 2025-05-03  3:46                 ` Keith Busch
  2025-05-05  9:51                   ` Javier Gonzalez
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2025-05-03  3:46 UTC (permalink / raw)
  To: Judy Brock
  Cc: Christoph Hellwig, Ballard, Curtis C (HPE Storage),
	Hannes Reinecke, hare@kernel.org, Sagi Grimberg, wagi@lst.de,
	linux-nvme@lists.infradead.org, Javier Gonzalez

On Fri, May 02, 2025 at 11:26:47PM +0000, Judy Brock wrote:
> For example, both companies have "admitted failure" but you haven't
> heard it: the FW in question definitely has a defect. Neither company
> is holding it out as compliant. Both companies have indicated going
> forward, the defective behavior has been corrected.
> 
> Not sure why you keep saying that neither company is willing to fix it.

I'm a little confused. If the conflicting behavior has been corrected,
why is this being discussed here? A device side fix is surely the best
possible outcome for everyone here. Requiring a kernel upgrade to work
around undesirable firmware behavior is a bit unpleasant for end users
when you already have a solution that works with any nvme capable OS. ?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] nvme: handle partially unique NID value
  2025-05-03  3:46                 ` Keith Busch
@ 2025-05-05  9:51                   ` Javier Gonzalez
  2025-05-05 11:11                     ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Javier Gonzalez @ 2025-05-05  9:51 UTC (permalink / raw)
  To: Keith Busch
  Cc: Judy Brock, Christoph Hellwig, Ballard, Curtis C (HPE Storage),
	Hannes Reinecke, hare@kernel.org, Sagi Grimberg, wagi@lst.de,
	linux-nvme@lists.infradead.org

On 02.05.2025 21:46, Keith Busch wrote:
>On Fri, May 02, 2025 at 11:26:47PM +0000, Judy Brock wrote:
>> For example, both companies have "admitted failure" but you haven't
>> heard it: the FW in question definitely has a defect. Neither company
>> is holding it out as compliant. Both companies have indicated going
>> forward, the defective behavior has been corrected.
>>
>> Not sure why you keep saying that neither company is willing to fix it.
>
>I'm a little confused. If the conflicting behavior has been corrected,
>why is this being discussed here? A device side fix is surely the best
>possible outcome for everyone here. Requiring a kernel upgrade to work
>around undesirable firmware behavior is a bit unpleasant for end users
>when you already have a solution that works with any nvme capable OS. ?

Agree. I think Hannes' approach to add dynamic quirks was the closest to
an upstreamable solution, as a general quick for the PM177xx is not
acceptable. But I completely understand Christoph's NAK.

I think we should let HPE distros carry this quirk for drives where they
would not want to roll a FW update.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] nvme: handle partially unique NID value
  2025-05-05  9:51                   ` Javier Gonzalez
@ 2025-05-05 11:11                     ` Christoph Hellwig
  2025-05-05 13:08                       ` Javier Gonzalez
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-05-05 11:11 UTC (permalink / raw)
  To: Javier Gonzalez
  Cc: Keith Busch, Judy Brock, Christoph Hellwig,
	Ballard, Curtis C (HPE Storage), Hannes Reinecke, hare@kernel.org,
	Sagi Grimberg, wagi@lst.de, linux-nvme@lists.infradead.org

On Mon, May 05, 2025 at 11:51:39AM +0200, Javier Gonzalez wrote:
> I think we should let HPE distros carry this quirk for drives where they
> would not want to roll a FW update.

Or just goddamn people to upgrade the broken firmware.  Without it
their data is at risk, so they'd better do it.

Also maybe this is a lesson to SSDs vendors (and I really mean all of
them) that if they can't push back ob broken "features" due to market
dynamics they should at least OEM brand the devices in the identify
data so that the blame gets deflected to the right party.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] nvme: handle partially unique NID value
  2025-05-05 11:11                     ` Christoph Hellwig
@ 2025-05-05 13:08                       ` Javier Gonzalez
  2025-05-05 13:49                         ` Laurence Oberman
  0 siblings, 1 reply; 13+ messages in thread
From: Javier Gonzalez @ 2025-05-05 13:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Judy Brock, Ballard, Curtis C (HPE Storage),
	Hannes Reinecke, hare@kernel.org, Sagi Grimberg, wagi@lst.de,
	linux-nvme@lists.infradead.org

On 05.05.2025 13:11, Christoph Hellwig wrote:
>On Mon, May 05, 2025 at 11:51:39AM +0200, Javier Gonzalez wrote:
>> I think we should let HPE distros carry this quirk for drives where they
>> would not want to roll a FW update.
>
>Or just goddamn people to upgrade the broken firmware.  Without it
>their data is at risk, so they'd better do it.
>
>Also maybe this is a lesson to SSDs vendors (and I really mean all of
>them) that if they can't push back ob broken "features" due to market
>dynamics they should at least OEM brand the devices in the identify
>data so that the blame gets deflected to the right party.

Agree. The dynamics of how OEMs want to apply FW updates is up to them,
but there is no doubt this has been a mess. Hope we have learned a
lesson...


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] nvme: handle partially unique NID value
  2025-05-05 13:08                       ` Javier Gonzalez
@ 2025-05-05 13:49                         ` Laurence Oberman
  2025-05-06  7:07                           ` Javier Gonzalez
  0 siblings, 1 reply; 13+ messages in thread
From: Laurence Oberman @ 2025-05-05 13:49 UTC (permalink / raw)
  To: Javier Gonzalez, Christoph Hellwig
  Cc: Keith Busch, Judy Brock, Ballard, Curtis C (HPE Storage),
	Hannes Reinecke, hare@kernel.org, Sagi Grimberg, wagi@lst.de,
	linux-nvme@lists.infradead.org

On Mon, 2025-05-05 at 15:08 +0200, Javier Gonzalez wrote:
> On 05.05.2025 13:11, Christoph Hellwig wrote:
> > On Mon, May 05, 2025 at 11:51:39AM +0200, Javier Gonzalez wrote:
> > > I think we should let HPE distros carry this quirk for drives
> > > where they
> > > would not want to roll a FW update.
> > 
> > Or just goddamn people to upgrade the broken firmware.  Without it
> > their data is at risk, so they'd better do it.
> > 
> > Also maybe this is a lesson to SSDs vendors (and I really mean all
> > of
> > them) that if they can't push back ob broken "features" due to
> > market
> > dynamics they should at least OEM brand the devices in the identify
> > data so that the blame gets deflected to the right party.
> 
> Agree. The dynamics of how OEMs want to apply FW updates is up to
> them,
> but there is no doubt this has been a mess. Hope we have learned a
> lesson...
> 

Seems what I sent last week is the same issue.
For now we will fix this in a RHEL only kernel until the vendor gets
F/W fixes out.
There are a lot of devices out in the wild already I guess, that have
this issue

Thanks
Laurence



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] nvme: handle partially unique NID value
  2025-05-05 13:49                         ` Laurence Oberman
@ 2025-05-06  7:07                           ` Javier Gonzalez
  0 siblings, 0 replies; 13+ messages in thread
From: Javier Gonzalez @ 2025-05-06  7:07 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Christoph Hellwig, Keith Busch, Judy Brock,
	Ballard, Curtis C (HPE Storage), Hannes Reinecke, hare@kernel.org,
	Sagi Grimberg, wagi@lst.de, linux-nvme@lists.infradead.org

On 05.05.2025 09:49, Laurence Oberman wrote:
>On Mon, 2025-05-05 at 15:08 +0200, Javier Gonzalez wrote:
>> On 05.05.2025 13:11, Christoph Hellwig wrote:
>> > On Mon, May 05, 2025 at 11:51:39AM +0200, Javier Gonzalez wrote:
>> > > I think we should let HPE distros carry this quirk for drives
>> > > where they
>> > > would not want to roll a FW update.
>> >
>> > Or just goddamn people to upgrade the broken firmware.  Without it
>> > their data is at risk, so they'd better do it.
>> >
>> > Also maybe this is a lesson to SSDs vendors (and I really mean all
>> > of
>> > them) that if they can't push back ob broken "features" due to
>> > market
>> > dynamics they should at least OEM brand the devices in the identify
>> > data so that the blame gets deflected to the right party.
>>
>> Agree. The dynamics of how OEMs want to apply FW updates is up to
>> them,
>> but there is no doubt this has been a mess. Hope we have learned a
>> lesson...
>>
>
>Seems what I sent last week is the same issue.
>For now we will fix this in a RHEL only kernel until the vendor gets
>F/W fixes out.

This is great. Thanks for the support Laurence!

Curtis,

With SUSE and RedHat picking this quirk, is it enough on your end?



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-05-06  7:07 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250414090959.2015-1-hare@kernel.org>
2025-04-14 11:19 ` [PATCH 0/2] nvme: handle partially unique NID value Christoph Hellwig
2025-04-14 11:31   ` Hannes Reinecke
2025-04-14 11:41     ` Christoph Hellwig
2025-04-14 11:55       ` Hannes Reinecke
2025-04-14 11:59         ` Christoph Hellwig
2025-04-17 16:56       ` Ballard, Curtis C (HPE Storage)
     [not found]         ` <CGME20250502082359uscas1p1e2a9858dcc9200ab1d1d863c4495fc0a@uscas1p1.samsung.com>
     [not found]           ` <eee8de0a44074dd3bfb0fc6ec425b647@samsung.com>
2025-05-02 10:25             ` Christoph Hellwig
     [not found]               ` <27a99b458f0144fba094726e4f470552@samsung.com>
2025-05-03  3:46                 ` Keith Busch
2025-05-05  9:51                   ` Javier Gonzalez
2025-05-05 11:11                     ` Christoph Hellwig
2025-05-05 13:08                       ` Javier Gonzalez
2025-05-05 13:49                         ` Laurence Oberman
2025-05-06  7:07                           ` Javier Gonzalez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox