* Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)
@ 2023-06-26 1:15 Bagas Sanjaya
[not found] ` <8f4e6d32-55b3-9df5-de91-1afbcb30ce13@nvidia.com>
2023-06-27 16:10 ` Keith Busch
0 siblings, 2 replies; 26+ messages in thread
From: Bagas Sanjaya @ 2023-06-26 1:15 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini,
Hannes Reinecke
Cc: Linux Kernel Mailing List, Linux Regressions, Linux NVMe
Hi,
I notice a regression report on Bugzilla [1]. Quoting from it:
> Hello,
> since upgrading one of my hosts to KErnel 6.2 i have a problem with 6 nvme disks (SAMSUNG MZ1WV480HCGL-000MV).
>
> This problem didn't occur with Kernel 5.15.
>
> Here is log from dmesg | grep nvme:
>
> nvme nvme0: pci function 0000:04:00.0
> nvme nvme1: pci function 0000:05:00.0
> nvme nvme2: pci function 0000:06:00.0
> nvme nvme3: pci function 0000:07:00.0
> nvme nvme4: pci function 0000:0a:00.0
> nvme nvme5: pci function 0000:0b:00.0
> nvme nvme0: 8/0/0 default/read/poll queues
> nvme nvme1: 8/0/0 default/read/poll queues
> nvme nvme2: 8/0/0 default/read/poll queues
> nvme nvme3: 8/0/0 default/read/poll queues
> nvme nvme4: 8/0/0 default/read/poll queues
> nvme nvme5: 8/0/0 default/read/poll queues
> nvme nvme1: globally duplicate IDs for nsid 1
> nvme nvme1: VID:DID 144d:a802 model:SAMSUNG MZ1WV480HCGL-000MV firmware:BXU87M9Q
> nvme nvme2: globally duplicate IDs for nsid 1
> nvme nvme2: VID:DID 144d:a802 model:SAMSUNG MZ1WV480HCGL-000MV firmware:BXU87M9Q
> nvme nvme4: globally duplicate IDs for nsid 1
> nvme nvme3: globally duplicate IDs for nsid 1
> nvme nvme4: VID:DID 144d:a802 model:SAMSUNG MZ1WV480HCGL-000MV firmware:BXU87M9Q
> nvme nvme3: VID:DID 144d:a802 model:SAMSUNG MZ1WV480HCGL-000MV firmware:BXU87M9Q
> nvme nvme5: globally duplicate IDs for nsid 1
> nvme nvme5: VID:DID 144d:a802 model:SAMSUNG MZ1WV480HCGL-000MV firmware:BXU87M9Q
> nvme0n1: p1 p9
>
> Only one drive, out of 6 is usable.
>
> These drives are rather common datacenter drives (Samsung SM-953).
>
> From another Bug-Report (https://bugzilla.kernel.org/show_bug.cgi?id=217384):
>
> To fix it, please add two strings in pci.c
>
> { PCI_DEVICE(0x144d, 0xa802), /* SAMSUNG MZ1WV480HCGL-000MV */
> .driver_data = NVME_QUIRK_BOGUS_NID, },
>
> Thanks in advance.
See Bugzilla for the full thread.
The reporter had a quirk (see above) that fixed this regression,
nevertheless I'm adding it to regzbot to make sure it doesn't fall
through cracks unnoticed:
#regzbot introduced: 86c2457a8e8112f https://bugzilla.kernel.org/show_bug.cgi?id=217593
#regzbot title: NVME_QUIRK_BOGUS_NID is needed for SAMSUNG MZ1WV480HCGL-000MV
Thanks.
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217593
--
An old man doll... just what I always wanted! - Clara
^ permalink raw reply [flat|nested] 26+ messages in thread[parent not found: <8f4e6d32-55b3-9df5-de91-1afbcb30ce13@nvidia.com>]
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) [not found] ` <8f4e6d32-55b3-9df5-de91-1afbcb30ce13@nvidia.com> @ 2023-06-26 18:26 ` Linux regression tracking (Thorsten Leemhuis) 0 siblings, 0 replies; 26+ messages in thread From: Linux regression tracking (Thorsten Leemhuis) @ 2023-06-26 18:26 UTC (permalink / raw) To: Chaitanya Kulkarni, Bagas Sanjaya, Clemens S. Cc: Linux Kernel Mailing List, Linux Regressions, Linux NVMe, John Meneghini, Sagi Grimberg, Keith Busch, Christoph Hellwig, Hannes Reinecke, Martin Belanger On 26.06.23 17:39, Chaitanya Kulkarni wrote: > On 6/25/2023 6:15 PM, Bagas Sanjaya wrote: >> >> I notice a regression report on Bugzilla [1]. Quoting from it: > > [...] > >>> Only one drive, out of 6 is usable. >>> >>> These drives are rather common datacenter drives (Samsung SM-953). >>> >>> From another Bug-Report (https://bugzilla.kernel.org/show_bug.cgi?id=217384): >>> >>> To fix it, please add two strings in pci.c >>> >>> { PCI_DEVICE(0x144d, 0xa802), /* SAMSUNG MZ1WV480HCGL-000MV */ >>> .driver_data = NVME_QUIRK_BOGUS_NID, }, > > Please send a formal patch I'll be happy to review it .. Huh, sorry, but why? You might not know this, but Bagas is somebody that helps with regression tracking (many thx for this BTW), so he's not even affected by the problem -- and most likely is neither familiar with the code or has the hardware to test the quirk entry. Sure, asking others for help is fine. But from what I see this is something that should be handled by those that developed the culprit. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-06-26 1:15 Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) Bagas Sanjaya [not found] ` <8f4e6d32-55b3-9df5-de91-1afbcb30ce13@nvidia.com> @ 2023-06-27 16:10 ` Keith Busch [not found] ` <9b778d59-c8d5-b037-0eb7-34cee5f273cc@samsung.com> 2023-07-10 8:52 ` Linux regression tracking (Thorsten Leemhuis) 1 sibling, 2 replies; 26+ messages in thread From: Keith Busch @ 2023-06-27 16:10 UTC (permalink / raw) To: Bagas Sanjaya Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux Regressions, Linux NVMe, Pankaj Raghav, Kanchan Joshi On Mon, Jun 26, 2023 at 08:15:49AM +0700, Bagas Sanjaya wrote: > See Bugzilla for the full thread. > > The reporter had a quirk (see above) that fixed this regression, > nevertheless I'm adding it to regzbot to make sure it doesn't fall > through cracks unnoticed: > > #regzbot introduced: 86c2457a8e8112f https://bugzilla.kernel.org/show_bug.cgi?id=217593 > #regzbot title: NVME_QUIRK_BOGUS_NID is needed for SAMSUNG MZ1WV480HCGL-000MV These bug reports really should go to the vendors that created the broken device with non-unique "unique" fields. It's easy enough for me to write the quirk patch, but it's not the ideal solution and may harm devices/firmwares with the same VID:DID that don't have this problem. Not being the vendor, I'm not in a postition to know that, so CC'ing some Samsung folks. ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <9b778d59-c8d5-b037-0eb7-34cee5f273cc@samsung.com>]
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) [not found] ` <9b778d59-c8d5-b037-0eb7-34cee5f273cc@samsung.com> @ 2023-06-30 15:18 ` Clemens Springsguth 2023-06-30 16:21 ` Pankaj Raghav 0 siblings, 1 reply; 26+ messages in thread From: Clemens Springsguth @ 2023-06-30 15:18 UTC (permalink / raw) To: Pankaj Raghav Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux Regressions, Linux NVMe, Kanchan Joshi, Jinhwan Park, Bagas Sanjaya, Javier Gonzalez Hi Pankaj, i would be very very grateful if you would create and submit the quirk as a proper patch. kind regards Clemens On Fri, Jun 30, 2023 at 10:40 AM Pankaj Raghav <p.raghav@samsung.com> wrote: > > On 2023-06-27 18:10, Keith Busch wrote: > > On Mon, Jun 26, 2023 at 08:15:49AM +0700, Bagas Sanjaya wrote: > >> See Bugzilla for the full thread. > >> > >> The reporter had a quirk (see above) that fixed this regression, > >> nevertheless I'm adding it to regzbot to make sure it doesn't fall > >> through cracks unnoticed: > >> > >> #regzbot introduced: 86c2457a8e8112f https://bugzilla.kernel.org/show_bug.cgi?id=217593 > >> #regzbot title: NVME_QUIRK_BOGUS_NID is needed for SAMSUNG MZ1WV480HCGL-000MV > > > > These bug reports really should go to the vendors that created the > > broken device with non-unique "unique" fields. It's easy enough for me > > to write the quirk patch, but it's not the ideal solution and may harm > > devices/firmwares with the same VID:DID that don't have this problem. > > Not being the vendor, I'm not in a postition to know that, so CC'ing > > some Samsung folks. > > I got a reply from our team internally. There is no plan on releasing > a new firmware for this model and the related models with the same VID:DID > pair. We can add a quirk. > > @Clemens: Let me know if you want me to send the quirk as a proper patch. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-06-30 15:18 ` Clemens Springsguth @ 2023-06-30 16:21 ` Pankaj Raghav 2023-07-03 13:25 ` Clemens Springsguth 0 siblings, 1 reply; 26+ messages in thread From: Pankaj Raghav @ 2023-06-30 16:21 UTC (permalink / raw) To: Clemens Springsguth Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux Regressions, Linux NVMe, Kanchan Joshi, Jinhwan Park, Bagas Sanjaya, Javier Gonzalez On 2023-06-30 17:18, Clemens Springsguth wrote: > Hi Pankaj, > > i would be very very grateful if you would create and submit the quirk > as a proper patch. > No problem. I will send something soon, but could you test it and give me a tested-by tag when I send the patch? I don't have that model at my disposal. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-06-30 16:21 ` Pankaj Raghav @ 2023-07-03 13:25 ` Clemens Springsguth 0 siblings, 0 replies; 26+ messages in thread From: Clemens Springsguth @ 2023-07-03 13:25 UTC (permalink / raw) To: Pankaj Raghav Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux Regressions, Linux NVMe, Kanchan Joshi, Jinhwan Park, Bagas Sanjaya, Javier Gonzalez On Fri, Jun 30, 2023 at 6:21 PM Pankaj Raghav <p.raghav@samsung.com> wrote: > > On 2023-06-30 17:18, Clemens Springsguth wrote: > > Hi Pankaj, > > > > i would be very very grateful if you would create and submit the quirk > > as a proper patch. > > > > No problem. I will send something soon, but could you test it and > give me a tested-by tag when I send the patch? I don't have that > model at my disposal. > Hi Pankaj, I am available for testing with pleasure. kind regards Clemens ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-06-27 16:10 ` Keith Busch [not found] ` <9b778d59-c8d5-b037-0eb7-34cee5f273cc@samsung.com> @ 2023-07-10 8:52 ` Linux regression tracking (Thorsten Leemhuis) 2023-07-10 15:58 ` Keith Busch 1 sibling, 1 reply; 26+ messages in thread From: Linux regression tracking (Thorsten Leemhuis) @ 2023-07-10 8:52 UTC (permalink / raw) To: Keith Busch, Bagas Sanjaya Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux Regressions, Linux NVMe, Pankaj Raghav, Kanchan Joshi On 27.06.23 18:10, Keith Busch wrote: > On Mon, Jun 26, 2023 at 08:15:49AM +0700, Bagas Sanjaya wrote: >> See Bugzilla for the full thread. >> >> The reporter had a quirk (see above) that fixed this regression, >> nevertheless I'm adding it to regzbot to make sure it doesn't fall >> through cracks unnoticed: >> >> #regzbot introduced: 86c2457a8e8112f https://bugzilla.kernel.org/show_bug.cgi?id=217593 >> #regzbot title: NVME_QUIRK_BOGUS_NID is needed for SAMSUNG MZ1WV480HCGL-000MV > > These bug reports really should go to the vendors that created the > broken device with non-unique "unique" fields. I understand that, but I think we need middlemen for that, as I or Bagas don't have the contacts -- and it's IMHO also a bit much too ask us for in general, as regression tracking is hard enough already. At least unless this becomes something that happen regularly, then a list of persons we could contact would be fine I guess. But we simply can't deal with too many subsystem specific special cases. > It's easy enough for me > to write the quirk patch, but it's not the ideal solution and may harm > devices/firmwares with the same VID:DID that don't have this problem. > Not being the vendor, I'm not in a postition to know that, so CC'ing > some Samsung folks. Another request came in today, even with a pseudo-patch: https://bugzilla.kernel.org/show_bug.cgi?id=217649 To quote: ``` As with numerous NVMe controllers these days, Samsung's MZAL41T0HBLB-00BL2, which Lenovo builds into their 16ARP8 also suffers from invalid IDs, breaking suspend and hibernate also on the latest kernel 6.4.2. The following change restores this functionality: File: root/drivers/nvme/host/pci.c Change: - { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G and 512G */ - .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, + { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G, 512G and 1TB */ + .driver_data = NVME_QUIRK_BOGUS_NID | + NVME_QUIRK_DISABLE_WRITE_ZEROES, }, ``` Ciao, Thorsten ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-10 8:52 ` Linux regression tracking (Thorsten Leemhuis) @ 2023-07-10 15:58 ` Keith Busch [not found] ` <f0fdf86e-4293-8e07-835d-b5a866252068@samsung.com> 0 siblings, 1 reply; 26+ messages in thread From: Keith Busch @ 2023-07-10 15:58 UTC (permalink / raw) To: Linux regressions mailing list Cc: Bagas Sanjaya, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Pankaj Raghav, Kanchan Joshi On Mon, Jul 10, 2023 at 10:52:52AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > On 27.06.23 18:10, Keith Busch wrote: > > On Mon, Jun 26, 2023 at 08:15:49AM +0700, Bagas Sanjaya wrote: > >> See Bugzilla for the full thread. > >> > >> The reporter had a quirk (see above) that fixed this regression, > >> nevertheless I'm adding it to regzbot to make sure it doesn't fall > >> through cracks unnoticed: > >> > >> #regzbot introduced: 86c2457a8e8112f https://bugzilla.kernel.org/show_bug.cgi?id=217593 > >> #regzbot title: NVME_QUIRK_BOGUS_NID is needed for SAMSUNG MZ1WV480HCGL-000MV > > > > These bug reports really should go to the vendors that created the > > broken device with non-unique "unique" fields. > > I understand that, but I think we need middlemen for that, as I or Bagas > don't have the contacts -- and it's IMHO also a bit much too ask us for > in general, as regression tracking is hard enough already. At least > unless this becomes something that happen regularly, then a list of > persons we could contact would be fine I guess. But we simply can't deal > with too many subsystem specific special cases. I'm not asking the Linux regression trackers to fill that role, though. I'm asking people who experience these issues report it to their vendor directly because these device makers apparently have zero clue that their spec non-compliance is causing painful experiences for their customers and annoyance for maintainers. They keep pumping out more and more devices with the same breakage. This particular vendor has been great at engaging with Linux, but that's not necessarily normal among all device makers, and I don't have contacts with the majority of the vendors we've had to quirk for this issue. We did complain to the NVMe spec workgroup that their complaince cert suite is not testing for this. There was a little initial interest in fixing that gap, but it fizzled out... > Another request came in today, even with a pseudo-patch: > https://bugzilla.kernel.org/show_bug.cgi?id=217649 > > To quote: > ``` > As with numerous NVMe controllers these days, Samsung's > MZAL41T0HBLB-00BL2, which Lenovo builds into their 16ARP8 also suffers > from invalid IDs, breaking suspend and hibernate also on the latest > kernel 6.4.2. > > The following change restores this functionality: > > File: root/drivers/nvme/host/pci.c > Change: > > - { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G and 512G */ > - .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, > > + { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G, 512G and 1TB */ > + .driver_data = NVME_QUIRK_BOGUS_NID | > + NVME_QUIRK_DISABLE_WRITE_ZEROES, }, Panjaj, okay with this one too? ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <f0fdf86e-4293-8e07-835d-b5a866252068@samsung.com>]
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) [not found] ` <f0fdf86e-4293-8e07-835d-b5a866252068@samsung.com> @ 2023-07-11 9:39 ` Linux regression tracking (Thorsten Leemhuis) 2023-07-11 12:06 ` Christoph Hellwig 2023-07-16 19:19 ` August Wikerfors 1 sibling, 1 reply; 26+ messages in thread From: Linux regression tracking (Thorsten Leemhuis) @ 2023-07-11 9:39 UTC (permalink / raw) To: Pankaj Raghav, Keith Busch, Linux regressions mailing list Cc: Bagas Sanjaya, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환, Linus Torvalds [CCing Linus for the "whack a mole" aspect in the second half] On 11.07.23 08:54, Pankaj Raghav wrote: >>> I understand that, but I think we need middlemen for that, as I or Bagas >>> don't have the contacts -- and it's IMHO also a bit much too ask us for >>> in general, as regression tracking is hard enough already. At least >>> unless this becomes something that happen regularly, then a list of >>> persons we could contact would be fine I guess. But we simply can't deal >>> with too many subsystem specific special cases. >> >> I'm not asking the Linux regression trackers to fill that role, though. Well, during our work we often encounter those bugs -- often from people that are no regular developers that already had a hard time understanding the issue and reporting it to us somehow. Asking those to... >> I'm asking people who experience these issues report it to their vendor ...find the right destination and format to report their Linux problems to the vendors is unlikely to fly I suspect. And I'm not sure if that is in our interest, as then it might take a lot longer to get those quirk entries into the kernel source. But whatever, the main reason why I write this mail is different: >> directly because these device makers apparently have zero clue that >> their spec non-compliance is causing painful experiences for their >> customers and annoyance for maintainers. They keep pumping out more and >> more devices with the same breakage. >> >> This particular vendor has been great at engaging with Linux, but that's >> not necessarily normal among all device makers, and I don't have >> contacts with the majority of the vendors we've had to quirk for this >> issue. >> >> We did complain to the NVMe spec workgroup that their complaince cert >> suite is not testing for this. There was a little initial interest in >> fixing that gap, but it fizzled out... Preface: this is not my area of expertise, and maybe I should keep my mouth shut. But whatever. Well, that "They keep pumping out more and more devices with the same breakage" and the "new device" comment from Pankaj below bear the question: should we stop trying to play "whack a mole" with all those quirk entries and handle devices with duplicate ids just like Windows does? That would "make things just work"(tm). And yes, I suspect there are good reasons why we went down the "quirk" route or why abandoning it might be hard. But maybe it's time to reconsider that path, as from my outside point of view things sound a lot like they are somewhat similar to the ACPI problems we dealt with ~15 years ago: we learned that we have to deal with broken ACPI implementations and somehow use them in a way similar to how Windows uses them, as that's the OS the machine was designed for and tested with. Ciao, Thorsten >>> Another request came in today, even with a pseudo-patch: >>> https://bugzilla.kernel.org/show_bug.cgi?id=217649 >>> >>> To quote: >>> ``` >>> As with numerous NVMe controllers these days, Samsung's >>> MZAL41T0HBLB-00BL2, which Lenovo builds into their 16ARP8 also suffers >>> from invalid IDs, breaking suspend and hibernate also on the latest >>> kernel 6.4.2. > [...] >> Panjaj, okay with this one too? > > This looks a like a new device that might have a firmware update. I will ping > internally first. > > As you mentioned, the recent addition of globally unique ID check > is breaking a lot of devices because of non-compliant firmware. I will try to create > some awareness about this issue internally as well. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-11 9:39 ` Linux regression tracking (Thorsten Leemhuis) @ 2023-07-11 12:06 ` Christoph Hellwig 2023-07-11 12:14 ` Sagi Grimberg ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Christoph Hellwig @ 2023-07-11 12:06 UTC (permalink / raw) To: Linux regressions mailing list Cc: Pankaj Raghav, Keith Busch, Bagas Sanjaya, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환, Linus Torvalds On Tue, Jul 11, 2023 at 11:39:11AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > Well, that "They keep pumping out more and more devices with the same > breakage" and the "new device" comment from Pankaj below bear the > question: should we stop trying to play "whack a mole" with all those > quirk entries and handle devices with duplicate ids just like Windows does? As far as I can tell Windows completely ignores the IDs. Which, looking back, I'd love to be able to do as well, but they are already used by udev for the /dev/disk/by-id/ links. Those are usually not used on desktop systems, as they use the file system labels and UUIDs, but that doesn't work for non-file system uses. And all this has been working really well with the good old enterprise SSDs, it's just that the cheap consumer devices keep fucking it up. If we'd take it away now we'd break existing users, which puts us between a rock and a hard place. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-11 12:06 ` Christoph Hellwig @ 2023-07-11 12:14 ` Sagi Grimberg 2023-07-11 14:12 ` Greg KH 2023-07-11 15:33 ` Keith Busch 2023-07-11 16:47 ` Linus Torvalds 2 siblings, 1 reply; 26+ messages in thread From: Sagi Grimberg @ 2023-07-11 12:14 UTC (permalink / raw) To: Christoph Hellwig, Linux regressions mailing list Cc: Pankaj Raghav, Keith Busch, Bagas Sanjaya, Jens Axboe, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환, Linus Torvalds >> Well, that "They keep pumping out more and more devices with the same >> breakage" and the "new device" comment from Pankaj below bear the >> question: should we stop trying to play "whack a mole" with all those >> quirk entries and handle devices with duplicate ids just like Windows does? > > As far as I can tell Windows completely ignores the IDs. Which, looking > back, I'd love to be able to do as well, but they are already used > by udev for the /dev/disk/by-id/ links. Those are usually not used > on desktop systems, as they use the file system labels and UUIDs, but > that doesn't work for non-file system uses. > > And all this has been working really well with the good old enterprise > SSDs, it's just that the cheap consumer devices keep fucking it up. > > If we'd take it away now we'd break existing users, which puts us between > a rock and a hard place. Maybe the compromise would be to add a modparam that tells the driver to ignore it altogether (like allow_bogus_identifiers) that would default to false. Then people can just workaround the problem instead of having the back-and-fourth with the vendor? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-11 12:14 ` Sagi Grimberg @ 2023-07-11 14:12 ` Greg KH 2023-07-11 14:39 ` Sagi Grimberg 0 siblings, 1 reply; 26+ messages in thread From: Greg KH @ 2023-07-11 14:12 UTC (permalink / raw) To: Sagi Grimberg Cc: Christoph Hellwig, Linux regressions mailing list, Pankaj Raghav, Keith Busch, Bagas Sanjaya, Jens Axboe, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환, Linus Torvalds On Tue, Jul 11, 2023 at 03:14:54PM +0300, Sagi Grimberg wrote: > > > > Well, that "They keep pumping out more and more devices with the same > > > breakage" and the "new device" comment from Pankaj below bear the > > > question: should we stop trying to play "whack a mole" with all those > > > quirk entries and handle devices with duplicate ids just like Windows does? > > > > As far as I can tell Windows completely ignores the IDs. Which, looking > > back, I'd love to be able to do as well, but they are already used > > by udev for the /dev/disk/by-id/ links. Those are usually not used > > on desktop systems, as they use the file system labels and UUIDs, but > > that doesn't work for non-file system uses. > > > > And all this has been working really well with the good old enterprise > > SSDs, it's just that the cheap consumer devices keep fucking it up. > > > > If we'd take it away now we'd break existing users, which puts us between > > a rock and a hard place. > > Maybe the compromise would be to add a modparam that tells the driver > to ignore it altogether (like allow_bogus_identifiers) that would > default to false. Then people can just workaround the problem instead > of having the back-and-fourth with the vendor? > Module parameters do not work on a per-device basis, sorry. This isn't the 1990's anymore, please do not attempt to add new ones :) thanks, greg k-h ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-11 14:12 ` Greg KH @ 2023-07-11 14:39 ` Sagi Grimberg 0 siblings, 0 replies; 26+ messages in thread From: Sagi Grimberg @ 2023-07-11 14:39 UTC (permalink / raw) To: Greg KH Cc: Christoph Hellwig, Linux regressions mailing list, Pankaj Raghav, Keith Busch, Bagas Sanjaya, Jens Axboe, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환, Linus Torvalds >>>> Well, that "They keep pumping out more and more devices with the same >>>> breakage" and the "new device" comment from Pankaj below bear the >>>> question: should we stop trying to play "whack a mole" with all those >>>> quirk entries and handle devices with duplicate ids just like Windows does? >>> >>> As far as I can tell Windows completely ignores the IDs. Which, looking >>> back, I'd love to be able to do as well, but they are already used >>> by udev for the /dev/disk/by-id/ links. Those are usually not used >>> on desktop systems, as they use the file system labels and UUIDs, but >>> that doesn't work for non-file system uses. >>> >>> And all this has been working really well with the good old enterprise >>> SSDs, it's just that the cheap consumer devices keep fucking it up. >>> >>> If we'd take it away now we'd break existing users, which puts us between >>> a rock and a hard place. >> >> Maybe the compromise would be to add a modparam that tells the driver >> to ignore it altogether (like allow_bogus_identifiers) that would >> default to false. Then people can just workaround the problem instead >> of having the back-and-fourth with the vendor? >> > > Module parameters do not work on a per-device basis, sorry. This isn't > the 1990's anymore, please do not attempt to add new ones :) Don't get me wrong, I don't like adding this. But the source of this is that there are simply too many breakages of non-compliant consumer drives out there that maybe a compromise would be "globally relax compliance check in this specific area" as a workaround. Right now each time this issue is seen in the wild, the only resolution is either the vendor fixing it, or the driver adds a quirk, which is positive and exactly what we want. But more and more users complain, and there is no immediate workaround. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-11 12:06 ` Christoph Hellwig 2023-07-11 12:14 ` Sagi Grimberg @ 2023-07-11 15:33 ` Keith Busch 2023-07-11 16:47 ` Linus Torvalds 2 siblings, 0 replies; 26+ messages in thread From: Keith Busch @ 2023-07-11 15:33 UTC (permalink / raw) To: Christoph Hellwig Cc: Linux regressions mailing list, Pankaj Raghav, Bagas Sanjaya, Jens Axboe, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환, Linus Torvalds On Tue, Jul 11, 2023 at 02:06:09PM +0200, Christoph Hellwig wrote: > On Tue, Jul 11, 2023 at 11:39:11AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > > Well, that "They keep pumping out more and more devices with the same > > breakage" and the "new device" comment from Pankaj below bear the > > question: should we stop trying to play "whack a mole" with all those > > quirk entries and handle devices with duplicate ids just like Windows does? > > As far as I can tell Windows completely ignores the IDs. Which, looking > back, I'd love to be able to do as well, but they are already used > by udev for the /dev/disk/by-id/ links. Those are usually not used > on desktop systems, as they use the file system labels and UUIDs, but > that doesn't work for non-file system uses. It's also an inescapable requirement for supporting multipath, and a number of the devices being quirked bizarrely also report CMIC/NMIC capabilities. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-11 12:06 ` Christoph Hellwig 2023-07-11 12:14 ` Sagi Grimberg 2023-07-11 15:33 ` Keith Busch @ 2023-07-11 16:47 ` Linus Torvalds 2023-07-11 17:21 ` Keith Busch 2023-07-12 16:45 ` Christoph Hellwig 2 siblings, 2 replies; 26+ messages in thread From: Linus Torvalds @ 2023-07-11 16:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Linux regressions mailing list, Pankaj Raghav, Keith Busch, Bagas Sanjaya, Jens Axboe, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환 On Tue, 11 Jul 2023 at 05:06, Christoph Hellwig <hch@lst.de> wrote: > > As far as I can tell Windows completely ignores the IDs. Which, looking > back, I'd love to be able to do as well, but they are already used > by udev for the /dev/disk/by-id/ links. Those are usually not used > on desktop systems, as they use the file system labels and UUIDs, but > that doesn't work for non-file system uses. The thing is, the nvme code seems to actively do completely stuipid things in this area. > And all this has been working really well with the good old enterprise > SSDs, it's just that the cheap consumer devices keep f*cking it up. Christoph, deal with reality, not with what you think things should look like. Anybody who expected unique ID's is frankly completely incompetent. People should have *known* not to do this. "Those Who Do Not Learn History Are Doomed To Repeat It" - Santayana and we have NEVER EVER seen devices with reliably unique IDs. Really. We've had these uuid's before (ask Greg about USB devices one day, and that was *recent*). We've always known that vendors will fill in a fixed value, and somebody still decided to make this a correctness issue? Christoph, don't blame vendors. Somebody did indeed f*ck up. But it was you. > If we'd take it away now we'd break existing users, which puts us between > a rock and a hard place. Well, here's a suggestion: stop making it worse. For example, we have this completely unacceptable garbage: ret = nvme_global_check_duplicate_ids(ctrl->subsys, &info->ids); if (ret) { dev_err(ctrl->device, "globally duplicate IDs for nsid %d\n", info->nsid); nvme_print_device_info(ctrl); return ret; } iow, the code even checks for and *notices* that there are duplicate IDs, and what does it do? It then errors out. Then expecting people TO WAIT FOR A NEW KERNEL VERSION when you noticed something wrong? What an absolute crock. So stop blaming anybody else. I think the code should *default* to "unreliable uuid", and then if you're sure it's actually ok, then you use it. Then some rare enterprise user with multipathing - who is going to be very very careful about which device to use anyway - can use the "approved list". Or "Oh, I noticed a non-unique UUID, let me generate one for you based on physical location". But this "my disk doesn't work in v6.0 and later because some clown added a duplicate check that shouldn't be there" is not a good thing to then try to make excuses for. Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-11 16:47 ` Linus Torvalds @ 2023-07-11 17:21 ` Keith Busch 2023-07-11 22:07 ` John Meneghini 2023-07-12 7:38 ` Sagi Grimberg 2023-07-12 16:45 ` Christoph Hellwig 1 sibling, 2 replies; 26+ messages in thread From: Keith Busch @ 2023-07-11 17:21 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Linux regressions mailing list, Pankaj Raghav, Bagas Sanjaya, Jens Axboe, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환 On Tue, Jul 11, 2023 at 09:47:00AM -0700, Linus Torvalds wrote: > On Tue, 11 Jul 2023 at 05:06, Christoph Hellwig <hch@lst.de> wrote: > For example, we have this completely unacceptable garbage: > > ret = nvme_global_check_duplicate_ids(ctrl->subsys, &info->ids); > if (ret) { > dev_err(ctrl->device, > "globally duplicate IDs for nsid %d\n", info->nsid); > nvme_print_device_info(ctrl); > return ret; > } > > iow, the code even checks for and *notices* that there are duplicate > IDs, and what does it do? It then errors out. This check came from a recent half-baked spec feature called "Dispersed Namespaces" that caused breakage and data corruption when used in Linux. Rather than attempt to support that mostly vendor specific feature, the driver attempted to fence that off as unmaintainable. This check wasn't aimed at enforcing "correctness", but it certainly found a lot of that as collatoral damage. Let's see if we can find a better way to detect the difference with a sane fallback as you suggest. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-11 17:21 ` Keith Busch @ 2023-07-11 22:07 ` John Meneghini 2023-07-12 7:38 ` Sagi Grimberg 1 sibling, 0 replies; 26+ messages in thread From: John Meneghini @ 2023-07-11 22:07 UTC (permalink / raw) To: Keith Busch, Linus Torvalds Cc: Christoph Hellwig, Linux regressions mailing list, Pankaj Raghav, Bagas Sanjaya, Jens Axboe, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환 Yes, this is what I thought. This is all the result of the duplicate NID check added to deal with TP4034 Dispersed Namespaces. One suggestion I have would be to limit this check to nvme-of subsystems only. These are the only devices I am aware of out there which support TP4034. Moreover, all nvme-of devices report a valid NID. It's required with NVMe Over Fabrics. The PCIe devices, I expect, don't care. You don't really need a valid NID with a private namespace - which is what most PCIe devices are. I'll wager that if you change nvme_global_check_duplicate_ids() to check only nvme-of subsystems, and simply continue with PCIe subsystems, 90% of these nvme quirks can be removed. John Meneghini Senior Principal Platform Storage Engineer RHEL SST - Platform Storage Group jmeneghi@redhat.com On 7/11/23 13:21, Keith Busch wrote: > On Tue, Jul 11, 2023 at 09:47:00AM -0700, Linus Torvalds wrote: >> On Tue, 11 Jul 2023 at 05:06, Christoph Hellwig <hch@lst.de> wrote: >> For example, we have this completely unacceptable garbage: >> >> ret = nvme_global_check_duplicate_ids(ctrl->subsys, &info->ids); >> if (ret) { >> dev_err(ctrl->device, >> "globally duplicate IDs for nsid %d\n", info->nsid); >> nvme_print_device_info(ctrl); >> return ret; >> } >> >> iow, the code even checks for and *notices* that there are duplicate >> IDs, and what does it do? It then errors out. > > This check came from a recent half-baked spec feature called "Dispersed > Namespaces" that caused breakage and data corruption when used in Linux. > Rather than attempt to support that mostly vendor specific feature, the > driver attempted to fence that off as unmaintainable. This check wasn't > aimed at enforcing "correctness", but it certainly found a lot of that > as collatoral damage. Let's see if we can find a better way to detect > the difference with a sane fallback as you suggest. > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-11 17:21 ` Keith Busch 2023-07-11 22:07 ` John Meneghini @ 2023-07-12 7:38 ` Sagi Grimberg 2023-07-12 16:48 ` Linus Torvalds 1 sibling, 1 reply; 26+ messages in thread From: Sagi Grimberg @ 2023-07-12 7:38 UTC (permalink / raw) To: Keith Busch, Linus Torvalds Cc: Christoph Hellwig, Linux regressions mailing list, Pankaj Raghav, Bagas Sanjaya, Jens Axboe, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환 >> For example, we have this completely unacceptable garbage: >> >> ret = nvme_global_check_duplicate_ids(ctrl->subsys, &info->ids); >> if (ret) { >> dev_err(ctrl->device, >> "globally duplicate IDs for nsid %d\n", info->nsid); >> nvme_print_device_info(ctrl); >> return ret; >> } >> >> iow, the code even checks for and *notices* that there are duplicate >> IDs, and what does it do? It then errors out. > > This check came from a recent half-baked spec feature called "Dispersed > Namespaces" that caused breakage and data corruption when used in Linux. > Rather than attempt to support that mostly vendor specific feature, the > driver attempted to fence that off as unmaintainable. This check wasn't > aimed at enforcing "correctness", but it certainly found a lot of that > as collatoral damage. Let's see if we can find a better way to detect > the difference with a sane fallback as you suggest. Perhaps we could fallback to what we do in wwid_show()? "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id, serial_len, subsys->serial, model_len, subsys->model, head->ns_id) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-12 7:38 ` Sagi Grimberg @ 2023-07-12 16:48 ` Linus Torvalds 0 siblings, 0 replies; 26+ messages in thread From: Linus Torvalds @ 2023-07-12 16:48 UTC (permalink / raw) To: Sagi Grimberg Cc: Keith Busch, Christoph Hellwig, Linux regressions mailing list, Pankaj Raghav, Bagas Sanjaya, Jens Axboe, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환 On Wed, 12 Jul 2023 at 00:38, Sagi Grimberg <sagi@grimberg.me> wrote: > > Perhaps we could fallback to what we do in wwid_show()? > > "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id, serial_len, > subsys->serial, model_len, subsys->model, head->ns_id) That's not going to be any more unique. Yes, there is *slightly* more hope that a simpler "serial number" might be different between devices, but honestly, when I told Christoph to learn from history and not assume UUID's are unique for hardware, that "history" is literally decades of serial numbers. Why do you think things like dm/md generate the UUID and literally write it to disk? It's because those unique serial numbers have never ever existed. And absolutely NONE OF THIS IS NEW. This is not a nvme issue. SCSI disks allegedly has a VPD page 0x80 with a serial number. Not only won't many devices report that VPD in the first place, even if they do there's no guarantee that it's actually unique. And USB devices are *supposed* to have unique device serial numbers ("iSerial"). I think it's literally been in the spec since day#1. And some even do. Quite a lot absolutely do NOT. It's just fundamentally cheaper for hardware manufacturers that do mass manufacturing to do completely identical pieces of hardware. Serial numbers and UUID's are _fundamentally_ hard and expensive at some level. If you sell overpriced garbage to enterprise users, then a serial number or UUID will most definitely be part of the package. Because that is _literally_ what "Enterprise hardware" means: "Overpriced garbage with a provenance and somebody else to blame". But if you sell to the mass market, and aren't going for people who pay extra for provenance, a serial number or a UUID is pure overhead. Sometimes it's easy enough to do, and people do it. But if you *rely* on it, *you* are the problem. Not the vendor that was churning out a million devices for real users, and that was spending all the effort on just making it work. Basically, I think all of these things are purely for fabrics and the nvme target side. Expecting serial numbers from a PCIe device is pure fantasy. And, more importantly, it's stupid and wrong. Exactly because we've been here before. Don't repeat mistakes from the past. By all means _report_ things like serial numbers. They can be very useful for tracking. But never *ever* rely on them for basic functionality. People absolutely will have two identical devices from the same batch, and unless there is some actual hardware compliance verification, they will not have unique serial numbers. In many cases, the best you can hope for is that the duplicate ones are *obviously* duplicate (ie "all zeroes" and similar). But you shouldn't expect or depend on even that to be the case. Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-11 16:47 ` Linus Torvalds 2023-07-11 17:21 ` Keith Busch @ 2023-07-12 16:45 ` Christoph Hellwig 2023-07-12 16:51 ` Linus Torvalds 1 sibling, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2023-07-12 16:45 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Linux regressions mailing list, Pankaj Raghav, Keith Busch, Bagas Sanjaya, Jens Axboe, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환 On Tue, Jul 11, 2023 at 09:47:00AM -0700, Linus Torvalds wrote: > Anybody who expected unique ID's is frankly completely incompetent. > People should have *known* not to do this. Except that storage software really does rely on them. Not your laptop (or mine) but all kinds of data center software. Where these IDs have worked fine for decades. It just turns out nvme has the misfortune of trying to deal with both that and cheap consumer crap. > and we have NEVER EVER seen devices with reliably unique IDs. Really. Sorry, but that's bullshit. > iow, the code even checks for and *notices* that there are duplicate > IDs, and what does it do? It then errors out. Yes. Because we have applications which will lose data if they suddently get the wrong device. > I think the code should *default* to "unreliable uuid", and then if > you're sure it's actually ok, then you use it. Then some rare > enterprise user with multipathing - who is going to be very very > careful about which device to use anyway - can use the "approved > list". That doesn't scale either. > Or "Oh, I noticed a non-unique UUID, let me generate one for you based > on physical location". > > But this "my disk doesn't work in v6.0 and later because some clown > added a duplicate check that shouldn't be there" is not a good thing > to then try to make excuses for. Well, let's try something like this (co-developed with Sagi) that allows it for non-multipath PCIe devices, hich covers the quirks we've added so far, but also add a big fat warning, because we know people rely on the /dev/disk/by-id/ links. Probably not on these devices, but who knows. diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 47d7ba2827ff29..37b6fa74666204 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3431,10 +3431,40 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info) ret = nvme_global_check_duplicate_ids(ctrl->subsys, &info->ids); if (ret) { - dev_err(ctrl->device, - "globally duplicate IDs for nsid %d\n", info->nsid); + /* + * We've found two different namespaces on two different + * subsystems that report the same ID. This is pretty nasty + * for anything that actually requires unique device + * identification. In the kernel we need this for multipathing, + * and in user space the /dev/disk/by-id/ links rely on it. + * + * If the device also claims to be multi-path capable back off + * here now and refuse the probe the second device as this is a + * recipe for data corruption. If not this is probably a + * cheap consumer device if on the PCIe bus, so let the user + * proceed and use the shiny toy, but warn that with changing + * probing order (which due to our async probing could just be + * device taking longer to startup) the other device could show + * up at any time. + */ nvme_print_device_info(ctrl); - return ret; + if ((ns->ctrl->ops->flags & NVME_F_FABRICS) || /* !PCIe */ + ((ns->ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) && + info->is_shared)) { + dev_err(ctrl->device, + "ignoring nsid %d because of duplicate IDs\n", + info->nsid); + return ret; + } + + dev_err(ctrl->device, + "clearing duplicate IDs for nsid %d\n", info->nsid); + dev_err(ctrl->device, + "use of /dev/disk/by-id/ may cause data corruption\n"); + memset(&info->ids.nguid, 0, sizeof(info->ids.nguid)); + memset(&info->ids.uuid, 0, sizeof(info->ids.uuid)); + memset(&info->ids.eui64, 0, sizeof(info->ids.eui64)); + ctrl->quirks |= NVME_QUIRK_BOGUS_NID; } mutex_lock(&ctrl->subsys->lock); ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-12 16:45 ` Christoph Hellwig @ 2023-07-12 16:51 ` Linus Torvalds 2023-07-12 16:57 ` Christoph Hellwig 0 siblings, 1 reply; 26+ messages in thread From: Linus Torvalds @ 2023-07-12 16:51 UTC (permalink / raw) To: Christoph Hellwig Cc: Linux regressions mailing list, Pankaj Raghav, Keith Busch, Bagas Sanjaya, Jens Axboe, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환 On Wed, 12 Jul 2023 at 09:45, Christoph Hellwig <hch@lst.de> wrote: > > On Tue, Jul 11, 2023 at 09:47:00AM -0700, Linus Torvalds wrote: > > > and we have NEVER EVER seen devices with reliably unique IDs. Really. > > Sorry, but that's bullshit. Christoph, you are *literally* involved in a discussion where this is the case. What you call "bullshit" is what everybody else here calls REALITY. The fact that *some* devices have serial numbers in no way implies that all of them do. And if you think that they all do, you are the problem. Literally in this case. So you had better really internalize this "there are no reliable serial numbers in general". Because it's simply a FACT. > Well, let's try something like this (co-developed with Sagi) that > allows it for non-multipath PCIe devices, which covers the quirks > we've added so far, but also add a big fat warning, because we know > people rely on the /dev/disk/by-id/ links. Probably not on these > devices, but who knows. Looks sane to me. Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-12 16:51 ` Linus Torvalds @ 2023-07-12 16:57 ` Christoph Hellwig 2023-07-12 17:34 ` Linus Torvalds 0 siblings, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2023-07-12 16:57 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Linux regressions mailing list, Pankaj Raghav, Keith Busch, Bagas Sanjaya, Jens Axboe, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환 On Wed, Jul 12, 2023 at 09:51:52AM -0700, Linus Torvalds wrote: > On Wed, 12 Jul 2023 at 09:45, Christoph Hellwig <hch@lst.de> wrote: > > > > On Tue, Jul 11, 2023 at 09:47:00AM -0700, Linus Torvalds wrote: > > > > > and we have NEVER EVER seen devices with reliably unique IDs. Really. > > > > Sorry, but that's bullshit. > > Christoph, you are *literally* involved in a discussion where this is the case. Yes. But that's not never ever. Enterprise storage devices have been pretty very good at it, because it is part of the purchase specs. So don't claim NEVER EVER, which is just BS. Claim we've been way to optimistic in that even cheap devices would get something so basic, and you're spot on. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-12 16:57 ` Christoph Hellwig @ 2023-07-12 17:34 ` Linus Torvalds 2023-07-13 11:44 ` Christoph Hellwig 0 siblings, 1 reply; 26+ messages in thread From: Linus Torvalds @ 2023-07-12 17:34 UTC (permalink / raw) To: Christoph Hellwig Cc: Linux regressions mailing list, Pankaj Raghav, Keith Busch, Bagas Sanjaya, Jens Axboe, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환 On Wed, 12 Jul 2023 at 09:57, Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Jul 12, 2023 at 09:51:52AM -0700, Linus Torvalds wrote: > > Christoph, you are *literally* involved in a discussion where this is the case. > > Yes. But that's not never ever. Enterprise storage devices have been > pretty very good at it, because it is part of the purchase specs. Your basic logic is lacking. "Humans are animals" and "Some animals have four legs" are both undeniabvly true. That still does *not* mean "Humans have four legs". And yet hat is *literally* the kind of nonsense you are spouting. I very much say that "some devices will not have UUID's". That's a *FACT*. You then say "Some devices have UUID's, so all devices must have UUIDs". That's *STUPID*. Can you really not tell the difference? Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-12 17:34 ` Linus Torvalds @ 2023-07-13 11:44 ` Christoph Hellwig 2023-07-13 16:53 ` Linus Torvalds 0 siblings, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2023-07-13 11:44 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Linux regressions mailing list, Pankaj Raghav, Keith Busch, Bagas Sanjaya, Jens Axboe, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환 On Wed, Jul 12, 2023 at 10:34:05AM -0700, Linus Torvalds wrote: > That's a *FACT*. > > You then say "Some devices have UUID's, so all devices must have UUIDs". Please stop putting words in my mouth. Maybe instead of shouting it would help to actually read the text? You said we've never seen devices with reliably IDs, which simply isn't true. And that doesn't mean I'm saying all devices have reliably IDs, which somehow you're not trying to make me say. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) 2023-07-13 11:44 ` Christoph Hellwig @ 2023-07-13 16:53 ` Linus Torvalds 0 siblings, 0 replies; 26+ messages in thread From: Linus Torvalds @ 2023-07-13 16:53 UTC (permalink / raw) To: Christoph Hellwig Cc: Linux regressions mailing list, Pankaj Raghav, Keith Busch, Bagas Sanjaya, Jens Axboe, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환 On Thu, 13 Jul 2023 at 04:44, Christoph Hellwig <hch@lst.de> wrote: > > Please stop putting words in my mouth. Maybe instead of shouting it > would help to actually read the text? I have read the text. You spout unbelievable garbage. Even in this email: > You said we've never seen devices with reliably IDs, which simply isn't > true. And that doesn't mean I'm saying all devices have reliably IDs, > which somehow you're not trying to make me say. When I say "devices", I obviously mean in general. You seem to argue against it. OF COURSE a _single_ device that has shown to have a UUID is likely to then repeat that UUID reliably. But that doesn't mean it's true in general. The fact is, any driver subsystem that deals with more than one device manufacturer had better not rely on any serial number for uniqueness. If some device layer does, and there is a device without serial numbers, then the bug is squarely on the driver, not the device. Because the driver author clearly didn't think things through. THAT is what I mean by "devices do not EVER reliably have uuids". I just checked. I have on my desk access to three machines with NVMe drives. Of them (a) one didn't have any UUID at all, didn't report namespaces and the "wwid" exposed in /proc is that return sysfs_emit(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id, serial_len, subsys->serial, model_len, subsys->model, head->ns_id); thing. (b) the other two had a NULL uuid and nguid, but seem to have a eui64, whatever the hell that is. So just from my limited test, one in three didn't have any UUID at all. Deal with it. Stop claiming it's somehow "reliable". Two out of three out of a random selection is not "reliable". And it has never been reliable, judging by those quirks. When we have manufacturers like Samsung, Phison, ADATA, Lexar, Sandisk, Micron, and Kingston mentioned in the quirk list, we probably have covered most of the consumer SSD manufacturers. It's a sad that NVMe - that started out as a "lean mean low-overhead disk interface definition" results in this kind of mess. Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) [not found] ` <f0fdf86e-4293-8e07-835d-b5a866252068@samsung.com> 2023-07-11 9:39 ` Linux regression tracking (Thorsten Leemhuis) @ 2023-07-16 19:19 ` August Wikerfors 1 sibling, 0 replies; 26+ messages in thread From: August Wikerfors @ 2023-07-16 19:19 UTC (permalink / raw) To: Pankaj Raghav Cc: Keith Busch, Linux regressions mailing list, Bagas Sanjaya, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Clemens S., Martin Belanger, Chaitanya Kulkarni, John Meneghini, Hannes Reinecke, Linux Kernel Mailing List, Linux NVMe, Kanchan Joshi, Javier Gonzalez, 박진환 On 2023-07-11 08:54, Pankaj Raghav wrote: >>> Another request came in today, even with a pseudo-patch: >>> https://bugzilla.kernel.org/show_bug.cgi?id=217649 >>> >>> To quote: >>> ``` >>> As with numerous NVMe controllers these days, Samsung's >>> MZAL41T0HBLB-00BL2, which Lenovo builds into their 16ARP8 also suffers >>> from invalid IDs, breaking suspend and hibernate also on the latest >>> kernel 6.4.2. >>> >>> The following change restores this functionality: >>> >>> File: root/drivers/nvme/host/pci.c >>> Change: >>> >>> - { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G and 512G */ >>> - .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, >>> >>> + { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G, 512G and 1TB */ >>> + .driver_data = NVME_QUIRK_BOGUS_NID | >>> + NVME_QUIRK_DISABLE_WRITE_ZEROES, }, >> >> Panjaj, okay with this one too? > > This looks a like a new device that might have a firmware update. I will ping > internally first. (Note that this is a different issue from the regression in this thread) I reported this back in November [1] and was told by Kanchan Joshi at Samsung that it was fixed in new firmware [2]. Lenovo was also contacted and said they were working on the update in December [3]. I'm not sure what happened then, but in March, Mark Pearson at Lenovo wrote [4]: > I'm stuck on this one - the FW team reached out to Samsung to see if > there were fixes that we should be picking up and Samsung reported back > that there are no Linux issues reported against this part :( The release process then seems to have started over [5] and the latest update as of May is that the update is supposed to be released this month. [1] https://lore.kernel.org/all/20221116171727.4083-1-git@augustwikerfors.se/ [2] https://lore.kernel.org/all/20221206055928.GB24451@test-zns/ [3] https://github.com/tomsom/yoga-linux/issues/9#issuecomment-1368013293 [4] https://github.com/fwupd/firmware-lenovo/issues/308#issuecomment-1466631468 [5] https://forums.lenovo.com/topic/findpost/27/5196929/5984302 ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2023-07-16 19:25 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-26 1:15 Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD) Bagas Sanjaya
[not found] ` <8f4e6d32-55b3-9df5-de91-1afbcb30ce13@nvidia.com>
2023-06-26 18:26 ` Linux regression tracking (Thorsten Leemhuis)
2023-06-27 16:10 ` Keith Busch
[not found] ` <9b778d59-c8d5-b037-0eb7-34cee5f273cc@samsung.com>
2023-06-30 15:18 ` Clemens Springsguth
2023-06-30 16:21 ` Pankaj Raghav
2023-07-03 13:25 ` Clemens Springsguth
2023-07-10 8:52 ` Linux regression tracking (Thorsten Leemhuis)
2023-07-10 15:58 ` Keith Busch
[not found] ` <f0fdf86e-4293-8e07-835d-b5a866252068@samsung.com>
2023-07-11 9:39 ` Linux regression tracking (Thorsten Leemhuis)
2023-07-11 12:06 ` Christoph Hellwig
2023-07-11 12:14 ` Sagi Grimberg
2023-07-11 14:12 ` Greg KH
2023-07-11 14:39 ` Sagi Grimberg
2023-07-11 15:33 ` Keith Busch
2023-07-11 16:47 ` Linus Torvalds
2023-07-11 17:21 ` Keith Busch
2023-07-11 22:07 ` John Meneghini
2023-07-12 7:38 ` Sagi Grimberg
2023-07-12 16:48 ` Linus Torvalds
2023-07-12 16:45 ` Christoph Hellwig
2023-07-12 16:51 ` Linus Torvalds
2023-07-12 16:57 ` Christoph Hellwig
2023-07-12 17:34 ` Linus Torvalds
2023-07-13 11:44 ` Christoph Hellwig
2023-07-13 16:53 ` Linus Torvalds
2023-07-16 19:19 ` August Wikerfors
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox