[PATCH] nvme-pci: 512 byte aligned dma pool segment quirk

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
@ 2024-11-12 19:50 Bob Beckett
  2024-11-12 20:47 ` Keith Busch
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Bob Beckett @ 2024-11-12 19:50 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg
  Cc: kernel, Robert Beckett, linux-nvme, linux-kernel

From: Robert Beckett <bob.beckett@collabora.com>

We initially put in a quick fix of limiting the queue depth to 1
as experimentation showed that it fixed data corruption on 64GB
steamdecks.

After further experimentation, it appears that the corruption
is fixed by aligning the small dma pool segments to 512 bytes.
Testing via desync image verification shows that it now passes
thousands of verification loops, where previously
it never managed above 7.

Currently it is not known why this fixes the corruption.
Perhaps it is doing something nasty like using an mmc page
as a cache for the prp lists (mmc min. page size is 512 bytes)
and not invalidating properly, so that the dma pool change to
treats segment list as a stack ends up giving a previous
segment in the same cached page.

This fixes the previous queue depth limitation as it fixes
the corruption without incurring a 37% tested performance
degredation.

Fixes: 83bdfcbdbe5d ("nvme-pci: qdepth 1 quirk")
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
---
 drivers/nvme/host/nvme.h | 5 +++++
 drivers/nvme/host/pci.c  | 6 +++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 093cb423f536..61bba5513de0 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -173,6 +173,11 @@ enum nvme_quirks {
 	 * MSI (but not MSI-X) interrupts are broken and never fire.
 	 */
 	NVME_QUIRK_BROKEN_MSI			= (1 << 21),
+
+	/*
+	 * Align dma pool segment size to 512 bytes
+	 */
+	NVME_QUIRK_DMAPOOL_ALIGN_512		= (1 << 22),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4b9fda0b1d9a..6fcd3bb413c4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2700,8 +2700,8 @@ static int nvme_setup_prp_pools(struct nvme_dev *dev)
 		return -ENOMEM;
 
 	/* Optimisation for I/Os between 4k and 128k */
-	dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev,
-						256, 256, 0);
+	dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev,256,
+				       dev->ctrl.quirks & NVME_QUIRK_DMAPOOL_ALIGN_512 ? 512 : 256, 0);
 	if (!dev->prp_small_pool) {
 		dma_pool_destroy(dev->prp_page_pool);
 		return -ENOMEM;
@@ -3449,7 +3449,7 @@ static const struct pci_device_id nvme_id_table[] = {
 	{ PCI_VDEVICE(REDHAT, 0x0010),	/* Qemu emulated controller */
 		.driver_data = NVME_QUIRK_BOGUS_NID, },
 	{ PCI_DEVICE(0x1217, 0x8760), /* O2 Micro 64GB Steam Deck */
-		.driver_data = NVME_QUIRK_QDEPTH_ONE },
+		.driver_data = NVME_QUIRK_DMAPOOL_ALIGN_512, },
 	{ PCI_DEVICE(0x126f, 0x2262),	/* Silicon Motion generic */
 		.driver_data = NVME_QUIRK_NO_DEEPEST_PS |
 				NVME_QUIRK_BOGUS_NID, },
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-12 19:50 [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk Bob Beckett
@ 2024-11-12 20:47 ` Keith Busch
  2024-11-13  4:31 ` Christoph Hellwig
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-11-12 20:47 UTC (permalink / raw)
  To: Bob Beckett
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, kernel, linux-nvme,
	linux-kernel

On Tue, Nov 12, 2024 at 07:50:00PM +0000, Bob Beckett wrote:
> @@ -2700,8 +2700,8 @@ static int nvme_setup_prp_pools(struct nvme_dev *dev)
>  		return -ENOMEM;
>  
>  	/* Optimisation for I/Os between 4k and 128k */
> -	dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev,
> -						256, 256, 0);
> +	dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev,256,
> +				       dev->ctrl.quirks & NVME_QUIRK_DMAPOOL_ALIGN_512 ? 512 : 256, 0);

This line is overly long and a bit difficult to read that way. I changed
it to this:

---
@@ -2693,15 +2693,20 @@ static int nvme_disable_prepare_reset(struct nvme_dev *dev, bool shutdown)

 static int nvme_setup_prp_pools(struct nvme_dev *dev)
 {
+       size_t small_align = 256;
+
        dev->prp_page_pool = dma_pool_create("prp list page", dev->dev,
                                                NVME_CTRL_PAGE_SIZE,
                                                NVME_CTRL_PAGE_SIZE, 0);
        if (!dev->prp_page_pool)
                return -ENOMEM;

+       if (dev->ctrl.quirks & NVME_QUIRK_DMAPOOL_ALIGN_512)
+               small_align = 512;
+
        /* Optimisation for I/Os between 4k and 128k */
        dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev,
-                                               256, 256, 0);
+                                               256, small_align, 0);
        if (!dev->prp_small_pool) {
                dma_pool_destroy(dev->prp_page_pool);
                return -ENOMEM;
--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-12 19:50 [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk Bob Beckett
  2024-11-12 20:47 ` Keith Busch
@ 2024-11-13  4:31 ` Christoph Hellwig
  2024-11-13 18:05   ` Keith Busch
  2024-11-14 11:38 ` Paweł Anikiel
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-11-13  4:31 UTC (permalink / raw)
  To: Bob Beckett
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, kernel,
	linux-nvme, linux-kernel

On Tue, Nov 12, 2024 at 07:50:00PM +0000, Bob Beckett wrote:
> From: Robert Beckett <bob.beckett@collabora.com>
> 
> We initially put in a quick fix of limiting the queue depth to 1
> as experimentation showed that it fixed data corruption on 64GB
> steamdecks.
> 
> After further experimentation, it appears that the corruption
> is fixed by aligning the small dma pool segments to 512 bytes.
> Testing via desync image verification shows that it now passes
> thousands of verification loops, where previously
> it never managed above 7.

As suggested before, instead of changing the pool size please just
always use the large pool for this device.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-13  4:31 ` Christoph Hellwig
@ 2024-11-13 18:05   ` Keith Busch
  2024-11-13 20:08     ` Robert Beckett
  2024-11-14  5:55     ` Christoph Hellwig
  0 siblings, 2 replies; 24+ messages in thread
From: Keith Busch @ 2024-11-13 18:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bob Beckett, Jens Axboe, Sagi Grimberg, kernel, linux-nvme,
	linux-kernel

On Wed, Nov 13, 2024 at 05:31:51AM +0100, Christoph Hellwig wrote:
> On Tue, Nov 12, 2024 at 07:50:00PM +0000, Bob Beckett wrote:
> > From: Robert Beckett <bob.beckett@collabora.com>
> > 
> > We initially put in a quick fix of limiting the queue depth to 1
> > as experimentation showed that it fixed data corruption on 64GB
> > steamdecks.
> > 
> > After further experimentation, it appears that the corruption
> > is fixed by aligning the small dma pool segments to 512 bytes.
> > Testing via desync image verification shows that it now passes
> > thousands of verification loops, where previously
> > it never managed above 7.
> 
> As suggested before, instead of changing the pool size please just
> always use the large pool for this device.

Well, he's doing what I suggested. I thought this was better because it
puts the decision making in the initialization path instead of the IO
path.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-13 18:05   ` Keith Busch
@ 2024-11-13 20:08     ` Robert Beckett
  2024-11-14  5:55     ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Robert Beckett @ 2024-11-13 20:08 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, kernel, linux-nvme,
	linux-kernel






 ---- On Wed, 13 Nov 2024 18:05:53 +0000  Keith Busch  wrote --- 
 > On Wed, Nov 13, 2024 at 05:31:51AM +0100, Christoph Hellwig wrote:
 > > On Tue, Nov 12, 2024 at 07:50:00PM +0000, Bob Beckett wrote:
 > > > From: Robert Beckett bob.beckett@collabora.com>
 > > > 
 > > > We initially put in a quick fix of limiting the queue depth to 1
 > > > as experimentation showed that it fixed data corruption on 64GB
 > > > steamdecks.
 > > > 
 > > > After further experimentation, it appears that the corruption
 > > > is fixed by aligning the small dma pool segments to 512 bytes.
 > > > Testing via desync image verification shows that it now passes
 > > > thousands of verification loops, where previously
 > > > it never managed above 7.
 > > 
 > > As suggested before, instead of changing the pool size please just
 > > always use the large pool for this device.
 > 
 > Well, he's doing what I suggested. I thought this was better because it
 > puts the decision making in the initialization path instead of the IO
 > path.
 > 
yep, this avoids any extra conditional in the fast path


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-13 18:05   ` Keith Busch
  2024-11-13 20:08     ` Robert Beckett
@ 2024-11-14  5:55     ` Christoph Hellwig
  2024-11-14 13:10       ` Robert Beckett
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-11-14  5:55 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Bob Beckett, Jens Axboe, Sagi Grimberg, kernel,
	linux-nvme, linux-kernel

On Wed, Nov 13, 2024 at 11:05:53AM -0700, Keith Busch wrote:
> Well, he's doing what I suggested. I thought this was better because it
> puts the decision making in the initialization path instead of the IO
> path.

I guess it's fine in that regard.  It just usually expect patch
authors to at least reply to reviews..

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-12 19:50 [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk Bob Beckett
  2024-11-12 20:47 ` Keith Busch
  2024-11-13  4:31 ` Christoph Hellwig
@ 2024-11-14 11:38 ` Paweł Anikiel
  2024-11-14 12:17   ` Christoph Hellwig
                     ` (2 more replies)
  2024-11-14 18:00 ` Keith Busch
  2024-11-25 11:01 ` Niklas Cassel
  4 siblings, 3 replies; 24+ messages in thread
From: Paweł Anikiel @ 2024-11-14 11:38 UTC (permalink / raw)
  To: bob.beckett; +Cc: axboe, hch, kbusch, kernel, linux-kernel, linux-nvme, sagi

Hi all,

I've been tracking down an issue that seems to be related (identical?) to
this one, and I would like to propose a different fix.

I have a device with the aforementioned NVMe-eMMC bridge, and I was
experiencing nvme read timeouts after updating the kernel from 5.15 to
6.6. Doing a kernel bisect, I arrived at the same dma pool commit as
Robert in the original thread.

After trying out some changes in the nvme-pci driver, I came up with the
same fix as in this thread: change the alignment of the small pool to
512. However, I wanted to get a deeper understanding of what's going on.

After a lot of analysis, I found out why the nvme timeouts were happening:
The bridge incorrectly implements PRP list chaining.

When doing a read of exactly 264 sectors, and allocating a PRP list with
offset 0xf00, the last PRP entry in that list lies right before a page
boundary.  The bridge incorrectly (?) assumes that it's a pointer to a
chained PRP list, tries to do a DMA to address 0x0, gets a bus error,
and crashes.

When doing a write of 264 sectors with PRP list offset of 0xf00,
the bridge treats data as a pointer, and writes incorrect data to
the drive. This might be why Robert is experiencing fs corruption.

So if my findings are right, the correct quirk would be "don't make PRP
lists ending on a page boundary".

Changing the small dma pool alignment to 512 happens to fix the issue
because it never allocates a PRP list with offset 0xf00. Theoretically,
the issue could still happen with the page pool, but this bridge has
a max transfer size of 64 pages, which is not enough to fill an entire
page-sized PRP list.

Robert, could you check that the fs corruption happens only after
transfers of size 257-264 and PRP list offset of 0xf00? This would
confirm my theory.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-14 11:38 ` Paweł Anikiel
@ 2024-11-14 12:17   ` Christoph Hellwig
  2024-11-14 15:37     ` Keith Busch
  2024-11-14 13:24   ` Robert Beckett
  2024-11-14 15:46   ` Keith Busch
  2 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-11-14 12:17 UTC (permalink / raw)
  To: Paweł Anikiel
  Cc: bob.beckett, axboe, hch, kbusch, kernel, linux-kernel, linux-nvme,
	sagi

On Thu, Nov 14, 2024 at 11:38:03AM +0000, Paweł Anikiel wrote:
> When doing a write of 264 sectors with PRP list offset of 0xf00,
> the bridge treats data as a pointer, and writes incorrect data to
> the drive. This might be why Robert is experiencing fs corruption.

Not having the hardware and just speculating, but this seems like
a pretty likely failure mode.

**still cursing nvme for all this PRP brain damange**


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-14  5:55     ` Christoph Hellwig
@ 2024-11-14 13:10       ` Robert Beckett
  0 siblings, 0 replies; 24+ messages in thread
From: Robert Beckett @ 2024-11-14 13:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, kernel, linux-nvme,
	linux-kernel






 ---- On Thu, 14 Nov 2024 05:55:44 +0000  Christoph Hellwig  wrote --- 
 > On Wed, Nov 13, 2024 at 11:05:53AM -0700, Keith Busch wrote:
 > > Well, he's doing what I suggested. I thought this was better because it
 > > puts the decision making in the initialization path instead of the IO
 > > path.
 > 
 > I guess it's fine in that regard.  It just usually expect patch
 > authors to at least reply to reviews..
 > 
apologies for not replying directly. I assumed a new patch would suffice.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-14 11:38 ` Paweł Anikiel
  2024-11-14 12:17   ` Christoph Hellwig
@ 2024-11-14 13:24   ` Robert Beckett
  2024-11-14 14:13     ` Paweł Anikiel
  2024-11-14 15:46   ` Keith Busch
  2 siblings, 1 reply; 24+ messages in thread
From: Robert Beckett @ 2024-11-14 13:24 UTC (permalink / raw)
  To: "Paweł Anikiel"
  Cc: axboe, hch, kbusch, kernel, linux-kernel, linux-nvme, sagi






 ---- On Thu, 14 Nov 2024 11:38:03 +0000  Paweł Anikiel  wrote --- 
 > Hi all,
 > 
 > I've been tracking down an issue that seems to be related (identical?) to
 > this one, and I would like to propose a different fix.
 > 
 > I have a device with the aforementioned NVMe-eMMC bridge, and I was
 > experiencing nvme read timeouts after updating the kernel from 5.15 to
 > 6.6. Doing a kernel bisect, I arrived at the same dma pool commit as
 > Robert in the original thread.
 > 
 > After trying out some changes in the nvme-pci driver, I came up with the
 > same fix as in this thread: change the alignment of the small pool to
 > 512. However, I wanted to get a deeper understanding of what's going on.
 > 
 > After a lot of analysis, I found out why the nvme timeouts were happening:
 > The bridge incorrectly implements PRP list chaining.
 > 
 > When doing a read of exactly 264 sectors, and allocating a PRP list with
 > offset 0xf00, the last PRP entry in that list lies right before a page
 > boundary.  The bridge incorrectly (?) assumes that it's a pointer to a
 > chained PRP list, tries to do a DMA to address 0x0, gets a bus error,
 > and crashes.
 > 
 > When doing a write of 264 sectors with PRP list offset of 0xf00,
 > the bridge treats data as a pointer, and writes incorrect data to
 > the drive. This might be why Robert is experiencing fs corruption.
 > 
 > So if my findings are right, the correct quirk would be "don't make PRP
 > lists ending on a page boundary".

This is interesting.
I had the same idea previously. I initially just changed the hard coded 256 / 8 to use 31 instead, which should have ensured the last entry of each segment never gets used.
When I tested that, it not longer failed, which was a good sign. So then I modified it to only do that on the last 256 byte segment of a page, but then is started failing again.
This lead me to believe it was not a chaining issue specifically, so I went looking for other hypotheses, eventually setting on 512 byte alignment.
I never saw any bus error during my testing, just wrong data read, which then fails image verification. I was expecting iommu error logs if it was trying to access a chain in to nowhere if it always interpreted last entry in page as a link. I never saw any iommu errors.

 > 
 > Changing the small dma pool alignment to 512 happens to fix the issue
 > because it never allocates a PRP list with offset 0xf00. Theoretically,
 > the issue could still happen with the page pool, but this bridge has
 > a max transfer size of 64 pages, which is not enough to fill an entire
 > page-sized PRP list.
 > 
 > Robert, could you check that the fs corruption happens only after
 > transfers of size 257-264 and PRP list offset of 0xf00? This would
 > confirm my theory.

I'd be glad to if you could share your testing method.
Currently I use a desync image verification which does lots of reads in parallel and is the only method I've found so far that can repro in a reasonable amount of time.

 > 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-14 13:24   ` Robert Beckett
@ 2024-11-14 14:13     ` Paweł Anikiel
  2024-11-14 16:28       ` Robert Beckett
  0 siblings, 1 reply; 24+ messages in thread
From: Paweł Anikiel @ 2024-11-14 14:13 UTC (permalink / raw)
  To: Robert Beckett; +Cc: axboe, hch, kbusch, kernel, linux-kernel, linux-nvme, sagi

On Thu, Nov 14, 2024 at 2:24 PM Robert Beckett
<bob.beckett@collabora.com> wrote:
> This is interesting.
> I had the same idea previously. I initially just changed the hard coded 256 / 8 to use 31 instead, which should have ensured the last entry of each segment never gets used.
> When I tested that, it not longer failed, which was a good sign. So then I modified it to only do that on the last 256 byte segment of a page, but then is started failing again.

Could you elaborate the "only do that on the last 256 byte segment of
a page" part? How did you check which chunk of the page would be
allocated before choosing the dma pool?

> I never saw any bus error during my testing, just wrong data read, which then fails image verification. I was expecting iommu error logs if it was trying to access a chain in to nowhere if it always interpreted last entry in page as a link. I never saw any iommu errors.

Maybe I misspoke, the "bus error" part was just my speculation, I
didn't look at the IOMMU logs or anything like that.

> I'd be glad to if you could share your testing method.

I dumped all the nvme transfers before the crash happened (using
tracefs), and I saw a read of size 264 = 8 + 256, which led me to the
chaining theory. To test this claim, I wrote a simple pci device
driver which creates one IO queue and submits a read command where the
PRP list is set up in a way that tests if the controller treats it as
a chained list or not. I ran it, and it indeed treated the last PRP
entry as a chained pointer.

It is possible that this is not the only thing that's wrong. Could you
share your patch that checks your "only do that on the last 256 byte
segment of a page" idea?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-14 12:17   ` Christoph Hellwig
@ 2024-11-14 15:37     ` Keith Busch
  0 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-11-14 15:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pawel Anikiel, bob.beckett, axboe, kernel, linux-kernel,
	linux-nvme, sagi

On Thu, Nov 14, 2024 at 01:17:51PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 14, 2024 at 11:38:03AM +0000, Pawel Anikiel wrote:
> > When doing a write of 264 sectors with PRP list offset of 0xf00,
> > the bridge treats data as a pointer, and writes incorrect data to
> > the drive. This might be why Robert is experiencing fs corruption.
> 
> Not having the hardware and just speculating, but this seems like
> a pretty likely failure mode.

I can totally believe that. I recall the driver had a similiar bug like
15 years ago. I think Nisheeth fixed it. Checks logs... Yep, commit
0d1bc9125890426b52c.

Anyway, even with the quirk, we can still have the last PRP just before
the page boundary from the "large" pool if the device supports MDTS 2MB
or more. What does this device report?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-14 11:38 ` Paweł Anikiel
  2024-11-14 12:17   ` Christoph Hellwig
  2024-11-14 13:24   ` Robert Beckett
@ 2024-11-14 15:46   ` Keith Busch
  2024-11-14 16:47     ` Robert Beckett
  2 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-11-14 15:46 UTC (permalink / raw)
  To: Pawel Anikiel
  Cc: bob.beckett, axboe, hch, kernel, linux-kernel, linux-nvme, sagi

On Thu, Nov 14, 2024 at 11:38:03AM +0000, Pawel Anikiel wrote:
> I've been tracking down an issue that seems to be related (identical?) to
> this one, and I would like to propose a different fix.
> 
> I have a device with the aforementioned NVMe-eMMC bridge, and I was
> experiencing nvme read timeouts after updating the kernel from 5.15 to
> 6.6. Doing a kernel bisect, I arrived at the same dma pool commit as
> Robert in the original thread.
> 
> After trying out some changes in the nvme-pci driver, I came up with the
> same fix as in this thread: change the alignment of the small pool to
> 512. However, I wanted to get a deeper understanding of what's going on.
> 
> After a lot of analysis, I found out why the nvme timeouts were happening:
> The bridge incorrectly implements PRP list chaining.
> 
> When doing a read of exactly 264 sectors, and allocating a PRP list with
> offset 0xf00, the last PRP entry in that list lies right before a page
> boundary.  The bridge incorrectly (?) assumes that it's a pointer to a
> chained PRP list, tries to do a DMA to address 0x0, gets a bus error,
> and crashes.
> 
> When doing a write of 264 sectors with PRP list offset of 0xf00,
> the bridge treats data as a pointer, and writes incorrect data to
> the drive. This might be why Robert is experiencing fs corruption.

This sounds very plausible, great analysis. Curious though, even without
the dma pool optimizations, you could still allocate a PRP list at that
offset. I wonder why the problem only showed up once we optimized the
pool allocator.
 
> So if my findings are right, the correct quirk would be "don't make PRP
> lists ending on a page boundary".

Coincidently enough, the quirk in this patch achieves that. But it's
great to understand why it was successful.

> Changing the small dma pool alignment to 512 happens to fix the issue
> because it never allocates a PRP list with offset 0xf00. Theoretically,
> the issue could still happen with the page pool, but this bridge has
> a max transfer size of 64 pages, which is not enough to fill an entire
> page-sized PRP list.

Thanks, this answers my question in the other thread: MDTS is too small
to hit the same bug with the large pool.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-14 14:13     ` Paweł Anikiel
@ 2024-11-14 16:28       ` Robert Beckett
  2024-11-22 19:36         ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Robert Beckett @ 2024-11-14 16:28 UTC (permalink / raw)
  To: "Paweł Anikiel"
  Cc: axboe, hch, kbusch, kernel, linux-kernel, linux-nvme, sagi



 ---- On Thu, 14 Nov 2024 14:13:52 +0000  Paweł Anikiel  wrote --- 
 > On Thu, Nov 14, 2024 at 2:24 PM Robert Beckett
 > bob.beckett@collabora.com> wrote:
 > > This is interesting.
 > > I had the same idea previously. I initially just changed the hard coded 256 / 8 to use 31 instead, which should have ensured the last entry of each segment never gets used.
 > > When I tested that, it not longer failed, which was a good sign. So then I modified it to only do that on the last 256 byte segment of a page, but then is started failing again.
 > 
 > Could you elaborate the "only do that on the last 256 byte segment of
 > a page" part? How did you check which chunk of the page would be
 > allocated before choosing the dma pool?
 > 
 > > I never saw any bus error during my testing, just wrong data read, which then fails image verification. I was expecting iommu error logs if it was trying to access a chain in to nowhere if it always interpreted last entry in page as a link. I never saw any iommu errors.
 > 
 > Maybe I misspoke, the "bus error" part was just my speculation, I
 > didn't look at the IOMMU logs or anything like that.
 > 
 > > I'd be glad to if you could share your testing method.
 > 
 > I dumped all the nvme transfers before the crash happened (using
 > tracefs), and I saw a read of size 264 = 8 + 256, which led me to the
 > chaining theory. To test this claim, I wrote a simple pci device
 > driver which creates one IO queue and submits a read command where the
 > PRP list is set up in a way that tests if the controller treats it as
 > a chained list or not. I ran it, and it indeed treated the last PRP
 > entry as a chained pointer.
hmm, I guess a simple debugfs trigger file could be used to construct specially formulated requests. Would work as a debug tool.
Though at this point, the simple dmapool alignment param usage fixes both of these scenarios, so it will be kind of academic to continue putting effort in to understand this. I am trying to get answers out of the vendor to confirm any of these theories, which I hope will be more conclusive than our combined inference from testing.
 > 
 > It is possible that this is not the only thing that's wrong. Could you
 > share your patch that checks your "only do that on the last 256 byte
 > segment of a page" idea?
 > 
Unfortunately I have already rebased away that change with the new one.
I can go hunting in my reflog to see if I can find it again, but probably easier to just implement it again.
I just hacked in a threshold parameter to the dmapool allocator that told it to allocate a different segment if it was the last in a page and the size was over the threshold.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-14 15:46   ` Keith Busch
@ 2024-11-14 16:47     ` Robert Beckett
  0 siblings, 0 replies; 24+ messages in thread
From: Robert Beckett @ 2024-11-14 16:47 UTC (permalink / raw)
  To: Keith Busch
  Cc: Pawel Anikiel, axboe, hch, kernel, linux-kernel, linux-nvme, sagi

[-- Attachment #1: Type: text/plain, Size: 3249 bytes --]






 ---- On Thu, 14 Nov 2024 15:46:41 +0000  Keith Busch  wrote --- 
 > On Thu, Nov 14, 2024 at 11:38:03AM +0000, Pawel Anikiel wrote:
 > > I've been tracking down an issue that seems to be related (identical?) to
 > > this one, and I would like to propose a different fix.
 > > 
 > > I have a device with the aforementioned NVMe-eMMC bridge, and I was
 > > experiencing nvme read timeouts after updating the kernel from 5.15 to
 > > 6.6. Doing a kernel bisect, I arrived at the same dma pool commit as
 > > Robert in the original thread.
 > > 
 > > After trying out some changes in the nvme-pci driver, I came up with the
 > > same fix as in this thread: change the alignment of the small pool to
 > > 512. However, I wanted to get a deeper understanding of what's going on.
 > > 
 > > After a lot of analysis, I found out why the nvme timeouts were happening:
 > > The bridge incorrectly implements PRP list chaining.
 > > 
 > > When doing a read of exactly 264 sectors, and allocating a PRP list with
 > > offset 0xf00, the last PRP entry in that list lies right before a page
 > > boundary.  The bridge incorrectly (?) assumes that it's a pointer to a
 > > chained PRP list, tries to do a DMA to address 0x0, gets a bus error,
 > > and crashes.
 > > 
 > > When doing a write of 264 sectors with PRP list offset of 0xf00,
 > > the bridge treats data as a pointer, and writes incorrect data to
 > > the drive. This might be why Robert is experiencing fs corruption.
 > 
 > This sounds very plausible, great analysis. Curious though, even without
 > the dma pool optimizations, you could still allocate a PRP list at that
 > offset. I wonder why the problem only showed up once we optimized the
 > pool allocator.

yeah, me too. This is why I ended up at the 512 byte alignment idea.
I was speculating that maybe it has a 512 byte cache, and it caches whole 512 byte aligned blocks. So when we issue the previous segment immediately again, while it is processing the current segment, it may not have invalidated that first segment and start processing it using stale prp's that were meant for the previous block.
I'm not sure why last entry in page required to be a chain in fw would be exposed more readily by the the dmapool block stack change.

 >  
 > > So if my findings are right, the correct quirk would be "don't make PRP
 > > lists ending on a page boundary".
 > 
 > Coincidently enough, the quirk in this patch achieves that. But it's
 > great to understand why it was successful.
 > 
 > > Changing the small dma pool alignment to 512 happens to fix the issue
 > > because it never allocates a PRP list with offset 0xf00. Theoretically,
 > > the issue could still happen with the page pool, but this bridge has
 > > a max transfer size of 64 pages, which is not enough to fill an entire
 > > page-sized PRP list.
 > 
 > Thanks, this answers my question in the other thread: MDTS is too small
 > to hit the same bug with the large pool.
 > 

I hadn't appreciated the MDTS meaning that we never actually use a full page in a given transfer.
For reference, it is 6, which looks like it means only half a page will ever get used by this device. 
In case it is useful for further discussion, attached is the nvme controller identity output


[-- Attachment #2: nvme-id.log --]
[-- Type: application/octet-stream, Size: 7870 bytes --]

NVME Identify Controller:
vid       : 0x1217
ssvid     : 0x2
sn        : NCE777D00B21D
mn        : E2M2 64GB
fr        : 10100080
rab       : 5
ieee      : 00d051
cmic      : 0
  [3:3] : 0	ANA not supported
  [2:2] : 0	PCI
  [1:1] : 0	Single Controller
  [0:0] : 0	Single Port

mdts      : 6
cntlid    : 0x1
ver       : 0x10300
rtd3r     : 0x7a120
rtd3e     : 0x1e8480
oaes      : 0x20
  [31:31] : 0	Discovery Log Change Notice Not Supported
  [27:27] : 0	Zone Descriptor Changed Notices Not Supported
  [15:15] : 0	Normal NSS Shutdown Event Not Supported
  [14:14] : 0	Endurance Group Event Aggregate Log Page Change Notice Not Supported
  [13:13] : 0	LBA Status Information Notices Not Supported
  [12:12] : 0	Predictable Latency Event Aggregate Log Change Notices Not Supported
  [11:11] : 0	Asymmetric Namespace Access Change Notices Not Supported
  [9:9] : 0	Firmware Activation Notices Not Supported
  [8:8] : 0	Namespace Attribute Changed Event Not Supported
  [7:0] : 0x20	Reserved

ctratt    : 0
  [19:19] : 0	Flexible Data Placement Not Supported
  [15:15] : 0	Extended LBA Formats Not Supported
  [14:14] : 0	Delete NVM Set Not Supported
  [13:13] : 0	Delete Endurance Group Not Supported
  [12:12] : 0	Variable Capacity Management Not Supported
  [11:11] : 0	Fixed Capacity Management Not Supported
  [10:10] : 0	Multi Domain Subsystem Not Supported
  [9:9] : 0	UUID List Not Supported
  [8:8] : 0	SQ Associations Not Supported
  [7:7] : 0	Namespace Granularity Not Supported
  [6:6] : 0	Traffic Based Keep Alive Not Supported
  [5:5] : 0	Predictable Latency Mode Not Supported
  [4:4] : 0	Endurance Groups Not Supported
  [3:3] : 0	Read Recovery Levels Not Supported
  [2:2] : 0	NVM Sets Not Supported
  [1:1] : 0	Non-Operational Power State Permissive Not Supported
  [0:0] : 0	128-bit Host Identifier Not Supported

rrls      : 0
cntrltype : 0
  [7:2] : 0	Reserved
  [1:0] : 0	Controller type not reported
fguid     : 00000000-0000-0000-0000-000000000000
crdt1     : 0
crdt2     : 0
crdt3     : 0
nvmsr     : 0
  [1:1] : 0	NVM subsystem Not part of an Enclosure
  [0:0] : 0	NVM subsystem Not part of a Storage Device

vwci      : 0
  [7:7] : 0	VPD Write Cycles Remaining field is Not valid.
  [6:0] : 0	VPD Write Cycles Remaining 

mec       : 0
  [1:1] : 0	NVM subsystem Not contains a Management Endpoint on a PCIe port
  [0:0] : 0	NVM subsystem Not contains a Management Endpoint on an SMBus/I2C port

oacs      : 0x6
  [10:10] : 0	Lockdown Command and Feature Not Supported
  [9:9] : 0	Get LBA Status Capability Not Supported
  [8:8] : 0	Doorbell Buffer Config Not Supported
  [7:7] : 0	Virtualization Management Not Supported
  [6:6] : 0	NVMe-MI Send and Receive Not Supported
  [5:5] : 0	Directives Not Supported
  [4:4] : 0	Device Self-test Not Supported
  [3:3] : 0	NS Management and Attachment Not Supported
  [2:2] : 0x1	FW Commit and Download Supported
  [1:1] : 0x1	Format NVM Supported
  [0:0] : 0	Security Send and Receive Not Supported

acl       : 4
aerl      : 4
frmw      : 0xa
  [5:5] : 0	Multiple FW or Boot Update Detection Not Supported
  [4:4] : 0	Firmware Activate Without Reset Not Supported
  [3:1] : 0x5	Number of Firmware Slots
  [0:0] : 0	Firmware Slot 1 Read/Write

lpa       : 0
  [6:6] : 0	Telemetry Log Data Area 4 Not Supported
  [5:5] : 0	LID 0x0, Scope of each command in LID 0x5, 0x12, 0x13 Not Supported
  [4:4] : 0	Persistent Event log Not Supported
  [3:3] : 0	Telemetry host/controller initiated log page Not Supported
  [2:2] : 0	Extended data for Get Log Page Not Supported
  [1:1] : 0	Command Effects Log Page Not Supported
  [0:0] : 0	SMART/Health Log Page per NS Not Supported

elpe      : 0
  [7:0] : 0 (0's based)	Error Log Page Entries (ELPE)

npss      : 1
  [7:0] : 1 (0's based)	Number of Power States Support (NPSS)

avscc     : 0
  [0:0] : 0	Admin Vendor Specific Commands uses Vendor Specific Format

apsta     : 0
  [0:0] : 0	Autonomous Power State Transitions Not Supported

wctemp    : 343
 [15:0] : 70 °C (343 K)	Warning Composite Temperature Threshold (WCTEMP)

cctemp    : 353
 [15:0] : 80 °C (353 K)	Critical Composite Temperature Threshold (CCTEMP)

mtfa      : 0
hmpre     : 0
hmmin     : 0
tnvmcap   : 0
[127:0] : 0
	Total NVM Capacity (TNVMCAP)

unvmcap   : 0
[127:0] : 0
	Unallocated NVM Capacity (UNVMCAP)

rpmbs     : 0
 [31:24]: 0	Access Size
 [23:16]: 0	Total Size
  [5:3] : 0	Authentication Method
  [2:0] : 0	Number of RPMB Units

edstt     : 0
dsto      : 0
fwug      : 0
kas       : 0
hctma     : 0
  [0:0] : 0	Host Controlled Thermal Management Not Supported

mntmt     : 0
 [15:0] : -273 °C (0 K)	Minimum Thermal Management Temperature (MNTMT)

mxtmt     : 0
 [15:0] : -273 °C (0 K)	Maximum Thermal Management Temperature (MXTMT)

sanicap   : 0
  [31:30] : 0	Additional media modification after sanitize operation completes successfully is not defined
  [29:29] : 0	No-Deallocate After Sanitize bit in Sanitize command Supported
    [2:2] : 0	Overwrite Sanitize Operation Not Supported
    [1:1] : 0	Block Erase Sanitize Operation Not Supported
    [0:0] : 0	Crypto Erase Sanitize Operation Not Supported

hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
endgidmax : 0
anatt     : 0
anacap    : 0
  [7:7] : 0	Non-zero group ID Not Supported
  [6:6] : 0	Group ID does change
  [4:4] : 0	ANA Change state Not Supported
  [3:3] : 0	ANA Persistent Loss state Not Supported
  [2:2] : 0	ANA Inaccessible state Not Supported
  [1:1] : 0	ANA Non-optimized state Not Supported
  [0:0] : 0	ANA Optimized state Not Supported

anagrpmax : 0
nanagrpid : 0
pels      : 0
domainid  : 0
megcap    : 0
sqes      : 0x66
  [7:4] : 0x6	Max SQ Entry Size (64)
  [3:0] : 0x6	Min SQ Entry Size (64)

cqes      : 0x44
  [7:4] : 0x4	Max CQ Entry Size (16)
  [3:0] : 0x4	Min CQ Entry Size (16)

maxcmd    : 0
nn        : 1
oncs      : 0x4
  [10:10] : 0	All Fast Copy Not Supported
  [9:9] : 0	Copy Single Atomicity Not Supported
  [8:8] : 0	Copy Not Supported
  [7:7] : 0	Verify Not Supported
  [6:6] : 0	Timestamp Not Supported
  [5:5] : 0	Reservations Not Supported
  [4:4] : 0	Save and Select Not Supported
  [3:3] : 0	Write Zeroes Not Supported
  [2:2] : 0x1	Data Set Management Supported
  [1:1] : 0	Write Uncorrectable Not Supported
  [0:0] : 0	Compare Not Supported

fuses     : 0
  [0:0] : 0	Fused Compare and Write Not Supported

fna       : 0
  [3:3] : 0	Format NVM Broadcast NSID (FFFFFFFFh) Supported
  [2:2] : 0	Crypto Erase Not Supported as part of Secure Erase
  [1:1] : 0	Crypto Erase Applies to Single Namespace(s)
  [0:0] : 0	Format Applies to Single Namespace(s)

vwc       : 0x1
  [2:1] : 0	Support for the NSID field set to FFFFFFFFh is not indicated
  [0:0] : 0x1	Volatile Write Cache Present

awun      : 0
awupf     : 0
icsvscc   : 0
  [0:0] : 0	NVM Vendor Specific Commands uses Vendor Specific Format

nwpc      : 0
  [2:2] : 0	Permanent Write Protect Not Supported
  [1:1] : 0	Write Protect Until Power Supply Not Supported
  [0:0] : 0	No Write Protect and Write Protect Namespace Not Supported

acwu      : 0
ocfs      : 0
  [3:3] : 0	Controller Copy Format 3h Not Supported
  [2:2] : 0	Controller Copy Format 2h Not Supported
  [1:1] : 0	Controller Copy Format 1h Not Supported
  [0:0] : 0	Controller Copy Format 0h Not Supported

sgls      : 0
 [15:8] : 0	SGL Descriptor Threshold
 [1:0]  : 0	Scatter-Gather Lists Not Supported

mnan      : 0
maxdna    : 0
maxcna    : 0
oaqd      : 0
subnqn    : 
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
fcatt     : 0
  [0:0] : 0	Dynamic Controller Model

msdbd     : 0
ofcs      : 0
  [0:0] : 0	Disconnect command Not Supported

ps      0 : mp:5.00W operational enlat:0 exlat:0 rrt:0 rrl:0
            rwt:0 rwl:0 idle_power:- active_power:-
            active_power_workload:-
ps      1 : mp:0.0025W non-operational enlat:4000 exlat:10000 rrt:1 rrl:1
            rwt:1 rwl:1 idle_power:- active_power:-
            active_power_workload:-

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-12 19:50 [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk Bob Beckett
                   ` (2 preceding siblings ...)
  2024-11-14 11:38 ` Paweł Anikiel
@ 2024-11-14 18:00 ` Keith Busch
  2024-11-14 18:01   ` Jens Axboe
  2024-11-25 11:01 ` Niklas Cassel
  4 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-11-14 18:00 UTC (permalink / raw)
  To: Bob Beckett
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, kernel, linux-nvme,
	linux-kernel

On Tue, Nov 12, 2024 at 07:50:00PM +0000, Bob Beckett wrote:
> From: Robert Beckett <bob.beckett@collabora.com>
> 
> We initially put in a quick fix of limiting the queue depth to 1
> as experimentation showed that it fixed data corruption on 64GB
> steamdecks.
> 
> After further experimentation, it appears that the corruption
> is fixed by aligning the small dma pool segments to 512 bytes.
> Testing via desync image verification shows that it now passes
> thousands of verification loops, where previously
> it never managed above 7.
> 
> Currently it is not known why this fixes the corruption.
> Perhaps it is doing something nasty like using an mmc page
> as a cache for the prp lists (mmc min. page size is 512 bytes)
> and not invalidating properly, so that the dma pool change to
> treats segment list as a stack ends up giving a previous
> segment in the same cached page.
> 
> This fixes the previous queue depth limitation as it fixes
> the corruption without incurring a 37% tested performance
> degredation.
> 
> Fixes: 83bdfcbdbe5d ("nvme-pci: qdepth 1 quirk")

I had this queued up for the nvme-6.12 pull request, which I'm about to
send out, but I guess we should drop it until we conclude this
discussion. With 6.12 likely to be released on Sunday, this better
mitigation would need to target 6.13, then stable.

FWIW, given the current understanding with the last entry boundary
chaining idea, the QD1 mitigates that by always allocating a 0 page
offset prp list. So it effectively works around whatever errata we're
dealing with, albeit with an unsurprising performance penalty.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-14 18:00 ` Keith Busch
@ 2024-11-14 18:01   ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2024-11-14 18:01 UTC (permalink / raw)
  To: Keith Busch, Bob Beckett
  Cc: Christoph Hellwig, Sagi Grimberg, kernel, linux-nvme,
	linux-kernel

On 11/14/24 11:00 AM, Keith Busch wrote:
> On Tue, Nov 12, 2024 at 07:50:00PM +0000, Bob Beckett wrote:
>> From: Robert Beckett <bob.beckett@collabora.com>
>>
>> We initially put in a quick fix of limiting the queue depth to 1
>> as experimentation showed that it fixed data corruption on 64GB
>> steamdecks.
>>
>> After further experimentation, it appears that the corruption
>> is fixed by aligning the small dma pool segments to 512 bytes.
>> Testing via desync image verification shows that it now passes
>> thousands of verification loops, where previously
>> it never managed above 7.
>>
>> Currently it is not known why this fixes the corruption.
>> Perhaps it is doing something nasty like using an mmc page
>> as a cache for the prp lists (mmc min. page size is 512 bytes)
>> and not invalidating properly, so that the dma pool change to
>> treats segment list as a stack ends up giving a previous
>> segment in the same cached page.
>>
>> This fixes the previous queue depth limitation as it fixes
>> the corruption without incurring a 37% tested performance
>> degredation.
>>
>> Fixes: 83bdfcbdbe5d ("nvme-pci: qdepth 1 quirk")
> 
> I had this queued up for the nvme-6.12 pull request, which I'm about to
> send out, but I guess we should drop it until we conclude this
> discussion. With 6.12 likely to be released on Sunday, this better
> mitigation would need to target 6.13, then stable.

Since it's a long standing issue, it's not urgent it go into 6.12 in the
first place. So I'd concur with that assessment, even before this
discussion, it should just go into 6.13 and be marked for stable. It
needs that anyway.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-14 16:28       ` Robert Beckett
@ 2024-11-22 19:36         ` Keith Busch
  2024-12-09 12:32           ` Robert Beckett
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-11-22 19:36 UTC (permalink / raw)
  To: Robert Beckett
  Cc: "Pawel Anikiel", axboe, hch, kernel, linux-kernel,
	linux-nvme, sagi

On Thu, Nov 14, 2024 at 04:28:48PM +0000, Robert Beckett wrote:
>  ---- On Thu, 14 Nov 2024 14:13:52 +0000  Paweł Anikiel  wrote --- 
>  > On Thu, Nov 14, 2024 at 2:24 PM Robert Beckett
>  > bob.beckett@collabora.com> wrote:
>  > > This is interesting.
>  > > I had the same idea previously. I initially just changed the hard coded 256 / 8 to use 31 instead, which should have ensured the last entry of each segment never gets used.
>  > > When I tested that, it not longer failed, which was a good sign. So then I modified it to only do that on the last 256 byte segment of a page, but then is started failing again.
>  > 
>  > Could you elaborate the "only do that on the last 256 byte segment of
>  > a page" part? How did you check which chunk of the page would be
>  > allocated before choosing the dma pool?
>  > 
>  > > I never saw any bus error during my testing, just wrong data
>  > > read, which then fails image verification. I was expecting iommu
>  > > error logs if it was trying to access a chain in to nowhere if it
>  > > always interpreted last entry in page as a link. I never saw any
>  > > iommu errors.
>  > 
>  > Maybe I misspoke, the "bus error" part was just my speculation, I
>  > didn't look at the IOMMU logs or anything like that.
>  > 
>  > > I'd be glad to if you could share your testing method.
>  > 
>  > I dumped all the nvme transfers before the crash happened (using
>  > tracefs), and I saw a read of size 264 = 8 + 256, which led me to the
>  > chaining theory. To test this claim, I wrote a simple pci device
>  > driver which creates one IO queue and submits a read command where the
>  > PRP list is set up in a way that tests if the controller treats it as
>  > a chained list or not. I ran it, and it indeed treated the last PRP
>  > entry as a chained pointer.
> hmm, I guess a simple debugfs trigger file could be used to construct
> specially formulated requests. Would work as a debug tool.
>
> Though at this point, the simple dmapool alignment param usage fixes
> both of these scenarios, so it will be kind of academic to continue
> putting effort in to understand this. I am trying to get answers out
> of the vendor to confirm any of these theories, which I hope will be
> more conclusive than our combined inference from testing.

Any updates on this? I'm satisfied with the quirk patch, so we can move
this forward if you're okay with the current understanding.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-12 19:50 [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk Bob Beckett
                   ` (3 preceding siblings ...)
  2024-11-14 18:00 ` Keith Busch
@ 2024-11-25 11:01 ` Niklas Cassel
  4 siblings, 0 replies; 24+ messages in thread
From: Niklas Cassel @ 2024-11-25 11:01 UTC (permalink / raw)
  To: Bob Beckett
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, kernel,
	linux-nvme, linux-kernel, Gwendal Grignou

On Tue, Nov 12, 2024 at 07:50:00PM +0000, Bob Beckett wrote:
> From: Robert Beckett <bob.beckett@collabora.com>
> 
> We initially put in a quick fix of limiting the queue depth to 1
> as experimentation showed that it fixed data corruption on 64GB
> steamdecks.
> 
> After further experimentation, it appears that the corruption
> is fixed by aligning the small dma pool segments to 512 bytes.
> Testing via desync image verification shows that it now passes
> thousands of verification loops, where previously
> it never managed above 7.
> 
> Currently it is not known why this fixes the corruption.
> Perhaps it is doing something nasty like using an mmc page
> as a cache for the prp lists (mmc min. page size is 512 bytes)
> and not invalidating properly, so that the dma pool change to
> treats segment list as a stack ends up giving a previous
> segment in the same cached page.
> 
> This fixes the previous queue depth limitation as it fixes
> the corruption without incurring a 37% tested performance
> degredation.
> 
> Fixes: 83bdfcbdbe5d ("nvme-pci: qdepth 1 quirk")
> Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
> ---
>  drivers/nvme/host/nvme.h | 5 +++++
>  drivers/nvme/host/pci.c  | 6 +++---
>  2 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 093cb423f536..61bba5513de0 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -173,6 +173,11 @@ enum nvme_quirks {
>  	 * MSI (but not MSI-X) interrupts are broken and never fire.
>  	 */
>  	NVME_QUIRK_BROKEN_MSI			= (1 << 21),
> +
> +	/*
> +	 * Align dma pool segment size to 512 bytes
> +	 */
> +	NVME_QUIRK_DMAPOOL_ALIGN_512		= (1 << 22),
>  };
>  
>  /*
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 4b9fda0b1d9a..6fcd3bb413c4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2700,8 +2700,8 @@ static int nvme_setup_prp_pools(struct nvme_dev *dev)
>  		return -ENOMEM;
>  
>  	/* Optimisation for I/Os between 4k and 128k */
> -	dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev,
> -						256, 256, 0);
> +	dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev,256,
> +				       dev->ctrl.quirks & NVME_QUIRK_DMAPOOL_ALIGN_512 ? 512 : 256, 0);
>  	if (!dev->prp_small_pool) {
>  		dma_pool_destroy(dev->prp_page_pool);
>  		return -ENOMEM;
> @@ -3449,7 +3449,7 @@ static const struct pci_device_id nvme_id_table[] = {
>  	{ PCI_VDEVICE(REDHAT, 0x0010),	/* Qemu emulated controller */
>  		.driver_data = NVME_QUIRK_BOGUS_NID, },
>  	{ PCI_DEVICE(0x1217, 0x8760), /* O2 Micro 64GB Steam Deck */
> -		.driver_data = NVME_QUIRK_QDEPTH_ONE },
> +		.driver_data = NVME_QUIRK_DMAPOOL_ALIGN_512, },
>  	{ PCI_DEVICE(0x126f, 0x2262),	/* Silicon Motion generic */
>  		.driver_data = NVME_QUIRK_NO_DEEPEST_PS |
>  				NVME_QUIRK_BOGUS_NID, },
> -- 
> 2.45.2
> 
> 

+CC: Gwendal

Since he sent out a patch to revert the original QD=1 quirk,
claiming that the quirk wasn't needed when using the same
NVMe to eMMC bridge with another eMMC device.

If the quirk is really per eMMC device (from reading about this
problem, it sure sounds like a controller issue...), but if this
problem is really eMMC device related, then probably the quirk
should be applied only for certain eMMC devices instead.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-11-22 19:36         ` Keith Busch
@ 2024-12-09 12:32           ` Robert Beckett
  2024-12-09 15:33             ` Paweł Anikiel
  0 siblings, 1 reply; 24+ messages in thread
From: Robert Beckett @ 2024-12-09 12:32 UTC (permalink / raw)
  To: Keith Busch
  Cc: Pawel Anikiel, axboe, hch, kernel, linux-kernel, linux-nvme, sagi






 ---- On Fri, 22 Nov 2024 19:36:51 +0000  Keith Busch  wrote --- 
 > On Thu, Nov 14, 2024 at 04:28:48PM +0000, Robert Beckett wrote:
 > >  ---- On Thu, 14 Nov 2024 14:13:52 +0000  Paweł Anikiel  wrote --- 
 > >  > On Thu, Nov 14, 2024 at 2:24 PM Robert Beckett
 > >  > bob.beckett@collabora.com> wrote:
 > >  > > This is interesting.
 > >  > > I had the same idea previously. I initially just changed the hard coded 256 / 8 to use 31 instead, which should have ensured the last entry of each segment never gets used.
 > >  > > When I tested that, it not longer failed, which was a good sign. So then I modified it to only do that on the last 256 byte segment of a page, but then is started failing again.
 > >  > 
 > >  > Could you elaborate the "only do that on the last 256 byte segment of
 > >  > a page" part? How did you check which chunk of the page would be
 > >  > allocated before choosing the dma pool?
 > >  > 
 > >  > > I never saw any bus error during my testing, just wrong data
 > >  > > read, which then fails image verification. I was expecting iommu
 > >  > > error logs if it was trying to access a chain in to nowhere if it
 > >  > > always interpreted last entry in page as a link. I never saw any
 > >  > > iommu errors.
 > >  > 
 > >  > Maybe I misspoke, the "bus error" part was just my speculation, I
 > >  > didn't look at the IOMMU logs or anything like that.
 > >  > 
 > >  > > I'd be glad to if you could share your testing method.
 > >  > 
 > >  > I dumped all the nvme transfers before the crash happened (using
 > >  > tracefs), and I saw a read of size 264 = 8 + 256, which led me to the
 > >  > chaining theory. To test this claim, I wrote a simple pci device
 > >  > driver which creates one IO queue and submits a read command where the
 > >  > PRP list is set up in a way that tests if the controller treats it as
 > >  > a chained list or not. I ran it, and it indeed treated the last PRP
 > >  > entry as a chained pointer.
 > > hmm, I guess a simple debugfs trigger file could be used to construct
 > > specially formulated requests. Would work as a debug tool.
 > >
 > > Though at this point, the simple dmapool alignment param usage fixes
 > > both of these scenarios, so it will be kind of academic to continue
 > > putting effort in to understand this. I am trying to get answers out
 > > of the vendor to confirm any of these theories, which I hope will be
 > > more conclusive than our combined inference from testing.
 > 
 > Any updates on this? I'm satisfied with the quirk patch, so we can move
 > this forward if you're okay with the current understanding.
 > 
apologies for late reply, I think this got missed during a holiday. Thanks for prompting on the previous thread.

I have no further updates on this. I have received no further info from the vendor.
I think we can go ahead and use the alignment patch as is. The only outstanding question was whether it is an
implicit last entry per page chain vs simple alisngment requirement. Either way, using the dmapool
alignment fixes both of these potential causes, so we should just take it as is.
If we ever get any better info and can do a more specific patch in future, we can rework it then.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-12-09 12:32           ` Robert Beckett
@ 2024-12-09 15:33             ` Paweł Anikiel
  2024-12-10 21:36               ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Paweł Anikiel @ 2024-12-09 15:33 UTC (permalink / raw)
  To: Robert Beckett
  Cc: Keith Busch, axboe, hch, kernel, linux-kernel, linux-nvme, sagi

On Mon, Dec 9, 2024 at 1:33 PM Robert Beckett <bob.beckett@collabora.com> wrote:
> [...]
> I have no further updates on this. I have received no further info from the vendor.
> I think we can go ahead and use the alignment patch as is. The only outstanding question was whether it is an
> implicit last entry per page chain vs simple alisngment requirement. Either way, using the dmapool
> alignment fixes both of these potential causes, so we should just take it as is.
> If we ever get any better info and can do a more specific patch in future, we can rework it then.

I think the 512 byte alignment fix is good. I tried coming up with
something more specific, but everything I could think of was either
too complicated or artificial.

Regarding the question of whether this is an alignment requirement or
the last PRP entry issue, I strongly believe it's the latter. I have a
piece of code that clearly demonstrates the hardware bug when run on a
device with the nvme bridge. I would really appreciate it if this was
verified and my explanation was included in the patch.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-12-09 15:33             ` Paweł Anikiel
@ 2024-12-10 21:36               ` Keith Busch
  2024-12-11 10:55                 ` Robert Beckett
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-12-10 21:36 UTC (permalink / raw)
  To: Pawel Anikiel
  Cc: Robert Beckett, axboe, hch, kernel, linux-kernel, linux-nvme,
	sagi

On Mon, Dec 09, 2024 at 04:33:01PM +0100, Paweł Anikiel wrote:
> On Mon, Dec 9, 2024 at 1:33 PM Robert Beckett <bob.beckett@collabora.com> wrote:
> > [...]
> > I have no further updates on this. I have received no further info from the vendor.
> > I think we can go ahead and use the alignment patch as is. The only outstanding question was whether it is an
> > implicit last entry per page chain vs simple alisngment requirement. Either way, using the dmapool
> > alignment fixes both of these potential causes, so we should just take it as is.
> > If we ever get any better info and can do a more specific patch in future, we can rework it then.
> 
> I think the 512 byte alignment fix is good. I tried coming up with
> something more specific, but everything I could think of was either
> too complicated or artificial.
> 
> Regarding the question of whether this is an alignment requirement or
> the last PRP entry issue, I strongly believe it's the latter. I have a
> piece of code that clearly demonstrates the hardware bug when run on a
> device with the nvme bridge. I would really appreciate it if this was
> verified and my explanation was included in the patch.

I've pushed this to nvme-6.13 with an updated commit message here:

  https://git.infradead.org/?p=nvme.git;a=commitdiff;h=ccd84b8d6f4a60626dacb933b5d56dadca75c0ca

I can force an update if you have any edit suggestions.

Commit message copied below:

Author: Robert Beckett <bob.beckett@collabora.com>

nvme-pci: 512 byte aligned dma pool segment quirk

We initially introduced a quick fix limiting the queue depth to 1 as
experimentation showed that it fixed data corruption on 64GB steamdecks.

Further experimentation revealed corruption only happens when the last
PRP data element aligns to the end of the page boundary. The device
appears to treat this as a PRP chain to a new list instead of the data
element that it actually is. This is an implementation is in violation
of the spec. Encountering this errata with the Linux driver requires the
host request a 128k transfer and coincidently get the last small pool
dma buffer within a page.

The QD1 quirk effectly works around this because the last data PRP
always was at a 248 byte offset from the page start, so it never
appeared at the end of the page. Further, the MDTS is also small enough
that the "large" prp pool can hold enough PRP elements to never get to
the end, so that pool is not a problem either.

Introduce a new quirk to ensure the small pool is always aligned such
that the last PRP element can't appear a the end of the page. This comes
at the expense of wasting 256 bytes per small pool page allocated.

Link: https://lore.kernel.org/linux-nvme/20241113043151.GA20077@lst.de/T/#u
Fixes: 83bdfcbdbe5d ("nvme-pci: qdepth 1 quirk")
Cc: Paweł Anikiel <panikiel@google.com>
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-12-10 21:36               ` Keith Busch
@ 2024-12-11 10:55                 ` Robert Beckett
  2024-12-17 11:18                   ` Paweł Anikiel
  0 siblings, 1 reply; 24+ messages in thread
From: Robert Beckett @ 2024-12-11 10:55 UTC (permalink / raw)
  To: Keith Busch
  Cc: Pawel Anikiel, axboe, hch, kernel, linux-kernel, linux-nvme, sagi






 ---- On Tue, 10 Dec 2024 21:36:55 +0000  Keith Busch  wrote --- 
 > On Mon, Dec 09, 2024 at 04:33:01PM +0100, Paweł Anikiel wrote:
 > > On Mon, Dec 9, 2024 at 1:33 PM Robert Beckett bob.beckett@collabora.com> wrote:
 > > > [...]
 > > > I have no further updates on this. I have received no further info from the vendor.
 > > > I think we can go ahead and use the alignment patch as is. The only outstanding question was whether it is an
 > > > implicit last entry per page chain vs simple alisngment requirement. Either way, using the dmapool
 > > > alignment fixes both of these potential causes, so we should just take it as is.
 > > > If we ever get any better info and can do a more specific patch in future, we can rework it then.
 > > 
 > > I think the 512 byte alignment fix is good. I tried coming up with
 > > something more specific, but everything I could think of was either
 > > too complicated or artificial.
 > > 
 > > Regarding the question of whether this is an alignment requirement or
 > > the last PRP entry issue, I strongly believe it's the latter. I have a
 > > piece of code that clearly demonstrates the hardware bug when run on a
 > > device with the nvme bridge. I would really appreciate it if this was
 > > verified and my explanation was included in the patch.
 > 
 > I've pushed this to nvme-6.13 with an updated commit message here:
 > 
 >   https://git.infradead.org/?p=nvme.git;a=commitdiff;h=ccd84b8d6f4a60626dacb933b5d56dadca75c0ca

lgtm. Thanks!

 > 
 > I can force an update if you have any edit suggestions.
 > 
 > Commit message copied below:
 > 
 > Author: Robert Beckett bob.beckett@collabora.com>
 > 
 > nvme-pci: 512 byte aligned dma pool segment quirk
 > 
 > We initially introduced a quick fix limiting the queue depth to 1 as
 > experimentation showed that it fixed data corruption on 64GB steamdecks.
 > 
 > Further experimentation revealed corruption only happens when the last
 > PRP data element aligns to the end of the page boundary. The device
 > appears to treat this as a PRP chain to a new list instead of the data
 > element that it actually is. This is an implementation is in violation
 > of the spec. Encountering this errata with the Linux driver requires the
 > host request a 128k transfer and coincidently get the last small pool
 > dma buffer within a page.
 > 
 > The QD1 quirk effectly works around this because the last data PRP
 > always was at a 248 byte offset from the page start, so it never
 > appeared at the end of the page. Further, the MDTS is also small enough
 > that the "large" prp pool can hold enough PRP elements to never get to
 > the end, so that pool is not a problem either.
 > 
 > Introduce a new quirk to ensure the small pool is always aligned such
 > that the last PRP element can't appear a the end of the page. This comes
 > at the expense of wasting 256 bytes per small pool page allocated.
 > 
 > Link: https://lore.kernel.org/linux-nvme/20241113043151.GA20077@lst.de/T/#u
 > Fixes: 83bdfcbdbe5d ("nvme-pci: qdepth 1 quirk")
 > Cc: Paweł Anikiel panikiel@google.com>
 > Signed-off-by: Robert Beckett bob.beckett@collabora.com>
 > Signed-off-by: Keith Busch kbusch@kernel.org>
 > 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk
  2024-12-11 10:55                 ` Robert Beckett
@ 2024-12-17 11:18                   ` Paweł Anikiel
  0 siblings, 0 replies; 24+ messages in thread
From: Paweł Anikiel @ 2024-12-17 11:18 UTC (permalink / raw)
  To: Robert Beckett
  Cc: Keith Busch, axboe, hch, kernel, linux-kernel, linux-nvme, sagi

On Wed, Dec 11, 2024 at 11:56 AM Robert Beckett
<bob.beckett@collabora.com> wrote:
>
>
>
>
>
>
>  ---- On Tue, 10 Dec 2024 21:36:55 +0000  Keith Busch  wrote ---
>  > On Mon, Dec 09, 2024 at 04:33:01PM +0100, Paweł Anikiel wrote:
>  > > On Mon, Dec 9, 2024 at 1:33 PM Robert Beckett bob.beckett@collabora.com> wrote:
>  > > > [...]
>  > > > I have no further updates on this. I have received no further info from the vendor.
>  > > > I think we can go ahead and use the alignment patch as is. The only outstanding question was whether it is an
>  > > > implicit last entry per page chain vs simple alisngment requirement. Either way, using the dmapool
>  > > > alignment fixes both of these potential causes, so we should just take it as is.
>  > > > If we ever get any better info and can do a more specific patch in future, we can rework it then.
>  > >
>  > > I think the 512 byte alignment fix is good. I tried coming up with
>  > > something more specific, but everything I could think of was either
>  > > too complicated or artificial.
>  > >
>  > > Regarding the question of whether this is an alignment requirement or
>  > > the last PRP entry issue, I strongly believe it's the latter. I have a
>  > > piece of code that clearly demonstrates the hardware bug when run on a
>  > > device with the nvme bridge. I would really appreciate it if this was
>  > > verified and my explanation was included in the patch.
>  >
>  > I've pushed this to nvme-6.13 with an updated commit message here:
>  >
>  >   https://git.infradead.org/?p=nvme.git;a=commitdiff;h=ccd84b8d6f4a60626dacb933b5d56dadca75c0ca
>
> lgtm. Thanks!

Looks good to me as well. Thank you!

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-12-17 11:18 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-12 19:50 [PATCH] nvme-pci: 512 byte aligned dma pool segment quirk Bob Beckett
2024-11-12 20:47 ` Keith Busch
2024-11-13  4:31 ` Christoph Hellwig
2024-11-13 18:05   ` Keith Busch
2024-11-13 20:08     ` Robert Beckett
2024-11-14  5:55     ` Christoph Hellwig
2024-11-14 13:10       ` Robert Beckett
2024-11-14 11:38 ` Paweł Anikiel
2024-11-14 12:17   ` Christoph Hellwig
2024-11-14 15:37     ` Keith Busch
2024-11-14 13:24   ` Robert Beckett
2024-11-14 14:13     ` Paweł Anikiel
2024-11-14 16:28       ` Robert Beckett
2024-11-22 19:36         ` Keith Busch
2024-12-09 12:32           ` Robert Beckett
2024-12-09 15:33             ` Paweł Anikiel
2024-12-10 21:36               ` Keith Busch
2024-12-11 10:55                 ` Robert Beckett
2024-12-17 11:18                   ` Paweł Anikiel
2024-11-14 15:46   ` Keith Busch
2024-11-14 16:47     ` Robert Beckett
2024-11-14 18:00 ` Keith Busch
2024-11-14 18:01   ` Jens Axboe
2024-11-25 11:01 ` Niklas Cassel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox