max_hw_sectors error caused by recent NVMe driver commit

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* max_hw_sectors error caused by recent NVMe driver commit
@ 2023-02-11  5:33 Michael Kelley (LINUX)
  2023-02-11  9:38 ` Daniel Gomez
  2023-02-13  5:59 ` Christoph Hellwig
  0 siblings, 2 replies; 11+ messages in thread
From: Michael Kelley (LINUX) @ 2023-02-11  5:33 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer,
	asahi@lists.linux.dev, linux-nvme@lists.infradead.org

Commit 3f30a79c2e2c ("nvme-pci: set constant paramters in nvme_pci_alloc_ctrl")
appears to have introduced an error in how max_hw_sectors is calculated.  The
value of max_hw_sectors is based on dma_max_mapping_size(), which indirectly
uses dma_addressing_limited() to decide if swiotlb_max_mapping_size() should
be used.

In this commit, setting max_hw_sectors is moved to nvme_pci_alloc_dev().
But dma_addressing_limited() depends on the dev->dma_mask, which hasn't
been set.  dma_addressing_limited() returns "true", and the swiotlb max mapping
size is used, limiting NVMe transfers to 504 sectors (252 Kbytes).

Prior to this commit, max_hw_sectors isn't set until after the call to
dma_set_mask_and_coherent() in nvme_pci_enable(), as called from
nvme_reset_work().   max_hw_sectors is correctly determined based on
values reported by the NVMe controller.

I haven't provided a fix because I'm not that familiar with the overall structure
of the code and the intent of the code reorganization.  I'm not sure if setting
the DMA mask should be moved earlier, or setting max_hw_sectors should
be moved back to its original location.

Michael

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: max_hw_sectors error caused by recent NVMe driver commit
  2023-02-11  5:33 max_hw_sectors error caused by recent NVMe driver commit Michael Kelley (LINUX)
@ 2023-02-11  9:38 ` Daniel Gomez
  2023-02-13 16:42   ` Michael Kelley (LINUX)
  2023-02-13  5:59 ` Christoph Hellwig
  1 sibling, 1 reply; 11+ messages in thread
From: Daniel Gomez @ 2023-02-11  9:38 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org

Hi all,

On Sat, Feb 11, 2023 at 6:38 AM Michael Kelley (LINUX)
<mikelley@microsoft.com> wrote:
>
> Commit 3f30a79c2e2c ("nvme-pci: set constant paramters in nvme_pci_alloc_ctrl")
> appears to have introduced an error in how max_hw_sectors is calculated.  The
> value of max_hw_sectors is based on dma_max_mapping_size(), which indirectly
> uses dma_addressing_limited() to decide if swiotlb_max_mapping_size() should
> be used.
>
> In this commit, setting max_hw_sectors is moved to nvme_pci_alloc_dev().
> But dma_addressing_limited() depends on the dev->dma_mask, which hasn't
> been set.  dma_addressing_limited() returns "true", and the swiotlb max mapping
> size is used, limiting NVMe transfers to 504 sectors (252 Kbytes).
>
> Prior to this commit, max_hw_sectors isn't set until after the call to
> dma_set_mask_and_coherent() in nvme_pci_enable(), as called from
> nvme_reset_work().   max_hw_sectors is correctly determined based on
> values reported by the NVMe controller.
>
> I haven't provided a fix because I'm not that familiar with the overall structure
> of the code and the intent of the code reorganization.  I'm not sure if setting
> the DMA mask should be moved earlier, or setting max_hw_sectors should
> be moved back to its original location.

Yesterday, I ran into the same problem. Below you can find my test
case. I'm trying to get familiar with the code, so any help would be
much appreciated.

I could reproduce it with a simple dd test [1] and btrace. In summary,
multi-page bvec splits are performed in 4 steps/chunks of 63 segments
(63 * 4K / 512 = 504 sectors) and 1 step of 8 segments (8 * 4K / 512 =
32 sectors). But I think it should be just 4 steps of 64 segments as
that is the max_segments upper limit in my test case. Below you can
find my test case and numbers:

Check:
root@qemuarm64:/sys/block/nvme0n1/queue# grep -H . max_sectors_kb
max_hw_sectors_kb max_segments max_segment_size optimal_io_size
logical_block_size chunk_sectors
max_sectors_kb:252
max_hw_sectors_kb:252
max_segments:64
max_segment_size:4294967295
optimal_io_size:512
logical_block_size:512
chunk_sectors:0

dd test:
dd iflag=direct if=/dev/nvme0n1 bs=1M of=/dev/null status=progress

btrace snippet:
259,0    0    23476     1.111774467   751  Q  RS 1923072 + 2048 [dd]
259,0    0    23477     1.111774842   751  X  RS 1923072 / 1923576 [dd]
259,0    0    23478     1.111775342   751  G  RS 1923072 + 504 [dd]
259,0    0    23479     1.111775425   751  I  RS 1923072 + 504 [dd]
259,0    0    23480     1.111776133   751  X  RS 1923576 / 1924080 [dd]
259,0    0    23481     1.111776258   751  G  RS 1923576 + 504 [dd]
259,0    0    23482     1.111776300   751  I  RS 1923576 + 504 [dd]
259,0    0    23483     1.111776675   751  X  RS 1924080 / 1924584 [dd]
259,0    0    23484     1.111776967   751  G  RS 1924080 + 504 [dd]
259,0    0    23485     1.111777008   751  I  RS 1924080 + 504 [dd]
259,0    0    23486     1.111777383   751  X  RS 1924584 / 1925088 [dd]
259,0    0    23487     1.111777467   751  G  RS 1924584 + 504 [dd]
259,0    0    23488     1.111777550   751  I  RS 1924584 + 504 [dd]
259,0    0    23489     1.111777758   751  G  RS 1925088 + 32 [dd]
259,0    0    23490     1.111777800   751  I  RS 1925088 + 32 [dd]
259,0    0    23491     1.111779383    36  D  RS 1923072 + 504 [kworker/0:1H]
259,0    0    23492     1.111780092    36  D  RS 1923576 + 504 [kworker/0:1H]
259,0    0    23493     1.111780800    36  D  RS 1924080 + 504 [kworker/0:1H]
259,0    0    23494     1.111781425    36  D  RS 1924584 + 504 [kworker/0:1H]
259,0    0    23495     1.111781717    36  D  RS 1925088 + 32 [kworker/0:1H]
259,0    0    23496     1.112201967   749  C  RS 1923072 + 504 [0]
259,0    0    23497     1.112563925   749  C  RS 1923072 + 2048 [0]
259,0    0    23498     1.112564425   749  C  RS 1923072 + 2016 [0]
259,0    0    23499     1.112564800   749  C  RS 1923072 + 1512 [0]
259,0    0    23500     1.112758217   749  C  RS 1923072 + 1008 [0]

In addition, going back to Michaels reference commit and running the
test in the same setup, I got the following checks:

max_sectors_kb:512
max_hw_sectors_kb:512
max_segments:127
max_segment_size:4294967295
optimal_io_size:512
logical_block_size:512

So, I'm not sure why my max_segments when from 64 -> 127 and,
max_sectors_kb and max_hw_sectdors_kb from 252 -> 512. Perhaps my
first assumption was wrong?

[1] test:
https://unix.stackexchange.com/questions/529529/why-is-the-size-of-my-io-requests-being-limited-to-about-512k

Daniel

>
> Michael
>
>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: max_hw_sectors error caused by recent NVMe driver commit
  2023-02-11  5:33 max_hw_sectors error caused by recent NVMe driver commit Michael Kelley (LINUX)
  2023-02-11  9:38 ` Daniel Gomez
@ 2023-02-13  5:59 ` Christoph Hellwig
  2023-02-13  6:41   ` Michael Kelley (LINUX)
  1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2023-02-13  5:59 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org

Hi Michael,

does this fix the problem for you?

commit 0467e96bafd2e84b67ba5c122bbbbac5e3f267e9
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Feb 13 06:58:33 2023 +0100

    nvme dma_mask

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c734934c407ccf..cb068f92bd597e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2509,18 +2509,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 {
 	int result = -ENOMEM;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
-	int dma_address_bits = 64;
 
 	if (pci_enable_device_mem(pdev))
 		return result;
 
 	pci_set_master(pdev);
 
-	if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48)
-		dma_address_bits = 48;
-	if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(dma_address_bits)))
-		goto disable;
-
 	if (readl(dev->bar + NVME_REG_CSTS) == -1) {
 		result = -ENODEV;
 		goto disable;
@@ -2998,7 +2992,11 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 			     quirks);
 	if (ret)
 		goto out_put_device;
-	
+
+	if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48)
+		dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(48));
+	else
+		dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(64));
 	dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
 	dma_set_max_seg_size(&pdev->dev, 0xffffffff);
 


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* RE: max_hw_sectors error caused by recent NVMe driver commit
  2023-02-13  5:59 ` Christoph Hellwig
@ 2023-02-13  6:41   ` Michael Kelley (LINUX)
  2023-02-13  6:42     ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Kelley (LINUX) @ 2023-02-13  6:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer,
	asahi@lists.linux.dev, linux-nvme@lists.infradead.org

From: Christoph Hellwig <hch@lst.de> Sent: Sunday, February 12, 2023 10:00 PM
> 
> Hi Michael,
> 
> does this fix the problem for you?

Yes, basic smoke test works.  The underlying NVMe device
capabilities now show through instead of being replaced by
swiotlb values.  I did not do extensive functional testing.

See one code comment below.

Michael

> 
> commit 0467e96bafd2e84b67ba5c122bbbbac5e3f267e9
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Mon Feb 13 06:58:33 2023 +0100
> 
>     nvme dma_mask
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index c734934c407ccf..cb068f92bd597e 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2509,18 +2509,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  {
>  	int result = -ENOMEM;
>  	struct pci_dev *pdev = to_pci_dev(dev->dev);
> -	int dma_address_bits = 64;
> 
>  	if (pci_enable_device_mem(pdev))
>  		return result;
> 
>  	pci_set_master(pdev);
> 
> -	if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48)
> -		dma_address_bits = 48;
> -	if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(dma_address_bits)))
> -		goto disable;
> -
>  	if (readl(dev->bar + NVME_REG_CSTS) == -1) {
>  		result = -ENODEV;
>  		goto disable;
> @@ -2998,7 +2992,11 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct
> pci_dev *pdev,
>  			     quirks);
>  	if (ret)
>  		goto out_put_device;
> -
> +
> +	if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48)
> +		dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(48));
> +	else
> +		dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(64));

Any reason to use dev->dev instead of &pdev->dev?  They should be
the same, but with everything else in this function using &pdev->dev,
the lack of consistency suggests there might be a reason.

>  	dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
>  	dma_set_max_seg_size(&pdev->dev, 0xffffffff);
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: max_hw_sectors error caused by recent NVMe driver commit
  2023-02-13  6:41   ` Michael Kelley (LINUX)
@ 2023-02-13  6:42     ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-02-13  6:42 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org

On Mon, Feb 13, 2023 at 06:41:12AM +0000, Michael Kelley (LINUX) wrote:
> Any reason to use dev->dev instead of &pdev->dev?  They should be
> the same, but with everything else in this function using &pdev->dev,
> the lack of consistency suggests there might be a reason.

No good reason - the code started with copy and paste from the previous
location and I can fix this up.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: max_hw_sectors error caused by recent NVMe driver commit
  2023-02-11  9:38 ` Daniel Gomez
@ 2023-02-13 16:42   ` Michael Kelley (LINUX)
  2023-02-13 16:57     ` Keith Busch
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Kelley (LINUX) @ 2023-02-13 16:42 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org

From: Daniel Gomez <dagmcr@gmail.com> Sent: Saturday, February 11, 2023 1:39 AM
> 
> Hi all,
> 
> On Sat, Feb 11, 2023 at 6:38 AM Michael Kelley (LINUX)
> <mikelley@microsoft.com> wrote:
> >
> > Commit 3f30a79c2e2c ("nvme-pci: set constant paramters in nvme_pci_alloc_ctrl")
> > appears to have introduced an error in how max_hw_sectors is calculated.  The
> > value of max_hw_sectors is based on dma_max_mapping_size(), which indirectly
> > uses dma_addressing_limited() to decide if swiotlb_max_mapping_size() should
> > be used.
> >
> > In this commit, setting max_hw_sectors is moved to nvme_pci_alloc_dev().
> > But dma_addressing_limited() depends on the dev->dma_mask, which hasn't
> > been set.  dma_addressing_limited() returns "true", and the swiotlb max mapping
> > size is used, limiting NVMe transfers to 504 sectors (252 Kbytes).
> >
> > Prior to this commit, max_hw_sectors isn't set until after the call to
> > dma_set_mask_and_coherent() in nvme_pci_enable(), as called from
> > nvme_reset_work().   max_hw_sectors is correctly determined based on
> > values reported by the NVMe controller.
> >
> > I haven't provided a fix because I'm not that familiar with the overall structure
> > of the code and the intent of the code reorganization.  I'm not sure if setting
> > the DMA mask should be moved earlier, or setting max_hw_sectors should
> > be moved back to its original location.
> 
> Yesterday, I ran into the same problem. Below you can find my test
> case. I'm trying to get familiar with the code, so any help would be
> much appreciated.
> 
> I could reproduce it with a simple dd test [1] and btrace. In summary,
> multi-page bvec splits are performed in 4 steps/chunks of 63 segments
> (63 * 4K / 512 = 504 sectors) and 1 step of 8 segments (8 * 4K / 512 =
> 32 sectors). But I think it should be just 4 steps of 64 segments as
> that is the max_segments upper limit in my test case. Below you can
> find my test case and numbers:

The maximum size of a single I/O transfer is limited to the *smaller*
of two factors:  max_segments and max_sectors_kb.  In your example
below,  the limiting factor is max_sectors_kb.   But note that up to 64
segments *could* be needed for a 252 Kbyte transfer if the starting
memory address is in the middle of a 4 Kbyte page.  A segment is
needed for the partial page at the beginning and another segment
is needed for the partial page at the end.  Also, if the target physical
memory has chunks larger than 4 Kbytes that are contiguous, even
fewer than 63 segments would be needed because each segment
describes a *contiguous* physical memory area using its starting
address and length.  But the I/O transfer would still be limited to
252 Kbytes because of max_sectors_kb.

> 
> Check:
> root@qemuarm64:/sys/block/nvme0n1/queue# grep -H . max_sectors_kb
> max_hw_sectors_kb max_segments max_segment_size optimal_io_size
> logical_block_size chunk_sectors
> max_sectors_kb:252
> max_hw_sectors_kb:252
> max_segments:64
> max_segment_size:4294967295
> optimal_io_size:512
> logical_block_size:512
> chunk_sectors:0
> 
> dd test:
> dd iflag=direct if=/dev/nvme0n1 bs=1M of=/dev/null status=progress
> 
> btrace snippet:
> 259,0    0    23476     1.111774467   751  Q  RS 1923072 + 2048 [dd]
> 259,0    0    23477     1.111774842   751  X  RS 1923072 / 1923576 [dd]
> 259,0    0    23478     1.111775342   751  G  RS 1923072 + 504 [dd]
> 259,0    0    23479     1.111775425   751  I  RS 1923072 + 504 [dd]
> 259,0    0    23480     1.111776133   751  X  RS 1923576 / 1924080 [dd]
> 259,0    0    23481     1.111776258   751  G  RS 1923576 + 504 [dd]
> 259,0    0    23482     1.111776300   751  I  RS 1923576 + 504 [dd]
> 259,0    0    23483     1.111776675   751  X  RS 1924080 / 1924584 [dd]
> 259,0    0    23484     1.111776967   751  G  RS 1924080 + 504 [dd]
> 259,0    0    23485     1.111777008   751  I  RS 1924080 + 504 [dd]
> 259,0    0    23486     1.111777383   751  X  RS 1924584 / 1925088 [dd]
> 259,0    0    23487     1.111777467   751  G  RS 1924584 + 504 [dd]
> 259,0    0    23488     1.111777550   751  I  RS 1924584 + 504 [dd]
> 259,0    0    23489     1.111777758   751  G  RS 1925088 + 32 [dd]
> 259,0    0    23490     1.111777800   751  I  RS 1925088 + 32 [dd]
> 259,0    0    23491     1.111779383    36  D  RS 1923072 + 504 [kworker/0:1H]
> 259,0    0    23492     1.111780092    36  D  RS 1923576 + 504 [kworker/0:1H]
> 259,0    0    23493     1.111780800    36  D  RS 1924080 + 504 [kworker/0:1H]
> 259,0    0    23494     1.111781425    36  D  RS 1924584 + 504 [kworker/0:1H]
> 259,0    0    23495     1.111781717    36  D  RS 1925088 + 32 [kworker/0:1H]
> 259,0    0    23496     1.112201967   749  C  RS 1923072 + 504 [0]
> 259,0    0    23497     1.112563925   749  C  RS 1923072 + 2048 [0]
> 259,0    0    23498     1.112564425   749  C  RS 1923072 + 2016 [0]
> 259,0    0    23499     1.112564800   749  C  RS 1923072 + 1512 [0]
> 259,0    0    23500     1.112758217   749  C  RS 1923072 + 1008 [0]
> 
> In addition, going back to Michaels reference commit and running the
> test in the same setup, I got the following checks:
> 
> max_sectors_kb:512
> max_hw_sectors_kb:512
> max_segments:127
> max_segment_size:4294967295
> optimal_io_size:512
> logical_block_size:512
> 
> So, I'm not sure why my max_segments when from 64 -> 127 and,
> max_sectors_kb and max_hw_sectdors_kb from 252 -> 512. Perhaps my
> first assumption was wrong?

The bug I pointed out caused max_hw_sectors to be set incorrectly.  Going
back to the referenced commit reverted the bug, and allowed the correct
value to be set.  In your example, I would guess the value of 512 Kbytes came
from querying the NVMe device for its max transfer size. Ideally, to support
512 Kbyte transfers, you would want 129 segments (to allow for starting in
the middle of a page as describe above).  But the value of max_segments
is limited by the NVME driver itself using the value of NVME_MAX_SEGS
defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
sure that the data structure containing the scatterlist fits in a single page.
See nvme_pci_alloc_iod_mempool().

Hopefully this information helps ...

Michael



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: max_hw_sectors error caused by recent NVMe driver commit
  2023-02-13 16:42   ` Michael Kelley (LINUX)
@ 2023-02-13 16:57     ` Keith Busch
  2023-02-17 13:28       ` Daniel Gomez
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2023-02-13 16:57 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: Daniel Gomez, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Gerd Bayer, asahi@lists.linux.dev,
	linux-nvme@lists.infradead.org

On Mon, Feb 13, 2023 at 04:42:31PM +0000, Michael Kelley (LINUX) wrote:
> Ideally, to support
> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> the middle of a page as describe above).  But the value of max_segments
> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> sure that the data structure containing the scatterlist fits in a single page.
> See nvme_pci_alloc_iod_mempool().

Should be 128 possible segements now in -next, but yeah, 129 would be ideal.

The limit confuses many because sometimes user space can sometimes get 512kib
IO to work and other times the same program fails, and all because of physical
memory continuity that user space isn't always aware of. A sure-fire way to
never hit that limit is to allocate hugepages.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: max_hw_sectors error caused by recent NVMe driver commit
  2023-02-13 16:57     ` Keith Busch
@ 2023-02-17 13:28       ` Daniel Gomez
  2023-02-17 16:05         ` Michael Kelley (LINUX)
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel Gomez @ 2023-02-17 13:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Michael Kelley (LINUX), Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Gerd Bayer, asahi@lists.linux.dev,
	linux-nvme@lists.infradead.org

On Mon, Feb 13, 2023 at 5:57 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Mon, Feb 13, 2023 at 04:42:31PM +0000, Michael Kelley (LINUX) wrote:
> > Ideally, to support
> > 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> > the middle of a page as describe above).  But the value of max_segments
> > is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> > sure that the data structure containing the scatterlist fits in a single page.
> > See nvme_pci_alloc_iod_mempool().
Michael,

Thanks for clarifying. It really helps. Also, it's fixed after
applying Christoph patch [1].

[1] https://lore.kernel.org/all/20230213072035.288225-1-hch@lst.de/

>> value to be set.  In your example, I would guess the value of 512 Kbytes came
>> from querying the NVMe device for its max transfer size. Ideally, to support
>> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
>> the middle of a page as describe above).  But the value of max_segments
>> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
>> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
>> sure that the data structure containing the scatterlist fits in a single page.

>
> Should be 128 possible segements now in -next, but yeah, 129 would be ideal.

Quoting Michael,

>> the middle of a page as describe above).  But the value of max_segments
>> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
>> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
>> sure that the data structure containing the scatterlist fits in a single page.

Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
after Keith optimization patch to 128) but not when the device is
limited to a lower max_segments value because they fit anyway in a
single page?

Following the kernel code, I can see the max_hw_sectors_kb is
calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9),
max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^
(MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to
change the MDTS, I get the following:

  MDTS   max_hw_sectors   max_hw_sectors_kb   max_segments
 ------ ---------------- ------------------- --------------
  0      1024             512                 127
  1      16               8                   3
  2      32               16                  5
  3      64               32                  9
  4      128              64                  17
  5      256              128                 33
  6      512              256                 65
  7      1024             512                 127

>
> The limit confuses many because sometimes user space can sometimes get 512kib
> IO to work and other times the same program fails, and all because of physical
> memory continuity that user space isn't always aware of. A sure-fire way to
> never hit that limit is to allocate hugepages.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: max_hw_sectors error caused by recent NVMe driver commit
  2023-02-17 13:28       ` Daniel Gomez
@ 2023-02-17 16:05         ` Michael Kelley (LINUX)
  2023-03-03 16:24           ` Daniel Gomez
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Kelley (LINUX) @ 2023-02-17 16:05 UTC (permalink / raw)
  To: Daniel Gomez, Keith Busch
  Cc: Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer,
	asahi@lists.linux.dev, linux-nvme@lists.infradead.org

From: Daniel Gomez <dagmcr@gmail.com> Sent: Friday, February 17, 2023 5:28 AM
> 
> >> value to be set.  In your example, I would guess the value of 512 Kbytes came
> >> from querying the NVMe device for its max transfer size. Ideally, to support
> >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> >> the middle of a page as describe above).  But the value of max_segments
> >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> >> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> >> sure that the data structure containing the scatterlist fits in a single page.
> 
> >
> > Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
> 
> Quoting Michael,
> 
> >> the middle of a page as describe above).  But the value of max_segments
> >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> >> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> >> sure that the data structure containing the scatterlist fits in a single page.
> 
> Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
> after Keith optimization patch to 128) but not when the device is
> limited to a lower max_segments value because they fit anyway in a
> single page?

Yes, that's correct.   But "the device is limited to a lower max_segments
value" isn't really because max_segments is limited.  The limit is on
max_hw_sectors_kb derived from the NVMe controller MDTS value,
as you have shown in your table below.  Then the max_segments value
is derived from max_hw_sectors_kb.  For example, if max_hw_sectors_kb
is 128 Kbytes, you can never need more than 33 segments.  Each segment
can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
Then add 1 segment to handle the case where the memory buffer doesn't
start on a page boundary, and you get 33.   I'm making a subtle distinction
here between "max_segments is limited" and "you can't need more than
XX segments for a given max_hw_sectors_kb value".

Michael

> 
> Following the kernel code, I can see the max_hw_sectors_kb is
> calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9),
> max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^
> (MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to
> change the MDTS, I get the following:
> 
>   MDTS   max_hw_sectors   max_hw_sectors_kb   max_segments
>  ------ ---------------- ------------------- --------------
>   0      1024             512                 127
>   1      16               8                   3
>   2      32               16                  5
>   3      64               32                  9
>   4      128              64                  17
>   5      256              128                 33
>   6      512              256                 65
>   7      1024             512                 127
> 
> >
> > The limit confuses many because sometimes user space can sometimes get 512kib
> > IO to work and other times the same program fails, and all because of physical
> > memory continuity that user space isn't always aware of. A sure-fire way to
> > never hit that limit is to allocate hugepages.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: max_hw_sectors error caused by recent NVMe driver commit
  2023-02-17 16:05         ` Michael Kelley (LINUX)
@ 2023-03-03 16:24           ` Daniel Gomez
  2023-03-03 16:44             ` Keith Busch
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel Gomez @ 2023-03-03 16:24 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: Chaitanya Kulkarni, Christoph Hellwig, Gerd Bayer, Keith Busch,
	Sagi Grimberg, asahi@lists.linux.dev,
	linux-nvme@lists.infradead.org

Hi,

On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX)
<mikelley@microsoft.com> wrote:
>
> From: Daniel Gomez <dagmcr@gmail.com> Sent: Friday, February 17, 2023 5:28 AM
> >
> > >> value to be set.  In your example, I would guess the value of 512 Kbytes came
> > >> from querying the NVMe device for its max transfer size. Ideally, to support
> > >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> > >> the middle of a page as describe above).  But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > >
> > > Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
> >
> > Quoting Michael,
> >
> > >> the middle of a page as describe above).  But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
> > after Keith optimization patch to 128) but not when the device is
> > limited to a lower max_segments value because they fit anyway in a
> > single page?
>
> Yes, that's correct.   But "the device is limited to a lower max_segments
> value" isn't really because max_segments is limited.  The limit is on
> max_hw_sectors_kb derived from the NVMe controller MDTS value,
> as you have shown in your table below.  Then the max_segments value
> is derived from max_hw_sectors_kb.  For example, if max_hw_sectors_kb
> is 128 Kbytes, you can never need more than 33 segments.  Each segment
> can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
> Then add 1 segment to handle the case where the memory buffer doesn't
> start on a page boundary, and you get 33.

I'm trying to understand and clarify this part. So, I found Keith's
patch [1] where he addresses (and introduces) the same case (quote
from the patch: 'One additional segment is added to account for a
transfer that may start in the middle of a page.'). But I'm not sure
when that case should be considered when we split an I/O. Is that
handled in bio_split_rw + bvec_split_segs functions?

So, for mdts=7 and max_segments being now 128, when can we see and
test that a transfer starts in the middle of a page and therefore,
that we need an extra segment for that? In that case, shouldn't the
split max at 128 - 1 (508 KiB)?

In my tests [2], I can see splits are made in chunks of 128 in some
cases. Whether or not these cases are page-aligned I don’t know but
I'd like to verify that.

[1] https://lore.kernel.org/all/1439417874-8925-1-git-send-email-keith.busch@intel.com/

[2] while true; do sleep 1; dd iflag=direct if=/dev/nvme0n1 bs=1M
count=1 of=/dev/null status=progress; done

On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX)
<mikelley@microsoft.com> wrote:
>
> From: Daniel Gomez <dagmcr@gmail.com> Sent: Friday, February 17, 2023 5:28 AM
> >
> > >> value to be set.  In your example, I would guess the value of 512 Kbytes came
> > >> from querying the NVMe device for its max transfer size. Ideally, to support
> > >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> > >> the middle of a page as describe above).  But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > >
> > > Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
> >
> > Quoting Michael,
> >
> > >> the middle of a page as describe above).  But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
> > after Keith optimization patch to 128) but not when the device is
> > limited to a lower max_segments value because they fit anyway in a
> > single page?
>
> Yes, that's correct.   But "the device is limited to a lower max_segments
> value" isn't really because max_segments is limited.  The limit is on
> max_hw_sectors_kb derived from the NVMe controller MDTS value,
> as you have shown in your table below.  Then the max_segments value
> is derived from max_hw_sectors_kb.  For example, if max_hw_sectors_kb
> is 128 Kbytes, you can never need more than 33 segments.  Each segment
> can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
> Then add 1 segment to handle the case where the memory buffer doesn't
> start on a page boundary, and you get 33.   I'm making a subtle distinction
> here between "max_segments is limited" and "you can't need more than
> XX segments for a given max_hw_sectors_kb value".
>
> Michael
>
> >
> > Following the kernel code, I can see the max_hw_sectors_kb is
> > calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9),
> > max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^
> > (MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to
> > change the MDTS, I get the following:
> >
> >   MDTS   max_hw_sectors   max_hw_sectors_kb   max_segments
> >  ------ ---------------- ------------------- --------------
> >   0      1024             512                 127
> >   1      16               8                   3
> >   2      32               16                  5
> >   3      64               32                  9
> >   4      128              64                  17
> >   5      256              128                 33
> >   6      512              256                 65
> >   7      1024             512                 127
> >
> > >
> > > The limit confuses many because sometimes user space can sometimes get 512kib
> > > IO to work and other times the same program fails, and all because of physical
> > > memory continuity that user space isn't always aware of. A sure-fire way to
> > > never hit that limit is to allocate hugepages.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: max_hw_sectors error caused by recent NVMe driver commit
  2023-03-03 16:24           ` Daniel Gomez
@ 2023-03-03 16:44             ` Keith Busch
  0 siblings, 0 replies; 11+ messages in thread
From: Keith Busch @ 2023-03-03 16:44 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Michael Kelley (LINUX), Chaitanya Kulkarni, Christoph Hellwig,
	Gerd Bayer, Sagi Grimberg, asahi@lists.linux.dev,
	linux-nvme@lists.infradead.org

On Fri, Mar 03, 2023 at 05:24:43PM +0100, Daniel Gomez wrote:
> On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX)
> <mikelley@microsoft.com> wrote:
> >
> > Yes, that's correct.   But "the device is limited to a lower max_segments
> > value" isn't really because max_segments is limited.  The limit is on
> > max_hw_sectors_kb derived from the NVMe controller MDTS value,
> > as you have shown in your table below.  Then the max_segments value
> > is derived from max_hw_sectors_kb.  For example, if max_hw_sectors_kb
> > is 128 Kbytes, you can never need more than 33 segments.  Each segment
> > can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
> > Then add 1 segment to handle the case where the memory buffer doesn't
> > start on a page boundary, and you get 33.
> 
> I'm trying to understand and clarify this part. So, I found Keith's
> patch [1] where he addresses (and introduces) the same case (quote
> from the patch: 'One additional segment is added to account for a
> transfer that may start in the middle of a page.'). But I'm not sure
> when that case should be considered when we split an I/O. Is that
> handled in bio_split_rw + bvec_split_segs functions?

The function responsible for splitting bio's to the request_queue's limits is
__bio_split_to_limits(). This happens before blk-mq forms 'struct request' for
the low-level driver in blk_mq_submit_bio().

> So, for mdts=7 and max_segments being now 128, when can we see and
> test that a transfer starts in the middle of a page and therefore,
> that we need an extra segment for that? In that case, shouldn't the
> split max at 128 - 1 (508 KiB)?

You can ensure you start in the middle of a page by allocating page aligned
memory, then adding an offset.

The number of segments you actually get from your allocation depends on how
many discontiguous pages the system provisions your virtual memory. User space
doesn't have direct control over that unless you're using huge pages. It's
entirely possible a 512KiB buffer has just one segment, or 129 (assuming 4k
pages), or any number inbetween.

> In my tests [2], I can see splits are made in chunks of 128 in some
> cases. Whether or not these cases are page-aligned I don’t know but
> I'd like to verify that.
> 
> [1] https://lore.kernel.org/all/1439417874-8925-1-git-send-email-keith.busch@intel.com/
> 
> [2] while true; do sleep 1; dd iflag=direct if=/dev/nvme0n1 bs=1M
> count=1 of=/dev/null status=progress; done


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-03-03 16:44 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-11  5:33 max_hw_sectors error caused by recent NVMe driver commit Michael Kelley (LINUX)
2023-02-11  9:38 ` Daniel Gomez
2023-02-13 16:42   ` Michael Kelley (LINUX)
2023-02-13 16:57     ` Keith Busch
2023-02-17 13:28       ` Daniel Gomez
2023-02-17 16:05         ` Michael Kelley (LINUX)
2023-03-03 16:24           ` Daniel Gomez
2023-03-03 16:44             ` Keith Busch
2023-02-13  5:59 ` Christoph Hellwig
2023-02-13  6:41   ` Michael Kelley (LINUX)
2023-02-13  6:42     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox