* max_hw_sectors error caused by recent NVMe driver commit
@ 2023-02-11 5:33 Michael Kelley (LINUX)
2023-02-11 9:38 ` Daniel Gomez
2023-02-13 5:59 ` Christoph Hellwig
0 siblings, 2 replies; 11+ messages in thread
From: Michael Kelley (LINUX) @ 2023-02-11 5:33 UTC (permalink / raw)
To: Christoph Hellwig, Keith Busch
Cc: Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer,
asahi@lists.linux.dev, linux-nvme@lists.infradead.org
Commit 3f30a79c2e2c ("nvme-pci: set constant paramters in nvme_pci_alloc_ctrl")
appears to have introduced an error in how max_hw_sectors is calculated. The
value of max_hw_sectors is based on dma_max_mapping_size(), which indirectly
uses dma_addressing_limited() to decide if swiotlb_max_mapping_size() should
be used.
In this commit, setting max_hw_sectors is moved to nvme_pci_alloc_dev().
But dma_addressing_limited() depends on the dev->dma_mask, which hasn't
been set. dma_addressing_limited() returns "true", and the swiotlb max mapping
size is used, limiting NVMe transfers to 504 sectors (252 Kbytes).
Prior to this commit, max_hw_sectors isn't set until after the call to
dma_set_mask_and_coherent() in nvme_pci_enable(), as called from
nvme_reset_work(). max_hw_sectors is correctly determined based on
values reported by the NVMe controller.
I haven't provided a fix because I'm not that familiar with the overall structure
of the code and the intent of the code reorganization. I'm not sure if setting
the DMA mask should be moved earlier, or setting max_hw_sectors should
be moved back to its original location.
Michael
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit
2023-02-11 5:33 max_hw_sectors error caused by recent NVMe driver commit Michael Kelley (LINUX)
@ 2023-02-11 9:38 ` Daniel Gomez
2023-02-13 16:42 ` Michael Kelley (LINUX)
2023-02-13 5:59 ` Christoph Hellwig
1 sibling, 1 reply; 11+ messages in thread
From: Daniel Gomez @ 2023-02-11 9:38 UTC (permalink / raw)
To: Michael Kelley (LINUX)
Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org
Hi all,
On Sat, Feb 11, 2023 at 6:38 AM Michael Kelley (LINUX)
<mikelley@microsoft.com> wrote:
>
> Commit 3f30a79c2e2c ("nvme-pci: set constant paramters in nvme_pci_alloc_ctrl")
> appears to have introduced an error in how max_hw_sectors is calculated. The
> value of max_hw_sectors is based on dma_max_mapping_size(), which indirectly
> uses dma_addressing_limited() to decide if swiotlb_max_mapping_size() should
> be used.
>
> In this commit, setting max_hw_sectors is moved to nvme_pci_alloc_dev().
> But dma_addressing_limited() depends on the dev->dma_mask, which hasn't
> been set. dma_addressing_limited() returns "true", and the swiotlb max mapping
> size is used, limiting NVMe transfers to 504 sectors (252 Kbytes).
>
> Prior to this commit, max_hw_sectors isn't set until after the call to
> dma_set_mask_and_coherent() in nvme_pci_enable(), as called from
> nvme_reset_work(). max_hw_sectors is correctly determined based on
> values reported by the NVMe controller.
>
> I haven't provided a fix because I'm not that familiar with the overall structure
> of the code and the intent of the code reorganization. I'm not sure if setting
> the DMA mask should be moved earlier, or setting max_hw_sectors should
> be moved back to its original location.
Yesterday, I ran into the same problem. Below you can find my test
case. I'm trying to get familiar with the code, so any help would be
much appreciated.
I could reproduce it with a simple dd test [1] and btrace. In summary,
multi-page bvec splits are performed in 4 steps/chunks of 63 segments
(63 * 4K / 512 = 504 sectors) and 1 step of 8 segments (8 * 4K / 512 =
32 sectors). But I think it should be just 4 steps of 64 segments as
that is the max_segments upper limit in my test case. Below you can
find my test case and numbers:
Check:
root@qemuarm64:/sys/block/nvme0n1/queue# grep -H . max_sectors_kb
max_hw_sectors_kb max_segments max_segment_size optimal_io_size
logical_block_size chunk_sectors
max_sectors_kb:252
max_hw_sectors_kb:252
max_segments:64
max_segment_size:4294967295
optimal_io_size:512
logical_block_size:512
chunk_sectors:0
dd test:
dd iflag=direct if=/dev/nvme0n1 bs=1M of=/dev/null status=progress
btrace snippet:
259,0 0 23476 1.111774467 751 Q RS 1923072 + 2048 [dd]
259,0 0 23477 1.111774842 751 X RS 1923072 / 1923576 [dd]
259,0 0 23478 1.111775342 751 G RS 1923072 + 504 [dd]
259,0 0 23479 1.111775425 751 I RS 1923072 + 504 [dd]
259,0 0 23480 1.111776133 751 X RS 1923576 / 1924080 [dd]
259,0 0 23481 1.111776258 751 G RS 1923576 + 504 [dd]
259,0 0 23482 1.111776300 751 I RS 1923576 + 504 [dd]
259,0 0 23483 1.111776675 751 X RS 1924080 / 1924584 [dd]
259,0 0 23484 1.111776967 751 G RS 1924080 + 504 [dd]
259,0 0 23485 1.111777008 751 I RS 1924080 + 504 [dd]
259,0 0 23486 1.111777383 751 X RS 1924584 / 1925088 [dd]
259,0 0 23487 1.111777467 751 G RS 1924584 + 504 [dd]
259,0 0 23488 1.111777550 751 I RS 1924584 + 504 [dd]
259,0 0 23489 1.111777758 751 G RS 1925088 + 32 [dd]
259,0 0 23490 1.111777800 751 I RS 1925088 + 32 [dd]
259,0 0 23491 1.111779383 36 D RS 1923072 + 504 [kworker/0:1H]
259,0 0 23492 1.111780092 36 D RS 1923576 + 504 [kworker/0:1H]
259,0 0 23493 1.111780800 36 D RS 1924080 + 504 [kworker/0:1H]
259,0 0 23494 1.111781425 36 D RS 1924584 + 504 [kworker/0:1H]
259,0 0 23495 1.111781717 36 D RS 1925088 + 32 [kworker/0:1H]
259,0 0 23496 1.112201967 749 C RS 1923072 + 504 [0]
259,0 0 23497 1.112563925 749 C RS 1923072 + 2048 [0]
259,0 0 23498 1.112564425 749 C RS 1923072 + 2016 [0]
259,0 0 23499 1.112564800 749 C RS 1923072 + 1512 [0]
259,0 0 23500 1.112758217 749 C RS 1923072 + 1008 [0]
In addition, going back to Michaels reference commit and running the
test in the same setup, I got the following checks:
max_sectors_kb:512
max_hw_sectors_kb:512
max_segments:127
max_segment_size:4294967295
optimal_io_size:512
logical_block_size:512
So, I'm not sure why my max_segments when from 64 -> 127 and,
max_sectors_kb and max_hw_sectdors_kb from 252 -> 512. Perhaps my
first assumption was wrong?
[1] test:
https://unix.stackexchange.com/questions/529529/why-is-the-size-of-my-io-requests-being-limited-to-about-512k
Daniel
>
> Michael
>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit
2023-02-11 5:33 max_hw_sectors error caused by recent NVMe driver commit Michael Kelley (LINUX)
2023-02-11 9:38 ` Daniel Gomez
@ 2023-02-13 5:59 ` Christoph Hellwig
2023-02-13 6:41 ` Michael Kelley (LINUX)
1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2023-02-13 5:59 UTC (permalink / raw)
To: Michael Kelley (LINUX)
Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org
Hi Michael,
does this fix the problem for you?
commit 0467e96bafd2e84b67ba5c122bbbbac5e3f267e9
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Feb 13 06:58:33 2023 +0100
nvme dma_mask
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c734934c407ccf..cb068f92bd597e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2509,18 +2509,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
{
int result = -ENOMEM;
struct pci_dev *pdev = to_pci_dev(dev->dev);
- int dma_address_bits = 64;
if (pci_enable_device_mem(pdev))
return result;
pci_set_master(pdev);
- if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48)
- dma_address_bits = 48;
- if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(dma_address_bits)))
- goto disable;
-
if (readl(dev->bar + NVME_REG_CSTS) == -1) {
result = -ENODEV;
goto disable;
@@ -2998,7 +2992,11 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
quirks);
if (ret)
goto out_put_device;
-
+
+ if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48)
+ dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(48));
+ else
+ dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(64));
dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
dma_set_max_seg_size(&pdev->dev, 0xffffffff);
^ permalink raw reply related [flat|nested] 11+ messages in thread
* RE: max_hw_sectors error caused by recent NVMe driver commit
2023-02-13 5:59 ` Christoph Hellwig
@ 2023-02-13 6:41 ` Michael Kelley (LINUX)
2023-02-13 6:42 ` Christoph Hellwig
0 siblings, 1 reply; 11+ messages in thread
From: Michael Kelley (LINUX) @ 2023-02-13 6:41 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer,
asahi@lists.linux.dev, linux-nvme@lists.infradead.org
From: Christoph Hellwig <hch@lst.de> Sent: Sunday, February 12, 2023 10:00 PM
>
> Hi Michael,
>
> does this fix the problem for you?
Yes, basic smoke test works. The underlying NVMe device
capabilities now show through instead of being replaced by
swiotlb values. I did not do extensive functional testing.
See one code comment below.
Michael
>
> commit 0467e96bafd2e84b67ba5c122bbbbac5e3f267e9
> Author: Christoph Hellwig <hch@lst.de>
> Date: Mon Feb 13 06:58:33 2023 +0100
>
> nvme dma_mask
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index c734934c407ccf..cb068f92bd597e 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2509,18 +2509,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
> {
> int result = -ENOMEM;
> struct pci_dev *pdev = to_pci_dev(dev->dev);
> - int dma_address_bits = 64;
>
> if (pci_enable_device_mem(pdev))
> return result;
>
> pci_set_master(pdev);
>
> - if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48)
> - dma_address_bits = 48;
> - if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(dma_address_bits)))
> - goto disable;
> -
> if (readl(dev->bar + NVME_REG_CSTS) == -1) {
> result = -ENODEV;
> goto disable;
> @@ -2998,7 +2992,11 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct
> pci_dev *pdev,
> quirks);
> if (ret)
> goto out_put_device;
> -
> +
> + if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48)
> + dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(48));
> + else
> + dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(64));
Any reason to use dev->dev instead of &pdev->dev? They should be
the same, but with everything else in this function using &pdev->dev,
the lack of consistency suggests there might be a reason.
> dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
> dma_set_max_seg_size(&pdev->dev, 0xffffffff);
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit
2023-02-13 6:41 ` Michael Kelley (LINUX)
@ 2023-02-13 6:42 ` Christoph Hellwig
0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-02-13 6:42 UTC (permalink / raw)
To: Michael Kelley (LINUX)
Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org
On Mon, Feb 13, 2023 at 06:41:12AM +0000, Michael Kelley (LINUX) wrote:
> Any reason to use dev->dev instead of &pdev->dev? They should be
> the same, but with everything else in this function using &pdev->dev,
> the lack of consistency suggests there might be a reason.
No good reason - the code started with copy and paste from the previous
location and I can fix this up.
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: max_hw_sectors error caused by recent NVMe driver commit
2023-02-11 9:38 ` Daniel Gomez
@ 2023-02-13 16:42 ` Michael Kelley (LINUX)
2023-02-13 16:57 ` Keith Busch
0 siblings, 1 reply; 11+ messages in thread
From: Michael Kelley (LINUX) @ 2023-02-13 16:42 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org
From: Daniel Gomez <dagmcr@gmail.com> Sent: Saturday, February 11, 2023 1:39 AM
>
> Hi all,
>
> On Sat, Feb 11, 2023 at 6:38 AM Michael Kelley (LINUX)
> <mikelley@microsoft.com> wrote:
> >
> > Commit 3f30a79c2e2c ("nvme-pci: set constant paramters in nvme_pci_alloc_ctrl")
> > appears to have introduced an error in how max_hw_sectors is calculated. The
> > value of max_hw_sectors is based on dma_max_mapping_size(), which indirectly
> > uses dma_addressing_limited() to decide if swiotlb_max_mapping_size() should
> > be used.
> >
> > In this commit, setting max_hw_sectors is moved to nvme_pci_alloc_dev().
> > But dma_addressing_limited() depends on the dev->dma_mask, which hasn't
> > been set. dma_addressing_limited() returns "true", and the swiotlb max mapping
> > size is used, limiting NVMe transfers to 504 sectors (252 Kbytes).
> >
> > Prior to this commit, max_hw_sectors isn't set until after the call to
> > dma_set_mask_and_coherent() in nvme_pci_enable(), as called from
> > nvme_reset_work(). max_hw_sectors is correctly determined based on
> > values reported by the NVMe controller.
> >
> > I haven't provided a fix because I'm not that familiar with the overall structure
> > of the code and the intent of the code reorganization. I'm not sure if setting
> > the DMA mask should be moved earlier, or setting max_hw_sectors should
> > be moved back to its original location.
>
> Yesterday, I ran into the same problem. Below you can find my test
> case. I'm trying to get familiar with the code, so any help would be
> much appreciated.
>
> I could reproduce it with a simple dd test [1] and btrace. In summary,
> multi-page bvec splits are performed in 4 steps/chunks of 63 segments
> (63 * 4K / 512 = 504 sectors) and 1 step of 8 segments (8 * 4K / 512 =
> 32 sectors). But I think it should be just 4 steps of 64 segments as
> that is the max_segments upper limit in my test case. Below you can
> find my test case and numbers:
The maximum size of a single I/O transfer is limited to the *smaller*
of two factors: max_segments and max_sectors_kb. In your example
below, the limiting factor is max_sectors_kb. But note that up to 64
segments *could* be needed for a 252 Kbyte transfer if the starting
memory address is in the middle of a 4 Kbyte page. A segment is
needed for the partial page at the beginning and another segment
is needed for the partial page at the end. Also, if the target physical
memory has chunks larger than 4 Kbytes that are contiguous, even
fewer than 63 segments would be needed because each segment
describes a *contiguous* physical memory area using its starting
address and length. But the I/O transfer would still be limited to
252 Kbytes because of max_sectors_kb.
>
> Check:
> root@qemuarm64:/sys/block/nvme0n1/queue# grep -H . max_sectors_kb
> max_hw_sectors_kb max_segments max_segment_size optimal_io_size
> logical_block_size chunk_sectors
> max_sectors_kb:252
> max_hw_sectors_kb:252
> max_segments:64
> max_segment_size:4294967295
> optimal_io_size:512
> logical_block_size:512
> chunk_sectors:0
>
> dd test:
> dd iflag=direct if=/dev/nvme0n1 bs=1M of=/dev/null status=progress
>
> btrace snippet:
> 259,0 0 23476 1.111774467 751 Q RS 1923072 + 2048 [dd]
> 259,0 0 23477 1.111774842 751 X RS 1923072 / 1923576 [dd]
> 259,0 0 23478 1.111775342 751 G RS 1923072 + 504 [dd]
> 259,0 0 23479 1.111775425 751 I RS 1923072 + 504 [dd]
> 259,0 0 23480 1.111776133 751 X RS 1923576 / 1924080 [dd]
> 259,0 0 23481 1.111776258 751 G RS 1923576 + 504 [dd]
> 259,0 0 23482 1.111776300 751 I RS 1923576 + 504 [dd]
> 259,0 0 23483 1.111776675 751 X RS 1924080 / 1924584 [dd]
> 259,0 0 23484 1.111776967 751 G RS 1924080 + 504 [dd]
> 259,0 0 23485 1.111777008 751 I RS 1924080 + 504 [dd]
> 259,0 0 23486 1.111777383 751 X RS 1924584 / 1925088 [dd]
> 259,0 0 23487 1.111777467 751 G RS 1924584 + 504 [dd]
> 259,0 0 23488 1.111777550 751 I RS 1924584 + 504 [dd]
> 259,0 0 23489 1.111777758 751 G RS 1925088 + 32 [dd]
> 259,0 0 23490 1.111777800 751 I RS 1925088 + 32 [dd]
> 259,0 0 23491 1.111779383 36 D RS 1923072 + 504 [kworker/0:1H]
> 259,0 0 23492 1.111780092 36 D RS 1923576 + 504 [kworker/0:1H]
> 259,0 0 23493 1.111780800 36 D RS 1924080 + 504 [kworker/0:1H]
> 259,0 0 23494 1.111781425 36 D RS 1924584 + 504 [kworker/0:1H]
> 259,0 0 23495 1.111781717 36 D RS 1925088 + 32 [kworker/0:1H]
> 259,0 0 23496 1.112201967 749 C RS 1923072 + 504 [0]
> 259,0 0 23497 1.112563925 749 C RS 1923072 + 2048 [0]
> 259,0 0 23498 1.112564425 749 C RS 1923072 + 2016 [0]
> 259,0 0 23499 1.112564800 749 C RS 1923072 + 1512 [0]
> 259,0 0 23500 1.112758217 749 C RS 1923072 + 1008 [0]
>
> In addition, going back to Michaels reference commit and running the
> test in the same setup, I got the following checks:
>
> max_sectors_kb:512
> max_hw_sectors_kb:512
> max_segments:127
> max_segment_size:4294967295
> optimal_io_size:512
> logical_block_size:512
>
> So, I'm not sure why my max_segments when from 64 -> 127 and,
> max_sectors_kb and max_hw_sectdors_kb from 252 -> 512. Perhaps my
> first assumption was wrong?
The bug I pointed out caused max_hw_sectors to be set incorrectly. Going
back to the referenced commit reverted the bug, and allowed the correct
value to be set. In your example, I would guess the value of 512 Kbytes came
from querying the NVMe device for its max transfer size. Ideally, to support
512 Kbyte transfers, you would want 129 segments (to allow for starting in
the middle of a page as describe above). But the value of max_segments
is limited by the NVME driver itself using the value of NVME_MAX_SEGS
defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
sure that the data structure containing the scatterlist fits in a single page.
See nvme_pci_alloc_iod_mempool().
Hopefully this information helps ...
Michael
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit
2023-02-13 16:42 ` Michael Kelley (LINUX)
@ 2023-02-13 16:57 ` Keith Busch
2023-02-17 13:28 ` Daniel Gomez
0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2023-02-13 16:57 UTC (permalink / raw)
To: Michael Kelley (LINUX)
Cc: Daniel Gomez, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Gerd Bayer, asahi@lists.linux.dev,
linux-nvme@lists.infradead.org
On Mon, Feb 13, 2023 at 04:42:31PM +0000, Michael Kelley (LINUX) wrote:
> Ideally, to support
> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> the middle of a page as describe above). But the value of max_segments
> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> sure that the data structure containing the scatterlist fits in a single page.
> See nvme_pci_alloc_iod_mempool().
Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
The limit confuses many because sometimes user space can sometimes get 512kib
IO to work and other times the same program fails, and all because of physical
memory continuity that user space isn't always aware of. A sure-fire way to
never hit that limit is to allocate hugepages.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit
2023-02-13 16:57 ` Keith Busch
@ 2023-02-17 13:28 ` Daniel Gomez
2023-02-17 16:05 ` Michael Kelley (LINUX)
0 siblings, 1 reply; 11+ messages in thread
From: Daniel Gomez @ 2023-02-17 13:28 UTC (permalink / raw)
To: Keith Busch
Cc: Michael Kelley (LINUX), Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Gerd Bayer, asahi@lists.linux.dev,
linux-nvme@lists.infradead.org
On Mon, Feb 13, 2023 at 5:57 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Mon, Feb 13, 2023 at 04:42:31PM +0000, Michael Kelley (LINUX) wrote:
> > Ideally, to support
> > 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> > the middle of a page as describe above). But the value of max_segments
> > is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> > sure that the data structure containing the scatterlist fits in a single page.
> > See nvme_pci_alloc_iod_mempool().
Michael,
Thanks for clarifying. It really helps. Also, it's fixed after
applying Christoph patch [1].
[1] https://lore.kernel.org/all/20230213072035.288225-1-hch@lst.de/
>> value to be set. In your example, I would guess the value of 512 Kbytes came
>> from querying the NVMe device for its max transfer size. Ideally, to support
>> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
>> the middle of a page as describe above). But the value of max_segments
>> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
>> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
>> sure that the data structure containing the scatterlist fits in a single page.
>
> Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
Quoting Michael,
>> the middle of a page as describe above). But the value of max_segments
>> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
>> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
>> sure that the data structure containing the scatterlist fits in a single page.
Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
after Keith optimization patch to 128) but not when the device is
limited to a lower max_segments value because they fit anyway in a
single page?
Following the kernel code, I can see the max_hw_sectors_kb is
calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9),
max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^
(MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to
change the MDTS, I get the following:
MDTS max_hw_sectors max_hw_sectors_kb max_segments
------ ---------------- ------------------- --------------
0 1024 512 127
1 16 8 3
2 32 16 5
3 64 32 9
4 128 64 17
5 256 128 33
6 512 256 65
7 1024 512 127
>
> The limit confuses many because sometimes user space can sometimes get 512kib
> IO to work and other times the same program fails, and all because of physical
> memory continuity that user space isn't always aware of. A sure-fire way to
> never hit that limit is to allocate hugepages.
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: max_hw_sectors error caused by recent NVMe driver commit
2023-02-17 13:28 ` Daniel Gomez
@ 2023-02-17 16:05 ` Michael Kelley (LINUX)
2023-03-03 16:24 ` Daniel Gomez
0 siblings, 1 reply; 11+ messages in thread
From: Michael Kelley (LINUX) @ 2023-02-17 16:05 UTC (permalink / raw)
To: Daniel Gomez, Keith Busch
Cc: Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer,
asahi@lists.linux.dev, linux-nvme@lists.infradead.org
From: Daniel Gomez <dagmcr@gmail.com> Sent: Friday, February 17, 2023 5:28 AM
>
> >> value to be set. In your example, I would guess the value of 512 Kbytes came
> >> from querying the NVMe device for its max transfer size. Ideally, to support
> >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> >> the middle of a page as describe above). But the value of max_segments
> >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> >> sure that the data structure containing the scatterlist fits in a single page.
>
> >
> > Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
>
> Quoting Michael,
>
> >> the middle of a page as describe above). But the value of max_segments
> >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> >> sure that the data structure containing the scatterlist fits in a single page.
>
> Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
> after Keith optimization patch to 128) but not when the device is
> limited to a lower max_segments value because they fit anyway in a
> single page?
Yes, that's correct. But "the device is limited to a lower max_segments
value" isn't really because max_segments is limited. The limit is on
max_hw_sectors_kb derived from the NVMe controller MDTS value,
as you have shown in your table below. Then the max_segments value
is derived from max_hw_sectors_kb. For example, if max_hw_sectors_kb
is 128 Kbytes, you can never need more than 33 segments. Each segment
can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
Then add 1 segment to handle the case where the memory buffer doesn't
start on a page boundary, and you get 33. I'm making a subtle distinction
here between "max_segments is limited" and "you can't need more than
XX segments for a given max_hw_sectors_kb value".
Michael
>
> Following the kernel code, I can see the max_hw_sectors_kb is
> calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9),
> max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^
> (MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to
> change the MDTS, I get the following:
>
> MDTS max_hw_sectors max_hw_sectors_kb max_segments
> ------ ---------------- ------------------- --------------
> 0 1024 512 127
> 1 16 8 3
> 2 32 16 5
> 3 64 32 9
> 4 128 64 17
> 5 256 128 33
> 6 512 256 65
> 7 1024 512 127
>
> >
> > The limit confuses many because sometimes user space can sometimes get 512kib
> > IO to work and other times the same program fails, and all because of physical
> > memory continuity that user space isn't always aware of. A sure-fire way to
> > never hit that limit is to allocate hugepages.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit
2023-02-17 16:05 ` Michael Kelley (LINUX)
@ 2023-03-03 16:24 ` Daniel Gomez
2023-03-03 16:44 ` Keith Busch
0 siblings, 1 reply; 11+ messages in thread
From: Daniel Gomez @ 2023-03-03 16:24 UTC (permalink / raw)
To: Michael Kelley (LINUX)
Cc: Chaitanya Kulkarni, Christoph Hellwig, Gerd Bayer, Keith Busch,
Sagi Grimberg, asahi@lists.linux.dev,
linux-nvme@lists.infradead.org
Hi,
On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX)
<mikelley@microsoft.com> wrote:
>
> From: Daniel Gomez <dagmcr@gmail.com> Sent: Friday, February 17, 2023 5:28 AM
> >
> > >> value to be set. In your example, I would guess the value of 512 Kbytes came
> > >> from querying the NVMe device for its max transfer size. Ideally, to support
> > >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> > >> the middle of a page as describe above). But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > >
> > > Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
> >
> > Quoting Michael,
> >
> > >> the middle of a page as describe above). But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
> > after Keith optimization patch to 128) but not when the device is
> > limited to a lower max_segments value because they fit anyway in a
> > single page?
>
> Yes, that's correct. But "the device is limited to a lower max_segments
> value" isn't really because max_segments is limited. The limit is on
> max_hw_sectors_kb derived from the NVMe controller MDTS value,
> as you have shown in your table below. Then the max_segments value
> is derived from max_hw_sectors_kb. For example, if max_hw_sectors_kb
> is 128 Kbytes, you can never need more than 33 segments. Each segment
> can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
> Then add 1 segment to handle the case where the memory buffer doesn't
> start on a page boundary, and you get 33.
I'm trying to understand and clarify this part. So, I found Keith's
patch [1] where he addresses (and introduces) the same case (quote
from the patch: 'One additional segment is added to account for a
transfer that may start in the middle of a page.'). But I'm not sure
when that case should be considered when we split an I/O. Is that
handled in bio_split_rw + bvec_split_segs functions?
So, for mdts=7 and max_segments being now 128, when can we see and
test that a transfer starts in the middle of a page and therefore,
that we need an extra segment for that? In that case, shouldn't the
split max at 128 - 1 (508 KiB)?
In my tests [2], I can see splits are made in chunks of 128 in some
cases. Whether or not these cases are page-aligned I don’t know but
I'd like to verify that.
[1] https://lore.kernel.org/all/1439417874-8925-1-git-send-email-keith.busch@intel.com/
[2] while true; do sleep 1; dd iflag=direct if=/dev/nvme0n1 bs=1M
count=1 of=/dev/null status=progress; done
On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX)
<mikelley@microsoft.com> wrote:
>
> From: Daniel Gomez <dagmcr@gmail.com> Sent: Friday, February 17, 2023 5:28 AM
> >
> > >> value to be set. In your example, I would guess the value of 512 Kbytes came
> > >> from querying the NVMe device for its max transfer size. Ideally, to support
> > >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> > >> the middle of a page as describe above). But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > >
> > > Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
> >
> > Quoting Michael,
> >
> > >> the middle of a page as describe above). But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
> > after Keith optimization patch to 128) but not when the device is
> > limited to a lower max_segments value because they fit anyway in a
> > single page?
>
> Yes, that's correct. But "the device is limited to a lower max_segments
> value" isn't really because max_segments is limited. The limit is on
> max_hw_sectors_kb derived from the NVMe controller MDTS value,
> as you have shown in your table below. Then the max_segments value
> is derived from max_hw_sectors_kb. For example, if max_hw_sectors_kb
> is 128 Kbytes, you can never need more than 33 segments. Each segment
> can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
> Then add 1 segment to handle the case where the memory buffer doesn't
> start on a page boundary, and you get 33. I'm making a subtle distinction
> here between "max_segments is limited" and "you can't need more than
> XX segments for a given max_hw_sectors_kb value".
>
> Michael
>
> >
> > Following the kernel code, I can see the max_hw_sectors_kb is
> > calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9),
> > max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^
> > (MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to
> > change the MDTS, I get the following:
> >
> > MDTS max_hw_sectors max_hw_sectors_kb max_segments
> > ------ ---------------- ------------------- --------------
> > 0 1024 512 127
> > 1 16 8 3
> > 2 32 16 5
> > 3 64 32 9
> > 4 128 64 17
> > 5 256 128 33
> > 6 512 256 65
> > 7 1024 512 127
> >
> > >
> > > The limit confuses many because sometimes user space can sometimes get 512kib
> > > IO to work and other times the same program fails, and all because of physical
> > > memory continuity that user space isn't always aware of. A sure-fire way to
> > > never hit that limit is to allocate hugepages.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit
2023-03-03 16:24 ` Daniel Gomez
@ 2023-03-03 16:44 ` Keith Busch
0 siblings, 0 replies; 11+ messages in thread
From: Keith Busch @ 2023-03-03 16:44 UTC (permalink / raw)
To: Daniel Gomez
Cc: Michael Kelley (LINUX), Chaitanya Kulkarni, Christoph Hellwig,
Gerd Bayer, Sagi Grimberg, asahi@lists.linux.dev,
linux-nvme@lists.infradead.org
On Fri, Mar 03, 2023 at 05:24:43PM +0100, Daniel Gomez wrote:
> On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX)
> <mikelley@microsoft.com> wrote:
> >
> > Yes, that's correct. But "the device is limited to a lower max_segments
> > value" isn't really because max_segments is limited. The limit is on
> > max_hw_sectors_kb derived from the NVMe controller MDTS value,
> > as you have shown in your table below. Then the max_segments value
> > is derived from max_hw_sectors_kb. For example, if max_hw_sectors_kb
> > is 128 Kbytes, you can never need more than 33 segments. Each segment
> > can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
> > Then add 1 segment to handle the case where the memory buffer doesn't
> > start on a page boundary, and you get 33.
>
> I'm trying to understand and clarify this part. So, I found Keith's
> patch [1] where he addresses (and introduces) the same case (quote
> from the patch: 'One additional segment is added to account for a
> transfer that may start in the middle of a page.'). But I'm not sure
> when that case should be considered when we split an I/O. Is that
> handled in bio_split_rw + bvec_split_segs functions?
The function responsible for splitting bio's to the request_queue's limits is
__bio_split_to_limits(). This happens before blk-mq forms 'struct request' for
the low-level driver in blk_mq_submit_bio().
> So, for mdts=7 and max_segments being now 128, when can we see and
> test that a transfer starts in the middle of a page and therefore,
> that we need an extra segment for that? In that case, shouldn't the
> split max at 128 - 1 (508 KiB)?
You can ensure you start in the middle of a page by allocating page aligned
memory, then adding an offset.
The number of segments you actually get from your allocation depends on how
many discontiguous pages the system provisions your virtual memory. User space
doesn't have direct control over that unless you're using huge pages. It's
entirely possible a 512KiB buffer has just one segment, or 129 (assuming 4k
pages), or any number inbetween.
> In my tests [2], I can see splits are made in chunks of 128 in some
> cases. Whether or not these cases are page-aligned I don’t know but
> I'd like to verify that.
>
> [1] https://lore.kernel.org/all/1439417874-8925-1-git-send-email-keith.busch@intel.com/
>
> [2] while true; do sleep 1; dd iflag=direct if=/dev/nvme0n1 bs=1M
> count=1 of=/dev/null status=progress; done
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2023-03-03 16:44 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-11 5:33 max_hw_sectors error caused by recent NVMe driver commit Michael Kelley (LINUX)
2023-02-11 9:38 ` Daniel Gomez
2023-02-13 16:42 ` Michael Kelley (LINUX)
2023-02-13 16:57 ` Keith Busch
2023-02-17 13:28 ` Daniel Gomez
2023-02-17 16:05 ` Michael Kelley (LINUX)
2023-03-03 16:24 ` Daniel Gomez
2023-03-03 16:44 ` Keith Busch
2023-02-13 5:59 ` Christoph Hellwig
2023-02-13 6:41 ` Michael Kelley (LINUX)
2023-02-13 6:42 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox