* max_hw_sectors error caused by recent NVMe driver commit
@ 2023-02-11 5:33 Michael Kelley (LINUX)
2023-02-11 9:38 ` Daniel Gomez
2023-02-13 5:59 ` Christoph Hellwig
0 siblings, 2 replies; 11+ messages in thread
From: Michael Kelley (LINUX) @ 2023-02-11 5:33 UTC (permalink / raw)
To: Christoph Hellwig, Keith Busch
Cc: Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer,
asahi@lists.linux.dev, linux-nvme@lists.infradead.org
Commit 3f30a79c2e2c ("nvme-pci: set constant paramters in nvme_pci_alloc_ctrl")
appears to have introduced an error in how max_hw_sectors is calculated. The
value of max_hw_sectors is based on dma_max_mapping_size(), which indirectly
uses dma_addressing_limited() to decide if swiotlb_max_mapping_size() should
be used.
In this commit, setting max_hw_sectors is moved to nvme_pci_alloc_dev().
But dma_addressing_limited() depends on the dev->dma_mask, which hasn't
been set. dma_addressing_limited() returns "true", and the swiotlb max mapping
size is used, limiting NVMe transfers to 504 sectors (252 Kbytes).
Prior to this commit, max_hw_sectors isn't set until after the call to
dma_set_mask_and_coherent() in nvme_pci_enable(), as called from
nvme_reset_work(). max_hw_sectors is correctly determined based on
values reported by the NVMe controller.
I haven't provided a fix because I'm not that familiar with the overall structure
of the code and the intent of the code reorganization. I'm not sure if setting
the DMA mask should be moved earlier, or setting max_hw_sectors should
be moved back to its original location.
Michael
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: max_hw_sectors error caused by recent NVMe driver commit 2023-02-11 5:33 max_hw_sectors error caused by recent NVMe driver commit Michael Kelley (LINUX) @ 2023-02-11 9:38 ` Daniel Gomez 2023-02-13 16:42 ` Michael Kelley (LINUX) 2023-02-13 5:59 ` Christoph Hellwig 1 sibling, 1 reply; 11+ messages in thread From: Daniel Gomez @ 2023-02-11 9:38 UTC (permalink / raw) To: Michael Kelley (LINUX) Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org Hi all, On Sat, Feb 11, 2023 at 6:38 AM Michael Kelley (LINUX) <mikelley@microsoft.com> wrote: > > Commit 3f30a79c2e2c ("nvme-pci: set constant paramters in nvme_pci_alloc_ctrl") > appears to have introduced an error in how max_hw_sectors is calculated. The > value of max_hw_sectors is based on dma_max_mapping_size(), which indirectly > uses dma_addressing_limited() to decide if swiotlb_max_mapping_size() should > be used. > > In this commit, setting max_hw_sectors is moved to nvme_pci_alloc_dev(). > But dma_addressing_limited() depends on the dev->dma_mask, which hasn't > been set. dma_addressing_limited() returns "true", and the swiotlb max mapping > size is used, limiting NVMe transfers to 504 sectors (252 Kbytes). > > Prior to this commit, max_hw_sectors isn't set until after the call to > dma_set_mask_and_coherent() in nvme_pci_enable(), as called from > nvme_reset_work(). max_hw_sectors is correctly determined based on > values reported by the NVMe controller. > > I haven't provided a fix because I'm not that familiar with the overall structure > of the code and the intent of the code reorganization. I'm not sure if setting > the DMA mask should be moved earlier, or setting max_hw_sectors should > be moved back to its original location. Yesterday, I ran into the same problem. Below you can find my test case. I'm trying to get familiar with the code, so any help would be much appreciated. I could reproduce it with a simple dd test [1] and btrace. In summary, multi-page bvec splits are performed in 4 steps/chunks of 63 segments (63 * 4K / 512 = 504 sectors) and 1 step of 8 segments (8 * 4K / 512 = 32 sectors). But I think it should be just 4 steps of 64 segments as that is the max_segments upper limit in my test case. Below you can find my test case and numbers: Check: root@qemuarm64:/sys/block/nvme0n1/queue# grep -H . max_sectors_kb max_hw_sectors_kb max_segments max_segment_size optimal_io_size logical_block_size chunk_sectors max_sectors_kb:252 max_hw_sectors_kb:252 max_segments:64 max_segment_size:4294967295 optimal_io_size:512 logical_block_size:512 chunk_sectors:0 dd test: dd iflag=direct if=/dev/nvme0n1 bs=1M of=/dev/null status=progress btrace snippet: 259,0 0 23476 1.111774467 751 Q RS 1923072 + 2048 [dd] 259,0 0 23477 1.111774842 751 X RS 1923072 / 1923576 [dd] 259,0 0 23478 1.111775342 751 G RS 1923072 + 504 [dd] 259,0 0 23479 1.111775425 751 I RS 1923072 + 504 [dd] 259,0 0 23480 1.111776133 751 X RS 1923576 / 1924080 [dd] 259,0 0 23481 1.111776258 751 G RS 1923576 + 504 [dd] 259,0 0 23482 1.111776300 751 I RS 1923576 + 504 [dd] 259,0 0 23483 1.111776675 751 X RS 1924080 / 1924584 [dd] 259,0 0 23484 1.111776967 751 G RS 1924080 + 504 [dd] 259,0 0 23485 1.111777008 751 I RS 1924080 + 504 [dd] 259,0 0 23486 1.111777383 751 X RS 1924584 / 1925088 [dd] 259,0 0 23487 1.111777467 751 G RS 1924584 + 504 [dd] 259,0 0 23488 1.111777550 751 I RS 1924584 + 504 [dd] 259,0 0 23489 1.111777758 751 G RS 1925088 + 32 [dd] 259,0 0 23490 1.111777800 751 I RS 1925088 + 32 [dd] 259,0 0 23491 1.111779383 36 D RS 1923072 + 504 [kworker/0:1H] 259,0 0 23492 1.111780092 36 D RS 1923576 + 504 [kworker/0:1H] 259,0 0 23493 1.111780800 36 D RS 1924080 + 504 [kworker/0:1H] 259,0 0 23494 1.111781425 36 D RS 1924584 + 504 [kworker/0:1H] 259,0 0 23495 1.111781717 36 D RS 1925088 + 32 [kworker/0:1H] 259,0 0 23496 1.112201967 749 C RS 1923072 + 504 [0] 259,0 0 23497 1.112563925 749 C RS 1923072 + 2048 [0] 259,0 0 23498 1.112564425 749 C RS 1923072 + 2016 [0] 259,0 0 23499 1.112564800 749 C RS 1923072 + 1512 [0] 259,0 0 23500 1.112758217 749 C RS 1923072 + 1008 [0] In addition, going back to Michaels reference commit and running the test in the same setup, I got the following checks: max_sectors_kb:512 max_hw_sectors_kb:512 max_segments:127 max_segment_size:4294967295 optimal_io_size:512 logical_block_size:512 So, I'm not sure why my max_segments when from 64 -> 127 and, max_sectors_kb and max_hw_sectdors_kb from 252 -> 512. Perhaps my first assumption was wrong? [1] test: https://unix.stackexchange.com/questions/529529/why-is-the-size-of-my-io-requests-being-limited-to-about-512k Daniel > > Michael > > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: max_hw_sectors error caused by recent NVMe driver commit 2023-02-11 9:38 ` Daniel Gomez @ 2023-02-13 16:42 ` Michael Kelley (LINUX) 2023-02-13 16:57 ` Keith Busch 0 siblings, 1 reply; 11+ messages in thread From: Michael Kelley (LINUX) @ 2023-02-13 16:42 UTC (permalink / raw) To: Daniel Gomez Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org From: Daniel Gomez <dagmcr@gmail.com> Sent: Saturday, February 11, 2023 1:39 AM > > Hi all, > > On Sat, Feb 11, 2023 at 6:38 AM Michael Kelley (LINUX) > <mikelley@microsoft.com> wrote: > > > > Commit 3f30a79c2e2c ("nvme-pci: set constant paramters in nvme_pci_alloc_ctrl") > > appears to have introduced an error in how max_hw_sectors is calculated. The > > value of max_hw_sectors is based on dma_max_mapping_size(), which indirectly > > uses dma_addressing_limited() to decide if swiotlb_max_mapping_size() should > > be used. > > > > In this commit, setting max_hw_sectors is moved to nvme_pci_alloc_dev(). > > But dma_addressing_limited() depends on the dev->dma_mask, which hasn't > > been set. dma_addressing_limited() returns "true", and the swiotlb max mapping > > size is used, limiting NVMe transfers to 504 sectors (252 Kbytes). > > > > Prior to this commit, max_hw_sectors isn't set until after the call to > > dma_set_mask_and_coherent() in nvme_pci_enable(), as called from > > nvme_reset_work(). max_hw_sectors is correctly determined based on > > values reported by the NVMe controller. > > > > I haven't provided a fix because I'm not that familiar with the overall structure > > of the code and the intent of the code reorganization. I'm not sure if setting > > the DMA mask should be moved earlier, or setting max_hw_sectors should > > be moved back to its original location. > > Yesterday, I ran into the same problem. Below you can find my test > case. I'm trying to get familiar with the code, so any help would be > much appreciated. > > I could reproduce it with a simple dd test [1] and btrace. In summary, > multi-page bvec splits are performed in 4 steps/chunks of 63 segments > (63 * 4K / 512 = 504 sectors) and 1 step of 8 segments (8 * 4K / 512 = > 32 sectors). But I think it should be just 4 steps of 64 segments as > that is the max_segments upper limit in my test case. Below you can > find my test case and numbers: The maximum size of a single I/O transfer is limited to the *smaller* of two factors: max_segments and max_sectors_kb. In your example below, the limiting factor is max_sectors_kb. But note that up to 64 segments *could* be needed for a 252 Kbyte transfer if the starting memory address is in the middle of a 4 Kbyte page. A segment is needed for the partial page at the beginning and another segment is needed for the partial page at the end. Also, if the target physical memory has chunks larger than 4 Kbytes that are contiguous, even fewer than 63 segments would be needed because each segment describes a *contiguous* physical memory area using its starting address and length. But the I/O transfer would still be limited to 252 Kbytes because of max_sectors_kb. > > Check: > root@qemuarm64:/sys/block/nvme0n1/queue# grep -H . max_sectors_kb > max_hw_sectors_kb max_segments max_segment_size optimal_io_size > logical_block_size chunk_sectors > max_sectors_kb:252 > max_hw_sectors_kb:252 > max_segments:64 > max_segment_size:4294967295 > optimal_io_size:512 > logical_block_size:512 > chunk_sectors:0 > > dd test: > dd iflag=direct if=/dev/nvme0n1 bs=1M of=/dev/null status=progress > > btrace snippet: > 259,0 0 23476 1.111774467 751 Q RS 1923072 + 2048 [dd] > 259,0 0 23477 1.111774842 751 X RS 1923072 / 1923576 [dd] > 259,0 0 23478 1.111775342 751 G RS 1923072 + 504 [dd] > 259,0 0 23479 1.111775425 751 I RS 1923072 + 504 [dd] > 259,0 0 23480 1.111776133 751 X RS 1923576 / 1924080 [dd] > 259,0 0 23481 1.111776258 751 G RS 1923576 + 504 [dd] > 259,0 0 23482 1.111776300 751 I RS 1923576 + 504 [dd] > 259,0 0 23483 1.111776675 751 X RS 1924080 / 1924584 [dd] > 259,0 0 23484 1.111776967 751 G RS 1924080 + 504 [dd] > 259,0 0 23485 1.111777008 751 I RS 1924080 + 504 [dd] > 259,0 0 23486 1.111777383 751 X RS 1924584 / 1925088 [dd] > 259,0 0 23487 1.111777467 751 G RS 1924584 + 504 [dd] > 259,0 0 23488 1.111777550 751 I RS 1924584 + 504 [dd] > 259,0 0 23489 1.111777758 751 G RS 1925088 + 32 [dd] > 259,0 0 23490 1.111777800 751 I RS 1925088 + 32 [dd] > 259,0 0 23491 1.111779383 36 D RS 1923072 + 504 [kworker/0:1H] > 259,0 0 23492 1.111780092 36 D RS 1923576 + 504 [kworker/0:1H] > 259,0 0 23493 1.111780800 36 D RS 1924080 + 504 [kworker/0:1H] > 259,0 0 23494 1.111781425 36 D RS 1924584 + 504 [kworker/0:1H] > 259,0 0 23495 1.111781717 36 D RS 1925088 + 32 [kworker/0:1H] > 259,0 0 23496 1.112201967 749 C RS 1923072 + 504 [0] > 259,0 0 23497 1.112563925 749 C RS 1923072 + 2048 [0] > 259,0 0 23498 1.112564425 749 C RS 1923072 + 2016 [0] > 259,0 0 23499 1.112564800 749 C RS 1923072 + 1512 [0] > 259,0 0 23500 1.112758217 749 C RS 1923072 + 1008 [0] > > In addition, going back to Michaels reference commit and running the > test in the same setup, I got the following checks: > > max_sectors_kb:512 > max_hw_sectors_kb:512 > max_segments:127 > max_segment_size:4294967295 > optimal_io_size:512 > logical_block_size:512 > > So, I'm not sure why my max_segments when from 64 -> 127 and, > max_sectors_kb and max_hw_sectdors_kb from 252 -> 512. Perhaps my > first assumption was wrong? The bug I pointed out caused max_hw_sectors to be set incorrectly. Going back to the referenced commit reverted the bug, and allowed the correct value to be set. In your example, I would guess the value of 512 Kbytes came from querying the NVMe device for its max transfer size. Ideally, to support 512 Kbyte transfers, you would want 129 segments (to allow for starting in the middle of a page as describe above). But the value of max_segments is limited by the NVME driver itself using the value of NVME_MAX_SEGS defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make sure that the data structure containing the scatterlist fits in a single page. See nvme_pci_alloc_iod_mempool(). Hopefully this information helps ... Michael ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit 2023-02-13 16:42 ` Michael Kelley (LINUX) @ 2023-02-13 16:57 ` Keith Busch 2023-02-17 13:28 ` Daniel Gomez 0 siblings, 1 reply; 11+ messages in thread From: Keith Busch @ 2023-02-13 16:57 UTC (permalink / raw) To: Michael Kelley (LINUX) Cc: Daniel Gomez, Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org On Mon, Feb 13, 2023 at 04:42:31PM +0000, Michael Kelley (LINUX) wrote: > Ideally, to support > 512 Kbyte transfers, you would want 129 segments (to allow for starting in > the middle of a page as describe above). But the value of max_segments > is limited by the NVME driver itself using the value of NVME_MAX_SEGS > defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make > sure that the data structure containing the scatterlist fits in a single page. > See nvme_pci_alloc_iod_mempool(). Should be 128 possible segements now in -next, but yeah, 129 would be ideal. The limit confuses many because sometimes user space can sometimes get 512kib IO to work and other times the same program fails, and all because of physical memory continuity that user space isn't always aware of. A sure-fire way to never hit that limit is to allocate hugepages. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit 2023-02-13 16:57 ` Keith Busch @ 2023-02-17 13:28 ` Daniel Gomez 2023-02-17 16:05 ` Michael Kelley (LINUX) 0 siblings, 1 reply; 11+ messages in thread From: Daniel Gomez @ 2023-02-17 13:28 UTC (permalink / raw) To: Keith Busch Cc: Michael Kelley (LINUX), Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org On Mon, Feb 13, 2023 at 5:57 PM Keith Busch <kbusch@kernel.org> wrote: > > On Mon, Feb 13, 2023 at 04:42:31PM +0000, Michael Kelley (LINUX) wrote: > > Ideally, to support > > 512 Kbyte transfers, you would want 129 segments (to allow for starting in > > the middle of a page as describe above). But the value of max_segments > > is limited by the NVME driver itself using the value of NVME_MAX_SEGS > > defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make > > sure that the data structure containing the scatterlist fits in a single page. > > See nvme_pci_alloc_iod_mempool(). Michael, Thanks for clarifying. It really helps. Also, it's fixed after applying Christoph patch [1]. [1] https://lore.kernel.org/all/20230213072035.288225-1-hch@lst.de/ >> value to be set. In your example, I would guess the value of 512 Kbytes came >> from querying the NVMe device for its max transfer size. Ideally, to support >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in >> the middle of a page as describe above). But the value of max_segments >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make >> sure that the data structure containing the scatterlist fits in a single page. > > Should be 128 possible segements now in -next, but yeah, 129 would be ideal. Quoting Michael, >> the middle of a page as describe above). But the value of max_segments >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make >> sure that the data structure containing the scatterlist fits in a single page. Yes, I can see that. I guess the 129 needs to be reduced to 127 (or after Keith optimization patch to 128) but not when the device is limited to a lower max_segments value because they fit anyway in a single page? Following the kernel code, I can see the max_hw_sectors_kb is calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9), max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^ (MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to change the MDTS, I get the following: MDTS max_hw_sectors max_hw_sectors_kb max_segments ------ ---------------- ------------------- -------------- 0 1024 512 127 1 16 8 3 2 32 16 5 3 64 32 9 4 128 64 17 5 256 128 33 6 512 256 65 7 1024 512 127 > > The limit confuses many because sometimes user space can sometimes get 512kib > IO to work and other times the same program fails, and all because of physical > memory continuity that user space isn't always aware of. A sure-fire way to > never hit that limit is to allocate hugepages. ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: max_hw_sectors error caused by recent NVMe driver commit 2023-02-17 13:28 ` Daniel Gomez @ 2023-02-17 16:05 ` Michael Kelley (LINUX) 2023-03-03 16:24 ` Daniel Gomez 0 siblings, 1 reply; 11+ messages in thread From: Michael Kelley (LINUX) @ 2023-02-17 16:05 UTC (permalink / raw) To: Daniel Gomez, Keith Busch Cc: Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org From: Daniel Gomez <dagmcr@gmail.com> Sent: Friday, February 17, 2023 5:28 AM > > >> value to be set. In your example, I would guess the value of 512 Kbytes came > >> from querying the NVMe device for its max transfer size. Ideally, to support > >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in > >> the middle of a page as describe above). But the value of max_segments > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make > >> sure that the data structure containing the scatterlist fits in a single page. > > > > > Should be 128 possible segements now in -next, but yeah, 129 would be ideal. > > Quoting Michael, > > >> the middle of a page as describe above). But the value of max_segments > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make > >> sure that the data structure containing the scatterlist fits in a single page. > > Yes, I can see that. I guess the 129 needs to be reduced to 127 (or > after Keith optimization patch to 128) but not when the device is > limited to a lower max_segments value because they fit anyway in a > single page? Yes, that's correct. But "the device is limited to a lower max_segments value" isn't really because max_segments is limited. The limit is on max_hw_sectors_kb derived from the NVMe controller MDTS value, as you have shown in your table below. Then the max_segments value is derived from max_hw_sectors_kb. For example, if max_hw_sectors_kb is 128 Kbytes, you can never need more than 33 segments. Each segment can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments. Then add 1 segment to handle the case where the memory buffer doesn't start on a page boundary, and you get 33. I'm making a subtle distinction here between "max_segments is limited" and "you can't need more than XX segments for a given max_hw_sectors_kb value". Michael > > Following the kernel code, I can see the max_hw_sectors_kb is > calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9), > max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^ > (MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to > change the MDTS, I get the following: > > MDTS max_hw_sectors max_hw_sectors_kb max_segments > ------ ---------------- ------------------- -------------- > 0 1024 512 127 > 1 16 8 3 > 2 32 16 5 > 3 64 32 9 > 4 128 64 17 > 5 256 128 33 > 6 512 256 65 > 7 1024 512 127 > > > > > The limit confuses many because sometimes user space can sometimes get 512kib > > IO to work and other times the same program fails, and all because of physical > > memory continuity that user space isn't always aware of. A sure-fire way to > > never hit that limit is to allocate hugepages. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit 2023-02-17 16:05 ` Michael Kelley (LINUX) @ 2023-03-03 16:24 ` Daniel Gomez 2023-03-03 16:44 ` Keith Busch 0 siblings, 1 reply; 11+ messages in thread From: Daniel Gomez @ 2023-03-03 16:24 UTC (permalink / raw) To: Michael Kelley (LINUX) Cc: Chaitanya Kulkarni, Christoph Hellwig, Gerd Bayer, Keith Busch, Sagi Grimberg, asahi@lists.linux.dev, linux-nvme@lists.infradead.org Hi, On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX) <mikelley@microsoft.com> wrote: > > From: Daniel Gomez <dagmcr@gmail.com> Sent: Friday, February 17, 2023 5:28 AM > > > > >> value to be set. In your example, I would guess the value of 512 Kbytes came > > >> from querying the NVMe device for its max transfer size. Ideally, to support > > >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in > > >> the middle of a page as describe above). But the value of max_segments > > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS > > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make > > >> sure that the data structure containing the scatterlist fits in a single page. > > > > > > > > Should be 128 possible segements now in -next, but yeah, 129 would be ideal. > > > > Quoting Michael, > > > > >> the middle of a page as describe above). But the value of max_segments > > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS > > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make > > >> sure that the data structure containing the scatterlist fits in a single page. > > > > Yes, I can see that. I guess the 129 needs to be reduced to 127 (or > > after Keith optimization patch to 128) but not when the device is > > limited to a lower max_segments value because they fit anyway in a > > single page? > > Yes, that's correct. But "the device is limited to a lower max_segments > value" isn't really because max_segments is limited. The limit is on > max_hw_sectors_kb derived from the NVMe controller MDTS value, > as you have shown in your table below. Then the max_segments value > is derived from max_hw_sectors_kb. For example, if max_hw_sectors_kb > is 128 Kbytes, you can never need more than 33 segments. Each segment > can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments. > Then add 1 segment to handle the case where the memory buffer doesn't > start on a page boundary, and you get 33. I'm trying to understand and clarify this part. So, I found Keith's patch [1] where he addresses (and introduces) the same case (quote from the patch: 'One additional segment is added to account for a transfer that may start in the middle of a page.'). But I'm not sure when that case should be considered when we split an I/O. Is that handled in bio_split_rw + bvec_split_segs functions? So, for mdts=7 and max_segments being now 128, when can we see and test that a transfer starts in the middle of a page and therefore, that we need an extra segment for that? In that case, shouldn't the split max at 128 - 1 (508 KiB)? In my tests [2], I can see splits are made in chunks of 128 in some cases. Whether or not these cases are page-aligned I don’t know but I'd like to verify that. [1] https://lore.kernel.org/all/1439417874-8925-1-git-send-email-keith.busch@intel.com/ [2] while true; do sleep 1; dd iflag=direct if=/dev/nvme0n1 bs=1M count=1 of=/dev/null status=progress; done On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX) <mikelley@microsoft.com> wrote: > > From: Daniel Gomez <dagmcr@gmail.com> Sent: Friday, February 17, 2023 5:28 AM > > > > >> value to be set. In your example, I would guess the value of 512 Kbytes came > > >> from querying the NVMe device for its max transfer size. Ideally, to support > > >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in > > >> the middle of a page as describe above). But the value of max_segments > > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS > > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make > > >> sure that the data structure containing the scatterlist fits in a single page. > > > > > > > > Should be 128 possible segements now in -next, but yeah, 129 would be ideal. > > > > Quoting Michael, > > > > >> the middle of a page as describe above). But the value of max_segments > > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS > > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make > > >> sure that the data structure containing the scatterlist fits in a single page. > > > > Yes, I can see that. I guess the 129 needs to be reduced to 127 (or > > after Keith optimization patch to 128) but not when the device is > > limited to a lower max_segments value because they fit anyway in a > > single page? > > Yes, that's correct. But "the device is limited to a lower max_segments > value" isn't really because max_segments is limited. The limit is on > max_hw_sectors_kb derived from the NVMe controller MDTS value, > as you have shown in your table below. Then the max_segments value > is derived from max_hw_sectors_kb. For example, if max_hw_sectors_kb > is 128 Kbytes, you can never need more than 33 segments. Each segment > can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments. > Then add 1 segment to handle the case where the memory buffer doesn't > start on a page boundary, and you get 33. I'm making a subtle distinction > here between "max_segments is limited" and "you can't need more than > XX segments for a given max_hw_sectors_kb value". > > Michael > > > > > Following the kernel code, I can see the max_hw_sectors_kb is > > calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9), > > max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^ > > (MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to > > change the MDTS, I get the following: > > > > MDTS max_hw_sectors max_hw_sectors_kb max_segments > > ------ ---------------- ------------------- -------------- > > 0 1024 512 127 > > 1 16 8 3 > > 2 32 16 5 > > 3 64 32 9 > > 4 128 64 17 > > 5 256 128 33 > > 6 512 256 65 > > 7 1024 512 127 > > > > > > > > The limit confuses many because sometimes user space can sometimes get 512kib > > > IO to work and other times the same program fails, and all because of physical > > > memory continuity that user space isn't always aware of. A sure-fire way to > > > never hit that limit is to allocate hugepages. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit 2023-03-03 16:24 ` Daniel Gomez @ 2023-03-03 16:44 ` Keith Busch 0 siblings, 0 replies; 11+ messages in thread From: Keith Busch @ 2023-03-03 16:44 UTC (permalink / raw) To: Daniel Gomez Cc: Michael Kelley (LINUX), Chaitanya Kulkarni, Christoph Hellwig, Gerd Bayer, Sagi Grimberg, asahi@lists.linux.dev, linux-nvme@lists.infradead.org On Fri, Mar 03, 2023 at 05:24:43PM +0100, Daniel Gomez wrote: > On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX) > <mikelley@microsoft.com> wrote: > > > > Yes, that's correct. But "the device is limited to a lower max_segments > > value" isn't really because max_segments is limited. The limit is on > > max_hw_sectors_kb derived from the NVMe controller MDTS value, > > as you have shown in your table below. Then the max_segments value > > is derived from max_hw_sectors_kb. For example, if max_hw_sectors_kb > > is 128 Kbytes, you can never need more than 33 segments. Each segment > > can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments. > > Then add 1 segment to handle the case where the memory buffer doesn't > > start on a page boundary, and you get 33. > > I'm trying to understand and clarify this part. So, I found Keith's > patch [1] where he addresses (and introduces) the same case (quote > from the patch: 'One additional segment is added to account for a > transfer that may start in the middle of a page.'). But I'm not sure > when that case should be considered when we split an I/O. Is that > handled in bio_split_rw + bvec_split_segs functions? The function responsible for splitting bio's to the request_queue's limits is __bio_split_to_limits(). This happens before blk-mq forms 'struct request' for the low-level driver in blk_mq_submit_bio(). > So, for mdts=7 and max_segments being now 128, when can we see and > test that a transfer starts in the middle of a page and therefore, > that we need an extra segment for that? In that case, shouldn't the > split max at 128 - 1 (508 KiB)? You can ensure you start in the middle of a page by allocating page aligned memory, then adding an offset. The number of segments you actually get from your allocation depends on how many discontiguous pages the system provisions your virtual memory. User space doesn't have direct control over that unless you're using huge pages. It's entirely possible a 512KiB buffer has just one segment, or 129 (assuming 4k pages), or any number inbetween. > In my tests [2], I can see splits are made in chunks of 128 in some > cases. Whether or not these cases are page-aligned I don’t know but > I'd like to verify that. > > [1] https://lore.kernel.org/all/1439417874-8925-1-git-send-email-keith.busch@intel.com/ > > [2] while true; do sleep 1; dd iflag=direct if=/dev/nvme0n1 bs=1M > count=1 of=/dev/null status=progress; done ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit 2023-02-11 5:33 max_hw_sectors error caused by recent NVMe driver commit Michael Kelley (LINUX) 2023-02-11 9:38 ` Daniel Gomez @ 2023-02-13 5:59 ` Christoph Hellwig 2023-02-13 6:41 ` Michael Kelley (LINUX) 1 sibling, 1 reply; 11+ messages in thread From: Christoph Hellwig @ 2023-02-13 5:59 UTC (permalink / raw) To: Michael Kelley (LINUX) Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org Hi Michael, does this fix the problem for you? commit 0467e96bafd2e84b67ba5c122bbbbac5e3f267e9 Author: Christoph Hellwig <hch@lst.de> Date: Mon Feb 13 06:58:33 2023 +0100 nvme dma_mask diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index c734934c407ccf..cb068f92bd597e 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2509,18 +2509,12 @@ static int nvme_pci_enable(struct nvme_dev *dev) { int result = -ENOMEM; struct pci_dev *pdev = to_pci_dev(dev->dev); - int dma_address_bits = 64; if (pci_enable_device_mem(pdev)) return result; pci_set_master(pdev); - if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48) - dma_address_bits = 48; - if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(dma_address_bits))) - goto disable; - if (readl(dev->bar + NVME_REG_CSTS) == -1) { result = -ENODEV; goto disable; @@ -2998,7 +2992,11 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev, quirks); if (ret) goto out_put_device; - + + if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48) + dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(48)); + else + dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(64)); dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1); dma_set_max_seg_size(&pdev->dev, 0xffffffff); ^ permalink raw reply related [flat|nested] 11+ messages in thread
* RE: max_hw_sectors error caused by recent NVMe driver commit 2023-02-13 5:59 ` Christoph Hellwig @ 2023-02-13 6:41 ` Michael Kelley (LINUX) 2023-02-13 6:42 ` Christoph Hellwig 0 siblings, 1 reply; 11+ messages in thread From: Michael Kelley (LINUX) @ 2023-02-13 6:41 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org From: Christoph Hellwig <hch@lst.de> Sent: Sunday, February 12, 2023 10:00 PM > > Hi Michael, > > does this fix the problem for you? Yes, basic smoke test works. The underlying NVMe device capabilities now show through instead of being replaced by swiotlb values. I did not do extensive functional testing. See one code comment below. Michael > > commit 0467e96bafd2e84b67ba5c122bbbbac5e3f267e9 > Author: Christoph Hellwig <hch@lst.de> > Date: Mon Feb 13 06:58:33 2023 +0100 > > nvme dma_mask > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index c734934c407ccf..cb068f92bd597e 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2509,18 +2509,12 @@ static int nvme_pci_enable(struct nvme_dev *dev) > { > int result = -ENOMEM; > struct pci_dev *pdev = to_pci_dev(dev->dev); > - int dma_address_bits = 64; > > if (pci_enable_device_mem(pdev)) > return result; > > pci_set_master(pdev); > > - if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48) > - dma_address_bits = 48; > - if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(dma_address_bits))) > - goto disable; > - > if (readl(dev->bar + NVME_REG_CSTS) == -1) { > result = -ENODEV; > goto disable; > @@ -2998,7 +2992,11 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct > pci_dev *pdev, > quirks); > if (ret) > goto out_put_device; > - > + > + if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48) > + dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(48)); > + else > + dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(64)); Any reason to use dev->dev instead of &pdev->dev? They should be the same, but with everything else in this function using &pdev->dev, the lack of consistency suggests there might be a reason. > dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1); > dma_set_max_seg_size(&pdev->dev, 0xffffffff); > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: max_hw_sectors error caused by recent NVMe driver commit 2023-02-13 6:41 ` Michael Kelley (LINUX) @ 2023-02-13 6:42 ` Christoph Hellwig 0 siblings, 0 replies; 11+ messages in thread From: Christoph Hellwig @ 2023-02-13 6:42 UTC (permalink / raw) To: Michael Kelley (LINUX) Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Gerd Bayer, asahi@lists.linux.dev, linux-nvme@lists.infradead.org On Mon, Feb 13, 2023 at 06:41:12AM +0000, Michael Kelley (LINUX) wrote: > Any reason to use dev->dev instead of &pdev->dev? They should be > the same, but with everything else in this function using &pdev->dev, > the lack of consistency suggests there might be a reason. No good reason - the code started with copy and paste from the previous location and I can fix this up. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2023-03-03 16:44 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-02-11 5:33 max_hw_sectors error caused by recent NVMe driver commit Michael Kelley (LINUX) 2023-02-11 9:38 ` Daniel Gomez 2023-02-13 16:42 ` Michael Kelley (LINUX) 2023-02-13 16:57 ` Keith Busch 2023-02-17 13:28 ` Daniel Gomez 2023-02-17 16:05 ` Michael Kelley (LINUX) 2023-03-03 16:24 ` Daniel Gomez 2023-03-03 16:44 ` Keith Busch 2023-02-13 5:59 ` Christoph Hellwig 2023-02-13 6:41 ` Michael Kelley (LINUX) 2023-02-13 6:42 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox