* [PATCH v2 1/1] nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
@ 2023-04-21 8:08 Adrian Huang
2023-04-24 16:37 ` Keith Busch
2023-05-03 16:17 ` Christoph Hellwig
0 siblings, 2 replies; 5+ messages in thread
From: Adrian Huang @ 2023-04-21 8:08 UTC (permalink / raw)
To: linux-nvme
Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, iommu,
Adrian Huang, Jiwei Sun, Adrian Huang
From: Adrian Huang <ahuang12@lenovo.com>
When running the fio test on a 448-core AMD server + a NVME disk,
a soft lockup or a hard lockup call trace is shown:
[soft lockup]
watchdog: BUG: soft lockup - CPU#126 stuck for 23s! [swapper/126:0]
RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x50
...
Call Trace:
<IRQ>
fq_flush_timeout+0x7d/0xd0
? __pfx_fq_flush_timeout+0x10/0x10
call_timer_fn+0x2e/0x150
run_timer_softirq+0x48a/0x560
? __pfx_fq_flush_timeout+0x10/0x10
? clockevents_program_event+0xaf/0x130
__do_softirq+0xf1/0x335
irq_exit_rcu+0x9f/0xd0
sysvec_apic_timer_interrupt+0xb4/0xd0
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1f/0x30
...
Obvisouly, fq_flush_timeout spends over 20 seconds. Here is ftrace log:
| fq_flush_timeout() {
| fq_ring_free() {
| put_pages_list() {
0.170 us | free_unref_page_list();
0.810 us | }
| free_iova_fast() {
| free_iova() {
* 85622.66 us | _raw_spin_lock_irqsave();
2.860 us | remove_iova();
0.600 us | _raw_spin_unlock_irqrestore();
0.470 us | lock_info_report();
2.420 us | free_iova_mem.part.0();
* 85638.27 us | }
* 85638.84 us | }
| put_pages_list() {
0.230 us | free_unref_page_list();
0.470 us | }
... ...
$ 31017069 us | }
Most of cores are under lock contention for acquiring iova_rbtree_lock due
to the iova flush queue mechanism.
[hard lockup]
NMI watchdog: Watchdog detected hard LOCKUP on cpu 351
RIP: 0010:native_queued_spin_lock_slowpath+0x2d8/0x330
Call Trace:
<IRQ>
_raw_spin_lock_irqsave+0x4f/0x60
free_iova+0x27/0xd0
free_iova_fast+0x4d/0x1d0
fq_ring_free+0x9b/0x150
iommu_dma_free_iova+0xb4/0x2e0
__iommu_dma_unmap+0x10b/0x140
iommu_dma_unmap_sg+0x90/0x110
dma_unmap_sg_attrs+0x4a/0x50
nvme_unmap_data+0x5d/0x120 [nvme]
nvme_pci_complete_batch+0x77/0xc0 [nvme]
nvme_irq+0x2ee/0x350 [nvme]
? __pfx_nvme_pci_complete_batch+0x10/0x10 [nvme]
__handle_irq_event_percpu+0x53/0x1a0
handle_irq_event_percpu+0x19/0x60
handle_irq_event+0x3d/0x60
handle_edge_irq+0xb3/0x210
__common_interrupt+0x7f/0x150
common_interrupt+0xc5/0xf0
</IRQ>
<TASK>
asm_common_interrupt+0x2b/0x40
...
ftrace shows fq_ring_free spends over 10 seconds [1]. Again, most of
cores are under lock contention for acquiring iova_rbtree_lock due
to the iova flush queue mechanism.
[Root Cause]
The root cause is that the max_hw_sectors_kb of nvme disk (mdts=10)
is 4096kb, which streaming DMA mappings cannot benefit from the
scalable IOVA mechanism introduced by the commit 9257b4a206fc
("iommu/iova: introduce per-cpu caching to iova allocation") if
the length is greater than 128kb.
To fix the lock contention issue, clamp max_hw_sectors based on
DMA optimized limitation in order to leverage scalable IOVA mechanism.
Note: The issue does not happen with another NVME disk (mdts = 5
and max_hw_sectors_kb = 128)
[1] https://gist.github.com/AdrianHuang/bf8ec7338204837631fbdaed25d19cc4
Suggested-by: Keith Busch <kbusch@kernel.org>
Reported-and-tested-by: Jiwei Sun <sunjw10@lenovo.com>
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
---
Changes since v1:
- Clamp max_hw_sectors at lower level driver code per Keith's suggestion
drivers/nvme/host/pci.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 282d808400c5..fa351c56d690 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2914,6 +2914,12 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
struct nvme_dev *dev;
int ret = -ENOMEM;
+ /*
+ * Limit the max command size to prevent iod->sg allocations going
+ * over a single page.
+ */
+ size_t max_bytes = NVME_MAX_KB_SZ * 1024;
+
if (node == NUMA_NO_NODE)
set_dev_node(&pdev->dev, first_memory_node);
@@ -2955,12 +2961,10 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
dma_set_max_seg_size(&pdev->dev, 0xffffffff);
- /*
- * Limit the max command size to prevent iod->sg allocations going
- * over a single page.
- */
- dev->ctrl.max_hw_sectors = min_t(u32,
- NVME_MAX_KB_SZ << 1, dma_max_mapping_size(&pdev->dev) >> 9);
+ max_bytes = min(max_bytes, dma_max_mapping_size(&pdev->dev));
+ max_bytes = min_not_zero(max_bytes, dma_opt_mapping_size(&pdev->dev));
+ dev->ctrl.max_hw_sectors = max_bytes >> 9;
+
dev->ctrl.max_segments = NVME_MAX_SEGS;
/*
--
2.34.1
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH v2 1/1] nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
2023-04-21 8:08 [PATCH v2 1/1] nvme-pci: clamp max_hw_sectors based on DMA optimized limitation Adrian Huang
@ 2023-04-24 16:37 ` Keith Busch
2023-05-03 16:17 ` Christoph Hellwig
1 sibling, 0 replies; 5+ messages in thread
From: Keith Busch @ 2023-04-24 16:37 UTC (permalink / raw)
To: Adrian Huang
Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg, iommu,
Jiwei Sun, Adrian Huang
Looks good to me.
Reviewed-by: Keith Busch <kbusch@kernel.org>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2 1/1] nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
2023-04-21 8:08 [PATCH v2 1/1] nvme-pci: clamp max_hw_sectors based on DMA optimized limitation Adrian Huang
2023-04-24 16:37 ` Keith Busch
@ 2023-05-03 16:17 ` Christoph Hellwig
2023-05-10 18:22 ` Keith Busch
1 sibling, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2023-05-03 16:17 UTC (permalink / raw)
To: Adrian Huang
Cc: linux-nvme, Keith Busch, Jens Axboe, Christoph Hellwig,
Sagi Grimberg, iommu, Jiwei Sun, Adrian Huang
dma_opt_mapping_size falls back to and is bound by dma_max_mapping_size.
So I think we could just do this:
---
From 3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd Mon Sep 17 00:00:00 2001
From: Adrian Huang <ahuang12@lenovo.com>
Date: Fri, 21 Apr 2023 16:08:00 +0800
Subject: nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
When running the fio test on a 448-core AMD server + a NVME disk,
a soft lockup or a hard lockup call trace is shown:
[soft lockup]
watchdog: BUG: soft lockup - CPU#126 stuck for 23s! [swapper/126:0]
RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x50
...
Call Trace:
<IRQ>
fq_flush_timeout+0x7d/0xd0
? __pfx_fq_flush_timeout+0x10/0x10
call_timer_fn+0x2e/0x150
run_timer_softirq+0x48a/0x560
? __pfx_fq_flush_timeout+0x10/0x10
? clockevents_program_event+0xaf/0x130
__do_softirq+0xf1/0x335
irq_exit_rcu+0x9f/0xd0
sysvec_apic_timer_interrupt+0xb4/0xd0
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1f/0x30
...
Obvisouly, fq_flush_timeout spends over 20 seconds. Here is ftrace log:
| fq_flush_timeout() {
| fq_ring_free() {
| put_pages_list() {
0.170 us | free_unref_page_list();
0.810 us | }
| free_iova_fast() {
| free_iova() {
* 85622.66 us | _raw_spin_lock_irqsave();
2.860 us | remove_iova();
0.600 us | _raw_spin_unlock_irqrestore();
0.470 us | lock_info_report();
2.420 us | free_iova_mem.part.0();
* 85638.27 us | }
* 85638.84 us | }
| put_pages_list() {
0.230 us | free_unref_page_list();
0.470 us | }
... ...
$ 31017069 us | }
Most of cores are under lock contention for acquiring iova_rbtree_lock due
to the iova flush queue mechanism.
[hard lockup]
NMI watchdog: Watchdog detected hard LOCKUP on cpu 351
RIP: 0010:native_queued_spin_lock_slowpath+0x2d8/0x330
Call Trace:
<IRQ>
_raw_spin_lock_irqsave+0x4f/0x60
free_iova+0x27/0xd0
free_iova_fast+0x4d/0x1d0
fq_ring_free+0x9b/0x150
iommu_dma_free_iova+0xb4/0x2e0
__iommu_dma_unmap+0x10b/0x140
iommu_dma_unmap_sg+0x90/0x110
dma_unmap_sg_attrs+0x4a/0x50
nvme_unmap_data+0x5d/0x120 [nvme]
nvme_pci_complete_batch+0x77/0xc0 [nvme]
nvme_irq+0x2ee/0x350 [nvme]
? __pfx_nvme_pci_complete_batch+0x10/0x10 [nvme]
__handle_irq_event_percpu+0x53/0x1a0
handle_irq_event_percpu+0x19/0x60
handle_irq_event+0x3d/0x60
handle_edge_irq+0xb3/0x210
__common_interrupt+0x7f/0x150
common_interrupt+0xc5/0xf0
</IRQ>
<TASK>
asm_common_interrupt+0x2b/0x40
...
ftrace shows fq_ring_free spends over 10 seconds [1]. Again, most of
cores are under lock contention for acquiring iova_rbtree_lock due
to the iova flush queue mechanism.
[Root Cause]
The root cause is that the max_hw_sectors_kb of nvme disk (mdts=10)
is 4096kb, which streaming DMA mappings cannot benefit from the
scalable IOVA mechanism introduced by the commit 9257b4a206fc
("iommu/iova: introduce per-cpu caching to iova allocation") if
the length is greater than 128kb.
To fix the lock contention issue, clamp max_hw_sectors based on
DMA optimized limitation in order to leverage scalable IOVA mechanism.
Note: The issue does not happen with another NVME disk (mdts = 5
and max_hw_sectors_kb = 128)
[1] https://gist.github.com/AdrianHuang/bf8ec7338204837631fbdaed25d19cc4
Suggested-by: Keith Busch <kbusch@kernel.org>
Reported-and-tested-by: Jiwei Sun <sunjw10@lenovo.com>
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/nvme/host/pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 18ca1e3ae07086..922ffe4e28222a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2956,7 +2956,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
* over a single page.
*/
dev->ctrl.max_hw_sectors = min_t(u32,
- NVME_MAX_KB_SZ << 1, dma_max_mapping_size(&pdev->dev) >> 9);
+ NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
dev->ctrl.max_segments = NVME_MAX_SEGS;
/*
--
2.39.2
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH v2 1/1] nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
2023-05-03 16:17 ` Christoph Hellwig
@ 2023-05-10 18:22 ` Keith Busch
2023-05-11 12:06 ` Huang Adrian
0 siblings, 1 reply; 5+ messages in thread
From: Keith Busch @ 2023-05-10 18:22 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Adrian Huang, linux-nvme, Jens Axboe, Sagi Grimberg, iommu,
Jiwei Sun, Adrian Huang
On Wed, May 03, 2023 at 06:17:59PM +0200, Christoph Hellwig wrote:
> dma_opt_mapping_size falls back to and is bound by dma_max_mapping_size.
> So I think we could just do this:
Thanks for discoverying this even futher simplifaction. This looks good
to me.
Adrian, are you okay with the suggestion?
> @@ -2956,7 +2956,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
> * over a single page.
> */
> dev->ctrl.max_hw_sectors = min_t(u32,
> - NVME_MAX_KB_SZ << 1, dma_max_mapping_size(&pdev->dev) >> 9);
> + NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
> dev->ctrl.max_segments = NVME_MAX_SEGS;
>
> /*
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2 1/1] nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
2023-05-10 18:22 ` Keith Busch
@ 2023-05-11 12:06 ` Huang Adrian
0 siblings, 0 replies; 5+ messages in thread
From: Huang Adrian @ 2023-05-11 12:06 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, linux-nvme, Jens Axboe, Sagi Grimberg, iommu,
Jiwei Sun, Adrian Huang
On Thu, May 11, 2023 at 2:22 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, May 03, 2023 at 06:17:59PM +0200, Christoph Hellwig wrote:
> > dma_opt_mapping_size falls back to and is bound by dma_max_mapping_size.
> > So I think we could just do this:
>
> Thanks for discoverying this even futher simplifaction. This looks good
> to me.
>
> Adrian, are you okay with the suggestion?
Yes. This looks good to me. I like this simplicity.
Thanks for Christoph's changes.
> > @@ -2956,7 +2956,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
> > * over a single page.
> > */
> > dev->ctrl.max_hw_sectors = min_t(u32,
> > - NVME_MAX_KB_SZ << 1, dma_max_mapping_size(&pdev->dev) >> 9);
> > + NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
> > dev->ctrl.max_segments = NVME_MAX_SEGS;
> >
> > /*
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-05-11 12:06 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-21 8:08 [PATCH v2 1/1] nvme-pci: clamp max_hw_sectors based on DMA optimized limitation Adrian Huang
2023-04-24 16:37 ` Keith Busch
2023-05-03 16:17 ` Christoph Hellwig
2023-05-10 18:22 ` Keith Busch
2023-05-11 12:06 ` Huang Adrian
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox