From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Adrian Huang <ahuang12@lenovo.com>,
Keith Busch <kbusch@kernel.org>, Jiwei Sun <sunjw10@lenovo.com>,
Christoph Hellwig <hch@lst.de>, Sasha Levin <sashal@kernel.org>,
sagi@grimberg.me, linux-nvme@lists.infradead.org
Subject: [PATCH AUTOSEL 6.3 03/67] nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
Date: Thu, 25 May 2023 14:30:40 -0400 [thread overview]
Message-ID: <20230525183144.1717540-3-sashal@kernel.org> (raw)
In-Reply-To: <20230525183144.1717540-1-sashal@kernel.org>
From: Adrian Huang <ahuang12@lenovo.com>
[ Upstream commit 3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd ]
When running the fio test on a 448-core AMD server + a NVME disk,
a soft lockup or a hard lockup call trace is shown:
[soft lockup]
watchdog: BUG: soft lockup - CPU#126 stuck for 23s! [swapper/126:0]
RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x50
...
Call Trace:
<IRQ>
fq_flush_timeout+0x7d/0xd0
? __pfx_fq_flush_timeout+0x10/0x10
call_timer_fn+0x2e/0x150
run_timer_softirq+0x48a/0x560
? __pfx_fq_flush_timeout+0x10/0x10
? clockevents_program_event+0xaf/0x130
__do_softirq+0xf1/0x335
irq_exit_rcu+0x9f/0xd0
sysvec_apic_timer_interrupt+0xb4/0xd0
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1f/0x30
...
Obvisouly, fq_flush_timeout spends over 20 seconds. Here is ftrace log:
| fq_flush_timeout() {
| fq_ring_free() {
| put_pages_list() {
0.170 us | free_unref_page_list();
0.810 us | }
| free_iova_fast() {
| free_iova() {
* 85622.66 us | _raw_spin_lock_irqsave();
2.860 us | remove_iova();
0.600 us | _raw_spin_unlock_irqrestore();
0.470 us | lock_info_report();
2.420 us | free_iova_mem.part.0();
* 85638.27 us | }
* 85638.84 us | }
| put_pages_list() {
0.230 us | free_unref_page_list();
0.470 us | }
... ...
$ 31017069 us | }
Most of cores are under lock contention for acquiring iova_rbtree_lock due
to the iova flush queue mechanism.
[hard lockup]
NMI watchdog: Watchdog detected hard LOCKUP on cpu 351
RIP: 0010:native_queued_spin_lock_slowpath+0x2d8/0x330
Call Trace:
<IRQ>
_raw_spin_lock_irqsave+0x4f/0x60
free_iova+0x27/0xd0
free_iova_fast+0x4d/0x1d0
fq_ring_free+0x9b/0x150
iommu_dma_free_iova+0xb4/0x2e0
__iommu_dma_unmap+0x10b/0x140
iommu_dma_unmap_sg+0x90/0x110
dma_unmap_sg_attrs+0x4a/0x50
nvme_unmap_data+0x5d/0x120 [nvme]
nvme_pci_complete_batch+0x77/0xc0 [nvme]
nvme_irq+0x2ee/0x350 [nvme]
? __pfx_nvme_pci_complete_batch+0x10/0x10 [nvme]
__handle_irq_event_percpu+0x53/0x1a0
handle_irq_event_percpu+0x19/0x60
handle_irq_event+0x3d/0x60
handle_edge_irq+0xb3/0x210
__common_interrupt+0x7f/0x150
common_interrupt+0xc5/0xf0
</IRQ>
<TASK>
asm_common_interrupt+0x2b/0x40
...
ftrace shows fq_ring_free spends over 10 seconds [1]. Again, most of
cores are under lock contention for acquiring iova_rbtree_lock due
to the iova flush queue mechanism.
[Root Cause]
The root cause is that the max_hw_sectors_kb of nvme disk (mdts=10)
is 4096kb, which streaming DMA mappings cannot benefit from the
scalable IOVA mechanism introduced by the commit 9257b4a206fc
("iommu/iova: introduce per-cpu caching to iova allocation") if
the length is greater than 128kb.
To fix the lock contention issue, clamp max_hw_sectors based on
DMA optimized limitation in order to leverage scalable IOVA mechanism.
Note: The issue does not happen with another NVME disk (mdts = 5
and max_hw_sectors_kb = 128)
[1] https://gist.github.com/AdrianHuang/bf8ec7338204837631fbdaed25d19cc4
Suggested-by: Keith Busch <kbusch@kernel.org>
Reported-and-tested-by: Jiwei Sun <sunjw10@lenovo.com>
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/nvme/host/pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a7772c0194d5a..a389f1ea0b151 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2960,7 +2960,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
* over a single page.
*/
dev->ctrl.max_hw_sectors = min_t(u32,
- NVME_MAX_KB_SZ << 1, dma_max_mapping_size(&pdev->dev) >> 9);
+ NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
dev->ctrl.max_segments = NVME_MAX_SEGS;
/*
--
2.39.2
next prev parent reply other threads:[~2023-05-25 18:32 UTC|newest]
Thread overview: 72+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-25 18:30 [PATCH AUTOSEL 6.3 01/67] nvme-pci: add NVME_QUIRK_BOGUS_NID for HS-SSD-FUTURE 2048G Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 02/67] nvme-pci: add quirk for missing secondary temperature thresholds Sasha Levin
2023-05-25 18:30 ` Sasha Levin [this message]
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 04/67] ASoC: amd: yc: Add DMI entry to support System76 Pangolin 12 Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 05/67] ASoC: dwc: limit the number of overrun messages Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 06/67] cpupower:Fix resource leaks in sysfs_get_enabled() Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 07/67] selftests/ftrace: Improve integration with kselftest runner Sasha Levin
2023-05-26 15:23 ` Mark Brown
2023-06-01 9:30 ` Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 08/67] ASoC: SOF: amd: Fix NULL pointer crash in acp_sof_ipc_msg_data function Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 09/67] um: harddog: fix modular build Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 10/67] xfrm: Check if_id in inbound policy/secpath match Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 11/67] ASoC: jz4740-i2s: Make I2S divider calculations more robust Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 12/67] ASoC: dt-bindings: Adjust #sound-dai-cells on TI's single-DAI codecs Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 13/67] ALSA: hda/realtek: Add quirks for ASUS GU604V and GU603V Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 14/67] ASoC: ssm2602: Add workaround for playback distortions Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 15/67] media: dvb_demux: fix a bug for the continuity counter Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 16/67] media: dvb-usb: az6027: fix three null-ptr-deref in az6027_i2c_xfer() Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 17/67] media: dvb-usb-v2: ec168: fix null-ptr-deref in ec168_i2c_xfer() Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 18/67] media: dvb-usb-v2: ce6230: fix null-ptr-deref in ce6230_i2c_master_xfer() Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 19/67] media: dvb-usb-v2: rtl28xxu: fix null-ptr-deref in rtl28xxu_i2c_xfer Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 20/67] media: dvb-usb: digitv: fix null-ptr-deref in digitv_i2c_xfer() Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 21/67] media: dvb-usb: dw2102: fix uninit-value in su3000_read_mac_address Sasha Levin
2023-05-25 18:30 ` [PATCH AUTOSEL 6.3 22/67] media: netup_unidvb: fix irq init by register it at the end of probe Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 23/67] media: dvb_ca_en50221: fix a size write bug Sasha Levin
2023-06-16 19:21 ` Pavel Machek
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 24/67] media: ttusb-dec: fix memory leak in ttusb_dec_exit_dvb() Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 25/67] media: mn88443x: fix !CONFIG_OF error by drop of_match_ptr from ID table Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 26/67] media: dvb-core: Fix use-after-free on race condition at dvb_frontend Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 27/67] media: dvb-core: Fix use-after-free due on race condition at dvb_net Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 28/67] media: dvb-core: Fix use-after-free due to race at dvb_register_device() Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 29/67] media: dvb-core: Fix kernel WARNING for blocking operation in wait_event*() Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 30/67] media: dvb-core: Fix use-after-free due to race condition at dvb_ca_en50221 Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 31/67] ASoC: SOF: debug: conditionally bump runtime_pm counter on exceptions Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 32/67] ASoC: SOF: pcm: fix pm_runtime imbalance in error handling Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 33/67] ASoC: SOF: sof-client-probes: " Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 34/67] ASoC: SOF: pm: save io region state in case of errors in resume Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 35/67] tipc: add tipc_bearer_min_mtu to calculate min mtu Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 36/67] s390/pkey: zeroize key blobs Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 37/67] s390/topology: honour nr_cpu_ids when adding CPUs Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 38/67] s390/ipl: fix IPIB virtual vs physical address confusion Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 39/67] ACPI: resource: Add IRQ override quirk for LG UltraPC 17U70P Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 40/67] wifi: rtl8xxxu: fix authentication timeout due to incorrect RCR value Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 41/67] ARM: dts: stm32: add pin map for CAN controller on stm32f7 Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 42/67] ARM: dts: stm32: add CAN support on stm32f746 Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 43/67] arm64/mm: mark private VM_FAULT_X defines as vm_fault_t Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 44/67] arm64: vdso: Pass (void *) to virt_to_page() Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 45/67] wifi: mac80211: simplify chanctx allocation Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 46/67] wifi: mac80211: consider reserved chanctx for mindef Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 47/67] wifi: mac80211: recalc chanctx mindef before assigning Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 48/67] wifi: iwlwifi: mvm: Add locking to the rate read flow Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 49/67] scsi: ufs: core: Fix MCQ tag calculation Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 50/67] scsi: ufs: core: Rename symbol sizeof_utp_transfer_cmd_desc() Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 51/67] scsi: ufs: core: Fix MCQ nr_hw_queues Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 52/67] scsi: Revert "scsi: core: Do not increase scsi_device's iorequest_cnt if dispatch failed" Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 53/67] scsi: core: Decrease scsi_device's iorequest_cnt if dispatch failed Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 54/67] wifi: b43: fix incorrect __packed annotation Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 55/67] net: wwan: t7xx: Ensure init is completed before system sleep Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 56/67] netfilter: conntrack: define variables exp_nat_nla_policy and any_addr with CONFIG_NF_NAT Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 57/67] nvme-multipath: don't call blk_mark_disk_dead in nvme_mpath_remove_disk Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 58/67] nvme: do not let the user delete a ctrl before a complete initialization Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 59/67] ALSA: oss: avoid missing-prototype warnings Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 60/67] drm/msm: Be more shouty if per-process pgtables aren't working Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 61/67] atm: hide unused procfs functions Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 62/67] mdio_bus: unhide mdio_bus_init prototype Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 63/67] ceph: silence smatch warning in reconnect_caps_cb() Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 64/67] drm/amdgpu: skip disabling fence driver src_irqs when device is unplugged Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 65/67] ublk: fix AB-BA lockdep warning Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 66/67] nvme-pci: Add quirk for Teamgroup MP33 SSD Sasha Levin
2023-05-25 18:31 ` [PATCH AUTOSEL 6.3 67/67] block: Deny writable memory mapping if block is read-only Sasha Levin
2023-05-25 19:02 ` [PATCH AUTOSEL 6.3 01/67] nvme-pci: add NVME_QUIRK_BOGUS_NID for HS-SSD-FUTURE 2048G Eric Biggers
2023-06-01 9:52 ` Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230525183144.1717540-3-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=ahuang12@lenovo.com \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=sagi@grimberg.me \
--cc=stable@vger.kernel.org \
--cc=sunjw10@lenovo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox