* next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c @ 2025-06-30 7:50 ` Ben Copeland 2025-06-30 13:33 ` Christoph Hellwig 2025-07-03 5:54 ` Kanchan Joshi 0 siblings, 2 replies; 15+ messages in thread From: Ben Copeland @ 2025-06-30 7:50 UTC (permalink / raw) To: linux-kernel, lkft-triage, regressions, linux-nvme Cc: Dan Carpenter, kbusch, axboe, hch, sagi A regression in the NVMe PCI driver triggers IOMMU DMA warnings during I/O completion on ARM64 systems. The issue manifests as drivers/iommu/dma-iommu.c warnings during the DMA unmapping process. IOMMU DMA unmapping warnings occur during NVMe completion processing. The warnings appear consistently during normal NVMe I/O operations. - CONFIG_ARM64_64K_PAGES=y Test Environment: * Ampere Altra - AmpereOne Regression Analysis: - New regression? Yes - Reproducibility? Yes First seen on next-20250627 Good: next-20250626 Bad: next-20250627 Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> ## Test log [ 32.699872] WARNING: drivers/iommu/dma-iommu.c:1232 at iommu_dma_unmap_page+0xc4/0xe8, CPU#13: swapper/13/0 [ 32.714204] Modules linked in: cdc_ether usbnet sm3_ce nvme sha3_ce nvme_core xhci_pci_renesas arm_cspmu_module arm_spe_pmu ipmi_devintf arm_cmn ipmi_msghandler cppc_cpufreq fuse drm backlight ip_tables x_tables [ 32.732967] CPU: 13 UID: 0 PID: 0 Comm: swapper/13 Tainted: G W 6.16.0-rc3-next-20250627 #1 PREEMPT [ 32.743562] Tainted: [W]=WARN [ 32.749381] Hardware name: Inspur NF5280R7/Mitchell MB, BIOS 04.04.00004001 2025-02-04 22:23:30 02/04/2025 [ 32.759020] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 32.768746] pc : iommu_dma_unmap_page+0xc4/0xe8 [ 32.776040] lr : iommu_dma_unmap_page+0x40/0xe8 [ 32.780559] sp : ffff8000801afde0 [ 32.783861] x29: ffff8000801afde0 x28: 0000000000000005 x27: fff00001d7d230f0 [ 32.790983] x26: 0000000000000000 x25: fff00001003da0c8 x24: 0000000000000002 [ 32.798106] x23: 0000000000000000 x22: 0000000000001000 x21: 00000000feed5000 [ 32.805229] x20: fff00001003da0c8 x19: fff00001d7d23000 x18: 0000000000080000 [ 32.812352] x17: 0000000000000040 x16: ffffae5c7594ea68 x15: 0000000000000000 [ 32.819474] x14: 000000000007ffff x13: 0000000000000001 x12: 000000000000002c [ 32.826597] x11: 00000000000fffff x10: ffffffffffffffff x9 : ffffae5c76358e60 [ 32.833719] x8 : ffff8000801afd68 x7 : ffffae5c76358a78 x6 : 00000000feed5001 [ 32.840842] x5 : 000000000000000d x4 : ffffae5c76358a78 x3 : 0000000000000000 [ 32.847965] x2 : 0000000000000000 x1 : fff00001027e54c0 x0 : 0000000000000000 [ 32.855088] Call trace: [ 32.857521] iommu_dma_unmap_page+0xc4/0xe8 (P) [ 32.862039] dma_unmap_page_attrs+0x238/0x2a0 [ 32.866385] nvme_unmap_data+0x1dc/0x278 [nvme] [ 32.870904] nvme_pci_complete_rq+0xb8/0x118 [nvme] [ 32.878632] blk_complete_reqs+0x5c/0x80 [ 32.885320] blk_done_softirq+0x28/0x40 [ 32.892006] handle_softirqs+0x16c/0x408 [ 32.896436] __do_softirq+0x1c/0x28 [ 32.899912] ____do_softirq+0x18/0x30 [ 32.903561] call_on_irq_stack+0x24/0x30 [ 32.907472] do_softirq_own_stack+0x24/0x38 [ 32.911642] __irq_exit_rcu+0xd4/0x118 [ 32.915378] irq_exit_rcu+0x18/0x48 [ 32.918854] el1_interrupt+0x38/0x68 [ 32.922418] el1h_64_irq_handler+0x18/0x28 [ 32.926502] el1h_64_irq+0x6c/0x70 [ 32.929891] cpuidle_enter_state+0xcc/0x4c0 (P) [ 32.934410] cpuidle_enter+0x40/0x60 [ 32.937972] do_idle+0x204/0x288 [ 32.941188] cpu_startup_entry+0x3c/0x50 [ 32.945098] secondary_start_kernel+0x140/0x168 [ 32.949617] __secondary_switched+0xc0/0xc8 [ 32.953788] ---[ end trace 0000000000000000 ]--- ## Bisection git bisect start # bad: [06cae0e3f61c4c1ef18726b817bbb88c29f81e57] nvme-pci: merge the simple PRP and SGL setup into a common helper git bisect bad 06cae0e3f61c4c1ef18726b817bbb88c29f81e57 # good: [d6c12c69ef4fa33e32ceda4a53991ace01401cd9] block: add scatterlist-less DMA mapping helpers git bisect good d6c12c69ef4fa33e32ceda4a53991ace01401cd9 # good: [07c81cbf438b769e0d673be3b5c021a424a4dc6f] nvme-pci: refactor nvme_pci_use_sgls git bisect good 07c81cbf438b769e0d673be3b5c021a424a4dc6f # first bad commit: [06cae0e3f61c4c1ef18726b817bbb88c29f81e57] nvme-pci: merge the simple PRP and SGL setup into a common helper ## Source * Kernel version: 6.16.0-rc3-next-20250627 * Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git * Git sha: 2aeda9592360c200085898a258c4754bfe879921 * Git describe: 6.16.0-rc3-next-20250627 * Project details: https://qa-reports.linaro.org/lkft/linux-next-master-ampere/build/next-20250627/ * Architectures: arm64 * Toolchains: gcc-13 * Kconfigs: gcc-13-lkftconfig-64k_page_size ## Build arm64 * Test LAVA log: https://lkft-staging.validation.linaro.org/scheduler/job/180925#L7760 * Build link: https://storage.tuxsuite.com/public/ampere/ci/builds/2z6SDAcW8aM7PBWZ8oJ82QlfXvH/ * Kernel config: https://storage.tuxsuite.com/public/ampere/ci/builds/2z6SDAcW8aM7PBWZ8oJ82QlfXvH/config ## Steps to reproduce: # https://storage.tuxsuite.com/public/ampere/ci/builds/2z6SDAcW8aM7PBWZ8oJ82QlfXvH/tuxmake_reproducer.sh Linaro LKFT https://lkft.linaro.org ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-06-30 7:50 ` next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c Ben Copeland @ 2025-06-30 13:33 ` Christoph Hellwig 2025-06-30 19:51 ` Ben Copeland 2025-06-30 20:25 ` Keith Busch 2025-07-03 5:54 ` Kanchan Joshi 1 sibling, 2 replies; 15+ messages in thread From: Christoph Hellwig @ 2025-06-30 13:33 UTC (permalink / raw) To: Ben Copeland Cc: linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, kbusch, axboe, hch, sagi, iommu Hi Ben, > [ 32.857521] iommu_dma_unmap_page+0xc4/0xe8 (P) Can you resolve this to a source location for me. i.e. gdb vmlinux l *(iommu_dma_unmap_page+0xc4) Also what IOMMU driver is this device using? It looks like it might not support a 4k IOMMU page size. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-06-30 13:33 ` Christoph Hellwig @ 2025-06-30 19:51 ` Ben Copeland 2025-07-01 19:43 ` Keith Busch 2025-06-30 20:25 ` Keith Busch 1 sibling, 1 reply; 15+ messages in thread From: Ben Copeland @ 2025-06-30 19:51 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, kbusch, axboe, sagi, iommu Hello Christoph, On Mon, 30 Jun 2025 at 14:33, Christoph Hellwig <hch@lst.de> wrote: > > Hi Ben, > > > [ 32.857521] iommu_dma_unmap_page+0xc4/0xe8 (P) > > Can you resolve this to a source location for me. i.e. > > gdb vmlinux > > l *(iommu_dma_unmap_page+0xc4) Sure, here's the kernel stack trace. [ 32.699872] WARNING: drivers/iommu/dma-iommu.c:1232 at iommu_dma_unmap_page+0xc4/0xe8, CPU#13: swapper/13/0 [ 32.714204] Modules linked in: cdc_ether usbnet sm3_ce nvme sha3_ce nvme_core xhci_pci_renesas arm_cspmu_module arm_spe_pmu ipmi_devintf arm_cmn ipmi_msghandler cppc_cpufreq fuse drm backlight ip_tables x_tables [ 32.732967] CPU: 13 UID: 0 PID: 0 Comm: swapper/13 Tainted: G W 6.16.0-rc3-next-20250627 #1 PREEMPT [ 32.743562] Tainted: [W]=WARN [ 32.749381] Hardware name: Inspur NF5280R7/Mitchell MB, BIOS 04.04.00004001 2025-02-04 22:23:30 02/04/2025 [ 32.759020] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 32.768746] pc : iommu_dma_unmap_page (/builds/linux/drivers/iommu/dma-iommu.c:1232 (discriminator 1)) [ 32.776040] lr : iommu_dma_unmap_page (/builds/linux/drivers/iommu/dma-iommu.c:1232 (discriminator 1)) [ 32.780559] sp : ffff8000801afde0 [ 32.783861] x29: ffff8000801afde0 x28: 0000000000000005 x27: fff00001d7d230f0 [ 32.790983] x26: 0000000000000000 x25: fff00001003da0c8 x24: 0000000000000002 [ 32.798106] x23: 0000000000000000 x22: 0000000000001000 x21: 00000000feed5000 [ 32.805229] x20: fff00001003da0c8 x19: fff00001d7d23000 x18: 0000000000080000 [ 32.812352] x17: 0000000000000040 x16: ffffae5c7594ea68 x15: 0000000000000000 [ 32.819474] x14: 000000000007ffff x13: 0000000000000001 x12: 000000000000002c [ 32.826597] x11: 00000000000fffff x10: ffffffffffffffff x9 : ffffae5c76358e60 [ 32.833719] x8 : ffff8000801afd68 x7 : ffffae5c76358a78 x6 : 00000000feed5001 [ 32.840842] x5 : 000000000000000d x4 : ffffae5c76358a78 x3 : 0000000000000000 [ 32.847965] x2 : 0000000000000000 x1 : fff00001027e54c0 x0 : 0000000000000000 [ 32.855088] Call trace: [ 32.857521] iommu_dma_unmap_page (/builds/linux/drivers/iommu/dma-iommu.c:1232 (discriminator 1)) (P) [ 32.862039] dma_unmap_page_attrs (/builds/linux/kernel/dma/mapping.c:193) [ 32.866385] nvme_unmap_data (/home/ben/linux/linux/drivers/nvme/host/nvme.h:788 /home/ben/linux/linux/drivers/nvme/host/pci.c:1077) nvme [ 32.870904] nvme_pci_complete_rq (/home/ben/linux/linux/drivers/nvme/host/pci.c:1051 /home/ben/linux/linux/drivers/nvme/host/pci.c:1063 /home/ben/linux/linux/drivers/nvme/host/pci.c:1071) nvme [ 32.878632] blk_complete_reqs (/builds/linux/block/blk-mq.c:1223 (discriminator 1)) [ 32.885320] blk_done_softirq (/builds/linux/block/blk-mq.c:1230) [ 32.892006] handle_softirqs (/builds/linux/arch/arm64/include/asm/jump_label.h:36 /builds/linux/include/trace/events/irq.h:142 /builds/linux/kernel/softirq.c:580) [ 32.896436] __do_softirq (/builds/linux/kernel/softirq.c:614) [ 32.899912] ____do_softirq (/builds/linux/arch/arm64/kernel/irq.c:82) [ 32.903561] call_on_irq_stack (/builds/linux/arch/arm64/kernel/entry.S:897) [ 32.907472] do_softirq_own_stack (/builds/linux/arch/arm64/kernel/irq.c:87) [ 32.911642] __irq_exit_rcu (/builds/linux/kernel/softirq.c:460 /builds/linux/kernel/softirq.c:680) [ 32.915378] irq_exit_rcu (/builds/linux/kernel/softirq.c:698 (discriminator 1)) [ 32.918854] el1_interrupt (/builds/linux/arch/arm64/include/asm/current.h:19 /builds/linux/arch/arm64/kernel/entry-common.c:280 /builds/linux/arch/arm64/kernel/entry-common.c:586 /builds/linux/arch/arm64/kernel/entry-common.c:598) [ 32.922418] el1h_64_irq_handler (/builds/linux/arch/arm64/kernel/entry-common.c:604) [ 32.926502] el1h_64_irq (/builds/linux/arch/arm64/kernel/entry.S:596) [ 32.929891] cpuidle_enter_state (/builds/linux/drivers/cpuidle/cpuidle.c:292) (P) [ 32.934410] cpuidle_enter (/builds/linux/drivers/cpuidle/cpuidle.c:391 (discriminator 2)) [ 32.937972] do_idle (/builds/linux/kernel/sched/idle.c:160 /builds/linux/kernel/sched/idle.c:235 /builds/linux/kernel/sched/idle.c:330) [ 32.941188] cpu_startup_entry (/builds/linux/kernel/sched/idle.c:428 (discriminator 1)) [ 32.945098] secondary_start_kernel (/builds/linux/arch/arm64/include/asm/atomic_ll_sc.h:95 (discriminator 2) /builds/linux/arch/arm64/include/asm/atomic.h:28 (discriminator 2) /builds/linux/include/linux/atomic/atomic-arch-fallback.h:546 (discriminator 2) /builds/linux/include/linux/atomic/atomic-arch-fallback.h:994 (discriminator 2) /builds/linux/include/linux/atomic/atomic-instrumented.h:436 (discriminator 2) /builds/linux/include/linux/sched/mm.h:37 (discriminator 2) /builds/linux/arch/arm64/kernel/smp.c:214 (discriminator 2)) [ 32.949617] __secondary_switched (/builds/linux/arch/arm64/kernel/head.S:405) [ 32.953788] ---[ end trace 0000000000000000 ]--- > > Also what IOMMU driver is this device using? It looks like it > might not support a 4k IOMMU page size. > From the boot log, I can see [ 1.083447] arm-smmu-v3 arm-smmu-v3.16.auto: option mask 0x0 [ 1.083460] arm-smmu-v3 arm-smmu-v3.16.auto: IDR0.HTTU features(0x600000) overridden by FW configuration (0x0) [ 1.083463] arm-smmu-v3 arm-smmu-v3.16.auto: ias 48-bit, oas 48-bit (features 0x0094dfef) Let me know if there is anything else. Ben ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-06-30 19:51 ` Ben Copeland @ 2025-07-01 19:43 ` Keith Busch 0 siblings, 0 replies; 15+ messages in thread From: Keith Busch @ 2025-07-01 19:43 UTC (permalink / raw) To: Ben Copeland Cc: Christoph Hellwig, linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, axboe, sagi, iommu On Mon, Jun 30, 2025 at 08:51:01PM +0100, Ben Copeland wrote: > > [ 1.083447] arm-smmu-v3 arm-smmu-v3.16.auto: option mask 0x0 > [ 1.083460] arm-smmu-v3 arm-smmu-v3.16.auto: IDR0.HTTU features(0x600000) overridden by FW configuration (0x0) > [ 1.083463] arm-smmu-v3 arm-smmu-v3.16.auto: ias 48-bit, oas 48-bit (features 0x0094dfef) Neat, I have machine with the same. The iommu granularity appears to be able to take on various values, so let's see what the iommu group's pgsize_bitmap is on mine with some bpf and drgn magic: # bpftrace -e 'kfunc:iommu_group_show_type { printf("iommu_group: 0x%lx\n", args->group); }' & # cat /sys/kernel/iommu_groups/0/type iommu_group: 0xffff0000b54d0600 # drgn >>> hex(Object(prog, "struct iommu_group *", 0xffff0000b54d0600).default_domain.pgsize_bitmap) '0x20010000' So 64k is the minimum iommu granularity in its current configuration. Definitely too big to guarantee coalescing nvme segments, so if yours is similarly configured, that explains why it takes the path you're taking. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-06-30 13:33 ` Christoph Hellwig 2025-06-30 19:51 ` Ben Copeland @ 2025-06-30 20:25 ` Keith Busch 2025-07-01 0:54 ` Keith Busch ` (2 more replies) 1 sibling, 3 replies; 15+ messages in thread From: Keith Busch @ 2025-06-30 20:25 UTC (permalink / raw) To: Christoph Hellwig Cc: Ben Copeland, linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, axboe, sagi, iommu On Mon, Jun 30, 2025 at 03:33:43PM +0200, Christoph Hellwig wrote: > Hi Ben, > > > [ 32.857521] iommu_dma_unmap_page+0xc4/0xe8 (P) > > Can you resolve this to a source location for me. i.e. > > gdb vmlinux > > l *(iommu_dma_unmap_page+0xc4) > > Also what IOMMU driver is this device using? It looks like it > might not support a 4k IOMMU page size. I think the PRP handling is broken. At the very least, handling the last element is wrong if it appears at the end of the list, so I think we need something like this: --- --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -701,7 +701,7 @@ static void nvme_free_prps(struct request *req) prp_list = iod->descriptors[desc]; do { dma_unmap_page(dma_dev, dma_addr, dma_len, dir); - if (i == NVME_CTRL_PAGE_SIZE >> 3) { + if (i == NVME_CTRL_PAGE_SIZE >> 3 && length > NVME_CTRL_PAGE_SIZE) { prp_list = iod->descriptors[++desc]; i = 0; } -- But even that, the PRP setup doesn't match the teardown. We're calling dma_map_page() on each PRP even if consecutive PRP's came from the same dma mapping segment. So even if it had been coalesced, but if the device doesn't support SGLs, then it would use the prp unmap path. Ben, can you check if your device supports sgl? # nvme id-ctrl /dev/nvme0 | grep sgl ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-06-30 20:25 ` Keith Busch @ 2025-07-01 0:54 ` Keith Busch 2025-07-01 13:05 ` Ben Copeland 2025-07-01 13:29 ` Christoph Hellwig 2 siblings, 0 replies; 15+ messages in thread From: Keith Busch @ 2025-07-01 0:54 UTC (permalink / raw) To: Christoph Hellwig Cc: Ben Copeland, linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, axboe, sagi, iommu On Mon, Jun 30, 2025 at 02:25:23PM -0600, Keith Busch wrote: > On Mon, Jun 30, 2025 at 03:33:43PM +0200, Christoph Hellwig wrote: > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -701,7 +701,7 @@ static void nvme_free_prps(struct request *req) > prp_list = iod->descriptors[desc]; > do { > dma_unmap_page(dma_dev, dma_addr, dma_len, dir); > - if (i == NVME_CTRL_PAGE_SIZE >> 3) { > + if (i == NVME_CTRL_PAGE_SIZE >> 3 && length > NVME_CTRL_PAGE_SIZE) { > prp_list = iod->descriptors[++desc]; > i = 0; > } > -- > > But even that, the PRP setup doesn't match the teardown. We're calling > dma_map_page() on each PRP even if consecutive PRP's [...] Err, I meant to say "dma_unmap_page()". ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-06-30 20:25 ` Keith Busch 2025-07-01 0:54 ` Keith Busch @ 2025-07-01 13:05 ` Ben Copeland 2025-07-01 14:20 ` Keith Busch 2025-07-01 13:29 ` Christoph Hellwig 2 siblings, 1 reply; 15+ messages in thread From: Ben Copeland @ 2025-07-01 13:05 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, axboe, sagi, iommu On Mon, 30 Jun 2025 at 21:25, Keith Busch <kbusch@kernel.org> wrote: > > On Mon, Jun 30, 2025 at 03:33:43PM +0200, Christoph Hellwig wrote: > > Hi Ben, > > > > > [ 32.857521] iommu_dma_unmap_page+0xc4/0xe8 (P) > > > > Can you resolve this to a source location for me. i.e. > > > > gdb vmlinux > > > > l *(iommu_dma_unmap_page+0xc4) > > > > Also what IOMMU driver is this device using? It looks like it > > might not support a 4k IOMMU page size. > > I think the PRP handling is broken. At the very least, handling the last > element is wrong if it appears at the end of the list, so I think we > need something like this: > > --- > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -701,7 +701,7 @@ static void nvme_free_prps(struct request *req) > prp_list = iod->descriptors[desc]; > do { > dma_unmap_page(dma_dev, dma_addr, dma_len, dir); > - if (i == NVME_CTRL_PAGE_SIZE >> 3) { > + if (i == NVME_CTRL_PAGE_SIZE >> 3 && length > NVME_CTRL_PAGE_SIZE) { > prp_list = iod->descriptors[++desc]; > i = 0; > } > -- > > But even that, the PRP setup doesn't match the teardown. We're calling > dma_map_page() on each PRP even if consecutive PRP's came from the same > dma mapping segment. So even if it had been coalesced, but if the device > doesn't support SGLs, then it would use the prp unmap path. Even though SGLs are not supported, I gave you the patch a spin and saw the same stack traces. > > Ben, can you check if your device supports sgl? > > # nvme id-ctrl /dev/nvme0 | grep sgl sgl is not supported by the drive.. Here was my output: --- sgls : 0 Ben ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-07-01 13:05 ` Ben Copeland @ 2025-07-01 14:20 ` Keith Busch 0 siblings, 0 replies; 15+ messages in thread From: Keith Busch @ 2025-07-01 14:20 UTC (permalink / raw) To: Ben Copeland Cc: Christoph Hellwig, linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, axboe, sagi, iommu On Tue, Jul 01, 2025 at 02:05:04PM +0100, Ben Copeland wrote: > Even though SGLs are not supported, I gave you the patch a spin and > saw the same stack traces. Right, that patch applies only if SGL is not supported, but it's only fixing part of the problem, so I expected it would still fail. The dma teardown needs to be symetric to the dma mapping, but we've removed the sg-table now, so we don't have the original mapping context in the completion path here. Christoph replied with a patch that tries to infer it based on the dma_addr continuity, which I think should work for you, but I'll have to look a bit closer to be confident it's always the correct assumption. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-06-30 20:25 ` Keith Busch 2025-07-01 0:54 ` Keith Busch 2025-07-01 13:05 ` Ben Copeland @ 2025-07-01 13:29 ` Christoph Hellwig 2025-07-01 15:58 ` Leon Romanovsky 2025-07-01 21:00 ` Keith Busch 2 siblings, 2 replies; 15+ messages in thread From: Christoph Hellwig @ 2025-07-01 13:29 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Ben Copeland, linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, axboe, sagi, iommu, Leon Romanovsky On Mon, Jun 30, 2025 at 02:25:23PM -0600, Keith Busch wrote: > I think the PRP handling is broken. At the very least, handling the last > element is wrong if it appears at the end of the list, so I think we > need something like this: Yeah. > But even that, the PRP setup doesn't match the teardown. We're calling > dma_map_page() on each PRP even if consecutive PRP's came from the same > dma mapping segment. So even if it had been coalesced, but if the device > doesn't support SGLs, then it would use the prp unmap path. Yes, that's broken, and I remember fixing it before. A little digging shows that my fixes disappeared between the oct 30 version of Leon's dma-split branch and the latest one somewhere. Below is what should restore it, but at least when forcing my Intel IOMMU down this path it still has issues with VPTEs already set. So maybe Bob should not try it quite yet. I'll try to get to it, but my availability today and tomorrow is a bit limited. diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 38be1505dbd9..02bb5cf5db1a 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -678,40 +678,55 @@ static void nvme_free_prps(struct request *req) enum dma_data_direction dir = rq_dma_dir(req); int length = iod->total_len; dma_addr_t dma_addr; - int i, desc; + int prp_len, i, desc; __le64 *prp_list; + dma_addr_t dma_start; u32 dma_len; dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp1); - dma_len = min_t(u32, length, - NVME_CTRL_PAGE_SIZE - (dma_addr & (NVME_CTRL_PAGE_SIZE - 1))); - length -= dma_len; + prp_len = NVME_CTRL_PAGE_SIZE - (dma_addr & (NVME_CTRL_PAGE_SIZE - 1)); + prp_len = min(length, prp_len); + length -= prp_len; if (!length) { - dma_unmap_page(dma_dev, dma_addr, dma_len, dir); + dma_unmap_page(dma_dev, dma_addr, prp_len, dir); return; } + dma_start = dma_addr; + dma_len = prp_len; + dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp2); + if (length <= NVME_CTRL_PAGE_SIZE) { - dma_unmap_page(dma_dev, dma_addr, dma_len, dir); - dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp2); - dma_unmap_page(dma_dev, dma_addr, length, dir); - return; + if (dma_addr != dma_start + dma_len) { + dma_unmap_page(dma_dev, dma_start, dma_len, dir); + dma_start = dma_addr; + dma_len = 0; + } + dma_len += length; + goto done; } i = 0; desc = 0; prp_list = iod->descriptors[desc]; do { - dma_unmap_page(dma_dev, dma_addr, dma_len, dir); if (i == NVME_CTRL_PAGE_SIZE >> 3) { prp_list = iod->descriptors[++desc]; i = 0; } dma_addr = le64_to_cpu(prp_list[i++]); - dma_len = min(length, NVME_CTRL_PAGE_SIZE); - length -= dma_len; + if (dma_addr != dma_start + dma_len) { + dma_unmap_page(dma_dev, dma_start, dma_len, dir); + dma_start = dma_addr; + dma_len = 0; + } + prp_len = min(length, NVME_CTRL_PAGE_SIZE); + dma_len += prp_len; + length -= prp_len; } while (length); +done: + dma_unmap_page(dma_dev, dma_start, dma_len, dir); } static void nvme_free_sgls(struct request *req) ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-07-01 13:29 ` Christoph Hellwig @ 2025-07-01 15:58 ` Leon Romanovsky 2025-07-01 21:00 ` Keith Busch 1 sibling, 0 replies; 15+ messages in thread From: Leon Romanovsky @ 2025-07-01 15:58 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Ben Copeland, linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, axboe, sagi, iommu On Tue, Jul 01, 2025 at 03:29:36PM +0200, Christoph Hellwig wrote: > On Mon, Jun 30, 2025 at 02:25:23PM -0600, Keith Busch wrote: > > I think the PRP handling is broken. At the very least, handling the last > > element is wrong if it appears at the end of the list, so I think we > > need something like this: > > Yeah. > > > But even that, the PRP setup doesn't match the teardown. We're calling > > dma_map_page() on each PRP even if consecutive PRP's came from the same > > dma mapping segment. So even if it had been coalesced, but if the device > > doesn't support SGLs, then it would use the prp unmap path. > > Yes, that's broken, and I remember fixing it before. A little digging > shows that my fixes disappeared between the oct 30 version of Leon's > dma-split branch and the latest one somewhere. Oct, 30 belongs to RFC/first version of patches. Thanks ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-07-01 13:29 ` Christoph Hellwig 2025-07-01 15:58 ` Leon Romanovsky @ 2025-07-01 21:00 ` Keith Busch 2025-07-03 9:30 ` Christoph Hellwig 1 sibling, 1 reply; 15+ messages in thread From: Keith Busch @ 2025-07-01 21:00 UTC (permalink / raw) To: Christoph Hellwig Cc: Ben Copeland, linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, axboe, sagi, iommu, Leon Romanovsky On Tue, Jul 01, 2025 at 03:29:36PM +0200, Christoph Hellwig wrote: > Yes, that's broken, and I remember fixing it before. A little digging > shows that my fixes disappeared between the oct 30 version of Leon's > dma-split branch and the latest one somewhere. Below is what should > restore it, but at least when forcing my Intel IOMMU down this path it > still has issues with VPTEs already set. So maybe Bob should not try > it quite yet. I'll try to get to it, but my availability today and > tomorrow is a bit limited. Let's say we're using ARM64 SMMU configured with 64k granularity like I showed earlier. Now let's send a read command with 128k transfer, and let's assume the payload is two 64k aligned physical segments, so we have 2 bvecs. Since nvme's virtual boundary is smaller than the iommu's granule, we won't attempt to coalesce. We instead iommu map each bvec segment individually. And let's say each segment just so happens to get consecutive IOVA's. The mapping side had done each segment individually, but your proposal here will assume the contiguous dma_addr ranges were done as a single larger mapping. Is that okay? > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 38be1505dbd9..02bb5cf5db1a 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -678,40 +678,55 @@ static void nvme_free_prps(struct request *req) > enum dma_data_direction dir = rq_dma_dir(req); > int length = iod->total_len; > dma_addr_t dma_addr; > - int i, desc; > + int prp_len, i, desc; > __le64 *prp_list; > + dma_addr_t dma_start; > u32 dma_len; > > dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp1); > - dma_len = min_t(u32, length, > - NVME_CTRL_PAGE_SIZE - (dma_addr & (NVME_CTRL_PAGE_SIZE - 1))); > - length -= dma_len; > + prp_len = NVME_CTRL_PAGE_SIZE - (dma_addr & (NVME_CTRL_PAGE_SIZE - 1)); > + prp_len = min(length, prp_len); > + length -= prp_len; > if (!length) { > - dma_unmap_page(dma_dev, dma_addr, dma_len, dir); > + dma_unmap_page(dma_dev, dma_addr, prp_len, dir); > return; > } > > + dma_start = dma_addr; > + dma_len = prp_len; > + dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp2); > + > if (length <= NVME_CTRL_PAGE_SIZE) { > - dma_unmap_page(dma_dev, dma_addr, dma_len, dir); > - dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp2); > - dma_unmap_page(dma_dev, dma_addr, length, dir); > - return; > + if (dma_addr != dma_start + dma_len) { > + dma_unmap_page(dma_dev, dma_start, dma_len, dir); > + dma_start = dma_addr; > + dma_len = 0; > + } > + dma_len += length; > + goto done; > } > > i = 0; > desc = 0; > prp_list = iod->descriptors[desc]; > do { > - dma_unmap_page(dma_dev, dma_addr, dma_len, dir); > if (i == NVME_CTRL_PAGE_SIZE >> 3) { > prp_list = iod->descriptors[++desc]; > i = 0; > } > > dma_addr = le64_to_cpu(prp_list[i++]); > - dma_len = min(length, NVME_CTRL_PAGE_SIZE); > - length -= dma_len; > + if (dma_addr != dma_start + dma_len) { > + dma_unmap_page(dma_dev, dma_start, dma_len, dir); > + dma_start = dma_addr; > + dma_len = 0; > + } > + prp_len = min(length, NVME_CTRL_PAGE_SIZE); > + dma_len += prp_len; > + length -= prp_len; > } while (length); > +done: > + dma_unmap_page(dma_dev, dma_start, dma_len, dir); > } > > static void nvme_free_sgls(struct request *req) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-07-01 21:00 ` Keith Busch @ 2025-07-03 9:30 ` Christoph Hellwig 2025-07-03 14:29 ` Keith Busch 0 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2025-07-03 9:30 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Ben Copeland, linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, axboe, sagi, iommu, Leon Romanovsky On Tue, Jul 01, 2025 at 05:00:58PM -0400, Keith Busch wrote: > On Tue, Jul 01, 2025 at 03:29:36PM +0200, Christoph Hellwig wrote: > > Yes, that's broken, and I remember fixing it before. A little digging > > shows that my fixes disappeared between the oct 30 version of Leon's > > dma-split branch and the latest one somewhere. Below is what should > > restore it, but at least when forcing my Intel IOMMU down this path it > > still has issues with VPTEs already set. So maybe Bob should not try > > it quite yet. I'll try to get to it, but my availability today and > > tomorrow is a bit limited. > > Let's say we're using ARM64 SMMU configured with 64k granularity like I > showed earlier. > > Now let's send a read command with 128k transfer, and let's assume the > payload is two 64k aligned physical segments, so we have 2 bvecs. > > Since nvme's virtual boundary is smaller than the iommu's granule, we > won't attempt to coalesce. We instead iommu map each bvec segment > individually. > > And let's say each segment just so happens to get consecutive IOVA's. > The mapping side had done each segment individually, but your proposal > here will assume the contiguous dma_addr ranges were done as a single > larger mapping. Is that okay? Not on the DMA API level, and depending on the implementation also not in practice. I think the idea to reconstruct the dma addresses from PRPs should be considered a failure by now. It works fine for SGLs, but for PRPs we're better off just stashing them away. Bob, can you try something like the patch below? To be fully safe it needs a mempool, and it could use some cleanups, but it does pass testing on my setup here, so I'd love to see if if fixes your issue. diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f5cf90ddc3e9..f1242e321a58 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -262,6 +262,11 @@ enum nvme_iod_flags { IOD_SINGLE_SEGMENT = 1U << 2, }; +struct dma_vec { + dma_addr_t addr; + unsigned int len; +}; + /* * The nvme_iod describes the data in an I/O. */ @@ -274,6 +279,8 @@ struct nvme_iod { unsigned int total_len; struct dma_iova_state dma_state; void *descriptors[NVME_MAX_NR_DESCRIPTORS]; + struct dma_vec *dma_vecs; + unsigned int nr_dma_vecs; dma_addr_t meta_dma; struct sg_table meta_sgt; @@ -676,42 +683,12 @@ static void nvme_free_prps(struct request *req) struct nvme_queue *nvmeq = req->mq_hctx->driver_data; struct device *dma_dev = nvmeq->dev->dev; enum dma_data_direction dir = rq_dma_dir(req); - int length = iod->total_len; - dma_addr_t dma_addr; - int i, desc; - __le64 *prp_list; - u32 dma_len; - - dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp1); - dma_len = min_t(u32, length, - NVME_CTRL_PAGE_SIZE - (dma_addr & (NVME_CTRL_PAGE_SIZE - 1))); - length -= dma_len; - if (!length) { - dma_unmap_page(dma_dev, dma_addr, dma_len, dir); - return; - } - - if (length <= NVME_CTRL_PAGE_SIZE) { - dma_unmap_page(dma_dev, dma_addr, dma_len, dir); - dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp2); - dma_unmap_page(dma_dev, dma_addr, length, dir); - return; - } - - i = 0; - desc = 0; - prp_list = iod->descriptors[desc]; - do { - dma_unmap_page(dma_dev, dma_addr, dma_len, dir); - if (i == NVME_CTRL_PAGE_SIZE >> 3) { - prp_list = iod->descriptors[++desc]; - i = 0; - } + unsigned int i; - dma_addr = le64_to_cpu(prp_list[i++]); - dma_len = min(length, NVME_CTRL_PAGE_SIZE); - length -= dma_len; - } while (length); + for (i = 0; i < iod->nr_dma_vecs; i++) + dma_unmap_page(dma_dev, iod->dma_vecs[i].addr, + iod->dma_vecs[i].len, dir); + kfree(iod->dma_vecs); } static void nvme_free_sgls(struct request *req) @@ -770,6 +747,17 @@ static blk_status_t nvme_pci_setup_data_prp(struct request *req, unsigned int prp_len, i; __le64 *prp_list; + if (dma_need_unmap(nvmeq->dev->dev)) { + /* XXX: use mempool */ + iod->dma_vecs = kmalloc_array(blk_rq_nr_phys_segments(req), + sizeof(struct dma_vec), GFP_NOIO); + if (!iod->dma_vecs) + return BLK_STS_RESOURCE; + iod->dma_vecs[0].addr = iter->addr; + iod->dma_vecs[0].len = iter->len; + iod->nr_dma_vecs++; + } + /* * PRP1 always points to the start of the DMA transfers. * @@ -793,6 +781,11 @@ static blk_status_t nvme_pci_setup_data_prp(struct request *req, goto bad_sgl; goto done; } + if (dma_need_unmap(nvmeq->dev->dev)) { + iod->dma_vecs[iod->nr_dma_vecs].addr = iter->addr; + iod->dma_vecs[iod->nr_dma_vecs].len = iter->len; + iod->nr_dma_vecs++; + } } /* @@ -838,6 +831,12 @@ static blk_status_t nvme_pci_setup_data_prp(struct request *req, goto bad_sgl; goto done; } + if (dma_need_unmap(nvmeq->dev->dev)) { + iod->dma_vecs[iod->nr_dma_vecs].addr = + iter->addr; + iod->dma_vecs[iod->nr_dma_vecs].len = iter->len; + iod->nr_dma_vecs++; + } } /* @@ -1108,6 +1107,7 @@ static blk_status_t nvme_prep_rq(struct request *req) iod->nr_descriptors = 0; iod->total_len = 0; iod->meta_sgt.nents = 0; + iod->nr_dma_vecs = 0; ret = nvme_setup_cmd(req->q->queuedata, req); if (ret) ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-07-03 9:30 ` Christoph Hellwig @ 2025-07-03 14:29 ` Keith Busch 2025-07-03 15:13 ` Ben Copeland 0 siblings, 1 reply; 15+ messages in thread From: Keith Busch @ 2025-07-03 14:29 UTC (permalink / raw) To: Christoph Hellwig Cc: Ben Copeland, linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, axboe, sagi, iommu, Leon Romanovsky On Thu, Jul 03, 2025 at 11:30:42AM +0200, Christoph Hellwig wrote: > I think the idea to reconstruct the dma addresses from PRPs should > be considered a failure by now. It works fine for SGLs, but for > PRPs we're better off just stashing them away. Bob, can you try s/Bob/Ben > something like the patch below? To be fully safe it needs a mempool, > and it could use some cleanups, but it does pass testing on my setup > here, so I'd love to see if if fixes your issue. Thanks for confirming. While this is starting to look a bit messy, I believe it's still an overall win: you've cut down the vector walking in the setup path from 3 to 1, which reduces a non-trivial amount of overhead for even moderately sized IO. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-07-03 14:29 ` Keith Busch @ 2025-07-03 15:13 ` Ben Copeland 0 siblings, 0 replies; 15+ messages in thread From: Ben Copeland @ 2025-07-03 15:13 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, linux-kernel, lkft-triage, regressions, linux-nvme, Dan Carpenter, axboe, sagi, iommu, Leon Romanovsky On Thu, 3 Jul 2025 at 15:29, Keith Busch <kbusch@kernel.org> wrote: > > On Thu, Jul 03, 2025 at 11:30:42AM +0200, Christoph Hellwig wrote: > > I think the idea to reconstruct the dma addresses from PRPs should > > be considered a failure by now. It works fine for SGLs, but for > > PRPs we're better off just stashing them away. Bob, can you try > > s/Bob/Ben > > > something like the patch below? To be fully safe it needs a mempool, > > and it could use some cleanups, but it does pass testing on my setup > > here, so I'd love to see if if fixes your issue. I have tested it on my system and can no longer see the regression. Happy to retest when the patch goes through. Tested-by: Ben Copeland <ben.copeland@linaro.org> Thank you! Ben > > Thanks for confirming. > > While this is starting to look a bit messy, I believe it's still an > overall win: you've cut down the vector walking in the setup path from 3 > to 1, which reduces a non-trivial amount of overhead for even moderately > sized IO. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c 2025-06-30 7:50 ` next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c Ben Copeland 2025-06-30 13:33 ` Christoph Hellwig @ 2025-07-03 5:54 ` Kanchan Joshi 1 sibling, 0 replies; 15+ messages in thread From: Kanchan Joshi @ 2025-07-03 5:54 UTC (permalink / raw) To: Ben Copeland, linux-kernel, lkft-triage, regressions, linux-nvme Cc: Dan Carpenter, kbusch, axboe, hch, sagi On 6/30/2025 1:20 PM, Ben Copeland wrote: > # first bad commit: [06cae0e3f61c4c1ef18726b817bbb88c29f81e57] > nvme-pci: merge the simple PRP and SGL setup into a common helper this patch had the problem [0] of not setting 'iod->first_dma' which can trigger that IOMMU warning on unmap. But problem was transient as a subsequent patch removed the iod->first_dma and made unmapping work without that. [0] -static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev, - struct request *req, struct nvme_rw_command *cmnd, - struct bio_vec *bv) +static blk_status_t nvme_pci_setup_data_simple(struct request *req, + enum nvme_use_sgl use_sgl) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); - unsigned int offset = bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1); - unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - offset; - - iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0); - if (dma_mapping_error(dev->dev, iod->first_dma)) iod->first_dma does not get set, and that can cause unmap on invalid address during teardown. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-07-03 20:57 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CGME20250630075218epcas5p1f3d467fffa468cac0ccd012c193e94df@epcas5p1.samsung.com> 2025-06-30 7:50 ` next-20250627: IOMMU DMA warning during NVMe I/O completion after 06cae0e3f61c Ben Copeland 2025-06-30 13:33 ` Christoph Hellwig 2025-06-30 19:51 ` Ben Copeland 2025-07-01 19:43 ` Keith Busch 2025-06-30 20:25 ` Keith Busch 2025-07-01 0:54 ` Keith Busch 2025-07-01 13:05 ` Ben Copeland 2025-07-01 14:20 ` Keith Busch 2025-07-01 13:29 ` Christoph Hellwig 2025-07-01 15:58 ` Leon Romanovsky 2025-07-01 21:00 ` Keith Busch 2025-07-03 9:30 ` Christoph Hellwig 2025-07-03 14:29 ` Keith Busch 2025-07-03 15:13 ` Ben Copeland 2025-07-03 5:54 ` Kanchan Joshi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).