From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Keith Busch <kbusch@kernel.org>,
"Paul E . McKenney" <paulmck@kernel.org>,
Alex Williamson <alex.williamson@redhat.com>,
Sasha Levin <sashal@kernel.org>,
kvm@vger.kernel.org
Subject: [PATCH AUTOSEL 6.16-5.15] vfio/type1: conditional rescheduling while pinning
Date: Sat, 9 Aug 2025 20:21:01 -0400 [thread overview]
Message-ID: <20250810002104.1545396-12-sashal@kernel.org> (raw)
In-Reply-To: <20250810002104.1545396-1-sashal@kernel.org>
From: Keith Busch <kbusch@kernel.org>
[ Upstream commit b1779e4f209c7ff7e32f3c79d69bca4e3a3a68b6 ]
A large DMA mapping request can loop through dma address pinning for
many pages. In cases where THP can not be used, the repeated vmf_insert_pfn can
be costly, so let the task reschedule as need to prevent CPU stalls. Failure to
do so has potential harmful side effects, like increased memory pressure
as unrelated rcu tasks are unable to make their reclaim callbacks and
result in OOM conditions.
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
rcu: hardirqs softirqs csw/system
rcu: number: 0 107 0
rcu: cputime: 50 0 10446 ==> 10556(ms)
rcu: (t=21075 jiffies g=377761 q=204059 ncpus=384)
...
<TASK>
? asm_sysvec_apic_timer_interrupt+0x16/0x20
? walk_system_ram_range+0x63/0x120
? walk_system_ram_range+0x46/0x120
? pgprot_writethrough+0x20/0x20
lookup_memtype+0x67/0xf0
track_pfn_insert+0x20/0x40
vmf_insert_pfn_prot+0x88/0x140
vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
__do_fault+0x28/0x1b0
handle_mm_fault+0xef1/0x2560
fixup_user_fault+0xf5/0x270
vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
? futex_wake+0x1c1/0x260
x64_sys_call+0x234/0x17a0
do_syscall_64+0x63/0x130
? exc_page_fault+0x63/0x130
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20250715184622.3561598-1-kbusch@meta.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
**Backport Status: YES**
This commit should be backported to stable kernel trees. Here's my
extensive analysis:
## Bug Fix Nature
The commit fixes a critical CPU stall issue that can lead to system
instability. The RCU stall trace in the commit message shows:
- A 21-second RCU stall (`t=21075 jiffies`)
- Over 10 seconds of CPU time consumed (`cputime: 10556ms`)
- The stall originates from `vfio_pin_pages_remote()` during large DMA
mapping operations
This is a serious bug that affects system reliability and can cascade
into OOM conditions as mentioned: "increased memory pressure as
unrelated rcu tasks are unable to make their reclaim callbacks and
result in OOM conditions."
## Code Change Analysis
The fix is minimal and well-contained:
1. **Single line addition**: The patch adds only `cond_resched()` call
at line 654 (after the comment block)
2. **Strategic placement**: Positioned at the beginning of the batch
refill loop, before `vaddr_get_pfns()` is called
3. **Clear comment**: The 5-line comment explains exactly why the
reschedule is needed
```c
+ /*
+ * Large mappings may take a while to repeatedly
refill
+ * the batch, so conditionally relinquish the
CPU when
+ * needed to avoid stalls.
+ */
+ cond_resched();
```
## Stability and Safety
1. **Low risk**: `cond_resched()` is a standard kernel primitive that
only yields CPU if needed
2. **Already used pattern**: My search shows vfio_iommu_type1.c already
uses `cond_resched()` in other places, confirming this is an
established pattern
3. **No functional changes**: The fix doesn't alter the logic of DMA
pinning, it just prevents monopolizing the CPU
4. **Reviewed by RCU maintainer**: Paul E. McKenney's review adds
credibility to the fix
## Stable Tree Criteria Compliance
Per stable-kernel-rules.rst, this meets the criteria:
- **Fixes a real bug**: CPU stalls and potential OOM are serious issues
- **Small change**: Single line addition with comment
- **Obviously correct**: Standard solution for long-running loops
- **Already tested**: Has been in mainline and reviewed by experts
- **No new features**: Pure bug fix, no functionality addition
## Impact Assessment
The bug affects systems performing large DMA mappings through VFIO,
particularly when Transparent Huge Pages (THP) cannot be used. This is
common in:
- Virtual machine device passthrough
- GPU/accelerator passthrough scenarios
- Large memory pinning operations
Without this fix, affected systems can experience:
- RCU stalls leading to system unresponsiveness
- Memory pressure and OOM kills
- Performance degradation for unrelated tasks
The fix prevents these issues with negligible overhead (conditional
reschedule only when needed).
drivers/vfio/vfio_iommu_type1.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1136d7ac6b59..f8d68fe77b41 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -647,6 +647,13 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
while (npage) {
if (!batch->size) {
+ /*
+ * Large mappings may take a while to repeatedly refill
+ * the batch, so conditionally relinquish the CPU when
+ * needed to avoid stalls.
+ */
+ cond_resched();
+
/* Empty batch, so refill it. */
ret = vaddr_get_pfns(mm, vaddr, npage, dma->prot,
&pfn, batch);
--
2.39.5
next prev parent reply other threads:[~2025-08-10 0:21 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-10 0:20 [PATCH AUTOSEL 6.16-5.4] kconfig: gconf: avoid hardcoding model2 in on_treeview2_cursor_changed() Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-6.15] kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minute Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.4] kconfig: lxdialog: fix 'space' to (de)select options Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.4] scsi: aacraid: Stop using PCI_IRQ_AFFINITY Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.4] kconfig: gconf: fix potential memory leak in renderer_edited() Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.15] scsi: target: core: Generate correct identifiers for PR OUT transport IDs Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.4] ipmi: Fix strcpy source and destination the same Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.4] scsi: Fix sas_user_scan() to handle wildcard and multi-channel scans Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-6.1] vfio/mlx5: fix possible overflow in tracking max message size Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.4] kconfig: nconf: Ensure null termination where strncpy is used Sasha Levin
2025-08-10 0:21 ` [PATCH AUTOSEL 6.16-5.4] kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c Sasha Levin
2025-08-10 0:21 ` Sasha Levin [this message]
2025-08-10 0:21 ` [PATCH AUTOSEL 6.16-5.4] ipmi: Use dev_warn_ratelimited() for incorrect message warnings Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250810002104.1545396-12-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=alex.williamson@redhat.com \
--cc=kbusch@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=patches@lists.linux.dev \
--cc=paulmck@kernel.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox