[PATCH AUTOSEL 6.16-5.15] vfio/type1: conditional rescheduling while pinning

Linux kernel -stable discussions
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Keith Busch <kbusch@kernel.org>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Alex Williamson <alex.williamson@redhat.com>,
	Sasha Levin <sashal@kernel.org>,
	kvm@vger.kernel.org
Subject: [PATCH AUTOSEL 6.16-5.15] vfio/type1: conditional rescheduling while pinning
Date: Sat,  9 Aug 2025 20:21:01 -0400	[thread overview]
Message-ID: <20250810002104.1545396-12-sashal@kernel.org> (raw)
In-Reply-To: <20250810002104.1545396-1-sashal@kernel.org>

From: Keith Busch <kbusch@kernel.org>

[ Upstream commit b1779e4f209c7ff7e32f3c79d69bca4e3a3a68b6 ]

A large DMA mapping request can loop through dma address pinning for
many pages. In cases where THP can not be used, the repeated vmf_insert_pfn can
be costly, so let the task reschedule as need to prevent CPU stalls. Failure to
do so has potential harmful side effects, like increased memory pressure
as unrelated rcu tasks are unable to make their reclaim callbacks and
result in OOM conditions.

 rcu: INFO: rcu_sched self-detected stall on CPU
 rcu:   36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
 rcu:            hardirqs   softirqs   csw/system
 rcu:    number:        0        107            0
 rcu:   cputime:       50          0        10446   ==> 10556(ms)
 rcu:   (t=21075 jiffies g=377761 q=204059 ncpus=384)
...
  <TASK>
  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
  ? walk_system_ram_range+0x63/0x120
  ? walk_system_ram_range+0x46/0x120
  ? pgprot_writethrough+0x20/0x20
  lookup_memtype+0x67/0xf0
  track_pfn_insert+0x20/0x40
  vmf_insert_pfn_prot+0x88/0x140
  vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
  __do_fault+0x28/0x1b0
  handle_mm_fault+0xef1/0x2560
  fixup_user_fault+0xf5/0x270
  vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
  vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
  vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
  ? futex_wake+0x1c1/0x260
  x64_sys_call+0x234/0x17a0
  do_syscall_64+0x63/0x130
  ? exc_page_fault+0x63/0x130
  entry_SYSCALL_64_after_hwframe+0x4b/0x53

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20250715184622.3561598-1-kbusch@meta.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

**Backport Status: YES**

This commit should be backported to stable kernel trees. Here's my
extensive analysis:

## Bug Fix Nature

The commit fixes a critical CPU stall issue that can lead to system
instability. The RCU stall trace in the commit message shows:
- A 21-second RCU stall (`t=21075 jiffies`)
- Over 10 seconds of CPU time consumed (`cputime: 10556ms`)
- The stall originates from `vfio_pin_pages_remote()` during large DMA
  mapping operations

This is a serious bug that affects system reliability and can cascade
into OOM conditions as mentioned: "increased memory pressure as
unrelated rcu tasks are unable to make their reclaim callbacks and
result in OOM conditions."

## Code Change Analysis

The fix is minimal and well-contained:
1. **Single line addition**: The patch adds only `cond_resched()` call
   at line 654 (after the comment block)
2. **Strategic placement**: Positioned at the beginning of the batch
   refill loop, before `vaddr_get_pfns()` is called
3. **Clear comment**: The 5-line comment explains exactly why the
   reschedule is needed

```c
+                       /*
+                        * Large mappings may take a while to repeatedly
refill
+                        * the batch, so conditionally relinquish the
CPU when
+                        * needed to avoid stalls.
+                        */
+                       cond_resched();
```

## Stability and Safety

1. **Low risk**: `cond_resched()` is a standard kernel primitive that
   only yields CPU if needed
2. **Already used pattern**: My search shows vfio_iommu_type1.c already
   uses `cond_resched()` in other places, confirming this is an
   established pattern
3. **No functional changes**: The fix doesn't alter the logic of DMA
   pinning, it just prevents monopolizing the CPU
4. **Reviewed by RCU maintainer**: Paul E. McKenney's review adds
   credibility to the fix

## Stable Tree Criteria Compliance

Per stable-kernel-rules.rst, this meets the criteria:
- **Fixes a real bug**: CPU stalls and potential OOM are serious issues
- **Small change**: Single line addition with comment
- **Obviously correct**: Standard solution for long-running loops
- **Already tested**: Has been in mainline and reviewed by experts
- **No new features**: Pure bug fix, no functionality addition

## Impact Assessment

The bug affects systems performing large DMA mappings through VFIO,
particularly when Transparent Huge Pages (THP) cannot be used. This is
common in:
- Virtual machine device passthrough
- GPU/accelerator passthrough scenarios
- Large memory pinning operations

Without this fix, affected systems can experience:
- RCU stalls leading to system unresponsiveness
- Memory pressure and OOM kills
- Performance degradation for unrelated tasks

The fix prevents these issues with negligible overhead (conditional
reschedule only when needed).

 drivers/vfio/vfio_iommu_type1.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1136d7ac6b59..f8d68fe77b41 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -647,6 +647,13 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 
 	while (npage) {
 		if (!batch->size) {
+			/*
+			 * Large mappings may take a while to repeatedly refill
+			 * the batch, so conditionally relinquish the CPU when
+			 * needed to avoid stalls.
+			 */
+			cond_resched();
+
 			/* Empty batch, so refill it. */
 			ret = vaddr_get_pfns(mm, vaddr, npage, dma->prot,
 					     &pfn, batch);
-- 
2.39.5

next prev parent reply	other threads:[~2025-08-10  0:21 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-10  0:20 [PATCH AUTOSEL 6.16-5.4] kconfig: gconf: avoid hardcoding model2 in on_treeview2_cursor_changed() Sasha Levin
2025-08-10  0:20 ` [PATCH AUTOSEL 6.16-6.15] kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minute Sasha Levin
2025-08-10  0:20 ` [PATCH AUTOSEL 6.16-5.4] kconfig: lxdialog: fix 'space' to (de)select options Sasha Levin
2025-08-10  0:20 ` [PATCH AUTOSEL 6.16-5.4] scsi: aacraid: Stop using PCI_IRQ_AFFINITY Sasha Levin
2025-08-10  0:20 ` [PATCH AUTOSEL 6.16-5.4] kconfig: gconf: fix potential memory leak in renderer_edited() Sasha Levin
2025-08-10  0:20 ` [PATCH AUTOSEL 6.16-5.15] scsi: target: core: Generate correct identifiers for PR OUT transport IDs Sasha Levin
2025-08-10  0:20 ` [PATCH AUTOSEL 6.16-5.4] ipmi: Fix strcpy source and destination the same Sasha Levin
2025-08-10  0:20 ` [PATCH AUTOSEL 6.16-5.4] scsi: Fix sas_user_scan() to handle wildcard and multi-channel scans Sasha Levin
2025-08-10  0:20 ` [PATCH AUTOSEL 6.16-6.1] vfio/mlx5: fix possible overflow in tracking max message size Sasha Levin
2025-08-10  0:20 ` [PATCH AUTOSEL 6.16-5.4] kconfig: nconf: Ensure null termination where strncpy is used Sasha Levin
2025-08-10  0:21 ` [PATCH AUTOSEL 6.16-5.4] kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c Sasha Levin
2025-08-10  0:21 ` Sasha Levin [this message]
2025-08-10  0:21 ` [PATCH AUTOSEL 6.16-5.4] ipmi: Use dev_warn_ratelimited() for incorrect message warnings Sasha Levin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:1136d7ac6b5 dfblob:f8d68fe77b4 )
 OR (
bs:"[PATCH AUTOSEL 6.16-5.15] vfio/type1: conditional rescheduling while pinning" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250810002104.1545396-12-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=alex.williamson@redhat.com \
    --cc=kbusch@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=paulmck@kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox