* [PATCHv2] vfio/type1: conditional rescheduling while pinning
@ 2025-07-15 18:46 Keith Busch
2025-07-15 23:47 ` Paul E. McKenney
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Keith Busch @ 2025-07-15 18:46 UTC (permalink / raw)
To: alex.williamson, kvm, linux-pci; +Cc: paulmck, Keith Busch
From: Keith Busch <kbusch@kernel.org>
A large DMA mapping request can loop through dma address pinning for
many pages. In cases where THP can not be used, the repeated vmf_insert_pfn can
be costly, so let the task reschedule as need to prevent CPU stalls. Failure to
do so has potential harmful side effects, like increased memory pressure
as unrelated rcu tasks are unable to make their reclaim callbacks and
result in OOM conditions.
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
rcu: hardirqs softirqs csw/system
rcu: number: 0 107 0
rcu: cputime: 50 0 10446 ==> 10556(ms)
rcu: (t=21075 jiffies g=377761 q=204059 ncpus=384)
...
<TASK>
? asm_sysvec_apic_timer_interrupt+0x16/0x20
? walk_system_ram_range+0x63/0x120
? walk_system_ram_range+0x46/0x120
? pgprot_writethrough+0x20/0x20
lookup_memtype+0x67/0xf0
track_pfn_insert+0x20/0x40
vmf_insert_pfn_prot+0x88/0x140
vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
__do_fault+0x28/0x1b0
handle_mm_fault+0xef1/0x2560
fixup_user_fault+0xf5/0x270
vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
? futex_wake+0x1c1/0x260
x64_sys_call+0x234/0x17a0
do_syscall_64+0x63/0x130
? exc_page_fault+0x63/0x130
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
v1->v2:
Merged up to vfio/next
Moved the cond_resched() to a more appropriate place within the
loop, and added a comment about why it's there.
Update to change log describing one of the consequences of not doing
this.
drivers/vfio/vfio_iommu_type1.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1136d7ac6b597..ad599b1601711 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -647,6 +647,13 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
while (npage) {
if (!batch->size) {
+ /*
+ * Large mappings may take a while to repeatedly refill
+ * the batch, so conditionally relinquish the CPU when
+ * needed to avoid stalls.
+ */
+ cond_resched();
+
/* Empty batch, so refill it. */
ret = vaddr_get_pfns(mm, vaddr, npage, dma->prot,
&pfn, batch);
--
2.47.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCHv2] vfio/type1: conditional rescheduling while pinning
2025-07-15 18:46 [PATCHv2] vfio/type1: conditional rescheduling while pinning Keith Busch
@ 2025-07-15 23:47 ` Paul E. McKenney
2025-07-16 12:32 ` Jason Gunthorpe
2025-07-16 20:24 ` Alex Williamson
2 siblings, 0 replies; 6+ messages in thread
From: Paul E. McKenney @ 2025-07-15 23:47 UTC (permalink / raw)
To: Keith Busch; +Cc: alex.williamson, kvm, linux-pci, Keith Busch
On Tue, Jul 15, 2025 at 11:46:22AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> A large DMA mapping request can loop through dma address pinning for
> many pages. In cases where THP can not be used, the repeated vmf_insert_pfn can
> be costly, so let the task reschedule as need to prevent CPU stalls. Failure to
> do so has potential harmful side effects, like increased memory pressure
> as unrelated rcu tasks are unable to make their reclaim callbacks and
> result in OOM conditions.
>
> rcu: INFO: rcu_sched self-detected stall on CPU
> rcu: 36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
> rcu: hardirqs softirqs csw/system
> rcu: number: 0 107 0
> rcu: cputime: 50 0 10446 ==> 10556(ms)
> rcu: (t=21075 jiffies g=377761 q=204059 ncpus=384)
> ...
> <TASK>
> ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> ? walk_system_ram_range+0x63/0x120
> ? walk_system_ram_range+0x46/0x120
> ? pgprot_writethrough+0x20/0x20
> lookup_memtype+0x67/0xf0
> track_pfn_insert+0x20/0x40
> vmf_insert_pfn_prot+0x88/0x140
> vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
> __do_fault+0x28/0x1b0
> handle_mm_fault+0xef1/0x2560
> fixup_user_fault+0xf5/0x270
> vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
> vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
> vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
> ? futex_wake+0x1c1/0x260
> x64_sys_call+0x234/0x17a0
> do_syscall_64+0x63/0x130
> ? exc_page_fault+0x63/0x130
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
From an RCU CPU stall-warning viewpoint, given that vaddr_get_pfns()
invokes mmap_read_lock(), thus this code can schedule:
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
> ---
> v1->v2:
>
> Merged up to vfio/next
>
> Moved the cond_resched() to a more appropriate place within the
> loop, and added a comment about why it's there.
>
> Update to change log describing one of the consequences of not doing
> this.
>
> drivers/vfio/vfio_iommu_type1.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 1136d7ac6b597..ad599b1601711 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -647,6 +647,13 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>
> while (npage) {
> if (!batch->size) {
> + /*
> + * Large mappings may take a while to repeatedly refill
> + * the batch, so conditionally relinquish the CPU when
> + * needed to avoid stalls.
> + */
> + cond_resched();
> +
> /* Empty batch, so refill it. */
> ret = vaddr_get_pfns(mm, vaddr, npage, dma->prot,
> &pfn, batch);
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCHv2] vfio/type1: conditional rescheduling while pinning
2025-07-15 18:46 [PATCHv2] vfio/type1: conditional rescheduling while pinning Keith Busch
2025-07-15 23:47 ` Paul E. McKenney
@ 2025-07-16 12:32 ` Jason Gunthorpe
2025-07-16 13:47 ` Keith Busch
2025-07-16 20:24 ` Alex Williamson
2 siblings, 1 reply; 6+ messages in thread
From: Jason Gunthorpe @ 2025-07-16 12:32 UTC (permalink / raw)
To: Keith Busch; +Cc: alex.williamson, kvm, linux-pci, paulmck, Keith Busch
On Tue, Jul 15, 2025 at 11:46:22AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> A large DMA mapping request can loop through dma address pinning for
> many pages. In cases where THP can not be used, the repeated vmf_insert_pfn can
> be costly, so let the task reschedule as need to prevent CPU stalls. Failure to
> do so has potential harmful side effects, like increased memory pressure
> as unrelated rcu tasks are unable to make their reclaim callbacks and
> result in OOM conditions.
>
> rcu: INFO: rcu_sched self-detected stall on CPU
> rcu: 36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
> rcu: hardirqs softirqs csw/system
> rcu: number: 0 107 0
> rcu: cputime: 50 0 10446 ==> 10556(ms)
> rcu: (t=21075 jiffies g=377761 q=204059 ncpus=384)
> ...
> <TASK>
> ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> ? walk_system_ram_range+0x63/0x120
> ? walk_system_ram_range+0x46/0x120
> ? pgprot_writethrough+0x20/0x20
> lookup_memtype+0x67/0xf0
> track_pfn_insert+0x20/0x40
> vmf_insert_pfn_prot+0x88/0x140
> vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
> __do_fault+0x28/0x1b0
> handle_mm_fault+0xef1/0x2560
> fixup_user_fault+0xf5/0x270
> vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
> vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
> vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
> ? futex_wake+0x1c1/0x260
> x64_sys_call+0x234/0x17a0
> do_syscall_64+0x63/0x130
> ? exc_page_fault+0x63/0x130
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> v1->v2:
>
> Merged up to vfio/next
>
> Moved the cond_resched() to a more appropriate place within the
> loop, and added a comment about why it's there.
>
> Update to change log describing one of the consequences of not doing
> this.
>
> drivers/vfio/vfio_iommu_type1.c | 7 +++++++
> 1 file changed, 7 insertions(+)
You should be making a matching change to iommufd too..
Jason
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCHv2] vfio/type1: conditional rescheduling while pinning
2025-07-16 12:32 ` Jason Gunthorpe
@ 2025-07-16 13:47 ` Keith Busch
2025-07-16 13:51 ` Jason Gunthorpe
0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2025-07-16 13:47 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: Keith Busch, alex.williamson, kvm, linux-pci, paulmck
On Wed, Jul 16, 2025 at 09:32:01AM -0300, Jason Gunthorpe wrote:
>
> You should be making a matching change to iommufd too..
Yeah, okay. My colleauge I've been working with on this hasn't yet
reported CPU stalls using iommufd though, so I'm not sure which path
iommufd takes to know where to place cond_resched(). Blindly stumbly
through it, my best guess right now is arriving at the loop in
iopt_fill_domains_pages(), called through iommufd_vfio_map_dma(). Am I
close?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCHv2] vfio/type1: conditional rescheduling while pinning
2025-07-16 13:47 ` Keith Busch
@ 2025-07-16 13:51 ` Jason Gunthorpe
0 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2025-07-16 13:51 UTC (permalink / raw)
To: Keith Busch; +Cc: Keith Busch, alex.williamson, kvm, linux-pci, paulmck
On Wed, Jul 16, 2025 at 07:47:04AM -0600, Keith Busch wrote:
> On Wed, Jul 16, 2025 at 09:32:01AM -0300, Jason Gunthorpe wrote:
> >
> > You should be making a matching change to iommufd too..
>
> Yeah, okay. My colleauge I've been working with on this hasn't yet
> reported CPU stalls using iommufd though, so I'm not sure which path
> iommufd takes to know where to place cond_resched(). Blindly stumbly
> through it, my best guess right now is arriving at the loop in
> iopt_fill_domains_pages(), called through iommufd_vfio_map_dma(). Am I
> close?
Yeah, I would guess maybe put it in pfn_reader_next()
But iommufd won't be faulting the MMIO memory from VFIO so you can't
hit the same exact condition you are describing..
Jason
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCHv2] vfio/type1: conditional rescheduling while pinning
2025-07-15 18:46 [PATCHv2] vfio/type1: conditional rescheduling while pinning Keith Busch
2025-07-15 23:47 ` Paul E. McKenney
2025-07-16 12:32 ` Jason Gunthorpe
@ 2025-07-16 20:24 ` Alex Williamson
2 siblings, 0 replies; 6+ messages in thread
From: Alex Williamson @ 2025-07-16 20:24 UTC (permalink / raw)
To: Keith Busch; +Cc: kvm, linux-pci, paulmck, Keith Busch
On Tue, 15 Jul 2025 11:46:22 -0700
Keith Busch <kbusch@meta.com> wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> A large DMA mapping request can loop through dma address pinning for
> many pages. In cases where THP can not be used, the repeated vmf_insert_pfn can
> be costly, so let the task reschedule as need to prevent CPU stalls. Failure to
> do so has potential harmful side effects, like increased memory pressure
> as unrelated rcu tasks are unable to make their reclaim callbacks and
> result in OOM conditions.
>
> rcu: INFO: rcu_sched self-detected stall on CPU
> rcu: 36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
> rcu: hardirqs softirqs csw/system
> rcu: number: 0 107 0
> rcu: cputime: 50 0 10446 ==> 10556(ms)
> rcu: (t=21075 jiffies g=377761 q=204059 ncpus=384)
> ...
> <TASK>
> ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> ? walk_system_ram_range+0x63/0x120
> ? walk_system_ram_range+0x46/0x120
> ? pgprot_writethrough+0x20/0x20
> lookup_memtype+0x67/0xf0
> track_pfn_insert+0x20/0x40
> vmf_insert_pfn_prot+0x88/0x140
> vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
> __do_fault+0x28/0x1b0
> handle_mm_fault+0xef1/0x2560
> fixup_user_fault+0xf5/0x270
> vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
> vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
> vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
> ? futex_wake+0x1c1/0x260
> x64_sys_call+0x234/0x17a0
> do_syscall_64+0x63/0x130
> ? exc_page_fault+0x63/0x130
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> v1->v2:
>
> Merged up to vfio/next
>
> Moved the cond_resched() to a more appropriate place within the
> loop, and added a comment about why it's there.
>
> Update to change log describing one of the consequences of not doing
> this.
>
> drivers/vfio/vfio_iommu_type1.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 1136d7ac6b597..ad599b1601711 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -647,6 +647,13 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>
> while (npage) {
> if (!batch->size) {
> + /*
> + * Large mappings may take a while to repeatedly refill
> + * the batch, so conditionally relinquish the CPU when
> + * needed to avoid stalls.
> + */
> + cond_resched();
> +
> /* Empty batch, so refill it. */
> ret = vaddr_get_pfns(mm, vaddr, npage, dma->prot,
> &pfn, batch);
Applied to vfio next branch for v6.17. Thanks,
Alex
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-07-16 20:24 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-15 18:46 [PATCHv2] vfio/type1: conditional rescheduling while pinning Keith Busch
2025-07-15 23:47 ` Paul E. McKenney
2025-07-16 12:32 ` Jason Gunthorpe
2025-07-16 13:47 ` Keith Busch
2025-07-16 13:51 ` Jason Gunthorpe
2025-07-16 20:24 ` Alex Williamson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).