kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv2] vfio/type1: conditional rescheduling while pinning
@ 2025-07-15 18:46 Keith Busch
  2025-07-15 23:47 ` Paul E. McKenney
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Keith Busch @ 2025-07-15 18:46 UTC (permalink / raw)
  To: alex.williamson, kvm, linux-pci; +Cc: paulmck, Keith Busch

From: Keith Busch <kbusch@kernel.org>

A large DMA mapping request can loop through dma address pinning for
many pages. In cases where THP can not be used, the repeated vmf_insert_pfn can
be costly, so let the task reschedule as need to prevent CPU stalls. Failure to
do so has potential harmful side effects, like increased memory pressure
as unrelated rcu tasks are unable to make their reclaim callbacks and
result in OOM conditions.

 rcu: INFO: rcu_sched self-detected stall on CPU
 rcu:   36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
 rcu:            hardirqs   softirqs   csw/system
 rcu:    number:        0        107            0
 rcu:   cputime:       50          0        10446   ==> 10556(ms)
 rcu:   (t=21075 jiffies g=377761 q=204059 ncpus=384)
...
  <TASK>
  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
  ? walk_system_ram_range+0x63/0x120
  ? walk_system_ram_range+0x46/0x120
  ? pgprot_writethrough+0x20/0x20
  lookup_memtype+0x67/0xf0
  track_pfn_insert+0x20/0x40
  vmf_insert_pfn_prot+0x88/0x140
  vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
  __do_fault+0x28/0x1b0
  handle_mm_fault+0xef1/0x2560
  fixup_user_fault+0xf5/0x270
  vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
  vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
  vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
  ? futex_wake+0x1c1/0x260
  x64_sys_call+0x234/0x17a0
  do_syscall_64+0x63/0x130
  ? exc_page_fault+0x63/0x130
  entry_SYSCALL_64_after_hwframe+0x4b/0x53

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
v1->v2:

  Merged up to vfio/next

  Moved the cond_resched() to a more appropriate place within the
  loop, and added a comment about why it's there.

  Update to change log describing one of the consequences of not doing
  this.

 drivers/vfio/vfio_iommu_type1.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1136d7ac6b597..ad599b1601711 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -647,6 +647,13 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 
 	while (npage) {
 		if (!batch->size) {
+			/*
+			 * Large mappings may take a while to repeatedly refill
+			 * the batch, so conditionally relinquish the CPU when
+			 * needed to avoid stalls.
+			 */
+			cond_resched();
+
 			/* Empty batch, so refill it. */
 			ret = vaddr_get_pfns(mm, vaddr, npage, dma->prot,
 					     &pfn, batch);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] vfio/type1: conditional rescheduling while pinning
  2025-07-15 18:46 [PATCHv2] vfio/type1: conditional rescheduling while pinning Keith Busch
@ 2025-07-15 23:47 ` Paul E. McKenney
  2025-07-16 12:32 ` Jason Gunthorpe
  2025-07-16 20:24 ` Alex Williamson
  2 siblings, 0 replies; 6+ messages in thread
From: Paul E. McKenney @ 2025-07-15 23:47 UTC (permalink / raw)
  To: Keith Busch; +Cc: alex.williamson, kvm, linux-pci, Keith Busch

On Tue, Jul 15, 2025 at 11:46:22AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> A large DMA mapping request can loop through dma address pinning for
> many pages. In cases where THP can not be used, the repeated vmf_insert_pfn can
> be costly, so let the task reschedule as need to prevent CPU stalls. Failure to
> do so has potential harmful side effects, like increased memory pressure
> as unrelated rcu tasks are unable to make their reclaim callbacks and
> result in OOM conditions.
> 
>  rcu: INFO: rcu_sched self-detected stall on CPU
>  rcu:   36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
>  rcu:            hardirqs   softirqs   csw/system
>  rcu:    number:        0        107            0
>  rcu:   cputime:       50          0        10446   ==> 10556(ms)
>  rcu:   (t=21075 jiffies g=377761 q=204059 ncpus=384)
> ...
>   <TASK>
>   ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>   ? walk_system_ram_range+0x63/0x120
>   ? walk_system_ram_range+0x46/0x120
>   ? pgprot_writethrough+0x20/0x20
>   lookup_memtype+0x67/0xf0
>   track_pfn_insert+0x20/0x40
>   vmf_insert_pfn_prot+0x88/0x140
>   vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
>   __do_fault+0x28/0x1b0
>   handle_mm_fault+0xef1/0x2560
>   fixup_user_fault+0xf5/0x270
>   vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
>   vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
>   vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
>   ? futex_wake+0x1c1/0x260
>   x64_sys_call+0x234/0x17a0
>   do_syscall_64+0x63/0x130
>   ? exc_page_fault+0x63/0x130
>   entry_SYSCALL_64_after_hwframe+0x4b/0x53
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>

From an RCU CPU stall-warning viewpoint, given that vaddr_get_pfns()
invokes mmap_read_lock(), thus this code can schedule:

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
> v1->v2:
> 
>   Merged up to vfio/next
> 
>   Moved the cond_resched() to a more appropriate place within the
>   loop, and added a comment about why it's there.
> 
>   Update to change log describing one of the consequences of not doing
>   this.
> 
>  drivers/vfio/vfio_iommu_type1.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 1136d7ac6b597..ad599b1601711 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -647,6 +647,13 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>  
>  	while (npage) {
>  		if (!batch->size) {
> +			/*
> +			 * Large mappings may take a while to repeatedly refill
> +			 * the batch, so conditionally relinquish the CPU when
> +			 * needed to avoid stalls.
> +			 */
> +			cond_resched();
> +
>  			/* Empty batch, so refill it. */
>  			ret = vaddr_get_pfns(mm, vaddr, npage, dma->prot,
>  					     &pfn, batch);
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] vfio/type1: conditional rescheduling while pinning
  2025-07-15 18:46 [PATCHv2] vfio/type1: conditional rescheduling while pinning Keith Busch
  2025-07-15 23:47 ` Paul E. McKenney
@ 2025-07-16 12:32 ` Jason Gunthorpe
  2025-07-16 13:47   ` Keith Busch
  2025-07-16 20:24 ` Alex Williamson
  2 siblings, 1 reply; 6+ messages in thread
From: Jason Gunthorpe @ 2025-07-16 12:32 UTC (permalink / raw)
  To: Keith Busch; +Cc: alex.williamson, kvm, linux-pci, paulmck, Keith Busch

On Tue, Jul 15, 2025 at 11:46:22AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> A large DMA mapping request can loop through dma address pinning for
> many pages. In cases where THP can not be used, the repeated vmf_insert_pfn can
> be costly, so let the task reschedule as need to prevent CPU stalls. Failure to
> do so has potential harmful side effects, like increased memory pressure
> as unrelated rcu tasks are unable to make their reclaim callbacks and
> result in OOM conditions.
> 
>  rcu: INFO: rcu_sched self-detected stall on CPU
>  rcu:   36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
>  rcu:            hardirqs   softirqs   csw/system
>  rcu:    number:        0        107            0
>  rcu:   cputime:       50          0        10446   ==> 10556(ms)
>  rcu:   (t=21075 jiffies g=377761 q=204059 ncpus=384)
> ...
>   <TASK>
>   ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>   ? walk_system_ram_range+0x63/0x120
>   ? walk_system_ram_range+0x46/0x120
>   ? pgprot_writethrough+0x20/0x20
>   lookup_memtype+0x67/0xf0
>   track_pfn_insert+0x20/0x40
>   vmf_insert_pfn_prot+0x88/0x140
>   vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
>   __do_fault+0x28/0x1b0
>   handle_mm_fault+0xef1/0x2560
>   fixup_user_fault+0xf5/0x270
>   vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
>   vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
>   vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
>   ? futex_wake+0x1c1/0x260
>   x64_sys_call+0x234/0x17a0
>   do_syscall_64+0x63/0x130
>   ? exc_page_fault+0x63/0x130
>   entry_SYSCALL_64_after_hwframe+0x4b/0x53
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> v1->v2:
> 
>   Merged up to vfio/next
> 
>   Moved the cond_resched() to a more appropriate place within the
>   loop, and added a comment about why it's there.
> 
>   Update to change log describing one of the consequences of not doing
>   this.
> 
>  drivers/vfio/vfio_iommu_type1.c | 7 +++++++
>  1 file changed, 7 insertions(+)

You should be making a matching change to iommufd too..

Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] vfio/type1: conditional rescheduling while pinning
  2025-07-16 12:32 ` Jason Gunthorpe
@ 2025-07-16 13:47   ` Keith Busch
  2025-07-16 13:51     ` Jason Gunthorpe
  0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2025-07-16 13:47 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Keith Busch, alex.williamson, kvm, linux-pci, paulmck

On Wed, Jul 16, 2025 at 09:32:01AM -0300, Jason Gunthorpe wrote:
> 
> You should be making a matching change to iommufd too..

Yeah, okay. My colleauge I've been working with on this hasn't yet
reported CPU stalls using iommufd though, so I'm not sure which path
iommufd takes to know where to place cond_resched(). Blindly stumbly
through it, my best guess right now is arriving at the loop in
iopt_fill_domains_pages(), called through iommufd_vfio_map_dma(). Am I
close?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] vfio/type1: conditional rescheduling while pinning
  2025-07-16 13:47   ` Keith Busch
@ 2025-07-16 13:51     ` Jason Gunthorpe
  0 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2025-07-16 13:51 UTC (permalink / raw)
  To: Keith Busch; +Cc: Keith Busch, alex.williamson, kvm, linux-pci, paulmck

On Wed, Jul 16, 2025 at 07:47:04AM -0600, Keith Busch wrote:
> On Wed, Jul 16, 2025 at 09:32:01AM -0300, Jason Gunthorpe wrote:
> > 
> > You should be making a matching change to iommufd too..
> 
> Yeah, okay. My colleauge I've been working with on this hasn't yet
> reported CPU stalls using iommufd though, so I'm not sure which path
> iommufd takes to know where to place cond_resched(). Blindly stumbly
> through it, my best guess right now is arriving at the loop in
> iopt_fill_domains_pages(), called through iommufd_vfio_map_dma(). Am I
> close?

Yeah, I would guess maybe put it in pfn_reader_next()

But iommufd won't be faulting the MMIO memory from VFIO so you can't
hit the same exact condition you are describing..

Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] vfio/type1: conditional rescheduling while pinning
  2025-07-15 18:46 [PATCHv2] vfio/type1: conditional rescheduling while pinning Keith Busch
  2025-07-15 23:47 ` Paul E. McKenney
  2025-07-16 12:32 ` Jason Gunthorpe
@ 2025-07-16 20:24 ` Alex Williamson
  2 siblings, 0 replies; 6+ messages in thread
From: Alex Williamson @ 2025-07-16 20:24 UTC (permalink / raw)
  To: Keith Busch; +Cc: kvm, linux-pci, paulmck, Keith Busch

On Tue, 15 Jul 2025 11:46:22 -0700
Keith Busch <kbusch@meta.com> wrote:

> From: Keith Busch <kbusch@kernel.org>
> 
> A large DMA mapping request can loop through dma address pinning for
> many pages. In cases where THP can not be used, the repeated vmf_insert_pfn can
> be costly, so let the task reschedule as need to prevent CPU stalls. Failure to
> do so has potential harmful side effects, like increased memory pressure
> as unrelated rcu tasks are unable to make their reclaim callbacks and
> result in OOM conditions.
> 
>  rcu: INFO: rcu_sched self-detected stall on CPU
>  rcu:   36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
>  rcu:            hardirqs   softirqs   csw/system
>  rcu:    number:        0        107            0
>  rcu:   cputime:       50          0        10446   ==> 10556(ms)
>  rcu:   (t=21075 jiffies g=377761 q=204059 ncpus=384)
> ...
>   <TASK>
>   ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>   ? walk_system_ram_range+0x63/0x120
>   ? walk_system_ram_range+0x46/0x120
>   ? pgprot_writethrough+0x20/0x20
>   lookup_memtype+0x67/0xf0
>   track_pfn_insert+0x20/0x40
>   vmf_insert_pfn_prot+0x88/0x140
>   vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
>   __do_fault+0x28/0x1b0
>   handle_mm_fault+0xef1/0x2560
>   fixup_user_fault+0xf5/0x270
>   vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
>   vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
>   vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
>   ? futex_wake+0x1c1/0x260
>   x64_sys_call+0x234/0x17a0
>   do_syscall_64+0x63/0x130
>   ? exc_page_fault+0x63/0x130
>   entry_SYSCALL_64_after_hwframe+0x4b/0x53
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> v1->v2:
> 
>   Merged up to vfio/next
> 
>   Moved the cond_resched() to a more appropriate place within the
>   loop, and added a comment about why it's there.
> 
>   Update to change log describing one of the consequences of not doing
>   this.
> 
>  drivers/vfio/vfio_iommu_type1.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 1136d7ac6b597..ad599b1601711 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -647,6 +647,13 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>  
>  	while (npage) {
>  		if (!batch->size) {
> +			/*
> +			 * Large mappings may take a while to repeatedly refill
> +			 * the batch, so conditionally relinquish the CPU when
> +			 * needed to avoid stalls.
> +			 */
> +			cond_resched();
> +
>  			/* Empty batch, so refill it. */
>  			ret = vaddr_get_pfns(mm, vaddr, npage, dma->prot,
>  					     &pfn, batch);

Applied to vfio next branch for v6.17.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-07-16 20:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-15 18:46 [PATCHv2] vfio/type1: conditional rescheduling while pinning Keith Busch
2025-07-15 23:47 ` Paul E. McKenney
2025-07-16 12:32 ` Jason Gunthorpe
2025-07-16 13:47   ` Keith Busch
2025-07-16 13:51     ` Jason Gunthorpe
2025-07-16 20:24 ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).