[PATCH] vfio/type1: conditional rescheduling while pinning

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] vfio/type1: conditional rescheduling while pinning
@ 2025-03-12 22:52 Keith Busch
  2025-03-17 21:44 ` Alex Williamson
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2025-03-12 22:52 UTC (permalink / raw)
  To: alex.williamson, kvm; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

A large DMA mapping request can loop through dma address pinning for
many pages. The repeated vmf_insert_pfn can be costly, so let the task
reschedule as need to prevent CPU stalls.

 rcu: INFO: rcu_sched self-detected stall on CPU
 rcu: 	36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
 rcu: 	         hardirqs   softirqs   csw/system
 rcu: 	 number:        0        107            0
 rcu: 	cputime:       50          0        10446   ==> 10556(ms)
 rcu: 	(t=21075 jiffies g=377761 q=204059 ncpus=384)
...
  <TASK>
  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
  ? walk_system_ram_range+0x63/0x120
  ? walk_system_ram_range+0x46/0x120
  ? pgprot_writethrough+0x20/0x20
  lookup_memtype+0x67/0xf0
  track_pfn_insert+0x20/0x40
  vmf_insert_pfn_prot+0x88/0x140
  vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
  __do_fault+0x28/0x1b0
  handle_mm_fault+0xef1/0x2560
  fixup_user_fault+0xf5/0x270
  vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
  vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
  vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
  ? futex_wake+0x1c1/0x260
  x64_sys_call+0x234/0x17a0
  do_syscall_64+0x63/0x130
  ? exc_page_fault+0x63/0x130
  entry_SYSCALL_64_after_hwframe+0x4b/0x53

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/vfio/vfio_iommu_type1.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 50ebc9593c9d7..9ad5fcc2de7c7 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -679,6 +679,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 
 		if (unlikely(disable_hugepages))
 			break;
+		cond_resched();
 	}
 
 out:
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] vfio/type1: conditional rescheduling while pinning
  2025-03-12 22:52 [PATCH] vfio/type1: conditional rescheduling while pinning Keith Busch
@ 2025-03-17 21:44 ` Alex Williamson
  2025-03-17 22:30   ` Keith Busch
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2025-03-17 21:44 UTC (permalink / raw)
  To: Keith Busch; +Cc: kvm, Keith Busch

On Wed, 12 Mar 2025 15:52:55 -0700
Keith Busch <kbusch@meta.com> wrote:

> From: Keith Busch <kbusch@kernel.org>
> 
> A large DMA mapping request can loop through dma address pinning for
> many pages. The repeated vmf_insert_pfn can be costly, so let the task
> reschedule as need to prevent CPU stalls.
> 
>  rcu: INFO: rcu_sched self-detected stall on CPU
>  rcu: 	36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
>  rcu: 	         hardirqs   softirqs   csw/system
>  rcu: 	 number:        0        107            0
>  rcu: 	cputime:       50          0        10446   ==> 10556(ms)
>  rcu: 	(t=21075 jiffies g=377761 q=204059 ncpus=384)
> ...
>   <TASK>
>   ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>   ? walk_system_ram_range+0x63/0x120
>   ? walk_system_ram_range+0x46/0x120
>   ? pgprot_writethrough+0x20/0x20
>   lookup_memtype+0x67/0xf0
>   track_pfn_insert+0x20/0x40
>   vmf_insert_pfn_prot+0x88/0x140
>   vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
>   __do_fault+0x28/0x1b0
>   handle_mm_fault+0xef1/0x2560
>   fixup_user_fault+0xf5/0x270
>   vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
>   vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
>   vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
>   ? futex_wake+0x1c1/0x260
>   x64_sys_call+0x234/0x17a0
>   do_syscall_64+0x63/0x130
>   ? exc_page_fault+0x63/0x130
>   entry_SYSCALL_64_after_hwframe+0x4b/0x53
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 50ebc9593c9d7..9ad5fcc2de7c7 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -679,6 +679,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>  
>  		if (unlikely(disable_hugepages))
>  			break;
> +		cond_resched();
>  	}
>  
>  out:

Hey Keith, is this still necessary with:

https://lore.kernel.org/all/20250218222209.1382449-1-alex.williamson@redhat.com/

This is currently in linux-next from the vfio next branch and should
pretty much eliminate any stalls related to DMA mapping MMIO BARs.
Also the code here has been refactored in next, so this doesn't apply
anyway, and if there is a resched still needed, this location would
only affect DMA mapping of memory, not device BARs.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vfio/type1: conditional rescheduling while pinning
  2025-03-17 21:44 ` Alex Williamson
@ 2025-03-17 22:30   ` Keith Busch
  2025-03-17 22:53     ` Alex Williamson
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2025-03-17 22:30 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Keith Busch, kvm

On Mon, Mar 17, 2025 at 03:44:17PM -0600, Alex Williamson wrote:
> On Wed, 12 Mar 2025 15:52:55 -0700
> > @@ -679,6 +679,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> >  
> >  		if (unlikely(disable_hugepages))
> >  			break;
> > +		cond_resched();
> >  	}
> >  
> >  out:
> 
> Hey Keith, is this still necessary with:
> 
> https://lore.kernel.org/all/20250218222209.1382449-1-alex.williamson@redhat.com/

Thank you for the suggestion. I'll try to fold this into a build, and
see what happens. But from what I can tell, I'm not sure it will help.
We're simply not getting large folios in this path and dealing with
individual pages. Though it is a large contiguous range (~60GB, not
necessarily aligned). Shoould we expect to only be dealing with PUD and
PMD levels with these kinds of mappings?

> This is currently in linux-next from the vfio next branch and should
> pretty much eliminate any stalls related to DMA mapping MMIO BARs.
> Also the code here has been refactored in next, so this doesn't apply
> anyway, and if there is a resched still needed, this location would
> only affect DMA mapping of memory, not device BARs.  Thanks,

Thanks for the head's up. Regardless, it doesn't look like bad place to
cond_resched(), but may not trigger any cpu stall indicator outside this
vfio fault path.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vfio/type1: conditional rescheduling while pinning
  2025-03-17 22:30   ` Keith Busch
@ 2025-03-17 22:53     ` Alex Williamson
  2025-03-19 15:47       ` Keith Busch
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2025-03-17 22:53 UTC (permalink / raw)
  To: Keith Busch; +Cc: Keith Busch, kvm

On Mon, 17 Mar 2025 16:30:47 -0600
Keith Busch <kbusch@kernel.org> wrote:

> On Mon, Mar 17, 2025 at 03:44:17PM -0600, Alex Williamson wrote:
> > On Wed, 12 Mar 2025 15:52:55 -0700  
> > > @@ -679,6 +679,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> > >  
> > >  		if (unlikely(disable_hugepages))
> > >  			break;
> > > +		cond_resched();
> > >  	}
> > >  
> > >  out:  
> > 
> > Hey Keith, is this still necessary with:
> > 
> > https://lore.kernel.org/all/20250218222209.1382449-1-alex.williamson@redhat.com/  
> 
> Thank you for the suggestion. I'll try to fold this into a build, and
> see what happens. But from what I can tell, I'm not sure it will help.
> We're simply not getting large folios in this path and dealing with
> individual pages. Though it is a large contiguous range (~60GB, not
> necessarily aligned). Shoould we expect to only be dealing with PUD and
> PMD levels with these kinds of mappings?

IME with QEMU, PMD alignment basically happens without any effort and
gets 90+% of the way there, PUD alignment requires a bit of work[1].
 
> > This is currently in linux-next from the vfio next branch and should
> > pretty much eliminate any stalls related to DMA mapping MMIO BARs.
> > Also the code here has been refactored in next, so this doesn't apply
> > anyway, and if there is a resched still needed, this location would
> > only affect DMA mapping of memory, not device BARs.  Thanks,  
> 
> Thanks for the head's up. Regardless, it doesn't look like bad place to
> cond_resched(), but may not trigger any cpu stall indicator outside this
> vfio fault path.

Note that we already have a cond_resched() in vfio_iommu_map(), which
we'll hit any time we get a break in a contiguous mapping.  We may hit
that regularly enough that it's not an issue for RAM mapping, but I've
certainly seen soft lockups when we have many GiB of contiguous pfnmaps
prior to the series above.  Thanks,

Alex

[1]https://gitlab.com/qemu-project/qemu/-/commit/00b519c0bca0e933ed22e2e6f8bca6b23f41f950


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vfio/type1: conditional rescheduling while pinning
  2025-03-17 22:53     ` Alex Williamson
@ 2025-03-19 15:47       ` Keith Busch
  2025-03-19 18:17         ` Alex Williamson
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2025-03-19 15:47 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Keith Busch, kvm

On Mon, Mar 17, 2025 at 04:53:47PM -0600, Alex Williamson wrote:
> On Mon, 17 Mar 2025 16:30:47 -0600
> Keith Busch <kbusch@kernel.org> wrote:
> 
> > On Mon, Mar 17, 2025 at 03:44:17PM -0600, Alex Williamson wrote:
> > > On Wed, 12 Mar 2025 15:52:55 -0700  
> > > > @@ -679,6 +679,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> > > >  
> > > >  		if (unlikely(disable_hugepages))
> > > >  			break;
> > > > +		cond_resched();
> > > >  	}
> > > >  
> > > >  out:  
> > > 
> > > Hey Keith, is this still necessary with:
> > > 
> > > https://lore.kernel.org/all/20250218222209.1382449-1-alex.williamson@redhat.com/  
> > 
> > Thank you for the suggestion. I'll try to fold this into a build, and
> > see what happens. But from what I can tell, I'm not sure it will help.
> > We're simply not getting large folios in this path and dealing with
> > individual pages. Though it is a large contiguous range (~60GB, not
> > necessarily aligned). Shoould we expect to only be dealing with PUD and
> > PMD levels with these kinds of mappings?
> 
> IME with QEMU, PMD alignment basically happens without any effort and
> gets 90+% of the way there, PUD alignment requires a bit of work[1].
>  
> > > This is currently in linux-next from the vfio next branch and should
> > > pretty much eliminate any stalls related to DMA mapping MMIO BARs.
> > > Also the code here has been refactored in next, so this doesn't apply
> > > anyway, and if there is a resched still needed, this location would
> > > only affect DMA mapping of memory, not device BARs.  Thanks,  
> > 
> > Thanks for the head's up. Regardless, it doesn't look like bad place to
> > cond_resched(), but may not trigger any cpu stall indicator outside this
> > vfio fault path.
> 
> Note that we already have a cond_resched() in vfio_iommu_map(), which
> we'll hit any time we get a break in a contiguous mapping.  We may hit
> that regularly enough that it's not an issue for RAM mapping, but I've
> certainly seen soft lockups when we have many GiB of contiguous pfnmaps
> prior to the series above.  Thanks,

So far adding the additional patches has not changed anything. We've
ensured we are using an address and length aligned to 2MB, but it sure
looks like vfio's fault handler is only getting order-0 faults. I'm not
finding anything immediately obvious about what we can change to get the
desired higher order behvaior, though. Any other hints or information I
could provide?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vfio/type1: conditional rescheduling while pinning
  2025-03-19 15:47       ` Keith Busch
@ 2025-03-19 18:17         ` Alex Williamson
  2025-03-19 18:34           ` Keith Busch
  2025-07-09 20:18           ` Keith Busch
  0 siblings, 2 replies; 11+ messages in thread
From: Alex Williamson @ 2025-03-19 18:17 UTC (permalink / raw)
  To: Keith Busch; +Cc: Keith Busch, kvm

On Wed, 19 Mar 2025 09:47:05 -0600
Keith Busch <kbusch@kernel.org> wrote:

> On Mon, Mar 17, 2025 at 04:53:47PM -0600, Alex Williamson wrote:
> > On Mon, 17 Mar 2025 16:30:47 -0600
> > Keith Busch <kbusch@kernel.org> wrote:
> >   
> > > On Mon, Mar 17, 2025 at 03:44:17PM -0600, Alex Williamson wrote:  
> > > > On Wed, 12 Mar 2025 15:52:55 -0700    
> > > > > @@ -679,6 +679,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> > > > >  
> > > > >  		if (unlikely(disable_hugepages))
> > > > >  			break;
> > > > > +		cond_resched();
> > > > >  	}
> > > > >  
> > > > >  out:    
> > > > 
> > > > Hey Keith, is this still necessary with:
> > > > 
> > > > https://lore.kernel.org/all/20250218222209.1382449-1-alex.williamson@redhat.com/    
> > > 
> > > Thank you for the suggestion. I'll try to fold this into a build, and
> > > see what happens. But from what I can tell, I'm not sure it will help.
> > > We're simply not getting large folios in this path and dealing with
> > > individual pages. Though it is a large contiguous range (~60GB, not
> > > necessarily aligned). Shoould we expect to only be dealing with PUD and
> > > PMD levels with these kinds of mappings?  
> > 
> > IME with QEMU, PMD alignment basically happens without any effort and
> > gets 90+% of the way there, PUD alignment requires a bit of work[1].
> >    
> > > > This is currently in linux-next from the vfio next branch and should
> > > > pretty much eliminate any stalls related to DMA mapping MMIO BARs.
> > > > Also the code here has been refactored in next, so this doesn't apply
> > > > anyway, and if there is a resched still needed, this location would
> > > > only affect DMA mapping of memory, not device BARs.  Thanks,    
> > > 
> > > Thanks for the head's up. Regardless, it doesn't look like bad place to
> > > cond_resched(), but may not trigger any cpu stall indicator outside this
> > > vfio fault path.  
> > 
> > Note that we already have a cond_resched() in vfio_iommu_map(), which
> > we'll hit any time we get a break in a contiguous mapping.  We may hit
> > that regularly enough that it's not an issue for RAM mapping, but I've
> > certainly seen soft lockups when we have many GiB of contiguous pfnmaps
> > prior to the series above.  Thanks,  
> 
> So far adding the additional patches has not changed anything. We've
> ensured we are using an address and length aligned to 2MB, but it sure
> looks like vfio's fault handler is only getting order-0 faults. I'm not
> finding anything immediately obvious about what we can change to get the
> desired higher order behvaior, though. Any other hints or information I
> could provide?

Since you mention folding in the changes, are you working on an upstream
kernel or a downstream backport?  Huge pfnmap support was added in
v6.12 via [1].  Without that you'd never see better than a order-a
fault.  I hope that's it because with all the kernel pieces in place it
should "Just work".  Thanks,

Alex

[1] https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vfio/type1: conditional rescheduling while pinning
  2025-03-19 18:17         ` Alex Williamson
@ 2025-03-19 18:34           ` Keith Busch
  2025-03-19 22:13             ` Keith Busch
  2025-07-09 20:18           ` Keith Busch
  1 sibling, 1 reply; 11+ messages in thread
From: Keith Busch @ 2025-03-19 18:34 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Keith Busch, kvm

On Wed, Mar 19, 2025 at 12:17:04PM -0600, Alex Williamson wrote:
> Since you mention folding in the changes, are you working on an upstream
> kernel or a downstream backport?  Huge pfnmap support was added in
> v6.12 via [1].  Without that you'd never see better than a order-a
> fault.  I hope that's it because with all the kernel pieces in place it
> should "Just work".  Thanks,

Yep, this is a backport to 6.11, and I included that series. There were
a few extra patches outside it needed to port that far back, but nothing
difficult.

Anyway since my last email, things are looking more successful now. We
changed a few things in both user and kernel side, so we're just doing
more tests to confirm what part was the necessary change.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vfio/type1: conditional rescheduling while pinning
  2025-03-19 18:34           ` Keith Busch
@ 2025-03-19 22:13             ` Keith Busch
  0 siblings, 0 replies; 11+ messages in thread
From: Keith Busch @ 2025-03-19 22:13 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Keith Busch, kvm

On Wed, Mar 19, 2025 at 12:34:39PM -0600, Keith Busch wrote:
> On Wed, Mar 19, 2025 at 12:17:04PM -0600, Alex Williamson wrote:
> > Since you mention folding in the changes, are you working on an upstream
> > kernel or a downstream backport?  Huge pfnmap support was added in
> > v6.12 via [1].  Without that you'd never see better than a order-a
> > fault.  I hope that's it because with all the kernel pieces in place it
> > should "Just work".  Thanks,
> 
> Yep, this is a backport to 6.11, and I included that series. There were
> a few extra patches outside it needed to port that far back, but nothing
> difficult.
> 
> Anyway since my last email, things are looking more successful now. We
> changed a few things in both user and kernel side, so we're just doing
> more tests to confirm what part was the necessary change.

Looks like we're okay now. I think the user space part was missing a
MADV_HUGEPAGE in one of the paths, so we were never getting the huge
faults.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vfio/type1: conditional rescheduling while pinning
  2025-03-19 18:17         ` Alex Williamson
  2025-03-19 18:34           ` Keith Busch
@ 2025-07-09 20:18           ` Keith Busch
  2025-07-11 20:16             ` Alex Williamson
  1 sibling, 1 reply; 11+ messages in thread
From: Keith Busch @ 2025-07-09 20:18 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Keith Busch, kvm

On Wed, Mar 19, 2025 at 12:17:04PM -0600, Alex Williamson wrote:
> On Wed, 19 Mar 2025 09:47:05 -0600
> > > 
> > > Note that we already have a cond_resched() in vfio_iommu_map(), which
> > > we'll hit any time we get a break in a contiguous mapping.  We may hit
> > > that regularly enough that it's not an issue for RAM mapping, but I've
> > > certainly seen soft lockups when we have many GiB of contiguous pfnmaps
> > > prior to the series above.  Thanks,  
> > 
> > So far adding the additional patches has not changed anything. We've
> > ensured we are using an address and length aligned to 2MB, but it sure
> > looks like vfio's fault handler is only getting order-0 faults. I'm not
> > finding anything immediately obvious about what we can change to get the
> > desired higher order behvaior, though. Any other hints or information I
> > could provide?
> 
> Since you mention folding in the changes, are you working on an upstream
> kernel or a downstream backport?  Huge pfnmap support was added in
> v6.12 via [1].  Without that you'd never see better than a order-a
> fault.  I hope that's it because with all the kernel pieces in place it
> should "Just work".  Thanks,

I think I'm back to needing a cond_resched(). I'm finding too many user
space programs, including qemu, for various reasons do not utilize
hugepage faults, and we're ultimately locking up a cpu for long enough
to cause other nasty side effects, like OOM due to blocked rcu free
callbacks. As preferable as it is to get everything aligned to use the
faster faults, I don't think the kernel should depend on that to prevent
prolonged cpu lockups. What do you think?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vfio/type1: conditional rescheduling while pinning
  2025-07-09 20:18           ` Keith Busch
@ 2025-07-11 20:16             ` Alex Williamson
  2025-07-11 20:40               ` Keith Busch
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2025-07-11 20:16 UTC (permalink / raw)
  To: Keith Busch; +Cc: Keith Busch, kvm

On Wed, 9 Jul 2025 14:18:58 -0600
Keith Busch <kbusch@kernel.org> wrote:

> On Wed, Mar 19, 2025 at 12:17:04PM -0600, Alex Williamson wrote:
> > On Wed, 19 Mar 2025 09:47:05 -0600  
> > > > 
> > > > Note that we already have a cond_resched() in vfio_iommu_map(), which
> > > > we'll hit any time we get a break in a contiguous mapping.  We may hit
> > > > that regularly enough that it's not an issue for RAM mapping, but I've
> > > > certainly seen soft lockups when we have many GiB of contiguous pfnmaps
> > > > prior to the series above.  Thanks,    
> > > 
> > > So far adding the additional patches has not changed anything. We've
> > > ensured we are using an address and length aligned to 2MB, but it sure
> > > looks like vfio's fault handler is only getting order-0 faults. I'm not
> > > finding anything immediately obvious about what we can change to get the
> > > desired higher order behvaior, though. Any other hints or information I
> > > could provide?  
> > 
> > Since you mention folding in the changes, are you working on an upstream
> > kernel or a downstream backport?  Huge pfnmap support was added in
> > v6.12 via [1].  Without that you'd never see better than a order-a
> > fault.  I hope that's it because with all the kernel pieces in place it
> > should "Just work".  Thanks,  
> 
> I think I'm back to needing a cond_resched(). I'm finding too many user
> space programs, including qemu, for various reasons do not utilize
> hugepage faults, and we're ultimately locking up a cpu for long enough
> to cause other nasty side effects, like OOM due to blocked rcu free
> callbacks. As preferable as it is to get everything aligned to use the
> faster faults, I don't think the kernel should depend on that to prevent
> prolonged cpu lockups. What do you think?

I'm not opposed to adding a cond_resched, but I'll also note that Peter
Xu has been working on a series that tries to get the right mapping
alignment automatically.  It's still a WIP, but it'd be good to know if
that resolves the remaining userspace issues you've seen or we're still
susceptible to apps that aren't even trying to use THP:

https://lore.kernel.org/all/20250613134111.469884-1-peterx@redhat.com/

Thanks,
Alex


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vfio/type1: conditional rescheduling while pinning
  2025-07-11 20:16             ` Alex Williamson
@ 2025-07-11 20:40               ` Keith Busch
  0 siblings, 0 replies; 11+ messages in thread
From: Keith Busch @ 2025-07-11 20:40 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Keith Busch, kvm

On Fri, Jul 11, 2025 at 02:16:57PM -0600, Alex Williamson wrote:
> On Wed, 9 Jul 2025 14:18:58 -0600
> Keith Busch <kbusch@kernel.org> wrote:
> > I think I'm back to needing a cond_resched(). I'm finding too many user
> > space programs, including qemu, for various reasons do not utilize
> > hugepage faults, and we're ultimately locking up a cpu for long enough
> > to cause other nasty side effects, like OOM due to blocked rcu free
> > callbacks. As preferable as it is to get everything aligned to use the
> > faster faults, I don't think the kernel should depend on that to prevent
> > prolonged cpu lockups. What do you think?
> 
> I'm not opposed to adding a cond_resched, but I'll also note that Peter
> Xu has been working on a series that tries to get the right mapping
> alignment automatically.  It's still a WIP, but it'd be good to know if
> that resolves the remaining userspace issues you've seen or we're still
> susceptible to apps that aren't even trying to use THP:
> 
> https://lore.kernel.org/all/20250613134111.469884-1-peterx@redhat.com/

Yes, I saw that series and included that for testing as well.

But I'm finding many machines have a transparent_hugepage enabled policy
set to "never" due to some problems THP usage caused with other hardware
(CXL, I think). I'm trying to get these to use "madvise" instead, but
that's causing other performance regressions for some tests.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-07-11 20:40 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-12 22:52 [PATCH] vfio/type1: conditional rescheduling while pinning Keith Busch
2025-03-17 21:44 ` Alex Williamson
2025-03-17 22:30   ` Keith Busch
2025-03-17 22:53     ` Alex Williamson
2025-03-19 15:47       ` Keith Busch
2025-03-19 18:17         ` Alex Williamson
2025-03-19 18:34           ` Keith Busch
2025-03-19 22:13             ` Keith Busch
2025-07-09 20:18           ` Keith Busch
2025-07-11 20:16             ` Alex Williamson
2025-07-11 20:40               ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).