[PATCH] vfio/type1: Retry follow_pfnmap_start() when PFNMAP is zapped

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] vfio/type1: Retry follow_pfnmap_start() when PFNMAP is zapped
@ 2026-03-17 15:07 Max Boone via B4 Relay
  2026-03-18 21:22 ` Alex Williamson
  0 siblings, 1 reply; 6+ messages in thread
From: Max Boone via B4 Relay @ 2026-03-17 15:07 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, stable, Max Boone

From: Max Boone <mboone@akamai.com>

A race between page table walking (e.g. via procfs numa_maps) and VFIO DMA
pinning can lead to temporary failures in follow_pfnmap_start(). When a
PUD entry is split and concurrently refaulted, the PFNMAP mapping may be
temporarily zapped, causing follow_pfnmap_start() to return an error.

Although follow_pfnmap_start() returns an -EINVAL this is not due to
invalid parameters, but rather because of the pfnmap being non-present.
Treat it as such, and retry by returning -EAGAIN, similar to how GUP
handles such races.

This avoids propagating an unexpected -EINVAL to userspace, like follows:
[dma_map]
dma_map iova=0x000000000000 size=0x000004000000 vaddr=0x00007f7800000000
dma_map FAILED iova=0x020000000000: [Errno 22] Invalid argument
dma_map iova=0x040000000000 size=0x000002000000 vaddr=0x00007f5780000000

Which would've succeeded on a retry.

Cc: stable@vger.kernel.org
Fixes: a77f9489f1d7 ("vfio: use the new follow_pfnmap API")
Signed-off-by: Max Boone <mboone@akamai.com>
---
 drivers/vfio/vfio_iommu_type1.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 5167bec14..3a0d0bbb9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -559,9 +559,17 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 		if (ret)
 			return ret;
 
+		/*
+		 * follow_pfnmap_start() returns -EINVAL for
+		 * invalid parameters and non-present entries.
+		 * If that happens here after a successful
+		 * fixup_user_fault(), it is likely that the
+		 * pfnmap has been zapped. Retry instead of
+		 * failing.
+		 */
 		ret = follow_pfnmap_start(&args);
 		if (ret)
-			return ret;
+			return -EAGAIN;
 	}
 
 	if (write_fault && !args.writable) {

---
base-commit: 96ca4caf9066f5ebd35b561a521af588a8eb0215
change-id: 20260317-retry-pin-on-reclaimed-pud-dfb9e26eb8cf

Best regards,
-- 
Max Boone <mboone@akamai.com>



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] vfio/type1: Retry follow_pfnmap_start() when PFNMAP is zapped
  2026-03-17 15:07 [PATCH] vfio/type1: Retry follow_pfnmap_start() when PFNMAP is zapped Max Boone via B4 Relay
@ 2026-03-18 21:22 ` Alex Williamson
  2026-03-19  8:36   ` Boone, Max
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Williamson @ 2026-03-18 21:22 UTC (permalink / raw)
  To: Max Boone via B4 Relay; +Cc: mboone, kvm, linux-kernel, stable, alex

On Tue, 17 Mar 2026 16:07:45 +0100
Max Boone via B4 Relay <devnull+mboone.akamai.com@kernel.org> wrote:

> From: Max Boone <mboone@akamai.com>
> 
> A race between page table walking (e.g. via procfs numa_maps) and VFIO DMA
> pinning can lead to temporary failures in follow_pfnmap_start(). When a
> PUD entry is split and concurrently refaulted, the PFNMAP mapping may be
> temporarily zapped, causing follow_pfnmap_start() to return an error.
> 
> Although follow_pfnmap_start() returns an -EINVAL this is not due to
> invalid parameters, but rather because of the pfnmap being non-present.
> Treat it as such, and retry by returning -EAGAIN, similar to how GUP
> handles such races.
> 
> This avoids propagating an unexpected -EINVAL to userspace, like follows:
> [dma_map]
> dma_map iova=0x000000000000 size=0x000004000000 vaddr=0x00007f7800000000
> dma_map FAILED iova=0x020000000000: [Errno 22] Invalid argument
> dma_map iova=0x040000000000 size=0x000002000000 vaddr=0x00007f5780000000
> 
> Which would've succeeded on a retry.
> 
> Cc: stable@vger.kernel.org
> Fixes: a77f9489f1d7 ("vfio: use the new follow_pfnmap API")
> Signed-off-by: Max Boone <mboone@akamai.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 5167bec14..3a0d0bbb9 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -559,9 +559,17 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
>  		if (ret)
>  			return ret;
>  
> +		/*
> +		 * follow_pfnmap_start() returns -EINVAL for
> +		 * invalid parameters and non-present entries.
> +		 * If that happens here after a successful
> +		 * fixup_user_fault(), it is likely that the
> +		 * pfnmap has been zapped. Retry instead of
> +		 * failing.
> +		 */

It's a little stronger than that, right?  We're betting that the only
remaining non-zero return is due to a race and we can introduce what
appears to be potential for an infinite loop here because -EAGAIN will
get kicked out to redo the vma_lookup() and fixup_user_fault() should
return a genuine error if we're completely in the weeds.  Should we
make this a little stronger and more specific?  Thanks,

Alex

>  		ret = follow_pfnmap_start(&args);
>  		if (ret)
> -			return ret;
> +			return -EAGAIN;
>  	}
>  
>  	if (write_fault && !args.writable) {
> 
> ---
> base-commit: 96ca4caf9066f5ebd35b561a521af588a8eb0215
> change-id: 20260317-retry-pin-on-reclaimed-pud-dfb9e26eb8cf
> 
> Best regards,


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] vfio/type1: Retry follow_pfnmap_start() when PFNMAP is zapped
  2026-03-18 21:22 ` Alex Williamson
@ 2026-03-19  8:36   ` Boone, Max
  2026-03-19 13:18     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 6+ messages in thread
From: Boone, Max @ 2026-03-19  8:36 UTC (permalink / raw)
  To: Alex Williamson, David Hildenbrand
  Cc: Max Boone via B4 Relay, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2222 bytes --]



> On Mar 18, 2026, at 10:22 PM, Alex Williamson <alex@shazbot.org> wrote:
> 
> […]
> 
>> + /*
>> + * follow_pfnmap_start() returns -EINVAL for
>> + * invalid parameters and non-present entries.
>> + * If that happens here after a successful
>> + * fixup_user_fault(), it is likely that the
>> + * pfnmap has been zapped. Retry instead of
>> + * failing.
>> + */
> 
> It's a little stronger than that, right?  We're betting that the only
> remaining non-zero return is due to a race and we can introduce what
> appears to be potential for an infinite loop here because -EAGAIN will
> get kicked out to redo the vma_lookup() and fixup_user_fault() should
> return a genuine error if we're completely in the weeds.  Should we
> make this a little stronger and more specific?  Thanks,

I’d say that the best case would be to have follow_pfnmap_start() return
-EINVAL or -ENOENT w.r.t. which of the two return values it is. But then
again, we could theoretically run into an infinite loop I guess - as the zap
and faulting could run in lockstep (the race window is extremely small
though).

We could make the retry above bounded, and bubble up a -EBUSY such
that users of the ioctl can decide to retry instead of fail?

David, you mentioned that gup already has retry logic that we don’t have
with follow_fault_pfn() -> follow_pfnmap_start(). Would we potentially run
into an infinite loop with this change?

I see that the same pattern also appears in:
- virt/kvm/kvm_main.c:hva_to_pfn_remapped()
- s390/pci/pci_mmio.c:s390_pci_mmio_write()

And other users of the function are:
- drivers/virt/acrn/mm.c:acrn_vm_ram_map()
- mm/memory.c:generic_access_phys()

Maybe I can better draft in this patch with adding an -ENOENT return to
follow_pfnmap_start() and in the same patchset add retrying logic where
necessary (on first sight, kvm and vfio)?

> 
> Alex
> 
>> ret = follow_pfnmap_start(&args);
>> if (ret)
>> - return ret;
>> + return -EAGAIN;
>> }
>> 
>> if (write_fault && !args.writable) {
>> 
>> ---
>> base-commit: 96ca4caf9066f5ebd35b561a521af588a8eb0215
>> change-id: 20260317-retry-pin-on-reclaimed-pud-dfb9e26eb8cf
>> 
>> Best regards,
> 


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3061 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] vfio/type1: Retry follow_pfnmap_start() when PFNMAP is zapped
  2026-03-19  8:36   ` Boone, Max
@ 2026-03-19 13:18     ` David Hildenbrand (Arm)
  2026-03-19 14:30       ` Alex Williamson
  0 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-19 13:18 UTC (permalink / raw)
  To: Boone, Max, Alex Williamson
  Cc: Max Boone via B4 Relay, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On 3/19/26 09:36, Boone, Max wrote:
> 
> 
>> On Mar 18, 2026, at 10:22 PM, Alex Williamson <alex@shazbot.org> wrote:
>>
>> […]
>>
>>> + /*
>>> + * follow_pfnmap_start() returns -EINVAL for
>>> + * invalid parameters and non-present entries.
>>> + * If that happens here after a successful
>>> + * fixup_user_fault(), it is likely that the
>>> + * pfnmap has been zapped. Retry instead of
>>> + * failing.
>>> + */
>>
>> It's a little stronger than that, right?  We're betting that the only
>> remaining non-zero return is due to a race and we can introduce what
>> appears to be potential for an infinite loop here because -EAGAIN will
>> get kicked out to redo the vma_lookup() and fixup_user_fault() should
>> return a genuine error if we're completely in the weeds.  Should we
>> make this a little stronger and more specific?  Thanks,
> 
> I’d say that the best case would be to have follow_pfnmap_start() return
> -EINVAL or -ENOENT w.r.t. which of the two return values it is. But then
> again, we could theoretically run into an infinite loop I guess - as the zap
> and faulting could run in lockstep (the race window is extremely small
> though).

Well, in theory :) To hit that race repeatedly, you'd really have to be
quite lucky I guess.

But the real question is: if user space triggered the pinning, and user
space keeps hurting itself to make progress, is that a real problem?

I guess the crucial part would be to

a) Have some cond_resched(() in there?
b) Checking for fatal signals somewhere?
c) Possibly drop locks (mmap lock?) every now and then?

For GUP, a) and b) are in place in __get_user_pages().

c) might be done, but I think it's less deterministic.

> 
> We could make the retry above bounded, and bubble up a -EBUSY such
> that users of the ioctl can decide to retry instead of fail?

Would that be a possible ABI break? You'd really have to only do that in
a case where user space does stupid things, I guess.

> 
> David, you mentioned that gup already has retry logic that we don’t have
> with follow_fault_pfn() -> follow_pfnmap_start(). Would we potentially run
> into an infinite loop with this change?

GUP triggers page faults through faultin_page(). If handle_mm_fault()
returns

* VM_FAULT_COMPLETED we return -EAGAIN
* VM_FAULT_ERROR we return the error
* VM_FAULT_RETRY we return -EBUSY
* Otherwise 0

In the caller __get_user_pages(), we
* Retry immediately with ret == 0
* Return to the GUP caller (letting it retry) with -EBUSY/-EAGAIN

Having at least a) and b) sounds reasonable. Not sure about having c),
might be tricky if we are not allowed to drop the lock.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] vfio/type1: Retry follow_pfnmap_start() when PFNMAP is zapped
  2026-03-19 13:18     ` David Hildenbrand (Arm)
@ 2026-03-19 14:30       ` Alex Williamson
  2026-03-19 19:44         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Williamson @ 2026-03-19 14:30 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Boone, Max, Max Boone via B4 Relay, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org, alex

On Thu, 19 Mar 2026 14:18:49 +0100
"David Hildenbrand (Arm)" <david@kernel.org> wrote:

> On 3/19/26 09:36, Boone, Max wrote:
> > 
> >   
> >> On Mar 18, 2026, at 10:22 PM, Alex Williamson <alex@shazbot.org> wrote:
> >>
> >> […]
> >>  
> >>> + /*
> >>> + * follow_pfnmap_start() returns -EINVAL for
> >>> + * invalid parameters and non-present entries.
> >>> + * If that happens here after a successful
> >>> + * fixup_user_fault(), it is likely that the
> >>> + * pfnmap has been zapped. Retry instead of
> >>> + * failing.
> >>> + */  
> >>
> >> It's a little stronger than that, right?  We're betting that the only
> >> remaining non-zero return is due to a race and we can introduce what
> >> appears to be potential for an infinite loop here because -EAGAIN will
> >> get kicked out to redo the vma_lookup() and fixup_user_fault() should
> >> return a genuine error if we're completely in the weeds.  Should we
> >> make this a little stronger and more specific?  Thanks,  
> > 
> > I’d say that the best case would be to have follow_pfnmap_start() return
> > -EINVAL or -ENOENT w.r.t. which of the two return values it is. But then
> > again, we could theoretically run into an infinite loop I guess - as the zap
> > and faulting could run in lockstep (the race window is extremely small
> > though).  
> 
> Well, in theory :) To hit that race repeatedly, you'd really have to be
> quite lucky I guess.
> 
> But the real question is: if user space triggered the pinning, and user
> space keeps hurting itself to make progress, is that a real problem?

I'd say no, that's not a problem.  It's really just that masking all
non-zero returns as -EAGAIN makes some assumptions about the other
error conditions that could change over time, so minimally those
dependencies should be clearly stated.  Even better would be if we
didn't need to make those assumptions.  Thanks,

Alex

> I guess the crucial part would be to
> 
> a) Have some cond_resched(() in there?
> b) Checking for fatal signals somewhere?
> c) Possibly drop locks (mmap lock?) every now and then?
> 
> For GUP, a) and b) are in place in __get_user_pages().
> 
> c) might be done, but I think it's less deterministic.
> 
> > 
> > We could make the retry above bounded, and bubble up a -EBUSY such
> > that users of the ioctl can decide to retry instead of fail?  
> 
> Would that be a possible ABI break? You'd really have to only do that in
> a case where user space does stupid things, I guess.
> 
> > 
> > David, you mentioned that gup already has retry logic that we don’t have
> > with follow_fault_pfn() -> follow_pfnmap_start(). Would we potentially run
> > into an infinite loop with this change?  
> 
> GUP triggers page faults through faultin_page(). If handle_mm_fault()
> returns
> 
> * VM_FAULT_COMPLETED we return -EAGAIN
> * VM_FAULT_ERROR we return the error
> * VM_FAULT_RETRY we return -EBUSY
> * Otherwise 0
> 
> In the caller __get_user_pages(), we
> * Retry immediately with ret == 0
> * Return to the GUP caller (letting it retry) with -EBUSY/-EAGAIN
> 
> Having at least a) and b) sounds reasonable. Not sure about having c),
> might be tricky if we are not allowed to drop the lock.
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] vfio/type1: Retry follow_pfnmap_start() when PFNMAP is zapped
  2026-03-19 14:30       ` Alex Williamson
@ 2026-03-19 19:44         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 6+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-19 19:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Boone, Max, Max Boone via B4 Relay, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On 3/19/26 15:30, Alex Williamson wrote:
> On Thu, 19 Mar 2026 14:18:49 +0100
> "David Hildenbrand (Arm)" <david@kernel.org> wrote:
> 
>> On 3/19/26 09:36, Boone, Max wrote:
>>>
>>>   
>>>
>>> I’d say that the best case would be to have follow_pfnmap_start() return
>>> -EINVAL or -ENOENT w.r.t. which of the two return values it is. But then
>>> again, we could theoretically run into an infinite loop I guess - as the zap
>>> and faulting could run in lockstep (the race window is extremely small
>>> though).  
>>
>> Well, in theory :) To hit that race repeatedly, you'd really have to be
>> quite lucky I guess.
>>
>> But the real question is: if user space triggered the pinning, and user
>> space keeps hurting itself to make progress, is that a real problem?
> 
> I'd say no, that's not a problem.  It's really just that masking all
> non-zero returns as -EAGAIN makes some assumptions about the other
> error conditions that could change over time, so minimally those
> dependencies should be clearly stated.  Even better would be if we
> didn't need to make those assumptions. 

I agree that follow_pfnmap_start() should be improved, to return -EINVAL
on invalid parameters and -ENOENT on "no present entry"."

And -ENOENT should be documented to be resolvable by triggering a fault.

Looking at it, the retry label in the function also seems avoidable, and
some other checks might be shaky ...

Let me briefly see if I can rework that.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-19 19:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-17 15:07 [PATCH] vfio/type1: Retry follow_pfnmap_start() when PFNMAP is zapped Max Boone via B4 Relay
2026-03-18 21:22 ` Alex Williamson
2026-03-19  8:36   ` Boone, Max
2026-03-19 13:18     ` David Hildenbrand (Arm)
2026-03-19 14:30       ` Alex Williamson
2026-03-19 19:44         ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox