* [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support
2026-05-01 1:20 [PATCH 0/3] mm/hmm: Add mmap lock-drop support for userfaultfd-backed mappings Stanislav Kinsburskii
@ 2026-05-01 1:20 ` Stanislav Kinsburskii
2026-05-12 8:42 ` David Hildenbrand (Arm)
2026-05-01 1:20 ` [PATCH 2/3] mshv: Use hmm_range_fault_unlockable() for userfaultfd support Stanislav Kinsburskii
2026-05-01 1:20 ` [PATCH 3/3] selftests/mm: Add userfaultfd test for HMM unlockable path Stanislav Kinsburskii
2 siblings, 1 reply; 7+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-01 1:20 UTC (permalink / raw)
To: kys, Liam.Howlett, akpm, akpm, david, decui, haiyangz, jgg,
corbet, leon, longli, ljs, mhocko, rppt, shuah, skhan, surenb,
vbabka, wei.liu
Cc: linux-doc, linux-hyperv, linux-kernel, linux-kernel,
linux-kselftest, linux-mm
Add hmm_range_fault_unlockable(), a new HMM entry point that allows the
mmap read lock to be dropped during page faults. This follows the
int *locked pattern from get_user_pages_remote() in mm/gup.c: callers
pass an int *locked variable indicating they can handle the lock being
dropped.
When locked is non-NULL, hmm_vma_fault() adds FAULT_FLAG_ALLOW_RETRY
and FAULT_FLAG_KILLABLE to the fault flags passed to handle_mm_fault().
If the fault handler drops the mmap lock (returning VM_FAULT_RETRY or
VM_FAULT_COMPLETED), the function sets *locked = 0 and returns 0,
signalling the caller to restart its walk with a fresh notifier
sequence. Fatal signals are checked before returning, matching GUP
behavior. The caller is responsible for re-acquiring the lock and
restarting from the beginning, since previously collected PFNs may be
stale after the lock was dropped.
The existing hmm_range_fault() is refactored into a thin wrapper that
calls hmm_range_fault_unlockable(range, NULL). Passing NULL means
FAULT_FLAG_ALLOW_RETRY is never set, preserving existing behavior for
all current callers with no functional change.
Faulting hugetlb pages is not supported on the unlockable path: if a
hugetlb page requires faulting, -EFAULT is returned. This is because
walk_hugetlb_range() holds hugetlb_vma_lock_read across the callback
and unconditionally unlocks on return; if the mmap lock is dropped
inside the callback the VMA may be freed, making the walk framework's
unlock a use-after-free. Hugetlb pages already present in page tables
are handled normally.
Documentation/mm/hmm.rst is updated with a new section describing the
unlockable API, its usage pattern, and the hugetlb limitation.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
Documentation/mm/hmm.rst | 89 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/hmm.h | 1 +
mm/hmm.c | 91 +++++++++++++++++++++++++++++++++++++++++-----
3 files changed, 172 insertions(+), 9 deletions(-)
diff --git a/Documentation/mm/hmm.rst b/Documentation/mm/hmm.rst
index 7d61b7a8b65b7..13874b4dfd5f4 100644
--- a/Documentation/mm/hmm.rst
+++ b/Documentation/mm/hmm.rst
@@ -208,6 +208,95 @@ invalidate() callback. That lock must be held before calling
mmu_interval_read_retry() to avoid any race with a concurrent CPU page table
update.
+Scalable lock-drop support (hmm_range_fault_unlockable)
+=======================================================
+
+Some page fault handlers (e.g., userfaultfd) require the mmap lock to be
+dropped during fault resolution. Drivers that need to support such mappings
+can use::
+
+ int hmm_range_fault_unlockable(struct hmm_range *range, int *locked);
+
+This follows the same ``int *locked`` pattern used by ``get_user_pages_remote()``
+in ``mm/gup.c``. The caller sets ``*locked = 1`` and holds the mmap read lock
+before calling. If the lock is dropped during the fault (VM_FAULT_RETRY or
+VM_FAULT_COMPLETED), the function returns 0 with ``*locked = 0``, signalling
+the caller to restart its walk with a fresh notifier sequence. The caller is
+responsible for re-acquiring the lock and restarting from the beginning, since
+previously collected PFNs may be stale.
+
+The usage pattern is::
+
+ int driver_populate_range_unlockable(...)
+ {
+ struct hmm_range range;
+ int locked;
+ ...
+
+ range.notifier = &interval_sub;
+ range.start = ...;
+ range.end = ...;
+ range.hmm_pfns = ...;
+
+ if (!mmget_not_zero(interval_sub->notifier.mm))
+ return -EFAULT;
+
+ again:
+ range.notifier_seq = mmu_interval_read_begin(&interval_sub);
+ locked = 1;
+ mmap_read_lock(mm);
+ ret = hmm_range_fault_unlockable(&range, &locked);
+ if (locked)
+ mmap_read_unlock(mm);
+ if (ret) {
+ if (ret == -EBUSY)
+ goto again;
+ return ret;
+ }
+ if (!locked)
+ goto again;
+
+ take_lock(driver->update);
+ if (mmu_interval_read_retry(&ni, range.notifier_seq) {
+ release_lock(driver->update);
+ goto again;
+ }
+
+ /* Use pfns array content to update device page table,
+ * under the update lock */
+
+ release_lock(driver->update);
+ return 0;
+ }
+
+Passing ``locked = NULL`` to ``hmm_range_fault_unlockable()`` is equivalent to
+calling ``hmm_range_fault()`` — the lock will never be dropped.
+
+Note: hugetlb pages are not supported with the unlockable path. If a hugetlb
+page requires faulting during an ``hmm_range_fault_unlockable()`` call,
+``-EFAULT`` is returned. Hugetlb pages that are already present in page tables
+are handled normally.
+
+This limitation exists because ``walk_hugetlb_range()`` in the page walk
+framework holds ``hugetlb_vma_lock_read`` across the callback and unconditionally
+unlocks on return. If the mmap lock is dropped inside the callback (via
+VM_FAULT_RETRY), the VMA may be freed before the walk framework's unlock,
+resulting in a use-after-free. Possible approaches to lift this limitation in
+the future:
+
+1. Extend the walk framework to allow callbacks to signal that the hugetlb vma
+ lock was dropped (e.g., a flag in ``struct mm_walk`` that tells
+ ``walk_hugetlb_range()`` to skip the unlock).
+
+2. Bypass ``walk_page_range()`` for hugetlb pages in the unlockable path and
+ walk hugetlb page tables directly with custom lock management (similar to
+ how GUP handles hugetlb without the walk framework).
+
+3. Re-acquire the mmap lock before returning from the hugetlb callback (like
+ ``fixup_user_fault()``), ensuring the VMA remains valid for the walk
+ framework's unlock. This changes the "never re-take" contract and would
+ require callers to handle hugetlb differently.
+
Leverage default_flags and pfn_flags_mask
=========================================
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index db75ffc949a7a..46e581865c48a 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -123,6 +123,7 @@ struct hmm_range {
* Please see Documentation/mm/hmm.rst for how to use the range API.
*/
int hmm_range_fault(struct hmm_range *range);
+int hmm_range_fault_unlockable(struct hmm_range *range, int *locked);
/*
* HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
diff --git a/mm/hmm.c b/mm/hmm.c
index 5955f2f0c83db..9bf2fa37f2efd 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -33,6 +33,7 @@
struct hmm_vma_walk {
struct hmm_range *range;
unsigned long last;
+ int *locked;
};
enum {
@@ -86,10 +87,28 @@ static int hmm_vma_fault(unsigned long addr, unsigned long end,
fault_flags |= FAULT_FLAG_WRITE;
}
- for (; addr < end; addr += PAGE_SIZE)
- if (handle_mm_fault(vma, addr, fault_flags, NULL) &
- VM_FAULT_ERROR)
+ if (hmm_vma_walk->locked)
+ fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+
+ for (; addr < end; addr += PAGE_SIZE) {
+ vm_fault_t ret;
+
+ ret = handle_mm_fault(vma, addr, fault_flags, NULL);
+
+ if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
+ /*
+ * The mmap lock has been dropped by the fault handler.
+ * Record the failing address and signal lock-drop to
+ * the caller.
+ */
+ *hmm_vma_walk->locked = 0;
+ hmm_vma_walk->last = addr;
+ return -EAGAIN;
+ }
+
+ if (ret & VM_FAULT_ERROR)
return -EFAULT;
+ }
return -EBUSY;
}
@@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
if (required_fault) {
int ret;
+ /*
+ * Faulting hugetlb pages on the unlockable path is not
+ * supported. The walk framework holds hugetlb_vma_lock_read
+ * which must be dropped before handle_mm_fault, but if the
+ * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
+ * be freed and the walk framework's unconditional unlock
+ * becomes a use-after-free.
+ */
+ if (hmm_vma_walk->locked)
+ return -EFAULT;
+
spin_unlock(ptl);
hugetlb_vma_unlock_read(vma);
/*
@@ -655,14 +685,49 @@ static const struct mm_walk_ops hmm_walk_ops = {
*
* This is similar to get_user_pages(), except that it can read the page tables
* without mutating them (ie causing faults).
+ *
+ * The mmap lock must be held by the caller and will remain held on return.
+ * For a variant that allows the mmap lock to be dropped during faults (e.g.,
+ * for userfaultfd support), see hmm_range_fault_unlockable().
*/
int hmm_range_fault(struct hmm_range *range)
{
+ return hmm_range_fault_unlockable(range, NULL);
+}
+EXPORT_SYMBOL(hmm_range_fault);
+
+/**
+ * hmm_range_fault_unlockable - fault a range with mmap lock-drop support
+ * @range: argument structure
+ * @locked: pointer to lock state variable (input: 1; output: 0 if lock
+ * was dropped)
+ *
+ * Similar to hmm_range_fault() but allows the mmap lock to be dropped during
+ * page faults. This enables support for userfaultfd-backed mappings and other
+ * cases where handle_mm_fault() may need to release the mmap lock.
+ *
+ * The caller must hold the mmap read lock and set *locked = 1 before calling.
+ * On return:
+ * - *locked == 1: mmap lock is still held, return value has normal semantics
+ * - *locked == 0: mmap lock was dropped. The caller must re-acquire the lock
+ * and restart the operation. Return value is -EBUSY in this case.
+ *
+ * When the lock is dropped internally, this function will attempt to
+ * re-acquire it and retry the fault with FAULT_FLAG_TRIED set. If the retry
+ * also results in lock-drop (possible but unusual), or if a fatal signal is
+ * pending, the function returns with *locked == 0.
+ *
+ * Returns 0 on success or a negative error code. See hmm_range_fault() for
+ * the full list of possible errors.
+ */
+int hmm_range_fault_unlockable(struct hmm_range *range, int *locked)
+{
+ struct mm_struct *mm = range->notifier->mm;
struct hmm_vma_walk hmm_vma_walk = {
.range = range,
.last = range->start,
+ .locked = locked,
};
- struct mm_struct *mm = range->notifier->mm;
int ret;
mmap_assert_locked(mm);
@@ -674,16 +739,24 @@ int hmm_range_fault(struct hmm_range *range)
return -EBUSY;
ret = walk_page_range(mm, hmm_vma_walk.last, range->end,
&hmm_walk_ops, &hmm_vma_walk);
+ if (ret == -EAGAIN) {
+ /*
+ * The mmap lock was dropped during the fault
+ * (e.g. userfaultfd). Signal the caller to restart
+ * by returning with *locked = 0.
+ */
+ if (fatal_signal_pending(current))
+ return -EINTR;
+ return 0;
+ }
/*
- * When -EBUSY is returned the loop restarts with
- * hmm_vma_walk.last set to an address that has not been stored
- * in pfns. All entries < last in the pfn array are set to their
- * output, and all >= are still at their input values.
+ * -EBUSY: page table changed during the walk.
+ * Restart from hmm_vma_walk.last.
*/
} while (ret == -EBUSY);
return ret;
}
-EXPORT_SYMBOL(hmm_range_fault);
+EXPORT_SYMBOL(hmm_range_fault_unlockable);
/**
* hmm_dma_map_alloc - Allocate HMM map structure
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support
2026-05-01 1:20 ` [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support Stanislav Kinsburskii
@ 2026-05-12 8:42 ` David Hildenbrand (Arm)
2026-05-12 16:18 ` Stanislav Kinsburskii
0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-12 8:42 UTC (permalink / raw)
To: Stanislav Kinsburskii, kys, Liam.Howlett, akpm, decui, haiyangz,
jgg, corbet, leon, longli, ljs, mhocko, rppt, shuah, skhan,
surenb, vbabka, wei.liu
Cc: linux-doc, linux-hyperv, linux-kernel, linux-kselftest, linux-mm
> + for (; addr < end; addr += PAGE_SIZE) {
> + vm_fault_t ret;
> +
> + ret = handle_mm_fault(vma, addr, fault_flags, NULL);
> +
> + if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
> + /*
> + * The mmap lock has been dropped by the fault handler.
> + * Record the failing address and signal lock-drop to
> + * the caller.
> + */
> + *hmm_vma_walk->locked = 0;
> + hmm_vma_walk->last = addr;
> + return -EAGAIN;
Okay, so we'll return straight from hmm_vma_fault() to
hmm_vma_handle_pte()/hmm_vma_walk_pmd() -> walk_page_range() machinery.
Hopefully we don't refer to the MM/VMA on any path there? It would be nicer if
the hmm_vma_fault() could be called by the caller of walk_page_range(), but
that's tricky I guess, as hmm_vma_fault() consumes the walk structure and
requires the vma in there.
Note: am I wrong, or is hmm_vma_fault() really always called with
required_fault=true?
> + }
> +
> + if (ret & VM_FAULT_ERROR)
> return -EFAULT;
> + }
> return -EBUSY;
> }
>
> @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
> if (required_fault) {
> int ret;
>
> + /*
> + * Faulting hugetlb pages on the unlockable path is not
> + * supported. The walk framework holds hugetlb_vma_lock_read
> + * which must be dropped before handle_mm_fault, but if the
> + * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
> + * be freed and the walk framework's unconditional unlock
> + * becomes a use-after-free.
> + */
> + if (hmm_vma_walk->locked)
> + return -EFAULT;
Just because it's unlockable doesn't mean that you must unlock. Can't this be
kept working as is, just simulating here as if it would not be unlockable?
--
Cheers,
David
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support
2026-05-12 8:42 ` David Hildenbrand (Arm)
@ 2026-05-12 16:18 ` Stanislav Kinsburskii
2026-05-12 19:18 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 7+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-12 16:18 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: kys, Liam.Howlett, akpm, decui, haiyangz, jgg, corbet, leon,
longli, ljs, mhocko, rppt, shuah, skhan, surenb, vbabka, wei.liu,
linux-doc, linux-hyperv, linux-kernel, linux-kselftest, linux-mm
On Tue, May 12, 2026 at 10:42:14AM +0200, David Hildenbrand (Arm) wrote:
>
> > + for (; addr < end; addr += PAGE_SIZE) {
> > + vm_fault_t ret;
> > +
> > + ret = handle_mm_fault(vma, addr, fault_flags, NULL);
> > +
> > + if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
> > + /*
> > + * The mmap lock has been dropped by the fault handler.
> > + * Record the failing address and signal lock-drop to
> > + * the caller.
> > + */
> > + *hmm_vma_walk->locked = 0;
> > + hmm_vma_walk->last = addr;
> > + return -EAGAIN;
>
>
> Okay, so we'll return straight from hmm_vma_fault() to
> hmm_vma_handle_pte()/hmm_vma_walk_pmd() -> walk_page_range() machinery.
>
> Hopefully we don't refer to the MM/VMA on any path there? It would be nicer if
> the hmm_vma_fault() could be called by the caller of walk_page_range(), but
> that's tricky I guess, as hmm_vma_fault() consumes the walk structure and
> requires the vma in there.
>
It looks like a caller can provide a post_vma callback in mm_walk_ops. I
missed that case here. This callback cannot be supported by this change.
I will update the patch.
>
> Note: am I wrong, or is hmm_vma_fault() really always called with
> required_fault=true?
>
No, hmm_pte_need_fault can return false.
> > + }
> > +
> > + if (ret & VM_FAULT_ERROR)
> > return -EFAULT;
> > + }
> > return -EBUSY;
> > }
> >
> > @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
> > if (required_fault) {
> > int ret;
> >
> > + /*
> > + * Faulting hugetlb pages on the unlockable path is not
> > + * supported. The walk framework holds hugetlb_vma_lock_read
> > + * which must be dropped before handle_mm_fault, but if the
> > + * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
> > + * be freed and the walk framework's unconditional unlock
> > + * becomes a use-after-free.
> > + */
> > + if (hmm_vma_walk->locked)
> > + return -EFAULT;
>
> Just because it's unlockable doesn't mean that you must unlock. Can't this be
> kept working as is, just simulating here as if it would not be unlockable?
>
I’m not sure how to implement this. The walk_page_range code expects the
hugetlb VMA to still be read-locked when we return from
hmm_vma_walk_hugetlb_entry. How can we guarantee that if the VMA might
be gone?
I added a note in the docs. Whoever tackles this will likely need to
either rework `walk_page_range` to handle the case where the VMA is
gone, or use a different approach.
Do you have any other suggestions on how to implement it?
Thanks,
Stanislav
>
> --
> Cheers,
>
> David
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support
2026-05-12 16:18 ` Stanislav Kinsburskii
@ 2026-05-12 19:18 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-12 19:18 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, Liam.Howlett, akpm, decui, haiyangz, jgg, corbet, leon,
longli, ljs, mhocko, rppt, shuah, skhan, surenb, vbabka, wei.liu,
linux-doc, linux-hyperv, linux-kernel, linux-kselftest, linux-mm
On 5/12/26 18:18, Stanislav Kinsburskii wrote:
> On Tue, May 12, 2026 at 10:42:14AM +0200, David Hildenbrand (Arm) wrote:
>>
>>> + for (; addr < end; addr += PAGE_SIZE) {
>>> + vm_fault_t ret;
>>> +
>>> + ret = handle_mm_fault(vma, addr, fault_flags, NULL);
>>> +
>>> + if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
>>> + /*
>>> + * The mmap lock has been dropped by the fault handler.
>>> + * Record the failing address and signal lock-drop to
>>> + * the caller.
>>> + */
>>> + *hmm_vma_walk->locked = 0;
>>> + hmm_vma_walk->last = addr;
>>> + return -EAGAIN;
>>
>>
>> Okay, so we'll return straight from hmm_vma_fault() to
>> hmm_vma_handle_pte()/hmm_vma_walk_pmd() -> walk_page_range() machinery.
>>
>> Hopefully we don't refer to the MM/VMA on any path there? It would be nicer if
>> the hmm_vma_fault() could be called by the caller of walk_page_range(), but
>> that's tricky I guess, as hmm_vma_fault() consumes the walk structure and
>> requires the vma in there.
>>
>
> It looks like a caller can provide a post_vma callback in mm_walk_ops. I
> missed that case here. This callback cannot be supported by this change.
> I will update the patch.
>
>>
>> Note: am I wrong, or is hmm_vma_fault() really always called with
>> required_fault=true?
>>
>
> No, hmm_pte_need_fault can return false.
That's not what I mean. Looks like all paths leading to hmm_vma_fault() have
required_fault = true;
IOW, there is always a "if (required_fault)" before it one way or the other.
Ah, and there even is a "WARN_ON_ONCE(!required_fault)" in the function. What an
odd thing to do :)
>
>>> + }
>>> +
>>> + if (ret & VM_FAULT_ERROR)
>>> return -EFAULT;
>>> + }
>>> return -EBUSY;
>>> }
>>>
>>> @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
>>> if (required_fault) {
>>> int ret;
>>>
>>> + /*
>>> + * Faulting hugetlb pages on the unlockable path is not
>>> + * supported. The walk framework holds hugetlb_vma_lock_read
>>> + * which must be dropped before handle_mm_fault, but if the
>>> + * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
>>> + * be freed and the walk framework's unconditional unlock
>>> + * becomes a use-after-free.
>>> + */
>>> + if (hmm_vma_walk->locked)
>>> + return -EFAULT;
>>
>> Just because it's unlockable doesn't mean that you must unlock. Can't this be
>> kept working as is, just simulating here as if it would not be unlockable?
>>
>
> I’m not sure how to implement this. The walk_page_range code expects the
> hugetlb VMA to still be read-locked when we return from
> hmm_vma_walk_hugetlb_entry. How can we guarantee that if the VMA might
> be gone?
>
> I added a note in the docs. Whoever tackles this will likely need to
> either rework `walk_page_range` to handle the case where the VMA is
> gone, or use a different approach.
>
> Do you have any other suggestions on how to implement it?
You just want hmm_vma_fault() to not set
"FAULT_FLAG_ALLOW_RETRY·|·FAULT_FLAG_KILLABLE".
The hacky way could be:
diff --git a/mm/hmm.c b/mm/hmm.c
index 5955f2f0c83d..83dba990e10a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -564,6 +564,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned
long hmask,
required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
if (required_fault) {
+ int *saved_locked = hmm_vma_walk->locked;
int ret;
spin_unlock(ptl);
@@ -576,7 +577,9 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned
long hmask,
* use here of either pte or ptl after dropping the vma
* lock.
*/
+ hmm_vma_walk->locked = NULL;
ret = hmm_vma_fault(addr, end, required_fault, walk);
+ hmm_vma_walk->locked = saved_locked;
hugetlb_vma_lock_read(vma);
return ret;
}
But really, I think we should just try to get uffd support working properly, not
excluding hugetlb.
GUP achieves it properly by performing the fault handling outside of page table
walking context ... essentially what I described in my first comment above:
return the information to the caller and let it just trigger the fault.
The issue here is that we trigger a fault out of walk_hugetlb_range() where we
still hold locks, resulting in this questionable hugetlb_vma_unlock_read +
hugetlb_vma_lock_read pattern.
The fault should just be triggered from a place where we don't have to play with
hugetlb vma locks or be afraid that dropping the mmap lock causes other problems.
--
Cheers,
David
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/3] mshv: Use hmm_range_fault_unlockable() for userfaultfd support
2026-05-01 1:20 [PATCH 0/3] mm/hmm: Add mmap lock-drop support for userfaultfd-backed mappings Stanislav Kinsburskii
2026-05-01 1:20 ` [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support Stanislav Kinsburskii
@ 2026-05-01 1:20 ` Stanislav Kinsburskii
2026-05-01 1:20 ` [PATCH 3/3] selftests/mm: Add userfaultfd test for HMM unlockable path Stanislav Kinsburskii
2 siblings, 0 replies; 7+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-01 1:20 UTC (permalink / raw)
To: kys, Liam.Howlett, akpm, akpm, david, decui, haiyangz, jgg,
corbet, leon, longli, ljs, mhocko, rppt, shuah, skhan, surenb,
vbabka, wei.liu
Cc: linux-doc, linux-hyperv, linux-kernel, linux-kernel,
linux-kselftest, linux-mm
Convert the mshv driver's HMM fault path to use
hmm_range_fault_unlockable() instead of hmm_range_fault(). This enables
userfaultfd-backed guest memory regions by allowing the mmap lock to be
dropped during page fault handling.
Extract the per-VMA walk into a dedicated mshv_region_hmm_fault_walk()
helper. The outer mshv_region_hmm_fault_and_lock() handles the do/while
restart loop: if the lock is dropped during a fault (userfaultfd resolution
or similar) or an invalidation occurs (-EBUSY), the function restarts the
entire walk from the beginning with a fresh notifier_seq, since the VMA
layout may have changed.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_regions.c | 127 +++++++++++++++++++++++++++++++--------------
1 file changed, 87 insertions(+), 40 deletions(-)
diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index d09940e88298e..05665446ca6d9 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -565,6 +565,75 @@ int mshv_region_get(struct mshv_region *region)
return kref_get_unless_zero(®ion->mreg_refcount);
}
+/**
+ * mshv_region_hmm_fault_walk - Walk VMAs and fault in pages for a range
+ * @region : Pointer to the memory region structure
+ * @range : HMM range structure (caller sets notifier and notifier_seq)
+ * @start : Starting virtual address of the range to fault (inclusive)
+ * @end : Ending virtual address of the range to fault (exclusive)
+ * @pfns : Output array for page frame numbers with HMM flags
+ * @locked : Pointer to lock state; set to 0 if mmap lock was dropped
+ * @do_fault: If true, fault in missing pages; if false, snapshot only
+ *
+ * Iterates through VMAs covering [start, end), collecting page frame
+ * numbers via hmm_range_fault_unlockable() for each VMA segment.
+ * When @do_fault is true, missing pages are faulted in and write faults
+ * are requested only when both the VMA and the hypervisor mapping permit
+ * writes, to avoid breaking copy-on-write semantics on read-only mappings.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+static int mshv_region_hmm_fault_walk(struct mshv_region *region,
+ struct hmm_range *range,
+ unsigned long start,
+ unsigned long end,
+ unsigned long *pfns,
+ int *locked,
+ bool do_fault)
+{
+ unsigned long cur_start = start;
+ unsigned long *cur_pfns = pfns;
+
+ while (cur_start < end) {
+ struct vm_area_struct *vma;
+
+ vma = vma_lookup(range->notifier->mm, cur_start);
+ if (!vma)
+ return -EFAULT;
+
+ range->hmm_pfns = cur_pfns;
+ range->start = cur_start;
+ range->end = min(vma->vm_end, end);
+ range->default_flags = 0;
+ if (do_fault) {
+ range->default_flags = HMM_PFN_REQ_FAULT;
+ /*
+ * Only request writable pages from HMM when
+ * both the VMA and the hypervisor mapping allow
+ * writes. Without this, hmm_range_fault() would
+ * trigger COW on read-only mappings (e.g. shared
+ * zero pages, file-backed pages), breaking
+ * copy-on-write semantics and potentially
+ * granting the guest write access to shared host
+ * pages.
+ */
+ if ((vma->vm_flags & VM_WRITE) &&
+ (region->hv_map_flags & HV_MAP_GPA_WRITABLE))
+ range->default_flags |= HMM_PFN_REQ_WRITE;
+ }
+
+ int ret = hmm_range_fault_unlockable(range, locked);
+
+ if (ret || !*locked)
+ return ret;
+
+ cur_start = range->end;
+ cur_pfns += (range->end - range->start) >> PAGE_SHIFT;
+ }
+
+ return 0;
+}
+
/**
* mshv_region_hmm_fault_and_lock - Fault in pages across VMAs and lock
* the memory region
@@ -575,11 +644,9 @@ int mshv_region_get(struct mshv_region *region)
* @do_fault: If true, fault in missing pages; if false, snapshot only
* pages already present in page tables
*
- * Iterates through VMAs covering [start, end), collecting page frame
- * numbers via hmm_range_fault() for each VMA segment. When @do_fault
- * is true, missing pages are faulted in and write faults are requested
- * only when both the VMA and the hypervisor mapping permit writes, to
- * avoid breaking copy-on-write semantics on read-only mappings.
+ * Faults in pages covering [start, end) and acquires region->mreg_mutex.
+ * If the mmap lock is dropped during the fault (e.g. by userfaultfd) or
+ * the mmu notifier sequence is invalidated, the entire walk is restarted.
*
* On success, returns with region->mreg_mutex held; the caller is
* responsible for releasing it. Returns -EBUSY if the mmu notifier
@@ -597,47 +664,27 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_region *region,
.notifier = ®ion->mreg_mni,
};
struct mm_struct *mm = region->mreg_mni.mm;
+ int locked;
int ret;
- range.notifier_seq = mmu_interval_read_begin(range.notifier);
- mmap_read_lock(mm);
- while (start < end) {
- struct vm_area_struct *vma;
+ do {
+ range.notifier_seq = mmu_interval_read_begin(range.notifier);
+ locked = 1;
+ mmap_read_lock(mm);
- vma = vma_lookup(mm, start);
- if (!vma) {
- ret = -EFAULT;
- break;
- }
+ ret = mshv_region_hmm_fault_walk(region, &range, start, end,
+ pfns, &locked, do_fault);
- range.hmm_pfns = pfns;
- range.start = start;
- range.end = min(vma->vm_end, end);
- range.default_flags = 0;
- if (do_fault) {
- range.default_flags = HMM_PFN_REQ_FAULT;
- /*
- * Only request writable pages from HMM when both
- * the VMA and the hypervisor mapping allow writes.
- * Without this, hmm_range_fault() would trigger
- * COW on read-only mappings (e.g. shared zero
- * pages, file-backed pages), breaking
- * copy-on-write semantics and potentially granting
- * the guest write access to shared host pages.
- */
- if ((vma->vm_flags & VM_WRITE) &&
- (region->hv_map_flags & HV_MAP_GPA_WRITABLE))
- range.default_flags |= HMM_PFN_REQ_WRITE;
- }
+ if (locked)
+ mmap_read_unlock(mm);
- ret = hmm_range_fault(&range);
- if (ret)
- break;
+ /*
+ * If the lock was dropped (by userfaultfd or similar), restart
+ * the entire walk with a fresh notifier_seq since the VMA layout
+ * may have changed. Also restart on -EBUSY (invalidation).
+ */
+ } while (!locked || ret == -EBUSY);
- start = range.end;
- pfns += (range.end - range.start) >> PAGE_SHIFT;
- }
- mmap_read_unlock(mm);
if (ret)
return ret;
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH 3/3] selftests/mm: Add userfaultfd test for HMM unlockable path
2026-05-01 1:20 [PATCH 0/3] mm/hmm: Add mmap lock-drop support for userfaultfd-backed mappings Stanislav Kinsburskii
2026-05-01 1:20 ` [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support Stanislav Kinsburskii
2026-05-01 1:20 ` [PATCH 2/3] mshv: Use hmm_range_fault_unlockable() for userfaultfd support Stanislav Kinsburskii
@ 2026-05-01 1:20 ` Stanislav Kinsburskii
2 siblings, 0 replies; 7+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-01 1:20 UTC (permalink / raw)
To: kys, Liam.Howlett, akpm, akpm, david, decui, haiyangz, jgg,
corbet, leon, longli, ljs, mhocko, rppt, shuah, skhan, surenb,
vbabka, wei.liu
Cc: linux-doc, linux-hyperv, linux-kernel, linux-kernel,
linux-kselftest, linux-mm
Add a selftest that exercises hmm_range_fault_unlockable() with a
userfaultfd-backed mapping. The test:
1. Creates an anonymous mmap region
2. Registers it with userfaultfd (UFFDIO_REGISTER_MODE_MISSING)
3. Spawns a handler thread that responds to page faults by filling
pages with a known pattern (0xAB) via UFFDIO_COPY
4. Issues HMM_DMIRROR_READ_UNLOCKABLE to the test_hmm driver, which
calls hmm_range_fault_unlockable() internally
5. Verifies the device read back the data provided by the userfaultfd
handler
This requires changes to the test_hmm kernel module:
- New dmirror_range_fault_unlockable() that uses the new HMM API
- New dmirror_fault_unlockable() and dmirror_read_unlockable() wrappers
- New HMM_DMIRROR_READ_UNLOCKABLE ioctl (0x09)
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
lib/test_hmm.c | 122 +++++++++++++++++++++++++++++
lib/test_hmm_uapi.h | 1
tools/testing/selftests/mm/hmm-tests.c | 133 ++++++++++++++++++++++++++++++++
3 files changed, 256 insertions(+)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 0964d53365e61..20b14e279a8bd 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -327,6 +327,84 @@ static int dmirror_range_fault(struct dmirror *dmirror,
return ret;
}
+static int dmirror_range_fault_unlockable(struct dmirror *dmirror,
+ struct hmm_range *range)
+{
+ struct mm_struct *mm = dmirror->notifier.mm;
+ unsigned long timeout =
+ jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+ int locked;
+ int ret;
+
+ while (true) {
+ if (time_after(jiffies, timeout)) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ range->notifier_seq = mmu_interval_read_begin(range->notifier);
+ locked = 1;
+ mmap_read_lock(mm);
+ ret = hmm_range_fault_unlockable(range, &locked);
+ if (locked)
+ mmap_read_unlock(mm);
+ if (ret) {
+ if (ret == -EBUSY)
+ continue;
+ goto out;
+ }
+ if (!locked)
+ continue;
+
+ mutex_lock(&dmirror->mutex);
+ if (mmu_interval_read_retry(range->notifier,
+ range->notifier_seq)) {
+ mutex_unlock(&dmirror->mutex);
+ continue;
+ }
+ break;
+ }
+
+ ret = dmirror_do_fault(dmirror, range);
+
+ mutex_unlock(&dmirror->mutex);
+out:
+ return ret;
+}
+
+static int dmirror_fault_unlockable(struct dmirror *dmirror,
+ unsigned long start,
+ unsigned long end, bool write)
+{
+ struct mm_struct *mm = dmirror->notifier.mm;
+ unsigned long addr;
+ unsigned long pfns[32];
+ struct hmm_range range = {
+ .notifier = &dmirror->notifier,
+ .hmm_pfns = pfns,
+ .pfn_flags_mask = 0,
+ .default_flags =
+ HMM_PFN_REQ_FAULT | (write ? HMM_PFN_REQ_WRITE : 0),
+ .dev_private_owner = dmirror->mdevice,
+ };
+ int ret = 0;
+
+ if (!mmget_not_zero(mm))
+ return 0;
+
+ for (addr = start; addr < end; addr = range.end) {
+ range.start = addr;
+ range.end = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
+
+ ret = dmirror_range_fault_unlockable(dmirror, &range);
+ if (ret)
+ break;
+ }
+
+ mmput(mm);
+ return ret;
+}
+
static int dmirror_fault(struct dmirror *dmirror, unsigned long start,
unsigned long end, bool write)
{
@@ -426,6 +504,47 @@ static int dmirror_read(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
return ret;
}
+static int dmirror_read_unlockable(struct dmirror *dmirror,
+ struct hmm_dmirror_cmd *cmd)
+{
+ struct dmirror_bounce bounce;
+ unsigned long start, end;
+ unsigned long size = cmd->npages << PAGE_SHIFT;
+ int ret;
+
+ start = cmd->addr;
+ end = start + size;
+ if (end < start)
+ return -EINVAL;
+
+ ret = dmirror_bounce_init(&bounce, start, size);
+ if (ret)
+ return ret;
+
+ while (1) {
+ mutex_lock(&dmirror->mutex);
+ ret = dmirror_do_read(dmirror, start, end, &bounce);
+ mutex_unlock(&dmirror->mutex);
+ if (ret != -ENOENT)
+ break;
+
+ start = cmd->addr + (bounce.cpages << PAGE_SHIFT);
+ ret = dmirror_fault_unlockable(dmirror, start, end, false);
+ if (ret)
+ break;
+ cmd->faults++;
+ }
+
+ if (ret == 0) {
+ if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
+ bounce.size))
+ ret = -EFAULT;
+ }
+ cmd->cpages = bounce.cpages;
+ dmirror_bounce_fini(&bounce);
+ return ret;
+}
+
static int dmirror_do_write(struct dmirror *dmirror, unsigned long start,
unsigned long end, struct dmirror_bounce *bounce)
{
@@ -1537,6 +1656,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
dmirror->flags = cmd.npages;
ret = 0;
break;
+ case HMM_DMIRROR_READ_UNLOCKABLE:
+ ret = dmirror_read_unlockable(dmirror, &cmd);
+ break;
default:
return -EINVAL;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index f94c6d4573382..076df6df92275 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -38,6 +38,7 @@ struct hmm_dmirror_cmd {
#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x06, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_RELEASE _IOWR('H', 0x07, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_FLAGS _IOWR('H', 0x08, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_READ_UNLOCKABLE _IOWR('H', 0x09, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_FLAG_FAIL_ALLOC (1ULL << 0)
diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index e8328c89d855e..e7bf061747edd 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -26,6 +26,9 @@
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <sys/time.h>
+#include <sys/syscall.h>
+#include <linux/userfaultfd.h>
+#include <poll.h>
/*
@@ -2852,4 +2855,134 @@ TEST_F_TIMEOUT(hmm, benchmark_thp_migration, 120)
&thp_results, ®ular_results);
}
}
+
+/*
+ * Test that HMM can fault in pages backed by userfaultfd using the
+ * hmm_range_fault_unlockable() path. This exercises the lock-drop retry
+ * logic in the HMM framework.
+ */
+struct uffd_thread_args {
+ int uffd;
+ void *page_buffer;
+ unsigned long page_size;
+};
+
+static void *uffd_handler_thread(void *arg)
+{
+ struct uffd_thread_args *args = arg;
+ struct uffd_msg msg;
+ struct uffdio_copy copy;
+ struct pollfd pollfd;
+ int ret;
+
+ pollfd.fd = args->uffd;
+ pollfd.events = POLLIN;
+
+ while (1) {
+ ret = poll(&pollfd, 1, 5000);
+ if (ret <= 0)
+ break;
+
+ ret = read(args->uffd, &msg, sizeof(msg));
+ if (ret != sizeof(msg))
+ break;
+
+ if (msg.event != UFFD_EVENT_PAGEFAULT)
+ break;
+
+ /* Fill the page with a known pattern */
+ memset(args->page_buffer, 0xAB, args->page_size);
+
+ copy.dst = msg.arg.pagefault.address & ~(args->page_size - 1);
+ copy.src = (unsigned long)args->page_buffer;
+ copy.len = args->page_size;
+ copy.mode = 0;
+ copy.copy = 0;
+
+ ret = ioctl(args->uffd, UFFDIO_COPY, ©);
+ if (ret < 0)
+ break;
+ }
+
+ return NULL;
+}
+
+TEST_F(hmm, userfaultfd_read)
+{
+ struct hmm_buffer *buffer;
+ struct uffd_thread_args uffd_args;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ unsigned char *ptr;
+ pthread_t thread;
+ int uffd;
+ int ret;
+ struct uffdio_api api;
+ struct uffdio_register reg;
+
+ npages = 4;
+ size = npages << self->page_shift;
+
+ /* Create userfaultfd */
+ uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+ if (uffd < 0)
+ SKIP(return, "userfaultfd not available");
+
+ api.api = UFFD_API;
+ api.features = 0;
+ ret = ioctl(uffd, UFFDIO_API, &api);
+ ASSERT_EQ(ret, 0);
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+
+ /* Create anonymous mapping */
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ -1, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Register the region with userfaultfd */
+ reg.range.start = (unsigned long)buffer->ptr;
+ reg.range.len = size;
+ reg.mode = UFFDIO_REGISTER_MODE_MISSING;
+ ret = ioctl(uffd, UFFDIO_REGISTER, ®);
+ ASSERT_EQ(ret, 0);
+
+ /* Set up the handler thread */
+ uffd_args.uffd = uffd;
+ uffd_args.page_buffer = malloc(self->page_size);
+ ASSERT_NE(uffd_args.page_buffer, NULL);
+ uffd_args.page_size = self->page_size;
+
+ ret = pthread_create(&thread, NULL, uffd_handler_thread, &uffd_args);
+ ASSERT_EQ(ret, 0);
+
+ /*
+ * Use the unlockable read path which allows the mmap lock to be
+ * dropped during the fault, enabling userfaultfd resolution.
+ */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ_UNLOCKABLE,
+ buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Verify the device read the data filled by the uffd handler */
+ ptr = buffer->mirror;
+ for (i = 0; i < size; ++i)
+ ASSERT_EQ(ptr[i], (unsigned char)0xAB);
+
+ pthread_join(thread, NULL);
+ free(uffd_args.page_buffer);
+ close(uffd);
+ hmm_buffer_free(buffer);
+}
+
TEST_HARNESS_MAIN
^ permalink raw reply related [flat|nested] 7+ messages in thread