Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
       [not found]                   ` <e7vr62s73dftijeveyg6lfgivctijz4qcar3teswjbuv6gog3k@4sbpuj35nbbh>
@ 2025-09-01 16:01                     ` Nikita Kalyazin
  2025-09-08 16:53                       ` Liam R. Howlett
  2025-09-16 19:55                     ` Peter Xu
  1 sibling, 1 reply; 30+ messages in thread
From: Nikita Kalyazin @ 2025-09-01 16:01 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes
  Cc: Peter Xu, David Hildenbrand, Mike Rapoport, Suren Baghdasaryan,
	linux-mm, linux-kernel, Vlastimil Babka, Muchun Song,
	Hugh Dickins, Andrew Morton, James Houghton, Michal Hocko,
	Andrea Arcangeli, Oscar Salvador, Axel Rasmussen, Ujwal Kundur



On 04/07/2025 20:39, Liam R. Howlett wrote:
> * Peter Xu <peterx@redhat.com> [250704 11:00]:
>> On Fri, Jul 04, 2025 at 11:34:15AM +0200, David Hildenbrand wrote:
>>> On 03.07.25 19:48, Mike Rapoport wrote:
>>>> On Wed, Jul 02, 2025 at 03:46:57PM -0400, Peter Xu wrote:
>>>>> On Wed, Jul 02, 2025 at 08:39:32PM +0300, Mike Rapoport wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>> The main target of this change is the implementation of UFFD for
>>>>>>> KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code
>>>>>>> into the mm codebase.  We usually mean KVM by the "drivers" in this context,
>>>>>>> and it is already somewhat "knowledgeable" of the mm.  I don't think there
>>>>>>> are existing use cases for other drivers to implement this at the moment.
>>>>>>>
>>>>>>> Although I can't see new exports in this series, there is now a way to limit
>>>>>>> exports to particular modules [3].  Would it help if we only do it for KVM
>>>>>>> initially (if/when actually needed)?
>>>>>>
>>>>>> There were talks about pulling out guest_memfd core into mm, but I don't
>>>>>> remember patches about it. If parts of guest_memfd were already in mm/ that
>>>>>> would make easier to export uffd ops to it.
>>>>>
>>>>> Do we have a link to such discussion?  I'm also curious whether that idea
>>>>> was acknowledged by KVM maintainers.
>>>>
>>>> AFAIR it was discussed at one of David's guest_memfd calls
>>>
>>> While it was discussed in the call a couple of times in different context
>>> (guest_memfd as a library / guest_memfd shim), I think we already discussed
>>> it back at LPC last year.
>>>
>>> One of the main reasons for doing that is supporting guest_memfd in other
>>> hypervisors -- the gunyah hypervisor in the kernel wants to make use of it
>>> as well.
>>
>> I see, thanks for the info. I found the series, it's here:
>>
>> https://lore.kernel.org/all/20241113-guestmem-library-v3-0-71fdee85676b@quicinc.com/
>>
>> Here, the question is whether do we still want to keep explicit calls to
>> shmem, hugetlbfs and in the future, guest-memfd.  The library-ize of
>> guest-memfd doesn't change a huge lot on answering this question, IIUC.
> 
> Can you explore moving hugetlb_mfill_atomic_pte and
> shmem_mfill_atomic_pte into mm/userfaultfd.c and generalizing them to
> use the same code?
> 
> That is, split the similar blocks into functions and reduce duplication.
> 
> These are under the UFFD config option and are pretty similar.  This
> will also limit the bleeding of mfill_atomic_mode out of uffd.
> 
> 
> 
> If you look at what the code does in userfaultfd.c, you can see that
> some refactoring is necessary for other reasons:
> 
> mfill_atomic() calls mfill_atomic_hugetlb(), or it enters a while
> (src_addr < src_start + len) to call mfill_atomic_pte().. which might
> call shmem_mfill_atomic_pte() in mm/shmem.c
> 
> mfill_atomic_hugetlb() calls, in a while (src_addr < src_start + len)
> loop and calls hugetlb_mfill_atomic_pte() in mm/hugetlb.c
> 
> The shmem call already depends on the vma flags.. which it still does in
> your patch 4 here.  So you've replaced this:
> 
> if (!(dst_vma->vm_flags & VM_SHARED)) {
> ...
> } else {
>          shmem_mfill_atomic_pte()
> }
> 
> With...
> 
> if (!(dst_vma->vm_flags & VM_SHARED)) {
> ...
> } else {
> ...
>          uffd_ops->uffd_copy()
> }
> 
> So, really, what needs to happen first is userfaultfd needs to be
> sorted.
> 
> There's no point of creating a vm_ops_uffd if it will just serve as
> replacing the call locations of the functions like this, as it has done
> nothing to simplify the logic.
> 
>> However if we want to generalize userfaultfd capability for a type of
>> memory, we will still need something like the vm_uffd_ops hook to report
>> such information.  It means drivers can still overwrite these, with/without
>> an exported mfill_atomic_install_pte() functions.  I'm not sure whether
>> that eases the concern.
> 
> If we work through the duplication and reduction where possible, the
> path forward may be easier to see.
> 
>>
>> So to me, generalizing the mem type looks helpful with/without moving
>> guest-memfd under mm/.
> 
> Yes, it should decrease the duplication across hugetlb.c and shmem.c,
> but I think that userfaultfd is the place to start.
> 
>>
>> We do have the option to keep hard-code guest-memfd like shmem or
>> hugetlbfs. This is still "doable", but this likely means guest-memfd
>> support for userfaultfd needs to be done after that work.  I did quickly
>> check the status of gunyah hypervisor [1,2,3], I found that all of the
>> efforts are not yet continued in 2025.  The hypervisor last update was Jan
>> 2024 with a batch push [1].
>>
>> I still prefer generalizing uffd capabilities using the ops.  That makes
>> guest-memfd support on MINOR not be blocked and it should be able to be
>> done concurrently v.s. guest-memfd library.  If guest-memfd library idea
>> didn't move on, it's non-issue either.
>>
>> I've considered dropping uffd_copy() and MISSING support for vm_uffd_ops if
>> I'm going to repost - that looks like the only thing that people are
>> against with, even though that is not my preference, as that'll make the
>> API half-broken on its own.
> 
> The generalisation you did does not generalize much, as I pointed out
> above, and so it seems less impactful than it could be.
> 
> These patches also do not explore what this means for guest_memfd.  So
> it is not clear that the expected behaviour will serve the need.
> 
> You sent a link to an example user.  Can you please keep this work
> together in the patch set so that we know it'll work for your use case
> and allows us an easier way to pull down this work so we can examine it.

Hi Liam, Lorenzo,

With mmap support in guest_memfd being recently accepted and merged into 
kvm/next [1], UFFDIO_CONTINUE support in guest_memfd becomes a real use 
case.

 From what I understand, it is the API for UFFDIO_COPY (ie the 
.uffd_copy callback) proposed by Peter that raises the safery issues, 
while the generalisation of the checks (.uffd_features, .uffd_ioctls) 
and .uffd_get_folio needed for UFFDIO_CONTINUE do not introduce such 
concerns.  In order to unblock the userfaultfd support in guest_memfd, 
would it be acceptable to start with implementing 
.uffd_get_folio/UFFDIO_CONTINUE only, leaving the callback for 
UFFDIO_COPY for later when we have an idea about a safer API and a clear 
use case for that?

If that's the case, what would be the best way to demonstrate the client 
code?  Would using kvm/next as the base for the aggregated series (new 
mm API + client code in kvm/guest_memfd) work?

Thanks,
Nikita

[1]: https://lore.kernel.org/kvm/20250729225455.670324-1-seanjc@google.com/
> 
> Alternatively, you can reduce and combine the memory types without
> exposing the changes externally, if they stand on their own.  But I
> don't think anyone is going to accept using a vm_ops change where a
> direct function call could be used.
> 
>> Said that, I still prefer this against
>> hard-code and/or use CONFIG_GUESTMEM in userfaultfd code.
> 
> It also does nothing with regards to the CONFIG_USERFAULTFD in other
> areas.  My hope is that you could pull out the common code and make the
> CONFIG_ sections smaller.
> 
> And, by the way, it isn't the fact that we're going to copy something
> (or mfill_atomic it, I guess?) but the fact that we're handing out the
> pointer for something else to do that.  Please handle this manipulation
> in the core mm code without handing out pointers to folios, or page
> tables.
> 
> You could do this with the address being passed in and figure out the
> type, or even a function pointer that you specifically pass in an enum
> of the type (I think this is what Lorenzo was suggesting somewhere),
> maybe with multiple flags for actions and fallback (MFILL|ZERO for
> example).
> 
> Thanks,
> Liam
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-01 16:01                     ` [PATCH v2 1/4] mm: Introduce vm_uffd_ops API Nikita Kalyazin
@ 2025-09-08 16:53                       ` Liam R. Howlett
  2025-09-16 20:05                         ` Peter Xu
  2025-09-17  9:25                         ` Mike Rapoport
  0 siblings, 2 replies; 30+ messages in thread
From: Liam R. Howlett @ 2025-09-08 16:53 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: Lorenzo Stoakes, Peter Xu, David Hildenbrand, Mike Rapoport,
	Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Hugh Dickins, Andrew Morton, James Houghton,
	Michal Hocko, Andrea Arcangeli, Oscar Salvador, Axel Rasmussen,
	Ujwal Kundur

* Nikita Kalyazin <kalyazin@amazon.com> [250901 12:01]:
> 
> 
> On 04/07/2025 20:39, Liam R. Howlett wrote:
> > * Peter Xu <peterx@redhat.com> [250704 11:00]:
> > > On Fri, Jul 04, 2025 at 11:34:15AM +0200, David Hildenbrand wrote:
> > > > On 03.07.25 19:48, Mike Rapoport wrote:
> > > > > On Wed, Jul 02, 2025 at 03:46:57PM -0400, Peter Xu wrote:
> > > > > > On Wed, Jul 02, 2025 at 08:39:32PM +0300, Mike Rapoport wrote:
> > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > > > The main target of this change is the implementation of UFFD for
> > > > > > > > KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code
> > > > > > > > into the mm codebase.  We usually mean KVM by the "drivers" in this context,
> > > > > > > > and it is already somewhat "knowledgeable" of the mm.  I don't think there
> > > > > > > > are existing use cases for other drivers to implement this at the moment.
> > > > > > > > 
> > > > > > > > Although I can't see new exports in this series, there is now a way to limit
> > > > > > > > exports to particular modules [3].  Would it help if we only do it for KVM
> > > > > > > > initially (if/when actually needed)?
> > > > > > > 
> > > > > > > There were talks about pulling out guest_memfd core into mm, but I don't
> > > > > > > remember patches about it. If parts of guest_memfd were already in mm/ that
> > > > > > > would make easier to export uffd ops to it.
> > > > > > 
> > > > > > Do we have a link to such discussion?  I'm also curious whether that idea
> > > > > > was acknowledged by KVM maintainers.
> > > > > 
> > > > > AFAIR it was discussed at one of David's guest_memfd calls
> > > > 
> > > > While it was discussed in the call a couple of times in different context
> > > > (guest_memfd as a library / guest_memfd shim), I think we already discussed
> > > > it back at LPC last year.
> > > > 
> > > > One of the main reasons for doing that is supporting guest_memfd in other
> > > > hypervisors -- the gunyah hypervisor in the kernel wants to make use of it
> > > > as well.
> > > 
> > > I see, thanks for the info. I found the series, it's here:
> > > 
> > > https://lore.kernel.org/all/20241113-guestmem-library-v3-0-71fdee85676b@quicinc.com/
> > > 
> > > Here, the question is whether do we still want to keep explicit calls to
> > > shmem, hugetlbfs and in the future, guest-memfd.  The library-ize of
> > > guest-memfd doesn't change a huge lot on answering this question, IIUC.
> > 
> > Can you explore moving hugetlb_mfill_atomic_pte and
> > shmem_mfill_atomic_pte into mm/userfaultfd.c and generalizing them to
> > use the same code?
> > 
> > That is, split the similar blocks into functions and reduce duplication.
> > 
> > These are under the UFFD config option and are pretty similar.  This
> > will also limit the bleeding of mfill_atomic_mode out of uffd.
> > 
> > 
> > 
> > If you look at what the code does in userfaultfd.c, you can see that
> > some refactoring is necessary for other reasons:
> > 
> > mfill_atomic() calls mfill_atomic_hugetlb(), or it enters a while
> > (src_addr < src_start + len) to call mfill_atomic_pte().. which might
> > call shmem_mfill_atomic_pte() in mm/shmem.c
> > 
> > mfill_atomic_hugetlb() calls, in a while (src_addr < src_start + len)
> > loop and calls hugetlb_mfill_atomic_pte() in mm/hugetlb.c
> > 
> > The shmem call already depends on the vma flags.. which it still does in
> > your patch 4 here.  So you've replaced this:
> > 
> > if (!(dst_vma->vm_flags & VM_SHARED)) {
> > ...
> > } else {
> >          shmem_mfill_atomic_pte()
> > }
> > 
> > With...
> > 
> > if (!(dst_vma->vm_flags & VM_SHARED)) {
> > ...
> > } else {
> > ...
> >          uffd_ops->uffd_copy()
> > }
> > 
> > So, really, what needs to happen first is userfaultfd needs to be
> > sorted.
> > 
> > There's no point of creating a vm_ops_uffd if it will just serve as
> > replacing the call locations of the functions like this, as it has done
> > nothing to simplify the logic.
> > 
> > > However if we want to generalize userfaultfd capability for a type of
> > > memory, we will still need something like the vm_uffd_ops hook to report
> > > such information.  It means drivers can still overwrite these, with/without
> > > an exported mfill_atomic_install_pte() functions.  I'm not sure whether
> > > that eases the concern.
> > 
> > If we work through the duplication and reduction where possible, the
> > path forward may be easier to see.
> > 
> > > 
> > > So to me, generalizing the mem type looks helpful with/without moving
> > > guest-memfd under mm/.
> > 
> > Yes, it should decrease the duplication across hugetlb.c and shmem.c,
> > but I think that userfaultfd is the place to start.
> > 
> > > 
> > > We do have the option to keep hard-code guest-memfd like shmem or
> > > hugetlbfs. This is still "doable", but this likely means guest-memfd
> > > support for userfaultfd needs to be done after that work.  I did quickly
> > > check the status of gunyah hypervisor [1,2,3], I found that all of the
> > > efforts are not yet continued in 2025.  The hypervisor last update was Jan
> > > 2024 with a batch push [1].
> > > 
> > > I still prefer generalizing uffd capabilities using the ops.  That makes
> > > guest-memfd support on MINOR not be blocked and it should be able to be
> > > done concurrently v.s. guest-memfd library.  If guest-memfd library idea
> > > didn't move on, it's non-issue either.
> > > 
> > > I've considered dropping uffd_copy() and MISSING support for vm_uffd_ops if
> > > I'm going to repost - that looks like the only thing that people are
> > > against with, even though that is not my preference, as that'll make the
> > > API half-broken on its own.
> > 
> > The generalisation you did does not generalize much, as I pointed out
> > above, and so it seems less impactful than it could be.
> > 
> > These patches also do not explore what this means for guest_memfd.  So
> > it is not clear that the expected behaviour will serve the need.
> > 
> > You sent a link to an example user.  Can you please keep this work
> > together in the patch set so that we know it'll work for your use case
> > and allows us an easier way to pull down this work so we can examine it.
> 
> Hi Liam, Lorenzo,
> 
> With mmap support in guest_memfd being recently accepted and merged into
> kvm/next [1], UFFDIO_CONTINUE support in guest_memfd becomes a real use
> case.
> 
> From what I understand, it is the API for UFFDIO_COPY (ie the .uffd_copy
> callback) proposed by Peter that raises the safery issues, while the
> generalisation of the checks (.uffd_features, .uffd_ioctls) and
> .uffd_get_folio needed for UFFDIO_CONTINUE do not introduce such concerns.
> In order to unblock the userfaultfd support in guest_memfd, would it be
> acceptable to start with implementing .uffd_get_folio/UFFDIO_CONTINUE only,
> leaving the callback for UFFDIO_COPY for later when we have an idea about a
> safer API and a clear use case for that?

Reading through the patches, I'm not entirely sure what you are
proposing.

What I was hoping to see by a generalization of the memory types is a
much simpler shared code base until the code hit memory type specific
areas where a function pointer could be used to keep things from getting
complicated (or, I guess a switch statement..).

What we don't want is non-mm code specifying values for the function
pointer and doing what they want, or a function pointer that returns a
core mm resource (in the old example this was a vma, here it is a
folio).

From this patch set:
+        * Return: zero if succeeded, negative for errors.
+        */
+       int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
+                             struct folio **folio);

This is one of the contention points in the current scenario as the
folio would be returned.

If you are going to be manipulating an mm resource it should be in mm
code, especially if it requires mm locks.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
       [not found]                   ` <e7vr62s73dftijeveyg6lfgivctijz4qcar3teswjbuv6gog3k@4sbpuj35nbbh>
  2025-09-01 16:01                     ` [PATCH v2 1/4] mm: Introduce vm_uffd_ops API Nikita Kalyazin
@ 2025-09-16 19:55                     ` Peter Xu
  2025-09-19 17:22                       ` Liam R. Howlett
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-09-16 19:55 UTC (permalink / raw)
  To: Liam R. Howlett, David Hildenbrand, Mike Rapoport,
	Nikita Kalyazin, Lorenzo Stoakes, Suren Baghdasaryan, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Hugh Dickins,
	Andrew Morton, James Houghton, Michal Hocko, Andrea Arcangeli,
	Oscar Salvador, Axel Rasmussen, Ujwal Kundur

On Fri, Jul 04, 2025 at 03:39:32PM -0400, Liam R. Howlett wrote:
> * Peter Xu <peterx@redhat.com> [250704 11:00]:
> > On Fri, Jul 04, 2025 at 11:34:15AM +0200, David Hildenbrand wrote:
> > > On 03.07.25 19:48, Mike Rapoport wrote:
> > > > On Wed, Jul 02, 2025 at 03:46:57PM -0400, Peter Xu wrote:
> > > > > On Wed, Jul 02, 2025 at 08:39:32PM +0300, Mike Rapoport wrote:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > > The main target of this change is the implementation of UFFD for
> > > > > > > KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code
> > > > > > > into the mm codebase.  We usually mean KVM by the "drivers" in this context,
> > > > > > > and it is already somewhat "knowledgeable" of the mm.  I don't think there
> > > > > > > are existing use cases for other drivers to implement this at the moment.
> > > > > > > 
> > > > > > > Although I can't see new exports in this series, there is now a way to limit
> > > > > > > exports to particular modules [3].  Would it help if we only do it for KVM
> > > > > > > initially (if/when actually needed)?
> > > > > > 
> > > > > > There were talks about pulling out guest_memfd core into mm, but I don't
> > > > > > remember patches about it. If parts of guest_memfd were already in mm/ that
> > > > > > would make easier to export uffd ops to it.
> > > > > 
> > > > > Do we have a link to such discussion?  I'm also curious whether that idea
> > > > > was acknowledged by KVM maintainers.
> > > > 
> > > > AFAIR it was discussed at one of David's guest_memfd calls
> > > 
> > > While it was discussed in the call a couple of times in different context
> > > (guest_memfd as a library / guest_memfd shim), I think we already discussed
> > > it back at LPC last year.
> > > 
> > > One of the main reasons for doing that is supporting guest_memfd in other
> > > hypervisors -- the gunyah hypervisor in the kernel wants to make use of it
> > > as well.
> > 
> > I see, thanks for the info. I found the series, it's here:
> > 
> > https://lore.kernel.org/all/20241113-guestmem-library-v3-0-71fdee85676b@quicinc.com/
> > 
> > Here, the question is whether do we still want to keep explicit calls to
> > shmem, hugetlbfs and in the future, guest-memfd.  The library-ize of
> > guest-memfd doesn't change a huge lot on answering this question, IIUC.
> 
> Can you explore moving hugetlb_mfill_atomic_pte and
> shmem_mfill_atomic_pte into mm/userfaultfd.c and generalizing them to
> use the same code?
> 
> That is, split the similar blocks into functions and reduce duplication.
> 
> These are under the UFFD config option and are pretty similar.  This
> will also limit the bleeding of mfill_atomic_mode out of uffd.

These are different problems to solve.

I did propose a refactoring of hugetlbfs in general years ago.  I didn't
finish that.  It's not only because it's a huge amount of work that almost
nobody likes to review (even if everybody likes to see landing), also that
since we decided to not add more complexity into hugetlbfs code like HGM
there's also less point on refactoring.

I hope we can be on the same page to understand the problem we want to
solve here.  The problem is we want to support userfaultfd on guest-memfd.
Your suggestion could be helpful, but IMHO it's orthogonal to the current
problem.  It's also not a dependency.  If it is, you should see code
duplications introduced in this series, which might not be needed after a
cleanup.  It's not like that.

This series simply resolves a totally different problem.  The order on
solving this problem or cleaning things elsewhere (or we decide to not do
it at all) are not deterministic, IMHO.

> 
> 
> 
> If you look at what the code does in userfaultfd.c, you can see that
> some refactoring is necessary for other reasons:
> 
> mfill_atomic() calls mfill_atomic_hugetlb(), or it enters a while
> (src_addr < src_start + len) to call mfill_atomic_pte().. which might
> call shmem_mfill_atomic_pte() in mm/shmem.c
> 
> mfill_atomic_hugetlb() calls, in a while (src_addr < src_start + len)
> loop and calls hugetlb_mfill_atomic_pte() in mm/hugetlb.c

Again, IMHO this still falls into hugetlbfs refactoring category.  I would
sincerely request that we put that as a separate topic to discuss, because
it's huge and community resources are limited on both developments and
reviews.

Shmem is already almost sharing code with anonymous, after all these two
memory types are the only ones that support userfaultfd besides hugetlbfs.

> 
> The shmem call already depends on the vma flags.. which it still does in
> your patch 4 here.  So you've replaced this:
> 
> if (!(dst_vma->vm_flags & VM_SHARED)) {
> ...
> } else {
>         shmem_mfill_atomic_pte()
> }
> 
> With...
> 
> if (!(dst_vma->vm_flags & VM_SHARED)) {
> ...
> } else {
> ...
>         uffd_ops->uffd_copy()
> }

I'm not 100% sure I get this comment.  It's intentional to depend on vma
flags here for shmem.  See Andrea's commit:

commit 5b51072e97d587186c2f5390c8c9c1fb7e179505
Author: Andrea Arcangeli <aarcange@redhat.com>
Date:   Fri Nov 30 14:09:28 2018 -0800

    userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem

Are you suggesting we should merge it somehow?  I'd appreciate some
concrete example here if so.

> 
> So, really, what needs to happen first is userfaultfd needs to be
> sorted.
> 
> There's no point of creating a vm_ops_uffd if it will just serve as
> replacing the call locations of the functions like this, as it has done
> nothing to simplify the logic.

I had a vague feeling that you may not have understood the goal of this
series.  I thought we were on the same page there.  It could be partly my
fault if that's not obvious, I can try to be more explicit in the cover
letter next if I'm going to repost.

Please consider supporting guest-memfd as the goal, vm_ops_uffd is to allow
all changes to happen in guest-memfd.c.  Without it, it can't happen.
That's the whole point so far.

> 
> > However if we want to generalize userfaultfd capability for a type of
> > memory, we will still need something like the vm_uffd_ops hook to report
> > such information.  It means drivers can still overwrite these, with/without
> > an exported mfill_atomic_install_pte() functions.  I'm not sure whether
> > that eases the concern.
> 
> If we work through the duplication and reduction where possible, the
> path forward may be easier to see.
> 
> > 
> > So to me, generalizing the mem type looks helpful with/without moving
> > guest-memfd under mm/.
> 
> Yes, it should decrease the duplication across hugetlb.c and shmem.c,
> but I think that userfaultfd is the place to start.
> 
> > 
> > We do have the option to keep hard-code guest-memfd like shmem or
> > hugetlbfs. This is still "doable", but this likely means guest-memfd
> > support for userfaultfd needs to be done after that work.  I did quickly
> > check the status of gunyah hypervisor [1,2,3], I found that all of the
> > efforts are not yet continued in 2025.  The hypervisor last update was Jan
> > 2024 with a batch push [1].
> > 
> > I still prefer generalizing uffd capabilities using the ops.  That makes
> > guest-memfd support on MINOR not be blocked and it should be able to be
> > done concurrently v.s. guest-memfd library.  If guest-memfd library idea
> > didn't move on, it's non-issue either.
> > 
> > I've considered dropping uffd_copy() and MISSING support for vm_uffd_ops if
> > I'm going to repost - that looks like the only thing that people are
> > against with, even though that is not my preference, as that'll make the
> > API half-broken on its own.
> 
> The generalisation you did does not generalize much, as I pointed out
> above, and so it seems less impactful than it could be.
> 
> These patches also do not explore what this means for guest_memfd.  So
> it is not clear that the expected behaviour will serve the need.
> 
> You sent a link to an example user.  Can you please keep this work
> together in the patch set so that we know it'll work for your use case
> and allows us an easier way to pull down this work so we can examine it.
> 
> Alternatively, you can reduce and combine the memory types without
> exposing the changes externally, if they stand on their own.  But I
> don't think anyone is going to accept using a vm_ops change where a
> direct function call could be used.

While I was absent, Nikita sent a version that is based on the library
code.  That uses direct function calls:

https://lore.kernel.org/all/20250915161815.40729-3-kalyazin@amazon.com/

I think it's great we have the other sample of implementing this.

I believe Nikita has no strong opinion how this will land at last, whatever
way the community prefers.  I prefer this solution I sent.

Do you have a preference, when explicitly there's the other one to compare?

> 
> > Said that, I still prefer this against
> > hard-code and/or use CONFIG_GUESTMEM in userfaultfd code.
> 
> It also does nothing with regards to the CONFIG_USERFAULTFD in other
> areas.  My hope is that you could pull out the common code and make the
> CONFIG_ sections smaller.
> 
> And, by the way, it isn't the fact that we're going to copy something
> (or mfill_atomic it, I guess?) but the fact that we're handing out the
> pointer for something else to do that.  Please handle this manipulation
> in the core mm code without handing out pointers to folios, or page
> tables.

Is "return pointers to folios" also an issue now in the API?  Are you
NACKing uffd_get_folio() API that this series proposed?

> 
> You could do this with the address being passed in and figure out the
> type, or even a function pointer that you specifically pass in an enum
> of the type (I think this is what Lorenzo was suggesting somewhere),
> maybe with multiple flags for actions and fallback (MFILL|ZERO for
> example).

I didn't quickly get it.  I appreciate if there's some more elaborations.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-08 16:53                       ` Liam R. Howlett
@ 2025-09-16 20:05                         ` Peter Xu
  2025-09-17 15:29                           ` Liam R. Howlett
  2025-09-17  9:25                         ` Mike Rapoport
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-09-16 20:05 UTC (permalink / raw)
  To: Liam R. Howlett, Nikita Kalyazin, Lorenzo Stoakes,
	David Hildenbrand, Mike Rapoport, Suren Baghdasaryan, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Hugh Dickins,
	Andrew Morton, James Houghton, Michal Hocko, Andrea Arcangeli,
	Oscar Salvador, Axel Rasmussen, Ujwal Kundur

On Mon, Sep 08, 2025 at 12:53:37PM -0400, Liam R. Howlett wrote:
> What we don't want is non-mm code specifying values for the function
> pointer and doing what they want, or a function pointer that returns a
> core mm resource (in the old example this was a vma, here it is a
> folio).
> 
> From this patch set:
> +        * Return: zero if succeeded, negative for errors.
> +        */
> +       int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> +                             struct folio **folio);
> 
> This is one of the contention points in the current scenario as the
> folio would be returned.

OK I didn't see this one previously, it partly answers one of my question
in the other reply, in a way I wished not.

Could you elaborate why an API returning an folio pointer would be
dangerous?

OTOH, would you think alloc_pages() or folio_alloc() be dangerous too?

They return a folio from the mm core to drivers, hence it's not the same
direction of folio sharing, however it also means at least the driver can
manipulate the folio / memmap as much as it wants, sabotaging everything is
similarly possible.  Why we worry about that?

Are we going to unexport alloc_pages() someday?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-08 16:53                       ` Liam R. Howlett
  2025-09-16 20:05                         ` Peter Xu
@ 2025-09-17  9:25                         ` Mike Rapoport
  2025-09-17 16:53                           ` Liam R. Howlett
  1 sibling, 1 reply; 30+ messages in thread
From: Mike Rapoport @ 2025-09-17  9:25 UTC (permalink / raw)
  To: Liam R. Howlett, Nikita Kalyazin, Lorenzo Stoakes, Peter Xu,
	David Hildenbrand, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

Hi Liam,

On Mon, Sep 08, 2025 at 12:53:37PM -0400, Liam R. Howlett wrote:
> 
> Reading through the patches, I'm not entirely sure what you are
> proposing.
> 
> What I was hoping to see by a generalization of the memory types is a
> much simpler shared code base until the code hit memory type specific
> areas where a function pointer could be used to keep things from getting
> complicated (or, I guess a switch statement..).
> 
> What we don't want is non-mm code specifying values for the function
> pointer and doing what they want, or a function pointer that returns a
> core mm resource (in the old example this was a vma, here it is a
> folio).
> 
> From this patch set:
> +        * Return: zero if succeeded, negative for errors.
> +        */
> +       int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> +                             struct folio **folio);
> 
> This is one of the contention points in the current scenario as the
> folio would be returned.

I don't see a problem with it. It's not any different from
vma_ops->fault(): a callback for a filesystem to get a folio that will be
mapped afterwards by the mm code.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-16 20:05                         ` Peter Xu
@ 2025-09-17 15:29                           ` Liam R. Howlett
  0 siblings, 0 replies; 30+ messages in thread
From: Liam R. Howlett @ 2025-09-17 15:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nikita Kalyazin, Lorenzo Stoakes, David Hildenbrand,
	Mike Rapoport, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

* Peter Xu <peterx@redhat.com> [250916 16:05]:
> On Mon, Sep 08, 2025 at 12:53:37PM -0400, Liam R. Howlett wrote:
> > What we don't want is non-mm code specifying values for the function
> > pointer and doing what they want, or a function pointer that returns a
> > core mm resource (in the old example this was a vma, here it is a
> > folio).
> > 
> > From this patch set:
> > +        * Return: zero if succeeded, negative for errors.
> > +        */
> > +       int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> > +                             struct folio **folio);
> > 
> > This is one of the contention points in the current scenario as the
> > folio would be returned.
> 
> OK I didn't see this one previously, it partly answers one of my question
> in the other reply, in a way I wished not.
> 
> Could you elaborate why an API returning an folio pointer would be
> dangerous?

I did [1], and Lorenzo did [2].

This is a bad design that has gotten us in trouble and we don't need to
do it.  This is an anti-pattern.

> 
> OTOH, would you think alloc_pages() or folio_alloc() be dangerous too?
> 
> They return a folio from the mm core to drivers, hence it's not the same
> direction of folio sharing, however it also means at least the driver can
> manipulate the folio / memmap as much as it wants, sabotaging everything is
> similarly possible.  Why we worry about that?

Are you expecting the same number of memory types as drivers?  I'm not
sure I want to live in that reality.

> 
> Are we going to unexport alloc_pages() someday?

I guess, over a long enough timeline all functions will be unexported.

And this is exactly why a 'two phase' approach to 'revisit if necessary'
[3] is a problem.  When we tried to remove the use of mm pointers in
drivers, we were stuck looking at hundreds of lines of code in a single
driver trying to figure out what was going on.

Seriously, I added locking and it added a regression so they removed it
[4].  It took years to get that driver to a more sensible state, and I'm
really happy that Carlos Llamas did all that work!

We regularly had to fight with people to stop caching a pointer
internally.  You know why they needed the vma pointer?  To modify the
vma flag, but only a few modifications were supposed to be made.. yet it
spawned years of cleanup.

And you're asking us to do it again.

Why can't we use enums to figure out what to do [5], one of which could
be the new functionality for guest_memfd?

There are many ways that this can be done with limited code living in
the mm that are safer and more maintainable and testable than handing
out pointers that have locking hell [6] to anyone with a source file.

Thanks,
Liam

[1]. https://lore.kernel.org/linux-mm/5ixvg4tnwj53ixh2fx26dxlctgqtayydqryqxhns6bwj3q3ies@6sttjti5dxt7/
[2]. https://lore.kernel.org/linux-mm/982f4f94-f0bf-45dd-9003-081b76e57027@lucifer.local/
[3]. https://lore.kernel.org/linux-mm/e9235b88-e2be-4358-a6fb-507c5cad6fd9@lucifer.local/
[4]. https://lore.kernel.org/all/20220621140212.vpkio64idahetbyf@revolver/T/#m9d9c8911447e395a73448700d7f06a4366b5ae02
[5]. https://lore.kernel.org/linux-mm/54bb09fc-144b-4d61-8bc2-1eca4d278382@lucifer.local/
[6]. https://elixir.bootlin.com/linux/v6.16.7/source/mm/rmap.c#L21



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-17  9:25                         ` Mike Rapoport
@ 2025-09-17 16:53                           ` Liam R. Howlett
  2025-09-18  8:37                             ` Mike Rapoport
  0 siblings, 1 reply; 30+ messages in thread
From: Liam R. Howlett @ 2025-09-17 16:53 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Nikita Kalyazin, Lorenzo Stoakes, Peter Xu, David Hildenbrand,
	Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Hugh Dickins, Andrew Morton, James Houghton,
	Michal Hocko, Andrea Arcangeli, Oscar Salvador, Axel Rasmussen,
	Ujwal Kundur

* Mike Rapoport <rppt@kernel.org> [250917 05:26]:
> Hi Liam,
> 
> On Mon, Sep 08, 2025 at 12:53:37PM -0400, Liam R. Howlett wrote:
> > 
> > Reading through the patches, I'm not entirely sure what you are
> > proposing.
> > 
> > What I was hoping to see by a generalization of the memory types is a
> > much simpler shared code base until the code hit memory type specific
> > areas where a function pointer could be used to keep things from getting
> > complicated (or, I guess a switch statement..).
> > 
> > What we don't want is non-mm code specifying values for the function
> > pointer and doing what they want, or a function pointer that returns a
> > core mm resource (in the old example this was a vma, here it is a
> > folio).
> > 
> > From this patch set:
> > +        * Return: zero if succeeded, negative for errors.
> > +        */
> > +       int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> > +                             struct folio **folio);
> > 
> > This is one of the contention points in the current scenario as the
> > folio would be returned.
> 
> I don't see a problem with it. It's not any different from
> vma_ops->fault(): a callback for a filesystem to get a folio that will be
> mapped afterwards by the mm code.
> 

I disagree, the filesystem vma_ops->fault() is not a config option like
this one.  So we are on a path to enable uffd by default, and it really
needs work beyond this series.  Setting up a list head and passing in
through every call stack is far from idea.

I also think the filesystem model is not one we want to duplicate in mm
for memory types - think of the test issues we have now and then have a
look at the xfstests support of filesystems [1].

So we are on a path of less test coverage, and more code that is
actually about mm that is outside of mm.  So, is there another way?

Cheers,
Liam

[1]. https://github.com/kdave/xfstests/blob/master/README#L37



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-17 16:53                           ` Liam R. Howlett
@ 2025-09-18  8:37                             ` Mike Rapoport
  2025-09-18 16:47                               ` Liam R. Howlett
  0 siblings, 1 reply; 30+ messages in thread
From: Mike Rapoport @ 2025-09-18  8:37 UTC (permalink / raw)
  To: Liam R. Howlett, Nikita Kalyazin, Lorenzo Stoakes, Peter Xu,
	David Hildenbrand, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Wed, Sep 17, 2025 at 12:53:05PM -0400, Liam R. Howlett wrote:
> * Mike Rapoport <rppt@kernel.org> [250917 05:26]:
> > Hi Liam,
> > 
> > On Mon, Sep 08, 2025 at 12:53:37PM -0400, Liam R. Howlett wrote:
> > > 
> > > Reading through the patches, I'm not entirely sure what you are
> > > proposing.
> > > 
> > > What I was hoping to see by a generalization of the memory types is a
> > > much simpler shared code base until the code hit memory type specific
> > > areas where a function pointer could be used to keep things from getting
> > > complicated (or, I guess a switch statement..).
> > > 
> > > What we don't want is non-mm code specifying values for the function
> > > pointer and doing what they want, or a function pointer that returns a
> > > core mm resource (in the old example this was a vma, here it is a
> > > folio).
> > > 
> > > From this patch set:
> > > +        * Return: zero if succeeded, negative for errors.
> > > +        */
> > > +       int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> > > +                             struct folio **folio);
> > > 
> > > This is one of the contention points in the current scenario as the
> > > folio would be returned.
> > 
> > I don't see a problem with it. It's not any different from
> > vma_ops->fault(): a callback for a filesystem to get a folio that will be
> > mapped afterwards by the mm code.
> > 
> 
> I disagree, the filesystem vma_ops->fault() is not a config option like
> this one.  So we are on a path to enable uffd by default, and it really
> needs work beyond this series.  Setting up a list head and passing in
> through every call stack is far from idea.

I don't follow you here. How addition of uffd callbacks guarded by a config
option to vma_ops leads to enabling uffd by by default?
 
> I also think the filesystem model is not one we want to duplicate in mm
> for memory types - think of the test issues we have now and then have a
> look at the xfstests support of filesystems [1].
> 
> So we are on a path of less test coverage, and more code that is
> actually about mm that is outside of mm.  So, is there another way?

There are quite a few vma_ops outside fs/ not covered by xfstest, so the
test coverage argument is moot at best.
And anything in the kernel can grab a folio and do whatever it pleases.

Nevertheless, let's step back for a second and instead focus on the problem
these patches are trying to solve, which is to allow guest_memfd implement
UFFD_CONTINUE (or minor fault in other terminology). 

This means uffd should be able to map a folio that's already in
guest_memfd page cache to the faulted address. Obviously, the page table
update happens in uffd. But it still has to find what to map and we need
some way to let guest_memfd tell that to uffd.

So we need a hook somewhere that will return a folio matching pgoff in
vma->file->inode.

Do you see a way to implement it otherwise?

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18  8:37                             ` Mike Rapoport
@ 2025-09-18 16:47                               ` Liam R. Howlett
  2025-09-18 17:15                                 ` Nikita Kalyazin
  2025-09-18 18:05                                 ` Mike Rapoport
  0 siblings, 2 replies; 30+ messages in thread
From: Liam R. Howlett @ 2025-09-18 16:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Nikita Kalyazin, Lorenzo Stoakes, Peter Xu, David Hildenbrand,
	Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Hugh Dickins, Andrew Morton, James Houghton,
	Michal Hocko, Andrea Arcangeli, Oscar Salvador, Axel Rasmussen,
	Ujwal Kundur

* Mike Rapoport <rppt@kernel.org> [250918 04:37]:
> On Wed, Sep 17, 2025 at 12:53:05PM -0400, Liam R. Howlett wrote:
> > * Mike Rapoport <rppt@kernel.org> [250917 05:26]:
> > > Hi Liam,
> > > 
> > > On Mon, Sep 08, 2025 at 12:53:37PM -0400, Liam R. Howlett wrote:
> > > > 
> > > > Reading through the patches, I'm not entirely sure what you are
> > > > proposing.
> > > > 
> > > > What I was hoping to see by a generalization of the memory types is a
> > > > much simpler shared code base until the code hit memory type specific
> > > > areas where a function pointer could be used to keep things from getting
> > > > complicated (or, I guess a switch statement..).
> > > > 
> > > > What we don't want is non-mm code specifying values for the function
> > > > pointer and doing what they want, or a function pointer that returns a
> > > > core mm resource (in the old example this was a vma, here it is a
> > > > folio).
> > > > 
> > > > From this patch set:
> > > > +        * Return: zero if succeeded, negative for errors.
> > > > +        */
> > > > +       int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> > > > +                             struct folio **folio);
> > > > 
> > > > This is one of the contention points in the current scenario as the
> > > > folio would be returned.
> > > 
> > > I don't see a problem with it. It's not any different from
> > > vma_ops->fault(): a callback for a filesystem to get a folio that will be
> > > mapped afterwards by the mm code.
> > > 
> > 
> > I disagree, the filesystem vma_ops->fault() is not a config option like
> > this one.  So we are on a path to enable uffd by default, and it really
> > needs work beyond this series.  Setting up a list head and passing in
> > through every call stack is far from idea.
> 
> I don't follow you here. How addition of uffd callbacks guarded by a config
> option to vma_ops leads to enabling uffd by by default?

Any new memory type that uses the above interface now needs uffd
enabled, anyone planning to use guest_memfd needs it enabled, anyone
able to get a module using this interface needs it enabled (by whoever
gives them the kernel they use).  Kernel provides now need to enable
UFFD - which is different than the example provided.

>  
> > I also think the filesystem model is not one we want to duplicate in mm
> > for memory types - think of the test issues we have now and then have a
> > look at the xfstests support of filesystems [1].
> > 
> > So we are on a path of less test coverage, and more code that is
> > actually about mm that is outside of mm.  So, is there another way?
> 
> There are quite a few vma_ops outside fs/ not covered by xfstest, so the
> test coverage argument is moot at best.
> And anything in the kernel can grab a folio and do whatever it pleases.

Testing filesystems is nothing short of a nightmare and I don't want mm
to march happily towards that end.  This interface is endlessly flexible
and thus endlessly broken and working at the same time.

IOW, we have given anyone wanting to implement a new memory type
infinite freedoms to run afoul, but they won't be looking for those
people when things go horribly wrong - they will most likely see a
memory issue and come here. syzbot will see a hang on some mm lock in an
unrelated task, or whatever.

I would rather avoid the endlessly flexible interface to avoid incorrect
uses in favour of a limited selection of choices, that could be expanded
if necessary, but would be more visible to the mm people going in.  That
is, people can add new memory types through adding them to mm/ instead
of in driver/ or out of tree.

I could very much see someone looking to use this for a binder-type
driver and that might work out really well!  But I don't want someone
doing it and shoving the folio pointer in a custom struct because they
*know* it's fine, so what's the big deal?  I don't mean to pick on
binder, but this example comes to mind.

> 
> Nevertheless, let's step back for a second and instead focus on the problem
> these patches are trying to solve, which is to allow guest_memfd implement
> UFFD_CONTINUE (or minor fault in other terminology). 

Well, this is about modularizing memory types, but the first user is
supposed to be the guest-memfd support.

> 
> This means uffd should be able to map a folio that's already in
> guest_memfd page cache to the faulted address. Obviously, the page table
> update happens in uffd. But it still has to find what to map and we need
> some way to let guest_memfd tell that to uffd.
> 
> So we need a hook somewhere that will return a folio matching pgoff in
> vma->file->inode.
> 
> Do you see a way to implement it otherwise?

I must be missing something.

UFFDIO_CONTINUE currently enters through an ioctl that calls
userfaultfd_continue() -> mfill_atomic_continue()... mfill_atomic() gets
and uses the folio to actually do the work.  Right now, we don't hand
out the folio, so what is different here?

I am under the impression that we don't need to return the folio, but
may need to do work on it.  That is, we can give the mm side what it
needs to call the related memory type functions to service the request.

For example, one could pass in the inode, pgoff, and memory type and the
mm code could then call the fault handler for that memory type?

I didn't think Nikita had a folio returned in his first three patches
[1], but then they built on other patches and it was difficult to follow
along.  Is it because that interface was agreed on in a call on 23 Jan
2025 [2], as somewhat unclearly stated in [1]?

Thanks,
Liam

[1].  https://lore.kernel.org/all/20250404154352.23078-1-kalyazin@amazon.com/
[2].  https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.w1126rgli5e3



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 16:47                               ` Liam R. Howlett
@ 2025-09-18 17:15                                 ` Nikita Kalyazin
  2025-09-18 17:45                                   ` Lorenzo Stoakes
  2025-09-18 17:54                                   ` Liam R. Howlett
  2025-09-18 18:05                                 ` Mike Rapoport
  1 sibling, 2 replies; 30+ messages in thread
From: Nikita Kalyazin @ 2025-09-18 17:15 UTC (permalink / raw)
  To: Liam R. Howlett, Mike Rapoport, Lorenzo Stoakes, Peter Xu,
	David Hildenbrand, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur



On 18/09/2025 17:47, Liam R. Howlett wrote:
> * Mike Rapoport <rppt@kernel.org> [250918 04:37]:
>> On Wed, Sep 17, 2025 at 12:53:05PM -0400, Liam R. Howlett wrote:
>>> * Mike Rapoport <rppt@kernel.org> [250917 05:26]:
>>>> Hi Liam,
>>>>
>>>> On Mon, Sep 08, 2025 at 12:53:37PM -0400, Liam R. Howlett wrote:
>>>>>
>>>>> Reading through the patches, I'm not entirely sure what you are
>>>>> proposing.
>>>>>
>>>>> What I was hoping to see by a generalization of the memory types is a
>>>>> much simpler shared code base until the code hit memory type specific
>>>>> areas where a function pointer could be used to keep things from getting
>>>>> complicated (or, I guess a switch statement..).
>>>>>
>>>>> What we don't want is non-mm code specifying values for the function
>>>>> pointer and doing what they want, or a function pointer that returns a
>>>>> core mm resource (in the old example this was a vma, here it is a
>>>>> folio).
>>>>>
>>>>>  From this patch set:
>>>>> +        * Return: zero if succeeded, negative for errors.
>>>>> +        */
>>>>> +       int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
>>>>> +                             struct folio **folio);
>>>>>
>>>>> This is one of the contention points in the current scenario as the
>>>>> folio would be returned.
>>>>
>>>> I don't see a problem with it. It's not any different from
>>>> vma_ops->fault(): a callback for a filesystem to get a folio that will be
>>>> mapped afterwards by the mm code.
>>>>
>>>
>>> I disagree, the filesystem vma_ops->fault() is not a config option like
>>> this one.  So we are on a path to enable uffd by default, and it really
>>> needs work beyond this series.  Setting up a list head and passing in
>>> through every call stack is far from idea.
>>
>> I don't follow you here. How addition of uffd callbacks guarded by a config
>> option to vma_ops leads to enabling uffd by by default?
> 
> Any new memory type that uses the above interface now needs uffd
> enabled, anyone planning to use guest_memfd needs it enabled, anyone
> able to get a module using this interface needs it enabled (by whoever
> gives them the kernel they use).  Kernel provides now need to enable
> UFFD - which is different than the example provided.
> 
>>
>>> I also think the filesystem model is not one we want to duplicate in mm
>>> for memory types - think of the test issues we have now and then have a
>>> look at the xfstests support of filesystems [1].
>>>
>>> So we are on a path of less test coverage, and more code that is
>>> actually about mm that is outside of mm.  So, is there another way?
>>
>> There are quite a few vma_ops outside fs/ not covered by xfstest, so the
>> test coverage argument is moot at best.
>> And anything in the kernel can grab a folio and do whatever it pleases.
> 
> Testing filesystems is nothing short of a nightmare and I don't want mm
> to march happily towards that end.  This interface is endlessly flexible
> and thus endlessly broken and working at the same time.
> 
> IOW, we have given anyone wanting to implement a new memory type
> infinite freedoms to run afoul, but they won't be looking for those
> people when things go horribly wrong - they will most likely see a
> memory issue and come here. syzbot will see a hang on some mm lock in an
> unrelated task, or whatever.
> 
> I would rather avoid the endlessly flexible interface to avoid incorrect
> uses in favour of a limited selection of choices, that could be expanded
> if necessary, but would be more visible to the mm people going in.  That
> is, people can add new memory types through adding them to mm/ instead
> of in driver/ or out of tree.
> 
> I could very much see someone looking to use this for a binder-type
> driver and that might work out really well!  But I don't want someone
> doing it and shoving the folio pointer in a custom struct because they
> *know* it's fine, so what's the big deal?  I don't mean to pick on
> binder, but this example comes to mind.
> 
>>
>> Nevertheless, let's step back for a second and instead focus on the problem
>> these patches are trying to solve, which is to allow guest_memfd implement
>> UFFD_CONTINUE (or minor fault in other terminology).
> 
> Well, this is about modularizing memory types, but the first user is
> supposed to be the guest-memfd support.
> 
>>
>> This means uffd should be able to map a folio that's already in
>> guest_memfd page cache to the faulted address. Obviously, the page table
>> update happens in uffd. But it still has to find what to map and we need
>> some way to let guest_memfd tell that to uffd.
>>
>> So we need a hook somewhere that will return a folio matching pgoff in
>> vma->file->inode.
>>
>> Do you see a way to implement it otherwise?
> 
> I must be missing something.
> 
> UFFDIO_CONTINUE currently enters through an ioctl that calls
> userfaultfd_continue() -> mfill_atomic_continue()... mfill_atomic() gets
> and uses the folio to actually do the work.  Right now, we don't hand
> out the folio, so what is different here?
> 
> I am under the impression that we don't need to return the folio, but
> may need to do work on it.  That is, we can give the mm side what it
> needs to call the related memory type functions to service the request.
> 
> For example, one could pass in the inode, pgoff, and memory type and the
> mm code could then call the fault handler for that memory type?
> 
> I didn't think Nikita had a folio returned in his first three patches
> [1], but then they built on other patches and it was difficult to follow
> along.  Is it because that interface was agreed on in a call on 23 Jan
> 2025 [2], as somewhat unclearly stated in [1]?

I believe you can safely ignore what was discussed in [2] as it is 
irrelevant to this discussion.  That was just reasoning why it was 
possible to use UserfaultFD for guest_memfd as opposed to inventing an 
alternative solution to handling faults in userspace.

Regarding returning a folio, [1] was calling vm_ops->fault() in 
UserfaultFD code.  The fault() itself gets a folio (at least in 
guest_memfd implementation [3]).  Does it look like a preferable 
solution to you?

The other patches it I was building on top were mmap support in 
guest_memfd [4], which is currently merged in kvm/next, and also part of 
[3].

[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/david/linux.git/tree/virt/kvm/guest_memfd.c?id=911634bac3107b237dcd8fdcb6ac91a22741cbe7#n347
[4] https://lore.kernel.org/kvm/20250729225455.670324-1-seanjc@google.com

> 
> Thanks,
> Liam
> 
> [1].  https://lore.kernel.org/all/20250404154352.23078-1-kalyazin@amazon.com/
> [2].  https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.w1126rgli5e3
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 17:15                                 ` Nikita Kalyazin
@ 2025-09-18 17:45                                   ` Lorenzo Stoakes
  2025-09-18 17:53                                     ` David Hildenbrand
  2025-09-18 17:54                                   ` Liam R. Howlett
  1 sibling, 1 reply; 30+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 17:45 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: Liam R. Howlett, Mike Rapoport, Peter Xu, David Hildenbrand,
	Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Hugh Dickins, Andrew Morton, James Houghton,
	Michal Hocko, Andrea Arcangeli, Oscar Salvador, Axel Rasmussen,
	Ujwal Kundur

On Thu, Sep 18, 2025 at 06:15:41PM +0100, Nikita Kalyazin wrote:
> The other patches it I was building on top were mmap support in guest_memfd
> [4], which is currently merged in kvm/next, and also part of [3].
>
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/david/linux.git/tree/virt/kvm/guest_memfd.c?id=911634bac3107b237dcd8fdcb6ac91a22741cbe7#n347
> [4] https://lore.kernel.org/kvm/20250729225455.670324-1-seanjc@google.com

Note that the mmap implementation is changing to mmap_prepare. It's the
case in point example of where we exposed an internal mm object (VMA) and
experienced significant issues including embargoed critical zero-day
security reports as a result.

I took a quick look at [3] and it seems you're using it in a totally normal
way which is completely fine, however, and it'll be very easily convertible
to mmap_prepare.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 17:45                                   ` Lorenzo Stoakes
@ 2025-09-18 17:53                                     ` David Hildenbrand
  2025-09-18 18:20                                       ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-09-18 17:53 UTC (permalink / raw)
  To: Lorenzo Stoakes, Nikita Kalyazin
  Cc: Liam R. Howlett, Mike Rapoport, Peter Xu, Suren Baghdasaryan,
	linux-mm, linux-kernel, Vlastimil Babka, Muchun Song,
	Hugh Dickins, Andrew Morton, James Houghton, Michal Hocko,
	Andrea Arcangeli, Oscar Salvador, Axel Rasmussen, Ujwal Kundur

On 18.09.25 19:45, Lorenzo Stoakes wrote:
> On Thu, Sep 18, 2025 at 06:15:41PM +0100, Nikita Kalyazin wrote:
>> The other patches it I was building on top were mmap support in guest_memfd
>> [4], which is currently merged in kvm/next, and also part of [3].
>>
>> [3] https://git.kernel.org/pub/scm/linux/kernel/git/david/linux.git/tree/virt/kvm/guest_memfd.c?id=911634bac3107b237dcd8fdcb6ac91a22741cbe7#n347
>> [4] https://lore.kernel.org/kvm/20250729225455.670324-1-seanjc@google.com
> 
> Note that the mmap implementation is changing to mmap_prepare. It's the
> case in point example of where we exposed an internal mm object (VMA) and
> experienced significant issues including embargoed critical zero-day
> security reports as a result.
> 
> I took a quick look at [3] and it seems you're using it in a totally normal
> way which is completely fine, however, and it'll be very easily convertible
> to mmap_prepare.

Yes, we'll take care of that once it's (finally :D ) upstream!

Re Nikita: If we could just reuse fault() for userfaultfd purposes, that 
might actually be pretty nice.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 17:15                                 ` Nikita Kalyazin
  2025-09-18 17:45                                   ` Lorenzo Stoakes
@ 2025-09-18 17:54                                   ` Liam R. Howlett
  1 sibling, 0 replies; 30+ messages in thread
From: Liam R. Howlett @ 2025-09-18 17:54 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: Mike Rapoport, Lorenzo Stoakes, Peter Xu, David Hildenbrand,
	Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Hugh Dickins, Andrew Morton, James Houghton,
	Michal Hocko, Andrea Arcangeli, Oscar Salvador, Axel Rasmussen,
	Ujwal Kundur

* Nikita Kalyazin <kalyazin@amazon.com> [250918 13:16]:

...
> > > 
> > > Nevertheless, let's step back for a second and instead focus on the problem
> > > these patches are trying to solve, which is to allow guest_memfd implement
> > > UFFD_CONTINUE (or minor fault in other terminology).
> > 
> > Well, this is about modularizing memory types, but the first user is
> > supposed to be the guest-memfd support.
> > 
> > > 
> > > This means uffd should be able to map a folio that's already in
> > > guest_memfd page cache to the faulted address. Obviously, the page table
> > > update happens in uffd. But it still has to find what to map and we need
> > > some way to let guest_memfd tell that to uffd.
> > > 
> > > So we need a hook somewhere that will return a folio matching pgoff in
> > > vma->file->inode.
> > > 
> > > Do you see a way to implement it otherwise?
> > 
> > I must be missing something.
> > 
> > UFFDIO_CONTINUE currently enters through an ioctl that calls
> > userfaultfd_continue() -> mfill_atomic_continue()... mfill_atomic() gets
> > and uses the folio to actually do the work.  Right now, we don't hand
> > out the folio, so what is different here?
> > 
> > I am under the impression that we don't need to return the folio, but
> > may need to do work on it.  That is, we can give the mm side what it
> > needs to call the related memory type functions to service the request.
> > 
> > For example, one could pass in the inode, pgoff, and memory type and the
> > mm code could then call the fault handler for that memory type?
> > 
> > I didn't think Nikita had a folio returned in his first three patches
> > [1], but then they built on other patches and it was difficult to follow
> > along.  Is it because that interface was agreed on in a call on 23 Jan
> > 2025 [2], as somewhat unclearly stated in [1]?
> 
> I believe you can safely ignore what was discussed in [2] as it is
> irrelevant to this discussion.  That was just reasoning why it was possible
> to use UserfaultFD for guest_memfd as opposed to inventing an alternative
> solution to handling faults in userspace.
> 
> Regarding returning a folio, [1] was calling vm_ops->fault() in UserfaultFD
> code.  The fault() itself gets a folio (at least in guest_memfd
> implementation [3]).  Does it look like a preferable solution to you?

I think this answers my question.. but I want to be sure.  Does that
mean you were getting the folio and doing the work in uffd without
returning the uffd?  I tried to get those patches, but they didn't apply
for me.

What I want to do is limit the "memory type" that we support by
restricting what is done to service the fault, and handle that in mm
code (mm/uffd.c or whatever).

What we get is more people using the same fault handler and thus more
eyes and testing.  Less code duplication.

Unless there is a technical reason we need more flexibility?

> 
> The other patches it I was building on top were mmap support in guest_memfd
> [4], which is currently merged in kvm/next, and also part of [3].


Can we process it in the mm without returning the folio like the ioctl
does today, or is there a technical reason that won't work?

Thanks,
Liam


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 16:47                               ` Liam R. Howlett
  2025-09-18 17:15                                 ` Nikita Kalyazin
@ 2025-09-18 18:05                                 ` Mike Rapoport
  2025-09-18 18:32                                   ` Liam R. Howlett
  1 sibling, 1 reply; 30+ messages in thread
From: Mike Rapoport @ 2025-09-18 18:05 UTC (permalink / raw)
  To: Liam R. Howlett, Nikita Kalyazin, Lorenzo Stoakes, Peter Xu,
	David Hildenbrand, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Thu, Sep 18, 2025 at 12:47:41PM -0400, Liam R. Howlett wrote:
> * Mike Rapoport <rppt@kernel.org> [250918 04:37]:
> > On Wed, Sep 17, 2025 at 12:53:05PM -0400, Liam R. Howlett wrote:
> > >
> > > I disagree, the filesystem vma_ops->fault() is not a config option like
> > > this one.  So we are on a path to enable uffd by default, and it really
> > > needs work beyond this series.  Setting up a list head and passing in
> > > through every call stack is far from idea.
> > 
> > I don't follow you here. How addition of uffd callbacks guarded by a config
> > option to vma_ops leads to enabling uffd by by default?
> 
> Any new memory type that uses the above interface now needs uffd
> enabled, anyone planning to use guest_memfd needs it enabled, anyone
> able to get a module using this interface needs it enabled (by whoever
> gives them the kernel they use).  Kernel provides now need to enable
> UFFD - which is different than the example provided.

My understanding of Peter's suggestion is that *if* uffd is enabled memory
type *may* implement the API, whatever API we'll come up with.
  
> > Nevertheless, let's step back for a second and instead focus on the problem
> > these patches are trying to solve, which is to allow guest_memfd implement
> > UFFD_CONTINUE (or minor fault in other terminology). 
> 
> Well, this is about modularizing memory types, but the first user is
> supposed to be the guest-memfd support.
> 
> > 
> > This means uffd should be able to map a folio that's already in
> > guest_memfd page cache to the faulted address. Obviously, the page table
> > update happens in uffd. But it still has to find what to map and we need
> > some way to let guest_memfd tell that to uffd.
> > 
> > So we need a hook somewhere that will return a folio matching pgoff in
> > vma->file->inode.
> > 
> > Do you see a way to implement it otherwise?
> 
> I must be missing something.
> 
> UFFDIO_CONTINUE currently enters through an ioctl that calls
> userfaultfd_continue() -> mfill_atomic_continue()... mfill_atomic() gets
> and uses the folio to actually do the work.  Right now, we don't hand
> out the folio, so what is different here?

The ioctl() is the mean of userspace to resolve a page fault and
mfill_atomic() needs something similar to ->fault() to actually get the
folio. And in case of shmem and guest_memfd the folio lives in the page
cache.
 
> I am under the impression that we don't need to return the folio, but
> may need to do work on it.  That is, we can give the mm side what it
> needs to call the related memory type functions to service the request.
> 
> For example, one could pass in the inode, pgoff, and memory type and the
> mm code could then call the fault handler for that memory type?

How calling the fault handler differs conceptually from calling
uffd_get_folio?
If you take a look at UFFD_CONTINUE for shmem, this is pretty much what's
happening. uffd side finds inode and pgoff and calls to a shmem_get_folio()
that's very much similar to shmem->fault().

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 17:53                                     ` David Hildenbrand
@ 2025-09-18 18:20                                       ` Peter Xu
  2025-09-18 19:43                                         ` Liam R. Howlett
  2025-09-22 17:20                                         ` David Hildenbrand
  0 siblings, 2 replies; 30+ messages in thread
From: Peter Xu @ 2025-09-18 18:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Nikita Kalyazin, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Hugh Dickins, Andrew Morton, James Houghton,
	Michal Hocko, Andrea Arcangeli, Oscar Salvador, Axel Rasmussen,
	Ujwal Kundur

On Thu, Sep 18, 2025 at 07:53:46PM +0200, David Hildenbrand wrote:
> Re Nikita: If we could just reuse fault() for userfaultfd purposes, that
> might actually be pretty nice.

I commented on that.

https://lore.kernel.org/all/aEiwHjl4tsUt98sh@x1.local/

That'll need to leak FAULT_FLAG_USERFAULT_CONTINUE which isn't necessary,
make it extremely hard to know when to set the flag, and comlicates the
fault path which isn't necessary.

I think Mike's comment was spot on, that the new API is literally
do_fault() for shmem, but only used in userfaultfd context so it's even an
oneliner.

I do not maintain mm, so above is only my two cents, so I don't make
decisions.  Personally I still prefer the current approach of keep the mm
main fault path clean.

Besides, this series also cleans up other places all over the places, the
vm_uffd_ops is a most simplified version of description for a memory type.
So IMHO it's beneficial in other aspects as well.  If uffd_copy() is a
concern, fine, we drop it.  We don't plan to have more use of UFFDIO_COPY
outside of the known three memory types after all.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 18:05                                 ` Mike Rapoport
@ 2025-09-18 18:32                                   ` Liam R. Howlett
  2025-09-18 19:32                                     ` Peter Xu
  2025-09-19  9:05                                     ` Mike Rapoport
  0 siblings, 2 replies; 30+ messages in thread
From: Liam R. Howlett @ 2025-09-18 18:32 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Nikita Kalyazin, Lorenzo Stoakes, Peter Xu, David Hildenbrand,
	Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Hugh Dickins, Andrew Morton, James Houghton,
	Michal Hocko, Andrea Arcangeli, Oscar Salvador, Axel Rasmussen,
	Ujwal Kundur

* Mike Rapoport <rppt@kernel.org> [250918 14:05]:

...

>  
> > I am under the impression that we don't need to return the folio, but
> > may need to do work on it.  That is, we can give the mm side what it
> > needs to call the related memory type functions to service the request.
> > 
> > For example, one could pass in the inode, pgoff, and memory type and the
> > mm code could then call the fault handler for that memory type?
> 
> How calling the fault handler differs conceptually from calling
> uffd_get_folio?
> If you take a look at UFFD_CONTINUE for shmem, this is pretty much what's
> happening. uffd side finds inode and pgoff and calls to a shmem_get_folio()
> that's very much similar to shmem->fault().

I believe the location of the code that handles the folio.  One would
decouple the folio processing from the mm while the other would decouple
which processing of the folio is done within the mm.

Does that make sense?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 18:32                                   ` Liam R. Howlett
@ 2025-09-18 19:32                                     ` Peter Xu
  2025-09-19  9:05                                     ` Mike Rapoport
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Xu @ 2025-09-18 19:32 UTC (permalink / raw)
  To: Liam R. Howlett, Mike Rapoport, Nikita Kalyazin, Lorenzo Stoakes,
	David Hildenbrand, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Thu, Sep 18, 2025 at 02:32:31PM -0400, Liam R. Howlett wrote:
> I believe the location of the code that handles the folio.  One would
> decouple the folio processing from the mm while the other would decouple
> which processing of the folio is done within the mm.
> 
> Does that make sense?

Not to me.

do_fault() allows allocation and some other things (e.g. trappable),
uffd_get_folio() doesn't. They're fundamentally exactly the same otherwise.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 18:20                                       ` Peter Xu
@ 2025-09-18 19:43                                         ` Liam R. Howlett
  2025-09-18 21:07                                           ` Peter Xu
  2025-09-22 17:20                                         ` David Hildenbrand
  1 sibling, 1 reply; 30+ messages in thread
From: Liam R. Howlett @ 2025-09-18 19:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Lorenzo Stoakes, Nikita Kalyazin,
	Mike Rapoport, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

* Peter Xu <peterx@redhat.com> [250918 14:21]:
> On Thu, Sep 18, 2025 at 07:53:46PM +0200, David Hildenbrand wrote:
> > Re Nikita: If we could just reuse fault() for userfaultfd purposes, that
> > might actually be pretty nice.
> 
> I commented on that.
> 
> https://lore.kernel.org/all/aEiwHjl4tsUt98sh@x1.local/
> 
> That'll need to leak FAULT_FLAG_USERFAULT_CONTINUE which isn't necessary,
> make it extremely hard to know when to set the flag, and comlicates the
> fault path which isn't necessary.
> 
> I think Mike's comment was spot on, that the new API is literally
> do_fault() for shmem, but only used in userfaultfd context so it's even an
> oneliner.
> 
> I do not maintain mm, so above is only my two cents, so I don't make
> decisions.  Personally I still prefer the current approach of keep the mm
> main fault path clean.

What we are trying to say is you can have a fault path that takes a type
enum that calls into a function that does whatever you want.  It can
even live in mm/userfaultfd.c.  It can then jump off to mm/guest-memfd.c
which can contain super unique copying of memory if that's needed.

That way driver/i_know_better_that_everyone.c or fs/stature.c don't
decide they can register their uffd and do cool stuff that totally won't
tank the system in random strange ways.

Seriously, how many fault handlers are you expecting to have here?

I'd be surprised if a lot of the code in these memory types isn't
shared, but I guess if they are all over the kernel they'll just clone
the code and bugs (like the arch code does, but with less of a reason).

> Besides, this series also cleans up other places all over the places, the
> vm_uffd_ops is a most simplified version of description for a memory type.

6 files changed, 207 insertions(+), 76 deletions(-)

Can you please point me to which patch has clean up?

The mm has uffd code _everywhere_, including mm/userfaultfd.c that jumps
to fs/userfaultfd.c and back.  Every entry has a LIST_HEAD(uf) [1] [2]
[3] in it that is passed through the entire call stack in case it is
needed [4] [5] [6] [7] [8].   It has if statements in core mm functions
in the middle of loops [9].  It complicates error handling and has
tricky locking [10] (full disclosure, I helped with the locking..
totally worth the complication).

This is a seriously complicated feature.

How is a generic callback that splits out into, probably 4?, functions
the deal breaker here?

How is leaking a flag the line that we won't cross?

> So IMHO it's beneficial in other aspects as well.  If uffd_copy() is a
> concern, fine, we drop it.  We don't plan to have more use of UFFDIO_COPY
> outside of the known three memory types after all.

EXACTLY!  There are three memory types and we're going to the most
flexible interface possible, with the most danger.  With guest_memfd
we're up to four functions we'd need.  Why not keep the mm code in the
mm and have four functions to choose from?  If you want 5 we can always
add another.

Regards,
Liam


[1]. https://elixir.bootlin.com/linux/v6.17-rc6/source/mm/mmap.c#L122
[2]. https://elixir.bootlin.com/linux/v6.17-rc6/source/mm/vma.c#L3149
[3]. https://elixir.bootlin.com/linux/v6.17-rc6/source/mm/util.c#L572


[4]. https://elixir.bootlin.com/linux/v6.17-rc6/source/mm/mmap.c#L338
[5]. https://elixir.bootlin.com/linux/v6.17-rc6/source/mm/mmap.c#L1063
[6]. https://elixir.bootlin.com/linux/v6.17-rc6/source/mm/vma.c#L1478
[7]. https://elixir.bootlin.com/linux/v6.17-rc6/source/mm/vma.c#L1517
[8]. https://elixir.bootlin.com/linux/v6.17-rc6/source/mm/vma.c#L1563

[9]. https://elixir.bootlin.com/linux/v6.17-rc6/source/mm/vma.c#L1407

[10].  https://elixir.bootlin.com/linux/v6.17-rc6/source/mm/userfaultfd.c#L69




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 19:43                                         ` Liam R. Howlett
@ 2025-09-18 21:07                                           ` Peter Xu
  2025-09-19  1:50                                             ` Liam R. Howlett
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-09-18 21:07 UTC (permalink / raw)
  To: Liam R. Howlett, David Hildenbrand, Lorenzo Stoakes,
	Nikita Kalyazin, Mike Rapoport, Suren Baghdasaryan, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Hugh Dickins,
	Andrew Morton, James Houghton, Michal Hocko, Andrea Arcangeli,
	Oscar Salvador, Axel Rasmussen, Ujwal Kundur

On Thu, Sep 18, 2025 at 03:43:34PM -0400, Liam R. Howlett wrote:
> * Peter Xu <peterx@redhat.com> [250918 14:21]:
> > On Thu, Sep 18, 2025 at 07:53:46PM +0200, David Hildenbrand wrote:
> > > Re Nikita: If we could just reuse fault() for userfaultfd purposes, that
> > > might actually be pretty nice.
> > 
> > I commented on that.
> > 
> > https://lore.kernel.org/all/aEiwHjl4tsUt98sh@x1.local/

[1]

> > 
> > That'll need to leak FAULT_FLAG_USERFAULT_CONTINUE which isn't necessary,
> > make it extremely hard to know when to set the flag, and comlicates the
> > fault path which isn't necessary.
> > 
> > I think Mike's comment was spot on, that the new API is literally
> > do_fault() for shmem, but only used in userfaultfd context so it's even an
> > oneliner.
> > 
> > I do not maintain mm, so above is only my two cents, so I don't make
> > decisions.  Personally I still prefer the current approach of keep the mm
> > main fault path clean.
> 
> What we are trying to say is you can have a fault path that takes a type
> enum that calls into a function that does whatever you want.  It can
> even live in mm/userfaultfd.c.  It can then jump off to mm/guest-memfd.c
> which can contain super unique copying of memory if that's needed.

Per mentioning of mm/guest-memfd.c, are you suggesting to take guest-memfd
library approach?

We have in total of at least three proposals:

(a) https://lore.kernel.org/all/20250404154352.23078-1-kalyazin@amazon.com/
(b) this one
(c) https://lore.kernel.org/all/20250915161815.40729-1-kalyazin@amazon.com/

I reviewd (a) and (c) and I provided my comments.  If you prefer the
library approach, feel free to reply directly to (c) thread against my
email.

I chose (b), from when it was posted.

> 
> That way driver/i_know_better_that_everyone.c or fs/stature.c don't
> decide they can register their uffd and do cool stuff that totally won't
> tank the system in random strange ways.

What is the difference if they are allowed to register ->fault() and tank
the system?

> 
> Seriously, how many fault handlers are you expecting to have here?

First of all, it's not about "how many".  We can assume one user as of now.
Talking about any future user doesn't really help.  The choice I made above
on (b) is the best solution I think, with any known possible users.  The
plan might change, when more use cases pops up.  However we can only try to
make a fair decision with the current status quo.

OTOH, the vm_uffd_ops also provides other fields (besides uffd_*() hooks).
I wouldn't be surprised if a driver wants to opt-in with some of the fields
with zero hooks attached at all, when an userfaultfd feature is
automatically friendly to all kinds of memory types.

Consider one VMA that sets UFFDIO_WRITEPROTECT but without most of the
rest.

As I discussed, IMHO (b) is the clean way to describe userfaultfd demand
for any memory type.

> 
> I'd be surprised if a lot of the code in these memory types isn't
> shared, but I guess if they are all over the kernel they'll just clone
> the code and bugs (like the arch code does, but with less of a reason).
> 
> > Besides, this series also cleans up other places all over the places, the
> > vm_uffd_ops is a most simplified version of description for a memory type.
> 
> 6 files changed, 207 insertions(+), 76 deletions(-)
> 
> Can you please point me to which patch has clean up?

Patch 4.  If you want me to explain every change I touched that is a
cleanup, I can go into details.  Maybe it's faster if you read them, it's
not a huge patch.

> 
> The mm has uffd code _everywhere_, including mm/userfaultfd.c that jumps
> to fs/userfaultfd.c and back.  Every entry has a LIST_HEAD(uf) [1] [2]
> [3] in it that is passed through the entire call stack in case it is
> needed [4] [5] [6] [7] [8].   It has if statements in core mm functions
> in the middle of loops [9].  It complicates error handling and has
> tricky locking [10] (full disclosure, I helped with the locking..
> totally worth the complication).
> 
> This is a seriously complicated feature.

I know you're the expert of VMA, you worked a lot of it.  I understand
those things can caused you frustrations when touching those codes.  Though
please let's do not bring those into reviewing this series.  Those have
nothing to do with this series.

Frankly, I also wished if one day we can get rid of some of them.  It
really depends on how many users are there for the uffd events besides the
generic ones (fault traps).

> 
> How is a generic callback that splits out into, probably 4?, functions
> the deal breaker here?
> 
> How is leaking a flag the line that we won't cross?
> 
> > So IMHO it's beneficial in other aspects as well.  If uffd_copy() is a
> > concern, fine, we drop it.  We don't plan to have more use of UFFDIO_COPY
> > outside of the known three memory types after all.
> 
> EXACTLY!  There are three memory types and we're going to the most
> flexible interface possible, with the most danger.  With guest_memfd
> we're up to four functions we'd need.  Why not keep the mm code in the
> mm and have four functions to choose from?  If you want 5 we can always
> add another.

I assume for most of the rest comments, you're suggesting the library
approach.  If so, again, I suggest we discuss explicitly in that thread.

The library code may (or may not) be useful for other purposes.  For the
support of userfaultfd, it definitely doesn't need to be a dependency.
OTOH, we may still want some abstraction like this one with/without the
library.  If so, I don't really see why we need to pushback on this.

AFAIU, the only concern (after drop uffd_copy) is we'll expose
uffd_get_folio().  Please help explain why ->fault() isn't worse.

If we accepted ->fault() for all these years, I don't see a reason we
should reject ->uffd_get_folio(), especially one of the goals is to keep
the core mm path clean, per my comment to proposal (a).

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 21:07                                           ` Peter Xu
@ 2025-09-19  1:50                                             ` Liam R. Howlett
  2025-09-19 14:16                                               ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Liam R. Howlett @ 2025-09-19  1:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Lorenzo Stoakes, Nikita Kalyazin,
	Mike Rapoport, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

* Peter Xu <peterx@redhat.com> [250918 17:07]:
> On Thu, Sep 18, 2025 at 03:43:34PM -0400, Liam R. Howlett wrote:
> > * Peter Xu <peterx@redhat.com> [250918 14:21]:
> > > On Thu, Sep 18, 2025 at 07:53:46PM +0200, David Hildenbrand wrote:
> > > > Re Nikita: If we could just reuse fault() for userfaultfd purposes, that
> > > > might actually be pretty nice.
> > > 
> > > I commented on that.
> > > 
> > > https://lore.kernel.org/all/aEiwHjl4tsUt98sh@x1.local/
> 
> [1]
> 
> > > 
> > > That'll need to leak FAULT_FLAG_USERFAULT_CONTINUE which isn't necessary,
> > > make it extremely hard to know when to set the flag, and comlicates the
> > > fault path which isn't necessary.
> > > 
> > > I think Mike's comment was spot on, that the new API is literally
> > > do_fault() for shmem, but only used in userfaultfd context so it's even an
> > > oneliner.
> > > 
> > > I do not maintain mm, so above is only my two cents, so I don't make
> > > decisions.  Personally I still prefer the current approach of keep the mm
> > > main fault path clean.
> > 
> > What we are trying to say is you can have a fault path that takes a type
> > enum that calls into a function that does whatever you want.  It can
> > even live in mm/userfaultfd.c.  It can then jump off to mm/guest-memfd.c
> > which can contain super unique copying of memory if that's needed.
> 
> Per mentioning of mm/guest-memfd.c, are you suggesting to take guest-memfd
> library approach?

> We have in total of at least three proposals:
> 
> (a) https://lore.kernel.org/all/20250404154352.23078-1-kalyazin@amazon.com/
> (b) this one
> (c) https://lore.kernel.org/all/20250915161815.40729-1-kalyazin@amazon.com/
> 
> I reviewd (a) and (c) and I provided my comments.  If you prefer the
> library approach, feel free to reply directly to (c) thread against my
> email.
> 
> I chose (b), from when it was posted.

Honestly, I don't know what I'd vote for because I don't like any of
them.  I'd chose (d) the do nothing option.

> 
> 
> > 
> > That way driver/i_know_better_that_everyone.c or fs/stature.c don't
> > decide they can register their uffd and do cool stuff that totally won't
> > tank the system in random strange ways.
> 
> What is the difference if they are allowed to register ->fault() and tank
> the system?

One less problem.

More people with mm experience looking at the handling of folios.

The common code not being cloned and kept up to date when an issue in
the original is discovered.

Having to only update a few fault handlers when there is a folio or
other mm change.

Hopefully better testing?

> > 
> > Seriously, how many fault handlers are you expecting to have here?
> 
> First of all, it's not about "how many".  We can assume one user as of now.
> Talking about any future user doesn't really help.  The choice I made above
> on (b) is the best solution I think, with any known possible users.  The
> plan might change, when more use cases pops up.  However we can only try to
> make a fair decision with the current status quo.

Planning to handle one, five, or two billion makes a difference in what
you do.  Your plan right now enables everyone to do whatever they want,
where they want.  I don't think we need this sort of flexibility with
the limited number of users?

> 
> OTOH, the vm_uffd_ops also provides other fields (besides uffd_*() hooks).
> I wouldn't be surprised if a driver wants to opt-in with some of the fields
> with zero hooks attached at all, when an userfaultfd feature is
> automatically friendly to all kinds of memory types.
> 
> Consider one VMA that sets UFFDIO_WRITEPROTECT but without most of the
> rest.
> 
> As I discussed, IMHO (b) is the clean way to describe userfaultfd demand
> for any memory type.
> 
> > 
> > I'd be surprised if a lot of the code in these memory types isn't
> > shared, but I guess if they are all over the kernel they'll just clone
> > the code and bugs (like the arch code does, but with less of a reason).
> > 
> > > Besides, this series also cleans up other places all over the places, the
> > > vm_uffd_ops is a most simplified version of description for a memory type.
> > 
> > 6 files changed, 207 insertions(+), 76 deletions(-)
> > 
> > Can you please point me to which patch has clean up?
> 
> Patch 4.  If you want me to explain every change I touched that is a
> cleanup, I can go into details.  Maybe it's faster if you read them, it's
> not a huge patch.

I responded here [1].  I actually put a lot of effort into that response
and took a lot of time to dig into some of this to figure out if it was
possible, and suggested some ideas.

That was back in July, so the details aren't that fresh anymore.  Maybe
you missed my reply?

I was hoping that, if the code was cleaned up, a solution may be more
clear.

I think the ops idea has a lot of positives.  I also think you don't
need to return a folio pointer to make it work.

> > 
> > How is a generic callback that splits out into, probably 4?, functions
> > the deal breaker here?
> > 
> > How is leaking a flag the line that we won't cross?
> > 
> > > So IMHO it's beneficial in other aspects as well.  If uffd_copy() is a
> > > concern, fine, we drop it.  We don't plan to have more use of UFFDIO_COPY
> > > outside of the known three memory types after all.
> > 
> > EXACTLY!  There are three memory types and we're going to the most
> > flexible interface possible, with the most danger.  With guest_memfd
> > we're up to four functions we'd need.  Why not keep the mm code in the
> > mm and have four functions to choose from?  If you want 5 we can always
> > add another.
> 
> I assume for most of the rest comments, you're suggesting the library
> approach.  If so, again, I suggest we discuss explicitly in that thread.
> 
> The library code may (or may not) be useful for other purposes.  For the
> support of userfaultfd, it definitely doesn't need to be a dependency.
> OTOH, we may still want some abstraction like this one with/without the
> library.  If so, I don't really see why we need to pushback on this.
> 
> AFAIU, the only concern (after drop uffd_copy) is we'll expose
> uffd_get_folio().

If you read my response [1], then you can see that I find the ops
underutilized and lacks code simplification.

> Please help explain why ->fault() isn't worse.

I'm not sure it is worse, but I don't think it's necessary to return a
folio for 4 users.  And I think it could be better if we handled the
operations on the folio internally, if at all possible.

> 
> If we accepted ->fault() for all these years, I don't see a reason we
> should reject ->uffd_get_folio(), especially one of the goals is to keep
> the core mm path clean, per my comment to proposal (a).

I see this argument as saying "there's a hole in our boat so why can't I
make another?"  It's not the direction we have to go to get what we need
right now, so why are we doing it?  Like you said, it can be evaluated
later if things change..

My thoughts were around an idea that we only really need to do a limited
number of operations on that pointer you are returning.  Those
operations may share code, and could be internal to mm.  I don't see
this as (a), (b), or (c), but maybe an addition to (b)?  Maybe we need
more ops to cover the uses?

So, I think I do want the vm_uffd_ops idea, but not as it is written
right now.

Thanks,
Liam

[1]. https://lore.kernel.org/all/e7vr62s73dftijeveyg6lfgivctijz4qcar3teswjbuv6gog3k@4sbpuj35nbbh/


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 18:32                                   ` Liam R. Howlett
  2025-09-18 19:32                                     ` Peter Xu
@ 2025-09-19  9:05                                     ` Mike Rapoport
  1 sibling, 0 replies; 30+ messages in thread
From: Mike Rapoport @ 2025-09-19  9:05 UTC (permalink / raw)
  To: Liam R. Howlett, Nikita Kalyazin, Lorenzo Stoakes, Peter Xu,
	David Hildenbrand, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Thu, Sep 18, 2025 at 02:32:31PM -0400, Liam R. Howlett wrote:
> * Mike Rapoport <rppt@kernel.org> [250918 14:05]:
> 
> ...
> 
> >  
> > > I am under the impression that we don't need to return the folio, but
> > > may need to do work on it.  That is, we can give the mm side what it
> > > needs to call the related memory type functions to service the request.
> > > 
> > > For example, one could pass in the inode, pgoff, and memory type and the
> > > mm code could then call the fault handler for that memory type?
> > 
> > How calling the fault handler differs conceptually from calling
> > uffd_get_folio?
> > If you take a look at UFFD_CONTINUE for shmem, this is pretty much what's
> > happening. uffd side finds inode and pgoff and calls to a shmem_get_folio()
> > that's very much similar to shmem->fault().
> 
> I believe the location of the code that handles the folio.  One would
> decouple the folio processing from the mm while the other would decouple
> which processing of the folio is done within the mm.
> 
> Does that make sense?

No :)

In short, uffd_get_folio() is a special case of ->fault().

tl;dr version is that the whole processing is page fault handling with a
brief excursion to userspace. For VMAs registered with uffd, page faults
are intercepted by uffd and delivered as events to userpsace. There are
several ways for userspace to resolve a page fault, in this case we are
talking about UFFD_CONTINUE. Its semantics is similar to a minor fault - if
the faulted folio is already in the page cache of the address space backing
the VMA, it is mapped into the page table. If the folio is not in the page
cache uffd returns -EFAULT.

So the processing of the folio that uffd_get_folio() is exactly what
->fault() would do for a folio found in the page cache of inode backing the
VMA. And unlike ->fault(), uffd_get_folio() should not allocate a new folio
if its not in the page cache.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-19  1:50                                             ` Liam R. Howlett
@ 2025-09-19 14:16                                               ` Peter Xu
  2025-09-19 14:34                                                 ` Lorenzo Stoakes
  2025-09-19 19:38                                                 ` Liam R. Howlett
  0 siblings, 2 replies; 30+ messages in thread
From: Peter Xu @ 2025-09-19 14:16 UTC (permalink / raw)
  To: Liam R. Howlett, David Hildenbrand, Lorenzo Stoakes,
	Nikita Kalyazin, Mike Rapoport, Suren Baghdasaryan, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Hugh Dickins,
	Andrew Morton, James Houghton, Michal Hocko, Andrea Arcangeli,
	Oscar Salvador, Axel Rasmussen, Ujwal Kundur

On Thu, Sep 18, 2025 at 09:50:49PM -0400, Liam R. Howlett wrote:
> * Peter Xu <peterx@redhat.com> [250918 17:07]:
> > On Thu, Sep 18, 2025 at 03:43:34PM -0400, Liam R. Howlett wrote:
> > > * Peter Xu <peterx@redhat.com> [250918 14:21]:
> > > > On Thu, Sep 18, 2025 at 07:53:46PM +0200, David Hildenbrand wrote:
> > > > > Re Nikita: If we could just reuse fault() for userfaultfd purposes, that
> > > > > might actually be pretty nice.
> > > > 
> > > > I commented on that.
> > > > 
> > > > https://lore.kernel.org/all/aEiwHjl4tsUt98sh@x1.local/
> > 
> > [1]
> > 
> > > > 
> > > > That'll need to leak FAULT_FLAG_USERFAULT_CONTINUE which isn't necessary,
> > > > make it extremely hard to know when to set the flag, and comlicates the
> > > > fault path which isn't necessary.
> > > > 
> > > > I think Mike's comment was spot on, that the new API is literally
> > > > do_fault() for shmem, but only used in userfaultfd context so it's even an
> > > > oneliner.
> > > > 
> > > > I do not maintain mm, so above is only my two cents, so I don't make
> > > > decisions.  Personally I still prefer the current approach of keep the mm
> > > > main fault path clean.
> > > 
> > > What we are trying to say is you can have a fault path that takes a type
> > > enum that calls into a function that does whatever you want.  It can
> > > even live in mm/userfaultfd.c.  It can then jump off to mm/guest-memfd.c
> > > which can contain super unique copying of memory if that's needed.
> > 
> > Per mentioning of mm/guest-memfd.c, are you suggesting to take guest-memfd
> > library approach?
> 
> > We have in total of at least three proposals:
> > 
> > (a) https://lore.kernel.org/all/20250404154352.23078-1-kalyazin@amazon.com/
> > (b) this one
> > (c) https://lore.kernel.org/all/20250915161815.40729-1-kalyazin@amazon.com/
> > 
> > I reviewd (a) and (c) and I provided my comments.  If you prefer the
> > library approach, feel free to reply directly to (c) thread against my
> > email.
> > 
> > I chose (b), from when it was posted.
> 
> Honestly, I don't know what I'd vote for because I don't like any of
> them.  I'd chose (d) the do nothing option.
> 
> > 
> > 
> > > 
> > > That way driver/i_know_better_that_everyone.c or fs/stature.c don't
> > > decide they can register their uffd and do cool stuff that totally won't
> > > tank the system in random strange ways.
> > 
> > What is the difference if they are allowed to register ->fault() and tank
> > the system?
> 
> One less problem.
> 
> More people with mm experience looking at the handling of folios.
> 
> The common code not being cloned and kept up to date when an issue in
> the original is discovered.
> 
> Having to only update a few fault handlers when there is a folio or
> other mm change.
> 
> Hopefully better testing?
> 
> > > 
> > > Seriously, how many fault handlers are you expecting to have here?
> > 
> > First of all, it's not about "how many".  We can assume one user as of now.
> > Talking about any future user doesn't really help.  The choice I made above
> > on (b) is the best solution I think, with any known possible users.  The
> > plan might change, when more use cases pops up.  However we can only try to
> > make a fair decision with the current status quo.
> 
> Planning to handle one, five, or two billion makes a difference in what
> you do.  Your plan right now enables everyone to do whatever they want,
> where they want.  I don't think we need this sort of flexibility with
> the limited number of users?
> 
> > 
> > OTOH, the vm_uffd_ops also provides other fields (besides uffd_*() hooks).
> > I wouldn't be surprised if a driver wants to opt-in with some of the fields
> > with zero hooks attached at all, when an userfaultfd feature is
> > automatically friendly to all kinds of memory types.
> > 
> > Consider one VMA that sets UFFDIO_WRITEPROTECT but without most of the
> > rest.
> > 
> > As I discussed, IMHO (b) is the clean way to describe userfaultfd demand
> > for any memory type.
> > 
> > > 
> > > I'd be surprised if a lot of the code in these memory types isn't
> > > shared, but I guess if they are all over the kernel they'll just clone
> > > the code and bugs (like the arch code does, but with less of a reason).
> > > 
> > > > Besides, this series also cleans up other places all over the places, the
> > > > vm_uffd_ops is a most simplified version of description for a memory type.
> > > 
> > > 6 files changed, 207 insertions(+), 76 deletions(-)
> > > 
> > > Can you please point me to which patch has clean up?
> > 
> > Patch 4.  If you want me to explain every change I touched that is a
> > cleanup, I can go into details.  Maybe it's faster if you read them, it's
> > not a huge patch.
> 
> I responded here [1].  I actually put a lot of effort into that response
> and took a lot of time to dig into some of this to figure out if it was
> possible, and suggested some ideas.
> 
> That was back in July, so the details aren't that fresh anymore.  Maybe
> you missed my reply?

AFAICT, we made it the other way round.  My reply is here:

https://lore.kernel.org/all/aMnAscxj_h42wOAC@x1.local/

> 
> I was hoping that, if the code was cleaned up, a solution may be more
> clear.
> 
> I think the ops idea has a lot of positives.  I also think you don't
> need to return a folio pointer to make it work.
> 
> > > 
> > > How is a generic callback that splits out into, probably 4?, functions
> > > the deal breaker here?
> > > 
> > > How is leaking a flag the line that we won't cross?
> > > 
> > > > So IMHO it's beneficial in other aspects as well.  If uffd_copy() is a
> > > > concern, fine, we drop it.  We don't plan to have more use of UFFDIO_COPY
> > > > outside of the known three memory types after all.
> > > 
> > > EXACTLY!  There are three memory types and we're going to the most
> > > flexible interface possible, with the most danger.  With guest_memfd
> > > we're up to four functions we'd need.  Why not keep the mm code in the
> > > mm and have four functions to choose from?  If you want 5 we can always
> > > add another.
> > 
> > I assume for most of the rest comments, you're suggesting the library
> > approach.  If so, again, I suggest we discuss explicitly in that thread.
> > 
> > The library code may (or may not) be useful for other purposes.  For the
> > support of userfaultfd, it definitely doesn't need to be a dependency.
> > OTOH, we may still want some abstraction like this one with/without the
> > library.  If so, I don't really see why we need to pushback on this.
> > 
> > AFAIU, the only concern (after drop uffd_copy) is we'll expose
> > uffd_get_folio().
> 
> If you read my response [1], then you can see that I find the ops
> underutilized and lacks code simplification.

I tried to explain to you on why what you mentioned is completely
orthogonal to this change, in above my reply.

> 
> > Please help explain why ->fault() isn't worse.
> 
> I'm not sure it is worse, but I don't think it's necessary to return a
> folio for 4 users.  And I think it could be better if we handled the
> operations on the folio internally, if at all possible.
> 
> > 
> > If we accepted ->fault() for all these years, I don't see a reason we
> > should reject ->uffd_get_folio(), especially one of the goals is to keep
> > the core mm path clean, per my comment to proposal (a).
> 
> I see this argument as saying "there's a hole in our boat so why can't I
> make another?"  It's not the direction we have to go to get what we need
> right now, so why are we doing it?  Like you said, it can be evaluated
> later if things change..

You described ->fault() as "a hole in our boat"?  I'm astonished and do not
know what to say on this.

There was a great comment saying one may want to make Linux an unikernel.
I thought it was a good one, but only when it was a joke. Is it not?

> 
> My thoughts were around an idea that we only really need to do a limited
> number of operations on that pointer you are returning.  Those
> operations may share code, and could be internal to mm.  I don't see
> this as (a), (b), or (c), but maybe an addition to (b)?  Maybe we need
> more ops to cover the uses?

That's exactly what this proposal is about, isn't it?  Userfaultfd minor
fault shares almost all the code except the one hook fetching a folio from
a page cache from the memory type.

"could be internal to mm" is (c) at least.  No one can do what you
mentioned without moving guest-memfd into mm/ first.

Nikita and I drafted these changes, so likely we may likely have better
idea what is happening.

Would you perhaps implement your idea, if that's better?  Either you're
right, we're happy to use it.  Or you found what you're missing.

It's not required, but now I think it might be a good idea if you are so
strongly pushing back this userfaultfd feature.  I hope it's fair we
request for a better solution from you when you rejected almost every
solution.

> 
> So, I think I do want the vm_uffd_ops idea, but not as it is written
> right now.
> 
> Thanks,
> Liam
> 
> [1]. https://lore.kernel.org/all/e7vr62s73dftijeveyg6lfgivctijz4qcar3teswjbuv6gog3k@4sbpuj35nbbh/
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-19 14:16                                               ` Peter Xu
@ 2025-09-19 14:34                                                 ` Lorenzo Stoakes
  2025-09-19 15:12                                                   ` Peter Xu
  2025-09-19 19:38                                                 ` Liam R. Howlett
  1 sibling, 1 reply; 30+ messages in thread
From: Lorenzo Stoakes @ 2025-09-19 14:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liam R. Howlett, David Hildenbrand, Nikita Kalyazin,
	Mike Rapoport, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Fri, Sep 19, 2025 at 10:16:57AM -0400, Peter Xu wrote:
>
> You described ->fault() as "a hole in our boat"?  I'm astonished and do not
> know what to say on this.
>
> There was a great comment saying one may want to make Linux an unikernel.
> I thought it was a good one, but only when it was a joke. Is it not?
>
> >
> > My thoughts were around an idea that we only really need to do a limited
> > number of operations on that pointer you are returning.  Those
> > operations may share code, and could be internal to mm.  I don't see
> > this as (a), (b), or (c), but maybe an addition to (b)?  Maybe we need
> > more ops to cover the uses?
>
> That's exactly what this proposal is about, isn't it?  Userfaultfd minor
> fault shares almost all the code except the one hook fetching a folio from
> a page cache from the memory type.
>
> "could be internal to mm" is (c) at least.  No one can do what you
> mentioned without moving guest-memfd into mm/ first.
>
> Nikita and I drafted these changes, so likely we may likely have better
> idea what is happening.
>
> Would you perhaps implement your idea, if that's better?  Either you're
> right, we're happy to use it.  Or you found what you're missing.
>
> It's not required, but now I think it might be a good idea if you are so
> strongly pushing back this userfaultfd feature.  I hope it's fair we
> request for a better solution from you when you rejected almost every
> solution.

Peter -

I've been staying out of this discussion as I'm about to go to Kernel
Recipes and then off on a (well-needed!) holiday, and I simply lack the
bandwidth right now.

But I think we should all calm down a little here :)

Liam and I (more so Liam recently for above reasons) have pushed back
because we have both personally experienced the consequences of giving
drivers too much flexibility wrt core mm functionality.

This is the sole reason we have done so.

We are both eager to find a way forward that is constructive and works well
for everybody involved. We WANT this series to land.

So I think perhaps we should take a step back and identify clearly what the
issues are and how we might best address them.

I spoke to Mike off-list who suggested perhaps things aren't quite
egregious as they seem with uffd_get_folio() so perhaps this is a means of
moving forward.

But I think in broad terms - let's identify what the sensible options are,
and then drill down into whichever one we agree is best to move forwards
with.

Again, apologies for not being able to be more involved here,
workload/other engagements dictate that I am unable to be.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-19 14:34                                                 ` Lorenzo Stoakes
@ 2025-09-19 15:12                                                   ` Peter Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2025-09-19 15:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R. Howlett, David Hildenbrand, Nikita Kalyazin,
	Mike Rapoport, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Fri, Sep 19, 2025 at 03:34:39PM +0100, Lorenzo Stoakes wrote:
> Peter -
> 
> I've been staying out of this discussion as I'm about to go to Kernel
> Recipes and then off on a (well-needed!) holiday, and I simply lack the
> bandwidth right now.
> 
> But I think we should all calm down a little here :)
> 
> Liam and I (more so Liam recently for above reasons) have pushed back
> because we have both personally experienced the consequences of giving
> drivers too much flexibility wrt core mm functionality.
> 
> This is the sole reason we have done so.
> 
> We are both eager to find a way forward that is constructive and works well
> for everybody involved. We WANT this series to land.
> 
> So I think perhaps we should take a step back and identify clearly what the
> issues are and how we might best address them.
> 
> I spoke to Mike off-list who suggested perhaps things aren't quite
> egregious as they seem with uffd_get_folio() so perhaps this is a means of
> moving forward.
> 
> But I think in broad terms - let's identify what the sensible options are,
> and then drill down into whichever one we agree is best to move forwards
> with.
> 
> Again, apologies for not being able to be more involved here,
> workload/other engagements dictate that I am unable to be.

That's totally fine, Lorenzo.  I appreciate your help on figuring things
out.

I do agree the discussion actually went nowhere.

I think so far the "issues" is very much clear, about exporting
uffd_get_folio(), as you correctly pointed out and I'm glad you discussed
with Mike.

My point is that hook is totally fine, and we need that exactly because we
want to keep ->fault() semantic clean.

Just to mention, if this series cannot land, I prefer landing Nikita's very
old version (a).  That'll make mm fault() ugly, I pointed that out, but if
all the people prefer that and all the people like to sign-off with it, I'm
OK from userfaultfd perspective.  I don't make judgement there.

Then this series can drop uffd_get_folio() and keep the rest in one way or
another, describing memory type attributes only, and need to cooperate only
a driver with a ->fault() that works for the new flag.  But then this
series will be a pure cleanup.  I'll likely then put this series aside as
it stops blocking things, and I also have a queue to flush myself elsewhere.

I wished we can just go with this series with uffd_get_folio() only.  Feel
free to discuss with more people, and let me know how this series should
move on.

Thanks a lot,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-16 19:55                     ` Peter Xu
@ 2025-09-19 17:22                       ` Liam R. Howlett
  2025-09-22 16:38                         ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Liam R. Howlett @ 2025-09-19 17:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Mike Rapoport, Nikita Kalyazin,
	Lorenzo Stoakes, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

* Peter Xu <peterx@redhat.com> [250916 15:56]:

...

First, thanks for replying.  Sorry I missed it.

> > 
> > Can you explore moving hugetlb_mfill_atomic_pte and
> > shmem_mfill_atomic_pte into mm/userfaultfd.c and generalizing them to
> > use the same code?
> > 
> > That is, split the similar blocks into functions and reduce duplication.
> > 
> > These are under the UFFD config option and are pretty similar.  This
> > will also limit the bleeding of mfill_atomic_mode out of uffd.
> 
> These are different problems to solve.
> 
> I did propose a refactoring of hugetlbfs in general years ago.  I didn't
> finish that.  It's not only because it's a huge amount of work that almost
> nobody likes to review (even if everybody likes to see landing), also that
> since we decided to not add more complexity into hugetlbfs code like HGM
> there's also less point on refactoring.
> 
> I hope we can be on the same page to understand the problem we want to
> solve here.  The problem is we want to support userfaultfd on guest-memfd.
> Your suggestion could be helpful, but IMHO it's orthogonal to the current
> problem.  It's also not a dependency.  If it is, you should see code
> duplications introduced in this series, which might not be needed after a
> cleanup.  It's not like that.
> 
> This series simply resolves a totally different problem.  The order on
> solving this problem or cleaning things elsewhere (or we decide to not do
> it at all) are not deterministic, IMHO.

I find it very difficult to see what abstraction of functions are needed
with the code in the current state. I feel like we are exposing an
interface that we cannot see what we need.

> 
> > 
> > 
> > 
> > If you look at what the code does in userfaultfd.c, you can see that
> > some refactoring is necessary for other reasons:
> > 
> > mfill_atomic() calls mfill_atomic_hugetlb(), or it enters a while
> > (src_addr < src_start + len) to call mfill_atomic_pte().. which might
> > call shmem_mfill_atomic_pte() in mm/shmem.c
> > 
> > mfill_atomic_hugetlb() calls, in a while (src_addr < src_start + len)
> > loop and calls hugetlb_mfill_atomic_pte() in mm/hugetlb.c
> 
> Again, IMHO this still falls into hugetlbfs refactoring category.  I would
> sincerely request that we put that as a separate topic to discuss, because
> it's huge and community resources are limited on both developments and
> reviews.
> 
> Shmem is already almost sharing code with anonymous, after all these two
> memory types are the only ones that support userfaultfd besides hugetlbfs.
> 
> > 
> > The shmem call already depends on the vma flags.. which it still does in
> > your patch 4 here.  So you've replaced this:
> > 
> > if (!(dst_vma->vm_flags & VM_SHARED)) {
> > ...
> > } else {
> >         shmem_mfill_atomic_pte()
> > }
> > 
> > With...
> > 
> > if (!(dst_vma->vm_flags & VM_SHARED)) {
> > ...
> > } else {
> > ...
> >         uffd_ops->uffd_copy()
> > }
> 
> I'm not 100% sure I get this comment.  It's intentional to depend on vma
> flags here for shmem.  See Andrea's commit:
> 
> commit 5b51072e97d587186c2f5390c8c9c1fb7e179505
> Author: Andrea Arcangeli <aarcange@redhat.com>
> Date:   Fri Nov 30 14:09:28 2018 -0800
> 
>     userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem
> 
> Are you suggesting we should merge it somehow?  I'd appreciate some
> concrete example here if so.

What I am saying is that the containing function, mfill_atomic_pte(),
should be absorbed into each memory type's ->uffd_copy(), at least
partially.

Shouldn't each memory type do all the necessary checks in ->uffd_copy()?

To put it another way, no shared memory vma will enter the if()
statement, so why are we checking if they need to?

So if the default uffd_copy() does the if (!shared) stuff, you can just
call uffd_ops->uffd_copy() with out any check there, right?

> 
> > 
> > So, really, what needs to happen first is userfaultfd needs to be
> > sorted.
> > 
> > There's no point of creating a vm_ops_uffd if it will just serve as
> > replacing the call locations of the functions like this, as it has done
> > nothing to simplify the logic.
> 
> I had a vague feeling that you may not have understood the goal of this
> series.  I thought we were on the same page there.  It could be partly my
> fault if that's not obvious, I can try to be more explicit in the cover
> letter next if I'm going to repost.

I certainly feel like I've lost the overall goal during the
conversations sometimes, thanks for pointing that out.

I keep thinking this is about memory type modularization, I think the
cover letter subject keeps getting in my way.

> 
> Please consider supporting guest-memfd as the goal, vm_ops_uffd is to allow
> all changes to happen in guest-memfd.c.  Without it, it can't happen.
> That's the whole point so far.
> 

...

> 
> While I was absent, Nikita sent a version that is based on the library
> code.  That uses direct function calls:
> 
> https://lore.kernel.org/all/20250915161815.40729-3-kalyazin@amazon.com/
> 
> I think it's great we have the other sample of implementing this.
> 
> I believe Nikita has no strong opinion how this will land at last, whatever
> way the community prefers.  I prefer this solution I sent.
> 
> Do you have a preference, when explicitly there's the other one to compare?

I like this better.

> 
> > 
> > > Said that, I still prefer this against
> > > hard-code and/or use CONFIG_GUESTMEM in userfaultfd code.
> > 
> > It also does nothing with regards to the CONFIG_USERFAULTFD in other
> > areas.  My hope is that you could pull out the common code and make the
> > CONFIG_ sections smaller.
> > 
> > And, by the way, it isn't the fact that we're going to copy something
> > (or mfill_atomic it, I guess?) but the fact that we're handing out the
> > pointer for something else to do that.  Please handle this manipulation
> > in the core mm code without handing out pointers to folios, or page
> > tables.
> 
> Is "return pointers to folios" also an issue now in the API?  Are you
> NACKing uffd_get_folio() API that this series proposed?

No, I don't think I see the problem with what I'm proposing.

I'd rather find a middle ground.  I was thinking it best to have the
part needing folio pointer to be in the mm while the rest remains in the
guestmem implementation.

The other thread seems to imply that it would all have to be pulled in,
and that seems undesirable as well.

It is just that we have had this sort of interface before and it has
gone poorly for us.  I know there's other ways to do what you are doing
here in regards to ->fault(), but that's the most flexible and hardest
to validate way of doing it.  If we can avoid exposing it in this way,
I'd rather do that.

> 
> > 
> > You could do this with the address being passed in and figure out the
> > type, or even a function pointer that you specifically pass in an enum
> > of the type (I think this is what Lorenzo was suggesting somewhere),
> > maybe with multiple flags for actions and fallback (MFILL|ZERO for
> > example).
> 
> I didn't quickly get it.  I appreciate if there's some more elaborations.

What I was thinking was the same sort of thing that happens on the
ioctl today, but with a memory type, and only for the callback that
needs the folio.

But again, I'm sure I'm missing something and I'm sorry to drag this
out..

And sorry if I upset you, I feel like we're talking past each other and
causing a lot of stress.

Thanks,
Liam




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-19 14:16                                               ` Peter Xu
  2025-09-19 14:34                                                 ` Lorenzo Stoakes
@ 2025-09-19 19:38                                                 ` Liam R. Howlett
  2025-09-22 16:33                                                   ` Peter Xu
  1 sibling, 1 reply; 30+ messages in thread
From: Liam R. Howlett @ 2025-09-19 19:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Lorenzo Stoakes, Nikita Kalyazin,
	Mike Rapoport, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

* Peter Xu <peterx@redhat.com> [250919 10:17]:

...

> > > > Can you please point me to which patch has clean up?
> > > 
> > > Patch 4.  If you want me to explain every change I touched that is a
> > > cleanup, I can go into details.  Maybe it's faster if you read them, it's
> > > not a huge patch.
> > 
> > I responded here [1].  I actually put a lot of effort into that response
> > and took a lot of time to dig into some of this to figure out if it was
> > possible, and suggested some ideas.
> > 
> > That was back in July, so the details aren't that fresh anymore.  Maybe
> > you missed my reply?
> 
> AFAICT, we made it the other way round.  My reply is here:
> 
> https://lore.kernel.org/all/aMnAscxj_h42wOAC@x1.local/

Thanks, yes.  I missed your reply.

> > > 
> > > If we accepted ->fault() for all these years, I don't see a reason we
> > > should reject ->uffd_get_folio(), especially one of the goals is to keep
> > > the core mm path clean, per my comment to proposal (a).
> > 
> > I see this argument as saying "there's a hole in our boat so why can't I
> > make another?"  It's not the direction we have to go to get what we need
> > right now, so why are we doing it?  Like you said, it can be evaluated
> > later if things change..
> 
> You described ->fault() as "a hole in our boat"?  I'm astonished and do not
> know what to say on this.
> 
> There was a great comment saying one may want to make Linux an unikernel.
> I thought it was a good one, but only when it was a joke. Is it not?

Well it's leaking the internals, which is what we don't want to do.
Certainly it is useful and does what is needed.

> 
> > 
> > My thoughts were around an idea that we only really need to do a limited
> > number of operations on that pointer you are returning.  Those
> > operations may share code, and could be internal to mm.  I don't see
> > this as (a), (b), or (c), but maybe an addition to (b)?  Maybe we need
> > more ops to cover the uses?
> 
> That's exactly what this proposal is about, isn't it?  Userfaultfd minor
> fault shares almost all the code except the one hook fetching a folio from
> a page cache from the memory type.
> 
> "could be internal to mm" is (c) at least.  No one can do what you
> mentioned without moving guest-memfd into mm/ first.
> 
> Nikita and I drafted these changes, so likely we may likely have better
> idea what is happening.
> 
> Would you perhaps implement your idea, if that's better?  Either you're
> right, we're happy to use it.  Or you found what you're missing.
> 

I spoke to Mike on this and I understand what I was missing.  I'm fine
with the folio part as you have it.

Apologies for holding this up and the added stress on your side.

Thanks,
Liam



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-19 19:38                                                 ` Liam R. Howlett
@ 2025-09-22 16:33                                                   ` Peter Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2025-09-22 16:33 UTC (permalink / raw)
  To: Liam R. Howlett, David Hildenbrand, Lorenzo Stoakes,
	Nikita Kalyazin, Mike Rapoport, Suren Baghdasaryan, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Hugh Dickins,
	Andrew Morton, James Houghton, Michal Hocko, Andrea Arcangeli,
	Oscar Salvador, Axel Rasmussen, Ujwal Kundur

On Fri, Sep 19, 2025 at 03:38:51PM -0400, Liam R. Howlett wrote:
> I spoke to Mike on this and I understand what I was missing.  I'm fine
> with the folio part as you have it.
> 
> Apologies for holding this up and the added stress on your side.

Thank you for changing your mind.

Mike, I definitely appreciate your help alone the way since the start.

I'll wait for 2-3 more days to see whether there's any further objections
from anyone, or I'll repost.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-19 17:22                       ` Liam R. Howlett
@ 2025-09-22 16:38                         ` Peter Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2025-09-22 16:38 UTC (permalink / raw)
  To: Liam R. Howlett, David Hildenbrand, Mike Rapoport,
	Nikita Kalyazin, Lorenzo Stoakes, Suren Baghdasaryan, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Hugh Dickins,
	Andrew Morton, James Houghton, Michal Hocko, Andrea Arcangeli,
	Oscar Salvador, Axel Rasmussen, Ujwal Kundur

On Fri, Sep 19, 2025 at 01:22:01PM -0400, Liam R. Howlett wrote:
> > > The shmem call already depends on the vma flags.. which it still does in
> > > your patch 4 here.  So you've replaced this:
> > > 
> > > if (!(dst_vma->vm_flags & VM_SHARED)) {
> > > ...
> > > } else {
> > >         shmem_mfill_atomic_pte()
> > > }
> > > 
> > > With...
> > > 
> > > if (!(dst_vma->vm_flags & VM_SHARED)) {
> > > ...
> > > } else {
> > > ...
> > >         uffd_ops->uffd_copy()
> > > }
> > 
> > I'm not 100% sure I get this comment.  It's intentional to depend on vma
> > flags here for shmem.  See Andrea's commit:
> > 
> > commit 5b51072e97d587186c2f5390c8c9c1fb7e179505
> > Author: Andrea Arcangeli <aarcange@redhat.com>
> > Date:   Fri Nov 30 14:09:28 2018 -0800
> > 
> >     userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem
> > 
> > Are you suggesting we should merge it somehow?  I'd appreciate some
> > concrete example here if so.
> 
> What I am saying is that the containing function, mfill_atomic_pte(),
> should be absorbed into each memory type's ->uffd_copy(), at least
> partially.
> 
> Shouldn't each memory type do all the necessary checks in ->uffd_copy()?
> 
> To put it another way, no shared memory vma will enter the if()
> statement, so why are we checking if they need to?
> 
> So if the default uffd_copy() does the if (!shared) stuff, you can just
> call uffd_ops->uffd_copy() with out any check there, right?

IIUC this is the only piece of technical side of discussion that was not
addressed, so I'm replying explicitly to this one.  Please let me know if I
missed something else.

Here we need to keep MAP_PRIVATE separate and as a common code to
UFFDIO_COPY / ZEROCOPY.  It not only applies to shmem even if it was about
shmem only at that time when Andrea fixed this case, it was simply because
hugetlbfs was working fine via its separate path.

In general, shmem is the file system, and when it's mapped MAP_PRIVATE (and
even if it's called shmem) we should bypass page cache, hence uffd_copy()
shouldn't be used.  It should apply to other file systems too when
UFFDIO_COPY/ZEROCOPY will be implemented via ->uffd_copy().

However since we're not going to export ->uffd_copy() in the new version,
it's also out of the picture.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-18 18:20                                       ` Peter Xu
  2025-09-18 19:43                                         ` Liam R. Howlett
@ 2025-09-22 17:20                                         ` David Hildenbrand
  2025-09-22 18:03                                           ` Peter Xu
  1 sibling, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-09-22 17:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: Lorenzo Stoakes, Nikita Kalyazin, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Hugh Dickins, Andrew Morton, James Houghton,
	Michal Hocko, Andrea Arcangeli, Oscar Salvador, Axel Rasmussen,
	Ujwal Kundur

On 18.09.25 20:20, Peter Xu wrote:
> On Thu, Sep 18, 2025 at 07:53:46PM +0200, David Hildenbrand wrote:
>> Re Nikita: If we could just reuse fault() for userfaultfd purposes, that
>> might actually be pretty nice.
> 
> I commented on that.
> 
> https://lore.kernel.org/all/aEiwHjl4tsUt98sh@x1.local/
> 
> That'll need to leak FAULT_FLAG_USERFAULT_CONTINUE which isn't necessary,
> make it extremely hard to know when to set the flag, and comlicates the
> fault path which isn't necessary.

I agree that FAULT_FLAG_USERFAULT_CONTINUE would be a very weird thing 
to have.

I was wondering whether it could be abstracted in a cleaner way, similar 
to what you described with the "NO_USERFAULT", but possibly taking it 
one step further (if possible ...).

In your reply you also mentioned "whether we can also avoid reusing 
fault() but instead resolve the page faults using the vm_ops hook too".

So it kind-of is a special type of page fault, but the question would be 
how that could be integrated more cleanly. And as you also point out, 
the question would be which other users it might really have.

In GUP we achieve not triggering userfaultfd by not setting 
FAULT_FLAG_ALLOW_RETRY.

But it's rather that not allowing to retry (drop+retake locks) makes it 
impossible to call into userfaultfd.

So not sure if abstracting/reusing that would make sense.

> 
> I think Mike's comment was spot on, that the new API is literally
> do_fault() for shmem, but only used in userfaultfd context so it's even an
> oneliner.

Right, special fault handling for userfaultfd.

> 
> I do not maintain mm, so above is only my two cents, so I don't make
> decisions.  Personally I still prefer the current approach of keep the mm
> main fault path clean.

Well, as clean as things like FAULT_FLAG_ALLOW_RETRY are. :)

In any case, reading Liams reply, it sounded like that there is a path 
forward, so I don't particularly care, just wanted to mention what I had 
in mind regarding FAULT_FLAG_ALLOW_RETRY.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-09-22 17:20                                         ` David Hildenbrand
@ 2025-09-22 18:03                                           ` Peter Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2025-09-22 18:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Nikita Kalyazin, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Hugh Dickins, Andrew Morton, James Houghton,
	Michal Hocko, Andrea Arcangeli, Oscar Salvador, Axel Rasmussen,
	Ujwal Kundur

On Mon, Sep 22, 2025 at 07:20:50PM +0200, David Hildenbrand wrote:
> On 18.09.25 20:20, Peter Xu wrote:
> > On Thu, Sep 18, 2025 at 07:53:46PM +0200, David Hildenbrand wrote:
> > > Re Nikita: If we could just reuse fault() for userfaultfd purposes, that
> > > might actually be pretty nice.
> > 
> > I commented on that.
> > 
> > https://lore.kernel.org/all/aEiwHjl4tsUt98sh@x1.local/
> > 
> > That'll need to leak FAULT_FLAG_USERFAULT_CONTINUE which isn't necessary,
> > make it extremely hard to know when to set the flag, and comlicates the
> > fault path which isn't necessary.
> 
> I agree that FAULT_FLAG_USERFAULT_CONTINUE would be a very weird thing to
> have.
> 
> I was wondering whether it could be abstracted in a cleaner way, similar to
> what you described with the "NO_USERFAULT", but possibly taking it one step
> further (if possible ...).
> 
> In your reply you also mentioned "whether we can also avoid reusing fault()
> but instead resolve the page faults using the vm_ops hook too".
> 
> So it kind-of is a special type of page fault, but the question would be how
> that could be integrated more cleanly. And as you also point out, the
> question would be which other users it might really have.
> 
> In GUP we achieve not triggering userfaultfd by not setting
> FAULT_FLAG_ALLOW_RETRY.
> 
> But it's rather that not allowing to retry (drop+retake locks) makes it
> impossible to call into userfaultfd.

Right.

> 
> So not sure if abstracting/reusing that would make sense.

Direct reuse is unlikely.  The current semantics of FAULT_FLAG_ALLOW_RETRY
is very specific, meanwhile handle_userfaultfd() actually has code to warn
already when it's not set..

In this context, jumping into handle_userfault() is already wrong because
this is about a request to fetch the page cache without being trappable.
So if there'll be a new flag, it'll need to bypass handle_userfault().

But then if we'll need a new flag anyway.. IMHO it'll still be cleaner with
uffd_get_folio().  It then sticks together with the uffd description of the
memory type when one wants to opt-in with MINOR faults, which should also
explicitly invoked only in userfaultfd minor fault reslutions (hence, no
chance of breaking any form of fault() either..).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-09-22 18:03 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250627154655.2085903-2-peterx@redhat.com>
     [not found] ` <aaaca9d4-b8df-45b8-a3a4-a431c99f26c7@lucifer.local>
     [not found]   ` <CAJuCfpHN6vpDx+UNPEzJgZ_qD9USTJZ_+yZzQg2BpF_aRpufYw@mail.gmail.com>
     [not found]     ` <982f4f94-f0bf-45dd-9003-081b76e57027@lucifer.local>
     [not found]       ` <cda7c46b-c474-48f4-b703-e2f988470f3b@amazon.com>
     [not found]         ` <aGVu1Isy-R9RszxW@kernel.org>
     [not found]           ` <aGWMsfbayEco0j4R@x1.local>
     [not found]             ` <aGbCbW7hUf3a2do2@kernel.org>
     [not found]               ` <289eede1-d47d-49a2-b9b6-ff8050d84893@redhat.com>
     [not found]                 ` <aGfsaIIzHWfjcNFd@x1.local>
     [not found]                   ` <e7vr62s73dftijeveyg6lfgivctijz4qcar3teswjbuv6gog3k@4sbpuj35nbbh>
2025-09-01 16:01                     ` [PATCH v2 1/4] mm: Introduce vm_uffd_ops API Nikita Kalyazin
2025-09-08 16:53                       ` Liam R. Howlett
2025-09-16 20:05                         ` Peter Xu
2025-09-17 15:29                           ` Liam R. Howlett
2025-09-17  9:25                         ` Mike Rapoport
2025-09-17 16:53                           ` Liam R. Howlett
2025-09-18  8:37                             ` Mike Rapoport
2025-09-18 16:47                               ` Liam R. Howlett
2025-09-18 17:15                                 ` Nikita Kalyazin
2025-09-18 17:45                                   ` Lorenzo Stoakes
2025-09-18 17:53                                     ` David Hildenbrand
2025-09-18 18:20                                       ` Peter Xu
2025-09-18 19:43                                         ` Liam R. Howlett
2025-09-18 21:07                                           ` Peter Xu
2025-09-19  1:50                                             ` Liam R. Howlett
2025-09-19 14:16                                               ` Peter Xu
2025-09-19 14:34                                                 ` Lorenzo Stoakes
2025-09-19 15:12                                                   ` Peter Xu
2025-09-19 19:38                                                 ` Liam R. Howlett
2025-09-22 16:33                                                   ` Peter Xu
2025-09-22 17:20                                         ` David Hildenbrand
2025-09-22 18:03                                           ` Peter Xu
2025-09-18 17:54                                   ` Liam R. Howlett
2025-09-18 18:05                                 ` Mike Rapoport
2025-09-18 18:32                                   ` Liam R. Howlett
2025-09-18 19:32                                     ` Peter Xu
2025-09-19  9:05                                     ` Mike Rapoport
2025-09-16 19:55                     ` Peter Xu
2025-09-19 17:22                       ` Liam R. Howlett
2025-09-22 16:38                         ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox