All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Suren Baghdasaryan <surenb@google.com>
Cc: Mike Rapoport <rppt@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Vlastimil Babka <vbabka@suse.cz>,
	Muchun Song <muchun.song@linux.dev>,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	James Houghton <jthoughton@google.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Nikita Kalyazin <kalyazin@amazon.com>,
	Michal Hocko <mhocko@suse.com>,
	David Hildenbrand <david@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Oscar Salvador <osalvador@suse.de>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Ujwal Kundur <ujwal.kundur@gmail.com>
Subject: Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
Date: Thu, 3 Jul 2025 11:45:40 -0400	[thread overview]
Message-ID: <aGalpN8m73zQOW7j@x1.local> (raw)
In-Reply-To: <CAJuCfpH1mtKiphP8ipZeD4CNG9Mr4QERJTQAQm_gtZow5G7AAQ@mail.gmail.com>

On Thu, Jul 03, 2025 at 08:01:12AM -0700, Suren Baghdasaryan wrote:
> On Wed, Jul 2, 2025 at 1:23 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Jul 02, 2025 at 09:16:31PM +0300, Mike Rapoport wrote:
> > > On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote:
> > > > On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > >
> > > > > It seems like we're assuming a _lot_ of mm understanding in the underlying
> > > > > driver here.
> > > > >
> > > > > I'm not sure it's really normal to be handing around page table state and
> > > > > folios etc. to a driver like this, this is really... worrying to me.
> > > > >
> > > > > This feels like you're trying to put mm functionality outside of mm?
> > > >
> > > > To second that, two things stick out for me here:
> > > > 1. uffd_copy and uffd_get_folio seem to be at different abstraction
> > > > levels. uffd_copy is almost the entire copy operation for VM_SHARED
> > > > VMAs while uffd_get_folio is a small part of the continue operation.
> > > > 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
> > > > last patch is quite a complex function which itself calls some IMO
> > > > pretty internal functions like mfill_atomic_install_pte(). Expecting
> > > > modules to implement such functionality seems like a stretch to me but
> > > > maybe this is for some specialized modules which are written by mm
> > > > experts only?
> > >
> > > Largely shmem_mfill_atomic_pte() differs from anonymous memory version
> > > (mfill_atomic_pte_copy()) by the way the allocated folio is accounted and
> > > whether it's added to the page cache. So instead of uffd_copy(...) we might
> > > add
> > >
> > >       int (*folio_alloc)(struct vm_area_struct *vma, unsigned long dst_addr);
> > >       void (*folio_release)(struct vm_area_struct *vma, struct folio *folio);
> >
> > Thanks for digging into this, Mike.  It's just that IMHO it may not be
> > enough..
> >
> > I actually tried to think about a more complicated API, but more I thought
> > of that, more it looked like an overkill.  I can list something here to
> > show why I chose the simplest solution with uffd_copy() as of now.
> 
> TBH below does not sound like an overkill to me for keeping mm parts
> to itself without over-exposing them to modules.
> 
> >
> > Firstly, see shmem_inode_acct_blocks() at the entrance: that's shmem
> > accounting we need to do before allocations, and with/without allocations.
> 
> Ok, this results in an additional folio_prealloc() hook.
> 
> > That accounting can't be put into folio_alloc() yet even if we'll have one,
> > because we could have the folio allocated when reaching here (that is, when
> > *foliop != NULL).  That was a very delicated decision of us to do shmem
> > accounting separately in 2016:
> >
> > https://lore.kernel.org/all/20161216144821.5183-37-aarcange@redhat.com/
> >
> > Then, there's also the complexity on how the page cache should be managed
> > for any mem type.  For shmem, folio was only injected right before the
> > pgtable installations.  We can't inject it when folio_alloc() because then
> > others can fault-in without data populated. It means we at least need one
> > more API to do page cache injections for the folio just got allocated from
> > folio_alloc().
> 
> folio_add_to_cache() hook?
> 
> >
> > We also may want to have different treatment on how the folio flags are
> > setup.  It may not always happen in folio_alloc().  E.g. for shmem right
> > now we do this right before the page cache injections:
> >
> >         VM_BUG_ON(folio_test_locked(folio));
> >         VM_BUG_ON(folio_test_swapbacked(folio));
> >         __folio_set_locked(folio);
> >         __folio_set_swapbacked(folio);
> >         __folio_mark_uptodate(folio);
> >
> > We may not want to do exactly the same for all the rest mem types.  E.g. we
> > likely don't want to set swapbacked for guest-memfd folios.  We may need
> > one more API to do it.
> 
> Can we do that inside folio_add_to_cache() hook before doing the injection?

The folio can be freed outside this function by userfaultfd callers, so I
wonder if it'll crash the kernel if mm sees a folio locked while being freed.

> 
> >
> > Then if to think about failure path, when we have the question above on
> > shmem acct issue, we may want to have yet another post_failure hook doing
> > shmem_inode_unacct_blocks() properly for shmem.. maybe we don't need that
> > for guest-memfd, but we still need that for shmem to properly unacct when
> > folio allocation succeeded, but copy_from_user failed somehow.
> 
> Failure handling hook.
> 
> >
> > Then the question is, do we really need all these fuss almost for nothing?
> 
> If that helps to keep modules simpler and mm details contained inside
> the core kernel I think it is worth doing. I imagine if the shmem was
> a module then implementing the current API would require exporting
> functions like mfill_atomic_install_pte(). That seems like
> over-exposure to me. And if we can avoid all the locking and
> refcounting that Liam mentioned that would definitely be worth it
> IMHO.
> 
> >
> > Note that if we want, IIUC we can still change this in the future. The
> > initial isolation like this series would still be good to land earlier; we
> > don't even plan to support MISSING for guest-memfd in the near future, but
> > only MINOR mode for now.  We don't necessarily need to work out MISSING
> > immediately to move on useful features landing Linux.  Even if we'll have
> > it for guest-memfd, it shouldn't be a challenge to maintain both.  This is
> > just not a contract we need to sign forever yet.

[1]

> >
> > Hope above clarifies a bit on why I chose the simplest solution as of
> > now. I also don't like this API, as I mentioned in the cover letter. It's
> > really a trade-off I made at least for now the best I can come up with.
> >
> > Said that, if any of us has better solution, please shoot.  I'm always open
> > to better alternatives.
> 
> I didn't explore this code as deep as you have done and therefore
> might not see all the pitfalls but looks like you already considered
> an alternative which does sound better to me. What are the drawbacks
> that I might be missing?

It'll be a complex API from another angle, that we'll need 5-6 APIs just to
implement uffd missing mode.

Meanwhile it'll also already start to introduce those function jumps into
shmem even if one would be enough to decouple it.

As I mentioned above, I'm also not against doing it, but IMHO that does not
need to be done right now [1], as currently it looks to me guest-memfd may
not really need missing mode traps as of now, and I am actually not sure
whether it will.  We may not want to introduce 5-6 APIs with no further
user.  We can also always discuss a proper API when the demand arrives.

If uffd_copy is so concerning to people, there's also another alternative
which is to make vm_uffd_ops only support traps except MISSING.  uffd_copy
is only about missing traps. Then we can drop uffd_copy API, but the API
itself is then broken on its own.

I still feel like we're over-worried on this.  For OOT drivers I never
cared breaking anything.  For in-tree ones, this discussion can really
happen when there's a need to.  And at that time we can also have a look at
the impl using uffd_copy(), maybe it'll not be that bad either.  It seems
to me we don't necessarily need to figure this out right now.  IMHO we can
do it two-steps in worst case.

Thanks,

-- 
Peter Xu



  reply	other threads:[~2025-07-03 15:45 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-27 15:46 [PATCH v2 0/4] mm/userfaultfd: modulize memory types Peter Xu
2025-06-27 15:46 ` [PATCH v2 1/4] mm: Introduce vm_uffd_ops API Peter Xu
2025-06-29  8:50   ` Mike Rapoport
2025-07-02 19:30     ` Peter Xu
2025-06-30 10:15   ` Lorenzo Stoakes
2025-07-01 17:04     ` Suren Baghdasaryan
2025-07-02 15:40       ` Liam R. Howlett
2025-07-02 15:56       ` Lorenzo Stoakes
2025-07-02 17:08         ` Nikita Kalyazin
2025-07-02 17:39           ` Mike Rapoport
2025-07-02 19:46             ` Peter Xu
2025-07-03 17:48               ` Mike Rapoport
2025-07-04  9:34                 ` David Hildenbrand
2025-07-04 14:59                   ` Peter Xu
2025-07-04 19:39                     ` Liam R. Howlett
2025-09-01 16:01                       ` Nikita Kalyazin
2025-09-08 16:53                         ` Liam R. Howlett
2025-09-16 20:05                           ` Peter Xu
2025-09-17 15:29                             ` Liam R. Howlett
2025-09-17  9:25                           ` Mike Rapoport
2025-09-17 16:53                             ` Liam R. Howlett
2025-09-18  8:37                               ` Mike Rapoport
2025-09-18 16:47                                 ` Liam R. Howlett
2025-09-18 17:15                                   ` Nikita Kalyazin
2025-09-18 17:45                                     ` Lorenzo Stoakes
2025-09-18 17:53                                       ` David Hildenbrand
2025-09-18 18:20                                         ` Peter Xu
2025-09-18 19:43                                           ` Liam R. Howlett
2025-09-18 21:07                                             ` Peter Xu
2025-09-19  1:50                                               ` Liam R. Howlett
2025-09-19 14:16                                                 ` Peter Xu
2025-09-19 14:34                                                   ` Lorenzo Stoakes
2025-09-19 15:12                                                     ` Peter Xu
2025-09-19 19:38                                                   ` Liam R. Howlett
2025-09-22 16:33                                                     ` Peter Xu
2025-09-22 17:20                                           ` David Hildenbrand
2025-09-22 18:03                                             ` Peter Xu
2025-09-18 17:54                                     ` Liam R. Howlett
2025-09-18 18:05                                   ` Mike Rapoport
2025-09-18 18:32                                     ` Liam R. Howlett
2025-09-18 19:32                                       ` Peter Xu
2025-09-19  9:05                                       ` Mike Rapoport
2025-09-16 19:55                       ` Peter Xu
2025-09-19 17:22                         ` Liam R. Howlett
2025-09-22 16:38                           ` Peter Xu
2025-07-02 21:24           ` Liam R. Howlett
2025-07-02 21:36             ` Peter Xu
2025-07-03  2:00               ` Liam R. Howlett
2025-07-03 15:24                 ` Peter Xu
2025-07-03 16:15                   ` Lorenzo Stoakes
2025-07-03 17:39                   ` Liam R. Howlett
2025-07-02 20:24         ` Peter Xu
2025-07-03 16:32           ` Lorenzo Stoakes
2025-07-02 18:16       ` Mike Rapoport
2025-07-02 20:22         ` Peter Xu
2025-07-03 15:01           ` Suren Baghdasaryan
2025-07-03 15:45             ` Peter Xu [this message]
2025-07-03 16:01               ` Lorenzo Stoakes
2025-06-27 15:46 ` [PATCH v2 2/4] mm/shmem: Support " Peter Xu
2025-06-29  8:51   ` Mike Rapoport
2025-06-27 15:46 ` [PATCH v2 3/4] mm/hugetlb: " Peter Xu
2025-06-29  8:52   ` Mike Rapoport
2025-06-27 15:46 ` [PATCH v2 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu
2025-06-29  8:55   ` Mike Rapoport
2025-07-02 20:38     ` Peter Xu
2025-06-30 10:29 ` [PATCH v2 0/4] mm/userfaultfd: modulize memory types Lorenzo Stoakes
2025-07-01  0:15   ` Andrew Morton
2025-07-02 20:36   ` Peter Xu
2025-07-03 15:55     ` Lorenzo Stoakes
2025-07-03 16:26       ` Peter Xu
2025-07-03 16:44         ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aGalpN8m73zQOW7j@x1.local \
    --to=peterx@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=jthoughton@google.com \
    --cc=kalyazin@amazon.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=ujwal.kundur@gmail.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.