All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Lokesh Gidra <lokeshgidra@google.com>
Cc: David Hildenbrand <david@redhat.com>,
	Suren Baghdasaryan <surenb@google.com>,
	akpm@linux-foundation.org, viro@zeniv.linux.org.uk,
	brauner@kernel.org, shuah@kernel.org, aarcange@redhat.com,
	hughd@google.com, mhocko@suse.com, axelrasmussen@google.com,
	rppt@kernel.org, willy@infradead.org, Liam.Howlett@oracle.com,
	jannh@google.com, zhangpeng362@huawei.com, bgeffon@google.com,
	kaleshsingh@google.com, ngeoffray@google.com, jdduke@google.com,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	kernel-team@android.com
Subject: Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI
Date: Fri, 13 Oct 2023 13:05:55 -0400	[thread overview]
Message-ID: <ZSl488I/W4mz4gnM@x1n> (raw)
In-Reply-To: <CA+EESO5ESxxricWx2EFneizLGj2Cb5tuM3kbAicc0ggA4Wh2oQ@mail.gmail.com>

On Fri, Oct 13, 2023 at 09:49:10AM -0700, Lokesh Gidra wrote:
> On Fri, Oct 13, 2023 at 9:08 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Oct 13, 2023 at 11:56:31AM +0200, David Hildenbrand wrote:
> > > Hi Peter,
> >
> > Hi, David,
> >
> > >
> > > > I used to have the same thought with David on whether we can simplify the
> > > > design to e.g. limit it to single mm.  Then I found that the trickiest is
> > > > actually patch 1 together with the anon_vma manipulations, and the problem
> > > > is that's not avoidable even if we restrict the api to apply on single mm.
> > > >
> > > > What else we can benefit from single mm?  One less mmap read lock, but
> > > > probably that's all we can get; IIUC we need to keep most of the rest of
> > > > the code, e.g. pgtable walks, double pgtable lockings, etc.
> > >
> > > No existing mechanisms move anon pages between unrelated processes, that
> > > naturally makes me nervous if we're doing it "just because we can".
> >
> > IMHO that's also the potential, when guarded with userfaultfd descriptor
> > being shared between two processes.
> >
> > See below with more comment on the raised concerns.
> >
> > >
> > > >
> > > > Actually, even though I have no solid clue, but I had a feeling that there
> > > > can be some interesting way to leverage this across-mm movement, while
> > > > keeping things all safe (by e.g. elaborately requiring other proc to create
> > > > uffd and deliver to this proc).
> > >
> > > Okay, but no real use cases yet.
> >
> > I can provide a "not solid" example.  I didn't mention it because it's
> > really something that just popped into my mind when thinking cross-mm, so I
> > never discussed with anyone yet nor shared it anywhere.
> >
> > Consider VM live upgrade in a generic form (e.g., no VFIO), we can do that
> > very efficiently with shmem or hugetlbfs, but not yet anonymous.  We can do
> > extremely efficient postcopy live upgrade now with anonymous if with REMAP.
> >
> > Basically I see it a potential way of moving memory efficiently especially
> > with thp.
> >
> > >
> > > >
> > > > Considering Andrea's original version already contains those bits and all
> > > > above, I'd vote that we go ahead with supporting two MMs.
> > >
> > > You can do nasty things with that, as it stands, on the upstream codebase.
> > >
> > > If you pin the page in src_mm and move it to dst_mm, you successfully broke
> > > an invariant that "exclusive" means "no other references from other
> > > processes". That page is marked exclusive but it is, in fact, not exclusive.
> >
> > It is still exclusive to the dst mm?  I see your point, but I think you're
> > taking exclusiveness altogether with pinning, and IMHO that may not be
> > always necessary?
> >
> > >
> > > Once you achieved that, you can easily have src_mm not have MMF_HAS_PINNED,
> >
> > (I suppose you meant dst_mm here)
> >
> > > so you can just COW-share that page. Now you successfully broke the
> > > invariant that COW-shared pages must not be pinned. And you can even trigger
> > > VM_BUG_ONs, like in sanity_check_pinned_pages().
> >
> > Yeah, that's really unfortunate.  But frankly, I don't think it's the fault
> > of this new feature, but the rest.
> >
> > Let's imagine if the MMF_HAS_PINNED wasn't proposed as a per-mm flag, but
> > per-vma, which I don't see why we can't because it's simply a hint so far.
> > Then if we apply the same rule here, UFFDIO_REMAP won't even work for
> > single-mm as long as cross-vma. Then UFFDIO_REMAP as a whole feature will
> > be NACKed simply because of this..
> >
> > And I don't think anyone can guarantee a per-vma MMF_HAS_PINNED can never
> > happen, or any further change to pinning solution that may affect this.  So
> > far it just looks unsafe to remap a pin page to me.
> >
> > I don't have a good suggestion here if this is a risk.. I'd think it risky
> > then to do REMAP over pinned pages no matter cross-mm or single-mm.  It
> > means probably we just rule them out: folio_maybe_dma_pinned() may not even
> > be enough to be safe with fast-gup.  We may need page_needs_cow_for_dma()
> > with proper write_protect_seq no matter cross-mm or single-mm?
> >
> > >
> > > Can it all be fixed? Sure, with more complexity. For something without clear
> > > motivation, I'll have to pass.
> >
> > I think what you raised is a valid concern, but IMHO it's better fixed no
> > matter cross-mm or single-mm.  What do you think?
> >
> > In general, pinning lose its whole point here to me for an userspace either
> > if it DONTNEEDs it or REMAP it.  What would be great to do here is we unpin
> > it upon DONTNEED/REMAP/whatever drops the page, because it loses its
> > coherency anyway, IMHO.
> >
> > >
> > > Once there is real demand, we can revisit it and explore what else we would
> > > have to take care of (I don't know how memcg behaves when moving between
> > > completely unrelated processes, maybe that works as expected, I don't know
> > > and I have no time to spare on reviewing features with no real use cases)
> > > and announce it as a new feature.
> >
> > Good point.  memcg is probably needed..
> >
> > So you reminded me to do a more thorough review against zap/fault paths, I
> > think what's missing are (besides page pinning):
> >
> >   - mem_cgroup_charge()/mem_cgroup_uncharge():
> >
> >     (side note: I think folio_throttle_swaprate() is only for when
> >      allocating new pages, so not needed here)
> >
> >   - check_stable_address_space() (under pgtable lock)
> >
> >   - tlb flush
> >
> >     Hmm???????????????? I can't see anywhere we did tlb flush, batched or
> >     not, either single-mm or cross-mm should need it.  Is this missing?
> >
> IIUC, ptep_clear_flush() flushes tlb entry. So I think we are doing
> unbatched flushing. Possibly a nice performance improvement later on
> would be to try doing it batched. Suren can throw more light on it.

Oh yeah.. thanks.

> 
> One thing I was wondering is don't we need cache flush for the src
> pages? mremap's move_page_tables() does it. IMHO, it's required here
> as well.

I commented in my reply, I also think it's needed.  Otherwise for some
arches I think we can have page containing stall data if not fully flushed
before the movement.  x86 is probably fine, though.

> 
> > >
> > >
> > > Note: that (with only reading the documentation) it also kept me wondering
> > > how the MMs are even implied from
> > >
> > >        struct uffdio_move {
> > >            __u64 dst;    /* Destination of move */
> > >            __u64 src;    /* Source of move */
> > >            __u64 len;    /* Number of bytes to move */
> > >            __u64 mode;   /* Flags controlling behavior of move */
> > >            __s64 move;   /* Number of bytes moved, or negated error */
> > >        };
> > >
> > > That probably has to be documented as well, in which address space dst and
> > > src reside.
> >
> > Agreed, some better documentation will never hurt.  Dst should be in the mm
> > address space that was bound to the userfault descriptor.  Src should be in
> > the current mm address space.
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >
> 

-- 
Peter Xu


  reply	other threads:[~2023-10-13 17:07 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-09  6:42 [PATCH v3 0/3] userfaultfd move option Suren Baghdasaryan
2023-10-09  6:42 ` [PATCH v3 1/3] mm/rmap: support move to different root anon_vma in folio_move_anon_rmap() Suren Baghdasaryan
2023-10-12 22:01   ` Peter Xu
2023-10-13  8:04     ` David Hildenbrand
2023-10-19 15:19       ` Suren Baghdasaryan
2023-10-09  6:42 ` [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI Suren Baghdasaryan
2023-10-09 14:38   ` David Hildenbrand
2023-10-09 16:21     ` Suren Baghdasaryan
2023-10-09 16:23       ` David Hildenbrand
2023-10-09 16:29         ` Lokesh Gidra
2023-10-09 17:56           ` Lokesh Gidra
2023-10-10  1:49             ` Suren Baghdasaryan
2023-10-12 20:11           ` Peter Xu
2023-10-13  9:56             ` David Hildenbrand
2023-10-13 16:08               ` Peter Xu
2023-10-13 16:49                 ` Lokesh Gidra
2023-10-13 17:05                   ` Peter Xu [this message]
2023-10-16 18:01                 ` David Hildenbrand
2023-10-16 19:01                   ` Peter Xu
2023-10-17 15:55                     ` David Hildenbrand
2023-10-17 18:59                       ` Peter Xu
2023-10-19 15:41                         ` David Hildenbrand
2023-10-19 19:53                           ` Peter Xu
2023-10-19 20:02                             ` Suren Baghdasaryan
2023-10-19 20:43                               ` Peter Xu
2023-10-20 10:02                             ` David Hildenbrand
2023-10-20 14:09                               ` Suren Baghdasaryan
2023-10-20 17:16                                 ` David Hildenbrand
2023-10-22 15:46                                   ` Peter Xu
2023-10-23 12:03                                     ` David Hildenbrand
2023-10-23 16:36                                       ` David Hildenbrand
2023-10-23 17:33                                         ` Suren Baghdasaryan
2023-10-19 21:45                 ` Suren Baghdasaryan
2023-10-12 21:59   ` Peter Xu
2023-10-19 21:24     ` Suren Baghdasaryan
2023-10-22 17:01       ` Peter Xu
2023-10-23 17:43         ` Suren Baghdasaryan
2023-10-23 18:37           ` Peter Xu
2023-10-23 19:01             ` Suren Baghdasaryan
2023-10-17 19:39   ` kernel test robot
2023-10-19 21:55     ` Suren Baghdasaryan
2023-10-23 12:29   ` David Hildenbrand
2023-10-23 15:53     ` David Hildenbrand
2023-10-23 19:00       ` Suren Baghdasaryan
2023-10-23 18:56     ` Suren Baghdasaryan
2023-10-24 14:27       ` David Hildenbrand
2023-10-24 14:36         ` Suren Baghdasaryan
2023-10-09  6:42 ` [PATCH v3 3/3] selftests/mm: add UFFDIO_MOVE ioctl test Suren Baghdasaryan
2023-10-12 22:29   ` Peter Xu
2023-10-19 15:43     ` Suren Baghdasaryan
2023-10-19 17:29       ` Axel Rasmussen
2023-10-19 19:33         ` Peter Xu
  -- strict thread matches above, loose matches on Subject: below --
2023-10-19  0:21 [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZSl488I/W4mz4gnM@x1n \
    --to=peterx@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=bgeffon@google.com \
    --cc=brauner@kernel.org \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=jannh@google.com \
    --cc=jdduke@google.com \
    --cc=kaleshsingh@google.com \
    --cc=kernel-team@android.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=mhocko@suse.com \
    --cc=ngeoffray@google.com \
    --cc=rppt@kernel.org \
    --cc=shuah@kernel.org \
    --cc=surenb@google.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=zhangpeng362@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.