All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Peter Collingbourne <pcc@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Kostya Kortchinsky <kostyak@google.com>,
	Evgenii Stepanov <eugenis@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: [PATCH v4] mm: improve mprotect(R|W) efficiency on pages referenced once
Date: Mon, 31 May 2021 18:00:18 -0400	[thread overview]
Message-ID: <YLVcchsuXKV9oG4W@t490s> (raw)
In-Reply-To: <YLDi2WbVKjKnydDw@t490s>

On Fri, May 28, 2021 at 08:32:25AM -0400, Peter Xu wrote:
> On Thu, May 27, 2021 at 06:35:42PM -0700, Peter Collingbourne wrote:
> > On Thu, May 27, 2021 at 6:21 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, May 27, 2021 at 04:37:11PM -0700, Peter Collingbourne wrote:
> > > > On Thu, May 27, 2021 at 2:41 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > Peter,
> > > > >
> > > > > On Thu, May 27, 2021 at 12:04:53PM -0700, Peter Collingbourne wrote:
> > > > >
> > > > > [...]
> > > > >
> > > > > > +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
> > > > > > +                               unsigned long cp_flags)
> > > > > > +{
> > > > > > +     if (!(cp_flags & MM_CP_DIRTY_ACCT)) {
> > > > > > +             if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE)))
> > > > > > +                     return false;
> > > > > > +
> > > > > > +             if (page_count(pte_page(pte)) != 1)
> > > > > > +                     return false;
> > > > > > +     }
> > > > >
> > > > > Can we make MM_CP_DIRTY_ACCT still in charge?  IIUC that won't affect your use
> > > > > case, something like:
> > > > >
> > > > >        /* Never apply trick if MM_CP_DIRTY_ACCT not set */
> > > > >        if (!(cp_flags & MM_CP_DIRTY_ACCT))
> > > > >            return false;
> > > > >
> > > > > The thing is that's really what MM_CP_DIRTY_ACCT is about, imho (as its name
> > > > > shows).  Say, we should start to count on the dirty bit for applying the write
> > > > > bit only if that flag set.  With above, I think we can drop the pte_uffd_wp()
> > > > > check below because uffd_wp never applies MM_CP_DIRTY_ACCT when do
> > > > > change_protection().
> > > >
> > > > I don't think that would work. The anonymous pages that we're
> > > > interesting in optimizing are private writable pages, for which
> > > > vma_wants_writenotify(vma, vma->vm_page_prot) would return false (and
> > > > thus MM_CP_DIRTY_ACCT would not be set, and thus your code would
> > > > disable the optimization), because of this code at the top of
> > > > vma_wants_writenotify:
> > > >
> > > >         /* If it was private or non-writable, the write bit is already clear */
> > > >         if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
> > > >                 return 0;
> > > >
> > > > IIUC, dirty accountable is about whether we can always apply the
> > > > optimization no matter what the ref count is, so it isn't suitable for
> > > > situations where we need to check the ref count.
> > >
> > > Ah I see.. Though it still looks weird e.g. the first check could have been
> > > done before calling change_protection()?
> > >
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 96f4df023439..9270140afbbd 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -548,7 +548,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> > >          * held in write mode.
> > >          */
> > >         vma->vm_flags = newflags;
> > > -       dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
> > > +       dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot) ||
> > > +           (vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE));
> > >         vma_set_page_prot(vma);
> > >
> > >         change_protection(vma, start, end, vma->vm_page_prot,
> > >
> > > Would something like this make the check even faster?
> > 
> > That still doesn't seem like it would work either. I think we need
> > three kinds of behavior (glossing over a bunch of details):
> > 
> > - always make RW for certain shared pages (this is the original dirty
> > accountable behavior)
> > - don't make RW except for page_count==1 for certain private pages
> > - don't optimize at all in other cases
> > 
> > A single bit isn't enough to cover all of these possibilities.
> 
> Yes I guess you're right.
> 
> > 
> > > Meanwhile when I'm looking at the rest I found I cannot understand the other
> > > check in this patch regarding soft dirty:
> > >
> > > +       if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY))
> > > +               return false;
> > >
> > > I'm wondering why it's not:
> > >
> > > +       if (!pte_soft_dirty(pte) && !(vma->vm_flags & VM_SOFTDIRTY))
> > > +               return false;
> > >
> > > Then I look back and it's indeed what it does before, starting from commit
> > > 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after
> > > VM_SOFTDIRTY cleared", 2014-10-14):
> > >
> > >         if (dirty_accountable && pte_dirty(ptent) &&
> > >                 (pte_soft_dirty(ptent) ||
> > >                 !(vma->vm_flags & VM_SOFTDIRTY)))
> > >
> > > However I don't get it... Shouldn't this be "if soft dirty set, or soft dirty
> > > tracking not enabled, then we can grant the write bit"?  The thing is afaiu
> > > VM_SOFTDIRTY works in the reversed way that soft dirty enabled only if it's
> > > cleared.  Hmm... Am I the only one thinks it wrong?
> > 
> > No strong opinions here, I'm just preserving the original logic.
> 
> Yeah no reason to block your patch.  I'll think about it.
> 
> It's just that the 1st !MM_CP_DIRTY_ACCT check looks very hard to follow,
> however indeed I don't have any better idea to rewrite it.
> 
> Then the patch looks good to me:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>

Sorry I may need to retrieve the ACK here...

TL;DR: we'd need to skip the optimization for safety when MM_CP_PROT_NUMA set

The thing is that there's report showing that numa could have an issue with the
other patch in Andrea's COR branch (which is a nicer one covering huge pages)
if without prot_numa check there:

https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=7471fd37fc2b8780fb4a3ab9a5ee9d87752e246a

Now it's still not clear whether the same issue could happen with this patch as
it's using page_count() rather than page_mapcount() due to the different
handling of the two branches, however to be safe and before we figure that
thing out, imho it's better this patch also skips NUMA path completely.

-- 
Peter Xu



  reply	other threads:[~2021-05-31 22:00 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-27 19:04 [PATCH v4] mm: improve mprotect(R|W) efficiency on pages referenced once Peter Collingbourne
2021-05-27 21:41 ` Peter Xu
2021-05-27 23:37   ` Peter Collingbourne
2021-05-28  1:21     ` Peter Xu
2021-05-28  1:35       ` Peter Collingbourne
2021-05-28 12:32         ` Peter Xu
2021-05-31 22:00           ` Peter Xu [this message]
2021-06-01  0:44 ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YLVcchsuXKV9oG4W@t490s \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=eugenis@google.com \
    --cc=kostyak@google.com \
    --cc=linux-mm@kvack.org \
    --cc=pcc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.