From: Jerome Glisse <jglisse@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Nadav Amit <nadav.amit@gmail.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Joerg Roedel <jroedel@suse.de>,
Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
David Woodhouse <dwmw2@infradead.org>,
Alistair Popple <alistair@popple.id.au>,
Michael Ellerman <mpe@ellerman.id.au>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Stephen Rothwell <sfr@canb.auug.org.au>,
iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org,
linux-next@vger.kernel.org
Subject: Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless
Date: Tue, 3 Oct 2017 20:15:59 -0400 [thread overview]
Message-ID: <20171004001559.GD20644@redhat.com> (raw)
In-Reply-To: <20171003234215.GA5231@redhat.com>
On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote:
> Hello Jerome,
>
> On Fri, Sep 01, 2017 at 01:30:11PM -0400, Jerome Glisse wrote:
> > +Case A is obvious you do not want to take the risk for the device to write to
> > +a page that might now be use by some completely different task.
>
> used
>
> > +is true ven if the thread doing the page table update is preempted right after
>
> even
>
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 90731e3b7e58..5706252b828a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1167,8 +1167,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > goto out_free_pages;
> > VM_BUG_ON_PAGE(!PageHead(page), page);
> >
> > + /*
> > + * Leave pmd empty until pte is filled note we must notify here as
> > + * concurrent CPU thread might write to new page before the call to
> > + * mmu_notifier_invalidate_range_end() happen which can lead to a
>
> happens
>
> > + * device seeing memory write in different order than CPU.
> > + *
> > + * See Documentation/vm/mmu_notifier.txt
> > + */
> > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > - /* leave pmd empty until pte is filled */
> >
>
> Here we can change the following mmu_notifier_invalidate_range_end to
> skip calling ->invalidate_range. It could be called
> mmu_notifier_invalidate_range_only_end, or other suggestions
> welcome. Otherwise we'll repeat the call for nothing.
>
> We need it inside the PT lock for the ordering issue, but we don't
> need to run it twice.
>
> Same in do_huge_pmd_wp_page, wp_page_copy and
> migrate_vma_insert_page. Every time *clear_flush_notify is used
> mmu_notifier_invalidate_range_only_end should be called after it,
> instead of mmu_notifier_invalidate_range_end.
>
> I think optimizing that part too, fits in the context of this patchset
> (if not in the same patch), because the objective is still the same:
> to remove unnecessary ->invalidate_range calls.
Yes you are right, good idea, i will respin with that too (and with the
various typo you noted thank you for that). I can do 2 patch or 1, i don't
mind either way. I will probably do 2 as first and they can be folded into
1 if people prefer just one.
>
> > + * No need to notify as we downgrading page
>
> are
>
> > + * table protection not changing it to point
> > + * to a new page.
> > + *
>
> > + * No need to notify as we downgrading page table to read only
>
> are
>
> > + * No need to notify as we replacing a read only page with another
>
> are
>
> > @@ -1510,13 +1515,43 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > if (pte_soft_dirty(pteval))
> > swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > set_pte_at(mm, address, pvmw.pte, swp_pte);
> > - } else
> > + } else {
> > + /*
> > + * We should not need to notify here as we reach this
> > + * case only from freeze_page() itself only call from
> > + * split_huge_page_to_list() so everything below must
> > + * be true:
> > + * - page is not anonymous
> > + * - page is locked
> > + *
> > + * So as it is a shared page and it is locked, it can
> > + * not be remove from the page cache and replace by
> > + * a new page before mmu_notifier_invalidate_range_end
> > + * so no concurrent thread might update its page table
> > + * to point at new page while a device still is using
> > + * this page.
> > + *
> > + * But we can not assume that new user of try_to_unmap
> > + * will have that in mind so just to be safe here call
> > + * mmu_notifier_invalidate_range()
> > + *
> > + * See Documentation/vm/mmu_notifier.txt
> > + */
> > dec_mm_counter(mm, mm_counter_file(page));
> > + mmu_notifier_invalidate_range(mm, address,
> > + address + PAGE_SIZE);
> > + }
> > discard:
> > + /*
> > + * No need to call mmu_notifier_invalidate_range() as we are
> > + * either replacing a present pte with non present one (either
> > + * a swap or special one). We handling the clearing pte case
> > + * above.
> > + *
> > + * See Documentation/vm/mmu_notifier.txt
> > + */
> > page_remove_rmap(subpage, PageHuge(page));
> > put_page(page);
> > - mmu_notifier_invalidate_range(mm, address,
> > - address + PAGE_SIZE);
> > }
> >
> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>
> That is the path that unmaps filebacked pages (btw, not necessarily
> shared unlike comment says, they can be private but still filebacked).
I was more refering to the fact that they are in page cache and thus
given current condition they can not be migrated to a new page in our
back. But yes it can be a private mapping, i will fix the comment.
> I'd like some more explanation about the inner working of "that new
> user" as per comment above.
>
> It would be enough to drop mmu_notifier_invalidate_range from above
> without adding it to the filebacked case. The above gives higher prio
> to the hypothetical and uncertain future case, than to the current
> real filebacked case that doesn't need ->invalidate_range inside the
> PT lock, or do you see something that might already need such
> ->invalidate_range?
No i don't see any new user today that might need such invalidate but
i was trying to be extra cautious as i have a tendency to assume that
someone might do a patch that use try_to_unmap() without going through
all the comments in the function and thus possibly using it in a an
unexpected way from mmu_notifier callback point of view. I am fine
with putting the burden on new user to get it right and adding an
extra warning in the function description to try to warn people in a
sensible way.
> I certainly like the patch. I expect ->invalidate_range users will
> like the slight higher complexity in order to eliminate unnecessary
> invalidates that are slowing them down unnecessarily. At the same time
> this is zero risk (because a noop) for all other MMU notifier users
> (those that don't share the primary MMU pagetables, like KVM).
I have another patchset to restore the change_pte optimization for kvm
i want to post too. I will need to do some benchmarking first to make
sure it actualy helps even in a small way.
Thank you for looking into this patch, i will repost with your suggestions
soon.
Jérôme
next prev parent reply other threads:[~2017-10-04 0:16 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-01 17:30 [PATCH] mm/mmu_notifier: avoid double notification when it is useless jglisse
2017-10-03 23:42 ` Andrea Arcangeli
2017-10-04 0:15 ` Jerome Glisse [this message]
[not found] ` <0D64494B-AB3D-4091-B75A-883EA37BE098@gmail.com>
2017-10-04 1:20 ` Jerome Glisse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171004001559.GD20644@redhat.com \
--to=jglisse@redhat.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alistair@popple.id.au \
--cc=benh@kernel.crashing.org \
--cc=dwmw2@infradead.org \
--cc=iommu@lists.linux-foundation.org \
--cc=jroedel@suse.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-next@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mpe@ellerman.id.au \
--cc=nadav.amit@gmail.com \
--cc=sfr@canb.auug.org.au \
--cc=suravee.suthikulpanit@amd.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).