Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jerome Glisse <jglisse@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Nadav Amit <nadav.amit@gmail.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Joerg Roedel <jroedel@suse.de>,
	Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
	David Woodhouse <dwmw2@infradead.org>,
	Alistair Popple <alistair@popple.id.au>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org,
	linux-next@vger.kernel.org
Subject: Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless
Date: Tue, 3 Oct 2017 20:15:59 -0400	[thread overview]
Message-ID: <20171004001559.GD20644@redhat.com> (raw)
In-Reply-To: <20171003234215.GA5231@redhat.com>

On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote:
> Hello Jerome,
> 
> On Fri, Sep 01, 2017 at 01:30:11PM -0400, Jerome Glisse wrote:
> > +Case A is obvious you do not want to take the risk for the device to write to
> > +a page that might now be use by some completely different task.
> 
> used
> 
> > +is true ven if the thread doing the page table update is preempted right after
> 
> even
> 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 90731e3b7e58..5706252b828a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1167,8 +1167,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >  		goto out_free_pages;
> >  	VM_BUG_ON_PAGE(!PageHead(page), page);
> >  
> > +	/*
> > +	 * Leave pmd empty until pte is filled note we must notify here as
> > +	 * concurrent CPU thread might write to new page before the call to
> > +	 * mmu_notifier_invalidate_range_end() happen which can lead to a
> 
> happens
> 
> > +	 * device seeing memory write in different order than CPU.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> >  	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > -	/* leave pmd empty until pte is filled */
> >  
> 
> Here we can change the following mmu_notifier_invalidate_range_end to
> skip calling ->invalidate_range. It could be called
> mmu_notifier_invalidate_range_only_end, or other suggestions
> welcome. Otherwise we'll repeat the call for nothing.
> 
> We need it inside the PT lock for the ordering issue, but we don't
> need to run it twice.
> 
> Same in do_huge_pmd_wp_page, wp_page_copy and
> migrate_vma_insert_page. Every time *clear_flush_notify is used
> mmu_notifier_invalidate_range_only_end should be called after it,
> instead of mmu_notifier_invalidate_range_end.
> 
> I think optimizing that part too, fits in the context of this patchset
> (if not in the same patch), because the objective is still the same:
> to remove unnecessary ->invalidate_range calls.

Yes you are right, good idea, i will respin with that too (and with the
various typo you noted thank you for that). I can do 2 patch or 1, i don't
mind either way. I will probably do 2 as first and they can be folded into
1 if people prefer just one.


> 
> > +				 * No need to notify as we downgrading page
> 
> are
> 
> > +				 * table protection not changing it to point
> > +				 * to a new page.
> > +	 			 *
> 
> > +		 * No need to notify as we downgrading page table to read only
> 
> are
> 
> > +	 * No need to notify as we replacing a read only page with another
> 
> are
> 
> > @@ -1510,13 +1515,43 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			if (pte_soft_dirty(pteval))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> > -		} else
> > +		} else {
> > +			/*
> > +			 * We should not need to notify here as we reach this
> > +			 * case only from freeze_page() itself only call from
> > +			 * split_huge_page_to_list() so everything below must
> > +			 * be true:
> > +			 *   - page is not anonymous
> > +			 *   - page is locked
> > +			 *
> > +			 * So as it is a shared page and it is locked, it can
> > +			 * not be remove from the page cache and replace by
> > +			 * a new page before mmu_notifier_invalidate_range_end
> > +			 * so no concurrent thread might update its page table
> > +			 * to point at new page while a device still is using
> > +			 * this page.
> > +			 *
> > +			 * But we can not assume that new user of try_to_unmap
> > +			 * will have that in mind so just to be safe here call
> > +			 * mmu_notifier_invalidate_range()
> > +			 *
> > +			 * See Documentation/vm/mmu_notifier.txt
> > +			 */
> >  			dec_mm_counter(mm, mm_counter_file(page));
> > +			mmu_notifier_invalidate_range(mm, address,
> > +						      address + PAGE_SIZE);
> > +		}
> >  discard:
> > +		/*
> > +		 * No need to call mmu_notifier_invalidate_range() as we are
> > +		 * either replacing a present pte with non present one (either
> > +		 * a swap or special one). We handling the clearing pte case
> > +		 * above.
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> > +		 */
> >  		page_remove_rmap(subpage, PageHuge(page));
> >  		put_page(page);
> > -		mmu_notifier_invalidate_range(mm, address,
> > -					      address + PAGE_SIZE);
> >  	}
> >  
> >  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> 
> That is the path that unmaps filebacked pages (btw, not necessarily
> shared unlike comment says, they can be private but still filebacked).

I was more refering to the fact that they are in page cache and thus
given current condition they can not be migrated to a new page in our
back. But yes it can be a private mapping, i will fix the comment.


> I'd like some more explanation about the inner working of "that new
> user" as per comment above.
> 
> It would be enough to drop mmu_notifier_invalidate_range from above
> without adding it to the filebacked case. The above gives higher prio
> to the hypothetical and uncertain future case, than to the current
> real filebacked case that doesn't need ->invalidate_range inside the
> PT lock, or do you see something that might already need such
> ->invalidate_range?

No i don't see any new user today that might need such invalidate but
i was trying to be extra cautious as i have a tendency to assume that
someone might do a patch that use try_to_unmap() without going through
all the comments in the function and thus possibly using it in a an
unexpected way from mmu_notifier callback point of view. I am fine
with putting the burden on new user to get it right and adding an
extra warning in the function description to try to warn people in a
sensible way.


> I certainly like the patch. I expect ->invalidate_range users will
> like the slight higher complexity in order to eliminate unnecessary
> invalidates that are slowing them down unnecessarily. At the same time
> this is zero risk (because a noop) for all other MMU notifier users
> (those that don't share the primary MMU pagetables, like KVM).

I have another patchset to restore the change_pte optimization for kvm
i want to post too. I will need to do some benchmarking first to make
sure it actualy helps even in a small way.

Thank you for looking into this patch, i will repost with your suggestions
soon.

Jérôme

WARNING: multiple messages have this Message-ID (diff)

From: Jerome Glisse <jglisse@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Nadav Amit <nadav.amit@gmail.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Joerg Roedel <jroedel@suse.de>,
	Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
	David Woodhouse <dwmw2@infradead.org>,
	Alistair Popple <alistair@popple.id.au>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org,
	linux-next@vger.kernel.org
Subject: Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless
Date: Tue, 3 Oct 2017 20:15:59 -0400	[thread overview]
Message-ID: <20171004001559.GD20644@redhat.com> (raw)
In-Reply-To: <20171003234215.GA5231@redhat.com>

On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote:
> Hello Jerome,
> 
> On Fri, Sep 01, 2017 at 01:30:11PM -0400, Jerome Glisse wrote:
> > +Case A is obvious you do not want to take the risk for the device to write to
> > +a page that might now be use by some completely different task.
> 
> used
> 
> > +is true ven if the thread doing the page table update is preempted right after
> 
> even
> 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 90731e3b7e58..5706252b828a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1167,8 +1167,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >  		goto out_free_pages;
> >  	VM_BUG_ON_PAGE(!PageHead(page), page);
> >  
> > +	/*
> > +	 * Leave pmd empty until pte is filled note we must notify here as
> > +	 * concurrent CPU thread might write to new page before the call to
> > +	 * mmu_notifier_invalidate_range_end() happen which can lead to a
> 
> happens
> 
> > +	 * device seeing memory write in different order than CPU.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> >  	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > -	/* leave pmd empty until pte is filled */
> >  
> 
> Here we can change the following mmu_notifier_invalidate_range_end to
> skip calling ->invalidate_range. It could be called
> mmu_notifier_invalidate_range_only_end, or other suggestions
> welcome. Otherwise we'll repeat the call for nothing.
> 
> We need it inside the PT lock for the ordering issue, but we don't
> need to run it twice.
> 
> Same in do_huge_pmd_wp_page, wp_page_copy and
> migrate_vma_insert_page. Every time *clear_flush_notify is used
> mmu_notifier_invalidate_range_only_end should be called after it,
> instead of mmu_notifier_invalidate_range_end.
> 
> I think optimizing that part too, fits in the context of this patchset
> (if not in the same patch), because the objective is still the same:
> to remove unnecessary ->invalidate_range calls.

Yes you are right, good idea, i will respin with that too (and with the
various typo you noted thank you for that). I can do 2 patch or 1, i don't
mind either way. I will probably do 2 as first and they can be folded into
1 if people prefer just one.


> 
> > +				 * No need to notify as we downgrading page
> 
> are
> 
> > +				 * table protection not changing it to point
> > +				 * to a new page.
> > +	 			 *
> 
> > +		 * No need to notify as we downgrading page table to read only
> 
> are
> 
> > +	 * No need to notify as we replacing a read only page with another
> 
> are
> 
> > @@ -1510,13 +1515,43 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			if (pte_soft_dirty(pteval))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> > -		} else
> > +		} else {
> > +			/*
> > +			 * We should not need to notify here as we reach this
> > +			 * case only from freeze_page() itself only call from
> > +			 * split_huge_page_to_list() so everything below must
> > +			 * be true:
> > +			 *   - page is not anonymous
> > +			 *   - page is locked
> > +			 *
> > +			 * So as it is a shared page and it is locked, it can
> > +			 * not be remove from the page cache and replace by
> > +			 * a new page before mmu_notifier_invalidate_range_end
> > +			 * so no concurrent thread might update its page table
> > +			 * to point at new page while a device still is using
> > +			 * this page.
> > +			 *
> > +			 * But we can not assume that new user of try_to_unmap
> > +			 * will have that in mind so just to be safe here call
> > +			 * mmu_notifier_invalidate_range()
> > +			 *
> > +			 * See Documentation/vm/mmu_notifier.txt
> > +			 */
> >  			dec_mm_counter(mm, mm_counter_file(page));
> > +			mmu_notifier_invalidate_range(mm, address,
> > +						      address + PAGE_SIZE);
> > +		}
> >  discard:
> > +		/*
> > +		 * No need to call mmu_notifier_invalidate_range() as we are
> > +		 * either replacing a present pte with non present one (either
> > +		 * a swap or special one). We handling the clearing pte case
> > +		 * above.
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> > +		 */
> >  		page_remove_rmap(subpage, PageHuge(page));
> >  		put_page(page);
> > -		mmu_notifier_invalidate_range(mm, address,
> > -					      address + PAGE_SIZE);
> >  	}
> >  
> >  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> 
> That is the path that unmaps filebacked pages (btw, not necessarily
> shared unlike comment says, they can be private but still filebacked).

I was more refering to the fact that they are in page cache and thus
given current condition they can not be migrated to a new page in our
back. But yes it can be a private mapping, i will fix the comment.


> I'd like some more explanation about the inner working of "that new
> user" as per comment above.
> 
> It would be enough to drop mmu_notifier_invalidate_range from above
> without adding it to the filebacked case. The above gives higher prio
> to the hypothetical and uncertain future case, than to the current
> real filebacked case that doesn't need ->invalidate_range inside the
> PT lock, or do you see something that might already need such
> ->invalidate_range?

No i don't see any new user today that might need such invalidate but
i was trying to be extra cautious as i have a tendency to assume that
someone might do a patch that use try_to_unmap() without going through
all the comments in the function and thus possibly using it in a an
unexpected way from mmu_notifier callback point of view. I am fine
with putting the burden on new user to get it right and adding an
extra warning in the function description to try to warn people in a
sensible way.


> I certainly like the patch. I expect ->invalidate_range users will
> like the slight higher complexity in order to eliminate unnecessary
> invalidates that are slowing them down unnecessarily. At the same time
> this is zero risk (because a noop) for all other MMU notifier users
> (those that don't share the primary MMU pagetables, like KVM).

I have another patchset to restore the change_pte optimization for kvm
i want to post too. I will need to do some benchmarking first to make
sure it actualy helps even in a small way.

Thank you for looking into this patch, i will repost with your suggestions
soon.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-10-04  0:15 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-01 17:30 [PATCH] mm/mmu_notifier: avoid double notification when it is useless jglisse
2017-09-01 17:30 ` jglisse
2017-10-03 23:42 ` Andrea Arcangeli
2017-10-03 23:42   ` Andrea Arcangeli
2017-10-04  0:15   ` Jerome Glisse [this message]
2017-10-04  0:15     ` Jerome Glisse
2017-10-04  0:43     ` Nadav Amit
2017-10-04  0:43       ` Nadav Amit
2017-10-04  1:20       ` Jerome Glisse
2017-10-04  1:20         ` Jerome Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171004001559.GD20644@redhat.com \
    --to=jglisse@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alistair@popple.id.au \
    --cc=benh@kernel.crashing.org \
    --cc=dwmw2@infradead.org \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jroedel@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-next@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mpe@ellerman.id.au \
    --cc=nadav.amit@gmail.com \
    --cc=sfr@canb.auug.org.au \
    --cc=suravee.suthikulpanit@amd.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.