Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jerome Glisse <jglisse@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Ross Zwisler" <ross.zwisler@linux.intel.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Bernhard Held" <berny156@gmx.de>,
	"Adam Borowski" <kilobyte@angband.pl>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	"Wanpeng Li" <kernellwp@gmail.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Takashi Iwai" <tiwai@suse.de>,
	"Nadav Amit" <nadav.amit@gmail.com>,
	"Mike Galbraith" <efault@gmx.de>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	axie <axie@amd.com>, "Andrew Morton" <akpm@linux-foundation.org>
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
Date: Wed, 30 Aug 2017 13:48:58 -0400	[thread overview]
Message-ID: <20170830174857.GC2386@redhat.com> (raw)
In-Reply-To: <20170830165250.GD13559@redhat.com>

On Wed, Aug 30, 2017 at 06:52:50PM +0200, Andrea Arcangeli wrote:
> Hello Jerome,
> 
> On Tue, Aug 29, 2017 at 07:54:36PM -0400, Jerome Glisse wrote:
> > Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> > and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> > end.
> > 
> > Note that because we can not presume the pmd value or pte value we have to
> > assume the worse and unconditionaly report an invalidation as happening.
> 
> I pointed out in earlier email ->invalidate_range can only be
> implemented (as mutually exclusive alternative to
> ->invalidate_range_start/end) by secondary MMUs that shares the very
> same pagetables with the core linux VM of the primary MMU, and those
> invalidate_range are already called by
> __mmu_notifier_invalidate_range_end. The other bit is done by the MMU
> gather (because mmu_notifier_invalidate_range_start is a noop for
> drivers that implement s->invalidate_range).
> 
> The difference between sharing the same pagetables or not allows for
> ->invalidate_range to work because when the Linux MM changes the
> primary MMU pagetables it also automatically invalidated updates
> secondary MMU at the same time (because of the pagetable sharing
> between primary and secondary MMUs). So then all that is left to do is
> an invalidate_range to flush the secondary MMU TLBs.
> 
> There's no need of action in mmu_notifier_invalidate_range_start for
> those pagetable sharing drivers because there's no risk of a secondary
> MMU shadow pagetable layer to be re-created in between
> mmu_notifier_invalidate_range_start and the actual pagetable
> invalidate because again the pagetables are shared.

Yes but we still need to call invalidate_range() while under the page
table spinlock as this hardware sharing the CPU page table have their
own tlb (not even talking about tlb of each individual devices that
use PASID/ATS).

If we do not call it under spinlock then there is a chance that something
else get mapped between the time we drop the CPU page table and the
time we call mmu_notifier_invalidate_range_end() which itself call
invalidate_range().

Actually i would remove the call to invalidate_range() from range_end()
and audit all places where we call range_end() to assert that there is
a call to invalidate_range() under the page table spinlock that preced
it.

> 
> void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> 				  unsigned long start, unsigned long end)
> {
> 	struct mmu_notifier *mn;
> 	int id;
> 
> 	id = srcu_read_lock(&srcu);
> 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> 		/*
> 		 * Call invalidate_range here too to avoid the need for the
> 		 * subsystem of having to register an invalidate_range_end
> 		 * call-back when there is invalidate_range already. Usually a
> 		 * subsystem registers either invalidate_range_start()/end() or
> 		 * invalidate_range(), so this will be no additional overhead
> 		 * (besides the pointer check).
> 		 */
> 		if (mn->ops->invalidate_range)
> 			mn->ops->invalidate_range(mn, mm, start, end);
> 			^^^^^^^^^^^^^^^^^^^^^^^^^
> 		if (mn->ops->invalidate_range_end)
> 			mn->ops->invalidate_range_end(mn, mm, start, end);
> 	}
> 	srcu_read_unlock(&srcu, id);
> }
> 
> So this conversion from invalidate_page to invalidate_range looks
> superflous and the final mmu_notifier_invalidate_range_end should be
> enough.

See above.

> AFIK only amd_iommu_v2 and intel-svm (svm as in shared virtual memory)
> uses it.

powerpc has something similar too but i don't know its status

> My suggestion is to remove from below all
> mmu_notifier_invalidate_range calls and keep only the
> mmu_notifier_invalidate_range_end and test both amd_iommu_v2 and
> intel-svm with it under heavy swapping.
> 
> The only critical constraint to keep for invalidate_range to stay safe
> with a single call of mmu_notifier_invalidate_range_end after put_page
> is that the put_page cannot be the last put_page. That only applies to
> the case where the page isn't freed through MMU gather (MMU gather
> calls mmu_notifier_invalidate_range of its own before freeing the
> page, as opposed mmu gather does nothing for drivers using
> invalidate_range_start/end because invalidate_range_start acted as
> barrier to avoid establishing mappings on the secondary MMUs for
> those).
> 
> Not strictly required but I think it would be safer and more efficient
> to replace the put_page with something like:
> 
> static inline void put_page_not_freeing(struct page *page)
> {
> 	page = compound_head(page);
> 
> 	if (put_page_testzero(page))
> 		VM_WARN_ON_PAGE(1, page);
> }

Yes adding such check make sense.

Thank for looking into this too

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Jerome Glisse <jglisse@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Ross Zwisler" <ross.zwisler@linux.intel.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Bernhard Held" <berny156@gmx.de>,
	"Adam Borowski" <kilobyte@angband.pl>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	"Wanpeng Li" <kernellwp@gmail.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Takashi Iwai" <tiwai@suse.de>,
	"Nadav Amit" <nadav.amit@gmail.com>,
	"Mike Galbraith" <efault@gmx.de>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	axie <axie@amd.com>, "Andrew Morton" <akpm@linux-foundation.org>
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
Date: Wed, 30 Aug 2017 13:48:58 -0400	[thread overview]
Message-ID: <20170830174857.GC2386@redhat.com> (raw)
In-Reply-To: <20170830165250.GD13559@redhat.com>

On Wed, Aug 30, 2017 at 06:52:50PM +0200, Andrea Arcangeli wrote:
> Hello Jerome,
> 
> On Tue, Aug 29, 2017 at 07:54:36PM -0400, Jerome Glisse wrote:
> > Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> > and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> > end.
> > 
> > Note that because we can not presume the pmd value or pte value we have to
> > assume the worse and unconditionaly report an invalidation as happening.
> 
> I pointed out in earlier email ->invalidate_range can only be
> implemented (as mutually exclusive alternative to
> ->invalidate_range_start/end) by secondary MMUs that shares the very
> same pagetables with the core linux VM of the primary MMU, and those
> invalidate_range are already called by
> __mmu_notifier_invalidate_range_end. The other bit is done by the MMU
> gather (because mmu_notifier_invalidate_range_start is a noop for
> drivers that implement s->invalidate_range).
> 
> The difference between sharing the same pagetables or not allows for
> ->invalidate_range to work because when the Linux MM changes the
> primary MMU pagetables it also automatically invalidated updates
> secondary MMU at the same time (because of the pagetable sharing
> between primary and secondary MMUs). So then all that is left to do is
> an invalidate_range to flush the secondary MMU TLBs.
> 
> There's no need of action in mmu_notifier_invalidate_range_start for
> those pagetable sharing drivers because there's no risk of a secondary
> MMU shadow pagetable layer to be re-created in between
> mmu_notifier_invalidate_range_start and the actual pagetable
> invalidate because again the pagetables are shared.

Yes but we still need to call invalidate_range() while under the page
table spinlock as this hardware sharing the CPU page table have their
own tlb (not even talking about tlb of each individual devices that
use PASID/ATS).

If we do not call it under spinlock then there is a chance that something
else get mapped between the time we drop the CPU page table and the
time we call mmu_notifier_invalidate_range_end() which itself call
invalidate_range().

Actually i would remove the call to invalidate_range() from range_end()
and audit all places where we call range_end() to assert that there is
a call to invalidate_range() under the page table spinlock that preced
it.

> 
> void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> 				  unsigned long start, unsigned long end)
> {
> 	struct mmu_notifier *mn;
> 	int id;
> 
> 	id = srcu_read_lock(&srcu);
> 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> 		/*
> 		 * Call invalidate_range here too to avoid the need for the
> 		 * subsystem of having to register an invalidate_range_end
> 		 * call-back when there is invalidate_range already. Usually a
> 		 * subsystem registers either invalidate_range_start()/end() or
> 		 * invalidate_range(), so this will be no additional overhead
> 		 * (besides the pointer check).
> 		 */
> 		if (mn->ops->invalidate_range)
> 			mn->ops->invalidate_range(mn, mm, start, end);
> 			^^^^^^^^^^^^^^^^^^^^^^^^^
> 		if (mn->ops->invalidate_range_end)
> 			mn->ops->invalidate_range_end(mn, mm, start, end);
> 	}
> 	srcu_read_unlock(&srcu, id);
> }
> 
> So this conversion from invalidate_page to invalidate_range looks
> superflous and the final mmu_notifier_invalidate_range_end should be
> enough.

See above.

> AFIK only amd_iommu_v2 and intel-svm (svm as in shared virtual memory)
> uses it.

powerpc has something similar too but i don't know its status

> My suggestion is to remove from below all
> mmu_notifier_invalidate_range calls and keep only the
> mmu_notifier_invalidate_range_end and test both amd_iommu_v2 and
> intel-svm with it under heavy swapping.
> 
> The only critical constraint to keep for invalidate_range to stay safe
> with a single call of mmu_notifier_invalidate_range_end after put_page
> is that the put_page cannot be the last put_page. That only applies to
> the case where the page isn't freed through MMU gather (MMU gather
> calls mmu_notifier_invalidate_range of its own before freeing the
> page, as opposed mmu gather does nothing for drivers using
> invalidate_range_start/end because invalidate_range_start acted as
> barrier to avoid establishing mappings on the secondary MMUs for
> those).
> 
> Not strictly required but I think it would be safer and more efficient
> to replace the put_page with something like:
> 
> static inline void put_page_not_freeing(struct page *page)
> {
> 	page = compound_head(page);
> 
> 	if (put_page_testzero(page))
> 		VM_WARN_ON_PAGE(1, page);
> }

Yes adding such check make sense.

Thank for looking into this too

Jérôme

next prev parent reply	other threads:[~2017-08-30 17:49 UTC|newest]

Thread overview: 160+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-29 23:54 [PATCH 00/13] mmu_notifier kill invalidate_page callback Jérôme Glisse
2017-08-29 23:54 ` Jérôme Glisse
2017-08-29 23:54 ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 01/13] dax: update to new mmu_notifier semantic Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 02/13] mm/rmap: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-30  2:46   ` Nadav Amit
2017-08-30  2:46     ` Nadav Amit
2017-08-30  2:59     ` Jerome Glisse
2017-08-30  2:59       ` Jerome Glisse
2017-08-30  3:16       ` Nadav Amit
2017-08-30  3:16         ` Nadav Amit
2017-08-30  3:18         ` Nadav Amit
2017-08-30  3:18           ` Nadav Amit
2017-08-30 17:27     ` Andrea Arcangeli
2017-08-30 17:27       ` Andrea Arcangeli
2017-08-30 18:00       ` Nadav Amit
2017-08-30 18:00         ` Nadav Amit
2017-08-30 21:25         ` Andrea Arcangeli
2017-08-30 21:25           ` Andrea Arcangeli
2017-08-30 23:25           ` Nadav Amit
2017-08-30 23:25             ` Nadav Amit
2017-08-31  0:47             ` Jerome Glisse
2017-08-31  0:47               ` Jerome Glisse
2017-08-31  0:47               ` Jerome Glisse
     [not found]               ` <20170831004719.GF9445-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-08-31 17:12                 ` Andrea Arcangeli
2017-08-31 17:12                   ` Andrea Arcangeli
2017-08-31 17:12                   ` Andrea Arcangeli
2017-08-31 19:15                   ` Nadav Amit
2017-08-31 19:15                     ` Nadav Amit
2017-08-30 18:20       ` Jerome Glisse
2017-08-30 18:20         ` Jerome Glisse
2017-08-30 18:40         ` Nadav Amit
2017-08-30 18:40           ` Nadav Amit
2017-08-30 20:45           ` Jerome Glisse
2017-08-30 20:45             ` Jerome Glisse
2017-08-30 22:17             ` Andrea Arcangeli
2017-08-30 22:17               ` Andrea Arcangeli
2017-08-30 20:55           ` Andrea Arcangeli
2017-08-30 20:55             ` Andrea Arcangeli
2017-08-30 16:52   ` Andrea Arcangeli
2017-08-30 16:52     ` Andrea Arcangeli
2017-08-30 17:48     ` Jerome Glisse [this message]
2017-08-30 17:48       ` Jerome Glisse
2017-08-30 21:53     ` Linus Torvalds
2017-08-30 21:53       ` Linus Torvalds
2017-08-30 23:01       ` Andrea Arcangeli
2017-08-30 23:01         ` Andrea Arcangeli
2017-08-31 18:25         ` Jerome Glisse
2017-08-31 18:25           ` Jerome Glisse
2017-08-31 19:40           ` Linus Torvalds
2017-08-31 19:40             ` Linus Torvalds
2017-08-29 23:54 ` [PATCH 03/13] powerpc/powernv: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 04/13] drm/amdgpu: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-30  6:18   ` Christian König
2017-08-30  6:18     ` Christian König
2017-08-29 23:54 ` [PATCH 05/13] IB/umem: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-30  6:13   ` Leon Romanovsky
2017-08-29 23:54 ` [PATCH 06/13] IB/hfi1: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-09-06 14:08   ` Arumugam, Kamenee
2017-08-29 23:54 ` [PATCH 08/13] iommu/intel: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 09/13] misc/mic/scif: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 10/13] sgi-gru: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 11/13] xen/gntdev: " Jérôme Glisse
2017-08-29 23:54 ` Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-30 19:32   ` Boris Ostrovsky
2017-08-30 19:32   ` Boris Ostrovsky
2017-08-29 23:54 ` [PATCH 12/13] KVM: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 13/13] mm/mmu_notifier: kill invalidate_page Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
     [not found] ` <20170829235447.10050-1-jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-08-29 23:54   ` [PATCH 07/13] iommu/amd: update to new mmu_notifier semantic Jérôme Glisse
2017-08-29 23:54     ` Jérôme Glisse
2017-08-29 23:54     ` Jérôme Glisse
2017-08-30  0:11   ` [PATCH 00/13] mmu_notifier kill invalidate_page callback Linus Torvalds
2017-08-30  0:11     ` Linus Torvalds
2017-08-30  0:11     ` Linus Torvalds
2017-08-30  0:11     ` Linus Torvalds
2017-08-30  0:56     ` Jerome Glisse
2017-08-30  0:56     ` Jerome Glisse
2017-08-30  0:56       ` Jerome Glisse
2017-08-30  0:56       ` Jerome Glisse
2017-08-30  8:40       ` Mike Galbraith
2017-08-30  8:40         ` Mike Galbraith
2017-08-30  8:40         ` Mike Galbraith
2017-08-30  8:40         ` Mike Galbraith
2017-08-30  8:40       ` Mike Galbraith
2017-08-30 14:57       ` Adam Borowski
2017-08-30 14:57         ` Adam Borowski
2017-08-30 14:57         ` Adam Borowski
2017-09-01 14:47         ` Jeff Cook
2017-09-01 14:47           ` Jeff Cook
2017-09-01 14:47           ` Jeff Cook
2017-09-01 14:47           ` Jeff Cook
2017-09-01 14:50           ` taskboxtester
2017-09-01 14:50             ` taskboxtester
2017-09-01 14:47         ` Jeff Cook
2017-08-30 14:57       ` Adam Borowski
2017-08-30 21:51   ` Felix Kuehling
2017-08-31 13:59     ` Jerome Glisse
     [not found]       ` <20170831135953.GA9227-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-08-31 14:14         ` Christian König
2017-08-31 18:39         ` Felix Kuehling
2017-08-31 19:00           ` Jerome Glisse
     [not found]             ` <20170831190021.GG9227-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-08-31 23:19               ` Felix Kuehling
2017-08-31 23:29                 ` Jerome Glisse
2017-08-30  0:11 ` Linus Torvalds
2017-11-30  9:33 ` BSOD with " Fabian Grünbichler
2017-11-30  9:33   ` Fabian Grünbichler
     [not found]   ` <20171130093320.66cxaoj45g2ttzoh-aVfaTQcAavps8ZkLLAvlZtBPR1lH4CV8@public.gmane.org>
2017-11-30 11:20     ` Paolo Bonzini
2017-11-30 11:20       ` Paolo Bonzini
2017-11-30 11:20       ` Paolo Bonzini
2017-11-30 16:19       ` Radim Krčmář
2017-11-30 16:19         ` Radim Krčmář
2017-11-30 18:05         ` [PATCH 1/2] KVM: x86: fix APIC page invalidation Radim Krčmář
2017-11-30 18:05           ` Radim Krčmář
2017-11-30 18:05           ` Radim Krčmář
2017-11-30 18:05           ` [PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier Radim Krčmář
2017-11-30 18:05             ` Radim Krčmář
2017-12-01 15:15             ` Paolo Bonzini
2017-12-01 15:15               ` Paolo Bonzini
2017-12-01 15:15               ` Paolo Bonzini
2017-12-03 17:24               ` Andrea Arcangeli
2017-12-03 17:24                 ` Andrea Arcangeli
2017-12-03 17:24                 ` Andrea Arcangeli
2017-12-01 12:21           ` [PATCH 1/2] KVM: x86: fix APIC page invalidation Fabian Grünbichler
2017-12-01 12:21             ` Fabian Grünbichler
2017-12-01 15:27           ` Paolo Bonzini
2017-12-01 15:27             ` Paolo Bonzini
2017-12-03 17:28           ` Andrea Arcangeli
2017-12-03 17:28             ` Andrea Arcangeli
2017-12-03 17:28             ` Andrea Arcangeli
2017-12-06  2:32           ` Wanpeng Li
2017-12-06  2:32             ` Wanpeng Li
2017-12-06  9:50             ` 王金浦
2017-12-06  9:50               ` 王金浦
2017-12-06 10:00               ` Paolo Bonzini
2017-12-06 10:00                 ` Paolo Bonzini
2017-12-06 10:00                 ` Paolo Bonzini
2017-12-06  8:15           ` Fabian Grünbichler
2017-12-06  8:15             ` Fabian Grünbichler
2017-12-06  8:15             ` Fabian Grünbichler
2017-12-13 12:54           ` Richard Purdie
2017-12-13 12:54             ` Richard Purdie
2017-12-13 12:54             ` Richard Purdie
2017-11-30 16:19       ` BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback Radim Krčmář
2017-11-30 11:20   ` Paolo Bonzini
2017-11-30  9:33 ` Fabian Grünbichler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170830174857.GC2386@redhat.com \
    --to=jglisse@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axie@amd.com \
    --cc=berny156@gmx.de \
    --cc=dan.j.williams@intel.com \
    --cc=efault@gmx.de \
    --cc=kernellwp@gmail.com \
    --cc=kilobyte@angband.pl \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nadav.amit@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=rkrcmar@redhat.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=tiwai@suse.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.