linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: "Jérôme Glisse" <jglisse@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Ross Zwisler" <ross.zwisler@linux.intel.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Bernhard Held" <berny156@gmx.de>,
	"Adam Borowski" <kilobyte@angband.pl>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	"Wanpeng Li" <kernellwp@gmail.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Takashi Iwai" <tiwai@suse.de>,
	"Nadav Amit" <nadav.amit@gmail.com>,
	"Mike Galbraith" <efault@gmx.de>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	axie <axie@amd.com>, "Andrew Morton" <akpm@linux-foundation.org>
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
Date: Wed, 30 Aug 2017 18:52:50 +0200	[thread overview]
Message-ID: <20170830165250.GD13559@redhat.com> (raw)
In-Reply-To: <20170829235447.10050-3-jglisse@redhat.com>

Hello Jerome,

On Tue, Aug 29, 2017 at 07:54:36PM -0400, Jerome Glisse wrote:
> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> end.
> 
> Note that because we can not presume the pmd value or pte value we have to
> assume the worse and unconditionaly report an invalidation as happening.

I pointed out in earlier email ->invalidate_range can only be
implemented (as mutually exclusive alternative to
->invalidate_range_start/end) by secondary MMUs that shares the very
same pagetables with the core linux VM of the primary MMU, and those
invalidate_range are already called by
__mmu_notifier_invalidate_range_end. The other bit is done by the MMU
gather (because mmu_notifier_invalidate_range_start is a noop for
drivers that implement s->invalidate_range).

The difference between sharing the same pagetables or not allows for
->invalidate_range to work because when the Linux MM changes the
primary MMU pagetables it also automatically invalidated updates
secondary MMU at the same time (because of the pagetable sharing
between primary and secondary MMUs). So then all that is left to do is
an invalidate_range to flush the secondary MMU TLBs.

There's no need of action in mmu_notifier_invalidate_range_start for
those pagetable sharing drivers because there's no risk of a secondary
MMU shadow pagetable layer to be re-created in between
mmu_notifier_invalidate_range_start and the actual pagetable
invalidate because again the pagetables are shared.

void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
				  unsigned long start, unsigned long end)
{
	struct mmu_notifier *mn;
	int id;

	id = srcu_read_lock(&srcu);
	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
		/*
		 * Call invalidate_range here too to avoid the need for the
		 * subsystem of having to register an invalidate_range_end
		 * call-back when there is invalidate_range already. Usually a
		 * subsystem registers either invalidate_range_start()/end() or
		 * invalidate_range(), so this will be no additional overhead
		 * (besides the pointer check).
		 */
		if (mn->ops->invalidate_range)
			mn->ops->invalidate_range(mn, mm, start, end);
			^^^^^^^^^^^^^^^^^^^^^^^^^
		if (mn->ops->invalidate_range_end)
			mn->ops->invalidate_range_end(mn, mm, start, end);
	}
	srcu_read_unlock(&srcu, id);
}

So this conversion from invalidate_page to invalidate_range looks
superflous and the final mmu_notifier_invalidate_range_end should be
enough.

AFIK only amd_iommu_v2 and intel-svm (svm as in shared virtual memory)
uses it.

My suggestion is to remove from below all
mmu_notifier_invalidate_range calls and keep only the
mmu_notifier_invalidate_range_end and test both amd_iommu_v2 and
intel-svm with it under heavy swapping.

The only critical constraint to keep for invalidate_range to stay safe
with a single call of mmu_notifier_invalidate_range_end after put_page
is that the put_page cannot be the last put_page. That only applies to
the case where the page isn't freed through MMU gather (MMU gather
calls mmu_notifier_invalidate_range of its own before freeing the
page, as opposed mmu gather does nothing for drivers using
invalidate_range_start/end because invalidate_range_start acted as
barrier to avoid establishing mappings on the secondary MMUs for
those).

Not strictly required but I think it would be safer and more efficient
to replace the put_page with something like:

static inline void put_page_not_freeing(struct page *page)
{
	page = compound_head(page);

	if (put_page_testzero(page))
		VM_WARN_ON_PAGE(1, page);
}

Thanks!
Andrea

> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  		}
>  
>  		if (ret) {
> -			mmu_notifier_invalidate_page(vma->vm_mm, address);
> +			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
>  			(*cleaned)++;
>  		}
>  	}
>  
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +
>  	return true;
>  }
>  
> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	pte_t pteval;
>  	struct page *subpage;
>  	bool ret = true;
> +	unsigned long start = address, end;
>  	enum ttu_flags flags = (enum ttu_flags)arg;
>  
>  	/* munlock has nothing to gain from examining un-locked vmas */
> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  				flags & TTU_MIGRATION, page);
>  	}
>  
> +	/*
> +	 * We have to assume the worse case ie pmd for invalidation. Note that
> +	 * the page can not be free in this function as call of try_to_unmap()
> +	 * must hold a reference on the page.
> +	 */
> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> +
>  	while (page_vma_mapped_walk(&pvmw)) {
>  		/*
>  		 * If the page is mlock()d, we cannot swap it out.
> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  				set_huge_swap_pte_at(mm, address,
>  						     pvmw.pte, pteval,
>  						     vma_mmu_pagesize(vma));
> +				mmu_notifier_invalidate_range(mm, address,
> +					address + vma_mmu_pagesize(vma));
>  			} else {
>  				dec_mm_counter(mm, mm_counter(page));
>  				set_pte_at(mm, address, pvmw.pte, pteval);
> @@ -1435,6 +1461,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> +			mmu_notifier_invalidate_range(mm, address,
> +						      address + PAGE_SIZE);
>  		} else if (PageAnon(page)) {
>  			swp_entry_t entry = { .val = page_private(subpage) };
>  			pte_t swp_pte;
> @@ -1445,6 +1473,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
>  				WARN_ON_ONCE(1);
>  				ret = false;
> +				/* We have to invalidate as we cleared the pte */
> +				mmu_notifier_invalidate_range(mm, address,
> +							address + PAGE_SIZE);
>  				page_vma_mapped_walk_done(&pvmw);
>  				break;
>  			}
> @@ -1453,6 +1484,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (!PageSwapBacked(page)) {
>  				if (!PageDirty(page)) {
>  					dec_mm_counter(mm, MM_ANONPAGES);
> +					/* Invalidate as we cleared the pte */
> +					mmu_notifier_invalidate_range(mm,
> +						address, address + PAGE_SIZE);
>  					goto discard;
>  				}
>  
> @@ -1485,13 +1519,17 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> +			mmu_notifier_invalidate_range(mm, address,
> +						      address + PAGE_SIZE);
>  		} else
>  			dec_mm_counter(mm, mm_counter_file(page));
>  discard:
>  		page_remove_rmap(subpage, PageHuge(page));
>  		put_page(page);
> -		mmu_notifier_invalidate_page(mm, address);
>  	}
> +
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +
>  	return ret;
>  }
>  
> -- 
> 2.13.5
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2017-08-30 16:52 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-29 23:54 [PATCH 00/13] mmu_notifier kill invalidate_page callback Jérôme Glisse
2017-08-29 23:54 ` [PATCH 01/13] dax: update to new mmu_notifier semantic Jérôme Glisse
2017-08-29 23:54 ` [PATCH 02/13] mm/rmap: " Jérôme Glisse
2017-08-30  2:46   ` Nadav Amit
2017-08-30  2:59     ` Jerome Glisse
2017-08-30  3:16       ` Nadav Amit
2017-08-30  3:18         ` Nadav Amit
2017-08-30 17:27     ` Andrea Arcangeli
2017-08-30 18:00       ` Nadav Amit
2017-08-30 21:25         ` Andrea Arcangeli
2017-08-30 23:25           ` Nadav Amit
2017-08-31  0:47             ` Jerome Glisse
2017-08-31 17:12               ` Andrea Arcangeli
2017-08-31 19:15                 ` Nadav Amit
2017-08-30 18:20       ` Jerome Glisse
2017-08-30 18:40         ` Nadav Amit
2017-08-30 20:45           ` Jerome Glisse
2017-08-30 22:17             ` Andrea Arcangeli
2017-08-30 20:55           ` Andrea Arcangeli
2017-08-30 16:52   ` Andrea Arcangeli [this message]
2017-08-30 17:48     ` Jerome Glisse
2017-08-30 21:53     ` Linus Torvalds
2017-08-30 23:01       ` Andrea Arcangeli
2017-08-31 18:25         ` Jerome Glisse
2017-08-31 19:40           ` Linus Torvalds
2017-08-29 23:54 ` [PATCH 03/13] powerpc/powernv: " Jérôme Glisse
2017-08-29 23:54 ` [PATCH 04/13] drm/amdgpu: " Jérôme Glisse
2017-08-30  6:18   ` Christian König
2017-08-29 23:54 ` [PATCH 05/13] IB/umem: " Jérôme Glisse
2017-08-30  6:13   ` Leon Romanovsky
2017-08-29 23:54 ` [PATCH 06/13] IB/hfi1: " Jérôme Glisse
2017-09-06 14:08   ` Arumugam, Kamenee
2017-08-29 23:54 ` [PATCH 07/13] iommu/amd: " Jérôme Glisse
2017-08-29 23:54 ` [PATCH 08/13] iommu/intel: " Jérôme Glisse
2017-08-29 23:54 ` [PATCH 09/13] misc/mic/scif: " Jérôme Glisse
2017-08-29 23:54 ` [PATCH 10/13] sgi-gru: " Jérôme Glisse
2017-08-29 23:54 ` [PATCH 11/13] xen/gntdev: " Jérôme Glisse
2017-08-30 19:32   ` Boris Ostrovsky
2017-08-29 23:54 ` [PATCH 12/13] KVM: " Jérôme Glisse
2017-08-29 23:54 ` [PATCH 13/13] mm/mmu_notifier: kill invalidate_page Jérôme Glisse
2017-08-30  0:11 ` [PATCH 00/13] mmu_notifier kill invalidate_page callback Linus Torvalds
2017-08-30  0:56   ` Jerome Glisse
2017-08-30  8:40     ` Mike Galbraith
2017-08-30 14:57     ` Adam Borowski
2017-09-01 14:47       ` Jeff Cook
2017-09-01 14:50         ` taskboxtester
2017-11-30  9:33 ` BSOD with " Fabian Grünbichler
2017-11-30 11:20   ` Paolo Bonzini
2017-11-30 16:19     ` Radim Krčmář
2017-11-30 18:05       ` [PATCH 1/2] KVM: x86: fix APIC page invalidation Radim Krčmář
2017-11-30 18:05         ` [PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier Radim Krčmář
2017-12-01 15:15           ` Paolo Bonzini
2017-12-03 17:24             ` Andrea Arcangeli
2017-12-01 12:21         ` [PATCH 1/2] KVM: x86: fix APIC page invalidation Fabian Grünbichler
2017-12-01 15:27         ` Paolo Bonzini
2017-12-03 17:28         ` Andrea Arcangeli
2017-12-06  2:32         ` Wanpeng Li
2017-12-06  9:50           ` 王金浦
2017-12-06 10:00             ` Paolo Bonzini
2017-12-06  8:15         ` Fabian Grünbichler
2017-12-13 12:54         ` Richard Purdie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170830165250.GD13559@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axie@amd.com \
    --cc=berny156@gmx.de \
    --cc=dan.j.williams@intel.com \
    --cc=efault@gmx.de \
    --cc=jglisse@redhat.com \
    --cc=kernellwp@gmail.com \
    --cc=kilobyte@angband.pl \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nadav.amit@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=rkrcmar@redhat.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=tiwai@suse.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).