Re: [BUG] vfio device assignment regression with THP ref counting redesign

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Alex Williamson <alex.williamson@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
	kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [BUG] vfio device assignment regression with THP ref counting redesign
Date: Fri, 29 Apr 2016 16:34:29 -0600	[thread overview]
Message-ID: <20160429163429.30c9f4a0@t450s.home> (raw)
In-Reply-To: <20160429163444.GM11700@redhat.com>

On Fri, 29 Apr 2016 18:34:44 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Fri, Apr 29, 2016 at 10:06:11AM +0300, Kirill A. Shutemov wrote:
> > Hm. I just woke up and haven't got any coffee yet, but I don't why my
> > approach would be worse for performance. Both have the same algorithmic
> > complexity.  
> 
> Even before looking at the overall performance, I'm not sure your
> patch is really fixing it all: you didn't touch reuse_swap_page which
> is used by do_wp_page to know if it can call do_wp_page_reuse. Your
> patch would still trigger a COW instead of calling do_wp_page_reuse,
> but it would only happen if the page was pinned after the pmd split,
> which is probably not what the testcase is triggering. My patch
> instead fixed that too.
> 
> total_mapcount returns the wrong value for reuse_swap_page, which is
> probably why you didn't try to use it there.
> 
> The main issue of my patch is that it has a performance downside that
> is page_mapcount becomes expensive for all other usages, which is
> better than breaking vfio but I couldn't use total_mapcount again
> because it counts things wrong in reuse_swap_page.
> 
> Like I said there's room for optimizations so today I tried to
> optimize more stuff...

I've had this under test for several hours without error.  Thanks!

Alex


> From 74f1fd7fab71a2cce0d1796fb38241acde2c1224 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Fri, 29 Apr 2016 01:05:06 +0200
> Subject: [PATCH 1/1] mm: thp: calculate the mapcount correctly for THP pages
>  during WP faults
> 
> This will provide fully accuracy to the mapcount calculation in the
> write protect faults, so page pinning will not get broken by false
> positive copy-on-writes.
> 
> total_mapcount() isn't the right calculation needed in
> reuse_swap_page, so this introduces a page_trans_huge_mapcount() that
> is effectively the full accurate return value for page_mapcount() if
> dealing with Transparent Hugepages, however we only use the
> page_trans_huge_mapcount() during COW faults where it strictly needed,
> due to its higher runtime cost.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/mm.h   |  5 +++++
>  include/linux/swap.h |  3 +--
>  mm/huge_memory.c     | 44 ++++++++++++++++++++++++++++++++++++--------
>  mm/swapfile.c        |  5 +----
>  4 files changed, 43 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8fb3604..c2026a1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -501,11 +501,16 @@ static inline int page_mapcount(struct page *page)
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  int total_mapcount(struct page *page);
> +int page_trans_huge_mapcount(struct page *page);
>  #else
>  static inline int total_mapcount(struct page *page)
>  {
>  	return page_mapcount(page);
>  }
> +static inline int page_trans_huge_mapcount(struct page *page)
> +{
> +	return page_mapcount(page);
> +}
>  #endif
>  
>  static inline struct page *virt_to_head_page(const void *x)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2f6478f..905bf8e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -517,8 +517,7 @@ static inline int swp_swapcount(swp_entry_t entry)
>  	return 0;
>  }
>  
> -#define reuse_swap_page(page) \
> -	(!PageTransCompound(page) && page_mapcount(page) == 1)
> +#define reuse_swap_page(page) (page_trans_huge_mapcount(page) == 1)
>  
>  static inline int try_to_free_swap(struct page *page)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 06bce0f..6a6d9c0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>  	/*
>  	 * We can only reuse the page if nobody else maps the huge page or it's
> -	 * part. We can do it by checking page_mapcount() on each sub-page, but
> -	 * it's expensive.
> -	 * The cheaper way is to check page_count() to be equal 1: every
> -	 * mapcount takes page reference reference, so this way we can
> -	 * guarantee, that the PMD is the only mapping.
> -	 * This can give false negative if somebody pinned the page, but that's
> -	 * fine.
> +	 * part.
>  	 */
> -	if (page_mapcount(page) == 1 && page_count(page) == 1) {
> +	if (page_trans_huge_mapcount(page) == 1) {
>  		pmd_t entry;
>  		entry = pmd_mkyoung(orig_pmd);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> @@ -3226,6 +3220,40 @@ int total_mapcount(struct page *page)
>  }
>  
>  /*
> + * This calculates accurately how many mappings a transparent hugepage
> + * has (unlike page_mapcount() which isn't fully accurate). This full
> + * accuracy is primarily needed to know if copy-on-write faults can
> + * takeover the page and change the mapping to read-write instead of
> + * copying them. This is different from total_mapcount() too: we must
> + * not count all mappings on the subpages individually, but instead we
> + * must check the highest mapcount any one of the subpages has.
> + *
> + * It would be entirely safe and even more correct to replace
> + * page_mapcount() with page_trans_huge_mapcount(), however we only
> + * use page_trans_huge_mapcount() in the copy-on-write faults where we
> + * need full accuracy to avoid breaking page pinning.
> + */
> +int page_trans_huge_mapcount(struct page *page)
> +{
> +	int i, ret;
> +
> +	VM_BUG_ON_PAGE(PageTail(page), page);
> +
> +	if (likely(!PageCompound(page)))
> +		return atomic_read(&page->_mapcount) + 1;
> +
> +	ret = 0;
> +	if (likely(!PageHuge(page))) {
> +		for (i = 0; i < HPAGE_PMD_NR; i++)
> +			ret = max(ret, atomic_read(&page[i]._mapcount) + 1);
> +		if (PageDoubleMap(page))
> +			ret -= 1;
> +	}
> +	ret += compound_mapcount(page);
> +	return ret;
> +}
> +
> +/*
>   * This function splits huge page into normal pages. @page can point to any
>   * subpage of huge page to split. Split doesn't change the position of @page.
>   *
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 83874ec..984470a 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -930,10 +930,7 @@ int reuse_swap_page(struct page *page)
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	if (unlikely(PageKsm(page)))
>  		return 0;
> -	/* The page is part of THP and cannot be reused */
> -	if (PageTransCompound(page))
> -		return 0;
> -	count = page_mapcount(page);
> +	count = page_trans_huge_mapcount(page);
>  	if (count <= 1 && PageSwapCache(page)) {
>  		count += page_swapcount(page);
>  		if (count == 1 && !PageWriteback(page)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2016-04-29 22:34 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-28 16:20 [BUG] vfio device assignment regression with THP ref counting redesign Alex Williamson
2016-04-28 18:17 ` Kirill A. Shutemov
2016-04-28 18:58   ` Alex Williamson
2016-04-28 23:21     ` Andrea Arcangeli
2016-04-29  0:44       ` Alex Williamson
2016-04-29  0:51       ` Kirill A. Shutemov
2016-04-29  2:45         ` Alex Williamson
2016-04-29  7:06           ` Kirill A. Shutemov
2016-04-29 15:12             ` Alex Williamson
2016-04-29 16:34             ` Andrea Arcangeli
2016-04-29 22:34               ` Alex Williamson [this message]
2016-05-02 10:41               ` Kirill A. Shutemov
2016-05-02 11:15                 ` Jerome Glisse
2016-05-02 12:14                   ` GUP guarantees wrt to userspace mappings redesign Kirill A. Shutemov
2016-05-02 13:39                     ` Jerome Glisse
2016-05-02 15:00                       ` GUP guarantees wrt to userspace mappings Kirill A. Shutemov
2016-05-02 15:22                         ` Jerome Glisse
2016-05-02 16:12                           ` Kirill A. Shutemov
2016-05-02 19:14                             ` Andrea Arcangeli
2016-05-02 19:11                           ` Andrea Arcangeli
2016-05-02 19:02                         ` Andrea Arcangeli
2016-05-02 14:15                     ` GUP guarantees wrt to userspace mappings redesign Oleg Nesterov
2016-05-02 16:21                       ` Kirill A. Shutemov
2016-05-02 16:22                         ` Oleg Nesterov
2016-05-02 18:03                           ` Kirill A. Shutemov
2016-05-02 17:41                             ` Oleg Nesterov
2016-05-02 18:56                     ` Andrea Arcangeli
2016-05-02 15:23                 ` [BUG] vfio device assignment regression with THP ref counting redesign Andrea Arcangeli
2016-05-02 16:00                   ` Kirill A. Shutemov
2016-05-02 18:03                     ` Andrea Arcangeli
2016-05-05  1:19                       ` Alex Williamson
2016-05-05 14:39                         ` Andrea Arcangeli
2016-05-05 15:09                           ` Andrea Arcangeli
2016-05-05 15:11                           ` Kirill A. Shutemov
2016-05-05 15:24                             ` Andrea Arcangeli
2016-05-06  7:29                               ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160429163429.30c9f4a0@t450s.home \
    --to=alex.williamson@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).