Re: [PATCH -V7 RESEND 08/21] swap: Support to read a huge swap cluster for swapin a THP

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Huang\, Ying" <ying.huang@intel.com>
To: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Michal Hocko <mhocko@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Shaohua Li <shli@kernel.org>, Hugh Dickins <hughd@google.com>,
	Minchan Kim <minchan@kernel.org>, Rik van Riel <riel@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>
Subject: Re: [PATCH -V7 RESEND 08/21] swap: Support to read a huge swap cluster for swapin a THP
Date: Sat, 01 Dec 2018 08:34:06 +0800	[thread overview]
Message-ID: <8736rirsox.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20181130233201.6yuzbhymtjddvf3u@ca-dmjordan1.us.oracle.com> (Daniel Jordan's message of "Fri, 30 Nov 2018 15:32:01 -0800")

Hi, Daniel,

Daniel Jordan <daniel.m.jordan@oracle.com> writes:

> Hi Ying,
>
> On Tue, Nov 20, 2018 at 04:54:36PM +0800, Huang Ying wrote:
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 97831166994a..1eedbc0aede2 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -387,14 +389,42 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>>  		 * as SWAP_HAS_CACHE.  That's done in later part of code or
>>  		 * else swap_off will be aborted if we return NULL.
>>  		 */
>> -		if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
>> +		if (!__swp_swapcount(entry, &entry_size) &&
>> +		    swap_slot_cache_enabled)
>>  			break;
>>  
>>  		/*
>>  		 * Get a new page to read into from swap.
>>  		 */
>> -		if (!new_page) {
>> -			new_page = alloc_page_vma(gfp_mask, vma, addr);
>> +		if (!new_page ||
>> +		    (IS_ENABLED(CONFIG_THP_SWAP) &&
>> +		     hpage_nr_pages(new_page) != entry_size)) {
>> +			if (new_page)
>> +				put_page(new_page);
>> +			if (IS_ENABLED(CONFIG_THP_SWAP) &&
>> +			    entry_size == HPAGE_PMD_NR) {
>> +				gfp_t gfp;
>> +
>> +				gfp = alloc_hugepage_direct_gfpmask(vma, addr);
>
> vma is NULL when we get here from try_to_unuse, so the kernel will die on
> vma->flags inside alloc_hugepage_direct_gfpmask.

Good catch!  Thanks a lot for your help to pinpoint this bug!

> try_to_unuse swaps in before it finds vma's, but even if those were reversed,
> it seems try_to_unuse wouldn't always have a single vma to pass into this path
> since it's walking the swap_map and multiple processes mapping the same huge
> page can have different huge page advice (and maybe mempolicies?), affecting
> the result of alloc_hugepage_direct_gfpmask.  And yet
> alloc_hugepage_direct_gfpmask needs a vma to do its job.  So, I'm not sure how
> to fix this.
>
> If the entry's usage count were 1, we could find the vma in that common case to
> give read_swap_cache_async, and otherwise allocate small pages.  We'd have THPs
> some of the time and be exactly following alloc_hugepage_direct_gfpmask, but
> would also be conservative when it's uncertain.
>
> Or, if the system-wide THP settings allow it then go for it, but otherwise
> ignore vma hints and always fall back to small pages.  This requires another
> way of controlling THP allocations besides alloc_hugepage_direct_gfpmask.
>
> Or maybe try_to_unuse shouldn't allocate hugepages at all, but then no perf
> improvement for try_to_unuse.
>
> What do you think?

I think that swapoff() which is the main user of try_to_unuse() isn't a
common operation in practical.  So it's not necessary to make it more
complex for this.

In alloc_hugepage_direct_gfpmask(), the only information provided by vma
is: vma->flags & VM_HUGEPAGE.  Because we have no vma available, I think
it is OK to just assume that the flag is cleared.  That is, rely on
system-wide THP settings only.

What do you think about this proposal?

Best Regards,
Huang, Ying

WARNING: multiple messages have this Message-ID (diff)

From: "Huang\, Ying" <ying.huang@intel.com>
To: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Michal Hocko <mhocko@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Shaohua Li <shli@kernel.org>, Hugh Dickins <hughd@google.com>,
	Minchan Kim <minchan@kernel.org>, Rik van Riel <riel@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>
Subject: Re: [PATCH -V7 RESEND 08/21] swap: Support to read a huge swap cluster for swapin a THP
Date: Sat, 01 Dec 2018 08:34:06 +0800	[thread overview]
Message-ID: <8736rirsox.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20181130233201.6yuzbhymtjddvf3u@ca-dmjordan1.us.oracle.com> (Daniel Jordan's message of "Fri, 30 Nov 2018 15:32:01 -0800")

Hi, Daniel,

Daniel Jordan <daniel.m.jordan@oracle.com> writes:

> Hi Ying,
>
> On Tue, Nov 20, 2018 at 04:54:36PM +0800, Huang Ying wrote:
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 97831166994a..1eedbc0aede2 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -387,14 +389,42 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>>  		 * as SWAP_HAS_CACHE.  That's done in later part of code or
>>  		 * else swap_off will be aborted if we return NULL.
>>  		 */
>> -		if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
>> +		if (!__swp_swapcount(entry, &entry_size) &&
>> +		    swap_slot_cache_enabled)
>>  			break;
>>  
>>  		/*
>>  		 * Get a new page to read into from swap.
>>  		 */
>> -		if (!new_page) {
>> -			new_page = alloc_page_vma(gfp_mask, vma, addr);
>> +		if (!new_page ||
>> +		    (IS_ENABLED(CONFIG_THP_SWAP) &&
>> +		     hpage_nr_pages(new_page) != entry_size)) {
>> +			if (new_page)
>> +				put_page(new_page);
>> +			if (IS_ENABLED(CONFIG_THP_SWAP) &&
>> +			    entry_size == HPAGE_PMD_NR) {
>> +				gfp_t gfp;
>> +
>> +				gfp = alloc_hugepage_direct_gfpmask(vma, addr);
>
> vma is NULL when we get here from try_to_unuse, so the kernel will die on
> vma->flags inside alloc_hugepage_direct_gfpmask.

Good catch!  Thanks a lot for your help to pinpoint this bug!

> try_to_unuse swaps in before it finds vma's, but even if those were reversed,
> it seems try_to_unuse wouldn't always have a single vma to pass into this path
> since it's walking the swap_map and multiple processes mapping the same huge
> page can have different huge page advice (and maybe mempolicies?), affecting
> the result of alloc_hugepage_direct_gfpmask.  And yet
> alloc_hugepage_direct_gfpmask needs a vma to do its job.  So, I'm not sure how
> to fix this.
>
> If the entry's usage count were 1, we could find the vma in that common case to
> give read_swap_cache_async, and otherwise allocate small pages.  We'd have THPs
> some of the time and be exactly following alloc_hugepage_direct_gfpmask, but
> would also be conservative when it's uncertain.
>
> Or, if the system-wide THP settings allow it then go for it, but otherwise
> ignore vma hints and always fall back to small pages.  This requires another
> way of controlling THP allocations besides alloc_hugepage_direct_gfpmask.
>
> Or maybe try_to_unuse shouldn't allocate hugepages at all, but then no perf
> improvement for try_to_unuse.
>
> What do you think?

I think that swapoff() which is the main user of try_to_unuse() isn't a
common operation in practical.  So it's not necessary to make it more
complex for this.

In alloc_hugepage_direct_gfpmask(), the only information provided by vma
is: vma->flags & VM_HUGEPAGE.  Because we have no vma available, I think
it is OK to just assume that the flag is cleared.  That is, rely on
system-wide THP settings only.

What do you think about this proposal?

Best Regards,
Huang, Ying

next prev parent reply	other threads:[~2018-12-01  0:34 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-20  8:54 [PATCH -V7 00/21] swap: Swapout/swapin THP in one piece Huang Ying
2018-11-20  8:54 ` Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 01/21] swap: Enable PMD swap operations for CONFIG_THP_SWAP Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 02/21] swap: Add __swap_duplicate_locked() Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 03/21] swap: Support PMD swap mapping in swap_duplicate() Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 04/21] swap: Support PMD swap mapping in put_swap_page() Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 05/21] swap: Support PMD swap mapping in free_swap_and_cache()/swap_free() Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 06/21] swap: Support PMD swap mapping when splitting huge PMD Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 07/21] swap: Support PMD swap mapping in split_swap_cluster() Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 08/21] swap: Support to read a huge swap cluster for swapin a THP Huang Ying
2018-11-30 23:32   ` Daniel Jordan
2018-12-01  0:34     ` Huang, Ying [this message]
2018-12-01  0:34       ` Huang, Ying
2018-12-03 16:15       ` Daniel Jordan
2018-12-04  2:30         ` Huang, Ying
2018-12-04  2:30           ` Huang, Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 09/21] swap: Swapin a THP in one piece Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 10/21] swap: Support to count THP swapin and its fallback Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 11/21] swap: Add sysfs interface to configure THP swapin Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 12/21] swap: Support PMD swap mapping in swapoff Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 13/21] swap: Support PMD swap mapping in madvise_free() Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 14/21] swap: Support to move swap account for PMD swap mapping Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 15/21] swap: Support to copy PMD swap mapping when fork() Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 16/21] swap: Free PMD swap mapping when zap_huge_pmd() Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 17/21] swap: Support PMD swap mapping for MADV_WILLNEED Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 18/21] swap: Support PMD swap mapping in mincore() Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 19/21] swap: Support PMD swap mapping in common path Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 20/21] swap: create PMD swap mapping when unmap the THP Huang Ying
2018-11-20  8:54 ` [PATCH -V7 RESEND 21/21] swap: Update help of CONFIG_THP_SWAP Huang Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8736rirsox.fsf@yhuang-dev.intel.com \
    --to=ying.huang@intel.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=minchan@kernel.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=riel@redhat.com \
    --cc=shli@kernel.org \
    --cc=zi.yan@cs.rutgers.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.