All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: Ebru Akagunduz <ebru.akagunduz@gmail.com>, linux-mm@kvack.org
Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com,
	n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, riel@redhat.com,
	iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org,
	linux-kernel@vger.kernel.org, mgorman@suse.de,
	rientjes@google.com, aneesh.kumar@linux.vnet.ibm.com,
	hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz,
	boaz@plexistor.com, raindel@mellanox.com
Subject: Re: [RESEND RFC v4 3/3] mm: make swapin readahead to improve thp collapse rate
Date: Fri, 4 Sep 2015 18:44:41 +0200	[thread overview]
Message-ID: <55E9CA79.7000103@suse.cz> (raw)
In-Reply-To: <1441313508-4276-4-git-send-email-ebru.akagunduz@gmail.com>

On 09/03/2015 10:51 PM, Ebru Akagunduz wrote:
> This patch makes swapin readahead to improve thp collapse rate.
> When khugepaged scanned pages, there can be a few of the pages
> in swap area.
> 
> With the patch THP can collapse 4kB pages into a THP when
> there are up to max_ptes_swap swap ptes in a 2MB range.
> 
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all. Afterwards, the test program
> touches the area by writing, it skips a page in each
> 20 pages of the area.
> 
> Without the patch, system did not swap in readahead.
> THP rate was %47 of the program of the memory, it
> did not change over time.
> 
> With this patch, after 10 minutes of waiting khugepaged had
> collapsed %90 of the program's memory.
> 
> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> Changes in v2:
>  - Use FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT flag
>    instead of 0x0 when called do_swap_page from
>    __collapse_huge_page_swapin (Rik van Riel)
> 
> Changes in v3:
>  - Catch VM_FAULT_HWPOISON and VM_FAULT_OOM return cases
>    in __collapse_huge_page_swapin (Kirill A. Shutemov)
> 
> Changes in v4:
>  - Fix broken indentation reverting if (...) statement in
>    __collapse_huge_page_swapin (Kirill A. Shutemov)
>  - Fix check statement of ret (Kirill A. Shutemov)
>  - Use swapped_in name instead of swap_pte
> 
> 			After swapped out
> -------------------------------------------------------------------
>               | Anonymous | AnonHugePages | Swap      | Fraction  |
> -------------------------------------------------------------------
> With patch    | 253720 kB | 251904 kB     | 546284 kB |    %99    |
> -------------------------------------------------------------------
> Without patch | 238160 kB | 235520 kB     | 561844 kB |    %98    |
> -------------------------------------------------------------------
> 
>                         After swapped in
> -------------------------------------------------------------------
>               | Anonymous | AnonHugePages | Swap      | Fraction  |
> -------------------------------------------------------------------
> With patch    | 533532 kB | 528384 kB     | 266472 kB |    %90    |
> -------------------------------------------------------------------
> Without patch | 499956 kB | 235520 kB     | 300048 kB |    %47    |
> -------------------------------------------------------------------
> 
> 
>  include/linux/mm.h                 |  4 ++++
>  include/trace/events/huge_memory.h | 24 +++++++++++++++++++++
>  mm/huge_memory.c                   | 43 ++++++++++++++++++++++++++++++++++++++
>  mm/memory.c                        |  2 +-
>  4 files changed, 72 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bf6f117..f28d98a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -29,6 +29,10 @@ struct user_struct;
>  struct writeback_control;
>  struct bdi_writeback;
>  
> +extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			unsigned long address, pte_t *page_table, pmd_t *pmd,
> +			unsigned int flags, pte_t orig_pte);
> +

I'd rather use mm/internal.h here than export the function outside mm.

>  #ifndef CONFIG_NEED_MULTIPLE_NODES	/* Don't use mapnrs, do it properly */
>  extern unsigned long max_mapnr;
>  
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index c2112fd..e530210 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -97,6 +97,30 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
>  		khugepaged_status_string[__entry->status])
>  );
>  
> +TRACE_EVENT(mm_collapse_huge_page_swapin,
> +
> +	TP_PROTO(struct mm_struct *mm, int swapped_in, int ret),
> +
> +	TP_ARGS(mm, swapped_in, ret),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(int, swapped_in)
> +		__field(int, ret)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->swapped_in = swapped_in;
> +		__entry->ret = ret;
> +	),
> +
> +	TP_printk("mm=%p, swapped_in=%d, ret=%d",
> +		__entry->mm,
> +		__entry->swapped_in,
> +		__entry->ret)
> +);
> +
>  #endif /* __HUGE_MEMORY_H */
>  #include <trace/define_trace.h>
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cb9ab46..f420ef8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2529,6 +2529,47 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
>  	return true;
>  }
>  
> +/*
> + * Bring missing pages in from swap, to complete THP collapse.
> + * Only done if khugepaged_scan_pmd believes it is worthwhile.
> + *
> + * Called and returns without pte mapped or spinlocks held,
> + * but with mmap_sem held to protect against vma changes.
> + */
> +
> +static void __collapse_huge_page_swapin(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long address, pmd_t *pmd,
> +					pte_t *pte)
> +{
> +	unsigned long _address;
> +	pte_t pteval = *pte;
> +	int swapped_in = 0, ret = 0;
> +
> +	pte = pte_offset_map(pmd, address);
> +	for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
> +	     pte++, _address += PAGE_SIZE) {
> +		pteval = *pte;
> +		if (!is_swap_pte(pteval))
> +			continue;
> +		swapped_in++;
> +		ret = do_swap_page(mm, vma, _address, pte, pmd,
> +				   FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
> +				   pteval);
> +		if (ret & VM_FAULT_ERROR) {
> +			trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
> +			return;
> +		}
> +		/* pte is unmapped now, we need to map it */
> +		pte = pte_offset_map(pmd, _address);
> +	}
> +	pte--;
> +	pte_unmap(pte);
> +	trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
> +}
> +
> +
> +
>  static void collapse_huge_page(struct mm_struct *mm,
>  				   unsigned long address,
>  				   struct page **hpage,
> @@ -2600,6 +2641,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  
>  	anon_vma_lock_write(vma->anon_vma);
>  
> +	__collapse_huge_page_swapin(mm, vma, address, pmd, pte);

Hmm it seems rather wasteful to call this when no swap entries were detected.
Also it seems pointless to try continue collapsing when we have just only issued
async swap-in? What are the chances they would finish in time?

I'm less sure about the relation vs khugepaged_alloc_page(). At this point, we
have already succeeded the hugepage allocation. It makes sense not to swap-in if
we can't allocate a hugepage. It makes also sense not to allocate a hugepage if
we will just issue async swap-ins and then free the hugepage back. Swap-in means
disk I/O that's best avoided if not useful. But the reclaim for hugepage
allocation might also involve disk I/O. At worst, it could be creating new swap
pte's in the very pmd we are scanning... Thoughts?

> +
>  	pte = pte_offset_map(pmd, address);
>  	pte_ptl = pte_lockptr(mm, pmd);
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 388dcf9..d3a446c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2442,7 +2442,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
>   * We return with the mmap_sem locked or unlocked in the same cases
>   * as does filemap_fault().
>   */
> -static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		unsigned long address, pte_t *page_table, pmd_t *pmd,
>  		unsigned int flags, pte_t orig_pte)
>  {
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Vlastimil Babka <vbabka@suse.cz>
To: Ebru Akagunduz <ebru.akagunduz@gmail.com>, linux-mm@kvack.org
Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com,
	n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, riel@redhat.com,
	iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org,
	linux-kernel@vger.kernel.org, mgorman@suse.de,
	rientjes@google.com, aneesh.kumar@linux.vnet.ibm.com,
	hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz,
	boaz@plexistor.com, raindel@mellanox.com
Subject: Re: [RESEND RFC v4 3/3] mm: make swapin readahead to improve thp collapse rate
Date: Fri, 4 Sep 2015 18:44:41 +0200	[thread overview]
Message-ID: <55E9CA79.7000103@suse.cz> (raw)
In-Reply-To: <1441313508-4276-4-git-send-email-ebru.akagunduz@gmail.com>

On 09/03/2015 10:51 PM, Ebru Akagunduz wrote:
> This patch makes swapin readahead to improve thp collapse rate.
> When khugepaged scanned pages, there can be a few of the pages
> in swap area.
> 
> With the patch THP can collapse 4kB pages into a THP when
> there are up to max_ptes_swap swap ptes in a 2MB range.
> 
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all. Afterwards, the test program
> touches the area by writing, it skips a page in each
> 20 pages of the area.
> 
> Without the patch, system did not swap in readahead.
> THP rate was %47 of the program of the memory, it
> did not change over time.
> 
> With this patch, after 10 minutes of waiting khugepaged had
> collapsed %90 of the program's memory.
> 
> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> Changes in v2:
>  - Use FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT flag
>    instead of 0x0 when called do_swap_page from
>    __collapse_huge_page_swapin (Rik van Riel)
> 
> Changes in v3:
>  - Catch VM_FAULT_HWPOISON and VM_FAULT_OOM return cases
>    in __collapse_huge_page_swapin (Kirill A. Shutemov)
> 
> Changes in v4:
>  - Fix broken indentation reverting if (...) statement in
>    __collapse_huge_page_swapin (Kirill A. Shutemov)
>  - Fix check statement of ret (Kirill A. Shutemov)
>  - Use swapped_in name instead of swap_pte
> 
> 			After swapped out
> -------------------------------------------------------------------
>               | Anonymous | AnonHugePages | Swap      | Fraction  |
> -------------------------------------------------------------------
> With patch    | 253720 kB | 251904 kB     | 546284 kB |    %99    |
> -------------------------------------------------------------------
> Without patch | 238160 kB | 235520 kB     | 561844 kB |    %98    |
> -------------------------------------------------------------------
> 
>                         After swapped in
> -------------------------------------------------------------------
>               | Anonymous | AnonHugePages | Swap      | Fraction  |
> -------------------------------------------------------------------
> With patch    | 533532 kB | 528384 kB     | 266472 kB |    %90    |
> -------------------------------------------------------------------
> Without patch | 499956 kB | 235520 kB     | 300048 kB |    %47    |
> -------------------------------------------------------------------
> 
> 
>  include/linux/mm.h                 |  4 ++++
>  include/trace/events/huge_memory.h | 24 +++++++++++++++++++++
>  mm/huge_memory.c                   | 43 ++++++++++++++++++++++++++++++++++++++
>  mm/memory.c                        |  2 +-
>  4 files changed, 72 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bf6f117..f28d98a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -29,6 +29,10 @@ struct user_struct;
>  struct writeback_control;
>  struct bdi_writeback;
>  
> +extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			unsigned long address, pte_t *page_table, pmd_t *pmd,
> +			unsigned int flags, pte_t orig_pte);
> +

I'd rather use mm/internal.h here than export the function outside mm.

>  #ifndef CONFIG_NEED_MULTIPLE_NODES	/* Don't use mapnrs, do it properly */
>  extern unsigned long max_mapnr;
>  
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index c2112fd..e530210 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -97,6 +97,30 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
>  		khugepaged_status_string[__entry->status])
>  );
>  
> +TRACE_EVENT(mm_collapse_huge_page_swapin,
> +
> +	TP_PROTO(struct mm_struct *mm, int swapped_in, int ret),
> +
> +	TP_ARGS(mm, swapped_in, ret),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(int, swapped_in)
> +		__field(int, ret)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->swapped_in = swapped_in;
> +		__entry->ret = ret;
> +	),
> +
> +	TP_printk("mm=%p, swapped_in=%d, ret=%d",
> +		__entry->mm,
> +		__entry->swapped_in,
> +		__entry->ret)
> +);
> +
>  #endif /* __HUGE_MEMORY_H */
>  #include <trace/define_trace.h>
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cb9ab46..f420ef8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2529,6 +2529,47 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
>  	return true;
>  }
>  
> +/*
> + * Bring missing pages in from swap, to complete THP collapse.
> + * Only done if khugepaged_scan_pmd believes it is worthwhile.
> + *
> + * Called and returns without pte mapped or spinlocks held,
> + * but with mmap_sem held to protect against vma changes.
> + */
> +
> +static void __collapse_huge_page_swapin(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long address, pmd_t *pmd,
> +					pte_t *pte)
> +{
> +	unsigned long _address;
> +	pte_t pteval = *pte;
> +	int swapped_in = 0, ret = 0;
> +
> +	pte = pte_offset_map(pmd, address);
> +	for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
> +	     pte++, _address += PAGE_SIZE) {
> +		pteval = *pte;
> +		if (!is_swap_pte(pteval))
> +			continue;
> +		swapped_in++;
> +		ret = do_swap_page(mm, vma, _address, pte, pmd,
> +				   FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
> +				   pteval);
> +		if (ret & VM_FAULT_ERROR) {
> +			trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
> +			return;
> +		}
> +		/* pte is unmapped now, we need to map it */
> +		pte = pte_offset_map(pmd, _address);
> +	}
> +	pte--;
> +	pte_unmap(pte);
> +	trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
> +}
> +
> +
> +
>  static void collapse_huge_page(struct mm_struct *mm,
>  				   unsigned long address,
>  				   struct page **hpage,
> @@ -2600,6 +2641,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  
>  	anon_vma_lock_write(vma->anon_vma);
>  
> +	__collapse_huge_page_swapin(mm, vma, address, pmd, pte);

Hmm it seems rather wasteful to call this when no swap entries were detected.
Also it seems pointless to try continue collapsing when we have just only issued
async swap-in? What are the chances they would finish in time?

I'm less sure about the relation vs khugepaged_alloc_page(). At this point, we
have already succeeded the hugepage allocation. It makes sense not to swap-in if
we can't allocate a hugepage. It makes also sense not to allocate a hugepage if
we will just issue async swap-ins and then free the hugepage back. Swap-in means
disk I/O that's best avoided if not useful. But the reclaim for hugepage
allocation might also involve disk I/O. At worst, it could be creating new swap
pte's in the very pmd we are scanning... Thoughts?

> +
>  	pte = pte_offset_map(pmd, address);
>  	pte_ptl = pte_lockptr(mm, pmd);
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 388dcf9..d3a446c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2442,7 +2442,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
>   * We return with the mmap_sem locked or unlocked in the same cases
>   * as does filemap_fault().
>   */
> -static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		unsigned long address, pte_t *page_table, pmd_t *pmd,
>  		unsigned int flags, pte_t orig_pte)
>  {
> 


  reply	other threads:[~2015-09-04 16:44 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-03 20:51 [RESEND RFC v4 0/3] mm: make swapin readahead to gain more thp performance Ebru Akagunduz
2015-09-03 20:51 ` Ebru Akagunduz
2015-09-03 20:51 ` [RESEND RFC v4 1/3] mm: add tracepoint for scanning pages Ebru Akagunduz
2015-09-03 20:51   ` Ebru Akagunduz
2015-09-04 16:07   ` Vlastimil Babka
2015-09-04 16:07     ` Vlastimil Babka
2015-09-04 16:44     ` Rik van Riel
2015-09-04 16:44       ` Rik van Riel
2015-09-05 19:24       ` Ebru Akagunduz
2015-09-05 19:24         ` Ebru Akagunduz
2015-09-03 20:51 ` [RESEND RFC v4 2/3] mm: make optimistic check for swapin readahead Ebru Akagunduz
2015-09-03 20:51   ` Ebru Akagunduz
2015-09-04 16:18   ` Vlastimil Babka
2015-09-04 16:18     ` Vlastimil Babka
2015-09-03 20:51 ` [RESEND RFC v4 3/3] mm: make swapin readahead to improve thp collapse rate Ebru Akagunduz
2015-09-03 20:51   ` Ebru Akagunduz
2015-09-04 16:44   ` Vlastimil Babka [this message]
2015-09-04 16:44     ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55E9CA79.7000103@suse.cz \
    --to=vbabka@suse.cz \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=boaz@plexistor.com \
    --cc=ebru.akagunduz@gmail.com \
    --cc=gorcunov@openvz.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=raindel@mellanox.com \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=xiexiuqi@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.