Re: [rfc][patch] swap: virtual swap readahead

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Rik van Riel <riel@redhat.com>
Subject: Re: [rfc][patch] swap: virtual swap readahead
Date: Mon, 8 Jun 2009 15:52:46 +0800	[thread overview]
Message-ID: <20090608075246.GA12644@localhost> (raw)
In-Reply-To: <1243436746-2698-1-git-send-email-hannes@cmpxchg.org>

[-- Attachment #1: Type: text/plain, Size: 6763 bytes --]

On Wed, May 27, 2009 at 05:05:46PM +0200, Johannes Weiner wrote:
> The current swap readahead implementation reads a physically
> contiguous group of swap slots around the faulting page to take
> advantage of the disk head's position and in the hope that the
> surrounding pages will be needed soon as well.
> 
> This works as long as the physical swap slot order approximates the
> LRU order decently, otherwise it wastes memory and IO bandwidth to
> read in pages that are unlikely to be needed soon.
> 
> However, the physical swap slot layout diverges from the LRU order
> with increasing swap activity, i.e. high memory pressure situations,
> and this is exactly the situation where swapin should not waste any
> memory or IO bandwidth as both are the most contended resources at
> this point.
> 
> This patch makes swap-in base its readaround window on the virtual
> proximity of pages in the faulting VMA, as an indicator for pages
> needed in the near future, while still taking physical locality of
> swap slots into account.
> 
> This has the advantage of reading in big batches when the LRU order
> matches the swap slot order while automatically throttling readahead
> when the system is thrashing and swap slots are no longer nicely
> grouped by LRU order.

Hi Johannes,

You may want to test the patch against a real desktop :)
The attached scripts can do that. I also have the setup to
test it out conveniently, so if you send me the latest patch..

Thanks,
Fengguang

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Cc: Rik van Riel <riel@redhat.com>
> ---
>  mm/swap_state.c |   80 +++++++++++++++++++++++++++++++++++++++----------------
>  1 files changed, 57 insertions(+), 23 deletions(-)
> 
> qsbench, 20 runs each, 1.7GB RAM, 2GB swap, 4 cores:
> 
>          "mean (standard deviation) median"
> 
> All values are given in seconds.  I used a t-test to make sure there
> is a statistical difference of at least 95% probability in the
> compared runs for the given number of samples, arithmetic mean and
> standard deviation.
> 
> 1 x 2048M
> vanilla: 391.25 ( 71.76) 384.56
> vswapra: 445.55 ( 83.19) 415.41
> 
> 	This is an actual regression.  I am not yet quite sure why
> 	this happens and I am undecided whether one humonguous active
> 	vma in the system is a common workload.
> 
> 	It's also the only regression I found, with qsbench anyway.  I
> 	started out with powers of two and tweaked the parameters
> 	until the results between the two kernel versions differed.
> 
> 2 x 1024M
> vanilla: 384.25 ( 75.00) 423.08
> vswapra: 290.26 ( 31.38) 299.51
> 
> 4 x 540M
> vanilla: 553.91 (100.02) 554.57
> vswapra: 336.58 ( 52.49) 331.52
> 
> 8 x 280M
> vanilla: 561.08 ( 82.36) 583.12
> vswapra: 319.13 ( 43.17) 307.69
> 
> 16 x 128M
> vanilla: 285.51 (113.20) 236.62
> vswapra: 214.24 ( 62.37) 214.15
> 
> 	All these show a nice improvement in performance and runtime
> 	stability.
> 
> The missing shmem support is a big TODO, I will try to find time to
> tackle this when the overall idea is not refused in the first place.
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3ecea98..8f8daaa 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -336,11 +336,6 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>   *
>   * Returns the struct page for entry and addr, after queueing swapin.
>   *
> - * Primitive swap readahead code. We simply read an aligned block of
> - * (1 << page_cluster) entries in the swap area. This method is chosen
> - * because it doesn't cost us any seek time.  We also make sure to queue
> - * the 'original' request together with the readahead ones...
> - *
>   * This has been extended to use the NUMA policies from the mm triggering
>   * the readahead.
>   *
> @@ -349,27 +344,66 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
> -	int nr_pages;
> -	struct page *page;
> -	unsigned long offset;
> -	unsigned long end_offset;
> -
> -	/*
> -	 * Get starting offset for readaround, and number of pages to read.
> -	 * Adjust starting address by readbehind (for NUMA interleave case)?
> -	 * No, it's very unlikely that swap layout would follow vma layout,
> -	 * more likely that neighbouring swap pages came from the same node:
> -	 * so use the same "addr" to choose the same node for each swap read.
> -	 */
> -	nr_pages = valid_swaphandles(entry, &offset);
> -	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
> -		/* Ok, do the async read-ahead now */
> -		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> -						gfp_mask, vma, addr);
> +	int cluster = 1 << page_cluster;
> +	int window = cluster << PAGE_SHIFT;
> +	unsigned long start, pos, end;
> +	unsigned long pmin, pmax;
> +
> +	/* XXX: fix this for shmem */
> +	if (!vma || !vma->vm_mm)
> +		goto nora;
> +
> +	/* Physical range to read from */
> +	pmin = swp_offset(entry) & ~(cluster - 1);
> +	pmax = pmin + cluster;
> +
> +	/* Virtual range to read from */
> +	start = addr & ~(window - 1);
> +	end = start + window;
> +
> +	for (pos = start; pos < end; pos += PAGE_SIZE) {
> +		struct page *page;
> +		swp_entry_t swp;
> +		spinlock_t *ptl;
> +		pgd_t *pgd;
> +		pud_t *pud;
> +		pmd_t *pmd;
> +		pte_t *pte;
> +
> +		pgd = pgd_offset(vma->vm_mm, pos);
> +		if (!pgd_present(*pgd))
> +			continue;
> +		pud = pud_offset(pgd, pos);
> +		if (!pud_present(*pud))
> +			continue;
> +		pmd = pmd_offset(pud, pos);
> +		if (!pmd_present(*pmd))
> +			continue;
> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
> +		if (!is_swap_pte(*pte)) {
> +			pte_unmap_unlock(pte, ptl);
> +			continue;
> +		}
> +		swp = pte_to_swp_entry(*pte);
> +		pte_unmap_unlock(pte, ptl);
> +
> +		if (swp_type(swp) != swp_type(entry))
> +			continue;
> +		/*
> +		 * Dont move the disk head too far away.  This also
> +		 * throttles readahead while thrashing, where virtual
> +		 * order diverges more and more from physical order.
> +		 */
> +		if (swp_offset(swp) > pmax)
> +			continue;
> +		if (swp_offset(swp) < pmin)
> +			continue;
> +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
>  		if (!page)
> -			break;
> +			continue;
>  		page_cache_release(page);
>  	}
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
> +nora:
>  	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>  }
> -- 
> 1.6.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

[-- Attachment #2: run-many-x-apps.sh --]
[-- Type: application/x-sh, Size: 1707 bytes --]

[-- Attachment #3: test-mmap-exec-prot.sh --]
[-- Type: application/x-sh, Size: 203 bytes --]

next prev parent reply	other threads:[~2009-06-08  7:53 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-27 15:05 [rfc][patch] swap: virtual swap readahead Johannes Weiner
2009-05-27 15:05 ` Johannes Weiner
2009-05-27 17:32 ` Rik van Riel
2009-05-27 17:32   ` Rik van Riel
2009-05-27 21:48 ` Andrew Morton
2009-05-27 21:48   ` Andrew Morton
2009-05-28  0:14   ` Johannes Weiner
2009-05-28  0:14     ` Johannes Weiner
2009-06-01  8:05 ` Andi Kleen
2009-06-01  8:05   ` Andi Kleen
2009-06-08  7:52 ` Wu Fengguang [this message]
2009-06-08 17:58   ` Johannes Weiner
2009-06-08 17:58     ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090608075246.GA12644@localhost \
    --to=fengguang.wu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.