[PATCH] swap: add a simple detector for inappropriate swapin readahead

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] swap: add a simple detector for inappropriate swapin readahead
@ 2013-04-15  4:01 Shaohua Li
  2013-04-16  3:24 ` Simon Jeons
  2013-05-10 19:59 ` Andrew Morton
  0 siblings, 2 replies; 4+ messages in thread
From: Shaohua Li @ 2013-04-15  4:01 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, hughd, khlebnikov, riel, fengguang.wu, minchan

[-- Attachment #1: Type: text/plain, Size: 8475 bytes --]

This is a patch to improve swap readahead algorithm. It's from Hugh and I
slightly changed it.

Hugh's original changelog:

swapin readahead does a blind readahead, whether or not the swapin
is sequential.  This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they
can be reclaimed easily - though, what if their allocation forced
reclaim of useful pages?  But on SSD devices large reads are more
expensive than small ones: if the readahead pages are unneeded,
reading them in caused significant overhead.

This patch adds very simplistic random read detection.  Stealing
the PageReadahead technique from Konstantin Khlebnikov's patch,
avoiding the vma/anon_vma sophistications of Shaohua Li's patch,
swapin_nr_pages() simply looks at readahead's current success
rate, and narrows or widens its readahead window accordingly.
There is little science to its heuristic: it's about as stupid
as can be whilst remaining effective.

The table below shows elapsed times (in centiseconds) when running
a single repetitive swapping load across a 1000MB mapping in 900MB
ram with 1GB swap (the harddisk tests had taken painfully too long
when I used mem=500M, but SSD shows similar results for that).

Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes
his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1
patch which Shaohua showed to be defective; HughNew this Nov 14
patch, with page_cluster as usual at default of 3 (8-page reads);
HughPC4 this same patch with page_cluster 4 (16-page reads);
HughPC0 with page_cluster 0 (1-page reads: no readahead).

HDD for swapping to harddisk, SSD for swapping to VertexII SSD.
Seq for sequential access to the mapping, cycling five times around;
Rand for the same number of random touches.  Anon for a MAP_PRIVATE
anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

One weakness of Shaohua's vma/anon_vma approach was that it did
not optimize Shmem: seen below.  Konstantin's approach was perhaps
mistuned, 50% slower on Seq: did not compete and is not shown below.

HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     73921   76210   75611   76904   78191  121542
Seq Shmem    73601   73176   73855   72947   74543  118322
Rand Anon   895392  831243  871569  845197  846496  841680
Rand Shmem 1058375 1053486  827935  764955  764376  756489

SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     24634   24198   24673   25107   21614   70018
Seq Shmem    24959   24932   25052   25703   22030   69678
Rand Anon    43014   26146   28075   25989   26935   25901
Rand Shmem   45349   45215   28249   24268   24138   24332

These tests are, of course, two extremes of a very simple case:
under heavier mixed loads I've not yet observed any consistent
improvement or degradation, and wider testing would be welcome.

Shaohua Li:

Test shows Vanilla is slightly better in sequential workload than Hugh's patch.
I observed with Hugh's patch sometimes the readahead size is shrinked too fast
(from 8 to 1 immediately) in sequential workload if there is no hit. And in
such case, continuing doing readahead is good actually.

I don't prepare a sophisticated algorithm for the sequential workload because
so far we can't guarantee sequential accessed pages are swap out sequentially.
So I slightly change Hugh's heuristic - don't shrink readahead size too fast.

Here is my test result (unit second, 3 runs average):
	Vanilla		Hugh		New
Seq	356		370		360
Random	4525		2447		2444

Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'.
The first part is running a random workload (till around 1200 of the x-axis)
and the second part is running a sequential workload. swapin and swapout
throughput are almost identical in steady state in both workloads. These are
expected behavior. while in Vanilla, swapin is much bigger than swapout
especially in random workload (because wrong readahead).

Original-patch-by: Shaohua Li <shli@fusionio.com>
Original-patch-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
---

 include/linux/page-flags.h |    4 +-
 mm/swap_state.c            |   63 ++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 62 insertions(+), 5 deletions(-)

Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h	2013-04-12 15:07:05.011112763 +0800
+++ linux/include/linux/page-flags.h	2013-04-15 11:48:12.161080804 +0800
@@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
 PAGEFLAG(MappedToDisk, mappedtodisk)

-/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
+/* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
-PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
+PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)

 #ifdef CONFIG_HIGHMEM
 /*
Index: linux/mm/swap_state.c
===================================================================
--- linux.orig/mm/swap_state.c	2013-04-12 15:07:05.003112912 +0800
+++ linux/mm/swap_state.c	2013-04-15 11:48:12.165078764 +0800
@@ -63,6 +63,8 @@ unsigned long total_swapcache_pages(void
 	return ret;
 }

+static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
+
 void show_swap_cache_info(void)
 {
 	printk("%lu pages in swap cache\n", total_swapcache_pages());
@@ -286,8 +288,11 @@ struct page * lookup_swap_cache(swp_entr

 	page = find_get_page(swap_address_space(entry), entry.val);

-	if (page)
+	if (page) {
 		INC_CACHE_INFO(find_success);
+		if (TestClearPageReadahead(page))
+			atomic_inc(&swapin_readahead_hits);
+	}

 	INC_CACHE_INFO(find_total);
 	return page;
@@ -373,6 +378,50 @@ struct page *read_swap_cache_async(swp_e
 	return found_page;
 }

+unsigned long swapin_nr_pages(unsigned long offset)
+{
+	static unsigned long prev_offset;
+	unsigned int pages, max_pages, last_ra;
+	static atomic_t last_readahead_pages;
+
+	max_pages = 1 << ACCESS_ONCE(page_cluster);
+	if (max_pages <= 1)
+		return 1;
+
+	/*
+	 * This heuristic has been found to work well on both sequential and
+	 * random loads, swapping to hard disk or to SSD: please don't ask
+	 * what the "+ 2" means, it just happens to work well, that's all.
+	 */
+	pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
+	if (pages == 2) {
+		/*
+		 * We can have no readahead hits to judge by: but must not get
+		 * stuck here forever, so check for an adjacent offset instead
+		 * (and don't even bother to check whether swap type is same).
+		 */
+		if (offset != prev_offset + 1 && offset != prev_offset - 1)
+			pages = 1;
+		prev_offset = offset;
+	} else {
+		unsigned int roundup = 4;
+		while (roundup < pages)
+			roundup <<= 1;
+		pages = roundup;
+	}
+
+	if (pages > max_pages)
+		pages = max_pages;
+
+	/* Don't shrink readahead too fast */
+	last_ra = atomic_read(&last_readahead_pages) / 2;
+	if (pages < last_ra)
+		pages = last_ra;
+	atomic_set(&last_readahead_pages, pages);
+
+	return pages;
+}
+
 /**
  * swapin_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
@@ -396,11 +445,16 @@ struct page *swapin_readahead(swp_entry_
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *page;
-	unsigned long offset = swp_offset(entry);
+	unsigned long entry_offset = swp_offset(entry);
+	unsigned long offset = entry_offset;
 	unsigned long start_offset, end_offset;
-	unsigned long mask = (1UL << page_cluster) - 1;
+	unsigned long mask;
 	struct blk_plug plug;

+	mask = swapin_nr_pages(offset) - 1;
+	if (!mask)
+		goto skip;
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -414,10 +468,13 @@ struct page *swapin_readahead(swp_entry_
 						gfp_mask, vma, addr);
 		if (!page)
 			continue;
+		if (offset != entry_offset)
+			SetPageReadahead(page);
 		page_cache_release(page);
 	}
 	blk_finish_plug(&plug);

 	lru_add_drain();	/* Push any new pages onto the LRU now */
+skip:
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }

[-- Attachment #2: swapra.png --]
[-- Type: image/png, Size: 52071 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] swap: add a simple detector for inappropriate swapin readahead
  2013-04-15  4:01 [PATCH] swap: add a simple detector for inappropriate swapin readahead Shaohua Li
@ 2013-04-16  3:24 ` Simon Jeons
  2013-05-10 19:59 ` Andrew Morton
  1 sibling, 0 replies; 4+ messages in thread
From: Simon Jeons @ 2013-04-16  3:24 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, akpm, hughd, khlebnikov, riel, fengguang.wu, minchan

Hi Shaohua,
On 04/15/2013 12:01 PM, Shaohua Li wrote:
> This is a patch to improve swap readahead algorithm. It's from Hugh and I
> slightly changed it.
>
> Hugh's original changelog:
>
> swapin readahead does a blind readahead, whether or not the swapin
> is sequential.  This may be ok on harddisk, because large reads have
> relatively small costs, and if the readahead pages are unneeded they
> can be reclaimed easily - though, what if their allocation forced
> reclaim of useful pages?  But on SSD devices large reads are more
> expensive than small ones: if the readahead pages are unneeded,
> reading them in caused significant overhead.
>
> This patch adds very simplistic random read detection.  Stealing
> the PageReadahead technique from Konstantin Khlebnikov's patch,
> avoiding the vma/anon_vma sophistications of Shaohua Li's patch,
> swapin_nr_pages() simply looks at readahead's current success
> rate, and narrows or widens its readahead window accordingly.
> There is little science to its heuristic: it's about as stupid
> as can be whilst remaining effective.
>
> The table below shows elapsed times (in centiseconds) when running
> a single repetitive swapping load across a 1000MB mapping in 900MB
> ram with 1GB swap (the harddisk tests had taken painfully too long
> when I used mem=500M, but SSD shows similar results for that).
>
> Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes
> his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1
> patch which Shaohua showed to be defective; HughNew this Nov 14
> patch, with page_cluster as usual at default of 3 (8-page reads);
> HughPC4 this same patch with page_cluster 4 (16-page reads);
> HughPC0 with page_cluster 0 (1-page reads: no readahead).
>
> HDD for swapping to harddisk, SSD for swapping to VertexII SSD.
> Seq for sequential access to the mapping, cycling five times around;
> Rand for the same number of random touches.  Anon for a MAP_PRIVATE
> anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
>
> One weakness of Shaohua's vma/anon_vma approach was that it did
> not optimize Shmem: seen below.  Konstantin's approach was perhaps
> mistuned, 50% slower on Seq: did not compete and is not shown below.
>
> HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
> Seq Anon     73921   76210   75611   76904   78191  121542
> Seq Shmem    73601   73176   73855   72947   74543  118322
> Rand Anon   895392  831243  871569  845197  846496  841680
> Rand Shmem 1058375 1053486  827935  764955  764376  756489
>
> SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
> Seq Anon     24634   24198   24673   25107   21614   70018
> Seq Shmem    24959   24932   25052   25703   22030   69678
> Rand Anon    43014   26146   28075   25989   26935   25901
> Rand Shmem   45349   45215   28249   24268   24138   24332
>
> These tests are, of course, two extremes of a very simple case:
> under heavier mixed loads I've not yet observed any consistent
> improvement or degradation, and wider testing would be welcome.
>
> Shaohua Li:
>
> Test shows Vanilla is slightly better in sequential workload than Hugh's patch.
> I observed with Hugh's patch sometimes the readahead size is shrinked too fast
> (from 8 to 1 immediately) in sequential workload if there is no hit. And in
> such case, continuing doing readahead is good actually.
>
> I don't prepare a sophisticated algorithm for the sequential workload because
> so far we can't guarantee sequential accessed pages are swap out sequentially.
> So I slightly change Hugh's heuristic - don't shrink readahead size too fast.
>
> Here is my test result (unit second, 3 runs average):
> 	Vanilla		Hugh		New
> Seq	356		370		360
> Random	4525		2447		2444
>
> Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'.

Could you tell me how you draw this graph?

> The first part is running a random workload (till around 1200 of the x-axis)
> and the second part is running a sequential workload. swapin and swapout
> throughput are almost identical in steady state in both workloads. These are
> expected behavior. while in Vanilla, swapin is much bigger than swapout
> especially in random workload (because wrong readahead).
>
> Original-patch-by: Shaohua Li <shli@fusionio.com>
> Original-patch-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Cc: Minchan Kim <minchan@kernel.org>
> ---
>
>   include/linux/page-flags.h |    4 +-
>   mm/swap_state.c            |   63 ++++++++++++++++++++++++++++++++++++++++++---
>   2 files changed, 62 insertions(+), 5 deletions(-)
>
> Index: linux/include/linux/page-flags.h
> ===================================================================
> --- linux.orig/include/linux/page-flags.h	2013-04-12 15:07:05.011112763 +0800
> +++ linux/include/linux/page-flags.h	2013-04-15 11:48:12.161080804 +0800
> @@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
>   TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
>   PAGEFLAG(MappedToDisk, mappedtodisk)
>   
> -/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
> +/* PG_readahead is only used for reads; PG_reclaim is only for writes */
>   PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
> -PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
> +PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
>   
>   #ifdef CONFIG_HIGHMEM
>   /*
> Index: linux/mm/swap_state.c
> ===================================================================
> --- linux.orig/mm/swap_state.c	2013-04-12 15:07:05.003112912 +0800
> +++ linux/mm/swap_state.c	2013-04-15 11:48:12.165078764 +0800
> @@ -63,6 +63,8 @@ unsigned long total_swapcache_pages(void
>   	return ret;
>   }
>   
> +static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
> +
>   void show_swap_cache_info(void)
>   {
>   	printk("%lu pages in swap cache\n", total_swapcache_pages());
> @@ -286,8 +288,11 @@ struct page * lookup_swap_cache(swp_entr
>   
>   	page = find_get_page(swap_address_space(entry), entry.val);
>   
> -	if (page)
> +	if (page) {
>   		INC_CACHE_INFO(find_success);
> +		if (TestClearPageReadahead(page))
> +			atomic_inc(&swapin_readahead_hits);
> +	}
>   
>   	INC_CACHE_INFO(find_total);
>   	return page;
> @@ -373,6 +378,50 @@ struct page *read_swap_cache_async(swp_e
>   	return found_page;
>   }
>   
> +unsigned long swapin_nr_pages(unsigned long offset)
> +{
> +	static unsigned long prev_offset;
> +	unsigned int pages, max_pages, last_ra;
> +	static atomic_t last_readahead_pages;
> +
> +	max_pages = 1 << ACCESS_ONCE(page_cluster);
> +	if (max_pages <= 1)
> +		return 1;
> +
> +	/*
> +	 * This heuristic has been found to work well on both sequential and
> +	 * random loads, swapping to hard disk or to SSD: please don't ask
> +	 * what the "+ 2" means, it just happens to work well, that's all.
> +	 */
> +	pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
> +	if (pages == 2) {
> +		/*
> +		 * We can have no readahead hits to judge by: but must not get
> +		 * stuck here forever, so check for an adjacent offset instead
> +		 * (and don't even bother to check whether swap type is same).
> +		 */
> +		if (offset != prev_offset + 1 && offset != prev_offset - 1)
> +			pages = 1;
> +		prev_offset = offset;
> +	} else {
> +		unsigned int roundup = 4;
> +		while (roundup < pages)
> +			roundup <<= 1;
> +		pages = roundup;
> +	}
> +
> +	if (pages > max_pages)
> +		pages = max_pages;
> +
> +	/* Don't shrink readahead too fast */
> +	last_ra = atomic_read(&last_readahead_pages) / 2;
> +	if (pages < last_ra)
> +		pages = last_ra;
> +	atomic_set(&last_readahead_pages, pages);
> +
> +	return pages;
> +}
> +
>   /**
>    * swapin_readahead - swap in pages in hope we need them soon
>    * @entry: swap entry of this memory
> @@ -396,11 +445,16 @@ struct page *swapin_readahead(swp_entry_
>   			struct vm_area_struct *vma, unsigned long addr)
>   {
>   	struct page *page;
> -	unsigned long offset = swp_offset(entry);
> +	unsigned long entry_offset = swp_offset(entry);
> +	unsigned long offset = entry_offset;
>   	unsigned long start_offset, end_offset;
> -	unsigned long mask = (1UL << page_cluster) - 1;
> +	unsigned long mask;
>   	struct blk_plug plug;
>   
> +	mask = swapin_nr_pages(offset) - 1;
> +	if (!mask)
> +		goto skip;
> +
>   	/* Read a page_cluster sized and aligned cluster around offset. */
>   	start_offset = offset & ~mask;
>   	end_offset = offset | mask;
> @@ -414,10 +468,13 @@ struct page *swapin_readahead(swp_entry_
>   						gfp_mask, vma, addr);
>   		if (!page)
>   			continue;
> +		if (offset != entry_offset)
> +			SetPageReadahead(page);
>   		page_cache_release(page);
>   	}
>   	blk_finish_plug(&plug);
>   
>   	lru_add_drain();	/* Push any new pages onto the LRU now */
> +skip:
>   	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>   }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] swap: add a simple detector for inappropriate swapin readahead
  2013-04-15  4:01 [PATCH] swap: add a simple detector for inappropriate swapin readahead Shaohua Li
  2013-04-16  3:24 ` Simon Jeons
@ 2013-05-10 19:59 ` Andrew Morton
  2013-05-13  5:22   ` Shaohua Li
  1 sibling, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2013-05-10 19:59 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, hughd, khlebnikov, riel, fengguang.wu, minchan

On Mon, 15 Apr 2013 12:01:16 +0800 Shaohua Li <shli@kernel.org> wrote:

> This is a patch to improve swap readahead algorithm. It's from Hugh and I
> slightly changed it.
> 
> ...
>

I find the new code a bit harder to follow that it needs to be.

> --- linux.orig/include/linux/page-flags.h	2013-04-12 15:07:05.011112763 +0800
> +++ linux/include/linux/page-flags.h	2013-04-15 11:48:12.161080804 +0800
> @@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
>  TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
>  PAGEFLAG(MappedToDisk, mappedtodisk)
>  
> -/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
> +/* PG_readahead is only used for reads; PG_reclaim is only for writes */
>  PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
> -PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
> +PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
>  
>  #ifdef CONFIG_HIGHMEM
>  /*
> Index: linux/mm/swap_state.c
> ===================================================================
> --- linux.orig/mm/swap_state.c	2013-04-12 15:07:05.003112912 +0800
> +++ linux/mm/swap_state.c	2013-04-15 11:48:12.165078764 +0800
> @@ -63,6 +63,8 @@ unsigned long total_swapcache_pages(void
>  	return ret;
>  }
>  
> +static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);

Some documentation is needed here explaining this variable's role.  If
that is understood then perhaps the reader will be able to work out why
it was initialised to "4".  Or perhaps not.

>  void show_swap_cache_info(void)
>  {
>  	printk("%lu pages in swap cache\n", total_swapcache_pages());
> @@ -286,8 +288,11 @@ struct page * lookup_swap_cache(swp_entr
>  
>  	page = find_get_page(swap_address_space(entry), entry.val);
>  
> -	if (page)
> +	if (page) {
>  		INC_CACHE_INFO(find_success);
> +		if (TestClearPageReadahead(page))
> +			atomic_inc(&swapin_readahead_hits);
> +	}
>  
>  	INC_CACHE_INFO(find_total);
>  	return page;
> @@ -373,6 +378,50 @@ struct page *read_swap_cache_async(swp_e
>  	return found_page;
>  }
>  
> +unsigned long swapin_nr_pages(unsigned long offset)

Should be static.

Needs documentation explaining what it does and why.

It would probably be clearer to make `offset' have type pgoff_t. 
What's what swp_offset() returned.  Ditto `entry_offset' in
swapin_readahead().  It's not *really* a pgoff_t, but that's what we
have and it's more informative than a bare ulong.

The documentation should describe the meaning of this function's return
value.

> +{
> +	static unsigned long prev_offset;
> +	unsigned int pages, max_pages, last_ra;
> +	static atomic_t last_readahead_pages;
> +
> +	max_pages = 1 << ACCESS_ONCE(page_cluster);
> +	if (max_pages <= 1)
> +		return 1;
> +
> +	/*
> +	 * This heuristic has been found to work well on both sequential and
> +	 * random loads, swapping to hard disk or to SSD: please don't ask
> +	 * what the "+ 2" means, it just happens to work well, that's all.

OK, I won't.

> +	 */
> +	pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
> +	if (pages == 2) {
> +		/*
> +		 * We can have no readahead hits to judge by: but must not get
> +		 * stuck here forever, so check for an adjacent offset instead
> +		 * (and don't even bother to check whether swap type is same).
> +		 */
> +		if (offset != prev_offset + 1 && offset != prev_offset - 1)
> +			pages = 1;
> +		prev_offset = offset;
> +	} else {
> +		unsigned int roundup = 4;

What does the "4" mean?

> +		while (roundup < pages)
> +			roundup <<= 1;

Can use something like

		roundup = ilog2(pages) + 2;

And what does the "2" mean?

> +		pages = roundup;
> +	}
> +
> +	if (pages > max_pages)
> +		pages = max_pages;

min()

> +	/* Don't shrink readahead too fast */
> +	last_ra = atomic_read(&last_readahead_pages) / 2;

Why not "3"?

> +	if (pages < last_ra)
> +		pages = last_ra;
> +	atomic_set(&last_readahead_pages, pages);
> +
> +	return pages;
> +}
> +
>  /**
>   * swapin_readahead - swap in pages in hope we need them soon
>   * @entry: swap entry of this memory
> @@ -396,11 +445,16 @@ struct page *swapin_readahead(swp_entry_
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
>  	struct page *page;
> -	unsigned long offset = swp_offset(entry);
> +	unsigned long entry_offset = swp_offset(entry);
> +	unsigned long offset = entry_offset;
>  	unsigned long start_offset, end_offset;
> -	unsigned long mask = (1UL << page_cluster) - 1;
> +	unsigned long mask;
>  	struct blk_plug plug;
>  
> +	mask = swapin_nr_pages(offset) - 1;

This I in fact found to be the most obscure part of the patch. 
swapin_nr_pages() returns a count, but here we're copying it into a
variable which appears to hold a bitmask.  That's a weird thing to do
and only makes sense if it is assured (and designed) that
swapin_nr_pages() returns a power of 2.

Wanna see if we can clear all these things up please?

> +	if (!mask)
> +		goto skip;
> +
>  	/* Read a page_cluster sized and aligned cluster around offset. */
>  	start_offset = offset & ~mask;
>  	end_offset = offset | mask;
> ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] swap: add a simple detector for inappropriate swapin readahead
  2013-05-10 19:59 ` Andrew Morton
@ 2013-05-13  5:22   ` Shaohua Li
  0 siblings, 0 replies; 4+ messages in thread
From: Shaohua Li @ 2013-05-13  5:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, hughd, khlebnikov, riel, fengguang.wu, minchan

On Fri, May 10, 2013 at 12:59:06PM -0700, Andrew Morton wrote:
> On Mon, 15 Apr 2013 12:01:16 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > This is a patch to improve swap readahead algorithm. It's from Hugh and I
> > slightly changed it.
> > 
> > ...
> >
> 
> I find the new code a bit harder to follow that it needs to be.

The patch detects random workload to avoid false readahead. For sequential
workload, it's known to be hard to do readahead, because we can't guarantee
memory of sequential workload live together in disk. The original blind
readahead doesn't work very well for sequential worload too. So the goal is to
not regress for sequential workload. There are some magics here for this. I'd
say I can't prove the magics are ok, but it just happens to work for simple
workload, sorry :)!

> > --- linux.orig/include/linux/page-flags.h	2013-04-12 15:07:05.011112763 +0800
> > +++ linux/include/linux/page-flags.h	2013-04-15 11:48:12.161080804 +0800
> > @@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
> >  TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
> >  PAGEFLAG(MappedToDisk, mappedtodisk)
> >  
> > -/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
> > +/* PG_readahead is only used for reads; PG_reclaim is only for writes */
> >  PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
> > -PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
> > +PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
> >  
> >  #ifdef CONFIG_HIGHMEM
> >  /*
> > Index: linux/mm/swap_state.c
> > ===================================================================
> > --- linux.orig/mm/swap_state.c	2013-04-12 15:07:05.003112912 +0800
> > +++ linux/mm/swap_state.c	2013-04-15 11:48:12.165078764 +0800
> > @@ -63,6 +63,8 @@ unsigned long total_swapcache_pages(void
> >  	return ret;
> >  }
> >  
> > +static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
> 
> Some documentation is needed here explaining this variable's role.  If
> that is understood then perhaps the reader will be able to work out why
> it was initialised to "4".  Or perhaps not. 

Ok, explained it, but the '4' is still a magic.

> >  void show_swap_cache_info(void)
> >  {
> >  	printk("%lu pages in swap cache\n", total_swapcache_pages());
> > @@ -286,8 +288,11 @@ struct page * lookup_swap_cache(swp_entr
> >  
> >  	page = find_get_page(swap_address_space(entry), entry.val);
> >  
> > -	if (page)
> > +	if (page) {
> >  		INC_CACHE_INFO(find_success);
> > +		if (TestClearPageReadahead(page))
> > +			atomic_inc(&swapin_readahead_hits);
> > +	}
> >  
> >  	INC_CACHE_INFO(find_total);
> >  	return page;
> > @@ -373,6 +378,50 @@ struct page *read_swap_cache_async(swp_e
> >  	return found_page;
> >  }
> >  
> > +unsigned long swapin_nr_pages(unsigned long offset)
> 
> Should be static.
> 
> Needs documentation explaining what it does and why.
> 
> It would probably be clearer to make `offset' have type pgoff_t. 
> What's what swp_offset() returned.  Ditto `entry_offset' in
> swapin_readahead().  It's not *really* a pgoff_t, but that's what we
> have and it's more informative than a bare ulong.
> 
> The documentation should describe the meaning of this function's return
> value.

done.
 
> > +{
> > +	static unsigned long prev_offset;
> > +	unsigned int pages, max_pages, last_ra;
> > +	static atomic_t last_readahead_pages;
> > +
> > +	max_pages = 1 << ACCESS_ONCE(page_cluster);
> > +	if (max_pages <= 1)
> > +		return 1;
> > +
> > +	/*
> > +	 * This heuristic has been found to work well on both sequential and
> > +	 * random loads, swapping to hard disk or to SSD: please don't ask
> > +	 * what the "+ 2" means, it just happens to work well, that's all.
> 
> OK, I won't.

Thanks :)
 
> > +	 */
> > +	pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
> > +	if (pages == 2) {
> > +		/*
> > +		 * We can have no readahead hits to judge by: but must not get
> > +		 * stuck here forever, so check for an adjacent offset instead
> > +		 * (and don't even bother to check whether swap type is same).
> > +		 */
> > +		if (offset != prev_offset + 1 && offset != prev_offset - 1)
> > +			pages = 1;
> > +		prev_offset = offset;
> > +	} else {
> > +		unsigned int roundup = 4;
> 
> What does the "4" mean?

The same magic.
 
> > +		while (roundup < pages)
> > +			roundup <<= 1;
> 
> Can use something like
> 
> 		roundup = ilog2(pages) + 2;
> 
> And what does the "2" mean?

ilog2 doesn't work here.
 
> > +		pages = roundup;
> > +	}
> > +
> > +	if (pages > max_pages)
> > +		pages = max_pages;
> 
> min()

ok
 
> > +	/* Don't shrink readahead too fast */
> > +	last_ra = atomic_read(&last_readahead_pages) / 2;
> 
> Why not "3"?

A magic again

> > +	if (pages < last_ra)
> > +		pages = last_ra;
> > +	atomic_set(&last_readahead_pages, pages);
> > +
> > +	return pages;
> > +}
> > +
> >  /**
> >   * swapin_readahead - swap in pages in hope we need them soon
> >   * @entry: swap entry of this memory
> > @@ -396,11 +445,16 @@ struct page *swapin_readahead(swp_entry_
> >  			struct vm_area_struct *vma, unsigned long addr)
> >  {
> >  	struct page *page;
> > -	unsigned long offset = swp_offset(entry);
> > +	unsigned long entry_offset = swp_offset(entry);
> > +	unsigned long offset = entry_offset;
> >  	unsigned long start_offset, end_offset;
> > -	unsigned long mask = (1UL << page_cluster) - 1;
> > +	unsigned long mask;
> >  	struct blk_plug plug;
> >  
> > +	mask = swapin_nr_pages(offset) - 1;
> 
> This I in fact found to be the most obscure part of the patch. 
> swapin_nr_pages() returns a count, but here we're copying it into a
> variable which appears to hold a bitmask.  That's a weird thing to do
> and only makes sense if it is assured (and designed) that
> swapin_nr_pages() returns a power of 2.

Yes, it's guaranteed to return a power of 2, now commented in patch.


---
 mm/swap_state.c |   33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

Index: linux/mm/swap_state.c
===================================================================
--- linux.orig/mm/swap_state.c	2013-05-13 11:43:00.137490065 +0800
+++ linux/mm/swap_state.c	2013-05-13 13:18:44.573272832 +0800
@@ -63,6 +63,11 @@ unsigned long total_swapcache_pages(void
 	return ret;
 }
 
+/*
+ * Track how many swap readahead pages are truly hit. We readahead at least
+ * swapin_readahead_hits pages. The "4" is arbitary, if there are hits, at
+ * least readahead 4 pages.
+ */
 static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
 
 void show_swap_cache_info(void)
@@ -378,9 +383,20 @@ struct page *read_swap_cache_async(swp_e
 	return found_page;
 }
 
-unsigned long swapin_nr_pages(unsigned long offset)
+/*
+ * Return how many swap pages should be readahead. This detects random workload
+ * to avoid false readahead. It's hard to correctly do readahead for sequential
+ * workload, as we can't guarantee memory of sequential workload live in disk
+ * sequentially. This still tries to readahead as more pages as possible if
+ * swapin readahead hits (for example, any hit causes at least 4 pages
+ * readahead; shrinking only allows shrink to half of last readahead pages).
+ *
+ * This is guaranteed to return power of 2 pages, as swapin_readahead reads
+ * ahead an aligned cluster.
+ */
+static unsigned long swapin_nr_pages(pgoff_t offset)
 {
-	static unsigned long prev_offset;
+	static pgoff_t prev_offset;
 	unsigned int pages, max_pages, last_ra;
 	static atomic_t last_readahead_pages;
 
@@ -410,13 +426,12 @@ unsigned long swapin_nr_pages(unsigned l
 		pages = roundup;
 	}
 
-	if (pages > max_pages)
-		pages = max_pages;
+	pages = min(pages, max_pages);
 
 	/* Don't shrink readahead too fast */
 	last_ra = atomic_read(&last_readahead_pages) / 2;
-	if (pages < last_ra)
-		pages = last_ra;
+	pages = max(pages, last_ra);
+
 	atomic_set(&last_readahead_pages, pages);
 
 	return pages;
@@ -445,9 +460,9 @@ struct page *swapin_readahead(swp_entry_
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *page;
-	unsigned long entry_offset = swp_offset(entry);
-	unsigned long offset = entry_offset;
-	unsigned long start_offset, end_offset;
+	pgoff_t entry_offset = swp_offset(entry);
+	pgoff_t offset = entry_offset;
+	pgoff_t start_offset, end_offset;
 	unsigned long mask;
 	struct blk_plug plug;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-05-13  5:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-15  4:01 [PATCH] swap: add a simple detector for inappropriate swapin readahead Shaohua Li
2013-04-16  3:24 ` Simon Jeons
2013-05-10 19:59 ` Andrew Morton
2013-05-13  5:22   ` Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).