linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch v2]swap: add a simple random read swapin detection
@ 2012-08-27  4:00 Shaohua Li
  2012-08-27 12:57 ` Rik van Riel
  2012-08-27 14:52 ` Konstantin Khlebnikov
  0 siblings, 2 replies; 31+ messages in thread
From: Shaohua Li @ 2012-08-27  4:00 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, riel, fengguang.wu, minchan

The swapin readahead does a blind readahead regardless if the swapin is
sequential. This is ok for harddisk and random read, because read big size has
no penality in harddisk, and if the readahead pages are garbage, they can be
reclaimed fastly. But for SSD, big size read is more expensive than small size
read. If readahead pages are garbage, such readahead only has overhead.

This patch addes a simple random read detection like what file mmap readahead
does. If random read is detected, swapin readahead will be skipped. This
improves a lot for a swap workload with random IO in a fast SSD.

I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
			runtime changes with path
randwrite harddisk	-38.7%
seqwrite harddisk	-1.1%
randwrite SSD		-46.9%
seqwrite SSD		+0.3%

For both harddisk and SSD, the randwrite swap workload run time is reduced
significant. sequential write swap workload hasn't chanage.

Interesting is the randwrite harddisk test is improved too. This might be
because swapin readahead need allocate extra memory, which further tights
memory pressure, so more swapout/swapin.

This patch depends on readahead-fault-retry-breaks-mmap-file-read-random-detection.patch

V1->V2:
1. Move the swap readahead accounting to separate functions as suggested by Riel.
2. Enable the logic only with CONFIG_SWAP enabled as suggested by Minchan.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 include/linux/mm_types.h |    3 +++
 mm/internal.h            |   44 ++++++++++++++++++++++++++++++++++++++++++++
 mm/memory.c              |    3 ++-
 mm/swap_state.c          |    8 ++++++++
 4 files changed, 57 insertions(+), 1 deletion(-)

Index: linux/mm/swap_state.c
===================================================================
--- linux.orig/mm/swap_state.c	2012-08-22 11:44:53.057913107 +0800
+++ linux/mm/swap_state.c	2012-08-23 17:27:28.560013412 +0800
@@ -20,6 +20,7 @@
 #include <linux/page_cgroup.h>
 
 #include <asm/pgtable.h>
+#include "internal.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
@@ -379,6 +380,12 @@ struct page *swapin_readahead(swp_entry_
 	unsigned long mask = (1UL << page_cluster) - 1;
 	struct blk_plug plug;
 
+	if (vma) {
+		swap_cache_miss(vma);
+		if (swap_cache_skip_readahead(vma))
+			goto skip;
+	}
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -397,5 +404,6 @@ struct page *swapin_readahead(swp_entry_
 	blk_finish_plug(&plug);
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
+skip:
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
Index: linux/include/linux/mm_types.h
===================================================================
--- linux.orig/include/linux/mm_types.h	2012-08-22 11:44:53.077912855 +0800
+++ linux/include/linux/mm_types.h	2012-08-24 13:07:11.798576941 +0800
@@ -279,6 +279,9 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+#ifdef CONFIG_SWAP
+	atomic_t swapra_miss;
+#endif
 };
 
 struct core_thread {
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2012-08-22 11:44:53.065913005 +0800
+++ linux/mm/memory.c	2012-08-23 17:27:23.424074216 +0800
@@ -2953,7 +2953,8 @@ static int do_swap_page(struct mm_struct
 		ret = VM_FAULT_HWPOISON;
 		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 		goto out_release;
-	}
+	} else if (!(flags & FAULT_FLAG_TRIED))
+		swap_cache_hit(vma);
 
 	locked = lock_page_or_retry(page, mm, flags);
 
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h	2012-08-22 09:51:39.295322268 +0800
+++ linux/mm/internal.h	2012-08-27 11:51:27.447915373 +0800
@@ -356,3 +356,47 @@ extern unsigned long vm_mmap_pgoff(struc
         unsigned long, unsigned long);
 
 extern void set_pageblock_order(void);
+
+/*
+ * Unnecessary readahead harms performance. 1. for SSD, big size read is more
+ * expensive than small size read, so extra unnecessary read only has overhead.
+ * For harddisk, this overhead doesn't exist. 2. unnecessary readahead will
+ * allocate extra memroy, which further tights memory pressure, so more
+ * swapout/swapin.
+ * These adds a simple swap random access detection. In swap page fault, if
+ * page is found in swap cache, decrease an account of vma, otherwise we need
+ * do sync swapin and the account is increased. Optionally swapin will do
+ * readahead if the counter is below a threshold.
+ */
+#ifdef CONFIG_SWAP
+#define SWAPRA_MISS  (100)
+static inline void swap_cache_hit(struct vm_area_struct *vma)
+{
+	atomic_dec_if_positive(&vma->swapra_miss);
+}
+
+static inline void swap_cache_miss(struct vm_area_struct *vma)
+{
+	if (atomic_read(&vma->swapra_miss) < SWAPRA_MISS * 10)
+		atomic_inc(&vma->swapra_miss);
+}
+
+static inline int swap_cache_skip_readahead(struct vm_area_struct *vma)
+{
+	return atomic_read(&vma->swapra_miss) > SWAPRA_MISS;
+}
+#else
+static inline void swap_cache_hit(struct vm_area_struct *vma)
+{
+}
+
+static inline void swap_cache_miss(struct vm_area_struct *vma)
+{
+}
+
+static inline int swap_cache_skip_readahead(struct vm_area_struct *vma)
+{
+	return 0;
+}
+
+#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch v2]swap: add a simple random read swapin detection
  2012-08-27  4:00 [patch v2]swap: add a simple random read swapin detection Shaohua Li
@ 2012-08-27 12:57 ` Rik van Riel
  2012-08-27 14:52 ` Konstantin Khlebnikov
  1 sibling, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2012-08-27 12:57 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, akpm, fengguang.wu, minchan

On 08/27/2012 12:00 AM, Shaohua Li wrote:
> The swapin readahead does a blind readahead regardless if the swapin is
> sequential. This is ok for harddisk and random read, because read big size has
> no penality in harddisk, and if the readahead pages are garbage, they can be
> reclaimed fastly. But for SSD, big size read is more expensive than small size
> read. If readahead pages are garbage, such readahead only has overhead.
>
> This patch addes a simple random read detection like what file mmap readahead
> does. If random read is detected, swapin readahead will be skipped. This
> improves a lot for a swap workload with random IO in a fast SSD.
>
> I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
> 			runtime changes with path
> randwrite harddisk	-38.7%
> seqwrite harddisk	-1.1%
> randwrite SSD		-46.9%
> seqwrite SSD		+0.3%
>
> For both harddisk and SSD, the randwrite swap workload run time is reduced
> significant. sequential write swap workload hasn't chanage.

Very nice results!

> Signed-off-by: Shaohua Li <shli@fusionio.com>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch v2]swap: add a simple random read swapin detection
  2012-08-27  4:00 [patch v2]swap: add a simple random read swapin detection Shaohua Li
  2012-08-27 12:57 ` Rik van Riel
@ 2012-08-27 14:52 ` Konstantin Khlebnikov
  2012-08-30 10:36   ` [patch v3]swap: " Shaohua Li
  1 sibling, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-08-27 14:52 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm@kvack.org, akpm@linux-foundation.org, riel@redhat.com,
	fengguang.wu@intel.com, minchan@kernel.org

Shaohua Li wrote:
> The swapin readahead does a blind readahead regardless if the swapin is
> sequential. This is ok for harddisk and random read, because read big size has
> no penality in harddisk, and if the readahead pages are garbage, they can be
> reclaimed fastly. But for SSD, big size read is more expensive than small size
> read. If readahead pages are garbage, such readahead only has overhead.
>
> This patch addes a simple random read detection like what file mmap readahead
> does. If random read is detected, swapin readahead will be skipped. This
> improves a lot for a swap workload with random IO in a fast SSD.
>
> I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
> 			runtime changes with path
> randwrite harddisk	-38.7%
> seqwrite harddisk	-1.1%
> randwrite SSD		-46.9%
> seqwrite SSD		+0.3%

Very nice!

>
> For both harddisk and SSD, the randwrite swap workload run time is reduced
> significant. sequential write swap workload hasn't chanage.
>
> Interesting is the randwrite harddisk test is improved too. This might be
> because swapin readahead need allocate extra memory, which further tights
> memory pressure, so more swapout/swapin.
>
> This patch depends on readahead-fault-retry-breaks-mmap-file-read-random-detection.patch
>
> V1->V2:
> 1. Move the swap readahead accounting to separate functions as suggested by Riel.
> 2. Enable the logic only with CONFIG_SWAP enabled as suggested by Minchan.
>
> Signed-off-by: Shaohua Li<shli@fusionio.com>
> ---
>   include/linux/mm_types.h |    3 +++
>   mm/internal.h            |   44 ++++++++++++++++++++++++++++++++++++++++++++
>   mm/memory.c              |    3 ++-
>   mm/swap_state.c          |    8 ++++++++
>   4 files changed, 57 insertions(+), 1 deletion(-)
>

> --- linux.orig/include/linux/mm_types.h	2012-08-22 11:44:53.077912855 +0800
> +++ linux/include/linux/mm_types.h	2012-08-24 13:07:11.798576941 +0800
> @@ -279,6 +279,9 @@ struct vm_area_struct {
>   #ifdef CONFIG_NUMA
>   	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>   #endif
> +#ifdef CONFIG_SWAP
> +	atomic_t swapra_miss;
> +#endif

You can place this atomic on vma->anon_vma, it has perfect 4-byte hole right 
after field "refcount". vma->anon_vma already exists since this vma already 
contains anon pages.

>   };
>
>   struct core_thread {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch v3]swap: add a simple random read swapin detection
  2012-08-27 14:52 ` Konstantin Khlebnikov
@ 2012-08-30 10:36   ` Shaohua Li
  2012-08-30 16:03     ` Rik van Riel
  2012-08-30 17:42     ` Minchan Kim
  0 siblings, 2 replies; 31+ messages in thread
From: Shaohua Li @ 2012-08-30 10:36 UTC (permalink / raw)
  To: Konstantin Khlebnikov, akpm
  Cc: linux-mm@kvack.org, riel@redhat.com, fengguang.wu@intel.com,
	minchan@kernel.org

On Mon, Aug 27, 2012 at 06:52:07PM +0400, Konstantin Khlebnikov wrote:
> >--- linux.orig/include/linux/mm_types.h	2012-08-22 11:44:53.077912855 +0800
> >+++ linux/include/linux/mm_types.h	2012-08-24 13:07:11.798576941 +0800
> >@@ -279,6 +279,9 @@ struct vm_area_struct {
> >  #ifdef CONFIG_NUMA
> >  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
> >  #endif
> >+#ifdef CONFIG_SWAP
> >+	atomic_t swapra_miss;
> >+#endif
> 
> You can place this atomic on vma->anon_vma, it has perfect 4-byte
> hole right after field "refcount". vma->anon_vma already exists
> since this vma already contains anon pages.

makes sense. vma->anon_vma could be NUll (shmem), but in shmem
case, vma could NULL too, so maybe just ignore it.


Subject: swap: add a simple random read swapin detection

The swapin readahead does a blind readahead regardless if the swapin is
sequential. This is ok for harddisk and random read, because read big size has
no penality in harddisk, and if the readahead pages are garbage, they can be
reclaimed fastly. But for SSD, big size read is more expensive than small size
read. If readahead pages are garbage, such readahead only has overhead.

This patch addes a simple random read detection like what file mmap readahead
does. If random read is detected, swapin readahead will be skipped. This
improves a lot for a swap workload with random IO in a fast SSD.

I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
			runtime changes with path
randwrite harddisk	-38.7%
seqwrite harddisk	-1.1%
randwrite SSD		-46.9%
seqwrite SSD		+0.3%

For both harddisk and SSD, the randwrite swap workload run time is reduced
significant. sequential write swap workload hasn't chanage.

Interesting is the randwrite harddisk test is improved too. This might be
because swapin readahead need allocate extra memory, which further tights
memory pressure, so more swapout/swapin.

This patch depends on readahead-fault-retry-breaks-mmap-file-read-random-detection.patch

V2->V3:
move swapra_miss to 'struct anon_vma' as suggested by Konstantin. 

V1->V2:
1. Move the swap readahead accounting to separate functions as suggested by Riel.
2. Enable the logic only with CONFIG_SWAP enabled as suggested by Minchan.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 include/linux/rmap.h |    3 +++
 mm/internal.h        |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/memory.c          |    3 ++-
 mm/shmem.c           |    1 +
 mm/swap_state.c      |    6 ++++++
 5 files changed, 62 insertions(+), 1 deletion(-)

Index: linux/mm/swap_state.c
===================================================================
--- linux.orig/mm/swap_state.c	2012-08-29 16:13:00.912112140 +0800
+++ linux/mm/swap_state.c	2012-08-30 18:28:24.678315187 +0800
@@ -20,6 +20,7 @@
 #include <linux/page_cgroup.h>
 
 #include <asm/pgtable.h>
+#include "internal.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
@@ -379,6 +380,10 @@ struct page *swapin_readahead(swp_entry_
 	unsigned long mask = (1UL << page_cluster) - 1;
 	struct blk_plug plug;
 
+	swap_cache_miss(vma);
+	if (swap_cache_skip_readahead(vma))
+		goto skip;
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -397,5 +402,6 @@ struct page *swapin_readahead(swp_entry_
 	blk_finish_plug(&plug);
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
+skip:
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2012-08-29 16:13:00.920112040 +0800
+++ linux/mm/memory.c	2012-08-30 13:32:05.425830660 +0800
@@ -2953,7 +2953,8 @@ static int do_swap_page(struct mm_struct
 		ret = VM_FAULT_HWPOISON;
 		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 		goto out_release;
-	}
+	} else if (!(flags & FAULT_FLAG_TRIED))
+		swap_cache_hit(vma);
 
 	locked = lock_page_or_retry(page, mm, flags);
 
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h	2012-08-29 16:13:00.932111888 +0800
+++ linux/mm/internal.h	2012-08-30 18:28:03.698578951 +0800
@@ -12,6 +12,7 @@
 #define __MM_INTERNAL_H
 
 #include <linux/mm.h>
+#include <linux/rmap.h>
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
@@ -356,3 +357,52 @@ extern unsigned long vm_mmap_pgoff(struc
         unsigned long, unsigned long);
 
 extern void set_pageblock_order(void);
+
+/*
+ * Unnecessary readahead harms performance. 1. for SSD, big size read is more
+ * expensive than small size read, so extra unnecessary read only has overhead.
+ * For harddisk, this overhead doesn't exist. 2. unnecessary readahead will
+ * allocate extra memroy, which further tights memory pressure, so more
+ * swapout/swapin.
+ * These adds a simple swap random access detection. In swap page fault, if
+ * page is found in swap cache, decrease an account of vma, otherwise we need
+ * do sync swapin and the account is increased. Optionally swapin will do
+ * readahead if the counter is below a threshold.
+ */
+#ifdef CONFIG_SWAP
+#define SWAPRA_MISS  (100)
+static inline void swap_cache_hit(struct vm_area_struct *vma)
+{
+	if (vma && vma->anon_vma)
+		atomic_dec_if_positive(&vma->anon_vma->swapra_miss);
+}
+
+static inline void swap_cache_miss(struct vm_area_struct *vma)
+{
+	if (!vma || !vma->anon_vma)
+		return;
+	if (atomic_read(&vma->anon_vma->swapra_miss) < SWAPRA_MISS * 10)
+		atomic_inc(&vma->anon_vma->swapra_miss);
+}
+
+static inline int swap_cache_skip_readahead(struct vm_area_struct *vma)
+{
+	if (!vma || !vma->anon_vma)
+		return 0;
+	return atomic_read(&vma->anon_vma->swapra_miss) > SWAPRA_MISS;
+}
+#else
+static inline void swap_cache_hit(struct vm_area_struct *vma)
+{
+}
+
+static inline void swap_cache_miss(struct vm_area_struct *vma)
+{
+}
+
+static inline int swap_cache_skip_readahead(struct vm_area_struct *vma)
+{
+	return 0;
+}
+
+#endif
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2012-06-01 10:10:31.686394463 +0800
+++ linux/include/linux/rmap.h	2012-08-30 18:10:12.256048781 +0800
@@ -35,6 +35,9 @@ struct anon_vma {
 	 * anon_vma if they are the last user on release
 	 */
 	atomic_t refcount;
+#ifdef CONFIG_SWAP
+	atomic_t swapra_miss;
+#endif
 
 	/*
 	 * NOTE: the LSB of the head.next is set by
Index: linux/mm/shmem.c
===================================================================
--- linux.orig/mm/shmem.c	2012-08-06 16:00:45.465441525 +0800
+++ linux/mm/shmem.c	2012-08-30 18:10:51.755553250 +0800
@@ -933,6 +933,7 @@ static struct page *shmem_swapin(swp_ent
 	pvma.vm_pgoff = index + info->vfs_inode.i_ino;
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = spol;
+	pvma.anon_vma = NULL;
 	return swapin_readahead(swap, gfp, &pvma, 0);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch v3]swap: add a simple random read swapin detection
  2012-08-30 10:36   ` [patch v3]swap: " Shaohua Li
@ 2012-08-30 16:03     ` Rik van Riel
  2012-08-30 17:42     ` Minchan Kim
  1 sibling, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2012-08-30 16:03 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Konstantin Khlebnikov, akpm, linux-mm@kvack.org,
	fengguang.wu@intel.com, minchan@kernel.org

On 08/30/2012 06:36 AM, Shaohua Li wrote:

> Interesting is the randwrite harddisk test is improved too. This might be
> because swapin readahead need allocate extra memory, which further tights
> memory pressure, so more swapout/swapin.
>
> This patch depends on readahead-fault-retry-breaks-mmap-file-read-random-detection.patch
>
> V2->V3:
> move swapra_miss to 'struct anon_vma' as suggested by Konstantin.
>
> V1->V2:
> 1. Move the swap readahead accounting to separate functions as suggested by Riel.
> 2. Enable the logic only with CONFIG_SWAP enabled as suggested by Minchan.
>
> Signed-off-by: Shaohua Li <shli@fusionio.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch v3]swap: add a simple random read swapin detection
  2012-08-30 10:36   ` [patch v3]swap: " Shaohua Li
  2012-08-30 16:03     ` Rik van Riel
@ 2012-08-30 17:42     ` Minchan Kim
  2012-09-03  7:21       ` [patch v4]swap: " Shaohua Li
  1 sibling, 1 reply; 31+ messages in thread
From: Minchan Kim @ 2012-08-30 17:42 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Konstantin Khlebnikov, akpm, linux-mm@kvack.org, riel@redhat.com,
	fengguang.wu@intel.com, minchan@kernel.org

On Thu, Aug 30, 2012 at 06:36:12PM +0800, Shaohua Li wrote:
> On Mon, Aug 27, 2012 at 06:52:07PM +0400, Konstantin Khlebnikov wrote:
> > >--- linux.orig/include/linux/mm_types.h	2012-08-22 11:44:53.077912855 +0800
> > >+++ linux/include/linux/mm_types.h	2012-08-24 13:07:11.798576941 +0800
> > >@@ -279,6 +279,9 @@ struct vm_area_struct {
> > >  #ifdef CONFIG_NUMA
> > >  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
> > >  #endif
> > >+#ifdef CONFIG_SWAP
> > >+	atomic_t swapra_miss;
> > >+#endif
> > 
> > You can place this atomic on vma->anon_vma, it has perfect 4-byte
> > hole right after field "refcount". vma->anon_vma already exists
> > since this vma already contains anon pages.
> 
> makes sense. vma->anon_vma could be NUll (shmem), but in shmem
> case, vma could NULL too, so maybe just ignore it.
> 
> 
> Subject: swap: add a simple random read swapin detection
> 
> The swapin readahead does a blind readahead regardless if the swapin is
> sequential. This is ok for harddisk and random read, because read big size has
> no penality in harddisk, and if the readahead pages are garbage, they can be
> reclaimed fastly. But for SSD, big size read is more expensive than small size
> read. If readahead pages are garbage, such readahead only has overhead.
> 
> This patch addes a simple random read detection like what file mmap readahead
> does. If random read is detected, swapin readahead will be skipped. This
> improves a lot for a swap workload with random IO in a fast SSD.
> 
> I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
> 			runtime changes with path
> randwrite harddisk	-38.7%
> seqwrite harddisk	-1.1%
> randwrite SSD		-46.9%
> seqwrite SSD		+0.3%
> 
> For both harddisk and SSD, the randwrite swap workload run time is reduced
> significant. sequential write swap workload hasn't chanage.
> 
> Interesting is the randwrite harddisk test is improved too. This might be
> because swapin readahead need allocate extra memory, which further tights
> memory pressure, so more swapout/swapin.
> 
> This patch depends on readahead-fault-retry-breaks-mmap-file-read-random-detection.patch
> 
> V2->V3:
> move swapra_miss to 'struct anon_vma' as suggested by Konstantin. 
> 
> V1->V2:
> 1. Move the swap readahead accounting to separate functions as suggested by Riel.
> 2. Enable the logic only with CONFIG_SWAP enabled as suggested by Minchan.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  include/linux/rmap.h |    3 +++
>  mm/internal.h        |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/memory.c          |    3 ++-
>  mm/shmem.c           |    1 +
>  mm/swap_state.c      |    6 ++++++
>  5 files changed, 62 insertions(+), 1 deletion(-)
> 
> Index: linux/mm/swap_state.c
> ===================================================================
> --- linux.orig/mm/swap_state.c	2012-08-29 16:13:00.912112140 +0800
> +++ linux/mm/swap_state.c	2012-08-30 18:28:24.678315187 +0800
> @@ -20,6 +20,7 @@
>  #include <linux/page_cgroup.h>
>  
>  #include <asm/pgtable.h>
> +#include "internal.h"
>  
>  /*
>   * swapper_space is a fiction, retained to simplify the path through
> @@ -379,6 +380,10 @@ struct page *swapin_readahead(swp_entry_
>  	unsigned long mask = (1UL << page_cluster) - 1;
>  	struct blk_plug plug;
>  
> +	swap_cache_miss(vma);
> +	if (swap_cache_skip_readahead(vma))
> +		goto skip;
> +
>  	/* Read a page_cluster sized and aligned cluster around offset. */
>  	start_offset = offset & ~mask;
>  	end_offset = offset | mask;
> @@ -397,5 +402,6 @@ struct page *swapin_readahead(swp_entry_
>  	blk_finish_plug(&plug);
>  
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
> +skip:
>  	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>  }
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c	2012-08-29 16:13:00.920112040 +0800
> +++ linux/mm/memory.c	2012-08-30 13:32:05.425830660 +0800
> @@ -2953,7 +2953,8 @@ static int do_swap_page(struct mm_struct
>  		ret = VM_FAULT_HWPOISON;
>  		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>  		goto out_release;
> -	}
> +	} else if (!(flags & FAULT_FLAG_TRIED))
> +		swap_cache_hit(vma);
>  
>  	locked = lock_page_or_retry(page, mm, flags);
>  
> Index: linux/mm/internal.h
> ===================================================================
> --- linux.orig/mm/internal.h	2012-08-29 16:13:00.932111888 +0800
> +++ linux/mm/internal.h	2012-08-30 18:28:03.698578951 +0800
> @@ -12,6 +12,7 @@
>  #define __MM_INTERNAL_H
>  
>  #include <linux/mm.h>
> +#include <linux/rmap.h>
>  
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
> @@ -356,3 +357,52 @@ extern unsigned long vm_mmap_pgoff(struc
>          unsigned long, unsigned long);
>  
>  extern void set_pageblock_order(void);
> +
> +/*
> + * Unnecessary readahead harms performance. 1. for SSD, big size read is more
> + * expensive than small size read, so extra unnecessary read only has overhead.
> + * For harddisk, this overhead doesn't exist. 2. unnecessary readahead will
> + * allocate extra memroy, which further tights memory pressure, so more
> + * swapout/swapin.
> + * These adds a simple swap random access detection. In swap page fault, if
> + * page is found in swap cache, decrease an account of vma, otherwise we need
> + * do sync swapin and the account is increased. Optionally swapin will do
> + * readahead if the counter is below a threshold.
> + */
> +#ifdef CONFIG_SWAP
> +#define SWAPRA_MISS  (100)
> +static inline void swap_cache_hit(struct vm_area_struct *vma)
> +{
> +	if (vma && vma->anon_vma)
> +		atomic_dec_if_positive(&vma->anon_vma->swapra_miss);
> +}
> +
> +static inline void swap_cache_miss(struct vm_area_struct *vma)
> +{
> +	if (!vma || !vma->anon_vma)
> +		return;
> +	if (atomic_read(&vma->anon_vma->swapra_miss) < SWAPRA_MISS * 10)

You can use some meaningful macro instead of magic vaule 10.

/*
 * If swapra_miss is higher than SWAPRA_SKIP_THRESHOLD, swapin readahead
 * will be skipped.
 * swapra_miss count could be increased until SWAPRA_MISS_MAX_COUNT.
 * If swapra_miss count is decreased by SWAPRA_SKIP_THRESHOLD below
 * by cache hit, we can start swapin readahead, again.
 */
#define SWAPRA_SKIP_THRESHOLD 100
#define SWAPRA_MISS_MAX_COUNT		(SWAPRA_SKIP_THRESHOLD * 10)

> +		atomic_inc(&vma->anon_vma->swapra_miss);
> +}
> +
> +static inline int swap_cache_skip_readahead(struct vm_area_struct *vma)
> +{
> +	if (!vma || !vma->anon_vma)
> +		return 0;
> +	return atomic_read(&vma->anon_vma->swapra_miss) > SWAPRA_MISS;
> +}
> +#else
> +static inline void swap_cache_hit(struct vm_area_struct *vma)
> +{
> +}
> +
> +static inline void swap_cache_miss(struct vm_area_struct *vma)
> +{
> +}
> +
> +static inline int swap_cache_skip_readahead(struct vm_area_struct *vma)
> +{
> +	return 0;
> +}
> +
> +#endif
> Index: linux/include/linux/rmap.h
> ===================================================================
> --- linux.orig/include/linux/rmap.h	2012-06-01 10:10:31.686394463 +0800
> +++ linux/include/linux/rmap.h	2012-08-30 18:10:12.256048781 +0800
> @@ -35,6 +35,9 @@ struct anon_vma {
>  	 * anon_vma if they are the last user on release
>  	 */
>  	atomic_t refcount;
> +#ifdef CONFIG_SWAP
> +	atomic_t swapra_miss;
> +#endif
>  
>  	/*
>  	 * NOTE: the LSB of the head.next is set by
> Index: linux/mm/shmem.c
> ===================================================================
> --- linux.orig/mm/shmem.c	2012-08-06 16:00:45.465441525 +0800
> +++ linux/mm/shmem.c	2012-08-30 18:10:51.755553250 +0800
> @@ -933,6 +933,7 @@ static struct page *shmem_swapin(swp_ent
>  	pvma.vm_pgoff = index + info->vfs_inode.i_ino;
>  	pvma.vm_ops = NULL;
>  	pvma.vm_policy = spol;
> +	pvma.anon_vma = NULL;

So, shmem always do readahead blindly still?

>  	return swapin_readahead(swap, gfp, &pvma, 0);
>  }
>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch v4]swap: add a simple random read swapin detection
  2012-08-30 17:42     ` Minchan Kim
@ 2012-09-03  7:21       ` Shaohua Li
  2012-09-03  8:32         ` Minchan Kim
  0 siblings, 1 reply; 31+ messages in thread
From: Shaohua Li @ 2012-09-03  7:21 UTC (permalink / raw)
  To: Minchan Kim, akpm
  Cc: Konstantin Khlebnikov, linux-mm@kvack.org, riel@redhat.com,
	fengguang.wu@intel.com

On Fri, Aug 31, 2012 at 02:42:23AM +0900, Minchan Kim wrote:
>  */
> #define SWAPRA_SKIP_THRESHOLD 100
> #define SWAPRA_MISS_MAX_COUNT		(SWAPRA_SKIP_THRESHOLD * 10)

Ok, updated to use macro.
 
> > ===================================================================
> > --- linux.orig/mm/shmem.c	2012-08-06 16:00:45.465441525 +0800
> > +++ linux/mm/shmem.c	2012-08-30 18:10:51.755553250 +0800
> > @@ -933,6 +933,7 @@ static struct page *shmem_swapin(swp_ent
> >  	pvma.vm_pgoff = index + info->vfs_inode.i_ino;
> >  	pvma.vm_ops = NULL;
> >  	pvma.vm_policy = spol;
> > +	pvma.anon_vma = NULL;
> 
> So, shmem always do readahead blindly still?

yes, I had no idea how to prevent the ra for shmem so far. It could be a
tmpfs file read, so no vma ever.

Thanks,
Shaohua


Subject: swap: add a simple random read swapin detection

The swapin readahead does a blind readahead regardless if the swapin is
sequential. This is ok for harddisk and random read, because read big size has
no penality in harddisk, and if the readahead pages are garbage, they can be
reclaimed fastly. But for SSD, big size read is more expensive than small size
read. If readahead pages are garbage, such readahead only has overhead.

This patch addes a simple random read detection like what file mmap readahead
does. If random read is detected, swapin readahead will be skipped. This
improves a lot for a swap workload with random IO in a fast SSD.

I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
			runtime changes with path
randwrite harddisk	-38.7%
seqwrite harddisk	-1.1%
randwrite SSD		-46.9%
seqwrite SSD		+0.3%

For both harddisk and SSD, the randwrite swap workload run time is reduced
significant. sequential write swap workload hasn't chanage.

Interesting is the randwrite harddisk test is improved too. This might be
because swapin readahead need allocate extra memory, which further tights
memory pressure, so more swapout/swapin.

This patch depends on readahead-fault-retry-breaks-mmap-file-read-random-detection.patch

V2->V3:
move swapra_miss to 'struct anon_vma' as suggested by Konstantin. 

V1->V2:
1. Move the swap readahead accounting to separate functions as suggested by Riel.
2. Enable the logic only with CONFIG_SWAP enabled as suggested by Minchan.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/rmap.h |    3 ++
 mm/internal.h        |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/memory.c          |    3 +-
 mm/shmem.c           |    1 
 mm/swap_state.c      |    6 +++++
 5 files changed, 64 insertions(+), 1 deletion(-)

Index: linux/mm/swap_state.c
===================================================================
--- linux.orig/mm/swap_state.c	2012-08-29 16:13:00.912112140 +0800
+++ linux/mm/swap_state.c	2012-08-30 18:28:24.678315187 +0800
@@ -20,6 +20,7 @@
 #include <linux/page_cgroup.h>
 
 #include <asm/pgtable.h>
+#include "internal.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
@@ -379,6 +380,10 @@ struct page *swapin_readahead(swp_entry_
 	unsigned long mask = (1UL << page_cluster) - 1;
 	struct blk_plug plug;
 
+	swap_cache_miss(vma);
+	if (swap_cache_skip_readahead(vma))
+		goto skip;
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -397,5 +402,6 @@ struct page *swapin_readahead(swp_entry_
 	blk_finish_plug(&plug);
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
+skip:
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2012-08-29 16:13:00.920112040 +0800
+++ linux/mm/memory.c	2012-08-30 13:32:05.425830660 +0800
@@ -2953,7 +2953,8 @@ static int do_swap_page(struct mm_struct
 		ret = VM_FAULT_HWPOISON;
 		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 		goto out_release;
-	}
+	} else if (!(flags & FAULT_FLAG_TRIED))
+		swap_cache_hit(vma);
 
 	locked = lock_page_or_retry(page, mm, flags);
 
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h	2012-08-29 16:13:00.932111888 +0800
+++ linux/mm/internal.h	2012-09-03 15:16:30.566299444 +0800
@@ -12,6 +12,7 @@
 #define __MM_INTERNAL_H
 
 #include <linux/mm.h>
+#include <linux/rmap.h>
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
@@ -356,3 +357,54 @@ extern unsigned long vm_mmap_pgoff(struc
         unsigned long, unsigned long);
 
 extern void set_pageblock_order(void);
+
+/*
+ * Unnecessary readahead harms performance. 1. for SSD, big size read is more
+ * expensive than small size read, so extra unnecessary read only has overhead.
+ * For harddisk, this overhead doesn't exist. 2. unnecessary readahead will
+ * allocate extra memroy, which further tights memory pressure, so more
+ * swapout/swapin.
+ * These adds a simple swap random access detection. In swap page fault, if
+ * page is found in swap cache, decrease an account of vma, otherwise we need
+ * do sync swapin and the account is increased. Optionally swapin will do
+ * readahead if the counter is below a threshold.
+ */
+#ifdef CONFIG_SWAP
+#define SWAPRA_MISS_THRESHOLD  (100)
+#define SWAPRA_MAX_MISS ((SWAPRA_MISS_THRESHOLD) * 10)
+static inline void swap_cache_hit(struct vm_area_struct *vma)
+{
+	if (vma && vma->anon_vma)
+		atomic_dec_if_positive(&vma->anon_vma->swapra_miss);
+}
+
+static inline void swap_cache_miss(struct vm_area_struct *vma)
+{
+	if (!vma || !vma->anon_vma)
+		return;
+	if (atomic_read(&vma->anon_vma->swapra_miss) < SWAPRA_MAX_MISS)
+		atomic_inc(&vma->anon_vma->swapra_miss);
+}
+
+static inline int swap_cache_skip_readahead(struct vm_area_struct *vma)
+{
+	if (!vma || !vma->anon_vma)
+		return 0;
+	return atomic_read(&vma->anon_vma->swapra_miss) >
+		SWAPRA_MISS_THRESHOLD;
+}
+#else
+static inline void swap_cache_hit(struct vm_area_struct *vma)
+{
+}
+
+static inline void swap_cache_miss(struct vm_area_struct *vma)
+{
+}
+
+static inline int swap_cache_skip_readahead(struct vm_area_struct *vma)
+{
+	return 0;
+}
+
+#endif
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2012-06-01 10:10:31.686394463 +0800
+++ linux/include/linux/rmap.h	2012-08-30 18:10:12.256048781 +0800
@@ -35,6 +35,9 @@ struct anon_vma {
 	 * anon_vma if they are the last user on release
 	 */
 	atomic_t refcount;
+#ifdef CONFIG_SWAP
+	atomic_t swapra_miss;
+#endif
 
 	/*
 	 * NOTE: the LSB of the head.next is set by
Index: linux/mm/shmem.c
===================================================================
--- linux.orig/mm/shmem.c	2012-08-06 16:00:45.465441525 +0800
+++ linux/mm/shmem.c	2012-08-30 18:10:51.755553250 +0800
@@ -933,6 +933,7 @@ static struct page *shmem_swapin(swp_ent
 	pvma.vm_pgoff = index + info->vfs_inode.i_ino;
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = spol;
+	pvma.anon_vma = NULL;
 	return swapin_readahead(swap, gfp, &pvma, 0);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch v4]swap: add a simple random read swapin detection
  2012-09-03  7:21       ` [patch v4]swap: " Shaohua Li
@ 2012-09-03  8:32         ` Minchan Kim
  2012-09-03 11:46           ` Shaohua Li
  0 siblings, 1 reply; 31+ messages in thread
From: Minchan Kim @ 2012-09-03  8:32 UTC (permalink / raw)
  To: Shaohua Li
  Cc: akpm, Konstantin Khlebnikov, linux-mm@kvack.org, riel@redhat.com,
	fengguang.wu@intel.com

On Mon, Sep 03, 2012 at 03:21:37PM +0800, Shaohua Li wrote:
> On Fri, Aug 31, 2012 at 02:42:23AM +0900, Minchan Kim wrote:
> >  */
> > #define SWAPRA_SKIP_THRESHOLD 100
> > #define SWAPRA_MISS_MAX_COUNT		(SWAPRA_SKIP_THRESHOLD * 10)
> 
> Ok, updated to use macro.
>  
> > > ===================================================================
> > > --- linux.orig/mm/shmem.c	2012-08-06 16:00:45.465441525 +0800
> > > +++ linux/mm/shmem.c	2012-08-30 18:10:51.755553250 +0800
> > > @@ -933,6 +933,7 @@ static struct page *shmem_swapin(swp_ent
> > >  	pvma.vm_pgoff = index + info->vfs_inode.i_ino;
> > >  	pvma.vm_ops = NULL;
> > >  	pvma.vm_policy = spol;
> > > +	pvma.anon_vma = NULL;
> > 
> > So, shmem always do readahead blindly still?
> 
> yes, I had no idea how to prevent the ra for shmem so far. It could be a
> tmpfs file read, so no vma ever.

It could be a TODO or something we should notice to say the swap readahead
different behavior between anon pages and tmpfs pages.
So let's add comment.

> 
> Thanks,
> Shaohua
> 
> 
> Subject: swap: add a simple random read swapin detection
> 
> The swapin readahead does a blind readahead regardless if the swapin is
> sequential. This is ok for harddisk and random read, because read big size has
> no penality in harddisk, and if the readahead pages are garbage, they can be
> reclaimed fastly. But for SSD, big size read is more expensive than small size
> read. If readahead pages are garbage, such readahead only has overhead.
> 
> This patch addes a simple random read detection like what file mmap readahead
> does. If random read is detected, swapin readahead will be skipped. This
> improves a lot for a swap workload with random IO in a fast SSD.
> 
> I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
> 			runtime changes with path
> randwrite harddisk	-38.7%
> seqwrite harddisk	-1.1%
> randwrite SSD		-46.9%
> seqwrite SSD		+0.3%
> 
> For both harddisk and SSD, the randwrite swap workload run time is reduced
> significant. sequential write swap workload hasn't chanage.
> 
> Interesting is the randwrite harddisk test is improved too. This might be
> because swapin readahead need allocate extra memory, which further tights
> memory pressure, so more swapout/swapin.
> 
> This patch depends on readahead-fault-retry-breaks-mmap-file-read-random-detection.patch
> 
> V2->V3:
> move swapra_miss to 'struct anon_vma' as suggested by Konstantin. 
> 
> V1->V2:
> 1. Move the swap readahead accounting to separate functions as suggested by Riel.
> 2. Enable the logic only with CONFIG_SWAP enabled as suggested by Minchan.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/rmap.h |    3 ++
>  mm/internal.h        |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/memory.c          |    3 +-
>  mm/shmem.c           |    1 
>  mm/swap_state.c      |    6 +++++
>  5 files changed, 64 insertions(+), 1 deletion(-)
> 
> Index: linux/mm/swap_state.c
> ===================================================================
> --- linux.orig/mm/swap_state.c	2012-08-29 16:13:00.912112140 +0800
> +++ linux/mm/swap_state.c	2012-08-30 18:28:24.678315187 +0800
> @@ -20,6 +20,7 @@
>  #include <linux/page_cgroup.h>
>  
>  #include <asm/pgtable.h>
> +#include "internal.h"
>  
>  /*
>   * swapper_space is a fiction, retained to simplify the path through
> @@ -379,6 +380,10 @@ struct page *swapin_readahead(swp_entry_
>  	unsigned long mask = (1UL << page_cluster) - 1;
>  	struct blk_plug plug;
>  
> +	swap_cache_miss(vma);
> +	if (swap_cache_skip_readahead(vma))
> +		goto skip;
> +
>  	/* Read a page_cluster sized and aligned cluster around offset. */
>  	start_offset = offset & ~mask;
>  	end_offset = offset | mask;
> @@ -397,5 +402,6 @@ struct page *swapin_readahead(swp_entry_
>  	blk_finish_plug(&plug);
>  
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
> +skip:
>  	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>  }
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c	2012-08-29 16:13:00.920112040 +0800
> +++ linux/mm/memory.c	2012-08-30 13:32:05.425830660 +0800
> @@ -2953,7 +2953,8 @@ static int do_swap_page(struct mm_struct
>  		ret = VM_FAULT_HWPOISON;
>  		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>  		goto out_release;
> -	}
> +	} else if (!(flags & FAULT_FLAG_TRIED))
> +		swap_cache_hit(vma);
>  
>  	locked = lock_page_or_retry(page, mm, flags);
>  
> Index: linux/mm/internal.h
> ===================================================================
> --- linux.orig/mm/internal.h	2012-08-29 16:13:00.932111888 +0800
> +++ linux/mm/internal.h	2012-09-03 15:16:30.566299444 +0800
> @@ -12,6 +12,7 @@
>  #define __MM_INTERNAL_H
>  
>  #include <linux/mm.h>
> +#include <linux/rmap.h>
>  
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
> @@ -356,3 +357,54 @@ extern unsigned long vm_mmap_pgoff(struc
>          unsigned long, unsigned long);
>  
>  extern void set_pageblock_order(void);
> +
> +/*
> + * Unnecessary readahead harms performance. 1. for SSD, big size read is more
> + * expensive than small size read, so extra unnecessary read only has overhead.
> + * For harddisk, this overhead doesn't exist. 2. unnecessary readahead will
> + * allocate extra memroy, which further tights memory pressure, so more
> + * swapout/swapin.
> + * These adds a simple swap random access detection. In swap page fault, if
> + * page is found in swap cache, decrease an account of vma, otherwise we need
> + * do sync swapin and the account is increased. Optionally swapin will do
> + * readahead if the counter is below a threshold.
> + */
> +#ifdef CONFIG_SWAP
> +#define SWAPRA_MISS_THRESHOLD  (100)
> +#define SWAPRA_MAX_MISS ((SWAPRA_MISS_THRESHOLD) * 10)
> +static inline void swap_cache_hit(struct vm_area_struct *vma)
> +{
> +	if (vma && vma->anon_vma)
> +		atomic_dec_if_positive(&vma->anon_vma->swapra_miss);
> +}
> +
> +static inline void swap_cache_miss(struct vm_area_struct *vma)
> +{
> +	if (!vma || !vma->anon_vma)
> +		return;
> +	if (atomic_read(&vma->anon_vma->swapra_miss) < SWAPRA_MAX_MISS)
> +		atomic_inc(&vma->anon_vma->swapra_miss);
> +}
> +
> +static inline int swap_cache_skip_readahead(struct vm_area_struct *vma)
> +{
> +	if (!vma || !vma->anon_vma)
> +		return 0;
> +	return atomic_read(&vma->anon_vma->swapra_miss) >
> +		SWAPRA_MISS_THRESHOLD;
> +}
> +#else
> +static inline void swap_cache_hit(struct vm_area_struct *vma)
> +{
> +}
> +
> +static inline void swap_cache_miss(struct vm_area_struct *vma)
> +{
> +}
> +
> +static inline int swap_cache_skip_readahead(struct vm_area_struct *vma)
> +{
> +	return 0;
> +}
> +
> +#endif
> Index: linux/include/linux/rmap.h
> ===================================================================
> --- linux.orig/include/linux/rmap.h	2012-06-01 10:10:31.686394463 +0800
> +++ linux/include/linux/rmap.h	2012-08-30 18:10:12.256048781 +0800
> @@ -35,6 +35,9 @@ struct anon_vma {
>  	 * anon_vma if they are the last user on release
>  	 */
>  	atomic_t refcount;
> +#ifdef CONFIG_SWAP
> +	atomic_t swapra_miss;
> +#endif

Don't we need initialization?

diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cd..c0f3221 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -416,6 +416,9 @@ static void anon_vma_ctor(void *data)
 
        mutex_init(&anon_vma->mutex);
        atomic_set(&anon_vma->refcount, 0);
+#ifdef CONFIG_SWAP
+       atomic_set(&anon_vma->swapra_miss, 0);
+#endif
        INIT_LIST_HEAD(&anon_vma->head);
 }
 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [patch v4]swap: add a simple random read swapin detection
  2012-09-03  8:32         ` Minchan Kim
@ 2012-09-03 11:46           ` Shaohua Li
  2012-09-03 19:02             ` Konstantin Khlebnikov
  2012-09-03 22:03             ` [patch v4]swap: add a simple random read swapin detection Minchan Kim
  0 siblings, 2 replies; 31+ messages in thread
From: Shaohua Li @ 2012-09-03 11:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, Konstantin Khlebnikov, linux-mm@kvack.org, riel@redhat.com,
	fengguang.wu@intel.com

On Mon, Sep 03, 2012 at 05:32:45PM +0900, Minchan Kim wrote:
> Don't we need initialization?
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0f3b7cd..c0f3221 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -416,6 +416,9 @@ static void anon_vma_ctor(void *data)
>  
>         mutex_init(&anon_vma->mutex);
>         atomic_set(&anon_vma->refcount, 0);
> +#ifdef CONFIG_SWAP
> +       atomic_set(&anon_vma->swapra_miss, 0);
> +#endif
>         INIT_LIST_HEAD(&anon_vma->head);
>  }

Sorry about this silly problem. I'm wondering why I didn't notice it, maybe
because only tested random swap after move swapra_miss to anon_vma.


Subject: swap: add a simple random read swapin detection

The swapin readahead does a blind readahead regardless if the swapin is
sequential. This is ok for harddisk and random read, because read big size has
no penality in harddisk, and if the readahead pages are garbage, they can be
reclaimed fastly. But for SSD, big size read is more expensive than small size
read. If readahead pages are garbage, such readahead only has overhead.

This patch addes a simple random read detection like what file mmap readahead
does. If random read is detected, swapin readahead will be skipped. This
improves a lot for a swap workload with random IO in a fast SSD.

I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
			runtime changes with path
randwrite harddisk	-38.7%
seqwrite harddisk	-1.1%
randwrite SSD		-46.9%
seqwrite SSD		+0.3%

For both harddisk and SSD, the randwrite swap workload run time is reduced
significant. sequential write swap workload hasn't chanage.

Interesting is the randwrite harddisk test is improved too. This might be
because swapin readahead need allocate extra memory, which further tights
memory pressure, so more swapout/swapin.

This patch depends on readahead-fault-retry-breaks-mmap-file-read-random-detection.patch

V2->V3:
move swapra_miss to 'struct anon_vma' as suggested by Konstantin. 

V1->V2:
1. Move the swap readahead accounting to separate functions as suggested by Riel.
2. Enable the logic only with CONFIG_SWAP enabled as suggested by Minchan.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/rmap.h |    3 ++
 mm/internal.h        |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/memory.c          |    3 +-
 mm/rmap.c            |    3 ++
 mm/shmem.c           |    1 
 mm/swap_state.c      |    6 +++++
 6 files changed, 67 insertions(+), 1 deletion(-)

Index: linux/mm/swap_state.c
===================================================================
--- linux.orig/mm/swap_state.c	2012-08-29 16:13:00.912112140 +0800
+++ linux/mm/swap_state.c	2012-08-30 18:28:24.678315187 +0800
@@ -20,6 +20,7 @@
 #include <linux/page_cgroup.h>
 
 #include <asm/pgtable.h>
+#include "internal.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
@@ -379,6 +380,10 @@ struct page *swapin_readahead(swp_entry_
 	unsigned long mask = (1UL << page_cluster) - 1;
 	struct blk_plug plug;
 
+	swap_cache_miss(vma);
+	if (swap_cache_skip_readahead(vma))
+		goto skip;
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -397,5 +402,6 @@ struct page *swapin_readahead(swp_entry_
 	blk_finish_plug(&plug);
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
+skip:
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2012-08-29 16:13:00.920112040 +0800
+++ linux/mm/memory.c	2012-08-30 13:32:05.425830660 +0800
@@ -2953,7 +2953,8 @@ static int do_swap_page(struct mm_struct
 		ret = VM_FAULT_HWPOISON;
 		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 		goto out_release;
-	}
+	} else if (!(flags & FAULT_FLAG_TRIED))
+		swap_cache_hit(vma);
 
 	locked = lock_page_or_retry(page, mm, flags);
 
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h	2012-08-29 16:13:00.932111888 +0800
+++ linux/mm/internal.h	2012-09-03 15:16:30.566299444 +0800
@@ -12,6 +12,7 @@
 #define __MM_INTERNAL_H
 
 #include <linux/mm.h>
+#include <linux/rmap.h>
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
@@ -356,3 +357,54 @@ extern unsigned long vm_mmap_pgoff(struc
         unsigned long, unsigned long);
 
 extern void set_pageblock_order(void);
+
+/*
+ * Unnecessary readahead harms performance. 1. for SSD, big size read is more
+ * expensive than small size read, so extra unnecessary read only has overhead.
+ * For harddisk, this overhead doesn't exist. 2. unnecessary readahead will
+ * allocate extra memroy, which further tights memory pressure, so more
+ * swapout/swapin.
+ * These adds a simple swap random access detection. In swap page fault, if
+ * page is found in swap cache, decrease an account of vma, otherwise we need
+ * do sync swapin and the account is increased. Optionally swapin will do
+ * readahead if the counter is below a threshold.
+ */
+#ifdef CONFIG_SWAP
+#define SWAPRA_MISS_THRESHOLD  (100)
+#define SWAPRA_MAX_MISS ((SWAPRA_MISS_THRESHOLD) * 10)
+static inline void swap_cache_hit(struct vm_area_struct *vma)
+{
+	if (vma && vma->anon_vma)
+		atomic_dec_if_positive(&vma->anon_vma->swapra_miss);
+}
+
+static inline void swap_cache_miss(struct vm_area_struct *vma)
+{
+	if (!vma || !vma->anon_vma)
+		return;
+	if (atomic_read(&vma->anon_vma->swapra_miss) < SWAPRA_MAX_MISS)
+		atomic_inc(&vma->anon_vma->swapra_miss);
+}
+
+static inline int swap_cache_skip_readahead(struct vm_area_struct *vma)
+{
+	if (!vma || !vma->anon_vma)
+		return 0;
+	return atomic_read(&vma->anon_vma->swapra_miss) >
+		SWAPRA_MISS_THRESHOLD;
+}
+#else
+static inline void swap_cache_hit(struct vm_area_struct *vma)
+{
+}
+
+static inline void swap_cache_miss(struct vm_area_struct *vma)
+{
+}
+
+static inline int swap_cache_skip_readahead(struct vm_area_struct *vma)
+{
+	return 0;
+}
+
+#endif
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2012-06-01 10:10:31.686394463 +0800
+++ linux/include/linux/rmap.h	2012-08-30 18:10:12.256048781 +0800
@@ -35,6 +35,9 @@ struct anon_vma {
 	 * anon_vma if they are the last user on release
 	 */
 	atomic_t refcount;
+#ifdef CONFIG_SWAP
+	atomic_t swapra_miss;
+#endif
 
 	/*
 	 * NOTE: the LSB of the head.next is set by
Index: linux/mm/shmem.c
===================================================================
--- linux.orig/mm/shmem.c	2012-08-06 16:00:45.465441525 +0800
+++ linux/mm/shmem.c	2012-08-30 18:10:51.755553250 +0800
@@ -933,6 +933,7 @@ static struct page *shmem_swapin(swp_ent
 	pvma.vm_pgoff = index + info->vfs_inode.i_ino;
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = spol;
+	pvma.anon_vma = NULL;
 	return swapin_readahead(swap, gfp, &pvma, 0);
 }
 
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2012-06-01 10:10:31.706394210 +0800
+++ linux/mm/rmap.c	2012-09-03 19:42:15.454127265 +0800
@@ -416,6 +416,9 @@ static void anon_vma_ctor(void *data)
 
 	mutex_init(&anon_vma->mutex);
 	atomic_set(&anon_vma->refcount, 0);
+#ifdef CONFIG_SWAP
+	atomic_set(&anon_vma->swapra_miss, 0);
+#endif
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch v4]swap: add a simple random read swapin detection
  2012-09-03 11:46           ` Shaohua Li
@ 2012-09-03 19:02             ` Konstantin Khlebnikov
  2012-09-03 19:05               ` Rik van Riel
  2012-09-03 22:03             ` [patch v4]swap: add a simple random read swapin detection Minchan Kim
  1 sibling, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-09-03 19:02 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Minchan Kim, akpm@linux-foundation.org, linux-mm@kvack.org,
	riel@redhat.com, fengguang.wu@intel.com

Shaohua Li wrote:
> On Mon, Sep 03, 2012 at 05:32:45PM +0900, Minchan Kim wrote:
> Subject: swap: add a simple random read swapin detection
>
> The swapin readahead does a blind readahead regardless if the swapin is
> sequential. This is ok for harddisk and random read, because read big size has
> no penality in harddisk, and if the readahead pages are garbage, they can be
> reclaimed fastly. But for SSD, big size read is more expensive than small size
> read. If readahead pages are garbage, such readahead only has overhead.
>
> This patch addes a simple random read detection like what file mmap readahead
> does. If random read is detected, swapin readahead will be skipped. This
> improves a lot for a swap workload with random IO in a fast SSD.
>
> I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
> 			runtime changes with path
> randwrite harddisk	-38.7%
> seqwrite harddisk	-1.1%
> randwrite SSD		-46.9%
> seqwrite SSD		+0.3%
>
> For both harddisk and SSD, the randwrite swap workload run time is reduced
> significant. sequential write swap workload hasn't chanage.
>
> Interesting is the randwrite harddisk test is improved too. This might be
> because swapin readahead need allocate extra memory, which further tights
> memory pressure, so more swapout/swapin.

Generally speaking swapin readahread isn't usable while system is under memory
pressure. Cache hit isn't very probable, because reclaimer allocates swap
entries in page-LRU order.

But swapin readahead is very useful if system recovers from memory pressure,
it helps to read whole swap back to memory (a sort of desktop scenario).

So, I think we can simply disable swapin readahead while system is under memory
pressure. For example in time-based manner -- enable it only after grace period
after last swap_writepage().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch v4]swap: add a simple random read swapin detection
  2012-09-03 19:02             ` Konstantin Khlebnikov
@ 2012-09-03 19:05               ` Rik van Riel
  2012-09-04  7:34                 ` Konstantin Khlebnikov
  0 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2012-09-03 19:05 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Shaohua Li, Minchan Kim, akpm@linux-foundation.org,
	linux-mm@kvack.org, fengguang.wu@intel.com

On 09/03/2012 03:02 PM, Konstantin Khlebnikov wrote:
> Shaohua Li wrote:
>> On Mon, Sep 03, 2012 at 05:32:45PM +0900, Minchan Kim wrote:
>> Subject: swap: add a simple random read swapin detection
>>
>> The swapin readahead does a blind readahead regardless if the swapin is
>> sequential. This is ok for harddisk and random read, because read big
>> size has
>> no penality in harddisk, and if the readahead pages are garbage, they
>> can be
>> reclaimed fastly. But for SSD, big size read is more expensive than
>> small size
>> read. If readahead pages are garbage, such readahead only has overhead.
>>
>> This patch addes a simple random read detection like what file mmap
>> readahead
>> does. If random read is detected, swapin readahead will be skipped. This
>> improves a lot for a swap workload with random IO in a fast SSD.
>>
>> I run anonymous mmap write micro benchmark, which will triger
>> swapin/swapout.
>>             runtime changes with path
>> randwrite harddisk    -38.7%
>> seqwrite harddisk    -1.1%
>> randwrite SSD        -46.9%
>> seqwrite SSD        +0.3%
>>
>> For both harddisk and SSD, the randwrite swap workload run time is
>> reduced
>> significant. sequential write swap workload hasn't chanage.
>>
>> Interesting is the randwrite harddisk test is improved too. This might be
>> because swapin readahead need allocate extra memory, which further tights
>> memory pressure, so more swapout/swapin.
>
> Generally speaking swapin readahread isn't usable while system is under
> memory
> pressure. Cache hit isn't very probable, because reclaimer allocates swap
> entries in page-LRU order.
>
> But swapin readahead is very useful if system recovers from memory
> pressure,
> it helps to read whole swap back to memory (a sort of desktop scenario).
>
> So, I think we can simply disable swapin readahead while system is under
> memory
> pressure. For example in time-based manner -- enable it only after grace
> period
> after last swap_writepage().

Determining "under memory pressure" is pretty hard to do.

However, Shaohua's patch provides an easy way to see whether swap
readahead is helping (we are getting pages from the swap cache),
or whether it is not (pages got evicted before someone faulted on
them).

In short, Shaohua's patch not only does roughly what you want, it
does it in a simple way.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch v4]swap: add a simple random read swapin detection
  2012-09-03 11:46           ` Shaohua Li
  2012-09-03 19:02             ` Konstantin Khlebnikov
@ 2012-09-03 22:03             ` Minchan Kim
  1 sibling, 0 replies; 31+ messages in thread
From: Minchan Kim @ 2012-09-03 22:03 UTC (permalink / raw)
  To: Shaohua Li
  Cc: akpm, Konstantin Khlebnikov, linux-mm@kvack.org, riel@redhat.com,
	fengguang.wu@intel.com

On Mon, Sep 03, 2012 at 07:46:31PM +0800, Shaohua Li wrote:
> On Mon, Sep 03, 2012 at 05:32:45PM +0900, Minchan Kim wrote:
> > Don't we need initialization?
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..c0f3221 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -416,6 +416,9 @@ static void anon_vma_ctor(void *data)
> >  
> >         mutex_init(&anon_vma->mutex);
> >         atomic_set(&anon_vma->refcount, 0);
> > +#ifdef CONFIG_SWAP
> > +       atomic_set(&anon_vma->swapra_miss, 0);
> > +#endif
> >         INIT_LIST_HEAD(&anon_vma->head);
> >  }
> 
> Sorry about this silly problem. I'm wondering why I didn't notice it, maybe
> because only tested random swap after move swapra_miss to anon_vma.
> 
> 
> Subject: swap: add a simple random read swapin detection
> 
> The swapin readahead does a blind readahead regardless if the swapin is
> sequential. This is ok for harddisk and random read, because read big size has
> no penality in harddisk, and if the readahead pages are garbage, they can be
> reclaimed fastly. But for SSD, big size read is more expensive than small size
> read. If readahead pages are garbage, such readahead only has overhead.
> 
> This patch addes a simple random read detection like what file mmap readahead
> does. If random read is detected, swapin readahead will be skipped. This
> improves a lot for a swap workload with random IO in a fast SSD.
> 
> I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
> 			runtime changes with path
> randwrite harddisk	-38.7%
> seqwrite harddisk	-1.1%
> randwrite SSD		-46.9%
> seqwrite SSD		+0.3%
> 
> For both harddisk and SSD, the randwrite swap workload run time is reduced
> significant. sequential write swap workload hasn't chanage.
> 
> Interesting is the randwrite harddisk test is improved too. This might be
> because swapin readahead need allocate extra memory, which further tights
> memory pressure, so more swapout/swapin.
> 
> This patch depends on readahead-fault-retry-breaks-mmap-file-read-random-detection.patch
> 
> V2->V3:
> move swapra_miss to 'struct anon_vma' as suggested by Konstantin. 
> 
> V1->V2:
> 1. Move the swap readahead accounting to separate functions as suggested by Riel.
> 2. Enable the logic only with CONFIG_SWAP enabled as suggested by Minchan.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch v4]swap: add a simple random read swapin detection
  2012-09-03 19:05               ` Rik van Riel
@ 2012-09-04  7:34                 ` Konstantin Khlebnikov
  2012-09-04 14:15                   ` Rik van Riel
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-09-04  7:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Shaohua Li, Minchan Kim, akpm@linux-foundation.org,
	linux-mm@kvack.org, fengguang.wu@intel.com

Rik van Riel wrote:
> On 09/03/2012 03:02 PM, Konstantin Khlebnikov wrote:
>> Shaohua Li wrote:
>>> On Mon, Sep 03, 2012 at 05:32:45PM +0900, Minchan Kim wrote:
>>> Subject: swap: add a simple random read swapin detection
>>>
>>> The swapin readahead does a blind readahead regardless if the swapin is
>>> sequential. This is ok for harddisk and random read, because read big
>>> size has
>>> no penality in harddisk, and if the readahead pages are garbage, they
>>> can be
>>> reclaimed fastly. But for SSD, big size read is more expensive than
>>> small size
>>> read. If readahead pages are garbage, such readahead only has overhead.
>>>
>>> This patch addes a simple random read detection like what file mmap
>>> readahead
>>> does. If random read is detected, swapin readahead will be skipped. This
>>> improves a lot for a swap workload with random IO in a fast SSD.
>>>
>>> I run anonymous mmap write micro benchmark, which will triger
>>> swapin/swapout.
>>>              runtime changes with path
>>> randwrite harddisk    -38.7%
>>> seqwrite harddisk    -1.1%
>>> randwrite SSD        -46.9%
>>> seqwrite SSD        +0.3%
>>>
>>> For both harddisk and SSD, the randwrite swap workload run time is
>>> reduced
>>> significant. sequential write swap workload hasn't chanage.
>>>
>>> Interesting is the randwrite harddisk test is improved too. This might be
>>> because swapin readahead need allocate extra memory, which further tights
>>> memory pressure, so more swapout/swapin.
>>
>> Generally speaking swapin readahread isn't usable while system is under
>> memory
>> pressure. Cache hit isn't very probable, because reclaimer allocates swap
>> entries in page-LRU order.
>>
>> But swapin readahead is very useful if system recovers from memory
>> pressure,
>> it helps to read whole swap back to memory (a sort of desktop scenario).
>>
>> So, I think we can simply disable swapin readahead while system is under
>> memory
>> pressure. For example in time-based manner -- enable it only after grace
>> period
>> after last swap_writepage().
>
> Determining "under memory pressure" is pretty hard to do.

Indeed. But swapin readahead is mostly useless if system hasn't free memory.
So condition can be simply time-based (above) or nr_free_pages()-based,
or can provide some hints from reclaimer/page-allocator.

[readahead also useful in swapoff, but it doesn't use it for now]

>
> However, Shaohua's patch provides an easy way to see whether swap
> readahead is helping (we are getting pages from the swap cache),
> or whether it is not (pages got evicted before someone faulted on
> them).
>

[
BTW we can use readahead bit in page-flags: mark readahead pages with 
SetPageReadahead() and gather these marks in do_swap_page()
if (TestClearReadahead(page))
	swap_headahead_hit(vma);
]

> In short, Shaohua's patch not only does roughly what you want, it
> does it in a simple way.
>

It disables reahahead if it is ineffective in one particular VMA,
but in recovering-case this does not important -- we really want to read
whole swap back, no matter which VMA around pages belongs to.
[BTW this case was mentioned in you patch which added skipping-over-holes]

And its metric is strange, looks like it just disables headahead for all VMAs
after hundred swapins and never enables it back. Why we cannot disable it from
the beginning and turn it on when needed? This ways is even more simple.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch v4]swap: add a simple random read swapin detection
  2012-09-04  7:34                 ` Konstantin Khlebnikov
@ 2012-09-04 14:15                   ` Rik van Riel
  2012-09-06 11:08                     ` [PATCH RFC] mm/swap: automatic tuning for swapin readahead Konstantin Khlebnikov
  0 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2012-09-04 14:15 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Shaohua Li, Minchan Kim, akpm@linux-foundation.org,
	linux-mm@kvack.org, fengguang.wu@intel.com

On 09/04/2012 03:34 AM, Konstantin Khlebnikov wrote:

> It disables reahahead if it is ineffective in one particular VMA,
> but in recovering-case this does not important -- we really want to read
> whole swap back, no matter which VMA around pages belongs to.
> [BTW this case was mentioned in you patch which added skipping-over-holes]

This is a good point.  It is entirely possible that we may
be better off deciding this on a system wide level, and not
a VMA level, since that would allow for the statistic to
move faster.

On the other hand, keeping readahead enabled for some VMAs
at any times may be required to get the hits we need to
re-enable it for others :)

> And its metric is strange, looks like it just disables headahead for all
> VMAs
> after hundred swapins and never enables it back. Why we cannot disable
> it from
> the beginning and turn it on when needed? This ways is even more simple.

Take a careful look at the code, specifically do_swap_page().
If a page is found in the swap cache, it is counted as a hit.
If enough pages are found in the swap cache, readahead is
enabled again for the VMA.

Having swap readahead enabled by default is probably the best
thing to do, since IO clustering is generally useful.

How would you determine when to "turn it on when needed"?

What kind of criteria would you use?

What would be the threshold for enabling it?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-09-04 14:15                   ` Rik van Riel
@ 2012-09-06 11:08                     ` Konstantin Khlebnikov
  2012-10-01 23:00                       ` Hugh Dickins
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-09-06 11:08 UTC (permalink / raw)
  To: Rik van Riel, Shaohua Li
  Cc: Minchan Kim, Andrew Morton, fengguang.wu, linux-kernel, linux-mm

This patch adds simple tracker for swapin readahread effectiveness, and tunes
readahead cluster depending on it. It manage internal state [0..1024] and scales
readahead order between 0 and value from sysctl vm.page-cluster (3 by default).
Swapout and readahead misses decreases state, swapin and ra hits increases it:

 Swapin          +1		[page fault, shmem, etc... ]
 Swapout         -10
 Readahead hit   +10
 Readahead miss  -1		[removing from swapcache unused readahead page]

If system is under serious memory pressure swapin readahead is useless, because
pages in swap are highly fragmented and cache hit is mostly impossible. In this
case swapin only leads to unnecessary memory allocations. But readahead helps to
read all swapped pages back to memory if system recovers from memory pressure.

This patch inspired by patch from Shaohua Li
http://www.spinics.net/lists/linux-mm/msg41128.html
mine version uses system wide state rather than per-VMA counters.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
---
 include/linux/page-flags.h |    1 +
 mm/swap_state.c            |   42 +++++++++++++++++++++++++++++++++++++-----
 2 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b5d1384..3657cdc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -231,6 +231,7 @@ PAGEFLAG(MappedToDisk, mappedtodisk)
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
 PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
+TESTCLEARFLAG(Readahead, reclaim)
 
 #ifdef CONFIG_HIGHMEM
 /*
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0cb36fb..d6c7a88 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -53,12 +53,31 @@ static struct {
 	unsigned long find_total;
 } swap_cache_info;
 
+#define SWAP_RA_BITS	10
+
+static atomic_t swap_ra_state = ATOMIC_INIT((1 << SWAP_RA_BITS) - 1);
+static int swap_ra_cluster = 1;
+
+static void swap_ra_update(int delta)
+{
+	int old_state, new_state;
+
+	old_state = atomic_read(&swap_ra_state);
+	new_state = clamp(old_state + delta, 0, 1 << SWAP_RA_BITS);
+	if (old_state != new_state) {
+		atomic_set(&swap_ra_state, new_state);
+		swap_ra_cluster = (page_cluster * new_state) >> SWAP_RA_BITS;
+	}
+}
+
 void show_swap_cache_info(void)
 {
 	printk("%lu pages in swap cache\n", total_swapcache_pages);
-	printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu\n",
+	printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu,"
+		" readahead %d/%d\n",
 		swap_cache_info.add_total, swap_cache_info.del_total,
-		swap_cache_info.find_success, swap_cache_info.find_total);
+		swap_cache_info.find_success, swap_cache_info.find_total,
+		1 << swap_ra_cluster, atomic_read(&swap_ra_state));
 	printk("Free swap  = %ldkB\n", nr_swap_pages << (PAGE_SHIFT - 10));
 	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
 }
@@ -112,6 +131,8 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
 	if (!error) {
 		error = __add_to_swap_cache(page, entry);
 		radix_tree_preload_end();
+		/* FIXME weird place */
+		swap_ra_update(-10); /* swapout, decrease readahead */
 	}
 	return error;
 }
@@ -132,6 +153,8 @@ void __delete_from_swap_cache(struct page *page)
 	total_swapcache_pages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
+	if (TestClearPageReadahead(page))
+		swap_ra_update(-1); /* readahead miss */
 }
 
 /**
@@ -265,8 +288,11 @@ struct page * lookup_swap_cache(swp_entry_t entry)
 
 	page = find_get_page(&swapper_space, entry.val);
 
-	if (page)
+	if (page) {
 		INC_CACHE_INFO(find_success);
+		if (TestClearPageReadahead(page))
+			swap_ra_update(+10); /* readahead hit */
+	}
 
 	INC_CACHE_INFO(find_total);
 	return page;
@@ -374,11 +400,14 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *page;
-	unsigned long offset = swp_offset(entry);
+	unsigned long entry_offset = swp_offset(entry);
+	unsigned long offset = entry_offset;
 	unsigned long start_offset, end_offset;
-	unsigned long mask = (1UL << page_cluster) - 1;
+	unsigned long mask = (1UL << swap_ra_cluster) - 1;
 	struct blk_plug plug;
 
+	swap_ra_update(+1); /* swapin, increase readahead */
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -392,6 +421,9 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 						gfp_mask, vma, addr);
 		if (!page)
 			continue;
+		/* FIXME these pages aren't readahead sometimes */
+		if (offset != entry_offset)
+			SetPageReadahead(page);
 		page_cache_release(page);
 	}
 	blk_finish_plug(&plug);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-09-06 11:08                     ` [PATCH RFC] mm/swap: automatic tuning for swapin readahead Konstantin Khlebnikov
@ 2012-10-01 23:00                       ` Hugh Dickins
  2012-10-02  8:58                         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 31+ messages in thread
From: Hugh Dickins @ 2012-10-01 23:00 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Shaohua Li
  Cc: Rik van Riel, Minchan Kim, Andrew Morton, Wu Fengguang,
	linux-kernel, linux-mm

Shaohua, Konstantin,

Sorry that it takes me so long to to reply on these swapin readahead
bounding threads, but I had to try some things out before jumping in,
and only found time to experiment last week.

On Thu, 6 Sep 2012, Konstantin Khlebnikov wrote:
> This patch adds simple tracker for swapin readahread effectiveness, and tunes
> readahead cluster depending on it. It manage internal state [0..1024] and scales
> readahead order between 0 and value from sysctl vm.page-cluster (3 by default).
> Swapout and readahead misses decreases state, swapin and ra hits increases it:
> 
>  Swapin          +1		[page fault, shmem, etc... ]
>  Swapout         -10
>  Readahead hit   +10
>  Readahead miss  -1		[removing from swapcache unused readahead page]
> 
> If system is under serious memory pressure swapin readahead is useless, because
> pages in swap are highly fragmented and cache hit is mostly impossible. In this
> case swapin only leads to unnecessary memory allocations. But readahead helps to
> read all swapped pages back to memory if system recovers from memory pressure.
> 
> This patch inspired by patch from Shaohua Li
> http://www.spinics.net/lists/linux-mm/msg41128.html
> mine version uses system wide state rather than per-VMA counters.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>

While I appreciate the usefulness of the idea, I do have some issues
with both implementations - Shaohua's currently in mmotm and next,
and Konstantin's apparently overlooked.

Shaohua, things I don't care for in your patch,
but none of them thoroughly convincing killers:

1. As Konstantin mentioned (in other words), it dignifies the illusion
   that swap is somehow structured by vmas, rather than being a global
   pool allocated by accident of when pages fall to the bottom of lrus.

2. Following on from that, it's unable to extend its optimization to
   randomly accessed tmpfs files or shmem areas (and I don't want that
   horrid pseudo-vma stuff in shmem.c to be extended in any way to deal
   with this - I'd have replaced it years ago by alloc_page_mpol() if I
   had understood the since-acknowledged-broken mempolicy lifetimes).

3. Although putting swapra_miss into struct anon_vma was a neat memory-
   saving idea from Konstantin, anon_vmas are otherwise pretty much self-
   referential, never before holding any control information themselves:
   I hesitate to extend them in this way.

4. I have not actually performed the test to prove it (tell me if I'm
   plain wrong), but experience with trying to modify it tells me that
   if your vma (worse, your anon_vma) is sometimes used for sequential
   access and sometimes for random (or part of it for sequential and
   part of it for random), then a burst of randomness will switch
   readahead off it forever.

Konstantin, given that, I wanted to speak up for your version.
I admire the way you have confined it to swap_state.c (and without
relying upon the FAULT_FLAG_TRIED patch), and make neat use of
PageReadahead and lookup_swap_cache().

But when I compared it against vanilla or Shaohua's patch, okay it's
comparable (a few percent slower?) than Shaohua's on random, and works
on shmem where his fails - but it was 50% slower on sequential access
(when testing on this laptop with Intel SSD: not quite the same as in
the tests below, which I left your patch out of).

I thought that's probably due to some off-by-one or other trivial bug
in the patch; but when I looked to correct it, I found that I just
don't understand what your heuristics are up to, the +1s and -1s
and +10s and -10s.  Maybe it's an off-by-ten, I haven't a clue.

Perhaps, with a trivial bugfix, and comments added, yours will be
great.  But it drove me to steal some of your ideas, combining with
a simple heuristic that even I can understand: patch below.

If I boot with mem=900M (and 1G swap: either on hard disk sda, or
on Vertex II SSD sdb), and mmap anonymous 1000M (either MAP_PRIVATE,
or MAP_SHARED for a shmem object), and either cycle sequentially round
that making 5M touches (spaced a page apart), or make 5M random touches,
then here are the times in centisecs that I see (but it's only elapsed
that I've been worrying about).

3.6-rc7 swapping to hard disk:
    124 user    6154 system   73921 elapsed -rc7 sda seq
    102 user    8862 system  895392 elapsed -rc7 sda random
    130 user    6628 system   73601 elapsed -rc7 sda shmem seq
    194 user    8610 system 1058375 elapsed -rc7 sda shmem random

3.6-rc7 swapping to SSD:
    116 user    5898 system   24634 elapsed -rc7 sdb seq
     96 user    8166 system   43014 elapsed -rc7 sdb random
    110 user    6410 system   24959 elapsed -rc7 sdb shmem seq
    208 user    8024 system   45349 elapsed -rc7 sdb shmem random

3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page), HDD:
    116 user    6258 system   76210 elapsed shli sda seq
     80 user    7716 system  831243 elapsed shli sda random
    128 user    6640 system   73176 elapsed shli sda shmem seq
    212 user    8522 system 1053486 elapsed shli sda shmem random

3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page), SSD:
    126 user    5734 system   24198 elapsed shli sdb seq
     90 user    7356 system   26146 elapsed shli sdb random
    128 user    6396 system   24932 elapsed shli sdb shmem seq
    192 user    8006 system   45215 elapsed shli sdb shmem random

3.6-rc7 + my patch, swapping to hard disk:
    126 user    6252 system   75611 elapsed hugh sda seq
     70 user    8310 system  871569 elapsed hugh sda random
    130 user    6790 system   73855 elapsed hugh sda shmem seq
    148 user    7734 system  827935 elapsed hugh sda shmem random

3.6-rc7 + my patch, swapping to SSD:
    116 user    5996 system   24673 elapsed hugh sdb seq
     76 user    7568 system   28075 elapsed hugh sdb random
    132 user    6468 system   25052 elapsed hugh sdb shmem seq
    166 user    7220 system   28249 elapsed hugh sdb shmem random

Mine does look slightly slower than Shaohua's there (except,
of course, on the shmem random): maybe it's just noise,
maybe I have some edge condition to improve, don't know yet.

These tests are, of course, at the single process extreme; I've also
tried my heavy swapping loads, but have not yet discerned a clear
trend on all machines from those.

Shaohua, Konstantin, do you have any time to try my patch against
whatever loads you were testing with, to see if it's a contender?

Thanks,
Hugh

 include/linux/page-flags.h |    4 +-
 mm/swap_state.c            |   51 ++++++++++++++++++++++++++++++++---
 2 files changed, 50 insertions(+), 5 deletions(-)

--- 3.6.0/include/linux/page-flags.h	2012-08-03 08:31:26.904842267 -0700
+++ linux/include/linux/page-flags.h	2012-09-28 22:02:00.008166986 -0700
@@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
 PAGEFLAG(MappedToDisk, mappedtodisk)
 
-/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
+/* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
-PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
+PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
 
 #ifdef CONFIG_HIGHMEM
 /*
--- 3.6.0/mm/swap_state.c	2012-08-03 08:31:27.076842271 -0700
+++ linux/mm/swap_state.c	2012-09-28 23:32:59.752577966 -0700
@@ -53,6 +53,8 @@ static struct {
 	unsigned long find_total;
 } swap_cache_info;
 
+static atomic_t swapra_hits = ATOMIC_INIT(0);
+
 void show_swap_cache_info(void)
 {
 	printk("%lu pages in swap cache\n", total_swapcache_pages);
@@ -265,8 +267,11 @@ struct page * lookup_swap_cache(swp_entr
 
 	page = find_get_page(&swapper_space, entry.val);
 
-	if (page)
+	if (page) {
 		INC_CACHE_INFO(find_success);
+		if (TestClearPageReadahead(page))
+			atomic_inc(&swapra_hits);
+	}
 
 	INC_CACHE_INFO(find_total);
 	return page;
@@ -351,6 +356,41 @@ struct page *read_swap_cache_async(swp_e
 	return found_page;
 }
 
+unsigned long swapin_nr_pages(unsigned long offset)
+{
+	static unsigned long prev_offset;
+	static unsigned int swapin_pages = 8;
+	unsigned int used, half, pages, max_pages;
+
+	used = atomic_xchg(&swapra_hits, 0) + 1;
+	pages = ACCESS_ONCE(swapin_pages);
+	half = pages >> 1;
+
+	if (!half) {
+		/*
+		 * We can have no readahead hits to judge by: but must not get
+		 * stuck here forever, so check for an adjacent offset instead
+		 * (and don't even bother to check if swap type is the same).
+		 */
+		if (offset == prev_offset + 1 || offset == prev_offset - 1)
+			pages <<= 1;
+		prev_offset = offset;
+	} else if (used < half) {
+		/* Less than half were used?  Then halve the window size */
+		pages = half;
+	} else if (used > half) {
+		/* More than half were used?  Then double the window size */
+		pages <<= 1;
+	}
+
+	max_pages = 1 << ACCESS_ONCE(page_cluster);
+	if (pages > max_pages)
+		pages = max_pages;
+	if (ACCESS_ONCE(swapin_pages) != pages)
+		swapin_pages = pages;
+	return pages;
+}
+
 /**
  * swapin_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
@@ -374,11 +414,14 @@ struct page *swapin_readahead(swp_entry_
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *page;
-	unsigned long offset = swp_offset(entry);
+	unsigned long entry_offset = swp_offset(entry);
+	unsigned long offset = entry_offset;
 	unsigned long start_offset, end_offset;
-	unsigned long mask = (1UL << page_cluster) - 1;
+	unsigned long mask;
 	struct blk_plug plug;
 
+	mask = swapin_nr_pages(offset) - 1;
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -392,6 +435,8 @@ struct page *swapin_readahead(swp_entry_
 						gfp_mask, vma, addr);
 		if (!page)
 			continue;
+		if (offset != entry_offset)
+			SetPageReadahead(page);
 		page_cache_release(page);
 	}
 	blk_finish_plug(&plug);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-01 23:00                       ` Hugh Dickins
@ 2012-10-02  8:58                         ` Konstantin Khlebnikov
  2012-10-03 21:07                           ` Hugh Dickins
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-10-02  8:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Shaohua Li, Rik van Riel, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel@vger.kernel.org, linux-mm@kvack.org

Great job! I'm glad to see that you like my proof of concept patch.
I though that +/-10 logic can switch between border states smoothly.
But I have no strong experience in such kind of fuzzy-logic stuff,
so it's no surprise that my code fails in some cases.
(one note below about numbers)

Hugh Dickins wrote:
> Shaohua, Konstantin,
>
> Sorry that it takes me so long to to reply on these swapin readahead
> bounding threads, but I had to try some things out before jumping in,
> and only found time to experiment last week.
>
> On Thu, 6 Sep 2012, Konstantin Khlebnikov wrote:
>> This patch adds simple tracker for swapin readahread effectiveness, and tunes
>> readahead cluster depending on it. It manage internal state [0..1024] and scales
>> readahead order between 0 and value from sysctl vm.page-cluster (3 by default).
>> Swapout and readahead misses decreases state, swapin and ra hits increases it:
>>
>>   Swapin          +1           [page fault, shmem, etc... ]
>>   Swapout         -10
>>   Readahead hit   +10
>>   Readahead miss  -1           [removing from swapcache unused readahead page]
>>
>> If system is under serious memory pressure swapin readahead is useless, because
>> pages in swap are highly fragmented and cache hit is mostly impossible. In this
>> case swapin only leads to unnecessary memory allocations. But readahead helps to
>> read all swapped pages back to memory if system recovers from memory pressure.
>>
>> This patch inspired by patch from Shaohua Li
>> http://www.spinics.net/lists/linux-mm/msg41128.html
>> mine version uses system wide state rather than per-VMA counters.
>>
>> Signed-off-by: Konstantin Khlebnikov<khlebnikov@openvz.org>
>
> While I appreciate the usefulness of the idea, I do have some issues
> with both implementations - Shaohua's currently in mmotm and next,
> and Konstantin's apparently overlooked.
>
> Shaohua, things I don't care for in your patch,
> but none of them thoroughly convincing killers:
>
> 1. As Konstantin mentioned (in other words), it dignifies the illusion
>     that swap is somehow structured by vmas, rather than being a global
>     pool allocated by accident of when pages fall to the bottom of lrus.
>
> 2. Following on from that, it's unable to extend its optimization to
>     randomly accessed tmpfs files or shmem areas (and I don't want that
>     horrid pseudo-vma stuff in shmem.c to be extended in any way to deal
>     with this - I'd have replaced it years ago by alloc_page_mpol() if I
>     had understood the since-acknowledged-broken mempolicy lifetimes).
>
> 3. Although putting swapra_miss into struct anon_vma was a neat memory-
>     saving idea from Konstantin, anon_vmas are otherwise pretty much self-
>     referential, never before holding any control information themselves:
>     I hesitate to extend them in this way.
>
> 4. I have not actually performed the test to prove it (tell me if I'm
>     plain wrong), but experience with trying to modify it tells me that
>     if your vma (worse, your anon_vma) is sometimes used for sequential
>     access and sometimes for random (or part of it for sequential and
>     part of it for random), then a burst of randomness will switch
>     readahead off it forever.
>
> Konstantin, given that, I wanted to speak up for your version.
> I admire the way you have confined it to swap_state.c (and without
> relying upon the FAULT_FLAG_TRIED patch), and make neat use of
> PageReadahead and lookup_swap_cache().
>
> But when I compared it against vanilla or Shaohua's patch, okay it's
> comparable (a few percent slower?) than Shaohua's on random, and works
> on shmem where his fails - but it was 50% slower on sequential access
> (when testing on this laptop with Intel SSD: not quite the same as in
> the tests below, which I left your patch out of).
>
> I thought that's probably due to some off-by-one or other trivial bug
> in the patch; but when I looked to correct it, I found that I just
> don't understand what your heuristics are up to, the +1s and -1s
> and +10s and -10s.  Maybe it's an off-by-ten, I haven't a clue.
>
> Perhaps, with a trivial bugfix, and comments added, yours will be
> great.  But it drove me to steal some of your ideas, combining with
> a simple heuristic that even I can understand: patch below.
>
> If I boot with mem=900M (and 1G swap: either on hard disk sda, or
> on Vertex II SSD sdb), and mmap anonymous 1000M (either MAP_PRIVATE,
> or MAP_SHARED for a shmem object), and either cycle sequentially round
> that making 5M touches (spaced a page apart), or make 5M random touches,
> then here are the times in centisecs that I see (but it's only elapsed
> that I've been worrying about).
>
> 3.6-rc7 swapping to hard disk:
>      124 user    6154 system   73921 elapsed -rc7 sda seq
>      102 user    8862 system  895392 elapsed -rc7 sda random
>      130 user    6628 system   73601 elapsed -rc7 sda shmem seq
>      194 user    8610 system 1058375 elapsed -rc7 sda shmem random
>
> 3.6-rc7 swapping to SSD:
>      116 user    5898 system   24634 elapsed -rc7 sdb seq
>       96 user    8166 system   43014 elapsed -rc7 sdb random
>      110 user    6410 system   24959 elapsed -rc7 sdb shmem seq
>      208 user    8024 system   45349 elapsed -rc7 sdb shmem random
>
> 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page), HDD:
>      116 user    6258 system   76210 elapsed shli sda seq
>       80 user    7716 system  831243 elapsed shli sda random
>      128 user    6640 system   73176 elapsed shli sda shmem seq
>      212 user    8522 system 1053486 elapsed shli sda shmem random
>
> 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page), SSD:
>      126 user    5734 system   24198 elapsed shli sdb seq
>       90 user    7356 system   26146 elapsed shli sdb random
>      128 user    6396 system   24932 elapsed shli sdb shmem seq
>      192 user    8006 system   45215 elapsed shli sdb shmem random
>
> 3.6-rc7 + my patch, swapping to hard disk:
>      126 user    6252 system   75611 elapsed hugh sda seq
>       70 user    8310 system  871569 elapsed hugh sda random
>      130 user    6790 system   73855 elapsed hugh sda shmem seq
>      148 user    7734 system  827935 elapsed hugh sda shmem random
>
> 3.6-rc7 + my patch, swapping to SSD:
>      116 user    5996 system   24673 elapsed hugh sdb seq
>       76 user    7568 system   28075 elapsed hugh sdb random
>      132 user    6468 system   25052 elapsed hugh sdb shmem seq
>      166 user    7220 system   28249 elapsed hugh sdb shmem random
>

Hmm, It would be nice to gather numbers without swapin readahead at all, just
to see the the worst possible case for sequential read and the best for random.
I'll run some tests too, especially I want to see how it works for less
synthetic workloads.

> Mine does look slightly slower than Shaohua's there (except,
> of course, on the shmem random): maybe it's just noise,
> maybe I have some edge condition to improve, don't know yet.
>
> These tests are, of course, at the single process extreme; I've also
> tried my heavy swapping loads, but have not yet discerned a clear
> trend on all machines from those.
>
> Shaohua, Konstantin, do you have any time to try my patch against
> whatever loads you were testing with, to see if it's a contender?
>
> Thanks,
> Hugh
>
>   include/linux/page-flags.h |    4 +-
>   mm/swap_state.c            |   51 ++++++++++++++++++++++++++++++++---
>   2 files changed, 50 insertions(+), 5 deletions(-)
>
> --- 3.6.0/include/linux/page-flags.h    2012-08-03 08:31:26.904842267 -0700
> +++ linux/include/linux/page-flags.h    2012-09-28 22:02:00.008166986 -0700
> @@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
>   TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
>   PAGEFLAG(MappedToDisk, mappedtodisk)
>
> -/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
> +/* PG_readahead is only used for reads; PG_reclaim is only for writes */
>   PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
> -PAGEFLAG(Readahead, reclaim)           /* Reminder to do async read-ahead */
> +PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
>
>   #ifdef CONFIG_HIGHMEM
>   /*
> --- 3.6.0/mm/swap_state.c       2012-08-03 08:31:27.076842271 -0700
> +++ linux/mm/swap_state.c       2012-09-28 23:32:59.752577966 -0700
> @@ -53,6 +53,8 @@ static struct {
>          unsigned long find_total;
>   } swap_cache_info;
>
> +static atomic_t swapra_hits = ATOMIC_INIT(0);
> +
>   void show_swap_cache_info(void)
>   {
>          printk("%lu pages in swap cache\n", total_swapcache_pages);
> @@ -265,8 +267,11 @@ struct page * lookup_swap_cache(swp_entr
>
>          page = find_get_page(&swapper_space, entry.val);
>
> -       if (page)
> +       if (page) {
>                  INC_CACHE_INFO(find_success);
> +               if (TestClearPageReadahead(page))
> +                       atomic_inc(&swapra_hits);
> +       }
>
>          INC_CACHE_INFO(find_total);
>          return page;
> @@ -351,6 +356,41 @@ struct page *read_swap_cache_async(swp_e
>          return found_page;
>   }
>
> +unsigned long swapin_nr_pages(unsigned long offset)
> +{
> +       static unsigned long prev_offset;
> +       static unsigned int swapin_pages = 8;
> +       unsigned int used, half, pages, max_pages;
> +
> +       used = atomic_xchg(&swapra_hits, 0) + 1;
> +       pages = ACCESS_ONCE(swapin_pages);
> +       half = pages>>  1;
> +
> +       if (!half) {
> +               /*
> +                * We can have no readahead hits to judge by: but must not get
> +                * stuck here forever, so check for an adjacent offset instead
> +                * (and don't even bother to check if swap type is the same).
> +                */
> +               if (offset == prev_offset + 1 || offset == prev_offset - 1)
> +                       pages<<= 1;
> +               prev_offset = offset;
> +       } else if (used<  half) {
> +               /* Less than half were used?  Then halve the window size */
> +               pages = half;
> +       } else if (used>  half) {
> +               /* More than half were used?  Then double the window size */
> +               pages<<= 1;
> +       }
> +
> +       max_pages = 1<<  ACCESS_ONCE(page_cluster);
> +       if (pages>  max_pages)
> +               pages = max_pages;
> +       if (ACCESS_ONCE(swapin_pages) != pages)
> +               swapin_pages = pages;
> +       return pages;
> +}
> +
>   /**
>    * swapin_readahead - swap in pages in hope we need them soon
>    * @entry: swap entry of this memory
> @@ -374,11 +414,14 @@ struct page *swapin_readahead(swp_entry_
>                          struct vm_area_struct *vma, unsigned long addr)
>   {
>          struct page *page;
> -       unsigned long offset = swp_offset(entry);
> +       unsigned long entry_offset = swp_offset(entry);
> +       unsigned long offset = entry_offset;
>          unsigned long start_offset, end_offset;
> -       unsigned long mask = (1UL<<  page_cluster) - 1;
> +       unsigned long mask;
>          struct blk_plug plug;
>
> +       mask = swapin_nr_pages(offset) - 1;
> +
>          /* Read a page_cluster sized and aligned cluster around offset. */
>          start_offset = offset&  ~mask;
>          end_offset = offset | mask;
> @@ -392,6 +435,8 @@ struct page *swapin_readahead(swp_entry_
>                                                  gfp_mask, vma, addr);
>                  if (!page)
>                          continue;
> +               if (offset != entry_offset)
> +                       SetPageReadahead(page);
>                  page_cache_release(page);
>          }
>          blk_finish_plug(&plug);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-02  8:58                         ` Konstantin Khlebnikov
@ 2012-10-03 21:07                           ` Hugh Dickins
  2012-10-04 16:23                             ` Konstantin Khlebnikov
  0 siblings, 1 reply; 31+ messages in thread
From: Hugh Dickins @ 2012-10-03 21:07 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Shaohua Li, Rik van Riel, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Tue, 2 Oct 2012, Konstantin Khlebnikov wrote:
> Hugh Dickins wrote:
> > 
> > If I boot with mem=900M (and 1G swap: either on hard disk sda, or
> > on Vertex II SSD sdb), and mmap anonymous 1000M (either MAP_PRIVATE,
> > or MAP_SHARED for a shmem object), and either cycle sequentially round
> > that making 5M touches (spaced a page apart), or make 5M random touches,
> > then here are the times in centisecs that I see (but it's only elapsed
> > that I've been worrying about).
> > 
> > 3.6-rc7 swapping to hard disk:
> >      124 user    6154 system   73921 elapsed -rc7 sda seq
> >      102 user    8862 system  895392 elapsed -rc7 sda random
> >      130 user    6628 system   73601 elapsed -rc7 sda shmem seq
> >      194 user    8610 system 1058375 elapsed -rc7 sda shmem random
> > 
> > 3.6-rc7 swapping to SSD:
> >      116 user    5898 system   24634 elapsed -rc7 sdb seq
> >       96 user    8166 system   43014 elapsed -rc7 sdb random
> >      110 user    6410 system   24959 elapsed -rc7 sdb shmem seq
> >      208 user    8024 system   45349 elapsed -rc7 sdb shmem random
> > 
> > 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page),
> > HDD:
> >      116 user    6258 system   76210 elapsed shli sda seq
> >       80 user    7716 system  831243 elapsed shli sda random
> >      128 user    6640 system   73176 elapsed shli sda shmem seq
> >      212 user    8522 system 1053486 elapsed shli sda shmem random
> > 
> > 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page),
> > SSD:
> >      126 user    5734 system   24198 elapsed shli sdb seq
> >       90 user    7356 system   26146 elapsed shli sdb random
> >      128 user    6396 system   24932 elapsed shli sdb shmem seq
> >      192 user    8006 system   45215 elapsed shli sdb shmem random
> > 
> > 3.6-rc7 + my patch, swapping to hard disk:
> >      126 user    6252 system   75611 elapsed hugh sda seq
> >       70 user    8310 system  871569 elapsed hugh sda random
> >      130 user    6790 system   73855 elapsed hugh sda shmem seq
> >      148 user    7734 system  827935 elapsed hugh sda shmem random
> > 
> > 3.6-rc7 + my patch, swapping to SSD:
> >      116 user    5996 system   24673 elapsed hugh sdb seq
> >       76 user    7568 system   28075 elapsed hugh sdb random
> >      132 user    6468 system   25052 elapsed hugh sdb shmem seq
> >      166 user    7220 system   28249 elapsed hugh sdb shmem random
> > 
> 
> Hmm, It would be nice to gather numbers without swapin readahead at all, just
> to see the the worst possible case for sequential read and the best for
> random.

Right, and also interesting to see what happens if we raise page_cluster
(more of an option than it was, with your or my patch scaling it down).
Run on the same machine under the same conditions:

3.6-rc7 + my patch, swapping to hard disk with page_cluster 0 (no readahead):
    136 user   34038 system  121542 elapsed hugh cluster0 sda seq
    102 user    7928 system  841680 elapsed hugh cluster0 sda random
    130 user   34770 system  118322 elapsed hugh cluster0 sda shmem seq
    160 user    7362 system  756489 elapsed hugh cluster0 sda shmem random

3.6-rc7 + my patch, swapping to SSD with page_cluster 0 (no readahead):
    138 user   32230 system   70018 elapsed hugh cluster0 sdb seq
     88 user    7296 system   25901 elapsed hugh cluster0 sdb random
    154 user   33150 system   69678 elapsed hugh cluster0 sdb shmem seq
    166 user    6936 system   24332 elapsed hugh cluster0 sdb shmem random

3.6-rc7 + my patch, swapping to hard disk with page_cluster 4 (default + 1):
    144 user    4262 system   77950 elapsed hugh cluster4 sda seq
     74 user    8268 system  863871 elapsed hugh cluster4 sda random
    140 user    4880 system   73534 elapsed hugh cluster4 sda shmem seq
    160 user    7788 system  834804 elapsed hugh cluster4 sda shmem random

3.6-rc7 + my patch, swapping to SSD with page_cluster 4 (default + 1):
    124 user    4242 system   21125 elapsed hugh cluster4 sdb seq
     72 user    7680 system   28686 elapsed hugh cluster4 sdb random
    122 user    4622 system   21387 elapsed hugh cluster4 sdb shmem seq
    172 user    7238 system   28226 elapsed hugh cluster4 sdb shmem random

I was at first surprised to see random significantly faster than sequential
on SSD with readahead off, thinking they ought to come out the same.  But
no, that's a warning on the limitations of the test: with an mmap of 1000M
on a machine with mem=900M, the page-by-page sequential is never going to
rehit cache, whereas the random has a good chance of finding in memory.

Which I presume also accounts for the lower user times throughout
for random - but then why not the same for shmem random?

I did start off measuring on the laptop with SSD, mmap 1000M mem=500M;
but once I transferred to the desktop, I rediscovered just how slow
swapping to hard disk can be, couldn't wait days, so made mem=900M.

> I'll run some tests too, especially I want to see how it works for less
> synthetic workloads.

Thank you, that would be valuable.  I expect there to be certain midway
tests on which Shaohao's patch would show up as significantly faster,
where his per-vma approach would beat the global approach; then the
global to improve with growing contention between processes.  But I
didn't devise any such test, and hoped Shaohua might have one.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-03 21:07                           ` Hugh Dickins
@ 2012-10-04 16:23                             ` Konstantin Khlebnikov
  2012-10-08 22:09                               ` Hugh Dickins
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-10-04 16:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Shaohua Li, Rik van Riel, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel@vger.kernel.org, linux-mm@kvack.org

[-- Attachment #1: Type: text/plain, Size: 6938 bytes --]

Here results of my test. Workload isn't very realistic, but at least it 
threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs,
512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment)

average results for ten runs:

		RA=3	RA=0	RA=1	RA=2	RA=4	Hugh	Shaohua
real time	500	542	528	519	500	523	522
user time	738	737	735	737	739	737	739
sys time	93	93	91	92	96	92	93
pgmajfault	62918	110533	92454	78221	54342	86601	77229
pgpgin		2070372	795228	1034046	1471010	3177192	1154532	1599388
pgpgout		2597278	2022037	2110020	2350380	2802670	2286671	2526570
pswpin		462747	138873	202148	310969	739431	232710	341320
pswpout		646363	502599	524613	584731	697797	568784	628677

So, last two columns shows mostly equal results: +4.6% and +4.4% in comparison 
to vanilla kernel with RA=3, but your version shows more stable results 
(std-error 2.7% against 4.8%) (all this numbers in huge table in attachment)



Numbers from your tests formatted into table for better readability
				
HDD		Vanilla	Shaohua	RA=3	RA=0	RA=4
SEQ, ANON	73921	76210	75611	121542	77950
SEQ, SHMEM	73601	73176	73855	118322	73534
RND, ANON	895392	831243	871569	841680	863871
RND, SHMEM	1058375	1053486	827935	756489	834804

SDD		Vanilla	Shaohua	RA=3	RA=0	RA=4
SEQ, ANON	24634	24198	24673	70018	21125
SEQ, SHMEM	24959	24932	25052	69678	21387
RND, ANON	43014	26146	28075	25901	28686
RND, SHMEM	45349	45215	28249	24332	28226

Hugh Dickins wrote:
> On Tue, 2 Oct 2012, Konstantin Khlebnikov wrote:
>> Hugh Dickins wrote:
>>>
>>> If I boot with mem=900M (and 1G swap: either on hard disk sda, or
>>> on Vertex II SSD sdb), and mmap anonymous 1000M (either MAP_PRIVATE,
>>> or MAP_SHARED for a shmem object), and either cycle sequentially round
>>> that making 5M touches (spaced a page apart), or make 5M random touches,
>>> then here are the times in centisecs that I see (but it's only elapsed
>>> that I've been worrying about).
>>>
>>> 3.6-rc7 swapping to hard disk:
>>>       124 user    6154 system   73921 elapsed -rc7 sda seq
>>>       102 user    8862 system  895392 elapsed -rc7 sda random
>>>       130 user    6628 system   73601 elapsed -rc7 sda shmem seq
>>>       194 user    8610 system 1058375 elapsed -rc7 sda shmem random
>>>
>>> 3.6-rc7 swapping to SSD:
>>>       116 user    5898 system   24634 elapsed -rc7 sdb seq
>>>        96 user    8166 system   43014 elapsed -rc7 sdb random
>>>       110 user    6410 system   24959 elapsed -rc7 sdb shmem seq
>>>       208 user    8024 system   45349 elapsed -rc7 sdb shmem random
>>>
>>> 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page),
>>> HDD:
>>>       116 user    6258 system   76210 elapsed shli sda seq
>>>        80 user    7716 system  831243 elapsed shli sda random
>>>       128 user    6640 system   73176 elapsed shli sda shmem seq
>>>       212 user    8522 system 1053486 elapsed shli sda shmem random
>>>
>>> 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page),
>>> SSD:
>>>       126 user    5734 system   24198 elapsed shli sdb seq
>>>        90 user    7356 system   26146 elapsed shli sdb random
>>>       128 user    6396 system   24932 elapsed shli sdb shmem seq
>>>       192 user    8006 system   45215 elapsed shli sdb shmem random
>>>
>>> 3.6-rc7 + my patch, swapping to hard disk:
>>>       126 user    6252 system   75611 elapsed hugh sda seq
>>>        70 user    8310 system  871569 elapsed hugh sda random
>>>       130 user    6790 system   73855 elapsed hugh sda shmem seq
>>>       148 user    7734 system  827935 elapsed hugh sda shmem random
>>>
>>> 3.6-rc7 + my patch, swapping to SSD:
>>>       116 user    5996 system   24673 elapsed hugh sdb seq
>>>        76 user    7568 system   28075 elapsed hugh sdb random
>>>       132 user    6468 system   25052 elapsed hugh sdb shmem seq
>>>       166 user    7220 system   28249 elapsed hugh sdb shmem random
>>>
>>
>> Hmm, It would be nice to gather numbers without swapin readahead at all, just
>> to see the the worst possible case for sequential read and the best for
>> random.
>
> Right, and also interesting to see what happens if we raise page_cluster
> (more of an option than it was, with your or my patch scaling it down).
> Run on the same machine under the same conditions:
>
> 3.6-rc7 + my patch, swapping to hard disk with page_cluster 0 (no readahead):
>      136 user   34038 system  121542 elapsed hugh cluster0 sda seq
>      102 user    7928 system  841680 elapsed hugh cluster0 sda random
>      130 user   34770 system  118322 elapsed hugh cluster0 sda shmem seq
>      160 user    7362 system  756489 elapsed hugh cluster0 sda shmem random
>
> 3.6-rc7 + my patch, swapping to SSD with page_cluster 0 (no readahead):
>      138 user   32230 system   70018 elapsed hugh cluster0 sdb seq
>       88 user    7296 system   25901 elapsed hugh cluster0 sdb random
>      154 user   33150 system   69678 elapsed hugh cluster0 sdb shmem seq
>      166 user    6936 system   24332 elapsed hugh cluster0 sdb shmem random
>
> 3.6-rc7 + my patch, swapping to hard disk with page_cluster 4 (default + 1):
>      144 user    4262 system   77950 elapsed hugh cluster4 sda seq
>       74 user    8268 system  863871 elapsed hugh cluster4 sda random
>      140 user    4880 system   73534 elapsed hugh cluster4 sda shmem seq
>      160 user    7788 system  834804 elapsed hugh cluster4 sda shmem random
>
> 3.6-rc7 + my patch, swapping to SSD with page_cluster 4 (default + 1):
>      124 user    4242 system   21125 elapsed hugh cluster4 sdb seq
>       72 user    7680 system   28686 elapsed hugh cluster4 sdb random
>      122 user    4622 system   21387 elapsed hugh cluster4 sdb shmem seq
>      172 user    7238 system   28226 elapsed hugh cluster4 sdb shmem random
>
> I was at first surprised to see random significantly faster than sequential
> on SSD with readahead off, thinking they ought to come out the same.  But
> no, that's a warning on the limitations of the test: with an mmap of 1000M
> on a machine with mem=900M, the page-by-page sequential is never going to
> rehit cache, whereas the random has a good chance of finding in memory.
>
> Which I presume also accounts for the lower user times throughout
> for random - but then why not the same for shmem random?
>
> I did start off measuring on the laptop with SSD, mmap 1000M mem=500M;
> but once I transferred to the desktop, I rediscovered just how slow
> swapping to hard disk can be, couldn't wait days, so made mem=900M.
>
>> I'll run some tests too, especially I want to see how it works for less
>> synthetic workloads.
>
> Thank you, that would be valuable.  I expect there to be certain midway
> tests on which Shaohao's patch would show up as significantly faster,
> where his per-vma approach would beat the global approach; then the
> global to improve with growing contention between processes.  But I
> didn't devise any such test, and hoped Shaohua might have one.
>
> Hugh


[-- Attachment #2: test-linux-build.sh --]
[-- Type: application/x-sh, Size: 370 bytes --]

[-- Attachment #3: test-linux-build-results.txt --]
[-- Type: text/plain, Size: 1193 bytes --]

		0-orig.log 	1-nora.log		1-one.log		2-two.log		4-four.log		5-hugh.log		6-shaohua.log
real time	500 [1.9%]	542 [4.5%]	+8.3%	528 [4.7%]	+5.7%	519 [2.6%]	+3.8%	500 [1.2%]	+0.1%	523 [2.7%]	+4.6%	522 [4.4%]	+4.4%
user time	738 [0.5%]	737 [0.7%]	-0.2%	735 [0.4%]	-0.4%	737 [0.4%]	-0.1%	739 [0.4%]	+0.1%	737 [0.4%]	-0.1%	739 [0.3%]	+0.2%
sys time	93 [1.2%]	93 [1.8%]	+0.6%	91 [0.8%]	-1.4%	92 [1.2%]	-1.3%	96 [0.8%]	+3.8%	92 [0.8%]	-1.0%	93 [1.0%]	+0.5%
pgmajfault	62918 [4.2%]	110533 [6.6%]	+75.7%	92454 [4.6%]	+46.9%	78221 [3.3%]	+24.3%	54342 [2.7%]	-13.6%	86601 [3.8%]	+37.6%	77229 [6.4%]	+22.7%
pgpgin		2070372 [4.0%]	795228 [6.7%]	-61.6%	1034046 [2.9%]	-50.1%	1471010 [4.2%]	-28.9%	3177192 [2.1%]	+53.5%	1154532 [3.4%]	-44.2%	1599388 [6.2%]	-22.7%
pgpgout		2597278 [7.9%]	2022037 [9.4%]	-22.1%	2110020 [4.3%]	-18.8%	2350380 [8.1%]	-9.5%	2802670 [6.8%]	+7.9%	2286671 [7.0%]	-12.0%	2526570 [8.1%]	-2.7%
pswpin		462747 [4.5%]	138873 [6.1%]	-70.0%	202148 [3.4%]	-56.3%	310969 [3.7%]	-32.8%	739431 [2.0%]	+59.8%	232710 [3.6%]	-49.7%	341320 [6.9%]	-26.2%
pswpout		646363 [7.9%]	502599 [9.5%]	-22.2%	524613 [4.4%]	-18.8%	584731 [8.2%]	-9.5%	697797 [6.8%]	+8.0%	568784 [7.0%]	-12.0%	628677 [8.1%]	-2.7%

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-04 16:23                             ` Konstantin Khlebnikov
@ 2012-10-08 22:09                               ` Hugh Dickins
  2012-10-08 22:16                                 ` Andrew Morton
                                                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Hugh Dickins @ 2012-10-08 22:09 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Shaohua Li, Rik van Riel, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote:

> Here results of my test. Workload isn't very realistic, but at least it
> threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs,
> 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment)
> 
> average results for ten runs:
> 
> 		RA=3	RA=0	RA=1	RA=2	RA=4	Hugh	Shaohua
> real time	500	542	528	519	500	523	522
> user time	738	737	735	737	739	737	739
> sys time	93	93	91	92	96	92	93
> pgmajfault	62918	110533	92454	78221	54342	86601	77229
> pgpgin	2070372	795228	1034046	1471010	3177192	1154532	1599388
> pgpgout	2597278	2022037	2110020	2350380	2802670	2286671	2526570
> pswpin	462747	138873	202148	310969	739431	232710	341320
> pswpout	646363	502599	524613	584731	697797	568784	628677
> 
> So, last two columns shows mostly equal results: +4.6% and +4.4% in
> comparison to vanilla kernel with RA=3, but your version shows more stable
> results (std-error 2.7% against 4.8%) (all this numbers in huge table in
> attachment)

Thanks for doing this, Konstantin, but I'm stuck for anything much to say!
Shaohua and I are both about 4.5% bad for this particular test, but I'm
more consistently bad - hurrah!

I suspect (not a convincing argument) that if the test were just slightly
different (a little more or a little less memory, SSD instead of hard
disk, diskcache instead of tmpfs), then it would come out differently.

Did you draw any conclusions from the numbers you found?

I haven't done any more on this in the last few days, except to verify
that once an anon_vma is judged random with Shaohua's, then it appears
to be condemned to no-readahead ever after.

That's probably something that a hack like I had in mine would fix,
but that addition might change its balance further (and increase vma
or anon_vma size) - not tried yet.

All I want to do right now, is suggest to Andrew that he hold Shaohua's
patch back from 3.7 for the moment: I'll send a response to Sep 7th's
mm-commits mail to suggest that - but no great disaster if he ignores me.

Hugh

> 
> Numbers from your tests formatted into table for better readability
> 				
> HDD		Vanilla	Shaohua	RA=3	RA=0	RA=4
> SEQ, ANON	73921	76210	75611	121542	77950
> SEQ, SHMEM	73601	73176	73855	118322	73534
> RND, ANON	895392	831243	871569	841680	863871
> RND, SHMEM	1058375	1053486	827935	756489	834804
> 
> SDD		Vanilla	Shaohua	RA=3	RA=0	RA=4
> SEQ, ANON	24634	24198	24673	70018	21125
> SEQ, SHMEM	24959	24932	25052	69678	21387
> RND, ANON	43014	26146	28075	25901	28686
> RND, SHMEM	45349	45215	28249	24332	28226

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-08 22:09                               ` Hugh Dickins
@ 2012-10-08 22:16                                 ` Andrew Morton
  2012-10-09  7:53                                 ` Konstantin Khlebnikov
  2012-10-16  0:50                                 ` Shaohua Li
  2 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2012-10-08 22:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Shaohua Li, Rik van Riel, Minchan Kim,
	Wu Fengguang, linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Mon, 8 Oct 2012 15:09:58 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> All I want to do right now, is suggest to Andrew that he hold Shaohua's
> patch back from 3.7 for the moment: I'll send a response to Sep 7th's
> mm-commits mail to suggest that - but no great disaster if he ignores me.

Just in the nick of time.

I'll move
swap-add-a-simple-detector-for-inappropriate-swapin-readahead.patch and
swap-add-a-simple-detector-for-inappropriate-swapin-readahead-fix.patch
into the wait-and-see pile.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-08 22:09                               ` Hugh Dickins
  2012-10-08 22:16                                 ` Andrew Morton
@ 2012-10-09  7:53                                 ` Konstantin Khlebnikov
  2012-10-16  0:50                                 ` Shaohua Li
  2 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-10-09  7:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Shaohua Li, Rik van Riel, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel@vger.kernel.org, linux-mm@kvack.org

Hugh Dickins wrote:
> On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote:
>
>> Here results of my test. Workload isn't very realistic, but at least it
>> threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs,
>> 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment)
>>
>> average results for ten runs:
>>
>> 		RA=3	RA=0	RA=1	RA=2	RA=4	Hugh	Shaohua
>> real time	500	542	528	519	500	523	522
>> user time	738	737	735	737	739	737	739
>> sys time	93	93	91	92	96	92	93
>> pgmajfault	62918	110533	92454	78221	54342	86601	77229
>> pgpgin	2070372	795228	1034046	1471010	3177192	1154532	1599388
>> pgpgout	2597278	2022037	2110020	2350380	2802670	2286671	2526570
>> pswpin	462747	138873	202148	310969	739431	232710	341320
>> pswpout	646363	502599	524613	584731	697797	568784	628677
>>
>> So, last two columns shows mostly equal results: +4.6% and +4.4% in
>> comparison to vanilla kernel with RA=3, but your version shows more stable
>> results (std-error 2.7% against 4.8%) (all this numbers in huge table in
>> attachment)
>
> Thanks for doing this, Konstantin, but I'm stuck for anything much to say!
> Shaohua and I are both about 4.5% bad for this particular test, but I'm
> more consistently bad - hurrah!
>
> I suspect (not a convincing argument) that if the test were just slightly
> different (a little more or a little less memory, SSD instead of hard
> disk, diskcache instead of tmpfs), then it would come out differently.

Yes, results depends mostly on tmpfs.

>
> Did you draw any conclusions from the numbers you found?

Yeah, I have some ideas:

Numbers for vanilla kernel shows strong dependence between time and readahead
size. Seems like main problem is that tmpfs does not have it's own readahead,
it can only rely on swap-in readahead. There are about 25% readahead hits for RA=3.
As "pswpin" row shows both your and Shaohua patches makes readahead smaller.


Plus tmpfs doesn't keeps copy for clean pages in the swap (unlike to anon pages).
On swapin path it always marks page dirty and releases swap-entry.
I didn't have any measurements but this particular test definitely re-reads
some files multiple times and writes them back to the swap after that.

>
> I haven't done any more on this in the last few days, except to verify
> that once an anon_vma is judged random with Shaohua's, then it appears
> to be condemned to no-readahead ever after.
>
> That's probably something that a hack like I had in mine would fix,
> but that addition might change its balance further (and increase vma
> or anon_vma size) - not tried yet.
>
> All I want to do right now, is suggest to Andrew that he hold Shaohua's
> patch back from 3.7 for the moment: I'll send a response to Sep 7th's
> mm-commits mail to suggest that - but no great disaster if he ignores me.
>
> Hugh
>
>>
>> Numbers from your tests formatted into table for better readability
>> 				
>> HDD		Vanilla	Shaohua	RA=3	RA=0	RA=4
>> SEQ, ANON	73921	76210	75611	121542	77950
>> SEQ, SHMEM	73601	73176	73855	118322	73534
>> RND, ANON	895392	831243	871569	841680	863871
>> RND, SHMEM	1058375	1053486	827935	756489	834804
>>
>> SDD		Vanilla	Shaohua	RA=3	RA=0	RA=4
>> SEQ, ANON	24634	24198	24673	70018	21125
>> SEQ, SHMEM	24959	24932	25052	69678	21387
>> RND, ANON	43014	26146	28075	25901	28686
>> RND, SHMEM	45349	45215	28249	24332	28226

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-08 22:09                               ` Hugh Dickins
  2012-10-08 22:16                                 ` Andrew Morton
  2012-10-09  7:53                                 ` Konstantin Khlebnikov
@ 2012-10-16  0:50                                 ` Shaohua Li
  2012-10-22  7:36                                   ` Shaohua Li
  2 siblings, 1 reply; 31+ messages in thread
From: Shaohua Li @ 2012-10-16  0:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Rik van Riel, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Mon, Oct 08, 2012 at 03:09:58PM -0700, Hugh Dickins wrote:
> On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote:
> 
> > Here results of my test. Workload isn't very realistic, but at least it
> > threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs,
> > 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment)
> > 
> > average results for ten runs:
> > 
> > 		RA=3	RA=0	RA=1	RA=2	RA=4	Hugh	Shaohua
> > real time	500	542	528	519	500	523	522
> > user time	738	737	735	737	739	737	739
> > sys time	93	93	91	92	96	92	93
> > pgmajfault	62918	110533	92454	78221	54342	86601	77229
> > pgpgin	2070372	795228	1034046	1471010	3177192	1154532	1599388
> > pgpgout	2597278	2022037	2110020	2350380	2802670	2286671	2526570
> > pswpin	462747	138873	202148	310969	739431	232710	341320
> > pswpout	646363	502599	524613	584731	697797	568784	628677
> > 
> > So, last two columns shows mostly equal results: +4.6% and +4.4% in
> > comparison to vanilla kernel with RA=3, but your version shows more stable
> > results (std-error 2.7% against 4.8%) (all this numbers in huge table in
> > attachment)
> 
> Thanks for doing this, Konstantin, but I'm stuck for anything much to say!
> Shaohua and I are both about 4.5% bad for this particular test, but I'm
> more consistently bad - hurrah!
> 
> I suspect (not a convincing argument) that if the test were just slightly
> different (a little more or a little less memory, SSD instead of hard
> disk, diskcache instead of tmpfs), then it would come out differently.
> 
> Did you draw any conclusions from the numbers you found?
> 
> I haven't done any more on this in the last few days, except to verify
> that once an anon_vma is judged random with Shaohua's, then it appears
> to be condemned to no-readahead ever after.
> 
> That's probably something that a hack like I had in mine would fix,
> but that addition might change its balance further (and increase vma
> or anon_vma size) - not tried yet.
> 
> All I want to do right now, is suggest to Andrew that he hold Shaohua's
> patch back from 3.7 for the moment: I'll send a response to Sep 7th's
> mm-commits mail to suggest that - but no great disaster if he ignores me.

Ok, I tested Hugh's patch. My test is a multithread random write workload.
With Hugh's patch, 49:28.06elapsed
With mine, 43:23.39elapsed
There is 12% more time used with Hugh's patch.

In the stable state of this workload, SI:SO ratio should be roughly 1:1. With
Hugh's patch, it's around 1.6:1, there is still unnecessary swapin.

I also tried a workload with seqential/random write mixed, Hugh's patch is 10%
bad too.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-16  0:50                                 ` Shaohua Li
@ 2012-10-22  7:36                                   ` Shaohua Li
  2012-10-23  5:16                                     ` Hugh Dickins
  0 siblings, 1 reply; 31+ messages in thread
From: Shaohua Li @ 2012-10-22  7:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Rik van Riel, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Tue, Oct 16, 2012 at 08:50:49AM +0800, Shaohua Li wrote:
> On Mon, Oct 08, 2012 at 03:09:58PM -0700, Hugh Dickins wrote:
> > On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote:
> > 
> > > Here results of my test. Workload isn't very realistic, but at least it
> > > threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs,
> > > 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment)
> > > 
> > > average results for ten runs:
> > > 
> > > 		RA=3	RA=0	RA=1	RA=2	RA=4	Hugh	Shaohua
> > > real time	500	542	528	519	500	523	522
> > > user time	738	737	735	737	739	737	739
> > > sys time	93	93	91	92	96	92	93
> > > pgmajfault	62918	110533	92454	78221	54342	86601	77229
> > > pgpgin	2070372	795228	1034046	1471010	3177192	1154532	1599388
> > > pgpgout	2597278	2022037	2110020	2350380	2802670	2286671	2526570
> > > pswpin	462747	138873	202148	310969	739431	232710	341320
> > > pswpout	646363	502599	524613	584731	697797	568784	628677
> > > 
> > > So, last two columns shows mostly equal results: +4.6% and +4.4% in
> > > comparison to vanilla kernel with RA=3, but your version shows more stable
> > > results (std-error 2.7% against 4.8%) (all this numbers in huge table in
> > > attachment)
> > 
> > Thanks for doing this, Konstantin, but I'm stuck for anything much to say!
> > Shaohua and I are both about 4.5% bad for this particular test, but I'm
> > more consistently bad - hurrah!
> > 
> > I suspect (not a convincing argument) that if the test were just slightly
> > different (a little more or a little less memory, SSD instead of hard
> > disk, diskcache instead of tmpfs), then it would come out differently.
> > 
> > Did you draw any conclusions from the numbers you found?
> > 
> > I haven't done any more on this in the last few days, except to verify
> > that once an anon_vma is judged random with Shaohua's, then it appears
> > to be condemned to no-readahead ever after.
> > 
> > That's probably something that a hack like I had in mine would fix,
> > but that addition might change its balance further (and increase vma
> > or anon_vma size) - not tried yet.
> > 
> > All I want to do right now, is suggest to Andrew that he hold Shaohua's
> > patch back from 3.7 for the moment: I'll send a response to Sep 7th's
> > mm-commits mail to suggest that - but no great disaster if he ignores me.
> 
> Ok, I tested Hugh's patch. My test is a multithread random write workload.
> With Hugh's patch, 49:28.06elapsed
> With mine, 43:23.39elapsed
> There is 12% more time used with Hugh's patch.
> 
> In the stable state of this workload, SI:SO ratio should be roughly 1:1. With
> Hugh's patch, it's around 1.6:1, there is still unnecessary swapin.
> 
> I also tried a workload with seqential/random write mixed, Hugh's patch is 10%
> bad too.

With below change, the si/so ratio is back to around 1:1 in my workload. Guess
the run time of my test will be reduced too, though I didn't test yet.
-	used = atomic_xchg(&swapra_hits, 0) + 1;
+	used = atomic_xchg(&swapra_hits, 0);

I'm wondering how could a global counter based method detect readahead
correctly. For example, if there are a sequential access thread and a random
access thread, doesn't this method always make wrong decision?

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-22  7:36                                   ` Shaohua Li
@ 2012-10-23  5:16                                     ` Hugh Dickins
  2012-10-23  5:51                                       ` Shaohua Li
  0 siblings, 1 reply; 31+ messages in thread
From: Hugh Dickins @ 2012-10-23  5:16 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Konstantin Khlebnikov, Rik van Riel, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Mon, 22 Oct 2012, Shaohua Li wrote:
> On Tue, Oct 16, 2012 at 08:50:49AM +0800, Shaohua Li wrote:
> > On Mon, Oct 08, 2012 at 03:09:58PM -0700, Hugh Dickins wrote:
> > > On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote:
> > > 
> > > > Here results of my test. Workload isn't very realistic, but at least it
> > > > threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs,
> > > > 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment)
> > > > 
> > > > average results for ten runs:
> > > > 
> > > > 		RA=3	RA=0	RA=1	RA=2	RA=4	Hugh	Shaohua
> > > > real time	500	542	528	519	500	523	522
> > > > user time	738	737	735	737	739	737	739
> > > > sys time	93	93	91	92	96	92	93
> > > > pgmajfault	62918	110533	92454	78221	54342	86601	77229
> > > > pgpgin	2070372	795228	1034046	1471010	3177192	1154532	1599388
> > > > pgpgout	2597278	2022037	2110020	2350380	2802670	2286671	2526570
> > > > pswpin	462747	138873	202148	310969	739431	232710	341320
> > > > pswpout	646363	502599	524613	584731	697797	568784	628677
> > > > 
> > > > So, last two columns shows mostly equal results: +4.6% and +4.4% in
> > > > comparison to vanilla kernel with RA=3, but your version shows more stable
> > > > results (std-error 2.7% against 4.8%) (all this numbers in huge table in
> > > > attachment)
> > > 
> > > Thanks for doing this, Konstantin, but I'm stuck for anything much to say!
> > > Shaohua and I are both about 4.5% bad for this particular test, but I'm
> > > more consistently bad - hurrah!
> > > 
> > > I suspect (not a convincing argument) that if the test were just slightly
> > > different (a little more or a little less memory, SSD instead of hard
> > > disk, diskcache instead of tmpfs), then it would come out differently.
> > > 
> > > Did you draw any conclusions from the numbers you found?
> > > 
> > > I haven't done any more on this in the last few days, except to verify
> > > that once an anon_vma is judged random with Shaohua's, then it appears
> > > to be condemned to no-readahead ever after.
> > > 
> > > That's probably something that a hack like I had in mine would fix,
> > > but that addition might change its balance further (and increase vma
> > > or anon_vma size) - not tried yet.
> > > 
> > > All I want to do right now, is suggest to Andrew that he hold Shaohua's
> > > patch back from 3.7 for the moment: I'll send a response to Sep 7th's
> > > mm-commits mail to suggest that - but no great disaster if he ignores me.
> > 
> > Ok, I tested Hugh's patch. My test is a multithread random write workload.
> > With Hugh's patch, 49:28.06elapsed
> > With mine, 43:23.39elapsed
> > There is 12% more time used with Hugh's patch.
> > 
> > In the stable state of this workload, SI:SO ratio should be roughly 1:1. With
> > Hugh's patch, it's around 1.6:1, there is still unnecessary swapin.
> > 
> > I also tried a workload with seqential/random write mixed, Hugh's patch is 10%
> > bad too.
> 
> With below change, the si/so ratio is back to around 1:1 in my workload. Guess
> the run time of my test will be reduced too, though I didn't test yet.
> -	used = atomic_xchg(&swapra_hits, 0) + 1;
> +	used = atomic_xchg(&swapra_hits, 0);

Thank you for playing and trying that, I haven't found time to revisit it
at all.  I'll give that adjustment a go at my end.  The "+ 1" was for the
target page itself; but whatever works best, there's not much science to it.

> 
> I'm wondering how could a global counter based method detect readahead
> correctly. For example, if there are a sequential access thread and a random
> access thread, doesn't this method always make wrong decision?

But only in the simplest cases is the sequentiality of placement on swap
well correlated with the sequentiality of placement in virtual memory.
Once you have a sequential access thread and a random access thread
swapping out at the same time, their pages will be interspersed.

I'm pretty sure that if you give it more thought than I am giving it
at the moment, you can devise a test case which would go amazingly
faster by your per-vma method than by keeping just this global state.

But I doubt such a test case would be so realistic as to deserve that
extra sophistication.  I do prefer to keep the heuristic as stupid and
unpretentious as possible.

Especially when I remember how get_swap_page() stripes across swap
areas of equal priority: my guess is that nobody uses that feature,
and we don't even want to consider it here; but it feels wrong to
ignore it if we aim for more cleverness at the readahead end.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-23  5:16                                     ` Hugh Dickins
@ 2012-10-23  5:51                                       ` Shaohua Li
  2012-10-23 13:41                                         ` Rik van Riel
  0 siblings, 1 reply; 31+ messages in thread
From: Shaohua Li @ 2012-10-23  5:51 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Rik van Riel, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Mon, Oct 22, 2012 at 10:16:40PM -0700, Hugh Dickins wrote:
> On Mon, 22 Oct 2012, Shaohua Li wrote:
> > On Tue, Oct 16, 2012 at 08:50:49AM +0800, Shaohua Li wrote:
> > > On Mon, Oct 08, 2012 at 03:09:58PM -0700, Hugh Dickins wrote:
> > > > On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote:
> > > > 
> > > > > Here results of my test. Workload isn't very realistic, but at least it
> > > > > threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs,
> > > > > 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment)
> > > > > 
> > > > > average results for ten runs:
> > > > > 
> > > > > 		RA=3	RA=0	RA=1	RA=2	RA=4	Hugh	Shaohua
> > > > > real time	500	542	528	519	500	523	522
> > > > > user time	738	737	735	737	739	737	739
> > > > > sys time	93	93	91	92	96	92	93
> > > > > pgmajfault	62918	110533	92454	78221	54342	86601	77229
> > > > > pgpgin	2070372	795228	1034046	1471010	3177192	1154532	1599388
> > > > > pgpgout	2597278	2022037	2110020	2350380	2802670	2286671	2526570
> > > > > pswpin	462747	138873	202148	310969	739431	232710	341320
> > > > > pswpout	646363	502599	524613	584731	697797	568784	628677
> > > > > 
> > > > > So, last two columns shows mostly equal results: +4.6% and +4.4% in
> > > > > comparison to vanilla kernel with RA=3, but your version shows more stable
> > > > > results (std-error 2.7% against 4.8%) (all this numbers in huge table in
> > > > > attachment)
> > > > 
> > > > Thanks for doing this, Konstantin, but I'm stuck for anything much to say!
> > > > Shaohua and I are both about 4.5% bad for this particular test, but I'm
> > > > more consistently bad - hurrah!
> > > > 
> > > > I suspect (not a convincing argument) that if the test were just slightly
> > > > different (a little more or a little less memory, SSD instead of hard
> > > > disk, diskcache instead of tmpfs), then it would come out differently.
> > > > 
> > > > Did you draw any conclusions from the numbers you found?
> > > > 
> > > > I haven't done any more on this in the last few days, except to verify
> > > > that once an anon_vma is judged random with Shaohua's, then it appears
> > > > to be condemned to no-readahead ever after.
> > > > 
> > > > That's probably something that a hack like I had in mine would fix,
> > > > but that addition might change its balance further (and increase vma
> > > > or anon_vma size) - not tried yet.
> > > > 
> > > > All I want to do right now, is suggest to Andrew that he hold Shaohua's
> > > > patch back from 3.7 for the moment: I'll send a response to Sep 7th's
> > > > mm-commits mail to suggest that - but no great disaster if he ignores me.
> > > 
> > > Ok, I tested Hugh's patch. My test is a multithread random write workload.
> > > With Hugh's patch, 49:28.06elapsed
> > > With mine, 43:23.39elapsed
> > > There is 12% more time used with Hugh's patch.
> > > 
> > > In the stable state of this workload, SI:SO ratio should be roughly 1:1. With
> > > Hugh's patch, it's around 1.6:1, there is still unnecessary swapin.
> > > 
> > > I also tried a workload with seqential/random write mixed, Hugh's patch is 10%
> > > bad too.
> > 
> > With below change, the si/so ratio is back to around 1:1 in my workload. Guess
> > the run time of my test will be reduced too, though I didn't test yet.
> > -	used = atomic_xchg(&swapra_hits, 0) + 1;
> > +	used = atomic_xchg(&swapra_hits, 0);
> 
> Thank you for playing and trying that, I haven't found time to revisit it
> at all.  I'll give that adjustment a go at my end.  The "+ 1" was for the
> target page itself; but whatever works best, there's not much science to it.

With '+1', the minimum ra pages is 2 even for a random access.
 
> > 
> > I'm wondering how could a global counter based method detect readahead
> > correctly. For example, if there are a sequential access thread and a random
> > access thread, doesn't this method always make wrong decision?
> 
> But only in the simplest cases is the sequentiality of placement on swap
> well correlated with the sequentiality of placement in virtual memory.
> Once you have a sequential access thread and a random access thread
> swapping out at the same time, their pages will be interspersed.
> 
> I'm pretty sure that if you give it more thought than I am giving it
> at the moment, you can devise a test case which would go amazingly
> faster by your per-vma method than by keeping just this global state.
> 
> But I doubt such a test case would be so realistic as to deserve that
> extra sophistication.  I do prefer to keep the heuristic as stupid and
> unpretentious as possible.

I have no strong point against the global state method. But I'd agree making the
heuristic simple is preferred currently. I'm happy about the patch if the '+1'
is removed.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-23  5:51                                       ` Shaohua Li
@ 2012-10-23 13:41                                         ` Rik van Riel
  2012-10-24  1:13                                           ` Shaohua Li
  0 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2012-10-23 13:41 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Hugh Dickins, Konstantin Khlebnikov, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel@vger.kernel.org, linux-mm@kvack.org

On 10/23/2012 01:51 AM, Shaohua Li wrote:

> I have no strong point against the global state method. But I'd agree making the
> heuristic simple is preferred currently. I'm happy about the patch if the '+1'
> is removed.

Without the +1, how will you figure out when to re-enable readahead?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-23 13:41                                         ` Rik van Riel
@ 2012-10-24  1:13                                           ` Shaohua Li
  2012-11-06  5:36                                             ` Shaohua Li
  0 siblings, 1 reply; 31+ messages in thread
From: Shaohua Li @ 2012-10-24  1:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Konstantin Khlebnikov, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Tue, Oct 23, 2012 at 09:41:00AM -0400, Rik van Riel wrote:
> On 10/23/2012 01:51 AM, Shaohua Li wrote:
> 
> >I have no strong point against the global state method. But I'd agree making the
> >heuristic simple is preferred currently. I'm happy about the patch if the '+1'
> >is removed.
> 
> Without the +1, how will you figure out when to re-enable readahead?

Below code in swapin_nr_pages can recover it.
+               if (offset == prev_offset + 1 || offset == prev_offset - 1)
+                       pages <<= 1;

Not perfect, but should work in some sort. This reminds me to think if
pagereadahead flag is really required, hit in swap cache is a more reliable way
to count readahead hit, and as Hugh mentioned, swap isn't vma bound.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-10-24  1:13                                           ` Shaohua Li
@ 2012-11-06  5:36                                             ` Shaohua Li
  2012-11-14  9:48                                               ` Hugh Dickins
  0 siblings, 1 reply; 31+ messages in thread
From: Shaohua Li @ 2012-11-06  5:36 UTC (permalink / raw)
  To: Rik van Riel, Hugh Dickins
  Cc: Konstantin Khlebnikov, Minchan Kim, Andrew Morton, Wu Fengguang,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Wed, Oct 24, 2012 at 09:13:56AM +0800, Shaohua Li wrote:
> On Tue, Oct 23, 2012 at 09:41:00AM -0400, Rik van Riel wrote:
> > On 10/23/2012 01:51 AM, Shaohua Li wrote:
> > 
> > >I have no strong point against the global state method. But I'd agree making the
> > >heuristic simple is preferred currently. I'm happy about the patch if the '+1'
> > >is removed.
> > 
> > Without the +1, how will you figure out when to re-enable readahead?
> 
> Below code in swapin_nr_pages can recover it.
> +               if (offset == prev_offset + 1 || offset == prev_offset - 1)
> +                       pages <<= 1;
> 
> Not perfect, but should work in some sort. This reminds me to think if
> pagereadahead flag is really required, hit in swap cache is a more reliable way
> to count readahead hit, and as Hugh mentioned, swap isn't vma bound.

Hugh,
ping! Any chance you can check this again?

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-11-06  5:36                                             ` Shaohua Li
@ 2012-11-14  9:48                                               ` Hugh Dickins
  2012-11-19  2:33                                                 ` Shaohua Li
  0 siblings, 1 reply; 31+ messages in thread
From: Hugh Dickins @ 2012-11-14  9:48 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Rik van Riel, Konstantin Khlebnikov, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel, linux-mm

On Tue, 6 Nov 2012, Shaohua Li wrote:
> On Wed, Oct 24, 2012 at 09:13:56AM +0800, Shaohua Li wrote:
> > On Tue, Oct 23, 2012 at 09:41:00AM -0400, Rik van Riel wrote:
> > > On 10/23/2012 01:51 AM, Shaohua Li wrote:
> > > 
> > > >I have no strong point against the global state method. But I'd agree making the
> > > >heuristic simple is preferred currently. I'm happy about the patch if the '+1'
> > > >is removed.
> > > 
> > > Without the +1, how will you figure out when to re-enable readahead?
> > 
> > Below code in swapin_nr_pages can recover it.
> > +               if (offset == prev_offset + 1 || offset == prev_offset - 1)
> > +                       pages <<= 1;
> > 
> > Not perfect, but should work in some sort. This reminds me to think if
> > pagereadahead flag is really required, hit in swap cache is a more reliable way
> > to count readahead hit, and as Hugh mentioned, swap isn't vma bound.
> 
> Hugh,
> ping! Any chance you can check this again?

I apologize, Shaohua, my slowness must be very frustrating for you,
as it is for me too.

Thank you for pointing out how my first patch was reading two pages
instead of one in the random case, explaining its disappointing
performance there: odd how blind I was to that, despite taking stats.

I did experiment with removing the "+ 1" as you did, it worked well
in the random SSD case, but degraded performance in (all? I forget)
the other cases.

I failed to rescue the "algorithm" in that patch, and changed it a
week ago for an even simpler one, that has worked well for me so far.
When I sent you a quick private ack to your ping, I was puzzled by its
"too good to be true" initial test results: once I looked into those,
found my rearrangement of the test script had left out a swapoff,
so the supposed harddisk tests were actually swapping to SSD.

I've finally got around to assembling the results and writing up
some description, starting off from yours.  I think I've gone as
far as I can with this, and don't want to hold you up with further
delays: would it be okay if I simply hand this patch over to you now,
to test and expand upon and add your Sign-off and send in to akpm to
replace your original in mmotm - IF you are satisfied with it?


[PATCH] swap: add a simple detector for inappropriate swapin readahead

swapin readahead does a blind readahead, whether or not the swapin
is sequential.  This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they
can be reclaimed easily - though, what if their allocation forced
reclaim of useful pages?  But on SSD devices large reads are more
expensive than small ones: if the readahead pages are unneeded,
reading them in caused significant overhead.

This patch adds very simplistic random read detection.  Stealing
the PageReadahead technique from Konstantin Khlebnikov's patch,
avoiding the vma/anon_vma sophistications of Shaohua Li's patch,
swapin_nr_pages() simply looks at readahead's current success
rate, and narrows or widens its readahead window accordingly.
There is little science to its heuristic: it's about as stupid
as can be whilst remaining effective.

The table below shows elapsed times (in centiseconds) when running
a single repetitive swapping load across a 1000MB mapping in 900MB
ram with 1GB swap (the harddisk tests had taken painfully too long
when I used mem=500M, but SSD shows similar results for that).

Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes
his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1
patch which Shaohua showed to be defective; HughNew this Nov 14
patch, with page_cluster as usual at default of 3 (8-page reads);
HughPC4 this same patch with page_cluster 4 (16-page reads);
HughPC0 with page_cluster 0 (1-page reads: no readahead).

HDD for swapping to harddisk, SSD for swapping to VertexII SSD.
Seq for sequential access to the mapping, cycling five times around;
Rand for the same number of random touches.  Anon for a MAP_PRIVATE
anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

One weakness of Shaohua's vma/anon_vma approach was that it did
not optimize Shmem: seen below.  Konstantin's approach was perhaps
mistuned, 50% slower on Seq: did not compete and is not shown below.

HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     73921   76210   75611   76904   78191  121542
Seq Shmem    73601   73176   73855   72947   74543  118322
Rand Anon   895392  831243  871569  845197  846496  841680
Rand Shmem 1058375 1053486  827935  764955  764376  756489

SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     24634   24198   24673   25107   21614   70018
Seq Shmem    24959   24932   25052   25703   22030   69678
Rand Anon    43014   26146   28075   25989   26935   25901
Rand Shmem   45349   45215   28249   24268   24138   24332

These tests are, of course, two extremes of a very simple case:
under heavier mixed loads I've not yet observed any consistent
improvement or degradation, and wider testing would be welcome.

Original-patch-by: Shaohua Li <shli@fusionio.com>
Original-patch-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
---

 include/linux/page-flags.h |    4 +-
 mm/swap_state.c            |   55 +++++++++++++++++++++++++++++++++--
 2 files changed, 54 insertions(+), 5 deletions(-)

--- 3.7-rc5/include/linux/page-flags.h	2012-09-30 16:47:46.000000000 -0700
+++ linux/include/linux/page-flags.h	2012-11-11 09:45:30.908591576 -0800
@@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
 PAGEFLAG(MappedToDisk, mappedtodisk)
 
-/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
+/* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
-PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
+PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
 
 #ifdef CONFIG_HIGHMEM
 /*
--- 3.7-rc5/mm/swap_state.c	2012-09-30 16:47:46.000000000 -0700
+++ linux/mm/swap_state.c	2012-11-11 09:50:06.520598126 -0800
@@ -53,6 +53,8 @@ static struct {
 	unsigned long find_total;
 } swap_cache_info;
 
+static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
+
 void show_swap_cache_info(void)
 {
 	printk("%lu pages in swap cache\n", total_swapcache_pages);
@@ -265,8 +267,11 @@ struct page * lookup_swap_cache(swp_entr
 
 	page = find_get_page(&swapper_space, entry.val);
 
-	if (page)
+	if (page) {
 		INC_CACHE_INFO(find_success);
+		if (TestClearPageReadahead(page))
+			atomic_inc(&swapin_readahead_hits);
+	}
 
 	INC_CACHE_INFO(find_total);
 	return page;
@@ -351,6 +356,42 @@ struct page *read_swap_cache_async(swp_e
 	return found_page;
 }
 
+unsigned long swapin_nr_pages(unsigned long offset)
+{
+	static unsigned long prev_offset;
+	unsigned int pages, max_pages;
+
+	max_pages = 1 << ACCESS_ONCE(page_cluster);
+	if (max_pages <= 1)
+		return 1;
+
+	/*
+	 * This heuristic has been found to work well on both sequential and
+	 * random loads, swapping to hard disk or to SSD: please don't ask
+	 * what the "+ 2" means, it just happens to work well, that's all.
+	 */
+	pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
+	if (pages == 2) {
+		/*
+		 * We can have no readahead hits to judge by: but must not get
+		 * stuck here forever, so check for an adjacent offset instead
+		 * (and don't even bother to check whether swap type is same).
+		 */
+		if (offset != prev_offset + 1 && offset != prev_offset - 1)
+			pages = 1;
+		prev_offset = offset;
+	} else {
+		unsigned int roundup = 4;
+		while (roundup < pages)
+			roundup <<= 1;
+		pages = roundup;
+	}
+
+	if (pages > max_pages)
+		pages = max_pages;
+	return pages;
+}
+
 /**
  * swapin_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
@@ -374,11 +415,16 @@ struct page *swapin_readahead(swp_entry_
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *page;
-	unsigned long offset = swp_offset(entry);
+	unsigned long entry_offset = swp_offset(entry);
+	unsigned long offset = entry_offset;
 	unsigned long start_offset, end_offset;
-	unsigned long mask = (1UL << page_cluster) - 1;
+	unsigned long mask;
 	struct blk_plug plug;
 
+	mask = swapin_nr_pages(offset) - 1;
+	if (!mask)
+		goto skip;
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -392,10 +438,13 @@ struct page *swapin_readahead(swp_entry_
 						gfp_mask, vma, addr);
 		if (!page)
 			continue;
+		if (offset != entry_offset)
+			SetPageReadahead(page);
 		page_cache_release(page);
 	}
 	blk_finish_plug(&plug);
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
+skip:
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
  2012-11-14  9:48                                               ` Hugh Dickins
@ 2012-11-19  2:33                                                 ` Shaohua Li
  0 siblings, 0 replies; 31+ messages in thread
From: Shaohua Li @ 2012-11-19  2:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Rik van Riel, Konstantin Khlebnikov, Minchan Kim, Andrew Morton,
	Wu Fengguang, linux-kernel, linux-mm

On Wed, Nov 14, 2012 at 01:48:18AM -0800, Hugh Dickins wrote:
> On Tue, 6 Nov 2012, Shaohua Li wrote:
> > On Wed, Oct 24, 2012 at 09:13:56AM +0800, Shaohua Li wrote:
> > > On Tue, Oct 23, 2012 at 09:41:00AM -0400, Rik van Riel wrote:
> > > > On 10/23/2012 01:51 AM, Shaohua Li wrote:
> > > > 
> > > > >I have no strong point against the global state method. But I'd agree making the
> > > > >heuristic simple is preferred currently. I'm happy about the patch if the '+1'
> > > > >is removed.
> > > > 
> > > > Without the +1, how will you figure out when to re-enable readahead?
> > > 
> > > Below code in swapin_nr_pages can recover it.
> > > +               if (offset == prev_offset + 1 || offset == prev_offset - 1)
> > > +                       pages <<= 1;
> > > 
> > > Not perfect, but should work in some sort. This reminds me to think if
> > > pagereadahead flag is really required, hit in swap cache is a more reliable way
> > > to count readahead hit, and as Hugh mentioned, swap isn't vma bound.
> > 
> > Hugh,
> > ping! Any chance you can check this again?
> 
> I apologize, Shaohua, my slowness must be very frustrating for you,
> as it is for me too.

Not at all, thanks for looking at it.
 
> Thank you for pointing out how my first patch was reading two pages
> instead of one in the random case, explaining its disappointing
> performance there: odd how blind I was to that, despite taking stats.
> 
> I did experiment with removing the "+ 1" as you did, it worked well
> in the random SSD case, but degraded performance in (all? I forget)
> the other cases.
> 
> I failed to rescue the "algorithm" in that patch, and changed it a
> week ago for an even simpler one, that has worked well for me so far.
> When I sent you a quick private ack to your ping, I was puzzled by its
> "too good to be true" initial test results: once I looked into those,
> found my rearrangement of the test script had left out a swapoff,
> so the supposed harddisk tests were actually swapping to SSD.
> 
> I've finally got around to assembling the results and writing up
> some description, starting off from yours.  I think I've gone as
> far as I can with this, and don't want to hold you up with further
> delays: would it be okay if I simply hand this patch over to you now,
> to test and expand upon and add your Sign-off and send in to akpm to
> replace your original in mmotm - IF you are satisfied with it?

I played the patch more. It works as expected in random access case, but in
sequential case, it has regression against vanilla, maybe because I'm using a
two sockets machine. I explained the reason in below patch and changelog.

Below is an addon patch above Hugh's patch. We can apply Hugh's patch first and
then this one if it's ok, or just merge them to one patch. AKPM, what's your
suggestion?

Thanks,
Shaohua

Subject: mm/swap: improve swapin readahead heuristic

swapout always tries to find a cluster to do swap. The cluster is shared by all
processes (kswapds, direct page reclaim) who do swap. The result is swapout
adjacent memory could cause interleave access pattern to disk. We do aggressive
swapin in non-random access case to avoid skip swapin in interleave access
pattern.

This really isn't the fault of swapin, but before we improve swapout algorithm
(for example, give each CPU a swap cluster), aggressive swapin gives better
performance for sequential access.

With below patch, the heurisic becomes:
1. swapin max_pages pages for any hit
2. otherwise swapin last_readahead_pages*3/4 pages

Test is done at a two sockets machine (7G memory), so at least 3 tasks are
doing swapout (2 kswapd, and one direct page reclaim). sequential test is 1
thread accessing 14G memory, random test is 24 threads random accessing 14G
memory. Data is time.

		Rand		Seq
vanilla		5678		434
Hugh		2829		625
Hugh+belowpatch	2785		401

For both rand and seq access, below patch gets good performance. And even
slightly better than vanilla in seq. Not quite sure about the reason, but I'd
suspect this is because there are some daemons doing small random swap.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
---
 mm/swap_state.c |   31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

Index: linux/mm/swap_state.c
===================================================================
--- linux.orig/mm/swap_state.c	2012-11-19 09:08:58.171621096 +0800
+++ linux/mm/swap_state.c	2012-11-19 10:01:28.016023822 +0800
@@ -359,6 +359,7 @@ struct page *read_swap_cache_async(swp_e
 unsigned long swapin_nr_pages(unsigned long offset)
 {
 	static unsigned long prev_offset;
+	static atomic_t last_readahead_pages;
 	unsigned int pages, max_pages;
 
 	max_pages = 1 << ACCESS_ONCE(page_cluster);
@@ -366,29 +367,29 @@ unsigned long swapin_nr_pages(unsigned l
 		return 1;
 
 	/*
-	 * This heuristic has been found to work well on both sequential and
-	 * random loads, swapping to hard disk or to SSD: please don't ask
-	 * what the "+ 2" means, it just happens to work well, that's all.
+	 * swapout always tries to find a cluster to do swap. The cluster is
+	 * shared by all processes (kswapds, direct page reclaim) who do swap.
+	 * The result is swapout adjacent memory could cause interleave access
+	 * pattern to disk. We do aggressive swapin in non-random access case
+	 * to avoid skip swapin in interleave access pattern.
 	 */
-	pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
-	if (pages == 2) {
+	pages = atomic_xchg(&swapin_readahead_hits, 0);
+	if (!pages) {
 		/*
 		 * We can have no readahead hits to judge by: but must not get
 		 * stuck here forever, so check for an adjacent offset instead
 		 * (and don't even bother to check whether swap type is same).
 		 */
-		if (offset != prev_offset + 1 && offset != prev_offset - 1)
-			pages = 1;
+		if (offset != prev_offset + 1 && offset != prev_offset - 1) {
+			pages = atomic_read(&last_readahead_pages) * 3 / 4;
+			pages = max_t(unsigned int, pages, 1);
+		} else
+			pages = max_pages;
 		prev_offset = offset;
-	} else {
-		unsigned int roundup = 4;
-		while (roundup < pages)
-			roundup <<= 1;
-		pages = roundup;
-	}
-
-	if (pages > max_pages)
+	} else
 		pages = max_pages;
+
+	atomic_set(&last_readahead_pages, pages);
 	return pages;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2012-11-19  2:33 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-27  4:00 [patch v2]swap: add a simple random read swapin detection Shaohua Li
2012-08-27 12:57 ` Rik van Riel
2012-08-27 14:52 ` Konstantin Khlebnikov
2012-08-30 10:36   ` [patch v3]swap: " Shaohua Li
2012-08-30 16:03     ` Rik van Riel
2012-08-30 17:42     ` Minchan Kim
2012-09-03  7:21       ` [patch v4]swap: " Shaohua Li
2012-09-03  8:32         ` Minchan Kim
2012-09-03 11:46           ` Shaohua Li
2012-09-03 19:02             ` Konstantin Khlebnikov
2012-09-03 19:05               ` Rik van Riel
2012-09-04  7:34                 ` Konstantin Khlebnikov
2012-09-04 14:15                   ` Rik van Riel
2012-09-06 11:08                     ` [PATCH RFC] mm/swap: automatic tuning for swapin readahead Konstantin Khlebnikov
2012-10-01 23:00                       ` Hugh Dickins
2012-10-02  8:58                         ` Konstantin Khlebnikov
2012-10-03 21:07                           ` Hugh Dickins
2012-10-04 16:23                             ` Konstantin Khlebnikov
2012-10-08 22:09                               ` Hugh Dickins
2012-10-08 22:16                                 ` Andrew Morton
2012-10-09  7:53                                 ` Konstantin Khlebnikov
2012-10-16  0:50                                 ` Shaohua Li
2012-10-22  7:36                                   ` Shaohua Li
2012-10-23  5:16                                     ` Hugh Dickins
2012-10-23  5:51                                       ` Shaohua Li
2012-10-23 13:41                                         ` Rik van Riel
2012-10-24  1:13                                           ` Shaohua Li
2012-11-06  5:36                                             ` Shaohua Li
2012-11-14  9:48                                               ` Hugh Dickins
2012-11-19  2:33                                                 ` Shaohua Li
2012-09-03 22:03             ` [patch v4]swap: add a simple random read swapin detection Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).