[PATCH 0/2] swap: improve swap I/O rate

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] swap: improve swap I/O rate
@ 2012-05-14 11:58 ehrhardt
  2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
  To: linux-mm; +Cc: axboe, Ehrhardt Christian

From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

In an memory overcommitment scneario with KVM I ran into a lot of wiats for
swap. While checking the I/O done on the swap disks I found almost all I/Os
to be done as single page 4k request. Despite the fact that swap in is a
batch of 1<<page-cluster pages as swap readahead and swap out is a list of
pages written in shrink_page_list.

[1/2 swap in improvment]
The read patch shows improvements of up to 50% swap throughput, much happier
guest systems and even when running with comparable throughput a lot I/O per
seconds saved leaving resources in the SAN for other consumers.

[2/2 documentation]
While doing so I also realized that the documentation for
proc/sys/vm/page-cluster is no more matching the code

[missing patch #3]
I tried to get a similar patch working for swap out in shrink_page_list. And
it worked in functional terms, but the additional mergin was negligible.
Maybe the cond_resched triggers much mor often than I expected, I'm open for
suggestions regarding improving the pagout I/O sizes as well.

Kind regards,
Christian Ehrhardt

Christian Ehrhardt (2):
  swap: allow swap readahead to be merged
  documentation: update how page-cluster affects swap I/O

 Documentation/sysctl/vm.txt |   12 ++++++++++--
 mm/swap_state.c             |    5 +++++
 2 files changed, 15 insertions(+), 2 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] swap: allow swap readahead to be merged
  2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
@ 2012-05-14 11:58 ` ehrhardt
  2012-05-15  4:38   ` Minchan Kim
  2012-05-15 17:43   ` Rik van Riel
  2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 17+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
  To: linux-mm; +Cc: axboe, Christian Ehrhardt

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

Swap readahead works fine, but the I/O to disk is almost always done in page
size requests, despite the fact that readahead submits 1<<page-cluster pages
at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more I/Os
than required.

On a single device this might not be an issue, but as soon as a server runs
on shared san resources savin I/Os not only improves swapin throughput but
also provides a lower resource utilization.

With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.

In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued:     560,888,    2,243MiB  Writes Queued:     226,242,  904,968KiB
Read Dispatches:  544,701,    2,243MiB  Write Dispatches:  159,318,  904,968KiB
Reads Requeued:         0               Writes Requeued:         0
Reads Completed:  544,716,    2,243MiB  Writes Completed:  159,321,  904,980KiB
Read Merges:       16,187,   64,748KiB  Write Merges:       61,744,  246,976KiB
IO unplugs:       149,614               Timer unplugs:       2,940

With the patch:
Reads Queued:     734,315,    2,937MiB  Writes Queued:     300,188,    1,200MiB
Read Dispatches:  214,972,    2,937MiB  Write Dispatches:  215,176,    1,200MiB
Reads Requeued:         0               Writes Requeued:         0
Reads Completed:  214,971,    2,937MiB  Writes Completed:  215,177,    1,200MiB
Read Merges:      519,343,    2,077MiB  Write Merges:       73,325,  293,300KiB
IO unplugs:       337,130               Timer unplugs:      11,184

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---
 mm/swap_state.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4c5ff7f..c85b559 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/pagemap.h>
 #include <linux/backing-dev.h>
+#include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
@@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long offset = swp_offset(entry);
 	unsigned long start_offset, end_offset;
 	unsigned long mask = (1UL << page_cluster) - 1;
+	struct blk_plug plug;
 
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
@@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (!start_offset)	/* First page is swap header. */
 		start_offset++;
 
+	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			continue;
 		page_cache_release(page);
 	}
+	blk_finish_plug(&plug);
+
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
-- 
1.7.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] documentation: update how page-cluster affects swap I/O
  2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
  2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-05-14 11:58 ` ehrhardt
  2012-05-15  4:48   ` Minchan Kim
  2012-05-15  4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
  2012-05-15 18:24 ` Jens Axboe
  3 siblings, 1 reply; 17+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
  To: linux-mm; +Cc: axboe, Christian Ehrhardt

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
the code and add some comments about what the tunable will change in that
behavior.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---
 Documentation/sysctl/vm.txt |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 96f0ee8..4d87dc0 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -574,16 +574,24 @@ of physical RAM.  See above.
 
 page-cluster
 
-page-cluster controls the number of pages which are written to swap in
-a single attempt.  The swap I/O size.
+page-cluster controls the number of pages up to which consecutive pages (if
+available) are read in from swap in a single attempt. This is the swap
+counterpart to page cache readahead.
+The mentioned consecutivity is not in terms of virtual/physical addresses,
+but consecutive on swap space - that means they were swapped out together.
 
 It is a logarithmic value - setting it to zero means "1 page", setting
 it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
+Zero disables swap readahead completely.
 
 The default value is three (eight pages at a time).  There may be some
 small benefits in tuning this to a different value if your workload is
 swap-intensive.
 
+Lower values mean lower latencies for initial faults, but at the same time
+extra faults and I/O delays for following faults if they would have been part of
+that consecutive pages readahead would have brought in.
+
 =============================================================
 
 panic_on_oom
-- 
1.7.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] swap: allow swap readahead to be merged
  2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-05-15  4:38   ` Minchan Kim
  2012-05-15 17:43   ` Rik van Riel
  1 sibling, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2012-05-15  4:38 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm, axboe, Hugh Dickins, Rik van Riel

On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:

> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
> 
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
> 
> With a load running KVM in a lot of memory overcommitment (the hot memory
> is 1.5 times the host memory) swapping throughput improves significantly
> and the lead feels more responsive as well as achieves more throughput.
> 
> In a test setup with 16 swap disks running blocktrace on one of those disks
> shows the improved merging:
> Prior:
> Reads Queued:     560,888,    2,243MiB  Writes Queued:     226,242,  904,968KiB
> Read Dispatches:  544,701,    2,243MiB  Write Dispatches:  159,318,  904,968KiB
> Reads Requeued:         0               Writes Requeued:         0
> Reads Completed:  544,716,    2,243MiB  Writes Completed:  159,321,  904,980KiB
> Read Merges:       16,187,   64,748KiB  Write Merges:       61,744,  246,976KiB
> IO unplugs:       149,614               Timer unplugs:       2,940
> 
> With the patch:
> Reads Queued:     734,315,    2,937MiB  Writes Queued:     300,188,    1,200MiB
> Read Dispatches:  214,972,    2,937MiB  Write Dispatches:  215,176,    1,200MiB
> Reads Requeued:         0               Writes Requeued:         0
> Reads Completed:  214,971,    2,937MiB  Writes Completed:  215,177,    1,200MiB
> Read Merges:      519,343,    2,077MiB  Write Merges:       73,325,  293,300KiB
> IO unplugs:       337,130               Timer unplugs:      11,184
> 
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

Reviewed-by: Minchan Kim <minchan@kernel.org>

It does make sense to me.

> ---
>  mm/swap_state.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 4c5ff7f..c85b559 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -14,6 +14,7 @@
>  #include <linux/init.h>
>  #include <linux/pagemap.h>
>  #include <linux/backing-dev.h>
> +#include <linux/blkdev.h>
>  #include <linux/pagevec.h>
>  #include <linux/migrate.h>
>  #include <linux/page_cgroup.h>
> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  	unsigned long offset = swp_offset(entry);
>  	unsigned long start_offset, end_offset;
>  	unsigned long mask = (1UL << page_cluster) - 1;
> +	struct blk_plug plug;
>  
>  	/* Read a page_cluster sized and aligned cluster around offset. */
>  	start_offset = offset & ~mask;
> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  	if (!start_offset)	/* First page is swap header. */
>  		start_offset++;
>  
> +	blk_start_plug(&plug);
>  	for (offset = start_offset; offset <= end_offset ; offset++) {
>  		/* Ok, do the async read-ahead now */
>  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			continue;
>  		page_cache_release(page);
>  	}
> +	blk_finish_plug(&plug);
> +
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>  }



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O
  2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
@ 2012-05-15  4:48   ` Minchan Kim
  2012-05-21  7:24     ` Christian Ehrhardt
  0 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2012-05-15  4:48 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm, axboe, Rik van Riel, Hugh Dickins, Andrew Morton

On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:

> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
> the code and add some comments about what the tunable will change in that
> behavior.
> 
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> ---
>  Documentation/sysctl/vm.txt |   12 ++++++++++--
>  1 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 96f0ee8..4d87dc0 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -574,16 +574,24 @@ of physical RAM.  See above.
>  
>  page-cluster
>  
> -page-cluster controls the number of pages which are written to swap in
> -a single attempt.  The swap I/O size.
> +page-cluster controls the number of pages up to which consecutive pages (if
> +available) are read in from swap in a single attempt. This is the swap


"If available" would be wrong in next kernel because recently Rik submit following patch,

mm: make swapin readahead skip over holes
http://marc.info/?l=linux-mm&m=132743264912987&w=4


> +counterpart to page cache readahead.
> +The mentioned consecutivity is not in terms of virtual/physical addresses,
> +but consecutive on swap space - that means they were swapped out together.
>  
>  It is a logarithmic value - setting it to zero means "1 page", setting
>  it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
> +Zero disables swap readahead completely.
>  
>  The default value is three (eight pages at a time).  There may be some
>  small benefits in tuning this to a different value if your workload is
>  swap-intensive.
>  
> +Lower values mean lower latencies for initial faults, but at the same time
> +extra faults and I/O delays for following faults if they would have been part of
> +that consecutive pages readahead would have brought in.
> +
>  =============================================================
>  
>  panic_on_oom


Otherwise, Looks good to me.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] swap: improve swap I/O rate
  2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
  2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
  2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
@ 2012-05-15  4:59 ` Minchan Kim
  2012-05-21  7:51   ` Christian Ehrhardt
  2012-05-15 18:24 ` Jens Axboe
  3 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2012-05-15  4:59 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel

On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:

> From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>
> 
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> In an memory overcommitment scneario with KVM I ran into a lot of wiats for
> swap. While checking the I/O done on the swap disks I found almost all I/Os
> to be done as single page 4k request. Despite the fact that swap in is a
> batch of 1<<page-cluster pages as swap readahead and swap out is a list of
> pages written in shrink_page_list.
> 
> [1/2 swap in improvment]
> The read patch shows improvements of up to 50% swap throughput, much happier
> guest systems and even when running with comparable throughput a lot I/O per
> seconds saved leaving resources in the SAN for other consumers.
> 
> [2/2 documentation]
> While doing so I also realized that the documentation for
> proc/sys/vm/page-cluster is no more matching the code
> 
> [missing patch #3]
> I tried to get a similar patch working for swap out in shrink_page_list. And
> it worked in functional terms, but the additional mergin was negligible.


I think we have already done it.
Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list so we already have applied
I/O plugging. 

> Maybe the cond_resched triggers much mor often than I expected, I'm open for
> suggestions regarding improving the pagout I/O sizes as well.


We could enhance write out by batch like ext4_bio_write_page.

> 
> Kind regards,
> Christian Ehrhardt
> 
> 
> Christian Ehrhardt (2):
>   swap: allow swap readahead to be merged
>   documentation: update how page-cluster affects swap I/O
> 
>  Documentation/sysctl/vm.txt |   12 ++++++++++--
>  mm/swap_state.c             |    5 +++++
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] swap: allow swap readahead to be merged
  2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
  2012-05-15  4:38   ` Minchan Kim
@ 2012-05-15 17:43   ` Rik van Riel
  1 sibling, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2012-05-15 17:43 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm, axboe

On 05/14/2012 07:58 AM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
>
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.

> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] swap: improve swap I/O rate
  2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
                   ` (2 preceding siblings ...)
  2012-05-15  4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
@ 2012-05-15 18:24 ` Jens Axboe
  3 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2012-05-15 18:24 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm

On 2012-05-14 13:58, ehrhardt@linux.vnet.ibm.com wrote:
> From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>
> 
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> In an memory overcommitment scneario with KVM I ran into a lot of wiats for
> swap. While checking the I/O done on the swap disks I found almost all I/Os
> to be done as single page 4k request. Despite the fact that swap in is a
> batch of 1<<page-cluster pages as swap readahead and swap out is a list of
> pages written in shrink_page_list.
> 
> [1/2 swap in improvment]
> The read patch shows improvements of up to 50% swap throughput, much happier
> guest systems and even when running with comparable throughput a lot I/O per
> seconds saved leaving resources in the SAN for other consumers.
> 
> [2/2 documentation]
> While doing so I also realized that the documentation for
> proc/sys/vm/page-cluster is no more matching the code
> 
> [missing patch #3]
> I tried to get a similar patch working for swap out in shrink_page_list. And
> it worked in functional terms, but the additional mergin was negligible.
> Maybe the cond_resched triggers much mor often than I expected, I'm open for
> suggestions regarding improving the pagout I/O sizes as well.
> 
> Kind regards,
> Christian Ehrhardt
> 
> 
> Christian Ehrhardt (2):
>   swap: allow swap readahead to be merged
>   documentation: update how page-cluster affects swap I/O

Looks good to me, you can add my acked-by to both of them.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O
  2012-05-15  4:48   ` Minchan Kim
@ 2012-05-21  7:24     ` Christian Ehrhardt
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Ehrhardt @ 2012-05-21  7:24 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, axboe, Rik van Riel, Hugh Dickins, Andrew Morton



On 05/15/2012 06:48 AM, Minchan Kim wrote:
> On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:
>
>> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>>
>> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
>> the code and add some comments about what the tunable will change in that
>> behavior.
>>
>> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>> ---
>>   Documentation/sysctl/vm.txt |   12 ++++++++++--
>>   1 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
>> index 96f0ee8..4d87dc0 100644
>> --- a/Documentation/sysctl/vm.txt
>> +++ b/Documentation/sysctl/vm.txt
>> @@ -574,16 +574,24 @@ of physical RAM.  See above.
>>
>>   page-cluster
>>
>> -page-cluster controls the number of pages which are written to swap in
>> -a single attempt.  The swap I/O size.
>> +page-cluster controls the number of pages up to which consecutive pages (if
>> +available) are read in from swap in a single attempt. This is the swap
>
>
> "If available" would be wrong in next kernel because recently Rik submit following patch,
>
> mm: make swapin readahead skip over holes
> http://marc.info/?l=linux-mm&m=132743264912987&w=4
>
>

You're right - its not severely wrong, but if we are fixing the 
documentation we can do it right.
I'll send a 2nd version of the patch series with this adapted and all 
the acks I got so far added.

-- 

GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] swap: improve swap I/O rate
  2012-05-15  4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
@ 2012-05-21  7:51   ` Christian Ehrhardt
  2012-05-21  8:46     ` Minchan Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Christian Ehrhardt @ 2012-05-21  7:51 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel

[...]

>> [missing patch #3]
>> I tried to get a similar patch working for swap out in shrink_page_list. And
>> it worked in functional terms, but the additional mergin was negligible.
>
>
> I think we have already done it.
> Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list so we already have applied
> I/O plugging.
>

I saw that code and it is part of the kernel I used to test my patches.
But despite that code and my additional experiments of plug/unplug in 
shrink_page_list the effective I/O size of swap write stays at almost 4k.

Thereby so far I can tell you that the plugs in shrink_page_list and 
shrink_mem_cgroup_zone aren't sufficient - at least for my case.
You saw the blocktrace summaries in my first mail, an excerpt of a write 
submission stream looks like that:

  94,4   10      465     0.023520923   116  A   W 28868648 + 8 <- (94,5) 
28868456
  94,5   10      466     0.023521173   116  Q   W 28868648 + 8 [kswapd0]
  94,5   10      467     0.023522048   116  G   W 28868648 + 8 [kswapd0]
  94,5   10      468     0.023522235   116  P   N [kswapd0]
  94,5   10      469     0.023759892   116  I   W 28868648 + 8 ( 237844) 
[kswapd0]
  94,5   10      470     0.023760079   116  U   N [kswapd0] 1
  94,5   10      471     0.023760360   116  D   W 28868648 + 8 ( 468) 
[kswapd0]
  94,4   10      472     0.023891235   116  A   W 28868656 + 8 <- (94,5) 
28868464
  94,5   10      473     0.023891454   116  Q   W 28868656 + 8 [kswapd0]
  94,5   10      474     0.023892110   116  G   W 28868656 + 8 [kswapd0]
  94,5   10      475     0.023944610   116  I   W 28868656 + 8 ( 52500) 
[kswapd0]
  94,5   10      476     0.023944735   116  U   N [kswapd0] 1
  94,5   10      477     0.023944892   116  D   W 28868656 + 8 ( 282) 
[kswapd0]
  94,5   16       19     0.024023192 16033  C   W 28868648 + 8 ( 262832) [0]
  94,5   24       37     0.024196752 14526  C   W 28868656 + 8 ( 251860) [0]
[...]

But we can split this discussion from my other two patches and I would 
be happy to provide my test environment for further tests if there are 
new suggestions/patches/...

>> Maybe the cond_resched triggers much mor often than I expected, I'm open for
>> suggestions regarding improving the pagout I/O sizes as well.
>
>
> We could enhance write out by batch like ext4_bio_write_page.
>

Do you mean the changes brought by "bd2d0210 ext4: use bio layer instead 
of buffer layer in mpage_da_submit_io" ?



-- 

GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] swap: allow swap readahead to be merged
  2012-05-21  8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
@ 2012-05-21  8:09 ` ehrhardt
  2012-05-21  8:51   ` Minchan Kim
  0 siblings, 1 reply; 17+ messages in thread
From: ehrhardt @ 2012-05-21  8:09 UTC (permalink / raw)
  To: linux-mm; +Cc: axboe, Christian Ehrhardt

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

Swap readahead works fine, but the I/O to disk is almost always done in page
size requests, despite the fact that readahead submits 1<<page-cluster pages
at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more I/Os
than required.

On a single device this might not be an issue, but as soon as a server runs
on shared san resources savin I/Os not only improves swapin throughput but
also provides a lower resource utilization.

With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.

In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued:     560,888,    2,243MiB  Writes Queued:     226,242,  904,968KiB
Read Dispatches:  544,701,    2,243MiB  Write Dispatches:  159,318,  904,968KiB
Reads Requeued:         0               Writes Requeued:         0
Reads Completed:  544,716,    2,243MiB  Writes Completed:  159,321,  904,980KiB
Read Merges:       16,187,   64,748KiB  Write Merges:       61,744,  246,976KiB
IO unplugs:       149,614               Timer unplugs:       2,940

With the patch:
Reads Queued:     734,315,    2,937MiB  Writes Queued:     300,188,    1,200MiB
Read Dispatches:  214,972,    2,937MiB  Write Dispatches:  215,176,    1,200MiB
Reads Requeued:         0               Writes Requeued:         0
Reads Completed:  214,971,    2,937MiB  Writes Completed:  215,177,    1,200MiB
Read Merges:      519,343,    2,077MiB  Write Merges:       73,325,  293,300KiB
IO unplugs:       337,130               Timer unplugs:      11,184

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
---
 mm/swap_state.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4c5ff7f..c85b559 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/pagemap.h>
 #include <linux/backing-dev.h>
+#include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
@@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long offset = swp_offset(entry);
 	unsigned long start_offset, end_offset;
 	unsigned long mask = (1UL << page_cluster) - 1;
+	struct blk_plug plug;
 
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
@@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (!start_offset)	/* First page is swap header. */
 		start_offset++;
 
+	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			continue;
 		page_cache_release(page);
 	}
+	blk_finish_plug(&plug);
+
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
-- 
1.7.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] swap: improve swap I/O rate
  2012-05-21  7:51   ` Christian Ehrhardt
@ 2012-05-21  8:46     ` Minchan Kim
  0 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2012-05-21  8:46 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel

On 05/21/2012 04:51 PM, Christian Ehrhardt wrote:

> [...]
> 
>>> [missing patch #3]
>>> I tried to get a similar patch working for swap out in
>>> shrink_page_list. And
>>> it worked in functional terms, but the additional mergin was negligible.
>>
>>
>> I think we have already done it.
>> Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list
>> so we already have applied
>> I/O plugging.
>>
> 
> I saw that code and it is part of the kernel I used to test my patches.
> But despite that code and my additional experiments of plug/unplug in
> shrink_page_list the effective I/O size of swap write stays at almost 4k.


I meant your plugging in shrink_page_list is redundant 

> 
> Thereby so far I can tell you that the plugs in shrink_page_list and
> shrink_mem_cgroup_zone aren't sufficient - at least for my case.


Yeb.

> You saw the blocktrace summaries in my first mail, an excerpt of a write
> submission stream looks like that:
> 
>  94,4   10      465     0.023520923   116  A   W 28868648 + 8 <- (94,5)
> 28868456
>  94,5   10      466     0.023521173   116  Q   W 28868648 + 8 [kswapd0]
>  94,5   10      467     0.023522048   116  G   W 28868648 + 8 [kswapd0]
>  94,5   10      468     0.023522235   116  P   N [kswapd0]
>  94,5   10      469     0.023759892   116  I   W 28868648 + 8 ( 237844)
> [kswapd0]
>  94,5   10      470     0.023760079   116  U   N [kswapd0] 1
>  94,5   10      471     0.023760360   116  D   W 28868648 + 8 ( 468)
> [kswapd0]
>  94,4   10      472     0.023891235   116  A   W 28868656 + 8 <- (94,5)
> 28868464
>  94,5   10      473     0.023891454   116  Q   W 28868656 + 8 [kswapd0]
>  94,5   10      474     0.023892110   116  G   W 28868656 + 8 [kswapd0]
>  94,5   10      475     0.023944610   116  I   W 28868656 + 8 ( 52500)
> [kswapd0]
>  94,5   10      476     0.023944735   116  U   N [kswapd0] 1
>  94,5   10      477     0.023944892   116  D   W 28868656 + 8 ( 282)
> [kswapd0]
>  94,5   16       19     0.024023192 16033  C   W 28868648 + 8 ( 262832) [0]
>  94,5   24       37     0.024196752 14526  C   W 28868656 + 8 ( 251860) [0]
> [...]
> 
> But we can split this discussion from my other two patches and I would
> be happy to provide my test environment for further tests if there are
> new suggestions/patches/...
> 
>>> Maybe the cond_resched triggers much mor often than I expected, I'm
>>> open for
>>> suggestions regarding improving the pagout I/O sizes as well.
>>
>>
>> We could enhance write out by batch like ext4_bio_write_page.
>>
> 
> Do you mean the changes brought by "bd2d0210 ext4: use bio layer instead
> of buffer layer in mpage_da_submit_io" ?


Yeb, I think it's helpful for your case but it's not trivial to implement it, IMHO.

> 
> 
> 



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] swap: allow swap readahead to be merged
  2012-05-21  8:09 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-05-21  8:51   ` Minchan Kim
  2012-05-21  9:07     ` Christian Ehrhardt
  0 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2012-05-21  8:51 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm, axboe

On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote:

> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
> 
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
> 
> With a load running KVM in a lot of memory overcommitment (the hot memory
> is 1.5 times the host memory) swapping throughput improves significantly
> and the lead feels more responsive as well as achieves more throughput.
> 
> In a test setup with 16 swap disks running blocktrace on one of those disks
> shows the improved merging:
> Prior:
> Reads Queued:     560,888,    2,243MiB  Writes Queued:     226,242,  904,968KiB
> Read Dispatches:  544,701,    2,243MiB  Write Dispatches:  159,318,  904,968KiB
> Reads Requeued:         0               Writes Requeued:         0
> Reads Completed:  544,716,    2,243MiB  Writes Completed:  159,321,  904,980KiB
> Read Merges:       16,187,   64,748KiB  Write Merges:       61,744,  246,976KiB
> IO unplugs:       149,614               Timer unplugs:       2,940
> 
> With the patch:
> Reads Queued:     734,315,    2,937MiB  Writes Queued:     300,188,    1,200MiB
> Read Dispatches:  214,972,    2,937MiB  Write Dispatches:  215,176,    1,200MiB
> Reads Requeued:         0               Writes Requeued:         0
> Reads Completed:  214,971,    2,937MiB  Writes Completed:  215,177,    1,200MiB
> Read Merges:      519,343,    2,077MiB  Write Merges:       73,325,  293,300KiB
> IO unplugs:       337,130               Timer unplugs:      11,184
> 
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Jens Axboe <axboe@kernel.dk>


Reviewed-by: Minchan Kim <minchan@kernel.org>

Didn't I add my Reviewed-by on your previous version?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] swap: allow swap readahead to be merged
  2012-05-21  8:51   ` Minchan Kim
@ 2012-05-21  9:07     ` Christian Ehrhardt
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Ehrhardt @ 2012-05-21  9:07 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, axboe



On 05/21/2012 10:51 AM, Minchan Kim wrote:
> On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote:
>
>> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>>
[...]
>>
>> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>> Acked-by: Rik van Riel<riel@redhat.com>
>> Acked-by: Jens Axboe<axboe@kernel.dk>
>
>
> Reviewed-by: Minchan Kim<minchan@kernel.org>
>
> Didn't I add my Reviewed-by on your previous version?
>

Sorry I missed it since you provided the good feedback on all three 
mails. I had your "otherwise looks good to me to mail #2" still in mind 
and didn't want to be so offensive to convert that to a review or ack 
statement.

-- 

GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] swap: allow swap readahead to be merged
  2012-06-04  8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
@ 2012-06-04  8:33 ` ehrhardt
  2012-06-05 23:44   ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: ehrhardt @ 2012-06-04  8:33 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: axboe, hughd, minchan, Christian Ehrhardt

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

Swap readahead works fine, but the I/O to disk is almost always done in page
size requests, despite the fact that readahead submits 1<<page-cluster pages
at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more I/Os
than required.

On a single device this might not be an issue, but as soon as a server runs
on shared san resources savin I/Os not only improves swapin throughput but
also provides a lower resource utilization.

With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.

In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued:     560,888,    2,243MiB  Writes Queued:     226,242,  904,968KiB
Read Dispatches:  544,701,    2,243MiB  Write Dispatches:  159,318,  904,968KiB
Reads Requeued:         0               Writes Requeued:         0
Reads Completed:  544,716,    2,243MiB  Writes Completed:  159,321,  904,980KiB
Read Merges:       16,187,   64,748KiB  Write Merges:       61,744,  246,976KiB
IO unplugs:       149,614               Timer unplugs:       2,940

With the patch:
Reads Queued:     734,315,    2,937MiB  Writes Queued:     300,188,    1,200MiB
Read Dispatches:  214,972,    2,937MiB  Write Dispatches:  215,176,    1,200MiB
Reads Requeued:         0               Writes Requeued:         0
Reads Completed:  214,971,    2,937MiB  Writes Completed:  215,177,    1,200MiB
Read Merges:      519,343,    2,077MiB  Write Merges:       73,325,  293,300KiB
IO unplugs:       337,130               Timer unplugs:      11,184

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>

---
 mm/swap_state.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4c5ff7f..c85b559 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/pagemap.h>
 #include <linux/backing-dev.h>
+#include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
@@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long offset = swp_offset(entry);
 	unsigned long start_offset, end_offset;
 	unsigned long mask = (1UL << page_cluster) - 1;
+	struct blk_plug plug;
 
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
@@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (!start_offset)	/* First page is swap header. */
 		start_offset++;
 
+	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			continue;
 		page_cache_release(page);
 	}
+	blk_finish_plug(&plug);
+
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
-- 
1.7.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] swap: allow swap readahead to be merged
  2012-06-04  8:33 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-06-05 23:44   ` Andrew Morton
  2012-06-20 15:58     ` Christian Ehrhardt
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2012-06-05 23:44 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm, axboe, hughd, minchan

On Mon,  4 Jun 2012 10:33:22 +0200
ehrhardt@linux.vnet.ibm.com wrote:

> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.

Yes, long ago we (ie: I) decided that swap I/O isn't sufficiently
common to bother doing any fancy high-level aggregation: just toss it
at the queue and use the general BIO merging.

> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
> 
> With a load running KVM in a lot of memory overcommitment (the hot memory
> is 1.5 times the host memory) swapping throughput improves significantly
> and the lead feels more responsive as well as achieves more throughput.
> 
> In a test setup with 16 swap disks running blocktrace on one of those disks
> shows the improved merging:
> Prior:
> Reads Queued:     560,888,    2,243MiB  Writes Queued:     226,242,  904,968KiB
> Read Dispatches:  544,701,    2,243MiB  Write Dispatches:  159,318,  904,968KiB
> Reads Requeued:         0               Writes Requeued:         0
> Reads Completed:  544,716,    2,243MiB  Writes Completed:  159,321,  904,980KiB
> Read Merges:       16,187,   64,748KiB  Write Merges:       61,744,  246,976KiB
> IO unplugs:       149,614               Timer unplugs:       2,940
> 
> With the patch:
> Reads Queued:     734,315,    2,937MiB  Writes Queued:     300,188,    1,200MiB
> Read Dispatches:  214,972,    2,937MiB  Write Dispatches:  215,176,    1,200MiB
> Reads Requeued:         0               Writes Requeued:         0
> Reads Completed:  214,971,    2,937MiB  Writes Completed:  215,177,    1,200MiB
> Read Merges:      519,343,    2,077MiB  Write Merges:       73,325,  293,300KiB
> IO unplugs:       337,130               Timer unplugs:      11,184

This is rather hard to understand.  How much faster did it get?

> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -14,6 +14,7 @@
>  #include <linux/init.h>
>  #include <linux/pagemap.h>
>  #include <linux/backing-dev.h>
> +#include <linux/blkdev.h>
>  #include <linux/pagevec.h>
>  #include <linux/migrate.h>
>  #include <linux/page_cgroup.h>
> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  	unsigned long offset = swp_offset(entry);
>  	unsigned long start_offset, end_offset;
>  	unsigned long mask = (1UL << page_cluster) - 1;
> +	struct blk_plug plug;
>  
>  	/* Read a page_cluster sized and aligned cluster around offset. */
>  	start_offset = offset & ~mask;
> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  	if (!start_offset)	/* First page is swap header. */
>  		start_offset++;
>  
> +	blk_start_plug(&plug);
>  	for (offset = start_offset; offset <= end_offset ; offset++) {
>  		/* Ok, do the async read-ahead now */
>  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			continue;
>  		page_cache_release(page);
>  	}
> +	blk_finish_plug(&plug);
> +
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  	return read_swap_cache_async(entry, gfp_mask, vma, addr);

AFACIT this affects tmpfs as well, and it would be
interesting/useful/diligent to check for performance improvements or
regressions in that area.

And the patch doesn't help swapoff, in try_to_unuse().  Or any other
callers of swap_readpage(), if they exist.

The switch to explicit plugging might have caused swap regressions in
other areas so perhaps a more extensive patch is needed.  But
swapin_readahead() covers most cases and a more extensive patch will
work OK with this one, so I guess we run witht he simple patch for now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] swap: allow swap readahead to be merged
  2012-06-05 23:44   ` Andrew Morton
@ 2012-06-20 15:58     ` Christian Ehrhardt
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Ehrhardt @ 2012-06-20 15:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, axboe, hughd, minchan



On 06/06/2012 01:44 AM, Andrew Morton wrote:
> On Mon,  4 Jun 2012 10:33:22 +0200
> ehrhardt@linux.vnet.ibm.com wrote:
>
>> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>>
>> Swap readahead works fine, but the I/O to disk is almost always done in page
>> size requests, despite the fact that readahead submits 1<<page-cluster pages
>> at a time.
>> On older kernels the old per device plugging behavior might have captured
>> this and merged the requests, but currently all comes down to much more I/Os
>> than required.
>
> Yes, long ago we (ie: I) decided that swap I/O isn't sufficiently
> common to bother doing any fancy high-level aggregation: just toss it
> at the queue and use the general BIO merging.
>
>> On a single device this might not be an issue, but as soon as a server runs
>> on shared san resources savin I/Os not only improves swapin throughput but
>> also provides a lower resource utilization.
>>
>> With a load running KVM in a lot of memory overcommitment (the hot memory
>> is 1.5 times the host memory) swapping throughput improves significantly
>> and the lead feels more responsive as well as achieves more throughput.
>>
>> In a test setup with 16 swap disks running blocktrace on one of those disks
>> shows the improved merging:
>> Prior:
>> Reads Queued:     560,888,    2,243MiB  Writes Queued:     226,242,  904,968KiB
>> Read Dispatches:  544,701,    2,243MiB  Write Dispatches:  159,318,  904,968KiB
>> Reads Requeued:         0               Writes Requeued:         0
>> Reads Completed:  544,716,    2,243MiB  Writes Completed:  159,321,  904,980KiB
>> Read Merges:       16,187,   64,748KiB  Write Merges:       61,744,  246,976KiB
>> IO unplugs:       149,614               Timer unplugs:       2,940
>>
>> With the patch:
>> Reads Queued:     734,315,    2,937MiB  Writes Queued:     300,188,    1,200MiB
>> Read Dispatches:  214,972,    2,937MiB  Write Dispatches:  215,176,    1,200MiB
>> Reads Requeued:         0               Writes Requeued:         0
>> Reads Completed:  214,971,    2,937MiB  Writes Completed:  215,177,    1,200MiB
>> Read Merges:      519,343,    2,077MiB  Write Merges:       73,325,  293,300KiB
>> IO unplugs:       337,130               Timer unplugs:      11,184
>
> This is rather hard to understand.  How much faster did it get?

I got ~10% to ~40% more throughput in my cases and at the same time much 
lower cpu consumption when broken down per transferred kilobyte (the 
majority of that due to saved interrupts and better cache handling).
In a shared SAN others might get an additional benefit as well, because 
this now causes less protocol overhead.

>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -14,6 +14,7 @@
>>   #include<linux/init.h>
>>   #include<linux/pagemap.h>
>>   #include<linux/backing-dev.h>
>> +#include<linux/blkdev.h>
>>   #include<linux/pagevec.h>
>>   #include<linux/migrate.h>
>>   #include<linux/page_cgroup.h>
>> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>>   	unsigned long offset = swp_offset(entry);
>>   	unsigned long start_offset, end_offset;
>>   	unsigned long mask = (1UL<<  page_cluster) - 1;
>> +	struct blk_plug plug;
>>
>>   	/* Read a page_cluster sized and aligned cluster around offset. */
>>   	start_offset = offset&  ~mask;
>> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>>   	if (!start_offset)	/* First page is swap header. */
>>   		start_offset++;
>>
>> +	blk_start_plug(&plug);
>>   	for (offset = start_offset; offset<= end_offset ; offset++) {
>>   		/* Ok, do the async read-ahead now */
>>   		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
>> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>>   			continue;
>>   		page_cache_release(page);
>>   	}
>> +	blk_finish_plug(&plug);
>> +
>>   	lru_add_drain();	/* Push any new pages onto the LRU now */
>>   	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>
> AFACIT this affects tmpfs as well, and it would be
> interesting/useful/diligent to check for performance improvements or
> regressions in that area.
>

A quick test with fio doing 256k sequential write showed some 
improvements of 9.1%, but since I'm not sure how big noise is in this 
test I'd be cautions with these results.
Unfortunately I didn't check cpu consumption - it might be possible that 
with tmpfs thats the area where a bigger improvement could be seen.
Well at least it didn't break - so thats a good result as well.


> And the patch doesn't help swapoff, in try_to_unuse().  Or any other
> callers of swap_readpage(), if they exist.
>
> The switch to explicit plugging might have caused swap regressions in
> other areas so perhaps a more extensive patch is needed.  But
> swapin_readahead() covers most cases and a more extensive patch will
> work OK with this one, so I guess we run witht he simple patch for now.
>

Yeah all the other swap areas might need re-tuning after the plugging 
changes as well, but for example swapoff shouldn't be too performance 
critical right?
As discussed before I'd more interested in the swap writeout path to 
merge stuff better as well.
Eventually - as you said - a later more complex patch can follow and 
take all these into account.

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-06-20 15:58 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-15  4:38   ` Minchan Kim
2012-05-15 17:43   ` Rik van Riel
2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
2012-05-15  4:48   ` Minchan Kim
2012-05-21  7:24     ` Christian Ehrhardt
2012-05-15  4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
2012-05-21  7:51   ` Christian Ehrhardt
2012-05-21  8:46     ` Minchan Kim
2012-05-15 18:24 ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2012-05-21  8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
2012-05-21  8:09 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-21  8:51   ` Minchan Kim
2012-05-21  9:07     ` Christian Ehrhardt
2012-06-04  8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
2012-06-04  8:33 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-06-05 23:44   ` Andrew Morton
2012-06-20 15:58     ` Christian Ehrhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).