* [PATCH 0/2] swap: improve swap I/O rate
@ 2012-05-14 11:58 ehrhardt
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
` (3 more replies)
0 siblings, 4 replies; 17+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
To: linux-mm; +Cc: axboe, Ehrhardt Christian
From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
In an memory overcommitment scneario with KVM I ran into a lot of wiats for
swap. While checking the I/O done on the swap disks I found almost all I/Os
to be done as single page 4k request. Despite the fact that swap in is a
batch of 1<<page-cluster pages as swap readahead and swap out is a list of
pages written in shrink_page_list.
[1/2 swap in improvment]
The read patch shows improvements of up to 50% swap throughput, much happier
guest systems and even when running with comparable throughput a lot I/O per
seconds saved leaving resources in the SAN for other consumers.
[2/2 documentation]
While doing so I also realized that the documentation for
proc/sys/vm/page-cluster is no more matching the code
[missing patch #3]
I tried to get a similar patch working for swap out in shrink_page_list. And
it worked in functional terms, but the additional mergin was negligible.
Maybe the cond_resched triggers much mor often than I expected, I'm open for
suggestions regarding improving the pagout I/O sizes as well.
Kind regards,
Christian Ehrhardt
Christian Ehrhardt (2):
swap: allow swap readahead to be merged
documentation: update how page-cluster affects swap I/O
Documentation/sysctl/vm.txt | 12 ++++++++++--
mm/swap_state.c | 5 +++++
2 files changed, 15 insertions(+), 2 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
@ 2012-05-14 11:58 ` ehrhardt
2012-05-15 4:38 ` Minchan Kim
2012-05-15 17:43 ` Rik van Riel
2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
` (2 subsequent siblings)
3 siblings, 2 replies; 17+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
To: linux-mm; +Cc: axboe, Christian Ehrhardt
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Swap readahead works fine, but the I/O to disk is almost always done in page
size requests, despite the fact that readahead submits 1<<page-cluster pages
at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more I/Os
than required.
On a single device this might not be an issue, but as soon as a server runs
on shared san resources savin I/Os not only improves swapin throughput but
also provides a lower resource utilization.
With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.
In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
IO unplugs: 149,614 Timer unplugs: 2,940
With the patch:
Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
IO unplugs: 337,130 Timer unplugs: 11,184
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---
mm/swap_state.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4c5ff7f..c85b559 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,6 +14,7 @@
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/backing-dev.h>
+#include <linux/blkdev.h>
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
@@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = swp_offset(entry);
unsigned long start_offset, end_offset;
unsigned long mask = (1UL << page_cluster) - 1;
+ struct blk_plug plug;
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
@@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!start_offset) /* First page is swap header. */
start_offset++;
+ blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
continue;
page_cache_release(page);
}
+ blk_finish_plug(&plug);
+
lru_add_drain(); /* Push any new pages onto the LRU now */
return read_swap_cache_async(entry, gfp_mask, vma, addr);
}
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/2] documentation: update how page-cluster affects swap I/O
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-05-14 11:58 ` ehrhardt
2012-05-15 4:48 ` Minchan Kim
2012-05-15 4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
2012-05-15 18:24 ` Jens Axboe
3 siblings, 1 reply; 17+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
To: linux-mm; +Cc: axboe, Christian Ehrhardt
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
the code and add some comments about what the tunable will change in that
behavior.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---
Documentation/sysctl/vm.txt | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 96f0ee8..4d87dc0 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -574,16 +574,24 @@ of physical RAM. See above.
page-cluster
-page-cluster controls the number of pages which are written to swap in
-a single attempt. The swap I/O size.
+page-cluster controls the number of pages up to which consecutive pages (if
+available) are read in from swap in a single attempt. This is the swap
+counterpart to page cache readahead.
+The mentioned consecutivity is not in terms of virtual/physical addresses,
+but consecutive on swap space - that means they were swapped out together.
It is a logarithmic value - setting it to zero means "1 page", setting
it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
+Zero disables swap readahead completely.
The default value is three (eight pages at a time). There may be some
small benefits in tuning this to a different value if your workload is
swap-intensive.
+Lower values mean lower latencies for initial faults, but at the same time
+extra faults and I/O delays for following faults if they would have been part of
+that consecutive pages readahead would have brought in.
+
=============================================================
panic_on_oom
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-05-15 4:38 ` Minchan Kim
2012-05-15 17:43 ` Rik van Riel
1 sibling, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2012-05-15 4:38 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe, Hugh Dickins, Rik van Riel
On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
>
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
>
> With a load running KVM in a lot of memory overcommitment (the hot memory
> is 1.5 times the host memory) swapping throughput improves significantly
> and the lead feels more responsive as well as achieves more throughput.
>
> In a test setup with 16 swap disks running blocktrace on one of those disks
> shows the improved merging:
> Prior:
> Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
> Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
> Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
> IO unplugs: 149,614 Timer unplugs: 2,940
>
> With the patch:
> Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
> Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
> Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
> IO unplugs: 337,130 Timer unplugs: 11,184
>
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
It does make sense to me.
> ---
> mm/swap_state.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 4c5ff7f..c85b559 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -14,6 +14,7 @@
> #include <linux/init.h>
> #include <linux/pagemap.h>
> #include <linux/backing-dev.h>
> +#include <linux/blkdev.h>
> #include <linux/pagevec.h>
> #include <linux/migrate.h>
> #include <linux/page_cgroup.h>
> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> unsigned long offset = swp_offset(entry);
> unsigned long start_offset, end_offset;
> unsigned long mask = (1UL << page_cluster) - 1;
> + struct blk_plug plug;
>
> /* Read a page_cluster sized and aligned cluster around offset. */
> start_offset = offset & ~mask;
> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> if (!start_offset) /* First page is swap header. */
> start_offset++;
>
> + blk_start_plug(&plug);
> for (offset = start_offset; offset <= end_offset ; offset++) {
> /* Ok, do the async read-ahead now */
> page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> continue;
> page_cache_release(page);
> }
> + blk_finish_plug(&plug);
> +
> lru_add_drain(); /* Push any new pages onto the LRU now */
> return read_swap_cache_async(entry, gfp_mask, vma, addr);
> }
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O
2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
@ 2012-05-15 4:48 ` Minchan Kim
2012-05-21 7:24 ` Christian Ehrhardt
0 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2012-05-15 4:48 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe, Rik van Riel, Hugh Dickins, Andrew Morton
On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
> the code and add some comments about what the tunable will change in that
> behavior.
>
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> ---
> Documentation/sysctl/vm.txt | 12 ++++++++++--
> 1 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 96f0ee8..4d87dc0 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -574,16 +574,24 @@ of physical RAM. See above.
>
> page-cluster
>
> -page-cluster controls the number of pages which are written to swap in
> -a single attempt. The swap I/O size.
> +page-cluster controls the number of pages up to which consecutive pages (if
> +available) are read in from swap in a single attempt. This is the swap
"If available" would be wrong in next kernel because recently Rik submit following patch,
mm: make swapin readahead skip over holes
http://marc.info/?l=linux-mm&m=132743264912987&w=4
> +counterpart to page cache readahead.
> +The mentioned consecutivity is not in terms of virtual/physical addresses,
> +but consecutive on swap space - that means they were swapped out together.
>
> It is a logarithmic value - setting it to zero means "1 page", setting
> it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
> +Zero disables swap readahead completely.
>
> The default value is three (eight pages at a time). There may be some
> small benefits in tuning this to a different value if your workload is
> swap-intensive.
>
> +Lower values mean lower latencies for initial faults, but at the same time
> +extra faults and I/O delays for following faults if they would have been part of
> +that consecutive pages readahead would have brought in.
> +
> =============================================================
>
> panic_on_oom
Otherwise, Looks good to me.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] swap: improve swap I/O rate
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
@ 2012-05-15 4:59 ` Minchan Kim
2012-05-21 7:51 ` Christian Ehrhardt
2012-05-15 18:24 ` Jens Axboe
3 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2012-05-15 4:59 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel
On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>
>
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> In an memory overcommitment scneario with KVM I ran into a lot of wiats for
> swap. While checking the I/O done on the swap disks I found almost all I/Os
> to be done as single page 4k request. Despite the fact that swap in is a
> batch of 1<<page-cluster pages as swap readahead and swap out is a list of
> pages written in shrink_page_list.
>
> [1/2 swap in improvment]
> The read patch shows improvements of up to 50% swap throughput, much happier
> guest systems and even when running with comparable throughput a lot I/O per
> seconds saved leaving resources in the SAN for other consumers.
>
> [2/2 documentation]
> While doing so I also realized that the documentation for
> proc/sys/vm/page-cluster is no more matching the code
>
> [missing patch #3]
> I tried to get a similar patch working for swap out in shrink_page_list. And
> it worked in functional terms, but the additional mergin was negligible.
I think we have already done it.
Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list so we already have applied
I/O plugging.
> Maybe the cond_resched triggers much mor often than I expected, I'm open for
> suggestions regarding improving the pagout I/O sizes as well.
We could enhance write out by batch like ext4_bio_write_page.
>
> Kind regards,
> Christian Ehrhardt
>
>
> Christian Ehrhardt (2):
> swap: allow swap readahead to be merged
> documentation: update how page-cluster affects swap I/O
>
> Documentation/sysctl/vm.txt | 12 ++++++++++--
> mm/swap_state.c | 5 +++++
> 2 files changed, 15 insertions(+), 2 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-15 4:38 ` Minchan Kim
@ 2012-05-15 17:43 ` Rik van Riel
1 sibling, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2012-05-15 17:43 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe
On 05/14/2012 07:58 AM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
>
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] swap: improve swap I/O rate
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
` (2 preceding siblings ...)
2012-05-15 4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
@ 2012-05-15 18:24 ` Jens Axboe
3 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2012-05-15 18:24 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm
On 2012-05-14 13:58, ehrhardt@linux.vnet.ibm.com wrote:
> From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>
>
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> In an memory overcommitment scneario with KVM I ran into a lot of wiats for
> swap. While checking the I/O done on the swap disks I found almost all I/Os
> to be done as single page 4k request. Despite the fact that swap in is a
> batch of 1<<page-cluster pages as swap readahead and swap out is a list of
> pages written in shrink_page_list.
>
> [1/2 swap in improvment]
> The read patch shows improvements of up to 50% swap throughput, much happier
> guest systems and even when running with comparable throughput a lot I/O per
> seconds saved leaving resources in the SAN for other consumers.
>
> [2/2 documentation]
> While doing so I also realized that the documentation for
> proc/sys/vm/page-cluster is no more matching the code
>
> [missing patch #3]
> I tried to get a similar patch working for swap out in shrink_page_list. And
> it worked in functional terms, but the additional mergin was negligible.
> Maybe the cond_resched triggers much mor often than I expected, I'm open for
> suggestions regarding improving the pagout I/O sizes as well.
>
> Kind regards,
> Christian Ehrhardt
>
>
> Christian Ehrhardt (2):
> swap: allow swap readahead to be merged
> documentation: update how page-cluster affects swap I/O
Looks good to me, you can add my acked-by to both of them.
--
Jens Axboe
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O
2012-05-15 4:48 ` Minchan Kim
@ 2012-05-21 7:24 ` Christian Ehrhardt
0 siblings, 0 replies; 17+ messages in thread
From: Christian Ehrhardt @ 2012-05-21 7:24 UTC (permalink / raw)
To: Minchan Kim; +Cc: linux-mm, axboe, Rik van Riel, Hugh Dickins, Andrew Morton
On 05/15/2012 06:48 AM, Minchan Kim wrote:
> On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:
>
>> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>>
>> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
>> the code and add some comments about what the tunable will change in that
>> behavior.
>>
>> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>> ---
>> Documentation/sysctl/vm.txt | 12 ++++++++++--
>> 1 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
>> index 96f0ee8..4d87dc0 100644
>> --- a/Documentation/sysctl/vm.txt
>> +++ b/Documentation/sysctl/vm.txt
>> @@ -574,16 +574,24 @@ of physical RAM. See above.
>>
>> page-cluster
>>
>> -page-cluster controls the number of pages which are written to swap in
>> -a single attempt. The swap I/O size.
>> +page-cluster controls the number of pages up to which consecutive pages (if
>> +available) are read in from swap in a single attempt. This is the swap
>
>
> "If available" would be wrong in next kernel because recently Rik submit following patch,
>
> mm: make swapin readahead skip over holes
> http://marc.info/?l=linux-mm&m=132743264912987&w=4
>
>
You're right - its not severely wrong, but if we are fixing the
documentation we can do it right.
I'll send a 2nd version of the patch series with this adapted and all
the acks I got so far added.
--
GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] swap: improve swap I/O rate
2012-05-15 4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
@ 2012-05-21 7:51 ` Christian Ehrhardt
2012-05-21 8:46 ` Minchan Kim
0 siblings, 1 reply; 17+ messages in thread
From: Christian Ehrhardt @ 2012-05-21 7:51 UTC (permalink / raw)
To: Minchan Kim; +Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel
[...]
>> [missing patch #3]
>> I tried to get a similar patch working for swap out in shrink_page_list. And
>> it worked in functional terms, but the additional mergin was negligible.
>
>
> I think we have already done it.
> Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list so we already have applied
> I/O plugging.
>
I saw that code and it is part of the kernel I used to test my patches.
But despite that code and my additional experiments of plug/unplug in
shrink_page_list the effective I/O size of swap write stays at almost 4k.
Thereby so far I can tell you that the plugs in shrink_page_list and
shrink_mem_cgroup_zone aren't sufficient - at least for my case.
You saw the blocktrace summaries in my first mail, an excerpt of a write
submission stream looks like that:
94,4 10 465 0.023520923 116 A W 28868648 + 8 <- (94,5)
28868456
94,5 10 466 0.023521173 116 Q W 28868648 + 8 [kswapd0]
94,5 10 467 0.023522048 116 G W 28868648 + 8 [kswapd0]
94,5 10 468 0.023522235 116 P N [kswapd0]
94,5 10 469 0.023759892 116 I W 28868648 + 8 ( 237844)
[kswapd0]
94,5 10 470 0.023760079 116 U N [kswapd0] 1
94,5 10 471 0.023760360 116 D W 28868648 + 8 ( 468)
[kswapd0]
94,4 10 472 0.023891235 116 A W 28868656 + 8 <- (94,5)
28868464
94,5 10 473 0.023891454 116 Q W 28868656 + 8 [kswapd0]
94,5 10 474 0.023892110 116 G W 28868656 + 8 [kswapd0]
94,5 10 475 0.023944610 116 I W 28868656 + 8 ( 52500)
[kswapd0]
94,5 10 476 0.023944735 116 U N [kswapd0] 1
94,5 10 477 0.023944892 116 D W 28868656 + 8 ( 282)
[kswapd0]
94,5 16 19 0.024023192 16033 C W 28868648 + 8 ( 262832) [0]
94,5 24 37 0.024196752 14526 C W 28868656 + 8 ( 251860) [0]
[...]
But we can split this discussion from my other two patches and I would
be happy to provide my test environment for further tests if there are
new suggestions/patches/...
>> Maybe the cond_resched triggers much mor often than I expected, I'm open for
>> suggestions regarding improving the pagout I/O sizes as well.
>
>
> We could enhance write out by batch like ext4_bio_write_page.
>
Do you mean the changes brought by "bd2d0210 ext4: use bio layer instead
of buffer layer in mpage_da_submit_io" ?
--
GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-21 8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
@ 2012-05-21 8:09 ` ehrhardt
2012-05-21 8:51 ` Minchan Kim
0 siblings, 1 reply; 17+ messages in thread
From: ehrhardt @ 2012-05-21 8:09 UTC (permalink / raw)
To: linux-mm; +Cc: axboe, Christian Ehrhardt
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Swap readahead works fine, but the I/O to disk is almost always done in page
size requests, despite the fact that readahead submits 1<<page-cluster pages
at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more I/Os
than required.
On a single device this might not be an issue, but as soon as a server runs
on shared san resources savin I/Os not only improves swapin throughput but
also provides a lower resource utilization.
With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.
In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
IO unplugs: 149,614 Timer unplugs: 2,940
With the patch:
Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
IO unplugs: 337,130 Timer unplugs: 11,184
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
---
mm/swap_state.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4c5ff7f..c85b559 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,6 +14,7 @@
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/backing-dev.h>
+#include <linux/blkdev.h>
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
@@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = swp_offset(entry);
unsigned long start_offset, end_offset;
unsigned long mask = (1UL << page_cluster) - 1;
+ struct blk_plug plug;
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
@@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!start_offset) /* First page is swap header. */
start_offset++;
+ blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
continue;
page_cache_release(page);
}
+ blk_finish_plug(&plug);
+
lru_add_drain(); /* Push any new pages onto the LRU now */
return read_swap_cache_async(entry, gfp_mask, vma, addr);
}
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] swap: improve swap I/O rate
2012-05-21 7:51 ` Christian Ehrhardt
@ 2012-05-21 8:46 ` Minchan Kim
0 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2012-05-21 8:46 UTC (permalink / raw)
To: Christian Ehrhardt
Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel
On 05/21/2012 04:51 PM, Christian Ehrhardt wrote:
> [...]
>
>>> [missing patch #3]
>>> I tried to get a similar patch working for swap out in
>>> shrink_page_list. And
>>> it worked in functional terms, but the additional mergin was negligible.
>>
>>
>> I think we have already done it.
>> Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list
>> so we already have applied
>> I/O plugging.
>>
>
> I saw that code and it is part of the kernel I used to test my patches.
> But despite that code and my additional experiments of plug/unplug in
> shrink_page_list the effective I/O size of swap write stays at almost 4k.
I meant your plugging in shrink_page_list is redundant
>
> Thereby so far I can tell you that the plugs in shrink_page_list and
> shrink_mem_cgroup_zone aren't sufficient - at least for my case.
Yeb.
> You saw the blocktrace summaries in my first mail, an excerpt of a write
> submission stream looks like that:
>
> 94,4 10 465 0.023520923 116 A W 28868648 + 8 <- (94,5)
> 28868456
> 94,5 10 466 0.023521173 116 Q W 28868648 + 8 [kswapd0]
> 94,5 10 467 0.023522048 116 G W 28868648 + 8 [kswapd0]
> 94,5 10 468 0.023522235 116 P N [kswapd0]
> 94,5 10 469 0.023759892 116 I W 28868648 + 8 ( 237844)
> [kswapd0]
> 94,5 10 470 0.023760079 116 U N [kswapd0] 1
> 94,5 10 471 0.023760360 116 D W 28868648 + 8 ( 468)
> [kswapd0]
> 94,4 10 472 0.023891235 116 A W 28868656 + 8 <- (94,5)
> 28868464
> 94,5 10 473 0.023891454 116 Q W 28868656 + 8 [kswapd0]
> 94,5 10 474 0.023892110 116 G W 28868656 + 8 [kswapd0]
> 94,5 10 475 0.023944610 116 I W 28868656 + 8 ( 52500)
> [kswapd0]
> 94,5 10 476 0.023944735 116 U N [kswapd0] 1
> 94,5 10 477 0.023944892 116 D W 28868656 + 8 ( 282)
> [kswapd0]
> 94,5 16 19 0.024023192 16033 C W 28868648 + 8 ( 262832) [0]
> 94,5 24 37 0.024196752 14526 C W 28868656 + 8 ( 251860) [0]
> [...]
>
> But we can split this discussion from my other two patches and I would
> be happy to provide my test environment for further tests if there are
> new suggestions/patches/...
>
>>> Maybe the cond_resched triggers much mor often than I expected, I'm
>>> open for
>>> suggestions regarding improving the pagout I/O sizes as well.
>>
>>
>> We could enhance write out by batch like ext4_bio_write_page.
>>
>
> Do you mean the changes brought by "bd2d0210 ext4: use bio layer instead
> of buffer layer in mpage_da_submit_io" ?
Yeb, I think it's helpful for your case but it's not trivial to implement it, IMHO.
>
>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-21 8:09 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-05-21 8:51 ` Minchan Kim
2012-05-21 9:07 ` Christian Ehrhardt
0 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2012-05-21 8:51 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe
On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
>
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
>
> With a load running KVM in a lot of memory overcommitment (the hot memory
> is 1.5 times the host memory) swapping throughput improves significantly
> and the lead feels more responsive as well as achieves more throughput.
>
> In a test setup with 16 swap disks running blocktrace on one of those disks
> shows the improved merging:
> Prior:
> Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
> Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
> Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
> IO unplugs: 149,614 Timer unplugs: 2,940
>
> With the patch:
> Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
> Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
> Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
> IO unplugs: 337,130 Timer unplugs: 11,184
>
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Didn't I add my Reviewed-by on your previous version?
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-21 8:51 ` Minchan Kim
@ 2012-05-21 9:07 ` Christian Ehrhardt
0 siblings, 0 replies; 17+ messages in thread
From: Christian Ehrhardt @ 2012-05-21 9:07 UTC (permalink / raw)
To: Minchan Kim; +Cc: linux-mm, axboe
On 05/21/2012 10:51 AM, Minchan Kim wrote:
> On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote:
>
>> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>>
[...]
>>
>> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>> Acked-by: Rik van Riel<riel@redhat.com>
>> Acked-by: Jens Axboe<axboe@kernel.dk>
>
>
> Reviewed-by: Minchan Kim<minchan@kernel.org>
>
> Didn't I add my Reviewed-by on your previous version?
>
Sorry I missed it since you provided the good feedback on all three
mails. I had your "otherwise looks good to me to mail #2" still in mind
and didn't want to be so offensive to convert that to a review or ack
statement.
--
GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] swap: allow swap readahead to be merged
2012-06-04 8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
@ 2012-06-04 8:33 ` ehrhardt
2012-06-05 23:44 ` Andrew Morton
0 siblings, 1 reply; 17+ messages in thread
From: ehrhardt @ 2012-06-04 8:33 UTC (permalink / raw)
To: linux-mm, akpm; +Cc: axboe, hughd, minchan, Christian Ehrhardt
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Swap readahead works fine, but the I/O to disk is almost always done in page
size requests, despite the fact that readahead submits 1<<page-cluster pages
at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more I/Os
than required.
On a single device this might not be an issue, but as soon as a server runs
on shared san resources savin I/Os not only improves swapin throughput but
also provides a lower resource utilization.
With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.
In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
IO unplugs: 149,614 Timer unplugs: 2,940
With the patch:
Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
IO unplugs: 337,130 Timer unplugs: 11,184
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
---
mm/swap_state.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4c5ff7f..c85b559 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,6 +14,7 @@
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/backing-dev.h>
+#include <linux/blkdev.h>
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
@@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = swp_offset(entry);
unsigned long start_offset, end_offset;
unsigned long mask = (1UL << page_cluster) - 1;
+ struct blk_plug plug;
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
@@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!start_offset) /* First page is swap header. */
start_offset++;
+ blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
continue;
page_cache_release(page);
}
+ blk_finish_plug(&plug);
+
lru_add_drain(); /* Push any new pages onto the LRU now */
return read_swap_cache_async(entry, gfp_mask, vma, addr);
}
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-06-04 8:33 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-06-05 23:44 ` Andrew Morton
2012-06-20 15:58 ` Christian Ehrhardt
0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2012-06-05 23:44 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe, hughd, minchan
On Mon, 4 Jun 2012 10:33:22 +0200
ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
Yes, long ago we (ie: I) decided that swap I/O isn't sufficiently
common to bother doing any fancy high-level aggregation: just toss it
at the queue and use the general BIO merging.
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
>
> With a load running KVM in a lot of memory overcommitment (the hot memory
> is 1.5 times the host memory) swapping throughput improves significantly
> and the lead feels more responsive as well as achieves more throughput.
>
> In a test setup with 16 swap disks running blocktrace on one of those disks
> shows the improved merging:
> Prior:
> Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
> Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
> Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
> IO unplugs: 149,614 Timer unplugs: 2,940
>
> With the patch:
> Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
> Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
> Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
> IO unplugs: 337,130 Timer unplugs: 11,184
This is rather hard to understand. How much faster did it get?
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -14,6 +14,7 @@
> #include <linux/init.h>
> #include <linux/pagemap.h>
> #include <linux/backing-dev.h>
> +#include <linux/blkdev.h>
> #include <linux/pagevec.h>
> #include <linux/migrate.h>
> #include <linux/page_cgroup.h>
> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> unsigned long offset = swp_offset(entry);
> unsigned long start_offset, end_offset;
> unsigned long mask = (1UL << page_cluster) - 1;
> + struct blk_plug plug;
>
> /* Read a page_cluster sized and aligned cluster around offset. */
> start_offset = offset & ~mask;
> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> if (!start_offset) /* First page is swap header. */
> start_offset++;
>
> + blk_start_plug(&plug);
> for (offset = start_offset; offset <= end_offset ; offset++) {
> /* Ok, do the async read-ahead now */
> page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> continue;
> page_cache_release(page);
> }
> + blk_finish_plug(&plug);
> +
> lru_add_drain(); /* Push any new pages onto the LRU now */
> return read_swap_cache_async(entry, gfp_mask, vma, addr);
AFACIT this affects tmpfs as well, and it would be
interesting/useful/diligent to check for performance improvements or
regressions in that area.
And the patch doesn't help swapoff, in try_to_unuse(). Or any other
callers of swap_readpage(), if they exist.
The switch to explicit plugging might have caused swap regressions in
other areas so perhaps a more extensive patch is needed. But
swapin_readahead() covers most cases and a more extensive patch will
work OK with this one, so I guess we run witht he simple patch for now.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-06-05 23:44 ` Andrew Morton
@ 2012-06-20 15:58 ` Christian Ehrhardt
0 siblings, 0 replies; 17+ messages in thread
From: Christian Ehrhardt @ 2012-06-20 15:58 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, axboe, hughd, minchan
On 06/06/2012 01:44 AM, Andrew Morton wrote:
> On Mon, 4 Jun 2012 10:33:22 +0200
> ehrhardt@linux.vnet.ibm.com wrote:
>
>> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>>
>> Swap readahead works fine, but the I/O to disk is almost always done in page
>> size requests, despite the fact that readahead submits 1<<page-cluster pages
>> at a time.
>> On older kernels the old per device plugging behavior might have captured
>> this and merged the requests, but currently all comes down to much more I/Os
>> than required.
>
> Yes, long ago we (ie: I) decided that swap I/O isn't sufficiently
> common to bother doing any fancy high-level aggregation: just toss it
> at the queue and use the general BIO merging.
>
>> On a single device this might not be an issue, but as soon as a server runs
>> on shared san resources savin I/Os not only improves swapin throughput but
>> also provides a lower resource utilization.
>>
>> With a load running KVM in a lot of memory overcommitment (the hot memory
>> is 1.5 times the host memory) swapping throughput improves significantly
>> and the lead feels more responsive as well as achieves more throughput.
>>
>> In a test setup with 16 swap disks running blocktrace on one of those disks
>> shows the improved merging:
>> Prior:
>> Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
>> Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
>> Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
>> IO unplugs: 149,614 Timer unplugs: 2,940
>>
>> With the patch:
>> Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
>> Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
>> Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
>> IO unplugs: 337,130 Timer unplugs: 11,184
>
> This is rather hard to understand. How much faster did it get?
I got ~10% to ~40% more throughput in my cases and at the same time much
lower cpu consumption when broken down per transferred kilobyte (the
majority of that due to saved interrupts and better cache handling).
In a shared SAN others might get an additional benefit as well, because
this now causes less protocol overhead.
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -14,6 +14,7 @@
>> #include<linux/init.h>
>> #include<linux/pagemap.h>
>> #include<linux/backing-dev.h>
>> +#include<linux/blkdev.h>
>> #include<linux/pagevec.h>
>> #include<linux/migrate.h>
>> #include<linux/page_cgroup.h>
>> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> unsigned long offset = swp_offset(entry);
>> unsigned long start_offset, end_offset;
>> unsigned long mask = (1UL<< page_cluster) - 1;
>> + struct blk_plug plug;
>>
>> /* Read a page_cluster sized and aligned cluster around offset. */
>> start_offset = offset& ~mask;
>> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> if (!start_offset) /* First page is swap header. */
>> start_offset++;
>>
>> + blk_start_plug(&plug);
>> for (offset = start_offset; offset<= end_offset ; offset++) {
>> /* Ok, do the async read-ahead now */
>> page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
>> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> continue;
>> page_cache_release(page);
>> }
>> + blk_finish_plug(&plug);
>> +
>> lru_add_drain(); /* Push any new pages onto the LRU now */
>> return read_swap_cache_async(entry, gfp_mask, vma, addr);
>
> AFACIT this affects tmpfs as well, and it would be
> interesting/useful/diligent to check for performance improvements or
> regressions in that area.
>
A quick test with fio doing 256k sequential write showed some
improvements of 9.1%, but since I'm not sure how big noise is in this
test I'd be cautions with these results.
Unfortunately I didn't check cpu consumption - it might be possible that
with tmpfs thats the area where a bigger improvement could be seen.
Well at least it didn't break - so thats a good result as well.
> And the patch doesn't help swapoff, in try_to_unuse(). Or any other
> callers of swap_readpage(), if they exist.
>
> The switch to explicit plugging might have caused swap regressions in
> other areas so perhaps a more extensive patch is needed. But
> swapin_readahead() covers most cases and a more extensive patch will
> work OK with this one, so I guess we run witht he simple patch for now.
>
Yeah all the other swap areas might need re-tuning after the plugging
changes as well, but for example swapoff shouldn't be too performance
critical right?
As discussed before I'd more interested in the swap writeout path to
merge stuff better as well.
Eventually - as you said - a later more complex patch can follow and
take all these into account.
--
Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2012-06-20 15:58 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-15 4:38 ` Minchan Kim
2012-05-15 17:43 ` Rik van Riel
2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
2012-05-15 4:48 ` Minchan Kim
2012-05-21 7:24 ` Christian Ehrhardt
2012-05-15 4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
2012-05-21 7:51 ` Christian Ehrhardt
2012-05-21 8:46 ` Minchan Kim
2012-05-15 18:24 ` Jens Axboe
-- strict thread matches above, loose matches on Subject: below --
2012-05-21 8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
2012-05-21 8:09 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-21 8:51 ` Minchan Kim
2012-05-21 9:07 ` Christian Ehrhardt
2012-06-04 8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
2012-06-04 8:33 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-06-05 23:44 ` Andrew Morton
2012-06-20 15:58 ` Christian Ehrhardt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).