* [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
@ 2012-05-14 11:58 ` ehrhardt
2012-05-15 4:38 ` Minchan Kim
2012-05-15 17:43 ` Rik van Riel
0 siblings, 2 replies; 11+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
To: linux-mm; +Cc: axboe, Christian Ehrhardt
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Swap readahead works fine, but the I/O to disk is almost always done in page
size requests, despite the fact that readahead submits 1<<page-cluster pages
at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more I/Os
than required.
On a single device this might not be an issue, but as soon as a server runs
on shared san resources savin I/Os not only improves swapin throughput but
also provides a lower resource utilization.
With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.
In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
IO unplugs: 149,614 Timer unplugs: 2,940
With the patch:
Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
IO unplugs: 337,130 Timer unplugs: 11,184
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---
mm/swap_state.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4c5ff7f..c85b559 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,6 +14,7 @@
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/backing-dev.h>
+#include <linux/blkdev.h>
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
@@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = swp_offset(entry);
unsigned long start_offset, end_offset;
unsigned long mask = (1UL << page_cluster) - 1;
+ struct blk_plug plug;
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
@@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!start_offset) /* First page is swap header. */
start_offset++;
+ blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
continue;
page_cache_release(page);
}
+ blk_finish_plug(&plug);
+
lru_add_drain(); /* Push any new pages onto the LRU now */
return read_swap_cache_async(entry, gfp_mask, vma, addr);
}
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-05-15 4:38 ` Minchan Kim
2012-05-15 17:43 ` Rik van Riel
1 sibling, 0 replies; 11+ messages in thread
From: Minchan Kim @ 2012-05-15 4:38 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe, Hugh Dickins, Rik van Riel
On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
>
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
>
> With a load running KVM in a lot of memory overcommitment (the hot memory
> is 1.5 times the host memory) swapping throughput improves significantly
> and the lead feels more responsive as well as achieves more throughput.
>
> In a test setup with 16 swap disks running blocktrace on one of those disks
> shows the improved merging:
> Prior:
> Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
> Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
> Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
> IO unplugs: 149,614 Timer unplugs: 2,940
>
> With the patch:
> Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
> Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
> Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
> IO unplugs: 337,130 Timer unplugs: 11,184
>
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
It does make sense to me.
> ---
> mm/swap_state.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 4c5ff7f..c85b559 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -14,6 +14,7 @@
> #include <linux/init.h>
> #include <linux/pagemap.h>
> #include <linux/backing-dev.h>
> +#include <linux/blkdev.h>
> #include <linux/pagevec.h>
> #include <linux/migrate.h>
> #include <linux/page_cgroup.h>
> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> unsigned long offset = swp_offset(entry);
> unsigned long start_offset, end_offset;
> unsigned long mask = (1UL << page_cluster) - 1;
> + struct blk_plug plug;
>
> /* Read a page_cluster sized and aligned cluster around offset. */
> start_offset = offset & ~mask;
> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> if (!start_offset) /* First page is swap header. */
> start_offset++;
>
> + blk_start_plug(&plug);
> for (offset = start_offset; offset <= end_offset ; offset++) {
> /* Ok, do the async read-ahead now */
> page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> continue;
> page_cache_release(page);
> }
> + blk_finish_plug(&plug);
> +
> lru_add_drain(); /* Push any new pages onto the LRU now */
> return read_swap_cache_async(entry, gfp_mask, vma, addr);
> }
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-15 4:38 ` Minchan Kim
@ 2012-05-15 17:43 ` Rik van Riel
1 sibling, 0 replies; 11+ messages in thread
From: Rik van Riel @ 2012-05-15 17:43 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe
On 05/14/2012 07:58 AM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
>
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-21 8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
@ 2012-05-21 8:09 ` ehrhardt
2012-05-21 8:51 ` Minchan Kim
0 siblings, 1 reply; 11+ messages in thread
From: ehrhardt @ 2012-05-21 8:09 UTC (permalink / raw)
To: linux-mm; +Cc: axboe, Christian Ehrhardt
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Swap readahead works fine, but the I/O to disk is almost always done in page
size requests, despite the fact that readahead submits 1<<page-cluster pages
at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more I/Os
than required.
On a single device this might not be an issue, but as soon as a server runs
on shared san resources savin I/Os not only improves swapin throughput but
also provides a lower resource utilization.
With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.
In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
IO unplugs: 149,614 Timer unplugs: 2,940
With the patch:
Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
IO unplugs: 337,130 Timer unplugs: 11,184
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
---
mm/swap_state.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4c5ff7f..c85b559 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,6 +14,7 @@
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/backing-dev.h>
+#include <linux/blkdev.h>
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
@@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = swp_offset(entry);
unsigned long start_offset, end_offset;
unsigned long mask = (1UL << page_cluster) - 1;
+ struct blk_plug plug;
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
@@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!start_offset) /* First page is swap header. */
start_offset++;
+ blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
continue;
page_cache_release(page);
}
+ blk_finish_plug(&plug);
+
lru_add_drain(); /* Push any new pages onto the LRU now */
return read_swap_cache_async(entry, gfp_mask, vma, addr);
}
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-21 8:09 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-05-21 8:51 ` Minchan Kim
2012-05-21 9:07 ` Christian Ehrhardt
0 siblings, 1 reply; 11+ messages in thread
From: Minchan Kim @ 2012-05-21 8:51 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe
On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
>
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
>
> With a load running KVM in a lot of memory overcommitment (the hot memory
> is 1.5 times the host memory) swapping throughput improves significantly
> and the lead feels more responsive as well as achieves more throughput.
>
> In a test setup with 16 swap disks running blocktrace on one of those disks
> shows the improved merging:
> Prior:
> Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
> Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
> Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
> IO unplugs: 149,614 Timer unplugs: 2,940
>
> With the patch:
> Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
> Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
> Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
> IO unplugs: 337,130 Timer unplugs: 11,184
>
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Didn't I add my Reviewed-by on your previous version?
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-21 8:51 ` Minchan Kim
@ 2012-05-21 9:07 ` Christian Ehrhardt
0 siblings, 0 replies; 11+ messages in thread
From: Christian Ehrhardt @ 2012-05-21 9:07 UTC (permalink / raw)
To: Minchan Kim; +Cc: linux-mm, axboe
On 05/21/2012 10:51 AM, Minchan Kim wrote:
> On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote:
>
>> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>>
[...]
>>
>> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>> Acked-by: Rik van Riel<riel@redhat.com>
>> Acked-by: Jens Axboe<axboe@kernel.dk>
>
>
> Reviewed-by: Minchan Kim<minchan@kernel.org>
>
> Didn't I add my Reviewed-by on your previous version?
>
Sorry I missed it since you provided the good feedback on all three
mails. I had your "otherwise looks good to me to mail #2" still in mind
and didn't want to be so offensive to convert that to a review or ack
statement.
--
GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 0/2] swap: improve swap I/O rate - V2
@ 2012-06-04 8:33 ehrhardt
2012-06-04 8:33 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-06-04 8:33 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
0 siblings, 2 replies; 11+ messages in thread
From: ehrhardt @ 2012-06-04 8:33 UTC (permalink / raw)
To: linux-mm, akpm; +Cc: axboe, hughd, minchan, Ehrhardt Christian
From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
* Update in V3 *
- Added another reviewed by
- should be ready for upstream inclusion now
* Update in V2 *
- Adapted the documentation patch according to feedback of Minchan Kim
- Added the Acks I got to V1 so far
In an memory overcommitment scneario with KVM I ran into a lot of waits for
swap. While checking the I/O done on the swap disks I found almost all I/Os
to be done as single page 4k request. Despite the fact that swap in is a
batch of 1<<page-cluster pages as swap readahead and swap out is a list of
pages written in shrink_page_list.
[1/2 swap in improvment]
The read patch shows improvements of up to 50% swap throughput, much happier
guest systems and even when running with comparable throughput a lot I/O per
seconds saved leaving resources in the SAN for other consumers.
[2/2 documentation]
While doing so I also realized that the documentation for
proc/sys/vm/page-cluster is no more matching the code
Kind regards,
Christian Ehrhardt
Christian Ehrhardt (2):
swap: allow swap readahead to be merged
documentation: update how page-cluster affects swap I/O
Documentation/sysctl/vm.txt | 12 ++++++++++--
mm/swap_state.c | 5 +++++
2 files changed, 15 insertions(+), 2 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/2] swap: allow swap readahead to be merged
2012-06-04 8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
@ 2012-06-04 8:33 ` ehrhardt
2012-06-05 23:44 ` Andrew Morton
2012-06-04 8:33 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
1 sibling, 1 reply; 11+ messages in thread
From: ehrhardt @ 2012-06-04 8:33 UTC (permalink / raw)
To: linux-mm, akpm; +Cc: axboe, hughd, minchan, Christian Ehrhardt
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Swap readahead works fine, but the I/O to disk is almost always done in page
size requests, despite the fact that readahead submits 1<<page-cluster pages
at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more I/Os
than required.
On a single device this might not be an issue, but as soon as a server runs
on shared san resources savin I/Os not only improves swapin throughput but
also provides a lower resource utilization.
With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.
In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
IO unplugs: 149,614 Timer unplugs: 2,940
With the patch:
Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
IO unplugs: 337,130 Timer unplugs: 11,184
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
---
mm/swap_state.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4c5ff7f..c85b559 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,6 +14,7 @@
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/backing-dev.h>
+#include <linux/blkdev.h>
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
@@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = swp_offset(entry);
unsigned long start_offset, end_offset;
unsigned long mask = (1UL << page_cluster) - 1;
+ struct blk_plug plug;
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
@@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!start_offset) /* First page is swap header. */
start_offset++;
+ blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
continue;
page_cache_release(page);
}
+ blk_finish_plug(&plug);
+
lru_add_drain(); /* Push any new pages onto the LRU now */
return read_swap_cache_async(entry, gfp_mask, vma, addr);
}
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/2] documentation: update how page-cluster affects swap I/O
2012-06-04 8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
2012-06-04 8:33 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-06-04 8:33 ` ehrhardt
1 sibling, 0 replies; 11+ messages in thread
From: ehrhardt @ 2012-06-04 8:33 UTC (permalink / raw)
To: linux-mm, akpm; +Cc: axboe, hughd, minchan, Christian Ehrhardt
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
the code and add some comments about what the tunable will change in that
behavior.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
---
Documentation/sysctl/vm.txt | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 96f0ee8..4d87dc0 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -574,16 +574,24 @@ of physical RAM. See above.
page-cluster
-page-cluster controls the number of pages which are written to swap in
-a single attempt. The swap I/O size.
+page-cluster controls the number of pages up to which consecutive pages
+are read in from swap in a single attempt. This is the swap counterpart
+to page cache readahead.
+The mentioned consecutivity is not in terms of virtual/physical addresses,
+but consecutive on swap space - that means they were swapped out together.
It is a logarithmic value - setting it to zero means "1 page", setting
it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
+Zero disables swap readahead completely.
The default value is three (eight pages at a time). There may be some
small benefits in tuning this to a different value if your workload is
swap-intensive.
+Lower values mean lower latencies for initial faults, but at the same time
+extra faults and I/O delays for following faults if they would have been part of
+that consecutive pages readahead would have brought in.
+
=============================================================
panic_on_oom
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-06-04 8:33 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-06-05 23:44 ` Andrew Morton
2012-06-20 15:58 ` Christian Ehrhardt
0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2012-06-05 23:44 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe, hughd, minchan
On Mon, 4 Jun 2012 10:33:22 +0200
ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
Yes, long ago we (ie: I) decided that swap I/O isn't sufficiently
common to bother doing any fancy high-level aggregation: just toss it
at the queue and use the general BIO merging.
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
>
> With a load running KVM in a lot of memory overcommitment (the hot memory
> is 1.5 times the host memory) swapping throughput improves significantly
> and the lead feels more responsive as well as achieves more throughput.
>
> In a test setup with 16 swap disks running blocktrace on one of those disks
> shows the improved merging:
> Prior:
> Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
> Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
> Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
> IO unplugs: 149,614 Timer unplugs: 2,940
>
> With the patch:
> Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
> Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
> Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
> IO unplugs: 337,130 Timer unplugs: 11,184
This is rather hard to understand. How much faster did it get?
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -14,6 +14,7 @@
> #include <linux/init.h>
> #include <linux/pagemap.h>
> #include <linux/backing-dev.h>
> +#include <linux/blkdev.h>
> #include <linux/pagevec.h>
> #include <linux/migrate.h>
> #include <linux/page_cgroup.h>
> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> unsigned long offset = swp_offset(entry);
> unsigned long start_offset, end_offset;
> unsigned long mask = (1UL << page_cluster) - 1;
> + struct blk_plug plug;
>
> /* Read a page_cluster sized and aligned cluster around offset. */
> start_offset = offset & ~mask;
> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> if (!start_offset) /* First page is swap header. */
> start_offset++;
>
> + blk_start_plug(&plug);
> for (offset = start_offset; offset <= end_offset ; offset++) {
> /* Ok, do the async read-ahead now */
> page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> continue;
> page_cache_release(page);
> }
> + blk_finish_plug(&plug);
> +
> lru_add_drain(); /* Push any new pages onto the LRU now */
> return read_swap_cache_async(entry, gfp_mask, vma, addr);
AFACIT this affects tmpfs as well, and it would be
interesting/useful/diligent to check for performance improvements or
regressions in that area.
And the patch doesn't help swapoff, in try_to_unuse(). Or any other
callers of swap_readpage(), if they exist.
The switch to explicit plugging might have caused swap regressions in
other areas so perhaps a more extensive patch is needed. But
swapin_readahead() covers most cases and a more extensive patch will
work OK with this one, so I guess we run witht he simple patch for now.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-06-05 23:44 ` Andrew Morton
@ 2012-06-20 15:58 ` Christian Ehrhardt
0 siblings, 0 replies; 11+ messages in thread
From: Christian Ehrhardt @ 2012-06-20 15:58 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, axboe, hughd, minchan
On 06/06/2012 01:44 AM, Andrew Morton wrote:
> On Mon, 4 Jun 2012 10:33:22 +0200
> ehrhardt@linux.vnet.ibm.com wrote:
>
>> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>>
>> Swap readahead works fine, but the I/O to disk is almost always done in page
>> size requests, despite the fact that readahead submits 1<<page-cluster pages
>> at a time.
>> On older kernels the old per device plugging behavior might have captured
>> this and merged the requests, but currently all comes down to much more I/Os
>> than required.
>
> Yes, long ago we (ie: I) decided that swap I/O isn't sufficiently
> common to bother doing any fancy high-level aggregation: just toss it
> at the queue and use the general BIO merging.
>
>> On a single device this might not be an issue, but as soon as a server runs
>> on shared san resources savin I/Os not only improves swapin throughput but
>> also provides a lower resource utilization.
>>
>> With a load running KVM in a lot of memory overcommitment (the hot memory
>> is 1.5 times the host memory) swapping throughput improves significantly
>> and the lead feels more responsive as well as achieves more throughput.
>>
>> In a test setup with 16 swap disks running blocktrace on one of those disks
>> shows the improved merging:
>> Prior:
>> Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
>> Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
>> Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
>> IO unplugs: 149,614 Timer unplugs: 2,940
>>
>> With the patch:
>> Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
>> Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
>> Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
>> IO unplugs: 337,130 Timer unplugs: 11,184
>
> This is rather hard to understand. How much faster did it get?
I got ~10% to ~40% more throughput in my cases and at the same time much
lower cpu consumption when broken down per transferred kilobyte (the
majority of that due to saved interrupts and better cache handling).
In a shared SAN others might get an additional benefit as well, because
this now causes less protocol overhead.
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -14,6 +14,7 @@
>> #include<linux/init.h>
>> #include<linux/pagemap.h>
>> #include<linux/backing-dev.h>
>> +#include<linux/blkdev.h>
>> #include<linux/pagevec.h>
>> #include<linux/migrate.h>
>> #include<linux/page_cgroup.h>
>> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> unsigned long offset = swp_offset(entry);
>> unsigned long start_offset, end_offset;
>> unsigned long mask = (1UL<< page_cluster) - 1;
>> + struct blk_plug plug;
>>
>> /* Read a page_cluster sized and aligned cluster around offset. */
>> start_offset = offset& ~mask;
>> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> if (!start_offset) /* First page is swap header. */
>> start_offset++;
>>
>> + blk_start_plug(&plug);
>> for (offset = start_offset; offset<= end_offset ; offset++) {
>> /* Ok, do the async read-ahead now */
>> page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
>> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> continue;
>> page_cache_release(page);
>> }
>> + blk_finish_plug(&plug);
>> +
>> lru_add_drain(); /* Push any new pages onto the LRU now */
>> return read_swap_cache_async(entry, gfp_mask, vma, addr);
>
> AFACIT this affects tmpfs as well, and it would be
> interesting/useful/diligent to check for performance improvements or
> regressions in that area.
>
A quick test with fio doing 256k sequential write showed some
improvements of 9.1%, but since I'm not sure how big noise is in this
test I'd be cautions with these results.
Unfortunately I didn't check cpu consumption - it might be possible that
with tmpfs thats the area where a bigger improvement could be seen.
Well at least it didn't break - so thats a good result as well.
> And the patch doesn't help swapoff, in try_to_unuse(). Or any other
> callers of swap_readpage(), if they exist.
>
> The switch to explicit plugging might have caused swap regressions in
> other areas so perhaps a more extensive patch is needed. But
> swapin_readahead() covers most cases and a more extensive patch will
> work OK with this one, so I guess we run witht he simple patch for now.
>
Yeah all the other swap areas might need re-tuning after the plugging
changes as well, but for example swapoff shouldn't be too performance
critical right?
As discussed before I'd more interested in the swap writeout path to
merge stuff better as well.
Eventually - as you said - a later more complex patch can follow and
take all these into account.
--
Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2012-06-20 15:58 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-04 8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
2012-06-04 8:33 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-06-05 23:44 ` Andrew Morton
2012-06-20 15:58 ` Christian Ehrhardt
2012-06-04 8:33 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
-- strict thread matches above, loose matches on Subject: below --
2012-05-21 8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
2012-05-21 8:09 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-21 8:51 ` Minchan Kim
2012-05-21 9:07 ` Christian Ehrhardt
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-15 4:38 ` Minchan Kim
2012-05-15 17:43 ` Rik van Riel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).