* [PATCH 0/2] swap: improve swap I/O rate @ 2012-05-14 11:58 ehrhardt 2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt ` (3 more replies) 0 siblings, 4 replies; 11+ messages in thread From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw) To: linux-mm; +Cc: axboe, Ehrhardt Christian From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> In an memory overcommitment scneario with KVM I ran into a lot of wiats for swap. While checking the I/O done on the swap disks I found almost all I/Os to be done as single page 4k request. Despite the fact that swap in is a batch of 1<<page-cluster pages as swap readahead and swap out is a list of pages written in shrink_page_list. [1/2 swap in improvment] The read patch shows improvements of up to 50% swap throughput, much happier guest systems and even when running with comparable throughput a lot I/O per seconds saved leaving resources in the SAN for other consumers. [2/2 documentation] While doing so I also realized that the documentation for proc/sys/vm/page-cluster is no more matching the code [missing patch #3] I tried to get a similar patch working for swap out in shrink_page_list. And it worked in functional terms, but the additional mergin was negligible. Maybe the cond_resched triggers much mor often than I expected, I'm open for suggestions regarding improving the pagout I/O sizes as well. Kind regards, Christian Ehrhardt Christian Ehrhardt (2): swap: allow swap readahead to be merged documentation: update how page-cluster affects swap I/O Documentation/sysctl/vm.txt | 12 ++++++++++-- mm/swap_state.c | 5 +++++ 2 files changed, 15 insertions(+), 2 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/2] swap: allow swap readahead to be merged 2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt @ 2012-05-14 11:58 ` ehrhardt 2012-05-15 4:38 ` Minchan Kim 2012-05-15 17:43 ` Rik van Riel 2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt ` (2 subsequent siblings) 3 siblings, 2 replies; 11+ messages in thread From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw) To: linux-mm; +Cc: axboe, Christian Ehrhardt From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Swap readahead works fine, but the I/O to disk is almost always done in page size requests, despite the fact that readahead submits 1<<page-cluster pages at a time. On older kernels the old per device plugging behavior might have captured this and merged the requests, but currently all comes down to much more I/Os than required. On a single device this might not be an issue, but as soon as a server runs on shared san resources savin I/Os not only improves swapin throughput but also provides a lower resource utilization. With a load running KVM in a lot of memory overcommitment (the hot memory is 1.5 times the host memory) swapping throughput improves significantly and the lead feels more responsive as well as achieves more throughput. In a test setup with 16 swap disks running blocktrace on one of those disks shows the improved merging: Prior: Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB IO unplugs: 149,614 Timer unplugs: 2,940 With the patch: Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB IO unplugs: 337,130 Timer unplugs: 11,184 Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> --- mm/swap_state.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 4c5ff7f..c85b559 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -14,6 +14,7 @@ #include <linux/init.h> #include <linux/pagemap.h> #include <linux/backing-dev.h> +#include <linux/blkdev.h> #include <linux/pagevec.h> #include <linux/migrate.h> #include <linux/page_cgroup.h> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, unsigned long offset = swp_offset(entry); unsigned long start_offset, end_offset; unsigned long mask = (1UL << page_cluster) - 1; + struct blk_plug plug; /* Read a page_cluster sized and aligned cluster around offset. */ start_offset = offset & ~mask; @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, if (!start_offset) /* First page is swap header. */ start_offset++; + blk_start_plug(&plug); for (offset = start_offset; offset <= end_offset ; offset++) { /* Ok, do the async read-ahead now */ page = read_swap_cache_async(swp_entry(swp_type(entry), offset), @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, continue; page_cache_release(page); } + blk_finish_plug(&plug); + lru_add_drain(); /* Push any new pages onto the LRU now */ return read_swap_cache_async(entry, gfp_mask, vma, addr); } -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged 2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt @ 2012-05-15 4:38 ` Minchan Kim 2012-05-15 17:43 ` Rik van Riel 1 sibling, 0 replies; 11+ messages in thread From: Minchan Kim @ 2012-05-15 4:38 UTC (permalink / raw) To: ehrhardt; +Cc: linux-mm, axboe, Hugh Dickins, Rik van Riel On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > Swap readahead works fine, but the I/O to disk is almost always done in page > size requests, despite the fact that readahead submits 1<<page-cluster pages > at a time. > On older kernels the old per device plugging behavior might have captured > this and merged the requests, but currently all comes down to much more I/Os > than required. > > On a single device this might not be an issue, but as soon as a server runs > on shared san resources savin I/Os not only improves swapin throughput but > also provides a lower resource utilization. > > With a load running KVM in a lot of memory overcommitment (the hot memory > is 1.5 times the host memory) swapping throughput improves significantly > and the lead feels more responsive as well as achieves more throughput. > > In a test setup with 16 swap disks running blocktrace on one of those disks > shows the improved merging: > Prior: > Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB > Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB > Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB > IO unplugs: 149,614 Timer unplugs: 2,940 > > With the patch: > Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB > Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB > Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB > IO unplugs: 337,130 Timer unplugs: 11,184 > > Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan@kernel.org> It does make sense to me. > --- > mm/swap_state.c | 5 +++++ > 1 files changed, 5 insertions(+), 0 deletions(-) > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 4c5ff7f..c85b559 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -14,6 +14,7 @@ > #include <linux/init.h> > #include <linux/pagemap.h> > #include <linux/backing-dev.h> > +#include <linux/blkdev.h> > #include <linux/pagevec.h> > #include <linux/migrate.h> > #include <linux/page_cgroup.h> > @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > unsigned long offset = swp_offset(entry); > unsigned long start_offset, end_offset; > unsigned long mask = (1UL << page_cluster) - 1; > + struct blk_plug plug; > > /* Read a page_cluster sized and aligned cluster around offset. */ > start_offset = offset & ~mask; > @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > if (!start_offset) /* First page is swap header. */ > start_offset++; > > + blk_start_plug(&plug); > for (offset = start_offset; offset <= end_offset ; offset++) { > /* Ok, do the async read-ahead now */ > page = read_swap_cache_async(swp_entry(swp_type(entry), offset), > @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > continue; > page_cache_release(page); > } > + blk_finish_plug(&plug); > + > lru_add_drain(); /* Push any new pages onto the LRU now */ > return read_swap_cache_async(entry, gfp_mask, vma, addr); > } -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged 2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt 2012-05-15 4:38 ` Minchan Kim @ 2012-05-15 17:43 ` Rik van Riel 1 sibling, 0 replies; 11+ messages in thread From: Rik van Riel @ 2012-05-15 17:43 UTC (permalink / raw) To: ehrhardt; +Cc: linux-mm, axboe On 05/14/2012 07:58 AM, ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com> > > Swap readahead works fine, but the I/O to disk is almost always done in page > size requests, despite the fact that readahead submits 1<<page-cluster pages > at a time. > On older kernels the old per device plugging behavior might have captured > this and merged the requests, but currently all comes down to much more I/Os > than required. > > On a single device this might not be an issue, but as soon as a server runs > on shared san resources savin I/Os not only improves swapin throughput but > also provides a lower resource utilization. > Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 2/2] documentation: update how page-cluster affects swap I/O 2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt 2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt @ 2012-05-14 11:58 ` ehrhardt 2012-05-15 4:48 ` Minchan Kim 2012-05-15 4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim 2012-05-15 18:24 ` Jens Axboe 3 siblings, 1 reply; 11+ messages in thread From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw) To: linux-mm; +Cc: axboe, Christian Ehrhardt From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of the code and add some comments about what the tunable will change in that behavior. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> --- Documentation/sysctl/vm.txt | 12 ++++++++++-- 1 files changed, 10 insertions(+), 2 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 96f0ee8..4d87dc0 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -574,16 +574,24 @@ of physical RAM. See above. page-cluster -page-cluster controls the number of pages which are written to swap in -a single attempt. The swap I/O size. +page-cluster controls the number of pages up to which consecutive pages (if +available) are read in from swap in a single attempt. This is the swap +counterpart to page cache readahead. +The mentioned consecutivity is not in terms of virtual/physical addresses, +but consecutive on swap space - that means they were swapped out together. It is a logarithmic value - setting it to zero means "1 page", setting it to 1 means "2 pages", setting it to 2 means "4 pages", etc. +Zero disables swap readahead completely. The default value is three (eight pages at a time). There may be some small benefits in tuning this to a different value if your workload is swap-intensive. +Lower values mean lower latencies for initial faults, but at the same time +extra faults and I/O delays for following faults if they would have been part of +that consecutive pages readahead would have brought in. + ============================================================= panic_on_oom -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O 2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt @ 2012-05-15 4:48 ` Minchan Kim 2012-05-21 7:24 ` Christian Ehrhardt 0 siblings, 1 reply; 11+ messages in thread From: Minchan Kim @ 2012-05-15 4:48 UTC (permalink / raw) To: ehrhardt; +Cc: linux-mm, axboe, Rik van Riel, Hugh Dickins, Andrew Morton On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of > the code and add some comments about what the tunable will change in that > behavior. > > Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > --- > Documentation/sysctl/vm.txt | 12 ++++++++++-- > 1 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index 96f0ee8..4d87dc0 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -574,16 +574,24 @@ of physical RAM. See above. > > page-cluster > > -page-cluster controls the number of pages which are written to swap in > -a single attempt. The swap I/O size. > +page-cluster controls the number of pages up to which consecutive pages (if > +available) are read in from swap in a single attempt. This is the swap "If available" would be wrong in next kernel because recently Rik submit following patch, mm: make swapin readahead skip over holes http://marc.info/?l=linux-mm&m=132743264912987&w=4 > +counterpart to page cache readahead. > +The mentioned consecutivity is not in terms of virtual/physical addresses, > +but consecutive on swap space - that means they were swapped out together. > > It is a logarithmic value - setting it to zero means "1 page", setting > it to 1 means "2 pages", setting it to 2 means "4 pages", etc. > +Zero disables swap readahead completely. > > The default value is three (eight pages at a time). There may be some > small benefits in tuning this to a different value if your workload is > swap-intensive. > > +Lower values mean lower latencies for initial faults, but at the same time > +extra faults and I/O delays for following faults if they would have been part of > +that consecutive pages readahead would have brought in. > + > ============================================================= > > panic_on_oom Otherwise, Looks good to me. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O 2012-05-15 4:48 ` Minchan Kim @ 2012-05-21 7:24 ` Christian Ehrhardt 0 siblings, 0 replies; 11+ messages in thread From: Christian Ehrhardt @ 2012-05-21 7:24 UTC (permalink / raw) To: Minchan Kim; +Cc: linux-mm, axboe, Rik van Riel, Hugh Dickins, Andrew Morton On 05/15/2012 06:48 AM, Minchan Kim wrote: > On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote: > >> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com> >> >> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of >> the code and add some comments about what the tunable will change in that >> behavior. >> >> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com> >> --- >> Documentation/sysctl/vm.txt | 12 ++++++++++-- >> 1 files changed, 10 insertions(+), 2 deletions(-) >> >> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt >> index 96f0ee8..4d87dc0 100644 >> --- a/Documentation/sysctl/vm.txt >> +++ b/Documentation/sysctl/vm.txt >> @@ -574,16 +574,24 @@ of physical RAM. See above. >> >> page-cluster >> >> -page-cluster controls the number of pages which are written to swap in >> -a single attempt. The swap I/O size. >> +page-cluster controls the number of pages up to which consecutive pages (if >> +available) are read in from swap in a single attempt. This is the swap > > > "If available" would be wrong in next kernel because recently Rik submit following patch, > > mm: make swapin readahead skip over holes > http://marc.info/?l=linux-mm&m=132743264912987&w=4 > > You're right - its not severely wrong, but if we are fixing the documentation we can do it right. I'll send a 2nd version of the patch series with this adapted and all the acks I got so far added. -- GrA 1/4 sse / regards, Christian Ehrhardt IBM Linux Technology Center, System z Linux Performance -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] swap: improve swap I/O rate 2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt 2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt 2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt @ 2012-05-15 4:59 ` Minchan Kim 2012-05-21 7:51 ` Christian Ehrhardt 2012-05-15 18:24 ` Jens Axboe 3 siblings, 1 reply; 11+ messages in thread From: Minchan Kim @ 2012-05-15 4:59 UTC (permalink / raw) To: ehrhardt; +Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote: > From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com> > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > In an memory overcommitment scneario with KVM I ran into a lot of wiats for > swap. While checking the I/O done on the swap disks I found almost all I/Os > to be done as single page 4k request. Despite the fact that swap in is a > batch of 1<<page-cluster pages as swap readahead and swap out is a list of > pages written in shrink_page_list. > > [1/2 swap in improvment] > The read patch shows improvements of up to 50% swap throughput, much happier > guest systems and even when running with comparable throughput a lot I/O per > seconds saved leaving resources in the SAN for other consumers. > > [2/2 documentation] > While doing so I also realized that the documentation for > proc/sys/vm/page-cluster is no more matching the code > > [missing patch #3] > I tried to get a similar patch working for swap out in shrink_page_list. And > it worked in functional terms, but the additional mergin was negligible. I think we have already done it. Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list so we already have applied I/O plugging. > Maybe the cond_resched triggers much mor often than I expected, I'm open for > suggestions regarding improving the pagout I/O sizes as well. We could enhance write out by batch like ext4_bio_write_page. > > Kind regards, > Christian Ehrhardt > > > Christian Ehrhardt (2): > swap: allow swap readahead to be merged > documentation: update how page-cluster affects swap I/O > > Documentation/sysctl/vm.txt | 12 ++++++++++-- > mm/swap_state.c | 5 +++++ > 2 files changed, 15 insertions(+), 2 deletions(-) > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] swap: improve swap I/O rate 2012-05-15 4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim @ 2012-05-21 7:51 ` Christian Ehrhardt 2012-05-21 8:46 ` Minchan Kim 0 siblings, 1 reply; 11+ messages in thread From: Christian Ehrhardt @ 2012-05-21 7:51 UTC (permalink / raw) To: Minchan Kim; +Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel [...] >> [missing patch #3] >> I tried to get a similar patch working for swap out in shrink_page_list. And >> it worked in functional terms, but the additional mergin was negligible. > > > I think we have already done it. > Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list so we already have applied > I/O plugging. > I saw that code and it is part of the kernel I used to test my patches. But despite that code and my additional experiments of plug/unplug in shrink_page_list the effective I/O size of swap write stays at almost 4k. Thereby so far I can tell you that the plugs in shrink_page_list and shrink_mem_cgroup_zone aren't sufficient - at least for my case. You saw the blocktrace summaries in my first mail, an excerpt of a write submission stream looks like that: 94,4 10 465 0.023520923 116 A W 28868648 + 8 <- (94,5) 28868456 94,5 10 466 0.023521173 116 Q W 28868648 + 8 [kswapd0] 94,5 10 467 0.023522048 116 G W 28868648 + 8 [kswapd0] 94,5 10 468 0.023522235 116 P N [kswapd0] 94,5 10 469 0.023759892 116 I W 28868648 + 8 ( 237844) [kswapd0] 94,5 10 470 0.023760079 116 U N [kswapd0] 1 94,5 10 471 0.023760360 116 D W 28868648 + 8 ( 468) [kswapd0] 94,4 10 472 0.023891235 116 A W 28868656 + 8 <- (94,5) 28868464 94,5 10 473 0.023891454 116 Q W 28868656 + 8 [kswapd0] 94,5 10 474 0.023892110 116 G W 28868656 + 8 [kswapd0] 94,5 10 475 0.023944610 116 I W 28868656 + 8 ( 52500) [kswapd0] 94,5 10 476 0.023944735 116 U N [kswapd0] 1 94,5 10 477 0.023944892 116 D W 28868656 + 8 ( 282) [kswapd0] 94,5 16 19 0.024023192 16033 C W 28868648 + 8 ( 262832) [0] 94,5 24 37 0.024196752 14526 C W 28868656 + 8 ( 251860) [0] [...] But we can split this discussion from my other two patches and I would be happy to provide my test environment for further tests if there are new suggestions/patches/... >> Maybe the cond_resched triggers much mor often than I expected, I'm open for >> suggestions regarding improving the pagout I/O sizes as well. > > > We could enhance write out by batch like ext4_bio_write_page. > Do you mean the changes brought by "bd2d0210 ext4: use bio layer instead of buffer layer in mpage_da_submit_io" ? -- GrA 1/4 sse / regards, Christian Ehrhardt IBM Linux Technology Center, System z Linux Performance -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] swap: improve swap I/O rate 2012-05-21 7:51 ` Christian Ehrhardt @ 2012-05-21 8:46 ` Minchan Kim 0 siblings, 0 replies; 11+ messages in thread From: Minchan Kim @ 2012-05-21 8:46 UTC (permalink / raw) To: Christian Ehrhardt Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel On 05/21/2012 04:51 PM, Christian Ehrhardt wrote: > [...] > >>> [missing patch #3] >>> I tried to get a similar patch working for swap out in >>> shrink_page_list. And >>> it worked in functional terms, but the additional mergin was negligible. >> >> >> I think we have already done it. >> Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list >> so we already have applied >> I/O plugging. >> > > I saw that code and it is part of the kernel I used to test my patches. > But despite that code and my additional experiments of plug/unplug in > shrink_page_list the effective I/O size of swap write stays at almost 4k. I meant your plugging in shrink_page_list is redundant > > Thereby so far I can tell you that the plugs in shrink_page_list and > shrink_mem_cgroup_zone aren't sufficient - at least for my case. Yeb. > You saw the blocktrace summaries in my first mail, an excerpt of a write > submission stream looks like that: > > 94,4 10 465 0.023520923 116 A W 28868648 + 8 <- (94,5) > 28868456 > 94,5 10 466 0.023521173 116 Q W 28868648 + 8 [kswapd0] > 94,5 10 467 0.023522048 116 G W 28868648 + 8 [kswapd0] > 94,5 10 468 0.023522235 116 P N [kswapd0] > 94,5 10 469 0.023759892 116 I W 28868648 + 8 ( 237844) > [kswapd0] > 94,5 10 470 0.023760079 116 U N [kswapd0] 1 > 94,5 10 471 0.023760360 116 D W 28868648 + 8 ( 468) > [kswapd0] > 94,4 10 472 0.023891235 116 A W 28868656 + 8 <- (94,5) > 28868464 > 94,5 10 473 0.023891454 116 Q W 28868656 + 8 [kswapd0] > 94,5 10 474 0.023892110 116 G W 28868656 + 8 [kswapd0] > 94,5 10 475 0.023944610 116 I W 28868656 + 8 ( 52500) > [kswapd0] > 94,5 10 476 0.023944735 116 U N [kswapd0] 1 > 94,5 10 477 0.023944892 116 D W 28868656 + 8 ( 282) > [kswapd0] > 94,5 16 19 0.024023192 16033 C W 28868648 + 8 ( 262832) [0] > 94,5 24 37 0.024196752 14526 C W 28868656 + 8 ( 251860) [0] > [...] > > But we can split this discussion from my other two patches and I would > be happy to provide my test environment for further tests if there are > new suggestions/patches/... > >>> Maybe the cond_resched triggers much mor often than I expected, I'm >>> open for >>> suggestions regarding improving the pagout I/O sizes as well. >> >> >> We could enhance write out by batch like ext4_bio_write_page. >> > > Do you mean the changes brought by "bd2d0210 ext4: use bio layer instead > of buffer layer in mpage_da_submit_io" ? Yeb, I think it's helpful for your case but it's not trivial to implement it, IMHO. > > > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] swap: improve swap I/O rate 2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt ` (2 preceding siblings ...) 2012-05-15 4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim @ 2012-05-15 18:24 ` Jens Axboe 3 siblings, 0 replies; 11+ messages in thread From: Jens Axboe @ 2012-05-15 18:24 UTC (permalink / raw) To: ehrhardt; +Cc: linux-mm On 2012-05-14 13:58, ehrhardt@linux.vnet.ibm.com wrote: > From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com> > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > In an memory overcommitment scneario with KVM I ran into a lot of wiats for > swap. While checking the I/O done on the swap disks I found almost all I/Os > to be done as single page 4k request. Despite the fact that swap in is a > batch of 1<<page-cluster pages as swap readahead and swap out is a list of > pages written in shrink_page_list. > > [1/2 swap in improvment] > The read patch shows improvements of up to 50% swap throughput, much happier > guest systems and even when running with comparable throughput a lot I/O per > seconds saved leaving resources in the SAN for other consumers. > > [2/2 documentation] > While doing so I also realized that the documentation for > proc/sys/vm/page-cluster is no more matching the code > > [missing patch #3] > I tried to get a similar patch working for swap out in shrink_page_list. And > it worked in functional terms, but the additional mergin was negligible. > Maybe the cond_resched triggers much mor often than I expected, I'm open for > suggestions regarding improving the pagout I/O sizes as well. > > Kind regards, > Christian Ehrhardt > > > Christian Ehrhardt (2): > swap: allow swap readahead to be merged > documentation: update how page-cluster affects swap I/O Looks good to me, you can add my acked-by to both of them. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2012-05-21 8:46 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt 2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt 2012-05-15 4:38 ` Minchan Kim 2012-05-15 17:43 ` Rik van Riel 2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt 2012-05-15 4:48 ` Minchan Kim 2012-05-21 7:24 ` Christian Ehrhardt 2012-05-15 4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim 2012-05-21 7:51 ` Christian Ehrhardt 2012-05-21 8:46 ` Minchan Kim 2012-05-15 18:24 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).