From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx117.postini.com [74.125.245.117]) by kanga.kvack.org (Postfix) with SMTP id 8C27D8D0001 for ; Mon, 14 May 2012 07:58:34 -0400 (EDT) Received: from /spool/local by e06smtp16.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 14 May 2012 12:58:32 +0100 Received: from d06av02.portsmouth.uk.ibm.com (d06av02.portsmouth.uk.ibm.com [9.149.37.228]) by d06nrmr1407.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q4EBwVMZ2449570 for ; Mon, 14 May 2012 12:58:31 +0100 Received: from d06av02.portsmouth.uk.ibm.com (loopback [127.0.0.1]) by d06av02.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q4EBwVBa027298 for ; Mon, 14 May 2012 05:58:31 -0600 From: ehrhardt@linux.vnet.ibm.com Subject: [PATCH 0/2] swap: improve swap I/O rate Date: Mon, 14 May 2012 13:58:27 +0200 Message-Id: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: axboe@kernel.dk, Ehrhardt Christian From: Ehrhardt Christian From: Christian Ehrhardt In an memory overcommitment scneario with KVM I ran into a lot of wiats for swap. While checking the I/O done on the swap disks I found almost all I/Os to be done as single page 4k request. Despite the fact that swap in is a batch of 1< email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx140.postini.com [74.125.245.140]) by kanga.kvack.org (Postfix) with SMTP id 612D88D0009 for ; Mon, 14 May 2012 07:58:36 -0400 (EDT) Received: from /spool/local by e06smtp12.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 14 May 2012 12:58:34 +0100 Received: from d06av06.portsmouth.uk.ibm.com (d06av06.portsmouth.uk.ibm.com [9.149.37.217]) by d06nrmr1407.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q4EBwW3g2506836 for ; Mon, 14 May 2012 12:58:32 +0100 Received: from d06av06.portsmouth.uk.ibm.com (loopback [127.0.0.1]) by d06av06.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q4EBwWaM026206 for ; Mon, 14 May 2012 05:58:32 -0600 From: ehrhardt@linux.vnet.ibm.com Subject: [PATCH 1/2] swap: allow swap readahead to be merged Date: Mon, 14 May 2012 13:58:28 +0200 Message-Id: <1336996709-8304-2-git-send-email-ehrhardt@linux.vnet.ibm.com> In-Reply-To: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> References: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: axboe@kernel.dk, Christian Ehrhardt From: Christian Ehrhardt Swap readahead works fine, but the I/O to disk is almost always done in page size requests, despite the fact that readahead submits 1< --- mm/swap_state.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 4c5ff7f..c85b559 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, unsigned long offset = swp_offset(entry); unsigned long start_offset, end_offset; unsigned long mask = (1UL << page_cluster) - 1; + struct blk_plug plug; /* Read a page_cluster sized and aligned cluster around offset. */ start_offset = offset & ~mask; @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, if (!start_offset) /* First page is swap header. */ start_offset++; + blk_start_plug(&plug); for (offset = start_offset; offset <= end_offset ; offset++) { /* Ok, do the async read-ahead now */ page = read_swap_cache_async(swp_entry(swp_type(entry), offset), @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, continue; page_cache_release(page); } + blk_finish_plug(&plug); + lru_add_drain(); /* Push any new pages onto the LRU now */ return read_swap_cache_async(entry, gfp_mask, vma, addr); } -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx205.postini.com [74.125.245.205]) by kanga.kvack.org (Postfix) with SMTP id 062578D000B for ; Mon, 14 May 2012 07:58:37 -0400 (EDT) Received: from /spool/local by e06smtp14.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 14 May 2012 12:58:36 +0100 Received: from d06av04.portsmouth.uk.ibm.com (d06av04.portsmouth.uk.ibm.com [9.149.37.216]) by d06nrmr1507.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q4EBwXJ72465980 for ; Mon, 14 May 2012 12:58:33 +0100 Received: from d06av04.portsmouth.uk.ibm.com (loopback [127.0.0.1]) by d06av04.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q4EBwXD2028891 for ; Mon, 14 May 2012 05:58:33 -0600 From: ehrhardt@linux.vnet.ibm.com Subject: [PATCH 2/2] documentation: update how page-cluster affects swap I/O Date: Mon, 14 May 2012 13:58:29 +0200 Message-Id: <1336996709-8304-3-git-send-email-ehrhardt@linux.vnet.ibm.com> In-Reply-To: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> References: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: axboe@kernel.dk, Christian Ehrhardt From: Christian Ehrhardt Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of the code and add some comments about what the tunable will change in that behavior. Signed-off-by: Christian Ehrhardt --- Documentation/sysctl/vm.txt | 12 ++++++++++-- 1 files changed, 10 insertions(+), 2 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 96f0ee8..4d87dc0 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -574,16 +574,24 @@ of physical RAM. See above. page-cluster -page-cluster controls the number of pages which are written to swap in -a single attempt. The swap I/O size. +page-cluster controls the number of pages up to which consecutive pages (if +available) are read in from swap in a single attempt. This is the swap +counterpart to page cache readahead. +The mentioned consecutivity is not in terms of virtual/physical addresses, +but consecutive on swap space - that means they were swapped out together. It is a logarithmic value - setting it to zero means "1 page", setting it to 1 means "2 pages", setting it to 2 means "4 pages", etc. +Zero disables swap readahead completely. The default value is three (eight pages at a time). There may be some small benefits in tuning this to a different value if your workload is swap-intensive. +Lower values mean lower latencies for initial faults, but at the same time +extra faults and I/O delays for following faults if they would have been part of +that consecutive pages readahead would have brought in. + ============================================================= panic_on_oom -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx206.postini.com [74.125.245.206]) by kanga.kvack.org (Postfix) with SMTP id 68D8B6B0081 for ; Tue, 15 May 2012 00:38:25 -0400 (EDT) Message-ID: <4FB1DDE0.2020007@kernel.org> Date: Tue, 15 May 2012 13:38:56 +0900 From: Minchan Kim MIME-Version: 1.0 Subject: Re: [PATCH 1/2] swap: allow swap readahead to be merged References: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> <1336996709-8304-2-git-send-email-ehrhardt@linux.vnet.ibm.com> In-Reply-To: <1336996709-8304-2-git-send-email-ehrhardt@linux.vnet.ibm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: ehrhardt@linux.vnet.ibm.com Cc: linux-mm@kvack.org, axboe@kernel.dk, Hugh Dickins , Rik van Riel On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt > > Swap readahead works fine, but the I/O to disk is almost always done in page > size requests, despite the fact that readahead submits 1< at a time. > On older kernels the old per device plugging behavior might have captured > this and merged the requests, but currently all comes down to much more I/Os > than required. > > On a single device this might not be an issue, but as soon as a server runs > on shared san resources savin I/Os not only improves swapin throughput but > also provides a lower resource utilization. > > With a load running KVM in a lot of memory overcommitment (the hot memory > is 1.5 times the host memory) swapping throughput improves significantly > and the lead feels more responsive as well as achieves more throughput. > > In a test setup with 16 swap disks running blocktrace on one of those disks > shows the improved merging: > Prior: > Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB > Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB > Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB > IO unplugs: 149,614 Timer unplugs: 2,940 > > With the patch: > Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB > Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB > Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB > IO unplugs: 337,130 Timer unplugs: 11,184 > > Signed-off-by: Christian Ehrhardt Reviewed-by: Minchan Kim It does make sense to me. > --- > mm/swap_state.c | 5 +++++ > 1 files changed, 5 insertions(+), 0 deletions(-) > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 4c5ff7f..c85b559 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -14,6 +14,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > unsigned long offset = swp_offset(entry); > unsigned long start_offset, end_offset; > unsigned long mask = (1UL << page_cluster) - 1; > + struct blk_plug plug; > > /* Read a page_cluster sized and aligned cluster around offset. */ > start_offset = offset & ~mask; > @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > if (!start_offset) /* First page is swap header. */ > start_offset++; > > + blk_start_plug(&plug); > for (offset = start_offset; offset <= end_offset ; offset++) { > /* Ok, do the async read-ahead now */ > page = read_swap_cache_async(swp_entry(swp_type(entry), offset), > @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > continue; > page_cache_release(page); > } > + blk_finish_plug(&plug); > + > lru_add_drain(); /* Push any new pages onto the LRU now */ > return read_swap_cache_async(entry, gfp_mask, vma, addr); > } -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx123.postini.com [74.125.245.123]) by kanga.kvack.org (Postfix) with SMTP id 084B16B0081 for ; Tue, 15 May 2012 00:47:44 -0400 (EDT) Message-ID: <4FB1E00F.2000903@kernel.org> Date: Tue, 15 May 2012 13:48:15 +0900 From: Minchan Kim MIME-Version: 1.0 Subject: Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O References: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> <1336996709-8304-3-git-send-email-ehrhardt@linux.vnet.ibm.com> In-Reply-To: <1336996709-8304-3-git-send-email-ehrhardt@linux.vnet.ibm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: ehrhardt@linux.vnet.ibm.com Cc: linux-mm@kvack.org, axboe@kernel.dk, Rik van Riel , Hugh Dickins , Andrew Morton On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt > > Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of > the code and add some comments about what the tunable will change in that > behavior. > > Signed-off-by: Christian Ehrhardt > --- > Documentation/sysctl/vm.txt | 12 ++++++++++-- > 1 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index 96f0ee8..4d87dc0 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -574,16 +574,24 @@ of physical RAM. See above. > > page-cluster > > -page-cluster controls the number of pages which are written to swap in > -a single attempt. The swap I/O size. > +page-cluster controls the number of pages up to which consecutive pages (if > +available) are read in from swap in a single attempt. This is the swap "If available" would be wrong in next kernel because recently Rik submit following patch, mm: make swapin readahead skip over holes http://marc.info/?l=linux-mm&m=132743264912987&w=4 > +counterpart to page cache readahead. > +The mentioned consecutivity is not in terms of virtual/physical addresses, > +but consecutive on swap space - that means they were swapped out together. > > It is a logarithmic value - setting it to zero means "1 page", setting > it to 1 means "2 pages", setting it to 2 means "4 pages", etc. > +Zero disables swap readahead completely. > > The default value is three (eight pages at a time). There may be some > small benefits in tuning this to a different value if your workload is > swap-intensive. > > +Lower values mean lower latencies for initial faults, but at the same time > +extra faults and I/O delays for following faults if they would have been part of > +that consecutive pages readahead would have brought in. > + > ============================================================= > > panic_on_oom Otherwise, Looks good to me. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx203.postini.com [74.125.245.203]) by kanga.kvack.org (Postfix) with SMTP id 70E096B0081 for ; Tue, 15 May 2012 00:58:40 -0400 (EDT) Message-ID: <4FB1E2A0.9050900@kernel.org> Date: Tue, 15 May 2012 13:59:12 +0900 From: Minchan Kim MIME-Version: 1.0 Subject: Re: [PATCH 0/2] swap: improve swap I/O rate References: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> In-Reply-To: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: ehrhardt@linux.vnet.ibm.com Cc: linux-mm@kvack.org, axboe@kernel.dk, Andrew Morton , Hugh Dickins , Rik van Riel On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote: > From: Ehrhardt Christian > > From: Christian Ehrhardt > > In an memory overcommitment scneario with KVM I ran into a lot of wiats for > swap. While checking the I/O done on the swap disks I found almost all I/Os > to be done as single page 4k request. Despite the fact that swap in is a > batch of 1< pages written in shrink_page_list. > > [1/2 swap in improvment] > The read patch shows improvements of up to 50% swap throughput, much happier > guest systems and even when running with comparable throughput a lot I/O per > seconds saved leaving resources in the SAN for other consumers. > > [2/2 documentation] > While doing so I also realized that the documentation for > proc/sys/vm/page-cluster is no more matching the code > > [missing patch #3] > I tried to get a similar patch working for swap out in shrink_page_list. And > it worked in functional terms, but the additional mergin was negligible. I think we have already done it. Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list so we already have applied I/O plugging. > Maybe the cond_resched triggers much mor often than I expected, I'm open for > suggestions regarding improving the pagout I/O sizes as well. We could enhance write out by batch like ext4_bio_write_page. > > Kind regards, > Christian Ehrhardt > > > Christian Ehrhardt (2): > swap: allow swap readahead to be merged > documentation: update how page-cluster affects swap I/O > > Documentation/sysctl/vm.txt | 12 ++++++++++-- > mm/swap_state.c | 5 +++++ > 2 files changed, 15 insertions(+), 2 deletions(-) > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: email@kvack.org > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id B7FC16B004D for ; Tue, 15 May 2012 13:43:54 -0400 (EDT) Message-ID: <4FB295C5.7080008@redhat.com> Date: Tue, 15 May 2012 13:43:33 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 1/2] swap: allow swap readahead to be merged References: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> <1336996709-8304-2-git-send-email-ehrhardt@linux.vnet.ibm.com> In-Reply-To: <1336996709-8304-2-git-send-email-ehrhardt@linux.vnet.ibm.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: ehrhardt@linux.vnet.ibm.com Cc: linux-mm@kvack.org, axboe@kernel.dk On 05/14/2012 07:58 AM, ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt > > Swap readahead works fine, but the I/O to disk is almost always done in page > size requests, despite the fact that readahead submits 1< at a time. > On older kernels the old per device plugging behavior might have captured > this and merged the requests, but currently all comes down to much more I/Os > than required. > > On a single device this might not be an issue, but as soon as a server runs > on shared san resources savin I/Os not only improves swapin throughput but > also provides a lower resource utilization. > Signed-off-by: Christian Ehrhardt Acked-by: Rik van Riel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx142.postini.com [74.125.245.142]) by kanga.kvack.org (Postfix) with SMTP id CF9566B004D for ; Tue, 15 May 2012 14:24:23 -0400 (EDT) Message-ID: <4FB29F51.8060605@kernel.dk> Date: Tue, 15 May 2012 20:24:17 +0200 From: Jens Axboe MIME-Version: 1.0 Subject: Re: [PATCH 0/2] swap: improve swap I/O rate References: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> In-Reply-To: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: ehrhardt@linux.vnet.ibm.com Cc: linux-mm@kvack.org On 2012-05-14 13:58, ehrhardt@linux.vnet.ibm.com wrote: > From: Ehrhardt Christian > > From: Christian Ehrhardt > > In an memory overcommitment scneario with KVM I ran into a lot of wiats for > swap. While checking the I/O done on the swap disks I found almost all I/Os > to be done as single page 4k request. Despite the fact that swap in is a > batch of 1< pages written in shrink_page_list. > > [1/2 swap in improvment] > The read patch shows improvements of up to 50% swap throughput, much happier > guest systems and even when running with comparable throughput a lot I/O per > seconds saved leaving resources in the SAN for other consumers. > > [2/2 documentation] > While doing so I also realized that the documentation for > proc/sys/vm/page-cluster is no more matching the code > > [missing patch #3] > I tried to get a similar patch working for swap out in shrink_page_list. And > it worked in functional terms, but the additional mergin was negligible. > Maybe the cond_resched triggers much mor often than I expected, I'm open for > suggestions regarding improving the pagout I/O sizes as well. > > Kind regards, > Christian Ehrhardt > > > Christian Ehrhardt (2): > swap: allow swap readahead to be merged > documentation: update how page-cluster affects swap I/O Looks good to me, you can add my acked-by to both of them. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx166.postini.com [74.125.245.166]) by kanga.kvack.org (Postfix) with SMTP id 103DB6B0081 for ; Mon, 21 May 2012 03:25:25 -0400 (EDT) Received: from /spool/local by e06smtp17.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 May 2012 08:25:22 +0100 Received: from d06av04.portsmouth.uk.ibm.com (d06av04.portsmouth.uk.ibm.com [9.149.37.216]) by d06nrmr1806.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q4L7OrOc2474212 for ; Mon, 21 May 2012 08:24:54 +0100 Received: from d06av04.portsmouth.uk.ibm.com (loopback [127.0.0.1]) by d06av04.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q4L7OoNG021594 for ; Mon, 21 May 2012 01:24:50 -0600 Message-ID: <4FB9EDC1.3070401@linux.vnet.ibm.com> Date: Mon, 21 May 2012 09:24:49 +0200 From: Christian Ehrhardt MIME-Version: 1.0 Subject: Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O References: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> <1336996709-8304-3-git-send-email-ehrhardt@linux.vnet.ibm.com> <4FB1E00F.2000903@kernel.org> In-Reply-To: <4FB1E00F.2000903@kernel.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm@kvack.org, axboe@kernel.dk, Rik van Riel , Hugh Dickins , Andrew Morton On 05/15/2012 06:48 AM, Minchan Kim wrote: > On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote: > >> From: Christian Ehrhardt >> >> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of >> the code and add some comments about what the tunable will change in that >> behavior. >> >> Signed-off-by: Christian Ehrhardt >> --- >> Documentation/sysctl/vm.txt | 12 ++++++++++-- >> 1 files changed, 10 insertions(+), 2 deletions(-) >> >> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt >> index 96f0ee8..4d87dc0 100644 >> --- a/Documentation/sysctl/vm.txt >> +++ b/Documentation/sysctl/vm.txt >> @@ -574,16 +574,24 @@ of physical RAM. See above. >> >> page-cluster >> >> -page-cluster controls the number of pages which are written to swap in >> -a single attempt. The swap I/O size. >> +page-cluster controls the number of pages up to which consecutive pages (if >> +available) are read in from swap in a single attempt. This is the swap > > > "If available" would be wrong in next kernel because recently Rik submit following patch, > > mm: make swapin readahead skip over holes > http://marc.info/?l=linux-mm&m=132743264912987&w=4 > > You're right - its not severely wrong, but if we are fixing the documentation we can do it right. I'll send a 2nd version of the patch series with this adapted and all the acks I got so far added. -- GrA 1/4 sse / regards, Christian Ehrhardt IBM Linux Technology Center, System z Linux Performance -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx159.postini.com [74.125.245.159]) by kanga.kvack.org (Postfix) with SMTP id 3E3CF6B0081 for ; Mon, 21 May 2012 03:51:32 -0400 (EDT) Received: from /spool/local by e06smtp12.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 May 2012 08:51:30 +0100 Received: from d06av05.portsmouth.uk.ibm.com (d06av05.portsmouth.uk.ibm.com [9.149.37.229]) by d06nrmr1806.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q4L7pSxn2150612 for ; Mon, 21 May 2012 08:51:28 +0100 Received: from d06av05.portsmouth.uk.ibm.com (loopback [127.0.0.1]) by d06av05.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q4L7pSVA007928 for ; Mon, 21 May 2012 01:51:28 -0600 Message-ID: <4FB9F3FF.7030709@linux.vnet.ibm.com> Date: Mon, 21 May 2012 09:51:27 +0200 From: Christian Ehrhardt MIME-Version: 1.0 Subject: Re: [PATCH 0/2] swap: improve swap I/O rate References: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> <4FB1E2A0.9050900@kernel.org> In-Reply-To: <4FB1E2A0.9050900@kernel.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm@kvack.org, axboe@kernel.dk, Andrew Morton , Hugh Dickins , Rik van Riel [...] >> [missing patch #3] >> I tried to get a similar patch working for swap out in shrink_page_list. And >> it worked in functional terms, but the additional mergin was negligible. > > > I think we have already done it. > Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list so we already have applied > I/O plugging. > I saw that code and it is part of the kernel I used to test my patches. But despite that code and my additional experiments of plug/unplug in shrink_page_list the effective I/O size of swap write stays at almost 4k. Thereby so far I can tell you that the plugs in shrink_page_list and shrink_mem_cgroup_zone aren't sufficient - at least for my case. You saw the blocktrace summaries in my first mail, an excerpt of a write submission stream looks like that: 94,4 10 465 0.023520923 116 A W 28868648 + 8 <- (94,5) 28868456 94,5 10 466 0.023521173 116 Q W 28868648 + 8 [kswapd0] 94,5 10 467 0.023522048 116 G W 28868648 + 8 [kswapd0] 94,5 10 468 0.023522235 116 P N [kswapd0] 94,5 10 469 0.023759892 116 I W 28868648 + 8 ( 237844) [kswapd0] 94,5 10 470 0.023760079 116 U N [kswapd0] 1 94,5 10 471 0.023760360 116 D W 28868648 + 8 ( 468) [kswapd0] 94,4 10 472 0.023891235 116 A W 28868656 + 8 <- (94,5) 28868464 94,5 10 473 0.023891454 116 Q W 28868656 + 8 [kswapd0] 94,5 10 474 0.023892110 116 G W 28868656 + 8 [kswapd0] 94,5 10 475 0.023944610 116 I W 28868656 + 8 ( 52500) [kswapd0] 94,5 10 476 0.023944735 116 U N [kswapd0] 1 94,5 10 477 0.023944892 116 D W 28868656 + 8 ( 282) [kswapd0] 94,5 16 19 0.024023192 16033 C W 28868648 + 8 ( 262832) [0] 94,5 24 37 0.024196752 14526 C W 28868656 + 8 ( 251860) [0] [...] But we can split this discussion from my other two patches and I would be happy to provide my test environment for further tests if there are new suggestions/patches/... >> Maybe the cond_resched triggers much mor often than I expected, I'm open for >> suggestions regarding improving the pagout I/O sizes as well. > > > We could enhance write out by batch like ext4_bio_write_page. > Do you mean the changes brought by "bd2d0210 ext4: use bio layer instead of buffer layer in mpage_da_submit_io" ? -- GrA 1/4 sse / regards, Christian Ehrhardt IBM Linux Technology Center, System z Linux Performance -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx153.postini.com [74.125.245.153]) by kanga.kvack.org (Postfix) with SMTP id 273FA6B0082 for ; Mon, 21 May 2012 04:09:29 -0400 (EDT) Received: from /spool/local by e06smtp16.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 May 2012 09:09:27 +0100 Received: from d06av10.portsmouth.uk.ibm.com (d06av10.portsmouth.uk.ibm.com [9.149.37.251]) by d06nrmr1307.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q4L89K8r2838772 for ; Mon, 21 May 2012 09:09:20 +0100 Received: from d06av10.portsmouth.uk.ibm.com (loopback [127.0.0.1]) by d06av10.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q4L80MIs021795 for ; Mon, 21 May 2012 04:00:22 -0400 From: ehrhardt@linux.vnet.ibm.com Subject: [PATCH 1/2] swap: allow swap readahead to be merged Date: Mon, 21 May 2012 10:09:14 +0200 Message-Id: <1337587755-4743-2-git-send-email-ehrhardt@linux.vnet.ibm.com> In-Reply-To: <1337587755-4743-1-git-send-email-ehrhardt@linux.vnet.ibm.com> References: <1337587755-4743-1-git-send-email-ehrhardt@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: axboe@kernel.dk, Christian Ehrhardt From: Christian Ehrhardt Swap readahead works fine, but the I/O to disk is almost always done in page size requests, despite the fact that readahead submits 1< Acked-by: Rik van Riel Acked-by: Jens Axboe --- mm/swap_state.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 4c5ff7f..c85b559 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, unsigned long offset = swp_offset(entry); unsigned long start_offset, end_offset; unsigned long mask = (1UL << page_cluster) - 1; + struct blk_plug plug; /* Read a page_cluster sized and aligned cluster around offset. */ start_offset = offset & ~mask; @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, if (!start_offset) /* First page is swap header. */ start_offset++; + blk_start_plug(&plug); for (offset = start_offset; offset <= end_offset ; offset++) { /* Ok, do the async read-ahead now */ page = read_swap_cache_async(swp_entry(swp_type(entry), offset), @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, continue; page_cache_release(page); } + blk_finish_plug(&plug); + lru_add_drain(); /* Push any new pages onto the LRU now */ return read_swap_cache_async(entry, gfp_mask, vma, addr); } -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx108.postini.com [74.125.245.108]) by kanga.kvack.org (Postfix) with SMTP id B4E9B6B0081 for ; Mon, 21 May 2012 04:46:27 -0400 (EDT) Message-ID: <4FBA00E1.2020103@kernel.org> Date: Mon, 21 May 2012 17:46:25 +0900 From: Minchan Kim MIME-Version: 1.0 Subject: Re: [PATCH 0/2] swap: improve swap I/O rate References: <1336996709-8304-1-git-send-email-ehrhardt@linux.vnet.ibm.com> <4FB1E2A0.9050900@kernel.org> <4FB9F3FF.7030709@linux.vnet.ibm.com> In-Reply-To: <4FB9F3FF.7030709@linux.vnet.ibm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Christian Ehrhardt Cc: linux-mm@kvack.org, axboe@kernel.dk, Andrew Morton , Hugh Dickins , Rik van Riel On 05/21/2012 04:51 PM, Christian Ehrhardt wrote: > [...] > >>> [missing patch #3] >>> I tried to get a similar patch working for swap out in >>> shrink_page_list. And >>> it worked in functional terms, but the additional mergin was negligible. >> >> >> I think we have already done it. >> Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list >> so we already have applied >> I/O plugging. >> > > I saw that code and it is part of the kernel I used to test my patches. > But despite that code and my additional experiments of plug/unplug in > shrink_page_list the effective I/O size of swap write stays at almost 4k. I meant your plugging in shrink_page_list is redundant > > Thereby so far I can tell you that the plugs in shrink_page_list and > shrink_mem_cgroup_zone aren't sufficient - at least for my case. Yeb. > You saw the blocktrace summaries in my first mail, an excerpt of a write > submission stream looks like that: > > 94,4 10 465 0.023520923 116 A W 28868648 + 8 <- (94,5) > 28868456 > 94,5 10 466 0.023521173 116 Q W 28868648 + 8 [kswapd0] > 94,5 10 467 0.023522048 116 G W 28868648 + 8 [kswapd0] > 94,5 10 468 0.023522235 116 P N [kswapd0] > 94,5 10 469 0.023759892 116 I W 28868648 + 8 ( 237844) > [kswapd0] > 94,5 10 470 0.023760079 116 U N [kswapd0] 1 > 94,5 10 471 0.023760360 116 D W 28868648 + 8 ( 468) > [kswapd0] > 94,4 10 472 0.023891235 116 A W 28868656 + 8 <- (94,5) > 28868464 > 94,5 10 473 0.023891454 116 Q W 28868656 + 8 [kswapd0] > 94,5 10 474 0.023892110 116 G W 28868656 + 8 [kswapd0] > 94,5 10 475 0.023944610 116 I W 28868656 + 8 ( 52500) > [kswapd0] > 94,5 10 476 0.023944735 116 U N [kswapd0] 1 > 94,5 10 477 0.023944892 116 D W 28868656 + 8 ( 282) > [kswapd0] > 94,5 16 19 0.024023192 16033 C W 28868648 + 8 ( 262832) [0] > 94,5 24 37 0.024196752 14526 C W 28868656 + 8 ( 251860) [0] > [...] > > But we can split this discussion from my other two patches and I would > be happy to provide my test environment for further tests if there are > new suggestions/patches/... > >>> Maybe the cond_resched triggers much mor often than I expected, I'm >>> open for >>> suggestions regarding improving the pagout I/O sizes as well. >> >> >> We could enhance write out by batch like ext4_bio_write_page. >> > > Do you mean the changes brought by "bd2d0210 ext4: use bio layer instead > of buffer layer in mpage_da_submit_io" ? Yeb, I think it's helpful for your case but it's not trivial to implement it, IMHO. > > > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id ADD856B0081 for ; Mon, 21 May 2012 04:51:16 -0400 (EDT) Message-ID: <4FBA0203.20509@kernel.org> Date: Mon, 21 May 2012 17:51:15 +0900 From: Minchan Kim MIME-Version: 1.0 Subject: Re: [PATCH 1/2] swap: allow swap readahead to be merged References: <1337587755-4743-1-git-send-email-ehrhardt@linux.vnet.ibm.com> <1337587755-4743-2-git-send-email-ehrhardt@linux.vnet.ibm.com> In-Reply-To: <1337587755-4743-2-git-send-email-ehrhardt@linux.vnet.ibm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: ehrhardt@linux.vnet.ibm.com Cc: linux-mm@kvack.org, axboe@kernel.dk On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt > > Swap readahead works fine, but the I/O to disk is almost always done in page > size requests, despite the fact that readahead submits 1< at a time. > On older kernels the old per device plugging behavior might have captured > this and merged the requests, but currently all comes down to much more I/Os > than required. > > On a single device this might not be an issue, but as soon as a server runs > on shared san resources savin I/Os not only improves swapin throughput but > also provides a lower resource utilization. > > With a load running KVM in a lot of memory overcommitment (the hot memory > is 1.5 times the host memory) swapping throughput improves significantly > and the lead feels more responsive as well as achieves more throughput. > > In a test setup with 16 swap disks running blocktrace on one of those disks > shows the improved merging: > Prior: > Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB > Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB > Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB > IO unplugs: 149,614 Timer unplugs: 2,940 > > With the patch: > Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB > Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB > Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB > IO unplugs: 337,130 Timer unplugs: 11,184 > > Signed-off-by: Christian Ehrhardt > Acked-by: Rik van Riel > Acked-by: Jens Axboe Reviewed-by: Minchan Kim Didn't I add my Reviewed-by on your previous version? -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx107.postini.com [74.125.245.107]) by kanga.kvack.org (Postfix) with SMTP id 133F56B0081 for ; Mon, 21 May 2012 05:07:42 -0400 (EDT) Received: from /spool/local by e06smtp16.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 May 2012 10:07:40 +0100 Received: from d06av03.portsmouth.uk.ibm.com (d06av03.portsmouth.uk.ibm.com [9.149.37.213]) by d06nrmr1407.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q4L97GT41249380 for ; Mon, 21 May 2012 10:07:16 +0100 Received: from d06av03.portsmouth.uk.ibm.com (localhost.localdomain [127.0.0.1]) by d06av03.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q4L97F2m022005 for ; Mon, 21 May 2012 03:07:15 -0600 Message-ID: <4FBA05C2.5090306@linux.vnet.ibm.com> Date: Mon, 21 May 2012 11:07:14 +0200 From: Christian Ehrhardt MIME-Version: 1.0 Subject: Re: [PATCH 1/2] swap: allow swap readahead to be merged References: <1337587755-4743-1-git-send-email-ehrhardt@linux.vnet.ibm.com> <1337587755-4743-2-git-send-email-ehrhardt@linux.vnet.ibm.com> <4FBA0203.20509@kernel.org> In-Reply-To: <4FBA0203.20509@kernel.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm@kvack.org, axboe@kernel.dk On 05/21/2012 10:51 AM, Minchan Kim wrote: > On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote: > >> From: Christian Ehrhardt >> [...] >> >> Signed-off-by: Christian Ehrhardt >> Acked-by: Rik van Riel >> Acked-by: Jens Axboe > > > Reviewed-by: Minchan Kim > > Didn't I add my Reviewed-by on your previous version? > Sorry I missed it since you provided the good feedback on all three mails. I had your "otherwise looks good to me to mail #2" still in mind and didn't want to be so offensive to convert that to a review or ack statement. -- GrA 1/4 sse / regards, Christian Ehrhardt IBM Linux Technology Center, System z Linux Performance -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx175.postini.com [74.125.245.175]) by kanga.kvack.org (Postfix) with SMTP id 0CF746B005C for ; Mon, 4 Jun 2012 04:33:33 -0400 (EDT) Received: from /spool/local by e06smtp16.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 4 Jun 2012 09:33:32 +0100 Received: from d06av07.portsmouth.uk.ibm.com (d06av07.portsmouth.uk.ibm.com [9.149.37.248]) by d06nrmr1407.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q548XTCI2551900 for ; Mon, 4 Jun 2012 09:33:30 +0100 Received: from d06av07.portsmouth.uk.ibm.com (d06av07.portsmouth.uk.ibm.com [127.0.0.1]) by d06av07.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q548N02F001692 for ; Mon, 4 Jun 2012 04:23:00 -0400 From: ehrhardt@linux.vnet.ibm.com Subject: [PATCH 1/2] swap: allow swap readahead to be merged Date: Mon, 4 Jun 2012 10:33:22 +0200 Message-Id: <1338798803-5009-2-git-send-email-ehrhardt@linux.vnet.ibm.com> In-Reply-To: <1338798803-5009-1-git-send-email-ehrhardt@linux.vnet.ibm.com> References: <1338798803-5009-1-git-send-email-ehrhardt@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: axboe@kernel.dk, hughd@google.com, minchan@kernel.org, Christian Ehrhardt From: Christian Ehrhardt Swap readahead works fine, but the I/O to disk is almost always done in page size requests, despite the fact that readahead submits 1< Acked-by: Rik van Riel Acked-by: Jens Axboe Reviewed-by: Minchan Kim --- mm/swap_state.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 4c5ff7f..c85b559 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, unsigned long offset = swp_offset(entry); unsigned long start_offset, end_offset; unsigned long mask = (1UL << page_cluster) - 1; + struct blk_plug plug; /* Read a page_cluster sized and aligned cluster around offset. */ start_offset = offset & ~mask; @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, if (!start_offset) /* First page is swap header. */ start_offset++; + blk_start_plug(&plug); for (offset = start_offset; offset <= end_offset ; offset++) { /* Ok, do the async read-ahead now */ page = read_swap_cache_async(swp_entry(swp_type(entry), offset), @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, continue; page_cache_release(page); } + blk_finish_plug(&plug); + lru_add_drain(); /* Push any new pages onto the LRU now */ return read_swap_cache_async(entry, gfp_mask, vma, addr); } -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id EC5446B0062 for ; Tue, 5 Jun 2012 19:44:43 -0400 (EDT) Date: Tue, 5 Jun 2012 16:44:42 -0700 From: Andrew Morton Subject: Re: [PATCH 1/2] swap: allow swap readahead to be merged Message-Id: <20120605164442.c7d12faa.akpm@linux-foundation.org> In-Reply-To: <1338798803-5009-2-git-send-email-ehrhardt@linux.vnet.ibm.com> References: <1338798803-5009-1-git-send-email-ehrhardt@linux.vnet.ibm.com> <1338798803-5009-2-git-send-email-ehrhardt@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: ehrhardt@linux.vnet.ibm.com Cc: linux-mm@kvack.org, axboe@kernel.dk, hughd@google.com, minchan@kernel.org On Mon, 4 Jun 2012 10:33:22 +0200 ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt > > Swap readahead works fine, but the I/O to disk is almost always done in page > size requests, despite the fact that readahead submits 1< at a time. > On older kernels the old per device plugging behavior might have captured > this and merged the requests, but currently all comes down to much more I/Os > than required. Yes, long ago we (ie: I) decided that swap I/O isn't sufficiently common to bother doing any fancy high-level aggregation: just toss it at the queue and use the general BIO merging. > On a single device this might not be an issue, but as soon as a server runs > on shared san resources savin I/Os not only improves swapin throughput but > also provides a lower resource utilization. > > With a load running KVM in a lot of memory overcommitment (the hot memory > is 1.5 times the host memory) swapping throughput improves significantly > and the lead feels more responsive as well as achieves more throughput. > > In a test setup with 16 swap disks running blocktrace on one of those disks > shows the improved merging: > Prior: > Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB > Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB > Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB > IO unplugs: 149,614 Timer unplugs: 2,940 > > With the patch: > Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB > Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB > Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB > IO unplugs: 337,130 Timer unplugs: 11,184 This is rather hard to understand. How much faster did it get? > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -14,6 +14,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > unsigned long offset = swp_offset(entry); > unsigned long start_offset, end_offset; > unsigned long mask = (1UL << page_cluster) - 1; > + struct blk_plug plug; > > /* Read a page_cluster sized and aligned cluster around offset. */ > start_offset = offset & ~mask; > @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > if (!start_offset) /* First page is swap header. */ > start_offset++; > > + blk_start_plug(&plug); > for (offset = start_offset; offset <= end_offset ; offset++) { > /* Ok, do the async read-ahead now */ > page = read_swap_cache_async(swp_entry(swp_type(entry), offset), > @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > continue; > page_cache_release(page); > } > + blk_finish_plug(&plug); > + > lru_add_drain(); /* Push any new pages onto the LRU now */ > return read_swap_cache_async(entry, gfp_mask, vma, addr); AFACIT this affects tmpfs as well, and it would be interesting/useful/diligent to check for performance improvements or regressions in that area. And the patch doesn't help swapoff, in try_to_unuse(). Or any other callers of swap_readpage(), if they exist. The switch to explicit plugging might have caused swap regressions in other areas so perhaps a more extensive patch is needed. But swapin_readahead() covers most cases and a more extensive patch will work OK with this one, so I guess we run witht he simple patch for now. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx151.postini.com [74.125.245.151]) by kanga.kvack.org (Postfix) with SMTP id 1E66B6B004D for ; Wed, 20 Jun 2012 11:58:46 -0400 (EDT) Received: from /spool/local by e06smtp14.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 20 Jun 2012 16:58:44 +0100 Received: from d06av05.portsmouth.uk.ibm.com (d06av05.portsmouth.uk.ibm.com [9.149.37.229]) by d06nrmr1407.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q5KFwfUd2433098 for ; Wed, 20 Jun 2012 16:58:41 +0100 Received: from d06av05.portsmouth.uk.ibm.com (loopback [127.0.0.1]) by d06av05.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q5KFwe9Q020597 for ; Wed, 20 Jun 2012 09:58:40 -0600 Message-ID: <4FE1F32E.6080401@linux.vnet.ibm.com> Date: Wed, 20 Jun 2012 17:58:38 +0200 From: Christian Ehrhardt MIME-Version: 1.0 Subject: Re: [PATCH 1/2] swap: allow swap readahead to be merged References: <1338798803-5009-1-git-send-email-ehrhardt@linux.vnet.ibm.com> <1338798803-5009-2-git-send-email-ehrhardt@linux.vnet.ibm.com> <20120605164442.c7d12faa.akpm@linux-foundation.org> In-Reply-To: <20120605164442.c7d12faa.akpm@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, axboe@kernel.dk, hughd@google.com, minchan@kernel.org On 06/06/2012 01:44 AM, Andrew Morton wrote: > On Mon, 4 Jun 2012 10:33:22 +0200 > ehrhardt@linux.vnet.ibm.com wrote: > >> From: Christian Ehrhardt >> >> Swap readahead works fine, but the I/O to disk is almost always done in page >> size requests, despite the fact that readahead submits 1<> at a time. >> On older kernels the old per device plugging behavior might have captured >> this and merged the requests, but currently all comes down to much more I/Os >> than required. > > Yes, long ago we (ie: I) decided that swap I/O isn't sufficiently > common to bother doing any fancy high-level aggregation: just toss it > at the queue and use the general BIO merging. > >> On a single device this might not be an issue, but as soon as a server runs >> on shared san resources savin I/Os not only improves swapin throughput but >> also provides a lower resource utilization. >> >> With a load running KVM in a lot of memory overcommitment (the hot memory >> is 1.5 times the host memory) swapping throughput improves significantly >> and the lead feels more responsive as well as achieves more throughput. >> >> In a test setup with 16 swap disks running blocktrace on one of those disks >> shows the improved merging: >> Prior: >> Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB >> Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB >> Reads Requeued: 0 Writes Requeued: 0 >> Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB >> Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB >> IO unplugs: 149,614 Timer unplugs: 2,940 >> >> With the patch: >> Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB >> Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB >> Reads Requeued: 0 Writes Requeued: 0 >> Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB >> Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB >> IO unplugs: 337,130 Timer unplugs: 11,184 > > This is rather hard to understand. How much faster did it get? I got ~10% to ~40% more throughput in my cases and at the same time much lower cpu consumption when broken down per transferred kilobyte (the majority of that due to saved interrupts and better cache handling). In a shared SAN others might get an additional benefit as well, because this now causes less protocol overhead. >> --- a/mm/swap_state.c >> +++ b/mm/swap_state.c >> @@ -14,6 +14,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, >> unsigned long offset = swp_offset(entry); >> unsigned long start_offset, end_offset; >> unsigned long mask = (1UL<< page_cluster) - 1; >> + struct blk_plug plug; >> >> /* Read a page_cluster sized and aligned cluster around offset. */ >> start_offset = offset& ~mask; >> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, >> if (!start_offset) /* First page is swap header. */ >> start_offset++; >> >> + blk_start_plug(&plug); >> for (offset = start_offset; offset<= end_offset ; offset++) { >> /* Ok, do the async read-ahead now */ >> page = read_swap_cache_async(swp_entry(swp_type(entry), offset), >> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, >> continue; >> page_cache_release(page); >> } >> + blk_finish_plug(&plug); >> + >> lru_add_drain(); /* Push any new pages onto the LRU now */ >> return read_swap_cache_async(entry, gfp_mask, vma, addr); > > AFACIT this affects tmpfs as well, and it would be > interesting/useful/diligent to check for performance improvements or > regressions in that area. > A quick test with fio doing 256k sequential write showed some improvements of 9.1%, but since I'm not sure how big noise is in this test I'd be cautions with these results. Unfortunately I didn't check cpu consumption - it might be possible that with tmpfs thats the area where a bigger improvement could be seen. Well at least it didn't break - so thats a good result as well. > And the patch doesn't help swapoff, in try_to_unuse(). Or any other > callers of swap_readpage(), if they exist. > > The switch to explicit plugging might have caused swap regressions in > other areas so perhaps a more extensive patch is needed. But > swapin_readahead() covers most cases and a more extensive patch will > work OK with this one, so I guess we run witht he simple patch for now. > Yeah all the other swap areas might need re-tuning after the plugging changes as well, but for example swapoff shouldn't be too performance critical right? As discussed before I'd more interested in the swap writeout path to merge stuff better as well. Eventually - as you said - a later more complex patch can follow and take all these into account. -- Grusse / regards, Christian Ehrhardt IBM Linux Technology Center, System z Linux Performance -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org