* [PATCH] mm: make VM_MAX_READAHEAD configurable @ 2009-10-09 11:19 Ehrhardt Christian 2009-10-09 12:20 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: Ehrhardt Christian @ 2009-10-09 11:19 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Jens Axboe, Peter Zijlstra, Andrew Morton, Martin Schwidefsky, Christian Ehrhardt From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default and can be configured per block device queue. On the other hand a lot of admins do not use it, therefore it is reasonable to set a wise default. This path allows to configure the value via Kconfig mechanisms and therefore allow the assignment of different defaults dependent on other Kconfig symbols. Using this, the patch increases the default max readahead for s390 improving sequential throughput in a lot of scenarios with almost no drawbacks (only theoretical workloads with a lot concurrent sequential read patterns on a very low memory system suffer due to page cache trashing as expected). Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> --- [diffstat] include/linux/mm.h | 2 +- mm/Kconfig | 19 +++++++++++++++++++ 2 files changed, 20 insertions(+), 1 deletion(-) [diff] Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -1169,7 +1169,7 @@ int write_one_page(struct page *page, in void task_dirty_inc(struct task_struct *tsk); /* readahead.c */ -#define VM_MAX_READAHEAD 128 /* kbytes */ +#define VM_MAX_READAHEAD CONFIG_VM_MAX_READAHEAD /* kbytes */ #define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */ int force_page_cache_readahead(struct address_space *mapping, struct file *filp, Index: linux-2.6/mm/Kconfig =================================================================== --- linux-2.6.orig/mm/Kconfig +++ linux-2.6/mm/Kconfig @@ -288,3 +288,22 @@ config NOMMU_INITIAL_TRIM_EXCESS of 1 says that all excess pages should be trimmed. See Documentation/nommu-mmap.txt for more information. + +config VM_MAX_READAHEAD + int "Default max vm readahead size (16-4096 kbytes)" + default "512" if S390 + default "128" + range 16 4096 + help + This entry specifies the default max size used to read ahead + sequential access patterns in kilobytes. + + The value can be configured per device queue in /dev, this setting + just defines the default. + + The default is 128 which it used to be for years and should suit all + kind of linux targets. + + Smaller values might be useful for very memory constrained systems + like some embedded systems to avoid page cache trashing, while larger + values can be beneficial to server installations. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-09 11:19 [PATCH] mm: make VM_MAX_READAHEAD configurable Ehrhardt Christian @ 2009-10-09 12:20 ` Peter Zijlstra 2009-10-09 12:29 ` Jens Axboe 2009-10-09 13:14 ` Wu Fengguang 0 siblings, 2 replies; 15+ messages in thread From: Peter Zijlstra @ 2009-10-09 12:20 UTC (permalink / raw) To: Ehrhardt Christian Cc: linux-mm, linux-kernel, Jens Axboe, Andrew Morton, Martin Schwidefsky, Wu Fengguang On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote: > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default > and can be configured per block device queue. > On the other hand a lot of admins do not use it, therefore it is reasonable to > set a wise default. > > This path allows to configure the value via Kconfig mechanisms and therefore > allow the assignment of different defaults dependent on other Kconfig symbols. > > Using this, the patch increases the default max readahead for s390 improving > sequential throughput in a lot of scenarios with almost no drawbacks (only > theoretical workloads with a lot concurrent sequential read patterns on a very > low memory system suffer due to page cache trashing as expected). Why can't this be solved in userspace? Also, can't we simply raise this number if appropriate? Wu did some read-ahead trashing detection bits a long while back which should scale the read-ahead window back when we're low on memory, not sure that ever made it in, but that sounds like a better option than having different magic numbers for each platform. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-09 12:20 ` Peter Zijlstra @ 2009-10-09 12:29 ` Jens Axboe 2009-10-09 13:49 ` Martin Schwidefsky 2009-10-09 21:31 ` Andrew Morton 2009-10-09 13:14 ` Wu Fengguang 1 sibling, 2 replies; 15+ messages in thread From: Jens Axboe @ 2009-10-09 12:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Ehrhardt Christian, linux-mm, linux-kernel, Andrew Morton, Martin Schwidefsky, Wu Fengguang On Fri, Oct 09 2009, Peter Zijlstra wrote: > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote: > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default > > and can be configured per block device queue. > > On the other hand a lot of admins do not use it, therefore it is reasonable to > > set a wise default. > > > > This path allows to configure the value via Kconfig mechanisms and therefore > > allow the assignment of different defaults dependent on other Kconfig symbols. > > > > Using this, the patch increases the default max readahead for s390 improving > > sequential throughput in a lot of scenarios with almost no drawbacks (only > > theoretical workloads with a lot concurrent sequential read patterns on a very > > low memory system suffer due to page cache trashing as expected). > > Why can't this be solved in userspace? > > Also, can't we simply raise this number if appropriate? Wu did some > read-ahead trashing detection bits a long while back which should scale > the read-ahead window back when we're low on memory, not sure that ever > made it in, but that sounds like a better option than having different > magic numbers for each platform. Agree, making this a config option (and even defaulting to a different number because of an arch setting) is crazy. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-09 12:29 ` Jens Axboe @ 2009-10-09 13:49 ` Martin Schwidefsky 2009-10-09 13:58 ` Wu Fengguang 2009-10-11 1:10 ` Wu Fengguang 2009-10-09 21:31 ` Andrew Morton 1 sibling, 2 replies; 15+ messages in thread From: Martin Schwidefsky @ 2009-10-09 13:49 UTC (permalink / raw) To: Jens Axboe Cc: Peter Zijlstra, Ehrhardt Christian, linux-mm, linux-kernel, Andrew Morton, Wu Fengguang On Fri, 9 Oct 2009 14:29:52 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > On Fri, Oct 09 2009, Peter Zijlstra wrote: > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote: > > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > > > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default > > > and can be configured per block device queue. > > > On the other hand a lot of admins do not use it, therefore it is reasonable to > > > set a wise default. > > > > > > This path allows to configure the value via Kconfig mechanisms and therefore > > > allow the assignment of different defaults dependent on other Kconfig symbols. > > > > > > Using this, the patch increases the default max readahead for s390 improving > > > sequential throughput in a lot of scenarios with almost no drawbacks (only > > > theoretical workloads with a lot concurrent sequential read patterns on a very > > > low memory system suffer due to page cache trashing as expected). > > > > Why can't this be solved in userspace? > > > > Also, can't we simply raise this number if appropriate? Wu did some > > read-ahead trashing detection bits a long while back which should scale > > the read-ahead window back when we're low on memory, not sure that ever > > made it in, but that sounds like a better option than having different > > magic numbers for each platform. > > Agree, making this a config option (and even defaulting to a different > number because of an arch setting) is crazy. The patch from Christian fixes a performance regression in the latest distributions for s390. So we would opt for a larger value, 512KB seems to be a good one. I have no idea what that will do to the embedded space which is why Christian choose to make it configurable. Clearly the better solution would be some sort of system control that can be modified at runtime. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-09 13:49 ` Martin Schwidefsky @ 2009-10-09 13:58 ` Wu Fengguang 2009-10-11 1:10 ` Wu Fengguang 1 sibling, 0 replies; 15+ messages in thread From: Wu Fengguang @ 2009-10-09 13:58 UTC (permalink / raw) To: Martin Schwidefsky Cc: Jens Axboe, Peter Zijlstra, Ehrhardt Christian, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton On Fri, Oct 09, 2009 at 09:49:50PM +0800, Martin Schwidefsky wrote: > On Fri, 9 Oct 2009 14:29:52 +0200 > Jens Axboe <jens.axboe@oracle.com> wrote: > > > On Fri, Oct 09 2009, Peter Zijlstra wrote: > > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote: > > > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > > > > > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default > > > > and can be configured per block device queue. > > > > On the other hand a lot of admins do not use it, therefore it is reasonable to > > > > set a wise default. > > > > > > > > This path allows to configure the value via Kconfig mechanisms and therefore > > > > allow the assignment of different defaults dependent on other Kconfig symbols. > > > > > > > > Using this, the patch increases the default max readahead for s390 improving > > > > sequential throughput in a lot of scenarios with almost no drawbacks (only > > > > theoretical workloads with a lot concurrent sequential read patterns on a very > > > > low memory system suffer due to page cache trashing as expected). > > > > > > Why can't this be solved in userspace? > > > > > > Also, can't we simply raise this number if appropriate? Wu did some > > > read-ahead trashing detection bits a long while back which should scale > > > the read-ahead window back when we're low on memory, not sure that ever > > > made it in, but that sounds like a better option than having different > > > magic numbers for each platform. > > > > Agree, making this a config option (and even defaulting to a different > > number because of an arch setting) is crazy. > > The patch from Christian fixes a performance regression in the latest > distributions for s390. So we would opt for a larger value, 512KB seems > to be a good one. I have no idea what that will do to the embedded > space which is why Christian choose to make it configurable. Clearly > the better solution would be some sort of system control that can be > modified at runtime. So how about doing two patches together? - lift default readahead size to around 512KB - add some readahead logic to better support the thrashing case Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-09 13:49 ` Martin Schwidefsky 2009-10-09 13:58 ` Wu Fengguang @ 2009-10-11 1:10 ` Wu Fengguang 2009-10-12 5:53 ` Christian Ehrhardt 1 sibling, 1 reply; 15+ messages in thread From: Wu Fengguang @ 2009-10-11 1:10 UTC (permalink / raw) To: Martin Schwidefsky Cc: Jens Axboe, Peter Zijlstra, Ehrhardt Christian, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton Hi Martin, On Fri, Oct 09, 2009 at 09:49:50PM +0800, Martin Schwidefsky wrote: > On Fri, 9 Oct 2009 14:29:52 +0200 > Jens Axboe <jens.axboe@oracle.com> wrote: > > > On Fri, Oct 09 2009, Peter Zijlstra wrote: > > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote: > > > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > > > > > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default > > > > and can be configured per block device queue. > > > > On the other hand a lot of admins do not use it, therefore it is reasonable to > > > > set a wise default. > > > > > > > > This path allows to configure the value via Kconfig mechanisms and therefore > > > > allow the assignment of different defaults dependent on other Kconfig symbols. > > > > > > > > Using this, the patch increases the default max readahead for s390 improving > > > > sequential throughput in a lot of scenarios with almost no drawbacks (only > > > > theoretical workloads with a lot concurrent sequential read patterns on a very > > > > low memory system suffer due to page cache trashing as expected). [snip] > > The patch from Christian fixes a performance regression in the latest > distributions for s390. So we would opt for a larger value, 512KB seems > to be a good one. I have no idea what that will do to the embedded > space which is why Christian choose to make it configurable. Clearly > the better solution would be some sort of system control that can be > modified at runtime. May I ask for more details about your performance regression and why it is related to readahead size? (we didn't change VM_MAX_READAHEAD..) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-11 1:10 ` Wu Fengguang @ 2009-10-12 5:53 ` Christian Ehrhardt 2009-10-12 6:23 ` Wu Fengguang 0 siblings, 1 reply; 15+ messages in thread From: Christian Ehrhardt @ 2009-10-12 5:53 UTC (permalink / raw) To: Wu Fengguang Cc: Martin Schwidefsky, Jens Axboe, Peter Zijlstra, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton Wu Fengguang wrote: > Hi Martin, > > On Fri, Oct 09, 2009 at 09:49:50PM +0800, Martin Schwidefsky wrote: > >> On Fri, 9 Oct 2009 14:29:52 +0200 >> Jens Axboe <jens.axboe@oracle.com> wrote: >> >> >>> On Fri, Oct 09 2009, Peter Zijlstra wrote: >>> >>>> On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote: >>>> >>>>> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> >>>>> >>>>> On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default >>>>> and can be configured per block device queue. >>>>> On the other hand a lot of admins do not use it, therefore it is reasonable to >>>>> set a wise default. >>>>> >>>>> This path allows to configure the value via Kconfig mechanisms and therefore >>>>> allow the assignment of different defaults dependent on other Kconfig symbols. >>>>> >>>>> Using this, the patch increases the default max readahead for s390 improving >>>>> sequential throughput in a lot of scenarios with almost no drawbacks (only >>>>> theoretical workloads with a lot concurrent sequential read patterns on a very >>>>> low memory system suffer due to page cache trashing as expected). >>>>> > [snip] > >> The patch from Christian fixes a performance regression in the latest >> distributions for s390. So we would opt for a larger value, 512KB seems >> to be a good one. I have no idea what that will do to the embedded >> space which is why Christian choose to make it configurable. Clearly >> the better solution would be some sort of system control that can be >> modified at runtime. >> > > May I ask for more details about your performance regression and why > it is related to readahead size? (we didn't change VM_MAX_READAHEAD..) > Sure, the performance regression appeared when comparing Novell SLES10 vs. SLES11. While you are right Wu that the upstream default never changed so far, SLES10 had a patch applied that set 512. As mentioned before I didn't expect to get a generic 128->512 patch accepted,therefore the configurable solution. But after Peter and Jens replied so quickly stating that changing the default in kernel would be the wrong way to go I already looked out for userspace alternatives. At least for my issues I could fix it with device specific udev rules too. And as Andrew mentioned the diversity of devices cause any default to be wrong for one or another installation. To solve that the udev approach can also differ between different device types (might be easier on s390 than on other architectures because I need to take care of two disk types atm - and both shold get 512). The testcase for anyone who wants to experiment with it is almost too easy, the biggest impact can be seen with single thread iozone - I get ~40% better throughput when increasing the readahead size to 512 (even bigger RA sizes don't help much in my environment, probably due to fast devices). -- Grusse / regards, Christian Ehrhardt IBM Linux Technology Center, Open Virtualization -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-12 5:53 ` Christian Ehrhardt @ 2009-10-12 6:23 ` Wu Fengguang 2009-10-12 9:29 ` Christian Ehrhardt 0 siblings, 1 reply; 15+ messages in thread From: Wu Fengguang @ 2009-10-12 6:23 UTC (permalink / raw) To: Christian Ehrhardt Cc: Martin Schwidefsky, Jens Axboe, Peter Zijlstra, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton On Mon, Oct 12, 2009 at 01:53:01PM +0800, Christian Ehrhardt wrote: > Wu Fengguang wrote: > > Hi Martin, > > > > On Fri, Oct 09, 2009 at 09:49:50PM +0800, Martin Schwidefsky wrote: > > > >> On Fri, 9 Oct 2009 14:29:52 +0200 > >> Jens Axboe <jens.axboe@oracle.com> wrote: > >> > >> > >>> On Fri, Oct 09 2009, Peter Zijlstra wrote: > >>> > >>>> On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote: > >>>> > >>>>> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > >>>>> > >>>>> On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default > >>>>> and can be configured per block device queue. > >>>>> On the other hand a lot of admins do not use it, therefore it is reasonable to > >>>>> set a wise default. > >>>>> > >>>>> This path allows to configure the value via Kconfig mechanisms and therefore > >>>>> allow the assignment of different defaults dependent on other Kconfig symbols. > >>>>> > >>>>> Using this, the patch increases the default max readahead for s390 improving > >>>>> sequential throughput in a lot of scenarios with almost no drawbacks (only > >>>>> theoretical workloads with a lot concurrent sequential read patterns on a very > >>>>> low memory system suffer due to page cache trashing as expected). > >>>>> > > [snip] > > > >> The patch from Christian fixes a performance regression in the latest > >> distributions for s390. So we would opt for a larger value, 512KB seems > >> to be a good one. I have no idea what that will do to the embedded > >> space which is why Christian choose to make it configurable. Clearly > >> the better solution would be some sort of system control that can be > >> modified at runtime. > >> > > > > May I ask for more details about your performance regression and why > > it is related to readahead size? (we didn't change VM_MAX_READAHEAD..) > > > Sure, the performance regression appeared when comparing Novell SLES10 > vs. SLES11. > While you are right Wu that the upstream default never changed so far, > SLES10 had a > patch applied that set 512. I see. I'm curious why SLES11 removed that patch. Did it experienced some regressions with the larger readahead size? > As mentioned before I didn't expect to get a generic 128->512 patch > accepted,therefore > the configurable solution. But after Peter and Jens replied so quickly > stating that > changing the default in kernel would be the wrong way to go I already > looked out for > userspace alternatives. At least for my issues I could fix it with > device specific udev rules > too. OK. > And as Andrew mentioned the diversity of devices cause any default to be > wrong for one > or another installation. To solve that the udev approach can also differ > between different > device types (might be easier on s390 than on other architectures > because I need to take > care of two disk types atm - and both shold get 512). I guess it's not a general solution for all. There are so many devices in the world, and we have not yet considered the memory/workload combinations. > The testcase for anyone who wants to experiment with it is almost too > easy, the biggest > impact can be seen with single thread iozone - I get ~40% better > throughput when > increasing the readahead size to 512 (even bigger RA sizes don't help > much in my > environment, probably due to fast devices). That's impressive number - I guess we need a larger default RA size. But before that let's learn something from SLES10's experiences :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-12 6:23 ` Wu Fengguang @ 2009-10-12 9:29 ` Christian Ehrhardt 2009-10-12 9:39 ` Wu Fengguang 0 siblings, 1 reply; 15+ messages in thread From: Christian Ehrhardt @ 2009-10-12 9:29 UTC (permalink / raw) To: Wu Fengguang Cc: Martin Schwidefsky, Jens Axboe, Peter Zijlstra, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton Wu Fengguang wrote: > [SNIP] >>> May I ask for more details about your performance regression and why >>> it is related to readahead size? (we didn't change VM_MAX_READAHEAD..) >>> >>> >> Sure, the performance regression appeared when comparing Novell SLES10 >> vs. SLES11. >> While you are right Wu that the upstream default never changed so far, >> SLES10 had a >> patch applied that set 512. >> > > I see. I'm curious why SLES11 removed that patch. Did it experienced > some regressions with the larger readahead size? > > Only the obvious expected one with very low free/cacheable memory and a lot of parallel processes that do sequential I/O. The RA size scales up for all of them but 64xMaxRA then doesn't fit. For example iozone with 64 threads (each on one disk for its own), sequential access pattern read with I guess 10 M free for cache suffered by ~15% due to trashing. But that is a acceptable regression because it is no relevant customer scenario, while the benefits apply to customer scenarios. [...] >> And as Andrew mentioned the diversity of devices cause any default to be >> wrong for one >> or another installation. To solve that the udev approach can also differ >> between different >> device types (might be easier on s390 than on other architectures >> because I need to take >> care of two disk types atm - and both shold get 512). >> > > I guess it's not a general solution for all. There are so many > devices in the world, and we have not yet considered the > memory/workload combinations. > I completely agree, let me fix "my" issue per udev for now. And if some day the readahead mechanism evolves and doesn't need any max RA at all we can all be happy. [...] -- Grusse / regards, Christian Ehrhardt IBM Linux Technology Center, Open Virtualization -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-12 9:29 ` Christian Ehrhardt @ 2009-10-12 9:39 ` Wu Fengguang 0 siblings, 0 replies; 15+ messages in thread From: Wu Fengguang @ 2009-10-12 9:39 UTC (permalink / raw) To: Christian Ehrhardt Cc: Martin Schwidefsky, Jens Axboe, Peter Zijlstra, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton On Mon, Oct 12, 2009 at 05:29:48PM +0800, Christian Ehrhardt wrote: > Wu Fengguang wrote: > > [SNIP] > >>> May I ask for more details about your performance regression and why > >>> it is related to readahead size? (we didn't change VM_MAX_READAHEAD..) > >>> > >>> > >> Sure, the performance regression appeared when comparing Novell SLES10 > >> vs. SLES11. > >> While you are right Wu that the upstream default never changed so far, > >> SLES10 had a > >> patch applied that set 512. > >> > > > > I see. I'm curious why SLES11 removed that patch. Did it experienced > > some regressions with the larger readahead size? > > > > > > Only the obvious expected one with very low free/cacheable > memory and a lot of parallel processes that do sequential I/O. > The RA size scales up for all of them but 64xMaxRA then > doesn't fit. > > For example iozone with 64 threads (each on one disk for its own), > sequential access pattern read with I guess 10 M free for cache > suffered by ~15% due to trashing. FYI, I just finished with a patch for dealing with readahead thrashing. Will do some tests and post the result :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-09 12:29 ` Jens Axboe 2009-10-09 13:49 ` Martin Schwidefsky @ 2009-10-09 21:31 ` Andrew Morton 2009-10-10 10:53 ` Jens Axboe 1 sibling, 1 reply; 15+ messages in thread From: Andrew Morton @ 2009-10-09 21:31 UTC (permalink / raw) To: Jens Axboe Cc: Peter Zijlstra, Ehrhardt Christian, linux-mm, linux-kernel, Martin Schwidefsky, Wu Fengguang On Fri, 9 Oct 2009 14:29:52 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > On Fri, Oct 09 2009, Peter Zijlstra wrote: > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote: > > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > > > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default > > > and can be configured per block device queue. > > > On the other hand a lot of admins do not use it, therefore it is reasonable to > > > set a wise default. > > > > > > This path allows to configure the value via Kconfig mechanisms and therefore > > > allow the assignment of different defaults dependent on other Kconfig symbols. > > > > > > Using this, the patch increases the default max readahead for s390 improving > > > sequential throughput in a lot of scenarios with almost no drawbacks (only > > > theoretical workloads with a lot concurrent sequential read patterns on a very > > > low memory system suffer due to page cache trashing as expected). > > > > Why can't this be solved in userspace? > > > > Also, can't we simply raise this number if appropriate? Wu did some > > read-ahead trashing detection bits a long while back which should scale > > the read-ahead window back when we're low on memory, not sure that ever > > made it in, but that sounds like a better option than having different > > magic numbers for each platform. > > Agree, making this a config option (and even defaulting to a different > number because of an arch setting) is crazy. Given the (increasing) level of disparity between different kinds of storage devices, having _any_ default is crazy. Would be better to make some sort of vaguely informed guess at runtime, based upon the characteristics of the device. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-09 21:31 ` Andrew Morton @ 2009-10-10 10:53 ` Jens Axboe 2009-10-10 12:40 ` Wu Fengguang 0 siblings, 1 reply; 15+ messages in thread From: Jens Axboe @ 2009-10-10 10:53 UTC (permalink / raw) To: Andrew Morton Cc: Peter Zijlstra, Ehrhardt Christian, linux-mm, linux-kernel, Martin Schwidefsky, Wu Fengguang On Fri, Oct 09 2009, Andrew Morton wrote: > On Fri, 9 Oct 2009 14:29:52 +0200 > Jens Axboe <jens.axboe@oracle.com> wrote: > > > On Fri, Oct 09 2009, Peter Zijlstra wrote: > > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote: > > > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > > > > > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default > > > > and can be configured per block device queue. > > > > On the other hand a lot of admins do not use it, therefore it is reasonable to > > > > set a wise default. > > > > > > > > This path allows to configure the value via Kconfig mechanisms and therefore > > > > allow the assignment of different defaults dependent on other Kconfig symbols. > > > > > > > > Using this, the patch increases the default max readahead for s390 improving > > > > sequential throughput in a lot of scenarios with almost no drawbacks (only > > > > theoretical workloads with a lot concurrent sequential read patterns on a very > > > > low memory system suffer due to page cache trashing as expected). > > > > > > Why can't this be solved in userspace? > > > > > > Also, can't we simply raise this number if appropriate? Wu did some > > > read-ahead trashing detection bits a long while back which should scale > > > the read-ahead window back when we're low on memory, not sure that ever > > > made it in, but that sounds like a better option than having different > > > magic numbers for each platform. > > > > Agree, making this a config option (and even defaulting to a different > > number because of an arch setting) is crazy. > > Given the (increasing) level of disparity between different kinds of > storage devices, having _any_ default is crazy. You have to start somewhere :-). 0 is a default, too. > Would be better to make some sort of vaguely informed guess at > runtime, based upon the characteristics of the device. I'm pretty sure the readahead logic already does respond to eg memory pressure, not sure if it attempts to do anything based on how quickly the device is doing IO. Wu? -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-10 10:53 ` Jens Axboe @ 2009-10-10 12:40 ` Wu Fengguang 2009-10-10 17:41 ` Andrew Morton 0 siblings, 1 reply; 15+ messages in thread From: Wu Fengguang @ 2009-10-10 12:40 UTC (permalink / raw) To: Jens Axboe Cc: Andrew Morton, Peter Zijlstra, Ehrhardt Christian, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Martin Schwidefsky On Sat, Oct 10, 2009 at 06:53:33PM +0800, Jens Axboe wrote: > On Fri, Oct 09 2009, Andrew Morton wrote: > > On Fri, 9 Oct 2009 14:29:52 +0200 > > Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > On Fri, Oct 09 2009, Peter Zijlstra wrote: > > > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote: > > > > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > > > > > > > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default > > > > > and can be configured per block device queue. > > > > > On the other hand a lot of admins do not use it, therefore it is reasonable to > > > > > set a wise default. > > > > > > > > > > This path allows to configure the value via Kconfig mechanisms and therefore > > > > > allow the assignment of different defaults dependent on other Kconfig symbols. > > > > > > > > > > Using this, the patch increases the default max readahead for s390 improving > > > > > sequential throughput in a lot of scenarios with almost no drawbacks (only > > > > > theoretical workloads with a lot concurrent sequential read patterns on a very > > > > > low memory system suffer due to page cache trashing as expected). > > > > > > > > Why can't this be solved in userspace? > > > > > > > > Also, can't we simply raise this number if appropriate? Wu did some > > > > read-ahead trashing detection bits a long while back which should scale > > > > the read-ahead window back when we're low on memory, not sure that ever > > > > made it in, but that sounds like a better option than having different > > > > magic numbers for each platform. > > > > > > Agree, making this a config option (and even defaulting to a different > > > number because of an arch setting) is crazy. > > > > Given the (increasing) level of disparity between different kinds of > > storage devices, having _any_ default is crazy. > > You have to start somewhere :-). 0 is a default, too. Yes, an obvious and viable way is to start with a default size, and to back off in runtime if experienced thrashing. Ideally we use 4MB readahead size per disk, however there are several constraints: - readahead thrashing can be detected and handled very well if necessary :) - mmap readaround size currently one single size is used for both sequential readahead and mmap readaround, and a larger readaround size risks more prefetch misses (comparing to the pretty accurate readahead). I guess in despite of the increased readaound misses, a large readaround size would still help application startup time in a 4GB desktop. However it does risk working-set thrashings for memory tight desktops. Maybe we can try to detect working-set thrashings too. - IO latency Some workloads may be sensitive to IO latencies. The max_sectors_kb may help keep IO latency under control with a large readahead size, but there may be some tradeoffs in the IO scheduler. In summary, towards the runtime dynamic prefetching size, we - can reliably adapt readahead size to readahead thrashings - may reliably adapt readaround size to working set thrashings - don't know in general whether workload is IO latency sensitive > > Would be better to make some sort of vaguely informed guess at > > runtime, based upon the characteristics of the device. > > I'm pretty sure the readahead logic already does respond to eg memory > pressure, Yes, it's much better than before. Once thrashed, old kernels are basically reduced to do 1-page (random) IOs, which is disastrous. Current kernel does this. Given default_readahead_size > thrashing_readahead_size The readahead sequence would be read_size, 2*read_size, 4*read_size, ..., (until > thrashing_readahead_size) read_size, 2*read_size, 4*read_size, ..., (until > thrashing_readahead_size) read_size, 2*read_size, 4*read_size, ..., (until > thrashing_readahead_size) ... So if read_size=1, it roughly holds that average_readahead_size = thrashing_readahead_size/log2(thrashing_readahead_size) thrashed_pages = total_read_pages/2 And if read_size=LONG_MAX (eg. sendfile(large_file)) average_readahead_size = default_readahead_size thrashed_pages = default_readahead_size - thrashing_readahead_size In summary, readahead for sendfile() is not adaptive at all. Normal reads are somehow adaptive, but not optimal. But anyway, optimal thrashing readahead is approachable if it's a desirable goal :). > not sure if it attempts to do anything based on how quickly > the device is doing IO. Wu? Not for current kernel. But in fact it's possible to estimate the read speed for each individual sequential stream, and possibly drop some hint to the IO scheduler: someone will block on this IO after 3 seconds. But it may not deserve the complexity. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-10 12:40 ` Wu Fengguang @ 2009-10-10 17:41 ` Andrew Morton 0 siblings, 0 replies; 15+ messages in thread From: Andrew Morton @ 2009-10-10 17:41 UTC (permalink / raw) To: Wu Fengguang Cc: Jens Axboe, Peter Zijlstra, Ehrhardt Christian, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Martin Schwidefsky On Sat, 10 Oct 2009 20:40:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > > not sure if it attempts to do anything based on how quickly > > the device is doing IO. Wu? > > Not for current kernel. But in fact it's possible to estimate the > read speed for each individual sequential stream, and possibly drop > some hint to the IO scheduler: someone will block on this IO after 3 > seconds. But it may not deserve the complexity. Well, we have a test case. Would any of your design proposals address the performance problem which motivated the s390 guys to propose this patch? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable 2009-10-09 12:20 ` Peter Zijlstra 2009-10-09 12:29 ` Jens Axboe @ 2009-10-09 13:14 ` Wu Fengguang 1 sibling, 0 replies; 15+ messages in thread From: Wu Fengguang @ 2009-10-09 13:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Ehrhardt Christian, linux-mm, linux-kernel, Jens Axboe, Andrew Morton, Martin Schwidefsky On Fri, Oct 09, 2009 at 02:20:30PM +0200, Peter Zijlstra wrote: > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote: > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default > > and can be configured per block device queue. > > On the other hand a lot of admins do not use it, therefore it is reasonable to > > set a wise default. > > > > This path allows to configure the value via Kconfig mechanisms and therefore > > allow the assignment of different defaults dependent on other Kconfig symbols. > > > > Using this, the patch increases the default max readahead for s390 improving > > sequential throughput in a lot of scenarios with almost no drawbacks (only > > theoretical workloads with a lot concurrent sequential read patterns on a very > > low memory system suffer due to page cache trashing as expected). > > Why can't this be solved in userspace? > > Also, can't we simply raise this number if appropriate? Wu did some Agreed, and Ehrhardt's 512KB readahead size looks like a good default :) > read-ahead trashing detection bits a long while back which should scale > the read-ahead window back when we're low on memory, not sure that ever > made it in, but that sounds like a better option than having different > magic numbers for each platform. The current kernel could roughly estimate the thrashing safe size (the context readahead). However that's not enough. Context readahead is normally active only for interleaved reads. The normal behavior is to scale up readahead size aggressively. For better support for embedded systems, we may need a flag/mode which tells: "we recently experienced thrashing, so estimate and stick to the thrashing safe size instead of keep scaling up readahead size and thus risk thrashing again". Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2009-10-12 9:39 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-10-09 11:19 [PATCH] mm: make VM_MAX_READAHEAD configurable Ehrhardt Christian 2009-10-09 12:20 ` Peter Zijlstra 2009-10-09 12:29 ` Jens Axboe 2009-10-09 13:49 ` Martin Schwidefsky 2009-10-09 13:58 ` Wu Fengguang 2009-10-11 1:10 ` Wu Fengguang 2009-10-12 5:53 ` Christian Ehrhardt 2009-10-12 6:23 ` Wu Fengguang 2009-10-12 9:29 ` Christian Ehrhardt 2009-10-12 9:39 ` Wu Fengguang 2009-10-09 21:31 ` Andrew Morton 2009-10-10 10:53 ` Jens Axboe 2009-10-10 12:40 ` Wu Fengguang 2009-10-10 17:41 ` Andrew Morton 2009-10-09 13:14 ` Wu Fengguang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).