[PATCH] mm: make VM_MAX_READAHEAD configurable

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: make VM_MAX_READAHEAD configurable
@ 2009-10-09 11:19 Ehrhardt Christian
  2009-10-09 12:20 ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Ehrhardt Christian @ 2009-10-09 11:19 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Jens Axboe, Peter Zijlstra, Andrew Morton, Martin Schwidefsky,
	Christian Ehrhardt

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
and can be configured per block device queue.
On the other hand a lot of admins do not use it, therefore it is reasonable to
set a wise default.

This path allows to configure the value via Kconfig mechanisms and therefore
allow the assignment of different defaults dependent on other Kconfig symbols.

Using this, the patch increases the default max readahead for s390 improving
sequential throughput in a lot of scenarios with almost no drawbacks (only
theoretical workloads with a lot concurrent sequential read patterns on a very
low memory system suffer due to page cache trashing as expected).

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---

[diffstat]
 include/linux/mm.h |    2 +-
 mm/Kconfig         |   19 +++++++++++++++++++
 2 files changed, 20 insertions(+), 1 deletion(-)

[diff]
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -1169,7 +1169,7 @@ int write_one_page(struct page *page, in
 void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
-#define VM_MAX_READAHEAD	128	/* kbytes */
+#define VM_MAX_READAHEAD	CONFIG_VM_MAX_READAHEAD	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
 
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig
+++ linux-2.6/mm/Kconfig
@@ -288,3 +288,22 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  of 1 says that all excess pages should be trimmed.
 
 	  See Documentation/nommu-mmap.txt for more information.
+
+config VM_MAX_READAHEAD
+	int "Default max vm readahead size (16-4096 kbytes)"
+	default "512" if S390
+	default "128"
+	range 16 4096
+	help
+	  This entry specifies the default max size used to read ahead
+	  sequential access patterns in kilobytes.
+
+	  The value can be configured per device queue in /dev, this setting
+	  just defines the default.
+
+	  The default is 128 which it used to be for years and should suit all
+	  kind of linux targets.
+
+	  Smaller values might be useful for very memory constrained systems
+	  like some embedded systems to avoid page cache trashing, while larger
+	  values can be beneficial to server installations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-09 11:19 [PATCH] mm: make VM_MAX_READAHEAD configurable Ehrhardt Christian
@ 2009-10-09 12:20 ` Peter Zijlstra
  2009-10-09 12:29   ` Jens Axboe
  2009-10-09 13:14   ` Wu Fengguang
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2009-10-09 12:20 UTC (permalink / raw)
  To: Ehrhardt Christian
  Cc: linux-mm, linux-kernel, Jens Axboe, Andrew Morton,
	Martin Schwidefsky, Wu Fengguang

On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> and can be configured per block device queue.
> On the other hand a lot of admins do not use it, therefore it is reasonable to
> set a wise default.
> 
> This path allows to configure the value via Kconfig mechanisms and therefore
> allow the assignment of different defaults dependent on other Kconfig symbols.
> 
> Using this, the patch increases the default max readahead for s390 improving
> sequential throughput in a lot of scenarios with almost no drawbacks (only
> theoretical workloads with a lot concurrent sequential read patterns on a very
> low memory system suffer due to page cache trashing as expected).

Why can't this be solved in userspace?

Also, can't we simply raise this number if appropriate? Wu did some
read-ahead trashing detection bits a long while back which should scale
the read-ahead window back when we're low on memory, not sure that ever
made it in, but that sounds like a better option than having different
magic numbers for each platform.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-09 12:20 ` Peter Zijlstra
@ 2009-10-09 12:29   ` Jens Axboe
  2009-10-09 13:49     ` Martin Schwidefsky
  2009-10-09 21:31     ` Andrew Morton
  2009-10-09 13:14   ` Wu Fengguang
  1 sibling, 2 replies; 15+ messages in thread
From: Jens Axboe @ 2009-10-09 12:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ehrhardt Christian, linux-mm, linux-kernel, Andrew Morton,
	Martin Schwidefsky, Wu Fengguang

On Fri, Oct 09 2009, Peter Zijlstra wrote:
> On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> > 
> > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > and can be configured per block device queue.
> > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > set a wise default.
> > 
> > This path allows to configure the value via Kconfig mechanisms and therefore
> > allow the assignment of different defaults dependent on other Kconfig symbols.
> > 
> > Using this, the patch increases the default max readahead for s390 improving
> > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > theoretical workloads with a lot concurrent sequential read patterns on a very
> > low memory system suffer due to page cache trashing as expected).
> 
> Why can't this be solved in userspace?
> 
> Also, can't we simply raise this number if appropriate? Wu did some
> read-ahead trashing detection bits a long while back which should scale
> the read-ahead window back when we're low on memory, not sure that ever
> made it in, but that sounds like a better option than having different
> magic numbers for each platform.

Agree, making this a config option (and even defaulting to a different
number because of an arch setting) is crazy.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-09 12:29   ` Jens Axboe
@ 2009-10-09 13:49     ` Martin Schwidefsky
  2009-10-09 13:58       ` Wu Fengguang
  2009-10-11  1:10       ` Wu Fengguang
  2009-10-09 21:31     ` Andrew Morton
  1 sibling, 2 replies; 15+ messages in thread
From: Martin Schwidefsky @ 2009-10-09 13:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Peter Zijlstra, Ehrhardt Christian, linux-mm, linux-kernel,
	Andrew Morton, Wu Fengguang

On Fri, 9 Oct 2009 14:29:52 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> On Fri, Oct 09 2009, Peter Zijlstra wrote:
> > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> > > 
> > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > > and can be configured per block device queue.
> > > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > > set a wise default.
> > > 
> > > This path allows to configure the value via Kconfig mechanisms and therefore
> > > allow the assignment of different defaults dependent on other Kconfig symbols.
> > > 
> > > Using this, the patch increases the default max readahead for s390 improving
> > > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > > theoretical workloads with a lot concurrent sequential read patterns on a very
> > > low memory system suffer due to page cache trashing as expected).
> > 
> > Why can't this be solved in userspace?
> > 
> > Also, can't we simply raise this number if appropriate? Wu did some
> > read-ahead trashing detection bits a long while back which should scale
> > the read-ahead window back when we're low on memory, not sure that ever
> > made it in, but that sounds like a better option than having different
> > magic numbers for each platform.
> 
> Agree, making this a config option (and even defaulting to a different
> number because of an arch setting) is crazy.

The patch from Christian fixes a performance regression in the latest
distributions for s390. So we would opt for a larger value, 512KB seems
to be a good one. I have no idea what that will do to the embedded
space which is why Christian choose to make it configurable. Clearly
the better solution would be some sort of system control that can be
modified at runtime. 

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-09 13:49     ` Martin Schwidefsky
@ 2009-10-09 13:58       ` Wu Fengguang
  2009-10-11  1:10       ` Wu Fengguang
  1 sibling, 0 replies; 15+ messages in thread
From: Wu Fengguang @ 2009-10-09 13:58 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Jens Axboe, Peter Zijlstra, Ehrhardt Christian,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton

On Fri, Oct 09, 2009 at 09:49:50PM +0800, Martin Schwidefsky wrote:
> On Fri, 9 Oct 2009 14:29:52 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Fri, Oct 09 2009, Peter Zijlstra wrote:
> > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> > > > 
> > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > > > and can be configured per block device queue.
> > > > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > > > set a wise default.
> > > > 
> > > > This path allows to configure the value via Kconfig mechanisms and therefore
> > > > allow the assignment of different defaults dependent on other Kconfig symbols.
> > > > 
> > > > Using this, the patch increases the default max readahead for s390 improving
> > > > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > > > theoretical workloads with a lot concurrent sequential read patterns on a very
> > > > low memory system suffer due to page cache trashing as expected).
> > > 
> > > Why can't this be solved in userspace?
> > > 
> > > Also, can't we simply raise this number if appropriate? Wu did some
> > > read-ahead trashing detection bits a long while back which should scale
> > > the read-ahead window back when we're low on memory, not sure that ever
> > > made it in, but that sounds like a better option than having different
> > > magic numbers for each platform.
> > 
> > Agree, making this a config option (and even defaulting to a different
> > number because of an arch setting) is crazy.
> 
> The patch from Christian fixes a performance regression in the latest
> distributions for s390. So we would opt for a larger value, 512KB seems
> to be a good one. I have no idea what that will do to the embedded
> space which is why Christian choose to make it configurable. Clearly
> the better solution would be some sort of system control that can be
> modified at runtime. 

So how about doing two patches together?

- lift default readahead size to around 512KB
- add some readahead logic to better support the thrashing case

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-09 13:49     ` Martin Schwidefsky
  2009-10-09 13:58       ` Wu Fengguang
@ 2009-10-11  1:10       ` Wu Fengguang
  2009-10-12  5:53         ` Christian Ehrhardt
  1 sibling, 1 reply; 15+ messages in thread
From: Wu Fengguang @ 2009-10-11  1:10 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Jens Axboe, Peter Zijlstra, Ehrhardt Christian,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton

Hi Martin,

On Fri, Oct 09, 2009 at 09:49:50PM +0800, Martin Schwidefsky wrote:
> On Fri, 9 Oct 2009 14:29:52 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Fri, Oct 09 2009, Peter Zijlstra wrote:
> > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> > > > 
> > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > > > and can be configured per block device queue.
> > > > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > > > set a wise default.
> > > > 
> > > > This path allows to configure the value via Kconfig mechanisms and therefore
> > > > allow the assignment of different defaults dependent on other Kconfig symbols.
> > > > 
> > > > Using this, the patch increases the default max readahead for s390 improving
> > > > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > > > theoretical workloads with a lot concurrent sequential read patterns on a very
> > > > low memory system suffer due to page cache trashing as expected).
[snip]
> 
> The patch from Christian fixes a performance regression in the latest
> distributions for s390. So we would opt for a larger value, 512KB seems
> to be a good one. I have no idea what that will do to the embedded
> space which is why Christian choose to make it configurable. Clearly
> the better solution would be some sort of system control that can be
> modified at runtime. 

May I ask for more details about your performance regression and why
it is related to readahead size? (we didn't change VM_MAX_READAHEAD..)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-11  1:10       ` Wu Fengguang
@ 2009-10-12  5:53         ` Christian Ehrhardt
  2009-10-12  6:23           ` Wu Fengguang
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Ehrhardt @ 2009-10-12  5:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Martin Schwidefsky, Jens Axboe, Peter Zijlstra,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton

Wu Fengguang wrote:
> Hi Martin,
>
> On Fri, Oct 09, 2009 at 09:49:50PM +0800, Martin Schwidefsky wrote:
>   
>> On Fri, 9 Oct 2009 14:29:52 +0200
>> Jens Axboe <jens.axboe@oracle.com> wrote:
>>
>>     
>>> On Fri, Oct 09 2009, Peter Zijlstra wrote:
>>>       
>>>> On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
>>>>         
>>>>> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>>>>>
>>>>> On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
>>>>> and can be configured per block device queue.
>>>>> On the other hand a lot of admins do not use it, therefore it is reasonable to
>>>>> set a wise default.
>>>>>
>>>>> This path allows to configure the value via Kconfig mechanisms and therefore
>>>>> allow the assignment of different defaults dependent on other Kconfig symbols.
>>>>>
>>>>> Using this, the patch increases the default max readahead for s390 improving
>>>>> sequential throughput in a lot of scenarios with almost no drawbacks (only
>>>>> theoretical workloads with a lot concurrent sequential read patterns on a very
>>>>> low memory system suffer due to page cache trashing as expected).
>>>>>           
> [snip]
>   
>> The patch from Christian fixes a performance regression in the latest
>> distributions for s390. So we would opt for a larger value, 512KB seems
>> to be a good one. I have no idea what that will do to the embedded
>> space which is why Christian choose to make it configurable. Clearly
>> the better solution would be some sort of system control that can be
>> modified at runtime. 
>>     
>
> May I ask for more details about your performance regression and why
> it is related to readahead size? (we didn't change VM_MAX_READAHEAD..)
>   
Sure, the performance regression appeared when comparing Novell SLES10 
vs. SLES11.
While you are right Wu that the upstream default never changed so far, 
SLES10 had a
patch applied that set 512.

As mentioned before I didn't expect to get a generic 128->512 patch 
accepted,therefore
the configurable solution. But after Peter and Jens replied so quickly 
stating that
changing the default in kernel would be the wrong way to go I already 
looked out for
userspace alternatives. At least for my issues I could fix it with 
device specific udev rules
too.

And as Andrew mentioned the diversity of devices cause any default to be 
wrong for one
or another installation. To solve that the udev approach can also differ 
between different
device types (might be easier on s390 than on other architectures 
because I need to take
care of two disk types atm - and both shold get 512).

The testcase for anyone who wants to experiment with it is almost too 
easy, the biggest
impact can be seen with single thread iozone - I get ~40% better 
throughput when
increasing the readahead size to 512 (even bigger RA sizes don't help 
much in my
environment, probably due to fast devices).

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-12  5:53         ` Christian Ehrhardt
@ 2009-10-12  6:23           ` Wu Fengguang
  2009-10-12  9:29             ` Christian Ehrhardt
  0 siblings, 1 reply; 15+ messages in thread
From: Wu Fengguang @ 2009-10-12  6:23 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Martin Schwidefsky, Jens Axboe, Peter Zijlstra,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton

On Mon, Oct 12, 2009 at 01:53:01PM +0800, Christian Ehrhardt wrote:
> Wu Fengguang wrote:
> > Hi Martin,
> >
> > On Fri, Oct 09, 2009 at 09:49:50PM +0800, Martin Schwidefsky wrote:
> >   
> >> On Fri, 9 Oct 2009 14:29:52 +0200
> >> Jens Axboe <jens.axboe@oracle.com> wrote:
> >>
> >>     
> >>> On Fri, Oct 09 2009, Peter Zijlstra wrote:
> >>>       
> >>>> On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> >>>>         
> >>>>> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> >>>>>
> >>>>> On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> >>>>> and can be configured per block device queue.
> >>>>> On the other hand a lot of admins do not use it, therefore it is reasonable to
> >>>>> set a wise default.
> >>>>>
> >>>>> This path allows to configure the value via Kconfig mechanisms and therefore
> >>>>> allow the assignment of different defaults dependent on other Kconfig symbols.
> >>>>>
> >>>>> Using this, the patch increases the default max readahead for s390 improving
> >>>>> sequential throughput in a lot of scenarios with almost no drawbacks (only
> >>>>> theoretical workloads with a lot concurrent sequential read patterns on a very
> >>>>> low memory system suffer due to page cache trashing as expected).
> >>>>>           
> > [snip]
> >   
> >> The patch from Christian fixes a performance regression in the latest
> >> distributions for s390. So we would opt for a larger value, 512KB seems
> >> to be a good one. I have no idea what that will do to the embedded
> >> space which is why Christian choose to make it configurable. Clearly
> >> the better solution would be some sort of system control that can be
> >> modified at runtime. 
> >>     
> >
> > May I ask for more details about your performance regression and why
> > it is related to readahead size? (we didn't change VM_MAX_READAHEAD..)
> >   
> Sure, the performance regression appeared when comparing Novell SLES10 
> vs. SLES11.
> While you are right Wu that the upstream default never changed so far, 
> SLES10 had a
> patch applied that set 512.

I see. I'm curious why SLES11 removed that patch. Did it experienced
some regressions with the larger readahead size?

> As mentioned before I didn't expect to get a generic 128->512 patch 
> accepted,therefore
> the configurable solution. But after Peter and Jens replied so quickly 
> stating that
> changing the default in kernel would be the wrong way to go I already 
> looked out for
> userspace alternatives. At least for my issues I could fix it with 
> device specific udev rules
> too.

OK.

> And as Andrew mentioned the diversity of devices cause any default to be 
> wrong for one
> or another installation. To solve that the udev approach can also differ 
> between different
> device types (might be easier on s390 than on other architectures 
> because I need to take
> care of two disk types atm - and both shold get 512).

I guess it's not a general solution for all. There are so many
devices in the world, and we have not yet considered the
memory/workload combinations.

> The testcase for anyone who wants to experiment with it is almost too 
> easy, the biggest
> impact can be seen with single thread iozone - I get ~40% better 
> throughput when
> increasing the readahead size to 512 (even bigger RA sizes don't help 
> much in my
> environment, probably due to fast devices).

That's impressive number - I guess we need a larger default RA size.
But before that let's learn something from SLES10's experiences :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-12  6:23           ` Wu Fengguang
@ 2009-10-12  9:29             ` Christian Ehrhardt
  2009-10-12  9:39               ` Wu Fengguang
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Ehrhardt @ 2009-10-12  9:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Martin Schwidefsky, Jens Axboe, Peter Zijlstra,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton

Wu Fengguang wrote:
> [SNIP]
>>> May I ask for more details about your performance regression and why
>>> it is related to readahead size? (we didn't change VM_MAX_READAHEAD..)
>>>   
>>>       
>> Sure, the performance regression appeared when comparing Novell SLES10 
>> vs. SLES11.
>> While you are right Wu that the upstream default never changed so far, 
>> SLES10 had a
>> patch applied that set 512.
>>     
>
> I see. I'm curious why SLES11 removed that patch. Did it experienced
> some regressions with the larger readahead size?
>
>   

Only the obvious expected one with very low free/cacheable
memory and a lot of parallel processes that do sequential I/O.
The RA size scales up for all of them but 64xMaxRA then
doesn't fit.

For example iozone with 64 threads (each on one disk for its own),
sequential access pattern read with I guess 10 M free for cache
suffered by ~15% due to trashing.

But that is a acceptable regression because it is no relevant
customer scenario, while the benefits apply to customer scenarios.

[...]
>> And as Andrew mentioned the diversity of devices cause any default to be 
>> wrong for one
>> or another installation. To solve that the udev approach can also differ 
>> between different
>> device types (might be easier on s390 than on other architectures 
>> because I need to take
>> care of two disk types atm - and both shold get 512).
>>     
>
> I guess it's not a general solution for all. There are so many
> devices in the world, and we have not yet considered the
> memory/workload combinations.
>   
I completely agree, let me fix "my" issue per udev for now.
And if some day the readahead mechanism evolves and
doesn't need any max RA at all we can all be happy.

[...]

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-12  9:29             ` Christian Ehrhardt
@ 2009-10-12  9:39               ` Wu Fengguang
  0 siblings, 0 replies; 15+ messages in thread
From: Wu Fengguang @ 2009-10-12  9:39 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Martin Schwidefsky, Jens Axboe, Peter Zijlstra,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton

On Mon, Oct 12, 2009 at 05:29:48PM +0800, Christian Ehrhardt wrote:
> Wu Fengguang wrote:
> > [SNIP]
> >>> May I ask for more details about your performance regression and why
> >>> it is related to readahead size? (we didn't change VM_MAX_READAHEAD..)
> >>>   
> >>>       
> >> Sure, the performance regression appeared when comparing Novell SLES10 
> >> vs. SLES11.
> >> While you are right Wu that the upstream default never changed so far, 
> >> SLES10 had a
> >> patch applied that set 512.
> >>     
> >
> > I see. I'm curious why SLES11 removed that patch. Did it experienced
> > some regressions with the larger readahead size?
> >
> >   
> 
> Only the obvious expected one with very low free/cacheable
> memory and a lot of parallel processes that do sequential I/O.
> The RA size scales up for all of them but 64xMaxRA then
> doesn't fit.
> 
> For example iozone with 64 threads (each on one disk for its own),
> sequential access pattern read with I guess 10 M free for cache
> suffered by ~15% due to trashing.

FYI, I just finished with a patch for dealing with readahead
thrashing.  Will do some tests and post the result :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-09 12:29   ` Jens Axboe
  2009-10-09 13:49     ` Martin Schwidefsky
@ 2009-10-09 21:31     ` Andrew Morton
  2009-10-10 10:53       ` Jens Axboe
  1 sibling, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2009-10-09 21:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Peter Zijlstra, Ehrhardt Christian, linux-mm, linux-kernel,
	Martin Schwidefsky, Wu Fengguang

On Fri, 9 Oct 2009 14:29:52 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> On Fri, Oct 09 2009, Peter Zijlstra wrote:
> > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> > > 
> > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > > and can be configured per block device queue.
> > > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > > set a wise default.
> > > 
> > > This path allows to configure the value via Kconfig mechanisms and therefore
> > > allow the assignment of different defaults dependent on other Kconfig symbols.
> > > 
> > > Using this, the patch increases the default max readahead for s390 improving
> > > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > > theoretical workloads with a lot concurrent sequential read patterns on a very
> > > low memory system suffer due to page cache trashing as expected).
> > 
> > Why can't this be solved in userspace?
> > 
> > Also, can't we simply raise this number if appropriate? Wu did some
> > read-ahead trashing detection bits a long while back which should scale
> > the read-ahead window back when we're low on memory, not sure that ever
> > made it in, but that sounds like a better option than having different
> > magic numbers for each platform.
> 
> Agree, making this a config option (and even defaulting to a different
> number because of an arch setting) is crazy.

Given the (increasing) level of disparity between different kinds of
storage devices, having _any_ default is crazy.

Would be better to make some sort of vaguely informed guess at
runtime, based upon the characteristics of the device.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-09 21:31     ` Andrew Morton
@ 2009-10-10 10:53       ` Jens Axboe
  2009-10-10 12:40         ` Wu Fengguang
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2009-10-10 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Ehrhardt Christian, linux-mm, linux-kernel,
	Martin Schwidefsky, Wu Fengguang

On Fri, Oct 09 2009, Andrew Morton wrote:
> On Fri, 9 Oct 2009 14:29:52 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Fri, Oct 09 2009, Peter Zijlstra wrote:
> > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> > > > 
> > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > > > and can be configured per block device queue.
> > > > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > > > set a wise default.
> > > > 
> > > > This path allows to configure the value via Kconfig mechanisms and therefore
> > > > allow the assignment of different defaults dependent on other Kconfig symbols.
> > > > 
> > > > Using this, the patch increases the default max readahead for s390 improving
> > > > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > > > theoretical workloads with a lot concurrent sequential read patterns on a very
> > > > low memory system suffer due to page cache trashing as expected).
> > > 
> > > Why can't this be solved in userspace?
> > > 
> > > Also, can't we simply raise this number if appropriate? Wu did some
> > > read-ahead trashing detection bits a long while back which should scale
> > > the read-ahead window back when we're low on memory, not sure that ever
> > > made it in, but that sounds like a better option than having different
> > > magic numbers for each platform.
> > 
> > Agree, making this a config option (and even defaulting to a different
> > number because of an arch setting) is crazy.
> 
> Given the (increasing) level of disparity between different kinds of
> storage devices, having _any_ default is crazy.

You have to start somewhere :-). 0 is a default, too.

> Would be better to make some sort of vaguely informed guess at
> runtime, based upon the characteristics of the device.

I'm pretty sure the readahead logic already does respond to eg memory
pressure, not sure if it attempts to do anything based on how quickly
the device is doing IO. Wu?

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-10 10:53       ` Jens Axboe
@ 2009-10-10 12:40         ` Wu Fengguang
  2009-10-10 17:41           ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Wu Fengguang @ 2009-10-10 12:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrew Morton, Peter Zijlstra, Ehrhardt Christian,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Martin Schwidefsky

On Sat, Oct 10, 2009 at 06:53:33PM +0800, Jens Axboe wrote:
> On Fri, Oct 09 2009, Andrew Morton wrote:
> > On Fri, 9 Oct 2009 14:29:52 +0200
> > Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > On Fri, Oct 09 2009, Peter Zijlstra wrote:
> > > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > > > > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> > > > > 
> > > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > > > > and can be configured per block device queue.
> > > > > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > > > > set a wise default.
> > > > > 
> > > > > This path allows to configure the value via Kconfig mechanisms and therefore
> > > > > allow the assignment of different defaults dependent on other Kconfig symbols.
> > > > > 
> > > > > Using this, the patch increases the default max readahead for s390 improving
> > > > > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > > > > theoretical workloads with a lot concurrent sequential read patterns on a very
> > > > > low memory system suffer due to page cache trashing as expected).
> > > > 
> > > > Why can't this be solved in userspace?
> > > > 
> > > > Also, can't we simply raise this number if appropriate? Wu did some
> > > > read-ahead trashing detection bits a long while back which should scale
> > > > the read-ahead window back when we're low on memory, not sure that ever
> > > > made it in, but that sounds like a better option than having different
> > > > magic numbers for each platform.
> > > 
> > > Agree, making this a config option (and even defaulting to a different
> > > number because of an arch setting) is crazy.
> > 
> > Given the (increasing) level of disparity between different kinds of
> > storage devices, having _any_ default is crazy.
> 
> You have to start somewhere :-). 0 is a default, too.

Yes, an obvious and viable way is to start with a default size, and to
back off in runtime if experienced thrashing.

Ideally we use 4MB readahead size per disk, however there are several
constraints:
- readahead thrashing
  can be detected and handled very well if necessary :)
- mmap readaround size
  currently one single size is used for both sequential readahead and
  mmap readaround, and a larger readaround size risks more prefetch
  misses (comparing to the pretty accurate readahead). I guess in
  despite of the increased readaound misses, a large readaround size
  would still help application startup time in a 4GB desktop. However
  it does risk working-set thrashings for memory tight desktops. Maybe
  we can try to detect working-set thrashings too.
- IO latency
  Some workloads may be sensitive to IO latencies. The max_sectors_kb
  may help keep IO latency under control with a large readahead size,
  but there may be some tradeoffs in the IO scheduler.

In summary, towards the runtime dynamic prefetching size, we
- can reliably adapt readahead size to readahead thrashings
- may reliably adapt readaround size to working set thrashings
- don't know in general whether workload is IO latency sensitive

> > Would be better to make some sort of vaguely informed guess at
> > runtime, based upon the characteristics of the device.
> 
> I'm pretty sure the readahead logic already does respond to eg memory
> pressure,

Yes, it's much better than before. Once thrashed, old kernels are
basically reduced to do 1-page (random) IOs, which is disastrous.

Current kernel does this. Given

        default_readahead_size > thrashing_readahead_size

The readahead sequence would be

        read_size, 2*read_size, 4*read_size, ..., (until > thrashing_readahead_size)
        read_size, 2*read_size, 4*read_size, ..., (until > thrashing_readahead_size)
        read_size, 2*read_size, 4*read_size, ..., (until > thrashing_readahead_size)
        ...

So if read_size=1, it roughly holds that

        average_readahead_size = thrashing_readahead_size/log2(thrashing_readahead_size)
        thrashed_pages = total_read_pages/2

And if read_size=LONG_MAX (eg. sendfile(large_file))

        average_readahead_size = default_readahead_size
        thrashed_pages = default_readahead_size - thrashing_readahead_size

In summary, readahead for sendfile() is not adaptive at all.  Normal
reads are somehow adaptive, but not optimal.

But anyway, optimal thrashing readahead is approachable if it's a
desirable goal :).

> not sure if it attempts to do anything based on how quickly
> the device is doing IO. Wu?

Not for current kernel.  But in fact it's possible to estimate the
read speed for each individual sequential stream, and possibly drop
some hint to the IO scheduler: someone will block on this IO after 3
seconds. But it may not deserve the complexity.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-10 12:40         ` Wu Fengguang
@ 2009-10-10 17:41           ` Andrew Morton
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2009-10-10 17:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jens Axboe, Peter Zijlstra, Ehrhardt Christian,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Martin Schwidefsky

On Sat, 10 Oct 2009 20:40:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:

> > not sure if it attempts to do anything based on how quickly
> > the device is doing IO. Wu?
> 
> Not for current kernel.  But in fact it's possible to estimate the
> read speed for each individual sequential stream, and possibly drop
> some hint to the IO scheduler: someone will block on this IO after 3
> seconds. But it may not deserve the complexity.

Well, we have a test case.  Would any of your design proposals address
the performance problem which motivated the s390 guys to propose this
patch?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: make VM_MAX_READAHEAD configurable
  2009-10-09 12:20 ` Peter Zijlstra
  2009-10-09 12:29   ` Jens Axboe
@ 2009-10-09 13:14   ` Wu Fengguang
  1 sibling, 0 replies; 15+ messages in thread
From: Wu Fengguang @ 2009-10-09 13:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ehrhardt Christian, linux-mm, linux-kernel, Jens Axboe,
	Andrew Morton, Martin Schwidefsky

On Fri, Oct 09, 2009 at 02:20:30PM +0200, Peter Zijlstra wrote:
> On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> > 
> > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > and can be configured per block device queue.
> > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > set a wise default.
> > 
> > This path allows to configure the value via Kconfig mechanisms and therefore
> > allow the assignment of different defaults dependent on other Kconfig symbols.
> > 
> > Using this, the patch increases the default max readahead for s390 improving
> > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > theoretical workloads with a lot concurrent sequential read patterns on a very
> > low memory system suffer due to page cache trashing as expected).
> 
> Why can't this be solved in userspace?
> 
> Also, can't we simply raise this number if appropriate? Wu did some

Agreed, and Ehrhardt's 512KB readahead size looks like a good default :)

> read-ahead trashing detection bits a long while back which should scale
> the read-ahead window back when we're low on memory, not sure that ever
> made it in, but that sounds like a better option than having different
> magic numbers for each platform.

The current kernel could roughly estimate the thrashing safe size (the
context readahead). However that's not enough. Context readahead is
normally active only for interleaved reads. The normal behavior is to
scale up readahead size aggressively. For better support for embedded
systems, we may need a flag/mode which tells: "we recently experienced
thrashing, so estimate and stick to the thrashing safe size instead of
keep scaling up readahead size and thus risk thrashing again".

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2009-10-12  9:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-09 11:19 [PATCH] mm: make VM_MAX_READAHEAD configurable Ehrhardt Christian
2009-10-09 12:20 ` Peter Zijlstra
2009-10-09 12:29   ` Jens Axboe
2009-10-09 13:49     ` Martin Schwidefsky
2009-10-09 13:58       ` Wu Fengguang
2009-10-11  1:10       ` Wu Fengguang
2009-10-12  5:53         ` Christian Ehrhardt
2009-10-12  6:23           ` Wu Fengguang
2009-10-12  9:29             ` Christian Ehrhardt
2009-10-12  9:39               ` Wu Fengguang
2009-10-09 21:31     ` Andrew Morton
2009-10-10 10:53       ` Jens Axboe
2009-10-10 12:40         ` Wu Fengguang
2009-10-10 17:41           ` Andrew Morton
2009-10-09 13:14   ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).