Re: [PATCH 03/11] readahead: bump up the default readahead size

linux-embedded.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 03/11] readahead: bump up the default readahead size
       [not found]   ` <4B6FBB3F.4010701@linux.vnet.ibm.com>
@ 2010-02-08 13:46     ` Wu Fengguang
  2010-02-11 21:37       ` Matt Mackall
  0 siblings, 1 reply; 7+ messages in thread
From: Wu Fengguang @ 2010-02-08 13:46 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, Jens Axboe, Chris Mason, Peter Zijlstra,
	Martin Schwidefsky, Clemens Ladisch, Olivier Galibert,
	Linux Memory Management List, linux-fsdevel@vger.kernel.org, LKML,
	Paul Gortmaker, Matt Mackall, David Woodhouse, linux-embedded

Chris,

Firstly inform the linux-embedded maintainers :)

I think it's a good suggestion to add a config option
(CONFIG_READAHEAD_SIZE). Will update the patch..

Thanks,
Fengguang

On Mon, Feb 08, 2010 at 03:20:31PM +0800, Christian Ehrhardt wrote:
> This is related to our discussion from October 09 e.g. 
> http://lkml.indiana.edu/hypermail/linux/kernel/0910.1/01468.html
> 
> I work for s390 where - as mainframe - we only have environments that 
> benefit from 512k readahead, but I still expect some embedded devices won't.
> While my idea of making it configurable was not liked in the past, it 
> may be still useful when introducing this default change to let some 
> small devices choose without patching the src (a number field defaulting 
> to 512 and explaining the past of that value would be really nice).
> 
> For the discussion of 512 vs. 128 I can add from my measurements that I 
> have seen the following:
> - 512 is by far superior to 128 for sequential reads
> - improvements with iozone sequential read scaling from 1 to 64 parallel 
> processes up to +35%
> - readahead sizes larger than 512 reevealed to not be "more useful" but 
> increasing the chance of trashing in low mem systems
> 
> So I appreciate this change with a little note that I would prefer a 
> config option.
> -> tested & acked-by Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> Wu Fengguang wrote:
>  >
>  > Use 512kb max readahead size, and 32kb min readahead size.
>  >
>  > The former helps io performance for common workloads.
>  > The latter will be used in the thrashing safe context readahead.
>  >
>  > -- Rationals on the 512kb size --
>  >
>  > I believe it yields more I/O throughput without noticeably increasing
>  > I/O latency for today's HDD.
>  >
>  > For example, for a 100MB/s and 8ms access time HDD, its random IO or
>  > highly concurrent sequential IO would in theory be:
>  >
>  > io_size KB  access_time  transfer_time  io_latency   util%   
> throughput KB/s
>  > 4           8             0.04           8.04        0.49%    497.57 
>  > 8           8             0.08           8.08        0.97%    990.33 
>  > 16          8             0.16           8.16        1.92%   1961.69
>  > 32          8             0.31           8.31        3.76%   3849.62
>  > 64          8             0.62           8.62        7.25%   7420.29
>  > 128         8             1.25           9.25       13.51%  13837.84
>  > 256         8             2.50          10.50       23.81%  24380.95
>  > 512         8             5.00          13.00       38.46%  39384.62
>  > 1024        8            10.00          18.00       55.56%  56888.89
>  > 2048        8            20.00          28.00       71.43%  73142.86
>  > 4096        8            40.00          48.00       83.33%  85333.33
>  >
>  > The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to
>  > ~39MB/s, while merely increases (minimal) IO latency from 9.25ms to 13ms.
>  >
>  > As for SSD, I find that Intel X25-M SSD desires large readahead size
>  > even for sequential reads:
>  >
>  >     rasize    1st run        2nd run
>  >     ----------------------------------
>  >       4k    123 MB/s    122 MB/s
>  >      16k      153 MB/s    153 MB/s
>  >      32k    161 MB/s    162 MB/s
>  >      64k    167 MB/s    168 MB/s
>  >     128k    197 MB/s    197 MB/s
>  >     256k    217 MB/s    217 MB/s
>  >     512k    238 MB/s    234 MB/s
>  >       1M    251 MB/s    248 MB/s
>  >       2M    259 MB/s    257 MB/s
>  >          4M    269 MB/s    264 MB/s
>  >       8M    266 MB/s    266 MB/s
>  >
>  > The two other impacts of an enlarged readahead size are
>  >
>  > - memory footprint (caused by readahead miss)
>  >     Sequential readahead hit ratio is pretty high regardless of max
>  >     readahead size; the extra memory footprint is mainly caused by
>  >     enlarged mmap read-around.
>  >     I measured my desktop:
>  >     - under Xwindow:
>  >         128KB readahead hit ratio = 143MB/230MB = 62%
>  >         512KB readahead hit ratio = 138MB/248MB = 55%
>  >           1MB readahead hit ratio = 130MB/253MB = 51%
>  >     - under console: (seems more stable than the Xwindow data)
>  >         128KB readahead hit ratio = 30MB/56MB   = 53%
>  >           1MB readahead hit ratio = 30MB/59MB   = 51%
>  >     So the impact to memory footprint looks acceptable.
>  >
>  > - readahead thrashing
>  >     It will now cost 1MB readahead buffer per stream.  Memory tight
>  >     systems typically do not run multiple streams; but if they do
>  >     so, it should help I/O performance as long as we can avoid
>  >     thrashing, which can be achieved with the following patches.
>  >
>  > -- Benchmarks by Vivek Goyal --
>  >
>  > I have got two paths to the HP EVA and got multipath device setup(dm-3).
>  > I run increasing number of sequential readers. File system is ext3 and
>  > filesize is 1G.
>  > I have run the tests 3 times (3sets) and taken the average of it.
>  >
>  > Workload=bsr      iosched=cfq     Filesz=1G   bs=32K
>  > ======================================================================
>  >                     2.6.33-rc5                2.6.33-rc5-readahead
>  > job   Set NR  ReadBW(KB/s)   MaxClat(us)    ReadBW(KB/s)   MaxClat(us)
>  > ---   --- --  ------------   -----------    ------------   -----------
>  > bsr   3   1   141768         130965         190302         97937.3   
>  > bsr   3   2   131979         135402         185636         223286    
>  > bsr   3   4   132351         420733         185986         363658    
>  > bsr   3   8   133152         455434         184352         428478    
>  > bsr   3   16  130316         674499         185646         594311    
>  >
>  > I ran same test on a different piece of hardware. There are few SATA 
> disks
>  > (5-6) in striped configuration behind a hardware RAID controller.
>  >
>  > Workload=bsr      iosched=cfq     Filesz=1G   bs=32K
>  > ======================================================================
>  >                     2.6.33-rc5                2.6.33-rc5-readahead
>  > job   Set NR  ReadBW(KB/s)   MaxClat(us)    ReadBW(KB/s)   
> MaxClat(us)   
>  > ---   --- --  ------------   -----------    ------------   
> -----------   
>  > bsr   3   1   147569         14369.7        160191         
> 22752         
>  > bsr   3   2   124716         243932         149343         
> 184698        
>  > bsr   3   4   123451         327665         147183         
> 430875        
>  > bsr   3   8   122486         455102         144568         
> 484045        
>  > bsr   3   16  117645         1.03957e+06    137485         
> 1.06257e+06   
>  >
>  > Tested-by: Vivek Goyal <vgoyal@redhat.com>
>  > CC: Jens Axboe <jens.axboe@oracle.com>
>  > CC: Chris Mason <chris.mason@oracle.com>
>  > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
>  > CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
>  > CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>  > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>  > ---
>  >  include/linux/mm.h |    4 ++--
>  >  1 file changed, 2 insertions(+), 2 deletions(-)
>  >
>  > --- linux.orig/include/linux/mm.h    2010-01-30 17:38:49.000000000 +0800
>  > +++ linux/include/linux/mm.h    2010-01-30 18:09:58.000000000 +0800
>  > @@ -1184,8 +1184,8 @@ int write_one_page(struct page *page, in
>  >  void task_dirty_inc(struct task_struct *tsk);
>  >
>  >  /* readahead.c */
>  > -#define VM_MAX_READAHEAD    128    /* kbytes */
>  > -#define VM_MIN_READAHEAD    16    /* kbytes (includes current page) */
>  > +#define VM_MAX_READAHEAD    512    /* kbytes */
>  > +#define VM_MIN_READAHEAD    32    /* kbytes (includes current page) */
>  >
>  >  int force_page_cache_readahead(struct address_space *mapping, struct 
> file *filp,
>  >              pgoff_t offset, unsigned long nr_to_read);
>  >
>  >
> 
> -- 
> 
> GrÃ¼sse / regards, Christian Ehrhardt
> IBM Linux Technology Center, Open Virtualization 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 03/11] readahead: bump up the default readahead size
  2010-02-08 13:46     ` [PATCH 03/11] readahead: bump up the default readahead size Wu Fengguang
@ 2010-02-11 21:37       ` Matt Mackall
  2010-02-11 23:42         ` Jamie Lokier
  0 siblings, 1 reply; 7+ messages in thread
From: Matt Mackall @ 2010-02-11 21:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christian Ehrhardt, Andrew Morton, Jens Axboe, Chris Mason,
	Peter Zijlstra, Martin Schwidefsky, Clemens Ladisch,
	Olivier Galibert, Linux Memory Management List,
	linux-fsdevel@vger.kernel.org, LKML, Paul Gortmaker,
	David Woodhouse, linux-embedded

On Mon, 2010-02-08 at 21:46 +0800, Wu Fengguang wrote:
> Chris,
> 
> Firstly inform the linux-embedded maintainers :)
> 
> I think it's a good suggestion to add a config option
> (CONFIG_READAHEAD_SIZE). Will update the patch..

I don't have a strong opinion here beyond the nagging feeling that we
should be using a per-bdev scaling window scheme rather than something
static.

-- 
http://selenic.com : development and support for Mercurial and Linux


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 03/11] readahead: bump up the default readahead size
  2010-02-11 21:37       ` Matt Mackall
@ 2010-02-11 23:42         ` Jamie Lokier
  2010-02-12  0:04           ` Matt Mackall
  2010-02-12 13:59           ` Wu Fengguang
  0 siblings, 2 replies; 7+ messages in thread
From: Jamie Lokier @ 2010-02-11 23:42 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Wu Fengguang, Christian Ehrhardt, Andrew Morton, Jens Axboe,
	Chris Mason, Peter Zijlstra, Martin Schwidefsky, Clemens Ladisch,
	Olivier Galibert, Linux Memory Management List,
	linux-fsdevel@vger.kernel.org, LKML, Paul Gortmaker,
	David Woodhouse, linux-embedded

Matt Mackall wrote:
> On Mon, 2010-02-08 at 21:46 +0800, Wu Fengguang wrote:
> > Chris,
> > 
> > Firstly inform the linux-embedded maintainers :)
> > 
> > I think it's a good suggestion to add a config option
> > (CONFIG_READAHEAD_SIZE). Will update the patch..
> 
> I don't have a strong opinion here beyond the nagging feeling that we
> should be using a per-bdev scaling window scheme rather than something
> static.

I agree with both.  100Mb/s isn't typical on little devices, even if a
fast ATA disk is attached.  I've got something here where the ATA
interface itself (on a SoC) gets about 10MB/s max when doing nothing
else, or 4MB/s when talking to the network at the same time.
It's not a modern design, but you know, it's junk we try to use :-)

It sounds like a calculation based on throughput and seek time or IOP
rate, and maybe clamped if memory is small, would be good.

Is the window size something that could be meaningfully adjusted
according to live measurements?

-- Jamie



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 03/11] readahead: bump up the default readahead size
  2010-02-11 23:42         ` Jamie Lokier
@ 2010-02-12  0:04           ` Matt Mackall
  2010-02-12 13:59           ` Wu Fengguang
  1 sibling, 0 replies; 7+ messages in thread
From: Matt Mackall @ 2010-02-12  0:04 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Wu Fengguang, Christian Ehrhardt, Andrew Morton, Jens Axboe,
	Chris Mason, Peter Zijlstra, Martin Schwidefsky, Clemens Ladisch,
	Olivier Galibert, Linux Memory Management List,
	linux-fsdevel@vger.kernel.org, LKML, Paul Gortmaker,
	David Woodhouse, linux-embedded

On Thu, 2010-02-11 at 23:42 +0000, Jamie Lokier wrote:
> Matt Mackall wrote:
> > On Mon, 2010-02-08 at 21:46 +0800, Wu Fengguang wrote:
> > > Chris,
> > > 
> > > Firstly inform the linux-embedded maintainers :)
> > > 
> > > I think it's a good suggestion to add a config option
> > > (CONFIG_READAHEAD_SIZE). Will update the patch..
> > 
> > I don't have a strong opinion here beyond the nagging feeling that we
> > should be using a per-bdev scaling window scheme rather than something
> > static.
> 
> I agree with both.  100Mb/s isn't typical on little devices, even if a
> fast ATA disk is attached.  I've got something here where the ATA
> interface itself (on a SoC) gets about 10MB/s max when doing nothing
> else, or 4MB/s when talking to the network at the same time.
> It's not a modern design, but you know, it's junk we try to use :-)
> 
> It sounds like a calculation based on throughput and seek time or IOP
> rate, and maybe clamped if memory is small, would be good.
> 
> Is the window size something that could be meaningfully adjusted
> according to live measurements?

I think so. You've basically got a few different things you want to
balance: throughput, latency, and memory pressure. Successful readaheads
expand the window, as do empty request queues, while long request queues
and memory reclaim events collapse it. With any luck, we'll then
automatically do the right thing with fast/slow devices on big/small
boxes with varying load. And, like TCP, we don't need to 'know' anything
about the hardware, except to watch what happens when we use it.

-- 
http://selenic.com : development and support for Mercurial and Linux


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 03/11] readahead: bump up the default readahead size
  2010-02-11 23:42         ` Jamie Lokier
  2010-02-12  0:04           ` Matt Mackall
@ 2010-02-12 13:59           ` Wu Fengguang
  2010-02-12 20:20             ` Matt Mackall
  1 sibling, 1 reply; 7+ messages in thread
From: Wu Fengguang @ 2010-02-12 13:59 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Matt Mackall, Christian Ehrhardt, Andrew Morton, Jens Axboe,
	Chris Mason, Peter Zijlstra, Martin Schwidefsky, Clemens Ladisch,
	Olivier Galibert, Linux Memory Management List,
	linux-fsdevel@vger.kernel.org, LKML, Paul Gortmaker,
	David Woodhouse, linux-embedded@vger.kernel.org

On Fri, Feb 12, 2010 at 07:42:49AM +0800, Jamie Lokier wrote:
> Matt Mackall wrote:
> > On Mon, 2010-02-08 at 21:46 +0800, Wu Fengguang wrote:
> > > Chris,
> > > 
> > > Firstly inform the linux-embedded maintainers :)
> > > 
> > > I think it's a good suggestion to add a config option
> > > (CONFIG_READAHEAD_SIZE). Will update the patch..
> > 
> > I don't have a strong opinion here beyond the nagging feeling that we
> > should be using a per-bdev scaling window scheme rather than something
> > static.

It's good to do dynamic scaling -- in fact this patchset has code to do
- scale down readahead size (per-bdev) for small devices
- scale down readahead size (per-stream) to thrashing threshold

At the same time, I'd prefer
- to _only_ do scale down (below the default size) for low end
- and have a uniform default readahead size for the mainstream

IMHO scaling up automatically
- would be risky
- hurts to build one common expectation on Linux behavior
  (not only developers, but also admins will run into the question:
  "what on earth is the readahead size?")
- and still not likely to please the high end guys ;)

> I agree with both.  100Mb/s isn't typical on little devices, even if a
> fast ATA disk is attached.  I've got something here where the ATA
> interface itself (on a SoC) gets about 10MB/s max when doing nothing
> else, or 4MB/s when talking to the network at the same time.
> It's not a modern design, but you know, it's junk we try to use :-)

Good to know this. I guess the same situation for some USB-capable
wireless routers -- they typically don't have powerful hardware to
exert the full 100MB/s disk speed.

> It sounds like a calculation based on throughput and seek time or IOP
> rate, and maybe clamped if memory is small, would be good.
> 
> Is the window size something that could be meaningfully adjusted
> according to live measurements?

We currently have live adjustment for
- small devices
- thrashed read streams

We could add new adjustments based on throughput (estimation is the
problem) and memory size.

Note that it does not really hurt to have big _readahead_ size on low
throughput or small memory conditions, because it's merely _max_
readahead size, the actual readahead size scales up step-by-step, and
scales down if thrashed, and the sequential readahead hit ratio is
pretty high (so no memory/bandwidth is wasted).

What may hurt is to have big mmap _readaround_ size. The larger
readaround size, the more readaround miss ratio (but still not
disastrous), hence more memory pages and bandwidth wasted. It's not a
big problem for mainstream, however embedded systems may be more
sensitive.

I would guess most embedded systems put executables on MTD devices
(anyone to confirm this?). And I wonder if MTDs have general
characteristics that are suitable for smaller readahead/readaround
size (the two sizes are bundled for simplicity)?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 03/11] readahead: bump up the default readahead size
  2010-02-12 13:59           ` Wu Fengguang
@ 2010-02-12 20:20             ` Matt Mackall
  2010-02-21  2:25               ` Wu Fengguang
  0 siblings, 1 reply; 7+ messages in thread
From: Matt Mackall @ 2010-02-12 20:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jamie Lokier, Christian Ehrhardt, Andrew Morton, Jens Axboe,
	Chris Mason, Peter Zijlstra, Martin Schwidefsky, Clemens Ladisch,
	Olivier Galibert, Linux Memory Management List,
	linux-fsdevel@vger.kernel.org, LKML, Paul Gortmaker,
	David Woodhouse, linux-embedded@vger.kernel.org

On Fri, 2010-02-12 at 21:59 +0800, Wu Fengguang wrote:
> On Fri, Feb 12, 2010 at 07:42:49AM +0800, Jamie Lokier wrote:
> > Matt Mackall wrote:
> > > On Mon, 2010-02-08 at 21:46 +0800, Wu Fengguang wrote:
> > > > Chris,
> > > > 
> > > > Firstly inform the linux-embedded maintainers :)
> > > > 
> > > > I think it's a good suggestion to add a config option
> > > > (CONFIG_READAHEAD_SIZE). Will update the patch..
> > > 
> > > I don't have a strong opinion here beyond the nagging feeling that we
> > > should be using a per-bdev scaling window scheme rather than something
> > > static.
> 
> It's good to do dynamic scaling -- in fact this patchset has code to do
> - scale down readahead size (per-bdev) for small devices

I'm not sure device size is a great metric. It's only weakly correlated
with the things we actually care about: memory pressure (small devices
are often attached to systems with small and therefore full memory) and
latency (small devices are often old and slow and attached to slow
CPUs). I think we should instead use hints about latency (large request
queues) and memory pressure (reclaim passes) directly.

> - scale down readahead size (per-stream) to thrashing threshold

Yeah, I'm happy to call that part orthogonal to this discussion.

> At the same time, I'd prefer
> - to _only_ do scale down (below the default size) for low end
> - and have a uniform default readahead size for the mainstream

I don't think that's important, given that we're dynamically fiddling
with related things.

> IMHO scaling up automatically
> - would be risky

What, explicitly, are the risks? If we bound the window with memory
pressure and latency, I don't think it can get too far out of hand.
There are also some other bounds in here: we have other limits on how
big I/O requests can be.

I'm happy to worry about only scaling down for now, but it's only a
matter of time before we have to bump the number up again.
We've got an IOPS range from < 1 (mp3 player with power-saving
spin-down) to > 1M (high-end SSD). And the one that needs the most
readahead is the former! 

> I would guess most embedded systems put executables on MTD devices
> (anyone to confirm this?).

It's hard to generalize here. Even on flash devices, interleaving with
writes can result in high latencies that make it behave more like
spinning media, but there's no way to generalize about what the write
mix is going to be.

>  And I wonder if MTDs have general
> characteristics that are suitable for smaller readahead/readaround
> size (the two sizes are bundled for simplicity)?

Perhaps, but the trend is definitely towards larger blocks here.

> We could add new adjustments based on throughput (estimation is the
> problem) and memory size.

Note that throughput is not enough information here. More interesting is
the "bandwidth delay product" of the I/O path. If latency (of the whole
I/O stack) is zero, it's basically always better to read on demand. But
if every request takes 100ms whether it's for 4k or 4M (see optical
media), then you might want to consider reading 4M every time. And
latency is of course generally not independent of usage pattern. Which
is why I think TCP-like feedback scaling is the right approach.

-- 
http://selenic.com : development and support for Mercurial and Linux

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 03/11] readahead: bump up the default readahead size
  2010-02-12 20:20             ` Matt Mackall
@ 2010-02-21  2:25               ` Wu Fengguang
  0 siblings, 0 replies; 7+ messages in thread
From: Wu Fengguang @ 2010-02-21  2:25 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Jamie Lokier, Christian Ehrhardt, Andrew Morton, Jens Axboe,
	Chris Mason, Peter Zijlstra, Martin Schwidefsky, Clemens Ladisch,
	Olivier Galibert, Linux Memory Management List,
	linux-fsdevel@vger.kernel.org, LKML, Paul Gortmaker,
	David Woodhouse, linux-embedded@vger.kernel.org

Hi Matt,

On Sat, Feb 13, 2010 at 04:20:23AM +0800, Matt Mackall wrote:
> On Fri, 2010-02-12 at 21:59 +0800, Wu Fengguang wrote:
> > On Fri, Feb 12, 2010 at 07:42:49AM +0800, Jamie Lokier wrote:
> > > Matt Mackall wrote:
> > > > On Mon, 2010-02-08 at 21:46 +0800, Wu Fengguang wrote:
> > > > > Chris,
> > > > > 
> > > > > Firstly inform the linux-embedded maintainers :)
> > > > > 
> > > > > I think it's a good suggestion to add a config option
> > > > > (CONFIG_READAHEAD_SIZE). Will update the patch..
> > > > 
> > > > I don't have a strong opinion here beyond the nagging feeling that we
> > > > should be using a per-bdev scaling window scheme rather than something
> > > > static.
> > 
> > It's good to do dynamic scaling -- in fact this patchset has code to do
> > - scale down readahead size (per-bdev) for small devices
> 
> I'm not sure device size is a great metric. It's only weakly correlated

Yes, it's only weakly correlated. However device size is a good metric
in itself -- when it's small, ie. Linus' 500KB sized USB device.

> with the things we actually care about: memory pressure (small devices
> are often attached to systems with small and therefore full memory) and
> latency (small devices are often old and slow and attached to slow
> CPUs). I think we should instead use hints about latency (large request
> queues) and memory pressure (reclaim passes) directly.

In principle I think it's OK to use memory pressure and IO latency as hints.

1) memory pressure

For read-ahead, the memory pressure is mainly readahead buffers
consumed by too many concurrent streams. The context readahead in this
patchset can adapt readahead size to thrashing threshold well.  So in
principle we don't need to adapt the default _max_ read-ahead size to
memory pressure.

For read-around, the memory pressure is mainly read-around misses on
executables/libraries. Which could be reduced by scaling down
read-around size on fast "reclaim passes".

The more straightforward solution could be to limit default
read-around size proportional to available system memory, ie.
                512MB mem => 512KB read-around size
                128MB mem => 128KB read-around size
                 32MB mem =>  32KB read-around size (minimal)

2) IO latency

We might estimate the average service time and throughput for IOs of
different size, and choose the default readahead size based on
- good throughput
- low service time
- reasonable size bounds

IMHO the estimation should reflect the nature of the device, and do
not depend on specific workloads. Some points:

- in most cases, reducing readahead size on large request queues
  (which is typical in large file servers) only hurts performance
- we don't know whether the application is latency-sensitive (and to
  what degree), hence no need to be over-zealous to optimize for latency
- a dynamic changing readahead size is nightmare to benchmarks

That means to avoid estimation when there are any concurrent
reads/writes.  It also means that the estimation can be turned off for
this boot after enough data have been collected and the averages go
stable.

> > - scale down readahead size (per-stream) to thrashing threshold
> 
> Yeah, I'm happy to call that part orthogonal to this discussion.
> 
> > At the same time, I'd prefer
> > - to _only_ do scale down (below the default size) for low end
> > - and have a uniform default readahead size for the mainstream
> 
> I don't think that's important, given that we're dynamically fiddling
> with related things.

Before we can dynamically tune things and do it smart enough, it would
be good to have clear rules :)

> > IMHO scaling up automatically
> > - would be risky
> 
> What, explicitly, are the risks? If we bound the window with memory

Risks could be readahead misses and higher latency. 
Generally the risk:perf_gain ratio goes up for larger readahead size.

> pressure and latency, I don't think it can get too far out of hand.
> There are also some other bounds in here: we have other limits on how
> big I/O requests can be.

OK, if we do some bounds based mainly on foreseeable single device
performance needs.. 16MB?

> I'm happy to worry about only scaling down for now, but it's only a
> matter of time before we have to bump the number up again.

Agreed.

> We've got an IOPS range from < 1 (mp3 player with power-saving
> spin-down) to > 1M (high-end SSD). And the one that needs the most
> readahead is the former! 

We have laptop mode for the former, which will elevate readahead size
and (legitimately) disregard IO performance impacts.

> > I would guess most embedded systems put executables on MTD devices
> > (anyone to confirm this?).
> 
> It's hard to generalize here. Even on flash devices, interleaving with
> writes can result in high latencies that make it behave more like
> spinning media, but there's no way to generalize about what the write
> mix is going to be.

I'd prefer to not consider impact of writes when choosing default
readahead size.

> >  And I wonder if MTDs have general
> > characteristics that are suitable for smaller readahead/readaround
> > size (the two sizes are bundled for simplicity)?
> 
> Perhaps, but the trend is definitely towards larger blocks here.

OK.

> > We could add new adjustments based on throughput (estimation is the
> > problem) and memory size.
> 
> Note that throughput is not enough information here. More interesting is
> the "bandwidth delay product" of the I/O path. If latency (of the whole
> I/O stack) is zero, it's basically always better to read on demand. But
> if every request takes 100ms whether it's for 4k or 4M (see optical
> media), then you might want to consider reading 4M every time. And
> latency is of course generally not independent of usage pattern. Which
> is why I think TCP-like feedback scaling is the right approach.

OK.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-02-21  2:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20100207041013.891441102@intel.com>
     [not found] ` <20100207041043.147345346@intel.com>
     [not found]   ` <4B6FBB3F.4010701@linux.vnet.ibm.com>
2010-02-08 13:46     ` [PATCH 03/11] readahead: bump up the default readahead size Wu Fengguang
2010-02-11 21:37       ` Matt Mackall
2010-02-11 23:42         ` Jamie Lokier
2010-02-12  0:04           ` Matt Mackall
2010-02-12 13:59           ` Wu Fengguang
2010-02-12 20:20             ` Matt Mackall
2010-02-21  2:25               ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).