Re: [PATCH 3/3] readahead: introduce context readahead algorithm

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 3/3] readahead: introduce context readahead algorithm
       [not found]   ` <20090412084819.GA25314@elte.hu>
@ 2009-04-12 12:35     ` Wu Fengguang
  2009-04-16 17:12       ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 6+ messages in thread
From: Wu Fengguang @ 2009-04-12 12:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Vladislav Bolkhovitin, Jens Axboe, Jeff Moyer,
	LKML, Peter Zijlstra, Nick Piggin, Rik van Riel, Linus Torvalds,
	Chenfeng Xu, linux-mm

On Sun, Apr 12, 2009 at 04:48:19PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Introduce page cache context based readahead algorithm.
> > This is to better support concurrent read streams in general.
> 
> >  /*
> > + * Count contiguously cached pages from @offset-1 to @offset-@max,
> > + * this count is a conservative estimation of
> > + * 	- length of the sequential read sequence, or
> > + * 	- thrashing threshold in memory tight systems
> > + */
> > +static pgoff_t count_history_pages(struct address_space *mapping,
> > +				   struct file_ra_state *ra,
> > +				   pgoff_t offset, unsigned long max)
> > +{
> > +	pgoff_t head;
> > +
> > +	rcu_read_lock();
> > +	head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
> > +	rcu_read_unlock();
> > +
> > +	return offset - 1 - head;
> > +}
> 
> Very elegant method! I suspect this will work far better 
> than adding various increasingly more complex heuristics.
> 
> Emphatically-Acked-by: Ingo Molnar <mingo@elte.hu>

Thank you Ingo!

The only pity is that this heuristic can be defeated by some user
space program that tries to do aggressive drop-behind via
fadvise(DONTNEED) calls. But as long as the drop-behind algorithm
be a bit lazy and does not try to squeeze the last page at @offset-1,
this patch will work just OK.

The context readahead idea is so fundamental, that a slightly modified
algorithm can be used for all kinds of sequential patterns, and it can
automatically adapt to the thrashing threshold.

        1 if (probe_page(index - 1)) {
        2          begin = next_hole(index, max);
        3          H     = index - prev_hole(index, 2*max);
        4          end   = index + H;
        5          update_window(begin, end);
        6          submit_io();
        7 }

            [=] history [#] current [_] readahead [.] new readahead
            ==========================#____________..............
        1                            ^index-1
        2                             |----------->[begin
        3  |<----------- H -----------|
        4                             |----------- H ----------->]end
        5                                          [ new window ]

We didn't do that because we want to
- avoid unnecessary page cache lookups for normal cases
- be more aggressive when thrashings are not a concern

However, readahead thrashings are far more prevalent than one would
expect in a FTP/HTTP file streaming server. It can happen in a modern
server with 16GB memory, 1Gbps outbound bandwidth and 1MB readahead
size, due to the existences of slow streams.

Let's do a coarse calculation. The 8GB inactive_list pages will be
cycled in 64Gb/1Gbps=64 seconds. This means an async readahead window
must be consumed in 64s, or it will be thrashed.  That's a speed of
2048KB/64s=32KB/s. Any client below this speed will create thrashings
in the server. In practice, those poor slow clients could amount to
half of the total connections(partly because it will take them more
time to download anything). The frequent thrashings will in return
speedup the LRU cycling/aging speed...

We need a thrashing safe mode which do
- the above modified context readahead algorithm
- conservative ramp up of readahead size
- conservative async readahead size

The main problem is: when shall we switch into the mode?

We can start with aggressive readahead and try to detect readahead
thrashings and switch into thrashing safe mode automatically. This
will work for non-interleaved reads.  However the big file streamer -
lighttpd - does interleaved reads.  The current data structure is not
able to detect most readahead thrashings for lighttpd.

Luckily, the non-resident page tracking facility could help this case.
There the thrashed pages with their timing info can be found, based on
which we can have some extended context readahead algorithm that could
even overcome the drop-behind problem :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] readahead: introduce context readahead algorithm
       [not found]   ` <87zlej7kwf.fsf@basil.nowhere.org>
@ 2009-04-14  9:27     ` Wu Fengguang
  2009-04-14 10:00       ` Andi Kleen
  0 siblings, 1 reply; 6+ messages in thread
From: Wu Fengguang @ 2009-04-14  9:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Vladislav Bolkhovitin, Jens Axboe, Jeff Moyer,
	Nick Piggin, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	linux-mm, Chenfeng Xu

[-- Attachment #1: Type: text/plain, Size: 4576 bytes --]

On Tue, Apr 14, 2009 at 03:29:52PM +0800, Andi Kleen wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
> >
> > The context readahead algorithm guarantees to discover the sequentialness no
> > matter how the streams are interleaved. For the above example, it will start
> > sequential readahead since page 2 and 1002.
> >
> > The trick is to poke for page @offset-1 in the page cache when it has no other
> > clues on the sequentialness of request @offset: if the current requenst belongs
> > to a sequential stream, that stream must have accessed page @offset-1 recently,
> 
> Really? The page could be just randomly cached from some previous IO, right?
> I don't think you can detect that right? [maybe you could play some games
> with the LRU list, but that would be likely not O(1)]
>
> I wonder if this could cause the prefetcher to "switch gears" suddenly
> when it first walks a long stretch of non cached file data and then
> suddenly hits a cached page.
> 
> One possible way around this would be to add a concept of "page generation".
> e.g. use a few bits of page flags. keep a global generation count
> that increases slowly (let's say every few seconds) Mark the page with the 
> current generation count when the prefetcher takes it. When doing
> your algorithm check the count first and only use pages with a recent 
> count.
> 
> Not sure if it's a real problem or not.

Good catch Andi!

I'll list some possible situations. I guess you are referring to (2.3)?

1) page at @offset-1 is present, and it may be a trace of:

1.1) (interleaved) sequential reads: we catch it, bingo!

1.2) clustered random reads: readahead will be mostly favorable

If page @offset-1 exists, it means there are two references that come
close in both space(distance < one readahead window) and time
(distance < one LRU scan cycle). So it's a good indication that some
clustered random reads are occurring in the area around @offset.

I have did many experiments on the performance impact of readahead on
clustered random reads - the results are very encouraging. The tests
covered different random read densities, random read sizes, as well as
different thrashing conditions (by varying the dataset:memory ratios).
There are hardly any performance loss(2% for the worst case), but the
gain can be as large as 300%!

Some numbers, curves and analysis can be found in the attached
slides and paper:
        - readahead performance slides: Page 23-26: Clustered random reads
        - readahead framework paper: Page 7, Section 4.3: Random reads

The recent readahead addition to mmap reads makes another vivid example
of the stated "readahead is good for clustered random reads" principle.
The readahead's side effect for the random references by executables
are very good:
        - major faults reduced by 1/3
        - mmap IO numbers reduced by 1/4
        - with no obvious overheads

But as always, one can fake a workload to totally defeat the readahead
heuristics ;-)

2) page at @offset-1 is not present (for a sequential stream)

2.1) aggressive user space drop behind: fixable nuisance

The user space could be doing fadvise(offset-1, DONTNEED), which will
drop the history hint required to enable context readahead. But I
guess when the administrator/developer noticed its impact -
performance dropped instead of increased - he can easily fix it up to
do fadvise(offset-2, DONTNEED), or to manage its own readahead via
fadvise(WILLNEED).

So this is an nuisance but fixable situation.

2.2) readahead thrashing: not handled for now

We don't handle this for now. For now the behavior is to stop
readahead and possibly restart the readahead window rampup process.

2.3) readahead cache hits: rare case and the impact is temporary

The page at @offset-1 does get referenced by this stream, but it's
created by someone else at some distant time ago. The page at
@offset-1 may be lifted to active lru by this second reference, or too
late and get reclaimed - by the time we reference page @offset.

Normally its a range of cached pages. We are a) either walking inside the
range and enjoying the cache hits, b) or we walk out of it and restart
readahead by ourself, c) or the range of cached pages get reclaimed
while we are walking on them, and hence cannot find page @offset-1.

Obviously (c) is rare and temporary and is the main cause of (2.3).
As soon as we goto the next page at @offset+1, we'll its 'previous'
page at @offset to be cached(it is created by us!). So the context
readahead starts working again - it's merely delayed by one page :-)

Thanks,
Fengguang

[-- Attachment #2: readahead-algorithms-and-io-performance.pdf --]
[-- Type: application/pdf, Size: 403230 bytes --]

[-- Attachment #3: readahead-framework.pdf --]
[-- Type: application/pdf, Size: 610798 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] readahead: introduce context readahead algorithm
  2009-04-14  9:27     ` Wu Fengguang
@ 2009-04-14 10:00       ` Andi Kleen
  2009-04-14 10:58         ` Wu Fengguang
  0 siblings, 1 reply; 6+ messages in thread
From: Andi Kleen @ 2009-04-14 10:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Andrew Morton, Vladislav Bolkhovitin, Jens Axboe,
	Jeff Moyer, Nick Piggin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, linux-mm, Chenfeng Xu

> I'll list some possible situations. I guess you are referring to (2.3)?

Yep. Thanks for the detailed analysis.

> 2.3) readahead cache hits: rare case and the impact is temporary
> 
> The page at @offset-1 does get referenced by this stream, but it's
> created by someone else at some distant time ago. The page at
> @offset-1 may be lifted to active lru by this second reference, or too
> late and get reclaimed - by the time we reference page @offset.
> 
> Normally its a range of cached pages. We are a) either walking inside the
> range and enjoying the cache hits, b) or we walk out of it and restart
> readahead by ourself, c) or the range of cached pages get reclaimed
> while we are walking on them, and hence cannot find page @offset-1.
> 
> Obviously (c) is rare and temporary and is the main cause of (2.3).
> As soon as we goto the next page at @offset+1, we'll its 'previous'
> page at @offset to be cached(it is created by us!). So the context
> readahead starts working again - it's merely delayed by one page :-)

Thanks. The question is how much performance impact this has on
the stream that is readaheaded.  I guess it would be only a smaller
"hickup", with some luck hidden by the block level RA?

The other question would be if it could cause the readahead code
to do a lot of unnecessary work, but your answer seems to be "no". Fine.

I think the concept is sound.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] readahead: introduce context readahead algorithm
  2009-04-14 10:00       ` Andi Kleen
@ 2009-04-14 10:58         ` Wu Fengguang
  2009-04-14 11:11           ` Wu Fengguang
  0 siblings, 1 reply; 6+ messages in thread
From: Wu Fengguang @ 2009-04-14 10:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Vladislav Bolkhovitin, Jens Axboe, Jeff Moyer,
	Nick Piggin, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	linux-mm@kvack.org, Chenfeng Xu

On Tue, Apr 14, 2009 at 06:00:02PM +0800, Andi Kleen wrote:
> > I'll list some possible situations. I guess you are referring to (2.3)?
> 
> Yep. Thanks for the detailed analysis.

You are welcome :-)

> > 2.3) readahead cache hits: rare case and the impact is temporary
> > 
> > The page at @offset-1 does get referenced by this stream, but it's
> > created by someone else at some distant time ago. The page at
> > @offset-1 may be lifted to active lru by this second reference, or too
> > late and get reclaimed - by the time we reference page @offset.
> > 
> > Normally its a range of cached pages. We are a) either walking inside the
> > range and enjoying the cache hits, b) or we walk out of it and restart
> > readahead by ourself, c) or the range of cached pages get reclaimed
> > while we are walking on them, and hence cannot find page @offset-1.
> > 
> > Obviously (c) is rare and temporary and is the main cause of (2.3).
> > As soon as we goto the next page at @offset+1, we'll its 'previous'
> > page at @offset to be cached(it is created by us!). So the context
> > readahead starts working again - it's merely delayed by one page :-)
> 
> Thanks. The question is how much performance impact this has on
> the stream that is readaheaded.  I guess it would be only a smaller
> "hickup", with some luck hidden by the block level RA?

Yes there will be hickup: whenever the stream walks out of the current
cached page range(or the pages get reclaimed while we are walking in it),
there will be a 1-page read, followed by a 4-page readahead, and then 8,
16, ... page sized readahead, i.e. a readahead window rampup process.

That's assuming we are doing 1-page reads. For large sendfile() calls,
the readahead size will be instantly restored to its full size on the
first cache miss.

> The other question would be if it could cause the readahead code
> to do a lot of unnecessary work, but your answer seems to be "no". Fine.

Right. The readahead will automatically be turned off inside the
cached page ranges, and restarted after walking out of it.

> I think the concept is sound.

^_^ Thanks for your appreciation!


Best Regards,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] readahead: introduce context readahead algorithm
  2009-04-14 10:58         ` Wu Fengguang
@ 2009-04-14 11:11           ` Wu Fengguang
  0 siblings, 0 replies; 6+ messages in thread
From: Wu Fengguang @ 2009-04-14 11:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Vladislav Bolkhovitin, Jens Axboe, Jeff Moyer,
	Nick Piggin, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	linux-mm@kvack.org, Chenfeng Xu

On Tue, Apr 14, 2009 at 06:58:24PM +0800, Wu Fengguang wrote:
> On Tue, Apr 14, 2009 at 06:00:02PM +0800, Andi Kleen wrote:
> > > I'll list some possible situations. I guess you are referring to (2.3)?
> > 
> > Yep. Thanks for the detailed analysis.
> 
> You are welcome :-)
> 
> > > 2.3) readahead cache hits: rare case and the impact is temporary
> > > 
> > > The page at @offset-1 does get referenced by this stream, but it's
> > > created by someone else at some distant time ago. The page at
> > > @offset-1 may be lifted to active lru by this second reference, or too
> > > late and get reclaimed - by the time we reference page @offset.
> > > 
> > > Normally its a range of cached pages. We are a) either walking inside the
> > > range and enjoying the cache hits, b) or we walk out of it and restart
> > > readahead by ourself, c) or the range of cached pages get reclaimed
> > > while we are walking on them, and hence cannot find page @offset-1.
> > > 
> > > Obviously (c) is rare and temporary and is the main cause of (2.3).
> > > As soon as we goto the next page at @offset+1, we'll its 'previous'
> > > page at @offset to be cached(it is created by us!). So the context
> > > readahead starts working again - it's merely delayed by one page :-)
> > 
> > Thanks. The question is how much performance impact this has on
> > the stream that is readaheaded.  I guess it would be only a smaller
> > "hickup", with some luck hidden by the block level RA?

No the block level RA wont help here, because there are no disk
accesses at all for the cached pages before @offset: the disk RA have
absolutely no idea on where the IO for page @offset originates from ;-)

> Yes there will be hickup: whenever the stream walks out of the current
> cached page range(or the pages get reclaimed while we are walking in it),
> there will be a 1-page read, followed by a 4-page readahead, and then 8,
> 16, ... page sized readahead, i.e. a readahead window rampup process.
> 
> That's assuming we are doing 1-page reads. For large sendfile() calls,
> the readahead size will be instantly restored to its full size on the
> first cache miss.
> 
> > The other question would be if it could cause the readahead code
> > to do a lot of unnecessary work, but your answer seems to be "no". Fine.
> 
> Right. The readahead will automatically be turned off inside the
> cached page ranges, and restarted after walking out of it.
> 
> > I think the concept is sound.
> 
> ^_^ Thanks for your appreciation!
> 
> 
> Best Regards,
> Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] readahead: introduce context readahead algorithm
  2009-04-12 12:35     ` [PATCH 3/3] readahead: introduce context readahead algorithm Wu Fengguang
@ 2009-04-16 17:12       ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 6+ messages in thread
From: Vladislav Bolkhovitin @ 2009-04-16 17:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Andrew Morton, Jens Axboe, Jeff Moyer, LKML,
	Peter Zijlstra, Nick Piggin, Rik van Riel, Linus Torvalds,
	Chenfeng Xu, linux-mm

Wu Fengguang, on 04/12/2009 04:35 PM wrote:
> On Sun, Apr 12, 2009 at 04:48:19PM +0800, Ingo Molnar wrote:
>> * Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>>> Introduce page cache context based readahead algorithm.
>>> This is to better support concurrent read streams in general.
>>>  /*
>>> + * Count contiguously cached pages from @offset-1 to @offset-@max,
>>> + * this count is a conservative estimation of
>>> + * 	- length of the sequential read sequence, or
>>> + * 	- thrashing threshold in memory tight systems
>>> + */
>>> +static pgoff_t count_history_pages(struct address_space *mapping,
>>> +				   struct file_ra_state *ra,
>>> +				   pgoff_t offset, unsigned long max)
>>> +{
>>> +	pgoff_t head;
>>> +
>>> +	rcu_read_lock();
>>> +	head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
>>> +	rcu_read_unlock();
>>> +
>>> +	return offset - 1 - head;
>>> +}
>> Very elegant method! I suspect this will work far better 
>> than adding various increasingly more complex heuristics.
>>
>> Emphatically-Acked-by: Ingo Molnar <mingo@elte.hu>
> 
> Thank you Ingo!
> 
> The only pity is that this heuristic can be defeated by some user
> space program that tries to do aggressive drop-behind via
> fadvise(DONTNEED) calls. But as long as the drop-behind algorithm
> be a bit lazy and does not try to squeeze the last page at @offset-1,
> this patch will work just OK.
> 
> The context readahead idea is so fundamental, that a slightly modified
> algorithm can be used for all kinds of sequential patterns, and it can
> automatically adapt to the thrashing threshold.
> 
>         1 if (probe_page(index - 1)) {
>         2          begin = next_hole(index, max);
>         3          H     = index - prev_hole(index, 2*max);
>         4          end   = index + H;
>         5          update_window(begin, end);
>         6          submit_io();
>         7 }
> 
>             [=] history [#] current [_] readahead [.] new readahead
>             ==========================#____________..............
>         1                            ^index-1
>         2                             |----------->[begin
>         3  |<----------- H -----------|
>         4                             |----------- H ----------->]end
>         5                                          [ new window ]
> 
> 
> We didn't do that because we want to
> - avoid unnecessary page cache lookups for normal cases
> - be more aggressive when thrashings are not a concern
> 
> However, readahead thrashings are far more prevalent than one would
> expect in a FTP/HTTP file streaming server. It can happen in a modern
> server with 16GB memory, 1Gbps outbound bandwidth and 1MB readahead
> size, due to the existences of slow streams.
> 
> Let's do a coarse calculation. The 8GB inactive_list pages will be
> cycled in 64Gb/1Gbps=64 seconds. This means an async readahead window
> must be consumed in 64s, or it will be thrashed.  That's a speed of
> 2048KB/64s=32KB/s. Any client below this speed will create thrashings
> in the server. In practice, those poor slow clients could amount to
> half of the total connections(partly because it will take them more
> time to download anything). The frequent thrashings will in return
> speedup the LRU cycling/aging speed...
> 
> We need a thrashing safe mode which do
> - the above modified context readahead algorithm
> - conservative ramp up of readahead size
> - conservative async readahead size
> 
> The main problem is: when shall we switch into the mode?

More I think about an ideal readahead, more it looks like page cache 
should also keep fairness between its users, similarly as it's done for 
CPU (CFS) and disk (CFQ). A slow user should have a chance to use its 
chunk the cache in face of too fast thrasher.

The main problems with it are to define what "user" is and how to 
implement the fairness in an acceptably simple way.

Maybe something like that: "user" is IO context (or IO stream?) and, if 
there would be a need to get some pages for "user" A, pages belonging to 
other "users" would be evicted with additional wight W, so A's pages 
would be preferred during eviction.

Just my 0.05c.

> We can start with aggressive readahead and try to detect readahead
> thrashings and switch into thrashing safe mode automatically. This
> will work for non-interleaved reads.  However the big file streamer -
> lighttpd - does interleaved reads.  The current data structure is not
> able to detect most readahead thrashings for lighttpd.
> 
> Luckily, the non-resident page tracking facility could help this case.
> There the thrashed pages with their timing info can be found, based on
> which we can have some extended context readahead algorithm that could
> even overcome the drop-behind problem :)
> 
> Thanks,
> Fengguang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-04-16 17:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20090412071950.166891982@intel.com>
     [not found] ` <20090412072052.686760755@intel.com>
     [not found]   ` <20090412084819.GA25314@elte.hu>
2009-04-12 12:35     ` [PATCH 3/3] readahead: introduce context readahead algorithm Wu Fengguang
2009-04-16 17:12       ` Vladislav Bolkhovitin
     [not found]   ` <87zlej7kwf.fsf@basil.nowhere.org>
2009-04-14  9:27     ` Wu Fengguang
2009-04-14 10:00       ` Andi Kleen
2009-04-14 10:58         ` Wu Fengguang
2009-04-14 11:11           ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).