* Re: [PATCH 3/3] readahead: introduce context readahead algorithm [not found] ` <20090412084819.GA25314@elte.hu> @ 2009-04-12 12:35 ` Wu Fengguang 2009-04-16 17:12 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 6+ messages in thread From: Wu Fengguang @ 2009-04-12 12:35 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Vladislav Bolkhovitin, Jens Axboe, Jeff Moyer, LKML, Peter Zijlstra, Nick Piggin, Rik van Riel, Linus Torvalds, Chenfeng Xu, linux-mm On Sun, Apr 12, 2009 at 04:48:19PM +0800, Ingo Molnar wrote: > > * Wu Fengguang <fengguang.wu@intel.com> wrote: > > > Introduce page cache context based readahead algorithm. > > This is to better support concurrent read streams in general. > > > /* > > + * Count contiguously cached pages from @offset-1 to @offset-@max, > > + * this count is a conservative estimation of > > + * - length of the sequential read sequence, or > > + * - thrashing threshold in memory tight systems > > + */ > > +static pgoff_t count_history_pages(struct address_space *mapping, > > + struct file_ra_state *ra, > > + pgoff_t offset, unsigned long max) > > +{ > > + pgoff_t head; > > + > > + rcu_read_lock(); > > + head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max); > > + rcu_read_unlock(); > > + > > + return offset - 1 - head; > > +} > > Very elegant method! I suspect this will work far better > than adding various increasingly more complex heuristics. > > Emphatically-Acked-by: Ingo Molnar <mingo@elte.hu> Thank you Ingo! The only pity is that this heuristic can be defeated by some user space program that tries to do aggressive drop-behind via fadvise(DONTNEED) calls. But as long as the drop-behind algorithm be a bit lazy and does not try to squeeze the last page at @offset-1, this patch will work just OK. The context readahead idea is so fundamental, that a slightly modified algorithm can be used for all kinds of sequential patterns, and it can automatically adapt to the thrashing threshold. 1 if (probe_page(index - 1)) { 2 begin = next_hole(index, max); 3 H = index - prev_hole(index, 2*max); 4 end = index + H; 5 update_window(begin, end); 6 submit_io(); 7 } [=] history [#] current [_] readahead [.] new readahead ==========================#____________.............. 1 ^index-1 2 |----------->[begin 3 |<----------- H -----------| 4 |----------- H ----------->]end 5 [ new window ] We didn't do that because we want to - avoid unnecessary page cache lookups for normal cases - be more aggressive when thrashings are not a concern However, readahead thrashings are far more prevalent than one would expect in a FTP/HTTP file streaming server. It can happen in a modern server with 16GB memory, 1Gbps outbound bandwidth and 1MB readahead size, due to the existences of slow streams. Let's do a coarse calculation. The 8GB inactive_list pages will be cycled in 64Gb/1Gbps=64 seconds. This means an async readahead window must be consumed in 64s, or it will be thrashed. That's a speed of 2048KB/64s=32KB/s. Any client below this speed will create thrashings in the server. In practice, those poor slow clients could amount to half of the total connections(partly because it will take them more time to download anything). The frequent thrashings will in return speedup the LRU cycling/aging speed... We need a thrashing safe mode which do - the above modified context readahead algorithm - conservative ramp up of readahead size - conservative async readahead size The main problem is: when shall we switch into the mode? We can start with aggressive readahead and try to detect readahead thrashings and switch into thrashing safe mode automatically. This will work for non-interleaved reads. However the big file streamer - lighttpd - does interleaved reads. The current data structure is not able to detect most readahead thrashings for lighttpd. Luckily, the non-resident page tracking facility could help this case. There the thrashed pages with their timing info can be found, based on which we can have some extended context readahead algorithm that could even overcome the drop-behind problem :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] readahead: introduce context readahead algorithm 2009-04-12 12:35 ` [PATCH 3/3] readahead: introduce context readahead algorithm Wu Fengguang @ 2009-04-16 17:12 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 6+ messages in thread From: Vladislav Bolkhovitin @ 2009-04-16 17:12 UTC (permalink / raw) To: Wu Fengguang Cc: Ingo Molnar, Andrew Morton, Jens Axboe, Jeff Moyer, LKML, Peter Zijlstra, Nick Piggin, Rik van Riel, Linus Torvalds, Chenfeng Xu, linux-mm Wu Fengguang, on 04/12/2009 04:35 PM wrote: > On Sun, Apr 12, 2009 at 04:48:19PM +0800, Ingo Molnar wrote: >> * Wu Fengguang <fengguang.wu@intel.com> wrote: >> >>> Introduce page cache context based readahead algorithm. >>> This is to better support concurrent read streams in general. >>> /* >>> + * Count contiguously cached pages from @offset-1 to @offset-@max, >>> + * this count is a conservative estimation of >>> + * - length of the sequential read sequence, or >>> + * - thrashing threshold in memory tight systems >>> + */ >>> +static pgoff_t count_history_pages(struct address_space *mapping, >>> + struct file_ra_state *ra, >>> + pgoff_t offset, unsigned long max) >>> +{ >>> + pgoff_t head; >>> + >>> + rcu_read_lock(); >>> + head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max); >>> + rcu_read_unlock(); >>> + >>> + return offset - 1 - head; >>> +} >> Very elegant method! I suspect this will work far better >> than adding various increasingly more complex heuristics. >> >> Emphatically-Acked-by: Ingo Molnar <mingo@elte.hu> > > Thank you Ingo! > > The only pity is that this heuristic can be defeated by some user > space program that tries to do aggressive drop-behind via > fadvise(DONTNEED) calls. But as long as the drop-behind algorithm > be a bit lazy and does not try to squeeze the last page at @offset-1, > this patch will work just OK. > > The context readahead idea is so fundamental, that a slightly modified > algorithm can be used for all kinds of sequential patterns, and it can > automatically adapt to the thrashing threshold. > > 1 if (probe_page(index - 1)) { > 2 begin = next_hole(index, max); > 3 H = index - prev_hole(index, 2*max); > 4 end = index + H; > 5 update_window(begin, end); > 6 submit_io(); > 7 } > > [=] history [#] current [_] readahead [.] new readahead > ==========================#____________.............. > 1 ^index-1 > 2 |----------->[begin > 3 |<----------- H -----------| > 4 |----------- H ----------->]end > 5 [ new window ] > > > We didn't do that because we want to > - avoid unnecessary page cache lookups for normal cases > - be more aggressive when thrashings are not a concern > > However, readahead thrashings are far more prevalent than one would > expect in a FTP/HTTP file streaming server. It can happen in a modern > server with 16GB memory, 1Gbps outbound bandwidth and 1MB readahead > size, due to the existences of slow streams. > > Let's do a coarse calculation. The 8GB inactive_list pages will be > cycled in 64Gb/1Gbps=64 seconds. This means an async readahead window > must be consumed in 64s, or it will be thrashed. That's a speed of > 2048KB/64s=32KB/s. Any client below this speed will create thrashings > in the server. In practice, those poor slow clients could amount to > half of the total connections(partly because it will take them more > time to download anything). The frequent thrashings will in return > speedup the LRU cycling/aging speed... > > We need a thrashing safe mode which do > - the above modified context readahead algorithm > - conservative ramp up of readahead size > - conservative async readahead size > > The main problem is: when shall we switch into the mode? More I think about an ideal readahead, more it looks like page cache should also keep fairness between its users, similarly as it's done for CPU (CFS) and disk (CFQ). A slow user should have a chance to use its chunk the cache in face of too fast thrasher. The main problems with it are to define what "user" is and how to implement the fairness in an acceptably simple way. Maybe something like that: "user" is IO context (or IO stream?) and, if there would be a need to get some pages for "user" A, pages belonging to other "users" would be evicted with additional wight W, so A's pages would be preferred during eviction. Just my 0.05c. > We can start with aggressive readahead and try to detect readahead > thrashings and switch into thrashing safe mode automatically. This > will work for non-interleaved reads. However the big file streamer - > lighttpd - does interleaved reads. The current data structure is not > able to detect most readahead thrashings for lighttpd. > > Luckily, the non-resident page tracking facility could help this case. > There the thrashed pages with their timing info can be found, based on > which we can have some extended context readahead algorithm that could > even overcome the drop-behind problem :) > > Thanks, > Fengguang > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <87zlej7kwf.fsf@basil.nowhere.org>]
* Re: [PATCH 3/3] readahead: introduce context readahead algorithm [not found] ` <87zlej7kwf.fsf@basil.nowhere.org> @ 2009-04-14 9:27 ` Wu Fengguang 2009-04-14 10:00 ` Andi Kleen 0 siblings, 1 reply; 6+ messages in thread From: Wu Fengguang @ 2009-04-14 9:27 UTC (permalink / raw) To: Andi Kleen Cc: Andrew Morton, Vladislav Bolkhovitin, Jens Axboe, Jeff Moyer, Nick Piggin, Ingo Molnar, Linus Torvalds, Peter Zijlstra, linux-mm, Chenfeng Xu [-- Attachment #1: Type: text/plain, Size: 4576 bytes --] On Tue, Apr 14, 2009 at 03:29:52PM +0800, Andi Kleen wrote: > Wu Fengguang <fengguang.wu@intel.com> writes: > > > > The context readahead algorithm guarantees to discover the sequentialness no > > matter how the streams are interleaved. For the above example, it will start > > sequential readahead since page 2 and 1002. > > > > The trick is to poke for page @offset-1 in the page cache when it has no other > > clues on the sequentialness of request @offset: if the current requenst belongs > > to a sequential stream, that stream must have accessed page @offset-1 recently, > > Really? The page could be just randomly cached from some previous IO, right? > I don't think you can detect that right? [maybe you could play some games > with the LRU list, but that would be likely not O(1)] > > I wonder if this could cause the prefetcher to "switch gears" suddenly > when it first walks a long stretch of non cached file data and then > suddenly hits a cached page. > > One possible way around this would be to add a concept of "page generation". > e.g. use a few bits of page flags. keep a global generation count > that increases slowly (let's say every few seconds) Mark the page with the > current generation count when the prefetcher takes it. When doing > your algorithm check the count first and only use pages with a recent > count. > > Not sure if it's a real problem or not. Good catch Andi! I'll list some possible situations. I guess you are referring to (2.3)? 1) page at @offset-1 is present, and it may be a trace of: 1.1) (interleaved) sequential reads: we catch it, bingo! 1.2) clustered random reads: readahead will be mostly favorable If page @offset-1 exists, it means there are two references that come close in both space(distance < one readahead window) and time (distance < one LRU scan cycle). So it's a good indication that some clustered random reads are occurring in the area around @offset. I have did many experiments on the performance impact of readahead on clustered random reads - the results are very encouraging. The tests covered different random read densities, random read sizes, as well as different thrashing conditions (by varying the dataset:memory ratios). There are hardly any performance loss(2% for the worst case), but the gain can be as large as 300%! Some numbers, curves and analysis can be found in the attached slides and paper: - readahead performance slides: Page 23-26: Clustered random reads - readahead framework paper: Page 7, Section 4.3: Random reads The recent readahead addition to mmap reads makes another vivid example of the stated "readahead is good for clustered random reads" principle. The readahead's side effect for the random references by executables are very good: - major faults reduced by 1/3 - mmap IO numbers reduced by 1/4 - with no obvious overheads But as always, one can fake a workload to totally defeat the readahead heuristics ;-) 2) page at @offset-1 is not present (for a sequential stream) 2.1) aggressive user space drop behind: fixable nuisance The user space could be doing fadvise(offset-1, DONTNEED), which will drop the history hint required to enable context readahead. But I guess when the administrator/developer noticed its impact - performance dropped instead of increased - he can easily fix it up to do fadvise(offset-2, DONTNEED), or to manage its own readahead via fadvise(WILLNEED). So this is an nuisance but fixable situation. 2.2) readahead thrashing: not handled for now We don't handle this for now. For now the behavior is to stop readahead and possibly restart the readahead window rampup process. 2.3) readahead cache hits: rare case and the impact is temporary The page at @offset-1 does get referenced by this stream, but it's created by someone else at some distant time ago. The page at @offset-1 may be lifted to active lru by this second reference, or too late and get reclaimed - by the time we reference page @offset. Normally its a range of cached pages. We are a) either walking inside the range and enjoying the cache hits, b) or we walk out of it and restart readahead by ourself, c) or the range of cached pages get reclaimed while we are walking on them, and hence cannot find page @offset-1. Obviously (c) is rare and temporary and is the main cause of (2.3). As soon as we goto the next page at @offset+1, we'll its 'previous' page at @offset to be cached(it is created by us!). So the context readahead starts working again - it's merely delayed by one page :-) Thanks, Fengguang [-- Attachment #2: readahead-algorithms-and-io-performance.pdf --] [-- Type: application/pdf, Size: 403230 bytes --] [-- Attachment #3: readahead-framework.pdf --] [-- Type: application/pdf, Size: 610798 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] readahead: introduce context readahead algorithm 2009-04-14 9:27 ` Wu Fengguang @ 2009-04-14 10:00 ` Andi Kleen 2009-04-14 10:58 ` Wu Fengguang 0 siblings, 1 reply; 6+ messages in thread From: Andi Kleen @ 2009-04-14 10:00 UTC (permalink / raw) To: Wu Fengguang Cc: Andi Kleen, Andrew Morton, Vladislav Bolkhovitin, Jens Axboe, Jeff Moyer, Nick Piggin, Ingo Molnar, Linus Torvalds, Peter Zijlstra, linux-mm, Chenfeng Xu > I'll list some possible situations. I guess you are referring to (2.3)? Yep. Thanks for the detailed analysis. > 2.3) readahead cache hits: rare case and the impact is temporary > > The page at @offset-1 does get referenced by this stream, but it's > created by someone else at some distant time ago. The page at > @offset-1 may be lifted to active lru by this second reference, or too > late and get reclaimed - by the time we reference page @offset. > > Normally its a range of cached pages. We are a) either walking inside the > range and enjoying the cache hits, b) or we walk out of it and restart > readahead by ourself, c) or the range of cached pages get reclaimed > while we are walking on them, and hence cannot find page @offset-1. > > Obviously (c) is rare and temporary and is the main cause of (2.3). > As soon as we goto the next page at @offset+1, we'll its 'previous' > page at @offset to be cached(it is created by us!). So the context > readahead starts working again - it's merely delayed by one page :-) Thanks. The question is how much performance impact this has on the stream that is readaheaded. I guess it would be only a smaller "hickup", with some luck hidden by the block level RA? The other question would be if it could cause the readahead code to do a lot of unnecessary work, but your answer seems to be "no". Fine. I think the concept is sound. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] readahead: introduce context readahead algorithm 2009-04-14 10:00 ` Andi Kleen @ 2009-04-14 10:58 ` Wu Fengguang 2009-04-14 11:11 ` Wu Fengguang 0 siblings, 1 reply; 6+ messages in thread From: Wu Fengguang @ 2009-04-14 10:58 UTC (permalink / raw) To: Andi Kleen Cc: Andrew Morton, Vladislav Bolkhovitin, Jens Axboe, Jeff Moyer, Nick Piggin, Ingo Molnar, Linus Torvalds, Peter Zijlstra, linux-mm@kvack.org, Chenfeng Xu On Tue, Apr 14, 2009 at 06:00:02PM +0800, Andi Kleen wrote: > > I'll list some possible situations. I guess you are referring to (2.3)? > > Yep. Thanks for the detailed analysis. You are welcome :-) > > 2.3) readahead cache hits: rare case and the impact is temporary > > > > The page at @offset-1 does get referenced by this stream, but it's > > created by someone else at some distant time ago. The page at > > @offset-1 may be lifted to active lru by this second reference, or too > > late and get reclaimed - by the time we reference page @offset. > > > > Normally its a range of cached pages. We are a) either walking inside the > > range and enjoying the cache hits, b) or we walk out of it and restart > > readahead by ourself, c) or the range of cached pages get reclaimed > > while we are walking on them, and hence cannot find page @offset-1. > > > > Obviously (c) is rare and temporary and is the main cause of (2.3). > > As soon as we goto the next page at @offset+1, we'll its 'previous' > > page at @offset to be cached(it is created by us!). So the context > > readahead starts working again - it's merely delayed by one page :-) > > Thanks. The question is how much performance impact this has on > the stream that is readaheaded. I guess it would be only a smaller > "hickup", with some luck hidden by the block level RA? Yes there will be hickup: whenever the stream walks out of the current cached page range(or the pages get reclaimed while we are walking in it), there will be a 1-page read, followed by a 4-page readahead, and then 8, 16, ... page sized readahead, i.e. a readahead window rampup process. That's assuming we are doing 1-page reads. For large sendfile() calls, the readahead size will be instantly restored to its full size on the first cache miss. > The other question would be if it could cause the readahead code > to do a lot of unnecessary work, but your answer seems to be "no". Fine. Right. The readahead will automatically be turned off inside the cached page ranges, and restarted after walking out of it. > I think the concept is sound. ^_^ Thanks for your appreciation! Best Regards, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] readahead: introduce context readahead algorithm 2009-04-14 10:58 ` Wu Fengguang @ 2009-04-14 11:11 ` Wu Fengguang 0 siblings, 0 replies; 6+ messages in thread From: Wu Fengguang @ 2009-04-14 11:11 UTC (permalink / raw) To: Andi Kleen Cc: Andrew Morton, Vladislav Bolkhovitin, Jens Axboe, Jeff Moyer, Nick Piggin, Ingo Molnar, Linus Torvalds, Peter Zijlstra, linux-mm@kvack.org, Chenfeng Xu On Tue, Apr 14, 2009 at 06:58:24PM +0800, Wu Fengguang wrote: > On Tue, Apr 14, 2009 at 06:00:02PM +0800, Andi Kleen wrote: > > > I'll list some possible situations. I guess you are referring to (2.3)? > > > > Yep. Thanks for the detailed analysis. > > You are welcome :-) > > > > 2.3) readahead cache hits: rare case and the impact is temporary > > > > > > The page at @offset-1 does get referenced by this stream, but it's > > > created by someone else at some distant time ago. The page at > > > @offset-1 may be lifted to active lru by this second reference, or too > > > late and get reclaimed - by the time we reference page @offset. > > > > > > Normally its a range of cached pages. We are a) either walking inside the > > > range and enjoying the cache hits, b) or we walk out of it and restart > > > readahead by ourself, c) or the range of cached pages get reclaimed > > > while we are walking on them, and hence cannot find page @offset-1. > > > > > > Obviously (c) is rare and temporary and is the main cause of (2.3). > > > As soon as we goto the next page at @offset+1, we'll its 'previous' > > > page at @offset to be cached(it is created by us!). So the context > > > readahead starts working again - it's merely delayed by one page :-) > > > > Thanks. The question is how much performance impact this has on > > the stream that is readaheaded. I guess it would be only a smaller > > "hickup", with some luck hidden by the block level RA? No the block level RA wont help here, because there are no disk accesses at all for the cached pages before @offset: the disk RA have absolutely no idea on where the IO for page @offset originates from ;-) > Yes there will be hickup: whenever the stream walks out of the current > cached page range(or the pages get reclaimed while we are walking in it), > there will be a 1-page read, followed by a 4-page readahead, and then 8, > 16, ... page sized readahead, i.e. a readahead window rampup process. > > That's assuming we are doing 1-page reads. For large sendfile() calls, > the readahead size will be instantly restored to its full size on the > first cache miss. > > > The other question would be if it could cause the readahead code > > to do a lot of unnecessary work, but your answer seems to be "no". Fine. > > Right. The readahead will automatically be turned off inside the > cached page ranges, and restarted after walking out of it. > > > I think the concept is sound. > > ^_^ Thanks for your appreciation! > > > Best Regards, > Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-04-16 17:12 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20090412071950.166891982@intel.com>
[not found] ` <20090412072052.686760755@intel.com>
[not found] ` <20090412084819.GA25314@elte.hu>
2009-04-12 12:35 ` [PATCH 3/3] readahead: introduce context readahead algorithm Wu Fengguang
2009-04-16 17:12 ` Vladislav Bolkhovitin
[not found] ` <87zlej7kwf.fsf@basil.nowhere.org>
2009-04-14 9:27 ` Wu Fengguang
2009-04-14 10:00 ` Andi Kleen
2009-04-14 10:58 ` Wu Fengguang
2009-04-14 11:11 ` Wu Fengguang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).