Downsides to madvise/fadvise(willneed) for application startup

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Downsides to madvise/fadvise(willneed) for application startup
@ 2010-04-05 22:43 Taras Glek
  2010-04-05 23:17 ` Dave Chinner
                   ` (4 more replies)
  0 siblings, 5 replies; 35+ messages in thread
From: Taras Glek @ 2010-04-05 22:43 UTC (permalink / raw)
  To: linux-kernel

Hello,
I am working on improving Mozilla startup times. It turns out that page 
faults(caused by lack of cooperation between user/kernelspace) are the 
main cause of slow startup. I need some insights from someone who 
understands linux vm behavior.

Current Situation:
The dynamic linker mmap()s  executable and data sections of our 
executable but it doesn't call madvise().
By default page faults trigger 131072byte reads. To make matters worse, 
the compile-time linker + gcc lay out code in a manner that does not 
correspond to how the resulting executable will be executed(ie the 
layout is basically random). This means that during startup 15-40mb 
binaries are read in basically random fashion. Even if one orders the 
binary optimally, throughput is still suboptimal due to the puny readahead.

IO Hints:
Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb 
reads and a binary that tends to take 110 page faults(ie program stops 
execution and waits for disk) can be reduced down to 6. This has the 
potential to double application startup of large apps without any clear 
downsides. Suse ships their glibc with a dynamic linker patch to 
fadvise() dynamic libraries(not sure why they switched from doing 
madvise before).

I filed a glibc bug about this at 
http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented 
with his concern about wasting memory resources. What is the impact of 
madvise(WILLNEED) or the fadvise equivalent on systems under memory 
pressure? Does the kernel simply start ignoring these hints?

Also, once an application is started is it reasonable to keep it 
madvise(WILLNEED)ed or should the madvise flags be reset?

Perhaps the kernel could monitor the page-in patterns to increase the 
readahead sizes? This may already happen, I've noticed that a handful of 
pagefaults trigger > 131072bytes of IO, perhaps this just needs tweaking.

Thanks,
Taras Glek

PS. For more details on this issue see my blog at 
https://blog.mozilla.com/tglek/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-05 22:43 Downsides to madvise/fadvise(willneed) for application startup Taras Glek
@ 2010-04-05 23:17 ` Dave Chinner
  2010-04-05 23:52 ` Roland Dreier
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2010-04-05 23:17 UTC (permalink / raw)
  To: Taras Glek; +Cc: linux-kernel

On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
> Hello,
> I am working on improving Mozilla startup times. It turns out that
> page faults(caused by lack of cooperation between user/kernelspace)
> are the main cause of slow startup. I need some insights from
> someone who understands linux vm behavior.
> 
> Current Situation:
> The dynamic linker mmap()s  executable and data sections of our
> executable but it doesn't call madvise().
> By default page faults trigger 131072byte reads. To make matters

Try tuning /sys/block/<dev>/queue/read_ahead_kb and see if that
makes any difference - that's the default maximum readahead for the
given block device and defaults to 128k.

There has been some recent work to increase the default readahead
size, so if changing the default improves performance then perhaps
a fix for your problem is already in the works?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-05 22:43 Downsides to madvise/fadvise(willneed) for application startup Taras Glek
  2010-04-05 23:17 ` Dave Chinner
@ 2010-04-05 23:52 ` Roland Dreier
  2010-04-06 22:09   ` Taras Glek
  2010-04-06  9:51 ` Johannes Weiner
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 35+ messages in thread
From: Roland Dreier @ 2010-04-05 23:52 UTC (permalink / raw)
  To: Taras Glek; +Cc: linux-kernel

Almost certainly teaching my grandmother to suck eggs, but are you aware
of the work Michael Meeks has done on improving openoffice.org startup time?
-- 
Roland Dreier <rolandd@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-05 23:52 ` Roland Dreier
@ 2010-04-06 22:09   ` Taras Glek
  0 siblings, 0 replies; 35+ messages in thread
From: Taras Glek @ 2010-04-06 22:09 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-kernel

On 04/05/2010 04:52 PM, Roland Dreier wrote:
> Almost certainly teaching my grandmother to suck eggs, but are you aware
> of the work Michael Meeks has done on improving openoffice.org startup time?
>    
Yes. There were some stones left unturned in the cold startup area. 
Turns out that every single large application suffers from low io 
throughput likely due to lack of cooperation between the dynamic linker 
and the kernel.
There is a glibc bug filed on that.

http://sourceware.org/bugzilla/show_bug.cgi?id=11431

Unfortunately, few userspace people seem to know exactly how madvise() 
hints behave, so I was hoping someone on LKML would clue me in.

Taras

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-05 22:43 Downsides to madvise/fadvise(willneed) for application startup Taras Glek
  2010-04-05 23:17 ` Dave Chinner
  2010-04-05 23:52 ` Roland Dreier
@ 2010-04-06  9:51 ` Johannes Weiner
  2010-04-06 21:57   ` Taras Glek
  2010-04-07  2:24   ` Wu Fengguang
  2010-04-12  8:50 ` Andi Kleen
  2010-04-15 22:53 ` Andrew Morton
  4 siblings, 2 replies; 35+ messages in thread
From: Johannes Weiner @ 2010-04-06  9:51 UTC (permalink / raw)
  To: Taras Glek; +Cc: Wu Fengguang, linux-mm, linux-kernel

On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
> Hello,
> I am working on improving Mozilla startup times. It turns out that page 
> faults(caused by lack of cooperation between user/kernelspace) are the 
> main cause of slow startup. I need some insights from someone who 
> understands linux vm behavior.
> 
> Current Situation:
> The dynamic linker mmap()s  executable and data sections of our 
> executable but it doesn't call madvise().
> By default page faults trigger 131072byte reads. To make matters worse, 
> the compile-time linker + gcc lay out code in a manner that does not 
> correspond to how the resulting executable will be executed(ie the 
> layout is basically random). This means that during startup 15-40mb 
> binaries are read in basically random fashion. Even if one orders the 
> binary optimally, throughput is still suboptimal due to the puny readahead.
> 
> IO Hints:
> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb 
> reads and a binary that tends to take 110 page faults(ie program stops 
> execution and waits for disk) can be reduced down to 6. This has the 
> potential to double application startup of large apps without any clear 
> downsides. Suse ships their glibc with a dynamic linker patch to 
> fadvise() dynamic libraries(not sure why they switched from doing 
> madvise before).
> 
> I filed a glibc bug about this at 
> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented 
> with his concern about wasting memory resources. What is the impact of 
> madvise(WILLNEED) or the fadvise equivalent on systems under memory 
> pressure? Does the kernel simply start ignoring these hints?

It will throttle based on memory pressure.  In idle situations it will
eat your file cache, however, to satisfy the request.

Now, the file cache should be much bigger than the amount of unneeded
pages you prefault with the hint over the whole library, so I guess the
benefit of prefaulting the right pages outweighs the downside of evicting
some cache for unused library pages.

Still, it's a workaround for deficits in the demand-paging/readahead
heuristics and thus a bit ugly, I feel.  Maybe Wu can help.

> Also, once an application is started is it reasonable to keep it 
> madvise(WILLNEED)ed or should the madvise flags be reset?

It's a one-time operation that starts immediate readahead, no permanent
changes are done.

> Perhaps the kernel could monitor the page-in patterns to increase the 
> readahead sizes? This may already happen, I've noticed that a handful of 
> pagefaults trigger > 131072bytes of IO, perhaps this just needs tweaking.

CCd the man :-)

> Thanks,
> Taras Glek
> 
> PS. For more details on this issue see my blog at 
> https://blog.mozilla.com/tglek/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-06  9:51 ` Johannes Weiner
@ 2010-04-06 21:57   ` Taras Glek
  2010-04-06 22:26     ` Johannes Weiner
  2010-04-07  2:24   ` Wu Fengguang
  1 sibling, 1 reply; 35+ messages in thread
From: Taras Glek @ 2010-04-06 21:57 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Wu Fengguang, linux-mm, linux-kernel

On 04/06/2010 02:51 AM, Johannes Weiner wrote:
> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
>    
>> Hello,
>> I am working on improving Mozilla startup times. It turns out that page
>> faults(caused by lack of cooperation between user/kernelspace) are the
>> main cause of slow startup. I need some insights from someone who
>> understands linux vm behavior.
>>
>> Current Situation:
>> The dynamic linker mmap()s  executable and data sections of our
>> executable but it doesn't call madvise().
>> By default page faults trigger 131072byte reads. To make matters worse,
>> the compile-time linker + gcc lay out code in a manner that does not
>> correspond to how the resulting executable will be executed(ie the
>> layout is basically random). This means that during startup 15-40mb
>> binaries are read in basically random fashion. Even if one orders the
>> binary optimally, throughput is still suboptimal due to the puny readahead.
>>
>> IO Hints:
>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
>> reads and a binary that tends to take 110 page faults(ie program stops
>> execution and waits for disk) can be reduced down to 6. This has the
>> potential to double application startup of large apps without any clear
>> downsides. Suse ships their glibc with a dynamic linker patch to
>> fadvise() dynamic libraries(not sure why they switched from doing
>> madvise before).
>>
>> I filed a glibc bug about this at
>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
>> with his concern about wasting memory resources. What is the impact of
>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
>> pressure? Does the kernel simply start ignoring these hints?
>>      
> It will throttle based on memory pressure.  In idle situations it will
> eat your file cache, however, to satisfy the request.
>    
Define idle situations. Do you mean that madv(willneed) will aggresively 
readahead, but only while cpu(or disk?) is idle?
I am trying to optimize application startup which means that the cpu is 
busy while not blocked on io.
> Now, the file cache should be much bigger than the amount of unneeded
> pages you prefault with the hint over the whole library, so I guess the
> benefit of prefaulting the right pages outweighs the downside of evicting
> some cache for unused library pages.
>    
> Still, it's a workaround for deficits in the demand-paging/readahead
> heuristics and thus a bit ugly, I feel.  Maybe Wu can help.
>
>    
Can't wait to hear the juicy details.
>> Also, once an application is started is it reasonable to keep it
>> madvise(WILLNEED)ed or should the madvise flags be reset?
>>      
> It's a one-time operation that starts immediate readahead, no permanent
> changes are done.
>    
I may be measuring this wrong, but in my experience the only change 
madvise(willneed) does in increase the length parameter to 
__do_page_cache_readahead(). My script is at 
http://hg.mozilla.org/users/tglek_mozilla.com/startup/file/6453ad2a7906/kernelio.stp 
.


Taras

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-06 21:57   ` Taras Glek
@ 2010-04-06 22:26     ` Johannes Weiner
  2010-04-06 22:39       ` Taras Glek
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2010-04-06 22:26 UTC (permalink / raw)
  To: Taras Glek; +Cc: Wu Fengguang, linux-mm, linux-kernel

On Tue, Apr 06, 2010 at 02:57:30PM -0700, Taras Glek wrote:
> On 04/06/2010 02:51 AM, Johannes Weiner wrote:
> >On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
> >   
> >>Hello,
> >>I am working on improving Mozilla startup times. It turns out that page
> >>faults(caused by lack of cooperation between user/kernelspace) are the
> >>main cause of slow startup. I need some insights from someone who
> >>understands linux vm behavior.
> >>
> >>Current Situation:
> >>The dynamic linker mmap()s  executable and data sections of our
> >>executable but it doesn't call madvise().
> >>By default page faults trigger 131072byte reads. To make matters worse,
> >>the compile-time linker + gcc lay out code in a manner that does not
> >>correspond to how the resulting executable will be executed(ie the
> >>layout is basically random). This means that during startup 15-40mb
> >>binaries are read in basically random fashion. Even if one orders the
> >>binary optimally, throughput is still suboptimal due to the puny 
> >>readahead.
> >>
> >>IO Hints:
> >>Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
> >>reads and a binary that tends to take 110 page faults(ie program stops
> >>execution and waits for disk) can be reduced down to 6. This has the
> >>potential to double application startup of large apps without any clear
> >>downsides. Suse ships their glibc with a dynamic linker patch to
> >>fadvise() dynamic libraries(not sure why they switched from doing
> >>madvise before).
> >>
> >>I filed a glibc bug about this at
> >>http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
> >>with his concern about wasting memory resources. What is the impact of
> >>madvise(WILLNEED) or the fadvise equivalent on systems under memory
> >>pressure? Does the kernel simply start ignoring these hints?
> >>     
> >It will throttle based on memory pressure.  In idle situations it will
> >eat your file cache, however, to satisfy the request.
> >   
> Define idle situations. Do you mean that madv(willneed) will aggresively 
> readahead, but only while cpu(or disk?) is idle?
> I am trying to optimize application startup which means that the cpu is 
> busy while not blocked on io.

Sorry.  I meant without memory pressure.  It will trigger readahead for the
whole page range immediately, unless the sum of free pages and file cache
pages is less than that.

So yes, it will be aggressive against the cache but should not touch things
frequently in use or start swapping for example.

> >>Also, once an application is started is it reasonable to keep it
> >>madvise(WILLNEED)ed or should the madvise flags be reset?
> >>     
> >It's a one-time operation that starts immediate readahead, no permanent
> >changes are done.
> >   
> I may be measuring this wrong, but in my experience the only change 
> madvise(willneed) does in increase the length parameter to 
> __do_page_cache_readahead(). My script is at 
> http://hg.mozilla.org/users/tglek_mozilla.com/startup/file/6453ad2a7906/kernelio.stp
> .

Whether the page is read on a major fault or by means of WILLNEED,
they both end up calling this function.  It's just that faulting
does all the heuristics and WILLNEED will just force reading the
pages in the specified range.

But your question whether it would be reasonable to keep the region
WILLNEED madvised makes no sense.  It's just a request to prepopulate
the page cache from disk data immediately instead of waiting for
faults to trigger the reads.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-06 22:26     ` Johannes Weiner
@ 2010-04-06 22:39       ` Taras Glek
  0 siblings, 0 replies; 35+ messages in thread
From: Taras Glek @ 2010-04-06 22:39 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Wu Fengguang, linux-mm, linux-kernel

On 04/06/2010 03:26 PM, Johannes Weiner wrote:
> On Tue, Apr 06, 2010 at 02:57:30PM -0700, Taras Glek wrote:
>    
>> On 04/06/2010 02:51 AM, Johannes Weiner wrote:
>>      
>>> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
>>>
>>>        
>>>> Hello,
>>>> I am working on improving Mozilla startup times. It turns out that page
>>>> faults(caused by lack of cooperation between user/kernelspace) are the
>>>> main cause of slow startup. I need some insights from someone who
>>>> understands linux vm behavior.
>>>>
>>>> Current Situation:
>>>> The dynamic linker mmap()s  executable and data sections of our
>>>> executable but it doesn't call madvise().
>>>> By default page faults trigger 131072byte reads. To make matters worse,
>>>> the compile-time linker + gcc lay out code in a manner that does not
>>>> correspond to how the resulting executable will be executed(ie the
>>>> layout is basically random). This means that during startup 15-40mb
>>>> binaries are read in basically random fashion. Even if one orders the
>>>> binary optimally, throughput is still suboptimal due to the puny
>>>> readahead.
>>>>
>>>> IO Hints:
>>>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
>>>> reads and a binary that tends to take 110 page faults(ie program stops
>>>> execution and waits for disk) can be reduced down to 6. This has the
>>>> potential to double application startup of large apps without any clear
>>>> downsides. Suse ships their glibc with a dynamic linker patch to
>>>> fadvise() dynamic libraries(not sure why they switched from doing
>>>> madvise before).
>>>>
>>>> I filed a glibc bug about this at
>>>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
>>>> with his concern about wasting memory resources. What is the impact of
>>>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
>>>> pressure? Does the kernel simply start ignoring these hints?
>>>>
>>>>          
>>> It will throttle based on memory pressure.  In idle situations it will
>>> eat your file cache, however, to satisfy the request.
>>>
>>>        
>> Define idle situations. Do you mean that madv(willneed) will aggresively
>> readahead, but only while cpu(or disk?) is idle?
>> I am trying to optimize application startup which means that the cpu is
>> busy while not blocked on io.
>>      
> Sorry.  I meant without memory pressure.  It will trigger readahead for the
> whole page range immediately, unless the sum of free pages and file cache
> pages is less than that.
>
> So yes, it will be aggressive against the cache but should not touch things
> frequently in use or start swapping for example.
>    
Perfect.
>    
>>>> Also, once an application is started is it reasonable to keep it
>>>> madvise(WILLNEED)ed or should the madvise flags be reset?
>>>>
>>>>          
>>> It's a one-time operation that starts immediate readahead, no permanent
>>> changes are done.
>>>
>>>        
>> I may be measuring this wrong, but in my experience the only change
>> madvise(willneed) does in increase the length parameter to
>> __do_page_cache_readahead(). My script is at
>> http://hg.mozilla.org/users/tglek_mozilla.com/startup/file/6453ad2a7906/kernelio.stp
>> .
>>      
> Whether the page is read on a major fault or by means of WILLNEED,
> they both end up calling this function.  It's just that faulting
> does all the heuristics and WILLNEED will just force reading the
> pages in the specified range.
>
> But your question whether it would be reasonable to keep the region
> WILLNEED madvised makes no sense.  It's just a request to prepopulate
> the page cache from disk data immediately instead of waiting for
> faults to trigger the reads.
>    
Ok. Thanks for clarifying that. I was misinterpreting my io log.
Is there a way to force page faults from a particular memory mapping to 
do more readahead? Ie if WILLNEED is not used.


Have heuristics that read backwards been considered? Ie currently if one 
faults in page at offset 4096, that page a few pages following that will 
be preread. Would be interesting to try to preread pages before and 
after the page being faulted in.
For a graph of "backwards" io see the "Post-linker Fail" section in
http://blog.mozilla.com/tglek/2010/03/24/linux-why-loading-binaries-from-disk-sucks/


Taras

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-06  9:51 ` Johannes Weiner
  2010-04-06 21:57   ` Taras Glek
@ 2010-04-07  2:24   ` Wu Fengguang
  2010-04-07  2:54     ` Taras Glek
  1 sibling, 1 reply; 35+ messages in thread
From: Wu Fengguang @ 2010-04-07  2:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Taras Glek, linux-mm@kvack.org, linux-kernel@vger.kernel.org

Hi Taras,

On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
> > Hello,
> > I am working on improving Mozilla startup times. It turns out that page 
> > faults(caused by lack of cooperation between user/kernelspace) are the 
> > main cause of slow startup. I need some insights from someone who 
> > understands linux vm behavior.

How about improve Fedora (and other distros) to preload Mozilla (and
other apps the user run at the previous boot) with fadvise() at boot
time? This sounds like the most reasonable option.

As for the kernel readahead, I have a patchset to increase default
mmap read-around size from 128kb to 512kb (except for small memory
systems).  This should help your case as well.

> > Current Situation:
> > The dynamic linker mmap()s  executable and data sections of our 
> > executable but it doesn't call madvise().
> > By default page faults trigger 131072byte reads. To make matters worse, 
> > the compile-time linker + gcc lay out code in a manner that does not 
> > correspond to how the resulting executable will be executed(ie the 
> > layout is basically random). This means that during startup 15-40mb 
> > binaries are read in basically random fashion. Even if one orders the 
> > binary optimally, throughput is still suboptimal due to the puny readahead.
> > 
> > IO Hints:
> > Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb 
> > reads and a binary that tends to take 110 page faults(ie program stops 
> > execution and waits for disk) can be reduced down to 6. This has the 
> > potential to double application startup of large apps without any clear 
> > downsides.
> >
> > Suse ships their glibc with a dynamic linker patch to fadvise()
> > dynamic libraries(not sure why they switched from doing madvise
> > before).

This is interesting. I wonder how SuSE implements the policy.
Do you have the patch or some strace output that demonstrates the
fadvise() call?

> > I filed a glibc bug about this at 
> > http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented 
> > with his concern about wasting memory resources. What is the impact of 
> > madvise(WILLNEED) or the fadvise equivalent on systems under memory 
> > pressure? Does the kernel simply start ignoring these hints?
> 
> It will throttle based on memory pressure.  In idle situations it will
> eat your file cache, however, to satisfy the request.
> 
> Now, the file cache should be much bigger than the amount of unneeded
> pages you prefault with the hint over the whole library, so I guess the
> benefit of prefaulting the right pages outweighs the downside of evicting
> some cache for unused library pages.
> 
> Still, it's a workaround for deficits in the demand-paging/readahead
> heuristics and thus a bit ugly, I feel.  Maybe Wu can help.

Program page faults are inherently random, so the straightforward
solution would be to increase the mmap read-around size (for desktops
with reasonable large memory), rather than to improve program layout
or readahead heuristics :)

> > Also, once an application is started is it reasonable to keep it 
> > madvise(WILLNEED)ed or should the madvise flags be reset?
> 
> It's a one-time operation that starts immediate readahead, no permanent
> changes are done.

Right. The kernel regard WILLNEED as a readahead request from userspace.

> > Perhaps the kernel could monitor the page-in patterns to increase the 
> > readahead sizes? This may already happen, I've noticed that a handful of 
> > pagefaults trigger > 131072bytes of IO, perhaps this just needs tweaking.
> 
> CCd the man :-)

Thank you :)

Cheers,
Fengguang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-07  2:24   ` Wu Fengguang
@ 2010-04-07  2:54     ` Taras Glek
  2010-04-07  4:06       ` Minchan Kim
  2010-04-07  7:38       ` Wu Fengguang
  0 siblings, 2 replies; 35+ messages in thread
From: Taras Glek @ 2010-04-07  2:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org

On 04/06/2010 07:24 PM, Wu Fengguang wrote:
> Hi Taras,
>
> On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
>    
>> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
>>      
>>> Hello,
>>> I am working on improving Mozilla startup times. It turns out that page
>>> faults(caused by lack of cooperation between user/kernelspace) are the
>>> main cause of slow startup. I need some insights from someone who
>>> understands linux vm behavior.
>>>        
> How about improve Fedora (and other distros) to preload Mozilla (and
> other apps the user run at the previous boot) with fadvise() at boot
> time? This sounds like the most reasonable option.
>    
That's a slightly different usecase. I'd rather have all large apps 
startup as efficiently as possible without any hacks. Though until we 
get there, we'll be using all of the hacks we can.
> As for the kernel readahead, I have a patchset to increase default
> mmap read-around size from 128kb to 512kb (except for small memory
> systems).  This should help your case as well.
>    
Yes. Is the current readahead really doing read-around(ie does it read 
pages before the one being faulted)? From what I've seen, having the 
dynamic linker read binary sections backwards causes faults.
http://sourceware.org/bugzilla/show_bug.cgi?id=11447
>    
>>> Current Situation:
>>> The dynamic linker mmap()s  executable and data sections of our
>>> executable but it doesn't call madvise().
>>> By default page faults trigger 131072byte reads. To make matters worse,
>>> the compile-time linker + gcc lay out code in a manner that does not
>>> correspond to how the resulting executable will be executed(ie the
>>> layout is basically random). This means that during startup 15-40mb
>>> binaries are read in basically random fashion. Even if one orders the
>>> binary optimally, throughput is still suboptimal due to the puny readahead.
>>>
>>> IO Hints:
>>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
>>> reads and a binary that tends to take 110 page faults(ie program stops
>>> execution and waits for disk) can be reduced down to 6. This has the
>>> potential to double application startup of large apps without any clear
>>> downsides.
>>>
>>> Suse ships their glibc with a dynamic linker patch to fadvise()
>>> dynamic libraries(not sure why they switched from doing madvise
>>> before).
>>>        
> This is interesting. I wonder how SuSE implements the policy.
> Do you have the patch or some strace output that demonstrates the
> fadvise() call?
>    
glibc-2.3.90-ld.so-madvise.diff in 
http://www.rpmseek.com/rpm/glibc-2.4-31.12.3.src.html?hl=com&cba=0:G:0:3732595:0:15:0: 


As I recall they just fadvise the filedescriptor before accessing it.
>    
>>> I filed a glibc bug about this at
>>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
>>> with his concern about wasting memory resources. What is the impact of
>>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
>>> pressure? Does the kernel simply start ignoring these hints?
>>>        
>> It will throttle based on memory pressure.  In idle situations it will
>> eat your file cache, however, to satisfy the request.
>>
>> Now, the file cache should be much bigger than the amount of unneeded
>> pages you prefault with the hint over the whole library, so I guess the
>> benefit of prefaulting the right pages outweighs the downside of evicting
>> some cache for unused library pages.
>>
>> Still, it's a workaround for deficits in the demand-paging/readahead
>> heuristics and thus a bit ugly, I feel.  Maybe Wu can help.
>>      
> Program page faults are inherently random, so the straightforward
> solution would be to increase the mmap read-around size (for desktops
> with reasonable large memory), rather than to improve program layout
> or readahead heuristics :)
>    
Program page faults may exhibit random behavior once they've started.

During startup page-in pattern of over-engineered OO applications is 
very predictable. Programs are laid out based on compilation units, 
which have no relation to how they are executed. Another problem is that 
any large old application will have lots of code that is either rarely 
executed or completely dead. Random sprinkling of live code among mostly 
unneeded code is a problem.
I'm able to reduce startup pagefaults by 2.5x and mem usage by a few MB 
with proper binary layout. Even if one lays out a program wrongly, the 
worst-case pagein pattern will be pretty similar to what it is by default.

But yes, I completely agree that it would be awesome to increase the 
readahead size proportionally to available memory. It's a little silly 
to be reading tens of megabytes in 128kb increments :)  You rock for 
trying to modernize this.

>    
>>> Also, once an application is started is it reasonable to keep it
>>> madvise(WILLNEED)ed or should the madvise flags be reset?
>>>        
>> It's a one-time operation that starts immediate readahead, no permanent
>> changes are done.
>>      
> Right. The kernel regard WILLNEED as a readahead request from userspace.
>
>    
>>> Perhaps the kernel could monitor the page-in patterns to increase the
>>> readahead sizes? This may already happen, I've noticed that a handful of
>>> pagefaults trigger>  131072bytes of IO, perhaps this just needs tweaking.
>>>        
>> CCd the man :-)
>>      
> Thank you :)
>
> Cheers,
> Fengguang
>    

Cheers,
Taras

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-07  2:54     ` Taras Glek
@ 2010-04-07  4:06       ` Minchan Kim
  2010-04-07  7:14         ` Wu Fengguang
  2010-04-07  7:38       ` Wu Fengguang
  1 sibling, 1 reply; 35+ messages in thread
From: Minchan Kim @ 2010-04-07  4:06 UTC (permalink / raw)
  To: Taras Glek
  Cc: Wu Fengguang, Johannes Weiner, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

On Wed, Apr 7, 2010 at 11:54 AM, Taras Glek <tglek@mozilla.com> wrote:
> On 04/06/2010 07:24 PM, Wu Fengguang wrote:
>>
>> Hi Taras,
>>
>> On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
>>
>>>
>>> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
>>>
>>>>
>>>> Hello,
>>>> I am working on improving Mozilla startup times. It turns out that page
>>>> faults(caused by lack of cooperation between user/kernelspace) are the
>>>> main cause of slow startup. I need some insights from someone who
>>>> understands linux vm behavior.
>>>>
>>
>> How about improve Fedora (and other distros) to preload Mozilla (and
>> other apps the user run at the previous boot) with fadvise() at boot
>> time? This sounds like the most reasonable option.
>>
>
> That's a slightly different usecase. I'd rather have all large apps startup
> as efficiently as possible without any hacks. Though until we get there,
> we'll be using all of the hacks we can.
>>
>> As for the kernel readahead, I have a patchset to increase default
>> mmap read-around size from 128kb to 512kb (except for small memory
>> systems).  This should help your case as well.
>>
>
> Yes. Is the current readahead really doing read-around(ie does it read pages
> before the one being faulted)? From what I've seen, having the dynamic
> linker read binary sections backwards causes faults.
> http://sourceware.org/bugzilla/show_bug.cgi?id=11447
>>
>>
>>>>
>>>> Current Situation:
>>>> The dynamic linker mmap()s  executable and data sections of our
>>>> executable but it doesn't call madvise().
>>>> By default page faults trigger 131072byte reads. To make matters worse,
>>>> the compile-time linker + gcc lay out code in a manner that does not
>>>> correspond to how the resulting executable will be executed(ie the
>>>> layout is basically random). This means that during startup 15-40mb
>>>> binaries are read in basically random fashion. Even if one orders the
>>>> binary optimally, throughput is still suboptimal due to the puny
>>>> readahead.
>>>>
>>>> IO Hints:
>>>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
>>>> reads and a binary that tends to take 110 page faults(ie program stops
>>>> execution and waits for disk) can be reduced down to 6. This has the
>>>> potential to double application startup of large apps without any clear
>>>> downsides.
>>>>
>>>> Suse ships their glibc with a dynamic linker patch to fadvise()
>>>> dynamic libraries(not sure why they switched from doing madvise
>>>> before).
>>>>
>>
>> This is interesting. I wonder how SuSE implements the policy.
>> Do you have the patch or some strace output that demonstrates the
>> fadvise() call?
>>
>
> glibc-2.3.90-ld.so-madvise.diff in
> http://www.rpmseek.com/rpm/glibc-2.4-31.12.3.src.html?hl=com&cba=0:G:0:3732595:0:15:0:
>
> As I recall they just fadvise the filedescriptor before accessing it.
>>
>>
>>>>
>>>> I filed a glibc bug about this at
>>>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
>>>> with his concern about wasting memory resources. What is the impact of
>>>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
>>>> pressure? Does the kernel simply start ignoring these hints?
>>>>
>>>
>>> It will throttle based on memory pressure.  In idle situations it will
>>> eat your file cache, however, to satisfy the request.
>>>
>>> Now, the file cache should be much bigger than the amount of unneeded
>>> pages you prefault with the hint over the whole library, so I guess the
>>> benefit of prefaulting the right pages outweighs the downside of evicting
>>> some cache for unused library pages.
>>>
>>> Still, it's a workaround for deficits in the demand-paging/readahead
>>> heuristics and thus a bit ugly, I feel.  Maybe Wu can help.
>>>
>>
>> Program page faults are inherently random, so the straightforward
>> solution would be to increase the mmap read-around size (for desktops
>> with reasonable large memory), rather than to improve program layout
>> or readahead heuristics :)
>>
>
> Program page faults may exhibit random behavior once they've started.
>
> During startup page-in pattern of over-engineered OO applications is very
> predictable. Programs are laid out based on compilation units, which have no
> relation to how they are executed. Another problem is that any large old
> application will have lots of code that is either rarely executed or
> completely dead. Random sprinkling of live code among mostly unneeded code
> is a problem.
> I'm able to reduce startup pagefaults by 2.5x and mem usage by a few MB with
> proper binary layout. Even if one lays out a program wrongly, the worst-case
> pagein pattern will be pretty similar to what it is by default.
>
> But yes, I completely agree that it would be awesome to increase the
> readahead size proportionally to available memory. It's a little silly to be
> reading tens of megabytes in 128kb increments :)  You rock for trying to
> modernize this.

Hi, Wu and Taras.

I have been watched at this thread.
That's because I had a experience on reducing startup latency of application
in embedded system.

I think sometime increasing of readahead size wouldn't good in embedded.
Many of embedded system has nand as storage and compression file system.
About nand, as you know, random read effect isn't rather big than hdd.
About compression file system, as one has a big compression,
it would make startup late(big block read and decompression).
We had to disable readahead of code page with kernel hacking.
And it would make application slow as time goes by.
But at that time we thought latency is more important than performance
on our application.

Of course, it is different whenever what is file system and
compression ratio we use .
So I think increasing of readahead size might always be not good.

Please, consider embedded system when you have a plan to tweak
readahead, too. :)

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-07  4:06       ` Minchan Kim
@ 2010-04-07  7:14         ` Wu Fengguang
  2010-04-07  7:33           ` Minchan Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Wu Fengguang @ 2010-04-07  7:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Taras Glek, Johannes Weiner, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 6685 bytes --]

On Wed, Apr 07, 2010 at 12:06:07PM +0800, Minchan Kim wrote:
> On Wed, Apr 7, 2010 at 11:54 AM, Taras Glek <tglek@mozilla.com> wrote:
> > On 04/06/2010 07:24 PM, Wu Fengguang wrote:
> >>
> >> Hi Taras,
> >>
> >> On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
> >>
> >>>
> >>> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
> >>>
> >>>>
> >>>> Hello,
> >>>> I am working on improving Mozilla startup times. It turns out that page
> >>>> faults(caused by lack of cooperation between user/kernelspace) are the
> >>>> main cause of slow startup. I need some insights from someone who
> >>>> understands linux vm behavior.
> >>>>
> >>
> >> How about improve Fedora (and other distros) to preload Mozilla (and
> >> other apps the user run at the previous boot) with fadvise() at boot
> >> time? This sounds like the most reasonable option.
> >>
> >
> > That's a slightly different usecase. I'd rather have all large apps startup
> > as efficiently as possible without any hacks. Though until we get there,
> > we'll be using all of the hacks we can.
> >>
> >> As for the kernel readahead, I have a patchset to increase default
> >> mmap read-around size from 128kb to 512kb (except for small memory
> >> systems).  This should help your case as well.
> >>
> >
> > Yes. Is the current readahead really doing read-around(ie does it read pages
> > before the one being faulted)? From what I've seen, having the dynamic
> > linker read binary sections backwards causes faults.
> > http://sourceware.org/bugzilla/show_bug.cgi?id=11447
> >>
> >>
> >>>>
> >>>> Current Situation:
> >>>> The dynamic linker mmap()s  executable and data sections of our
> >>>> executable but it doesn't call madvise().
> >>>> By default page faults trigger 131072byte reads. To make matters worse,
> >>>> the compile-time linker + gcc lay out code in a manner that does not
> >>>> correspond to how the resulting executable will be executed(ie the
> >>>> layout is basically random). This means that during startup 15-40mb
> >>>> binaries are read in basically random fashion. Even if one orders the
> >>>> binary optimally, throughput is still suboptimal due to the puny
> >>>> readahead.
> >>>>
> >>>> IO Hints:
> >>>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
> >>>> reads and a binary that tends to take 110 page faults(ie program stops
> >>>> execution and waits for disk) can be reduced down to 6. This has the
> >>>> potential to double application startup of large apps without any clear
> >>>> downsides.
> >>>>
> >>>> Suse ships their glibc with a dynamic linker patch to fadvise()
> >>>> dynamic libraries(not sure why they switched from doing madvise
> >>>> before).
> >>>>
> >>
> >> This is interesting. I wonder how SuSE implements the policy.
> >> Do you have the patch or some strace output that demonstrates the
> >> fadvise() call?
> >>
> >
> > glibc-2.3.90-ld.so-madvise.diff in
> > http://www.rpmseek.com/rpm/glibc-2.4-31.12.3.src.html?hl=com&cba=0:G:0:3732595:0:15:0:
> >
> > As I recall they just fadvise the filedescriptor before accessing it.
> >>
> >>
> >>>>
> >>>> I filed a glibc bug about this at
> >>>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
> >>>> with his concern about wasting memory resources. What is the impact of
> >>>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
> >>>> pressure? Does the kernel simply start ignoring these hints?
> >>>>
> >>>
> >>> It will throttle based on memory pressure.  In idle situations it will
> >>> eat your file cache, however, to satisfy the request.
> >>>
> >>> Now, the file cache should be much bigger than the amount of unneeded
> >>> pages you prefault with the hint over the whole library, so I guess the
> >>> benefit of prefaulting the right pages outweighs the downside of evicting
> >>> some cache for unused library pages.
> >>>
> >>> Still, it's a workaround for deficits in the demand-paging/readahead
> >>> heuristics and thus a bit ugly, I feel.  Maybe Wu can help.
> >>>
> >>
> >> Program page faults are inherently random, so the straightforward
> >> solution would be to increase the mmap read-around size (for desktops
> >> with reasonable large memory), rather than to improve program layout
> >> or readahead heuristics :)
> >>
> >
> > Program page faults may exhibit random behavior once they've started.
> >
> > During startup page-in pattern of over-engineered OO applications is very
> > predictable. Programs are laid out based on compilation units, which have no
> > relation to how they are executed. Another problem is that any large old
> > application will have lots of code that is either rarely executed or
> > completely dead. Random sprinkling of live code among mostly unneeded code
> > is a problem.
> > I'm able to reduce startup pagefaults by 2.5x and mem usage by a few MB with
> > proper binary layout. Even if one lays out a program wrongly, the worst-case
> > pagein pattern will be pretty similar to what it is by default.
> >
> > But yes, I completely agree that it would be awesome to increase the
> > readahead size proportionally to available memory. It's a little silly to be
> > reading tens of megabytes in 128kb increments :)  You rock for trying to
> > modernize this.
> 
> Hi, Wu and Taras.
> 
> I have been watched at this thread.
> That's because I had a experience on reducing startup latency of application
> in embedded system.
> 
> I think sometime increasing of readahead size wouldn't good in embedded.
> Many of embedded system has nand as storage and compression file system.
> About nand, as you know, random read effect isn't rather big than hdd.
> About compression file system, as one has a big compression,
> it would make startup late(big block read and decompression).
> We had to disable readahead of code page with kernel hacking.
> And it would make application slow as time goes by.
> But at that time we thought latency is more important than performance
> on our application.
> 
> Of course, it is different whenever what is file system and
> compression ratio we use .
> So I think increasing of readahead size might always be not good.
> 
> Please, consider embedded system when you have a plan to tweak
> readahead, too. :)

Minchan, glad to know that you have experiences on embedded Linux.

While increasing the general readahead size from 128kb to 512kb, I
also added a limit for mmap read-around: if system memory size is less
than X MB, then limit read-around size to X KB. For example, do only
128KB read-around for a 128MB embedded box, and 32KB ra for 32MB box.

Do you think it a reasonable safety guard? Patch attached.

Thanks,
Fengguang


[-- Attachment #2: readahead-small-memory-limit-readaround.patch --]
[-- Type: text/x-diff, Size: 1886 bytes --]

readahead: limit read-ahead size for small memory systems

When lifting the default readahead size from 128KB to 512KB,
make sure it won't add memory pressure to small memory systems.

For read-ahead, the memory pressure is mainly readahead buffers consumed
by too many concurrent streams. The context readahead can adapt
readahead size to thrashing threshold well.  So in principle we don't
need to adapt the default _max_ read-ahead size to memory pressure.

For read-around, the memory pressure is mainly read-around misses on
executables/libraries. Which could be reduced by scaling down
read-around size on fast "reclaim passes".

This patch presents a straightforward solution: to limit default
read-ahead size proportional to available system memory, ie.

                512MB mem => 512KB read-around size limit
                128MB mem => 128KB read-around size limit
                 32MB mem =>  32KB read-around size limit

This will allow power users to adjust read-ahead/read-around size at
once, while saving the low end from unnecessary memory pressure, under
the assumption that low end users have no need to request a large
read-around size.

CC: Matt Mackall <mpm@selenic.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux.orig/mm/filemap.c	2010-03-01 13:27:28.000000000 +0800
+++ linux/mm/filemap.c	2010-03-01 13:38:40.000000000 +0800
@@ -1431,7 +1431,8 @@ static void do_sync_mmap_readahead(struc
 	/*
 	 * mmap read-around
 	 */
-	ra_pages = max_sane_readahead(ra->ra_pages);
+	ra_pages = min_t(unsigned long, ra->ra_pages,
+			 roundup_pow_of_two(totalram_pages / 1024));
 	if (ra_pages) {
 		ra->start = max_t(long, 0, offset - ra_pages/2);
 		ra->size = ra_pages;

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-07  7:14         ` Wu Fengguang
@ 2010-04-07  7:33           ` Minchan Kim
  2010-04-07  7:47             ` Wu Fengguang
  0 siblings, 1 reply; 35+ messages in thread
From: Minchan Kim @ 2010-04-07  7:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Taras Glek, Johannes Weiner, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

On Wed, Apr 7, 2010 at 4:14 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Wed, Apr 07, 2010 at 12:06:07PM +0800, Minchan Kim wrote:
>> On Wed, Apr 7, 2010 at 11:54 AM, Taras Glek <tglek@mozilla.com> wrote:
>> > On 04/06/2010 07:24 PM, Wu Fengguang wrote:
>> >>
>> >> Hi Taras,
>> >>
>> >> On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
>> >>
>> >>>
>> >>> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
>> >>>
>> >>>>
>> >>>> Hello,
>> >>>> I am working on improving Mozilla startup times. It turns out that page
>> >>>> faults(caused by lack of cooperation between user/kernelspace) are the
>> >>>> main cause of slow startup. I need some insights from someone who
>> >>>> understands linux vm behavior.
>> >>>>
>> >>
>> >> How about improve Fedora (and other distros) to preload Mozilla (and
>> >> other apps the user run at the previous boot) with fadvise() at boot
>> >> time? This sounds like the most reasonable option.
>> >>
>> >
>> > That's a slightly different usecase. I'd rather have all large apps startup
>> > as efficiently as possible without any hacks. Though until we get there,
>> > we'll be using all of the hacks we can.
>> >>
>> >> As for the kernel readahead, I have a patchset to increase default
>> >> mmap read-around size from 128kb to 512kb (except for small memory
>> >> systems).  This should help your case as well.
>> >>
>> >
>> > Yes. Is the current readahead really doing read-around(ie does it read pages
>> > before the one being faulted)? From what I've seen, having the dynamic
>> > linker read binary sections backwards causes faults.
>> > http://sourceware.org/bugzilla/show_bug.cgi?id=11447
>> >>
>> >>
>> >>>>
>> >>>> Current Situation:
>> >>>> The dynamic linker mmap()s  executable and data sections of our
>> >>>> executable but it doesn't call madvise().
>> >>>> By default page faults trigger 131072byte reads. To make matters worse,
>> >>>> the compile-time linker + gcc lay out code in a manner that does not
>> >>>> correspond to how the resulting executable will be executed(ie the
>> >>>> layout is basically random). This means that during startup 15-40mb
>> >>>> binaries are read in basically random fashion. Even if one orders the
>> >>>> binary optimally, throughput is still suboptimal due to the puny
>> >>>> readahead.
>> >>>>
>> >>>> IO Hints:
>> >>>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
>> >>>> reads and a binary that tends to take 110 page faults(ie program stops
>> >>>> execution and waits for disk) can be reduced down to 6. This has the
>> >>>> potential to double application startup of large apps without any clear
>> >>>> downsides.
>> >>>>
>> >>>> Suse ships their glibc with a dynamic linker patch to fadvise()
>> >>>> dynamic libraries(not sure why they switched from doing madvise
>> >>>> before).
>> >>>>
>> >>
>> >> This is interesting. I wonder how SuSE implements the policy.
>> >> Do you have the patch or some strace output that demonstrates the
>> >> fadvise() call?
>> >>
>> >
>> > glibc-2.3.90-ld.so-madvise.diff in
>> > http://www.rpmseek.com/rpm/glibc-2.4-31.12.3.src.html?hl=com&cba=0:G:0:3732595:0:15:0:
>> >
>> > As I recall they just fadvise the filedescriptor before accessing it.
>> >>
>> >>
>> >>>>
>> >>>> I filed a glibc bug about this at
>> >>>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
>> >>>> with his concern about wasting memory resources. What is the impact of
>> >>>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
>> >>>> pressure? Does the kernel simply start ignoring these hints?
>> >>>>
>> >>>
>> >>> It will throttle based on memory pressure.  In idle situations it will
>> >>> eat your file cache, however, to satisfy the request.
>> >>>
>> >>> Now, the file cache should be much bigger than the amount of unneeded
>> >>> pages you prefault with the hint over the whole library, so I guess the
>> >>> benefit of prefaulting the right pages outweighs the downside of evicting
>> >>> some cache for unused library pages.
>> >>>
>> >>> Still, it's a workaround for deficits in the demand-paging/readahead
>> >>> heuristics and thus a bit ugly, I feel.  Maybe Wu can help.
>> >>>
>> >>
>> >> Program page faults are inherently random, so the straightforward
>> >> solution would be to increase the mmap read-around size (for desktops
>> >> with reasonable large memory), rather than to improve program layout
>> >> or readahead heuristics :)
>> >>
>> >
>> > Program page faults may exhibit random behavior once they've started.
>> >
>> > During startup page-in pattern of over-engineered OO applications is very
>> > predictable. Programs are laid out based on compilation units, which have no
>> > relation to how they are executed. Another problem is that any large old
>> > application will have lots of code that is either rarely executed or
>> > completely dead. Random sprinkling of live code among mostly unneeded code
>> > is a problem.
>> > I'm able to reduce startup pagefaults by 2.5x and mem usage by a few MB with
>> > proper binary layout. Even if one lays out a program wrongly, the worst-case
>> > pagein pattern will be pretty similar to what it is by default.
>> >
>> > But yes, I completely agree that it would be awesome to increase the
>> > readahead size proportionally to available memory. It's a little silly to be
>> > reading tens of megabytes in 128kb increments :)  You rock for trying to
>> > modernize this.
>>
>> Hi, Wu and Taras.
>>
>> I have been watched at this thread.
>> That's because I had a experience on reducing startup latency of application
>> in embedded system.
>>
>> I think sometime increasing of readahead size wouldn't good in embedded.
>> Many of embedded system has nand as storage and compression file system.
>> About nand, as you know, random read effect isn't rather big than hdd.
>> About compression file system, as one has a big compression,
>> it would make startup late(big block read and decompression).
>> We had to disable readahead of code page with kernel hacking.
>> And it would make application slow as time goes by.
>> But at that time we thought latency is more important than performance
>> on our application.
>>
>> Of course, it is different whenever what is file system and
>> compression ratio we use .
>> So I think increasing of readahead size might always be not good.
>>
>> Please, consider embedded system when you have a plan to tweak
>> readahead, too. :)
>
> Minchan, glad to know that you have experiences on embedded Linux.
>
> While increasing the general readahead size from 128kb to 512kb, I
> also added a limit for mmap read-around: if system memory size is less
> than X MB, then limit read-around size to X KB. For example, do only
> 128KB read-around for a 128MB embedded box, and 32KB ra for 32MB box.
>
> Do you think it a reasonable safety guard? Patch attached.

Thanks for reply, Wu.

I didn't have looked at the your attachment.
That's because it's not matter of memory size in my case.
It was alone application on system and it was first main application of system.
It means we had a enough memory.

I guess there are such many of embedded system.
At that time, although I could disable readahead totally with read_ahead_kb,
I didn't want it. That's because I don't want to disable readahead on
the file I/O
and data section of program. So at a loss, I hacked kernel to disable
readahead of
only code section.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-07  7:33           ` Minchan Kim
@ 2010-04-07  7:47             ` Wu Fengguang
  2010-04-07  8:06               ` Minchan Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Wu Fengguang @ 2010-04-07  7:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Taras Glek, Johannes Weiner, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

On Wed, Apr 07, 2010 at 03:33:52PM +0800, Minchan Kim wrote:
> On Wed, Apr 7, 2010 at 4:14 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Wed, Apr 07, 2010 at 12:06:07PM +0800, Minchan Kim wrote:
> >> On Wed, Apr 7, 2010 at 11:54 AM, Taras Glek <tglek@mozilla.com> wrote:
> >> > On 04/06/2010 07:24 PM, Wu Fengguang wrote:
> >> >>
> >> >> Hi Taras,
> >> >>
> >> >> On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
> >> >>
> >> >>>
> >> >>> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
> >> >>>
> >> >>>>
> >> >>>> Hello,
> >> >>>> I am working on improving Mozilla startup times. It turns out that page
> >> >>>> faults(caused by lack of cooperation between user/kernelspace) are the
> >> >>>> main cause of slow startup. I need some insights from someone who
> >> >>>> understands linux vm behavior.
> >> >>>>
> >> >>
> >> >> How about improve Fedora (and other distros) to preload Mozilla (and
> >> >> other apps the user run at the previous boot) with fadvise() at boot
> >> >> time? This sounds like the most reasonable option.
> >> >>
> >> >
> >> > That's a slightly different usecase. I'd rather have all large apps startup
> >> > as efficiently as possible without any hacks. Though until we get there,
> >> > we'll be using all of the hacks we can.
> >> >>
> >> >> As for the kernel readahead, I have a patchset to increase default
> >> >> mmap read-around size from 128kb to 512kb (except for small memory
> >> >> systems).  This should help your case as well.
> >> >>
> >> >
> >> > Yes. Is the current readahead really doing read-around(ie does it read pages
> >> > before the one being faulted)? From what I've seen, having the dynamic
> >> > linker read binary sections backwards causes faults.
> >> > http://sourceware.org/bugzilla/show_bug.cgi?id=11447
> >> >>
> >> >>
> >> >>>>
> >> >>>> Current Situation:
> >> >>>> The dynamic linker mmap()s  executable and data sections of our
> >> >>>> executable but it doesn't call madvise().
> >> >>>> By default page faults trigger 131072byte reads. To make matters worse,
> >> >>>> the compile-time linker + gcc lay out code in a manner that does not
> >> >>>> correspond to how the resulting executable will be executed(ie the
> >> >>>> layout is basically random). This means that during startup 15-40mb
> >> >>>> binaries are read in basically random fashion. Even if one orders the
> >> >>>> binary optimally, throughput is still suboptimal due to the puny
> >> >>>> readahead.
> >> >>>>
> >> >>>> IO Hints:
> >> >>>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
> >> >>>> reads and a binary that tends to take 110 page faults(ie program stops
> >> >>>> execution and waits for disk) can be reduced down to 6. This has the
> >> >>>> potential to double application startup of large apps without any clear
> >> >>>> downsides.
> >> >>>>
> >> >>>> Suse ships their glibc with a dynamic linker patch to fadvise()
> >> >>>> dynamic libraries(not sure why they switched from doing madvise
> >> >>>> before).
> >> >>>>
> >> >>
> >> >> This is interesting. I wonder how SuSE implements the policy.
> >> >> Do you have the patch or some strace output that demonstrates the
> >> >> fadvise() call?
> >> >>
> >> >
> >> > glibc-2.3.90-ld.so-madvise.diff in
> >> > http://www.rpmseek.com/rpm/glibc-2.4-31.12.3.src.html?hl=com&cba=0:G:0:3732595:0:15:0:
> >> >
> >> > As I recall they just fadvise the filedescriptor before accessing it.
> >> >>
> >> >>
> >> >>>>
> >> >>>> I filed a glibc bug about this at
> >> >>>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
> >> >>>> with his concern about wasting memory resources. What is the impact of
> >> >>>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
> >> >>>> pressure? Does the kernel simply start ignoring these hints?
> >> >>>>
> >> >>>
> >> >>> It will throttle based on memory pressure.  In idle situations it will
> >> >>> eat your file cache, however, to satisfy the request.
> >> >>>
> >> >>> Now, the file cache should be much bigger than the amount of unneeded
> >> >>> pages you prefault with the hint over the whole library, so I guess the
> >> >>> benefit of prefaulting the right pages outweighs the downside of evicting
> >> >>> some cache for unused library pages.
> >> >>>
> >> >>> Still, it's a workaround for deficits in the demand-paging/readahead
> >> >>> heuristics and thus a bit ugly, I feel.  Maybe Wu can help.
> >> >>>
> >> >>
> >> >> Program page faults are inherently random, so the straightforward
> >> >> solution would be to increase the mmap read-around size (for desktops
> >> >> with reasonable large memory), rather than to improve program layout
> >> >> or readahead heuristics :)
> >> >>
> >> >
> >> > Program page faults may exhibit random behavior once they've started.
> >> >
> >> > During startup page-in pattern of over-engineered OO applications is very
> >> > predictable. Programs are laid out based on compilation units, which have no
> >> > relation to how they are executed. Another problem is that any large old
> >> > application will have lots of code that is either rarely executed or
> >> > completely dead. Random sprinkling of live code among mostly unneeded code
> >> > is a problem.
> >> > I'm able to reduce startup pagefaults by 2.5x and mem usage by a few MB with
> >> > proper binary layout. Even if one lays out a program wrongly, the worst-case
> >> > pagein pattern will be pretty similar to what it is by default.
> >> >
> >> > But yes, I completely agree that it would be awesome to increase the
> >> > readahead size proportionally to available memory. It's a little silly to be
> >> > reading tens of megabytes in 128kb increments :)  You rock for trying to
> >> > modernize this.
> >>
> >> Hi, Wu and Taras.
> >>
> >> I have been watched at this thread.
> >> That's because I had a experience on reducing startup latency of application
> >> in embedded system.
> >>
> >> I think sometime increasing of readahead size wouldn't good in embedded.
> >> Many of embedded system has nand as storage and compression file system.
> >> About nand, as you know, random read effect isn't rather big than hdd.
> >> About compression file system, as one has a big compression,
> >> it would make startup late(big block read and decompression).
> >> We had to disable readahead of code page with kernel hacking.
> >> And it would make application slow as time goes by.
> >> But at that time we thought latency is more important than performance
> >> on our application.
> >>
> >> Of course, it is different whenever what is file system and
> >> compression ratio we use .
> >> So I think increasing of readahead size might always be not good.
> >>
> >> Please, consider embedded system when you have a plan to tweak
> >> readahead, too. :)
> >
> > Minchan, glad to know that you have experiences on embedded Linux.
> >
> > While increasing the general readahead size from 128kb to 512kb, I
> > also added a limit for mmap read-around: if system memory size is less
> > than X MB, then limit read-around size to X KB. For example, do only
> > 128KB read-around for a 128MB embedded box, and 32KB ra for 32MB box.
> >
> > Do you think it a reasonable safety guard? Patch attached.
> 
> Thanks for reply, Wu.
> 
> I didn't have looked at the your attachment.
> That's because it's not matter of memory size in my case.

In general, the more memory size, the less we care about the possible
readahead misses :)

> It was alone application on system and it was first main application of system.
> It means we had a enough memory.
> 
> I guess there are such many of embedded system.
> At that time, although I could disable readahead totally with read_ahead_kb,
> I didn't want it. That's because I don't want to disable readahead on
> the file I/O
> and data section of program. So at a loss, I hacked kernel to disable
> readahead of
> only code section.

I would like to auto tune readahead size based on the device's
IO throughput and latency estimation, however that's not easy..

Other than that, if we can assert "this class of devices won't benefit
from large readahead", then we can do some static assignment.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-07  7:47             ` Wu Fengguang
@ 2010-04-07  8:06               ` Minchan Kim
  2010-04-07  8:13                 ` Wu Fengguang
  0 siblings, 1 reply; 35+ messages in thread
From: Minchan Kim @ 2010-04-07  8:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Taras Glek, Johannes Weiner, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

On Wed, Apr 7, 2010 at 4:47 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Wed, Apr 07, 2010 at 03:33:52PM +0800, Minchan Kim wrote:
>> On Wed, Apr 7, 2010 at 4:14 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > On Wed, Apr 07, 2010 at 12:06:07PM +0800, Minchan Kim wrote:
>> >> On Wed, Apr 7, 2010 at 11:54 AM, Taras Glek <tglek@mozilla.com> wrote:
>> >> > On 04/06/2010 07:24 PM, Wu Fengguang wrote:
>> >> >>
>> >> >> Hi Taras,
>> >> >>
>> >> >> On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
>> >> >>
>> >> >>>
>> >> >>> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
>> >> >>>
>> >> >>>>
>> >> >>>> Hello,
>> >> >>>> I am working on improving Mozilla startup times. It turns out that page
>> >> >>>> faults(caused by lack of cooperation between user/kernelspace) are the
>> >> >>>> main cause of slow startup. I need some insights from someone who
>> >> >>>> understands linux vm behavior.
>> >> >>>>
>> >> >>
>> >> >> How about improve Fedora (and other distros) to preload Mozilla (and
>> >> >> other apps the user run at the previous boot) with fadvise() at boot
>> >> >> time? This sounds like the most reasonable option.
>> >> >>
>> >> >
>> >> > That's a slightly different usecase. I'd rather have all large apps startup
>> >> > as efficiently as possible without any hacks. Though until we get there,
>> >> > we'll be using all of the hacks we can.
>> >> >>
>> >> >> As for the kernel readahead, I have a patchset to increase default
>> >> >> mmap read-around size from 128kb to 512kb (except for small memory
>> >> >> systems).  This should help your case as well.
>> >> >>
>> >> >
>> >> > Yes. Is the current readahead really doing read-around(ie does it read pages
>> >> > before the one being faulted)? From what I've seen, having the dynamic
>> >> > linker read binary sections backwards causes faults.
>> >> > http://sourceware.org/bugzilla/show_bug.cgi?id=11447
>> >> >>
>> >> >>
>> >> >>>>
>> >> >>>> Current Situation:
>> >> >>>> The dynamic linker mmap()s  executable and data sections of our
>> >> >>>> executable but it doesn't call madvise().
>> >> >>>> By default page faults trigger 131072byte reads. To make matters worse,
>> >> >>>> the compile-time linker + gcc lay out code in a manner that does not
>> >> >>>> correspond to how the resulting executable will be executed(ie the
>> >> >>>> layout is basically random). This means that during startup 15-40mb
>> >> >>>> binaries are read in basically random fashion. Even if one orders the
>> >> >>>> binary optimally, throughput is still suboptimal due to the puny
>> >> >>>> readahead.
>> >> >>>>
>> >> >>>> IO Hints:
>> >> >>>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
>> >> >>>> reads and a binary that tends to take 110 page faults(ie program stops
>> >> >>>> execution and waits for disk) can be reduced down to 6. This has the
>> >> >>>> potential to double application startup of large apps without any clear
>> >> >>>> downsides.
>> >> >>>>
>> >> >>>> Suse ships their glibc with a dynamic linker patch to fadvise()
>> >> >>>> dynamic libraries(not sure why they switched from doing madvise
>> >> >>>> before).
>> >> >>>>
>> >> >>
>> >> >> This is interesting. I wonder how SuSE implements the policy.
>> >> >> Do you have the patch or some strace output that demonstrates the
>> >> >> fadvise() call?
>> >> >>
>> >> >
>> >> > glibc-2.3.90-ld.so-madvise.diff in
>> >> > http://www.rpmseek.com/rpm/glibc-2.4-31.12.3.src.html?hl=com&cba=0:G:0:3732595:0:15:0:
>> >> >
>> >> > As I recall they just fadvise the filedescriptor before accessing it.
>> >> >>
>> >> >>
>> >> >>>>
>> >> >>>> I filed a glibc bug about this at
>> >> >>>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
>> >> >>>> with his concern about wasting memory resources. What is the impact of
>> >> >>>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
>> >> >>>> pressure? Does the kernel simply start ignoring these hints?
>> >> >>>>
>> >> >>>
>> >> >>> It will throttle based on memory pressure.  In idle situations it will
>> >> >>> eat your file cache, however, to satisfy the request.
>> >> >>>
>> >> >>> Now, the file cache should be much bigger than the amount of unneeded
>> >> >>> pages you prefault with the hint over the whole library, so I guess the
>> >> >>> benefit of prefaulting the right pages outweighs the downside of evicting
>> >> >>> some cache for unused library pages.
>> >> >>>
>> >> >>> Still, it's a workaround for deficits in the demand-paging/readahead
>> >> >>> heuristics and thus a bit ugly, I feel.  Maybe Wu can help.
>> >> >>>
>> >> >>
>> >> >> Program page faults are inherently random, so the straightforward
>> >> >> solution would be to increase the mmap read-around size (for desktops
>> >> >> with reasonable large memory), rather than to improve program layout
>> >> >> or readahead heuristics :)
>> >> >>
>> >> >
>> >> > Program page faults may exhibit random behavior once they've started.
>> >> >
>> >> > During startup page-in pattern of over-engineered OO applications is very
>> >> > predictable. Programs are laid out based on compilation units, which have no
>> >> > relation to how they are executed. Another problem is that any large old
>> >> > application will have lots of code that is either rarely executed or
>> >> > completely dead. Random sprinkling of live code among mostly unneeded code
>> >> > is a problem.
>> >> > I'm able to reduce startup pagefaults by 2.5x and mem usage by a few MB with
>> >> > proper binary layout. Even if one lays out a program wrongly, the worst-case
>> >> > pagein pattern will be pretty similar to what it is by default.
>> >> >
>> >> > But yes, I completely agree that it would be awesome to increase the
>> >> > readahead size proportionally to available memory. It's a little silly to be
>> >> > reading tens of megabytes in 128kb increments :)  You rock for trying to
>> >> > modernize this.
>> >>
>> >> Hi, Wu and Taras.
>> >>
>> >> I have been watched at this thread.
>> >> That's because I had a experience on reducing startup latency of application
>> >> in embedded system.
>> >>
>> >> I think sometime increasing of readahead size wouldn't good in embedded.
>> >> Many of embedded system has nand as storage and compression file system.
>> >> About nand, as you know, random read effect isn't rather big than hdd.
>> >> About compression file system, as one has a big compression,
>> >> it would make startup late(big block read and decompression).
>> >> We had to disable readahead of code page with kernel hacking.
>> >> And it would make application slow as time goes by.
>> >> But at that time we thought latency is more important than performance
>> >> on our application.
>> >>
>> >> Of course, it is different whenever what is file system and
>> >> compression ratio we use .
>> >> So I think increasing of readahead size might always be not good.
>> >>
>> >> Please, consider embedded system when you have a plan to tweak
>> >> readahead, too. :)
>> >
>> > Minchan, glad to know that you have experiences on embedded Linux.
>> >
>> > While increasing the general readahead size from 128kb to 512kb, I
>> > also added a limit for mmap read-around: if system memory size is less
>> > than X MB, then limit read-around size to X KB. For example, do only
>> > 128KB read-around for a 128MB embedded box, and 32KB ra for 32MB box.
>> >
>> > Do you think it a reasonable safety guard? Patch attached.
>>
>> Thanks for reply, Wu.
>>
>> I didn't have looked at the your attachment.
>> That's because it's not matter of memory size in my case.
>
> In general, the more memory size, the less we care about the possible
> readahead misses :)
>
>> It was alone application on system and it was first main application of system.
>> It means we had a enough memory.
>>
>> I guess there are such many of embedded system.
>> At that time, although I could disable readahead totally with read_ahead_kb,
>> I didn't want it. That's because I don't want to disable readahead on
>> the file I/O
>> and data section of program. So at a loss, I hacked kernel to disable
>> readahead of
>> only code section.
>
> I would like to auto tune readahead size based on the device's
> IO throughput and latency estimation, however that's not easy..

Indeed.

> Other than that, if we can assert "this class of devices won't benefit
> from large readahead", then we can do some static assignment.

A few month ago, I saw your patch about enhancing readahead.
At that time, many guys tested several size of USB and SSD which are
consist of nand device.
The result is good if we does readahead untile some crossover point.
So I think we need readahead about file I/O in non-rotation device, too.

But startup latency is important than file I/O performance in some machine.
With analysis at that time, code readahead of application affected slow startup.
In addition, during bootup, cache hit ratio was very small.

So I hoped we can disable readahead just only code section(ie, roughly
exec vma's filemap fault). :)

I don't want you to solve this problem right now.
Just let you understand embedded system's some problem
for enhancing readahead in future.  :)

> Thanks,
> Fengguang
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-07  8:06               ` Minchan Kim
@ 2010-04-07  8:13                 ` Wu Fengguang
  0 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2010-04-07  8:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Taras Glek, Johannes Weiner, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

Minchan,

> A few month ago, I saw your patch about enhancing readahead.
> At that time, many guys tested several size of USB and SSD which are
> consist of nand device.
> The result is good if we does readahead untile some crossover point.
> So I think we need readahead about file I/O in non-rotation device, too.
> 
> But startup latency is important than file I/O performance in some machine.
> With analysis at that time, code readahead of application affected slow startup.
> In addition, during bootup, cache hit ratio was very small.
> 
> So I hoped we can disable readahead just only code section(ie, roughly
> exec vma's filemap fault). :)
> 
> I don't want you to solve this problem right now.
> Just let you understand embedded system's some problem
> for enhancing readahead in future.  :)

Yeah, I've never heard of such a demand, definitely good to know it!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-07  2:54     ` Taras Glek
  2010-04-07  4:06       ` Minchan Kim
@ 2010-04-07  7:38       ` Wu Fengguang
  2010-04-08 17:44         ` Taras Glek
  1 sibling, 1 reply; 35+ messages in thread
From: Wu Fengguang @ 2010-04-07  7:38 UTC (permalink / raw)
  To: Taras Glek
  Cc: Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org

On Wed, Apr 07, 2010 at 10:54:58AM +0800, Taras Glek wrote:
> On 04/06/2010 07:24 PM, Wu Fengguang wrote:
> > Hi Taras,
> >
> > On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
> >    
> >> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
> >>      
> >>> Hello,
> >>> I am working on improving Mozilla startup times. It turns out that page
> >>> faults(caused by lack of cooperation between user/kernelspace) are the
> >>> main cause of slow startup. I need some insights from someone who
> >>> understands linux vm behavior.
> >>>        
> > How about improve Fedora (and other distros) to preload Mozilla (and
> > other apps the user run at the previous boot) with fadvise() at boot
> > time? This sounds like the most reasonable option.
> >    
> That's a slightly different usecase. I'd rather have all large apps 
> startup as efficiently as possible without any hacks. Though until we 
> get there, we'll be using all of the hacks we can.

Boot time user space readahead can do better than kernel heuristic
readahead in several ways:

- it can collect better knowledge on which files/pages will be used
  which lead to high readahead hit ratio and less cache consumption

- it can submit readahead requests for many files in parallel,
  which enables queuing (elevator, NCQ etc.) optimizations

So I won't call it dirty hack :)

> > As for the kernel readahead, I have a patchset to increase default
> > mmap read-around size from 128kb to 512kb (except for small memory
> > systems).  This should help your case as well.
> >    
> Yes. Is the current readahead really doing read-around(ie does it read 
> pages before the one being faulted)? From what I've seen, having the 

Sure. It will do read-around from current fault offset - 64kb to +64kb.

> dynamic linker read binary sections backwards causes faults.
> http://sourceware.org/bugzilla/show_bug.cgi?id=11447

There are too many data in
http://people.mozilla.com/~tglek/startup/systemtap_graphs/ld_bug/report.txt
Can you show me the relevant lines? (wondering if I can ever find such lines..)

> >    
> >>> Current Situation:
> >>> The dynamic linker mmap()s  executable and data sections of our
> >>> executable but it doesn't call madvise().
> >>> By default page faults trigger 131072byte reads. To make matters worse,
> >>> the compile-time linker + gcc lay out code in a manner that does not
> >>> correspond to how the resulting executable will be executed(ie the
> >>> layout is basically random). This means that during startup 15-40mb
> >>> binaries are read in basically random fashion. Even if one orders the
> >>> binary optimally, throughput is still suboptimal due to the puny readahead.
> >>>
> >>> IO Hints:
> >>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
> >>> reads and a binary that tends to take 110 page faults(ie program stops
> >>> execution and waits for disk) can be reduced down to 6. This has the
> >>> potential to double application startup of large apps without any clear
> >>> downsides.
> >>>
> >>> Suse ships their glibc with a dynamic linker patch to fadvise()
> >>> dynamic libraries(not sure why they switched from doing madvise
> >>> before).
> >>>        
> > This is interesting. I wonder how SuSE implements the policy.
> > Do you have the patch or some strace output that demonstrates the
> > fadvise() call?
> >    
> glibc-2.3.90-ld.so-madvise.diff in 
> http://www.rpmseek.com/rpm/glibc-2.4-31.12.3.src.html?hl=com&cba=0:G:0:3732595:0:15:0: 

550 Can't open
/pub/linux/distributions/suse/pub/suse/update/10.1/rpm/src/glibc-2.4-31.12.3.src.rpm:
No such file or directory

OK I give up.

> As I recall they just fadvise the filedescriptor before accessing it.

Obviously this is a bit risky for small memory systems..

> >>> I filed a glibc bug about this at
> >>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
> >>> with his concern about wasting memory resources. What is the impact of
> >>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
> >>> pressure? Does the kernel simply start ignoring these hints?
> >>>        
> >> It will throttle based on memory pressure.  In idle situations it will
> >> eat your file cache, however, to satisfy the request.
> >>
> >> Now, the file cache should be much bigger than the amount of unneeded
> >> pages you prefault with the hint over the whole library, so I guess the
> >> benefit of prefaulting the right pages outweighs the downside of evicting
> >> some cache for unused library pages.
> >>
> >> Still, it's a workaround for deficits in the demand-paging/readahead
> >> heuristics and thus a bit ugly, I feel.  Maybe Wu can help.
> >>      
> > Program page faults are inherently random, so the straightforward
> > solution would be to increase the mmap read-around size (for desktops
> > with reasonable large memory), rather than to improve program layout
> > or readahead heuristics :)
> >    
> Program page faults may exhibit random behavior once they've started.

Right.

> During startup page-in pattern of over-engineered OO applications is 
> very predictable. Programs are laid out based on compilation units, 
> which have no relation to how they are executed. Another problem is that 
> any large old application will have lots of code that is either rarely 
> executed or completely dead. Random sprinkling of live code among mostly 
> unneeded code is a problem.

Agreed.

> I'm able to reduce startup pagefaults by 2.5x and mem usage by a few MB 
> with proper binary layout. Even if one lays out a program wrongly, the 
> worst-case pagein pattern will be pretty similar to what it is by default.

That's great. When will we enjoy your research fruits? :)

> But yes, I completely agree that it would be awesome to increase the 
> readahead size proportionally to available memory. It's a little silly 
> to be reading tens of megabytes in 128kb increments :)  You rock for 
> trying to modernize this.

Thank you. I guess the 128kb is more than ten years old..

Cheers,
Fengguang

> >    
> >>> Also, once an application is started is it reasonable to keep it
> >>> madvise(WILLNEED)ed or should the madvise flags be reset?
> >>>        
> >> It's a one-time operation that starts immediate readahead, no permanent
> >> changes are done.
> >>      
> > Right. The kernel regard WILLNEED as a readahead request from userspace.
> >
> >    
> >>> Perhaps the kernel could monitor the page-in patterns to increase the
> >>> readahead sizes? This may already happen, I've noticed that a handful of
> >>> pagefaults trigger>  131072bytes of IO, perhaps this just needs tweaking.
> >>>        
> >> CCd the man :-)
> >>      
> > Thank you :)
> >
> > Cheers,
> > Fengguang
> >    
> 
> Cheers,
> Taras

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-07  7:38       ` Wu Fengguang
@ 2010-04-08 17:44         ` Taras Glek
  2010-04-12  2:27           ` Wu Fengguang
  0 siblings, 1 reply; 35+ messages in thread
From: Taras Glek @ 2010-04-08 17:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org

On 04/07/2010 12:38 AM, Wu Fengguang wrote:
> On Wed, Apr 07, 2010 at 10:54:58AM +0800, Taras Glek wrote:
>    
>> On 04/06/2010 07:24 PM, Wu Fengguang wrote:
>>      
>>> Hi Taras,
>>>
>>> On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
>>>
>>>        
>>>> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
>>>>
>>>>          
>>>>> Hello,
>>>>> I am working on improving Mozilla startup times. It turns out that page
>>>>> faults(caused by lack of cooperation between user/kernelspace) are the
>>>>> main cause of slow startup. I need some insights from someone who
>>>>> understands linux vm behavior.
>>>>>
>>>>>            
>>> How about improve Fedora (and other distros) to preload Mozilla (and
>>> other apps the user run at the previous boot) with fadvise() at boot
>>> time? This sounds like the most reasonable option.
>>>
>>>        
>> That's a slightly different usecase. I'd rather have all large apps
>> startup as efficiently as possible without any hacks. Though until we
>> get there, we'll be using all of the hacks we can.
>>      
> Boot time user space readahead can do better than kernel heuristic
> readahead in several ways:
>
> - it can collect better knowledge on which files/pages will be used
>    which lead to high readahead hit ratio and less cache consumption
>
> - it can submit readahead requests for many files in parallel,
>    which enables queuing (elevator, NCQ etc.) optimizations
>
> So I won't call it dirty hack :)
>
>    
Fair enough.
>>> As for the kernel readahead, I have a patchset to increase default
>>> mmap read-around size from 128kb to 512kb (except for small memory
>>> systems).  This should help your case as well.
>>>
>>>        
>> Yes. Is the current readahead really doing read-around(ie does it read
>> pages before the one being faulted)? From what I've seen, having the
>>      
> Sure. It will do read-around from current fault offset - 64kb to +64kb.
>    
That's excellent.
>    
>> dynamic linker read binary sections backwards causes faults.
>> http://sourceware.org/bugzilla/show_bug.cgi?id=11447
>>      
> There are too many data in
> http://people.mozilla.com/~tglek/startup/systemtap_graphs/ld_bug/report.txt
> Can you show me the relevant lines? (wondering if I can ever find such lines..)
>    
The first part of the file lists sections in a file and their hex 
offset+size.

lines like 0 512 offset(#1) mean a read at position 0 of 512 bytes. 
Incidentally this first read is coming from vfs_read, so the log doesn't 
take account readahead (unlike the other reads caused by mmap page faults).

So
15310848 131072 offset(#2)=====================
eaa73c 1523c .bss
eaa73c 19d1e .comment

15142912 131072 offset(#3)=====================
e810d4 200 .dynamic
e812d4 470 .got
e81744 3b50 .got.plt
e852a0 2549c .data

Shows 2 reads where the dynamic linker first seeks to the end of the 
file(to zero out .bss, causing IO via COW) and the backtracks to
read in .dynamic. However you are right, all of the backtracking reads 
are over 64K.
Thanks for explaining that. I am guessing your change to boost 
readaround will fix this issue nicely for firefox.

>>>
>>>        
>>>>> Current Situation:
>>>>> The dynamic linker mmap()s  executable and data sections of our
>>>>> executable but it doesn't call madvise().
>>>>> By default page faults trigger 131072byte reads. To make matters worse,
>>>>> the compile-time linker + gcc lay out code in a manner that does not
>>>>> correspond to how the resulting executable will be executed(ie the
>>>>> layout is basically random). This means that during startup 15-40mb
>>>>> binaries are read in basically random fashion. Even if one orders the
>>>>> binary optimally, throughput is still suboptimal due to the puny readahead.
>>>>>
>>>>> IO Hints:
>>>>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
>>>>> reads and a binary that tends to take 110 page faults(ie program stops
>>>>> execution and waits for disk) can be reduced down to 6. This has the
>>>>> potential to double application startup of large apps without any clear
>>>>> downsides.
>>>>>
>>>>> Suse ships their glibc with a dynamic linker patch to fadvise()
>>>>> dynamic libraries(not sure why they switched from doing madvise
>>>>> before).
>>>>>
>>>>>            
>>> This is interesting. I wonder how SuSE implements the policy.
>>> Do you have the patch or some strace output that demonstrates the
>>> fadvise() call?
>>>
>>>        
>> glibc-2.3.90-ld.so-madvise.diff in
>> http://www.rpmseek.com/rpm/glibc-2.4-31.12.3.src.html?hl=com&cba=0:G:0:3732595:0:15:0:
>>      
> 550 Can't open
> /pub/linux/distributions/suse/pub/suse/update/10.1/rpm/src/glibc-2.4-31.12.3.src.rpm:
> No such file or directory
>
> OK I give up.
>
>    
>> As I recall they just fadvise the filedescriptor before accessing it.
>>      
> Obviously this is a bit risky for small memory systems..
>
>    
>>>>> I filed a glibc bug about this at
>>>>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
>>>>> with his concern about wasting memory resources. What is the impact of
>>>>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
>>>>> pressure? Does the kernel simply start ignoring these hints?
>>>>>
>>>>>            
>>>> It will throttle based on memory pressure.  In idle situations it will
>>>> eat your file cache, however, to satisfy the request.
>>>>
>>>> Now, the file cache should be much bigger than the amount of unneeded
>>>> pages you prefault with the hint over the whole library, so I guess the
>>>> benefit of prefaulting the right pages outweighs the downside of evicting
>>>> some cache for unused library pages.
>>>>
>>>> Still, it's a workaround for deficits in the demand-paging/readahead
>>>> heuristics and thus a bit ugly, I feel.  Maybe Wu can help.
>>>>
>>>>          
>>> Program page faults are inherently random, so the straightforward
>>> solution would be to increase the mmap read-around size (for desktops
>>> with reasonable large memory), rather than to improve program layout
>>> or readahead heuristics :)
>>>
>>>        
>> Program page faults may exhibit random behavior once they've started.
>>      
> Right.
>
>    
>> During startup page-in pattern of over-engineered OO applications is
>> very predictable. Programs are laid out based on compilation units,
>> which have no relation to how they are executed. Another problem is that
>> any large old application will have lots of code that is either rarely
>> executed or completely dead. Random sprinkling of live code among mostly
>> unneeded code is a problem.
>>      
> Agreed.
>
>    
>> I'm able to reduce startup pagefaults by 2.5x and mem usage by a few MB
>> with proper binary layout. Even if one lays out a program wrongly, the
>> worst-case pagein pattern will be pretty similar to what it is by default.
>>      
> That's great. When will we enjoy your research fruits? :)
>    
Released it yesterday. Hopefully other bloated binaries will benefit 
from this too.

http://blog.mozilla.com/tglek/2010/04/07/icegrind-valgrind-plugin-for-optimizing-cold-startup/

Thanks a lot Wu, I feel I understand the kernel side of what's happening 
now.

Taras

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-08 17:44         ` Taras Glek
@ 2010-04-12  2:27           ` Wu Fengguang
  2010-04-12  3:25             ` Minchan Kim
  2010-04-12  4:43             ` drepper
  0 siblings, 2 replies; 35+ messages in thread
From: Wu Fengguang @ 2010-04-12  2:27 UTC (permalink / raw)
  To: Taras Glek
  Cc: Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org

On Fri, Apr 09, 2010 at 01:44:41AM +0800, Taras Glek wrote:
> On 04/07/2010 12:38 AM, Wu Fengguang wrote:
> > On Wed, Apr 07, 2010 at 10:54:58AM +0800, Taras Glek wrote:
> >    
> >> On 04/06/2010 07:24 PM, Wu Fengguang wrote:
> >>      
> >>> Hi Taras,
> >>>
> >>> On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
> >>>
> >>>        
> >>>> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
> >>>>
> >>>>          
> >>>>> Hello,
> >>>>> I am working on improving Mozilla startup times. It turns out that page
> >>>>> faults(caused by lack of cooperation between user/kernelspace) are the
> >>>>> main cause of slow startup. I need some insights from someone who
> >>>>> understands linux vm behavior.
> >>>>>
> >>>>>            
> >>> How about improve Fedora (and other distros) to preload Mozilla (and
> >>> other apps the user run at the previous boot) with fadvise() at boot
> >>> time? This sounds like the most reasonable option.
> >>>
> >>>        
> >> That's a slightly different usecase. I'd rather have all large apps
> >> startup as efficiently as possible without any hacks. Though until we
> >> get there, we'll be using all of the hacks we can.
> >>      
> > Boot time user space readahead can do better than kernel heuristic
> > readahead in several ways:
> >
> > - it can collect better knowledge on which files/pages will be used
> >    which lead to high readahead hit ratio and less cache consumption
> >
> > - it can submit readahead requests for many files in parallel,
> >    which enables queuing (elevator, NCQ etc.) optimizations
> >
> > So I won't call it dirty hack :)
> >
> >    
> Fair enough.
> >>> As for the kernel readahead, I have a patchset to increase default
> >>> mmap read-around size from 128kb to 512kb (except for small memory
> >>> systems).  This should help your case as well.
> >>>
> >>>        
> >> Yes. Is the current readahead really doing read-around(ie does it read
> >> pages before the one being faulted)? From what I've seen, having the
> >>      
> > Sure. It will do read-around from current fault offset - 64kb to +64kb.
> >    
> That's excellent.
> >    
> >> dynamic linker read binary sections backwards causes faults.
> >> http://sourceware.org/bugzilla/show_bug.cgi?id=11447
> >>      
> > There are too many data in
> > http://people.mozilla.com/~tglek/startup/systemtap_graphs/ld_bug/report.txt
> > Can you show me the relevant lines? (wondering if I can ever find such lines..)
> >    
> The first part of the file lists sections in a file and their hex 
> offset+size.
 
> lines like 0 512 offset(#1) mean a read at position 0 of 512 bytes. 
> Incidentally this first read is coming from vfs_read, so the log doesn't 
> take account readahead (unlike the other reads caused by mmap page faults).

Yes, every binary/library starts with this 512b read.  It is requested
by ld.so/ld-linux.so, and will trigger a 4-page readahead. This is not
good readahead. I wonder if ld.so can switch to mmap read for the
first read, in order to trigger a larger 128kb readahead. However this
will introduce a little overhead on VMA operations.

> So
> 15310848 131072 offset(#2)=====================
> eaa73c 1523c .bss
> eaa73c 19d1e .comment
> 
> 15142912 131072 offset(#3)=====================
> e810d4 200 .dynamic
> e812d4 470 .got
> e81744 3b50 .got.plt
> e852a0 2549c .data
> 
> Shows 2 reads where the dynamic linker first seeks to the end of the 
> file(to zero out .bss, causing IO via COW) and the backtracks to
> read in .dynamic. However you are right, all of the backtracking reads 
> are over 64K.

This is interesting finding to me, Thanks for the explanation :)

> Thanks for explaining that. I am guessing your change to boost 
> readaround will fix this issue nicely for firefox.

You are welcome.

> >>>
> >>>        
> >>>>> Current Situation:
> >>>>> The dynamic linker mmap()s  executable and data sections of our
> >>>>> executable but it doesn't call madvise().
> >>>>> By default page faults trigger 131072byte reads. To make matters worse,
> >>>>> the compile-time linker + gcc lay out code in a manner that does not
> >>>>> correspond to how the resulting executable will be executed(ie the
> >>>>> layout is basically random). This means that during startup 15-40mb
> >>>>> binaries are read in basically random fashion. Even if one orders the
> >>>>> binary optimally, throughput is still suboptimal due to the puny readahead.
> >>>>>
> >>>>> IO Hints:
> >>>>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
> >>>>> reads and a binary that tends to take 110 page faults(ie program stops
> >>>>> execution and waits for disk) can be reduced down to 6. This has the
> >>>>> potential to double application startup of large apps without any clear
> >>>>> downsides.
> >>>>>
> >>>>> Suse ships their glibc with a dynamic linker patch to fadvise()
> >>>>> dynamic libraries(not sure why they switched from doing madvise
> >>>>> before).
> >>>>>
> >>>>>            
> >>> This is interesting. I wonder how SuSE implements the policy.
> >>> Do you have the patch or some strace output that demonstrates the
> >>> fadvise() call?
> >>>
> >>>        
> >> glibc-2.3.90-ld.so-madvise.diff in
> >> http://www.rpmseek.com/rpm/glibc-2.4-31.12.3.src.html?hl=com&cba=0:G:0:3732595:0:15:0:
> >>      
> > 550 Can't open
> > /pub/linux/distributions/suse/pub/suse/update/10.1/rpm/src/glibc-2.4-31.12.3.src.rpm:
> > No such file or directory
> >
> > OK I give up.
> >
> >    
> >> As I recall they just fadvise the filedescriptor before accessing it.
> >>      
> > Obviously this is a bit risky for small memory systems..
> >
> >    
> >>>>> I filed a glibc bug about this at
> >>>>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
> >>>>> with his concern about wasting memory resources. What is the impact of
> >>>>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
> >>>>> pressure? Does the kernel simply start ignoring these hints?
> >>>>>
> >>>>>            
> >>>> It will throttle based on memory pressure.  In idle situations it will
> >>>> eat your file cache, however, to satisfy the request.
> >>>>
> >>>> Now, the file cache should be much bigger than the amount of unneeded
> >>>> pages you prefault with the hint over the whole library, so I guess the
> >>>> benefit of prefaulting the right pages outweighs the downside of evicting
> >>>> some cache for unused library pages.
> >>>>
> >>>> Still, it's a workaround for deficits in the demand-paging/readahead
> >>>> heuristics and thus a bit ugly, I feel.  Maybe Wu can help.
> >>>>
> >>>>          
> >>> Program page faults are inherently random, so the straightforward
> >>> solution would be to increase the mmap read-around size (for desktops
> >>> with reasonable large memory), rather than to improve program layout
> >>> or readahead heuristics :)
> >>>
> >>>        
> >> Program page faults may exhibit random behavior once they've started.
> >>      
> > Right.
> >
> >    
> >> During startup page-in pattern of over-engineered OO applications is
> >> very predictable. Programs are laid out based on compilation units,
> >> which have no relation to how they are executed. Another problem is that
> >> any large old application will have lots of code that is either rarely
> >> executed or completely dead. Random sprinkling of live code among mostly
> >> unneeded code is a problem.
> >>      
> > Agreed.
> >
> >    
> >> I'm able to reduce startup pagefaults by 2.5x and mem usage by a few MB
> >> with proper binary layout. Even if one lays out a program wrongly, the
> >> worst-case pagein pattern will be pretty similar to what it is by default.
> >>      
> > That's great. When will we enjoy your research fruits? :)
> >    
> Released it yesterday. Hopefully other bloated binaries will benefit 
> from this too.
> 
> http://blog.mozilla.com/tglek/2010/04/07/icegrind-valgrind-plugin-for-optimizing-cold-startup/

It sounds painful to produce the valgrind log, fortunately the end
user won't suffer.

Is it viable to turn on the "-ffunction-sections -fdata-sections"
options distribution wide? If so, you may sell it to Fedora :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-12  2:27           ` Wu Fengguang
@ 2010-04-12  3:25             ` Minchan Kim
  2010-04-12  4:58               ` Wu Fengguang
  2010-04-12  4:43             ` drepper
  1 sibling, 1 reply; 35+ messages in thread
From: Minchan Kim @ 2010-04-12  3:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Taras Glek, Johannes Weiner, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

Hi, Wu.

On Mon, Apr 12, 2010 at 11:27 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Fri, Apr 09, 2010 at 01:44:41AM +0800, Taras Glek wrote:
>> On 04/07/2010 12:38 AM, Wu Fengguang wrote:
>> > On Wed, Apr 07, 2010 at 10:54:58AM +0800, Taras Glek wrote:
>> >
>> >> On 04/06/2010 07:24 PM, Wu Fengguang wrote:
>> >>
>> >>> Hi Taras,
>> >>>
>> >>> On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
>> >>>
>> >>>
>> >>>> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
>> >>>>
>> >>>>
>> >>>>> Hello,
>> >>>>> I am working on improving Mozilla startup times. It turns out that page
>> >>>>> faults(caused by lack of cooperation between user/kernelspace) are the
>> >>>>> main cause of slow startup. I need some insights from someone who
>> >>>>> understands linux vm behavior.
>> >>>>>
>> >>>>>
>> >>> How about improve Fedora (and other distros) to preload Mozilla (and
>> >>> other apps the user run at the previous boot) with fadvise() at boot
>> >>> time? This sounds like the most reasonable option.
>> >>>
>> >>>
>> >> That's a slightly different usecase. I'd rather have all large apps
>> >> startup as efficiently as possible without any hacks. Though until we
>> >> get there, we'll be using all of the hacks we can.
>> >>
>> > Boot time user space readahead can do better than kernel heuristic
>> > readahead in several ways:
>> >
>> > - it can collect better knowledge on which files/pages will be used
>> >    which lead to high readahead hit ratio and less cache consumption
>> >
>> > - it can submit readahead requests for many files in parallel,
>> >    which enables queuing (elevator, NCQ etc.) optimizations
>> >
>> > So I won't call it dirty hack :)
>> >
>> >
>> Fair enough.
>> >>> As for the kernel readahead, I have a patchset to increase default
>> >>> mmap read-around size from 128kb to 512kb (except for small memory
>> >>> systems).  This should help your case as well.
>> >>>
>> >>>
>> >> Yes. Is the current readahead really doing read-around(ie does it read
>> >> pages before the one being faulted)? From what I've seen, having the
>> >>
>> > Sure. It will do read-around from current fault offset - 64kb to +64kb.
>> >
>> That's excellent.
>> >
>> >> dynamic linker read binary sections backwards causes faults.
>> >> http://sourceware.org/bugzilla/show_bug.cgi?id=11447
>> >>
>> > There are too many data in
>> > http://people.mozilla.com/~tglek/startup/systemtap_graphs/ld_bug/report.txt
>> > Can you show me the relevant lines? (wondering if I can ever find such lines..)
>> >
>> The first part of the file lists sections in a file and their hex
>> offset+size.
>
>> lines like 0 512 offset(#1) mean a read at position 0 of 512 bytes.
>> Incidentally this first read is coming from vfs_read, so the log doesn't
>> take account readahead (unlike the other reads caused by mmap page faults).
>
> Yes, every binary/library starts with this 512b read.  It is requested
> by ld.so/ld-linux.so, and will trigger a 4-page readahead. This is not
> good readahead. I wonder if ld.so can switch to mmap read for the
> first read, in order to trigger a larger 128kb readahead. However this
> will introduce a little overhead on VMA operations.

AFAIK, kernel reads first sector(ELF header and so one)  of binary in
case of binary.
in fs/exec.c,
prepare_binprm()
{
...
return kernel_read(bprm->file, 0, bprm->buf, BINPRM_BUF_SIZE);
}

But dynamic loader uses libc_read for reading of shared library's one.

So you may have a chance to increase readahead size on binary but hard on shared
library. Many of app have lots of shared library so the solution of
only binary isn't big about
performance. :(

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-12  3:25             ` Minchan Kim
@ 2010-04-12  4:58               ` Wu Fengguang
  0 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2010-04-12  4:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Taras Glek, Johannes Weiner, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

Hi Minchan,

> > Yes, every binary/library starts with this 512b read.  It is requested
> > by ld.so/ld-linux.so, and will trigger a 4-page readahead. This is not
> > good readahead. I wonder if ld.so can switch to mmap read for the
> > first read, in order to trigger a larger 128kb readahead. However this
> > will introduce a little overhead on VMA operations.

Correction with data: in my system, ld is doing one 832b initial read for every library:

        $ strace true
        execve("/bin/true", ["true"], [/* 44 vars */]) = 0
        brk(0)                                  = 0x608000
        mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb3b3ea0000
        access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
        mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb3b3e9e000
        access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
        open("/etc/ld.so.cache", O_RDONLY)      = 3
        fstat(3, {st_mode=S_IFREG|0644, st_size=140899, ...}) = 0
        mmap(NULL, 140899, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fb3b3e7b000
        close(3)                                = 0
        access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
        open("/lib/libc.so.6", O_RDONLY)        = 3
==>     read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\353\1\0\0\0\0\0@"..., 832) = 832
        fstat(3, {st_mode=S_IFREG|0755, st_size=1379752, ...}) = 0
        mmap(NULL, 3487784, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fb3b3931000
        mprotect(0x7fb3b3a7b000, 2097152, PROT_NONE) = 0
        mmap(0x7fb3b3c7b000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14a000) = 0x7fb3b3c7b000
        mmap(0x7fb3b3c80000, 18472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fb3b3c80000
        close(3)                                = 0
        mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb3b3e7a000
        mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb3b3e79000
        arch_prctl(ARCH_SET_FS, 0x7fb3b3e796f0) = 0
        mprotect(0x7fb3b3c7b000, 16384, PROT_READ) = 0
        mprotect(0x7fb3b3ea1000, 4096, PROT_READ) = 0
        munmap(0x7fb3b3e7b000, 140899)          = 0
        brk(0)                                  = 0x608000
        brk(0x629000)                           = 0x629000
        open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
        fstat(3, {st_mode=S_IFREG|0644, st_size=4332320, ...}) = 0
        mmap(NULL, 4332320, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fb3b350f000
        close(3)                                = 0
        close(1)                                = 0
        close(2)                                = 0
        exit_group(0)                           = ?

> AFAIK, kernel reads first sector(ELF header and so one)  of binary in
> case of binary.
> in fs/exec.c,
> prepare_binprm()
> {
> ...
> return kernel_read(bprm->file, 0, bprm->buf, BINPRM_BUF_SIZE);
> }

Thanks for pointing this out. Yes we may optimize the binary part by
adding a readahead call before the kernel_read().
 
> But dynamic loader uses libc_read for reading of shared library's one.
> 
> So you may have a chance to increase readahead size on binary but hard on shared
> library. Many of app have lots of shared library so the solution of
> only binary isn't big about
> performance. :(

Yeah, it won't be a big optimization..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-12  2:27           ` Wu Fengguang
  2010-04-12  3:25             ` Minchan Kim
@ 2010-04-12  4:43             ` drepper
  2010-04-12  4:46               ` Taras Glek
  2010-04-12  4:50               ` Wu Fengguang
  1 sibling, 2 replies; 35+ messages in thread
From: drepper @ 2010-04-12  4:43 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Taras Glek, Johannes Weiner, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 936 bytes --]

On Sun, Apr 11, 2010 at 19:27, Wu Fengguang <fengguang.wu@intel.com> wrote:
> Yes, every binary/library starts with this 512b read.  It is requested
> by ld.so/ld-linux.so, and will trigger a 4-page readahead. This is not
> good readahead. I wonder if ld.so can switch to mmap read for the
> first read, in order to trigger a larger 128kb readahead.

We first need to know the sizes of the segments and their location in the binary.  The binaries we use now are somewhat well laid out.  The read-only segment starts at offset 0 etc.  But this doesn't have to be the case.  The dynamic linker has to be generic.  Also, even if we start mapping at offset zero, now much to map?  The file might contain debug info which must not be mapped.  Therefore the first read loads enough of the headers to make all of the decisions.  Yes, we could do a mmap of one page instead of the read.  But that's more expansive in general, isn't it?

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 272 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-12  4:43             ` drepper
@ 2010-04-12  4:46               ` Taras Glek
  2010-04-12  4:50               ` Wu Fengguang
  1 sibling, 0 replies; 35+ messages in thread
From: Taras Glek @ 2010-04-12  4:46 UTC (permalink / raw)
  To: drepper
  Cc: Wu Fengguang, Johannes Weiner, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

On 04/11/2010 09:43 PM, drepper@gmail.com wrote:
> On Sun, Apr 11, 2010 at 19:27, Wu Fengguang <fengguang.wu@intel.com> 
> wrote:
>> Yes, every binary/library starts with this 512b read.  It is requested
>> by ld.so/ld-linux.so, and will trigger a 4-page readahead. This is not
>> good readahead. I wonder if ld.so can switch to mmap read for the
>> first read, in order to trigger a larger 128kb readahead.
>
> We first need to know the sizes of the segments and their location in 
> the binary.  The binaries we use now are somewhat well laid out.  The 
> read-only segment starts at offset 0 etc.  But this doesn't have to be 
> the case.  The dynamic linker has to be generic.  Also, even if we 
> start mapping at offset zero, now much to map?  The file might contain 
> debug info which must not be mapped.  Therefore the first read loads 
> enough of the headers to make all of the decisions.  Yes, we could do 
> a mmap of one page instead of the read.  But that's more expansive in 
> general, isn't it?
Can this not be cached for prelinked files? I think it is reasonable to 
optimize the gnu dynamic linker to optimize for an optimal layout 
produced by gnu tools of the same generation.

Taras

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-12  4:43             ` drepper
  2010-04-12  4:46               ` Taras Glek
@ 2010-04-12  4:50               ` Wu Fengguang
  1 sibling, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2010-04-12  4:50 UTC (permalink / raw)
  To: drepper@gmail.com
  Cc: Taras Glek, Johannes Weiner, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

On Mon, Apr 12, 2010 at 12:43:00PM +0800, drepper@gmail.com wrote:
> On Sun, Apr 11, 2010 at 19:27, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> Yes, every binary/library starts with this 512b read.  It is requested
>> by ld.so/ld-linux.so, and will trigger a 4-page readahead. This is not
>> good readahead. I wonder if ld.so can switch to mmap read for the
>> first read, in order to trigger a larger 128kb readahead.
>
> We first need to know the sizes of the segments and their location
> in the binary.  The binaries we use now are somewhat well laid out.
> The read-only segment starts at offset 0 etc.  But this doesn't have
> to be the case.  The dynamic linker has to be generic.  Also, even
> if we start mapping at offset zero, now much to map?  The file might
> contain debug info which must not be mapped.  Therefore the first
> read loads enough of the headers to make all of the decisions.  Yes,

I once read the ld code, it's more complex than I expected.

> we could do a mmap of one page instead of the read.  But that's more
> expansive in general, isn't it?

Right. Without considering IO, a simple read(512) is more efficient than
mmap()+read+munmap().

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-05 22:43 Downsides to madvise/fadvise(willneed) for application startup Taras Glek
                   ` (2 preceding siblings ...)
  2010-04-06  9:51 ` Johannes Weiner
@ 2010-04-12  8:50 ` Andi Kleen
  2010-04-15 22:53 ` Andrew Morton
  4 siblings, 0 replies; 35+ messages in thread
From: Andi Kleen @ 2010-04-12  8:50 UTC (permalink / raw)
  To: Taras Glek; +Cc: linux-kernel

Taras Glek <tglek@mozilla.com> writes:

> Hello,
> I am working on improving Mozilla startup times. It turns out that
> page faults(caused by lack of cooperation between user/kernelspace)
> are the main cause of slow startup. I need some insights from someone
> who understands linux vm behavior.

I have an older patch to create dynamic bitmaps based on the last 
run and only prefetch those pages. 

It wasn't entirely a win for everything and didn't work for shared
libraries, but with some additional tuning the approach still has
potential I think, by combining memory saving with prefetching.

ftp://firstfloor.org/pub/ak/pbitmap/INTRO
http://halobates.de/dp2.pdf

For your use case the algorithm would likely need some glibc support.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-05 22:43 Downsides to madvise/fadvise(willneed) for application startup Taras Glek
                   ` (3 preceding siblings ...)
  2010-04-12  8:50 ` Andi Kleen
@ 2010-04-15 22:53 ` Andrew Morton
  2010-04-15 23:21   ` Zan Lynx
                     ` (2 more replies)
  4 siblings, 3 replies; 35+ messages in thread
From: Andrew Morton @ 2010-04-15 22:53 UTC (permalink / raw)
  To: Taras Glek; +Cc: linux-kernel

On Mon, 05 Apr 2010 15:43:02 -0700
Taras Glek <tglek@mozilla.com> wrote:

> To make matters worse, 
> the compile-time linker + gcc lay out code in a manner that does not 
> correspond to how the resulting executable will be executed(ie the 
> layout is basically random).

Yes, the linker scrambles the executable's block ordering.

This just isn't an interesting case.  World-wide, the number of people
who compile their own web browser and execute it from the file which ld
produced is, umm, seven.

So I'd suggest that you always copy the executable to a temp file and
mv it back before running any timing tests.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-15 22:53 ` Andrew Morton
@ 2010-04-15 23:21   ` Zan Lynx
  2010-04-15 20:42     ` Andrew Morton
  2010-04-16 11:41     ` Andi Kleen
  2010-04-16  0:41   ` Taras Glek
  2010-04-16 11:40   ` Andi Kleen
  2 siblings, 2 replies; 35+ messages in thread
From: Zan Lynx @ 2010-04-15 23:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Taras Glek, linux-kernel

On 4/15/10 4:53 PM, Andrew Morton wrote:

> This just isn't an interesting case.  World-wide, the number of people
> who compile their own web browser and execute it from the file which ld
> produced is, umm, seven.

Gentoo users? Linux From Scratch?

There are many more than 7 of us. Unless you are talking about the build 
environments always running some tool after ld which I am not aware of.


-- 
Zan Lynx
zlynx@acm.org

"Knowledge is Power.  Power Corrupts.  Study Hard.  Be Evil."

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-15 23:21   ` Zan Lynx
@ 2010-04-15 20:42     ` Andrew Morton
  2010-04-16 11:41     ` Andi Kleen
  1 sibling, 0 replies; 35+ messages in thread
From: Andrew Morton @ 2010-04-15 20:42 UTC (permalink / raw)
  To: Zan Lynx; +Cc: Taras Glek, linux-kernel

On Thu, 15 Apr 2010 17:21:25 -0600 Zan Lynx <zlynx@acm.org> wrote:

> On 4/15/10 4:53 PM, Andrew Morton wrote:
> 
> > This just isn't an interesting case.  World-wide, the number of people
> > who compile their own web browser and execute it from the file which ld
> > produced is, umm, seven.
> 
> Gentoo users? Linux From Scratch?
> 
> There are many more than 7 of us. Unless you are talking about the build 
> environments always running some tool after ld which I am not aware of.
> 

OK, eight then.

But I still don't think it's the case we should optimise for.  Not if
it impacts the common case even the slightest.  It'd be far far better
to change those distros to perform the very cheap, once-off step of
straightening out their executables (including shared libraries).


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-15 23:21   ` Zan Lynx
  2010-04-15 20:42     ` Andrew Morton
@ 2010-04-16 11:41     ` Andi Kleen
  2010-04-16 12:23       ` Theodore Tso
  2010-04-16 12:23       ` Theodore Tso
  1 sibling, 2 replies; 35+ messages in thread
From: Andi Kleen @ 2010-04-16 11:41 UTC (permalink / raw)
  To: Zan Lynx; +Cc: Andrew Morton, Taras Glek, linux-kernel

Zan Lynx <zlynx@acm.org> writes:

> On 4/15/10 4:53 PM, Andrew Morton wrote:
>
>> This just isn't an interesting case.  World-wide, the number of people
>> who compile their own web browser and execute it from the file which ld
>> produced is, umm, seven.
>
> Gentoo users? Linux From Scratch?

"make install" tends to copy. I am not aware of any Makefiles
that link directly to /usr/bin, and usually that wouldn't work
anyways because of permissions. copy fixes the problem.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-16 11:41     ` Andi Kleen
@ 2010-04-16 12:23       ` Theodore Tso
  2010-04-16 12:23       ` Theodore Tso
  1 sibling, 0 replies; 35+ messages in thread
From: Theodore Tso @ 2010-04-16 12:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Zan Lynx, Andrew Morton, Taras Glek, linux-kernel

On Apr 16, 2010, at 7:41 AM, Andi Kleen wrote:

> Zan Lynx <zlynx@acm.org> writes:
> 
>> On 4/15/10 4:53 PM, Andrew Morton wrote:
>> 
>>> This just isn't an interesting case.  World-wide, the number of people
>>> who compile their own web browser and execute it from the file which ld
>>> produced is, umm, seven.
>> 
>> Gentoo users? Linux From Scratch?
> 
> "make install" tends to copy. I am not aware of any Makefiles
> that link directly to /usr/bin, and usually that wouldn't work
> anyways because of permissions. copy fixes the problem.

... and those people who are executing the binary out of the build directory are probably running the regression  test (i.e., "make; make check") and on most developer machines, if they're lucky they have enough memory that the executable will still be in their page cache.   :-)

This being said, on modern file systems (i.e., btrfs, ext4, xfs, et. al), delayed allocation should mostly hide this problem; and if not, and the linker can estimate in advance how big the resulting binary will be, it could be modified to use the fallocate(2) system call.

-- Ted

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-16 11:41     ` Andi Kleen
  2010-04-16 12:23       ` Theodore Tso
@ 2010-04-16 12:23       ` Theodore Tso
  1 sibling, 0 replies; 35+ messages in thread
From: Theodore Tso @ 2010-04-16 12:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Zan Lynx, Andrew Morton, Taras Glek, linux-kernel

On Apr 16, 2010, at 7:41 AM, Andi Kleen wrote:

> Zan Lynx <zlynx@acm.org> writes:
> 
>> On 4/15/10 4:53 PM, Andrew Morton wrote:
>> 
>>> This just isn't an interesting case.  World-wide, the number of people
>>> who compile their own web browser and execute it from the file which ld
>>> produced is, umm, seven.
>> 
>> Gentoo users? Linux From Scratch?
> 
> "make install" tends to copy. I am not aware of any Makefiles
> that link directly to /usr/bin, and usually that wouldn't work
> anyways because of permissions. copy fixes the problem.

... and those people who are executing the binary out of the build directory are probably running the regression  test (i.e., "make; make check") and on most developer machines, if they're lucky they have enough memory that the executable will still be in their page cache.   :-)

This being said, on modern file systems (i.e., btrfs, ext4, xfs, et. al), delayed allocation should mostly hide this problem; and if not, and the linker can estimate in advance how big the resulting binary will be, it could be modified to use the fallocate(2) system call.

-- Ted

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-15 22:53 ` Andrew Morton
  2010-04-15 23:21   ` Zan Lynx
@ 2010-04-16  0:41   ` Taras Glek
  2010-04-15 22:21     ` Andrew Morton
  2010-04-16 11:40   ` Andi Kleen
  2 siblings, 1 reply; 35+ messages in thread
From: Taras Glek @ 2010-04-16  0:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On 04/15/2010 03:53 PM, Andrew Morton wrote:
> On Mon, 05 Apr 2010 15:43:02 -0700
> Taras Glek<tglek@mozilla.com>  wrote:
>
>    
>> To make matters worse,
>> the compile-time linker + gcc lay out code in a manner that does not
>> correspond to how the resulting executable will be executed(ie the
>> layout is basically random).
>>      
> Yes, the linker scrambles the executable's block ordering.
>
> This just isn't an interesting case.  World-wide, the number of people
> who compile their own web browser and execute it from the file which ld
> produced is, umm, seven.
>    
I'm sorry that you don't find this interesting. I did not suggest that 
people compile their own browser to get a perfect layout. This is 
something that Mozilla can do when preparing builds and it's also 
something distributions can do. It just so happens that large parts of 
startup will be very similar for every single firefox install, might as 
well layout the binary accordingly.
> So I'd suggest that you always copy the executable to a temp file and
> mv it back before running any timing tests.
>    
You mean to get it into a cache or to hope to avoid fragmentation?  If 
you are suggesting this to avoid measuring the startup overhead of 
paging the binary in, I strongly disagee. It is the slowest part of 
firefox startup and needs to be addressed.

Taras



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-16  0:41   ` Taras Glek
@ 2010-04-15 22:21     ` Andrew Morton
  2010-04-16  2:37       ` Taras Glek
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2010-04-15 22:21 UTC (permalink / raw)
  To: Taras Glek; +Cc: linux-kernel

On Thu, 15 Apr 2010 17:41:48 -0700 Taras Glek <tglek@mozilla.com> wrote:

> On 04/15/2010 03:53 PM, Andrew Morton wrote:
> > On Mon, 05 Apr 2010 15:43:02 -0700
> > Taras Glek<tglek@mozilla.com>  wrote:
> >
> >    
> >> To make matters worse,
> >> the compile-time linker + gcc lay out code in a manner that does not
> >> correspond to how the resulting executable will be executed(ie the
> >> layout is basically random).
> >>      
> > Yes, the linker scrambles the executable's block ordering.
> >
> > This just isn't an interesting case.  World-wide, the number of people
> > who compile their own web browser and execute it from the file which ld
> > produced is, umm, seven.
> >    
>
> I'm sorry that you don't find this interesting.

It's not a case we should optimise for.  It's perfectly reasonable for
the kernel to assume that the executable is reasonably well-laid-out on
disk.  And if is _isn't_ well-laid-out than that should be fixed in
userspace, because for simple locality-of-reference reasons, that's
always going to produce the fastest result.

Plus it's the common case as well - the executable was copied from DVD
or over the network or whatever.

Plus it's so utterly trivial for people who compile-their-own to
straighten the file out - just run cp!  These people have gone and
screwed up their file layout - they should fix that, rather than trying
to get the kernel to perform the impossible for them.  See?

> I did not suggest that 
> people compile their own browser to get a perfect layout. This is 
> something that Mozilla can do when preparing builds and it's also 
> something distributions can do. It just so happens that large parts of 
> startup will be very similar for every single firefox install, might as 
> well layout the binary accordingly.
> > So I'd suggest that you always copy the executable to a temp file and
> > mv it back before running any timing tests.
> >    
> You mean to get it into a cache or to hope to avoid fragmentation?  If 
> you are suggesting this to avoid measuring the startup overhead of 
> paging the binary in, I strongly disagee. It is the slowest part of 
> firefox startup and needs to be addressed.

No, nothing like that at all.

What I'm saying is that you shouldn't be testing or attempting to
optimise for files which were laid out by ld.  Because those files are
an utter mess - the block ordering is simply all over the place.  And
the great majority of people aren't using executables which were laid out
on disk by ld!

Instead, straighten out the block layout with `cp', then go and do the
testing and the optimisation.  Because if you're not taking this first
step then you're just not serious about performance at all!

Here's a small executable, as laid out by ld:

File offset	disk blocks
0-0: 		18383385-18383385 (1)
1-1: 		18383389-18383389 (1)
2-3: 		18383392-18383393 (2)
4-4: 		18383400-18383400 (1)
5-7: 		18383430-18383432 (3)
8-11: 		18383450-18383453 (4)
12-12: 		18383423-18383423 (1)
13-14: 		18383447-18383448 (2)
15-16: 		18383474-18383475 (2)
17-17: 		18383390-18383390 (1)
18-18: 		18383398-18383398 (1)
19-20: 		18383418-18383419 (2)
21-21: 		18383421-18383421 (1)
22-22: 		18383397-18383397 (1)
23-23: 		18383399-18383399 (1)
24-24: 		18383407-18383407 (1)
25-25: 		18383391-18383391 (1)
26-26: 		18383396-18383396 (1)
27-28: 		18383394-18383395 (2)
29-34: 		18383401-18383406 (6)
35-38: 		18383425-18383428 (4)
39-39: 		18383433-18383433 (1)
40-40: 		18383463-18383463 (1)
41-44: 		18383490-18383493 (4)
45-45: 		18383409-18383409 (1)
46-46: 		18383422-18383422 (1)
47-47: 		18383442-18383442 (1)
48-48: 		18383410-18383410 (1)
49-49: 		18383420-18383420 (1)
50-50: 		18383424-18383424 (1)
51-51: 		18383429-18383429 (1)
52-54: 		18383411-18383413 (3)
55-56: 		18383416-18383417 (2)
57-64: 		18383434-18383441 (8)
65-66: 		18383458-18383459 (2)
67-68: 		18383414-18383415 (2)
69-70: 		18383387-18383388 (2)
71-71: 		18383408-18383408 (1)
72-74: 		18383443-18383445 (3)

Not only is it fragmented, it's also in jumbled-up order.

And here it is after I did `cp':

0-11:		18391043-18391054 (12)
12-15:		18391056-18391059 (4)
16-74:		18391064-18391122 (59)

Trying to get the kernel to fix up the first case is daft, when it is
so easy to fix and so obviously _needs_ fixing.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-15 22:21     ` Andrew Morton
@ 2010-04-16  2:37       ` Taras Glek
  0 siblings, 0 replies; 35+ messages in thread
From: Taras Glek @ 2010-04-16  2:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On 04/15/2010 03:21 PM, Andrew Morton wrote:
> On Thu, 15 Apr 2010 17:41:48 -0700 Taras Glek<tglek@mozilla.com>  wrote:
>
>    
>> On 04/15/2010 03:53 PM, Andrew Morton wrote:
>>      
>>> On Mon, 05 Apr 2010 15:43:02 -0700
>>> Taras Glek<tglek@mozilla.com>   wrote:
>>>
>>>
>>>        
>>>> To make matters worse,
>>>> the compile-time linker + gcc lay out code in a manner that does not
>>>> correspond to how the resulting executable will be executed(ie the
>>>> layout is basically random).
>>>>
>>>>          
>>> Yes, the linker scrambles the executable's block ordering.
>>>
>>> This just isn't an interesting case.  World-wide, the number of people
>>> who compile their own web browser and execute it from the file which ld
>>> produced is, umm, seven.
>>>
>>>        
>> I'm sorry that you don't find this interesting.
>>      
> It's not a case we should optimise for.  It's perfectly reasonable for
> the kernel to assume that the executable is reasonably well-laid-out on
> disk.  And if is _isn't_ well-laid-out than that should be fixed in
> userspace, because for simple locality-of-reference reasons, that's
> always going to produce the fastest result.
>
> Plus it's the common case as well - the executable was copied from DVD
> or over the network or whatever.
>
> Plus it's so utterly trivial for people who compile-their-own to
> straighten the file out - just run cp!  These people have gone and
> screwed up their file layout - they should fix that, rather than trying
> to get the kernel to perform the impossible for them.  See?
>
>    
>> I did not suggest that
>> people compile their own browser to get a perfect layout. This is
>> something that Mozilla can do when preparing builds and it's also
>> something distributions can do. It just so happens that large parts of
>> startup will be very similar for every single firefox install, might as
>> well layout the binary accordingly.
>>      
>>> So I'd suggest that you always copy the executable to a temp file and
>>> mv it back before running any timing tests.
>>>
>>>        
>> You mean to get it into a cache or to hope to avoid fragmentation?  If
>> you are suggesting this to avoid measuring the startup overhead of
>> paging the binary in, I strongly disagee. It is the slowest part of
>> firefox startup and needs to be addressed.
>>      
> No, nothing like that at all.
>
> What I'm saying is that you shouldn't be testing or attempting to
> optimise for files which were laid out by ld.  Because those files are
> an utter mess - the block ordering is simply all over the place.  And
> the great majority of people aren't using executables which were laid out
> on disk by ld!
>
> Instead, straighten out the block layout with `cp', then go and do the
> testing and the optimisation.  Because if you're not taking this first
> step then you're just not serious about performance at all!
>
> Here's a small executable, as laid out by ld:
>
> File offset	disk blocks
> 0-0: 		18383385-18383385 (1)
> 1-1: 		18383389-18383389 (1)
> 2-3: 		18383392-18383393 (2)
> 4-4: 		18383400-18383400 (1)
> 5-7: 		18383430-18383432 (3)
> 8-11: 		18383450-18383453 (4)
> 12-12: 		18383423-18383423 (1)
> 13-14: 		18383447-18383448 (2)
> 15-16: 		18383474-18383475 (2)
> 17-17: 		18383390-18383390 (1)
> 18-18: 		18383398-18383398 (1)
> 19-20: 		18383418-18383419 (2)
> 21-21: 		18383421-18383421 (1)
> 22-22: 		18383397-18383397 (1)
> 23-23: 		18383399-18383399 (1)
> 24-24: 		18383407-18383407 (1)
> 25-25: 		18383391-18383391 (1)
> 26-26: 		18383396-18383396 (1)
> 27-28: 		18383394-18383395 (2)
> 29-34: 		18383401-18383406 (6)
> 35-38: 		18383425-18383428 (4)
> 39-39: 		18383433-18383433 (1)
> 40-40: 		18383463-18383463 (1)
> 41-44: 		18383490-18383493 (4)
> 45-45: 		18383409-18383409 (1)
> 46-46: 		18383422-18383422 (1)
> 47-47: 		18383442-18383442 (1)
> 48-48: 		18383410-18383410 (1)
> 49-49: 		18383420-18383420 (1)
> 50-50: 		18383424-18383424 (1)
> 51-51: 		18383429-18383429 (1)
> 52-54: 		18383411-18383413 (3)
> 55-56: 		18383416-18383417 (2)
> 57-64: 		18383434-18383441 (8)
> 65-66: 		18383458-18383459 (2)
> 67-68: 		18383414-18383415 (2)
> 69-70: 		18383387-18383388 (2)
> 71-71: 		18383408-18383408 (1)
> 72-74: 		18383443-18383445 (3)
>
> Not only is it fragmented, it's also in jumbled-up order.
>
> And here it is after I did `cp':
>
> 0-11:		18391043-18391054 (12)
> 12-15:		18391056-18391059 (4)
> 16-74:		18391064-18391122 (59)
>
> Trying to get the kernel to fix up the first case is daft, when it is
> so easy to fix and so obviously _needs_ fixing.
>    
Yeah ok. We are talking about different things. I meant the linker lays 
out the program badly, ie within the executable itself. Turns out that 
naively concatenating various compilation units makes for binaries that 
load slowly due to excessive seeking within the file. I wasn't talking 
about filesystem fragmentation. I agree that filesystem bustage caused 
by executing the linker isn't interesting.

Taras

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Downsides to madvise/fadvise(willneed) for application startup
  2010-04-15 22:53 ` Andrew Morton
  2010-04-15 23:21   ` Zan Lynx
  2010-04-16  0:41   ` Taras Glek
@ 2010-04-16 11:40   ` Andi Kleen
  2 siblings, 0 replies; 35+ messages in thread
From: Andi Kleen @ 2010-04-16 11:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Taras Glek, linux-kernel

Andrew Morton <akpm@linux-foundation.org> writes:
>
> Yes, the linker scrambles the executable's block ordering.
>
> This just isn't an interesting case.  World-wide, the number of people
> who compile their own web browser and execute it from the file which ld
> produced is, umm, seven.

My understanding was that this is usually gone when you use a delayed
allocation fs (xfs, ext4), unless your link sequence takes much longer
than the flush window.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2010-04-16 12:23 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-05 22:43 Downsides to madvise/fadvise(willneed) for application startup Taras Glek
2010-04-05 23:17 ` Dave Chinner
2010-04-05 23:52 ` Roland Dreier
2010-04-06 22:09   ` Taras Glek
2010-04-06  9:51 ` Johannes Weiner
2010-04-06 21:57   ` Taras Glek
2010-04-06 22:26     ` Johannes Weiner
2010-04-06 22:39       ` Taras Glek
2010-04-07  2:24   ` Wu Fengguang
2010-04-07  2:54     ` Taras Glek
2010-04-07  4:06       ` Minchan Kim
2010-04-07  7:14         ` Wu Fengguang
2010-04-07  7:33           ` Minchan Kim
2010-04-07  7:47             ` Wu Fengguang
2010-04-07  8:06               ` Minchan Kim
2010-04-07  8:13                 ` Wu Fengguang
2010-04-07  7:38       ` Wu Fengguang
2010-04-08 17:44         ` Taras Glek
2010-04-12  2:27           ` Wu Fengguang
2010-04-12  3:25             ` Minchan Kim
2010-04-12  4:58               ` Wu Fengguang
2010-04-12  4:43             ` drepper
2010-04-12  4:46               ` Taras Glek
2010-04-12  4:50               ` Wu Fengguang
2010-04-12  8:50 ` Andi Kleen
2010-04-15 22:53 ` Andrew Morton
2010-04-15 23:21   ` Zan Lynx
2010-04-15 20:42     ` Andrew Morton
2010-04-16 11:41     ` Andi Kleen
2010-04-16 12:23       ` Theodore Tso
2010-04-16 12:23       ` Theodore Tso
2010-04-16  0:41   ` Taras Glek
2010-04-15 22:21     ` Andrew Morton
2010-04-16  2:37       ` Taras Glek
2010-04-16 11:40   ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox