* [PATCH 1/3] Use roundown_pow_of_two() in zone_batchsize() @ 2009-05-05 21:26 David Howells 2009-05-05 21:26 ` [PATCH 2/3] NOMMU: Clamp zone_batchsize() to 0 under NOMMU conditions David Howells 2009-05-05 21:26 ` [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable David Howells 0 siblings, 2 replies; 6+ messages in thread From: David Howells @ 2009-05-05 21:26 UTC (permalink / raw) To: torvalds, akpm, npiggin; +Cc: gerg, linux-kernel, David Howells Use roundown_pow_of_two(N) in zone_batchsize() rather than (1 << (fls(N)-1)) as they are equivalent, and with the former it is easier to see what is going on. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Lanttor Guo <lanttor.guo@freescale.com> --- mm/page_alloc.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e2f2699..8add7da 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2706,7 +2706,7 @@ static int zone_batchsize(struct zone *zone) * of pages of one half of the possible page colors * and the other with pages of the other colors. */ - batch = (1 << (fls(batch + batch/2)-1)) - 1; + batch = rounddown_pow_of_two(batch + batch/2) - 1; return batch; } ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/3] NOMMU: Clamp zone_batchsize() to 0 under NOMMU conditions 2009-05-05 21:26 [PATCH 1/3] Use roundown_pow_of_two() in zone_batchsize() David Howells @ 2009-05-05 21:26 ` David Howells 2009-05-05 21:26 ` [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable David Howells 1 sibling, 0 replies; 6+ messages in thread From: David Howells @ 2009-05-05 21:26 UTC (permalink / raw) To: torvalds, akpm, npiggin; +Cc: gerg, linux-kernel, David Howells Clamp zone_batchsize() to 0 under NOMMU conditions to stop free_hot_cold_page() from queueing and batching frees. The problem is that under NOMMU conditions it is really important to be able to allocate large contiguous chunks of memory, but when munmap() or exit_mmap() releases big stretches of memory, return of these to the buddy allocator can be deferred, and when it does finally happen, it can be in small chunks. Whilst the fragmentation this incurs isn't so much of a problem under MMU conditions as userspace VM is glued together from individual pages with the aid of the MMU, it is a real problem if there isn't an MMU. By clamping the page freeing queue size to 0, pages are returned to the allocator immediately, and the buddy detector is more likely to be able to glue them together into large chunks immediately, and fragmentation is less likely to occur. By disabling batching of frees, and by turning off the trimming of excess space during boot, Coldfire can manage to boot. Reported-by: Lanttor Guo <lanttor.guo@freescale.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Lanttor Guo <lanttor.guo@freescale.com> --- mm/page_alloc.c | 18 ++++++++++++++++++ 1 files changed, 18 insertions(+), 0 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8add7da..fe753ec 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2681,6 +2681,7 @@ static void __meminit zone_init_free_lists(struct zone *zone) static int zone_batchsize(struct zone *zone) { +#ifdef CONFIG_MMU int batch; /* @@ -2709,6 +2710,23 @@ static int zone_batchsize(struct zone *zone) batch = rounddown_pow_of_two(batch + batch/2) - 1; return batch; + +#else + /* The deferral and batching of frees should be suppressed under NOMMU + * conditions. + * + * The problem is that NOMMU needs to be able to allocate large chunks + * of contiguous memory as there's no hardware page translation to + * assemble apparent contiguous memory from discontiguous pages. + * + * Queueing large contiguous runs of pages for batching, however, + * causes the pages to actually be freed in smaller chunks. As there + * can be a significant delay between the individual batches being + * recycled, this leads to the once large chunks of space being + * fragmented and becoming unavailable for high-order allocations. + */ + return 0; +#endif } static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable 2009-05-05 21:26 [PATCH 1/3] Use roundown_pow_of_two() in zone_batchsize() David Howells 2009-05-05 21:26 ` [PATCH 2/3] NOMMU: Clamp zone_batchsize() to 0 under NOMMU conditions David Howells @ 2009-05-05 21:26 ` David Howells 2009-05-05 22:15 ` Andrew Morton 1 sibling, 1 reply; 6+ messages in thread From: David Howells @ 2009-05-05 21:26 UTC (permalink / raw) To: torvalds, akpm, npiggin; +Cc: gerg, linux-kernel, David Howells NOMMU mmap() has an option controlled by a sysctl variable that determines whether the allocations made by do_mmap_private() should have the excess space trimmed off and returned to the allocator. Make the initial setting of this variable a Kconfig configuration option. The reason there can be excess space is that the allocator only allocates in power-of-2 size chunks, but mmap()'s can be made in sizes that aren't a power of 2. There are two alternatives: (1) Keep the excess as dead space. The dead space then remains unused for the lifetime of the mapping. Mappings of shared objects such as libc, ld.so or busybox's text segment may retain their dead space forever. (2) Return the excess to the allocator. This means that the dead space is limited to less than a page per mapping, but it means that for a transient process, there's more chance of fragmentation as the excess space may be reused fairly quickly. During the boot process, a lot of transient processes are created, and this can cause a lot of fragmentation as the pagecache and various slabs grow greatly during this time. By turning off the trimming of excess space during boot and disabling batching of frees, Coldfire can manage to boot. A better way of doing things might be to have /sbin/init turn this option off. By that point libc, ld.so and init - which are all long-duration processes - have all been loaded and trimmed. Reported-by: Lanttor Guo <lanttor.guo@freescale.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Lanttor Guo <lanttor.guo@freescale.com> --- mm/Kconfig | 28 ++++++++++++++++++++++++++++ mm/nommu.c | 2 +- 2 files changed, 29 insertions(+), 1 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index 57971d2..c2b57d8 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -225,3 +225,31 @@ config HAVE_MLOCKED_PAGE_BIT config MMU_NOTIFIER bool + +config NOMMU_INITIAL_TRIM_EXCESS + int "Turn on mmap() excess space trimming before booting" + depends on !MMU + default 1 + help + The NOMMU mmap() frequently needs to allocate large contiguous chunks + of memory on which to store mappings, but it can only ask the system + allocator for chunks in 2^N*PAGE_SIZE amounts - which is frequently + more than it requires. To deal with this, mmap() is able to trim off + the excess and return it to the allocator. + + If trimming is enabled, the excess is trimmed off and returned to the + system allocator, which can cause extra fragmentation, particularly + if there are a lot of transient processes. + + If trimming is disabled, the excess is kept, but not used, which for + long-term mappings means that the space is wasted. + + Trimming can be dynamically controlled through a sysctl option + (/proc/sys/vm/nr_trim_pages) which specifies the minimum number of + excess pages there must be before trimming should occur, or zero if + no trimming is to occur. + + This option specifies the initial value of this option. The default + of 1 says that all excess pages should be trimmed. + + See Documentation/nommu-mmap.txt for more information. diff --git a/mm/nommu.c b/mm/nommu.c index 41dc127..cdc6f60 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -66,7 +66,7 @@ struct percpu_counter vm_committed_as; int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */ int sysctl_overcommit_ratio = 50; /* default is 50% */ int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT; -int sysctl_nr_trim_pages = 1; /* page trimming behaviour */ +int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS; int heap_stack_gap = 0; atomic_long_t mmap_pages_allocated; ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable 2009-05-05 21:26 ` [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable David Howells @ 2009-05-05 22:15 ` Andrew Morton 2009-05-06 0:09 ` Greg Ungerer 2009-05-06 11:42 ` David Howells 0 siblings, 2 replies; 6+ messages in thread From: Andrew Morton @ 2009-05-05 22:15 UTC (permalink / raw) To: David Howells; +Cc: torvalds, npiggin, gerg, linux-kernel, dhowells On Tue, 05 May 2009 22:26:48 +0100 David Howells <dhowells@redhat.com> wrote: > NOMMU mmap() has an option controlled by a sysctl variable that determines > whether the allocations made by do_mmap_private() should have the excess space > trimmed off and returned to the allocator. Make the initial setting of this > variable a Kconfig configuration option. > > The reason there can be excess space is that the allocator only allocates in > power-of-2 size chunks, but mmap()'s can be made in sizes that aren't a power > of 2. > > There are two alternatives: > > (1) Keep the excess as dead space. The dead space then remains unused for the > lifetime of the mapping. Mappings of shared objects such as libc, ld.so > or busybox's text segment may retain their dead space forever. > > (2) Return the excess to the allocator. This means that the dead space is > limited to less than a page per mapping, but it means that for a transient > process, there's more chance of fragmentation as the excess space may be > reused fairly quickly. > > During the boot process, a lot of transient processes are created, and this can > cause a lot of fragmentation as the pagecache and various slabs grow greatly > during this time. > > By turning off the trimming of excess space during boot and disabling batching > of frees, Coldfire can manage to boot. > > A better way of doing things might be to have /sbin/init turn this option off. > By that point libc, ld.so and init - which are all long-duration processes - > have all been loaded and trimmed. > Nasty problem. > --- a/mm/nommu.c > +++ b/mm/nommu.c > @@ -66,7 +66,7 @@ struct percpu_counter vm_committed_as; > int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */ > int sysctl_overcommit_ratio = 50; /* default is 50% */ > int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT; > -int sysctl_nr_trim_pages = 1; /* page trimming behaviour */ > +int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS; > int heap_stack_gap = 0; > But there's a risk of -ENOMEM regression on other system here? It's unlikely to be a huge problem for real-world embedded developers, as long as they know about this change. And because you set the Kconfig default to "no change" then I guess they'll be none the wiser. I think that patches 2 and 3 (and #1 unless I reorder and redo things) are 2.6.30 material. Agree? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable 2009-05-05 22:15 ` Andrew Morton @ 2009-05-06 0:09 ` Greg Ungerer 2009-05-06 11:42 ` David Howells 1 sibling, 0 replies; 6+ messages in thread From: Greg Ungerer @ 2009-05-06 0:09 UTC (permalink / raw) To: Andrew Morton; +Cc: David Howells, torvalds, npiggin, linux-kernel Andrew Morton wrote: > On Tue, 05 May 2009 22:26:48 +0100 > David Howells <dhowells@redhat.com> wrote: > >> NOMMU mmap() has an option controlled by a sysctl variable that determines >> whether the allocations made by do_mmap_private() should have the excess space >> trimmed off and returned to the allocator. Make the initial setting of this >> variable a Kconfig configuration option. >> >> The reason there can be excess space is that the allocator only allocates in >> power-of-2 size chunks, but mmap()'s can be made in sizes that aren't a power >> of 2. >> >> There are two alternatives: >> >> (1) Keep the excess as dead space. The dead space then remains unused for the >> lifetime of the mapping. Mappings of shared objects such as libc, ld.so >> or busybox's text segment may retain their dead space forever. >> >> (2) Return the excess to the allocator. This means that the dead space is >> limited to less than a page per mapping, but it means that for a transient >> process, there's more chance of fragmentation as the excess space may be >> reused fairly quickly. >> >> During the boot process, a lot of transient processes are created, and this can >> cause a lot of fragmentation as the pagecache and various slabs grow greatly >> during this time. >> >> By turning off the trimming of excess space during boot and disabling batching >> of frees, Coldfire can manage to boot. To put that in perspective better. Its not that all ColdFire platforms don't boot. It depends very much on what you try to run from user space. Typical small setups (which realistically is most ColdFire systems) just don't try to run that much. As with anything embedded there is a great variance on what people try to do... Regards Greg >> A better way of doing things might be to have /sbin/init turn this option off. >> By that point libc, ld.so and init - which are all long-duration processes - >> have all been loaded and trimmed. >> > > Nasty problem. > >> --- a/mm/nommu.c >> +++ b/mm/nommu.c >> @@ -66,7 +66,7 @@ struct percpu_counter vm_committed_as; >> int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */ >> int sysctl_overcommit_ratio = 50; /* default is 50% */ >> int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT; >> -int sysctl_nr_trim_pages = 1; /* page trimming behaviour */ >> +int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS; >> int heap_stack_gap = 0; >> > > But there's a risk of -ENOMEM regression on other system here? > > It's unlikely to be a huge problem for real-world embedded developers, > as long as they know about this change. And because you set the > Kconfig default to "no change" then I guess they'll be none the wiser. > > I think that patches 2 and 3 (and #1 unless I reorder and redo things) > are 2.6.30 material. Agree? > > -- ------------------------------------------------------------------------ Greg Ungerer -- Principal Engineer EMAIL: gerg@snapgear.com SnapGear Group, McAfee PHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable 2009-05-05 22:15 ` Andrew Morton 2009-05-06 0:09 ` Greg Ungerer @ 2009-05-06 11:42 ` David Howells 1 sibling, 0 replies; 6+ messages in thread From: David Howells @ 2009-05-06 11:42 UTC (permalink / raw) To: Andrew Morton; +Cc: dhowells, torvalds, npiggin, gerg, linux-kernel Andrew Morton <akpm@linux-foundation.org> wrote: > Nasty problem. Yes. That's part of the fun of the NOMMU world. It has many of the same problems as the MMU world - just more exaggerated. > > -int sysctl_nr_trim_pages = 1; /* page trimming behaviour */ > > +int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS; > > int heap_stack_gap = 0; > > > > But there's a risk of -ENOMEM regression on other system here? There shouldn't be (assuming you mean with this patch), the default is the same as the original value. > It's unlikely to be a huge problem for real-world embedded developers, > as long as they know about this change. And because you set the > Kconfig default to "no change" then I guess they'll be none the wiser. > > I think that patches 2 and 3 (and #1 unless I reorder and redo things) > are 2.6.30 material. Agree? Assuming you mean go in before 2.6.30 is cut, then yes. If you want, I can reorder the patches to put #1 last. David ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-05-06 11:44 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-05-05 21:26 [PATCH 1/3] Use roundown_pow_of_two() in zone_batchsize() David Howells 2009-05-05 21:26 ` [PATCH 2/3] NOMMU: Clamp zone_batchsize() to 0 under NOMMU conditions David Howells 2009-05-05 21:26 ` [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable David Howells 2009-05-05 22:15 ` Andrew Morton 2009-05-06 0:09 ` Greg Ungerer 2009-05-06 11:42 ` David Howells
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox