public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] Use roundown_pow_of_two() in zone_batchsize()
@ 2009-05-05 21:26 David Howells
  2009-05-05 21:26 ` [PATCH 2/3] NOMMU: Clamp zone_batchsize() to 0 under NOMMU conditions David Howells
  2009-05-05 21:26 ` [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable David Howells
  0 siblings, 2 replies; 6+ messages in thread
From: David Howells @ 2009-05-05 21:26 UTC (permalink / raw)
  To: torvalds, akpm, npiggin; +Cc: gerg, linux-kernel, David Howells

Use roundown_pow_of_two(N) in zone_batchsize() rather than (1 << (fls(N)-1))
as they are equivalent, and with the former it is easier to see what is going
on.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Lanttor Guo <lanttor.guo@freescale.com>
---

 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2f2699..8add7da 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2706,7 +2706,7 @@ static int zone_batchsize(struct zone *zone)
 	 * of pages of one half of the possible page colors
 	 * and the other with pages of the other colors.
 	 */
-	batch = (1 << (fls(batch + batch/2)-1)) - 1;
+	batch = rounddown_pow_of_two(batch + batch/2) - 1;
 
 	return batch;
 }


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] NOMMU: Clamp zone_batchsize() to 0 under NOMMU conditions
  2009-05-05 21:26 [PATCH 1/3] Use roundown_pow_of_two() in zone_batchsize() David Howells
@ 2009-05-05 21:26 ` David Howells
  2009-05-05 21:26 ` [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable David Howells
  1 sibling, 0 replies; 6+ messages in thread
From: David Howells @ 2009-05-05 21:26 UTC (permalink / raw)
  To: torvalds, akpm, npiggin; +Cc: gerg, linux-kernel, David Howells

Clamp zone_batchsize() to 0 under NOMMU conditions to stop free_hot_cold_page()
from queueing and batching frees.

The problem is that under NOMMU conditions it is really important to be able to
allocate large contiguous chunks of memory, but when munmap() or exit_mmap()
releases big stretches of memory, return of these to the buddy allocator can be
deferred, and when it does finally happen, it can be in small chunks.

Whilst the fragmentation this incurs isn't so much of a problem under MMU
conditions as userspace VM is glued together from individual pages with the aid
of the MMU, it is a real problem if there isn't an MMU.

By clamping the page freeing queue size to 0, pages are returned to the
allocator immediately, and the buddy detector is more likely to be able to glue
them together into large chunks immediately, and fragmentation is less likely
to occur.

By disabling batching of frees, and by turning off the trimming of excess space
during boot, Coldfire can manage to boot.

Reported-by: Lanttor Guo <lanttor.guo@freescale.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Lanttor Guo <lanttor.guo@freescale.com>
---

 mm/page_alloc.c |   18 ++++++++++++++++++
 1 files changed, 18 insertions(+), 0 deletions(-)


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8add7da..fe753ec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2681,6 +2681,7 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 
 static int zone_batchsize(struct zone *zone)
 {
+#ifdef CONFIG_MMU
 	int batch;
 
 	/*
@@ -2709,6 +2710,23 @@ static int zone_batchsize(struct zone *zone)
 	batch = rounddown_pow_of_two(batch + batch/2) - 1;
 
 	return batch;
+
+#else
+	/* The deferral and batching of frees should be suppressed under NOMMU
+	 * conditions.
+	 *
+	 * The problem is that NOMMU needs to be able to allocate large chunks
+	 * of contiguous memory as there's no hardware page translation to
+	 * assemble apparent contiguous memory from discontiguous pages.
+	 *
+	 * Queueing large contiguous runs of pages for batching, however,
+	 * causes the pages to actually be freed in smaller chunks.  As there
+	 * can be a significant delay between the individual batches being
+	 * recycled, this leads to the once large chunks of space being
+	 * fragmented and becoming unavailable for high-order allocations.
+	 */
+	return 0;
+#endif
 }
 
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable
  2009-05-05 21:26 [PATCH 1/3] Use roundown_pow_of_two() in zone_batchsize() David Howells
  2009-05-05 21:26 ` [PATCH 2/3] NOMMU: Clamp zone_batchsize() to 0 under NOMMU conditions David Howells
@ 2009-05-05 21:26 ` David Howells
  2009-05-05 22:15   ` Andrew Morton
  1 sibling, 1 reply; 6+ messages in thread
From: David Howells @ 2009-05-05 21:26 UTC (permalink / raw)
  To: torvalds, akpm, npiggin; +Cc: gerg, linux-kernel, David Howells

NOMMU mmap() has an option controlled by a sysctl variable that determines
whether the allocations made by do_mmap_private() should have the excess space
trimmed off and returned to the allocator.  Make the initial setting of this
variable a Kconfig configuration option.

The reason there can be excess space is that the allocator only allocates in
power-of-2 size chunks, but mmap()'s can be made in sizes that aren't a power
of 2.

There are two alternatives:

 (1) Keep the excess as dead space.  The dead space then remains unused for the
     lifetime of the mapping.  Mappings of shared objects such as libc, ld.so
     or busybox's text segment may retain their dead space forever.

 (2) Return the excess to the allocator.  This means that the dead space is
     limited to less than a page per mapping, but it means that for a transient
     process, there's more chance of fragmentation as the excess space may be
     reused fairly quickly.

During the boot process, a lot of transient processes are created, and this can
cause a lot of fragmentation as the pagecache and various slabs grow greatly
during this time.

By turning off the trimming of excess space during boot and disabling batching
of frees, Coldfire can manage to boot.

A better way of doing things might be to have /sbin/init turn this option off.
By that point libc, ld.so and init - which are all long-duration processes -
have all been loaded and trimmed.

Reported-by: Lanttor Guo <lanttor.guo@freescale.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Lanttor Guo <lanttor.guo@freescale.com>
---

 mm/Kconfig |   28 ++++++++++++++++++++++++++++
 mm/nommu.c |    2 +-
 2 files changed, 29 insertions(+), 1 deletions(-)


diff --git a/mm/Kconfig b/mm/Kconfig
index 57971d2..c2b57d8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -225,3 +225,31 @@ config HAVE_MLOCKED_PAGE_BIT
 
 config MMU_NOTIFIER
 	bool
+
+config NOMMU_INITIAL_TRIM_EXCESS
+	int "Turn on mmap() excess space trimming before booting"
+	depends on !MMU
+	default 1
+	help
+	  The NOMMU mmap() frequently needs to allocate large contiguous chunks
+	  of memory on which to store mappings, but it can only ask the system
+	  allocator for chunks in 2^N*PAGE_SIZE amounts - which is frequently
+	  more than it requires.  To deal with this, mmap() is able to trim off
+	  the excess and return it to the allocator.
+
+	  If trimming is enabled, the excess is trimmed off and returned to the
+	  system allocator, which can cause extra fragmentation, particularly
+	  if there are a lot of transient processes.
+
+	  If trimming is disabled, the excess is kept, but not used, which for
+	  long-term mappings means that the space is wasted.
+
+	  Trimming can be dynamically controlled through a sysctl option
+	  (/proc/sys/vm/nr_trim_pages) which specifies the minimum number of
+	  excess pages there must be before trimming should occur, or zero if
+	  no trimming is to occur.
+
+	  This option specifies the initial value of this option.  The default
+	  of 1 says that all excess pages should be trimmed.
+
+	  See Documentation/nommu-mmap.txt for more information.
diff --git a/mm/nommu.c b/mm/nommu.c
index 41dc127..cdc6f60 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -66,7 +66,7 @@ struct percpu_counter vm_committed_as;
 int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
 int sysctl_overcommit_ratio = 50; /* default is 50% */
 int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT;
-int sysctl_nr_trim_pages = 1; /* page trimming behaviour */
+int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS;
 int heap_stack_gap = 0;
 
 atomic_long_t mmap_pages_allocated;


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable
  2009-05-05 21:26 ` [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable David Howells
@ 2009-05-05 22:15   ` Andrew Morton
  2009-05-06  0:09     ` Greg Ungerer
  2009-05-06 11:42     ` David Howells
  0 siblings, 2 replies; 6+ messages in thread
From: Andrew Morton @ 2009-05-05 22:15 UTC (permalink / raw)
  To: David Howells; +Cc: torvalds, npiggin, gerg, linux-kernel, dhowells

On Tue, 05 May 2009 22:26:48 +0100
David Howells <dhowells@redhat.com> wrote:

> NOMMU mmap() has an option controlled by a sysctl variable that determines
> whether the allocations made by do_mmap_private() should have the excess space
> trimmed off and returned to the allocator.  Make the initial setting of this
> variable a Kconfig configuration option.
> 
> The reason there can be excess space is that the allocator only allocates in
> power-of-2 size chunks, but mmap()'s can be made in sizes that aren't a power
> of 2.
> 
> There are two alternatives:
> 
>  (1) Keep the excess as dead space.  The dead space then remains unused for the
>      lifetime of the mapping.  Mappings of shared objects such as libc, ld.so
>      or busybox's text segment may retain their dead space forever.
> 
>  (2) Return the excess to the allocator.  This means that the dead space is
>      limited to less than a page per mapping, but it means that for a transient
>      process, there's more chance of fragmentation as the excess space may be
>      reused fairly quickly.
> 
> During the boot process, a lot of transient processes are created, and this can
> cause a lot of fragmentation as the pagecache and various slabs grow greatly
> during this time.
> 
> By turning off the trimming of excess space during boot and disabling batching
> of frees, Coldfire can manage to boot.
> 
> A better way of doing things might be to have /sbin/init turn this option off.
> By that point libc, ld.so and init - which are all long-duration processes -
> have all been loaded and trimmed.
> 

Nasty problem.

> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -66,7 +66,7 @@ struct percpu_counter vm_committed_as;
>  int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
>  int sysctl_overcommit_ratio = 50; /* default is 50% */
>  int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT;
> -int sysctl_nr_trim_pages = 1; /* page trimming behaviour */
> +int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS;
>  int heap_stack_gap = 0;
>  

But there's a risk of -ENOMEM regression on other system here?

It's unlikely to be a huge problem for real-world embedded developers,
as long as they know about this change.  And because you set the
Kconfig default to "no change" then I guess they'll be none the wiser.

I think that patches 2 and 3 (and #1 unless I reorder and redo things)
are 2.6.30 material.  Agree?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable
  2009-05-05 22:15   ` Andrew Morton
@ 2009-05-06  0:09     ` Greg Ungerer
  2009-05-06 11:42     ` David Howells
  1 sibling, 0 replies; 6+ messages in thread
From: Greg Ungerer @ 2009-05-06  0:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Howells, torvalds, npiggin, linux-kernel



Andrew Morton wrote:
> On Tue, 05 May 2009 22:26:48 +0100
> David Howells <dhowells@redhat.com> wrote:
> 
>> NOMMU mmap() has an option controlled by a sysctl variable that determines
>> whether the allocations made by do_mmap_private() should have the excess space
>> trimmed off and returned to the allocator.  Make the initial setting of this
>> variable a Kconfig configuration option.
>>
>> The reason there can be excess space is that the allocator only allocates in
>> power-of-2 size chunks, but mmap()'s can be made in sizes that aren't a power
>> of 2.
>>
>> There are two alternatives:
>>
>>  (1) Keep the excess as dead space.  The dead space then remains unused for the
>>      lifetime of the mapping.  Mappings of shared objects such as libc, ld.so
>>      or busybox's text segment may retain their dead space forever.
>>
>>  (2) Return the excess to the allocator.  This means that the dead space is
>>      limited to less than a page per mapping, but it means that for a transient
>>      process, there's more chance of fragmentation as the excess space may be
>>      reused fairly quickly.
>>
>> During the boot process, a lot of transient processes are created, and this can
>> cause a lot of fragmentation as the pagecache and various slabs grow greatly
>> during this time.
>>
>> By turning off the trimming of excess space during boot and disabling batching
>> of frees, Coldfire can manage to boot.

To put that in perspective better. Its not that all ColdFire platforms
don't boot. It depends very much on what you try to run from user space.
Typical small setups (which realistically is most ColdFire systems)
just don't try to run that much. As with anything embedded there is a
great variance on what people try to do...

Regards
Greg



>> A better way of doing things might be to have /sbin/init turn this option off.
>> By that point libc, ld.so and init - which are all long-duration processes -
>> have all been loaded and trimmed.
>>
> 
> Nasty problem.
> 
>> --- a/mm/nommu.c
>> +++ b/mm/nommu.c
>> @@ -66,7 +66,7 @@ struct percpu_counter vm_committed_as;
>>  int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
>>  int sysctl_overcommit_ratio = 50; /* default is 50% */
>>  int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT;
>> -int sysctl_nr_trim_pages = 1; /* page trimming behaviour */
>> +int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS;
>>  int heap_stack_gap = 0;
>>  
> 
> But there's a risk of -ENOMEM regression on other system here?
> 
> It's unlikely to be a huge problem for real-world embedded developers,
> as long as they know about this change.  And because you set the
> Kconfig default to "no change" then I guess they'll be none the wiser.
> 
> I think that patches 2 and 3 (and #1 unless I reorder and redo things)
> are 2.6.30 material.  Agree?
> 
> 

-- 
------------------------------------------------------------------------
Greg Ungerer  --  Principal Engineer        EMAIL:     gerg@snapgear.com
SnapGear Group, McAfee                      PHONE:       +61 7 3435 2888
825 Stanley St,                             FAX:         +61 7 3891 3630
Woolloongabba, QLD, 4102, Australia         WEB: http://www.SnapGear.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable
  2009-05-05 22:15   ` Andrew Morton
  2009-05-06  0:09     ` Greg Ungerer
@ 2009-05-06 11:42     ` David Howells
  1 sibling, 0 replies; 6+ messages in thread
From: David Howells @ 2009-05-06 11:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dhowells, torvalds, npiggin, gerg, linux-kernel

Andrew Morton <akpm@linux-foundation.org> wrote:

> Nasty problem.

Yes.  That's part of the fun of the NOMMU world.  It has many of the same
problems as the MMU world - just more exaggerated.

> > -int sysctl_nr_trim_pages = 1; /* page trimming behaviour */
> > +int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS;
> >  int heap_stack_gap = 0;
> >  
> 
> But there's a risk of -ENOMEM regression on other system here?

There shouldn't be (assuming you mean with this patch), the default is the
same as the original value.

> It's unlikely to be a huge problem for real-world embedded developers,
> as long as they know about this change.  And because you set the
> Kconfig default to "no change" then I guess they'll be none the wiser.
> 
> I think that patches 2 and 3 (and #1 unless I reorder and redo things)
> are 2.6.30 material.  Agree?

Assuming you mean go in before 2.6.30 is cut, then yes.  If you want, I can
reorder the patches to put #1 last.

David

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-05-06 11:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-05 21:26 [PATCH 1/3] Use roundown_pow_of_two() in zone_batchsize() David Howells
2009-05-05 21:26 ` [PATCH 2/3] NOMMU: Clamp zone_batchsize() to 0 under NOMMU conditions David Howells
2009-05-05 21:26 ` [PATCH 3/3] NOMMU: Make the initial mmap allocation excess behaviour Kconfig configurable David Howells
2009-05-05 22:15   ` Andrew Morton
2009-05-06  0:09     ` Greg Ungerer
2009-05-06 11:42     ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox