free bootmem feedback patch

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* free bootmem feedback patch
@ 2004-07-13 22:59 Joshua Aas
  2004-07-13 23:14 ` Luck, Tony
                   ` (25 more replies)
  0 siblings, 26 replies; 27+ messages in thread
From: Joshua Aas @ 2004-07-13 22:59 UTC (permalink / raw)
  To: linux-ia64

Hello,

NUMA machines with a lot of memory/nodes appear to hang when freeing boot memory, as it can take on the order of 4 minutes. I would like to propose this patch, which adds progress feedback during this time. On a machine with only one memory region (a single node), it prints:

Freeing boot memory... done

It prints more dots every x nodes handled based on the total number of nodes, in such a way that the line never exceeds 80 chars. This way it is possible to see progress being made. I have tested this on a few machines, including a 512p/512GB machine, and it works fine.

Please apply.

Signed-off-by: Josh Aas <josha@sgi.com>

-----------------------------------------------------------------------
--- linux-2.6.7-clean/arch/ia64/mm/init.c       2004-06-16 00:19:22.000000000 -0500
+++ linux-2.6.7/arch/ia64/mm/init.c     2004-07-13 15:25:46.000000000 -0500
@@ -516,7 +516,7 @@ mem_init (void)
        long reserved_pages, codesize, datasize, initsize;
        unsigned long num_pgt_pages;
        pg_data_t *pgdat;
-       int i;
+       int i, pgdat_count;
        static struct kcore_list kcore_mem, kcore_vmem, kcore_kernel;
 
 #ifdef CONFIG_PCI
@@ -540,8 +540,29 @@ mem_init (void)
        kclist_add(&kcore_vmem, (void *)VMALLOC_START, VMALLOC_END-VMALLOC_START);
        kclist_add(&kcore_kernel, _stext, _end - _stext);
 
+       /*
+        * Give nice feedback while freeing boot memory. Each entry in pgdat corresponds to
+        * a memory zone, presumably a node in a NUMA machine. We need nice feedback so that 
+        * machines with lots of nodes/memory don't appear to be hanging.
+        */
+       printk(KERN_INFO "Freeing boot memory...");
+#      define NUM_FREE_BOOT_MEM_MSG_CHAR_COUNT 27 /* 22 + 5 for done message */
+       pgdat_count = 0;
+       i = 0;
        for_each_pgdat(pgdat)
+               pgdat_count++;
+       pgdat_count = (pgdat_count / (80 - NUM_FREE_BOOT_MEM_MSG_CHAR_COUNT)) - 1;
+       for_each_pgdat(pgdat) {
                totalram_pages += free_all_bootmem_node(pgdat);
+               if (i = pgdat_count) {
+                       printk(".");
+                       i = 0;
+               }
+               else {
+                       i++;
+               }
+       }
+       printk(" done\n");
 
        reserved_pages = 0;
        efi_memmap_walk(count_reserved_pages, &reserved_pages);

-----------------------------------------------------------------------

-- 
Josh Aas
Silicon Graphics, Inc. (SGI)
Linux System Software
651-683-3068

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
@ 2004-07-13 23:14 ` Luck, Tony
  2004-07-13 23:52 ` Joshua Aas
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Luck, Tony @ 2004-07-13 23:14 UTC (permalink / raw)
  To: linux-ia64

>NUMA machines with a lot of memory/nodes appear to hang when 
>freeing boot memory, as it can take on the order of 4 minutes. 

Isn't this a feature of the amount of memory being freed, rather
than the fact that there are many nodes?  I.e. if someone has a
single node box with a few TB of memory, they'll still have the
multi-minute delay, but your patch won't give them any progress
dots.

-Tony

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
  2004-07-13 23:14 ` Luck, Tony
@ 2004-07-13 23:52 ` Joshua Aas
  2004-07-14  8:44 ` Andi Kleen
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Joshua Aas @ 2004-07-13 23:52 UTC (permalink / raw)
  To: linux-ia64

Luck, Tony wrote:
>>NUMA machines with a lot of memory/nodes appear to hang when 
>>freeing boot memory, as it can take on the order of 4 minutes. 
> 
> 
> Isn't this a feature of the amount of memory being freed, rather
> than the fact that there are many nodes?  I.e. if someone has a
> single node box with a few TB of memory, they'll still have the
> multi-minute delay, but your patch won't give them any progress
> dots.
> 
> -Tony

In the case of a single massive memory region you would still see "Freeing boot memory...", you just wouldn't see progress. That would make me feel better than nothing. It wouldn't be easy to add progress feedback within a single region without making one line of feedback per region - which seems like way too much to me on machines are at and above 128-256 regions. The way I've done it, single region machines get notification, multi-region machines see progress, and either way it only makes one line of feedback. Also, correct me if I'm wrong, but machines with 250GB-4TB or so of memory, where the pause is longest, are more likely to be multi-region machines.

-- 
Josh Aas
Silicon Graphics, Inc. (SGI)
Linux System Software
651-683-3068

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
  2004-07-13 23:14 ` Luck, Tony
  2004-07-13 23:52 ` Joshua Aas
@ 2004-07-14  8:44 ` Andi Kleen
  2004-07-14  9:17 ` William Lee Irwin III
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2004-07-14  8:44 UTC (permalink / raw)
  To: linux-ia64

> 
> Freeing boot memory... done
> 
> It prints more dots every x nodes handled based on the total number of 
> nodes, in such a way that the line never exceeds 80 chars. This way it is 
> possible to see progress being made. I have tested this on a few machines, 
> including a 512p/512GB machine, and it works fine.

Better would be probably to speed this up.

I bet most of the overhead comes from taking locks in free_pages.
How about adding a lockless ____free_pages and using that? 
Adding an more optimized path for this is also not out of the question.

Another way would be to increase the bootmem page size, but
this won't cut down the overhead of free_pages.

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (2 preceding siblings ...)
  2004-07-14  8:44 ` Andi Kleen
@ 2004-07-14  9:17 ` William Lee Irwin III
  2004-07-14  9:19 ` William Lee Irwin III
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2004-07-14  9:17 UTC (permalink / raw)
  To: linux-ia64

On Tue, Jul 13, 2004 at 05:59:46PM -0500, Joshua Aas wrote:
> NUMA machines with a lot of memory/nodes appear to hang when freeing boot 
> memory, as it can take on the order of 4 minutes. I would like to propose 
> this patch, which adds progress feedback during this time. On a machine 
> with only one memory region (a single node), it prints:
> Freeing boot memory... done
> It prints more dots every x nodes handled based on the total number of 
> nodes, in such a way that the line never exceeds 80 chars. This way it is 
> possible to see progress being made. I have tested this on a few machines, 
> including a 512p/512GB machine, and it works fine.

Could I get a look at the memory map for this machine?

Also, it may make more sense to speed this up by freeing higher-order
pages at a time.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (3 preceding siblings ...)
  2004-07-14  9:17 ` William Lee Irwin III
@ 2004-07-14  9:19 ` William Lee Irwin III
  2004-07-14 16:17 ` Joshua Aas
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2004-07-14  9:19 UTC (permalink / raw)
  To: linux-ia64

At some point in the past, someone wrote:
>> Freeing boot memory... done
>> It prints more dots every x nodes handled based on the total number of 
>> nodes, in such a way that the line never exceeds 80 chars. This way it is 
>> possible to see progress being made. I have tested this on a few machines, 
>> including a 512p/512GB machine, and it works fine.

On Wed, Jul 14, 2004 at 10:44:31AM +0200, Andi Kleen wrote:
> Better would be probably to speed this up.
> I bet most of the overhead comes from taking locks in free_pages.
> How about adding a lockless ____free_pages and using that? 
> Adding an more optimized path for this is also not out of the question.
> Another way would be to increase the bootmem page size, but
> this won't cut down the overhead of free_pages.

That also sounds like an excellent idea.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (4 preceding siblings ...)
  2004-07-14  9:19 ` William Lee Irwin III
@ 2004-07-14 16:17 ` Joshua Aas
  2004-07-14 18:34 ` Luck, Tony
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Joshua Aas @ 2004-07-14 16:17 UTC (permalink / raw)
  To: linux-ia64

> On Wed, Jul 14, 2004 at 10:44:31AM +0200, Andi Kleen wrote:
> 
>>Better would be probably to speed this up.
>>I bet most of the overhead comes from taking locks in free_pages.
>>How about adding a lockless ____free_pages and using that? 
>>Adding an more optimized path for this is also not out of the question.
>>Another way would be to increase the bootmem page size, but
>>this won't cut down the overhead of free_pages.
> 
> 
> That also sounds like an excellent idea.

Seems like a good idea to me too. I'll try to come up with a patch that doesn't do locking since it is early enough in the boot process, and maybe I can get rid of some looping/function calls too. This theoretical patch seems like a good idea in any case, but my progress patch might still be necessary on top of that since I'm not sure how much time it will actually save. On machines with 1-8TB of memory the wait might still be considerable (i.e. on the order of 3-8 minutes).

-- 
Josh Aas
Silicon Graphics, Inc. (SGI)
Linux System Software
651-683-3068

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (5 preceding siblings ...)
  2004-07-14 16:17 ` Joshua Aas
@ 2004-07-14 18:34 ` Luck, Tony
  2004-07-14 22:12 ` William Lee Irwin III
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Luck, Tony @ 2004-07-14 18:34 UTC (permalink / raw)
  To: linux-ia64

There are certainly some opportunities to speed up the code in
free_all_bootmem_core() ... only one cpu is running, so using
atomic operations to "ClearPageReserved(page)" is overkill. As
is calling __free_page() [Which promptly tests that this isn't
a reserved page, and does more atomic ops to decrement the count
from 1 (which we just set it to) to zero.   What a waste of time!]

>Also, it may make more sense to speed this up by freeing higher-order
>pages at a time.

This looks like the most promising way to put a major dent in the
time taken.  If you make sure that the bits in the bitmap line up
correctly so that the physical addresses come out right, then you
have an easy test for "if (v = ~0UL) {" to find larger order pages
(5 on 32-bit systems, 6 on 64-bit).  A factor 64 speedup would
turn your 5 minute hang into a 5 second pause :-).

Some prefetches in that loop might help too (so you don't
have to take a full L3 cache miss on every page structure).

Another approach might be to just free enough pages on each node
to get all the cpus up and running, and then have one cpu on each
node free the remaining pages for the node. But this sounds like
major surgery (and wouldn't help my hypothetical machine that has
huge memory on just one node).

-Tony

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (6 preceding siblings ...)
  2004-07-14 18:34 ` Luck, Tony
@ 2004-07-14 22:12 ` William Lee Irwin III
  2004-07-15 19:11 ` Luck, Tony
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2004-07-14 22:12 UTC (permalink / raw)
  To: linux-ia64

At some point in the past, I wrote:
>> Also, it may make more sense to speed this up by freeing higher-order
>> pages at a time.

On Wed, Jul 14, 2004 at 11:34:23AM -0700, Luck, Tony wrote:
> This looks like the most promising way to put a major dent in the
> time taken.  If you make sure that the bits in the bitmap line up
> correctly so that the physical addresses come out right, then you
> have an easy test for "if (v = ~0UL) {" to find larger order pages
> (5 on 32-bit systems, 6 on 64-bit).  A factor 64 speedup would
> turn your 5 minute hang into a 5 second pause :-).
> Some prefetches in that loop might help too (so you don't
> have to take a full L3 cache miss on every page structure).
> Another approach might be to just free enough pages on each node
> to get all the cpus up and running, and then have one cpu on each
> node free the remaining pages for the node. But this sounds like
> major surgery (and wouldn't help my hypothetical machine that has
> huge memory on just one node).

I'd even go so far as to suggest scanning the bitmap until the
maximally sized higher-order page to free at the cursor (determined
by cursor alignment and the first bit indicating a reserved page or
otherwise MAX_ORDER bits out from the cursor) is found. More ideally
(2.7) I'd prefer to rewrite the thing altogether and explicitly track
ranges at all times, so eliminating all linear scans but the final one,
which is then very natural to arrange to free maximally sized higher-
order pages and has far fewer degenerate cases than the bitmaps.

Parallelization sounds very interesting, but something I don't have the
machine and other resources to address. I suspect as node initialization
is rearranged for the purpose of hotplugging memory this will become
easier, but have no specific ideas (or even knowledge of what's going on)
about the hotplug memory and/or bootmem parallelization topic, apart
from believing it's a good idea.

-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (7 preceding siblings ...)
  2004-07-14 22:12 ` William Lee Irwin III
@ 2004-07-15 19:11 ` Luck, Tony
  2004-07-15 19:31 ` Matthew Wilcox
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Luck, Tony @ 2004-07-15 19:11 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 2188 bytes --]

William Irwin wrote:
>I'd even go so far as to suggest scanning the bitmap until the
>maximally sized higher-order page to free at the cursor (determined
>by cursor alignment and the first bit indicating a reserved page or
>otherwise MAX_ORDER bits out from the cursor) is found.

Attached is my version that just frees pages in O(log2(BITS_PER_LONG))
pieces ... because I'm too lazy to figure out all the boundary
conditions to implement wli's excellent suggestion.

On my 2G machine, this reduced the time to free all the bootmem
from 41ms to 18.1ms ... so it should shave some minutes off the
monster SGI box that started this thread, but perhaps not enough
to avoid the "looks like the system is hung" problem as you will
still be looking at a couple of minutes :-(  Since I only got a
55% reduction, rather than a factor of 64 I expect that modifying
to look an larger order pages may have diminishing returns.

Notes and Caveats:
1) Is there a define someplace for log2(BITS_PER_LONG)?  I couldn't
find one, which is why I calculate the "order" in this patch.

2) On ia64 it looks like the pages and the bitmap are nicely aligned
so that a longword in node_bootmem_map[] corresponds to an order 6
page with the right physical alignment.  This may not be true on other
architectures, and mm/bootmem.c is generic code.  Before this patch
can go into the base, someone would have to check all the other
architectures (some of which might not care to change ... if they
only support 4G or less, then the benefits of saving a few milliseconds
during boot may not inspire them to mess with the boot code).  For
this reason I'm not adding a "Signed-off-by" to this patch because
I don't want the flack for breaking other architectures ... but by
all means take this patch and try it out.

I did also play with adding a "prefetchw(page + N)" to the original
loop.  With N>=6 the saving is about 5% for me.  SGI might need a
bigger N, and might see more savings, since most of the accesses
here are to remote nodes.  I didn't try this in my modified patch
(the gains may be smaller as we don't do as much work in between
touching each page).

-Tony

[-- Attachment #2: bootmem.patch --]
[-- Type: application/octet-stream, Size: 1110 bytes --]

diff -ru linux-2.6.8-rc1-orig/mm/bootmem.c linux-2.6.8-rc1/mm/bootmem.c
--- linux-2.6.8-rc1-orig/mm/bootmem.c	2004-07-11 10:34:19.000000000 -0700
+++ linux-2.6.8-rc1/mm/bootmem.c	2004-07-15 11:22:33.464794109 -0700
@@ -259,9 +259,12 @@
 	unsigned long i, count, total = 0;
 	unsigned long idx;
 	unsigned long *map; 
+	int order;

 	BUG_ON(!bdata->node_bootmem_map);

+	for (order = 0; (1 << order) < BITS_PER_LONG; order++)
+		;
 	count = 0;
 	/* first extant page of the node */
 	page = virt_to_page(phys_to_virt(bdata->node_boot_start));
@@ -269,7 +272,20 @@
 	map = bdata->node_bootmem_map;
 	for (i = 0; i < idx; ) {
 		unsigned long v = ~map[i / BITS_PER_LONG];
-		if (v) {
+		if (v == ~0UL) {
+			int j;
+
+			count += BITS_PER_LONG;
+			ClearPageReserved(page);
+			set_page_count(page, 1);
+			for (j = 1; j < BITS_PER_LONG; j++) {
+				ClearPageReserved(page + j);
+				set_page_count(page + j, 0);
+			}
+			__free_pages(page, order);
+			i += BITS_PER_LONG;
+			page += BITS_PER_LONG;
+		} else if (v) {
 			unsigned long m;
 			for (m = 1; m && i < idx; m<<=1, page++, i++) {
 				if (v & m) {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (8 preceding siblings ...)
  2004-07-15 19:11 ` Luck, Tony
@ 2004-07-15 19:31 ` Matthew Wilcox
  2004-07-15 20:21 ` David Mosberger
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2004-07-15 19:31 UTC (permalink / raw)
  To: linux-ia64

On Thu, Jul 15, 2004 at 12:11:07PM -0700, Luck, Tony wrote:
> Notes and Caveats:
> 1) Is there a define someplace for log2(BITS_PER_LONG)?  I couldn't
> find one, which is why I calculate the "order" in this patch.

ffs()?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (9 preceding siblings ...)
  2004-07-15 19:31 ` Matthew Wilcox
@ 2004-07-15 20:21 ` David Mosberger
  2004-07-15 23:16 ` William Lee Irwin III
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: David Mosberger @ 2004-07-15 20:21 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Thu, 15 Jul 2004 12:11:07 -0700, "Luck, Tony" <tony.luck@intel.com> said:

  Tony> Notes and Caveats: 1) Is there a define someplace for
  Tony> log2(BITS_PER_LONG)?  I couldn't find one, which is why I
  Tony> calculate the "order" in this patch.

BITS_PER_LONG is a power-of-two, so ffs() will do.  For more general
cases, you may want asm/page.h:get_order() instead.

	--david

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (10 preceding siblings ...)
  2004-07-15 20:21 ` David Mosberger
@ 2004-07-15 23:16 ` William Lee Irwin III
  2004-07-15 23:34 ` Matthew Wilcox
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2004-07-15 23:16 UTC (permalink / raw)
  To: linux-ia64

On Thu, Jul 15, 2004 at 12:11:07PM -0700, Luck, Tony wrote:
> Attached is my version that just frees pages in O(log2(BITS_PER_LONG))
> pieces ... because I'm too lazy to figure out all the boundary
> conditions to implement wli's excellent suggestion.
> On my 2G machine, this reduced the time to free all the bootmem
> from 41ms to 18.1ms ... so it should shave some minutes off the
> monster SGI box that started this thread, but perhaps not enough
> to avoid the "looks like the system is hung" problem as you will
> still be looking at a couple of minutes :-(  Since I only got a
> 55% reduction, rather than a factor of 64 I expect that modifying
> to look an larger order pages may have diminishing returns.

I'll work relative to this for the rest. I'd recommend using __ffs()
instead of the loop. Also, combining this with a specialized page
freeing function that doesn't e.g. fiddle with page references


On Thu, Jul 15, 2004 at 12:11:07PM -0700, Luck, Tony wrote:
> Notes and Caveats:
> 1) Is there a define someplace for log2(BITS_PER_LONG)?  I couldn't
> find one, which is why I calculate the "order" in this patch.

Unfortunately no. It would be nice to have one.


On Thu, Jul 15, 2004 at 12:11:07PM -0700, Luck, Tony wrote:
> 2) On ia64 it looks like the pages and the bitmap are nicely aligned
> so that a longword in node_bootmem_map[] corresponds to an order 6
> page with the right physical alignment.  This may not be true on other
> architectures, and mm/bootmem.c is generic code.  Before this patch
> can go into the base, someone would have to check all the other
> architectures (some of which might not care to change ... if they
> only support 4G or less, then the benefits of saving a few milliseconds
> during boot may not inspire them to mess with the boot code).  For
> this reason I'm not adding a "Signed-off-by" to this patch because
> I don't want the flack for breaking other architectures ... but by
> all means take this patch and try it out.

The common case is the bitmap and mem_map[] starting at 0. The
remaining cases are pretty marginalized. This can actually be checked
at runtime by checking the alignment of ->node_boot_start, e.g. maybe
if  (!~v && !((__pa(bdata->node_boot_start) >> PAGE_SHIFT) & ((1 << MAX_ORDER) - 1)))
instead of just !~v.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (11 preceding siblings ...)
  2004-07-15 23:16 ` William Lee Irwin III
@ 2004-07-15 23:34 ` Matthew Wilcox
  2004-07-15 23:53 ` Luck, Tony
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2004-07-15 23:34 UTC (permalink / raw)
  To: linux-ia64

On Thu, Jul 15, 2004 at 04:16:38PM -0700, William Lee Irwin III wrote:
> On Thu, Jul 15, 2004 at 12:11:07PM -0700, Luck, Tony wrote:
> > Notes and Caveats:
> > 1) Is there a define someplace for log2(BITS_PER_LONG)?  I couldn't
> > find one, which is why I calculate the "order" in this patch.
> 
> Unfortunately no. It would be nice to have one.

#define log2(x) (ffs(x) - 1)

(ffs returns a value from 0 to 32 whereas log2 wants to return -1 to 31)

Obviously it rounds down, but that's entirely consistent with the behaviour
of, say, 7/4 in C.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (12 preceding siblings ...)
  2004-07-15 23:34 ` Matthew Wilcox
@ 2004-07-15 23:53 ` Luck, Tony
  2004-07-16  0:09 ` David Mosberger
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Luck, Tony @ 2004-07-15 23:53 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]

>> still be looking at a couple of minutes :-(  Since I only got a
>> 55% reduction, rather than a factor of 64 I expect that modifying
>> to look an larger order pages may have diminishing returns.
>
>I'll work relative to this for the rest. I'd recommend using __ffs()
>instead of the loop. Also, combining this with a specialized page
>freeing function that doesn't e.g. fiddle with page references

The returns to freeing larger pages do indeed diminish fast.  I
added simple "look at the next word" and "look at the next
three words" hacks to see what the times looked like with
order=7 and order=8 ... and found that order 8 is only 1.8%
faster than order 6.

>The common case is the bitmap and mem_map[] starting at 0.

Sadly not quite 0.  PG_reserved is set for each page structure
and must be cleared ... so we have to touch every page structure
at least once :-(  On a 4TB machine thats 0.25 billion cache
misses (with a 16K page).


>remaining cases are pretty marginalized. This can actually be checked
>at runtime by checking the alignment of ->node_boot_start, e.g. maybe
>if  (!~v && !((__pa(bdata->node_boot_start) >> PAGE_SHIFT) & 
>((1 << MAX_ORDER) - 1)))
>instead of just !~v.

That check can be done once (for each node) outside the loop. The
exact expression used to set the "gofast" variable in my patch
make need some tweaking

New patch attached.

-Tony

[-- Attachment #2: bootmem.patch --]
[-- Type: application/octet-stream, Size: 1103 bytes --]

--- linux-2.6.8-rc1-orig/mm/bootmem.c	2004-07-11 10:34:19.000000000 -0700
+++ linux-2.6.8-rc1/mm/bootmem.c	2004-07-15 16:45:04.255195717 -0700
@@ -259,6 +259,7 @@
 	unsigned long i, count, total = 0;
 	unsigned long idx;
 	unsigned long *map; 
+	int gofast = 0;
 
 	BUG_ON(!bdata->node_bootmem_map);
 
@@ -267,9 +268,23 @@
 	page = virt_to_page(phys_to_virt(bdata->node_boot_start));
 	idx = bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
 	map = bdata->node_bootmem_map;
+	if (bdata->node_boot_start == 0 ||
+	    ffs(bdata->node_boot_start) - PAGE_SHIFT > ffs(BITS_PER_LONG))
+		gofast = 1;
 	for (i = 0; i < idx; ) {
 		unsigned long v = ~map[i / BITS_PER_LONG];
-		if (v) {
+		if (gofast && v == ~0UL) {
+			int j;
+
+			count += BITS_PER_LONG;
+			ClearPageReserved(page);
+			set_page_count(page, 1);
+			for (j = 1; j < BITS_PER_LONG; j++)
+				ClearPageReserved(page + j);
+			__free_pages(page, ffs(BITS_PER_LONG)-1);
+			i += BITS_PER_LONG;
+			page += BITS_PER_LONG;
+		} else if (v) {
 			unsigned long m;
 			for (m = 1; m && i < idx; m<<=1, page++, i++) {
 				if (v & m) {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (13 preceding siblings ...)
  2004-07-15 23:53 ` Luck, Tony
@ 2004-07-16  0:09 ` David Mosberger
  2004-07-16  0:11 ` William Lee Irwin III
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: David Mosberger @ 2004-07-16  0:09 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Fri, 16 Jul 2004 00:34:20 +0100, Matthew Wilcox <willy@debian.org> said:

  Matthew> #define log2(x) (ffs(x) - 1)

At the risk of pointing out the obvious, this definition only works
for integer-powers of two...

get_order() works for any "unsigned long" value.

	--david

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (14 preceding siblings ...)
  2004-07-16  0:09 ` David Mosberger
@ 2004-07-16  0:11 ` William Lee Irwin III
  2004-07-16  0:18 ` Matthew Wilcox
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2004-07-16  0:11 UTC (permalink / raw)
  To: linux-ia64

On Thu, Jul 15, 2004 at 04:16:38PM -0700, William Lee Irwin III wrote:
>> Unfortunately no. It would be nice to have one.

On Fri, Jul 16, 2004 at 12:34:20AM +0100, Matthew Wilcox wrote:
> #define log2(x) (ffs(x) - 1)
> (ffs returns a value from 0 to 32 whereas log2 wants to return -1 to 31)
> Obviously it rounds down, but that's entirely consistent with the behaviour
> of, say, 7/4 in C.

The kernel uses __ffs(), which returns from 0 to 31 and is undefined on 0.
Its rounding behavior is wrong for log2(). Anyhow, we happen to know it's
a power of 2 and that __ffs() will do.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (15 preceding siblings ...)
  2004-07-16  0:11 ` William Lee Irwin III
@ 2004-07-16  0:18 ` Matthew Wilcox
  2004-07-16  0:18 ` William Lee Irwin III
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2004-07-16  0:18 UTC (permalink / raw)
  To: linux-ia64

On Thu, Jul 15, 2004 at 05:09:00PM -0700, David Mosberger wrote:
> >>>>> On Fri, 16 Jul 2004 00:34:20 +0100, Matthew Wilcox <willy@debian.org> said:
> 
>   Matthew> #define log2(x) (ffs(x) - 1)
> 
> At the risk of pointing out the obvious, this definition only works
> for integer-powers of two...
> 
> get_order() works for any "unsigned long" value.

Oh, um, I didn't realise that ffs() worked backwards.  fls() - 1, then.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (16 preceding siblings ...)
  2004-07-16  0:18 ` Matthew Wilcox
@ 2004-07-16  0:18 ` William Lee Irwin III
  2004-08-03 17:53 ` Josh Aas
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2004-07-16  0:18 UTC (permalink / raw)
  To: linux-ia64

At some point in the past, I wrote:
>> I'll work relative to this for the rest. I'd recommend using __ffs()
>> instead of the loop. Also, combining this with a specialized page
>> freeing function that doesn't e.g. fiddle with page references

On Thu, Jul 15, 2004 at 04:53:33PM -0700, Luck, Tony wrote:
> The returns to freeing larger pages do indeed diminish fast.  I
> added simple "look at the next word" and "look at the next
> three words" hacks to see what the times looked like with
> order=7 and order=8 ... and found that order 8 is only 1.8%
> faster than order 6.

This is likely explained by the differences in remaining buddy bitmap
depth.


At some point in the past, I wrote:
>> The common case is the bitmap and mem_map[] starting at 0.

On Thu, Jul 15, 2004 at 04:53:33PM -0700, Luck, Tony wrote:
> Sadly not quite 0.  PG_reserved is set for each page structure
> and must be cleared ... so we have to touch every page structure
> at least once :-(  On a 4TB machine thats 0.25 billion cache
> misses (with a 16K page).

I had more in mind that PC's etc. had mem_map[]/etc. starting at 0.

Moving on, free_area_init_core() makes a rather pessimal assumption
and sets all pages reserved and all refcounts to 0. Setting the
refcounts to 0 is okay, but needs an entrypoint exported that can
free such pages, likely worth implementing in tandem with lockfree
freeing. In this arrangement, the entire bitmap needs iterating over
to mark the pages that aren't going to be freed reserved.


At some point in the past, I wrote:
>> remaining cases are pretty marginalized. This can actually be checked
>> at runtime by checking the alignment of ->node_boot_start, e.g. maybe
>> if  (!~v && !((__pa(bdata->node_boot_start) >> PAGE_SHIFT) & 
>> ((1 << MAX_ORDER) - 1)))
>> instead of just !~v.

On Thu, Jul 15, 2004 at 04:53:33PM -0700, Luck, Tony wrote:
> That check can be done once (for each node) outside the loop. The
> exact expression used to set the "gofast" variable in my patch
> make need some tweaking
> New patch attached.

I presumed enough compiler QOI for loop invariant hoisting, which I
suppose is a mistake with gcc.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (17 preceding siblings ...)
  2004-07-16  0:18 ` William Lee Irwin III
@ 2004-08-03 17:53 ` Josh Aas
  2004-08-03 23:53 ` William Lee Irwin III
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Josh Aas @ 2004-08-03 17:53 UTC (permalink / raw)
  To: linux-ia64

Are there any outstanding issues with Tony's second revision of the 
free_all_bootmem_core function? Do we still have the problem of making 
sure longwork in node_bootmem_map[] corresponds to an order 6 page with 
the right physical alignment? The second revision looks good to me. If I 
could get some more feedback on it I'll clean up any remaining issues so 
it can land sometime soon. I'll post test results (unpatched vs. 
patched) on a big machine later this afternoon.

wli - do you still want to see the memory map for my big test machine 
(512GB RAM)?

-Josh

Luck, Tony wrote:
>>>still be looking at a couple of minutes :-(  Since I only got a
>>>55% reduction, rather than a factor of 64 I expect that modifying
>>>to look an larger order pages may have diminishing returns.
>>
>>I'll work relative to this for the rest. I'd recommend using __ffs()
>>instead of the loop. Also, combining this with a specialized page
>>freeing function that doesn't e.g. fiddle with page references
> 
> 
> The returns to freeing larger pages do indeed diminish fast.  I
> added simple "look at the next word" and "look at the next
> three words" hacks to see what the times looked like with
> order=7 and order=8 ... and found that order 8 is only 1.8%
> faster than order 6.
> 
> 
>>The common case is the bitmap and mem_map[] starting at 0.
> 
> 
> Sadly not quite 0.  PG_reserved is set for each page structure
> and must be cleared ... so we have to touch every page structure
> at least once :-(  On a 4TB machine thats 0.25 billion cache
> misses (with a 16K page).
> 
> 
> 
>>remaining cases are pretty marginalized. This can actually be checked
>>at runtime by checking the alignment of ->node_boot_start, e.g. maybe
>>if  (!~v && !((__pa(bdata->node_boot_start) >> PAGE_SHIFT) & 
>>((1 << MAX_ORDER) - 1)))
>>instead of just !~v.
> 
> 
> That check can be done once (for each node) outside the loop. The
> exact expression used to set the "gofast" variable in my patch
> make need some tweaking
> 
> New patch attached.
> 
> -Tony

-- 
Josh Aas
Silicon Graphics, Inc. (SGI)
Linux System Software
651-683-3068

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (18 preceding siblings ...)
  2004-08-03 17:53 ` Josh Aas
@ 2004-08-03 23:53 ` William Lee Irwin III
  2004-08-06 14:11 ` Josh Aas
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2004-08-03 23:53 UTC (permalink / raw)
  To: linux-ia64

On Tue, Aug 03, 2004 at 12:53:53PM -0500, Josh Aas wrote:
> Are there any outstanding issues with Tony's second revision of the 
> free_all_bootmem_core function? Do we still have the problem of making 
> sure longwork in node_bootmem_map[] corresponds to an order 6 page with 
> the right physical alignment? The second revision looks good to me. If I 
> could get some more feedback on it I'll clean up any remaining issues so 
> it can land sometime soon. I'll post test results (unpatched vs. 
> patched) on a big machine later this afternoon.

I think it's fine.

On Tue, Aug 03, 2004 at 12:53:53PM -0500, Josh Aas wrote:
> wli - do you still want to see the memory map for my big test machine 
> (512GB RAM)?

Sure.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (19 preceding siblings ...)
  2004-08-03 23:53 ` William Lee Irwin III
@ 2004-08-06 14:11 ` Josh Aas
  2004-08-06 14:17 ` William Lee Irwin III
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Josh Aas @ 2004-08-06 14:11 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1562 bytes --]

Attached is an improved version of Tony Luck's patch. It shaves another 
~25% off by not using atomic ops to clear the page reserved bits and 
prefetching. Tony - will you sign off on it with me and we'll get this in?

Unfortunately, this still leaves a ~1 minute delay with no indication of 
what is going on for 4TB machines, and ~2 minutes for 8TB. Thus, I'd 
still like to see my progrees indicator patch go in. I am guessing 
memory sizes are only going to get bigger than even 8 TB, and memory is 
not going to get faster at the rate the totals increase (it certainly 
didn't double in speed between 4 and 8 TB installations). Thoughts?

Signed-off-by: Josh Aas <josha@sgi.com>

-Josh

William Lee Irwin III wrote:
> On Tue, Aug 03, 2004 at 12:53:53PM -0500, Josh Aas wrote:
> 
>>Are there any outstanding issues with Tony's second revision of the 
>>free_all_bootmem_core function? Do we still have the problem of making 
>>sure longwork in node_bootmem_map[] corresponds to an order 6 page with 
>>the right physical alignment? The second revision looks good to me. If I 
>>could get some more feedback on it I'll clean up any remaining issues so 
>>it can land sometime soon. I'll post test results (unpatched vs. 
>>patched) on a big machine later this afternoon.
> 
> 
> I think it's fine.
> 
> On Tue, Aug 03, 2004 at 12:53:53PM -0500, Josh Aas wrote:
> 
>>wli - do you still want to see the memory map for my big test machine 
>>(512GB RAM)?
> 
> 
> Sure.
> 
> 
> -- wli

-- 
Josh Aas
Silicon Graphics, Inc. (SGI)
Linux System Software
651-683-3068

[-- Attachment #2: bootmem3.patch --]
[-- Type: text/x-patch, Size: 1470 bytes --]

--- a/mm/bootmem.c	2004-08-05 15:33:39.000000000 -0500
+++ b/mm/bootmem.c	2004-08-05 16:25:05.000000000 -0500
@@ -259,6 +259,7 @@ static unsigned long __init free_all_boo
 	unsigned long i, count, total = 0;
 	unsigned long idx;
 	unsigned long *map; 
+	int gofast = 0;
 
 	BUG_ON(!bdata->node_bootmem_map);
 
@@ -267,14 +268,32 @@ static unsigned long __init free_all_boo
 	page = virt_to_page(phys_to_virt(bdata->node_boot_start));
 	idx = bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
 	map = bdata->node_bootmem_map;
+	if (bdata->node_boot_start == 0 ||
+	    ffs(bdata->node_boot_start) - PAGE_SHIFT > ffs(BITS_PER_LONG))
+		gofast = 1;
 	for (i = 0; i < idx; ) {
 		unsigned long v = ~map[i / BITS_PER_LONG];
-		if (v) {
+		if (gofast && v == ~0UL) {
+			int j;
+
+			count += BITS_PER_LONG;
+			(page)->flags &= ~(1UL << PG_reserved);
+			set_page_count(page, 1);
+			for (j = 1; j < BITS_PER_LONG; j++) {
+				if (j + 16 < BITS_PER_LONG) {
+                      			prefetchw(page + j + 16);
+                                }
+				(page + j)->flags &= ~(1UL << PG_reserved);
+			}	
+			__free_pages(page, ffs(BITS_PER_LONG)-1);
+			i += BITS_PER_LONG;
+			page += BITS_PER_LONG;
+		} else if (v) {
 			unsigned long m;
 			for (m = 1; m && i < idx; m<<=1, page++, i++) {
 				if (v & m) {
 					count++;
-					ClearPageReserved(page);
+					(page)->flags &= ~(1UL << PG_reserved);	
 					set_page_count(page, 1);
 					__free_page(page);
 				}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (20 preceding siblings ...)
  2004-08-06 14:11 ` Josh Aas
@ 2004-08-06 14:17 ` William Lee Irwin III
  2004-08-06 17:58 ` Luck, Tony
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2004-08-06 14:17 UTC (permalink / raw)
  To: linux-ia64

On Fri, Aug 06, 2004 at 09:11:50AM -0500, Josh Aas wrote:
> Attached is an improved version of Tony Luck's patch. It shaves another 
> ~25% off by not using atomic ops to clear the page reserved bits and 
> prefetching. Tony - will you sign off on it with me and we'll get this in?
> Unfortunately, this still leaves a ~1 minute delay with no indication of 
> what is going on for 4TB machines, and ~2 minutes for 8TB. Thus, I'd 
> still like to see my progrees indicator patch go in. I am guessing 
> memory sizes are only going to get bigger than even 8 TB, and memory is 
> not going to get faster at the rate the totals increase (it certainly 
> didn't double in speed between 4 and 8 TB installations). Thoughts?
> Signed-off-by: Josh Aas <josha@sgi.com>

There are still stronger attacks on this problem. I'll go about porting
one or more of those up to current. (The one I'm favoring at the moment
has actually never had broad exposure in the past).

This will be a time-consuming affair, so it will be fine to merge
things like this in the interim. I would give an ETA of several weeks
for an initial release, largely for testing on a sufficiently broad
variety of architectures beforehand.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (21 preceding siblings ...)
  2004-08-06 14:17 ` William Lee Irwin III
@ 2004-08-06 17:58 ` Luck, Tony
  2004-08-06 18:27 ` Josh Aas
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 27+ messages in thread
From: Luck, Tony @ 2004-08-06 17:58 UTC (permalink / raw)
  To: linux-ia64

Josh Aas wrote:
>Attached is an improved version of Tony Luck's patch. It 
>shaves another ~25% off by not using atomic ops to clear the
>page reserved bits and prefetching. Tony - will you sign off
>on it with me and we'll get this in?

Does the change:
-			ClearPageReserved(page);
+			(page)->flags &= ~(1UL << PG_reserved);

in the "else if (v)" clause actually make any difference?  I would
expect not (unless you have bizarrely fragmented memory), so you
could just leave the atomic operations there.

But it might be prettier to define a BootClearPageReserved() function
in page-flags.h rather than expose the details of the implementation
here in bootmem.c (in which case you could use this new function
in the "else if (v)" clause too for symmetry).

Finally you have a magic "16" in the prefetchw() path.  Where did
it come from, and is it the right number for non-ia64 machines?

Here's a "Signed-off-by: Tony Luck <tony.luck@intel.com>" that
you can cut-n-paste onto the patch when you repost to LKML.

>Unfortunately, this still leaves a ~1 minute delay with no
>indication of what is going on for 4TB machines, and ~2 minutes
>for 8TB.

Agreed that's long enough to cause worry, concern, even panic
amongst nervous system administrators.  But I think that wli will
hack this time down a lot more when he gets his patch together.
So I'd like to wait and see what that looks like before I add the
progress dots patch.

-Tony

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (22 preceding siblings ...)
  2004-08-06 17:58 ` Luck, Tony
@ 2004-08-06 18:27 ` Josh Aas
  2004-08-06 20:09 ` Luck, Tony
  2004-08-06 20:51 ` William Lee Irwin III
  25 siblings, 0 replies; 27+ messages in thread
From: Josh Aas @ 2004-08-06 18:27 UTC (permalink / raw)
  To: linux-ia64

Luck, Tony wrote:
> Does the change:
> -			ClearPageReserved(page);
> +			(page)->flags &= ~(1UL << PG_reserved);
> 
> in the "else if (v)" clause actually make any difference?  I would
> expect not (unless you have bizarrely fragmented memory), so you
> could just leave the atomic operations there.

I imagine that change would make an improvement, no matter how small it 
is. Why not just do it? Especially after I define a macro.

> But it might be prettier to define a BootClearPageReserved() function
> in page-flags.h rather than expose the details of the implementation
> here in bootmem.c (in which case you could use this new function
> in the "else if (v)" clause too for symmetry).

I'll go ahead and create a macro for the non-atomic bit clear. Perhaps 
call it "ClearPageReservedNoAtomic" instead of "BootClearPageReserved"?

> Finally you have a magic "16" in the prefetchw() path.  Where did
> it come from, and is it the right number for non-ia64 machines?

My bad for not explaining that. In my limited testing of this particular 
number it seemed to be the sweet spot for prefetching. I think it is a 
good number for all architectures because 32 bit architectures won't 
need the speed boost as much as 64 bit architectures. They are not 
likely to have > 4GB memory,, in which case this function's time is 
probably less than 1 second anyway. So even though you have to wait 16 
iterations before you start getting the prefetched advantage, 32 bit 
architectures still get the improvement on half the iterations. It 
should be good for other 64 bit architectures because all architectures 
should be able to complete the prefetch by the 16th iteration, and all 
64 bit architectures will get the prefetched advantage in 48/64 
iterations (that might be OBO but whatever). I haven't used any 64 bit 
machines other than ia64 (or used prefetching before for that matter), 
so let me know if my logic is wrong. I do think that number could be 
tuned to a slightly better value, but 16 is nice and safe for now. And I 
only have ia64 boxes around.

> Here's a "Signed-off-by: Tony Luck <tony.luck@intel.com>" that
> you can cut-n-paste onto the patch when you repost to LKML.

Thanks.

>>Unfortunately, this still leaves a ~1 minute delay with no
>>indication of what is going on for 4TB machines, and ~2 minutes
>>for 8TB.
> 
> 
> Agreed that's long enough to cause worry, concern, even panic
> amongst nervous system administrators.  But I think that wli will
> hack this time down a lot more when he gets his patch together.
> So I'd like to wait and see what that looks like before I add the
> progress dots patch.

Sounds good.

-- 
Josh Aas
Silicon Graphics, Inc. (SGI)
Linux System Software
651-683-3068

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (23 preceding siblings ...)
  2004-08-06 18:27 ` Josh Aas
@ 2004-08-06 20:09 ` Luck, Tony
  2004-08-06 20:51 ` William Lee Irwin III
  25 siblings, 0 replies; 27+ messages in thread
From: Luck, Tony @ 2004-08-06 20:09 UTC (permalink / raw)
  To: linux-ia64

>I'll go ahead and create a macro for the non-atomic bit clear. Perhaps 
>call it "ClearPageReservedNoAtomic" instead of "BootClearPageReserved"?

That looks like a more meaningful name than my suggestion.

>> Finally you have a magic "16" in the prefetchw() path.  Where did
>> it come from, and is it the right number for non-ia64 machines?
>
>My bad for not explaining that. In my limited testing of this particular 
>number it seemed to be the sweet spot for prefetching. I think it is a 
>good number for all architectures because 32 bit architectures won't 
>need the speed boost as much as 64 bit architectures. They are not 
>likely to have > 4GB memory,, in which case this function's time is 
>probably less than 1 second anyway.

There are quite a few x86 boxes running with 16GB ... but the max
is 64GB, which is still below the pain threshold for seeing an
unreasonable delay during this routine ... so I agree that perfect
tuning is irrelevent for 32-bit machines here.  You should include
your explanatory paragraph when you post to LKML.

-Tony

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: free bootmem feedback patch
  2004-07-13 22:59 free bootmem feedback patch Joshua Aas
                   ` (24 preceding siblings ...)
  2004-08-06 20:09 ` Luck, Tony
@ 2004-08-06 20:51 ` William Lee Irwin III
  25 siblings, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2004-08-06 20:51 UTC (permalink / raw)
  To: linux-ia64

At some point in the past, Josh Aas wrote:
>> My bad for not explaining that. In my limited testing of this particular 
>> number it seemed to be the sweet spot for prefetching. I think it is a 
>> good number for all architectures because 32 bit architectures won't 
>> need the speed boost as much as 64 bit architectures. They are not 
>> likely to have > 4GB memory,, in which case this function's time is 
>> probably less than 1 second anyway.

On Fri, Aug 06, 2004 at 01:09:34PM -0700, Luck, Tony wrote:
> There are quite a few x86 boxes running with 16GB ... but the max
> is 64GB, which is still below the pain threshold for seeing an
> unreasonable delay during this routine ... so I agree that perfect
> tuning is irrelevent for 32-bit machines here.  You should include
> your explanatory paragraph when you post to LKML.

The 32-bit situation is even more unusual than that. Bootmem is only
used for ZONE_NORMAL, the region of physical memory usually mapped
into the majority of kernel virtualspace. So even while physical
memory may span 64GB, no more than 896MB or so are ever tracked by
the bootmem allocator. ->zone_mem_map for highmem zones is set up
later by hand.

-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2004-08-06 20:51 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-13 22:59 free bootmem feedback patch Joshua Aas
2004-07-13 23:14 ` Luck, Tony
2004-07-13 23:52 ` Joshua Aas
2004-07-14  8:44 ` Andi Kleen
2004-07-14  9:17 ` William Lee Irwin III
2004-07-14  9:19 ` William Lee Irwin III
2004-07-14 16:17 ` Joshua Aas
2004-07-14 18:34 ` Luck, Tony
2004-07-14 22:12 ` William Lee Irwin III
2004-07-15 19:11 ` Luck, Tony
2004-07-15 19:31 ` Matthew Wilcox
2004-07-15 20:21 ` David Mosberger
2004-07-15 23:16 ` William Lee Irwin III
2004-07-15 23:34 ` Matthew Wilcox
2004-07-15 23:53 ` Luck, Tony
2004-07-16  0:09 ` David Mosberger
2004-07-16  0:11 ` William Lee Irwin III
2004-07-16  0:18 ` Matthew Wilcox
2004-07-16  0:18 ` William Lee Irwin III
2004-08-03 17:53 ` Josh Aas
2004-08-03 23:53 ` William Lee Irwin III
2004-08-06 14:11 ` Josh Aas
2004-08-06 14:17 ` William Lee Irwin III
2004-08-06 17:58 ` Luck, Tony
2004-08-06 18:27 ` Josh Aas
2004-08-06 20:09 ` Luck, Tony
2004-08-06 20:51 ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox