[patch 0/4] ia64 SPARSEMEM

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* [patch 0/4] ia64 SPARSEMEM
@ 2005-05-23 17:50 Bob Picco
  2005-05-24  3:29 ` David Mosberger
                   ` (24 more replies)
  0 siblings, 25 replies; 26+ messages in thread
From: Bob Picco @ 2005-05-23 17:50 UTC (permalink / raw)
  To: linux-ia64

Tony,

This is a series of patches which enable SPARSEMEM for ia64.  It's against
rc4-mm2.  The patches have been tested against memory configurations FLATMEM,
DISCONTIG and SPARSEMEM.  This includes NUMA simulated hardware, rx2600
and HPSIM.  An early version of ia64 with SPARSEMEM was tested by Jesse. It
would be optimal to have another test pass on SGI NUMA hardware.

Ultimately I would like to eliminate DISCONTIG and VIRTUAL_MEM_MAP. Before 
this can be accomplished, more performance comparisons between SPARSEMEM 
and DISCONTIG+VIRTUAL_MEM_MAP is required.  I did some preliminary performance
work with 2 CPU rx2600.  The results were comparable and no noticeable 
regression was evidenced.  Further work on multi node NUMA machines is
required.

thanks,

bob

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
@ 2005-05-24  3:29 ` David Mosberger
  2005-05-24 14:33 ` Bob Picco
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: David Mosberger @ 2005-05-24  3:29 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Mon, 23 May 2005 13:50:31 -0400, Bob Picco <bob.picco@hp.com> said:

  Bob> Ultimately I would like to eliminate [...] VIRTUAL_MEM_MAP.

Why?

Considering this:

+#ifdef CONFIG_SPARSEMEM
+ /*
+ * SECTION_SIZE_BITS            2^N: how big each section will be
+ * MAX_PHYSADDR_BITS            2^N: how much physical address space we have
+ * MAX_PHYSMEM_BITS             2^N: how much memory we can have in that space
+ */

The virtual mem-map seems like a much nicer solution to me.

	--david

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
  2005-05-24  3:29 ` David Mosberger
@ 2005-05-24 14:33 ` Bob Picco
  2005-05-24 16:27 ` Bob Picco
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Bob Picco @ 2005-05-24 14:33 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote:	[Mon May 23 2005, 11:29:47PM EDT]
> >>>>> On Mon, 23 May 2005 13:50:31 -0400, Bob Picco <bob.picco@hp.com> said:
> 
>   Bob> Ultimately I would like to eliminate [...] VIRTUAL_MEM_MAP.
> 
> Why?
> 
> Considering this:
> 
> +#ifdef CONFIG_SPARSEMEM
> + /*
> + * SECTION_SIZE_BITS            2^N: how big each section will be
> + * MAX_PHYSADDR_BITS            2^N: how much physical address space we have
> + * MAX_PHYSMEM_BITS             2^N: how much memory we can have in that space
> + */
> 
> The virtual mem-map seems like a much nicer solution to me.
> 
> 	--david
> -
David,

VIRTUAL_MEM_MAP was introduced on ia64 because of memory holes on ia64
platforms. The mem_map of the pglist_data is a pointer to a virtual 
contiguous array of page structures in memory for a node.  To
eliminate memory waste contiguous memory holes don't have page structures.
The alternative would be for holes in memory to be represented by reserve
page structures.  The VIRTUAL_MEM_MAP solutions requires ia64_pfn_valid
and a hook in the buddy allocator to check for holes in memory. 

SPARSEMEM has eliminated mem_map, ia64_pfn_valid and the buddy allocator
hook.  There is a small cost for SPARSEMEM.  Any section which has both
memory and holes requires reserved pages for the holes.  I can see
SPARSEMEM replacing DISCONTIG_MEM+VIRTUAL_MEM_MAP.  So we'd basically
have on ia64 NUMA and no NUMA memory support code.

So AFAICS there is no advantage to retaining VIRTUAL_MEM_MAP should SPARSEMEM
match or exceed the performance of DISCONTIG_MEM+VIRTUAL_MEM_MAP.

bob

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
  2005-05-24  3:29 ` David Mosberger
  2005-05-24 14:33 ` Bob Picco
@ 2005-05-24 16:27 ` Bob Picco
  2005-05-26  0:32 ` Luck, Tony
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Bob Picco @ 2005-05-24 16:27 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote:	[Mon May 23 2005, 11:29:47PM EDT]
> >>>>> On Mon, 23 May 2005 13:50:31 -0400, Bob Picco <bob.picco@hp.com> said:
> 
>   Bob> Ultimately I would like to eliminate [...] VIRTUAL_MEM_MAP.
> 
> Why?
> 
> Considering this:
> 
> +#ifdef CONFIG_SPARSEMEM
> + /*
> + * SECTION_SIZE_BITS            2^N: how big each section will be
> + * MAX_PHYSADDR_BITS            2^N: how much physical address space we have
> + * MAX_PHYSMEM_BITS             2^N: how much memory we can have in that space
> + */
> 
> The virtual mem-map seems like a much nicer solution to me.
> 
> 	--david
> -
Hi David,

I don't perceive VIRTUAL_MEM_MAP as a nicer solution.  Actually SPARSEMEM
is a very elegant solution for memory holes and hotplug memory.

VIRTUAL_MEM_MAP accommodates the physical holes in memory for ia64. This is 
achieved allocating a contiguous virtual region to cover the minimum
and maximum page structures on node.  The page struct array is stored in
node_mem_map of pglist_data for each node.  page structs that represent
holes in memory and span a page of physical memory aren't allocated. 
CONFIG_HOLES_IN_ZONE in bad_range and calls to ia64_pfn_valid are used to detect
these holes when in buddy allocator.

SPARSEMEM carves memory into 1<<SECTION_SIZE_BITS. Any section entirely 
consumed by a hole is marked as invalid.  A section with holes and memory 
results in the holes being reserved pages.  SPARSEMEM eliminates 
CONFIG_HOLES_IN_ZONE and ia64_pfn_valid.  

So AFAICS DISCONTIG_MEM+VIRTUAL_MEM_MAP won't provide any advantage should
SPARSEMEM perform as well or better. We could eventually reduce ia64 memory 
Kconfig complexity and code in ia64 mm directory significantly.  This could
all be a pipe dream should performance not win out.  

thanks,

bob

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (2 preceding siblings ...)
  2005-05-24 16:27 ` Bob Picco
@ 2005-05-26  0:32 ` Luck, Tony
  2005-05-26 20:09 ` David Mosberger
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Luck, Tony @ 2005-05-26  0:32 UTC (permalink / raw)
  To: linux-ia64

>+#ifdef CONFIG_SPARSEMEM
>+ /*
>+ * SECTION_SIZE_BITS            2^N: how big each section will be
>+ * MAX_PHYSADDR_BITS            2^N: how much physical address space we have
>+ * MAX_PHYSMEM_BITS             2^N: how much memory we can have in that space
>+ */

MAX_PHYSADDR_BITS is apparently never used ... what's the distinction
between it and MAX_PHYSMEM_BITS?  From the comments, I'd guess that you
really meant to use MAX_PHYSADDR_BITS in this:

#define SECTIONS_SHIFT          (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)

Pursuing Jack Steiner's line of questioning on how this works for
the SGI Altix ... it would appear that he will need to use 50 for
MAX_PHYSMEM_BITS, and probably 32 for SECTION_SIZE_BITS (but maybe
a smaller number ... his banks of memory all start on 4G boundaries,
but could be as small as 1G ... can you have a chunk with an empty
tail?).  So SGI will end up with 2^(50-32) = 256K entries in mem_section[]
(or perhaps 4x that if sections must be fully populated).  All allocated
on the boot node ... and perhaps consuming a significant portion of
the kernel memory mapped by dtr[0].

It will be interesting to see performance numbers on how this compares
with against VIRTUAL_MEM_MAP ... trading cache misses vs. TLB misses.

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (3 preceding siblings ...)
  2005-05-26  0:32 ` Luck, Tony
@ 2005-05-26 20:09 ` David Mosberger
  2005-05-26 20:54 ` Bob Picco
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: David Mosberger @ 2005-05-26 20:09 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 24 May 2005 12:27:39 -0400, Bob Picco <bob.picco@hp.com> said:

  >> + * SECTION_SIZE_BITS            2^N: how big each section will be

  Bob> I don't perceive VIRTUAL_MEM_MAP as a nicer solution.  Actually
  Bob> SPARSEMEM is a very elegant solution for memory holes and
  Bob> hotplug memory.

Except that it requires configuration?  What happens if
SECTION_SIZE_BITS is chosen wrong or poorly for a given (perhaps new)
platform?

	--david

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (4 preceding siblings ...)
  2005-05-26 20:09 ` David Mosberger
@ 2005-05-26 20:54 ` Bob Picco
  2005-05-26 21:02 ` Dave Hansen
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Bob Picco @ 2005-05-26 20:54 UTC (permalink / raw)
  To: linux-ia64

luck wrote:	[Wed May 25 2005, 08:32:54PM EDT]
> 
> >+#ifdef CONFIG_SPARSEMEM
> >+ /*
> >+ * SECTION_SIZE_BITS            2^N: how big each section will be
> >+ * MAX_PHYSADDR_BITS            2^N: how much physical address space we have
> >+ * MAX_PHYSMEM_BITS             2^N: how much memory we can have in that space
> >+ */
> 
> MAX_PHYSADDR_BITS is apparently never used ... what's the distinction
Ah MAX_PHYSADDR_BITS appears not used by all arches ported to SPARSEMEM.  I 
wonder if it's a remnant of NONLINEAR.  Dave, do you recall?
> between it and MAX_PHYSMEM_BITS?  From the comments, I'd guess that you
> really meant to use MAX_PHYSADDR_BITS in this:
> 
> #define SECTIONS_SHIFT          (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
> 
> Pursuing Jack Steiner's line of questioning on how this works for
> the SGI Altix ... it would appear that he will need to use 50 for
> MAX_PHYSMEM_BITS, and probably 32 for SECTION_SIZE_BITS (but maybe
I went back and reviewed Jack's email.  I must be blind but don't see why
he would need more than 44 bits of physical memory bits.  I agree that
should you need 50 bits for physical address bits then you should use
32 bits for SECTION_SIZE_BITS.
> a smaller number ... his banks of memory all start on 4G boundaries,
> but could be as small as 1G ... can you have a chunk with an empty
> tail?).  So SGI will end up with 2^(50-32) = 256K entries in mem_section[]
> (or perhaps 4x that if sections must be fully populated).  All allocated
> on the boot node ... and perhaps consuming a significant portion of
> the kernel memory mapped by dtr[0].
Well worse case it would consume 2^(1(50-32)+3) (2 Mb).  I would hope that 
it's not configured for 28 SECTION_SIZE_BITS and 50 physical. This would
be excessive 2^((50-28)+3 = 32Mb and not advised.
> 
> 
> It will be interesting to see performance numbers on how this compares
> with against VIRTUAL_MEM_MAP ... trading cache misses vs. TLB misses.
> 
> -Tony
bob

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (5 preceding siblings ...)
  2005-05-26 20:54 ` Bob Picco
@ 2005-05-26 21:02 ` Dave Hansen
  2005-05-26 21:34 ` Luck, Tony
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Dave Hansen @ 2005-05-26 21:02 UTC (permalink / raw)
  To: linux-ia64

On Thu, 2005-05-26 at 16:54 -0400, Bob Picco wrote:
> luck wrote:	[Wed May 25 2005, 08:32:54PM EDT]
> > 
> > >+#ifdef CONFIG_SPARSEMEM
> > >+ /*
> > >+ * SECTION_SIZE_BITS            2^N: how big each section will be
> > >+ * MAX_PHYSADDR_BITS            2^N: how much physical address space we have
> > >+ * MAX_PHYSMEM_BITS             2^N: how much memory we can have in that space
> > >+ */
> > 
> > MAX_PHYSADDR_BITS is apparently never used ... what's the distinction
>
> Ah MAX_PHYSADDR_BITS appears not used by all arches ported to SPARSEMEM.  I 
> wonder if it's a remnant of NONLINEAR.  Dave, do you recall?

Yep, I think it's a remnant from nonlinear.  Something about how sparse
the memory will be.  We can compress the virtual address space if we're
careful.

-- Dave


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (6 preceding siblings ...)
  2005-05-26 21:02 ` Dave Hansen
@ 2005-05-26 21:34 ` Luck, Tony
  2005-05-26 21:44 ` Jack Steiner
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Luck, Tony @ 2005-05-26 21:34 UTC (permalink / raw)
  To: linux-ia64

>I went back and reviewed Jack's email.  I must be blind but don't see why
>he would need more than 44 bits of physical memory bits.  I agree that
>should you need 50 bits for physical address bits then you should use
>32 bits for SECTION_SIZE_BITS.

Jack's e-mail only really covered the banks of memory within a node.  A
physical address on Altix looks like this:

	+-------------------------------------------------------+
      | 0   | NASID | AS | BN | 00 |     bank-offset          |
      +-------------------------------------------------------+

Where "NASID" is a physical node number ... 11 bits I think
      "AS" defines the type of memory ... 2-bits with value 0x3 for normal memory
      "BN" is the bank number within a node.

A system with N nodes doesn't necessarily have NASIDs assigned 0, 1, ... N-1

It is theoretically possible to build a 2-node system with NASIDs of 0 and 2047
[though SGI have told me that this isn't usually done].  But just for grins the
memory map for this with 4G on each of the two nodes looks like:

Node 0:
	0x0000003000000000-0x000000303FFFFFFF (bank 0)
	0x0000003400000000-0x000000343FFFFFFF (bank 1)
	0x0000003800000000-0x000000383FFFFFFF (bank 2)
	0x0000003C00000000-0x0000003C3FFFFFFF (bank 3)
Node 1:
	0x0001FFF000000000-0x0001FFF03FFFFFFF (bank 0)
	0x0001FFF400000000-0x0001FFF43FFFFFFF (bank 1)
	0x0001FFF800000000-0x0001FFF83FFFFFFF (bank 2)
	0x0001FFFC00000000-0x0001FFFC3FFFFFFF (bank 3)

(and I may have slipped a binary bit here ... perhaps SGI only
needs 49 bits, not 50).

>> a smaller number ... his banks of memory all start on 4G boundaries,
>> but could be as small as 1G ... can you have a chunk with an empty
>> tail?).

You didn't explicitly answer this question ... all the banks in the Altix
start on 4G boundaries ... but the DIMM size is 1G ... so will this compell
them to use a CHUNK size of 30 (to handle the 1G increment), or can they
use 32?

>Well worse case it would consume 2^(1(50-32)+3) (2 Mb).  I would hope that 
>it's not configured for 28 SECTION_SIZE_BITS and 50 physical. This would
>be excessive 2^((50-28)+3 = 32Mb and not advised.

I think they need 49 and either 32 or 30 ... depending on whether a chunk
has to be fully populated.

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (7 preceding siblings ...)
  2005-05-26 21:34 ` Luck, Tony
@ 2005-05-26 21:44 ` Jack Steiner
  2005-05-26 21:51 ` Bob Picco
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Jack Steiner @ 2005-05-26 21:44 UTC (permalink / raw)
  To: linux-ia64

On Thu, May 26, 2005 at 04:54:08PM -0400, Bob Picco wrote:
> luck wrote:	[Wed May 25 2005, 08:32:54PM EDT]
> > 
> > >+#ifdef CONFIG_SPARSEMEM
> > >+ /*
> > >+ * SECTION_SIZE_BITS            2^N: how big each section will be
> > >+ * MAX_PHYSADDR_BITS            2^N: how much physical address space we have
> > >+ * MAX_PHYSMEM_BITS             2^N: how much memory we can have in that space
> > >+ */
> > 
> > MAX_PHYSADDR_BITS is apparently never used ... what's the distinction
> Ah MAX_PHYSADDR_BITS appears not used by all arches ported to SPARSEMEM.  I 
> wonder if it's a remnant of NONLINEAR.  Dave, do you recall?
> > between it and MAX_PHYSMEM_BITS?  From the comments, I'd guess that you
> > really meant to use MAX_PHYSADDR_BITS in this:
> > 
> > #define SECTIONS_SHIFT          (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
> > 
> > Pursuing Jack Steiner's line of questioning on how this works for
> > the SGI Altix ... it would appear that he will need to use 50 for
> > MAX_PHYSMEM_BITS, and probably 32 for SECTION_SIZE_BITS (but maybe
> I went back and reviewed Jack's email.  I must be blind but don't see why
> he would need more than 44 bits of physical memory bits.  I agree that
> should you need 50 bits for physical address bits then you should use
> 32 bits for SECTION_SIZE_BITS.

Ahhhh. You folks are a step ahead of me. I was just in the process of trying
to figure out the various options.

We definitely need 50 bit physical addresses (49 on todays hardware but more
coming).

A physical address on Altix looks like:

	+-------------+--+--------------------+
	|  NODE #     |AS|       NodeOffset   |
	+-------------+--+--------------------+
	 4           3 33 3                  0
	 9           8 76 5                  0
	
		Bits [48:38] contain a node number in the range 0..2047
			(another bit will be added soon)
		Bits [37:36] always contain a "3" for WB RAM.
		Bits [35:0]  contain the node offset

Node numbers are not dense & do not start at 0. Large systems can
be partitioned into smaller chunks. Node numbers within a partition
are typically not interleaved with the node numbers of other partitions, but
it is possible to have a partition with almost any subset of node numbers.
For example, a partition could consist of nodes 1536, 1538, & 1552.


> > a smaller number ... his banks of memory all start on 4G boundaries,

All banks (currently) start on 16GB boundaries. I don't think it
matters, but directory memory occupies the last 1/32 of each DIMM. This
means that memory blocks are slightly smaller than you might expect. The
bios marks the directory memory as "unavailable".


> > but could be as small as 1G ... can you have a chunk with an empty
> > tail?).  So SGI will end up with 2^(50-32) = 256K entries in mem_section[]
> > (or perhaps 4x that if sections must be fully populated).  All allocated
> > on the boot node ... and perhaps consuming a significant portion of
> > the kernel memory mapped by dtr[0].
> Well worse case it would consume 2^(1(50-32)+3) (2 Mb).  I would hope that 
> it's not configured for 28 SECTION_SIZE_BITS and 50 physical. This would
> be excessive 2^((50-28)+3 = 32Mb and not advised.
> > 
> > 
> > It will be interesting to see performance numbers on how this compares
> > with against VIRTUAL_MEM_MAP ... trading cache misses vs. TLB misses.

I just finished buildind & booting a SPARSEMEM kernel. No problems but I have 
not run any performance tests yet. 

I had MAX_PHYSMEM_BITS set to the wrong value. I was on a small
system so it did not cause problems. I'll fix the size before running 
performance tests.

I noticed that available memory seems slightly smaller but have not tracked down the
cause.
 BASELINE
 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   3820304   3510272    310032
   1   3882992   3800224     82768
   2   3883008   3794352     88656
   3   3882992   3801552     81440
   4   3883008   3802272     80736

 SPARSE
 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   3820256   3320784    499472
   1   3882992   3741328    141664
   2   3883008   3749536    133472
   3   3882992   3751392    131600
   4   3883008   3758368    124640
	


> > 
> > -Tony
> bob
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (8 preceding siblings ...)
  2005-05-26 21:44 ` Jack Steiner
@ 2005-05-26 21:51 ` Bob Picco
  2005-05-26 22:03 ` Luck, Tony
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Bob Picco @ 2005-05-26 21:51 UTC (permalink / raw)
  To: linux-ia64

luck wrote:	[Thu May 26 2005, 05:34:24PM EDT]
> 
> >I went back and reviewed Jack's email.  I must be blind but don't see why
> >he would need more than 44 bits of physical memory bits.  I agree that
> >should you need 50 bits for physical address bits then you should use
> >32 bits for SECTION_SIZE_BITS.
> 
> Jack's e-mail only really covered the banks of memory within a node.  A
> physical address on Altix looks like this:
> 
> 	+-------------------------------------------------------+
>       | 0   | NASID | AS | BN | 00 |     bank-offset          |
>       +-------------------------------------------------------+
> 
> Where "NASID" is a physical node number ... 11 bits I think
>       "AS" defines the type of memory ... 2-bits with value 0x3 for normal memory
>       "BN" is the bank number within a node.
> 
> A system with N nodes doesn't necessarily have NASIDs assigned 0, 1, ... N-1
> 
> It is theoretically possible to build a 2-node system with NASIDs of 0 and 2047
> [though SGI have told me that this isn't usually done].  But just for grins the
> memory map for this with 4G on each of the two nodes looks like:
> 
> Node 0:
> 	0x0000003000000000-0x000000303FFFFFFF (bank 0)
> 	0x0000003400000000-0x000000343FFFFFFF (bank 1)
> 	0x0000003800000000-0x000000383FFFFFFF (bank 2)
> 	0x0000003C00000000-0x0000003C3FFFFFFF (bank 3)
> Node 1:
> 	0x0001FFF000000000-0x0001FFF03FFFFFFF (bank 0)
> 	0x0001FFF400000000-0x0001FFF43FFFFFFF (bank 1)
> 	0x0001FFF800000000-0x0001FFF83FFFFFFF (bank 2)
> 	0x0001FFFC00000000-0x0001FFFC3FFFFFFF (bank 3)
> 
> (and I may have slipped a binary bit here ... perhaps SGI only
> needs 49 bits, not 50).
> 
okay. thanks for explanation.
> >> a smaller number ... his banks of memory all start on 4G boundaries,
> >> but could be as small as 1G ... can you have a chunk with an empty
> >> tail?).
> 
> You didn't explicitly answer this question ... all the banks in the Altix
> start on 4G boundaries ... but the DIMM size is 1G ... so will this compell
> them to use a CHUNK size of 30 (to handle the 1G increment), or can they
> use 32?
Sorry.  I would use 30.  Unless banks need to be totally populated.  For that
case 32 to reduce sectionmap size for SPARSEMEM.
> 
> >Well worse case it would consume 2^(1(50-32)+3) (2 Mb).  I would hope that 
> >it's not configured for 28 SECTION_SIZE_BITS and 50 physical. This would
> >be excessive 2^((50-28)+3 = 32Mb and not advised.
> 
> I think they need 49 and either 32 or 30 ... depending on whether a chunk
> has to be fully populated.
It doesn't have to be fully populated but we'd like to minimize reserved
pages.
> 
> -Tony
bob

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (9 preceding siblings ...)
  2005-05-26 21:51 ` Bob Picco
@ 2005-05-26 22:03 ` Luck, Tony
  2005-05-26 22:04 ` Bob Picco
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Luck, Tony @ 2005-05-26 22:03 UTC (permalink / raw)
  To: linux-ia64

>Well worse case it would consume 2^(1(50-32)+3) (2 Mb).  I would hope that 
>it's not configured for 28 SECTION_SIZE_BITS and 50 physical. This would
>be excessive 2^((50-28)+3 = 32Mb and not advised.

While you can tune a custom kernel for a particular system configuration,
we might have to use the 50/28 configuration for the generic "defconfig"
case ... that kernel should be bootable anywhere.   SGI needs (or will
need) the "50" for total physical size, and other platforms may need the
"28" (is that even small enough?  We currently have "granule" sizes of
16M and 64M to cope with odd holes in the physical address space ... so
perhaps the section size might need to be even smaller: 24 or 26?  How did
you come up with 28 as the low bound for SECTION_BITS?)

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (10 preceding siblings ...)
  2005-05-26 22:03 ` Luck, Tony
@ 2005-05-26 22:04 ` Bob Picco
  2005-05-27  5:14 ` Yasunori Goto
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Bob Picco @ 2005-05-26 22:04 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote:	[Thu May 26 2005, 04:09:06PM EDT]
> >>>>> On Tue, 24 May 2005 12:27:39 -0400, Bob Picco <bob.picco@hp.com> said:
> 
>   >> + * SECTION_SIZE_BITS            2^N: how big each section will be
> 
>   Bob> I don't perceive VIRTUAL_MEM_MAP as a nicer solution.  Actually
>   Bob> SPARSEMEM is a very elegant solution for memory holes and
>   Bob> hotplug memory.
> 
> Except that it requires configuration?  What happens if
> SECTION_SIZE_BITS is chosen wrong or poorly for a given (perhaps new)
> platform?
> 
> 	--david
> -
Yes I suppose this is a valid point.  Choosing a bad value for a very
sparse address address space or machine with small quantity of memory could 
consume lots of useless memory for section_map and/or reserved page
structures.

I'd prefer not to have a platform specific solution.  This needs to be given
some more thought.

bob

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (11 preceding siblings ...)
  2005-05-26 22:04 ` Bob Picco
@ 2005-05-27  5:14 ` Yasunori Goto
  2005-05-27 10:35 ` Bob Picco
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Yasunori Goto @ 2005-05-27  5:14 UTC (permalink / raw)
  To: linux-ia64

Hello. Jack-san.

> > > a smaller number ... his banks of memory all start on 4G boundaries,
> 
> All banks (currently) start on 16GB boundaries. I don't think it
> matters, but directory memory occupies the last 1/32 of each DIMM. This
> means that memory blocks are slightly smaller than you might expect. The
> bios marks the directory memory as "unavailable".

Then, I don't think section size must be same with
its directory memory size. When pages are registered by __free_pages(),
kernel distinguish RAM from the areas for firmware
(ex, page_is_ram()) before free, firmware page become reserved page, 
and kernel doesn't use it.
If you make a code that kernel asks its BIOS, the status of directory
memory pages will be reserved. I think section size can be 16GB
for your case.

Thanks.
-- 
Yasunori Goto 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (12 preceding siblings ...)
  2005-05-27  5:14 ` Yasunori Goto
@ 2005-05-27 10:35 ` Bob Picco
  2005-05-27 16:23 ` David Mosberger
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Bob Picco @ 2005-05-27 10:35 UTC (permalink / raw)
  To: linux-ia64

luck wrote:	[Thu May 26 2005, 06:03:29PM EDT]
> >Well worse case it would consume 2^(1(50-32)+3) (2 Mb).  I would hope that 
> >it's not configured for 28 SECTION_SIZE_BITS and 50 physical. This would
> >be excessive 2^((50-28)+3 = 32Mb and not advised.
> 
> While you can tune a custom kernel for a particular system configuration,
> we might have to use the 50/28 configuration for the generic "defconfig"
> case ... that kernel should be bootable anywhere.   SGI needs (or will
> need) the "50" for total physical size, and other platforms may need the
> "28" (is that even small enough?  We currently have "granule" sizes of
> 16M and 64M to cope with odd holes in the physical address space ... so
> perhaps the section size might need to be even smaller: 24 or 26?  How did
> you come up with 28 as the low bound for SECTION_BITS?)
It seems to be least costly in terms of consuming reserved pages. Of course
reducing it decreases consuming reserved pages but at the cost of increasing
the mem_section size.  Should 50 have to be the defconfig value, then I'd
recommend 30 for SECTION_SIZE_BITS.  So memory wise 8Mb -> 2^((50-30)+3) 
for mem_section.  Smaller than 30 SECTION_BITS for 50 bit physical
consumes too much memory for mem_section. 

Well this correction below will eliminate a Kconfig patch mishap.  I booted on
a 4Mbg rx2600 with memQ2Mb.  No problems but reserved pages and detected RAM 
is slightly disconcerting.  Investigating this further.
> 
> -Tony
bob

Index: linux-2.6.12-rc4-mm2-broken/arch/ia64/Kconfig
=================================--- linux-2.6.12-rc4-mm2-broken.orig/arch/ia64/Kconfig	2005-05-25 12:28:37.000000000 -0400
+++ linux-2.6.12-rc4-mm2-broken/arch/ia64/Kconfig	2005-05-26 19:20:57.000000000 -0400
@@ -321,7 +321,7 @@ config HAVE_ARCH_EARLY_PFN_TO_NID
 	depends on NEED_MULTIPLE_NODES
 
 config SECTION_BITS
-	int
+	int "SECTION_BITS (28-32)" if !HUGETLB_PAGE
 	depends on SPARSEMEM
 	range 28 32	if !HUGETLB_PAGE
 	default "32"	if HUGETLB_PAGE
@@ -330,7 +330,7 @@ config SECTION_BITS
 	  Size of memory section in bits.
 
 config PHYSICAL_MEMORY_BITS
-	int
+	int "PHYSICAL_MEMORY_BITS (44-50)"
 	depends on SPARSEMEM
 	range 44 50
 	default 44


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (13 preceding siblings ...)
  2005-05-27 10:35 ` Bob Picco
@ 2005-05-27 16:23 ` David Mosberger
  2005-05-27 22:04 ` Jack Steiner
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: David Mosberger @ 2005-05-27 16:23 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Fri, 27 May 2005 06:35:48 -0400, Bob Picco <bob.picco@hp.com> said:

  Bob> luck wrote:	[Thu May 26 2005, 06:03:29PM EDT]
  >> >Well worse case it would consume 2^(1(50-32)+3) (2 Mb).  I would hope that 
  >> >it's not configured for 28 SECTION_SIZE_BITS and 50 physical. This would
  >> >be excessive 2^((50-28)+3 = 32Mb and not advised.

  >> While you can tune a custom kernel for a particular system configuration,
  >> we might have to use the 50/28 configuration for the generic "defconfig"
  >> case ... that kernel should be bootable anywhere.   SGI needs (or will
  >> need) the "50" for total physical size, and other platforms may need the
  >> "28" (is that even small enough?  We currently have "granule" sizes of
  >> 16M and 64M to cope with odd holes in the physical address space ... so
  >> perhaps the section size might need to be even smaller: 24 or 26?  How did
  >> you come up with 28 as the low bound for SECTION_BITS?)

  Bob> It seems to be least costly in terms of consuming reserved
  Bob> pages. Of course reducing it decreases consuming reserved pages
  Bob> but at the cost of increasing the mem_section size.  Should 50
  Bob> have to be the defconfig value, then I'd recommend 30 for
  Bob> SECTION_SIZE_BITS.  So memory wise 8Mb -> 2^((50-30)+3) for
  Bob> mem_section.  Smaller than 30 SECTION_BITS for 50 bit physical
  Bob> consumes too much memory for mem_section.

This discussion just demonstrates what a pain such config parameters
are.  Such stuff need to be
self-adjusting/self-tuning/able-to-cover-any-configuration.  Virtual
mem-map seems to qualify much better here.

	--david

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (14 preceding siblings ...)
  2005-05-27 16:23 ` David Mosberger
@ 2005-05-27 22:04 ` Jack Steiner
  2005-05-30  0:18 ` KAMEZAWA Hiroyuki
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Jack Steiner @ 2005-05-27 22:04 UTC (permalink / raw)
  To: linux-ia64

On Fri, May 27, 2005 at 09:23:24AM -0700, David Mosberger wrote:
> >>>>> On Fri, 27 May 2005 06:35:48 -0400, Bob Picco <bob.picco@hp.com> said:
> 
>   Bob> luck wrote:	[Thu May 26 2005, 06:03:29PM EDT]
>   >> >Well worse case it would consume 2^(1(50-32)+3) (2 Mb).  I would hope that 
>   >> >it's not configured for 28 SECTION_SIZE_BITS and 50 physical. This would
>   >> >be excessive 2^((50-28)+3 = 32Mb and not advised.
> 
>   >> While you can tune a custom kernel for a particular system configuration,
>   >> we might have to use the 50/28 configuration for the generic "defconfig"
>   >> case ... that kernel should be bootable anywhere.   SGI needs (or will
>   >> need) the "50" for total physical size, and other platforms may need the
>   >> "28" (is that even small enough?  We currently have "granule" sizes of
>   >> 16M and 64M to cope with odd holes in the physical address space ... so
>   >> perhaps the section size might need to be even smaller: 24 or 26?  How did
>   >> you come up with 28 as the low bound for SECTION_BITS?)
> 
>   Bob> It seems to be least costly in terms of consuming reserved
>   Bob> pages. Of course reducing it decreases consuming reserved pages
>   Bob> but at the cost of increasing the mem_section size.  Should 50
>   Bob> have to be the defconfig value, then I'd recommend 30 for
>   Bob> SECTION_SIZE_BITS.  So memory wise 8Mb -> 2^((50-30)+3) for
>   Bob> mem_section.  Smaller than 30 SECTION_BITS for 50 bit physical
>   Bob> consumes too much memory for mem_section.
> 
> This discussion just demonstrates what a pain such config parameters
> are.  Such stuff need to be
> self-adjusting/self-tuning/able-to-cover-any-configuration.  Virtual
> mem-map seems to qualify much better here.
> 
> 	--david

Another point to consider...

Both RH & SuSE release a single kernel binary image that is used on all
platforms. IA64 discovers the platform type (Dig, SGI, HP, ...) during
boot & sets up platform callouts to platform specific functions. We
will need to select values for both CONFIG_SECTION_BITS &
CONFIG_PHYSICAL_MEMORY_BITS that work on all IA64 platforms (or make these
values configurable at runtime)

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (15 preceding siblings ...)
  2005-05-27 22:04 ` Jack Steiner
@ 2005-05-30  0:18 ` KAMEZAWA Hiroyuki
  2005-05-31 17:55 ` Luck, Tony
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2005-05-30  0:18 UTC (permalink / raw)
  To: linux-ia64

Hi Bob,
> Another point to consider...
> 
> Both RH & SuSE release a single kernel binary image that is used on all
> platforms. IA64 discovers the platform type (Dig, SGI, HP, ...) during
> boot & sets up platform callouts to platform specific functions. We
> will need to select values for both CONFIG_SECTION_BITS &
> CONFIG_PHYSICAL_MEMORY_BITS that work on all IA64 platforms (or make these
> values configurable at runtime)
> 
After reading e-mails,
	physical address size -- 50bits
	section size	      -- 32bits
will be good as a default.
This can work on any ia64 platforms && Full-sized Hugetlb can be used without any tuning.

If people want to hot-remove smaller memory sections, he should adjust the params.

-- Kame


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (16 preceding siblings ...)
  2005-05-30  0:18 ` KAMEZAWA Hiroyuki
@ 2005-05-31 17:55 ` Luck, Tony
  2005-05-31 18:14 ` Dave Hansen
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Luck, Tony @ 2005-05-31 17:55 UTC (permalink / raw)
  To: linux-ia64

>After reading e-mails,
>	physical address size -- 50bits
>	section size	      -- 32bits
>will be good as a default.
>This can work on any ia64 platforms && Full-sized Hugetlb can 
>be used without any tuning.

But we'll need to update the physical address size every 18 months or
so to add another bit as memory density increases.

CONFIG_VIRTUAL_MEM_MAP will automatically adjust to each system, and won't
need tweaking periodically.

The issue here appears to be that 32-bit systems don't have any spare
virtual address space to devote to a sparse virtual mem_map[] array, so
they need to have the indirection ... but the degree of sparseness tends
to be limited as they support fewer physical address bits.

What are the other benefits of SPARSEMEM that I'm missing?

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (17 preceding siblings ...)
  2005-05-31 17:55 ` Luck, Tony
@ 2005-05-31 18:14 ` Dave Hansen
  2005-05-31 18:15 ` Jack Steiner
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Dave Hansen @ 2005-05-31 18:14 UTC (permalink / raw)
  To: linux-ia64

On Tue, 2005-05-31 at 10:55 -0700, Luck, Tony wrote:
> What are the other benefits of SPARSEMEM that I'm missing?

* The same implementation works everywhere, on every architecture.
* It has good tlb behavior.
* It is faster and has a lower icache footprint than existing
  discontigmem implementations.
* On a theoretical 16TB ppc64 system with 16MB sections, the overhead of
  the mem_section[] table is 8MB.

Also, nothing seriously confines us to a flat array of mem_sections,
that's just the only implementation right now.  The pagetables that are
walked in the TLB miss handler (for vmem_map[]) could just as easily be
a set of two-level mem_section tables that are walked in software.  That
just adds an extra load to the pfn_to_page() path.  Plus, if somebody
does this, all sparsemem architectures can benefit.

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (18 preceding siblings ...)
  2005-05-31 18:14 ` Dave Hansen
@ 2005-05-31 18:15 ` Jack Steiner
  2005-05-31 21:41 ` Luck, Tony
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Jack Steiner @ 2005-05-31 18:15 UTC (permalink / raw)
  To: linux-ia64

On Mon, May 30, 2005 at 09:18:22AM +0900, KAMEZAWA Hiroyuki wrote:
> Hi Bob,
> >Another point to consider...
> >
> >Both RH & SuSE release a single kernel binary image that is used on all
> >platforms. IA64 discovers the platform type (Dig, SGI, HP, ...) during
> >boot & sets up platform callouts to platform specific functions. We
> >will need to select values for both CONFIG_SECTION_BITS &
> >CONFIG_PHYSICAL_MEMORY_BITS that work on all IA64 platforms (or make these
> >values configurable at runtime)
> >
> After reading e-mails,
> 	physical address size -- 50bits
> 	section size	      -- 32bits

A significant number of SGI systems contain 2GB memory chunks (4x512MB
DIMMS configured as a 2GB memory bank). I think these system would prefer
a section size of 31bits. Right?

I'm concerned that there may not be value of "section size" & "physical address size"
that is optimal across all of the various IA64 platforms.

(Functionally, everything appears to work fine. I'm in the process of running a
few benchmarks on a 64p system. More later....).

> will be good as a default.
> This can work on any ia64 platforms && Full-sized Hugetlb can be used 
> without any tuning.
> 
> If people want to hot-remove smaller memory sections, he should adjust the 
> params.
> 
> -- Kame

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (19 preceding siblings ...)
  2005-05-31 18:15 ` Jack Steiner
@ 2005-05-31 21:41 ` Luck, Tony
  2005-05-31 21:58 ` Dave Hansen
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Luck, Tony @ 2005-05-31 21:41 UTC (permalink / raw)
  To: linux-ia64

Dave Hansen wrote:
>* The same implementation works everywhere, on every architecture.
Truth in advertising might require that this patch be renamed as
"SOMEWHATSPARSE", as it breaks down for extremely sparse physical
address space machines.  It looks like the existing SGI Altix is at
the ragged edge of what can be practically supported.

>* It has good tlb behavior.
True ... definitely better than VIRTUAL_MEM_MAP. But what effect does
this have on system level performance?

>* It is faster and has a lower icache footprint than existing
>  discontigmem implementations.
Did I miss some benchmark results?

>* On a theoretical 16TB ppc64 system with 16MB sections, the overhead of
>  the mem_section[] table is 8MB.
Back to the "somewhat sparse" arguments of point #1. In fact this theoretical
system isn't "sparse" at all!

>Also, nothing seriously confines us to a flat array of mem_sections,
>that's just the only implementation right now.  The pagetables that are
>walked in the TLB miss handler (for vmem_map[]) could just as easily be
>a set of two-level mem_section tables that are walked in software.  That
>just adds an extra load to the pfn_to_page() path.  Plus, if somebody
 ^^^^^^^^^^^^^^^^^^^^^^^
>does this, all sparsemem architectures can benefit.

What would the performance impacts of this extra load be? pfn_to_page()
appears to be a pretty common operation.


Overall the SPARSEMEM patch is a good clean-up to an area of code that is
unspeakably complex (because of all the interactions between NUMA, DISCONTIG,
and VIRT_MEM_MAP).  So I'm in full agreement that something needs to be done.
I'm just not convinced yet that the existing form of the SPARSEMEM patch is
the greatest thing since sliced bread.

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (20 preceding siblings ...)
  2005-05-31 21:41 ` Luck, Tony
@ 2005-05-31 21:58 ` Dave Hansen
  2005-06-01  1:37 ` Bob Picco
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: Dave Hansen @ 2005-05-31 21:58 UTC (permalink / raw)
  To: linux-ia64

On Tue, 2005-05-31 at 14:41 -0700, Luck, Tony wrote:
> >* It has good tlb behavior.
> True ... definitely better than VIRTUAL_MEM_MAP. But what effect does
> this have on system level performance?

It slightly improves performance on everything I've run it on, at least
compared to discontigmem.  That means a few ppc64 configurations, x86
summit, and NUMAQ.

> >* It is faster and has a lower icache footprint than existing
> >  discontigmem implementations.
> Did I miss some benchmark results?

I've posted them a few times.  The gain is somewhere in the 1-2% on
NUMAQ.  Nothing substantial.

I can dig the results up again, but they're going to mean close to
nothing on your hardware.  I'd suggest running it yourself, and seeing
exactly how it behaves.

> >* On a theoretical 16TB ppc64 system with 16MB sections, the overhead of
> >  the mem_section[] table is 8MB.
> Back to the "somewhat sparse" arguments of point #1. In fact this theoretical
> system isn't "sparse" at all!

Well, the overhead is still 8MB, even if the system only has 32MB:
16MB@0 and 16MB@(1TB-16MB).  That's pretty sparse.

In any case, I agree that the current code isn't optimal across all ia64
platforms.  But, I don't think we're seriously tied to that single, flat
array.  It's just the easiest way to do it for now.

> >Also, nothing seriously confines us to a flat array of mem_sections,
> >that's just the only implementation right now.  The pagetables that are
> >walked in the TLB miss handler (for vmem_map[]) could just as easily be
> >a set of two-level mem_section tables that are walked in software.  That
> >just adds an extra load to the pfn_to_page() path.  Plus, if somebody
>  ^^^^^^^^^^^^^^^^^^^^^^^
> >does this, all sparsemem architectures can benefit.
> 
> What would the performance impacts of this extra load be? pfn_to_page()
> appears to be a pretty common operation.

On a normal, x86 flatmem system there's a single load to do
pfn_to_page() from *mem_map.  With today's sparsemem, that goes to two
loads (page->flags and mem_section[section]).  I haven't been able to
measure the effect of this extra load on any macro-benchmarks.

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (21 preceding siblings ...)
  2005-05-31 21:58 ` Dave Hansen
@ 2005-06-01  1:37 ` Bob Picco
  2005-06-01  9:14 ` Andy Whitcroft
  2005-06-01 22:48 ` David Mosberger
  24 siblings, 0 replies; 26+ messages in thread
From: Bob Picco @ 2005-06-01  1:37 UTC (permalink / raw)
  To: linux-ia64

Dave Hansen wrote:	[Tue May 31 2005, 02:14:22PM EDT]
> On Tue, 2005-05-31 at 10:55 -0700, Luck, Tony wrote:
> > What are the other benefits of SPARSEMEM that I'm missing?
> 
> * The same implementation works everywhere, on every architecture.
> * It has good tlb behavior.
> * It is faster and has a lower icache footprint than existing
>   discontigmem implementations.
> * On a theoretical 16TB ppc64 system with 16MB sections, the overhead of
>   the mem_section[] table is 8MB.
> 
> Also, nothing seriously confines us to a flat array of mem_sections,
I started on this Friday.  The holiday weekend and two days of not
feeling well caused a delay. I should have it all worked out by tomorrow
or Thursday.  Hopefully it should eliminate the objections raised.
> that's just the only implementation right now.  The pagetables that are
> walked in the TLB miss handler (for vmem_map[]) could just as easily be
> a set of two-level mem_section tables that are walked in software.  That
> just adds an extra load to the pfn_to_page() path.  Plus, if somebody
> does this, all sparsemem architectures can benefit.
> 
> -- Dave
bob

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (22 preceding siblings ...)
  2005-06-01  1:37 ` Bob Picco
@ 2005-06-01  9:14 ` Andy Whitcroft
  2005-06-01 22:48 ` David Mosberger
  24 siblings, 0 replies; 26+ messages in thread
From: Andy Whitcroft @ 2005-06-01  9:14 UTC (permalink / raw)
  To: linux-ia64

Luck, Tony wrote:
> Dave Hansen wrote:
> 
>>* The same implementation works everywhere, on every architecture.
> 
> Truth in advertising might require that this patch be renamed as
> "SOMEWHATSPARSE", as it breaks down for extremely sparse physical
> address space machines.  It looks like the existing SGI Altix is at
> the ragged edge of what can be practically supported.

> Overall the SPARSEMEM patch is a good clean-up to an area of code that is
> unspeakably complex (because of all the interactions between NUMA, DISCONTIG,
> and VIRT_MEM_MAP).  So I'm in full agreement that something needs to be done.
> I'm just not convinced yet that the existing form of the SPARSEMEM patch is
> the greatest thing since sliced bread.

As Dave has already indicated the key here is that this is a common
implementation.  As the code stood we had a common implementation of
'flat' memory.  We had per-architecture memory models where this was not
sufficient.  Over time a number of key architectures have met the
non-contigious problem and solved it in nearly the same manner.  What we
set out to do was to try and provide a common implementation for a more
complex discontigious memory format.

SPARSEMEM isn't trying to be the answer for every problem, more a simple
and clean implementation which would fix the majority of the current
users well.  Where that was insufficient we still have the option of an
architecture specific memory model and significant effort has been made
in the foundations already accepted to make them usable by such an
implementation.

That said, it also was expected that SPARSEMEM would form the foundation
for more complex memory models.  As you mention there are already some
big iron machines which push the simplistic mapping we currently have to
the very limit and probabally beyond.  As Dave says if this is fixed in
SPARSEMEM then we gain for any other architecture that finds itself up
against the limits.

-apw

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [patch 0/4] ia64 SPARSEMEM
  2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
                   ` (23 preceding siblings ...)
  2005-06-01  9:14 ` Andy Whitcroft
@ 2005-06-01 22:48 ` David Mosberger
  24 siblings, 0 replies; 26+ messages in thread
From: David Mosberger @ 2005-06-01 22:48 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 31 May 2005 11:14:22 -0700, Dave Hansen <haveblue@us.ibm.com> said:

  Dave> The pagetables that are walked in the TLB miss handler (for
  Dave> vmem_map[]) could just as easily be a set of two-level
  Dave> mem_section tables that are walked in software.

SPARSEMEM might be a lot more interesting if it could "degenerate"
into VIRTUAL_MEM_MAP.  That is, on platforms with enough virtual
space, you should be able to let the hardware/MMU-subsystem do the
translation.

	--david

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2005-06-01 22:48 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-23 17:50 [patch 0/4] ia64 SPARSEMEM Bob Picco
2005-05-24  3:29 ` David Mosberger
2005-05-24 14:33 ` Bob Picco
2005-05-24 16:27 ` Bob Picco
2005-05-26  0:32 ` Luck, Tony
2005-05-26 20:09 ` David Mosberger
2005-05-26 20:54 ` Bob Picco
2005-05-26 21:02 ` Dave Hansen
2005-05-26 21:34 ` Luck, Tony
2005-05-26 21:44 ` Jack Steiner
2005-05-26 21:51 ` Bob Picco
2005-05-26 22:03 ` Luck, Tony
2005-05-26 22:04 ` Bob Picco
2005-05-27  5:14 ` Yasunori Goto
2005-05-27 10:35 ` Bob Picco
2005-05-27 16:23 ` David Mosberger
2005-05-27 22:04 ` Jack Steiner
2005-05-30  0:18 ` KAMEZAWA Hiroyuki
2005-05-31 17:55 ` Luck, Tony
2005-05-31 18:14 ` Dave Hansen
2005-05-31 18:15 ` Jack Steiner
2005-05-31 21:41 ` Luck, Tony
2005-05-31 21:58 ` Dave Hansen
2005-06-01  1:37 ` Bob Picco
2005-06-01  9:14 ` Andy Whitcroft
2005-06-01 22:48 ` David Mosberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox