public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] ia64:  numa emulation
@ 2005-10-20 20:36 Lee Schermerhorn
  2005-10-20 20:52 ` Chen, Kenneth W
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Lee Schermerhorn @ 2005-10-20 20:36 UTC (permalink / raw)
  To: linux-ia64

Subdivide an ia64 SMP platform into fake numa nodes -- e.g., for testing
fixes and tool development w/o having access to actual numa platform.

I developed this patch when I saw Andi Kleen's x86_64 numa emulation
patch and have been maintaining it for my own use since 2.6.10.  Peter
Chubb suggested that I posted it to the ia64 list for discussion, and
perhaps [someday] possible inclusion in the kernel.  Below is a longish
description of the patch.  I'll post the actual patch as a separate
message.

By the way:  I know of at least one other such patch that Bob Picco
maintains for his use, altho' I wasn't aware of it when I put this on
together.  Some messages I've seen on this [or maybe other?] mailing
list lead me to believe that other such patches may exist.

---
IA64 Numa Emulation

How to configure:

Select NUMA_EMU config parameter.

[shamelessly copied from Andi Kleen's x86_64 numa emulation]

----------------
How to invoke:

boot command line argument:  numaúke[=N]

[also copied from Andi's x86_64 numa emulation]

N = number of fake nodes requested, default=2.
N is constrained to min( # actual cpus, MAX_NUMNODES )
N < 2 abandons numa emulation

----------------
Other limitations:

1) won't attempt to subdivide a NUMA platform.  Abandons emulation
   if system ACPI provides an SRAT with >1 proximity domain.

2) no attempt to support memory-less nodes nor cpu-less nodes.

3) no way to specify memory per node.  simply divides existing phys
   mem ~equally amoung emulated nodes.  Excess [remainder] to node 0.

4) simple round robin assignment of phys cpus to emulated nodes.
   Will require work for, e.g., multicore/SMT if we want siblings
   in same node.  To do otherwise would probably confuse the
   scheduling domain setup.

----------------
Approach:

During acpi_arch_numa_fixup():
after finding numaúke[=N] boot parameter, and validating N,
<=1 SRAT proximity domain, etc., populate variables and structures
that would be filled from ACPI/SRAT/SLIT in actual numa platform:  

	+ pxm_flags[] -- array of SRAT proximity domains
	+ node_cpuid[] -- array of {phys cpuid, pxm/node id}
		  indexed by logical cpu id.
	+ srat_num_cpus -- # of cpus found in [emulated] SRAT
	+ node_memblks[] -- array of phys memory ranges and
	  pxm/node affinity for each
	+ num_node_memblks -- number of valid entries in 
	  node_memblks[] array.
	+ slit_table -- pointer to emulated ACPI SLIT

Then, acpi_arch_numa_fixup() will just "do the right thing".  Emulation
is at low enough level that the rest of the system required no
additional
changes.

Note:  I use a custom EFI memory map walker to find physical memory,
regardless of its usage to populate the node_memblks[] and to compute
total memory.

----------------
Interaction with Virtual Mem Map and Buddy System:

IA64 [prior to sparesmem] uses a virtual memory map for page structs.
The
virtual mem map for each node must be aligned on on max order boundary.
For a real system, with holes between nodes, it does not matter if
memory
[page structs] actually exist for all virtual pages in the virtual mem
map.
However, with emulated nodes, there may be no holes between the emulated
nodes' assigned memory.  Therefore, the boundaries between nodes must be
on a max order boundary.  Otherwise, pages under the virtual mem map
for,
say, node 1 that is rounded down to max order boundary will be assigned
to
both node 0 AND node 1.  [Voice of experience ;-)]  

So, I have to round UP the boundary between nodes to a max order
boundary.
I had to decrease the CONFIG_FORCE_MAX_ZONEORDER to 14 [256M boundary]
when
numa emulation configured in to give node 0 sufficient memory on
smallish
[~4GB] systems.

Also, because of interaction of max order and huge tlb fs [hard coded
HPAGE_SHIFT_DEFAULT], I disable hugetlbfs when numa emulation is
configured. 
Back around 2.6.10, I recall seeing discussion of dynamically sizing
hugetlbfs
based on max order.  I haven't tracked this to see if it went in.  But,
if it
did, one could reenable hugetlbfs and adapted to the reduced max order
automatically.

None of this may be required with sparsemem.

----------------
Testing:

I have tested this Numa emulation on an HP rx2600 [2-cpu ia64] with 2-
nodes
[of course!] under each kernel release since 2.6.10.  I tested on an
rx46xx
[4-cpus] with 2, 3, and 4 nodes under 2.6.13.  Also, under 2.6.13, I
layered
on the memory hotplug patch series from sr71.net, along with Marcello's 
migration cache and Ray Bryant's manual page migration patches.  With
this
config, I was able to migrate pages between emulated nodes.

Your mileage may vary...

Lee Schermerhorn



^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH 0/1] ia64:  numa emulation
  2005-10-20 20:36 [PATCH 0/1] ia64: numa emulation Lee Schermerhorn
@ 2005-10-20 20:52 ` Chen, Kenneth W
  2005-10-20 20:58 ` Lee Schermerhorn
  2005-10-20 21:47 ` Chen, Kenneth W
  2 siblings, 0 replies; 4+ messages in thread
From: Chen, Kenneth W @ 2005-10-20 20:52 UTC (permalink / raw)
  To: linux-ia64

Lee Schermerhorn wrote on Thursday, October 20, 2005 1:36 PM
> Also, because of interaction of max order and huge tlb fs [hard
> coded HPAGE_SHIFT_DEFAULT], I disable hugetlbfs when numa emulation
> is configured.

Hugetlbfs can still be safely enabled.  The only downside is system
won't be able to reserve any hugetlb page with the default size. Maybe
it is senseless, but no harm is done here by enabling it.


> Back around 2.6.10, I recall seeing discussion of dynamically sizing
> hugetlbfs based on max order.  I haven't tracked this to see if it
> went in.  But, if it did, one could reenable hugetlbfs and adapted
> to the reduced max order automatically.

I have a patch does better than that.  We can change the hugetlb page
size at run time.  I will clean up my patch and post it.

- Ken


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH 0/1] ia64:  numa emulation
  2005-10-20 20:36 [PATCH 0/1] ia64: numa emulation Lee Schermerhorn
  2005-10-20 20:52 ` Chen, Kenneth W
@ 2005-10-20 20:58 ` Lee Schermerhorn
  2005-10-20 21:47 ` Chen, Kenneth W
  2 siblings, 0 replies; 4+ messages in thread
From: Lee Schermerhorn @ 2005-10-20 20:58 UTC (permalink / raw)
  To: linux-ia64

On Thu, 2005-10-20 at 13:52 -0700, Chen, Kenneth W wrote:
> Lee Schermerhorn wrote on Thursday, October 20, 2005 1:36 PM
> > Also, because of interaction of max order and huge tlb fs [hard
> > coded HPAGE_SHIFT_DEFAULT], I disable hugetlbfs when numa emulation
> > is configured.
> 
> Hugetlbfs can still be safely enabled.  The only downside is system
> won't be able to reserve any hugetlb page with the default size. Maybe
> it is senseless, but no harm is done here by enabling it.

OK.  I wasn't sure, and didn't have need for hugetlbfs at the time, so I
just disabled it by default when emulation selected.  One could always
override that default.

> 
> 
> > Back around 2.6.10, I recall seeing discussion of dynamically sizing
> > hugetlbfs based on max order.  I haven't tracked this to see if it
> > went in.  But, if it did, one could reenable hugetlbfs and adapted
> > to the reduced max order automatically.
> 
> I have a patch does better than that.  We can change the hugetlb page
> size at run time.  I will clean up my patch and post it.
> 

Cool.  Sounds like a generic [!just ia64] patch, right?

Lee


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH 0/1] ia64:  numa emulation
  2005-10-20 20:36 [PATCH 0/1] ia64: numa emulation Lee Schermerhorn
  2005-10-20 20:52 ` Chen, Kenneth W
  2005-10-20 20:58 ` Lee Schermerhorn
@ 2005-10-20 21:47 ` Chen, Kenneth W
  2 siblings, 0 replies; 4+ messages in thread
From: Chen, Kenneth W @ 2005-10-20 21:47 UTC (permalink / raw)
  To: linux-ia64

Lee Schermerhorn wrote on Thursday, October 20, 2005 1:59 PM
> > > Back around 2.6.10, I recall seeing discussion of dynamically
> > > sizing hugetlbfs based on max order.  I haven't tracked this
> > > to see if it went in.  But, if it did, one could reenable
> > > hugetlbfs and adapted to the reduced max order automatically.
> > 
> > I have a patch does better than that.  We can change the hugetlb
> > page size at run time.  I will clean up my patch and post it.
> > 
> 
> Cool.  Sounds like a generic [!just ia64] patch, right?


Well, the patch is ia64 only since x86 and x86_64 doesn't offer
multiple page size for hugetlb page.  In any case, this probably
should be arch specific, since it uses arch specific API to find
out supported page size.

- Ken


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2005-10-20 21:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-20 20:36 [PATCH 0/1] ia64: numa emulation Lee Schermerhorn
2005-10-20 20:52 ` Chen, Kenneth W
2005-10-20 20:58 ` Lee Schermerhorn
2005-10-20 21:47 ` Chen, Kenneth W

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox