From: Erich Focht <efocht@ess.nec.de>
To: linux-ia64@vger.kernel.org
Subject: Re: [Linux-ia64] Re: [Discontig-devel] CLUMPS, CHUNKS and GRANULES
Date: Mon, 19 Aug 2002 16:33:40 +0000 [thread overview]
Message-ID: <marc-linux-ia64-105590701905954@msgid-missing> (raw)
In-Reply-To: <marc-linux-ia64-105590701905942@msgid-missing>
Hi Jack,
thanks very much for the detailed comments. There was (at least) one
mistake in my previous mail, sorry if that generated confusion. As far
as I understand, you agree with getting rid of CHUNKS and replacing
the one macro using them with GRANULE. Also I hope we can limit the
usage of PXM to be only within arch/ia64/acpi.c.
On Friday 16 August 2002 23:53, Jack Steiner wrote:
> > things and get rid of some concepts. In my opinion we need only
> > the following concepts inside DISCONTIGMEM:
> > - node IDs (AKA compact node IDs or logical nodes).
> > - physical node IDs
> > - clumps (I'd prefer the name memory BANKS here, as a clump suggests
> > something to be contiguous, without holes (German: Klumpen)).
> >
> > In the initialisation phase we need:
> > - memory blocks (AKA chunks?) (contiguous pieces of memory on one
> > node, provided by ACPI, only used for setup. No size or alignment
> > expected. Needed later for paddr_to_nid() but that's all.)
> > - proximity domains (only ACPI NUMA setup, invisible to the rest of
> > the DISCONTIG code).
...
> > Therefore a node would have several memory banks which are not
> > necessarily adjacent in the physical memory space. There can be gaps
> > or banks from other nodes interleaved. In the mem_map array there is
> > space reserved for page struct entries of ALL pages of one bank,
> > existent or not. Memory holes between banks don't build holes in the
> > mem_map array.
>
> If the mem_map has entries for pages that dont exist, how do you handle
> code that scans the mem_map array. How does code recognize & skip pages
> associated with missing memory?? For examples, see show_mem()
> & get_discontig_info(). (Maybe I misunderstood your proposal here).
Sorry, I didn't mean to change anything related to mem_map. I just
described the current state wrongly. From the following piece of code
in discontig_paging_init() I thought that for each clump we will have
PLAT_CLUMPSIZE/PAGE_SIZE struct page entries in the mem_map array. I
missed the readjustment of npages below...
mycnodeid = boot_get_local_cnodeid();
for (cnodeid = 0; cnodeid < numnodes; cnodeid++) {
...
page = NODE_DATA(cnodeid)->node_mem_map;
bdp = BOOT_NODE_DATA(cnodeid)->bdata;
while (bdp->node_low_pfn) {
kaddr = (unsigned long)__va(bdp->node_boot_start);
ekaddr = (unsigned long)__va(bdp->node_low_pfn << PAGE_SHIFT);
while (kaddr < ekaddr) {
node_data[mycnodeid]->clump_mem_map_base[PLAT_CLUMP_MEM_MAP_INDEX(kaddr)] = page;
npages = PLAT_CLUMPSIZE/PAGE_SIZE;
if (kaddr + (npages<<PAGE_SHIFT) > ekaddr)
npages = (ekaddr - kaddr) >> PAGE_SHIFT;
for (i = 0; i < npages; i++, page++, kaddr += PAGE_SIZE)
page->virtual = (void*)kaddr;
}
bdp++;
}
...
}
Which means: for each memory block we get from the ACPI SRAT table we
will have struct pages reserved. Not more. Hotpluggable (non-existent?)
memory should appear here, too, I guess. But that should not harm.
> On the SGI platform, "physical node number" has a very precise definition.
> This is not true on all architectures. On SGI, the physical number is bits
> [48:38] of the physical address. In addition, a system can run with a
> sparse set of physical node numbers. For example, a 3 node system could
> have physical node 512, 800 & 2012.
It's somewhat similar on the NEC Azusa/Asama, where one can get the
physical node number from the SAPIC_ID of the CPUs.
> > > proximity domain numbers - these numbers are assigned by ACPI.
> > > Each platform must provide a platform specific function
> > > for mapping proximity node numbers to physical node numbers.
> >
> > The proximity domain numbers are unnecessary. They are just other
>
> Unfortunately, for SGI hardware, proximity domain numbers cant be the same
> as a physical node number. ACPI limits proximity domain numbers to 0..254.
> On SGI, physical node numbers are 0..2047. Fortunately, we found a way to
> compress the physical node number into a proximity domain number. In the
> future, though, our current "trick" may no longer work. If we can get the
> the proximity domain numbers changed to 0..65K, then I
> agree that it could be the same as the physical node number.
> Is there any chance we can get this changed???
>
> > (compact) mapping. Only SGI uses the pxm numbers later as:
> > #define PLAT_BOOTMEM_ALLOC_GOAL(cnode,kaddr) \
> > __pa(SN2_KADDR(PLAT_PXM_TO_PHYS_NODE_NUMBER(nid_to_pxm_map[cnode]) ...
> > but it is clear that what they actually want to do is translate the
> > compact node id to a physical node id. They just misuse the PXM
> > translation tables for this. All reference to proximity domain numbers
> > can be eliminated after the ACPI setup phase. Maybe we need some map
> > when hotplugging and adjusting a physical->logical translation table,
> > but not in DISCONTIG.
The only place where the PXM is used (after ACPI based initialisation) is
the macro above. And there you do
cnode ---> PXM number ---> physical node id
I suggest something trivial: use a physnode_map[] table initialised
by the ACPI routines which detected the nodes. So just do
cnode ---> physical node id
The physnode_map table will have only numnodes entries and we will never
see PXM outside the ACPI init routines.
> > > - Memory is conceptually divided into chunks. A chunk is either
> > > completely present, or else the kernel assumes it is completely
> > > absent. Each node consists of a number of possibly discontiguous
> > > chunks.
> >
> > When reading the code I get the impression that the concept of a CHUNK
> > isn't really needed in the code. The definitions are misleading
> > because they suggest that CHUNKS are equally sized (there is a
> > CHUNKSHIFT) and we should expect ACPI to give us a bunch of
> > chunks. But all we really need these for is to check whether a
> > physical address is valid or to find out to which node a physical
> > address belongs to. When building the mem_map and the page struct
> > entries we need to know whether a page is inside a valid memory block
> > or not, no matter how this memory block looks like, how big it is
> > of whether it fits into one clump or not. On Azusa a chunk returned by
> > ACPI can span the whole node memory, thus the rule: "a clump is made
> > of chunks" is not valid.
>
> Agree that CHUNK is barely used. I think the way GRANULE is being used, it
> may replace the need for CHUNKs.
Fine! Then we can get rid of CHUNKS and PXMs in the mmzone*.h files?
> However, since IA64 doesnt current implement a kern_addr_valid() function,
> CHUNK is not currently used.
>
> Do you know if kern_addr_valid() for IA64 is planned in the future???
No idea. Do you think we need such a thing? We lived without it for quite
a while...
> > > - each node has a single contiguous mem_map array. The array contains
> > > page struct entries for every page on the node. There are no "holes" in
> > > the mem_map array. The node data area (see below) has pointers to the
> > > start of the mem_map entries for each clump on the node.
> >
> > The mem_map array is the same on each node, copied from the boot_node
> > to all other nodes. It contains page_struct entries for ALL pages on
> > ALL nodes (if I interpret discontig_paging_init() correctly). The
> > first two sentences need to be reformulated.
>
> I think the first two sentences are correct, but the last one is
> misleading. Is this better:
>
> - each node has a single contiguous page_struct array. This array contains
> page struct entries for every page that is actually present on the node.
> There are no "holes" in the page_struct array for non-existent memory. Note
> that adjacent entries in the array are NOT necessarily for contiguous
> physical pages if there are multiple non-contiguous clumps on the node.
>
> The node data area (see below) has pointers to the start of the
> page_struct entries for each clump on the node.
Agreed. I misunderstood discontig_paging_init.
> > > PLAT_BOOTMEM_ALLOC_GOAL(cnode,kaddr) - Calculate a "goal" value to be
> > > passed to __alloc_bootmem_node for allocating structures on nodes so
> > > that they dont alias to the same line in the cache as the previous
> > > allocated structure. You can return 0 if your platform doesnt have this
> > > problem.
...
> I'm not real happy with this solution, but I think it works. To verify it,
> I added a printk right after the point in discontig.c that does the
> allocate:
...
>
> Looks ok, although the node 0 allocation is not necessarily ideal.
OK, looks reasonable. Will think of something similar for DIG64
platforms.
> Consistency in naming is what is important. We should all agree on the
> terminology & variable naming conventions. We also need to be clear that
> maximum node node is NOT the same as NR_NODES-1.
>
> If I understand your proposal,
>
> locical nodes are:
> values: 0..NR_NODES-1.
> names are (pick one) node, nodenum, lnode, cnode, ...
>
> physical nodes:
> values: are 0 .. ???
> names: pnode, physnode, ....
>
> Lets pick the names we want to use.
OK. I like node & physnode (the later is pretty rare, I guess).
> > > PLAT_MAX_NODE_NUMBER - maximum physical node number plus 1
> >
> > And this one should be MAX_PHYS_NODES or NR_PHYS_NODES.
>
> These names are confusing. For example, the SGI SN2 system has
> maximum number of nodes is 128
> maximum node number 2047
>
> According to the current discontig patch for SN2:
> PLAT_MAX_NODE_NUMBER = 2048
> PLAT_MAX_COMPACT_NODES = 128
>
> (Note: I dont object to changing names, but we need both abstractions).
I understand your objection. For the first macro we need to remind
readers that it is the highest possible physical node ID in the system.
PLAT_MAX_PHYSNODE_NUMBER sounds too long, so maybe:
MAX_PHYSNODE_ID = 2048 ?
Anyway, this one is only used for the physical_node_map[] which maps
a physnode to a cnode ID, so maybe we shouldn't worry too much about the
name.
Anyway, instead of PLAT_MAX_COMPACT_NODES I'd prefer
NR_NODES = 128
or
MAX_NUMNODES = 128,
whatever is used on other architectures.
Best regards,
Erich
next prev parent reply other threads:[~2002-08-19 16:33 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-08-16 11:44 [Linux-ia64] Re: [Discontig-devel] CLUMPS, CHUNKS and GRANULES Erich Focht
2002-08-16 21:53 ` Jack Steiner
2002-08-16 22:05 ` Martin J. Bligh
2002-08-16 22:13 ` Martin J. Bligh
2002-08-16 22:28 ` Jack Steiner
2002-08-16 23:46 ` Martin J. Bligh
2002-08-17 0:26 ` Jack Steiner
2002-08-19 16:33 ` Erich Focht [this message]
2002-08-19 21:34 ` Jack Steiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=marc-linux-ia64-105590701905954@msgid-missing \
--to=efocht@ess.nec.de \
--cc=linux-ia64@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox