From: Jack Steiner <steiner@sgi.com>
To: linux-ia64@vger.kernel.org
Subject: Re: [Linux-ia64] Re: [Discontig-devel] CLUMPS, CHUNKS and GRANULES
Date: Mon, 19 Aug 2002 21:34:00 +0000 [thread overview]
Message-ID: <marc-linux-ia64-105590701905955@msgid-missing> (raw)
In-Reply-To: <marc-linux-ia64-105590701905942@msgid-missing>
>
> Hi Jack,
>
> thanks very much for the detailed comments. There was (at least) one
> mistake in my previous mail, sorry if that generated confusion. As far
> as I understand, you agree with getting rid of CHUNKS and replacing
> the one macro using them with GRANULE. Also I hope we can limit the
> usage of PXM to be only within arch/ia64/acpi.c.
Agree on both points.
>
> On Friday 16 August 2002 23:53, Jack Steiner wrote:
> > > things and get rid of some concepts. In my opinion we need only
> > > the following concepts inside DISCONTIGMEM:
> > > - node IDs (AKA compact node IDs or logical nodes).
> > > - physical node IDs
> > > - clumps (I'd prefer the name memory BANKS here, as a clump suggests
> > > something to be contiguous, without holes (German: Klumpen)).
> > >
> > > In the initialisation phase we need:
> > > - memory blocks (AKA chunks?) (contiguous pieces of memory on one
> > > node, provided by ACPI, only used for setup. No size or alignment
> > > expected. Needed later for paddr_to_nid() but that's all.)
> > > - proximity domains (only ACPI NUMA setup, invisible to the rest of
> > > the DISCONTIG code).
>
> > > Therefore a node would have several memory banks which are not
> > > necessarily adjacent in the physical memory space. There can be gaps
> > > or banks from other nodes interleaved. In the mem_map array there is
> > > space reserved for page struct entries of ALL pages of one bank,
> > > existent or not. Memory holes between banks don't build holes in the
> > > mem_map array.
> >
> > If the mem_map has entries for pages that dont exist, how do you handle
> > code that scans the mem_map array. How does code recognize & skip pages
> > associated with missing memory?? For examples, see show_mem()
> > & get_discontig_info(). (Maybe I misunderstood your proposal here).
>
> Sorry, I didn't mean to change anything related to mem_map. I just
> described the current state wrongly. From the following piece of code
> in discontig_paging_init() I thought that for each clump we will have
> PLAT_CLUMPSIZE/PAGE_SIZE struct page entries in the mem_map array. I
> missed the readjustment of npages below...
>
>
> mycnodeid = boot_get_local_cnodeid();
> for (cnodeid = 0; cnodeid < numnodes; cnodeid++) {
> page = NODE_DATA(cnodeid)->node_mem_map;
> bdp = BOOT_NODE_DATA(cnodeid)->bdata;
> while (bdp->node_low_pfn) {
> kaddr = (unsigned long)__va(bdp->node_boot_start);
> ekaddr = (unsigned long)__va(bdp->node_low_pfn << PAGE_SHIFT);
> while (kaddr < ekaddr) {
> node_data[mycnodeid]->clump_mem_map_base[PLAT_CLUMP_MEM_MAP_INDEX(kaddr)] = page;
> npages = PLAT_CLUMPSIZE/PAGE_SIZE;
> if (kaddr + (npages<<PAGE_SHIFT) > ekaddr)
> npages = (ekaddr - kaddr) >> PAGE_SHIFT;
> for (i = 0; i < npages; i++, page++, kaddr += PAGE_SIZE)
> page->virtual = (void*)kaddr;
> }
> bdp++;
> }
> }
>
> Which means: for each memory block we get from the ACPI SRAT table we
> will have struct pages reserved. Not more. Hotpluggable (non-existent?)
> memory should appear here, too, I guess. But that should not harm.
>
>
>
> > On the SGI platform, "physical node number" has a very precise definition.
> > This is not true on all architectures. On SGI, the physical number is bits
> > [48:38] of the physical address. In addition, a system can run with a
> > sparse set of physical node numbers. For example, a 3 node system could
> > have physical node 512, 800 & 2012.
>
> It's somewhat similar on the NEC Azusa/Asama, where one can get the
> physical node number from the SAPIC_ID of the CPUs.
FWIW, physical node number is also in the sapic_id of the SGI systems too.
>
>
>
> > > > proximity domain numbers - these numbers are assigned by ACPI.
> > > > Each platform must provide a platform specific function
> > > > for mapping proximity node numbers to physical node numbers.
> > >
> > > The proximity domain numbers are unnecessary. They are just other
> >
> > Unfortunately, for SGI hardware, proximity domain numbers cant be the same
> > as a physical node number. ACPI limits proximity domain numbers to 0..254.
> > On SGI, physical node numbers are 0..2047. Fortunately, we found a way to
> > compress the physical node number into a proximity domain number. In the
> > future, though, our current "trick" may no longer work. If we can get the
> > the proximity domain numbers changed to 0..65K, then I
> > agree that it could be the same as the physical node number.
> > Is there any chance we can get this changed???
> >
> > > (compact) mapping. Only SGI uses the pxm numbers later as:
> > > #define PLAT_BOOTMEM_ALLOC_GOAL(cnode,kaddr) \
> > > __pa(SN2_KADDR(PLAT_PXM_TO_PHYS_NODE_NUMBER(nid_to_pxm_map[cnode])
> > > but it is clear that what they actually want to do is translate the
> > > compact node id to a physical node id. They just misuse the PXM
> > > translation tables for this. All reference to proximity domain numbers
> > > can be eliminated after the ACPI setup phase. Maybe we need some map
> > > when hotplugging and adjusting a physical->logical translation table,
> > > but not in DISCONTIG.
>
> The only place where the PXM is used (after ACPI based initialisation) is
> the macro above. And there you do
> cnode ---> PXM number ---> physical node id
> I suggest something trivial: use a physnode_map[] table initialised
> by the ACPI routines which detected the nodes. So just do
> cnode ---> physical node id
> The physnode_map table will have only numnodes entries and we will never
> see PXM outside the ACPI init routines.
Seems ok.
I assume there is still an ACPI platform-specific macro or function to convert
PXM numbers into physical node numbers.
>
>
>
> > > > - Memory is conceptually divided into chunks. A chunk is either
> > > > completely present, or else the kernel assumes it is completely
> > > > absent. Each node consists of a number of possibly discontiguous
> > > > chunks.
> > >
> > > When reading the code I get the impression that the concept of a CHUNK
> > > isn't really needed in the code. The definitions are misleading
> > > because they suggest that CHUNKS are equally sized (there is a
> > > CHUNKSHIFT) and we should expect ACPI to give us a bunch of
> > > chunks. But all we really need these for is to check whether a
> > > physical address is valid or to find out to which node a physical
> > > address belongs to. When building the mem_map and the page struct
> > > entries we need to know whether a page is inside a valid memory block
> > > or not, no matter how this memory block looks like, how big it is
> > > of whether it fits into one clump or not. On Azusa a chunk returned by
> > > ACPI can span the whole node memory, thus the rule: "a clump is made
> > > of chunks" is not valid.
> >
> > Agree that CHUNK is barely used. I think the way GRANULE is being used, it
> > may replace the need for CHUNKs.
>
> Fine! Then we can get rid of CHUNKS and PXMs in the mmzone*.h files?
PXM, yes.
I also think we can get rid of CHUNKs. If we decide we need need to support
kern_addr_valid(), we will need to use GRANULE the same way as we currently use
CHUNK. However, I think it is safe to wait until we understand the
requirement for kern_addr_valid() (as long as we know how to do it).
>
> > However, since IA64 doesnt current implement a kern_addr_valid() function,
> > CHUNK is not currently used.
> >
> > Do you know if kern_addr_valid() for IA64 is planned in the future???
> No idea. Do you think we need such a thing? We lived without it for quite
> a while...
No idea either...
>
> > > > - each node has a single contiguous mem_map array. The array contains
> > > > page struct entries for every page on the node. There are no "holes" in
> > > > the mem_map array. The node data area (see below) has pointers to the
> > > > start of the mem_map entries for each clump on the node.
> > >
> > > The mem_map array is the same on each node, copied from the boot_node
> > > to all other nodes. It contains page_struct entries for ALL pages on
> > > ALL nodes (if I interpret discontig_paging_init() correctly). The
> > > first two sentences need to be reformulated.
> >
> > I think the first two sentences are correct, but the last one is
> > misleading. Is this better:
> >
> > - each node has a single contiguous page_struct array. This array contains
> > page struct entries for every page that is actually present on the node
> > There are no "holes" in the page_struct array for non-existent memory. Note
> > that adjacent entries in the array are NOT necessarily for contiguous
> > physical pages if there are multiple non-contiguous clumps on the node.
> >
> > The node data area (see below) has pointers to the start of the
> > page_struct entries for each clump on the node.
>
> Agreed. I misunderstood discontig_paging_init.
>
>
> > > > PLAT_BOOTMEM_ALLOC_GOAL(cnode,kaddr) - Calculate a "goal" value to be
> > > > passed to __alloc_bootmem_node for allocating structures on nodes so
> > > > that they dont alias to the same line in the cache as the previous
> > > > allocated structure. You can return 0 if your platform doesnt have this
> > > > problem.
> > I'm not real happy with this solution, but I think it works. To verify it,
> > I added a printk right after the point in discontig.c that does the
> > allocate:
> >
> >
> > Looks ok, although the node 0 allocation is not necessarily ideal.
>
> OK, looks reasonable. Will think of something similar for DIG64
> platforms.
>
>
>
> > Consistency in naming is what is important. We should all agree on the
> > terminology & variable naming conventions. We also need to be clear that
> > maximum node node is NOT the same as NR_NODES-1.
> >
> > If I understand your proposal,
> >
> > locical nodes are:
> > values: 0..NR_NODES-1.
> > names are (pick one) node, nodenum, lnode, cnode, ...
> >
> > physical nodes:
> > values: are 0 .. ???
> > names: pnode, physnode, ....
> >
> > Lets pick the names we want to use.
>
> OK. I like node & physnode (the later is pretty rare, I guess).
>
> > > > PLAT_MAX_NODE_NUMBER - maximum physical node number plus 1
> > >
> > > And this one should be MAX_PHYS_NODES or NR_PHYS_NODES.
> >
> > These names are confusing. For example, the SGI SN2 system has
> > maximum number of nodes is 128
> > maximum node number 2047
> >
> > According to the current discontig patch for SN2:
> > PLAT_MAX_NODE_NUMBER = 2048
> > PLAT_MAX_COMPACT_NODES = 128
> >
> > (Note: I dont object to changing names, but we need both abstractions).
>
> I understand your objection. For the first macro we need to remind
> readers that it is the highest possible physical node ID in the system.
> PLAT_MAX_PHYSNODE_NUMBER sounds too long, so maybe:
> MAX_PHYSNODE_ID = 2048 ?
> Anyway, this one is only used for the physical_node_map[] which maps
> a physnode to a cnode ID, so maybe we shouldn't worry too much about the
> name.
> Anyway, instead of PLAT_MAX_COMPACT_NODES I'd prefer
> NR_NODES = 128
> or
> MAX_NUMNODES = 128,
> whatever is used on other architectures.
OK.
>
> Best regards,
> Erich
>
>
--
Thanks
Jack Steiner (651-683-5302) (vnet 233-5302) steiner@sgi.com
prev parent reply other threads:[~2002-08-19 21:34 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-08-16 11:44 [Linux-ia64] Re: [Discontig-devel] CLUMPS, CHUNKS and GRANULES Erich Focht
2002-08-16 21:53 ` Jack Steiner
2002-08-16 22:05 ` Martin J. Bligh
2002-08-16 22:13 ` Martin J. Bligh
2002-08-16 22:28 ` Jack Steiner
2002-08-16 23:46 ` Martin J. Bligh
2002-08-17 0:26 ` Jack Steiner
2002-08-19 16:33 ` Erich Focht
2002-08-19 21:34 ` Jack Steiner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=marc-linux-ia64-105590701905955@msgid-missing \
--to=steiner@sgi.com \
--cc=linux-ia64@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.