From mboxrd@z Thu Jan 1 00:00:00 1970 From: Erich Focht Date: Fri, 16 Aug 2002 11:44:33 +0000 Subject: [Linux-ia64] Re: [Discontig-devel] CLUMPS, CHUNKS and GRANULES Message-Id: List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-ia64@vger.kernel.org Hi Tony, hmmm, no comments to your post yesterday... maybe we find more people interested on this on lia64 ML? Actually I wanted to make some suggestions regarding the names, but after looking at some code I'd rather like to suggest to simplify things and get rid of some concepts. In my opinion we need only the following concepts inside DISCONTIGMEM: - node IDs (AKA compact node IDs or logical nodes). - physical node IDs - clumps (I'd prefer the name memory BANKS here, as a clump suggests something to be contiguous, without holes (German: Klumpen)). In the initialisation phase we need: - memory blocks (AKA chunks?) (contiguous pieces of memory on one node, provided by ACPI, only used for setup. No size or alignment expected. Needed later for paddr_to_nid() but that's all.) - proximity domains (only ACPI NUMA setup, invisible to the rest of the DISCONTIG code). This reduces the number of platform specific macros considerably and should improve the readability of the code. Therefore a node would have several memory banks which are not necessarily adjacent in the physical memory space. There can be gaps or banks from other nodes interleaved. In the mem_map array there is space reserved for page struct entries of ALL pages of one bank, existent or not. Memory holes between banks don't build holes in the mem_map array. Appended are some comments to the mem.txt attachment, somewhat lengthy, but explaining more in detail what I summarized above. ---------- comments to mem.txt (in include/asm-ia64/mmzone.h) --------- > - Nodes are numbered several ways: > > compact node numbers - compact node numbers are a dense numbering of > all the nodes in the system. An N node system will have compact > nodes numbered 0 .. N-1. There is no significance to the node > numbers. The compact node number assigned to a specific physical > node may vary from boot to boot. The boot node is not necessarily > node 0. I'd prefer to call them "logical node numbers" or just "node numbers", similar to CPUs. We don't have compact CPU IDs. > proximity domain numbers - these numbers are assigned by ACPI. > Each platform must provide a platform specific function > for mapping proximity node numbers to physical node numbers. The proximity domain numbers are unnecessary. They are just other "physical" node numbers which are only interesting in the setup phase when the ACPI _PXM numbers help to build the physical to logical (compact) mapping. Only SGI uses the pxm numbers later as: #define PLAT_BOOTMEM_ALLOC_GOAL(cnode,kaddr) \ __pa(SN2_KADDR(PLAT_PXM_TO_PHYS_NODE_NUMBER(nid_to_pxm_map[cnode]) ... but it is clear that what they actually want to do is translate the compact node id to a physical node id. They just misuse the PXM translation tables for this. All reference to proximity domain numbers can be eliminated after the ACPI setup phase. Maybe we need some map when hotplugging and adjusting a physical->logical translation table, but not in DISCONTIG. > - Memory is conceptually divided into chunks. A chunk is either > completely present, or else the kernel assumes it is completely > absent. Each node consists of a number of possibly discontiguous chunks. When reading the code I get the impression that the concept of a CHUNK isn't really needed in the code. The definitions are misleading because they suggest that CHUNKS are equally sized (there is a CHUNKSHIFT) and we should expect ACPI to give us a bunch of chunks. But all we really need these for is to check whether a physical address is valid or to find out to which node a physical address belongs to. When building the mem_map and the page struct entries we need to know whether a page is inside a valid memory block or not, no matter how this memory block looks like, how big it is of whether it fits into one clump or not. On Azusa a chunk returned by ACPI can span the whole node memory, thus the rule: "a clump is made of chunks" is not valid. I tried to find the places where the CHUNKs are used: - PLAT_CHUNKNUM : used by SGI for kern_addr_valid in the form VALIDCHUNK(PLAT_CHUNKNUM(_kav)) but VALIDCHUNK allways returns 1! So it is not needed! - PLAT_CHUNKSIZE : only used in CHUNKROUNDUP in discontig.c. I think we can recode this to round up to a GRANULE boundary, that's what we really want, I guess. - *_CHUNKSHIFT is only used for PLAT_CHUNKSIZE and PLAT_CHUNKNUM. On NEC Azusa ACPI returns each available contiguous memory block as one SRAT table entry. The size and the alignment can vary, there are no fixed size chunks. For building up the clumps, we don't need to know anything about these chunks! If a clump has holes, the setup routine will take care of them. All we need is the list of memory blocks delivered by ACPI and their assignment to nodes. The maximum number of memory blocks expected is currently set to PLAT_MAXCLUMPS. I think this is wrong, as a clump can contain multiple memory blocks. I would like to eliminate the CHUNK concept and the need for setting a lot of CHUNK related macros for each platform. All we really need is MAX_NR_MEMBLKS and only the setup routines will deal with these blocks. Call the ACPI memory blocks CHUNKS again, if you want, but they are only needed in the setup phase related to ACPI and shouldn't need an own philosophy within DISCONTIG. > - A contiguous group of memory chunks that reside on the same node > are referred to as a clump. Note that a clump may be partially present. > (Note, on some hardware implementations, a clump is the same as a memory > bank or a DIMM). > > - a node consists of multiple clumps of memory. From a NUMA perspective, > accesses to all clumps on the node have the same latency. Except for zone issues, > the clumps are treated as equivalent for allocation/performance purposes. > > - each node has a single contiguous mem_map array. The array contains page struct > entries for every page on the node. There are no "holes" in the mem_map array. > The node data area (see below) has pointers to the start of the mem_map entries > for each clump on the node. The mem_map array is the same on each node, copied from the boot_node to all other nodes. It contains page_struct entries for ALL pages on ALL nodes (if I interpret discontig_paging_init() correctly). The first two sentences need to be reformulated. > - each platform is responsible for defining the following constants & functions: > > PLAT_BOOTMEM_ALLOC_GOAL(cnode,kaddr) - Calculate a "goal" value to be passed > to __alloc_bootmem_node for allocating structures on nodes so that > they dont alias to the same line in the cache as the previous > allocated structure. You can return 0 if your platform doesnt have > this problem. > (Note: need better solution but works for now ZZZ). Either I misunderstood something or the definition in include/asm-ia64/sn/sn2/mmzone_sn2.h doesn't really unalias the cachelines. This would be nice to have! > PLAT_CHUNKSIZE - defines the size of the platform memory chunk. Get rid of this. > PLAT_CHUNKNUM(kaddr) - takes a kaddr & returns its chunk number Get rid of this. > PLAT_CLUMP_MEM_MAP_INDEX(kaddr) - Given a kaddr, find the index into the > clump_mem_map_base array of the page struct entry for the first page > of the clump. > > PLAT_CLUMP_OFFSET(kaddr) - find the byte offset of a kaddr within the clump that > contains it. > > PLAT_CLUMPSIZE - defines the size in bytes of the smallest clump supported on the platform. This definition is misleading. The clumps are all the same size. Suppose you have banks (for me this name sounds better than clump because I can associate with it something I know from looking into a computer) of 1GB which you want to call clumps. The minimum size of a bank is 128MB, because this is the smallest DIMM you can insert. Setting PLAT_CLUMPSIZE to 128MB leads to too small page struct lists when setting up the mem_map (at least on DIG64). PLAT_CLUMPSIZE - defines the size in bytes of the biggest clump supported on the platform. Make sure that (PLAT_CLUMPS_PER_NODE * PLAT_CLUMPSIZE is big enough for the maximum memory per node supported by the platform. > PLAT_CLUMPS_PER_NODE - maximum number of clumps per node > > PLAT_MAXCLUMPS - maximum number of clumps on all node combined > > PLAT_MAX_COMPACT_NODES - maximum number of nodes in a system. (do not confuse this > with the maximum node number. Nodes can be sparsely numbered). The name for this is MAX_NUMNODES or just NR_NODES. There was a patch from IBM changing everything to NR_NODES. That's also why I prefer calling compact nodes just "nodes". > PLAT_MAX_NODE_NUMBER - maximum physical node number plus 1 And this one should be MAX_PHYS_NODES or NR_PHYS_NODES. > PLAT_PXM_TO_PHYS_NODE_NUMBER(pxm) - convert a proximity_domain number (from ACPI) > into a physical node number Get rid of this. Not needed outside ACPI SRAT/SLIT interpretation routines. Ideas? Comments? Regards, Erich On Thursday 15 August 2002 20:05, Luck, Tony wrote: > Attached is the preamble to mmzone.h, which describes how > the ia64 discontig patch uses "CLUMPS" and "CHUNKS" to > split up memory into various sized pieces to make handling > easier for different parts of kernel code. It doesn't > mention "GRANULES" which are yet another ia64ism for > keeping track of aggregates of memory which aren't directly > related to discontig memory support, but I thought that I'd > include them here, so we covered every kind of aggregate. > > I'm spawning this thread to try to come up with some good > documentation for all of the above concepts, to make the > discontig patch easier to understand, and thus make it more > likely to be accepted, and easier to maintain the code. > > The Atlas authors are not particularly attached to the > "CLUMP" and "CHUNK" names, and GRANULE was more or less > disowned at birth by its author (see the comment in pgtable.h), > so if you have better names, please suggest them! > > Definitions: > > GRANULE - contiguous, self-sized aligned block of memory all > of which exists, and has the same physical caching attributes. > The kernel maps memory at this granularity using a single > TLB entry (hence the alignment and cache-attribute requirements). > > CHUNK - A (usually) larger memory area, all of which exists. > > CLUMP - A (potentially) even larger memory area, providing only a > base address alignment on which CHUNKS of memory may be found. > E.g. the base address for a node (or memory bank within a node). > On systems that need to set the CHUNK size greater that the CLUMP > size only a few CHUNKS at the start of a CLUMP exist. > > > Rationale - Hardware designers have had various degrees of > "creativity" when coming up with memory maps for machines. Linux > needs an efficient way of getting from a physical address to the > page structure that contains all the information about the page. > In a machine with contguous memory, we simply allocate an array > of page structures, and use the physical page number as an index > into the array. CLUMPS and CHUNKS provide for an efficient way > to get from a sparse physical page number to the page structure. > On many systems the CLUMP may be the same size as the CHUNK. > > -Tony