* [PATCH] Documenation/vm/numa
@ 2002-04-19 3:56 Mel
2002-04-19 5:18 ` Martin J. Bligh
0 siblings, 1 reply; 6+ messages in thread
From: Mel @ 2002-04-19 3:56 UTC (permalink / raw)
To: linux-kernel
Below is a small extension of the numa file in the vm Documenation branch
which tries to give a brief explanation about pg_data_t and zone_t
structs. Patch is against 2.4.19pre7 but I think it'll apply to any 2.4.x
or 2.5.x kernel. No change of code etc etc etc. Comments, corrections and
opinions very welcome
Mel
--- linux-2.4.19pre7.orig/Documentation/vm/numa Fri Aug 4 19:23:37 2000
+++ linux-2.4.19pre7.mel/Documentation/vm/numa Fri Apr 19 04:46:23 2002
@@ -1,4 +1,5 @@
Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
+Appended Apr 2002 by Mel Gorman <melcsn.ul.ie>
The intent of this file is to have an uptodate, running commentary
from different people about NUMA specific code in the Linux vm.
@@ -39,3 +40,123 @@
NUMA port achieves more maturity. The call alloc_pages_node has been
added, so that drivers can make the call and not worry about whether
it is running on a NUMA or UMA platform.
+
+
+Nodes
+=====
+
+A node is described by the pg\_data\_t struct. Each can have one or more
+of the three zone types ZONE\_HIGHMEM, ZONE\_NORMAL and ZONE\_DMA. It can
+only have one zone of each type. It is the responsibility of the buddy
+allocator to make sure pages are allocated from the proper nodes.
+
+It is declared as
+
+typedef struct pglist_data {
+ zone_t node_zones[MAX_NR_ZONES];
+ zonelist_t node_zonelists[GFP_ZONEMASK+1];
+ int nr_zones;
+ struct page *node_mem_map;
+ unsigned long *valid_addr_bitmap;
+ struct bootmem_data *bdata;
+ unsigned long node_start_paddr;
+ unsigned long node_start_mapnr;
+ unsigned long node_size;
+ int node_id;
+ struct pglist_data *node_next;
+} pg_data_t;
+
+ node_zones The zones for this node. Currently ZONE_HIGHMEM,
+ ZONE_NORMAL, ZONE_DMA.
+
+ node_zonelists This is the order of zones that allocations are
+ preferred from. build_zonelists() in page_alloc.c does
+ the work when called by free_area_init_core(). So a failed
+ allocation ZONE_HIGHMEM may fall back to ZONE_NORMAL
+ or back to ZONE_DMA . See the buddy algorithm for details.
+
+ nr_zones Number of zones in this node, between 1 and 3
+
+ node_mem_map The first page of the physical block this node represents
+
+ valid_addr_bitmap Not positive, a bitmap that shows where holes are in memory?
+
+ bdata Used only when starting up the node. Mainly confined
+ to bootmem.c
+
+ node_start_paddr The starting physical addres of the node?
+
+ node_start_mapnr This appears to be a "nice" place to put the zone inside
+ the larger mem_map. It's set during
+ free_page_init_core. Presumably there is some architecture
+ dependant way of defining nice.
+
+ node_size The total number of pages in this zone
+
+ node_id The ID of the node, starts at 0
+
+ node_next Pointer to next node in a linear list. NULL terminated
+
+1.2 Zones
+===========
+ Each pg_data_t node will be aware of one or more zones that it can
+allocate pages from. The possible zones are ZONE_HIGHMEM, ZONE_NORMAL
+and ZONE_DMA. There can only be one zone of each type per pg_data_t.
+Each zone is suitable for a particular use but there is not necessarily
+a penalty for usage of the wrong zone like there is with the wrong
+pg_data_t
+
+ typedef struct zone_struct {
+ /*
+ * Commonly accessed fields:
+ */
+ spinlock_t lock;
+ unsigned long free_pages;
+ unsigned long pages_min, pages_low, pages_high;
+ int need_balance;
+
+ /*
+ * free areas of different sizes
+ */
+ free_area_t free_area[MAX_ORDER];
+
+ /*
+ * Discontig memory support fields.
+ */
+ struct pglist_data *zone_pgdat;
+ struct page *zone_mem_map;
+ unsigned long zone_start_paddr;
+ unsigned long zone_start_mapnr;
+
+ /*
+ * rarely used fields:
+ */
+ char *name;
+ unsigned long size;
+ } zone_t;
+
+ lock A lock to protect the zone
+ free_pages Total number of free pages in the zone
+ pages_min When pages_min is reached, kswapd is woken up
+ pages_low When reached, the allocator will do the kswapd work in
+ a synchronuous fashion
+ pages_high Once kswapd is woken, it won't sleep until pages_high pages
+ are free
+ need_balance A flag kswapd uses to determine if it needs to balance
+ free_area Used by the buddy algorithm
+ zone_pgdat Points to the parent pg_data_t
+ zone_mem_map The first page in mem_map this zone refers to
+ zone_start_paddr Physical address of zone
+ zone_start_mapnr Address inside mem_map ?
+ name The string name of the zone
+ size Self explanatory
+
+1.3 Relationship
+==================
+
+ pg_data_t -------> pg_data_t -------> pgdata_t ------->NULL
+ / | \ / | \ / | \
+ ----- | ----- ----- | ----- ----- | -----
+ | | | | | | | | |
+ zone_t zone_t zone_t zone_t zone_t zone_t zone_t zone_t zone_t
+
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documenation/vm/numa
2002-04-19 3:56 [PATCH] Documenation/vm/numa Mel
@ 2002-04-19 5:18 ` Martin J. Bligh
2002-04-19 20:25 ` Eric W. Biederman
0 siblings, 1 reply; 6+ messages in thread
From: Martin J. Bligh @ 2002-04-19 5:18 UTC (permalink / raw)
To: Mel, linux-kernel
> Below is a small extension of the numa file in the vm Documenation branch
> which tries to give a brief explanation about pg_data_t and zone_t
> structs. Patch is against 2.4.19pre7 but I think it'll apply to any 2.4.x
> or 2.5.x kernel. No change of code etc etc etc. Comments, corrections and
> opinions very welcome
Nice job. Thanks for doing this.
> + node_start_paddr The starting physical addres of the node?
Yes. Note that it's an unsigned long, which doesn't work for ia32
with PAE (just as a for instance). Using a pfn (page frame number)
would be a better choice here, IMHO (see my definition of pfn below).
Same comment for zone_start_paddr.
> + node_start_mapnr This appears to be a "nice" place to put the zone
> inside + the larger mem_map. It's set during
> + free_page_init_core. Presumably there is some
> architecture + dependant way of defining nice.
Nice would be a strange term to use in this context - it's an ugly
hack ;-) It's the offset of the lmem_map array for that node within
the global mem_map array, which doesn't really exist (I'm calling
them arrays, though they're really just pointers - we use them like
arrays though). mem_map is some arbitrary constant, set to point to
PAGE_OFFSET or something similar.
> + zone_start_mapnr Address inside mem_map ?
Yes. Same comment as above ;-)
Note that there are two possible ways to define a pfn, in my mind.
One would be page_phys_addr >> PAGE_SHIFT. The other would be the
offset of the struct page for that page within the mythical mem_map
array. I prefer the former, though it probably contradicts everyone
else ;-) It's useful to have some way to pass around a 36 bit address
inside a 32 bit field.
M.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documenation/vm/numa
2002-04-19 5:18 ` Martin J. Bligh
@ 2002-04-19 20:25 ` Eric W. Biederman
2002-04-19 21:37 ` Martin J. Bligh
0 siblings, 1 reply; 6+ messages in thread
From: Eric W. Biederman @ 2002-04-19 20:25 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Mel, linux-kernel
"Martin J. Bligh" <Martin.Bligh@us.ibm.com> writes:
>
> Note that there are two possible ways to define a pfn, in my mind.
> One would be page_phys_addr >> PAGE_SHIFT. The other would be the
> offset of the struct page for that page within the mythical mem_map
> array. I prefer the former, though it probably contradicts everyone
> else ;-) It's useful to have some way to pass around a 36 bit address
> inside a 32 bit field.
A page frame number (pfn) is definitely the former
(page_phys_addr >> PAGE_SHIFT).
Eric
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documenation/vm/numa
2002-04-19 20:25 ` Eric W. Biederman
@ 2002-04-19 21:37 ` Martin J. Bligh
2002-04-20 16:05 ` Mel
0 siblings, 1 reply; 6+ messages in thread
From: Martin J. Bligh @ 2002-04-19 21:37 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Mel, linux-kernel
>> Note that there are two possible ways to define a pfn, in my mind.
>> One would be page_phys_addr >> PAGE_SHIFT. The other would be the
>> offset of the struct page for that page within the mythical mem_map
>> array. I prefer the former, though it probably contradicts everyone
>> else ;-) It's useful to have some way to pass around a 36 bit address
>> inside a 32 bit field.
>
> A page frame number (pfn) is definitely the former
> (page_phys_addr >> PAGE_SHIFT).
That's how I'd conceptually define it ... unfortunately the latter
definition
also matches for non-discontigmem machines, and it's easy to think of
it that way. I guess it's just everything that says "mapnr" in it that needs
killing then ... I'm off to make some patches.
M.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documenation/vm/numa
2002-04-19 21:37 ` Martin J. Bligh
@ 2002-04-20 16:05 ` Mel
2002-04-20 19:47 ` Martin J. Bligh
0 siblings, 1 reply; 6+ messages in thread
From: Mel @ 2002-04-20 16:05 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: linux-kernel
Thanks for the clarification. I've the patch attached below again with the
corrections you made. Does it look ok?
--- linux-2.4.19pre7.orig/Documentation/vm/numa Fri Aug 4 19:23:37 2000
+++ linux-2.4.19pre7.mel/Documentation/vm/numa Sat Apr 20 16:58:52 2002
@@ -1,4 +1,6 @@
Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
+ Apr 2002 by Mel Gorman <melcsn.ul.ie>
+Additional credit to Martin J. Bligh <Martin.Bligh@us.ibm.com>
The intent of this file is to have an uptodate, running commentary
from different people about NUMA specific code in the Linux vm.
@@ -39,3 +41,129 @@
NUMA port achieves more maturity. The call alloc_pages_node has been
added, so that drivers can make the call and not worry about whether
it is running on a NUMA or UMA platform.
+
+
+Nodes
+=====
+
+A node is described by the pg\_data\_t struct. Each can have one or more
+of the three zone types ZONE\_HIGHMEM, ZONE\_NORMAL and ZONE\_DMA. It can
+only have one zone of each type. It is the responsibility of the buddy
+allocator to make sure pages are allocated from the proper nodes.
+
+It is declared as
+
+typedef struct pglist_data {
+ zone_t node_zones[MAX_NR_ZONES];
+ zonelist_t node_zonelists[GFP_ZONEMASK+1];
+ int nr_zones;
+ struct page *node_mem_map;
+ unsigned long *valid_addr_bitmap;
+ struct bootmem_data *bdata;
+ unsigned long node_start_paddr;
+ unsigned long node_start_mapnr;
+ unsigned long node_size;
+ int node_id;
+ struct pglist_data *node_next;
+} pg_data_t;
+
+ node_zones The zones for this node. Currently ZONE_HIGHMEM,
+ ZONE_NORMAL, ZONE_DMA.
+
+ node_zonelists This is the order of zones that allocations are
+ preferred from. build_zonelists() in page_alloc.c does
+ the work when called by free_area_init_core(). So a failed
+ allocation ZONE_HIGHMEM may fall back to ZONE_NORMAL
+ or back to ZONE_DMA . See the buddy algorithm for details.
+
+ nr_zones Number of zones in this node, between 1 and 3
+
+ node_mem_map The first page of the physical block this node represents
+
+ valid_addr_bitmap Not positive, a bitmap that shows where holes are in memory?
+
+ bdata Used only when starting up the node. Mainly confined
+ to bootmem.c
+
+ node_start_paddr The starting physical address of the node. This doesn't
+ work really well as an unsigned long as it breaks for
+ ia32 with PAE for example. A more suitable solution would be
+ to record this as a Page Frame Number (pfn) . This could be
+ trivially defined as (page_phys_addr >> PAGE\_SHIFT).
+ Alternatively, it could be the struct page * index inside
+ mem_map.
+
+ node_start_mapnr This gives the offside within the lmem_map . This is
+ contained within the global mem_map. lmem_map is the
+ mapping of page frames for this node
+
+ node_size The total number of pages in this zone
+
+ node_id The ID of the node, starts at 0
+
+ node_next Pointer to next node in a linear list. NULL terminated
+
+1.2 Zones
+===========
+ Each pg_data_t node will be aware of one or more zones that it can
+allocate pages from. The possible zones are ZONE_HIGHMEM, ZONE_NORMAL
+and ZONE_DMA. There can only be one zone of each type per pg_data_t.
+Each zone is suitable for a particular use but there is not necessarily
+a penalty for usage of the wrong zone like there is with the wrong
+pg_data_t
+
+ typedef struct zone_struct {
+ /*
+ * Commonly accessed fields:
+ */
+ spinlock_t lock;
+ unsigned long free_pages;
+ unsigned long pages_min, pages_low, pages_high;
+ int need_balance;
+
+ /*
+ * free areas of different sizes
+ */
+ free_area_t free_area[MAX_ORDER];
+
+ /*
+ * Discontig memory support fields.
+ */
+ struct pglist_data *zone_pgdat;
+ struct page *zone_mem_map;
+ unsigned long zone_start_paddr;
+ unsigned long zone_start_mapnr;
+
+ /*
+ * rarely used fields:
+ */
+ char *name;
+ unsigned long size;
+ } zone_t;
+
+ lock A lock to protect the zone
+ free_pages Total number of free pages in the zone
+ pages_min When pages_min is reached, kswapd is woken up
+ pages_low When reached, the allocator will do the kswapd work in
+ a synchronuous fashion
+ pages_high Once kswapd is woken, it won't sleep until pages_high pages
+ are free
+ need_balance A flag kswapd uses to determine if it needs to balance
+ free_area Used by the buddy algorithm
+ zone_pgdat Points to the parent pg_data_t
+ zone_mem_map The first page in mem_map this zone refers to
+ zone_start_paddr Same note for node_start_paddr
+ zone_start_mapnr Same note for node_start_mapnr
+ name The string name of the zone
+ size Self explanatory
+
+1.3 Relationship
+==================
+
+ pg_data_t -------> pg_data_t -------> pgdata_t ------->NULL
+ / | \ / | \ / | \
+ ----- | ----- ----- | ----- ----- | -----
+ | | | | | | | | |
+ zone_t zone_t zone_t zone_t zone_t zone_t zone_t zone_t zone_t
+
+Note that there is not necessarily three zones per pg_data_t node.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documenation/vm/numa
2002-04-20 16:05 ` Mel
@ 2002-04-20 19:47 ` Martin J. Bligh
0 siblings, 0 replies; 6+ messages in thread
From: Martin J. Bligh @ 2002-04-20 19:47 UTC (permalink / raw)
To: Mel; +Cc: linux-kernel
> + node_start_paddr The starting physical address of the node. This doesn't
> + work really well as an unsigned long as it breaks for
> + ia32 with PAE for example. A more suitable solution would be
> + to record this as a Page Frame Number (pfn). This could be
> + trivially defined as (page_phys_addr >> PAGE\_SHIFT).
This looks fine.
> + Alternatively, it could be the struct page * index inside
> + mem_map.
But I'd just omit this last sentence ... that only works for machines
with a contig mem_map (not NUMA), and it's kind of an accidental
kludge that happens to work in some cases, not the proper definition
(what I originally posted was confusing).
I actually submitted a patch a couple of days ago to fix these to
a pfn instead. If that gets in, we can update the documentation then.
M.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2002-04-20 19:47 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-04-19 3:56 [PATCH] Documenation/vm/numa Mel
2002-04-19 5:18 ` Martin J. Bligh
2002-04-19 20:25 ` Eric W. Biederman
2002-04-19 21:37 ` Martin J. Bligh
2002-04-20 16:05 ` Mel
2002-04-20 19:47 ` Martin J. Bligh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox