public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Documenation/vm/numa
@ 2002-04-19  3:56 Mel
  2002-04-19  5:18 ` Martin J. Bligh
  0 siblings, 1 reply; 6+ messages in thread
From: Mel @ 2002-04-19  3:56 UTC (permalink / raw)
  To: linux-kernel

Below is a small extension of the numa file in the vm Documenation branch
which tries to give a brief explanation about pg_data_t and zone_t
structs. Patch is against 2.4.19pre7 but I think it'll apply to any 2.4.x
or 2.5.x kernel. No change of code etc etc etc. Comments, corrections and
opinions very welcome

			Mel



--- linux-2.4.19pre7.orig/Documentation/vm/numa	Fri Aug  4 19:23:37 2000
+++ linux-2.4.19pre7.mel/Documentation/vm/numa	Fri Apr 19 04:46:23 2002
@@ -1,4 +1,5 @@
 Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
+Appended Apr 2002 by Mel Gorman   <melcsn.ul.ie>

 The intent of this file is to have an uptodate, running commentary
 from different people about NUMA specific code in the Linux vm.
@@ -39,3 +40,123 @@
 NUMA port achieves more maturity. The call alloc_pages_node has been
 added, so that drivers can make the call and not worry about whether
 it is running on a NUMA or UMA platform.
+
+
+Nodes
+=====
+
+A node is described by the pg\_data\_t struct. Each can have one or more
+of the three zone types ZONE\_HIGHMEM, ZONE\_NORMAL and ZONE\_DMA. It can
+only have one zone of each type. It is the responsibility of the buddy
+allocator to make sure pages are allocated from the proper nodes.
+
+It is declared as
+
+typedef struct pglist_data {
+        zone_t node_zones[MAX_NR_ZONES];
+        zonelist_t node_zonelists[GFP_ZONEMASK+1];
+        int nr_zones;
+        struct page *node_mem_map;
+        unsigned long *valid_addr_bitmap;
+        struct bootmem_data *bdata;
+        unsigned long node_start_paddr;
+        unsigned long node_start_mapnr;
+        unsigned long node_size;
+        int node_id;
+        struct pglist_data *node_next;
+} pg_data_t;
+
+ node_zones        The zones for this node. Currently ZONE_HIGHMEM,
+                   ZONE_NORMAL, ZONE_DMA.
+
+ node_zonelists    This is the order of zones that allocations are
+		   preferred  from. build_zonelists() in page_alloc.c does
+		   the work when called by free_area_init_core(). So a failed
+		   allocation ZONE_HIGHMEM may fall back to ZONE_NORMAL
+		   or back to ZONE_DMA . See the buddy algorithm for details.
+
+ nr_zones          Number of zones in this node,  between 1 and 3
+
+ node_mem_map      The first page of the physical block this node represents
+
+ valid_addr_bitmap Not positive, a bitmap that shows where holes are in memory?
+
+ bdata             Used only when starting up the node. Mainly confined
+                   to bootmem.c
+
+ node_start_paddr  The starting physical addres of the node?
+
+ node_start_mapnr  This appears to be a "nice" place to put the zone inside
+                   the larger mem_map. It's set during
+                   free_page_init_core. Presumably there is some architecture
+                   dependant way of defining nice.
+
+ node_size         The total number of pages in this zone
+
+ node_id           The ID of the node, starts at 0
+
+ node_next         Pointer to next node in a linear list. NULL terminated
+
+1.2   Zones
+===========
+  Each pg_data_t node will be aware of one or more zones that it can
+allocate pages from. The possible zones are ZONE_HIGHMEM, ZONE_NORMAL
+and ZONE_DMA. There can only be one zone of each type per pg_data_t.
+Each zone is suitable for a particular use but there is not necessarily
+a penalty for usage of the wrong zone like there is with the wrong
+pg_data_t
+
+  typedef struct zone_struct {
+          /*
+           * Commonly accessed fields:
+           */
+          spinlock_t              lock;
+          unsigned long           free_pages;
+          unsigned long           pages_min, pages_low, pages_high;
+          int                     need_balance;
+
+          /*
+           * free areas of different sizes
+           */
+          free_area_t             free_area[MAX_ORDER];
+
+          /*
+           * Discontig memory support fields.
+           */
+          struct pglist_data      *zone_pgdat;
+          struct page             *zone_mem_map;
+          unsigned long           zone_start_paddr;
+          unsigned long           zone_start_mapnr;
+
+          /*
+           * rarely used fields:
+           */
+          char                    *name;
+          unsigned long           size;
+  } zone_t;
+
+ lock             A lock to protect the zone
+ free_pages       Total number of free pages in the zone
+ pages_min        When pages_min is reached, kswapd is woken up
+ pages_low        When reached, the allocator will do the kswapd work in
+                  a synchronuous fashion
+ pages_high       Once kswapd is woken, it won't sleep until pages_high pages
+                  are free
+ need_balance     A flag kswapd uses to determine if it needs to balance
+ free_area        Used by the buddy algorithm
+ zone_pgdat       Points to the parent pg_data_t
+ zone_mem_map     The first page in mem_map this zone refers to
+ zone_start_paddr Physical address of zone
+ zone_start_mapnr Address inside mem_map ?
+ name             The string name of the zone
+ size             Self explanatory
+
+1.3   Relationship
+==================
+
+        pg_data_t ------->  pg_data_t ------->      pgdata_t ------->NULL
+           / | \               / | \                 / | \
+      -----  |  -----     -----  |  -----       -----  |  -----
+      |      |      |     |      |      |       |      |      |
+  zone_t  zone_t  zone_t zone_t zone_t zone_t zone_t zone_t zone_t
+



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Documenation/vm/numa
  2002-04-19  3:56 [PATCH] Documenation/vm/numa Mel
@ 2002-04-19  5:18 ` Martin J. Bligh
  2002-04-19 20:25   ` Eric W. Biederman
  0 siblings, 1 reply; 6+ messages in thread
From: Martin J. Bligh @ 2002-04-19  5:18 UTC (permalink / raw)
  To: Mel, linux-kernel

> Below is a small extension of the numa file in the vm Documenation branch
> which tries to give a brief explanation about pg_data_t and zone_t
> structs. Patch is against 2.4.19pre7 but I think it'll apply to any 2.4.x
> or 2.5.x kernel. No change of code etc etc etc. Comments, corrections and
> opinions very welcome

Nice job. Thanks for doing this.

> + node_start_paddr  The starting physical addres of the node?

Yes. Note that it's an unsigned long, which doesn't work for ia32 
with PAE (just as a for instance). Using a pfn (page frame number) 
would be a better choice here, IMHO (see my definition of pfn below).
Same comment for zone_start_paddr.

> + node_start_mapnr  This appears to be a "nice" place to put the zone
> inside +                   the larger mem_map. It's set during
> +                   free_page_init_core. Presumably there is some
> architecture +                   dependant way of defining nice.

Nice would be a strange term to use in this context - it's an ugly
hack ;-) It's the offset of the lmem_map array for that node within 
the global mem_map array, which doesn't really exist (I'm calling 
them arrays, though they're really just pointers - we use them like 
arrays though). mem_map is some arbitrary constant, set to point to 
PAGE_OFFSET or something similar.

> + zone_start_mapnr Address inside mem_map ?

Yes. Same comment as above ;-)

Note that there are two possible ways to define a pfn, in my mind.
One would be page_phys_addr >> PAGE_SHIFT. The other would be the
offset of the struct page for that page within the mythical mem_map
array. I prefer the former, though it probably contradicts everyone
else ;-) It's useful to have some way to pass around a 36 bit address
inside a 32 bit field.

M.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Documenation/vm/numa
  2002-04-19  5:18 ` Martin J. Bligh
@ 2002-04-19 20:25   ` Eric W. Biederman
  2002-04-19 21:37     ` Martin J. Bligh
  0 siblings, 1 reply; 6+ messages in thread
From: Eric W. Biederman @ 2002-04-19 20:25 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Mel, linux-kernel

"Martin J. Bligh" <Martin.Bligh@us.ibm.com> writes:
> 
> Note that there are two possible ways to define a pfn, in my mind.
> One would be page_phys_addr >> PAGE_SHIFT. The other would be the
> offset of the struct page for that page within the mythical mem_map
> array. I prefer the former, though it probably contradicts everyone
> else ;-) It's useful to have some way to pass around a 36 bit address
> inside a 32 bit field.

A page frame number (pfn) is definitely the former 
(page_phys_addr >> PAGE_SHIFT).

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Documenation/vm/numa
  2002-04-19 20:25   ` Eric W. Biederman
@ 2002-04-19 21:37     ` Martin J. Bligh
  2002-04-20 16:05       ` Mel
  0 siblings, 1 reply; 6+ messages in thread
From: Martin J. Bligh @ 2002-04-19 21:37 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Mel, linux-kernel

>> Note that there are two possible ways to define a pfn, in my mind.
>> One would be page_phys_addr >> PAGE_SHIFT. The other would be the
>> offset of the struct page for that page within the mythical mem_map
>> array. I prefer the former, though it probably contradicts everyone
>> else ;-) It's useful to have some way to pass around a 36 bit address
>> inside a 32 bit field.
> 
> A page frame number (pfn) is definitely the former 
> (page_phys_addr >> PAGE_SHIFT).

That's how I'd conceptually define it ... unfortunately the latter
definition
also matches for non-discontigmem machines, and it's easy to think of
it that way. I guess it's just everything that says "mapnr" in it that needs
killing then ... I'm off to make some patches.

M.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Documenation/vm/numa
  2002-04-19 21:37     ` Martin J. Bligh
@ 2002-04-20 16:05       ` Mel
  2002-04-20 19:47         ` Martin J. Bligh
  0 siblings, 1 reply; 6+ messages in thread
From: Mel @ 2002-04-20 16:05 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel


Thanks for the clarification. I've the patch attached below again with the
corrections you made. Does it look ok?




--- linux-2.4.19pre7.orig/Documentation/vm/numa	Fri Aug  4 19:23:37 2000
+++ linux-2.4.19pre7.mel/Documentation/vm/numa	Sat Apr 20 16:58:52 2002
@@ -1,4 +1,6 @@
 Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
+        Apr 2002 by Mel Gorman   <melcsn.ul.ie>
+Additional credit to Martin J. Bligh <Martin.Bligh@us.ibm.com>

 The intent of this file is to have an uptodate, running commentary
 from different people about NUMA specific code in the Linux vm.
@@ -39,3 +41,129 @@
 NUMA port achieves more maturity. The call alloc_pages_node has been
 added, so that drivers can make the call and not worry about whether
 it is running on a NUMA or UMA platform.
+
+
+Nodes
+=====
+
+A node is described by the pg\_data\_t struct. Each can have one or more
+of the three zone types ZONE\_HIGHMEM, ZONE\_NORMAL and ZONE\_DMA. It can
+only have one zone of each type. It is the responsibility of the buddy
+allocator to make sure pages are allocated from the proper nodes.
+
+It is declared as
+
+typedef struct pglist_data {
+        zone_t node_zones[MAX_NR_ZONES];
+        zonelist_t node_zonelists[GFP_ZONEMASK+1];
+        int nr_zones;
+        struct page *node_mem_map;
+        unsigned long *valid_addr_bitmap;
+        struct bootmem_data *bdata;
+        unsigned long node_start_paddr;
+        unsigned long node_start_mapnr;
+        unsigned long node_size;
+        int node_id;
+        struct pglist_data *node_next;
+} pg_data_t;
+
+ node_zones        The zones for this node. Currently ZONE_HIGHMEM,
+                   ZONE_NORMAL, ZONE_DMA.
+
+ node_zonelists    This is the order of zones that allocations are
+		   preferred  from. build_zonelists() in page_alloc.c does
+		   the work when called by free_area_init_core(). So a failed
+		   allocation ZONE_HIGHMEM may fall back to ZONE_NORMAL
+		   or back to ZONE_DMA . See the buddy algorithm for details.
+
+ nr_zones          Number of zones in this node,  between 1 and 3
+
+ node_mem_map      The first page of the physical block this node represents
+
+ valid_addr_bitmap Not positive, a bitmap that shows where holes are in memory?
+
+ bdata             Used only when starting up the node. Mainly confined
+                   to bootmem.c
+
+ node_start_paddr  The starting physical address of the node. This doesn't
+		    work really well as an unsigned long as it breaks for
+                   ia32 with PAE for example. A more suitable solution would be
+                   to record this as a Page Frame Number (pfn) . This could be
+                   trivially defined as (page_phys_addr >> PAGE\_SHIFT).
+                   Alternatively, it could be the struct page * index inside
+                   mem_map.
+
+ node_start_mapnr  This gives the offside within the lmem_map . This is
+		    contained within the global mem_map. lmem_map is the
+		    mapping of page frames for this node
+
+ node_size         The total number of pages in this zone
+
+ node_id           The ID of the node, starts at 0
+
+ node_next         Pointer to next node in a linear list. NULL terminated
+
+1.2   Zones
+===========
+  Each pg_data_t node will be aware of one or more zones that it can
+allocate pages from. The possible zones are ZONE_HIGHMEM, ZONE_NORMAL
+and ZONE_DMA. There can only be one zone of each type per pg_data_t.
+Each zone is suitable for a particular use but there is not necessarily
+a penalty for usage of the wrong zone like there is with the wrong
+pg_data_t
+
+  typedef struct zone_struct {
+          /*
+           * Commonly accessed fields:
+           */
+          spinlock_t              lock;
+          unsigned long           free_pages;
+          unsigned long           pages_min, pages_low, pages_high;
+          int                     need_balance;
+
+          /*
+           * free areas of different sizes
+           */
+          free_area_t             free_area[MAX_ORDER];
+
+          /*
+           * Discontig memory support fields.
+           */
+          struct pglist_data      *zone_pgdat;
+          struct page             *zone_mem_map;
+          unsigned long           zone_start_paddr;
+          unsigned long           zone_start_mapnr;
+
+          /*
+           * rarely used fields:
+           */
+          char                    *name;
+          unsigned long           size;
+  } zone_t;
+
+ lock             A lock to protect the zone
+ free_pages       Total number of free pages in the zone
+ pages_min        When pages_min is reached, kswapd is woken up
+ pages_low        When reached, the allocator will do the kswapd work in
+                  a synchronuous fashion
+ pages_high       Once kswapd is woken, it won't sleep until pages_high pages
+                  are free
+ need_balance     A flag kswapd uses to determine if it needs to balance
+ free_area        Used by the buddy algorithm
+ zone_pgdat       Points to the parent pg_data_t
+ zone_mem_map     The first page in mem_map this zone refers to
+ zone_start_paddr Same note for node_start_paddr
+ zone_start_mapnr Same note for node_start_mapnr
+ name             The string name of the zone
+ size             Self explanatory
+
+1.3   Relationship
+==================
+
+        pg_data_t ------->  pg_data_t ------->      pgdata_t ------->NULL
+           / | \               / | \                 / | \
+      -----  |  -----     -----  |  -----       -----  |  -----
+      |      |      |     |      |      |       |      |      |
+  zone_t  zone_t  zone_t zone_t zone_t zone_t zone_t zone_t zone_t
+
+Note that there is not necessarily three zones per pg_data_t node.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Documenation/vm/numa
  2002-04-20 16:05       ` Mel
@ 2002-04-20 19:47         ` Martin J. Bligh
  0 siblings, 0 replies; 6+ messages in thread
From: Martin J. Bligh @ 2002-04-20 19:47 UTC (permalink / raw)
  To: Mel; +Cc: linux-kernel

> + node_start_paddr  The starting physical address of the node. This doesn't
> +		    work really well as an unsigned long as it breaks for
> +                   ia32 with PAE for example. A more suitable solution would be
> +                   to record this as a Page Frame Number (pfn). This could be
> +                   trivially defined as (page_phys_addr >> PAGE\_SHIFT).

This looks fine.

> +                   Alternatively, it could be the struct page * index inside
> +                   mem_map.

But I'd just omit this last sentence ... that only works for machines 
with a contig mem_map (not NUMA), and it's kind of an accidental
kludge that happens to work in some cases, not the proper definition
(what I originally posted was confusing).

I actually submitted a patch a couple of days ago to fix these to
a pfn instead. If that gets in, we can update the documentation then.

M.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2002-04-20 19:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-04-19  3:56 [PATCH] Documenation/vm/numa Mel
2002-04-19  5:18 ` Martin J. Bligh
2002-04-19 20:25   ` Eric W. Biederman
2002-04-19 21:37     ` Martin J. Bligh
2002-04-20 16:05       ` Mel
2002-04-20 19:47         ` Martin J. Bligh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox