[RFC] [PATCH] Power Managed memory base enabling

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] [PATCH] Power Managed memory base enabling
@ 2007-03-05 18:18 Mark Gross
  2007-03-06  1:26 ` KAMEZAWA Hiroyuki
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Mark Gross @ 2007-03-05 18:18 UTC (permalink / raw)
  To: linux-mm, linux-pm; +Cc: torvalds, akpm, mark.gross, neelam.chandwani

The following patch is to help enable both allocation and access based
memory power optimization policies for systems that have the capability
to put sticks of memory selectively into lower power states based on
workload requirement.

To be clear PM-memory will not be useful unless you have workloads that
can take advantage of it.  The identified workloads are not desktop
workloads.  However; there is a non-zero number of interested users with
applicable workloads that make pushing the enabling patches out to the
community worth while.  These workloads tend to be within network
elements and servers where memory utilization tracks traffic load.  

This patch is very simple.  It is independent of the ongoing
anti-fragmentation, page migration and hot remove activities.  It should
coexist and be complimentary to those.

The goals of this patch are:
* provide a method for identifying power managed memory that could change
state at runtime under policy manager and OS control.
* avoid start up allocations from putting kernel memory structures into
such memory at boot time.
* be minimal and transparent for platforms without such memory.

It makes use of the existing NUMA implementation.  

It implements a convention on the 4 bytes of "Proximity Domain ID"
within the SRAT memory affinity structure as defined in ACPI3.0a.  If
bit 31 is set, then the memory range represented by that PXM is assumed
to be power managed.  We are working on defining a "standard" for
identifying such memory areas as power manageable and progress committee
based.  

We are going with the above convention for the time being as it doesn't
violate ACPI specifications.  Some time in the future after the
committees are finished some of this code may need to be updated.

To exercise the capability on a platform with PM-memory, you will still
need to include a policy manager with some code to trigger the state
changes to enable transition into and out of a low power state. 

More will be done, but for now we would like to get this base enabling
into the upstream kernel as an initial step.

Thanks,

--mgross 

Signed-off-by: Mark Gross <mark.gross@intel.com>



diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/arch/x86_64/mm/numa.c linux-2.6.20-mm2-monroe/arch/x86_64/mm/numa.c
--- linux-2.6.20-mm2/arch/x86_64/mm/numa.c	2007-02-23 11:20:38.000000000 -0800
+++ linux-2.6.20-mm2-monroe/arch/x86_64/mm/numa.c	2007-03-02 15:15:53.000000000 -0800
@@ -156,12 +156,55 @@
 }
 #endif
 
+/* we need a place to save the next start address to use for each node because
+ * we need to allocate the pgdata and bootmem for power managed memory in
+ * non-power managed nodes.  We do this by saving off where we can start
+ * allocating in the nodes and updating them as the boot up proceeds.
+ */
+static unsigned long bootmem_start[MAX_NUMNODES];
+
+
 static void * __init
 early_node_mem(int nodeid, unsigned long start, unsigned long end,
 	      unsigned long size)
 {
-	unsigned long mem = find_e820_area(start, end, size);
+	unsigned long mem;
 	void *ptr;
+	if (bootmem_start[nodeid] <= start) {
+		bootmem_start[nodeid] = start;
+	}
+
+	mem = -1L;
+	if (power_managed_node(nodeid)) {
+		int non_pm_node = find_closest_non_pm_node(nodeid);
+
+		if (!node_online(non_pm_node)) {
+			return NULL; /* expect nodeid to get setup on the next
+					pass of setup_node_boot_mem after
+					non_pm_node is online*/
+		} else {
+			/* We set up the allocation in the non_pm_node
+			 * get the end of non_pm_node boot allocations
+			 * allocate from there.
+			 */
+			unsigned int non_pm_end;
+
+			non_pm_end = (NODE_DATA(non_pm_node)->node_start_pfn +
+				NODE_DATA(non_pm_node)->node_spanned_pages)
+					<< PAGE_SHIFT;
+
+			mem = find_e820_area(bootmem_start[non_pm_node],
+					non_pm_end, size);
+			/* now increment bootmem_start for next call */
+			if (mem!= -1L)
+				bootmem_start[non_pm_node] =
+					round_up(mem + size, PAGE_SIZE);
+		}
+	} else {
+		mem = find_e820_area(bootmem_start[nodeid], end, size);
+		if (mem!= -1L)
+			bootmem_start[nodeid] = round_up(mem + size, PAGE_SIZE);
+	}	
 	if (mem != -1L)
 		return __va(mem);
 	ptr = __alloc_bootmem_nopanic(size,
@@ -180,6 +223,7 @@
 	unsigned long start_pfn, end_pfn, bootmap_pages, bootmap_size, bootmap_start; 
 	unsigned long nodedata_phys;
 	void *bootmap;
+	int temp_id;
 	const int pgdat_size = round_up(sizeof(pg_data_t), PAGE_SIZE);
 
 	start = round_up(start, ZONE_ALIGN); 
@@ -219,8 +263,13 @@
 
 	free_bootmem_with_active_regions(nodeid, end);
 
-	reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size); 
-	reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start, bootmap_pages<<PAGE_SHIFT);
+	if (power_managed_node(nodeid))
+		temp_id = find_closest_non_pm_node(nodeid);
+	else
+		temp_id = nodeid;
+
+	reserve_bootmem_node(NODE_DATA(temp_id), nodedata_phys, pgdat_size);
+	reserve_bootmem_node(NODE_DATA(temp_id), bootmap_start, bootmap_pages<<PAGE_SHIFT);
 #ifdef CONFIG_ACPI_NUMA
 	srat_reserve_add_area(nodeid);
 #endif
@@ -243,11 +292,21 @@
 	memmapsize = sizeof(struct page) * (end_pfn-start_pfn);
 	limit = end_pfn << PAGE_SHIFT;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
-	NODE_DATA(nodeid)->node_mem_map = 
-		__alloc_bootmem_core(NODE_DATA(nodeid)->bdata, 
-				memmapsize, SMP_CACHE_BYTES, 
-				round_down(limit - memmapsize, PAGE_SIZE), 
+	if (power_managed_node(nodeid)) {
+		int non_pm_node = find_closest_non_pm_node(nodeid);
+
+		NODE_DATA(nodeid)->node_mem_map =
+			__alloc_bootmem_core(NODE_DATA(non_pm_node)->bdata,
+				memmapsize, SMP_CACHE_BYTES,
+				round_down(limit - memmapsize, PAGE_SIZE),
 				limit);
+	} else {
+ 		NODE_DATA(nodeid)->node_mem_map =
+			__alloc_bootmem_core(NODE_DATA(nodeid)->bdata,
+				memmapsize, SMP_CACHE_BYTES,
+				round_down(limit - memmapsize, PAGE_SIZE),
+				limit);
+	}
 	printk(KERN_DEBUG "Node %d memmap at 0x%p size %lu first pfn 0x%p\n",
 			nodeid, NODE_DATA(nodeid)->node_mem_map,
 			memmapsize, NODE_DATA(nodeid)->node_mem_map);
@@ -266,7 +325,10 @@
 	for (i = 0; i < NR_CPUS; i++) {
 		if (cpu_to_node[i] != NUMA_NO_NODE)
 			continue;
- 		numa_set_node(i, rr);
+		if (power_managed_node(rr))
+			numa_set_node(i,find_closest_non_pm_node(rr));
+		else
+			numa_set_node(i, rr);
 		rr = next_node(rr, node_online_map);
 		if (rr == MAX_NUMNODES)
 			rr = first_node(node_online_map);
diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/arch/x86_64/mm/srat.c linux-2.6.20-mm2-monroe/arch/x86_64/mm/srat.c
--- linux-2.6.20-mm2/arch/x86_64/mm/srat.c	2007-02-23 11:20:38.000000000 -0800
+++ linux-2.6.20-mm2-monroe/arch/x86_64/mm/srat.c	2007-03-02 15:15:53.000000000 -0800
@@ -28,6 +28,7 @@
 static nodemask_t nodes_parsed __initdata;
 static struct bootnode nodes[MAX_NUMNODES] __initdata;
 static struct bootnode nodes_add[MAX_NUMNODES];
+static int pm_node[MAX_NUMNODES];
 static int found_add_area __initdata;
 int hotadd_percent __initdata = 0;
 
@@ -299,7 +300,11 @@
 		return;
 	start = ma->base_address;
 	end = start + ma->length;
-	pxm = ma->proximity_domain;
+	pxm = ma->proximity_domain & 0x0000ffff;
+	if (ma->proximity_domain & (1<<31))
+		pm_node[pxm] = 1;
+	else
+		pm_node[pxm] = 0;
 	node = setup_node(pxm);
 	if (node < 0) {
 		printk(KERN_ERR "SRAT: Too many proximity domains.\n");
@@ -467,8 +472,6 @@
 	return acpi_slit->entry[index + node_to_pxm(b)];
 }
 
-EXPORT_SYMBOL(__node_distance);
-
 int memory_add_physaddr_to_nid(u64 start)
 {
 	int i, ret = 0;
@@ -479,5 +482,36 @@
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 
+int __power_managed_node(int srat_node)
+{
+	return pm_node[node_to_pxm(srat_node)];
+}
+
+int __power_managed_memory_present(void)
+{
+	int j;
+
+	for (j=0; j<MAX_LOCAL_APIC; j++) {
+		if(__power_managed_node(j) )
+			return 1;
+	}
+	return 0;
+}
+
+int __find_closest_non_pm_node(int nodeid)
+{
+	int i, dist, closest, temp;
+
+	dist = closest= 255;
+	for_each_node(i) {
+		if ((i != nodeid) && !power_managed_node(i)) {
+			temp = __node_distance(nodeid, i );
+			if (temp < dist) {
+				closest = i;
+				dist = temp;
+			}
+		}
+	}
+	return closest;
+}
diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/include/linux/mm.h linux-2.6.20-mm2-monroe/include/linux/mm.h
--- linux-2.6.20-mm2/include/linux/mm.h	2007-02-23 11:20:40.000000000 -0800
+++ linux-2.6.20-mm2-monroe/include/linux/mm.h	2007-03-02 15:15:53.000000000 -0800
@@ -1226,5 +1226,9 @@
 
 __attribute__((weak)) const char *arch_vma_name(struct vm_area_struct *vma);
 
+int power_managed_memory_present(void);
+int power_managed_node(int srat_node);
+int find_closest_non_pm_node(int nodeid);
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/bootmem.c linux-2.6.20-mm2-monroe/mm/bootmem.c
--- linux-2.6.20-mm2/mm/bootmem.c	2007-02-04 10:44:54.000000000 -0800
+++ linux-2.6.20-mm2-monroe/mm/bootmem.c	2007-03-02 15:17:06.000000000 -0800
@@ -417,13 +417,14 @@
 void * __init __alloc_bootmem_nopanic(unsigned long size, unsigned long align,
 				      unsigned long goal)
 {
-	bootmem_data_t *bdata;
 	void *ptr;
+	int i;
 
-	list_for_each_entry(bdata, &bdata_list, list) {
-		ptr = __alloc_bootmem_core(bdata, size, align, goal, 0);
-		if (ptr)
-			return ptr;
+	for_each_online_node(i) {
+		if ((!power_managed_node(i)) && (ptr =
+			__alloc_bootmem_core(NODE_DATA(i)->bdata, size,
+				align, goal, 0)))
+			return(ptr);
 	}
 	return NULL;
 }
@@ -463,16 +464,15 @@
 void * __init __alloc_bootmem_low(unsigned long size, unsigned long align,
 				  unsigned long goal)
 {
-	bootmem_data_t *bdata;
 	void *ptr;
+	int i;
 
-	list_for_each_entry(bdata, &bdata_list, list) {
-		ptr = __alloc_bootmem_core(bdata, size, align, goal,
-						ARCH_LOW_ADDRESS_LIMIT);
-		if (ptr)
-			return ptr;
+	for_each_online_node(i) {
+		if ((!power_managed_node(i)) && (ptr =
+			__alloc_bootmem_core(NODE_DATA(i)->bdata, size,
+				align, goal, ARCH_LOW_ADDRESS_LIMIT)))
+			return(ptr);
 	}
-
 	/*
 	 * Whoops, we cannot satisfy the allocation request.
 	 */
diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/memory.c linux-2.6.20-mm2-monroe/mm/memory.c
--- linux-2.6.20-mm2/mm/memory.c	2007-02-23 11:20:40.000000000 -0800
+++ linux-2.6.20-mm2-monroe/mm/memory.c	2007-03-02 15:15:53.000000000 -0800
@@ -2882,3 +2882,29 @@
 	return buf - old_buf;
 }
 EXPORT_SYMBOL_GPL(access_process_vm);
+
+#ifdef __x86_64__
+extern int __power_managed_memory_present(void);
+extern int __power_managed_node(int srat_node);
+extern int __find_closest_non_pm_node(int nodeid);
+#else
+inline int __power_managed_memory_present(void) { return 0};
+inline int __power_managed_node(int srat_node) { return 0};
+inline int __find_closest_non_pm_node(int nodeid) { return nodeid};
+#endif
+
+int power_managed_memory_present(void)
+{
+	return __power_managed_memory_present();
+}
+
+int power_managed_node(int srat_node)
+{
+	return __power_managed_node(srat_node);
+}
+
+int find_closest_non_pm_node(int nodeid)
+{
+	return __find_closest_non_pm_node(nodeid);
+}
+
diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/mempolicy.c linux-2.6.20-mm2-monroe/mm/mempolicy.c
--- linux-2.6.20-mm2/mm/mempolicy.c	2007-02-23 11:20:40.000000000 -0800
+++ linux-2.6.20-mm2-monroe/mm/mempolicy.c	2007-03-02 15:15:53.000000000 -0800
@@ -1617,8 +1617,13 @@
 	/* Set interleaving policy for system init. This way not all
 	   the data structures allocated at system boot end up in node zero. */
 
-	if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
-		printk("numa_policy_init: interleaving failed\n");
+	if (power_managed_memory_present()) {
+		if (do_set_mempolicy(MPOL_DEFAULT, &node_online_map))
+			printk("numa_policy_init: interleaving failed\n");
+	} else {
+		if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
+			printk("numa_policy_init: interleaving failed\n");
+	}
 }
 
 /* Reset policy of current process to default */
diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/page_alloc.c linux-2.6.20-mm2-monroe/mm/page_alloc.c
--- linux-2.6.20-mm2/mm/page_alloc.c	2007-02-23 11:20:40.000000000 -0800
+++ linux-2.6.20-mm2-monroe/mm/page_alloc.c	2007-03-02 15:15:53.000000000 -0800
@@ -2308,8 +2308,17 @@
 					* sizeof(wait_queue_head_t);
 
  	if (system_state == SYSTEM_BOOTING) {
-		zone->wait_table = (wait_queue_head_t *)
-			alloc_bootmem_node(pgdat, alloc_size);
+		if(power_managed_node(pgdat->node_id)) {
+			int nid;
+
+			nid = find_closest_non_pm_node(pgdat->node_id);
+			zone->wait_table = (wait_queue_head_t *)
+				alloc_bootmem_node(NODE_DATA(nid), alloc_size);
+		} else
+		{
+			zone->wait_table = (wait_queue_head_t *)
+				alloc_bootmem_node(pgdat, alloc_size);
+		}
 	} else {
 		/*
 		 * This case means that a zone whose size was 0 gets new memory
@@ -2824,8 +2833,15 @@
 		end = ALIGN(end, MAX_ORDER_NR_PAGES);
 		size =  (end - start) * sizeof(struct page);
 		map = alloc_remap(pgdat->node_id, size);
-		if (!map)
+		if (!map) {
+		if(power_managed_node(pgdat->node_id)) {
+			int nid;
+
+			nid = find_closest_non_pm_node(pgdat->node_id);
+			map = alloc_bootmem_node(NODE_DATA(nid), size);
+		} else	
 			map = alloc_bootmem_node(pgdat, size);
+		}
 		pgdat->node_mem_map = map + (pgdat->node_start_pfn - start);
 		printk(KERN_DEBUG
 			"Node %d memmap at 0x%p size %lu first pfn 0x%p\n",
diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/slab.c linux-2.6.20-mm2-monroe/mm/slab.c
--- linux-2.6.20-mm2/mm/slab.c	2007-02-23 11:20:40.000000000 -0800
+++ linux-2.6.20-mm2-monroe/mm/slab.c	2007-03-02 15:15:53.000000000 -0800
@@ -3378,6 +3378,7 @@
  *
  * Fallback to other node is possible if __GFP_THISNODE is not set.
  */
+
 static __always_inline void *
 __cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 		   void *caller)
@@ -3391,6 +3392,9 @@
 	if (unlikely(nodeid == -1))
 		nodeid = numa_node_id();
 
+	if (power_managed_node(nodeid) )
+		nodeid = find_closest_non_pm_node(nodeid);
+
 	if (unlikely(!cachep->nodelists[nodeid])) {
 		/* Node not bootstrapped yet */
 		ptr = fallback_alloc(cachep, flags);
@@ -3664,6 +3668,8 @@
 #ifdef CONFIG_NUMA
 void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
+	if (power_managed_node(nodeid) )
+			nodeid = find_closest_non_pm_node(nodeid);
 	return __cache_alloc_node(cachep, flags, nodeid,
 			__builtin_return_address(0));
 }
diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/sparse.c linux-2.6.20-mm2-monroe/mm/sparse.c
--- linux-2.6.20-mm2/mm/sparse.c	2007-02-04 10:44:54.000000000 -0800
+++ linux-2.6.20-mm2-monroe/mm/sparse.c	2007-03-02 15:15:53.000000000 -0800
@@ -49,6 +49,8 @@
 	struct mem_section *section = NULL;
 	unsigned long array_size = SECTIONS_PER_ROOT *
 				   sizeof(struct mem_section);
+	if (power_managed_node(nid))
+		nid = find_closest_non_pm_node(nid);
 
 	if (slab_is_available())
 		section = kmalloc_node(array_size, GFP_KERNEL, nid);
@@ -215,6 +217,9 @@
 	struct mem_section *ms = __nr_to_section(pnum);
 	int nid = sparse_early_nid(ms);
 
+	if (power_managed_node(nid))
+		nid = find_closest_non_pm_node(nid);
+
 	map = alloc_remap(nid, sizeof(struct page) * PAGES_PER_SECTION);
 	if (map)
 		return map;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH] Power Managed memory base enabling
  2007-03-05 18:18 [RFC] [PATCH] Power Managed memory base enabling Mark Gross
@ 2007-03-06  1:26 ` KAMEZAWA Hiroyuki
  2007-03-06 15:54   ` Mark Gross
  2007-03-06 15:09 ` David Rientjes
  2007-03-26 12:48 ` [linux-pm] " Pavel Machek
  2 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-03-06  1:26 UTC (permalink / raw)
  To: mgross; +Cc: linux-mm, linux-pm, torvalds, akpm, mark.gross, neelam.chandwani

On Mon, 5 Mar 2007 10:18:26 -0800
Mark Gross <mgross@linux.intel.com> wrote:

> It implements a convention on the 4 bytes of "Proximity Domain ID"
> within the SRAT memory affinity structure as defined in ACPI3.0a.  If
> bit 31 is set, then the memory range represented by that PXM is assumed
> to be power managed.  We are working on defining a "standard" for
> identifying such memory areas as power manageable and progress committee
> based.  
> 

This usage of bit 31 surprized me ;)
I think some vendor(sgi?) now using 4byte pxm...
no problem ? and othre OSs will handle this ?

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH] Power Managed memory base enabling
  2007-03-05 18:18 [RFC] [PATCH] Power Managed memory base enabling Mark Gross
  2007-03-06  1:26 ` KAMEZAWA Hiroyuki
@ 2007-03-06 15:09 ` David Rientjes
  2007-03-06 16:47   ` Mark Gross
  2007-03-26 12:48 ` [linux-pm] " Pavel Machek
  2 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2007-03-06 15:09 UTC (permalink / raw)
  To: Mark Gross
  Cc: linux-mm, linux-pm, Linus Torvalds, Andrew Morton, mark.gross,
	neelam.chandwani

On Mon, 5 Mar 2007, Mark Gross wrote:

> To exercise the capability on a platform with PM-memory, you will still
> need to include a policy manager with some code to trigger the state
> changes to enable transition into and out of a low power state. 
> 

Thanks for pushing this type of work to the community.

What type of policy manager did you have in mind for state transition?  
Since you're basing it on existing NUMA code, are you looking at something 
like /sys/devices/system/node/node*/power that would be responsible for 
migrating pages off the PM-memory it represents and then transitioning the 
hardware into a suspend or standby state?

The biggest concern is obviously going to be the interleaving.

> More will be done, but for now we would like to get this base enabling
> into the upstream kernel as an initial step.
> 

Might be a premature question, but will there be upstream support for 
transitioning the hardware state?  If so, it would be interesting to hear 
what the preliminary enter and exit latencies are for each.

Few comments on the patch follow.

> diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/arch/x86_64/mm/numa.c linux-2.6.20-mm2-monroe/arch/x86_64/mm/numa.c
> --- linux-2.6.20-mm2/arch/x86_64/mm/numa.c	2007-02-23 11:20:38.000000000 -0800
> +++ linux-2.6.20-mm2-monroe/arch/x86_64/mm/numa.c	2007-03-02 15:15:53.000000000 -0800
> @@ -156,12 +156,55 @@
>  }
>  #endif
>  
> +/* we need a place to save the next start address to use for each node because
> + * we need to allocate the pgdata and bootmem for power managed memory in
> + * non-power managed nodes.  We do this by saving off where we can start
> + * allocating in the nodes and updating them as the boot up proceeds.
> + */
> +static unsigned long bootmem_start[MAX_NUMNODES];
> +

When we're going through setup_node_bootmem(), we're already going to have 
the pm_node[] information populated for power management node detection. 
It can be represented by a nodemask (see below).  So the code in 
early_node_mem() could be simplified and more robust by eliminating 
bootmem_start[] and exporting nodes_parsed from srat.c.

We can get away with this because nodes_parsed is marked __initdata and 
will still be valid at this point.

>  static void * __init
>  early_node_mem(int nodeid, unsigned long start, unsigned long end,
>  	      unsigned long size)
>  {
> -	unsigned long mem = find_e820_area(start, end, size);
> +	unsigned long mem;
>  	void *ptr;
> +	if (bootmem_start[nodeid] <= start) {
> +		bootmem_start[nodeid] = start;
> +	}
> +
> +	mem = -1L;
> +	if (power_managed_node(nodeid)) {
> +		int non_pm_node = find_closest_non_pm_node(nodeid);
> +
> +		if (!node_online(non_pm_node)) {
> +			return NULL; /* expect nodeid to get setup on the next
> +					pass of setup_node_boot_mem after
> +					non_pm_node is online*/
> +		} else {
> +			/* We set up the allocation in the non_pm_node
> +			 * get the end of non_pm_node boot allocations
> +			 * allocate from there.
> +			 */
> +			unsigned int non_pm_end;
> +
> +			non_pm_end = (NODE_DATA(non_pm_node)->node_start_pfn +
> +				NODE_DATA(non_pm_node)->node_spanned_pages)
> +					<< PAGE_SHIFT;
> +
> +			mem = find_e820_area(bootmem_start[non_pm_node],
> +					non_pm_end, size);
> +			/* now increment bootmem_start for next call */
> +			if (mem!= -1L)
> +				bootmem_start[non_pm_node] =
> +					round_up(mem + size, PAGE_SIZE);
> +		}
> +	} else {
> +		mem = find_e820_area(bootmem_start[nodeid], end, size);
> +		if (mem!= -1L)
> +			bootmem_start[nodeid] = round_up(mem + size, PAGE_SIZE);
> +	}	
>  	if (mem != -1L)
>  		return __va(mem);
>  	ptr = __alloc_bootmem_nopanic(size,

Then the change above becomes much easier:

	if (power_managed_node(nodeid)) {
		int new_node = node_remap(nodeid, *nodes_parsed, *pm_nodes);
		if (nodeid != new_node) {
			start = NODE_DATA(new_node)->node_start_pfn;
			end = start + NODE_DATA(new_node)->node_spanned_pages;
		}
	}
	mem = find_e820_area(start, end, size);

> diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/arch/x86_64/mm/srat.c linux-2.6.20-mm2-monroe/arch/x86_64/mm/srat.c
> --- linux-2.6.20-mm2/arch/x86_64/mm/srat.c	2007-02-23 11:20:38.000000000 -0800
> +++ linux-2.6.20-mm2-monroe/arch/x86_64/mm/srat.c	2007-03-02 15:15:53.000000000 -0800
> @@ -28,6 +28,7 @@
>  static nodemask_t nodes_parsed __initdata;
>  static struct bootnode nodes[MAX_NUMNODES] __initdata;
>  static struct bootnode nodes_add[MAX_NUMNODES];
> +static int pm_node[MAX_NUMNODES];
>  static int found_add_area __initdata;
>  int hotadd_percent __initdata = 0;
>  

I would recommend making this a nodemask that is an extern from 
include/asm-x86_64/numa.h:

	nodemask_t pm_nodes;

> @@ -479,5 +482,36 @@
>  
>  	return ret;
>  }
> -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>  
> +int __power_managed_node(int srat_node)
> +{
> +	return pm_node[node_to_pxm(srat_node)];
> +}
> +
> +int __power_managed_memory_present(void)
> +{
> +	int j;
> +
> +	for (j=0; j<MAX_LOCAL_APIC; j++) {
> +		if(__power_managed_node(j) )
> +			return 1;
> +	}
> +	return 0;
> +}
> +
> +int __find_closest_non_pm_node(int nodeid)
> +{
> +	int i, dist, closest, temp;
> +
> +	dist = closest= 255;
> +	for_each_node(i) {
> +		if ((i != nodeid) && !power_managed_node(i)) {
> +			temp = __node_distance(nodeid, i );
> +			if (temp < dist) {
> +				closest = i;
> +				dist = temp;
> +			}
> +		}
> +	}
> +	return closest;
> +}

Then all these functions become trivial:

	int __power_managed_node(int nid)
	{
		return node_isset(node_to_pxm(nid), pm_nodes);
	}

	int __power_managed_memory_present(void)
	{
		return !nodes_empty(pm_nodes);
	}

	int __find_closest_non_pm_node(int nid)
	{
		int node;
		node = next_node(nid, pm_nodes);
		if (node == MAX_NUMNODES)
			node = first_node(pm_nodes);
	}

> diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/memory.c linux-2.6.20-mm2-monroe/mm/memory.c
> --- linux-2.6.20-mm2/mm/memory.c	2007-02-23 11:20:40.000000000 -0800
> +++ linux-2.6.20-mm2-monroe/mm/memory.c	2007-03-02 15:15:53.000000000 -0800
> @@ -2882,3 +2882,29 @@
>  	return buf - old_buf;
>  }
>  EXPORT_SYMBOL_GPL(access_process_vm);
> +
> +#ifdef __x86_64__
> +extern int __power_managed_memory_present(void);
> +extern int __power_managed_node(int srat_node);
> +extern int __find_closest_non_pm_node(int nodeid);
> +#else
> +inline int __power_managed_memory_present(void) { return 0};
> +inline int __power_managed_node(int srat_node) { return 0};
> +inline int __find_closest_non_pm_node(int nodeid) { return nodeid};
> +#endif
> +
> +int power_managed_memory_present(void)
> +{
> +	return __power_managed_memory_present();
> +}
> +
> +int power_managed_node(int srat_node)
> +{
> +	return __power_managed_node(srat_node);
> +}
> +
> +int find_closest_non_pm_node(int nodeid)
> +{
> +	return __find_closest_non_pm_node(nodeid);
> +}
> +

Probably should reconsider extern declarations in .c files.

> diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/mempolicy.c linux-2.6.20-mm2-monroe/mm/mempolicy.c
> --- linux-2.6.20-mm2/mm/mempolicy.c	2007-02-23 11:20:40.000000000 -0800
> +++ linux-2.6.20-mm2-monroe/mm/mempolicy.c	2007-03-02 15:15:53.000000000 -0800
> @@ -1617,8 +1617,13 @@
>  	/* Set interleaving policy for system init. This way not all
>  	   the data structures allocated at system boot end up in node zero. */
>  
> -	if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
> -		printk("numa_policy_init: interleaving failed\n");
> +	if (power_managed_memory_present()) {
> +		if (do_set_mempolicy(MPOL_DEFAULT, &node_online_map))
> +			printk("numa_policy_init: interleaving failed\n");
> +	} else {
> +		if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
> +			printk("numa_policy_init: interleaving failed\n");
> +	}
>  }
>  
>  /* Reset policy of current process to default */

These prink comments are misleading since MPOL_DEFAULT doesn't attempt to 
set interleaving policy.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH] Power Managed memory base enabling
  2007-03-06  1:26 ` KAMEZAWA Hiroyuki
@ 2007-03-06 15:54   ` Mark Gross
  0 siblings, 0 replies; 13+ messages in thread
From: Mark Gross @ 2007-03-06 15:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-pm, torvalds, akpm, mark.gross, neelam.chandwani

On Tue, Mar 06, 2007 at 10:26:28AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 5 Mar 2007 10:18:26 -0800
> Mark Gross <mgross@linux.intel.com> wrote:
> 
> > It implements a convention on the 4 bytes of "Proximity Domain ID"
> > within the SRAT memory affinity structure as defined in ACPI3.0a.  If
> > bit 31 is set, then the memory range represented by that PXM is assumed
> > to be power managed.  We are working on defining a "standard" for
> > identifying such memory areas as power manageable and progress committee
> > based.  
> > 
> 
> This usage of bit 31 surprized me ;)
It was not my first choice but, adding a new flag bit takes the ACPI
standards committee to rubber stamp the notion.  That is a work in
progress.  The "architects" are pondering the nuances and implications
of this subject as we speak.  I'm sure something wonderful is forth
coming.

We are trying to get this code out there to enable OSV support for a
product with a first generation of power managed memory coming out this
summer in the ATCA form factor, the MPCBL0050.

Its my hope that this convention will not be disruptive or create too
much legacy once the ACPI committee catches up with this technology.
Its not expected to be a problem, as there is only one publicly
available platform rolling out this year with it.

> I think some vendor(sgi?) now using 4byte pxm...

I don't know if SGI has any system that use all 4 bytes of PXM.  I did
notice that until recently the ACPI code in Linux only used the first
byte of that field calling the upper bytes as reserved.  This should be
the first code in Linux to overload the meeting of this bit.

> no problem ? and othre OSs will handle this ?
>
I hope there is no problem. I posted this RFC to find out ;)

I don't know if any other OS's know about this type of memory.  

--mgross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH] Power Managed memory base enabling
  2007-03-06 15:09 ` David Rientjes
@ 2007-03-06 16:47   ` Mark Gross
  2007-03-06 17:12     ` David Rientjes
  2007-03-07  2:40     ` David Rientjes
  0 siblings, 2 replies; 13+ messages in thread
From: Mark Gross @ 2007-03-06 16:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-pm, Linus Torvalds, Andrew Morton, mark.gross,
	neelam.chandwani

On Tue, Mar 06, 2007 at 07:09:14AM -0800, David Rientjes wrote:
> On Mon, 5 Mar 2007, Mark Gross wrote:
> 
> > To exercise the capability on a platform with PM-memory, you will still
> > need to include a policy manager with some code to trigger the state
> > changes to enable transition into and out of a low power state. 
> > 
> 
> Thanks for pushing this type of work to the community.
> 
> What type of policy manager did you have in mind for state transition?  
> Since you're basing it on existing NUMA code, are you looking at something 
> like /sys/devices/system/node/node*/power that would be responsible for 
> migrating pages off the PM-memory it represents and then transitioning the 
> hardware into a suspend or standby state?

For the initial version of HW that can do this we are stuck with
allocation based decisions where a complete solution needs page
migration.

Yes, a sysfs interface is being looked at to export the control to a
user mode daemon doing running some kind of policy manager, and if/when
page migration happens it will be hooked up to this interface.


> 
> The biggest concern is obviously going to be the interleaving.

Power friendly interleaving schemes will still be available.  They will
likely be limited to interleaving across at most 2 sticks.

The tests I've seen have shown that by-4 verses by-2 interleave on
modern hardware, isn't noticeable except for lmbench stream.  

It may not be a one size fits all technology.

> 
> > More will be done, but for now we would like to get this base enabling
> > into the upstream kernel as an initial step.
> > 
> 
> Might be a premature question, but will there be upstream support for 
> transitioning the hardware state?  If so, it would be interesting to hear 
> what the preliminary enter and exit latencies are for each.

The code MC registers to re-train the memory lanes are somewhat
protected and will be implemented in the platform FW / BIOS.  I don't
think code to do that will be pushed up stream. 

> 
> Few comments on the patch follow.
> 
> > diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/arch/x86_64/mm/numa.c linux-2.6.20-mm2-monroe/arch/x86_64/mm/numa.c
> > --- linux-2.6.20-mm2/arch/x86_64/mm/numa.c	2007-02-23 11:20:38.000000000 -0800
> > +++ linux-2.6.20-mm2-monroe/arch/x86_64/mm/numa.c	2007-03-02 15:15:53.000000000 -0800
> > @@ -156,12 +156,55 @@
> >  }
> >  #endif
> >  
> > +/* we need a place to save the next start address to use for each node because
> > + * we need to allocate the pgdata and bootmem for power managed memory in
> > + * non-power managed nodes.  We do this by saving off where we can start
> > + * allocating in the nodes and updating them as the boot up proceeds.
> > + */
> > +static unsigned long bootmem_start[MAX_NUMNODES];
> > +
> 
> When we're going through setup_node_bootmem(), we're already going to have 
> the pm_node[] information populated for power management node detection. 
> It can be represented by a nodemask (see below).  So the code in 
> early_node_mem() could be simplified and more robust by eliminating 
> bootmem_start[] and exporting nodes_parsed from srat.c.
> 
> We can get away with this because nodes_parsed is marked __initdata and 
> will still be valid at this point.
> 
> >  static void * __init
> >  early_node_mem(int nodeid, unsigned long start, unsigned long end,
> >  	      unsigned long size)
> >  {
> > -	unsigned long mem = find_e820_area(start, end, size);
> > +	unsigned long mem;
> >  	void *ptr;
> > +	if (bootmem_start[nodeid] <= start) {
> > +		bootmem_start[nodeid] = start;
> > +	}
> > +
> > +	mem = -1L;
> > +	if (power_managed_node(nodeid)) {
> > +		int non_pm_node = find_closest_non_pm_node(nodeid);
> > +
> > +		if (!node_online(non_pm_node)) {
> > +			return NULL; /* expect nodeid to get setup on the next
> > +					pass of setup_node_boot_mem after
> > +					non_pm_node is online*/
> > +		} else {
> > +			/* We set up the allocation in the non_pm_node
> > +			 * get the end of non_pm_node boot allocations
> > +			 * allocate from there.
> > +			 */
> > +			unsigned int non_pm_end;
> > +
> > +			non_pm_end = (NODE_DATA(non_pm_node)->node_start_pfn +
> > +				NODE_DATA(non_pm_node)->node_spanned_pages)
> > +					<< PAGE_SHIFT;
> > +
> > +			mem = find_e820_area(bootmem_start[non_pm_node],
> > +					non_pm_end, size);
> > +			/* now increment bootmem_start for next call */
> > +			if (mem!= -1L)
> > +				bootmem_start[non_pm_node] =
> > +					round_up(mem + size, PAGE_SIZE);
> > +		}
> > +	} else {
> > +		mem = find_e820_area(bootmem_start[nodeid], end, size);
> > +		if (mem!= -1L)
> > +			bootmem_start[nodeid] = round_up(mem + size, PAGE_SIZE);
> > +	}	
> >  	if (mem != -1L)
> >  		return __va(mem);
> >  	ptr = __alloc_bootmem_nopanic(size,
> 
> Then the change above becomes much easier:
> 
> 	if (power_managed_node(nodeid)) {
> 		int new_node = node_remap(nodeid, *nodes_parsed, *pm_nodes);
> 		if (nodeid != new_node) {
> 			start = NODE_DATA(new_node)->node_start_pfn;
> 			end = start + NODE_DATA(new_node)->node_spanned_pages;
> 		}
> 	}
> 	mem = find_e820_area(start, end, size);
> 

Let me give your idea a spin and get back to you. 


> > diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/arch/x86_64/mm/srat.c linux-2.6.20-mm2-monroe/arch/x86_64/mm/srat.c
> > --- linux-2.6.20-mm2/arch/x86_64/mm/srat.c	2007-02-23 11:20:38.000000000 -0800
> > +++ linux-2.6.20-mm2-monroe/arch/x86_64/mm/srat.c	2007-03-02 15:15:53.000000000 -0800
> > @@ -28,6 +28,7 @@
> >  static nodemask_t nodes_parsed __initdata;
> >  static struct bootnode nodes[MAX_NUMNODES] __initdata;
> >  static struct bootnode nodes_add[MAX_NUMNODES];
> > +static int pm_node[MAX_NUMNODES];
> >  static int found_add_area __initdata;
> >  int hotadd_percent __initdata = 0;
> >  
> 
> I would recommend making this a nodemask that is an extern from 
> include/asm-x86_64/numa.h:
> 
> 	nodemask_t pm_nodes;
> 
> > @@ -479,5 +482,36 @@
> >  
> >  	return ret;
> >  }
> > -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >  
> > +int __power_managed_node(int srat_node)
> > +{
> > +	return pm_node[node_to_pxm(srat_node)];
> > +}
> > +
> > +int __power_managed_memory_present(void)
> > +{
> > +	int j;
> > +
> > +	for (j=0; j<MAX_LOCAL_APIC; j++) {
> > +		if(__power_managed_node(j) )
> > +			return 1;
> > +	}
> > +	return 0;
> > +}
> > +
> > +int __find_closest_non_pm_node(int nodeid)
> > +{
> > +	int i, dist, closest, temp;
> > +
> > +	dist = closest= 255;
> > +	for_each_node(i) {
> > +		if ((i != nodeid) && !power_managed_node(i)) {
> > +			temp = __node_distance(nodeid, i );
> > +			if (temp < dist) {
> > +				closest = i;
> > +				dist = temp;
> > +			}
> > +		}
> > +	}
> > +	return closest;
> > +}
> 
> Then all these functions become trivial:
> 
> 	int __power_managed_node(int nid)
> 	{
> 		return node_isset(node_to_pxm(nid), pm_nodes);
> 	}
> 
> 	int __power_managed_memory_present(void)
> 	{
> 		return !nodes_empty(pm_nodes);
> 	}
> 
> 	int __find_closest_non_pm_node(int nid)
> 	{
> 		int node;
> 		node = next_node(nid, pm_nodes);
> 		if (node == MAX_NUMNODES)
> 			node = first_node(pm_nodes);
> 	}
> 
> > diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/memory.c linux-2.6.20-mm2-monroe/mm/memory.c
> > --- linux-2.6.20-mm2/mm/memory.c	2007-02-23 11:20:40.000000000 -0800
> > +++ linux-2.6.20-mm2-monroe/mm/memory.c	2007-03-02 15:15:53.000000000 -0800
> > @@ -2882,3 +2882,29 @@
> >  	return buf - old_buf;
> >  }
> >  EXPORT_SYMBOL_GPL(access_process_vm);
> > +
> > +#ifdef __x86_64__
> > +extern int __power_managed_memory_present(void);
> > +extern int __power_managed_node(int srat_node);
> > +extern int __find_closest_non_pm_node(int nodeid);
> > +#else
> > +inline int __power_managed_memory_present(void) { return 0};
> > +inline int __power_managed_node(int srat_node) { return 0};
> > +inline int __find_closest_non_pm_node(int nodeid) { return nodeid};
> > +#endif
> > +
> > +int power_managed_memory_present(void)
> > +{
> > +	return __power_managed_memory_present();
> > +}
> > +
> > +int power_managed_node(int srat_node)
> > +{
> > +	return __power_managed_node(srat_node);
> > +}
> > +
> > +int find_closest_non_pm_node(int nodeid)
> > +{
> > +	return __find_closest_non_pm_node(nodeid);
> > +}
> > +
> 
> Probably should reconsider extern declarations in .c files.
>

Yeah, but I couldn't think of a better place to put this code or how to
make it portable to non x86_64 architectures.  Recommendations gratefully
accepted.


> > diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/mempolicy.c linux-2.6.20-mm2-monroe/mm/mempolicy.c
> > --- linux-2.6.20-mm2/mm/mempolicy.c	2007-02-23 11:20:40.000000000 -0800
> > +++ linux-2.6.20-mm2-monroe/mm/mempolicy.c	2007-03-02 15:15:53.000000000 -0800
> > @@ -1617,8 +1617,13 @@
> >  	/* Set interleaving policy for system init. This way not all
> >  	   the data structures allocated at system boot end up in node zero. */
> >  
> > -	if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
> > -		printk("numa_policy_init: interleaving failed\n");
> > +	if (power_managed_memory_present()) {
> > +		if (do_set_mempolicy(MPOL_DEFAULT, &node_online_map))
> > +			printk("numa_policy_init: interleaving failed\n");
> > +	} else {
> > +		if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
> > +			printk("numa_policy_init: interleaving failed\n");
> > +	}
> >  }
> >  
> >  /* Reset policy of current process to default */
> 
> These prink comments are misleading since MPOL_DEFAULT doesn't attempt to 
> set interleaving policy.
>
oop, cut and paste bug. 


Thanks,

--mgross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH] Power Managed memory base enabling
  2007-03-06 16:47   ` Mark Gross
@ 2007-03-06 17:12     ` David Rientjes
  2007-03-06 17:20       ` Mark Gross
  2007-03-07  2:40     ` David Rientjes
  1 sibling, 1 reply; 13+ messages in thread
From: David Rientjes @ 2007-03-06 17:12 UTC (permalink / raw)
  To: Mark Gross
  Cc: linux-mm, linux-pm, Linus Torvalds, Andrew Morton, mark.gross,
	neelam.chandwani

On Tue, 6 Mar 2007, Mark Gross wrote:

> For the initial version of HW that can do this we are stuck with
> allocation based decisions where a complete solution needs page
> migration.
> 
> Yes, a sysfs interface is being looked at to export the control to a
> user mode daemon doing running some kind of policy manager, and if/when
> page migration happens it will be hooked up to this interface.
> 

Is do_migrate_pages() currently unsatisfactory for this?

> > > diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/memory.c linux-2.6.20-mm2-monroe/mm/memory.c
> > > --- linux-2.6.20-mm2/mm/memory.c	2007-02-23 11:20:40.000000000 -0800
> > > +++ linux-2.6.20-mm2-monroe/mm/memory.c	2007-03-02 15:15:53.000000000 -0800
> > > @@ -2882,3 +2882,29 @@
> > >  	return buf - old_buf;
> > >  }
> > >  EXPORT_SYMBOL_GPL(access_process_vm);
> > > +
> > > +#ifdef __x86_64__
> > > +extern int __power_managed_memory_present(void);
> > > +extern int __power_managed_node(int srat_node);
> > > +extern int __find_closest_non_pm_node(int nodeid);
> > > +#else
> > > +inline int __power_managed_memory_present(void) { return 0};
> > > +inline int __power_managed_node(int srat_node) { return 0};
> > > +inline int __find_closest_non_pm_node(int nodeid) { return nodeid};
> > > +#endif
> > > +
> > > +int power_managed_memory_present(void)
> > > +{
> > > +	return __power_managed_memory_present();
> > > +}
> > > +
> > > +int power_managed_node(int srat_node)
> > > +{
> > > +	return __power_managed_node(srat_node);
> > > +}
> > > +
> > > +int find_closest_non_pm_node(int nodeid)
> > > +{
> > > +	return __find_closest_non_pm_node(nodeid);
> > > +}
> > > +
> > 
> > Probably should reconsider extern declarations in .c files.
> >
> 
> Yeah, but I couldn't think of a better place to put this code or how to
> make it portable to non x86_64 architectures.  Recommendations gratefully
> accepted.
> 

I would add this to include/asm-x86_64/topology.h:

	extern int __power_managed_memory_present(void);
	extern int __power_managed_node(int);
	extern int __find_closest_non_pm_node(int);
	#define power_managed_memory_present()	__power_managed_memory_present()
	#define power_managed_node(nid)		__power_managed_node(nid)
	#define find_closest_non_pm_node(nid)	__find_closest_non_pm_node(nid)

and then put the actual functions in arch/x86_64/numa.c.  Then something 
like this in include/linux/topology.h would probably suffice:

	#ifndef find_closest_non_pm_node
	#define find_closest_non_pm_node(nid)	do {} while(0)
	#endif

etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH] Power Managed memory base enabling
  2007-03-06 17:12     ` David Rientjes
@ 2007-03-06 17:20       ` Mark Gross
  2007-03-06 17:33         ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Gross @ 2007-03-06 17:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-pm, Linus Torvalds, Andrew Morton, mark.gross,
	neelam.chandwani

On Tue, Mar 06, 2007 at 09:12:06AM -0800, David Rientjes wrote:
> On Tue, 6 Mar 2007, Mark Gross wrote:
> 
> > For the initial version of HW that can do this we are stuck with
> > allocation based decisions where a complete solution needs page
> > migration.
> > 
> > Yes, a sysfs interface is being looked at to export the control to a
> > user mode daemon doing running some kind of policy manager, and if/when
> > page migration happens it will be hooked up to this interface.
> > 
> 
> Is do_migrate_pages() currently unsatisfactory for this?

This looks like it should be good for this application!  How stable is
this?  The next phase of this work is to export the policy interfaces
and hook up the page migration.  I'm somewhat new to the mm code.

Thanks!

> 
> > > > diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/memory.c linux-2.6.20-mm2-monroe/mm/memory.c
> > > > --- linux-2.6.20-mm2/mm/memory.c	2007-02-23 11:20:40.000000000 -0800
> > > > +++ linux-2.6.20-mm2-monroe/mm/memory.c	2007-03-02 15:15:53.000000000 -0800
> > > > @@ -2882,3 +2882,29 @@
> > > >  	return buf - old_buf;
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(access_process_vm);
> > > > +
> > > > +#ifdef __x86_64__
> > > > +extern int __power_managed_memory_present(void);
> > > > +extern int __power_managed_node(int srat_node);
> > > > +extern int __find_closest_non_pm_node(int nodeid);
> > > > +#else
> > > > +inline int __power_managed_memory_present(void) { return 0};
> > > > +inline int __power_managed_node(int srat_node) { return 0};
> > > > +inline int __find_closest_non_pm_node(int nodeid) { return nodeid};
> > > > +#endif
> > > > +
> > > > +int power_managed_memory_present(void)
> > > > +{
> > > > +	return __power_managed_memory_present();
> > > > +}
> > > > +
> > > > +int power_managed_node(int srat_node)
> > > > +{
> > > > +	return __power_managed_node(srat_node);
> > > > +}
> > > > +
> > > > +int find_closest_non_pm_node(int nodeid)
> > > > +{
> > > > +	return __find_closest_non_pm_node(nodeid);
> > > > +}
> > > > +
> > > 
> > > Probably should reconsider extern declarations in .c files.
> > >
> > 
> > Yeah, but I couldn't think of a better place to put this code or how to
> > make it portable to non x86_64 architectures.  Recommendations gratefully
> > accepted.
> > 
> 
> I would add this to include/asm-x86_64/topology.h:
> 
> 	extern int __power_managed_memory_present(void);
> 	extern int __power_managed_node(int);
> 	extern int __find_closest_non_pm_node(int);
> 	#define power_managed_memory_present()	__power_managed_memory_present()
> 	#define power_managed_node(nid)		__power_managed_node(nid)
> 	#define find_closest_non_pm_node(nid)	__find_closest_non_pm_node(nid)
> 
> and then put the actual functions in arch/x86_64/numa.c.  Then something 
> like this in include/linux/topology.h would probably suffice:
> 
> 	#ifndef find_closest_non_pm_node
> 	#define find_closest_non_pm_node(nid)	do {} while(0)
> 	#endif
> 
> etc.
> 

thanks,

--mgross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH] Power Managed memory base enabling
  2007-03-06 17:20       ` Mark Gross
@ 2007-03-06 17:33         ` David Rientjes
  0 siblings, 0 replies; 13+ messages in thread
From: David Rientjes @ 2007-03-06 17:33 UTC (permalink / raw)
  To: Mark Gross
  Cc: linux-mm, linux-pm, Linus Torvalds, Andrew Morton, mark.gross,
	neelam.chandwani

On Tue, 6 Mar 2007, Mark Gross wrote:

> > Is do_migrate_pages() currently unsatisfactory for this?
> 
> This looks like it should be good for this application!  How stable is
> this?  The next phase of this work is to export the policy interfaces
> and hook up the page migration.  I'm somewhat new to the mm code.
> 

Since you've already used a NUMA approach to flagging PM-memory, you'd 
probably want to use this interface through mempolicy in your migration.  
There's currently work to do lockless VMA scanning that was posted just 
yesterday to linux-mm and that's a bottleneck in this migration.

Take a look at update_nodemask() in kernel/cpuset.c for how it migrates 
pages from a source set of nodes to a destination set using 
memory_migrate.  The cpuset specifics are explained in 
Documentation/cpusets.txt, but the basics are that you'll want to use 
memory_migrate to start the migration when you remove a node from your 
nodemask (another reason why I suggested the use of a nodemask instead of 
a simple array).

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH] Power Managed memory base enabling
  2007-03-06 16:47   ` Mark Gross
  2007-03-06 17:12     ` David Rientjes
@ 2007-03-07  2:40     ` David Rientjes
  2007-03-09 20:53       ` Mark Gross
  1 sibling, 1 reply; 13+ messages in thread
From: David Rientjes @ 2007-03-07  2:40 UTC (permalink / raw)
  To: Mark Gross
  Cc: linux-mm, linux-pm, Linus Torvalds, Andrew Morton, mark.gross,
	neelam.chandwani

On Tue, 6 Mar 2007, Mark Gross wrote:

> Let me give your idea a spin and get back to you. 
> 

Something like the following might be a little better.

 [ You might consider adding this as a configuration option such
   as CONFIG_NODE_POWER_MANAGEMENT so that power_managed_node(nid)
   always returns 0 when this isn't defined in arch/x86_64/Kconfig. ]

		David
---
 arch/x86_64/mm/numa.c         |   59 ++++++++++++++++++++++++++++++++++++++--
 arch/x86_64/mm/srat.c         |    7 +++-
 include/asm-x86_64/numa.h     |    2 +
 include/asm-x86_64/topology.h |    7 +++++
 include/linux/topology.h      |   18 ++++++++++++
 mm/bootmem.c                  |   34 +++++++++++++----------
 mm/mempolicy.c                |    9 +++++-
 mm/page_alloc.c               |   19 ++++++++++++-
 mm/slab.c                     |    7 +++++
 mm/sparse.c                   |    8 +++++
 10 files changed, 146 insertions(+), 24 deletions(-)

diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -11,6 +11,7 @@
 #include <linux/ctype.h>
 #include <linux/module.h>
 #include <linux/nodemask.h>
+#include <linux/acpi.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -159,8 +160,23 @@ static void * __init
 early_node_mem(int nodeid, unsigned long start, unsigned long end,
 	      unsigned long size)
 {
-	unsigned long mem = find_e820_area(start, end, size);
+	unsigned long mem;
 	void *ptr;
+
+	/*
+	 * If this is a power-managed node, we need to allocate this memory
+	 * elsewhere so we remap it, if possible.
+	 */
+	if (power_managed_node(nodeid)) {
+		int new_node = node_remap(nodeid, nodes_parsed, non_pm_nodes);
+		if (nodeid != new_node) {
+			start = NODE_DATA(new_node)->node_start_pfn;
+			end = start + NODE_DATA(new_node)->node_spanned_pages;
+			nodeid = new_node;
+		}
+	}
+	mem = find_e820_area(start, end, size);
+
 	if (mem != -1L)
 		return __va(mem);
 	ptr = __alloc_bootmem_nopanic(size,
@@ -180,6 +196,7 @@ void __init setup_node_bootmem(int nodeid, unsigned long start, unsigned long en
 	unsigned long nodedata_phys;
 	void *bootmap;
 	const int pgdat_size = round_up(sizeof(pg_data_t), PAGE_SIZE);
+	int reserve_nid = nodeid;
 
 	start = round_up(start, ZONE_ALIGN); 
 
@@ -218,6 +235,13 @@ void __init setup_node_bootmem(int nodeid, unsigned long start, unsigned long en
 
 	free_bootmem_with_active_regions(nodeid, end);
 
+	/*
+	 * Make sure we reserve bootmem on a node that is not under power
+	 * management.
+	 */
+	if (power_managed_node(nodeid))
+		reserve_nid = node_remap(nodeid, nodes_parsed, non_pm_nodes);
+
 	reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size); 
 	reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start, bootmap_pages<<PAGE_SHIFT);
 #ifdef CONFIG_ACPI_NUMA
@@ -242,6 +266,9 @@ void __init setup_node_zones(int nodeid)
 	memmapsize = sizeof(struct page) * (end_pfn-start_pfn);
 	limit = end_pfn << PAGE_SHIFT;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
+	if (power_managed_node(nodeid))
+		nodeid = node_remap(nodeid, nodes_parsed, non_pm_nodes);
+
 	NODE_DATA(nodeid)->node_mem_map = 
 		__alloc_bootmem_core(NODE_DATA(nodeid)->bdata, 
 				memmapsize, SMP_CACHE_BYTES, 
@@ -255,7 +282,7 @@ void __init setup_node_zones(int nodeid)
 
 void __init numa_init_array(void)
 {
-	int rr, i;
+	int rr, i, nodeid;
 	/* There are unfortunately some poorly designed mainboards around
 	   that only connect memory to a single CPU. This breaks the 1:1 cpu->node
 	   mapping. To avoid this fill in the mapping for all possible
@@ -265,7 +292,11 @@ void __init numa_init_array(void)
 	for (i = 0; i < NR_CPUS; i++) {
 		if (cpu_to_node[i] != NUMA_NO_NODE)
 			continue;
- 		numa_set_node(i, rr);
+		if (power_managed_node(rr))
+			nodeid = node_remap(rr, nodes_parsed, non_pm_nodes);
+		else
+			nodeid = rr;
+ 		numa_set_node(i, nodeid);
 		rr = next_node(rr, node_online_map);
 		if (rr == MAX_NUMNODES)
 			rr = first_node(node_online_map);
@@ -681,3 +712,25 @@ int pfn_valid(unsigned long pfn)
 }
 EXPORT_SYMBOL(pfn_valid);
 #endif
+
+int __power_managed_node(int nid)
+{
+	return !node_isset(node_to_pxm(nid), non_pm_nodes);
+}
+
+int __power_managed_memory_present(void)
+{
+	return !nodes_full(non_pm_nodes);
+}
+
+int __find_closest_non_pm_node(int nid)
+{
+	int node;
+	nodemask_t new_nodes;
+
+	nodes_and(new_nodes, non_pm_nodes, node_online_map);
+	node = next_node(nid, non_pm_nodes);
+	if (node == MAX_NUMNODES)
+		node = first_node(non_pm_nodes);
+	return node;
+}
diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c
--- a/arch/x86_64/mm/srat.c
+++ b/arch/x86_64/mm/srat.c
@@ -25,10 +25,11 @@ int acpi_numa __initdata;
 
 static struct acpi_table_slit *acpi_slit;
 
-static nodemask_t nodes_parsed __initdata;
+nodemask_t nodes_parsed __initdata;
 static struct bootnode nodes_add[MAX_NUMNODES];
 static int found_add_area __initdata;
 int hotadd_percent __initdata = 0;
+nodemask_t non_pm_nodes __read_mostly = NODE_MASK_ALL;
 
 /* Too small nodes confuse the VM badly. Usually they result
    from BIOS bugs. */
@@ -298,8 +299,10 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 		return;
 	start = ma->base_address;
 	end = start + ma->length;
-	pxm = ma->proximity_domain;
+	pxm = ma->proximity_domain & 0x0000ffff;
 	node = setup_node(pxm);
+	if (ma->proximity_domain & (1 << 31))
+		node_clear(node, non_pm_nodes);
 	if (node < 0) {
 		printk(KERN_ERR "SRAT: Too many proximity domains.\n");
 		bad_srat();
diff --git a/include/asm-x86_64/numa.h b/include/asm-x86_64/numa.h
--- a/include/asm-x86_64/numa.h
+++ b/include/asm-x86_64/numa.h
@@ -18,6 +18,8 @@ extern int numa_off;
 extern void numa_set_node(int cpu, int node);
 extern void srat_reserve_add_area(int nodeid);
 extern int hotadd_percent;
+extern nodemask_t nodes_parsed __initdata;
+extern nodemask_t non_pm_nodes;
 
 extern unsigned char apicid_to_node[256];
 #ifdef CONFIG_NUMA
diff --git a/include/asm-x86_64/topology.h b/include/asm-x86_64/topology.h
--- a/include/asm-x86_64/topology.h
+++ b/include/asm-x86_64/topology.h
@@ -18,6 +18,13 @@ extern int __node_distance(int, int);
 /* #else fallback version */
 #endif
 
+extern int __power_managed_memory_present(void);
+extern int __power_managed_node(int);
+extern int __find_closest_non_pm_node(int);
+#define power_managed_memory_present()	__power_managed_memory_present()
+#define power_managed_node(nid)		__power_managed_node(nid)
+#define find_closest_non_pm_node(nid)	__find_closest_non_pm_node(nid)
+
 #define cpu_to_node(cpu)		(cpu_to_node[cpu])
 #define parent_node(node)		(node)
 #define node_to_first_cpu(node) 	(first_cpu(node_to_cpumask[node]))
diff --git a/include/linux/topology.h b/include/linux/topology.h
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -67,6 +67,24 @@
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
 #endif
+#ifndef power_managed_memory_present
+static inline int power_managed_memory_present(void)
+{
+	return 0;
+}
+#endif
+#ifndef power_managed_node
+static inline int power_managed_node(int nid)
+{
+	return 0;
+}
+#endif
+#ifndef find_closest_non_pm_node
+static inline int find_closest_non_pm_node(int nid)
+{
+	return nid;
+}
+#endif
 
 /*
  * Below are the 3 major initializers used in building sched_domains:
diff --git a/mm/bootmem.c b/mm/bootmem.c
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -417,14 +417,16 @@ unsigned long __init free_all_bootmem(void)
 void * __init __alloc_bootmem_nopanic(unsigned long size, unsigned long align,
 				      unsigned long goal)
 {
-	bootmem_data_t *bdata;
 	void *ptr;
-
-	list_for_each_entry(bdata, &bdata_list, list) {
-		ptr = __alloc_bootmem_core(bdata, size, align, goal, 0);
-		if (ptr)
-			return ptr;
-	}
+	int i;
+
+	for_each_online_node(i)
+		if (!power_managed_node(i)) {
+			ptr = __alloc_bootmem_core(NODE_DATA(i)->bdata,
+						   size, align, goal, 0);
+			if (ptr)
+				return ptr;
+		}
 	return NULL;
 }
 
@@ -463,15 +465,17 @@ void * __init __alloc_bootmem_node(pg_data_t *pgdat, unsigned long size,
 void * __init __alloc_bootmem_low(unsigned long size, unsigned long align,
 				  unsigned long goal)
 {
-	bootmem_data_t *bdata;
 	void *ptr;
-
-	list_for_each_entry(bdata, &bdata_list, list) {
-		ptr = __alloc_bootmem_core(bdata, size, align, goal,
-						ARCH_LOW_ADDRESS_LIMIT);
-		if (ptr)
-			return ptr;
-	}
+	int i;
+
+	for_each_online_node(i)
+		if (!power_managed_node(i)) {
+			ptr = __alloc_bootmem_core(NODE_DATA(i)->bdata,
+						   size, align, goal,
+						   ARCH_LOW_ADDRESS_LIMIT);
+			if (ptr)
+				return ptr;
+		}
 
 	/*
 	 * Whoops, we cannot satisfy the allocation request.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1609,8 +1609,13 @@ void __init numa_policy_init(void)
 	/* Set interleaving policy for system init. This way not all
 	   the data structures allocated at system boot end up in node zero. */
 
-	if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
-		printk("numa_policy_init: interleaving failed\n");
+	if (power_managed_memory_present()) {
+		if (do_set_mempolicy(MPOL_DEFAULT, &node_online_map))
+			printk("numa_policy_init: default failed\n");
+	} else {
+		if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
+			printk("numa_policy_init: interleaving failed\n");
+	}
 }
 
 /* Reset policy of current process to default */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2599,6 +2599,14 @@ int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 					* sizeof(wait_queue_head_t);
 
  	if (system_state == SYSTEM_BOOTING) {
+		struct pglist_data *alloc_pgdat = pgdat;
+
+		if (power_managed_node(pgdat->node_id)) {
+			int nid;
+			nid = find_closest_non_pm_node(pgdat->node_id);
+			alloc_pgdat = NODE_DATA(nid);
+		}
+
 		zone->wait_table = (wait_queue_head_t *)
 			alloc_bootmem_node(pgdat, alloc_size);
 	} else {
@@ -3203,6 +3211,7 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat)
 	if (!pgdat->node_mem_map) {
 		unsigned long size, start, end;
 		struct page *map;
+		struct pglist_data *alloc_pgdat = pgdat;
 
 		/*
 		 * The zone's endpoints aren't required to be MAX_ORDER
@@ -3214,8 +3223,14 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat)
 		end = ALIGN(end, MAX_ORDER_NR_PAGES);
 		size =  (end - start) * sizeof(struct page);
 		map = alloc_remap(pgdat->node_id, size);
-		if (!map)
-			map = alloc_bootmem_node(pgdat, size);
+		if (!map) {
+			if (power_managed_node(pgdat->node_id)) {
+				int nid;
+				nid = find_closest_non_pm_node(pgdat->node_id);
+				alloc_pgdat = NODE_DATA(nid);
+			}
+			map = alloc_bootmem_node(alloc_pgdat, size);
+		}
 		pgdat->node_mem_map = map + (pgdat->node_start_pfn - start);
 		printk(KERN_DEBUG
 			"Node %d memmap at 0x%p size %lu first pfn 0x%p\n",
diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3399,6 +3399,10 @@ __cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	if (unlikely(nodeid == -1))
 		nodeid = numa_node_id();
 
+	/* We cannot allocate objects to nodes subject to power management */
+	if (power_managed_node(nodeid))
+		nodeid = find_closest_non_pm_node(nodeid);
+
 	if (unlikely(!cachep->nodelists[nodeid])) {
 		/* Node not bootstrapped yet */
 		ptr = fallback_alloc(cachep, flags);
@@ -3672,6 +3676,9 @@ out:
 #ifdef CONFIG_NUMA
 void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
+	/* We cannot allocate objects to nodes subject to power management */
+	if (power_managed_node(nodeid))
+		nodeid = find_closest_non_pm_node(nodeid);
 	return __cache_alloc_node(cachep, flags, nodeid,
 			__builtin_return_address(0));
 }
diff --git a/mm/sparse.c b/mm/sparse.c
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -50,6 +50,10 @@ static struct mem_section *sparse_index_alloc(int nid)
 	unsigned long array_size = SECTIONS_PER_ROOT *
 				   sizeof(struct mem_section);
 
+	/* The node we allocate on must not be subject to power management */
+	if (power_managed_node(nid))
+		nid = find_closest_non_pm_node(nid);
+
 	if (slab_is_available())
 		section = kmalloc_node(array_size, GFP_KERNEL, nid);
 	else
@@ -215,6 +219,10 @@ static struct page *sparse_early_mem_map_alloc(unsigned long pnum)
 	struct mem_section *ms = __nr_to_section(pnum);
 	int nid = sparse_early_nid(ms);
 
+	/* The node we allocate on must not be subject to power management */
+	if (power_managed_node(nid))
+		nid = find_closest_non_pm_node(nid);
+
 	map = alloc_remap(nid, sizeof(struct page) * PAGES_PER_SECTION);
 	if (map)
 		return map;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH] Power Managed memory base enabling
  2007-03-07  2:40     ` David Rientjes
@ 2007-03-09 20:53       ` Mark Gross
  2007-03-09 21:27         ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Gross @ 2007-03-09 20:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-pm, Linus Torvalds, Andrew Morton, mark.gross,
	neelam.chandwani

On Tue, Mar 06, 2007 at 06:40:36PM -0800, David Rientjes wrote:
> On Tue, 6 Mar 2007, Mark Gross wrote:
> 
> > Let me give your idea a spin and get back to you. 
> > 
> 
> Something like the following might be a little better.

Thanks!  I've got things cleaned up and working with as many of your
ideas as I could get working.  I liked many of the changes you offered
in the patch you sent to me off list.  

One thing I found was your patch didn't use the SLIT data in computing
the nearest non PM node, and I had to be careful about the difference
between the PM memory PXM bitmap and node id's.  After I accounted for
the not_to_pxm mapping things started working for me.

BTW re basing to 2.6.21rc3mm2, resulted in one 4k allocation in my
PM-zones.  I'll be looking for where that allocation is coming from
after I get this post finished.

--mgross

Singed-off-by: Mark Gross <mark.gross@intel.com>

diff -urN -X linux-2.6.21rc3mm2/Documentation/dontdiff linux-2.6.21rc3mm2/arch/x86_64/mm/numa.c linux-2.6.21rc3mm2-monroe/arch/x86_64/mm/numa.c
--- linux-2.6.21rc3mm2/arch/x86_64/mm/numa.c	2007-03-08 11:14:19.000000000 -0800
+++ linux-2.6.21rc3mm2-monroe/arch/x86_64/mm/numa.c	2007-03-09 10:23:25.000000000 -0800
@@ -155,19 +155,47 @@
 }
 #endif
 
+/* we need a place to save the next start address to use for each node because
+ * we need to allocate the pgdata and bootmem for power managed memory in
+ * non-power managed nodes.  We do this by saving off where we can start
+ * allocating in the nodes and updating them as the boot up proceeds.
+ */
+static unsigned long bootmem_start[MAX_NUMNODES];
+
+
 static void * __init
 early_node_mem(int nodeid, unsigned long start, unsigned long end,
 	      unsigned long size)
 {
-	unsigned long mem = find_e820_area(start, end, size);
+	unsigned long mem;
 	void *ptr;
-	if (mem != -1L)
+	int nid;
+	
+	if (bootmem_start[nodeid] < start) {
+		bootmem_start[nodeid] = start;
+	}
+
+	mem = -1L;
+	nid = nearest_non_pm_node(nodeid);
+	if (nid != nodeid) {
+		if (!node_online(nid))
+			return NULL;
+
+		end = (NODE_DATA(nid)->node_start_pfn +
+			NODE_DATA(nid)->node_spanned_pages)
+				<< PAGE_SHIFT;
+	}
+	mem = find_e820_area(bootmem_start[nid], end, size);
+	if (mem!= -1L) {
+		/* now increment bootmem_start for next call */
+		bootmem_start[nid] = round_up(mem + size, PAGE_SIZE);
 		return __va(mem);
+	}
 	ptr = __alloc_bootmem_nopanic(size,
 				SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS));
 	if (ptr == 0) {
 		printk(KERN_ERR "Cannot find %lu bytes in node %d\n",
-			size, nodeid);
+			size, nid);
 		return NULL;
 	}
 	return ptr;
@@ -179,6 +207,7 @@
 	unsigned long start_pfn, end_pfn, bootmap_pages, bootmap_size, bootmap_start; 
 	unsigned long nodedata_phys;
 	void *bootmap;
+	int non_pm_node = nearest_non_pm_node(nodeid);
 	const int pgdat_size = round_up(sizeof(pg_data_t), PAGE_SIZE);
 
 	start = round_up(start, ZONE_ALIGN); 
@@ -218,8 +247,8 @@
 
 	free_bootmem_with_active_regions(nodeid, end);
 
-	reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size); 
-	reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start, bootmap_pages<<PAGE_SHIFT);
+	reserve_bootmem_node(NODE_DATA(non_pm_node), nodedata_phys, pgdat_size);
+	reserve_bootmem_node(NODE_DATA(non_pm_node), bootmap_start, bootmap_pages<<PAGE_SHIFT);
 #ifdef CONFIG_ACPI_NUMA
 	srat_reserve_add_area(nodeid);
 #endif
@@ -230,8 +259,9 @@
 void __init setup_node_zones(int nodeid)
 { 
 	unsigned long start_pfn, end_pfn, memmapsize, limit;
+	int non_pm_node = nearest_non_pm_node(nodeid);
 
- 	start_pfn = node_start_pfn(nodeid);
+	start_pfn = node_start_pfn(nodeid);
  	end_pfn = node_end_pfn(nodeid);
 
 	Dprintk(KERN_INFO "Setting up memmap for node %d %lx-%lx\n",
@@ -242,11 +272,11 @@
 	memmapsize = sizeof(struct page) * (end_pfn-start_pfn);
 	limit = end_pfn << PAGE_SHIFT;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
-	NODE_DATA(nodeid)->node_mem_map = 
-		__alloc_bootmem_core(NODE_DATA(nodeid)->bdata, 
-				memmapsize, SMP_CACHE_BYTES, 
-				round_down(limit - memmapsize, PAGE_SIZE), 
-				limit);
+	NODE_DATA(nodeid)->node_mem_map =
+		__alloc_bootmem_core(NODE_DATA(non_pm_node)->bdata,
+			memmapsize, SMP_CACHE_BYTES,
+			round_down(limit - memmapsize, PAGE_SIZE),
+			limit);
 	printk(KERN_DEBUG "Node %d memmap at 0x%p size %lu first pfn 0x%p\n",
 			nodeid, NODE_DATA(nodeid)->node_mem_map,
 			memmapsize, NODE_DATA(nodeid)->node_mem_map);
@@ -265,7 +295,8 @@
 	for (i = 0; i < NR_CPUS; i++) {
 		if (cpu_to_node[i] != NUMA_NO_NODE)
 			continue;
- 		numa_set_node(i, rr);
+		numa_set_node(i,nearest_non_pm_node(rr));
+		//numa_set_node(i, rr);
 		rr = next_node(rr, node_online_map);
 		if (rr == MAX_NUMNODES)
 			rr = first_node(node_online_map);
diff -urN -X linux-2.6.21rc3mm2/Documentation/dontdiff linux-2.6.21rc3mm2/arch/x86_64/mm/srat.c linux-2.6.21rc3mm2-monroe/arch/x86_64/mm/srat.c
--- linux-2.6.21rc3mm2/arch/x86_64/mm/srat.c	2007-03-08 11:14:19.000000000 -0800
+++ linux-2.6.21rc3mm2-monroe/arch/x86_64/mm/srat.c	2007-03-09 11:00:51.000000000 -0800
@@ -27,6 +27,7 @@
 
 static nodemask_t nodes_parsed __initdata;
 static struct bootnode nodes_add[MAX_NUMNODES];
+static nodemask_t pm_nodes __read_mostly;
 static int found_add_area __initdata;
 int hotadd_percent __initdata = 0;
 
@@ -34,6 +35,9 @@
    from BIOS bugs. */
 #define NODE_MIN_SIZE (4*1024*1024)
 
+/* ACPI bit to represent power management node */
+#define POWER_MANAGEMENT_ACPI_BIT	(1 << 31)
+
 static __init int setup_node(int pxm)
 {
 	return acpi_map_pxm_to_node(pxm);
@@ -298,7 +302,10 @@
 		return;
 	start = ma->base_address;
 	end = start + ma->length;
-	pxm = ma->proximity_domain;
+	pxm = ma->proximity_domain & ~POWER_MANAGEMENT_ACPI_BIT;
+	if (ma->proximity_domain & POWER_MANAGEMENT_ACPI_BIT)
+		node_set(pxm, pm_nodes);
+
 	node = setup_node(pxm);
 	if (node < 0) {
 		printk(KERN_ERR "SRAT: Too many proximity domains.\n");
@@ -486,3 +493,35 @@
 }
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 
+int __power_managed_node(int nid)
+{
+	return node_isset(node_to_pxm(nid), pm_nodes);
+}
+
+int __power_managed_memory_present(void)
+{
+	return !nodes_empty(pm_nodes);
+}
+
+int __nearest_non_pm_node(int nid)
+{
+	int i, dist, closest, temp;
+	
+	if (!__power_managed_node(nid))
+		return nid;
+	dist = closest= 255;
+	for_each_node(i) {
+		if (__power_managed_node(i))
+			continue;
+
+		if (i != nid) {
+			temp = __node_distance(nid, i );
+			if (temp < dist) {
+				closest = i;
+				dist = temp;
+			}
+		}
+	}
+	BUG_ON(closest == 255);
+	return closest;
+}
diff -urN -X linux-2.6.21rc3mm2/Documentation/dontdiff linux-2.6.21rc3mm2/include/asm-x86_64/topology.h linux-2.6.21rc3mm2-monroe/include/asm-x86_64/topology.h
--- linux-2.6.21rc3mm2/include/asm-x86_64/topology.h	2007-03-08 11:14:20.000000000 -0800
+++ linux-2.6.21rc3mm2-monroe/include/asm-x86_64/topology.h	2007-03-09 10:23:25.000000000 -0800
@@ -18,6 +18,13 @@
 /* #else fallback version */
 #endif
 
+extern int __power_managed_node(int);
+extern int __power_managed_memory_present(void);
+extern int __nearest_non_pm_node(int);
+#define power_managed_node(nid)		__power_managed_node(nid)
+#define power_managed_memory_present()	__power_managed_memory_present()
+#define nearest_non_pm_node(nid)	__nearest_non_pm_node(nid)
+
 #define cpu_to_node(cpu)		(cpu_to_node[cpu])
 #define parent_node(node)		(node)
 #define node_to_first_cpu(node) 	(first_cpu(node_to_cpumask[node]))
diff -urN -X linux-2.6.21rc3mm2/Documentation/dontdiff linux-2.6.21rc3mm2/include/linux/topology.h linux-2.6.21rc3mm2-monroe/include/linux/topology.h
--- linux-2.6.21rc3mm2/include/linux/topology.h	2007-03-08 11:14:08.000000000 -0800
+++ linux-2.6.21rc3mm2-monroe/include/linux/topology.h	2007-03-09 10:23:25.000000000 -0800
@@ -67,6 +67,24 @@
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
 #endif
+#ifndef power_managed_node
+static inline int power_managed_node(int nid)
+{
+	return 0;
+}
+#endif
+#ifndef power_managed_memory_present
+static inline int power_managed_memory_present(void)
+{
+	return 0;
+}
+#endif
+#ifndef nearest_non_pm_node
+static inline int nearest_non_pm_node(int nid)
+{
+	return nid;
+}
+#endif
 
 /*
  * Below are the 3 major initializers used in building sched_domains:
diff -urN -X linux-2.6.21rc3mm2/Documentation/dontdiff linux-2.6.21rc3mm2/mm/bootmem.c linux-2.6.21rc3mm2-monroe/mm/bootmem.c
--- linux-2.6.21rc3mm2/mm/bootmem.c	2007-02-04 10:44:54.000000000 -0800
+++ linux-2.6.21rc3mm2-monroe/mm/bootmem.c	2007-03-09 10:23:25.000000000 -0800
@@ -417,11 +417,14 @@
 void * __init __alloc_bootmem_nopanic(unsigned long size, unsigned long align,
 				      unsigned long goal)
 {
-	bootmem_data_t *bdata;
 	void *ptr;
+	int i;
 
-	list_for_each_entry(bdata, &bdata_list, list) {
-		ptr = __alloc_bootmem_core(bdata, size, align, goal, 0);
+	for_each_online_node(i) {
+		if (power_managed_node(i))
+			continue;
+		ptr = __alloc_bootmem_core(NODE_DATA(i)->bdata, size,
+					align, goal, 0);
 		if (ptr)
 			return ptr;
 	}
@@ -463,12 +466,14 @@
 void * __init __alloc_bootmem_low(unsigned long size, unsigned long align,
 				  unsigned long goal)
 {
-	bootmem_data_t *bdata;
 	void *ptr;
+	int i;
 
-	list_for_each_entry(bdata, &bdata_list, list) {
-		ptr = __alloc_bootmem_core(bdata, size, align, goal,
-						ARCH_LOW_ADDRESS_LIMIT);
+	for_each_online_node(i) {
+		if (power_managed_node(i))
+			continue;
+		ptr = __alloc_bootmem_core(NODE_DATA(i)->bdata, size, align,
+					goal, ARCH_LOW_ADDRESS_LIMIT);
 		if (ptr)
 			return ptr;
 	}
diff -urN -X linux-2.6.21rc3mm2/Documentation/dontdiff linux-2.6.21rc3mm2/mm/mempolicy.c linux-2.6.21rc3mm2-monroe/mm/mempolicy.c
--- linux-2.6.21rc3mm2/mm/mempolicy.c	2007-03-08 11:14:20.000000000 -0800
+++ linux-2.6.21rc3mm2-monroe/mm/mempolicy.c	2007-03-09 10:23:25.000000000 -0800
@@ -1609,8 +1609,13 @@
 	/* Set interleaving policy for system init. This way not all
 	   the data structures allocated at system boot end up in node zero. */
 
-	if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
-		printk("numa_policy_init: interleaving failed\n");
+	if (power_managed_memory_present()) {
+		if (do_set_mempolicy(MPOL_DEFAULT, &node_online_map))
+			printk("numa_policy_init: default failed\n");
+	} else {
+		if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
+			printk("numa_policy_init: interleaving failed\n");
+	}
 }
 
 /* Reset policy of current process to default */
diff -urN -X linux-2.6.21rc3mm2/Documentation/dontdiff linux-2.6.21rc3mm2/mm/page_alloc.c linux-2.6.21rc3mm2-monroe/mm/page_alloc.c
--- linux-2.6.21rc3mm2/mm/page_alloc.c	2007-03-08 11:14:20.000000000 -0800
+++ linux-2.6.21rc3mm2-monroe/mm/page_alloc.c	2007-03-09 10:23:25.000000000 -0800
@@ -2600,8 +2600,10 @@
 					* sizeof(wait_queue_head_t);
 
  	if (system_state == SYSTEM_BOOTING) {
+		int nid = nearest_non_pm_node(pgdat->node_id);
+		
 		zone->wait_table = (wait_queue_head_t *)
-			alloc_bootmem_node(pgdat, alloc_size);
+			alloc_bootmem_node(NODE_DATA(nid), alloc_size);
 	} else {
 		/*
 		 * This case means that a zone whose size was 0 gets new memory
@@ -3215,8 +3217,11 @@
 		end = ALIGN(end, MAX_ORDER_NR_PAGES);
 		size =  (end - start) * sizeof(struct page);
 		map = alloc_remap(pgdat->node_id, size);
-		if (!map)
-			map = alloc_bootmem_node(pgdat, size);
+		if (!map) {
+			int nid = nearest_non_pm_node(pgdat->node_id);
+
+			map = alloc_bootmem_node(NODE_DATA(nid), size);
+		}
 		pgdat->node_mem_map = map + (pgdat->node_start_pfn - start);
 		printk(KERN_DEBUG
 			"Node %d memmap at 0x%p size %lu first pfn 0x%p\n",
diff -urN -X linux-2.6.21rc3mm2/Documentation/dontdiff linux-2.6.21rc3mm2/mm/slab.c linux-2.6.21rc3mm2-monroe/mm/slab.c
--- linux-2.6.21rc3mm2/mm/slab.c	2007-03-08 11:14:20.000000000 -0800
+++ linux-2.6.21rc3mm2-monroe/mm/slab.c	2007-03-09 10:23:25.000000000 -0800
@@ -3399,6 +3399,7 @@
 	if (unlikely(nodeid == -1))
 		nodeid = numa_node_id();
 
+	nodeid = nearest_non_pm_node(nodeid);
 	if (unlikely(!cachep->nodelists[nodeid])) {
 		/* Node not bootstrapped yet */
 		ptr = fallback_alloc(cachep, flags);
@@ -3672,6 +3673,7 @@
 #ifdef CONFIG_NUMA
 void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
+	nodeid = nearest_non_pm_node(nodeid);
 	return __cache_alloc_node(cachep, flags, nodeid,
 			__builtin_return_address(0));
 }
diff -urN -X linux-2.6.21rc3mm2/Documentation/dontdiff linux-2.6.21rc3mm2/mm/sparse.c linux-2.6.21rc3mm2-monroe/mm/sparse.c
--- linux-2.6.21rc3mm2/mm/sparse.c	2007-02-04 10:44:54.000000000 -0800
+++ linux-2.6.21rc3mm2-monroe/mm/sparse.c	2007-03-09 10:23:25.000000000 -0800
@@ -49,7 +49,8 @@
 	struct mem_section *section = NULL;
 	unsigned long array_size = SECTIONS_PER_ROOT *
 				   sizeof(struct mem_section);
-
+	
+	nid = nearest_non_pm_node(nid);
 	if (slab_is_available())
 		section = kmalloc_node(array_size, GFP_KERNEL, nid);
 	else
@@ -215,6 +216,7 @@
 	struct mem_section *ms = __nr_to_section(pnum);
 	int nid = sparse_early_nid(ms);
 
+	nid = nearest_non_pm_node(nid);
 	map = alloc_remap(nid, sizeof(struct page) * PAGES_PER_SECTION);
 	if (map)
 		return map;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH] Power Managed memory base enabling
  2007-03-09 21:27         ` David Rientjes
@ 2007-03-09 21:26           ` Mark Gross
  0 siblings, 0 replies; 13+ messages in thread
From: Mark Gross @ 2007-03-09 21:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-pm, Linus Torvalds, Andrew Morton, mark.gross,
	neelam.chandwani

On Fri, Mar 09, 2007 at 01:27:36PM -0800, David Rientjes wrote:
> On Fri, 9 Mar 2007, Mark Gross wrote:
> 
> > +int __nearest_non_pm_node(int nid)
> > +{
> > +	int i, dist, closest, temp;
> > +	
> > +	if (!__power_managed_node(nid))
> > +		return nid;
> > +	dist = closest= 255;
> > +	for_each_node(i) {
> 
> Shouldn't this be for_each_online_node(i) ?

yes.


thanks, 

--mgross

> 
> > +		if (__power_managed_node(i))
> > +			continue;
> > +
> > +		if (i != nid) {
> > +			temp = __node_distance(nid, i );
> > +			if (temp < dist) {
> > +				closest = i;
> > +				dist = temp;
> > +			}
> > +		}
> > +	}
> > +	BUG_ON(closest == 255);
> > +	return closest;
> > +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH] Power Managed memory base enabling
  2007-03-09 20:53       ` Mark Gross
@ 2007-03-09 21:27         ` David Rientjes
  2007-03-09 21:26           ` Mark Gross
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2007-03-09 21:27 UTC (permalink / raw)
  To: Mark Gross
  Cc: linux-mm, linux-pm, Linus Torvalds, Andrew Morton, mark.gross,
	neelam.chandwani

On Fri, 9 Mar 2007, Mark Gross wrote:

> +int __nearest_non_pm_node(int nid)
> +{
> +	int i, dist, closest, temp;
> +	
> +	if (!__power_managed_node(nid))
> +		return nid;
> +	dist = closest= 255;
> +	for_each_node(i) {

Shouldn't this be for_each_online_node(i) ?

> +		if (__power_managed_node(i))
> +			continue;
> +
> +		if (i != nid) {
> +			temp = __node_distance(nid, i );
> +			if (temp < dist) {
> +				closest = i;
> +				dist = temp;
> +			}
> +		}
> +	}
> +	BUG_ON(closest == 255);
> +	return closest;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [linux-pm] [RFC] [PATCH] Power Managed memory base enabling
  2007-03-05 18:18 [RFC] [PATCH] Power Managed memory base enabling Mark Gross
  2007-03-06  1:26 ` KAMEZAWA Hiroyuki
  2007-03-06 15:09 ` David Rientjes
@ 2007-03-26 12:48 ` Pavel Machek
  2 siblings, 0 replies; 13+ messages in thread
From: Pavel Machek @ 2007-03-26 12:48 UTC (permalink / raw)
  To: Mark Gross
  Cc: linux-mm, linux-pm, mark.gross, akpm, torvalds, neelam.chandwani

Hi!

> It implements a convention on the 4 bytes of "Proximity Domain ID"
> within the SRAT memory affinity structure as defined in ACPI3.0a.  If
> bit 31 is set, then the memory range represented by that PXM is assumed
> to be power managed.  We are working on defining a "standard" for
> identifying such memory areas as power manageable and progress committee
> based.  
...
> More will be done, but for now we would like to get this base enabling
> into the upstream kernel as an initial step.

I'm not sure if the hack above does not disqualify it from
mainstream...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-03-26 12:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-05 18:18 [RFC] [PATCH] Power Managed memory base enabling Mark Gross
2007-03-06  1:26 ` KAMEZAWA Hiroyuki
2007-03-06 15:54   ` Mark Gross
2007-03-06 15:09 ` David Rientjes
2007-03-06 16:47   ` Mark Gross
2007-03-06 17:12     ` David Rientjes
2007-03-06 17:20       ` Mark Gross
2007-03-06 17:33         ` David Rientjes
2007-03-07  2:40     ` David Rientjes
2007-03-09 20:53       ` Mark Gross
2007-03-09 21:27         ` David Rientjes
2007-03-09 21:26           ` Mark Gross
2007-03-26 12:48 ` [linux-pm] " Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).