From: Mark Gross <mgross@linux.intel.com>
To: David Rientjes <rientjes@google.com>
Cc: linux-mm@kvack.org, linux-pm@lists.osdl.org,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
mark.gross@intel.com, neelam.chandwani@intel.com
Subject: Re: [RFC] [PATCH] Power Managed memory base enabling
Date: Tue, 6 Mar 2007 08:47:22 -0800 [thread overview]
Message-ID: <20070306164722.GB22725@linux.intel.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0703051941310.18703@chino.kir.corp.google.com>
On Tue, Mar 06, 2007 at 07:09:14AM -0800, David Rientjes wrote:
> On Mon, 5 Mar 2007, Mark Gross wrote:
>
> > To exercise the capability on a platform with PM-memory, you will still
> > need to include a policy manager with some code to trigger the state
> > changes to enable transition into and out of a low power state.
> >
>
> Thanks for pushing this type of work to the community.
>
> What type of policy manager did you have in mind for state transition?
> Since you're basing it on existing NUMA code, are you looking at something
> like /sys/devices/system/node/node*/power that would be responsible for
> migrating pages off the PM-memory it represents and then transitioning the
> hardware into a suspend or standby state?
For the initial version of HW that can do this we are stuck with
allocation based decisions where a complete solution needs page
migration.
Yes, a sysfs interface is being looked at to export the control to a
user mode daemon doing running some kind of policy manager, and if/when
page migration happens it will be hooked up to this interface.
>
> The biggest concern is obviously going to be the interleaving.
Power friendly interleaving schemes will still be available. They will
likely be limited to interleaving across at most 2 sticks.
The tests I've seen have shown that by-4 verses by-2 interleave on
modern hardware, isn't noticeable except for lmbench stream.
It may not be a one size fits all technology.
>
> > More will be done, but for now we would like to get this base enabling
> > into the upstream kernel as an initial step.
> >
>
> Might be a premature question, but will there be upstream support for
> transitioning the hardware state? If so, it would be interesting to hear
> what the preliminary enter and exit latencies are for each.
The code MC registers to re-train the memory lanes are somewhat
protected and will be implemented in the platform FW / BIOS. I don't
think code to do that will be pushed up stream.
>
> Few comments on the patch follow.
>
> > diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/arch/x86_64/mm/numa.c linux-2.6.20-mm2-monroe/arch/x86_64/mm/numa.c
> > --- linux-2.6.20-mm2/arch/x86_64/mm/numa.c 2007-02-23 11:20:38.000000000 -0800
> > +++ linux-2.6.20-mm2-monroe/arch/x86_64/mm/numa.c 2007-03-02 15:15:53.000000000 -0800
> > @@ -156,12 +156,55 @@
> > }
> > #endif
> >
> > +/* we need a place to save the next start address to use for each node because
> > + * we need to allocate the pgdata and bootmem for power managed memory in
> > + * non-power managed nodes. We do this by saving off where we can start
> > + * allocating in the nodes and updating them as the boot up proceeds.
> > + */
> > +static unsigned long bootmem_start[MAX_NUMNODES];
> > +
>
> When we're going through setup_node_bootmem(), we're already going to have
> the pm_node[] information populated for power management node detection.
> It can be represented by a nodemask (see below). So the code in
> early_node_mem() could be simplified and more robust by eliminating
> bootmem_start[] and exporting nodes_parsed from srat.c.
>
> We can get away with this because nodes_parsed is marked __initdata and
> will still be valid at this point.
>
> > static void * __init
> > early_node_mem(int nodeid, unsigned long start, unsigned long end,
> > unsigned long size)
> > {
> > - unsigned long mem = find_e820_area(start, end, size);
> > + unsigned long mem;
> > void *ptr;
> > + if (bootmem_start[nodeid] <= start) {
> > + bootmem_start[nodeid] = start;
> > + }
> > +
> > + mem = -1L;
> > + if (power_managed_node(nodeid)) {
> > + int non_pm_node = find_closest_non_pm_node(nodeid);
> > +
> > + if (!node_online(non_pm_node)) {
> > + return NULL; /* expect nodeid to get setup on the next
> > + pass of setup_node_boot_mem after
> > + non_pm_node is online*/
> > + } else {
> > + /* We set up the allocation in the non_pm_node
> > + * get the end of non_pm_node boot allocations
> > + * allocate from there.
> > + */
> > + unsigned int non_pm_end;
> > +
> > + non_pm_end = (NODE_DATA(non_pm_node)->node_start_pfn +
> > + NODE_DATA(non_pm_node)->node_spanned_pages)
> > + << PAGE_SHIFT;
> > +
> > + mem = find_e820_area(bootmem_start[non_pm_node],
> > + non_pm_end, size);
> > + /* now increment bootmem_start for next call */
> > + if (mem!= -1L)
> > + bootmem_start[non_pm_node] =
> > + round_up(mem + size, PAGE_SIZE);
> > + }
> > + } else {
> > + mem = find_e820_area(bootmem_start[nodeid], end, size);
> > + if (mem!= -1L)
> > + bootmem_start[nodeid] = round_up(mem + size, PAGE_SIZE);
> > + }
> > if (mem != -1L)
> > return __va(mem);
> > ptr = __alloc_bootmem_nopanic(size,
>
> Then the change above becomes much easier:
>
> if (power_managed_node(nodeid)) {
> int new_node = node_remap(nodeid, *nodes_parsed, *pm_nodes);
> if (nodeid != new_node) {
> start = NODE_DATA(new_node)->node_start_pfn;
> end = start + NODE_DATA(new_node)->node_spanned_pages;
> }
> }
> mem = find_e820_area(start, end, size);
>
Let me give your idea a spin and get back to you.
> > diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/arch/x86_64/mm/srat.c linux-2.6.20-mm2-monroe/arch/x86_64/mm/srat.c
> > --- linux-2.6.20-mm2/arch/x86_64/mm/srat.c 2007-02-23 11:20:38.000000000 -0800
> > +++ linux-2.6.20-mm2-monroe/arch/x86_64/mm/srat.c 2007-03-02 15:15:53.000000000 -0800
> > @@ -28,6 +28,7 @@
> > static nodemask_t nodes_parsed __initdata;
> > static struct bootnode nodes[MAX_NUMNODES] __initdata;
> > static struct bootnode nodes_add[MAX_NUMNODES];
> > +static int pm_node[MAX_NUMNODES];
> > static int found_add_area __initdata;
> > int hotadd_percent __initdata = 0;
> >
>
> I would recommend making this a nodemask that is an extern from
> include/asm-x86_64/numa.h:
>
> nodemask_t pm_nodes;
>
> > @@ -479,5 +482,36 @@
> >
> > return ret;
> > }
> > -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >
> > +int __power_managed_node(int srat_node)
> > +{
> > + return pm_node[node_to_pxm(srat_node)];
> > +}
> > +
> > +int __power_managed_memory_present(void)
> > +{
> > + int j;
> > +
> > + for (j=0; j<MAX_LOCAL_APIC; j++) {
> > + if(__power_managed_node(j) )
> > + return 1;
> > + }
> > + return 0;
> > +}
> > +
> > +int __find_closest_non_pm_node(int nodeid)
> > +{
> > + int i, dist, closest, temp;
> > +
> > + dist = closest= 255;
> > + for_each_node(i) {
> > + if ((i != nodeid) && !power_managed_node(i)) {
> > + temp = __node_distance(nodeid, i );
> > + if (temp < dist) {
> > + closest = i;
> > + dist = temp;
> > + }
> > + }
> > + }
> > + return closest;
> > +}
>
> Then all these functions become trivial:
>
> int __power_managed_node(int nid)
> {
> return node_isset(node_to_pxm(nid), pm_nodes);
> }
>
> int __power_managed_memory_present(void)
> {
> return !nodes_empty(pm_nodes);
> }
>
> int __find_closest_non_pm_node(int nid)
> {
> int node;
> node = next_node(nid, pm_nodes);
> if (node == MAX_NUMNODES)
> node = first_node(pm_nodes);
> }
>
> > diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/memory.c linux-2.6.20-mm2-monroe/mm/memory.c
> > --- linux-2.6.20-mm2/mm/memory.c 2007-02-23 11:20:40.000000000 -0800
> > +++ linux-2.6.20-mm2-monroe/mm/memory.c 2007-03-02 15:15:53.000000000 -0800
> > @@ -2882,3 +2882,29 @@
> > return buf - old_buf;
> > }
> > EXPORT_SYMBOL_GPL(access_process_vm);
> > +
> > +#ifdef __x86_64__
> > +extern int __power_managed_memory_present(void);
> > +extern int __power_managed_node(int srat_node);
> > +extern int __find_closest_non_pm_node(int nodeid);
> > +#else
> > +inline int __power_managed_memory_present(void) { return 0};
> > +inline int __power_managed_node(int srat_node) { return 0};
> > +inline int __find_closest_non_pm_node(int nodeid) { return nodeid};
> > +#endif
> > +
> > +int power_managed_memory_present(void)
> > +{
> > + return __power_managed_memory_present();
> > +}
> > +
> > +int power_managed_node(int srat_node)
> > +{
> > + return __power_managed_node(srat_node);
> > +}
> > +
> > +int find_closest_non_pm_node(int nodeid)
> > +{
> > + return __find_closest_non_pm_node(nodeid);
> > +}
> > +
>
> Probably should reconsider extern declarations in .c files.
>
Yeah, but I couldn't think of a better place to put this code or how to
make it portable to non x86_64 architectures. Recommendations gratefully
accepted.
> > diff -urN -X linux-2.6.20-mm2/Documentation/dontdiff linux-2.6.20-mm2/mm/mempolicy.c linux-2.6.20-mm2-monroe/mm/mempolicy.c
> > --- linux-2.6.20-mm2/mm/mempolicy.c 2007-02-23 11:20:40.000000000 -0800
> > +++ linux-2.6.20-mm2-monroe/mm/mempolicy.c 2007-03-02 15:15:53.000000000 -0800
> > @@ -1617,8 +1617,13 @@
> > /* Set interleaving policy for system init. This way not all
> > the data structures allocated at system boot end up in node zero. */
> >
> > - if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
> > - printk("numa_policy_init: interleaving failed\n");
> > + if (power_managed_memory_present()) {
> > + if (do_set_mempolicy(MPOL_DEFAULT, &node_online_map))
> > + printk("numa_policy_init: interleaving failed\n");
> > + } else {
> > + if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
> > + printk("numa_policy_init: interleaving failed\n");
> > + }
> > }
> >
> > /* Reset policy of current process to default */
>
> These prink comments are misleading since MPOL_DEFAULT doesn't attempt to
> set interleaving policy.
>
oop, cut and paste bug.
Thanks,
--mgross
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-03-06 16:47 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-03-05 18:18 [RFC] [PATCH] Power Managed memory base enabling Mark Gross
2007-03-06 1:26 ` KAMEZAWA Hiroyuki
2007-03-06 15:54 ` Mark Gross
2007-03-06 15:09 ` David Rientjes
2007-03-06 16:47 ` Mark Gross [this message]
2007-03-06 17:12 ` David Rientjes
2007-03-06 17:20 ` Mark Gross
2007-03-06 17:33 ` David Rientjes
2007-03-07 2:40 ` David Rientjes
2007-03-09 20:53 ` Mark Gross
2007-03-09 21:27 ` David Rientjes
2007-03-09 21:26 ` Mark Gross
2007-03-26 12:48 ` [linux-pm] " Pavel Machek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070306164722.GB22725@linux.intel.com \
--to=mgross@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=linux-pm@lists.osdl.org \
--cc=mark.gross@intel.com \
--cc=neelam.chandwani@intel.com \
--cc=rientjes@google.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).