* NUMA node information for pages
@ 2014-03-31 23:41 Ulrich Drepper
[not found] ` <533a1566.540db40a.274d.ffff8616SMTPIN_ADDED_BROKEN@mx.google.com>
[not found] ` <533a1563.ad318c0a.6a93.182bSMTPIN_ADDED_BROKEN@mx.google.com>
0 siblings, 2 replies; 9+ messages in thread
From: Ulrich Drepper @ 2014-03-31 23:41 UTC (permalink / raw)
To: anatol.pomozov, jkosina, akpm, xemul, rientjes, paul.gortmaker,
n-horiguchi, linux-kernel
I might be missing something but I couldn't find a way to use the
pagemap information to then look up the NUMA node the respective page is
located on. Especially when analyzing anomalities this is really
useful. The /proc/kpageflags and /proc/kpagecount files don't have that
information.
If this is correct, could the attached patch be considered? It's really
simple and follows the same line as the kpageflags file.
Signed-off-by: Ulrich Drepper <drepper@gmail.com>
Documentation/vm/pagemap.txt | 3 ++
fs/proc/page.c | 50
+++++++++++++++++++++++++++++++++++++++++++
2 files changed, 53 insertions(+)
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 5948e45..413b34c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -34,6 +34,9 @@ There are three components to pagemap:
* /proc/kpagecount. This file contains a 64-bit count of the number of
times each page is mapped, indexed by PFN.
+ * /proc/kpagenode. This file contains a 32-bit number of the NUMA node
+ each page is mapped on.
+
* /proc/kpageflags. This file contains a 64-bit set of flags for each
page, indexed by PFN.
diff --git a/fs/proc/page.c b/fs/proc/page.c
index e647c55..65bea9f 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -15,6 +15,9 @@
#define KPMSIZE sizeof(u64)
#define KPMMASK (KPMSIZE - 1)
+#define KNIDSIZE sizeof(s32)
+#define KNIDMASK (KNIDSIZE - 1)
+
/* /proc/kpagecount - an array exposing page counts
*
* Each entry is a u64 representing the corresponding
@@ -212,10 +215,57 @@ static const struct file_operations proc_kpageflags_operations = {
.read = kpageflags_read,
};
+/* /proc/kpagenode - an array exposing node information for pages
+ *
+ * Each entry is a s32 representing the corresponding
+ * physical page flags.
+ */
+
+static ssize_t kpagenode_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ unsigned long src = *ppos;
+ unsigned long pfn = src / KNIDSIZE;
+ ssize_t ret = 0;
+
+ count = min_t(unsigned long, count, (max_pfn * KNIDSIZE) - src);
+ if (src & KNIDSIZE || count & KNIDMASK)
+ return -EINVAL;
+
+ while (count > 0) {
+ int nid;
+ if (pfn_valid(pfn))
+ nid = pfn_to_nid(pfn);
+ else
+ nid = -1;
+
+ if (put_user(nid, out)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ pfn++;
+ out++;
+ count -= KNIDSIZE;
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpagenode_operations = {
+ .llseek = mem_lseek,
+ .read = kpagenode_read,
+};
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+ proc_create("kpagenode", S_IRUSR, NULL, &proc_kpagenode_operations);
return 0;
}
fs_initcall(proc_page_init);
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: NUMA node information for pages
[not found] ` <533a1566.540db40a.274d.ffff8616SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-04-01 4:28 ` David Rientjes
0 siblings, 0 replies; 9+ messages in thread
From: David Rientjes @ 2014-04-01 4:28 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: drepper, anatol.pomozov, jkosina, akpm, xemul, paul.gortmaker,
linux-kernel
On Mon, 31 Mar 2014, Naoya Horiguchi wrote:
> > I might be missing something but I couldn't find a way to use the
> > pagemap information to then look up the NUMA node the respective page is
> > located on. Especially when analyzing anomalities this is really
> > useful. The /proc/kpageflags and /proc/kpagecount files don't have that
> > information.
> >
> > If this is correct, could the attached patch be considered? It's really
> > simple and follows the same line as the kpageflags file.
>
> The information about "pfn-node" mapping seldom (or never) changes after boot,
> so it seems better to me that adding a new interface somewhere under
> /sys/devices/system/node/nodeN which shows pfn range of a given node.
If that's the direction we're going, I'd much prefer just the physical
start and end addresses be exported rather than pfn so we don't need to do
getpagesize() in userspace.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: NUMA node information for pages
[not found] ` <533a1563.ad318c0a.6a93.182bSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-04-08 1:56 ` Ulrich Drepper
[not found] ` <5343806c.100cc30a.0461.ffffc401SMTPIN_ADDED_BROKEN@mx.google.com>
0 siblings, 1 reply; 9+ messages in thread
From: Ulrich Drepper @ 2014-04-08 1:56 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: Anatol Pomozov, Jiri Kosina, Andrew Morton, xemul, rientjes,
paul.gortmaker, Linux Kernel Mailing List
On Mon, Mar 31, 2014 at 9:24 PM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> The information about "pfn-node" mapping seldom (or never) changes after boot,
> so it seems better to me that adding a new interface somewhere under
> /sys/devices/system/node/nodeN which shows pfn range of a given node.
> If this doesn't work for your usecase, could you explain more about how you
> use this information?
I have no problem with that type of interface. It'll be more work
figuring out the details since the interface I proposed is trivial and
mimics that of kpageflags etc but that's manageable.
I'll see whether I can figure out the necessary details. I imagine
that if the PFN are indeed always clustered for each node then, as
David proposes, text output like
PFNSTART PFNSTOP
in a file below /sys/devices/system/node/nodeN should be sufficient.
How does memory hot plug work in this situation? If the PFNs are
allocated dense at startup then there might potentially be many ranges
for each node.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: NUMA node information for pages
[not found] ` <5343806c.100cc30a.0461.ffffc401SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-04-10 0:41 ` David Rientjes
[not found] ` <5345fe27.82dab40a.0831.0af9SMTPIN_ADDED_BROKEN@mx.google.com>
0 siblings, 1 reply; 9+ messages in thread
From: David Rientjes @ 2014-04-10 0:41 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: drepper, anatol.pomozov, jkosina, akpm, xemul, paul.gortmaker,
linux-kernel
On Tue, 8 Apr 2014, Naoya Horiguchi wrote:
> memory hotplug is done in memory block basis, so if we get info from under
> /sys/devices/system/memory/memory<ID> it should be memory hotplug-aware
> (/sys/devices/system/memory/memory<ID>/state shows online/offline status.)
>
> And IIUC, "pfn-node_id" mapping might be already available for userspace.
> /sys/devices/system/memory/block_size_bytes exports memory block size,
> so we can simply map pfn (physical address) into memory block ID by
> (physicall address)/(memory block size), then we can find associated node
> from /sys/devices/system/memory/memory<ID>
>
> $ ls -l /sys/devices/system/memory/memory0
> ...
> lrwxrwxrwx 1 root root 0 Apr 8 00:15 node0 -> ../../node/node0
>
That's only possible with sparsemem and if you have memory hotplug
enabled. I'm thinking that Ulrich is looking for a solution that won't
have such a dependency and work for all memory models (including one that
disables NUMA and simply represents all memory as one big node).
[ And that block_size_bytes file is absolutely horrid, why are we
exporting all this information in hex and not telling anybody? ]
I'd much prefer a single change that works for everybody and userspace can
rely on exporting accurate information as long as sysfs is mounted, and
not even need to rely on getpagesize() to convert from pfn to physical
address: just simple {start,end}_phys_addr files added to
/sys/devices/system/node/nodeN/ for node N. Online information can
already be parsed for these ranges from /sys/devices/system/node/online.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: NUMA node information for pages
[not found] ` <5345fe27.82dab40a.0831.0af9SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-04-10 22:06 ` David Rientjes
[not found] ` <53474709.e59ec20a.3bd5.3b91SMTPIN_ADDED_BROKEN@mx.google.com>
0 siblings, 1 reply; 9+ messages in thread
From: David Rientjes @ 2014-04-10 22:06 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: drepper, anatol.pomozov, jkosina, akpm, xemul, paul.gortmaker,
linux-kernel
On Wed, 9 Apr 2014, Naoya Horiguchi wrote:
> > [ And that block_size_bytes file is absolutely horrid, why are we
> > exporting all this information in hex and not telling anybody? ]
>
> Indeed, this kind of implicit hex numbers are commonly used in many place.
> I guess that it's maybe for historical reasons.
>
I think it was meant to be simple to that you could easily add the length
to the start, but it should at least prefix this with `0x'. That code has
been around for years, though, so we probably can't fix it now.
> > I'd much prefer a single change that works for everybody and userspace can
> > rely on exporting accurate information as long as sysfs is mounted, and
> > not even need to rely on getpagesize() to convert from pfn to physical
> > address: just simple {start,end}_phys_addr files added to
> > /sys/devices/system/node/nodeN/ for node N. Online information can
> > already be parsed for these ranges from /sys/devices/system/node/online.
>
> OK, so what if some node has multiple address ranges? I don't think that
> start(end)_phys_addr simply returns minimum (maximum) possible address is optimal,
> because users can't know about void range between valid address ranges
> (non-exist pfn should not belong to any node).
> Are printing multilined (or comma-separated) ranges preferable for example
> like below?
>
> $ cat /sys/devices/system/node/nodeN/phys_addr
> 0x0-0x80000000
> 0x100000000-0x180000000
>
What the...? nodeN should represent the pgdat for that node and a pgdat
can only have a single range. I'm suggesting that
/sys/devices/system/node/nodeN/start_phys_addr returns
node_start_pfn(N) << PAGE_SHIFT and
/sys/devices/system/node/nodeN/end_phys_addr returns
node_end_pfn(N) << PAGE_SHIFT and prefix them correctly this time with
`0x'.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)
[not found] ` <53474709.e59ec20a.3bd5.3b91SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-04-11 11:00 ` David Rientjes
2014-04-11 16:24 ` Dave Hansen
0 siblings, 1 reply; 9+ messages in thread
From: David Rientjes @ 2014-04-11 11:00 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: drepper, anatol.pomozov, jkosina, akpm, xemul, paul.gortmaker,
linux-kernel, linux-mm
On Thu, 10 Apr 2014, Naoya Horiguchi wrote:
> Yes, that's right, but it seems to me that just node_start_pfn and node_end_pfn
> is not enough because there can be holes (without any page struct backed) inside
> [node_start_pfn, node_end_pfn), and it's not aware of memory hotplug.
>
So? Who cares if there are non-addressable holes in part of the span?
Ulrich, correct me if I'm wrong, but it seems you're looking for just a
address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually
expecting that there are no holes in a node for things like acpi or I/O or
reserved memory.
The node spans a contiguous length of memory, there's no consideration for
addresses that aren't actually backed by physical memory. We are just
representing proximity domains that have a base address and length in the
acpi world.
Memory hotplug is already taken care of because onlining and offlining
nodes already add these node classes and {start,end}_phys_addr would
show up automatically. If you use node_start_pfn(nid) and
node_end_pfn(nid) as suggested, there's no futher consideration needed for
hotplug.
I think trying to represent holes and handling different memory models and
hotplug in special ways is complete overkill.
Ulrich, can I have your ack?
---
Documentation/ABI/stable/sysfs-devices-node | 12 ++++++++++++
drivers/base/node.c | 18 ++++++++++++++++++
2 files changed, 30 insertions(+)
diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -63,6 +63,18 @@ Description:
The node's hit/miss statistics, in units of pages.
See Documentation/numastat.txt
+What: /sys/devices/system/node/nodeX/start_phys_addr
+Date: April 2014
+Contact: David Rientjes <rientjes@google.com>
+Description:
+ The physical base address of this node.
+
+What: /sys/devices/system/node/nodeX/end_phys_addr
+Date: April 2014
+Contact: David Rientjes <rientjes@google.com>
+Description:
+ The physical base + length address of this node.
+
What: /sys/devices/system/node/nodeX/distance
Date: October 2002
Contact: Linux Memory Management list <linux-mm@kvack.org>
diff --git a/drivers/base/node.c b/drivers/base/node.c
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -170,6 +170,20 @@ static ssize_t node_read_numastat(struct device *dev,
}
static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
+static ssize_t node_read_start_phys_addr(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "0x%lx\n", node_start_pfn(dev->id) << PAGE_SHIFT);
+}
+static DEVICE_ATTR(start_phys_addr, S_IRUGO, node_read_start_phys_addr, NULL);
+
+static ssize_t node_read_end_phys_addr(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "0x%lx\n", node_end_pfn(dev->id) << PAGE_SHIFT);
+}
+static DEVICE_ATTR(end_phys_addr, S_IRUGO, node_read_end_phys_addr, NULL);
+
static ssize_t node_read_vmstat(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -286,6 +300,8 @@ static int register_node(struct node *node, int num, struct node *parent)
device_create_file(&node->dev, &dev_attr_cpulist);
device_create_file(&node->dev, &dev_attr_meminfo);
device_create_file(&node->dev, &dev_attr_numastat);
+ device_create_file(&node->dev, &dev_attr_start_phys_addr);
+ device_create_file(&node->dev, &dev_attr_end_phys_addr);
device_create_file(&node->dev, &dev_attr_distance);
device_create_file(&node->dev, &dev_attr_vmstat);
@@ -311,6 +327,8 @@ void unregister_node(struct node *node)
device_remove_file(&node->dev, &dev_attr_cpulist);
device_remove_file(&node->dev, &dev_attr_meminfo);
device_remove_file(&node->dev, &dev_attr_numastat);
+ device_remove_file(&node->dev, &dev_attr_start_phys_addr);
+ device_remove_file(&node->dev, &dev_attr_end_phys_addr);
device_remove_file(&node->dev, &dev_attr_distance);
device_remove_file(&node->dev, &dev_attr_vmstat);
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)
2014-04-11 11:00 ` [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages) David Rientjes
@ 2014-04-11 16:24 ` Dave Hansen
2014-04-11 22:13 ` David Rientjes
0 siblings, 1 reply; 9+ messages in thread
From: Dave Hansen @ 2014-04-11 16:24 UTC (permalink / raw)
To: David Rientjes, Naoya Horiguchi
Cc: drepper, anatol.pomozov, jkosina, akpm, xemul, paul.gortmaker,
linux-kernel, linux-mm
On 04/11/2014 04:00 AM, David Rientjes wrote:
> On Thu, 10 Apr 2014, Naoya Horiguchi wrote:
>> > Yes, that's right, but it seems to me that just node_start_pfn and node_end_pfn
>> > is not enough because there can be holes (without any page struct backed) inside
>> > [node_start_pfn, node_end_pfn), and it's not aware of memory hotplug.
>> >
> So? Who cares if there are non-addressable holes in part of the span?
> Ulrich, correct me if I'm wrong, but it seems you're looking for just a
> address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually
> expecting that there are no holes in a node for things like acpi or I/O or
> reserved memory.
...
> I think trying to represent holes and handling different memory models and
> hotplug in special ways is complete overkill.
This isn't just about memory hotplug or different memory models. There
are systems out there today, in production, that have layouts like this:
|------Node0-----|
|------Node1-----|
and this:
|------Node0-----|
|-Node1-|
For those systems, this interface has no meaning. Given a page in the
shared-span areas, this interface provides no way to figure out which
node it is in.
If you want a non-portable hack that just works on one system, I'd
suggest parsing the existing firmware tables.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)
2014-04-11 16:24 ` Dave Hansen
@ 2014-04-11 22:13 ` David Rientjes
2014-04-11 22:53 ` Dave Hansen
0 siblings, 1 reply; 9+ messages in thread
From: David Rientjes @ 2014-04-11 22:13 UTC (permalink / raw)
To: Dave Hansen
Cc: Naoya Horiguchi, drepper, anatol.pomozov, jkosina, akpm, xemul,
paul.gortmaker, linux-kernel, linux-mm
On Fri, 11 Apr 2014, Dave Hansen wrote:
> > So? Who cares if there are non-addressable holes in part of the span?
> > Ulrich, correct me if I'm wrong, but it seems you're looking for just a
> > address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually
> > expecting that there are no holes in a node for things like acpi or I/O or
> > reserved memory.
> ...
> > I think trying to represent holes and handling different memory models and
> > hotplug in special ways is complete overkill.
>
> This isn't just about memory hotplug or different memory models. There
> are systems out there today, in production, that have layouts like this:
>
> |------Node0-----|
> |------Node1-----|
>
> and this:
>
> |------Node0-----|
> |-Node1-|
>
What additional information, in your opinion, can we export to assist
userspace in making this determination that $address is on $nid?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)
2014-04-11 22:13 ` David Rientjes
@ 2014-04-11 22:53 ` Dave Hansen
0 siblings, 0 replies; 9+ messages in thread
From: Dave Hansen @ 2014-04-11 22:53 UTC (permalink / raw)
To: David Rientjes
Cc: Naoya Horiguchi, drepper, anatol.pomozov, jkosina, akpm, xemul,
paul.gortmaker, linux-kernel, linux-mm
On 04/11/2014 03:13 PM, David Rientjes wrote:
> What additional information, in your opinion, can we export to assist
> userspace in making this determination that $address is on $nid?
In the case of overlapping nodes, the only place we actually have *all*
of the information is in the 'struct page' itself. Ulrich's original
patch obviously _works_, and especially if it's an interface only for
debugging purposes, it seems silly to spend virtually any time
optimizing it. Keeping it close to pagemap's implementation lessens the
likelihood that we'll screw things up.
I assume that the original problem was trying to figure out what NUMA
affinity a given range of pages mapped in to a _process_ have, and that
/proc/$pid/numamaps is too coarse. Is that right, Ulrich?
If you want to go the route of calculating and exporting the physical
ranges that nodes uniquely own, you've *GOT* to handle the overlaps.
Naoya had the right idea. His idea seemed to get shot down with the
misunderstanding that node pfn ranges never overlap.
The only other question is how many of these kpage* things we're going
to put in here until we've exported the entire contents of 'struct page'
5 times over. :)
We could add some tracepoints to the pagemap to dump lots of information
in to a trace buffer that could be later read back. If you want
detailed information (NUMA for instance), you turn the tracepoints and
read pagemap for the range you care about.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-04-11 22:53 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-31 23:41 NUMA node information for pages Ulrich Drepper
[not found] ` <533a1566.540db40a.274d.ffff8616SMTPIN_ADDED_BROKEN@mx.google.com>
2014-04-01 4:28 ` David Rientjes
[not found] ` <533a1563.ad318c0a.6a93.182bSMTPIN_ADDED_BROKEN@mx.google.com>
2014-04-08 1:56 ` Ulrich Drepper
[not found] ` <5343806c.100cc30a.0461.ffffc401SMTPIN_ADDED_BROKEN@mx.google.com>
2014-04-10 0:41 ` David Rientjes
[not found] ` <5345fe27.82dab40a.0831.0af9SMTPIN_ADDED_BROKEN@mx.google.com>
2014-04-10 22:06 ` David Rientjes
[not found] ` <53474709.e59ec20a.3bd5.3b91SMTPIN_ADDED_BROKEN@mx.google.com>
2014-04-11 11:00 ` [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages) David Rientjes
2014-04-11 16:24 ` Dave Hansen
2014-04-11 22:13 ` David Rientjes
2014-04-11 22:53 ` Dave Hansen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox