* [patch[ Simple Topology API
@ 2002-07-13 0:35 Matthew Dobson
2002-07-13 2:49 ` Andrew Morton
2002-07-13 8:04 ` Alexander Viro
0 siblings, 2 replies; 20+ messages in thread
From: Matthew Dobson @ 2002-07-13 0:35 UTC (permalink / raw)
To: linux-kernel, Michael Hohnbaum, Martin Bligh, Linus Torvalds,
Andrew Morton
[-- Attachment #1: Type: text/plain, Size: 452 bytes --]
Here is a very rudimentary topology API for NUMA systems. It uses prctl() for
the userland calls, and exposes some useful things to userland. It would be
nice to expose these simple structures to both users and the kernel itself.
Any architecture wishing to use this API simply has to write a .h file that
defines the 5 calls defined in core_ibmnumaq.h and include it in asm/mmzone.h.
Voila! Instant inclusion in the topology!
Enjoy!
-Matt
[-- Attachment #2: 2.5.25-simple_topo.patch --]
[-- Type: text/plain, Size: 11411 bytes --]
diff -Nur linux-2.5.25-vanilla/include/asm-i386/core_ibmnumaq.h linux-2.5.25-api/include/asm-i386/core_ibmnumaq.h
--- linux-2.5.25-vanilla/include/asm-i386/core_ibmnumaq.h Wed Dec 31 16:00:00 1969
+++ linux-2.5.25-api/include/asm-i386/core_ibmnumaq.h Thu Jul 11 13:58:25 2002
@@ -0,0 +1,62 @@
+/*
+ * linux/include/asm-i386/core_ibmnumaq.h
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT. See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpatch@us.ibm.com>
+ */
+#ifndef _ASM_CORE_IBMNUMAQ_H_
+#define _ASM_CORE_IBMNUMAQ_H_
+
+/*
+ * These functions need to be defined for every architecture.
+ * The first five are necessary for the Memory Binding API to function.
+ * The last is needed by several pieces of NUMA code.
+ */
+
+
+/* Returns the number of the node containing CPU 'cpu' */
+#define _cpu_to_node(cpu) (cpu_to_logical_apicid(cpu) >> 4)
+
+/* Returns the number of the node containing MemBlk 'memblk' */
+#define _memblk_to_node(memblk) (memblk)
+
+/* Returns the number of the node containing Node 'nid'. This architecture is flat,
+ so it is a pretty simple function! */
+#define _node_to_node(nid) (nid)
+
+/* Returns the number of the first CPU on Node 'node' */
+static inline int _node_to_cpu(int node)
+{
+ int i, cpu, logical_apicid = node << 4;
+
+ for(i = 1; i < 16; i <<= 1)
+ if ((cpu = logical_apicid_to_cpu(logical_apicid | i)) >= 0)
+ return cpu;
+
+ return 0;
+}
+
+/* Returns the number of the first MemBlk on Node 'node' */
+#define _node_to_memblk(node) (node)
+
+#endif /* _ASM_CORE_IBMNUMAQ_H_ */
diff -Nur linux-2.5.25-vanilla/include/asm-i386/mmzone.h linux-2.5.25-api/include/asm-i386/mmzone.h
--- linux-2.5.25-vanilla/include/asm-i386/mmzone.h Wed Dec 31 16:00:00 1969
+++ linux-2.5.25-api/include/asm-i386/mmzone.h Fri Jul 12 16:10:43 2002
@@ -0,0 +1,49 @@
+/*
+ * linux/include/asm-i386/mmzone.h
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT. See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpatch@us.ibm.com>
+ */
+#ifndef _ASM_MMZONE_H_
+#define _ASM_MMZONE_H_
+
+#include <asm/smpboot.h>
+
+#ifdef CONFIG_IBMNUMAQ
+#include <asm/core_ibmnumaq.h>
+/* Other architectures wishing to use this simple topology API should fill
+ in the below functions as appropriate in their own <arch>.h file. */
+#else /* !CONFIG_IBMNUMAQ */
+
+#define _cpu_to_node(cpu) (0)
+#define _memblk_to_node(memblk) (0)
+#define _node_to_node(nid) (0)
+#define _node_to_cpu(node) (0)
+#define _node_to_memblk(node) (0)
+
+#endif /* CONFIG_IBMNUMAQ */
+
+/* Returns the number of the current Node. */
+#define numa_node_id() (_cpu_to_node(smp_processor_id()))
+
+#endif /* _ASM_MMZONE_H_ */
diff -Nur linux-2.5.25-vanilla/include/linux/membind.h linux-2.5.25-api/include/linux/membind.h
--- linux-2.5.25-vanilla/include/linux/membind.h Wed Dec 31 16:00:00 1969
+++ linux-2.5.25-api/include/linux/membind.h Fri Jul 12 16:31:30 2002
@@ -0,0 +1,38 @@
+/*
+ * linux/include/linux/membind.h
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT. See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpatch@us.ibm.com>
+ */
+#ifndef _LINUX_MEMBIND_H_
+#define _LINUX_MEMBIND_H_
+
+int cpu_to_node(int);
+int memblk_to_node(int);
+int node_to_node(int);
+int node_to_cpu(int);
+int node_to_memblk(int);
+int get_curr_cpu(void);
+int get_curr_node(void);
+
+#endif /* _LINUX_MEMBIND_H_ */
diff -Nur linux-2.5.25-vanilla/include/linux/prctl.h linux-2.5.25-api/include/linux/prctl.h
--- linux-2.5.25-vanilla/include/linux/prctl.h Fri Jul 5 16:42:28 2002
+++ linux-2.5.25-api/include/linux/prctl.h Wed Jul 10 13:58:17 2002
@@ -26,4 +26,17 @@
# define PR_FPEMU_NOPRINT 1 /* silently emulate fp operations accesses */
# define PR_FPEMU_SIGFPE 2 /* don't emulate fp operations, send SIGFPE instead */
+/* Get CPU/Node */
+#define PR_GET_CURR_CPU 13
+#define PR_GET_CURR_NODE 14
+
+/* XX to Node conversion functions */
+#define PR_CPU_TO_NODE 15
+#define PR_MEMBLK_TO_NODE 16
+#define PR_NODE_TO_NODE 17
+
+/* Node to XX conversion functions */
+#define PR_NODE_TO_CPU 18
+#define PR_NODE_TO_MEMBLK 19
+
#endif /* _LINUX_PRCTL_H */
diff -Nur linux-2.5.25-vanilla/kernel/membind.c linux-2.5.25-api/kernel/membind.c
--- linux-2.5.25-vanilla/kernel/membind.c Wed Dec 31 16:00:00 1969
+++ linux-2.5.25-api/kernel/membind.c Fri Jul 12 16:13:17 2002
@@ -0,0 +1,130 @@
+/*
+ * linux/kernel/membind.c
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT. See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpatch@us.ibm.com>
+ */
+#include <linux/kernel.h>
+#include <linux/unistd.h>
+#include <linux/config.h>
+#include <linux/sched.h>
+#include <linux/membind.h>
+#include <linux/mmzone.h>
+#include <linux/errno.h>
+#include <linux/smp.h>
+
+extern unsigned long memblk_online_map;
+
+/*
+ * cpu_to_node(cpu): Returns the number of the most specific Node
+ * containing CPU 'cpu'.
+ */
+inline int cpu_to_node(int cpu)
+{
+ if (cpu == -1) /* return highest numbered node */
+ return (numnodes - 1);
+
+ if ((cpu < 0) || (cpu >= NR_CPUS) ||
+ (!(cpu_online_map & (1 << cpu)))) /* invalid cpu # */
+ return -ENODEV;
+
+ return _cpu_to_node(cpu);
+}
+
+/*
+ * memblk_to_node(memblk): Returns the number of the most specific Node
+ * containing Memory Block 'memblk'.
+ */
+inline int memblk_to_node(int memblk)
+{
+ if (memblk == -1) /* return highest numbered node */
+ return (numnodes - 1);
+
+ if ((memblk < 0) || (memblk >= NR_MEMBLKS) ||
+ (!(memblk_online_map & (1 << memblk)))) /* invalid memblk # */
+ return -ENODEV;
+
+ return _memblk_to_node(memblk);
+}
+
+/*
+ * node_to_node(nid): Returns the number of the of the most specific Node that
+ * encompasses Node 'nid'. Some may call this the parent Node of 'nid'.
+ */
+int node_to_node(int nid)
+{
+ if ((nid < 0) || (nid >= numnodes)) /* invalid node # */
+ return -ENODEV;
+
+ return _node_to_node(nid);
+}
+
+/*
+ * node_to_cpu(nid): Returns the lowest numbered CPU on Node 'nid'
+ */
+inline int node_to_cpu(int nid)
+{
+ if (nid == -1) /* return highest numbered cpu */
+ return (num_online_cpus() - 1);
+
+ if ((nid < 0) || (nid >= numnodes)) /* invalid node # */
+ return -ENODEV;
+
+ return _node_to_cpu(nid);
+}
+
+/*
+ * node_to_memblk(nid): Returns the lowest numbered MemBlk on Node 'nid'
+ */
+inline int node_to_memblk(int nid)
+{
+ if (nid == -1) /* return highest numbered memblk */
+ return (num_online_memblks() - 1);
+
+ if ((nid < 0) || (nid >= numnodes)) /* invalid node # */
+ return -ENODEV;
+
+ return _node_to_memblk(nid);
+}
+
+/*
+ * get_curr_cpu(): Returns the currently executing CPU number.
+ * For now, this has only mild usefulness, as this information could
+ * change on the return from syscall (which automatically calls schedule()).
+ * Due to this, the data could be stale by the time it gets back to the user.
+ * It will have to do, until a better method is found.
+ */
+inline int get_curr_cpu(void)
+{
+ return smp_processor_id();
+}
+
+/*
+ * get_curr_node(): Returns the number of the Node containing
+ * the currently executing CPU. Subject to the same caveat
+ * as the get_curr_cpu() call.
+ */
+inline int get_curr_node(void)
+{
+ return cpu_to_node(get_curr_cpu());
+}
diff -Nur linux-2.5.25-vanilla/kernel/sys.c linux-2.5.25-api/kernel/sys.c
--- linux-2.5.25-vanilla/kernel/sys.c Fri Jul 5 16:42:04 2002
+++ linux-2.5.25-api/kernel/sys.c Fri Jul 12 16:11:16 2002
@@ -19,6 +19,7 @@
#include <linux/tqueue.h>
#include <linux/device.h>
#include <linux/times.h>
+#include <linux/membind.h>
#include <asm/uaccess.h>
#include <asm/io.h>
@@ -1291,6 +1292,27 @@
}
current->keep_capabilities = arg2;
break;
+ case PR_GET_CURR_CPU:
+ error = (long) get_curr_cpu();
+ break;
+ case PR_GET_CURR_NODE:
+ error = (long) get_curr_node();
+ break;
+ case PR_CPU_TO_NODE:
+ error = (long) cpu_to_node((int)arg2);
+ break;
+ case PR_MEMBLK_TO_NODE:
+ error = (long) memblk_to_node((int)arg2);
+ break;
+ case PR_NODE_TO_NODE:
+ error = (long) node_to_node((int)arg2);
+ break;
+ case PR_NODE_TO_CPU:
+ error = (long) node_to_cpu((int)arg2);
+ break;
+ case PR_NODE_TO_MEMBLK:
+ error = (long) node_to_memblk((int)arg2);
+ break;
default:
error = -EINVAL;
break;
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [patch[ Simple Topology API 2002-07-13 0:35 [patch[ Simple Topology API Matthew Dobson @ 2002-07-13 2:49 ` Andrew Morton 2002-07-15 18:49 ` Matthew Dobson 2002-07-13 8:04 ` Alexander Viro 1 sibling, 1 reply; 20+ messages in thread From: Andrew Morton @ 2002-07-13 2:49 UTC (permalink / raw) To: colpatch; +Cc: linux-kernel, Michael Hohnbaum, Martin Bligh, Linus Torvalds Matthew Dobson wrote: > > Here is a very rudimentary topology API for NUMA systems. It uses prctl() for > the userland calls, and exposes some useful things to userland. It would be > nice to expose these simple structures to both users and the kernel itself. > Any architecture wishing to use this API simply has to write a .h file that > defines the 5 calls defined in core_ibmnumaq.h and include it in asm/mmzone.h. Matt, I suspect what happens when these patches come out is that most people simply don't have the knowledge/time/experience/context to judge them, and nothing ends up happening. No way would I pretend to be able to comment on the big picture, that's for sure. If the code is clean, the interfaces make sense, the impact on other platforms is minimised and the stakeholders are OK with it then that should be sufficient, yes? AFAIK, the interested parties with this and the memory binding API are ia32-NUMA, ia64, PPC, some MIPS and x86-64-soon. It would be helpful if the owners of those platforms could review this work and say "yes, this is something we can use and build upon". Have they done that? I'd have a few micro-observations: > ... > --- linux-2.5.25-vanilla/kernel/membind.c Wed Dec 31 16:00:00 1969 > +++ linux-2.5.25-api/kernel/membind.c Fri Jul 12 16:13:17 2002 > .. > +inline int memblk_to_node(int memblk) The inlines with global scope in this file seem strange? Matthew Dobson wrote: > > Here is a Memory Binding API > ... > + memblk_binding: { MEMBLK_NO_BINDING, MPOL_STRICT }, \ > ... > +typedef struct memblk_list { > + memblk_bitmask_t bitmask; > + int behavior; > + rwlock_t lock; > +} memblk_list_t; Is is possible to reduce this type to something smaller for CONFIG_NUMA=n? In the above task_struct initialiser you should initialise the rwlock to RWLOCK_LOCK_UNLOCKED. It's nice to use the `name:value' initialiser format in there, too. > ... > +int set_memblk_binding(memblk_bitmask_t memblks, int behavior) > +{ > ... > + read_lock_irqsave(¤t->memblk_binding.lock, flags); Your code accesses `current' a lot. You'll find that the code generation is fairly poor - evaluating `current' chews 10-15 bytes of code. You can perform a manual CSE by copying current into a local, and save a few cycles. > ... > +struct page * _alloc_pages(unsigned int gfp_mask, unsigned int order) > +{ > ... > + spin_lock_irqsave(&node_lock, flags); > + temp = pgdat_list; > + spin_unlock_irqrestore(&node_lock, flags); Not sure what you're trying to lock here, but you're not locking it ;) This is either racy code or unneeded locking. Thanks. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-13 2:49 ` Andrew Morton @ 2002-07-15 18:49 ` Matthew Dobson 0 siblings, 0 replies; 20+ messages in thread From: Matthew Dobson @ 2002-07-15 18:49 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, Michael Hohnbaum, Martin Bligh, Linus Torvalds Andrew Morton wrote: > Matt, > > I suspect what happens when these patches come out is that most people simply > don't have the knowledge/time/experience/context to judge them, and nothing > ends up happening. No way would I pretend to be able to comment on the > big picture, that's for sure. Absolutely correct. I know that most people here on LKML don't have 8, 16, 32, or more CPU systems to test this code on, or for that matter, even care about code designed for said systems. I'm lucky enough to get to work on such machines, and I'm sure there are others out there (as evidenced by some of the replies I've gotten) that do care. Also, there are publicly available NUMA machines in the OSDL that people can use to "play" on large systems. I hope that by seeing code and using these systems, some more people might get interested in some of the interesting scalability issues that crop up with these machines. > If the code is clean, the interfaces make sense, the impact on other > platforms is minimised and the stakeholders are OK with it then that > should be sufficient, yes? I would hope so. That's what I'm trying to establish! ;) > AFAIK, the interested parties with this and the memory binding API are > ia32-NUMA, ia64, PPC, some MIPS and x86-64-soon. It would be helpful > if the owners of those platforms could review this work and say "yes, > this is something we can use and build upon". Have they done that? I've gotten some feedback from large systems people. I hope to get feedback from anyone with large systems that could potentially use this kind of API, and get a "this is great" or a "this sucks". I believe that bigger systems need new ways to improve efficiency and scalability than what the kernel offers now. I know I do... > I'd have a few micro-observations: > >>... >>--- linux-2.5.25-vanilla/kernel/membind.c Wed Dec 31 16:00:00 1969 >>+++ linux-2.5.25-api/kernel/membind.c Fri Jul 12 16:13:17 2002 >>.. >>+inline int memblk_to_node(int memblk) > > > The inlines with global scope in this file seem strange? > > > Matthew Dobson wrote: > >>Here is a Memory Binding API >>... >>+ memblk_binding: { MEMBLK_NO_BINDING, MPOL_STRICT }, \ > > >>... >>+typedef struct memblk_list { >>+ memblk_bitmask_t bitmask; >>+ int behavior; >>+ rwlock_t lock; >>+} memblk_list_t; > > > Is is possible to reduce this type to something smaller for > CONFIG_NUMA=n? Probably... I'll look at that today... > In the above task_struct initialiser you should initialise the > rwlock to RWLOCK_LOCK_UNLOCKED. Yep.. Totally forgot about that! :( > It's nice to use the `name:value' initialiser format in there, too. Sure, enhanced readability is always a good thing! >>... >>+int set_memblk_binding(memblk_bitmask_t memblks, int behavior) >>+{ >>... >>+ read_lock_irqsave(¤t->memblk_binding.lock, flags); > > > Your code accesses `current' a lot. You'll find that the code > generation is fairly poor - evaluating `current' chews 10-15 > bytes of code. You can perform a manual CSE by copying current > into a local, and save a few cycles. Sure.. I've actually gotten a couple different ideas about improving the efficiency of that function, and will also be rewriting that today.. >>... >>+struct page * _alloc_pages(unsigned int gfp_mask, unsigned int order) >>+{ >>... >>+ spin_lock_irqsave(&node_lock, flags); >>+ temp = pgdat_list; >>+ spin_unlock_irqrestore(&node_lock, flags); > > > Not sure what you're trying to lock here, but you're not locking > it ;) This is either racy code or unneeded locking. To be honest, I'm not entirely sure what that's locking either. That is the non-NUMA path of that function, and the locking was in the original code, so I just moved it along. After doing a bit of searching, that lock seems COMPLETELY useless there. Especially since in the original function, a few lines further down pgdat_list is read again, without the lock! I guess, unless someone here says otherwise, I'll pull that locking out of the next rev. Thanks for all the feedback. I'll incorporate most of it into the next rev of the patch! Cheers! -Matt > > > Thanks. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-13 0:35 [patch[ Simple Topology API Matthew Dobson 2002-07-13 2:49 ` Andrew Morton @ 2002-07-13 8:04 ` Alexander Viro 2002-07-13 17:13 ` Albert D. Cahalan 2002-07-15 23:52 ` Matthew Dobson 1 sibling, 2 replies; 20+ messages in thread From: Alexander Viro @ 2002-07-13 8:04 UTC (permalink / raw) To: Matthew Dobson Cc: linux-kernel, Michael Hohnbaum, Martin Bligh, Linus Torvalds, Andrew Morton On Fri, 12 Jul 2002, Matthew Dobson wrote: > Here is a very rudimentary topology API for NUMA systems. It uses prctl() for > the userland calls, and exposes some useful things to userland. It would be > nice to expose these simple structures to both users and the kernel itself. > Any architecture wishing to use this API simply has to write a .h file that > defines the 5 calls defined in core_ibmnumaq.h and include it in asm/mmzone.h. > Voila! Instant inclusion in the topology! > > Enjoy! It's hard to enjoy the use of prctl(). Especially for things like "give me the number of the first CPU in node <n>" - it ain't no process controll, no matter how you stretch it. <soapbox> That's yet another demonstration of the evil of multiplexing syscalls. They hide the broken APIs and make them easy to introduce. And broken APIs get introduced - through each of these. prctl(), fcntl(), ioctl() - you name it. Please, don't do that. </soapbox> Please, replace that API with something sane. "Current processor" and _maybe_ "current node" are reasonable per-process things (even though the latter is obviously redundant). They are inherently racy, however - if you get scheduled on the return from syscall the value may have nothing to reality by the time you return to userland. The rest is obviously system-wide _and_ not process-related (it's "tell me about the configuration of machine"). Implementing them as prctls makes absolutely no sense. If anything, that's sysctl material. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-13 8:04 ` Alexander Viro @ 2002-07-13 17:13 ` Albert D. Cahalan 2002-07-15 23:52 ` Matthew Dobson 1 sibling, 0 replies; 20+ messages in thread From: Albert D. Cahalan @ 2002-07-13 17:13 UTC (permalink / raw) To: Alexander Viro Cc: Matthew Dobson, linux-kernel, Michael Hohnbaum, Martin Bligh, Linus Torvalds, Andrew Morton Alexander Viro writes: > It's hard to enjoy the use of prctl(). Especially for things like > "give me the number of the first CPU in node <n>" - it ain't no > process controll, no matter how you stretch it. Yeah... eeew. > <soapbox> That's yet another demonstration of the evil of multiplexing > syscalls. They hide the broken APIs and make them easy to introduce. > And broken APIs get introduced - through each of these. prctl(), fcntl(), > ioctl() - you name it. Please, don't do that. </soapbox> This wouldn't happen if it wasn't so damn hard to add a syscall. If you make people go though all the arch maintainers just to add a simple arch-independent syscall, they'll just bolt their code into some dark hidden corner of the kernel. That's life. Make syscalls easy to write, and this won't happen. Can you guess what would happen if you got rid of prctl(), fcntl(), and ioctl()? We'd get apps with code like this: // write address of one of these to /proc/orifice typedef struct evil { int version; // struct version struct evil *next; // next in list struct evil *prev; // prev in list char opcode; // indicates what we will do int (*fn)(void *); // callback function (if not NULL) void *addr; // an address in kernel memory short flags; // 0x0001 call fn w/ ints off, 0x0002 w/ BKL double timeout; // in microfortnights (uses APIC's NMI) } evil; ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-13 8:04 ` Alexander Viro 2002-07-13 17:13 ` Albert D. Cahalan @ 2002-07-15 23:52 ` Matthew Dobson 1 sibling, 0 replies; 20+ messages in thread From: Matthew Dobson @ 2002-07-15 23:52 UTC (permalink / raw) To: Alexander Viro Cc: linux-kernel, Michael Hohnbaum, Martin Bligh, Linus Torvalds, Andrew Morton Al, If I can get 1-2 syscalls for the Topo API, and 1-2 for the Membind API, I'll gladly make the changes. For now, though, prctl() works fine. If it needs to be changed at some point, it can be done in about 5 minutes... As far as the raciness of the get_curr_cpu & get_curr_node calls, that is noted in the comments. Until we get a better way of exposing the current working processor to userspace, they'll have to do. I believe that having *some* idea of where you're running is better than having *no* idea of where you're running. -Matt Alexander Viro wrote: > > On Fri, 12 Jul 2002, Matthew Dobson wrote: > > >>Here is a very rudimentary topology API for NUMA systems. It uses prctl() for >>the userland calls, and exposes some useful things to userland. It would be >>nice to expose these simple structures to both users and the kernel itself. >>Any architecture wishing to use this API simply has to write a .h file that >>defines the 5 calls defined in core_ibmnumaq.h and include it in asm/mmzone.h. >> Voila! Instant inclusion in the topology! >> >>Enjoy! > > > It's hard to enjoy the use of prctl(). Especially for things like > "give me the number of the first CPU in node <n>" - it ain't no > process controll, no matter how you stretch it. > > <soapbox> That's yet another demonstration of the evil of multiplexing > syscalls. They hide the broken APIs and make them easy to introduce. > And broken APIs get introduced - through each of these. prctl(), fcntl(), > ioctl() - you name it. Please, don't do that. </soapbox> > > Please, replace that API with something sane. "Current processor" and > _maybe_ "current node" are reasonable per-process things (even though > the latter is obviously redundant). They are inherently racy, however - > if you get scheduled on the return from syscall the value may have > nothing to reality by the time you return to userland. The rest is > obviously system-wide _and_ not process-related (it's "tell me about > the configuration of machine"). Implementing them as prctls makes > absolutely no sense. If anything, that's sysctl material. > > ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <3D2F75D7.3060105@us.ibm.com.suse.lists.linux.kernel>]
[parent not found: <3D2F9521.96D7080B@zip.com.au.suse.lists.linux.kernel>]
* Re: [patch[ Simple Topology API [not found] ` <3D2F9521.96D7080B@zip.com.au.suse.lists.linux.kernel> @ 2002-07-13 20:08 ` Andi Kleen 2002-07-14 19:17 ` Linus Torvalds 2002-07-15 17:48 ` Matthew Dobson 0 siblings, 2 replies; 20+ messages in thread From: Andi Kleen @ 2002-07-13 20:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, Michael Hohnbaum, Martin Bligh, Linus Torvalds Andrew Morton <akpm@zip.com.au> writes: > AFAIK, the interested parties with this and the memory binding API are > ia32-NUMA, ia64, PPC, some MIPS and x86-64-soon. It would be helpful > if the owners of those platforms could review this work and say "yes, > this is something we can use and build upon". Have they done that? Comment from the x86-64 side: Current x86-64 NUMA essentially has no 'nodes', just each CPU has local memory that is slightly faster than remote memory. This means the node number would be always identical to the CPU number. As long as the API provides it's ok for me. Just the node concept will not be very useful on that platform. memblk will also be identity mapped to node/cpu. Some way to tell user space about memory affinity seems to be useful, but... General comment: I don't see what the application should do with the memblk concept currently. Just knowing about it doesn't seem too useful. Surely it needs some way to allocate memory in a specific memblk to be useful? Also doesn't it need to know how much memory is available in each memblk? (otherwise I don't see how it could do any useful partitioning) -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-13 20:08 ` Andi Kleen @ 2002-07-14 19:17 ` Linus Torvalds 2002-07-14 19:43 ` Andi Kleen ` (3 more replies) 2002-07-15 17:48 ` Matthew Dobson 1 sibling, 4 replies; 20+ messages in thread From: Linus Torvalds @ 2002-07-14 19:17 UTC (permalink / raw) To: Andi Kleen; +Cc: Andrew Morton, linux-kernel, Michael Hohnbaum, Martin Bligh [ I've been off-line for a week, so I didn't follow all of the discussion, but here goes anyway ] On 13 Jul 2002, Andi Kleen wrote: > > Current x86-64 NUMA essentially has no 'nodes', just each CPU has > local memory that is slightly faster than remote memory. This means > the node number would be always identical to the CPU number. As long > as the API provides it's ok for me. Just the node concept will not be > very useful on that platform. memblk will also be identity mapped to > node/cpu. The whole "node" concept sounds broken. There is no such thing as a node, since even within nodes latencies will easily differ for different CPU's if you have local memories for CPU's within a node (which is clearly the only sane thing to do). If you want to model memory behaviour, you should have memory descriptors (in linux parlance, "zone_t") have an array of latencies to each CPU. That latency is _not_ a "is this memory local to this CPU" kind of number, that simply doesn't make any sense. The fact is, what matters is the number of hops. Maybe you want to allow one hop, but not five. Then, make the memory binding interface a function of just what kind of latency you allow from a set X of CPU's. Simple, straightforward, and it has a direct meaning in real life, which makes it unabiguous. So your "memory affinity" system call really needs just one number: the acceptable latency. You may also want to have a CPU-set argument, although I suspect that it's equally correct to just assume that the CPU-set is the set of CPU's that the process can already run on. After that, creating a new zone array is nothing more than: - give each zone a "latency value", which is simply the minimum of all the latencies for that zone from CPU's that are in the CPU set. - sort the zone array, lowest latency first. - the passed-in latency is the cut-off-point - clear the end of the array (with the sanity check that you always accept one zone, even if it happens to have a latency higher than the one passed in). End result: you end up with a priority-sorted array of acceptable zones. In other words, a zone list. Which iz _exactly_ what you want anyway (that's what the current "zone_table" is. And then you associate that zone-list with the process, and use that zone-list for all process allocations. Advantages: - very direct mapping to what the hardware actually does - no complex data structures for topology - works for all topologies, the process doesn't even have to know, you can trivially encode it all internally in the kernel by just having the CPU latency map for each memory zone we know about. Disadvantages: - you cannot create "crazy" memory bindings. You can only say "I don't want to allocate from slow memory". You _can_ do crazy things by initially using a different CPU binding, then doing the memory binding, and then re-doing the CPU binding. So if you _want_ bad memory bindings you can create them, but you have to work at it. - we have to use some standard latency measure, either purely time-based (which changes from machine to machine), or based on some notion of "relative to local memory". My personal suggestion would be the "relative to local memory" thing, and call that 10 units. So a cross-CPU (but same module) hop might imply a latency of 15, which a memory access that goes over the backbone between modules might be a 35. And one that takes two hops might be 55. So then, for each CPU in a machine, you can _trivially_ create the mapping from each memory zone to that CPU. And that's all you really care about. No? Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-14 19:17 ` Linus Torvalds @ 2002-07-14 19:43 ` Andi Kleen 2002-07-15 2:34 ` Eric W. Biederman 2002-07-16 19:03 ` Martin J. Bligh ` (2 subsequent siblings) 3 siblings, 1 reply; 20+ messages in thread From: Andi Kleen @ 2002-07-14 19:43 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Andrew Morton, linux-kernel, Michael Hohnbaum, Martin Bligh, Paul McKenney On Sun, Jul 14, 2002 at 12:17:25PM -0700, Linus Torvalds wrote: > The whole "node" concept sounds broken. There is no such thing as a node, > since even within nodes latencies will easily differ for different CPU's > if you have local memories for CPU's within a node (which is clearly the > only sane thing to do). I basically agree, but then when you go for a full graph everything becomes very complex. It's not clear if that much detail is useful for the application. > latency is _not_ a "is this memory local to this CPU" kind of number, that > simply doesn't make any sense. The fact is, what matters is the number of > hops. Maybe you want to allow one hop, but not five. > > Then, make the memory binding interface a function of just what kind of > latency you allow from a set X of CPU's. Simple, straightforward, and it > has a direct meaning in real life, which makes it unabiguous. Hmm - that could be a problem for applications that care less about latency, but more about equal use of bandwidth (see below). They just want their datastructures to be spread out evenly over all the available memory controllers. I don't see how that could be done with a single latency value; you really need some more complete idea about the topology. At least on Hammer the latency difference is small enough that caring about the overall bandwidth makes more sense. > And then you associate that zone-list with the process, and use that > zone-list for all process allocations. That's the basic idea sure for normal allocations from applications that do not care much about NUMA. But "numa aware" applications want to do other things like: - put some memory area into every node (e.g. for the numa equivalent of per CPU data in the kernel) - "stripe" a shared memory segment over all available memory subsystems (e.g. to use memory bandwidth fully if you know your interconnect can take it; that's e.g. the case on the Hammer) As I understood it this API is supposed to be the base of such an NUMA API for applications (just offer the information, but no way to use it usefully yet) More comments from the NUMA gurus please. -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-14 19:43 ` Andi Kleen @ 2002-07-15 2:34 ` Eric W. Biederman 2002-07-15 15:25 ` Sandy Harris 0 siblings, 1 reply; 20+ messages in thread From: Eric W. Biederman @ 2002-07-15 2:34 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Andrew Morton, linux-kernel, Michael Hohnbaum, Martin Bligh, Paul McKenney Andi Kleen <ak@suse.de> writes: > > At least on Hammer the latency difference is small enough that > caring about the overall bandwidth makes more sense. I agree. I will have to look closer but unless there is more juice than I have seen in Hyper-Transport it is going to become one of the architectural bottlenecks of the Hammer. Currently you get 1600MB/s in a single direction. Not to bad. But when the memory controllers get out to dual channel DDR-II 400, the local bandwidth to that memory is 6400MB/s, and the bandwidth to remote memory 1600MB/s, or 3200MB/s (if reads are as common as writes). So I suspect bandwidth intensive applications will really benefit from local memory optimization on the Hammer. I can buy that the latency is negligible, the fact the links don't appear to scale in bandwidth as well as the connection to memory may be a bigger issue. > > And then you associate that zone-list with the process, and use that > > zone-list for all process allocations. > > That's the basic idea sure for normal allocations from applications > that do not care much about NUMA. > > But "numa aware" applications want to do other things like: > - put some memory area into every node (e.g. for the numa equivalent of > per CPU data in the kernel) > - "stripe" a shared memory segment over all available memory subsystems > (e.g. to use memory bandwidth fully if you know your interconnect can > take it; that's e.g. the case on the Hammer) The latter I really quite believe. Even dual channel PC2100 can exceed your interprocessor bandwidth. And yes I have measured 2000MB/s memory copy with an Athlon MP and PC2100 memory. Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-15 2:34 ` Eric W. Biederman @ 2002-07-15 15:25 ` Sandy Harris 2002-07-15 16:33 ` Chris Friesen 2002-07-16 10:30 ` Eric W. Biederman 0 siblings, 2 replies; 20+ messages in thread From: Sandy Harris @ 2002-07-15 15:25 UTC (permalink / raw) To: linux-kernel "Eric W. Biederman" wrote: > > Andi Kleen <ak@suse.de> writes: > > > > At least on Hammer the latency difference is small enough that > > caring about the overall bandwidth makes more sense. > > I agree. I will have to look closer but unless there is more > juice than I have seen in Hyper-Transport it is going to become > one of the architectural bottlenecks of the Hammer. > > Currently you get 1600MB/s in a single direction. That's on an 8-bit channel, as used on Clawhammer (AMD's lower cost CPU for desktop market). The spec allows 2, 4, 6, 16 or 32-bit channels. If I recall correctly, the AMD presentation at OLS said Sledgehammer (server market) uses 16-bit. > Not to bad. > But when the memory controllers get out to dual channel DDR-II 400, > the local bandwidth to that memory is 6400MB/s, and the bandwidth to > remote memory 1600MB/s, or 3200MB/s (if reads are as common as > writes). > > So I suspect bandwidth intensive applications will really benefit > from local memory optimization on the Hammer. I can buy that the > latency is negligible, I'm not so sure. Clawhammer has two links, can do dual-CPU. One link to the other CPU, one for I/O. Latency may well be negligible there. Sledgehammer has three links, can do no-glue 4-way with each CPU using two links to talk to others, one for I/O. I/O -- A ------ B -- I/O | | | | I/O -- C ------ D -- I/O They can also go to no-glue 8-way: I/O -- A ------ B ------ E ------ G -- I/O | | | | | | | | I/O -- C ------ D ------ F ------ H -- I/O I suspect latency may become an issue when more than one link is involved and there can be contention. Beyond 8-way, you need glue logic (hypertransport switches?) and latency seems bound to become an issue. > the fact the links don't appear to scale > in bandwidth as well as the connection to memory may be a bigger > issue. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-15 15:25 ` Sandy Harris @ 2002-07-15 16:33 ` Chris Friesen 2002-07-16 10:30 ` Eric W. Biederman 1 sibling, 0 replies; 20+ messages in thread From: Chris Friesen @ 2002-07-15 16:33 UTC (permalink / raw) To: Sandy Harris; +Cc: linux-kernel Sandy Harris wrote: > I suspect latency may become an issue when more than one link is > involved and there can be contention. According to the AMD talk at OLS, worst case on a 4-way is better than current best-case on a uniprocessor athlon. > Beyond 8-way, you need glue logic (hypertransport switches?) and > latency seems bound to become an issue. Nope. Just extend the ladder. Each cpu talks to three other entities, either cpu or I/O. Can be extended arbitrarily until latencies are too high. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-15 15:25 ` Sandy Harris 2002-07-15 16:33 ` Chris Friesen @ 2002-07-16 10:30 ` Eric W. Biederman 2002-07-16 12:59 ` Rik van Riel 2002-07-16 15:45 ` Martin J. Bligh 1 sibling, 2 replies; 20+ messages in thread From: Eric W. Biederman @ 2002-07-16 10:30 UTC (permalink / raw) To: Sandy Harris; +Cc: linux-kernel Sandy Harris <pashley@storm.ca> writes: > "Eric W. Biederman" wrote: > > > > Andi Kleen <ak@suse.de> writes: > > > > > > At least on Hammer the latency difference is small enough that > > > caring about the overall bandwidth makes more sense. > > > > I agree. I will have to look closer but unless there is more > > juice than I have seen in Hyper-Transport it is going to become > > one of the architectural bottlenecks of the Hammer. > > > > Currently you get 1600MB/s in a single direction. > > That's on an 8-bit channel, as used on Clawhammer (AMD's lower cost > CPU for desktop market). The spec allows 2, 4, 6, 16 or 32-bit > channels. If I recall correctly, the AMD presentation at OLS said > Sledgehammer (server market) uses 16-bit. Thanks, my confusion. The danger is of having more bandwidth to memory than to other processors is still present, but it may be one of those places where the cpu designers are able to stay one step ahead of the problem. I will definitely agree the problem goes away for the short term with a 32bit link. > > Not to bad. > > But when the memory controllers get out to dual channel DDR-II 400, > > the local bandwidth to that memory is 6400MB/s, and the bandwidth to > > remote memory 1600MB/s, or 3200MB/s (if reads are as common as > > writes). > > > > So I suspect bandwidth intensive applications will really benefit > > from local memory optimization on the Hammer. I can buy that the > > latency is negligible, > > I'm not so sure. Clawhammer has two links, can do dual-CPU. One link > to the other CPU, one for I/O. Latency may well be negligible there. > > Sledgehammer has three links, can do no-glue 4-way with each CPU > using two links to talk to others, one for I/O. > > I/O -- A ------ B -- I/O > | | > | | > I/O -- C ------ D -- I/O > > They can also go to no-glue 8-way: > > I/O -- A ------ B ------ E ------ G -- I/O > | | | | > | | | | > I/O -- C ------ D ------ F ------ H -- I/O > I suspect latency may become an issue when more than one link is > involved and there can be contention. I think the 8-way topology is a little more interesting than presented. But if not it does look like you can run into issues. The more I look at it there appears to be a strong dynamic balance in the architecture between having just enough bandwidth, and low enough latency not to become a bottleneck, and having a low hardware cost. > Beyond 8-way, you need glue logic (hypertransport switches?) and > latency seems bound to become an issue. Beyond 8-way you get into another system architecture entirely, which should be considered on it's own merits. In large part cache directories and other very sophisticated techniques are needed when you scale a system beyond the SMP point. As long as the inter-cpu bandwidth is >= the memory bandwidth on a single memory controller Hammer can probably get away with being just a better SMP, and not really a NUMA design. Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-16 10:30 ` Eric W. Biederman @ 2002-07-16 12:59 ` Rik van Riel 2002-07-16 15:45 ` Martin J. Bligh 1 sibling, 0 replies; 20+ messages in thread From: Rik van Riel @ 2002-07-16 12:59 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Sandy Harris, linux-kernel On 16 Jul 2002, Eric W. Biederman wrote: > Sandy Harris <pashley@storm.ca> writes: > > I/O -- A ------ B ------ E ------ G -- I/O > > | | | | > > | | | | > > I/O -- C ------ D ------ F ------ H -- I/O > > > I suspect latency may become an issue when more than one link is > > involved and there can be contention. > > I think the 8-way topology is a little more interesting than > presented. But if not it does look like you can run into issues. IIRC I/O -- A ------ B ------ E ------ G -- I/O | \ / | | \ / | | XX | | / \ | | / \ | I/O -- C ------ D ------ F ------ H -- I/O Where B is connected to F and D to E. Obviously this setup has a maximum hop count of 3 from any cpu to any other cpu, as opposed to a maximum hop count of 4 for the simple ladder. regards, Rik -- http://www.linuxsymposium.org/2002/ "You're one of those condescending OLS attendants" "Here's a nickle kid. Go buy yourself a real t-shirt" http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-16 10:30 ` Eric W. Biederman 2002-07-16 12:59 ` Rik van Riel @ 2002-07-16 15:45 ` Martin J. Bligh 1 sibling, 0 replies; 20+ messages in thread From: Martin J. Bligh @ 2002-07-16 15:45 UTC (permalink / raw) To: Eric W. Biederman, Sandy Harris; +Cc: linux-kernel >> They can also go to no-glue 8-way: >> >> I/O -- A ------ B ------ E ------ G -- I/O >> | | | | >> | | | | >> I/O -- C ------ D ------ F ------ H -- I/O > > > I think the 8-way topology is a little more interesting than > presented. But if not it does look like you can run into issues. > The more I look at it there appears to be a strong dynamic balance > in the architecture between having just enough bandwidth, and low > enough latency not to become a bottleneck, and having a low hardware > cost. Whilst I don't have a definitive diagram, the "back of a napkin" sketches we came up with at an OLS dinner looked like this: I/O -- A ------ B ---- E ------ G -- I/O | \/ | | /\ | I/O -- C ------ D ---- F ------ H -- I/O (please excuse my poor artistic skills). That reduces the max hops from 4 to 3 (if I haven't screwed something up). M. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-14 19:17 ` Linus Torvalds 2002-07-14 19:43 ` Andi Kleen @ 2002-07-16 19:03 ` Martin J. Bligh 2002-07-16 22:29 ` Matthew Dobson 2002-07-17 0:21 ` Michael Hohnbaum 3 siblings, 0 replies; 20+ messages in thread From: Martin J. Bligh @ 2002-07-16 19:03 UTC (permalink / raw) To: Linus Torvalds, Andi Kleen; +Cc: Andrew Morton, linux-kernel, Michael Hohnbaum > The whole "node" concept sounds broken. There is no such thing as a node, > since even within nodes latencies will easily differ for different CPU's > if you have local memories for CPU's within a node (which is clearly the > only sane thing to do). Define a node as a group of CPUs with the same set of latencies to memory. Then you get something that makes sense for everyone, and reduces the storage of duplicated data. If your latencies for each CPU are different, define a 1-1 mapping between nodes and CPUs. If you really want to store everthing for each CPU, that's fine. > If you want to model memory behaviour, you should have memory descriptors > (in linux parlance, "zone_t") have an array of latencies to each CPU. That > latency is _not_ a "is this memory local to this CPU" kind of number, that > simply doesn't make any sense. The fact is, what matters is the number of > hops. Maybe you want to allow one hop, but not five. I can't help thinking that we'd be better off making the mechanism as generic as possible, and not trying to predict all the wierd and wonderful things people might want to do (eg striping), then implement what you describe as a policy decision. M. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-14 19:17 ` Linus Torvalds 2002-07-14 19:43 ` Andi Kleen 2002-07-16 19:03 ` Martin J. Bligh @ 2002-07-16 22:29 ` Matthew Dobson 2002-07-17 0:21 ` Michael Hohnbaum 3 siblings, 0 replies; 20+ messages in thread From: Matthew Dobson @ 2002-07-16 22:29 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Andrew Morton, linux-kernel, Michael Hohnbaum, Martin Bligh Linus Torvalds wrote: > [ I've been off-line for a week, so I didn't follow all of the discussion, > but here goes anyway ] > > On 13 Jul 2002, Andi Kleen wrote: > >>Current x86-64 NUMA essentially has no 'nodes', just each CPU has >>local memory that is slightly faster than remote memory. This means >>the node number would be always identical to the CPU number. As long >>as the API provides it's ok for me. Just the node concept will not be >>very useful on that platform. memblk will also be identity mapped to >>node/cpu. > > > The whole "node" concept sounds broken. There is no such thing as a node, > since even within nodes latencies will easily differ for different CPU's > if you have local memories for CPU's within a node (which is clearly the > only sane thing to do). If you're saying local memories for *each* CPU within a node, then no, that is not the only sane thing to do. There are some architectures that do, and some that do not. The Hammer architecture, to the best of my knowledge, has memory hanging off of each CPU, however, NUMA-Q, the main one I work with, has local memory for each group of 4 CPUs. If you're speaking only of node-local memory, ie: memory local to all the CPUs on the 'node', then all local CPUs should have the same latency to that memory. > If you want to model memory behaviour, you should have memory descriptors > (in linux parlance, "zone_t") have an array of latencies to each CPU. That > latency is _not_ a "is this memory local to this CPU" kind of number, that > simply doesn't make any sense. The fact is, what matters is the number of > hops. Maybe you want to allow one hop, but not five. > > Then, make the memory binding interface a function of just what kind of > latency you allow from a set X of CPU's. Simple, straightforward, and it > has a direct meaning in real life, which makes it unabiguous. I mostly agree with you here, except I really do believe that we should use the node abstraction. It adds little overhead, but buys us a good bit. Nodes, according to the API, are defined on a per-arch basis, allowing for us to sanely define nodes on our NUMA-Q hardware (node==4cpus), AMD people to sanely define nodes on there hardware (node==cpu), and others to define nodes to whatever they want. We will avoid redundant data in many cases, and in the simplest case, this defaults to your node==cpu behavior anyway. If we do use CPU-Mem latencies, the NUMA-Q platform (and I'm sure others) would only be able to distinguish between local and remote CPUs, not individual remote CPUs. > So your "memory affinity" system call really needs just one number: the > acceptable latency. You may also want to have a CPU-set argument, although > I suspect that it's equally correct to just assume that the CPU-set is the > set of CPU's that the process can already run on. > > After that, creating a new zone array is nothing more than: > > - give each zone a "latency value", which is simply the minimum of all > the latencies for that zone from CPU's that are in the CPU set. > > - sort the zone array, lowest latency first. > > - the passed-in latency is the cut-off-point - clear the end of the > array (with the sanity check that you always accept one zone, even if > it happens to have a latency higher than the one passed in). > > End result: you end up with a priority-sorted array of acceptable zones. > In other words, a zone list. Which iz _exactly_ what you want anyway > (that's what the current "zone_table" is. > > And then you associate that zone-list with the process, and use that > zone-list for all process allocations. It seems as though you'd be throwing out some useful data. For example, imagine you have a 2 quad NUMAQ system. Each quad contains 4 CPUs and a block of memory. Now if use all of the CPUs as our CPU set, zone 0 (memory block on quad 0) will have a latency of 1 (b/c it is one hop from the first 4 cpus), as will zone 1 (memory block on quad 1), b/c it is one hop from the second 4 cpus. Now it would appear that since these zones both have the same latency, they would be eqally good choices. This isn't true, since if the process is on CPU 0-3, it should allocate on zone 0, and vice versa for CPUs 4-7. Latency shouldn't be the ONLY way to make decisions. > Advantages: > > - very direct mapping to what the hardware actually does For some architectures, but for some it isn't. > - no complex data structures for topology Agreed. > - works for all topologies, the process doesn't even have to know, you > can trivially encode it all internally in the kernel by just having the > CPU latency map for each memory zone we know about. True, but the point of this is API is to allow for processes that *DO* want to know to make intelligent decisions! Those that don't care, can still go on, blissfully unaware they are on a NUMA system. > Disadvantages: > > - you cannot create "crazy" memory bindings. You can only say "I don't > want to allocate from slow memory". You _can_ do crazy things by > initially using a different CPU binding, then doing the memory > binding, and then re-doing the CPU binding. So if you _want_ bad memory > bindings you can create them, but you have to work at it. Why limit the process? The overhead is so small to allow processes to do anything they want, why not allow them? > - we have to use some standard latency measure, either purely time-based > (which changes from machine to machine), or based on some notion of > "relative to local memory". > > My personal suggestion would be the "relative to local memory" thing, and > call that 10 units. So a cross-CPU (but same module) hop might imply a > latency of 15, which a memory access that goes over the backbone between > modules might be a 35. And one that takes two hops might be 55. Absolutely true. I think that the "relative to local memory" is a great measuring stick. It is pretty much platform agnostic, assuming every platform has some concept of "local" memory. I basically think that we should give processes that care the ability to do just about anything they want, no matter how crazy... Most processes will never even attempt to look at their default bindings, never mind change them. Plus, were making mechanism decisions that will (hopefully) be around for some time. I'm sure people will come up with things we can't even imagine, so the more powerful the API the better. <prepares to stop, drop, and roll> -Matt > So then, for each CPU in a machine, you can _trivially_ create the mapping > from each memory zone to that CPU. And that's all you really care about. > > No? > > Linus > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-14 19:17 ` Linus Torvalds ` (2 preceding siblings ...) 2002-07-16 22:29 ` Matthew Dobson @ 2002-07-17 0:21 ` Michael Hohnbaum 3 siblings, 0 replies; 20+ messages in thread From: Michael Hohnbaum @ 2002-07-17 0:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andi Kleen, Andrew Morton, linux-kernel, Martin Bligh On Sun, 2002-07-14 at 12:17, Linus Torvalds wrote: > > [ I've been off-line for a week, so I didn't follow all of the discussion, > but here goes anyway ] > > On 13 Jul 2002, Andi Kleen wrote: > > > > Current x86-64 NUMA essentially has no 'nodes', just each CPU has > > local memory that is slightly faster than remote memory. This means > > the node number would be always identical to the CPU number. As long > > as the API provides it's ok for me. Just the node concept will not be > > very useful on that platform. memblk will also be identity mapped to > > node/cpu. > > The whole "node" concept sounds broken. There is no such thing as a node, > since even within nodes latencies will easily differ for different CPU's > if you have local memories for CPU's within a node (which is clearly the > only sane thing to do). > > If you want to model memory behaviour, you should have memory descriptors > (in linux parlance, "zone_t") have an array of latencies to each CPU. That > latency is _not_ a "is this memory local to this CPU" kind of number, that > simply doesn't make any sense. The fact is, what matters is the number of > hops. Maybe you want to allow one hop, but not five. How NUMA binding APIs have been used successfully is for a group of processes to all decide to "hang out" together in the same vicinity so that they can optimize access to shared memory. In existing NUMA systems, this has been deciding to execute on a subset of the nodes. In a system with 4 nodes, on NUMAQ machines, the latency is either local or remote. Thus if one sets a binding/latency argument such that remote accesses are allowed, then all of the nodes are fair game, otherwise only the local node is used. So a set of processes decides to occupy two nodes. Using strictly a latency argument, there is no way to specify this. One could use cpu binding to restrict the processes to the two nodes, but not the memory - unless you now associate cpus with memory/nodes and are back to maintaining topology info. Another shortcoming of latency based binding is if a process executes on a node for awhile, then gets moved to another node. In that case the best memory allocation depends on several factors. The first is to determine where we measure latency from - the node the process is currently executing or where it had been executing? From a scheduling perspective we are playing with the idea of a home node. If a process is dispatched off of the home node should memory allocations come from the home node or the current node? If neither has available memory should latency be considered from home or current? The other way that memory binding has been used is for a large process, typically a database, to want control over where it places memory and data structures. It is not a latency issue, but rather one of how the work is distributed across the system. The memory binding that Matt proposed allows an architecture to define what a memory block is, and the application to determine how it wants to bind to the memory blocks. It is only intended for use by applications that are aware of the complexities involved, and these will have to be knowledgeable about the systems that they are on. Hopefully, by default, Linux will do the right things as far as memory placement for processes that choose to leave it up to the system - which will be the majority of apps. However, the apps that want to specify their memory placement, want the ability to have explicit control over the nodes it lands on. It is not strictly a latency based decision. Michael -- Michael Hohnbaum 503-578-5486 hohnbaum@us.ibm.com T/L 775-5486 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API 2002-07-13 20:08 ` Andi Kleen 2002-07-14 19:17 ` Linus Torvalds @ 2002-07-15 17:48 ` Matthew Dobson 1 sibling, 0 replies; 20+ messages in thread From: Matthew Dobson @ 2002-07-15 17:48 UTC (permalink / raw) To: Andi Kleen Cc: Andrew Morton, linux-kernel, Michael Hohnbaum, Martin Bligh, Linus Torvalds Andi Kleen wrote: > Andrew Morton <akpm@zip.com.au> writes: >>AFAIK, the interested parties with this and the memory binding API are >>ia32-NUMA, ia64, PPC, some MIPS and x86-64-soon. It would be helpful >>if the owners of those platforms could review this work and say "yes, >>this is something we can use and build upon". Have they done that? > > Comment from the x86-64 side: > > Current x86-64 NUMA essentially has no 'nodes', just each CPU has > local memory that is slightly faster than remote memory. This means > the node number would be always identical to the CPU number. As long > as the API provides it's ok for me. Just the node concept will not be > very useful on that platform. memblk will also be identity mapped to > node/cpu. > > Some way to tell user space about memory affinity seems to be useful, > but... That shouldn't be a problem at all. Since each architecture is responsible for defining the 5 main topology functions, you could do this: #define _cpu_to_node(cpu) (cpu) #define _memblk_to_node(memblk) (memblk) #define _node_to_node(node) (node) #define _node_to_cpu(node) (node) #define _node_to_memblk(node) (node) > General comment: > > I don't see what the application should do with the memblk concept > currently. Just knowing about it doesn't seem too useful. > Surely it needs some way to allocate memory in a specific memblk to be useful? > Also doesn't it need to know how much memory is available in each memblk? > (otherwise I don't see how it could do any useful partitioning) For that, you need to look at the Memory Binding API that I sent out moments after this patch... It builds on top of this infrastructure to allow binding processes to individual memory blocks or groups of memory blocks. Cheers! -Matt > > -Andi > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [patch[ Simple Topology API @ 2002-07-15 19:50 Jukka Honkela 0 siblings, 0 replies; 20+ messages in thread From: Jukka Honkela @ 2002-07-15 19:50 UTC (permalink / raw) To: Chris Friesen; +Cc: linux-kernel Chris Friesen wrote: >> Beyond 8-way, you need glue logic (hypertransport switches?) and >> latency seems bound to become an issue. >Nope. Just extend the ladder. Each cpu talks to three other entities, >either cpu or I/O. Can be extended arbitrarily until latencies are too >high. You seem to be missing one critical piece from the OLS talk. The HT protocol (or something related) can't handle more than 8 CPU's in a single configuration. You need to have some kind of bridge to connect more than 8CPU's together, although systems with more than 8 CPU's have not been discussed officially anywhere, afaik. 8 CPU's and less belongs to the SUMO category (Sufficiently Uniform Memory Organization, apparently new AMD terminology) whereas 9 CPU's and more is likely to be NUMA. -- Jukka Honkela ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2002-07-17 0:20 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-07-13 0:35 [patch[ Simple Topology API Matthew Dobson
2002-07-13 2:49 ` Andrew Morton
2002-07-15 18:49 ` Matthew Dobson
2002-07-13 8:04 ` Alexander Viro
2002-07-13 17:13 ` Albert D. Cahalan
2002-07-15 23:52 ` Matthew Dobson
[not found] <3D2F75D7.3060105@us.ibm.com.suse.lists.linux.kernel>
[not found] ` <3D2F9521.96D7080B@zip.com.au.suse.lists.linux.kernel>
2002-07-13 20:08 ` Andi Kleen
2002-07-14 19:17 ` Linus Torvalds
2002-07-14 19:43 ` Andi Kleen
2002-07-15 2:34 ` Eric W. Biederman
2002-07-15 15:25 ` Sandy Harris
2002-07-15 16:33 ` Chris Friesen
2002-07-16 10:30 ` Eric W. Biederman
2002-07-16 12:59 ` Rik van Riel
2002-07-16 15:45 ` Martin J. Bligh
2002-07-16 19:03 ` Martin J. Bligh
2002-07-16 22:29 ` Matthew Dobson
2002-07-17 0:21 ` Michael Hohnbaum
2002-07-15 17:48 ` Matthew Dobson
-- strict thread matches above, loose matches on Subject: below --
2002-07-15 19:50 Jukka Honkela
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox