* Re: latest linus-2.5 BK broken
@ 2002-06-18 23:38 Michael Hohnbaum
2002-06-18 23:57 ` Ingo Molnar
0 siblings, 1 reply; 70+ messages in thread
From: Michael Hohnbaum @ 2002-06-18 23:38 UTC (permalink / raw)
To: torvalds; +Cc: rusty, rml, linux-kernel, colpatch
On Tuesday, June 18 2002, Linus Torvalds wrote:
> On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> NO. They want to be node-affine. They don't want to specify what
> CPUs they attach to.
>
> So you're going to have separate interfaces for that? Gag me with a
> volvo, but that's idiotic.
>
> Besides, even that would be broken. You want bitmaps, because bitmaps
> is really what it is all about. It's NOT about "I must run on this
> CPU", it can equally well be "I mustn't run on those two CPU's that
> are hosting the RT part of this thing" or something like that.
>
> Linus
A bit mask is a very good choice for the sched_setaffinity()
interface. I would suggest an additional argument be added
which would indicate the resource that the process is to be
affined to. That way this interface could be used for binding
processes to cpus, memory nodes, perhaps NUMA nodes, and,
as discussed recently in another thread, other processes.
Personally, I see NUMA nodes as an overkill, if a process
can be bound to cpus and memory nodes.
There has been an effort made to address the needs for binding
processes to processors, memory nodes, etc. for NUMA machines.
A proposed API has been developed and implemented. See
http://lse.sourceforge.net/numa/numa_api.html for a spec on
the API. Matt Dobson has posted the implementation to lkml
as a patch against 2.5 several times, but has not seen much
discussion. I could see much of the capabilities provided
in the NUMA API being provided through the sched_setaffinity()
as described above.
Michael Hohnbaum
hohnbaum@us.ibm.com
^ permalink raw reply [flat|nested] 70+ messages in thread* Re: latest linus-2.5 BK broken 2002-06-18 23:38 latest linus-2.5 BK broken Michael Hohnbaum @ 2002-06-18 23:57 ` Ingo Molnar 2002-06-19 0:08 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 70+ messages in thread From: Ingo Molnar @ 2002-06-18 23:57 UTC (permalink / raw) To: Michael Hohnbaum Cc: Linus Torvalds, Rusty Russell, Robert Love, linux-kernel, colpatch On 18 Jun 2002, Michael Hohnbaum wrote: > A bit mask is a very good choice for the sched_setaffinity() > interface. [...] thanks :) > [...] I would suggest an additional argument be added > which would indicate the resource that the process is to be > affined to. That way this interface could be used for binding > processes to cpus, memory nodes, perhaps NUMA nodes, and, > as discussed recently in another thread, other processes. > Personally, I see NUMA nodes as an overkill, if a process > can be bound to cpus and memory nodes. are you sure we want one generic, process-based affinity interface? i think the affinity to certain memory regions might need to be more finegrained than this. Eg. it could be useful to define a per-file (per-inode) 'backing store memory node' that the file is affine to. This will eg. cause the pagecache to be allocated in the memory node. Process-based affinity does not describe this in a natural way. Another example, memory maps: we might want to have a certain memory map (vma) allocated in a given memory node, independently of where the process that is faulting a given pages resides. and it might certainly make sense to have some sort of 'default memory affinity' for a process as well, but this should be a different syscall - it really does a much different thing than CPU affinity. The CPU resource is 'used' only temporarily with little footprint, while memory usage is often for a very long timespan, and the affinity strategies differ greatly. Also, memory as a resource is much more complex than CPU, eg. it must handle things like over-allocation, fallback to 'nearby' nodes if a node is full, etc. so i'd suggest to actually create a good memory-affinity syscall interface instead of trying to generalize it into the simple, robust, finite CPU-affinity syscalls. Ingo ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 23:57 ` Ingo Molnar @ 2002-06-19 0:08 ` Ingo Molnar 2002-06-19 1:00 ` Matthew Dobson 2002-06-19 23:48 ` Michael Hohnbaum 2 siblings, 0 replies; 70+ messages in thread From: Ingo Molnar @ 2002-06-19 0:08 UTC (permalink / raw) To: Michael Hohnbaum Cc: Linus Torvalds, Rusty Russell, Robert Love, linux-kernel, colpatch another thought would be that the 'default' memory affinity can be derived from the CPU affinity. A default process, one which is affine to all CPUs, can have memory allocated from all memory nodes. A process which is bound to a given set of CPUs, should get its memory allocated from the nodes that 'belong' to those CPUs. the topology might not be as simple as this, but generally it's the CPU that drives the topology, so a given CPU affinity mask leads to a specific 'preferred memory nodes' bitmask - there isnt much choice needed on the user's part, in fact it might be contraproductive to bind a process to some CPU and bind its memory allocations to a very distant memory node. While mathematically there is not necesserily any 1:1 relationship between CPU affinity and 'best memory affinity', technologically there is. per-object affinity might still be possible under these scheme, it would override whatever 'default' memory affinity is derived from the CPU affinity mask. [that would enable for example for an important database file to be locked to a given memory node, and helper processes executing on distant CPUs will not cause a distant pagecache page to be allocated.] another advantage is that this removes the burden from the application writer, of having to figure out the actual memory topology and fitting the CPU affinity to the memory affinity (and vice versa). The kernel can figure out a good default memory affinity based on the CPU affinity mask. (so everything so far points in the direction of having a simple CPU affinity syscall, which we have now.) Ingo ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 23:57 ` Ingo Molnar 2002-06-19 0:08 ` Ingo Molnar @ 2002-06-19 1:00 ` Matthew Dobson 2002-06-19 23:48 ` Michael Hohnbaum 2 siblings, 0 replies; 70+ messages in thread From: Matthew Dobson @ 2002-06-19 1:00 UTC (permalink / raw) To: Ingo Molnar Cc: Michael Hohnbaum, Linus Torvalds, Rusty Russell, Robert Love, linux-kernel [-- Attachment #1: Type: text/plain, Size: 3108 bytes --] Ingo Molnar wrote: > On 18 Jun 2002, Michael Hohnbaum wrote: >>[...] I would suggest an additional argument be added >>which would indicate the resource that the process is to be >>affined to. That way this interface could be used for binding >>processes to cpus, memory nodes, perhaps NUMA nodes, and, >>as discussed recently in another thread, other processes. >>Personally, I see NUMA nodes as an overkill, if a process >>can be bound to cpus and memory nodes. > > > are you sure we want one generic, process-based affinity interface? > > i think the affinity to certain memory regions might need to be more > finegrained than this. Eg. it could be useful to define a per-file > (per-inode) 'backing store memory node' that the file is affine to. This > will eg. cause the pagecache to be allocated in the memory node. > Process-based affinity does not describe this in a natural way. Another > example, memory maps: we might want to have a certain memory map (vma) > allocated in a given memory node, independently of where the process that > is faulting a given pages resides. > > and it might certainly make sense to have some sort of 'default memory > affinity' for a process as well, but this should be a different syscall - > it really does a much different thing than CPU affinity. The CPU resource > is 'used' only temporarily with little footprint, while memory usage is > often for a very long timespan, and the affinity strategies differ > greatly. Also, memory as a resource is much more complex than CPU, eg. it > must handle things like over-allocation, fallback to 'nearby' nodes if a > node is full, etc. I've attatched copies of the patch that Michael referred to in his email so you can see where we're going with this. I think that we have (at least the beginnings) of what you've described. The patch allows processes to bind to specific CPU's (via bitmask) and/or specific memory blocks. You can set these up to complement each other, or to something completely arbitrary (for debugging purposes, etc). It also includes the beginnings of very simple topology info with some simple arch-independent calls (cpu_to_node, node_to_cpu, node_to_memblk, etc.) Of course these do require some lower level hooks for each architecture that wants to use them, but they should be simple calls. I've been sidetracked on other things for about a month, but I plan on getting back to this patch ASAP (this week), and porting it forward to the latest version. It is currently only up to 2.5.14. If anyone has any suggestions for other features, changes, comments, flames, ANYTHING, please let me know. > so i'd suggest to actually create a good memory-affinity syscall interface > instead of trying to generalize it into the simple, robust, finite > CPU-affinity syscalls. See above ;) -Matt > > Ingo > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > [-- Attachment #2: numa_api-arch_dep-2.5.14.patch --] [-- Type: text/plain, Size: 4065 bytes --] diff -Nur linux-2.5.12-vanilla/include/asm-i386/core_ibmnumaq.h linux-2.5.12-api/include/asm-i386/core_ibmnumaq.h --- linux-2.5.12-vanilla/include/asm-i386/core_ibmnumaq.h Wed Dec 31 16:00:00 1969 +++ linux-2.5.12-api/include/asm-i386/core_ibmnumaq.h Wed May 1 17:24:25 2002 @@ -0,0 +1,61 @@ +/* + * linux/include/asm-i386/mmzone.h + * + * Written by: Matthew Dobson, IBM Corporation + * + * Copyright (C) 2002, IBM Corp. + * + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for more + * details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + * Send feedback to <colpatch@us.ibm.com> + */ +#ifndef _ASM_CORE_IBMNUMAQ_H_ +#define _ASM_CORE_IBMNUMAQ_H_ + +/* + * These functions need to be defined for every architecture. + * The first five are necessary for the NUMA API to function. + * The last is needed by several pieces of NUMA code. + */ + +/* Returns the number of the node containing CPU 'cpu' */ +#define _cpu_to_node(cpu) (cpu_to_logical_apicid(cpu) >> 4) + +/* Returns the number of the node containing MemBlk 'memblk' */ +#define _memblk_to_node(memblk) (memblk) + +/* Returns the number of the node containing Node 'nid'. This architecture is flat, + so it is a pretty simple function. */ +#define _node_to_node(nid) (nid) + +/* Returns the number of the first CPU on Node 'node' */ +static inline int _node_to_cpu(int node) +{ + int i, cpu, logical_apicid = node << 4; + + for(i = 1; i < 16; i <<= 1) + if ((cpu = logical_apicid_to_cpu(logical_apicid | i)) >= 0) + return cpu; + + return 0; +} + +/* Returns the number of the first MemBlk on Node 'node' */ +#define _node_to_memblk(node) (node) + +#endif /* _ASM_CORE_IBMNUMAQ_H_ */ diff -Nur linux-2.5.12-vanilla/include/asm-i386/mmzone.h linux-2.5.12-api/include/asm-i386/mmzone.h --- linux-2.5.12-vanilla/include/asm-i386/mmzone.h Wed Dec 31 16:00:00 1969 +++ linux-2.5.12-api/include/asm-i386/mmzone.h Wed May 1 17:24:25 2002 @@ -0,0 +1,45 @@ +/* + * linux/include/asm-i386/mmzone.h + * + * Written by: Matthew Dobson, IBM Corporation + * + * Copyright (C) 2002, IBM Corp. + * + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for more + * details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + * Send feedback to <colpatch@us.ibm.com> + */ +#ifndef _ASM_MMZONE_H_ +#define _ASM_MMZONE_H_ + +#include <asm/smpboot.h> + +#ifdef CONFIG_IBMNUMAQ +#include <asm/core_ibmnumaq.h> +#else /* !CONFIG_IBMNUMAQ */ +#define _cpu_to_node(cpu) (0) +#define _memblk_to_node(memblk) (0) +#define _node_to_node(nid) (0) +#define _node_to_cpu(node) (0) +#define _node_to_memblk(node) (0) +#endif /* CONFIG_IBMNUMAQ */ + +/* Returns the number of the current Node. */ +#define numa_node_id() (_cpu_to_node(smp_processor_id())) + +#endif /* _ASM_MMZONE_H_ */ [-- Attachment #3: numa_api-arch_indep-impl-2.5.14.patch --] [-- Type: text/plain, Size: 17845 bytes --] diff -Nur linux-2.5.8-vanilla/kernel/Makefile linux-2.5.8-api/kernel/Makefile --- linux-2.5.8-vanilla/kernel/Makefile Sun Apr 14 12:18:47 2002 +++ linux-2.5.8-api/kernel/Makefile Mon Apr 22 15:35:16 2002 @@ -15,7 +15,7 @@ obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \ module.o exit.o itimer.o info.o time.o softirq.o resource.o \ sysctl.o capability.o ptrace.o timer.o user.o \ - signal.o sys.o kmod.o context.o futex.o platform.o + signal.o sys.o kmod.o context.o futex.o platform.o numa.o obj-$(CONFIG_UID16) += uid16.o obj-$(CONFIG_MODULES) += ksyms.o diff -Nur linux-2.5.8-vanilla/kernel/fork.c linux-2.5.8-api/kernel/fork.c --- linux-2.5.8-vanilla/kernel/fork.c Sun Apr 14 12:18:45 2002 +++ linux-2.5.8-api/kernel/fork.c Tue Apr 23 14:49:29 2002 @@ -707,6 +707,20 @@ spin_lock_init(&p->sigmask_lock); } #endif + if (!null_restrict(&p->numa_launch_policy)){ + p->numa_binding = p->numa_launch_policy; + p->cpus_allowed = p->numa_binding.cpus.list & p->numa_restrict.cpus.list; + if (!(p->cpus_allowed & cpu_online_map)) + BUG(); + if (p->cpus_allowed & (1UL << smp_processor_id())) + p->thread_info->cpu = smp_processor_id(); + else + p->thread_info->cpu = __ffs(p->cpus_allowed & cpu_online_map); + } else + p->thread_info->cpu = smp_processor_id(); + numa_set_init(&p->numa_launch_policy); + rwlock_init(&p->numa_api_lock); + p->array = NULL; p->lock_depth = -1; /* -1 = no lock */ p->start_time = jiffies; diff -Nur linux-2.5.8-vanilla/kernel/numa.c linux-2.5.8-api/kernel/numa.c --- linux-2.5.8-vanilla/kernel/numa.c Wed Dec 31 16:00:00 1969 +++ linux-2.5.8-api/kernel/numa.c Mon Apr 29 11:21:27 2002 @@ -0,0 +1,378 @@ +/* + * linux/kernel/numa.c + * + * Written by: Matthew Dobson, IBM Corporation + * + * Copyright (C) 2002, IBM Corp. + * + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for more + * details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + * Send feedback to <colpatch@us.ibm.com> + */ +#include <linux/kernel.h> +#include <linux/unistd.h> +#include <linux/config.h> +#include <linux/sched.h> +#include <linux/numa.h> +#include <linux/mmzone.h> +#include <linux/errno.h> +#include <linux/smp.h> + + +#define is_valid_cpu_behavior(x) (x == CPU_BIND_STRICT) +#define is_valid_memblk_behavior(x) (((x & 0x7) == MPOL_STRICT) || ((x & 0x7) == MPOL_LOOSE)) + +#define is_numa_subset(x, y) (!((x) & ~(y))) /* test whether x is a subset of y */ + + +extern int nummemblks; +extern unsigned long memblk_online_map; + +/* + * set_restricted_cpus(): Sets up a new CPU Restriction Set + */ +int set_restricted_cpus(numa_bitmap_t cpus, numa_set_t *numamap) +{ + int ret; + unsigned long flags; + numa_bitmap_t cpu_binding; + + ret = -ENODEV; + /* Make sure that at least one of the cpus in the new restriction set is online. */ + if (!(cpus & cpu_online_map)) + goto out; + + read_lock_irqsave(¤t->numa_api_lock, flags); + cpu_binding = current->numa_binding.cpus.list; + /* If there is a binding, at least one of the bound cpus must be valid in the + new restriction set. */ + if ((!null_restrict(¤t->numa_binding)) && + (!(cpu_binding & cpus))) + goto out_unlock; + + ret = -EPERM; + /* If the new restriction expands upon the old restriction, the caller must + have CAP_SYS_NICE. */ + if ((!is_numa_subset(cpus, current->numa_restrict.cpus.list)) && + (!capable(CAP_SYS_NICE))) + goto out_unlock; + read_unlock_irqrestore(¤t->numa_api_lock, flags); + + write_lock_irqsave(¤t->numa_api_lock, flags); + current->numa_restrict.cpus.list = cpus; + write_unlock_irqrestore(¤t->numa_api_lock, flags); + + /* Set cpus_allowed to the current binding masked against the new list of allowed cpus. */ + set_cpus_allowed(current, cpu_binding & cpus); + ret = 0; + goto out; + + out_unlock: + read_unlock_irqrestore(¤t->numa_api_lock, flags); + out: + return ret; +} + +/* + * set_restricted_memblks(): Sets up a new MemBlk Restriction Set + */ +int set_restricted_memblks(numa_bitmap_t memblks, numa_set_t *numamap) +{ + int ret; + unsigned long flags; + + ret = -ENODEV; + /* Make sure that at least one of the memblks in the new restriction set is online. */ + if (!(memblks & memblk_online_map)) + goto out; + + read_lock_irqsave(¤t->numa_api_lock, flags); + /* If there is a binding, at least one of the bound memblks must be valid in the + new restriction set. */ + if ((!null_restrict(¤t->numa_binding)) && + (!(current->numa_binding.memblks.list & memblks))) + goto out_unlock; + + ret = -EPERM; + /* If the new restriction expands upon the old restriction, the caller + must have CAP_SYS_NICE. */ + if ((!is_numa_subset(memblks, current->numa_restrict.memblks.list)) && + (!capable(CAP_SYS_NICE))) + goto out_unlock; + read_unlock_irqrestore(¤t->numa_api_lock, flags); + + write_lock_irqsave(¤t->numa_api_lock, flags); + current->numa_restrict.memblks.list = memblks; + write_unlock_irqrestore(¤t->numa_api_lock, flags); + + ret = 0; + goto out; + + out_unlock: + read_unlock_irqrestore(¤t->numa_api_lock, flags); + out: + return ret; +} + +/* + * get_restricted_cpus(): Returns the current CPU Restriction Set + */ +inline numa_bitmap_t get_restricted_cpus(void) +{ + unsigned long flags; + numa_bitmap_t cpu_restriction; + + read_lock_irqsave(¤t->numa_api_lock, flags); + cpu_restriction = current->numa_restrict.cpus.list; + read_unlock_irqrestore(¤t->numa_api_lock, flags); + + return cpu_restriction; +} + +/* + * get_restricted_memblks(): Returns the current MemBlk Restriction Set + */ +inline numa_bitmap_t get_restricted_memblks(void) +{ + unsigned long flags; + numa_bitmap_t memblk_restriction; + + read_lock_irqsave(¤t->numa_api_lock, flags); + memblk_restriction = current->numa_restrict.memblks.list; + read_unlock_irqrestore(¤t->numa_api_lock, flags); + + return memblk_restriction; +} + +/* + * cpu_to_node(cpu): Returns the number of the most specific Node + * containing CPU 'cpu'. + */ +inline int cpu_to_node(int cpu) +{ + if (cpu == -1) /* return highest numbered node */ + return (numnodes - 1); + + if ((cpu < 0) || (cpu >= NR_CPUS) || + (!(cpu_online_map & (1 << cpu)))) /* invalid cpu # */ + return -ENODEV; + + return _cpu_to_node(cpu_logical_map(cpu)); +} + +/* + * memblk_to_node(memblk): Returns the number of the most specific Node + * containing Memory Block 'memblk'. + */ +inline int memblk_to_node(int memblk) +{ + if (memblk == -1) /* return highest numbered node */ + return (numnodes - 1); + + if ((memblk < 0) || (memblk >= NR_MEMBLKS) || + (!(memblk_online_map & (1 << memblk)))) /* invalid memblk # */ + return -ENODEV; + + return _memblk_to_node(memblk); +} + +/* + * node_to_node(nid): Returns the number of the of the most specific Node that + * encompasses Node 'nid'. Some may call this the parent Node of 'nid'. + */ +int node_to_node(int nid) +{ + if ((nid < 0) || (nid >= numnodes)) /* invalid node # */ + return -ENODEV; + + return _node_to_node(nid); +} + +/* + * node_to_cpu(nid): Returns the lowest numbered CPU on Node 'nid' + */ +inline int node_to_cpu(int nid) +{ + if (nid == -1) /* return highest numbered cpu */ + return (smp_num_cpus - 1); + + if ((nid < 0) || (nid >= numnodes)) /* invalid node # */ + return -ENODEV; + + return _node_to_cpu(nid); +} + +/* + * node_to_memblk(nid): Returns the lowest numbered MemBlk on Node 'nid' + */ +inline int node_to_memblk(int nid) +{ + if (nid == -1) /* return highest numbered memblk */ + return (nummemblks - 1); + + if ((nid < 0) || (nid >= numnodes)) /* invalid node # */ + return -ENODEV; + + return _node_to_memblk(nid); +} + +/* + * get_cpu(): Returns the currently executing CPU number. + * For now, this has only mild usefulness, as this information could + * change on the return from syscall (which automatically calls schedule()). + * Due to this, the data could be stale by the time it gets back to the user. + * It will have to do, until a better method is found. + */ +inline int get_cpu(void) +{ + return smp_processor_id(); +} + +/* + * get_node(): Returns the number of the Node containing + * the currently executing CPU. Subject to the same caveat + * as the get_cpu() call. + */ +inline int get_node(void) +{ + return cpu_to_node(get_cpu()); +} + +/* + * bind_to_cpu(): Sets up a new CPU Binding + */ +int bind_to_cpu(numa_bitmap_t cpus, int behavior) +{ + int ret; + unsigned long flags; + numa_bitmap_t cpu_restriction; + + read_lock_irqsave(¤t->numa_api_lock, flags); + ret = -ENODEV; + /* Make sure that at least one of the cpus in the new binding is online, AND + in the current restriction set. */ + if (!(cpus & cpu_online_map & current->numa_restrict.cpus.list)) + goto out_unlock; + cpu_restriction = current->numa_restrict.cpus.list; + read_unlock_irqrestore(¤t->numa_api_lock, flags); + + ret = -EINVAL; + /* Test to make sure the behavior argument is valid. */ + if (!is_valid_cpu_behavior(behavior)) + goto out; + + write_lock_irqsave(¤t->numa_api_lock, flags); + current->numa_binding.cpus.list = cpus; + current->numa_binding.cpus.behavior = behavior; + write_unlock_irqrestore(¤t->numa_api_lock, flags); + + /* Set cpus_allowed to the new binding masked against the current list of allowed cpus. */ + set_cpus_allowed(current, cpus & cpu_restriction); + ret = 0; + goto out; + + out_unlock: + read_unlock_irqrestore(¤t->numa_api_lock, flags); + out: + return ret; +} + +/* + * bind_to_memblk(): Sets up a new MemBlk Binding + */ +int bind_to_memblk(numa_bitmap_t memblks, int behavior) +{ + int ret; + unsigned long flags; + + read_lock_irqsave(¤t->numa_api_lock, flags); + ret = -ENODEV; + /* Make sure that at least one of the memblks in the new binding is online, AND + in the current restriction set. */ + if (!(memblks & memblk_online_map & current->numa_restrict.memblks.list)) + goto out_unlock; + read_unlock_irqrestore(¤t->numa_api_lock, flags); + + ret = -EINVAL; + /* Test to make sure the behavior argument is valid. */ + if (!is_valid_memblk_behavior(behavior)) + goto out; + + write_lock_irqsave(¤t->numa_api_lock, flags); + current->numa_binding.memblks.list = memblks; + current->numa_binding.memblks.behavior = behavior; + write_unlock_irqrestore(¤t->numa_api_lock, flags); + + ret = 0; + goto out; + + out_unlock: + read_unlock_irqrestore(¤t->numa_api_lock, flags); + out: + return ret; +} + +/* + * bind_memory(): Will eventually set up a memory binding for a specific chunk of memory. + * Specifically, the chunk starting at 'start' through 'len' bytes. As of now, it doesn't + * *quite* do that. ;) + */ +inline int bind_memory(unsigned long start, size_t len, numa_bitmap_t memblks, int behavior) +{ + return -ENOTSUPP; +} + +/* + * set_launch_policy(): Sets up a new Launch Policy for current process + */ +int set_launch_policy(numa_bitmap_t cpus, int cpu_behavior, + numa_bitmap_t memblks, int memblk_behavior) +{ + int ret; + unsigned long flags; + + read_lock_irqsave(¤t->numa_api_lock, flags); + ret = -ENODEV; + /* Make sure that at least one of the cpus and one of the memblks in the new + binding are online, AND in the current restriction set. */ + if ((!(cpus & cpu_online_map & current->numa_restrict.cpus.list)) || + (!(memblks & memblk_online_map & current->numa_restrict.memblks.list))) + goto out_unlock; + read_unlock_irqrestore(¤t->numa_api_lock, flags); + + ret = -EINVAL; + /* Test to make sure the behavior arguments are valid. */ + if ((!is_valid_cpu_behavior(cpu_behavior)) || + (!is_valid_memblk_behavior(memblk_behavior))) + goto out; + + write_lock_irqsave(¤t->numa_api_lock, flags); + current->numa_launch_policy.cpus.list = cpus; + current->numa_launch_policy.cpus.behavior = cpu_behavior; + current->numa_launch_policy.memblks.list = memblks; + current->numa_launch_policy.memblks.behavior = memblk_behavior; + write_unlock_irqrestore(¤t->numa_api_lock, flags); + + ret = 0; + goto out; + + out_unlock: + read_unlock_irqrestore(¤t->numa_api_lock, flags); + out: + return ret; +} diff -Nur linux-2.5.8-vanilla/mm/numa.c linux-2.5.8-api/mm/numa.c --- linux-2.5.8-vanilla/mm/numa.c Sun Apr 14 12:18:49 2002 +++ linux-2.5.8-api/mm/numa.c Wed Apr 24 11:26:18 2002 @@ -8,8 +8,11 @@ #include <linux/bootmem.h> #include <linux/mmzone.h> #include <linux/spinlock.h> +#include <linux/numa.h> int numnodes = 1; /* Initialized for UMA platforms */ +int nummemblks = 0; +unsigned long memblk_online_map = 0UL; /* Similar to cpu_online_map, but for memory blocks */ static bootmem_data_t contig_bootmem_data; pg_data_t contig_page_data = { bdata: &contig_bootmem_data }; @@ -27,6 +30,10 @@ { free_area_init_core(0, &contig_page_data, &mem_map, zones_size, zone_start_paddr, zholes_size, pmap); + contig_page_data.node_id = 0; + contig_page_data.memblk_id = 0; + nummemblks = 1; + memblk_online_map = 1UL; } #endif /* !CONFIG_DISCONTIGMEM */ @@ -71,6 +78,11 @@ free_area_init_core(nid, pgdat, &discard, zones_size, zone_start_paddr, zholes_size, pmap); pgdat->node_id = nid; + pgdat->memblk_id = nummemblks; + if (test_and_set_bit(nummemblks++, &memblk_online_map)){ + printk("memblk alread counted?!?!\n"); + BUG(); + } /* * Get space for the valid bitmap. @@ -88,6 +100,8 @@ return __alloc_pages(gfp_mask, order, pgdat->node_zonelists + (gfp_mask & GFP_ZONEMASK)); } +#ifdef CONFIG_NUMA + /* * This can be refined. Currently, tries to do round robin, instead * should do concentratic circle search, starting from current node. @@ -96,35 +110,84 @@ { struct page *ret = 0; pg_data_t *start, *temp; -#ifndef CONFIG_NUMA + int search_twice = 0; + numa_bitmap_t memblk_bitmask, memblk_bitmask2; unsigned long flags; - static pg_data_t *next = 0; -#endif if (order >= MAX_ORDER) return NULL; -#ifdef CONFIG_NUMA + + read_lock_irqsave(¤t->numa_api_lock, flags); + if (null_restrict(¤t->numa_binding)) + /* if there is no binding, only search the restriction set */ + memblk_bitmask = current->numa_restrict.memblks.list; + else { + /* if there is a binding, search it */ + memblk_bitmask = current->numa_binding.memblks.list; + if (current->numa_binding.memblks.behavior == MPOL_LOOSE){ + /* and if it is a loose binding, remember to search + the restriction if we come up empty */ + search_twice = 1; + /* no need to search the memblks in the binding again, + so we'll mask them out. */ + memblk_bitmask2 = current->numa_restrict.memblks.list & ~memblk_bitmask; + } + } + read_unlock_irqrestore(¤t->numa_api_lock, flags); + +search_through_memblks: temp = NODE_DATA(numa_node_id()); -#else - spin_lock_irqsave(&node_lock, flags); - if (!next) next = pgdat_list; - temp = next; - next = next->node_next; - spin_unlock_irqrestore(&node_lock, flags); -#endif start = temp; while (temp) { - if ((ret = alloc_pages_pgdat(temp, gfp_mask, order))) - return(ret); + if (memblk_bitmask & (1 << temp->memblk_id)) + if ((ret = alloc_pages_pgdat(temp, gfp_mask, order))) + return(ret); temp = temp->node_next; } temp = pgdat_list; while (temp != start) { + if (memblk_bitmask & (1 << temp->memblk_id)) + if ((ret = alloc_pages_pgdat(temp, gfp_mask, order))) + return(ret); + temp = temp->node_next; + } + + if (search_twice) { + /* + * If we failed to find a "preferred" memblk, try again + * looking for anything that's allowed (in restrict), but + * skip those memblks we've already looked at + */ + search_twice = 0; /* no infinite loops, please */ + memblk_bitmask = memblk_bitmask2; + goto search_through_memblks; + } + return(0); +} + +#else /* !CONFIG_NUMA */ + +struct page * _alloc_pages(unsigned int gfp_mask, unsigned int order) +{ + struct page *ret = 0; + pg_data_t *temp; + unsigned long flags; + + if (order >= MAX_ORDER) + return NULL; + + spin_lock_irqsave(&node_lock, flags); + temp = pgdat_list; + spin_unlock_irqrestore(&node_lock, flags); + + while (temp) { if ((ret = alloc_pages_pgdat(temp, gfp_mask, order))) return(ret); temp = temp->node_next; } return(0); } + +#endif /* CONFIG_NUMA */ #endif /* CONFIG_DISCONTIGMEM */ diff -Nur linux-2.5.8-vanilla/mm/page_alloc.c linux-2.5.8-api/mm/page_alloc.c --- linux-2.5.8-vanilla/mm/page_alloc.c Sun Apr 14 12:18:44 2002 +++ linux-2.5.8-api/mm/page_alloc.c Mon Apr 22 15:35:16 2002 @@ -41,6 +41,9 @@ static int zone_balance_min[MAX_NR_ZONES] __initdata = { 20 , 20, 20, }; static int zone_balance_max[MAX_NR_ZONES] __initdata = { 255 , 255, 255, }; +extern int nummemblks; +extern unsigned long memblk_online_map; + /* * Free_page() adds the page to the free lists. This is optimized for * fast normal cases (no error jumps taken normally). @@ -955,6 +958,10 @@ void __init free_area_init(unsigned long *zones_size) { free_area_init_core(0, &contig_page_data, &mem_map, zones_size, 0, 0, 0); + contig_page_data.node_id = 0; + contig_page_data.memblk_id = 0; + nummemblks = 1; + memblk_online_map = 1UL; } static int __init setup_mem_frac(char *str) [-- Attachment #4: numa_api-arch_indep-setup-2.5.14.patch --] [-- Type: text/plain, Size: 6538 bytes --] diff -Nur linux-2.5.8-vanilla/include/linux/init_task.h linux-2.5.8-api/include/linux/init_task.h --- linux-2.5.8-vanilla/include/linux/init_task.h Mon Apr 22 17:20:20 2002 +++ linux-2.5.8-api/include/linux/init_task.h Fri Apr 26 15:22:52 2002 @@ -59,6 +59,10 @@ children: LIST_HEAD_INIT(tsk.children), \ sibling: LIST_HEAD_INIT(tsk.sibling), \ thread_group: LIST_HEAD_INIT(tsk.thread_group), \ + numa_restrict: NEW_NUMA_SET, \ + numa_binding: NEW_NUMA_SET, \ + numa_launch_policy: NEW_NUMA_SET, \ + numa_api_lock: RW_LOCK_UNLOCKED, \ wait_chldexit: __WAIT_QUEUE_HEAD_INITIALIZER(tsk.wait_chldexit),\ real_timer: { \ function: it_real_fn \ diff -Nur linux-2.5.8-vanilla/include/linux/mmzone.h linux-2.5.8-api/include/linux/mmzone.h --- linux-2.5.8-vanilla/include/linux/mmzone.h Mon Apr 22 17:13:25 2002 +++ linux-2.5.8-api/include/linux/mmzone.h Fri Apr 26 17:15:28 2002 @@ -136,6 +136,7 @@ unsigned long node_start_mapnr; unsigned long node_size; int node_id; + int memblk_id; /* A unique ID for each memory block (physical contiguous chunk of memory) */ struct pglist_data *node_next; } pg_data_t; @@ -163,14 +164,15 @@ #define NODE_MEM_MAP(nid) mem_map #define MAX_NR_NODES 1 -#else /* !CONFIG_DISCONTIGMEM */ +#endif /* !CONFIG_DISCONTIGMEM */ -#include <asm/mmzone.h> +#if defined (CONFIG_DISCONTIGMEM) || defined (CONFIG_NUMA) +#include <asm/mmzone.h> /* page->zone is currently 8 bits ... */ #define MAX_NR_NODES (255 / MAX_NR_ZONES) -#endif /* !CONFIG_DISCONTIGMEM */ +#endif /* CONFIG_DISCONTIGMEM || CONFIG_NUMA */ #define MAP_ALIGN(x) ((((x) % sizeof(mem_map_t)) == 0) ? (x) : ((x) + \ sizeof(mem_map_t) - ((x) % sizeof(mem_map_t)))) diff -Nur linux-2.5.8-vanilla/include/linux/numa.h linux-2.5.8-api/include/linux/numa.h --- linux-2.5.8-vanilla/include/linux/numa.h Wed Dec 31 16:00:00 1969 +++ linux-2.5.8-api/include/linux/numa.h Mon Apr 29 11:03:20 2002 @@ -0,0 +1,76 @@ +/* + * linux/include/linux/numa.h + * + * Written by: Matthew Dobson, IBM Corporation + * + * Copyright (C) 2002, IBM Corp. + * + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for more + * details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + * Send feedback to <colpatch@us.ibm.com> + */ +#ifndef _LINUX_NUMA_H_ +#define _LINUX_NUMA_H_ + +#include <linux/types.h> + +#ifdef CONFIG_NUMA +#define NR_MEMBLKS 32 /* Max number of Memory Blocks */ +#else +#define NR_MEMBLKS 1 +#endif + +typedef unsigned long numa_bitmap_t; +#define NUMA_BITMAP_NONE (~((numa_bitmap_t) 0)) + +#define CPU_BIND_STRICT 0 + +#define MPOL_FIRST 1 /* UNUSED FOR NOW */ +#define MPOL_STRIPE 2 /* UNUSED FOR NOW */ +#define MPOL_RR 4 /* UNUSED FOR NOW */ +#define MPOL_STRICT 8 /* Memory MUST be allocated according to binding */ +#define MPOL_LOOSE 16 /* Memory must try to be allocated according to binding first, + and can fall back to restriction if necessary */ + + +typedef struct numa_list { + numa_bitmap_t list; + int behavior; +} numa_list_t; + +typedef struct numa_set { + numa_list_t cpus; + numa_list_t memblks; +} numa_set_t; + + +/* Initializes a numa_set_t to be an empty set. */ +#define numa_set_init(x) do { (x)->cpus.list = NUMA_BITMAP_NONE;\ + (x)->memblks.list = NUMA_BITMAP_NONE;\ + (x)->cpus.behavior = CPU_BIND_STRICT;\ + (x)->memblks.behavior = MPOL_STRICT; } while(0) + +/* Assignment initializer for a numa_set_t to be an empty set */ +#define NEW_NUMA_SET { {NUMA_BITMAP_NONE, CPU_BIND_STRICT}, \ + {NUMA_BITMAP_NONE, MPOL_STRICT} } + +/* Tests whether a numa_set_t represents an empty restriction (ie: all 1's. All cpus/memblks allowed.) */ +#define null_restrict(x) (((x)->cpus.list == NUMA_BITMAP_NONE) && \ + ((x)->memblks.list == NUMA_BITMAP_NONE)) + +#endif /* _LINUX_NUMA_H_ */ diff -Nur linux-2.5.8-vanilla/include/linux/sched.h linux-2.5.8-api/include/linux/sched.h --- linux-2.5.8-vanilla/include/linux/sched.h Mon Apr 22 17:13:27 2002 +++ linux-2.5.8-api/include/linux/sched.h Fri Apr 26 15:14:15 2002 @@ -28,6 +28,7 @@ #include <linux/securebits.h> #include <linux/fs_struct.h> #include <linux/compiler.h> +#include <linux/numa.h> struct exec_domain; @@ -286,6 +287,12 @@ struct task_struct *pidhash_next; struct task_struct **pidhash_pprev; + /* additional NUMA stuff */ + numa_set_t numa_restrict; + numa_set_t numa_binding; + numa_set_t numa_launch_policy; + rwlock_t numa_api_lock; /* protects the preceding 3 structs */ + wait_queue_head_t wait_chldexit; /* for wait4() */ struct completion *vfork_done; /* for vfork() */ diff -Nur linux-2.5.8-vanilla/include/linux/smp.h linux-2.5.8-api/include/linux/smp.h --- linux-2.5.8-vanilla/include/linux/smp.h Mon Apr 22 17:13:25 2002 +++ linux-2.5.8-api/include/linux/smp.h Fri Apr 26 15:14:15 2002 @@ -90,6 +90,7 @@ #define cpu_number_map(cpu) 0 #define smp_call_function(func,info,retry,wait) ({ 0; }) #define cpu_online_map 1 +#define memblk_online_map 1 static inline void smp_send_reschedule(int cpu) { } static inline void smp_send_reschedule_all(void) { } #define __per_cpu_data diff -Nur linux-2.5.8-vanilla/kernel/sched.c linux-2.5.8-api/kernel/sched.c --- linux-2.5.8-vanilla/kernel/sched.c Mon Apr 22 13:17:43 2002 +++ linux-2.5.8-api/kernel/sched.c Mon Apr 22 15:35:16 2002 @@ -357,7 +357,7 @@ runqueue_t *rq; preempt_disable(); - rq = this_rq(); + rq = task_rq(p); spin_lock_irq(&rq->lock); p->state = TASK_RUNNING; @@ -371,7 +371,6 @@ p->sleep_avg = p->sleep_avg * CHILD_PENALTY / 100; p->prio = effective_prio(p); } - p->thread_info->cpu = smp_processor_id(); activate_task(p, rq); spin_unlock_irq(&rq->lock); @@ -1662,8 +1661,7 @@ migration_req_t req; runqueue_t *rq; - new_mask &= cpu_online_map; - if (!new_mask) + if (!(new_mask & cpu_online_map)) BUG(); preempt_disable(); [-- Attachment #5: numa_api-prctl-2.5.14.patch --] [-- Type: text/plain, Size: 3053 bytes --] diff -Nur linux-2.5.8-vanilla/include/linux/prctl.h linux-2.5.8-api/include/linux/prctl.h --- linux-2.5.8-vanilla/include/linux/prctl.h Sun Apr 14 12:18:54 2002 +++ linux-2.5.8-api/include/linux/prctl.h Wed Apr 24 17:31:33 2002 @@ -26,4 +26,31 @@ # define PR_FPEMU_NOPRINT 1 /* silently emulate fp operations accesses */ # define PR_FPEMU_SIGFPE 2 /* don't emulate fp operations, send SIGFPE instead */ +/* Get/Set Restricted CPUs and MemBlks */ +#define PR_SET_RESTRICTED_CPUS 11 +#define PR_SET_RESTRICTED_MEMBLKS 12 +#define PR_GET_RESTRICTED_CPUS 13 +#define PR_GET_RESTRICTED_MEMBLKS 14 + +/* Get CPU/Node */ +#define PR_GET_CPU 15 +#define PR_GET_NODE 16 + +/* X to Node conversion functions */ +#define PR_CPU_TO_NODE 17 +#define PR_MEMBLK_TO_NODE 18 +#define PR_NODE_TO_NODE 19 + +/* Node to X conversion functions */ +#define PR_NODE_TO_CPU 20 +#define PR_NODE_TO_MEMBLK 21 + +/* Set CPU/MemBlk/Memory Bindings */ +#define PR_BIND_TO_CPUS 22 +#define PR_BIND_TO_MEMBLKS 23 +#define PR_BIND_MEMORY 24 + +/* Set Launch Policy */ +#define PR_SET_LAUNCH_POLICY 25 + #endif /* _LINUX_PRCTL_H */ diff -Nur linux-2.5.8-vanilla/kernel/sys.c linux-2.5.8-api/kernel/sys.c --- linux-2.5.8-vanilla/kernel/sys.c Sun Apr 14 12:18:45 2002 +++ linux-2.5.8-api/kernel/sys.c Wed Apr 24 17:32:17 2002 @@ -16,6 +16,7 @@ #include <linux/highuid.h> #include <linux/fs.h> #include <linux/device.h> +#include <linux/numa.h> #include <asm/uaccess.h> #include <asm/io.h> @@ -1277,6 +1278,51 @@ break; } current->keep_capabilities = arg2; + break; + case PR_SET_RESTRICTED_CPUS: + error = (long) set_restricted_cpus((numa_bitmap_t)arg2, (numa_set_t *)arg3); + break; + case PR_SET_RESTRICTED_MEMBLKS: + error = (long) set_restricted_memblks((numa_bitmap_t)arg2, (numa_set_t *)arg3); + break; + case PR_GET_RESTRICTED_CPUS: + error = (long) get_restricted_cpus(); + break; + case PR_GET_RESTRICTED_MEMBLKS: + error = (long) get_restricted_memblks(); + break; + case PR_GET_CPU: + error = (long) get_cpu(); + break; + case PR_GET_NODE: + error = (long) get_node(); + break; + case PR_CPU_TO_NODE: + error = (long) cpu_to_node((int)arg2); + break; + case PR_MEMBLK_TO_NODE: + error = (long) memblk_to_node((int)arg2); + break; + case PR_NODE_TO_NODE: + error = (long) node_to_node((int)arg2); + break; + case PR_NODE_TO_CPU: + error = (long) node_to_cpu((int)arg2); + break; + case PR_NODE_TO_MEMBLK: + error = (long) node_to_memblk((int)arg2); + break; + case PR_BIND_TO_CPUS: + error = (long) bind_to_cpu((numa_bitmap_t)arg2, (int)arg3); + break; + case PR_BIND_TO_MEMBLKS: + error = (long) bind_to_memblk((numa_bitmap_t)arg2, (int)arg3); + break; + case PR_BIND_MEMORY: + error = (long) bind_memory((unsigned long)arg2, (size_t)arg3, (numa_bitmap_t)arg4, (int)arg5); + break; + case PR_SET_LAUNCH_POLICY: + error = (long) set_launch_policy((numa_bitmap_t)arg2, (int)arg3, (numa_bitmap_t)arg4, (int)arg5); break; default: error = -EINVAL; ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 23:57 ` Ingo Molnar 2002-06-19 0:08 ` Ingo Molnar 2002-06-19 1:00 ` Matthew Dobson @ 2002-06-19 23:48 ` Michael Hohnbaum 2 siblings, 0 replies; 70+ messages in thread From: Michael Hohnbaum @ 2002-06-19 23:48 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rusty Russell, Robert Love, linux-kernel, Matthew Dobson On Tue, 2002-06-18 at 16:57, Ingo Molnar wrote: > > On 18 Jun 2002, Michael Hohnbaum wrote: > > > [...] I would suggest an additional argument be added > > which would indicate the resource that the process is to be > > affined to. That way this interface could be used for binding > > processes to cpus, memory nodes, perhaps NUMA nodes, and, > > as discussed recently in another thread, other processes. > > Personally, I see NUMA nodes as an overkill, if a process > > can be bound to cpus and memory nodes. > > are you sure we want one generic, process-based affinity interface? No, I'm not sure that is what we want. I see that as a compromise solution. Something that would allow some of the simple binding capabilities, but not necessarily a full blown solution. I agree with your comments below that memory binding/allocation is much more complex than CPU binding, so additional flexibility in specifying memory binding is needed. However, wanting to start simple, the first step is to affine a process to memory on one or more nodes. > i think the affinity to certain memory regions might need to be more > finegrained than this. Eg. it could be useful to define a per-file > (per-inode) 'backing store memory node' that the file is affine to. This > will eg. cause the pagecache to be allocated in the memory node. > Process-based affinity does not describe this in a natural way. Another > example, memory maps: we might want to have a certain memory map (vma) > allocated in a given memory node, independently of where the process that > is faulting a given pages resides. > > and it might certainly make sense to have some sort of 'default memory > affinity' for a process as well, but this should be a different syscall - This is close to what is currently implemented - memory is allocated, by default on the node that the process is executing on when the request for memory is made. Even if a process is affined to multiple CPUs that span node boundaries, it is performant to dispatch the process on only one node (providing the cpu cycles are available). The NUMA extensions to the scheduler try to do this. Similarly, all memory for a process should be allocated from that one node. If memory is exhausted on that node, any other nodes that the process has affinity to cpus on should then be used. In other words, each process should have a home node that is preferred for dispatch and memory allocation. The process may have affinity to other nodes, which would be used only if the home quad had a significant resource shortage. > it really does a much different thing than CPU affinity. The CPU resource > is 'used' only temporarily with little footprint, while memory usage is > often for a very long timespan, and the affinity strategies differ > greatly. Also, memory as a resource is much more complex than CPU, eg. it > must handle things like over-allocation, fallback to 'nearby' nodes if a > node is full, etc. > > so i'd suggest to actually create a good memory-affinity syscall interface > instead of trying to generalize it into the simple, robust, finite > CPU-affinity syscalls. We have attempted to do that. Please look at the API definition http://lse.sourceforge.net/numa/numa_api.html If it would help, we could break out just the memory portion of this API (both in the specification and the implementation) and submit those for comment. What do you think? > > Ingo > Michael Hohnbaum hohnbaum@us.ibm.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken @ 2002-06-24 21:28 Paul McKenney 0 siblings, 0 replies; 70+ messages in thread From: Paul McKenney @ 2002-06-24 21:28 UTC (permalink / raw) To: Larry McVoy; +Cc: linux-kernel Hello, Larry, Our SMP cluster discussion was quite a bit of fun, very challenging! I still stand by my assessment: > The Score. > > Paul agreed that SMP Clusters could be implemented. He was not > sure that it could achieve good performance, but could not prove > otherwise. Although he suspected that the complexity might be > less than the proprietary highly parallel Unixes, he was not > convinced that it would be less than Linux would be, given the > Linux community's emphasis on simplicity in addition to performance. See you at Ottawa! Thanx, Paul > Larry McVoy <lm@bitmover.com> > Sent by: linux-kernel-owner@vger.kernel.org > 06/19/2002 10:24 PM > > > I totally agree, mostly I was playing devils advocate. The model > > actually in my head is when you have multiple kernels but they talk > > well enough that the applications have to care in areas where it > > doesn't make a performance difference (There's got to be one of those). > > .... > > > The compute cluster problem is an interesting one. The big items > > I see on the todo list are: > > > > - Scalable fast distributed file system (Lustre looks like a > > possibility) > > - Sub application level checkpointing. > > > > Services like a schedulers, already exist. > > > > Basically the job of a cluster scheduler gets much easier, and the > > scheduler more powerful once it gets the ability to suspend jobs. > > Checkpointing buys three things. The ability to preempt jobs, the > > ability to migrate processes, and the ability to recover from failed > > nodes, (assuming the failed hardware didn't corrupt your jobs > > checkpoint). > > > > Once solutions to the cluster problems become well understood I > > wouldn't be surprised if some of the supporting services started to > > live in the kernel like nfsd. Parts of the distributed filesystem > > certainly will. > > http://www.bitmover.com/cc-pitch > > I've been trying to get Linus to listen to this for years and he keeps > on flogging the tired SMP horse instead. DEC did it and Sun has been > passing around these slides for a few weeks, so maybe they'll do it too. > Then Linux can join the party after it has become a fine grained, > locked to hell and back, soft "realtime", numa enabled, bloated piece > of crap like all the other kernels and we'll get to go through the > "let's reinvent Unix for the 3rd time in 40 years" all over again. > What fun. Not. > > Sorry to be grumpy, go read the slides, I'll be at OLS, I'd be happy > to talk it over with anyone who wants to think about it. Paul McKenney > from IBM came down the San Francisco to talk to me about it, put me > through an 8 or 9 hour session which felt like a PhD exam, and > after trying to poke holes in it grudgingly let on that maybe it was > a good idea. He was kind of enough to write up what he took away > from it, here it is. > > --lm > > From: "Paul McKenney" <Paul.McKenney@us.ibm.com> > To: lm@bitmover.com, tytso@mit.edu > Subject: Greatly enjoyed our discussion yesterday! > Date: Fri, 9 Nov 2001 18:48:56 -0800 > > Hello! > > I greatly enjoyed our discussion yesterday! Here are the pieces of it that > I recall, I know that you will not be shy about correcting any errors and > omissions. > > Thanx, Paul > > Larry McVoy's SMP Clusters > > Discussion on November 8, 2001 > > Larry McVoy, Ted T'so, and Paul McKenney > > > What is SMP Clusters? > > SMP Clusters is a method of partioning an SMP (symmetric > multiprocessing) machine's CPUs, memory, and I/O devices > so that multiple "OSlets" run on this machine. Each OSlet > owns and controls its partition. A given partition is > expected to contain from 4-8 CPUs, its share of memory, > and its share of I/O devices. A machine large enough to > have SMP Clusters profitably applied is expected to have > enough of the standard I/O adapters (e.g., ethernet, > SCSI, FC, etc.) so that each OSlet would have at least > one of each. > > Each OSlet has the same data structures that an isolated > OS would have for the same amount of resources. Unless > interactions with the OSlets are required, an OSlet runs > very nearly the same code over very nearly the same data > as would a standalone OS. > > Although each OSlet is in most ways its own machine, the > full set of OSlets appears as one OS to any user programs > running on any of the OSlets. In particular, processes on > on OSlet can share memory with processes on other OSlets, > can send signals to processes on other OSlets, communicate > via pipes and Unix-domain sockets with processes on other > OSlets, and so on. Performance of operations spanning > multiple OSlets may be somewhat slower than operations local > to a single OSlet, but the difference will not be noticeable > except to users who are engaged in careful performance > analysis. > > The goals of the SMP Cluster approach are: > > 1. Allow the core kernel code to use simple locking designs. > 2. Present applications with a single-system view. > 3. Maintain good (linear!) scalability. > 4. Not degrade the performance of a single CPU beyond that > of a standalone OS running on the same resources. > 5. Minimize modification of core kernel code. Modified or > rewritten device drivers, filesystems, and > architecture-specific code is permitted, perhaps even > encouraged. ;-) > > > OS Boot > > Early-boot code/firmware must partition the machine, and prepare > tables for each OSlet that describe the resources that each > OSlet owns. Each OSlet must be made aware of the existence of > all the other OSlets, and will need some facility to allow > efficient determination of which OSlet a given resource belongs > to (for example, to determine which OSlet a given page is owned > by). > > At some point in the boot sequence, each OSlet creates a "proxy > task" for each of the other OSlets that provides shared services > to them. > > Issues: > > 1. Some systems may require device probing to be done > by a central program, possibly before the OSlets are > spawned. Systems that react in an unfriendly manner > to failed probes might be in this class. > > 2. Interrupts must be set up very carefully. On some > systems, the interrupt system may constrain the ways > in which the system is partitioned. > > > Shared Operations > > This section describes some possible implementations and issues > with a number of the shared operations. > > Shared operations include: > > 1. Page fault on memory owned by some other OSlet. > 2. Manipulation of processes running on some other OSlet. > 3. Access to devices owned by some other OSlet. > 4. Reception of network packets intended for some other OSlet. > 5. SysV msgq and sema operations on msgq and sema objects > accessed by processes running on multiple of the OSlets. > 6. Access to filesystems owned by some other OSlet. The > /tmp directory gets special mention. > 7. Pipes connecting processes in different OSlets. > 8. Creation of processes that are to run on a different > OSlet than their parent. > 9. Processing of exit()/wait() pairs involving processes > running on different OSlets. > > Page Fault > > As noted earlier, each OSlet maintains a proxy process > for each other OSlet (so that for an SMP Cluster made > up of N OSlets, there are N*(N-1) proxy processes). > > When a process in OSlet A wishes to map a file > belonging to OSlet B, it makes a request to B's proxy > process corresponding to OSlet A. The proxy process > maps the desired file and takes a page fault at the > desired address (translated as needed, since the file > will usually not be mapped to the same location in the > proxy and client processes), forcing the page into > OSlet B's memory. The proxy process then passes the > corresponding physical address back to the client > process, which maps it. > > Issues: > > o How to coordinate pageout? Two approaches: > > 1. Use mlock in the proxy process so that > only the client process can do the pageout. > > 2. Make the two OSlets coordinate their > pageouts. This is more complex, but will > be required in some form or another to > prevent OSlets from "ganging up" on one > of their number, exhausting its memory. > > o When OSlet A ejects the memory from its working > set, where does it put it? > > 1. Throw it away, and go to the proxy process > as needed to get it back. > > 2. Augment core VM as needed to track the > "guest" memory. This may be needed for > performance, but... > > o Some code is required in the pagein() path to > figure out that the proxy must be used. > > 1. Larry stated that he is willing to be > punched in the nose to get this code in. ;-) > The amount of this code is minimized by > creating SMP-clusters-specific filesystems, > which have their own functions for mapping > and releasing pages. (Does this really > cover OSlet A's paging out of this memory?) > > o How are pagein()s going to be even halfway fast > if IPC to the proxy is involved? > > 1. Just do it. Page faults should not be > all that frequent with today's memory > sizes. (But then why do we care so > much about page-fault performance???) > > 2. Use "doors" (from Sun), which are very > similar to protected procedure call > (from K42/Tornado/Hurricane). The idea > is that the CPU in OSlet A that is handling > the page fault temporarily -becomes- a > member of OSlet B by using OSlet B's page > tables for the duration. This results in > some interesting issues: > > a. What happens if a process wants to > block while "doored"? Does it > switch back to being an OSlet A > process? > > b. What happens if a process takes an > interrupt (which corresponds to > OSlet A) while doored (thus using > OSlet B's page tables)? > > i. Prevent this by disabling > interrupts while doored. > This could pose problems > with relatively long VM > code paths. > > ii. Switch back to OSlet A's > page tables upon interrupt, > and switch back to OSlet B's > page tables upon return > from interrupt. On machines > not supporting ASID, take a > TLB-flush hit in both > directions. Also likely > requires common text (at > least for low-level interrupts) > for all OSlets, making it more > difficult to support OSlets > running different versions of > the OS. > > Furthermore, the last time > that Paul suggested adding > instructions to the interrupt > path, several people politely > informed him that this would > require a nose punching. ;-) > > c. If a bunch of OSlets simultaneously > decide to invoke their proxies on > a particular OSlet, that OSlet gets > lock contention corresponding to > the number of CPUs on the system > rather than to the number in a > single OSlet. Some approaches to > handle this: > > i. Stripe -everything-, rely > on entropy to save you. > May still have problems with > hotspots (e.g., which of the > OSlets has the root of the > root filesystem?). > > ii. Use some sort of queued lock > to limit the number CPUs that > can be running proxy processes > in a given OSlet. This does > not really help scaling, but > would make the contention > less destructive to the > victim OSlet. > > o How to balance memory usage across the OSlets? > > 1. Don't bother, let paging deal with it. > Paul's previous experience with this > philosophy was not encouraging. (You > can end up with one OSlet thrashing > due to the memory load placed on it by > other OSlets, which don't see any > memory pressure.) > > 2. Use some global memory-pressure scheme > to even things out. Seems possible, > Paul is concerned about the complexity > of this approach. If this approach is > taken, make sure someone with some > control-theory experience is involved. > > > Manipulation of Processes Running on Some Other OSlet. > > The general idea here is to implement something similar > to a vproc layer. This is common code, and thus requires > someone to sacrifice their nose. There was some discussion > of other things that this would be useful for, but I have > lost them. > > Manipulations discussed included signals and job control. > > Issues: > > o Should process information be replicated across > the OSlets for performance reasons? If so, how > much, and how to synchronize. > > 1. No, just use doors. See above discussion. > > 2. Yes. No discussion of synchronization > methods. (Hey, we had to leave -something- > for later!) > > Access to Devices Owned by Some Other OSlet > > Larry mentioned a /rdev, but if we discussed any details > of this, I have lost them. Presumably, one would use some > sort of IPC or doors to make this work. > > Reception of Network Packets Intended for Some Other OSlet. > > An OSlet receives a packet, and realizes that it is > destined for a process running in some other OSlet. > How is this handled without rewriting most of the > networking stack? > > The general approach was to add a NAT-like layer that > inspected the packet and determined which OSlet it was > destined for. The packet was then forwarded to the > correct OSlet, and subjected to full IP-stack processing. > > Issues: > > o If the address map in the kernel is not to be > manipulated on each packet reception, there > needs to be a circular buffer in each OSlet for > each of the other OSlets (again, N*(N-1) buffers). > In order to prevent the buffer from needing to > be exceedingly large, packets must be bcopy()ed > into this buffer by the OSlet that received > the packet, and then bcopy()ed out by the OSlet > containing the target process. This could add > a fair amount of overhead. > > 1. Just accept the overhead. Rely on this > being an uncommon case (see the next issue). > > 2. Come up with some other approach, possibly > involving the user address space of the > proxy process. We could not articulate > such an approach, but it was late and we > were tired. > > o If there are two processes that share the FD > on which the packet could be received, and these > two processes are in two different OSlets, and > neither is in the OSlet that received the packet, > what the heck do you do??? > > 1. Prevent this from happening by refusing > to allow processes holding a TCP connection > open to move to another OSlet. This could > result in load-balance problems in some > workloads, though neither Paul nor Ted were > able to come up with a good example on the > spot (seeing as BAAN has not been doing really > well of late). > > To indulge in l'esprit d'escalier... How > about a timesharing system that users > access from the network? A single user > would have to log on twice to run a job > that consumed more than one OSlet if each > process in the job might legitimately need > access to stdin. > > 2. Do all protocol processing on the OSlet > on which the packet was received, and > straighten things out when delivering > the packet data to the receiving process. > This likely requires changes to common > code, hence someone to volunteer their nose. > > > SysV msgq and sema Operations > > We didn't discuss these. None of us seem to be SysV fans, > but these must be made to work regardless. > > Larry says that shm should be implemented in terms of mmap(), > so that this case reduces to page-mapping discussed above. > Of course, one would need a filesystem large enough to handle > the largest possible shmget. Paul supposes that one could > dynamically create a memory filesystem to avoid problems here, > but is in no way volunteering his nose to this cause. > > > Access to Filesystems Owned by Some Other OSlet. > > For the most part, this reduces to the mmap case. However, > partitioning popular filesystems over the OSlets could be > very helpful. Larry mentioned that this had been prototyped. > Paul cannot remember if Larry promised to send papers or > other documentation, but duly requests them after the fact. > > Larry suggests having a local /tmp, so that /tmp is in effect > private to each OSlet. There would be a /gtmp that would > be a globally visible /tmp equivalent. We went round and > round on software compatibility, Paul suggesting a hashed > filesystem as an alternative. Larry eventually pointed out > that one could just issue different mount commands to get > a global filesystem in /tmp, and create a per-OSlet /ltmp. > This would allow people to determine their own level of > risk/performance. > > > Pipes Connecting Processes in Different OSlets. > > This was mentioned, but I have forgotten the details. > My vague recollections lead me to believe that some > nose-punching was required, but I must defer to Larry > and Ted. > > Ditto for Unix-domain sockets. > > > Creation of Processes on a Different OSlet Than Their Parent. > > There would be a inherited attribute that would prevent > fork() or exec() from creating its child on a different > OSlet. This attribute would be set by default to prevent > too many surprises. Things like make(1) would clear > this attribute to allow amazingly fast kernel builds. > > There would also be a system call that would cause the > child to be placed on a specified OSlet (Paul suggested > use of HP's "launch policy" concept to avoid adding yet > another dimension to the exec() combinatorial explosion). > > The discussion of packet reception lead Larry to suggest > that cross-OSlet process creation would be prohibited if > the parent and child shared a socket. See above for the > load-balancing concern and corresponding l'esprit d'escalier. > > > Processing of exit()/wait() Pairs Crossing OSlet Boundaries > > We didn't discuss this. My guess is that vproc deals > with it. Some care is required when optimizing for this. > If one hands off to a remote parent that dies before > doing a wait(), one would not want one of the init > processes getting a nasty surprise. > > (Yes, there are separate init processes for each OSlet. > We did not talk about implications of this, which might > occur if one were to need to send a signal intended to > be received by all the replicated processes.) > > > Other Desiderata: > > 1. Ability of surviving OSlets to continue running after one of their > number fails. > > Paul was quite skeptical of this. Larry suggested that the > "door" mechanism could use a dynamic-linking strategy. Paul > remained skeptical. ;-) > > 2. Ability to run different versions of the OS on different OSlets. > > Some discussion of this above. > > > The Score. > > Paul agreed that SMP Clusters could be implemented. He was not > sure that it could achieve good performance, but could not prove > otherwise. Although he suspected that the complexity might be > less than the proprietary highly parallel Unixes, he was not > convinced that it would be less than Linux would be, given the > Linux community's emphasis on simplicity in addition to performance. > > -- > --- > Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken
@ 2002-06-21 12:59 Jesse Pollard
0 siblings, 0 replies; 70+ messages in thread
From: Jesse Pollard @ 2002-06-21 12:59 UTC (permalink / raw)
To: dalecki, Linus Torvalds; +Cc: linux-kernel
Martin Dalecki <dalecki@evision-ventures.com>:
>Yes HT gives 12%. naive SMP gives 50% and good SMP (aka corssbar bus)
>gives 70% for two CPU. All those numbers are well below the level
>where more then 2-4 makes hardly any sense... Amdahl bites you still if you
>read it like:
...
I think your numbers are a little low - I've seen between 50%-80% on
master/slave SMP depending on the job. 50% if both processess are heavily
syscall oriented, 75% (or therabouts) when both processes are more normally
balanced, and 80% if both processes are more compute bound.
Good SMP, with a crossbar switch buss should give close to 95%. Good SMP
alone should give about 75%.
My expierence with good crossbar switch is based on Cray UNICOS/YMP/SV
hardware. A well tuned hardware platform, and slightly less well tuned
SMP implementation, though the UNICOS 10 rewrite may have fixed the
SMP implementation.
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil
Any opinions expressed are solely my own.
^ permalink raw reply [flat|nested] 70+ messages in thread* Re: latest linus-2.5 BK broken @ 2002-06-21 7:31 Martin Knoblauch 0 siblings, 0 replies; 70+ messages in thread From: Martin Knoblauch @ 2002-06-21 7:31 UTC (permalink / raw) To: linux-kernel > If one want's to have a grasp on how the next generation of > really fast computers will look alike. Well: they will be based > on Johnson-junctions. TRW will build them (same company > as Voyager sonde). Look there they don't plan for thousands of CPUs ----------------------------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >they plan for few CPUs in liquid helium: > > > http://www.trw.com/extlink/1,,,00.html?ExternalTRW=/images/imaps_2000_paper.pdf&DIR=2 > first thing that I cought on page 2 was the 4096 processors. Hmm... Martin -- ---------------------------------- Martin Knoblauch knobi@knobisoft.de http://www.knobisoft.de ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken
@ 2002-06-20 23:48 Miles Lane
0 siblings, 0 replies; 70+ messages in thread
From: Miles Lane @ 2002-06-20 23:48 UTC (permalink / raw)
To: Martin Dalecki; +Cc: LKML
Uz.ytkownik Martin Dalecki napisa?:
<snip>
> You don't read economic papers. Don't you? Or what is it with this
> plumbing server/pc market around us? Or increased notebook sales.
> (Typical marked saturation symptom, like the second car for the
> familiy :-).
>
> I suggest it's precisely the end of the open invention curve out there:
>
> 1. Nowadays the CPUs are indeed good enough for most of the common tasks.
> WindowsXP tries hard to help overcome this :-). But in reality Win2000
> is just fine for office work.
>
> 2. The technology in question is starting to hit real physical barriers becouse
> it appears more and more that not everything comming out of the labs
> can be implemented at reasonable costs.
Martin, perhaps you haven't seen this article.
This news seems to contradict your assertion that cost is going
to become a big problem as we attempt to continue tracking the
price/performance trajectory of Moore's law.
http://www.nytimes.com/reuters/technology/tech-technology-chip.html
Miles
^ permalink raw reply [flat|nested] 70+ messages in thread[parent not found: <E17KSLb-0007Dj-00@wagner.rustcorp.com.au>]
* Re: latest linus-2.5 BK broken [not found] <E17KSLb-0007Dj-00@wagner.rustcorp.com.au> @ 2002-06-19 0:12 ` Linus Torvalds 2002-06-19 15:23 ` Rusty Russell 0 siblings, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2002-06-19 0:12 UTC (permalink / raw) To: Rusty Russell; +Cc: Kernel Mailing List On Wed, 19 Jun 2002, Rusty Russell wrote: > > - new_mask &= cpu_online_map; > + /* Eliminate offline cpus from the mask */ > + for (i = 0; i < NR_CPUS; i++) > + if (!cpu_online(i)) > + new_mask &= ~(1<<i); > + And why can't cpu_online_map be a bitmap? What's your beef against sane and efficient data structures? The above is just crazy. Just add a #define NRCPUWORDS ROUND_UP(NR_CPU, BITS_PER_LONG) struct cpu_mask { unsigned long mask[NRCPUWORDS]; } cpu_mask_t; and then add a few simple operations like cpumask_and(cpu_mask_t * res, cpu_mask_t *a, cpu_mask_t *b); and friends.. See how we handle this issue in <linux/signal.h>, which has perfectly efficient things to do all the same issues (ie see how "sigemptyset()" and friends compile to efficient code for the "normal" cases. This is not rocket science, and I find it ridiculous that you claim to worry about scaling up to thousands of CPU's, and then you try to send me absolute crap like the above which clearly is unacceptable for lots of CPU's. No, C doesn't have built-in support for bitmap operations except on a small scale level (ie single words), and yes, clearly that's why Linux tends to prefer only small bitmaps, but NO, that does not make bitmaps evil. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-19 0:12 ` Linus Torvalds @ 2002-06-19 15:23 ` Rusty Russell 2002-06-19 16:28 ` Linus Torvalds 0 siblings, 1 reply; 70+ messages in thread From: Rusty Russell @ 2002-06-19 15:23 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kernel Mailing List In message <Pine.LNX.4.33.0206181701240.2562-100000@penguin.transmeta.com> you write: > > On Wed, 19 Jun 2002, Rusty Russell wrote: > > > > - new_mask &= cpu_online_map; > > + /* Eliminate offline cpus from the mask */ > > + for (i = 0; i < NR_CPUS; i++) > > + if (!cpu_online(i)) > > + new_mask &= ~(1<<i); > > + > > And why can't cpu_online_map be a bitmap? > > What's your beef against sane and efficient data structures? The above is > just crazy. Oh, it can be. I wasn't going to require something from all archs for this one case (well, it was more like zero cases when I first did the patch). > and then add a few simple operations like > > cpumask_and(cpu_mask_t * res, cpu_mask_t *a, cpu_mask_t *b); Sure... or just make all archs supply a "cpus_online_of(mask)" which does that, unless there are other interesting cases. Or we can go the other way and have a general "and_region(void *res, void *a, void *b, int len)". Which one do you want? > This is not rocket science, and I find it ridiculous that you claim to > worry about scaling up to thousands of CPU's, and then you try to send me > absolute crap like the above which clearly is unacceptable for lots of > CPU's. Spinning 1000 times doesn't phase me until someone complains. Breaking userspace code does. One can be fixed if it proves to be a bottleneck. Understand? Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-19 15:23 ` Rusty Russell @ 2002-06-19 16:28 ` Linus Torvalds 2002-06-19 20:57 ` Rusty Russell 0 siblings, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2002-06-19 16:28 UTC (permalink / raw) To: Rusty Russell; +Cc: Kernel Mailing List On Thu, 20 Jun 2002, Rusty Russell wrote: > > and then add a few simple operations like > > > > cpumask_and(cpu_mask_t * res, cpu_mask_t *a, cpu_mask_t *b); > > Sure... or just make all archs supply a "cpus_online_of(mask)" which > does that, unless there are other interesting cases. Or we can go the > other way and have a general "and_region(void *res, void *a, void *b, > int len)". Which one do you want? There are definitely other "interesting" cases that already do the full bitwise and/or on bitmasks - see sigset_t and sigaddset/sigdelset/ sigfillset. It's really the exact same code, and the exact same issues. The problem with a generic "and_region" is that it's a slight amount of work to make sure that we optimize for the common cases (and since I'm not a huge believer in hundreds of nodes, I consider the common case to be a single word). And do things like just automatically get the UP case right: which we do right now by just virtue of having a constant cpu_online_mask, and letting the compiler just do the (obvious) optimizations. I'm a _huge_ believer in having generic code that is automatically optimized away by the compiler into nothingness. (And by contrast, I absolutely _detest_ #ifdef's in source code that makes those optimizations explicit). But that sometimes requires some thought, notably making sure that all constants hang around as constants all the way to the code generation phase (this tends to mean inline functions and #defines). It _would_ probably be worthwhile to try to have better support for "bitmaps" as real kernel data structures, since we actually have this problem in multiple places. Right now we already use bitmaps for signal handling (one or two words, constant size), for FD_SET's (variable size), for various filesystems (variable size, largish), and for a lot of random drivers (some variable, some constant). It wasn't that long ago that I added a "bitmap_member()" macro to <linux/types.h> to declare bitmaps exactly because a lot of people _were_ doing it and getting it wrong. Actually, the most common case was not a bug, but a latent problem with code that did something like unsigned char bitmap[BITMAP_SIZE/8]; which works on x86 as long as the bitmap size was a multiple of 8. It would probably make sense to make a real <linux/bitmap.h>, move the bitmap_member() there (and rename to "bitmap_declare()" - it's called member because all the places I first looked at were structure members), and add some simple generic routines for handling these things. (We've obviously had the bit_set/clear/test() stuff forever, but the more involved stuff should be fairly easy to abstract out too, instead of having special functions for signal masks). > Breaking userspace code does. One can be fixed if it proves to be a > bottleneck. Understand? What I don't understand is why you don't accept the fact that these things can be considered infinitely big. There's nothing fundamentally wrong with static allocation. People who build thousand-node systems _are_ going to compile their own distribution. Trust me. They aren't just going to slap down redhat-7.3 on a 16k-node ASCI Purple. It makes no sense to do that. They may want to run quake or something standard on it without recompiling, but especially the maintenance stuff - the stuff which cares about CPU affinity - is a nobrainer. So you can easily just accept the fact that at some point the max number of CPU's can be considered fixed. And that "some point" isn't even very high, especially since bitmaps _are_ so dense that there is basically no overhead to just starting out with #define MAX_CPU (1024) bitmap_declare(cpu_bitmap, MAX_CPU); and let it be at that. That 1024 is already ridiculously high, in my opinion - simply because people who are playing with bigger numbers _are_ going to be able to just increase the number and recompile. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-19 16:28 ` Linus Torvalds @ 2002-06-19 20:57 ` Rusty Russell 0 siblings, 0 replies; 70+ messages in thread From: Rusty Russell @ 2002-06-19 20:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kernel Mailing List, paulus In message <Pine.LNX.4.44.0206190907520.2053-100000@home.transmeta.com> you wri te: > > > On Thu, 20 Jun 2002, Rusty Russell wrote: > > > and then add a few simple operations like > > > > > > cpumask_and(cpu_mask_t * res, cpu_mask_t *a, cpu_mask_t *b); > > > > Sure... or just make all archs supply a "cpus_online_of(mask)" which > > does that, unless there are other interesting cases. Or we can go the > > other way and have a general "and_region(void *res, void *a, void *b, > > int len)". Which one do you want? > > There are definitely other "interesting" cases that already do the full > bitwise and/or on bitmasks - see sigset_t and sigaddset/sigdelset/ > sigfillset. It's really the exact same code, and the exact same issues. > > The problem with a generic "and_region" is that it's a slight amount of > work to make sure that we optimize for the common cases (and since I'm not > a huge believer in hundreds of nodes, I consider the common case to be a > single word). And do things like just automatically get the UP case right: > which we do right now by just virtue of having a constant cpu_online_mask, > and letting the compiler just do the (obvious) optimizations. Sure, completely agreed. Normal tricks here: 1 long turns into equivalent to dst = a & b, the other cases are handled with varying amount of suckiness. Code and optimization tested on 2.95.4 and 3.0.4 (both PPC), kernel compiled on my x86 box back in .au. > It would probably make sense to make a real <linux/bitmap.h>, move the > bitmap_member() there (and rename to "bitmap_declare()" - it's called > member because all the places I first looked at were structure members), > and add some simple generic routines for handling these things. I renamed it to DECLARE_BITMAP() to match list, mutex et al. and moved it to linux/bitops.h. PS. Please sort out merging with Paulus's stuff: I'd like to compile on PPC soon since I'm laptop-only for two more weeks 8) Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/include/linux/bitops.h working-2.5.23-bitops/include/linux/bitops.h --- linux-2.5.23/include/linux/bitops.h Fri Jun 7 13:59:07 2002 +++ working-2.5.23-bitops/include/linux/bitops.h Thu Jun 20 06:55:51 2002 @@ -2,6 +2,27 @@ #define _LINUX_BITOPS_H #include <asm/bitops.h> +#define DECLARE_BITMAP(name,bits) \ + unsigned long name[((bits)+BITS_PER_LONG-1)/BITS_PER_LONG] + +#ifndef HAVE_ARCH_AND_REGION +void __and_region(unsigned long num, unsigned char *dst, + const unsigned char *a, const unsigned char *b); +#endif + +/* For the moment, handle 1 long case fast, leave rest to __and_region. */ +#define and_region(num,dst,a,b) \ +do { \ + if (__alignof__(*(a)) == __alignof__(long) \ + && __alignof__(*(b)) == __alignof__(long) \ + && __builtin_constant_p(num) \ + && (num) == sizeof(long)) { \ + *((unsigned long *)(dst)) = \ + (*(unsigned long *)(a) & *(unsigned long *)(b)); \ + } else \ + __and_region((num), (void*)(dst), (void*)(a), (void*)(b)); \ +} while(0) + /* * ffs: find first bit set. This is defined the same way as * the libc and compiler builtin ffs routines, therefore @@ -106,8 +127,5 @@ res = (res & 0x33) + ((res >> 2) & 0x33); return (res & 0x0F) + ((res >> 4) & 0x0F); } - -#include <asm/bitops.h> - #endif diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/include/linux/types.h working-2.5.23-bitops/include/linux/types.h --- linux-2.5.23/include/linux/types.h Mon Jun 17 23:19:25 2002 +++ working-2.5.23-bitops/include/linux/types.h Thu Jun 20 06:14:39 2002 @@ -3,9 +3,6 @@ #ifdef __KERNEL__ #include <linux/config.h> - -#define bitmap_member(name,bits) \ - unsigned long name[((bits)+BITS_PER_LONG-1)/BITS_PER_LONG] #endif #include <linux/posix_types.h> diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/include/sound/ac97_codec.h working-2.5.23-bitops/include/sound/ac97_codec.h --- linux-2.5.23/include/sound/ac97_codec.h Mon Jun 17 23:19:25 2002 +++ working-2.5.23-bitops/include/sound/ac97_codec.h Thu Jun 20 06:31:35 2002 @@ -25,6 +25,7 @@ * */ +#include <linux/bitops.h> #include "control.h" #include "info.h" @@ -160,7 +161,7 @@ unsigned int rates_mic_adc; unsigned int spdif_status; unsigned short regs[0x80]; /* register cache */ - bitmap_member(reg_accessed, 0x80); /* bit flags */ + DECLARE_BITMAP(reg_accessed, 0x80); /* bit flags */ union { /* vendor specific code */ struct { unsigned short unchained[3]; // 0 = C34, 1 = C79, 2 = C69 diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/kernel/Makefile working-2.5.23-bitops/kernel/Makefile --- linux-2.5.23/kernel/Makefile Mon Jun 10 16:03:56 2002 +++ working-2.5.23-bitops/kernel/Makefile Thu Jun 20 06:27:29 2002 @@ -10,12 +10,12 @@ O_TARGET := kernel.o export-objs = signal.o sys.o kmod.o context.o ksyms.o pm.o exec_domain.o \ - printk.o platform.o suspend.o + printk.o platform.o suspend.o bitops.o obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \ module.o exit.o itimer.o time.o softirq.o resource.o \ sysctl.o capability.o ptrace.o timer.o user.o \ - signal.o sys.o kmod.o context.o futex.o platform.o + signal.o sys.o kmod.o context.o futex.o platform.o bitops.o obj-$(CONFIG_UID16) += uid16.o obj-$(CONFIG_MODULES) += ksyms.o diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/kernel/bitops.c working-2.5.23-bitops/kernel/bitops.c --- linux-2.5.23/kernel/bitops.c Thu Jan 1 10:00:00 1970 +++ working-2.5.23-bitops/kernel/bitops.c Thu Jun 20 06:52:29 2002 @@ -0,0 +1,32 @@ +#include <linux/config.h> +#include <linux/bitops.h> +#include <linux/module.h> + +#ifndef HAVE_ARCH_AND_REGION +/* Generic is fairly stupid: archs should optimize properly. */ +void __and_region(unsigned long num, unsigned char *dst, + const unsigned char *a, const unsigned char *b) +{ + unsigned long i; + + /* Copy first bytes, until one is long aligned. */ + for (i = 0; i < num && ((unsigned long)a+i) % __alignof__(long); i++) + dst[i] = (a[i] & b[i]); + + /* If they are all aligned, do long-at-a-time copy */ + if (((unsigned long)b+i)%__alignof__(long) == 0 + && ((unsigned long)dst+i)%__alignof__(long) == 0) { + for (; i + sizeof(long) <= num; i += sizeof(long)) { + *(unsigned long *)(dst+i) + = (*(unsigned long *)(a+i) + & *(unsigned long *)(b+i)); + } + } + + /* Do whatever is left. */ + for (; i < num; i++) + dst[i] = (a[i] & b[i]); +} + +EXPORT_SYMBOL(__and_region); +#endif diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/sound/core/seq/seq_clientmgr.h working-2.5.23-bitops/sound/core/seq/seq_clientmgr.h --- linux-2.5.23/sound/core/seq/seq_clientmgr.h Mon Jun 17 23:19:26 2002 +++ working-2.5.23-bitops/sound/core/seq/seq_clientmgr.h Thu Jun 20 06:34:16 2002 @@ -53,8 +53,8 @@ char name[64]; /* client name */ int number; /* client number */ unsigned int filter; /* filter flags */ - bitmap_member(client_filter, 256); - bitmap_member(event_filter, 256); + DECLARE_BITMAP(client_filter, 256); + DECLARE_BITMAP(event_filter, 256); snd_use_lock_t use_lock; int event_lost; /* ports */ diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/sound/core/seq/seq_queue.h working-2.5.23-bitops/sound/core/seq/seq_queue.h --- linux-2.5.23/sound/core/seq/seq_queue.h Mon Jun 17 23:19:26 2002 +++ working-2.5.23-bitops/sound/core/seq/seq_queue.h Thu Jun 20 06:34:11 2002 @@ -26,6 +26,7 @@ #include "seq_lock.h" #include <linux/interrupt.h> #include <linux/list.h> +#include <linux/bitops.h> #define SEQ_QUEUE_NO_OWNER (-1) @@ -51,7 +52,7 @@ spinlock_t check_lock; /* clients which uses this queue (bitmap) */ - bitmap_member(clients_bitmap,SNDRV_SEQ_MAX_CLIENTS); + DECLARE_BITMAP(clients_bitmap,SNDRV_SEQ_MAX_CLIENTS); unsigned int clients; /* users of this queue */ struct semaphore timer_mutex; ^ permalink raw reply [flat|nested] 70+ messages in thread
* latest linus-2.5 BK broken @ 2002-06-18 17:18 James Simmons 2002-06-18 17:46 ` Robert Love 0 siblings, 1 reply; 70+ messages in thread From: James Simmons @ 2002-06-18 17:18 UTC (permalink / raw) To: Linux Kernel Mailing List gcc -Wp,-MD,./.sched.o.d -D__KERNEL__ -I/tmp/fbdev-2.5/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i686 -malign-functions=4 -nostdinc -iwithprefix include -fno-omit-frame-pointer -DKBUILD_BASENAME=sched -c -o sched.o sched.c sched.c: In function `sys_sched_setaffinity': sched.c:1329: `cpu_online_map' undeclared (first use in this function) sched.c:1329: (Each undeclared identifier is reported only once sched.c:1329: for each function it appears in.) sched.c: In function `sys_sched_getaffinity': sched.c:1389: `cpu_online_map' undeclared (first use in this function) make[1]: *** [sched.o] Error 1 . --- |o_o | |:_/ | Give Micro$oft the Bird!!!! // \ \ Use Linux!!!! (| | ) /'\_ _/`\ \___)=(___/ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 17:18 James Simmons @ 2002-06-18 17:46 ` Robert Love 2002-06-18 18:51 ` Rusty Russell 0 siblings, 1 reply; 70+ messages in thread From: Robert Love @ 2002-06-18 17:46 UTC (permalink / raw) To: James Simmons; +Cc: Linux Kernel Mailing List, rusty On Tue, 2002-06-18 at 10:18, James Simmons wrote: > gcc -Wp,-MD,./.sched.o.d -D__KERNEL__ -I/tmp/fbdev-2.5/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i686 -malign-functions=4 -nostdinc -iwithprefix include -fno-omit-frame-pointer -DKBUILD_BASENAME=sched -c -o sched.o sched.c > sched.c: In function `sys_sched_setaffinity': > sched.c:1329: `cpu_online_map' undeclared (first use in this function) > sched.c:1329: (Each undeclared identifier is reported only once > sched.c:1329: for each function it appears in.) > sched.c: In function `sys_sched_getaffinity': > sched.c:1389: `cpu_online_map' undeclared (first use in this function) > make[1]: *** [sched.o] Error 1 Rusty, I assume this is a side-effect of the hotplug merge? Can you fix this or tell me what is the new equivalent of cpu_online_map? Robert Love ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 17:46 ` Robert Love @ 2002-06-18 18:51 ` Rusty Russell 2002-06-18 18:43 ` Zwane Mwaikambo ` (2 more replies) 0 siblings, 3 replies; 70+ messages in thread From: Rusty Russell @ 2002-06-18 18:51 UTC (permalink / raw) To: Robert Love; +Cc: Linux Kernel Mailing List, torvalds In message <1024422409.1476.208.camel@sinai> you write: > On Tue, 2002-06-18 at 10:18, James Simmons wrote: > > > gcc -Wp,-MD,./.sched.o.d -D__KERNEL__ -I/tmp/fbdev-2.5/include -Wall -Wst rict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -f no-common -pipe -mpreferred-stack-boundary=2 -march=i686 -malign-functions=4 - nostdinc -iwithprefix include -fno-omit-frame-pointer -DKBUILD_BASENAME=sche d -c -o sched.o sched.c > > sched.c: In function `sys_sched_setaffinity': > > sched.c:1329: `cpu_online_map' undeclared (first use in this function) > > sched.c:1329: (Each undeclared identifier is reported only once > > sched.c:1329: for each function it appears in.) > > sched.c: In function `sys_sched_getaffinity': > > sched.c:1389: `cpu_online_map' undeclared (first use in this function) > > make[1]: *** [sched.o] Error 1 > > Rusty, I assume this is a side-effect of the hotplug merge? Yes, sorry. > Can you fix this or tell me what is the new equivalent of > cpu_online_map? Well, I'm heading away from assumptions on the arch representations of online CPUs (which the NUMA guys need anyway). You could do a loop here, but the real problem is the broken userspace interface. Can you fix this so it takes a single CPU number please? ie. /* -1 = remove affinity */ sys_sched_setaffinity(pid_t pid, int cpu); This will work everywhere, and doesn't require userspace to know the size of the cpu bitmask etc. Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 18:51 ` Rusty Russell @ 2002-06-18 18:43 ` Zwane Mwaikambo 2002-06-18 18:56 ` Linus Torvalds 2002-06-18 19:29 ` Benjamin LaHaise 2 siblings, 0 replies; 70+ messages in thread From: Zwane Mwaikambo @ 2002-06-18 18:43 UTC (permalink / raw) To: Rusty Russell; +Cc: Robert Love, Linux Kernel Mailing List, torvalds Hi Rusty, On Wed, 19 Jun 2002, Rusty Russell wrote: > > Can you fix this or tell me what is the new equivalent of > > cpu_online_map? > > Well, I'm heading away from assumptions on the arch representations of > online CPUs (which the NUMA guys need anyway). Will there also be some sort of facility to determine which node a cpu is from, this would be quite handy in other areas. Cheers, Zwane Mwaikambo -- http://function.linuxpower.ca ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 18:51 ` Rusty Russell 2002-06-18 18:43 ` Zwane Mwaikambo @ 2002-06-18 18:56 ` Linus Torvalds 2002-06-18 18:59 ` Robert Love 2002-06-18 20:05 ` Rusty Russell 2002-06-18 19:29 ` Benjamin LaHaise 2 siblings, 2 replies; 70+ messages in thread From: Linus Torvalds @ 2002-06-18 18:56 UTC (permalink / raw) To: Rusty Russell; +Cc: Robert Love, Linux Kernel Mailing List On Wed, 19 Jun 2002, Rusty Russell wrote: > > You could do a loop here, but the real problem is the broken userspace > interface. Can you fix this so it takes a single CPU number please? NO. Rusty, people want to do "node-affine" stuff, which absolutely requires you to be able to give CPU "collections". Single CPU's need not apply. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 18:56 ` Linus Torvalds @ 2002-06-18 18:59 ` Robert Love 2002-06-18 20:05 ` Rusty Russell 1 sibling, 0 replies; 70+ messages in thread From: Robert Love @ 2002-06-18 18:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: Rusty Russell, Linux Kernel Mailing List On Tue, 2002-06-18 at 11:56, Linus Torvalds wrote: > NO. > > Rusty, people want to do "node-affine" stuff, which absolutely requires > you to be able to give CPU "collections". Single CPU's need not apply. I would also hate to have to make 32 system calls to get the affinity mask I want. If anything, I think the interface is not collective _enough_ - further abstractions like psets seem to be in favor, not dropping down to a one-CPU-and-task per-call thing. Not that I am complaining, I am happy with the interface... Robert Love ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 18:56 ` Linus Torvalds 2002-06-18 18:59 ` Robert Love @ 2002-06-18 20:05 ` Rusty Russell 2002-06-18 20:05 ` Linus Torvalds 1 sibling, 1 reply; 70+ messages in thread From: Rusty Russell @ 2002-06-18 20:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: Robert Love, Linux Kernel Mailing List In message <Pine.LNX.4.44.0206181155280.4552-100000@home.transmeta.com> you wri te: > > > On Wed, 19 Jun 2002, Rusty Russell wrote: > > > > You could do a loop here, but the real problem is the broken userspace > > interface. Can you fix this so it takes a single CPU number please? > > NO. > > Rusty, people want to do "node-affine" stuff, which absolutely requires > you to be able to give CPU "collections". Single CPU's need not apply. NO. They want to be node-affine. They don't want to specify what CPUs they attach to. Understand? Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 20:05 ` Rusty Russell @ 2002-06-18 20:05 ` Linus Torvalds 2002-06-18 20:31 ` Rusty Russell 0 siblings, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2002-06-18 20:05 UTC (permalink / raw) To: Rusty Russell; +Cc: Robert Love, Linux Kernel Mailing List On Wed, 19 Jun 2002, Rusty Russell wrote: > > NO. They want to be node-affine. They don't want to specify what > CPUs they attach to. So you're going to have separate interfaces for that? Gag me with a volvo, but that's idiotic. Besides, even that would be broken. You want bitmaps, because bitmaps is really what it is all about. It's NOT about "I must run on this CPU", it can equally well be "I mustn't run on those two CPU's that are hosting the RT part of this thing" or something like that. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 20:05 ` Linus Torvalds @ 2002-06-18 20:31 ` Rusty Russell 2002-06-18 20:41 ` Linus Torvalds 2002-06-18 20:55 ` Robert Love 0 siblings, 2 replies; 70+ messages in thread From: Rusty Russell @ 2002-06-18 20:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Robert Love, Linux Kernel Mailing List In message <Pine.LNX.4.44.0206181302300.872-100000@home.transmeta.com> you writ e: > On Wed, 19 Jun 2002, Rusty Russell wrote: > > > > NO. They want to be node-affine. They don't want to specify what > > CPUs they attach to. > > So you're going to have separate interfaces for that? Gag me with a volvo, > but that's idiotic. No, you have accepted a non-portable userspace interface and put it in generic code. THAT is idiotic. So any program that doesn't use the following is broken: #include <limits.h> #define BITS_PER_LONG (sizeof(long)*CHAR_BIT) int set_cpu(int cpu) { size_t size = sizeof(unsigned long); unsigned long *bitmask = NULL; int ret; do { size *= 2; bitmask = realloc(bitmask, size); memset(bitmask, 0, size); bitmask[cpu / BITS_PER_LONG] = (1 << (cpu % BITS_PER_LONG); ret = sched_setaffinity(getpid(), size, bitmask); } while (ret < 0 && errno = -EINVAL); return ret; } > Besides, even that would be broken. You want bitmaps, because bitmaps is > really what it is all about. It's NOT about "I must run on this CPU", it > can equally well be "I mustn't run on those two CPU's that are hosting the > RT part of this thing" or something like that. Just bind to a cpu != those two CPUs. I could come up with contrived examples too, but I'm trying to save userspace programmers and those who have to port to new architectures. If you don't know how to do it well, do it simply. Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 20:31 ` Rusty Russell @ 2002-06-18 20:41 ` Linus Torvalds 2002-06-18 21:12 ` Benjamin LaHaise 2002-06-18 20:55 ` Robert Love 1 sibling, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2002-06-18 20:41 UTC (permalink / raw) To: Rusty Russell; +Cc: Robert Love, Linux Kernel Mailing List On Wed, 19 Jun 2002, Rusty Russell wrote: > > So any program that doesn't use the following is broken: That wasn't so hard, was it? Besides, we've had this interface for about 15 years, and it's called "select()". It scales fine to thousands of descriptors, and we're talking about something that is a hell of a lot less timing-critical than select ever was. "Earth to Rusty, come in Rusty.." How do we handle the bitmaps in select()? Right. We assume some size that is plenty good enough. Come back to me when something simple like #define MAX_CPUNR 1024 unsigned long cpumask[MAX_CPUNR / BITS_PER_LONG]; doesn't work. The existing interface is _fine_, and when somebody actually has a machine with more than 1024 CPU's (yeah, right, I'm really worried), the existing interface will cause graceful errors instead of doing something unexpected. And if you're telling me that people who care about CPU affinity cannot fathom a simple bitmap of longs, you're just out to lunch. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 20:41 ` Linus Torvalds @ 2002-06-18 21:12 ` Benjamin LaHaise 2002-06-18 21:08 ` Cort Dougan 2002-06-18 21:45 ` Bill Huey 0 siblings, 2 replies; 70+ messages in thread From: Benjamin LaHaise @ 2002-06-18 21:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: Rusty Russell, Robert Love, Linux Kernel Mailing List On Tue, Jun 18, 2002 at 01:41:12PM -0700, Linus Torvalds wrote: > That wasn't so hard, was it? > > Besides, we've had this interface for about 15 years, and it's called > "select()". It scales fine to thousands of descriptors, and we're talking > about something that is a hell of a lot less timing-critical than select > ever was. I take issue with the statement that select scales fine to thousands of file descriptors. It doesn't. For fairly trivial workloads it degrades to 0 operations per second with more than a few dozen filedescriptors in the array, but only one descriptor being active. To sustain decent throughput, select needs something like 50% of the filedescriptors in an array to be active at every select() call, which makes in unsuitable for things like LDAP servers, or HTTP/FTP where the clients are behind slow connections or interactive (like in the real world). I've benchmarked it -- we should really include something like /dev/epoll in the kernel to improve this case. Still, I think the bitmap approach in this case is useful, as having affinity to multiple CPUs can be needed, and it is not a frequently occuring operation (unlike select()). -ben -- "You will be reincarnated as a toad; and you will be much happier." ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 21:12 ` Benjamin LaHaise @ 2002-06-18 21:08 ` Cort Dougan 2002-06-18 21:47 ` Linus Torvalds 2002-06-19 10:21 ` Padraig Brady 2002-06-18 21:45 ` Bill Huey 1 sibling, 2 replies; 70+ messages in thread From: Cort Dougan @ 2002-06-18 21:08 UTC (permalink / raw) To: Benjamin LaHaise Cc: Linus Torvalds, Rusty Russell, Robert Love, Linux Kernel Mailing List I agree with you there. It's not easy, and I'd claim it's not possible given that no-one has done it yet, to have a select() call that is speedy for both 0-10 and 1k file descriptors. } I take issue with the statement that select scales fine to thousands of } file descriptors. It doesn't. For fairly trivial workloads it degrades } to 0 operations per second with more than a few dozen filedescriptors in } the array, but only one descriptor being active. To sustain decent } throughput, select needs something like 50% of the filedescriptors in an } array to be active at every select() call, which makes in unsuitable for } things like LDAP servers, or HTTP/FTP where the clients are behind slow } connections or interactive (like in the real world). I've benchmarked } it -- we should really include something like /dev/epoll in the kernel } to improve this case. } } Still, I think the bitmap approach in this case is useful, as having } affinity to multiple CPUs can be needed, and it is not a frequently } occuring operation (unlike select()). } } -ben } -- } "You will be reincarnated as a toad; and you will be much happier." } - } To unsubscribe from this list: send the line "unsubscribe linux-kernel" in } the body of a message to majordomo@vger.kernel.org } More majordomo info at http://vger.kernel.org/majordomo-info.html } Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 21:08 ` Cort Dougan @ 2002-06-18 21:47 ` Linus Torvalds 2002-06-19 12:29 ` Eric W. Biederman 2002-06-19 10:21 ` Padraig Brady 1 sibling, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2002-06-18 21:47 UTC (permalink / raw) To: Cort Dougan Cc: Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Tue, 18 Jun 2002, Cort Dougan wrote: > > I agree with you there. It's not easy, and I'd claim it's not possible > given that no-one has done it yet, to have a select() call that is speedy > for both 0-10 and 1k file descriptors. Actually, select() scales a lot better than poll() for _dense_ bitmaps. The problem with non-scalability ends up being either sparse bitmaps (minor problem, poll() can help) or just the work involved in watching a large number of fd's (major problem, but totally unrelated to the bitmap itself, and poll() usually makes it worse thanks to more data to be moved). Anyway, I was talking about the scalability of the _data_structure_, not the scalability performance-wise. Performance scalability is a non-issue for something like setaffinity(), since it's just not called at any rate approaching poll. >From a data structure standpoint, bitmaps are clearly the simplest dense representation, and scale perfectly well to any reasonable number of CPU's. If we end up using a default of 1024, maybe you'll have to recompile that part of the system that has anything to do with CPU affinity in about 10-20 years by just upping the number a bit. Quite frankly, that's going to be the _least_ of the issues. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 21:47 ` Linus Torvalds @ 2002-06-19 12:29 ` Eric W. Biederman 2002-06-19 17:27 ` Linus Torvalds 0 siblings, 1 reply; 70+ messages in thread From: Eric W. Biederman @ 2002-06-19 12:29 UTC (permalink / raw) To: Linus Torvalds Cc: Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List Linus Torvalds <torvalds@transmeta.com> writes: > If we end up using a default of 1024, maybe you'll have to recompile that > part of the system that has anything to do with CPU affinity in about > 10-20 years by just upping the number a bit. Quite frankly, that's going > to be the _least_ of the issues. :) 10-20 years or someone finds a good way to implement a single system image on linux clusters. They are already into the 1000s of nodes, and dual processors per node category. And as things continue they might even grow bigger. Eric ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-19 12:29 ` Eric W. Biederman @ 2002-06-19 17:27 ` Linus Torvalds 2002-06-20 3:57 ` Eric W. Biederman 0 siblings, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2002-06-19 17:27 UTC (permalink / raw) To: Eric W. Biederman Cc: Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On 19 Jun 2002, Eric W. Biederman wrote: > > 10-20 years or someone finds a good way to implement a single system > image on linux clusters. They are already into the 1000s of nodes, > and dual processors per node category. And as things continue they > might even grow bigger. Oh, clusters are a separate issue. I'm absolutely 100% conviced that you don't want to have a "single kernel" for a cluster, you want to run independent kernels with good communication infrastructure between them (ie global filesystem, and try to make the networking look uniform). Trying to have a single kernel for thousands of nodes is just crazy. Even if the system were ccNuma and _could_ do it in theory. The NuMA work can probably take single-kernel to maybe 64+ nodes, before people just start turning stark raving mad. There's no way you'll have single-kernel for thousands of CPU's, and still stay sane and claim any reasonable performance under generic loads. So don't confuse the issue with clusters like that. The "set_affinity()" call simply doesn't have anything to do with them. If you want to move processes between nodes on such a cluster, you'll probably need user-level help, the kernel is unlikely to do it for you. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-19 17:27 ` Linus Torvalds @ 2002-06-20 3:57 ` Eric W. Biederman 2002-06-20 5:24 ` Larry McVoy 2002-06-20 16:30 ` Cort Dougan 0 siblings, 2 replies; 70+ messages in thread From: Eric W. Biederman @ 2002-06-20 3:57 UTC (permalink / raw) To: Linus Torvalds Cc: Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List Linus Torvalds <torvalds@transmeta.com> writes: > On 19 Jun 2002, Eric W. Biederman wrote: > > > > 10-20 years or someone finds a good way to implement a single system > > image on linux clusters. They are already into the 1000s of nodes, > > and dual processors per node category. And as things continue they > > might even grow bigger. > > Oh, clusters are a separate issue. I'm absolutely 100% conviced that you > don't want to have a "single kernel" for a cluster, you want to run > independent kernels with good communication infrastructure between them > (ie global filesystem, and try to make the networking look uniform). > > Trying to have a single kernel for thousands of nodes is just crazy. Even > if the system were ccNuma and _could_ do it in theory. I totally agree, mostly I was playing devils advocate. The model actually in my head is when you have multiple kernels but they talk well enough that the applications have to care in areas where it doesn't make a performance difference (There's got to be one of those). > The NuMA work can probably take single-kernel to maybe 64+ nodes, before > people just start turning stark raving mad. There's no way you'll have > single-kernel for thousands of CPU's, and still stay sane and claim any > reasonable performance under generic loads. > > So don't confuse the issue with clusters like that. The "set_affinity()" > call simply doesn't have anything to do with them. If you want to move > processes between nodes on such a cluster, you'll probably need user-level > help, the kernel is unlikely to do it for you. Agreed. The compute cluster problem is an interesting one. The big items I see on the todo list are: - Scalable fast distributed file system (Lustre looks like a possibility) - Sub application level checkpointing. Services like a schedulers, already exist. Basically the job of a cluster scheduler gets much easier, and the scheduler more powerful once it gets the ability to suspend jobs. Checkpointing buys three things. The ability to preempt jobs, the ability to migrate processes, and the ability to recover from failed nodes, (assuming the failed hardware didn't corrupt your jobs checkpoint). Once solutions to the cluster problems become well understood I wouldn't be surprised if some of the supporting services started to live in the kernel like nfsd. Parts of the distributed filesystem certainly will. I suspect process checkpointing and restoring will evolve something something like pthread support. With some code in user space, and some generic helpers in the kernel as clean pieces of the job can be broken off. The challenge is only how to save/restore interprocess communications. Things like moving a tcp connection from one node to another are interesting problems. But also I suspect most of the hard problems that we need kernel help with can have uses independent of checkpointing. Already we have web server farms that spread connections to a single ip across nodes. Eric ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 3:57 ` Eric W. Biederman @ 2002-06-20 5:24 ` Larry McVoy 2002-06-20 7:26 ` Andreas Dilger 2002-06-20 14:54 ` Eric W. Biederman 2002-06-20 16:30 ` Cort Dougan 1 sibling, 2 replies; 70+ messages in thread From: Larry McVoy @ 2002-06-20 5:24 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List > I totally agree, mostly I was playing devils advocate. The model > actually in my head is when you have multiple kernels but they talk > well enough that the applications have to care in areas where it > doesn't make a performance difference (There's got to be one of those). .... > The compute cluster problem is an interesting one. The big items > I see on the todo list are: > > - Scalable fast distributed file system (Lustre looks like a > possibility) > - Sub application level checkpointing. > > Services like a schedulers, already exist. > > Basically the job of a cluster scheduler gets much easier, and the > scheduler more powerful once it gets the ability to suspend jobs. > Checkpointing buys three things. The ability to preempt jobs, the > ability to migrate processes, and the ability to recover from failed > nodes, (assuming the failed hardware didn't corrupt your jobs > checkpoint). > > Once solutions to the cluster problems become well understood I > wouldn't be surprised if some of the supporting services started to > live in the kernel like nfsd. Parts of the distributed filesystem > certainly will. http://www.bitmover.com/cc-pitch I've been trying to get Linus to listen to this for years and he keeps on flogging the tired SMP horse instead. DEC did it and Sun has been passing around these slides for a few weeks, so maybe they'll do it too. Then Linux can join the party after it has become a fine grained, locked to hell and back, soft "realtime", numa enabled, bloated piece of crap like all the other kernels and we'll get to go through the "let's reinvent Unix for the 3rd time in 40 years" all over again. What fun. Not. Sorry to be grumpy, go read the slides, I'll be at OLS, I'd be happy to talk it over with anyone who wants to think about it. Paul McKenney from IBM came down the San Francisco to talk to me about it, put me through an 8 or 9 hour session which felt like a PhD exam, and after trying to poke holes in it grudgingly let on that maybe it was a good idea. He was kind of enough to write up what he took away from it, here it is. --lm From: "Paul McKenney" <Paul.McKenney@us.ibm.com> To: lm@bitmover.com, tytso@mit.edu Subject: Greatly enjoyed our discussion yesterday! Date: Fri, 9 Nov 2001 18:48:56 -0800 Hello! I greatly enjoyed our discussion yesterday! Here are the pieces of it that I recall, I know that you will not be shy about correcting any errors and omissions. Thanx, Paul Larry McVoy's SMP Clusters Discussion on November 8, 2001 Larry McVoy, Ted T'so, and Paul McKenney What is SMP Clusters? SMP Clusters is a method of partioning an SMP (symmetric multiprocessing) machine's CPUs, memory, and I/O devices so that multiple "OSlets" run on this machine. Each OSlet owns and controls its partition. A given partition is expected to contain from 4-8 CPUs, its share of memory, and its share of I/O devices. A machine large enough to have SMP Clusters profitably applied is expected to have enough of the standard I/O adapters (e.g., ethernet, SCSI, FC, etc.) so that each OSlet would have at least one of each. Each OSlet has the same data structures that an isolated OS would have for the same amount of resources. Unless interactions with the OSlets are required, an OSlet runs very nearly the same code over very nearly the same data as would a standalone OS. Although each OSlet is in most ways its own machine, the full set of OSlets appears as one OS to any user programs running on any of the OSlets. In particular, processes on on OSlet can share memory with processes on other OSlets, can send signals to processes on other OSlets, communicate via pipes and Unix-domain sockets with processes on other OSlets, and so on. Performance of operations spanning multiple OSlets may be somewhat slower than operations local to a single OSlet, but the difference will not be noticeable except to users who are engaged in careful performance analysis. The goals of the SMP Cluster approach are: 1. Allow the core kernel code to use simple locking designs. 2. Present applications with a single-system view. 3. Maintain good (linear!) scalability. 4. Not degrade the performance of a single CPU beyond that of a standalone OS running on the same resources. 5. Minimize modification of core kernel code. Modified or rewritten device drivers, filesystems, and architecture-specific code is permitted, perhaps even encouraged. ;-) OS Boot Early-boot code/firmware must partition the machine, and prepare tables for each OSlet that describe the resources that each OSlet owns. Each OSlet must be made aware of the existence of all the other OSlets, and will need some facility to allow efficient determination of which OSlet a given resource belongs to (for example, to determine which OSlet a given page is owned by). At some point in the boot sequence, each OSlet creates a "proxy task" for each of the other OSlets that provides shared services to them. Issues: 1. Some systems may require device probing to be done by a central program, possibly before the OSlets are spawned. Systems that react in an unfriendly manner to failed probes might be in this class. 2. Interrupts must be set up very carefully. On some systems, the interrupt system may constrain the ways in which the system is partitioned. Shared Operations This section describes some possible implementations and issues with a number of the shared operations. Shared operations include: 1. Page fault on memory owned by some other OSlet. 2. Manipulation of processes running on some other OSlet. 3. Access to devices owned by some other OSlet. 4. Reception of network packets intended for some other OSlet. 5. SysV msgq and sema operations on msgq and sema objects accessed by processes running on multiple of the OSlets. 6. Access to filesystems owned by some other OSlet. The /tmp directory gets special mention. 7. Pipes connecting processes in different OSlets. 8. Creation of processes that are to run on a different OSlet than their parent. 9. Processing of exit()/wait() pairs involving processes running on different OSlets. Page Fault As noted earlier, each OSlet maintains a proxy process for each other OSlet (so that for an SMP Cluster made up of N OSlets, there are N*(N-1) proxy processes). When a process in OSlet A wishes to map a file belonging to OSlet B, it makes a request to B's proxy process corresponding to OSlet A. The proxy process maps the desired file and takes a page fault at the desired address (translated as needed, since the file will usually not be mapped to the same location in the proxy and client processes), forcing the page into OSlet B's memory. The proxy process then passes the corresponding physical address back to the client process, which maps it. Issues: o How to coordinate pageout? Two approaches: 1. Use mlock in the proxy process so that only the client process can do the pageout. 2. Make the two OSlets coordinate their pageouts. This is more complex, but will be required in some form or another to prevent OSlets from "ganging up" on one of their number, exhausting its memory. o When OSlet A ejects the memory from its working set, where does it put it? 1. Throw it away, and go to the proxy process as needed to get it back. 2. Augment core VM as needed to track the "guest" memory. This may be needed for performance, but... o Some code is required in the pagein() path to figure out that the proxy must be used. 1. Larry stated that he is willing to be punched in the nose to get this code in. ;-) The amount of this code is minimized by creating SMP-clusters-specific filesystems, which have their own functions for mapping and releasing pages. (Does this really cover OSlet A's paging out of this memory?) o How are pagein()s going to be even halfway fast if IPC to the proxy is involved? 1. Just do it. Page faults should not be all that frequent with today's memory sizes. (But then why do we care so much about page-fault performance???) 2. Use "doors" (from Sun), which are very similar to protected procedure call (from K42/Tornado/Hurricane). The idea is that the CPU in OSlet A that is handling the page fault temporarily -becomes- a member of OSlet B by using OSlet B's page tables for the duration. This results in some interesting issues: a. What happens if a process wants to block while "doored"? Does it switch back to being an OSlet A process? b. What happens if a process takes an interrupt (which corresponds to OSlet A) while doored (thus using OSlet B's page tables)? i. Prevent this by disabling interrupts while doored. This could pose problems with relatively long VM code paths. ii. Switch back to OSlet A's page tables upon interrupt, and switch back to OSlet B's page tables upon return from interrupt. On machines not supporting ASID, take a TLB-flush hit in both directions. Also likely requires common text (at least for low-level interrupts) for all OSlets, making it more difficult to support OSlets running different versions of the OS. Furthermore, the last time that Paul suggested adding instructions to the interrupt path, several people politely informed him that this would require a nose punching. ;-) c. If a bunch of OSlets simultaneously decide to invoke their proxies on a particular OSlet, that OSlet gets lock contention corresponding to the number of CPUs on the system rather than to the number in a single OSlet. Some approaches to handle this: i. Stripe -everything-, rely on entropy to save you. May still have problems with hotspots (e.g., which of the OSlets has the root of the root filesystem?). ii. Use some sort of queued lock to limit the number CPUs that can be running proxy processes in a given OSlet. This does not really help scaling, but would make the contention less destructive to the victim OSlet. o How to balance memory usage across the OSlets? 1. Don't bother, let paging deal with it. Paul's previous experience with this philosophy was not encouraging. (You can end up with one OSlet thrashing due to the memory load placed on it by other OSlets, which don't see any memory pressure.) 2. Use some global memory-pressure scheme to even things out. Seems possible, Paul is concerned about the complexity of this approach. If this approach is taken, make sure someone with some control-theory experience is involved. Manipulation of Processes Running on Some Other OSlet. The general idea here is to implement something similar to a vproc layer. This is common code, and thus requires someone to sacrifice their nose. There was some discussion of other things that this would be useful for, but I have lost them. Manipulations discussed included signals and job control. Issues: o Should process information be replicated across the OSlets for performance reasons? If so, how much, and how to synchronize. 1. No, just use doors. See above discussion. 2. Yes. No discussion of synchronization methods. (Hey, we had to leave -something- for later!) Access to Devices Owned by Some Other OSlet Larry mentioned a /rdev, but if we discussed any details of this, I have lost them. Presumably, one would use some sort of IPC or doors to make this work. Reception of Network Packets Intended for Some Other OSlet. An OSlet receives a packet, and realizes that it is destined for a process running in some other OSlet. How is this handled without rewriting most of the networking stack? The general approach was to add a NAT-like layer that inspected the packet and determined which OSlet it was destined for. The packet was then forwarded to the correct OSlet, and subjected to full IP-stack processing. Issues: o If the address map in the kernel is not to be manipulated on each packet reception, there needs to be a circular buffer in each OSlet for each of the other OSlets (again, N*(N-1) buffers). In order to prevent the buffer from needing to be exceedingly large, packets must be bcopy()ed into this buffer by the OSlet that received the packet, and then bcopy()ed out by the OSlet containing the target process. This could add a fair amount of overhead. 1. Just accept the overhead. Rely on this being an uncommon case (see the next issue). 2. Come up with some other approach, possibly involving the user address space of the proxy process. We could not articulate such an approach, but it was late and we were tired. o If there are two processes that share the FD on which the packet could be received, and these two processes are in two different OSlets, and neither is in the OSlet that received the packet, what the heck do you do??? 1. Prevent this from happening by refusing to allow processes holding a TCP connection open to move to another OSlet. This could result in load-balance problems in some workloads, though neither Paul nor Ted were able to come up with a good example on the spot (seeing as BAAN has not been doing really well of late). To indulge in l'esprit d'escalier... How about a timesharing system that users access from the network? A single user would have to log on twice to run a job that consumed more than one OSlet if each process in the job might legitimately need access to stdin. 2. Do all protocol processing on the OSlet on which the packet was received, and straighten things out when delivering the packet data to the receiving process. This likely requires changes to common code, hence someone to volunteer their nose. SysV msgq and sema Operations We didn't discuss these. None of us seem to be SysV fans, but these must be made to work regardless. Larry says that shm should be implemented in terms of mmap(), so that this case reduces to page-mapping discussed above. Of course, one would need a filesystem large enough to handle the largest possible shmget. Paul supposes that one could dynamically create a memory filesystem to avoid problems here, but is in no way volunteering his nose to this cause. Access to Filesystems Owned by Some Other OSlet. For the most part, this reduces to the mmap case. However, partitioning popular filesystems over the OSlets could be very helpful. Larry mentioned that this had been prototyped. Paul cannot remember if Larry promised to send papers or other documentation, but duly requests them after the fact. Larry suggests having a local /tmp, so that /tmp is in effect private to each OSlet. There would be a /gtmp that would be a globally visible /tmp equivalent. We went round and round on software compatibility, Paul suggesting a hashed filesystem as an alternative. Larry eventually pointed out that one could just issue different mount commands to get a global filesystem in /tmp, and create a per-OSlet /ltmp. This would allow people to determine their own level of risk/performance. Pipes Connecting Processes in Different OSlets. This was mentioned, but I have forgotten the details. My vague recollections lead me to believe that some nose-punching was required, but I must defer to Larry and Ted. Ditto for Unix-domain sockets. Creation of Processes on a Different OSlet Than Their Parent. There would be a inherited attribute that would prevent fork() or exec() from creating its child on a different OSlet. This attribute would be set by default to prevent too many surprises. Things like make(1) would clear this attribute to allow amazingly fast kernel builds. There would also be a system call that would cause the child to be placed on a specified OSlet (Paul suggested use of HP's "launch policy" concept to avoid adding yet another dimension to the exec() combinatorial explosion). The discussion of packet reception lead Larry to suggest that cross-OSlet process creation would be prohibited if the parent and child shared a socket. See above for the load-balancing concern and corresponding l'esprit d'escalier. Processing of exit()/wait() Pairs Crossing OSlet Boundaries We didn't discuss this. My guess is that vproc deals with it. Some care is required when optimizing for this. If one hands off to a remote parent that dies before doing a wait(), one would not want one of the init processes getting a nasty surprise. (Yes, there are separate init processes for each OSlet. We did not talk about implications of this, which might occur if one were to need to send a signal intended to be received by all the replicated processes.) Other Desiderata: 1. Ability of surviving OSlets to continue running after one of their number fails. Paul was quite skeptical of this. Larry suggested that the "door" mechanism could use a dynamic-linking strategy. Paul remained skeptical. ;-) 2. Ability to run different versions of the OS on different OSlets. Some discussion of this above. The Score. Paul agreed that SMP Clusters could be implemented. He was not sure that it could achieve good performance, but could not prove otherwise. Although he suspected that the complexity might be less than the proprietary highly parallel Unixes, he was not convinced that it would be less than Linux would be, given the Linux community's emphasis on simplicity in addition to performance. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 5:24 ` Larry McVoy @ 2002-06-20 7:26 ` Andreas Dilger 2002-06-20 14:54 ` Eric W. Biederman 1 sibling, 0 replies; 70+ messages in thread From: Andreas Dilger @ 2002-06-20 7:26 UTC (permalink / raw) To: Larry McVoy, Eric W. Biederman, Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Jun 19, 2002 22:24 -0700, Larry McVoy wrote: > Linus Torvalds <torvalds@transmeta.com> writes: > > The compute cluster problem is an interesting one. The big items > > I see on the todo list are: > > > > - Scalable fast distributed file system (Lustre looks like a > > possibility) Well, I can speak to this a little bit... Given Lustre's ext3 underpinnings, we have been thinking of some interesting methods by which we could take an existing ext3 filesystem on a disk and "clusterify" it (i.e. have distributed coherency across multiple clients). This would be perfectly suited for application on a CC cluster. Given that the network communication protocols are also abstracted out from the Lustre core, it would probably be trivial for someone with network/VM experience to write a "no-op" networking layer which basically did little more than passing around page addresses and faulting the right pages into each OSlet. The protocol design is already set up to handle direct DMA between client and storage target, and a CC cluster could also do away with the actual copy involved in the DMA. We can already do "zero copy" I/O between user-space and a remote disk with O_DIRECT and the right network hardware (which does direct DMA from one node to another). > "Paul McKenney" <Paul.McKenney@us.ibm.com> writes: > Access to Devices Owned by Some Other OSlet > > Larry mentioned a /rdev, but if we discussed any details > of this, I have lost them. Presumably, one would use some > sort of IPC or doors to make this work. I would just make access to remote devices act like NBD or something, and have similar "network/proxy" kernel drivers to all "remote" devices. At boot time something like devfs would instantiate the "proxy" drivers for all of the kernels except the one which is "in control" of that device. For example /dev/hda would be a real IDE disk device driver on the controlling node, but would be NBD in all of the other OSlets. It would have the same major/minor number across all OSlets so that it presented a uniform interface to user-space. While in some cases (e.g. FC) you could have shared-access directly to the device, other devices don't have the correct locking mechanisms internally to be accessed by more than one thread at a time. As the "network" layer between two OSlets would run basically at memory speeds, this would not impose much of an overhead. The proxy device interfaces would be equally useful between OSlets as with two remote machines (e.g. remote modem access), so I have no doubt that many of them already exist, and the others could be written rather easily. > Access to Filesystems Owned by Some Other OSlet. > > For the most part, this reduces to the mmap case. However, > partitioning popular filesystems over the OSlets could be > very helpful. Larry mentioned that this had been prototyped. > Paul cannot remember if Larry promised to send papers or > other documentation, but duly requests them after the fact. > > Larry suggests having a local /tmp, so that /tmp is in effect > private to each OSlet. There would be a /gtmp that would > be a globally visible /tmp equivalent. We went round and > round on software compatibility, Paul suggesting a hashed > filesystem as an alternative. Larry eventually pointed out > that one could just issue different mount commands to get > a global filesystem in /tmp, and create a per-OSlet /ltmp. > This would allow people to determine their own level of > risk/performance. Nah, just use a cluster filesystem for everything ;-). As I mentioned previously, Lustre could run from a single (optionally shared-access) disk (with proper, relatively minor, hacks that are just in the discussion phase now), or it can run from distributed disks that serve the data to the remote clients. With smart allocation of resources, OSlets will prefer to create new files on their "local" storage unless there are resource shortages. The fast "networking" between OSlets means even "remote" disk access is cheap. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 5:24 ` Larry McVoy 2002-06-20 7:26 ` Andreas Dilger @ 2002-06-20 14:54 ` Eric W. Biederman 1 sibling, 0 replies; 70+ messages in thread From: Eric W. Biederman @ 2002-06-20 14:54 UTC (permalink / raw) To: Larry McVoy Cc: Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List Larry McVoy <lm@bitmover.com> writes: > > I totally agree, mostly I was playing devils advocate. The model > > actually in my head is when you have multiple kernels but they talk > > well enough that the applications have to care in areas where it > > doesn't make a performance difference (There's got to be one of those). > > .... > > > The compute cluster problem is an interesting one. The big items > > I see on the todo list are: > > > > - Scalable fast distributed file system (Lustre looks like a > > possibility) > > - Sub application level checkpointing. > > > > Services like a schedulers, already exist. > > > > Basically the job of a cluster scheduler gets much easier, and the > > scheduler more powerful once it gets the ability to suspend jobs. > > Checkpointing buys three things. The ability to preempt jobs, the > > ability to migrate processes, and the ability to recover from failed > > nodes, (assuming the failed hardware didn't corrupt your jobs > > checkpoint). > > > > Once solutions to the cluster problems become well understood I > > wouldn't be surprised if some of the supporting services started to > > live in the kernel like nfsd. Parts of the distributed filesystem > > certainly will. > > http://www.bitmover.com/cc-pitch > > I've been trying to get Linus to listen to this for years and he keeps > on flogging the tired SMP horse instead. Hmm. My impression is that Linux has been doing SMP but mostly because it hasn't become a nightmare so far. Linus just a moment ago noted that there are scaleablity limits, to SMP. As for the cc-SMP stuff. a) Except dual cpu systems no-one makes affordable SMPs. b) It doesn't solve anything except your problem with locks. You have presented your idea, and maybe it will be useful. But at the moment it is not the place to start. What I need today is process checkpointing. The rest comes in easy incremental steps from there. For me the natural place to start is with clusters, they are cheaper and more accessible than SMPs. And then work on the clustering software with gradual refinements until it can be managed as one machine. At that point it should be easy to compare which does a better job for SMPs. Eric ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 3:57 ` Eric W. Biederman 2002-06-20 5:24 ` Larry McVoy @ 2002-06-20 16:30 ` Cort Dougan 2002-06-20 17:15 ` Linus Torvalds ` (3 more replies) 1 sibling, 4 replies; 70+ messages in thread From: Cort Dougan @ 2002-06-20 16:30 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List "Beating the SMP horse to death" does make sense for 2 processor SMP machines. When 64 processor machines become commodity (Linux is a commodity hardware OS) something will have to be done. When research groups put Linux on 1k processors - it's an experiment. I don't think they have much right to complain that Linux doesn't scale up to that level - it's not designed to. That being said, large clusters are an interesting research area but it is _not_ a failing of Linux that it doesn't scale to them. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 16:30 ` Cort Dougan @ 2002-06-20 17:15 ` Linus Torvalds 2002-06-21 6:15 ` Eric W. Biederman 2002-06-20 17:16 ` RW Hawkins ` (2 subsequent siblings) 3 siblings, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2002-06-20 17:15 UTC (permalink / raw) To: Cort Dougan Cc: Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Thu, 20 Jun 2002, Cort Dougan wrote: > > "Beating the SMP horse to death" does make sense for 2 processor SMP > machines. It makes fine sense for any tightly coupled system, where the tight coupling is cost-efficient. Today that means 2 CPU's, and maybe 4. Things like SMT (Intel calls it "HT") increase that to 4/8. It's just _cheaper_ to do that kind of built-in SMP support than it is to not use it. The important part of what Cort says is "commodity". Not the "small number of CPU's". Linux is focusing on SMP, because it is the ONLY INTERESTING HARDWARE BASE in the commodity space. ccNuma and clusters just aren't even on the _radar_ from a commodity standpoint. While commodity 4- and 8-way SMP is just a few years away. So because SMP hardware is cheap and efficient, all reasonable scalability work is done on SMP. And the fringe is just that - fringe. The numa/cluster fringe tends to try to use SMP approaches because they know they are a minority, and they want to try to leverage off the commodity. And it will continue to be this way for the forseeable future. People should just accept the fact. The only thing that may change the current state of affairs is that some cluster/numa issues are slowly percolating down and they may become more commoditized. For example, I think the AMD approach to SMP on the hammer series is "local memories" with a fast CPU interconnect. That's a lot more NUMA than we're used to in the PC space. On the other hand, another interesting trend seems to be that since commoditizing NUMA ends up being done with a lot of integration, the actual _latency_ difference is so small that those potential future commodity NUMA boxes can be considered largely UMA/SMP. And I guarantee Linux will scale up fine to 16 CPU's, once that is commodity. And the rest is just not all that important. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 17:15 ` Linus Torvalds @ 2002-06-21 6:15 ` Eric W. Biederman 2002-06-21 17:50 ` Larry McVoy 0 siblings, 1 reply; 70+ messages in thread From: Eric W. Biederman @ 2002-06-21 6:15 UTC (permalink / raw) To: Linus Torvalds Cc: Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List Linus Torvalds <torvalds@transmeta.com> writes: > On Thu, 20 Jun 2002, Cort Dougan wrote: > > > > "Beating the SMP horse to death" does make sense for 2 processor SMP > > machines. > > It makes fine sense for any tightly coupled system, where the tight > coupling is cost-efficient. > > Today that means 2 CPU's, and maybe 4. > > Things like SMT (Intel calls it "HT") increase that to 4/8. It's just > _cheaper_ to do that kind of built-in SMP support than it is to not use > it. > > The important part of what Cort says is "commodity". Not the "small number > of CPU's". Linux is focusing on SMP, because it is the ONLY INTERESTING > HARDWARE BASE in the commodity space. Commodity is the wrong word. Volume is the right word. Volumes of machines, volumes of money, and volumes of developers. > ccNuma and clusters just aren't even on the _radar_ from a commodity > standpoint. While commodity 4- and 8-way SMP is just a few years away. I bet it is easy to find a easy to find a 2-4 way heterogenous pile of computers in many a developers personal possession that could be turned into a cluster if the software wasn't so inconvenient to use, or if there was a good reason to run computer systems that way. Clusters and ccNuma are entirely different animals. ccNuma is about specialized hardware. Clusters are about using commodity hardware in a different way. > So because SMP hardware is cheap and efficient, all reasonable scalability > work is done on SMP. And the fringe is just that - fringe. The > numa/cluster fringe tends to try to use SMP approaches because they know > they are a minority, and they want to try to leverage off the commodity. The cluster fringe is a minority. But the high performance computer and batch scheduling minority has done a lot of work of the theoretical, and developmental computer science in the past. And I would be surprised if they weren't influential in the future. But like most research a lot of it is trying suboptimal solutions that eventually get ditched. The only SMP like stuff I have seen in clustering are the attempts to make clusters simpler to use. And the question I hear is how simple can we make it without sacrificing scaleabilty. > And it will continue to be this way for the forseeable future. People > should just accept the fact. I apparently see things differently. That the clusters will be a minority certainly. That the people working on them are hopelessly in fringes not a bit. Clusters of Linux machines scale acceptably . And for a certain set of people get the job done. The problem is making it more convenient to get the job done. And just like in hardware as integration can make extra hardware features essentially free, the next step is to begin integrating cluster features into Linux both kernel and user space. Basically the technique is. Implement something that works. Then find the clean efficient way to do it. If that takes kernel support write a kernel patch, and get it in. > And I guarantee Linux will scale up fine to 16 CPU's, once that is > commodity. And the rest is just not all that important. It works just fine on my little 20 node 20 kernel test machine too. I think Larry's perspective is interesting and if the common cluster software gets working well enough I might even try it. But until a big SMP becomes commodity I don't see the point. Eric ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-21 6:15 ` Eric W. Biederman @ 2002-06-21 17:50 ` Larry McVoy 2002-06-21 17:55 ` Robert Love 2002-06-22 18:25 ` Eric W. Biederman 0 siblings, 2 replies; 70+ messages in thread From: Larry McVoy @ 2002-06-21 17:50 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Fri, Jun 21, 2002 at 12:15:54AM -0600, Eric W. Biederman wrote: > I think Larry's perspective is interesting and if the common cluster > software gets working well enough I might even try it. But until a > big SMP becomes commodity I don't see the point. The real point is that multi threading screws up your kernel. All the Linux hackers are going through the learning curve on threading and think I'm an alarmist or a nut. After Linux works on a 64 way box, I suspect that the majority of them will secretly admit that threading does screw up the kernel but at that point it's far too late. The current approach is a lot like western medicine. Wait until the cancer shows up and then make an effort to get rid of it. My suggested approach is to take steps to make sure the cancer never gets here in the first place. It's proactive rather than reactive. And the reason I harp on this is that I'm positive (and history supports me 100%) that the reactive approach doesn't work, you'll be stuck with it, there is no way to "fix" it other than starting over with a new kernel. Then we get to repeat this whole discussion in 15 years with one of the Linux veterans trying to explain to the NewOS guys that multi threading really isn't as cool as it sounds and they should try this other approach. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-21 17:50 ` Larry McVoy @ 2002-06-21 17:55 ` Robert Love 2002-06-22 18:25 ` Eric W. Biederman 1 sibling, 0 replies; 70+ messages in thread From: Robert Love @ 2002-06-21 17:55 UTC (permalink / raw) To: Larry McVoy Cc: Eric W. Biederman, Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell, Linux Kernel Mailing List On Fri, 2002-06-21 at 10:50, Larry McVoy wrote: > The real point is that multi threading screws up your kernel. All the Linux > hackers are going through the learning curve on threading and think I'm an > alarmist or a nut. After Linux works on a 64 way box, I suspect that the > majority of them will secretly admit that threading does screw up the kernel > but at that point it's far too late. Larry, this is a point you have made several times and admittedly one I agree with. I fail to see how the high-end scaling will not compromise the low-end and I am genuinely concerned Linux will become Solaris. I do not know what to do to prevent it - and I am certainly not saying we should outright prevent certain things, but it worries me. You are going to be in Ottawa next week? Maybe we can talk about it... Robert Love ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-21 17:50 ` Larry McVoy 2002-06-21 17:55 ` Robert Love @ 2002-06-22 18:25 ` Eric W. Biederman 2002-06-22 19:26 ` Larry McVoy 1 sibling, 1 reply; 70+ messages in thread From: Eric W. Biederman @ 2002-06-22 18:25 UTC (permalink / raw) To: Larry McVoy Cc: Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List Larry McVoy <lm@bitmover.com> writes: > On Fri, Jun 21, 2002 at 12:15:54AM -0600, Eric W. Biederman wrote: > > I think Larry's perspective is interesting and if the common cluster > > software gets working well enough I might even try it. But until a > > big SMP becomes commodity I don't see the point. > > The real point is that multi threading screws up your kernel. All the Linux > hackers are going through the learning curve on threading and think I'm an > alarmist or a nut. After Linux works on a 64 way box, I suspect that the > majority of them will secretly admit that threading does screw up the kernel > but at that point it's far too late. I don't see a argument that locks that get to fine grained are not an issue. However even traditional version of single cpu unix are multi threaded. The locking in a multi cpu design just makes that explicit. And the only really nasty place to get locks is when you get a noticeable number of them in your device drivers. With the core code you can fix it without out worrying about killing the OS. > The current approach is a lot like western medicine. Wait until the > cancer shows up and then make an effort to get rid of it. My suggested > approach is to take steps to make sure the cancer never gets here in > the first place. It's proactive rather than reactive. And the reason > I harp on this is that I'm positive (and history supports me 100%) > that the reactive approach doesn't work, you'll be stuck with it, > there is no way to "fix" it other than starting over with a new kernel. > Then we get to repeat this whole discussion in 15 years with one of the > Linux veterans trying to explain to the NewOS guys that multi threading > really isn't as cool as it sounds and they should try this other > approach. Proactive don't add a lock unless you can really justify that you need it. That is well suited to open source code review type practices, and it appears to be what we are doing now. And if you don't add locks you certainly don't get into a lock tangle. As for 100% history supported all I see is that evolution of code, as it dynamically gathers the requirements instead of magically knowing them does much better than design as a long term strategy. Of course you design the parts you can see but every has a limited ability to see the future. To specifics, I don't see the point of OSlets on a single cpu that is hyper threaded. Traditional threading appears to make more sense to me. Similarly I don't see the point in the 2-4 cpu range. Given that there are some scales when you don't want/need more than one kernel, who has a machine where OSlets start to pay off? They don't exist in commodity hardware, so being proactive now looks stupid. The only practical course I see is to work on solutions that work on clusters of commodity machines. At least any one who wants one can get one. If you can produce a single system image, the big iron guys can tweak the startup routing and run that on their giant NUMA or SMP machines. Eric ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-22 18:25 ` Eric W. Biederman @ 2002-06-22 19:26 ` Larry McVoy 2002-06-22 22:25 ` Eric W. Biederman ` (2 more replies) 0 siblings, 3 replies; 70+ messages in thread From: Larry McVoy @ 2002-06-22 19:26 UTC (permalink / raw) To: Eric W. Biederman Cc: Larry McVoy, Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Sat, Jun 22, 2002 at 12:25:09PM -0600, Eric W. Biederman wrote: > I don't see a argument that locks that get to fine grained are not an > issue. However even traditional version of single cpu unix are multi > threaded. The locking in a multi cpu design just makes that explicit. > > And the only really nasty place to get locks is when you get a > noticeable number of them in your device drivers. With the core code > you can fix it without out worrying about killing the OS. Just out of curiousity, have you actually ever worked on a fine grain threaded OS? One that scales to at least 32 processors? Solaris? IRIX? Others? It makes a difference, if you've been there, your perspective is somewhat different than just talking about it. If you have worked on one, for how long? Did you support the source base after it matured for any length of time? > Proactive don't add a lock unless you can really justify that you need > it. That is well suited to open source code review type practices, > and it appears to be what we are doing now. And if you don't add > locks you certainly don't get into a lock tangle. That's a great theory. I support that theory, life would great if it matched that theory. Unfortunately, I don't know of any kernel which matches that theory, do you? Linux certainly doesn't. FreeBSD certainly doesn't. Solaris/IRIX crossed that point years ago. So where is the OS which has managed to resist the lock tangle? linux-2.5$ bk -r grep CONFIG_SMP | wc -l 1290 That's a lot of ifdefs for a supposedly tangle free kernel. And I suspect that the threading people will say Linux doesn't really scale beyond 2-4 CPUs for any I/O bound work load today. What's it going to be when Linux is at 32 CPUs? Solaris was around 3000 statically allocated locks when I left and I think it was scaling to maybe 8. At SGI, they were carefully putting the lock on the same cache line as the data structure that it protected, for all locked data structure which had any contention. The limit as the number of CPUs goes up is that each read/write cache line in the data segment has a lock. They certainly weren't there, but they were much closer than you might guess. It was definitely the norm that you laid out your locks with the data, it was that pervasive. Take a walk through sched.c and you can see the mess starting. How can anyone support that code on both UP and SMP? You are already supporting two code bases. Imagine what it is going to look like when the NUMA people get done. Don't forget the preempt people. Oh, yeah, let's throw in some soft realtime, that shouldn't screw things up too much. > To specifics, I don't see the point of OSlets on a single cpu that is > hyper threaded. Traditional threading appears to make more sense to > me. Similarly I don't see the point in the 2-4 cpu range. In general I agree with you here, but I think you haven't really considered all the options. I can see the benefit on a *single* CPU. There are all sorts of interesting games you could play in the area of fault tolerance and containment. Imagine a system, like what IBM has, that runs lots of copies of Linux with the mmap sharing turned off. ISPs would love it. Jeff Dike pointed out that if UML can run one kernel in user space, why not N? And if so, the OS clustering stuff could be done on top of UML and then "ported" to real hardware. I think that's a great idea, and you can carry it farther, you could run multiple kernels just for fault containment. See Sun's domains, DEC's Galaxy. > Given that there are some scales when you don't want/need more than > one kernel, who has a machine where OSlets start to pay off? They > don't exist in commodity hardware, so being proactive now looks > stupid. Not as stupid as having a kernel noone can maintain and not being able to do anything about it. There seems to be a subthread of elitist macho attitude along the lines of "oh, it won't be that bad, and besides, if you aren't good enough to code in a fine grained locked, soft real time, preempted, NUMA aware, then you just shouldn't be in the kernel". I'm not saying you are saying that, but I've definitely heard it on the list. It's a great thing for bragging rights but it's a horrible thing from the sustainability point of view. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-22 19:26 ` Larry McVoy @ 2002-06-22 22:25 ` Eric W. Biederman 2002-06-22 23:10 ` Larry McVoy 2002-06-23 6:34 ` William Lee Irwin III 2002-06-23 22:56 ` Kai Henningsen 2 siblings, 1 reply; 70+ messages in thread From: Eric W. Biederman @ 2002-06-22 22:25 UTC (permalink / raw) To: Larry McVoy Cc: Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List Larry McVoy <lm@bitmover.com> writes: > On Sat, Jun 22, 2002 at 12:25:09PM -0600, Eric W. Biederman wrote: > > To specifics, I don't see the point of OSlets on a single cpu that is > > hyper threaded. Traditional threading appears to make more sense to > > me. Similarly I don't see the point in the 2-4 cpu range. > > In general I agree with you here, but I think you haven't really considered > all the options. I can see the benefit on a *single* CPU. There are all > sorts of interesting games you could play in the area of fault tolerance > and containment. Imagine a system, like what IBM has, that runs lots of > copies of Linux with the mmap sharing turned off. ISPs would love > it. Hmm. Perhaps. But you are fundamentally susceptible to the base kernel, and the hardware on the machine. > Jeff Dike pointed out that if UML can run one kernel in user space, why > not N? And if so, the OS clustering stuff could be done on top of > UML and then "ported" to real hardware. I think that's a great idea, > and you can carry it farther, you could run multiple kernels just for > fault containment. See Sun's domains, DEC's Galaxy. Right. A clustered environment is accessible. For the most part I don't have a problem (except check pointing) that is facilitated by running linux under linux. Currently my problem to solve is compute clusters. My current worries are not can I scale a kernel to 64 cpus. My practical worries are will my user space to 1000 dual processor machines. The important point for me is that there are a fair number of fundamentally hard problems to get multiple kernels look like one. Especially when you start with a maximum decoupling. And you seem to assume that solving these problems are trivial. Maybe it is maintainable when you get done but there is a huge amount of work to get there. I haven't heard of a distributed OS as anything other than a dream, or a prototype with scaling problems. > > Given that there are some scales when you don't want/need more than > > one kernel, who has a machine where OSlets start to pay off? They > > don't exist in commodity hardware, so being proactive now looks > > stupid. > > Not as stupid as having a kernel noone can maintain and not being able > to do anything about it. There seems to be a subthread of elitist macho > attitude along the lines of "oh, it won't be that bad, and besides, > if you aren't good enough to code in a fine grained locked, soft real > time, preempted, NUMA aware, then you just shouldn't be in the kernel". > I'm not saying you are saying that, but I've definitely heard it on > the list. Hmm. I see a bulk of the on-going kernel work composed of projects to make the whole kernel easier to maintain. Especially interesting is the work that makes drivers relatively easy, and free from all of this cruft. Running some numbers (wc -l kernel/*.c fs/*.c mm/*.c) 1.2.12: 18813 lines 2.2.12: 37510 lines 2.5.14: 55701 lines So the core kernel is growing, but a fairly slow rate. Only worrying about the 60 thousand lines of generic kernel code is much better than worrying about the 3 million lines of driver code. And since you thought it was an interesting statistic: grep CONFIG_SMP kernel/*.c fs/*.c mm/*.c init/*.c | wc -l 44 So most of the code that cares about SMP is not in the core of the kernel, but is mostly the code that actually implements SMP support. So in thinking about I agree that the constant simplification work that is done to the linux kernel looks like one of the most important activities long term. > It's a great thing for bragging rights but it's a horrible thing from > the sustainability point of view. Given that the simplification efforts tend to be some of the highest priority activities in the kernel, and the easiest patches to get accepted. I don't get the feeling that we are walking into a long term maintenance problem. As for bragging rights, my kernel work tends to be some of the easiest code I have to write. I have no doubts that C is a high level programming language. Eric ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-22 22:25 ` Eric W. Biederman @ 2002-06-22 23:10 ` Larry McVoy 0 siblings, 0 replies; 70+ messages in thread From: Larry McVoy @ 2002-06-22 23:10 UTC (permalink / raw) To: Eric W. Biederman Cc: Larry McVoy, Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Sat, Jun 22, 2002 at 04:25:29PM -0600, Eric W. Biederman wrote: > The important point for me is that there are a fair number of > fundamentally hard problems to get multiple kernels look like one. > Especially when you start with a maximum decoupling. And you seem to > assume that solving these problems are trivial. No such assumption was made. Poke through my slides, you'll see that I think it will take a reasonable amount of effort to get there. I actually spelled out the staffing and the time estimates. Start asking around and you'll find that senior people who _have_ gone the multi threading route agree that this approach gets you to the same place with less than 1/10th the amount of work. The last guy who agreed with that statement was the guy who headed up the threading design and implementation of Solaris, he's at Netapp now. In fairness to you, I'm doing the same thing you are: I'm arguing about something I haven't done. On the other hand, I have been through (twice) the thing that you are saying is no problem and every person who has been there agrees with me that it sucks. It's doable, but it's a nightmare to maintain, it easily increases the subtlety of kernel interactions by an order of magnitude, probably closer to two orders. And I have done enough of what I've described to know it can be done. People who have deep knowledge of the fine grained approach have tried to prove that I was wrong and failed, repeatedly. They may not agree that this is a better way but they can't show that it won't work. > Maybe it is maintainable when you get done but there is a huge amount > of work to get there. I haven't heard of a distributed OS as anything > other than a dream, or a prototype with scaling problems. This is a distributed OS on one system, that's a lot easier than a distributed OS across machine boundaries. And if you are worried about scaling problems, you don't understand the design. The OS cluster idea multi threads all data structures for free. No locks on 99% of the data structures that you would need locks on in an SMP os. Think about this fact: if you have lock contention you don't scale. So you thread until you don't. Go do the math that shows how tiny of a fraction of 1% of lock contention screws your scaling, everyone has bumped up against those curves. So the goal of any multithreaded OS is ZERO lock contention. Makes you wonder why the locks are there in the first place. They are trying to get to where I want to go but they are definitely doing it the hard way. > > Not as stupid as having a kernel noone can maintain and not being able > > to do anything about it. There seems to be a subthread of elitist macho > > attitude along the lines of "oh, it won't be that bad, and besides, > > if you aren't good enough to code in a fine grained locked, soft real > > time, preempted, NUMA aware, then you just shouldn't be in the kernel". > > I'm not saying you are saying that, but I've definitely heard it on > > the list. > > Hmm. I see a bulk of the on-going kernel work composed of projects to > make the whole kernel easier to maintain. [...] > I don't get the feeling that we are walking into a long > term maintenance problem. I don't mean to harp on this, but if you are going to comment on how hard it is to maintain a kernel could you please give us some idea of why it is you think as you do? Do you have some prior experience with a project of this size that shows what you believe to be true in practice? You keep suggesting that there isn't a problem, that we aren't headed for a problem. Why is that? Do you know something I don't? I've certainly seen what happens to a kernel source base as it goes through this process a few times and my experience is that what you are saying is the opposite of what happens. So if you've got some different experience, how about sharing it? Maybe there is some way to do what you are suggesting will happen, but I haven't ever seen it personally, nor have I ever heard of it occurring in any long lived project. All projects become more complex as time goes on, it's a direct result of the demands placed on any successful project. > So in thinking about I agree that the constant simplification work > that is done to the linux kernel looks like one of the most important > activities long term. What constant simplification work? The generic part of the kernel does more or less what it did a few years ago yet is has grown at a pretty fast clip. Talk to the embedded people and ask them if they think it has gotten simpler. By what standard has the kernel become less complex? -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-22 19:26 ` Larry McVoy 2002-06-22 22:25 ` Eric W. Biederman @ 2002-06-23 6:34 ` William Lee Irwin III 2002-06-23 22:56 ` Kai Henningsen 2 siblings, 0 replies; 70+ messages in thread From: William Lee Irwin III @ 2002-06-23 6:34 UTC (permalink / raw) To: Larry McVoy, Eric W. Biederman, Larry McVoy, Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Sat, Jun 22, 2002 at 12:26:56PM -0700, Larry McVoy wrote: > Not as stupid as having a kernel noone can maintain and not being able > to do anything about it. There seems to be a subthread of elitist macho > attitude along the lines of "oh, it won't be that bad, and besides, > if you aren't good enough to code in a fine grained locked, soft real > time, preempted, NUMA aware, then you just shouldn't be in the kernel". > I'm not saying you are saying that, but I've definitely heard it on > the list. I've been accused of this, so I'll state for the record: my views on locking are not efficiency-related in the least. They have to do with ensuring that locks protect well-defined data and that locking constructs are clean (e.g. nonrecursive and no implicit drop or acquire). My duties are not directly related to locking, and I only push the agenda I do as a low-priority kernel janitoring effort. As this is not a scalability issue, I'll not press it further in this thread. Cheers, Bill ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-22 19:26 ` Larry McVoy 2002-06-22 22:25 ` Eric W. Biederman 2002-06-23 6:34 ` William Lee Irwin III @ 2002-06-23 22:56 ` Kai Henningsen 2 siblings, 0 replies; 70+ messages in thread From: Kai Henningsen @ 2002-06-23 22:56 UTC (permalink / raw) To: linux-kernel; +Cc: lm lm@bitmover.com (Larry McVoy) wrote on 22.06.02 in <20020622122656.W23670@work.bitmover.com>: > Just out of curiousity, have you actually ever worked on a fine grain > threaded OS? One that scales to at least 32 processors? Solaris? IRIX? > Others? It makes a difference, if you've been there, your perspective is IIRC, you said that your proposed system should have one oslet per about 4 CPUs. And I see many people claiming that current Linux locking is aimed at being good with about 4 CPUs. Maybe I'm dense, but it seems to me that means current Linux locking is aimed at exactly the spot where you argue it should be aimed *anyway*. What am I not seeing? MfG Kai ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 16:30 ` Cort Dougan 2002-06-20 17:15 ` Linus Torvalds @ 2002-06-20 17:16 ` RW Hawkins 2002-06-20 17:23 ` Cort Dougan 2002-06-20 20:40 ` Martin Dalecki 2002-06-21 5:34 ` Eric W. Biederman 3 siblings, 1 reply; 70+ messages in thread From: RW Hawkins @ 2002-06-20 17:16 UTC (permalink / raw) To: Cort Dougan Cc: Eric W. Biederman, Linus Torvalds, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List You're missing the point. Larry is saying "I have been down this road before, take heed". We don't want to waste the time reinventing bloat when we can learn from others mistakes. -RW Cort Dougan wrote: >"Beating the SMP horse to death" does make sense for 2 processor SMP >machines. When 64 processor machines become commodity (Linux is a >commodity hardware OS) something will have to be done. When research >groups put Linux on 1k processors - it's an experiment. I don't think they >have much right to complain that Linux doesn't scale up to that level - >it's not designed to. > >That being said, large clusters are an interesting research area but it is >_not_ a failing of Linux that it doesn't scale to them. >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 17:16 ` RW Hawkins @ 2002-06-20 17:23 ` Cort Dougan 0 siblings, 0 replies; 70+ messages in thread From: Cort Dougan @ 2002-06-20 17:23 UTC (permalink / raw) To: RW Hawkins Cc: Eric W. Biederman, Linus Torvalds, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List I'm not disagreeing with Larry here. I'm just pointing out that mainline Linux cares about what is commodity. That's 1-2 processors and 2-4 on some PPC and other boards. I'm keenly interested in 1k processors, as is Larry, and scaling Linux up to them. I'm don't disagree with Linus' path for Linux staying on SMP for now. Scaling up to huge clusters isn't a mainline Linux concern. It's a very interesting research area, though. In fact, some research I work on. } You're missing the point. Larry is saying "I have been down this road } before, take heed". We don't want to waste the time reinventing bloat } when we can learn from others mistakes. } } -RW } } Cort Dougan wrote: } } >"Beating the SMP horse to death" does make sense for 2 processor SMP } >machines. When 64 processor machines become commodity (Linux is a } >commodity hardware OS) something will have to be done. When research } >groups put Linux on 1k processors - it's an experiment. I don't think they } >have much right to complain that Linux doesn't scale up to that level - } >it's not designed to. } > } >That being said, large clusters are an interesting research area but it is } >_not_ a failing of Linux that it doesn't scale to them. } >- } >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in } >the body of a message to majordomo@vger.kernel.org } >More majordomo info at http://vger.kernel.org/majordomo-info.html } >Please read the FAQ at http://www.tux.org/lkml/ } > } > } } } } - } To unsubscribe from this list: send the line "unsubscribe linux-kernel" in } the body of a message to majordomo@vger.kernel.org } More majordomo info at http://vger.kernel.org/majordomo-info.html } Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 16:30 ` Cort Dougan 2002-06-20 17:15 ` Linus Torvalds 2002-06-20 17:16 ` RW Hawkins @ 2002-06-20 20:40 ` Martin Dalecki 2002-06-20 20:53 ` Linus Torvalds ` (2 more replies) 2002-06-21 5:34 ` Eric W. Biederman 3 siblings, 3 replies; 70+ messages in thread From: Martin Dalecki @ 2002-06-20 20:40 UTC (permalink / raw) To: Cort Dougan Cc: Eric W. Biederman, Linus Torvalds, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List Użytkownik Cort Dougan napisał: > "Beating the SMP horse to death" does make sense for 2 processor SMP > machines. When 64 processor machines become commodity (Linux is a > commodity hardware OS) something will have to be done. When research 64 processor machines will *never* become a commodity becouse: 1. It's not like paralell machines are something entierly new. They are around for an awfoul long time on this planet. (nearly longer then myself) 2. See 1. even dual CPU machines are a rarity even *now*. 3. Nobody needs them for the usual tasks they are a *waste* of resources and economics still applies. 4. SMP doesn't scale behind 4. Point. (64 hardly makes sense...) 5. It will never become a commodity to run highly transactional workloads where integrated bunches of 4 make sense. Neiter will it be common to solve partial differential equations for aeroplane dynamics or to calculate the behaviour of an hydrogen bomb. 6. Even in the aerodynamics department an only 14 CPU machine was very very fast. (NEC SX-3R) 7. Hyper threaded cores make hardly sense behind 2. 8. Amdahls law is math and not a decret from the Central Komitee of the Kommunist Party or George Bush. You can not overrule it. One exception could be dedicated rendering CPUs - which is the direction where graphics cards are apparently heading - but they will hardly ever need a general purpose operating system. But even then - I'm still in the bunch of people who are not interrested in any OpenGL or Direct whatever... The worsest graphics cards those days drive my display screens at the resolutions I wish them too just fine. PS. I'm sick of seeing bunches of PC's which are accidentally in the same room nowadays in the list of the 500 fastest computers on the world. It makes this list useless... If one want's to have a grasp on how the next generation of really fast computers will look alike. Well: they will be based on Johnson-junctions. TRW will build them (same company as Voyager sonde). Look there they don't plan for thousands of CPUs they plan for few CPUs in liquid helium: http://www.trw.com/extlink/1,,,00.html?ExternalTRW=/images/imaps_2000_paper.pdf&DIR=2 ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 20:40 ` Martin Dalecki @ 2002-06-20 20:53 ` Linus Torvalds 2002-06-20 21:27 ` Martin Dalecki 2002-06-20 21:13 ` Timothy D. Witham 2002-06-21 19:53 ` Rob Landley 2 siblings, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2002-06-20 20:53 UTC (permalink / raw) To: Martin Dalecki Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Thu, 20 Jun 2002, Martin Dalecki wrote: > > 2. See 1. even dual CPU machines are a rarity even *now*. With stuff like HT, you may well not be able to _buy_ an intel desktop machine with just "one" CPU. Get with the flow. The old Windows codebase is dead as far as new machines are concerned, which means that there is no reason to hold back any more: all OS's support SMP. > 3. Nobody needs them for the usual tasks they are a *waste* > of resources and economics still applies. That's a load of bull. For usual tasks, two CPU's give clearly better responsiveness than one. If only because one of them may be doing the computation, and the other may be doing GUI. The number of people doing things like mp3 ripping is apparently quite high. And it's definitely CPU-intensive. Now, I suspect that past two CPU's you won't find much added oomph, but the load-balancing of just two is definitely noticeable on a personal scale. I just don't want to use UP machines any more unless they have other things going for them (ie really really small). > 4. SMP doesn't scale behind 4. Point. (64 hardly makes sense...) That's not true either. You can easily make _cheap_ hardware scale to 4, no problem. You may not want a shared bus, but hey, they's a small implementation detail. Most new CPU's have the interconnect hardware on-die (either now or planned). Intel made SMP cheap by putting all the glue logic on-chip and in the standard chipsets. And besides, you don't actually need to _scale_ well, if the actual incremental costs are low. That's the whole point with the P4-HT, of course. Intel claims 5% die area addition for a 30% scaling. They may be full of sh*t, of course, and it may be that the added complexity in the control logic hurts them in other areas (longer pipeline, whatever), but the point is that if it's cheap, the second CPU doesn't have to "scale". Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 20:53 ` Linus Torvalds @ 2002-06-20 21:27 ` Martin Dalecki 2002-06-20 21:37 ` Linus Torvalds 2002-06-21 20:38 ` Rob Landley 0 siblings, 2 replies; 70+ messages in thread From: Martin Dalecki @ 2002-06-20 21:27 UTC (permalink / raw) To: Linus Torvalds Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List Użytkownik Linus Torvalds napisał: > > On Thu, 20 Jun 2002, Martin Dalecki wrote: > >>2. See 1. even dual CPU machines are a rarity even *now*. > > > With stuff like HT, you may well not be able to _buy_ an intel desktop > machine with just "one" CPU. Linus you forget one simple fact - a HT CPU is *not* two CPUs. It is one CPU with a slightly better utilization of the super scalar pipelines. And it's only slightly better. Just another way of increasind the fill reate of the pipelines for some specific tasks. > Get with the flow. The old Windows codebase is dead as far as new machines > are concerned, which means that there is no reason to hold back any more: > all OS's support SMP. > > >>3. Nobody needs them for the usual tasks they are a *waste* >>of resources and economics still applies. > > > That's a load of bull. Did I mention that ARMs are the most sold CPUs out there? > For usual tasks, two CPU's give clearly better responsiveness than one. If > only because one of them may be doing the computation, and the other may > be doing GUI. For the usual task of controlling just the fuel level of the motor or therlike one CPU makes fine. For the other usual tasks - well dissect a PCMCIA WLAN card or some reasonable fast ethernet card or some hard disk. You will find tons of independant CPUs in your system... but they are hardly SMP connected. For the other usual task my single Athlon is just fine. The main argument is yes it makes sense to use additional CPUs for work offload on dedicated tasks but the normal case is not to do it SMP way. > The number of people doing things like mp3 ripping is apparently quite > high. And it's definitely CPU-intensive. > > Now, I suspect that past two CPU's you won't find much added oomph, but Well on intel two CPU give you about 1.5 horse power of a single CPU. On Good SPM systems it's about 1.7. > the load-balancing of just two is definitely noticeable on a personal > scale. I just don't want to use UP machines any more unless they have > other things going for them (ie really really small). > > >>4. SMP doesn't scale behind 4. Point. (64 hardly makes sense...) > > > That's not true either. > > You can easily make _cheap_ hardware scale to 4, no problem. You may not > want a shared bus, but hey, they's a small implementation detail. Most new > CPU's have the interconnect hardware on-die (either now or planned). > > Intel made SMP cheap by putting all the glue logic on-chip and in the > standard chipsets. Not if I look out to buy a real SMP board. They are still very expensive in comparision to normal boards. However indeed they are nowadays affordable. > And besides, you don't actually need to _scale_ well, if the actual > incremental costs are low. That's the whole point with the P4-HT, of > course. Intel claims 5% die area addition for a 30% scaling. They may be The 30% - I never saw it in the intel paper. I remember they talk about 20% + something. And 30% is a *peak* value. The paper in question talks about 12% on average. Awfoul much for 5% die area (2.4 factor win) in esp. if you look at the constant increase of die area of CPUs in comparision to the speed factoring out the scaling of the production process. If once factors out the production process scale modern CPU are wasting transistors like no good in comparision to they older silbings. (Remember 8088 was just about 22t transistors and not 140M!). But it's not much in absolute numbers... > full of sh*t, of course, and it may be that the added complexity in the > control logic hurts them in other areas (longer pipeline, whatever), but > the point is that if it's cheap, the second CPU doesn't have to "scale". The main hurting point is the quadruple of the correctness testing effort. Longer pipelines - I hardly think so. The synchronization infrastructure for out of order execution was already there in the last CPU generation. This is the reaons why it's so cheap in terms of die estate to add it now. BTW. Them pulling this trick shows nicely that we are now at a point where there will be hardly any increase in the deployment of micro scale paralellity in CPU design nowadays... And not just on behalf of the CPU - even more importantly you could read it as public admit to the fact that we are near the end of static optimizations by improvements in compiler technology as well. Oh the compiler people promise miracles constantly since the first days of pipeline of course... In view of this I would love to see how they intend to HT the VLSI design of the Itanic :-). ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 21:27 ` Martin Dalecki @ 2002-06-20 21:37 ` Linus Torvalds 2002-06-20 21:59 ` Martin Dalecki 2002-06-21 20:38 ` Rob Landley 1 sibling, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2002-06-20 21:37 UTC (permalink / raw) To: Martin Dalecki Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Thu, 20 Jun 2002, Martin Dalecki wrote: > > Linus you forget one simple fact - a HT CPU is *not* two CPUs. > It is one CPU with a slightly better utilization of the > super scalar pipelines. Doesn't matter. It's SMP to software, _and_ it is a perfect example of how integration, in the form of almost free transistors, changes the economics. > Just another way of increasind the fill reate of the pipelines > for some specific tasks. Integration is _not_ "just another way". Integration fundamentally changes the whole equation. When you integrate the SMP capabilities on the CPU, suddenly the world changes, because suddenly SMP is cheap and easy to do for motherboard manufacturers that would never have done it before. Suddenly SMP is available at mass-market prices. When you integrate multiple CPU's on one standard die (either HT or real CPU's), the same thing happens. When you start integrating crossbars etc "numa-like" stuff, like Hammer apparently is doing, you get the same old technology, but it _behaves_ differently. You see this outside CPU's too. When people started integrating high-performance 3D onto a single die, the _market_ changed. The way people used it changed. It's largely the same technology that has been around for a long time in visual workstations, but it's DIFFERENT thanks to low prices and easy integration into bog-standard PC's. A 3D tech person might say that the technology is still the same. But a real human will notice that it's radically different. > Did I mention that ARMs are the most sold CPUs out there? Doesn't matter. Did I mention that microbes are the most populous form of living beings? Does that make any difference to us as humans? Should that make us think we want to be microbes? Or should it mean that we're somehow inferior? Obviously not. Did you mention that there are a lot more resistors in computers than CPU's? No. It is irrelevant. It doesn't drive technology in fundamental ways - even though the amount of fundamental technolgy inherent on a modern motherboard in _just_ the passive components like the resistor network is way beyond what people built just a few years ago. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 21:37 ` Linus Torvalds @ 2002-06-20 21:59 ` Martin Dalecki 2002-06-20 22:18 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 70+ messages in thread From: Martin Dalecki @ 2002-06-20 21:59 UTC (permalink / raw) To: Linus Torvalds Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List Użytkownik Linus Torvalds napisał: > > On Thu, 20 Jun 2002, Martin Dalecki wrote: > >>Linus you forget one simple fact - a HT CPU is *not* two CPUs. >>It is one CPU with a slightly better utilization of the >>super scalar pipelines. > > > Doesn't matter. It's SMP to software, _and_ it is a perfect example of how > integration, in the form of almost free transistors, changes the > economics. Well but this simply still doesn't make SMP magically scale better. HT gives you about 12% increase in throughput on average. This will hardly increase your MP3 ripping expierence :-). > Integration is _not_ "just another way". > > Integration fundamentally changes the whole equation. > > When you integrate the SMP capabilities on the CPU, suddenly the world > changes, because suddenly SMP is cheap and easy to do for motherboard > manufacturers that would never have done it before. Suddenly SMP is > available at mass-market prices. And suddenly the Chip-Set manufacturers start to buy CPU designs like creazy, becouse they can see what will be next... of course. > When you integrate multiple CPU's on one standard die (either HT or real > CPU's), the same thing happens. Again HT is still only one CPU. You are too software centric :-). > When you start integrating crossbars etc "numa-like" stuff, like Hammer > apparently is doing, you get the same old technology, but it _behaves_ > differently. Yes HT gives 12%. naive SMP gives 50% and good SMP (aka corssbar bus) gives 70% for two CPU. All those numbers are well below the level where more then 2-4 makes hardly any sense... Amdahl bites you still if you read it like: 88% waste (well actuall this time not) 50% waste 20% waste on scale. However corssbar switches are indeed allowing for maximally 64 CPUs and more importantly it's the first step since a long time to provide better overall system throughput. However they will still not be near any commodity - too much heat for the foreseeable future. > You see this outside CPU's too. > > When people started integrating high-performance 3D onto a single die, the > _market_ changed. The way people used it changed. It's largely the same > technology that has been around for a long time in visual workstations, > but it's DIFFERENT thanks to low prices and easy integration into > bog-standard PC's. > > A 3D tech person might say that the technology is still the same. > > But a real human will notice that it's radically different. Yes but you can drive the technology only up to the perceptual limits of a human. For example since about 6 years all those advancements in the graphics area are largely uninterresting to me. I don't play computer games. Never - they are too boring. Jet another fan in my computer - no thank's. > Did you mention that there are a lot more resistors in computers than > CPU's? No. It is irrelevant. It doesn't drive technology in fundamental > ways - even though the amount of fundamental technolgy inherent on a > modern motherboard in _just_ the passive components like the resistor > network is way beyond what people built just a few years ago. Well the last real technological jump comparable to the invention of television was actually due to this kind of CPUs which you compare to microbes - mobiles :-). And well I'm awaiting the day where there will be some WinWLAN card as shoddy as those Win modems are... Fortunately they made 802.11b complicated enough :-) But with a corssbar switch in place they could well make up for the latency on the main CPU... oh fear... oh scare... ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 21:59 ` Martin Dalecki @ 2002-06-20 22:18 ` Linus Torvalds 2002-06-20 22:41 ` Martin Dalecki 2002-06-21 7:43 ` Zwane Mwaikambo 2002-06-21 21:02 ` Rob Landley 2 siblings, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2002-06-20 22:18 UTC (permalink / raw) To: Martin Dalecki Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Thu, 20 Jun 2002, Martin Dalecki wrote: > > Yes HT gives 12%. naive SMP gives 50% and good SMP (aka corssbar bus) > gives 70% for two CPU. All those numbers are well below the level > where more then 2-4 makes hardly any sense... You don't _understand_. If it's "free", you take that 70% for the second CPU, and the additional 20% for the next two. Don't bother repeating yourself about Amdahls law. Realize what Moore's law says: things get cheaper over time. A _lot_ cheaper. It's still a fact that people are willing to pay for performance. Even if they strictly don't "need" it (but who are you or I to say who "needs" performance?). At which point it doesn't _matter_ if you only get 70% or 30% or 12% improvement. If it's within "cheap enough", people will buy it. In fact, once it gets "too cheap", people will buy something more expensive just because a cheap PC obviously isn't good enough. That's _reality_. Your "efficiency" arguments have no basis in the real life of economics in a developing market. Only embedded people care about absolute cost and absolute efficiencies ("it's not worth it for us to go for a more powerful CPU, since we don't need it"). The rest of the world takes that 66MHz improvement (in a CPU that does multiple gigahertz) and is happy about it. Or takes the added 12%, and is happy about it. Humans are not rational creatures. We're _rationalizing_ creatures, and we love rationalizing that big machine that just makes us feel better. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 22:18 ` Linus Torvalds @ 2002-06-20 22:41 ` Martin Dalecki 2002-06-21 0:09 ` Allen Campbell 0 siblings, 1 reply; 70+ messages in thread From: Martin Dalecki @ 2002-06-20 22:41 UTC (permalink / raw) To: Linus Torvalds Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List Użytkownik Linus Torvalds napisał: > At which point it doesn't _matter_ if you only get 70% or 30% or 12% > improvement. If it's within "cheap enough", people will buy it. In fact, > once it gets "too cheap", people will buy something more expensive just > because a cheap PC obviously isn't good enough. That's _reality_. > > Your "efficiency" arguments have no basis in the real life of economics in > a developing market. Only embedded people care about absolute cost and > absolute efficiencies ("it's not worth it for us to go for a more powerful > CPU, since we don't need it"). The rest of the world takes that 66MHz > improvement (in a CPU that does multiple gigahertz) and is happy about it. > Or takes the added 12%, and is happy about it. You don't read economic papers. Don't you? Or what is it with this plumbing server/pc market around us? Or increased notebook sales. (Typical marked saturation symptom, like the second car for the familiy :-). I suggest it's precisely the end of the open invention curve out there: 1. Nowadays the CPUs are indeed good enough for most of the common tasks. WindowsXP tries hard to help overcome this :-). But in reality Win2000 is just fine for office work. 2. The technology in question is starting to hit real physical barriers becouse it appears more and more that not everything comming out of the labs can be implemented at reasonable costs. > Humans are not rational creatures. We're _rationalizing_ creatures, and we > love rationalizing that big machine that just makes us feel better. Perhaps it's just still too deep in to my brain that the overwhelimg part of the PC market is still determined by corporate buyers (70%). And they look for efficiency (well within wide boundaries :-). There is for example not much of an uprush from Win4.0 or Win2000 to WindowsXP. Not only due to "political" reasons, but becouse a normal PC from few years ago still does the job for office productivity. Quite away from the days of yearly upgrades all around the office :-)... And finally the whole thing driving the movement behind AS/390 boxen running Linux OS instancies is consolidation and costs too... ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 22:41 ` Martin Dalecki @ 2002-06-21 0:09 ` Allen Campbell 0 siblings, 0 replies; 70+ messages in thread From: Allen Campbell @ 2002-06-21 0:09 UTC (permalink / raw) To: Martin Dalecki; +Cc: linux-kernel > Perhaps it's just still too deep in to my brain that > the overwhelimg part of the PC market is still determined > by corporate buyers (70%). And they look for efficiency (well within > wide boundaries :-). Most of those buyers care about cost efficiency, not design efficiency. If a 4 way Dell can just match a 2 way Sun, and for half the cost, guess who gets the sale. Doesn't matter if it's "naive" SMP or a beautiful cross-bar design, blessed by MIT. Yes, it's ugly. Sure, it would be nice if everyone loved computing so much that they actually cared enough to make the distinction. They don't. Get over it. As long as Linux is true to the market it will thrive. The moment the motivation becomes someone's pedantic notion of "purity", it's gone. I believe Linus understands this, and I'm thankful. I'm guessing that gift of understanding comes from a time when a certain programmer couldn't afford to pay for the elegance that was offered at the time. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 21:59 ` Martin Dalecki 2002-06-20 22:18 ` Linus Torvalds @ 2002-06-21 7:43 ` Zwane Mwaikambo 2002-06-21 21:02 ` Rob Landley 2 siblings, 0 replies; 70+ messages in thread From: Zwane Mwaikambo @ 2002-06-21 7:43 UTC (permalink / raw) To: Martin Dalecki Cc: Linus Torvalds, Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Thu, 20 Jun 2002, Martin Dalecki wrote: > > When you integrate multiple CPU's on one standard die (either HT or real > > CPU's), the same thing happens. > > Again HT is still only one CPU. You are too software centric :-). Can't help it... Remember i386/i387? -- http://function.linuxpower.ca ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 21:59 ` Martin Dalecki 2002-06-20 22:18 ` Linus Torvalds 2002-06-21 7:43 ` Zwane Mwaikambo @ 2002-06-21 21:02 ` Rob Landley 2 siblings, 0 replies; 70+ messages in thread From: Rob Landley @ 2002-06-21 21:02 UTC (permalink / raw) To: Martin Dalecki, Linus Torvalds Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Thursday 20 June 2002 05:59 pm, Martin Dalecki wrote: > Well but this simply still doesn't make SMP magically scale > better. HT gives you about 12% increase in throughput on average. > This will hardly increase your MP3 ripping expierence :-). HT is currently sopping up the idle time on the second and third execution core in the processor, and the fact that the processor before HT only had as many cores as it could at least sometimes use means that these execution cores aren't always idle. That said, there's nothing to stop them from adding a fourth or even fifth execution core to the die and getting a 25% boost, and then a fifth core and getting a little boost form that too. (And when you add the sixth core, teach the processor about the concept of a third thread, at which point you just write in instruction dispatcher feeding an arbitrary number of thread instruction streams into an arbitrary number of execution cores, and then add cores to your heart's content until you start having numa problems in your L1 cache... :) By the way, your mp3 ripping experience is largely about latency, which HT does help. (Realtime is all about getting a tiny amount of work done NOW, rather than a lot of work done after a significant fraction of a second scheduling delay.) As long as ripping and playback don't skip, processes that can be batched aren't really the problem. (Suck this CD dry, crunch it to files in this directory, I'm going to answer email in the meantime.) > > When you integrate multiple CPU's on one standard die (either HT or real > > CPU's), the same thing happens. > > Again HT is still only one CPU. You are too software centric :-). It's a CPU that literally can advance two processes at once. Not "time slice, time slice, time slice" with evil context switches in between trashing your cache, but actual parallel processing. My understanding is that with HT turned on, one of your three execution cores is devoted to each thread, and they get to fight over who gets to use the third each clock cycle. So you get to queue up DMA for that screaming scsi card without waiting for your other system call to exit its critical region. Hence the latency picture is REALLY NICE... > However corssbar switches are indeed allowing for maximally > 64 CPUs and more importantly it's the first step since a long time > to provide better overall system throughput. However they will still > not be near any commodity - too much heat for the foreseeable future. If you can do 8-way SMP/SMT on a chip (does SMT with twice as many execution cores as threads count as "real" SMP to you?), and then you fit that in an 8-way motherboard, boom: you have 64 way. Without really needing crossbar switches if you don't want to go that way... Sooner or later they'll just have an arbitrary execution core scheduler, and they won't have a fixed ratio of threads to cores, you'll just feed the chip what you've got and it'll power down any cores that aren't in use this clock cycle. I can easily see transmeta scaling code morphing up to dozens or even hundreds of execution in that case... That's a few years in the future, though. > > A 3D tech person might say that the technology is still the same. > > > > But a real human will notice that it's radically different. > > Yes but you can drive the technology only up to the perceptual limits > of a human. For example since about 6 years all those advancements > in the graphics area are largely uninterresting to me. I don't > play computer games. Never - they are too boring. Jet another > fan in my computer - no thank's. "It doesn't interst me so it's not interesting" is not a good argument, but the fact that the human visual perception threshold has long been reported to be 80 million triangles per second and we're approaching the ability to do that in real time with commodity off the shelf video cards. (Another two or three generations of moore's law and we WON'T be able to see the difference...) That is a point. > Well the last real technological jump comparable to the invention > of television was actually due to this kind of CPUs which you > compare to microbes - mobiles :-). And well I'm awaiting the > day where there will be some WinWLAN card as shoddy as those Win > modems are... Fortunately they made 802.11b complicated enough :-) > But with a corssbar switch in place they could well make up for > the latency on the main CPU... oh fear... oh scare... The latency in the cat 5 dwarfs any latency you're going to have on the motherboard, and that's something they deal with by just making gigabit and higher synchronous. No reason you can't have a win-ethernet card except that 100baseT is now $4.50 on a card (and a lot less on a chip on the motherboard, and that's just a licensing cost, the IC is pennies), and your "last mile" cable modem or DSL still isn't maxing out the ten magabit ethernet connection you're really hooking up to the internet through... There's no excess cost to squeeze out of here by going to a DSP... Rob ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 21:27 ` Martin Dalecki 2002-06-20 21:37 ` Linus Torvalds @ 2002-06-21 20:38 ` Rob Landley 1 sibling, 0 replies; 70+ messages in thread From: Rob Landley @ 2002-06-21 20:38 UTC (permalink / raw) To: Martin Dalecki, Linus Torvalds Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Thursday 20 June 2002 05:27 pm, Martin Dalecki wrote: > U¿ytkownik Linus Torvalds napisa³: > > On Thu, 20 Jun 2002, Martin Dalecki wrote: > >>2. See 1. even dual CPU machines are a rarity even *now*. > > > > With stuff like HT, you may well not be able to _buy_ an intel desktop > > machine with just "one" CPU. > > Linus you forget one simple fact - a HT CPU is *not* two CPUs. > It is one CPU with a slightly better utilization of the > super scalar pipelines. And it's only slightly better. > Just another way of increasind the fill reate of the pipelines > for some specific tasks. Wrong. RISC let you have two execution cores dispatching instructions in parallel. (Two instructions per clock). AMD expanded this to three execution cores in the Athlon with clever and insanely complex cisc->risc translation and pipeline organizing circuitry. Intel couldn't match that (at first) and went to VLIW, hence itanic. VLIW/EPIC was an attempt to figure out how to keep more execution cores busy without having each one know what the other ones are doing, and searchign for paralellism in a single instruction stream. Unload the parallelism finding work on the compiler, batch the resulting instructions together in groups, and explicitly feed an instruction to each execution core, each clock cycle. If there's nothing for it to do, feed it a NOP. That way you can have three execution cores (getting three instructions per clock), and you can even do four or five or six cores receiving big batches of paralell instructions and executing the whole mess each clock cycle in parallel. Of course the real bottleneck in a processor that's clock multiplied by a factor of 20 relative to the motherboard it sits in is the memory bus speed, and L1 cache size (since it's up to 20x slower when it hits the edge of the cache), and VLIW makes the memory bus MORE of a bottleneck, so resulting preformance sucks tremendously. Oops. Back to the drawing board. (R.I.P. itanium, modulo intel's marketing budget...) Hyper-threading is another way to keep extra execution cores busy: teach the chip about processes and dole the execution cores out to each process depending on how many they can use. (One, two, or three, depending on how parallel the next few instructions in the thread are.) Of course each thread needs its own register profile, but register renaming for speculative execution is way more complicated than that. And you need to teach the MMU how to look at more than one set of page tables at a time, but that's doable too. Putting full-blown SMP on a chip means you're duplicating all sorts of circuitry: your L1 cache, your bus interface logic, etc. SMT is basically SMP on a chip that shares the L1 cache, AND gives you an excuse to EXPAND it (they've got the transistor budge: Xeons hae a megabyte or more of L1 cache, there's just a case of diminishing returns. Now, they get to spend the transistors for a larger cache and actually have it MEAN something.) And yes, you could go beyond three execution cores with SMT. You could go to five or six execution cores, and have three threads of execution if you really wanted to. The design gets a little more complicated, but not really all that much, since the purpose is to SEPARATE what the threads are doing, as opposed to the traditional "is core #2 going to interfere with what core #2 is doing"? You may wind up designing a full blown instruction scheduler, but if that's too complex you could always put it in software and call it code morphing II. :) We've had a variant of multiprocessing on a chip since the original pentium, we just called it pipelining. Saying SMT is not "true SMP" is splitting hairs, and an attempt to win an argument by redefining the words used in the original statement. (I wasn't wrong: that color's not blue!) > > Get with the flow. The old Windows codebase is dead as far as new > > machines are concerned, which means that there is no reason to hold back > > any more: all OS's support SMP. > > > >>3. Nobody needs them for the usual tasks they are a *waste* > >>of resources and economics still applies. > > > > That's a load of bull. > > Did I mention that ARMs are the most sold CPUs out there? So they finally passed the enormous installed base of Z80's in traffic lights, elevators, and microwaves? Bully for them. What USE this information is remains an open question. > For the usual task of controlling just the fuel level of the motor > or therlike one CPU makes fine. For the other usual > tasks - well dissect a PCMCIA WLAN card or some reasonable fast > ethernet card or some hard disk. You will find tons of > independant CPUs in your system... but they are hardly SMP > connected. For the other usual task my single Athlon is > just fine. And the Z80 hooked up to an S100 bus running CP/M shall always rule forever and ever alelujiah amen. Case dismissed. > > The number of people doing things like mp3 ripping is apparently quite > > high. And it's definitely CPU-intensive. > > > > Now, I suspect that past two CPU's you won't find much added oomph, but > > Well on intel two CPU give you about 1.5 horse power of > a single CPU. On Good SPM systems it's about 1.7. Intel's traditional way of doing SMP sucks (the memory bus is STILL the main bottleneck to performance: let's share it!), and most PC OSes have traditionally had mondo lock contention doing even simple things. Okay. So? > > Intel made SMP cheap by putting all the glue logic on-chip and in the > > standard chipsets. > > Not if I look out to buy a real SMP board. Again with the "the PC isn't a real computer" line of argument... > They are still > very expensive in comparision to normal boards. However > indeed they are nowadays affordable. A year and a half ago I worked at the company that prototyped the first dual Athlon board (Boxxtech: tyan owed them a favor). Intel was never interested in bringing out a dual celeron motherboard (the first celerons were so cache-crippled trying to SMP them was just painful). The ONLY wanted to do SMP at the high end, and as processors came down in price they yanked the SMP support circuitry. Add in the fact the Intel SMP bus still sucks tremendously and the dominant OS through windows 98 couldn't even understand two graphics cards (and often got confused by two NETWORK cards) we're not talking a recipe for widespread adoption here... > > And besides, you don't actually need to _scale_ well, if the actual > > incremental costs are low. That's the whole point with the P4-HT, of > > course. Intel claims 5% die area addition for a 30% scaling. They may be > > The 30% - I never saw it in the intel paper. I remember they talk > about 20% + something. And 30% is a *peak* value. Sure. Keeping that third execution core busy 24/7. On the rare instances their pipeline organizer can devote that third execution core to advancing the first process, preventing it from doing so is slowing that first process down by repurposing a resource that would NOT otherwise have been wasted. (Minus 3% performance penalty for extra cache trashing and memory bus contention.) Now add a FOURTH execution core to the chip, bump the L1 cache size a bit, and watch performance go up 25%... I am REALLY waiting for AMD to start doing this. We've been waiting for "smp on a chip" (outside of PPC) for years, without ever explaining what the advantage was of giving each one its own bus interface unit and L1 cache... > The paper in question talks about 12% on average. Awfoul much for > 5% die area (2.4 factor win) in esp. if you look at the constant > increase of die area of CPUs in comparision to the speed factoring out > the scaling of the production process. If once factors out > the production process scale modern CPU are wasting transistors like > no good in comparision to they older silbings. (Remember 8088 was > just about 22t transistors and not 140M!). > But it's not much in absolute numbers... Yeah. It's called "a good idea" instead of brute force throwing transistors at the problem. Even Intel's allowed to have the occasional good idea. (After itanium they're certainly due for one!) > > full of sh*t, of course, and it may be that the added complexity in the > > control logic hurts them in other areas (longer pipeline, whatever), but > > the point is that if it's cheap, the second CPU doesn't have to "scale". > > The main hurting point is the quadruple of the correctness testing > effort. Longer pipelines - I hardly think so. The synchronization > infrastructure for out of order execution was already there in the last CPU > generation. This is the reaons why it's so cheap in terms of die estate to > add it now. In theory they might even be able to get rid of some of it, as long as they can keep all their execution cores busy 99% of the time without it. (Picking three simultaneously runnable instructions from two different threads of execution is a fundamentally easier problem than consistently picking even two instructions from one thread.) And it's a far cry from the itanium's way of handling branch preditiction to keep the cores busy. (Execute BOTH forks and throw the one we don't take away! Yeah, that'll guarantee we waste work so we LOOK busy, but don't actually run noticeably faster! Brilliant! (What, is the goal to make the chip run hot? 95% prediction rate isn't enough for you, and you're STILL going to stall the pipeline when you hit the edge of the L1 cache anyway...)) > BTW. Them pulling this trick shows nicely that we are now at a point > where there will be hardly any increase in the deployment of micro scale > paralellity in CPU design nowadays... Famous last words... > And not just on behalf of > the CPU - even more importantly you could read it as public admit to the > fact that we are near the end of static optimizations by improvements in > compiler technology as well. Oh the compiler people promise miracles > constantly since the first days of pipeline of course... Trust me: GCC 3.x can still be seriously improved upon. > In view of this I would love to see how they intend > to HT the VLSI design of the Itanic :-). Well, the rumors are that Intel is going to bury iTanic in a sea trench and license x86-64. AMD has confirmed that intel licensed the rights to the x86-64 instruction set, and intel's prototype is apparently called yamhill: http://www.matrixlist.com/pipermail/pc_support/2002-May/001416.html Whether or not AMD got a license to the inevitable hyper-threading patents in return, I have no idea. (If AMD would just buy transmeta and be done with it, I'd feel more comfortable predicting them. I have friends who work there, that rumor mill's bandwidth is full of the trouble they're having with absolutely sucky motherboard chipsets and nvidia writing out of spec graphics cards that the chipsets are actually designed to compensate for, and as such wind up screwing up other things by being out of spec. Or something like that, that's the trouble with rumors, details get mangled...) Rob ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 20:40 ` Martin Dalecki 2002-06-20 20:53 ` Linus Torvalds @ 2002-06-20 21:13 ` Timothy D. Witham 2002-06-21 19:53 ` Rob Landley 2 siblings, 0 replies; 70+ messages in thread From: Timothy D. Witham @ 2002-06-20 21:13 UTC (permalink / raw) To: Martin Dalecki Cc: Cort Dougan, Eric W. Biederman, Linus Torvalds, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Thu, 2002-06-20 at 13:40, Martin Dalecki wrote: > Użytkownik Cort Dougan napisał: > > "Beating the SMP horse to death" does make sense for 2 processor SMP > > machines. When 64 processor machines become commodity (Linux is a > > commodity hardware OS) something will have to be done. When research > > > 8. Amdahls law is math and not a decret from the Central Komitee of > the Kommunist Party or George Bush. You can not overrule it. > Boy, I haven't been beat up by Amdahl's law for at least 10 years. :-) A point to mention is that Amdahl's law also applies to scaling on clusters. Same issues as SMP as far as application scalability is concerned. But the point is that there are a whole bunch of applications that can have the serial portion reduce to such a small amount that they can benefit from lots of CPUS. > One exception could be dedicated rendering CPUs - which is the > direction where graphics cards are apparently heading - but they > will hardly ever need a general purpose operating system. But even then - > I'm still in the bunch of people who are not interrested > in any OpenGL or Direct whatever... The worsest graphics cards > those days drive my display screens at the resolutions I wish them too > just fine. > > PS. I'm sick of seeing bunches of PC's which are accidentally in > the same room nowadays in the list of the 500 fastest computers > on the world. It makes this list useless... > > If one want's to have a grasp on how the next generation of > really fast computers will look alike. Well: they will be based > on Johnson-junctions. TRW will build them (same company > as Voyager sonde). Look there they don't plan for thousands of CPUs > they plan for few CPUs in liquid helium: > > http://www.trw.com/extlink/1,,,00.html?ExternalTRW=/images/imaps_2000_paper.pdf&DIR=2 > > You know there used to be a whole bunch of companies doing this sort of work and they all went out of business because people could build a cluster out of off the shelf parts for 1/10 of the cost and get good enough performance. ETA, CDC, the old Cray the list goes on. All gone from the CPU business because good enough cheap enough wins every time. Tim > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Timothy D. Witham - Lab Director - wookie@osdlab.org Open Source Development Lab Inc - A non-profit corporation 15275 SW Koll Parkway - Suite H - Beaverton OR, 97006 (503)-626-2455 x11 (office) (503)-702-2871 (cell) (503)-626-2436 (fax) ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 20:40 ` Martin Dalecki 2002-06-20 20:53 ` Linus Torvalds 2002-06-20 21:13 ` Timothy D. Witham @ 2002-06-21 19:53 ` Rob Landley 2 siblings, 0 replies; 70+ messages in thread From: Rob Landley @ 2002-06-21 19:53 UTC (permalink / raw) To: Martin Dalecki, Cort Dougan Cc: Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List On Thursday 20 June 2002 04:40 pm, Martin Dalecki wrote: > U¿ytkownik Cort Dougan napisa³: > > "Beating the SMP horse to death" does make sense for 2 processor SMP > > machines. When 64 processor machines become commodity (Linux is a > > commodity hardware OS) something will have to be done. When research > > 64 processor machines will *never* become a commodity becouse: > > 1. It's not like paralell machines are something entierly new. They are > around for an awfoul long time on this planet. (nearly longer then myself) > > 2. See 1. even dual CPU machines are a rarity even *now*. DOS was a reverse engineered clone of CP/M with some unix features bolted on in the early 80's. Dos couldn't multitask on a single CPU. Dos couldn't handle more than one video card. DOS could barely keep track of more than one hard drive. Windows 3.1 through Windows 98 (and bill gates' 1/8 scale clone wini-me) were based on DOS, they couldn't take advantage of SMP if their life depended on it. NT through 4.0 had a market share dwarfed by the macintosh. > 3. Nobody needs them for the usual tasks they are a *waste* > of resources and economics still applies. Until moore's law hits atomic resolution, sure. How long that will take is hotly debated... > 4. SMP doesn't scale behind 4. Point. (64 hardly makes sense...) Actually it does, just not with Intel's brain dead memory bus architecture. EV6 goes to 32-way pretty well. The question is, at what point is it cheaper to just go to NUMA or clusters. (And at what point do your trace lengths get long enough that SMP starts acting like NUMA. And at what point do your cluster interconnects get fast enough that something like mosix starts acting like numa?) And the REALLY interesting advance is SMT (hyper-threading), rather than SMP. How do you go beyond the athlon's three execution cores without running out of parallel instructions to feed them? Simple, teach the chip about processes, so it can advance multiple points of execution to keep the cores fed. This lets you throw a higher transistor budget at the L1 and L2 caches without encountering diminishing returns as well. It's pretty straightforward, and at the very least allows dispatching interrupts in parallel and lets your GUI overlap drawing on the screen with the processing to figure out what goes on the screen. Between the two of them, even X11 might finally give me smooth mouse scrolling, one of these days... :) SMP on a chip really is overkill. Why give the multiple processors their own cache and memory bus interface? Waste of transistors, power, heat, etc... SMT is minimalist SMP on a chip... > 5. It will never become a commodity to run highly transactional > workloads where integrated bunches of 4 make sense. Neiter will > it be common to solve partial differential equations for aeroplane > dynamics or to calculate the behaviour of an hydrogen bomb. No, but it will be common to display bidirectional MP4 compressed video through an encrypted link, with sound, quite possibly in a window while you do other stuff with the machine. And some day voice recognition may actually replace "the clapper" to turn your light off when you get into bed at night... > One exception could be dedicated rendering CPUs - which is the > direction where graphics cards are apparently heading - but they "heading"? Headed. (What did you think your 3D accelerator card was?) > PS. I'm sick of seeing bunches of PC's which are accidentally in > the same room nowadays in the list of the 500 fastest computers > on the world. It makes this list useless... It shows who has money to throw at the problem, and approximately how much, which is all it ever really showed... > If one want's to have a grasp on how the next generation of > really fast computers will look alike. Well: they will be based > on Johnson-junctions. TRW will build them (same company > as Voyager sonde). Look there they don't plan for thousands of CPUs > they plan for few CPUs in liquid helium: > > http://www.trw.com/extlink/1,,,00.html?ExternalTRW=/images/imaps_2000_paper >.pdf&DIR=2 And cray bathed their circuitry in flourinert decades ago. Liquid Helium ain't winding up on my desktop any time soon, and my laptop outperforms a cray-1, and I use it for a dozen variations of text editing (coding, email...) Not interesting. Rob ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-20 16:30 ` Cort Dougan ` (2 preceding siblings ...) 2002-06-20 20:40 ` Martin Dalecki @ 2002-06-21 5:34 ` Eric W. Biederman 3 siblings, 0 replies; 70+ messages in thread From: Eric W. Biederman @ 2002-06-21 5:34 UTC (permalink / raw) To: Cort Dougan Cc: Linus Torvalds, Benjamin LaHaise, Rusty Russell, Robert Love, Linux Kernel Mailing List Cort Dougan <cort@fsmlabs.com> writes: > "Beating the SMP horse to death" does make sense for 2 processor SMP > machines. When 64 processor machines become commodity (Linux is a > commodity hardware OS) something will have to be done. When research > groups put Linux on 1k processors - it's an experiment. I don't think they > have much right to complain that Linux doesn't scale up to that level - > it's not designed to. > > That being said, large clusters are an interesting research area but it is > _not_ a failing of Linux that it doesn't scale to them. Linux in a classic beowulf configuration scales just fine. To be clear I am talking a batch scheduling system, where the jobs which run for hours at a time and on many nodes, possibly the entire cluster at a time. Are scheduled on some number of commodity systems, with a good network interconnect. The concern now is not does it work, or does it work well. But can it be made more convenient to use. Eric ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 21:08 ` Cort Dougan 2002-06-18 21:47 ` Linus Torvalds @ 2002-06-19 10:21 ` Padraig Brady 1 sibling, 0 replies; 70+ messages in thread From: Padraig Brady @ 2002-06-19 10:21 UTC (permalink / raw) To: Cort Dougan; +Cc: Benjamin LaHaise, Linux Kernel Mailing List Cort Dougan wrote: > I agree with you there. It's not easy, and I'd claim it's not possible > given that no-one has done it yet, to have a select() call that is speedy > for both 0-10 and 1k file descriptors. Have you noticed yesterdays + todays fixup patch from Andi Kleen: http://marc.theaimsgroup.com/?l=linux-kernel&m=102446644619648&w=2 Padraig. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 21:12 ` Benjamin LaHaise 2002-06-18 21:08 ` Cort Dougan @ 2002-06-18 21:45 ` Bill Huey 1 sibling, 0 replies; 70+ messages in thread From: Bill Huey @ 2002-06-18 21:45 UTC (permalink / raw) To: Benjamin LaHaise Cc: Linus Torvalds, Rusty Russell, Robert Love, Linux Kernel Mailing List, Bill Huey On Tue, Jun 18, 2002 at 05:12:00PM -0400, Benjamin LaHaise wrote: > connections or interactive (like in the real world). I've benchmarked > it -- we should really include something like /dev/epoll in the kernel > to improve this case. Heh, try kqueue(). ;) It's a pretty workable API and there seems to be a lot of momentum in the BSDs (Darwin, FreeBSD) for it. bill ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 20:31 ` Rusty Russell 2002-06-18 20:41 ` Linus Torvalds @ 2002-06-18 20:55 ` Robert Love 2002-06-19 13:31 ` Rusty Russell 1 sibling, 1 reply; 70+ messages in thread From: Robert Love @ 2002-06-18 20:55 UTC (permalink / raw) To: Rusty Russell; +Cc: Linus Torvalds, Linux Kernel Mailing List On Tue, 2002-06-18 at 13:31, Rusty Russell wrote: > No, you have accepted a non-portable userspace interface and put it in > generic code. THAT is idiotic. > > So any program that doesn't use the following is broken: On top of what Linus replied, there is the issue that if your task does not know how many CPUs can be in the system then setting its affinity is worthless in 90% of the cases. I.e., everyone today can write code like sched_setaffinity(0, sizeof(unsigned long), &mask) but let's say this code is executed on a system with a different number of bits in the CPU mask. What do you do with the new/old bits? Ignore them? Set new ones to zero? To 1? Summarily, setting CPU affinity is something that is naturally low-level enough it only makes sense when you know what you are setting and not setting. While a mask of -1 may always make sense, random bitmaps (think RT stuff here) are explicit for the number of CPUs given. The interface is designed to make this easy clean as possible - i.e., the size check, etc. Robert Love ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 20:55 ` Robert Love @ 2002-06-19 13:31 ` Rusty Russell 0 siblings, 0 replies; 70+ messages in thread From: Rusty Russell @ 2002-06-19 13:31 UTC (permalink / raw) To: Robert Love; +Cc: Linus Torvalds, Linux Kernel Mailing List In message <1024433739.922.236.camel@sinai> you write: > On Tue, 2002-06-18 at 13:31, Rusty Russell wrote: > > > No, you have accepted a non-portable userspace interface and put it in > > generic code. THAT is idiotic. > > > > So any program that doesn't use the following is broken: > > On top of what Linus replied, there is the issue that if your task does > not know how many CPUs can be in the system then setting its affinity is > worthless in 90% of the cases. No. You can read the cpus out of /proc/cpuinfo, and say "I want to be on <some cpu I found>" or "I want one copy for each processor", or even "I want every processor but the one the other task just bound to". This is 99% of actual usage. But I can see the man page now: The third arg to set/getaffinity is the size of a kernel data structure. There is no way to know this size: it is dependent on architecture and kernel configuration. You can pass a larger datastructure and the higher bits are ignored: try 1024? > I.e., everyone today can write code like > > sched_setaffinity(0, sizeof(unsigned long), &mask) NO THEY CAN'T. How will ia64 deal with this in ia32 binaries? How will Sparc64 deal with this in 32-bit binaries? How will PPC64 deal with this in PPC32 binaries? How will x86_64 deal with this in x86 binaries? They'll have to either break compatibility, or guess and fill accordingly. And when new CPUS come online? At the moment you effectively zero-fill, because you can't tell what you're supposed to do here. So you can never truly reset your affinity once it's set. > Summarily, setting CPU affinity is something that is naturally low-level > enough it only makes sense when you know what you are setting and not > setting. While a mask of -1 may always make sense, random bitmaps > (think RT stuff here) are explicit for the number of CPUs given. You've designed an interface where the easiest thing to do is the wrong thing (as per your example). This is the hallmark of bad design. *If* there had been a way to tell the bitmask size which was introduced at the same time, it might have been acceptable. But there isn't at the moment, so people are writing bugs right now. Untested patch below, seems to compile (hard to tell since PPC is v. broken right now) Summary: 1) Easy to write portable "set this cpu" code. 2) Both system calls now handle NR_CPUS > sizeof(long)*8. 3) Things which have set affinity once can now get back on new cpus as they come up. 4) Trivial to extent for hyperthreading on a per-arch basis. Linus, think and apply, Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. --- linux-2.5.22/include/linux/affinity.h Thu Jan 1 10:00:00 1970 +++ working-2.5.22-linus/include/linux/affinity.h Wed Jun 19 22:09:47 2002 @@ -0,0 +1,9 @@ +#ifndef _LINUX_AFFINITY_H +#define _LINUX_AFFINITY_H +enum { + /* Set affinity to these processors */ + LINUX_AFFINITY_INCLUDE, + /* Set affinity to all *but* these processors */ + LINUX_AFFINITY_EXCLUDE, +}; +#endif --- working-2.5.22-linus/kernel/sched.c.~1~ Tue Jun 18 23:48:03 2002 +++ working-2.5.22-linus/kernel/sched.c Wed Jun 19 23:28:32 2002 @@ -26,6 +26,7 @@ #include <linux/interrupt.h> #include <linux/completion.h> #include <linux/kernel_stat.h> +#include <linux/affinity.h> /* * Convert user-nice values [ -20 ... 0 ... 19 ] @@ -1309,25 +1310,57 @@ /** * sys_sched_setaffinity - set the cpu affinity of a process * @pid: pid of the process + * @include: is this include or exclude? * @len: length in bytes of the bitmask pointed to by user_mask_ptr - * @user_mask_ptr: user-space pointer to the new cpu mask + * @user_mask_ptr: user-space pointer to bitmask of cpus to include/exclude */ -asmlinkage int sys_sched_setaffinity(pid_t pid, unsigned int len, - unsigned long *user_mask_ptr) +asmlinkage int sys_sched_setaffinity(pid_t pid, + int include, + unsigned int len, + unsigned char *user_mask_ptr) { - unsigned long new_mask; + bitmap_member(new_mask, NR_CPUS); task_t *p; int retval; + unsigned int i; - if (len < sizeof(new_mask)) - return -EINVAL; - - if (copy_from_user(&new_mask, user_mask_ptr, sizeof(new_mask))) + memset(new_mask, 0x00, sizeof(new_mask)); + if (copy_from_user(new_mask, user_mask_ptr, + min((size_t)len, sizeof(new_mask)))) return -EFAULT; - new_mask &= cpu_online_map; - if (!new_mask) + /* longer is OK, as long as they don't actually set any of the bits. */ + if (len > sizeof(new_mask)) { + unsigned char c; + for (i = sizeof(new_mask); i < len; i++) { + if (get_user(c, user_mask_ptr+i)) + return -EFAULT; + if (c != 0) + return -ENOENT; + } + } + + /* Check for cpus that aren't online/don't exist */ + for (i = 0; i < ARRAY_SIZE(new_mask) * i; i++) { + if (i >= NR_CPUS || !cpu_online(i)) { + if (test_bit(i, new_mask)) + return -ENOENT; + } + } + + /* Invert the mask in the exclude case. */ + if (include == LINUX_AFFINITY_EXCLUDE) { + for (i = 0; i < ARRAY_SIZE(new_mask); i++) + new_mask[i] = ~new_mask[i]; + } else if (include != LINUX_AFFINITY_INCLUDE) { return -EINVAL; + } + + /* The new mask must mention some online cpus */ + for (i = 0; !cpu_online(i) || !test_bit(i, new_mask); i++) + if (i == NR_CPUS-1) + /* This is kinda true... */ + return -EWOULDBLOCK; read_lock(&tasklist_lock); @@ -1351,7 +1384,8 @@ goto out_unlock; retval = 0; - set_cpus_allowed(p, new_mask); + /* FIXME: set_cpus_allowed should take an array... */ + set_cpus_allowed(p, new_mask[0]); out_unlock: put_task_struct(p); @@ -1363,37 +1397,27 @@ * @pid: pid of the process * @len: length in bytes of the bitmask pointed to by user_mask_ptr * @user_mask_ptr: user-space pointer to hold the current cpu mask + * Returns the size that required to hold the complete cpu mask. */ asmlinkage int sys_sched_getaffinity(pid_t pid, unsigned int len, - unsigned long *user_mask_ptr) + void *user_mask_ptr) { - unsigned long mask; - unsigned int real_len; + bitmap_member(mask, NR_CPUS) = { 0 }; task_t *p; - int retval; - - real_len = sizeof(mask); - - if (len < real_len) - return -EINVAL; read_lock(&tasklist_lock); - - retval = -ESRCH; p = find_process_by_pid(pid); - if (!p) - goto out_unlock; - - retval = 0; - mask = p->cpus_allowed & cpu_online_map; - -out_unlock: + if (!p) { + read_unlock(&tasklist_lock); + return -ESRCH; + } + memcpy(mask, &p->cpus_allowed, sizeof(p->cpus_allowed)); read_unlock(&tasklist_lock); - if (retval) - return retval; - if (copy_to_user(user_mask_ptr, &mask, real_len)) + + if (copy_to_user(user_mask_ptr, &mask, + min((unsigned)sizeof(p->cpus_allowed), len))) return -EFAULT; - return real_len; + return sizeof(p->cpus_allowed); } asmlinkage long sys_sched_yield(void) @@ -1727,9 +1751,11 @@ migration_req_t req; runqueue_t *rq; +#if 0 /* This is checked for userspace, and kernel shouldn't do this */ new_mask &= cpu_online_map; if (!new_mask) BUG(); +#endif preempt_disable(); rq = task_rq_lock(p, &flags); ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 18:51 ` Rusty Russell 2002-06-18 18:43 ` Zwane Mwaikambo 2002-06-18 18:56 ` Linus Torvalds @ 2002-06-18 19:29 ` Benjamin LaHaise 2002-06-18 19:19 ` Zwane Mwaikambo 2002-06-18 20:13 ` Rusty Russell 2 siblings, 2 replies; 70+ messages in thread From: Benjamin LaHaise @ 2002-06-18 19:29 UTC (permalink / raw) To: Rusty Russell; +Cc: Robert Love, Linux Kernel Mailing List On Wed, Jun 19, 2002 at 04:51:31AM +1000, Rusty Russell wrote: > You could do a loop here, but the real problem is the broken userspace > interface. Can you fix this so it takes a single CPU number please? > > ie. > /* -1 = remove affinity */ > sys_sched_setaffinity(pid_t pid, int cpu); > > This will work everywhere, and doesn't require userspace to know the > size of the cpu bitmask etc. That doesn't work. Think of SMT CPU pairs (aka HyperThreading) or quads that share caches. -ben -- "You will be reincarnated as a toad; and you will be much happier." ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 19:29 ` Benjamin LaHaise @ 2002-06-18 19:19 ` Zwane Mwaikambo 2002-06-18 19:49 ` Benjamin LaHaise 2002-06-18 20:13 ` Rusty Russell 1 sibling, 1 reply; 70+ messages in thread From: Zwane Mwaikambo @ 2002-06-18 19:19 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Rusty Russell, Robert Love, Linux Kernel Mailing List On Tue, 18 Jun 2002, Benjamin LaHaise wrote: > > /* -1 = remove affinity */ > > sys_sched_setaffinity(pid_t pid, int cpu); > > > > This will work everywhere, and doesn't require userspace to know the > > size of the cpu bitmask etc. > > That doesn't work. Think of SMT CPU pairs (aka HyperThreading) or > quads that share caches. Hmm i don't understand, mind explaining why it wouldn't work on HT? Cheers, Zwane Mwaikambo -- http://function.linuxpower.ca ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 19:19 ` Zwane Mwaikambo @ 2002-06-18 19:49 ` Benjamin LaHaise 2002-06-18 19:27 ` Zwane Mwaikambo 0 siblings, 1 reply; 70+ messages in thread From: Benjamin LaHaise @ 2002-06-18 19:49 UTC (permalink / raw) To: Zwane Mwaikambo; +Cc: Rusty Russell, Robert Love, Linux Kernel Mailing List On Tue, Jun 18, 2002 at 09:19:40PM +0200, Zwane Mwaikambo wrote: > Hmm i don't understand, mind explaining why it wouldn't work on HT? On HyperThreading, you want to specify that either cpu in a pair is okay. In larger SMP machines that share a cache between 4 CPUs, the mask is likely to contain all 4 CPUs in each quad. -ben -- "You will be reincarnated as a toad; and you will be much happier." ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 19:49 ` Benjamin LaHaise @ 2002-06-18 19:27 ` Zwane Mwaikambo 0 siblings, 0 replies; 70+ messages in thread From: Zwane Mwaikambo @ 2002-06-18 19:27 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Rusty Russell, Robert Love, Linux Kernel Mailing List On Tue, 18 Jun 2002, Benjamin LaHaise wrote: > On HyperThreading, you want to specify that either cpu in a pair is > okay. In larger SMP machines that share a cache between 4 CPUs, the > mask is likely to contain all 4 CPUs in each quad. Hmm so you want to apply the same 'node' principal to HT? The way HT works i can see why that would be a good idea. Node affinity on the quads makes sense and distinguishing which cpus belong to which quads would also help for irq affinity. Thanks, Zwane Mwaikambo -- http://function.linuxpower.ca ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 19:29 ` Benjamin LaHaise 2002-06-18 19:19 ` Zwane Mwaikambo @ 2002-06-18 20:13 ` Rusty Russell 2002-06-18 20:21 ` Linus Torvalds 2002-06-18 22:03 ` Ingo Molnar 1 sibling, 2 replies; 70+ messages in thread From: Rusty Russell @ 2002-06-18 20:13 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Robert Love, torvalds, Linux Kernel Mailing List In message <20020618152949.B16091@redhat.com> you write: > On Wed, Jun 19, 2002 at 04:51:31AM +1000, Rusty Russell wrote: > > You could do a loop here, but the real problem is the broken userspace > > interface. Can you fix this so it takes a single CPU number please? > > > > ie. > > /* -1 = remove affinity */ > > sys_sched_setaffinity(pid_t pid, int cpu); > > > > This will work everywhere, and doesn't require userspace to know the > > size of the cpu bitmask etc. > > That doesn't work. Think of SMT CPU pairs (aka HyperThreading) or > quads that share caches. This is the NUMA "I want to be in this group" problem. If you're serious about this, you'll go for a sys_sched_groupaffinity call, or add an extra arg to sys_sched_setaffinity, or simply use the top 16 bits of the cpu arg. You will also add /proc/cpugroups or something to export this information to users so there's a point. Sorry, the current interface is insufficient for NUMA *and* is impossible[1] for the user to use correctly. Rusty. [1] Defined as "too hard for them to ever do it properly" -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 20:13 ` Rusty Russell @ 2002-06-18 20:21 ` Linus Torvalds 2002-06-18 22:03 ` Ingo Molnar 1 sibling, 0 replies; 70+ messages in thread From: Linus Torvalds @ 2002-06-18 20:21 UTC (permalink / raw) To: Rusty Russell; +Cc: Benjamin LaHaise, Robert Love, Linux Kernel Mailing List On Wed, 19 Jun 2002, Rusty Russell wrote: > > > That doesn't work. Think of SMT CPU pairs (aka HyperThreading) or > > quads that share caches. > > This is the NUMA "I want to be in this group" problem. If you're > serious about this, you'll go for a sys_sched_groupaffinity call, or > add an extra arg to sys_sched_setaffinity, or simply use the top 16 > bits of the cpu arg. Oh, yes. That makes sense. NOT. > Sorry, the current interface is insufficient for NUMA *and* is > impossible[1] for the user to use correctly. Don't be silly. Give _one_ good reason why the affinity system call cannot take a simple bitmask? It's trivial to use, your arguments do not make any sense. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: latest linus-2.5 BK broken 2002-06-18 20:13 ` Rusty Russell 2002-06-18 20:21 ` Linus Torvalds @ 2002-06-18 22:03 ` Ingo Molnar 1 sibling, 0 replies; 70+ messages in thread From: Ingo Molnar @ 2002-06-18 22:03 UTC (permalink / raw) To: Rusty Russell Cc: Benjamin LaHaise, Robert Love, torvalds, Linux Kernel Mailing List On Wed, 19 Jun 2002, Rusty Russell wrote: > This is the NUMA "I want to be in this group" problem. If you're > serious about this, you'll go for a sys_sched_groupaffinity call, or add > an extra arg to sys_sched_setaffinity, or simply use the top 16 bits of > the cpu arg. the reason why i picked a linear cpu bitmask for the first patches to do affinity syscalls (which ultimately found their way into 2.5) was very simple: we do *NOT* want to deal with cache hierarchies in the kernel, at this point. enumerating CPUs and giving processes the ability to bind themselves to an arbitrary set of CPUs is enough. *IF* user-space wants to do more then they can get and use whatever NUMA information they want. There could even be separate sets of syscalls perhaps to get the exact CPU cache hierarchy of the system, although that would have to be done really well to be truly generic and long-living. so in this case the simplest approach that scales well to a reasonable number of CPUs (thousands, at least) won. > You will also add /proc/cpugroups or something to export this > information to users so there's a point. and this might not even be enough. Cache hierarchies can be pretty non-trivial, and it's not necesserily a distinct group of CPUs, it could be a hierarchy of multiple levels, or it could even be an assymetric distribution of caches. In fact it might not be even expressable in 'group' categories - caches could be interconnected in a 2D or even 3D topology. Or multiprocessing CPUs could have dynamic caches in the future - 'cache on demand' allocated to a cache-happy CPU, while another CPU with a smaller working set will use less cache space. [obviously the technology is not available today.] one thing i was *very* sure about, we frankly dont have the slightest clue about how the really big systems will look like in 10 or 20 years. So hardcoding anything like 'group affinity' or some of today's NUMA hierarchies would be pretty shortsighted. I'm convinced that the 'opaque' solution, the simple but generic setaffinity system call is the right choice. Ingo ^ permalink raw reply [flat|nested] 70+ messages in thread
end of thread, other threads:[~2002-06-24 21:28 UTC | newest]
Thread overview: 70+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-18 23:38 latest linus-2.5 BK broken Michael Hohnbaum
2002-06-18 23:57 ` Ingo Molnar
2002-06-19 0:08 ` Ingo Molnar
2002-06-19 1:00 ` Matthew Dobson
2002-06-19 23:48 ` Michael Hohnbaum
-- strict thread matches above, loose matches on Subject: below --
2002-06-24 21:28 Paul McKenney
2002-06-21 12:59 Jesse Pollard
2002-06-21 7:31 Martin Knoblauch
2002-06-20 23:48 Miles Lane
[not found] <E17KSLb-0007Dj-00@wagner.rustcorp.com.au>
2002-06-19 0:12 ` Linus Torvalds
2002-06-19 15:23 ` Rusty Russell
2002-06-19 16:28 ` Linus Torvalds
2002-06-19 20:57 ` Rusty Russell
2002-06-18 17:18 James Simmons
2002-06-18 17:46 ` Robert Love
2002-06-18 18:51 ` Rusty Russell
2002-06-18 18:43 ` Zwane Mwaikambo
2002-06-18 18:56 ` Linus Torvalds
2002-06-18 18:59 ` Robert Love
2002-06-18 20:05 ` Rusty Russell
2002-06-18 20:05 ` Linus Torvalds
2002-06-18 20:31 ` Rusty Russell
2002-06-18 20:41 ` Linus Torvalds
2002-06-18 21:12 ` Benjamin LaHaise
2002-06-18 21:08 ` Cort Dougan
2002-06-18 21:47 ` Linus Torvalds
2002-06-19 12:29 ` Eric W. Biederman
2002-06-19 17:27 ` Linus Torvalds
2002-06-20 3:57 ` Eric W. Biederman
2002-06-20 5:24 ` Larry McVoy
2002-06-20 7:26 ` Andreas Dilger
2002-06-20 14:54 ` Eric W. Biederman
2002-06-20 16:30 ` Cort Dougan
2002-06-20 17:15 ` Linus Torvalds
2002-06-21 6:15 ` Eric W. Biederman
2002-06-21 17:50 ` Larry McVoy
2002-06-21 17:55 ` Robert Love
2002-06-22 18:25 ` Eric W. Biederman
2002-06-22 19:26 ` Larry McVoy
2002-06-22 22:25 ` Eric W. Biederman
2002-06-22 23:10 ` Larry McVoy
2002-06-23 6:34 ` William Lee Irwin III
2002-06-23 22:56 ` Kai Henningsen
2002-06-20 17:16 ` RW Hawkins
2002-06-20 17:23 ` Cort Dougan
2002-06-20 20:40 ` Martin Dalecki
2002-06-20 20:53 ` Linus Torvalds
2002-06-20 21:27 ` Martin Dalecki
2002-06-20 21:37 ` Linus Torvalds
2002-06-20 21:59 ` Martin Dalecki
2002-06-20 22:18 ` Linus Torvalds
2002-06-20 22:41 ` Martin Dalecki
2002-06-21 0:09 ` Allen Campbell
2002-06-21 7:43 ` Zwane Mwaikambo
2002-06-21 21:02 ` Rob Landley
2002-06-21 20:38 ` Rob Landley
2002-06-20 21:13 ` Timothy D. Witham
2002-06-21 19:53 ` Rob Landley
2002-06-21 5:34 ` Eric W. Biederman
2002-06-19 10:21 ` Padraig Brady
2002-06-18 21:45 ` Bill Huey
2002-06-18 20:55 ` Robert Love
2002-06-19 13:31 ` Rusty Russell
2002-06-18 19:29 ` Benjamin LaHaise
2002-06-18 19:19 ` Zwane Mwaikambo
2002-06-18 19:49 ` Benjamin LaHaise
2002-06-18 19:27 ` Zwane Mwaikambo
2002-06-18 20:13 ` Rusty Russell
2002-06-18 20:21 ` Linus Torvalds
2002-06-18 22:03 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox