* latest linus-2.5 BK broken
@ 2002-06-18 17:18 James Simmons
2002-06-18 17:46 ` Robert Love
0 siblings, 1 reply; 97+ messages in thread
From: James Simmons @ 2002-06-18 17:18 UTC (permalink / raw)
To: Linux Kernel Mailing List
gcc -Wp,-MD,./.sched.o.d -D__KERNEL__ -I/tmp/fbdev-2.5/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i686 -malign-functions=4 -nostdinc -iwithprefix include -fno-omit-frame-pointer -DKBUILD_BASENAME=sched -c -o sched.o sched.c
sched.c: In function `sys_sched_setaffinity':
sched.c:1329: `cpu_online_map' undeclared (first use in this function)
sched.c:1329: (Each undeclared identifier is reported only once
sched.c:1329: for each function it appears in.)
sched.c: In function `sys_sched_getaffinity':
sched.c:1389: `cpu_online_map' undeclared (first use in this function)
make[1]: *** [sched.o] Error 1
. ---
|o_o |
|:_/ | Give Micro$oft the Bird!!!!
// \ \ Use Linux!!!!
(| | )
/'\_ _/`\
\___)=(___/
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 17:18 latest linus-2.5 BK broken James Simmons
@ 2002-06-18 17:46 ` Robert Love
2002-06-18 18:51 ` Rusty Russell
0 siblings, 1 reply; 97+ messages in thread
From: Robert Love @ 2002-06-18 17:46 UTC (permalink / raw)
To: James Simmons; +Cc: Linux Kernel Mailing List, rusty
On Tue, 2002-06-18 at 10:18, James Simmons wrote:
> gcc -Wp,-MD,./.sched.o.d -D__KERNEL__ -I/tmp/fbdev-2.5/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i686 -malign-functions=4 -nostdinc -iwithprefix include -fno-omit-frame-pointer -DKBUILD_BASENAME=sched -c -o sched.o sched.c
> sched.c: In function `sys_sched_setaffinity':
> sched.c:1329: `cpu_online_map' undeclared (first use in this function)
> sched.c:1329: (Each undeclared identifier is reported only once
> sched.c:1329: for each function it appears in.)
> sched.c: In function `sys_sched_getaffinity':
> sched.c:1389: `cpu_online_map' undeclared (first use in this function)
> make[1]: *** [sched.o] Error 1
Rusty, I assume this is a side-effect of the hotplug merge?
Can you fix this or tell me what is the new equivalent of
cpu_online_map?
Robert Love
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 18:51 ` Rusty Russell
@ 2002-06-18 18:43 ` Zwane Mwaikambo
2002-06-18 18:56 ` Linus Torvalds
2002-06-18 19:29 ` Benjamin LaHaise
2 siblings, 0 replies; 97+ messages in thread
From: Zwane Mwaikambo @ 2002-06-18 18:43 UTC (permalink / raw)
To: Rusty Russell; +Cc: Robert Love, Linux Kernel Mailing List, torvalds
Hi Rusty,
On Wed, 19 Jun 2002, Rusty Russell wrote:
> > Can you fix this or tell me what is the new equivalent of
> > cpu_online_map?
>
> Well, I'm heading away from assumptions on the arch representations of
> online CPUs (which the NUMA guys need anyway).
Will there also be some sort of facility to determine which node a cpu is
from, this would be quite handy in other areas.
Cheers,
Zwane Mwaikambo
--
http://function.linuxpower.ca
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 17:46 ` Robert Love
@ 2002-06-18 18:51 ` Rusty Russell
2002-06-18 18:43 ` Zwane Mwaikambo
` (2 more replies)
0 siblings, 3 replies; 97+ messages in thread
From: Rusty Russell @ 2002-06-18 18:51 UTC (permalink / raw)
To: Robert Love; +Cc: Linux Kernel Mailing List, torvalds
In message <1024422409.1476.208.camel@sinai> you write:
> On Tue, 2002-06-18 at 10:18, James Simmons wrote:
>
> > gcc -Wp,-MD,./.sched.o.d -D__KERNEL__ -I/tmp/fbdev-2.5/include -Wall -Wst
rict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -f
no-common -pipe -mpreferred-stack-boundary=2 -march=i686 -malign-functions=4 -
nostdinc -iwithprefix include -fno-omit-frame-pointer -DKBUILD_BASENAME=sche
d -c -o sched.o sched.c
> > sched.c: In function `sys_sched_setaffinity':
> > sched.c:1329: `cpu_online_map' undeclared (first use in this function)
> > sched.c:1329: (Each undeclared identifier is reported only once
> > sched.c:1329: for each function it appears in.)
> > sched.c: In function `sys_sched_getaffinity':
> > sched.c:1389: `cpu_online_map' undeclared (first use in this function)
> > make[1]: *** [sched.o] Error 1
>
> Rusty, I assume this is a side-effect of the hotplug merge?
Yes, sorry.
> Can you fix this or tell me what is the new equivalent of
> cpu_online_map?
Well, I'm heading away from assumptions on the arch representations of
online CPUs (which the NUMA guys need anyway).
You could do a loop here, but the real problem is the broken userspace
interface. Can you fix this so it takes a single CPU number please?
ie.
/* -1 = remove affinity */
sys_sched_setaffinity(pid_t pid, int cpu);
This will work everywhere, and doesn't require userspace to know the
size of the cpu bitmask etc.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 18:51 ` Rusty Russell
2002-06-18 18:43 ` Zwane Mwaikambo
@ 2002-06-18 18:56 ` Linus Torvalds
2002-06-18 18:59 ` Robert Love
2002-06-18 20:05 ` Rusty Russell
2002-06-18 19:29 ` Benjamin LaHaise
2 siblings, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2002-06-18 18:56 UTC (permalink / raw)
To: Rusty Russell; +Cc: Robert Love, Linux Kernel Mailing List
On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> You could do a loop here, but the real problem is the broken userspace
> interface. Can you fix this so it takes a single CPU number please?
NO.
Rusty, people want to do "node-affine" stuff, which absolutely requires
you to be able to give CPU "collections". Single CPU's need not apply.
Linus
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 18:56 ` Linus Torvalds
@ 2002-06-18 18:59 ` Robert Love
2002-06-18 20:05 ` Rusty Russell
1 sibling, 0 replies; 97+ messages in thread
From: Robert Love @ 2002-06-18 18:59 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Rusty Russell, Linux Kernel Mailing List
On Tue, 2002-06-18 at 11:56, Linus Torvalds wrote:
> NO.
>
> Rusty, people want to do "node-affine" stuff, which absolutely requires
> you to be able to give CPU "collections". Single CPU's need not apply.
I would also hate to have to make 32 system calls to get the affinity
mask I want.
If anything, I think the interface is not collective _enough_ - further
abstractions like psets seem to be in favor, not dropping down to a
one-CPU-and-task per-call thing. Not that I am complaining, I am happy
with the interface...
Robert Love
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 19:29 ` Benjamin LaHaise
@ 2002-06-18 19:19 ` Zwane Mwaikambo
2002-06-18 19:49 ` Benjamin LaHaise
2002-06-18 20:13 ` Rusty Russell
1 sibling, 1 reply; 97+ messages in thread
From: Zwane Mwaikambo @ 2002-06-18 19:19 UTC (permalink / raw)
To: Benjamin LaHaise; +Cc: Rusty Russell, Robert Love, Linux Kernel Mailing List
On Tue, 18 Jun 2002, Benjamin LaHaise wrote:
> > /* -1 = remove affinity */
> > sys_sched_setaffinity(pid_t pid, int cpu);
> >
> > This will work everywhere, and doesn't require userspace to know the
> > size of the cpu bitmask etc.
>
> That doesn't work. Think of SMT CPU pairs (aka HyperThreading) or
> quads that share caches.
Hmm i don't understand, mind explaining why it wouldn't work on HT?
Cheers,
Zwane Mwaikambo
--
http://function.linuxpower.ca
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 19:49 ` Benjamin LaHaise
@ 2002-06-18 19:27 ` Zwane Mwaikambo
0 siblings, 0 replies; 97+ messages in thread
From: Zwane Mwaikambo @ 2002-06-18 19:27 UTC (permalink / raw)
To: Benjamin LaHaise; +Cc: Rusty Russell, Robert Love, Linux Kernel Mailing List
On Tue, 18 Jun 2002, Benjamin LaHaise wrote:
> On HyperThreading, you want to specify that either cpu in a pair is
> okay. In larger SMP machines that share a cache between 4 CPUs, the
> mask is likely to contain all 4 CPUs in each quad.
Hmm so you want to apply the same 'node' principal to HT? The way HT works
i can see why that would be a good idea. Node affinity on the quads makes
sense and distinguishing which cpus belong to which quads would also help
for irq affinity.
Thanks,
Zwane Mwaikambo
--
http://function.linuxpower.ca
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 18:51 ` Rusty Russell
2002-06-18 18:43 ` Zwane Mwaikambo
2002-06-18 18:56 ` Linus Torvalds
@ 2002-06-18 19:29 ` Benjamin LaHaise
2002-06-18 19:19 ` Zwane Mwaikambo
2002-06-18 20:13 ` Rusty Russell
2 siblings, 2 replies; 97+ messages in thread
From: Benjamin LaHaise @ 2002-06-18 19:29 UTC (permalink / raw)
To: Rusty Russell; +Cc: Robert Love, Linux Kernel Mailing List
On Wed, Jun 19, 2002 at 04:51:31AM +1000, Rusty Russell wrote:
> You could do a loop here, but the real problem is the broken userspace
> interface. Can you fix this so it takes a single CPU number please?
>
> ie.
> /* -1 = remove affinity */
> sys_sched_setaffinity(pid_t pid, int cpu);
>
> This will work everywhere, and doesn't require userspace to know the
> size of the cpu bitmask etc.
That doesn't work. Think of SMT CPU pairs (aka HyperThreading) or
quads that share caches.
-ben
--
"You will be reincarnated as a toad; and you will be much happier."
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 19:19 ` Zwane Mwaikambo
@ 2002-06-18 19:49 ` Benjamin LaHaise
2002-06-18 19:27 ` Zwane Mwaikambo
0 siblings, 1 reply; 97+ messages in thread
From: Benjamin LaHaise @ 2002-06-18 19:49 UTC (permalink / raw)
To: Zwane Mwaikambo; +Cc: Rusty Russell, Robert Love, Linux Kernel Mailing List
On Tue, Jun 18, 2002 at 09:19:40PM +0200, Zwane Mwaikambo wrote:
> Hmm i don't understand, mind explaining why it wouldn't work on HT?
On HyperThreading, you want to specify that either cpu in a pair is
okay. In larger SMP machines that share a cache between 4 CPUs, the
mask is likely to contain all 4 CPUs in each quad.
-ben
--
"You will be reincarnated as a toad; and you will be much happier."
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 18:56 ` Linus Torvalds
2002-06-18 18:59 ` Robert Love
@ 2002-06-18 20:05 ` Rusty Russell
2002-06-18 20:05 ` Linus Torvalds
1 sibling, 1 reply; 97+ messages in thread
From: Rusty Russell @ 2002-06-18 20:05 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Robert Love, Linux Kernel Mailing List
In message <Pine.LNX.4.44.0206181155280.4552-100000@home.transmeta.com> you wri
te:
>
>
> On Wed, 19 Jun 2002, Rusty Russell wrote:
> >
> > You could do a loop here, but the real problem is the broken userspace
> > interface. Can you fix this so it takes a single CPU number please?
>
> NO.
>
> Rusty, people want to do "node-affine" stuff, which absolutely requires
> you to be able to give CPU "collections". Single CPU's need not apply.
NO. They want to be node-affine. They don't want to specify what
CPUs they attach to.
Understand?
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 20:05 ` Rusty Russell
@ 2002-06-18 20:05 ` Linus Torvalds
2002-06-18 20:31 ` Rusty Russell
0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2002-06-18 20:05 UTC (permalink / raw)
To: Rusty Russell; +Cc: Robert Love, Linux Kernel Mailing List
On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> NO. They want to be node-affine. They don't want to specify what
> CPUs they attach to.
So you're going to have separate interfaces for that? Gag me with a volvo,
but that's idiotic.
Besides, even that would be broken. You want bitmaps, because bitmaps is
really what it is all about. It's NOT about "I must run on this CPU", it
can equally well be "I mustn't run on those two CPU's that are hosting the
RT part of this thing" or something like that.
Linus
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 19:29 ` Benjamin LaHaise
2002-06-18 19:19 ` Zwane Mwaikambo
@ 2002-06-18 20:13 ` Rusty Russell
2002-06-18 20:21 ` Linus Torvalds
2002-06-18 22:03 ` Ingo Molnar
1 sibling, 2 replies; 97+ messages in thread
From: Rusty Russell @ 2002-06-18 20:13 UTC (permalink / raw)
To: Benjamin LaHaise; +Cc: Robert Love, torvalds, Linux Kernel Mailing List
In message <20020618152949.B16091@redhat.com> you write:
> On Wed, Jun 19, 2002 at 04:51:31AM +1000, Rusty Russell wrote:
> > You could do a loop here, but the real problem is the broken userspace
> > interface. Can you fix this so it takes a single CPU number please?
> >
> > ie.
> > /* -1 = remove affinity */
> > sys_sched_setaffinity(pid_t pid, int cpu);
> >
> > This will work everywhere, and doesn't require userspace to know the
> > size of the cpu bitmask etc.
>
> That doesn't work. Think of SMT CPU pairs (aka HyperThreading) or
> quads that share caches.
This is the NUMA "I want to be in this group" problem. If you're
serious about this, you'll go for a sys_sched_groupaffinity call, or
add an extra arg to sys_sched_setaffinity, or simply use the top 16
bits of the cpu arg.
You will also add /proc/cpugroups or something to export this
information to users so there's a point.
Sorry, the current interface is insufficient for NUMA *and* is
impossible[1] for the user to use correctly.
Rusty.
[1] Defined as "too hard for them to ever do it properly"
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 20:13 ` Rusty Russell
@ 2002-06-18 20:21 ` Linus Torvalds
2002-06-18 22:03 ` Ingo Molnar
1 sibling, 0 replies; 97+ messages in thread
From: Linus Torvalds @ 2002-06-18 20:21 UTC (permalink / raw)
To: Rusty Russell; +Cc: Benjamin LaHaise, Robert Love, Linux Kernel Mailing List
On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> > That doesn't work. Think of SMT CPU pairs (aka HyperThreading) or
> > quads that share caches.
>
> This is the NUMA "I want to be in this group" problem. If you're
> serious about this, you'll go for a sys_sched_groupaffinity call, or
> add an extra arg to sys_sched_setaffinity, or simply use the top 16
> bits of the cpu arg.
Oh, yes. That makes sense. NOT.
> Sorry, the current interface is insufficient for NUMA *and* is
> impossible[1] for the user to use correctly.
Don't be silly.
Give _one_ good reason why the affinity system call cannot take a simple
bitmask? It's trivial to use, your arguments do not make any sense.
Linus
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 20:05 ` Linus Torvalds
@ 2002-06-18 20:31 ` Rusty Russell
2002-06-18 20:41 ` Linus Torvalds
2002-06-18 20:55 ` Robert Love
0 siblings, 2 replies; 97+ messages in thread
From: Rusty Russell @ 2002-06-18 20:31 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Robert Love, Linux Kernel Mailing List
In message <Pine.LNX.4.44.0206181302300.872-100000@home.transmeta.com> you writ
e:
> On Wed, 19 Jun 2002, Rusty Russell wrote:
> >
> > NO. They want to be node-affine. They don't want to specify what
> > CPUs they attach to.
>
> So you're going to have separate interfaces for that? Gag me with a volvo,
> but that's idiotic.
No, you have accepted a non-portable userspace interface and put it in
generic code. THAT is idiotic.
So any program that doesn't use the following is broken:
#include <limits.h>
#define BITS_PER_LONG (sizeof(long)*CHAR_BIT)
int set_cpu(int cpu)
{
size_t size = sizeof(unsigned long);
unsigned long *bitmask = NULL;
int ret;
do {
size *= 2;
bitmask = realloc(bitmask, size);
memset(bitmask, 0, size);
bitmask[cpu / BITS_PER_LONG] = (1 << (cpu % BITS_PER_LONG);
ret = sched_setaffinity(getpid(), size, bitmask);
} while (ret < 0 && errno = -EINVAL);
return ret;
}
> Besides, even that would be broken. You want bitmaps, because bitmaps is
> really what it is all about. It's NOT about "I must run on this CPU", it
> can equally well be "I mustn't run on those two CPU's that are hosting the
> RT part of this thing" or something like that.
Just bind to a cpu != those two CPUs. I could come up with contrived
examples too, but I'm trying to save userspace programmers and those
who have to port to new architectures.
If you don't know how to do it well, do it simply.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 20:31 ` Rusty Russell
@ 2002-06-18 20:41 ` Linus Torvalds
2002-06-18 21:12 ` Benjamin LaHaise
2002-06-18 20:55 ` Robert Love
1 sibling, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2002-06-18 20:41 UTC (permalink / raw)
To: Rusty Russell; +Cc: Robert Love, Linux Kernel Mailing List
On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> So any program that doesn't use the following is broken:
That wasn't so hard, was it?
Besides, we've had this interface for about 15 years, and it's called
"select()". It scales fine to thousands of descriptors, and we're talking
about something that is a hell of a lot less timing-critical than select
ever was.
"Earth to Rusty, come in Rusty.."
How do we handle the bitmaps in select()? Right. We assume some size that
is plenty good enough. Come back to me when something simple like
#define MAX_CPUNR 1024
unsigned long cpumask[MAX_CPUNR / BITS_PER_LONG];
doesn't work.
The existing interface is _fine_, and when somebody actually has a machine
with more than 1024 CPU's (yeah, right, I'm really worried), the existing
interface will cause graceful errors instead of doing something
unexpected.
And if you're telling me that people who care about CPU affinity cannot
fathom a simple bitmap of longs, you're just out to lunch.
Linus
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 20:31 ` Rusty Russell
2002-06-18 20:41 ` Linus Torvalds
@ 2002-06-18 20:55 ` Robert Love
2002-06-19 13:31 ` Rusty Russell
1 sibling, 1 reply; 97+ messages in thread
From: Robert Love @ 2002-06-18 20:55 UTC (permalink / raw)
To: Rusty Russell; +Cc: Linus Torvalds, Linux Kernel Mailing List
On Tue, 2002-06-18 at 13:31, Rusty Russell wrote:
> No, you have accepted a non-portable userspace interface and put it in
> generic code. THAT is idiotic.
>
> So any program that doesn't use the following is broken:
On top of what Linus replied, there is the issue that if your task does
not know how many CPUs can be in the system then setting its affinity is
worthless in 90% of the cases.
I.e., everyone today can write code like
sched_setaffinity(0, sizeof(unsigned long), &mask)
but let's say this code is executed on a system with a different number
of bits in the CPU mask. What do you do with the new/old bits? Ignore
them? Set new ones to zero? To 1?
Summarily, setting CPU affinity is something that is naturally low-level
enough it only makes sense when you know what you are setting and not
setting. While a mask of -1 may always make sense, random bitmaps
(think RT stuff here) are explicit for the number of CPUs given.
The interface is designed to make this easy clean as possible - i.e.,
the size check, etc.
Robert Love
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 21:12 ` Benjamin LaHaise
@ 2002-06-18 21:08 ` Cort Dougan
2002-06-18 21:47 ` Linus Torvalds
2002-06-19 10:21 ` Padraig Brady
2002-06-18 21:45 ` Bill Huey
1 sibling, 2 replies; 97+ messages in thread
From: Cort Dougan @ 2002-06-18 21:08 UTC (permalink / raw)
To: Benjamin LaHaise
Cc: Linus Torvalds, Rusty Russell, Robert Love,
Linux Kernel Mailing List
I agree with you there. It's not easy, and I'd claim it's not possible
given that no-one has done it yet, to have a select() call that is speedy
for both 0-10 and 1k file descriptors.
} I take issue with the statement that select scales fine to thousands of
} file descriptors. It doesn't. For fairly trivial workloads it degrades
} to 0 operations per second with more than a few dozen filedescriptors in
} the array, but only one descriptor being active. To sustain decent
} throughput, select needs something like 50% of the filedescriptors in an
} array to be active at every select() call, which makes in unsuitable for
} things like LDAP servers, or HTTP/FTP where the clients are behind slow
} connections or interactive (like in the real world). I've benchmarked
} it -- we should really include something like /dev/epoll in the kernel
} to improve this case.
}
} Still, I think the bitmap approach in this case is useful, as having
} affinity to multiple CPUs can be needed, and it is not a frequently
} occuring operation (unlike select()).
}
} -ben
} --
} "You will be reincarnated as a toad; and you will be much happier."
} -
} To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at http://vger.kernel.org/majordomo-info.html
} Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 20:41 ` Linus Torvalds
@ 2002-06-18 21:12 ` Benjamin LaHaise
2002-06-18 21:08 ` Cort Dougan
2002-06-18 21:45 ` Bill Huey
0 siblings, 2 replies; 97+ messages in thread
From: Benjamin LaHaise @ 2002-06-18 21:12 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Rusty Russell, Robert Love, Linux Kernel Mailing List
On Tue, Jun 18, 2002 at 01:41:12PM -0700, Linus Torvalds wrote:
> That wasn't so hard, was it?
>
> Besides, we've had this interface for about 15 years, and it's called
> "select()". It scales fine to thousands of descriptors, and we're talking
> about something that is a hell of a lot less timing-critical than select
> ever was.
I take issue with the statement that select scales fine to thousands of
file descriptors. It doesn't. For fairly trivial workloads it degrades
to 0 operations per second with more than a few dozen filedescriptors in
the array, but only one descriptor being active. To sustain decent
throughput, select needs something like 50% of the filedescriptors in an
array to be active at every select() call, which makes in unsuitable for
things like LDAP servers, or HTTP/FTP where the clients are behind slow
connections or interactive (like in the real world). I've benchmarked
it -- we should really include something like /dev/epoll in the kernel
to improve this case.
Still, I think the bitmap approach in this case is useful, as having
affinity to multiple CPUs can be needed, and it is not a frequently
occuring operation (unlike select()).
-ben
--
"You will be reincarnated as a toad; and you will be much happier."
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 21:12 ` Benjamin LaHaise
2002-06-18 21:08 ` Cort Dougan
@ 2002-06-18 21:45 ` Bill Huey
1 sibling, 0 replies; 97+ messages in thread
From: Bill Huey @ 2002-06-18 21:45 UTC (permalink / raw)
To: Benjamin LaHaise
Cc: Linus Torvalds, Rusty Russell, Robert Love,
Linux Kernel Mailing List, Bill Huey
On Tue, Jun 18, 2002 at 05:12:00PM -0400, Benjamin LaHaise wrote:
> connections or interactive (like in the real world). I've benchmarked
> it -- we should really include something like /dev/epoll in the kernel
> to improve this case.
Heh, try kqueue(). ;)
It's a pretty workable API and there seems to be a lot of momentum in
the BSDs (Darwin, FreeBSD) for it.
bill
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 21:08 ` Cort Dougan
@ 2002-06-18 21:47 ` Linus Torvalds
2002-06-19 12:29 ` Eric W. Biederman
2002-06-19 10:21 ` Padraig Brady
1 sibling, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2002-06-18 21:47 UTC (permalink / raw)
To: Cort Dougan
Cc: Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
On Tue, 18 Jun 2002, Cort Dougan wrote:
>
> I agree with you there. It's not easy, and I'd claim it's not possible
> given that no-one has done it yet, to have a select() call that is speedy
> for both 0-10 and 1k file descriptors.
Actually, select() scales a lot better than poll() for _dense_ bitmaps.
The problem with non-scalability ends up being either sparse bitmaps
(minor problem, poll() can help) or just the work involved in watching a
large number of fd's (major problem, but totally unrelated to the bitmap
itself, and poll() usually makes it worse thanks to more data to be
moved).
Anyway, I was talking about the scalability of the _data_structure_, not
the scalability performance-wise. Performance scalability is a non-issue
for something like setaffinity(), since it's just not called at any rate
approaching poll.
>From a data structure standpoint, bitmaps are clearly the simplest dense
representation, and scale perfectly well to any reasonable number of
CPU's.
If we end up using a default of 1024, maybe you'll have to recompile that
part of the system that has anything to do with CPU affinity in about
10-20 years by just upping the number a bit. Quite frankly, that's going
to be the _least_ of the issues.
Linus
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 20:13 ` Rusty Russell
2002-06-18 20:21 ` Linus Torvalds
@ 2002-06-18 22:03 ` Ingo Molnar
1 sibling, 0 replies; 97+ messages in thread
From: Ingo Molnar @ 2002-06-18 22:03 UTC (permalink / raw)
To: Rusty Russell
Cc: Benjamin LaHaise, Robert Love, torvalds,
Linux Kernel Mailing List
On Wed, 19 Jun 2002, Rusty Russell wrote:
> This is the NUMA "I want to be in this group" problem. If you're
> serious about this, you'll go for a sys_sched_groupaffinity call, or add
> an extra arg to sys_sched_setaffinity, or simply use the top 16 bits of
> the cpu arg.
the reason why i picked a linear cpu bitmask for the first patches to do
affinity syscalls (which ultimately found their way into 2.5) was very
simple: we do *NOT* want to deal with cache hierarchies in the kernel, at
this point.
enumerating CPUs and giving processes the ability to bind themselves to an
arbitrary set of CPUs is enough. *IF* user-space wants to do more then
they can get and use whatever NUMA information they want. There could even
be separate sets of syscalls perhaps to get the exact CPU cache hierarchy
of the system, although that would have to be done really well to be truly
generic and long-living.
so in this case the simplest approach that scales well to a reasonable
number of CPUs (thousands, at least) won.
> You will also add /proc/cpugroups or something to export this
> information to users so there's a point.
and this might not even be enough. Cache hierarchies can be pretty
non-trivial, and it's not necesserily a distinct group of CPUs, it could
be a hierarchy of multiple levels, or it could even be an assymetric
distribution of caches. In fact it might not be even expressable in
'group' categories - caches could be interconnected in a 2D or even 3D
topology. Or multiprocessing CPUs could have dynamic caches in the future
- 'cache on demand' allocated to a cache-happy CPU, while another CPU with
a smaller working set will use less cache space. [obviously the technology
is not available today.]
one thing i was *very* sure about, we frankly dont have the slightest clue
about how the really big systems will look like in 10 or 20 years. So
hardcoding anything like 'group affinity' or some of today's NUMA
hierarchies would be pretty shortsighted. I'm convinced that the 'opaque'
solution, the simple but generic setaffinity system call is the right
choice.
Ingo
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
@ 2002-06-18 23:38 Michael Hohnbaum
2002-06-18 23:57 ` Ingo Molnar
0 siblings, 1 reply; 97+ messages in thread
From: Michael Hohnbaum @ 2002-06-18 23:38 UTC (permalink / raw)
To: torvalds; +Cc: rusty, rml, linux-kernel, colpatch
On Tuesday, June 18 2002, Linus Torvalds wrote:
> On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> NO. They want to be node-affine. They don't want to specify what
> CPUs they attach to.
>
> So you're going to have separate interfaces for that? Gag me with a
> volvo, but that's idiotic.
>
> Besides, even that would be broken. You want bitmaps, because bitmaps
> is really what it is all about. It's NOT about "I must run on this
> CPU", it can equally well be "I mustn't run on those two CPU's that
> are hosting the RT part of this thing" or something like that.
>
> Linus
A bit mask is a very good choice for the sched_setaffinity()
interface. I would suggest an additional argument be added
which would indicate the resource that the process is to be
affined to. That way this interface could be used for binding
processes to cpus, memory nodes, perhaps NUMA nodes, and,
as discussed recently in another thread, other processes.
Personally, I see NUMA nodes as an overkill, if a process
can be bound to cpus and memory nodes.
There has been an effort made to address the needs for binding
processes to processors, memory nodes, etc. for NUMA machines.
A proposed API has been developed and implemented. See
http://lse.sourceforge.net/numa/numa_api.html for a spec on
the API. Matt Dobson has posted the implementation to lkml
as a patch against 2.5 several times, but has not seen much
discussion. I could see much of the capabilities provided
in the NUMA API being provided through the sched_setaffinity()
as described above.
Michael Hohnbaum
hohnbaum@us.ibm.com
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 23:38 Michael Hohnbaum
@ 2002-06-18 23:57 ` Ingo Molnar
2002-06-19 0:08 ` Ingo Molnar
` (2 more replies)
0 siblings, 3 replies; 97+ messages in thread
From: Ingo Molnar @ 2002-06-18 23:57 UTC (permalink / raw)
To: Michael Hohnbaum
Cc: Linus Torvalds, Rusty Russell, Robert Love, linux-kernel,
colpatch
On 18 Jun 2002, Michael Hohnbaum wrote:
> A bit mask is a very good choice for the sched_setaffinity()
> interface. [...]
thanks :)
> [...] I would suggest an additional argument be added
> which would indicate the resource that the process is to be
> affined to. That way this interface could be used for binding
> processes to cpus, memory nodes, perhaps NUMA nodes, and,
> as discussed recently in another thread, other processes.
> Personally, I see NUMA nodes as an overkill, if a process
> can be bound to cpus and memory nodes.
are you sure we want one generic, process-based affinity interface?
i think the affinity to certain memory regions might need to be more
finegrained than this. Eg. it could be useful to define a per-file
(per-inode) 'backing store memory node' that the file is affine to. This
will eg. cause the pagecache to be allocated in the memory node.
Process-based affinity does not describe this in a natural way. Another
example, memory maps: we might want to have a certain memory map (vma)
allocated in a given memory node, independently of where the process that
is faulting a given pages resides.
and it might certainly make sense to have some sort of 'default memory
affinity' for a process as well, but this should be a different syscall -
it really does a much different thing than CPU affinity. The CPU resource
is 'used' only temporarily with little footprint, while memory usage is
often for a very long timespan, and the affinity strategies differ
greatly. Also, memory as a resource is much more complex than CPU, eg. it
must handle things like over-allocation, fallback to 'nearby' nodes if a
node is full, etc.
so i'd suggest to actually create a good memory-affinity syscall interface
instead of trying to generalize it into the simple, robust, finite
CPU-affinity syscalls.
Ingo
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 23:57 ` Ingo Molnar
@ 2002-06-19 0:08 ` Ingo Molnar
2002-06-19 1:00 ` Matthew Dobson
2002-06-19 23:48 ` Michael Hohnbaum
2 siblings, 0 replies; 97+ messages in thread
From: Ingo Molnar @ 2002-06-19 0:08 UTC (permalink / raw)
To: Michael Hohnbaum
Cc: Linus Torvalds, Rusty Russell, Robert Love, linux-kernel,
colpatch
another thought would be that the 'default' memory affinity can be derived
from the CPU affinity. A default process, one which is affine to all CPUs,
can have memory allocated from all memory nodes. A process which is bound
to a given set of CPUs, should get its memory allocated from the nodes
that 'belong' to those CPUs.
the topology might not be as simple as this, but generally it's the CPU
that drives the topology, so a given CPU affinity mask leads to a specific
'preferred memory nodes' bitmask - there isnt much choice needed on the
user's part, in fact it might be contraproductive to bind a process to
some CPU and bind its memory allocations to a very distant memory node.
While mathematically there is not necesserily any 1:1 relationship between
CPU affinity and 'best memory affinity', technologically there is.
per-object affinity might still be possible under these scheme, it would
override whatever 'default' memory affinity is derived from the CPU
affinity mask. [that would enable for example for an important database
file to be locked to a given memory node, and helper processes executing
on distant CPUs will not cause a distant pagecache page to be allocated.]
another advantage is that this removes the burden from the application
writer, of having to figure out the actual memory topology and fitting the
CPU affinity to the memory affinity (and vice versa). The kernel can
figure out a good default memory affinity based on the CPU affinity mask.
(so everything so far points in the direction of having a simple CPU
affinity syscall, which we have now.)
Ingo
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
[not found] <E17KSLb-0007Dj-00@wagner.rustcorp.com.au>
@ 2002-06-19 0:12 ` Linus Torvalds
2002-06-19 15:23 ` Rusty Russell
0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2002-06-19 0:12 UTC (permalink / raw)
To: Rusty Russell; +Cc: Kernel Mailing List
On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> - new_mask &= cpu_online_map;
> + /* Eliminate offline cpus from the mask */
> + for (i = 0; i < NR_CPUS; i++)
> + if (!cpu_online(i))
> + new_mask &= ~(1<<i);
> +
And why can't cpu_online_map be a bitmap?
What's your beef against sane and efficient data structures? The above is
just crazy.
Just add a
#define NRCPUWORDS ROUND_UP(NR_CPU, BITS_PER_LONG)
struct cpu_mask {
unsigned long mask[NRCPUWORDS];
} cpu_mask_t;
and then add a few simple operations like
cpumask_and(cpu_mask_t * res, cpu_mask_t *a, cpu_mask_t *b);
and friends.. See how we handle this issue in <linux/signal.h>, which has
perfectly efficient things to do all the same issues (ie see how
"sigemptyset()" and friends compile to efficient code for the "normal"
cases.
This is not rocket science, and I find it ridiculous that you claim to
worry about scaling up to thousands of CPU's, and then you try to send me
absolute crap like the above which clearly is unacceptable for lots of
CPU's.
No, C doesn't have built-in support for bitmap operations except on a
small scale level (ie single words), and yes, clearly that's why Linux
tends to prefer only small bitmaps, but NO, that does not make bitmaps
evil.
Linus
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 23:57 ` Ingo Molnar
2002-06-19 0:08 ` Ingo Molnar
@ 2002-06-19 1:00 ` Matthew Dobson
2002-06-19 23:48 ` Michael Hohnbaum
2 siblings, 0 replies; 97+ messages in thread
From: Matthew Dobson @ 2002-06-19 1:00 UTC (permalink / raw)
To: Ingo Molnar
Cc: Michael Hohnbaum, Linus Torvalds, Rusty Russell, Robert Love,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 3108 bytes --]
Ingo Molnar wrote:
> On 18 Jun 2002, Michael Hohnbaum wrote:
>>[...] I would suggest an additional argument be added
>>which would indicate the resource that the process is to be
>>affined to. That way this interface could be used for binding
>>processes to cpus, memory nodes, perhaps NUMA nodes, and,
>>as discussed recently in another thread, other processes.
>>Personally, I see NUMA nodes as an overkill, if a process
>>can be bound to cpus and memory nodes.
>
>
> are you sure we want one generic, process-based affinity interface?
>
> i think the affinity to certain memory regions might need to be more
> finegrained than this. Eg. it could be useful to define a per-file
> (per-inode) 'backing store memory node' that the file is affine to. This
> will eg. cause the pagecache to be allocated in the memory node.
> Process-based affinity does not describe this in a natural way. Another
> example, memory maps: we might want to have a certain memory map (vma)
> allocated in a given memory node, independently of where the process that
> is faulting a given pages resides.
>
> and it might certainly make sense to have some sort of 'default memory
> affinity' for a process as well, but this should be a different syscall -
> it really does a much different thing than CPU affinity. The CPU resource
> is 'used' only temporarily with little footprint, while memory usage is
> often for a very long timespan, and the affinity strategies differ
> greatly. Also, memory as a resource is much more complex than CPU, eg. it
> must handle things like over-allocation, fallback to 'nearby' nodes if a
> node is full, etc.
I've attatched copies of the patch that Michael referred to in his email so you
can see where we're going with this. I think that we have (at least the
beginnings) of what you've described. The patch allows processes to bind to
specific CPU's (via bitmask) and/or specific memory blocks. You can set these
up to complement each other, or to something completely arbitrary (for
debugging purposes, etc). It also includes the beginnings of very simple
topology info with some simple arch-independent calls (cpu_to_node,
node_to_cpu, node_to_memblk, etc.) Of course these do require some lower level
hooks for each architecture that wants to use them, but they should be simple
calls. I've been sidetracked on other things for about a month, but I plan on
getting back to this patch ASAP (this week), and porting it forward to the
latest version. It is currently only up to 2.5.14.
If anyone has any suggestions for other features, changes, comments, flames,
ANYTHING, please let me know.
> so i'd suggest to actually create a good memory-affinity syscall interface
> instead of trying to generalize it into the simple, robust, finite
> CPU-affinity syscalls.
See above ;)
-Matt
>
> Ingo
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
[-- Attachment #2: numa_api-arch_dep-2.5.14.patch --]
[-- Type: text/plain, Size: 4065 bytes --]
diff -Nur linux-2.5.12-vanilla/include/asm-i386/core_ibmnumaq.h linux-2.5.12-api/include/asm-i386/core_ibmnumaq.h
--- linux-2.5.12-vanilla/include/asm-i386/core_ibmnumaq.h Wed Dec 31 16:00:00 1969
+++ linux-2.5.12-api/include/asm-i386/core_ibmnumaq.h Wed May 1 17:24:25 2002
@@ -0,0 +1,61 @@
+/*
+ * linux/include/asm-i386/mmzone.h
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT. See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpatch@us.ibm.com>
+ */
+#ifndef _ASM_CORE_IBMNUMAQ_H_
+#define _ASM_CORE_IBMNUMAQ_H_
+
+/*
+ * These functions need to be defined for every architecture.
+ * The first five are necessary for the NUMA API to function.
+ * The last is needed by several pieces of NUMA code.
+ */
+
+/* Returns the number of the node containing CPU 'cpu' */
+#define _cpu_to_node(cpu) (cpu_to_logical_apicid(cpu) >> 4)
+
+/* Returns the number of the node containing MemBlk 'memblk' */
+#define _memblk_to_node(memblk) (memblk)
+
+/* Returns the number of the node containing Node 'nid'. This architecture is flat,
+ so it is a pretty simple function. */
+#define _node_to_node(nid) (nid)
+
+/* Returns the number of the first CPU on Node 'node' */
+static inline int _node_to_cpu(int node)
+{
+ int i, cpu, logical_apicid = node << 4;
+
+ for(i = 1; i < 16; i <<= 1)
+ if ((cpu = logical_apicid_to_cpu(logical_apicid | i)) >= 0)
+ return cpu;
+
+ return 0;
+}
+
+/* Returns the number of the first MemBlk on Node 'node' */
+#define _node_to_memblk(node) (node)
+
+#endif /* _ASM_CORE_IBMNUMAQ_H_ */
diff -Nur linux-2.5.12-vanilla/include/asm-i386/mmzone.h linux-2.5.12-api/include/asm-i386/mmzone.h
--- linux-2.5.12-vanilla/include/asm-i386/mmzone.h Wed Dec 31 16:00:00 1969
+++ linux-2.5.12-api/include/asm-i386/mmzone.h Wed May 1 17:24:25 2002
@@ -0,0 +1,45 @@
+/*
+ * linux/include/asm-i386/mmzone.h
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT. See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpatch@us.ibm.com>
+ */
+#ifndef _ASM_MMZONE_H_
+#define _ASM_MMZONE_H_
+
+#include <asm/smpboot.h>
+
+#ifdef CONFIG_IBMNUMAQ
+#include <asm/core_ibmnumaq.h>
+#else /* !CONFIG_IBMNUMAQ */
+#define _cpu_to_node(cpu) (0)
+#define _memblk_to_node(memblk) (0)
+#define _node_to_node(nid) (0)
+#define _node_to_cpu(node) (0)
+#define _node_to_memblk(node) (0)
+#endif /* CONFIG_IBMNUMAQ */
+
+/* Returns the number of the current Node. */
+#define numa_node_id() (_cpu_to_node(smp_processor_id()))
+
+#endif /* _ASM_MMZONE_H_ */
[-- Attachment #3: numa_api-arch_indep-impl-2.5.14.patch --]
[-- Type: text/plain, Size: 17845 bytes --]
diff -Nur linux-2.5.8-vanilla/kernel/Makefile linux-2.5.8-api/kernel/Makefile
--- linux-2.5.8-vanilla/kernel/Makefile Sun Apr 14 12:18:47 2002
+++ linux-2.5.8-api/kernel/Makefile Mon Apr 22 15:35:16 2002
@@ -15,7 +15,7 @@
obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o info.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o futex.o platform.o
+ signal.o sys.o kmod.o context.o futex.o platform.o numa.o
obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += ksyms.o
diff -Nur linux-2.5.8-vanilla/kernel/fork.c linux-2.5.8-api/kernel/fork.c
--- linux-2.5.8-vanilla/kernel/fork.c Sun Apr 14 12:18:45 2002
+++ linux-2.5.8-api/kernel/fork.c Tue Apr 23 14:49:29 2002
@@ -707,6 +707,20 @@
spin_lock_init(&p->sigmask_lock);
}
#endif
+ if (!null_restrict(&p->numa_launch_policy)){
+ p->numa_binding = p->numa_launch_policy;
+ p->cpus_allowed = p->numa_binding.cpus.list & p->numa_restrict.cpus.list;
+ if (!(p->cpus_allowed & cpu_online_map))
+ BUG();
+ if (p->cpus_allowed & (1UL << smp_processor_id()))
+ p->thread_info->cpu = smp_processor_id();
+ else
+ p->thread_info->cpu = __ffs(p->cpus_allowed & cpu_online_map);
+ } else
+ p->thread_info->cpu = smp_processor_id();
+ numa_set_init(&p->numa_launch_policy);
+ rwlock_init(&p->numa_api_lock);
+
p->array = NULL;
p->lock_depth = -1; /* -1 = no lock */
p->start_time = jiffies;
diff -Nur linux-2.5.8-vanilla/kernel/numa.c linux-2.5.8-api/kernel/numa.c
--- linux-2.5.8-vanilla/kernel/numa.c Wed Dec 31 16:00:00 1969
+++ linux-2.5.8-api/kernel/numa.c Mon Apr 29 11:21:27 2002
@@ -0,0 +1,378 @@
+/*
+ * linux/kernel/numa.c
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT. See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpatch@us.ibm.com>
+ */
+#include <linux/kernel.h>
+#include <linux/unistd.h>
+#include <linux/config.h>
+#include <linux/sched.h>
+#include <linux/numa.h>
+#include <linux/mmzone.h>
+#include <linux/errno.h>
+#include <linux/smp.h>
+
+
+#define is_valid_cpu_behavior(x) (x == CPU_BIND_STRICT)
+#define is_valid_memblk_behavior(x) (((x & 0x7) == MPOL_STRICT) || ((x & 0x7) == MPOL_LOOSE))
+
+#define is_numa_subset(x, y) (!((x) & ~(y))) /* test whether x is a subset of y */
+
+
+extern int nummemblks;
+extern unsigned long memblk_online_map;
+
+/*
+ * set_restricted_cpus(): Sets up a new CPU Restriction Set
+ */
+int set_restricted_cpus(numa_bitmap_t cpus, numa_set_t *numamap)
+{
+ int ret;
+ unsigned long flags;
+ numa_bitmap_t cpu_binding;
+
+ ret = -ENODEV;
+ /* Make sure that at least one of the cpus in the new restriction set is online. */
+ if (!(cpus & cpu_online_map))
+ goto out;
+
+ read_lock_irqsave(¤t->numa_api_lock, flags);
+ cpu_binding = current->numa_binding.cpus.list;
+ /* If there is a binding, at least one of the bound cpus must be valid in the
+ new restriction set. */
+ if ((!null_restrict(¤t->numa_binding)) &&
+ (!(cpu_binding & cpus)))
+ goto out_unlock;
+
+ ret = -EPERM;
+ /* If the new restriction expands upon the old restriction, the caller must
+ have CAP_SYS_NICE. */
+ if ((!is_numa_subset(cpus, current->numa_restrict.cpus.list)) &&
+ (!capable(CAP_SYS_NICE)))
+ goto out_unlock;
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+ write_lock_irqsave(¤t->numa_api_lock, flags);
+ current->numa_restrict.cpus.list = cpus;
+ write_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+ /* Set cpus_allowed to the current binding masked against the new list of allowed cpus. */
+ set_cpus_allowed(current, cpu_binding & cpus);
+ ret = 0;
+ goto out;
+
+ out_unlock:
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+ out:
+ return ret;
+}
+
+/*
+ * set_restricted_memblks(): Sets up a new MemBlk Restriction Set
+ */
+int set_restricted_memblks(numa_bitmap_t memblks, numa_set_t *numamap)
+{
+ int ret;
+ unsigned long flags;
+
+ ret = -ENODEV;
+ /* Make sure that at least one of the memblks in the new restriction set is online. */
+ if (!(memblks & memblk_online_map))
+ goto out;
+
+ read_lock_irqsave(¤t->numa_api_lock, flags);
+ /* If there is a binding, at least one of the bound memblks must be valid in the
+ new restriction set. */
+ if ((!null_restrict(¤t->numa_binding)) &&
+ (!(current->numa_binding.memblks.list & memblks)))
+ goto out_unlock;
+
+ ret = -EPERM;
+ /* If the new restriction expands upon the old restriction, the caller
+ must have CAP_SYS_NICE. */
+ if ((!is_numa_subset(memblks, current->numa_restrict.memblks.list)) &&
+ (!capable(CAP_SYS_NICE)))
+ goto out_unlock;
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+ write_lock_irqsave(¤t->numa_api_lock, flags);
+ current->numa_restrict.memblks.list = memblks;
+ write_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+ ret = 0;
+ goto out;
+
+ out_unlock:
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+ out:
+ return ret;
+}
+
+/*
+ * get_restricted_cpus(): Returns the current CPU Restriction Set
+ */
+inline numa_bitmap_t get_restricted_cpus(void)
+{
+ unsigned long flags;
+ numa_bitmap_t cpu_restriction;
+
+ read_lock_irqsave(¤t->numa_api_lock, flags);
+ cpu_restriction = current->numa_restrict.cpus.list;
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+ return cpu_restriction;
+}
+
+/*
+ * get_restricted_memblks(): Returns the current MemBlk Restriction Set
+ */
+inline numa_bitmap_t get_restricted_memblks(void)
+{
+ unsigned long flags;
+ numa_bitmap_t memblk_restriction;
+
+ read_lock_irqsave(¤t->numa_api_lock, flags);
+ memblk_restriction = current->numa_restrict.memblks.list;
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+ return memblk_restriction;
+}
+
+/*
+ * cpu_to_node(cpu): Returns the number of the most specific Node
+ * containing CPU 'cpu'.
+ */
+inline int cpu_to_node(int cpu)
+{
+ if (cpu == -1) /* return highest numbered node */
+ return (numnodes - 1);
+
+ if ((cpu < 0) || (cpu >= NR_CPUS) ||
+ (!(cpu_online_map & (1 << cpu)))) /* invalid cpu # */
+ return -ENODEV;
+
+ return _cpu_to_node(cpu_logical_map(cpu));
+}
+
+/*
+ * memblk_to_node(memblk): Returns the number of the most specific Node
+ * containing Memory Block 'memblk'.
+ */
+inline int memblk_to_node(int memblk)
+{
+ if (memblk == -1) /* return highest numbered node */
+ return (numnodes - 1);
+
+ if ((memblk < 0) || (memblk >= NR_MEMBLKS) ||
+ (!(memblk_online_map & (1 << memblk)))) /* invalid memblk # */
+ return -ENODEV;
+
+ return _memblk_to_node(memblk);
+}
+
+/*
+ * node_to_node(nid): Returns the number of the of the most specific Node that
+ * encompasses Node 'nid'. Some may call this the parent Node of 'nid'.
+ */
+int node_to_node(int nid)
+{
+ if ((nid < 0) || (nid >= numnodes)) /* invalid node # */
+ return -ENODEV;
+
+ return _node_to_node(nid);
+}
+
+/*
+ * node_to_cpu(nid): Returns the lowest numbered CPU on Node 'nid'
+ */
+inline int node_to_cpu(int nid)
+{
+ if (nid == -1) /* return highest numbered cpu */
+ return (smp_num_cpus - 1);
+
+ if ((nid < 0) || (nid >= numnodes)) /* invalid node # */
+ return -ENODEV;
+
+ return _node_to_cpu(nid);
+}
+
+/*
+ * node_to_memblk(nid): Returns the lowest numbered MemBlk on Node 'nid'
+ */
+inline int node_to_memblk(int nid)
+{
+ if (nid == -1) /* return highest numbered memblk */
+ return (nummemblks - 1);
+
+ if ((nid < 0) || (nid >= numnodes)) /* invalid node # */
+ return -ENODEV;
+
+ return _node_to_memblk(nid);
+}
+
+/*
+ * get_cpu(): Returns the currently executing CPU number.
+ * For now, this has only mild usefulness, as this information could
+ * change on the return from syscall (which automatically calls schedule()).
+ * Due to this, the data could be stale by the time it gets back to the user.
+ * It will have to do, until a better method is found.
+ */
+inline int get_cpu(void)
+{
+ return smp_processor_id();
+}
+
+/*
+ * get_node(): Returns the number of the Node containing
+ * the currently executing CPU. Subject to the same caveat
+ * as the get_cpu() call.
+ */
+inline int get_node(void)
+{
+ return cpu_to_node(get_cpu());
+}
+
+/*
+ * bind_to_cpu(): Sets up a new CPU Binding
+ */
+int bind_to_cpu(numa_bitmap_t cpus, int behavior)
+{
+ int ret;
+ unsigned long flags;
+ numa_bitmap_t cpu_restriction;
+
+ read_lock_irqsave(¤t->numa_api_lock, flags);
+ ret = -ENODEV;
+ /* Make sure that at least one of the cpus in the new binding is online, AND
+ in the current restriction set. */
+ if (!(cpus & cpu_online_map & current->numa_restrict.cpus.list))
+ goto out_unlock;
+ cpu_restriction = current->numa_restrict.cpus.list;
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+ ret = -EINVAL;
+ /* Test to make sure the behavior argument is valid. */
+ if (!is_valid_cpu_behavior(behavior))
+ goto out;
+
+ write_lock_irqsave(¤t->numa_api_lock, flags);
+ current->numa_binding.cpus.list = cpus;
+ current->numa_binding.cpus.behavior = behavior;
+ write_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+ /* Set cpus_allowed to the new binding masked against the current list of allowed cpus. */
+ set_cpus_allowed(current, cpus & cpu_restriction);
+ ret = 0;
+ goto out;
+
+ out_unlock:
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+ out:
+ return ret;
+}
+
+/*
+ * bind_to_memblk(): Sets up a new MemBlk Binding
+ */
+int bind_to_memblk(numa_bitmap_t memblks, int behavior)
+{
+ int ret;
+ unsigned long flags;
+
+ read_lock_irqsave(¤t->numa_api_lock, flags);
+ ret = -ENODEV;
+ /* Make sure that at least one of the memblks in the new binding is online, AND
+ in the current restriction set. */
+ if (!(memblks & memblk_online_map & current->numa_restrict.memblks.list))
+ goto out_unlock;
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+ ret = -EINVAL;
+ /* Test to make sure the behavior argument is valid. */
+ if (!is_valid_memblk_behavior(behavior))
+ goto out;
+
+ write_lock_irqsave(¤t->numa_api_lock, flags);
+ current->numa_binding.memblks.list = memblks;
+ current->numa_binding.memblks.behavior = behavior;
+ write_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+ ret = 0;
+ goto out;
+
+ out_unlock:
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+ out:
+ return ret;
+}
+
+/*
+ * bind_memory(): Will eventually set up a memory binding for a specific chunk of memory.
+ * Specifically, the chunk starting at 'start' through 'len' bytes. As of now, it doesn't
+ * *quite* do that. ;)
+ */
+inline int bind_memory(unsigned long start, size_t len, numa_bitmap_t memblks, int behavior)
+{
+ return -ENOTSUPP;
+}
+
+/*
+ * set_launch_policy(): Sets up a new Launch Policy for current process
+ */
+int set_launch_policy(numa_bitmap_t cpus, int cpu_behavior,
+ numa_bitmap_t memblks, int memblk_behavior)
+{
+ int ret;
+ unsigned long flags;
+
+ read_lock_irqsave(¤t->numa_api_lock, flags);
+ ret = -ENODEV;
+ /* Make sure that at least one of the cpus and one of the memblks in the new
+ binding are online, AND in the current restriction set. */
+ if ((!(cpus & cpu_online_map & current->numa_restrict.cpus.list)) ||
+ (!(memblks & memblk_online_map & current->numa_restrict.memblks.list)))
+ goto out_unlock;
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+ ret = -EINVAL;
+ /* Test to make sure the behavior arguments are valid. */
+ if ((!is_valid_cpu_behavior(cpu_behavior)) ||
+ (!is_valid_memblk_behavior(memblk_behavior)))
+ goto out;
+
+ write_lock_irqsave(¤t->numa_api_lock, flags);
+ current->numa_launch_policy.cpus.list = cpus;
+ current->numa_launch_policy.cpus.behavior = cpu_behavior;
+ current->numa_launch_policy.memblks.list = memblks;
+ current->numa_launch_policy.memblks.behavior = memblk_behavior;
+ write_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+ ret = 0;
+ goto out;
+
+ out_unlock:
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+ out:
+ return ret;
+}
diff -Nur linux-2.5.8-vanilla/mm/numa.c linux-2.5.8-api/mm/numa.c
--- linux-2.5.8-vanilla/mm/numa.c Sun Apr 14 12:18:49 2002
+++ linux-2.5.8-api/mm/numa.c Wed Apr 24 11:26:18 2002
@@ -8,8 +8,11 @@
#include <linux/bootmem.h>
#include <linux/mmzone.h>
#include <linux/spinlock.h>
+#include <linux/numa.h>
int numnodes = 1; /* Initialized for UMA platforms */
+int nummemblks = 0;
+unsigned long memblk_online_map = 0UL; /* Similar to cpu_online_map, but for memory blocks */
static bootmem_data_t contig_bootmem_data;
pg_data_t contig_page_data = { bdata: &contig_bootmem_data };
@@ -27,6 +30,10 @@
{
free_area_init_core(0, &contig_page_data, &mem_map, zones_size,
zone_start_paddr, zholes_size, pmap);
+ contig_page_data.node_id = 0;
+ contig_page_data.memblk_id = 0;
+ nummemblks = 1;
+ memblk_online_map = 1UL;
}
#endif /* !CONFIG_DISCONTIGMEM */
@@ -71,6 +78,11 @@
free_area_init_core(nid, pgdat, &discard, zones_size, zone_start_paddr,
zholes_size, pmap);
pgdat->node_id = nid;
+ pgdat->memblk_id = nummemblks;
+ if (test_and_set_bit(nummemblks++, &memblk_online_map)){
+ printk("memblk alread counted?!?!\n");
+ BUG();
+ }
/*
* Get space for the valid bitmap.
@@ -88,6 +100,8 @@
return __alloc_pages(gfp_mask, order, pgdat->node_zonelists + (gfp_mask & GFP_ZONEMASK));
}
+#ifdef CONFIG_NUMA
+
/*
* This can be refined. Currently, tries to do round robin, instead
* should do concentratic circle search, starting from current node.
@@ -96,35 +110,84 @@
{
struct page *ret = 0;
pg_data_t *start, *temp;
-#ifndef CONFIG_NUMA
+ int search_twice = 0;
+ numa_bitmap_t memblk_bitmask, memblk_bitmask2;
unsigned long flags;
- static pg_data_t *next = 0;
-#endif
if (order >= MAX_ORDER)
return NULL;
-#ifdef CONFIG_NUMA
+
+ read_lock_irqsave(¤t->numa_api_lock, flags);
+ if (null_restrict(¤t->numa_binding))
+ /* if there is no binding, only search the restriction set */
+ memblk_bitmask = current->numa_restrict.memblks.list;
+ else {
+ /* if there is a binding, search it */
+ memblk_bitmask = current->numa_binding.memblks.list;
+ if (current->numa_binding.memblks.behavior == MPOL_LOOSE){
+ /* and if it is a loose binding, remember to search
+ the restriction if we come up empty */
+ search_twice = 1;
+ /* no need to search the memblks in the binding again,
+ so we'll mask them out. */
+ memblk_bitmask2 = current->numa_restrict.memblks.list & ~memblk_bitmask;
+ }
+ }
+ read_unlock_irqrestore(¤t->numa_api_lock, flags);
+
+search_through_memblks:
temp = NODE_DATA(numa_node_id());
-#else
- spin_lock_irqsave(&node_lock, flags);
- if (!next) next = pgdat_list;
- temp = next;
- next = next->node_next;
- spin_unlock_irqrestore(&node_lock, flags);
-#endif
start = temp;
while (temp) {
- if ((ret = alloc_pages_pgdat(temp, gfp_mask, order)))
- return(ret);
+ if (memblk_bitmask & (1 << temp->memblk_id))
+ if ((ret = alloc_pages_pgdat(temp, gfp_mask, order)))
+ return(ret);
temp = temp->node_next;
}
temp = pgdat_list;
while (temp != start) {
+ if (memblk_bitmask & (1 << temp->memblk_id))
+ if ((ret = alloc_pages_pgdat(temp, gfp_mask, order)))
+ return(ret);
+ temp = temp->node_next;
+ }
+
+ if (search_twice) {
+ /*
+ * If we failed to find a "preferred" memblk, try again
+ * looking for anything that's allowed (in restrict), but
+ * skip those memblks we've already looked at
+ */
+ search_twice = 0; /* no infinite loops, please */
+ memblk_bitmask = memblk_bitmask2;
+ goto search_through_memblks;
+ }
+ return(0);
+}
+
+#else /* !CONFIG_NUMA */
+
+struct page * _alloc_pages(unsigned int gfp_mask, unsigned int order)
+{
+ struct page *ret = 0;
+ pg_data_t *temp;
+ unsigned long flags;
+
+ if (order >= MAX_ORDER)
+ return NULL;
+
+ spin_lock_irqsave(&node_lock, flags);
+ temp = pgdat_list;
+ spin_unlock_irqrestore(&node_lock, flags);
+
+ while (temp) {
if ((ret = alloc_pages_pgdat(temp, gfp_mask, order)))
return(ret);
temp = temp->node_next;
}
return(0);
}
+
+#endif /* CONFIG_NUMA */
#endif /* CONFIG_DISCONTIGMEM */
diff -Nur linux-2.5.8-vanilla/mm/page_alloc.c linux-2.5.8-api/mm/page_alloc.c
--- linux-2.5.8-vanilla/mm/page_alloc.c Sun Apr 14 12:18:44 2002
+++ linux-2.5.8-api/mm/page_alloc.c Mon Apr 22 15:35:16 2002
@@ -41,6 +41,9 @@
static int zone_balance_min[MAX_NR_ZONES] __initdata = { 20 , 20, 20, };
static int zone_balance_max[MAX_NR_ZONES] __initdata = { 255 , 255, 255, };
+extern int nummemblks;
+extern unsigned long memblk_online_map;
+
/*
* Free_page() adds the page to the free lists. This is optimized for
* fast normal cases (no error jumps taken normally).
@@ -955,6 +958,10 @@
void __init free_area_init(unsigned long *zones_size)
{
free_area_init_core(0, &contig_page_data, &mem_map, zones_size, 0, 0, 0);
+ contig_page_data.node_id = 0;
+ contig_page_data.memblk_id = 0;
+ nummemblks = 1;
+ memblk_online_map = 1UL;
}
static int __init setup_mem_frac(char *str)
[-- Attachment #4: numa_api-arch_indep-setup-2.5.14.patch --]
[-- Type: text/plain, Size: 6538 bytes --]
diff -Nur linux-2.5.8-vanilla/include/linux/init_task.h linux-2.5.8-api/include/linux/init_task.h
--- linux-2.5.8-vanilla/include/linux/init_task.h Mon Apr 22 17:20:20 2002
+++ linux-2.5.8-api/include/linux/init_task.h Fri Apr 26 15:22:52 2002
@@ -59,6 +59,10 @@
children: LIST_HEAD_INIT(tsk.children), \
sibling: LIST_HEAD_INIT(tsk.sibling), \
thread_group: LIST_HEAD_INIT(tsk.thread_group), \
+ numa_restrict: NEW_NUMA_SET, \
+ numa_binding: NEW_NUMA_SET, \
+ numa_launch_policy: NEW_NUMA_SET, \
+ numa_api_lock: RW_LOCK_UNLOCKED, \
wait_chldexit: __WAIT_QUEUE_HEAD_INITIALIZER(tsk.wait_chldexit),\
real_timer: { \
function: it_real_fn \
diff -Nur linux-2.5.8-vanilla/include/linux/mmzone.h linux-2.5.8-api/include/linux/mmzone.h
--- linux-2.5.8-vanilla/include/linux/mmzone.h Mon Apr 22 17:13:25 2002
+++ linux-2.5.8-api/include/linux/mmzone.h Fri Apr 26 17:15:28 2002
@@ -136,6 +136,7 @@
unsigned long node_start_mapnr;
unsigned long node_size;
int node_id;
+ int memblk_id; /* A unique ID for each memory block (physical contiguous chunk of memory) */
struct pglist_data *node_next;
} pg_data_t;
@@ -163,14 +164,15 @@
#define NODE_MEM_MAP(nid) mem_map
#define MAX_NR_NODES 1
-#else /* !CONFIG_DISCONTIGMEM */
+#endif /* !CONFIG_DISCONTIGMEM */
-#include <asm/mmzone.h>
+#if defined (CONFIG_DISCONTIGMEM) || defined (CONFIG_NUMA)
+#include <asm/mmzone.h>
/* page->zone is currently 8 bits ... */
#define MAX_NR_NODES (255 / MAX_NR_ZONES)
-#endif /* !CONFIG_DISCONTIGMEM */
+#endif /* CONFIG_DISCONTIGMEM || CONFIG_NUMA */
#define MAP_ALIGN(x) ((((x) % sizeof(mem_map_t)) == 0) ? (x) : ((x) + \
sizeof(mem_map_t) - ((x) % sizeof(mem_map_t))))
diff -Nur linux-2.5.8-vanilla/include/linux/numa.h linux-2.5.8-api/include/linux/numa.h
--- linux-2.5.8-vanilla/include/linux/numa.h Wed Dec 31 16:00:00 1969
+++ linux-2.5.8-api/include/linux/numa.h Mon Apr 29 11:03:20 2002
@@ -0,0 +1,76 @@
+/*
+ * linux/include/linux/numa.h
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT. See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpatch@us.ibm.com>
+ */
+#ifndef _LINUX_NUMA_H_
+#define _LINUX_NUMA_H_
+
+#include <linux/types.h>
+
+#ifdef CONFIG_NUMA
+#define NR_MEMBLKS 32 /* Max number of Memory Blocks */
+#else
+#define NR_MEMBLKS 1
+#endif
+
+typedef unsigned long numa_bitmap_t;
+#define NUMA_BITMAP_NONE (~((numa_bitmap_t) 0))
+
+#define CPU_BIND_STRICT 0
+
+#define MPOL_FIRST 1 /* UNUSED FOR NOW */
+#define MPOL_STRIPE 2 /* UNUSED FOR NOW */
+#define MPOL_RR 4 /* UNUSED FOR NOW */
+#define MPOL_STRICT 8 /* Memory MUST be allocated according to binding */
+#define MPOL_LOOSE 16 /* Memory must try to be allocated according to binding first,
+ and can fall back to restriction if necessary */
+
+
+typedef struct numa_list {
+ numa_bitmap_t list;
+ int behavior;
+} numa_list_t;
+
+typedef struct numa_set {
+ numa_list_t cpus;
+ numa_list_t memblks;
+} numa_set_t;
+
+
+/* Initializes a numa_set_t to be an empty set. */
+#define numa_set_init(x) do { (x)->cpus.list = NUMA_BITMAP_NONE;\
+ (x)->memblks.list = NUMA_BITMAP_NONE;\
+ (x)->cpus.behavior = CPU_BIND_STRICT;\
+ (x)->memblks.behavior = MPOL_STRICT; } while(0)
+
+/* Assignment initializer for a numa_set_t to be an empty set */
+#define NEW_NUMA_SET { {NUMA_BITMAP_NONE, CPU_BIND_STRICT}, \
+ {NUMA_BITMAP_NONE, MPOL_STRICT} }
+
+/* Tests whether a numa_set_t represents an empty restriction (ie: all 1's. All cpus/memblks allowed.) */
+#define null_restrict(x) (((x)->cpus.list == NUMA_BITMAP_NONE) && \
+ ((x)->memblks.list == NUMA_BITMAP_NONE))
+
+#endif /* _LINUX_NUMA_H_ */
diff -Nur linux-2.5.8-vanilla/include/linux/sched.h linux-2.5.8-api/include/linux/sched.h
--- linux-2.5.8-vanilla/include/linux/sched.h Mon Apr 22 17:13:27 2002
+++ linux-2.5.8-api/include/linux/sched.h Fri Apr 26 15:14:15 2002
@@ -28,6 +28,7 @@
#include <linux/securebits.h>
#include <linux/fs_struct.h>
#include <linux/compiler.h>
+#include <linux/numa.h>
struct exec_domain;
@@ -286,6 +287,12 @@
struct task_struct *pidhash_next;
struct task_struct **pidhash_pprev;
+ /* additional NUMA stuff */
+ numa_set_t numa_restrict;
+ numa_set_t numa_binding;
+ numa_set_t numa_launch_policy;
+ rwlock_t numa_api_lock; /* protects the preceding 3 structs */
+
wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
diff -Nur linux-2.5.8-vanilla/include/linux/smp.h linux-2.5.8-api/include/linux/smp.h
--- linux-2.5.8-vanilla/include/linux/smp.h Mon Apr 22 17:13:25 2002
+++ linux-2.5.8-api/include/linux/smp.h Fri Apr 26 15:14:15 2002
@@ -90,6 +90,7 @@
#define cpu_number_map(cpu) 0
#define smp_call_function(func,info,retry,wait) ({ 0; })
#define cpu_online_map 1
+#define memblk_online_map 1
static inline void smp_send_reschedule(int cpu) { }
static inline void smp_send_reschedule_all(void) { }
#define __per_cpu_data
diff -Nur linux-2.5.8-vanilla/kernel/sched.c linux-2.5.8-api/kernel/sched.c
--- linux-2.5.8-vanilla/kernel/sched.c Mon Apr 22 13:17:43 2002
+++ linux-2.5.8-api/kernel/sched.c Mon Apr 22 15:35:16 2002
@@ -357,7 +357,7 @@
runqueue_t *rq;
preempt_disable();
- rq = this_rq();
+ rq = task_rq(p);
spin_lock_irq(&rq->lock);
p->state = TASK_RUNNING;
@@ -371,7 +371,6 @@
p->sleep_avg = p->sleep_avg * CHILD_PENALTY / 100;
p->prio = effective_prio(p);
}
- p->thread_info->cpu = smp_processor_id();
activate_task(p, rq);
spin_unlock_irq(&rq->lock);
@@ -1662,8 +1661,7 @@
migration_req_t req;
runqueue_t *rq;
- new_mask &= cpu_online_map;
- if (!new_mask)
+ if (!(new_mask & cpu_online_map))
BUG();
preempt_disable();
[-- Attachment #5: numa_api-prctl-2.5.14.patch --]
[-- Type: text/plain, Size: 3053 bytes --]
diff -Nur linux-2.5.8-vanilla/include/linux/prctl.h linux-2.5.8-api/include/linux/prctl.h
--- linux-2.5.8-vanilla/include/linux/prctl.h Sun Apr 14 12:18:54 2002
+++ linux-2.5.8-api/include/linux/prctl.h Wed Apr 24 17:31:33 2002
@@ -26,4 +26,31 @@
# define PR_FPEMU_NOPRINT 1 /* silently emulate fp operations accesses */
# define PR_FPEMU_SIGFPE 2 /* don't emulate fp operations, send SIGFPE instead */
+/* Get/Set Restricted CPUs and MemBlks */
+#define PR_SET_RESTRICTED_CPUS 11
+#define PR_SET_RESTRICTED_MEMBLKS 12
+#define PR_GET_RESTRICTED_CPUS 13
+#define PR_GET_RESTRICTED_MEMBLKS 14
+
+/* Get CPU/Node */
+#define PR_GET_CPU 15
+#define PR_GET_NODE 16
+
+/* X to Node conversion functions */
+#define PR_CPU_TO_NODE 17
+#define PR_MEMBLK_TO_NODE 18
+#define PR_NODE_TO_NODE 19
+
+/* Node to X conversion functions */
+#define PR_NODE_TO_CPU 20
+#define PR_NODE_TO_MEMBLK 21
+
+/* Set CPU/MemBlk/Memory Bindings */
+#define PR_BIND_TO_CPUS 22
+#define PR_BIND_TO_MEMBLKS 23
+#define PR_BIND_MEMORY 24
+
+/* Set Launch Policy */
+#define PR_SET_LAUNCH_POLICY 25
+
#endif /* _LINUX_PRCTL_H */
diff -Nur linux-2.5.8-vanilla/kernel/sys.c linux-2.5.8-api/kernel/sys.c
--- linux-2.5.8-vanilla/kernel/sys.c Sun Apr 14 12:18:45 2002
+++ linux-2.5.8-api/kernel/sys.c Wed Apr 24 17:32:17 2002
@@ -16,6 +16,7 @@
#include <linux/highuid.h>
#include <linux/fs.h>
#include <linux/device.h>
+#include <linux/numa.h>
#include <asm/uaccess.h>
#include <asm/io.h>
@@ -1277,6 +1278,51 @@
break;
}
current->keep_capabilities = arg2;
+ break;
+ case PR_SET_RESTRICTED_CPUS:
+ error = (long) set_restricted_cpus((numa_bitmap_t)arg2, (numa_set_t *)arg3);
+ break;
+ case PR_SET_RESTRICTED_MEMBLKS:
+ error = (long) set_restricted_memblks((numa_bitmap_t)arg2, (numa_set_t *)arg3);
+ break;
+ case PR_GET_RESTRICTED_CPUS:
+ error = (long) get_restricted_cpus();
+ break;
+ case PR_GET_RESTRICTED_MEMBLKS:
+ error = (long) get_restricted_memblks();
+ break;
+ case PR_GET_CPU:
+ error = (long) get_cpu();
+ break;
+ case PR_GET_NODE:
+ error = (long) get_node();
+ break;
+ case PR_CPU_TO_NODE:
+ error = (long) cpu_to_node((int)arg2);
+ break;
+ case PR_MEMBLK_TO_NODE:
+ error = (long) memblk_to_node((int)arg2);
+ break;
+ case PR_NODE_TO_NODE:
+ error = (long) node_to_node((int)arg2);
+ break;
+ case PR_NODE_TO_CPU:
+ error = (long) node_to_cpu((int)arg2);
+ break;
+ case PR_NODE_TO_MEMBLK:
+ error = (long) node_to_memblk((int)arg2);
+ break;
+ case PR_BIND_TO_CPUS:
+ error = (long) bind_to_cpu((numa_bitmap_t)arg2, (int)arg3);
+ break;
+ case PR_BIND_TO_MEMBLKS:
+ error = (long) bind_to_memblk((numa_bitmap_t)arg2, (int)arg3);
+ break;
+ case PR_BIND_MEMORY:
+ error = (long) bind_memory((unsigned long)arg2, (size_t)arg3, (numa_bitmap_t)arg4, (int)arg5);
+ break;
+ case PR_SET_LAUNCH_POLICY:
+ error = (long) set_launch_policy((numa_bitmap_t)arg2, (int)arg3, (numa_bitmap_t)arg4, (int)arg5);
break;
default:
error = -EINVAL;
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 21:08 ` Cort Dougan
2002-06-18 21:47 ` Linus Torvalds
@ 2002-06-19 10:21 ` Padraig Brady
1 sibling, 0 replies; 97+ messages in thread
From: Padraig Brady @ 2002-06-19 10:21 UTC (permalink / raw)
To: Cort Dougan; +Cc: Benjamin LaHaise, Linux Kernel Mailing List
Cort Dougan wrote:
> I agree with you there. It's not easy, and I'd claim it's not possible
> given that no-one has done it yet, to have a select() call that is speedy
> for both 0-10 and 1k file descriptors.
Have you noticed yesterdays + todays fixup patch from Andi Kleen:
http://marc.theaimsgroup.com/?l=linux-kernel&m=102446644619648&w=2
Padraig.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 21:47 ` Linus Torvalds
@ 2002-06-19 12:29 ` Eric W. Biederman
2002-06-19 17:27 ` Linus Torvalds
0 siblings, 1 reply; 97+ messages in thread
From: Eric W. Biederman @ 2002-06-19 12:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
Linus Torvalds <torvalds@transmeta.com> writes:
> If we end up using a default of 1024, maybe you'll have to recompile that
> part of the system that has anything to do with CPU affinity in about
> 10-20 years by just upping the number a bit. Quite frankly, that's going
> to be the _least_ of the issues.
:)
10-20 years or someone finds a good way to implement a single system
image on linux clusters. They are already into the 1000s of nodes,
and dual processors per node category. And as things continue they
might even grow bigger.
Eric
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 20:55 ` Robert Love
@ 2002-06-19 13:31 ` Rusty Russell
0 siblings, 0 replies; 97+ messages in thread
From: Rusty Russell @ 2002-06-19 13:31 UTC (permalink / raw)
To: Robert Love; +Cc: Linus Torvalds, Linux Kernel Mailing List
In message <1024433739.922.236.camel@sinai> you write:
> On Tue, 2002-06-18 at 13:31, Rusty Russell wrote:
>
> > No, you have accepted a non-portable userspace interface and put it in
> > generic code. THAT is idiotic.
> >
> > So any program that doesn't use the following is broken:
>
> On top of what Linus replied, there is the issue that if your task does
> not know how many CPUs can be in the system then setting its affinity is
> worthless in 90% of the cases.
No. You can read the cpus out of /proc/cpuinfo, and say "I want to be
on <some cpu I found>" or "I want one copy for each processor", or
even "I want every processor but the one the other task just bound
to". This is 99% of actual usage.
But I can see the man page now:
The third arg to set/getaffinity is the size of a kernel data
structure. There is no way to know this size: it is dependent
on architecture and kernel configuration. You can pass a
larger datastructure and the higher bits are ignored: try
1024?
> I.e., everyone today can write code like
>
> sched_setaffinity(0, sizeof(unsigned long), &mask)
NO THEY CAN'T. How will ia64 deal with this in ia32 binaries? How
will Sparc64 deal with this in 32-bit binaries? How will PPC64 deal
with this in PPC32 binaries? How will x86_64 deal with this in x86
binaries?
They'll have to either break compatibility, or guess and fill
accordingly.
And when new CPUS come online? At the moment you effectively
zero-fill, because you can't tell what you're supposed to do here. So
you can never truly reset your affinity once it's set.
> Summarily, setting CPU affinity is something that is naturally low-level
> enough it only makes sense when you know what you are setting and not
> setting. While a mask of -1 may always make sense, random bitmaps
> (think RT stuff here) are explicit for the number of CPUs given.
You've designed an interface where the easiest thing to do is the
wrong thing (as per your example). This is the hallmark of bad
design.
*If* there had been a way to tell the bitmask size which was
introduced at the same time, it might have been acceptable. But there
isn't at the moment, so people are writing bugs right now.
Untested patch below, seems to compile (hard to tell since PPC is
v. broken right now)
Summary:
1) Easy to write portable "set this cpu" code.
2) Both system calls now handle NR_CPUS > sizeof(long)*8.
3) Things which have set affinity once can now get back on new cpus as
they come up.
4) Trivial to extent for hyperthreading on a per-arch basis.
Linus, think and apply,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
--- linux-2.5.22/include/linux/affinity.h Thu Jan 1 10:00:00 1970
+++ working-2.5.22-linus/include/linux/affinity.h Wed Jun 19 22:09:47 2002
@@ -0,0 +1,9 @@
+#ifndef _LINUX_AFFINITY_H
+#define _LINUX_AFFINITY_H
+enum {
+ /* Set affinity to these processors */
+ LINUX_AFFINITY_INCLUDE,
+ /* Set affinity to all *but* these processors */
+ LINUX_AFFINITY_EXCLUDE,
+};
+#endif
--- working-2.5.22-linus/kernel/sched.c.~1~ Tue Jun 18 23:48:03 2002
+++ working-2.5.22-linus/kernel/sched.c Wed Jun 19 23:28:32 2002
@@ -26,6 +26,7 @@
#include <linux/interrupt.h>
#include <linux/completion.h>
#include <linux/kernel_stat.h>
+#include <linux/affinity.h>
/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
@@ -1309,25 +1310,57 @@
/**
* sys_sched_setaffinity - set the cpu affinity of a process
* @pid: pid of the process
+ * @include: is this include or exclude?
* @len: length in bytes of the bitmask pointed to by user_mask_ptr
- * @user_mask_ptr: user-space pointer to the new cpu mask
+ * @user_mask_ptr: user-space pointer to bitmask of cpus to include/exclude
*/
-asmlinkage int sys_sched_setaffinity(pid_t pid, unsigned int len,
- unsigned long *user_mask_ptr)
+asmlinkage int sys_sched_setaffinity(pid_t pid,
+ int include,
+ unsigned int len,
+ unsigned char *user_mask_ptr)
{
- unsigned long new_mask;
+ bitmap_member(new_mask, NR_CPUS);
task_t *p;
int retval;
+ unsigned int i;
- if (len < sizeof(new_mask))
- return -EINVAL;
-
- if (copy_from_user(&new_mask, user_mask_ptr, sizeof(new_mask)))
+ memset(new_mask, 0x00, sizeof(new_mask));
+ if (copy_from_user(new_mask, user_mask_ptr,
+ min((size_t)len, sizeof(new_mask))))
return -EFAULT;
- new_mask &= cpu_online_map;
- if (!new_mask)
+ /* longer is OK, as long as they don't actually set any of the bits. */
+ if (len > sizeof(new_mask)) {
+ unsigned char c;
+ for (i = sizeof(new_mask); i < len; i++) {
+ if (get_user(c, user_mask_ptr+i))
+ return -EFAULT;
+ if (c != 0)
+ return -ENOENT;
+ }
+ }
+
+ /* Check for cpus that aren't online/don't exist */
+ for (i = 0; i < ARRAY_SIZE(new_mask) * i; i++) {
+ if (i >= NR_CPUS || !cpu_online(i)) {
+ if (test_bit(i, new_mask))
+ return -ENOENT;
+ }
+ }
+
+ /* Invert the mask in the exclude case. */
+ if (include == LINUX_AFFINITY_EXCLUDE) {
+ for (i = 0; i < ARRAY_SIZE(new_mask); i++)
+ new_mask[i] = ~new_mask[i];
+ } else if (include != LINUX_AFFINITY_INCLUDE) {
return -EINVAL;
+ }
+
+ /* The new mask must mention some online cpus */
+ for (i = 0; !cpu_online(i) || !test_bit(i, new_mask); i++)
+ if (i == NR_CPUS-1)
+ /* This is kinda true... */
+ return -EWOULDBLOCK;
read_lock(&tasklist_lock);
@@ -1351,7 +1384,8 @@
goto out_unlock;
retval = 0;
- set_cpus_allowed(p, new_mask);
+ /* FIXME: set_cpus_allowed should take an array... */
+ set_cpus_allowed(p, new_mask[0]);
out_unlock:
put_task_struct(p);
@@ -1363,37 +1397,27 @@
* @pid: pid of the process
* @len: length in bytes of the bitmask pointed to by user_mask_ptr
* @user_mask_ptr: user-space pointer to hold the current cpu mask
+ * Returns the size that required to hold the complete cpu mask.
*/
asmlinkage int sys_sched_getaffinity(pid_t pid, unsigned int len,
- unsigned long *user_mask_ptr)
+ void *user_mask_ptr)
{
- unsigned long mask;
- unsigned int real_len;
+ bitmap_member(mask, NR_CPUS) = { 0 };
task_t *p;
- int retval;
-
- real_len = sizeof(mask);
-
- if (len < real_len)
- return -EINVAL;
read_lock(&tasklist_lock);
-
- retval = -ESRCH;
p = find_process_by_pid(pid);
- if (!p)
- goto out_unlock;
-
- retval = 0;
- mask = p->cpus_allowed & cpu_online_map;
-
-out_unlock:
+ if (!p) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ memcpy(mask, &p->cpus_allowed, sizeof(p->cpus_allowed));
read_unlock(&tasklist_lock);
- if (retval)
- return retval;
- if (copy_to_user(user_mask_ptr, &mask, real_len))
+
+ if (copy_to_user(user_mask_ptr, &mask,
+ min((unsigned)sizeof(p->cpus_allowed), len)))
return -EFAULT;
- return real_len;
+ return sizeof(p->cpus_allowed);
}
asmlinkage long sys_sched_yield(void)
@@ -1727,9 +1751,11 @@
migration_req_t req;
runqueue_t *rq;
+#if 0 /* This is checked for userspace, and kernel shouldn't do this */
new_mask &= cpu_online_map;
if (!new_mask)
BUG();
+#endif
preempt_disable();
rq = task_rq_lock(p, &flags);
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-19 0:12 ` Linus Torvalds
@ 2002-06-19 15:23 ` Rusty Russell
2002-06-19 16:28 ` Linus Torvalds
0 siblings, 1 reply; 97+ messages in thread
From: Rusty Russell @ 2002-06-19 15:23 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kernel Mailing List
In message <Pine.LNX.4.33.0206181701240.2562-100000@penguin.transmeta.com> you
write:
>
> On Wed, 19 Jun 2002, Rusty Russell wrote:
> >
> > - new_mask &= cpu_online_map;
> > + /* Eliminate offline cpus from the mask */
> > + for (i = 0; i < NR_CPUS; i++)
> > + if (!cpu_online(i))
> > + new_mask &= ~(1<<i);
> > +
>
> And why can't cpu_online_map be a bitmap?
>
> What's your beef against sane and efficient data structures? The above is
> just crazy.
Oh, it can be. I wasn't going to require something from all archs for
this one case (well, it was more like zero cases when I first did the
patch).
> and then add a few simple operations like
>
> cpumask_and(cpu_mask_t * res, cpu_mask_t *a, cpu_mask_t *b);
Sure... or just make all archs supply a "cpus_online_of(mask)" which
does that, unless there are other interesting cases. Or we can go the
other way and have a general "and_region(void *res, void *a, void *b,
int len)". Which one do you want?
> This is not rocket science, and I find it ridiculous that you claim to
> worry about scaling up to thousands of CPU's, and then you try to send me
> absolute crap like the above which clearly is unacceptable for lots of
> CPU's.
Spinning 1000 times doesn't phase me until someone complains.
Breaking userspace code does. One can be fixed if it proves to be a
bottleneck. Understand?
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-19 15:23 ` Rusty Russell
@ 2002-06-19 16:28 ` Linus Torvalds
2002-06-19 20:57 ` Rusty Russell
0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2002-06-19 16:28 UTC (permalink / raw)
To: Rusty Russell; +Cc: Kernel Mailing List
On Thu, 20 Jun 2002, Rusty Russell wrote:
> > and then add a few simple operations like
> >
> > cpumask_and(cpu_mask_t * res, cpu_mask_t *a, cpu_mask_t *b);
>
> Sure... or just make all archs supply a "cpus_online_of(mask)" which
> does that, unless there are other interesting cases. Or we can go the
> other way and have a general "and_region(void *res, void *a, void *b,
> int len)". Which one do you want?
There are definitely other "interesting" cases that already do the full
bitwise and/or on bitmasks - see sigset_t and sigaddset/sigdelset/
sigfillset. It's really the exact same code, and the exact same issues.
The problem with a generic "and_region" is that it's a slight amount of
work to make sure that we optimize for the common cases (and since I'm not
a huge believer in hundreds of nodes, I consider the common case to be a
single word). And do things like just automatically get the UP case right:
which we do right now by just virtue of having a constant cpu_online_mask,
and letting the compiler just do the (obvious) optimizations.
I'm a _huge_ believer in having generic code that is automatically
optimized away by the compiler into nothingness. (And by contrast, I
absolutely _detest_ #ifdef's in source code that makes those optimizations
explicit). But that sometimes requires some thought, notably making sure
that all constants hang around as constants all the way to the code
generation phase (this tends to mean inline functions and #defines).
It _would_ probably be worthwhile to try to have better support for
"bitmaps" as real kernel data structures, since we actually have this
problem in multiple places. Right now we already use bitmaps for signal
handling (one or two words, constant size), for FD_SET's (variable size),
for various filesystems (variable size, largish), and for a lot of random
drivers (some variable, some constant).
It wasn't that long ago that I added a "bitmap_member()" macro to
<linux/types.h> to declare bitmaps exactly because a lot of people _were_
doing it and getting it wrong. Actually, the most common case was not a
bug, but a latent problem with code that did something like
unsigned char bitmap[BITMAP_SIZE/8];
which works on x86 as long as the bitmap size was a multiple of 8.
It would probably make sense to make a real <linux/bitmap.h>, move the
bitmap_member() there (and rename to "bitmap_declare()" - it's called
member because all the places I first looked at were structure members),
and add some simple generic routines for handling these things.
(We've obviously had the bit_set/clear/test() stuff forever, but the more
involved stuff should be fairly easy to abstract out too, instead of
having special functions for signal masks).
> Breaking userspace code does. One can be fixed if it proves to be a
> bottleneck. Understand?
What I don't understand is why you don't accept the fact that these
things can be considered infinitely big. There's nothing fundamentally
wrong with static allocation.
People who build thousand-node systems _are_ going to compile their own
distribution. Trust me. They aren't just going to slap down redhat-7.3 on
a 16k-node ASCI Purple. It makes no sense to do that. They may want to run
quake or something standard on it without recompiling, but especially the
maintenance stuff - the stuff which cares about CPU affinity - is a
nobrainer.
So you can easily just accept the fact that at some point the max number
of CPU's can be considered fixed. And that "some point" isn't even very
high, especially since bitmaps _are_ so dense that there is basically no
overhead to just starting out with
#define MAX_CPU (1024)
bitmap_declare(cpu_bitmap, MAX_CPU);
and let it be at that. That 1024 is already ridiculously high, in my
opinion - simply because people who are playing with bigger numbers _are_
going to be able to just increase the number and recompile.
Linus
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-19 12:29 ` Eric W. Biederman
@ 2002-06-19 17:27 ` Linus Torvalds
2002-06-20 3:57 ` Eric W. Biederman
0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2002-06-19 17:27 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
On 19 Jun 2002, Eric W. Biederman wrote:
>
> 10-20 years or someone finds a good way to implement a single system
> image on linux clusters. They are already into the 1000s of nodes,
> and dual processors per node category. And as things continue they
> might even grow bigger.
Oh, clusters are a separate issue. I'm absolutely 100% conviced that you
don't want to have a "single kernel" for a cluster, you want to run
independent kernels with good communication infrastructure between them
(ie global filesystem, and try to make the networking look uniform).
Trying to have a single kernel for thousands of nodes is just crazy. Even
if the system were ccNuma and _could_ do it in theory.
The NuMA work can probably take single-kernel to maybe 64+ nodes, before
people just start turning stark raving mad. There's no way you'll have
single-kernel for thousands of CPU's, and still stay sane and claim any
reasonable performance under generic loads.
So don't confuse the issue with clusters like that. The "set_affinity()"
call simply doesn't have anything to do with them. If you want to move
processes between nodes on such a cluster, you'll probably need user-level
help, the kernel is unlikely to do it for you.
Linus
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-19 16:28 ` Linus Torvalds
@ 2002-06-19 20:57 ` Rusty Russell
0 siblings, 0 replies; 97+ messages in thread
From: Rusty Russell @ 2002-06-19 20:57 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kernel Mailing List, paulus
In message <Pine.LNX.4.44.0206190907520.2053-100000@home.transmeta.com> you wri
te:
>
>
> On Thu, 20 Jun 2002, Rusty Russell wrote:
> > > and then add a few simple operations like
> > >
> > > cpumask_and(cpu_mask_t * res, cpu_mask_t *a, cpu_mask_t *b);
> >
> > Sure... or just make all archs supply a "cpus_online_of(mask)" which
> > does that, unless there are other interesting cases. Or we can go the
> > other way and have a general "and_region(void *res, void *a, void *b,
> > int len)". Which one do you want?
>
> There are definitely other "interesting" cases that already do the full
> bitwise and/or on bitmasks - see sigset_t and sigaddset/sigdelset/
> sigfillset. It's really the exact same code, and the exact same issues.
>
> The problem with a generic "and_region" is that it's a slight amount of
> work to make sure that we optimize for the common cases (and since I'm not
> a huge believer in hundreds of nodes, I consider the common case to be a
> single word). And do things like just automatically get the UP case right:
> which we do right now by just virtue of having a constant cpu_online_mask,
> and letting the compiler just do the (obvious) optimizations.
Sure, completely agreed.
Normal tricks here: 1 long turns into equivalent to dst = a & b, the
other cases are handled with varying amount of suckiness. Code and
optimization tested on 2.95.4 and 3.0.4 (both PPC), kernel compiled on
my x86 box back in .au.
> It would probably make sense to make a real <linux/bitmap.h>, move the
> bitmap_member() there (and rename to "bitmap_declare()" - it's called
> member because all the places I first looked at were structure members),
> and add some simple generic routines for handling these things.
I renamed it to DECLARE_BITMAP() to match list, mutex et al. and moved
it to linux/bitops.h.
PS. Please sort out merging with Paulus's stuff: I'd like to compile
on PPC soon since I'm laptop-only for two more weeks 8)
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/include/linux/bitops.h working-2.5.23-bitops/include/linux/bitops.h
--- linux-2.5.23/include/linux/bitops.h Fri Jun 7 13:59:07 2002
+++ working-2.5.23-bitops/include/linux/bitops.h Thu Jun 20 06:55:51 2002
@@ -2,6 +2,27 @@
#define _LINUX_BITOPS_H
#include <asm/bitops.h>
+#define DECLARE_BITMAP(name,bits) \
+ unsigned long name[((bits)+BITS_PER_LONG-1)/BITS_PER_LONG]
+
+#ifndef HAVE_ARCH_AND_REGION
+void __and_region(unsigned long num, unsigned char *dst,
+ const unsigned char *a, const unsigned char *b);
+#endif
+
+/* For the moment, handle 1 long case fast, leave rest to __and_region. */
+#define and_region(num,dst,a,b) \
+do { \
+ if (__alignof__(*(a)) == __alignof__(long) \
+ && __alignof__(*(b)) == __alignof__(long) \
+ && __builtin_constant_p(num) \
+ && (num) == sizeof(long)) { \
+ *((unsigned long *)(dst)) = \
+ (*(unsigned long *)(a) & *(unsigned long *)(b)); \
+ } else \
+ __and_region((num), (void*)(dst), (void*)(a), (void*)(b)); \
+} while(0)
+
/*
* ffs: find first bit set. This is defined the same way as
* the libc and compiler builtin ffs routines, therefore
@@ -106,8 +127,5 @@
res = (res & 0x33) + ((res >> 2) & 0x33);
return (res & 0x0F) + ((res >> 4) & 0x0F);
}
-
-#include <asm/bitops.h>
-
#endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/include/linux/types.h working-2.5.23-bitops/include/linux/types.h
--- linux-2.5.23/include/linux/types.h Mon Jun 17 23:19:25 2002
+++ working-2.5.23-bitops/include/linux/types.h Thu Jun 20 06:14:39 2002
@@ -3,9 +3,6 @@
#ifdef __KERNEL__
#include <linux/config.h>
-
-#define bitmap_member(name,bits) \
- unsigned long name[((bits)+BITS_PER_LONG-1)/BITS_PER_LONG]
#endif
#include <linux/posix_types.h>
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/include/sound/ac97_codec.h working-2.5.23-bitops/include/sound/ac97_codec.h
--- linux-2.5.23/include/sound/ac97_codec.h Mon Jun 17 23:19:25 2002
+++ working-2.5.23-bitops/include/sound/ac97_codec.h Thu Jun 20 06:31:35 2002
@@ -25,6 +25,7 @@
*
*/
+#include <linux/bitops.h>
#include "control.h"
#include "info.h"
@@ -160,7 +161,7 @@
unsigned int rates_mic_adc;
unsigned int spdif_status;
unsigned short regs[0x80]; /* register cache */
- bitmap_member(reg_accessed, 0x80); /* bit flags */
+ DECLARE_BITMAP(reg_accessed, 0x80); /* bit flags */
union { /* vendor specific code */
struct {
unsigned short unchained[3]; // 0 = C34, 1 = C79, 2 = C69
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/kernel/Makefile working-2.5.23-bitops/kernel/Makefile
--- linux-2.5.23/kernel/Makefile Mon Jun 10 16:03:56 2002
+++ working-2.5.23-bitops/kernel/Makefile Thu Jun 20 06:27:29 2002
@@ -10,12 +10,12 @@
O_TARGET := kernel.o
export-objs = signal.o sys.o kmod.o context.o ksyms.o pm.o exec_domain.o \
- printk.o platform.o suspend.o
+ printk.o platform.o suspend.o bitops.o
obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o futex.o platform.o
+ signal.o sys.o kmod.o context.o futex.o platform.o bitops.o
obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += ksyms.o
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/kernel/bitops.c working-2.5.23-bitops/kernel/bitops.c
--- linux-2.5.23/kernel/bitops.c Thu Jan 1 10:00:00 1970
+++ working-2.5.23-bitops/kernel/bitops.c Thu Jun 20 06:52:29 2002
@@ -0,0 +1,32 @@
+#include <linux/config.h>
+#include <linux/bitops.h>
+#include <linux/module.h>
+
+#ifndef HAVE_ARCH_AND_REGION
+/* Generic is fairly stupid: archs should optimize properly. */
+void __and_region(unsigned long num, unsigned char *dst,
+ const unsigned char *a, const unsigned char *b)
+{
+ unsigned long i;
+
+ /* Copy first bytes, until one is long aligned. */
+ for (i = 0; i < num && ((unsigned long)a+i) % __alignof__(long); i++)
+ dst[i] = (a[i] & b[i]);
+
+ /* If they are all aligned, do long-at-a-time copy */
+ if (((unsigned long)b+i)%__alignof__(long) == 0
+ && ((unsigned long)dst+i)%__alignof__(long) == 0) {
+ for (; i + sizeof(long) <= num; i += sizeof(long)) {
+ *(unsigned long *)(dst+i)
+ = (*(unsigned long *)(a+i)
+ & *(unsigned long *)(b+i));
+ }
+ }
+
+ /* Do whatever is left. */
+ for (; i < num; i++)
+ dst[i] = (a[i] & b[i]);
+}
+
+EXPORT_SYMBOL(__and_region);
+#endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/sound/core/seq/seq_clientmgr.h working-2.5.23-bitops/sound/core/seq/seq_clientmgr.h
--- linux-2.5.23/sound/core/seq/seq_clientmgr.h Mon Jun 17 23:19:26 2002
+++ working-2.5.23-bitops/sound/core/seq/seq_clientmgr.h Thu Jun 20 06:34:16 2002
@@ -53,8 +53,8 @@
char name[64]; /* client name */
int number; /* client number */
unsigned int filter; /* filter flags */
- bitmap_member(client_filter, 256);
- bitmap_member(event_filter, 256);
+ DECLARE_BITMAP(client_filter, 256);
+ DECLARE_BITMAP(event_filter, 256);
snd_use_lock_t use_lock;
int event_lost;
/* ports */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/sound/core/seq/seq_queue.h working-2.5.23-bitops/sound/core/seq/seq_queue.h
--- linux-2.5.23/sound/core/seq/seq_queue.h Mon Jun 17 23:19:26 2002
+++ working-2.5.23-bitops/sound/core/seq/seq_queue.h Thu Jun 20 06:34:11 2002
@@ -26,6 +26,7 @@
#include "seq_lock.h"
#include <linux/interrupt.h>
#include <linux/list.h>
+#include <linux/bitops.h>
#define SEQ_QUEUE_NO_OWNER (-1)
@@ -51,7 +52,7 @@
spinlock_t check_lock;
/* clients which uses this queue (bitmap) */
- bitmap_member(clients_bitmap,SNDRV_SEQ_MAX_CLIENTS);
+ DECLARE_BITMAP(clients_bitmap,SNDRV_SEQ_MAX_CLIENTS);
unsigned int clients; /* users of this queue */
struct semaphore timer_mutex;
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-18 23:57 ` Ingo Molnar
2002-06-19 0:08 ` Ingo Molnar
2002-06-19 1:00 ` Matthew Dobson
@ 2002-06-19 23:48 ` Michael Hohnbaum
2 siblings, 0 replies; 97+ messages in thread
From: Michael Hohnbaum @ 2002-06-19 23:48 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Rusty Russell, Robert Love, linux-kernel,
Matthew Dobson
On Tue, 2002-06-18 at 16:57, Ingo Molnar wrote:
>
> On 18 Jun 2002, Michael Hohnbaum wrote:
>
> > [...] I would suggest an additional argument be added
> > which would indicate the resource that the process is to be
> > affined to. That way this interface could be used for binding
> > processes to cpus, memory nodes, perhaps NUMA nodes, and,
> > as discussed recently in another thread, other processes.
> > Personally, I see NUMA nodes as an overkill, if a process
> > can be bound to cpus and memory nodes.
>
> are you sure we want one generic, process-based affinity interface?
No, I'm not sure that is what we want. I see that as a compromise
solution. Something that would allow some of the simple binding
capabilities, but not necessarily a full blown solution.
I agree with your comments below that memory binding/allocation is
much more complex than CPU binding, so additional flexibility in
specifying memory binding is needed. However, wanting to start
simple, the first step is to affine a process to memory on one or
more nodes.
> i think the affinity to certain memory regions might need to be more
> finegrained than this. Eg. it could be useful to define a per-file
> (per-inode) 'backing store memory node' that the file is affine to. This
> will eg. cause the pagecache to be allocated in the memory node.
> Process-based affinity does not describe this in a natural way. Another
> example, memory maps: we might want to have a certain memory map (vma)
> allocated in a given memory node, independently of where the process that
> is faulting a given pages resides.
>
> and it might certainly make sense to have some sort of 'default memory
> affinity' for a process as well, but this should be a different syscall -
This is close to what is currently implemented - memory is allocated,
by default on the node that the process is executing on when the request
for memory is made. Even if a process is affined to multiple CPUs that
span node boundaries, it is performant to dispatch the process on only
one node (providing the cpu cycles are available). The NUMA extensions
to the scheduler try to do this. Similarly, all memory for a process
should be allocated from that one node. If memory is exhausted on
that node, any other nodes that the process has affinity to cpus on
should then be used. In other words, each process should have a home
node that is preferred for dispatch and memory allocation. The process
may have affinity to other nodes, which would be used only if the home
quad had a significant resource shortage.
> it really does a much different thing than CPU affinity. The CPU resource
> is 'used' only temporarily with little footprint, while memory usage is
> often for a very long timespan, and the affinity strategies differ
> greatly. Also, memory as a resource is much more complex than CPU, eg. it
> must handle things like over-allocation, fallback to 'nearby' nodes if a
> node is full, etc.
>
> so i'd suggest to actually create a good memory-affinity syscall interface
> instead of trying to generalize it into the simple, robust, finite
> CPU-affinity syscalls.
We have attempted to do that. Please look at the API definition
http://lse.sourceforge.net/numa/numa_api.html If it would help,
we could break out just the memory portion of this API (both in the
specification and the implementation) and submit those for comment.
What do you think?
>
> Ingo
>
Michael Hohnbaum
hohnbaum@us.ibm.com
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-19 17:27 ` Linus Torvalds
@ 2002-06-20 3:57 ` Eric W. Biederman
2002-06-20 5:24 ` Larry McVoy
2002-06-20 16:30 ` latest linus-2.5 BK broken Cort Dougan
0 siblings, 2 replies; 97+ messages in thread
From: Eric W. Biederman @ 2002-06-20 3:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
Linus Torvalds <torvalds@transmeta.com> writes:
> On 19 Jun 2002, Eric W. Biederman wrote:
> >
> > 10-20 years or someone finds a good way to implement a single system
> > image on linux clusters. They are already into the 1000s of nodes,
> > and dual processors per node category. And as things continue they
> > might even grow bigger.
>
> Oh, clusters are a separate issue. I'm absolutely 100% conviced that you
> don't want to have a "single kernel" for a cluster, you want to run
> independent kernels with good communication infrastructure between them
> (ie global filesystem, and try to make the networking look uniform).
>
> Trying to have a single kernel for thousands of nodes is just crazy. Even
> if the system were ccNuma and _could_ do it in theory.
I totally agree, mostly I was playing devils advocate. The model
actually in my head is when you have multiple kernels but they talk
well enough that the applications have to care in areas where it
doesn't make a performance difference (There's got to be one of those).
> The NuMA work can probably take single-kernel to maybe 64+ nodes, before
> people just start turning stark raving mad. There's no way you'll have
> single-kernel for thousands of CPU's, and still stay sane and claim any
> reasonable performance under generic loads.
>
> So don't confuse the issue with clusters like that. The "set_affinity()"
> call simply doesn't have anything to do with them. If you want to move
> processes between nodes on such a cluster, you'll probably need user-level
> help, the kernel is unlikely to do it for you.
Agreed.
The compute cluster problem is an interesting one. The big items
I see on the todo list are:
- Scalable fast distributed file system (Lustre looks like a
possibility)
- Sub application level checkpointing.
Services like a schedulers, already exist.
Basically the job of a cluster scheduler gets much easier, and the
scheduler more powerful once it gets the ability to suspend jobs.
Checkpointing buys three things. The ability to preempt jobs, the
ability to migrate processes, and the ability to recover from failed
nodes, (assuming the failed hardware didn't corrupt your jobs
checkpoint).
Once solutions to the cluster problems become well understood I
wouldn't be surprised if some of the supporting services started to
live in the kernel like nfsd. Parts of the distributed filesystem
certainly will.
I suspect process checkpointing and restoring will evolve something
something like pthread support. With some code in user space, and
some generic helpers in the kernel as clean pieces of the job can be
broken off. The challenge is only how to save/restore interprocess
communications. Things like moving a tcp connection from one node to
another are interesting problems.
But also I suspect most of the hard problems that we need kernel help
with can have uses independent of checkpointing. Already we have web
server farms that spread connections to a single ip across nodes.
Eric
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 3:57 ` Eric W. Biederman
@ 2002-06-20 5:24 ` Larry McVoy
2002-06-20 7:26 ` Andreas Dilger
` (2 more replies)
2002-06-20 16:30 ` latest linus-2.5 BK broken Cort Dougan
1 sibling, 3 replies; 97+ messages in thread
From: Larry McVoy @ 2002-06-20 5:24 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
> I totally agree, mostly I was playing devils advocate. The model
> actually in my head is when you have multiple kernels but they talk
> well enough that the applications have to care in areas where it
> doesn't make a performance difference (There's got to be one of those).
....
> The compute cluster problem is an interesting one. The big items
> I see on the todo list are:
>
> - Scalable fast distributed file system (Lustre looks like a
> possibility)
> - Sub application level checkpointing.
>
> Services like a schedulers, already exist.
>
> Basically the job of a cluster scheduler gets much easier, and the
> scheduler more powerful once it gets the ability to suspend jobs.
> Checkpointing buys three things. The ability to preempt jobs, the
> ability to migrate processes, and the ability to recover from failed
> nodes, (assuming the failed hardware didn't corrupt your jobs
> checkpoint).
>
> Once solutions to the cluster problems become well understood I
> wouldn't be surprised if some of the supporting services started to
> live in the kernel like nfsd. Parts of the distributed filesystem
> certainly will.
http://www.bitmover.com/cc-pitch
I've been trying to get Linus to listen to this for years and he keeps
on flogging the tired SMP horse instead. DEC did it and Sun has been
passing around these slides for a few weeks, so maybe they'll do it too.
Then Linux can join the party after it has become a fine grained,
locked to hell and back, soft "realtime", numa enabled, bloated piece
of crap like all the other kernels and we'll get to go through the
"let's reinvent Unix for the 3rd time in 40 years" all over again.
What fun. Not.
Sorry to be grumpy, go read the slides, I'll be at OLS, I'd be happy
to talk it over with anyone who wants to think about it. Paul McKenney
from IBM came down the San Francisco to talk to me about it, put me
through an 8 or 9 hour session which felt like a PhD exam, and
after trying to poke holes in it grudgingly let on that maybe it was
a good idea. He was kind of enough to write up what he took away
from it, here it is.
--lm
From: "Paul McKenney" <Paul.McKenney@us.ibm.com>
To: lm@bitmover.com, tytso@mit.edu
Subject: Greatly enjoyed our discussion yesterday!
Date: Fri, 9 Nov 2001 18:48:56 -0800
Hello!
I greatly enjoyed our discussion yesterday! Here are the pieces of it that
I recall, I know that you will not be shy about correcting any errors and
omissions.
Thanx, Paul
Larry McVoy's SMP Clusters
Discussion on November 8, 2001
Larry McVoy, Ted T'so, and Paul McKenney
What is SMP Clusters?
SMP Clusters is a method of partioning an SMP (symmetric
multiprocessing) machine's CPUs, memory, and I/O devices
so that multiple "OSlets" run on this machine. Each OSlet
owns and controls its partition. A given partition is
expected to contain from 4-8 CPUs, its share of memory,
and its share of I/O devices. A machine large enough to
have SMP Clusters profitably applied is expected to have
enough of the standard I/O adapters (e.g., ethernet,
SCSI, FC, etc.) so that each OSlet would have at least
one of each.
Each OSlet has the same data structures that an isolated
OS would have for the same amount of resources. Unless
interactions with the OSlets are required, an OSlet runs
very nearly the same code over very nearly the same data
as would a standalone OS.
Although each OSlet is in most ways its own machine, the
full set of OSlets appears as one OS to any user programs
running on any of the OSlets. In particular, processes on
on OSlet can share memory with processes on other OSlets,
can send signals to processes on other OSlets, communicate
via pipes and Unix-domain sockets with processes on other
OSlets, and so on. Performance of operations spanning
multiple OSlets may be somewhat slower than operations local
to a single OSlet, but the difference will not be noticeable
except to users who are engaged in careful performance
analysis.
The goals of the SMP Cluster approach are:
1. Allow the core kernel code to use simple locking designs.
2. Present applications with a single-system view.
3. Maintain good (linear!) scalability.
4. Not degrade the performance of a single CPU beyond that
of a standalone OS running on the same resources.
5. Minimize modification of core kernel code. Modified or
rewritten device drivers, filesystems, and
architecture-specific code is permitted, perhaps even
encouraged. ;-)
OS Boot
Early-boot code/firmware must partition the machine, and prepare
tables for each OSlet that describe the resources that each
OSlet owns. Each OSlet must be made aware of the existence of
all the other OSlets, and will need some facility to allow
efficient determination of which OSlet a given resource belongs
to (for example, to determine which OSlet a given page is owned
by).
At some point in the boot sequence, each OSlet creates a "proxy
task" for each of the other OSlets that provides shared services
to them.
Issues:
1. Some systems may require device probing to be done
by a central program, possibly before the OSlets are
spawned. Systems that react in an unfriendly manner
to failed probes might be in this class.
2. Interrupts must be set up very carefully. On some
systems, the interrupt system may constrain the ways
in which the system is partitioned.
Shared Operations
This section describes some possible implementations and issues
with a number of the shared operations.
Shared operations include:
1. Page fault on memory owned by some other OSlet.
2. Manipulation of processes running on some other OSlet.
3. Access to devices owned by some other OSlet.
4. Reception of network packets intended for some other OSlet.
5. SysV msgq and sema operations on msgq and sema objects
accessed by processes running on multiple of the OSlets.
6. Access to filesystems owned by some other OSlet. The
/tmp directory gets special mention.
7. Pipes connecting processes in different OSlets.
8. Creation of processes that are to run on a different
OSlet than their parent.
9. Processing of exit()/wait() pairs involving processes
running on different OSlets.
Page Fault
As noted earlier, each OSlet maintains a proxy process
for each other OSlet (so that for an SMP Cluster made
up of N OSlets, there are N*(N-1) proxy processes).
When a process in OSlet A wishes to map a file
belonging to OSlet B, it makes a request to B's proxy
process corresponding to OSlet A. The proxy process
maps the desired file and takes a page fault at the
desired address (translated as needed, since the file
will usually not be mapped to the same location in the
proxy and client processes), forcing the page into
OSlet B's memory. The proxy process then passes the
corresponding physical address back to the client
process, which maps it.
Issues:
o How to coordinate pageout? Two approaches:
1. Use mlock in the proxy process so that
only the client process can do the pageout.
2. Make the two OSlets coordinate their
pageouts. This is more complex, but will
be required in some form or another to
prevent OSlets from "ganging up" on one
of their number, exhausting its memory.
o When OSlet A ejects the memory from its working
set, where does it put it?
1. Throw it away, and go to the proxy process
as needed to get it back.
2. Augment core VM as needed to track the
"guest" memory. This may be needed for
performance, but...
o Some code is required in the pagein() path to
figure out that the proxy must be used.
1. Larry stated that he is willing to be
punched in the nose to get this code in. ;-)
The amount of this code is minimized by
creating SMP-clusters-specific filesystems,
which have their own functions for mapping
and releasing pages. (Does this really
cover OSlet A's paging out of this memory?)
o How are pagein()s going to be even halfway fast
if IPC to the proxy is involved?
1. Just do it. Page faults should not be
all that frequent with today's memory
sizes. (But then why do we care so
much about page-fault performance???)
2. Use "doors" (from Sun), which are very
similar to protected procedure call
(from K42/Tornado/Hurricane). The idea
is that the CPU in OSlet A that is handling
the page fault temporarily -becomes- a
member of OSlet B by using OSlet B's page
tables for the duration. This results in
some interesting issues:
a. What happens if a process wants to
block while "doored"? Does it
switch back to being an OSlet A
process?
b. What happens if a process takes an
interrupt (which corresponds to
OSlet A) while doored (thus using
OSlet B's page tables)?
i. Prevent this by disabling
interrupts while doored.
This could pose problems
with relatively long VM
code paths.
ii. Switch back to OSlet A's
page tables upon interrupt,
and switch back to OSlet B's
page tables upon return
from interrupt. On machines
not supporting ASID, take a
TLB-flush hit in both
directions. Also likely
requires common text (at
least for low-level interrupts)
for all OSlets, making it more
difficult to support OSlets
running different versions of
the OS.
Furthermore, the last time
that Paul suggested adding
instructions to the interrupt
path, several people politely
informed him that this would
require a nose punching. ;-)
c. If a bunch of OSlets simultaneously
decide to invoke their proxies on
a particular OSlet, that OSlet gets
lock contention corresponding to
the number of CPUs on the system
rather than to the number in a
single OSlet. Some approaches to
handle this:
i. Stripe -everything-, rely
on entropy to save you.
May still have problems with
hotspots (e.g., which of the
OSlets has the root of the
root filesystem?).
ii. Use some sort of queued lock
to limit the number CPUs that
can be running proxy processes
in a given OSlet. This does
not really help scaling, but
would make the contention
less destructive to the
victim OSlet.
o How to balance memory usage across the OSlets?
1. Don't bother, let paging deal with it.
Paul's previous experience with this
philosophy was not encouraging. (You
can end up with one OSlet thrashing
due to the memory load placed on it by
other OSlets, which don't see any
memory pressure.)
2. Use some global memory-pressure scheme
to even things out. Seems possible,
Paul is concerned about the complexity
of this approach. If this approach is
taken, make sure someone with some
control-theory experience is involved.
Manipulation of Processes Running on Some Other OSlet.
The general idea here is to implement something similar
to a vproc layer. This is common code, and thus requires
someone to sacrifice their nose. There was some discussion
of other things that this would be useful for, but I have
lost them.
Manipulations discussed included signals and job control.
Issues:
o Should process information be replicated across
the OSlets for performance reasons? If so, how
much, and how to synchronize.
1. No, just use doors. See above discussion.
2. Yes. No discussion of synchronization
methods. (Hey, we had to leave -something-
for later!)
Access to Devices Owned by Some Other OSlet
Larry mentioned a /rdev, but if we discussed any details
of this, I have lost them. Presumably, one would use some
sort of IPC or doors to make this work.
Reception of Network Packets Intended for Some Other OSlet.
An OSlet receives a packet, and realizes that it is
destined for a process running in some other OSlet.
How is this handled without rewriting most of the
networking stack?
The general approach was to add a NAT-like layer that
inspected the packet and determined which OSlet it was
destined for. The packet was then forwarded to the
correct OSlet, and subjected to full IP-stack processing.
Issues:
o If the address map in the kernel is not to be
manipulated on each packet reception, there
needs to be a circular buffer in each OSlet for
each of the other OSlets (again, N*(N-1) buffers).
In order to prevent the buffer from needing to
be exceedingly large, packets must be bcopy()ed
into this buffer by the OSlet that received
the packet, and then bcopy()ed out by the OSlet
containing the target process. This could add
a fair amount of overhead.
1. Just accept the overhead. Rely on this
being an uncommon case (see the next issue).
2. Come up with some other approach, possibly
involving the user address space of the
proxy process. We could not articulate
such an approach, but it was late and we
were tired.
o If there are two processes that share the FD
on which the packet could be received, and these
two processes are in two different OSlets, and
neither is in the OSlet that received the packet,
what the heck do you do???
1. Prevent this from happening by refusing
to allow processes holding a TCP connection
open to move to another OSlet. This could
result in load-balance problems in some
workloads, though neither Paul nor Ted were
able to come up with a good example on the
spot (seeing as BAAN has not been doing really
well of late).
To indulge in l'esprit d'escalier... How
about a timesharing system that users
access from the network? A single user
would have to log on twice to run a job
that consumed more than one OSlet if each
process in the job might legitimately need
access to stdin.
2. Do all protocol processing on the OSlet
on which the packet was received, and
straighten things out when delivering
the packet data to the receiving process.
This likely requires changes to common
code, hence someone to volunteer their nose.
SysV msgq and sema Operations
We didn't discuss these. None of us seem to be SysV fans,
but these must be made to work regardless.
Larry says that shm should be implemented in terms of mmap(),
so that this case reduces to page-mapping discussed above.
Of course, one would need a filesystem large enough to handle
the largest possible shmget. Paul supposes that one could
dynamically create a memory filesystem to avoid problems here,
but is in no way volunteering his nose to this cause.
Access to Filesystems Owned by Some Other OSlet.
For the most part, this reduces to the mmap case. However,
partitioning popular filesystems over the OSlets could be
very helpful. Larry mentioned that this had been prototyped.
Paul cannot remember if Larry promised to send papers or
other documentation, but duly requests them after the fact.
Larry suggests having a local /tmp, so that /tmp is in effect
private to each OSlet. There would be a /gtmp that would
be a globally visible /tmp equivalent. We went round and
round on software compatibility, Paul suggesting a hashed
filesystem as an alternative. Larry eventually pointed out
that one could just issue different mount commands to get
a global filesystem in /tmp, and create a per-OSlet /ltmp.
This would allow people to determine their own level of
risk/performance.
Pipes Connecting Processes in Different OSlets.
This was mentioned, but I have forgotten the details.
My vague recollections lead me to believe that some
nose-punching was required, but I must defer to Larry
and Ted.
Ditto for Unix-domain sockets.
Creation of Processes on a Different OSlet Than Their Parent.
There would be a inherited attribute that would prevent
fork() or exec() from creating its child on a different
OSlet. This attribute would be set by default to prevent
too many surprises. Things like make(1) would clear
this attribute to allow amazingly fast kernel builds.
There would also be a system call that would cause the
child to be placed on a specified OSlet (Paul suggested
use of HP's "launch policy" concept to avoid adding yet
another dimension to the exec() combinatorial explosion).
The discussion of packet reception lead Larry to suggest
that cross-OSlet process creation would be prohibited if
the parent and child shared a socket. See above for the
load-balancing concern and corresponding l'esprit d'escalier.
Processing of exit()/wait() Pairs Crossing OSlet Boundaries
We didn't discuss this. My guess is that vproc deals
with it. Some care is required when optimizing for this.
If one hands off to a remote parent that dies before
doing a wait(), one would not want one of the init
processes getting a nasty surprise.
(Yes, there are separate init processes for each OSlet.
We did not talk about implications of this, which might
occur if one were to need to send a signal intended to
be received by all the replicated processes.)
Other Desiderata:
1. Ability of surviving OSlets to continue running after one of their
number fails.
Paul was quite skeptical of this. Larry suggested that the
"door" mechanism could use a dynamic-linking strategy. Paul
remained skeptical. ;-)
2. Ability to run different versions of the OS on different OSlets.
Some discussion of this above.
The Score.
Paul agreed that SMP Clusters could be implemented. He was not
sure that it could achieve good performance, but could not prove
otherwise. Although he suspected that the complexity might be
less than the proprietary highly parallel Unixes, he was not
convinced that it would be less than Linux would be, given the
Linux community's emphasis on simplicity in addition to performance.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 5:24 ` Larry McVoy
@ 2002-06-20 7:26 ` Andreas Dilger
2002-06-20 14:54 ` Eric W. Biederman
2002-06-20 15:41 ` McVoy's Clusters (was Re: latest linus-2.5 BK broken) Sandy Harris
2 siblings, 0 replies; 97+ messages in thread
From: Andreas Dilger @ 2002-06-20 7:26 UTC (permalink / raw)
To: Larry McVoy, Eric W. Biederman, Linus Torvalds, Cort Dougan,
Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
On Jun 19, 2002 22:24 -0700, Larry McVoy wrote:
> Linus Torvalds <torvalds@transmeta.com> writes:
> > The compute cluster problem is an interesting one. The big items
> > I see on the todo list are:
> >
> > - Scalable fast distributed file system (Lustre looks like a
> > possibility)
Well, I can speak to this a little bit... Given Lustre's ext3
underpinnings, we have been thinking of some interesting methods
by which we could take an existing ext3 filesystem on a disk and
"clusterify" it (i.e. have distributed coherency across multiple
clients). This would be perfectly suited for application on a
CC cluster.
Given that the network communication protocols are also abstracted
out from the Lustre core, it would probably be trivial for someone
with network/VM experience to write a "no-op" networking layer
which basically did little more than passing around page addresses
and faulting the right pages into each OSlet. The protocol design
is already set up to handle direct DMA between client and storage
target, and a CC cluster could also do away with the actual copy
involved in the DMA. We can already do "zero copy" I/O between
user-space and a remote disk with O_DIRECT and the right network
hardware (which does direct DMA from one node to another).
> "Paul McKenney" <Paul.McKenney@us.ibm.com> writes:
> Access to Devices Owned by Some Other OSlet
>
> Larry mentioned a /rdev, but if we discussed any details
> of this, I have lost them. Presumably, one would use some
> sort of IPC or doors to make this work.
I would just make access to remote devices act like NBD or something,
and have similar "network/proxy" kernel drivers to all "remote" devices.
At boot time something like devfs would instantiate the "proxy"
drivers for all of the kernels except the one which is "in control"
of that device.
For example /dev/hda would be a real IDE disk device driver on the
controlling node, but would be NBD in all of the other OSlets. It would
have the same major/minor number across all OSlets so that it presented
a uniform interface to user-space. While in some cases (e.g. FC) you
could have shared-access directly to the device, other devices don't
have the correct locking mechanisms internally to be accessed by more
than one thread at a time.
As the "network" layer between two OSlets would run basically at memory
speeds, this would not impose much of an overhead. The proxy device
interfaces would be equally useful between OSlets as with two remote
machines (e.g. remote modem access), so I have no doubt that many of
them already exist, and the others could be written rather easily.
> Access to Filesystems Owned by Some Other OSlet.
>
> For the most part, this reduces to the mmap case. However,
> partitioning popular filesystems over the OSlets could be
> very helpful. Larry mentioned that this had been prototyped.
> Paul cannot remember if Larry promised to send papers or
> other documentation, but duly requests them after the fact.
>
> Larry suggests having a local /tmp, so that /tmp is in effect
> private to each OSlet. There would be a /gtmp that would
> be a globally visible /tmp equivalent. We went round and
> round on software compatibility, Paul suggesting a hashed
> filesystem as an alternative. Larry eventually pointed out
> that one could just issue different mount commands to get
> a global filesystem in /tmp, and create a per-OSlet /ltmp.
> This would allow people to determine their own level of
> risk/performance.
Nah, just use a cluster filesystem for everything ;-). As I mentioned
previously, Lustre could run from a single (optionally shared-access) disk
(with proper, relatively minor, hacks that are just in the discussion
phase now), or it can run from distributed disks that serve the data to
the remote clients. With smart allocation of resources, OSlets will
prefer to create new files on their "local" storage unless there are
resource shortages. The fast "networking" between OSlets means even
"remote" disk access is cheap.
Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 5:24 ` Larry McVoy
2002-06-20 7:26 ` Andreas Dilger
@ 2002-06-20 14:54 ` Eric W. Biederman
2002-06-20 15:41 ` McVoy's Clusters (was Re: latest linus-2.5 BK broken) Sandy Harris
2 siblings, 0 replies; 97+ messages in thread
From: Eric W. Biederman @ 2002-06-20 14:54 UTC (permalink / raw)
To: Larry McVoy
Cc: Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
Larry McVoy <lm@bitmover.com> writes:
> > I totally agree, mostly I was playing devils advocate. The model
> > actually in my head is when you have multiple kernels but they talk
> > well enough that the applications have to care in areas where it
> > doesn't make a performance difference (There's got to be one of those).
>
> ....
>
> > The compute cluster problem is an interesting one. The big items
> > I see on the todo list are:
> >
> > - Scalable fast distributed file system (Lustre looks like a
> > possibility)
> > - Sub application level checkpointing.
> >
> > Services like a schedulers, already exist.
> >
> > Basically the job of a cluster scheduler gets much easier, and the
> > scheduler more powerful once it gets the ability to suspend jobs.
> > Checkpointing buys three things. The ability to preempt jobs, the
> > ability to migrate processes, and the ability to recover from failed
> > nodes, (assuming the failed hardware didn't corrupt your jobs
> > checkpoint).
> >
> > Once solutions to the cluster problems become well understood I
> > wouldn't be surprised if some of the supporting services started to
> > live in the kernel like nfsd. Parts of the distributed filesystem
> > certainly will.
>
> http://www.bitmover.com/cc-pitch
>
> I've been trying to get Linus to listen to this for years and he keeps
> on flogging the tired SMP horse instead.
Hmm. My impression is that Linux has been doing SMP but mostly because
it hasn't become a nightmare so far. Linus just a moment ago noted that
there are scaleablity limits, to SMP.
As for the cc-SMP stuff.
a) Except dual cpu systems no-one makes affordable SMPs.
b) It doesn't solve anything except your problem with locks.
You have presented your idea, and maybe it will be useful. But at
the moment it is not the place to start. What I need today is process
checkpointing. The rest comes in easy incremental steps from there.
For me the natural place to start is with clusters, they are cheaper
and more accessible than SMPs. And then work on the clustering
software with gradual refinements until it can be managed as one
machine. At that point it should be easy to compare which does a
better job for SMPs.
Eric
^ permalink raw reply [flat|nested] 97+ messages in thread
* McVoy's Clusters (was Re: latest linus-2.5 BK broken)
2002-06-20 5:24 ` Larry McVoy
2002-06-20 7:26 ` Andreas Dilger
2002-06-20 14:54 ` Eric W. Biederman
@ 2002-06-20 15:41 ` Sandy Harris
2002-06-20 17:10 ` William Lee Irwin III
` (2 more replies)
2 siblings, 3 replies; 97+ messages in thread
From: Sandy Harris @ 2002-06-20 15:41 UTC (permalink / raw)
To: Linux Kernel Mailing List
[ I removed half a dozen cc's on this, and am just sending to the
list. Do people actually want the cc's?]
Larry McVoy wrote:
> > Checkpointing buys three things. The ability to preempt jobs, the
> > ability to migrate processes,
For large multi-processor systems, it isn't clear that those matter
much. On single user systems I've tried , ps -ax | wc -l usually
gives some number 50 < n < 100. For a multi-user general purpose
system, my guess would be something under 50 system processes plus
50 per user. So for a dozen to 20 users on a departmental server,
under 1000. A server for a big application, like database or web,
would have fewer users and more threads, but still only a few 100
or at most, say 2000.
So at something like 8 CPUs in a personal workstation and 128 or
256 for a server, things average out to 8 processes per CPU, and
it is not clear that process migration or any form of pre-emption
beyond the usual kernel scheduling is needed.
What combination of resources and loads do you think preemption
and migration are need for?
> > and the ability to recover from failed nodes, (assuming the
> > failed hardware didn't corrupt your jobs checkpoint).
That matters, but it isn't entirely clear that it needs to be done
in the kernel. Things like databases and journalling filesystems
already have their own mechanisms and it is not remarkably onerous
to put them into applications where required.
[big snip]
> Larry McVoy's SMP Clusters
>
> Discussion on November 8, 2001
>
> Larry McVoy, Ted T'so, and Paul McKenney
>
> What is SMP Clusters?
>
> SMP Clusters is a method of partioning an SMP (symmetric
> multiprocessing) machine's CPUs, memory, and I/O devices
> so that multiple "OSlets" run on this machine. Each OSlet
> owns and controls its partition. A given partition is
> expected to contain from 4-8 CPUs, its share of memory,
> and its share of I/O devices. A machine large enough to
> have SMP Clusters profitably applied is expected to have
> enough of the standard I/O adapters (e.g., ethernet,
> SCSI, FC, etc.) so that each OSlet would have at least
> one of each.
I'm not sure whose definition this is:
supercomputer: a device for converting compute-bound problems
into I/O-bound problems
but I suspect it is at least partially correct, and Beowulfs are
sometimes just devices to convert them to network-bound problems.
For a network-bound task like web serving, I can see a large
payoff in having each OSlet doing its own I/O.
However, in general I fail to see why each OSlet should have
independent resources rather than something like using one to
run a shared file system and another to handle the networking
for everybody.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 3:57 ` Eric W. Biederman
2002-06-20 5:24 ` Larry McVoy
@ 2002-06-20 16:30 ` Cort Dougan
2002-06-20 17:15 ` Linus Torvalds
` (3 more replies)
1 sibling, 4 replies; 97+ messages in thread
From: Cort Dougan @ 2002-06-20 16:30 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
"Beating the SMP horse to death" does make sense for 2 processor SMP
machines. When 64 processor machines become commodity (Linux is a
commodity hardware OS) something will have to be done. When research
groups put Linux on 1k processors - it's an experiment. I don't think they
have much right to complain that Linux doesn't scale up to that level -
it's not designed to.
That being said, large clusters are an interesting research area but it is
_not_ a failing of Linux that it doesn't scale to them.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: McVoy's Clusters (was Re: latest linus-2.5 BK broken)
2002-06-20 15:41 ` McVoy's Clusters (was Re: latest linus-2.5 BK broken) Sandy Harris
@ 2002-06-20 17:10 ` William Lee Irwin III
2002-06-20 20:42 ` Timothy D. Witham
2002-06-21 5:16 ` Eric W. Biederman
2002-06-22 14:14 ` Kai Henningsen
2 siblings, 1 reply; 97+ messages in thread
From: William Lee Irwin III @ 2002-06-20 17:10 UTC (permalink / raw)
To: Sandy Harris; +Cc: Linux Kernel Mailing List
On Thu, Jun 20, 2002 at 11:41:45AM -0400, Sandy Harris wrote:
> For large multi-processor systems, it isn't clear that those matter
> much. On single user systems I've tried , ps -ax | wc -l usually
> gives some number 50 < n < 100. For a multi-user general purpose
> system, my guess would be something under 50 system processes plus
> 50 per user. So for a dozen to 20 users on a departmental server,
> under 1000. A server for a big application, like database or web,
> would have fewer users and more threads, but still only a few 100
> or at most, say 2000.
Certain unnameable databases like to have 2K processes at minimum and
see task counts soar even higher under significant loads.
Also, the scholastic departmental servers I've seen in action generally
host 300+ users with something less than 50/logged in user and something
more than 50 for the baseline. For the school-wide one I used hosting
10K+ (40K+?) users generally only between 500 and 2500 (where the non-rare
maximum was around 1500) are logged in simultaneously, and the task/user
count was more like 5-10, with a number of them (most?) riding at 2 or 3
(shell + MUA or shell + 2 tasks for rlogin to elsewhere). The uncertainty
with respect to number of accounts is due to no userlists being visible.
I can try to contact some of the users or administrators if better
numbers are needed, though it may not work as I've long since graduated.
Cheers,
Bill
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 16:30 ` latest linus-2.5 BK broken Cort Dougan
@ 2002-06-20 17:15 ` Linus Torvalds
2002-06-21 6:15 ` Eric W. Biederman
2002-06-20 17:16 ` RW Hawkins
` (2 subsequent siblings)
3 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2002-06-20 17:15 UTC (permalink / raw)
To: Cort Dougan
Cc: Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
On Thu, 20 Jun 2002, Cort Dougan wrote:
>
> "Beating the SMP horse to death" does make sense for 2 processor SMP
> machines.
It makes fine sense for any tightly coupled system, where the tight
coupling is cost-efficient.
Today that means 2 CPU's, and maybe 4.
Things like SMT (Intel calls it "HT") increase that to 4/8. It's just
_cheaper_ to do that kind of built-in SMP support than it is to not use
it.
The important part of what Cort says is "commodity". Not the "small number
of CPU's". Linux is focusing on SMP, because it is the ONLY INTERESTING
HARDWARE BASE in the commodity space.
ccNuma and clusters just aren't even on the _radar_ from a commodity
standpoint. While commodity 4- and 8-way SMP is just a few years away.
So because SMP hardware is cheap and efficient, all reasonable scalability
work is done on SMP. And the fringe is just that - fringe. The
numa/cluster fringe tends to try to use SMP approaches because they know
they are a minority, and they want to try to leverage off the commodity.
And it will continue to be this way for the forseeable future. People
should just accept the fact.
The only thing that may change the current state of affairs is that some
cluster/numa issues are slowly percolating down and they may become more
commoditized. For example, I think the AMD approach to SMP on the hammer
series is "local memories" with a fast CPU interconnect. That's a lot more
NUMA than we're used to in the PC space.
On the other hand, another interesting trend seems to be that since
commoditizing NUMA ends up being done with a lot of integration, the
actual _latency_ difference is so small that those potential future
commodity NUMA boxes can be considered largely UMA/SMP.
And I guarantee Linux will scale up fine to 16 CPU's, once that is
commodity. And the rest is just not all that important.
Linus
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 16:30 ` latest linus-2.5 BK broken Cort Dougan
2002-06-20 17:15 ` Linus Torvalds
@ 2002-06-20 17:16 ` RW Hawkins
2002-06-20 17:23 ` Cort Dougan
2002-06-20 20:40 ` Martin Dalecki
2002-06-21 5:34 ` Eric W. Biederman
3 siblings, 1 reply; 97+ messages in thread
From: RW Hawkins @ 2002-06-20 17:16 UTC (permalink / raw)
To: Cort Dougan
Cc: Eric W. Biederman, Linus Torvalds, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
You're missing the point. Larry is saying "I have been down this road
before, take heed". We don't want to waste the time reinventing bloat
when we can learn from others mistakes.
-RW
Cort Dougan wrote:
>"Beating the SMP horse to death" does make sense for 2 processor SMP
>machines. When 64 processor machines become commodity (Linux is a
>commodity hardware OS) something will have to be done. When research
>groups put Linux on 1k processors - it's an experiment. I don't think they
>have much right to complain that Linux doesn't scale up to that level -
>it's not designed to.
>
>That being said, large clusters are an interesting research area but it is
>_not_ a failing of Linux that it doesn't scale to them.
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 17:16 ` RW Hawkins
@ 2002-06-20 17:23 ` Cort Dougan
0 siblings, 0 replies; 97+ messages in thread
From: Cort Dougan @ 2002-06-20 17:23 UTC (permalink / raw)
To: RW Hawkins
Cc: Eric W. Biederman, Linus Torvalds, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
I'm not disagreeing with Larry here. I'm just pointing out that mainline
Linux cares about what is commodity. That's 1-2 processors and 2-4 on
some PPC and other boards.
I'm keenly interested in 1k processors, as is Larry, and scaling Linux up
to them. I'm don't disagree with Linus' path for Linux staying on SMP for
now. Scaling up to huge clusters isn't a mainline Linux concern. It's a
very interesting research area, though. In fact, some research I work on.
} You're missing the point. Larry is saying "I have been down this road
} before, take heed". We don't want to waste the time reinventing bloat
} when we can learn from others mistakes.
}
} -RW
}
} Cort Dougan wrote:
}
} >"Beating the SMP horse to death" does make sense for 2 processor SMP
} >machines. When 64 processor machines become commodity (Linux is a
} >commodity hardware OS) something will have to be done. When research
} >groups put Linux on 1k processors - it's an experiment. I don't think they
} >have much right to complain that Linux doesn't scale up to that level -
} >it's not designed to.
} >
} >That being said, large clusters are an interesting research area but it is
} >_not_ a failing of Linux that it doesn't scale to them.
} >-
} >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
} >the body of a message to majordomo@vger.kernel.org
} >More majordomo info at http://vger.kernel.org/majordomo-info.html
} >Please read the FAQ at http://www.tux.org/lkml/
} >
} >
}
}
}
} -
} To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at http://vger.kernel.org/majordomo-info.html
} Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 16:30 ` latest linus-2.5 BK broken Cort Dougan
2002-06-20 17:15 ` Linus Torvalds
2002-06-20 17:16 ` RW Hawkins
@ 2002-06-20 20:40 ` Martin Dalecki
2002-06-20 20:53 ` Linus Torvalds
` (2 more replies)
2002-06-21 5:34 ` Eric W. Biederman
3 siblings, 3 replies; 97+ messages in thread
From: Martin Dalecki @ 2002-06-20 20:40 UTC (permalink / raw)
To: Cort Dougan
Cc: Eric W. Biederman, Linus Torvalds, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
Użytkownik Cort Dougan napisał:
> "Beating the SMP horse to death" does make sense for 2 processor SMP
> machines. When 64 processor machines become commodity (Linux is a
> commodity hardware OS) something will have to be done. When research
64 processor machines will *never* become a commodity becouse:
1. It's not like paralell machines are something entierly new. They are
around for an awfoul long time on this planet. (nearly longer then myself)
2. See 1. even dual CPU machines are a rarity even *now*.
3. Nobody needs them for the usual tasks they are a *waste*
of resources and economics still applies.
4. SMP doesn't scale behind 4. Point. (64 hardly makes sense...)
5. It will never become a commodity to run highly transactional
workloads where integrated bunches of 4 make sense. Neiter will
it be common to solve partial differential equations for aeroplane
dynamics or to calculate the behaviour of an hydrogen bomb.
6. Even in the aerodynamics department an only 14 CPU machine was
very very fast. (NEC SX-3R)
7. Hyper threaded cores make hardly sense behind 2.
8. Amdahls law is math and not a decret from the Central Komitee of
the Kommunist Party or George Bush. You can not overrule it.
One exception could be dedicated rendering CPUs - which is the
direction where graphics cards are apparently heading - but they
will hardly ever need a general purpose operating system. But even then -
I'm still in the bunch of people who are not interrested
in any OpenGL or Direct whatever... The worsest graphics cards
those days drive my display screens at the resolutions I wish them too
just fine.
PS. I'm sick of seeing bunches of PC's which are accidentally in
the same room nowadays in the list of the 500 fastest computers
on the world. It makes this list useless...
If one want's to have a grasp on how the next generation of
really fast computers will look alike. Well: they will be based
on Johnson-junctions. TRW will build them (same company
as Voyager sonde). Look there they don't plan for thousands of CPUs
they plan for few CPUs in liquid helium:
http://www.trw.com/extlink/1,,,00.html?ExternalTRW=/images/imaps_2000_paper.pdf&DIR=2
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: McVoy's Clusters (was Re: latest linus-2.5 BK broken)
2002-06-20 17:10 ` William Lee Irwin III
@ 2002-06-20 20:42 ` Timothy D. Witham
0 siblings, 0 replies; 97+ messages in thread
From: Timothy D. Witham @ 2002-06-20 20:42 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: Sandy Harris, Linux Kernel Mailing List
Another point is that I've seen large multi-user machines that roll
a 32 bi pid in less than 1/2 hour. So not only is it a large
number of process but also a very dynamic process environment.
Tim
On Thu, 2002-06-20 at 10:10, William Lee Irwin III wrote:
> On Thu, Jun 20, 2002 at 11:41:45AM -0400, Sandy Harris wrote:
> > For large multi-processor systems, it isn't clear that those matter
> > much. On single user systems I've tried , ps -ax | wc -l usually
> > gives some number 50 < n < 100. For a multi-user general purpose
> > system, my guess would be something under 50 system processes plus
> > 50 per user. So for a dozen to 20 users on a departmental server,
> > under 1000. A server for a big application, like database or web,
> > would have fewer users and more threads, but still only a few 100
> > or at most, say 2000.
>
> Certain unnameable databases like to have 2K processes at minimum and
> see task counts soar even higher under significant loads.
>
> Also, the scholastic departmental servers I've seen in action generally
> host 300+ users with something less than 50/logged in user and something
> more than 50 for the baseline. For the school-wide one I used hosting
> 10K+ (40K+?) users generally only between 500 and 2500 (where the non-rare
> maximum was around 1500) are logged in simultaneously, and the task/user
> count was more like 5-10, with a number of them (most?) riding at 2 or 3
> (shell + MUA or shell + 2 tasks for rlogin to elsewhere). The uncertainty
> with respect to number of accounts is due to no userlists being visible.
>
> I can try to contact some of the users or administrators if better
> numbers are needed, though it may not work as I've long since graduated.
>
> Cheers,
> Bill
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Timothy D. Witham - Lab Director - wookie@osdlab.org
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2436 (fax)
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 20:40 ` Martin Dalecki
@ 2002-06-20 20:53 ` Linus Torvalds
2002-06-20 21:27 ` Martin Dalecki
2002-06-20 21:13 ` Timothy D. Witham
2002-06-21 19:53 ` Rob Landley
2 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2002-06-20 20:53 UTC (permalink / raw)
To: Martin Dalecki
Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
On Thu, 20 Jun 2002, Martin Dalecki wrote:
>
> 2. See 1. even dual CPU machines are a rarity even *now*.
With stuff like HT, you may well not be able to _buy_ an intel desktop
machine with just "one" CPU.
Get with the flow. The old Windows codebase is dead as far as new machines
are concerned, which means that there is no reason to hold back any more:
all OS's support SMP.
> 3. Nobody needs them for the usual tasks they are a *waste*
> of resources and economics still applies.
That's a load of bull.
For usual tasks, two CPU's give clearly better responsiveness than one. If
only because one of them may be doing the computation, and the other may
be doing GUI.
The number of people doing things like mp3 ripping is apparently quite
high. And it's definitely CPU-intensive.
Now, I suspect that past two CPU's you won't find much added oomph, but
the load-balancing of just two is definitely noticeable on a personal
scale. I just don't want to use UP machines any more unless they have
other things going for them (ie really really small).
> 4. SMP doesn't scale behind 4. Point. (64 hardly makes sense...)
That's not true either.
You can easily make _cheap_ hardware scale to 4, no problem. You may not
want a shared bus, but hey, they's a small implementation detail. Most new
CPU's have the interconnect hardware on-die (either now or planned).
Intel made SMP cheap by putting all the glue logic on-chip and in the
standard chipsets.
And besides, you don't actually need to _scale_ well, if the actual
incremental costs are low. That's the whole point with the P4-HT, of
course. Intel claims 5% die area addition for a 30% scaling. They may be
full of sh*t, of course, and it may be that the added complexity in the
control logic hurts them in other areas (longer pipeline, whatever), but
the point is that if it's cheap, the second CPU doesn't have to "scale".
Linus
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 20:40 ` Martin Dalecki
2002-06-20 20:53 ` Linus Torvalds
@ 2002-06-20 21:13 ` Timothy D. Witham
2002-06-21 19:53 ` Rob Landley
2 siblings, 0 replies; 97+ messages in thread
From: Timothy D. Witham @ 2002-06-20 21:13 UTC (permalink / raw)
To: Martin Dalecki
Cc: Cort Dougan, Eric W. Biederman, Linus Torvalds, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
On Thu, 2002-06-20 at 13:40, Martin Dalecki wrote:
> Użytkownik Cort Dougan napisał:
> > "Beating the SMP horse to death" does make sense for 2 processor SMP
> > machines. When 64 processor machines become commodity (Linux is a
> > commodity hardware OS) something will have to be done. When research
>
>
> 8. Amdahls law is math and not a decret from the Central Komitee of
> the Kommunist Party or George Bush. You can not overrule it.
>
Boy, I haven't been beat up by Amdahl's law for at least 10 years. :-)
A point to mention is that Amdahl's law also applies to scaling on
clusters. Same issues as SMP as far as application scalability
is concerned.
But the point is that there are a whole bunch of applications that
can have the serial portion reduce to such a small amount that they
can benefit from lots of CPUS.
> One exception could be dedicated rendering CPUs - which is the
> direction where graphics cards are apparently heading - but they
> will hardly ever need a general purpose operating system. But even then -
> I'm still in the bunch of people who are not interrested
> in any OpenGL or Direct whatever... The worsest graphics cards
> those days drive my display screens at the resolutions I wish them too
> just fine.
>
> PS. I'm sick of seeing bunches of PC's which are accidentally in
> the same room nowadays in the list of the 500 fastest computers
> on the world. It makes this list useless...
>
> If one want's to have a grasp on how the next generation of
> really fast computers will look alike. Well: they will be based
> on Johnson-junctions. TRW will build them (same company
> as Voyager sonde). Look there they don't plan for thousands of CPUs
> they plan for few CPUs in liquid helium:
>
> http://www.trw.com/extlink/1,,,00.html?ExternalTRW=/images/imaps_2000_paper.pdf&DIR=2
>
>
You know there used to be a whole bunch of companies doing this
sort of work and they all went out of business because people could
build a cluster out of off the shelf parts for 1/10 of the cost and
get good enough performance. ETA, CDC, the old Cray the list goes
on. All gone from the CPU business because good enough cheap enough
wins every time.
Tim
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Timothy D. Witham - Lab Director - wookie@osdlab.org
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2436 (fax)
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 20:53 ` Linus Torvalds
@ 2002-06-20 21:27 ` Martin Dalecki
2002-06-20 21:37 ` Linus Torvalds
2002-06-21 20:38 ` Rob Landley
0 siblings, 2 replies; 97+ messages in thread
From: Martin Dalecki @ 2002-06-20 21:27 UTC (permalink / raw)
To: Linus Torvalds
Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
Użytkownik Linus Torvalds napisał:
>
> On Thu, 20 Jun 2002, Martin Dalecki wrote:
>
>>2. See 1. even dual CPU machines are a rarity even *now*.
>
>
> With stuff like HT, you may well not be able to _buy_ an intel desktop
> machine with just "one" CPU.
Linus you forget one simple fact - a HT CPU is *not* two CPUs.
It is one CPU with a slightly better utilization of the
super scalar pipelines. And it's only slightly better.
Just another way of increasind the fill reate of the pipelines
for some specific tasks.
> Get with the flow. The old Windows codebase is dead as far as new machines
> are concerned, which means that there is no reason to hold back any more:
> all OS's support SMP.
>
>
>>3. Nobody needs them for the usual tasks they are a *waste*
>>of resources and economics still applies.
>
>
> That's a load of bull.
Did I mention that ARMs are the most sold CPUs out there?
> For usual tasks, two CPU's give clearly better responsiveness than one. If
> only because one of them may be doing the computation, and the other may
> be doing GUI.
For the usual task of controlling just the fuel level of the motor
or therlike one CPU makes fine. For the other usual
tasks - well dissect a PCMCIA WLAN card or some reasonable fast
ethernet card or some hard disk. You will find tons of
independant CPUs in your system... but they are hardly SMP
connected. For the other usual task my single Athlon is
just fine. The main argument is yes it makes sense to
use additional CPUs for work offload on dedicated tasks
but the normal case is not to do it SMP way.
> The number of people doing things like mp3 ripping is apparently quite
> high. And it's definitely CPU-intensive.
>
> Now, I suspect that past two CPU's you won't find much added oomph, but
Well on intel two CPU give you about 1.5 horse power of
a single CPU. On Good SPM systems it's about 1.7.
> the load-balancing of just two is definitely noticeable on a personal
> scale. I just don't want to use UP machines any more unless they have
> other things going for them (ie really really small).
> >
>>4. SMP doesn't scale behind 4. Point. (64 hardly makes sense...)
> >
> That's not true either.
>
> You can easily make _cheap_ hardware scale to 4, no problem. You may not
> want a shared bus, but hey, they's a small implementation detail. Most new
> CPU's have the interconnect hardware on-die (either now or planned).
>
> Intel made SMP cheap by putting all the glue logic on-chip and in the
> standard chipsets.
Not if I look out to buy a real SMP board. They are still
very expensive in comparision to normal boards. However
indeed they are nowadays affordable.
> And besides, you don't actually need to _scale_ well, if the actual
> incremental costs are low. That's the whole point with the P4-HT, of
> course. Intel claims 5% die area addition for a 30% scaling. They may be
The 30% - I never saw it in the intel paper. I remember they talk
about 20% + something. And 30% is a *peak* value.
The paper in question talks about 12% on average. Awfoul much for
5% die area (2.4 factor win) in esp. if you look at the constant
increase of die area of CPUs in comparision to the speed factoring out
the scaling of the production process. If once factors out
the production process scale modern CPU are wasting transistors like
no good in comparision to they older silbings. (Remember 8088 was
just about 22t transistors and not 140M!).
But it's not much in absolute numbers...
> full of sh*t, of course, and it may be that the added complexity in the
> control logic hurts them in other areas (longer pipeline, whatever), but
> the point is that if it's cheap, the second CPU doesn't have to "scale".
The main hurting point is the quadruple of the correctness testing
effort. Longer pipelines - I hardly think so. The synchronization infrastructure
for out of order execution was already there in the last CPU generation.
This is the reaons why it's so cheap in terms of die estate to add it now.
BTW. Them pulling this trick shows nicely that we are now at a point
where there will be hardly any increase in the deployment of micro scale
paralellity in CPU design nowadays... And not just on behalf of
the CPU - even more importantly you could read it as public admit to the
fact that we are near the end of static optimizations by improvements in
compiler technology as well. Oh the compiler people promise miracles
constantly since the first days of pipeline of course...
In view of this I would love to see how they intend
to HT the VLSI design of the Itanic :-).
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 21:27 ` Martin Dalecki
@ 2002-06-20 21:37 ` Linus Torvalds
2002-06-20 21:59 ` Martin Dalecki
2002-06-21 16:01 ` Re: latest linus-2.5 BK broken Sandy Harris
2002-06-21 20:38 ` Rob Landley
1 sibling, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2002-06-20 21:37 UTC (permalink / raw)
To: Martin Dalecki
Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
On Thu, 20 Jun 2002, Martin Dalecki wrote:
>
> Linus you forget one simple fact - a HT CPU is *not* two CPUs.
> It is one CPU with a slightly better utilization of the
> super scalar pipelines.
Doesn't matter. It's SMP to software, _and_ it is a perfect example of how
integration, in the form of almost free transistors, changes the
economics.
> Just another way of increasind the fill reate of the pipelines
> for some specific tasks.
Integration is _not_ "just another way".
Integration fundamentally changes the whole equation.
When you integrate the SMP capabilities on the CPU, suddenly the world
changes, because suddenly SMP is cheap and easy to do for motherboard
manufacturers that would never have done it before. Suddenly SMP is
available at mass-market prices.
When you integrate multiple CPU's on one standard die (either HT or real
CPU's), the same thing happens.
When you start integrating crossbars etc "numa-like" stuff, like Hammer
apparently is doing, you get the same old technology, but it _behaves_
differently.
You see this outside CPU's too.
When people started integrating high-performance 3D onto a single die, the
_market_ changed. The way people used it changed. It's largely the same
technology that has been around for a long time in visual workstations,
but it's DIFFERENT thanks to low prices and easy integration into
bog-standard PC's.
A 3D tech person might say that the technology is still the same.
But a real human will notice that it's radically different.
> Did I mention that ARMs are the most sold CPUs out there?
Doesn't matter. Did I mention that microbes are the most populous form of
living beings? Does that make any difference to us as humans? Should that
make us think we want to be microbes? Or should it mean that we're somehow
inferior? Obviously not.
Did you mention that there are a lot more resistors in computers than
CPU's? No. It is irrelevant. It doesn't drive technology in fundamental
ways - even though the amount of fundamental technolgy inherent on a
modern motherboard in _just_ the passive components like the resistor
network is way beyond what people built just a few years ago.
Linus
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 21:37 ` Linus Torvalds
@ 2002-06-20 21:59 ` Martin Dalecki
2002-06-20 22:18 ` Linus Torvalds
` (2 more replies)
2002-06-21 16:01 ` Re: latest linus-2.5 BK broken Sandy Harris
1 sibling, 3 replies; 97+ messages in thread
From: Martin Dalecki @ 2002-06-20 21:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
Użytkownik Linus Torvalds napisał:
>
> On Thu, 20 Jun 2002, Martin Dalecki wrote:
>
>>Linus you forget one simple fact - a HT CPU is *not* two CPUs.
>>It is one CPU with a slightly better utilization of the
>>super scalar pipelines.
>
>
> Doesn't matter. It's SMP to software, _and_ it is a perfect example of how
> integration, in the form of almost free transistors, changes the
> economics.
Well but this simply still doesn't make SMP magically scale
better. HT gives you about 12% increase in throughput on average.
This will hardly increase your MP3 ripping expierence :-).
> Integration is _not_ "just another way".
>
> Integration fundamentally changes the whole equation.
>
> When you integrate the SMP capabilities on the CPU, suddenly the world
> changes, because suddenly SMP is cheap and easy to do for motherboard
> manufacturers that would never have done it before. Suddenly SMP is
> available at mass-market prices.
And suddenly the Chip-Set manufacturers start to buy CPU
designs like creazy, becouse they can see what will be next... of course.
> When you integrate multiple CPU's on one standard die (either HT or real
> CPU's), the same thing happens.
Again HT is still only one CPU. You are too software centric :-).
> When you start integrating crossbars etc "numa-like" stuff, like Hammer
> apparently is doing, you get the same old technology, but it _behaves_
> differently.
Yes HT gives 12%. naive SMP gives 50% and good SMP (aka corssbar bus)
gives 70% for two CPU. All those numbers are well below the level
where more then 2-4 makes hardly any sense... Amdahl bites you still if you
read it like:
88% waste (well actuall this time not)
50% waste
20% waste
on scale.
However corssbar switches are indeed allowing for maximally
64 CPUs and more importantly it's the first step since a long time
to provide better overall system throughput. However they will still
not be near any commodity - too much heat for the foreseeable future.
> You see this outside CPU's too.
>
> When people started integrating high-performance 3D onto a single die, the
> _market_ changed. The way people used it changed. It's largely the same
> technology that has been around for a long time in visual workstations,
> but it's DIFFERENT thanks to low prices and easy integration into
> bog-standard PC's.
>
> A 3D tech person might say that the technology is still the same.
>
> But a real human will notice that it's radically different.
Yes but you can drive the technology only up to the perceptual limits
of a human. For example since about 6 years all those advancements
in the graphics area are largely uninterresting to me. I don't
play computer games. Never - they are too boring. Jet another
fan in my computer - no thank's.
> Did you mention that there are a lot more resistors in computers than
> CPU's? No. It is irrelevant. It doesn't drive technology in fundamental
> ways - even though the amount of fundamental technolgy inherent on a
> modern motherboard in _just_ the passive components like the resistor
> network is way beyond what people built just a few years ago.
Well the last real technological jump comparable to the invention
of television was actually due to this kind of CPUs which you
compare to microbes - mobiles :-). And well I'm awaiting the
day where there will be some WinWLAN card as shoddy as those Win
modems are... Fortunately they made 802.11b complicated enough :-)
But with a corssbar switch in place they could well make up for
the latency on the main CPU... oh fear... oh scare...
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 21:59 ` Martin Dalecki
@ 2002-06-20 22:18 ` Linus Torvalds
2002-06-20 22:41 ` Martin Dalecki
2002-06-21 7:43 ` Zwane Mwaikambo
2002-06-21 21:02 ` Rob Landley
2 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2002-06-20 22:18 UTC (permalink / raw)
To: Martin Dalecki
Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
On Thu, 20 Jun 2002, Martin Dalecki wrote:
>
> Yes HT gives 12%. naive SMP gives 50% and good SMP (aka corssbar bus)
> gives 70% for two CPU. All those numbers are well below the level
> where more then 2-4 makes hardly any sense...
You don't _understand_.
If it's "free", you take that 70% for the second CPU, and the additional
20% for the next two.
Don't bother repeating yourself about Amdahls law. Realize what Moore's
law says: things get cheaper over time. A _lot_ cheaper.
It's still a fact that people are willing to pay for performance. Even if
they strictly don't "need" it (but who are you or I to say who "needs"
performance?).
At which point it doesn't _matter_ if you only get 70% or 30% or 12%
improvement. If it's within "cheap enough", people will buy it. In fact,
once it gets "too cheap", people will buy something more expensive just
because a cheap PC obviously isn't good enough. That's _reality_.
Your "efficiency" arguments have no basis in the real life of economics in
a developing market. Only embedded people care about absolute cost and
absolute efficiencies ("it's not worth it for us to go for a more powerful
CPU, since we don't need it"). The rest of the world takes that 66MHz
improvement (in a CPU that does multiple gigahertz) and is happy about it.
Or takes the added 12%, and is happy about it.
Humans are not rational creatures. We're _rationalizing_ creatures, and we
love rationalizing that big machine that just makes us feel better.
Linus
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 22:18 ` Linus Torvalds
@ 2002-06-20 22:41 ` Martin Dalecki
2002-06-21 0:09 ` Allen Campbell
0 siblings, 1 reply; 97+ messages in thread
From: Martin Dalecki @ 2002-06-20 22:41 UTC (permalink / raw)
To: Linus Torvalds
Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
Użytkownik Linus Torvalds napisał:
> At which point it doesn't _matter_ if you only get 70% or 30% or 12%
> improvement. If it's within "cheap enough", people will buy it. In fact,
> once it gets "too cheap", people will buy something more expensive just
> because a cheap PC obviously isn't good enough. That's _reality_.
>
> Your "efficiency" arguments have no basis in the real life of economics in
> a developing market. Only embedded people care about absolute cost and
> absolute efficiencies ("it's not worth it for us to go for a more powerful
> CPU, since we don't need it"). The rest of the world takes that 66MHz
> improvement (in a CPU that does multiple gigahertz) and is happy about it.
> Or takes the added 12%, and is happy about it.
You don't read economic papers. Don't you? Or what is it with this
plumbing server/pc market around us? Or increased notebook sales.
(Typical marked saturation symptom, like the second car for the
familiy :-).
I suggest it's precisely the end of the open invention curve out there:
1. Nowadays the CPUs are indeed good enough for most of the common tasks.
WindowsXP tries hard to help overcome this :-). But in reality Win2000
is just fine for office work.
2. The technology in question is starting to hit real physical barriers becouse
it appears more and more that not everything comming out of the labs
can be implemented at reasonable costs.
> Humans are not rational creatures. We're _rationalizing_ creatures, and we
> love rationalizing that big machine that just makes us feel better.
Perhaps it's just still too deep in to my brain that
the overwhelimg part of the PC market is still determined
by corporate buyers (70%). And they look for efficiency (well within
wide boundaries :-). There is for example not much of an uprush from
Win4.0 or Win2000 to WindowsXP. Not only due to "political" reasons,
but becouse a normal PC from few years ago still does the job
for office productivity. Quite away from the days of yearly upgrades
all around the office :-)... And finally the whole thing driving
the movement behind AS/390 boxen running Linux OS instancies is consolidation
and costs too...
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
@ 2002-06-20 23:48 Miles Lane
0 siblings, 0 replies; 97+ messages in thread
From: Miles Lane @ 2002-06-20 23:48 UTC (permalink / raw)
To: Martin Dalecki; +Cc: LKML
Uz.ytkownik Martin Dalecki napisa?:
<snip>
> You don't read economic papers. Don't you? Or what is it with this
> plumbing server/pc market around us? Or increased notebook sales.
> (Typical marked saturation symptom, like the second car for the
> familiy :-).
>
> I suggest it's precisely the end of the open invention curve out there:
>
> 1. Nowadays the CPUs are indeed good enough for most of the common tasks.
> WindowsXP tries hard to help overcome this :-). But in reality Win2000
> is just fine for office work.
>
> 2. The technology in question is starting to hit real physical barriers becouse
> it appears more and more that not everything comming out of the labs
> can be implemented at reasonable costs.
Martin, perhaps you haven't seen this article.
This news seems to contradict your assertion that cost is going
to become a big problem as we attempt to continue tracking the
price/performance trajectory of Moore's law.
http://www.nytimes.com/reuters/technology/tech-technology-chip.html
Miles
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 22:41 ` Martin Dalecki
@ 2002-06-21 0:09 ` Allen Campbell
0 siblings, 0 replies; 97+ messages in thread
From: Allen Campbell @ 2002-06-21 0:09 UTC (permalink / raw)
To: Martin Dalecki; +Cc: linux-kernel
> Perhaps it's just still too deep in to my brain that
> the overwhelimg part of the PC market is still determined
> by corporate buyers (70%). And they look for efficiency (well within
> wide boundaries :-).
Most of those buyers care about cost efficiency, not design
efficiency. If a 4 way Dell can just match a 2 way Sun, and for
half the cost, guess who gets the sale. Doesn't matter if it's
"naive" SMP or a beautiful cross-bar design, blessed by MIT. Yes,
it's ugly. Sure, it would be nice if everyone loved computing so
much that they actually cared enough to make the distinction. They
don't. Get over it.
As long as Linux is true to the market it will thrive. The moment
the motivation becomes someone's pedantic notion of "purity", it's
gone. I believe Linus understands this, and I'm thankful. I'm
guessing that gift of understanding comes from a time when a certain
programmer couldn't afford to pay for the elegance that was offered
at the time.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: McVoy's Clusters (was Re: latest linus-2.5 BK broken)
2002-06-20 15:41 ` McVoy's Clusters (was Re: latest linus-2.5 BK broken) Sandy Harris
2002-06-20 17:10 ` William Lee Irwin III
@ 2002-06-21 5:16 ` Eric W. Biederman
2002-06-22 14:14 ` Kai Henningsen
2 siblings, 0 replies; 97+ messages in thread
From: Eric W. Biederman @ 2002-06-21 5:16 UTC (permalink / raw)
To: Sandy Harris; +Cc: Linux Kernel Mailing List
Sandy Harris <pashley@storm.ca> writes:
> [ I removed half a dozen cc's on this, and am just sending to the
> list. Do people actually want the cc's?]
>
> Larry McVoy wrote:
>
> > > Checkpointing buys three things. The ability to preempt jobs, the
> > > ability to migrate processes,
> For large multi-processor systems, it isn't clear that those matter
> much.
The systems that are built because there is no machine that can
run your compute intensive application fast enough they matter quite
a bit.
> What combination of resources and loads do you think preemption
> and migration are need for?
Good answers have already been given.
The problem domain I am looking at are compute clusters. The
solutions are useful elsewhere but in compute clusters they are
extremely valuable.
> > > and the ability to recover from failed nodes, (assuming the
> > > failed hardware didn't corrupt your jobs checkpoint).
>
> That matters, but it isn't entirely clear that it needs to be done
> in the kernel.
I agree, glibc would be fine, but it must be below the level of
the application. Generally it is a pretty onerous task to checkpoint
a random program. For a proof attempt to checkpoint your X desktop,
the infrastructure is there to do it.
Every application must be capable of checkpointing it for the cluster
batch scheduler to take advantage of it.
Example case.
[Preemption]
You start job 1, a compute intensive application that runs for 4 days,
on 100 cpus. Your job is low priority.
In comes job2, a high priority job that runs for 4 hours and needs 256
cpus.
job1 is preempted. With checkpoint support it can be saved and
restarted later. Without checkpointing support it is simply killed.
[Migration]
Migration is needed for failing hardware or to get low priority jobs
out of the way onto less capable nodes that are going unused.
Or to restart a job that failed on other hardware.
Eric
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 16:30 ` latest linus-2.5 BK broken Cort Dougan
` (2 preceding siblings ...)
2002-06-20 20:40 ` Martin Dalecki
@ 2002-06-21 5:34 ` Eric W. Biederman
3 siblings, 0 replies; 97+ messages in thread
From: Eric W. Biederman @ 2002-06-21 5:34 UTC (permalink / raw)
To: Cort Dougan
Cc: Linus Torvalds, Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
Cort Dougan <cort@fsmlabs.com> writes:
> "Beating the SMP horse to death" does make sense for 2 processor SMP
> machines. When 64 processor machines become commodity (Linux is a
> commodity hardware OS) something will have to be done. When research
> groups put Linux on 1k processors - it's an experiment. I don't think they
> have much right to complain that Linux doesn't scale up to that level -
> it's not designed to.
>
> That being said, large clusters are an interesting research area but it is
> _not_ a failing of Linux that it doesn't scale to them.
Linux in a classic beowulf configuration scales just fine. To be clear
I am talking a batch scheduling system, where the jobs which run for
hours at a time and on many nodes, possibly the entire cluster at a
time. Are scheduled on some number of commodity systems, with a good
network interconnect.
The concern now is not does it work, or does it work well. But can
it be made more convenient to use.
Eric
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 17:15 ` Linus Torvalds
@ 2002-06-21 6:15 ` Eric W. Biederman
2002-06-21 17:50 ` Larry McVoy
0 siblings, 1 reply; 97+ messages in thread
From: Eric W. Biederman @ 2002-06-21 6:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
Linus Torvalds <torvalds@transmeta.com> writes:
> On Thu, 20 Jun 2002, Cort Dougan wrote:
> >
> > "Beating the SMP horse to death" does make sense for 2 processor SMP
> > machines.
>
> It makes fine sense for any tightly coupled system, where the tight
> coupling is cost-efficient.
>
> Today that means 2 CPU's, and maybe 4.
>
> Things like SMT (Intel calls it "HT") increase that to 4/8. It's just
> _cheaper_ to do that kind of built-in SMP support than it is to not use
> it.
>
> The important part of what Cort says is "commodity". Not the "small number
> of CPU's". Linux is focusing on SMP, because it is the ONLY INTERESTING
> HARDWARE BASE in the commodity space.
Commodity is the wrong word. Volume is the right word. Volumes of machines,
volumes of money, and volumes of developers.
> ccNuma and clusters just aren't even on the _radar_ from a commodity
> standpoint. While commodity 4- and 8-way SMP is just a few years away.
I bet it is easy to find a easy to find a 2-4 way heterogenous pile of
computers in many a developers personal possession that could be turned
into a cluster if the software wasn't so inconvenient to use, or if
there was a good reason to run computer systems that way.
Clusters and ccNuma are entirely different animals. ccNuma is about
specialized hardware. Clusters are about using commodity hardware in
a different way.
> So because SMP hardware is cheap and efficient, all reasonable scalability
> work is done on SMP. And the fringe is just that - fringe. The
> numa/cluster fringe tends to try to use SMP approaches because they know
> they are a minority, and they want to try to leverage off the commodity.
The cluster fringe is a minority. But the high performance computer
and batch scheduling minority has done a lot of work of the
theoretical, and developmental computer science in the past. And I
would be surprised if they weren't influential in the future. But
like most research a lot of it is trying suboptimal solutions that
eventually get ditched.
The only SMP like stuff I have seen in clustering are the attempts to
make clusters simpler to use. And the question I hear is how simple
can we make it without sacrificing scaleabilty.
> And it will continue to be this way for the forseeable future. People
> should just accept the fact.
I apparently see things differently. That the clusters will be a
minority certainly. That the people working on them are hopelessly in
fringes not a bit.
Clusters of Linux machines scale acceptably . And for a certain set of
people get the job done. The problem is making it more convenient to
get the job done. And just like in hardware as integration can make
extra hardware features essentially free, the next step is to begin
integrating cluster features into Linux both kernel and user space.
Basically the technique is. Implement something that works. Then
find the clean efficient way to do it. If that takes kernel support
write a kernel patch, and get it in.
> And I guarantee Linux will scale up fine to 16 CPU's, once that is
> commodity. And the rest is just not all that important.
It works just fine on my little 20 node 20 kernel test machine too.
I think Larry's perspective is interesting and if the common cluster
software gets working well enough I might even try it. But until a
big SMP becomes commodity I don't see the point.
Eric
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
@ 2002-06-21 7:31 Martin Knoblauch
0 siblings, 0 replies; 97+ messages in thread
From: Martin Knoblauch @ 2002-06-21 7:31 UTC (permalink / raw)
To: linux-kernel
> If one want's to have a grasp on how the next generation of
> really fast computers will look alike. Well: they will be based
> on Johnson-junctions. TRW will build them (same company
> as Voyager sonde). Look there they don't plan for thousands of CPUs
----------------------------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>they plan for few CPUs in liquid helium:
>
>
>
http://www.trw.com/extlink/1,,,00.html?ExternalTRW=/images/imaps_2000_paper.pdf&DIR=2
>
first thing that I cought on page 2 was the 4096 processors. Hmm...
Martin
--
----------------------------------
Martin Knoblauch
knobi@knobisoft.de
http://www.knobisoft.de
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 21:59 ` Martin Dalecki
2002-06-20 22:18 ` Linus Torvalds
@ 2002-06-21 7:43 ` Zwane Mwaikambo
2002-06-21 21:02 ` Rob Landley
2 siblings, 0 replies; 97+ messages in thread
From: Zwane Mwaikambo @ 2002-06-21 7:43 UTC (permalink / raw)
To: Martin Dalecki
Cc: Linus Torvalds, Cort Dougan, Eric W. Biederman, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
On Thu, 20 Jun 2002, Martin Dalecki wrote:
> > When you integrate multiple CPU's on one standard die (either HT or real
> > CPU's), the same thing happens.
>
> Again HT is still only one CPU. You are too software centric :-).
Can't help it...
Remember i386/i387?
--
http://function.linuxpower.ca
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
@ 2002-06-21 12:59 Jesse Pollard
0 siblings, 0 replies; 97+ messages in thread
From: Jesse Pollard @ 2002-06-21 12:59 UTC (permalink / raw)
To: dalecki, Linus Torvalds; +Cc: linux-kernel
Martin Dalecki <dalecki@evision-ventures.com>:
>Yes HT gives 12%. naive SMP gives 50% and good SMP (aka corssbar bus)
>gives 70% for two CPU. All those numbers are well below the level
>where more then 2-4 makes hardly any sense... Amdahl bites you still if you
>read it like:
...
I think your numbers are a little low - I've seen between 50%-80% on
master/slave SMP depending on the job. 50% if both processess are heavily
syscall oriented, 75% (or therabouts) when both processes are more normally
balanced, and 80% if both processes are more compute bound.
Good SMP, with a crossbar switch buss should give close to 95%. Good SMP
alone should give about 75%.
My expierence with good crossbar switch is based on Cray UNICOS/YMP/SV
hardware. A well tuned hardware platform, and slightly less well tuned
SMP implementation, though the UNICOS 10 rewrite may have fixed the
SMP implementation.
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil
Any opinions expressed are solely my own.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Re: latest linus-2.5 BK broken
2002-06-20 21:37 ` Linus Torvalds
2002-06-20 21:59 ` Martin Dalecki
@ 2002-06-21 16:01 ` Sandy Harris
1 sibling, 0 replies; 97+ messages in thread
From: Sandy Harris @ 2002-06-21 16:01 UTC (permalink / raw)
To: Linux Kernel Mailing List
Linus Torvalds wrote:
> Integration is _not_ "just another way".
>
> Integration fundamentally changes the whole equation.
>
> When you integrate the SMP capabilities on the CPU, suddenly the world
> changes, because suddenly SMP is cheap and easy to do for motherboard
> manufacturers that would never have done it before. Suddenly SMP is
> available at mass-market prices.
>
> When you integrate multiple CPU's on one standard die (either HT or real
> CPU's), the same thing happens.
>
> When you start integrating crossbars etc "numa-like" stuff, like Hammer
> apparently is doing, you get the same old technology, but it _behaves_
> differently.
>
> You see this outside CPU's too.
>
> When people started integrating high-performance 3D ...
It seems to me we're talking about several different ways to get
parralllelism in volume hardware. SMP, smarter peripherals, and
various sorts of cluster (beowulf compute engines, redundant for
high availability, load sharing for web servers or other I/O
bound loads, ...). Great. All have their place.
I wonder, though, about one that doesn't seem to be discussed
much: asymmetric multiprocessing.
One example is IBM mainframes with their channel processors; not
just smart peripherals but whole CPUs dedicated to I/O control.
Another was the VAX 782, two 780s with a fat bus-to-bus cable
and each CPU getting DMA into the other's memory. One CPU ran
most of the kernel, the other all the user processes.
To what extent is this becoming relevant to Linux with the port
to System 390 and the trend to I20 devices in PCs? How does it
affect the overall design?
I rather like the notion of a machine with most of the kernel,
including all disk and net I/O, running on, say, a pair of ARMs
while a quad of 64-bit whatevers run the user proceses. This
might give better $/power/heat/... tradeoffs than just goiing
to 8-way systems.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-21 6:15 ` Eric W. Biederman
@ 2002-06-21 17:50 ` Larry McVoy
2002-06-21 17:55 ` Robert Love
` (2 more replies)
0 siblings, 3 replies; 97+ messages in thread
From: Larry McVoy @ 2002-06-21 17:50 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
On Fri, Jun 21, 2002 at 12:15:54AM -0600, Eric W. Biederman wrote:
> I think Larry's perspective is interesting and if the common cluster
> software gets working well enough I might even try it. But until a
> big SMP becomes commodity I don't see the point.
The real point is that multi threading screws up your kernel. All the Linux
hackers are going through the learning curve on threading and think I'm an
alarmist or a nut. After Linux works on a 64 way box, I suspect that the
majority of them will secretly admit that threading does screw up the kernel
but at that point it's far too late.
The current approach is a lot like western medicine. Wait until the
cancer shows up and then make an effort to get rid of it. My suggested
approach is to take steps to make sure the cancer never gets here in
the first place. It's proactive rather than reactive. And the reason
I harp on this is that I'm positive (and history supports me 100%)
that the reactive approach doesn't work, you'll be stuck with it,
there is no way to "fix" it other than starting over with a new kernel.
Then we get to repeat this whole discussion in 15 years with one of the
Linux veterans trying to explain to the NewOS guys that multi threading
really isn't as cool as it sounds and they should try this other approach.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-21 17:50 ` Larry McVoy
@ 2002-06-21 17:55 ` Robert Love
2002-06-21 18:09 ` Linux, the microkernel (was Re: latest linus-2.5 BK broken) Jeff Garzik
2002-06-22 18:25 ` latest linus-2.5 BK broken Eric W. Biederman
2 siblings, 0 replies; 97+ messages in thread
From: Robert Love @ 2002-06-21 17:55 UTC (permalink / raw)
To: Larry McVoy
Cc: Eric W. Biederman, Linus Torvalds, Cort Dougan, Benjamin LaHaise,
Rusty Russell, Linux Kernel Mailing List
On Fri, 2002-06-21 at 10:50, Larry McVoy wrote:
> The real point is that multi threading screws up your kernel. All the Linux
> hackers are going through the learning curve on threading and think I'm an
> alarmist or a nut. After Linux works on a 64 way box, I suspect that the
> majority of them will secretly admit that threading does screw up the kernel
> but at that point it's far too late.
Larry, this is a point you have made several times and admittedly one I
agree with. I fail to see how the high-end scaling will not compromise
the low-end and I am genuinely concerned Linux will become Solaris.
I do not know what to do to prevent it - and I am certainly not saying
we should outright prevent certain things, but it worries me. You are
going to be in Ottawa next week? Maybe we can talk about it...
Robert Love
^ permalink raw reply [flat|nested] 97+ messages in thread
* Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-21 17:50 ` Larry McVoy
2002-06-21 17:55 ` Robert Love
@ 2002-06-21 18:09 ` Jeff Garzik
2002-06-21 18:46 ` Cort Dougan
2002-06-21 19:34 ` Rob Landley
2002-06-22 18:25 ` latest linus-2.5 BK broken Eric W. Biederman
2 siblings, 2 replies; 97+ messages in thread
From: Jeff Garzik @ 2002-06-21 18:09 UTC (permalink / raw)
To: Larry McVoy
Cc: Eric W. Biederman, Linus Torvalds, Cort Dougan, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
Larry McVoy wrote:
> the first place. It's proactive rather than reactive. And the reason
> I harp on this is that I'm positive (and history supports me 100%)
> that the reactive approach doesn't work, you'll be stuck with it,
> there is no way to "fix" it other than starting over with a new kernel.
> Then we get to repeat this whole discussion in 15 years with one of the
> Linux veterans trying to explain to the NewOS guys that multi threading
> really isn't as cool as it sounds and they should try this other approach.
One point that is missed, I think, is that Linux secretly wants to be a
microkernel.
Oh, I don't mean the strict definition of microkernel, we are continuing
to push the dogma of "do it in userspace" or "do it in process context"
(IOW userspace in the kernel).
Look at the kernel now -- the current kernel is not simply an
event-driven, monolithic program [the tradition kernel design]. Linux
also depends on a number of kernel threads to perform various
asynchronous tasks. We have had userspace agents managing bits of
hardware for a while now, and that trend is only going to be reinforced
with Al's initramfs.
IMO, the trend of the kernel is towards a collection of asynchronous
tasks, which lends itself to high parallelism. Hardware itself is
trending towards playing friendly with other hardware in the system
(examples: TCQ-driven bus release and interrupt coalescing), another
element of parallelism.
I don't see the future of Linux as a twisted nightmare of spinlocks.
Jeff
(I wonder if, shades of the old Linus/Tanenbaum flamewar, I will catch
hell from Linus for mentioning the word "microkernel" :))
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-21 18:09 ` Linux, the microkernel (was Re: latest linus-2.5 BK broken) Jeff Garzik
@ 2002-06-21 18:46 ` Cort Dougan
2002-06-21 20:25 ` Daniel Phillips
2002-06-21 19:34 ` Rob Landley
1 sibling, 1 reply; 97+ messages in thread
From: Cort Dougan @ 2002-06-21 18:46 UTC (permalink / raw)
To: Jeff Garzik
Cc: Larry McVoy, Eric W. Biederman, Linus Torvalds, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
That's not a microkernel design philosophy, it's a good OS design
philosophy. If it doesn't _have_ to be in the kernel, it generally
shouldn't be.
I agree with you that Linux is already a loosely connected yet highly
inter-dependent set of asynchronous tasks. That makes for a very difficult
to analyze system.
I don't see Linux being in serious jeopardy in the short-term of becoming
solaris. It only aims at running on 1-4 processors and does a pretty good
job of that. Most sane people realize, as Larry points out, that the
current design will not scale to 64 processors and beyond. That's obvious,
it's not an alarmist or deep statement. The key is to realize that it's
not _meant_ to scale that high right now.
I've done a little work with Larry's suggestion for scaling Linux and it's
very smart in that it solves the problem in a very simple and elegant way.
DEC did the same thing with Galaxy some time ago but they layered it with
so much of their cluster software and OpenVMS that it lost all the
performance that it had gained by being clever. If you want a simple
description of the idea (the way I am working on it), it's a software
version of NORMA.
Linux's sweet spot is 2-4 processors and probably shouldn't try to change.
It's a very hard problem going higher. Many systems have failed in exactly
the same way trying to do that sort of thing. Just cluster a bunch of
those 2-4 processor Linux's (room full of boxes, large 64-way IBM server or
some hybrid) and you have a clean solution.
} Oh, I don't mean the strict definition of microkernel, we are continuing
} to push the dogma of "do it in userspace" or "do it in process context"
} (IOW userspace in the kernel).
}
} Look at the kernel now -- the current kernel is not simply an
} event-driven, monolithic program [the tradition kernel design]. Linux
} also depends on a number of kernel threads to perform various
} asynchronous tasks. We have had userspace agents managing bits of
} hardware for a while now, and that trend is only going to be reinforced
} with Al's initramfs.
}
} IMO, the trend of the kernel is towards a collection of asynchronous
} tasks, which lends itself to high parallelism. Hardware itself is
} trending towards playing friendly with other hardware in the system
} (examples: TCQ-driven bus release and interrupt coalescing), another
} element of parallelism.
}
} I don't see the future of Linux as a twisted nightmare of spinlocks.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-21 18:09 ` Linux, the microkernel (was Re: latest linus-2.5 BK broken) Jeff Garzik
2002-06-21 18:46 ` Cort Dougan
@ 2002-06-21 19:34 ` Rob Landley
2002-06-22 15:31 ` Alan Cox
1 sibling, 1 reply; 97+ messages in thread
From: Rob Landley @ 2002-06-21 19:34 UTC (permalink / raw)
To: Jeff Garzik, Larry McVoy
Cc: Eric W. Biederman, Linus Torvalds, Cort Dougan, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
On Friday 21 June 2002 02:09 pm, Jeff Garzik wrote:
> One point that is missed, I think, is that Linux secretly wants to be a
> microkernel.
>
> Oh, I don't mean the strict definition of microkernel, we are continuing
> to push the dogma of "do it in userspace" or "do it in process context"
> (IOW userspace in the kernel).
...
>
>
> (I wonder if, shades of the old Linus/Tanenbaum flamewar, I will catch
> hell from Linus for mentioning the word "microkernel" :))
Amateur computer historian piping up...
A microkernel design was actually made to work once, with good performance.
It was about fifteen years ago, in the amiga. Know how they pulled it off?
Commodore used a mutant ultra-cheap 68030 that had -NO- memory management
unit.
No memory protection meant that message passing devolved to "here's a
pointer, please don't eat my data". And it's message passing that kills
microkernels, all that busy work from copying data (or, worse, playing with
page tables) when doing message passing kills your performance and makes the
sucker undebuggable. You wind up jumping through hoops to get access to the
data you need, and at any given point there are three different copies of it
flying through the memory bus getting out of sync with each other and needing
a forest of locks to even TRY to resolve.
In the Linux kernel, even when we have process context we can "reach out and
touch someone" any time we want to. No message passing nightmares, just keep
track of what you're exporting or Al will flame you. :) Lock, diddle the
original, unlock, move on. No copies, no version skew.
A microkernel design WITHOUT message passing is really just an extremely
modular monolithic kernel. Modularization, like object oriented programming,
is cool up until the point you let it turn into a religion. As long as you
don't wind up fighting your design and winding up unable to access your own
data when you really need to, because it's on the wrong side of a relatively
arbitrary boundary, modularization is a good thing.
Rob
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 20:40 ` Martin Dalecki
2002-06-20 20:53 ` Linus Torvalds
2002-06-20 21:13 ` Timothy D. Witham
@ 2002-06-21 19:53 ` Rob Landley
2 siblings, 0 replies; 97+ messages in thread
From: Rob Landley @ 2002-06-21 19:53 UTC (permalink / raw)
To: Martin Dalecki, Cort Dougan
Cc: Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
On Thursday 20 June 2002 04:40 pm, Martin Dalecki wrote:
> U¿ytkownik Cort Dougan napisa³:
> > "Beating the SMP horse to death" does make sense for 2 processor SMP
> > machines. When 64 processor machines become commodity (Linux is a
> > commodity hardware OS) something will have to be done. When research
>
> 64 processor machines will *never* become a commodity becouse:
>
> 1. It's not like paralell machines are something entierly new. They are
> around for an awfoul long time on this planet. (nearly longer then myself)
>
> 2. See 1. even dual CPU machines are a rarity even *now*.
DOS was a reverse engineered clone of CP/M with some unix features bolted on
in the early 80's. Dos couldn't multitask on a single CPU. Dos couldn't
handle more than one video card. DOS could barely keep track of more than
one hard drive.
Windows 3.1 through Windows 98 (and bill gates' 1/8 scale clone wini-me) were
based on DOS, they couldn't take advantage of SMP if their life depended on
it. NT through 4.0 had a market share dwarfed by the macintosh.
> 3. Nobody needs them for the usual tasks they are a *waste*
> of resources and economics still applies.
Until moore's law hits atomic resolution, sure. How long that will take is
hotly debated...
> 4. SMP doesn't scale behind 4. Point. (64 hardly makes sense...)
Actually it does, just not with Intel's brain dead memory bus architecture.
EV6 goes to 32-way pretty well.
The question is, at what point is it cheaper to just go to NUMA or clusters.
(And at what point do your trace lengths get long enough that SMP starts
acting like NUMA. And at what point do your cluster interconnects get fast
enough that something like mosix starts acting like numa?)
And the REALLY interesting advance is SMT (hyper-threading), rather than SMP.
How do you go beyond the athlon's three execution cores without running out
of parallel instructions to feed them? Simple, teach the chip about
processes, so it can advance multiple points of execution to keep the cores
fed. This lets you throw a higher transistor budget at the L1 and L2 caches
without encountering diminishing returns as well. It's pretty
straightforward, and at the very least allows dispatching interrupts in
parallel and lets your GUI overlap drawing on the screen with the processing
to figure out what goes on the screen. Between the two of them, even X11
might finally give me smooth mouse scrolling, one of these days... :)
SMP on a chip really is overkill. Why give the multiple processors their own
cache and memory bus interface? Waste of transistors, power, heat, etc...
SMT is minimalist SMP on a chip...
> 5. It will never become a commodity to run highly transactional
> workloads where integrated bunches of 4 make sense. Neiter will
> it be common to solve partial differential equations for aeroplane
> dynamics or to calculate the behaviour of an hydrogen bomb.
No, but it will be common to display bidirectional MP4 compressed video
through an encrypted link, with sound, quite possibly in a window while you
do other stuff with the machine. And some day voice recognition may actually
replace "the clapper" to turn your light off when you get into bed at night...
> One exception could be dedicated rendering CPUs - which is the
> direction where graphics cards are apparently heading - but they
"heading"? Headed. (What did you think your 3D accelerator card was?)
> PS. I'm sick of seeing bunches of PC's which are accidentally in
> the same room nowadays in the list of the 500 fastest computers
> on the world. It makes this list useless...
It shows who has money to throw at the problem, and approximately how much,
which is all it ever really showed...
> If one want's to have a grasp on how the next generation of
> really fast computers will look alike. Well: they will be based
> on Johnson-junctions. TRW will build them (same company
> as Voyager sonde). Look there they don't plan for thousands of CPUs
> they plan for few CPUs in liquid helium:
>
> http://www.trw.com/extlink/1,,,00.html?ExternalTRW=/images/imaps_2000_paper
>.pdf&DIR=2
And cray bathed their circuitry in flourinert decades ago. Liquid Helium
ain't winding up on my desktop any time soon, and my laptop outperforms a
cray-1, and I use it for a dozen variations of text editing (coding,
email...) Not interesting.
Rob
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-21 18:46 ` Cort Dougan
@ 2002-06-21 20:25 ` Daniel Phillips
2002-06-22 1:07 ` Horst von Brand
0 siblings, 1 reply; 97+ messages in thread
From: Daniel Phillips @ 2002-06-21 20:25 UTC (permalink / raw)
To: Cort Dougan, Jeff Garzik
Cc: Larry McVoy, Eric W. Biederman, Linus Torvalds, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
On Friday 21 June 2002 20:46, Cort Dougan wrote:
> I don't see Linux being in serious jeopardy in the short-term of becoming
> solaris. It only aims at running on 1-4 processors and does a pretty good
> job of that. Most sane people realize, as Larry points out, that the
> current design will not scale to 64 processors and beyond. That's obvious,
> it's not an alarmist or deep statement. The key is to realize that it's
> not _meant_ to scale that high right now.
And originally, it was never meant to scale to more than one processor.
--
Daniel
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 21:27 ` Martin Dalecki
2002-06-20 21:37 ` Linus Torvalds
@ 2002-06-21 20:38 ` Rob Landley
1 sibling, 0 replies; 97+ messages in thread
From: Rob Landley @ 2002-06-21 20:38 UTC (permalink / raw)
To: Martin Dalecki, Linus Torvalds
Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
On Thursday 20 June 2002 05:27 pm, Martin Dalecki wrote:
> U¿ytkownik Linus Torvalds napisa³:
> > On Thu, 20 Jun 2002, Martin Dalecki wrote:
> >>2. See 1. even dual CPU machines are a rarity even *now*.
> >
> > With stuff like HT, you may well not be able to _buy_ an intel desktop
> > machine with just "one" CPU.
>
> Linus you forget one simple fact - a HT CPU is *not* two CPUs.
> It is one CPU with a slightly better utilization of the
> super scalar pipelines. And it's only slightly better.
> Just another way of increasind the fill reate of the pipelines
> for some specific tasks.
Wrong.
RISC let you have two execution cores dispatching instructions in parallel.
(Two instructions per clock). AMD expanded this to three execution cores in
the Athlon with clever and insanely complex cisc->risc translation and
pipeline organizing circuitry. Intel couldn't match that (at first) and went
to VLIW, hence itanic.
VLIW/EPIC was an attempt to figure out how to keep more execution cores busy
without having each one know what the other ones are doing, and searchign for
paralellism in a single instruction stream. Unload the parallelism finding
work on the compiler, batch the resulting instructions together in groups,
and explicitly feed an instruction to each execution core, each clock cycle.
If there's nothing for it to do, feed it a NOP. That way you can have three
execution cores (getting three instructions per clock), and you can even do
four or five or six cores receiving big batches of paralell instructions and
executing the whole mess each clock cycle in parallel.
Of course the real bottleneck in a processor that's clock multiplied by a
factor of 20 relative to the motherboard it sits in is the memory bus speed,
and L1 cache size (since it's up to 20x slower when it hits the edge of the
cache), and VLIW makes the memory bus MORE of a bottleneck, so resulting
preformance sucks tremendously. Oops. Back to the drawing board. (R.I.P.
itanium, modulo intel's marketing budget...)
Hyper-threading is another way to keep extra execution cores busy: teach the
chip about processes and dole the execution cores out to each process
depending on how many they can use. (One, two, or three, depending on how
parallel the next few instructions in the thread are.)
Of course each thread needs its own register profile, but register renaming
for speculative execution is way more complicated than that. And you need to
teach the MMU how to look at more than one set of page tables at a time, but
that's doable too.
Putting full-blown SMP on a chip means you're duplicating all sorts of
circuitry: your L1 cache, your bus interface logic, etc. SMT is basically
SMP on a chip that shares the L1 cache, AND gives you an excuse to EXPAND it
(they've got the transistor budge: Xeons hae a megabyte or more of L1 cache,
there's just a case of diminishing returns. Now, they get to spend the
transistors for a larger cache and actually have it MEAN something.)
And yes, you could go beyond three execution cores with SMT. You could go to
five or six execution cores, and have three threads of execution if you
really wanted to. The design gets a little more complicated, but not really
all that much, since the purpose is to SEPARATE what the threads are doing,
as opposed to the traditional "is core #2 going to interfere with what core
#2 is doing"? You may wind up designing a full blown instruction scheduler,
but if that's too complex you could always put it in software and call it
code morphing II. :)
We've had a variant of multiprocessing on a chip since the original pentium,
we just called it pipelining. Saying SMT is not "true SMP" is splitting
hairs, and an attempt to win an argument by redefining the words used in the
original statement. (I wasn't wrong: that color's not blue!)
> > Get with the flow. The old Windows codebase is dead as far as new
> > machines are concerned, which means that there is no reason to hold back
> > any more: all OS's support SMP.
> >
> >>3. Nobody needs them for the usual tasks they are a *waste*
> >>of resources and economics still applies.
> >
> > That's a load of bull.
>
> Did I mention that ARMs are the most sold CPUs out there?
So they finally passed the enormous installed base of Z80's in traffic
lights, elevators, and microwaves? Bully for them.
What USE this information is remains an open question.
> For the usual task of controlling just the fuel level of the motor
> or therlike one CPU makes fine. For the other usual
> tasks - well dissect a PCMCIA WLAN card or some reasonable fast
> ethernet card or some hard disk. You will find tons of
> independant CPUs in your system... but they are hardly SMP
> connected. For the other usual task my single Athlon is
> just fine.
And the Z80 hooked up to an S100 bus running CP/M shall always rule forever
and ever alelujiah amen. Case dismissed.
> > The number of people doing things like mp3 ripping is apparently quite
> > high. And it's definitely CPU-intensive.
> >
> > Now, I suspect that past two CPU's you won't find much added oomph, but
>
> Well on intel two CPU give you about 1.5 horse power of
> a single CPU. On Good SPM systems it's about 1.7.
Intel's traditional way of doing SMP sucks (the memory bus is STILL the main
bottleneck to performance: let's share it!), and most PC OSes have
traditionally had mondo lock contention doing even simple things. Okay. So?
> > Intel made SMP cheap by putting all the glue logic on-chip and in the
> > standard chipsets.
>
> Not if I look out to buy a real SMP board.
Again with the "the PC isn't a real computer" line of argument...
> They are still
> very expensive in comparision to normal boards. However
> indeed they are nowadays affordable.
A year and a half ago I worked at the company that prototyped the first dual
Athlon board (Boxxtech: tyan owed them a favor). Intel was never interested
in bringing out a dual celeron motherboard (the first celerons were so
cache-crippled trying to SMP them was just painful). The ONLY wanted to do
SMP at the high end, and as processors came down in price they yanked the SMP
support circuitry.
Add in the fact the Intel SMP bus still sucks tremendously and the dominant
OS through windows 98 couldn't even understand two graphics cards (and often
got confused by two NETWORK cards) we're not talking a recipe for widespread
adoption here...
> > And besides, you don't actually need to _scale_ well, if the actual
> > incremental costs are low. That's the whole point with the P4-HT, of
> > course. Intel claims 5% die area addition for a 30% scaling. They may be
>
> The 30% - I never saw it in the intel paper. I remember they talk
> about 20% + something. And 30% is a *peak* value.
Sure. Keeping that third execution core busy 24/7. On the rare instances
their pipeline organizer can devote that third execution core to advancing
the first process, preventing it from doing so is slowing that first process
down by repurposing a resource that would NOT otherwise have been wasted.
(Minus 3% performance penalty for extra cache trashing and memory bus
contention.)
Now add a FOURTH execution core to the chip, bump the L1 cache size a bit,
and watch performance go up 25%...
I am REALLY waiting for AMD to start doing this. We've been waiting for "smp
on a chip" (outside of PPC) for years, without ever explaining what the
advantage was of giving each one its own bus interface unit and L1 cache...
> The paper in question talks about 12% on average. Awfoul much for
> 5% die area (2.4 factor win) in esp. if you look at the constant
> increase of die area of CPUs in comparision to the speed factoring out
> the scaling of the production process. If once factors out
> the production process scale modern CPU are wasting transistors like
> no good in comparision to they older silbings. (Remember 8088 was
> just about 22t transistors and not 140M!).
> But it's not much in absolute numbers...
Yeah. It's called "a good idea" instead of brute force throwing transistors
at the problem. Even Intel's allowed to have the occasional good idea.
(After itanium they're certainly due for one!)
> > full of sh*t, of course, and it may be that the added complexity in the
> > control logic hurts them in other areas (longer pipeline, whatever), but
> > the point is that if it's cheap, the second CPU doesn't have to "scale".
>
> The main hurting point is the quadruple of the correctness testing
> effort. Longer pipelines - I hardly think so. The synchronization
> infrastructure for out of order execution was already there in the last CPU
> generation. This is the reaons why it's so cheap in terms of die estate to
> add it now.
In theory they might even be able to get rid of some of it, as long as they
can keep all their execution cores busy 99% of the time without it. (Picking
three simultaneously runnable instructions from two different threads of
execution is a fundamentally easier problem than consistently picking even
two instructions from one thread.)
And it's a far cry from the itanium's way of handling branch preditiction to
keep the cores busy. (Execute BOTH forks and throw the one we don't take
away! Yeah, that'll guarantee we waste work so we LOOK busy, but don't
actually run noticeably faster! Brilliant! (What, is the goal to make the
chip run hot? 95% prediction rate isn't enough for you, and you're STILL
going to stall the pipeline when you hit the edge of the L1 cache anyway...))
> BTW. Them pulling this trick shows nicely that we are now at a point
> where there will be hardly any increase in the deployment of micro scale
> paralellity in CPU design nowadays...
Famous last words...
> And not just on behalf of
> the CPU - even more importantly you could read it as public admit to the
> fact that we are near the end of static optimizations by improvements in
> compiler technology as well. Oh the compiler people promise miracles
> constantly since the first days of pipeline of course...
Trust me: GCC 3.x can still be seriously improved upon.
> In view of this I would love to see how they intend
> to HT the VLSI design of the Itanic :-).
Well, the rumors are that Intel is going to bury iTanic in a sea trench and
license x86-64. AMD has confirmed that intel licensed the rights to the
x86-64 instruction set, and intel's prototype is apparently called yamhill:
http://www.matrixlist.com/pipermail/pc_support/2002-May/001416.html
Whether or not AMD got a license to the inevitable hyper-threading patents in
return, I have no idea. (If AMD would just buy transmeta and be done with
it, I'd feel more comfortable predicting them. I have friends who work
there, that rumor mill's bandwidth is full of the trouble they're having with
absolutely sucky motherboard chipsets and nvidia writing out of spec graphics
cards that the chipsets are actually designed to compensate for, and as such
wind up screwing up other things by being out of spec. Or something like
that, that's the trouble with rumors, details get mangled...)
Rob
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-20 21:59 ` Martin Dalecki
2002-06-20 22:18 ` Linus Torvalds
2002-06-21 7:43 ` Zwane Mwaikambo
@ 2002-06-21 21:02 ` Rob Landley
2002-06-22 3:57 ` (RFC)i386 arch autodetect( was Re: latest linus-2.5 BK broken ) Matthew D. Pitts
2 siblings, 1 reply; 97+ messages in thread
From: Rob Landley @ 2002-06-21 21:02 UTC (permalink / raw)
To: Martin Dalecki, Linus Torvalds
Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
On Thursday 20 June 2002 05:59 pm, Martin Dalecki wrote:
> Well but this simply still doesn't make SMP magically scale
> better. HT gives you about 12% increase in throughput on average.
> This will hardly increase your MP3 ripping expierence :-).
HT is currently sopping up the idle time on the second and third execution
core in the processor, and the fact that the processor before HT only had as
many cores as it could at least sometimes use means that these execution
cores aren't always idle.
That said, there's nothing to stop them from adding a fourth or even fifth
execution core to the die and getting a 25% boost, and then a fifth core and
getting a little boost form that too. (And when you add the sixth core,
teach the processor about the concept of a third thread, at which point you
just write in instruction dispatcher feeding an arbitrary number of thread
instruction streams into an arbitrary number of execution cores, and then add
cores to your heart's content until you start having numa problems in your L1
cache... :)
By the way, your mp3 ripping experience is largely about latency, which HT
does help. (Realtime is all about getting a tiny amount of work done NOW,
rather than a lot of work done after a significant fraction of a second
scheduling delay.) As long as ripping and playback don't skip, processes
that can be batched aren't really the problem. (Suck this CD dry, crunch it
to files in this directory, I'm going to answer email in the meantime.)
> > When you integrate multiple CPU's on one standard die (either HT or real
> > CPU's), the same thing happens.
>
> Again HT is still only one CPU. You are too software centric :-).
It's a CPU that literally can advance two processes at once. Not "time
slice, time slice, time slice" with evil context switches in between trashing
your cache, but actual parallel processing.
My understanding is that with HT turned on, one of your three execution cores
is devoted to each thread, and they get to fight over who gets to use the
third each clock cycle. So you get to queue up DMA for that screaming scsi
card without waiting for your other system call to exit its critical region.
Hence the latency picture is REALLY NICE...
> However corssbar switches are indeed allowing for maximally
> 64 CPUs and more importantly it's the first step since a long time
> to provide better overall system throughput. However they will still
> not be near any commodity - too much heat for the foreseeable future.
If you can do 8-way SMP/SMT on a chip (does SMT with twice as many execution
cores as threads count as "real" SMP to you?), and then you fit that in an
8-way motherboard, boom: you have 64 way. Without really needing crossbar
switches if you don't want to go that way...
Sooner or later they'll just have an arbitrary execution core scheduler, and
they won't have a fixed ratio of threads to cores, you'll just feed the
chip what you've got and it'll power down any cores that aren't in use this
clock cycle. I can easily see transmeta scaling code morphing up to dozens
or even hundreds of execution in that case...
That's a few years in the future, though.
> > A 3D tech person might say that the technology is still the same.
> >
> > But a real human will notice that it's radically different.
>
> Yes but you can drive the technology only up to the perceptual limits
> of a human. For example since about 6 years all those advancements
> in the graphics area are largely uninterresting to me. I don't
> play computer games. Never - they are too boring. Jet another
> fan in my computer - no thank's.
"It doesn't interst me so it's not interesting" is not a good argument, but
the fact that the human visual perception threshold has long been reported to
be 80 million triangles per second and we're approaching the ability to do
that in real time with commodity off the shelf video cards. (Another two or
three generations of moore's law and we WON'T be able to see the
difference...) That is a point.
> Well the last real technological jump comparable to the invention
> of television was actually due to this kind of CPUs which you
> compare to microbes - mobiles :-). And well I'm awaiting the
> day where there will be some WinWLAN card as shoddy as those Win
> modems are... Fortunately they made 802.11b complicated enough :-)
> But with a corssbar switch in place they could well make up for
> the latency on the main CPU... oh fear... oh scare...
The latency in the cat 5 dwarfs any latency you're going to have on the
motherboard, and that's something they deal with by just making gigabit and
higher synchronous. No reason you can't have a win-ethernet card except that
100baseT is now $4.50 on a card (and a lot less on a chip on the motherboard,
and that's just a licensing cost, the IC is pennies), and your "last mile"
cable modem or DSL still isn't maxing out the ten magabit ethernet connection
you're really hooking up to the internet through...
There's no excess cost to squeeze out of here by going to a DSP...
Rob
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-21 20:25 ` Daniel Phillips
@ 2002-06-22 1:07 ` Horst von Brand
2002-06-22 1:23 ` Larry McVoy
0 siblings, 1 reply; 97+ messages in thread
From: Horst von Brand @ 2002-06-22 1:07 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Linux Kernel Mailing List
[Cc:s heavily snipped]
Daniel Phillips <phillips@arcor.de> said:
> On Friday 21 June 2002 20:46, Cort Dougan wrote:
> > I don't see Linux being in serious jeopardy in the short-term of becoming
> > solaris. It only aims at running on 1-4 processors and does a pretty good
> > job of that. Most sane people realize, as Larry points out, that the
> > current design will not scale to 64 processors and beyond. That's obvious,
> > it's not an alarmist or deep statement. The key is to realize that it's
> > not _meant_ to scale that high right now.
>
> And originally, it was never meant to scale to more than one processor.
Right. If they had designed it for 4/8 CPUs from the start, they would
surely have gotten it dead wrong. Just to find out how wrong around now...
If 64-way becomes commodity one day in whatever form the hardware people
dream up, Linux will surely follow.
--
Horst von Brand vonbrand@sleipnir.valparaiso.cl
Casilla 9G, Vin~a del Mar, Chile +56 32 672616
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-22 1:07 ` Horst von Brand
@ 2002-06-22 1:23 ` Larry McVoy
2002-06-22 12:41 ` Roman Zippel
2002-06-23 15:15 ` Sandy Harris
0 siblings, 2 replies; 97+ messages in thread
From: Larry McVoy @ 2002-06-22 1:23 UTC (permalink / raw)
To: Horst von Brand; +Cc: Daniel Phillips, Linux Kernel Mailing List
On Fri, Jun 21, 2002 at 09:07:10PM -0400, Horst von Brand wrote:
> Right. If they had designed it for 4/8 CPUs from the start, they would
> surely have gotten it dead wrong. Just to find out how wrong around now...
I couldn't disagree more. The reason that all the SMP threaded OS's start
to suck is that managers say "Yeah, one CPU is good but how about 2?" Then
a year goes by and then they say "Yeah, 2 CPUs are good but how about 4?".
Etc. So the system is never designed, it is hacked. It's no wonder they
suck.
My point has always been that if you were told up front that you needed to
hit 2 orders of magnitude more CPUs than you have today, the design you'd
end up with would be very different than the "just hack it some more to get
2x more CPUs".
The interesting thing is to look at the ways you'd deal with a 1024 processors
and then work backwards to see how you scale it down to 1. There is NO WAY
to scale a fine grain threaded system which works on a 1024 system down to
a 1 CPU system, those are profoundly different.
I think you could take the OS cluster idea and scale it up as well as down.
Scaling down is really important, Linux works well in the embedded space,
that is probably the greatest financial success story that Linux has, let's
not screw it up.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 97+ messages in thread
* (RFC)i386 arch autodetect( was Re: latest linus-2.5 BK broken )
2002-06-21 21:02 ` Rob Landley
@ 2002-06-22 3:57 ` Matthew D. Pitts
2002-06-22 4:54 ` William Lee Irwin III
0 siblings, 1 reply; 97+ messages in thread
From: Matthew D. Pitts @ 2002-06-22 3:57 UTC (permalink / raw)
To: Rob Landley, Martin Dalecki, Linus Torvalds
Cc: Cort Dougan, Eric W. Biederman, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
Rob, et al...
Is there any plan to rewrite the i386 architecture to support auto-detection
of cpu's? Given the nature of the discussion here, that is something I would
love to tackle.
Matthew
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: (RFC)i386 arch autodetect( was Re: latest linus-2.5 BK broken )
2002-06-22 3:57 ` (RFC)i386 arch autodetect( was Re: latest linus-2.5 BK broken ) Matthew D. Pitts
@ 2002-06-22 4:54 ` William Lee Irwin III
0 siblings, 0 replies; 97+ messages in thread
From: William Lee Irwin III @ 2002-06-22 4:54 UTC (permalink / raw)
To: Matthew D. Pitts
Cc: Rob Landley, Martin Dalecki, Linus Torvalds, Cort Dougan,
Eric W. Biederman, Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List, Martin.Bligh, colpatch, cleverdj
On Fri, Jun 21, 2002 at 11:57:57PM -0400, Matthew D. Pitts wrote:
> Is there any plan to rewrite the i386 architecture to support auto-detection
> of cpu's? Given the nature of the discussion here, that is something I would
> love to tackle.
Probably more urgent are APIC drivers as it seems too many people are
trying to stuff flat bitmasks into ICR2 and other places directly and
breaking things. Also, the understanding of the fact that if flat
logical mode is being used that no cpu's beyond #7 can be addressed
would be nice to have, if only for a more graceful failure mode than
deadlock in flush_tlb_all() and/or cpu wakeup code when more cpu's
than the configured APIC mode can address are present. Perhaps afterward
a dynamic understanding of when the clustered hierarchical destination
format is required can be explored, involving dynamically switching
between flat destination format with logical ID's and clustered
hierarchical destination format with logical ID's and physical ID's for
cluster-local IPI's. On the NUMA-Q, the arrangement of cluster ID's is
dynamically configurable allowing one to configure each cluster so that
the broadcast cluster ID clashes only with the local cluster, and so
64 cpu's may be addressed as cluster-local IPI's may be done physically.
Cheers,
Bill
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-22 15:31 ` Alan Cox
@ 2002-06-22 12:24 ` Rob Landley
2002-06-22 19:00 ` Ruth Ivimey-Cook
2002-06-22 21:09 ` jdow
2002-06-23 21:40 ` [OT] " Xavier Bestel
2 siblings, 1 reply; 97+ messages in thread
From: Rob Landley @ 2002-06-22 12:24 UTC (permalink / raw)
To: Alan Cox; +Cc: Linux Kernel Mailing List
On Saturday 22 June 2002 11:31 am, Alan Cox wrote:
> > A microkernel design was actually made to work once, with good
> > performance. It was about fifteen years ago, in the amiga. Know how they
> > pulled it off? Commodore used a mutant ultra-cheap 68030 that had -NO-
> > memory management unit.
>
> Vanilla 68000 actually. And it never worked well - the UI folks had
> to use a library not threads. The fs performance sucked
I dug through my notes a bit, and the interview I was thinking (with one of
the designers before he died, Jay Minor I think) said that when they did
upgrade to the 68030 (long after the A1000), they specifically comissioned an
MMU-less version (68EC030), and that if they'd had to deal with an MMU in the
first place he doubted they could ever have gotten a microkernel architecture
to work.
Unfortunately, all I have from said interview at the moment are the notes I
took. My first year of computer history research was a learning experience
about how to do research, back before I learned to store the URL the notes
came from with the notes (no, the fact it's in my bookmarks list doesn't mean
I can find it again), and to save pages to my hard drive becaue the links
have been known to go away over time... :)
On a side note, it's fun looking through the tanenbaum-torvalds debate
archive and see all the people holding up the amiga as an example of a
successful microkernel with decent performance, and note the lack of MMU...
Rob
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-22 1:23 ` Larry McVoy
@ 2002-06-22 12:41 ` Roman Zippel
2002-06-23 15:15 ` Sandy Harris
1 sibling, 0 replies; 97+ messages in thread
From: Roman Zippel @ 2002-06-22 12:41 UTC (permalink / raw)
To: Larry McVoy; +Cc: Horst von Brand, Daniel Phillips, Linux Kernel Mailing List
Hi,
On Fri, 21 Jun 2002, Larry McVoy wrote:
> On Fri, Jun 21, 2002 at 09:07:10PM -0400, Horst von Brand wrote:
> > Right. If they had designed it for 4/8 CPUs from the start, they would
> > surely have gotten it dead wrong. Just to find out how wrong around now...
>
> I couldn't disagree more. The reason that all the SMP threaded OS's start
> to suck is that managers say "Yeah, one CPU is good but how about 2?" Then
> a year goes by and then they say "Yeah, 2 CPUs are good but how about 4?".
> Etc. So the system is never designed, it is hacked. It's no wonder they
> suck.
That's the important difference here, we have no managers forcing us to
specific goals. We have the time to develop a good solution, we are not
forced to accept a solution which sucks. We have the freedom to constantly
break the kernel and we don't have to maintain backwards compability,
which especially with regard to locking would really suck.
bye, Roman
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: McVoy's Clusters (was Re: latest linus-2.5 BK broken)
2002-06-20 15:41 ` McVoy's Clusters (was Re: latest linus-2.5 BK broken) Sandy Harris
2002-06-20 17:10 ` William Lee Irwin III
2002-06-21 5:16 ` Eric W. Biederman
@ 2002-06-22 14:14 ` Kai Henningsen
2 siblings, 0 replies; 97+ messages in thread
From: Kai Henningsen @ 2002-06-22 14:14 UTC (permalink / raw)
To: linux-kernel
pashley@storm.ca (Sandy Harris) wrote on 20.06.02 in <3D11F7B9.27C74922@storm.ca>:
> For large multi-processor systems, it isn't clear that those matter
> much. On single user systems I've tried , ps -ax | wc -l usually
> gives some number 50 < n < 100. For a multi-user general purpose
156 here right now, and I'd call that a light load. On a
processor : 0
vendor_id : AuthenticAMD
cpu family : 5
model : 8
model name : AMD-K6(tm) 3D processor
stepping : 12
cpu MHz : 350.818
with 768 MB - not the fastest machine around.
MfG Kai
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-21 19:34 ` Rob Landley
@ 2002-06-22 15:31 ` Alan Cox
2002-06-22 12:24 ` Rob Landley
` (2 more replies)
0 siblings, 3 replies; 97+ messages in thread
From: Alan Cox @ 2002-06-22 15:31 UTC (permalink / raw)
To: Rob Landley
Cc: Jeff Garzik, Larry McVoy, Eric W. Biederman, Linus Torvalds,
Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
> A microkernel design was actually made to work once, with good performance.
> It was about fifteen years ago, in the amiga. Know how they pulled it off?
> Commodore used a mutant ultra-cheap 68030 that had -NO- memory management
> unit.
Vanilla 68000 actually. And it never worked well - the UI folks had
to use a library not threads. The fs performance sucked
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-21 17:50 ` Larry McVoy
2002-06-21 17:55 ` Robert Love
2002-06-21 18:09 ` Linux, the microkernel (was Re: latest linus-2.5 BK broken) Jeff Garzik
@ 2002-06-22 18:25 ` Eric W. Biederman
2002-06-22 19:26 ` Larry McVoy
2 siblings, 1 reply; 97+ messages in thread
From: Eric W. Biederman @ 2002-06-22 18:25 UTC (permalink / raw)
To: Larry McVoy
Cc: Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
Larry McVoy <lm@bitmover.com> writes:
> On Fri, Jun 21, 2002 at 12:15:54AM -0600, Eric W. Biederman wrote:
> > I think Larry's perspective is interesting and if the common cluster
> > software gets working well enough I might even try it. But until a
> > big SMP becomes commodity I don't see the point.
>
> The real point is that multi threading screws up your kernel. All the Linux
> hackers are going through the learning curve on threading and think I'm an
> alarmist or a nut. After Linux works on a 64 way box, I suspect that the
> majority of them will secretly admit that threading does screw up the kernel
> but at that point it's far too late.
I don't see a argument that locks that get to fine grained are not an
issue. However even traditional version of single cpu unix are multi
threaded. The locking in a multi cpu design just makes that explicit.
And the only really nasty place to get locks is when you get a
noticeable number of them in your device drivers. With the core code
you can fix it without out worrying about killing the OS.
> The current approach is a lot like western medicine. Wait until the
> cancer shows up and then make an effort to get rid of it. My suggested
> approach is to take steps to make sure the cancer never gets here in
> the first place. It's proactive rather than reactive. And the reason
> I harp on this is that I'm positive (and history supports me 100%)
> that the reactive approach doesn't work, you'll be stuck with it,
> there is no way to "fix" it other than starting over with a new kernel.
> Then we get to repeat this whole discussion in 15 years with one of the
> Linux veterans trying to explain to the NewOS guys that multi threading
> really isn't as cool as it sounds and they should try this other
> approach.
Proactive don't add a lock unless you can really justify that you need
it. That is well suited to open source code review type practices,
and it appears to be what we are doing now. And if you don't add
locks you certainly don't get into a lock tangle.
As for 100% history supported all I see is that evolution of code,
as it dynamically gathers the requirements instead of magically
knowing them does much better than design as a long term
strategy. Of course you design the parts you can see but every has a
limited ability to see the future.
To specifics, I don't see the point of OSlets on a single cpu that is
hyper threaded. Traditional threading appears to make more sense to
me. Similarly I don't see the point in the 2-4 cpu range.
Given that there are some scales when you don't want/need more than
one kernel, who has a machine where OSlets start to pay off? They
don't exist in commodity hardware, so being proactive now looks
stupid.
The only practical course I see is to work on solutions that work on
clusters of commodity machines. At least any one who wants one can
get one. If you can produce a single system image, the big iron guys
can tweak the startup routing and run that on their giant NUMA or SMP
machines.
Eric
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-22 12:24 ` Rob Landley
@ 2002-06-22 19:00 ` Ruth Ivimey-Cook
0 siblings, 0 replies; 97+ messages in thread
From: Ruth Ivimey-Cook @ 2002-06-22 19:00 UTC (permalink / raw)
To: Rob Landley; +Cc: Alan Cox, Linux Kernel Mailing List
On Sat, 22 Jun 2002, Rob Landley wrote:
>On Saturday 22 June 2002 11:31 am, Alan Cox wrote:
>> > A microkernel design was actually made to work once, with good
>> > performance. It was about fifteen years ago, in the amiga. Know how they
>> > pulled it off? Commodore used a mutant ultra-cheap 68030 that had -NO-
>> > memory management unit.
>>
>> Vanilla 68000 actually. And it never worked well - the UI folks had
>> to use a library not threads. The fs performance sucked
Threads (in the sense of tasks[1]) in fact worked extremely well and very
efficiently on the Amiga, and "Intuition" was always coded as one thread and
was modified use them more widely as the programmers had time and resource to
do so.
>On a side note, it's fun looking through the tanenbaum-torvalds debate
>archive and see all the people holding up the amiga as an example of a
>successful microkernel with decent performance, and note the lack of MMU...
I was very happy indeed with the performance of the computer, given the 0.25
MIPS CPU. The "Exec" scheduler was an extremely good design of its type, as
has been recognised in various places since.
The filesystem of the Amiga was very slow because it was a very definitely
second-best setup; the original Amiga Corp. folks ran out of cash and in the
end the filesystem from another OS, Tripos, was grafted in. Not only was it
not what was originally designed in, but it was written in an
almost-incompatible language (BCPL).
However, I won't argue about MMU vs non-MMU; it was obvious from the start
that any kind of memory protection between tasks would render a great deal of
the system design useless, because the whole system shared memory and
resources. How else did people get away with application footprints 1/5 to
1/10 that of equivalents on Windows?
Regards,
Ruth
[1] Exec only understood "tasks" as the basic scheduling unit; a task could be
extended to become a process if access to the filesystem was required, but
doing so did not change the scheduling cost at all.
--
Ruth Ivimey-Cook
Software engineer and technical writer.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-22 18:25 ` latest linus-2.5 BK broken Eric W. Biederman
@ 2002-06-22 19:26 ` Larry McVoy
2002-06-22 22:25 ` Eric W. Biederman
` (2 more replies)
0 siblings, 3 replies; 97+ messages in thread
From: Larry McVoy @ 2002-06-22 19:26 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Larry McVoy, Linus Torvalds, Cort Dougan, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
On Sat, Jun 22, 2002 at 12:25:09PM -0600, Eric W. Biederman wrote:
> I don't see a argument that locks that get to fine grained are not an
> issue. However even traditional version of single cpu unix are multi
> threaded. The locking in a multi cpu design just makes that explicit.
>
> And the only really nasty place to get locks is when you get a
> noticeable number of them in your device drivers. With the core code
> you can fix it without out worrying about killing the OS.
Just out of curiousity, have you actually ever worked on a fine grain
threaded OS? One that scales to at least 32 processors? Solaris? IRIX?
Others? It makes a difference, if you've been there, your perspective is
somewhat different than just talking about it. If you have worked on one,
for how long? Did you support the source base after it matured for any
length of time?
> Proactive don't add a lock unless you can really justify that you need
> it. That is well suited to open source code review type practices,
> and it appears to be what we are doing now. And if you don't add
> locks you certainly don't get into a lock tangle.
That's a great theory. I support that theory, life would great if it
matched that theory. Unfortunately, I don't know of any kernel which
matches that theory, do you? Linux certainly doesn't. FreeBSD certainly
doesn't. Solaris/IRIX crossed that point years ago. So where is the
OS which has managed to resist the lock tangle?
linux-2.5$ bk -r grep CONFIG_SMP | wc -l
1290
That's a lot of ifdefs for a supposedly tangle free kernel. And I suspect
that the threading people will say Linux doesn't really scale beyond
2-4 CPUs for any I/O bound work load today. What's it going to be when
Linux is at 32 CPUs? Solaris was around 3000 statically allocated locks
when I left and I think it was scaling to maybe 8. At SGI, they were
carefully putting the lock on the same cache line as the data structure
that it protected, for all locked data structure which had any contention.
The limit as the number of CPUs goes up is that each read/write cache
line in the data segment has a lock. They certainly weren't there,
but they were much closer than you might guess. It was definitely the
norm that you laid out your locks with the data, it was that pervasive.
Take a walk through sched.c and you can see the mess starting. How
can anyone support that code on both UP and SMP? You are already
supporting two code bases. Imagine what it is going to look like when
the NUMA people get done. Don't forget the preempt people. Oh, yeah,
let's throw in some soft realtime, that shouldn't screw things up too
much.
> To specifics, I don't see the point of OSlets on a single cpu that is
> hyper threaded. Traditional threading appears to make more sense to
> me. Similarly I don't see the point in the 2-4 cpu range.
In general I agree with you here, but I think you haven't really considered
all the options. I can see the benefit on a *single* CPU. There are all
sorts of interesting games you could play in the area of fault tolerance
and containment. Imagine a system, like what IBM has, that runs lots of
copies of Linux with the mmap sharing turned off. ISPs would love it.
Jeff Dike pointed out that if UML can run one kernel in user space, why
not N? And if so, the OS clustering stuff could be done on top of
UML and then "ported" to real hardware. I think that's a great idea,
and you can carry it farther, you could run multiple kernels just for
fault containment. See Sun's domains, DEC's Galaxy.
> Given that there are some scales when you don't want/need more than
> one kernel, who has a machine where OSlets start to pay off? They
> don't exist in commodity hardware, so being proactive now looks
> stupid.
Not as stupid as having a kernel noone can maintain and not being able
to do anything about it. There seems to be a subthread of elitist macho
attitude along the lines of "oh, it won't be that bad, and besides,
if you aren't good enough to code in a fine grained locked, soft real
time, preempted, NUMA aware, then you just shouldn't be in the kernel".
I'm not saying you are saying that, but I've definitely heard it on
the list.
It's a great thing for bragging rights but it's a horrible thing from
the sustainability point of view.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-22 15:31 ` Alan Cox
2002-06-22 12:24 ` Rob Landley
@ 2002-06-22 21:09 ` jdow
2002-06-23 17:56 ` John Alvord
2002-06-23 21:40 ` [OT] " Xavier Bestel
2 siblings, 1 reply; 97+ messages in thread
From: jdow @ 2002-06-22 21:09 UTC (permalink / raw)
To: Rob Landley, Alan Cox
Cc: Jeff Garzik, Larry McVoy, Eric W. Biederman, Linus Torvalds,
Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
From: "Alan Cox" <alan@lxorguk.ukuu.org.uk>
> > A microkernel design was actually made to work once, with good performance.
> > It was about fifteen years ago, in the amiga. Know how they pulled it off?
> > Commodore used a mutant ultra-cheap 68030 that had -NO- memory management
> > unit.
>
> Vanilla 68000 actually. And it never worked well - the UI folks had
> to use a library not threads. The fs performance sucked
Some things just cannot be passed by..... The Amiga HAS worked well and
DOES work well - - - FINALLY. (It took several years and a VERY serious
debugging effort with Bill Hawes and Bryce Nesbitt finding and quashing
all manner of bad or missing pointer checks and the like. They made the
OS itself a remarkable work of art.)
You are right, Alan, in that it used a vanilla, slow, 68000 in its original
incarnation. A company named Metacomco generated the "DOS" part of the
system. IMAO they should have been sued for malpractice. The only good feature
the file system had was its resilience. Had it been coded correctly loss of
data would have been hard to achieve short of physical disk problems. Later
incarnations of the file system proved it could be remarkably fast accessing
specific files. Directory listings remain agonizingly slow.
The OS "exec" library is remarkably compact, quick, and resilient. It does
suffer from not using memory protection. However, in a testament to some
Amiga programmers AmigaDOS can survive months of up time with typical single
user loads with its latest incarnation. (My slightly hypertrophied A3000T
sits over there running some applications for me 24x7 quite nicely, thank
you.) This has had me musing about the relative quality of Linux applications
that blithely throw segfaults rather than check for overflows, null pointers,
and the like. I'd NEVER let the typical Linux application touch my Amigas for
two reasons, the crashes are annoying and they mean there are security holes
waiting for exploitation.
Having "everything" in the system a shared library has some advantages for
updating things on the fly without reboots, as is routinely exploited within
the Linux world. A side effect of the way this was implemented yields a
rather endearing Amiga trait, you cannot exceed array boundaries in most of
the OS and shared libraries. Arrays are eschewed in favor of linked lists.
'Tis a shame the idiots who owned and ran the company sucked it dry and
tossed the remains. The OS could wring remarkable performance out of rather
antiquated hardware, well in excess of what Apple could wring out of the
same hardware.
{^_-} I am rather fond of the tool. And I note it has (and in some instances
still) performed admirably in near real-time applications such as
show control (EFX in Las Vegas for one) and telemetry reception and
analysis (at NASA.) To be sure AmigaDOS 1.0 through 1.3 were rather
dreadful. 2.04 was remarkable. No AmigaDOS was EVER even approximately
as bad as an abortion I had to work on called GRiD-OS, however.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-22 19:26 ` Larry McVoy
@ 2002-06-22 22:25 ` Eric W. Biederman
2002-06-22 23:10 ` Larry McVoy
2002-06-23 6:34 ` William Lee Irwin III
2002-06-23 22:56 ` Kai Henningsen
2 siblings, 1 reply; 97+ messages in thread
From: Eric W. Biederman @ 2002-06-22 22:25 UTC (permalink / raw)
To: Larry McVoy
Cc: Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
Larry McVoy <lm@bitmover.com> writes:
> On Sat, Jun 22, 2002 at 12:25:09PM -0600, Eric W. Biederman wrote:
> > To specifics, I don't see the point of OSlets on a single cpu that is
> > hyper threaded. Traditional threading appears to make more sense to
> > me. Similarly I don't see the point in the 2-4 cpu range.
>
> In general I agree with you here, but I think you haven't really considered
> all the options. I can see the benefit on a *single* CPU. There are all
> sorts of interesting games you could play in the area of fault tolerance
> and containment. Imagine a system, like what IBM has, that runs lots of
> copies of Linux with the mmap sharing turned off. ISPs would love
> it.
Hmm. Perhaps. But you are fundamentally susceptible to the base
kernel, and the hardware on the machine.
> Jeff Dike pointed out that if UML can run one kernel in user space, why
> not N? And if so, the OS clustering stuff could be done on top of
> UML and then "ported" to real hardware. I think that's a great idea,
> and you can carry it farther, you could run multiple kernels just for
> fault containment. See Sun's domains, DEC's Galaxy.
Right. A clustered environment is accessible. For the most part I
don't have a problem (except check pointing) that is facilitated by
running linux under linux.
Currently my problem to solve is compute clusters. My current worries
are not can I scale a kernel to 64 cpus. My practical worries are
will my user space to 1000 dual processor machines.
The important point for me is that there are a fair number of
fundamentally hard problems to get multiple kernels look like one.
Especially when you start with a maximum decoupling. And you seem to
assume that solving these problems are trivial.
Maybe it is maintainable when you get done but there is a huge amount
of work to get there. I haven't heard of a distributed OS as anything
other than a dream, or a prototype with scaling problems.
> > Given that there are some scales when you don't want/need more than
> > one kernel, who has a machine where OSlets start to pay off? They
> > don't exist in commodity hardware, so being proactive now looks
> > stupid.
>
> Not as stupid as having a kernel noone can maintain and not being able
> to do anything about it. There seems to be a subthread of elitist macho
> attitude along the lines of "oh, it won't be that bad, and besides,
> if you aren't good enough to code in a fine grained locked, soft real
> time, preempted, NUMA aware, then you just shouldn't be in the kernel".
> I'm not saying you are saying that, but I've definitely heard it on
> the list.
Hmm. I see a bulk of the on-going kernel work composed of projects to
make the whole kernel easier to maintain. Especially interesting is
the work that makes drivers relatively easy, and free from all of this
cruft.
Running some numbers (wc -l kernel/*.c fs/*.c mm/*.c)
1.2.12: 18813 lines
2.2.12: 37510 lines
2.5.14: 55701 lines
So the core kernel is growing, but a fairly slow rate. Only worrying
about the 60 thousand lines of generic kernel code is much better than
worrying about the 3 million lines of driver code.
And since you thought it was an interesting statistic:
grep CONFIG_SMP kernel/*.c fs/*.c mm/*.c init/*.c | wc -l
44
So most of the code that cares about SMP is not in the core of the
kernel, but is mostly the code that actually implements SMP support.
So in thinking about I agree that the constant simplification work
that is done to the linux kernel looks like one of the most important
activities long term.
> It's a great thing for bragging rights but it's a horrible thing from
> the sustainability point of view.
Given that the simplification efforts tend to be some of the highest
priority activities in the kernel, and the easiest patches to get
accepted. I don't get the feeling that we are walking into a long
term maintenance problem.
As for bragging rights, my kernel work tends to be some of the easiest
code I have to write. I have no doubts that C is a high level
programming language.
Eric
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-22 22:25 ` Eric W. Biederman
@ 2002-06-22 23:10 ` Larry McVoy
0 siblings, 0 replies; 97+ messages in thread
From: Larry McVoy @ 2002-06-22 23:10 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Larry McVoy, Linus Torvalds, Cort Dougan, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
On Sat, Jun 22, 2002 at 04:25:29PM -0600, Eric W. Biederman wrote:
> The important point for me is that there are a fair number of
> fundamentally hard problems to get multiple kernels look like one.
> Especially when you start with a maximum decoupling. And you seem to
> assume that solving these problems are trivial.
No such assumption was made. Poke through my slides, you'll see that I
think it will take a reasonable amount of effort to get there. I actually
spelled out the staffing and the time estimates. Start asking around and
you'll find that senior people who _have_ gone the multi threading route
agree that this approach gets you to the same place with less than 1/10th
the amount of work. The last guy who agreed with that statement was the
guy who headed up the threading design and implementation of Solaris,
he's at Netapp now.
In fairness to you, I'm doing the same thing you are: I'm arguing about
something I haven't done. On the other hand, I have been through (twice)
the thing that you are saying is no problem and every person who has been
there agrees with me that it sucks. It's doable, but it's a nightmare to
maintain, it easily increases the subtlety of kernel interactions by an
order of magnitude, probably closer to two orders.
And I have done enough of what I've described to know it can be done.
People who have deep knowledge of the fine grained approach have tried
to prove that I was wrong and failed, repeatedly. They may not agree
that this is a better way but they can't show that it won't work.
> Maybe it is maintainable when you get done but there is a huge amount
> of work to get there. I haven't heard of a distributed OS as anything
> other than a dream, or a prototype with scaling problems.
This is a distributed OS on one system, that's a lot easier than a
distributed OS across machine boundaries. And if you are worried about
scaling problems, you don't understand the design. The OS cluster idea
multi threads all data structures for free. No locks on 99% of the
data structures that you would need locks on in an SMP os.
Think about this fact: if you have lock contention you don't scale. So
you thread until you don't. Go do the math that shows how tiny of a
fraction of 1% of lock contention screws your scaling, everyone has
bumped up against those curves. So the goal of any multithreaded OS
is ZERO lock contention. Makes you wonder why the locks are there
in the first place. They are trying to get to where I want to go but
they are definitely doing it the hard way.
> > Not as stupid as having a kernel noone can maintain and not being able
> > to do anything about it. There seems to be a subthread of elitist macho
> > attitude along the lines of "oh, it won't be that bad, and besides,
> > if you aren't good enough to code in a fine grained locked, soft real
> > time, preempted, NUMA aware, then you just shouldn't be in the kernel".
> > I'm not saying you are saying that, but I've definitely heard it on
> > the list.
>
> Hmm. I see a bulk of the on-going kernel work composed of projects to
> make the whole kernel easier to maintain.
[...]
> I don't get the feeling that we are walking into a long
> term maintenance problem.
I don't mean to harp on this, but if you are going to comment on how
hard it is to maintain a kernel could you please give us some idea of
why it is you think as you do? Do you have some prior experience with a
project of this size that shows what you believe to be true in practice?
You keep suggesting that there isn't a problem, that we aren't headed for
a problem. Why is that? Do you know something I don't? I've certainly
seen what happens to a kernel source base as it goes through this process
a few times and my experience is that what you are saying is the opposite
of what happens. So if you've got some different experience, how about
sharing it? Maybe there is some way to do what you are suggesting will
happen, but I haven't ever seen it personally, nor have I ever heard
of it occurring in any long lived project. All projects become more
complex as time goes on, it's a direct result of the demands placed on
any successful project.
> So in thinking about I agree that the constant simplification work
> that is done to the linux kernel looks like one of the most important
> activities long term.
What constant simplification work? The generic part of the kernel does
more or less what it did a few years ago yet is has grown at a pretty fast
clip. Talk to the embedded people and ask them if they think it has gotten
simpler. By what standard has the kernel become less complex?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-22 19:26 ` Larry McVoy
2002-06-22 22:25 ` Eric W. Biederman
@ 2002-06-23 6:34 ` William Lee Irwin III
2002-06-23 22:56 ` Kai Henningsen
2 siblings, 0 replies; 97+ messages in thread
From: William Lee Irwin III @ 2002-06-23 6:34 UTC (permalink / raw)
To: Larry McVoy, Eric W. Biederman, Larry McVoy, Linus Torvalds,
Cort Dougan, Benjamin LaHaise, Rusty Russell, Robert Love,
Linux Kernel Mailing List
On Sat, Jun 22, 2002 at 12:26:56PM -0700, Larry McVoy wrote:
> Not as stupid as having a kernel noone can maintain and not being able
> to do anything about it. There seems to be a subthread of elitist macho
> attitude along the lines of "oh, it won't be that bad, and besides,
> if you aren't good enough to code in a fine grained locked, soft real
> time, preempted, NUMA aware, then you just shouldn't be in the kernel".
> I'm not saying you are saying that, but I've definitely heard it on
> the list.
I've been accused of this, so I'll state for the record: my views on
locking are not efficiency-related in the least. They have to do with
ensuring that locks protect well-defined data and that locking
constructs are clean (e.g. nonrecursive and no implicit drop or acquire).
My duties are not directly related to locking, and I only push the
agenda I do as a low-priority kernel janitoring effort. As this is not
a scalability issue, I'll not press it further in this thread.
Cheers,
Bill
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-22 1:23 ` Larry McVoy
2002-06-22 12:41 ` Roman Zippel
@ 2002-06-23 15:15 ` Sandy Harris
2002-06-23 17:29 ` Jakob Oestergaard
` (2 more replies)
1 sibling, 3 replies; 97+ messages in thread
From: Sandy Harris @ 2002-06-23 15:15 UTC (permalink / raw)
To: Linux Kernel Mailing List
Larry McVoy wrote:
> The interesting thing is to look at the ways you'd deal with a 1024 processors
> and then work backwards to see how you scale it down to 1. There is NO WAY
> to scale a fine grain threaded system which works on a 1024 system down to
> a 1 CPU system, those are profoundly different.
>
> I think you could take the OS cluster idea and scale it up as well as down.
> Scaling down is really important, Linux works well in the embedded space,
> that is probably the greatest financial success story that Linux has, let's
> not screw it up.
Assuming we can get 4-way right, methinks Larry's ideas are likely to be a
whole lot easier way to handle a 32 or 64-way box than trying to re-design
the kernel sufficiently to do that well without destroying anything
important in the 1<= nCPU <= 4 case. Especially so because 16 to 64-way
clusters are common as dirt, and we can borrow tested tools. Anything that
works on a 16-box Beowulf ought to adapt nicely to a 64-way box with 16
of Larry's OSlets.
However, it is a lot harder to see that Larry's stuff is the right way
to deal with a 1024-CPU system. At that point, you've got perhaps 256
4-way groups running OSlets. How does communication overhead scale, and
do we have reason to suppose it is tolerable at 1024?
Also, it isn't as clear that clustering experience applies. Are clusters
that size built hierachically? Is a 1024-CPU Beowulf practical, and if so
do you build it as a Beowulf of 32 32-CPU Beowulfs? Is something analogous
required in the OSlet approach? would it work?
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-23 15:15 ` Sandy Harris
@ 2002-06-23 17:29 ` Jakob Oestergaard
2002-06-24 6:27 ` Craig I. Hagan
2002-06-24 10:59 ` Eric W. Biederman
2 siblings, 0 replies; 97+ messages in thread
From: Jakob Oestergaard @ 2002-06-23 17:29 UTC (permalink / raw)
To: Sandy Harris; +Cc: Linux Kernel Mailing List
On Sun, Jun 23, 2002 at 11:15:53AM -0400, Sandy Harris wrote:
> Larry McVoy wrote:
>
...
> Also, it isn't as clear that clustering experience applies. Are clusters
> that size built hierachically? Is a 1024-CPU Beowulf practical, and if so
> do you build it as a Beowulf of 32 32-CPU Beowulfs? Is something analogous
> required in the OSlet approach? would it work?
Well yes and no. Often the hierarchy is really shallow. A typical
(larger) Beowulf (if such a thing exists) could be ~50 nodes per 100Mbit
switch, heaps of those switches go into (interconnected) gigabit
switches, and that's it. There are *many* 'wulfs out there with just
one or a few switches - but they are not 1024 CPUs either.
Much more specialized interconnects are often used. The SP/2 (IBM) used
something resembling "one big switch", which was in reality a number of
cleverly connected smaller switches (sorry, forgot the topology) - so no
real hierarchy, similar bandwidth and latency between any two nodes an a
several-hundred node cluster.
The "Earth Simulator" (the #1 on www.top500.org) is using a one-stage
crossbar for it's 5000+ nodes.
My personal pet theory is, in short, that the hardware stays fairly flat
- not because it is beneficial to (on the contrary!), but because
software assumes that it is flat. The software paradigms in practical
use today have not changed since the early '80s and as long as the
hardware manages to stay "almost flat" that's not going to change.
--
................................................................
: jakob@unthought.net : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob Østergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-22 21:09 ` jdow
@ 2002-06-23 17:56 ` John Alvord
2002-06-23 20:48 ` jdow
0 siblings, 1 reply; 97+ messages in thread
From: John Alvord @ 2002-06-23 17:56 UTC (permalink / raw)
To: jdow
Cc: Rob Landley, Alan Cox, Jeff Garzik, Larry McVoy,
Eric W. Biederman, Linus Torvalds, Cort Dougan, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
On Sat, 22 Jun 2002 14:09:30 -0700, "jdow" <jdow@earthlink.net> wrote:
>From: "Alan Cox" <alan@lxorguk.ukuu.org.uk>
>
>> > A microkernel design was actually made to work once, with good performance.
>> > It was about fifteen years ago, in the amiga. Know how they pulled it off?
>> > Commodore used a mutant ultra-cheap 68030 that had -NO- memory management
>> > unit.
>>
>> Vanilla 68000 actually. And it never worked well - the UI folks had
>> to use a library not threads. The fs performance sucked
>
>Some things just cannot be passed by..... The Amiga HAS worked well and
>DOES work well - - - FINALLY. (It took several years and a VERY serious
>debugging effort with Bill Hawes and Bryce Nesbitt finding and quashing
>all manner of bad or missing pointer checks and the like. They made the
>OS itself a remarkable work of art.)
Was that the same Bill Hawes who hung around L-K quashing bugs for a
year or so (maybe 3-4 years ago?)
john alvord
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-23 17:56 ` John Alvord
@ 2002-06-23 20:48 ` jdow
0 siblings, 0 replies; 97+ messages in thread
From: jdow @ 2002-06-23 20:48 UTC (permalink / raw)
To: John Alvord
Cc: Rob Landley, Alan Cox, Jeff Garzik, Larry McVoy,
Eric W. Biederman, Linus Torvalds, Cort Dougan, Benjamin LaHaise,
Rusty Russell, Robert Love, Linux Kernel Mailing List
From: "John Alvord" <jalvo@mbay.net>
>On Sat, 22 Jun 2002 14:09:30 -0700, "jdow" <jdow@earthlink.net> wrote:
>>From: "Alan Cox" <alan@lxorguk.ukuu.org.uk>
>>
>>>> A microkernel design was actually made to work once, with good performance.
>>>> It was about fifteen years ago, in the amiga. Know how they pulled it off?
>>>> Commodore used a mutant ultra-cheap 68030 that had -NO- memory management
>>>> unit.
>>>
>>> Vanilla 68000 actually. And it never worked well - the UI folks had
>>> to use a library not threads. The fs performance sucked
>>
>>Some things just cannot be passed by..... The Amiga HAS worked well and
>>DOES work well - - - FINALLY. (It took several years and a VERY serious
>>debugging effort with Bill Hawes and Bryce Nesbitt finding and quashing
>>all manner of bad or missing pointer checks and the like. They made the
>>OS itself a remarkable work of art.)
>Was that the same Bill Hawes who hung around L-K quashing bugs for a
>year or so (maybe 3-4 years ago?)
I believe it was. That is about where I lost track of him. I hope he is
doing well wherever he is. You folks here should have done almost anything
to keep him around.
{^_^}
^ permalink raw reply [flat|nested] 97+ messages in thread
* [OT] Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-22 15:31 ` Alan Cox
2002-06-22 12:24 ` Rob Landley
2002-06-22 21:09 ` jdow
@ 2002-06-23 21:40 ` Xavier Bestel
2 siblings, 0 replies; 97+ messages in thread
From: Xavier Bestel @ 2002-06-23 21:40 UTC (permalink / raw)
To: Alan Cox
Cc: Rob Landley, Jeff Garzik, Larry McVoy, Eric W. Biederman,
Linus Torvalds, Cort Dougan, Benjamin LaHaise, Rusty Russell,
Robert Love, Linux Kernel Mailing List
Le sam 22/06/2002 à 17:31, Alan Cox a écrit :
> > A microkernel design was actually made to work once, with good performance.
> > It was about fifteen years ago, in the amiga. Know how they pulled it off?
> > Commodore used a mutant ultra-cheap 68030 that had -NO- memory management
> > unit.
>
> Vanilla 68000 actually. And it never worked well - the UI folks had
> to use a library not threads. The fs performance sucked
<troll feeding>
IIRC all simple UI things were done int the "input task" context (the
task moving the mouse pointer, to simplify things) and more heavy duty
had to be offloaded to the right task - using message passing of course.
This was not the intended design, which was to make Intuition a real
device (in the amiga sense, i.e. it could have its own task), but you
know, AmigaOS was a commercial proprietary OS with deadlines and a
complex history. That's why it had a really sucky fs, too (put your
floppy in the drive, type dir, drink a coffee while listening to your
disk being eaten, see the command output one-line-by-second).
Xav
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
2002-06-22 19:26 ` Larry McVoy
2002-06-22 22:25 ` Eric W. Biederman
2002-06-23 6:34 ` William Lee Irwin III
@ 2002-06-23 22:56 ` Kai Henningsen
2 siblings, 0 replies; 97+ messages in thread
From: Kai Henningsen @ 2002-06-23 22:56 UTC (permalink / raw)
To: linux-kernel; +Cc: lm
lm@bitmover.com (Larry McVoy) wrote on 22.06.02 in <20020622122656.W23670@work.bitmover.com>:
> Just out of curiousity, have you actually ever worked on a fine grain
> threaded OS? One that scales to at least 32 processors? Solaris? IRIX?
> Others? It makes a difference, if you've been there, your perspective is
IIRC, you said that your proposed system should have one oslet per about 4
CPUs. And I see many people claiming that current Linux locking is aimed
at being good with about 4 CPUs.
Maybe I'm dense, but it seems to me that means current Linux locking is
aimed at exactly the spot where you argue it should be aimed *anyway*.
What am I not seeing?
MfG Kai
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-23 15:15 ` Sandy Harris
2002-06-23 17:29 ` Jakob Oestergaard
@ 2002-06-24 6:27 ` Craig I. Hagan
2002-06-24 13:06 ` J.A. Magallon
2002-06-24 10:59 ` Eric W. Biederman
2 siblings, 1 reply; 97+ messages in thread
From: Craig I. Hagan @ 2002-06-24 6:27 UTC (permalink / raw)
To: Sandy Harris; +Cc: Linux Kernel Mailing List
> Also, it isn't as clear that clustering experience applies. Are clusters
> that size built hierachically? Is a 1024-CPU Beowulf practical, and if so
> do you build it as a Beowulf of 32 32-CPU Beowulfs? Is something analogous
> required in the OSlet approach? would it work?
a system of that size has many "practical" applications. It *can* be done
without partitioning it into a tree hierarchy, however, you will need a very
capable interconnect (quadrics and myrinet come to mind). Tt that you'll have a
tiered switching hierarchy even if the nodes are presented in a flat layer.
IMHO nearly any level of breakout for grid computing (basically a cluster
hierarchy) starts to become interesting as a function of your app/problem size
and how many simultanous jobs you are running.
Of course, we can stop and hit reality for a second: not many people can afford
a 1024 cpu cluster, hence the proliferation of smaller ones ;)
-- craig
.- ... . -.-. .-. . - -- . ... ... .- --. .
Craig I. Hagan "It's a small world, but I wouldn't want to back it up"
hagan(at)cih.com "True hackers don't die, their ttl expires"
"It takes a village to raise an idiot, but an idiot can raze a village"
Stop the spread of spam, use a sendmail condom!
http://www.cih.com/~hagan/smtpd-hacks
In Bandwidth we trust
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-23 15:15 ` Sandy Harris
2002-06-23 17:29 ` Jakob Oestergaard
2002-06-24 6:27 ` Craig I. Hagan
@ 2002-06-24 10:59 ` Eric W. Biederman
2 siblings, 0 replies; 97+ messages in thread
From: Eric W. Biederman @ 2002-06-24 10:59 UTC (permalink / raw)
To: Sandy Harris; +Cc: Linux Kernel Mailing List
Sandy Harris <pashley@storm.ca> writes:
> Larry McVoy wrote:
>
> > The interesting thing is to look at the ways you'd deal with a 1024 processors
>
> > and then work backwards to see how you scale it down to 1. There is NO WAY
> > to scale a fine grain threaded system which works on a 1024 system down to
> > a 1 CPU system, those are profoundly different.
> >
> > I think you could take the OS cluster idea and scale it up as well as down.
> > Scaling down is really important, Linux works well in the embedded space,
> > that is probably the greatest financial success story that Linux has, let's
> > not screw it up.
>
> Assuming we can get 4-way right, methinks Larry's ideas are likely to be a
> whole lot easier way to handle a 32 or 64-way box than trying to re-design
> the kernel sufficiently to do that well without destroying anything
> important in the 1<= nCPU <= 4 case. Especially so because 16 to 64-way
> clusters are common as dirt, and we can borrow tested tools. Anything that
> works on a 16-box Beowulf ought to adapt nicely to a 64-way box with 16
> of Larry's OSlets.
I wonder sometimes. With a 16 way cluster practically any tool will
work and not give you problems. I don't think many of the tools have
progressed beyond the make it work stage, and into polish yet.
> However, it is a lot harder to see that Larry's stuff is the right way
> to deal with a 1024-CPU system. At that point, you've got perhaps 256
> 4-way groups running OSlets. How does communication overhead scale, and
> do we have reason to suppose it is tolerable at 1024?
The rule is to communicate as little as possible. Because even if you
have a very low latency interconnect, with insane amounts of
bandwidth, it is needed for your application, not for cluster
management services.
> Also, it isn't as clear that clustering experience applies. Are clusters
> that size built hierachically? Is a 1024-CPU Beowulf practical, and if so
> do you build it as a Beowulf of 32 32-CPU Beowulfs? Is something analogous
> required in the OSlet approach? would it work?
A cluster with 960 compute nodes (each 2way) is being built for
Lawrence Livermore National Lab. http://www.llnl.gov/linux/mcr/.
The insane part is the Lustre filesystem is going to be a 32 Node
cluster in and of itself.
So there will be experience out there.
Eric
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: Linux, the microkernel (was Re: latest linus-2.5 BK broken)
2002-06-24 6:27 ` Craig I. Hagan
@ 2002-06-24 13:06 ` J.A. Magallon
0 siblings, 0 replies; 97+ messages in thread
From: J.A. Magallon @ 2002-06-24 13:06 UTC (permalink / raw)
To: Craig I. Hagan; +Cc: Sandy Harris, Linux Kernel Mailing List
On 2002.06.24 Craig I. Hagan wrote:
>> Also, it isn't as clear that clustering experience applies. Are clusters
>> that size built hierachically? Is a 1024-CPU Beowulf practical, and if so
>> do you build it as a Beowulf of 32 32-CPU Beowulfs? Is something analogous
>> required in the OSlet approach? would it work?
>
>a system of that size has many "practical" applications. It *can* be done
>without partitioning it into a tree hierarchy, however, you will need a very
>capable interconnect (quadrics and myrinet come to mind). Tt that you'll have a
>tiered switching hierarchy even if the nodes are presented in a flat layer.
>
>IMHO nearly any level of breakout for grid computing (basically a cluster
>hierarchy) starts to become interesting as a function of your app/problem size
>and how many simultanous jobs you are running.
>
>Of course, we can stop and hit reality for a second: not many people can afford
>a 1024 cpu cluster, hence the proliferation of smaller ones ;)
>
You do not have to go so far. Take a simple cluster of dual Xeon boxes (ie,
4 'cpus' per box). Current clustering software (MPI, PVM) is not ready to
handle a 2-level hierarchy, one with slow communications over tcp and a lower
level working as a shared-memory thread-able cluster.
It would not be so strange nowadays (nor too much expensive) to have a 8-16
nodes with 4 cpus each.
--
J.A. Magallon \ Software is like sex: It's better when it's free
mailto:jamagallon@able.es \ -- Linus Torvalds, FSF T-shirt
Linux werewolf 2.4.19-pre10-jam3, Mandrake Linux 8.3 (Cooker) for i586
gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.6mdk)
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: latest linus-2.5 BK broken
@ 2002-06-24 21:28 Paul McKenney
0 siblings, 0 replies; 97+ messages in thread
From: Paul McKenney @ 2002-06-24 21:28 UTC (permalink / raw)
To: Larry McVoy; +Cc: linux-kernel
Hello, Larry,
Our SMP cluster discussion was quite a bit of fun, very challenging!
I still stand by my assessment:
> The Score.
>
> Paul agreed that SMP Clusters could be implemented. He was not
> sure that it could achieve good performance, but could not prove
> otherwise. Although he suspected that the complexity might be
> less than the proprietary highly parallel Unixes, he was not
> convinced that it would be less than Linux would be, given the
> Linux community's emphasis on simplicity in addition to performance.
See you at Ottawa!
Thanx, Paul
> Larry McVoy <lm@bitmover.com>
> Sent by: linux-kernel-owner@vger.kernel.org
> 06/19/2002 10:24 PM
>
> > I totally agree, mostly I was playing devils advocate. The model
> > actually in my head is when you have multiple kernels but they talk
> > well enough that the applications have to care in areas where it
> > doesn't make a performance difference (There's got to be one of those).
>
> ....
>
> > The compute cluster problem is an interesting one. The big items
> > I see on the todo list are:
> >
> > - Scalable fast distributed file system (Lustre looks like a
> > possibility)
> > - Sub application level checkpointing.
> >
> > Services like a schedulers, already exist.
> >
> > Basically the job of a cluster scheduler gets much easier, and the
> > scheduler more powerful once it gets the ability to suspend jobs.
> > Checkpointing buys three things. The ability to preempt jobs, the
> > ability to migrate processes, and the ability to recover from failed
> > nodes, (assuming the failed hardware didn't corrupt your jobs
> > checkpoint).
> >
> > Once solutions to the cluster problems become well understood I
> > wouldn't be surprised if some of the supporting services started to
> > live in the kernel like nfsd. Parts of the distributed filesystem
> > certainly will.
>
> http://www.bitmover.com/cc-pitch
>
> I've been trying to get Linus to listen to this for years and he keeps
> on flogging the tired SMP horse instead. DEC did it and Sun has been
> passing around these slides for a few weeks, so maybe they'll do it too.
> Then Linux can join the party after it has become a fine grained,
> locked to hell and back, soft "realtime", numa enabled, bloated piece
> of crap like all the other kernels and we'll get to go through the
> "let's reinvent Unix for the 3rd time in 40 years" all over again.
> What fun. Not.
>
> Sorry to be grumpy, go read the slides, I'll be at OLS, I'd be happy
> to talk it over with anyone who wants to think about it. Paul McKenney
> from IBM came down the San Francisco to talk to me about it, put me
> through an 8 or 9 hour session which felt like a PhD exam, and
> after trying to poke holes in it grudgingly let on that maybe it was
> a good idea. He was kind of enough to write up what he took away
> from it, here it is.
>
> --lm
>
> From: "Paul McKenney" <Paul.McKenney@us.ibm.com>
> To: lm@bitmover.com, tytso@mit.edu
> Subject: Greatly enjoyed our discussion yesterday!
> Date: Fri, 9 Nov 2001 18:48:56 -0800
>
> Hello!
>
> I greatly enjoyed our discussion yesterday! Here are the pieces of it
that
> I recall, I know that you will not be shy about correcting any errors and
> omissions.
>
> Thanx, Paul
>
> Larry McVoy's SMP Clusters
>
> Discussion on November 8, 2001
>
> Larry McVoy, Ted T'so, and Paul McKenney
>
>
> What is SMP Clusters?
>
> SMP Clusters is a method of partioning an SMP (symmetric
> multiprocessing) machine's CPUs, memory, and I/O devices
> so that multiple "OSlets" run on this machine. Each OSlet
> owns and controls its partition. A given partition is
> expected to contain from 4-8 CPUs, its share of memory,
> and its share of I/O devices. A machine large enough to
> have SMP Clusters profitably applied is expected to have
> enough of the standard I/O adapters (e.g., ethernet,
> SCSI, FC, etc.) so that each OSlet would have at least
> one of each.
>
> Each OSlet has the same data structures that an isolated
> OS would have for the same amount of resources. Unless
> interactions with the OSlets are required, an OSlet runs
> very nearly the same code over very nearly the same data
> as would a standalone OS.
>
> Although each OSlet is in most ways its own machine, the
> full set of OSlets appears as one OS to any user programs
> running on any of the OSlets. In particular, processes on
> on OSlet can share memory with processes on other OSlets,
> can send signals to processes on other OSlets, communicate
> via pipes and Unix-domain sockets with processes on other
> OSlets, and so on. Performance of operations spanning
> multiple OSlets may be somewhat slower than operations local
> to a single OSlet, but the difference will not be noticeable
> except to users who are engaged in careful performance
> analysis.
>
> The goals of the SMP Cluster approach are:
>
> 1. Allow the core kernel code to use simple locking designs.
> 2. Present applications with a single-system view.
> 3. Maintain good (linear!) scalability.
> 4. Not degrade the performance of a single CPU beyond that
> of a standalone OS running on the same resources.
> 5. Minimize modification of core kernel code. Modified or
> rewritten device drivers, filesystems, and
> architecture-specific code is permitted, perhaps even
> encouraged. ;-)
>
>
> OS Boot
>
> Early-boot code/firmware must partition the machine, and prepare
> tables for each OSlet that describe the resources that each
> OSlet owns. Each OSlet must be made aware of the existence of
> all the other OSlets, and will need some facility to allow
> efficient determination of which OSlet a given resource belongs
> to (for example, to determine which OSlet a given page is owned
> by).
>
> At some point in the boot sequence, each OSlet creates a "proxy
> task" for each of the other OSlets that provides shared services
> to them.
>
> Issues:
>
> 1. Some systems may require device probing to be done
> by a central program, possibly before the OSlets are
> spawned. Systems that react in an unfriendly manner
> to failed probes might be in this class.
>
> 2. Interrupts must be set up very carefully. On some
> systems, the interrupt system may constrain the ways
> in which the system is partitioned.
>
>
> Shared Operations
>
> This section describes some possible implementations and issues
> with a number of the shared operations.
>
> Shared operations include:
>
> 1. Page fault on memory owned by some other OSlet.
> 2. Manipulation of processes running on some other OSlet.
> 3. Access to devices owned by some other OSlet.
> 4. Reception of network packets intended for some other OSlet.
> 5. SysV msgq and sema operations on msgq and sema objects
> accessed by processes running on multiple of the OSlets.
> 6. Access to filesystems owned by some other OSlet. The
> /tmp directory gets special mention.
> 7. Pipes connecting processes in different OSlets.
> 8. Creation of processes that are to run on a different
> OSlet than their parent.
> 9. Processing of exit()/wait() pairs involving processes
> running on different OSlets.
>
> Page Fault
>
> As noted earlier, each OSlet maintains a proxy process
> for each other OSlet (so that for an SMP Cluster made
> up of N OSlets, there are N*(N-1) proxy processes).
>
> When a process in OSlet A wishes to map a file
> belonging to OSlet B, it makes a request to B's proxy
> process corresponding to OSlet A. The proxy process
> maps the desired file and takes a page fault at the
> desired address (translated as needed, since the file
> will usually not be mapped to the same location in the
> proxy and client processes), forcing the page into
> OSlet B's memory. The proxy process then passes the
> corresponding physical address back to the client
> process, which maps it.
>
> Issues:
>
> o How to coordinate pageout? Two approaches:
>
> 1. Use mlock in the proxy process so that
> only the client process can do the pageout.
>
> 2. Make the two OSlets coordinate their
> pageouts. This is more complex, but will
> be required in some form or another to
> prevent OSlets from "ganging up" on one
> of their number, exhausting its memory.
>
> o When OSlet A ejects the memory from its working
> set, where does it put it?
>
> 1. Throw it away, and go to the proxy process
> as needed to get it back.
>
> 2. Augment core VM as needed to track the
> "guest" memory. This may be needed for
> performance, but...
>
> o Some code is required in the pagein() path to
> figure out that the proxy must be used.
>
> 1. Larry stated that he is willing to be
> punched in the nose to get this code in. ;-)
> The amount of this code is minimized by
> creating SMP-clusters-specific filesystems,
> which have their own functions for mapping
> and releasing pages. (Does this really
> cover OSlet A's paging out of this memory?)
>
> o How are pagein()s going to be even halfway fast
> if IPC to the proxy is involved?
>
> 1. Just do it. Page faults should not be
> all that frequent with today's memory
> sizes. (But then why do we care so
> much about page-fault performance???)
>
> 2. Use "doors" (from Sun), which are very
> similar to protected procedure call
> (from K42/Tornado/Hurricane). The idea
> is that the CPU in OSlet A that is handling
> the page fault temporarily -becomes- a
> member of OSlet B by using OSlet B's page
> tables for the duration. This results in
> some interesting issues:
>
> a. What happens if a process wants to
> block while "doored"? Does it
> switch back to being an OSlet A
> process?
>
> b. What happens if a process takes an
> interrupt (which corresponds to
> OSlet A) while doored (thus using
> OSlet B's page tables)?
>
> i. Prevent this by disabling
> interrupts while doored.
> This could pose problems
> with relatively long VM
> code paths.
>
> ii. Switch back to OSlet A's
> page tables upon interrupt,
> and switch back to OSlet B's
> page tables upon return
> from interrupt. On machines
> not supporting ASID, take a
> TLB-flush hit in both
> directions. Also likely
> requires common text (at
> least for low-level interrupts)
> for all OSlets, making it more
> difficult to support OSlets
> running different versions of
> the OS.
>
> Furthermore, the last time
> that Paul suggested adding
> instructions to the interrupt
> path, several people politely
> informed him that this would
> require a nose punching. ;-)
>
> c. If a bunch of OSlets simultaneously
> decide to invoke their proxies on
> a particular OSlet, that OSlet gets
> lock contention corresponding to
> the number of CPUs on the system
> rather than to the number in a
> single OSlet. Some approaches to
> handle this:
>
> i. Stripe -everything-, rely
> on entropy to save you.
> May still have problems with
> hotspots (e.g., which of the
> OSlets has the root of the
> root filesystem?).
>
> ii. Use some sort of queued lock
> to limit the number CPUs that
> can be running proxy processes
> in a given OSlet. This does
> not really help scaling, but
> would make the contention
> less destructive to the
> victim OSlet.
>
> o How to balance memory usage across the OSlets?
>
> 1. Don't bother, let paging deal with it.
> Paul's previous experience with this
> philosophy was not encouraging. (You
> can end up with one OSlet thrashing
> due to the memory load placed on it by
> other OSlets, which don't see any
> memory pressure.)
>
> 2. Use some global memory-pressure scheme
> to even things out. Seems possible,
> Paul is concerned about the complexity
> of this approach. If this approach is
> taken, make sure someone with some
> control-theory experience is involved.
>
>
> Manipulation of Processes Running on Some Other OSlet.
>
> The general idea here is to implement something similar
> to a vproc layer. This is common code, and thus requires
> someone to sacrifice their nose. There was some discussion
> of other things that this would be useful for, but I have
> lost them.
>
> Manipulations discussed included signals and job control.
>
> Issues:
>
> o Should process information be replicated across
> the OSlets for performance reasons? If so, how
> much, and how to synchronize.
>
> 1. No, just use doors. See above discussion.
>
> 2. Yes. No discussion of synchronization
> methods. (Hey, we had to leave -something-
> for later!)
>
> Access to Devices Owned by Some Other OSlet
>
> Larry mentioned a /rdev, but if we discussed any details
> of this, I have lost them. Presumably, one would use some
> sort of IPC or doors to make this work.
>
> Reception of Network Packets Intended for Some Other OSlet.
>
> An OSlet receives a packet, and realizes that it is
> destined for a process running in some other OSlet.
> How is this handled without rewriting most of the
> networking stack?
>
> The general approach was to add a NAT-like layer that
> inspected the packet and determined which OSlet it was
> destined for. The packet was then forwarded to the
> correct OSlet, and subjected to full IP-stack processing.
>
> Issues:
>
> o If the address map in the kernel is not to be
> manipulated on each packet reception, there
> needs to be a circular buffer in each OSlet for
> each of the other OSlets (again, N*(N-1) buffers).
> In order to prevent the buffer from needing to
> be exceedingly large, packets must be bcopy()ed
> into this buffer by the OSlet that received
> the packet, and then bcopy()ed out by the OSlet
> containing the target process. This could add
> a fair amount of overhead.
>
> 1. Just accept the overhead. Rely on this
> being an uncommon case (see the next issue).
>
> 2. Come up with some other approach, possibly
> involving the user address space of the
> proxy process. We could not articulate
> such an approach, but it was late and we
> were tired.
>
> o If there are two processes that share the FD
> on which the packet could be received, and these
> two processes are in two different OSlets, and
> neither is in the OSlet that received the packet,
> what the heck do you do???
>
> 1. Prevent this from happening by refusing
> to allow processes holding a TCP connection
> open to move to another OSlet. This could
> result in load-balance problems in some
> workloads, though neither Paul nor Ted were
> able to come up with a good example on the
> spot (seeing as BAAN has not been doing really
> well of late).
>
> To indulge in l'esprit d'escalier... How
> about a timesharing system that users
> access from the network? A single user
> would have to log on twice to run a job
> that consumed more than one OSlet if each
> process in the job might legitimately need
> access to stdin.
>
> 2. Do all protocol processing on the OSlet
> on which the packet was received, and
> straighten things out when delivering
> the packet data to the receiving process.
> This likely requires changes to common
> code, hence someone to volunteer their nose.
>
>
> SysV msgq and sema Operations
>
> We didn't discuss these. None of us seem to be SysV fans,
> but these must be made to work regardless.
>
> Larry says that shm should be implemented in terms of mmap(),
> so that this case reduces to page-mapping discussed above.
> Of course, one would need a filesystem large enough to handle
> the largest possible shmget. Paul supposes that one could
> dynamically create a memory filesystem to avoid problems here,
> but is in no way volunteering his nose to this cause.
>
>
> Access to Filesystems Owned by Some Other OSlet.
>
> For the most part, this reduces to the mmap case. However,
> partitioning popular filesystems over the OSlets could be
> very helpful. Larry mentioned that this had been prototyped.
> Paul cannot remember if Larry promised to send papers or
> other documentation, but duly requests them after the fact.
>
> Larry suggests having a local /tmp, so that /tmp is in effect
> private to each OSlet. There would be a /gtmp that would
> be a globally visible /tmp equivalent. We went round and
> round on software compatibility, Paul suggesting a hashed
> filesystem as an alternative. Larry eventually pointed out
> that one could just issue different mount commands to get
> a global filesystem in /tmp, and create a per-OSlet /ltmp.
> This would allow people to determine their own level of
> risk/performance.
>
>
> Pipes Connecting Processes in Different OSlets.
>
> This was mentioned, but I have forgotten the details.
> My vague recollections lead me to believe that some
> nose-punching was required, but I must defer to Larry
> and Ted.
>
> Ditto for Unix-domain sockets.
>
>
> Creation of Processes on a Different OSlet Than Their Parent.
>
> There would be a inherited attribute that would prevent
> fork() or exec() from creating its child on a different
> OSlet. This attribute would be set by default to prevent
> too many surprises. Things like make(1) would clear
> this attribute to allow amazingly fast kernel builds.
>
> There would also be a system call that would cause the
> child to be placed on a specified OSlet (Paul suggested
> use of HP's "launch policy" concept to avoid adding yet
> another dimension to the exec() combinatorial explosion).
>
> The discussion of packet reception lead Larry to suggest
> that cross-OSlet process creation would be prohibited if
> the parent and child shared a socket. See above for the
> load-balancing concern and corresponding l'esprit d'escalier.
>
>
> Processing of exit()/wait() Pairs Crossing OSlet Boundaries
>
> We didn't discuss this. My guess is that vproc deals
> with it. Some care is required when optimizing for this.
> If one hands off to a remote parent that dies before
> doing a wait(), one would not want one of the init
> processes getting a nasty surprise.
>
> (Yes, there are separate init processes for each OSlet.
> We did not talk about implications of this, which might
> occur if one were to need to send a signal intended to
> be received by all the replicated processes.)
>
>
> Other Desiderata:
>
> 1. Ability of surviving OSlets to continue running after one of their
> number fails.
>
> Paul was quite skeptical of this. Larry suggested that the
> "door" mechanism could use a dynamic-linking strategy. Paul
> remained skeptical. ;-)
>
> 2. Ability to run different versions of the OS on different OSlets.
>
> Some discussion of this above.
>
>
> The Score.
>
> Paul agreed that SMP Clusters could be implemented. He was not
> sure that it could achieve good performance, but could not prove
> otherwise. Although he suspected that the complexity might be
> less than the proprietary highly parallel Unixes, he was not
> convinced that it would be less than Linux would be, given the
> Linux community's emphasis on simplicity in addition to performance.
>
> --
> ---
> Larry McVoy lm at bitmover.com
http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 97+ messages in thread
end of thread, other threads:[~2002-06-24 21:28 UTC | newest]
Thread overview: 97+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-18 17:18 latest linus-2.5 BK broken James Simmons
2002-06-18 17:46 ` Robert Love
2002-06-18 18:51 ` Rusty Russell
2002-06-18 18:43 ` Zwane Mwaikambo
2002-06-18 18:56 ` Linus Torvalds
2002-06-18 18:59 ` Robert Love
2002-06-18 20:05 ` Rusty Russell
2002-06-18 20:05 ` Linus Torvalds
2002-06-18 20:31 ` Rusty Russell
2002-06-18 20:41 ` Linus Torvalds
2002-06-18 21:12 ` Benjamin LaHaise
2002-06-18 21:08 ` Cort Dougan
2002-06-18 21:47 ` Linus Torvalds
2002-06-19 12:29 ` Eric W. Biederman
2002-06-19 17:27 ` Linus Torvalds
2002-06-20 3:57 ` Eric W. Biederman
2002-06-20 5:24 ` Larry McVoy
2002-06-20 7:26 ` Andreas Dilger
2002-06-20 14:54 ` Eric W. Biederman
2002-06-20 15:41 ` McVoy's Clusters (was Re: latest linus-2.5 BK broken) Sandy Harris
2002-06-20 17:10 ` William Lee Irwin III
2002-06-20 20:42 ` Timothy D. Witham
2002-06-21 5:16 ` Eric W. Biederman
2002-06-22 14:14 ` Kai Henningsen
2002-06-20 16:30 ` latest linus-2.5 BK broken Cort Dougan
2002-06-20 17:15 ` Linus Torvalds
2002-06-21 6:15 ` Eric W. Biederman
2002-06-21 17:50 ` Larry McVoy
2002-06-21 17:55 ` Robert Love
2002-06-21 18:09 ` Linux, the microkernel (was Re: latest linus-2.5 BK broken) Jeff Garzik
2002-06-21 18:46 ` Cort Dougan
2002-06-21 20:25 ` Daniel Phillips
2002-06-22 1:07 ` Horst von Brand
2002-06-22 1:23 ` Larry McVoy
2002-06-22 12:41 ` Roman Zippel
2002-06-23 15:15 ` Sandy Harris
2002-06-23 17:29 ` Jakob Oestergaard
2002-06-24 6:27 ` Craig I. Hagan
2002-06-24 13:06 ` J.A. Magallon
2002-06-24 10:59 ` Eric W. Biederman
2002-06-21 19:34 ` Rob Landley
2002-06-22 15:31 ` Alan Cox
2002-06-22 12:24 ` Rob Landley
2002-06-22 19:00 ` Ruth Ivimey-Cook
2002-06-22 21:09 ` jdow
2002-06-23 17:56 ` John Alvord
2002-06-23 20:48 ` jdow
2002-06-23 21:40 ` [OT] " Xavier Bestel
2002-06-22 18:25 ` latest linus-2.5 BK broken Eric W. Biederman
2002-06-22 19:26 ` Larry McVoy
2002-06-22 22:25 ` Eric W. Biederman
2002-06-22 23:10 ` Larry McVoy
2002-06-23 6:34 ` William Lee Irwin III
2002-06-23 22:56 ` Kai Henningsen
2002-06-20 17:16 ` RW Hawkins
2002-06-20 17:23 ` Cort Dougan
2002-06-20 20:40 ` Martin Dalecki
2002-06-20 20:53 ` Linus Torvalds
2002-06-20 21:27 ` Martin Dalecki
2002-06-20 21:37 ` Linus Torvalds
2002-06-20 21:59 ` Martin Dalecki
2002-06-20 22:18 ` Linus Torvalds
2002-06-20 22:41 ` Martin Dalecki
2002-06-21 0:09 ` Allen Campbell
2002-06-21 7:43 ` Zwane Mwaikambo
2002-06-21 21:02 ` Rob Landley
2002-06-22 3:57 ` (RFC)i386 arch autodetect( was Re: latest linus-2.5 BK broken ) Matthew D. Pitts
2002-06-22 4:54 ` William Lee Irwin III
2002-06-21 16:01 ` Re: latest linus-2.5 BK broken Sandy Harris
2002-06-21 20:38 ` Rob Landley
2002-06-20 21:13 ` Timothy D. Witham
2002-06-21 19:53 ` Rob Landley
2002-06-21 5:34 ` Eric W. Biederman
2002-06-19 10:21 ` Padraig Brady
2002-06-18 21:45 ` Bill Huey
2002-06-18 20:55 ` Robert Love
2002-06-19 13:31 ` Rusty Russell
2002-06-18 19:29 ` Benjamin LaHaise
2002-06-18 19:19 ` Zwane Mwaikambo
2002-06-18 19:49 ` Benjamin LaHaise
2002-06-18 19:27 ` Zwane Mwaikambo
2002-06-18 20:13 ` Rusty Russell
2002-06-18 20:21 ` Linus Torvalds
2002-06-18 22:03 ` Ingo Molnar
-- strict thread matches above, loose matches on Subject: below --
2002-06-18 23:38 Michael Hohnbaum
2002-06-18 23:57 ` Ingo Molnar
2002-06-19 0:08 ` Ingo Molnar
2002-06-19 1:00 ` Matthew Dobson
2002-06-19 23:48 ` Michael Hohnbaum
[not found] <E17KSLb-0007Dj-00@wagner.rustcorp.com.au>
2002-06-19 0:12 ` Linus Torvalds
2002-06-19 15:23 ` Rusty Russell
2002-06-19 16:28 ` Linus Torvalds
2002-06-19 20:57 ` Rusty Russell
2002-06-20 23:48 Miles Lane
2002-06-21 7:31 Martin Knoblauch
2002-06-21 12:59 Jesse Pollard
2002-06-24 21:28 Paul McKenney
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox