* Re: Linux 2.4.17-pre5
2001-12-09 1:58 ` Rusty Russell
@ 2001-12-09 2:35 ` Davide Libenzi
2001-12-09 6:20 ` Rusty Russell
2001-12-09 16:24 ` Alan Cox
2001-12-09 16:16 ` Alan Cox
2001-12-09 19:38 ` Marcelo Tosatti
2 siblings, 2 replies; 46+ messages in thread
From: Davide Libenzi @ 2001-12-09 2:35 UTC (permalink / raw)
To: Rusty Russell; +Cc: Alan Cox, anton, davej, marcelo, lkml, Linus Torvalds
On Sun, 9 Dec 2001, Rusty Russell wrote:
> With HMT/hyperthread:
> Fifth process scheduled on 4 (shared with 0).
>
> When processes on 1, 2, or 3 schedule(), that processor sits
> idle, while processor 0/4 is doing double work (ie. only 2 in
> 5 chance that the right process will schedule() first).
>
> Finally, 0 or 4 will schedule() then wakeup, and be pulled
> onto another CPU (unless they are all busy again).
It's not easy to get this right anyway.
Using the scheduler i'm working on and setting a trigger load level of 2,
as soon as the idle is scheduled it'll go to grab the task waiting on the
other cpu and it'll make it running.
But this is not always right and, more difficult, it's very problematic to
understand when it's right and when it's not to behave in that way.
Think about a task that has built its own cache image on that cpu and that
it's veru likely that it's going to be woken up very soon.
By picking up another task you're going to trash its cache image.
What i'm thinking is to have the idle, instead of permanently halt()ed
waiting for an IPI, to be woken up at each timer tick to check the overall
balancing status.
Each time an unbalancing is discovered a counter ( on the idle cpu ) is
increased and, when this counter will become above a certain value, the
move is actually performed.
In this way we'll have a value that will make the scheduler to bahave
differently depending on its settings.
- Davide
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-09 2:35 ` Davide Libenzi
@ 2001-12-09 6:20 ` Rusty Russell
2001-12-09 16:24 ` Alan Cox
1 sibling, 0 replies; 46+ messages in thread
From: Rusty Russell @ 2001-12-09 6:20 UTC (permalink / raw)
To: Davide Libenzi; +Cc: Alan Cox, anton, davej, marcelo, lkml, Linus Torvalds
In message <Pine.LNX.4.40.0112081824210.1019-100000@blue1.dev.mcafeelabs.com> you write:
> It's not easy to get this right anyway.
Balancing the pull and push mechanisms in the scheduler while trying
to predict the future? "Not easy" is an excellent description.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-09 2:35 ` Davide Libenzi
2001-12-09 6:20 ` Rusty Russell
@ 2001-12-09 16:24 ` Alan Cox
2001-12-09 19:48 ` Davide Libenzi
` (2 more replies)
1 sibling, 3 replies; 46+ messages in thread
From: Alan Cox @ 2001-12-09 16:24 UTC (permalink / raw)
To: Davide Libenzi
Cc: Rusty Russell, Alan Cox, anton, davej, marcelo, lkml,
Linus Torvalds
> Using the scheduler i'm working on and setting a trigger load level of 2,
> as soon as the idle is scheduled it'll go to grab the task waiting on the
> other cpu and it'll make it running.
That rapidly gets you thrashing around as I suspect you've found.
I'm currently using the following rule in wake up
if(current->mm->runnable > 0) /* One already running ? */
cpu = current->mm->last_cpu;
else
cpu = idle_cpu();
else
cpu = cpu_num[fast_fl1(runnable_set)]
that is
If we are running threads with this mm on a cpu throw them at the
same core
If there is an idle CPU use it
Take the mask of currently executing priority levels, find the last
set bit (lowest pri) being executed, and look up a cpu running at
that priority
Then the idle stealing code will do the rest of the balancing, but at least
it converges towards each mm living on one cpu core.
Alan
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: Linux 2.4.17-pre5
2001-12-09 16:24 ` Alan Cox
@ 2001-12-09 19:48 ` Davide Libenzi
2001-12-09 22:44 ` Mike Kravetz
2001-12-19 22:16 ` Pavel Machek
2 siblings, 0 replies; 46+ messages in thread
From: Davide Libenzi @ 2001-12-09 19:48 UTC (permalink / raw)
To: Alan Cox; +Cc: Rusty Russell, anton, davej, marcelo, lkml, Linus Torvalds
On Sun, 9 Dec 2001, Alan Cox wrote:
> > Using the scheduler i'm working on and setting a trigger load level of 2,
> > as soon as the idle is scheduled it'll go to grab the task waiting on the
> > other cpu and it'll make it running.
>
> That rapidly gets you thrashing around as I suspect you've found.
Not really because i can make the same choices inside the idle code, out
of he fast path, without slowing the currently running cpu ( the waker ).
> I'm currently using the following rule in wake up
>
> if(current->mm->runnable > 0) /* One already running ? */
> cpu = current->mm->last_cpu;
> else
> cpu = idle_cpu();
> else
> cpu = cpu_num[fast_fl1(runnable_set)]
>
> that is
> If we are running threads with this mm on a cpu throw them at the
> same core
> If there is an idle CPU use it
> Take the mask of currently executing priority levels, find the last
> set bit (lowest pri) being executed, and look up a cpu running at
> that priority
>
> Then the idle stealing code will do the rest of the balancing, but at least
> it converges towards each mm living on one cpu core.
I've done a lot of experiments balancing the cost of moving tasks with
related tlb flushes and cache image trashing, with the cost of actually
leaving a cpu idle for a given period of time.
For example in a dual cpu the cost of leaving an idle cpu for more than
40-50 ms is higher than immediately fill the idle with a stolen task (
trigger rq length == 2 ).
This picture should vary a lot with big SMP systems, that's why i'm
seeking at a biased solution where it's easy to adjust the scheduler
behavior based on the underlying architecture.
For example, by leaving balancing decisions inside the idle code we'll
have a bit more time to consider different moving costs/metrics than will
be present for example in NUMA machines.
By measuring the cost of moving with the cpu idle time we'll have a pretty
good granularity and we could say, for example, that the tolerable cost of
moving a task on a given architecture is 40 ms idle time.
This means that if during 4 consecutive timer ticks ( on 100 HZ archs )
the idle cpu has found an "unbalanced" system, it's allowed to steal a
task to run on it.
Or better, it's allowed to steal a task from a cpu set that has a
"distance" <= 40 ms from its own set.
- Davide
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-09 16:24 ` Alan Cox
2001-12-09 19:48 ` Davide Libenzi
@ 2001-12-09 22:44 ` Mike Kravetz
2001-12-09 23:50 ` Davide Libenzi
2001-12-09 23:57 ` Alan Cox
2001-12-19 22:16 ` Pavel Machek
2 siblings, 2 replies; 46+ messages in thread
From: Mike Kravetz @ 2001-12-09 22:44 UTC (permalink / raw)
To: Alan Cox
Cc: Davide Libenzi, Rusty Russell, anton, davej, marcelo, lkml,
Linus Torvalds
On Sun, Dec 09, 2001 at 04:24:59PM +0000, Alan Cox wrote:
> I'm currently using the following rule in wake up
>
> if(current->mm->runnable > 0) /* One already running ? */
> cpu = current->mm->last_cpu;
> else
> cpu = idle_cpu();
> else
> cpu = cpu_num[fast_fl1(runnable_set)]
>
> that is
> If we are running threads with this mm on a cpu throw them at the
> same core
> If there is an idle CPU use it
> Take the mask of currently executing priority levels, find the last
> set bit (lowest pri) being executed, and look up a cpu running at
> that priority
>
> Then the idle stealing code will do the rest of the balancing, but at least
> it converges towards each mm living on one cpu core.
This implies that the idle loop will poll looking for work to do.
Is that correct? Davide's scheduler also does this. I believe
the current default idle loop (at least for i386) does as little
as possible and stops execting instructions. Comments in the code
mention power consumption. Should we be concerned with this?
--
Mike
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-09 22:44 ` Mike Kravetz
@ 2001-12-09 23:50 ` Davide Libenzi
2001-12-09 23:57 ` Alan Cox
1 sibling, 0 replies; 46+ messages in thread
From: Davide Libenzi @ 2001-12-09 23:50 UTC (permalink / raw)
To: Mike Kravetz
Cc: Alan Cox, Rusty Russell, anton, davej, marcelo, lkml,
Linus Torvalds
On Sun, 9 Dec 2001, Mike Kravetz wrote:
> On Sun, Dec 09, 2001 at 04:24:59PM +0000, Alan Cox wrote:
> > I'm currently using the following rule in wake up
> >
> > if(current->mm->runnable > 0) /* One already running ? */
> > cpu = current->mm->last_cpu;
> > else
> > cpu = idle_cpu();
> > else
> > cpu = cpu_num[fast_fl1(runnable_set)]
> >
> > that is
> > If we are running threads with this mm on a cpu throw them at the
> > same core
> > If there is an idle CPU use it
> > Take the mask of currently executing priority levels, find the last
> > set bit (lowest pri) being executed, and look up a cpu running at
> > that priority
> >
> > Then the idle stealing code will do the rest of the balancing, but at least
> > it converges towards each mm living on one cpu core.
>
> This implies that the idle loop will poll looking for work to do.
> Is that correct? Davide's scheduler also does this. I believe
> the current default idle loop (at least for i386) does as little
> as possible and stops execting instructions. Comments in the code
> mention power consumption. Should we be concerned with this?
My idea is not to poll ( due energy issues ) but to wake up idles (
kernel/timer.c ) at every timer tick to let them monitor the overall
balancing status.
- Davide
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-09 22:44 ` Mike Kravetz
2001-12-09 23:50 ` Davide Libenzi
@ 2001-12-09 23:57 ` Alan Cox
1 sibling, 0 replies; 46+ messages in thread
From: Alan Cox @ 2001-12-09 23:57 UTC (permalink / raw)
To: Mike Kravetz
Cc: Alan Cox, Davide Libenzi, Rusty Russell, anton, davej, marcelo,
lkml, Linus Torvalds
> This implies that the idle loop will poll looking for work to do.
> Is that correct? Davide's scheduler also does this. I believe
> the current default idle loop (at least for i386) does as little
> as possible and stops execting instructions. Comments in the code
> mention power consumption. Should we be concerned with this?
You can poll or IPI. An IPI has the problem that IPI's are horribly slow
on Pentium II/III. On the other hand the Athlon and PIV seem to both have
that bit sorted.
Its really an implementation detail as to whether you poll for work or
someone kicks you. Since we know what the other processors are doing and
who is idle we know when we need to kick them.
Alan
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-09 16:24 ` Alan Cox
2001-12-09 19:48 ` Davide Libenzi
2001-12-09 22:44 ` Mike Kravetz
@ 2001-12-19 22:16 ` Pavel Machek
2001-12-20 19:10 ` Davide Libenzi
2 siblings, 1 reply; 46+ messages in thread
From: Pavel Machek @ 2001-12-19 22:16 UTC (permalink / raw)
To: Alan Cox
Cc: Davide Libenzi, Rusty Russell, anton, davej, marcelo, lkml,
Linus Torvalds
Hi!
> > Using the scheduler i'm working on and setting a trigger load level of 2,
> > as soon as the idle is scheduled it'll go to grab the task waiting on the
> > other cpu and it'll make it running.
>
> That rapidly gets you thrashing around as I suspect you've found.
>
> I'm currently using the following rule in wake up
>
> if(current->mm->runnable > 0) /* One already running ? */
> cpu = current->mm->last_cpu;
Is this really a win?
I mean, if I have two tasks that can run from L2 cache, I want them on
different physical CPUs even if they share current->mm, no?
Pavel
--
"I do not steal MS software. It is not worth it."
-- Pavel Kankovsky
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: Linux 2.4.17-pre5
2001-12-19 22:16 ` Pavel Machek
@ 2001-12-20 19:10 ` Davide Libenzi
0 siblings, 0 replies; 46+ messages in thread
From: Davide Libenzi @ 2001-12-20 19:10 UTC (permalink / raw)
To: Pavel Machek
Cc: Alan Cox, Rusty Russell, anton, davej, marcelo, lkml,
Linus Torvalds
On Wed, 19 Dec 2001, Pavel Machek wrote:
> Hi!
>
> > > Using the scheduler i'm working on and setting a trigger load level of 2,
> > > as soon as the idle is scheduled it'll go to grab the task waiting on the
> > > other cpu and it'll make it running.
> >
> > That rapidly gets you thrashing around as I suspect you've found.
> >
> > I'm currently using the following rule in wake up
> >
> > if(current->mm->runnable > 0) /* One already running ? */
> > cpu = current->mm->last_cpu;
>
> Is this really a win?
>
> I mean, if I have two tasks that can run from L2 cache, I want them on
> different physical CPUs even if they share current->mm, no?
It depends. If you've two CPU and these two tasks are the only ones
running, yes running them on separate CPUs is ok because the parallelism
that you'll get is going to pay you back for the cache issues.
And this is the automatic bahavior that you'll get with sane schedulers.
But as a general rule, matching MMs should lead to a tentative to run them
on the same CPU ( give preference, not force ).
- Davide
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-09 1:58 ` Rusty Russell
2001-12-09 2:35 ` Davide Libenzi
@ 2001-12-09 16:16 ` Alan Cox
2001-12-10 0:21 ` Rusty Russell
2001-12-09 19:38 ` Marcelo Tosatti
2 siblings, 1 reply; 46+ messages in thread
From: Alan Cox @ 2001-12-09 16:16 UTC (permalink / raw)
To: Rusty Russell; +Cc: Alan Cox, anton, davej, marcelo, linux-kernel, torvalds
> > I trust Intels own labs over you on this one.
> This is voodoo optimization. I don't care WHO did it.
Why don't you spend some time making the PPC64 port actually follow
basic things like the coding standard. Its not voodoo optimisation, its
benchmarked work from Intel.
> Given another chip with similar technology (eg. PPC's Hardware Multi
> Threading) and the same patch, dbench runs 1 - 10 on 4-way makes NO
> POSITIVE DIFFERENCE.
Well let me guess. Perhaps the PPC hardware MT is different. Real numbers
have been done. Getting uppity because we have HT code in that happens to
clash with your work isn't helpful. The fact that the IBM PPC64 port is
9 months behind in this area doesn't mean the rest of us can wait. When the
PPC64 port is usable, mergable and resembles actual Linux code then this
can be looked at again for 2.4.
Perhaps you'd like to submit your PPC64 HT patches to the list today
so that they can be tried comparitively on the Intel HT and we can see if
its a better generic solution ?
For 2.5 the scheduler needs a rewrite anyway so its a non issue there.
Alan
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-09 16:16 ` Alan Cox
@ 2001-12-10 0:21 ` Rusty Russell
2001-12-10 0:41 ` Alan Cox
0 siblings, 1 reply; 46+ messages in thread
From: Rusty Russell @ 2001-12-10 0:21 UTC (permalink / raw)
To: Alan Cox; +Cc: anton, davej, marcelo, linux-kernel, torvalds
In message <E16D6cn-00071w-00@the-village.bc.nu> you write:
> Its not voodoo optimisation, its benchmarked work from Intel.
At the very least, please pass this paraphrase on to the Intel people.
I asserted:
If you number each CPU so its two IDs are smp_num_cpus()/2
apart, you will NOT need to put some crappy hack in the
scheduler to pack your CPUs correctly.
> Perhaps you'd like to submit your PPC64 HT patches to the list today
> so that they can be tried comparitively on the Intel HT and we can see if
> its a better generic solution ?
I apologize: clearly my previous post was far too long, as you
obviously did not read it. There is no sched.c patch.
> For 2.5 the scheduler needs a rewrite anyway so its a non issue there.
Disagree. Without widespread understanding of how the simple
scheduler works, writing a more complex one is doomed.
The Intel people, whom you assure me "know what their chip needs"
obviously have trouble understanding the subtleties of the current
scheduler. What hope the rest of us?
Rusty.
PS. Alan, go back and READ what my analysis, or this will be a VERY
long thread.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: Linux 2.4.17-pre5
2001-12-10 0:21 ` Rusty Russell
@ 2001-12-10 0:41 ` Alan Cox
2001-12-10 2:10 ` Martin J. Bligh
` (2 more replies)
0 siblings, 3 replies; 46+ messages in thread
From: Alan Cox @ 2001-12-10 0:41 UTC (permalink / raw)
To: Rusty Russell; +Cc: Alan Cox, anton, davej, marcelo, linux-kernel, torvalds
> If you number each CPU so its two IDs are smp_num_cpus()/2
> apart, you will NOT need to put some crappy hack in the
> scheduler to pack your CPUs correctly.
Which is a major change to the x86 tree and an invasive one. Right now the
X86 is doing a 1:1 mapping, and I can offer Marcelo no proof that somewhere
buried in the x86 arch code there isnt something that assumes this or mixes
a logical and physical cpu id wrongly in error.
At best you are exploiting an obscure quirk of the current scheduler that is
quite likely to break the moment someone factors power management into the
idling equation (turning cpus off and on is more expensive so if you idle
a cpu you want to keep it the idle one for PM). Congratulations on your
zen like mastery of the scheduler algorithm. Now tell me it wont change in
that property.
> > For 2.5 the scheduler needs a rewrite anyway so its a non issue there.
>
> Disagree. Without widespread understanding of how the simple
> scheduler works, writing a more complex one is doomed.
The simple scheduler doesn't work. I've run about 20 schedulers on playing
cards, and at the point you are shuffling things around and its clear what
is happening its actually hard not to start laughing at the current
scheduler once you hit a serious load or serious amounts of processors.
Its a great scheduler for a single or dual processor 486/pentium type box
running a home environment. It gets a bit flaky by the time its running
oracle on a 4 way, it gets very flaky by the time its running lotus back
ends on an 8 way. It doesn't take luancy like java, broken JVM implementations
and volcanomark to make it go astray
Alan
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: Linux 2.4.17-pre5
2001-12-10 0:41 ` Alan Cox
@ 2001-12-10 2:10 ` Martin J. Bligh
2001-12-10 5:40 ` Rusty Russell
2001-12-10 5:31 ` Rusty Russell
2001-12-11 9:00 ` Eric W. Biederman
2 siblings, 1 reply; 46+ messages in thread
From: Martin J. Bligh @ 2001-12-10 2:10 UTC (permalink / raw)
To: Alan Cox, Rusty Russell; +Cc: anton, davej, marcelo, linux-kernel, torvalds
>> If you number each CPU so its two IDs are smp_num_cpus()/2
>> apart, you will NOT need to put some crappy hack in the
>> scheduler to pack your CPUs correctly.
>
> Which is a major change to the x86 tree and an invasive one. Right now the
> X86 is doing a 1:1 mapping, and I can offer Marcelo no proof that somewhere
> buried in the x86 arch code there isnt something that assumes this or mixes
> a logical and physical cpu id wrongly in error.
I don't think it matters if you do a 1:1 map or not, since the NUMA-Q boxes work
fine without assuming this (I don't use physical APIC id's at all, except for from
the I/O APIC to just broadcast), and I don't think anyone else does either after
we boostrap.
It shouldn't be all that hard to check. Mentally mark every time we look at the
physical APIC id, and which variables are set from that and thus "tainted". I
did this once - I don't think it's very many at all.
I don't think changing the order we look at phys_cpu_present_map should
make much of a difference.
On the other hand, relying on the "arbitrary" cpu numbers either way doesn't
seem like the best of ideas ;-)
Martin.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-10 2:10 ` Martin J. Bligh
@ 2001-12-10 5:40 ` Rusty Russell
0 siblings, 0 replies; 46+ messages in thread
From: Rusty Russell @ 2001-12-10 5:40 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Alan Cox, anton, davej, marcelo, linux-kernel, torvalds
In message <2899076331.1007921423@[10.10.1.2]> you write:
> On the other hand, relying on the "arbitrary" cpu numbers either way doesn't
> seem like the best of ideas ;-)
Well, putting a comment in reschedule_idle() saying "we fill from the
bottom, and some ports rely on this" might be nice 8)
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-10 0:41 ` Alan Cox
2001-12-10 2:10 ` Martin J. Bligh
@ 2001-12-10 5:31 ` Rusty Russell
2001-12-10 8:28 ` Alan Cox
2001-12-11 9:00 ` Eric W. Biederman
2 siblings, 1 reply; 46+ messages in thread
From: Rusty Russell @ 2001-12-10 5:31 UTC (permalink / raw)
To: Alan Cox; +Cc: anton, davej, marcelo, linux-kernel, torvalds
In message <E16DEVr-0008SW-00@the-village.bc.nu> you write:
> > If you number each CPU so its two IDs are smp_num_cpus()/2
> > apart, you will NOT need to put some crappy hack in the
> > scheduler to pack your CPUs correctly.
>
> Which is a major change to the x86 tree and an invasive one. Right now the
> X86 is doing a 1:1 mapping, and I can offer Marcelo no proof that somewhere
> buried in the x86 arch code there isnt something that assumes this or mixes
> a logical and physical cpu id wrongly in error.
Agreed, but does the current x86 code does map them like this or not?
If it does, I'm curious as to why they saw a problem which this fixed.
I've been playing with this on and off for months, and trying to
understand what is happening. I posted my findings, and I'd really
like to get some feedback from others doing the same thing.
BUT I CAN'T DO THAT WHEN THERE'S NO DISCUSSION ABOUT PATCHES FROM
ANONYMOUS SOURCES WHICH GET MERGED! FUCK ARGHH FUCK FUCK FUCK.
(BTW, "I trust Intel engineers" is NOT discussion).
> Congratulations on your zen like mastery of the scheduler algorithm.
8) I just try to understand what I've seen on real hardware. It leads
to my belief that HMT cannot be a win on # processes = # CPUS + 1
situations on a non-preemptible scheduler.
> The simple scheduler doesn't work. I've run about 20 schedulers on playing
> cards, and at the point you are shuffling things around and its clear what
> is happening its actually hard not to start laughing at the current
> scheduler once you hit a serious load or serious amounts of processors.
Ack. Even in the limited case of trying to get HMT to work reasonably
in simple cases, scheduling changes are not simply "change the
goodness() function and it alters behavior". 8(
BTW: Alchemy, Voodoo, Zen and Cards. Maybe you should start hacking
on something more deterministic? 8)
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-10 5:31 ` Rusty Russell
@ 2001-12-10 8:28 ` Alan Cox
2001-12-10 23:12 ` James Cleverdon
0 siblings, 1 reply; 46+ messages in thread
From: Alan Cox @ 2001-12-10 8:28 UTC (permalink / raw)
To: Rusty Russell; +Cc: Alan Cox, anton, davej, marcelo, linux-kernel, torvalds
> Agreed, but does the current x86 code does map them like this or not?
> If it does, I'm curious as to why they saw a problem which this fixed.
The current x86 code maps the logical cpus as with the physical ones. In
other words its how they come off the mainboard. Which for HT seems to
be with each HT as (n, n+1)
> understand what is happening. I posted my findings, and I'd really
> like to get some feedback from others doing the same thing.
I never saw your stuff.
> BUT I CAN'T DO THAT WHEN THERE'S NO DISCUSSION ABOUT PATCHES FROM
> ANONYMOUS SOURCES WHICH GET MERGED! FUCK ARGHH FUCK FUCK FUCK.
A mailing list doesn't scale to that. I do have a cunning-plan (tm) but
that requires some work and while its doable for 2.2 or 2.4 I know that
making Linus do or change any tiny bit of his behaviour isn't going to
happen which rather limits the behaviour.
Think about
mail patch to linus-patches@...
linus-patches@ is a script that does
find the diff
find the paths in the diff
regexp them against the list of notifications
email each matching notification a copy
with the regexps including
* torvalds@transmeta.com
so its like mailing Linus but anyone who cares can web add/remove themselves
from the cc list, and its invisible to Linus too.
> BTW: Alchemy, Voodoo, Zen and Cards. Maybe you should start hacking
> on something more deterministic? 8)
Well actually Alchemy is MIPS and nothing to do with me. Trying to turn a
Voodoo card into a video4linux overlay is, Zen is ftp.linux.org.uk and I
hack card drivers. So 3 out of 4.
Alan
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-10 8:28 ` Alan Cox
@ 2001-12-10 23:12 ` James Cleverdon
2001-12-10 23:30 ` Alan Cox
0 siblings, 1 reply; 46+ messages in thread
From: James Cleverdon @ 2001-12-10 23:12 UTC (permalink / raw)
To: linux-kernel
On Monday 10 December 2001 12:28 am, Alan Cox wrote:
> > Agreed, but does the current x86 code does map them like this or not?
> > If it does, I'm curious as to why they saw a problem which this fixed.
>
> The current x86 code maps the logical cpus as with the physical ones. In
> other words its how they come off the mainboard. Which for HT seems to
> be with each HT as (n, n+1)
Yes. Intel has defined the LSB of the physical APIC ID to be the
"hyperthreading" bit. Even numbered IDs are real CPUs; odd IDs are the
virtual CPUs. (Or, as wli calls them, Schwarzenegger and Di Vito. ;^)
This may complicate Rusty's zen scheduler scheme. It certainly has made life
complicated for the BIOS folks. They had to sort all the real CPUs to the
front of the ACPI table, lest those folks so benighted as to run the crippled
version of Win2K (which only on-lines 8 CPUs) only get four real CPUs out of
eight.
Anyway, with Intel's new numbering scheme you only get two real CPUs per
logical cluster of 4, which is kind of a pain....
> > understand what is happening. I posted my findings, and I'd really
> > like to get some feedback from others doing the same thing.
[ Snip! ]
>
> Alan
--
James Cleverdon, IBM xSeries Platform (NUMA), Beaverton
jamesclv@us.ibm.com | cleverdj@us.ibm.com
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-10 23:12 ` James Cleverdon
@ 2001-12-10 23:30 ` Alan Cox
2001-12-11 9:16 ` Robert Varga
0 siblings, 1 reply; 46+ messages in thread
From: Alan Cox @ 2001-12-10 23:30 UTC (permalink / raw)
To: jamesclv; +Cc: linux-kernel
> This may complicate Rusty's zen scheduler scheme. It certainly has made life
> complicated for the BIOS folks. They had to sort all the real CPUs to the
> front of the ACPI table, lest those folks so benighted as to run the crippled
> version of Win2K (which only on-lines 8 CPUs) only get four real CPUs out of
> eight.
Rotfl, oh that is beautiful
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-10 23:30 ` Alan Cox
@ 2001-12-11 9:16 ` Robert Varga
2001-12-11 9:23 ` David Weinehall
0 siblings, 1 reply; 46+ messages in thread
From: Robert Varga @ 2001-12-11 9:16 UTC (permalink / raw)
To: Alan Cox; +Cc: jamesclv, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 791 bytes --]
On Mon, Dec 10, 2001 at 11:30:24PM +0000, Alan Cox wrote:
> > This may complicate Rusty's zen scheduler scheme. It certainly has made life
> > complicated for the BIOS folks. They had to sort all the real CPUs to the
> > front of the ACPI table, lest those folks so benighted as to run the crippled
> > version of Win2K (which only on-lines 8 CPUs) only get four real CPUs out of
> > eight.
>
> Rotfl, oh that is beautiful
As it happens a guy from microsoft sitting next to me as I read this
claims the DataCenter version of W2K has no limitation on number of
processors.
--
Kind regards,
Robert Varga
------------------------------------------------------------------------------
n@hq.sk http://hq.sk/~nite/gpgkey.txt
[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-11 9:16 ` Robert Varga
@ 2001-12-11 9:23 ` David Weinehall
0 siblings, 0 replies; 46+ messages in thread
From: David Weinehall @ 2001-12-11 9:23 UTC (permalink / raw)
To: Robert Varga; +Cc: Alan Cox, jamesclv, linux-kernel
On Tue, Dec 11, 2001 at 10:16:41AM +0100, Robert Varga wrote:
> On Mon, Dec 10, 2001 at 11:30:24PM +0000, Alan Cox wrote:
> > > This may complicate Rusty's zen scheduler scheme. It certainly
> > > has made life complicated for the BIOS folks. They had to sort
> > > all the real CPUs to the front of the ACPI table, lest those folks
> > > so benighted as to run the crippled version of Win2K (which only
> > > on-lines 8 CPUs) only get four real CPUs out of eight.
> >
> > Rotfl, oh that is beautiful
>
> As it happens a guy from microsoft sitting next to me as I read this
> claims the DataCenter version of W2K has no limitation on number of
> processors.
Well, Datacenter is the non-crippled version. You get to pay a lot for
getting a non-crippled version, though. Soooo lame.
/David
_ _
// David Weinehall <tao@acc.umu.se> /> Northern lights wander \\
// Maintainer of the v2.0 kernel // Dance across the winter sky //
\> http://www.acc.umu.se/~tao/ </ Full colour fire </
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-10 0:41 ` Alan Cox
2001-12-10 2:10 ` Martin J. Bligh
2001-12-10 5:31 ` Rusty Russell
@ 2001-12-11 9:00 ` Eric W. Biederman
2001-12-11 23:14 ` Alan Cox
2 siblings, 1 reply; 46+ messages in thread
From: Eric W. Biederman @ 2001-12-11 9:00 UTC (permalink / raw)
To: Alan Cox; +Cc: Rusty Russell, anton, davej, marcelo, linux-kernel, torvalds
Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> > If you number each CPU so its two IDs are smp_num_cpus()/2
> > apart, you will NOT need to put some crappy hack in the
> > scheduler to pack your CPUs correctly.
>
> Which is a major change to the x86 tree and an invasive one. Right now the
> X86 is doing a 1:1 mapping, and I can offer Marcelo no proof that somewhere
> buried in the x86 arch code there isnt something that assumes this or mixes
> a logical and physical cpu id wrongly in error.
Actually we don't do a 1:1 physical to logical mapping. I currently
have a board that has physical id's of: 0:6 and logical id's of 0:1
with no changes to the current x86 code.
>
> At best you are exploiting an obscure quirk of the current scheduler that is
> quite likely to break the moment someone factors power management into the
> idling equation (turning cpus off and on is more expensive so if you idle
> a cpu you want to keep it the idle one for PM). Congratulations on your
> zen like mastery of the scheduler algorithm. Now tell me it wont change in
> that property.
The idea of a cpu priority for filling sounds like a nice one. Even
if we don't use the cpu id bits for it.
Eric
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-11 9:00 ` Eric W. Biederman
@ 2001-12-11 23:14 ` Alan Cox
0 siblings, 0 replies; 46+ messages in thread
From: Alan Cox @ 2001-12-11 23:14 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alan Cox, Rusty Russell, anton, davej, marcelo, linux-kernel,
torvalds
> Actually we don't do a 1:1 physical to logical mapping. I currently
> have a board that has physical id's of: 0:6 and logical id's of 0:1
> with no changes to the current x86 code.
I mistook the physical to apic ones. My fault
/*
* On x86 all CPUs are mapped 1:1 to the APIC space.
* This simplifies scheduling and IPI sending and
* compresses data structures.
*/
static inline int cpu_logical_map(int cpu)
{
return cpu;
}
static inline int cpu_number_map(int cpu)
{
return cpu;
}
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Linux 2.4.17-pre5
2001-12-09 1:58 ` Rusty Russell
2001-12-09 2:35 ` Davide Libenzi
2001-12-09 16:16 ` Alan Cox
@ 2001-12-09 19:38 ` Marcelo Tosatti
2 siblings, 0 replies; 46+ messages in thread
From: Marcelo Tosatti @ 2001-12-09 19:38 UTC (permalink / raw)
To: Rusty Russell; +Cc: Alan Cox, anton, davej, linux-kernel, torvalds
On Sun, 9 Dec 2001, Rusty Russell wrote:
> In message <E16Crs9-0003Gc-00@the-village.bc.nu> you write:
> > > The sched.c change is also useless (ie. only harmful). Anton and I looked
> at
> > > adapting the scheduler for hyperthreading, but it looks like the recent
> > > changes have had the side effect of making hyperthreading + the current
> >
> > I trust Intels own labs over you on this one.
>
> This is voodoo optimization. I don't care WHO did it.
>
> Marcelo, drop the patch. Please delay scheduler hacks until they can
> be verified to actually do something.
>
> Given another chip with similar technology (eg. PPC's Hardware Multi
> Threading) and the same patch, dbench runs 1 - 10 on 4-way makes NO
> POSITIVE DIFFERENCE.
>
> http://samba.org/~anton/linux/HMT/
>
> > I suspect they know what their chip needs.
>
> I find your faith in J. Random Intel Engineer fascinating.
>
> ================
>
> The current scheduler actually works quite well if you number your
> CPUs right, and to fix the corner cases takes more than this change.
> First some simple terminology: let's assume we have two "sides" to
> each CPU (ie. each CPU has two IDs, smp_num_cpus()/2 apart):
>
> 0 1 2 3
> 4 5 6 7
>
> The current scheduler code reschedule_idle()s (pushes) from 0 to 3
> first anyway, so if we're less than 50% utilized it tends to "just
> work". Note that it doesn't stop the schedule() (pulls) on 4 - 7 from
> grabbing a process to run even if there is a fully idle CPU, so it's
> far from perfect.
>
> Now let's look at the performance-problematic case: dbench 5.
>
> Without HMT/hyperthread:
> Fifth process not scheduled at all.
>
> When any of the first four processes schedule(), the fifth
> process is pulled onto that processor.
>
> With HMT/hyperthread:
> Fifth process scheduled on 4 (shared with 0).
>
> When processes on 1, 2, or 3 schedule(), that processor sits
> idle, while processor 0/4 is doing double work (ie. only 2 in
> 5 chance that the right process will schedule() first).
>
> Finally, 0 or 4 will schedule() then wakeup, and be pulled
> onto another CPU (unless they are all busy again).
>
> The result is that dbench 5 runs significantly SLOWER with
> hyperthreading than without. We really want to pull a process off a
> cpu it is running on, if we are completely idle and it is running on a
> double-used CPU. Note that dbench 6 is almost back to normal
> performance, since the probability of the right process scheduling
> first becomes 4 in 6).
>
> Now, the Intel hack changes reschedule_idle() to push onto the first
> completely idle CPU above all others. Nice idea: the only problem is
> finding a load where that actually happens, since we push onto low
> numbers first anyway. If we have an average of <= 4 running
> processes, they spread out nicely, and if we have an average of > 4
> then there are no fully idle processes and this hack is useless.
Rusty,
I've applied the Intel HT code because it is non intrusive.
If you really want to see that change removed, please show me (sensible)
benchmark numbers where the code actually fuckups performance.
Thanks
^ permalink raw reply [flat|nested] 46+ messages in thread