public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [Lse-tech] Re: CPU affinity & IPI latency
@ 2001-07-13 20:17 Shailabh Nagar
  2001-07-15 20:51 ` Davide Libenzi
  0 siblings, 1 reply; 11+ messages in thread
From: Shailabh Nagar @ 2001-07-13 20:17 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Mike Kravetz, lse-tech, Andi Kleen, linux-kernel, Larry McVoy


David,

> Global scheduling decisions should be triggered in response of load
unbalancing
> and not at each schedule() run otherwise we're going to introduce a
common lock
> that will limit the overall scalability.

Thats correct. Though it beggars the question : who will define
"load-imbalance" and at what granularity ? In the Loadbalancing extensions
to MQ (http://lse.sourceforge.net/scheduling/LB/poolMQ.html) load balancing
is done at a frequency specified at the time the loadbalancing module is
loaded. The parameter can be changed dynamically through a /proc interface.
So we are providing a knob for the user/sysadmin to control the
loadbalancing desired.

> My idea about the future of the scheduler is to have a config options
users can
> chose depending upon the machine use.
> By trying to keep a unique scheduler for both UP and MP is like going to
give
> the same answer to different problems and the scaling factor ( of the
scheduler
> itself ) on SMP will never improve.

That is true to an extent. It would be convenient for us as scheduler
rewriters to have neatly differentiated classes like UP, SMP, BIG_SMP, NUMA
etc. But it forces all other scheduler-sensitive code to think of each of
these cases separately and is exactly the reason why #ifdef's are
discouraged for critical kernel code like the scheduler.

Its certainly a challenge to provide SMP/NUMA scalability in the scheduler
(and elsewhere in the kernel) without having to resort to an #ifdef.

Shailabh




^ permalink raw reply	[flat|nested] 11+ messages in thread
* Re: CPU affinity & IPI latency
@ 2001-07-13 19:17 Davide Libenzi
  2001-07-13 19:39 ` [Lse-tech] " Gerrit Huizenga
  0 siblings, 1 reply; 11+ messages in thread
From: Davide Libenzi @ 2001-07-13 19:17 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: lse-tech, Andi Kleen, linux-kernel, Larry McVoy


On 13-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 09:41:44AM -0700, Davide Libenzi wrote:
>> 
>> I personally think that a standard scheduler/cpu is the way to go for SMP.
>> I saw the one IBM guys did and I think that the wrong catch there is trying
>> always to grab the best task to run over all CPUs.
> 
> That was me/us.  Most of the reason for making 'global scheduling'
> decisions was an attempt to maintain the same behavior as the existing
> scheduler.  We are trying to see how well we can make this scheduler
> scale, while maintaining this global behavior.  Thought is that if
> there was ever any hope of someone adopting this scheduler, they would
> be more likely to do so if it attempted to maintain existing behavior.

The problem, IMHO, is that we're trying to extend what is a correct behaviour on
the UP scheduler ( pickup the best task to run ) to SMP machines.
Global scheduling decisions should be triggered in response of load unbalancing
and not at each schedule() run otherwise we're going to introduce a common lock
that will limit the overall scalability.
My idea about the future of the scheduler is to have a config options users can
chose depending upon the machine use.
By trying to keep a unique scheduler for both UP and MP is like going to give
the same answer to different problems and the scaling factor ( of the scheduler
itself ) on SMP will never improve.
The code inside kernel/sched.c should be reorganized ( it contains even not
scheduler code ) so that the various CONFIG_SCHED* will not introduce any messy
inside the code ( possibly by having the code in separate files
kernel/sched*.c ).




- Davide


^ permalink raw reply	[flat|nested] 11+ messages in thread
* Re: CPU affinity & IPI latency
@ 2001-07-13 17:05 Mike Kravetz
  2001-07-13 19:51 ` David Lang
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2001-07-13 17:05 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Davide Libenzi, lse-tech, Andi Kleen, linux-kernel

On Thu, Jul 12, 2001 at 05:36:41PM -0700, Larry McVoy wrote:
> Be careful tuning for LMbench (says the author :-)
> 
> Especially this benchmark.  It's certainly possible to get dramatically better
> SMP numbers by pinning all the lat_ctx processes to a single CPU, because 
> the benchmark is single threaded.  In other words, if we have 5 processes,
> call them A, B, C, D, and E, then the benchmark is passing a token from
> A to B to C to D to E and around again.  
> 
> If the amount of data/instructions needed by all 5 processes fits in the 
> cache and you pin all the processes to the same CPU you'll get much 
> better performance than simply letting them float.
> 
> But making the system do that naively is a bad idea.

I agree, and can't imagine the system ever attempting to take this
into account and leave these 5 tasks on the same CPU.

At the other extreme is my observation that 2 tasks on an 8 CPU
system are 'round robined' among all 8 CPUs.  I think having the
2 tasks stay on 2 of the 8 CPUs would be an improvement with respect
to CPU affinity.  Actually, the scheduler does 'try' to do this.

It is clear that the behavior of lat_ctx bypasses almost all of
the scheduler's attempts at CPU affinity.  The real question is,
"How often in running 'real workloads' are the schduler's attempts
at CPU affinity bypassed?".

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 11+ messages in thread
* CPU affinity & IPI latency
@ 2001-07-12 23:40 Mike Kravetz
  2001-07-15  7:42 ` Troy Benjegerdes
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2001-07-12 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andi Kleen, lse-tech

This discussion was started on 'lse-tech@lists.sourceforge.net'.
I'm widening the distribution in the hope of getting more input.

It started when Andi Kleen noticed that a single 'CPU Hog' task
was being bounced back and forth between the 2 CPUs on his 2-way
system.  I had seen similar behavior when running the context
switching test of LMbench.  When running lat_ctx with only two
threads on an 8 CPU system, one would ?expect? the two threads
to be confined to two of the 8 CPUs in the system.  However, what
I have observed is that the threads are effectively 'round
robined' among all the CPUs and they all end up bearing
an equivalent amount of the CPU load.  To more easily observe
this, increase the number of 'TRIPS' in the benchmark to a really
large number.

After a little investigation, I believe this 'situation' is caused
by the latency of the reschedule IPI used by the scheduler.  Recall
that in lat_ctx all threads are in a tight loop consisting of:

pipe_read()
pipe_write()

Both threads 'start' on the same CPU and are sitting in pipe_read
waiting for data.  A token is written to the pipe and one thread
is awakened.  The awakened thread, then immediately writes the token
back to the pipe which ultimately results in a call to reschedule_idle()
that will 'initiate' the scheduling of the other thread.  In
reschedule_idle() we can not take the 'fast path' because WE are
currently executing on the other thread's preferred CPU.  Therefore,
reschedule_idle() chooses the oldest idle CPU and sends the IPI.
However, before the IPI is received (and schedule() run) on the
remote CPU, the currently running thread calls pipe_read which
blocks and calls schedule().  Since the other task has yet to be
scheduled on the other CPU, it is scheduled to run on the current
CPU.  Both tasks continue to execute on the one CPU until such time
that an IPI induced schedule() on the other CPU hits a window where
it finds one of the tasks to schedule.  We continue in this way,
migrating the tasks to the oldest idle CPU and eventually cycling our
way through all the CPUs.

Does this explanation sound reasonable?

If so, it would then follow that booting with 'idle=poll' would
help alleviate this situation.  However, that is not the case.  With
idle=poll the CPU load is not as evenly distributed among the CPUs,
but is still distributed among all of them.

Does the behavior of the 'benchmark' mean anything?  Should one
expect tasks to stay their preferred CPUs if possible?

Thoughts/comments
-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2001-07-17 14:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <OF408C990D.63BC0397-ON85256A88.005CF33B@pok.ibm.com>
2001-07-13 18:27 ` [Lse-tech] Re: CPU affinity & IPI latency Hubertus Frnake
2001-07-17 14:20 ` Hubertus Frnake
2001-07-13 20:17 Shailabh Nagar
2001-07-15 20:51 ` Davide Libenzi
  -- strict thread matches above, loose matches on Subject: below --
2001-07-13 19:17 Davide Libenzi
2001-07-13 19:39 ` [Lse-tech] " Gerrit Huizenga
2001-07-13 20:05   ` Davide Libenzi
2001-07-13 17:05 Mike Kravetz
2001-07-13 19:51 ` David Lang
2001-07-13 22:43   ` Mike Kravetz
2001-07-15 20:02     ` Davide Libenzi
2001-07-15 20:10       ` [Lse-tech] " Andi Kleen
2001-07-15 20:15     ` Andi Kleen
2001-07-16 15:46       ` [Lse-tech] " Mike Kravetz
2001-07-12 23:40 Mike Kravetz
2001-07-15  7:42 ` Troy Benjegerdes
2001-07-15  9:05   ` [Lse-tech] " Andi Kleen
2001-07-15 17:00     ` Troy Benjegerdes
2001-07-16  0:58       ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox