public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* CPU affinity & IPI latency
@ 2001-07-12 23:40 Mike Kravetz
  2001-07-13  0:22 ` Davide Libenzi
  2001-07-15  7:42 ` Troy Benjegerdes
  0 siblings, 2 replies; 30+ messages in thread
From: Mike Kravetz @ 2001-07-12 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andi Kleen, lse-tech

This discussion was started on 'lse-tech@lists.sourceforge.net'.
I'm widening the distribution in the hope of getting more input.

It started when Andi Kleen noticed that a single 'CPU Hog' task
was being bounced back and forth between the 2 CPUs on his 2-way
system.  I had seen similar behavior when running the context
switching test of LMbench.  When running lat_ctx with only two
threads on an 8 CPU system, one would ?expect? the two threads
to be confined to two of the 8 CPUs in the system.  However, what
I have observed is that the threads are effectively 'round
robined' among all the CPUs and they all end up bearing
an equivalent amount of the CPU load.  To more easily observe
this, increase the number of 'TRIPS' in the benchmark to a really
large number.

After a little investigation, I believe this 'situation' is caused
by the latency of the reschedule IPI used by the scheduler.  Recall
that in lat_ctx all threads are in a tight loop consisting of:

pipe_read()
pipe_write()

Both threads 'start' on the same CPU and are sitting in pipe_read
waiting for data.  A token is written to the pipe and one thread
is awakened.  The awakened thread, then immediately writes the token
back to the pipe which ultimately results in a call to reschedule_idle()
that will 'initiate' the scheduling of the other thread.  In
reschedule_idle() we can not take the 'fast path' because WE are
currently executing on the other thread's preferred CPU.  Therefore,
reschedule_idle() chooses the oldest idle CPU and sends the IPI.
However, before the IPI is received (and schedule() run) on the
remote CPU, the currently running thread calls pipe_read which
blocks and calls schedule().  Since the other task has yet to be
scheduled on the other CPU, it is scheduled to run on the current
CPU.  Both tasks continue to execute on the one CPU until such time
that an IPI induced schedule() on the other CPU hits a window where
it finds one of the tasks to schedule.  We continue in this way,
migrating the tasks to the oldest idle CPU and eventually cycling our
way through all the CPUs.

Does this explanation sound reasonable?

If so, it would then follow that booting with 'idle=poll' would
help alleviate this situation.  However, that is not the case.  With
idle=poll the CPU load is not as evenly distributed among the CPUs,
but is still distributed among all of them.

Does the behavior of the 'benchmark' mean anything?  Should one
expect tasks to stay their preferred CPUs if possible?

Thoughts/comments
-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: CPU affinity & IPI latency
@ 2001-07-14  3:25 Hubertus Franke
  2001-07-16 16:14 ` Mike Kravetz
  0 siblings, 1 reply; 30+ messages in thread
From: Hubertus Franke @ 2001-07-14  3:25 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: lse-tech, linux-kernel



Mike, could we utilize the existing mechanism such as has_cpu.

Here is my approach/suggestion.
We introduce in the sched_data[cpu] a resched_task slot;

     struct task *resched_task;

When in reschedule_idle() a target cpu is to be decided for task <p>
we check the resched_task slot. If the slot is pointing
to some task, then this task should be considered running
and we should not consider the preemption goodness to the
currently running task as we know it already got IPI'ed.
See also the schedule() function describe later on.

reschedule_idle(struct task *p)
{
     :
     :

     struct task *rst = sched_data[target_cpu].resched_task;
      struct task *rmt = sched_data[target_cpu].current;
     if (rst != NULL) {
          if (preemption_goodness(p,rst, ...) > 1)
                    p->processor = target_cpu;
                    p->has_cpu   = 1;
               rst->has_cpu = 0; /* reset as we overwrite */
               sched_data[target_cpu].resched_task = p;
               /* so we make old <rst> available for scheduling
                * and temp-bind <p> to target_cpu */
                * don't have to send IPI as this to handles race
                * condition and we are holding scheduling lock
                */
          } else {
               continue;
               /* we know that the current priority won't be
                * larger than the one of <rst> otherwise this
                * would have been picked up in the schedule
                */
          }
     } else {
          /* standard stuff that we always do */
     }
}

In schedule() we need to check whether a reschedule reservation is held.
First we go through the standard <stillrunning> check to compute the
initial
<c> goodness value. Then under

still_running_back:

     <loop through the runqueue>
     /* note that <rst> will be ignored due to <has_cpu=1> flag */

     /* check wether reservation <rst> exists */
     rst = sched_data[this_cpu].resched_task;
     if (rst != NULL) {
          c = goodness(rst,..);
          if (c > best_prio) {
               best_prio = goodness(rst,..);
               next = rst;
               sched_data[this_cpu].resched_task = NULL;
          } else {
               /* need to return rst back to scheduable state */
               rst->has_cpu = 0;
          }
      }


This approach would eliminate the need to check during runqueue scan
to check for each task's saved_cpus_allowed and would also make sure that
only one task reserves running on a particular cpu. Reservations are
preempted through the existing mechanism, namely goodness comparision,
and such "preempted" tasks are returned to general scheduability.
are put back into the

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003



Mike Kravetz <mkravetz@sequent.com>@vger.kernel.org on 07/13/2001 06:43:05
PM

Sent by:  linux-kernel-owner@vger.kernel.org


To:   David Lang <david.lang@digitalinsight.com>
cc:   Larry McVoy <lm@bitmover.com>, Davide Libenzi
      <davidel@xmailserver.org>, lse-tech@lists.sourceforge.net, Andi Kleen
      <ak@suse.de>, linux-kernel@vger.kernel.org
Subject:  Re: CPU affinity & IPI latency



On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote:
> A real-world example of this issue.
>
> I was gzipping a large (~800MB) file on a dual athlon box. the gzip
prcess
> was bouncing back and forth between the two CPUs. I actually was able to
> gzip faster by starting up setiathome to keep one CPU busy and force the
> scheduler to keep the gzip on a single CPU (I ran things several times to
> verify it was actually faster)
>
> David Lang

That does sound like the same behavior I was seeing with lat_ctx.  Like
I said in my previous note, the scheduler does try to take CPU affinity
into account.  reschedule_idle() does a pretty good job of determining
what CPU a task should run on.  In the case of lat_ctx (and I believe
your use of gzip), the 'fast path' in reschedule_idle() is taken because
the CPU associated with the awakened task is idle.  However, before
schedule() is run on the 'target' CPU, schedule() is run on another
CPU and the task is scheduled there.

The root cause of this situation is the delay between the time
reschedule_idle() determines what is the best CPU a task should run
on, and the time schedule() is actually ran on that CPU.

I have toyed with the idea of 'temporarily binding' a task to a CPU
for the duration of the delay between reschedule_idle() and schedule().
It would go something like this,

- Define a new field in the task structure 'saved_cpus_allowed'.
  With a little collapsing of existing fields, there is room to put
  this on the same cache line as 'cpus_allowed'.
- In reschedule_idle() if we determine that the best CPU for a task
  is the CPU it is associated with (p->processor), then temporarily
  bind the task to that CPU.  The task is temporarily bound to the
  CPU by overwriting the 'cpus_allowed' field such that the task can
  only be scheduled on the target CPU.  Of course, the original
  value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
- In schedule(), the loop which examines all tasks on the runqueue
  will restore the value of 'cpus_allowed'.

This would preserve the 'best CPU' decision made by reschedule_idle().
On the down side, 'temporarily bound' tasks could not be scheduled
until schedule() is run on their associated CPUs.  This could potentially
waste CPU cycles, and delay context switches.  In addition, it is
quite possible that while a task is 'temporarily bound' the state of
the system could change in such a way that the best CPU is no longer
best.

There appears to be a classic tradeoff between CPU affinity and
context switch time.

Comments?

--
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: CPU affinity & IPI latency
@ 2001-07-16 10:10 Hubertus Franke
  2001-07-16 16:16 ` Davide Libenzi
  0 siblings, 1 reply; 30+ messages in thread
From: Hubertus Franke @ 2001-07-16 10:10 UTC (permalink / raw)
  To: linux-kernel, lse-tech


David, you are preaching to choir.

One can not have it both ways, at least without "#ifdef"s.
As Mike stated, we made the decision to adhere to current scheduling
semantics
purely for the purspose of comparision and increased adaptation chances.
As shown with the LoadBalancing addition to MQ, there are simple ways to
relax and completely eliminate the feedback between the queues, if one so
desires.

As for the solutions you proposed for the "switching problem", namely the
wakeup
list. I don't think you want a list here. A list would basically mean that
you
would need to walk it and come up with a single decision again. I think
what
I proposed, namely a per-CPU reschedule reservation that can be overwritten
taking
PROC_CHANGE_PENALTY or some form of it into account, seems a better
solution.
Open to discussions...

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: frankeh@us.ibm.com



Davide Libenzi <davidel@xmailserver.org>@vger.kernel.org on 07/15/2001
04:02:21 PM

Sent by:  linux-kernel-owner@vger.kernel.org


To:   Mike Kravetz <mkravetz@sequent.com>
cc:   linux-kernel@vger.kernel.org, Andi Kleen <ak@suse.de>,
      lse-tech@lists.sourceforge.net, Larry McVoy <lm@bitmover.com>, David
      Lang <david.lang@digitalinsight.com>
Subject:  Re: CPU affinity & IPI latency




On 13-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote:
>> A real-world example of this issue.
>>
>> I was gzipping a large (~800MB) file on a dual athlon box. the gzip
prcess
>> was bouncing back and forth between the two CPUs. I actually was able to
>> gzip faster by starting up setiathome to keep one CPU busy and force the
>> scheduler to keep the gzip on a single CPU (I ran things several times
to
>> verify it was actually faster)
>>
>> David Lang
>
> That does sound like the same behavior I was seeing with lat_ctx.  Like
> I said in my previous note, the scheduler does try to take CPU affinity
> into account.  reschedule_idle() does a pretty good job of determining
> what CPU a task should run on.  In the case of lat_ctx (and I believe
> your use of gzip), the 'fast path' in reschedule_idle() is taken because
> the CPU associated with the awakened task is idle.  However, before
> schedule() is run on the 'target' CPU, schedule() is run on another
> CPU and the task is scheduled there.
>
> The root cause of this situation is the delay between the time
> reschedule_idle() determines what is the best CPU a task should run
> on, and the time schedule() is actually ran on that CPU.
>
> I have toyed with the idea of 'temporarily binding' a task to a CPU
> for the duration of the delay between reschedule_idle() and schedule().
> It would go something like this,
>
> - Define a new field in the task structure 'saved_cpus_allowed'.
>   With a little collapsing of existing fields, there is room to put
>   this on the same cache line as 'cpus_allowed'.
> - In reschedule_idle() if we determine that the best CPU for a task
>   is the CPU it is associated with (p->processor), then temporarily
>   bind the task to that CPU.  The task is temporarily bound to the
>   CPU by overwriting the 'cpus_allowed' field such that the task can
>   only be scheduled on the target CPU.  Of course, the original
>   value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
> - In schedule(), the loop which examines all tasks on the runqueue
>   will restore the value of 'cpus_allowed'.
>
> This would preserve the 'best CPU' decision made by reschedule_idle().
> On the down side, 'temporarily bound' tasks could not be scheduled
> until schedule() is run on their associated CPUs.  This could potentially
> waste CPU cycles, and delay context switches.  In addition, it is
> quite possible that while a task is 'temporarily bound' the state of
> the system could change in such a way that the best CPU is no longer
> best.
>
> There appears to be a classic tradeoff between CPU affinity and
> context switch time.

The problem of the current scheduler is that it acts like an infinite
feedback
system.
When we're going to decide if we've to move a task we analyze the status at
the
current time without taking in account the system status at previous time
values.
But the feedback we send ( IPI to move the task ) take a finite time to hit
the
target CPU and, meanwhile, the system status changes.
So we're going to apply a feedback calculated in time T0 to a time Tn, and
this
will result in system auto-oscillation that we perceive as tasks bouncing
between CPUs.
This is kind of electronic example but it applies to all feedback systems.
The solution to this problem, given that we can't have a zero feedback
delivery
time, is :

1) lower the feedback amount, that means, try to minimize task movements

2) a low pass filter, that means, when we're going to decide the sort (
move )
        of a task, we've to weight the system status with the one that it
had
        at previous time values





- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: CPU affinity & IPI latency
@ 2001-07-16 18:26 Hubertus Franke
  0 siblings, 0 replies; 30+ messages in thread
From: Hubertus Franke @ 2001-07-16 18:26 UTC (permalink / raw)
  To: lse-tech; +Cc: linux-kernel


Well, actually the sole purpose of <has_cpu> is
to lock a task out of a scheduling decision. Remember
during transient states, there are two tasks that have
the <has_cpu> flag set for a particular cpu.
So I think using <has_cpu> is kosher and preferred in
this situation, IMHO.

As you saw from David Levines reply, he thinks that
a list is more appropriate for more generic decision
support.
But I don't like the fact that you lock down several
tasks at that point. In my solution, you return the
task back to general schedulability.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: frankeh@us.ibm.com



Mike Kravetz <mkravetz@sequent.com> on 07/16/2001 12:14:46 PM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:   Mike Kravetz <mkravetz@sequent.com>, lse-tech@lists.sourceforge.net,
      linux-kernel@vger.kernel.org
Subject:  Re: CPU affinity & IPI latency



On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote:
>
> Mike, could we utilize the existing mechanism such as has_cpu.
>

I like it.  Especially the way you eliminated the situation where
we would have multiple tasks waiting for schedule.  Hope this is
not a frequent situation!!!  The only thing I don't like is the
use of has_cpu to prevent the task from being scheduled.  Right
now, I can't think of any problems with it.  However, in the past
I have been bit by using fields for purposes other than what they
were designed for.

--
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center




^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: CPU affinity & IPI latency
@ 2001-07-16 21:45 Hubertus Franke
  2001-07-16 22:56 ` Davide Libenzi
  0 siblings, 1 reply; 30+ messages in thread
From: Hubertus Franke @ 2001-07-16 21:45 UTC (permalink / raw)
  To: linux-kernel, lse-tech


Clean, but in this solutions, you can lock out tasks for a
several cycles, if indeed several of them where added
in the time before we can service the IPI on the target cpu.
You can end up with strange priority inversion problems,
because these tasks aren't seen by the general scheduler
or let alone are on the runqueue.
That's why I opted in my design for a single slot.

One could opt for the following implementation to ensure only
a single waiting task, namely put the already enqueued one back
if the goodness is worse.


static inline void wpush(aligned_data * ad, struct task_struct * tsk) {
       if (ad->wlist &&
          (preemption_goodness(tsk,ad->wlist,ad->curr->mm) > 0))
        {
            add_to_runqueue(ad->wlist,tsk);
            if (task_on_runqueue(tsk))
                list_del(&tsk->run_list);
            /* obsolete now tsk->wlist_next = ad->wlist; */
            ad->wlist = tsk;
        }
}

I also don't like the fact that with reschedule_idle() you
just put the task into the runqueue, then you take it out again,
just to put it back into it after the IPI and that as it seems
on every reschedule_idle() path.

In my design one simply wouldn't send the IPI to a target CPU that
has a pending IPI waiting and preemption wouldn't happen based on
the goodness values.

I grant, that my design seems a bit more intrusive on the code,
but you were argueing just yesterday not to get stuck up with
closeness to the current code and semantics.

Comments ?

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: frankeh@us.ibm.com



Davide Libenzi <davidel@xmailserver.org>@ewok.dev.mcafeelabs.com on
07/16/2001 05:25:57 PM

Sent by:  davidel@ewok.dev.mcafeelabs.com


To:   Mike Kravetz <mkravetz@sequent.com>
cc:   linux-kernel@vger.kernel.org, lse-tech@lists.sourceforge.net,
      Hubertus Franke/Watson/IBM@IBMUS
Subject:  Re: CPU affinity & IPI latency




On 16-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote:
>>
>> Mike, could we utilize the existing mechanism such as has_cpu.
>>
>
> I like it.  Especially the way you eliminated the situation where
> we would have multiple tasks waiting for schedule.  Hope this is
> not a frequent situation!!!  The only thing I don't like is the
> use of has_cpu to prevent the task from being scheduled.  Right
> now, I can't think of any problems with it.  However, in the past
> I have been bit by using fields for purposes other than what they
> were designed for.

How about this ( draft ) :


struct task_struct {
        ...
        struct task_struct * wlist_next;
        ...
};


static union {
        struct schedule_data {
                struct task_struct * curr;
                struct task_struct * wlist;
                cycles_t last_schedule;
        } schedule_data;
        char __pad [SMP_CACHE_BYTES];
} aligned_data [NR_CPUS] __cacheline_aligned = { {{&init_task,0}}};


static inline struct task_struct * wpick(aligned_data * ad) {
        struct task_struct * tsk = ad->wlist;
        if (tsk) {
                ad->wlist = tsk->wlist_next;
                add_to_runqueue(tsk);
        }
        return tsk;
}

static inline void wpush(aligned_data * ad, struct task_struct * tsk) {
        if (task_on_runqueue(tsk))
                list_del(&tsk->run_list);
        tsk->wlist_next = ad->wlist;
        ad->wlist = tsk;
}


asmlinkage void schedule(void)
{
        ...
        if ((next = wpick(sched_data)))
                goto ...;
        ...
}

In reschedule_idle() when before sending the IPI we do a wpush().
We modify aligned_data->wlist and tsk->wlist_next under runqueue_lock so we
don't need another one.
A slight change is needed to reschedule_idle() to handle the new field.
Pros to this solution are 1) we are not going to give other fields a
different
meaning 2) when the idle will call schedule it'll pick the task w/o rescan.




- Davide





^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2001-07-16 22:53 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-07-12 23:40 CPU affinity & IPI latency Mike Kravetz
2001-07-13  0:22 ` Davide Libenzi
2001-07-13  0:36   ` Larry McVoy
2001-07-13  2:06     ` Mark Hahn
2001-07-13 16:41     ` Davide Libenzi
2001-07-13 17:31       ` Mike Kravetz
2001-07-13 19:17         ` Davide Libenzi
2001-07-13 19:39           ` [Lse-tech] " Gerrit Huizenga
2001-07-13 20:05             ` Davide Libenzi
2001-07-13 17:05     ` Mike Kravetz
2001-07-13 19:51       ` David Lang
2001-07-13 22:43         ` Mike Kravetz
2001-07-15 20:02           ` Davide Libenzi
2001-07-15 20:10             ` [Lse-tech] " Andi Kleen
2001-07-15 20:15           ` Andi Kleen
2001-07-15 20:31             ` Davide Libenzi
2001-07-16 15:46             ` [Lse-tech] " Mike Kravetz
2001-07-13 19:54       ` Chris Wedgwood
2001-07-15  7:42 ` Troy Benjegerdes
2001-07-15  9:05   ` [Lse-tech] " Andi Kleen
2001-07-15 17:00     ` Troy Benjegerdes
2001-07-16  0:58       ` Mike Kravetz
  -- strict thread matches above, loose matches on Subject: below --
2001-07-14  3:25 Hubertus Franke
2001-07-16 16:14 ` Mike Kravetz
2001-07-16 21:25   ` Davide Libenzi
2001-07-16 10:10 Hubertus Franke
2001-07-16 16:16 ` Davide Libenzi
2001-07-16 18:26 Hubertus Franke
2001-07-16 21:45 Hubertus Franke
2001-07-16 22:56 ` Davide Libenzi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox