public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Larry McVoy <lm@bitmover.com>
To: Bill Davidsen <davidsen@tmr.com>
Cc: Larry McVoy <lm@bitmover.com>,
	lse-tech@lists.sourceforge.net,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [Lse-tech] NUMA scheduling
Date: Mon, 25 Feb 2002 12:02:42 -0800	[thread overview]
Message-ID: <20020225120242.F22497@work.bitmover.com> (raw)
In-Reply-To: <20020225110327.A22497@work.bitmover.com> <Pine.LNX.3.96.1020225142845.17391A-100000@gatekeeper.tmr.com>
In-Reply-To: <Pine.LNX.3.96.1020225142845.17391A-100000@gatekeeper.tmr.com>; from davidsen@tmr.com on Mon, Feb 25, 2002 at 02:49:40PM -0500

On Mon, Feb 25, 2002 at 02:49:40PM -0500, Bill Davidsen wrote:
> On Mon, 25 Feb 2002, Larry McVoy wrote:
> 
> > If you read the early hardware papers on SMP, they all claim "Symmetric
> > Multi Processor", i.e., you can run any process on any CPU.  Skip forward
> > 3 years, now read the cache affinity papers from the same hardware people.
> > You have to step back and squint but what you'll see is that these papers
> > could be summarized on one sentence:
> > 
> > 	"Oops, we lied, it's not really symmetric at all"
> > 
> > You should treat each CPU as a mini system and think of a process reschedule
> > someplace else as a checkpoint/restart and assume that is heavy weight.  In
> > fact, I'd love to see the scheduler code forcibly sleep the process for 
> > 500 milliseconds each time it lands on a different CPU.  Tune the system
> > to work well with that, then take out the sleep, and you'll have the right
> > answer.
> 
>   Unfortunately this is an overly simple view of how SMP works. The only
> justification for CPU latency is to preserve cache contents. Trying to
> express this as a single number is bound to produce suboptimal results.

And here is the other side of the coin.  Remember what we are doing.
We're in the middle of a context switch, trying to figure out where we
should run this process.  We would like context switches to be fast.
Any work we do here is at direct odds with our goals.  SGI took the
approach that your statements would imply, i.e., approximate the 
cache footprint, figure out if it was big or small, and use that to
decide where to land the process.  This has two fatal flaws:
a) Because there is no generic hardware interface to say "how many cache
   lines are mine", you approximate that by looking at how much of the
   process timeslice this process used, if it used a lot, you guess it
   filled the cache.  This doesn't work at all for I/O bound processes,
   who typically run in short bursts.  So IRIX would bounce these around
   for no good reason, resulting in crappy I/O perf.  I got about another
   20% in BDS by locking down the processes (BDS delivered 3.2GBytes/sec
   of NFS traffic, sustained, in 1996).
b) all of the "thinking" you do to figure out where to land the process
   contributes directly to the cost of the context switch.  Linux has 
   nice light context switches, let's keep it that way.

Summary: SGI managed to get optimal usage out of their caches for long
running, CPU bound fortran jobs at the expense of time sharing and
I/O jobs.  I'm happy to let SGI win in that space.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

  reply	other threads:[~2002-02-25 20:03 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-02-22 18:56 NUMA scheduling Mike Kravetz
2002-02-22 19:14 ` [Lse-tech] " Jesse Barnes
2002-02-22 19:29   ` Peter Rival
2002-02-22 23:59 ` Mike Kravetz
2002-02-25 18:32 ` Erich Focht
2002-02-25 18:55   ` Martin J. Bligh
2002-02-25 19:03     ` Larry McVoy
2002-02-25 19:28       ` Davide Libenzi
2002-02-25 19:45         ` Davide Libenzi
2002-02-25 19:35       ` Timothy D. Witham
2002-02-25 19:49       ` Bill Davidsen
2002-02-25 20:02         ` Larry McVoy [this message]
2002-02-25 20:18           ` Davide Libenzi
2002-02-26  5:14           ` Bill Davidsen
2002-02-25 23:35     ` [Lse-tech] [rebalance at: do_fork() vs. do_execve()] " Andy Pfiffer
2002-02-26 10:33     ` [Lse-tech] " Erich Focht
2002-02-26 15:30       ` Martin J. Bligh
2002-02-27 16:56         ` Erich Focht
2002-02-26 19:03       ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20020225120242.F22497@work.bitmover.com \
    --to=lm@bitmover.com \
    --cc=davidsen@tmr.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lse-tech@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox