Crunch time -- the musical. (2.5 merge candidate list 1.5)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Crunch time -- the musical.  (2.5 merge candidate list 1.5)
@ 2002-10-23 21:26 Rob Landley
  2002-10-24 16:17 ` Michael Hohnbaum
  2002-10-25 14:46 ` Crunch time -- the musical. (2.5 " Kevin Corry
  0 siblings, 2 replies; 33+ messages in thread
From: Rob Landley @ 2002-10-23 21:26 UTC (permalink / raw)
  To: linux-kernel

Kernel hooks is back with new links.  Also new versions of Linux Trace Tookit
and sys_epoll.  And new stuff from the 2.5 status list, and new stuff is STILL
showing up on linux-kernel.  (Still no 2.5 patch for Alan's 32 bit dev_t,
though.)

Richard J. Moore has stepped up to defend "VM Large Page support",
which has become "hugetlb update".  I don't know if this counts as
a new feature or a bugfix, but it's back...

Due to numerous complaints (okay, one, but technically that's a number)
tried to reformat a bit to have a slightly less eye-searingly hideous layout.
And reorganized the -mm stuff to be together in one clump.

And so:

----------

Linus returns from the Linux Lunacy Cruise after Sunday, October 27th.
(See "http://www.geekcruises.com/itinerary/ll2_itinerary.html".  He's
off to Jamaica, mon.)

The following features aim to be ready for submission to Linus by Monday,
October 28th, to be considered for inclusion (in 2.5.45) before the feature
freeze on Thursday, October 31 (halloween).  (L minus four days, and
counting...)

Note: if you want to submit a new entry to this list, PLEASE provide a URL
to where the patch can be found, and any descriptive announcement you think
useful (user space tools, etc).  This doesn't have to be a web page devoted
to the patch, if the patch has been posted to linux-kernel a URL to the post
on any linux-kernel archive site is fine.

If you don't know of one, a good site for looking at the threaded archive is:
http://lists.insecure.org/lists/linux-kernel/

A more searchable archive is available at:
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&group=mlist.linux.kernel

This archive seems less likely to mangle your patch for cut and pasting
(especially if you click "raw download" at the top of the message),
although its a real pain to actualy try to read:
http://marc.theaimsgroup.com/?l=linux-kernel

This list is just pending features trying to get in before feature freeze.
It's primarily for features that need more testing, or might otherwise get
forgotten in the rush.  If you want to know what's already gone in, or what's
being worked on for the next development cycle, check out
"http://kernelnewbies.org/status".

You can get Andrew Morton's MM tree here, including a broken-out patches
directory and a description file:

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.44

Alan Cox's -ac tree comes from here:

http://www.kernel.org/pub/linux/kernel/people/alan/

Thanks to Rusty Russell and Guillaume Boissiere, whose respective 2.5 merge
candidate lists have been ruthlessly strip-mined in the process of
assembling this.  And to everybody who's emailed stuff.

And now, in no particular order:

============================ Pending features: =============================

1) New kernel configuration system (Roman Zippel)

Announcement:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6898.html

Code:
http://www.xs4all.nl/~zippel/lc/

Linus has actually looked fairly favorably on this one so far:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/3250.html

----------------------------------------------------------------------------

2) ext2/ext3 extended attributes and access control lists (Ted Tso) (in -mm)

Announce:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6787.html

Code:
bk://extfs.bkbits.net/extfs-2.5-update
http://thunk.org/tytso/linux/extfs-2.5
(Or just grab it from the -mm tree.)

(Considering that EA/ACL infrastructure is already in, and supported by XFS
and JFS, this one's pretty close to a shoe-in.)

----------------------------------------------------------------------------

3) Page table sharing  (Daniel Phillips, Dave McCracken) (in -mm)

Announce:
http://www.geocrawler.com/mail/msg.php3?msg_id=7855063&list=35

Patch from the -mm tree:
http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.44/2.5.44-mm3/broken-out/shpte-ng.patch

Ed Tomlinson seems to have a show-stopper bug for this one
(although he tells me in email he'd like to see it go in anyway):

http://lists.insecure.org/lists/linux-kernel/2002/Oct/7147.html

----------------------------------------------------------------------------

4) Improved Hugetlb support (Richard J. Moore) (in -mm tree)

(Dunno if this is exactly a feature, but giving it the benfit of the doubt...)

Description:
http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.44/2.5.44-mm3/description

Patches (everything starting with "htlb" or "hugetlb"):
http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.44/2.5.44-mm3/broken-out/

----------------------------------------------------------------------------

5) Generic Nonlinear Mappings (Ingo Molnar) (in -mm)

It's new, very close to deadline, needs testing and discussion.  I'm still a
touch vague on what it actually does, but there's a thread.

Announcement, patch, and start of thread:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103530883511032&w=2

----------------------------------------------------------------------------

6) Linux Trace Toolkit (LTT) (Karim Yaghmour)

Announce:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/7016.html

Patch:
http://opersys.com/ftp/pub/LTT/ExtraPatches/patch-ltt-linux-2.5.44-vanilla-021022-2.2.bz2

User tools:
http://opersys.com/ftp/pub/LTT/TraceToolkit-0.9.6pre2.tgz

----------------------------------------------------------------------------

7) Device mapper for Logical Volume Manager (LVM2)  (LVM2 team)  (in -ac)

Announce:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103536883428443&w=2

Download:
http://people.sistina.com/~thornber/patches/2.5-stable/

Home page:
http://www.sistina.com/products_lvm.htm

----------------------------------------------------------------------------

8) EVMS (Enterprise Volume Management System) (EVMS team)

Home page:
http://sourceforge.net/projects/evms

----------------------------------------------------------------------------

9) Kernel Probes (IBM, contact: Vamsi Krishna S)

Kprobes announcement:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528410215211&w=2

Base Kprobes Patch:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528425615302&w=2

KProbes->DProbes patches:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528454215523&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528454015520&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528485415813&w=2

Official IBM download site for most recent versions (gzipped
tarballs):
http://www-124.ibm.com/linux/patches/?project_id=141

See also the DProbes Home Page:
http://oss.software.ibm.com/developerworks/opensource/linux/projects/dprobes

A good explanation of the difference between kprobes, dprobes,
and kernel hooks is here:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103532874900445&w=2

And a clarification: just kprobes is being submitted for
2.5.45, not the whole of dprobes:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103536827928012&w=2

----------------------------------------------------------------------------

10) High resolution timers (George Anzinger, etc.)

Home page:
http://high-res-timers.sourceforge.net/

Patch via evil sourceforge download auto-mirror thing:
http://prdownloads.sourceforge.net/high-res-timers/hrtimers-support-2.5.36-1.0.patch?download

Linus has unresolved concerns with this one, by the way:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/3463.html

Note: The Google posix timer patch forwarded by Jim Houston is being
merged into this patch:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/8068.html

----------------------------------------------------------------------------

11) Linux Kernel Crash Dumps (Matt Robinson, LKCD team)

Announce:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103536576625905&w=2

Code:
http://lkcd.sourceforge.net/download/latest/

----------------------------------------------------------------------------

12) Rewrite of the console layer (James Simmons)

Home page:
http://linuxconsole.sourceforge.net/

Patch (Unknown version, but home page only has random CVS du jour link.):
http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz

Bitkeeper tree:
http://linuxconsole.bkbits.net


----------------------------------------------------------------------------

13) Kexec, luanch new linux kernel from Linux (Eric W. Biederman)

Announcement with links:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6584.html

And this thread is just too brazen not to include:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/7952.html

----------------------------------------------------------------------------

14) USAGI IPv6 (Yoshifujy Hideyaki)

README:
ftp://ftp.linux-ipv6.org/pub/usagi/patch/ipsec/README.IPSEC

Patch:
ftp://ftp.linux-ipv6.org/pub/usagi/patch/ipsec/ipsec-2.5.43-ALL-03.patch.gz

----------------------------------------------------------------------------

15) MMU-less processor support (Greg Ungerer)

Announcement with lots of links:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/7027.html

----------------------------------------------------------------------------

16) sys_epoll (I.E. /dev/poll) (Davide Libenzi)

homepage:
http://www.xmailserver.org/linux-patches/nio-improve.html

patch:
http://www.xmailserver.org/linux-patches/sys_epoll-2.5.44-0.7.diff

Linus participated repeatedly in a thread on this one too, expressing
concerns which (hopefully) have been addressed.  See:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6428.html

----------------------------------------------------------------------------

17) CD Recording/sgio patches (Jens Axboe)

Announce:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/8060.html

Patch:
http://www.kernel.org/pub/linux/kernel/people/axboe/patches/v2.5/2.5.44/sgio-14b.diff.bz2

----------------------------------------------------------------------------

18) In-kernel module loader (Rusty Russell.)

Announce:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6214.html

Patch:
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/module-x86-18-10-2002.2.5.43.diff.gz

----------------------------------------------------------------------------

19) Unified Boot/Module parameter support (Rusty Russell)

Note: depends on in-kernel module loader.

Huge disorganized heap 'o patches with no explanation:
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/

----------------------------------------------------------------------------

20) Hotplug CPU Removal (Rusty Russell)

Even bigger, more disorganized Heap 'o patches:
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Hotplug/

----------------------------------------------------------------------------

21) Unlimited groups patch (Tim Hockin.)

Announce:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761319825&w=2

Patch set:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524717119443&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761819834&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761619831&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761519829&w=2

----------------------------------------------------------------------------

22) Initramfs (Al Viro)

Way back when, Al said:
http://www.cs.helsinki.fi/linux/linux-kernel/2001-30/0110.html

I THINK this is the most recent patch:
ftp://ftp.math.psu.edu/pub/viro/N0-initramfs-C40

And Linus recently made happy noises about the idea:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/1110.html

----------------------------------------------------------------------------

23) Kernel Hooks (IBM contact: Vamsi Krishna S.)

Website:
http://www-124.ibm.com/linux/projects/kernelhooks/

Download site:
http://www-124.ibm.com/linux/patches/?patch_id=595

Posted patch:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103364774926440&w=2

----------------------------------------------------------------------------

24) NMI request/release interface (Corey Minyard)

He says:
> Add a request/release mechanism to the kernel (x86 only for now) for NMIs.
...
>I have modified the nmi watchdog to use this interface, and it
>seems to work ok.  Keith Owens is copied to see if he would be
>interested in converting kdb to use this, if it gets put into the kernel.

The latest patch so far:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540434409894&w=2

----------------------------------------------------------------------------

25) Digital Video Broadcasting Layer (LinuxTV team)

Home page:
http://www.linuxtv.org:81/dvb/

Download:
http://www.linuxtv.org:81/download/dvb/

----------------------------------------------------------------------------

26) NUMA aware scheduler extenstions (Erich Focht, Michael Hohnbaum)

Home page:
http://home.arcor.de/efocht/sched/

Patch:
http://home.arcor.de/efocht/sched/Nod20_numa_sched-2.5.31.patch

----------------------------------------------------------------------------

27) DriverFS Topology (Matthew Dobson)

Announcement:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103523702710396&w=2

Patches:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540707113401&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540757613962&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540758013984&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540757513957&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540757813966&w=2

----------------------------------------------------------------------------

28) Advanced TCA Disk Hotswap (Steven Dake)

At the last minute, Steven Dake submitted (and if he'd cc'd the list, I could
have linked to this message as the announcement, hint hint...):

> Please add to your 2.5.45 list:
>
> "Advanced TCA Disk Hotswap".
>
> This is a generic feature that provides good hotswap support for SCSI
> and FibreChannel disk devices.  The entire SCSI layer has been properly
> analyzed to provide correct locking and a complete RAMFS filesystem is
> available to control the kernel disk hotswap operations.
>
> Both Alan Cox and Greg KH have looked at the patch for 2.4 and suggested
> if I ported to 2.5 and made some changes (as I have in the latest port)
> this feature would be a good candidate for the 2.5 kernel.
>
> The sourceforge site for the latest patches is:
> https://sourceforge.net/projects/atca-hotswap/
>
> The lkml announcement for this latest port is:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103541572622729&w=2
>
> A thread discussing Advanced TCA hotswap (of which this partch is one
> part of) can be found at:
> http://marc.theaimsgroup.com/?t=103462115700001&r=1&w=2
>
> Thanks!
> -steve


======================== Unresolved issues: =========================

1) hyperthread-aware scheduler
2) connection tracking optimizations.

No URLs to patch.  Anybody want to come out in favor of these
with an announcement and pointer to a version being suggested
for inclusion?

3) IPSEC (David Miller, Alexy)
4) New CryptoAPI (James Morris)

David S. Miller said:

> No URLs, being coded as I type this :-)
>
> Some of the ipv4 infrastructure is in 2.5.44

Note, this may conflict with Yoshifuji Hideyaki's ipv6 ipsec stuff.  If not,
I'd like to collate or clarify the entries.)  USAGI ipv6 is in the first
section and this isn't because I have a URL to an existing patch to
USAGI, and don't for this.  I have no idea how much overlap there is
between these projects, and whether they're considered parts of the
same project or submitted individually...

5) ReiserFS 4

Hans Reiser said:

> We will send Reiser4 out soon, probably around the 27th.
>
> Hans

See also http://www.namesys.com/v4/fast_reiser4.html

Hans and Jens Axboe are arguing about whether or not Reiser4 is a
potential post-freeze addition.  That thread starts here:

http://lists.insecure.org/lists/linux-kernel/2002/Oct/7140.html

6) 32bit dev_t

Alan Cox said:

> The big one missing is 32bit dev_t. Thats the killer item we have left.

But did not provide a URL to a patch.  Presumably, it's in his tree and
is capable of being extracted out of it, so I guess it's already in
good hands?  (I dunno, ask him.)

He also mentioned:

> Oh other one I missed - DVB layer - digital tv etc. Pretty much
> essential now for europe, but again its basically all driver layer

But it's not clear this is an item that must go in before feature freeze
or not at all, which is what this list tries to focus on.

Then Dan Kegel pointed out:

> One possible page to quote for 32 bit dev_t:
> http://lwn.net/Articles/11583/

7) Online EXT3 resize support:

A thread over whether or not this is self-contained enough and low
enough impact to go in after the freature freeze starts here:

http://lists.insecure.org/lists/linux-kernel/2002/Oct/7680.html

I mention it just in case it isn't.  (We've had offline EXT3 resize for
a while, this is apparently twiddling a mounted partition without
unplugging it first, or even wearing rubber boots.)

-- 
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad, 
CmdrTaco, liquid nitrogen ice cream, and caffienated jello.  Well why not?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Crunch time -- the musical.  (2.5 merge candidate list 1.5)
  2002-10-23 21:26 Crunch time -- the musical. (2.5 merge candidate list 1.5) Rob Landley
@ 2002-10-24 16:17 ` Michael Hohnbaum
       [not found]   ` <200210240750.09751.landley@trommello.org>
  2002-10-25 14:46 ` Crunch time -- the musical. (2.5 " Kevin Corry
  1 sibling, 1 reply; 33+ messages in thread
From: Michael Hohnbaum @ 2002-10-24 16:17 UTC (permalink / raw)
  To: landley; +Cc: linux-kernel, Erich Focht

On Wed, 2002-10-23 at 14:26, Rob Landley wrote:

> 26) NUMA aware scheduler extenstions (Erich Focht, Michael Hohnbaum)
> 
> Home page:
> http://home.arcor.de/efocht/sched/
> 
> Patch:
> http://home.arcor.de/efocht/sched/Nod20_numa_sched-2.5.31.patch

The simple NUMA scheduler patch, which is ready for inclusion is a 
separate project from Erich's NUMA scheduler extensions.  Information
on the simple NUMA scheduler is contained in this lkml posting:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103351680614980&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103480772901235&w=2

The most recent version has been split into two patches for 2.5.44: 

http://marc.theaimsgroup.com/?l=linux-kernel&m=103539626130709&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540481010560&w=2

-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 33+ messages in thread

[parent not found: <200210240750.09751.landley@trommello.org>]

* Re: Crunch time -- the musical.  (2.5 merge candidate list 1.5)
       [not found]   ` <200210240750.09751.landley@trommello.org>
@ 2002-10-24 19:01     ` Michael Hohnbaum
  2002-10-24 21:51       ` Erich Focht
  0 siblings, 1 reply; 33+ messages in thread
From: Michael Hohnbaum @ 2002-10-24 19:01 UTC (permalink / raw)
  To: landley; +Cc: linux-kernel, Erich Focht

On Thu, 2002-10-24 at 05:50, Rob Landley wrote:
> On Thursday 24 October 2002 11:17, Michael Hohnbaum wrote:
> > On Wed, 2002-10-23 at 14:26, Rob Landley wrote:
> > > 26) NUMA aware scheduler extenstions (Erich Focht, Michael Hohnbaum)
> > >
> > > Home page:
> > > http://home.arcor.de/efocht/sched/
> > >
> > > Patch:
> > > http://home.arcor.de/efocht/sched/Nod20_numa_sched-2.5.31.patch
> >
> > The simple NUMA scheduler patch, which is ready for inclusion is a
> > separate project from Erich's NUMA scheduler extensions.  Information
> > on the simple NUMA scheduler is contained in this lkml posting:
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103351680614980&w=2
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103480772901235&w=2
> >
> > The most recent version has been split into two patches for 2.5.44:
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103539626130709&w=2
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103540481010560&w=2
> 
> Any relation to http://lse.sourceforge.net/numa/ which the 2.5 status list 
> says is "Alpha" state, two steps down from "Ready"?
> 
> Rob

Yes and no.  At one point I was working with Erich moving his NUMA 
scheduler to 2.5 and testing it on our NUMA hardware.  However, it
was not looking like his NUMA scheduler was going to be ready for 
2.5, so I went off on a separate effort to produce a much smaller,
simpler patch to provide rudimentary NUMA support within the scheduler.
This patch does not have all the functionality of Erich's, but does
provide definite performance improvements on NUMA machines with no
degradation on non-NUMA SMP.  It is much smaller and less intrusive,
and has been tested on multiple NUMA architectures (including by 
Erich on the NEC IA64 NUMA box).

The 2.5 status list has not been updated to reflect this separate 
effort, and I believe incorrectly lists this entry as "ready".  There
really are now two NUMA scheduler projects:

* Simple NUMA scheduler (Michael Hohnbaum)  - ready for inclusion
* Node affine NUMA scheduler (Erich Focht)  - Alpha (Beta?)

-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Crunch time -- the musical.  (2.5 merge candidate list 1.5)
  2002-10-24 19:01     ` Michael Hohnbaum
@ 2002-10-24 21:51       ` Erich Focht
  2002-10-24 22:38         ` Martin J. Bligh
  0 siblings, 1 reply; 33+ messages in thread
From: Erich Focht @ 2002-10-24 21:51 UTC (permalink / raw)
  To: Michael Hohnbaum, landley; +Cc: linux-kernel

Hi Rob and Michael,

I need to correct some inexactities and, of course, advertise my aproach
:-)

On Thursday 24 October 2002 21:01, Michael Hohnbaum wrote:
> > > > 26) NUMA aware scheduler extenstions (Erich Focht, Michael Hohnbaum)
> > > >
> > > > Home page:
> > > > http://home.arcor.de/efocht/sched/
> > > >
> > > > Patch:
> > > > http://home.arcor.de/efocht/sched/Nod20_numa_sched-2.5.31.patch

These are old. I posted the newer patches (splitted up in order to clearly
separate the functionality additions) to LKML:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103459387719030&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103459387519026&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103459441119407&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103459441319411&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103459441419416&w=2
They should work for any NUMA platform by just adding a call to
build_pools() in smp_cpus_done(). They work for non-NUMA platforms
the same way as the O(1) scheduler (though the code looks different).
A test overview is in: http://lwn.net/Articles/12546/
This suggests that taking only patches 01+02 already gives you a VERY
good NUMA scheduler. They deliver the infrastructure for later
developments (patches 03+05) which we can further research and tune or
give only to special customers.

> The 2.5 status list has not been updated to reflect this separate
> effort, and I believe incorrectly lists this entry as "ready".  There
> really are now two NUMA scheduler projects:
>
> * Simple NUMA scheduler (Michael Hohnbaum)  - ready for inclusion
> * Node affine NUMA scheduler (Erich Focht)  - Alpha (Beta?)
This is not correct. We have the node affine scheduler in production
since 6 months on top of 2.4. kernels and are happy with it. It is a lot
more than alpha or beta, it already makes customers happy.

The situation is really funny: Everybody seems to agree that the design
ideas in my NUMA aproach are sane and exactly what we want to have on
a NUMA platform in the end. But instead of concentrating on tuning the
parameters for the many different NUMA platforms and reshaping this
aproach to make it acceptable, IBM concentrates on a very much stripped
down aproach. I understand that this project has been started to make
the inclusion of some NUMA scheduler easier. But in the end, the simple
NUMA scheduler will have to develop to a much more complex thing and in
some form or another replicate the design ideas of my node affine
scheduler. On machines with poor NUMA ratio like NUMAQ the simple NUMA
change helps. For machines with good NUMA ratio like NEC Azusa, NEC TX7
you need a little bit more. AMD Hammer-SMP and ppc64 are certainly in
the same class as the Azusa/TX7. And as soon as Hammer SMP systems will
be around, the pressure for a full featured NUMA scheduler will be much
higher.

A NUMA scheduler extension of the 2.6 kernel fits very well with the
development effort done for better scalability and enterprise level
fitnes of Linux. Check http://lwn.net/Articles/12546/ to see that it
makes a difference to have more than O(1) on NUMA machines! I'd
definitely prefer the inclusion of my 01+02 patches (I'd have to
maintain less code to keep the customers happy), on the other side:
including Michael's patch would be better than not adding NUMA
scheduler support at all.

Best regards,
Erich

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Crunch time -- the musical.  (2.5 merge candidate list 1.5)
  2002-10-24 21:51       ` Erich Focht
@ 2002-10-24 22:38         ` Martin J. Bligh
  2002-10-25  8:15           ` Erich Focht
  0 siblings, 1 reply; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-24 22:38 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum, landley; +Cc: linux-kernel

> The situation is really funny: Everybody seems to agree that the design
> ideas in my NUMA aproach are sane and exactly what we want to have on
> a NUMA platform in the end. But instead of concentrating on tuning the
> parameters for the many different NUMA platforms and reshaping this
> aproach to make it acceptable, IBM concentrates on a very much stripped
> down aproach.

>From my point of view, the reason for focussing on this was that 
your scheduler degraded the performance on my machine, rather than
boosting it. Half of that was the more complex stuff you added on
top ... it's a lot easier to start with something simple that works 
and build on it, than fix something that's complex and doesn't work
well.

I still haven't been able to get your scheduler to boot for about 
the last month without crashing the system. Andrew says he has it 
booting somehow on 2.5.44-mm4, so I'll steal his kernel tommorow and
see how it looks. If the numbers look good for doing boring things
like kernel compile, SDET, etc, I'm happy.

M.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Crunch time -- the musical.  (2.5 merge candidate list 1.5)
  2002-10-24 22:38         ` Martin J. Bligh
@ 2002-10-25  8:15           ` Erich Focht
  2002-10-25 23:26             ` Martin J. Bligh
                               ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Erich Focht @ 2002-10-25  8:15 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum, landley; +Cc: linux-kernel

On Friday 25 October 2002 00:38, Martin J. Bligh wrote:
> > The situation is really funny: Everybody seems to agree that the design
> > ideas in my NUMA aproach are sane and exactly what we want to have on
> > a NUMA platform in the end. But instead of concentrating on tuning the
> > parameters for the many different NUMA platforms and reshaping this
> > aproach to make it acceptable, IBM concentrates on a very much stripped
> > down aproach.
>
> From my point of view, the reason for focussing on this was that
> your scheduler degraded the performance on my machine, rather than
> boosting it. Half of that was the more complex stuff you added on
> top ... it's a lot easier to start with something simple that works
> and build on it, than fix something that's complex and doesn't work
> well.

You're talking about one of the first 2.5 versions of the patch. It
changed a lot since then, thanks to your feedback, too.

> I still haven't been able to get your scheduler to boot for about
> the last month without crashing the system. Andrew says he has it
> booting somehow on 2.5.44-mm4, so I'll steal his kernel tommorow and
> see how it looks. If the numbers look good for doing boring things
> like kernel compile, SDET, etc, I'm happy.

I thought this problem is well understood! For some reasons independent of
my patch you have to boot your machines with the "notsc" option. This
leaves the cache_decay_ticks variable initialized to zero which my patch
doesn't like. I'm trying to deal with this inside the patch but there is
still a small window when the variable is zero. In my opinion this needs
to be fixed somewhere in arch/i386/kernel/smpboot.c. Booting a machine
with cache_decay_ticks=0 is pure nonsense, as it switches off cache
affinity which you absolutely need! So even if "notsc" is a legal option,
it should be fixed such that it doesn't leave your machine without cache
affinity. That would anyway give you a falsified behavior of the O(1)
scheduler.

Erich

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Crunch time -- the musical.  (2.5 merge candidate list 1.5)
  2002-10-25  8:15           ` Erich Focht
@ 2002-10-25 23:26             ` Martin J. Bligh
  2002-10-25 23:45               ` Martin J. Bligh
  2002-10-26  0:02               ` Martin J. Bligh
  2002-10-26 18:58             ` Martin J. Bligh
  2002-10-26 19:14             ` NUMA scheduler (was: 2.5 " Martin J. Bligh
  2 siblings, 2 replies; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-25 23:26 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum; +Cc: linux-kernel

> You're talking about one of the first 2.5 versions of the patch. It
> changed a lot since then, thanks to your feedback, too.

Right. But I've been struggling to boot anything later than that ;-)
 
> I thought this problem is well understood! For some reasons independent of
> my patch you have to boot your machines with the "notsc" option. This
> leaves the cache_decay_ticks variable initialized to zero which my patch
> doesn't like. I'm trying to deal with this inside the patch but there is
> still a small window when the variable is zero. In my opinion this needs
> to be fixed somewhere in arch/i386/kernel/smpboot.c. Booting a machine
> with cache_decay_ticks=0 is pure nonsense, as it switches off cache
> affinity which you absolutely need! So even if "notsc" is a legal option,
> it should be fixed such that it doesn't leave your machine without cache
> affinity. That would anyway give you a falsified behavior of the O(1)
> scheduler.

OK, well we seem to have it working on one machine, but not on another.
Those should be identical, I suspect it's a timing thing. I'm playing around
with the differences. First major thing I noticed is that the working box has
gcc 3.1, and the non-working gcc 2.95.4 (debian woody). I suspect it's
a subtle timing thing, or something equally horrible.

Changing the non-working box to gcc 3.1 instead (which I *really* don't
want to do long term unless we prove there's a bug in 2.95 ... gcc 3.x
is disgustingly slow) resulted in it getting a little further, but then got the 
following oops ... does this provide any clues?

CPU 7 IS NOW UP!
Starting migration thread for cpu 7
Bringing up 8
CPU 8 IS NOW UP!
Starting migration thread for cpu 8
divide error: 0000
 
CPU:    4
EIP:    0060:[<c011ac38>]    Not tainted
EFLAGS: 00010002
EIP is at task_to_steal+0x118/0x260
eax: 00000001   ebx: f01c5040   ecx: 00000000   edx: 00000000
esi: 00000063   edi: f01c5020   ebp: f0197ee8   esp: f0197eac
ds: 0068   es: 0068   ss: 0068
Process swapper (pid: 0, threadinfo=f0196000 task=f01bf060)
Stack: 00000000 f01b4120 00000000 c02ec940 f0197ed4 00000004 00000000 c02ecd3c 
       c02ec93c 00000000 00000001 0000007d c02ec4a0 00000001 00000004 f0197f1c 
       c011829c c02ec4a0 00000004 00000004 00000001 00000000 c39376c0 00000000 
Call Trace:
 [<c011829c>] load_balance+0x8c/0x140
 [<c0118588>] scheduler_tick+0x238/0x360
 [<c0123347>] tasklet_hi_action+0x77/0xc0
 [<c0105420>] default_idle+0x0/0x50
 [<c0126bd5>] update_process_times+0x45/0x60
 [<c0113faa>] smp_apic_timer_interrupt+0x11a/0x120
 [<c0105420>] default_idle+0x0/0x50
 [<c010815e>] apic_timer_interrupt+0x1a/0x20
 [<c0105420>] default_idle+0x0/0x50
 [<c0105420>] default_idle+0x0/0x50
 [<c010544a>] default_idle+0x2a/0x50
 [<c01054ea>] cpu_idle+0x3a/0x50
 [<c011db20>] printk+0x140/0x180

Code: f7 75 cc 8b 55 c8 83 f8 64 0f 4c f0 39 4d ec 8d 46 64 0f 44 

This is 2.5.44-mm4 + your patches 1,2,3,5, I think.

M.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Crunch time -- the musical.  (2.5 merge candidate list 1.5)
  2002-10-25 23:26             ` Martin J. Bligh
@ 2002-10-25 23:45               ` Martin J. Bligh
  2002-10-26  0:02               ` Martin J. Bligh
  1 sibling, 0 replies; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-25 23:45 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum; +Cc: linux-kernel

> divide error: 0000
>  
> CPU:    4
> EIP:    0060:[<c011ac38>]    Not tainted
> EFLAGS: 00010002
> EIP is at task_to_steal+0x118/0x260
> eax: 00000001   ebx: f01c5040   ecx: 00000000   edx: 00000000
> esi: 00000063   edi: f01c5020   ebp: f0197ee8   esp: f0197eac
> ds: 0068   es: 0068   ss: 0068
> Process swapper (pid: 0, threadinfo=f0196000 task=f01bf060)
> Stack: 00000000 f01b4120 00000000 c02ec940 f0197ed4 00000004 00000000 c02ecd3c 
>        c02ec93c 00000000 00000001 0000007d c02ec4a0 00000001 00000004 f0197f1c 
>        c011829c c02ec4a0 00000004 00000004 00000001 00000000 c39376c0 00000000 
> Call Trace:
>  [<c011829c>] load_balance+0x8c/0x140
>  [<c0118588>] scheduler_tick+0x238/0x360
>  [<c0123347>] tasklet_hi_action+0x77/0xc0
>  [<c0105420>] default_idle+0x0/0x50
>  [<c0126bd5>] update_process_times+0x45/0x60
>  [<c0113faa>] smp_apic_timer_interrupt+0x11a/0x120
>  [<c0105420>] default_idle+0x0/0x50
>  [<c010815e>] apic_timer_interrupt+0x1a/0x20
>  [<c0105420>] default_idle+0x0/0x50
>  [<c0105420>] default_idle+0x0/0x50
>  [<c010544a>] default_idle+0x2a/0x50
>  [<c01054ea>] cpu_idle+0x3a/0x50
>  [<c011db20>] printk+0x140/0x180
> 
> Code: f7 75 cc 8b 55 c8 83 f8 64 0f 4c f0 39 4d ec 8d 46 64 0f 44 


Dump of assembler code for function task_to_steal:
0xc011ab20 <task_to_steal>:     push   %ebp
0xc011ab21 <task_to_steal+1>:   mov    %esp,%ebp
0xc011ab23 <task_to_steal+3>:   push   %edi
0xc011ab24 <task_to_steal+4>:   push   %esi
0xc011ab25 <task_to_steal+5>:   push   %ebx
0xc011ab26 <task_to_steal+6>:   sub    $0x30,%esp
0xc011ab29 <task_to_steal+9>:   movl   $0x0,0xffffffdc(%ebp)
0xc011ab30 <task_to_steal+16>:  mov    0xc(%ebp),%eax
0xc011ab33 <task_to_steal+19>:  movl   $0x0,0xffffffe8(%ebp)
0xc011ab3a <task_to_steal+26>:  mov    0x8(%ebp),%edx
0xc011ab3d <task_to_steal+29>:  mov    0xc034afe0(,%eax,4),%eax
0xc011ab44 <task_to_steal+36>:  sar    $0x4,%eax
0xc011ab47 <task_to_steal+39>:  mov    %eax,0xffffffec(%ebp)
0xc011ab4a <task_to_steal+42>:  mov    0x20(%edx),%eax
0xc011ab4d <task_to_steal+45>:  mov    (%eax),%esi
0xc011ab4f <task_to_steal+47>:  test   %esi,%esi
0xc011ab51 <task_to_steal+49>:  je     0xc011ad6a <task_to_steal+586>
0xc011ab57 <task_to_steal+55>:  mov    %eax,0xffffffe4(%ebp)
0xc011ab5a <task_to_steal+58>:  movl   $0x0,0xfffffff0(%ebp)
0xc011ab61 <task_to_steal+65>:  mov    0xffffffe4(%ebp),%ebx
0xc011ab64 <task_to_steal+68>:  add    $0x4,%ebx
0xc011ab67 <task_to_steal+71>:  mov    %ebx,0xffffffd0(%ebp)
0xc011ab6a <task_to_steal+74>:  lea    0x0(%esi),%esi
0xc011ab70 <task_to_steal+80>:  mov    0xfffffff0(%ebp),%ebx
0xc011ab73 <task_to_steal+83>:  test   %ebx,%ebx
0xc011ab75 <task_to_steal+85>:  jne    0xc011acec <task_to_steal+460>
0xc011ab7b <task_to_steal+91>:  mov    0xffffffe4(%ebp),%edx
0xc011ab7e <task_to_steal+94>:  mov    0x4(%edx),%eax
0xc011ab81 <task_to_steal+97>:  test   %eax,%eax
0xc011ab83 <task_to_steal+99>:  jne    0xc011ace4 <task_to_steal+452>
0xc011ab89 <task_to_steal+105>: mov    0xffffffd0(%ebp),%ecx
0xc011ab8c <task_to_steal+108>: mov    0x4(%ecx),%eax
0xc011ab8f <task_to_steal+111>: test   %eax,%eax
0xc011ab91 <task_to_steal+113>: jne    0xc011acd9 <task_to_steal+441>
0xc011ab97 <task_to_steal+119>: mov    0xffffffd0(%ebp),%ebx
0xc011ab9a <task_to_steal+122>: mov    0x8(%ebx),%eax
0xc011ab9d <task_to_steal+125>: test   %eax,%eax
0xc011ab9f <task_to_steal+127>: jne    0xc011acce <task_to_steal+430>
0xc011aba5 <task_to_steal+133>: mov    0xffffffd0(%ebp),%edx
0xc011aba8 <task_to_steal+136>: mov    0xc(%edx),%eax
0xc011abab <task_to_steal+139>: test   %eax,%eax
0xc011abad <task_to_steal+141>: je     0xc011acbf <task_to_steal+415>
0xc011abb3 <task_to_steal+147>: bsf    %eax,%eax
0xc011abb6 <task_to_steal+150>: add    $0x60,%eax
0xc011abb9 <task_to_steal+153>: mov    %eax,0xfffffff0(%ebp)
0xc011abbc <task_to_steal+156>: cmpl   $0x8c,0xfffffff0(%ebp)
0xc011abc3 <task_to_steal+163>: je     0xc011ac9e <task_to_steal+382>
0xc011abc9 <task_to_steal+169>: mov    0xfffffff0(%ebp),%ebx
0xc011abcc <task_to_steal+172>: mov    0xffffffe4(%ebp),%eax
0xc011abcf <task_to_steal+175>: mov    0xc034b4e0,%edx
0xc011abd5 <task_to_steal+181>: lea    0x18(%eax,%ebx,8),%ebx
0xc011abd9 <task_to_steal+185>: mov    %ebx,0xffffffe0(%ebp)
0xc011abdc <task_to_steal+188>: mov    0x4(%ebx),%ebx
0xc011abdf <task_to_steal+191>: mov    %edx,0xffffffcc(%ebp)
0xc011abe2 <task_to_steal+194>: lea    0x0(%esi,1),%esi
0xc011abe9 <task_to_steal+201>: lea    0x0(%edi,1),%edi
0xc011abf0 <task_to_steal+208>: lea    0xffffffe0(%ebx),%edi
0xc011abf3 <task_to_steal+211>: mov    0xc0348e68,%eax
0xc011abf8 <task_to_steal+216>: mov    0x30(%edi),%edx
0xc011abfb <task_to_steal+219>: sub    %edx,%eax
0xc011abfd <task_to_steal+221>: cmp    0xffffffcc(%ebp),%eax
0xc011ac00 <task_to_steal+224>: jbe    0xc011ac70 <task_to_steal+336>
0xc011ac02 <task_to_steal+226>: mov    0x8(%ebp),%ecx
0xc011ac05 <task_to_steal+229>: mov    0x14(%ecx),%ecx
0xc011ac08 <task_to_steal+232>: cmp    %ecx,%edi
0xc011ac0a <task_to_steal+234>: mov    %ecx,0xffffffc8(%ebp)
0xc011ac0d <task_to_steal+237>: je     0xc011ac70 <task_to_steal+336>
0xc011ac0f <task_to_steal+239>: movzbl 0xc(%ebp),%ecx
0xc011ac13 <task_to_steal+243>: mov    0x38(%edi),%eax
0xc011ac16 <task_to_steal+246>: shr    %cl,%eax
0xc011ac18 <task_to_steal+248>: and    $0x1,%eax
0xc011ac1b <task_to_steal+251>: je     0xc011ac70 <task_to_steal+336>
0xc011ac1d <task_to_steal+253>: mov    0x48(%edi),%esi
0xc011ac20 <task_to_steal+256>: test   %esi,%esi
0xc011ac22 <task_to_steal+258>: jne    0xc011ac83 <task_to_steal+355>
0xc011ac24 <task_to_steal+260>: mov    0xc0348e68,%eax
0xc011ac29 <task_to_steal+265>: xor    %edx,%edx
0xc011ac2b <task_to_steal+267>: mov    $0x63,%esi
0xc011ac30 <task_to_steal+272>: mov    0x30(%edi),%ecx
0xc011ac33 <task_to_steal+275>: sub    %ecx,%eax
0xc011ac35 <task_to_steal+277>: mov    0x44(%edi),%ecx
0xc011ac38 <task_to_steal+280>: divl   0xffffffcc(%ebp)
0xc011ac3b <task_to_steal+283>: mov    0xffffffc8(%ebp),%edx
0xc011ac3e <task_to_steal+286>: cmp    $0x64,%eax
0xc011ac41 <task_to_steal+289>: cmovl  %eax,%esi
0xc011ac44 <task_to_steal+292>: cmp    %ecx,0xffffffec(%ebp)
0xc011ac47 <task_to_steal+295>: lea    0x64(%esi),%eax
0xc011ac4a <task_to_steal+298>: cmove  %eax,%esi
0xc011ac4d <task_to_steal+301>: mov    0x4(%edx),%eax
0xc011ac50 <task_to_steal+304>: lea    0xffffff9c(%esi),%edx
0xc011ac53 <task_to_steal+307>: mov    0xc(%eax),%eax
0xc011ac56 <task_to_steal+310>: mov    0xc034afe0(,%eax,4),%eax
0xc011ac5d <task_to_steal+317>: sar    $0x4,%eax
0xc011ac60 <task_to_steal+320>: cmp    %eax,%ecx
0xc011ac62 <task_to_steal+322>: cmove  %edx,%esi
0xc011ac65 <task_to_steal+325>: cmp    0xffffffdc(%ebp),%esi
0xc011ac68 <task_to_steal+328>: jle    0xc011ac70 <task_to_steal+336>
0xc011ac6a <task_to_steal+330>: mov    %esi,0xffffffdc(%ebp)
0xc011ac6d <task_to_steal+333>: mov    %edi,0xffffffe8(%ebp)
0xc011ac70 <task_to_steal+336>: mov    (%ebx),%ebx
0xc011ac72 <task_to_steal+338>: cmp    0xffffffe0(%ebp),%ebx
0xc011ac75 <task_to_steal+341>: jne    0xc011abf0 <task_to_steal+208>
0xc011ac7b <task_to_steal+347>: incl   0xfffffff0(%ebp)
0xc011ac7e <task_to_steal+350>: jmp    0xc011ab70 <task_to_steal+80>
0xc011ac83 <task_to_steal+355>: mov    %edi,(%esp,1)
0xc011ac86 <task_to_steal+358>: call   0xc0118070 <upd_node_mem>
0xc011ac8b <task_to_steal+363>: mov    0x8(%ebp),%edx
0xc011ac8e <task_to_steal+366>: mov    0xc034b4e0,%eax
0xc011ac93 <task_to_steal+371>: mov    %eax,0xffffffcc(%ebp)
0xc011ac96 <task_to_steal+374>: mov    0x14(%edx),%edx
0xc011ac99 <task_to_steal+377>: mov    %edx,0xffffffc8(%ebp)
0xc011ac9c <task_to_steal+380>: jmp    0xc011ac24 <task_to_steal+260>
0xc011ac9e <task_to_steal+382>: mov    0x8(%ebp),%eax
0xc011aca1 <task_to_steal+385>: mov    0xffffffe4(%ebp),%edx
0xc011aca4 <task_to_steal+388>: cmp    0x20(%eax),%edx
0xc011aca7 <task_to_steal+391>: jne    0xc011acb4 <task_to_steal+404>
0xc011aca9 <task_to_steal+393>: mov    0x1c(%eax),%ecx
0xc011acac <task_to_steal+396>: mov    %ecx,0xffffffe4(%ebp)
0xc011acaf <task_to_steal+399>: jmp    0xc011ab5a <task_to_steal+58>
0xc011acb4 <task_to_steal+404>: mov    0xffffffe8(%ebp),%eax
0xc011acb7 <task_to_steal+407>: add    $0x30,%esp
0xc011acba <task_to_steal+410>: pop    %ebx
0xc011acbb <task_to_steal+411>: pop    %esi
0xc011acbc <task_to_steal+412>: pop    %edi
0xc011acbd <task_to_steal+413>: pop    %ebp
0xc011acbe <task_to_steal+414>: ret    
0xc011acbf <task_to_steal+415>: mov    0xffffffd0(%ebp),%ecx
0xc011acc2 <task_to_steal+418>: bsf    0x10(%ecx),%eax
0xc011acc6 <task_to_steal+422>: sub    $0xffffff80,%eax
0xc011acc9 <task_to_steal+425>: jmp    0xc011abb9 <task_to_steal+153>
0xc011acce <task_to_steal+430>: bsf    %eax,%eax
0xc011acd1 <task_to_steal+433>: add    $0x40,%eax
0xc011acd4 <task_to_steal+436>: jmp    0xc011abb9 <task_to_steal+153>
0xc011acd9 <task_to_steal+441>: bsf    %eax,%eax
0xc011acdc <task_to_steal+444>: add    $0x20,%eax
0xc011acdf <task_to_steal+447>: jmp    0xc011abb9 <task_to_steal+153>
0xc011ace4 <task_to_steal+452>: bsf    %eax,%eax
0xc011ace7 <task_to_steal+455>: jmp    0xc011abb9 <task_to_steal+153>
0xc011acec <task_to_steal+460>: mov    0xfffffff0(%ebp),%eax
0xc011acef <task_to_steal+463>: xor    %esi,%esi
0xc011acf1 <task_to_steal+465>: mov    0xfffffff0(%ebp),%ecx
0xc011acf4 <task_to_steal+468>: mov    0xffffffd0(%ebp),%ebx
0xc011acf7 <task_to_steal+471>: sar    $0x5,%eax
0xc011acfa <task_to_steal+474>: and    $0x1f,%ecx
0xc011acfd <task_to_steal+477>: lea    (%ebx,%eax,4),%edi
0xc011ad00 <task_to_steal+480>: je     0xc011ad2b <task_to_steal+523>
0xc011ad02 <task_to_steal+482>: mov    (%edi),%eax
0xc011ad04 <task_to_steal+484>: shr    %cl,%eax
0xc011ad06 <task_to_steal+486>: bsf    %eax,%esi
0xc011ad09 <task_to_steal+489>: jne    0xc011ad10 <task_to_steal+496>
0xc011ad0b <task_to_steal+491>: mov    $0x20,%esi
0xc011ad10 <task_to_steal+496>: mov    $0x20,%eax
0xc011ad15 <task_to_steal+501>: sub    %ecx,%eax
0xc011ad17 <task_to_steal+503>: cmp    %eax,%esi
0xc011ad19 <task_to_steal+505>: jge    0xc011ad26 <task_to_steal+518>
0xc011ad1b <task_to_steal+507>: mov    0xfffffff0(%ebp),%edx
0xc011ad1e <task_to_steal+510>: lea    (%edx,%esi,1),%eax
0xc011ad21 <task_to_steal+513>: jmp    0xc011abb9 <task_to_steal+153>
0xc011ad26 <task_to_steal+518>: mov    %eax,%esi
0xc011ad28 <task_to_steal+520>: add    $0x4,%edi
0xc011ad2b <task_to_steal+523>: mov    0xffffffd0(%ebp),%ecx
0xc011ad2e <task_to_steal+526>: mov    %edi,%eax
0xc011ad30 <task_to_steal+528>: mov    $0x8c,%edx
0xc011ad35 <task_to_steal+533>: mov    %edi,%ebx
0xc011ad37 <task_to_steal+535>: sub    %ecx,%eax
0xc011ad39 <task_to_steal+537>: shl    $0x3,%eax
0xc011ad3c <task_to_steal+540>: sub    %eax,%edx
0xc011ad3e <task_to_steal+542>: add    $0x1f,%edx
0xc011ad41 <task_to_steal+545>: shr    $0x5,%edx
0xc011ad44 <task_to_steal+548>: mov    %edx,0xffffffd4(%ebp)
0xc011ad47 <task_to_steal+551>: mov    %edx,%ecx
0xc011ad49 <task_to_steal+553>: xor    %eax,%eax
0xc011ad4b <task_to_steal+555>: repz scas %es:(%edi),%eax
0xc011ad4d <task_to_steal+557>: je     0xc011ad55 <task_to_steal+565>
0xc011ad4f <task_to_steal+559>: lea    0xfffffffc(%edi),%edi
0xc011ad52 <task_to_steal+562>: bsf    (%edi),%eax
0xc011ad55 <task_to_steal+565>: sub    %ebx,%edi
0xc011ad57 <task_to_steal+567>: shl    $0x3,%edi
0xc011ad5a <task_to_steal+570>: add    %edi,%eax
0xc011ad5c <task_to_steal+572>: mov    %eax,%edx
0xc011ad5e <task_to_steal+574>: mov    0xfffffff0(%ebp),%eax
0xc011ad61 <task_to_steal+577>: add    %esi,%eax
0xc011ad63 <task_to_steal+579>: add    %edx,%eax
0xc011ad65 <task_to_steal+581>: jmp    0xc011abb9 <task_to_steal+153>
0xc011ad6a <task_to_steal+586>: mov    0x8(%ebp),%ecx
0xc011ad6d <task_to_steal+589>: mov    0x1c(%ecx),%ecx
0xc011ad70 <task_to_steal+592>: jmp    0xc011acac <task_to_steal+396>
0xc011ad75 <task_to_steal+597>: nop    
0xc011ad76 <task_to_steal+598>: lea    0x0(%esi),%esi
0xc011ad79 <task_to_steal+601>: lea    0x0(%edi,1),%edi
End of assembler dump.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Crunch time -- the musical.  (2.5 merge candidate list 1.5)
  2002-10-25 23:26             ` Martin J. Bligh
  2002-10-25 23:45               ` Martin J. Bligh
@ 2002-10-26  0:02               ` Martin J. Bligh
  1 sibling, 0 replies; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-26  0:02 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum; +Cc: linux-kernel

>> I thought this problem is well understood! For some reasons independent of
>> my patch you have to boot your machines with the "notsc" option. This
>> leaves the cache_decay_ticks variable initialized to zero which my patch
>> doesn't like. I'm trying to deal with this inside the patch but there is
>> still a small window when the variable is zero. In my opinion this needs
>> to be fixed somewhere in arch/i386/kernel/smpboot.c. Booting a machine
>> with cache_decay_ticks=0 is pure nonsense, as it switches off cache
>> affinity which you absolutely need! So even if "notsc" is a legal option,
>> it should be fixed such that it doesn't leave your machine without cache
>> affinity. That would anyway give you a falsified behavior of the O(1)
>> scheduler.

> EIP is at task_to_steal+0x118/0x260

This turned out to be:

weight = (jiffies - tmp->sleep_timestamp)/cache_decay_ticks;

So I guess that window is still biting you. I'll see if I can fix it properly.

M.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Crunch time -- the musical.  (2.5 merge candidate list 1.5)
  2002-10-25  8:15           ` Erich Focht
  2002-10-25 23:26             ` Martin J. Bligh
@ 2002-10-26 18:58             ` Martin J. Bligh
  2002-10-26 19:14             ` NUMA scheduler (was: 2.5 " Martin J. Bligh
  2 siblings, 0 replies; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-26 18:58 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum, landley; +Cc: linux-kernel

>> I still haven't been able to get your scheduler to boot for about
>> the last month without crashing the system. Andrew says he has it
>> booting somehow on 2.5.44-mm4, so I'll steal his kernel tommorow and
>> see how it looks. If the numbers look good for doing boring things
>> like kernel compile, SDET, etc, I'm happy.
> 
> I thought this problem is well understood! For some reasons independent of
> my patch you have to boot your machines with the "notsc" option. This
> leaves the cache_decay_ticks variable initialized to zero which my patch
> doesn't like. I'm trying to deal with this inside the patch but there is
> still a small window when the variable is zero. In my opinion this needs
> to be fixed somewhere in arch/i386/kernel/smpboot.c. Booting a machine
> with cache_decay_ticks=0 is pure nonsense, as it switches off cache
> affinity which you absolutely need! So even if "notsc" is a legal option,
> it should be fixed such that it doesn't leave your machine without cache
> affinity. That would anyway give you a falsified behavior of the O(1)
> scheduler.

Oh, not sure if I ever replied to this or not. I don't *have* to boot
with notsc, I just usually do. And it crashed either way, so it's a
different problem (changing versions of gcc seems to perturb it too).
BUT ... your new patches 1 and 2 don't have this problem. See followup
email in a second.

M.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-25  8:15           ` Erich Focht
  2002-10-25 23:26             ` Martin J. Bligh
  2002-10-26 18:58             ` Martin J. Bligh
@ 2002-10-26 19:14             ` Martin J. Bligh
  2002-10-27 18:16               ` Martin J. Bligh
  2 siblings, 1 reply; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-26 19:14 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum, mingo, habanero; +Cc: linux-kernel, lse-tech

>> From my point of view, the reason for focussing on this was that
>> your scheduler degraded the performance on my machine, rather than
>> boosting it. Half of that was the more complex stuff you added on
>> top ... it's a lot easier to start with something simple that works
>> and build on it, than fix something that's complex and doesn't work
>> well.
> 
> You're talking about one of the first 2.5 versions of the patch. It
> changed a lot since then, thanks to your feedback, too.

OK, I went to your latest patches (just 1 and 2). And they worked!
You've fixed the performance degradation problems for kernel compile
(now a 14% improvement in systime), that core set works without 
further futzing about or crashing, with or without TSC, on either 
version of gcc ... congrats!

It also produces the fastest system time for kernel compile I've ever
seen ... this core set seems to be good (I'm still less than convinced
about the further patches, but we can work on those one at a time now
you've got it all broken out and modular). Michael posted slightly 
different looking results for virgin 44 yesterday - the main difference between virgin 44 and 44-mm4 for this stuff is probably the per-cpu 
hot & cold pages (Ingo, this is like your original per-cpu pages).

All results are for a 16-way NUMA-Q (P3 700MHz 2Mb cache) 16Gb RAM.

Kernbench:
                             Elapsed        User      System         CPU
              2.5.44-mm4     19.676s    192.794s     42.678s     1197.4%
        2.5.44-mm4-hbaum     19.422s    189.828s     40.204s     1196.2%
      2.5.44-mm4-focht12     19.316s    189.514s     36.704s     1146.8%

Schedbench 4:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4       32.45       49.47      129.86        0.82
        2.5.44-mm4-hbaum       31.31       43.85      125.29        0.84
      2.5.44-mm4-focht12       38.50       45.34      154.05        1.07

Schedbench 8:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4       39.90       61.48      319.26        2.79
        2.5.44-mm4-hbaum       32.63       46.56      261.10        1.99
      2.5.44-mm4-focht12       35.56       46.57      284.53        1.97

Schedbench 16:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4       62.99       93.59     1008.01        5.11
        2.5.44-mm4-hbaum       49.78       76.71      796.68        4.43
      2.5.44-mm4-focht12       51.94       61.43      831.26        4.68

Schedbench 32:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4       88.13      194.53     2820.54       11.52
        2.5.44-mm4-hbaum       54.67      147.30     1749.77        7.91
      2.5.44-mm4-focht12       55.43      119.49     1773.97        8.41

Schedbench 64:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4      159.92      653.79    10235.93       25.16
        2.5.44-mm4-hbaum       65.20      300.58     4173.26       16.82
      2.5.44-mm4-focht12       56.49      235.78     3615.71       18.05

There's a small degredation at the low end of schedbench (Erich's
numa_test) in there ... would be nice to fix, but I'm less worried
about that (where the machine is lightly loaded) than the other 
numbers. Kernbench is just gcc-2.95-4 compiling the 2.4.17 kernel
doing a "make -j24 bzImage".

diffprofile 2.5.44-mm4 2.5.44-mm4-hbaum
(for kernbench, + got worse by adding the patch, - got better)

184 vm_enough_memory
154 d_lookup
83 do_schedule
75 page_add_rmap
73 strnlen_user
58 find_get_page
52 flush_signal_handlers
...
-61 pte_alloc_one
-63 do_wp_page
-85 .text.lock.file_table
-96 __set_page_dirty_buffers
-112 clear_page_tables
-118 get_empty_filp
-134 free_hot_cold_page
-144 page_remove_rmap
-150 __copy_to_user
-213 zap_pte_range
-217 buffered_rmqueue
-875 __copy_from_user
-1015 do_anonymous_page

diffprofile 2.5.44-mm4 2.5.44-mm4-focht12
(for kernbench, + got worse by adding the patch, - got better)

<nothing significantly degraded>
....
-57 path_lookup
-69 do_page_fault
-73 vm_enough_memory
-77 filemap_nopage
-78 do_no_page
-83 __set_page_dirty_buffers
-83 __fput
-84 do_schedule
-97 find_get_page
-106 file_move
-115 free_hot_cold_page
-115 clear_page_tables
-130 d_lookup
-147 atomic_dec_and_lock
-157 page_add_rmap
-197 buffered_rmqueue
-236 zap_pte_range
-264 get_empty_filp
-271 __copy_to_user
-464 page_remove_rmap
-573 .text.lock.file_table
-618 __copy_from_user
-823 do_anonymous_page



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-26 19:14             ` NUMA scheduler (was: 2.5 " Martin J. Bligh
@ 2002-10-27 18:16               ` Martin J. Bligh
  2002-10-28  0:32                 ` Erich Focht
  0 siblings, 1 reply; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-27 18:16 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum, mingo, habanero; +Cc: linux-kernel, lse-tech

> OK, I went to your latest patches (just 1 and 2). And they worked!
> You've fixed the performance degradation problems for kernel compile
> (now a 14% improvement in systime), that core set works without 
> further futzing about or crashing, with or without TSC, on either 
> version of gcc ... congrats!

So I have a slight correction to make to the above ;-) Your patches
do work just fine, no crashes any more. HOWEVER ... turns out I only
had the first patch installed, not both. Silly mistake, but turns out
to be very interesting. 

So your second patch is the balance on exec stuff ... I've looked at 
it, and think it's going to be very expensive to do in practice, at
least the simplistic "recalc everything on every exec" approach. It 
does benefit the low end schedbench results, but not the high end ones,
and you can see the cost of your second patch in the system times of
the kernbench.

In summary, I think I like the first patch alone better than the 
combination, but will have a play at making a cross between the two.
As I have very little context about the scheduler, would appreciate
any help anyone would like to volunteer ;-)

Corrected results are:

Kernbench:
                             Elapsed        User      System         CPU
              2.5.44-mm4     19.676s    192.794s     42.678s     1197.4%
        2.5.44-mm4-hbaum     19.422s    189.828s     40.204s     1196.2%
      2.5.44-mm4-focht-1      19.46s    189.838s     37.938s       1171%
     2.5.44-mm4-focht-12      20.32s        190s       44.4s     1153.6%

Schedbench 4:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4       32.45       49.47      129.86        0.82
        2.5.44-mm4-hbaum       31.31       43.85      125.29        0.84
      2.5.44-mm4-focht-1       38.61       45.15      154.48        1.06
     2.5.44-mm4-focht-12       23.23       38.87       92.99        0.85

Schedbench 8:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4       39.90       61.48      319.26        2.79
        2.5.44-mm4-hbaum       32.63       46.56      261.10        1.99
      2.5.44-mm4-focht-1       37.76       61.09      302.17        2.55
     2.5.44-mm4-focht-12       28.40       34.43      227.25        2.09

Schedbench 16:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4       62.99       93.59     1008.01        5.11
        2.5.44-mm4-hbaum       49.78       76.71      796.68        4.43
      2.5.44-mm4-focht-1       51.69       60.23      827.20        4.95
     2.5.44-mm4-focht-12       51.24       60.86      820.08        4.23

Schedbench 32:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4       88.13      194.53     2820.54       11.52
        2.5.44-mm4-hbaum       54.67      147.30     1749.77        7.91
      2.5.44-mm4-focht-1       56.71      123.62     1815.12        7.92
     2.5.44-mm4-focht-12       55.69      118.85     1782.25        7.28

Schedbench 64:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4      159.92      653.79    10235.93       25.16
        2.5.44-mm4-hbaum       65.20      300.58     4173.26       16.82
      2.5.44-mm4-focht-1       55.60      232.36     3558.98       17.61
     2.5.44-mm4-focht-12       56.03      234.45     3586.46       15.76



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-27 18:16               ` Martin J. Bligh
@ 2002-10-28  0:32                 ` Erich Focht
  2002-10-27 23:52                   ` Martin J. Bligh
                                     ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Erich Focht @ 2002-10-28  0:32 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

On Sunday 27 October 2002 19:16, Martin J. Bligh wrote:
> > OK, I went to your latest patches (just 1 and 2). And they worked!
> > You've fixed the performance degradation problems for kernel compile
> > (now a 14% improvement in systime), that core set works without
> > further futzing about or crashing, with or without TSC, on either
> > version of gcc ... congrats!
>
> So I have a slight correction to make to the above ;-) Your patches
> do work just fine, no crashes any more. HOWEVER ... turns out I only
> had the first patch installed, not both. Silly mistake, but turns out
> to be very interesting.
>
> So your second patch is the balance on exec stuff ... I've looked at
> it, and think it's going to be very expensive to do in practice, at
> least the simplistic "recalc everything on every exec" approach. It
> does benefit the low end schedbench results, but not the high end ones,
> and you can see the cost of your second patch in the system times of
> the kernbench.

This is interesting, indeed. As you might have seen from the tests I
posted on LKML I could not see that effect on our IA64 NUMA machine.
Which arises the question: is it expensive to recalculate the load
when doing an exec (which I should also see) or is the strategy of
equally distributing the jobs across the nodes bad for certain
load+architecture combinations? As I'm not seeing the effect, maybe
you could do the following experiment:
In sched_best_node() keep only the "while" loop at the beginning. This
leads to a cheap selection of the next node, just a simple round robin. 

Regarding the schedbench results: are they averages over multiple runs?
The numa_test needs to be repeated a few times to get statistically
meaningful results.

Thanks,
Erich

> In summary, I think I like the first patch alone better than the
> combination, but will have a play at making a cross between the two.
> As I have very little context about the scheduler, would appreciate
> any help anyone would like to volunteer ;-)
>
> Corrected results are:
>
> Kernbench:
>                              Elapsed        User      System         CPU
>               2.5.44-mm4     19.676s    192.794s     42.678s     1197.4%
>         2.5.44-mm4-hbaum     19.422s    189.828s     40.204s     1196.2%
>       2.5.44-mm4-focht-1      19.46s    189.838s     37.938s       1171%
>      2.5.44-mm4-focht-12      20.32s        190s       44.4s     1153.6%
>
> Schedbench 4:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>               2.5.44-mm4       32.45       49.47      129.86        0.82
>         2.5.44-mm4-hbaum       31.31       43.85      125.29        0.84
>       2.5.44-mm4-focht-1       38.61       45.15      154.48        1.06
>      2.5.44-mm4-focht-12       23.23       38.87       92.99        0.85
>
> Schedbench 8:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>               2.5.44-mm4       39.90       61.48      319.26        2.79
>         2.5.44-mm4-hbaum       32.63       46.56      261.10        1.99
>       2.5.44-mm4-focht-1       37.76       61.09      302.17        2.55
>      2.5.44-mm4-focht-12       28.40       34.43      227.25        2.09
>
> Schedbench 16:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>               2.5.44-mm4       62.99       93.59     1008.01        5.11
>         2.5.44-mm4-hbaum       49.78       76.71      796.68        4.43
>       2.5.44-mm4-focht-1       51.69       60.23      827.20        4.95
>      2.5.44-mm4-focht-12       51.24       60.86      820.08        4.23
>
> Schedbench 32:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>               2.5.44-mm4       88.13      194.53     2820.54       11.52
>         2.5.44-mm4-hbaum       54.67      147.30     1749.77        7.91
>       2.5.44-mm4-focht-1       56.71      123.62     1815.12        7.92
>      2.5.44-mm4-focht-12       55.69      118.85     1782.25        7.28
>
> Schedbench 64:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>               2.5.44-mm4      159.92      653.79    10235.93       25.16
>         2.5.44-mm4-hbaum       65.20      300.58     4173.26       16.82
>       2.5.44-mm4-focht-1       55.60      232.36     3558.98       17.61
>      2.5.44-mm4-focht-12       56.03      234.45     3586.46       15.76


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28  0:32                 ` Erich Focht
@ 2002-10-27 23:52                   ` Martin J. Bligh
  2002-10-28  0:55                     ` [Lse-tech] " Michael Hohnbaum
  2002-10-28  0:31                   ` Martin J. Bligh
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-27 23:52 UTC (permalink / raw)
  To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

> This is interesting, indeed. As you might have seen from the tests I
> posted on LKML I could not see that effect on our IA64 NUMA machine.
> Which arises the question: is it expensive to recalculate the load
> when doing an exec (which I should also see) or is the strategy of
> equally distributing the jobs across the nodes bad for certain
> load+architecture combinations? 

I suspect the former. Bouncing a whole pile of cachelines every time
would be much more expensive for me than it would for you, and 
kernbench will be heavy on exec.

> As I'm not seeing the effect, maybe
> you could do the following experiment:
> In sched_best_node() keep only the "while" loop at the beginning. This
> leads to a cheap selection of the next node, just a simple round robin. 

Maybe I could just send you the profiles instead ;-)
If I have more time, I'll try your suggestion.
I'm trying Michael's balance_exec on top of your patch 1 at the 
moment, but I'm somewhat confused by his code for sched_best_cpu.

+static int sched_best_cpu(struct task_struct *p)
+{
+       int i, minload, best_cpu, cur_cpu, node;
+       best_cpu = task_cpu(p);
+       if (cpu_rq(best_cpu)->nr_running <= 2)
+               return best_cpu;
+
+       node = __cpu_to_node(__get_cpu_var(last_exec_cpu));
+       if (++node >= numnodes)
+               node = 0;
+       
+       cur_cpu = __node_to_first_cpu(node);
+       minload = cpu_rq(best_cpu)->nr_running;
+
+       for (i = 0; i < NR_CPUS; i++) {
+               if (!cpu_online(cur_cpu))
+                       continue;
+
+               if (minload > cpu_rq(cur_cpu)->nr_running) {
+                       minload = cpu_rq(cur_cpu)->nr_running;
+                       best_cpu = cur_cpu;
+               }
+               if (++cur_cpu >= NR_CPUS)
+                       cur_cpu = 0;
+       }
+       __get_cpu_var(last_exec_cpu) = best_cpu;
+       return best_cpu;
+}

Michael, the way I read the NR_CPUS loop, you walk every cpu
in the system, and take the best from all of them. In which case
what's the point of the last_exec_cpu stuff? On the other hand, 
I changed your NR_CPUS to 4 (ie just walk the cpus in that node), 
and it got worse. So perhaps I'm just misreading your code ...
and it does seem significantly cheaper to execute than Erich's.

Erich, on the other hand, your code does this:

+void sched_balance_exec(void)
+{
+       int new_cpu, new_node=0;
+
+       while (pooldata_is_locked())
+               cpu_relax();
+       if (numpools > 1) {
+               new_node = sched_best_node(current);
+       } 
+       new_cpu = sched_best_cpu(current, new_node);
+       if (new_cpu != smp_processor_id())
+               sched_migrate_task(current, new_cpu);
+}

which seems to me to walk every runqueue in the system (in
sched_best_node), then walk one node's worth all over again
in sched_best_cpu .... doesn't it? Again, I may be misreading
this ... haven't looked at the scheduler much. But I can't 
help feeling some sort of lazy evaluation is in order ....

And what's this doing?

+       do {
+               /* atomic_inc_return is not implemented on all archs [EF] */
+               atomic_inc(&sched_node);
+               best_node = atomic_read(&sched_node) % numpools;
+       } while (!(pool_mask[best_node] & mask));

I really don't think putting a global atomic in there is going to
be cheap ....

> Regarding the schedbench results: are they averages over multiple runs?
> The numa_test needs to be repeated a few times to get statistically
> meaningful results.

No. But I don't have 2 hours to run each set of tests either. I did
a couple of runs, and didn't see huge variances. Seems stable enough.

M.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Lse-tech] Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-27 23:52                   ` Martin J. Bligh
@ 2002-10-28  0:55                     ` Michael Hohnbaum
  2002-10-28  4:23                       ` Martin J. Bligh
  0 siblings, 1 reply; 33+ messages in thread
From: Michael Hohnbaum @ 2002-10-28  0:55 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Erich Focht, mingo, Andrew Theurer, linux-kernel, lse-tech


> I'm trying Michael's balance_exec on top of your patch 1 at the 
> moment, but I'm somewhat confused by his code for sched_best_cpu.
> 
> +static int sched_best_cpu(struct task_struct *p)
> +{
> +       int i, minload, best_cpu, cur_cpu, node;
> +       best_cpu = task_cpu(p);
> +       if (cpu_rq(best_cpu)->nr_running <= 2)
> +               return best_cpu;
> +
> +       node = __cpu_to_node(__get_cpu_var(last_exec_cpu));
> +       if (++node >= numnodes)
> +               node = 0;
> +       
> +       cur_cpu = __node_to_first_cpu(node);
> +       minload = cpu_rq(best_cpu)->nr_running;
> +
> +       for (i = 0; i < NR_CPUS; i++) {
> +               if (!cpu_online(cur_cpu))
> +                       continue;
> +
> +               if (minload > cpu_rq(cur_cpu)->nr_running) {
> +                       minload = cpu_rq(cur_cpu)->nr_running;
> +                       best_cpu = cur_cpu;
> +               }
> +               if (++cur_cpu >= NR_CPUS)
> +                       cur_cpu = 0;
> +       }
> +       __get_cpu_var(last_exec_cpu) = best_cpu;
> +       return best_cpu;
> +}
> 
> Michael, the way I read the NR_CPUS loop, you walk every cpu
> in the system, and take the best from all of them. In which case
> what's the point of the last_exec_cpu stuff? On the other hand, 
> I changed your NR_CPUS to 4 (ie just walk the cpus in that node), 
> and it got worse. So perhaps I'm just misreading your code ...
> and it does seem significantly cheaper to execute than Erich's.
> 
You are reading it correct.  The only thing that the last_exec_cpu
does is to help spread the load across nodes.  Without that what was
happening is that node 0 would get completely loaded, then node 1,
etc.  With it, in cases where one or more runqueues have the same
length, the one chosen tends to get spread out a bit.  Not the 
greatest solution, but it helps.
> 
-- 
Michael Hohnbaum            503-578-5486
hohnbaum@us.ibm.com         T/L 775-5486


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Lse-tech] Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28  0:55                     ` [Lse-tech] " Michael Hohnbaum
@ 2002-10-28  4:23                       ` Martin J. Bligh
  0 siblings, 0 replies; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-28  4:23 UTC (permalink / raw)
  To: Michael Hohnbaum
  Cc: Erich Focht, mingo, Andrew Theurer, linux-kernel, lse-tech

>> Michael, the way I read the NR_CPUS loop, you walk every cpu
>> in the system, and take the best from all of them. In which case
>> what's the point of the last_exec_cpu stuff? On the other hand, 
>> I changed your NR_CPUS to 4 (ie just walk the cpus in that node), 
>> and it got worse. So perhaps I'm just misreading your code ...
>> and it does seem significantly cheaper to execute than Erich's.
>> 
> You are reading it correct.  The only thing that the last_exec_cpu
> does is to help spread the load across nodes.  Without that what was
> happening is that node 0 would get completely loaded, then node 1,
> etc.  With it, in cases where one or more runqueues have the same
> length, the one chosen tends to get spread out a bit.  Not the 
> greatest solution, but it helps.

OK. I made a simple boring optimisation to your patch. Shaved almost
a second off system time for kernbench, and seems idiotproof to me,
shouldn't change anything apart from touching fewer runqueues: if
we find a runqueue with nr_running == 0, stop searching ... we ain't
going to find anything better ;-)

Kernbench:
                                   Elapsed        User      System         CPU
                    2.5.44-mm4     19.676s    192.794s     42.678s     1197.4%
            2.5.44-mm4-hbaum-1     19.746s    189.232s     38.354s     1152.2%
           2.5.44-mm4-hbaum-12     19.322s    190.176s     40.354s     1192.6%
 2.5.44-mm4-hbaum-12-firstzero     19.292s     189.66s     39.428s     1187.4%

Patch is probably space-eaten, so just whack it in by hand.

--- 2.5.44-mm4-hbaum-12/kernel/sched.c  2002-10-27 19:54:25.000000000 -0800
+++ 2.5.44-mm4-hbaum-12-first_low/kernel/sched.c        2002-10-27 16:42:10.000000000 -0800
@@ -2206,6 +2206,8 @@
                if (minload > cpu_rq(cur_cpu)->nr_running) {
                        minload = cpu_rq(cur_cpu)->nr_running;
                        best_cpu = cur_cpu;
+                       if (minload == 0)
+                               break;
                }
                if (++cur_cpu >= NR_CPUS)
                        cur_cpu = 0;


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28  0:32                 ` Erich Focht
  2002-10-27 23:52                   ` Martin J. Bligh
@ 2002-10-28  0:31                   ` Martin J. Bligh
  2002-10-28 16:34                     ` Erich Focht
  2002-10-28  0:46                   ` Martin J. Bligh
  2002-10-28  7:16                   ` Martin J. Bligh
  3 siblings, 1 reply; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-28  0:31 UTC (permalink / raw)
  To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

OK, so I'm trying to read your patch 1, fairly unsucessfully
(seems to be a lot more complex that Michael's).

Can you explain pool_lock? It does actually seem to work, but
it's rather confusing ....

build_pools() has a comment above it saying:

+/*
+ * Call pooldata_lock() before calling this function and
+ * pooldata_unlock() after!
+ */

But then you promptly call pooldata_lock inside build_pools
anyway ... looks like it's just a naff comment, but doesn't
help much.

Leaving aside the acknowledged mind-boggling ugliness of 
pooldata_lock(), what exactly is this lock protecting, and when? 
The only thing that actually calls pooldata_lock is build_pools, 
right? And the only other thing that looks at it is sched_balance_exec
via pooldata_is_locked ... can that happen before build_pools
(seems like you're in deep trouble if it does anyway, as it'll
just block). If you really still need to do this, RCU is now
in the kernel ;-) If not, can we just chuck all that stuff?

M.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28  0:31                   ` Martin J. Bligh
@ 2002-10-28 16:34                     ` Erich Focht
  2002-10-28 16:57                       ` Martin J. Bligh
  0 siblings, 1 reply; 33+ messages in thread
From: Erich Focht @ 2002-10-28 16:34 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

On Monday 28 October 2002 01:31, Martin J. Bligh wrote:
> OK, so I'm trying to read your patch 1, fairly unsucessfully
> (seems to be a lot more complex that Michael's).
>
> Can you explain pool_lock? It does actually seem to work, but
> it's rather confusing ....

The pool data is needed to be able to loop over the CPUs of one node,
only. I'm convinced we'll need to do that sometime, no matter how simple
the core of the NUMA scheduler is.

The pool_lock is protecting that data while it is built. This can happen
in future more often, if somebody starts hotplugging CPUs.

> build_pools() has a comment above it saying:
>
> +/*
> + * Call pooldata_lock() before calling this function and
> + * pooldata_unlock() after!
> + */
>
> But then you promptly call pooldata_lock inside build_pools
> anyway ... looks like it's just a naff comment, but doesn't
> help much.

Sorry, the comment came from a former version...

> just block). If you really still need to do this, RCU is now
> in the kernel ;-) If not, can we just chuck all that stuff?

I'm preparing a core patch which doesn't need the pool_lock. I'll send it
out today.

Regards,
Erich


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28 16:34                     ` Erich Focht
@ 2002-10-28 16:57                       ` Martin J. Bligh
  2002-10-28 17:26                         ` Erich Focht
  0 siblings, 1 reply; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-28 16:57 UTC (permalink / raw)
  To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

> The pool data is needed to be able to loop over the CPUs of one node,
> only. I'm convinced we'll need to do that sometime, no matter how simple
> the core of the NUMA scheduler is.

Hmmm ... is using node_to_cpumask from the topology stuff, then looping
over that bitmask insufficient?
 
> The pool_lock is protecting that data while it is built. This can happen
> in future more often, if somebody starts hotplugging CPUs.

Heh .... when someone actually does that, we'll have a lot more problems
than just this to solve. Would be nice to keep this stuff simple for now, if 
possible.

> Sorry, the comment came from a former version...

No problem, I suspected that was all it was.
 
>> just block). If you really still need to do this, RCU is now
>> in the kernel ;-) If not, can we just chuck all that stuff?
> 
> I'm preparing a core patch which doesn't need the pool_lock. I'll send it
> out today.

Cool! Thanks,

M.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28 16:57                       ` Martin J. Bligh
@ 2002-10-28 17:26                         ` Erich Focht
  2002-10-28 17:35                           ` Martin J. Bligh
  0 siblings, 1 reply; 33+ messages in thread
From: Erich Focht @ 2002-10-28 17:26 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]

On Monday 28 October 2002 17:57, Martin J. Bligh wrote:
> > I'm preparing a core patch which doesn't need the pool_lock. I'll send it
> > out today.
>
> Cool! Thanks,

OK, here it comes. The core doesn't use the loop_over_nodes() macro any
more. There's one big loop over the CPUs for computing node loads and
the most loaded CPUs in find_busiest_queue. The call to build_cpus()
isn't critical any more. Functionality is the same as in the previous
patch (i.e. steal delays, ranking of task_to_steal, etc...).

I kept the loop_over_node() macro for compatibility reasons with the
additional patches. You might need to replace in the additional patches:
numpools -> numpools()
pool_nr_cpus[] -> pool_ncpus()

I'm puzzled about the initial load balancing impact and have to think
about the results I've seen from you so far... In the environments I am
used to, the frequency of exec syscalls is rather low, therefore I didn't
care too much about the sched_balance_exec performance and prefered to
try harder to achieve good distribution across the nodes.

Regards,
Erich

[-- Attachment #2: 01-numa_sched_core-2.5.39-12b.patch --]
[-- Type: text/x-diff, Size: 16562 bytes --]

diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c	Fri Sep 27 23:49:54 2002
+++ b/arch/i386/kernel/smpboot.c	Mon Oct 28 10:15:28 2002
@@ -1194,6 +1194,9 @@ int __devinit __cpu_up(unsigned int cpu)
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	zap_low_mappings();
+#ifdef CONFIG_NUMA
+	build_pools();
+#endif
 }
 
 void __init smp_intr_init()
diff -urNp a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c	Tue Oct 22 15:46:38 2002
+++ b/arch/ia64/kernel/smpboot.c	Mon Oct 28 10:15:28 2002
@@ -397,7 +397,7 @@ unsigned long cache_decay_ticks;	/* # of
 static void
 smp_tune_scheduling (void)
 {
-	cache_decay_ticks = 10;	/* XXX base this on PAL info and cache-bandwidth estimate */
+	cache_decay_ticks = 8;	/* XXX base this on PAL info and cache-bandwidth estimate */
 
 	printk("task migration cache decay timeout: %ld msecs.\n",
 	       (cache_decay_ticks + 1) * 1000 / HZ);
@@ -508,6 +508,9 @@ smp_cpus_done (unsigned int dummy)
 
 	printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
 	       num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+#ifdef CONFIG_NUMA
+	build_pools();
+#endif
 }
 
 int __devinit
diff -urNp a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	Tue Oct  8 15:03:54 2002
+++ b/include/linux/sched.h	Mon Oct 28 12:12:22 2002
@@ -22,6 +22,7 @@ extern unsigned long event;
 #include <asm/mmu.h>
 
 #include <linux/smp.h>
+#include <asm/topology.h>
 #include <linux/sem.h>
 #include <linux/signal.h>
 #include <linux/securebits.h>
@@ -167,7 +168,6 @@ extern void update_one_process(struct ta
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
 
-
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
 asmlinkage void schedule(void);
@@ -457,6 +457,9 @@ extern void set_cpus_allowed(task_t *p, 
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void build_pools(void);
+#endif
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urNp a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	Fri Sep 27 23:50:27 2002
+++ b/kernel/sched.c	Mon Oct 28 16:59:23 2002
@@ -154,6 +154,9 @@ struct runqueue {
 	task_t *migration_thread;
 	struct list_head migration_queue;
 
+	unsigned long wait_time;
+	int wait_node;
+
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -173,6 +176,62 @@ static struct runqueue runqueues[NR_CPUS
 # define task_running(rq, p)		((rq)->curr == (p))
 #endif
 
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
+
+#ifdef CONFIG_NUMA
+/* Number of CPUs per pool: sane values until all CPUs are up */
+int _pool_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+int pool_cpus[NR_CPUS];		/* list of cpus sorted by node number */
+int pool_ptr[MAX_NUMNODES+1];	/* pointer into the sorted list */
+unsigned long pool_mask[MAX_NUMNODES];
+#define numpools() numnodes
+#define pool_ncpus(pool)  _pool_nr_cpus[pool]
+
+#define POOL_DELAY_IDLE  (1*HZ/1000)
+#define POOL_DELAY_BUSY  (20*HZ/1000)
+
+#define loop_over_node(i,cpu,n) \
+	for(i=pool_ptr[n], cpu=pool_cpus[i]; i<pool_ptr[n+1]; \
+		    i++, cpu=pool_cpus[i])
+
+
+/*
+ * Build pool data after all CPUs have come up.
+ */
+void build_pools(void)
+{
+	int n, cpu, ptr;
+	unsigned long mask;
+
+	ptr=0;
+	for (n=0; n<numnodes; n++) {
+		mask = pool_mask[n] = __node_to_cpu_mask(n) & cpu_online_map;
+		pool_ptr[n] = ptr;
+		for (cpu=0; cpu<NR_CPUS; cpu++)
+			if (mask  & (1UL << cpu))
+				pool_cpus[ptr++] = cpu;
+		pool_ncpus(n) = ptr - pool_ptr[n];;
+	}
+	printk("CPU pools : %d\n",numpools());
+	for (n=0;n<numpools();n++)
+		printk("pool %d : %lx\n",n,pool_mask[n]);
+	if (cache_decay_ticks==1)
+		printk("WARNING: cache_decay_ticks=1, probably unset by platform. Running with poor CPU affinity!\n");
+#ifdef CONFIG_X86_NUMAQ
+	/* temporarilly set this to a reasonable value for NUMAQ */
+	cache_decay_ticks=8;
+#endif
+}
+
+#else
+#define numpools() 1
+#define pool_ncpus(pool)  num_online_cpus()
+#define POOL_DELAY_IDLE 0
+#define POOL_DELAY_BUSY 0
+#define loop_over_node(i,cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++)
+#endif
+
+
 /*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
@@ -632,121 +691,146 @@ static inline unsigned int double_lock_b
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
- */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
-{
-	int nr_running, load, max_load, i;
-	runqueue_t *busiest, *rq_src;
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their node.
+ * 
+ * 1. First try to find a runqueue within the own CPU pool (AKA node) with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU.
+ *
+ * This routine implements node balancing by delaying steals from remote
+ * nodes more if the own node is (within margins) averagely loaded. The
+ * most loaded node is remembered as well as the time (jiffies). In the
+ * following calls to the load_balancer the time is compared with
+ * POOL_DELAY_BUSY (if load is around the average) or POOL_DELAY_IDLE (if own
+ * node is unloaded) if the most loaded node didn't change. This gives less 
+ * loaded nodes the chance to approach the average load but doesn't exclude
+ * busy nodes from stealing (just in case the cpus_allowed mask isn't good
+ * for the idle nodes).
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler), e.g.: CPU -> node -> supernode... by implementing node-distance
+ * dependent steal delays.
+ *
+ *                                                         <efocht@ess.nec.de>
+ */
+static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int *nr_running)
+{
+	runqueue_t *busiest = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
+	int best_cpu, this_pool, max_pool_load, pool_idx;
+	int pool_load[MAX_NUMNODES], cpu_load[MAX_NUMNODES];
+	int cpu_idx[MAX_NUMNODES];
+	int cpu, pool, load, avg_load, i, steal_delay;
+
+	/* Need at least ~25% imbalance to trigger balancing. */
+#define CPUS_BALANCED(m,t) (((m) <= 1) || (((m) - (t))/2 < (((m) + (t))/2 + 3)/4))
 
-	/*
-	 * We search all runqueues to find the most busy one.
-	 * We do this lockless to reduce cache-bouncing overhead,
-	 * we re-check the 'best' source CPU later on again, with
-	 * the lock held.
-	 *
-	 * We fend off statistical fluctuations in runqueue lengths by
-	 * saving the runqueue length during the previous load-balancing
-	 * operation and using the smaller one the current and saved lengths.
-	 * If a runqueue is long enough for a longer amount of time then
-	 * we recognize it and pull tasks from it.
-	 *
-	 * The 'current runqueue length' is a statistical maximum variable,
-	 * for that one we take the longer one - to avoid fluctuations in
-	 * the other direction. So for a load-balance to happen it needs
-	 * stable long runqueue on the target CPU and stable short runqueue
-	 * on the local runqueue.
-	 *
-	 * We make an exception if this CPU is about to become idle - in
-	 * that case we are less picky about moving a task across CPUs and
-	 * take what can be taken.
-	 */
 	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
-		nr_running = this_rq->nr_running;
+		*nr_running = this_rq->nr_running;
 	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
-
-	busiest = NULL;
-	max_load = 1;
-	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
-			continue;
+		*nr_running = this_rq->prev_nr_running[this_cpu];
 
-		rq_src = cpu_rq(i);
-		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
-			load = rq_src->nr_running;
+	/* compute all pool loads and save their max cpu loads */
+	for (pool=0; pool<MAX_NUMNODES; pool++)
+		cpu_load[pool] = -1;
+
+	for (cpu=0; cpu<NR_CPUS; cpu++) {
+		if (!cpu_online(cpu)) continue;
+		pool = cpu_to_node(cpu);
+		src_rq = cpu_rq(cpu);
+		if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
+			load = src_rq->nr_running;
 		else
-			load = this_rq->prev_nr_running[i];
-		this_rq->prev_nr_running[i] = rq_src->nr_running;
+			load = this_rq->prev_nr_running[cpu];
+		this_rq->prev_nr_running[cpu] = src_rq->nr_running;
 
-		if ((load > max_load) && (rq_src != this_rq)) {
-			busiest = rq_src;
-			max_load = load;
+		pool_load[pool] += load;
+		if (load > cpu_load[pool]) {
+			cpu_load[pool] = load;
+			cpu_idx[pool] = cpu;
 		}
 	}
 
-	if (likely(!busiest))
-		goto out;
+	this_pool = cpu_to_node(this_cpu);
+	best_cpu = cpu_idx[this_pool];
+	if (best_cpu != this_cpu)
+		if (!CPUS_BALANCED(cpu_load[this_pool],*nr_running)) {
+			busiest = cpu_rq(best_cpu);
+			this_rq->wait_node = -1;
+			goto out;
+		}
+#ifdef CONFIG_NUMA
 
-	*imbalance = (max_load - nr_running) / 2;
+#define POOLS_BALANCED(comp,this) (((comp) -(this)) < 50)
+	avg_load = pool_load[this_pool];
+	pool_load[this_pool] = max_pool_load = 
+		pool_load[this_pool]*100/pool_ncpus(this_pool);
+	pool_idx = this_pool;
+	for (i = 1; i < numpools(); i++) {
+		pool = (i + this_pool) % numpools();
+		avg_load += pool_load[pool];
+		pool_load[pool]=pool_load[pool]*100/pool_ncpus(pool);
+		if (pool_load[pool] > max_pool_load) {
+			max_pool_load = pool_load[pool];
+			pool_idx = pool;
+		}
+	}
 
-	/* It needs an at least ~25% imbalance to trigger balancing. */
-	if (!idle && (*imbalance < (max_load + 3)/4)) {
-		busiest = NULL;
+	best_cpu = (pool_idx==this_pool) ? -1 : cpu_idx[pool_idx];
+	/* Exit if not enough imbalance on any remote node. */
+	if ((best_cpu < 0) || (max_pool_load <= 100) ||
+	    POOLS_BALANCED(max_pool_load,pool_load[this_pool])) {
+		this_rq->wait_node = -1;
 		goto out;
 	}
-
-	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
-	/*
-	 * Make sure nothing changed since we checked the
-	 * runqueue length.
-	 */
-	if (busiest->nr_running <= nr_running + 1) {
-		spin_unlock(&busiest->lock);
-		busiest = NULL;
+	avg_load = avg_load*100/num_online_cpus();
+	/* Wait longer before stealing if own pool's load is average. */
+	if (POOLS_BALANCED(avg_load,pool_load[this_pool]))
+		steal_delay = POOL_DELAY_BUSY;
+	else
+		steal_delay = POOL_DELAY_IDLE;
+	/* if we have a new most loaded node: just mark it */
+	if (this_rq->wait_node != pool_idx) {
+		this_rq->wait_node = pool_idx;
+		this_rq->wait_time = jiffies;
+		goto out;
+	} else
+		/* old most loaded node: check if waited enough */
+		if (jiffies - this_rq->wait_time < steal_delay)
+			goto out;
+
+	if ((best_cpu >= 0) &&
+	    (!CPUS_BALANCED(cpu_load[pool_idx],*nr_running))) {
+		busiest = cpu_rq(best_cpu);
+		this_rq->wait_node = -1;
 	}
-out:
+#endif
+ out:
 	return busiest;
 }
 
 /*
- * pull_task - move a task from a remote runqueue to the local runqueue.
- * Both runqueues must be locked.
+ * Find a task to steal from the busiest RQ. The busiest->lock must be held
+ * while calling this routine. 
  */
-static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
+static inline task_t *task_to_steal(runqueue_t *busiest, int this_cpu)
 {
-	dequeue_task(p, src_array);
-	src_rq->nr_running--;
-	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
-	enqueue_task(p, this_rq->active);
-	/*
-	 * Note that idle threads have a prio of MAX_PRIO, for this test
-	 * to be always true for them.
-	 */
-	if (p->prio < this_rq->curr->prio)
-		set_need_resched();
-}
-
-/*
- * Current runqueue is empty, or rebalance tick: if there is an
- * inbalance (current runqueue is too short) then pull from
- * busiest runqueue(s).
- *
- * We call this with the current runqueue locked,
- * irqs disabled.
- */
-static void load_balance(runqueue_t *this_rq, int idle)
-{
-	int imbalance, idx, this_cpu = smp_processor_id();
-	runqueue_t *busiest;
+	int idx;
+	task_t *next = NULL, *tmp;
 	prio_array_t *array;
 	struct list_head *head, *curr;
-	task_t *tmp;
+	int weight, maxweight=0;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
-	if (!busiest)
-		goto out;
+	/*
+	 * We do not migrate tasks that are:
+	 * 1) running (obviously), or
+	 * 2) cannot be migrated to this CPU due to cpus_allowed.
+	 */
+
+#define CAN_MIGRATE_TASK(p,rq,this_cpu)	\
+		((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \
+		p != rq->curr && \
+		 ((p)->cpus_allowed & (1UL<<(this_cpu))))
 
 	/*
 	 * We first consider expired tasks. Those will likely not be
@@ -772,7 +856,7 @@ skip_bitmap:
 			array = busiest->active;
 			goto new_array;
 		}
-		goto out_unlock;
+		goto out;
 	}
 
 	head = array->queue + idx;
@@ -780,33 +864,72 @@ skip_bitmap:
 skip_queue:
 	tmp = list_entry(curr, task_t, run_list);
 
+	if (CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) {
+		weight = (jiffies - tmp->sleep_timestamp)/cache_decay_ticks;
+		if (weight > maxweight) {
+			maxweight = weight;
+			next = tmp;
+		}
+	}
+	curr = curr->next;
+	if (curr != head)
+		goto skip_queue;
+	idx++;
+	goto skip_bitmap;
+
+ out:
+	return next;
+}
+
+/*
+ * pull_task - move a task from a remote runqueue to the local runqueue.
+ * Both runqueues must be locked.
+ */
+static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
+{
+	dequeue_task(p, src_array);
+	src_rq->nr_running--;
+	set_task_cpu(p, this_cpu);
+	this_rq->nr_running++;
+	enqueue_task(p, this_rq->active);
 	/*
-	 * We do not migrate tasks that are:
-	 * 1) running (obviously), or
-	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
-	 * 3) are cache-hot on their current CPU.
+	 * Note that idle threads have a prio of MAX_PRIO, for this test
+	 * to be always true for them.
 	 */
+	if (p->prio < this_rq->curr->prio)
+		set_need_resched();
+}
 
-#define CAN_MIGRATE_TASK(p,rq,this_cpu)					\
-	((jiffies - (p)->sleep_timestamp > cache_decay_ticks) &&	\
-		!task_running(rq, p) &&					\
-			((p)->cpus_allowed & (1UL << (this_cpu))))
-
-	curr = curr->prev;
-
-	if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) {
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
-	}
-	pull_task(busiest, array, tmp, this_rq, this_cpu);
-	if (!idle && --imbalance) {
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
-	}
+/*
+ * Current runqueue is empty, or rebalance tick: if there is an
+ * inbalance (current runqueue is too short) then pull from
+ * busiest runqueue(s).
+ *
+ * We call this with the current runqueue locked,
+ * irqs disabled.
+ */
+static void load_balance(runqueue_t *this_rq, int idle)
+{
+	int nr_running, this_cpu = task_cpu(this_rq->curr);
+	task_t *tmp;
+	runqueue_t *busiest;
+
+	busiest = find_busiest_queue(this_cpu, idle, &nr_running);
+	if (!busiest)
+		goto out;
+
+	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+	/*
+	 * Make sure nothing changed since we checked the
+	 * runqueue length.
+	 */
+	if (busiest->nr_running <= nr_running + 1)
+		goto out_unlock;
+
+	tmp = task_to_steal(busiest, this_cpu);
+	if (!tmp)
+		goto out_unlock;
+	pull_task(busiest, tmp->array, tmp, this_rq, this_cpu);
 out_unlock:
 	spin_unlock(&busiest->lock);
 out:
@@ -819,10 +942,10 @@ out:
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
 
 static inline void idle_tick(runqueue_t *rq)
@@ -2027,7 +2150,8 @@ static int migration_thread(void * data)
 		spin_unlock_irqrestore(&rq->lock, flags);
 
 		p = req->task;
-		cpu_dest = __ffs(p->cpus_allowed);
+		cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+
 		rq_dest = cpu_rq(cpu_dest);
 repeat:
 		cpu_src = task_cpu(p);
@@ -2130,6 +2254,8 @@ void __init sched_init(void)
 			__set_bit(MAX_PRIO, array->bitmap);
 		}
 	}
+	if (cache_decay_ticks)
+		cache_decay_ticks=1;
 	/*
 	 * We have to do a little magic to get the first
 	 * thread right in SMP mode.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28 17:26                         ` Erich Focht
@ 2002-10-28 17:35                           ` Martin J. Bligh
  2002-10-29  0:07                             ` [Lse-tech] " Erich Focht
  0 siblings, 1 reply; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-28 17:35 UTC (permalink / raw)
  To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

> I'm puzzled about the initial load balancing impact and have to think
> about the results I've seen from you so far... In the environments I am
> used to, the frequency of exec syscalls is rather low, therefore I didn't
> care too much about the sched_balance_exec performance and prefered to
> try harder to achieve good distribution across the nodes.

OK, but take a look at Michael's second patch. It still looks at
nr_running on every queue in the system (with some slightly strange
code to make a rotating choice on nodes on the case of equality),
so should still be able to make the best decision .... *but* it
seems to be much cheaper to execute. Not sure why at this point,
given the last results I sent you last night ;-)

M.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Lse-tech] Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28 17:35                           ` Martin J. Bligh
@ 2002-10-29  0:07                             ` Erich Focht
  0 siblings, 0 replies; 33+ messages in thread
From: Erich Focht @ 2002-10-29  0:07 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

On Monday 28 October 2002 18:35, Martin J. Bligh wrote:
> > I'm puzzled about the initial load balancing impact and have to think
> > about the results I've seen from you so far... In the environments I am
> > used to, the frequency of exec syscalls is rather low, therefore I didn't
> > care too much about the sched_balance_exec performance and prefered to
> > try harder to achieve good distribution across the nodes.
>
> OK, but take a look at Michael's second patch. It still looks at
> nr_running on every queue in the system (with some slightly strange
> code to make a rotating choice on nodes on the case of equality),
> so should still be able to make the best decision .... *but* it
> seems to be much cheaper to execute. Not sure why at this point,
> given the last results I sent you last night ;-)

Yes, I like it! I needed some time to understand that the per_cpu
variables can spread the execed tasks acros the nodes as well as the
atomic sched_node. Sure, I'd like to select the least loaded node instead
of the least loaded CPU. It can well be that you just have created on a
node 10 threads (by fork, therefore still on their original CPU), and have
an idle CPU in the same node (which didn't steal yet the newly created
tasks). Suppose your instant load looks like this:
node 0:   cpu0: 1 , cpu1: 1, cpu2: 1, cpu3: 1
node 1:   cpu4:10 , cpu5: 0, cpu6: 1, cpu7: 1

If you exec on cpu0 before cpu5 managed to steal something from cpu4,
you'll aim for cpu5. This would just increase the node-imbalance and
force more of the threads on cpu4 to move to node0, which is maybe bad
for them. Just an example... If you start considering non-trivial
cpus_allowed masks, you might get more of these cases.

We could take this as a design target for the initial load balancer
and keep the fastest version we currently have for the benchmarks
we currently use (Michael's).

Regards,
Erich

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28  0:32                 ` Erich Focht
  2002-10-27 23:52                   ` Martin J. Bligh
  2002-10-28  0:31                   ` Martin J. Bligh
@ 2002-10-28  0:46                   ` Martin J. Bligh
  2002-10-28 17:11                     ` Erich Focht
  2002-10-28 17:38                     ` Erich Focht
  2002-10-28  7:16                   ` Martin J. Bligh
  3 siblings, 2 replies; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-28  0:46 UTC (permalink / raw)
  To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

OK, so I tried Michael's without the balance_exec code as well,
then Erich's main patch with Michael's balance_exec (which seems
to be cheaper to calculate). Turns out I was actually running an
older version of Michael's patch .... with his latest stuff it
actually seems to perform better pretty much across the board
(comaring 2.5.44-mm4-focht-12 and 2.5.44-mm4-hbaum-12). And it's
also a lot simpler. 

Erich, what does all the pool stuff actually buy us over what
Michael is doing? Seems to be rather more complex, but maybe
it's useful for something we're just not measuring here?

              2.5.44-mm4     Virgin
      2.5.44-mm4-focht-1     Focht main
      2.5.44-mm4-hbaum-1     Hbaum main
     2.5.44-mm4-focht-12     Focht main + Focht balance_exec
      2.5.44-mm4-hbaum-1     Hbaum main + Hbaum balance_exec
        2.5.44-mm4-f1-h2     Focht main + Hbaum balance_exec

Kernbench:
                             Elapsed        User      System         CPU
              2.5.44-mm4     19.676s    192.794s     42.678s     1197.4%
      2.5.44-mm4-focht-1      19.46s    189.838s     37.938s       1171%
      2.5.44-mm4-hbaum-1     19.746s    189.232s     38.354s     1152.2%
     2.5.44-mm4-focht-12      20.32s        190s       44.4s     1153.6%
     2.5.44-mm4-hbaum-12     19.322s    190.176s     40.354s     1192.6%
        2.5.44-mm4-f1-h2     19.398s    190.118s      40.06s       1186%

Schedbench 4:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4       32.45       49.47      129.86        0.82
      2.5.44-mm4-focht-1       38.61       45.15      154.48        1.06
      2.5.44-mm4-hbaum-1       37.81       46.44      151.26        0.78
     2.5.44-mm4-focht-12       23.23       38.87       92.99        0.85
     2.5.44-mm4-hbaum-12       22.26       34.70       89.09        0.70
        2.5.44-mm4-f1-h2       21.39       35.97       85.57        0.81

Schedbench 8:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4       39.90       61.48      319.26        2.79
      2.5.44-mm4-focht-1       37.76       61.09      302.17        2.55
      2.5.44-mm4-hbaum-1       43.18       56.74      345.54        1.71
     2.5.44-mm4-focht-12       28.40       34.43      227.25        2.09
     2.5.44-mm4-hbaum-12       30.71       45.87      245.75        1.43
        2.5.44-mm4-f1-h2       36.11       45.18      288.98        2.10

Schedbench 16:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4       62.99       93.59     1008.01        5.11
      2.5.44-mm4-focht-1       51.69       60.23      827.20        4.95
      2.5.44-mm4-hbaum-1       52.57       61.54      841.38        3.93
     2.5.44-mm4-focht-12       51.24       60.86      820.08        4.23
     2.5.44-mm4-hbaum-12       52.33       62.23      837.46        3.84
        2.5.44-mm4-f1-h2       51.76       60.15      828.33        5.67

Schedbench 32:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4       88.13      194.53     2820.54       11.52
      2.5.44-mm4-focht-1       56.71      123.62     1815.12        7.92
      2.5.44-mm4-hbaum-1       54.57      153.56     1746.45        9.20
     2.5.44-mm4-focht-12       55.69      118.85     1782.25        7.28
     2.5.44-mm4-hbaum-12       54.36      135.30     1739.95        8.09
        2.5.44-mm4-f1-h2       55.97      119.28     1791.39        7.20

Schedbench 64:
                             Elapsed   TotalUser    TotalSys     AvgUser
              2.5.44-mm4      159.92      653.79    10235.93       25.16
      2.5.44-mm4-focht-1       55.60      232.36     3558.98       17.61
      2.5.44-mm4-hbaum-1       71.48      361.77     4575.45       18.53
     2.5.44-mm4-focht-12       56.03      234.45     3586.46       15.76
     2.5.44-mm4-hbaum-12       56.91      240.89     3642.99       15.67
        2.5.44-mm4-f1-h2       56.48      246.93     3615.32       16.97


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28  0:46                   ` Martin J. Bligh
@ 2002-10-28 17:11                     ` Erich Focht
  2002-10-28 18:32                       ` Martin J. Bligh
  2002-10-28 17:38                     ` Erich Focht
  1 sibling, 1 reply; 33+ messages in thread
From: Erich Focht @ 2002-10-28 17:11 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 5026 bytes --]

On Monday 28 October 2002 01:46, Martin J. Bligh wrote:
> Erich, what does all the pool stuff actually buy us over what
> Michael is doing? Seems to be rather more complex, but maybe
> it's useful for something we're just not measuring here?

The more complicated stuff is for achieving equal load between the
nodes. It delays steals more when the stealing node is averagely loaded,
less when it is unloaded. This is the place where we can make it cope
with more complex machines with multiple levels of memory hierarchy
(like our 32 CPU TX7). Equal load among the nodes is important if you
have memory bandwidth eaters, as the bandwidth in a node is limited.

When introducing node affinity (which shows good results for me!) you
also need a more careful ranking of the tasks which are candidates to
be stolen. The routine task_to_steal does this and is another source
of complexity. It is another point where the multilevel stuff comes in.
In the core part of the patch the rank of the steal candidates is computed
by only taking into account the time which a task has slept.

I attach the script for getting some statistics on the numa_test. I 
consider this test more sensitive to NUMA effects, as it is a bandwidth
eater also needing good latency.
(BTW, Martin: in the numa_test script I've sent you the PROBLEMSIZE must
be set to 1000000!).

Regards,
Erich


>
>               2.5.44-mm4     Virgin
>       2.5.44-mm4-focht-1     Focht main
>       2.5.44-mm4-hbaum-1     Hbaum main
>      2.5.44-mm4-focht-12     Focht main + Focht balance_exec
>       2.5.44-mm4-hbaum-1     Hbaum main + Hbaum balance_exec
>         2.5.44-mm4-f1-h2     Focht main + Hbaum balance_exec
>
> Kernbench:
>                              Elapsed        User      System         CPU
>               2.5.44-mm4     19.676s    192.794s     42.678s     1197.4%
>       2.5.44-mm4-focht-1      19.46s    189.838s     37.938s       1171%
>       2.5.44-mm4-hbaum-1     19.746s    189.232s     38.354s     1152.2%
>      2.5.44-mm4-focht-12      20.32s        190s       44.4s     1153.6%
>      2.5.44-mm4-hbaum-12     19.322s    190.176s     40.354s     1192.6%
>         2.5.44-mm4-f1-h2     19.398s    190.118s      40.06s       1186%
>
> Schedbench 4:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>               2.5.44-mm4       32.45       49.47      129.86        0.82
>       2.5.44-mm4-focht-1       38.61       45.15      154.48        1.06
>       2.5.44-mm4-hbaum-1       37.81       46.44      151.26        0.78
>      2.5.44-mm4-focht-12       23.23       38.87       92.99        0.85
>      2.5.44-mm4-hbaum-12       22.26       34.70       89.09        0.70
>         2.5.44-mm4-f1-h2       21.39       35.97       85.57        0.81
>
> Schedbench 8:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>               2.5.44-mm4       39.90       61.48      319.26        2.79
>       2.5.44-mm4-focht-1       37.76       61.09      302.17        2.55
>       2.5.44-mm4-hbaum-1       43.18       56.74      345.54        1.71
>      2.5.44-mm4-focht-12       28.40       34.43      227.25        2.09
>      2.5.44-mm4-hbaum-12       30.71       45.87      245.75        1.43
>         2.5.44-mm4-f1-h2       36.11       45.18      288.98        2.10
>
> Schedbench 16:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>               2.5.44-mm4       62.99       93.59     1008.01        5.11
>       2.5.44-mm4-focht-1       51.69       60.23      827.20        4.95
>       2.5.44-mm4-hbaum-1       52.57       61.54      841.38        3.93
>      2.5.44-mm4-focht-12       51.24       60.86      820.08        4.23
>      2.5.44-mm4-hbaum-12       52.33       62.23      837.46        3.84
>         2.5.44-mm4-f1-h2       51.76       60.15      828.33        5.67
>
> Schedbench 32:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>               2.5.44-mm4       88.13      194.53     2820.54       11.52
>       2.5.44-mm4-focht-1       56.71      123.62     1815.12        7.92
>       2.5.44-mm4-hbaum-1       54.57      153.56     1746.45        9.20
>      2.5.44-mm4-focht-12       55.69      118.85     1782.25        7.28
>      2.5.44-mm4-hbaum-12       54.36      135.30     1739.95        8.09
>         2.5.44-mm4-f1-h2       55.97      119.28     1791.39        7.20
>
> Schedbench 64:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>               2.5.44-mm4      159.92      653.79    10235.93       25.16
>       2.5.44-mm4-focht-1       55.60      232.36     3558.98       17.61
>       2.5.44-mm4-hbaum-1       71.48      361.77     4575.45       18.53
>      2.5.44-mm4-focht-12       56.03      234.45     3586.46       15.76
>      2.5.44-mm4-hbaum-12       56.91      240.89     3642.99       15.67
>         2.5.44-mm4-f1-h2       56.48      246.93     3615.32       16.97

[-- Attachment #2: numabench --]
[-- Type: application/x-shellscript, Size: 874 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28 17:11                     ` Erich Focht
@ 2002-10-28 18:32                       ` Martin J. Bligh
  0 siblings, 0 replies; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-28 18:32 UTC (permalink / raw)
  To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

>> Erich, what does all the pool stuff actually buy us over what
>> Michael is doing? Seems to be rather more complex, but maybe
>> it's useful for something we're just not measuring here?
> 
> The more complicated stuff is for achieving equal load between the
> nodes. It delays steals more when the stealing node is averagely loaded,
> less when it is unloaded. This is the place where we can make it cope
> with more complex machines with multiple levels of memory hierarchy
> (like our 32 CPU TX7). Equal load among the nodes is important if you
> have memory bandwidth eaters, as the bandwidth in a node is limited.
> 
> When introducing node affinity (which shows good results for me!) you
> also need a more careful ranking of the tasks which are candidates to
> be stolen. The routine task_to_steal does this and is another source
> of complexity. It is another point where the multilevel stuff comes in.
> In the core part of the patch the rank of the steal candidates is computed
> by only taking into account the time which a task has slept.

OK, it all sounds sane, just rather complicated ;-) I'm going to trawl
through your stuff with Michael, and see if we can simplify it a bit
somehow whilst not changing the functionality. Your first patch seems
to work just fine, it's just the complexity that bugs me a bit. 

The combination of your first patch with Michael's balance_exec stuff
actually seems to work pretty well ... I'll poke at the new patch you
sent me + Michael's exec balance + the little perf tweak I made to it,
and see what happens ;-)

> I attach the script for getting some statistics on the numa_test. I 
> consider this test more sensitive to NUMA effects, as it is a bandwidth
> eater also needing good latency.
> (BTW, Martin: in the numa_test script I've sent you the PROBLEMSIZE must
> be set to 1000000!).

It is ;-) I'm running 44-mm4, not virgin remember, so things like hot&cold 
page lists may make it faster?

M.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28  0:46                   ` Martin J. Bligh
  2002-10-28 17:11                     ` Erich Focht
@ 2002-10-28 17:38                     ` Erich Focht
  2002-10-28 17:36                       ` Martin J. Bligh
  1 sibling, 1 reply; 33+ messages in thread
From: Erich Focht @ 2002-10-28 17:38 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

On Monday 28 October 2002 01:46, Martin J. Bligh wrote:
>               2.5.44-mm4     Virgin
>       2.5.44-mm4-focht-1     Focht main
>       2.5.44-mm4-hbaum-1     Hbaum main
>      2.5.44-mm4-focht-12     Focht main + Focht balance_exec
>       2.5.44-mm4-hbaum-1     Hbaum main + Hbaum balance_exec
>         2.5.44-mm4-f1-h2     Focht main + Hbaum balance_exec
>
> Schedbench 4:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>               2.5.44-mm4       32.45       49.47      129.86        0.82
>       2.5.44-mm4-focht-1       38.61       45.15      154.48        1.06
>       2.5.44-mm4-hbaum-1       37.81       46.44      151.26        0.78
>      2.5.44-mm4-focht-12       23.23       38.87       92.99        0.85
>      2.5.44-mm4-hbaum-12       22.26       34.70       89.09        0.70
>         2.5.44-mm4-f1-h2       21.39       35.97       85.57        0.81

One more remarks:
You seem to have made the numa_test shorter. That reduces it to beeing
simply a check for the initial load balancing as the hackbench running in
the background (and aimed to disturb the initial load balancing) might
start too late. You will most probably not see the impact of node affinity
with such short running tests. But we weren't talking about node affinity,
yet...

Erich


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28 17:38                     ` Erich Focht
@ 2002-10-28 17:36                       ` Martin J. Bligh
  2002-10-28 23:49                         ` Erich Focht
  2002-10-29 22:39                         ` Erich Focht
  0 siblings, 2 replies; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-28 17:36 UTC (permalink / raw)
  To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

>> Schedbench 4:
>>                              Elapsed   TotalUser    TotalSys     AvgUser
>>               2.5.44-mm4       32.45       49.47      129.86        0.82
>>       2.5.44-mm4-focht-1       38.61       45.15      154.48        1.06
>>       2.5.44-mm4-hbaum-1       37.81       46.44      151.26        0.78
>>      2.5.44-mm4-focht-12       23.23       38.87       92.99        0.85
>>      2.5.44-mm4-hbaum-12       22.26       34.70       89.09        0.70
>>         2.5.44-mm4-f1-h2       21.39       35.97       85.57        0.81
> 
> One more remarks:
> You seem to have made the numa_test shorter. That reduces it to beeing
> simply a check for the initial load balancing as the hackbench running in
> the background (and aimed to disturb the initial load balancing) might
> start too late. You will most probably not see the impact of node affinity
> with such short running tests. But we weren't talking about node affinity,
> yet...

I didn't modify what you sent me at all ... perhaps my machine is
just faster than yours?

/me ducks & runs ;-)

M.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28 17:36                       ` Martin J. Bligh
@ 2002-10-28 23:49                         ` Erich Focht
  2002-10-29  0:00                           ` Martin J. Bligh
  2002-10-29 22:39                         ` Erich Focht
  1 sibling, 1 reply; 33+ messages in thread
From: Erich Focht @ 2002-10-28 23:49 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

On Monday 28 October 2002 18:36, Martin J. Bligh wrote:
> >> Schedbench 4:
> >>                              Elapsed   TotalUser    TotalSys     AvgUser
> >>               2.5.44-mm4       32.45       49.47      129.86        0.82
> >>       2.5.44-mm4-focht-1       38.61       45.15      154.48        1.06
> >>       2.5.44-mm4-hbaum-1       37.81       46.44      151.26        0.78
> >>      2.5.44-mm4-focht-12       23.23       38.87       92.99        0.85
> >>      2.5.44-mm4-hbaum-12       22.26       34.70       89.09        0.70
> >>         2.5.44-mm4-f1-h2       21.39       35.97       85.57        0.81
> >
> > One more remarks:
> > You seem to have made the numa_test shorter. That reduces it to beeing
> > simply a check for the initial load balancing as the hackbench running in
> > the background (and aimed to disturb the initial load balancing) might
> > start too late. You will most probably not see the impact of node
> > affinity with such short running tests. But we weren't talking about node
> > affinity, yet...
>
> I didn't modify what you sent me at all ... perhaps my machine is
> just faster than yours?
>
> /me ducks & runs ;-)

:-)))

I tried with IA32, too ;-) With PROBLEMSIZE=1000000 I get on a 2.8GHz
XEON something around 16s. On a 1.6GHz Athlon it's 22s. Both times running
./numa_test 2 on a dual CPU box. The usertime is pretty independent of the
OS, (but the scheduling influences it a lot).

But: you have a node level cache! Maybe the whole memory is inside that
one and then things can go really fast. Hmmm, I guess I'll need some
cache detection in the future to enforce that the BM really runs in
memory... Increasing PROBLEMSIZE might help, but we can do that later,
when testing affinity (I'm not giving up on this idea... ;-)

Regards,
Erich


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28 23:49                         ` Erich Focht
@ 2002-10-29  0:00                           ` Martin J. Bligh
  2002-10-29  1:12                             ` Gerrit Huizenga
  0 siblings, 1 reply; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-29  0:00 UTC (permalink / raw)
  To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

>> I didn't modify what you sent me at all ... perhaps my machine is
>> just faster than yours?
>> 
>> /me ducks & runs ;-)
> 
> :-)))
> 
> I tried with IA32, too ;-) With PROBLEMSIZE=1000000 I get on a 2.8GHz
> XEON something around 16s. On a 1.6GHz Athlon it's 22s. Both times running
> ./numa_test 2 on a dual CPU box. The usertime is pretty independent of the
> OS, (but the scheduling influences it a lot).

I have 700MHz P3 Xeons, but I have 2Mb L2 cache on them which is much
better than the newer chips. That might make a big differernce.
 
> But: you have a node level cache! Maybe the whole memory is inside that
> one and then things can go really fast. Hmmm, I guess I'll need some
> cache detection in the future to enforce that the BM really runs in
> memory... Increasing PROBLEMSIZE might help, but we can do that later,
> when testing affinity (I'm not giving up on this idea... ;-)

Yup, 32Mb cache. Not sure if it's faster than local memory or not.

M.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)
  2002-10-29  0:00                           ` Martin J. Bligh
@ 2002-10-29  1:12                             ` Gerrit Huizenga
  0 siblings, 0 replies; 33+ messages in thread
From: Gerrit Huizenga @ 2002-10-29  1:12 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Erich Focht, Michael Hohnbaum, mingo, habanero, linux-kernel,
	lse-tech

In message <737410000.1035849619@flay>, > : "Martin J. Bligh" writes:
> 
> Yup, 32Mb cache. Not sure if it's faster than local memory or not.

Yes, NUMA-Q cache can be faster than local memory, but it *only* caches
remote memory.  Some other architectures use the L3 cache to cache *all*
memory (local _and_ remote).  Reasoning:  why polute the valuable
cache with things that are already close at hand?

gerrit

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28 17:36                       ` Martin J. Bligh
  2002-10-28 23:49                         ` Erich Focht
@ 2002-10-29 22:39                         ` Erich Focht
  1 sibling, 0 replies; 33+ messages in thread
From: Erich Focht @ 2002-10-29 22:39 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

On Monday 28 October 2002 18:36, Martin J. Bligh wrote:
> >> Schedbench 4:
> >>                              Elapsed   TotalUser    TotalSys     AvgUser
> >>               2.5.44-mm4       32.45       49.47      129.86        0.82
> >>       2.5.44-mm4-focht-1       38.61       45.15      154.48        1.06
> >>       2.5.44-mm4-hbaum-1       37.81       46.44      151.26        0.78
> >>      2.5.44-mm4-focht-12       23.23       38.87       92.99        0.85
> >>      2.5.44-mm4-hbaum-12       22.26       34.70       89.09        0.70
> >>         2.5.44-mm4-f1-h2       21.39       35.97       85.57        0.81
> >
> > One more remarks:
> > You seem to have made the numa_test shorter. That reduces it to beeing
> > simply a check for the initial load balancing as the hackbench running in
> > the background (and aimed to disturb the initial load balancing) might
> > start too late. You will most probably not see the impact of node
> > affinity with such short running tests. But we weren't talking about node
> > affinity, yet...
>
> I didn't modify what you sent me at all ... perhaps my machine is
> just faster than yours?
>
> /me ducks & runs ;-)

Aaargh, now I understand!!! You just have wrong labels in your table,
they are permuted! More sense makes:

> >>                              AvgUser       Elapsed   TotalUser    TotalSys
> >>               2.5.44-mm4       32.45       49.47      129.86        0.82
> >>       2.5.44-mm4-focht-1       38.61       45.15      154.48        1.06
> >>       2.5.44-mm4-hbaum-1       37.81       46.44      151.26        0.78
> >>      2.5.44-mm4-focht-12       23.23       38.87       92.99        0.85
> >>      2.5.44-mm4-hbaum-12       22.26       34.70       89.09        0.70
> >>         2.5.44-mm4-f1-h2       21.39       35.97       85.57        0.81

Regards,
Erich


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: NUMA scheduler  (was: 2.5 merge candidate list 1.5)
  2002-10-28  0:32                 ` Erich Focht
                                     ` (2 preceding siblings ...)
  2002-10-28  0:46                   ` Martin J. Bligh
@ 2002-10-28  7:16                   ` Martin J. Bligh
  3 siblings, 0 replies; 33+ messages in thread
From: Martin J. Bligh @ 2002-10-28  7:16 UTC (permalink / raw)
  To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech

> This is interesting, indeed. As you might have seen from the tests I
> posted on LKML I could not see that effect on our IA64 NUMA machine.
> Which arises the question: is it expensive to recalculate the load
> when doing an exec (which I should also see) or is the strategy of
> equally distributing the jobs across the nodes bad for certain
> load+architecture combinations? As I'm not seeing the effect, maybe
> you could do the following experiment:
> In sched_best_node() keep only the "while" loop at the beginning. This
> leads to a cheap selection of the next node, just a simple round robin. 

I did this ... presume that's what you meant:

static int sched_best_node(struct task_struct *p)
{
        int i, n, best_node=0, min_load, pool_load, min_pool=numa_node_id();
        int cpu, pool, load;
        unsigned long mask = p->cpus_allowed & cpu_online_map;

        do {
                /* atomic_inc_return is not implemented on all archs [EF] */
                atomic_inc(&sched_node);
                best_node = atomic_read(&sched_node) % numpools;
        } while (!(pool_mask[best_node] & mask));

        return best_node;
}

Odd. seems to make it even worse.

Kernbench:
                                   Elapsed        User      System         CPU
           2.5.44-mm4-focht-12      20.32s        190s       44.4s     1153.6%
      2.5.44-mm4-focht-12-lobo     21.362s     193.71s     48.672s       1134%

The diffprofiles below look like this just makes it make bad decisions. 
Very odd ... compare with what hapenned when I put Michael's balance_exec
on instead. I'm tired, maybe I did something silly.

diffprofile 2.5.44-mm4-focht-1 2.5.44-mm4-focht-12      

606 page_remove_rmap
566 do_schedule
488 page_add_rmap
475 .text.lock.file_table
370 __copy_to_user
306 strnlen_user
272 d_lookup
235 find_get_page
233 get_empty_filp
193 atomic_dec_and_lock
161 copy_process
159 sched_best_node
135 flush_signal_handlers
131 complete
116 filemap_nopage
109 __fput
105 path_lookup
103 follow_mount
95 zap_pte_range
92 file_move
91 do_no_page
87 release_task
80 do_page_fault
62 lru_cache_add
62 link_path_walk
62 do_generic_mapping_read
57 find_trylock_page
55 release_pages
50 dup_task_struct
...
-73 do_anonymous_page
-478 __copy_from_user

diffprofile 2.5.44-mm4-focht-12 2.5.44-mm4-focht-12-lobo

567 do_schedule
482 do_anonymous_page
383 page_remove_rmap
336 __copy_from_user
333 page_add_rmap
241 zap_pte_range
213 init_private_file
189 strnlen_user
186 buffered_rmqueue
172 find_get_page
124 complete
111 filemap_nopage
97 free_hot_cold_page
89 flush_signal_handlers
86 clear_page_tables
79 do_page_fault
79 copy_process
75 d_lookup
74 path_lookup
71 sched_best_cpu
68 do_no_page
58 release_pages
58 __set_page_dirty_buffers
52 wait_for_completion
51 release_task
51 handle_mm_fault
...
-53 lru_cache_add
-73 dentry_open
-100 sched_best_node
-108 file_ra_state_init
-402 .text.lock.file_table


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)
  2002-10-23 21:26 Crunch time -- the musical. (2.5 merge candidate list 1.5) Rob Landley
  2002-10-24 16:17 ` Michael Hohnbaum
@ 2002-10-25 14:46 ` Kevin Corry
  1 sibling, 0 replies; 33+ messages in thread
From: Kevin Corry @ 2002-10-25 14:46 UTC (permalink / raw)
  To: Rob Landley; +Cc: linux-kernel

On Wednesday 23 October 2002 16:26, Rob Landley wrote:
> Due to numerous complaints (okay, one, but technically that's a number)
> tried to reformat a bit to have a slightly less eye-searingly hideous
> layout. And reorganized the -mm stuff to be together in one clump.
>
> And so:

> ......

> ---------------------------------------------------------------------------
>
> 8) EVMS (Enterprise Volume Management System) (EVMS team)
>
> Home page:
> http://sourceforge.net/projects/evms
>
> ---------------------------------------------------------------------------

Rob,

Can you please add the following links for the EVMS project:

Home page:
http://evms.sourceforge.net

Download:
http://evms.sourceforge.net/patches/

Some related discussions:
http://marc.theaimsgroup.com/?t=103359686900003&r=1&w=2
http://marc.theaimsgroup.com/?t=103439913000001&r=1&w=2
http://marc.theaimsgroup.com/?w=2&r=1&s=%5Bpatch%5D+evms+core&q=t

Thanks!
-- 
Kevin Corry
corryk@us.ibm.com
http://evms.sourceforge.net/

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2002-10-29 22:33 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-10-23 21:26 Crunch time -- the musical. (2.5 merge candidate list 1.5) Rob Landley
2002-10-24 16:17 ` Michael Hohnbaum
     [not found]   ` <200210240750.09751.landley@trommello.org>
2002-10-24 19:01     ` Michael Hohnbaum
2002-10-24 21:51       ` Erich Focht
2002-10-24 22:38         ` Martin J. Bligh
2002-10-25  8:15           ` Erich Focht
2002-10-25 23:26             ` Martin J. Bligh
2002-10-25 23:45               ` Martin J. Bligh
2002-10-26  0:02               ` Martin J. Bligh
2002-10-26 18:58             ` Martin J. Bligh
2002-10-26 19:14             ` NUMA scheduler (was: 2.5 " Martin J. Bligh
2002-10-27 18:16               ` Martin J. Bligh
2002-10-28  0:32                 ` Erich Focht
2002-10-27 23:52                   ` Martin J. Bligh
2002-10-28  0:55                     ` [Lse-tech] " Michael Hohnbaum
2002-10-28  4:23                       ` Martin J. Bligh
2002-10-28  0:31                   ` Martin J. Bligh
2002-10-28 16:34                     ` Erich Focht
2002-10-28 16:57                       ` Martin J. Bligh
2002-10-28 17:26                         ` Erich Focht
2002-10-28 17:35                           ` Martin J. Bligh
2002-10-29  0:07                             ` [Lse-tech] " Erich Focht
2002-10-28  0:46                   ` Martin J. Bligh
2002-10-28 17:11                     ` Erich Focht
2002-10-28 18:32                       ` Martin J. Bligh
2002-10-28 17:38                     ` Erich Focht
2002-10-28 17:36                       ` Martin J. Bligh
2002-10-28 23:49                         ` Erich Focht
2002-10-29  0:00                           ` Martin J. Bligh
2002-10-29  1:12                             ` Gerrit Huizenga
2002-10-29 22:39                         ` Erich Focht
2002-10-28  7:16                   ` Martin J. Bligh
2002-10-25 14:46 ` Crunch time -- the musical. (2.5 " Kevin Corry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox