* Crunch time -- the musical. (2.5 merge candidate list 1.5) @ 2002-10-23 21:26 Rob Landley 2002-10-24 16:17 ` Michael Hohnbaum 2002-10-25 14:46 ` Crunch time -- the musical. (2.5 " Kevin Corry 0 siblings, 2 replies; 33+ messages in thread From: Rob Landley @ 2002-10-23 21:26 UTC (permalink / raw) To: linux-kernel Kernel hooks is back with new links. Also new versions of Linux Trace Tookit and sys_epoll. And new stuff from the 2.5 status list, and new stuff is STILL showing up on linux-kernel. (Still no 2.5 patch for Alan's 32 bit dev_t, though.) Richard J. Moore has stepped up to defend "VM Large Page support", which has become "hugetlb update". I don't know if this counts as a new feature or a bugfix, but it's back... Due to numerous complaints (okay, one, but technically that's a number) tried to reformat a bit to have a slightly less eye-searingly hideous layout. And reorganized the -mm stuff to be together in one clump. And so: ---------- Linus returns from the Linux Lunacy Cruise after Sunday, October 27th. (See "http://www.geekcruises.com/itinerary/ll2_itinerary.html". He's off to Jamaica, mon.) The following features aim to be ready for submission to Linus by Monday, October 28th, to be considered for inclusion (in 2.5.45) before the feature freeze on Thursday, October 31 (halloween). (L minus four days, and counting...) Note: if you want to submit a new entry to this list, PLEASE provide a URL to where the patch can be found, and any descriptive announcement you think useful (user space tools, etc). This doesn't have to be a web page devoted to the patch, if the patch has been posted to linux-kernel a URL to the post on any linux-kernel archive site is fine. If you don't know of one, a good site for looking at the threaded archive is: http://lists.insecure.org/lists/linux-kernel/ A more searchable archive is available at: http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&group=mlist.linux.kernel This archive seems less likely to mangle your patch for cut and pasting (especially if you click "raw download" at the top of the message), although its a real pain to actualy try to read: http://marc.theaimsgroup.com/?l=linux-kernel This list is just pending features trying to get in before feature freeze. It's primarily for features that need more testing, or might otherwise get forgotten in the rush. If you want to know what's already gone in, or what's being worked on for the next development cycle, check out "http://kernelnewbies.org/status". You can get Andrew Morton's MM tree here, including a broken-out patches directory and a description file: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.44 Alan Cox's -ac tree comes from here: http://www.kernel.org/pub/linux/kernel/people/alan/ Thanks to Rusty Russell and Guillaume Boissiere, whose respective 2.5 merge candidate lists have been ruthlessly strip-mined in the process of assembling this. And to everybody who's emailed stuff. And now, in no particular order: ============================ Pending features: ============================= 1) New kernel configuration system (Roman Zippel) Announcement: http://lists.insecure.org/lists/linux-kernel/2002/Oct/6898.html Code: http://www.xs4all.nl/~zippel/lc/ Linus has actually looked fairly favorably on this one so far: http://lists.insecure.org/lists/linux-kernel/2002/Oct/3250.html ---------------------------------------------------------------------------- 2) ext2/ext3 extended attributes and access control lists (Ted Tso) (in -mm) Announce: http://lists.insecure.org/lists/linux-kernel/2002/Oct/6787.html Code: bk://extfs.bkbits.net/extfs-2.5-update http://thunk.org/tytso/linux/extfs-2.5 (Or just grab it from the -mm tree.) (Considering that EA/ACL infrastructure is already in, and supported by XFS and JFS, this one's pretty close to a shoe-in.) ---------------------------------------------------------------------------- 3) Page table sharing (Daniel Phillips, Dave McCracken) (in -mm) Announce: http://www.geocrawler.com/mail/msg.php3?msg_id=7855063&list=35 Patch from the -mm tree: http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.44/2.5.44-mm3/broken-out/shpte-ng.patch Ed Tomlinson seems to have a show-stopper bug for this one (although he tells me in email he'd like to see it go in anyway): http://lists.insecure.org/lists/linux-kernel/2002/Oct/7147.html ---------------------------------------------------------------------------- 4) Improved Hugetlb support (Richard J. Moore) (in -mm tree) (Dunno if this is exactly a feature, but giving it the benfit of the doubt...) Description: http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.44/2.5.44-mm3/description Patches (everything starting with "htlb" or "hugetlb"): http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.44/2.5.44-mm3/broken-out/ ---------------------------------------------------------------------------- 5) Generic Nonlinear Mappings (Ingo Molnar) (in -mm) It's new, very close to deadline, needs testing and discussion. I'm still a touch vague on what it actually does, but there's a thread. Announcement, patch, and start of thread: http://marc.theaimsgroup.com/?l=linux-kernel&m=103530883511032&w=2 ---------------------------------------------------------------------------- 6) Linux Trace Toolkit (LTT) (Karim Yaghmour) Announce: http://lists.insecure.org/lists/linux-kernel/2002/Oct/7016.html Patch: http://opersys.com/ftp/pub/LTT/ExtraPatches/patch-ltt-linux-2.5.44-vanilla-021022-2.2.bz2 User tools: http://opersys.com/ftp/pub/LTT/TraceToolkit-0.9.6pre2.tgz ---------------------------------------------------------------------------- 7) Device mapper for Logical Volume Manager (LVM2) (LVM2 team) (in -ac) Announce: http://marc.theaimsgroup.com/?l=linux-kernel&m=103536883428443&w=2 Download: http://people.sistina.com/~thornber/patches/2.5-stable/ Home page: http://www.sistina.com/products_lvm.htm ---------------------------------------------------------------------------- 8) EVMS (Enterprise Volume Management System) (EVMS team) Home page: http://sourceforge.net/projects/evms ---------------------------------------------------------------------------- 9) Kernel Probes (IBM, contact: Vamsi Krishna S) Kprobes announcement: http://marc.theaimsgroup.com/?l=linux-kernel&m=103528410215211&w=2 Base Kprobes Patch: http://marc.theaimsgroup.com/?l=linux-kernel&m=103528425615302&w=2 KProbes->DProbes patches: http://marc.theaimsgroup.com/?l=linux-kernel&m=103528454215523&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103528454015520&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103528485415813&w=2 Official IBM download site for most recent versions (gzipped tarballs): http://www-124.ibm.com/linux/patches/?project_id=141 See also the DProbes Home Page: http://oss.software.ibm.com/developerworks/opensource/linux/projects/dprobes A good explanation of the difference between kprobes, dprobes, and kernel hooks is here: http://marc.theaimsgroup.com/?l=linux-kernel&m=103532874900445&w=2 And a clarification: just kprobes is being submitted for 2.5.45, not the whole of dprobes: http://marc.theaimsgroup.com/?l=linux-kernel&m=103536827928012&w=2 ---------------------------------------------------------------------------- 10) High resolution timers (George Anzinger, etc.) Home page: http://high-res-timers.sourceforge.net/ Patch via evil sourceforge download auto-mirror thing: http://prdownloads.sourceforge.net/high-res-timers/hrtimers-support-2.5.36-1.0.patch?download Linus has unresolved concerns with this one, by the way: http://lists.insecure.org/lists/linux-kernel/2002/Oct/3463.html Note: The Google posix timer patch forwarded by Jim Houston is being merged into this patch: http://lists.insecure.org/lists/linux-kernel/2002/Oct/8068.html ---------------------------------------------------------------------------- 11) Linux Kernel Crash Dumps (Matt Robinson, LKCD team) Announce: http://marc.theaimsgroup.com/?l=linux-kernel&m=103536576625905&w=2 Code: http://lkcd.sourceforge.net/download/latest/ ---------------------------------------------------------------------------- 12) Rewrite of the console layer (James Simmons) Home page: http://linuxconsole.sourceforge.net/ Patch (Unknown version, but home page only has random CVS du jour link.): http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz Bitkeeper tree: http://linuxconsole.bkbits.net ---------------------------------------------------------------------------- 13) Kexec, luanch new linux kernel from Linux (Eric W. Biederman) Announcement with links: http://lists.insecure.org/lists/linux-kernel/2002/Oct/6584.html And this thread is just too brazen not to include: http://lists.insecure.org/lists/linux-kernel/2002/Oct/7952.html ---------------------------------------------------------------------------- 14) USAGI IPv6 (Yoshifujy Hideyaki) README: ftp://ftp.linux-ipv6.org/pub/usagi/patch/ipsec/README.IPSEC Patch: ftp://ftp.linux-ipv6.org/pub/usagi/patch/ipsec/ipsec-2.5.43-ALL-03.patch.gz ---------------------------------------------------------------------------- 15) MMU-less processor support (Greg Ungerer) Announcement with lots of links: http://lists.insecure.org/lists/linux-kernel/2002/Oct/7027.html ---------------------------------------------------------------------------- 16) sys_epoll (I.E. /dev/poll) (Davide Libenzi) homepage: http://www.xmailserver.org/linux-patches/nio-improve.html patch: http://www.xmailserver.org/linux-patches/sys_epoll-2.5.44-0.7.diff Linus participated repeatedly in a thread on this one too, expressing concerns which (hopefully) have been addressed. See: http://lists.insecure.org/lists/linux-kernel/2002/Oct/6428.html ---------------------------------------------------------------------------- 17) CD Recording/sgio patches (Jens Axboe) Announce: http://lists.insecure.org/lists/linux-kernel/2002/Oct/8060.html Patch: http://www.kernel.org/pub/linux/kernel/people/axboe/patches/v2.5/2.5.44/sgio-14b.diff.bz2 ---------------------------------------------------------------------------- 18) In-kernel module loader (Rusty Russell.) Announce: http://lists.insecure.org/lists/linux-kernel/2002/Oct/6214.html Patch: http://www.kernel.org/pub/linux/kernel/people/rusty/patches/module-x86-18-10-2002.2.5.43.diff.gz ---------------------------------------------------------------------------- 19) Unified Boot/Module parameter support (Rusty Russell) Note: depends on in-kernel module loader. Huge disorganized heap 'o patches with no explanation: http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/ ---------------------------------------------------------------------------- 20) Hotplug CPU Removal (Rusty Russell) Even bigger, more disorganized Heap 'o patches: http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Hotplug/ ---------------------------------------------------------------------------- 21) Unlimited groups patch (Tim Hockin.) Announce: http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761319825&w=2 Patch set: http://marc.theaimsgroup.com/?l=linux-kernel&m=103524717119443&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761819834&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761619831&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761519829&w=2 ---------------------------------------------------------------------------- 22) Initramfs (Al Viro) Way back when, Al said: http://www.cs.helsinki.fi/linux/linux-kernel/2001-30/0110.html I THINK this is the most recent patch: ftp://ftp.math.psu.edu/pub/viro/N0-initramfs-C40 And Linus recently made happy noises about the idea: http://lists.insecure.org/lists/linux-kernel/2002/Oct/1110.html ---------------------------------------------------------------------------- 23) Kernel Hooks (IBM contact: Vamsi Krishna S.) Website: http://www-124.ibm.com/linux/projects/kernelhooks/ Download site: http://www-124.ibm.com/linux/patches/?patch_id=595 Posted patch: http://marc.theaimsgroup.com/?l=linux-kernel&m=103364774926440&w=2 ---------------------------------------------------------------------------- 24) NMI request/release interface (Corey Minyard) He says: > Add a request/release mechanism to the kernel (x86 only for now) for NMIs. ... >I have modified the nmi watchdog to use this interface, and it >seems to work ok. Keith Owens is copied to see if he would be >interested in converting kdb to use this, if it gets put into the kernel. The latest patch so far: http://marc.theaimsgroup.com/?l=linux-kernel&m=103540434409894&w=2 ---------------------------------------------------------------------------- 25) Digital Video Broadcasting Layer (LinuxTV team) Home page: http://www.linuxtv.org:81/dvb/ Download: http://www.linuxtv.org:81/download/dvb/ ---------------------------------------------------------------------------- 26) NUMA aware scheduler extenstions (Erich Focht, Michael Hohnbaum) Home page: http://home.arcor.de/efocht/sched/ Patch: http://home.arcor.de/efocht/sched/Nod20_numa_sched-2.5.31.patch ---------------------------------------------------------------------------- 27) DriverFS Topology (Matthew Dobson) Announcement: http://marc.theaimsgroup.com/?l=linux-kernel&m=103523702710396&w=2 Patches: http://marc.theaimsgroup.com/?l=linux-kernel&m=103540707113401&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103540757613962&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103540758013984&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103540757513957&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103540757813966&w=2 ---------------------------------------------------------------------------- 28) Advanced TCA Disk Hotswap (Steven Dake) At the last minute, Steven Dake submitted (and if he'd cc'd the list, I could have linked to this message as the announcement, hint hint...): > Please add to your 2.5.45 list: > > "Advanced TCA Disk Hotswap". > > This is a generic feature that provides good hotswap support for SCSI > and FibreChannel disk devices. The entire SCSI layer has been properly > analyzed to provide correct locking and a complete RAMFS filesystem is > available to control the kernel disk hotswap operations. > > Both Alan Cox and Greg KH have looked at the patch for 2.4 and suggested > if I ported to 2.5 and made some changes (as I have in the latest port) > this feature would be a good candidate for the 2.5 kernel. > > The sourceforge site for the latest patches is: > https://sourceforge.net/projects/atca-hotswap/ > > The lkml announcement for this latest port is: > http://marc.theaimsgroup.com/?l=linux-kernel&m=103541572622729&w=2 > > A thread discussing Advanced TCA hotswap (of which this partch is one > part of) can be found at: > http://marc.theaimsgroup.com/?t=103462115700001&r=1&w=2 > > Thanks! > -steve ======================== Unresolved issues: ========================= 1) hyperthread-aware scheduler 2) connection tracking optimizations. No URLs to patch. Anybody want to come out in favor of these with an announcement and pointer to a version being suggested for inclusion? 3) IPSEC (David Miller, Alexy) 4) New CryptoAPI (James Morris) David S. Miller said: > No URLs, being coded as I type this :-) > > Some of the ipv4 infrastructure is in 2.5.44 Note, this may conflict with Yoshifuji Hideyaki's ipv6 ipsec stuff. If not, I'd like to collate or clarify the entries.) USAGI ipv6 is in the first section and this isn't because I have a URL to an existing patch to USAGI, and don't for this. I have no idea how much overlap there is between these projects, and whether they're considered parts of the same project or submitted individually... 5) ReiserFS 4 Hans Reiser said: > We will send Reiser4 out soon, probably around the 27th. > > Hans See also http://www.namesys.com/v4/fast_reiser4.html Hans and Jens Axboe are arguing about whether or not Reiser4 is a potential post-freeze addition. That thread starts here: http://lists.insecure.org/lists/linux-kernel/2002/Oct/7140.html 6) 32bit dev_t Alan Cox said: > The big one missing is 32bit dev_t. Thats the killer item we have left. But did not provide a URL to a patch. Presumably, it's in his tree and is capable of being extracted out of it, so I guess it's already in good hands? (I dunno, ask him.) He also mentioned: > Oh other one I missed - DVB layer - digital tv etc. Pretty much > essential now for europe, but again its basically all driver layer But it's not clear this is an item that must go in before feature freeze or not at all, which is what this list tries to focus on. Then Dan Kegel pointed out: > One possible page to quote for 32 bit dev_t: > http://lwn.net/Articles/11583/ 7) Online EXT3 resize support: A thread over whether or not this is self-contained enough and low enough impact to go in after the freature freeze starts here: http://lists.insecure.org/lists/linux-kernel/2002/Oct/7680.html I mention it just in case it isn't. (We've had offline EXT3 resize for a while, this is apparently twiddling a mounted partition without unplugging it first, or even wearing rubber boots.) -- http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad, CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Crunch time -- the musical. (2.5 merge candidate list 1.5) 2002-10-23 21:26 Crunch time -- the musical. (2.5 merge candidate list 1.5) Rob Landley @ 2002-10-24 16:17 ` Michael Hohnbaum [not found] ` <200210240750.09751.landley@trommello.org> 2002-10-25 14:46 ` Crunch time -- the musical. (2.5 " Kevin Corry 1 sibling, 1 reply; 33+ messages in thread From: Michael Hohnbaum @ 2002-10-24 16:17 UTC (permalink / raw) To: landley; +Cc: linux-kernel, Erich Focht On Wed, 2002-10-23 at 14:26, Rob Landley wrote: > 26) NUMA aware scheduler extenstions (Erich Focht, Michael Hohnbaum) > > Home page: > http://home.arcor.de/efocht/sched/ > > Patch: > http://home.arcor.de/efocht/sched/Nod20_numa_sched-2.5.31.patch The simple NUMA scheduler patch, which is ready for inclusion is a separate project from Erich's NUMA scheduler extensions. Information on the simple NUMA scheduler is contained in this lkml posting: http://marc.theaimsgroup.com/?l=linux-kernel&m=103351680614980&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103480772901235&w=2 The most recent version has been split into two patches for 2.5.44: http://marc.theaimsgroup.com/?l=linux-kernel&m=103539626130709&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103540481010560&w=2 -- Michael Hohnbaum 503-578-5486 hohnbaum@us.ibm.com T/L 775-5486 ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <200210240750.09751.landley@trommello.org>]
* Re: Crunch time -- the musical. (2.5 merge candidate list 1.5) [not found] ` <200210240750.09751.landley@trommello.org> @ 2002-10-24 19:01 ` Michael Hohnbaum 2002-10-24 21:51 ` Erich Focht 0 siblings, 1 reply; 33+ messages in thread From: Michael Hohnbaum @ 2002-10-24 19:01 UTC (permalink / raw) To: landley; +Cc: linux-kernel, Erich Focht On Thu, 2002-10-24 at 05:50, Rob Landley wrote: > On Thursday 24 October 2002 11:17, Michael Hohnbaum wrote: > > On Wed, 2002-10-23 at 14:26, Rob Landley wrote: > > > 26) NUMA aware scheduler extenstions (Erich Focht, Michael Hohnbaum) > > > > > > Home page: > > > http://home.arcor.de/efocht/sched/ > > > > > > Patch: > > > http://home.arcor.de/efocht/sched/Nod20_numa_sched-2.5.31.patch > > > > The simple NUMA scheduler patch, which is ready for inclusion is a > > separate project from Erich's NUMA scheduler extensions. Information > > on the simple NUMA scheduler is contained in this lkml posting: > > > > http://marc.theaimsgroup.com/?l=linux-kernel&m=103351680614980&w=2 > > http://marc.theaimsgroup.com/?l=linux-kernel&m=103480772901235&w=2 > > > > The most recent version has been split into two patches for 2.5.44: > > > > http://marc.theaimsgroup.com/?l=linux-kernel&m=103539626130709&w=2 > > http://marc.theaimsgroup.com/?l=linux-kernel&m=103540481010560&w=2 > > Any relation to http://lse.sourceforge.net/numa/ which the 2.5 status list > says is "Alpha" state, two steps down from "Ready"? > > Rob Yes and no. At one point I was working with Erich moving his NUMA scheduler to 2.5 and testing it on our NUMA hardware. However, it was not looking like his NUMA scheduler was going to be ready for 2.5, so I went off on a separate effort to produce a much smaller, simpler patch to provide rudimentary NUMA support within the scheduler. This patch does not have all the functionality of Erich's, but does provide definite performance improvements on NUMA machines with no degradation on non-NUMA SMP. It is much smaller and less intrusive, and has been tested on multiple NUMA architectures (including by Erich on the NEC IA64 NUMA box). The 2.5 status list has not been updated to reflect this separate effort, and I believe incorrectly lists this entry as "ready". There really are now two NUMA scheduler projects: * Simple NUMA scheduler (Michael Hohnbaum) - ready for inclusion * Node affine NUMA scheduler (Erich Focht) - Alpha (Beta?) -- Michael Hohnbaum 503-578-5486 hohnbaum@us.ibm.com T/L 775-5486 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Crunch time -- the musical. (2.5 merge candidate list 1.5) 2002-10-24 19:01 ` Michael Hohnbaum @ 2002-10-24 21:51 ` Erich Focht 2002-10-24 22:38 ` Martin J. Bligh 0 siblings, 1 reply; 33+ messages in thread From: Erich Focht @ 2002-10-24 21:51 UTC (permalink / raw) To: Michael Hohnbaum, landley; +Cc: linux-kernel Hi Rob and Michael, I need to correct some inexactities and, of course, advertise my aproach :-) On Thursday 24 October 2002 21:01, Michael Hohnbaum wrote: > > > > 26) NUMA aware scheduler extenstions (Erich Focht, Michael Hohnbaum) > > > > > > > > Home page: > > > > http://home.arcor.de/efocht/sched/ > > > > > > > > Patch: > > > > http://home.arcor.de/efocht/sched/Nod20_numa_sched-2.5.31.patch These are old. I posted the newer patches (splitted up in order to clearly separate the functionality additions) to LKML: http://marc.theaimsgroup.com/?l=linux-kernel&m=103459387719030&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103459387519026&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103459441119407&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103459441319411&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103459441419416&w=2 They should work for any NUMA platform by just adding a call to build_pools() in smp_cpus_done(). They work for non-NUMA platforms the same way as the O(1) scheduler (though the code looks different). A test overview is in: http://lwn.net/Articles/12546/ This suggests that taking only patches 01+02 already gives you a VERY good NUMA scheduler. They deliver the infrastructure for later developments (patches 03+05) which we can further research and tune or give only to special customers. > The 2.5 status list has not been updated to reflect this separate > effort, and I believe incorrectly lists this entry as "ready". There > really are now two NUMA scheduler projects: > > * Simple NUMA scheduler (Michael Hohnbaum) - ready for inclusion > * Node affine NUMA scheduler (Erich Focht) - Alpha (Beta?) This is not correct. We have the node affine scheduler in production since 6 months on top of 2.4. kernels and are happy with it. It is a lot more than alpha or beta, it already makes customers happy. The situation is really funny: Everybody seems to agree that the design ideas in my NUMA aproach are sane and exactly what we want to have on a NUMA platform in the end. But instead of concentrating on tuning the parameters for the many different NUMA platforms and reshaping this aproach to make it acceptable, IBM concentrates on a very much stripped down aproach. I understand that this project has been started to make the inclusion of some NUMA scheduler easier. But in the end, the simple NUMA scheduler will have to develop to a much more complex thing and in some form or another replicate the design ideas of my node affine scheduler. On machines with poor NUMA ratio like NUMAQ the simple NUMA change helps. For machines with good NUMA ratio like NEC Azusa, NEC TX7 you need a little bit more. AMD Hammer-SMP and ppc64 are certainly in the same class as the Azusa/TX7. And as soon as Hammer SMP systems will be around, the pressure for a full featured NUMA scheduler will be much higher. A NUMA scheduler extension of the 2.6 kernel fits very well with the development effort done for better scalability and enterprise level fitnes of Linux. Check http://lwn.net/Articles/12546/ to see that it makes a difference to have more than O(1) on NUMA machines! I'd definitely prefer the inclusion of my 01+02 patches (I'd have to maintain less code to keep the customers happy), on the other side: including Michael's patch would be better than not adding NUMA scheduler support at all. Best regards, Erich ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Crunch time -- the musical. (2.5 merge candidate list 1.5) 2002-10-24 21:51 ` Erich Focht @ 2002-10-24 22:38 ` Martin J. Bligh 2002-10-25 8:15 ` Erich Focht 0 siblings, 1 reply; 33+ messages in thread From: Martin J. Bligh @ 2002-10-24 22:38 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum, landley; +Cc: linux-kernel > The situation is really funny: Everybody seems to agree that the design > ideas in my NUMA aproach are sane and exactly what we want to have on > a NUMA platform in the end. But instead of concentrating on tuning the > parameters for the many different NUMA platforms and reshaping this > aproach to make it acceptable, IBM concentrates on a very much stripped > down aproach. >From my point of view, the reason for focussing on this was that your scheduler degraded the performance on my machine, rather than boosting it. Half of that was the more complex stuff you added on top ... it's a lot easier to start with something simple that works and build on it, than fix something that's complex and doesn't work well. I still haven't been able to get your scheduler to boot for about the last month without crashing the system. Andrew says he has it booting somehow on 2.5.44-mm4, so I'll steal his kernel tommorow and see how it looks. If the numbers look good for doing boring things like kernel compile, SDET, etc, I'm happy. M. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Crunch time -- the musical. (2.5 merge candidate list 1.5) 2002-10-24 22:38 ` Martin J. Bligh @ 2002-10-25 8:15 ` Erich Focht 2002-10-25 23:26 ` Martin J. Bligh ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Erich Focht @ 2002-10-25 8:15 UTC (permalink / raw) To: Martin J. Bligh, Michael Hohnbaum, landley; +Cc: linux-kernel On Friday 25 October 2002 00:38, Martin J. Bligh wrote: > > The situation is really funny: Everybody seems to agree that the design > > ideas in my NUMA aproach are sane and exactly what we want to have on > > a NUMA platform in the end. But instead of concentrating on tuning the > > parameters for the many different NUMA platforms and reshaping this > > aproach to make it acceptable, IBM concentrates on a very much stripped > > down aproach. > > From my point of view, the reason for focussing on this was that > your scheduler degraded the performance on my machine, rather than > boosting it. Half of that was the more complex stuff you added on > top ... it's a lot easier to start with something simple that works > and build on it, than fix something that's complex and doesn't work > well. You're talking about one of the first 2.5 versions of the patch. It changed a lot since then, thanks to your feedback, too. > I still haven't been able to get your scheduler to boot for about > the last month without crashing the system. Andrew says he has it > booting somehow on 2.5.44-mm4, so I'll steal his kernel tommorow and > see how it looks. If the numbers look good for doing boring things > like kernel compile, SDET, etc, I'm happy. I thought this problem is well understood! For some reasons independent of my patch you have to boot your machines with the "notsc" option. This leaves the cache_decay_ticks variable initialized to zero which my patch doesn't like. I'm trying to deal with this inside the patch but there is still a small window when the variable is zero. In my opinion this needs to be fixed somewhere in arch/i386/kernel/smpboot.c. Booting a machine with cache_decay_ticks=0 is pure nonsense, as it switches off cache affinity which you absolutely need! So even if "notsc" is a legal option, it should be fixed such that it doesn't leave your machine without cache affinity. That would anyway give you a falsified behavior of the O(1) scheduler. Erich ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Crunch time -- the musical. (2.5 merge candidate list 1.5) 2002-10-25 8:15 ` Erich Focht @ 2002-10-25 23:26 ` Martin J. Bligh 2002-10-25 23:45 ` Martin J. Bligh 2002-10-26 0:02 ` Martin J. Bligh 2002-10-26 18:58 ` Martin J. Bligh 2002-10-26 19:14 ` NUMA scheduler (was: 2.5 " Martin J. Bligh 2 siblings, 2 replies; 33+ messages in thread From: Martin J. Bligh @ 2002-10-25 23:26 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum; +Cc: linux-kernel > You're talking about one of the first 2.5 versions of the patch. It > changed a lot since then, thanks to your feedback, too. Right. But I've been struggling to boot anything later than that ;-) > I thought this problem is well understood! For some reasons independent of > my patch you have to boot your machines with the "notsc" option. This > leaves the cache_decay_ticks variable initialized to zero which my patch > doesn't like. I'm trying to deal with this inside the patch but there is > still a small window when the variable is zero. In my opinion this needs > to be fixed somewhere in arch/i386/kernel/smpboot.c. Booting a machine > with cache_decay_ticks=0 is pure nonsense, as it switches off cache > affinity which you absolutely need! So even if "notsc" is a legal option, > it should be fixed such that it doesn't leave your machine without cache > affinity. That would anyway give you a falsified behavior of the O(1) > scheduler. OK, well we seem to have it working on one machine, but not on another. Those should be identical, I suspect it's a timing thing. I'm playing around with the differences. First major thing I noticed is that the working box has gcc 3.1, and the non-working gcc 2.95.4 (debian woody). I suspect it's a subtle timing thing, or something equally horrible. Changing the non-working box to gcc 3.1 instead (which I *really* don't want to do long term unless we prove there's a bug in 2.95 ... gcc 3.x is disgustingly slow) resulted in it getting a little further, but then got the following oops ... does this provide any clues? CPU 7 IS NOW UP! Starting migration thread for cpu 7 Bringing up 8 CPU 8 IS NOW UP! Starting migration thread for cpu 8 divide error: 0000 CPU: 4 EIP: 0060:[<c011ac38>] Not tainted EFLAGS: 00010002 EIP is at task_to_steal+0x118/0x260 eax: 00000001 ebx: f01c5040 ecx: 00000000 edx: 00000000 esi: 00000063 edi: f01c5020 ebp: f0197ee8 esp: f0197eac ds: 0068 es: 0068 ss: 0068 Process swapper (pid: 0, threadinfo=f0196000 task=f01bf060) Stack: 00000000 f01b4120 00000000 c02ec940 f0197ed4 00000004 00000000 c02ecd3c c02ec93c 00000000 00000001 0000007d c02ec4a0 00000001 00000004 f0197f1c c011829c c02ec4a0 00000004 00000004 00000001 00000000 c39376c0 00000000 Call Trace: [<c011829c>] load_balance+0x8c/0x140 [<c0118588>] scheduler_tick+0x238/0x360 [<c0123347>] tasklet_hi_action+0x77/0xc0 [<c0105420>] default_idle+0x0/0x50 [<c0126bd5>] update_process_times+0x45/0x60 [<c0113faa>] smp_apic_timer_interrupt+0x11a/0x120 [<c0105420>] default_idle+0x0/0x50 [<c010815e>] apic_timer_interrupt+0x1a/0x20 [<c0105420>] default_idle+0x0/0x50 [<c0105420>] default_idle+0x0/0x50 [<c010544a>] default_idle+0x2a/0x50 [<c01054ea>] cpu_idle+0x3a/0x50 [<c011db20>] printk+0x140/0x180 Code: f7 75 cc 8b 55 c8 83 f8 64 0f 4c f0 39 4d ec 8d 46 64 0f 44 This is 2.5.44-mm4 + your patches 1,2,3,5, I think. M. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Crunch time -- the musical. (2.5 merge candidate list 1.5) 2002-10-25 23:26 ` Martin J. Bligh @ 2002-10-25 23:45 ` Martin J. Bligh 2002-10-26 0:02 ` Martin J. Bligh 1 sibling, 0 replies; 33+ messages in thread From: Martin J. Bligh @ 2002-10-25 23:45 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum; +Cc: linux-kernel > divide error: 0000 > > CPU: 4 > EIP: 0060:[<c011ac38>] Not tainted > EFLAGS: 00010002 > EIP is at task_to_steal+0x118/0x260 > eax: 00000001 ebx: f01c5040 ecx: 00000000 edx: 00000000 > esi: 00000063 edi: f01c5020 ebp: f0197ee8 esp: f0197eac > ds: 0068 es: 0068 ss: 0068 > Process swapper (pid: 0, threadinfo=f0196000 task=f01bf060) > Stack: 00000000 f01b4120 00000000 c02ec940 f0197ed4 00000004 00000000 c02ecd3c > c02ec93c 00000000 00000001 0000007d c02ec4a0 00000001 00000004 f0197f1c > c011829c c02ec4a0 00000004 00000004 00000001 00000000 c39376c0 00000000 > Call Trace: > [<c011829c>] load_balance+0x8c/0x140 > [<c0118588>] scheduler_tick+0x238/0x360 > [<c0123347>] tasklet_hi_action+0x77/0xc0 > [<c0105420>] default_idle+0x0/0x50 > [<c0126bd5>] update_process_times+0x45/0x60 > [<c0113faa>] smp_apic_timer_interrupt+0x11a/0x120 > [<c0105420>] default_idle+0x0/0x50 > [<c010815e>] apic_timer_interrupt+0x1a/0x20 > [<c0105420>] default_idle+0x0/0x50 > [<c0105420>] default_idle+0x0/0x50 > [<c010544a>] default_idle+0x2a/0x50 > [<c01054ea>] cpu_idle+0x3a/0x50 > [<c011db20>] printk+0x140/0x180 > > Code: f7 75 cc 8b 55 c8 83 f8 64 0f 4c f0 39 4d ec 8d 46 64 0f 44 Dump of assembler code for function task_to_steal: 0xc011ab20 <task_to_steal>: push %ebp 0xc011ab21 <task_to_steal+1>: mov %esp,%ebp 0xc011ab23 <task_to_steal+3>: push %edi 0xc011ab24 <task_to_steal+4>: push %esi 0xc011ab25 <task_to_steal+5>: push %ebx 0xc011ab26 <task_to_steal+6>: sub $0x30,%esp 0xc011ab29 <task_to_steal+9>: movl $0x0,0xffffffdc(%ebp) 0xc011ab30 <task_to_steal+16>: mov 0xc(%ebp),%eax 0xc011ab33 <task_to_steal+19>: movl $0x0,0xffffffe8(%ebp) 0xc011ab3a <task_to_steal+26>: mov 0x8(%ebp),%edx 0xc011ab3d <task_to_steal+29>: mov 0xc034afe0(,%eax,4),%eax 0xc011ab44 <task_to_steal+36>: sar $0x4,%eax 0xc011ab47 <task_to_steal+39>: mov %eax,0xffffffec(%ebp) 0xc011ab4a <task_to_steal+42>: mov 0x20(%edx),%eax 0xc011ab4d <task_to_steal+45>: mov (%eax),%esi 0xc011ab4f <task_to_steal+47>: test %esi,%esi 0xc011ab51 <task_to_steal+49>: je 0xc011ad6a <task_to_steal+586> 0xc011ab57 <task_to_steal+55>: mov %eax,0xffffffe4(%ebp) 0xc011ab5a <task_to_steal+58>: movl $0x0,0xfffffff0(%ebp) 0xc011ab61 <task_to_steal+65>: mov 0xffffffe4(%ebp),%ebx 0xc011ab64 <task_to_steal+68>: add $0x4,%ebx 0xc011ab67 <task_to_steal+71>: mov %ebx,0xffffffd0(%ebp) 0xc011ab6a <task_to_steal+74>: lea 0x0(%esi),%esi 0xc011ab70 <task_to_steal+80>: mov 0xfffffff0(%ebp),%ebx 0xc011ab73 <task_to_steal+83>: test %ebx,%ebx 0xc011ab75 <task_to_steal+85>: jne 0xc011acec <task_to_steal+460> 0xc011ab7b <task_to_steal+91>: mov 0xffffffe4(%ebp),%edx 0xc011ab7e <task_to_steal+94>: mov 0x4(%edx),%eax 0xc011ab81 <task_to_steal+97>: test %eax,%eax 0xc011ab83 <task_to_steal+99>: jne 0xc011ace4 <task_to_steal+452> 0xc011ab89 <task_to_steal+105>: mov 0xffffffd0(%ebp),%ecx 0xc011ab8c <task_to_steal+108>: mov 0x4(%ecx),%eax 0xc011ab8f <task_to_steal+111>: test %eax,%eax 0xc011ab91 <task_to_steal+113>: jne 0xc011acd9 <task_to_steal+441> 0xc011ab97 <task_to_steal+119>: mov 0xffffffd0(%ebp),%ebx 0xc011ab9a <task_to_steal+122>: mov 0x8(%ebx),%eax 0xc011ab9d <task_to_steal+125>: test %eax,%eax 0xc011ab9f <task_to_steal+127>: jne 0xc011acce <task_to_steal+430> 0xc011aba5 <task_to_steal+133>: mov 0xffffffd0(%ebp),%edx 0xc011aba8 <task_to_steal+136>: mov 0xc(%edx),%eax 0xc011abab <task_to_steal+139>: test %eax,%eax 0xc011abad <task_to_steal+141>: je 0xc011acbf <task_to_steal+415> 0xc011abb3 <task_to_steal+147>: bsf %eax,%eax 0xc011abb6 <task_to_steal+150>: add $0x60,%eax 0xc011abb9 <task_to_steal+153>: mov %eax,0xfffffff0(%ebp) 0xc011abbc <task_to_steal+156>: cmpl $0x8c,0xfffffff0(%ebp) 0xc011abc3 <task_to_steal+163>: je 0xc011ac9e <task_to_steal+382> 0xc011abc9 <task_to_steal+169>: mov 0xfffffff0(%ebp),%ebx 0xc011abcc <task_to_steal+172>: mov 0xffffffe4(%ebp),%eax 0xc011abcf <task_to_steal+175>: mov 0xc034b4e0,%edx 0xc011abd5 <task_to_steal+181>: lea 0x18(%eax,%ebx,8),%ebx 0xc011abd9 <task_to_steal+185>: mov %ebx,0xffffffe0(%ebp) 0xc011abdc <task_to_steal+188>: mov 0x4(%ebx),%ebx 0xc011abdf <task_to_steal+191>: mov %edx,0xffffffcc(%ebp) 0xc011abe2 <task_to_steal+194>: lea 0x0(%esi,1),%esi 0xc011abe9 <task_to_steal+201>: lea 0x0(%edi,1),%edi 0xc011abf0 <task_to_steal+208>: lea 0xffffffe0(%ebx),%edi 0xc011abf3 <task_to_steal+211>: mov 0xc0348e68,%eax 0xc011abf8 <task_to_steal+216>: mov 0x30(%edi),%edx 0xc011abfb <task_to_steal+219>: sub %edx,%eax 0xc011abfd <task_to_steal+221>: cmp 0xffffffcc(%ebp),%eax 0xc011ac00 <task_to_steal+224>: jbe 0xc011ac70 <task_to_steal+336> 0xc011ac02 <task_to_steal+226>: mov 0x8(%ebp),%ecx 0xc011ac05 <task_to_steal+229>: mov 0x14(%ecx),%ecx 0xc011ac08 <task_to_steal+232>: cmp %ecx,%edi 0xc011ac0a <task_to_steal+234>: mov %ecx,0xffffffc8(%ebp) 0xc011ac0d <task_to_steal+237>: je 0xc011ac70 <task_to_steal+336> 0xc011ac0f <task_to_steal+239>: movzbl 0xc(%ebp),%ecx 0xc011ac13 <task_to_steal+243>: mov 0x38(%edi),%eax 0xc011ac16 <task_to_steal+246>: shr %cl,%eax 0xc011ac18 <task_to_steal+248>: and $0x1,%eax 0xc011ac1b <task_to_steal+251>: je 0xc011ac70 <task_to_steal+336> 0xc011ac1d <task_to_steal+253>: mov 0x48(%edi),%esi 0xc011ac20 <task_to_steal+256>: test %esi,%esi 0xc011ac22 <task_to_steal+258>: jne 0xc011ac83 <task_to_steal+355> 0xc011ac24 <task_to_steal+260>: mov 0xc0348e68,%eax 0xc011ac29 <task_to_steal+265>: xor %edx,%edx 0xc011ac2b <task_to_steal+267>: mov $0x63,%esi 0xc011ac30 <task_to_steal+272>: mov 0x30(%edi),%ecx 0xc011ac33 <task_to_steal+275>: sub %ecx,%eax 0xc011ac35 <task_to_steal+277>: mov 0x44(%edi),%ecx 0xc011ac38 <task_to_steal+280>: divl 0xffffffcc(%ebp) 0xc011ac3b <task_to_steal+283>: mov 0xffffffc8(%ebp),%edx 0xc011ac3e <task_to_steal+286>: cmp $0x64,%eax 0xc011ac41 <task_to_steal+289>: cmovl %eax,%esi 0xc011ac44 <task_to_steal+292>: cmp %ecx,0xffffffec(%ebp) 0xc011ac47 <task_to_steal+295>: lea 0x64(%esi),%eax 0xc011ac4a <task_to_steal+298>: cmove %eax,%esi 0xc011ac4d <task_to_steal+301>: mov 0x4(%edx),%eax 0xc011ac50 <task_to_steal+304>: lea 0xffffff9c(%esi),%edx 0xc011ac53 <task_to_steal+307>: mov 0xc(%eax),%eax 0xc011ac56 <task_to_steal+310>: mov 0xc034afe0(,%eax,4),%eax 0xc011ac5d <task_to_steal+317>: sar $0x4,%eax 0xc011ac60 <task_to_steal+320>: cmp %eax,%ecx 0xc011ac62 <task_to_steal+322>: cmove %edx,%esi 0xc011ac65 <task_to_steal+325>: cmp 0xffffffdc(%ebp),%esi 0xc011ac68 <task_to_steal+328>: jle 0xc011ac70 <task_to_steal+336> 0xc011ac6a <task_to_steal+330>: mov %esi,0xffffffdc(%ebp) 0xc011ac6d <task_to_steal+333>: mov %edi,0xffffffe8(%ebp) 0xc011ac70 <task_to_steal+336>: mov (%ebx),%ebx 0xc011ac72 <task_to_steal+338>: cmp 0xffffffe0(%ebp),%ebx 0xc011ac75 <task_to_steal+341>: jne 0xc011abf0 <task_to_steal+208> 0xc011ac7b <task_to_steal+347>: incl 0xfffffff0(%ebp) 0xc011ac7e <task_to_steal+350>: jmp 0xc011ab70 <task_to_steal+80> 0xc011ac83 <task_to_steal+355>: mov %edi,(%esp,1) 0xc011ac86 <task_to_steal+358>: call 0xc0118070 <upd_node_mem> 0xc011ac8b <task_to_steal+363>: mov 0x8(%ebp),%edx 0xc011ac8e <task_to_steal+366>: mov 0xc034b4e0,%eax 0xc011ac93 <task_to_steal+371>: mov %eax,0xffffffcc(%ebp) 0xc011ac96 <task_to_steal+374>: mov 0x14(%edx),%edx 0xc011ac99 <task_to_steal+377>: mov %edx,0xffffffc8(%ebp) 0xc011ac9c <task_to_steal+380>: jmp 0xc011ac24 <task_to_steal+260> 0xc011ac9e <task_to_steal+382>: mov 0x8(%ebp),%eax 0xc011aca1 <task_to_steal+385>: mov 0xffffffe4(%ebp),%edx 0xc011aca4 <task_to_steal+388>: cmp 0x20(%eax),%edx 0xc011aca7 <task_to_steal+391>: jne 0xc011acb4 <task_to_steal+404> 0xc011aca9 <task_to_steal+393>: mov 0x1c(%eax),%ecx 0xc011acac <task_to_steal+396>: mov %ecx,0xffffffe4(%ebp) 0xc011acaf <task_to_steal+399>: jmp 0xc011ab5a <task_to_steal+58> 0xc011acb4 <task_to_steal+404>: mov 0xffffffe8(%ebp),%eax 0xc011acb7 <task_to_steal+407>: add $0x30,%esp 0xc011acba <task_to_steal+410>: pop %ebx 0xc011acbb <task_to_steal+411>: pop %esi 0xc011acbc <task_to_steal+412>: pop %edi 0xc011acbd <task_to_steal+413>: pop %ebp 0xc011acbe <task_to_steal+414>: ret 0xc011acbf <task_to_steal+415>: mov 0xffffffd0(%ebp),%ecx 0xc011acc2 <task_to_steal+418>: bsf 0x10(%ecx),%eax 0xc011acc6 <task_to_steal+422>: sub $0xffffff80,%eax 0xc011acc9 <task_to_steal+425>: jmp 0xc011abb9 <task_to_steal+153> 0xc011acce <task_to_steal+430>: bsf %eax,%eax 0xc011acd1 <task_to_steal+433>: add $0x40,%eax 0xc011acd4 <task_to_steal+436>: jmp 0xc011abb9 <task_to_steal+153> 0xc011acd9 <task_to_steal+441>: bsf %eax,%eax 0xc011acdc <task_to_steal+444>: add $0x20,%eax 0xc011acdf <task_to_steal+447>: jmp 0xc011abb9 <task_to_steal+153> 0xc011ace4 <task_to_steal+452>: bsf %eax,%eax 0xc011ace7 <task_to_steal+455>: jmp 0xc011abb9 <task_to_steal+153> 0xc011acec <task_to_steal+460>: mov 0xfffffff0(%ebp),%eax 0xc011acef <task_to_steal+463>: xor %esi,%esi 0xc011acf1 <task_to_steal+465>: mov 0xfffffff0(%ebp),%ecx 0xc011acf4 <task_to_steal+468>: mov 0xffffffd0(%ebp),%ebx 0xc011acf7 <task_to_steal+471>: sar $0x5,%eax 0xc011acfa <task_to_steal+474>: and $0x1f,%ecx 0xc011acfd <task_to_steal+477>: lea (%ebx,%eax,4),%edi 0xc011ad00 <task_to_steal+480>: je 0xc011ad2b <task_to_steal+523> 0xc011ad02 <task_to_steal+482>: mov (%edi),%eax 0xc011ad04 <task_to_steal+484>: shr %cl,%eax 0xc011ad06 <task_to_steal+486>: bsf %eax,%esi 0xc011ad09 <task_to_steal+489>: jne 0xc011ad10 <task_to_steal+496> 0xc011ad0b <task_to_steal+491>: mov $0x20,%esi 0xc011ad10 <task_to_steal+496>: mov $0x20,%eax 0xc011ad15 <task_to_steal+501>: sub %ecx,%eax 0xc011ad17 <task_to_steal+503>: cmp %eax,%esi 0xc011ad19 <task_to_steal+505>: jge 0xc011ad26 <task_to_steal+518> 0xc011ad1b <task_to_steal+507>: mov 0xfffffff0(%ebp),%edx 0xc011ad1e <task_to_steal+510>: lea (%edx,%esi,1),%eax 0xc011ad21 <task_to_steal+513>: jmp 0xc011abb9 <task_to_steal+153> 0xc011ad26 <task_to_steal+518>: mov %eax,%esi 0xc011ad28 <task_to_steal+520>: add $0x4,%edi 0xc011ad2b <task_to_steal+523>: mov 0xffffffd0(%ebp),%ecx 0xc011ad2e <task_to_steal+526>: mov %edi,%eax 0xc011ad30 <task_to_steal+528>: mov $0x8c,%edx 0xc011ad35 <task_to_steal+533>: mov %edi,%ebx 0xc011ad37 <task_to_steal+535>: sub %ecx,%eax 0xc011ad39 <task_to_steal+537>: shl $0x3,%eax 0xc011ad3c <task_to_steal+540>: sub %eax,%edx 0xc011ad3e <task_to_steal+542>: add $0x1f,%edx 0xc011ad41 <task_to_steal+545>: shr $0x5,%edx 0xc011ad44 <task_to_steal+548>: mov %edx,0xffffffd4(%ebp) 0xc011ad47 <task_to_steal+551>: mov %edx,%ecx 0xc011ad49 <task_to_steal+553>: xor %eax,%eax 0xc011ad4b <task_to_steal+555>: repz scas %es:(%edi),%eax 0xc011ad4d <task_to_steal+557>: je 0xc011ad55 <task_to_steal+565> 0xc011ad4f <task_to_steal+559>: lea 0xfffffffc(%edi),%edi 0xc011ad52 <task_to_steal+562>: bsf (%edi),%eax 0xc011ad55 <task_to_steal+565>: sub %ebx,%edi 0xc011ad57 <task_to_steal+567>: shl $0x3,%edi 0xc011ad5a <task_to_steal+570>: add %edi,%eax 0xc011ad5c <task_to_steal+572>: mov %eax,%edx 0xc011ad5e <task_to_steal+574>: mov 0xfffffff0(%ebp),%eax 0xc011ad61 <task_to_steal+577>: add %esi,%eax 0xc011ad63 <task_to_steal+579>: add %edx,%eax 0xc011ad65 <task_to_steal+581>: jmp 0xc011abb9 <task_to_steal+153> 0xc011ad6a <task_to_steal+586>: mov 0x8(%ebp),%ecx 0xc011ad6d <task_to_steal+589>: mov 0x1c(%ecx),%ecx 0xc011ad70 <task_to_steal+592>: jmp 0xc011acac <task_to_steal+396> 0xc011ad75 <task_to_steal+597>: nop 0xc011ad76 <task_to_steal+598>: lea 0x0(%esi),%esi 0xc011ad79 <task_to_steal+601>: lea 0x0(%edi,1),%edi End of assembler dump. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Crunch time -- the musical. (2.5 merge candidate list 1.5) 2002-10-25 23:26 ` Martin J. Bligh 2002-10-25 23:45 ` Martin J. Bligh @ 2002-10-26 0:02 ` Martin J. Bligh 1 sibling, 0 replies; 33+ messages in thread From: Martin J. Bligh @ 2002-10-26 0:02 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum; +Cc: linux-kernel >> I thought this problem is well understood! For some reasons independent of >> my patch you have to boot your machines with the "notsc" option. This >> leaves the cache_decay_ticks variable initialized to zero which my patch >> doesn't like. I'm trying to deal with this inside the patch but there is >> still a small window when the variable is zero. In my opinion this needs >> to be fixed somewhere in arch/i386/kernel/smpboot.c. Booting a machine >> with cache_decay_ticks=0 is pure nonsense, as it switches off cache >> affinity which you absolutely need! So even if "notsc" is a legal option, >> it should be fixed such that it doesn't leave your machine without cache >> affinity. That would anyway give you a falsified behavior of the O(1) >> scheduler. > EIP is at task_to_steal+0x118/0x260 This turned out to be: weight = (jiffies - tmp->sleep_timestamp)/cache_decay_ticks; So I guess that window is still biting you. I'll see if I can fix it properly. M. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Crunch time -- the musical. (2.5 merge candidate list 1.5) 2002-10-25 8:15 ` Erich Focht 2002-10-25 23:26 ` Martin J. Bligh @ 2002-10-26 18:58 ` Martin J. Bligh 2002-10-26 19:14 ` NUMA scheduler (was: 2.5 " Martin J. Bligh 2 siblings, 0 replies; 33+ messages in thread From: Martin J. Bligh @ 2002-10-26 18:58 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum, landley; +Cc: linux-kernel >> I still haven't been able to get your scheduler to boot for about >> the last month without crashing the system. Andrew says he has it >> booting somehow on 2.5.44-mm4, so I'll steal his kernel tommorow and >> see how it looks. If the numbers look good for doing boring things >> like kernel compile, SDET, etc, I'm happy. > > I thought this problem is well understood! For some reasons independent of > my patch you have to boot your machines with the "notsc" option. This > leaves the cache_decay_ticks variable initialized to zero which my patch > doesn't like. I'm trying to deal with this inside the patch but there is > still a small window when the variable is zero. In my opinion this needs > to be fixed somewhere in arch/i386/kernel/smpboot.c. Booting a machine > with cache_decay_ticks=0 is pure nonsense, as it switches off cache > affinity which you absolutely need! So even if "notsc" is a legal option, > it should be fixed such that it doesn't leave your machine without cache > affinity. That would anyway give you a falsified behavior of the O(1) > scheduler. Oh, not sure if I ever replied to this or not. I don't *have* to boot with notsc, I just usually do. And it crashed either way, so it's a different problem (changing versions of gcc seems to perturb it too). BUT ... your new patches 1 and 2 don't have this problem. See followup email in a second. M. ^ permalink raw reply [flat|nested] 33+ messages in thread
* NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-25 8:15 ` Erich Focht 2002-10-25 23:26 ` Martin J. Bligh 2002-10-26 18:58 ` Martin J. Bligh @ 2002-10-26 19:14 ` Martin J. Bligh 2002-10-27 18:16 ` Martin J. Bligh 2 siblings, 1 reply; 33+ messages in thread From: Martin J. Bligh @ 2002-10-26 19:14 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum, mingo, habanero; +Cc: linux-kernel, lse-tech >> From my point of view, the reason for focussing on this was that >> your scheduler degraded the performance on my machine, rather than >> boosting it. Half of that was the more complex stuff you added on >> top ... it's a lot easier to start with something simple that works >> and build on it, than fix something that's complex and doesn't work >> well. > > You're talking about one of the first 2.5 versions of the patch. It > changed a lot since then, thanks to your feedback, too. OK, I went to your latest patches (just 1 and 2). And they worked! You've fixed the performance degradation problems for kernel compile (now a 14% improvement in systime), that core set works without further futzing about or crashing, with or without TSC, on either version of gcc ... congrats! It also produces the fastest system time for kernel compile I've ever seen ... this core set seems to be good (I'm still less than convinced about the further patches, but we can work on those one at a time now you've got it all broken out and modular). Michael posted slightly different looking results for virgin 44 yesterday - the main difference between virgin 44 and 44-mm4 for this stuff is probably the per-cpu hot & cold pages (Ingo, this is like your original per-cpu pages). All results are for a 16-way NUMA-Q (P3 700MHz 2Mb cache) 16Gb RAM. Kernbench: Elapsed User System CPU 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4% 2.5.44-mm4-hbaum 19.422s 189.828s 40.204s 1196.2% 2.5.44-mm4-focht12 19.316s 189.514s 36.704s 1146.8% Schedbench 4: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 32.45 49.47 129.86 0.82 2.5.44-mm4-hbaum 31.31 43.85 125.29 0.84 2.5.44-mm4-focht12 38.50 45.34 154.05 1.07 Schedbench 8: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 39.90 61.48 319.26 2.79 2.5.44-mm4-hbaum 32.63 46.56 261.10 1.99 2.5.44-mm4-focht12 35.56 46.57 284.53 1.97 Schedbench 16: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 62.99 93.59 1008.01 5.11 2.5.44-mm4-hbaum 49.78 76.71 796.68 4.43 2.5.44-mm4-focht12 51.94 61.43 831.26 4.68 Schedbench 32: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 88.13 194.53 2820.54 11.52 2.5.44-mm4-hbaum 54.67 147.30 1749.77 7.91 2.5.44-mm4-focht12 55.43 119.49 1773.97 8.41 Schedbench 64: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 159.92 653.79 10235.93 25.16 2.5.44-mm4-hbaum 65.20 300.58 4173.26 16.82 2.5.44-mm4-focht12 56.49 235.78 3615.71 18.05 There's a small degredation at the low end of schedbench (Erich's numa_test) in there ... would be nice to fix, but I'm less worried about that (where the machine is lightly loaded) than the other numbers. Kernbench is just gcc-2.95-4 compiling the 2.4.17 kernel doing a "make -j24 bzImage". diffprofile 2.5.44-mm4 2.5.44-mm4-hbaum (for kernbench, + got worse by adding the patch, - got better) 184 vm_enough_memory 154 d_lookup 83 do_schedule 75 page_add_rmap 73 strnlen_user 58 find_get_page 52 flush_signal_handlers ... -61 pte_alloc_one -63 do_wp_page -85 .text.lock.file_table -96 __set_page_dirty_buffers -112 clear_page_tables -118 get_empty_filp -134 free_hot_cold_page -144 page_remove_rmap -150 __copy_to_user -213 zap_pte_range -217 buffered_rmqueue -875 __copy_from_user -1015 do_anonymous_page diffprofile 2.5.44-mm4 2.5.44-mm4-focht12 (for kernbench, + got worse by adding the patch, - got better) <nothing significantly degraded> .... -57 path_lookup -69 do_page_fault -73 vm_enough_memory -77 filemap_nopage -78 do_no_page -83 __set_page_dirty_buffers -83 __fput -84 do_schedule -97 find_get_page -106 file_move -115 free_hot_cold_page -115 clear_page_tables -130 d_lookup -147 atomic_dec_and_lock -157 page_add_rmap -197 buffered_rmqueue -236 zap_pte_range -264 get_empty_filp -271 __copy_to_user -464 page_remove_rmap -573 .text.lock.file_table -618 __copy_from_user -823 do_anonymous_page ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-26 19:14 ` NUMA scheduler (was: 2.5 " Martin J. Bligh @ 2002-10-27 18:16 ` Martin J. Bligh 2002-10-28 0:32 ` Erich Focht 0 siblings, 1 reply; 33+ messages in thread From: Martin J. Bligh @ 2002-10-27 18:16 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum, mingo, habanero; +Cc: linux-kernel, lse-tech > OK, I went to your latest patches (just 1 and 2). And they worked! > You've fixed the performance degradation problems for kernel compile > (now a 14% improvement in systime), that core set works without > further futzing about or crashing, with or without TSC, on either > version of gcc ... congrats! So I have a slight correction to make to the above ;-) Your patches do work just fine, no crashes any more. HOWEVER ... turns out I only had the first patch installed, not both. Silly mistake, but turns out to be very interesting. So your second patch is the balance on exec stuff ... I've looked at it, and think it's going to be very expensive to do in practice, at least the simplistic "recalc everything on every exec" approach. It does benefit the low end schedbench results, but not the high end ones, and you can see the cost of your second patch in the system times of the kernbench. In summary, I think I like the first patch alone better than the combination, but will have a play at making a cross between the two. As I have very little context about the scheduler, would appreciate any help anyone would like to volunteer ;-) Corrected results are: Kernbench: Elapsed User System CPU 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4% 2.5.44-mm4-hbaum 19.422s 189.828s 40.204s 1196.2% 2.5.44-mm4-focht-1 19.46s 189.838s 37.938s 1171% 2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6% Schedbench 4: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 32.45 49.47 129.86 0.82 2.5.44-mm4-hbaum 31.31 43.85 125.29 0.84 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85 Schedbench 8: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 39.90 61.48 319.26 2.79 2.5.44-mm4-hbaum 32.63 46.56 261.10 1.99 2.5.44-mm4-focht-1 37.76 61.09 302.17 2.55 2.5.44-mm4-focht-12 28.40 34.43 227.25 2.09 Schedbench 16: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 62.99 93.59 1008.01 5.11 2.5.44-mm4-hbaum 49.78 76.71 796.68 4.43 2.5.44-mm4-focht-1 51.69 60.23 827.20 4.95 2.5.44-mm4-focht-12 51.24 60.86 820.08 4.23 Schedbench 32: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 88.13 194.53 2820.54 11.52 2.5.44-mm4-hbaum 54.67 147.30 1749.77 7.91 2.5.44-mm4-focht-1 56.71 123.62 1815.12 7.92 2.5.44-mm4-focht-12 55.69 118.85 1782.25 7.28 Schedbench 64: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 159.92 653.79 10235.93 25.16 2.5.44-mm4-hbaum 65.20 300.58 4173.26 16.82 2.5.44-mm4-focht-1 55.60 232.36 3558.98 17.61 2.5.44-mm4-focht-12 56.03 234.45 3586.46 15.76 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-27 18:16 ` Martin J. Bligh @ 2002-10-28 0:32 ` Erich Focht 2002-10-27 23:52 ` Martin J. Bligh ` (3 more replies) 0 siblings, 4 replies; 33+ messages in thread From: Erich Focht @ 2002-10-28 0:32 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech On Sunday 27 October 2002 19:16, Martin J. Bligh wrote: > > OK, I went to your latest patches (just 1 and 2). And they worked! > > You've fixed the performance degradation problems for kernel compile > > (now a 14% improvement in systime), that core set works without > > further futzing about or crashing, with or without TSC, on either > > version of gcc ... congrats! > > So I have a slight correction to make to the above ;-) Your patches > do work just fine, no crashes any more. HOWEVER ... turns out I only > had the first patch installed, not both. Silly mistake, but turns out > to be very interesting. > > So your second patch is the balance on exec stuff ... I've looked at > it, and think it's going to be very expensive to do in practice, at > least the simplistic "recalc everything on every exec" approach. It > does benefit the low end schedbench results, but not the high end ones, > and you can see the cost of your second patch in the system times of > the kernbench. This is interesting, indeed. As you might have seen from the tests I posted on LKML I could not see that effect on our IA64 NUMA machine. Which arises the question: is it expensive to recalculate the load when doing an exec (which I should also see) or is the strategy of equally distributing the jobs across the nodes bad for certain load+architecture combinations? As I'm not seeing the effect, maybe you could do the following experiment: In sched_best_node() keep only the "while" loop at the beginning. This leads to a cheap selection of the next node, just a simple round robin. Regarding the schedbench results: are they averages over multiple runs? The numa_test needs to be repeated a few times to get statistically meaningful results. Thanks, Erich > In summary, I think I like the first patch alone better than the > combination, but will have a play at making a cross between the two. > As I have very little context about the scheduler, would appreciate > any help anyone would like to volunteer ;-) > > Corrected results are: > > Kernbench: > Elapsed User System CPU > 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4% > 2.5.44-mm4-hbaum 19.422s 189.828s 40.204s 1196.2% > 2.5.44-mm4-focht-1 19.46s 189.838s 37.938s 1171% > 2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6% > > Schedbench 4: > Elapsed TotalUser TotalSys AvgUser > 2.5.44-mm4 32.45 49.47 129.86 0.82 > 2.5.44-mm4-hbaum 31.31 43.85 125.29 0.84 > 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06 > 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85 > > Schedbench 8: > Elapsed TotalUser TotalSys AvgUser > 2.5.44-mm4 39.90 61.48 319.26 2.79 > 2.5.44-mm4-hbaum 32.63 46.56 261.10 1.99 > 2.5.44-mm4-focht-1 37.76 61.09 302.17 2.55 > 2.5.44-mm4-focht-12 28.40 34.43 227.25 2.09 > > Schedbench 16: > Elapsed TotalUser TotalSys AvgUser > 2.5.44-mm4 62.99 93.59 1008.01 5.11 > 2.5.44-mm4-hbaum 49.78 76.71 796.68 4.43 > 2.5.44-mm4-focht-1 51.69 60.23 827.20 4.95 > 2.5.44-mm4-focht-12 51.24 60.86 820.08 4.23 > > Schedbench 32: > Elapsed TotalUser TotalSys AvgUser > 2.5.44-mm4 88.13 194.53 2820.54 11.52 > 2.5.44-mm4-hbaum 54.67 147.30 1749.77 7.91 > 2.5.44-mm4-focht-1 56.71 123.62 1815.12 7.92 > 2.5.44-mm4-focht-12 55.69 118.85 1782.25 7.28 > > Schedbench 64: > Elapsed TotalUser TotalSys AvgUser > 2.5.44-mm4 159.92 653.79 10235.93 25.16 > 2.5.44-mm4-hbaum 65.20 300.58 4173.26 16.82 > 2.5.44-mm4-focht-1 55.60 232.36 3558.98 17.61 > 2.5.44-mm4-focht-12 56.03 234.45 3586.46 15.76 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 0:32 ` Erich Focht @ 2002-10-27 23:52 ` Martin J. Bligh 2002-10-28 0:55 ` [Lse-tech] " Michael Hohnbaum 2002-10-28 0:31 ` Martin J. Bligh ` (2 subsequent siblings) 3 siblings, 1 reply; 33+ messages in thread From: Martin J. Bligh @ 2002-10-27 23:52 UTC (permalink / raw) To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech > This is interesting, indeed. As you might have seen from the tests I > posted on LKML I could not see that effect on our IA64 NUMA machine. > Which arises the question: is it expensive to recalculate the load > when doing an exec (which I should also see) or is the strategy of > equally distributing the jobs across the nodes bad for certain > load+architecture combinations? I suspect the former. Bouncing a whole pile of cachelines every time would be much more expensive for me than it would for you, and kernbench will be heavy on exec. > As I'm not seeing the effect, maybe > you could do the following experiment: > In sched_best_node() keep only the "while" loop at the beginning. This > leads to a cheap selection of the next node, just a simple round robin. Maybe I could just send you the profiles instead ;-) If I have more time, I'll try your suggestion. I'm trying Michael's balance_exec on top of your patch 1 at the moment, but I'm somewhat confused by his code for sched_best_cpu. +static int sched_best_cpu(struct task_struct *p) +{ + int i, minload, best_cpu, cur_cpu, node; + best_cpu = task_cpu(p); + if (cpu_rq(best_cpu)->nr_running <= 2) + return best_cpu; + + node = __cpu_to_node(__get_cpu_var(last_exec_cpu)); + if (++node >= numnodes) + node = 0; + + cur_cpu = __node_to_first_cpu(node); + minload = cpu_rq(best_cpu)->nr_running; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(cur_cpu)) + continue; + + if (minload > cpu_rq(cur_cpu)->nr_running) { + minload = cpu_rq(cur_cpu)->nr_running; + best_cpu = cur_cpu; + } + if (++cur_cpu >= NR_CPUS) + cur_cpu = 0; + } + __get_cpu_var(last_exec_cpu) = best_cpu; + return best_cpu; +} Michael, the way I read the NR_CPUS loop, you walk every cpu in the system, and take the best from all of them. In which case what's the point of the last_exec_cpu stuff? On the other hand, I changed your NR_CPUS to 4 (ie just walk the cpus in that node), and it got worse. So perhaps I'm just misreading your code ... and it does seem significantly cheaper to execute than Erich's. Erich, on the other hand, your code does this: +void sched_balance_exec(void) +{ + int new_cpu, new_node=0; + + while (pooldata_is_locked()) + cpu_relax(); + if (numpools > 1) { + new_node = sched_best_node(current); + } + new_cpu = sched_best_cpu(current, new_node); + if (new_cpu != smp_processor_id()) + sched_migrate_task(current, new_cpu); +} which seems to me to walk every runqueue in the system (in sched_best_node), then walk one node's worth all over again in sched_best_cpu .... doesn't it? Again, I may be misreading this ... haven't looked at the scheduler much. But I can't help feeling some sort of lazy evaluation is in order .... And what's this doing? + do { + /* atomic_inc_return is not implemented on all archs [EF] */ + atomic_inc(&sched_node); + best_node = atomic_read(&sched_node) % numpools; + } while (!(pool_mask[best_node] & mask)); I really don't think putting a global atomic in there is going to be cheap .... > Regarding the schedbench results: are they averages over multiple runs? > The numa_test needs to be repeated a few times to get statistically > meaningful results. No. But I don't have 2 hours to run each set of tests either. I did a couple of runs, and didn't see huge variances. Seems stable enough. M. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Lse-tech] Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-27 23:52 ` Martin J. Bligh @ 2002-10-28 0:55 ` Michael Hohnbaum 2002-10-28 4:23 ` Martin J. Bligh 0 siblings, 1 reply; 33+ messages in thread From: Michael Hohnbaum @ 2002-10-28 0:55 UTC (permalink / raw) To: Martin J. Bligh Cc: Erich Focht, mingo, Andrew Theurer, linux-kernel, lse-tech > I'm trying Michael's balance_exec on top of your patch 1 at the > moment, but I'm somewhat confused by his code for sched_best_cpu. > > +static int sched_best_cpu(struct task_struct *p) > +{ > + int i, minload, best_cpu, cur_cpu, node; > + best_cpu = task_cpu(p); > + if (cpu_rq(best_cpu)->nr_running <= 2) > + return best_cpu; > + > + node = __cpu_to_node(__get_cpu_var(last_exec_cpu)); > + if (++node >= numnodes) > + node = 0; > + > + cur_cpu = __node_to_first_cpu(node); > + minload = cpu_rq(best_cpu)->nr_running; > + > + for (i = 0; i < NR_CPUS; i++) { > + if (!cpu_online(cur_cpu)) > + continue; > + > + if (minload > cpu_rq(cur_cpu)->nr_running) { > + minload = cpu_rq(cur_cpu)->nr_running; > + best_cpu = cur_cpu; > + } > + if (++cur_cpu >= NR_CPUS) > + cur_cpu = 0; > + } > + __get_cpu_var(last_exec_cpu) = best_cpu; > + return best_cpu; > +} > > Michael, the way I read the NR_CPUS loop, you walk every cpu > in the system, and take the best from all of them. In which case > what's the point of the last_exec_cpu stuff? On the other hand, > I changed your NR_CPUS to 4 (ie just walk the cpus in that node), > and it got worse. So perhaps I'm just misreading your code ... > and it does seem significantly cheaper to execute than Erich's. > You are reading it correct. The only thing that the last_exec_cpu does is to help spread the load across nodes. Without that what was happening is that node 0 would get completely loaded, then node 1, etc. With it, in cases where one or more runqueues have the same length, the one chosen tends to get spread out a bit. Not the greatest solution, but it helps. > -- Michael Hohnbaum 503-578-5486 hohnbaum@us.ibm.com T/L 775-5486 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Lse-tech] Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 0:55 ` [Lse-tech] " Michael Hohnbaum @ 2002-10-28 4:23 ` Martin J. Bligh 0 siblings, 0 replies; 33+ messages in thread From: Martin J. Bligh @ 2002-10-28 4:23 UTC (permalink / raw) To: Michael Hohnbaum Cc: Erich Focht, mingo, Andrew Theurer, linux-kernel, lse-tech >> Michael, the way I read the NR_CPUS loop, you walk every cpu >> in the system, and take the best from all of them. In which case >> what's the point of the last_exec_cpu stuff? On the other hand, >> I changed your NR_CPUS to 4 (ie just walk the cpus in that node), >> and it got worse. So perhaps I'm just misreading your code ... >> and it does seem significantly cheaper to execute than Erich's. >> > You are reading it correct. The only thing that the last_exec_cpu > does is to help spread the load across nodes. Without that what was > happening is that node 0 would get completely loaded, then node 1, > etc. With it, in cases where one or more runqueues have the same > length, the one chosen tends to get spread out a bit. Not the > greatest solution, but it helps. OK. I made a simple boring optimisation to your patch. Shaved almost a second off system time for kernbench, and seems idiotproof to me, shouldn't change anything apart from touching fewer runqueues: if we find a runqueue with nr_running == 0, stop searching ... we ain't going to find anything better ;-) Kernbench: Elapsed User System CPU 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4% 2.5.44-mm4-hbaum-1 19.746s 189.232s 38.354s 1152.2% 2.5.44-mm4-hbaum-12 19.322s 190.176s 40.354s 1192.6% 2.5.44-mm4-hbaum-12-firstzero 19.292s 189.66s 39.428s 1187.4% Patch is probably space-eaten, so just whack it in by hand. --- 2.5.44-mm4-hbaum-12/kernel/sched.c 2002-10-27 19:54:25.000000000 -0800 +++ 2.5.44-mm4-hbaum-12-first_low/kernel/sched.c 2002-10-27 16:42:10.000000000 -0800 @@ -2206,6 +2206,8 @@ if (minload > cpu_rq(cur_cpu)->nr_running) { minload = cpu_rq(cur_cpu)->nr_running; best_cpu = cur_cpu; + if (minload == 0) + break; } if (++cur_cpu >= NR_CPUS) cur_cpu = 0; ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 0:32 ` Erich Focht 2002-10-27 23:52 ` Martin J. Bligh @ 2002-10-28 0:31 ` Martin J. Bligh 2002-10-28 16:34 ` Erich Focht 2002-10-28 0:46 ` Martin J. Bligh 2002-10-28 7:16 ` Martin J. Bligh 3 siblings, 1 reply; 33+ messages in thread From: Martin J. Bligh @ 2002-10-28 0:31 UTC (permalink / raw) To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech OK, so I'm trying to read your patch 1, fairly unsucessfully (seems to be a lot more complex that Michael's). Can you explain pool_lock? It does actually seem to work, but it's rather confusing .... build_pools() has a comment above it saying: +/* + * Call pooldata_lock() before calling this function and + * pooldata_unlock() after! + */ But then you promptly call pooldata_lock inside build_pools anyway ... looks like it's just a naff comment, but doesn't help much. Leaving aside the acknowledged mind-boggling ugliness of pooldata_lock(), what exactly is this lock protecting, and when? The only thing that actually calls pooldata_lock is build_pools, right? And the only other thing that looks at it is sched_balance_exec via pooldata_is_locked ... can that happen before build_pools (seems like you're in deep trouble if it does anyway, as it'll just block). If you really still need to do this, RCU is now in the kernel ;-) If not, can we just chuck all that stuff? M. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 0:31 ` Martin J. Bligh @ 2002-10-28 16:34 ` Erich Focht 2002-10-28 16:57 ` Martin J. Bligh 0 siblings, 1 reply; 33+ messages in thread From: Erich Focht @ 2002-10-28 16:34 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech On Monday 28 October 2002 01:31, Martin J. Bligh wrote: > OK, so I'm trying to read your patch 1, fairly unsucessfully > (seems to be a lot more complex that Michael's). > > Can you explain pool_lock? It does actually seem to work, but > it's rather confusing .... The pool data is needed to be able to loop over the CPUs of one node, only. I'm convinced we'll need to do that sometime, no matter how simple the core of the NUMA scheduler is. The pool_lock is protecting that data while it is built. This can happen in future more often, if somebody starts hotplugging CPUs. > build_pools() has a comment above it saying: > > +/* > + * Call pooldata_lock() before calling this function and > + * pooldata_unlock() after! > + */ > > But then you promptly call pooldata_lock inside build_pools > anyway ... looks like it's just a naff comment, but doesn't > help much. Sorry, the comment came from a former version... > just block). If you really still need to do this, RCU is now > in the kernel ;-) If not, can we just chuck all that stuff? I'm preparing a core patch which doesn't need the pool_lock. I'll send it out today. Regards, Erich ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 16:34 ` Erich Focht @ 2002-10-28 16:57 ` Martin J. Bligh 2002-10-28 17:26 ` Erich Focht 0 siblings, 1 reply; 33+ messages in thread From: Martin J. Bligh @ 2002-10-28 16:57 UTC (permalink / raw) To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech > The pool data is needed to be able to loop over the CPUs of one node, > only. I'm convinced we'll need to do that sometime, no matter how simple > the core of the NUMA scheduler is. Hmmm ... is using node_to_cpumask from the topology stuff, then looping over that bitmask insufficient? > The pool_lock is protecting that data while it is built. This can happen > in future more often, if somebody starts hotplugging CPUs. Heh .... when someone actually does that, we'll have a lot more problems than just this to solve. Would be nice to keep this stuff simple for now, if possible. > Sorry, the comment came from a former version... No problem, I suspected that was all it was. >> just block). If you really still need to do this, RCU is now >> in the kernel ;-) If not, can we just chuck all that stuff? > > I'm preparing a core patch which doesn't need the pool_lock. I'll send it > out today. Cool! Thanks, M. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 16:57 ` Martin J. Bligh @ 2002-10-28 17:26 ` Erich Focht 2002-10-28 17:35 ` Martin J. Bligh 0 siblings, 1 reply; 33+ messages in thread From: Erich Focht @ 2002-10-28 17:26 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech [-- Attachment #1: Type: text/plain, Size: 1094 bytes --] On Monday 28 October 2002 17:57, Martin J. Bligh wrote: > > I'm preparing a core patch which doesn't need the pool_lock. I'll send it > > out today. > > Cool! Thanks, OK, here it comes. The core doesn't use the loop_over_nodes() macro any more. There's one big loop over the CPUs for computing node loads and the most loaded CPUs in find_busiest_queue. The call to build_cpus() isn't critical any more. Functionality is the same as in the previous patch (i.e. steal delays, ranking of task_to_steal, etc...). I kept the loop_over_node() macro for compatibility reasons with the additional patches. You might need to replace in the additional patches: numpools -> numpools() pool_nr_cpus[] -> pool_ncpus() I'm puzzled about the initial load balancing impact and have to think about the results I've seen from you so far... In the environments I am used to, the frequency of exec syscalls is rather low, therefore I didn't care too much about the sched_balance_exec performance and prefered to try harder to achieve good distribution across the nodes. Regards, Erich [-- Attachment #2: 01-numa_sched_core-2.5.39-12b.patch --] [-- Type: text/x-diff, Size: 16562 bytes --] diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c --- a/arch/i386/kernel/smpboot.c Fri Sep 27 23:49:54 2002 +++ b/arch/i386/kernel/smpboot.c Mon Oct 28 10:15:28 2002 @@ -1194,6 +1194,9 @@ int __devinit __cpu_up(unsigned int cpu) void __init smp_cpus_done(unsigned int max_cpus) { zap_low_mappings(); +#ifdef CONFIG_NUMA + build_pools(); +#endif } void __init smp_intr_init() diff -urNp a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c --- a/arch/ia64/kernel/smpboot.c Tue Oct 22 15:46:38 2002 +++ b/arch/ia64/kernel/smpboot.c Mon Oct 28 10:15:28 2002 @@ -397,7 +397,7 @@ unsigned long cache_decay_ticks; /* # of static void smp_tune_scheduling (void) { - cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth estimate */ + cache_decay_ticks = 8; /* XXX base this on PAL info and cache-bandwidth estimate */ printk("task migration cache decay timeout: %ld msecs.\n", (cache_decay_ticks + 1) * 1000 / HZ); @@ -508,6 +508,9 @@ smp_cpus_done (unsigned int dummy) printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n", num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100); +#ifdef CONFIG_NUMA + build_pools(); +#endif } int __devinit diff -urNp a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h Tue Oct 8 15:03:54 2002 +++ b/include/linux/sched.h Mon Oct 28 12:12:22 2002 @@ -22,6 +22,7 @@ extern unsigned long event; #include <asm/mmu.h> #include <linux/smp.h> +#include <asm/topology.h> #include <linux/sem.h> #include <linux/signal.h> #include <linux/securebits.h> @@ -167,7 +168,6 @@ extern void update_one_process(struct ta extern void scheduler_tick(int user_tick, int system); extern unsigned long cache_decay_ticks; - #define MAX_SCHEDULE_TIMEOUT LONG_MAX extern signed long FASTCALL(schedule_timeout(signed long timeout)); asmlinkage void schedule(void); @@ -457,6 +457,9 @@ extern void set_cpus_allowed(task_t *p, # define set_cpus_allowed(p, new_mask) do { } while (0) #endif +#ifdef CONFIG_NUMA +extern void build_pools(void); +#endif extern void set_user_nice(task_t *p, long nice); extern int task_prio(task_t *p); extern int task_nice(task_t *p); diff -urNp a/kernel/sched.c b/kernel/sched.c --- a/kernel/sched.c Fri Sep 27 23:50:27 2002 +++ b/kernel/sched.c Mon Oct 28 16:59:23 2002 @@ -154,6 +154,9 @@ struct runqueue { task_t *migration_thread; struct list_head migration_queue; + unsigned long wait_time; + int wait_node; + } ____cacheline_aligned; static struct runqueue runqueues[NR_CPUS] __cacheline_aligned; @@ -173,6 +176,62 @@ static struct runqueue runqueues[NR_CPUS # define task_running(rq, p) ((rq)->curr == (p)) #endif +#define cpu_to_node(cpu) __cpu_to_node(cpu) + +#ifdef CONFIG_NUMA +/* Number of CPUs per pool: sane values until all CPUs are up */ +int _pool_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS }; +int pool_cpus[NR_CPUS]; /* list of cpus sorted by node number */ +int pool_ptr[MAX_NUMNODES+1]; /* pointer into the sorted list */ +unsigned long pool_mask[MAX_NUMNODES]; +#define numpools() numnodes +#define pool_ncpus(pool) _pool_nr_cpus[pool] + +#define POOL_DELAY_IDLE (1*HZ/1000) +#define POOL_DELAY_BUSY (20*HZ/1000) + +#define loop_over_node(i,cpu,n) \ + for(i=pool_ptr[n], cpu=pool_cpus[i]; i<pool_ptr[n+1]; \ + i++, cpu=pool_cpus[i]) + + +/* + * Build pool data after all CPUs have come up. + */ +void build_pools(void) +{ + int n, cpu, ptr; + unsigned long mask; + + ptr=0; + for (n=0; n<numnodes; n++) { + mask = pool_mask[n] = __node_to_cpu_mask(n) & cpu_online_map; + pool_ptr[n] = ptr; + for (cpu=0; cpu<NR_CPUS; cpu++) + if (mask & (1UL << cpu)) + pool_cpus[ptr++] = cpu; + pool_ncpus(n) = ptr - pool_ptr[n];; + } + printk("CPU pools : %d\n",numpools()); + for (n=0;n<numpools();n++) + printk("pool %d : %lx\n",n,pool_mask[n]); + if (cache_decay_ticks==1) + printk("WARNING: cache_decay_ticks=1, probably unset by platform. Running with poor CPU affinity!\n"); +#ifdef CONFIG_X86_NUMAQ + /* temporarilly set this to a reasonable value for NUMAQ */ + cache_decay_ticks=8; +#endif +} + +#else +#define numpools() 1 +#define pool_ncpus(pool) num_online_cpus() +#define POOL_DELAY_IDLE 0 +#define POOL_DELAY_BUSY 0 +#define loop_over_node(i,cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++) +#endif + + /* * task_rq_lock - lock the runqueue a given task resides on and disable * interrupts. Note the ordering: we can safely lookup the task_rq without @@ -632,121 +691,146 @@ static inline unsigned int double_lock_b } /* - * find_busiest_queue - find the busiest runqueue. - */ -static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) -{ - int nr_running, load, max_load, i; - runqueue_t *busiest, *rq_src; + * Find a runqueue from which to steal a task. We try to do this as locally as + * possible because we don't want to let tasks get far from their node. + * + * 1. First try to find a runqueue within the own CPU pool (AKA node) with + * imbalance larger than 25% (relative to the current runqueue). + * 2. If the local node is well balanced, locate the most loaded node and its + * most loaded CPU. + * + * This routine implements node balancing by delaying steals from remote + * nodes more if the own node is (within margins) averagely loaded. The + * most loaded node is remembered as well as the time (jiffies). In the + * following calls to the load_balancer the time is compared with + * POOL_DELAY_BUSY (if load is around the average) or POOL_DELAY_IDLE (if own + * node is unloaded) if the most loaded node didn't change. This gives less + * loaded nodes the chance to approach the average load but doesn't exclude + * busy nodes from stealing (just in case the cpus_allowed mask isn't good + * for the idle nodes). + * This concept can be extended easilly to more than two levels (multi-level + * scheduler), e.g.: CPU -> node -> supernode... by implementing node-distance + * dependent steal delays. + * + * <efocht@ess.nec.de> + */ +static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int *nr_running) +{ + runqueue_t *busiest = NULL, *this_rq = cpu_rq(this_cpu), *src_rq; + int best_cpu, this_pool, max_pool_load, pool_idx; + int pool_load[MAX_NUMNODES], cpu_load[MAX_NUMNODES]; + int cpu_idx[MAX_NUMNODES]; + int cpu, pool, load, avg_load, i, steal_delay; + + /* Need at least ~25% imbalance to trigger balancing. */ +#define CPUS_BALANCED(m,t) (((m) <= 1) || (((m) - (t))/2 < (((m) + (t))/2 + 3)/4)) - /* - * We search all runqueues to find the most busy one. - * We do this lockless to reduce cache-bouncing overhead, - * we re-check the 'best' source CPU later on again, with - * the lock held. - * - * We fend off statistical fluctuations in runqueue lengths by - * saving the runqueue length during the previous load-balancing - * operation and using the smaller one the current and saved lengths. - * If a runqueue is long enough for a longer amount of time then - * we recognize it and pull tasks from it. - * - * The 'current runqueue length' is a statistical maximum variable, - * for that one we take the longer one - to avoid fluctuations in - * the other direction. So for a load-balance to happen it needs - * stable long runqueue on the target CPU and stable short runqueue - * on the local runqueue. - * - * We make an exception if this CPU is about to become idle - in - * that case we are less picky about moving a task across CPUs and - * take what can be taken. - */ if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu])) - nr_running = this_rq->nr_running; + *nr_running = this_rq->nr_running; else - nr_running = this_rq->prev_nr_running[this_cpu]; - - busiest = NULL; - max_load = 1; - for (i = 0; i < NR_CPUS; i++) { - if (!cpu_online(i)) - continue; + *nr_running = this_rq->prev_nr_running[this_cpu]; - rq_src = cpu_rq(i); - if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) - load = rq_src->nr_running; + /* compute all pool loads and save their max cpu loads */ + for (pool=0; pool<MAX_NUMNODES; pool++) + cpu_load[pool] = -1; + + for (cpu=0; cpu<NR_CPUS; cpu++) { + if (!cpu_online(cpu)) continue; + pool = cpu_to_node(cpu); + src_rq = cpu_rq(cpu); + if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu])) + load = src_rq->nr_running; else - load = this_rq->prev_nr_running[i]; - this_rq->prev_nr_running[i] = rq_src->nr_running; + load = this_rq->prev_nr_running[cpu]; + this_rq->prev_nr_running[cpu] = src_rq->nr_running; - if ((load > max_load) && (rq_src != this_rq)) { - busiest = rq_src; - max_load = load; + pool_load[pool] += load; + if (load > cpu_load[pool]) { + cpu_load[pool] = load; + cpu_idx[pool] = cpu; } } - if (likely(!busiest)) - goto out; + this_pool = cpu_to_node(this_cpu); + best_cpu = cpu_idx[this_pool]; + if (best_cpu != this_cpu) + if (!CPUS_BALANCED(cpu_load[this_pool],*nr_running)) { + busiest = cpu_rq(best_cpu); + this_rq->wait_node = -1; + goto out; + } +#ifdef CONFIG_NUMA - *imbalance = (max_load - nr_running) / 2; +#define POOLS_BALANCED(comp,this) (((comp) -(this)) < 50) + avg_load = pool_load[this_pool]; + pool_load[this_pool] = max_pool_load = + pool_load[this_pool]*100/pool_ncpus(this_pool); + pool_idx = this_pool; + for (i = 1; i < numpools(); i++) { + pool = (i + this_pool) % numpools(); + avg_load += pool_load[pool]; + pool_load[pool]=pool_load[pool]*100/pool_ncpus(pool); + if (pool_load[pool] > max_pool_load) { + max_pool_load = pool_load[pool]; + pool_idx = pool; + } + } - /* It needs an at least ~25% imbalance to trigger balancing. */ - if (!idle && (*imbalance < (max_load + 3)/4)) { - busiest = NULL; + best_cpu = (pool_idx==this_pool) ? -1 : cpu_idx[pool_idx]; + /* Exit if not enough imbalance on any remote node. */ + if ((best_cpu < 0) || (max_pool_load <= 100) || + POOLS_BALANCED(max_pool_load,pool_load[this_pool])) { + this_rq->wait_node = -1; goto out; } - - nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); - /* - * Make sure nothing changed since we checked the - * runqueue length. - */ - if (busiest->nr_running <= nr_running + 1) { - spin_unlock(&busiest->lock); - busiest = NULL; + avg_load = avg_load*100/num_online_cpus(); + /* Wait longer before stealing if own pool's load is average. */ + if (POOLS_BALANCED(avg_load,pool_load[this_pool])) + steal_delay = POOL_DELAY_BUSY; + else + steal_delay = POOL_DELAY_IDLE; + /* if we have a new most loaded node: just mark it */ + if (this_rq->wait_node != pool_idx) { + this_rq->wait_node = pool_idx; + this_rq->wait_time = jiffies; + goto out; + } else + /* old most loaded node: check if waited enough */ + if (jiffies - this_rq->wait_time < steal_delay) + goto out; + + if ((best_cpu >= 0) && + (!CPUS_BALANCED(cpu_load[pool_idx],*nr_running))) { + busiest = cpu_rq(best_cpu); + this_rq->wait_node = -1; } -out: +#endif + out: return busiest; } /* - * pull_task - move a task from a remote runqueue to the local runqueue. - * Both runqueues must be locked. + * Find a task to steal from the busiest RQ. The busiest->lock must be held + * while calling this routine. */ -static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) +static inline task_t *task_to_steal(runqueue_t *busiest, int this_cpu) { - dequeue_task(p, src_array); - src_rq->nr_running--; - set_task_cpu(p, this_cpu); - this_rq->nr_running++; - enqueue_task(p, this_rq->active); - /* - * Note that idle threads have a prio of MAX_PRIO, for this test - * to be always true for them. - */ - if (p->prio < this_rq->curr->prio) - set_need_resched(); -} - -/* - * Current runqueue is empty, or rebalance tick: if there is an - * inbalance (current runqueue is too short) then pull from - * busiest runqueue(s). - * - * We call this with the current runqueue locked, - * irqs disabled. - */ -static void load_balance(runqueue_t *this_rq, int idle) -{ - int imbalance, idx, this_cpu = smp_processor_id(); - runqueue_t *busiest; + int idx; + task_t *next = NULL, *tmp; prio_array_t *array; struct list_head *head, *curr; - task_t *tmp; + int weight, maxweight=0; - busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance); - if (!busiest) - goto out; + /* + * We do not migrate tasks that are: + * 1) running (obviously), or + * 2) cannot be migrated to this CPU due to cpus_allowed. + */ + +#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ + ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ + p != rq->curr && \ + ((p)->cpus_allowed & (1UL<<(this_cpu)))) /* * We first consider expired tasks. Those will likely not be @@ -772,7 +856,7 @@ skip_bitmap: array = busiest->active; goto new_array; } - goto out_unlock; + goto out; } head = array->queue + idx; @@ -780,33 +864,72 @@ skip_bitmap: skip_queue: tmp = list_entry(curr, task_t, run_list); + if (CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) { + weight = (jiffies - tmp->sleep_timestamp)/cache_decay_ticks; + if (weight > maxweight) { + maxweight = weight; + next = tmp; + } + } + curr = curr->next; + if (curr != head) + goto skip_queue; + idx++; + goto skip_bitmap; + + out: + return next; +} + +/* + * pull_task - move a task from a remote runqueue to the local runqueue. + * Both runqueues must be locked. + */ +static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) +{ + dequeue_task(p, src_array); + src_rq->nr_running--; + set_task_cpu(p, this_cpu); + this_rq->nr_running++; + enqueue_task(p, this_rq->active); /* - * We do not migrate tasks that are: - * 1) running (obviously), or - * 2) cannot be migrated to this CPU due to cpus_allowed, or - * 3) are cache-hot on their current CPU. + * Note that idle threads have a prio of MAX_PRIO, for this test + * to be always true for them. */ + if (p->prio < this_rq->curr->prio) + set_need_resched(); +} -#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ - ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ - !task_running(rq, p) && \ - ((p)->cpus_allowed & (1UL << (this_cpu)))) - - curr = curr->prev; - - if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) { - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; - } - pull_task(busiest, array, tmp, this_rq, this_cpu); - if (!idle && --imbalance) { - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; - } +/* + * Current runqueue is empty, or rebalance tick: if there is an + * inbalance (current runqueue is too short) then pull from + * busiest runqueue(s). + * + * We call this with the current runqueue locked, + * irqs disabled. + */ +static void load_balance(runqueue_t *this_rq, int idle) +{ + int nr_running, this_cpu = task_cpu(this_rq->curr); + task_t *tmp; + runqueue_t *busiest; + + busiest = find_busiest_queue(this_cpu, idle, &nr_running); + if (!busiest) + goto out; + + nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); + /* + * Make sure nothing changed since we checked the + * runqueue length. + */ + if (busiest->nr_running <= nr_running + 1) + goto out_unlock; + + tmp = task_to_steal(busiest, this_cpu); + if (!tmp) + goto out_unlock; + pull_task(busiest, tmp->array, tmp, this_rq, this_cpu); out_unlock: spin_unlock(&busiest->lock); out: @@ -819,10 +942,10 @@ out: * frequency and balancing agressivity depends on whether the CPU is * idle or not. * - * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on + * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on * systems with HZ=100, every 10 msecs.) */ -#define BUSY_REBALANCE_TICK (HZ/4 ?: 1) +#define BUSY_REBALANCE_TICK (HZ/5 ?: 1) #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) static inline void idle_tick(runqueue_t *rq) @@ -2027,7 +2150,8 @@ static int migration_thread(void * data) spin_unlock_irqrestore(&rq->lock, flags); p = req->task; - cpu_dest = __ffs(p->cpus_allowed); + cpu_dest = __ffs(p->cpus_allowed & cpu_online_map); + rq_dest = cpu_rq(cpu_dest); repeat: cpu_src = task_cpu(p); @@ -2130,6 +2254,8 @@ void __init sched_init(void) __set_bit(MAX_PRIO, array->bitmap); } } + if (cache_decay_ticks) + cache_decay_ticks=1; /* * We have to do a little magic to get the first * thread right in SMP mode. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 17:26 ` Erich Focht @ 2002-10-28 17:35 ` Martin J. Bligh 2002-10-29 0:07 ` [Lse-tech] " Erich Focht 0 siblings, 1 reply; 33+ messages in thread From: Martin J. Bligh @ 2002-10-28 17:35 UTC (permalink / raw) To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech > I'm puzzled about the initial load balancing impact and have to think > about the results I've seen from you so far... In the environments I am > used to, the frequency of exec syscalls is rather low, therefore I didn't > care too much about the sched_balance_exec performance and prefered to > try harder to achieve good distribution across the nodes. OK, but take a look at Michael's second patch. It still looks at nr_running on every queue in the system (with some slightly strange code to make a rotating choice on nodes on the case of equality), so should still be able to make the best decision .... *but* it seems to be much cheaper to execute. Not sure why at this point, given the last results I sent you last night ;-) M. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Lse-tech] Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 17:35 ` Martin J. Bligh @ 2002-10-29 0:07 ` Erich Focht 0 siblings, 0 replies; 33+ messages in thread From: Erich Focht @ 2002-10-29 0:07 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech On Monday 28 October 2002 18:35, Martin J. Bligh wrote: > > I'm puzzled about the initial load balancing impact and have to think > > about the results I've seen from you so far... In the environments I am > > used to, the frequency of exec syscalls is rather low, therefore I didn't > > care too much about the sched_balance_exec performance and prefered to > > try harder to achieve good distribution across the nodes. > > OK, but take a look at Michael's second patch. It still looks at > nr_running on every queue in the system (with some slightly strange > code to make a rotating choice on nodes on the case of equality), > so should still be able to make the best decision .... *but* it > seems to be much cheaper to execute. Not sure why at this point, > given the last results I sent you last night ;-) Yes, I like it! I needed some time to understand that the per_cpu variables can spread the execed tasks acros the nodes as well as the atomic sched_node. Sure, I'd like to select the least loaded node instead of the least loaded CPU. It can well be that you just have created on a node 10 threads (by fork, therefore still on their original CPU), and have an idle CPU in the same node (which didn't steal yet the newly created tasks). Suppose your instant load looks like this: node 0: cpu0: 1 , cpu1: 1, cpu2: 1, cpu3: 1 node 1: cpu4:10 , cpu5: 0, cpu6: 1, cpu7: 1 If you exec on cpu0 before cpu5 managed to steal something from cpu4, you'll aim for cpu5. This would just increase the node-imbalance and force more of the threads on cpu4 to move to node0, which is maybe bad for them. Just an example... If you start considering non-trivial cpus_allowed masks, you might get more of these cases. We could take this as a design target for the initial load balancer and keep the fastest version we currently have for the benchmarks we currently use (Michael's). Regards, Erich ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 0:32 ` Erich Focht 2002-10-27 23:52 ` Martin J. Bligh 2002-10-28 0:31 ` Martin J. Bligh @ 2002-10-28 0:46 ` Martin J. Bligh 2002-10-28 17:11 ` Erich Focht 2002-10-28 17:38 ` Erich Focht 2002-10-28 7:16 ` Martin J. Bligh 3 siblings, 2 replies; 33+ messages in thread From: Martin J. Bligh @ 2002-10-28 0:46 UTC (permalink / raw) To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech OK, so I tried Michael's without the balance_exec code as well, then Erich's main patch with Michael's balance_exec (which seems to be cheaper to calculate). Turns out I was actually running an older version of Michael's patch .... with his latest stuff it actually seems to perform better pretty much across the board (comaring 2.5.44-mm4-focht-12 and 2.5.44-mm4-hbaum-12). And it's also a lot simpler. Erich, what does all the pool stuff actually buy us over what Michael is doing? Seems to be rather more complex, but maybe it's useful for something we're just not measuring here? 2.5.44-mm4 Virgin 2.5.44-mm4-focht-1 Focht main 2.5.44-mm4-hbaum-1 Hbaum main 2.5.44-mm4-focht-12 Focht main + Focht balance_exec 2.5.44-mm4-hbaum-1 Hbaum main + Hbaum balance_exec 2.5.44-mm4-f1-h2 Focht main + Hbaum balance_exec Kernbench: Elapsed User System CPU 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4% 2.5.44-mm4-focht-1 19.46s 189.838s 37.938s 1171% 2.5.44-mm4-hbaum-1 19.746s 189.232s 38.354s 1152.2% 2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6% 2.5.44-mm4-hbaum-12 19.322s 190.176s 40.354s 1192.6% 2.5.44-mm4-f1-h2 19.398s 190.118s 40.06s 1186% Schedbench 4: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 32.45 49.47 129.86 0.82 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81 Schedbench 8: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 39.90 61.48 319.26 2.79 2.5.44-mm4-focht-1 37.76 61.09 302.17 2.55 2.5.44-mm4-hbaum-1 43.18 56.74 345.54 1.71 2.5.44-mm4-focht-12 28.40 34.43 227.25 2.09 2.5.44-mm4-hbaum-12 30.71 45.87 245.75 1.43 2.5.44-mm4-f1-h2 36.11 45.18 288.98 2.10 Schedbench 16: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 62.99 93.59 1008.01 5.11 2.5.44-mm4-focht-1 51.69 60.23 827.20 4.95 2.5.44-mm4-hbaum-1 52.57 61.54 841.38 3.93 2.5.44-mm4-focht-12 51.24 60.86 820.08 4.23 2.5.44-mm4-hbaum-12 52.33 62.23 837.46 3.84 2.5.44-mm4-f1-h2 51.76 60.15 828.33 5.67 Schedbench 32: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 88.13 194.53 2820.54 11.52 2.5.44-mm4-focht-1 56.71 123.62 1815.12 7.92 2.5.44-mm4-hbaum-1 54.57 153.56 1746.45 9.20 2.5.44-mm4-focht-12 55.69 118.85 1782.25 7.28 2.5.44-mm4-hbaum-12 54.36 135.30 1739.95 8.09 2.5.44-mm4-f1-h2 55.97 119.28 1791.39 7.20 Schedbench 64: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 159.92 653.79 10235.93 25.16 2.5.44-mm4-focht-1 55.60 232.36 3558.98 17.61 2.5.44-mm4-hbaum-1 71.48 361.77 4575.45 18.53 2.5.44-mm4-focht-12 56.03 234.45 3586.46 15.76 2.5.44-mm4-hbaum-12 56.91 240.89 3642.99 15.67 2.5.44-mm4-f1-h2 56.48 246.93 3615.32 16.97 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 0:46 ` Martin J. Bligh @ 2002-10-28 17:11 ` Erich Focht 2002-10-28 18:32 ` Martin J. Bligh 2002-10-28 17:38 ` Erich Focht 1 sibling, 1 reply; 33+ messages in thread From: Erich Focht @ 2002-10-28 17:11 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech [-- Attachment #1: Type: text/plain, Size: 5026 bytes --] On Monday 28 October 2002 01:46, Martin J. Bligh wrote: > Erich, what does all the pool stuff actually buy us over what > Michael is doing? Seems to be rather more complex, but maybe > it's useful for something we're just not measuring here? The more complicated stuff is for achieving equal load between the nodes. It delays steals more when the stealing node is averagely loaded, less when it is unloaded. This is the place where we can make it cope with more complex machines with multiple levels of memory hierarchy (like our 32 CPU TX7). Equal load among the nodes is important if you have memory bandwidth eaters, as the bandwidth in a node is limited. When introducing node affinity (which shows good results for me!) you also need a more careful ranking of the tasks which are candidates to be stolen. The routine task_to_steal does this and is another source of complexity. It is another point where the multilevel stuff comes in. In the core part of the patch the rank of the steal candidates is computed by only taking into account the time which a task has slept. I attach the script for getting some statistics on the numa_test. I consider this test more sensitive to NUMA effects, as it is a bandwidth eater also needing good latency. (BTW, Martin: in the numa_test script I've sent you the PROBLEMSIZE must be set to 1000000!). Regards, Erich > > 2.5.44-mm4 Virgin > 2.5.44-mm4-focht-1 Focht main > 2.5.44-mm4-hbaum-1 Hbaum main > 2.5.44-mm4-focht-12 Focht main + Focht balance_exec > 2.5.44-mm4-hbaum-1 Hbaum main + Hbaum balance_exec > 2.5.44-mm4-f1-h2 Focht main + Hbaum balance_exec > > Kernbench: > Elapsed User System CPU > 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4% > 2.5.44-mm4-focht-1 19.46s 189.838s 37.938s 1171% > 2.5.44-mm4-hbaum-1 19.746s 189.232s 38.354s 1152.2% > 2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6% > 2.5.44-mm4-hbaum-12 19.322s 190.176s 40.354s 1192.6% > 2.5.44-mm4-f1-h2 19.398s 190.118s 40.06s 1186% > > Schedbench 4: > Elapsed TotalUser TotalSys AvgUser > 2.5.44-mm4 32.45 49.47 129.86 0.82 > 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06 > 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78 > 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85 > 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70 > 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81 > > Schedbench 8: > Elapsed TotalUser TotalSys AvgUser > 2.5.44-mm4 39.90 61.48 319.26 2.79 > 2.5.44-mm4-focht-1 37.76 61.09 302.17 2.55 > 2.5.44-mm4-hbaum-1 43.18 56.74 345.54 1.71 > 2.5.44-mm4-focht-12 28.40 34.43 227.25 2.09 > 2.5.44-mm4-hbaum-12 30.71 45.87 245.75 1.43 > 2.5.44-mm4-f1-h2 36.11 45.18 288.98 2.10 > > Schedbench 16: > Elapsed TotalUser TotalSys AvgUser > 2.5.44-mm4 62.99 93.59 1008.01 5.11 > 2.5.44-mm4-focht-1 51.69 60.23 827.20 4.95 > 2.5.44-mm4-hbaum-1 52.57 61.54 841.38 3.93 > 2.5.44-mm4-focht-12 51.24 60.86 820.08 4.23 > 2.5.44-mm4-hbaum-12 52.33 62.23 837.46 3.84 > 2.5.44-mm4-f1-h2 51.76 60.15 828.33 5.67 > > Schedbench 32: > Elapsed TotalUser TotalSys AvgUser > 2.5.44-mm4 88.13 194.53 2820.54 11.52 > 2.5.44-mm4-focht-1 56.71 123.62 1815.12 7.92 > 2.5.44-mm4-hbaum-1 54.57 153.56 1746.45 9.20 > 2.5.44-mm4-focht-12 55.69 118.85 1782.25 7.28 > 2.5.44-mm4-hbaum-12 54.36 135.30 1739.95 8.09 > 2.5.44-mm4-f1-h2 55.97 119.28 1791.39 7.20 > > Schedbench 64: > Elapsed TotalUser TotalSys AvgUser > 2.5.44-mm4 159.92 653.79 10235.93 25.16 > 2.5.44-mm4-focht-1 55.60 232.36 3558.98 17.61 > 2.5.44-mm4-hbaum-1 71.48 361.77 4575.45 18.53 > 2.5.44-mm4-focht-12 56.03 234.45 3586.46 15.76 > 2.5.44-mm4-hbaum-12 56.91 240.89 3642.99 15.67 > 2.5.44-mm4-f1-h2 56.48 246.93 3615.32 16.97 [-- Attachment #2: numabench --] [-- Type: application/x-shellscript, Size: 874 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 17:11 ` Erich Focht @ 2002-10-28 18:32 ` Martin J. Bligh 0 siblings, 0 replies; 33+ messages in thread From: Martin J. Bligh @ 2002-10-28 18:32 UTC (permalink / raw) To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech >> Erich, what does all the pool stuff actually buy us over what >> Michael is doing? Seems to be rather more complex, but maybe >> it's useful for something we're just not measuring here? > > The more complicated stuff is for achieving equal load between the > nodes. It delays steals more when the stealing node is averagely loaded, > less when it is unloaded. This is the place where we can make it cope > with more complex machines with multiple levels of memory hierarchy > (like our 32 CPU TX7). Equal load among the nodes is important if you > have memory bandwidth eaters, as the bandwidth in a node is limited. > > When introducing node affinity (which shows good results for me!) you > also need a more careful ranking of the tasks which are candidates to > be stolen. The routine task_to_steal does this and is another source > of complexity. It is another point where the multilevel stuff comes in. > In the core part of the patch the rank of the steal candidates is computed > by only taking into account the time which a task has slept. OK, it all sounds sane, just rather complicated ;-) I'm going to trawl through your stuff with Michael, and see if we can simplify it a bit somehow whilst not changing the functionality. Your first patch seems to work just fine, it's just the complexity that bugs me a bit. The combination of your first patch with Michael's balance_exec stuff actually seems to work pretty well ... I'll poke at the new patch you sent me + Michael's exec balance + the little perf tweak I made to it, and see what happens ;-) > I attach the script for getting some statistics on the numa_test. I > consider this test more sensitive to NUMA effects, as it is a bandwidth > eater also needing good latency. > (BTW, Martin: in the numa_test script I've sent you the PROBLEMSIZE must > be set to 1000000!). It is ;-) I'm running 44-mm4, not virgin remember, so things like hot&cold page lists may make it faster? M. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 0:46 ` Martin J. Bligh 2002-10-28 17:11 ` Erich Focht @ 2002-10-28 17:38 ` Erich Focht 2002-10-28 17:36 ` Martin J. Bligh 1 sibling, 1 reply; 33+ messages in thread From: Erich Focht @ 2002-10-28 17:38 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech On Monday 28 October 2002 01:46, Martin J. Bligh wrote: > 2.5.44-mm4 Virgin > 2.5.44-mm4-focht-1 Focht main > 2.5.44-mm4-hbaum-1 Hbaum main > 2.5.44-mm4-focht-12 Focht main + Focht balance_exec > 2.5.44-mm4-hbaum-1 Hbaum main + Hbaum balance_exec > 2.5.44-mm4-f1-h2 Focht main + Hbaum balance_exec > > Schedbench 4: > Elapsed TotalUser TotalSys AvgUser > 2.5.44-mm4 32.45 49.47 129.86 0.82 > 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06 > 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78 > 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85 > 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70 > 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81 One more remarks: You seem to have made the numa_test shorter. That reduces it to beeing simply a check for the initial load balancing as the hackbench running in the background (and aimed to disturb the initial load balancing) might start too late. You will most probably not see the impact of node affinity with such short running tests. But we weren't talking about node affinity, yet... Erich ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 17:38 ` Erich Focht @ 2002-10-28 17:36 ` Martin J. Bligh 2002-10-28 23:49 ` Erich Focht 2002-10-29 22:39 ` Erich Focht 0 siblings, 2 replies; 33+ messages in thread From: Martin J. Bligh @ 2002-10-28 17:36 UTC (permalink / raw) To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech >> Schedbench 4: >> Elapsed TotalUser TotalSys AvgUser >> 2.5.44-mm4 32.45 49.47 129.86 0.82 >> 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06 >> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78 >> 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85 >> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70 >> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81 > > One more remarks: > You seem to have made the numa_test shorter. That reduces it to beeing > simply a check for the initial load balancing as the hackbench running in > the background (and aimed to disturb the initial load balancing) might > start too late. You will most probably not see the impact of node affinity > with such short running tests. But we weren't talking about node affinity, > yet... I didn't modify what you sent me at all ... perhaps my machine is just faster than yours? /me ducks & runs ;-) M. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 17:36 ` Martin J. Bligh @ 2002-10-28 23:49 ` Erich Focht 2002-10-29 0:00 ` Martin J. Bligh 2002-10-29 22:39 ` Erich Focht 1 sibling, 1 reply; 33+ messages in thread From: Erich Focht @ 2002-10-28 23:49 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech On Monday 28 October 2002 18:36, Martin J. Bligh wrote: > >> Schedbench 4: > >> Elapsed TotalUser TotalSys AvgUser > >> 2.5.44-mm4 32.45 49.47 129.86 0.82 > >> 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06 > >> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78 > >> 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85 > >> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70 > >> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81 > > > > One more remarks: > > You seem to have made the numa_test shorter. That reduces it to beeing > > simply a check for the initial load balancing as the hackbench running in > > the background (and aimed to disturb the initial load balancing) might > > start too late. You will most probably not see the impact of node > > affinity with such short running tests. But we weren't talking about node > > affinity, yet... > > I didn't modify what you sent me at all ... perhaps my machine is > just faster than yours? > > /me ducks & runs ;-) :-))) I tried with IA32, too ;-) With PROBLEMSIZE=1000000 I get on a 2.8GHz XEON something around 16s. On a 1.6GHz Athlon it's 22s. Both times running ./numa_test 2 on a dual CPU box. The usertime is pretty independent of the OS, (but the scheduling influences it a lot). But: you have a node level cache! Maybe the whole memory is inside that one and then things can go really fast. Hmmm, I guess I'll need some cache detection in the future to enforce that the BM really runs in memory... Increasing PROBLEMSIZE might help, but we can do that later, when testing affinity (I'm not giving up on this idea... ;-) Regards, Erich ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 23:49 ` Erich Focht @ 2002-10-29 0:00 ` Martin J. Bligh 2002-10-29 1:12 ` Gerrit Huizenga 0 siblings, 1 reply; 33+ messages in thread From: Martin J. Bligh @ 2002-10-29 0:00 UTC (permalink / raw) To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech >> I didn't modify what you sent me at all ... perhaps my machine is >> just faster than yours? >> >> /me ducks & runs ;-) > > :-))) > > I tried with IA32, too ;-) With PROBLEMSIZE=1000000 I get on a 2.8GHz > XEON something around 16s. On a 1.6GHz Athlon it's 22s. Both times running > ./numa_test 2 on a dual CPU box. The usertime is pretty independent of the > OS, (but the scheduling influences it a lot). I have 700MHz P3 Xeons, but I have 2Mb L2 cache on them which is much better than the newer chips. That might make a big differernce. > But: you have a node level cache! Maybe the whole memory is inside that > one and then things can go really fast. Hmmm, I guess I'll need some > cache detection in the future to enforce that the BM really runs in > memory... Increasing PROBLEMSIZE might help, but we can do that later, > when testing affinity (I'm not giving up on this idea... ;-) Yup, 32Mb cache. Not sure if it's faster than local memory or not. M. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-29 0:00 ` Martin J. Bligh @ 2002-10-29 1:12 ` Gerrit Huizenga 0 siblings, 0 replies; 33+ messages in thread From: Gerrit Huizenga @ 2002-10-29 1:12 UTC (permalink / raw) To: Martin J. Bligh Cc: Erich Focht, Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech In message <737410000.1035849619@flay>, > : "Martin J. Bligh" writes: > > Yup, 32Mb cache. Not sure if it's faster than local memory or not. Yes, NUMA-Q cache can be faster than local memory, but it *only* caches remote memory. Some other architectures use the L3 cache to cache *all* memory (local _and_ remote). Reasoning: why polute the valuable cache with things that are already close at hand? gerrit ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 17:36 ` Martin J. Bligh 2002-10-28 23:49 ` Erich Focht @ 2002-10-29 22:39 ` Erich Focht 1 sibling, 0 replies; 33+ messages in thread From: Erich Focht @ 2002-10-29 22:39 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech On Monday 28 October 2002 18:36, Martin J. Bligh wrote: > >> Schedbench 4: > >> Elapsed TotalUser TotalSys AvgUser > >> 2.5.44-mm4 32.45 49.47 129.86 0.82 > >> 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06 > >> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78 > >> 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85 > >> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70 > >> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81 > > > > One more remarks: > > You seem to have made the numa_test shorter. That reduces it to beeing > > simply a check for the initial load balancing as the hackbench running in > > the background (and aimed to disturb the initial load balancing) might > > start too late. You will most probably not see the impact of node > > affinity with such short running tests. But we weren't talking about node > > affinity, yet... > > I didn't modify what you sent me at all ... perhaps my machine is > just faster than yours? > > /me ducks & runs ;-) Aaargh, now I understand!!! You just have wrong labels in your table, they are permuted! More sense makes: > >> AvgUser Elapsed TotalUser TotalSys > >> 2.5.44-mm4 32.45 49.47 129.86 0.82 > >> 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06 > >> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78 > >> 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85 > >> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70 > >> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81 Regards, Erich ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: NUMA scheduler (was: 2.5 merge candidate list 1.5) 2002-10-28 0:32 ` Erich Focht ` (2 preceding siblings ...) 2002-10-28 0:46 ` Martin J. Bligh @ 2002-10-28 7:16 ` Martin J. Bligh 3 siblings, 0 replies; 33+ messages in thread From: Martin J. Bligh @ 2002-10-28 7:16 UTC (permalink / raw) To: Erich Focht; +Cc: Michael Hohnbaum, mingo, habanero, linux-kernel, lse-tech > This is interesting, indeed. As you might have seen from the tests I > posted on LKML I could not see that effect on our IA64 NUMA machine. > Which arises the question: is it expensive to recalculate the load > when doing an exec (which I should also see) or is the strategy of > equally distributing the jobs across the nodes bad for certain > load+architecture combinations? As I'm not seeing the effect, maybe > you could do the following experiment: > In sched_best_node() keep only the "while" loop at the beginning. This > leads to a cheap selection of the next node, just a simple round robin. I did this ... presume that's what you meant: static int sched_best_node(struct task_struct *p) { int i, n, best_node=0, min_load, pool_load, min_pool=numa_node_id(); int cpu, pool, load; unsigned long mask = p->cpus_allowed & cpu_online_map; do { /* atomic_inc_return is not implemented on all archs [EF] */ atomic_inc(&sched_node); best_node = atomic_read(&sched_node) % numpools; } while (!(pool_mask[best_node] & mask)); return best_node; } Odd. seems to make it even worse. Kernbench: Elapsed User System CPU 2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6% 2.5.44-mm4-focht-12-lobo 21.362s 193.71s 48.672s 1134% The diffprofiles below look like this just makes it make bad decisions. Very odd ... compare with what hapenned when I put Michael's balance_exec on instead. I'm tired, maybe I did something silly. diffprofile 2.5.44-mm4-focht-1 2.5.44-mm4-focht-12 606 page_remove_rmap 566 do_schedule 488 page_add_rmap 475 .text.lock.file_table 370 __copy_to_user 306 strnlen_user 272 d_lookup 235 find_get_page 233 get_empty_filp 193 atomic_dec_and_lock 161 copy_process 159 sched_best_node 135 flush_signal_handlers 131 complete 116 filemap_nopage 109 __fput 105 path_lookup 103 follow_mount 95 zap_pte_range 92 file_move 91 do_no_page 87 release_task 80 do_page_fault 62 lru_cache_add 62 link_path_walk 62 do_generic_mapping_read 57 find_trylock_page 55 release_pages 50 dup_task_struct ... -73 do_anonymous_page -478 __copy_from_user diffprofile 2.5.44-mm4-focht-12 2.5.44-mm4-focht-12-lobo 567 do_schedule 482 do_anonymous_page 383 page_remove_rmap 336 __copy_from_user 333 page_add_rmap 241 zap_pte_range 213 init_private_file 189 strnlen_user 186 buffered_rmqueue 172 find_get_page 124 complete 111 filemap_nopage 97 free_hot_cold_page 89 flush_signal_handlers 86 clear_page_tables 79 do_page_fault 79 copy_process 75 d_lookup 74 path_lookup 71 sched_best_cpu 68 do_no_page 58 release_pages 58 __set_page_dirty_buffers 52 wait_for_completion 51 release_task 51 handle_mm_fault ... -53 lru_cache_add -73 dentry_open -100 sched_best_node -108 file_ra_state_init -402 .text.lock.file_table ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Crunch time -- the musical. (2.5 merge candidate list 1.5) 2002-10-23 21:26 Crunch time -- the musical. (2.5 merge candidate list 1.5) Rob Landley 2002-10-24 16:17 ` Michael Hohnbaum @ 2002-10-25 14:46 ` Kevin Corry 1 sibling, 0 replies; 33+ messages in thread From: Kevin Corry @ 2002-10-25 14:46 UTC (permalink / raw) To: Rob Landley; +Cc: linux-kernel On Wednesday 23 October 2002 16:26, Rob Landley wrote: > Due to numerous complaints (okay, one, but technically that's a number) > tried to reformat a bit to have a slightly less eye-searingly hideous > layout. And reorganized the -mm stuff to be together in one clump. > > And so: > ...... > --------------------------------------------------------------------------- > > 8) EVMS (Enterprise Volume Management System) (EVMS team) > > Home page: > http://sourceforge.net/projects/evms > > --------------------------------------------------------------------------- Rob, Can you please add the following links for the EVMS project: Home page: http://evms.sourceforge.net Download: http://evms.sourceforge.net/patches/ Some related discussions: http://marc.theaimsgroup.com/?t=103359686900003&r=1&w=2 http://marc.theaimsgroup.com/?t=103439913000001&r=1&w=2 http://marc.theaimsgroup.com/?w=2&r=1&s=%5Bpatch%5D+evms+core&q=t Thanks! -- Kevin Corry corryk@us.ibm.com http://evms.sourceforge.net/ ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2002-10-29 22:33 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-10-23 21:26 Crunch time -- the musical. (2.5 merge candidate list 1.5) Rob Landley
2002-10-24 16:17 ` Michael Hohnbaum
[not found] ` <200210240750.09751.landley@trommello.org>
2002-10-24 19:01 ` Michael Hohnbaum
2002-10-24 21:51 ` Erich Focht
2002-10-24 22:38 ` Martin J. Bligh
2002-10-25 8:15 ` Erich Focht
2002-10-25 23:26 ` Martin J. Bligh
2002-10-25 23:45 ` Martin J. Bligh
2002-10-26 0:02 ` Martin J. Bligh
2002-10-26 18:58 ` Martin J. Bligh
2002-10-26 19:14 ` NUMA scheduler (was: 2.5 " Martin J. Bligh
2002-10-27 18:16 ` Martin J. Bligh
2002-10-28 0:32 ` Erich Focht
2002-10-27 23:52 ` Martin J. Bligh
2002-10-28 0:55 ` [Lse-tech] " Michael Hohnbaum
2002-10-28 4:23 ` Martin J. Bligh
2002-10-28 0:31 ` Martin J. Bligh
2002-10-28 16:34 ` Erich Focht
2002-10-28 16:57 ` Martin J. Bligh
2002-10-28 17:26 ` Erich Focht
2002-10-28 17:35 ` Martin J. Bligh
2002-10-29 0:07 ` [Lse-tech] " Erich Focht
2002-10-28 0:46 ` Martin J. Bligh
2002-10-28 17:11 ` Erich Focht
2002-10-28 18:32 ` Martin J. Bligh
2002-10-28 17:38 ` Erich Focht
2002-10-28 17:36 ` Martin J. Bligh
2002-10-28 23:49 ` Erich Focht
2002-10-29 0:00 ` Martin J. Bligh
2002-10-29 1:12 ` Gerrit Huizenga
2002-10-29 22:39 ` Erich Focht
2002-10-28 7:16 ` Martin J. Bligh
2002-10-25 14:46 ` Crunch time -- the musical. (2.5 " Kevin Corry
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox