public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Greg KH <gregkh@suse.de>
To: linux-kernel@vger.kernel.org, stable@kernel.org, torvalds@osdl.org
Cc: Justin Forbes <jmforbes@linuxtx.org>,
	Zwane Mwaikambo <zwane@arm.linux.org.uk>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Randy Dunlap <rdunlap@xenotime.net>,
	Dave Jones <davej@redhat.com>,
	Chuck Wolber <chuckw@quantumlinux.com>,
	Chris Wedgwood <reviews@ml.cw.f00f.org>,
	Michael Krufky <mkrufky@linuxtv.org>,
	akpm@osdl.org, alan@lxorguk.ukuu.org.uk, nickpiggin@yahoo.com.au,
	suresh.b.siddha@intel.com, Christoph Lameter <clameter@sgi.com>,
	John Hawkes <hawkes@sgi.com>, Ingo Molnar <mingo@elte.hu>,
	Peter Williams <pwil3058@bigpond.net.au>,
	Greg Kroah-Hartman <gregkh@suse.de>
Subject: [patch 17/67] Fix longstanding load balancing bug in the scheduler
Date: Wed, 11 Oct 2006 14:04:49 -0700	[thread overview]
Message-ID: <20061011210449.GR16627@kroah.com> (raw)
In-Reply-To: <20061011210310.GA16627@kroah.com>

[-- Attachment #1: fix-longstanding-load-balancing-bug-in-the-scheduler.patch --]
[-- Type: text/plain, Size: 6738 bytes --]


-stable review patch.  If anyone has any objections, please let us know.

------------------
From: Christoph Lameter <christoph@sgi.com>

The scheduler will stop load balancing if the most busy processor contains
processes pinned via processor affinity.

The scheduler currently only does one search for busiest cpu.  If it cannot
pull any tasks away from the busiest cpu because they were pinned then the
scheduler goes into a corner and sulks leaving the idle processors idle.

F.e.  If you have processor 0 busy running four tasks pinned via taskset,
there are none on processor 1 and one just started two processes on
processor 2 then the scheduler will not move one of the two processes away
from processor 2.

This patch fixes that issue by forcing the scheduler to come out of its
corner and retrying the load balancing by considering other processors for
load balancing.

This patch was originally developed by John Hawkes and discussed at
http://marc.theaimsgroup.com/?l=linux-kernel&m=113901368523205&w=2.

I have removed extraneous material and gone back to equipping struct rq
with the cpu the queue is associated with since this makes the patch much
easier and it is likely that others in the future will have the same
difficulty of figuring out which processor owns which runqueue.

The overhead added through these patches is a single word on the stack if
the kernel is configured to support 32 cpus or less (32 bit).  For 32 bit
environments the maximum number of cpus that can be configued is 255 which
would result in the use of 32 bytes additional on the stack.  On IA64 up to
1k cpus can be configured which will result in the use of 128 additional
bytes on the stack.  The maximum additional cache footprint is one
cacheline.  Typically memory use will be much less than a cacheline and the
additional cpumask will be placed on the stack in a cacheline that already
contains other local variable.


Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: John Hawkes <hawkes@sgi.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 kernel/sched.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 46 insertions(+), 8 deletions(-)

--- linux-2.6.18.orig/kernel/sched.c
+++ linux-2.6.18/kernel/sched.c
@@ -238,6 +238,7 @@ struct rq {
 	/* For active balancing */
 	int active_balance;
 	int push_cpu;
+	int cpu;		/* cpu of this runqueue */
 
 	struct task_struct *migration_thread;
 	struct list_head migration_queue;
@@ -267,6 +268,15 @@ struct rq {
 
 static DEFINE_PER_CPU(struct rq, runqueues);
 
+static inline int cpu_of(struct rq *rq)
+{
+#ifdef CONFIG_SMP
+	return rq->cpu;
+#else
+	return 0;
+#endif
+}
+
 /*
  * The domain tree (rq->sd) is protected by RCU's quiescent state transition.
  * See detach_destroy_domains: synchronize_sched for details.
@@ -2211,7 +2221,8 @@ out:
  */
 static struct sched_group *
 find_busiest_group(struct sched_domain *sd, int this_cpu,
-		   unsigned long *imbalance, enum idle_type idle, int *sd_idle)
+		   unsigned long *imbalance, enum idle_type idle, int *sd_idle,
+		   cpumask_t *cpus)
 {
 	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
@@ -2248,7 +2259,12 @@ find_busiest_group(struct sched_domain *
 		sum_weighted_load = sum_nr_running = avg_load = 0;
 
 		for_each_cpu_mask(i, group->cpumask) {
-			struct rq *rq = cpu_rq(i);
+			struct rq *rq;
+
+			if (!cpu_isset(i, *cpus))
+				continue;
+
+			rq = cpu_rq(i);
 
 			if (*sd_idle && !idle_cpu(i))
 				*sd_idle = 0;
@@ -2466,13 +2482,17 @@ ret:
  */
 static struct rq *
 find_busiest_queue(struct sched_group *group, enum idle_type idle,
-		   unsigned long imbalance)
+		   unsigned long imbalance, cpumask_t *cpus)
 {
 	struct rq *busiest = NULL, *rq;
 	unsigned long max_load = 0;
 	int i;
 
 	for_each_cpu_mask(i, group->cpumask) {
+
+		if (!cpu_isset(i, *cpus))
+			continue;
+
 		rq = cpu_rq(i);
 
 		if (rq->nr_running == 1 && rq->raw_weighted_load > imbalance)
@@ -2511,6 +2531,7 @@ static int load_balance(int this_cpu, st
 	struct sched_group *group;
 	unsigned long imbalance;
 	struct rq *busiest;
+	cpumask_t cpus = CPU_MASK_ALL;
 
 	if (idle != NOT_IDLE && sd->flags & SD_SHARE_CPUPOWER &&
 	    !sched_smt_power_savings)
@@ -2518,13 +2539,15 @@ static int load_balance(int this_cpu, st
 
 	schedstat_inc(sd, lb_cnt[idle]);
 
-	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle);
+redo:
+	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
+							&cpus);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[idle]);
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, idle, imbalance);
+	busiest = find_busiest_queue(group, idle, imbalance, &cpus);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
@@ -2549,8 +2572,12 @@ static int load_balance(int this_cpu, st
 		double_rq_unlock(this_rq, busiest);
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
-		if (unlikely(all_pinned))
+		if (unlikely(all_pinned)) {
+			cpu_clear(cpu_of(busiest), cpus);
+			if (!cpus_empty(cpus))
+				goto redo;
 			goto out_balanced;
+		}
 	}
 
 	if (!nr_moved) {
@@ -2639,18 +2666,22 @@ load_balance_newidle(int this_cpu, struc
 	unsigned long imbalance;
 	int nr_moved = 0;
 	int sd_idle = 0;
+	cpumask_t cpus = CPU_MASK_ALL;
 
 	if (sd->flags & SD_SHARE_CPUPOWER && !sched_smt_power_savings)
 		sd_idle = 1;
 
 	schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
-	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE, &sd_idle);
+redo:
+	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE,
+				&sd_idle, &cpus);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]);
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, NEWLY_IDLE, imbalance);
+	busiest = find_busiest_queue(group, NEWLY_IDLE, imbalance,
+				&cpus);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
 		goto out_balanced;
@@ -2668,6 +2699,12 @@ load_balance_newidle(int this_cpu, struc
 					minus_1_or_zero(busiest->nr_running),
 					imbalance, sd, NEWLY_IDLE, NULL);
 		spin_unlock(&busiest->lock);
+
+		if (!nr_moved) {
+			cpu_clear(cpu_of(busiest), cpus);
+			if (!cpus_empty(cpus))
+				goto redo;
+		}
 	}
 
 	if (!nr_moved) {
@@ -6747,6 +6784,7 @@ void __init sched_init(void)
 			rq->cpu_load[j] = 0;
 		rq->active_balance = 0;
 		rq->push_cpu = 0;
+		rq->cpu = i;
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
 #endif

--

  parent reply	other threads:[~2006-10-11 21:28 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20061011204756.642936754@quad.kroah.org>
2006-10-11 21:03 ` [patch 00/67] 2.6.18-stable review Greg KH
2006-10-11 21:03   ` [patch 01/67] NET_SCHED: Fix fallout from dev->qdisc RCU change Greg KH
2006-10-11 21:03   ` [patch 02/67] uml: allow using again x86/x86_64 crypto code Greg KH
2006-10-11 21:03   ` [patch 03/67] uml: use DEFCONFIG_LIST to avoid reading hosts config Greg KH
2006-10-11 21:03   ` [patch 04/67] UML: Fix UML build failure Greg KH
2006-10-11 21:03   ` [patch 05/67] Video: Fix msp343xG handling regression Greg KH
2006-10-11 21:03   ` [patch 06/67] Video: cx24123: fix PLL divisor setup Greg KH
2006-10-11 21:15     ` Michael Krufky
2006-10-11 21:29       ` Greg KH
2006-10-11 21:36         ` Michael Krufky
2006-10-11 23:01           ` [stable] " Greg KH
2006-10-11 23:58             ` Michael Krufky
2006-10-13 18:48               ` Greg KH
2006-10-11 21:03   ` [patch 07/67] Video: pvrusb2: Solve mutex deadlock Greg KH
2006-10-11 21:04   ` [patch 09/67] Video: pvrusb2: Suppress compiler warning Greg KH
2006-10-11 21:04   ` [patch 10/67] Video: pvrusb2: Limit hor res for 24xxx devices Greg KH
2006-10-11 21:04   ` [patch 11/67] zd1211rw: ZD1211B ASIC/FWT, not jointly decoder Greg KH
2006-10-12 13:41     ` John W. Linville
2006-10-11 21:04   ` [patch 12/67] S390: user readable uninitialised kernel memory (CVE-2006-5174) Greg KH
2006-10-11 21:04   ` [patch 13/67] IB/mthca: Fix lid used for sending traps Greg KH
2006-10-11 21:04   ` [patch 14/67] USB: Allow compile in g_ether, fix typo Greg KH
2006-10-11 21:04   ` [patch 15/67] ALSA: Fix initiailization of user-space controls Greg KH
2006-10-11 21:04   ` [patch 16/67] jbd: fix commit of ordered data buffers Greg KH
2006-10-12 11:55     ` Jan Kara
2006-10-12 17:16       ` Greg KH
2006-10-11 21:04   ` Greg KH [this message]
2006-10-12  7:30     ` [patch 17/67] Fix longstanding load balancing bug in the scheduler Arjan van de Ven
2006-10-11 21:04   ` [patch 18/67] zone_reclaim: dynamic slab reclaim Greg KH
2006-10-12  7:31     ` Arjan van de Ven
2006-10-12 10:04       ` Christoph Lameter
2006-10-11 21:04   ` [patch 19/67] mv643xx_eth: fix obvious typo, which caused build breakage Greg KH
2006-10-11 21:05   ` [patch 20/67] netdrvr: lp486e: fix typo Greg KH
2006-10-11 21:05   ` [patch 21/67] sky2: tx pause bug fix Greg KH
2006-10-11 21:05   ` [patch 22/67] sky2 network driver device ids Greg KH
2006-10-11 21:05   ` [patch 23/67] One line per header in Kbuild files to reduce conflicts Greg KH
2006-10-11 21:05   ` [patch 24/67] Fix ARM make headers_check Greg KH
2006-10-11 21:05   ` [patch 25/67] Fix make headers_check on sh Greg KH
2006-10-11 21:05   ` [patch 26/67] Fix make headers_check on sh64 Greg KH
2006-10-11 21:05   ` [patch 27/67] Fix make headers_check on m32r Greg KH
2006-10-11 21:05   ` [patch 28/67] Fix exported headers for SPARC, SPARC64 Greg KH
2006-10-11 21:05   ` [patch 29/67] Fix m68knommu exported headers Greg KH
2006-10-11 21:05   ` [patch 30/67] Fix H8300 " Greg KH
2006-10-11 21:06   ` [patch 31/67] Remove ARM26 header export Greg KH
2006-10-11 21:06   ` [patch 32/67] Remove UML " Greg KH
2006-10-11 21:06   ` [patch 33/67] Dont advertise (or allow) headers_{install,check} where inappropriate Greg KH
2006-10-11 21:06   ` [patch 34/67] Fix v850 exported headers Greg KH
2006-10-11 21:06   ` [patch 35/67] Clean up exported headers on CRIS Greg KH
2006-10-11 21:06   ` [patch 36/67] Remove offsetof() from user-visible <linux/stddef.h> Greg KH
2006-10-11 21:06   ` [patch 37/67] powerpc: fix building gdb against asm/ptrace.h Greg KH
2006-10-11 21:06   ` [patch 38/67] sysfs: remove duplicated dput in sysfs_update_file Greg KH
2006-10-11 21:06   ` [patch 39/67] powerpc: Fix ohare IDE irq workaround on old powermacs Greg KH
2006-10-11 21:07   ` [patch 40/67] i386 bootioremap / kexec fix Greg KH
2006-10-11 21:07   ` [patch 41/67] rtc: lockdep fix/workaround Greg KH
2006-10-11 21:07   ` [patch 42/67] do not free non slab allocated per_cpu_pageset Greg KH
2006-10-11 21:07   ` [patch 43/67] backlight: fix oops in __mutex_lock_slowpath during head /sys/class/graphics/fb0/bits_per_pixel /sys/class/graphics/fb0/blank /sys/class/graphics/fb0/console /sys/class/graphics/fb0/cursor /sys/class/graphics/fb0/dev /sys/class/graphics/fb0/device /sys/class/graphics/fb0/mode /sys/class/graphics/fb0/modes /sys/class/graphics/fb0/name /sys/class/graphics/fb0/pan /sys/class/graphics/fb0/rotate /sys/class/graphics/fb0/state /sys/class/graphics/fb0/stride /sys/class/graphics/fb0/subsystem /sys/class/graphics/fb0/uevent /sys/class/graphics/fb0/virtual_size Greg KH
2006-10-11 21:07   ` [patch 44/67] cpu to node relationship fixup: acpi_map_cpu2node Greg KH
2006-10-11 21:07   ` [patch 45/67] cpu to node relationship fixup: map cpu to node Greg KH
2006-10-11 21:07   ` [patch 46/67] i386: fix flat mode numa on a real numa system Greg KH
2006-10-11 21:07   ` [patch 47/67] load_module: no BUG if module_subsys uninitialized Greg KH
2006-10-11 21:07   ` [patch 48/67] Fix VIDIOC_ENUMSTD bug Greg KH
2006-10-11 21:46     ` Jonathan Corbet
2006-10-11 21:49       ` Michael Krufky
2006-10-11 22:10         ` Mauro Carvalho Chehab
2006-10-11 23:04           ` [stable] " Greg KH
2006-10-11 21:07   ` [patch 49/67] SPARC64: Fix serious bug in sched_clock() on sparc64 Greg KH
2006-10-11 21:07   ` [patch 50/67] CPUFREQ: Fix some more CPU hotplug locking Greg KH
2006-10-11 21:08   ` [patch 51/67] IPV6: bh_lock_sock_nested on tcp_v6_rcv Greg KH
2006-10-11 21:08   ` [patch 52/67] SPARC64: Fix sparc64 ramdisk handling Greg KH
2006-10-11 21:08   ` [patch 53/67] sata_mv: fix oops Greg KH
2006-10-11 21:08   ` [patch 54/67] PKT_SCHED: cls_basic: Use unsigned int when generating handle Greg KH
2006-10-11 21:08   ` [patch 55/67] IPV6: Disable SG for GSO unless we have checksum Greg KH
2006-10-11 21:08   ` [patch 56/67] MD: Fix problem where hot-added drives are not resynced Greg KH
2006-10-11 21:08   ` [patch 57/67] TCP: Fix and simplify microsecond rtt sampling Greg KH
2006-10-11 21:08   ` [patch 58/67] mm: bug in set_page_dirty_buffers Greg KH
2006-10-11 21:08   ` [patch 59/67] fbdev: correct buffer size limit in fbmem_read_proc() Greg KH
2006-10-11 21:08   ` [patch 60/67] rtc driver rtc-pcf8563 century bit inversed Greg KH
2006-10-11 21:08   ` [patch 61/67] invalidate_inode_pages2(): ignore page refcounts Greg KH
2006-10-11 21:09   ` [patch 62/67] scx200_hrt: fix precedence bug manifesting as 27x clock in 1 MHz mode Greg KH
2006-10-11 21:09   ` [patch 63/67] ide-generic: jmicron fix Greg KH
2006-10-11 21:09   ` [patch 64/67] x86-64: Calgary IOMMU: Fix off by one when calculating register space location Greg KH
2006-10-11 21:09   ` [patch 66/67] NETFILTER: NAT: fix NOTRACK checksum handling Greg KH
2006-10-11 21:09   ` [patch 67/67] block layer: elv_iosched_show should get elv_list_lock Greg KH
2006-10-11 21:36   ` [patch 00/67] 2.6.18-stable review Dave Jones
2006-10-11 21:59     ` Greg KH
2006-10-11 22:17       ` Dave Jones
2006-10-11 22:19       ` Dave Jones
2006-10-11 22:59         ` [stable] " Greg KH
2006-10-12  0:42   ` Theodore Tso
2006-10-12 16:35     ` [stable] " Greg KH
2006-10-12 16:51       ` Dave Jones

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061011210449.GR16627@kroah.com \
    --to=gregkh@suse.de \
    --cc=akpm@osdl.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=chuckw@quantumlinux.com \
    --cc=clameter@sgi.com \
    --cc=davej@redhat.com \
    --cc=hawkes@sgi.com \
    --cc=jmforbes@linuxtx.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=mkrufky@linuxtv.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=pwil3058@bigpond.net.au \
    --cc=rdunlap@xenotime.net \
    --cc=reviews@ml.cw.f00f.org \
    --cc=stable@kernel.org \
    --cc=suresh.b.siddha@intel.com \
    --cc=torvalds@osdl.org \
    --cc=tytso@mit.edu \
    --cc=zwane@arm.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox