public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Williams <pwil3058@bigpond.net.au>
To: Andrew Morton <akpm@osdl.org>
Cc: suresh.b.siddha@intel.com, kernel@kolivas.org, npiggin@suse.de,
	mingo@elte.hu, rostedt@goodmis.org, linux-kernel@vger.kernel.org,
	torvalds@osdl.org
Subject: Re: [rfc][patch] sched: remove smpnice
Date: Mon, 13 Feb 2006 12:06:49 +1100	[thread overview]
Message-ID: <43EFDBA9.4060107@bigpond.net.au> (raw)
In-Reply-To: <43EFC05F.6010204@bigpond.net.au>

[-- Attachment #1: Type: text/plain, Size: 3226 bytes --]

Peter Williams wrote:
> Peter Williams wrote:
> 
>> Andrew Morton wrote:
>>
>>> Peter Williams <pwil3058@bigpond.net.au> wrote:
>>>
>>>> I don't think either of these issues warrant abandoning smpnice.  
>>>> The first is highly unlikely to occur on real systems and the second 
>>>> is just an example of the patch doing its job (maybe too 
>>>> officiously).  I don't think users would notice either on real systems.
>>>>
>>>> Even if you pull it from 2.6.16 rather than upgrading it with my 
>>>> patch can you please leave both in -mm?
>>>
>>>
>>>
>>>
>>> Yes, I have done that.  I currently have:
>>>
>>> sched-restore-smpnice.patch
>>> sched-modified-nice-support-for-smp-load-balancing.patch
>>> sched-cleanup_task_activated.patch
>>> sched-alter_uninterruptible_sleep_interactivity.patch
>>> sched-make_task_noninteractive_use_sleep_type.patch
>>> sched-dont_decrease_idle_sleep_avg.patch
>>> sched-include_noninteractive_sleep_in_idle_detect.patch
>>> sched-new-sched-domain-for-representing-multi-core.patch
>>> sched-fix-group-power-for-allnodes_domains.patch
>>
>>
>>
>> OK.  Having slept on these problems I am now of the opinion that the 
>> problems are caused by the use of NICE_TO_BIAS_PRIO(0) to set 
>> *imbalance inside the (*imbalance < SCHED_LOAD_SCALE) if statement in 
>> find_busiest_group().  What is happening here is that even though the 
>> imbalance is less than one (average) task sometimes the decision is 
>> made to move a task anyway but with the current version this decision 
>> can be subverted in two ways: 1) if the task to be moved has a nice 
>> value less than zero the value of *imbalance that is set will be too 
>> small for move_tasks() to move it; and 2) if there are a number of 
>> tasks with nice values greater than zero on the "busiest" more than 
>> one of them may be moved as the value of *imbalance that is set may be 
>> big enough to include more than one of these tasks.
>>
>> The fix for this problem is to replace NICE_TO_BIAS_PRIO(0) with the 
>> "average bias prio per runnable task" on "busiest".  This will 
>> (generally) result in a larger value for *imbalance in case 1. above 
>> and a smaller one in case 2. and alleviate both problems.  A patch to 
>> apply this fix is attached.
>>
>> Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
>>
>> Could you please add this patch to -mm so that it can be tested?
>>
>> [old patch removed]
> 
> 
> Can we pull this one, please?  I've mistakenly assumed that busiest was 
> a run queue when it's actually a sched group. :-(

Here's a fixed version that takes into account the fact that busiest is 
a group not a run queue.  This patch also includes "improvements" to the 
logic at the end of find_busiest_queue() to take account of the fact 
that the loads are weighted.  Hopefully, the comments in the code 
explain these changes.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>

Because this is a more substantial change it may be advisable to wait 
for comments from Ingo and Nick before including this in -mm for testing.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

[-- Attachment #2: fix-smpnice-light-load-problem-v2 --]
[-- Type: text/plain, Size: 3456 bytes --]

Index: MM-2.6.X/kernel/sched.c
===================================================================
--- MM-2.6.X.orig/kernel/sched.c	2006-02-13 10:23:12.000000000 +1100
+++ MM-2.6.X/kernel/sched.c	2006-02-13 11:46:41.000000000 +1100
@@ -2033,9 +2033,11 @@ find_busiest_group(struct sched_domain *
 	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
 	unsigned long max_pull;
+	unsigned long avg_bias_prio, busiest_prio_bias, busiest_nr_running;
 	int load_idx;
 
 	max_load = this_load = total_load = total_pwr = 0;
+	busiest_prio_bias = busiest_nr_running = 0;
 	if (idle == NOT_IDLE)
 		load_idx = sd->busy_idx;
 	else if (idle == NEWLY_IDLE)
@@ -2047,13 +2049,16 @@ find_busiest_group(struct sched_domain *
 		unsigned long load;
 		int local_group;
 		int i;
+		unsigned long sum_prio_bias, sum_nr_running;
 
 		local_group = cpu_isset(this_cpu, group->cpumask);
 
 		/* Tally up the load of all CPUs in the group */
-		avg_load = 0;
+		sum_prio_bias = sum_nr_running = avg_load = 0;
 
 		for_each_cpu_mask(i, group->cpumask) {
+			runqueue_t *rq = cpu_rq(i);
+
 			if (*sd_idle && !idle_cpu(i))
 				*sd_idle = 0;
 
@@ -2064,6 +2069,8 @@ find_busiest_group(struct sched_domain *
 				load = source_load(i, load_idx);
 
 			avg_load += load;
+			sum_prio_bias += rq->prio_bias;
+			sum_nr_running += rq->nr_running;
 		}
 
 		total_load += avg_load;
@@ -2078,6 +2085,8 @@ find_busiest_group(struct sched_domain *
 		} else if (avg_load > max_load) {
 			max_load = avg_load;
 			busiest = group;
+			busiest_prio_bias = sum_prio_bias;
+			busiest_nr_running = sum_nr_running;
 		}
 		group = group->next;
 	} while (group != sd->groups);
@@ -2111,12 +2120,25 @@ find_busiest_group(struct sched_domain *
 				(avg_load - this_load) * this->cpu_power)
 			/ SCHED_LOAD_SCALE;
 
-	if (*imbalance < SCHED_LOAD_SCALE) {
+	/* assume that busiest_nr_running > 0 */
+	avg_bias_prio = busiest_prio_bias / busiest_nr_running;
+	/*
+	 * Get rid of the scaling factor, rounding down as we divide and
+	 * converting to biased load for use by move_tasks()
+	 */
+	*imbalance = biased_load(*imbalance);
+	/*
+	 * if *imbalance is less than the average runnable task biased prio
+	 * there is no gaurantee that any tasks will be moved so we'll have
+	 * a think about bumping its value to force at least one task to be
+	 * moved
+	 */
+	if (*imbalance < avg_bias_prio) {
 		unsigned long pwr_now = 0, pwr_move = 0;
 		unsigned long tmp;
 
-		if (max_load - this_load >= SCHED_LOAD_SCALE*2) {
-			*imbalance = NICE_TO_BIAS_PRIO(0);
+		if (biased_load(max_load - this_load) >= avg_bias_prio*2) {
+			*imbalance = avg_bias_prio;
 			return busiest;
 		}
 
@@ -2146,18 +2168,19 @@ find_busiest_group(struct sched_domain *
 		pwr_move /= SCHED_LOAD_SCALE;
 
 		/* Move if we gain throughput */
-		if (pwr_move <= pwr_now)
-			goto out_balanced;
+		if (pwr_move <= pwr_now) {
+			/* or if there's a reasonable chance that *imbalance
+			 * is big enough to cause a move
+			 */
+			if (*imbalance <= avg_bias_prio / 2)
+				goto out_balanced;
+			else
+				return busiest;
+		}
 
-		*imbalance = NICE_TO_BIAS_PRIO(0);
-		return busiest;
+		*imbalance = avg_bias_prio;
 	}
 
-	/*
-	 * Get rid of the scaling factor, rounding down as we divide and
-	 * converting to biased load for use by move_tasks()
-	 */
-	*imbalance = biased_load(*imbalance);
 	return busiest;
 
 out_balanced:

  reply	other threads:[~2006-02-13  1:06 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-02-07 14:28 [rfc][patch] sched: remove smpnice Nick Piggin
2006-02-07 14:57 ` Con Kolivas
2006-02-07 15:05   ` Nick Piggin
2006-02-07 22:15   ` Andrew Morton
2006-02-07 23:11     ` Con Kolivas
2006-02-07 23:36       ` Andrew Morton
2006-02-08  3:28         ` Nick Piggin
2006-02-08 14:56         ` Ingo Molnar
2006-02-10  7:01         ` Siddha, Suresh B
2006-02-10  7:17           ` Andrew Morton
2006-02-10  7:23             ` Con Kolivas
2006-02-10  9:06             ` Ingo Molnar
2006-02-11  1:27             ` Peter Williams
2006-02-11  2:00               ` Andrew Morton
2006-02-12  1:13                 ` Peter Williams
2006-02-12 23:10                   ` Peter Williams
2006-02-13  1:06                     ` Peter Williams [this message]
2006-02-14  0:37                       ` Peter Williams
2006-02-14  8:53                         ` Siddha, Suresh B
2006-02-11  3:36               ` Peter Williams
2006-02-11  4:04               ` Peter Williams
2006-02-14  9:07               ` Siddha, Suresh B
2006-02-14 22:40                 ` Peter Williams
2006-02-14 23:44                   ` Paul Jackson
2006-02-15  0:09                     ` Peter Williams
2006-02-15  1:00                       ` Paul Jackson
2006-02-15  7:07                   ` Siddha, Suresh B
2006-02-15 22:36                     ` Peter Williams
2006-02-15 23:29                       ` Peter Williams
2006-02-13 14:12           ` Con Kolivas
2006-02-07 23:20     ` Peter Williams
2006-02-07 23:29       ` Con Kolivas
2006-02-07 23:36       ` Martin Bligh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=43EFDBA9.4060107@bigpond.net.au \
    --to=pwil3058@bigpond.net.au \
    --cc=akpm@osdl.org \
    --cc=kernel@kolivas.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=npiggin@suse.de \
    --cc=rostedt@goodmis.org \
    --cc=suresh.b.siddha@intel.com \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox