[PATCH] sched: fix erroneous sysct_sched_nr

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched: fix erroneous sysct_sched_nr_migrate logic
@ 2011-05-04 19:15 Vladimir Davydov
  2011-05-06  7:18 ` Ingo Molnar
  0 siblings, 1 reply; 3+ messages in thread
From: Vladimir Davydov @ 2011-05-04 19:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, Vladimir Davydov

During load balance, the scheduler must not iterate more than
sysctl_sched_nr_migrate (32 by default) tasks, but at present this limit is held
only for tasks in a task group. That means if there is the only task group in
the system, the scheduler never iterates more than 32 tasks in a single balance
run, but if there are N task groups, it can iterate up to N * 32 tasks. This
patch makes the limit system-wide as it should be.
---
 kernel/sched_fair.c |   35 +++++++++++++++++------------------
 1 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 37f2262..a8fe580 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2142,9 +2142,9 @@ static unsigned long
 balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
 	      unsigned long max_load_move, struct sched_domain *sd,
 	      enum cpu_idle_type idle, int *all_pinned,
-	      struct cfs_rq *busiest_cfs_rq)
+	      unsigned int *loops_left, struct cfs_rq *busiest_cfs_rq)
 {
-	int loops = 0, pulled = 0;
+	int pulled = 0;
 	long rem_load_move = max_load_move;
 	struct task_struct *p, *n;
 
@@ -2152,8 +2152,9 @@ balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		goto out;
 
 	list_for_each_entry_safe(p, n, &busiest_cfs_rq->tasks, se.group_node) {
-		if (loops++ > sysctl_sched_nr_migrate)
+		if (!*loops_left)
 			break;
+		--*loops_left;
 
 		if ((p->se.load.weight >> 1) > rem_load_move ||
 		    !can_migrate_task(p, busiest, this_cpu, sd, idle,
@@ -2170,8 +2171,10 @@ balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		 * kernels will stop after the first task is pulled to minimize
 		 * the critical section.
 		 */
-		if (idle == CPU_NEWLY_IDLE)
+		if (idle == CPU_NEWLY_IDLE) {
+			*loops_left = 0;
 			break;
+		}
 #endif
 
 		/*
@@ -2239,7 +2242,7 @@ static unsigned long
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,
 		  struct sched_domain *sd, enum cpu_idle_type idle,
-		  int *all_pinned)
+		  int *all_pinned, unsigned int *loops_left)
 {
 	long rem_load_move = max_load_move;
 	int busiest_cpu = cpu_of(busiest);
@@ -2264,9 +2267,12 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		rem_load = div_u64(rem_load, busiest_h_load + 1);
 
 		moved_load = balance_tasks(this_rq, this_cpu, busiest,
-				rem_load, sd, idle, all_pinned,
+				rem_load, sd, idle, all_pinned, loops_left,
 				busiest_cfs_rq);
 
+		if (!*loops_left)
+			break;
+
 		if (!moved_load)
 			continue;
 
@@ -2290,11 +2296,11 @@ static unsigned long
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,
 		  struct sched_domain *sd, enum cpu_idle_type idle,
-		  int *all_pinned)
+		  int *all_pinned, unsigned int *loops_left)
 {
 	return balance_tasks(this_rq, this_cpu, busiest,
 			max_load_move, sd, idle, all_pinned,
-			&busiest->cfs);
+			loops_left, &busiest->cfs);
 }
 #endif
 
@@ -2311,28 +2317,21 @@ static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		      int *all_pinned)
 {
 	unsigned long total_load_moved = 0, load_moved;
+	unsigned int loops_left = sysctl_sched_nr_migrate;
 
 	do {
 		load_moved = load_balance_fair(this_rq, this_cpu, busiest,
 				max_load_move - total_load_moved,
-				sd, idle, all_pinned);
+				sd, idle, all_pinned, &loops_left);
 
 		total_load_moved += load_moved;
 
 #ifdef CONFIG_PREEMPT
-		/*
-		 * NEWIDLE balancing is a source of latency, so preemptible
-		 * kernels will stop after the first task is pulled to minimize
-		 * the critical section.
-		 */
-		if (idle == CPU_NEWLY_IDLE && this_rq->nr_running)
-			break;
-
 		if (raw_spin_is_contended(&this_rq->lock) ||
 				raw_spin_is_contended(&busiest->lock))
 			break;
 #endif
-	} while (load_moved && max_load_move > total_load_moved);
+	} while (load_moved && max_load_move > total_load_moved && loops_left);
 
 	return total_load_moved > 0;
 }
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched: fix erroneous sysct_sched_nr_migrate logic
  2011-05-04 19:15 [PATCH] sched: fix erroneous sysct_sched_nr_migrate logic Vladimir Davydov
@ 2011-05-06  7:18 ` Ingo Molnar
       [not found]   ` <9532E773-BD05-4A18-B9BD-6B667F15FF91@parallels.com>
  0 siblings, 1 reply; 3+ messages in thread
From: Ingo Molnar @ 2011-05-06  7:18 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Peter Zijlstra, linux-kernel, Nikhil Rao, Mike Galbraith,
	Srivatsa Vaddagiri, Stephan Barwolf, Nikunj A. Dadhania

* Vladimir Davydov <vdavydov@parallels.com> wrote:

> During load balance, the scheduler must not iterate more than
> sysctl_sched_nr_migrate (32 by default) tasks, but at present this limit is held
> only for tasks in a task group. That means if there is the only task group in
> the system, the scheduler never iterates more than 32 tasks in a single balance
> run, but if there are N task groups, it can iterate up to N * 32 tasks. This
> patch makes the limit system-wide as it should be.
> ---
>  kernel/sched_fair.c |   35 +++++++++++++++++------------------
>  1 files changed, 17 insertions(+), 18 deletions(-)

Well, you are right that we currently scale "nr_groups*32", but changing this 
will have an effect on default scheduling behavior, especially if there are a 
lot of groups.

So either it has to be shown (measured, demonstrated) that the current behavior 
is catastrophic or clearly bad in some workloads, or it has to be shown 
(measured) that it has no bad effect on the balancing quality of workloads 
involving a lot of groups.

What was the motivation for the patch - have you noticed it via review, or have 
you run into a workload that demonstrated it? Such details need to be in 
changelogs.

If there's adverse effect on balancing quality we might still do something 
about the number of iterations, but it all has to be done a lot more carefully 
than just capping it to 32 globally, without any numbers and analysis ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched: fix erroneous sysct_sched_nr_migrate logic
       [not found]   ` <9532E773-BD05-4A18-B9BD-6B667F15FF91@parallels.com>
@ 2011-05-10  9:10     ` Peter Zijlstra
  0 siblings, 0 replies; 3+ messages in thread
From: Peter Zijlstra @ 2011-05-10  9:10 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Ingo Molnar, linux-kernel@vger.kernel.org, Nikhil Rao,
	Mike Galbraith, Srivatsa Vaddagiri, Stephan Barwolf,
	Nikunj A. Dadhania

On Fri, 2011-05-06 at 14:02 +0400, Vladimir Davydov wrote:
> 
> But looking through the code, I found the definition: 
> 
> /*  * Number of tasks to iterate in a single balance run.  * Limited
> because this is done with IRQs disabled.  */ const_debug unsigned int
> sysctl_sched_nr_migrate = 32; 
> 
> However, AFAIS from the code, the number of tasks to iterate is
> virtually unlimited. So I conclude either the comment is confusing, or
> the logic is wrong. 
> 

Right, so I mostly agree with your (haven't actually read your patch
yet), the one worry I have is that we'll get stuck endlessly trying to
balance the first cgroup and when there's enough tasks in there but not
enough weight, we'll get stuck not making much progress.

This is one of the many things with the whole cgroup mess that needs
proper sorting out.

So yes, I agree with your interpretation of the sysctl, but I don't
think a straight fwd accounting 'fix' will actually result in a better
kernel.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-05-10  9:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-04 19:15 [PATCH] sched: fix erroneous sysct_sched_nr_migrate logic Vladimir Davydov
2011-05-06  7:18 ` Ingo Molnar
     [not found]   ` <9532E773-BD05-4A18-B9BD-6B667F15FF91@parallels.com>
2011-05-10  9:10     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox