[PATCH 1/3] sched/fair: Disable runtime

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq
       [not found] <20140617130442.29933.54945.stgit@tkhai>
@ 2014-06-17 13:24 ` Kirill Tkhai
  2014-06-23 10:07   ` Peter Zijlstra
  2014-06-17 13:24 ` [PATCH 2/3] sched/rt: __disable_runtime: Enqueue just unthrottled rt_rq back on the stack Kirill Tkhai
  2014-06-17 13:24 ` [PATCH 3/3] sched: Rework check_for_tasks() Kirill Tkhai
  2 siblings, 1 reply; 13+ messages in thread
From: Kirill Tkhai @ 2014-06-17 13:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, tkhai, Srikar Dronamraju,
	Mike Galbraith, Konstantin Khorenko


We kill rq->rd on the CPU_DOWN_PREPARE stage:

	cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
	-> cpu_attach_domain -> rq_attach_root -> set_rq_offline

This unthrottles all throttled cfs_rqs.

But the cpu is still able to call schedule() till

	take_cpu_down->__cpu_disable()

is called from stop_machine.

This case the tasks from just unthrottled cfs_rqs are pickable
in a standard scheduler way, and they are picked by dying cpu.
The cfs_rqs becomes throttled again, and migrate_tasks()
in migration_call skips their tasks (one more unthrottle
in migrate_tasks()->CPU_DYING does not happen, because rq->rd
is already NULL).

Patch sets runtime_enabled to zero. This guarantees, the runtime
is not accounted, and the cfs_rqs won't exceed given
cfs_rq->runtime_remaining = 1, and tasks will be pickable
in migrate_tasks(). runtime_enabled is recalculated again
when rq becomes online again.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
CC: Mike Galbraith <umgwanakikbuti@gmail.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..f120525 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3775,6 +3775,19 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	hrtimer_cancel(&cfs_b->slack_timer);
 }
 
+static void __maybe_unused update_runtime_enabled(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq;
+
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth;
+
+		raw_spin_lock(&cfs_b->lock);
+		cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
+		raw_spin_unlock(&cfs_b->lock);
+	}
+}
+
 static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 {
 	struct cfs_rq *cfs_rq;
@@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 		cfs_rq->runtime_remaining = 1;
 		if (cfs_rq_throttled(cfs_rq))
 			unthrottle_cfs_rq(cfs_rq);
+
+		/*
+		 * Offline rq is schedulable till cpu is completely disabled
+		 * in take_cpu_down(), so we prevent new cfs throttling here.
+		 */
+		cfs_rq->runtime_enabled = 0;
 	}
 }
 
@@ -3831,6 +3850,7 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 	return NULL;
 }
 static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+static inline void update_runtime_enabled(struct rq *rq) {}
 static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
 
 #endif /* CONFIG_CFS_BANDWIDTH */
@@ -7325,6 +7345,8 @@ void trigger_load_balance(struct rq *rq)
 static void rq_online_fair(struct rq *rq)
 {
 	update_sysctl();
+
+	update_runtime_enabled(rq);
 }
 
 static void rq_offline_fair(struct rq *rq)




^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] sched/rt: __disable_runtime: Enqueue just unthrottled rt_rq back on the stack
       [not found] <20140617130442.29933.54945.stgit@tkhai>
  2014-06-17 13:24 ` [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq Kirill Tkhai
@ 2014-06-17 13:24 ` Kirill Tkhai
  2014-06-23 10:12   ` Peter Zijlstra
  2014-06-17 13:24 ` [PATCH 3/3] sched: Rework check_for_tasks() Kirill Tkhai
  2 siblings, 1 reply; 13+ messages in thread
From: Kirill Tkhai @ 2014-06-17 13:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, tkhai, Srikar Dronamraju,
	Mike Galbraith, Konstantin Khorenko


Make rt_rq available for pick_next_task(). Otherwise, their tasks
stay prisoned long time till dead cpu becomes alive again.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
CC: Mike Galbraith <umgwanakikbuti@gmail.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/rt.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a490831..671a8b5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -740,6 +740,9 @@ balanced:
 		rt_rq->rt_throttled = 0;
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 		raw_spin_unlock(&rt_b->rt_runtime_lock);
+
+		/* Make rt_rq available for pick_next_task() */
+		sched_rt_rq_enqueue(rt_rq);
 	}
 }
 




^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] sched: Rework check_for_tasks()
       [not found] <20140617130442.29933.54945.stgit@tkhai>
  2014-06-17 13:24 ` [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq Kirill Tkhai
  2014-06-17 13:24 ` [PATCH 2/3] sched/rt: __disable_runtime: Enqueue just unthrottled rt_rq back on the stack Kirill Tkhai
@ 2014-06-17 13:24 ` Kirill Tkhai
  2014-06-23 10:24   ` Peter Zijlstra
  2 siblings, 1 reply; 13+ messages in thread
From: Kirill Tkhai @ 2014-06-17 13:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, tkhai, Srikar Dronamraju,
	Mike Galbraith, Konstantin Khorenko


1)Iterate throw all of threads in the system.
  Check for all threads, not only for group leaders.

2)Check for p->on_rq instead of p->state and cputime.
  Preempted task in !TASK_RUNNING state  OR just
  created task may be queued, that we want to be
  reported too.

3)Use read_lock() instead of write_lock().
  This function does not change any structures, and
  read_lock() is enough.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
CC: Mike Galbraith <umgwanakikbuti@gmail.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Ingo Molnar <mingo@kernel.org>
---
 kernel/cpu.c |   33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index a343bde..81e2a38 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -274,21 +274,28 @@ void clear_tasks_mm_cpumask(int cpu)
 	rcu_read_unlock();
 }
 
-static inline void check_for_tasks(int cpu)
+static inline void check_for_tasks(int dead_cpu)
 {
-	struct task_struct *p;
-	cputime_t utime, stime;
+	struct task_struct *g, *p;
 
-	write_lock_irq(&tasklist_lock);
-	for_each_process(p) {
-		task_cputime(p, &utime, &stime);
-		if (task_cpu(p) == cpu && p->state == TASK_RUNNING &&
-		    (utime || stime))
-			pr_warn("Task %s (pid = %d) is on cpu %d (state = %ld, flags = %x)\n",
-				p->comm, task_pid_nr(p), cpu,
-				p->state, p->flags);
-	}
-	write_unlock_irq(&tasklist_lock);
+	read_lock_irq(&tasklist_lock);
+	do_each_thread(g, p) {
+		if (!p->on_rq)
+			continue;
+		/*
+		 * We do the check with unlocked task_rq(p)->lock.
+		 * Order the reading to do not warn about a task,
+		 * which was running on this cpu in the past, and
+		 * it's just been woken on another cpu.
+		 */
+		rmb();
+		if (task_cpu(p) != dead_cpu)
+			continue;
+
+		pr_warn("Task %s (pid=%d) is on cpu %d (state=%ld, flags=%x)\n",
+			p->comm, task_pid_nr(p), dead_cpu, p->state, p->flags);
+	} while_each_thread(g, p);
+	read_unlock_irq(&tasklist_lock);
 }
 
 struct take_cpu_down_param {




^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq
  2014-06-17 13:24 ` [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq Kirill Tkhai
@ 2014-06-23 10:07   ` Peter Zijlstra
  2014-06-23 10:58     ` Kirill Tkhai
  2014-06-23 17:29     ` bsegall
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2014-06-23 10:07 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Ingo Molnar, tkhai, Srikar Dronamraju,
	Mike Galbraith, Konstantin Khorenko

On Tue, Jun 17, 2014 at 05:24:10PM +0400, Kirill Tkhai wrote:
> @@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
>  		cfs_rq->runtime_remaining = 1;
>  		if (cfs_rq_throttled(cfs_rq))
>  			unthrottle_cfs_rq(cfs_rq);
> +
> +		/*
> +		 * Offline rq is schedulable till cpu is completely disabled
> +		 * in take_cpu_down(), so we prevent new cfs throttling here.
> +		 */
> +		cfs_rq->runtime_enabled = 0;

Does it make sense to clear this before calling unthrottle_cfs_rq()?
Just to make sure they're in the right order..

>  	}
>  }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] sched/rt: __disable_runtime: Enqueue just unthrottled rt_rq back on the stack
  2014-06-17 13:24 ` [PATCH 2/3] sched/rt: __disable_runtime: Enqueue just unthrottled rt_rq back on the stack Kirill Tkhai
@ 2014-06-23 10:12   ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2014-06-23 10:12 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Ingo Molnar, tkhai, Srikar Dronamraju,
	Mike Galbraith, Konstantin Khorenko

On Tue, Jun 17, 2014 at 05:24:16PM +0400, Kirill Tkhai wrote:
> 
> Make rt_rq available for pick_next_task(). Otherwise, their tasks
> stay prisoned long time till dead cpu becomes alive again.
> 
> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
> CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> CC: Mike Galbraith <umgwanakikbuti@gmail.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/sched/rt.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index a490831..671a8b5 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -740,6 +740,9 @@ balanced:

Depending on what you use to generate patches, can you:

QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"

or .gitconfig:

[diff "default"]
        xfuncname = "^[[:alpha:]$_].*[^:]$"

That avoids mistaking labels (like the above balanced:) for function
names.

The patch itself makes sense though.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] sched: Rework check_for_tasks()
  2014-06-17 13:24 ` [PATCH 3/3] sched: Rework check_for_tasks() Kirill Tkhai
@ 2014-06-23 10:24   ` Peter Zijlstra
  2014-06-23 10:52     ` Kirill Tkhai
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2014-06-23 10:24 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Ingo Molnar, tkhai, Srikar Dronamraju,
	Mike Galbraith, Konstantin Khorenko

On Tue, Jun 17, 2014 at 05:24:22PM +0400, Kirill Tkhai wrote:
> 
> 1)Iterate throw all of threads in the system.

	thru

>   Check for all threads, not only for group leaders.
> 
> 2)Check for p->on_rq instead of p->state and cputime.
>   Preempted task in !TASK_RUNNING state  OR just
>   created task may be queued, that we want to be
>   reported too.
> 
> 3)Use read_lock() instead of write_lock().
>   This function does not change any structures, and
>   read_lock() is enough.
> 
> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
> CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> CC: Mike Galbraith <umgwanakikbuti@gmail.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/cpu.c |   33 ++++++++++++++++++++-------------
>  1 file changed, 20 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index a343bde..81e2a38 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -274,21 +274,28 @@ void clear_tasks_mm_cpumask(int cpu)
>  	rcu_read_unlock();
>  }
>  
> -static inline void check_for_tasks(int cpu)
> +static inline void check_for_tasks(int dead_cpu)
>  {
> -	struct task_struct *p;
> -	cputime_t utime, stime;
> +	struct task_struct *g, *p;
>  
> -	write_lock_irq(&tasklist_lock);
> -	for_each_process(p) {
> -		task_cputime(p, &utime, &stime);
> -		if (task_cpu(p) == cpu && p->state == TASK_RUNNING &&
> -		    (utime || stime))
> -			pr_warn("Task %s (pid = %d) is on cpu %d (state = %ld, flags = %x)\n",
> -				p->comm, task_pid_nr(p), cpu,
> -				p->state, p->flags);
> -	}
> -	write_unlock_irq(&tasklist_lock);
> +	read_lock_irq(&tasklist_lock);
> +	do_each_thread(g, p) {
> +		if (!p->on_rq)
> +			continue;
> +		/*
> +		 * We do the check with unlocked task_rq(p)->lock.
> +		 * Order the reading to do not warn about a task,
> +		 * which was running on this cpu in the past, and
> +		 * it's just been woken on another cpu.
> +		 */
> +		rmb();

		smp_rmb();

> +		if (task_cpu(p) != dead_cpu)
> +			continue;

But because we don't have rq->lock held, we can be in the middle of a
wakeup and miss a task.

Then again, I suppose anything without rq->lock can and will miss tasks.

> +		pr_warn("Task %s (pid=%d) is on cpu %d (state=%ld, flags=%x)\n",
> +			p->comm, task_pid_nr(p), dead_cpu, p->state, p->flags);
> +	} while_each_thread(g, p);
> +	read_unlock_irq(&tasklist_lock);
>  }
>  
>  struct take_cpu_down_param {
> 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] sched: Rework check_for_tasks()
  2014-06-23 10:24   ` Peter Zijlstra
@ 2014-06-23 10:52     ` Kirill Tkhai
  2014-06-23 14:21       ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill Tkhai @ 2014-06-23 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, tkhai, Srikar Dronamraju,
	Mike Galbraith, Konstantin Khorenko

В Пн, 23/06/2014 в 12:24 +0200, Peter Zijlstra пишет:
> On Tue, Jun 17, 2014 at 05:24:22PM +0400, Kirill Tkhai wrote:
> > 
> > 1)Iterate throw all of threads in the system.
> 
> 	thru

Thanks :)

> 
> >   Check for all threads, not only for group leaders.
> > 
> > 2)Check for p->on_rq instead of p->state and cputime.
> >   Preempted task in !TASK_RUNNING state  OR just
> >   created task may be queued, that we want to be
> >   reported too.
> > 
> > 3)Use read_lock() instead of write_lock().
> >   This function does not change any structures, and
> >   read_lock() is enough.
> > 
> > Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
> > CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > CC: Mike Galbraith <umgwanakikbuti@gmail.com>
> > CC: Peter Zijlstra <peterz@infradead.org>
> > CC: Ingo Molnar <mingo@kernel.org>
> > ---
> >  kernel/cpu.c |   33 ++++++++++++++++++++-------------
> >  1 file changed, 20 insertions(+), 13 deletions(-)
> > 
> > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > index a343bde..81e2a38 100644
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -274,21 +274,28 @@ void clear_tasks_mm_cpumask(int cpu)
> >  	rcu_read_unlock();
> >  }
> >  
> > -static inline void check_for_tasks(int cpu)
> > +static inline void check_for_tasks(int dead_cpu)
> >  {
> > -	struct task_struct *p;
> > -	cputime_t utime, stime;
> > +	struct task_struct *g, *p;
> >  
> > -	write_lock_irq(&tasklist_lock);
> > -	for_each_process(p) {
> > -		task_cputime(p, &utime, &stime);
> > -		if (task_cpu(p) == cpu && p->state == TASK_RUNNING &&
> > -		    (utime || stime))
> > -			pr_warn("Task %s (pid = %d) is on cpu %d (state = %ld, flags = %x)\n",
> > -				p->comm, task_pid_nr(p), cpu,
> > -				p->state, p->flags);
> > -	}
> > -	write_unlock_irq(&tasklist_lock);
> > +	read_lock_irq(&tasklist_lock);
> > +	do_each_thread(g, p) {
> > +		if (!p->on_rq)
> > +			continue;
> > +		/*
> > +		 * We do the check with unlocked task_rq(p)->lock.
> > +		 * Order the reading to do not warn about a task,
> > +		 * which was running on this cpu in the past, and
> > +		 * it's just been woken on another cpu.
> > +		 */
> > +		rmb();
> 
> 		smp_rmb();
> 
> > +		if (task_cpu(p) != dead_cpu)
> > +			continue;
> 
> But because we don't have rq->lock held, we can be in the middle of a
> wakeup and miss a task.
> 
> Then again, I suppose anything without rq->lock can and will miss tasks.

If we use rq->lock it's possible to move check_for_tasks() to kernel/sched/core.c.

And we can leave TASK_RUNNING check for waking tasks. Maybe something like this?

static inline void check_for_tasks(int dead_cpu)
{
	struct task_struct *g, *p;
	struct rq *rq = cpu_rq(dead_cpu);

	read_lock_irq(&tasklist_lock);
	raw_spin_lock(&rq->lock)

	do_each_thread(g, p) {
		if (!p->on_rq && p->state != TASK_RUNNING)
			continue;
		if (task_cpu(p) != dead_cpu)
			continue;

		pr_warn("Task %s (pid=%d) is on cpu %d (state=%ld, flags=%x)\n",
			p->comm, task_pid_nr(p), dead_cpu, p->state, p->flags);
	} while_each_thread(g, p);

	raw_spin_unlock(&rq->lock)
	read_unlock_irq(&tasklist_lock);
}

It still does not give a 100% guarantee... Should we take p->pi_lock for every task?

> > +		pr_warn("Task %s (pid=%d) is on cpu %d (state=%ld, flags=%x)\n",
> > +			p->comm, task_pid_nr(p), dead_cpu, p->state, p->flags);
> > +	} while_each_thread(g, p);
> > +	read_unlock_irq(&tasklist_lock);
> >  }
> >  
> >  struct take_cpu_down_param {
> > 
> > 
> > 

Regards,
Kirill



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq
  2014-06-23 10:07   ` Peter Zijlstra
@ 2014-06-23 10:58     ` Kirill Tkhai
  2014-06-23 17:29     ` bsegall
  1 sibling, 0 replies; 13+ messages in thread
From: Kirill Tkhai @ 2014-06-23 10:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, tkhai, Srikar Dronamraju,
	Mike Galbraith, Konstantin Khorenko

В Пн, 23/06/2014 в 12:07 +0200, Peter Zijlstra пишет:
> On Tue, Jun 17, 2014 at 05:24:10PM +0400, Kirill Tkhai wrote:
> > @@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> >  		cfs_rq->runtime_remaining = 1;
> >  		if (cfs_rq_throttled(cfs_rq))
> >  			unthrottle_cfs_rq(cfs_rq);
> > +
> > +		/*
> > +		 * Offline rq is schedulable till cpu is completely disabled
> > +		 * in take_cpu_down(), so we prevent new cfs throttling here.
> > +		 */
> > +		cfs_rq->runtime_enabled = 0;
> 
> Does it make sense to clear this before calling unthrottle_cfs_rq()?
> Just to make sure they're in the right order..

This looks good for me. I'll test and resend.

> 
> >  	}
> >  }



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] sched: Rework check_for_tasks()
  2014-06-23 10:52     ` Kirill Tkhai
@ 2014-06-23 14:21       ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2014-06-23 14:21 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Ingo Molnar, tkhai, Srikar Dronamraju,
	Mike Galbraith, Konstantin Khorenko

On Mon, Jun 23, 2014 at 02:52:18PM +0400, Kirill Tkhai wrote:
> > Then again, I suppose anything without rq->lock can and will miss tasks.
> 
> If we use rq->lock it's possible to move check_for_tasks() to kernel/sched/core.c.
> 
> And we can leave TASK_RUNNING check for waking tasks. Maybe something like this?
> 
> static inline void check_for_tasks(int dead_cpu)
> {
> 	struct task_struct *g, *p;
> 	struct rq *rq = cpu_rq(dead_cpu);
> 
> 	read_lock_irq(&tasklist_lock);
> 	raw_spin_lock(&rq->lock)
> 
> 	do_each_thread(g, p) {
> 		if (!p->on_rq && p->state != TASK_RUNNING)
> 			continue;
> 		if (task_cpu(p) != dead_cpu)
> 			continue;
> 
> 		pr_warn("Task %s (pid=%d) is on cpu %d (state=%ld, flags=%x)\n",
> 			p->comm, task_pid_nr(p), dead_cpu, p->state, p->flags);
> 	} while_each_thread(g, p);
> 
> 	raw_spin_unlock(&rq->lock)
> 	read_unlock_irq(&tasklist_lock);
> }
> 
> It still does not give a 100% guarantee... Should we take p->pi_lock for every task?

seeing how rq->lock nests inside of ->pi_lock that's going to be
somewhat icky.

I think we can live with a false negative, given how much people run
this nonsense it'll trigger eventually.

False positives would be bad though :-)

So I think we can keep your original (lock-free) proposal.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq
  2014-06-23 10:07   ` Peter Zijlstra
  2014-06-23 10:58     ` Kirill Tkhai
@ 2014-06-23 17:29     ` bsegall
  2014-06-23 20:49       ` Kirill Tkhai
  1 sibling, 1 reply; 13+ messages in thread
From: bsegall @ 2014-06-23 17:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, Ingo Molnar, tkhai, Srikar Dronamraju,
	Mike Galbraith, Konstantin Khorenko, pjt

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Jun 17, 2014 at 05:24:10PM +0400, Kirill Tkhai wrote:
>> @@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
>>  		cfs_rq->runtime_remaining = 1;
>>  		if (cfs_rq_throttled(cfs_rq))
>>  			unthrottle_cfs_rq(cfs_rq);
>> +
>> +		/*
>> +		 * Offline rq is schedulable till cpu is completely disabled
>> +		 * in take_cpu_down(), so we prevent new cfs throttling here.
>> +		 */
>> +		cfs_rq->runtime_enabled = 0;
>
> Does it make sense to clear this before calling unthrottle_cfs_rq()?
> Just to make sure they're in the right order..

I believe that order is irrelevant here - I do not believe that
unthrottle_cfs_rq(a) can cause a throttle_cfs_rq(a). In fact, I don't
see any code that will check it at all from unthrottle, although I might
be missing something. It _can_ cause a throttle_cfs_rq(parent_cfs_rq(a)),
but that should be fine as long as for_each_leaf_cfs_rq is sorted
correctly.

That said, migrate_tasks drops rq->lock, and I /think/ another cpu could
wake another task onto this cpu, which could then enqueue_throttle its
cfs_rq (which previously had no tasks and thus wasn't on
leaf_cfs_rq_list). You certainly could have tg_set_bandwidth come in and
turn runtime_enabled on.

I think the general idea of turning runtime_enabled off during offline
is fine, ccing pjt to double check.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq
  2014-06-23 17:29     ` bsegall
@ 2014-06-23 20:49       ` Kirill Tkhai
  2014-06-23 21:05         ` bsegall
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill Tkhai @ 2014-06-23 20:49 UTC (permalink / raw)
  To: bsegall, Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, Ingo Molnar, Srikar Dronamraju,
	Mike Galbraith, Konstantin Khorenko, pjt

On 23.06.2014 21:29, bsegall@google.com wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
>> On Tue, Jun 17, 2014 at 05:24:10PM +0400, Kirill Tkhai wrote:
>>> @@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
>>>  		cfs_rq->runtime_remaining = 1;
>>>  		if (cfs_rq_throttled(cfs_rq))
>>>  			unthrottle_cfs_rq(cfs_rq);
>>> +
>>> +		/*
>>> +		 * Offline rq is schedulable till cpu is completely disabled
>>> +		 * in take_cpu_down(), so we prevent new cfs throttling here.
>>> +		 */
>>> +		cfs_rq->runtime_enabled = 0;
>>
>> Does it make sense to clear this before calling unthrottle_cfs_rq()?
>> Just to make sure they're in the right order..
> 
> I believe that order is irrelevant here - I do not believe that
> unthrottle_cfs_rq(a) can cause a throttle_cfs_rq(a). In fact, I don't
> see any code that will check it at all from unthrottle, although I might
> be missing something. It _can_ cause a throttle_cfs_rq(parent_cfs_rq(a)),
> but that should be fine as long as for_each_leaf_cfs_rq is sorted
> correctly.

I think this is correct. We may change the order just for the hope, that
anybody will work on it in some way in the future, and this person could
skip this opaque place. Ok, I don't know how is better :)

> That said, migrate_tasks drops rq->lock, and I /think/ another cpu could
> wake another task onto this cpu, which could then enqueue_throttle its
> cfs_rq (which previously had no tasks and thus wasn't on
> leaf_cfs_rq_list). You certainly could have tg_set_bandwidth come in and
> turn runtime_enabled on.

We mask cpu inactive on CPU_DOWN_PREPARE stage (in sched_cpu_inactive).
Other cpu is not able to wake a task there after that.

rq is masked offline in cpuset_cpu_inactive() (during the same stage).
But priority of sched_cpu_inactive() is higher than priority of
cpuset_cpu_inactive().

        CPU_PRI_SCHED_INACTIVE  = INT_MIN + 1,
        CPU_PRI_CPUSET_INACTIVE = INT_MIN,

This guarantees that nobody could use dying_cpu even before we start
unthrottling. Another cpu will see dying_cpu is inactive.

So, it looks like we are free of this problem.

> I think the general idea of turning runtime_enabled off during offline
> is fine, ccing pjt to double check.

Thanks,
	Kirill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq
  2014-06-23 20:49       ` Kirill Tkhai
@ 2014-06-23 21:05         ` bsegall
  2014-06-23 21:15           ` Kirill Tkhai
  0 siblings, 1 reply; 13+ messages in thread
From: bsegall @ 2014-06-23 21:05 UTC (permalink / raw)
  To: tkhai
  Cc: Peter Zijlstra, Kirill Tkhai, linux-kernel, Ingo Molnar,
	Srikar Dronamraju, Mike Galbraith, Konstantin Khorenko, pjt

Kirill Tkhai <tkhai@yandex.ru> writes:

> On 23.06.2014 21:29, bsegall@google.com wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>>> On Tue, Jun 17, 2014 at 05:24:10PM +0400, Kirill Tkhai wrote:
>>>> @@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
>>>>  		cfs_rq->runtime_remaining = 1;
>>>>  		if (cfs_rq_throttled(cfs_rq))
>>>>  			unthrottle_cfs_rq(cfs_rq);
>>>> +
>>>> +		/*
>>>> +		 * Offline rq is schedulable till cpu is completely disabled
>>>> +		 * in take_cpu_down(), so we prevent new cfs throttling here.
>>>> +		 */
>>>> +		cfs_rq->runtime_enabled = 0;
>>>
>>> Does it make sense to clear this before calling unthrottle_cfs_rq()?
>>> Just to make sure they're in the right order..
>> 
>> I believe that order is irrelevant here - I do not believe that
>> unthrottle_cfs_rq(a) can cause a throttle_cfs_rq(a). In fact, I don't
>> see any code that will check it at all from unthrottle, although I might
>> be missing something. It _can_ cause a throttle_cfs_rq(parent_cfs_rq(a)),
>> but that should be fine as long as for_each_leaf_cfs_rq is sorted
>> correctly.
>
> I think this is correct. We may change the order just for the hope, that
> anybody will work on it in some way in the future, and this person could
> skip this opaque place. Ok, I don't know how is better :)
>
>> That said, migrate_tasks drops rq->lock, and I /think/ another cpu could
>> wake another task onto this cpu, which could then enqueue_throttle its
>> cfs_rq (which previously had no tasks and thus wasn't on
>> leaf_cfs_rq_list). You certainly could have tg_set_bandwidth come in and
>> turn runtime_enabled on.
>
> We mask cpu inactive on CPU_DOWN_PREPARE stage (in sched_cpu_inactive).
> Other cpu is not able to wake a task there after that.
>
> rq is masked offline in cpuset_cpu_inactive() (during the same stage).
> But priority of sched_cpu_inactive() is higher than priority of
> cpuset_cpu_inactive().
>
>         CPU_PRI_SCHED_INACTIVE  = INT_MIN + 1,
>         CPU_PRI_CPUSET_INACTIVE = INT_MIN,
>
> This guarantees that nobody could use dying_cpu even before we start
> unthrottling. Another cpu will see dying_cpu is inactive.
>
> So, it looks like we are free of this problem.

Ah, ok, I haven't looked that hard at hotplug, and wasn't sure of the
ordering there. We still have the tg_set_cfs_bandwidth issue, because
that uses for_each_possible_cpu. However, with the addition of
rq_online_fair, that can probably be changed to for_each_active_cpu, and
then I think we would be fine.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq
  2014-06-23 21:05         ` bsegall
@ 2014-06-23 21:15           ` Kirill Tkhai
  0 siblings, 0 replies; 13+ messages in thread
From: Kirill Tkhai @ 2014-06-23 21:15 UTC (permalink / raw)
  To: bsegall
  Cc: Peter Zijlstra, Kirill Tkhai, linux-kernel, Ingo Molnar,
	Srikar Dronamraju, Mike Galbraith, Konstantin Khorenko, pjt

On 24.06.2014 01:05, bsegall@google.com wrote:
> Kirill Tkhai <tkhai@yandex.ru> writes:
> 
>> On 23.06.2014 21:29, bsegall@google.com wrote:
>>> Peter Zijlstra <peterz@infradead.org> writes:
>>>
>>>> On Tue, Jun 17, 2014 at 05:24:10PM +0400, Kirill Tkhai wrote:
>>>>> @@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
>>>>>  		cfs_rq->runtime_remaining = 1;
>>>>>  		if (cfs_rq_throttled(cfs_rq))
>>>>>  			unthrottle_cfs_rq(cfs_rq);
>>>>> +
>>>>> +		/*
>>>>> +		 * Offline rq is schedulable till cpu is completely disabled
>>>>> +		 * in take_cpu_down(), so we prevent new cfs throttling here.
>>>>> +		 */
>>>>> +		cfs_rq->runtime_enabled = 0;
>>>>
>>>> Does it make sense to clear this before calling unthrottle_cfs_rq()?
>>>> Just to make sure they're in the right order..
>>>
>>> I believe that order is irrelevant here - I do not believe that
>>> unthrottle_cfs_rq(a) can cause a throttle_cfs_rq(a). In fact, I don't
>>> see any code that will check it at all from unthrottle, although I might
>>> be missing something. It _can_ cause a throttle_cfs_rq(parent_cfs_rq(a)),
>>> but that should be fine as long as for_each_leaf_cfs_rq is sorted
>>> correctly.
>>
>> I think this is correct. We may change the order just for the hope, that
>> anybody will work on it in some way in the future, and this person could
>> skip this opaque place. Ok, I don't know how is better :)
>>
>>> That said, migrate_tasks drops rq->lock, and I /think/ another cpu could
>>> wake another task onto this cpu, which could then enqueue_throttle its
>>> cfs_rq (which previously had no tasks and thus wasn't on
>>> leaf_cfs_rq_list). You certainly could have tg_set_bandwidth come in and
>>> turn runtime_enabled on.
>>
>> We mask cpu inactive on CPU_DOWN_PREPARE stage (in sched_cpu_inactive).
>> Other cpu is not able to wake a task there after that.
>>
>> rq is masked offline in cpuset_cpu_inactive() (during the same stage).
>> But priority of sched_cpu_inactive() is higher than priority of
>> cpuset_cpu_inactive().
>>
>>         CPU_PRI_SCHED_INACTIVE  = INT_MIN + 1,
>>         CPU_PRI_CPUSET_INACTIVE = INT_MIN,
>>
>> This guarantees that nobody could use dying_cpu even before we start
>> unthrottling. Another cpu will see dying_cpu is inactive.
>>
>> So, it looks like we are free of this problem.
> 
> Ah, ok, I haven't looked that hard at hotplug, and wasn't sure of the
> ordering there. We still have the tg_set_cfs_bandwidth issue, because
> that uses for_each_possible_cpu. However, with the addition of
> rq_online_fair, that can probably be changed to for_each_active_cpu, and
> then I think we would be fine.

Ok, now I see. Thanks, Ben.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-06-23 21:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20140617130442.29933.54945.stgit@tkhai>
2014-06-17 13:24 ` [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq Kirill Tkhai
2014-06-23 10:07   ` Peter Zijlstra
2014-06-23 10:58     ` Kirill Tkhai
2014-06-23 17:29     ` bsegall
2014-06-23 20:49       ` Kirill Tkhai
2014-06-23 21:05         ` bsegall
2014-06-23 21:15           ` Kirill Tkhai
2014-06-17 13:24 ` [PATCH 2/3] sched/rt: __disable_runtime: Enqueue just unthrottled rt_rq back on the stack Kirill Tkhai
2014-06-23 10:12   ` Peter Zijlstra
2014-06-17 13:24 ` [PATCH 3/3] sched: Rework check_for_tasks() Kirill Tkhai
2014-06-23 10:24   ` Peter Zijlstra
2014-06-23 10:52     ` Kirill Tkhai
2014-06-23 14:21       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox