[PATCHSET sched/core] sched: prepare for cmwq

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET sched/core] sched: prepare for cmwq
@ 2010-05-13 10:48 Tejun Heo
  2010-05-13 10:48 ` [PATCH 1/4] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Tejun Heo @ 2010-05-13 10:48 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel

Hello, Ingo, Peter.

These four patches are the scheduler modifications necessary for cmwq
and contains the following four patches.

 0001-sched-consult-online-mask-instead-of-active-in-selec.patch
 0002-sched-implement-__set_cpus_allowed.patch
 0003-sched-refactor-try_to_wake_up.patch
 0004-sched-add-hooks-for-workqueue.patch

The first three have been posted multiple times as part of cmwq
patchset multiple times.  Other than description updates to explain
why they are needed, these three patches haven't changed from the last
posting[L].

The last patch adds two hardcoded workqueue hooks as suggested in in
the unify-tracers-in-sched thread[U].  These new hardcoded hooks are
much simpler and no scheduler behavior change is leaked outside of
sched.c.  Workqueue workers are marked with PF_WQ_WORKER and hooks are
called only for them.  Instead of exporting try_to_wake_up_local(),
the sleeping hook now returns a pointer to local task which is woken
up by scheduler.  The hooks are defined as noop for now and will be
filled later by cmwq implementation.

The change from sched_notifier to the hardcoded hooks doesn't make any
notable difference to cmwq implementation itself.

This patchset is available in the following git tree.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git sched-wq

and contains the following changes.

 include/linux/sched.h    |   15 ++-
 kernel/fork.c            |    2 
 kernel/sched.c           |  191 +++++++++++++++++++++++++++++++++--------------
 kernel/workqueue_sched.h |   16 +++
 4 files changed, 163 insertions(+), 61 deletions(-)

Thanks.

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel/954759
[U] http://thread.gmane.org/gmane.linux.kernel/980857

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/4] sched: consult online mask instead of active in select_fallback_rq()
  2010-05-13 10:48 [PATCHSET sched/core] sched: prepare for cmwq Tejun Heo
@ 2010-05-13 10:48 ` Tejun Heo
  2010-05-31  8:01   ` Peter Zijlstra
  2010-05-13 10:48 ` [PATCH 2/4] sched: implement __set_cpus_allowed() Tejun Heo
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2010-05-13 10:48 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel; +Cc: Tejun Heo

If called after sched_class chooses a CPU which isn't in a task's
cpus_allowed mask, select_fallback_rq() can end up migrating a task
which is bound to an !active but online cpu to an active cpu.  This is
dangerous because active is cleared before CPU_DOWN_PREPARE is called
and subsystems expect affinities of kthreads and other tasks to be
maintained till their CPU_DOWN_PREPARE callbacks are complete.

Consult cpu_online_mask instead.

This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
callback creates the trustee kthread and kthread_bind()s it to the
target cpu, and the trustee is expected to run on that cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index b531d79..aca4a20 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2262,12 +2262,12 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 	const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
 
 	/* Look for allowed, online CPU in same node. */
-	for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
+	for_each_cpu_and(dest_cpu, nodemask, cpu_online_mask)
 		if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
 			return dest_cpu;
 
 	/* Any allowed, online CPU? */
-	dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
+	dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_online_mask);
 	if (dest_cpu < nr_cpu_ids)
 		return dest_cpu;
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/4] sched: implement __set_cpus_allowed()
  2010-05-13 10:48 [PATCHSET sched/core] sched: prepare for cmwq Tejun Heo
  2010-05-13 10:48 ` [PATCH 1/4] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
@ 2010-05-13 10:48 ` Tejun Heo
  2010-05-31  8:01   ` Peter Zijlstra
  2010-05-13 10:48 ` [PATCH 3/4] sched: refactor try_to_wake_up() Tejun Heo
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2010-05-13 10:48 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel; +Cc: Tejun Heo, Rusty Russell, Mike Galbraith

Concurrency managed workqueue needs to be able to migrate tasks to a
cpu which is online but !active for the following two purposes.

p1. To guarantee forward progress during cpu down sequence.  Each
    workqueue which could be depended upon during memory allocation
    has an emergency worker task which is summoned when a pending work
    on such workqueue can't be serviced immediately.  cpu hotplug
    callbacks expect workqueues to work during cpu down sequence
    (usually so that they can flush them), so, to guarantee forward
    progress, it should be possible to summon emergency workers to
    !active but online cpus.

p2. To migrate back unbound workers when a cpu comes back online.
    When a cpu goes down, existing workers are unbound from the cpu
    and allowed to run on other cpus if there still are pending or
    running works.  If the cpu comes back online while those workers
    are still around, those workers are migrated back and re-bound to
    the cpu.  This isn't strictly required for correctness as long as
    those unbound workers don't execute works which are newly
    scheduled after the cpu comes back online; however, migrating back
    the workers has the advantage of making the behavior more
    consistent thus avoiding surprises which are difficult to expect
    and reproduce, and being actually cleaner and easier to implement.

To implement this, __set_cpus_allowed() is factored out from
set_cpus_allowed_ptr() and @force parameter is added to it.  The
latter is now a wrapper around the former with @force set to %false.
When @force is %false, the following original behaviors are
maintained.

c1. Check whether PF_THREAD_BOUND is set.  This is set for bound
    kthreads so that they can't be moved around.

c2. Check whether the target cpu is still marked active -
    cpu_active().  Active state is cleared early while downing a cpu.

When @force parameter is %true, __set_cpus_allowed() ignores c1 and
uses cpu online state instead of active state for c2.

Due to the way migration is implemented, the @force parameter needs to
be passed over to the migration cpu_stop callback.  @force parameter
is added to struct migration_req and passed to __migrate_task().

Please note the naming discrepancy between set_cpus_allowed_ptr() and
the new functions.  The _ptr suffix is from the days when cpumask API
wasn't mature and future changes should drop it from
set_cpus_allowed_ptr() too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |   14 +++++++++---
 kernel/sched.c        |   51 +++++++++++++++++++++++++++++++-----------------
 2 files changed, 43 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dfea405..ef6067c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1776,11 +1776,11 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif
 
 #ifdef CONFIG_SMP
-extern int set_cpus_allowed_ptr(struct task_struct *p,
-				const struct cpumask *new_mask);
+extern int __set_cpus_allowed(struct task_struct *p,
+			      const struct cpumask *new_mask, bool force);
 #else
-static inline int set_cpus_allowed_ptr(struct task_struct *p,
-				       const struct cpumask *new_mask)
+static inline int __set_cpus_allowed(struct task_struct *p,
+				     const struct cpumask *new_mask, bool force)
 {
 	if (!cpumask_test_cpu(0, new_mask))
 		return -EINVAL;
@@ -1788,6 +1788,12 @@ static inline int set_cpus_allowed_ptr(struct task_struct *p,
 }
 #endif
 
+static inline int set_cpus_allowed_ptr(struct task_struct *p,
+				       const struct cpumask *new_mask)
+{
+	return __set_cpus_allowed(p, new_mask, false);
+}
+
 #ifndef CONFIG_CPUMASK_OFFSTACK
 static inline int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
 {
diff --git a/kernel/sched.c b/kernel/sched.c
index aca4a20..ecf024d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2038,6 +2038,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 struct migration_arg {
 	struct task_struct *task;
 	int dest_cpu;
+	bool force;
 };
 
 static int migration_cpu_stop(void *data);
@@ -3111,7 +3112,7 @@ void sched_exec(void)
 	 */
 	if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed) &&
 	    likely(cpu_active(dest_cpu)) && migrate_task(p, dest_cpu)) {
-		struct migration_arg arg = { p, dest_cpu };
+		struct migration_arg arg = { p, dest_cpu, false };
 
 		task_rq_unlock(rq, &flags);
 		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
@@ -5280,17 +5281,27 @@ static inline void sched_init_granularity(void)
  *    is done.
  */
 
-/*
- * Change a given task's CPU affinity. Migrate the thread to a
- * proper CPU and schedule it away if the CPU it's executing on
- * is removed from the allowed bitmask.
+/**
+ * __set_cpus_allowed - change a task's CPU affinity
+ * @p: task to change CPU affinity for
+ * @new_mask: new CPU affinity
+ * @force: override CPU active status and PF_THREAD_BOUND check
+ *
+ * Migrate the thread to a proper CPU and schedule it away if the CPU
+ * it's executing on is removed from the allowed bitmask.
+ *
+ * The caller must have a valid reference to the task, the task must
+ * not exit() & deallocate itself prematurely. The call is not atomic;
+ * no spinlocks may be held.
  *
- * NOTE: the caller must have a valid reference to the task, the
- * task must not exit() & deallocate itself prematurely. The
- * call is not atomic; no spinlocks may be held.
+ * If @force is %true, PF_THREAD_BOUND test is bypassed and CPU active
+ * state is ignored as long as the CPU is online.
  */
-int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
+int __set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask,
+		       bool force)
 {
+	const struct cpumask *cpu_cand_mask =
+		force ? cpu_online_mask : cpu_active_mask;
 	unsigned long flags;
 	struct rq *rq;
 	unsigned int dest_cpu;
@@ -5309,12 +5320,12 @@ again:
 		goto again;
 	}
 
-	if (!cpumask_intersects(new_mask, cpu_active_mask)) {
+	if (!cpumask_intersects(new_mask, cpu_cand_mask)) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
+	if (unlikely((p->flags & PF_THREAD_BOUND) && !force && p != current &&
 		     !cpumask_equal(&p->cpus_allowed, new_mask))) {
 		ret = -EINVAL;
 		goto out;
@@ -5331,9 +5342,9 @@ again:
 	if (cpumask_test_cpu(task_cpu(p), new_mask))
 		goto out;
 
-	dest_cpu = cpumask_any_and(cpu_active_mask, new_mask);
+	dest_cpu = cpumask_any_and(cpu_cand_mask, new_mask);
 	if (migrate_task(p, dest_cpu)) {
-		struct migration_arg arg = { p, dest_cpu };
+		struct migration_arg arg = { p, dest_cpu, force };
 		/* Need help from migration thread: drop lock and wait. */
 		task_rq_unlock(rq, &flags);
 		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
@@ -5345,7 +5356,7 @@ out:
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
+EXPORT_SYMBOL_GPL(__set_cpus_allowed);
 
 /*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
@@ -5358,12 +5369,15 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
  *
  * Returns non-zero if task was successfully migrated.
  */
-static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
+static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu,
+			  bool force)
 {
+	const struct cpumask *cpu_cand_mask =
+		force ? cpu_online_mask : cpu_active_mask;
 	struct rq *rq_dest, *rq_src;
 	int ret = 0;
 
-	if (unlikely(!cpu_active(dest_cpu)))
+	if (unlikely(!cpumask_test_cpu(dest_cpu, cpu_cand_mask)))
 		return ret;
 
 	rq_src = cpu_rq(src_cpu);
@@ -5408,7 +5422,8 @@ static int migration_cpu_stop(void *data)
 	 * be on another cpu but it doesn't matter.
 	 */
 	local_irq_disable();
-	__migrate_task(arg->task, raw_smp_processor_id(), arg->dest_cpu);
+	__migrate_task(arg->task, raw_smp_processor_id(), arg->dest_cpu,
+		       arg->force);
 	local_irq_enable();
 	return 0;
 }
@@ -5435,7 +5450,7 @@ void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
 	 * in the racer should migrate the task anyway.
 	 */
 	if (needs_cpu)
-		__migrate_task(p, dead_cpu, dest_cpu);
+		__migrate_task(p, dead_cpu, dest_cpu, false);
 	local_irq_restore(flags);
 }
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/4] sched: refactor try_to_wake_up()
  2010-05-13 10:48 [PATCHSET sched/core] sched: prepare for cmwq Tejun Heo
  2010-05-13 10:48 ` [PATCH 1/4] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
  2010-05-13 10:48 ` [PATCH 2/4] sched: implement __set_cpus_allowed() Tejun Heo
@ 2010-05-13 10:48 ` Tejun Heo
  2010-05-13 10:48 ` [PATCH 4/4] sched: add hooks for workqueue Tejun Heo
  2010-05-17 23:13 ` [PATCHSET sched/core] sched: prepare for cmwq Tejun Heo
  4 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2010-05-13 10:48 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel; +Cc: Tejun Heo, Mike Galbraith

Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
The factoring out doesn't affect try_to_wake_up() much
code-generation-wise.  Depending on configuration options, it ends up
generating the same object code as before or slightly different one
due to different register assignment.

This is to help future implementation of try_to_wake_up_local().

Mike Galbraith suggested rename to ttwu_post_activation() from
ttwu_woken_up() and comment update in try_to_wake_up().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |   83 +++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ecf024d..0b6da84 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2322,11 +2322,52 @@ static void update_avg(u64 *avg, u64 sample)
 }
 #endif
 
-/***
+static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
+				 bool is_sync, bool is_migrate, bool is_local,
+				 unsigned long en_flags)
+{
+	schedstat_inc(p, se.statistics.nr_wakeups);
+	if (is_sync)
+		schedstat_inc(p, se.statistics.nr_wakeups_sync);
+	if (is_migrate)
+		schedstat_inc(p, se.statistics.nr_wakeups_migrate);
+	if (is_local)
+		schedstat_inc(p, se.statistics.nr_wakeups_local);
+	else
+		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+
+	activate_task(rq, p, en_flags);
+}
+
+static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
+					int wake_flags, bool success)
+{
+	trace_sched_wakeup(p, success);
+	check_preempt_curr(rq, p, wake_flags);
+
+	p->state = TASK_RUNNING;
+#ifdef CONFIG_SMP
+	if (p->sched_class->task_woken)
+		p->sched_class->task_woken(rq, p);
+
+	if (unlikely(rq->idle_stamp)) {
+		u64 delta = rq->clock - rq->idle_stamp;
+		u64 max = 2*sysctl_sched_migration_cost;
+
+		if (delta > max)
+			rq->avg_idle = max;
+		else
+			update_avg(&rq->avg_idle, delta);
+		rq->idle_stamp = 0;
+	}
+#endif
+}
+
+/**
  * try_to_wake_up - wake up a thread
- * @p: the to-be-woken-up thread
+ * @p: the thread to be awakened
  * @state: the mask of task states that can be woken
- * @sync: do a synchronous wakeup?
+ * @wake_flags: wake modifier flags (WF_*)
  *
  * Put it on the run-queue if it's not already there. The "current"
  * thread is always on the run-queue (except when the actual
@@ -2334,7 +2375,8 @@ static void update_avg(u64 *avg, u64 sample)
  * the simpler "current->state = TASK_RUNNING" to mark yourself
  * runnable without the overhead of this.
  *
- * returns failure only if the task is already active.
+ * Returns %true if @p was woken up, %false if it was already running
+ * or @state didn't match @p's state.
  */
 static int try_to_wake_up(struct task_struct *p, unsigned int state,
 			  int wake_flags)
@@ -2414,38 +2456,11 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
 
 out_activate:
 #endif /* CONFIG_SMP */
-	schedstat_inc(p, se.statistics.nr_wakeups);
-	if (wake_flags & WF_SYNC)
-		schedstat_inc(p, se.statistics.nr_wakeups_sync);
-	if (orig_cpu != cpu)
-		schedstat_inc(p, se.statistics.nr_wakeups_migrate);
-	if (cpu == this_cpu)
-		schedstat_inc(p, se.statistics.nr_wakeups_local);
-	else
-		schedstat_inc(p, se.statistics.nr_wakeups_remote);
-	activate_task(rq, p, en_flags);
+	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
+		      cpu == this_cpu, en_flags);
 	success = 1;
-
 out_running:
-	trace_sched_wakeup(p, success);
-	check_preempt_curr(rq, p, wake_flags);
-
-	p->state = TASK_RUNNING;
-#ifdef CONFIG_SMP
-	if (p->sched_class->task_woken)
-		p->sched_class->task_woken(rq, p);
-
-	if (unlikely(rq->idle_stamp)) {
-		u64 delta = rq->clock - rq->idle_stamp;
-		u64 max = 2*sysctl_sched_migration_cost;
-
-		if (delta > max)
-			rq->avg_idle = max;
-		else
-			update_avg(&rq->avg_idle, delta);
-		rq->idle_stamp = 0;
-	}
-#endif
+	ttwu_post_activation(p, rq, wake_flags, success);
 out:
 	task_rq_unlock(rq, &flags);
 	put_cpu();
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/4] sched: add hooks for workqueue
  2010-05-13 10:48 [PATCHSET sched/core] sched: prepare for cmwq Tejun Heo
                   ` (2 preceding siblings ...)
  2010-05-13 10:48 ` [PATCH 3/4] sched: refactor try_to_wake_up() Tejun Heo
@ 2010-05-13 10:48 ` Tejun Heo
  2010-05-31  8:01   ` Peter Zijlstra
  2010-05-17 23:13 ` [PATCHSET sched/core] sched: prepare for cmwq Tejun Heo
  4 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2010-05-13 10:48 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel; +Cc: Tejun Heo, Mike Galbraith

Concurrency managed workqueue needs to know when workers are going to
sleep and waking up, and, when a worker goes to sleep, be able to wake
up another worker to maintain adequate concurrency.  This patch
introduces PF_WQ_WORKER to identify workqueue workers and adds the
following two hooks.

* wq_worker_waking_up(): called when a worker is woken up.

* wq_worker_sleeping(): called when a worker is going to sleep and may
  return a pointer to a local task which should be woken up.  The
  returned task is woken up using try_to_wake_up_local() which is
  simplified ttwu which is called under rq lock and can only wake up
  local tasks.

Both hooks are currently defined as noop in kernel/workqueue_sched.h.
Later cmwq implementation will replace them with proper
implementation.

These hooks are hard coded as they'll always be enabled.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h    |    1 +
 kernel/fork.c            |    2 +-
 kernel/sched.c           |   53 ++++++++++++++++++++++++++++++++++++++++++++-
 kernel/workqueue_sched.h |   16 +++++++++++++
 4 files changed, 69 insertions(+), 3 deletions(-)
 create mode 100644 kernel/workqueue_sched.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ef6067c..553ffae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1702,6 +1702,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define PF_EXITING	0x00000004	/* getting shut down */
 #define PF_EXITPIDONE	0x00000008	/* pi exit done on shut down */
 #define PF_VCPU		0x00000010	/* I'm a virtual CPU */
+#define PF_WQ_WORKER	0x00000020	/* I'm a workqueue worker */
 #define PF_FORKNOEXEC	0x00000040	/* forked but didn't exec */
 #define PF_MCE_PROCESS  0x00000080      /* process policy on mce errors */
 #define PF_SUPERPRIV	0x00000100	/* used super-user privileges */
diff --git a/kernel/fork.c b/kernel/fork.c
index 44b0791..1440f90 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,7 +900,7 @@ static void copy_flags(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long new_flags = p->flags;
 
-	new_flags &= ~PF_SUPERPRIV;
+	new_flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER);
 	new_flags |= PF_FORKNOEXEC;
 	new_flags |= PF_STARTING;
 	p->flags = new_flags;
diff --git a/kernel/sched.c b/kernel/sched.c
index 0b6da84..460b176 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -77,6 +77,7 @@
 #include <asm/irq_regs.h>
 
 #include "sched_cpupri.h"
+#include "workqueue_sched.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
@@ -2361,6 +2362,9 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
 		rq->idle_stamp = 0;
 	}
 #endif
+	/* if a worker is waking up, notify workqueue */
+	if ((p->flags & PF_WQ_WORKER) && success)
+		wq_worker_waking_up(p, cpu_of(rq));
 }
 
 /**
@@ -2469,6 +2473,37 @@ out:
 }
 
 /**
+ * try_to_wake_up_local - try to wake up a local task with rq lock held
+ * @p: the thread to be awakened
+ *
+ * Put @p on the run-queue if it's not alredy there.  The caller must
+ * ensure that this_rq() is locked, @p is bound to this_rq() and not
+ * the current task.  this_rq() stays locked over invocation.
+ */
+static void try_to_wake_up_local(struct task_struct *p)
+{
+	struct rq *rq = task_rq(p);
+	bool success = false;
+
+	BUG_ON(rq != this_rq());
+	BUG_ON(p == current);
+	lockdep_assert_held(&rq->lock);
+
+	if (!(p->state & TASK_NORMAL))
+		return;
+
+	if (!p->se.on_rq) {
+		if (likely(!task_running(rq, p))) {
+			schedstat_inc(rq, ttwu_count);
+			schedstat_inc(rq, ttwu_local);
+		}
+		ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
+		success = true;
+	}
+	ttwu_post_activation(p, rq, 0, success);
+}
+
+/**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
  *
@@ -3673,10 +3708,24 @@ need_resched_nonpreemptible:
 	clear_tsk_need_resched(prev);
 
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
-		if (unlikely(signal_pending_state(prev->state, prev)))
+		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
-		else
+		} else {
+			/*
+			 * If a worker is going to sleep, notify and
+			 * ask workqueue whether it wants to wake up a
+			 * task to maintain concurrency.  If so, wake
+			 * up the task.
+			 */
+			if (prev->flags & PF_WQ_WORKER) {
+				struct task_struct *to_wakeup;
+
+				to_wakeup = wq_worker_sleeping(prev, cpu);
+				if (to_wakeup)
+					try_to_wake_up_local(to_wakeup);
+			}
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
+		}
 		switch_count = &prev->nvcsw;
 	}
 
diff --git a/kernel/workqueue_sched.h b/kernel/workqueue_sched.h
new file mode 100644
index 0000000..af040ba
--- /dev/null
+++ b/kernel/workqueue_sched.h
@@ -0,0 +1,16 @@
+/*
+ * kernel/workqueue_sched.h
+ *
+ * Scheduler hooks for concurrency managed workqueue.  Only to be
+ * included from sched.c and workqueue.c.
+ */
+static inline void wq_worker_waking_up(struct task_struct *task,
+				       unsigned int cpu)
+{
+}
+
+static inline struct task_struct *wq_worker_sleeping(struct task_struct *task,
+						     unsigned int cpu)
+{
+	return NULL;
+}
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCHSET sched/core] sched: prepare for cmwq
  2010-05-13 10:48 [PATCHSET sched/core] sched: prepare for cmwq Tejun Heo
                   ` (3 preceding siblings ...)
  2010-05-13 10:48 ` [PATCH 4/4] sched: add hooks for workqueue Tejun Heo
@ 2010-05-17 23:13 ` Tejun Heo
  2010-05-21 13:25   ` Tejun Heo
  4 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2010-05-17 23:13 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel

On 05/13/2010 12:48 PM, Tejun Heo wrote:
> These four patches are the scheduler modifications necessary for cmwq
> and contains the following four patches.

Ping.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET sched/core] sched: prepare for cmwq
  2010-05-17 23:13 ` [PATCHSET sched/core] sched: prepare for cmwq Tejun Heo
@ 2010-05-21 13:25   ` Tejun Heo
  2010-05-23  9:05     ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2010-05-21 13:25 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel

Hello, Peter, Ingo.

On 05/18/2010 01:13 AM, Tejun Heo wrote:
> On 05/13/2010 12:48 PM, Tejun Heo wrote:
>> These four patches are the scheduler modifications necessary for cmwq
>> and contains the following four patches.
> 
> Ping.
> 

I know we're inside -rc1 merge window but I'm really hoping to push
cmwq through linux-next during this devel cycle as the whole thing
basically has been blocked unchanged on scheduler changes for a couple
of months now.  So, can we please get this thing moving?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET sched/core] sched: prepare for cmwq
  2010-05-21 13:25   ` Tejun Heo
@ 2010-05-23  9:05     ` Ingo Molnar
  2010-05-23  9:08       ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2010-05-23  9:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: peterz, linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> Hello, Peter, Ingo.
> 
> On 05/18/2010 01:13 AM, Tejun Heo wrote:
> > On 05/13/2010 12:48 PM, Tejun Heo wrote:
> >> These four patches are the scheduler modifications necessary for cmwq
> >> and contains the following four patches.
> > 
> > Ping.
> > 
> 
> I know we're inside -rc1 merge window but I'm really 
> hoping to push cmwq through linux-next during this devel 
> cycle as the whole thing basically has been blocked 
> unchanged on scheduler changes for a couple of months 
> now.  So, can we please get this thing moving?

IIRC these changes are for some sort of speedup for btrfs, 
right?

I would really like to see before/after numbers on some 
important FS workload. How much of a speedup does it give?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET sched/core] sched: prepare for cmwq
  2010-05-23  9:05     ` Ingo Molnar
@ 2010-05-23  9:08       ` Tejun Heo
  2010-05-23  9:13         ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2010-05-23  9:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: peterz, linux-kernel

Hello, Ingo.

On 05/23/2010 11:05 AM, Ingo Molnar wrote:
> IIRC these changes are for some sort of speedup for btrfs, 
> right?

Eh.... What are you talking about?  These are for cmwq.  The "must
unify scheduler notification hooks", "nah, if always enabled, go with
hard coded hooks" stuff.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET sched/core] sched: prepare for cmwq
  2010-05-23  9:08       ` Tejun Heo
@ 2010-05-23  9:13         ` Ingo Molnar
  2010-05-23  9:24           ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2010-05-23  9:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: peterz, linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> Hello, Ingo.
> 
> On 05/23/2010 11:05 AM, Ingo Molnar wrote:
> > IIRC these changes are for some sort of speedup for btrfs, 
> > right?
> 
> Eh.... What are you talking about?  These are for cmwq.  
> The "must unify scheduler notification hooks", "nah, if 
> always enabled, go with hard coded hooks" stuff.

... which is then used for the concurrency-managed 
workqueue feature, which is then used for a BTRFS speedup, 
right?

( If that's not the plan then color me confused :)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET sched/core] sched: prepare for cmwq
  2010-05-23  9:13         ` Ingo Molnar
@ 2010-05-23  9:24           ` Tejun Heo
  2010-05-23 10:20             ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2010-05-23  9:24 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: peterz, linux-kernel

Hello,

On 05/23/2010 11:13 AM, Ingo Molnar wrote:
> ... which is then used for the concurrency-managed 
> workqueue feature, which is then used for a BTRFS speedup, 
> right?
>
> ( If that's not the plan then color me confused :)

Yeap, you're officially confused. :-) It doesn't have much to do with
btrfs at this point and performance test result was posted earlier
this year (on your request too) where it performed quite similarly to
the workqueue code on the favorable side while providing much easier
API for its users.

The whole thing has been sitting in my queue for months now blocked on
these scheduler changes.  I've requested multiple times to push this
forward on at least a separate sched devel branch and make incremental
changes from there if scheduler side is not satisfactory so that the
whole code base can be tested in wider scope and fix and improvement
history can be kept in git, about which you always have had a pretty
strong opinion about.  There is no outstanding major opposition to
cmwq itself (fscache conversion needs more performance testing tho).
The whole thing is blocked on these scheduler changes.  What's up?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET sched/core] sched: prepare for cmwq
  2010-05-23  9:24           ` Tejun Heo
@ 2010-05-23 10:20             ` Ingo Molnar
  2010-05-23 10:26               ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2010-05-23 10:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: peterz, linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 05/23/2010 11:13 AM, Ingo Molnar wrote:
>
> > ... which is then used for the concurrency-managed 
> > workqueue feature, which is then used for a BTRFS 
> > speedup, right?
> >
> > ( If that's not the plan then color me confused :)
> 
> Yeap, you're officially confused. :-) It doesn't have 
> much to do with btrfs at this point [...]

Not exclusive to btrfs of course, but IIRC one of the 
examples given in one of those threads was btrfs 
scalability, right?

In case it's not obvious, i'm trying to find a tangible 
improvement to examine. No need to resend stuff, do you 
have a link to that old performance measurement by any 
chance?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET sched/core] sched: prepare for cmwq
  2010-05-23 10:20             ` Ingo Molnar
@ 2010-05-23 10:26               ` Tejun Heo
  2010-05-27  8:26                 ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2010-05-23 10:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: peterz, linux-kernel

Hello, Ingo.

On 05/23/2010 12:20 PM, Ingo Molnar wrote:
> Not exclusive to btrfs of course, but IIRC one of the examples given
> in one of those threads was btrfs scalability, right?

Hmmm... I don't think so.  If btrfs ever was mentioned, it must have
been pretty tangential.  Nothing rings a bell to me.

> In case it's not obvious, i'm trying to find a tangible improvement
> to examine. No need to resend stuff, do you have a link to that old
> performance measurement by any chance?

Sure.

 http://thread.gmane.org/gmane.linux.kernel/939353

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET sched/core] sched: prepare for cmwq
  2010-05-23 10:26               ` Tejun Heo
@ 2010-05-27  8:26                 ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2010-05-27  8:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: peterz, linux-kernel

On 05/23/2010 12:26 PM, Tejun Heo wrote:
> Hello, Ingo.
> 
> On 05/23/2010 12:20 PM, Ingo Molnar wrote:
>> Not exclusive to btrfs of course, but IIRC one of the examples given
>> in one of those threads was btrfs scalability, right?
> 
> Hmmm... I don't think so.  If btrfs ever was mentioned, it must have
> been pretty tangential.  Nothing rings a bell to me.
> 
>> In case it's not obvious, i'm trying to find a tangible improvement
>> to examine. No need to resend stuff, do you have a link to that old
>> performance measurement by any chance?
> 
> Sure.
> 
>  http://thread.gmane.org/gmane.linux.kernel/939353

Ingo, any update?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] sched: consult online mask instead of active in select_fallback_rq()
  2010-05-13 10:48 ` [PATCH 1/4] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
@ 2010-05-31  8:01   ` Peter Zijlstra
  2010-05-31  9:48     ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2010-05-31  8:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, linux-kernel

On Thu, 2010-05-13 at 12:48 +0200, Tejun Heo wrote:
> If called after sched_class chooses a CPU which isn't in a task's
> cpus_allowed mask, select_fallback_rq() can end up migrating a task
> which is bound to an !active but online cpu to an active cpu.  This is
> dangerous because active is cleared before CPU_DOWN_PREPARE is called
> and subsystems expect affinities of kthreads and other tasks to be
> maintained till their CPU_DOWN_PREPARE callbacks are complete.

So 6ad4c188 (sched: Fix balance vs hotplug race) moved it that early
because it was done too late.

Could we not instead do it explicitly after CPU_DOWN_PREPARE? It would
of course mean removing the partition_sched_domain() and
generate_sched_domains() calls from these callbacks and doing it
explicitly.

So we need to do it before we take the CPU down, but I think we can do
it after DOWN_PREPARE.

> Consult cpu_online_mask instead.
> 
> This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
> callback creates the trustee kthread and kthread_bind()s it to the
> target cpu, and the trustee is expected to run on that cpu.

It doesn't explain wth a trustee kthread is, which pretty much renders
the whole paragraph useless.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] sched: implement __set_cpus_allowed()
  2010-05-13 10:48 ` [PATCH 2/4] sched: implement __set_cpus_allowed() Tejun Heo
@ 2010-05-31  8:01   ` Peter Zijlstra
  2010-05-31  9:55     ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2010-05-31  8:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, linux-kernel, Rusty Russell, Mike Galbraith

On Thu, 2010-05-13 at 12:48 +0200, Tejun Heo wrote:
> Concurrency managed workqueue needs to be able to migrate tasks to a
> cpu which is online but !active for the following two purposes.
> 
> p1. To guarantee forward progress during cpu down sequence.  Each
>     workqueue which could be depended upon during memory allocation
>     has an emergency worker task which is summoned when a pending work
>     on such workqueue can't be serviced immediately.  cpu hotplug
>     callbacks expect workqueues to work during cpu down sequence
>     (usually so that they can flush them), so, to guarantee forward
>     progress, it should be possible to summon emergency workers to
>     !active but online cpus.

If we do the thing suggested in the previous patch, that is move
clearing active and rebuilding the sched domains until right after
DOWN_PREPARE, this goes away, right?

> p2. To migrate back unbound workers when a cpu comes back online.
>     When a cpu goes down, existing workers are unbound from the cpu
>     and allowed to run on other cpus if there still are pending or
>     running works.  If the cpu comes back online while those workers
>     are still around, those workers are migrated back and re-bound to
>     the cpu.  This isn't strictly required for correctness as long as
>     those unbound workers don't execute works which are newly
>     scheduled after the cpu comes back online; however, migrating back
>     the workers has the advantage of making the behavior more
>     consistent thus avoiding surprises which are difficult to expect
>     and reproduce, and being actually cleaner and easier to implement.

I still don't like this much, if you mark these tasks to simply die when
the queue is exhausted, and flush the queue explicitly on
CPU_UP_PREPARE, you should never need to do this.


After which I think you don't need this patch anymore..

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/4] sched: add hooks for workqueue
  2010-05-13 10:48 ` [PATCH 4/4] sched: add hooks for workqueue Tejun Heo
@ 2010-05-31  8:01   ` Peter Zijlstra
  2010-05-31  9:58     ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2010-05-31  8:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, linux-kernel, Mike Galbraith

On Thu, 2010-05-13 at 12:48 +0200, Tejun Heo wrote:
> Concurrency managed workqueue needs to know when workers are going to
> sleep and waking up, and, when a worker goes to sleep, be able to wake
> up another worker to maintain adequate concurrency.  This patch
> introduces PF_WQ_WORKER to identify workqueue workers and adds the
> following two hooks.
> 
> * wq_worker_waking_up(): called when a worker is woken up.
> 
> * wq_worker_sleeping(): called when a worker is going to sleep and may
>   return a pointer to a local task which should be woken up.  The
>   returned task is woken up using try_to_wake_up_local() which is
>   simplified ttwu which is called under rq lock and can only wake up
>   local tasks.

This changelog seems to lack explanation for why you need the wakeup
callback.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] sched: consult online mask instead of active in select_fallback_rq()
  2010-05-31  8:01   ` Peter Zijlstra
@ 2010-05-31  9:48     ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2010-05-31  9:48 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel

Hello,

On 05/31/2010 10:01 AM, Peter Zijlstra wrote:
> On Thu, 2010-05-13 at 12:48 +0200, Tejun Heo wrote:
>> If called after sched_class chooses a CPU which isn't in a task's
>> cpus_allowed mask, select_fallback_rq() can end up migrating a task
>> which is bound to an !active but online cpu to an active cpu.  This is
>> dangerous because active is cleared before CPU_DOWN_PREPARE is called
>> and subsystems expect affinities of kthreads and other tasks to be
>> maintained till their CPU_DOWN_PREPARE callbacks are complete.
> 
> So 6ad4c188 (sched: Fix balance vs hotplug race) moved it that early
> because it was done too late.
> 
> Could we not instead do it explicitly after CPU_DOWN_PREPARE? It would
> of course mean removing the partition_sched_domain() and
> generate_sched_domains() calls from these callbacks and doing it
> explicitly.
> 
> So we need to do it before we take the CPU down, but I think we can do
> it after DOWN_PREPARE.

Hmm.... yeah, I suppose that would be cleaner.  Maybe doing it from
the lowest priority DOWN_PREPARE notifier would do the trick.  I'll
give it a shot.

>> Consult cpu_online_mask instead.
>>
>> This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
>> callback creates the trustee kthread and kthread_bind()s it to the
>> target cpu, and the trustee is expected to run on that cpu.
> 
> It doesn't explain wth a trustee kthread is, which pretty much renders
> the whole paragraph useless.

Come on.  The paragraph makes perfect sense even if you remove the
words "trustee" completely.  It's saying that it's creating a kthread
CPU_DOWN_PREPARE and binds it to the cpu and expects it to stay there.
If you're not completely comfortable with the use of "trustee" without
proper introduction, I'll be happy to remove it, but limited amount of
forward reference is helpful when searching the history backwards.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] sched: implement __set_cpus_allowed()
  2010-05-31  8:01   ` Peter Zijlstra
@ 2010-05-31  9:55     ` Tejun Heo
  2010-05-31 10:01       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2010-05-31  9:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, Rusty Russell, Mike Galbraith

Hello,

On 05/31/2010 10:01 AM, Peter Zijlstra wrote:
> On Thu, 2010-05-13 at 12:48 +0200, Tejun Heo wrote:
>> Concurrency managed workqueue needs to be able to migrate tasks to a
>> cpu which is online but !active for the following two purposes.
>>
>> p1. To guarantee forward progress during cpu down sequence.  Each
>>     workqueue which could be depended upon during memory allocation
>>     has an emergency worker task which is summoned when a pending work
>>     on such workqueue can't be serviced immediately.  cpu hotplug
>>     callbacks expect workqueues to work during cpu down sequence
>>     (usually so that they can flush them), so, to guarantee forward
>>     progress, it should be possible to summon emergency workers to
>>     !active but online cpus.
> 
> If we do the thing suggested in the previous patch, that is move
> clearing active and rebuilding the sched domains until right after
> DOWN_PREPARE, this goes away, right?

Hmmm... yeah, if the usual set_cpus_allowed_ptr() keeps working
throughout CPU_DOWN_PREPARE, this probably goes away.  I'll give it a
shot.

>> p2. To migrate back unbound workers when a cpu comes back online.
>>     When a cpu goes down, existing workers are unbound from the cpu
>>     and allowed to run on other cpus if there still are pending or
>>     running works.  If the cpu comes back online while those workers
>>     are still around, those workers are migrated back and re-bound to
>>     the cpu.  This isn't strictly required for correctness as long as
>>     those unbound workers don't execute works which are newly
>>     scheduled after the cpu comes back online; however, migrating back
>>     the workers has the advantage of making the behavior more
>>     consistent thus avoiding surprises which are difficult to expect
>>     and reproduce, and being actually cleaner and easier to implement.
> 
> I still don't like this much, if you mark these tasks to simply die when
> the queue is exhausted, and flush the queue explicitly on
> CPU_UP_PREPARE, you should never need to do this.

I don't think flushing from CPU_UP_PREPARE would be a good idea.
There is no guarantee that those works will finish in short (human
scale) time, but we can update cpu_active mask before other
CPU_UP_PREPARE notifiers are executed so that it's symmetrical to cpu
down path and then this problem goes away the exact same way, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/4] sched: add hooks for workqueue
  2010-05-31  8:01   ` Peter Zijlstra
@ 2010-05-31  9:58     ` Tejun Heo
  2010-05-31 10:05       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2010-05-31  9:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, Mike Galbraith

Hello, Peter.

On 05/31/2010 10:01 AM, Peter Zijlstra wrote:
> On Thu, 2010-05-13 at 12:48 +0200, Tejun Heo wrote:
>> Concurrency managed workqueue needs to know when workers are going to
>> sleep and waking up, and, when a worker goes to sleep, be able to wake
>> up another worker to maintain adequate concurrency.  This patch
>> introduces PF_WQ_WORKER to identify workqueue workers and adds the
>> following two hooks.
>>
>> * wq_worker_waking_up(): called when a worker is woken up.
>>
>> * wq_worker_sleeping(): called when a worker is going to sleep and may
>>   return a pointer to a local task which should be woken up.  The
>>   returned task is woken up using try_to_wake_up_local() which is
>>   simplified ttwu which is called under rq lock and can only wake up
>>   local tasks.
> 
> This changelog seems to lack explanation for why you need the wakeup
> callback.

Because cmwq "needs to know when workers are going to sleep and waking
up, and, when a worker goes to sleep, be able to wake up another
worker to maintain adequate concurrency".  Sure, I can add more but
again forward reference here would work pretty well when digging
through history.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] sched: implement __set_cpus_allowed()
  2010-05-31  9:55     ` Tejun Heo
@ 2010-05-31 10:01       ` Peter Zijlstra
  2010-05-31 10:02         ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2010-05-31 10:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, linux-kernel, Rusty Russell, Mike Galbraith

On Mon, 2010-05-31 at 11:55 +0200, Tejun Heo wrote:
> I don't think flushing from CPU_UP_PREPARE would be a good idea.
> There is no guarantee that those works will finish in short (human
> scale) time,

If not, we should cure it by ensuring these works won't be left
lingering I think.

Also, I really don't much care about how fast we can hotplug cycle
things -- its an utter slow path.

>  but we can update cpu_active mask before other
> CPU_UP_PREPARE notifiers are executed so that it's symmetrical to cpu
> down path and then this problem goes away the exact same way, right?

Ah, no, we cannot mark it active before its actually up, because at that
time we'll actually try and run stuff on it, which clearly won't work
when its not there to run stuff.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] sched: implement __set_cpus_allowed()
  2010-05-31 10:01       ` Peter Zijlstra
@ 2010-05-31 10:02         ` Peter Zijlstra
  2010-05-31 10:06           ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2010-05-31 10:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, linux-kernel, Rusty Russell, Mike Galbraith

On Mon, 2010-05-31 at 12:01 +0200, Peter Zijlstra wrote:
> >  but we can update cpu_active mask before other
> > CPU_UP_PREPARE notifiers are executed so that it's symmetrical to cpu
> > down path and then this problem goes away the exact same way, right?
> 
> Ah, no, we cannot mark it active before its actually up, because at that
> time we'll actually try and run stuff on it, which clearly won't work
> when its not there to run stuff. 

So we should clear it _before_ we go down, and set it _after_ we're up.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/4] sched: add hooks for workqueue
  2010-05-31  9:58     ` Tejun Heo
@ 2010-05-31 10:05       ` Peter Zijlstra
  2010-05-31 10:07         ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2010-05-31 10:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, linux-kernel, Mike Galbraith

On Mon, 2010-05-31 at 11:58 +0200, Tejun Heo wrote:
> Hello, Peter.
> 
> On 05/31/2010 10:01 AM, Peter Zijlstra wrote:
> > On Thu, 2010-05-13 at 12:48 +0200, Tejun Heo wrote:
> >> Concurrency managed workqueue needs to know when workers are going to
> >> sleep and waking up, and, when a worker goes to sleep, be able to wake
> >> up another worker to maintain adequate concurrency.  This patch
> >> introduces PF_WQ_WORKER to identify workqueue workers and adds the
> >> following two hooks.
> >>
> >> * wq_worker_waking_up(): called when a worker is woken up.
> >>
> >> * wq_worker_sleeping(): called when a worker is going to sleep and may
> >>   return a pointer to a local task which should be woken up.  The
> >>   returned task is woken up using try_to_wake_up_local() which is
> >>   simplified ttwu which is called under rq lock and can only wake up
> >>   local tasks.
> > 
> > This changelog seems to lack explanation for why you need the wakeup
> > callback.
> 
> Because cmwq "needs to know when workers are going to sleep and waking
> up, and, when a worker goes to sleep, be able to wake up another
> worker to maintain adequate concurrency".  

That again only explains what you use the sleep hook for, not what you
want to use the wakeup hook for (putting a worker to sleep when there
are now 2 runnable seems like a good use).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] sched: implement __set_cpus_allowed()
  2010-05-31 10:02         ` Peter Zijlstra
@ 2010-05-31 10:06           ` Tejun Heo
  2010-05-31 10:15             ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2010-05-31 10:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, Rusty Russell, Mike Galbraith

On 05/31/2010 12:02 PM, Peter Zijlstra wrote:
> On Mon, 2010-05-31 at 12:01 +0200, Peter Zijlstra wrote:
>>>  but we can update cpu_active mask before other
>>> CPU_UP_PREPARE notifiers are executed so that it's symmetrical to cpu
>>> down path and then this problem goes away the exact same way, right?
>>
>> Ah, no, we cannot mark it active before its actually up, because at that
>> time we'll actually try and run stuff on it, which clearly won't work
>> when its not there to run stuff. 
> 
> So we should clear it _before_ we go down, and set it _after_ we're up.
> 

Yeah, sure, I misspoke.  I meant CPU_ONLINE not CPU_UP_PREPARE.  So,
we can mark a cpu active before other CPU_ONLINE callbacks are run.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/4] sched: add hooks for workqueue
  2010-05-31 10:05       ` Peter Zijlstra
@ 2010-05-31 10:07         ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2010-05-31 10:07 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, Mike Galbraith

On 05/31/2010 12:05 PM, Peter Zijlstra wrote:
> That again only explains what you use the sleep hook for, not what you
> want to use the wakeup hook for (putting a worker to sleep when there
> are now 2 runnable seems like a good use).

Alright, will add that.  Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] sched: implement __set_cpus_allowed()
  2010-05-31 10:06           ` Tejun Heo
@ 2010-05-31 10:15             ` Peter Zijlstra
  2010-05-31 10:19               ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2010-05-31 10:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, linux-kernel, Rusty Russell, Mike Galbraith

On Mon, 2010-05-31 at 12:06 +0200, Tejun Heo wrote:
>  I meant CPU_ONLINE not CPU_UP_PREPARE.  So,
> we can mark a cpu active before other CPU_ONLINE callbacks are run.

OK, that should work.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] sched: implement __set_cpus_allowed()
  2010-05-31 10:15             ` Peter Zijlstra
@ 2010-05-31 10:19               ` Tejun Heo
  2010-05-31 10:46                 ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2010-05-31 10:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, Rusty Russell, Mike Galbraith

Hello,

On 05/31/2010 12:15 PM, Peter Zijlstra wrote:
> On Mon, 2010-05-31 at 12:06 +0200, Tejun Heo wrote:
>>  I meant CPU_ONLINE not CPU_UP_PREPARE.  So,
>> we can mark a cpu active before other CPU_ONLINE callbacks are run.
> 
> OK, that should work.

The only remaining problemw is PF_THREAD_BOUND.  It needs to be set
for wq workers (including rescuers).  Would it be okay to add an
interface which only ignores PF_THREAD_BOUND?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] sched: implement __set_cpus_allowed()
  2010-05-31 10:19               ` Tejun Heo
@ 2010-05-31 10:46                 ` Peter Zijlstra
  2010-05-31 11:47                   ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2010-05-31 10:46 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, linux-kernel, Rusty Russell, Mike Galbraith

On Mon, 2010-05-31 at 12:19 +0200, Tejun Heo wrote:
> Hello,
> 
> On 05/31/2010 12:15 PM, Peter Zijlstra wrote:
> > On Mon, 2010-05-31 at 12:06 +0200, Tejun Heo wrote:
> >>  I meant CPU_ONLINE not CPU_UP_PREPARE.  So,
> >> we can mark a cpu active before other CPU_ONLINE callbacks are run.
> > 
> > OK, that should work.
> 
> The only remaining problemw is PF_THREAD_BOUND.  It needs to be set
> for wq workers (including rescuers).  Would it be okay to add an
> interface which only ignores PF_THREAD_BOUND?

So workers _are_ thread bound and don't migrate, so for those I don't
see the problem, that is of course until we have to break affinity when
their CPU goes down, at which point we should clear PF_THREAD_BOUND I
think.

For the rescue thread, why not set PF_THREAD_BOUND when its running
worklets and clear it when done? That way we get extra checking that
people won't migrate it when its not allowed.

Does that work, or did I miss something?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] sched: implement __set_cpus_allowed()
  2010-05-31 10:46                 ` Peter Zijlstra
@ 2010-05-31 11:47                   ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2010-05-31 11:47 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, Rusty Russell, Mike Galbraith

Hello,

On 05/31/2010 12:46 PM, Peter Zijlstra wrote:
> So workers _are_ thread bound and don't migrate, so for those I don't
> see the problem, that is of course until we have to break affinity when
> their CPU goes down, at which point we should clear PF_THREAD_BOUND I
> think.
> 
> For the rescue thread, why not set PF_THREAD_BOUND when its running
> worklets and clear it when done? That way we get extra checking that
> people won't migrate it when its not allowed.
> 
> Does that work, or did I miss something?

The rescuers migrate themselves so should be okay.  I still think it
would be better to migrate back detached workers.  It's a slow path
for sure but there's a big gaping hole between machine-slow and
human-slow and works can be human-slow (ie. hotplug operation not
finished in several minutes).  I'll see if that's implementable w/o
adding an additional interface.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/4] sched: refactor try_to_wake_up()
  2010-05-31 18:56 [PATCHSET sched/core] sched: prepare for cmwq, take#2 Tejun Heo
@ 2010-05-31 18:56 ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2010-05-31 18:56 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel, rusty, paulus, acme
  Cc: Tejun Heo, Mike Galbraith

Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
The factoring out doesn't affect try_to_wake_up() much
code-generation-wise.  Depending on configuration options, it ends up
generating the same object code as before or slightly different one
due to different register assignment.

This is to help future implementation of try_to_wake_up_local().

Mike Galbraith suggested rename to ttwu_post_activation() from
ttwu_woken_up() and comment update in try_to_wake_up().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |   83 +++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 80a806e..d54f063 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2321,11 +2321,52 @@ static void update_avg(u64 *avg, u64 sample)
 }
 #endif
 
-/***
+static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
+				 bool is_sync, bool is_migrate, bool is_local,
+				 unsigned long en_flags)
+{
+	schedstat_inc(p, se.statistics.nr_wakeups);
+	if (is_sync)
+		schedstat_inc(p, se.statistics.nr_wakeups_sync);
+	if (is_migrate)
+		schedstat_inc(p, se.statistics.nr_wakeups_migrate);
+	if (is_local)
+		schedstat_inc(p, se.statistics.nr_wakeups_local);
+	else
+		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+
+	activate_task(rq, p, en_flags);
+}
+
+static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
+					int wake_flags, bool success)
+{
+	trace_sched_wakeup(p, success);
+	check_preempt_curr(rq, p, wake_flags);
+
+	p->state = TASK_RUNNING;
+#ifdef CONFIG_SMP
+	if (p->sched_class->task_woken)
+		p->sched_class->task_woken(rq, p);
+
+	if (unlikely(rq->idle_stamp)) {
+		u64 delta = rq->clock - rq->idle_stamp;
+		u64 max = 2*sysctl_sched_migration_cost;
+
+		if (delta > max)
+			rq->avg_idle = max;
+		else
+			update_avg(&rq->avg_idle, delta);
+		rq->idle_stamp = 0;
+	}
+#endif
+}
+
+/**
  * try_to_wake_up - wake up a thread
- * @p: the to-be-woken-up thread
+ * @p: the thread to be awakened
  * @state: the mask of task states that can be woken
- * @sync: do a synchronous wakeup?
+ * @wake_flags: wake modifier flags (WF_*)
  *
  * Put it on the run-queue if it's not already there. The "current"
  * thread is always on the run-queue (except when the actual
@@ -2333,7 +2374,8 @@ static void update_avg(u64 *avg, u64 sample)
  * the simpler "current->state = TASK_RUNNING" to mark yourself
  * runnable without the overhead of this.
  *
- * returns failure only if the task is already active.
+ * Returns %true if @p was woken up, %false if it was already running
+ * or @state didn't match @p's state.
  */
 static int try_to_wake_up(struct task_struct *p, unsigned int state,
 			  int wake_flags)
@@ -2413,38 +2455,11 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
 
 out_activate:
 #endif /* CONFIG_SMP */
-	schedstat_inc(p, se.statistics.nr_wakeups);
-	if (wake_flags & WF_SYNC)
-		schedstat_inc(p, se.statistics.nr_wakeups_sync);
-	if (orig_cpu != cpu)
-		schedstat_inc(p, se.statistics.nr_wakeups_migrate);
-	if (cpu == this_cpu)
-		schedstat_inc(p, se.statistics.nr_wakeups_local);
-	else
-		schedstat_inc(p, se.statistics.nr_wakeups_remote);
-	activate_task(rq, p, en_flags);
+	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
+		      cpu == this_cpu, en_flags);
 	success = 1;
-
 out_running:
-	trace_sched_wakeup(p, success);
-	check_preempt_curr(rq, p, wake_flags);
-
-	p->state = TASK_RUNNING;
-#ifdef CONFIG_SMP
-	if (p->sched_class->task_woken)
-		p->sched_class->task_woken(rq, p);
-
-	if (unlikely(rq->idle_stamp)) {
-		u64 delta = rq->clock - rq->idle_stamp;
-		u64 max = 2*sysctl_sched_migration_cost;
-
-		if (delta > max)
-			rq->avg_idle = max;
-		else
-			update_avg(&rq->avg_idle, delta);
-		rq->idle_stamp = 0;
-	}
-#endif
+	ttwu_post_activation(p, rq, wake_flags, success);
 out:
 	task_rq_unlock(rq, &flags);
 	put_cpu();
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2010-05-31 18:58 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-13 10:48 [PATCHSET sched/core] sched: prepare for cmwq Tejun Heo
2010-05-13 10:48 ` [PATCH 1/4] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
2010-05-31  8:01   ` Peter Zijlstra
2010-05-31  9:48     ` Tejun Heo
2010-05-13 10:48 ` [PATCH 2/4] sched: implement __set_cpus_allowed() Tejun Heo
2010-05-31  8:01   ` Peter Zijlstra
2010-05-31  9:55     ` Tejun Heo
2010-05-31 10:01       ` Peter Zijlstra
2010-05-31 10:02         ` Peter Zijlstra
2010-05-31 10:06           ` Tejun Heo
2010-05-31 10:15             ` Peter Zijlstra
2010-05-31 10:19               ` Tejun Heo
2010-05-31 10:46                 ` Peter Zijlstra
2010-05-31 11:47                   ` Tejun Heo
2010-05-13 10:48 ` [PATCH 3/4] sched: refactor try_to_wake_up() Tejun Heo
2010-05-13 10:48 ` [PATCH 4/4] sched: add hooks for workqueue Tejun Heo
2010-05-31  8:01   ` Peter Zijlstra
2010-05-31  9:58     ` Tejun Heo
2010-05-31 10:05       ` Peter Zijlstra
2010-05-31 10:07         ` Tejun Heo
2010-05-17 23:13 ` [PATCHSET sched/core] sched: prepare for cmwq Tejun Heo
2010-05-21 13:25   ` Tejun Heo
2010-05-23  9:05     ` Ingo Molnar
2010-05-23  9:08       ` Tejun Heo
2010-05-23  9:13         ` Ingo Molnar
2010-05-23  9:24           ` Tejun Heo
2010-05-23 10:20             ` Ingo Molnar
2010-05-23 10:26               ` Tejun Heo
2010-05-27  8:26                 ` Tejun Heo
  -- strict thread matches above, loose matches on Subject: below --
2010-05-31 18:56 [PATCHSET sched/core] sched: prepare for cmwq, take#2 Tejun Heo
2010-05-31 18:56 ` [PATCH 3/4] sched: refactor try_to_wake_up() Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).