[PATCHSET sched/core] sched: prepare for cmwq, take#2

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET sched/core] sched: prepare for cmwq, take#2
@ 2010-05-31 18:56 Tejun Heo
  2010-05-31 18:56 ` [PATCH 1/4] sched: define and use CPU_PRI_* enums for cpu notifier priorities Tejun Heo
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Tejun Heo @ 2010-05-31 18:56 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel, rusty, paulus, acme

Hello,

This is the second take of repare-for-cmwq patchset.  Changes from the
first take[L] are

* consult-online-mask-instead-of-active-in-select_fallback_rq and
  implement-__set_cpus_allowed replaced by updates to when cpu_active
  and cpuset configuration are updated as suggested by Peter.

* Patch description updated in 0004 as suggested by Peter.

 0001-sched-define-and-use-CPU_PRI_-enums-for-cpu-notifier.patch
 0002-sched-adjust-when-cpu_active-and-cpuset-configuratio.patch
 0003-sched-refactor-try_to_wake_up.patch
 0004-sched-add-hooks-for-workqueue.patch

This patchset is available in the following git tree.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git sched-wq

and contains the following changes.

 include/linux/cpu.h        |   25 ++++++
 include/linux/perf_event.h |    2 
 include/linux/sched.h      |    1 
 kernel/cpu.c               |    6 -
 kernel/cpuset.c            |   46 +++++++-----
 kernel/fork.c              |    2 
 kernel/sched.c             |  169 ++++++++++++++++++++++++++++++++++-----------
 kernel/workqueue_sched.h   |   16 ++++
 8 files changed, 203 insertions(+), 64 deletions(-)

Thanks.

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel/984967

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] sched: define and use CPU_PRI_* enums for cpu notifier priorities
  2010-05-31 18:56 [PATCHSET sched/core] sched: prepare for cmwq, take#2 Tejun Heo
@ 2010-05-31 18:56 ` Tejun Heo
  2010-05-31 18:56 ` [PATCH 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining Tejun Heo
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2010-05-31 18:56 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel, rusty, paulus, acme
  Cc: Tejun Heo, Peter Zijlstra

Instead of hardcoding priority 10 and 20 in sched and perf, collect
them into CPU_PRI_* enums.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 include/linux/cpu.h        |    9 +++++++++
 include/linux/perf_event.h |    2 +-
 kernel/sched.c             |    2 +-
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index e287863..2d90738 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -48,6 +48,15 @@ extern ssize_t arch_cpu_release(const char *, size_t);
 #endif
 struct notifier_block;
 
+/*
+ * CPU notifier priorities.
+ */
+enum {
+	/* migration should happen before other stuff but after perf */
+	CPU_PRI_PERF		= 20,
+	CPU_PRI_MIGRATION	= 10,
+};
+
 #ifdef CONFIG_SMP
 /* Need to know about CPUs going up/down? */
 #if defined(CONFIG_HOTPLUG_CPU) || !defined(MODULE)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c8e3754..11a84f9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -990,7 +990,7 @@ static inline void perf_event_disable(struct perf_event *event)		{ }
 #define perf_cpu_notifier(fn)					\
 do {								\
 	static struct notifier_block fn##_nb __cpuinitdata =	\
-		{ .notifier_call = fn, .priority = 20 };	\
+		{ .notifier_call = fn, .priority = CPU_PRI_PERF }; \
 	fn(&fn##_nb, (unsigned long)CPU_UP_PREPARE,		\
 		(void *)(unsigned long)smp_processor_id());	\
 	fn(&fn##_nb, (unsigned long)CPU_STARTING,		\
diff --git a/kernel/sched.c b/kernel/sched.c
index b531d79..7f07048 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5837,7 +5837,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
  */
 static struct notifier_block __cpuinitdata migration_notifier = {
 	.notifier_call = migration_call,
-	.priority = 10
+	.priority = CPU_PRI_MIGRATION,
 };
 
 static int __init migration_init(void)
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining
  2010-05-31 18:56 [PATCHSET sched/core] sched: prepare for cmwq, take#2 Tejun Heo
  2010-05-31 18:56 ` [PATCH 1/4] sched: define and use CPU_PRI_* enums for cpu notifier priorities Tejun Heo
@ 2010-05-31 18:56 ` Tejun Heo
  2010-06-01 15:57   ` Peter Zijlstra
  2010-05-31 18:56 ` [PATCH 3/4] sched: refactor try_to_wake_up() Tejun Heo
  2010-05-31 18:56 ` [PATCH 4/4] sched: add hooks for workqueue Tejun Heo
  3 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2010-05-31 18:56 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel, rusty, paulus, acme
  Cc: Tejun Heo, Peter Zijlstra, Paul Menage

Currently, when a cpu goes down, cpu_active is cleared before
CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
default priority cpu notifier.  When a cpu is coming up, it's set
before CPU_ONLINE but cpuset configuration again is updated from the
same cpu notifier.

For cpu notifiers, this presents an inconsistent state.  Threads which
a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
migrated to other cpus because the cpu is no more inactive.

Fix it by updating cpu_active in the highest priority cpu notifier and
cpuset configuration in the second highest when a cpu is coming up.
Down path is updated similarly.  This guarantees that all other cpu
notifiers see consistent cpu_active and cpuset configuration.

This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
callback creates a kthread and kthread_bind()s it to the target cpu,
and the thread is expected to run on that cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Menage <menage@google.com>
---
 include/linux/cpu.h |   16 ++++++++++++++++
 kernel/cpu.c        |    6 ------
 kernel/cpuset.c     |   46 ++++++++++++++++++++++++++++------------------
 kernel/sched.c      |   31 ++++++++++++++++++++++++++++++-
 4 files changed, 74 insertions(+), 25 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 2d90738..de6b172 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -52,6 +52,22 @@ struct notifier_block;
  * CPU notifier priorities.
  */
 enum {
+	/*
+	 * SCHED_ACTIVE marks a cpu which is coming up active during
+	 * CPU_ONLINE and CPU_DOWN_FAILED and must be the first
+	 * notifier.  CPUSET_ACTIVE adjusts cpuset according to
+	 * cpu_active mask right after SCHED_ACTIVE.  During
+	 * CPU_DOWN_PREPARE, SCHED_INACTIVE and CPUSET_INACTIVE are
+	 * ordered in the similar way.
+	 *
+	 * This ordering guarantees consistent cpu_active mask and
+	 * migration behavior to all cpu notifiers.
+	 */
+	CPU_PRI_SCHED_ACTIVE	= INT_MAX,
+	CPU_PRI_CPUSET_ACTIVE	= INT_MAX - 1,
+	CPU_PRI_SCHED_INACTIVE	= INT_MIN + 1,
+	CPU_PRI_CPUSET_INACTIVE	= INT_MIN,
+
 	/* migration should happen before other stuff but after perf */
 	CPU_PRI_PERF		= 20,
 	CPU_PRI_MIGRATION	= 10,
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 5457775..69d25a7 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -211,12 +211,9 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 		return -EINVAL;
 
 	cpu_hotplug_begin();
-	set_cpu_active(cpu, false);
 	err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
 					hcpu, -1, &nr_calls);
 	if (err == NOTIFY_BAD) {
-		set_cpu_active(cpu, true);
-
 		nr_calls--;
 		__raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
 					  hcpu, nr_calls, NULL);
@@ -228,7 +225,6 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 
 	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
 	if (err) {
-		set_cpu_active(cpu, true);
 		/* CPU didn't die: tell everyone.  Can't complain. */
 		if (raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
 					    hcpu) == NOTIFY_BAD)
@@ -309,8 +305,6 @@ static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen)
 		goto out_notify;
 	BUG_ON(!cpu_online(cpu));
 
-	set_cpu_active(cpu, true);
-
 	/* Now call notifier in preparation. */
 	raw_notifier_call_chain(&cpu_chain, CPU_ONLINE | mod, hcpu);
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9a50c5f..2ad660c 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2071,31 +2071,17 @@ static void scan_for_empty_cpusets(struct cpuset *root)
  * but making no active use of cpusets.
  *
  * This routine ensures that top_cpuset.cpus_allowed tracks
- * cpu_online_map on each CPU hotplug (cpuhp) event.
+ * cpu_active_mask on each CPU hotplug (cpuhp) event.
  *
  * Called within get_online_cpus().  Needs to call cgroup_lock()
  * before calling generate_sched_domains().
  */
-static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
-				unsigned long phase, void *unused_cpu)
+static void cpuset_update_active_cpus(void)
 {
 	struct sched_domain_attr *attr;
 	cpumask_var_t *doms;
 	int ndoms;
 
-	switch (phase) {
-	case CPU_ONLINE:
-	case CPU_ONLINE_FROZEN:
-	case CPU_DOWN_PREPARE:
-	case CPU_DOWN_PREPARE_FROZEN:
-	case CPU_DOWN_FAILED:
-	case CPU_DOWN_FAILED_FROZEN:
-		break;
-
-	default:
-		return NOTIFY_DONE;
-	}
-
 	cgroup_lock();
 	mutex_lock(&callback_mutex);
 	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
@@ -2106,8 +2092,31 @@ static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
 
 	/* Have scheduler rebuild the domains */
 	partition_sched_domains(ndoms, doms, attr);
+}
 
-	return NOTIFY_OK;
+static int __cpuinit cpuset_cpu_active(struct notifier_block *nfb,
+				       unsigned long action, void *hcpu)
+{
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_ONLINE:
+	case CPU_DOWN_FAILED:
+		cpuset_update_active_cpus();
+		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static int __cpuexit cpuset_cpu_inactive(struct notifier_block *nfb,
+					 unsigned long action, void *hcpu)
+{
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DOWN_PREPARE:
+		cpuset_update_active_cpus();
+		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
@@ -2161,7 +2170,8 @@ void __init cpuset_init_smp(void)
 	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
 	top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
 
-	hotcpu_notifier(cpuset_track_online_cpus, 0);
+	hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
+	hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
 	hotplug_memory_notifier(cpuset_track_online_nodes, 10);
 
 	cpuset_wq = create_singlethread_workqueue("cpuset");
diff --git a/kernel/sched.c b/kernel/sched.c
index 7f07048..80a806e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5840,17 +5840,46 @@ static struct notifier_block __cpuinitdata migration_notifier = {
 	.priority = CPU_PRI_MIGRATION,
 };
 
+static int __cpuinit sched_cpu_active(struct notifier_block *nfb,
+				      unsigned long action, void *hcpu)
+{
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_ONLINE:
+	case CPU_DOWN_FAILED:
+		set_cpu_active((long)hcpu, true);
+		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static int __cpuexit sched_cpu_inactive(struct notifier_block *nfb,
+					unsigned long action, void *hcpu)
+{
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DOWN_PREPARE:
+		set_cpu_active((long)hcpu, false);
+		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
 static int __init migration_init(void)
 {
 	void *cpu = (void *)(long)smp_processor_id();
 	int err;
 
-	/* Start one for the boot CPU: */
+	/* Initialize migration for the boot CPU */
 	err = migration_call(&migration_notifier, CPU_UP_PREPARE, cpu);
 	BUG_ON(err == NOTIFY_BAD);
 	migration_call(&migration_notifier, CPU_ONLINE, cpu);
 	register_cpu_notifier(&migration_notifier);
 
+	/* Register cpu active notifiers */
+	cpu_notifier(sched_cpu_active, CPU_PRI_SCHED_ACTIVE);
+	cpu_notifier(sched_cpu_inactive, CPU_PRI_SCHED_INACTIVE);
+
 	return 0;
 }
 early_initcall(migration_init);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] sched: refactor try_to_wake_up()
  2010-05-31 18:56 [PATCHSET sched/core] sched: prepare for cmwq, take#2 Tejun Heo
  2010-05-31 18:56 ` [PATCH 1/4] sched: define and use CPU_PRI_* enums for cpu notifier priorities Tejun Heo
  2010-05-31 18:56 ` [PATCH 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining Tejun Heo
@ 2010-05-31 18:56 ` Tejun Heo
  2010-05-31 18:56 ` [PATCH 4/4] sched: add hooks for workqueue Tejun Heo
  3 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2010-05-31 18:56 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel, rusty, paulus, acme
  Cc: Tejun Heo, Mike Galbraith

Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
The factoring out doesn't affect try_to_wake_up() much
code-generation-wise.  Depending on configuration options, it ends up
generating the same object code as before or slightly different one
due to different register assignment.

This is to help future implementation of try_to_wake_up_local().

Mike Galbraith suggested rename to ttwu_post_activation() from
ttwu_woken_up() and comment update in try_to_wake_up().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |   83 +++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 80a806e..d54f063 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2321,11 +2321,52 @@ static void update_avg(u64 *avg, u64 sample)
 }
 #endif
 
-/***
+static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
+				 bool is_sync, bool is_migrate, bool is_local,
+				 unsigned long en_flags)
+{
+	schedstat_inc(p, se.statistics.nr_wakeups);
+	if (is_sync)
+		schedstat_inc(p, se.statistics.nr_wakeups_sync);
+	if (is_migrate)
+		schedstat_inc(p, se.statistics.nr_wakeups_migrate);
+	if (is_local)
+		schedstat_inc(p, se.statistics.nr_wakeups_local);
+	else
+		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+
+	activate_task(rq, p, en_flags);
+}
+
+static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
+					int wake_flags, bool success)
+{
+	trace_sched_wakeup(p, success);
+	check_preempt_curr(rq, p, wake_flags);
+
+	p->state = TASK_RUNNING;
+#ifdef CONFIG_SMP
+	if (p->sched_class->task_woken)
+		p->sched_class->task_woken(rq, p);
+
+	if (unlikely(rq->idle_stamp)) {
+		u64 delta = rq->clock - rq->idle_stamp;
+		u64 max = 2*sysctl_sched_migration_cost;
+
+		if (delta > max)
+			rq->avg_idle = max;
+		else
+			update_avg(&rq->avg_idle, delta);
+		rq->idle_stamp = 0;
+	}
+#endif
+}
+
+/**
  * try_to_wake_up - wake up a thread
- * @p: the to-be-woken-up thread
+ * @p: the thread to be awakened
  * @state: the mask of task states that can be woken
- * @sync: do a synchronous wakeup?
+ * @wake_flags: wake modifier flags (WF_*)
  *
  * Put it on the run-queue if it's not already there. The "current"
  * thread is always on the run-queue (except when the actual
@@ -2333,7 +2374,8 @@ static void update_avg(u64 *avg, u64 sample)
  * the simpler "current->state = TASK_RUNNING" to mark yourself
  * runnable without the overhead of this.
  *
- * returns failure only if the task is already active.
+ * Returns %true if @p was woken up, %false if it was already running
+ * or @state didn't match @p's state.
  */
 static int try_to_wake_up(struct task_struct *p, unsigned int state,
 			  int wake_flags)
@@ -2413,38 +2455,11 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
 
 out_activate:
 #endif /* CONFIG_SMP */
-	schedstat_inc(p, se.statistics.nr_wakeups);
-	if (wake_flags & WF_SYNC)
-		schedstat_inc(p, se.statistics.nr_wakeups_sync);
-	if (orig_cpu != cpu)
-		schedstat_inc(p, se.statistics.nr_wakeups_migrate);
-	if (cpu == this_cpu)
-		schedstat_inc(p, se.statistics.nr_wakeups_local);
-	else
-		schedstat_inc(p, se.statistics.nr_wakeups_remote);
-	activate_task(rq, p, en_flags);
+	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
+		      cpu == this_cpu, en_flags);
 	success = 1;
-
 out_running:
-	trace_sched_wakeup(p, success);
-	check_preempt_curr(rq, p, wake_flags);
-
-	p->state = TASK_RUNNING;
-#ifdef CONFIG_SMP
-	if (p->sched_class->task_woken)
-		p->sched_class->task_woken(rq, p);
-
-	if (unlikely(rq->idle_stamp)) {
-		u64 delta = rq->clock - rq->idle_stamp;
-		u64 max = 2*sysctl_sched_migration_cost;
-
-		if (delta > max)
-			rq->avg_idle = max;
-		else
-			update_avg(&rq->avg_idle, delta);
-		rq->idle_stamp = 0;
-	}
-#endif
+	ttwu_post_activation(p, rq, wake_flags, success);
 out:
 	task_rq_unlock(rq, &flags);
 	put_cpu();
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] sched: add hooks for workqueue
  2010-05-31 18:56 [PATCHSET sched/core] sched: prepare for cmwq, take#2 Tejun Heo
                   ` (2 preceding siblings ...)
  2010-05-31 18:56 ` [PATCH 3/4] sched: refactor try_to_wake_up() Tejun Heo
@ 2010-05-31 18:56 ` Tejun Heo
  3 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2010-05-31 18:56 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel, rusty, paulus, acme
  Cc: Tejun Heo, Mike Galbraith

Concurrency managed workqueue needs to know when workers are going to
sleep and waking up.  Using these two hooks, cmwq keeps track of the
current concurrency level and throttles execution of new works if it's
too high and wakes up another worker from the sleep hook if it becomes
too low.

This patch introduces PF_WQ_WORKER to identify workqueue workers and
adds the following two hooks.

* wq_worker_waking_up(): called when a worker is woken up.

* wq_worker_sleeping(): called when a worker is going to sleep and may
  return a pointer to a local task which should be woken up.  The
  returned task is woken up using try_to_wake_up_local() which is
  simplified ttwu which is called under rq lock and can only wake up
  local tasks.

Both hooks are currently defined as noop in kernel/workqueue_sched.h.
Later cmwq implementation will replace them with proper
implementation.

These hooks are hard coded as they'll always be enabled.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h    |    1 +
 kernel/fork.c            |    2 +-
 kernel/sched.c           |   53 ++++++++++++++++++++++++++++++++++++++++++++-
 kernel/workqueue_sched.h |   16 +++++++++++++
 4 files changed, 69 insertions(+), 3 deletions(-)
 create mode 100644 kernel/workqueue_sched.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dfea405..02f960a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1702,6 +1702,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define PF_EXITING	0x00000004	/* getting shut down */
 #define PF_EXITPIDONE	0x00000008	/* pi exit done on shut down */
 #define PF_VCPU		0x00000010	/* I'm a virtual CPU */
+#define PF_WQ_WORKER	0x00000020	/* I'm a workqueue worker */
 #define PF_FORKNOEXEC	0x00000040	/* forked but didn't exec */
 #define PF_MCE_PROCESS  0x00000080      /* process policy on mce errors */
 #define PF_SUPERPRIV	0x00000100	/* used super-user privileges */
diff --git a/kernel/fork.c b/kernel/fork.c
index 44b0791..1440f90 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,7 +900,7 @@ static void copy_flags(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long new_flags = p->flags;
 
-	new_flags &= ~PF_SUPERPRIV;
+	new_flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER);
 	new_flags |= PF_FORKNOEXEC;
 	new_flags |= PF_STARTING;
 	p->flags = new_flags;
diff --git a/kernel/sched.c b/kernel/sched.c
index d54f063..374a9a8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -77,6 +77,7 @@
 #include <asm/irq_regs.h>
 
 #include "sched_cpupri.h"
+#include "workqueue_sched.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
@@ -2360,6 +2361,9 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
 		rq->idle_stamp = 0;
 	}
 #endif
+	/* if a worker is waking up, notify workqueue */
+	if ((p->flags & PF_WQ_WORKER) && success)
+		wq_worker_waking_up(p, cpu_of(rq));
 }
 
 /**
@@ -2468,6 +2472,37 @@ out:
 }
 
 /**
+ * try_to_wake_up_local - try to wake up a local task with rq lock held
+ * @p: the thread to be awakened
+ *
+ * Put @p on the run-queue if it's not alredy there.  The caller must
+ * ensure that this_rq() is locked, @p is bound to this_rq() and not
+ * the current task.  this_rq() stays locked over invocation.
+ */
+static void try_to_wake_up_local(struct task_struct *p)
+{
+	struct rq *rq = task_rq(p);
+	bool success = false;
+
+	BUG_ON(rq != this_rq());
+	BUG_ON(p == current);
+	lockdep_assert_held(&rq->lock);
+
+	if (!(p->state & TASK_NORMAL))
+		return;
+
+	if (!p->se.on_rq) {
+		if (likely(!task_running(rq, p))) {
+			schedstat_inc(rq, ttwu_count);
+			schedstat_inc(rq, ttwu_local);
+		}
+		ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
+		success = true;
+	}
+	ttwu_post_activation(p, rq, 0, success);
+}
+
+/**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
  *
@@ -3672,10 +3707,24 @@ need_resched_nonpreemptible:
 	clear_tsk_need_resched(prev);
 
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
-		if (unlikely(signal_pending_state(prev->state, prev)))
+		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
-		else
+		} else {
+			/*
+			 * If a worker is going to sleep, notify and
+			 * ask workqueue whether it wants to wake up a
+			 * task to maintain concurrency.  If so, wake
+			 * up the task.
+			 */
+			if (prev->flags & PF_WQ_WORKER) {
+				struct task_struct *to_wakeup;
+
+				to_wakeup = wq_worker_sleeping(prev, cpu);
+				if (to_wakeup)
+					try_to_wake_up_local(to_wakeup);
+			}
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
+		}
 		switch_count = &prev->nvcsw;
 	}
 
diff --git a/kernel/workqueue_sched.h b/kernel/workqueue_sched.h
new file mode 100644
index 0000000..af040ba
--- /dev/null
+++ b/kernel/workqueue_sched.h
@@ -0,0 +1,16 @@
+/*
+ * kernel/workqueue_sched.h
+ *
+ * Scheduler hooks for concurrency managed workqueue.  Only to be
+ * included from sched.c and workqueue.c.
+ */
+static inline void wq_worker_waking_up(struct task_struct *task,
+				       unsigned int cpu)
+{
+}
+
+static inline struct task_struct *wq_worker_sleeping(struct task_struct *task,
+						     unsigned int cpu)
+{
+	return NULL;
+}
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining
  2010-05-31 18:56 ` [PATCH 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining Tejun Heo
@ 2010-06-01 15:57   ` Peter Zijlstra
  2010-06-01 16:58     ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2010-06-01 15:57 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, linux-kernel, rusty, paulus, acme, Paul Menage

On Mon, 2010-05-31 at 20:56 +0200, Tejun Heo wrote:
> Currently, when a cpu goes down, cpu_active is cleared before
> CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
> default priority cpu notifier.  When a cpu is coming up, it's set
> before CPU_ONLINE but cpuset configuration again is updated from the
> same cpu notifier.
> 
> For cpu notifiers, this presents an inconsistent state.  Threads which
> a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
> migrated to other cpus because the cpu is no more inactive.
> 
> Fix it by updating cpu_active in the highest priority cpu notifier and
> cpuset configuration in the second highest when a cpu is coming up.
> Down path is updated similarly.  This guarantees that all other cpu
> notifiers see consistent cpu_active and cpuset configuration.
> 
> This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
> callback creates a kthread and kthread_bind()s it to the target cpu,
> and the thread is expected to run on that cpu.

I know we all love notifier lists, but doesn't the below code get lots
more readable if we don't play tricks with notifier priorities and
simply hardcode the few (perf/sched/cpuset) callbacks into the hotplug
paths?

Also, I'm afraid you've now inverted the relation between
cpu_active_mask and parition_sched_domains().

You need to first set/clear the active mask, then rebuild the domain.
But with your patch parition_sched_domains() gets called in the regular
DOWN_PREPARE path, while we only clear active at the very end, which
means we build the wrong domains.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining
  2010-06-01 15:57   ` Peter Zijlstra
@ 2010-06-01 16:58     ` Tejun Heo
  2010-06-01 17:20       ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2010-06-01 16:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, rusty, paulus, acme, Paul Menage

Hello,

On 06/01/2010 05:57 PM, Peter Zijlstra wrote:
> I know we all love notifier lists, but doesn't the below code get lots
> more readable if we don't play tricks with notifier priorities and
> simply hardcode the few (perf/sched/cpuset) callbacks into the hotplug
> paths?

Maybe, maybe not.  In this case, I kind of like that sleep failure
case doesn't have to be explicitly rolled back but if you like hard
coding that's fine too.

> Also, I'm afraid you've now inverted the relation between
> cpu_active_mask and parition_sched_domains().
>
> You need to first set/clear the active mask, then rebuild the domain.
> But with your patch parition_sched_domains() gets called in the regular
> DOWN_PREPARE path, while we only clear active at the very end, which
> means we build the wrong domains.

Ah, right forgot about that.  So, the things that need to be ordered
are cpu_active mask update, cpuset configuration and sched domains,
right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining
  2010-06-01 16:58     ` Tejun Heo
@ 2010-06-01 17:20       ` Peter Zijlstra
  2010-06-02 16:03         ` [PATCH UPDATED " Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2010-06-01 17:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, linux-kernel, rusty, paulus, acme, Paul Menage

On Tue, 2010-06-01 at 18:58 +0200, Tejun Heo wrote:
> Hello,
> 
> On 06/01/2010 05:57 PM, Peter Zijlstra wrote:
> > I know we all love notifier lists, but doesn't the below code get lots
> > more readable if we don't play tricks with notifier priorities and
> > simply hardcode the few (perf/sched/cpuset) callbacks into the hotplug
> > paths?
> 
> Maybe, maybe not.  In this case, I kind of like that sleep failure
> case doesn't have to be explicitly rolled back but if you like hard
> coding that's fine too.

Hurm, yeah that rollback might make things messy indeed.

> > Also, I'm afraid you've now inverted the relation between
> > cpu_active_mask and parition_sched_domains().
> >
> > You need to first set/clear the active mask, then rebuild the domain.
> > But with your patch parition_sched_domains() gets called in the regular
> > DOWN_PREPARE path, while we only clear active at the very end, which
> > means we build the wrong domains.
> 
> Ah, right forgot about that.  So, the things that need to be ordered
> are cpu_active mask update, cpuset configuration and sched domains,
> right?

Right, I think that should cover it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH UPDATED 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining
  2010-06-01 17:20       ` Peter Zijlstra
@ 2010-06-02 16:03         ` Tejun Heo
  2010-06-04  7:43           ` Tejun Heo
  2010-06-04 11:58           ` Peter Zijlstra
  0 siblings, 2 replies; 13+ messages in thread
From: Tejun Heo @ 2010-06-02 16:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, rusty, paulus, acme, Paul Menage

Currently, when a cpu goes down, cpu_active is cleared before
CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
default priority cpu notifier.  When a cpu is coming up, it's set
before CPU_ONLINE but cpuset configuration again is updated from the
same cpu notifier.

For cpu notifiers, this presents an inconsistent state.  Threads which
a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
migrated to other cpus because the cpu is no more inactive.

Fix it by updating cpu_active in the highest priority cpu notifier and
cpuset configuration in the second highest when a cpu is coming up.
Down path is updated similarly.  This guarantees that all other cpu
notifiers see consistent cpu_active and cpuset configuration.

cpuset_track_online_cpus() notifier is converted to
cpuset_update_active_cpus() which just updates the configuration and
now called from cpuset_cpu_[in]active() notifiers registered from
sched_init_smp().  If cpuset is disabled, cpuset_update_active_cpus()
degenerates into partition_sched_domains() making separate notifier
for !CONFIG_CPUSETS unnecessary.

This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
callback creates a kthread and kthread_bind()s it to the target cpu,
and the thread is expected to run on that cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Menage <menage@google.com>
---
Okay, how about this one?  Using notifiers seems better for the
following reasons.

* Rollback on failure.

* cpuset/sched_domain don't expect to be called before smp
  configuration is complete.  If hardcoded into cpu_up/down(),
  condition checks need to be added so that they're skipped if the
  system is bringing up the cpus for the first time.

Works fine here w/ CPUSET enabled and disabled.

Thanks.

 include/linux/cpu.h    |   16 +++++++++++
 include/linux/cpuset.h |    6 ++++
 kernel/cpu.c           |    6 ----
 kernel/cpuset.c        |   21 +--------------
 kernel/sched.c         |   67 ++++++++++++++++++++++++++++++++++++-------------
 5 files changed, 74 insertions(+), 42 deletions(-)

Index: work/include/linux/cpu.h
===================================================================
--- work.orig/include/linux/cpu.h
+++ work/include/linux/cpu.h
@@ -52,6 +52,22 @@ struct notifier_block;
  * CPU notifier priorities.
  */
 enum {
+	/*
+	 * SCHED_ACTIVE marks a cpu which is coming up active during
+	 * CPU_ONLINE and CPU_DOWN_FAILED and must be the first
+	 * notifier.  CPUSET_ACTIVE adjusts cpuset according to
+	 * cpu_active mask right after SCHED_ACTIVE.  During
+	 * CPU_DOWN_PREPARE, SCHED_INACTIVE and CPUSET_INACTIVE are
+	 * ordered in the similar way.
+	 *
+	 * This ordering guarantees consistent cpu_active mask and
+	 * migration behavior to all cpu notifiers.
+	 */
+	CPU_PRI_SCHED_ACTIVE	= INT_MAX,
+	CPU_PRI_CPUSET_ACTIVE	= INT_MAX - 1,
+	CPU_PRI_SCHED_INACTIVE	= INT_MIN + 1,
+	CPU_PRI_CPUSET_INACTIVE	= INT_MIN,
+
 	/* migration should happen before other stuff but after perf */
 	CPU_PRI_PERF		= 20,
 	CPU_PRI_MIGRATION	= 10,
Index: work/kernel/cpu.c
===================================================================
--- work.orig/kernel/cpu.c
+++ work/kernel/cpu.c
@@ -211,12 +211,9 @@ static int __ref _cpu_down(unsigned int
 		return -EINVAL;

 	cpu_hotplug_begin();
-	set_cpu_active(cpu, false);
 	err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
 					hcpu, -1, &nr_calls);
 	if (err == NOTIFY_BAD) {
-		set_cpu_active(cpu, true);
-
 		nr_calls--;
 		__raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
 					  hcpu, nr_calls, NULL);
@@ -228,7 +225,6 @@ static int __ref _cpu_down(unsigned int

 	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
 	if (err) {
-		set_cpu_active(cpu, true);
 		/* CPU didn't die: tell everyone.  Can't complain. */
 		if (raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
 					    hcpu) == NOTIFY_BAD)
@@ -309,8 +305,6 @@ static int __cpuinit _cpu_up(unsigned in
 		goto out_notify;
 	BUG_ON(!cpu_online(cpu));

-	set_cpu_active(cpu, true);
-
 	/* Now call notifier in preparation. */
 	raw_notifier_call_chain(&cpu_chain, CPU_ONLINE | mod, hcpu);

Index: work/kernel/cpuset.c
===================================================================
--- work.orig/kernel/cpuset.c
+++ work/kernel/cpuset.c
@@ -2071,31 +2071,17 @@ static void scan_for_empty_cpusets(struc
  * but making no active use of cpusets.
  *
  * This routine ensures that top_cpuset.cpus_allowed tracks
- * cpu_online_map on each CPU hotplug (cpuhp) event.
+ * cpu_active_mask on each CPU hotplug (cpuhp) event.
  *
  * Called within get_online_cpus().  Needs to call cgroup_lock()
  * before calling generate_sched_domains().
  */
-static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
-				unsigned long phase, void *unused_cpu)
+void cpuset_update_active_cpus(void)
 {
 	struct sched_domain_attr *attr;
 	cpumask_var_t *doms;
 	int ndoms;

-	switch (phase) {
-	case CPU_ONLINE:
-	case CPU_ONLINE_FROZEN:
-	case CPU_DOWN_PREPARE:
-	case CPU_DOWN_PREPARE_FROZEN:
-	case CPU_DOWN_FAILED:
-	case CPU_DOWN_FAILED_FROZEN:
-		break;
-
-	default:
-		return NOTIFY_DONE;
-	}
-
 	cgroup_lock();
 	mutex_lock(&callback_mutex);
 	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
@@ -2106,8 +2092,6 @@ static int cpuset_track_online_cpus(stru

 	/* Have scheduler rebuild the domains */
 	partition_sched_domains(ndoms, doms, attr);
-
-	return NOTIFY_OK;
 }

 #ifdef CONFIG_MEMORY_HOTPLUG
@@ -2161,7 +2145,6 @@ void __init cpuset_init_smp(void)
 	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
 	top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];

-	hotcpu_notifier(cpuset_track_online_cpus, 0);
 	hotplug_memory_notifier(cpuset_track_online_nodes, 10);

 	cpuset_wq = create_singlethread_workqueue("cpuset");
Index: work/kernel/sched.c
===================================================================
--- work.orig/kernel/sched.c
+++ work/kernel/sched.c
@@ -5840,17 +5840,46 @@ static struct notifier_block __cpuinitda
 	.priority = CPU_PRI_MIGRATION,
 };

+static int __cpuinit sched_cpu_active(struct notifier_block *nfb,
+				      unsigned long action, void *hcpu)
+{
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_ONLINE:
+	case CPU_DOWN_FAILED:
+		set_cpu_active((long)hcpu, true);
+		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static int __cpuexit sched_cpu_inactive(struct notifier_block *nfb,
+					unsigned long action, void *hcpu)
+{
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DOWN_PREPARE:
+		set_cpu_active((long)hcpu, false);
+		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
 static int __init migration_init(void)
 {
 	void *cpu = (void *)(long)smp_processor_id();
 	int err;

-	/* Start one for the boot CPU: */
+	/* Initialize migration for the boot CPU */
 	err = migration_call(&migration_notifier, CPU_UP_PREPARE, cpu);
 	BUG_ON(err == NOTIFY_BAD);
 	migration_call(&migration_notifier, CPU_ONLINE, cpu);
 	register_cpu_notifier(&migration_notifier);

+	/* Register cpu active notifiers */
+	cpu_notifier(sched_cpu_active, CPU_PRI_SCHED_ACTIVE);
+	cpu_notifier(sched_cpu_inactive, CPU_PRI_SCHED_INACTIVE);
+
 	return 0;
 }
 early_initcall(migration_init);
@@ -7309,29 +7338,35 @@ int __init sched_create_sysfs_power_savi
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */

-#ifndef CONFIG_CPUSETS
 /*
- * Add online and remove offline CPUs from the scheduler domains.
- * When cpusets are enabled they take over this function.
+ * Update cpusets according to cpu_active mask.  If cpusets are
+ * disabled, cpuset_update_active_cpus() becomes a simple wrapper
+ * around partition_sched_domains().
  */
-static int update_sched_domains(struct notifier_block *nfb,
-				unsigned long action, void *hcpu)
+static int __cpuinit cpuset_cpu_active(struct notifier_block *nfb,
+				       unsigned long action, void *hcpu)
 {
-	switch (action) {
+	switch (action & ~CPU_TASKS_FROZEN) {
 	case CPU_ONLINE:
-	case CPU_ONLINE_FROZEN:
-	case CPU_DOWN_PREPARE:
-	case CPU_DOWN_PREPARE_FROZEN:
 	case CPU_DOWN_FAILED:
-	case CPU_DOWN_FAILED_FROZEN:
-		partition_sched_domains(1, NULL, NULL);
+		cpuset_update_active_cpus();
 		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
+}

+static int __cpuexit cpuset_cpu_inactive(struct notifier_block *nfb,
+					 unsigned long action, void *hcpu)
+{
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DOWN_PREPARE:
+		cpuset_update_active_cpus();
+		return NOTIFY_OK;
 	default:
 		return NOTIFY_DONE;
 	}
 }
-#endif

 static int update_runtime(struct notifier_block *nfb,
 				unsigned long action, void *hcpu)
@@ -7377,10 +7412,8 @@ void __init sched_init_smp(void)
 	mutex_unlock(&sched_domains_mutex);
 	put_online_cpus();

-#ifndef CONFIG_CPUSETS
-	/* XXX: Theoretical race here - CPU may be hotplugged now */
-	hotcpu_notifier(update_sched_domains, 0);
-#endif
+	hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
+	hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);

 	/* RT runtime code needs to handle some hotplug events */
 	hotcpu_notifier(update_runtime, 0);
Index: work/include/linux/cpuset.h
===================================================================
--- work.orig/include/linux/cpuset.h
+++ work/include/linux/cpuset.h
@@ -20,6 +20,7 @@ extern int number_of_cpusets;	/* How man

 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
+extern void cpuset_update_active_cpus(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern int cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -96,6 +97,11 @@ static inline void set_mems_allowed(node
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}

+static inline void cpuset_update_active_cpus(void)
+{
+	partition_sched_domains(1, NULL, NULL);
+}
+
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH UPDATED 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining
  2010-06-02 16:03         ` [PATCH UPDATED " Tejun Heo
@ 2010-06-04  7:43           ` Tejun Heo
  2010-06-04 11:58           ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2010-06-04  7:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, rusty, paulus, acme, Paul Menage

On 06/02/2010 06:03 PM, Tejun Heo wrote:
> Okay, how about this one?  Using notifiers seems better for the
> following reasons.
> 
> * Rollback on failure.
> 
> * cpuset/sched_domain don't expect to be called before smp
>   configuration is complete.  If hardcoded into cpu_up/down(),
>   condition checks need to be added so that they're skipped if the
>   system is bringing up the cpus for the first time.
> 
> Works fine here w/ CPUSET enabled and disabled.

Ping. Peter.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH UPDATED 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining
  2010-06-02 16:03         ` [PATCH UPDATED " Tejun Heo
  2010-06-04  7:43           ` Tejun Heo
@ 2010-06-04 11:58           ` Peter Zijlstra
  2010-06-04 12:03             ` Tejun Heo
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2010-06-04 11:58 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, linux-kernel, rusty, paulus, acme, Paul Menage

On Wed, 2010-06-02 at 18:03 +0200, Tejun Heo wrote:
> Currently, when a cpu goes down, cpu_active is cleared before
> CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
> default priority cpu notifier.  When a cpu is coming up, it's set
> before CPU_ONLINE but cpuset configuration again is updated from the
> same cpu notifier.
> 
> For cpu notifiers, this presents an inconsistent state.  Threads which
> a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
> migrated to other cpus because the cpu is no more inactive.
> 
> Fix it by updating cpu_active in the highest priority cpu notifier and
> cpuset configuration in the second highest when a cpu is coming up.
> Down path is updated similarly.  This guarantees that all other cpu
> notifiers see consistent cpu_active and cpuset configuration.
> 
> cpuset_track_online_cpus() notifier is converted to
> cpuset_update_active_cpus() which just updates the configuration and
> now called from cpuset_cpu_[in]active() notifiers registered from
> sched_init_smp().  If cpuset is disabled, cpuset_update_active_cpus()
> degenerates into partition_sched_domains() making separate notifier
> for !CONFIG_CPUSETS unnecessary.
> 
> This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
> callback creates a kthread and kthread_bind()s it to the target cpu,
> and the thread is expected to run on that cpu.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Paul Menage <menage@google.com>
> ---
> Okay, how about this one?  Using notifiers seems better for the
> following reasons.
> 
> * Rollback on failure.
> 
> * cpuset/sched_domain don't expect to be called before smp
>   configuration is complete.  If hardcoded into cpu_up/down(),
>   condition checks need to be added so that they're skipped if the
>   system is bringing up the cpus for the first time.
> 
> Works fine here w/ CPUSET enabled and disabled.

OK, this looks good. 

Thanks Tejun.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH UPDATED 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining
  2010-06-04 11:58           ` Peter Zijlstra
@ 2010-06-04 12:03             ` Tejun Heo
  2010-06-04 12:07               ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2010-06-04 12:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, rusty, paulus, acme, Paul Menage

On 06/04/2010 01:58 PM, Peter Zijlstra wrote:
>> * Rollback on failure.
>>
>> * cpuset/sched_domain don't expect to be called before smp
>>   configuration is complete.  If hardcoded into cpu_up/down(),
>>   condition checks need to be added so that they're skipped if the
>>   system is bringing up the cpus for the first time.
>>
>> Works fine here w/ CPUSET enabled and disabled.
> 
> OK, this looks good. 

Cool, can I add your acked-by's to and send them towards Ingo?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH UPDATED 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining
  2010-06-04 12:03             ` Tejun Heo
@ 2010-06-04 12:07               ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2010-06-04 12:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, linux-kernel, rusty, paulus, acme, Paul Menage

On Fri, 2010-06-04 at 14:03 +0200, Tejun Heo wrote:
> On 06/04/2010 01:58 PM, Peter Zijlstra wrote:
> >> * Rollback on failure.
> >>
> >> * cpuset/sched_domain don't expect to be called before smp
> >>   configuration is complete.  If hardcoded into cpu_up/down(),
> >>   condition checks need to be added so that they're skipped if the
> >>   system is bringing up the cpus for the first time.
> >>
> >> Works fine here w/ CPUSET enabled and disabled.
> > 
> > OK, this looks good. 
> 
> Cool, can I add your acked-by's to and send them towards Ingo?

Yes.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-06-04 12:07 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-31 18:56 [PATCHSET sched/core] sched: prepare for cmwq, take#2 Tejun Heo
2010-05-31 18:56 ` [PATCH 1/4] sched: define and use CPU_PRI_* enums for cpu notifier priorities Tejun Heo
2010-05-31 18:56 ` [PATCH 2/4] sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining Tejun Heo
2010-06-01 15:57   ` Peter Zijlstra
2010-06-01 16:58     ` Tejun Heo
2010-06-01 17:20       ` Peter Zijlstra
2010-06-02 16:03         ` [PATCH UPDATED " Tejun Heo
2010-06-04  7:43           ` Tejun Heo
2010-06-04 11:58           ` Peter Zijlstra
2010-06-04 12:03             ` Tejun Heo
2010-06-04 12:07               ` Peter Zijlstra
2010-05-31 18:56 ` [PATCH 3/4] sched: refactor try_to_wake_up() Tejun Heo
2010-05-31 18:56 ` [PATCH 4/4] sched: add hooks for workqueue Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).