[PATCH] sched/kthread: Complain loudly when others violate our flags

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched/kthread: Complain loudly when others violate our flags
@ 2011-09-27 21:17 Steven Rostedt
  2011-09-28  0:22 ` Thomas Gleixner
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Steven Rostedt @ 2011-09-27 21:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Peter Zijlstra, Thomas Gleixner

For convenience and optimization, we are going to use the task's flag
PF_THREAD_BOUND as a way to know if a task is bound to a CPU or not. As
that is what the flag means.

In the RT kernel we depend greatly on this meaning as it is a way to
know if we should manually bound a task to a CPU or not. But I've spent
the last two days hunting down a bug where things were not working as
they should.

Finally, I added a simple patch to mainline (one I think should be
accepted permanently), the one at the bottom of this email. And it
triggered the following:


------------[ cut here ]------------
WARNING: at /home/rostedt/work/git/linux-trace.git/kernel/sched.c:2236 set_task_cpu+0x137/0x1ba()
Hardware name: Precision WorkStation 470    
Modules linked in:
Pid: 3, comm: ksoftirqd/0 Tainted: G        W   3.1.0-rc7-test+ #262
Call Trace:
 [<ffffffff81051728>] warn_slowpath_common+0x83/0x9b
 [<ffffffff8105175a>] warn_slowpath_null+0x1a/0x1c
 [<ffffffff8104af0f>] set_task_cpu+0x137/0x1ba
 [<ffffffff810824aa>] ? lock_acquire+0x118/0x151
 [<ffffffff8104b846>] ? try_to_wake_up+0x2e/0x1db
 [<ffffffff8104b91d>] try_to_wake_up+0x105/0x1db
 [<ffffffff8103d4ae>] ? complete+0x1e/0x4f
 [<ffffffff8104ba05>] default_wake_function+0x12/0x14
 [<ffffffff8103c1c6>] __wake_up_common+0x4d/0x83
 [<ffffffff8103d4ae>] ? complete+0x1e/0x4f
 [<ffffffff8103d4cc>] complete+0x3c/0x4f
 [<ffffffff81238789>] blk_end_sync_rq+0x31/0x35
 [<ffffffff81238758>] ? blk_rq_map_user+0x210/0x210
 [<ffffffff8123408b>] blk_finish_request+0x206/0x238
 [<ffffffff815022ec>] ? _raw_spin_lock_irqsave+0x51/0x5c
 [<ffffffff812345bf>] ? blk_end_bidi_request+0x32/0x5d
 [<ffffffff812345cd>] blk_end_bidi_request+0x40/0x5d
 [<ffffffff81234624>] blk_end_request+0x10/0x12
 [<ffffffff8131af7b>] scsi_io_completion+0x1dc/0x4d7
 [<ffffffff81312f0c>] scsi_finish_command+0xe4/0xed
 [<ffffffff8131ace2>] scsi_softirq_done+0x109/0x112
 [<ffffffff8123964d>] blk_done_softirq+0x7f/0x93
 [<ffffffff81057c4c>] __do_softirq+0x107/0x24a
 [<ffffffff81057e4a>] run_ksoftirqd+0xbb/0x1b4
 [<ffffffff81057d8f>] ? __do_softirq+0x24a/0x24a
 [<ffffffff8106ea8e>] kthread+0x9f/0xa7
 [<ffffffff81505977>] ? sub_preempt_count+0x95/0xa8
 [<ffffffff8150a084>] kernel_thread_helper+0x4/0x10
 [<ffffffff81502e78>] ? retint_restore_args+0x13/0x13
 [<ffffffff8106e9ef>] ? __init_kthread_worker+0x5a/0x5a
 [<ffffffff8150a080>] ? gs_change+0x13/0x13
---[ end trace 5a5d197966b56a53 ]---
migrating bounded task kworker/u:1:49



I looked at the task that it tried to migrate, and it happened to be the
kworker thread! Then I went into kernel/workqueue.c and found this
nonsense:

	if (bind && !on_unbound_cpu)
		kthread_bind(worker->task, gcwq->cpu);
	else {
		worker->task->flags |= PF_THREAD_BOUND;
		if (on_unbound_cpu)
			worker->flags |= WORKER_UNBOUND;
	}

Nothing but the scheduler and kthread_bind() has the right to set the
PF_THREAD_BOUND flag. Especially when the thread IS NOT BOUNDED!!!!!!

I don't go around and stick my hand down your pants to play with your
flags! Don't stick your hand in ours and play with our flags!

WTF is the workqueue code setting the PF_THREAD_BOUND flag manually?
Talk about fragile coupling! You just made this flag meaningless. Don't
do that.

Sorry but I just wasted two whole days because of this nonsense and I'm
not particularly happy about it.
 
-- Steve

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


diff --git a/kernel/sched.c b/kernel/sched.c
index ec5f472..682a90c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2233,6 +2233,9 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	if (task_cpu(p) != new_cpu) {
 		p->se.nr_migrations++;
 		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
+		if (WARN_ON(p->flags & PF_THREAD_BOUND))
+			printk(KERN_WARNING "migrating bounded task %s:%d\n",
+			       p->comm, p->pid);
 	}
 
 	__set_task_cpu(p, new_cpu);



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-27 21:17 [PATCH] sched/kthread: Complain loudly when others violate our flags Steven Rostedt
@ 2011-09-28  0:22 ` Thomas Gleixner
  2011-09-30  3:48 ` Tejun Heo
  2011-09-30  3:55 ` Tejun Heo
  2 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2011-09-28  0:22 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Tejun Heo, LKML, Peter Zijlstra

On Tue, 27 Sep 2011, Steven Rostedt wrote:
> I looked at the task that it tried to migrate, and it happened to be the
> kworker thread! Then I went into kernel/workqueue.c and found this
> nonsense:
> 
> 	if (bind && !on_unbound_cpu)
> 		kthread_bind(worker->task, gcwq->cpu);
> 	else {
> 		worker->task->flags |= PF_THREAD_BOUND;
> 		if (on_unbound_cpu)
> 			worker->flags |= WORKER_UNBOUND;
> 	}
> 
> Nothing but the scheduler and kthread_bind() has the right to set the
> PF_THREAD_BOUND flag. Especially when the thread IS NOT BOUNDED!!!!!!

Yikes, I somehow missed that gem. :(
 
> I don't go around and stick my hand down your pants to play with your
> flags! Don't stick your hand in ours and play with our flags!
> 
> WTF is the workqueue code setting the PF_THREAD_BOUND flag manually?
> Talk about fragile coupling! You just made this flag meaningless. Don't
> do that.

And looking at the hotplug code in that very file is just making me
more nervous about that abuse.

> Sorry but I just wasted two whole days because of this nonsense and I'm
> not particularly happy about it.
>  
> -- Steve
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> 
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index ec5f472..682a90c 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2233,6 +2233,9 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>  	if (task_cpu(p) != new_cpu) {
>  		p->se.nr_migrations++;
>  		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
> +		if (WARN_ON(p->flags & PF_THREAD_BOUND))
> +			printk(KERN_WARNING "migrating bounded task %s:%d\n",
> +			       p->comm, p->pid);

That should be a if (WARN(...)) and if triggers we should just refuse
to migrate it. So probably set_task_cpu() is not the proper place for
that. Can we move that check on layer up to avoid it reaching
set_task_cpu() ?

>  	}
>  
>  	__set_task_cpu(p, new_cpu);

And please submit a patch which gets rid of

> 		worker->task->flags |= PF_THREAD_BOUND;

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-27 21:17 [PATCH] sched/kthread: Complain loudly when others violate our flags Steven Rostedt
  2011-09-28  0:22 ` Thomas Gleixner
@ 2011-09-30  3:48 ` Tejun Heo
  2011-09-30  4:05   ` Steven Rostedt
  2011-09-30  9:27   ` Peter Zijlstra
  2011-09-30  3:55 ` Tejun Heo
  2 siblings, 2 replies; 21+ messages in thread
From: Tejun Heo @ 2011-09-30  3:48 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, Peter Zijlstra, Thomas Gleixner

Hello, Steven.

Sorry about the delay, coming back from a rather long vacation.

On Tue, Sep 27, 2011 at 05:17:34PM -0400, Steven Rostedt wrote:
> I looked at the task that it tried to migrate, and it happened to be the
> kworker thread! Then I went into kernel/workqueue.c and found this
> nonsense:
> 
> 	if (bind && !on_unbound_cpu)
> 		kthread_bind(worker->task, gcwq->cpu);
> 	else {
> 		worker->task->flags |= PF_THREAD_BOUND;
> 		if (on_unbound_cpu)
> 			worker->flags |= WORKER_UNBOUND;
> 	}
> 
> Nothing but the scheduler and kthread_bind() has the right to set the
> PF_THREAD_BOUND flag. Especially when the thread IS NOT BOUNDED!!!!!!
> 
> I don't go around and stick my hand down your pants to play with your
> flags! Don't stick your hand in ours and play with our flags!
> 
> WTF is the workqueue code setting the PF_THREAD_BOUND flag manually?
> Talk about fragile coupling! You just made this flag meaningless. Don't
> do that.

IIRC, this was because there was no way to set PF_THREAD_BOUND once a
kthread starts to run and workers can stay active across CPU bring
down/up cycle.  Per-cpu kthreads need PF_THREAD_BOUND to prevent cpu
affinity manipulation by third party for correctness.

> Sorry but I just wasted two whole days because of this nonsense and I'm
> not particularly happy about it.

Sorry that it wasted your time and made you unhappy but wouldn't
grepping for its usage a logical thing if you wanted to add to what it
meant?  PF_THREAD_BOUND meaning the task's affinity or cpuset can't be
manipulated by third party seems like a valid interpretation.

Simply removing it would allow breaking workqueue from userland by
manipulating affinity.  How about testing PF_WQ_WORKER in
set_cpus_allowed_ptr() (and maybe cpuset, I'm not sure)?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-30  3:48 ` Tejun Heo
@ 2011-09-30  4:05   ` Steven Rostedt
  2011-09-30  4:14     ` Tejun Heo
  2011-09-30  9:27   ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2011-09-30  4:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Peter Zijlstra, Thomas Gleixner

On Thu, 2011-09-29 at 20:48 -0700, Tejun Heo wrote:
>  
> > WTF is the workqueue code setting the PF_THREAD_BOUND flag manually?
> > Talk about fragile coupling! You just made this flag meaningless. Don't
> > do that.
> 
> IIRC, this was because there was no way to set PF_THREAD_BOUND once a
> kthread starts to run and workers can stay active across CPU bring
> down/up cycle.  Per-cpu kthreads need PF_THREAD_BOUND to prevent cpu
> affinity manipulation by third party for correctness.

That third party also includes workqueue. It shouldn't mess with that
flag.


> 
> > Sorry but I just wasted two whole days because of this nonsense and I'm
> > not particularly happy about it.
> 
> Sorry that it wasted your time and made you unhappy but wouldn't
> grepping for its usage a logical thing if you wanted to add to what it
> meant?  PF_THREAD_BOUND meaning the task's affinity or cpuset can't be
> manipulated by third party seems like a valid interpretation.

But you set PF_THREAD_BOUND to tasks that ARE NOT BOUNDED! That's just
wrong. It's not about "don't touch this affinity" it's about, this
thread has been pinned to a CPU and will not change for the life of the
thread.

> 
> Simply removing it would allow breaking workqueue from userland by
> manipulating affinity.  How about testing PF_WQ_WORKER in
> set_cpus_allowed_ptr() (and maybe cpuset, I'm not sure)?

Do you realize that these threads migrate? If you didn't, it just proves
that you shouldn't touch it.

-- Steve



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-30  4:05   ` Steven Rostedt
@ 2011-09-30  4:14     ` Tejun Heo
  2011-09-30  8:34       ` Peter Zijlstra
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Tejun Heo @ 2011-09-30  4:14 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, Peter Zijlstra, Thomas Gleixner

Hello,

On Fri, Sep 30, 2011 at 12:05:29AM -0400, Steven Rostedt wrote:
> > Simply removing it would allow breaking workqueue from userland by
> > manipulating affinity.  How about testing PF_WQ_WORKER in
> > set_cpus_allowed_ptr() (and maybe cpuset, I'm not sure)?
> 
> Do you realize that these threads migrate? If you didn't, it just proves
> that you shouldn't touch it.

It's rather complicated due to the way workqueues have been used.  You
can't tell whether a work really needs to be pinned to one cpu for
correctness until actual cpu hotplug time as wq users are expected to
flush such work items from hotplug notifiers, so wq workers should be
pinned on the respective cpus while the cpus are up but also need to
step aside once unplug notifiers are finished.

So, we had "don't diddle with my affinity" flag and you want to change
it, which is fine given good enough justification.  Let's talk about
this on the other reply.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-30  4:14     ` Tejun Heo
@ 2011-09-30  8:34       ` Peter Zijlstra
  2011-09-30  8:55         ` Tejun Heo
  2011-09-30  9:23       ` Peter Zijlstra
  2011-09-30  9:29       ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2011-09-30  8:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Steven Rostedt, LKML, Thomas Gleixner

On Thu, 2011-09-29 at 21:14 -0700, Tejun Heo wrote:
> It's rather complicated due to the way workqueues have been used.  You
> can't tell whether a work really needs to be pinned to one cpu for
> correctness until actual cpu hotplug time as wq users are expected to
> flush such work items from hotplug notifiers, so wq workers should be
> pinned on the respective cpus while the cpus are up but also need to
> step aside once unplug notifiers are finished. 

That really is crack, and I've always said so.

How about we fix that utter workqueue unplug nightmare..

---
Subject: workqueue: Fix cpuhotplug trainwreck

The current workqueue code does crazy stuff on cpu unplug, it relies on
forced affine breakage, thereby violating per-cpu expectations. Worse,
it tries to re-attach to a cpu if the thing comes up again before all
previously queued works are finished. This breaks (admittedly bonkers)
cpu-hotplug use that relies on a down-up cycle to push all usage away.

Introduce a new WQ_NON_AFFINE flag that indicates a per-cpu workqueue
will not respect cpu affinity and use this to migrate all its pending
works to whatever cpu is doing cpu-down.

For the rest, simply flush all per-cpu works and don't mess about.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/cpu.h       |   14 +-
 include/linux/workqueue.h |    5 +-
 kernel/workqueue.c        |  539 ++++++++++++---------------------------------
 3 files changed, 152 insertions(+), 406 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index b1a635a..c9d6885 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -60,14 +60,16 @@ enum {
 	 */
 	CPU_PRI_SCHED_ACTIVE	= INT_MAX,
 	CPU_PRI_CPUSET_ACTIVE	= INT_MAX - 1,
-	CPU_PRI_SCHED_INACTIVE	= INT_MIN + 1,
-	CPU_PRI_CPUSET_INACTIVE	= INT_MIN,
 
 	/* migration should happen before other stuff but after perf */
-	CPU_PRI_PERF		= 20,
-	CPU_PRI_MIGRATION	= 10,
-	/* prepare workqueues for other notifiers */
-	CPU_PRI_WORKQUEUE	= 5,
+	CPU_PRI_PERF			= 20,
+	CPU_PRI_MIGRATION		= 10,
+	CPU_PRI_WORKQUEUE_ACTIVE	= 5,  /* prepare workqueues for others */
+	CPU_PRI_NORMAL			= 0,
+	CPU_PRI_WORKQUEUE_INACTIVE	= -5, /* flush workqueues after others */
+
+	CPU_PRI_SCHED_INACTIVE	= INT_MIN + 1,
+	CPU_PRI_CPUSET_INACTIVE	= INT_MIN,
 };
 
 #define CPU_ONLINE		0x0002 /* CPU (unsigned)v is up */
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 0d556de..d68ead8 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -254,9 +254,10 @@ enum {
 	WQ_MEM_RECLAIM		= 1 << 3, /* may be used for memory reclaim */
 	WQ_HIGHPRI		= 1 << 4, /* high priority */
 	WQ_CPU_INTENSIVE	= 1 << 5, /* cpu instensive workqueue */
+	WQ_NON_AFFINE		= 1 << 6, /* free to move works around cpus */
 
-	WQ_DRAINING		= 1 << 6, /* internal: workqueue is draining */
-	WQ_RESCUER		= 1 << 7, /* internal: workqueue has rescuer */
+	WQ_DRAINING		= 1 << 7, /* internal: workqueue is draining */
+	WQ_RESCUER		= 1 << 8, /* internal: workqueue has rescuer */
 
 	WQ_MAX_ACTIVE		= 512,	  /* I like 512, better ideas? */
 	WQ_MAX_UNBOUND_PER_CPU	= 4,	  /* 4 * #cpus for unbound wq */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1783aab..b067aa9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -41,6 +41,7 @@
 #include <linux/debug_locks.h>
 #include <linux/lockdep.h>
 #include <linux/idr.h>
+#include <linux/delay.h>
 
 #include "workqueue_sched.h"
 
@@ -57,20 +58,10 @@ enum {
 	WORKER_DIE		= 1 << 1,	/* die die die */
 	WORKER_IDLE		= 1 << 2,	/* is idle */
 	WORKER_PREP		= 1 << 3,	/* preparing to run works */
-	WORKER_ROGUE		= 1 << 4,	/* not bound to any cpu */
-	WORKER_REBIND		= 1 << 5,	/* mom is home, come back */
-	WORKER_CPU_INTENSIVE	= 1 << 6,	/* cpu intensive */
-	WORKER_UNBOUND		= 1 << 7,	/* worker is unbound */
+	WORKER_CPU_INTENSIVE	= 1 << 4,	/* cpu intensive */
+	WORKER_UNBOUND		= 1 << 5,	/* worker is unbound */
 
-	WORKER_NOT_RUNNING	= WORKER_PREP | WORKER_ROGUE | WORKER_REBIND |
-				  WORKER_CPU_INTENSIVE | WORKER_UNBOUND,
-
-	/* gcwq->trustee_state */
-	TRUSTEE_START		= 0,		/* start */
-	TRUSTEE_IN_CHARGE	= 1,		/* trustee in charge of gcwq */
-	TRUSTEE_BUTCHER		= 2,		/* butcher workers */
-	TRUSTEE_RELEASE		= 3,		/* release workers */
-	TRUSTEE_DONE		= 4,		/* trustee is done */
+	WORKER_NOT_RUNNING	= WORKER_PREP | WORKER_CPU_INTENSIVE | WORKER_UNBOUND,
 
 	BUSY_WORKER_HASH_ORDER	= 6,		/* 64 pointers */
 	BUSY_WORKER_HASH_SIZE	= 1 << BUSY_WORKER_HASH_ORDER,
@@ -84,7 +75,6 @@ enum {
 						   (min two ticks) */
 	MAYDAY_INTERVAL		= HZ / 10,	/* and then every 100ms */
 	CREATE_COOLDOWN		= HZ,		/* time to breath after fail */
-	TRUSTEE_COOLDOWN	= HZ / 10,	/* for trustee draining */
 
 	/*
 	 * Rescue workers are used only on emergencies and shared by
@@ -136,7 +126,6 @@ struct worker {
 	unsigned long		last_active;	/* L: last active timestamp */
 	unsigned int		flags;		/* X: flags */
 	int			id;		/* I: worker id */
-	struct work_struct	rebind_work;	/* L: rebind worker to cpu */
 };
 
 /*
@@ -163,10 +152,8 @@ struct global_cwq {
 
 	struct ida		worker_ida;	/* L: for worker IDs */
 
-	struct task_struct	*trustee;	/* L: for gcwq shutdown */
-	unsigned int		trustee_state;	/* L: trustee state */
-	wait_queue_head_t	trustee_wait;	/* trustee wait */
 	struct worker		*first_idle;	/* L: first idle worker */
+	wait_queue_head_t	idle_wait;
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -979,13 +966,38 @@ static bool is_chained_work(struct workqueue_struct *wq)
 	return false;
 }
 
-static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
-			 struct work_struct *work)
+static void ___queue_work(struct workqueue_struct *wq, struct global_cwq *gcwq,
+			  struct work_struct *work)
 {
-	struct global_cwq *gcwq;
 	struct cpu_workqueue_struct *cwq;
 	struct list_head *worklist;
 	unsigned int work_flags;
+
+	/* gcwq determined, get cwq and queue */
+	cwq = get_cwq(gcwq->cpu, wq);
+	trace_workqueue_queue_work(gcwq->cpu, cwq, work);
+
+	BUG_ON(!list_empty(&work->entry));
+
+	cwq->nr_in_flight[cwq->work_color]++;
+	work_flags = work_color_to_flags(cwq->work_color);
+
+	if (likely(cwq->nr_active < cwq->max_active)) {
+		trace_workqueue_activate_work(work);
+		cwq->nr_active++;
+		worklist = gcwq_determine_ins_pos(gcwq, cwq);
+	} else {
+		work_flags |= WORK_STRUCT_DELAYED;
+		worklist = &cwq->delayed_works;
+	}
+
+	insert_work(cwq, work, worklist, work_flags);
+}
+
+static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
+			 struct work_struct *work)
+{
+	struct global_cwq *gcwq;
 	unsigned long flags;
 
 	debug_work_activate(work);
@@ -1031,27 +1043,32 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 		spin_lock_irqsave(&gcwq->lock, flags);
 	}
 
-	/* gcwq determined, get cwq and queue */
-	cwq = get_cwq(gcwq->cpu, wq);
-	trace_workqueue_queue_work(cpu, cwq, work);
+	___queue_work(wq, gcwq, work);
 
-	BUG_ON(!list_empty(&work->entry));
+	spin_unlock_irqrestore(&gcwq->lock, flags);
+}
 
-	cwq->nr_in_flight[cwq->work_color]++;
-	work_flags = work_color_to_flags(cwq->work_color);
+/**
+ * queue_work_on - queue work on specific cpu
+ * @cpu: CPU number to execute work on
+ * @wq: workqueue to use
+ * @work: work to queue
+ *
+ * Returns 0 if @work was already on a queue, non-zero otherwise.
+ *
+ * We queue the work to a specific CPU, the caller must ensure it
+ * can't go away.
+ */
+static int
+__queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
+{
+	int ret = 0;
 
-	if (likely(cwq->nr_active < cwq->max_active)) {
-		trace_workqueue_activate_work(work);
-		cwq->nr_active++;
-		worklist = gcwq_determine_ins_pos(gcwq, cwq);
-	} else {
-		work_flags |= WORK_STRUCT_DELAYED;
-		worklist = &cwq->delayed_works;
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+		__queue_work(cpu, wq, work);
+		ret = 1;
 	}
-
-	insert_work(cwq, work, worklist, work_flags);
-
-	spin_unlock_irqrestore(&gcwq->lock, flags);
+	return ret;
 }
 
 /**
@@ -1068,34 +1085,19 @@ int queue_work(struct workqueue_struct *wq, struct work_struct *work)
 {
 	int ret;
 
-	ret = queue_work_on(get_cpu(), wq, work);
+	ret = __queue_work_on(get_cpu(), wq, work);
 	put_cpu();
 
 	return ret;
 }
 EXPORT_SYMBOL_GPL(queue_work);
 
-/**
- * queue_work_on - queue work on specific cpu
- * @cpu: CPU number to execute work on
- * @wq: workqueue to use
- * @work: work to queue
- *
- * Returns 0 if @work was already on a queue, non-zero otherwise.
- *
- * We queue the work to a specific CPU, the caller must ensure it
- * can't go away.
- */
 int
 queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
 {
-	int ret = 0;
+	WARN_ON(wq->flags & WQ_NON_AFFINE);
 
-	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
-		__queue_work(cpu, wq, work);
-		ret = 1;
-	}
-	return ret;
+	return __queue_work_on(cpu, wq, work);
 }
 EXPORT_SYMBOL_GPL(queue_work_on);
 
@@ -1141,6 +1143,8 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 	struct timer_list *timer = &dwork->timer;
 	struct work_struct *work = &dwork->work;
 
+	WARN_ON((wq->flags & WQ_NON_AFFINE) && cpu != -1);
+
 	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
 		unsigned int lcpu;
 
@@ -1206,12 +1210,13 @@ static void worker_enter_idle(struct worker *worker)
 	/* idle_list is LIFO */
 	list_add(&worker->entry, &gcwq->idle_list);
 
-	if (likely(!(worker->flags & WORKER_ROGUE))) {
-		if (too_many_workers(gcwq) && !timer_pending(&gcwq->idle_timer))
-			mod_timer(&gcwq->idle_timer,
-				  jiffies + IDLE_WORKER_TIMEOUT);
-	} else
-		wake_up_all(&gcwq->trustee_wait);
+	if (gcwq->nr_idle == gcwq->nr_workers)
+		wake_up_all(&gcwq->idle_wait);
+
+	if (too_many_workers(gcwq) && !timer_pending(&gcwq->idle_timer)) {
+		mod_timer(&gcwq->idle_timer,
+				jiffies + IDLE_WORKER_TIMEOUT);
+	}
 
 	/* sanity check nr_running */
 	WARN_ON_ONCE(gcwq->nr_workers == gcwq->nr_idle &&
@@ -1303,23 +1308,6 @@ __acquires(&gcwq->lock)
 	}
 }
 
-/*
- * Function for worker->rebind_work used to rebind rogue busy workers
- * to the associated cpu which is coming back online.  This is
- * scheduled by cpu up but can race with other cpu hotplug operations
- * and may be executed twice without intervening cpu down.
- */
-static void worker_rebind_fn(struct work_struct *work)
-{
-	struct worker *worker = container_of(work, struct worker, rebind_work);
-	struct global_cwq *gcwq = worker->gcwq;
-
-	if (worker_maybe_bind_and_lock(worker))
-		worker_clr_flags(worker, WORKER_REBIND);
-
-	spin_unlock_irq(&gcwq->lock);
-}
-
 static struct worker *alloc_worker(void)
 {
 	struct worker *worker;
@@ -1328,7 +1316,6 @@ static struct worker *alloc_worker(void)
 	if (worker) {
 		INIT_LIST_HEAD(&worker->entry);
 		INIT_LIST_HEAD(&worker->scheduled);
-		INIT_WORK(&worker->rebind_work, worker_rebind_fn);
 		/* on creation a worker is in !idle && prep state */
 		worker->flags = WORKER_PREP;
 	}
@@ -1668,13 +1655,6 @@ static bool manage_workers(struct worker *worker)
 
 	gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
 
-	/*
-	 * The trustee might be waiting to take over the manager
-	 * position, tell it we're done.
-	 */
-	if (unlikely(gcwq->trustee))
-		wake_up_all(&gcwq->trustee_wait);
-
 	return ret;
 }
 
@@ -3214,171 +3194,71 @@ EXPORT_SYMBOL_GPL(work_busy);
  * gcwqs serve mix of short, long and very long running works making
  * blocked draining impractical.
  *
- * This is solved by allowing a gcwq to be detached from CPU, running
- * it with unbound (rogue) workers and allowing it to be reattached
- * later if the cpu comes back online.  A separate thread is created
- * to govern a gcwq in such state and is called the trustee of the
- * gcwq.
- *
- * Trustee states and their descriptions.
- *
- * START	Command state used on startup.  On CPU_DOWN_PREPARE, a
- *		new trustee is started with this state.
- *
- * IN_CHARGE	Once started, trustee will enter this state after
- *		assuming the manager role and making all existing
- *		workers rogue.  DOWN_PREPARE waits for trustee to
- *		enter this state.  After reaching IN_CHARGE, trustee
- *		tries to execute the pending worklist until it's empty
- *		and the state is set to BUTCHER, or the state is set
- *		to RELEASE.
- *
- * BUTCHER	Command state which is set by the cpu callback after
- *		the cpu has went down.  Once this state is set trustee
- *		knows that there will be no new works on the worklist
- *		and once the worklist is empty it can proceed to
- *		killing idle workers.
- *
- * RELEASE	Command state which is set by the cpu callback if the
- *		cpu down has been canceled or it has come online
- *		again.  After recognizing this state, trustee stops
- *		trying to drain or butcher and clears ROGUE, rebinds
- *		all remaining workers back to the cpu and releases
- *		manager role.
- *
- * DONE		Trustee will enter this state after BUTCHER or RELEASE
- *		is complete.
- *
- *          trustee                 CPU                draining
- *         took over                down               complete
- * START -----------> IN_CHARGE -----------> BUTCHER -----------> DONE
- *                        |                     |                  ^
- *                        | CPU is back online  v   return workers |
- *                         ----------------> RELEASE --------------
  */
 
-/**
- * trustee_wait_event_timeout - timed event wait for trustee
- * @cond: condition to wait for
- * @timeout: timeout in jiffies
- *
- * wait_event_timeout() for trustee to use.  Handles locking and
- * checks for RELEASE request.
- *
- * CONTEXT:
- * spin_lock_irq(gcwq->lock) which may be released and regrabbed
- * multiple times.  To be used by trustee.
- *
- * RETURNS:
- * Positive indicating left time if @cond is satisfied, 0 if timed
- * out, -1 if canceled.
- */
-#define trustee_wait_event_timeout(cond, timeout) ({			\
-	long __ret = (timeout);						\
-	while (!((cond) || (gcwq->trustee_state == TRUSTEE_RELEASE)) &&	\
-	       __ret) {							\
-		spin_unlock_irq(&gcwq->lock);				\
-		__wait_event_timeout(gcwq->trustee_wait, (cond) ||	\
-			(gcwq->trustee_state == TRUSTEE_RELEASE),	\
-			__ret);						\
-		spin_lock_irq(&gcwq->lock);				\
-	}								\
-	gcwq->trustee_state == TRUSTEE_RELEASE ? -1 : (__ret);		\
-})
+static int __devinit workqueue_cpu_up_callback(struct notifier_block *nfb,
+						unsigned long action,
+						void *hcpu)
+{
+	unsigned int cpu = (unsigned long)hcpu;
+	struct global_cwq *gcwq = get_gcwq(cpu);
+	struct worker *uninitialized_var(new_worker);
+	unsigned long flags;
 
-/**
- * trustee_wait_event - event wait for trustee
- * @cond: condition to wait for
- *
- * wait_event() for trustee to use.  Automatically handles locking and
- * checks for CANCEL request.
- *
- * CONTEXT:
- * spin_lock_irq(gcwq->lock) which may be released and regrabbed
- * multiple times.  To be used by trustee.
- *
- * RETURNS:
- * 0 if @cond is satisfied, -1 if canceled.
- */
-#define trustee_wait_event(cond) ({					\
-	long __ret1;							\
-	__ret1 = trustee_wait_event_timeout(cond, MAX_SCHEDULE_TIMEOUT);\
-	__ret1 < 0 ? -1 : 0;						\
-})
+	action &= ~CPU_TASKS_FROZEN;
 
-static int __cpuinit trustee_thread(void *__gcwq)
-{
-	struct global_cwq *gcwq = __gcwq;
-	struct worker *worker;
-	struct work_struct *work;
-	struct hlist_node *pos;
-	long rc;
-	int i;
+	switch (action) {
+	case CPU_UP_PREPARE:
+		BUG_ON(gcwq->first_idle);
+		new_worker = create_worker(gcwq, false);
+		if (!new_worker)
+			return NOTIFY_BAD;
+	}
 
-	BUG_ON(gcwq->cpu != smp_processor_id());
+	/* some are called w/ irq disabled, don't disturb irq status */
+	spin_lock_irqsave(&gcwq->lock, flags);
 
-	spin_lock_irq(&gcwq->lock);
-	/*
-	 * Claim the manager position and make all workers rogue.
-	 * Trustee must be bound to the target cpu and can't be
-	 * cancelled.
-	 */
-	BUG_ON(gcwq->cpu != smp_processor_id());
-	rc = trustee_wait_event(!(gcwq->flags & GCWQ_MANAGING_WORKERS));
-	BUG_ON(rc < 0);
+	switch (action) {
+	case CPU_UP_PREPARE:
+		BUG_ON(gcwq->first_idle);
+		gcwq->first_idle = new_worker;
+		break;
 
-	gcwq->flags |= GCWQ_MANAGING_WORKERS;
+	case CPU_UP_CANCELED:
+		destroy_worker(gcwq->first_idle);
+		gcwq->first_idle = NULL;
+		break;
 
-	list_for_each_entry(worker, &gcwq->idle_list, entry)
-		worker->flags |= WORKER_ROGUE;
+	case CPU_ONLINE:
+		spin_unlock_irq(&gcwq->lock);
+		kthread_bind(gcwq->first_idle->task, cpu);
+		spin_lock_irq(&gcwq->lock);
+		gcwq->flags |= GCWQ_MANAGE_WORKERS;
+		start_worker(gcwq->first_idle);
+		gcwq->first_idle = NULL;
+		break;
+	}
 
-	for_each_busy_worker(worker, i, pos, gcwq)
-		worker->flags |= WORKER_ROGUE;
+	spin_unlock_irqrestore(&gcwq->lock, flags);
 
-	/*
-	 * Call schedule() so that we cross rq->lock and thus can
-	 * guarantee sched callbacks see the rogue flag.  This is
-	 * necessary as scheduler callbacks may be invoked from other
-	 * cpus.
-	 */
-	spin_unlock_irq(&gcwq->lock);
-	schedule();
-	spin_lock_irq(&gcwq->lock);
+	return notifier_from_errno(0);
+}
 
-	/*
-	 * Sched callbacks are disabled now.  Zap nr_running.  After
-	 * this, nr_running stays zero and need_more_worker() and
-	 * keep_working() are always true as long as the worklist is
-	 * not empty.
-	 */
-	atomic_set(get_gcwq_nr_running(gcwq->cpu), 0);
+static void flush_gcwq(struct global_cwq *gcwq)
+{
+	struct work_struct *work, *nw;
+	struct worker *worker, *n;
+	LIST_HEAD(non_affine_works);
 
-	spin_unlock_irq(&gcwq->lock);
-	del_timer_sync(&gcwq->idle_timer);
 	spin_lock_irq(&gcwq->lock);
+	list_for_each_entry_safe(work, nw, &gcwq->worklist, entry) {
+		struct workqueue_struct *wq = get_work_cwq(work)->wq;
 
-	/*
-	 * We're now in charge.  Notify and proceed to drain.  We need
-	 * to keep the gcwq running during the whole CPU down
-	 * procedure as other cpu hotunplug callbacks may need to
-	 * flush currently running tasks.
-	 */
-	gcwq->trustee_state = TRUSTEE_IN_CHARGE;
-	wake_up_all(&gcwq->trustee_wait);
+		if (wq->flags & WQ_NON_AFFINE)
+			list_move(&work->entry, &non_affine_works);
+	}
 
-	/*
-	 * The original cpu is in the process of dying and may go away
-	 * anytime now.  When that happens, we and all workers would
-	 * be migrated to other cpus.  Try draining any left work.  We
-	 * want to get it over with ASAP - spam rescuers, wake up as
-	 * many idlers as necessary and create new ones till the
-	 * worklist is empty.  Note that if the gcwq is frozen, there
-	 * may be frozen works in freezable cwqs.  Don't declare
-	 * completion while frozen.
-	 */
-	while (gcwq->nr_workers != gcwq->nr_idle ||
-	       gcwq->flags & GCWQ_FREEZING ||
-	       gcwq->trustee_state == TRUSTEE_IN_CHARGE) {
+	while (!list_empty(&gcwq->worklist)) {
 		int nr_works = 0;
 
 		list_for_each_entry(work, &gcwq->worklist, entry) {
@@ -3392,190 +3272,54 @@ static int __cpuinit trustee_thread(void *__gcwq)
 			wake_up_process(worker->task);
 		}
 
+		spin_unlock_irq(&gcwq->lock);
+
 		if (need_to_create_worker(gcwq)) {
-			spin_unlock_irq(&gcwq->lock);
-			worker = create_worker(gcwq, false);
-			spin_lock_irq(&gcwq->lock);
-			if (worker) {
-				worker->flags |= WORKER_ROGUE;
+			worker = create_worker(gcwq, true);
+			if (worker)
 				start_worker(worker);
-			}
 		}
 
-		/* give a breather */
-		if (trustee_wait_event_timeout(false, TRUSTEE_COOLDOWN) < 0)
-			break;
-	}
-
-	/*
-	 * Either all works have been scheduled and cpu is down, or
-	 * cpu down has already been canceled.  Wait for and butcher
-	 * all workers till we're canceled.
-	 */
-	do {
-		rc = trustee_wait_event(!list_empty(&gcwq->idle_list));
-		while (!list_empty(&gcwq->idle_list))
-			destroy_worker(list_first_entry(&gcwq->idle_list,
-							struct worker, entry));
-	} while (gcwq->nr_workers && rc >= 0);
+		wait_event_timeout(gcwq->idle_wait, 
+				gcwq->nr_idle == gcwq->nr_workers, HZ/10);
 
-	/*
-	 * At this point, either draining has completed and no worker
-	 * is left, or cpu down has been canceled or the cpu is being
-	 * brought back up.  There shouldn't be any idle one left.
-	 * Tell the remaining busy ones to rebind once it finishes the
-	 * currently scheduled works by scheduling the rebind_work.
-	 */
-	WARN_ON(!list_empty(&gcwq->idle_list));
-
-	for_each_busy_worker(worker, i, pos, gcwq) {
-		struct work_struct *rebind_work = &worker->rebind_work;
-
-		/*
-		 * Rebind_work may race with future cpu hotplug
-		 * operations.  Use a separate flag to mark that
-		 * rebinding is scheduled.
-		 */
-		worker->flags |= WORKER_REBIND;
-		worker->flags &= ~WORKER_ROGUE;
+		spin_lock_irq(&gcwq->lock);
+	}
 
-		/* queue rebind_work, wq doesn't matter, use the default one */
-		if (test_and_set_bit(WORK_STRUCT_PENDING_BIT,
-				     work_data_bits(rebind_work)))
-			continue;
+	WARN_ON(gcwq->nr_workers != gcwq->nr_idle);
 
-		debug_work_activate(rebind_work);
-		insert_work(get_cwq(gcwq->cpu, system_wq), rebind_work,
-			    worker->scheduled.next,
-			    work_color_to_flags(WORK_NO_COLOR));
-	}
+	list_for_each_entry_safe(worker, n, &gcwq->idle_list, entry) 
+		destroy_worker(worker);
 
-	/* relinquish manager role */
-	gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+	WARN_ON(gcwq->nr_workers || gcwq->nr_idle);
 
-	/* notify completion */
-	gcwq->trustee = NULL;
-	gcwq->trustee_state = TRUSTEE_DONE;
-	wake_up_all(&gcwq->trustee_wait);
 	spin_unlock_irq(&gcwq->lock);
-	return 0;
-}
 
-/**
- * wait_trustee_state - wait for trustee to enter the specified state
- * @gcwq: gcwq the trustee of interest belongs to
- * @state: target state to wait for
- *
- * Wait for the trustee to reach @state.  DONE is already matched.
- *
- * CONTEXT:
- * spin_lock_irq(gcwq->lock) which may be released and regrabbed
- * multiple times.  To be used by cpu_callback.
- */
-static void __cpuinit wait_trustee_state(struct global_cwq *gcwq, int state)
-__releases(&gcwq->lock)
-__acquires(&gcwq->lock)
-{
-	if (!(gcwq->trustee_state == state ||
-	      gcwq->trustee_state == TRUSTEE_DONE)) {
-		spin_unlock_irq(&gcwq->lock);
-		__wait_event(gcwq->trustee_wait,
-			     gcwq->trustee_state == state ||
-			     gcwq->trustee_state == TRUSTEE_DONE);
-		spin_lock_irq(&gcwq->lock);
+	gcwq = get_gcwq(get_cpu());
+	spin_lock_irq(&gcwq->lock);
+	list_for_each_entry_safe(work, nw, &non_affine_works, entry) {
+		list_del_init(&work->entry);
+		___queue_work(get_work_cwq(work)->wq, gcwq, work);
 	}
+	spin_unlock_irq(&gcwq->lock);
+	put_cpu();
 }
 
-static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
+static int __devinit workqueue_cpu_down_callback(struct notifier_block *nfb,
 						unsigned long action,
 						void *hcpu)
 {
 	unsigned int cpu = (unsigned long)hcpu;
 	struct global_cwq *gcwq = get_gcwq(cpu);
-	struct task_struct *new_trustee = NULL;
-	struct worker *uninitialized_var(new_worker);
-	unsigned long flags;
 
 	action &= ~CPU_TASKS_FROZEN;
 
 	switch (action) {
 	case CPU_DOWN_PREPARE:
-		new_trustee = kthread_create(trustee_thread, gcwq,
-					     "workqueue_trustee/%d\n", cpu);
-		if (IS_ERR(new_trustee))
-			return notifier_from_errno(PTR_ERR(new_trustee));
-		kthread_bind(new_trustee, cpu);
-		/* fall through */
-	case CPU_UP_PREPARE:
-		BUG_ON(gcwq->first_idle);
-		new_worker = create_worker(gcwq, false);
-		if (!new_worker) {
-			if (new_trustee)
-				kthread_stop(new_trustee);
-			return NOTIFY_BAD;
-		}
-	}
-
-	/* some are called w/ irq disabled, don't disturb irq status */
-	spin_lock_irqsave(&gcwq->lock, flags);
-
-	switch (action) {
-	case CPU_DOWN_PREPARE:
-		/* initialize trustee and tell it to acquire the gcwq */
-		BUG_ON(gcwq->trustee || gcwq->trustee_state != TRUSTEE_DONE);
-		gcwq->trustee = new_trustee;
-		gcwq->trustee_state = TRUSTEE_START;
-		wake_up_process(gcwq->trustee);
-		wait_trustee_state(gcwq, TRUSTEE_IN_CHARGE);
-		/* fall through */
-	case CPU_UP_PREPARE:
-		BUG_ON(gcwq->first_idle);
-		gcwq->first_idle = new_worker;
-		break;
-
-	case CPU_DYING:
-		/*
-		 * Before this, the trustee and all workers except for
-		 * the ones which are still executing works from
-		 * before the last CPU down must be on the cpu.  After
-		 * this, they'll all be diasporas.
-		 */
-		gcwq->flags |= GCWQ_DISASSOCIATED;
-		break;
-
-	case CPU_POST_DEAD:
-		gcwq->trustee_state = TRUSTEE_BUTCHER;
-		/* fall through */
-	case CPU_UP_CANCELED:
-		destroy_worker(gcwq->first_idle);
-		gcwq->first_idle = NULL;
-		break;
-
-	case CPU_DOWN_FAILED:
-	case CPU_ONLINE:
-		gcwq->flags &= ~GCWQ_DISASSOCIATED;
-		if (gcwq->trustee_state != TRUSTEE_DONE) {
-			gcwq->trustee_state = TRUSTEE_RELEASE;
-			wake_up_process(gcwq->trustee);
-			wait_trustee_state(gcwq, TRUSTEE_DONE);
-		}
-
-		/*
-		 * Trustee is done and there might be no worker left.
-		 * Put the first_idle in and request a real manager to
-		 * take a look.
-		 */
-		spin_unlock_irq(&gcwq->lock);
-		kthread_bind(gcwq->first_idle->task, cpu);
-		spin_lock_irq(&gcwq->lock);
-		gcwq->flags |= GCWQ_MANAGE_WORKERS;
-		start_worker(gcwq->first_idle);
-		gcwq->first_idle = NULL;
+		flush_gcwq(gcwq);
 		break;
 	}
 
-	spin_unlock_irqrestore(&gcwq->lock, flags);
-
 	return notifier_from_errno(0);
 }
 
@@ -3772,7 +3516,8 @@ static int __init init_workqueues(void)
 	unsigned int cpu;
 	int i;
 
-	cpu_notifier(workqueue_cpu_callback, CPU_PRI_WORKQUEUE);
+	cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_ACTIVE);
+ 	hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_INACTIVE);
 
 	/* initialize gcwqs */
 	for_each_gcwq_cpu(cpu) {
@@ -3795,9 +3540,7 @@ static int __init init_workqueues(void)
 			    (unsigned long)gcwq);
 
 		ida_init(&gcwq->worker_ida);
-
-		gcwq->trustee_state = TRUSTEE_DONE;
-		init_waitqueue_head(&gcwq->trustee_wait);
+		init_waitqueue_head(&gcwq->idle_wait);
 	}
 
 	/* create the initial worker */


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-30  8:34       ` Peter Zijlstra
@ 2011-09-30  8:55         ` Tejun Heo
  2011-09-30  9:04           ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2011-09-30  8:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Steven Rostedt, LKML, Thomas Gleixner

Hello, Peter.

On Fri, Sep 30, 2011 at 10:34:59AM +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-29 at 21:14 -0700, Tejun Heo wrote:
> > It's rather complicated due to the way workqueues have been used.  You
> > can't tell whether a work really needs to be pinned to one cpu for
> > correctness until actual cpu hotplug time as wq users are expected to
> > flush such work items from hotplug notifiers, so wq workers should be
> > pinned on the respective cpus while the cpus are up but also need to
> > step aside once unplug notifiers are finished. 
> 
> That really is crack, and I've always said so.
> 
> How about we fix that utter workqueue unplug nightmare..

The patch is corrupt (=20/=3D's).  Can you please re-send?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-30  8:55         ` Tejun Heo
@ 2011-09-30  9:04           ` Peter Zijlstra
  2011-10-03  1:15             ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2011-09-30  9:04 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Steven Rostedt, LKML, Thomas Gleixner

On Fri, 2011-09-30 at 01:55 -0700, Tejun Heo wrote:
> Hello, Peter.
> 
> On Fri, Sep 30, 2011 at 10:34:59AM +0200, Peter Zijlstra wrote:
> > On Thu, 2011-09-29 at 21:14 -0700, Tejun Heo wrote:
> > > It's rather complicated due to the way workqueues have been used.  You
> > > can't tell whether a work really needs to be pinned to one cpu for
> > > correctness until actual cpu hotplug time as wq users are expected to
> > > flush such work items from hotplug notifiers, so wq workers should be
> > > pinned on the respective cpus while the cpus are up but also need to
> > > step aside once unplug notifiers are finished. 
> > 
> > That really is crack, and I've always said so.
> > 
> > How about we fix that utter workqueue unplug nightmare..
> 
> The patch is corrupt (=20/=3D's).  Can you please re-send?

Fucking mailer insists on quoted-printable for plain text :/

I've got something like the below awk in my mail->patch pipe:

/Content-Transfer-Encoding:.*quoted-printable.*/ {
        decode = 1;
}
// {
	tmp = $0
	if (!decode) {
		print tmp
		next
	}
	if (concat) {
                tmp = last tmp
                concat = 0;
        }
        if (tmp ~ /=$/) {
                concat = 1;
                gsub("=$", "", tmp);
        }
        offset = 0;
        while (match(tmp, /=[[:xdigit:]][[:xdigit:]]/, a)) {
                if (a[0] < offset)
                        break;
                hex = substr(a[0], 2)
                char = sprintf("%c", strtonum("0x"hex))
                gsub(a[0], char, tmp)
                offset = a[0];
        }
        if (concat) {
                last = tmp
                next
        }
        print tmp
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-30  9:04           ` Peter Zijlstra
@ 2011-10-03  1:15             ` Tejun Heo
  2011-10-03 10:20               ` Peter Zijlstra
                                 ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Tejun Heo @ 2011-10-03  1:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Steven Rostedt, LKML, Thomas Gleixner

Hello,

On Fri, Sep 30, 2011 at 11:04:33AM +0200, Peter Zijlstra wrote:
> > The patch is corrupt (=20/=3D's).  Can you please re-send?
> 
> Fucking mailer insists on quoted-printable for plain text :/

If you can't get the mailer working properly, please attach the patch.
At the end is decoded patch (there still seem to be some artifacts,
dunno why).

Anyways, I don't think I'm gonna take this one.  There are some
attractions to the approach - ie. making the users determine whether
they need strict affinity or not and mandating those users to shut
down properly from cpu down callbacks and if we're doing this from the
scratch, this probably would be a sane choice.  But we already have
tons of users and relatively well tested code.  I don't see compelling
reasons to perturb that at this point.

Also, on a quick glance, the change is breaking non-reentrance
guarantee.

Thanks.

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index b1a635a..c9d6885 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -60,14 +60,16 @@ enum {
 	 */
 	CPU_PRI_SCHED_ACTIVE	= INT_MAX,
 	CPU_PRI_CPUSET_ACTIVE	= INT_MAX - 1,
-	CPU_PRI_SCHED_INACTIVE	= INT_MIN + 1,
-	CPU_PRI_CPUSET_INACTIVE	= INT_MIN,
 
 	/* migration should happen before other stuff but after perf */
-	CPU_PRI_PERF		= 20,
-	CPU_PRI_MIGRATION	= 10,
-	/* prepare workqueues for other notifiers */
-	CPU_PRI_WORKQUEUE	= 5,
+	CPU_PRI_PERF			= 20,
+	CPU_PRI_MIGRATION		= 10,
+	CPU_PRI_WORKQUEUE_ACTIVE	= 5,  /* prepare workqueues for others */
+	CPU_PRI_NORMAL			= 0,
+	CPU_PRI_WORKQUEUE_INACTIVE	= -5, /* flush workqueues after others */
+
+	CPU_PRI_SCHED_INACTIVE	= INT_MIN + 1,
+	CPU_PRI_CPUSET_INACTIVE	= INT_MIN,
 };
 
 #define CPU_ONLINE		0x0002 /* CPU (unsigned)v is up */
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 0d556de..d68ead8 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -254,9 +254,10 @@ enum {
 	WQ_MEM_RECLAIM		= 1 << 3, /* may be used for memory reclaim */
 	WQ_HIGHPRI		= 1 << 4, /* high priority */
 	WQ_CPU_INTENSIVE	= 1 << 5, /* cpu instensive workqueue */
+	WQ_NON_AFFINE		= 1 << 6, /* free to move works around cpus */
 
-	WQ_DRAINING		= 1 << 6, /* internal: workqueue is draining */
-	WQ_RESCUER		= 1 << 7, /* internal: workqueue has rescuer */
+	WQ_DRAINING		= 1 << 7, /* internal: workqueue is draining */
+	WQ_RESCUER		= 1 << 8, /* internal: workqueue has rescuer */
 
 	WQ_MAX_ACTIVE		= 512,	  /* I like 512, better ideas? */
 	WQ_MAX_UNBOUND_PER_CPU	= 4,	  /* 4 * #cpus for unbound wq */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1783aab..a57acb8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -41,6 +41,7 @@
 #include <linux/debug_locks.h>
 #include <linux/lockdep.h>
 #include <linux/idr.h>
+#include <linux/delay.h>
 
 #include "workqueue_sched.h"
 
@@ -57,20 +58,10 @@ enum {
 	WORKER_DIE		= 1 << 1,	/* die die die */
 	WORKER_IDLE		= 1 << 2,	/* is idle */
 	WORKER_PREP		= 1 << 3,	/* preparing to run works */
-	WORKER_ROGUE		= 1 << 4,	/* not bound to any cpu */
-	WORKER_REBIND		= 1 << 5,	/* mom is home, come back */
-	WORKER_CPU_INTENSIVE	= 1 << 6,	/* cpu intensive */
-	WORKER_UNBOUND		= 1 << 7,	/* worker is unbound */
+	WORKER_CPU_INTENSIVE	= 1 << 4,	/* cpu intensive */
+	WORKER_UNBOUND		= 1 << 5,	/* worker is unbound */
 
-	WORKER_NOT_RUNNING	= WORKER_PREP | WORKER_ROGUE | WORKER_REBIND |
-				  WORKER_CPU_INTENSIVE | WORKER_UNBOUND,
-
-	/* gcwq->trustee_state */
-	TRUSTEE_START		= 0,		/* start */
-	TRUSTEE_IN_CHARGE	= 1,		/* trustee in charge of gcwq */
-	TRUSTEE_BUTCHER		= 2,		/* butcher workers */
-	TRUSTEE_RELEASE		= 3,		/* release workers */
-	TRUSTEE_DONE		= 4,		/* trustee is done */
+	WORKER_NOT_RUNNING	= WORKER_PREP | WORKER_CPU_INTENSIVE | WORKER_UNBOUN=,
 
 	BUSY_WORKER_HASH_ORDER	= 6,		/* 64 pointers */
 	BUSY_WORKER_HASH_SIZE	= 1 << BUSY_WORKER_HASH_ORDER,
@@ -84,7 +75,6 @@ enum {
 						   (min two ticks) */
 	MAYDAY_INTERVAL		= HZ / 10,	/* and then every 100ms */
 	CREATE_COOLDOWN		= HZ,		/* time to breath after fail */
-	TRUSTEE_COOLDOWN	= HZ / 10,	/* for trustee draining */
 
 	/*
 	 * Rescue workers are used only on emergencies and shared by
@@ -136,7 +126,6 @@ struct worker {
 	unsigned long		last_active;	/* L: last active timestamp */
 	unsigned int		flags;		/* X: flags */
 	int			id;		/* I: worker id */
-	struct work_struct	rebind_work;	/* L: rebind worker to cpu */
 };
 
 /*
@@ -163,10 +152,8 @@ struct global_cwq {
 
 	struct ida		worker_ida;	/* L: for worker IDs */
 
-	struct task_struct	*trustee;	/* L: for gcwq shutdown */
-	unsigned int		trustee_state;	/* L: trustee state */
-	wait_queue_head_t	trustee_wait;	/* trustee wait */
 	struct worker		*first_idle;	/* L: first idle worker */
+	wait_queue_head_t	idle_wait;
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -979,13 +966,38 @@ static bool is_chained_work(struct workqueue_struct *wq)
 	return false;
 }
 
-static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
-			 struct work_struct *work)
+static void ___queue_work(struct workqueue_struct *wq, struct global_cwq *=cwq,
+			  struct work_struct *work)
 {
-	struct global_cwq *gcwq;
 	struct cpu_workqueue_struct *cwq;
 	struct list_head *worklist;
 	unsigned int work_flags;
+
+	/* gcwq determined, get cwq and queue */
+	cwq = get_cwq(gcwq->cpu, wq);
+	trace_workqueue_queue_work(gcwq->cpu, cwq, work);
+
+	BUG_ON(!list_empty(&work->entry));
+
+	cwq->nr_in_flight[cwq->work_color]++;
+	work_flags = work_color_to_flags(cwq->work_color);
+
+	if (likely(cwq->nr_active < cwq->max_active)) {
+		trace_workqueue_activate_work(work);
+		cwq->nr_active++;
+		worklist = gcwq_determine_ins_pos(gcwq, cwq);
+	} else {
+		work_flags |= WORK_STRUCT_DELAYED;
+		worklist = &cwq->delayed_works;
+	}
+
+	insert_work(cwq, work, worklist, work_flags);
+}
+
+static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
+			 struct work_struct *work)
+{
+	struct global_cwq *gcwq;
 	unsigned long flags;
 
 	debug_work_activate(work);
@@ -1031,27 +1043,32 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 		spin_lock_irqsave(&gcwq->lock, flags);
 	}
 
-	/* gcwq determined, get cwq and queue */
-	cwq = get_cwq(gcwq->cpu, wq);
-	trace_workqueue_queue_work(cpu, cwq, work);
+	___queue_work(wq, gcwq, work);
 
-	BUG_ON(!list_empty(&work->entry));
+	spin_unlock_irqrestore(&gcwq->lock, flags);
+}
 
-	cwq->nr_in_flight[cwq->work_color]++;
-	work_flags = work_color_to_flags(cwq->work_color);
+/**
+ * queue_work_on - queue work on specific cpu
+ * @cpu: CPU number to execute work on
+ * @wq: workqueue to use
+ * @work: work to queue
+ *
+ * Returns 0 if @work was already on a queue, non-zero otherwise.
+ *
+ * We queue the work to a specific CPU, the caller must ensure it
+ * can't go away.
+ */
+static int
+__queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *=ork)
+{
+	int ret = 0;
 
-	if (likely(cwq->nr_active < cwq->max_active)) {
-		trace_workqueue_activate_work(work);
-		cwq->nr_active++;
-		worklist = gcwq_determine_ins_pos(gcwq, cwq);
-	} else {
-		work_flags |= WORK_STRUCT_DELAYED;
-		worklist = &cwq->delayed_works;
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+		__queue_work(cpu, wq, work);
+		ret = 1;
 	}
-
-	insert_work(cwq, work, worklist, work_flags);
-
-	spin_unlock_irqrestore(&gcwq->lock, flags);
+	return ret;
 }
 
 /**
@@ -1068,34 +1085,19 @@ int queue_work(struct workqueue_struct *wq, struct work_struct *work)
 {
 	int ret;
 
-	ret = queue_work_on(get_cpu(), wq, work);
+	ret = __queue_work_on(get_cpu(), wq, work);
 	put_cpu();
 
 	return ret;
 }
 EXPORT_SYMBOL_GPL(queue_work);
 
-/**
- * queue_work_on - queue work on specific cpu
- * @cpu: CPU number to execute work on
- * @wq: workqueue to use
- * @work: work to queue
- *
- * Returns 0 if @work was already on a queue, non-zero otherwise.
- *
- * We queue the work to a specific CPU, the caller must ensure it
- * can't go away.
- */
 int
 queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
 {
-	int ret = 0;
+	WARN_ON(wq->flags & WQ_NON_AFFINE);
 
-	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
-		__queue_work(cpu, wq, work);
-		ret = 1;
-	}
-	return ret;
+	return __queue_work_on(cpu, wq, work);
 }
 EXPORT_SYMBOL_GPL(queue_work_on);
 
@@ -1141,6 +1143,8 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 	struct timer_list *timer = &dwork->timer;
 	struct work_struct *work = &dwork->work;
 
+	WARN_ON((wq->flags & WQ_NON_AFFINE) && cpu != -1);
+
 	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
 		unsigned int lcpu;
 
@@ -1206,12 +1210,13 @@ static void worker_enter_idle(struct worker *worker)
 	/* idle_list is LIFO */
 	list_add(&worker->entry, &gcwq->idle_list);
 
-	if (likely(!(worker->flags & WORKER_ROGUE))) {
-		if (too_many_workers(gcwq) && !timer_pending(&gcwq->idle_timer))
-			mod_timer(&gcwq->idle_timer,
-				  jiffies + IDLE_WORKER_TIMEOUT);
-	} else
-		wake_up_all(&gcwq->trustee_wait);
+	if (gcwq->nr_idle == gcwq->nr_workers)
+		wake_up_all(&gcwq->idle_wait);
+
+	if (too_many_workers(gcwq) && !timer_pending(&gcwq->idle_timer)) {
+		mod_timer(&gcwq->idle_timer,
+				jiffies + IDLE_WORKER_TIMEOUT);
+	}
 
 	/* sanity check nr_running */
 	WARN_ON_ONCE(gcwq->nr_workers == gcwq->nr_idle &&
@@ -1303,23 +1308,6 @@ __acquires(&gcwq->lock)
 	}
 }
 
-/*
- * Function for worker->rebind_work used to rebind rogue busy workers
- * to the associated cpu which is coming back online.  This is
- * scheduled by cpu up but can race with other cpu hotplug operations
- * and may be executed twice without intervening cpu down.
- */
-static void worker_rebind_fn(struct work_struct *work)
-{
-	struct worker *worker = container_of(work, struct worker, rebind_work);
-	struct global_cwq *gcwq = worker->gcwq;
-
-	if (worker_maybe_bind_and_lock(worker))
-		worker_clr_flags(worker, WORKER_REBIND);
-
-	spin_unlock_irq(&gcwq->lock);
-}
-
 static struct worker *alloc_worker(void)
 {
 	struct worker *worker;
@@ -1328,7 +1316,6 @@ static struct worker *alloc_worker(void)
 	if (worker) {
 		INIT_LIST_HEAD(&worker->entry);
 		INIT_LIST_HEAD(&worker->scheduled);
-		INIT_WORK(&worker->rebind_work, worker_rebind_fn);
 		/* on creation a worker is in !idle && prep state */
 		worker->flags = WORKER_PREP;
 	}
@@ -1668,13 +1655,6 @@ static bool manage_workers(struct worker *worker)
 
 	gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
 
-	/*
-	 * The trustee might be waiting to take over the manager
-	 * position, tell it we're done.
-	 */
-	if (unlikely(gcwq->trustee))
-		wake_up_all(&gcwq->trustee_wait);
-
 	return ret;
 }
 
@@ -3214,171 +3194,71 @@ EXPORT_SYMBOL_GPL(work_busy);
  * gcwqs serve mix of short, long and very long running works making
  * blocked draining impractical.
  *
- * This is solved by allowing a gcwq to be detached from CPU, running
- * it with unbound (rogue) workers and allowing it to be reattached
- * later if the cpu comes back online.  A separate thread is created
- * to govern a gcwq in such state and is called the trustee of the
- * gcwq.
- *
- * Trustee states and their descriptions.
- *
- * START	Command state used on startup.  On CPU_DOWN_PREPARE, a
- *		new trustee is started with this state.
- *
- * IN_CHARGE	Once started, trustee will enter this state after
- *		assuming the manager role and making all existing
- *		workers rogue.  DOWN_PREPARE waits for trustee to
- *		enter this state.  After reaching IN_CHARGE, trustee
- *		tries to execute the pending worklist until it's empty
- *		and the state is set to BUTCHER, or the state is set
- *		to RELEASE.
- *
- * BUTCHER	Command state which is set by the cpu callback after
- *		the cpu has went down.  Once this state is set trustee
- *		knows that there will be no new works on the worklist
- *		and once the worklist is empty it can proceed to
- *		killing idle workers.
- *
- * RELEASE	Command state which is set by the cpu callback if the
- *		cpu down has been canceled or it has come online
- *		again.  After recognizing this state, trustee stops
- *		trying to drain or butcher and clears ROGUE, rebinds
- *		all remaining workers back to the cpu and releases
- *		manager role.
- *
- * DONE		Trustee will enter this state after BUTCHER or RELEASE
- *		is complete.
- *
- *          trustee                 CPU                draining
- *         took over                down               complete
- * START -----------> IN_CHARGE -----------> BUTCHER -----------> DONE
- *                        |                     |                  ^
- *                        | CPU is back online  v   return workers |
- *                         ----------------> RELEASE --------------
  */
 
-/**
- * trustee_wait_event_timeout - timed event wait for trustee
- * @cond: condition to wait for
- * @timeout: timeout in jiffies
- *
- * wait_event_timeout() for trustee to use.  Handles locking and
- * checks for RELEASE request.
- *
- * CONTEXT:
- * spin_lock_irq(gcwq->lock) which may be released and regrabbed
- * multiple times.  To be used by trustee.
- *
- * RETURNS:
- * Positive indicating left time if @cond is satisfied, 0 if timed
- * out, -1 if canceled.
- */
-#define trustee_wait_event_timeout(cond, timeout) ({			\
-	long __ret = (timeout);						\
-	while (!((cond) || (gcwq->trustee_state == TRUSTEE_RELEASE)) &&	\
-	       __ret) {							\
-		spin_unlock_irq(&gcwq->lock);				\
-		__wait_event_timeout(gcwq->trustee_wait, (cond) ||	\
-			(gcwq->trustee_state == TRUSTEE_RELEASE),	\
-			__ret);						\
-		spin_lock_irq(&gcwq->lock);				\
-	}								\
-	gcwq->trustee_state == TRUSTEE_RELEASE ? -1 : (__ret);		\
-})
+static int __devinit workqueue_cpu_up_callback(struct notifier_block *nfb,
+						unsigned long action,
+						void *hcpu)
+{
+	unsigned int cpu = (unsigned long)hcpu;
+	struct global_cwq *gcwq = get_gcwq(cpu);
+	struct worker *uninitialized_var(new_worker);
+	unsigned long flags;
 
-/**
- * trustee_wait_event - event wait for trustee
- * @cond: condition to wait for
- *
- * wait_event() for trustee to use.  Automatically handles locking and
- * checks for CANCEL request.
- *
- * CONTEXT:
- * spin_lock_irq(gcwq->lock) which may be released and regrabbed
- * multiple times.  To be used by trustee.
- *
- * RETURNS:
- * 0 if @cond is satisfied, -1 if canceled.
- */
-#define trustee_wait_event(cond) ({					\
-	long __ret1;							\
-	__ret1 = trustee_wait_event_timeout(cond, MAX_SCHEDULE_TIMEOUT);\
-	__ret1 < 0 ? -1 : 0;						\
-})
+	action &= ~CPU_TASKS_FROZEN;
 
-static int __cpuinit trustee_thread(void *__gcwq)
-{
-	struct global_cwq *gcwq = __gcwq;
-	struct worker *worker;
-	struct work_struct *work;
-	struct hlist_node *pos;
-	long rc;
-	int i;
+	switch (action) {
+	case CPU_UP_PREPARE:
+		BUG_ON(gcwq->first_idle);
+		new_worker = create_worker(gcwq, false);
+		if (!new_worker)
+			return NOTIFY_BAD;
+	}
 
-	BUG_ON(gcwq->cpu != smp_processor_id());
+	/* some are called w/ irq disabled, don't disturb irq status */
+	spin_lock_irqsave(&gcwq->lock, flags);
 
-	spin_lock_irq(&gcwq->lock);
-	/*
-	 * Claim the manager position and make all workers rogue.
-	 * Trustee must be bound to the target cpu and can't be
-	 * cancelled.
-	 */
-	BUG_ON(gcwq->cpu != smp_processor_id());
-	rc = trustee_wait_event(!(gcwq->flags & GCWQ_MANAGING_WORKERS));
-	BUG_ON(rc < 0);
+	switch (action) {
+	case CPU_UP_PREPARE:
+		BUG_ON(gcwq->first_idle);
+		gcwq->first_idle = new_worker;
+		break;
 
-	gcwq->flags |= GCWQ_MANAGING_WORKERS;
+	case CPU_UP_CANCELED:
+		destroy_worker(gcwq->first_idle);
+		gcwq->first_idle = NULL;
+		break;
 
-	list_for_each_entry(worker, &gcwq->idle_list, entry)
-		worker->flags |= WORKER_ROGUE;
+	case CPU_ONLINE:
+		spin_unlock_irq(&gcwq->lock);
+		kthread_bind(gcwq->first_idle->task, cpu);
+		spin_lock_irq(&gcwq->lock);
+		gcwq->flags |= GCWQ_MANAGE_WORKERS;
+		start_worker(gcwq->first_idle);
+		gcwq->first_idle = NULL;
+		break;
+	}
 
-	for_each_busy_worker(worker, i, pos, gcwq)
-		worker->flags |= WORKER_ROGUE;
+	spin_unlock_irqrestore(&gcwq->lock, flags);
 
-	/*
-	 * Call schedule() so that we cross rq->lock and thus can
-	 * guarantee sched callbacks see the rogue flag.  This is
-	 * necessary as scheduler callbacks may be invoked from other
-	 * cpus.
-	 */
-	spin_unlock_irq(&gcwq->lock);
-	schedule();
-	spin_lock_irq(&gcwq->lock);
+	return notifier_from_errno(0);
+}
 
-	/*
-	 * Sched callbacks are disabled now.  Zap nr_running.  After
-	 * this, nr_running stays zero and need_more_worker() and
-	 * keep_working() are always true as long as the worklist is
-	 * not empty.
-	 */
-	atomic_set(get_gcwq_nr_running(gcwq->cpu), 0);
+static void flush_gcwq(struct global_cwq *gcwq)
+{
+	struct work_struct *work, *nw;
+	struct worker *worker, *n;
+	LIST_HEAD(non_affine_works);
 
-	spin_unlock_irq(&gcwq->lock);
-	del_timer_sync(&gcwq->idle_timer);
 	spin_lock_irq(&gcwq->lock);
+	list_for_each_entry_safe(work, nw, &gcwq->worklist, entry) {
+		struct workqueue_struct *wq = get_work_cwq(work)->wq;
 
-	/*
-	 * We're now in charge.  Notify and proceed to drain.  We need
-	 * to keep the gcwq running during the whole CPU down
-	 * procedure as other cpu hotunplug callbacks may need to
-	 * flush currently running tasks.
-	 */
-	gcwq->trustee_state = TRUSTEE_IN_CHARGE;
-	wake_up_all(&gcwq->trustee_wait);
+		if (wq->flags & WQ_NON_AFFINE)
+			list_move(&work->entry, &non_affine_works);
+	}
 
-	/*
-	 * The original cpu is in the process of dying and may go away
-	 * anytime now.  When that happens, we and all workers would
-	 * be migrated to other cpus.  Try draining any left work.  We
-	 * want to get it over with ASAP - spam rescuers, wake up as
-	 * many idlers as necessary and create new ones till the
-	 * worklist is empty.  Note that if the gcwq is frozen, there
-	 * may be frozen works in freezable cwqs.  Don't declare
-	 * completion while frozen.
-	 */
-	while (gcwq->nr_workers != gcwq->nr_idle ||
-	       gcwq->flags & GCWQ_FREEZING ||
-	       gcwq->trustee_state == TRUSTEE_IN_CHARGE) {
+	while (!list_empty(&gcwq->worklist)) {
 		int nr_works = 0;
 
 		list_for_each_entry(work, &gcwq->worklist, entry) {
@@ -3392,190 +3272,50 @@ static int __cpuinit trustee_thread(void *__gcwq)
 			wake_up_process(worker->task);
 		}
 
+		spin_unlock_irq(&gcwq->lock);
+
 		if (need_to_create_worker(gcwq)) {
-			spin_unlock_irq(&gcwq->lock);
-			worker = create_worker(gcwq, false);
-			spin_lock_irq(&gcwq->lock);
-			if (worker) {
-				worker->flags |= WORKER_ROGUE;
+			worker = create_worker(gcwq, true);
+			if (worker)
 				start_worker(worker);
-			}
 		}
 
-		/* give a breather */
-		if (trustee_wait_event_timeout(false, TRUSTEE_COOLDOWN) < 0)
-			break;
-	}
-
-	/*
-	 * Either all works have been scheduled and cpu is down, or
-	 * cpu down has already been canceled.  Wait for and butcher
-	 * all workers till we're canceled.
-	 */
-	do {
-		rc = trustee_wait_event(!list_empty(&gcwq->idle_list));
-		while (!list_empty(&gcwq->idle_list))
-			destroy_worker(list_first_entry(&gcwq->idle_list,
-							struct worker, entry));
-	} while (gcwq->nr_workers && rc >= 0);
-
-	/*
-	 * At this point, either draining has completed and no worker
-	 * is left, or cpu down has been canceled or the cpu is being
-	 * brought back up.  There shouldn't be any idle one left.
-	 * Tell the remaining busy ones to rebind once it finishes the
-	 * currently scheduled works by scheduling the rebind_work.
-	 */
-	WARN_ON(!list_empty(&gcwq->idle_list));
-
-	for_each_busy_worker(worker, i, pos, gcwq) {
-		struct work_struct *rebind_work = &worker->rebind_work;
-
-		/*
-		 * Rebind_work may race with future cpu hotplug
-		 * operations.  Use a separate flag to mark that
-		 * rebinding is scheduled.
-		 */
-		worker->flags |= WORKER_REBIND;
-		worker->flags &= ~WORKER_ROGUE;
-
-		/* queue rebind_work, wq doesn't matter, use the default one */
-		if (test_and_set_bit(WORK_STRUCT_PENDING_BIT,
-				     work_data_bits(rebind_work)))
-			continue;
+		wait_event_timeout(gcwq->idle_wait, 
+				   gcwq->nr_idle == gcwq->nr_workers, HZ/10);
 
-		debug_work_activate(rebind_work);
-		insert_work(get_cwq(gcwq->cpu, system_wq), rebind_work,
-			    worker->scheduled.next,
-			    work_color_to_flags(WORK_NO_COLOR));
+		spin_lock_irq(&gcwq->lock);
 	}
 
-	/* relinquish manager role */
-	gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+	WARN_ON(gcwq->nr_workers != gcwq->nr_idle);
 
-	/* notify completion */
-	gcwq->trustee = NULL;
-	gcwq->trustee_state = TRUSTEE_DONE;
-	wake_up_all(&gcwq->trustee_wait);
-	spin_unlock_irq(&gcwq->lock);
-	return 0;
-}
+	WARN_ON(gcwq->nr_workers || gcwq->nr_idle);
 
-/**
- * wait_trustee_state - wait for trustee to enter the specified state
- * @gcwq: gcwq the trustee of interest belongs to
- * @state: target state to wait for
- *
- * Wait for the trustee to reach @state.  DONE is already matched.
- *
- * CONTEXT:
- * spin_lock_irq(gcwq->lock) which may be released and regrabbed
- * multiple times.  To be used by cpu_callback.
- */
-static void __cpuinit wait_trustee_state(struct global_cwq *gcwq, int state)
-__releases(&gcwq->lock)
-__acquires(&gcwq->lock)
-{
-	if (!(gcwq->trustee_state == state ||
-	      gcwq->trustee_state == TRUSTEE_DONE)) {
-		spin_unlock_irq(&gcwq->lock);
-		__wait_event(gcwq->trustee_wait,
-			     gcwq->trustee_state == state ||
-			     gcwq->trustee_state == TRUSTEE_DONE);
-		spin_lock_irq(&gcwq->lock);
+	spin_unlock_irq(&gcwq->lock);
+	gcwq = get_gcwq(get_cpu());
+	spin_lock_irq(&gcwq->lock);
+	list_for_each_entry_safe(work, nw, &non_affine_works, entry) {
+		list_del_init(&work->entry);
+		___queue_work(get_work_cwq(work)->wq, gcwq, work);
 	}
+	spin_unlock_irq(&gcwq->lock);
+	put_cpu();
 }
 
-static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
+static int __devinit workqueue_cpu_down_callback(struct notifier_block *nf,
 						unsigned long action,
 						void *hcpu)
 {
 	unsigned int cpu = (unsigned long)hcpu;
 	struct global_cwq *gcwq = get_gcwq(cpu);
-	struct task_struct *new_trustee = NULL;
-	struct worker *uninitialized_var(new_worker);
-	unsigned long flags;
 
 	action &= ~CPU_TASKS_FROZEN;
 
 	switch (action) {
 	case CPU_DOWN_PREPARE:
-		new_trustee = kthread_create(trustee_thread, gcwq,
-					     "workqueue_trustee/%d\n", cpu);
-		if (IS_ERR(new_trustee))
-			return notifier_from_errno(PTR_ERR(new_trustee));
-		kthread_bind(new_trustee, cpu);
-		/* fall through */
-	case CPU_UP_PREPARE:
-		BUG_ON(gcwq->first_idle);
-		new_worker = create_worker(gcwq, false);
-		if (!new_worker) {
-			if (new_trustee)
-				kthread_stop(new_trustee);
-			return NOTIFY_BAD;
-		}
-	}
-
-	/* some are called w/ irq disabled, don't disturb irq status */
-	spin_lock_irqsave(&gcwq->lock, flags);
-
-	switch (action) {
-	case CPU_DOWN_PREPARE:
-		/* initialize trustee and tell it to acquire the gcwq */
-		BUG_ON(gcwq->trustee || gcwq->trustee_state != TRUSTEE_DONE);
-		gcwq->trustee = new_trustee;
-		gcwq->trustee_state = TRUSTEE_START;
-		wake_up_process(gcwq->trustee);
-		wait_trustee_state(gcwq, TRUSTEE_IN_CHARGE);
-		/* fall through */
-	case CPU_UP_PREPARE:
-		BUG_ON(gcwq->first_idle);
-		gcwq->first_idle = new_worker;
-		break;
-
-	case CPU_DYING:
-		/*
-		 * Before this, the trustee and all workers except for
-		 * the ones which are still executing works from
-		 * before the last CPU down must be on the cpu.  After
-		 * this, they'll all be diasporas.
-		 */
-		gcwq->flags |= GCWQ_DISASSOCIATED;
-		break;
-
-	case CPU_POST_DEAD:
-		gcwq->trustee_state = TRUSTEE_BUTCHER;
-		/* fall through */
-	case CPU_UP_CANCELED:
-		destroy_worker(gcwq->first_idle);
-		gcwq->first_idle = NULL;
-		break;
-
-	case CPU_DOWN_FAILED:
-	case CPU_ONLINE:
-		gcwq->flags &= ~GCWQ_DISASSOCIATED;
-		if (gcwq->trustee_state != TRUSTEE_DONE) {
-			gcwq->trustee_state = TRUSTEE_RELEASE;
-			wake_up_process(gcwq->trustee);
-			wait_trustee_state(gcwq, TRUSTEE_DONE);
-		}
-
-		/*
-		 * Trustee is done and there might be no worker left.
-		 * Put the first_idle in and request a real manager to
-		 * take a look.
-		 */
-		spin_unlock_irq(&gcwq->lock);
-		kthread_bind(gcwq->first_idle->task, cpu);
-		spin_lock_irq(&gcwq->lock);
-		gcwq->flags |= GCWQ_MANAGE_WORKERS;
-		start_worker(gcwq->first_idle);
-		gcwq->first_idle = NULL;
+		flush_gcwq(gcwq);
 		break;
 	}
 
-	spin_unlock_irqrestore(&gcwq->lock, flags);
-
 	return notifier_from_errno(0);
 }
 
@@ -3772,7 +3512,8 @@ static int __init init_workqueues(void)
 	unsigned int cpu;
 	int i;
 
-	cpu_notifier(workqueue_cpu_callback, CPU_PRI_WORKQUEUE);
+	cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_ACTIVE);
+ 	hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_INACTIVE)=
 
 	/* initialize gcwqs */
 	for_each_gcwq_cpu(cpu) {
@@ -3795,9 +3536,7 @@ static int __init init_workqueues(void)
 			    (unsigned long)gcwq);
 
 		ida_init(&gcwq->worker_ida);
-
-		gcwq->trustee_state = TRUSTEE_DONE;
-		init_waitqueue_head(&gcwq->trustee_wait);
+		init_waitqueue_head(&gcwq->idle_wait);
 	}
 
 	/* create the initial worker */

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-10-03  1:15             ` Tejun Heo
@ 2011-10-03 10:20               ` Peter Zijlstra
  2011-10-07  1:40                 ` Tejun Heo
  2011-10-03 10:33               ` Peter Zijlstra
  2011-10-03 10:50               ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2011-10-03 10:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Steven Rostedt, LKML, Thomas Gleixner

On Sun, 2011-10-02 at 18:15 -0700, Tejun Heo wrote:
> Anyways, I don't think I'm gonna take this one.  There are some
> attractions to the approach - ie. making the users determine whether
> they need strict affinity or not and mandating those users to shut
> down properly from cpu down callbacks and if we're doing this from the
> scratch, this probably would be a sane choice.  But we already have
> tons of users and relatively well tested code.  I don't see compelling
> reasons to perturb that at this point.
> 
So wtf am I going to do with people who want PF_THREAD_BOUND to actually
do what it means? Put a warning in the scheduler code to flag all
violations and let you sort out the workqueue fallout?

I didn't write this change for fun, I actually need to get
PF_THREAD_BOUND back to sanity, this change alone isn't enough, but it
gets rid of the worst abuse. This isn't frivolous perturbation.

> Also, on a quick glance, the change is breaking non-reentrance
> guarantee.
> 
How so? Afaict it does exactly what the trustee thread used to do, or is
it is related to the NON_AFFINE moving the worklets around?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-10-03 10:20               ` Peter Zijlstra
@ 2011-10-07  1:40                 ` Tejun Heo
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2011-10-07  1:40 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Steven Rostedt, LKML, Thomas Gleixner

Hello, Peter.

Sorry about the delays.  I just moved and things are still pretty
hectic.

On Mon, Oct 03, 2011 at 12:20:38PM +0200, Peter Zijlstra wrote:
> On Sun, 2011-10-02 at 18:15 -0700, Tejun Heo wrote:
> > Anyways, I don't think I'm gonna take this one.  There are some
> > attractions to the approach - ie. making the users determine whether
> > they need strict affinity or not and mandating those users to shut
> > down properly from cpu down callbacks and if we're doing this from the
> > scratch, this probably would be a sane choice.  But we already have
> > tons of users and relatively well tested code.  I don't see compelling
> > reasons to perturb that at this point.
>
> So wtf am I going to do with people who want PF_THREAD_BOUND to actually
> do what it means? Put a warning in the scheduler code to flag all
> violations and let you sort out the workqueue fallout?

The only thing workqueue requires is that set_cpus_allowed_ptr() and
cpuset don't diddle with workers.  We can either add another flag or
just add check for PF_WQ_WORKER there.

> I didn't write this change for fun, I actually need to get
> PF_THREAD_BOUND back to sanity, this change alone isn't enough, but it
> gets rid of the worst abuse. This isn't frivolous perturbation.

One thing I was curious about and Steven didn't answer was that
whether the RT's requirement prohibits the bound threads from changing
its own affinity because that's currently explicitly allowed whether
the new affinity is single CPU or any set of CPUs.

> > Also, on a quick glance, the change is breaking non-reentrance
> > guarantee.
>
> How so? Afaict it does exactly what the trustee thread used to do, or is
> it is related to the NON_AFFINE moving the worklets around?

Yeap, that was my bad.  I thought you were pushing out works to other
cpus before flushing local.  Can you please re-post rolled-up patch?
Hopefully at least boot tested?

One thing that's not clear to me is how the new code is gonna handle
long-running work items.  It seems like the code assumes that all
long-running work items should be queued on NON_AFFINE workqueue.  Is
that correct?  The trustee thing is pretty complex but the
requirements there are pretty weird for historical reasons.  If it can
simplify the code while following the requirements (or audit and
update all users), great but I still can't see how it can.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-10-03  1:15             ` Tejun Heo
  2011-10-03 10:20               ` Peter Zijlstra
@ 2011-10-03 10:33               ` Peter Zijlstra
  2011-10-03 13:43                 ` Thomas Gleixner
  2011-10-03 10:50               ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2011-10-03 10:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Steven Rostedt, LKML, Thomas Gleixner

On Mon, 2011-10-03 at 12:20 +0200, Peter Zijlstra wrote:
> I didn't write this change for fun, I actually need to get
> PF_THREAD_BOUND back to sanity, this change alone isn't enough, but it
> gets rid of the worst abuse. This isn't frivolous perturbation. 

Not to mention it gets rid of ~250 lines, and it adds stricter checking
and thereby removes the need to manually flush queues if you really
meant things to be per-cpu.

I mean, really, what's the downside?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-10-03 10:33               ` Peter Zijlstra
@ 2011-10-03 13:43                 ` Thomas Gleixner
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2011-10-03 13:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Tejun Heo, Steven Rostedt, LKML

On Mon, 3 Oct 2011, Peter Zijlstra wrote:

> On Mon, 2011-10-03 at 12:20 +0200, Peter Zijlstra wrote:
> > I didn't write this change for fun, I actually need to get
> > PF_THREAD_BOUND back to sanity, this change alone isn't enough, but it
> > gets rid of the worst abuse. This isn't frivolous perturbation. 
> 
> Not to mention it gets rid of ~250 lines, and it adds stricter checking
> and thereby removes the need to manually flush queues if you really
> meant things to be per-cpu.
> 
> I mean, really, what's the downside?

Nothing. If code can be simplified and at the same time gets stricter
semantics and becomes less fragile then there is no valid reason to
refuse such a cleanup.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-10-03  1:15             ` Tejun Heo
  2011-10-03 10:20               ` Peter Zijlstra
  2011-10-03 10:33               ` Peter Zijlstra
@ 2011-10-03 10:50               ` Peter Zijlstra
  2 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2011-10-03 10:50 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Steven Rostedt, LKML, Thomas Gleixner

On Mon, 2011-10-03 at 12:20 +0200, Peter Zijlstra wrote:
> I didn't write this change for fun, I actually need to get
> PF_THREAD_BOUND back to sanity, this change alone isn't enough, but it
> gets rid of the worst abuse. 

I think the below patch is sufficient (on top of the previous one), but
I haven't actually build or boot tested it (with the scheduler tests
that yell on a PF_THREAD_BOUND task migrating -- note to self: need to
exclude self migration due to the below)

It also looks like there's some more room for simplification now that we
don't actually need to worry about trying to run on cpus that aren't
there anymore.

---
Index: linux-2.6/kernel/workqueue.c
===================================================================
--- linux-2.6.orig/kernel/workqueue.c
+++ linux-2.6/kernel/workqueue.c
@@ -1285,8 +1285,14 @@ __acquires(&gcwq->lock)
 		 * it races with cpu hotunplug operation.  Verify
 		 * against GCWQ_DISASSOCIATED.
 		 */
-		if (!(gcwq->flags & GCWQ_DISASSOCIATED))
+		if (!(gcwq->flags & GCWQ_DISASSOCIATED)) {
+			/*
+			 * Since we're binding to a particular cpu and need to
+			 * stay there for correctness, mark us PF_THREAD_BOUND.
+			 */
+			task->flags |= PF_THREAD_BOUND;
 			set_cpus_allowed_ptr(task, get_cpu_mask(gcwq->cpu));
+		}
 
 		spin_lock_irq(&gcwq->lock);
 		if (gcwq->flags & GCWQ_DISASSOCIATED)
@@ -1308,6 +1314,18 @@ __acquires(&gcwq->lock)
 	}
 }
 
+static void worker_unbind_and_unlock(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+	struct task_struct *task = worker->task;
+
+	/*
+	 * Its no longer required we're PF_THREAD_BOUND, the work is done.
+	 */
+	task->flags &= ~PF_THREAD_BOUND;
+	spin_unlock_irq(&gcwq->lock);
+}
+
 static struct worker *alloc_worker(void)
 {
 	struct worker *worker;
@@ -1370,15 +1388,9 @@ static struct worker *create_worker(stru
 	if (IS_ERR(worker->task))
 		goto fail;
 
-	/*
-	 * A rogue worker will become a regular one if CPU comes
-	 * online later on.  Make sure every worker has
-	 * PF_THREAD_BOUND set.
-	 */
 	if (bind && !on_unbound_cpu)
 		kthread_bind(worker->task, gcwq->cpu);
 	else {
-		worker->task->flags |= PF_THREAD_BOUND;
 		if (on_unbound_cpu)
 			worker->flags |= WORKER_UNBOUND;
 	}
@@ -2055,7 +2067,7 @@ static int rescuer_thread(void *__wq)
 		if (keep_working(gcwq))
 			wake_up_worker(gcwq);
 
-		spin_unlock_irq(&gcwq->lock);
+		worker_unbind_and_unlock(rescuer);
 	}
 
 	schedule();
@@ -3004,7 +3016,6 @@ struct workqueue_struct *__alloc_workque
 		if (IS_ERR(rescuer->task))
 			goto err;
 
-		rescuer->task->flags |= PF_THREAD_BOUND;
 		wake_up_process(rescuer->task);
 	}
 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-30  4:14     ` Tejun Heo
  2011-09-30  8:34       ` Peter Zijlstra
@ 2011-09-30  9:23       ` Peter Zijlstra
  2011-09-30  9:29       ` Peter Zijlstra
  2 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2011-09-30  9:23 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Steven Rostedt, LKML, Thomas Gleixner




> static void flush_gcwq(struct global_cwq *gcwq)
> {
>         struct work_struct *work, *nw;
>         struct worker *worker, *n;
>         LIST_HEAD(non_affine_works);
> 
>         spin_lock_irq(&gcwq->lock);
>         list_for_each_entry_safe(work, nw, &gcwq->worklist, entry) {
>                 struct workqueue_struct *wq = get_work_cwq(work)->wq;
> 
>                 if (wq->flags & WQ_NON_AFFINE)
>                         list_move(&work->entry, &non_affine_works);
>         }
> 
>         while (!list_empty(&gcwq->worklist)) {
>                 int nr_works = 0;
> 
>                 list_for_each_entry(work, &gcwq->worklist, entry) {
>                         send_mayday(work);
>                         nr_works++;
>                 }
> 
>                 list_for_each_entry(worker, &gcwq->idle_list, entry) {
>                         if (!nr_works--)
>                                 break;
>                         wake_up_process(worker->task);
>                 }
> 
>                 spin_unlock_irq(&gcwq->lock);
> 
>                 if (need_to_create_worker(gcwq)) {
>                         worker = create_worker(gcwq, true);
>                         if (worker)
>                                 start_worker(worker);
>                 }
> 
>                 wait_event_timeout(gcwq->idle_wait,
>                                 gcwq->nr_idle == gcwq->nr_workers, HZ/10);
> 
>                 spin_lock_irq(&gcwq->lock);
>         }

We could probably replace that loop with one of the flush/barrier
thingies, but I got lost and wanted to post something. Using a proper
flush would also get rid of that icky waitqueue.

> 
>         WARN_ON(gcwq->nr_workers != gcwq->nr_idle);
> 
>         list_for_each_entry_safe(worker, n, &gcwq->idle_list, entry)
>                 destroy_worker(worker);
> 
>         WARN_ON(gcwq->nr_workers || gcwq->nr_idle);
> 
>         spin_unlock_irq(&gcwq->lock);
> 
>         gcwq = get_gcwq(get_cpu());
>         spin_lock_irq(&gcwq->lock);
>         list_for_each_entry_safe(work, nw, &non_affine_works, entry) {
>                 list_del_init(&work->entry);
>                 ___queue_work(get_work_cwq(work)->wq, gcwq, work);
>         }
>         spin_unlock_irq(&gcwq->lock);
>         put_cpu();
> }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-30  4:14     ` Tejun Heo
  2011-09-30  8:34       ` Peter Zijlstra
  2011-09-30  9:23       ` Peter Zijlstra
@ 2011-09-30  9:29       ` Peter Zijlstra
  2 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2011-09-30  9:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Steven Rostedt, LKML, Thomas Gleixner

On Fri, 2011-09-30 at 10:34 +0200, Peter Zijlstra wrote:
> Introduce a new WQ_NON_AFFINE flag that indicates a per-cpu workqueue
> will not respect cpu affinity and use this to migrate all its pending
> works to whatever cpu is doing cpu-down. 

Forgot to mention, it will also dis-allow queue_work_on() usage for
these workqueues, since that API implies you actually care what cpu the
work runs on, which cannot be guaranteed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-30  3:48 ` Tejun Heo
  2011-09-30  4:05   ` Steven Rostedt
@ 2011-09-30  9:27   ` Peter Zijlstra
  2011-10-03  1:22     ` Tejun Heo
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2011-09-30  9:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Steven Rostedt, LKML, Thomas Gleixner

On Thu, 2011-09-29 at 20:48 -0700, Tejun Heo wrote:
> IIRC, this was because there was no way to set PF_THREAD_BOUND once a
> kthread starts to run and workers can stay active across CPU bring
> down/up cycle.  Per-cpu kthreads need PF_THREAD_BOUND to prevent cpu
> affinity manipulation by third party for correctness. 

But that's the whole point isn't it. You mark threads that aren't
strictly per-cpu with that. Aside from the unplug trainwreck, you also
mark your unbound workers with that.

There is no correctness issue what so ever with those, and userspace
moving them about doesn't matter one whit.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-30  9:27   ` Peter Zijlstra
@ 2011-10-03  1:22     ` Tejun Heo
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2011-10-03  1:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Steven Rostedt, LKML, Thomas Gleixner

On Fri, Sep 30, 2011 at 11:27:45AM +0200, Peter Zijlstra wrote:
> There is no correctness issue what so ever with those, and userspace
> moving them about doesn't matter one whit.

I wanted to make it consistent and disallow diddling wq worker
affinity from userland.  As those workers are shared and managed
dynamically, it didn't make sense to allow affinity manipulation.
That said, it might make sense allowing control over affinity of
unbound wq workers for things like cpu isolation.  Userspace doing
that per-worker doesn't make whole lot of sense.  Allowing putting
unbound workers automatically into a cpuset would work better, I
think.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-27 21:17 [PATCH] sched/kthread: Complain loudly when others violate our flags Steven Rostedt
  2011-09-28  0:22 ` Thomas Gleixner
  2011-09-30  3:48 ` Tejun Heo
@ 2011-09-30  3:55 ` Tejun Heo
  2011-09-30  4:08   ` Steven Rostedt
  2 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2011-09-30  3:55 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, Peter Zijlstra, Thomas Gleixner

On Tue, Sep 27, 2011 at 05:17:34PM -0400, Steven Rostedt wrote:
> For convenience and optimization, we are going to use the task's flag
> PF_THREAD_BOUND as a way to know if a task is bound to a CPU or not. As
> that is what the flag means.

And I'm not sure this is a very good idea.  Until now, PF_THREAD_BOUND
meant "this task is explicitly bound to the current affinity (whatever
it is) and thus it shouldn't be manipulated by third party".  The
suggested change makes the flag much less orthogonal and useful.  How
much of convenience and optimization are we talking about?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-30  3:55 ` Tejun Heo
@ 2011-09-30  4:08   ` Steven Rostedt
  2011-09-30  4:28     ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2011-09-30  4:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Peter Zijlstra, Thomas Gleixner

On Thu, 2011-09-29 at 20:55 -0700, Tejun Heo wrote:
> On Tue, Sep 27, 2011 at 05:17:34PM -0400, Steven Rostedt wrote:
> > For convenience and optimization, we are going to use the task's flag
> > PF_THREAD_BOUND as a way to know if a task is bound to a CPU or not. As
> > that is what the flag means.
> 
> And I'm not sure this is a very good idea.  Until now, PF_THREAD_BOUND
> meant "this task is explicitly bound to the current affinity (whatever
> it is) and thus it shouldn't be manipulated by third party".

That's exactly what we want it to mean. Not manipulated by workqueue or
anyone else. We don't plan on setting this flag if a task just happens
to have only one CPU in its affinity. We only set this flag if the task
is bounded to a CPU for good. When we see this flag set, we can therefor
assume that it will not migrate!


>   The
> suggested change makes the flag much less orthogonal and useful.  How
> much of convenience and optimization are we talking about?
> 

Enough to not have to manually pin the task to the cpu at every lock.

-- Steve



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/kthread: Complain loudly when others violate our flags
  2011-09-30  4:08   ` Steven Rostedt
@ 2011-09-30  4:28     ` Tejun Heo
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2011-09-30  4:28 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, Peter Zijlstra, Thomas Gleixner

Hello,

On Fri, Sep 30, 2011 at 12:08:45AM -0400, Steven Rostedt wrote:
> On Thu, 2011-09-29 at 20:55 -0700, Tejun Heo wrote:
> > On Tue, Sep 27, 2011 at 05:17:34PM -0400, Steven Rostedt wrote:
> > > For convenience and optimization, we are going to use the task's flag
> > > PF_THREAD_BOUND as a way to know if a task is bound to a CPU or not. As
> > > that is what the flag means.
> > 
> > And I'm not sure this is a very good idea.  Until now, PF_THREAD_BOUND
> > meant "this task is explicitly bound to the current affinity (whatever
> > it is) and thus it shouldn't be manipulated by third party".
> 
> That's exactly what we want it to mean. Not manipulated by workqueue or
> anyone else.

Umm... those are wq workers and wq is third party?  I'm sorry that you
wasted your time but let's please drop the bickering over the
interpretation of the flag.

> We don't plan on setting this flag if a task just happens
> to have only one CPU in its affinity. We only set this flag if the task
> is bounded to a CPU for good. When we see this flag set, we can therefor
> assume that it will not migrate!

Up to now, it did NOT mean that.  It just blocked out affinity
manipulation by third party and cpuset attach.  From what you write, I
think the added assumptions are...

* The affinity should be confined to single CPU.

* Once set, the flag can't be cleared and affinity can't be changed by
  anyone.

Am I right?

> > suggested change makes the flag much less orthogonal and useful.  How
> > much of convenience and optimization are we talking about?
> 
> Enough to not have to manually pin the task to the cpu at every lock.

As I wrote before we can test PF_WQ_WORKER too where appropriate
(probably just set_cpus_allowed_ptr) and avoid using PF_THREAD_BOUND
for wq workers.  We can also add a new flag but w/ wq being the only
user, that probably is an overkill.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-10-07  1:40 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-27 21:17 [PATCH] sched/kthread: Complain loudly when others violate our flags Steven Rostedt
2011-09-28  0:22 ` Thomas Gleixner
2011-09-30  3:48 ` Tejun Heo
2011-09-30  4:05   ` Steven Rostedt
2011-09-30  4:14     ` Tejun Heo
2011-09-30  8:34       ` Peter Zijlstra
2011-09-30  8:55         ` Tejun Heo
2011-09-30  9:04           ` Peter Zijlstra
2011-10-03  1:15             ` Tejun Heo
2011-10-03 10:20               ` Peter Zijlstra
2011-10-07  1:40                 ` Tejun Heo
2011-10-03 10:33               ` Peter Zijlstra
2011-10-03 13:43                 ` Thomas Gleixner
2011-10-03 10:50               ` Peter Zijlstra
2011-09-30  9:23       ` Peter Zijlstra
2011-09-30  9:29       ` Peter Zijlstra
2011-09-30  9:27   ` Peter Zijlstra
2011-10-03  1:22     ` Tejun Heo
2011-09-30  3:55 ` Tejun Heo
2011-09-30  4:08   ` Steven Rostedt
2011-09-30  4:28     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox