[PATCHSET] concurrency managed workqueue, take#3

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET] concurrency managed workqueue, take#3
@ 2010-01-18  0:57 Tejun Heo
  2010-01-18  0:57 ` [PATCH 01/40] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
                   ` (41 more replies)
  0 siblings, 42 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Hello, all.

This is the third take of cmwq (concurrency managed workqueue)
patchset.  It's on top of the current linus#master
066000dd856709b6980123eb39b957fe26993f7b (v2.6.33-rc3).  Git tree is
available at

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-cmwq

Quilt series is available at

  http://master.kernel.org/~tj/patches/review-cmwq.tar.gz

Changes from the last take[L]
=============================

* Scheduler code to select fallback cpu has changed and caused problem
  with kthread_bind()ing from CPU_DOWN_PREP.  It is fixed by adding
  0001-sched-consult-online-mask-instead-of-active-in-selec.patch.

* 0002-0028 haven't changed but included for completeness.

* 0029-0040 added to convert libata, async, fscache, cifs and gfs2 to
  use workqueue and kill slow-work which after conversion doesn't have
  any user left.

New patches in this series are

 0001-sched-consult-online-mask-instead-of-active-in-selec.patch
 0029-workqueue-add-system_wq-and-system_single_wq.patch
 0030-workqueue-implement-work_busy.patch
 0031-libata-take-advantage-of-cmwq-and-remove-concurrency.patch
 0032-async-introduce-workqueue-based-alternative-implemen.patch
 0033-async-convert-async-users-to-use-the-new-implementat.patch
 0034-async-kill-original-implementation.patch
 0035-fscache-convert-object-to-use-workqueue-instead-of-s.patch
 0036-fscache-convert-operation-to-use-workqueue-instead-o.patch
 0037-fscache-drop-references-to-slow-work.patch
 0038-cifs-use-workqueue-instead-of-slow-work.patch
 0039-gfs2-use-workqueue-instead-of-slow-work.patch
 0040-slow-work-kill-it.patch

0001 is the aforementioned scheduler fix.

0029-0030 prepare wq for conversions.

0031 converts libata to use cmwq and remove concurrency limitations.

0032-0034 reimplement async using two workqueues.

0035-0037 convert fscache to use workqueues instead of slow-work.

0038-0039 convert cifs and gfs2 to use workqueues instead of
slow-work.

0040 kills slow-work which doesn't have any user left.

Please note that slow-work conversion is missing a couple of
capabilities.

* sysctls to control concurrency level.

* workqueue business notification used to make fscache work to yield
  context and retry instead of waiting holding the context.

The former can easily be added.  The latter isn't difficult to add
either but I was a bit doubtful about its usefulness.  David, do you
think this is really needed?

With the above omissions and removal of slow-work documentation, the
the whole series ends up reducing line count by around a hundred
lines.  I'll append diffstat output at the end of this email.

The libata conversion reduces 13 lines of code while removing two
annoying concurrency limitations.

The new async implementation is shorter by about two hundred lines
while providing about the same capability and removing a dedicated
thread pool.

Although there are some minor differences, the capability provided by
slow-work is basically identical to that provided by cmwq.  Other than
few places where slow-work specific features are depended on, the
conversion of slow-work users to cmwq is fairly straight forward.  The
ref count is incremented on queue and decremented at the end of the
callback.  Module draining is replaced with workqueue flushing.
Concurrency limit is replaced with max_active.  The removal of
slow-work brings in the largest code reduction of about 2000 lines and
removes yet another dedicated thread pool.

slow-work is probably the largest chunk which can be replaced by cmwq
but as shown in the libata case small conversions can bring noticeable
benefits and there are other places which have had to deal with
similar limitations.

Please note that the slow-work conversions haven't been signed off
yet.  Those changes need careful review from David before going
anywhere.

Performance test
================

Another issue raised was the performance.  I tried a few things but
couldn't find a realistic and easy test scenario which could expose wq
performance difference.  As many have pointed out, wq just isn't a
very hot path.  I ended up writing a simplistic wq load generator.

wq workload is generated by perf-wq.c module which is a very simple
synthetic wq load generator (I'll post it as a reply to this message).
A work is described by five parameters - burn_usecs, mean_sleep_msecs,
mean_resched_msecs and factor.  It randomly splits burn_usecs into
two, burns the first part, sleeps for 0 - 2 * mean_sleep_msecs, burns
what's left of burn_usecs and then reschedules itself in 0 - 2 *
mean_resched_msecs.  factor is used to tune the number of cycles to
match execution duration.

It issues three types of works - short, medium and long, each with two
burn durations L and S.

	burn/L(us)	burn/S(us)	mean_sleep(ms)	mean_resched(ms) cycles
 short	50		1		1		10		 454
 medium	50		2		10		50		 125
 long	50		4		100		250		 42

And then these works are put into the following workloads.  The lower
numbered workloads have more short/medium works.

 workload 0
 * 12 wqs with 4 short works
 *  2 wqs with 2 short  and 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 1
 *  8 wqs with 4 short works
 *  2 wqs with 2 short  and 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 2
 *  4 wqs with 4 short works
 *  2 wqs with 2 short  and 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 3
 *  2 wqs with 4 short works
 *  2 wqs with 2 short  and 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 4
 *  2 wqs with 4 short works
 *  2 wqs with 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 5
 *  2 wqs with 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

The above wq loads are run in parallel with mencoder converting 76M
mjpeg file into mpeg4 which takes 25.59 seconds with standard
deviation of 0.19 without wq loading.  The CPU was intel netburst
celeron running at 2.66GHz (chosen for its small cache size and
slowness).  wl0 and 1 are only tested for burn/S.  Each test case was
run 11 times and the first run was discarded.

	 vanilla/L	cmwq/L		vanilla/S	cmwq/S
 wl0					26.18 d0.24	26.27 d0.29
 wl1					26.50 d0.45	26.52 d0.23
 wl2	26.62 d0.35	26.53 d0.23	26.14 d0.22	26.12 d0.32
 wl3	26.30 d0.25	26.29 d0.26	25.94 d0.25	26.17 d0.30
 wl4	26.26 d0.23	25.93 d0.24	25.90 d0.23	25.91 d0.29
 wl5	25.81 d0.33	25.88 d0.25	25.63 d0.27	25.59 d0.26

There is no significant difference between the two.  Maybe the code
overhead and benefits coming from context sharing are canceling each
other nicely.  With longer burns, cmwq looks better but it's nothing
significant.  With shorter burns, other than wl3 spiking up for
vanilla which probably would go away if the test is repeated, the two
are performing virtually identically.

The above is exaggerated synthetic test result and the performance
difference will be even less noticeable in either direction under
realistic workloads.

cmwq extends workqueue such that it can serve as robust async
mechanism which can be used (mostly) universally without introducing
any noticeable performance degradation.

Thanks.

diffstat
========
 Documentation/slow-work.txt   |  322 -----
 arch/ia64/kernel/smpboot.c    |    2 
 arch/ia64/kvm/Kconfig         |    1 
 arch/powerpc/kvm/Kconfig      |    1 
 arch/s390/kvm/Kconfig         |    1 
 arch/x86/kernel/smpboot.c     |    2 
 arch/x86/kvm/Kconfig          |    1 
 drivers/acpi/battery.c        |    4 
 drivers/acpi/osl.c            |   41 
 drivers/ata/libata-core.c     |   50 
 drivers/ata/libata-eh.c       |    4 
 drivers/ata/libata-scsi.c     |   11 
 drivers/ata/libata.h          |    1 
 drivers/ata/pata_legacy.c     |    2 
 drivers/base/core.c           |    2 
 drivers/base/dd.c             |    2 
 drivers/md/raid5.c            |    4 
 drivers/s390/block/dasd.c     |    4 
 drivers/scsi/sd.c             |    8 
 fs/cachefiles/namei.c         |   28 
 fs/cachefiles/rdwr.c          |    4 
 fs/cifs/Kconfig               |    1 
 fs/cifs/cifsfs.c              |    6 
 fs/cifs/cifsglob.h            |    8 
 fs/cifs/dir.c                 |    2 
 fs/cifs/file.c                |   22 
 fs/cifs/misc.c                |   15 
 fs/fscache/Kconfig            |    1 
 fs/fscache/internal.h         |    2 
 fs/fscache/main.c             |   25 
 fs/fscache/object-list.c      |   12 
 fs/fscache/object.c           |   67 -
 fs/fscache/operation.c        |   67 -
 fs/fscache/page.c             |   36 
 fs/gfs2/Kconfig               |    1 
 fs/gfs2/incore.h              |    3 
 fs/gfs2/main.c                |    9 
 fs/gfs2/ops_fstype.c          |    8 
 fs/gfs2/recovery.c            |   52 
 fs/gfs2/recovery.h            |    4 
 fs/gfs2/sys.c                 |    3 
 include/linux/async.h         |   17 
 include/linux/fscache-cache.h |   49 
 include/linux/kvm_host.h      |    4 
 include/linux/libata.h        |    2 
 include/linux/preempt.h       |   48 
 include/linux/sched.h         |   71 -
 include/linux/slow-work.h     |  163 --
 include/linux/stop_machine.h  |    6 
 include/linux/workqueue.h     |  109 +
 init/Kconfig                  |   28 
 init/do_mounts.c              |    2 
 init/main.c                   |    4 
 kernel/Makefile               |    2 
 kernel/async.c                |  393 +-----
 kernel/irq/autoprobe.c        |    2 
 kernel/module.c               |    4 
 kernel/power/process.c        |   21 
 kernel/sched.c                |  334 +++--
 kernel/slow-work-debugfs.c    |  227 ---
 kernel/slow-work.c            | 1068 ----------------
 kernel/slow-work.h            |   72 -
 kernel/stop_machine.c         |  151 +-
 kernel/sysctl.c               |    8 
 kernel/trace/Kconfig          |    4 
 kernel/workqueue.c            | 2697 ++++++++++++++++++++++++++++++++++++------
 virt/kvm/kvm_main.c           |   26 
 67 files changed, 3120 insertions(+), 3231 deletions(-)

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel/929641

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 01/40] sched: consult online mask instead of active in select_fallback_rq()
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18 10:13   ` Peter Zijlstra
  2010-01-18  0:57 ` [PATCH 02/40] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
                   ` (40 subsequent siblings)
  41 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

If called after sched_class chooses a CPU which isn't in a task's
cpus_allowed mask, select_fallback_rq() can end up migrating a task
which is bound to a !active but online cpu to an active cpu.  This is
dangerous because active is cleared before CPU_DOWN_PREPARE is called
and subsystems expect affinities of kthreads and other tasks to be
maintained till their CPU_DOWN_PREPARE callbacks are complete.

Consult cpu_online_mask instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c535cc4..09d97e3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2288,12 +2288,12 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 	const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
 
 	/* Look for allowed, online CPU in same node. */
-	for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
+	for_each_cpu_and(dest_cpu, nodemask, cpu_online_mask)
 		if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
 			return dest_cpu;
 
 	/* Any allowed, online CPU? */
-	dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
+	dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_online_mask);
 	if (dest_cpu < nr_cpu_ids)
 		return dest_cpu;
 
@@ -2302,6 +2302,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 		rcu_read_lock();
 		cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
 		rcu_read_unlock();
+		/* breaking affinity, consider active mask instead */
 		dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
 
 		/*
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH 01/40] sched: consult online mask instead of active in select_fallback_rq()
  2010-01-18  0:57 ` [PATCH 01/40] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
@ 2010-01-18 10:13   ` Peter Zijlstra
  2010-01-18 11:26     ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2010-01-18 10:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

On Mon, 2010-01-18 at 09:57 +0900, Tejun Heo wrote:
> If called after sched_class chooses a CPU which isn't in a task's
> cpus_allowed mask,

I can only see this happening when you're changing cpus_allowed after
starting to take down the cpu. IOW you're moving a thread to a dying
cpu.

This is because you're spawning workqueue threads while we're going
down?




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 01/40] sched: consult online mask instead of active in select_fallback_rq()
  2010-01-18 10:13   ` Peter Zijlstra
@ 2010-01-18 11:26     ` Tejun Heo
  0 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18 11:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

Hello,

On 01/18/2010 07:13 PM, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 09:57 +0900, Tejun Heo wrote:
>> If called after sched_class chooses a CPU which isn't in a task's
>> cpus_allowed mask,
> 
> I can only see this happening when you're changing cpus_allowed after
> starting to take down the cpu. IOW you're moving a thread to a dying
> cpu.
> 
> This is because you're spawning workqueue threads while we're going
> down?

It got triggered by the hotplug callback creating the trustee kthread
and kthread_bind()ing it to the target cpu during CPU_DOWN_PREPARE.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 02/40] sched: rename preempt_notifiers to sched_notifiers and refactor implementation
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
  2010-01-18  0:57 ` [PATCH 01/40] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 03/40] sched: refactor try_to_wake_up() Tejun Heo
                   ` (39 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Mike Galbraith

Rename preempt_notifiers to sched_notifiers and move it to sched.h.
Also, refactor implementation in sched.c such that adding new
callbacks is easier.

This patch does not introduce any functional change and in fact
generates the same binary at least with my configuration (x86_64 SMP,
kvm and some debug options).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 arch/ia64/kvm/Kconfig    |    2 +-
 arch/powerpc/kvm/Kconfig |    2 +-
 arch/s390/kvm/Kconfig    |    2 +-
 arch/x86/kvm/Kconfig     |    2 +-
 include/linux/kvm_host.h |    4 +-
 include/linux/preempt.h  |   48 --------------------
 include/linux/sched.h    |   53 +++++++++++++++++++++-
 init/Kconfig             |    2 +-
 kernel/sched.c           |  108 +++++++++++++++++++---------------------------
 virt/kvm/kvm_main.c      |   26 +++++------
 10 files changed, 113 insertions(+), 136 deletions(-)

diff --git a/arch/ia64/kvm/Kconfig b/arch/ia64/kvm/Kconfig
index ef3e7be..a38b72e 100644
--- a/arch/ia64/kvm/Kconfig
+++ b/arch/ia64/kvm/Kconfig
@@ -22,7 +22,7 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 07703f7..d3a65c6 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,7 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select ANON_INODES
 
 config KVM_BOOK3S_64_HANDLER
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 6ee55ae..a0adddd 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -18,7 +18,7 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines using the SIE
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 4cd4983..fd38f79 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -22,7 +22,7 @@ config KVM
 	depends on HAVE_KVM
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bd5a616..8079759 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -74,8 +74,8 @@ void kvm_io_bus_unregister_dev(struct kvm *kvm, struct kvm_io_bus *bus,
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	struct preempt_notifier preempt_notifier;
+#ifdef CONFIG_SCHED_NOTIFIERS
+	struct sched_notifier sched_notifier;
 #endif
 	int vcpu_id;
 	struct mutex mutex;
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 2e681d9..538c675 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -93,52 +93,4 @@ do { \
 
 #endif
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
-struct preempt_notifier;
-
-/**
- * preempt_ops - notifiers called when a task is preempted and rescheduled
- * @sched_in: we're about to be rescheduled:
- *    notifier: struct preempt_notifier for the task being scheduled
- *    cpu:  cpu we're scheduled on
- * @sched_out: we've just been preempted
- *    notifier: struct preempt_notifier for the task being preempted
- *    next: the task that's kicking us out
- *
- * Please note that sched_in and out are called under different
- * contexts.  sched_out is called with rq lock held and irq disabled
- * while sched_in is called without rq lock and irq enabled.  This
- * difference is intentional and depended upon by its users.
- */
-struct preempt_ops {
-	void (*sched_in)(struct preempt_notifier *notifier, int cpu);
-	void (*sched_out)(struct preempt_notifier *notifier,
-			  struct task_struct *next);
-};
-
-/**
- * preempt_notifier - key for installing preemption notifiers
- * @link: internal use
- * @ops: defines the notifier functions to be called
- *
- * Usually used in conjunction with container_of().
- */
-struct preempt_notifier {
-	struct hlist_node link;
-	struct preempt_ops *ops;
-};
-
-void preempt_notifier_register(struct preempt_notifier *notifier);
-void preempt_notifier_unregister(struct preempt_notifier *notifier);
-
-static inline void preempt_notifier_init(struct preempt_notifier *notifier,
-				     struct preempt_ops *ops)
-{
-	INIT_HLIST_NODE(&notifier->link);
-	notifier->ops = ops;
-}
-
-#endif
-
 #endif /* __LINUX_PREEMPT_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8d4991b..b65c23b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1218,6 +1218,53 @@ struct sched_rt_entity {
 #endif
 };
 
+#ifdef CONFIG_SCHED_NOTIFIERS
+
+struct sched_notifier;
+
+/**
+ * sched_notifier_ops - notifiers called for scheduling events
+ * @in: we're about to be rescheduled:
+ *    notifier: struct sched_notifier for the task being scheduled
+ *    cpu:  cpu we're scheduled on
+ * @out: we've just been preempted
+ *    notifier: struct sched_notifier for the task being preempted
+ *    next: the task that's kicking us out
+ *
+ * Please note that in and out are called under different contexts.
+ * out is called with rq lock held and irq disabled while in is called
+ * without rq lock and irq enabled.  This difference is intentional
+ * and depended upon by its users.
+ */
+struct sched_notifier_ops {
+	void (*in)(struct sched_notifier *notifier, int cpu);
+	void (*out)(struct sched_notifier *notifier, struct task_struct *next);
+};
+
+/**
+ * sched_notifier - key for installing scheduler notifiers
+ * @link: internal use
+ * @ops: defines the notifier functions to be called
+ *
+ * Usually used in conjunction with container_of().
+ */
+struct sched_notifier {
+	struct hlist_node link;
+	struct sched_notifier_ops *ops;
+};
+
+void sched_notifier_register(struct sched_notifier *notifier);
+void sched_notifier_unregister(struct sched_notifier *notifier);
+
+static inline void sched_notifier_init(struct sched_notifier *notifier,
+				       struct sched_notifier_ops *ops)
+{
+	INIT_HLIST_NODE(&notifier->link);
+	notifier->ops = ops;
+}
+
+#endif	/* CONFIG_SCHED_NOTIFIERS */
+
 struct rcu_node;
 
 struct task_struct {
@@ -1241,9 +1288,9 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	/* list of struct preempt_notifier: */
-	struct hlist_head preempt_notifiers;
+#ifdef CONFIG_SCHED_NOTIFIERS
+	/* list of struct sched_notifier: */
+	struct hlist_head sched_notifiers;
 #endif
 
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index d95ca7c..06644b8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1259,7 +1259,7 @@ config STOP_MACHINE
 
 source "block/Kconfig"
 
-config PREEMPT_NOTIFIERS
+config SCHED_NOTIFIERS
 	bool
 
 source "kernel/Kconfig.locks"
diff --git a/kernel/sched.c b/kernel/sched.c
index 09d97e3..768d313 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1434,6 +1434,44 @@ static inline void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val) {}
 #endif
 
+#ifdef CONFIG_SCHED_NOTIFIERS
+
+#define fire_sched_notifiers(p, callback, args...) do {			\
+	struct sched_notifier *__sn;					\
+	struct hlist_node *__pos;					\
+									\
+	hlist_for_each_entry(__sn, __pos, &(p)->sched_notifiers, link)	\
+		__sn->ops->callback(__sn , ##args);			\
+} while (0)
+
+/**
+ * sched_notifier_register - register scheduler notifier
+ * @notifier: notifier struct to register
+ */
+void sched_notifier_register(struct sched_notifier *notifier)
+{
+	hlist_add_head(&notifier->link, &current->sched_notifiers);
+}
+EXPORT_SYMBOL_GPL(sched_notifier_register);
+
+/**
+ * sched_notifier_unregister - unregister scheduler notifier
+ * @notifier: notifier struct to unregister
+ *
+ * This is safe to call from within a scheduler notifier.
+ */
+void sched_notifier_unregister(struct sched_notifier *notifier)
+{
+	hlist_del(&notifier->link);
+}
+EXPORT_SYMBOL_GPL(sched_notifier_unregister);
+
+#else	/* !CONFIG_SCHED_NOTIFIERS */
+
+#define fire_sched_notifiers(p, callback, args...)	do { } while (0)
+
+#endif	/* CONFIG_SCHED_NOTIFIERS */
+
 static inline void inc_cpu_load(struct rq *rq, unsigned long load)
 {
 	update_load_add(&rq->load, load);
@@ -2568,8 +2606,8 @@ static void __sched_fork(struct task_struct *p)
 	p->se.on_rq = 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	INIT_HLIST_HEAD(&p->preempt_notifiers);
+#ifdef CONFIG_SCHED_NOTIFIERS
+	INIT_HLIST_HEAD(&p->sched_notifiers);
 #endif
 }
 
@@ -2668,64 +2706,6 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 	task_rq_unlock(rq, &flags);
 }
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
-/**
- * preempt_notifier_register - tell me when current is being preempted & rescheduled
- * @notifier: notifier struct to register
- */
-void preempt_notifier_register(struct preempt_notifier *notifier)
-{
-	hlist_add_head(&notifier->link, &current->preempt_notifiers);
-}
-EXPORT_SYMBOL_GPL(preempt_notifier_register);
-
-/**
- * preempt_notifier_unregister - no longer interested in preemption notifications
- * @notifier: notifier struct to unregister
- *
- * This is safe to call from within a preemption notifier.
- */
-void preempt_notifier_unregister(struct preempt_notifier *notifier)
-{
-	hlist_del(&notifier->link);
-}
-EXPORT_SYMBOL_GPL(preempt_notifier_unregister);
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-	struct preempt_notifier *notifier;
-	struct hlist_node *node;
-
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
-		notifier->ops->sched_in(notifier, raw_smp_processor_id());
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-	struct preempt_notifier *notifier;
-	struct hlist_node *node;
-
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
-		notifier->ops->sched_out(notifier, next);
-}
-
-#else /* !CONFIG_PREEMPT_NOTIFIERS */
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-}
-
-#endif /* CONFIG_PREEMPT_NOTIFIERS */
-
 /**
  * prepare_task_switch - prepare to switch tasks
  * @rq: the runqueue preparing to switch
@@ -2743,7 +2723,7 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
-	fire_sched_out_preempt_notifiers(prev, next);
+	fire_sched_notifiers(prev, out, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
 }
@@ -2787,7 +2767,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	perf_event_task_sched_in(current, cpu_of(rq));
 	finish_lock_switch(rq, prev);
 
-	fire_sched_in_preempt_notifiers(current);
+	fire_sched_notifiers(current, in, raw_smp_processor_id());
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
@@ -9637,8 +9617,8 @@ void __init sched_init(void)
 
 	set_load_weight(&init_task);
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	INIT_HLIST_HEAD(&init_task.preempt_notifiers);
+#ifdef CONFIG_SCHED_NOTIFIERS
+	INIT_HLIST_HEAD(&init_task.sched_notifiers);
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a944be3..557d5f9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -77,7 +77,7 @@ static atomic_t hardware_enable_failed;
 struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
-static __read_mostly struct preempt_ops kvm_preempt_ops;
+static __read_mostly struct sched_notifier_ops kvm_sched_notifier_ops;
 
 struct dentry *kvm_debugfs_dir;
 
@@ -109,7 +109,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
 
 	mutex_lock(&vcpu->mutex);
 	cpu = get_cpu();
-	preempt_notifier_register(&vcpu->preempt_notifier);
+	sched_notifier_register(&vcpu->sched_notifier);
 	kvm_arch_vcpu_load(vcpu, cpu);
 	put_cpu();
 }
@@ -118,7 +118,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 {
 	preempt_disable();
 	kvm_arch_vcpu_put(vcpu);
-	preempt_notifier_unregister(&vcpu->preempt_notifier);
+	sched_notifier_unregister(&vcpu->sched_notifier);
 	preempt_enable();
 	mutex_unlock(&vcpu->mutex);
 }
@@ -1195,7 +1195,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 	if (IS_ERR(vcpu))
 		return PTR_ERR(vcpu);
 
-	preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
+	sched_notifier_init(&vcpu->sched_notifier, &kvm_sched_notifier_ops);
 
 	r = kvm_arch_vcpu_setup(vcpu);
 	if (r)
@@ -2029,23 +2029,21 @@ static struct sys_device kvm_sysdev = {
 struct page *bad_page;
 pfn_t bad_pfn;
 
-static inline
-struct kvm_vcpu *preempt_notifier_to_vcpu(struct preempt_notifier *pn)
+static inline struct kvm_vcpu *sched_notifier_to_vcpu(struct sched_notifier *sn)
 {
-	return container_of(pn, struct kvm_vcpu, preempt_notifier);
+	return container_of(sn, struct kvm_vcpu, sched_notifier);
 }
 
-static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
+static void kvm_sched_in(struct sched_notifier *sn, int cpu)
 {
-	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
+	struct kvm_vcpu *vcpu = sched_notifier_to_vcpu(sn);
 
 	kvm_arch_vcpu_load(vcpu, cpu);
 }
 
-static void kvm_sched_out(struct preempt_notifier *pn,
-			  struct task_struct *next)
+static void kvm_sched_out(struct sched_notifier *sn, struct task_struct *next)
 {
-	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
+	struct kvm_vcpu *vcpu = sched_notifier_to_vcpu(sn);
 
 	kvm_arch_vcpu_put(vcpu);
 }
@@ -2118,8 +2116,8 @@ int kvm_init(void *opaque, unsigned int vcpu_size,
 		goto out_free;
 	}
 
-	kvm_preempt_ops.sched_in = kvm_sched_in;
-	kvm_preempt_ops.sched_out = kvm_sched_out;
+	kvm_sched_notifier_ops.in = kvm_sched_in;
+	kvm_sched_notifier_ops.out = kvm_sched_out;
 
 	kvm_init_debug();
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 03/40] sched: refactor try_to_wake_up()
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
  2010-01-18  0:57 ` [PATCH 01/40] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
  2010-01-18  0:57 ` [PATCH 02/40] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 04/40] sched: implement __set_cpus_allowed() Tejun Heo
                   ` (38 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Mike Galbraith

Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
The factoring out doesn't affect try_to_wake_up() much
code-generation-wise.  Depending on configuration options, it ends up
generating the same object code as before or slightly different one
due to different register assignment.

This is to help future implementation of try_to_wake_up_local().

Mike Galbraith suggested rename to ttwu_post_activation() from
ttwu_woken_up() and comment update in try_to_wake_up().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |  114 +++++++++++++++++++++++++++++++------------------------
 1 files changed, 64 insertions(+), 50 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 768d313..7633915 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2391,11 +2391,67 @@ int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 }
 #endif
 
-/***
+static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
+				 bool is_sync, bool is_migrate, bool is_local)
+{
+	schedstat_inc(p, se.nr_wakeups);
+	if (is_sync)
+		schedstat_inc(p, se.nr_wakeups_sync);
+	if (is_migrate)
+		schedstat_inc(p, se.nr_wakeups_migrate);
+	if (is_local)
+		schedstat_inc(p, se.nr_wakeups_local);
+	else
+		schedstat_inc(p, se.nr_wakeups_remote);
+
+	activate_task(rq, p, 1);
+
+	/*
+	 * Only attribute actual wakeups done by this task.
+	 */
+	if (!in_interrupt()) {
+		struct sched_entity *se = &current->se;
+		u64 sample = se->sum_exec_runtime;
+
+		if (se->last_wakeup)
+			sample -= se->last_wakeup;
+		else
+			sample -= se->start_runtime;
+		update_avg(&se->avg_wakeup, sample);
+
+		se->last_wakeup = se->sum_exec_runtime;
+	}
+}
+
+static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
+					int wake_flags, bool success)
+{
+	trace_sched_wakeup(rq, p, success);
+	check_preempt_curr(rq, p, wake_flags);
+
+	p->state = TASK_RUNNING;
+#ifdef CONFIG_SMP
+	if (p->sched_class->task_woken)
+		p->sched_class->task_woken(rq, p);
+
+	if (unlikely(rq->idle_stamp)) {
+		u64 delta = rq->clock - rq->idle_stamp;
+		u64 max = 2*sysctl_sched_migration_cost;
+
+		if (delta > max)
+			rq->avg_idle = max;
+		else
+			update_avg(&rq->avg_idle, delta);
+		rq->idle_stamp = 0;
+	}
+#endif
+}
+
+/**
  * try_to_wake_up - wake up a thread
- * @p: the to-be-woken-up thread
+ * @p: the thread to be awakened
  * @state: the mask of task states that can be woken
- * @sync: do a synchronous wakeup?
+ * @wake_flags: wake modifier flags (WF_*)
  *
  * Put it on the run-queue if it's not already there. The "current"
  * thread is always on the run-queue (except when the actual
@@ -2403,7 +2459,8 @@ int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
  * the simpler "current->state = TASK_RUNNING" to mark yourself
  * runnable without the overhead of this.
  *
- * returns failure only if the task is already active.
+ * Returns %true if @p was woken up, %false if it was already running
+ * or @state didn't match @p's state.
  */
 static int try_to_wake_up(struct task_struct *p, unsigned int state,
 			  int wake_flags)
@@ -2475,54 +2532,11 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
 
 out_activate:
 #endif /* CONFIG_SMP */
-	schedstat_inc(p, se.nr_wakeups);
-	if (wake_flags & WF_SYNC)
-		schedstat_inc(p, se.nr_wakeups_sync);
-	if (orig_cpu != cpu)
-		schedstat_inc(p, se.nr_wakeups_migrate);
-	if (cpu == this_cpu)
-		schedstat_inc(p, se.nr_wakeups_local);
-	else
-		schedstat_inc(p, se.nr_wakeups_remote);
-	activate_task(rq, p, 1);
+	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
+		      cpu == this_cpu);
 	success = 1;
-
-	/*
-	 * Only attribute actual wakeups done by this task.
-	 */
-	if (!in_interrupt()) {
-		struct sched_entity *se = &current->se;
-		u64 sample = se->sum_exec_runtime;
-
-		if (se->last_wakeup)
-			sample -= se->last_wakeup;
-		else
-			sample -= se->start_runtime;
-		update_avg(&se->avg_wakeup, sample);
-
-		se->last_wakeup = se->sum_exec_runtime;
-	}
-
 out_running:
-	trace_sched_wakeup(rq, p, success);
-	check_preempt_curr(rq, p, wake_flags);
-
-	p->state = TASK_RUNNING;
-#ifdef CONFIG_SMP
-	if (p->sched_class->task_woken)
-		p->sched_class->task_woken(rq, p);
-
-	if (unlikely(rq->idle_stamp)) {
-		u64 delta = rq->clock - rq->idle_stamp;
-		u64 max = 2*sysctl_sched_migration_cost;
-
-		if (delta > max)
-			rq->avg_idle = max;
-		else
-			update_avg(&rq->avg_idle, delta);
-		rq->idle_stamp = 0;
-	}
-#endif
+	ttwu_post_activation(p, rq, wake_flags, success);
 out:
 	task_rq_unlock(rq, &flags);
 	put_cpu();
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 04/40] sched: implement __set_cpus_allowed()
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (2 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 03/40] sched: refactor try_to_wake_up() Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  9:56   ` Peter Zijlstra
  2010-01-18  0:57 ` [PATCH 05/40] sched: make sched_notifiers unconditional Tejun Heo
                   ` (37 subsequent siblings)
  41 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Mike Galbraith

set_cpus_allowed_ptr() modifies the allowed cpu mask of a task.  The
function performs the following checks before applying new mask.

* Check whether PF_THREAD_BOUND is set.  This is set for bound
  kthreads so that they can't be moved around.

* Check whether the target cpu is still marked active - cpu_active().
  Active state is cleared early while downing a cpu.

This patch adds __set_cpus_allowed() which takes @force parameter
which when true makes __set_cpus_allowed() ignore PF_THREAD_BOUND and
use cpu online state instead of active state for the latter.  This
allows migrating tasks to CPUs as long as they are online.
set_cpus_allowed_ptr() is implemented as inline wrapper around
__set_cpus_allowed().

Due to the way migration is implemented, the @force parameter needs to
be passed over to the migration thread.  @force parameter is added to
struct migration_req and passed to __migrate_task().

Please note the naming discrepancy between set_cpus_allowed_ptr() and
the new functions.  The _ptr suffix is from the days when cpumask API
wasn't mature and future changes should drop it from
set_cpus_allowed_ptr() too.

NOTE: It would be nice to implement kthread_bind() in terms of
      __set_cpus_allowed() if we can drop the capability to bind to a
      dead CPU from kthread_bind(), which doesn't seem too popular
      anyway.  With such change, we'll have set_cpus_allowed_ptr() for
      regular tasks and kthread_bind() for kthreads and can use
      PF_THREAD_BOUND instead of passing @force parameter around.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |   14 +++++++++---
 kernel/sched.c        |   55 ++++++++++++++++++++++++++++++++-----------------
 2 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b65c23b..803c6a8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1871,11 +1871,11 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif
 
 #ifdef CONFIG_SMP
-extern int set_cpus_allowed_ptr(struct task_struct *p,
-				const struct cpumask *new_mask);
+extern int __set_cpus_allowed(struct task_struct *p,
+			      const struct cpumask *new_mask, bool force);
 #else
-static inline int set_cpus_allowed_ptr(struct task_struct *p,
-				       const struct cpumask *new_mask)
+static inline int __set_cpus_allowed(struct task_struct *p,
+				     const struct cpumask *new_mask, bool force)
 {
 	if (!cpumask_test_cpu(0, new_mask))
 		return -EINVAL;
@@ -1883,6 +1883,12 @@ static inline int set_cpus_allowed_ptr(struct task_struct *p,
 }
 #endif
 
+static inline int set_cpus_allowed_ptr(struct task_struct *p,
+				       const struct cpumask *new_mask)
+{
+	return __set_cpus_allowed(p, new_mask, false);
+}
+
 #ifndef CONFIG_CPUMASK_OFFSTACK
 static inline int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
 {
diff --git a/kernel/sched.c b/kernel/sched.c
index 7633915..63fcada 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2096,6 +2096,7 @@ struct migration_req {
 
 	struct task_struct *task;
 	int dest_cpu;
+	bool force;
 
 	struct completion done;
 };
@@ -2104,8 +2105,8 @@ struct migration_req {
  * The task's runqueue lock must be held.
  * Returns true if you have to wait for migration thread.
  */
-static int
-migrate_task(struct task_struct *p, int dest_cpu, struct migration_req *req)
+static int migrate_task(struct task_struct *p, int dest_cpu, bool force,
+			struct migration_req *req)
 {
 	struct rq *rq = task_rq(p);
 
@@ -2119,6 +2120,7 @@ migrate_task(struct task_struct *p, int dest_cpu, struct migration_req *req)
 	init_completion(&req->done);
 	req->task = p;
 	req->dest_cpu = dest_cpu;
+	req->force = force;
 	list_add(&req->list, &rq->migration_queue);
 
 	return 1;
@@ -3159,7 +3161,7 @@ again:
 	}
 
 	/* force the process onto the specified CPU */
-	if (migrate_task(p, dest_cpu, &req)) {
+	if (migrate_task(p, dest_cpu, false, &req)) {
 		/* Need to wait for migration thread (might exit: take ref). */
 		struct task_struct *mt = rq->migration_thread;
 
@@ -7110,17 +7112,27 @@ static inline void sched_init_granularity(void)
  * 7) we wake up and the migration is done.
  */
 
-/*
- * Change a given task's CPU affinity. Migrate the thread to a
- * proper CPU and schedule it away if the CPU it's executing on
- * is removed from the allowed bitmask.
+/**
+ * __set_cpus_allowed - change a task's CPU affinity
+ * @p: task to change CPU affinity for
+ * @new_mask: new CPU affinity
+ * @force: override CPU active status and PF_THREAD_BOUND check
+ *
+ * Migrate the thread to a proper CPU and schedule it away if the CPU
+ * it's executing on is removed from the allowed bitmask.
+ *
+ * The caller must have a valid reference to the task, the task must
+ * not exit() & deallocate itself prematurely. The call is not atomic;
+ * no spinlocks may be held.
  *
- * NOTE: the caller must have a valid reference to the task, the
- * task must not exit() & deallocate itself prematurely. The
- * call is not atomic; no spinlocks may be held.
+ * If @force is %true, PF_THREAD_BOUND test is bypassed and CPU active
+ * state is ignored as long as the CPU is online.
  */
-int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
+int __set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask,
+		       bool force)
 {
+	const struct cpumask *cpu_cand_mask =
+		force ? cpu_online_mask : cpu_active_mask;
 	struct migration_req req;
 	unsigned long flags;
 	struct rq *rq;
@@ -7143,12 +7155,12 @@ again:
 		goto again;
 	}
 
-	if (!cpumask_intersects(new_mask, cpu_active_mask)) {
+	if (!cpumask_intersects(new_mask, cpu_cand_mask)) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
+	if (unlikely((p->flags & PF_THREAD_BOUND) && !force && p != current &&
 		     !cpumask_equal(&p->cpus_allowed, new_mask))) {
 		ret = -EINVAL;
 		goto out;
@@ -7165,7 +7177,8 @@ again:
 	if (cpumask_test_cpu(task_cpu(p), new_mask))
 		goto out;
 
-	if (migrate_task(p, cpumask_any_and(cpu_active_mask, new_mask), &req)) {
+	if (migrate_task(p, cpumask_any_and(cpu_cand_mask, new_mask), force,
+			 &req)) {
 		/* Need help from migration thread: drop lock and wait. */
 		struct task_struct *mt = rq->migration_thread;
 
@@ -7182,7 +7195,7 @@ out:
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
+EXPORT_SYMBOL_GPL(__set_cpus_allowed);
 
 /*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
@@ -7195,12 +7208,15 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
  *
  * Returns non-zero if task was successfully migrated.
  */
-static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
+static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu,
+			  bool force)
 {
+	const struct cpumask *cpu_cand_mask =
+		force ? cpu_online_mask : cpu_active_mask;
 	struct rq *rq_dest, *rq_src;
 	int ret = 0;
 
-	if (unlikely(!cpu_active(dest_cpu)))
+	if (unlikely(!cpumask_test_cpu(dest_cpu, cpu_cand_mask)))
 		return ret;
 
 	rq_src = cpu_rq(src_cpu);
@@ -7280,7 +7296,8 @@ static int migration_thread(void *data)
 
 		if (req->task != NULL) {
 			raw_spin_unlock(&rq->lock);
-			__migrate_task(req->task, cpu, req->dest_cpu);
+			__migrate_task(req->task, cpu, req->dest_cpu,
+				       req->force);
 		} else if (likely(cpu == (badcpu = smp_processor_id()))) {
 			req->dest_cpu = RCU_MIGRATION_GOT_QS;
 			raw_spin_unlock(&rq->lock);
@@ -7305,7 +7322,7 @@ static int __migrate_task_irq(struct task_struct *p, int src_cpu, int dest_cpu)
 	int ret;
 
 	local_irq_disable();
-	ret = __migrate_task(p, src_cpu, dest_cpu);
+	ret = __migrate_task(p, src_cpu, dest_cpu, false);
 	local_irq_enable();
 	return ret;
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/40] sched: implement __set_cpus_allowed()
  2010-01-18  0:57 ` [PATCH 04/40] sched: implement __set_cpus_allowed() Tejun Heo
@ 2010-01-18  9:56   ` Peter Zijlstra
  2010-01-18 11:22     ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2010-01-18  9:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

On Mon, 2010-01-18 at 09:57 +0900, Tejun Heo wrote:
> set_cpus_allowed_ptr() modifies the allowed cpu mask of a task.  The
> function performs the following checks before applying new mask.
> 
> * Check whether PF_THREAD_BOUND is set.  This is set for bound
>   kthreads so that they can't be moved around.
> 
> * Check whether the target cpu is still marked active - cpu_active().
>   Active state is cleared early while downing a cpu.
> 
> This patch adds __set_cpus_allowed() which takes @force parameter
> which when true makes __set_cpus_allowed() ignore PF_THREAD_BOUND and
> use cpu online state instead of active state for the latter.  This
> allows migrating tasks to CPUs as long as they are online.
> set_cpus_allowed_ptr() is implemented as inline wrapper around
> __set_cpus_allowed().
> 
> Due to the way migration is implemented, the @force parameter needs to
> be passed over to the migration thread.  @force parameter is added to
> struct migration_req and passed to __migrate_task().
> 
> Please note the naming discrepancy between set_cpus_allowed_ptr() and
> the new functions.  The _ptr suffix is from the days when cpumask API
> wasn't mature and future changes should drop it from
> set_cpus_allowed_ptr() too.
> 
> NOTE: It would be nice to implement kthread_bind() in terms of
>       __set_cpus_allowed() if we can drop the capability to bind to a
>       dead CPU from kthread_bind(), which doesn't seem too popular
>       anyway.  With such change, we'll have set_cpus_allowed_ptr() for
>       regular tasks and kthread_bind() for kthreads and can use
>       PF_THREAD_BOUND instead of passing @force parameter around.


And your changelog still sucks... it only says what it does, not why.

still hate the patch too.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/40] sched: implement __set_cpus_allowed()
  2010-01-18  9:56   ` Peter Zijlstra
@ 2010-01-18 11:22     ` Tejun Heo
  2010-01-18 11:41       ` Peter Zijlstra
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18 11:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

On 01/18/2010 06:56 PM, Peter Zijlstra wrote:
>> NOTE: It would be nice to implement kthread_bind() in terms of
>>       __set_cpus_allowed() if we can drop the capability to bind to a
>>       dead CPU from kthread_bind(), which doesn't seem too popular
>>       anyway.  With such change, we'll have set_cpus_allowed_ptr() for
>>       regular tasks and kthread_bind() for kthreads and can use
>>       PF_THREAD_BOUND instead of passing @force parameter around.
> 
> 
> And your changelog still sucks... it only says what it does, not why.
> 
> still hate the patch too.

These part haven't changed at all since the last posting so if you
disliked it before it's kind of expected you still do so.

Anyways, I'm not the greatest fan of this patch either.  Let's see how
the whole series fares out first and try to make this better.  What do
you think about doing what's described in the NOTE?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/40] sched: implement __set_cpus_allowed()
  2010-01-18 11:22     ` Tejun Heo
@ 2010-01-18 11:41       ` Peter Zijlstra
  2010-01-19  1:07         ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2010-01-18 11:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

On Mon, 2010-01-18 at 20:22 +0900, Tejun Heo wrote:
> On 01/18/2010 06:56 PM, Peter Zijlstra wrote:
> >> NOTE: It would be nice to implement kthread_bind() in terms of
> >>       __set_cpus_allowed() if we can drop the capability to bind to a
> >>       dead CPU from kthread_bind(), which doesn't seem too popular
> >>       anyway.  With such change, we'll have set_cpus_allowed_ptr() for
> >>       regular tasks and kthread_bind() for kthreads and can use
> >>       PF_THREAD_BOUND instead of passing @force parameter around.
> > 
> > 
> > And your changelog still sucks... it only says what it does, not why.
> > 
> > still hate the patch too.
> 
> These part haven't changed at all since the last posting so if you
> disliked it before it's kind of expected you still do so.

You could at least have augmented the changelog with the why.. my memory
thinks it had to so with that silly move back on up story.

> Anyways, I'm not the greatest fan of this patch either.  Let's see how
> the whole series fares out first and try to make this better.  What do
> you think about doing what's described in the NOTE?

I'm still not sure we need any of this. For new threads we have the
stopped state and kthread_bind() should work in its current form (except
you need patch 1 in your series when you're creating new threads when
the cpu is currently going down).




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/40] sched: implement __set_cpus_allowed()
  2010-01-18 11:41       ` Peter Zijlstra
@ 2010-01-19  1:07         ` Tejun Heo
  2010-01-19  8:37           ` Peter Zijlstra
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-19  1:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

Hello,

On 01/18/2010 08:41 PM, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 20:22 +0900, Tejun Heo wrote:
>> These part haven't changed at all since the last posting so if you
>> disliked it before it's kind of expected you still do so.
> 
> You could at least have augmented the changelog with the why.. my memory
> thinks it had to so with that silly move back on up story.
>
>> Anyways, I'm not the greatest fan of this patch either.  Let's see how
>> the whole series fares out first and try to make this better.  What do
>> you think about doing what's described in the NOTE?
> 
> I'm still not sure we need any of this. For new threads we have the
> stopped state and kthread_bind() should work in its current form (except
> you need patch 1 in your series when you're creating new threads when
> the cpu is currently going down).

It's also necessary to guarantee forward progress during CPU_DOWN.
The problem with kthread_bind() is that it's not synchronized against
CPU hotplug operations.  It needs outer synchronization like calling
it directly from CPU_DOWN_PREP.  I guess it's doable but I think it
would be better to simply share the backend implementation between
set_cpus_allowed_ptr() and kthread_bind().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/40] sched: implement __set_cpus_allowed()
  2010-01-19  1:07         ` Tejun Heo
@ 2010-01-19  8:37           ` Peter Zijlstra
  2010-01-20  8:35             ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2010-01-19  8:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

On Tue, 2010-01-19 at 10:07 +0900, Tejun Heo wrote:
> 
> It's also necessary to guarantee forward progress during CPU_DOWN.
> The problem with kthread_bind() is that it's not synchronized against
> CPU hotplug operations.  It needs outer synchronization like calling
> it directly from CPU_DOWN_PREP.  I guess it's doable but I think it
> would be better to simply share the backend implementation between
> set_cpus_allowed_ptr() and kthread_bind().

OK, so you're saying you need to migrate the rescue thread during
cpu-down. That thread is guaranteed sleeping right, if it were not it'd
not be elegible to run on our dying cpu. Hence kthread_bind() ought to
just work, no?


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/40] sched: implement __set_cpus_allowed()
  2010-01-19  8:37           ` Peter Zijlstra
@ 2010-01-20  8:35             ` Tejun Heo
  2010-01-20  8:50               ` Peter Zijlstra
  2010-01-24  8:18               ` Tejun Heo
  0 siblings, 2 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-20  8:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

Hello,

On 01/19/2010 05:37 PM, Peter Zijlstra wrote:
> On Tue, 2010-01-19 at 10:07 +0900, Tejun Heo wrote:
>>
>> It's also necessary to guarantee forward progress during CPU_DOWN.
>> The problem with kthread_bind() is that it's not synchronized against
>> CPU hotplug operations.  It needs outer synchronization like calling
>> it directly from CPU_DOWN_PREP.  I guess it's doable but I think it
>> would be better to simply share the backend implementation between
>> set_cpus_allowed_ptr() and kthread_bind().
> 
> OK, so you're saying you need to migrate the rescue thread during
> cpu-down. That thread is guaranteed sleeping right,

No, it's not.  It might have been tasked to process works from other
CPUs.

> if it were not it'd not be elegible to run on our dying cpu.  Hence
> kthread_bind() ought to just work, no?

Why wouldn't it be elegible?

Commit e2912009fb7b715728311b0d8fe327a1432b3f79 killed the ability to
bind a kthread to a dead CPU which means that the only differences
between kthread_bind() and set_cpus_allowed_ptr() are...

* Whether to use cpu_online_mask or cpu_active_mask.

* Whether to set or check PF_THREAD_BOUND.

Wouldn't it make more sense to share the backend implementation
between kthread_bind() and set_cpus_allowed_ptr() instead of making
kthread_bind() a special case?  The goals of the two functions are
basically identical.  Why have two separate implementations?
kthread_bind() implementation as it currently stands is pretty fragile
too.  Making kthread_bind() backed by set_cpus_allowed_ptr() will make
it more robust and less error-prone and all that's necessary to
achieve that is modifying sanity checks.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/40] sched: implement __set_cpus_allowed()
  2010-01-20  8:35             ` Tejun Heo
@ 2010-01-20  8:50               ` Peter Zijlstra
  2010-01-20  9:00                 ` Tejun Heo
  2010-01-24  8:18               ` Tejun Heo
  1 sibling, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2010-01-20  8:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

On Wed, 2010-01-20 at 17:35 +0900, Tejun Heo wrote:
> Hello,
> 
> On 01/19/2010 05:37 PM, Peter Zijlstra wrote:
> > On Tue, 2010-01-19 at 10:07 +0900, Tejun Heo wrote:
> >>
> >> It's also necessary to guarantee forward progress during CPU_DOWN.
> >> The problem with kthread_bind() is that it's not synchronized against
> >> CPU hotplug operations.  It needs outer synchronization like calling
> >> it directly from CPU_DOWN_PREP.  I guess it's doable but I think it
> >> would be better to simply share the backend implementation between
> >> set_cpus_allowed_ptr() and kthread_bind().
> > 
> > OK, so you're saying you need to migrate the rescue thread during
> > cpu-down. That thread is guaranteed sleeping right,
> 
> No, it's not.  It might have been tasked to process works from other
> CPUs.

OK, so maybe I'm confused, but in general the workqueue thing needs to
ensure work affinity, right? So why can we move the rescue thread while
its processing another CPU's works?


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/40] sched: implement __set_cpus_allowed()
  2010-01-20  8:50               ` Peter Zijlstra
@ 2010-01-20  9:00                 ` Tejun Heo
  2010-01-20  8:59                   ` Peter Zijlstra
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-20  9:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

Hello,

On 01/20/2010 05:50 PM, Peter Zijlstra wrote:
> OK, so maybe I'm confused, but in general the workqueue thing needs to
> ensure work affinity, right? So why can we move the rescue thread while
> its processing another CPU's works?

It doesn't get moved directly.  The CPU requests rescuer and rescuer
will move itself to the CPU when it becomes idle.  We can of course
make the rescuer wake up some other thread and then go to sleep and
the other thread can kthread_bind() the rescuer to the dying CPU but
that exactly is what the set_cpus_allowed_ptr() does.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/40] sched: implement __set_cpus_allowed()
  2010-01-20  9:00                 ` Tejun Heo
@ 2010-01-20  8:59                   ` Peter Zijlstra
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Zijlstra @ 2010-01-20  8:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

On Wed, 2010-01-20 at 18:00 +0900, Tejun Heo wrote:
> Hello,
> 
> On 01/20/2010 05:50 PM, Peter Zijlstra wrote:
> > OK, so maybe I'm confused, but in general the workqueue thing needs to
> > ensure work affinity, right? So why can we move the rescue thread while
> > its processing another CPU's works?
> 
> It doesn't get moved directly.  The CPU requests rescuer and rescuer
> will move itself to the CPU when it becomes idle.  We can of course
> make the rescuer wake up some other thread and then go to sleep and
> the other thread can kthread_bind() the rescuer to the dying CPU but
> that exactly is what the set_cpus_allowed_ptr() does.

Ah, ok I thought it was all mitigated through the management thread.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/40] sched: implement __set_cpus_allowed()
  2010-01-20  8:35             ` Tejun Heo
  2010-01-20  8:50               ` Peter Zijlstra
@ 2010-01-24  8:18               ` Tejun Heo
  1 sibling, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-24  8:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

Hello,

On 01/20/2010 05:35 PM, Tejun Heo wrote:
> Wouldn't it make more sense to share the backend implementation
> between kthread_bind() and set_cpus_allowed_ptr() instead of making
> kthread_bind() a special case?  The goals of the two functions are
> basically identical.  Why have two separate implementations?
> kthread_bind() implementation as it currently stands is pretty fragile
> too.  Making kthread_bind() backed by set_cpus_allowed_ptr() will make
> it more robust and less error-prone and all that's necessary to
> achieve that is modifying sanity checks.

I gave it shot.  The interface is cleaner this way but I couldn't
figure out where to set PF_THREAD_BOUND as actual migration may happen
in different places and p->flags can only be set while it's known the
process is not running.  At this point, I can't think of a better way
to do this than the current patch.  :-(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 05/40] sched: make sched_notifiers unconditional
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (3 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 04/40] sched: implement __set_cpus_allowed() Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops Tejun Heo
                   ` (36 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Mike Galbraith

sched_notifiers will be used by workqueue which is always there.
Always enable sched_notifiers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 arch/ia64/kvm/Kconfig    |    1 -
 arch/powerpc/kvm/Kconfig |    1 -
 arch/s390/kvm/Kconfig    |    1 -
 arch/x86/kvm/Kconfig     |    1 -
 include/linux/kvm_host.h |    2 --
 include/linux/sched.h    |    6 ------
 init/Kconfig             |    4 ----
 kernel/sched.c           |   13 -------------
 8 files changed, 0 insertions(+), 29 deletions(-)

diff --git a/arch/ia64/kvm/Kconfig b/arch/ia64/kvm/Kconfig
index a38b72e..a9e2b9c 100644
--- a/arch/ia64/kvm/Kconfig
+++ b/arch/ia64/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select SCHED_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index d3a65c6..38818c0 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,6 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select SCHED_NOTIFIERS
 	select ANON_INODES
 
 config KVM_BOOK3S_64_HANDLER
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index a0adddd..f9b46b0 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -18,7 +18,6 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select SCHED_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines using the SIE
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index fd38f79..337a4e5 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	depends on HAVE_KVM
 	# for device assignment:
 	depends on PCI
-	select SCHED_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8079759..45b631e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -74,9 +74,7 @@ void kvm_io_bus_unregister_dev(struct kvm *kvm, struct kvm_io_bus *bus,
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_SCHED_NOTIFIERS
 	struct sched_notifier sched_notifier;
-#endif
 	int vcpu_id;
 	struct mutex mutex;
 	int   cpu;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 803c6a8..9fc537a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1218,8 +1218,6 @@ struct sched_rt_entity {
 #endif
 };
 
-#ifdef CONFIG_SCHED_NOTIFIERS
-
 struct sched_notifier;
 
 /**
@@ -1263,8 +1261,6 @@ static inline void sched_notifier_init(struct sched_notifier *notifier,
 	notifier->ops = ops;
 }
 
-#endif	/* CONFIG_SCHED_NOTIFIERS */
-
 struct rcu_node;
 
 struct task_struct {
@@ -1288,10 +1284,8 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_SCHED_NOTIFIERS
 	/* list of struct sched_notifier: */
 	struct hlist_head sched_notifiers;
-#endif
 
 	/*
 	 * fpu_counter contains the number of consecutive context switches
diff --git a/init/Kconfig b/init/Kconfig
index 06644b8..b1b7175 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1258,8 +1258,4 @@ config STOP_MACHINE
 	  Need stop_machine() primitive.
 
 source "block/Kconfig"
-
-config SCHED_NOTIFIERS
-	bool
-
 source "kernel/Kconfig.locks"
diff --git a/kernel/sched.c b/kernel/sched.c
index 63fcada..30b7ea8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1434,8 +1434,6 @@ static inline void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val) {}
 #endif
 
-#ifdef CONFIG_SCHED_NOTIFIERS
-
 #define fire_sched_notifiers(p, callback, args...) do {			\
 	struct sched_notifier *__sn;					\
 	struct hlist_node *__pos;					\
@@ -1466,12 +1464,6 @@ void sched_notifier_unregister(struct sched_notifier *notifier)
 }
 EXPORT_SYMBOL_GPL(sched_notifier_unregister);
 
-#else	/* !CONFIG_SCHED_NOTIFIERS */
-
-#define fire_sched_notifiers(p, callback, args...)	do { } while (0)
-
-#endif	/* CONFIG_SCHED_NOTIFIERS */
-
 static inline void inc_cpu_load(struct rq *rq, unsigned long load)
 {
 	update_load_add(&rq->load, load);
@@ -2621,10 +2613,7 @@ static void __sched_fork(struct task_struct *p)
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->se.on_rq = 0;
 	INIT_LIST_HEAD(&p->se.group_node);
-
-#ifdef CONFIG_SCHED_NOTIFIERS
 	INIT_HLIST_HEAD(&p->sched_notifiers);
-#endif
 }
 
 /*
@@ -9648,9 +9637,7 @@ void __init sched_init(void)
 
 	set_load_weight(&init_task);
 
-#ifdef CONFIG_SCHED_NOTIFIERS
 	INIT_HLIST_HEAD(&init_task.sched_notifiers);
-#endif
 
 #ifdef CONFIG_SMP
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (4 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 05/40] sched: make sched_notifiers unconditional Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  9:57   ` Peter Zijlstra
  2010-01-18  0:57 ` [PATCH 07/40] sched: implement try_to_wake_up_local() Tejun Heo
                   ` (35 subsequent siblings)
  41 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Mike Galbraith

Add wakeup and sleep notifiers to sched_notifiers and allow omitting
some of the notifiers in the ops table.  These will be used by
concurrency managed workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    6 ++++++
 kernel/sched.c        |   11 ++++++++---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9fc537a..a6b4e9c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1222,6 +1222,10 @@ struct sched_notifier;
 
 /**
  * sched_notifier_ops - notifiers called for scheduling events
+ * @wakeup: we're waking up
+ *    notifier: struct sched_notifier for the task being woken up
+ * @sleep: we're going to bed
+ *    notifier: struct sched_notifier for the task sleeping
  * @in: we're about to be rescheduled:
  *    notifier: struct sched_notifier for the task being scheduled
  *    cpu:  cpu we're scheduled on
@@ -1235,6 +1239,8 @@ struct sched_notifier;
  * and depended upon by its users.
  */
 struct sched_notifier_ops {
+	void (*wakeup)(struct sched_notifier *notifier);
+	void (*sleep)(struct sched_notifier *notifier);
 	void (*in)(struct sched_notifier *notifier, int cpu);
 	void (*out)(struct sched_notifier *notifier, struct task_struct *next);
 };
diff --git a/kernel/sched.c b/kernel/sched.c
index 30b7ea8..eae624f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1439,7 +1439,8 @@ static inline void cpuacct_update_stats(struct task_struct *tsk,
 	struct hlist_node *__pos;					\
 									\
 	hlist_for_each_entry(__sn, __pos, &(p)->sched_notifiers, link)	\
-		__sn->ops->callback(__sn , ##args);			\
+		if (__sn->ops->callback)				\
+			__sn->ops->callback(__sn , ##args);		\
 } while (0)
 
 /**
@@ -2439,6 +2440,8 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
 		rq->idle_stamp = 0;
 	}
 #endif
+	if (success)
+		fire_sched_notifiers(p, wakeup);
 }
 
 /**
@@ -5481,10 +5484,12 @@ need_resched_nonpreemptible:
 	clear_tsk_need_resched(prev);
 
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
-		if (unlikely(signal_pending_state(prev->state, prev)))
+		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
-		else
+		} else {
+			fire_sched_notifiers(prev, sleep);
 			deactivate_task(rq, prev, 1);
+		}
 		switch_count = &prev->nvcsw;
 	}
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops
  2010-01-18  0:57 ` [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops Tejun Heo
@ 2010-01-18  9:57   ` Peter Zijlstra
  2010-01-18 11:31     ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2010-01-18  9:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

On Mon, 2010-01-18 at 09:57 +0900, Tejun Heo wrote:

> @@ -2439,6 +2440,8 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
>  		rq->idle_stamp = 0;
>  	}
>  #endif
> +	if (success)
> +		fire_sched_notifiers(p, wakeup);
>  }
>  
>  /**

So why can't you call fire_sched_notifier(p, wakeup) right next to
activate_task(rq, p, 1) in ttwu_activate() ?

> @@ -5481,10 +5484,12 @@ need_resched_nonpreemptible:
>  	clear_tsk_need_resched(prev);
>  
>  	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
> -		if (unlikely(signal_pending_state(prev->state, prev)))
> +		if (unlikely(signal_pending_state(prev->state, prev))) {
>  			prev->state = TASK_RUNNING;
> -		else
> +		} else {
> +			fire_sched_notifiers(prev, sleep);
>  			deactivate_task(rq, prev, 1);
> +		}
>  		switch_count = &prev->nvcsw;
>  	}
>  



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops
  2010-01-18  9:57   ` Peter Zijlstra
@ 2010-01-18 11:31     ` Tejun Heo
  2010-01-18 12:49       ` Peter Zijlstra
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18 11:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

On 01/18/2010 06:57 PM, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 09:57 +0900, Tejun Heo wrote:
> 
>> @@ -2439,6 +2440,8 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
>>  		rq->idle_stamp = 0;
>>  	}
>>  #endif
>> +	if (success)
>> +		fire_sched_notifiers(p, wakeup);
>>  }
>>  
>>  /**
> 
> So why can't you call fire_sched_notifier(p, wakeup) right next to
> activate_task(rq, p, 1) in ttwu_activate() ?

I was worried about calling notifier callback before notifying the
sched_class of the wakeup.  Conceptually, the notifier callback should
be called after all sched internal stuff about the wakeup finish,
so...

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops
  2010-01-18 11:31     ` Tejun Heo
@ 2010-01-18 12:49       ` Peter Zijlstra
  2010-01-19  1:04         ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2010-01-18 12:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

On Mon, 2010-01-18 at 20:31 +0900, Tejun Heo wrote:
> On 01/18/2010 06:57 PM, Peter Zijlstra wrote:
> > On Mon, 2010-01-18 at 09:57 +0900, Tejun Heo wrote:
> > 
> >> @@ -2439,6 +2440,8 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
> >>  		rq->idle_stamp = 0;
> >>  	}
> >>  #endif
> >> +	if (success)
> >> +		fire_sched_notifiers(p, wakeup);
> >>  }
> >>  
> >>  /**
> > 
> > So why can't you call fire_sched_notifier(p, wakeup) right next to
> > activate_task(rq, p, 1) in ttwu_activate() ?
> 
> I was worried about calling notifier callback before notifying the
> sched_class of the wakeup.  Conceptually, the notifier callback should
> be called after all sched internal stuff about the wakeup finish,
> so...

I'm thinking that we can place it next to activate_task(), if it makes
you feel better you can place them both at the end up ttwu_activate()
instead of in the middle.

Esp. with the callback you have it really doesn't matter.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops
  2010-01-18 12:49       ` Peter Zijlstra
@ 2010-01-19  1:04         ` Tejun Heo
  2010-01-19  8:28           ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-19  1:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

On 01/18/2010 09:49 PM, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 20:31 +0900, Tejun Heo wrote:
>> On 01/18/2010 06:57 PM, Peter Zijlstra wrote:
>>> On Mon, 2010-01-18 at 09:57 +0900, Tejun Heo wrote:
>>>
>>>> @@ -2439,6 +2440,8 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
>>>>  		rq->idle_stamp = 0;
>>>>  	}
>>>>  #endif
>>>> +	if (success)
>>>> +		fire_sched_notifiers(p, wakeup);
>>>>  }
>>>>  
>>>>  /**
>>>
>>> So why can't you call fire_sched_notifier(p, wakeup) right next to
>>> activate_task(rq, p, 1) in ttwu_activate() ?
>>
>> I was worried about calling notifier callback before notifying the
>> sched_class of the wakeup.  Conceptually, the notifier callback should
>> be called after all sched internal stuff about the wakeup finish,
>> so...
> 
> I'm thinking that we can place it next to activate_task(), if it makes
> you feel better you can place them both at the end up ttwu_activate()
> instead of in the middle.
> 
> Esp. with the callback you have it really doesn't matter.

Alright, if it's safe, there's no reason to keep it separate with an
extra branch.  I'll move it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops
  2010-01-19  1:04         ` Tejun Heo
@ 2010-01-19  8:28           ` Tejun Heo
  2010-01-19  8:55             ` Peter Zijlstra
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-19  8:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

Hello,

On 01/19/2010 10:04 AM, Tejun Heo wrote:
>> I'm thinking that we can place it next to activate_task(), if it makes
>> you feel better you can place them both at the end up ttwu_activate()
>> instead of in the middle.
>>
>> Esp. with the callback you have it really doesn't matter.
> 
> Alright, if it's safe, there's no reason to keep it separate with an
> extra branch.  I'll move it.

Alright, was trying to convert it and I'm still a bit worried.  One of
the reasons I put it at the end of post_activation() is to allow
calling try_to_wake_up_local() from wakeup callback.  This won't be
used by cmwq right now but making it symmetrical to sleep callback
would be more consistent, so...  If we fire wakeup callback right
after activate_task() and allow try_to_wake_up_local() to be called
from it, wake up logic ends up being nested inside outer wake up which
is still in progress.  Would that be safe too?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops
  2010-01-19  8:28           ` Tejun Heo
@ 2010-01-19  8:55             ` Peter Zijlstra
  2010-01-20  8:47               ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2010-01-19  8:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

On Tue, 2010-01-19 at 17:28 +0900, Tejun Heo wrote:
> Hello,
> 
> On 01/19/2010 10:04 AM, Tejun Heo wrote:
> >> I'm thinking that we can place it next to activate_task(), if it makes
> >> you feel better you can place them both at the end up ttwu_activate()
> >> instead of in the middle.
> >>
> >> Esp. with the callback you have it really doesn't matter.
> > 
> > Alright, if it's safe, there's no reason to keep it separate with an
> > extra branch.  I'll move it.
> 
> Alright, was trying to convert it and I'm still a bit worried.  One of
> the reasons I put it at the end of post_activation() is to allow
> calling try_to_wake_up_local() from wakeup callback.  This won't be
> used by cmwq right now but making it symmetrical to sleep callback
> would be more consistent, so...  If we fire wakeup callback right
> after activate_task() and allow try_to_wake_up_local() to be called
> from it, wake up logic ends up being nested inside outer wake up which
> is still in progress.  Would that be safe too?

I think so, still doing a wakeup from a wakeup sounds like trouble in
that it has the potential to a thundering herd, so I'd really rather
you'd not do something like that.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops
  2010-01-19  8:55             ` Peter Zijlstra
@ 2010-01-20  8:47               ` Tejun Heo
  0 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-20  8:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, awalls, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi, Mike Galbraith

Hello,

On 01/19/2010 05:55 PM, Peter Zijlstra wrote:
> I think so, still doing a wakeup from a wakeup sounds like trouble in
> that it has the potential to a thundering herd, so I'd really rather
> you'd not do something like that.

cmwq is not gonna do that and if someone is gonna do that there of
course needs to be some sort of limiting.

I still feel a bit uneasy about calling out external callbacks before
scheduler internal state update is complete.  If you still think
moving it right below activate_task() would be better, I'll move it
and modify the comment above try_to_wake_up_local().  Please let me
know what you think.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 07/40] sched: implement try_to_wake_up_local()
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (5 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 08/40] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 Tejun Heo
                   ` (34 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Mike Galbraith

Implement try_to_wake_up_local().  try_to_wake_up_local() is similar
to try_to_wake_up() but it assumes the caller has this_rq() locked and
the target task is bound to this_rq().  try_to_wake_up_local() can be
called from wakeup and sleep scheduler notifiers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    2 +
 kernel/sched.c        |   56 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a6b4e9c..d7db574 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2094,6 +2094,8 @@ extern void release_uids(struct user_namespace *ns);
 
 extern void do_timer(unsigned long ticks);
 
+extern bool try_to_wake_up_local(struct task_struct *p, unsigned int state,
+				 int wake_flags);
 extern int wake_up_state(struct task_struct *tsk, unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk,
diff --git a/kernel/sched.c b/kernel/sched.c
index eae624f..8fdcc7d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2440,6 +2440,10 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
 		rq->idle_stamp = 0;
 	}
 #endif
+	/*
+	 * Wake up is complete, fire wake up notifier.  This allows
+	 * try_to_wake_up_local() to be called from wake up notifiers.
+	 */
 	if (success)
 		fire_sched_notifiers(p, wakeup);
 }
@@ -2542,6 +2546,53 @@ out:
 }
 
 /**
+ * try_to_wake_up_local - try to wake up a local task with rq lock held
+ * @p: the thread to be awakened
+ * @state: the mask of task states that can be woken
+ * @wake_flags: wake modifier flags (WF_*)
+ *
+ * Put @p on the run-queue if it's not alredy there.  The caller must
+ * ensure that this_rq() is locked, @p is bound to this_rq() and @p is
+ * not the current task.  this_rq() stays locked over invocation.
+ *
+ * This function can be called from wakeup and sleep scheduler
+ * notifiers.  Be careful not to create deep recursion by chaining
+ * wakeup notifiers.
+ *
+ * Returns %true if @p was woken up, %false if it was already running
+ * or @state didn't match @p's state.
+ */
+bool try_to_wake_up_local(struct task_struct *p, unsigned int state,
+			  int wake_flags)
+{
+	struct rq *rq = task_rq(p);
+	bool success = false;
+
+	BUG_ON(rq != this_rq());
+	BUG_ON(p == current);
+	lockdep_assert_held(&rq->lock);
+
+	if (!sched_feat(SYNC_WAKEUPS))
+		wake_flags &= ~WF_SYNC;
+
+	if (!(p->state & state))
+		return false;
+
+	if (!p->se.on_rq) {
+		if (likely(!task_running(rq, p))) {
+			schedstat_inc(rq, ttwu_count);
+			schedstat_inc(rq, ttwu_local);
+		}
+		ttwu_activate(p, rq, wake_flags & WF_SYNC, false, true);
+		success = true;
+	}
+
+	ttwu_post_activation(p, rq, wake_flags, success);
+
+	return success;
+}
+
+/**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
  *
@@ -5487,6 +5538,11 @@ need_resched_nonpreemptible:
 		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
 		} else {
+			/*
+			 * Fire sleep notifier before changing any scheduler
+			 * state.  This allows try_to_wake_up_local() to be
+			 * called from sleep notifiers.
+			 */
 			fire_sched_notifiers(prev, sleep);
 			deactivate_task(rq, prev, 1);
 		}
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 08/40] acpi: use queue_work_on() instead of binding workqueue worker to cpu0
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (6 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 07/40] sched: implement try_to_wake_up_local() Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 09/40] stop_machine: reimplement without using workqueue Tejun Heo
                   ` (33 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

ACPI works need to be executed on cpu0 and acpi/osl.c achieves this by
creating singlethread workqueue and then binding it to cpu0 from a
work which is quite unorthodox.  Make it create regular workqueues and
use queue_work_on() instead.  This is in preparation of concurrency
managed workqueue and the extra workers won't be a problem after it's
implemented.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/acpi/osl.c |   41 ++++++++++++-----------------------------
 1 files changed, 12 insertions(+), 29 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 02e8464..93f6647 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -191,36 +191,11 @@ acpi_status __init acpi_os_initialize(void)
 	return AE_OK;
 }
 
-static void bind_to_cpu0(struct work_struct *work)
-{
-	set_cpus_allowed_ptr(current, cpumask_of(0));
-	kfree(work);
-}
-
-static void bind_workqueue(struct workqueue_struct *wq)
-{
-	struct work_struct *work;
-
-	work = kzalloc(sizeof(struct work_struct), GFP_KERNEL);
-	INIT_WORK(work, bind_to_cpu0);
-	queue_work(wq, work);
-}
-
 acpi_status acpi_os_initialize1(void)
 {
-	/*
-	 * On some machines, a software-initiated SMI causes corruption unless
-	 * the SMI runs on CPU 0.  An SMI can be initiated by any AML, but
-	 * typically it's done in GPE-related methods that are run via
-	 * workqueues, so we can avoid the known corruption cases by binding
-	 * the workqueues to CPU 0.
-	 */
-	kacpid_wq = create_singlethread_workqueue("kacpid");
-	bind_workqueue(kacpid_wq);
-	kacpi_notify_wq = create_singlethread_workqueue("kacpi_notify");
-	bind_workqueue(kacpi_notify_wq);
-	kacpi_hotplug_wq = create_singlethread_workqueue("kacpi_hotplug");
-	bind_workqueue(kacpi_hotplug_wq);
+	kacpid_wq = create_workqueue("kacpid");
+	kacpi_notify_wq = create_workqueue("kacpi_notify");
+	kacpi_hotplug_wq = create_workqueue("kacpi_hotplug");
 	BUG_ON(!kacpid_wq);
 	BUG_ON(!kacpi_notify_wq);
 	BUG_ON(!kacpi_hotplug_wq);
@@ -759,7 +734,15 @@ static acpi_status __acpi_os_execute(acpi_execute_type type,
 		(type == OSL_NOTIFY_HANDLER ? kacpi_notify_wq : kacpid_wq);
 	dpc->wait = hp ? 1 : 0;
 	INIT_WORK(&dpc->work, acpi_os_execute_deferred);
-	ret = queue_work(queue, &dpc->work);
+
+	/*
+	 * On some machines, a software-initiated SMI causes corruption unless
+	 * the SMI runs on CPU 0.  An SMI can be initiated by any AML, but
+	 * typically it's done in GPE-related methods that are run via
+	 * workqueues, so we can avoid the known corruption cases by always
+	 * queueing on CPU 0.
+	 */
+	ret = queue_work_on(0, queue, &dpc->work);
 
 	if (!ret) {
 		printk(KERN_ERR PREFIX
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 09/40] stop_machine: reimplement without using workqueue
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (7 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 08/40] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 10/40] workqueue: misc/cosmetic updates Tejun Heo
                   ` (32 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

stop_machine() is the only user of RT workqueue.  Reimplement it using
kthreads directly and rip RT support from workqueue.  This is in
preparation of concurrency managed workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/stop_machine.h |    6 ++
 include/linux/workqueue.h    |   20 +++---
 init/main.c                  |    2 +
 kernel/stop_machine.c        |  151 ++++++++++++++++++++++++++++++++++-------
 kernel/workqueue.c           |    6 --
 5 files changed, 142 insertions(+), 43 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index baba3a2..2d32e06 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -53,6 +53,11 @@ int stop_machine_create(void);
  */
 void stop_machine_destroy(void);
 
+/**
+ * init_stop_machine: initialize stop_machine during boot
+ */
+void init_stop_machine(void);
+
 #else
 
 static inline int stop_machine(int (*fn)(void *), void *data,
@@ -67,6 +72,7 @@ static inline int stop_machine(int (*fn)(void *), void *data,
 
 static inline int stop_machine_create(void) { return 0; }
 static inline void stop_machine_destroy(void) { }
+static inline void init_stop_machine(void) { }
 
 #endif /* CONFIG_SMP */
 #endif /* _LINUX_STOP_MACHINE */
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 9466e86..0697946 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -181,12 +181,11 @@ static inline void destroy_work_on_stack(struct work_struct *work) { }
 
 
 extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread,
-		       int freezeable, int rt, struct lock_class_key *key,
-		       const char *lock_name);
+__create_workqueue_key(const char *name, int singlethread, int freezeable,
+		       struct lock_class_key *key, const char *lock_name);
 
 #ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable, rt)	\
+#define __create_workqueue(name, singlethread, freezeable)	\
 ({								\
 	static struct lock_class_key __key;			\
 	const char *__lock_name;				\
@@ -197,19 +196,18 @@ __create_workqueue_key(const char *name, int singlethread,
 		__lock_name = #name;				\
 								\
 	__create_workqueue_key((name), (singlethread),		\
-			       (freezeable), (rt), &__key,	\
+			       (freezeable), &__key,		\
 			       __lock_name);			\
 })
 #else
-#define __create_workqueue(name, singlethread, freezeable, rt)	\
-	__create_workqueue_key((name), (singlethread), (freezeable), (rt), \
+#define __create_workqueue(name, singlethread, freezeable)	\
+	__create_workqueue_key((name), (singlethread), (freezeable), \
 			       NULL, NULL)
 #endif
 
-#define create_workqueue(name) __create_workqueue((name), 0, 0, 0)
-#define create_rt_workqueue(name) __create_workqueue((name), 0, 0, 1)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0)
+#define create_workqueue(name) __create_workqueue((name), 0, 0)
+#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1)
+#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/init/main.c b/init/main.c
index dac44a9..adb09f8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -34,6 +34,7 @@
 #include <linux/security.h>
 #include <linux/smp.h>
 #include <linux/workqueue.h>
+#include <linux/stop_machine.h>
 #include <linux/profile.h>
 #include <linux/rcupdate.h>
 #include <linux/moduleparam.h>
@@ -769,6 +770,7 @@ static void __init do_initcalls(void)
 static void __init do_basic_setup(void)
 {
 	init_workqueues();
+	init_stop_machine();
 	cpuset_init_smp();
 	usermodehelper_init();
 	init_tmpfs();
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 912823e..671a4ac 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -25,6 +25,8 @@ enum stopmachine_state {
 	STOPMACHINE_RUN,
 	/* Exit */
 	STOPMACHINE_EXIT,
+	/* Done */
+	STOPMACHINE_DONE,
 };
 static enum stopmachine_state state;
 
@@ -42,10 +44,9 @@ static DEFINE_MUTEX(lock);
 static DEFINE_MUTEX(setup_lock);
 /* Users of stop_machine. */
 static int refcount;
-static struct workqueue_struct *stop_machine_wq;
+static struct task_struct **stop_machine_threads;
 static struct stop_machine_data active, idle;
 static const struct cpumask *active_cpus;
-static void *stop_machine_work;
 
 static void set_state(enum stopmachine_state newstate)
 {
@@ -63,14 +64,31 @@ static void ack_state(void)
 }
 
 /* This is the actual function which stops the CPU. It runs
- * in the context of a dedicated stopmachine workqueue. */
-static void stop_cpu(struct work_struct *unused)
+ * on dedicated per-cpu kthreads. */
+static int stop_cpu(void *unused)
 {
 	enum stopmachine_state curstate = STOPMACHINE_NONE;
-	struct stop_machine_data *smdata = &idle;
+	struct stop_machine_data *smdata;
 	int cpu = smp_processor_id();
 	int err;
 
+repeat:
+	/* Wait for __stop_machine() to initiate */
+	while (true) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* <- kthread_stop() and __stop_machine()::smp_wmb() */
+		if (kthread_should_stop()) {
+			__set_current_state(TASK_RUNNING);
+			return 0;
+		}
+		if (state == STOPMACHINE_PREPARE)
+			break;
+		schedule();
+	}
+	smp_rmb();	/* <- __stop_machine()::set_state() */
+
+	/* Okay, let's go */
+	smdata = &idle;
 	if (!active_cpus) {
 		if (cpu == cpumask_first(cpu_online_mask))
 			smdata = &active;
@@ -104,6 +122,7 @@ static void stop_cpu(struct work_struct *unused)
 	} while (curstate != STOPMACHINE_EXIT);
 
 	local_irq_enable();
+	goto repeat;
 }
 
 /* Callback for CPUs which aren't supposed to do anything. */
@@ -112,46 +131,122 @@ static int chill(void *unused)
 	return 0;
 }
 
+static int create_stop_machine_thread(unsigned int cpu)
+{
+	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+	struct task_struct **pp = per_cpu_ptr(stop_machine_threads, cpu);
+	struct task_struct *p;
+
+	if (*pp)
+		return -EBUSY;
+
+	p = kthread_create(stop_cpu, NULL, "kstop/%u", cpu);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
+	*pp = p;
+	return 0;
+}
+
+/* Should be called with cpu hotplug disabled and setup_lock held */
+static void kill_stop_machine_threads(void)
+{
+	unsigned int cpu;
+
+	if (!stop_machine_threads)
+		return;
+
+	for_each_online_cpu(cpu) {
+		struct task_struct *p = *per_cpu_ptr(stop_machine_threads, cpu);
+		if (p)
+			kthread_stop(p);
+	}
+	free_percpu(stop_machine_threads);
+	stop_machine_threads = NULL;
+}
+
 int stop_machine_create(void)
 {
+	unsigned int cpu;
+
+	get_online_cpus();
 	mutex_lock(&setup_lock);
 	if (refcount)
 		goto done;
-	stop_machine_wq = create_rt_workqueue("kstop");
-	if (!stop_machine_wq)
-		goto err_out;
-	stop_machine_work = alloc_percpu(struct work_struct);
-	if (!stop_machine_work)
+
+	stop_machine_threads = alloc_percpu(struct task_struct *);
+	if (!stop_machine_threads)
 		goto err_out;
+
+	/*
+	 * cpu hotplug is disabled, create only for online cpus,
+	 * cpu_callback() will handle cpu hot [un]plugs.
+	 */
+	for_each_online_cpu(cpu) {
+		if (create_stop_machine_thread(cpu))
+			goto err_out;
+		kthread_bind(*per_cpu_ptr(stop_machine_threads, cpu), cpu);
+	}
 done:
 	refcount++;
 	mutex_unlock(&setup_lock);
+	put_online_cpus();
 	return 0;
 
 err_out:
-	if (stop_machine_wq)
-		destroy_workqueue(stop_machine_wq);
+	kill_stop_machine_threads();
 	mutex_unlock(&setup_lock);
+	put_online_cpus();
 	return -ENOMEM;
 }
 EXPORT_SYMBOL_GPL(stop_machine_create);
 
 void stop_machine_destroy(void)
 {
+	get_online_cpus();
 	mutex_lock(&setup_lock);
-	refcount--;
-	if (refcount)
-		goto done;
-	destroy_workqueue(stop_machine_wq);
-	free_percpu(stop_machine_work);
-done:
+	if (!--refcount)
+		kill_stop_machine_threads();
 	mutex_unlock(&setup_lock);
+	put_online_cpus();
 }
 EXPORT_SYMBOL_GPL(stop_machine_destroy);
 
+static int __cpuinit stop_machine_cpu_callback(struct notifier_block *nfb,
+					       unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (unsigned long)hcpu;
+	struct task_struct **pp = per_cpu_ptr(stop_machine_threads, cpu);
+
+	/* Hotplug exclusion is enough, no need to worry about setup_lock */
+	if (!stop_machine_threads)
+		return NOTIFY_OK;
+
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_UP_PREPARE:
+		if (create_stop_machine_thread(cpu)) {
+			printk(KERN_ERR "failed to create stop machine "
+			       "thread for %u\n", cpu);
+			return NOTIFY_BAD;
+		}
+		break;
+
+	case CPU_ONLINE:
+		kthread_bind(*pp, cpu);
+		break;
+
+	case CPU_UP_CANCELED:
+	case CPU_POST_DEAD:
+		kthread_stop(*pp);
+		*pp = NULL;
+		break;
+	}
+	return NOTIFY_OK;
+}
+
 int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 {
-	struct work_struct *sm_work;
 	int i, ret;
 
 	/* Set up initial state. */
@@ -164,19 +259,18 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 	idle.fn = chill;
 	idle.data = NULL;
 
-	set_state(STOPMACHINE_PREPARE);
+	set_state(STOPMACHINE_PREPARE);	/* -> stop_cpu()::smp_rmb() */
+	smp_wmb();			/* -> stop_cpu()::set_current_state() */
 
 	/* Schedule the stop_cpu work on all cpus: hold this CPU so one
 	 * doesn't hit this CPU until we're ready. */
 	get_cpu();
-	for_each_online_cpu(i) {
-		sm_work = per_cpu_ptr(stop_machine_work, i);
-		INIT_WORK(sm_work, stop_cpu);
-		queue_work_on(i, stop_machine_wq, sm_work);
-	}
+	for_each_online_cpu(i)
+		wake_up_process(*per_cpu_ptr(stop_machine_threads, i));
 	/* This will release the thread on our CPU. */
 	put_cpu();
-	flush_workqueue(stop_machine_wq);
+	while (state < STOPMACHINE_DONE)
+		yield();
 	ret = active.fnret;
 	mutex_unlock(&lock);
 	return ret;
@@ -197,3 +291,8 @@ int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(stop_machine);
+
+void __init init_stop_machine(void)
+{
+	hotcpu_notifier(stop_machine_cpu_callback, 0);
+}
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index dee4865..3dccec6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -62,7 +62,6 @@ struct workqueue_struct {
 	const char *name;
 	int singlethread;
 	int freezeable;		/* Freeze threads during suspend */
-	int rt;
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map lockdep_map;
 #endif
@@ -913,7 +912,6 @@ init_cpu_workqueue(struct workqueue_struct *wq, int cpu)
 
 static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 {
-	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
 	struct workqueue_struct *wq = cwq->wq;
 	const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
 	struct task_struct *p;
@@ -929,8 +927,6 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 	 */
 	if (IS_ERR(p))
 		return PTR_ERR(p);
-	if (cwq->wq->rt)
-		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
 	cwq->thread = p;
 
 	trace_workqueue_creation(cwq->thread, cpu);
@@ -952,7 +948,6 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 struct workqueue_struct *__create_workqueue_key(const char *name,
 						int singlethread,
 						int freezeable,
-						int rt,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
@@ -974,7 +969,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	wq->singlethread = singlethread;
 	wq->freezeable = freezeable;
-	wq->rt = rt;
 	INIT_LIST_HEAD(&wq->list);
 
 	if (singlethread) {
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 10/40] workqueue: misc/cosmetic updates
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (8 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 09/40] stop_machine: reimplement without using workqueue Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 11/40] workqueue: merge feature parameters into flags Tejun Heo
                   ` (31 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Make the following updates in preparation of concurrency managed
workqueue.  None of these changes causes any visible behavior
difference.

* Add comments and adjust indentations to data structures and several
  functions.

* Rename wq_per_cpu() to get_cwq() and swap the position of two
  parameters for consistency.  Convert a direct per_cpu_ptr() access
  to wq->cpu_wq to get_cwq().

* Add work_static() and Update set_wq_data() such that it sets the
  flags part to WORK_STRUCT_PENDING | WORK_STRUCT_STATIC if static |
  @extra_flags.

* Move santiy check on work->entry emptiness from queue_work_on() to
  __queue_work() which all queueing paths share.

* Make __queue_work() take @cpu and @wq instead of @cwq.

* Restructure flush_work() and __create_workqueue_key() to make them
  easier to modify.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |    5 ++
 kernel/workqueue.c        |  127 +++++++++++++++++++++++++++++----------------
 2 files changed, 88 insertions(+), 44 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 0697946..ac06c55 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -96,9 +96,14 @@ struct execute_work {
 #ifdef CONFIG_DEBUG_OBJECTS_WORK
 extern void __init_work(struct work_struct *work, int onstack);
 extern void destroy_work_on_stack(struct work_struct *work);
+static inline bool work_static(struct work_struct *work)
+{
+	return test_bit(WORK_STRUCT_STATIC, work_data_bits(work));
+}
 #else
 static inline void __init_work(struct work_struct *work, int onstack) { }
 static inline void destroy_work_on_stack(struct work_struct *work) { }
+static inline bool work_static(struct work_struct *work) { return false; }
 #endif
 
 /*
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3dccec6..e16c457 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -37,6 +37,16 @@
 #include <trace/events/workqueue.h>
 
 /*
+ * Structure fields follow one of the following exclusion rules.
+ *
+ * I: Set during initialization and read-only afterwards.
+ *
+ * L: cwq->lock protected.  Access with cwq->lock held.
+ *
+ * W: workqueue_lock protected.
+ */
+
+/*
  * The per-CPU workqueue (if single thread, we always use the first
  * possible cpu).
  */
@@ -48,8 +58,8 @@ struct cpu_workqueue_struct {
 	wait_queue_head_t more_work;
 	struct work_struct *current_work;
 
-	struct workqueue_struct *wq;
-	struct task_struct *thread;
+	struct workqueue_struct *wq;		/* I: the owning workqueue */
+	struct task_struct	*thread;
 } ____cacheline_aligned;
 
 /*
@@ -57,13 +67,13 @@ struct cpu_workqueue_struct {
  * per-CPU workqueues:
  */
 struct workqueue_struct {
-	struct cpu_workqueue_struct *cpu_wq;
-	struct list_head list;
-	const char *name;
+	struct cpu_workqueue_struct *cpu_wq;	/* I: cwq's */
+	struct list_head	list;		/* W: list of all workqueues */
+	const char		*name;		/* I: workqueue name */
 	int singlethread;
 	int freezeable;		/* Freeze threads during suspend */
 #ifdef CONFIG_LOCKDEP
-	struct lockdep_map lockdep_map;
+	struct lockdep_map	lockdep_map;
 #endif
 };
 
@@ -204,8 +214,8 @@ static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
 		? cpu_singlethread_map : cpu_populated_map;
 }
 
-static
-struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
+static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
+					    struct workqueue_struct *wq)
 {
 	if (unlikely(is_wq_single_threaded(wq)))
 		cpu = singlethread_cpu;
@@ -217,15 +227,14 @@ struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
  * - Must *only* be called if the pending flag is set
  */
 static inline void set_wq_data(struct work_struct *work,
-				struct cpu_workqueue_struct *cwq)
+			       struct cpu_workqueue_struct *cwq,
+			       unsigned long extra_flags)
 {
-	unsigned long new;
-
 	BUG_ON(!work_pending(work));
 
-	new = (unsigned long) cwq | (1UL << WORK_STRUCT_PENDING);
-	new |= WORK_STRUCT_FLAG_MASK & *work_data_bits(work);
-	atomic_long_set(&work->data, new);
+	atomic_long_set(&work->data, (unsigned long)cwq |
+			(work_static(work) ? (1UL << WORK_STRUCT_STATIC) : 0) |
+			(1UL << WORK_STRUCT_PENDING) | extra_flags);
 }
 
 static inline
@@ -234,29 +243,47 @@ struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 	return (void *) (atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK);
 }
 
+/**
+ * insert_work - insert a work into cwq
+ * @cwq: cwq @work belongs to
+ * @work: work to insert
+ * @head: insertion point
+ * @extra_flags: extra WORK_STRUCT_* flags to set
+ *
+ * Insert @work into @cwq after @head.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
 static void insert_work(struct cpu_workqueue_struct *cwq,
-			struct work_struct *work, struct list_head *head)
+			struct work_struct *work, struct list_head *head,
+			unsigned int extra_flags)
 {
 	trace_workqueue_insertion(cwq->thread, work);
 
-	set_wq_data(work, cwq);
+	/* we own @work, set data and link */
+	set_wq_data(work, cwq, extra_flags);
+
 	/*
 	 * Ensure that we get the right work->data if we see the
 	 * result of list_add() below, see try_to_grab_pending().
 	 */
 	smp_wmb();
+
 	list_add_tail(&work->entry, head);
 	wake_up(&cwq->more_work);
 }
 
-static void __queue_work(struct cpu_workqueue_struct *cwq,
+static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
+	struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 	unsigned long flags;
 
 	debug_work_activate(work);
 	spin_lock_irqsave(&cwq->lock, flags);
-	insert_work(cwq, work, &cwq->worklist);
+	BUG_ON(!list_empty(&work->entry));
+	insert_work(cwq, work, &cwq->worklist, 0);
 	spin_unlock_irqrestore(&cwq->lock, flags);
 }
 
@@ -298,8 +325,7 @@ queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
 	int ret = 0;
 
 	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
-		BUG_ON(!list_empty(&work->entry));
-		__queue_work(wq_per_cpu(wq, cpu), work);
+		__queue_work(cpu, wq, work);
 		ret = 1;
 	}
 	return ret;
@@ -310,9 +336,8 @@ static void delayed_work_timer_fn(unsigned long __data)
 {
 	struct delayed_work *dwork = (struct delayed_work *)__data;
 	struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
-	struct workqueue_struct *wq = cwq->wq;
 
-	__queue_work(wq_per_cpu(wq, smp_processor_id()), &dwork->work);
+	__queue_work(smp_processor_id(), cwq->wq, &dwork->work);
 }
 
 /**
@@ -356,7 +381,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 		timer_stats_timer_set_start_info(&dwork->timer);
 
 		/* This stores cwq for the moment, for the timer_fn */
-		set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id()));
+		set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
 		timer->expires = jiffies + delay;
 		timer->data = (unsigned long)dwork;
 		timer->function = delayed_work_timer_fn;
@@ -420,6 +445,12 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)
 	spin_unlock_irq(&cwq->lock);
 }
 
+/**
+ * worker_thread - the worker thread function
+ * @__cwq: cwq to serve
+ *
+ * The cwq worker thread function.
+ */
 static int worker_thread(void *__cwq)
 {
 	struct cpu_workqueue_struct *cwq = __cwq;
@@ -458,6 +489,17 @@ static void wq_barrier_func(struct work_struct *work)
 	complete(&barr->done);
 }
 
+/**
+ * insert_wq_barrier - insert a barrier work
+ * @cwq: cwq to insert barrier into
+ * @barr: wq_barrier to insert
+ * @head: insertion point
+ *
+ * Insert barrier @barr into @cwq before @head.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
 static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 			struct wq_barrier *barr, struct list_head *head)
 {
@@ -469,11 +511,10 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 	 */
 	INIT_WORK_ON_STACK(&barr->work, wq_barrier_func);
 	__set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));
-
 	init_completion(&barr->done);
 
 	debug_work_activate(&barr->work);
-	insert_work(cwq, &barr->work, head);
+	insert_work(cwq, &barr->work, head, 0);
 }
 
 static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
@@ -507,9 +548,6 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
  *
  * We sleep until all works which were queued on entry have been handled,
  * but we are not livelocked by new incoming ones.
- *
- * This function used to run the workqueues itself.  Now we just wait for the
- * helper threads to do it.
  */
 void flush_workqueue(struct workqueue_struct *wq)
 {
@@ -548,7 +586,6 @@ int flush_work(struct work_struct *work)
 	lock_map_acquire(&cwq->wq->lockdep_map);
 	lock_map_release(&cwq->wq->lockdep_map);
 
-	prev = NULL;
 	spin_lock_irq(&cwq->lock);
 	if (!list_empty(&work->entry)) {
 		/*
@@ -557,22 +594,22 @@ int flush_work(struct work_struct *work)
 		 */
 		smp_rmb();
 		if (unlikely(cwq != get_wq_data(work)))
-			goto out;
+			goto already_gone;
 		prev = &work->entry;
 	} else {
 		if (cwq->current_work != work)
-			goto out;
+			goto already_gone;
 		prev = &cwq->worklist;
 	}
 	insert_wq_barrier(cwq, &barr, prev->next);
-out:
-	spin_unlock_irq(&cwq->lock);
-	if (!prev)
-		return 0;
 
+	spin_unlock_irq(&cwq->lock);
 	wait_for_completion(&barr.done);
 	destroy_work_on_stack(&barr.work);
 	return 1;
+already_gone:
+	spin_unlock_irq(&cwq->lock);
+	return 0;
 }
 EXPORT_SYMBOL_GPL(flush_work);
 
@@ -655,7 +692,7 @@ static void wait_on_work(struct work_struct *work)
 	cpu_map = wq_cpu_map(wq);
 
 	for_each_cpu(cpu, cpu_map)
-		wait_on_cpu_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+		wait_on_cpu_work(get_cwq(cpu, wq), work);
 }
 
 static int __cancel_work_timer(struct work_struct *work,
@@ -772,9 +809,7 @@ EXPORT_SYMBOL(schedule_delayed_work);
 void flush_delayed_work(struct delayed_work *dwork)
 {
 	if (del_timer_sync(&dwork->timer)) {
-		struct cpu_workqueue_struct *cwq;
-		cwq = wq_per_cpu(keventd_wq, get_cpu());
-		__queue_work(cwq, &dwork->work);
+		__queue_work(get_cpu(), keventd_wq, &dwork->work);
 		put_cpu();
 	}
 	flush_work(&dwork->work);
@@ -957,13 +992,11 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
-		return NULL;
+		goto err;
 
 	wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
-	if (!wq->cpu_wq) {
-		kfree(wq);
-		return NULL;
-	}
+	if (!wq->cpu_wq)
+		goto err;
 
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
@@ -1007,6 +1040,12 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		wq = NULL;
 	}
 	return wq;
+err:
+	if (wq) {
+		free_percpu(wq->cpu_wq);
+		kfree(wq);
+	}
+	return NULL;
 }
 EXPORT_SYMBOL_GPL(__create_workqueue_key);
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 11/40] workqueue: merge feature parameters into flags
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (9 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 10/40] workqueue: misc/cosmetic updates Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 12/40] workqueue: define both bit position and mask for work flags Tejun Heo
                   ` (30 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Currently, __create_workqueue_key() takes @singlethread and
@freezeable paramters and store them separately in workqueue_struct.
Merge them into a single flags parameter and field and use
WQ_FREEZEABLE and WQ_SINGLE_THREAD.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |   25 +++++++++++++++----------
 kernel/workqueue.c        |   17 +++++++----------
 2 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index ac06c55..495572a 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -184,13 +184,17 @@ static inline bool work_static(struct work_struct *work) { return false; }
 #define work_clear_pending(work) \
 	clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
 
+enum {
+	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
+	WQ_SINGLE_THREAD	= 1 << 1, /* no per-cpu worker */
+};
 
 extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread, int freezeable,
+__create_workqueue_key(const char *name, unsigned int flags,
 		       struct lock_class_key *key, const char *lock_name);
 
 #ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable)	\
+#define __create_workqueue(name, flags)				\
 ({								\
 	static struct lock_class_key __key;			\
 	const char *__lock_name;				\
@@ -200,19 +204,20 @@ __create_workqueue_key(const char *name, int singlethread, int freezeable,
 	else							\
 		__lock_name = #name;				\
 								\
-	__create_workqueue_key((name), (singlethread),		\
-			       (freezeable), &__key,		\
+	__create_workqueue_key((name), (flags), &__key,		\
 			       __lock_name);			\
 })
 #else
-#define __create_workqueue(name, singlethread, freezeable)	\
-	__create_workqueue_key((name), (singlethread), (freezeable), \
-			       NULL, NULL)
+#define __create_workqueue(name, flags)				\
+	__create_workqueue_key((name), (flags), NULL, NULL)
 #endif
 
-#define create_workqueue(name) __create_workqueue((name), 0, 0)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)
+#define create_workqueue(name)					\
+	__create_workqueue((name), 0)
+#define create_freezeable_workqueue(name)			\
+	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD)
+#define create_singlethread_workqueue(name)			\
+	__create_workqueue((name), WQ_SINGLE_THREAD)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e16c457..579041f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -67,11 +67,10 @@ struct cpu_workqueue_struct {
  * per-CPU workqueues:
  */
 struct workqueue_struct {
+	unsigned int		flags;		/* I: WQ_* flags */
 	struct cpu_workqueue_struct *cpu_wq;	/* I: cwq's */
 	struct list_head	list;		/* W: list of all workqueues */
 	const char		*name;		/* I: workqueue name */
-	int singlethread;
-	int freezeable;		/* Freeze threads during suspend */
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	lockdep_map;
 #endif
@@ -203,9 +202,9 @@ static const struct cpumask *cpu_singlethread_map __read_mostly;
 static cpumask_var_t cpu_populated_map __read_mostly;
 
 /* If it's single threaded, it isn't in the list of workqueues. */
-static inline int is_wq_single_threaded(struct workqueue_struct *wq)
+static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
 {
-	return wq->singlethread;
+	return wq->flags & WQ_SINGLE_THREAD;
 }
 
 static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
@@ -456,7 +455,7 @@ static int worker_thread(void *__cwq)
 	struct cpu_workqueue_struct *cwq = __cwq;
 	DEFINE_WAIT(wait);
 
-	if (cwq->wq->freezeable)
+	if (cwq->wq->flags & WQ_FREEZEABLE)
 		set_freezable();
 
 	for (;;) {
@@ -981,8 +980,7 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 }
 
 struct workqueue_struct *__create_workqueue_key(const char *name,
-						int singlethread,
-						int freezeable,
+						unsigned int flags,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
@@ -998,13 +996,12 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	if (!wq->cpu_wq)
 		goto err;
 
+	wq->flags = flags;
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
-	wq->singlethread = singlethread;
-	wq->freezeable = freezeable;
 	INIT_LIST_HEAD(&wq->list);
 
-	if (singlethread) {
+	if (flags & WQ_SINGLE_THREAD) {
 		cwq = init_cpu_workqueue(wq, singlethread_cpu);
 		err = create_workqueue_thread(cwq, singlethread_cpu);
 		start_workqueue_thread(cwq, -1);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 12/40] workqueue: define both bit position and mask for work flags
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (10 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 11/40] workqueue: merge feature parameters into flags Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 13/40] workqueue: separate out process_one_work() Tejun Heo
                   ` (29 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Work flags are about to see more traditional mask handling.  Define
WORK_STRUCT_*_BIT as the bit position constant and redefine
WORK_STRUCT_* as bit masks.

While at it, re-define these constants as enums and use
WORK_STRUCT_STATIC instead of hard-coding 2 in
WORK_DATA_STATIC_INIT().

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |   23 +++++++++++++++--------
 kernel/workqueue.c        |   14 +++++++-------
 2 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 495572a..e51b5dc 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -22,12 +22,19 @@ typedef void (*work_func_t)(struct work_struct *work);
  */
 #define work_data_bits(work) ((unsigned long *)(&(work)->data))
 
+enum {
+	WORK_STRUCT_PENDING_BIT	= 0,	/* work item is pending execution */
+	WORK_STRUCT_STATIC_BIT	= 1,	/* static initializer (debugobjects) */
+
+	WORK_STRUCT_PENDING	= 1 << WORK_STRUCT_PENDING_BIT,
+	WORK_STRUCT_STATIC	= 1 << WORK_STRUCT_STATIC_BIT,
+
+	WORK_STRUCT_FLAG_MASK	= 3UL,
+	WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
+};
+
 struct work_struct {
 	atomic_long_t data;
-#define WORK_STRUCT_PENDING 0		/* T if work item pending execution */
-#define WORK_STRUCT_STATIC  1		/* static initializer (debugobjects) */
-#define WORK_STRUCT_FLAG_MASK (3UL)
-#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
 	struct list_head entry;
 	work_func_t func;
 #ifdef CONFIG_LOCKDEP
@@ -36,7 +43,7 @@ struct work_struct {
 };
 
 #define WORK_DATA_INIT()	ATOMIC_LONG_INIT(0)
-#define WORK_DATA_STATIC_INIT()	ATOMIC_LONG_INIT(2)
+#define WORK_DATA_STATIC_INIT()	ATOMIC_LONG_INIT(WORK_STRUCT_STATIC)
 
 struct delayed_work {
 	struct work_struct work;
@@ -98,7 +105,7 @@ extern void __init_work(struct work_struct *work, int onstack);
 extern void destroy_work_on_stack(struct work_struct *work);
 static inline bool work_static(struct work_struct *work)
 {
-	return test_bit(WORK_STRUCT_STATIC, work_data_bits(work));
+	return test_bit(WORK_STRUCT_STATIC_BIT, work_data_bits(work));
 }
 #else
 static inline void __init_work(struct work_struct *work, int onstack) { }
@@ -167,7 +174,7 @@ static inline bool work_static(struct work_struct *work) { return false; }
  * @work: The work item in question
  */
 #define work_pending(work) \
-	test_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+	test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))
 
 /**
  * delayed_work_pending - Find out whether a delayable work item is currently
@@ -182,7 +189,7 @@ static inline bool work_static(struct work_struct *work) { return false; }
  * @work: The work item in question
  */
 #define work_clear_pending(work) \
-	clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+	clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))
 
 enum {
 	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 579041f..f8e4d67 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -115,7 +115,7 @@ static int work_fixup_activate(void *addr, enum debug_obj_state state)
 		 * statically initialized. We just make sure that it
 		 * is tracked in the object tracker.
 		 */
-		if (test_bit(WORK_STRUCT_STATIC, work_data_bits(work))) {
+		if (test_bit(WORK_STRUCT_STATIC_BIT, work_data_bits(work))) {
 			debug_object_init(work, &work_debug_descr);
 			debug_object_activate(work, &work_debug_descr);
 			return 0;
@@ -232,8 +232,8 @@ static inline void set_wq_data(struct work_struct *work,
 	BUG_ON(!work_pending(work));
 
 	atomic_long_set(&work->data, (unsigned long)cwq |
-			(work_static(work) ? (1UL << WORK_STRUCT_STATIC) : 0) |
-			(1UL << WORK_STRUCT_PENDING) | extra_flags);
+			(work_static(work) ? WORK_STRUCT_STATIC : 0) |
+			WORK_STRUCT_PENDING | extra_flags);
 }
 
 static inline
@@ -323,7 +323,7 @@ queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
 {
 	int ret = 0;
 
-	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
 		__queue_work(cpu, wq, work);
 		ret = 1;
 	}
@@ -373,7 +373,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 	struct timer_list *timer = &dwork->timer;
 	struct work_struct *work = &dwork->work;
 
-	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
 		BUG_ON(timer_pending(timer));
 		BUG_ON(!list_empty(&work->entry));
 
@@ -509,7 +509,7 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 	 * might deadlock.
 	 */
 	INIT_WORK_ON_STACK(&barr->work, wq_barrier_func);
-	__set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));
+	__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));
 	init_completion(&barr->done);
 
 	debug_work_activate(&barr->work);
@@ -621,7 +621,7 @@ static int try_to_grab_pending(struct work_struct *work)
 	struct cpu_workqueue_struct *cwq;
 	int ret = -1;
 
-	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
 		return 0;
 
 	/*
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 13/40] workqueue: separate out process_one_work()
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (11 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 12/40] workqueue: define both bit position and mask for work flags Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 14/40] workqueue: temporarily disable workqueue tracing Tejun Heo
                   ` (28 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Separate out process_one_work() out of run_workqueue().  This patch
doesn't cause any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  100 +++++++++++++++++++++++++++++++--------------------
 1 files changed, 61 insertions(+), 39 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f8e4d67..aa1d680 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -395,51 +395,73 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
+/**
+ * process_one_work - process single work
+ * @cwq: cwq to process work for
+ * @work: work to process
+ *
+ * Process @work.  This function contains all the logics necessary to
+ * process a single work including synchronization against and
+ * interaction with other workers on the same cpu, queueing and
+ * flushing.  As long as context requirement is met, any worker can
+ * call this function to process a work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock) which is released and regrabbed.
+ */
+static void process_one_work(struct cpu_workqueue_struct *cwq,
+			     struct work_struct *work)
+{
+	work_func_t f = work->func;
+#ifdef CONFIG_LOCKDEP
+	/*
+	 * It is permissible to free the struct work_struct from
+	 * inside the function that is called from it, this we need to
+	 * take into account for lockdep too.  To avoid bogus "held
+	 * lock freed" warnings as well as problems when looking into
+	 * work->lockdep_map, make a copy and use that here.
+	 */
+	struct lockdep_map lockdep_map = work->lockdep_map;
+#endif
+	/* claim and process */
+	trace_workqueue_execution(cwq->thread, work);
+	debug_work_deactivate(work);
+	cwq->current_work = work;
+	list_del_init(&work->entry);
+
+	spin_unlock_irq(&cwq->lock);
+
+	BUG_ON(get_wq_data(work) != cwq);
+	work_clear_pending(work);
+	lock_map_acquire(&cwq->wq->lockdep_map);
+	lock_map_acquire(&lockdep_map);
+	f(work);
+	lock_map_release(&lockdep_map);
+	lock_map_release(&cwq->wq->lockdep_map);
+
+	if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
+		printk(KERN_ERR "BUG: workqueue leaked lock or atomic: "
+		       "%s/0x%08x/%d\n",
+		       current->comm, preempt_count(), task_pid_nr(current));
+		printk(KERN_ERR "    last function: ");
+		print_symbol("%s\n", (unsigned long)f);
+		debug_show_held_locks(current);
+		dump_stack();
+	}
+
+	spin_lock_irq(&cwq->lock);
+
+	/* we're done with it, release */
+	cwq->current_work = NULL;
+}
+
 static void run_workqueue(struct cpu_workqueue_struct *cwq)
 {
 	spin_lock_irq(&cwq->lock);
 	while (!list_empty(&cwq->worklist)) {
 		struct work_struct *work = list_entry(cwq->worklist.next,
 						struct work_struct, entry);
-		work_func_t f = work->func;
-#ifdef CONFIG_LOCKDEP
-		/*
-		 * It is permissible to free the struct work_struct
-		 * from inside the function that is called from it,
-		 * this we need to take into account for lockdep too.
-		 * To avoid bogus "held lock freed" warnings as well
-		 * as problems when looking into work->lockdep_map,
-		 * make a copy and use that here.
-		 */
-		struct lockdep_map lockdep_map = work->lockdep_map;
-#endif
-		trace_workqueue_execution(cwq->thread, work);
-		debug_work_deactivate(work);
-		cwq->current_work = work;
-		list_del_init(cwq->worklist.next);
-		spin_unlock_irq(&cwq->lock);
-
-		BUG_ON(get_wq_data(work) != cwq);
-		work_clear_pending(work);
-		lock_map_acquire(&cwq->wq->lockdep_map);
-		lock_map_acquire(&lockdep_map);
-		f(work);
-		lock_map_release(&lockdep_map);
-		lock_map_release(&cwq->wq->lockdep_map);
-
-		if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
-			printk(KERN_ERR "BUG: workqueue leaked lock or atomic: "
-					"%s/0x%08x/%d\n",
-					current->comm, preempt_count(),
-				       	task_pid_nr(current));
-			printk(KERN_ERR "    last function: ");
-			print_symbol("%s\n", (unsigned long)f);
-			debug_show_held_locks(current);
-			dump_stack();
-		}
-
-		spin_lock_irq(&cwq->lock);
-		cwq->current_work = NULL;
+		process_one_work(cwq, work);
 	}
 	spin_unlock_irq(&cwq->lock);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 14/40] workqueue: temporarily disable workqueue tracing
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (12 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 13/40] workqueue: separate out process_one_work() Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 15/40] workqueue: kill cpu_populated_map Tejun Heo
                   ` (27 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Strip tracing code from workqueue and disable workqueue tracing.  This
is temporary measure till concurrency managed workqueue is complete.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/trace/Kconfig |    4 +++-
 kernel/workqueue.c   |   14 +++-----------
 2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 6c22d8a..833d244 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -414,7 +414,9 @@ config KMEMTRACE
 	  If unsure, say N.
 
 config WORKQUEUE_TRACER
-	bool "Trace workqueues"
+# Temporarily disabled during workqueue reimplementation
+#	bool "Trace workqueues"
+	def_bool n
 	select GENERIC_TRACER
 	help
 	  The workqueue tracer provides some statistical information
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aa1d680..a48a9b8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -33,8 +33,6 @@
 #include <linux/kallsyms.h>
 #include <linux/debug_locks.h>
 #include <linux/lockdep.h>
-#define CREATE_TRACE_POINTS
-#include <trace/events/workqueue.h>
 
 /*
  * Structure fields follow one of the following exclusion rules.
@@ -236,10 +234,10 @@ static inline void set_wq_data(struct work_struct *work,
 			WORK_STRUCT_PENDING | extra_flags);
 }
 
-static inline
-struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
+static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 {
-	return (void *) (atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK);
+	return (void *)(atomic_long_read(&work->data) &
+			WORK_STRUCT_WQ_DATA_MASK);
 }
 
 /**
@@ -258,8 +256,6 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 			struct work_struct *work, struct list_head *head,
 			unsigned int extra_flags)
 {
-	trace_workqueue_insertion(cwq->thread, work);
-
 	/* we own @work, set data and link */
 	set_wq_data(work, cwq, extra_flags);
 
@@ -424,7 +420,6 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 	struct lockdep_map lockdep_map = work->lockdep_map;
 #endif
 	/* claim and process */
-	trace_workqueue_execution(cwq->thread, work);
 	debug_work_deactivate(work);
 	cwq->current_work = work;
 	list_del_init(&work->entry);
@@ -985,8 +980,6 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 		return PTR_ERR(p);
 	cwq->thread = p;
 
-	trace_workqueue_creation(cwq->thread, cpu);
-
 	return 0;
 }
 
@@ -1091,7 +1084,6 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
 	 * checks list_empty(), and a "normal" queue_work() can't use
 	 * a dead CPU.
 	 */
-	trace_workqueue_destruction(cwq->thread);
 	kthread_stop(cwq->thread);
 	cwq->thread = NULL;
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 15/40] workqueue: kill cpu_populated_map
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (13 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 14/40] workqueue: temporarily disable workqueue tracing Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 16/40] workqueue: update cwq alignement Tejun Heo
                   ` (26 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Worker management is about to be overhauled.  Simplify things by
removing cpu_populated_map, creating workers for all possible cpus and
making single threaded workqueues behave more like multi threaded
ones.

After this patch, all cwqs are always initialized, all workqueues are
linked on the workqueues list and workers for all possibles cpus
always exist.  This also makes CPU hotplug support simpler - rebinding
workers on CPU_ONLINE and flushing on CPU_POST_DEAD are enough.

While at it, make get_cwq() always return the cwq for the specified
cpu, add target_cwq() for cases where single thread distinction is
necessary and drop all direct usage of per_cpu_ptr() on wq->cpu_wq.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  165 +++++++++++++++++-----------------------------------
 1 files changed, 54 insertions(+), 111 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a48a9b8..d29e069 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -189,34 +189,19 @@ static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
 
 static int singlethread_cpu __read_mostly;
-static const struct cpumask *cpu_singlethread_map __read_mostly;
-/*
- * _cpu_down() first removes CPU from cpu_online_map, then CPU_DEAD
- * flushes cwq->worklist. This means that flush_workqueue/wait_on_work
- * which comes in between can't use for_each_online_cpu(). We could
- * use cpu_possible_map, the cpumask below is more a documentation
- * than optimization.
- */
-static cpumask_var_t cpu_populated_map __read_mostly;
-
-/* If it's single threaded, it isn't in the list of workqueues. */
-static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
-{
-	return wq->flags & WQ_SINGLE_THREAD;
-}
 
-static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
+static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
+					    struct workqueue_struct *wq)
 {
-	return is_wq_single_threaded(wq)
-		? cpu_singlethread_map : cpu_populated_map;
+	return per_cpu_ptr(wq->cpu_wq, cpu);
 }
 
-static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
-					    struct workqueue_struct *wq)
+static struct cpu_workqueue_struct *target_cwq(unsigned int cpu,
+					       struct workqueue_struct *wq)
 {
-	if (unlikely(is_wq_single_threaded(wq)))
+	if (unlikely(wq->flags & WQ_SINGLE_THREAD))
 		cpu = singlethread_cpu;
-	return per_cpu_ptr(wq->cpu_wq, cpu);
+	return get_cwq(cpu, wq);
 }
 
 /*
@@ -272,7 +257,7 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
-	struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+	struct cpu_workqueue_struct *cwq = target_cwq(cpu, wq);
 	unsigned long flags;
 
 	debug_work_activate(work);
@@ -376,7 +361,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 		timer_stats_timer_set_start_info(&dwork->timer);
 
 		/* This stores cwq for the moment, for the timer_fn */
-		set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
+		set_wq_data(work, target_cwq(raw_smp_processor_id(), wq), 0);
 		timer->expires = jiffies + delay;
 		timer->data = (unsigned long)dwork;
 		timer->function = delayed_work_timer_fn;
@@ -567,14 +552,13 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
  */
 void flush_workqueue(struct workqueue_struct *wq)
 {
-	const struct cpumask *cpu_map = wq_cpu_map(wq);
 	int cpu;
 
 	might_sleep();
 	lock_map_acquire(&wq->lockdep_map);
 	lock_map_release(&wq->lockdep_map);
-	for_each_cpu(cpu, cpu_map)
-		flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
+	for_each_possible_cpu(cpu)
+		flush_cpu_workqueue(get_cwq(cpu, wq));
 }
 EXPORT_SYMBOL_GPL(flush_workqueue);
 
@@ -692,7 +676,6 @@ static void wait_on_work(struct work_struct *work)
 {
 	struct cpu_workqueue_struct *cwq;
 	struct workqueue_struct *wq;
-	const struct cpumask *cpu_map;
 	int cpu;
 
 	might_sleep();
@@ -705,9 +688,8 @@ static void wait_on_work(struct work_struct *work)
 		return;
 
 	wq = cwq->wq;
-	cpu_map = wq_cpu_map(wq);
 
-	for_each_cpu(cpu, cpu_map)
+	for_each_possible_cpu(cpu)
 		wait_on_cpu_work(get_cwq(cpu, wq), work);
 }
 
@@ -940,7 +922,7 @@ int current_is_keventd(void)
 
 	BUG_ON(!keventd_wq);
 
-	cwq = per_cpu_ptr(keventd_wq->cpu_wq, cpu);
+	cwq = get_cwq(cpu, keventd_wq);
 	if (current == cwq->thread)
 		ret = 1;
 
@@ -948,26 +930,12 @@ int current_is_keventd(void)
 
 }
 
-static struct cpu_workqueue_struct *
-init_cpu_workqueue(struct workqueue_struct *wq, int cpu)
-{
-	struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
-
-	cwq->wq = wq;
-	spin_lock_init(&cwq->lock);
-	INIT_LIST_HEAD(&cwq->worklist);
-	init_waitqueue_head(&cwq->more_work);
-
-	return cwq;
-}
-
 static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 {
 	struct workqueue_struct *wq = cwq->wq;
-	const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
 	struct task_struct *p;
 
-	p = kthread_create(worker_thread, cwq, fmt, wq->name, cpu);
+	p = kthread_create(worker_thread, cwq, "%s/%d", wq->name, cpu);
 	/*
 	 * Nobody can add the work_struct to this cwq,
 	 *	if (caller is __create_workqueue)
@@ -999,8 +967,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
+	bool singlethread = flags & WQ_SINGLE_THREAD;
 	struct workqueue_struct *wq;
-	struct cpu_workqueue_struct *cwq;
 	int err = 0, cpu;
 
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
@@ -1016,41 +984,40 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	INIT_LIST_HEAD(&wq->list);
 
-	if (flags & WQ_SINGLE_THREAD) {
-		cwq = init_cpu_workqueue(wq, singlethread_cpu);
-		err = create_workqueue_thread(cwq, singlethread_cpu);
-		start_workqueue_thread(cwq, -1);
-	} else {
-		cpu_maps_update_begin();
-		/*
-		 * We must place this wq on list even if the code below fails.
-		 * cpu_down(cpu) can remove cpu from cpu_populated_map before
-		 * destroy_workqueue() takes the lock, in that case we leak
-		 * cwq[cpu]->thread.
-		 */
-		spin_lock(&workqueue_lock);
-		list_add(&wq->list, &workqueues);
-		spin_unlock(&workqueue_lock);
-		/*
-		 * We must initialize cwqs for each possible cpu even if we
-		 * are going to call destroy_workqueue() finally. Otherwise
-		 * cpu_up() can hit the uninitialized cwq once we drop the
-		 * lock.
-		 */
-		for_each_possible_cpu(cpu) {
-			cwq = init_cpu_workqueue(wq, cpu);
-			if (err || !cpu_online(cpu))
-				continue;
-			err = create_workqueue_thread(cwq, cpu);
+	cpu_maps_update_begin();
+	/*
+	 * We must initialize cwqs for each possible cpu even if we
+	 * are going to call destroy_workqueue() finally. Otherwise
+	 * cpu_up() can hit the uninitialized cwq once we drop the
+	 * lock.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+		cwq->wq = wq;
+		spin_lock_init(&cwq->lock);
+		INIT_LIST_HEAD(&cwq->worklist);
+		init_waitqueue_head(&cwq->more_work);
+
+		if (err)
+			continue;
+		err = create_workqueue_thread(cwq, cpu);
+		if (cpu_online(cpu) && !singlethread)
 			start_workqueue_thread(cwq, cpu);
-		}
-		cpu_maps_update_done();
+		else
+			start_workqueue_thread(cwq, -1);
 	}
+	cpu_maps_update_done();
 
 	if (err) {
 		destroy_workqueue(wq);
 		wq = NULL;
 	}
+
+	spin_lock(&workqueue_lock);
+	list_add(&wq->list, &workqueues);
+	spin_unlock(&workqueue_lock);
+
 	return wq;
 err:
 	if (wq) {
@@ -1096,17 +1063,14 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
  */
 void destroy_workqueue(struct workqueue_struct *wq)
 {
-	const struct cpumask *cpu_map = wq_cpu_map(wq);
 	int cpu;
 
-	cpu_maps_update_begin();
 	spin_lock(&workqueue_lock);
 	list_del(&wq->list);
 	spin_unlock(&workqueue_lock);
 
-	for_each_cpu(cpu, cpu_map)
-		cleanup_workqueue_thread(per_cpu_ptr(wq->cpu_wq, cpu));
- 	cpu_maps_update_done();
+	for_each_possible_cpu(cpu)
+		cleanup_workqueue_thread(get_cwq(cpu, wq));
 
 	free_percpu(wq->cpu_wq);
 	kfree(wq);
@@ -1120,47 +1084,30 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 	unsigned int cpu = (unsigned long)hcpu;
 	struct cpu_workqueue_struct *cwq;
 	struct workqueue_struct *wq;
-	int ret = NOTIFY_OK;
 
 	action &= ~CPU_TASKS_FROZEN;
 
-	switch (action) {
-	case CPU_UP_PREPARE:
-		cpumask_set_cpu(cpu, cpu_populated_map);
-	}
-undo:
 	list_for_each_entry(wq, &workqueues, list) {
-		cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+		if (wq->flags & WQ_SINGLE_THREAD)
+			continue;
 
-		switch (action) {
-		case CPU_UP_PREPARE:
-			if (!create_workqueue_thread(cwq, cpu))
-				break;
-			printk(KERN_ERR "workqueue [%s] for %i failed\n",
-				wq->name, cpu);
-			action = CPU_UP_CANCELED;
-			ret = NOTIFY_BAD;
-			goto undo;
+		cwq = get_cwq(cpu, wq);
 
+		switch (action) {
 		case CPU_ONLINE:
-			start_workqueue_thread(cwq, cpu);
+			__set_cpus_allowed(cwq->thread, get_cpu_mask(cpu),
+					   true);
 			break;
 
-		case CPU_UP_CANCELED:
-			start_workqueue_thread(cwq, -1);
 		case CPU_POST_DEAD:
-			cleanup_workqueue_thread(cwq);
+			lock_map_acquire(&cwq->wq->lockdep_map);
+			lock_map_release(&cwq->wq->lockdep_map);
+			flush_cpu_workqueue(cwq);
 			break;
 		}
 	}
 
-	switch (action) {
-	case CPU_UP_CANCELED:
-	case CPU_POST_DEAD:
-		cpumask_clear_cpu(cpu, cpu_populated_map);
-	}
-
-	return ret;
+	return NOTIFY_OK;
 }
 
 #ifdef CONFIG_SMP
@@ -1212,11 +1159,7 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
 
 void __init init_workqueues(void)
 {
-	alloc_cpumask_var(&cpu_populated_map, GFP_KERNEL);
-
-	cpumask_copy(cpu_populated_map, cpu_online_mask);
 	singlethread_cpu = cpumask_first(cpu_possible_mask);
-	cpu_singlethread_map = cpumask_of(singlethread_cpu);
 	hotcpu_notifier(workqueue_cpu_callback, 0);
 	keventd_wq = create_workqueue("events");
 	BUG_ON(!keventd_wq);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 16/40] workqueue: update cwq alignement
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (14 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 15/40] workqueue: kill cpu_populated_map Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 17/40] workqueue: reimplement workqueue flushing using color coded works Tejun Heo
                   ` (25 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

work->data field is used for two purposes.  It points to cwq it's
queued on and the lower bits are used for flags.  Currently, two bits
are reserved which is always safe as 4 byte alignment is guaranteed on
every architecture.  However, future changes will need more flag bits.

On SMP, the percpu allocator is capable of honoring larger alignment
(there are other users which depend on it) and larger alignment works
just fine.  On UP, percpu allocator is a thin wrapper around
kzalloc/kfree() and don't honor alignment request.

This patch introduces WORK_STRUCT_FLAG_BITS and implements
alloc/free_cwqs() which guarantees (1 << WORK_STRUCT_FLAG_BITS)
alignment both on SMP and UP.  On SMP, simply wrapping percpu
allocator is enouhg.  On UP, extra space is allocated so that cwq can
be aligned and the original pointer can be stored after it which is
used in the free path.

While at it, as cwqs are now forced aligned, make sure the resulting
alignment is at least equal to or larger than that of long long.

Alignment problem on UP is reported by Michal Simek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Reported-by: Michal Simek <michal.simek@petalogix.com>
---
 include/linux/workqueue.h |    4 ++-
 kernel/workqueue.c        |   59 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index e51b5dc..011738a 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -29,7 +29,9 @@ enum {
 	WORK_STRUCT_PENDING	= 1 << WORK_STRUCT_PENDING_BIT,
 	WORK_STRUCT_STATIC	= 1 << WORK_STRUCT_STATIC_BIT,
 
-	WORK_STRUCT_FLAG_MASK	= 3UL,
+	WORK_STRUCT_FLAG_BITS	= 2,
+
+	WORK_STRUCT_FLAG_MASK	= (1UL << WORK_STRUCT_FLAG_BITS) - 1,
 	WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
 };
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d29e069..14edfcd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -46,7 +46,9 @@
 
 /*
  * The per-CPU workqueue (if single thread, we always use the first
- * possible cpu).
+ * possible cpu).  The lower WORK_STRUCT_FLAG_BITS of
+ * work_struct->data are used for flags and thus cwqs need to be
+ * aligned at two's power of the number of flag bits.
  */
 struct cpu_workqueue_struct {
 
@@ -58,7 +60,7 @@ struct cpu_workqueue_struct {
 
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	struct task_struct	*thread;
-} ____cacheline_aligned;
+};
 
 /*
  * The externally visible workqueue abstraction is an array of
@@ -930,6 +932,44 @@ int current_is_keventd(void)
 
 }
 
+static struct cpu_workqueue_struct *alloc_cwqs(void)
+{
+	const size_t size = sizeof(struct cpu_workqueue_struct);
+	const size_t align = 1 << WORK_STRUCT_FLAG_BITS;
+	struct cpu_workqueue_struct *cwqs;
+#ifndef CONFIG_SMP
+	void *ptr;
+
+	/*
+	 * On UP, percpu allocator doesn't honor alignment parameter
+	 * and simply uses arch-dependent default.  Allocate enough
+	 * room to align cwq and put an extra pointer at the end
+	 * pointing back to the originally allocated pointer which
+	 * will be used for free.
+	 */
+	ptr = __alloc_percpu(size + align + sizeof(void *), 1);
+	cwqs = PTR_ALIGN(ptr, align);
+	*(void **)per_cpu_ptr(cwqs + 1, 0) = ptr;
+#else
+	/* On SMP, percpu allocator can do it itself */
+	cwqs = __alloc_percpu(size, align);
+#endif
+	/* just in case, make sure it's actually aligned */
+	BUG_ON(!IS_ALIGNED((unsigned long)cwqs, align));
+	return cwqs;
+}
+
+static void free_cwqs(struct cpu_workqueue_struct *cwqs)
+{
+#ifndef CONFIG_SMP
+	/* on UP, the pointer to free is stored right after the cwq */
+	if (cwqs)
+		free_percpu(*(void **)per_cpu_ptr(cwqs + 1, 0));
+#else
+	free_percpu(cwqs);
+#endif
+}
+
 static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 {
 	struct workqueue_struct *wq = cwq->wq;
@@ -975,7 +1015,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	if (!wq)
 		goto err;
 
-	wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
+	wq->cpu_wq = alloc_cwqs();
 	if (!wq->cpu_wq)
 		goto err;
 
@@ -994,6 +1034,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
+		BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
 		cwq->wq = wq;
 		spin_lock_init(&cwq->lock);
 		INIT_LIST_HEAD(&cwq->worklist);
@@ -1021,7 +1062,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	return wq;
 err:
 	if (wq) {
-		free_percpu(wq->cpu_wq);
+		free_cwqs(wq->cpu_wq);
 		kfree(wq);
 	}
 	return NULL;
@@ -1072,7 +1113,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	for_each_possible_cpu(cpu)
 		cleanup_workqueue_thread(get_cwq(cpu, wq));
 
-	free_percpu(wq->cpu_wq);
+	free_cwqs(wq->cpu_wq);
 	kfree(wq);
 }
 EXPORT_SYMBOL_GPL(destroy_workqueue);
@@ -1159,6 +1200,14 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
 
 void __init init_workqueues(void)
 {
+	/*
+	 * cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
+	 * Make sure that the alignment isn't lower than that of
+	 * unsigned long long.
+	 */
+	BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
+		     __alignof__(unsigned long long));
+
 	singlethread_cpu = cpumask_first(cpu_possible_mask);
 	hotcpu_notifier(workqueue_cpu_callback, 0);
 	keventd_wq = create_workqueue("events");
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 17/40] workqueue: reimplement workqueue flushing using color coded works
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (15 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 16/40] workqueue: update cwq alignement Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 18/40] workqueue: introduce worker Tejun Heo
                   ` (24 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Reimplement workqueue flushing using color coded works.  wq has the
current work color which is painted on the works being issued via
cwqs.  Flushing a workqueue is achieved by advancing the current work
colors of cwqs and waiting for all the works which have any of the
previous colors to drain.

Currently there are 16 possible colors, one is reserved for no color
and 15 colors are useable allowing 14 concurrent flushes.  When color
space gets full, flush attempts are batched up and processed together
when color frees up, so even with many concurrent flushers, the new
implementation won't build up huge queue of flushers which has to be
processed one after another.

Only works which are queued via __queue_work() are colored.  Works
which are directly put on queue using insert_work() use NO_COLOR and
don't participate in workqueue flushing.  Currently only works used
for work-specific flush fall in this category.

This new implementation leaves only cleanup_workqueue_thread() as the
user of flush_cpu_workqueue().  Just make its users use
flush_workqueue() and kthread_stop() directly and kill
cleanup_workqueue_thread().  As workqueue flushing doesn't use barrier
request anymore, the comment describing the complex synchronization
around it in cleanup_workqueue_thread() is removed together with the
function.

This new implementation is to allow having and sharing multiple
workers per cpu.

Please note that one more bit is reserved for a future work flag by
this patch.  This is to avoid shifting bits and updating comments
later.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |   17 ++-
 kernel/workqueue.c        |  351 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 315 insertions(+), 53 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 011738a..316cc48 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -29,7 +29,22 @@ enum {
 	WORK_STRUCT_PENDING	= 1 << WORK_STRUCT_PENDING_BIT,
 	WORK_STRUCT_STATIC	= 1 << WORK_STRUCT_STATIC_BIT,
 
-	WORK_STRUCT_FLAG_BITS	= 2,
+	WORK_STRUCT_COLOR_SHIFT	= 3,	/* color for workqueue flushing */
+	WORK_STRUCT_COLOR_BITS	= 4,
+
+	/*
+	 * The last color is no color used for works which don't
+	 * participate in workqueue flushing.
+	 */
+	WORK_NR_COLORS		= (1 << WORK_STRUCT_COLOR_BITS) - 1,
+	WORK_NO_COLOR		= WORK_NR_COLORS,
+
+	/*
+	 * Reserve 7 bits off of cwq pointer.  This makes cwqs aligned
+	 * to 128 bytes which isn't too excessive while allowing 15
+	 * workqueue flush colors.
+	 */
+	WORK_STRUCT_FLAG_BITS	= 7,
 
 	WORK_STRUCT_FLAG_MASK	= (1UL << WORK_STRUCT_FLAG_BITS) - 1,
 	WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 14edfcd..9e55c80 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -41,6 +41,8 @@
  *
  * L: cwq->lock protected.  Access with cwq->lock held.
  *
+ * F: wq->flush_mutex protected.
+ *
  * W: workqueue_lock protected.
  */
 
@@ -59,10 +61,23 @@ struct cpu_workqueue_struct {
 	struct work_struct *current_work;
 
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
+	int			work_color;	/* L: current color */
+	int			flush_color;	/* L: flushing color */
+	int			nr_in_flight[WORK_NR_COLORS];
+						/* L: nr of in_flight works */
 	struct task_struct	*thread;
 };
 
 /*
+ * Structure used to wait for workqueue flush.
+ */
+struct wq_flusher {
+	struct list_head	list;		/* F: list of flushers */
+	int			flush_color;	/* F: flush color waiting for */
+	struct completion	done;		/* flush completion */
+};
+
+/*
  * The externally visible workqueue abstraction is an array of
  * per-CPU workqueues:
  */
@@ -70,6 +85,15 @@ struct workqueue_struct {
 	unsigned int		flags;		/* I: WQ_* flags */
 	struct cpu_workqueue_struct *cpu_wq;	/* I: cwq's */
 	struct list_head	list;		/* W: list of all workqueues */
+
+	struct mutex		flush_mutex;	/* protects wq flushing */
+	int			work_color;	/* F: current work color */
+	int			flush_color;	/* F: current flush color */
+	atomic_t		nr_cwqs_to_flush; /* flush in progress */
+	struct wq_flusher	*first_flusher;	/* F: first flusher */
+	struct list_head	flusher_queue;	/* F: flush waiters */
+	struct list_head	flusher_overflow; /* F: flush overflow list */
+
 	const char		*name;		/* I: workqueue name */
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	lockdep_map;
@@ -206,6 +230,22 @@ static struct cpu_workqueue_struct *target_cwq(unsigned int cpu,
 	return get_cwq(cpu, wq);
 }
 
+static unsigned int work_color_to_flags(int color)
+{
+	return color << WORK_STRUCT_COLOR_SHIFT;
+}
+
+static int work_flags_to_color(unsigned int flags)
+{
+	return (flags >> WORK_STRUCT_COLOR_SHIFT) &
+		((1 << WORK_STRUCT_COLOR_BITS) - 1);
+}
+
+static int work_next_color(int color)
+{
+	return (color + 1) % WORK_NR_COLORS;
+}
+
 /*
  * Set the workqueue on which a work item is to be run
  * - Must *only* be called if the pending flag is set
@@ -265,7 +305,9 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 	debug_work_activate(work);
 	spin_lock_irqsave(&cwq->lock, flags);
 	BUG_ON(!list_empty(&work->entry));
-	insert_work(cwq, work, &cwq->worklist, 0);
+	cwq->nr_in_flight[cwq->work_color]++;
+	insert_work(cwq, work, &cwq->worklist,
+		    work_color_to_flags(cwq->work_color));
 	spin_unlock_irqrestore(&cwq->lock, flags);
 }
 
@@ -379,6 +421,44 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
 /**
+ * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
+ * @cwq: cwq of interest
+ * @color: color of work which left the queue
+ *
+ * A work either has completed or is removed from pending queue,
+ * decrement nr_in_flight of its cwq and handle workqueue flushing.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
+{
+	/* ignore uncolored works */
+	if (color == WORK_NO_COLOR)
+		return;
+
+	cwq->nr_in_flight[color]--;
+
+	/* is flush in progress and are we at the flushing tip? */
+	if (likely(cwq->flush_color != color))
+		return;
+
+	/* are there still in-flight works? */
+	if (cwq->nr_in_flight[color])
+		return;
+
+	/* this cwq is done, clear flush_color */
+	cwq->flush_color = -1;
+
+	/*
+	 * If this was the last cwq, wake up the first flusher.  It
+	 * will handle the rest.
+	 */
+	if (atomic_dec_and_test(&cwq->wq->nr_cwqs_to_flush))
+		complete(&cwq->wq->first_flusher->done);
+}
+
+/**
  * process_one_work - process single work
  * @cwq: cwq to process work for
  * @work: work to process
@@ -396,6 +476,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 			     struct work_struct *work)
 {
 	work_func_t f = work->func;
+	int work_color;
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * It is permissible to free the struct work_struct from
@@ -409,6 +490,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 	/* claim and process */
 	debug_work_deactivate(work);
 	cwq->current_work = work;
+	work_color = work_flags_to_color(*work_data_bits(work));
 	list_del_init(&work->entry);
 
 	spin_unlock_irq(&cwq->lock);
@@ -435,6 +517,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 
 	/* we're done with it, release */
 	cwq->current_work = NULL;
+	cwq_dec_nr_in_flight(cwq, work_color);
 }
 
 static void run_workqueue(struct cpu_workqueue_struct *cwq)
@@ -517,29 +600,72 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 	init_completion(&barr->done);
 
 	debug_work_activate(&barr->work);
-	insert_work(cwq, &barr->work, head, 0);
+	insert_work(cwq, &barr->work, head, work_color_to_flags(WORK_NO_COLOR));
 }
 
-static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
+/**
+ * flush_workqueue_prep_cwqs - prepare cwqs for workqueue flushing
+ * @wq: workqueue being flushed
+ * @flush_color: new flush color, < 0 for no-op
+ * @work_color: new work color, < 0 for no-op
+ *
+ * Prepare cwqs for workqueue flushing.
+ *
+ * If @flush_color is non-negative, flush_color on all cwqs should be
+ * -1.  If no cwq has in-flight commands at the specified color, all
+ * cwq->flush_color's stay at -1 and %false is returned.  If any cwq
+ * has in flight commands, its cwq->flush_color is set to
+ * @flush_color, @wq->nr_cwqs_to_flush is updated accordingly, cwq
+ * wakeup logic is armed and %true is returned.
+ *
+ * The caller should have initialized @wq->first_flusher prior to
+ * calling this function with non-negative @flush_color.  If
+ * @flush_color is negative, no flush color update is done and %false
+ * is returned.
+ *
+ * If @work_color is non-negative, all cwqs should have the same
+ * work_color which is previous to @work_color and all will be
+ * advanced to @work_color.
+ *
+ * CONTEXT:
+ * mutex_lock(wq->flush_mutex).
+ *
+ * RETURNS:
+ * %true if @flush_color >= 0 and wakeup logic is armed.  %false
+ * otherwise.
+ */
+static bool flush_workqueue_prep_cwqs(struct workqueue_struct *wq,
+				      int flush_color, int work_color)
 {
-	int active = 0;
-	struct wq_barrier barr;
+	bool wait = false;
+	unsigned int cpu;
 
-	WARN_ON(cwq->thread == current);
+	BUG_ON(flush_color >= 0 && atomic_read(&wq->nr_cwqs_to_flush));
 
-	spin_lock_irq(&cwq->lock);
-	if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
-		insert_wq_barrier(cwq, &barr, &cwq->worklist);
-		active = 1;
-	}
-	spin_unlock_irq(&cwq->lock);
+	for_each_possible_cpu(cpu) {
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
-	if (active) {
-		wait_for_completion(&barr.done);
-		destroy_work_on_stack(&barr.work);
+		spin_lock_irq(&cwq->lock);
+
+		if (flush_color >= 0) {
+			BUG_ON(cwq->flush_color != -1);
+
+			if (cwq->nr_in_flight[flush_color]) {
+				cwq->flush_color = flush_color;
+				atomic_inc(&wq->nr_cwqs_to_flush);
+				wait = true;
+			}
+		}
+
+		if (work_color >= 0) {
+			BUG_ON(work_color != work_next_color(cwq->work_color));
+			cwq->work_color = work_color;
+		}
+
+		spin_unlock_irq(&cwq->lock);
 	}
 
-	return active;
+	return wait;
 }
 
 /**
@@ -554,13 +680,144 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
  */
 void flush_workqueue(struct workqueue_struct *wq)
 {
-	int cpu;
+	struct wq_flusher this_flusher = {
+		.list = LIST_HEAD_INIT(this_flusher.list),
+		.flush_color = -1,
+		.done = COMPLETION_INITIALIZER_ONSTACK(this_flusher.done),
+	};
+	int next_color;
 
-	might_sleep();
 	lock_map_acquire(&wq->lockdep_map);
 	lock_map_release(&wq->lockdep_map);
-	for_each_possible_cpu(cpu)
-		flush_cpu_workqueue(get_cwq(cpu, wq));
+
+	mutex_lock(&wq->flush_mutex);
+
+	/*
+	 * Start-to-wait phase
+	 */
+	next_color = work_next_color(wq->work_color);
+
+	if (next_color != wq->flush_color) {
+		/*
+		 * Color space is not full.  The current work_color
+		 * becomes our flush_color and work_color is advanced
+		 * by one.
+		 */
+		BUG_ON(!list_empty(&wq->flusher_overflow));
+		this_flusher.flush_color = wq->work_color;
+		wq->work_color = next_color;
+
+		if (!wq->first_flusher) {
+			/* no flush in progress, become the first flusher */
+			BUG_ON(wq->flush_color != this_flusher.flush_color);
+
+			wq->first_flusher = &this_flusher;
+
+			if (!flush_workqueue_prep_cwqs(wq, wq->flush_color,
+						       wq->work_color)) {
+				/* nothing to flush, done */
+				wq->flush_color = next_color;
+				wq->first_flusher = NULL;
+				goto out_unlock;
+			}
+		} else {
+			/* wait in queue */
+			BUG_ON(wq->flush_color == this_flusher.flush_color);
+			list_add_tail(&this_flusher.list, &wq->flusher_queue);
+			flush_workqueue_prep_cwqs(wq, -1, wq->work_color);
+		}
+	} else {
+		/*
+		 * Oops, color space is full, wait on overflow queue.
+		 * The next flush completion will assign us
+		 * flush_color and transfer to flusher_queue.
+		 */
+		list_add_tail(&this_flusher.list, &wq->flusher_overflow);
+	}
+
+	mutex_unlock(&wq->flush_mutex);
+
+	wait_for_completion(&this_flusher.done);
+
+	/*
+	 * Wake-up-and-cascade phase
+	 *
+	 * First flushers are responsible for cascading flushes and
+	 * handling overflow.  Non-first flushers can simply return.
+	 */
+	if (wq->first_flusher != &this_flusher)
+		return;
+
+	mutex_lock(&wq->flush_mutex);
+
+	wq->first_flusher = NULL;
+
+	BUG_ON(!list_empty(&this_flusher.list));
+	BUG_ON(wq->flush_color != this_flusher.flush_color);
+
+	while (true) {
+		struct wq_flusher *next, *tmp;
+
+		/* complete all the flushers sharing the current flush color */
+		list_for_each_entry_safe(next, tmp, &wq->flusher_queue, list) {
+			if (next->flush_color != wq->flush_color)
+				break;
+			list_del_init(&next->list);
+			complete(&next->done);
+		}
+
+		BUG_ON(!list_empty(&wq->flusher_overflow) &&
+		       wq->flush_color != work_next_color(wq->work_color));
+
+		/* this flush_color is finished, advance by one */
+		wq->flush_color = work_next_color(wq->flush_color);
+
+		/* one color has been freed, handle overflow queue */
+		if (!list_empty(&wq->flusher_overflow)) {
+			/*
+			 * Assign the same color to all overflowed
+			 * flushers, advance work_color and append to
+			 * flusher_queue.  This is the start-to-wait
+			 * phase for these overflowed flushers.
+			 */
+			list_for_each_entry(tmp, &wq->flusher_overflow, list)
+				tmp->flush_color = wq->work_color;
+
+			wq->work_color = work_next_color(wq->work_color);
+
+			list_splice_tail_init(&wq->flusher_overflow,
+					      &wq->flusher_queue);
+			flush_workqueue_prep_cwqs(wq, -1, wq->work_color);
+		}
+
+		if (list_empty(&wq->flusher_queue)) {
+			BUG_ON(wq->flush_color != wq->work_color);
+			break;
+		}
+
+		/*
+		 * Need to flush more colors.  Make the next flusher
+		 * the new first flusher and arm cwqs.
+		 */
+		BUG_ON(wq->flush_color == wq->work_color);
+		BUG_ON(wq->flush_color != next->flush_color);
+
+		list_del_init(&next->list);
+		wq->first_flusher = next;
+
+		if (flush_workqueue_prep_cwqs(wq, wq->flush_color, -1))
+			break;
+
+		/*
+		 * Meh... this color is already done, clear first
+		 * flusher and repeat cascading.
+		 */
+		wq->first_flusher = NULL;
+		complete(&next->done);
+	}
+
+out_unlock:
+	mutex_unlock(&wq->flush_mutex);
 }
 EXPORT_SYMBOL_GPL(flush_workqueue);
 
@@ -647,6 +904,8 @@ static int try_to_grab_pending(struct work_struct *work)
 		if (cwq == get_wq_data(work)) {
 			debug_work_deactivate(work);
 			list_del_init(&work->entry);
+			cwq_dec_nr_in_flight(cwq,
+				work_flags_to_color(*work_data_bits(work)));
 			ret = 1;
 		}
 	}
@@ -1020,6 +1279,10 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		goto err;
 
 	wq->flags = flags;
+	mutex_init(&wq->flush_mutex);
+	atomic_set(&wq->nr_cwqs_to_flush, 0);
+	INIT_LIST_HEAD(&wq->flusher_queue);
+	INIT_LIST_HEAD(&wq->flusher_overflow);
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	INIT_LIST_HEAD(&wq->list);
@@ -1036,6 +1299,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 
 		BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
 		cwq->wq = wq;
+		cwq->flush_color = -1;
 		spin_lock_init(&cwq->lock);
 		INIT_LIST_HEAD(&cwq->worklist);
 		init_waitqueue_head(&cwq->more_work);
@@ -1069,33 +1333,6 @@ err:
 }
 EXPORT_SYMBOL_GPL(__create_workqueue_key);
 
-static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
-{
-	/*
-	 * Our caller is either destroy_workqueue() or CPU_POST_DEAD,
-	 * cpu_add_remove_lock protects cwq->thread.
-	 */
-	if (cwq->thread == NULL)
-		return;
-
-	lock_map_acquire(&cwq->wq->lockdep_map);
-	lock_map_release(&cwq->wq->lockdep_map);
-
-	flush_cpu_workqueue(cwq);
-	/*
-	 * If the caller is CPU_POST_DEAD and cwq->worklist was not empty,
-	 * a concurrent flush_workqueue() can insert a barrier after us.
-	 * However, in that case run_workqueue() won't return and check
-	 * kthread_should_stop() until it flushes all work_struct's.
-	 * When ->worklist becomes empty it is safe to exit because no
-	 * more work_structs can be queued on this cwq: flush_workqueue
-	 * checks list_empty(), and a "normal" queue_work() can't use
-	 * a dead CPU.
-	 */
-	kthread_stop(cwq->thread);
-	cwq->thread = NULL;
-}
-
 /**
  * destroy_workqueue - safely terminate a workqueue
  * @wq: target workqueue
@@ -1110,8 +1347,20 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	list_del(&wq->list);
 	spin_unlock(&workqueue_lock);
 
-	for_each_possible_cpu(cpu)
-		cleanup_workqueue_thread(get_cwq(cpu, wq));
+	flush_workqueue(wq);
+
+	for_each_possible_cpu(cpu) {
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		int i;
+
+		if (cwq->thread) {
+			kthread_stop(cwq->thread);
+			cwq->thread = NULL;
+		}
+
+		for (i = 0; i < WORK_NR_COLORS; i++)
+			BUG_ON(cwq->nr_in_flight[i]);
+	}
 
 	free_cwqs(wq->cpu_wq);
 	kfree(wq);
@@ -1141,9 +1390,7 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 			break;
 
 		case CPU_POST_DEAD:
-			lock_map_acquire(&cwq->wq->lockdep_map);
-			lock_map_release(&cwq->wq->lockdep_map);
-			flush_cpu_workqueue(cwq);
+			flush_workqueue(wq);
 			break;
 		}
 	}
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 18/40] workqueue: introduce worker
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (16 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 17/40] workqueue: reimplement workqueue flushing using color coded works Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 19/40] workqueue: reimplement work flushing using linked works Tejun Heo
                   ` (23 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Separate out worker thread related information to struct worker from
struct cpu_workqueue_struct and implement helper functions to deal
with the new struct worker.  The only change which is visible outside
is that now workqueue worker are all named "kworker/CPUID:WORKERID"
where WORKERID is allocated from per-cpu ida.

This is in preparation of concurrency managed workqueue where shared
multiple workers would be available per cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  211 +++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 151 insertions(+), 60 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9e55c80..17d270b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -33,6 +33,7 @@
 #include <linux/kallsyms.h>
 #include <linux/debug_locks.h>
 #include <linux/lockdep.h>
+#include <linux/idr.h>
 
 /*
  * Structure fields follow one of the following exclusion rules.
@@ -46,6 +47,15 @@
  * W: workqueue_lock protected.
  */
 
+struct cpu_workqueue_struct;
+
+struct worker {
+	struct work_struct	*current_work;	/* L: work being processed */
+	struct task_struct	*task;		/* I: worker task */
+	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
+	int			id;		/* I: worker id */
+};
+
 /*
  * The per-CPU workqueue (if single thread, we always use the first
  * possible cpu).  The lower WORK_STRUCT_FLAG_BITS of
@@ -58,14 +68,14 @@ struct cpu_workqueue_struct {
 
 	struct list_head worklist;
 	wait_queue_head_t more_work;
-	struct work_struct *current_work;
+	unsigned int		cpu;
+	struct worker		*worker;
 
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	int			work_color;	/* L: current color */
 	int			flush_color;	/* L: flushing color */
 	int			nr_in_flight[WORK_NR_COLORS];
 						/* L: nr of in_flight works */
-	struct task_struct	*thread;
 };
 
 /*
@@ -213,6 +223,9 @@ static inline void debug_work_deactivate(struct work_struct *work) { }
 /* Serializes the accesses to the list of workqueues. */
 static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
+static DEFINE_PER_CPU(struct ida, worker_ida);
+
+static int worker_thread(void *__worker);
 
 static int singlethread_cpu __read_mostly;
 
@@ -420,6 +433,105 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
+static struct worker *alloc_worker(void)
+{
+	struct worker *worker;
+
+	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+	return worker;
+}
+
+/**
+ * create_worker - create a new workqueue worker
+ * @cwq: cwq the new worker will belong to
+ * @bind: whether to set affinity to @cpu or not
+ *
+ * Create a new worker which is bound to @cwq.  The returned worker
+ * can be started by calling start_worker() or destroyed using
+ * destroy_worker().
+ *
+ * CONTEXT:
+ * Might sleep.  Does GFP_KERNEL allocations.
+ *
+ * RETURNS:
+ * Pointer to the newly created worker.
+ */
+static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
+{
+	int id = -1;
+	struct worker *worker = NULL;
+
+	spin_lock(&workqueue_lock);
+	while (ida_get_new(&per_cpu(worker_ida, cwq->cpu), &id)) {
+		spin_unlock(&workqueue_lock);
+		if (!ida_pre_get(&per_cpu(worker_ida, cwq->cpu), GFP_KERNEL))
+			goto fail;
+		spin_lock(&workqueue_lock);
+	}
+	spin_unlock(&workqueue_lock);
+
+	worker = alloc_worker();
+	if (!worker)
+		goto fail;
+
+	worker->cwq = cwq;
+	worker->id = id;
+
+	worker->task = kthread_create(worker_thread, worker, "kworker/%u:%d",
+				      cwq->cpu, id);
+	if (IS_ERR(worker->task))
+		goto fail;
+
+	if (bind)
+		kthread_bind(worker->task, cwq->cpu);
+
+	return worker;
+fail:
+	if (id >= 0) {
+		spin_lock(&workqueue_lock);
+		ida_remove(&per_cpu(worker_ida, cwq->cpu), id);
+		spin_unlock(&workqueue_lock);
+	}
+	kfree(worker);
+	return NULL;
+}
+
+/**
+ * start_worker - start a newly created worker
+ * @worker: worker to start
+ *
+ * Start @worker.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void start_worker(struct worker *worker)
+{
+	wake_up_process(worker->task);
+}
+
+/**
+ * destroy_worker - destroy a workqueue worker
+ * @worker: worker to be destroyed
+ *
+ * Destroy @worker.
+ */
+static void destroy_worker(struct worker *worker)
+{
+	int cpu = worker->cwq->cpu;
+	int id = worker->id;
+
+	/* sanity check frenzy */
+	BUG_ON(worker->current_work);
+
+	kthread_stop(worker->task);
+	kfree(worker);
+
+	spin_lock(&workqueue_lock);
+	ida_remove(&per_cpu(worker_ida, cpu), id);
+	spin_unlock(&workqueue_lock);
+}
+
 /**
  * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
  * @cwq: cwq of interest
@@ -460,7 +572,7 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
 
 /**
  * process_one_work - process single work
- * @cwq: cwq to process work for
+ * @worker: self
  * @work: work to process
  *
  * Process @work.  This function contains all the logics necessary to
@@ -472,9 +584,9 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
  * CONTEXT:
  * spin_lock_irq(cwq->lock) which is released and regrabbed.
  */
-static void process_one_work(struct cpu_workqueue_struct *cwq,
-			     struct work_struct *work)
+static void process_one_work(struct worker *worker, struct work_struct *work)
 {
+	struct cpu_workqueue_struct *cwq = worker->cwq;
 	work_func_t f = work->func;
 	int work_color;
 #ifdef CONFIG_LOCKDEP
@@ -489,7 +601,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 #endif
 	/* claim and process */
 	debug_work_deactivate(work);
-	cwq->current_work = work;
+	worker->current_work = work;
 	work_color = work_flags_to_color(*work_data_bits(work));
 	list_del_init(&work->entry);
 
@@ -516,30 +628,33 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 	spin_lock_irq(&cwq->lock);
 
 	/* we're done with it, release */
-	cwq->current_work = NULL;
+	worker->current_work = NULL;
 	cwq_dec_nr_in_flight(cwq, work_color);
 }
 
-static void run_workqueue(struct cpu_workqueue_struct *cwq)
+static void run_workqueue(struct worker *worker)
 {
+	struct cpu_workqueue_struct *cwq = worker->cwq;
+
 	spin_lock_irq(&cwq->lock);
 	while (!list_empty(&cwq->worklist)) {
 		struct work_struct *work = list_entry(cwq->worklist.next,
 						struct work_struct, entry);
-		process_one_work(cwq, work);
+		process_one_work(worker, work);
 	}
 	spin_unlock_irq(&cwq->lock);
 }
 
 /**
  * worker_thread - the worker thread function
- * @__cwq: cwq to serve
+ * @__worker: self
  *
  * The cwq worker thread function.
  */
-static int worker_thread(void *__cwq)
+static int worker_thread(void *__worker)
 {
-	struct cpu_workqueue_struct *cwq = __cwq;
+	struct worker *worker = __worker;
+	struct cpu_workqueue_struct *cwq = worker->cwq;
 	DEFINE_WAIT(wait);
 
 	if (cwq->wq->flags & WQ_FREEZEABLE)
@@ -558,7 +673,7 @@ static int worker_thread(void *__cwq)
 		if (kthread_should_stop())
 			break;
 
-		run_workqueue(cwq);
+		run_workqueue(worker);
 	}
 
 	return 0;
@@ -856,7 +971,7 @@ int flush_work(struct work_struct *work)
 			goto already_gone;
 		prev = &work->entry;
 	} else {
-		if (cwq->current_work != work)
+		if (!cwq->worker || cwq->worker->current_work != work)
 			goto already_gone;
 		prev = &cwq->worklist;
 	}
@@ -921,7 +1036,7 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 	int running = 0;
 
 	spin_lock_irq(&cwq->lock);
-	if (unlikely(cwq->current_work == work)) {
+	if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
 		insert_wq_barrier(cwq, &barr, cwq->worklist.next);
 		running = 1;
 	}
@@ -1184,7 +1299,7 @@ int current_is_keventd(void)
 	BUG_ON(!keventd_wq);
 
 	cwq = get_cwq(cpu, keventd_wq);
-	if (current == cwq->thread)
+	if (current == cwq->worker->task)
 		ret = 1;
 
 	return ret;
@@ -1229,38 +1344,6 @@ static void free_cwqs(struct cpu_workqueue_struct *cwqs)
 #endif
 }
 
-static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
-{
-	struct workqueue_struct *wq = cwq->wq;
-	struct task_struct *p;
-
-	p = kthread_create(worker_thread, cwq, "%s/%d", wq->name, cpu);
-	/*
-	 * Nobody can add the work_struct to this cwq,
-	 *	if (caller is __create_workqueue)
-	 *		nobody should see this wq
-	 *	else // caller is CPU_UP_PREPARE
-	 *		cpu is not on cpu_online_map
-	 * so we can abort safely.
-	 */
-	if (IS_ERR(p))
-		return PTR_ERR(p);
-	cwq->thread = p;
-
-	return 0;
-}
-
-static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
-{
-	struct task_struct *p = cwq->thread;
-
-	if (p != NULL) {
-		if (cpu >= 0)
-			kthread_bind(p, cpu);
-		wake_up_process(p);
-	}
-}
-
 struct workqueue_struct *__create_workqueue_key(const char *name,
 						unsigned int flags,
 						struct lock_class_key *key,
@@ -1268,7 +1351,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 {
 	bool singlethread = flags & WQ_SINGLE_THREAD;
 	struct workqueue_struct *wq;
-	int err = 0, cpu;
+	bool failed = false;
+	unsigned int cpu;
 
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
@@ -1298,23 +1382,25 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
 		BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
+		cwq->cpu = cpu;
 		cwq->wq = wq;
 		cwq->flush_color = -1;
 		spin_lock_init(&cwq->lock);
 		INIT_LIST_HEAD(&cwq->worklist);
 		init_waitqueue_head(&cwq->more_work);
 
-		if (err)
+		if (failed)
 			continue;
-		err = create_workqueue_thread(cwq, cpu);
-		if (cpu_online(cpu) && !singlethread)
-			start_workqueue_thread(cwq, cpu);
+		cwq->worker = create_worker(cwq,
+					    cpu_online(cpu) && !singlethread);
+		if (cwq->worker)
+			start_worker(cwq->worker);
 		else
-			start_workqueue_thread(cwq, -1);
+			failed = true;
 	}
 	cpu_maps_update_done();
 
-	if (err) {
+	if (failed) {
 		destroy_workqueue(wq);
 		wq = NULL;
 	}
@@ -1353,9 +1439,9 @@ void destroy_workqueue(struct workqueue_struct *wq)
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 		int i;
 
-		if (cwq->thread) {
-			kthread_stop(cwq->thread);
-			cwq->thread = NULL;
+		if (cwq->worker) {
+			destroy_worker(cwq->worker);
+			cwq->worker = NULL;
 		}
 
 		for (i = 0; i < WORK_NR_COLORS; i++)
@@ -1385,8 +1471,8 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 
 		switch (action) {
 		case CPU_ONLINE:
-			__set_cpus_allowed(cwq->thread, get_cpu_mask(cpu),
-					   true);
+			__set_cpus_allowed(cwq->worker->task,
+					   get_cpu_mask(cpu), true);
 			break;
 
 		case CPU_POST_DEAD:
@@ -1447,6 +1533,8 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
 
 void __init init_workqueues(void)
 {
+	unsigned int cpu;
+
 	/*
 	 * cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
 	 * Make sure that the alignment isn't lower than that of
@@ -1455,6 +1543,9 @@ void __init init_workqueues(void)
 	BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
 		     __alignof__(unsigned long long));
 
+	for_each_possible_cpu(cpu)
+		ida_init(&per_cpu(worker_ida, cpu));
+
 	singlethread_cpu = cpumask_first(cpu_possible_mask);
 	hotcpu_notifier(workqueue_cpu_callback, 0);
 	keventd_wq = create_workqueue("events");
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 19/40] workqueue: reimplement work flushing using linked works
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (17 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 18/40] workqueue: introduce worker Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 20/40] workqueue: implement per-cwq active work limit Tejun Heo
                   ` (22 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

A work is linked to the next one by having WORK_STRUCT_LINKED bit set
and these links can be chained.  When a linked work is dispatched to a
worker, all linked works are dispatched to the worker's newly added
->scheduled queue and processed back-to-back.

Currently, as there's only single worker per cwq, having linked works
doesn't make any visible behavior difference.  This change is to
prepare for multiple shared workers per cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |    2 +
 kernel/workqueue.c        |  152 ++++++++++++++++++++++++++++++++++++++------
 2 files changed, 133 insertions(+), 21 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 316cc48..a6650f1 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -25,9 +25,11 @@ typedef void (*work_func_t)(struct work_struct *work);
 enum {
 	WORK_STRUCT_PENDING_BIT	= 0,	/* work item is pending execution */
 	WORK_STRUCT_STATIC_BIT	= 1,	/* static initializer (debugobjects) */
+	WORK_STRUCT_LINKED_BIT	= 2,	/* next work is linked to this one */
 
 	WORK_STRUCT_PENDING	= 1 << WORK_STRUCT_PENDING_BIT,
 	WORK_STRUCT_STATIC	= 1 << WORK_STRUCT_STATIC_BIT,
+	WORK_STRUCT_LINKED	= 1 << WORK_STRUCT_LINKED_BIT,
 
 	WORK_STRUCT_COLOR_SHIFT	= 3,	/* color for workqueue flushing */
 	WORK_STRUCT_COLOR_BITS	= 4,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 17d270b..2ac1624 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -51,6 +51,7 @@ struct cpu_workqueue_struct;
 
 struct worker {
 	struct work_struct	*current_work;	/* L: work being processed */
+	struct list_head	scheduled;	/* L: scheduled works */
 	struct task_struct	*task;		/* I: worker task */
 	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
 	int			id;		/* I: worker id */
@@ -438,6 +439,8 @@ static struct worker *alloc_worker(void)
 	struct worker *worker;
 
 	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+	if (worker)
+		INIT_LIST_HEAD(&worker->scheduled);
 	return worker;
 }
 
@@ -523,6 +526,7 @@ static void destroy_worker(struct worker *worker)
 
 	/* sanity check frenzy */
 	BUG_ON(worker->current_work);
+	BUG_ON(!list_empty(&worker->scheduled));
 
 	kthread_stop(worker->task);
 	kfree(worker);
@@ -533,6 +537,48 @@ static void destroy_worker(struct worker *worker)
 }
 
 /**
+ * move_linked_works - move linked works to a list
+ * @work: start of series of works to be scheduled
+ * @head: target list to append @work to
+ * @nextp: out paramter for nested worklist walking
+ *
+ * Schedule linked works starting from @work to @head.  Work series to
+ * be scheduled starts at @work and includes any consecutive work with
+ * WORK_STRUCT_LINKED set in its predecessor.
+ *
+ * If @nextp is not NULL, it's updated to point to the next work of
+ * the last scheduled work.  This allows move_linked_works() to be
+ * nested inside outer list_for_each_entry_safe().
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void move_linked_works(struct work_struct *work, struct list_head *head,
+			      struct work_struct **nextp)
+{
+	struct work_struct *n;
+
+	/*
+	 * Linked worklist will always end before the end of the list,
+	 * use NULL for list head.
+	 */
+	work = list_entry(work->entry.prev, struct work_struct, entry);
+	list_for_each_entry_safe_continue(work, n, NULL, entry) {
+		list_move_tail(&work->entry, head);
+		if (!(*work_data_bits(work) & WORK_STRUCT_LINKED))
+			break;
+	}
+
+	/*
+	 * If we're already inside safe list traversal and have moved
+	 * multiple works to the scheduled queue, the next position
+	 * needs to be updated.
+	 */
+	if (nextp)
+		*nextp = n;
+}
+
+/**
  * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
  * @cwq: cwq of interest
  * @color: color of work which left the queue
@@ -632,17 +678,25 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 	cwq_dec_nr_in_flight(cwq, work_color);
 }
 
-static void run_workqueue(struct worker *worker)
+/**
+ * process_scheduled_works - process scheduled works
+ * @worker: self
+ *
+ * Process all scheduled works.  Please note that the scheduled list
+ * may change while processing a work, so this function repeatedly
+ * fetches a work from the top and executes it.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock) which may be released and regrabbed
+ * multiple times.
+ */
+static void process_scheduled_works(struct worker *worker)
 {
-	struct cpu_workqueue_struct *cwq = worker->cwq;
-
-	spin_lock_irq(&cwq->lock);
-	while (!list_empty(&cwq->worklist)) {
-		struct work_struct *work = list_entry(cwq->worklist.next,
+	while (!list_empty(&worker->scheduled)) {
+		struct work_struct *work = list_first_entry(&worker->scheduled,
 						struct work_struct, entry);
 		process_one_work(worker, work);
 	}
-	spin_unlock_irq(&cwq->lock);
 }
 
 /**
@@ -673,7 +727,27 @@ static int worker_thread(void *__worker)
 		if (kthread_should_stop())
 			break;
 
-		run_workqueue(worker);
+		spin_lock_irq(&cwq->lock);
+
+		while (!list_empty(&cwq->worklist)) {
+			struct work_struct *work =
+				list_first_entry(&cwq->worklist,
+						 struct work_struct, entry);
+
+			if (likely(!(*work_data_bits(work) &
+				     WORK_STRUCT_LINKED))) {
+				/* optimization path, not strictly necessary */
+				process_one_work(worker, work);
+				if (unlikely(!list_empty(&worker->scheduled)))
+					process_scheduled_works(worker);
+			} else {
+				move_linked_works(work, &worker->scheduled,
+						  NULL);
+				process_scheduled_works(worker);
+			}
+		}
+
+		spin_unlock_irq(&cwq->lock);
 	}
 
 	return 0;
@@ -694,16 +768,33 @@ static void wq_barrier_func(struct work_struct *work)
  * insert_wq_barrier - insert a barrier work
  * @cwq: cwq to insert barrier into
  * @barr: wq_barrier to insert
- * @head: insertion point
+ * @target: target work to attach @barr to
+ * @worker: worker currently executing @target, NULL if @target is not executing
  *
- * Insert barrier @barr into @cwq before @head.
+ * @barr is linked to @target such that @barr is completed only after
+ * @target finishes execution.  Please note that the ordering
+ * guarantee is observed only with respect to @target and on the local
+ * cpu.
+ *
+ * Currently, a queued barrier can't be canceled.  This is because
+ * try_to_grab_pending() can't determine whether the work to be
+ * grabbed is at the head of the queue and thus can't clear LINKED
+ * flag of the previous work while there must be a valid next work
+ * after a work with LINKED flag set.
+ *
+ * Note that when @worker is non-NULL, @target may be modified
+ * underneath us, so we can't reliably determine cwq from @target.
  *
  * CONTEXT:
  * spin_lock_irq(cwq->lock).
  */
 static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
-			struct wq_barrier *barr, struct list_head *head)
+			      struct wq_barrier *barr,
+			      struct work_struct *target, struct worker *worker)
 {
+	struct list_head *head;
+	unsigned int linked = 0;
+
 	/*
 	 * debugobject calls are safe here even with cwq->lock locked
 	 * as we know for sure that this will not trigger any of the
@@ -714,8 +805,24 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 	__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));
 	init_completion(&barr->done);
 
+	/*
+	 * If @target is currently being executed, schedule the
+	 * barrier to the worker; otherwise, put it after @target.
+	 */
+	if (worker)
+		head = worker->scheduled.next;
+	else {
+		unsigned long *bits = work_data_bits(target);
+
+		head = target->entry.next;
+		/* there can already be other linked works, inherit and set */
+		linked = *bits & WORK_STRUCT_LINKED;
+		*bits |= WORK_STRUCT_LINKED;
+	}
+
 	debug_work_activate(&barr->work);
-	insert_work(cwq, &barr->work, head, work_color_to_flags(WORK_NO_COLOR));
+	insert_work(cwq, &barr->work, head,
+		    work_color_to_flags(WORK_NO_COLOR) | linked);
 }
 
 /**
@@ -948,8 +1055,8 @@ EXPORT_SYMBOL_GPL(flush_workqueue);
  */
 int flush_work(struct work_struct *work)
 {
+	struct worker *worker = NULL;
 	struct cpu_workqueue_struct *cwq;
-	struct list_head *prev;
 	struct wq_barrier barr;
 
 	might_sleep();
@@ -969,14 +1076,14 @@ int flush_work(struct work_struct *work)
 		smp_rmb();
 		if (unlikely(cwq != get_wq_data(work)))
 			goto already_gone;
-		prev = &work->entry;
 	} else {
-		if (!cwq->worker || cwq->worker->current_work != work)
+		if (cwq->worker && cwq->worker->current_work == work)
+			worker = cwq->worker;
+		if (!worker)
 			goto already_gone;
-		prev = &cwq->worklist;
 	}
-	insert_wq_barrier(cwq, &barr, prev->next);
 
+	insert_wq_barrier(cwq, &barr, work, worker);
 	spin_unlock_irq(&cwq->lock);
 	wait_for_completion(&barr.done);
 	destroy_work_on_stack(&barr.work);
@@ -1033,16 +1140,19 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 				struct work_struct *work)
 {
 	struct wq_barrier barr;
-	int running = 0;
+	struct worker *worker;
 
 	spin_lock_irq(&cwq->lock);
+
+	worker = NULL;
 	if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
-		insert_wq_barrier(cwq, &barr, cwq->worklist.next);
-		running = 1;
+		worker = cwq->worker;
+		insert_wq_barrier(cwq, &barr, work, worker);
 	}
+
 	spin_unlock_irq(&cwq->lock);
 
-	if (unlikely(running)) {
+	if (unlikely(worker)) {
 		wait_for_completion(&barr.done);
 		destroy_work_on_stack(&barr.work);
 	}
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 20/40] workqueue: implement per-cwq active work limit
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (18 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 19/40] workqueue: reimplement work flushing using linked works Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 21/40] workqueue: reimplement workqueue freeze using max_active Tejun Heo
                   ` (21 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Add cwq->nr_active, cwq->max_active and cwq->delayed_work.  nr_active
counts the number of active works per cwq.  A work is active if it's
flushable (colored) and is on cwq's worklist.  If nr_active reaches
max_active, new works are queued on cwq->delayed_work and activated
later as works on the cwq complete and decrement nr_active.

cwq->max_active can be specified via the new @max_active parameter to
__create_workqueue() and is set to 1 for all workqueues for now.  As
each cwq has only single worker now, this double queueing doesn't
cause any behavior difference visible to its users.

This will be used to reimplement freeze/thaw and implement shared
worker pool.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |   18 +++++++++---------
 kernel/workqueue.c        |   39 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a6650f1..974a232 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -216,11 +216,11 @@ enum {
 };
 
 extern struct workqueue_struct *
-__create_workqueue_key(const char *name, unsigned int flags,
+__create_workqueue_key(const char *name, unsigned int flags, int max_active,
 		       struct lock_class_key *key, const char *lock_name);
 
 #ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, flags)				\
+#define __create_workqueue(name, flags, max_active)		\
 ({								\
 	static struct lock_class_key __key;			\
 	const char *__lock_name;				\
@@ -230,20 +230,20 @@ __create_workqueue_key(const char *name, unsigned int flags,
 	else							\
 		__lock_name = #name;				\
 								\
-	__create_workqueue_key((name), (flags), &__key,		\
-			       __lock_name);			\
+	__create_workqueue_key((name), (flags), (max_active),	\
+				&__key, __lock_name);		\
 })
 #else
-#define __create_workqueue(name, flags)				\
-	__create_workqueue_key((name), (flags), NULL, NULL)
+#define __create_workqueue(name, flags, max_active)		\
+	__create_workqueue_key((name), (flags), (max_active), NULL, NULL)
 #endif
 
 #define create_workqueue(name)					\
-	__create_workqueue((name), 0)
+	__create_workqueue((name), 0, 1)
 #define create_freezeable_workqueue(name)			\
-	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD)
+	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD, 1)
 #define create_singlethread_workqueue(name)			\
-	__create_workqueue((name), WQ_SINGLE_THREAD)
+	__create_workqueue((name), WQ_SINGLE_THREAD, 1)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2ac1624..0c9c01d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -77,6 +77,9 @@ struct cpu_workqueue_struct {
 	int			flush_color;	/* L: flushing color */
 	int			nr_in_flight[WORK_NR_COLORS];
 						/* L: nr of in_flight works */
+	int			nr_active;	/* L: nr of active works */
+	int			max_active;	/* I: max active works */
+	struct list_head	delayed_works;	/* L: delayed works */
 };
 
 /*
@@ -314,14 +317,24 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
 	struct cpu_workqueue_struct *cwq = target_cwq(cpu, wq);
+	struct list_head *worklist;
 	unsigned long flags;
 
 	debug_work_activate(work);
+
 	spin_lock_irqsave(&cwq->lock, flags);
 	BUG_ON(!list_empty(&work->entry));
+
 	cwq->nr_in_flight[cwq->work_color]++;
-	insert_work(cwq, work, &cwq->worklist,
-		    work_color_to_flags(cwq->work_color));
+
+	if (likely(cwq->nr_active < cwq->max_active)) {
+		cwq->nr_active++;
+		worklist = &cwq->worklist;
+	} else
+		worklist = &cwq->delayed_works;
+
+	insert_work(cwq, work, worklist, work_color_to_flags(cwq->work_color));
+
 	spin_unlock_irqrestore(&cwq->lock, flags);
 }
 
@@ -578,6 +591,15 @@ static void move_linked_works(struct work_struct *work, struct list_head *head,
 		*nextp = n;
 }
 
+static void cwq_activate_first_delayed(struct cpu_workqueue_struct *cwq)
+{
+	struct work_struct *work = list_first_entry(&cwq->delayed_works,
+						    struct work_struct, entry);
+
+	move_linked_works(work, &cwq->worklist, NULL);
+	cwq->nr_active++;
+}
+
 /**
  * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
  * @cwq: cwq of interest
@@ -596,6 +618,12 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
 		return;
 
 	cwq->nr_in_flight[color]--;
+	cwq->nr_active--;
+
+	/* one down, submit a delayed one */
+	if (!list_empty(&cwq->delayed_works) &&
+	    cwq->nr_active < cwq->max_active)
+		cwq_activate_first_delayed(cwq);
 
 	/* is flush in progress and are we at the flushing tip? */
 	if (likely(cwq->flush_color != color))
@@ -1456,6 +1484,7 @@ static void free_cwqs(struct cpu_workqueue_struct *cwqs)
 
 struct workqueue_struct *__create_workqueue_key(const char *name,
 						unsigned int flags,
+						int max_active,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
@@ -1464,6 +1493,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	bool failed = false;
 	unsigned int cpu;
 
+	max_active = clamp_val(max_active, 1, INT_MAX);
+
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
 		goto err;
@@ -1495,8 +1526,10 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		cwq->cpu = cpu;
 		cwq->wq = wq;
 		cwq->flush_color = -1;
+		cwq->max_active = max_active;
 		spin_lock_init(&cwq->lock);
 		INIT_LIST_HEAD(&cwq->worklist);
+		INIT_LIST_HEAD(&cwq->delayed_works);
 		init_waitqueue_head(&cwq->more_work);
 
 		if (failed)
@@ -1556,6 +1589,8 @@ void destroy_workqueue(struct workqueue_struct *wq)
 
 		for (i = 0; i < WORK_NR_COLORS; i++)
 			BUG_ON(cwq->nr_in_flight[i]);
+		BUG_ON(cwq->nr_active);
+		BUG_ON(!list_empty(&cwq->delayed_works));
 	}
 
 	free_cwqs(wq->cpu_wq);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 21/40] workqueue: reimplement workqueue freeze using max_active
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (19 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 20/40] workqueue: implement per-cwq active work limit Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 22/40] workqueue: introduce global cwq and unify cwq locks Tejun Heo
                   ` (20 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Currently, workqueue freezing is implemented by marking the worker
freezeable and calling try_to_freeze() from dispatch loop.
Reimplement it using cwq->limit so that the workqueue is frozen
instead of the worker.

* workqueue_struct->saved_max_active is added which stores the
  specified max_active on initialization.

* On freeze, all cwq->max_active's are quenched to zero.  Freezing is
  complete when nr_active on all cwqs reach zero.

* On thaw, all cwq->max_active's are restored to wq->saved_max_active
  and the worklist is repopulated.

This new implementation allows having single shared pool of workers
per cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |    7 ++
 kernel/power/process.c    |   21 +++++-
 kernel/workqueue.c        |  163 ++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 179 insertions(+), 12 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 974a232..7a260df 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -331,4 +331,11 @@ static inline long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
 #else
 long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg);
 #endif /* CONFIG_SMP */
+
+#ifdef CONFIG_FREEZER
+extern void freeze_workqueues_begin(void);
+extern bool freeze_workqueues_busy(void);
+extern void thaw_workqueues(void);
+#endif /* CONFIG_FREEZER */
+
 #endif
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5ade1bd..6f89afd 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -15,6 +15,7 @@
 #include <linux/syscalls.h>
 #include <linux/freezer.h>
 #include <linux/delay.h>
+#include <linux/workqueue.h>
 
 /* 
  * Timeout for stopping processes
@@ -35,6 +36,7 @@ static int try_to_freeze_tasks(bool sig_only)
 	struct task_struct *g, *p;
 	unsigned long end_time;
 	unsigned int todo;
+	bool wq_busy = false;
 	struct timeval start, end;
 	u64 elapsed_csecs64;
 	unsigned int elapsed_csecs;
@@ -42,6 +44,10 @@ static int try_to_freeze_tasks(bool sig_only)
 	do_gettimeofday(&start);
 
 	end_time = jiffies + TIMEOUT;
+
+	if (!sig_only)
+		freeze_workqueues_begin();
+
 	while (true) {
 		todo = 0;
 		read_lock(&tasklist_lock);
@@ -63,6 +69,12 @@ static int try_to_freeze_tasks(bool sig_only)
 				todo++;
 		} while_each_thread(g, p);
 		read_unlock(&tasklist_lock);
+
+		if (!sig_only) {
+			wq_busy = freeze_workqueues_busy();
+			todo += wq_busy;
+		}
+
 		if (!todo || time_after(jiffies, end_time))
 			break;
 
@@ -86,9 +98,13 @@ static int try_to_freeze_tasks(bool sig_only)
 		 */
 		printk("\n");
 		printk(KERN_ERR "Freezing of tasks failed after %d.%02d seconds "
-				"(%d tasks refusing to freeze):\n",
-				elapsed_csecs / 100, elapsed_csecs % 100, todo);
+		       "(%d tasks refusing to freeze, wq_busy=%d):\n",
+		       elapsed_csecs / 100, elapsed_csecs % 100,
+		       todo - wq_busy, wq_busy);
 		show_state();
+
+		thaw_workqueues();
+
 		read_lock(&tasklist_lock);
 		do_each_thread(g, p) {
 			task_lock(p);
@@ -158,6 +174,7 @@ void thaw_processes(void)
 	oom_killer_enable();
 
 	printk("Restarting tasks ... ");
+	thaw_workqueues();
 	thaw_tasks(true);
 	thaw_tasks(false);
 	schedule();
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0c9c01d..eca3925 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -78,7 +78,7 @@ struct cpu_workqueue_struct {
 	int			nr_in_flight[WORK_NR_COLORS];
 						/* L: nr of in_flight works */
 	int			nr_active;	/* L: nr of active works */
-	int			max_active;	/* I: max active works */
+	int			max_active;	/* L: max active works */
 	struct list_head	delayed_works;	/* L: delayed works */
 };
 
@@ -108,6 +108,7 @@ struct workqueue_struct {
 	struct list_head	flusher_queue;	/* F: flush waiters */
 	struct list_head	flusher_overflow; /* F: flush overflow list */
 
+	int			saved_max_active; /* I: saved cwq max_active */
 	const char		*name;		/* I: workqueue name */
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	lockdep_map;
@@ -228,6 +229,7 @@ static inline void debug_work_deactivate(struct work_struct *work) { }
 static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
 static DEFINE_PER_CPU(struct ida, worker_ida);
+static bool workqueue_freezing;		/* W: have wqs started freezing? */
 
 static int worker_thread(void *__worker);
 
@@ -739,19 +741,13 @@ static int worker_thread(void *__worker)
 	struct cpu_workqueue_struct *cwq = worker->cwq;
 	DEFINE_WAIT(wait);
 
-	if (cwq->wq->flags & WQ_FREEZEABLE)
-		set_freezable();
-
 	for (;;) {
 		prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
-		if (!freezing(current) &&
-		    !kthread_should_stop() &&
+		if (!kthread_should_stop() &&
 		    list_empty(&cwq->worklist))
 			schedule();
 		finish_wait(&cwq->more_work, &wait);
 
-		try_to_freeze();
-
 		if (kthread_should_stop())
 			break;
 
@@ -1504,6 +1500,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		goto err;
 
 	wq->flags = flags;
+	wq->saved_max_active = max_active;
 	mutex_init(&wq->flush_mutex);
 	atomic_set(&wq->nr_cwqs_to_flush, 0);
 	INIT_LIST_HEAD(&wq->flusher_queue);
@@ -1548,8 +1545,19 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		wq = NULL;
 	}
 
+	/*
+	 * workqueue_lock protects global freeze state and workqueues
+	 * list.  Grab it, set max_active accordingly and add the new
+	 * workqueue to workqueues list.
+	 */
 	spin_lock(&workqueue_lock);
+
+	if (workqueue_freezing && wq->flags & WQ_FREEZEABLE)
+		for_each_possible_cpu(cpu)
+			get_cwq(cpu, wq)->max_active = 0;
+
 	list_add(&wq->list, &workqueues);
+
 	spin_unlock(&workqueue_lock);
 
 	return wq;
@@ -1572,12 +1580,16 @@ void destroy_workqueue(struct workqueue_struct *wq)
 {
 	int cpu;
 
+	flush_workqueue(wq);
+
+	/*
+	 * wq list is used to freeze wq, remove from list after
+	 * flushing is complete in case freeze races us.
+	 */
 	spin_lock(&workqueue_lock);
 	list_del(&wq->list);
 	spin_unlock(&workqueue_lock);
 
-	flush_workqueue(wq);
-
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 		int i;
@@ -1676,6 +1688,137 @@ long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
 EXPORT_SYMBOL_GPL(work_on_cpu);
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_FREEZER
+
+/**
+ * freeze_workqueues_begin - begin freezing workqueues
+ *
+ * Start freezing workqueues.  After this function returns, all
+ * freezeable workqueues will queue new works to their frozen_works
+ * list instead of the cwq ones.
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock and cwq->lock's.
+ */
+void freeze_workqueues_begin(void)
+{
+	struct workqueue_struct *wq;
+	unsigned int cpu;
+
+	spin_lock(&workqueue_lock);
+
+	BUG_ON(workqueue_freezing);
+	workqueue_freezing = true;
+
+	for_each_possible_cpu(cpu) {
+		list_for_each_entry(wq, &workqueues, list) {
+			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+			spin_lock_irq(&cwq->lock);
+
+			if (wq->flags & WQ_FREEZEABLE)
+				cwq->max_active = 0;
+
+			spin_unlock_irq(&cwq->lock);
+		}
+	}
+
+	spin_unlock(&workqueue_lock);
+}
+
+/**
+ * freeze_workqueues_busy - are freezeable workqueues still busy?
+ *
+ * Check whether freezing is complete.  This function must be called
+ * between freeze_workqueues_begin() and thaw_workqueues().
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock.
+ *
+ * RETURNS:
+ * %true if some freezeable workqueues are still busy.  %false if
+ * freezing is complete.
+ */
+bool freeze_workqueues_busy(void)
+{
+	struct workqueue_struct *wq;
+	unsigned int cpu;
+	bool busy = false;
+
+	spin_lock(&workqueue_lock);
+
+	BUG_ON(!workqueue_freezing);
+
+	for_each_possible_cpu(cpu) {
+		/*
+		 * nr_active is monotonically decreasing.  It's safe
+		 * to peek without lock.
+		 */
+		list_for_each_entry(wq, &workqueues, list) {
+			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+			if (!(wq->flags & WQ_FREEZEABLE))
+				continue;
+
+			BUG_ON(cwq->nr_active < 0);
+			if (cwq->nr_active) {
+				busy = true;
+				goto out_unlock;
+			}
+		}
+	}
+out_unlock:
+	spin_unlock(&workqueue_lock);
+	return busy;
+}
+
+/**
+ * thaw_workqueues - thaw workqueues
+ *
+ * Thaw workqueues.  Normal queueing is restored and all collected
+ * frozen works are transferred to their respective cwq worklists.
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock and cwq->lock's.
+ */
+void thaw_workqueues(void)
+{
+	struct workqueue_struct *wq;
+	unsigned int cpu;
+
+	spin_lock(&workqueue_lock);
+
+	if (!workqueue_freezing)
+		goto out_unlock;
+
+	for_each_possible_cpu(cpu) {
+		list_for_each_entry(wq, &workqueues, list) {
+			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+			if (!(wq->flags & WQ_FREEZEABLE))
+				continue;
+
+			spin_lock_irq(&cwq->lock);
+
+			/* restore max_active and repopulate worklist */
+			cwq->max_active = wq->saved_max_active;
+
+			while (!list_empty(&cwq->delayed_works) &&
+			       cwq->nr_active < cwq->max_active)
+				cwq_activate_first_delayed(cwq);
+
+			wake_up(&cwq->more_work);
+
+			spin_unlock_irq(&cwq->lock);
+		}
+	}
+
+	workqueue_freezing = false;
+out_unlock:
+	spin_unlock(&workqueue_lock);
+}
+#endif /* CONFIG_FREEZER */
+
 void __init init_workqueues(void)
 {
 	unsigned int cpu;
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 22/40] workqueue: introduce global cwq and unify cwq locks
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (20 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 21/40] workqueue: reimplement workqueue freeze using max_active Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 23/40] workqueue: implement worker states Tejun Heo
                   ` (19 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

There is one gcwq (global cwq) per each cpu and all cwqs on an cpu
point to it.  A gcwq contains a lock to be used by all cwqs on the cpu
and an ida to give IDs to workers belonging to the cpu.

This patch introduces gcwq, moves worker_ida into gcwq and make all
cwqs on the same cpu use the cpu's gcwq->lock instead of separate
locks.  gcwq->ida is now protected by gcwq->lock too.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  156 ++++++++++++++++++++++++++++++++--------------------
 1 files changed, 96 insertions(+), 60 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index eca3925..91c924c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -40,38 +40,45 @@
  *
  * I: Set during initialization and read-only afterwards.
  *
- * L: cwq->lock protected.  Access with cwq->lock held.
+ * L: gcwq->lock protected.  Access with gcwq->lock held.
  *
  * F: wq->flush_mutex protected.
  *
  * W: workqueue_lock protected.
  */
 
+struct global_cwq;
 struct cpu_workqueue_struct;
 
 struct worker {
 	struct work_struct	*current_work;	/* L: work being processed */
 	struct list_head	scheduled;	/* L: scheduled works */
 	struct task_struct	*task;		/* I: worker task */
+	struct global_cwq	*gcwq;		/* I: the associated gcwq */
 	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
 	int			id;		/* I: worker id */
 };
 
 /*
+ * Global per-cpu workqueue.
+ */
+struct global_cwq {
+	spinlock_t		lock;		/* the gcwq lock */
+	unsigned int		cpu;		/* I: the associated cpu */
+	struct ida		worker_ida;	/* L: for worker IDs */
+} ____cacheline_aligned_in_smp;
+
+/*
  * The per-CPU workqueue (if single thread, we always use the first
  * possible cpu).  The lower WORK_STRUCT_FLAG_BITS of
  * work_struct->data are used for flags and thus cwqs need to be
  * aligned at two's power of the number of flag bits.
  */
 struct cpu_workqueue_struct {
-
-	spinlock_t lock;
-
+	struct global_cwq	*gcwq;		/* I: the associated gcwq */
 	struct list_head worklist;
 	wait_queue_head_t more_work;
-	unsigned int		cpu;
 	struct worker		*worker;
-
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	int			work_color;	/* L: current color */
 	int			flush_color;	/* L: flushing color */
@@ -228,13 +235,19 @@ static inline void debug_work_deactivate(struct work_struct *work) { }
 /* Serializes the accesses to the list of workqueues. */
 static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
-static DEFINE_PER_CPU(struct ida, worker_ida);
 static bool workqueue_freezing;		/* W: have wqs started freezing? */
 
+static DEFINE_PER_CPU(struct global_cwq, global_cwq);
+
 static int worker_thread(void *__worker);
 
 static int singlethread_cpu __read_mostly;
 
+static struct global_cwq *get_gcwq(unsigned int cpu)
+{
+	return &per_cpu(global_cwq, cpu);
+}
+
 static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
 					    struct workqueue_struct *wq)
 {
@@ -296,7 +309,7 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
  * Insert @work into @cwq after @head.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void insert_work(struct cpu_workqueue_struct *cwq,
 			struct work_struct *work, struct list_head *head,
@@ -319,12 +332,13 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
 	struct cpu_workqueue_struct *cwq = target_cwq(cpu, wq);
+	struct global_cwq *gcwq = cwq->gcwq;
 	struct list_head *worklist;
 	unsigned long flags;
 
 	debug_work_activate(work);
 
-	spin_lock_irqsave(&cwq->lock, flags);
+	spin_lock_irqsave(&gcwq->lock, flags);
 	BUG_ON(!list_empty(&work->entry));
 
 	cwq->nr_in_flight[cwq->work_color]++;
@@ -337,7 +351,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 
 	insert_work(cwq, work, worklist, work_color_to_flags(cwq->work_color));
 
-	spin_unlock_irqrestore(&cwq->lock, flags);
+	spin_unlock_irqrestore(&gcwq->lock, flags);
 }
 
 /**
@@ -476,39 +490,41 @@ static struct worker *alloc_worker(void)
  */
 static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
 {
+	struct global_cwq *gcwq = cwq->gcwq;
 	int id = -1;
 	struct worker *worker = NULL;
 
-	spin_lock(&workqueue_lock);
-	while (ida_get_new(&per_cpu(worker_ida, cwq->cpu), &id)) {
-		spin_unlock(&workqueue_lock);
-		if (!ida_pre_get(&per_cpu(worker_ida, cwq->cpu), GFP_KERNEL))
+	spin_lock_irq(&gcwq->lock);
+	while (ida_get_new(&gcwq->worker_ida, &id)) {
+		spin_unlock_irq(&gcwq->lock);
+		if (!ida_pre_get(&gcwq->worker_ida, GFP_KERNEL))
 			goto fail;
-		spin_lock(&workqueue_lock);
+		spin_lock_irq(&gcwq->lock);
 	}
-	spin_unlock(&workqueue_lock);
+	spin_unlock_irq(&gcwq->lock);
 
 	worker = alloc_worker();
 	if (!worker)
 		goto fail;
 
+	worker->gcwq = gcwq;
 	worker->cwq = cwq;
 	worker->id = id;
 
 	worker->task = kthread_create(worker_thread, worker, "kworker/%u:%d",
-				      cwq->cpu, id);
+				      gcwq->cpu, id);
 	if (IS_ERR(worker->task))
 		goto fail;
 
 	if (bind)
-		kthread_bind(worker->task, cwq->cpu);
+		kthread_bind(worker->task, gcwq->cpu);
 
 	return worker;
 fail:
 	if (id >= 0) {
-		spin_lock(&workqueue_lock);
-		ida_remove(&per_cpu(worker_ida, cwq->cpu), id);
-		spin_unlock(&workqueue_lock);
+		spin_lock_irq(&gcwq->lock);
+		ida_remove(&gcwq->worker_ida, id);
+		spin_unlock_irq(&gcwq->lock);
 	}
 	kfree(worker);
 	return NULL;
@@ -521,7 +537,7 @@ fail:
  * Start @worker.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void start_worker(struct worker *worker)
 {
@@ -536,7 +552,7 @@ static void start_worker(struct worker *worker)
  */
 static void destroy_worker(struct worker *worker)
 {
-	int cpu = worker->cwq->cpu;
+	struct global_cwq *gcwq = worker->gcwq;
 	int id = worker->id;
 
 	/* sanity check frenzy */
@@ -546,9 +562,9 @@ static void destroy_worker(struct worker *worker)
 	kthread_stop(worker->task);
 	kfree(worker);
 
-	spin_lock(&workqueue_lock);
-	ida_remove(&per_cpu(worker_ida, cpu), id);
-	spin_unlock(&workqueue_lock);
+	spin_lock_irq(&gcwq->lock);
+	ida_remove(&gcwq->worker_ida, id);
+	spin_unlock_irq(&gcwq->lock);
 }
 
 /**
@@ -566,7 +582,7 @@ static void destroy_worker(struct worker *worker)
  * nested inside outer list_for_each_entry_safe().
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void move_linked_works(struct work_struct *work, struct list_head *head,
 			      struct work_struct **nextp)
@@ -611,7 +627,7 @@ static void cwq_activate_first_delayed(struct cpu_workqueue_struct *cwq)
  * decrement nr_in_flight of its cwq and handle workqueue flushing.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
 {
@@ -658,11 +674,12 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
  * call this function to process a work.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock) which is released and regrabbed.
+ * spin_lock_irq(gcwq->lock) which is released and regrabbed.
  */
 static void process_one_work(struct worker *worker, struct work_struct *work)
 {
 	struct cpu_workqueue_struct *cwq = worker->cwq;
+	struct global_cwq *gcwq = cwq->gcwq;
 	work_func_t f = work->func;
 	int work_color;
 #ifdef CONFIG_LOCKDEP
@@ -681,7 +698,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 	work_color = work_flags_to_color(*work_data_bits(work));
 	list_del_init(&work->entry);
 
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 
 	BUG_ON(get_wq_data(work) != cwq);
 	work_clear_pending(work);
@@ -701,7 +718,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 		dump_stack();
 	}
 
-	spin_lock_irq(&cwq->lock);
+	spin_lock_irq(&gcwq->lock);
 
 	/* we're done with it, release */
 	worker->current_work = NULL;
@@ -717,7 +734,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
  * fetches a work from the top and executes it.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock) which may be released and regrabbed
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
  * multiple times.
  */
 static void process_scheduled_works(struct worker *worker)
@@ -738,6 +755,7 @@ static void process_scheduled_works(struct worker *worker)
 static int worker_thread(void *__worker)
 {
 	struct worker *worker = __worker;
+	struct global_cwq *gcwq = worker->gcwq;
 	struct cpu_workqueue_struct *cwq = worker->cwq;
 	DEFINE_WAIT(wait);
 
@@ -751,7 +769,7 @@ static int worker_thread(void *__worker)
 		if (kthread_should_stop())
 			break;
 
-		spin_lock_irq(&cwq->lock);
+		spin_lock_irq(&gcwq->lock);
 
 		while (!list_empty(&cwq->worklist)) {
 			struct work_struct *work =
@@ -771,7 +789,7 @@ static int worker_thread(void *__worker)
 			}
 		}
 
-		spin_unlock_irq(&cwq->lock);
+		spin_unlock_irq(&gcwq->lock);
 	}
 
 	return 0;
@@ -810,7 +828,7 @@ static void wq_barrier_func(struct work_struct *work)
  * underneath us, so we can't reliably determine cwq from @target.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 			      struct wq_barrier *barr,
@@ -820,7 +838,7 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 	unsigned int linked = 0;
 
 	/*
-	 * debugobject calls are safe here even with cwq->lock locked
+	 * debugobject calls are safe here even with gcwq->lock locked
 	 * as we know for sure that this will not trigger any of the
 	 * checks and call back into the fixup functions where we
 	 * might deadlock.
@@ -890,8 +908,9 @@ static bool flush_workqueue_prep_cwqs(struct workqueue_struct *wq,
 
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		struct global_cwq *gcwq = cwq->gcwq;
 
-		spin_lock_irq(&cwq->lock);
+		spin_lock_irq(&gcwq->lock);
 
 		if (flush_color >= 0) {
 			BUG_ON(cwq->flush_color != -1);
@@ -908,7 +927,7 @@ static bool flush_workqueue_prep_cwqs(struct workqueue_struct *wq,
 			cwq->work_color = work_color;
 		}
 
-		spin_unlock_irq(&cwq->lock);
+		spin_unlock_irq(&gcwq->lock);
 	}
 
 	return wait;
@@ -1081,17 +1100,19 @@ int flush_work(struct work_struct *work)
 {
 	struct worker *worker = NULL;
 	struct cpu_workqueue_struct *cwq;
+	struct global_cwq *gcwq;
 	struct wq_barrier barr;
 
 	might_sleep();
 	cwq = get_wq_data(work);
 	if (!cwq)
 		return 0;
+	gcwq = cwq->gcwq;
 
 	lock_map_acquire(&cwq->wq->lockdep_map);
 	lock_map_release(&cwq->wq->lockdep_map);
 
-	spin_lock_irq(&cwq->lock);
+	spin_lock_irq(&gcwq->lock);
 	if (!list_empty(&work->entry)) {
 		/*
 		 * See the comment near try_to_grab_pending()->smp_rmb().
@@ -1108,12 +1129,12 @@ int flush_work(struct work_struct *work)
 	}
 
 	insert_wq_barrier(cwq, &barr, work, worker);
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 	wait_for_completion(&barr.done);
 	destroy_work_on_stack(&barr.work);
 	return 1;
 already_gone:
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(flush_work);
@@ -1124,6 +1145,7 @@ EXPORT_SYMBOL_GPL(flush_work);
  */
 static int try_to_grab_pending(struct work_struct *work)
 {
+	struct global_cwq *gcwq;
 	struct cpu_workqueue_struct *cwq;
 	int ret = -1;
 
@@ -1138,8 +1160,9 @@ static int try_to_grab_pending(struct work_struct *work)
 	cwq = get_wq_data(work);
 	if (!cwq)
 		return ret;
+	gcwq = cwq->gcwq;
 
-	spin_lock_irq(&cwq->lock);
+	spin_lock_irq(&gcwq->lock);
 	if (!list_empty(&work->entry)) {
 		/*
 		 * This work is queued, but perhaps we locked the wrong cwq.
@@ -1155,7 +1178,7 @@ static int try_to_grab_pending(struct work_struct *work)
 			ret = 1;
 		}
 	}
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 
 	return ret;
 }
@@ -1163,10 +1186,11 @@ static int try_to_grab_pending(struct work_struct *work)
 static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 				struct work_struct *work)
 {
+	struct global_cwq *gcwq = cwq->gcwq;
 	struct wq_barrier barr;
 	struct worker *worker;
 
-	spin_lock_irq(&cwq->lock);
+	spin_lock_irq(&gcwq->lock);
 
 	worker = NULL;
 	if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
@@ -1174,7 +1198,7 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 		insert_wq_barrier(cwq, &barr, work, worker);
 	}
 
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 
 	if (unlikely(worker)) {
 		wait_for_completion(&barr.done);
@@ -1518,13 +1542,13 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	 */
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		struct global_cwq *gcwq = get_gcwq(cpu);
 
 		BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
-		cwq->cpu = cpu;
+		cwq->gcwq = gcwq;
 		cwq->wq = wq;
 		cwq->flush_color = -1;
 		cwq->max_active = max_active;
-		spin_lock_init(&cwq->lock);
 		INIT_LIST_HEAD(&cwq->worklist);
 		INIT_LIST_HEAD(&cwq->delayed_works);
 		init_waitqueue_head(&cwq->more_work);
@@ -1698,7 +1722,7 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
  * list instead of the cwq ones.
  *
  * CONTEXT:
- * Grabs and releases workqueue_lock and cwq->lock's.
+ * Grabs and releases workqueue_lock and gcwq->lock's.
  */
 void freeze_workqueues_begin(void)
 {
@@ -1711,16 +1735,18 @@ void freeze_workqueues_begin(void)
 	workqueue_freezing = true;
 
 	for_each_possible_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+
+		spin_lock_irq(&gcwq->lock);
+
 		list_for_each_entry(wq, &workqueues, list) {
 			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
-			spin_lock_irq(&cwq->lock);
-
 			if (wq->flags & WQ_FREEZEABLE)
 				cwq->max_active = 0;
-
-			spin_unlock_irq(&cwq->lock);
 		}
+
+		spin_unlock_irq(&gcwq->lock);
 	}
 
 	spin_unlock(&workqueue_lock);
@@ -1779,7 +1805,7 @@ out_unlock:
  * frozen works are transferred to their respective cwq worklists.
  *
  * CONTEXT:
- * Grabs and releases workqueue_lock and cwq->lock's.
+ * Grabs and releases workqueue_lock and gcwq->lock's.
  */
 void thaw_workqueues(void)
 {
@@ -1792,14 +1818,16 @@ void thaw_workqueues(void)
 		goto out_unlock;
 
 	for_each_possible_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+
+		spin_lock_irq(&gcwq->lock);
+
 		list_for_each_entry(wq, &workqueues, list) {
 			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
 			if (!(wq->flags & WQ_FREEZEABLE))
 				continue;
 
-			spin_lock_irq(&cwq->lock);
-
 			/* restore max_active and repopulate worklist */
 			cwq->max_active = wq->saved_max_active;
 
@@ -1808,9 +1836,9 @@ void thaw_workqueues(void)
 				cwq_activate_first_delayed(cwq);
 
 			wake_up(&cwq->more_work);
-
-			spin_unlock_irq(&cwq->lock);
 		}
+
+		spin_unlock_irq(&gcwq->lock);
 	}
 
 	workqueue_freezing = false;
@@ -1831,11 +1859,19 @@ void __init init_workqueues(void)
 	BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
 		     __alignof__(unsigned long long));
 
-	for_each_possible_cpu(cpu)
-		ida_init(&per_cpu(worker_ida, cpu));
-
 	singlethread_cpu = cpumask_first(cpu_possible_mask);
 	hotcpu_notifier(workqueue_cpu_callback, 0);
+
+	/* initialize gcwqs */
+	for_each_possible_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+
+		spin_lock_init(&gcwq->lock);
+		gcwq->cpu = cpu;
+
+		ida_init(&gcwq->worker_ida);
+	}
+
 	keventd_wq = create_workqueue("events");
 	BUG_ON(!keventd_wq);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 23/40] workqueue: implement worker states
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (21 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 22/40] workqueue: introduce global cwq and unify cwq locks Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 24/40] workqueue: reimplement CPU hotplugging support using trustee Tejun Heo
                   ` (18 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Implement worker states.  After created, a worker is STARTED.  While a
worker isn't processing a work, it's IDLE and chained on
gcwq->idle_list.  While processing a work, a worker is BUSY and
chained on gcwq->busy_hash.  Also, gcwq now counts the number of all
workers and idle ones.

worker_thread() is restructured to reflect state transitions.
cwq->more_work is removed and waking up a worker makes it check for
events.  A worker is killed by setting DIE flag while it's IDLE and
waking it up.

This gives gcwq better visibility of what's going on and allows it to
find out whether a work is executing quickly which is necessary to
have multiple workers processing the same cwq.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  207 ++++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 170 insertions(+), 37 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 91c924c..104f72f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -35,6 +35,17 @@
 #include <linux/lockdep.h>
 #include <linux/idr.h>
 
+enum {
+	/* worker flags */
+	WORKER_STARTED		= 1 << 0,	/* started */
+	WORKER_DIE		= 1 << 1,	/* die die die */
+	WORKER_IDLE		= 1 << 2,	/* is idle */
+
+	BUSY_WORKER_HASH_ORDER	= 5,		/* 32 pointers */
+	BUSY_WORKER_HASH_SIZE	= 1 << BUSY_WORKER_HASH_ORDER,
+	BUSY_WORKER_HASH_MASK	= BUSY_WORKER_HASH_SIZE - 1,
+};
+
 /*
  * Structure fields follow one of the following exclusion rules.
  *
@@ -51,11 +62,18 @@ struct global_cwq;
 struct cpu_workqueue_struct;
 
 struct worker {
+	/* on idle list while idle, on busy hash table while busy */
+	union {
+		struct list_head	entry;	/* L: while idle */
+		struct hlist_node	hentry;	/* L: while busy */
+	};
+
 	struct work_struct	*current_work;	/* L: work being processed */
 	struct list_head	scheduled;	/* L: scheduled works */
 	struct task_struct	*task;		/* I: worker task */
 	struct global_cwq	*gcwq;		/* I: the associated gcwq */
 	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
+	unsigned int		flags;		/* L: flags */
 	int			id;		/* I: worker id */
 };
 
@@ -65,6 +83,15 @@ struct worker {
 struct global_cwq {
 	spinlock_t		lock;		/* the gcwq lock */
 	unsigned int		cpu;		/* I: the associated cpu */
+
+	int			nr_workers;	/* L: total number of workers */
+	int			nr_idle;	/* L: currently idle ones */
+
+	/* workers are chained either in the idle_list or busy_hash */
+	struct list_head	idle_list;	/* L: list of idle workers */
+	struct hlist_head	busy_hash[BUSY_WORKER_HASH_SIZE];
+						/* L: hash of busy workers */
+
 	struct ida		worker_ida;	/* L: for worker IDs */
 } ____cacheline_aligned_in_smp;
 
@@ -77,7 +104,6 @@ struct global_cwq {
 struct cpu_workqueue_struct {
 	struct global_cwq	*gcwq;		/* I: the associated gcwq */
 	struct list_head worklist;
-	wait_queue_head_t more_work;
 	struct worker		*worker;
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	int			work_color;	/* L: current color */
@@ -300,6 +326,33 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 }
 
 /**
+ * busy_worker_head - return the busy hash head for a work
+ * @gcwq: gcwq of interest
+ * @work: work to be hashed
+ *
+ * Return hash head of @gcwq for @work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to the hash head.
+ */
+static struct hlist_head *busy_worker_head(struct global_cwq *gcwq,
+					   struct work_struct *work)
+{
+	const int base_shift = ilog2(sizeof(struct work_struct));
+	unsigned long v = (unsigned long)work;
+
+	/* simple shift and fold hash, do we need something better? */
+	v >>= base_shift;
+	v += v >> BUSY_WORKER_HASH_ORDER;
+	v &= BUSY_WORKER_HASH_MASK;
+
+	return &gcwq->busy_hash[v];
+}
+
+/**
  * insert_work - insert a work into cwq
  * @cwq: cwq @work belongs to
  * @work: work to insert
@@ -325,7 +378,7 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 	smp_wmb();
 
 	list_add_tail(&work->entry, head);
-	wake_up(&cwq->more_work);
+	wake_up_process(cwq->worker->task);
 }
 
 static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
@@ -463,13 +516,59 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
+/**
+ * worker_enter_idle - enter idle state
+ * @worker: worker which is entering idle state
+ *
+ * @worker is entering idle state.  Update stats and idle timer if
+ * necessary.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void worker_enter_idle(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+
+	BUG_ON(worker->flags & WORKER_IDLE);
+	BUG_ON(!list_empty(&worker->entry) &&
+	       (worker->hentry.next || worker->hentry.pprev));
+
+	worker->flags |= WORKER_IDLE;
+	gcwq->nr_idle++;
+
+	/* idle_list is LIFO */
+	list_add(&worker->entry, &gcwq->idle_list);
+}
+
+/**
+ * worker_leave_idle - leave idle state
+ * @worker: worker which is leaving idle state
+ *
+ * @worker is leaving idle state.  Update stats.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void worker_leave_idle(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+
+	BUG_ON(!(worker->flags & WORKER_IDLE));
+	worker->flags &= ~WORKER_IDLE;
+	gcwq->nr_idle--;
+	list_del_init(&worker->entry);
+}
+
 static struct worker *alloc_worker(void)
 {
 	struct worker *worker;
 
 	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
-	if (worker)
+	if (worker) {
+		INIT_LIST_HEAD(&worker->entry);
 		INIT_LIST_HEAD(&worker->scheduled);
+	}
 	return worker;
 }
 
@@ -534,13 +633,16 @@ fail:
  * start_worker - start a newly created worker
  * @worker: worker to start
  *
- * Start @worker.
+ * Make the gcwq aware of @worker and start it.
  *
  * CONTEXT:
  * spin_lock_irq(gcwq->lock).
  */
 static void start_worker(struct worker *worker)
 {
+	worker->flags |= WORKER_STARTED;
+	worker->gcwq->nr_workers++;
+	worker_enter_idle(worker);
 	wake_up_process(worker->task);
 }
 
@@ -548,7 +650,10 @@ static void start_worker(struct worker *worker)
  * destroy_worker - destroy a workqueue worker
  * @worker: worker to be destroyed
  *
- * Destroy @worker.
+ * Destroy @worker and adjust @gcwq stats accordingly.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which is released and regrabbed.
  */
 static void destroy_worker(struct worker *worker)
 {
@@ -559,12 +664,21 @@ static void destroy_worker(struct worker *worker)
 	BUG_ON(worker->current_work);
 	BUG_ON(!list_empty(&worker->scheduled));
 
+	if (worker->flags & WORKER_STARTED)
+		gcwq->nr_workers--;
+	if (worker->flags & WORKER_IDLE)
+		gcwq->nr_idle--;
+
+	list_del_init(&worker->entry);
+	worker->flags |= WORKER_DIE;
+
+	spin_unlock_irq(&gcwq->lock);
+
 	kthread_stop(worker->task);
 	kfree(worker);
 
 	spin_lock_irq(&gcwq->lock);
 	ida_remove(&gcwq->worker_ida, id);
-	spin_unlock_irq(&gcwq->lock);
 }
 
 /**
@@ -680,6 +794,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 {
 	struct cpu_workqueue_struct *cwq = worker->cwq;
 	struct global_cwq *gcwq = cwq->gcwq;
+	struct hlist_head *bwh = busy_worker_head(gcwq, work);
 	work_func_t f = work->func;
 	int work_color;
 #ifdef CONFIG_LOCKDEP
@@ -694,6 +809,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 #endif
 	/* claim and process */
 	debug_work_deactivate(work);
+	hlist_add_head(&worker->hentry, bwh);
 	worker->current_work = work;
 	work_color = work_flags_to_color(*work_data_bits(work));
 	list_del_init(&work->entry);
@@ -721,6 +837,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 	spin_lock_irq(&gcwq->lock);
 
 	/* we're done with it, release */
+	hlist_del_init(&worker->hentry);
 	worker->current_work = NULL;
 	cwq_dec_nr_in_flight(cwq, work_color);
 }
@@ -757,42 +874,52 @@ static int worker_thread(void *__worker)
 	struct worker *worker = __worker;
 	struct global_cwq *gcwq = worker->gcwq;
 	struct cpu_workqueue_struct *cwq = worker->cwq;
-	DEFINE_WAIT(wait);
 
-	for (;;) {
-		prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
-		if (!kthread_should_stop() &&
-		    list_empty(&cwq->worklist))
-			schedule();
-		finish_wait(&cwq->more_work, &wait);
+woke_up:
+	spin_lock_irq(&gcwq->lock);
 
-		if (kthread_should_stop())
-			break;
+	/* DIE can be set only while we're idle, checking here is enough */
+	if (worker->flags & WORKER_DIE) {
+		spin_unlock_irq(&gcwq->lock);
+		return 0;
+	}
 
-		spin_lock_irq(&gcwq->lock);
+	worker_leave_idle(worker);
 
-		while (!list_empty(&cwq->worklist)) {
-			struct work_struct *work =
-				list_first_entry(&cwq->worklist,
-						 struct work_struct, entry);
-
-			if (likely(!(*work_data_bits(work) &
-				     WORK_STRUCT_LINKED))) {
-				/* optimization path, not strictly necessary */
-				process_one_work(worker, work);
-				if (unlikely(!list_empty(&worker->scheduled)))
-					process_scheduled_works(worker);
-			} else {
-				move_linked_works(work, &worker->scheduled,
-						  NULL);
+	/*
+	 * ->scheduled list can only be filled while a worker is
+	 * preparing to process a work or actually processing it.
+	 * Make sure nobody diddled with it while I was sleeping.
+	 */
+	BUG_ON(!list_empty(&worker->scheduled));
+
+	while (!list_empty(&cwq->worklist)) {
+		struct work_struct *work =
+			list_first_entry(&cwq->worklist,
+					 struct work_struct, entry);
+
+		if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
+			/* optimization path, not strictly necessary */
+			process_one_work(worker, work);
+			if (unlikely(!list_empty(&worker->scheduled)))
 				process_scheduled_works(worker);
-			}
+		} else {
+			move_linked_works(work, &worker->scheduled, NULL);
+			process_scheduled_works(worker);
 		}
-
-		spin_unlock_irq(&gcwq->lock);
 	}
 
-	return 0;
+	/*
+	 * gcwq->lock is held and there's no work to process, sleep.
+	 * Workers are woken up only while holding gcwq->lock, so
+	 * setting the current state before releasing gcwq->lock is
+	 * enough to prevent losing any event.
+	 */
+	worker_enter_idle(worker);
+	__set_current_state(TASK_INTERRUPTIBLE);
+	spin_unlock_irq(&gcwq->lock);
+	schedule();
+	goto woke_up;
 }
 
 struct wq_barrier {
@@ -1551,7 +1678,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		cwq->max_active = max_active;
 		INIT_LIST_HEAD(&cwq->worklist);
 		INIT_LIST_HEAD(&cwq->delayed_works);
-		init_waitqueue_head(&cwq->more_work);
 
 		if (failed)
 			continue;
@@ -1602,7 +1728,7 @@ EXPORT_SYMBOL_GPL(__create_workqueue_key);
  */
 void destroy_workqueue(struct workqueue_struct *wq)
 {
-	int cpu;
+	unsigned int cpu;
 
 	flush_workqueue(wq);
 
@@ -1619,8 +1745,10 @@ void destroy_workqueue(struct workqueue_struct *wq)
 		int i;
 
 		if (cwq->worker) {
+			spin_lock_irq(&cwq->gcwq->lock);
 			destroy_worker(cwq->worker);
 			cwq->worker = NULL;
+			spin_unlock_irq(&cwq->gcwq->lock);
 		}
 
 		for (i = 0; i < WORK_NR_COLORS; i++)
@@ -1835,7 +1963,7 @@ void thaw_workqueues(void)
 			       cwq->nr_active < cwq->max_active)
 				cwq_activate_first_delayed(cwq);
 
-			wake_up(&cwq->more_work);
+			wake_up_process(cwq->worker->task);
 		}
 
 		spin_unlock_irq(&gcwq->lock);
@@ -1850,6 +1978,7 @@ out_unlock:
 void __init init_workqueues(void)
 {
 	unsigned int cpu;
+	int i;
 
 	/*
 	 * cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
@@ -1869,6 +1998,10 @@ void __init init_workqueues(void)
 		spin_lock_init(&gcwq->lock);
 		gcwq->cpu = cpu;
 
+		INIT_LIST_HEAD(&gcwq->idle_list);
+		for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)
+			INIT_HLIST_HEAD(&gcwq->busy_hash[i]);
+
 		ida_init(&gcwq->worker_ida);
 	}
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 24/40] workqueue: reimplement CPU hotplugging support using trustee
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (22 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 23/40] workqueue: implement worker states Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 25/40] workqueue: make single thread workqueue shared worker pool friendly Tejun Heo
                   ` (17 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Reimplement CPU hotplugging support using trustee thread.  On CPU
down, a trustee thread is created and each step of CPU down is
executed by the trustee and workqueue_cpu_callback() simply drives and
waits for trustee state transitions.

CPU down operation no longer waits for works to be drained but trustee
sticks around till all pending works have been completed.  If CPU is
brought back up while works are still draining,
workqueue_cpu_callback() tells trustee to step down and rebinds
workers to the cpu.

As it's difficult to tell whether cwqs are empty if it's freezing or
frozen, trustee doesn't consider draining to be complete while a gcwq
is freezing or frozen (tracked by new GCWQ_FREEZING flag).  Also,
workers which get unbound from their cpu are marked with WORKER_ROGUE.

Trustee based implementation doesn't bring any new feature at this
point but it will be used to manage worker pool when dynamic shared
worker pool is implemented.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  296 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 281 insertions(+), 15 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 104f72f..265f480 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -36,14 +36,27 @@
 #include <linux/idr.h>
 
 enum {
+	/* global_cwq flags */
+	GCWQ_FREEZING		= 1 << 3,	/* freeze in progress */
+
 	/* worker flags */
 	WORKER_STARTED		= 1 << 0,	/* started */
 	WORKER_DIE		= 1 << 1,	/* die die die */
 	WORKER_IDLE		= 1 << 2,	/* is idle */
+	WORKER_ROGUE		= 1 << 4,	/* not bound to any cpu */
+
+	/* gcwq->trustee_state */
+	TRUSTEE_START		= 0,		/* start */
+	TRUSTEE_IN_CHARGE	= 1,		/* trustee in charge of gcwq */
+	TRUSTEE_BUTCHER		= 2,		/* butcher workers */
+	TRUSTEE_RELEASE		= 3,		/* release workers */
+	TRUSTEE_DONE		= 4,		/* trustee is done */
 
 	BUSY_WORKER_HASH_ORDER	= 5,		/* 32 pointers */
 	BUSY_WORKER_HASH_SIZE	= 1 << BUSY_WORKER_HASH_ORDER,
 	BUSY_WORKER_HASH_MASK	= BUSY_WORKER_HASH_SIZE - 1,
+
+	TRUSTEE_COOLDOWN	= HZ / 10,	/* for trustee draining */
 };
 
 /*
@@ -83,6 +96,7 @@ struct worker {
 struct global_cwq {
 	spinlock_t		lock;		/* the gcwq lock */
 	unsigned int		cpu;		/* I: the associated cpu */
+	unsigned int		flags;		/* L: GCWQ_* flags */
 
 	int			nr_workers;	/* L: total number of workers */
 	int			nr_idle;	/* L: currently idle ones */
@@ -93,6 +107,10 @@ struct global_cwq {
 						/* L: hash of busy workers */
 
 	struct ida		worker_ida;	/* L: for worker IDs */
+
+	struct task_struct	*trustee;	/* L: for gcwq shutdown */
+	unsigned int		trustee_state;	/* L: trustee state */
+	wait_queue_head_t	trustee_wait;	/* trustee wait */
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -148,6 +166,10 @@ struct workqueue_struct {
 #endif
 };
 
+#define for_each_busy_worker(worker, i, pos, gcwq)			\
+	for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)			\
+		hlist_for_each_entry(worker, pos, &gcwq->busy_hash[i], hentry)
+
 #ifdef CONFIG_DEBUG_OBJECTS_WORK
 
 static struct debug_obj_descr work_debug_descr;
@@ -539,6 +561,9 @@ static void worker_enter_idle(struct worker *worker)
 
 	/* idle_list is LIFO */
 	list_add(&worker->entry, &gcwq->idle_list);
+
+	if (unlikely(worker->flags & WORKER_ROGUE))
+		wake_up_all(&gcwq->trustee_wait);
 }
 
 /**
@@ -615,8 +640,15 @@ static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
 	if (IS_ERR(worker->task))
 		goto fail;
 
+	/*
+	 * A rogue worker will become a regular one if CPU comes
+	 * online later on.  Make sure every worker has
+	 * PF_THREAD_BOUND set.
+	 */
 	if (bind)
 		kthread_bind(worker->task, gcwq->cpu);
+	else
+		worker->task->flags |= PF_THREAD_BOUND;
 
 	return worker;
 fail:
@@ -1762,34 +1794,259 @@ void destroy_workqueue(struct workqueue_struct *wq)
 }
 EXPORT_SYMBOL_GPL(destroy_workqueue);
 
+/*
+ * CPU hotplug.
+ *
+ * CPU hotplug is implemented by allowing cwqs to be detached from
+ * CPU, running with unbound workers and allowing them to be
+ * reattached later if the cpu comes back online.  A separate thread
+ * is created to govern cwqs in such state and is called the trustee.
+ *
+ * Trustee states and their descriptions.
+ *
+ * START	Command state used on startup.  On CPU_DOWN_PREPARE, a
+ *		new trustee is started with this state.
+ *
+ * IN_CHARGE	Once started, trustee will enter this state after
+ *		making all existing workers rogue.  DOWN_PREPARE waits
+ *		for trustee to enter this state.  After reaching
+ *		IN_CHARGE, trustee tries to execute the pending
+ *		worklist until it's empty and the state is set to
+ *		BUTCHER, or the state is set to RELEASE.
+ *
+ * BUTCHER	Command state which is set by the cpu callback after
+ *		the cpu has went down.  Once this state is set trustee
+ *		knows that there will be no new works on the worklist
+ *		and once the worklist is empty it can proceed to
+ *		killing idle workers.
+ *
+ * RELEASE	Command state which is set by the cpu callback if the
+ *		cpu down has been canceled or it has come online
+ *		again.  After recognizing this state, trustee stops
+ *		trying to drain or butcher and transits to DONE.
+ *
+ * DONE		Trustee will enter this state after BUTCHER or RELEASE
+ *		is complete.
+ *
+ *          trustee                 CPU                draining
+ *         took over                down               complete
+ * START -----------> IN_CHARGE -----------> BUTCHER -----------> DONE
+ *                        |                     |                  ^
+ *                        | CPU is back online  v   return workers |
+ *                         ----------------> RELEASE --------------
+ */
+
+/**
+ * trustee_wait_event_timeout - timed event wait for trustee
+ * @cond: condition to wait for
+ * @timeout: timeout in jiffies
+ *
+ * wait_event_timeout() for trustee to use.  Handles locking and
+ * checks for RELEASE request.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  To be used by trustee.
+ *
+ * RETURNS:
+ * Positive indicating left time if @cond is satisfied, 0 if timed
+ * out, -1 if canceled.
+ */
+#define trustee_wait_event_timeout(cond, timeout) ({			\
+	long __ret = (timeout);						\
+	while (!((cond) || (gcwq->trustee_state == TRUSTEE_RELEASE)) &&	\
+	       __ret) {							\
+		spin_unlock_irq(&gcwq->lock);				\
+		__wait_event_timeout(gcwq->trustee_wait, (cond) ||	\
+			(gcwq->trustee_state == TRUSTEE_RELEASE),	\
+			__ret);						\
+		spin_lock_irq(&gcwq->lock);				\
+	}								\
+	gcwq->trustee_state == TRUSTEE_RELEASE ? -1 : (__ret);		\
+})
+
+/**
+ * trustee_wait_event - event wait for trustee
+ * @cond: condition to wait for
+ *
+ * wait_event() for trustee to use.  Automatically handles locking and
+ * checks for CANCEL request.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  To be used by trustee.
+ *
+ * RETURNS:
+ * 0 if @cond is satisfied, -1 if canceled.
+ */
+#define trustee_wait_event(cond) ({					\
+	long __ret1;							\
+	__ret1 = trustee_wait_event_timeout(cond, MAX_SCHEDULE_TIMEOUT);\
+	__ret1 < 0 ? -1 : 0;						\
+})
+
+static bool __cpuinit trustee_unset_rogue(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+	int rc;
+
+	if (!(worker->flags & WORKER_ROGUE))
+		return false;
+
+	spin_unlock_irq(&gcwq->lock);
+	rc = __set_cpus_allowed(worker->task, get_cpu_mask(gcwq->cpu), true);
+	BUG_ON(rc);
+	spin_lock_irq(&gcwq->lock);
+	worker->flags &= ~WORKER_ROGUE;
+	return true;
+}
+
+static int __cpuinit trustee_thread(void *__gcwq)
+{
+	struct global_cwq *gcwq = __gcwq;
+	struct worker *worker;
+	struct hlist_node *pos;
+	int i;
+
+	BUG_ON(gcwq->cpu != smp_processor_id());
+
+	spin_lock_irq(&gcwq->lock);
+	/*
+	 * Make all multithread workers rogue.  Trustee must be bound
+	 * to the target cpu and can't be cancelled.
+	 */
+	BUG_ON(gcwq->cpu != smp_processor_id());
+
+	list_for_each_entry(worker, &gcwq->idle_list, entry)
+		if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
+			worker->flags |= WORKER_ROGUE;
+
+	for_each_busy_worker(worker, i, pos, gcwq)
+		if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
+			worker->flags |= WORKER_ROGUE;
+
+	/*
+	 * We're now in charge.  Notify and proceed to drain.  We need
+	 * to keep the gcwq running during the whole CPU down
+	 * procedure as other cpu hotunplug callbacks may need to
+	 * flush currently running tasks.
+	 */
+	gcwq->trustee_state = TRUSTEE_IN_CHARGE;
+	wake_up_all(&gcwq->trustee_wait);
+
+	/*
+	 * The original cpu is in the process of dying and may go away
+	 * anytime now.  When that happens, we and all workers would
+	 * be migrated to other cpus.  Try draining any left work.
+	 * Note that if the gcwq is frozen, there may be frozen works
+	 * in freezeable cwqs.  Don't declare completion while frozen.
+	 */
+	while (gcwq->nr_workers != gcwq->nr_idle ||
+	       gcwq->flags & GCWQ_FREEZING ||
+	       gcwq->trustee_state == TRUSTEE_IN_CHARGE) {
+		/* give a breather */
+		if (trustee_wait_event_timeout(false, TRUSTEE_COOLDOWN) < 0)
+			break;
+	}
+
+	/* notify completion */
+	gcwq->trustee = NULL;
+	gcwq->trustee_state = TRUSTEE_DONE;
+	wake_up_all(&gcwq->trustee_wait);
+	spin_unlock_irq(&gcwq->lock);
+	return 0;
+}
+
+/**
+ * wait_trustee_state - wait for trustee to enter the specified state
+ * @gcwq: gcwq the trustee of interest belongs to
+ * @state: target state to wait for
+ *
+ * Wait for the trustee to reach @state.  DONE is already matched.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  To be used by cpu_callback.
+ */
+static void __cpuinit wait_trustee_state(struct global_cwq *gcwq, int state)
+{
+	if (!(gcwq->trustee_state == state ||
+	      gcwq->trustee_state == TRUSTEE_DONE)) {
+		spin_unlock_irq(&gcwq->lock);
+		__wait_event(gcwq->trustee_wait,
+			     gcwq->trustee_state == state ||
+			     gcwq->trustee_state == TRUSTEE_DONE);
+		spin_lock_irq(&gcwq->lock);
+	}
+}
+
 static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 						unsigned long action,
 						void *hcpu)
 {
 	unsigned int cpu = (unsigned long)hcpu;
-	struct cpu_workqueue_struct *cwq;
-	struct workqueue_struct *wq;
+	struct global_cwq *gcwq = get_gcwq(cpu);
+	struct task_struct *new_trustee = NULL;
+	struct worker *worker;
+	struct hlist_node *pos;
+	int i;
 
 	action &= ~CPU_TASKS_FROZEN;
 
-	list_for_each_entry(wq, &workqueues, list) {
-		if (wq->flags & WQ_SINGLE_THREAD)
-			continue;
-
-		cwq = get_cwq(cpu, wq);
+	switch (action) {
+	case CPU_DOWN_PREPARE:
+		new_trustee = kthread_create(trustee_thread, gcwq,
+					     "workqueue_trustee/%d\n", cpu);
+		if (IS_ERR(new_trustee))
+			return NOTIFY_BAD;
+		kthread_bind(new_trustee, cpu);
+	}
 
-		switch (action) {
-		case CPU_ONLINE:
-			__set_cpus_allowed(cwq->worker->task,
-					   get_cpu_mask(cpu), true);
-			break;
+	spin_lock_irq(&gcwq->lock);
 
-		case CPU_POST_DEAD:
-			flush_workqueue(wq);
-			break;
+	switch (action) {
+	case CPU_DOWN_PREPARE:
+		/* initialize trustee and tell it to acquire the gcwq */
+		BUG_ON(gcwq->trustee || gcwq->trustee_state != TRUSTEE_DONE);
+		gcwq->trustee = new_trustee;
+		gcwq->trustee_state = TRUSTEE_START;
+		wake_up_process(gcwq->trustee);
+		wait_trustee_state(gcwq, TRUSTEE_IN_CHARGE);
+		break;
+
+	case CPU_POST_DEAD:
+		gcwq->trustee_state = TRUSTEE_BUTCHER;
+		break;
+
+	case CPU_DOWN_FAILED:
+	case CPU_ONLINE:
+		if (gcwq->trustee_state != TRUSTEE_DONE) {
+			gcwq->trustee_state = TRUSTEE_RELEASE;
+			wake_up_process(gcwq->trustee);
+			wait_trustee_state(gcwq, TRUSTEE_DONE);
 		}
+
+		/*
+		 * Clear ROGUE from and rebind all multithread
+		 * workers.  Unsetting ROGUE and rebinding require
+		 * dropping gcwq->lock.  Restart loop after each
+		 * successful release.
+		 */
+	recheck:
+		list_for_each_entry(worker, &gcwq->idle_list, entry)
+			if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD) &&
+			    trustee_unset_rogue(worker))
+				goto recheck;
+
+		for_each_busy_worker(worker, i, pos, gcwq)
+			if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD) &&
+			    trustee_unset_rogue(worker))
+				goto recheck;
+		break;
 	}
 
+	spin_unlock_irq(&gcwq->lock);
+
 	return NOTIFY_OK;
 }
 
@@ -1867,6 +2124,9 @@ void freeze_workqueues_begin(void)
 
 		spin_lock_irq(&gcwq->lock);
 
+		BUG_ON(gcwq->flags & GCWQ_FREEZING);
+		gcwq->flags |= GCWQ_FREEZING;
+
 		list_for_each_entry(wq, &workqueues, list) {
 			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
@@ -1950,6 +2210,9 @@ void thaw_workqueues(void)
 
 		spin_lock_irq(&gcwq->lock);
 
+		BUG_ON(!(gcwq->flags & GCWQ_FREEZING));
+		gcwq->flags &= ~GCWQ_FREEZING;
+
 		list_for_each_entry(wq, &workqueues, list) {
 			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
@@ -2003,6 +2266,9 @@ void __init init_workqueues(void)
 			INIT_HLIST_HEAD(&gcwq->busy_hash[i]);
 
 		ida_init(&gcwq->worker_ida);
+
+		gcwq->trustee_state = TRUSTEE_DONE;
+		init_waitqueue_head(&gcwq->trustee_wait);
 	}
 
 	keventd_wq = create_workqueue("events");
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 25/40] workqueue: make single thread workqueue shared worker pool friendly
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (23 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 24/40] workqueue: reimplement CPU hotplugging support using trustee Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 26/40] workqueue: use shared worklist and pool all workers per cpu Tejun Heo
                   ` (16 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Reimplement st (single thread) workqueue so that it's friendly to
shared worker pool.  It was originally implemented by confining st
workqueues to use cwq of a fixed cpu and always having a worker for
the cpu.  This implementation isn't very friendly to shared worker
pool and suboptimal in that it ends up crossing cpu boundaries often.

Reimplement st workqueue using dynamic single cpu binding and
cwq->limit.  WQ_SINGLE_THREAD is replaced with WQ_SINGLE_CPU.  In a
single cpu workqueue, at most single cwq is bound to the wq at any
given time.  Arbitration is done using atomic accesses to
wq->single_cpu when queueing a work.  Once bound, the binding stays
till the workqueue is drained.

Note that the binding is never broken while a workqueue is frozen.
This is because idle cwqs may have works waiting in delayed_works
queue while frozen.  On thaw, the cwq is restarted if there are any
delayed works or unbound otherwise.

When combined with max_active limit of 1, single cpu workqueue has
exactly the same execution properties as the original single thread
workqueue while allowing sharing of per-cpu workers.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |    6 +-
 kernel/workqueue.c        |  139 ++++++++++++++++++++++++++++++++-------------
 2 files changed, 103 insertions(+), 42 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 7a260df..b012da7 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -212,7 +212,7 @@ static inline bool work_static(struct work_struct *work) { return false; }
 
 enum {
 	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
-	WQ_SINGLE_THREAD	= 1 << 1, /* no per-cpu worker */
+	WQ_SINGLE_CPU		= 1 << 1, /* only single cpu at a time */
 };
 
 extern struct workqueue_struct *
@@ -241,9 +241,9 @@ __create_workqueue_key(const char *name, unsigned int flags, int max_active,
 #define create_workqueue(name)					\
 	__create_workqueue((name), 0, 1)
 #define create_freezeable_workqueue(name)			\
-	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD, 1)
+	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_CPU, 1)
 #define create_singlethread_workqueue(name)			\
-	__create_workqueue((name), WQ_SINGLE_THREAD, 1)
+	__create_workqueue((name), WQ_SINGLE_CPU, 1)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 265f480..19cfa12 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -114,8 +114,7 @@ struct global_cwq {
 } ____cacheline_aligned_in_smp;
 
 /*
- * The per-CPU workqueue (if single thread, we always use the first
- * possible cpu).  The lower WORK_STRUCT_FLAG_BITS of
+ * The per-CPU workqueue.  The lower WORK_STRUCT_FLAG_BITS of
  * work_struct->data are used for flags and thus cwqs need to be
  * aligned at two's power of the number of flag bits.
  */
@@ -159,6 +158,8 @@ struct workqueue_struct {
 	struct list_head	flusher_queue;	/* F: flush waiters */
 	struct list_head	flusher_overflow; /* F: flush overflow list */
 
+	unsigned long		single_cpu;	/* cpu for single cpu wq */
+
 	int			saved_max_active; /* I: saved cwq max_active */
 	const char		*name;		/* I: workqueue name */
 #ifdef CONFIG_LOCKDEP
@@ -289,8 +290,6 @@ static DEFINE_PER_CPU(struct global_cwq, global_cwq);
 
 static int worker_thread(void *__worker);
 
-static int singlethread_cpu __read_mostly;
-
 static struct global_cwq *get_gcwq(unsigned int cpu)
 {
 	return &per_cpu(global_cwq, cpu);
@@ -302,14 +301,6 @@ static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
 	return per_cpu_ptr(wq->cpu_wq, cpu);
 }
 
-static struct cpu_workqueue_struct *target_cwq(unsigned int cpu,
-					       struct workqueue_struct *wq)
-{
-	if (unlikely(wq->flags & WQ_SINGLE_THREAD))
-		cpu = singlethread_cpu;
-	return get_cwq(cpu, wq);
-}
-
 static unsigned int work_color_to_flags(int color)
 {
 	return color << WORK_STRUCT_COLOR_SHIFT;
@@ -403,17 +394,84 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 	wake_up_process(cwq->worker->task);
 }
 
-static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
-			 struct work_struct *work)
+/**
+ * cwq_unbind_single_cpu - unbind cwq from single cpu workqueue processing
+ * @cwq: cwq to unbind
+ *
+ * Try to unbind @cwq from single cpu workqueue processing.  If
+ * @cwq->wq is frozen, unbind is delayed till the workqueue is thawed.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void cwq_unbind_single_cpu(struct cpu_workqueue_struct *cwq)
 {
-	struct cpu_workqueue_struct *cwq = target_cwq(cpu, wq);
+	struct workqueue_struct *wq = cwq->wq;
 	struct global_cwq *gcwq = cwq->gcwq;
+
+	BUG_ON(wq->single_cpu != gcwq->cpu);
+	/*
+	 * Unbind from workqueue if @cwq is not frozen.  If frozen,
+	 * thaw_workqueues() will either restart processing on this
+	 * cpu or unbind if empty.  This keeps works queued while
+	 * frozen fully ordered and flushable.
+	 */
+	if (likely(!(gcwq->flags & GCWQ_FREEZING))) {
+		smp_wmb();	/* paired with cmpxchg() in __queue_work() */
+		wq->single_cpu = NR_CPUS;
+	}
+}
+
+static void __queue_work(unsigned int req_cpu, struct workqueue_struct *wq,
+			 struct work_struct *work)
+{
+	struct global_cwq *gcwq;
+	struct cpu_workqueue_struct *cwq;
 	struct list_head *worklist;
+	unsigned int cpu;
 	unsigned long flags;
+	bool arbitrate;
 
 	debug_work_activate(work);
 
-	spin_lock_irqsave(&gcwq->lock, flags);
+	if (!(wq->flags & WQ_SINGLE_CPU)) {
+		/* just use the requested cpu for multicpu workqueues */
+		cwq = get_cwq(req_cpu, wq);
+		gcwq = cwq->gcwq;
+		spin_lock_irqsave(&gcwq->lock, flags);
+	} else {
+		/*
+		 * It's a bit more complex for single cpu workqueues.
+		 * We first need to determine which cpu is going to be
+		 * used.  If no cpu is currently serving this
+		 * workqueue, arbitrate using atomic accesses to
+		 * wq->single_cpu; otherwise, use the current one.
+		 */
+	retry:
+		cpu = wq->single_cpu;
+		arbitrate = cpu == NR_CPUS;
+		if (arbitrate)
+			cpu = req_cpu;
+
+		cwq = get_cwq(cpu, wq);
+		gcwq = cwq->gcwq;
+		spin_lock_irqsave(&gcwq->lock, flags);
+
+		/*
+		 * The following cmpxchg() is full barrier paired with
+		 * smp_wmb() in cwq_unbind_single_cpu() and guarantees
+		 * that all changes to wq->st_* fields are visible on
+		 * the new cpu after this point.
+		 */
+		if (arbitrate)
+			cmpxchg(&wq->single_cpu, NR_CPUS, cpu);
+
+		if (unlikely(wq->single_cpu != cpu)) {
+			spin_unlock_irqrestore(&gcwq->lock, flags);
+			goto retry;
+		}
+	}
+
 	BUG_ON(!list_empty(&work->entry));
 
 	cwq->nr_in_flight[cwq->work_color]++;
@@ -523,7 +581,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 		timer_stats_timer_set_start_info(&dwork->timer);
 
 		/* This stores cwq for the moment, for the timer_fn */
-		set_wq_data(work, target_cwq(raw_smp_processor_id(), wq), 0);
+		set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
 		timer->expires = jiffies + delay;
 		timer->data = (unsigned long)dwork;
 		timer->function = delayed_work_timer_fn;
@@ -784,10 +842,14 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
 	cwq->nr_in_flight[color]--;
 	cwq->nr_active--;
 
-	/* one down, submit a delayed one */
-	if (!list_empty(&cwq->delayed_works) &&
-	    cwq->nr_active < cwq->max_active)
-		cwq_activate_first_delayed(cwq);
+	if (!list_empty(&cwq->delayed_works)) {
+		/* one down, submit a delayed one */
+		if (cwq->nr_active < cwq->max_active)
+			cwq_activate_first_delayed(cwq);
+	} else if (!cwq->nr_active && cwq->wq->flags & WQ_SINGLE_CPU) {
+		/* this was the last work, unbind from single cpu */
+		cwq_unbind_single_cpu(cwq);
+	}
 
 	/* is flush in progress and are we at the flushing tip? */
 	if (likely(cwq->flush_color != color))
@@ -1667,7 +1729,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
-	bool singlethread = flags & WQ_SINGLE_THREAD;
 	struct workqueue_struct *wq;
 	bool failed = false;
 	unsigned int cpu;
@@ -1688,6 +1749,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	atomic_set(&wq->nr_cwqs_to_flush, 0);
 	INIT_LIST_HEAD(&wq->flusher_queue);
 	INIT_LIST_HEAD(&wq->flusher_overflow);
+	wq->single_cpu = NR_CPUS;
+
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	INIT_LIST_HEAD(&wq->list);
@@ -1713,8 +1776,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 
 		if (failed)
 			continue;
-		cwq->worker = create_worker(cwq,
-					    cpu_online(cpu) && !singlethread);
+		cwq->worker = create_worker(cwq, cpu_online(cpu));
 		if (cwq->worker)
 			start_worker(cwq->worker);
 		else
@@ -1912,18 +1974,16 @@ static int __cpuinit trustee_thread(void *__gcwq)
 
 	spin_lock_irq(&gcwq->lock);
 	/*
-	 * Make all multithread workers rogue.  Trustee must be bound
-	 * to the target cpu and can't be cancelled.
+	 * Make all workers rogue.  Trustee must be bound to the
+	 * target cpu and can't be cancelled.
 	 */
 	BUG_ON(gcwq->cpu != smp_processor_id());
 
 	list_for_each_entry(worker, &gcwq->idle_list, entry)
-		if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
-			worker->flags |= WORKER_ROGUE;
+		worker->flags |= WORKER_ROGUE;
 
 	for_each_busy_worker(worker, i, pos, gcwq)
-		if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
-			worker->flags |= WORKER_ROGUE;
+		worker->flags |= WORKER_ROGUE;
 
 	/*
 	 * We're now in charge.  Notify and proceed to drain.  We need
@@ -2027,20 +2087,17 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 		}
 
 		/*
-		 * Clear ROGUE from and rebind all multithread
-		 * workers.  Unsetting ROGUE and rebinding require
-		 * dropping gcwq->lock.  Restart loop after each
-		 * successful release.
+		 * Clear ROGUE from and rebind all workers.  Unsetting
+		 * ROGUE and rebinding require dropping gcwq->lock.
+		 * Restart loop after each successful release.
 		 */
 	recheck:
 		list_for_each_entry(worker, &gcwq->idle_list, entry)
-			if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD) &&
-			    trustee_unset_rogue(worker))
+			if (trustee_unset_rogue(worker))
 				goto recheck;
 
 		for_each_busy_worker(worker, i, pos, gcwq)
-			if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD) &&
-			    trustee_unset_rogue(worker))
+			if (trustee_unset_rogue(worker))
 				goto recheck;
 		break;
 	}
@@ -2226,6 +2283,11 @@ void thaw_workqueues(void)
 			       cwq->nr_active < cwq->max_active)
 				cwq_activate_first_delayed(cwq);
 
+			/* perform delayed unbind from single cpu if empty */
+			if (wq->single_cpu == gcwq->cpu &&
+			    !cwq->nr_active && list_empty(&cwq->delayed_works))
+				cwq_unbind_single_cpu(cwq);
+
 			wake_up_process(cwq->worker->task);
 		}
 
@@ -2251,7 +2313,6 @@ void __init init_workqueues(void)
 	BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
 		     __alignof__(unsigned long long));
 
-	singlethread_cpu = cpumask_first(cpu_possible_mask);
 	hotcpu_notifier(workqueue_cpu_callback, 0);
 
 	/* initialize gcwqs */
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 26/40] workqueue: use shared worklist and pool all workers per cpu
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (24 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 25/40] workqueue: make single thread workqueue shared worker pool friendly Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 27/40] workqueue: implement concurrency managed dynamic worker pool Tejun Heo
                   ` (15 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Use gcwq->worklist instead of cwq->worklist and break the strict
association between a cwq and its worker.  All works queued on a cpu
are queued on gcwq->worklist and processed by any available worker on
the gcwq.

As there no longer is strict association between a cwq and its worker,
whether a work is executing can now only be determined using
gcwq->busy_hash[].  [__]find_worker_executing_work() are implemented
for this and used where it's necessary to find whether a work is being
executed and if so which worker is executing it.

After this change, the only association between a cwq and its worker
is that a cwq puts a worker into shared worker pool on creation and
kills it on destruction.  As all workqueues are still limited to
max_active of one, this means that there are always at least as many
workers as active works and thus there's no danger for deadlock.

The break of strong association between cwqs and workers requires
somewhat clumsy changes to current_is_keventd() and
destroy_workqueue().  Dynamic worker pool management will remove both
clumsy changes.  current_is_keventd() won't be necessary at all as the
only reason it exists is to avoid queueing a work from a work which
will be allowed just fine.  The clumsy part of destroy_workqueue() is
added because a worker can only be destroyed while idle and there's no
guarantee a worker is idle when its wq is going down.  With dynamic
pool management, workers are not associated with workqueues at all and
only idle ones will be submitted to destroy_workqueue() so the code
won't be necessary anymore.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  192 +++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 154 insertions(+), 38 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 19cfa12..f38d263 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -72,7 +72,6 @@ enum {
  */
 
 struct global_cwq;
-struct cpu_workqueue_struct;
 
 struct worker {
 	/* on idle list while idle, on busy hash table while busy */
@@ -85,7 +84,6 @@ struct worker {
 	struct list_head	scheduled;	/* L: scheduled works */
 	struct task_struct	*task;		/* I: worker task */
 	struct global_cwq	*gcwq;		/* I: the associated gcwq */
-	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
 	unsigned int		flags;		/* L: flags */
 	int			id;		/* I: worker id */
 };
@@ -95,6 +93,7 @@ struct worker {
  */
 struct global_cwq {
 	spinlock_t		lock;		/* the gcwq lock */
+	struct list_head	worklist;	/* L: list of pending works */
 	unsigned int		cpu;		/* I: the associated cpu */
 	unsigned int		flags;		/* L: GCWQ_* flags */
 
@@ -120,7 +119,6 @@ struct global_cwq {
  */
 struct cpu_workqueue_struct {
 	struct global_cwq	*gcwq;		/* I: the associated gcwq */
-	struct list_head worklist;
 	struct worker		*worker;
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	int			work_color;	/* L: current color */
@@ -338,6 +336,32 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 			WORK_STRUCT_WQ_DATA_MASK);
 }
 
+/* Return the first worker.  Safe with preemption disabled */
+static struct worker *first_worker(struct global_cwq *gcwq)
+{
+	if (unlikely(list_empty(&gcwq->idle_list)))
+		return NULL;
+
+	return list_first_entry(&gcwq->idle_list, struct worker, entry);
+}
+
+/**
+ * wake_up_worker - wake up an idle worker
+ * @gcwq: gcwq to wake worker for
+ *
+ * Wake up the first idle worker of @gcwq.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void wake_up_worker(struct global_cwq *gcwq)
+{
+	struct worker *worker = first_worker(gcwq);
+
+	if (likely(worker))
+		wake_up_process(worker->task);
+}
+
 /**
  * busy_worker_head - return the busy hash head for a work
  * @gcwq: gcwq of interest
@@ -366,13 +390,67 @@ static struct hlist_head *busy_worker_head(struct global_cwq *gcwq,
 }
 
 /**
- * insert_work - insert a work into cwq
+ * __find_worker_executing_work - find worker which is executing a work
+ * @gcwq: gcwq of interest
+ * @bwh: hash head as returned by busy_worker_head()
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @gcwq.  @bwh should be
+ * the hash head obtained by calling busy_worker_head() with the same
+ * work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to worker which is executing @work if found, NULL
+ * otherwise.
+ */
+static struct worker *__find_worker_executing_work(struct global_cwq *gcwq,
+						   struct hlist_head *bwh,
+						   struct work_struct *work)
+{
+	struct worker *worker;
+	struct hlist_node *tmp;
+
+	hlist_for_each_entry(worker, tmp, bwh, hentry)
+		if (worker->current_work == work)
+			return worker;
+	return NULL;
+}
+
+/**
+ * find_worker_executing_work - find worker which is executing a work
+ * @gcwq: gcwq of interest
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @gcwq.  This function is
+ * identical to __find_worker_executing_work() except that this
+ * function calculates @bwh itself.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to worker which is executing @work if found, NULL
+ * otherwise.
+ */
+static struct worker *find_worker_executing_work(struct global_cwq *gcwq,
+						 struct work_struct *work)
+{
+	return __find_worker_executing_work(gcwq, busy_worker_head(gcwq, work),
+					    work);
+}
+
+/**
+ * insert_work - insert a work into gcwq
  * @cwq: cwq @work belongs to
  * @work: work to insert
  * @head: insertion point
  * @extra_flags: extra WORK_STRUCT_* flags to set
  *
- * Insert @work into @cwq after @head.
+ * Insert @work which belongs to @cwq into @gcwq after @head.
+ * @extra_flags is or'd to work_struct flags.
  *
  * CONTEXT:
  * spin_lock_irq(gcwq->lock).
@@ -391,7 +469,7 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 	smp_wmb();
 
 	list_add_tail(&work->entry, head);
-	wake_up_process(cwq->worker->task);
+	wake_up_worker(cwq->gcwq);
 }
 
 /**
@@ -478,7 +556,7 @@ static void __queue_work(unsigned int req_cpu, struct workqueue_struct *wq,
 
 	if (likely(cwq->nr_active < cwq->max_active)) {
 		cwq->nr_active++;
-		worklist = &cwq->worklist;
+		worklist = &gcwq->worklist;
 	} else
 		worklist = &cwq->delayed_works;
 
@@ -657,10 +735,10 @@ static struct worker *alloc_worker(void)
 
 /**
  * create_worker - create a new workqueue worker
- * @cwq: cwq the new worker will belong to
+ * @gcwq: gcwq the new worker will belong to
  * @bind: whether to set affinity to @cpu or not
  *
- * Create a new worker which is bound to @cwq.  The returned worker
+ * Create a new worker which is bound to @gcwq.  The returned worker
  * can be started by calling start_worker() or destroyed using
  * destroy_worker().
  *
@@ -670,9 +748,8 @@ static struct worker *alloc_worker(void)
  * RETURNS:
  * Pointer to the newly created worker.
  */
-static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
+static struct worker *create_worker(struct global_cwq *gcwq, bool bind)
 {
-	struct global_cwq *gcwq = cwq->gcwq;
 	int id = -1;
 	struct worker *worker = NULL;
 
@@ -690,7 +767,6 @@ static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
 		goto fail;
 
 	worker->gcwq = gcwq;
-	worker->cwq = cwq;
 	worker->id = id;
 
 	worker->task = kthread_create(worker_thread, worker, "kworker/%u:%d",
@@ -818,7 +894,7 @@ static void cwq_activate_first_delayed(struct cpu_workqueue_struct *cwq)
 	struct work_struct *work = list_first_entry(&cwq->delayed_works,
 						    struct work_struct, entry);
 
-	move_linked_works(work, &cwq->worklist, NULL);
+	move_linked_works(work, &cwq->gcwq->worklist, NULL);
 	cwq->nr_active++;
 }
 
@@ -886,11 +962,12 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
  */
 static void process_one_work(struct worker *worker, struct work_struct *work)
 {
-	struct cpu_workqueue_struct *cwq = worker->cwq;
+	struct cpu_workqueue_struct *cwq = get_wq_data(work);
 	struct global_cwq *gcwq = cwq->gcwq;
 	struct hlist_head *bwh = busy_worker_head(gcwq, work);
 	work_func_t f = work->func;
 	int work_color;
+	struct worker *collision;
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * It is permissible to free the struct work_struct from
@@ -901,6 +978,18 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 	 */
 	struct lockdep_map lockdep_map = work->lockdep_map;
 #endif
+	/*
+	 * A single work shouldn't be executed concurrently by
+	 * multiple workers on a single cpu.  Check whether anyone is
+	 * already processing the work.  If so, defer the work to the
+	 * currently executing one.
+	 */
+	collision = __find_worker_executing_work(gcwq, bwh, work);
+	if (unlikely(collision)) {
+		move_linked_works(work, &collision->scheduled, NULL);
+		return;
+	}
+
 	/* claim and process */
 	debug_work_deactivate(work);
 	hlist_add_head(&worker->hentry, bwh);
@@ -910,7 +999,6 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 
 	spin_unlock_irq(&gcwq->lock);
 
-	BUG_ON(get_wq_data(work) != cwq);
 	work_clear_pending(work);
 	lock_map_acquire(&cwq->wq->lockdep_map);
 	lock_map_acquire(&lockdep_map);
@@ -967,7 +1055,6 @@ static int worker_thread(void *__worker)
 {
 	struct worker *worker = __worker;
 	struct global_cwq *gcwq = worker->gcwq;
-	struct cpu_workqueue_struct *cwq = worker->cwq;
 
 woke_up:
 	spin_lock_irq(&gcwq->lock);
@@ -987,9 +1074,9 @@ woke_up:
 	 */
 	BUG_ON(!list_empty(&worker->scheduled));
 
-	while (!list_empty(&cwq->worklist)) {
+	while (!list_empty(&gcwq->worklist)) {
 		struct work_struct *work =
-			list_first_entry(&cwq->worklist,
+			list_first_entry(&gcwq->worklist,
 					 struct work_struct, entry);
 
 		if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
@@ -1343,8 +1430,7 @@ int flush_work(struct work_struct *work)
 		if (unlikely(cwq != get_wq_data(work)))
 			goto already_gone;
 	} else {
-		if (cwq->worker && cwq->worker->current_work == work)
-			worker = cwq->worker;
+		worker = find_worker_executing_work(gcwq, work);
 		if (!worker)
 			goto already_gone;
 	}
@@ -1413,11 +1499,9 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 
 	spin_lock_irq(&gcwq->lock);
 
-	worker = NULL;
-	if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
-		worker = cwq->worker;
+	worker = find_worker_executing_work(gcwq, work);
+	if (unlikely(worker))
 		insert_wq_barrier(cwq, &barr, work, worker);
-	}
 
 	spin_unlock_irq(&gcwq->lock);
 
@@ -1671,18 +1755,37 @@ int keventd_up(void)
 
 int current_is_keventd(void)
 {
-	struct cpu_workqueue_struct *cwq;
-	int cpu = raw_smp_processor_id(); /* preempt-safe: keventd is per-cpu */
-	int ret = 0;
+	bool found = false;
+	unsigned int cpu;
 
-	BUG_ON(!keventd_wq);
+	/*
+	 * There no longer is one-to-one relation between worker and
+	 * work queue and a worker task might be unbound from its cpu
+	 * if the cpu was offlined.  Match all busy workers.  This
+	 * function will go away once dynamic pool is implemented.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+		struct worker *worker;
+		struct hlist_node *pos;
+		unsigned long flags;
+		int i;
 
-	cwq = get_cwq(cpu, keventd_wq);
-	if (current == cwq->worker->task)
-		ret = 1;
+		spin_lock_irqsave(&gcwq->lock, flags);
 
-	return ret;
+		for_each_busy_worker(worker, i, pos, gcwq) {
+			if (worker->task == current) {
+				found = true;
+				break;
+			}
+		}
+
+		spin_unlock_irqrestore(&gcwq->lock, flags);
+		if (found)
+			break;
+	}
 
+	return found;
 }
 
 static struct cpu_workqueue_struct *alloc_cwqs(void)
@@ -1771,12 +1874,11 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		cwq->wq = wq;
 		cwq->flush_color = -1;
 		cwq->max_active = max_active;
-		INIT_LIST_HEAD(&cwq->worklist);
 		INIT_LIST_HEAD(&cwq->delayed_works);
 
 		if (failed)
 			continue;
-		cwq->worker = create_worker(cwq, cpu_online(cpu));
+		cwq->worker = create_worker(gcwq, cpu_online(cpu));
 		if (cwq->worker)
 			start_worker(cwq->worker);
 		else
@@ -1836,13 +1938,26 @@ void destroy_workqueue(struct workqueue_struct *wq)
 
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		struct global_cwq *gcwq = cwq->gcwq;
 		int i;
 
 		if (cwq->worker) {
-			spin_lock_irq(&cwq->gcwq->lock);
+		retry:
+			spin_lock_irq(&gcwq->lock);
+			/*
+			 * Worker can only be destroyed while idle.
+			 * Wait till it becomes idle.  This is ugly
+			 * and prone to starvation.  It will go away
+			 * once dynamic worker pool is implemented.
+			 */
+			if (!(cwq->worker->flags & WORKER_IDLE)) {
+				spin_unlock_irq(&gcwq->lock);
+				msleep(100);
+				goto retry;
+			}
 			destroy_worker(cwq->worker);
 			cwq->worker = NULL;
-			spin_unlock_irq(&cwq->gcwq->lock);
+			spin_unlock_irq(&gcwq->lock);
 		}
 
 		for (i = 0; i < WORK_NR_COLORS; i++)
@@ -2161,7 +2276,7 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
  *
  * Start freezing workqueues.  After this function returns, all
  * freezeable workqueues will queue new works to their frozen_works
- * list instead of the cwq ones.
+ * list instead of gcwq->worklist.
  *
  * CONTEXT:
  * Grabs and releases workqueue_lock and gcwq->lock's.
@@ -2247,7 +2362,7 @@ out_unlock:
  * thaw_workqueues - thaw workqueues
  *
  * Thaw workqueues.  Normal queueing is restored and all collected
- * frozen works are transferred to their respective cwq worklists.
+ * frozen works are transferred to their respective gcwq worklists.
  *
  * CONTEXT:
  * Grabs and releases workqueue_lock and gcwq->lock's.
@@ -2320,6 +2435,7 @@ void __init init_workqueues(void)
 		struct global_cwq *gcwq = get_gcwq(cpu);
 
 		spin_lock_init(&gcwq->lock);
+		INIT_LIST_HEAD(&gcwq->worklist);
 		gcwq->cpu = cpu;
 
 		INIT_LIST_HEAD(&gcwq->idle_list);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 27/40] workqueue: implement concurrency managed dynamic worker pool
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (25 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 26/40] workqueue: use shared worklist and pool all workers per cpu Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 28/40] workqueue: increase max_active of keventd and kill current_is_keventd() Tejun Heo
                   ` (14 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Instead of creating a worker for each cwq and putting it into the
shared pool, manage per-cpu workers dynamically.

Works aren't supposed to be cpu cycle hogs and maintaining just enough
concurrency to prevent work processing from stalling due to lack of
processing context is optimal.  gcwq keeps the number of concurrent
active workers to minimum but no less.  As long as there's one or more
running workers on the cpu, no new worker is scheduled so that works
can be processed in batch as much as possible but when the last
running worker blocks, gcwq immediately schedules new worker so that
the cpu doesn't sit idle while there are works to be processed.

gcwq always keeps at least single idle worker around.  When a new
worker is necessary and the worker is the last idle one, the worker
assumes the role of "manager" and manages the worker pool -
ie. creates another worker.  Forward-progress is guaranteed by having
dedicated rescue workers for workqueues which may be necessary while
creating a new worker.  When the manager is having problem creating a
new worker, mayday timer activates and rescue workers are summoned to
the cpu and execute works which might be necessary to create new
workers.

Trustee is expanded to serve the role of manager while a CPU is being
taken down and stays down.  As no new works are supposed to be queued
on a dead cpu, it just needs to drain all the existing ones.  Trustee
continues to try to create new workers and summon rescuers as long as
there are pending works.  If the CPU is brought back up while the
trustee is still trying to drain the gcwq from the previous offlining,
the trustee puts all workers back to the cpu and pass control over to
gcwq which assumes the manager role as necessary.

Concurrency managed worker pool reduces the number of workers
drastically.  Only workers which are necessary to keep the processing
going are created and kept.  Also, it reduces cache footprint by
avoiding unnecessarily switching contexts between different workers.

Please note that this patch does not increase max_active of any
workqueue.  All workqueues can still only process one work per cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |    8 +-
 kernel/workqueue.c        |  858 ++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 778 insertions(+), 88 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index b012da7..adb3080 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -213,6 +213,7 @@ static inline bool work_static(struct work_struct *work) { return false; }
 enum {
 	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
 	WQ_SINGLE_CPU		= 1 << 1, /* only single cpu at a time */
+	WQ_RESCUER		= 1 << 2, /* has an rescue worker */
 };
 
 extern struct workqueue_struct *
@@ -239,11 +240,12 @@ __create_workqueue_key(const char *name, unsigned int flags, int max_active,
 #endif
 
 #define create_workqueue(name)					\
-	__create_workqueue((name), 0, 1)
+	__create_workqueue((name), WQ_RESCUER, 1)
 #define create_freezeable_workqueue(name)			\
-	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_CPU, 1)
+	__create_workqueue((name),				\
+			   WQ_FREEZEABLE | WQ_SINGLE_CPU | WQ_RESCUER, 1)
 #define create_singlethread_workqueue(name)			\
-	__create_workqueue((name), WQ_SINGLE_CPU, 1)
+	__create_workqueue((name), WQ_SINGLE_CPU | WQ_RESCUER, 1)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f38d263..9baf7a8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -43,8 +43,16 @@ enum {
 	WORKER_STARTED		= 1 << 0,	/* started */
 	WORKER_DIE		= 1 << 1,	/* die die die */
 	WORKER_IDLE		= 1 << 2,	/* is idle */
+	WORKER_PREP		= 1 << 3,	/* preparing to run works */
 	WORKER_ROGUE		= 1 << 4,	/* not bound to any cpu */
 
+	WORKER_IGN_RUNNING	= WORKER_PREP | WORKER_ROGUE,
+
+	/* global_cwq flags */
+	GCWQ_MANAGE_WORKERS	= 1 << 0,	/* need to manage workers */
+	GCWQ_MANAGING_WORKERS	= 1 << 1,	/* managing workers */
+	GCWQ_DISASSOCIATED	= 1 << 2,	/* cpu can't serve workers */
+
 	/* gcwq->trustee_state */
 	TRUSTEE_START		= 0,		/* start */
 	TRUSTEE_IN_CHARGE	= 1,		/* trustee in charge of gcwq */
@@ -56,7 +64,19 @@ enum {
 	BUSY_WORKER_HASH_SIZE	= 1 << BUSY_WORKER_HASH_ORDER,
 	BUSY_WORKER_HASH_MASK	= BUSY_WORKER_HASH_SIZE - 1,
 
+	MAX_IDLE_WORKERS_RATIO	= 4,		/* 1/4 of busy can be idle */
+	IDLE_WORKER_TIMEOUT	= 300 * HZ,	/* keep idle ones for 5 mins */
+
+	MAYDAY_INITIAL_TIMEOUT	= HZ / 100,	/* call for help after 10ms */
+	MAYDAY_INTERVAL		= HZ / 10,	/* and then every 100ms */
+	CREATE_COOLDOWN		= HZ,		/* time to breath after fail */
 	TRUSTEE_COOLDOWN	= HZ / 10,	/* for trustee draining */
+
+	/*
+	 * Rescue workers are used only on emergencies and shared by
+	 * all cpus.  Give -20.
+	 */
+	RESCUER_NICE_LEVEL	= -20,
 };
 
 /*
@@ -64,8 +84,16 @@ enum {
  *
  * I: Set during initialization and read-only afterwards.
  *
+ * P: Preemption protected.  Disabling preemption is enough and should
+ *    only be modified and accessed from the local cpu.
+ *
  * L: gcwq->lock protected.  Access with gcwq->lock held.
  *
+ * X: During normal operation, modification requires gcwq->lock and
+ *    should be done only from local cpu.  Either disabling preemption
+ *    on local cpu or grabbing gcwq->lock is enough for read access.
+ *    While trustee is in charge, it's identical to L.
+ *
  * F: wq->flush_mutex protected.
  *
  * W: workqueue_lock protected.
@@ -73,6 +101,10 @@ enum {
 
 struct global_cwq;
 
+/*
+ * The poor guys doing the actual heavy lifting.  All on-duty workers
+ * are either serving the manager role, on idle list or on busy hash.
+ */
 struct worker {
 	/* on idle list while idle, on busy hash table while busy */
 	union {
@@ -84,12 +116,17 @@ struct worker {
 	struct list_head	scheduled;	/* L: scheduled works */
 	struct task_struct	*task;		/* I: worker task */
 	struct global_cwq	*gcwq;		/* I: the associated gcwq */
-	unsigned int		flags;		/* L: flags */
+	unsigned long		last_active;	/* L: last active timestamp */
+	/* 64 bytes boundary on 64bit, 32 on 32bit */
+	struct sched_notifier	sched_notifier;	/* I: scheduler notifier */
+	unsigned int		flags;		/* ?: flags */
 	int			id;		/* I: worker id */
 };
 
 /*
- * Global per-cpu workqueue.
+ * Global per-cpu workqueue.  There's one and only one for each cpu
+ * and all works are queued and processed here regardless of their
+ * target workqueues.
  */
 struct global_cwq {
 	spinlock_t		lock;		/* the gcwq lock */
@@ -101,15 +138,19 @@ struct global_cwq {
 	int			nr_idle;	/* L: currently idle ones */
 
 	/* workers are chained either in the idle_list or busy_hash */
-	struct list_head	idle_list;	/* L: list of idle workers */
+	struct list_head	idle_list;	/* ?: list of idle workers */
 	struct hlist_head	busy_hash[BUSY_WORKER_HASH_SIZE];
 						/* L: hash of busy workers */
 
+	struct timer_list	idle_timer;	/* L: worker idle timeout */
+	struct timer_list	mayday_timer;	/* L: SOS timer for dworkers */
+
 	struct ida		worker_ida;	/* L: for worker IDs */
 
 	struct task_struct	*trustee;	/* L: for gcwq shutdown */
 	unsigned int		trustee_state;	/* L: trustee state */
 	wait_queue_head_t	trustee_wait;	/* trustee wait */
+	struct worker		*first_idle;	/* L: first idle worker */
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -119,7 +160,6 @@ struct global_cwq {
  */
 struct cpu_workqueue_struct {
 	struct global_cwq	*gcwq;		/* I: the associated gcwq */
-	struct worker		*worker;
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	int			work_color;	/* L: current color */
 	int			flush_color;	/* L: flushing color */
@@ -158,6 +198,9 @@ struct workqueue_struct {
 
 	unsigned long		single_cpu;	/* cpu for single cpu wq */
 
+	cpumask_var_t		mayday_mask;	/* cpus requesting rescue */
+	struct worker		*rescuer;	/* I: rescue worker */
+
 	int			saved_max_active; /* I: saved cwq max_active */
 	const char		*name;		/* I: workqueue name */
 #ifdef CONFIG_LOCKDEP
@@ -284,7 +327,14 @@ static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
 static bool workqueue_freezing;		/* W: have wqs started freezing? */
 
+/*
+ * The almighty global cpu workqueues.  nr_running is the only field
+ * which is expected to be used frequently by other cpus by
+ * try_to_wake_up() which ends up incrementing it.  Put it in a
+ * separate cacheline.
+ */
 static DEFINE_PER_CPU(struct global_cwq, global_cwq);
+static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, gcwq_nr_running);
 
 static int worker_thread(void *__worker);
 
@@ -293,6 +343,11 @@ static struct global_cwq *get_gcwq(unsigned int cpu)
 	return &per_cpu(global_cwq, cpu);
 }
 
+static atomic_t *get_gcwq_nr_running(unsigned int cpu)
+{
+	return &per_cpu(gcwq_nr_running, cpu);
+}
+
 static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
 					    struct workqueue_struct *wq)
 {
@@ -336,6 +391,63 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 			WORK_STRUCT_WQ_DATA_MASK);
 }
 
+/*
+ * Policy functions.  These define the policies on how the global
+ * worker pool is managed.  Unless noted otherwise, these functions
+ * assume that they're being called with gcwq->lock held.
+ */
+
+/*
+ * Need to wake up a worker?  Called from anything but currently
+ * running workers.
+ */
+static bool need_more_worker(struct global_cwq *gcwq)
+{
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	return !list_empty(&gcwq->worklist) && !atomic_read(nr_running);
+}
+
+/* Can I start working?  Called from busy but !running workers. */
+static bool may_start_working(struct global_cwq *gcwq)
+{
+	return gcwq->nr_idle;
+}
+
+/* Do I need to keep working?  Called from currently running workers. */
+static bool keep_working(struct global_cwq *gcwq)
+{
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	return !list_empty(&gcwq->worklist) && atomic_read(nr_running) <= 1;
+}
+
+/* Do we need a new worker?  Called from manager. */
+static bool need_to_create_worker(struct global_cwq *gcwq)
+{
+	return need_more_worker(gcwq) && !may_start_working(gcwq);
+}
+
+/* Do I need to be the manager? */
+static bool need_to_manage_workers(struct global_cwq *gcwq)
+{
+	return need_to_create_worker(gcwq) || gcwq->flags & GCWQ_MANAGE_WORKERS;
+}
+
+/* Do we have too many workers and should some go away? */
+static bool too_many_workers(struct global_cwq *gcwq)
+{
+	bool managing = gcwq->flags & GCWQ_MANAGING_WORKERS;
+	int nr_idle = gcwq->nr_idle + managing; /* manager is considered idle */
+	int nr_busy = gcwq->nr_workers - nr_idle;
+
+	return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
+}
+
+/*
+ * Wake up functions.
+ */
+
 /* Return the first worker.  Safe with preemption disabled */
 static struct worker *first_worker(struct global_cwq *gcwq)
 {
@@ -363,6 +475,70 @@ static void wake_up_worker(struct global_cwq *gcwq)
 }
 
 /**
+ * sched_wake_up_worker - wake up an idle worker from a scheduler notifier
+ * @gcwq: gcwq to wake worker for
+ *
+ * Wake up the first idle worker of @gcwq.
+ *
+ * CONTEXT:
+ * Scheduler callback.  DO NOT call from anywhere else.
+ */
+static void sched_wake_up_worker(struct global_cwq *gcwq)
+{
+	struct worker *worker = first_worker(gcwq);
+
+	if (likely(worker))
+		try_to_wake_up_local(worker->task, TASK_NORMAL, 0);
+}
+
+/*
+ * Scheduler notifier callbacks.  These functions are called during
+ * schedule() with rq lock held.  Don't try to acquire any lock and
+ * only access fields which are safe with preemption disabled from
+ * local cpu.
+ */
+
+/* called when a worker task wakes up from sleep */
+static void worker_sched_wakeup(struct sched_notifier *sn)
+{
+	struct worker *worker = container_of(sn, struct worker, sched_notifier);
+	struct global_cwq *gcwq = worker->gcwq;
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	if (unlikely(worker->flags & WORKER_IGN_RUNNING))
+		return;
+
+	atomic_inc(nr_running);
+}
+
+/* called when a worker task goes into sleep */
+static void worker_sched_sleep(struct sched_notifier *sn)
+{
+	struct worker *worker = container_of(sn, struct worker, sched_notifier);
+	struct global_cwq *gcwq = worker->gcwq;
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	if (unlikely(worker->flags & WORKER_IGN_RUNNING))
+		return;
+
+	/* this can only happen on the local cpu */
+	BUG_ON(gcwq->cpu != raw_smp_processor_id());
+
+	/*
+	 * The counterpart of the following dec_and_test, implied mb,
+	 * worklist not empty test sequence is in insert_work().
+	 * Please read comment there.
+	 */
+	if (atomic_dec_and_test(nr_running) && !list_empty(&gcwq->worklist))
+		sched_wake_up_worker(gcwq);
+}
+
+static struct sched_notifier_ops wq_sched_notifier_ops = {
+	.wakeup		= worker_sched_wakeup,
+	.sleep		= worker_sched_sleep,
+};
+
+/**
  * busy_worker_head - return the busy hash head for a work
  * @gcwq: gcwq of interest
  * @work: work to be hashed
@@ -459,6 +635,8 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 			struct work_struct *work, struct list_head *head,
 			unsigned int extra_flags)
 {
+	struct global_cwq *gcwq = cwq->gcwq;
+
 	/* we own @work, set data and link */
 	set_wq_data(work, cwq, extra_flags);
 
@@ -469,7 +647,16 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 	smp_wmb();
 
 	list_add_tail(&work->entry, head);
-	wake_up_worker(cwq->gcwq);
+
+	/*
+	 * Ensure either worker_sched_deactivated() sees the above
+	 * list_add_tail() or we see zero nr_running to avoid workers
+	 * lying around lazily while there are works to be processed.
+	 */
+	smp_mb();
+
+	if (!atomic_read(get_gcwq_nr_running(gcwq->cpu)))
+		wake_up_worker(gcwq);
 }
 
 /**
@@ -694,11 +881,16 @@ static void worker_enter_idle(struct worker *worker)
 
 	worker->flags |= WORKER_IDLE;
 	gcwq->nr_idle++;
+	worker->last_active = jiffies;
 
 	/* idle_list is LIFO */
 	list_add(&worker->entry, &gcwq->idle_list);
 
-	if (unlikely(worker->flags & WORKER_ROGUE))
+	if (likely(!(worker->flags & WORKER_ROGUE))) {
+		if (too_many_workers(gcwq) && !timer_pending(&gcwq->idle_timer))
+			mod_timer(&gcwq->idle_timer,
+				  jiffies + IDLE_WORKER_TIMEOUT);
+	} else
 		wake_up_all(&gcwq->trustee_wait);
 }
 
@@ -729,6 +921,9 @@ static struct worker *alloc_worker(void)
 	if (worker) {
 		INIT_LIST_HEAD(&worker->entry);
 		INIT_LIST_HEAD(&worker->scheduled);
+		sched_notifier_init(&worker->sched_notifier,
+				    &wq_sched_notifier_ops);
+		/* on creation a worker is not idle */
 	}
 	return worker;
 }
@@ -806,7 +1001,7 @@ fail:
  */
 static void start_worker(struct worker *worker)
 {
-	worker->flags |= WORKER_STARTED;
+	worker->flags |= WORKER_STARTED | WORKER_PREP;
 	worker->gcwq->nr_workers++;
 	worker_enter_idle(worker);
 	wake_up_process(worker->task);
@@ -847,6 +1042,220 @@ static void destroy_worker(struct worker *worker)
 	ida_remove(&gcwq->worker_ida, id);
 }
 
+static void idle_worker_timeout(unsigned long __gcwq)
+{
+	struct global_cwq *gcwq = (void *)__gcwq;
+
+	spin_lock_irq(&gcwq->lock);
+
+	if (too_many_workers(gcwq)) {
+		struct worker *worker;
+		unsigned long expires;
+
+		/* idle_list is kept in LIFO order, check the last one */
+		worker = list_entry(gcwq->idle_list.prev, struct worker, entry);
+		expires = worker->last_active + IDLE_WORKER_TIMEOUT;
+
+		if (time_before(jiffies, expires))
+			mod_timer(&gcwq->idle_timer, expires);
+		else {
+			/* it's been idle for too long, wake up manager */
+			gcwq->flags |= GCWQ_MANAGE_WORKERS;
+			wake_up_worker(gcwq);
+		}
+	}
+
+	spin_unlock_irq(&gcwq->lock);
+}
+
+static bool send_mayday(struct work_struct *work)
+{
+	struct cpu_workqueue_struct *cwq = get_wq_data(work);
+	struct workqueue_struct *wq = cwq->wq;
+
+	if (!(wq->flags & WQ_RESCUER))
+		return false;
+
+	/* mayday mayday mayday */
+	if (!cpumask_test_and_set_cpu(cwq->gcwq->cpu, wq->mayday_mask))
+		wake_up_process(wq->rescuer->task);
+	return true;
+}
+
+static void gcwq_mayday_timeout(unsigned long __gcwq)
+{
+	struct global_cwq *gcwq = (void *)__gcwq;
+	struct work_struct *work;
+
+	spin_lock_irq(&gcwq->lock);
+
+	if (need_to_create_worker(gcwq)) {
+		/*
+		 * We've been trying to create a new worker but
+		 * haven't been successful.  We might be hitting an
+		 * allocation deadlock.  Send distress signals to
+		 * rescuers.
+		 */
+		list_for_each_entry(work, &gcwq->worklist, entry)
+			send_mayday(work);
+	}
+
+	spin_unlock_irq(&gcwq->lock);
+
+	mod_timer(&gcwq->mayday_timer, jiffies + MAYDAY_INTERVAL);
+}
+
+/**
+ * maybe_create_worker - create a new worker if necessary
+ * @gcwq: gcwq to create a new worker for
+ *
+ * Create a new worker for @gcwq if necessary.  @gcwq is guaranteed to
+ * have at least one idle worker on return from this function.  If
+ * creating a new worker takes longer than MAYDAY_INTERVAL, mayday is
+ * sent to all rescuers with works scheduled on @gcwq to resolve
+ * possible allocation deadlock.
+ *
+ * On return, need_to_create_worker() is guaranteed to be false and
+ * may_start_working() true.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  Does GFP_KERNEL allocations.  Called only from
+ * manager.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true
+ * otherwise.
+ */
+static bool maybe_create_worker(struct global_cwq *gcwq)
+{
+	if (!need_to_create_worker(gcwq))
+		return false;
+restart:
+	/* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */
+	mod_timer(&gcwq->mayday_timer, jiffies + MAYDAY_INITIAL_TIMEOUT);
+
+	while (true) {
+		struct worker *worker;
+
+		spin_unlock_irq(&gcwq->lock);
+
+		worker = create_worker(gcwq, true);
+		if (worker) {
+			del_timer_sync(&gcwq->mayday_timer);
+			spin_lock_irq(&gcwq->lock);
+			start_worker(worker);
+			BUG_ON(need_to_create_worker(gcwq));
+			return true;
+		}
+
+		if (!need_to_create_worker(gcwq))
+			break;
+
+		spin_unlock_irq(&gcwq->lock);
+		__set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(CREATE_COOLDOWN);
+		spin_lock_irq(&gcwq->lock);
+		if (!need_to_create_worker(gcwq))
+			break;
+	}
+
+	spin_unlock_irq(&gcwq->lock);
+	del_timer_sync(&gcwq->mayday_timer);
+	spin_lock_irq(&gcwq->lock);
+	if (need_to_create_worker(gcwq))
+		goto restart;
+	return true;
+}
+
+/**
+ * maybe_destroy_worker - destroy workers which have been idle for a while
+ * @gcwq: gcwq to destroy workers for
+ *
+ * Destroy @gcwq workers which have been idle for longer than
+ * IDLE_WORKER_TIMEOUT.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  Called only from manager.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true
+ * otherwise.
+ */
+static bool maybe_destroy_workers(struct global_cwq *gcwq)
+{
+	bool ret = false;
+
+	while (too_many_workers(gcwq)) {
+		struct worker *worker;
+		unsigned long expires;
+
+		worker = list_entry(gcwq->idle_list.prev, struct worker, entry);
+		expires = worker->last_active + IDLE_WORKER_TIMEOUT;
+
+		if (time_before(jiffies, expires)) {
+			mod_timer(&gcwq->idle_timer, expires);
+			break;
+		}
+
+		destroy_worker(worker);
+		ret = true;
+	}
+
+	return ret;
+}
+
+/**
+ * manage_workers - manage worker pool
+ * @worker: self
+ *
+ * Assume the manager role and manage gcwq worker pool @worker belongs
+ * to.  At any given time, there can be only zero or one manager per
+ * gcwq.  The exclusion is handled automatically by this function.
+ *
+ * The caller can safely start processing works on false return.  On
+ * true return, it's guaranteed that need_to_create_worker() is false
+ * and may_start_working() is true.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  Does GFP_KERNEL allocations.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true if
+ * some action was taken.
+ */
+static bool manage_workers(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+	bool ret = false;
+
+	if (gcwq->flags & GCWQ_MANAGING_WORKERS)
+		return ret;
+
+	gcwq->flags &= ~GCWQ_MANAGE_WORKERS;
+	gcwq->flags |= GCWQ_MANAGING_WORKERS;
+
+	/*
+	 * Destroy and then create so that may_start_working() is true
+	 * on return.
+	 */
+	ret |= maybe_destroy_workers(gcwq);
+	ret |= maybe_create_worker(gcwq);
+
+	gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+
+	/*
+	 * The trustee might be waiting to take over the manager
+	 * position, tell it we're done.
+	 */
+	if (unlikely(gcwq->trustee))
+		wake_up_all(&gcwq->trustee_wait);
+
+	return ret;
+}
+
 /**
  * move_linked_works - move linked works to a list
  * @work: start of series of works to be scheduled
@@ -1049,23 +1458,39 @@ static void process_scheduled_works(struct worker *worker)
  * worker_thread - the worker thread function
  * @__worker: self
  *
- * The cwq worker thread function.
+ * The gcwq worker thread function.  There's a single dynamic pool of
+ * these per each cpu.  These workers process all works regardless of
+ * their specific target workqueue.  The only exception is works which
+ * belong to workqueues with a rescuer which will be explained in
+ * rescuer_thread().
  */
 static int worker_thread(void *__worker)
 {
 	struct worker *worker = __worker;
 	struct global_cwq *gcwq = worker->gcwq;
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
 
+	/* register sched_notifiers */
+	sched_notifier_register(&worker->sched_notifier);
 woke_up:
 	spin_lock_irq(&gcwq->lock);
 
 	/* DIE can be set only while we're idle, checking here is enough */
 	if (worker->flags & WORKER_DIE) {
 		spin_unlock_irq(&gcwq->lock);
+		sched_notifier_unregister(&worker->sched_notifier);
 		return 0;
 	}
 
 	worker_leave_idle(worker);
+recheck:
+	/* no more worker necessary? */
+	if (!need_more_worker(gcwq))
+		goto sleep;
+
+	/* do we need to manage? */
+	if (unlikely(!may_start_working(gcwq)) && manage_workers(worker))
+		goto recheck;
 
 	/*
 	 * ->scheduled list can only be filled while a worker is
@@ -1074,7 +1499,16 @@ woke_up:
 	 */
 	BUG_ON(!list_empty(&worker->scheduled));
 
-	while (!list_empty(&gcwq->worklist)) {
+	/*
+	 * When control reaches this point, we're guaranteed to have
+	 * at least one idle worker or that someone else has already
+	 * assumed the manager role.
+	 */
+	worker->flags &= ~WORKER_PREP;
+	if (likely(!(worker->flags & WORKER_IGN_RUNNING)))
+		atomic_inc(nr_running);
+
+	do {
 		struct work_struct *work =
 			list_first_entry(&gcwq->worklist,
 					 struct work_struct, entry);
@@ -1088,13 +1522,21 @@ woke_up:
 			move_linked_works(work, &worker->scheduled, NULL);
 			process_scheduled_works(worker);
 		}
-	}
+	} while (keep_working(gcwq));
+
+	if (likely(!(worker->flags & WORKER_IGN_RUNNING)))
+		atomic_dec(nr_running);
+	worker->flags |= WORKER_PREP;
 
+	if (unlikely(need_to_manage_workers(gcwq)) && manage_workers(worker))
+		goto recheck;
+sleep:
 	/*
-	 * gcwq->lock is held and there's no work to process, sleep.
-	 * Workers are woken up only while holding gcwq->lock, so
-	 * setting the current state before releasing gcwq->lock is
-	 * enough to prevent losing any event.
+	 * gcwq->lock is held and there's no work to process and no
+	 * need to manage, sleep.  Workers are woken up only while
+	 * holding gcwq->lock or from local cpu, so setting the
+	 * current state before releasing gcwq->lock is enough to
+	 * prevent losing any event.
 	 */
 	worker_enter_idle(worker);
 	__set_current_state(TASK_INTERRUPTIBLE);
@@ -1103,6 +1545,122 @@ woke_up:
 	goto woke_up;
 }
 
+/**
+ * worker_maybe_bind_and_lock - bind worker to its cpu if possible and lock gcwq
+ * @worker: target worker
+ *
+ * Works which are scheduled while the cpu is online must at least be
+ * scheduled to a worker which is bound to the cpu so that if they are
+ * flushed from cpu callbacks while cpu is going down, they are
+ * guaranteed to execute on the cpu.
+ *
+ * This function is to be used to bind rescuers and new rogue workers
+ * to the target cpu and may race with cpu going down or coming
+ * online.  kthread_bind() can't be used because it may put the worker
+ * to already dead cpu and __set_cpus_allowed() can't be used verbatim
+ * as it's best effort and blocking and gcwq may be [dis]associated in
+ * the meantime.
+ *
+ * This function tries __set_cpus_allowed() and locks gcwq and
+ * verifies the binding against GCWQ_DISASSOCIATED which is set during
+ * CPU_DYING and cleared during CPU_ONLINE, so if the worker enters
+ * idle state or fetches works without dropping lock, it can guarantee
+ * the scheduling requirement described in the first paragraph.
+ *
+ * CONTEXT:
+ * Might sleep.  Called without any lock but returns with gcwq->lock
+ * held.
+ */
+static void worker_maybe_bind_and_lock(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+	struct task_struct *task = worker->task;
+
+	while (true) {
+		/*
+		 * The following call may fail, succeed or succeed
+		 * without actually migrating the task to the cpu if
+		 * it races with cpu hotunplug operation.  Verify
+		 * against GCWQ_DISASSOCIATED.
+		 */
+		__set_cpus_allowed(task, get_cpu_mask(gcwq->cpu), true);
+
+		spin_lock_irq(&gcwq->lock);
+		if (gcwq->flags & GCWQ_DISASSOCIATED)
+			return;
+		if (task_cpu(task) == gcwq->cpu &&
+		    cpumask_equal(&current->cpus_allowed,
+				  get_cpu_mask(gcwq->cpu)))
+			return;
+		spin_unlock_irq(&gcwq->lock);
+
+		/* CPU has come up inbetween, retry migration */
+		cpu_relax();
+	}
+}
+
+/**
+ * rescuer_thread - the rescuer thread function
+ * @__wq: the associated workqueue
+ *
+ * Workqueue rescuer thread function.  There's one rescuer for each
+ * workqueue which has WQ_RESCUER set.
+ *
+ * Regular work processing on a gcwq may block trying to create a new
+ * worker which uses GFP_KERNEL allocation which has slight chance of
+ * developing into deadlock if some works currently on the same queue
+ * need to be processed to satisfy the GFP_KERNEL allocation.  This is
+ * the problem rescuer solves.
+ *
+ * When such condition is possible, the gcwq summons rescuers of all
+ * workqueues which have works queued on the gcwq and let them process
+ * those works so that forward progress can be guaranteed.
+ *
+ * This should happen rarely.
+ */
+static int rescuer_thread(void *__wq)
+{
+	struct workqueue_struct *wq = __wq;
+	struct worker *rescuer = wq->rescuer;
+	struct list_head *scheduled = &rescuer->scheduled;
+	unsigned int cpu;
+
+	set_user_nice(current, RESCUER_NICE_LEVEL);
+repeat:
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	if (kthread_should_stop())
+		return 0;
+
+	for_each_cpu(cpu, wq->mayday_mask) {
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		struct global_cwq *gcwq = cwq->gcwq;
+		struct work_struct *work, *n;
+
+		__set_current_state(TASK_RUNNING);
+		cpumask_clear_cpu(cpu, wq->mayday_mask);
+
+		/* migrate to the target cpu if possible */
+		rescuer->gcwq = gcwq;
+		worker_maybe_bind_and_lock(rescuer);
+
+		/*
+		 * Slurp in all works issued via this workqueue and
+		 * process'em.
+		 */
+		BUG_ON(!list_empty(&rescuer->scheduled));
+		list_for_each_entry_safe(work, n, &gcwq->worklist, entry)
+			if (get_wq_data(work) == cwq)
+				move_linked_works(work, scheduled, &n);
+
+		process_scheduled_works(rescuer);
+		spin_unlock_irq(&gcwq->lock);
+	}
+
+	schedule();
+	goto repeat;
+}
+
 struct wq_barrier {
 	struct work_struct	work;
 	struct completion	done;
@@ -1833,7 +2391,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 						const char *lock_name)
 {
 	struct workqueue_struct *wq;
-	bool failed = false;
 	unsigned int cpu;
 
 	max_active = clamp_val(max_active, 1, INT_MAX);
@@ -1858,13 +2415,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	INIT_LIST_HEAD(&wq->list);
 
-	cpu_maps_update_begin();
-	/*
-	 * We must initialize cwqs for each possible cpu even if we
-	 * are going to call destroy_workqueue() finally. Otherwise
-	 * cpu_up() can hit the uninitialized cwq once we drop the
-	 * lock.
-	 */
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 		struct global_cwq *gcwq = get_gcwq(cpu);
@@ -1875,20 +2425,25 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		cwq->flush_color = -1;
 		cwq->max_active = max_active;
 		INIT_LIST_HEAD(&cwq->delayed_works);
-
-		if (failed)
-			continue;
-		cwq->worker = create_worker(gcwq, cpu_online(cpu));
-		if (cwq->worker)
-			start_worker(cwq->worker);
-		else
-			failed = true;
 	}
-	cpu_maps_update_done();
 
-	if (failed) {
-		destroy_workqueue(wq);
-		wq = NULL;
+	if (flags & WQ_RESCUER) {
+		struct worker *rescuer;
+
+		if (!alloc_cpumask_var(&wq->mayday_mask, GFP_KERNEL))
+			goto err;
+
+		wq->rescuer = rescuer = alloc_worker();
+		if (!rescuer)
+			goto err;
+
+		rescuer->task = kthread_create(rescuer_thread, wq, "%s", name);
+		if (IS_ERR(rescuer->task))
+			goto err;
+
+		wq->rescuer = rescuer;
+		rescuer->task->flags |= PF_THREAD_BOUND;
+		wake_up_process(rescuer->task);
 	}
 
 	/*
@@ -1910,6 +2465,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 err:
 	if (wq) {
 		free_cwqs(wq->cpu_wq);
+		free_cpumask_var(wq->mayday_mask);
+		kfree(wq->rescuer);
 		kfree(wq);
 	}
 	return NULL;
@@ -1936,36 +2493,22 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	list_del(&wq->list);
 	spin_unlock(&workqueue_lock);
 
+	/* sanity check */
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
-		struct global_cwq *gcwq = cwq->gcwq;
 		int i;
 
-		if (cwq->worker) {
-		retry:
-			spin_lock_irq(&gcwq->lock);
-			/*
-			 * Worker can only be destroyed while idle.
-			 * Wait till it becomes idle.  This is ugly
-			 * and prone to starvation.  It will go away
-			 * once dynamic worker pool is implemented.
-			 */
-			if (!(cwq->worker->flags & WORKER_IDLE)) {
-				spin_unlock_irq(&gcwq->lock);
-				msleep(100);
-				goto retry;
-			}
-			destroy_worker(cwq->worker);
-			cwq->worker = NULL;
-			spin_unlock_irq(&gcwq->lock);
-		}
-
 		for (i = 0; i < WORK_NR_COLORS; i++)
 			BUG_ON(cwq->nr_in_flight[i]);
 		BUG_ON(cwq->nr_active);
 		BUG_ON(!list_empty(&cwq->delayed_works));
 	}
 
+	if (wq->flags & WQ_RESCUER) {
+		kthread_stop(wq->rescuer->task);
+		free_cpumask_var(wq->mayday_mask);
+	}
+
 	free_cwqs(wq->cpu_wq);
 	kfree(wq);
 }
@@ -1974,10 +2517,18 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
 /*
  * CPU hotplug.
  *
- * CPU hotplug is implemented by allowing cwqs to be detached from
- * CPU, running with unbound workers and allowing them to be
- * reattached later if the cpu comes back online.  A separate thread
- * is created to govern cwqs in such state and is called the trustee.
+ * There are two challenges in supporting CPU hotplug.  Firstly, there
+ * are a lot of assumptions on strong associations among work, cwq and
+ * gcwq which make migrating pending and scheduled works very
+ * difficult to implement without impacting hot paths.  Secondly,
+ * gcwqs serve mix of short, long and very long running works making
+ * blocked draining impractical.
+ *
+ * This is solved by allowing a gcwq to be detached from CPU, running
+ * it with unbound (rogue) workers and allowing it to be reattached
+ * later if the cpu comes back online.  A separate thread is created
+ * to govern a gcwq in such state and is called the trustee of the
+ * gcwq.
  *
  * Trustee states and their descriptions.
  *
@@ -1985,11 +2536,12 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
  *		new trustee is started with this state.
  *
  * IN_CHARGE	Once started, trustee will enter this state after
- *		making all existing workers rogue.  DOWN_PREPARE waits
- *		for trustee to enter this state.  After reaching
- *		IN_CHARGE, trustee tries to execute the pending
- *		worklist until it's empty and the state is set to
- *		BUTCHER, or the state is set to RELEASE.
+ *		assuming the manager role and making all existing
+ *		workers rogue.  DOWN_PREPARE waits for trustee to
+ *		enter this state.  After reaching IN_CHARGE, trustee
+ *		tries to execute the pending worklist until it's empty
+ *		and the state is set to BUTCHER, or the state is set
+ *		to RELEASE.
  *
  * BUTCHER	Command state which is set by the cpu callback after
  *		the cpu has went down.  Once this state is set trustee
@@ -2000,7 +2552,9 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
  * RELEASE	Command state which is set by the cpu callback if the
  *		cpu down has been canceled or it has come online
  *		again.  After recognizing this state, trustee stops
- *		trying to drain or butcher and transits to DONE.
+ *		trying to drain or butcher and clears ROGUE, rebinds
+ *		all remaining workers back to the cpu and releases
+ *		manager role.
  *
  * DONE		Trustee will enter this state after BUTCHER or RELEASE
  *		is complete.
@@ -2081,18 +2635,26 @@ static bool __cpuinit trustee_unset_rogue(struct worker *worker)
 static int __cpuinit trustee_thread(void *__gcwq)
 {
 	struct global_cwq *gcwq = __gcwq;
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
 	struct worker *worker;
+	struct work_struct *work;
 	struct hlist_node *pos;
+	long rc;
 	int i;
 
 	BUG_ON(gcwq->cpu != smp_processor_id());
 
 	spin_lock_irq(&gcwq->lock);
 	/*
-	 * Make all workers rogue.  Trustee must be bound to the
-	 * target cpu and can't be cancelled.
+	 * Claim the manager position and make all workers rogue.
+	 * Trustee must be bound to the target cpu and can't be
+	 * cancelled.
 	 */
 	BUG_ON(gcwq->cpu != smp_processor_id());
+	rc = trustee_wait_event(!(gcwq->flags & GCWQ_MANAGING_WORKERS));
+	BUG_ON(rc < 0);
+
+	gcwq->flags |= GCWQ_MANAGING_WORKERS;
 
 	list_for_each_entry(worker, &gcwq->idle_list, entry)
 		worker->flags |= WORKER_ROGUE;
@@ -2101,6 +2663,28 @@ static int __cpuinit trustee_thread(void *__gcwq)
 		worker->flags |= WORKER_ROGUE;
 
 	/*
+	 * Call schedule() so that we cross rq->lock and thus can
+	 * guarantee sched callbacks see the rogue flag.  This is
+	 * necessary as scheduler callbacks may be invoked from other
+	 * cpus.
+	 */
+	spin_unlock_irq(&gcwq->lock);
+	schedule();
+	spin_lock_irq(&gcwq->lock);
+
+	/*
+	 * Sched callbacks are disabled now.  Zap nr_running.  After
+	 * this, gcwq->nr_running stays zero and need_more_worker()
+	 * and keep_working() are always true as long as the worklist
+	 * is not empty.
+	 */
+	atomic_set(nr_running, 0);
+
+	spin_unlock_irq(&gcwq->lock);
+	del_timer_sync(&gcwq->idle_timer);
+	spin_lock_irq(&gcwq->lock);
+
+	/*
 	 * We're now in charge.  Notify and proceed to drain.  We need
 	 * to keep the gcwq running during the whole CPU down
 	 * procedure as other cpu hotunplug callbacks may need to
@@ -2112,18 +2696,80 @@ static int __cpuinit trustee_thread(void *__gcwq)
 	/*
 	 * The original cpu is in the process of dying and may go away
 	 * anytime now.  When that happens, we and all workers would
-	 * be migrated to other cpus.  Try draining any left work.
-	 * Note that if the gcwq is frozen, there may be frozen works
-	 * in freezeable cwqs.  Don't declare completion while frozen.
+	 * be migrated to other cpus.  Try draining any left work.  We
+	 * want to get it over with ASAP - spam rescuers, wake up as
+	 * many idlers as necessary and create new ones till the
+	 * worklist is empty.  Note that if the gcwq is frozen, there
+	 * may be frozen works in freezeable cwqs.  Don't declare
+	 * completion while frozen.
 	 */
 	while (gcwq->nr_workers != gcwq->nr_idle ||
 	       gcwq->flags & GCWQ_FREEZING ||
 	       gcwq->trustee_state == TRUSTEE_IN_CHARGE) {
+		int nr_works = 0;
+
+		list_for_each_entry(work, &gcwq->worklist, entry) {
+			send_mayday(work);
+			nr_works++;
+		}
+
+		list_for_each_entry(worker, &gcwq->idle_list, entry) {
+			if (!nr_works--)
+				break;
+			wake_up_process(worker->task);
+		}
+
+		if (need_to_create_worker(gcwq)) {
+			spin_unlock_irq(&gcwq->lock);
+			worker = create_worker(gcwq, false);
+			if (worker) {
+				worker_maybe_bind_and_lock(worker);
+				worker->flags |= WORKER_ROGUE;
+				start_worker(worker);
+			} else
+				spin_lock_irq(&gcwq->lock);
+		}
+
 		/* give a breather */
 		if (trustee_wait_event_timeout(false, TRUSTEE_COOLDOWN) < 0)
 			break;
 	}
 
+	/*
+	 * Either all works have been scheduled and cpu is down, or
+	 * cpu down has already been canceled.  Wait for and butcher
+	 * all workers till we're canceled.
+	 */
+	while (gcwq->nr_workers) {
+		if (trustee_wait_event(!list_empty(&gcwq->idle_list)) < 0)
+			break;
+
+		while (!list_empty(&gcwq->idle_list)) {
+			worker = list_first_entry(&gcwq->idle_list,
+						  struct worker, entry);
+			destroy_worker(worker);
+		}
+	}
+
+	/*
+	 * At this point, either draining has completed and no worker
+	 * is left, or cpu down has been canceled or the cpu is being
+	 * brought back up.  Clear ROGUE from and rebind all left
+	 * workers.  Unsetting ROGUE and rebinding require dropping
+	 * gcwq->lock.  Restart loop after each successful release.
+	 */
+recheck:
+	list_for_each_entry(worker, &gcwq->idle_list, entry)
+		if (trustee_unset_rogue(worker))
+			goto recheck;
+
+	for_each_busy_worker(worker, i, pos, gcwq)
+		if (trustee_unset_rogue(worker))
+			goto recheck;
+
+	/* relinquish manager role */
+	gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+
 	/* notify completion */
 	gcwq->trustee = NULL;
 	gcwq->trustee_state = TRUSTEE_DONE;
@@ -2162,9 +2808,7 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 	unsigned int cpu = (unsigned long)hcpu;
 	struct global_cwq *gcwq = get_gcwq(cpu);
 	struct task_struct *new_trustee = NULL;
-	struct worker *worker;
-	struct hlist_node *pos;
-	int i;
+	struct worker *uninitialized_var(new_worker);
 
 	action &= ~CPU_TASKS_FROZEN;
 
@@ -2175,6 +2819,15 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 		if (IS_ERR(new_trustee))
 			return NOTIFY_BAD;
 		kthread_bind(new_trustee, cpu);
+		/* fall through */
+	case CPU_UP_PREPARE:
+		BUG_ON(gcwq->first_idle);
+		new_worker = create_worker(gcwq, false);
+		if (!new_worker) {
+			if (new_trustee)
+				kthread_stop(new_trustee);
+			return NOTIFY_BAD;
+		}
 	}
 
 	spin_lock_irq(&gcwq->lock);
@@ -2187,14 +2840,32 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 		gcwq->trustee_state = TRUSTEE_START;
 		wake_up_process(gcwq->trustee);
 		wait_trustee_state(gcwq, TRUSTEE_IN_CHARGE);
+		/* fall through */
+	case CPU_UP_PREPARE:
+		BUG_ON(gcwq->first_idle);
+		gcwq->first_idle = new_worker;
+		break;
+
+	case CPU_DYING:
+		/*
+		 * Before this, the trustee and all workers must have
+		 * stayed on the cpu.  After this, they'll all be
+		 * diasporas.
+		 */
+		gcwq->flags |= GCWQ_DISASSOCIATED;
 		break;
 
 	case CPU_POST_DEAD:
 		gcwq->trustee_state = TRUSTEE_BUTCHER;
+		/* fall through */
+	case CPU_UP_CANCELED:
+		destroy_worker(gcwq->first_idle);
+		gcwq->first_idle = NULL;
 		break;
 
 	case CPU_DOWN_FAILED:
 	case CPU_ONLINE:
+		gcwq->flags &= ~GCWQ_DISASSOCIATED;
 		if (gcwq->trustee_state != TRUSTEE_DONE) {
 			gcwq->trustee_state = TRUSTEE_RELEASE;
 			wake_up_process(gcwq->trustee);
@@ -2202,18 +2873,16 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 		}
 
 		/*
-		 * Clear ROGUE from and rebind all workers.  Unsetting
-		 * ROGUE and rebinding require dropping gcwq->lock.
-		 * Restart loop after each successful release.
+		 * Trustee is done and there might be no worker left.
+		 * Put the first_idle in and request a real manager to
+		 * take a look.
 		 */
-	recheck:
-		list_for_each_entry(worker, &gcwq->idle_list, entry)
-			if (trustee_unset_rogue(worker))
-				goto recheck;
-
-		for_each_busy_worker(worker, i, pos, gcwq)
-			if (trustee_unset_rogue(worker))
-				goto recheck;
+		spin_unlock_irq(&gcwq->lock);
+		kthread_bind(gcwq->first_idle->task, cpu);
+		spin_lock_irq(&gcwq->lock);
+		gcwq->flags |= GCWQ_MANAGE_WORKERS;
+		start_worker(gcwq->first_idle);
+		gcwq->first_idle = NULL;
 		break;
 	}
 
@@ -2402,10 +3071,10 @@ void thaw_workqueues(void)
 			if (wq->single_cpu == gcwq->cpu &&
 			    !cwq->nr_active && list_empty(&cwq->delayed_works))
 				cwq_unbind_single_cpu(cwq);
-
-			wake_up_process(cwq->worker->task);
 		}
 
+		wake_up_worker(gcwq);
+
 		spin_unlock_irq(&gcwq->lock);
 	}
 
@@ -2442,12 +3111,31 @@ void __init init_workqueues(void)
 		for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)
 			INIT_HLIST_HEAD(&gcwq->busy_hash[i]);
 
+		init_timer_deferrable(&gcwq->idle_timer);
+		gcwq->idle_timer.function = idle_worker_timeout;
+		gcwq->idle_timer.data = (unsigned long)gcwq;
+
+		setup_timer(&gcwq->mayday_timer, gcwq_mayday_timeout,
+			    (unsigned long)gcwq);
+
 		ida_init(&gcwq->worker_ida);
 
 		gcwq->trustee_state = TRUSTEE_DONE;
 		init_waitqueue_head(&gcwq->trustee_wait);
 	}
 
+	/* create the initial worker */
+	for_each_online_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+		struct worker *worker;
+
+		worker = create_worker(gcwq, true);
+		BUG_ON(!worker);
+		spin_lock_irq(&gcwq->lock);
+		start_worker(worker);
+		spin_unlock_irq(&gcwq->lock);
+	}
+
 	keventd_wq = create_workqueue("events");
 	BUG_ON(!keventd_wq);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 28/40] workqueue: increase max_active of keventd and kill current_is_keventd()
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (26 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 27/40] workqueue: implement concurrency managed dynamic worker pool Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 29/40] workqueue: add system_wq and system_single_wq Tejun Heo
                   ` (13 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Thomas Gleixner, Tony Luck, Andi Kleen, Oleg Nesterov

Define WQ_MAX_ACTIVE and create keventd with max_active set to half of
it which means that keventd now can process upto WQ_MAX_ACTIVE / 2 - 1
works concurrently.  Unless some combination can result in dependency
loop longer than max_active, deadlock won't happen and thus it's
unnecessary to check whether current_is_keventd() before trying to
schedule a work.  Kill current_is_keventd().

(Lockdep annotations are broken.  We need lock_map_acquire_read_norecurse())

NOT_SIGNED_OFF_YET: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 arch/ia64/kernel/smpboot.c |    2 +-
 arch/x86/kernel/smpboot.c  |    2 +-
 include/linux/workqueue.h  |    3 +-
 kernel/workqueue.c         |   54 +++----------------------------------------
 4 files changed, 8 insertions(+), 53 deletions(-)

diff --git a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
index de100aa..3a46feb 100644
--- a/arch/ia64/kernel/smpboot.c
+++ b/arch/ia64/kernel/smpboot.c
@@ -516,7 +516,7 @@ do_boot_cpu (int sapicid, int cpu)
 	/*
 	 * We can't use kernel_thread since we must avoid to reschedule the child.
 	 */
-	if (!keventd_up() || current_is_keventd())
+	if (!keventd_up())
 		c_idle.work.func(&c_idle.work);
 	else {
 		schedule_work(&c_idle.work);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 678d0b8..93175af 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -724,7 +724,7 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu)
 		goto do_rest;
 	}
 
-	if (!keventd_up() || current_is_keventd())
+	if (!keventd_up())
 		c_idle.work.func(&c_idle.work);
 	else {
 		schedule_work(&c_idle.work);
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index adb3080..f43a260 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -214,6 +214,8 @@ enum {
 	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
 	WQ_SINGLE_CPU		= 1 << 1, /* only single cpu at a time */
 	WQ_RESCUER		= 1 << 2, /* has an rescue worker */
+
+	WQ_MAX_ACTIVE		= 256,    /* I like 256, better ideas? */
 };
 
 extern struct workqueue_struct *
@@ -267,7 +269,6 @@ extern int schedule_delayed_work(struct delayed_work *work, unsigned long delay)
 extern int schedule_delayed_work_on(int cpu, struct delayed_work *work,
 					unsigned long delay);
 extern int schedule_on_each_cpu(work_func_t func);
-extern int current_is_keventd(void);
 extern int keventd_up(void);
 
 extern void init_workqueues(void);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9baf7a8..9833774 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2239,7 +2239,6 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
 int schedule_on_each_cpu(work_func_t func)
 {
 	int cpu;
-	int orig = -1;
 	struct work_struct *works;
 
 	works = alloc_percpu(struct work_struct);
@@ -2248,23 +2247,12 @@ int schedule_on_each_cpu(work_func_t func)
 
 	get_online_cpus();
 
-	/*
-	 * When running in keventd don't schedule a work item on
-	 * itself.  Can just call directly because the work queue is
-	 * already bound.  This also is faster.
-	 */
-	if (current_is_keventd())
-		orig = raw_smp_processor_id();
-
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = per_cpu_ptr(works, cpu);
 
 		INIT_WORK(work, func);
-		if (cpu != orig)
-			schedule_work_on(cpu, work);
+		schedule_work_on(cpu, work);
 	}
-	if (orig >= 0)
-		func(per_cpu_ptr(works, orig));
 
 	for_each_online_cpu(cpu)
 		flush_work(per_cpu_ptr(works, cpu));
@@ -2311,41 +2299,6 @@ int keventd_up(void)
 	return keventd_wq != NULL;
 }
 
-int current_is_keventd(void)
-{
-	bool found = false;
-	unsigned int cpu;
-
-	/*
-	 * There no longer is one-to-one relation between worker and
-	 * work queue and a worker task might be unbound from its cpu
-	 * if the cpu was offlined.  Match all busy workers.  This
-	 * function will go away once dynamic pool is implemented.
-	 */
-	for_each_possible_cpu(cpu) {
-		struct global_cwq *gcwq = get_gcwq(cpu);
-		struct worker *worker;
-		struct hlist_node *pos;
-		unsigned long flags;
-		int i;
-
-		spin_lock_irqsave(&gcwq->lock, flags);
-
-		for_each_busy_worker(worker, i, pos, gcwq) {
-			if (worker->task == current) {
-				found = true;
-				break;
-			}
-		}
-
-		spin_unlock_irqrestore(&gcwq->lock, flags);
-		if (found)
-			break;
-	}
-
-	return found;
-}
-
 static struct cpu_workqueue_struct *alloc_cwqs(void)
 {
 	const size_t size = sizeof(struct cpu_workqueue_struct);
@@ -2393,7 +2346,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	struct workqueue_struct *wq;
 	unsigned int cpu;
 
-	max_active = clamp_val(max_active, 1, INT_MAX);
+	WARN_ON(max_active < 1 || max_active > WQ_MAX_ACTIVE);
+	max_active = clamp_val(max_active, 1, WQ_MAX_ACTIVE);
 
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
@@ -3136,6 +3090,6 @@ void __init init_workqueues(void)
 		spin_unlock_irq(&gcwq->lock);
 	}
 
-	keventd_wq = create_workqueue("events");
+	keventd_wq = __create_workqueue("events", 0, WQ_MAX_ACTIVE);
 	BUG_ON(!keventd_wq);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 29/40] workqueue: add system_wq and system_single_wq
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (27 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 28/40] workqueue: increase max_active of keventd and kill current_is_keventd() Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 30/40] workqueue: implement work_busy() Tejun Heo
                   ` (12 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Rename keventd_wq to system_wq and export it.  Also add
system_single_wq which guarantees that a work isn't executed on
multiple cpus concurrently.  These workqueues will be used by future
patches to update workqueue users.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |   19 +++++++++++++++++++
 kernel/workqueue.c        |   30 +++++++++++++++++++-----------
 2 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index f43a260..265207d 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -218,6 +218,25 @@ enum {
 	WQ_MAX_ACTIVE		= 256,    /* I like 256, better ideas? */
 };
 
+/*
+ * System-wide workqueues which are always present.
+ *
+ * system_wq is the one used by schedule[_delayed]_work[_on]().
+ * Multi-CPU multi-threaded.  There are users which expect relatively
+ * short queue flush time.  Don't queue works which can run for too
+ * long.
+ *
+ * system_long_wq is similar to system_wq but may host long running
+ * works.  Queue flushing might take relatively long.
+ *
+ * system_single_wq is single-CPU multi-threaded and guarantees that a
+ * work item is not executed in parallel by multiple CPUs.  Queue
+ * flushing might take relatively long.
+ */
+extern struct workqueue_struct *system_wq;
+extern struct workqueue_struct *system_long_wq;
+extern struct workqueue_struct *system_single_wq;
+
 extern struct workqueue_struct *
 __create_workqueue_key(const char *name, unsigned int flags, int max_active,
 		       struct lock_class_key *key, const char *lock_name);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9833774..233278c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -208,6 +208,13 @@ struct workqueue_struct {
 #endif
 };
 
+struct workqueue_struct *system_wq __read_mostly;
+struct workqueue_struct *system_long_wq __read_mostly;
+struct workqueue_struct *system_single_wq __read_mostly;
+EXPORT_SYMBOL_GPL(system_wq);
+EXPORT_SYMBOL_GPL(system_long_wq);
+EXPORT_SYMBOL_GPL(system_single_wq);
+
 #define for_each_busy_worker(worker, i, pos, gcwq)			\
 	for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)			\
 		hlist_for_each_entry(worker, pos, &gcwq->busy_hash[i], hentry)
@@ -2148,8 +2155,6 @@ int cancel_delayed_work_sync(struct delayed_work *dwork)
 }
 EXPORT_SYMBOL(cancel_delayed_work_sync);
 
-static struct workqueue_struct *keventd_wq __read_mostly;
-
 /**
  * schedule_work - put work task in global workqueue
  * @work: job to be done
@@ -2163,7 +2168,7 @@ static struct workqueue_struct *keventd_wq __read_mostly;
  */
 int schedule_work(struct work_struct *work)
 {
-	return queue_work(keventd_wq, work);
+	return queue_work(system_wq, work);
 }
 EXPORT_SYMBOL(schedule_work);
 
@@ -2176,7 +2181,7 @@ EXPORT_SYMBOL(schedule_work);
  */
 int schedule_work_on(int cpu, struct work_struct *work)
 {
-	return queue_work_on(cpu, keventd_wq, work);
+	return queue_work_on(cpu, system_wq, work);
 }
 EXPORT_SYMBOL(schedule_work_on);
 
@@ -2191,7 +2196,7 @@ EXPORT_SYMBOL(schedule_work_on);
 int schedule_delayed_work(struct delayed_work *dwork,
 					unsigned long delay)
 {
-	return queue_delayed_work(keventd_wq, dwork, delay);
+	return queue_delayed_work(system_wq, dwork, delay);
 }
 EXPORT_SYMBOL(schedule_delayed_work);
 
@@ -2204,7 +2209,7 @@ EXPORT_SYMBOL(schedule_delayed_work);
 void flush_delayed_work(struct delayed_work *dwork)
 {
 	if (del_timer_sync(&dwork->timer)) {
-		__queue_work(get_cpu(), keventd_wq, &dwork->work);
+		__queue_work(get_cpu(), system_wq, &dwork->work);
 		put_cpu();
 	}
 	flush_work(&dwork->work);
@@ -2223,7 +2228,7 @@ EXPORT_SYMBOL(flush_delayed_work);
 int schedule_delayed_work_on(int cpu,
 			struct delayed_work *dwork, unsigned long delay)
 {
-	return queue_delayed_work_on(cpu, keventd_wq, dwork, delay);
+	return queue_delayed_work_on(cpu, system_wq, dwork, delay);
 }
 EXPORT_SYMBOL(schedule_delayed_work_on);
 
@@ -2264,7 +2269,7 @@ int schedule_on_each_cpu(work_func_t func)
 
 void flush_scheduled_work(void)
 {
-	flush_workqueue(keventd_wq);
+	flush_workqueue(system_wq);
 }
 EXPORT_SYMBOL(flush_scheduled_work);
 
@@ -2296,7 +2301,7 @@ EXPORT_SYMBOL_GPL(execute_in_process_context);
 
 int keventd_up(void)
 {
-	return keventd_wq != NULL;
+	return system_wq != NULL;
 }
 
 static struct cpu_workqueue_struct *alloc_cwqs(void)
@@ -3090,6 +3095,9 @@ void __init init_workqueues(void)
 		spin_unlock_irq(&gcwq->lock);
 	}
 
-	keventd_wq = __create_workqueue("events", 0, WQ_MAX_ACTIVE);
-	BUG_ON(!keventd_wq);
+	system_wq = __create_workqueue("events", 0, WQ_MAX_ACTIVE);
+	system_long_wq = __create_workqueue("events_long", 0, WQ_MAX_ACTIVE);
+	system_single_wq = __create_workqueue("events_single", WQ_SINGLE_CPU,
+					      WQ_MAX_ACTIVE);
+	BUG_ON(!system_wq || !system_long_wq || !system_single_wq);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 30/40] workqueue: implement work_busy()
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (28 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 29/40] workqueue: add system_wq and system_single_wq Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  2:52   ` Andy Walls
  2010-01-18  0:57 ` [PATCH 31/40] libata: take advantage of cmwq and remove concurrency limitations Tejun Heo
                   ` (11 subsequent siblings)
  41 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Implement work_busy() which tests whether a work is either pending or
running.  The test isn't synchronized against workqueue operation and
the result is only good as advisory hints or for debugging.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |    2 +-
 kernel/workqueue.c        |   31 +++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+), 1 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 265207d..4f8705f 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -293,8 +293,8 @@ extern int keventd_up(void);
 extern void init_workqueues(void);
 int execute_in_process_context(work_func_t fn, struct execute_work *);
 
+extern bool work_busy(struct work_struct *work);
 extern int flush_work(struct work_struct *work);
-
 extern int cancel_work_sync(struct work_struct *work);
 
 /*
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 233278c..882d4d8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1960,6 +1960,37 @@ out_unlock:
 EXPORT_SYMBOL_GPL(flush_workqueue);
 
 /**
+ * work_busy - test whether a work is currently pending or running
+ * @work: the work to be tested
+ *
+ * Test whether @work is currently pending or running.  There is no
+ * synchronization around this function and the test result is
+ * unreliable and only useful as advisory hints or for debugging.  The
+ * caller is responsible for ensuring the workqueue @work was last
+ * queued on stays valid until this function returns.
+ *
+ * RETURNS:
+ * %true if @work is currently running, %false otherwise.
+ */
+bool work_busy(struct work_struct *work)
+{
+	struct cpu_workqueue_struct *cwq = get_wq_data(work);
+	struct global_cwq *gcwq;
+	unsigned long flags;
+	bool ret;
+
+	if (!cwq)
+		return false;
+	gcwq = cwq->gcwq;
+
+	spin_lock_irqsave(&gcwq->lock, flags);
+	ret = work_pending(work) || find_worker_executing_work(gcwq, work);
+	spin_unlock_irqrestore(&gcwq->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(work_busy);
+
+/**
  * flush_work - block until a work_struct's callback has terminated
  * @work: the work which is to be flushed
  *
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH 30/40] workqueue: implement work_busy()
  2010-01-18  0:57 ` [PATCH 30/40] workqueue: implement work_busy() Tejun Heo
@ 2010-01-18  2:52   ` Andy Walls
  2010-01-18  5:41     ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Andy Walls @ 2010-01-18  2:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

On Mon, 2010-01-18 at 09:57 +0900, Tejun Heo wrote:
> Implement work_busy() which tests whether a work is either pending or
> running.  The test isn't synchronized against workqueue operation and
> the result is only good as advisory hints or for debugging.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Hi Tejun,

>From a driver writer's perspective, this function not useful since it is
unreliable (false positives only?) and I have no way of

"ensuring the workqueue @work was last queued on stays valid until this
function returns."

I don't quite know how to check and enfore a workqueue's continuing
validity across the function call.  (Maybe you could clarify?)



As a driver writer, I'd do one of two things to reliably know when a
work is "not busy":

1. mark work objects which I submitted with an atomic_t or bit flags:

	struct foo_work {
		struct work_struct	work;
		atomic_t		dispatched;
		struct foo_instance	*foo;
	};

	struct foo_instance {
		...
		struct foo_work		work_object[5];
		...
	};

The irq_handler finds a work_object[] that is not dispatched, marks it
dispatched, and schedules the work.  The work handler will clear the
dispatched field when it is done with the work object.  A busy work
object will have dispatched set, a non-busy work will not, and
dispatched can be checked atomically.

This still can suffer from false positives for "busy", but it functions
as long as the work_object[] exists vs. the workqueue validity criterion
(which isn't clear to me).  The driver has direct control of when the
work_object[] array will be valid.

Or
2. Just schedule the work object and check the return value to see if
the submission suceeded.  If it did, the work was "not pending".  This
method can't check for "running" of course.



Is there some specific use case where this function is very useful
despite being unreliable?  I just think it's asking for abuse by someone
who would think "mostly reliable" is good enough, when it actually may
not be.

Regards,
Andy

> ---
>  include/linux/workqueue.h |    2 +-
>  kernel/workqueue.c        |   31 +++++++++++++++++++++++++++++++
>  2 files changed, 32 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 265207d..4f8705f 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -293,8 +293,8 @@ extern int keventd_up(void);
>  extern void init_workqueues(void);
>  int execute_in_process_context(work_func_t fn, struct execute_work *);
>  
> +extern bool work_busy(struct work_struct *work);
>  extern int flush_work(struct work_struct *work);
> -
>  extern int cancel_work_sync(struct work_struct *work);
>  
>  /*
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 233278c..882d4d8 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1960,6 +1960,37 @@ out_unlock:
>  EXPORT_SYMBOL_GPL(flush_workqueue);
>  
>  /**
> + * work_busy - test whether a work is currently pending or running
> + * @work: the work to be tested
> + *
> + * Test whether @work is currently pending or running.  There is no
> + * synchronization around this function and the test result is
> + * unreliable and only useful as advisory hints or for debugging.  The
> + * caller is responsible for ensuring the workqueue @work was last
> + * queued on stays valid until this function returns.
> + *
> + * RETURNS:
> + * %true if @work is currently running, %false otherwise.
> + */
> +bool work_busy(struct work_struct *work)
> +{
> +	struct cpu_workqueue_struct *cwq = get_wq_data(work);
> +	struct global_cwq *gcwq;
> +	unsigned long flags;
> +	bool ret;
> +
> +	if (!cwq)
> +		return false;
> +	gcwq = cwq->gcwq;
> +
> +	spin_lock_irqsave(&gcwq->lock, flags);
> +	ret = work_pending(work) || find_worker_executing_work(gcwq, work);
> +	spin_unlock_irqrestore(&gcwq->lock, flags);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(work_busy);
> +
> +/**
>   * flush_work - block until a work_struct's callback has terminated
>   * @work: the work which is to be flushed
>   *


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 30/40] workqueue: implement work_busy()
  2010-01-18  2:52   ` Andy Walls
@ 2010-01-18  5:41     ` Tejun Heo
  0 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  5:41 UTC (permalink / raw)
  To: Andy Walls
  Cc: torvalds, mingo, peterz, linux-kernel, jeff, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

Hello, Andy.

On 01/18/2010 11:52 AM, Andy Walls wrote:
>>From a driver writer's perspective, this function not useful since it is
> unreliable (false positives only?) and I have no way of
> 
> "ensuring the workqueue @work was last queued on stays valid until this
> function returns."
> 
> I don't quite know how to check and enfore a workqueue's continuing
> validity across the function call.  (Maybe you could clarify?)

I don't really think that would be possible without tinkering with
workqueue internal locking.

> 2. Just schedule the work object and check the return value to see if
> the submission suceeded.  If it did, the work was "not pending".  This
> method can't check for "running" of course.

For workqueue, the above combined with proper subsystem locking would
be the best way to do it, I think.

> Is there some specific use case where this function is very useful
> despite being unreliable?  I just think it's asking for abuse by someone
> who would think "mostly reliable" is good enough, when it actually may
> not be.

I mostly just wanted to keep the fscache debug printout which
indicates whether a fscache object has work pending or running.  It's
a debug printout so it doesn't need to be reliable.  If the debug
printout can be removed, this patch can go too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 31/40] libata: take advantage of cmwq and remove concurrency limitations
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (29 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 30/40] workqueue: implement work_busy() Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18 15:48   ` Stefan Richter
  2010-01-18  0:57 ` [PATCH 32/40] async: introduce workqueue based alternative implementation Tejun Heo
                   ` (10 subsequent siblings)
  41 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Jeff Garzik

libata has two concurrency related limitations.

a. ata_wq which is used for polling PIO has single thread per CPU.  If
   there are multiple devices doing polling PIO on the same CPU, they
   can't be executed simultaneously.

b. ata_aux_wq which is used for SCSI probing has single thread.  In
   cases where SCSI probing is stalled for extended period of time
   which is possible for ATAPI devices, this will stall all probing.

#a is solved by increasing maximum concurrency of ata_wq.  Please note
that polling PIO might be used under allocation path and thus needs to
be served by a separate wq with a rescuer.

#b is solved by using the default wq instead and achieving exclusion
via per-port mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jeff Garzik <jgarzik@pobox.com>
---
 drivers/ata/libata-core.c |   19 ++-----------------
 drivers/ata/libata-eh.c   |    4 ++--
 drivers/ata/libata-scsi.c |   11 ++++++-----
 drivers/ata/libata.h      |    1 -
 include/linux/libata.h    |    2 ++
 5 files changed, 12 insertions(+), 25 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 22ff51b..de91814 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -97,8 +97,6 @@ static unsigned long ata_dev_blacklisted(const struct ata_device *dev);
 unsigned int ata_print_id = 1;
 static struct workqueue_struct *ata_wq;
 
-struct workqueue_struct *ata_aux_wq;
-
 struct ata_force_param {
 	const char	*name;
 	unsigned int	cbl;
@@ -5681,6 +5679,7 @@ struct ata_port *ata_port_alloc(struct ata_host *host)
 #else
 	INIT_DELAYED_WORK(&ap->port_task, NULL);
 #endif
+	mutex_init(&ap->scsi_scan_mutex);
 	INIT_DELAYED_WORK(&ap->hotplug_task, ata_scsi_hotplug);
 	INIT_WORK(&ap->scsi_rescan_task, ata_scsi_dev_rescan);
 	INIT_LIST_HEAD(&ap->eh_done_q);
@@ -6616,26 +6615,13 @@ static int __init ata_init(void)
 {
 	ata_parse_force_param();
 
-	/*
-	 * FIXME: In UP case, there is only one workqueue thread and if you
-	 * have more than one PIO device, latency is bloody awful, with
-	 * occasional multi-second "hiccups" as one PIO device waits for
-	 * another.  It's an ugly wart that users DO occasionally complain
-	 * about; luckily most users have at most one PIO polled device.
-	 */
-	ata_wq = create_workqueue("ata");
+	ata_wq = __create_workqueue("ata", WQ_RESCUER, WQ_MAX_ACTIVE);
 	if (!ata_wq)
 		goto free_force_tbl;
 
-	ata_aux_wq = create_singlethread_workqueue("ata_aux");
-	if (!ata_aux_wq)
-		goto free_wq;
-
 	printk(KERN_DEBUG "libata version " DRV_VERSION " loaded.\n");
 	return 0;
 
-free_wq:
-	destroy_workqueue(ata_wq);
 free_force_tbl:
 	kfree(ata_force_tbl);
 	return -ENOMEM;
@@ -6645,7 +6631,6 @@ static void __exit ata_exit(void)
 {
 	kfree(ata_force_tbl);
 	destroy_workqueue(ata_wq);
-	destroy_workqueue(ata_aux_wq);
 }
 
 subsys_initcall(ata_init);
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 0ea97c9..73735ab 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -727,7 +727,7 @@ void ata_scsi_error(struct Scsi_Host *host)
 	if (ap->pflags & ATA_PFLAG_LOADING)
 		ap->pflags &= ~ATA_PFLAG_LOADING;
 	else if (ap->pflags & ATA_PFLAG_SCSI_HOTPLUG)
-		queue_delayed_work(ata_aux_wq, &ap->hotplug_task, 0);
+		schedule_delayed_work(&ap->hotplug_task, 0);
 
 	if (ap->pflags & ATA_PFLAG_RECOVERED)
 		ata_port_printk(ap, KERN_INFO, "EH complete\n");
@@ -2938,7 +2938,7 @@ static int ata_eh_revalidate_and_attach(struct ata_link *link,
 			ehc->i.flags |= ATA_EHI_SETMODE;
 
 			/* schedule the scsi_rescan_device() here */
-			queue_work(ata_aux_wq, &(ap->scsi_rescan_task));
+			schedule_work(&(ap->scsi_rescan_task));
 		} else if (dev->class == ATA_DEV_UNKNOWN &&
 			   ehc->tries[dev->devno] &&
 			   ata_class_enabled(ehc->classes[dev->devno])) {
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index f4ea5a8..08e2ef8 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3408,8 +3408,7 @@ void ata_scsi_scan_host(struct ata_port *ap, int sync)
 				"                  switching to async\n");
 	}
 
-	queue_delayed_work(ata_aux_wq, &ap->hotplug_task,
-			   round_jiffies_relative(HZ));
+	schedule_delayed_work(&ap->hotplug_task, round_jiffies_relative(HZ));
 }
 
 /**
@@ -3555,6 +3554,7 @@ void ata_scsi_hotplug(struct work_struct *work)
 	}
 
 	DPRINTK("ENTER\n");
+	mutex_lock(&ap->scsi_scan_mutex);
 
 	/* Unplug detached devices.  We cannot use link iterator here
 	 * because PMP links have to be scanned even if PMP is
@@ -3568,6 +3568,7 @@ void ata_scsi_hotplug(struct work_struct *work)
 	/* scan for new ones */
 	ata_scsi_scan_host(ap, 0);
 
+	mutex_unlock(&ap->scsi_scan_mutex);
 	DPRINTK("EXIT\n");
 }
 
@@ -3646,9 +3647,7 @@ static int ata_scsi_user_scan(struct Scsi_Host *shost, unsigned int channel,
  *	@work: Pointer to ATA port to perform scsi_rescan_device()
  *
  *	After ATA pass thru (SAT) commands are executed successfully,
- *	libata need to propagate the changes to SCSI layer.  This
- *	function must be executed from ata_aux_wq such that sdev
- *	attach/detach don't race with rescan.
+ *	libata need to propagate the changes to SCSI layer.
  *
  *	LOCKING:
  *	Kernel thread context (may sleep).
@@ -3661,6 +3660,7 @@ void ata_scsi_dev_rescan(struct work_struct *work)
 	struct ata_device *dev;
 	unsigned long flags;
 
+	mutex_lock(&ap->scsi_scan_mutex);
 	spin_lock_irqsave(ap->lock, flags);
 
 	ata_for_each_link(link, ap, EDGE) {
@@ -3680,6 +3680,7 @@ void ata_scsi_dev_rescan(struct work_struct *work)
 	}
 
 	spin_unlock_irqrestore(ap->lock, flags);
+	mutex_unlock(&ap->scsi_scan_mutex);
 }
 
 /**
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index 823e630..4da2105 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -65,7 +65,6 @@ enum {
 };
 
 extern unsigned int ata_print_id;
-extern struct workqueue_struct *ata_aux_wq;
 extern int atapi_passthru16;
 extern int libata_fua;
 extern int libata_noacpi;
diff --git a/include/linux/libata.h b/include/linux/libata.h
index 6a9c4dd..c4fd18e 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -749,6 +749,8 @@ struct ata_port {
 
 	void			*port_task_data;
 	struct delayed_work	port_task;
+
+	struct mutex		scsi_scan_mutex;
 	struct delayed_work	hotplug_task;
 	struct work_struct	scsi_rescan_task;
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH 31/40] libata: take advantage of cmwq and remove concurrency limitations
  2010-01-18  0:57 ` [PATCH 31/40] libata: take advantage of cmwq and remove concurrency limitations Tejun Heo
@ 2010-01-18 15:48   ` Stefan Richter
  2010-01-19  0:49     ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Stefan Richter @ 2010-01-18 15:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi,
	Jeff Garzik

Tejun Heo wrote:
[...]
> b. ata_aux_wq which is used for SCSI probing has single thread.  In
>    cases where SCSI probing is stalled for extended period of time
>    which is possible for ATAPI devices, this will stall all probing.

Are things like INQUIRY and possibly motor spin-up performed there?
If yes, ...

[...]
> #b is solved by using the default wq instead and achieving exclusion
> via per-port mutex.
[...]
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -727,7 +727,7 @@ void ata_scsi_error(struct Scsi_Host *host)
>  	if (ap->pflags & ATA_PFLAG_LOADING)
>  		ap->pflags &= ~ATA_PFLAG_LOADING;
>  	else if (ap->pflags & ATA_PFLAG_SCSI_HOTPLUG)
> -		queue_delayed_work(ata_aux_wq, &ap->hotplug_task, 0);
> +		schedule_delayed_work(&ap->hotplug_task, 0);
>  
>  	if (ap->pflags & ATA_PFLAG_RECOVERED)
>  		ata_port_printk(ap, KERN_INFO, "EH complete\n");
> @@ -2938,7 +2938,7 @@ static int ata_eh_revalidate_and_attach(struct ata_link *link,
>  			ehc->i.flags |= ATA_EHI_SETMODE;
>  
>  			/* schedule the scsi_rescan_device() here */
> -			queue_work(ata_aux_wq, &(ap->scsi_rescan_task));
> +			schedule_work(&(ap->scsi_rescan_task));
>  		} else if (dev->class == ATA_DEV_UNKNOWN &&
>  			   ehc->tries[dev->devno] &&
>  			   ata_class_enabled(ehc->classes[dev->devno])) {
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index f4ea5a8..08e2ef8 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -3408,8 +3408,7 @@ void ata_scsi_scan_host(struct ata_port *ap, int sync)
>  				"                  switching to async\n");
>  	}
>  
> -	queue_delayed_work(ata_aux_wq, &ap->hotplug_task,
> -			   round_jiffies_relative(HZ));
> +	schedule_delayed_work(&ap->hotplug_task, round_jiffies_relative(HZ));
>  }
>  
>  /**
[...]

... wouldn't queue_delayed_work(system_long_wq, &ap->hotplug_task, ...);
be more appropriate then?
-- 
Stefan Richter
-=====-==-=- ---= =--=-
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 31/40] libata: take advantage of cmwq and remove concurrency limitations
  2010-01-18 15:48   ` Stefan Richter
@ 2010-01-19  0:49     ` Tejun Heo
  0 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-19  0:49 UTC (permalink / raw)
  To: Stefan Richter
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi,
	Jeff Garzik

Hello,

On 01/19/2010 12:48 AM, Stefan Richter wrote:
>> b. ata_aux_wq which is used for SCSI probing has single thread.  In
>>    cases where SCSI probing is stalled for extended period of time
>>    which is possible for ATAPI devices, this will stall all probing.
> 
> Are things like INQUIRY and possibly motor spin-up performed there?
> If yes, ...

For disks, they are all done by EH.  Responses to INQUIRYs are
constructed by libata from the data it has acquired during parallel
ATA probing, so there's nothing much going on there.  For ATAPI
devices, it's a different story tho.

>> @@ -3408,8 +3408,7 @@ void ata_scsi_scan_host(struct ata_port *ap, int sync)
>>  				"                  switching to async\n");
>>  	}
>>  
>> -	queue_delayed_work(ata_aux_wq, &ap->hotplug_task,
>> -			   round_jiffies_relative(HZ));
>> +	schedule_delayed_work(&ap->hotplug_task, round_jiffies_relative(HZ));
>>  }
>>  
>>  /**
> [...]
> 
> ... wouldn't queue_delayed_work(system_long_wq, &ap->hotplug_task, ...);
> be more appropriate then?

Hmmm... yeah, right, I'll update it.  I'm thinking about adding a
debug option to trigger a warning if a work queued on the default
workqueue takes longer than, say, 30secs to finish.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (30 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 31/40] libata: take advantage of cmwq and remove concurrency limitations Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  6:01   ` Arjan van de Ven
  2010-01-18  0:57 ` [PATCH 33/40] async: convert async users to use the new implementation Tejun Heo
                   ` (9 subsequent siblings)
  41 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Arjan van de Ven

Now that cmwq can handle high concurrency, there's no reason to
implement separate thread pool for async.  Introduce alternative
implementation based on workqueue.

The new implementation uses two workqueues - async_wq and
async_ordered_wq.  The former multithreaded and the latter
singlethreaded.  async_call() schedules unordered asynchronous
excution on async_wq.  async_call_ordered() schedules ordered excution
on async_ordered_wq.  Functions scheduled using the ordered variant
are guaranteed to be excecuted only after all async excutions
scheduled previously have finished.

This patch doesn't convert any existing user.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@infradead.org>
---
 drivers/base/core.c    |    1 +
 drivers/base/dd.c      |    1 +
 include/linux/async.h  |    6 ++
 init/do_mounts.c       |    1 +
 init/main.c            |    1 +
 kernel/async.c         |  147 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/irq/autoprobe.c |    1 +
 kernel/module.c        |    2 +
 8 files changed, 160 insertions(+), 0 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 2820257..14774c9 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1744,4 +1744,5 @@ void device_shutdown(void)
 		}
 	}
 	async_synchronize_full();
+	async_barrier();
 }
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index ee95c76..5c9c923 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -179,6 +179,7 @@ void wait_for_device_probe(void)
 	/* wait for the known devices to complete their probing */
 	wait_event(probe_waitqueue, atomic_read(&probe_count) == 0);
 	async_synchronize_full();
+	async_barrier();
 }
 EXPORT_SYMBOL_GPL(wait_for_device_probe);
 
diff --git a/include/linux/async.h b/include/linux/async.h
index 68a9530..49658dc 100644
--- a/include/linux/async.h
+++ b/include/linux/async.h
@@ -12,6 +12,7 @@
 
 #include <linux/types.h>
 #include <linux/list.h>
+#include <linux/workqueue.h>
 
 typedef u64 async_cookie_t;
 typedef void (async_func_ptr) (void *data, async_cookie_t cookie);
@@ -25,3 +26,8 @@ extern void async_synchronize_cookie(async_cookie_t cookie);
 extern void async_synchronize_cookie_domain(async_cookie_t cookie,
 					    struct list_head *list);
 
+typedef void (*async_func_t)(void *data);
+
+extern bool async_call(async_func_t func, void *data);
+extern bool async_call_ordered(async_func_t func, void *data);
+extern void async_barrier(void);
diff --git a/init/do_mounts.c b/init/do_mounts.c
index bb008d0..608ac17 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -406,6 +406,7 @@ void __init prepare_namespace(void)
 			(ROOT_DEV = name_to_dev_t(saved_root_name)) == 0)
 			msleep(100);
 		async_synchronize_full();
+		async_barrier();
 	}
 
 	is_floppy = MAJOR(ROOT_DEV) == FLOPPY_MAJOR;
diff --git a/init/main.c b/init/main.c
index adb09f8..e35dfdd 100644
--- a/init/main.c
+++ b/init/main.c
@@ -802,6 +802,7 @@ static noinline int init_post(void)
 {
 	/* need to finish all async __init code before freeing the memory */
 	async_synchronize_full();
+	async_barrier();
 	free_initmem();
 	unlock_kernel();
 	mark_rodata_ro();
diff --git a/kernel/async.c b/kernel/async.c
index 27235f5..4cd52bc 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -395,3 +395,150 @@ static int __init async_init(void)
 }
 
 core_initcall(async_init);
+
+struct async_ent {
+	struct work_struct	work;
+	async_func_t		func;
+	void			*data;
+	bool			ordered;
+};
+
+static struct workqueue_struct *async_wq;
+static struct workqueue_struct *async_ordered_wq;
+
+static void async_work_func(struct work_struct *work)
+{
+	struct async_ent *ent = container_of(work, struct async_ent, work);
+	ktime_t calltime, delta, rettime;
+
+	if (initcall_debug && system_state == SYSTEM_BOOTING) {
+		printk("calling  %pF @ %i\n",
+		       ent->func, task_pid_nr(current));
+		calltime = ktime_get();
+	}
+
+	if (ent->ordered)
+		flush_workqueue(async_wq);
+
+	ent->func(ent->data);
+
+	if (initcall_debug && system_state == SYSTEM_BOOTING) {
+		rettime = ktime_get();
+		delta = ktime_sub(rettime, calltime);
+		printk("initcall %pF returned 0 after %lld usecs\n",
+		       ent->func, (long long)ktime_to_ns(delta) >> 10);
+	}
+}
+
+static bool __async_call(async_func_t func, void *data, bool ordered)
+{
+	struct async_ent *ent;
+
+	ent = kzalloc(sizeof(struct async_ent), GFP_ATOMIC);
+	if (!ent) {
+		kfree(ent);
+		if (ordered) {
+			flush_workqueue(async_wq);
+			flush_workqueue(async_ordered_wq);
+		}
+		func(data);
+		return false;
+	}
+
+	ent->func = func;
+	ent->data = data;
+	ent->ordered = ordered;
+	/*
+	 * Use separate INIT_WORK for sync and async so that they end
+	 * up with different lockdep keys.
+	 */
+	if (ordered) {
+		INIT_WORK(&ent->work, async_work_func);
+		queue_work(async_ordered_wq, &ent->work);
+	} else {
+		INIT_WORK(&ent->work, async_work_func);
+		queue_work(async_wq, &ent->work);
+	}
+	return true;
+}
+
+/**
+ * async_call - schedule a function for asynchronous execution
+ * @func: function to execute asynchronously
+ * @data: data pointer to pass to the function
+ *
+ * Schedule @func(@data) for asynchronous execution.  The function
+ * might be called directly if memory allocation fails.
+ *
+ * CONTEXT:
+ * Don't care but keep in mind that @func may be executed directly.
+ *
+ * RETURNS:
+ * %true if async execution is scheduled, %false if executed locally.
+ */
+bool async_call(async_func_t func, void *data)
+{
+	return __async_call(func, data, false);
+}
+EXPORT_SYMBOL_GPL(async_call);
+
+/**
+ * async_call_ordered - schedule ordered asynchronous execution
+ * @func: function to execute asynchronously
+ * @data: data pointer to pass to the function
+ *
+ * Schedule @func(data) for ordered asynchronous excution.  It will be
+ * executed only after all async functions scheduled upto this point
+ * have finished.
+ *
+ * CONTEXT:
+ * Might sleep.
+ *
+ * RETURNS:
+ * %true if async execution is scheduled, %false if executed locally.
+ */
+bool async_call_ordered(async_func_t func, void *data)
+{
+	might_sleep();
+	return __async_call(func, data, true);
+}
+EXPORT_SYMBOL_GPL(async_call_ordered);
+
+/**
+ * async_barrier - asynchronous execution barrier
+ *
+ * Wait till all currently scheduled async executions are finished.
+ *
+ * CONTEXT:
+ * Might sleep.
+ */
+void async_barrier(void)
+{
+	ktime_t starttime, delta, endtime;
+
+	if (initcall_debug && system_state == SYSTEM_BOOTING) {
+		printk("async_waiting @ %i\n", task_pid_nr(current));
+		starttime = ktime_get();
+	}
+
+	flush_workqueue(async_wq);
+	flush_workqueue(async_ordered_wq);
+
+	if (initcall_debug && system_state == SYSTEM_BOOTING) {
+		endtime = ktime_get();
+		delta = ktime_sub(endtime, starttime);
+		printk("async_continuing @ %i after %lli usec\n",
+		       task_pid_nr(current),
+		       (long long)ktime_to_ns(delta) >> 10);
+	}
+}
+EXPORT_SYMBOL_GPL(async_barrier);
+
+static int __init init_async(void)
+{
+	async_wq = __create_workqueue("async", 0, WQ_MAX_ACTIVE);
+	async_ordered_wq = create_singlethread_workqueue("async_ordered");
+	BUG_ON(!async_wq || !async_ordered_wq);
+	return 0;
+}
+core_initcall(init_async);
diff --git a/kernel/irq/autoprobe.c b/kernel/irq/autoprobe.c
index 2295a31..39188cd 100644
--- a/kernel/irq/autoprobe.c
+++ b/kernel/irq/autoprobe.c
@@ -39,6 +39,7 @@ unsigned long probe_irq_on(void)
 	 * quiesce the kernel, or at least the asynchronous portion
 	 */
 	async_synchronize_full();
+	async_barrier();
 	mutex_lock(&probing_active);
 	/*
 	 * something may have generated an irq long ago and we want to
diff --git a/kernel/module.c b/kernel/module.c
index f82386b..623a9b6 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -717,6 +717,7 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
 	blocking_notifier_call_chain(&module_notify_list,
 				     MODULE_STATE_GOING, mod);
 	async_synchronize_full();
+	async_barrier();
 	mutex_lock(&module_mutex);
 	/* Store the name of the last unloaded module for diagnostic purposes */
 	strlcpy(last_unloaded_module, mod->name, sizeof(last_unloaded_module));
@@ -2494,6 +2495,7 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
 
 	/* We need to finish all async code before the module init sequence is done */
 	async_synchronize_full();
+	async_barrier();
 
 	mutex_lock(&module_mutex);
 	/* Drop initial reference. */
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-18  0:57 ` [PATCH 32/40] async: introduce workqueue based alternative implementation Tejun Heo
@ 2010-01-18  6:01   ` Arjan van de Ven
  2010-01-18  8:49     ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Arjan van de Ven @ 2010-01-18  6:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi, Tejun Heo,
	Arjan van de Ven

On Mon, 18 Jan 2010 09:57:44 +0900
Tejun Heo <tj@kernel.org> wrote:

> Now that cmwq can handle high concurrency, there's no reason to
> implement separate thread pool for async.  Introduce alternative
> implementation based on workqueue.
> 

I'm sorry but I'm really not happy with this conversion;
it looses the very nice property of being able to execute and
synchronize between places at the end just before device registration.

I don't mind the implementation sharing thread pool with your stuff,
but I really really want to keep the cookie and synchronization
mechanism. There's a bunch of users of that pending and doing things
sequential entirely just is not going to cut it.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-18  6:01   ` Arjan van de Ven
@ 2010-01-18  8:49     ` Tejun Heo
  2010-01-18 15:25       ` Arjan van de Ven
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  8:49 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi,
	Arjan van de Ven

On 01/18/2010 03:01 PM, Arjan van de Ven wrote:
> I'm sorry but I'm really not happy with this conversion;
> it looses the very nice property of being able to execute and
> synchronize between places at the end just before device registration.

Hmm... can you elaborate a bit?

> I don't mind the implementation sharing thread pool with your stuff,
> but I really really want to keep the cookie and synchronization
> mechanism. There's a bunch of users of that pending and doing things
> sequential entirely just is not going to cut it.

For what async is currently used for, I don't think there will be any
noticeable difference.  If the proposed implementation is lacking
somewhere, we can definitely improve it although I'm not sure whether
it will end up with the cookie thing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-18  8:49     ` Tejun Heo
@ 2010-01-18 15:25       ` Arjan van de Ven
  2010-01-19  0:57         ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Arjan van de Ven @ 2010-01-18 15:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi,
	Arjan van de Ven

On Mon, 18 Jan 2010 17:49:39 +0900
Tejun Heo <tj@kernel.org> wrote:

> For what async is currently used for, I don't think there will be any
> noticeable difference.  If the proposed implementation is lacking
> somewhere, we can definitely improve it although I'm not sure whether
> it will end up with the cookie thing.

it is not that I do not like your backend implementation. I do not like
the programming model change you're introducing.
The cookie API allows for what is sort of the equivalent of out-of-order
execution that the cpu does. In a very very simple way, you can start
things in an order, then they execute in variable times and in parallel,
and then when the side effects need to become visible (device
registration, whatever), you go back to an in-order model. 
I have patches to do this for KMS and we're working on getting
something working for ACPI as well.

Your API/model change gets rid of this conceptually simple programming
model, which makes using it on other places more complex and messy. I
do not see what enormous benefit your patches would have that would
justify complicating the programming model. (And "sharing the thread
pool" is not that; I'm sure it's possible to share the thread pool
without changing the programming model... and it's not that the async
thread pools are that big or complex anyway)

So consider the current patches

NAK-ed-by: Arjan van de Ven <arjan@linux.intel.com>

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-18 15:25       ` Arjan van de Ven
@ 2010-01-19  0:57         ` Tejun Heo
  2010-01-19  0:57           ` Arjan van de Ven
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-19  0:57 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi,
	Arjan van de Ven

Hello, Arjan.

On 01/19/2010 12:25 AM, Arjan van de Ven wrote:
> Your API/model change gets rid of this conceptually simple programming
> model, which makes using it on other places more complex and messy. I
> do not see what enormous benefit your patches would have that would
> justify complicating the programming model. (And "sharing the thread
> pool" is not that; I'm sure it's possible to share the thread pool
> without changing the programming model... and it's not that the async
> thread pools are that big or complex anyway)

Oh yeah, if you want the cookies, it can be implemented on top of
workqueue.  I'm just not sure whether that has justifiable benefit.
About the same things can be achived using flushes and if some
subsystem want low level control, it can simply use the workqueue
directly implementing whatever level of concurrency as it sees fit and
flushing as necessary.

As cmwq can provide async contexts flexibly, the benefit of async is
simplification of simple cases where the caller doesn't have to worry
about allocating works or whatever.  I don't really see much point in
introducing a different set of sync rules based on cookies when the
same thing can be achieved using wq interfaces directly and I think
that having two different models for async synchronization might hurt
more than help.  What type of ordering rules are you currently working
on using cookies?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-19  0:57         ` Tejun Heo
@ 2010-01-19  0:57           ` Arjan van de Ven
  2010-01-19  7:56             ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Arjan van de Ven @ 2010-01-19  0:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi,
	Arjan van de Ven

On 1/18/2010 16:57, Tejun Heo wrote:
> Hello, Arjan.
>
> On 01/19/2010 12:25 AM, Arjan van de Ven wrote:
>> Your API/model change gets rid of this conceptually simple programming
>> model, which makes using it on other places more complex and messy. I
>> do not see what enormous benefit your patches would have that would
>> justify complicating the programming model. (And "sharing the thread
>> pool" is not that; I'm sure it's possible to share the thread pool
>> without changing the programming model... and it's not that the async
>> thread pools are that big or complex anyway)
>
> Oh yeah, if you want the cookies, it can be implemented on top of
> workqueue.  I'm just not sure whether that has justifiable benefit.
> About the same things can be achived using flushes and if some
> subsystem want low level control, it can simply use the workqueue
> directly implementing whatever level of concurrency as it sees fit and
> flushing as necessary.
>
> As cmwq can provide async contexts flexibly, the benefit of async is
> simplification of simple cases where the caller doesn't have to worry
> about allocating works or whatever.  I don't really see much point in
> introducing a different set of sync rules based on cookies when the
> same thing can be achieved using wq interfaces directly and I think
> that having two different models for async synchronization might hurt
> more than help.  What type of ordering rules are you currently working
> on using cookies?

there are two types:
there's the domains, where you synchronize only within a domain,
and then there's the "async string", think ACPI.
the ACPI init is a whole series of sort of dependent steps, where you synchronize
about halfway, but the whole set runs async to all other inits in the system, and only
near the very end when a full synchronization is done do you wait.
basically what you get (sorry, lame ascii graph)

***************************************  (main init flow)
    ***				driver 1
    *  **			driver 2
     *   **			driver 3
     *     **			driver 4
     *       **                  driver 5

where you get maximum concurrency during the pre-synchronization part,
and a "chain" of synchronized execution *as part of the same function flow*,
but possibly independent of other synchronization flows.


the async infrastructure as you say took away the hassle of allocating, and more
importantly, caring for the lifetime of the metadata object. But it also introduced
a sychronization mechanism that is natural and simple for driver init and some other flows.



>
> Thanks.
>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-19  0:57           ` Arjan van de Ven
@ 2010-01-19  7:56             ` Tejun Heo
  2010-01-19 14:37               ` Arjan van de Ven
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-19  7:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi,
	Arjan van de Ven

Hello, Arjan.

On 01/19/2010 09:57 AM, Arjan van de Ven wrote:
> there are two types:
> there's the domains, where you synchronize only within a domain,

Maps pretty nicely to a wq which is a queueing and flushing domain
after all.

> and then there's the "async string", think ACPI.  the ACPI init is a
> whole series of sort of dependent steps, where you synchronize about
> halfway, but the whole set runs async to all other inits in the
> system, and only near the very end when a full synchronization is
> done do you wait.  basically what you get (sorry, lame ascii graph)
> 
> ***************************************  (main init flow)
>    ***                driver 1
>    *  **            driver 2
>     *   **            driver 3
>     *     **            driver 4
>     *       **                  driver 5
> 
> where you get maximum concurrency during the pre-synchronization
> part, and a "chain" of synchronized execution *as part of the same
> function flow*, but possibly independent of other synchronization
> flows.

This too can be implemented using wq directly.  More below.

> the async infrastructure as you say took away the hassle of
> allocating, and more importantly, caring for the lifetime of the
> metadata object. But it also introduced a sychronization mechanism
> that is natural and simple for driver init and some other flows.

The tradeoff changes with the worker pool implementation can be shared
with workqueue which provides its own ways to control concurrency and
synchronize.  Before, the cookie based synchronization is something
inherent to the async mechanism.  The async worker pool was needed and
the synchronization mechanism came integrated with it.  Now that the
backend can be replaced with workqueue which supplies its own ways of
synchronization, the cookie based synchronization model needs stronger
justification as it no longer comes as a integral part of something
bigger which is needed anyway.

I'm sure the cookie based synchronization has its benefits but is the
benefit big enough, or is using workqueue synchronization constructs
difficult enough to justify a completely separate synchronization
model?

If so, we can leave the list based cookie synchronization alone and
simply use wq's to provide concurrency only without using its
synchronization mechanisms (flushes).

As for the current in-kernel users, the simplistic implementation
seems enough to me.  Do you think the stuff which is currently being
worked on would benefit a lot from cookie based synchronization
compared to using works and flushes directly?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-19  7:56             ` Tejun Heo
@ 2010-01-19 14:37               ` Arjan van de Ven
  2010-01-20  0:19                 ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Arjan van de Ven @ 2010-01-19 14:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi,
	Arjan van de Ven

On Tue, 19 Jan 2010 16:56:46 +0900
Tejun Heo <tj@kernel.org> wrote:

> > where you get maximum concurrency during the pre-synchronization
> > part, and a "chain" of synchronized execution *as part of the same
> > function flow*, but possibly independent of other synchronization
> > flows.
> 
> This too can be implemented using wq directly.  More below.

however you are forcing the function to be split in pieces,
which makes for a more complex programming model.
For example, I have trouble proving to myself that your ata conversion
is acutally correct.


> 
> The tradeoff changes with the worker pool implementation can be shared
> with workqueue which provides its own ways to control concurrency and
> synchronize. 

while I don't mind sharing the pool implementation (all 20 lines of
it ;-), I don't think the objective of sharing some implementation
detail is worth complicating the programming model.


> Before, the cookie based synchronization is something
> inherent to the async mechanism.  The async worker pool was needed and
> the synchronization mechanism came integrated with it.  Now that the
> backend can be replaced with workqueue which supplies its own ways of
> synchronization, the cookie based synchronization model needs stronger
> justification as it no longer comes as a integral part of something
> bigger which is needed anyway.

the wq model is either "full async" or "fully ordered".
the cookie mechanism allows for "run async for the expensive bit, and
then INSIDE THE SAME FUNCTION, synchronize, and then run some more".

> If so, we can leave the list based cookie synchronization alone and
> simply use wq's to provide concurrency only without using its
> synchronization mechanisms (flushes).

can you flush from inside a wq element?
That's the critical part that makes the cookie based system easy to
program.



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-19 14:37               ` Arjan van de Ven
@ 2010-01-20  0:19                 ` Tejun Heo
  2010-01-20  0:31                   ` Arjan van de Ven
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-20  0:19 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi,
	Arjan van de Ven

Hello, Arjan.

On 01/19/2010 11:37 PM, Arjan van de Ven wrote:
>> This too can be implemented using wq directly.  More below.
> 
> however you are forcing the function to be split in pieces,
> which makes for a more complex programming model.
> For example, I have trouble proving to myself that your ata conversion
> is acutally correct.

I think it is.  :-)

>> The tradeoff changes with the worker pool implementation can be shared
>> with workqueue which provides its own ways to control concurrency and
>> synchronize. 
> 
> while I don't mind sharing the pool implementation (all 20 lines of
> it ;-), I don't think the objective of sharing some implementation
> detail is worth complicating the programming model.

Oh yeah, we can definitely pay some lines of code for a separate
synchronization model if it makes driver writers' lives easier.  I'm
just wondering whether the benefit is enough to justify a separate
sync model.

>> Before, the cookie based synchronization is something
>> inherent to the async mechanism.  The async worker pool was needed and
>> the synchronization mechanism came integrated with it.  Now that the
>> backend can be replaced with workqueue which supplies its own ways of
>> synchronization, the cookie based synchronization model needs stronger
>> justification as it no longer comes as a integral part of something
>> bigger which is needed anyway.
> 
> the wq model is either "full async" or "fully ordered".
> the cookie mechanism allows for "run async for the expensive bit, and
> then INSIDE THE SAME FUNCTION, synchronize, and then run some more".

Hmmm...

>> If so, we can leave the list based cookie synchronization alone and
>> simply use wq's to provide concurrency only without using its
>> synchronization mechanisms (flushes).
> 
> can you flush from inside a wq element?  That's the critical part
> that makes the cookie based system easy to program.

Yeah, you can flush individual works from other works and wqs from
works running from different wqs.  What's not allowed is flushing the
wq a work is running on from the work.  Let's say if the flush code
can be modified to do so, would that change your opinion?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-20  0:19                 ` Tejun Heo
@ 2010-01-20  0:31                   ` Arjan van de Ven
  2010-01-20  2:08                     ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Arjan van de Ven @ 2010-01-20  0:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi,
	Arjan van de Ven

On 1/19/2010 16:19, Tejun Heo wrote:

> Yeah, you can flush individual works from other works and wqs from
> works running from different wqs.  What's not allowed is flushing the
> wq a work is running on from the work.  Let's say if the flush code
> can be modified to do so, would that change your opinion?

once you get "run in parallel, but have an API to wait on everyone who was scheduled before me"...
... that'd be fine ;)

but then you pretty much HAVE the cookie API, even if you don't have an actual cookie.
(just the cookie was an easy way to determine the "before me")

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-20  0:31                   ` Arjan van de Ven
@ 2010-01-20  2:08                     ` Tejun Heo
  2010-01-20  6:03                       ` Arjan van de Ven
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-20  2:08 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi,
	Arjan van de Ven

Hello,

On 01/20/2010 09:31 AM, Arjan van de Ven wrote:
> On 1/19/2010 16:19, Tejun Heo wrote:
> 
>> Yeah, you can flush individual works from other works and wqs from
>> works running from different wqs.  What's not allowed is flushing the
>> wq a work is running on from the work.  Let's say if the flush code
>> can be modified to do so, would that change your opinion?
> 
> once you get "run in parallel, but have an API to wait on everyone
> who was scheduled before me"...  ... that'd be fine ;)

Cool, I'll give a shot at it then.  I think it would be better to
adapt the existing interface to the new uses if at all possible.

> but then you pretty much HAVE the cookie API, even if you don't have
> an actual cookie.  (just the cookie was an easy way to determine the
> "before me")

Yeap, but then again, whatever we do, all those synchronization
interfaces can be mapped onto each other eventually.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-20  2:08                     ` Tejun Heo
@ 2010-01-20  6:03                       ` Arjan van de Ven
  2010-01-20  8:24                         ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Arjan van de Ven @ 2010-01-20  6:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi,
	Arjan van de Ven

On Wed, 20 Jan 2010 11:08:16 +0900
Tejun Heo <tj@kernel.org> wrote:

> 
> Yeap, but then again, whatever we do, all those synchronization
> interfaces can be mapped onto each other eventually.

and maybe we need to be smart about this;
for me, sharing the backend implementation (the pool part) makes sense,
although a thread pool really is not much code. But a smart thread pool
may be.

as for interfaces, I really really think it's ok to have different
interfaces for usecases that are very different, as long as the
interfaces are logical in their domain. I rather have 2 interfaces, each
logical to their domain, than a forced joined interface that doesn't
really naturally fit either.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 32/40] async: introduce workqueue based alternative implementation
  2010-01-20  6:03                       ` Arjan van de Ven
@ 2010-01-20  8:24                         ` Tejun Heo
  2010-01-22 10:59                           ` [PATCH] async: use workqueue for worker pool Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-20  8:24 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi,
	Arjan van de Ven

Hello,

On 01/20/2010 03:03 PM, Arjan van de Ven wrote:
>> Yeap, but then again, whatever we do, all those synchronization
>> interfaces can be mapped onto each other eventually.

Eh... gave it a shot and it was too complex.

> and maybe we need to be smart about this;
> for me, sharing the backend implementation (the pool part) makes sense,
> although a thread pool really is not much code. But a smart thread pool
> may be.
>
> as for interfaces, I really really think it's ok to have different
> interfaces for usecases that are very different, as long as the
> interfaces are logical in their domain. I rather have 2 interfaces, each
> logical to their domain, than a forced joined interface that doesn't
> really naturally fit either.

I'll just replace the backend worker pool for now.  If necessary, we
can try to unify the sync model later, I suppose.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH] async: use workqueue for worker pool
  2010-01-20  8:24                         ` Tejun Heo
@ 2010-01-22 10:59                           ` Tejun Heo
  0 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-22 10:59 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, avi, johannes, andi,
	Arjan van de Ven

Replace private worker pool with system_long_wq.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@infradead.org>
---
Alright, here's the patch to just convert the worker pool.  Boots fine
here.

Thanks.

 kernel/async.c |  140 ++++++++-------------------------------------------------
 1 file changed, 21 insertions(+), 119 deletions(-)

Index: work/kernel/async.c
===================================================================
--- work.orig/kernel/async.c
+++ work/kernel/async.c
@@ -49,39 +49,31 @@ asynchronous and synchronous parts of th
 */
 
 #include <linux/async.h>
-#include <linux/bug.h>
 #include <linux/module.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
-#include <linux/init.h>
-#include <linux/kthread.h>
-#include <linux/delay.h>
 #include <asm/atomic.h>
 
 static async_cookie_t next_cookie = 1;
 
-#define MAX_THREADS	256
 #define MAX_WORK	32768
 
 static LIST_HEAD(async_pending);
 static LIST_HEAD(async_running);
 static DEFINE_SPINLOCK(async_lock);
 
-static int async_enabled = 0;
-
 struct async_entry {
-	struct list_head list;
-	async_cookie_t   cookie;
-	async_func_ptr	 *func;
-	void             *data;
-	struct list_head *running;
+	struct list_head	list;
+	struct work_struct	work;
+	async_cookie_t		cookie;
+	async_func_ptr		*func;
+	void			*data;
+	struct list_head	*running;
 };
 
 static DECLARE_WAIT_QUEUE_HEAD(async_done);
-static DECLARE_WAIT_QUEUE_HEAD(async_new);
 
 static atomic_t entry_count;
-static atomic_t thread_count;
 
 extern int initcall_debug;
 
@@ -116,27 +108,23 @@ static async_cookie_t  lowest_in_progres
 	spin_unlock_irqrestore(&async_lock, flags);
 	return ret;
 }
+
 /*
  * pick the first pending entry and run it
  */
-static void run_one_entry(void)
+static void async_run_entry_fn(struct work_struct *work)
 {
+	struct async_entry *entry =
+		container_of(work, struct async_entry, work);
 	unsigned long flags;
-	struct async_entry *entry;
 	ktime_t calltime, delta, rettime;
 
-	/* 1) pick one task from the pending queue */
-
+	/* 1) move self to the running queue */
 	spin_lock_irqsave(&async_lock, flags);
-	if (list_empty(&async_pending))
-		goto out;
-	entry = list_first_entry(&async_pending, struct async_entry, list);
-
-	/* 2) move it to the running queue */
 	list_move_tail(&entry->list, entry->running);
 	spin_unlock_irqrestore(&async_lock, flags);
 
-	/* 3) run it (and print duration)*/
+	/* 2) run (and print duration) */
 	if (initcall_debug && system_state == SYSTEM_BOOTING) {
 		printk("calling  %lli_%pF @ %i\n", (long long)entry->cookie,
 			entry->func, task_pid_nr(current));
@@ -152,31 +140,25 @@ static void run_one_entry(void)
 			(long long)ktime_to_ns(delta) >> 10);
 	}
 
-	/* 4) remove it from the running queue */
+	/* 3) remove self from the running queue */
 	spin_lock_irqsave(&async_lock, flags);
 	list_del(&entry->list);
 
-	/* 5) free the entry  */
+	/* 4) free the entry */
 	kfree(entry);
 	atomic_dec(&entry_count);
 
 	spin_unlock_irqrestore(&async_lock, flags);
 
-	/* 6) wake up any waiters. */
+	/* 5) wake up any waiters */
 	wake_up(&async_done);
-	return;
-
-out:
-	spin_unlock_irqrestore(&async_lock, flags);
 }
 
-
 static async_cookie_t __async_schedule(async_func_ptr *ptr, void *data, struct list_head *running)
 {
 	struct async_entry *entry;
 	unsigned long flags;
 	async_cookie_t newcookie;
-	
 
 	/* allow irq-off callers */
 	entry = kzalloc(sizeof(struct async_entry), GFP_ATOMIC);
@@ -185,7 +167,7 @@ static async_cookie_t __async_schedule(a
 	 * If we're out of memory or if there's too much work
 	 * pending already, we execute synchronously.
 	 */
-	if (!async_enabled || !entry || atomic_read(&entry_count) > MAX_WORK) {
+	if (!entry || atomic_read(&entry_count) > MAX_WORK) {
 		kfree(entry);
 		spin_lock_irqsave(&async_lock, flags);
 		newcookie = next_cookie++;
@@ -195,6 +177,7 @@ static async_cookie_t __async_schedule(a
 		ptr(data, newcookie);
 		return newcookie;
 	}
+	INIT_WORK(&entry->work, async_run_entry_fn);
 	entry->func = ptr;
 	entry->data = data;
 	entry->running = running;
@@ -204,7 +187,10 @@ static async_cookie_t __async_schedule(a
 	list_add_tail(&entry->list, &async_pending);
 	atomic_inc(&entry_count);
 	spin_unlock_irqrestore(&async_lock, flags);
-	wake_up(&async_new);
+
+	/* schedule for execution */
+	queue_work(system_long_wq, &entry->work);
+
 	return newcookie;
 }
 
@@ -311,87 +297,3 @@ void async_synchronize_cookie(async_cook
 	async_synchronize_cookie_domain(cookie, &async_running);
 }
 EXPORT_SYMBOL_GPL(async_synchronize_cookie);
-
-
-static int async_thread(void *unused)
-{
-	DECLARE_WAITQUEUE(wq, current);
-	add_wait_queue(&async_new, &wq);
-
-	while (!kthread_should_stop()) {
-		int ret = HZ;
-		set_current_state(TASK_INTERRUPTIBLE);
-		/*
-		 * check the list head without lock.. false positives
-		 * are dealt with inside run_one_entry() while holding
-		 * the lock.
-		 */
-		rmb();
-		if (!list_empty(&async_pending))
-			run_one_entry();
-		else
-			ret = schedule_timeout(HZ);
-
-		if (ret == 0) {
-			/*
-			 * we timed out, this means we as thread are redundant.
-			 * we sign off and die, but we to avoid any races there
-			 * is a last-straw check to see if work snuck in.
-			 */
-			atomic_dec(&thread_count);
-			wmb(); /* manager must see our departure first */
-			if (list_empty(&async_pending))
-				break;
-			/*
-			 * woops work came in between us timing out and us
-			 * signing off; we need to stay alive and keep working.
-			 */
-			atomic_inc(&thread_count);
-		}
-	}
-	remove_wait_queue(&async_new, &wq);
-
-	return 0;
-}
-
-static int async_manager_thread(void *unused)
-{
-	DECLARE_WAITQUEUE(wq, current);
-	add_wait_queue(&async_new, &wq);
-
-	while (!kthread_should_stop()) {
-		int tc, ec;
-
-		set_current_state(TASK_INTERRUPTIBLE);
-
-		tc = atomic_read(&thread_count);
-		rmb();
-		ec = atomic_read(&entry_count);
-
-		while (tc < ec && tc < MAX_THREADS) {
-			if (IS_ERR(kthread_run(async_thread, NULL, "async/%i",
-					       tc))) {
-				msleep(100);
-				continue;
-			}
-			atomic_inc(&thread_count);
-			tc++;
-		}
-
-		schedule();
-	}
-	remove_wait_queue(&async_new, &wq);
-
-	return 0;
-}
-
-static int __init async_init(void)
-{
-	async_enabled =
-		!IS_ERR(kthread_run(async_manager_thread, NULL, "async/mgr"));
-
-	WARN_ON(!async_enabled);
-	return 0;
-}
-
-core_initcall(async_init);

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 33/40] async: convert async users to use the new implementation
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (31 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 32/40] async: introduce workqueue based alternative implementation Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 34/40] async: kill original implementation Tejun Heo
                   ` (8 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Arjan van de Ven, Jeff Garzik, James Bottomley,
	Alexey Starikovskiy, Neil Brown, Martin Schwidefsky

Currently there are five users of async - libata, sd, battery, raid5
and dasd.  All except libata simply schedule asynchronous excution and
the conversion is trivial.  libata requires synchronization to make
devices appear in order which can be easily achieved by putting the
SCSI probe part into a separate function and using ordered async
feature.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexey Starikovskiy <astarikovskiy@suse.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 drivers/acpi/battery.c    |    4 ++--
 drivers/ata/libata-core.c |   31 +++++++++++++++----------------
 drivers/ata/pata_legacy.c |    2 +-
 drivers/md/raid5.c        |    4 ++--
 drivers/s390/block/dasd.c |    4 ++--
 drivers/scsi/sd.c         |    8 ++++----
 6 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
index cada73f..fbdcec4 100644
--- a/drivers/acpi/battery.c
+++ b/drivers/acpi/battery.c
@@ -911,7 +911,7 @@ static struct acpi_driver acpi_battery_driver = {
 		},
 };
 
-static void __init acpi_battery_init_async(void *unused, async_cookie_t cookie)
+static void __init acpi_battery_init_async(void *unused)
 {
 	if (acpi_disabled)
 		return;
@@ -931,7 +931,7 @@ static void __init acpi_battery_init_async(void *unused, async_cookie_t cookie)
 
 static int __init acpi_battery_init(void)
 {
-	async_schedule(acpi_battery_init_async, NULL);
+	async_call(acpi_battery_init_async, NULL);
 	return 0;
 }
 
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index de91814..7d413d4 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -6068,21 +6068,11 @@ void ata_host_init(struct ata_host *host, struct device *dev,
 }
 
 
-static void async_port_probe(void *data, async_cookie_t cookie)
+static void async_port_probe(void *data)
 {
 	int rc;
 	struct ata_port *ap = data;
 
-	/*
-	 * If we're not allowed to scan this host in parallel,
-	 * we need to wait until all previous scans have completed
-	 * before going further.
-	 * Jeff Garzik says this is only within a controller, so we
-	 * don't need to wait for port 0, only for later ports.
-	 */
-	if (!(ap->host->flags & ATA_HOST_PARALLEL_SCAN) && ap->port_no != 0)
-		async_synchronize_cookie(cookie);
-
 	/* probe */
 	if (ap->ops->error_handler) {
 		struct ata_eh_info *ehi = &ap->link.eh_info;
@@ -6119,13 +6109,17 @@ static void async_port_probe(void *data, async_cookie_t cookie)
 			 */
 		}
 	}
+}
 
-	/* in order to keep device order, we need to synchronize at this point */
-	async_synchronize_cookie(cookie);
-
-	ata_scsi_scan_host(ap, 1);
+static void async_port_probe_finish(void *data)
+{
+	struct ata_host *host = data;
+	int i;
 
+	for (i = 0; i < host->n_ports; i++)
+		ata_scsi_scan_host(host->ports[i], 1);
 }
+
 /**
  *	ata_host_register - register initialized ATA host
  *	@host: ATA host to register
@@ -6204,8 +6198,13 @@ int ata_host_register(struct ata_host *host, struct scsi_host_template *sht)
 	/* perform each probe asynchronously */
 	for (i = 0; i < host->n_ports; i++) {
 		struct ata_port *ap = host->ports[i];
-		async_schedule(async_port_probe, ap);
+
+		if (host->flags & ATA_HOST_PARALLEL_SCAN)
+			async_call(async_port_probe, ap);
+		else
+			async_call_ordered(async_port_probe, ap);
 	}
+	async_call_ordered(async_port_probe_finish, host);
 
 	return 0;
 }
diff --git a/drivers/ata/pata_legacy.c b/drivers/ata/pata_legacy.c
index 9df1ff7..ad855b0 100644
--- a/drivers/ata/pata_legacy.c
+++ b/drivers/ata/pata_legacy.c
@@ -1037,7 +1037,7 @@ static __init int legacy_init_one(struct legacy_probe *probe)
 				&legacy_sht);
 	if (ret)
 		goto fail;
-	async_synchronize_full();
+	async_barrier();
 	ld->platform_dev = pdev;
 
 	/* Nothing found means we drop the port as its probably not there */
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e84204e..7000e87 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1206,7 +1206,7 @@ static void __raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 }
 
 #ifdef CONFIG_MULTICORE_RAID456
-static void async_run_ops(void *param, async_cookie_t cookie)
+static void async_run_ops(void *param)
 {
 	struct stripe_head *sh = param;
 	unsigned long ops_request = sh->ops.request;
@@ -1229,7 +1229,7 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 	sh->ops.request = ops_request;
 
 	atomic_inc(&sh->count);
-	async_schedule(async_run_ops, sh);
+	async_call(async_run_ops, sh);
 }
 #else
 #define raid_run_ops __raid_run_ops
diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c
index fdb2e7c..4437505 100644
--- a/drivers/s390/block/dasd.c
+++ b/drivers/s390/block/dasd.c
@@ -2276,7 +2276,7 @@ dasd_exit(void)
  * SECTION: common functions for ccw_driver use
  */
 
-static void dasd_generic_auto_online(void *data, async_cookie_t cookie)
+static void dasd_generic_auto_online(void *data)
 {
 	struct ccw_device *cdev = data;
 	int ret;
@@ -2317,7 +2317,7 @@ int dasd_generic_probe(struct ccw_device *cdev,
 	 */
 	if ((dasd_get_feature(cdev, DASD_FEATURE_INITIAL_ONLINE) > 0 ) ||
 	    (dasd_autodetect && dasd_busid_known(dev_name(&cdev->dev)) != 0))
-		async_schedule(dasd_generic_auto_online, cdev);
+		async_call(dasd_generic_auto_online, cdev);
 	return 0;
 }
 
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 255da53..d2857cd 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2142,7 +2142,7 @@ static int sd_format_disk_name(char *prefix, int index, char *buf, int buflen)
 /*
  * The asynchronous part of sd_probe
  */
-static void sd_probe_async(void *data, async_cookie_t cookie)
+static void sd_probe_async(void *data)
 {
 	struct scsi_disk *sdkp = data;
 	struct scsi_device *sdp;
@@ -2277,8 +2277,8 @@ static int sd_probe(struct device *dev)
 
 	get_device(&sdp->sdev_gendev);
 
-	get_device(&sdkp->dev);	/* prevent release before async_schedule */
-	async_schedule(sd_probe_async, sdkp);
+	get_device(&sdkp->dev);	/* prevent release before async_call */
+	async_call(sd_probe_async, sdkp);
 
 	return 0;
 
@@ -2309,7 +2309,7 @@ static int sd_remove(struct device *dev)
 {
 	struct scsi_disk *sdkp;
 
-	async_synchronize_full();
+	async_barrier();
 	sdkp = dev_get_drvdata(dev);
 	blk_queue_prep_rq(sdkp->device->request_queue, scsi_prep_fn);
 	device_del(&sdkp->dev);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 34/40] async: kill original implementation
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (32 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 33/40] async: convert async users to use the new implementation Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work Tejun Heo
                   ` (7 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Arjan van de Ven

The original implementation doesn't have any user now.  Kill it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@infradead.org>
---
 drivers/base/core.c    |    1 -
 drivers/base/dd.c      |    1 -
 include/linux/async.h  |   13 --
 init/do_mounts.c       |    1 -
 init/main.c            |    1 -
 kernel/async.c         |  366 +-----------------------------------------------
 kernel/irq/autoprobe.c |    1 -
 kernel/module.c        |    2 -
 8 files changed, 3 insertions(+), 383 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 14774c9..20dbf59 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1743,6 +1743,5 @@ void device_shutdown(void)
 			dev->driver->shutdown(dev);
 		}
 	}
-	async_synchronize_full();
 	async_barrier();
 }
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 5c9c923..8c187db 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -178,7 +178,6 @@ void wait_for_device_probe(void)
 {
 	/* wait for the known devices to complete their probing */
 	wait_event(probe_waitqueue, atomic_read(&probe_count) == 0);
-	async_synchronize_full();
 	async_barrier();
 }
 EXPORT_SYMBOL_GPL(wait_for_device_probe);
diff --git a/include/linux/async.h b/include/linux/async.h
index 49658dc..da9eee7 100644
--- a/include/linux/async.h
+++ b/include/linux/async.h
@@ -11,21 +11,8 @@
  */
 
 #include <linux/types.h>
-#include <linux/list.h>
 #include <linux/workqueue.h>
 
-typedef u64 async_cookie_t;
-typedef void (async_func_ptr) (void *data, async_cookie_t cookie);
-
-extern async_cookie_t async_schedule(async_func_ptr *ptr, void *data);
-extern async_cookie_t async_schedule_domain(async_func_ptr *ptr, void *data,
-					    struct list_head *list);
-extern void async_synchronize_full(void);
-extern void async_synchronize_full_domain(struct list_head *list);
-extern void async_synchronize_cookie(async_cookie_t cookie);
-extern void async_synchronize_cookie_domain(async_cookie_t cookie,
-					    struct list_head *list);
-
 typedef void (*async_func_t)(void *data);
 
 extern bool async_call(async_func_t func, void *data);
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 608ac17..f84f552 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -405,7 +405,6 @@ void __init prepare_namespace(void)
 		while (driver_probe_done() != 0 ||
 			(ROOT_DEV = name_to_dev_t(saved_root_name)) == 0)
 			msleep(100);
-		async_synchronize_full();
 		async_barrier();
 	}
 
diff --git a/init/main.c b/init/main.c
index e35dfdd..04c67fe 100644
--- a/init/main.c
+++ b/init/main.c
@@ -801,7 +801,6 @@ static noinline int init_post(void)
 	__releases(kernel_lock)
 {
 	/* need to finish all async __init code before freeing the memory */
-	async_synchronize_full();
 	async_barrier();
 	free_initmem();
 	unlock_kernel();
diff --git a/kernel/async.c b/kernel/async.c
index 4cd52bc..ddc2499 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -13,8 +13,6 @@
 
 /*
 
-Goals and Theory of Operation
-
 The primary goal of this feature is to reduce the kernel boot time,
 by doing various independent hardware delays and discovery operations
 decoupled and not strictly serialized.
@@ -25,377 +23,19 @@ asynchronously, out of order, while these operations still
 have their externally visible parts happen sequentially and in-order.
 (not unlike how out-of-order CPUs retire their instructions in order)
 
-Key to the asynchronous function call implementation is the concept of
-a "sequence cookie" (which, although it has an abstracted type, can be
-thought of as a monotonically incrementing number).
-
-The async core will assign each scheduled event such a sequence cookie and
-pass this to the called functions.
-
-The asynchronously called function should before doing a globally visible
-operation, such as registering device numbers, call the
-async_synchronize_cookie() function and pass in its own cookie. The
-async_synchronize_cookie() function will make sure that all asynchronous
-operations that were scheduled prior to the operation corresponding with the
-cookie have completed.
-
-Subsystem/driver initialization code that scheduled asynchronous probe
-functions, but which shares global resources with other drivers/subsystems
-that do not use the asynchronous call feature, need to do a full
-synchronization with the async_synchronize_full() function, before returning
-from their init function. This is to maintain strict ordering between the
-asynchronous and synchronous parts of the kernel.
+Parts can be executed in parallel should be scheduled via async_call()
+while parts which need to be executed sequentially to implement
+in-order appearance via async_call_ordered().
 
 */
 
 #include <linux/async.h>
-#include <linux/bug.h>
 #include <linux/module.h>
-#include <linux/wait.h>
 #include <linux/sched.h>
-#include <linux/init.h>
-#include <linux/kthread.h>
 #include <linux/delay.h>
-#include <asm/atomic.h>
-
-static async_cookie_t next_cookie = 1;
-
-#define MAX_THREADS	256
-#define MAX_WORK	32768
-
-static LIST_HEAD(async_pending);
-static LIST_HEAD(async_running);
-static DEFINE_SPINLOCK(async_lock);
-
-static int async_enabled = 0;
-
-struct async_entry {
-	struct list_head list;
-	async_cookie_t   cookie;
-	async_func_ptr	 *func;
-	void             *data;
-	struct list_head *running;
-};
-
-static DECLARE_WAIT_QUEUE_HEAD(async_done);
-static DECLARE_WAIT_QUEUE_HEAD(async_new);
-
-static atomic_t entry_count;
-static atomic_t thread_count;
 
 extern int initcall_debug;
 
-
-/*
- * MUST be called with the lock held!
- */
-static async_cookie_t  __lowest_in_progress(struct list_head *running)
-{
-	struct async_entry *entry;
-
-	if (!list_empty(running)) {
-		entry = list_first_entry(running,
-			struct async_entry, list);
-		return entry->cookie;
-	}
-
-	list_for_each_entry(entry, &async_pending, list)
-		if (entry->running == running)
-			return entry->cookie;
-
-	return next_cookie;	/* "infinity" value */
-}
-
-static async_cookie_t  lowest_in_progress(struct list_head *running)
-{
-	unsigned long flags;
-	async_cookie_t ret;
-
-	spin_lock_irqsave(&async_lock, flags);
-	ret = __lowest_in_progress(running);
-	spin_unlock_irqrestore(&async_lock, flags);
-	return ret;
-}
-/*
- * pick the first pending entry and run it
- */
-static void run_one_entry(void)
-{
-	unsigned long flags;
-	struct async_entry *entry;
-	ktime_t calltime, delta, rettime;
-
-	/* 1) pick one task from the pending queue */
-
-	spin_lock_irqsave(&async_lock, flags);
-	if (list_empty(&async_pending))
-		goto out;
-	entry = list_first_entry(&async_pending, struct async_entry, list);
-
-	/* 2) move it to the running queue */
-	list_move_tail(&entry->list, entry->running);
-	spin_unlock_irqrestore(&async_lock, flags);
-
-	/* 3) run it (and print duration)*/
-	if (initcall_debug && system_state == SYSTEM_BOOTING) {
-		printk("calling  %lli_%pF @ %i\n", (long long)entry->cookie,
-			entry->func, task_pid_nr(current));
-		calltime = ktime_get();
-	}
-	entry->func(entry->data, entry->cookie);
-	if (initcall_debug && system_state == SYSTEM_BOOTING) {
-		rettime = ktime_get();
-		delta = ktime_sub(rettime, calltime);
-		printk("initcall %lli_%pF returned 0 after %lld usecs\n",
-			(long long)entry->cookie,
-			entry->func,
-			(long long)ktime_to_ns(delta) >> 10);
-	}
-
-	/* 4) remove it from the running queue */
-	spin_lock_irqsave(&async_lock, flags);
-	list_del(&entry->list);
-
-	/* 5) free the entry  */
-	kfree(entry);
-	atomic_dec(&entry_count);
-
-	spin_unlock_irqrestore(&async_lock, flags);
-
-	/* 6) wake up any waiters. */
-	wake_up(&async_done);
-	return;
-
-out:
-	spin_unlock_irqrestore(&async_lock, flags);
-}
-
-
-static async_cookie_t __async_schedule(async_func_ptr *ptr, void *data, struct list_head *running)
-{
-	struct async_entry *entry;
-	unsigned long flags;
-	async_cookie_t newcookie;
-	
-
-	/* allow irq-off callers */
-	entry = kzalloc(sizeof(struct async_entry), GFP_ATOMIC);
-
-	/*
-	 * If we're out of memory or if there's too much work
-	 * pending already, we execute synchronously.
-	 */
-	if (!async_enabled || !entry || atomic_read(&entry_count) > MAX_WORK) {
-		kfree(entry);
-		spin_lock_irqsave(&async_lock, flags);
-		newcookie = next_cookie++;
-		spin_unlock_irqrestore(&async_lock, flags);
-
-		/* low on memory.. run synchronously */
-		ptr(data, newcookie);
-		return newcookie;
-	}
-	entry->func = ptr;
-	entry->data = data;
-	entry->running = running;
-
-	spin_lock_irqsave(&async_lock, flags);
-	newcookie = entry->cookie = next_cookie++;
-	list_add_tail(&entry->list, &async_pending);
-	atomic_inc(&entry_count);
-	spin_unlock_irqrestore(&async_lock, flags);
-	wake_up(&async_new);
-	return newcookie;
-}
-
-/**
- * async_schedule - schedule a function for asynchronous execution
- * @ptr: function to execute asynchronously
- * @data: data pointer to pass to the function
- *
- * Returns an async_cookie_t that may be used for checkpointing later.
- * Note: This function may be called from atomic or non-atomic contexts.
- */
-async_cookie_t async_schedule(async_func_ptr *ptr, void *data)
-{
-	return __async_schedule(ptr, data, &async_running);
-}
-EXPORT_SYMBOL_GPL(async_schedule);
-
-/**
- * async_schedule_domain - schedule a function for asynchronous execution within a certain domain
- * @ptr: function to execute asynchronously
- * @data: data pointer to pass to the function
- * @running: running list for the domain
- *
- * Returns an async_cookie_t that may be used for checkpointing later.
- * @running may be used in the async_synchronize_*_domain() functions
- * to wait within a certain synchronization domain rather than globally.
- * A synchronization domain is specified via the running queue @running to use.
- * Note: This function may be called from atomic or non-atomic contexts.
- */
-async_cookie_t async_schedule_domain(async_func_ptr *ptr, void *data,
-				     struct list_head *running)
-{
-	return __async_schedule(ptr, data, running);
-}
-EXPORT_SYMBOL_GPL(async_schedule_domain);
-
-/**
- * async_synchronize_full - synchronize all asynchronous function calls
- *
- * This function waits until all asynchronous function calls have been done.
- */
-void async_synchronize_full(void)
-{
-	do {
-		async_synchronize_cookie(next_cookie);
-	} while (!list_empty(&async_running) || !list_empty(&async_pending));
-}
-EXPORT_SYMBOL_GPL(async_synchronize_full);
-
-/**
- * async_synchronize_full_domain - synchronize all asynchronous function within a certain domain
- * @list: running list to synchronize on
- *
- * This function waits until all asynchronous function calls for the
- * synchronization domain specified by the running list @list have been done.
- */
-void async_synchronize_full_domain(struct list_head *list)
-{
-	async_synchronize_cookie_domain(next_cookie, list);
-}
-EXPORT_SYMBOL_GPL(async_synchronize_full_domain);
-
-/**
- * async_synchronize_cookie_domain - synchronize asynchronous function calls within a certain domain with cookie checkpointing
- * @cookie: async_cookie_t to use as checkpoint
- * @running: running list to synchronize on
- *
- * This function waits until all asynchronous function calls for the
- * synchronization domain specified by the running list @list submitted
- * prior to @cookie have been done.
- */
-void async_synchronize_cookie_domain(async_cookie_t cookie,
-				     struct list_head *running)
-{
-	ktime_t starttime, delta, endtime;
-
-	if (initcall_debug && system_state == SYSTEM_BOOTING) {
-		printk("async_waiting @ %i\n", task_pid_nr(current));
-		starttime = ktime_get();
-	}
-
-	wait_event(async_done, lowest_in_progress(running) >= cookie);
-
-	if (initcall_debug && system_state == SYSTEM_BOOTING) {
-		endtime = ktime_get();
-		delta = ktime_sub(endtime, starttime);
-
-		printk("async_continuing @ %i after %lli usec\n",
-			task_pid_nr(current),
-			(long long)ktime_to_ns(delta) >> 10);
-	}
-}
-EXPORT_SYMBOL_GPL(async_synchronize_cookie_domain);
-
-/**
- * async_synchronize_cookie - synchronize asynchronous function calls with cookie checkpointing
- * @cookie: async_cookie_t to use as checkpoint
- *
- * This function waits until all asynchronous function calls prior to @cookie
- * have been done.
- */
-void async_synchronize_cookie(async_cookie_t cookie)
-{
-	async_synchronize_cookie_domain(cookie, &async_running);
-}
-EXPORT_SYMBOL_GPL(async_synchronize_cookie);
-
-
-static int async_thread(void *unused)
-{
-	DECLARE_WAITQUEUE(wq, current);
-	add_wait_queue(&async_new, &wq);
-
-	while (!kthread_should_stop()) {
-		int ret = HZ;
-		set_current_state(TASK_INTERRUPTIBLE);
-		/*
-		 * check the list head without lock.. false positives
-		 * are dealt with inside run_one_entry() while holding
-		 * the lock.
-		 */
-		rmb();
-		if (!list_empty(&async_pending))
-			run_one_entry();
-		else
-			ret = schedule_timeout(HZ);
-
-		if (ret == 0) {
-			/*
-			 * we timed out, this means we as thread are redundant.
-			 * we sign off and die, but we to avoid any races there
-			 * is a last-straw check to see if work snuck in.
-			 */
-			atomic_dec(&thread_count);
-			wmb(); /* manager must see our departure first */
-			if (list_empty(&async_pending))
-				break;
-			/*
-			 * woops work came in between us timing out and us
-			 * signing off; we need to stay alive and keep working.
-			 */
-			atomic_inc(&thread_count);
-		}
-	}
-	remove_wait_queue(&async_new, &wq);
-
-	return 0;
-}
-
-static int async_manager_thread(void *unused)
-{
-	DECLARE_WAITQUEUE(wq, current);
-	add_wait_queue(&async_new, &wq);
-
-	while (!kthread_should_stop()) {
-		int tc, ec;
-
-		set_current_state(TASK_INTERRUPTIBLE);
-
-		tc = atomic_read(&thread_count);
-		rmb();
-		ec = atomic_read(&entry_count);
-
-		while (tc < ec && tc < MAX_THREADS) {
-			if (IS_ERR(kthread_run(async_thread, NULL, "async/%i",
-					       tc))) {
-				msleep(100);
-				continue;
-			}
-			atomic_inc(&thread_count);
-			tc++;
-		}
-
-		schedule();
-	}
-	remove_wait_queue(&async_new, &wq);
-
-	return 0;
-}
-
-static int __init async_init(void)
-{
-	async_enabled =
-		!IS_ERR(kthread_run(async_manager_thread, NULL, "async/mgr"));
-
-	WARN_ON(!async_enabled);
-	return 0;
-}
-
-core_initcall(async_init);
-
 struct async_ent {
 	struct work_struct	work;
 	async_func_t		func;
diff --git a/kernel/irq/autoprobe.c b/kernel/irq/autoprobe.c
index 39188cd..9d7f375 100644
--- a/kernel/irq/autoprobe.c
+++ b/kernel/irq/autoprobe.c
@@ -38,7 +38,6 @@ unsigned long probe_irq_on(void)
 	/*
 	 * quiesce the kernel, or at least the asynchronous portion
 	 */
-	async_synchronize_full();
 	async_barrier();
 	mutex_lock(&probing_active);
 	/*
diff --git a/kernel/module.c b/kernel/module.c
index 623a9b6..21c4c17 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -716,7 +716,6 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
 		mod->exit();
 	blocking_notifier_call_chain(&module_notify_list,
 				     MODULE_STATE_GOING, mod);
-	async_synchronize_full();
 	async_barrier();
 	mutex_lock(&module_mutex);
 	/* Store the name of the last unloaded module for diagnostic purposes */
@@ -2494,7 +2493,6 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
 				     MODULE_STATE_LIVE, mod);
 
 	/* We need to finish all async code before the module init sequence is done */
-	async_synchronize_full();
 	async_barrier();
 
 	mutex_lock(&module_mutex);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (33 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 34/40] async: kill original implementation Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-02-12 18:03   ` David Howells
  2010-01-18  0:57 ` [PATCH 36/40] fscache: convert operation " Tejun Heo
                   ` (6 subsequent siblings)
  41 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Make fscache object state transition callbacks use workqueue instead
of slow-work.  New dedicated single CPU workqueue fscache_object_wq is
created.  get/put callbacks are renamed and modified to take @object
and called directly from the enqueue wrapper and the work function.
While at it, make all open coded instances of get/put to use
fscache_get/put_object().

* slow-work flags are no longer printed out in object debugging
  outputs.

* Max concurrency for objects is set to 99 which is the default for
  very slow works.  API to adjust cwq max concurrency limit can be
  added easily.

* In cachefiles_mark_object_active() no longer relinquishes the thread
  when a new work is queued.  This can be implemented in workqueue if
  necessary but I wasn't sure how useful it would actually be.
  Simpler way would be to shorten the requeue timeout (say 5secs) and
  warn iff the duration from the initial queueing goes over the
  original limit (60s).

* Single CPU workqueue was used to ease conversion.  If necessary,
  multi CPU one can be used by adding per-object state transition
  mutex.

NOT_SIGNED_OFF_YET
Cc: David Howells <dhowells@redhat.com>
---
 fs/cachefiles/namei.c         |   28 ++++-------------
 fs/fscache/internal.h         |    1 +
 fs/fscache/main.c             |   10 ++++++
 fs/fscache/object-list.c      |   12 +++----
 fs/fscache/object.c           |   67 +++++++++-------------------------------
 include/linux/fscache-cache.h |    7 ++--
 6 files changed, 42 insertions(+), 83 deletions(-)

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 14ac480..bb5c680 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -36,10 +36,9 @@ void __cachefiles_printk_object(struct cachefiles_object *object,
 
 	printk(KERN_ERR "%sobject: OBJ%x\n",
 	       prefix, object->fscache.debug_id);
-	printk(KERN_ERR "%sobjstate=%s fl=%lx swfl=%lx ev=%lx[%lx]\n",
+	printk(KERN_ERR "%sobjstate=%s fl=%lx ev=%lx[%lx]\n",
 	       prefix, fscache_object_states[object->fscache.state],
-	       object->fscache.flags, object->fscache.work.flags,
-	       object->fscache.events,
+	       object->fscache.flags, object->fscache.events,
 	       object->fscache.event_mask & FSCACHE_OBJECT_EVENTS_MASK);
 	printk(KERN_ERR "%sops=%u inp=%u exc=%u\n",
 	       prefix, object->fscache.n_ops, object->fscache.n_in_progress,
@@ -150,15 +149,13 @@ wait_for_old_object:
 	write_unlock(&cache->active_lock);
 
 	if (test_bit(CACHEFILES_OBJECT_ACTIVE, &xobject->flags)) {
-		wait_queue_head_t *wq;
-
 		signed long timeout = 60 * HZ;
+		wait_queue_head_t *wq;
 		wait_queue_t wait;
-		bool requeue;
 
 		/* if the object we're waiting for is queued for processing,
 		 * then just put ourselves on the queue behind it */
-		if (slow_work_is_queued(&xobject->fscache.work)) {
+		if (work_pending(&xobject->fscache.work)) {
 			_debug("queue OBJ%x behind OBJ%x immediately",
 			       object->fscache.debug_id,
 			       xobject->fscache.debug_id);
@@ -166,28 +163,17 @@ wait_for_old_object:
 		}
 
 		/* otherwise we sleep until either the object we're waiting for
-		 * is done, or the slow-work facility wants the thread back to
-		 * do other work */
+		 * is done */
 		wq = bit_waitqueue(&xobject->flags, CACHEFILES_OBJECT_ACTIVE);
 		init_wait(&wait);
-		requeue = false;
 		do {
 			prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
 			if (!test_bit(CACHEFILES_OBJECT_ACTIVE, &xobject->flags))
 				break;
-			requeue = slow_work_sleep_till_thread_needed(
-				&object->fscache.work, &timeout);
-		} while (timeout > 0 && !requeue);
+			timeout = schedule_timeout(timeout);
+		} while (timeout > 0);
 		finish_wait(wq, &wait);
 
-		if (requeue &&
-		    test_bit(CACHEFILES_OBJECT_ACTIVE, &xobject->flags)) {
-			_debug("queue OBJ%x behind OBJ%x after wait",
-			       object->fscache.debug_id,
-			       xobject->fscache.debug_id);
-			goto requeue;
-		}
-
 		if (timeout <= 0) {
 			printk(KERN_ERR "\n");
 			printk(KERN_ERR "CacheFiles: Error: Overlong"
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index edd7434..f9da2fc 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -82,6 +82,7 @@ extern unsigned fscache_defer_lookup;
 extern unsigned fscache_defer_create;
 extern unsigned fscache_debug;
 extern struct kobject *fscache_root;
+extern struct workqueue_struct *fscache_object_wq;
 
 extern int fscache_wait_bit(void *);
 extern int fscache_wait_bit_interruptible(void *);
diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index add6bdb..a676a86 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -40,6 +40,7 @@ MODULE_PARM_DESC(fscache_debug,
 		 "FS-Cache debugging mask");
 
 struct kobject *fscache_root;
+struct workqueue_struct *fscache_object_wq;
 
 /*
  * initialise the fs caching module
@@ -52,6 +53,12 @@ static int __init fscache_init(void)
 	if (ret < 0)
 		goto error_slow_work;
 
+	ret = -ENOMEM;
+	fscache_object_wq =
+		__create_workqueue("fscache_object", WQ_SINGLE_CPU, 99);
+	if (!fscache_object_wq)
+		goto error_object_wq;
+
 	ret = fscache_proc_init();
 	if (ret < 0)
 		goto error_proc;
@@ -80,6 +87,8 @@ error_kobj:
 error_cookie_jar:
 	fscache_proc_cleanup();
 error_proc:
+	destroy_workqueue(fscache_object_wq);
+error_object_wq:
 	slow_work_unregister_user(THIS_MODULE);
 error_slow_work:
 	return ret;
@@ -97,6 +106,7 @@ static void __exit fscache_exit(void)
 	kobject_put(fscache_root);
 	kmem_cache_destroy(fscache_cookie_jar);
 	fscache_proc_cleanup();
+	destroy_workqueue(fscache_object_wq);
 	slow_work_unregister_user(THIS_MODULE);
 	printk(KERN_NOTICE "FS-Cache: Unloaded\n");
 }
diff --git a/fs/fscache/object-list.c b/fs/fscache/object-list.c
index 3221a0c..d4164f8 100644
--- a/fs/fscache/object-list.c
+++ b/fs/fscache/object-list.c
@@ -33,8 +33,8 @@ struct fscache_objlist_data {
 #define FSCACHE_OBJLIST_CONFIG_NOREADS	0x00000200	/* show objects without active reads */
 #define FSCACHE_OBJLIST_CONFIG_EVENTS	0x00000400	/* show objects with events */
 #define FSCACHE_OBJLIST_CONFIG_NOEVENTS	0x00000800	/* show objects without no events */
-#define FSCACHE_OBJLIST_CONFIG_WORK	0x00001000	/* show objects with slow work */
-#define FSCACHE_OBJLIST_CONFIG_NOWORK	0x00002000	/* show objects without slow work */
+#define FSCACHE_OBJLIST_CONFIG_WORK	0x00001000	/* show objects with work */
+#define FSCACHE_OBJLIST_CONFIG_NOWORK	0x00002000	/* show objects without work */
 
 	u8		buf[512];	/* key and aux data buffer */
 };
@@ -230,12 +230,11 @@ static int fscache_objlist_show(struct seq_file *m, void *v)
 		       READS, NOREADS);
 		FILTER(obj->events & obj->event_mask,
 		       EVENTS, NOEVENTS);
-		FILTER(obj->work.flags & ~(1UL << SLOW_WORK_VERY_SLOW),
-		       WORK, NOWORK);
+		FILTER(work_busy(&obj->work), WORK, NOWORK);
 	}
 
 	seq_printf(m,
-		   "%8x %8x %s %5u %3u %3u %3u %2u %5u %2lx %2lx %1lx %1lx | ",
+		   "%8x %8x %s %5u %3u %3u %3u %2u %5u %2lx %2lx %1lx | ",
 		   obj->debug_id,
 		   obj->parent ? obj->parent->debug_id : -1,
 		   fscache_object_states_short[obj->state],
@@ -247,8 +246,7 @@ static int fscache_objlist_show(struct seq_file *m, void *v)
 		   atomic_read(&obj->n_reads),
 		   obj->event_mask & FSCACHE_OBJECT_EVENTS_MASK,
 		   obj->events,
-		   obj->flags,
-		   obj->work.flags);
+		   obj->flags);
 
 	no_cookie = true;
 	keylen = auxlen = 0;
diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index e513ac5..3b44f06 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -50,12 +50,8 @@ const char fscache_object_states_short[FSCACHE_OBJECT__NSTATES][5] = {
 	[FSCACHE_OBJECT_DEAD]		= "DEAD",
 };
 
-static void fscache_object_slow_work_put_ref(struct slow_work *);
-static int  fscache_object_slow_work_get_ref(struct slow_work *);
-static void fscache_object_slow_work_execute(struct slow_work *);
-#ifdef CONFIG_SLOW_WORK_PROC
-static void fscache_object_slow_work_desc(struct slow_work *, struct seq_file *);
-#endif
+static int  fscache_get_object(struct fscache_object *);
+static void fscache_put_object(struct fscache_object *);
 static void fscache_initialise_object(struct fscache_object *);
 static void fscache_lookup_object(struct fscache_object *);
 static void fscache_object_available(struct fscache_object *);
@@ -64,17 +60,6 @@ static void fscache_withdraw_object(struct fscache_object *);
 static void fscache_enqueue_dependents(struct fscache_object *);
 static void fscache_dequeue_object(struct fscache_object *);
 
-const struct slow_work_ops fscache_object_slow_work_ops = {
-	.owner		= THIS_MODULE,
-	.get_ref	= fscache_object_slow_work_get_ref,
-	.put_ref	= fscache_object_slow_work_put_ref,
-	.execute	= fscache_object_slow_work_execute,
-#ifdef CONFIG_SLOW_WORK_PROC
-	.desc		= fscache_object_slow_work_desc,
-#endif
-};
-EXPORT_SYMBOL(fscache_object_slow_work_ops);
-
 /*
  * we need to notify the parent when an op completes that we had outstanding
  * upon it
@@ -345,7 +330,7 @@ unsupported_event:
 /*
  * execute an object
  */
-static void fscache_object_slow_work_execute(struct slow_work *work)
+void fscache_object_work_func(struct work_struct *work)
 {
 	struct fscache_object *object =
 		container_of(work, struct fscache_object, work);
@@ -359,23 +344,9 @@ static void fscache_object_slow_work_execute(struct slow_work *work)
 	if (object->events & object->event_mask)
 		fscache_enqueue_object(object);
 	clear_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+	fscache_put_object(object);
 }
-
-/*
- * describe an object for slow-work debugging
- */
-#ifdef CONFIG_SLOW_WORK_PROC
-static void fscache_object_slow_work_desc(struct slow_work *work,
-					  struct seq_file *m)
-{
-	struct fscache_object *object =
-		container_of(work, struct fscache_object, work);
-
-	seq_printf(m, "FSC: OBJ%x: %s",
-		   object->debug_id,
-		   fscache_object_states_short[object->state]);
-}
-#endif
+EXPORT_SYMBOL(fscache_object_work_func);
 
 /*
  * initialise an object
@@ -393,7 +364,6 @@ static void fscache_initialise_object(struct fscache_object *object)
 	_enter("");
 	ASSERT(object->cookie != NULL);
 	ASSERT(object->cookie->parent != NULL);
-	ASSERT(list_empty(&object->work.link));
 
 	if (object->events & ((1 << FSCACHE_OBJECT_EV_ERROR) |
 			      (1 << FSCACHE_OBJECT_EV_RELEASE) |
@@ -671,10 +641,8 @@ static void fscache_drop_object(struct fscache_object *object)
 		object->parent = NULL;
 	}
 
-	/* this just shifts the object release to the slow work processor */
-	fscache_stat(&fscache_n_cop_put_object);
-	object->cache->ops->put_object(object);
-	fscache_stat_d(&fscache_n_cop_put_object);
+	/* this just shifts the object release to the work processor */
+	fscache_put_object(object);
 
 	_leave("");
 }
@@ -758,12 +726,10 @@ void fscache_withdrawing_object(struct fscache_cache *cache,
 }
 
 /*
- * allow the slow work item processor to get a ref on an object
+ * get a ref on an object
  */
-static int fscache_object_slow_work_get_ref(struct slow_work *work)
+static int fscache_get_object(struct fscache_object *object)
 {
-	struct fscache_object *object =
-		container_of(work, struct fscache_object, work);
 	int ret;
 
 	fscache_stat(&fscache_n_cop_grab_object);
@@ -773,13 +739,10 @@ static int fscache_object_slow_work_get_ref(struct slow_work *work)
 }
 
 /*
- * allow the slow work item processor to discard a ref on a work item
+ * discard a ref on a work item
  */
-static void fscache_object_slow_work_put_ref(struct slow_work *work)
+static void fscache_put_object(struct fscache_object *object)
 {
-	struct fscache_object *object =
-		container_of(work, struct fscache_object, work);
-
 	fscache_stat(&fscache_n_cop_put_object);
 	object->cache->ops->put_object(object);
 	fscache_stat_d(&fscache_n_cop_put_object);
@@ -792,7 +755,9 @@ void fscache_enqueue_object(struct fscache_object *object)
 {
 	_enter("{OBJ%x}", object->debug_id);
 
-	slow_work_enqueue(&object->work);
+	if (fscache_get_object(object) >= 0)
+		if (!queue_work(fscache_object_wq, &object->work))
+			fscache_put_object(object);
 }
 
 /*
@@ -819,9 +784,7 @@ static void fscache_enqueue_dependents(struct fscache_object *object)
 
 		/* sort onto appropriate lists */
 		fscache_enqueue_object(dep);
-		fscache_stat(&fscache_n_cop_put_object);
-		dep->cache->ops->put_object(dep);
-		fscache_stat_d(&fscache_n_cop_put_object);
+		fscache_put_object(dep);
 
 		if (!list_empty(&object->dependents))
 			cond_resched_lock(&object->lock);
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 7be0c6f..63b7f76 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -21,6 +21,7 @@
 #include <linux/fscache.h>
 #include <linux/sched.h>
 #include <linux/slow-work.h>
+#include <linux/workqueue.h>
 
 #define NR_MAXCACHES BITS_PER_LONG
 
@@ -389,7 +390,7 @@ struct fscache_object {
 	struct fscache_cache	*cache;		/* cache that supplied this object */
 	struct fscache_cookie	*cookie;	/* netfs's file/index object */
 	struct fscache_object	*parent;	/* parent object */
-	struct slow_work	work;		/* attention scheduling record */
+	struct work_struct	work;		/* attention scheduling record */
 	struct list_head	dependents;	/* FIFO of dependent objects */
 	struct list_head	dep_link;	/* link in parent's dependents list */
 	struct list_head	pending_ops;	/* unstarted operations on this object */
@@ -411,7 +412,7 @@ extern const char *fscache_object_states[];
 	(test_bit(FSCACHE_IOERROR, &(obj)->cache->flags) &&	\
 	 (obj)->state >= FSCACHE_OBJECT_DYING)
 
-extern const struct slow_work_ops fscache_object_slow_work_ops;
+extern void fscache_object_work_func(struct work_struct *work);
 
 /**
  * fscache_object_init - Initialise a cache object description
@@ -433,7 +434,7 @@ void fscache_object_init(struct fscache_object *object,
 	spin_lock_init(&object->lock);
 	INIT_LIST_HEAD(&object->cache_link);
 	INIT_HLIST_NODE(&object->cookie_link);
-	vslow_work_init(&object->work, &fscache_object_slow_work_ops);
+	INIT_WORK(&object->work, fscache_object_work_func);
 	INIT_LIST_HEAD(&object->dependents);
 	INIT_LIST_HEAD(&object->dep_link);
 	INIT_LIST_HEAD(&object->pending_ops);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work
  2010-01-18  0:57 ` [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work Tejun Heo
@ 2010-02-12 18:03   ` David Howells
  2010-02-13  5:43     ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: David Howells @ 2010-02-12 18:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dhowells, torvalds, mingo, peterz, awalls, linux-kernel, jeff,
	akpm, jens.axboe, rusty, cl, arjan, avi, johannes, andi

Tejun Heo <tj@kernel.org> wrote:

>  		/* otherwise we sleep until either the object we're waiting for
> -		 * is done, or the slow-work facility wants the thread back to
> -		 * do other work */
> +		 * is done */
>  		wq = bit_waitqueue(&xobject->flags, CACHEFILES_OBJECT_ACTIVE);
>  		init_wait(&wait);
> -		requeue = false;
>  		do {
>  			prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
>  			if (!test_bit(CACHEFILES_OBJECT_ACTIVE, &xobject->flags))
>  				break;
> -			requeue = slow_work_sleep_till_thread_needed(
> -				&object->fscache.work, &timeout);
> -		} while (timeout > 0 && !requeue);
> +			timeout = schedule_timeout(timeout);
> +		} while (timeout > 0);
>  		finish_wait(wq, &wait);

Okay, how do you stop the workqueue from having all its threads blocking on
pending work?  The reason the code you've removed interacts with the slow work
facility in this way is that there can be a dependency whereby an executing
work item depends on something that is queued.  This code allows the thread to
be given back to the pool and processing deferred.

Note that just creating more threads isn't a good answer - that can run you out
of resources instead.

> +	ret = -ENOMEM;
> +	fscache_object_wq =
> +		__create_workqueue("fscache_object", WQ_SINGLE_CPU, 99);
> +	if (!fscache_object_wq)
> +		goto error_object_wq;
> +

What does fscache_object_wq being WQ_SINGLE_CPU imply?  Does that mean there
can only be one CPU processing object state changes?  I'm not sure that's a
good idea - something like a tar command can create thousands of objects, all
of which will start undergoing state changes.

Why did you do this?  Is it because cmwq does _not_ prevent reentrance to
executing work items?  I take it that's why you can get away with this:

	-	slow_work_enqueue(&object->work);
	+	if (fscache_get_object(object) >= 0)
	+		if (!queue_work(fscache_object_wq, &object->work))
	+			fscache_put_object(object);

One of the reasons I _don't_ want to use the old workqueue facility is that it
doesn't manage reentrancy.  That can end up tying up multiple threads for one
long-duration work item.

>  	seq_printf(m,
> -		   "%8x %8x %s %5u %3u %3u %3u %2u %5u %2lx %2lx %1lx %1lx | ",
> +		   "%8x %8x %s %5u %3u %3u %3u %2u %5u %2lx %2lx %1lx | ",

You've got to alter the printed header lines too and the documentation.

Note that it would still be useful to know whether an object was queued for
work or being executed.

> -
> -/*
> - * describe an object for slow-work debugging
> - */
> -#ifdef CONFIG_SLOW_WORK_PROC
> -static void fscache_object_slow_work_desc(struct slow_work *work,
> -					  struct seq_file *m)
> -{
> -	struct fscache_object *object =
> -		container_of(work, struct fscache_object, work);
> -
> -	seq_printf(m, "FSC: OBJ%x: %s",
> -		   object->debug_id,
> -		   fscache_object_states_short[object->state]);
> -}
> -#endif

Please provide this facility as part of cmwq - it's been really useful, and I'd
rather not dispense with it.

David

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work
  2010-02-12 18:03   ` David Howells
@ 2010-02-13  5:43     ` Tejun Heo
  2010-02-15 15:04       ` David Howells
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-02-13  5:43 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, arjan, avi, johannes, andi

Hello,

On 02/13/2010 03:03 AM, David Howells wrote:
> Tejun Heo <tj@kernel.org> wrote:
>> -			requeue = slow_work_sleep_till_thread_needed(
>> -				&object->fscache.work, &timeout);
>> -		} while (timeout > 0 && !requeue);
>> +			timeout = schedule_timeout(timeout);
>> +		} while (timeout > 0);
> 
> Okay, how do you stop the workqueue from having all its threads
> blocking on pending work?  The reason the code you've removed
> interacts with the slow work facility in this way is that there can
> be a dependency whereby an executing work item depends on something
> that is queued.  This code allows the thread to be given back to the
> pool and processing deferred.

How deep the dependency chain can be?  As I wrote in the patch
description, wake-me-up-on-another-enqueue can be implemented in
similar way but I wasn't sure how useful it would be.  If the
dependency chain is strictly bound and significantly shorter than the
allowed concurrency, it might be better to just leave them sleep.

If it's mainly because there can be many concurrent long waiters (but
no dependency), implementing staggered timeout might be better option.
I wasn't sure about the requirement there.

> Note that just creating more threads isn't a good answer - that can
> run you out of resources instead.

It depends.  The only resource taken up by an idle kthread is small
amount of memory and it can definitely be traded off against code
complexity and processing overhead.  Anyways, this really depends on
what is the concurrency requirement there, can you please explain what
would the bad cases be?

>> +	ret = -ENOMEM;
>> +	fscache_object_wq =
>> +		__create_workqueue("fscache_object", WQ_SINGLE_CPU, 99);
>> +	if (!fscache_object_wq)
>> +		goto error_object_wq;
>> +
> 
> What does fscache_object_wq being WQ_SINGLE_CPU imply?  Does that mean there
> can only be one CPU processing object state changes?

Yes.

> I'm not sure that's a good idea - something like a tar command can
> create thousands of objects, all of which will start undergoing
> state changes.

The default concurrency level for slow-work is pretty low.  Is it
expected to be tuned to a very high value in certain configurations?

> Why did you do this?  Is it because cmwq does _not_ prevent reentrance to
> executing work items?  I take it that's why you can get away with this:

and yes, I used it as a cheap way to avoid reentrance.  For most
cases, it works just fine.  For slow work, it might not be enough.

> 	-	slow_work_enqueue(&object->work);
> 	+	if (fscache_get_object(object) >= 0)
> 	+		if (!queue_work(fscache_object_wq, &object->work))
> 	+			fscache_put_object(object);
> 
> One of the reasons I _don't_ want to use the old workqueue facility is that it
> doesn't manage reentrancy.  That can end up tying up multiple threads for one
> long-duration work item.

Yeap, it's a drawback of the workqueue API although I don't think it
would be big enough to warrant a completely separate workpool
mechanism.  It's usually enough to implement synchronization from the
callback or guarantee that running works don't get queued some other
way.  What would happen if fscache object works are reentered?  Would
there be correctness issues?  How likely are they to get scheduled
while being executed?  If this is something critical, I have a draft
implementation which avoids reentrance.  I was gonna apply it for all
works but it would cause too much cross CPU access when the wq users
can already handle reentrance but it can be implemented as optional
behavior along with SINGLE_CPU.

>>  	seq_printf(m,
>> -		   "%8x %8x %s %5u %3u %3u %3u %2u %5u %2lx %2lx %1lx %1lx | ",
>> +		   "%8x %8x %s %5u %3u %3u %3u %2u %5u %2lx %2lx %1lx | ",
> 
> You've got to alter the printed header lines too and the documentation.

Yeap, sure.

> Note that it would still be useful to know whether an object was queued for
> work or being executed.

Adding wouldn't be difficult but would it justify having a dedicated
function for that in workqueue where fscache would be the only user?
Also please note that such information is only useful for debugging or
as hints due to lack of synchronization.

>> -/*
>> - * describe an object for slow-work debugging
>> - */
>> -#ifdef CONFIG_SLOW_WORK_PROC
>> -static void fscache_object_slow_work_desc(struct slow_work *work,
>> -					  struct seq_file *m)
>> -{
>> -	struct fscache_object *object =
>> -		container_of(work, struct fscache_object, work);
>> -
>> -	seq_printf(m, "FSC: OBJ%x: %s",
>> -		   object->debug_id,
>> -		   fscache_object_states_short[object->state]);
>> -}
>> -#endif
> 
> Please provide this facility as part of cmwq - it's been really
> useful, and I'd rather not dispense with it.

Hmmm... but yeah, right, it does make sense to beef up debugging
facility as wq's use cases are expanded.  I'll try to add them.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work
  2010-02-13  5:43     ` Tejun Heo
@ 2010-02-15 15:04       ` David Howells
  2010-02-16  3:40         ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: David Howells @ 2010-02-15 15:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dhowells, torvalds, mingo, peterz, awalls, linux-kernel, jeff,
	akpm, jens.axboe, rusty, cl, arjan, avi, johannes, andi

Tejun Heo <tj@kernel.org> wrote:

> > Okay, how do you stop the workqueue from having all its threads
> > blocking on pending work?  The reason the code you've removed
> > interacts with the slow work facility in this way is that there can
> > be a dependency whereby an executing work item depends on something
> > that is queued.  This code allows the thread to be given back to the
> > pool and processing deferred.
> 
> How deep the dependency chain can be?

There only needs to be a single dependency in the chain.  The problem is that a
pool thread gets blocked waiting for an item on the queue - but if there's a
limited number of pool threads, then all of them can wind up blocked waiting on
the contents of the queue.

The problem I've seen is that a someone goes and bulk-updates a bunch of files
on, say, an NFS server; then FS-Cache flushes all the altered objects and then
attempts to create new replacement ones.  However, it was stipulated that all
this had to happen asynchronously - so the new objects have to wait for the old
objects to go away so that they can replace them in the namespace.

So what happens is that the obsolete objects being deleted get executed to
begin deletion, but the deletions then get deferred because the objects are
still undergoing I/O - and so the objects get requeued *behind* the new objects
that are going to wait for them.

> As I wrote in the patch description, wake-me-up-on-another-enqueue can be
> implemented in similar way but I wasn't sure how useful it would be.  If the
> dependency chain is strictly bound and significantly shorter than the
> allowed concurrency, it might be better to just leave them sleep.

The problem there is that the timeouts add up and can significantly slow the
system.

> If it's mainly because there can be many concurrent long waiters (but
> no dependency), implementing staggered timeout might be better option.
> I wasn't sure about the requirement there.

We don't really want to time out if we've got threads to spare, and if our
dependency is getting or will get CPU time.

> > Note that just creating more threads isn't a good answer - that can
> > run you out of resources instead.
> 
> It depends.  The only resource taken up by an idle kthread is small
> amount of memory and it can definitely be traded off against code
> complexity and processing overhead.

And PIDs...

Also the definition of a 'small amount of memory' is dependent on how much
memory you actually have.

> Anyways, this really depends on what is the concurrency requirement there,
> can you please explain what would the bad cases be?

See above.  But I've come across this problem and dealt with it, generally
without resorting to timeouts.

> >> +	ret = -ENOMEM;
> >> +	fscache_object_wq =
> >> +		__create_workqueue("fscache_object", WQ_SINGLE_CPU, 99);
> >> +	if (!fscache_object_wq)
> >> +		goto error_object_wq;
> >> +
> > 
> > What does fscache_object_wq being WQ_SINGLE_CPU imply?  Does that mean there
> > can only be one CPU processing object state changes?
> 
> Yes.

That has scalability implications.

> > I'm not sure that's a good idea - something like a tar command can
> > create thousands of objects, all of which will start undergoing
> > state changes.
> 
> The default concurrency level for slow-work is pretty low.  Is it
> expected to be tuned to a very high value in certain configurations?

That's why I have a tuning knob.  I don't really have the facilities for
working up profiles of different loads, but I expect there's a sweet spot for
any particular load.  You have to trade the amount of time and resources it
takes to waggle the disk around off against the number of things you want to
cache.

> > Why did you do this?  Is it because cmwq does _not_ prevent reentrance to
> > executing work items?  I take it that's why you can get away with this:
> 
> and yes, I used it as a cheap way to avoid reentrance.  For most
> cases, it works just fine.  For slow work, it might not be enough.

Most cases don't think they need to avoid reentrance.  They might even be
right.  I've been bitten by it a number of times.

> > 	-	slow_work_enqueue(&object->work);
> > 	+	if (fscache_get_object(object) >= 0)
> > 	+		if (!queue_work(fscache_object_wq, &object->work))
> > 	+			fscache_put_object(object);
> > 
> > One of the reasons I _don't_ want to use the old workqueue facility is that
> > it doesn't manage reentrancy.  That can end up tying up multiple threads
> > for one long-duration work item.
> 
> Yeap, it's a drawback of the workqueue API although I don't think it
> would be big enough to warrant a completely separate workpool
> mechanism.  It's usually enough to implement synchronization from the
> callback or guarantee that running works don't get queued some other
> way.  What would happen if fscache object works are reentered?  Would
> there be correctness issues?

Definitely.  In the last rewrite, I started off by writing a thread pool that
was non-reentrant, and then built everything on top of that assumption.  This
means I don't have to do a whole bunch of locking because I _know_ each object
can only be under execution by one thread at any one time.

> How likely are they to get scheduled while being executed?

Reasonably likely, and the events aren't entirely within the control of the
local system.

> If this is something critical, I have a draft implementation which avoids
> reentrance.

If you can provide it, I can simplify RxRPC and AFS too.  Those suffer from
reentrancy issues too that I'd dearly like to avoid, but workqueues don't.

> I was gonna apply it for all works but it would cause too much cross CPU
> access when the wq users can already handle reentrance but it can be
> implemented as optional behavior along with SINGLE_CPU.

How many of them actually *handle* it?  For some of them it won't matter
because they're only scheduled once, but I bet that some of them it *is* an
issue that no one has considered, and the window of opportunity is small enough
that it's not happened or the has not been reported or successfully pinpointed.

> > Note that it would still be useful to know whether an object was queued for
> > work or being executed.
> 
> Adding wouldn't be difficult but would it justify having a dedicated
> function for that in workqueue where fscache would be the only user?
> Also please note that such information is only useful for debugging or
> as hints due to lack of synchronization.

Agreed, but debugging still has to be done sometimes.  Of course, it's much
easier for slow-work, since it has to manage reentrancy anyway, and so keeps
hold of the object till afterwards.

Oh, btw, I've run up your patches with FS-Cache.  They quickly create a couple
of hundred threads.  Is that right?  To be fair, the threads do go away again
after a period of quiscence.

David

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work
  2010-02-15 15:04       ` David Howells
@ 2010-02-16  3:40         ` Tejun Heo
  2010-02-16  3:59           ` Tejun Heo
  2010-02-16 18:05           ` David Howells
  0 siblings, 2 replies; 102+ messages in thread
From: Tejun Heo @ 2010-02-16  3:40 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, arjan, avi, johannes, andi

Hello,

On 02/16/2010 12:04 AM, David Howells wrote:
>> How deep the dependency chain can be?
...
> So what happens is that the obsolete objects being deleted get
> executed to begin deletion, but the deletions then get deferred
> because the objects are still undergoing I/O - and so the objects
> get requeued *behind* the new objects that are going to wait for
> them.

I see, so the dependency chain isn't deep but can be very wide.

>>> Note that just creating more threads isn't a good answer - that can
>>> run you out of resources instead.
>>
>> It depends.  The only resource taken up by an idle kthread is small
>> amount of memory and it can definitely be traded off against code
>> complexity and processing overhead.
> 
> And PIDs...
>
> Also the definition of a 'small amount of memory' is dependent on how much
> memory you actually have.

I was thinking maybe low hundreds.

>> Anyways, this really depends on what is the concurrency requirement there,
>> can you please explain what would the bad cases be?
> 
> See above.  But I've come across this problem and dealt with it, generally
> without resorting to timeouts.

That doesn't necessarily mean it would be the best solution under
different circumstances, right?  I'm still quite unfamiliar with the
facache code and assumptions about workload in there.  So, you're
saying...

* There can be a lot of concurrent shallow dependency chains, so
  deadlocks can't realistically avoided by allowing larger number of
  theads in the pool.

* Such occurrences would be common enough that the 'yield' path would
  be essential in keeping the operation going smooth.

One problem I have with the slow work yield-on-queue mechanism is that
it may fit fscache well but generally doesn't make much sense.  What
would make more sense would be yield-under-pressure (ie. thread limit
reached or about to be reached and new work queued).  Would that work
for fscache?

>>> What does fscache_object_wq being WQ_SINGLE_CPU imply?  Does that mean there
>>> can only be one CPU processing object state changes?
>>
>> Yes.
> 
> That has scalability implications.

It might but I wasn't sure whether this could actually be a problem
for what fscache is doing.  Again, I just don't know what kind of
workload the code is expecting.  The reason why I thought it might not
was because the default concurrency level was low.

>>> I'm not sure that's a good idea - something like a tar command can
>>> create thousands of objects, all of which will start undergoing
>>> state changes.
>>
>> The default concurrency level for slow-work is pretty low.  Is it
>> expected to be tuned to a very high value in certain configurations?
> 
> That's why I have a tuning knob.  I don't really have the facilities
> for working up profiles of different loads, but I expect there's a
> sweet spot for any particular load.  You have to trade the amount of
> time and resources it takes to waggle the disk around off against
> the number of things you want to cache.

Alright, so it can be very high.  This is slightly off topic but isn't
the know a bit too low level to export?  It will adjust concurrency
level of the whole slow-work facility which can be used by any number
of users.

>> Yeap, it's a drawback of the workqueue API although I don't think it
>> would be big enough to warrant a completely separate workpool
>> mechanism.  It's usually enough to implement synchronization from the
>> callback or guarantee that running works don't get queued some other
>> way.  What would happen if fscache object works are reentered?  Would
>> there be correctness issues?
> 
> Definitely.  In the last rewrite, I started off by writing a thread
> pool that was non-reentrant, and then built everything on top of
> that assumption.  This means I don't have to do a whole bunch of
> locking because I _know_ each object can only be under execution by
> one thread at any one time.
>
>> How likely are they to get scheduled while being executed?
> 
> Reasonably likely, and the events aren't entirely within the control
> of the local system.

I see.

>> If this is something critical, I have a draft implementation which avoids
>> reentrance.
> 
> If you can provide it, I can simplify RxRPC and AFS too.  Those
> suffer from reentrancy issues too that I'd dearly like to avoid, but
> workqueues don't.
> 
>> I was gonna apply it for all works but it would cause too much cross CPU
>> access when the wq users can already handle reentrance but it can be
>> implemented as optional behavior along with SINGLE_CPU.
> 
> How many of them actually *handle* it?  For some of them it won't
> matter because they're only scheduled once, but I bet that some of
> them it *is* an issue that no one has considered, and the window of
> opportunity is small enough that it's not happened or the has not
> been reported or successfully pinpointed.

As the handlers are running asynchronously, for a lot of cases, they
require some form of synchronization anyway and that usually seems to
take care of the reentrance issue together.  But, yeah, it definitely
is possible that there are undiscovered buggy cases.

>> Adding wouldn't be difficult but would it justify having a dedicated
>> function for that in workqueue where fscache would be the only user?
>> Also please note that such information is only useful for debugging or
>> as hints due to lack of synchronization.
> 
> Agreed, but debugging still has to be done sometimes.  Of course, it's much
> easier for slow-work, since it has to manage reentrancy anyway, and so keeps
> hold of the object till afterwards.
> 
> Oh, btw, I've run up your patches with FS-Cache.  They quickly create a couple
> of hundred threads.  Is that right?  To be fair, the threads do go away again
> after a period of quiscence.

Heh... yeah, I wasn't sure about the dependency problem we were
talking above so just jacked up the concurrency level, so if there was
demand for high concurrency, cmwq would have created lots of workers.
It can be controlled by the max workers parameter.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work
  2010-02-16  3:40         ` Tejun Heo
@ 2010-02-16  3:59           ` Tejun Heo
  2010-02-16 18:05           ` David Howells
  1 sibling, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-02-16  3:59 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, arjan, avi, johannes, andi

Hello, again.

BTW, if we solve the yielding problem (I think we can retain the
original behavior by implementing it inside fscache) and the
reentrance issue, do you see any other obstacles in switching to cmwq?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work
  2010-02-16  3:40         ` Tejun Heo
  2010-02-16  3:59           ` Tejun Heo
@ 2010-02-16 18:05           ` David Howells
  2010-02-16 23:50             ` Tejun Heo
  1 sibling, 1 reply; 102+ messages in thread
From: David Howells @ 2010-02-16 18:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dhowells, torvalds, mingo, peterz, awalls, linux-kernel, jeff,
	akpm, jens.axboe, rusty, cl, arjan, avi, johannes, andi

Tejun Heo <tj@kernel.org> wrote:

> That doesn't necessarily mean it would be the best solution under
> different circumstances, right?  I'm still quite unfamiliar with the
> facache code and assumptions about workload in there.

Timeouts, you mean?  What you can end up doing is accruing timeouts as you go
through ops looking for one that you can process now.  Even the yield mechanism
I've come up with isn't perfect.

>   So, you're saying...
>
> * There can be a lot of concurrent shallow dependency chains, so
>   deadlocks can't realistically avoided by allowing larger number of
>   theads in the pool.

Yes.  As long as you can queue one more op than you can have threads, you can
get deadlock between the queue and the threads.

> * Such occurrences would be common enough that the 'yield' path would
>   be essential in keeping the operation going smooth.

I've seen them a few times, usually under high pressure.  I've got some evil
test cases that try to read a few thousand sequences of files simultaneously.

> One problem I have with the slow work yield-on-queue mechanism is that
> it may fit fscache well but generally doesn't make much sense.  What
> would make more sense would be yield-under-pressure (ie. thread limit
> reached or about to be reached and new work queued).  Would that work
> for fscache?

I'm not sure what you mean.  Slow-work does do yield-under-pressure.
slow_work_sleep_till_thread_needed() adds the waiting object to a waitqueue by
which it can be interrupted by slow-work when slow-work wants its thread back.

If the object execution is busy doing something rather than waiting around,
there's no reason to yield the thread back.

> It might but I wasn't sure whether this could actually be a problem
> for what fscache is doing.  Again, I just don't know what kind of
> workload the code is expecting.  The reason why I thought it might not
> was because the default concurrency level was low.

You can end up serialising together all the I/O being done by NFS, AFS and
anything else using FS-Cache.

> Alright, so it can be very high.  This is slightly off topic but isn't
> the know a bit too low level to export?  It will adjust concurrency
> level of the whole slow-work facility which can be used by any number
> of users.

'The know'?

One thing I was trying to do was avoid the workqueue problem of having a static
pool of threads per workqueue.  As CPU counts go up, that starts eating some
serious resources.  What I was trying for was one pool that was dynamically
sized.

Tuning such a pool is tricky, however; you have a set of conflicting usage
patterns - hence the two thread priorities (slow and very slow).

> As the handlers are running asynchronously, for a lot of cases, they
> require some form of synchronization anyway and that usually seems to
> take care of the reentrance issue together.  But, yeah, it definitely
> is possible that there are undiscovered buggy cases.

What I'm trying to avoid is having several threads all trying to execute the
same object.  This is a more extreme problem in AF_RXRPC as there are more
events to deal with, and it can tie up all the threads in the pool quite
easily.

> BTW, if we solve the yielding problem (I think we can retain the
> original behavior by implementing it inside fscache) and the
> reentrance issue, do you see any other obstacles in switching to cmwq?

I don't think so.  I'm not sure how you retain the original yield behaviour by
doing it inside FS-Cache - slow-work knows about the congestion, not FS-Cache.

David

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work
  2010-02-16 18:05           ` David Howells
@ 2010-02-16 23:50             ` Tejun Heo
  2010-02-18 11:50               ` David Howells
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-02-16 23:50 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, arjan, avi, johannes, andi

Hello,

On 02/17/2010 03:05 AM, David Howells wrote:
> Tejun Heo <tj@kernel.org> wrote:
>>   So, you're saying...
>>
>> * There can be a lot of concurrent shallow dependency chains, so
>>   deadlocks can't realistically avoided by allowing larger number of
>>   theads in the pool.
> 
> Yes.  As long as you can queue one more op than you can have threads, you can
> get deadlock between the queue and the threads.
> 
>> * Such occurrences would be common enough that the 'yield' path would
>>   be essential in keeping the operation going smooth.
> 
> I've seen them a few times, usually under high pressure.  I've got some evil
> test cases that try to read a few thousand sequences of files simultaneously.

Alright, thanks for clarifying.  Yield it is then.

>> One problem I have with the slow work yield-on-queue mechanism is that
>> it may fit fscache well but generally doesn't make much sense.  What
>> would make more sense would be yield-under-pressure (ie. thread limit
>> reached or about to be reached and new work queued).  Would that work
>> for fscache?
> 
> I'm not sure what you mean.  Slow-work does do yield-under-pressure.
> slow_work_sleep_till_thread_needed() adds the waiting object to a waitqueue by
> which it can be interrupted by slow-work when slow-work wants its thread back.
> 
> If the object execution is busy doing something rather than waiting around,
> there's no reason to yield the thread back.

The waiting workers are woken up in slow_work_enqueue() when a new
work is enqueued regardless of how many workers are currently in use,
right?  So, it ends up yielding on any queue activity rather than only
under resource pressure.

>> It might but I wasn't sure whether this could actually be a problem
>> for what fscache is doing.  Again, I just don't know what kind of
>> workload the code is expecting.  The reason why I thought it might not
>> was because the default concurrency level was low.
> 
> You can end up serialising together all the I/O being done by NFS, AFS and
> anything else using FS-Cache.

Yeap, if the workload can be highly parallel, SINGLE_CPU won't fare
off very well.  Will take care of that.

>> Alright, so it can be very high.  This is slightly off topic but isn't
>> the know a bit too low level to export?  It will adjust concurrency
>> level of the whole slow-work facility which can be used by any number
>> of users.
> 
> 'The know'?

Ah... sorry about that, the knob.

> One thing I was trying to do was avoid the workqueue problem of
> having a static pool of threads per workqueue.  As CPU counts go up,
> that starts eating some serious resources.  What I was trying for
> was one pool that was dynamically sized.

Yeah, sure.  I was just wondering that it would be nice if the pool
can be dynamically sized without user specifying the concurrency level
directly.  I have no idea that would be possible or not.

> Tuning such a pool is tricky, however; you have a set of conflicting usage
> patterns - hence the two thread priorities (slow and very slow).
...
>> BTW, if we solve the yielding problem (I think we can retain the
>> original behavior by implementing it inside fscache) and the
>> reentrance issue, do you see any other obstacles in switching to cmwq?
> 
> I don't think so.  I'm not sure how you retain the original yield
> behaviour by doing it inside FS-Cache - slow-work knows about the
> congestion, not FS-Cache.

I'm thinking about how to solve it but it will be solved one way or
the other.  If it can't be done in fscache, I'll add some general hook
in wq code to tell fscache congestion conditions.  I was curious
whether you see other issues in converting to cmwq other than those
two discussed so far.  If not, I'll go ahead and prep the next round.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work
  2010-02-16 23:50             ` Tejun Heo
@ 2010-02-18 11:50               ` David Howells
  2010-02-18 12:33                 ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: David Howells @ 2010-02-18 11:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dhowells, torvalds, mingo, peterz, awalls, linux-kernel, jeff,
	akpm, jens.axboe, rusty, cl, arjan, avi, johannes, andi

Tejun Heo <tj@kernel.org> wrote:

> The waiting workers are woken up in slow_work_enqueue() when a new
> work is enqueued regardless of how many workers are currently in use,
> right?  So, it ends up yielding on any queue activity rather than only
> under resource pressure.

It depends how you define resource pressure, I suppose.  There's someone now
waiting for the resource, and there's someone blocking that resource who can
yield it as they're waiting on something else.  The question is if it's
cheaper to create a new thread, assuming we're under the limit, or if it's
cheaper to steal someone else's thread?

David

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work
  2010-02-18 11:50               ` David Howells
@ 2010-02-18 12:33                 ` Tejun Heo
  0 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-02-18 12:33 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, arjan, avi, johannes, andi

Hello, David.

On 02/18/2010 08:50 PM, David Howells wrote:
> Tejun Heo <tj@kernel.org> wrote:
>> The waiting workers are woken up in slow_work_enqueue() when a new
>> work is enqueued regardless of how many workers are currently in use,
>> right?  So, it ends up yielding on any queue activity rather than only
>> under resource pressure.
> 
> It depends how you define resource pressure, I suppose.  There's someone now
> waiting for the resource, and there's someone blocking that resource who can
> yield it as they're waiting on something else.  The question is if it's
> cheaper to create a new thread, assuming we're under the limit, or if it's
> cheaper to steal someone else's thread?

Initially I was thinking about adding the hook to global cmwq events
in which case yielding on queueing wouldn't make much sense as it will
always yield very quickly.  I'll think more about it but will try to
preserve the current behavior.  I'll go ahead and update the patchset.

Thanks for reviewing.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 36/40] fscache: convert operation to use workqueue instead of slow-work
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (34 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 37/40] fscache: drop references to slow-work Tejun Heo
                   ` (5 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

Make fscache operation to use only workqueue instead of combination of
workqueue and slow-work.  FSCACHE_OP_SLOW is dropped and
FSCACHE_OP_FAST is renamed to FSCACHE_OP_ASYNC and uses newly added
fscache_op_wq workqueue to execute op->processor().
fscache_operation_init_slow() is dropped and fscache_operation_init()
now takes @processor argument directly.

* fscache_retrieval_work() is no longer necessary as OP_ASYNC now does
  the equivalent thing.

* Max concurrency for objects is set to 4 which is the default for
  slow works.  API to adjust cwq max concurrency limit can be added
  easily.

* Single CPU workqueue was used to ease conversion.  If necessary,
  multi CPU one can be used by adding per-operation mutex.

NOT_SIGNED_OFF_YET
Cc: David Howells <dhowells@redhat.com>
---
 fs/cachefiles/rdwr.c          |    4 +-
 fs/fscache/internal.h         |    1 +
 fs/fscache/main.c             |    9 +++++
 fs/fscache/operation.c        |   67 +++++------------------------------------
 fs/fscache/page.c             |   36 +++++-----------------
 include/linux/fscache-cache.h |   34 +++++---------------
 6 files changed, 37 insertions(+), 114 deletions(-)

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 1d83325..43bbe57 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -421,7 +421,7 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval *op,
 	shift = PAGE_SHIFT - inode->i_sb->s_blocksize_bits;
 
 	op->op.flags &= FSCACHE_OP_KEEP_FLAGS;
-	op->op.flags |= FSCACHE_OP_FAST;
+	op->op.flags |= FSCACHE_OP_ASYNC;
 	op->op.processor = cachefiles_read_copier;
 
 	pagevec_init(&pagevec, 0);
@@ -728,7 +728,7 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op,
 	pagevec_init(&pagevec, 0);
 
 	op->op.flags &= FSCACHE_OP_KEEP_FLAGS;
-	op->op.flags |= FSCACHE_OP_FAST;
+	op->op.flags |= FSCACHE_OP_ASYNC;
 	op->op.processor = cachefiles_read_copier;
 
 	INIT_LIST_HEAD(&backpages);
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index f9da2fc..4238ce6 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -83,6 +83,7 @@ extern unsigned fscache_defer_create;
 extern unsigned fscache_debug;
 extern struct kobject *fscache_root;
 extern struct workqueue_struct *fscache_object_wq;
+extern struct workqueue_struct *fscache_op_wq;
 
 extern int fscache_wait_bit(void *);
 extern int fscache_wait_bit_interruptible(void *);
diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index a676a86..908803c 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -41,6 +41,7 @@ MODULE_PARM_DESC(fscache_debug,
 
 struct kobject *fscache_root;
 struct workqueue_struct *fscache_object_wq;
+struct workqueue_struct *fscache_op_wq;
 
 /*
  * initialise the fs caching module
@@ -59,6 +60,11 @@ static int __init fscache_init(void)
 	if (!fscache_object_wq)
 		goto error_object_wq;
 
+	fscache_op_wq =
+		__create_workqueue("fscache_operation", WQ_SINGLE_CPU, 4);
+	if (!fscache_op_wq)
+		goto error_op_wq;
+
 	ret = fscache_proc_init();
 	if (ret < 0)
 		goto error_proc;
@@ -87,6 +93,8 @@ error_kobj:
 error_cookie_jar:
 	fscache_proc_cleanup();
 error_proc:
+	destroy_workqueue(fscache_op_wq);
+error_op_wq:
 	destroy_workqueue(fscache_object_wq);
 error_object_wq:
 	slow_work_unregister_user(THIS_MODULE);
@@ -106,6 +114,7 @@ static void __exit fscache_exit(void)
 	kobject_put(fscache_root);
 	kmem_cache_destroy(fscache_cookie_jar);
 	fscache_proc_cleanup();
+	destroy_workqueue(fscache_op_wq);
 	destroy_workqueue(fscache_object_wq);
 	slow_work_unregister_user(THIS_MODULE);
 	printk(KERN_NOTICE "FS-Cache: Unloaded\n");
diff --git a/fs/fscache/operation.c b/fs/fscache/operation.c
index 313e79a..c673b32 100644
--- a/fs/fscache/operation.c
+++ b/fs/fscache/operation.c
@@ -41,16 +41,12 @@ void fscache_enqueue_operation(struct fscache_operation *op)
 
 	fscache_stat(&fscache_n_op_enqueue);
 	switch (op->flags & FSCACHE_OP_TYPE) {
-	case FSCACHE_OP_FAST:
-		_debug("queue fast");
+	case FSCACHE_OP_ASYNC:
+		_debug("queue async");
 		atomic_inc(&op->usage);
-		if (!schedule_work(&op->fast_work))
+		if (!queue_work(fscache_op_wq, &op->work))
 			fscache_put_operation(op);
 		break;
-	case FSCACHE_OP_SLOW:
-		_debug("queue slow");
-		slow_work_enqueue(&op->slow_work);
-		break;
 	case FSCACHE_OP_MYTHREAD:
 		_debug("queue for caller's attention");
 		break;
@@ -454,36 +450,13 @@ void fscache_operation_gc(struct work_struct *work)
 }
 
 /*
- * allow the slow work item processor to get a ref on an operation
- */
-static int fscache_op_get_ref(struct slow_work *work)
-{
-	struct fscache_operation *op =
-		container_of(work, struct fscache_operation, slow_work);
-
-	atomic_inc(&op->usage);
-	return 0;
-}
-
-/*
- * allow the slow work item processor to discard a ref on an operation
- */
-static void fscache_op_put_ref(struct slow_work *work)
-{
-	struct fscache_operation *op =
-		container_of(work, struct fscache_operation, slow_work);
-
-	fscache_put_operation(op);
-}
-
-/*
- * execute an operation using the slow thread pool to provide processing context
- * - the caller holds a ref to this object, so we don't need to hold one
+ * execute an operation using fs_op_wq to provide processing context -
+ * the caller holds a ref to this object, so we don't need to hold one
  */
-static void fscache_op_execute(struct slow_work *work)
+void fscache_op_work_func(struct work_struct *work)
 {
 	struct fscache_operation *op =
-		container_of(work, struct fscache_operation, slow_work);
+		container_of(work, struct fscache_operation, work);
 	unsigned long start;
 
 	_enter("{OBJ%x OP%x,%d}",
@@ -493,31 +466,7 @@ static void fscache_op_execute(struct slow_work *work)
 	start = jiffies;
 	op->processor(op);
 	fscache_hist(fscache_ops_histogram, start);
+	fscache_put_operation(op);
 
 	_leave("");
 }
-
-/*
- * describe an operation for slow-work debugging
- */
-#ifdef CONFIG_SLOW_WORK_PROC
-static void fscache_op_desc(struct slow_work *work, struct seq_file *m)
-{
-	struct fscache_operation *op =
-		container_of(work, struct fscache_operation, slow_work);
-
-	seq_printf(m, "FSC: OBJ%x OP%x: %s/%s fl=%lx",
-		   op->object->debug_id, op->debug_id,
-		   op->name, op->state, op->flags);
-}
-#endif
-
-const struct slow_work_ops fscache_op_slow_work_ops = {
-	.owner		= THIS_MODULE,
-	.get_ref	= fscache_op_get_ref,
-	.put_ref	= fscache_op_put_ref,
-	.execute	= fscache_op_execute,
-#ifdef CONFIG_SLOW_WORK_PROC
-	.desc		= fscache_op_desc,
-#endif
-};
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index c598ea4..3d02944 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -104,7 +104,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
 
 page_busy:
 	/* we might want to wait here, but that could deadlock the allocator as
-	 * the slow-work threads writing to the cache may all end up sleeping
+	 * the work threads writing to the cache may all end up sleeping
 	 * on memory allocation */
 	fscache_stat(&fscache_n_store_vmscan_busy);
 	return false;
@@ -187,9 +187,8 @@ int __fscache_attr_changed(struct fscache_cookie *cookie)
 		return -ENOMEM;
 	}
 
-	fscache_operation_init(op, NULL);
-	fscache_operation_init_slow(op, fscache_attr_changed_op);
-	op->flags = FSCACHE_OP_SLOW | (1 << FSCACHE_OP_EXCLUSIVE);
+	fscache_operation_init(op, fscache_attr_changed_op, NULL);
+	op->flags = FSCACHE_OP_ASYNC | (1 << FSCACHE_OP_EXCLUSIVE);
 	fscache_set_op_name(op, "Attr");
 
 	spin_lock(&cookie->lock);
@@ -217,24 +216,6 @@ nobufs:
 EXPORT_SYMBOL(__fscache_attr_changed);
 
 /*
- * handle secondary execution given to a retrieval op on behalf of the
- * cache
- */
-static void fscache_retrieval_work(struct work_struct *work)
-{
-	struct fscache_retrieval *op =
-		container_of(work, struct fscache_retrieval, op.fast_work);
-	unsigned long start;
-
-	_enter("{OP%x}", op->op.debug_id);
-
-	start = jiffies;
-	op->op.processor(&op->op);
-	fscache_hist(fscache_ops_histogram, start);
-	fscache_put_operation(&op->op);
-}
-
-/*
  * release a retrieval op reference
  */
 static void fscache_release_retrieval_op(struct fscache_operation *_op)
@@ -268,13 +249,12 @@ static struct fscache_retrieval *fscache_alloc_retrieval(
 		return NULL;
 	}
 
-	fscache_operation_init(&op->op, fscache_release_retrieval_op);
+	fscache_operation_init(&op->op, NULL, fscache_release_retrieval_op);
 	op->op.flags	= FSCACHE_OP_MYTHREAD | (1 << FSCACHE_OP_WAITING);
 	op->mapping	= mapping;
 	op->end_io_func	= end_io_func;
 	op->context	= context;
 	op->start_time	= jiffies;
-	INIT_WORK(&op->op.fast_work, fscache_retrieval_work);
 	INIT_LIST_HEAD(&op->to_do);
 	fscache_set_op_name(&op->op, "Retr");
 	return op;
@@ -798,9 +778,9 @@ int __fscache_write_page(struct fscache_cookie *cookie,
 	if (!op)
 		goto nomem;
 
-	fscache_operation_init(&op->op, fscache_release_write_op);
-	fscache_operation_init_slow(&op->op, fscache_write_op);
-	op->op.flags = FSCACHE_OP_SLOW | (1 << FSCACHE_OP_WAITING);
+	fscache_operation_init(&op->op, fscache_write_op,
+			       fscache_release_write_op);
+	op->op.flags = FSCACHE_OP_ASYNC | (1 << FSCACHE_OP_WAITING);
 	fscache_set_op_name(&op->op, "Write1");
 
 	ret = radix_tree_preload(gfp & ~__GFP_HIGHMEM);
@@ -855,7 +835,7 @@ int __fscache_write_page(struct fscache_cookie *cookie,
 	fscache_stat(&fscache_n_store_ops);
 	fscache_stat(&fscache_n_stores_ok);
 
-	/* the slow work queue now carries its own ref on the object */
+	/* the work queue now carries its own ref on the object */
 	fscache_put_operation(&op->op);
 	_leave(" = 0");
 	return 0;
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 63b7f76..7791f2c 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -77,18 +77,14 @@ typedef void (*fscache_operation_release_t)(struct fscache_operation *op);
 typedef void (*fscache_operation_processor_t)(struct fscache_operation *op);
 
 struct fscache_operation {
-	union {
-		struct work_struct fast_work;	/* record for fast ops */
-		struct slow_work slow_work;	/* record for (very) slow ops */
-	};
+	struct work_struct	work;		/* record for async ops */
 	struct list_head	pend_link;	/* link in object->pending_ops */
 	struct fscache_object	*object;	/* object to be operated upon */
 
 	unsigned long		flags;
 #define FSCACHE_OP_TYPE		0x000f	/* operation type */
-#define FSCACHE_OP_FAST		0x0001	/* - fast op, processor may not sleep for disk */
-#define FSCACHE_OP_SLOW		0x0002	/* - (very) slow op, processor may sleep for disk */
-#define FSCACHE_OP_MYTHREAD	0x0003	/* - processing is done be issuing thread, not pool */
+#define FSCACHE_OP_ASYNC	0x0001	/* - async op, processor may sleep for disk */
+#define FSCACHE_OP_MYTHREAD	0x0002	/* - processing is done be issuing thread, not pool */
 #define FSCACHE_OP_WAITING	4	/* cleared when op is woken */
 #define FSCACHE_OP_EXCLUSIVE	5	/* exclusive op, other ops must wait */
 #define FSCACHE_OP_DEAD		6	/* op is now dead */
@@ -118,7 +114,7 @@ struct fscache_operation {
 };
 
 extern atomic_t fscache_op_debug_id;
-extern const struct slow_work_ops fscache_op_slow_work_ops;
+extern void fscache_op_work_func(struct work_struct *work);
 
 extern void fscache_enqueue_operation(struct fscache_operation *);
 extern void fscache_put_operation(struct fscache_operation *);
@@ -129,33 +125,21 @@ extern void fscache_put_operation(struct fscache_operation *);
  * @release: The release function to assign
  *
  * Do basic initialisation of an operation.  The caller must still set flags,
- * object, either fast_work or slow_work if necessary, and processor if needed.
+ * object and processor if needed.
  */
 static inline void fscache_operation_init(struct fscache_operation *op,
-					  fscache_operation_release_t release)
+					fscache_operation_processor_t processor,
+					fscache_operation_release_t release)
 {
+	INIT_WORK(&op->work, fscache_op_work_func);
 	atomic_set(&op->usage, 1);
 	op->debug_id = atomic_inc_return(&fscache_op_debug_id);
+	op->processor = processor;
 	op->release = release;
 	INIT_LIST_HEAD(&op->pend_link);
 	fscache_set_op_state(op, "Init");
 }
 
-/**
- * fscache_operation_init_slow - Do additional initialisation of a slow op
- * @op: The operation to initialise
- * @processor: The processor function to assign
- *
- * Do additional initialisation of an operation as required for slow work.
- */
-static inline
-void fscache_operation_init_slow(struct fscache_operation *op,
-				 fscache_operation_processor_t processor)
-{
-	op->processor = processor;
-	slow_work_init(&op->slow_work, &fscache_op_slow_work_ops);
-}
-
 /*
  * data read operation
  */
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 37/40] fscache: drop references to slow-work
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (35 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 36/40] fscache: convert operation " Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  0:57 ` [PATCH 38/40] cifs: use workqueue instead of slow-work Tejun Heo
                   ` (4 subsequent siblings)
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

fscache no longer uses slow-work.  Drop references to it.

NOT_SIGNED_OFF_YET
Cc: David Howells <dhowells@redhat.com>
---
 fs/fscache/Kconfig            |    1 -
 fs/fscache/main.c             |   10 +---------
 include/linux/fscache-cache.h |    8 --------
 3 files changed, 1 insertions(+), 18 deletions(-)

diff --git a/fs/fscache/Kconfig b/fs/fscache/Kconfig
index 864dac2..526245a 100644
--- a/fs/fscache/Kconfig
+++ b/fs/fscache/Kconfig
@@ -2,7 +2,6 @@
 config FSCACHE
 	tristate "General filesystem local caching manager"
 	depends on EXPERIMENTAL
-	select SLOW_WORK
 	help
 	  This option enables a generic filesystem caching manager that can be
 	  used by various network and other filesystems to cache data locally.
diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index 908803c..3fb53e2 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -48,13 +48,8 @@ struct workqueue_struct *fscache_op_wq;
  */
 static int __init fscache_init(void)
 {
-	int ret;
+	int ret = -ENOMEM;
 
-	ret = slow_work_register_user(THIS_MODULE);
-	if (ret < 0)
-		goto error_slow_work;
-
-	ret = -ENOMEM;
 	fscache_object_wq =
 		__create_workqueue("fscache_object", WQ_SINGLE_CPU, 99);
 	if (!fscache_object_wq)
@@ -97,8 +92,6 @@ error_proc:
 error_op_wq:
 	destroy_workqueue(fscache_object_wq);
 error_object_wq:
-	slow_work_unregister_user(THIS_MODULE);
-error_slow_work:
 	return ret;
 }
 
@@ -116,7 +109,6 @@ static void __exit fscache_exit(void)
 	fscache_proc_cleanup();
 	destroy_workqueue(fscache_op_wq);
 	destroy_workqueue(fscache_object_wq);
-	slow_work_unregister_user(THIS_MODULE);
 	printk(KERN_NOTICE "FS-Cache: Unloaded\n");
 }
 
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 7791f2c..803c8a5 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -20,7 +20,6 @@
 
 #include <linux/fscache.h>
 #include <linux/sched.h>
-#include <linux/slow-work.h>
 #include <linux/workqueue.h>
 
 #define NR_MAXCACHES BITS_PER_LONG
@@ -102,15 +101,8 @@ struct fscache_operation {
 	/* operation releaser */
 	fscache_operation_release_t release;
 
-#ifdef CONFIG_SLOW_WORK_PROC
-	const char *name;		/* operation name */
-	const char *state;		/* operation state */
-#define fscache_set_op_name(OP, N)	do { (OP)->name  = (N); } while(0)
-#define fscache_set_op_state(OP, S)	do { (OP)->state = (S); } while(0)
-#else
 #define fscache_set_op_name(OP, N)	do { } while(0)
 #define fscache_set_op_state(OP, S)	do { } while(0)
-#endif
 };
 
 extern atomic_t fscache_op_debug_id;
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH 38/40] cifs: use workqueue instead of slow-work
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (36 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 37/40] fscache: drop references to slow-work Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-19 12:20   ` Jeff Layton
  2010-01-18  0:57 ` [PATCH 39/40] gfs2: " Tejun Heo
                   ` (3 subsequent siblings)
  41 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Steve French

Workqueue can now handle high concurrency.  Use system_single_wq
instead of slow-work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Steve French <sfrench@samba.org>
---
 fs/cifs/Kconfig    |    1 -
 fs/cifs/cifsfs.c   |    6 +-----
 fs/cifs/cifsglob.h |    8 +++++---
 fs/cifs/dir.c      |    2 +-
 fs/cifs/file.c     |   22 +++++-----------------
 fs/cifs/misc.c     |   15 +++++++--------
 6 files changed, 19 insertions(+), 35 deletions(-)

diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index 80f3525..6994a0f 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -2,7 +2,6 @@ config CIFS
 	tristate "CIFS support (advanced network filesystem, SMBFS successor)"
 	depends on INET
 	select NLS
-	select SLOW_WORK
 	help
 	  This is the client VFS module for the Common Internet File System
 	  (CIFS) protocol which is the successor to the Server Message Block
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 8c6a036..461a3a7 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1038,15 +1038,10 @@ init_cifs(void)
 	if (rc)
 		goto out_unregister_key_type;
 #endif
-	rc = slow_work_register_user(THIS_MODULE);
-	if (rc)
-		goto out_unregister_resolver_key;
 
 	return 0;
 
- out_unregister_resolver_key:
 #ifdef CONFIG_CIFS_DFS_UPCALL
-	unregister_key_type(&key_type_dns_resolver);
  out_unregister_key_type:
 #endif
 #ifdef CONFIG_CIFS_UPCALL
@@ -1069,6 +1064,7 @@ static void __exit
 exit_cifs(void)
 {
 	cFYI(DBG2, ("exit_cifs"));
+	flush_workqueue(system_single_wq);
 	cifs_proc_clean();
 #ifdef CONFIG_CIFS_DFS_UPCALL
 	cifs_dfs_release_automount_timer();
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 4b35f7e..c645843 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -18,7 +18,7 @@
  */
 #include <linux/in.h>
 #include <linux/in6.h>
-#include <linux/slow-work.h>
+#include <linux/workqueue.h>
 #include "cifs_fs_sb.h"
 #include "cifsacl.h"
 /*
@@ -356,7 +356,7 @@ struct cifsFileInfo {
 	atomic_t count;		/* reference count */
 	struct mutex fh_mutex; /* prevents reopen race after dead ses*/
 	struct cifs_search_info srch_inf;
-	struct slow_work oplock_break; /* slow_work job for oplock breaks */
+	struct work_struct oplock_break; /* work for oplock breaks */
 };
 
 /* Take a reference on the file private data */
@@ -723,4 +723,6 @@ GLOBAL_EXTERN unsigned int cifs_min_rcv;    /* min size of big ntwrk buf pool */
 GLOBAL_EXTERN unsigned int cifs_min_small;  /* min size of small buf pool */
 GLOBAL_EXTERN unsigned int cifs_max_pending; /* MAX requests at once to server*/
 
-extern const struct slow_work_ops cifs_oplock_break_ops;
+void cifs_oplock_break(struct work_struct *work);
+void cifs_oplock_break_get(struct cifsFileInfo *cfile);
+void cifs_oplock_break_put(struct cifsFileInfo *cfile);
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 6ccf726..3c6f9b2 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -157,7 +157,7 @@ cifs_new_fileinfo(struct inode *newinode, __u16 fileHandle,
 	mutex_init(&pCifsFile->lock_mutex);
 	INIT_LIST_HEAD(&pCifsFile->llist);
 	atomic_set(&pCifsFile->count, 1);
-	slow_work_init(&pCifsFile->oplock_break, &cifs_oplock_break_ops);
+	INIT_WORK(&pCifsFile->oplock_break, cifs_oplock_break);
 
 	write_lock(&GlobalSMBSeslock);
 	list_add(&pCifsFile->tlist, &cifs_sb->tcon->openFileList);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 057e1da..1c5fdf9 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2276,8 +2276,7 @@ out:
 	return rc;
 }
 
-static void
-cifs_oplock_break(struct slow_work *work)
+void cifs_oplock_break(struct work_struct *work)
 {
 	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
 						  oplock_break);
@@ -2316,33 +2315,22 @@ cifs_oplock_break(struct slow_work *work)
 				 LOCKING_ANDX_OPLOCK_RELEASE, false);
 		cFYI(1, ("Oplock release rc = %d", rc));
 	}
+
+	cifs_oplock_break_put(cfile);
 }
 
-static int
-cifs_oplock_break_get(struct slow_work *work)
+void cifs_oplock_break_get(struct cifsFileInfo *cfile)
 {
-	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
-						  oplock_break);
 	mntget(cfile->mnt);
 	cifsFileInfo_get(cfile);
-	return 0;
 }
 
-static void
-cifs_oplock_break_put(struct slow_work *work)
+void cifs_oplock_break_put(struct cifsFileInfo *cfile)
 {
-	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
-						  oplock_break);
 	mntput(cfile->mnt);
 	cifsFileInfo_put(cfile);
 }
 
-const struct slow_work_ops cifs_oplock_break_ops = {
-	.get_ref	= cifs_oplock_break_get,
-	.put_ref	= cifs_oplock_break_put,
-	.execute	= cifs_oplock_break,
-};
-
 const struct address_space_operations cifs_addr_ops = {
 	.readpage = cifs_readpage,
 	.readpages = cifs_readpages,
diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
index d27d4ec..a2cf7d2 100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -499,7 +499,6 @@ is_valid_oplock_break(struct smb_hdr *buf, struct TCP_Server_Info *srv)
 	struct cifsTconInfo *tcon;
 	struct cifsInodeInfo *pCifsInode;
 	struct cifsFileInfo *netfile;
-	int rc;
 
 	cFYI(1, ("Checking for oplock break or dnotify response"));
 	if ((pSMB->hdr.Command == SMB_COM_NT_TRANSACT) &&
@@ -584,13 +583,13 @@ is_valid_oplock_break(struct smb_hdr *buf, struct TCP_Server_Info *srv)
 				pCifsInode->clientCanCacheAll = false;
 				if (pSMB->OplockLevel == 0)
 					pCifsInode->clientCanCacheRead = false;
-				rc = slow_work_enqueue(&netfile->oplock_break);
-				if (rc) {
-					cERROR(1, ("failed to enqueue oplock "
-						   "break: %d\n", rc));
-				} else {
-					netfile->oplock_break_cancelled = false;
-				}
+
+				cifs_oplock_break_get(netfile);
+				if (!queue_work(system_single_wq,
+						&netfile->oplock_break))
+					cifs_oplock_break_put(netfile);
+				netfile->oplock_break_cancelled = false;
+
 				read_unlock(&GlobalSMBSeslock);
 				read_unlock(&cifs_tcp_ses_lock);
 				return true;
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH 38/40] cifs: use workqueue instead of slow-work
  2010-01-18  0:57 ` [PATCH 38/40] cifs: use workqueue instead of slow-work Tejun Heo
@ 2010-01-19 12:20   ` Jeff Layton
  2010-01-20  0:15     ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Jeff Layton @ 2010-01-19 12:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi,
	Tejun Heo, Steve French

On Mon, 18 Jan 2010 09:57:50 +0900
Tejun Heo <tj@kernel.org> wrote:

> Workqueue can now handle high concurrency.  Use system_single_wq
> instead of slow-work.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Steve French <sfrench@samba.org>
> ---
>  fs/cifs/Kconfig    |    1 -
>  fs/cifs/cifsfs.c   |    6 +-----
>  fs/cifs/cifsglob.h |    8 +++++---
>  fs/cifs/dir.c      |    2 +-
>  fs/cifs/file.c     |   22 +++++-----------------
>  fs/cifs/misc.c     |   15 +++++++--------
>  6 files changed, 19 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
> index 80f3525..6994a0f 100644
> --- a/fs/cifs/Kconfig
> +++ b/fs/cifs/Kconfig
> @@ -2,7 +2,6 @@ config CIFS
>  	tristate "CIFS support (advanced network filesystem, SMBFS successor)"
>  	depends on INET
>  	select NLS
> -	select SLOW_WORK
>  	help
>  	  This is the client VFS module for the Common Internet File System
>  	  (CIFS) protocol which is the successor to the Server Message Block
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index 8c6a036..461a3a7 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1038,15 +1038,10 @@ init_cifs(void)
>  	if (rc)
>  		goto out_unregister_key_type;
>  #endif
> -	rc = slow_work_register_user(THIS_MODULE);
> -	if (rc)
> -		goto out_unregister_resolver_key;
>  
>  	return 0;
>  
> - out_unregister_resolver_key:
>  #ifdef CONFIG_CIFS_DFS_UPCALL
> -	unregister_key_type(&key_type_dns_resolver);
>   out_unregister_key_type:
>  #endif
>  #ifdef CONFIG_CIFS_UPCALL
> @@ -1069,6 +1064,7 @@ static void __exit
>  exit_cifs(void)
>  {
>  	cFYI(DBG2, ("exit_cifs"));
> +	flush_workqueue(system_single_wq);
>  	cifs_proc_clean();
>  #ifdef CONFIG_CIFS_DFS_UPCALL
>  	cifs_dfs_release_automount_timer();
> diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
> index 4b35f7e..c645843 100644
> --- a/fs/cifs/cifsglob.h
> +++ b/fs/cifs/cifsglob.h
> @@ -18,7 +18,7 @@
>   */
>  #include <linux/in.h>
>  #include <linux/in6.h>
> -#include <linux/slow-work.h>
> +#include <linux/workqueue.h>
>  #include "cifs_fs_sb.h"
>  #include "cifsacl.h"
>  /*
> @@ -356,7 +356,7 @@ struct cifsFileInfo {
>  	atomic_t count;		/* reference count */
>  	struct mutex fh_mutex; /* prevents reopen race after dead ses*/
>  	struct cifs_search_info srch_inf;
> -	struct slow_work oplock_break; /* slow_work job for oplock breaks */
> +	struct work_struct oplock_break; /* work for oplock breaks */
>  };
>  
>  /* Take a reference on the file private data */
> @@ -723,4 +723,6 @@ GLOBAL_EXTERN unsigned int cifs_min_rcv;    /* min size of big ntwrk buf pool */
>  GLOBAL_EXTERN unsigned int cifs_min_small;  /* min size of small buf pool */
>  GLOBAL_EXTERN unsigned int cifs_max_pending; /* MAX requests at once to server*/
>  
> -extern const struct slow_work_ops cifs_oplock_break_ops;
> +void cifs_oplock_break(struct work_struct *work);
> +void cifs_oplock_break_get(struct cifsFileInfo *cfile);
> +void cifs_oplock_break_put(struct cifsFileInfo *cfile);
> diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
> index 6ccf726..3c6f9b2 100644
> --- a/fs/cifs/dir.c
> +++ b/fs/cifs/dir.c
> @@ -157,7 +157,7 @@ cifs_new_fileinfo(struct inode *newinode, __u16 fileHandle,
>  	mutex_init(&pCifsFile->lock_mutex);
>  	INIT_LIST_HEAD(&pCifsFile->llist);
>  	atomic_set(&pCifsFile->count, 1);
> -	slow_work_init(&pCifsFile->oplock_break, &cifs_oplock_break_ops);
> +	INIT_WORK(&pCifsFile->oplock_break, cifs_oplock_break);
>  
>  	write_lock(&GlobalSMBSeslock);
>  	list_add(&pCifsFile->tlist, &cifs_sb->tcon->openFileList);
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 057e1da..1c5fdf9 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2276,8 +2276,7 @@ out:
>  	return rc;
>  }
>  
> -static void
> -cifs_oplock_break(struct slow_work *work)
> +void cifs_oplock_break(struct work_struct *work)
>  {
>  	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
>  						  oplock_break);
> @@ -2316,33 +2315,22 @@ cifs_oplock_break(struct slow_work *work)
>  				 LOCKING_ANDX_OPLOCK_RELEASE, false);
>  		cFYI(1, ("Oplock release rc = %d", rc));
>  	}
> +
> +	cifs_oplock_break_put(cfile);
>  }
>  
> -static int
> -cifs_oplock_break_get(struct slow_work *work)
> +void cifs_oplock_break_get(struct cifsFileInfo *cfile)
>  {
> -	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
> -						  oplock_break);
>  	mntget(cfile->mnt);
>  	cifsFileInfo_get(cfile);
> -	return 0;
>  }
>  
> -static void
> -cifs_oplock_break_put(struct slow_work *work)
> +void cifs_oplock_break_put(struct cifsFileInfo *cfile)
>  {
> -	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
> -						  oplock_break);
>  	mntput(cfile->mnt);
>  	cifsFileInfo_put(cfile);
>  }
>  
> -const struct slow_work_ops cifs_oplock_break_ops = {
> -	.get_ref	= cifs_oplock_break_get,
> -	.put_ref	= cifs_oplock_break_put,
> -	.execute	= cifs_oplock_break,
> -};
> -
>  const struct address_space_operations cifs_addr_ops = {
>  	.readpage = cifs_readpage,
>  	.readpages = cifs_readpages,
> diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
> index d27d4ec..a2cf7d2 100644
> --- a/fs/cifs/misc.c
> +++ b/fs/cifs/misc.c
> @@ -499,7 +499,6 @@ is_valid_oplock_break(struct smb_hdr *buf, struct TCP_Server_Info *srv)
>  	struct cifsTconInfo *tcon;
>  	struct cifsInodeInfo *pCifsInode;
>  	struct cifsFileInfo *netfile;
> -	int rc;
>  
>  	cFYI(1, ("Checking for oplock break or dnotify response"));
>  	if ((pSMB->hdr.Command == SMB_COM_NT_TRANSACT) &&
> @@ -584,13 +583,13 @@ is_valid_oplock_break(struct smb_hdr *buf, struct TCP_Server_Info *srv)
>  				pCifsInode->clientCanCacheAll = false;
>  				if (pSMB->OplockLevel == 0)
>  					pCifsInode->clientCanCacheRead = false;
> -				rc = slow_work_enqueue(&netfile->oplock_break);
> -				if (rc) {
> -					cERROR(1, ("failed to enqueue oplock "
> -						   "break: %d\n", rc));
> -				} else {
> -					netfile->oplock_break_cancelled = false;
> -				}
> +
> +				cifs_oplock_break_get(netfile);
> +				if (!queue_work(system_single_wq,
> +						&netfile->oplock_break))
> +					cifs_oplock_break_put(netfile);
> +				netfile->oplock_break_cancelled = false;
> +
>  				read_unlock(&GlobalSMBSeslock);
>  				read_unlock(&cifs_tcp_ses_lock);
>  				return true;

This block of code looks problematic. This code is run by the
cifs_demultiplex_thread (cifsd). We can't do an oplock_break_put in
this context, since it might trigger a blocking SMB and cause a
deadlock.

A while back, I backported this code to earlier kernels and used a
standard workqueue there. What I did there was to only do the "get" if
the queue_work succeeded, and then had the queued work take and
immediately drop the GlobalSMBSeslock first thing. Yes, it's ugly, but
it prevented the possible deadlock and didn't require adding anything
like completion vars to the struct.

Also, this change seems to have changed the logic a bit. The
oplock_break_cancelled flag is being set to false unconditionally, and
the printk was dropped. Not a big deal on the last part, but we can't
really do much with errors in this codepath so it might be helpful to
have some indication that there are problems here.

Other than the above problems (which are easily fixable), this patch
seems fine.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 38/40] cifs: use workqueue instead of slow-work
  2010-01-19 12:20   ` Jeff Layton
@ 2010-01-20  0:15     ` Tejun Heo
  2010-01-20  0:56       ` Jeff Layton
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-20  0:15 UTC (permalink / raw)
  To: Jeff Layton
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi,
	Steve French

Hello,

On 01/19/2010 09:20 PM, Jeff Layton wrote:
>> @@ -584,13 +583,13 @@ is_valid_oplock_break(struct smb_hdr *buf, struct TCP_Server_Info *srv)
>>  				pCifsInode->clientCanCacheAll = false;
>>  				if (pSMB->OplockLevel == 0)
>>  					pCifsInode->clientCanCacheRead = false;
>> -				rc = slow_work_enqueue(&netfile->oplock_break);
>> -				if (rc) {
>> -					cERROR(1, ("failed to enqueue oplock "
>> -						   "break: %d\n", rc));
>> -				} else {
>> -					netfile->oplock_break_cancelled = false;
>> -				}
>> +
>> +				cifs_oplock_break_get(netfile);
>> +				if (!queue_work(system_single_wq,
>> +						&netfile->oplock_break))
>> +					cifs_oplock_break_put(netfile);
>> +				netfile->oplock_break_cancelled = false;
>> +
>>  				read_unlock(&GlobalSMBSeslock);
>>  				read_unlock(&cifs_tcp_ses_lock);
>>  				return true;
> 
> This block of code looks problematic. This code is run by the
> cifs_demultiplex_thread (cifsd). We can't do an oplock_break_put in
> this context, since it might trigger a blocking SMB and cause a
> deadlock.

Okay, thanks for pointing it out.

> A while back, I backported this code to earlier kernels and used a
> standard workqueue there. What I did there was to only do the "get" if
> the queue_work succeeded, and then had the queued work take and
> immediately drop the GlobalSMBSeslock first thing. Yes, it's ugly, but
> it prevented the possible deadlock and didn't require adding anything
> like completion vars to the struct.

Hmmm... Why is locking GlobalSMBSeslock necessary?
cifs_oplock_break_get() can never fail and it seems that
is_valid_oplock_break() should be holding valid reference by the time
it enqueues the work, so wouldn't the following be sufficient?

	if (queue_work(system_single_wq, &netfile->oplock_break))
		cifs_oplock_break_get(netfile);

> Also, this change seems to have changed the logic a bit. The
> oplock_break_cancelled flag is being set to false unconditionally, and
> the printk was dropped. Not a big deal on the last part, but we can't
> really do much with errors in this codepath so it might be helpful to
> have some indication that there are problems here.

The thing is that slow_work_enqueue() can only fail if getting a
reference fails.  In cifs' case, it always succeeds so there's no
failure case to handle there.

> Other than the above problems (which are easily fixable), this patch
> seems fine.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 38/40] cifs: use workqueue instead of slow-work
  2010-01-20  0:15     ` Tejun Heo
@ 2010-01-20  0:56       ` Jeff Layton
  2010-01-20  1:23         ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Jeff Layton @ 2010-01-20  0:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi,
	Steve French

On Wed, 20 Jan 2010 09:15:14 +0900
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 01/19/2010 09:20 PM, Jeff Layton wrote:
> >> @@ -584,13 +583,13 @@ is_valid_oplock_break(struct smb_hdr *buf, struct TCP_Server_Info *srv)
> >>  				pCifsInode->clientCanCacheAll = false;
> >>  				if (pSMB->OplockLevel == 0)
> >>  					pCifsInode->clientCanCacheRead = false;
> >> -				rc = slow_work_enqueue(&netfile->oplock_break);
> >> -				if (rc) {
> >> -					cERROR(1, ("failed to enqueue oplock "
> >> -						   "break: %d\n", rc));
> >> -				} else {
> >> -					netfile->oplock_break_cancelled = false;
> >> -				}
> >> +
> >> +				cifs_oplock_break_get(netfile);
> >> +				if (!queue_work(system_single_wq,
> >> +						&netfile->oplock_break))
> >> +					cifs_oplock_break_put(netfile);
> >> +				netfile->oplock_break_cancelled = false;
> >> +
> >>  				read_unlock(&GlobalSMBSeslock);
> >>  				read_unlock(&cifs_tcp_ses_lock);
> >>  				return true;
> > 
> > This block of code looks problematic. This code is run by the
> > cifs_demultiplex_thread (cifsd). We can't do an oplock_break_put in
> > this context, since it might trigger a blocking SMB and cause a
> > deadlock.
> 
> Okay, thanks for pointing it out.
> 
> > A while back, I backported this code to earlier kernels and used a
> > standard workqueue there. What I did there was to only do the "get" if
> > the queue_work succeeded, and then had the queued work take and
> > immediately drop the GlobalSMBSeslock first thing. Yes, it's ugly, but
> > it prevented the possible deadlock and didn't require adding anything
> > like completion vars to the struct.
> 
> Hmmm... Why is locking GlobalSMBSeslock necessary?
> cifs_oplock_break_get() can never fail and it seems that
> is_valid_oplock_break() should be holding valid reference by the time
> it enqueues the work, so wouldn't the following be sufficient?
> 
> 	if (queue_work(system_single_wq, &netfile->oplock_break))
> 		cifs_oplock_break_get(netfile);
> 

I guess I wasn't sure I could count on that. It seems unlikely that the
work would run before you did the "get", but unlikely races are even
harder to troubleshoot when they do get hit.

FWIW, it's not a terribly hot codepath, so taking and dropping the
spinlock shouldn't be too bad for performance. If you're certain that
we don't need to worry about it though, then maybe we can just skip
doing that.

> > Also, this change seems to have changed the logic a bit. The
> > oplock_break_cancelled flag is being set to false unconditionally, and
> > the printk was dropped. Not a big deal on the last part, but we can't
> > really do much with errors in this codepath so it might be helpful to
> > have some indication that there are problems here.
> 
> The thing is that slow_work_enqueue() can only fail if getting a
> reference fails.  In cifs' case, it always succeeds so there's no
> failure case to handle there.
> 

Ok, but here we're changing this to queue_work. Is that also guaranteed
to succeed here? If so, then dropping the printk is fine. If not, then
I think we should keep it in.

It's been a while since I overhauled this code, so I'll need to look
again at the semantics for the oplock_break_cancelled flag. It may be
ok to just set it unconditionally like this, but I'll need to check
and be sure.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 38/40] cifs: use workqueue instead of slow-work
  2010-01-20  0:56       ` Jeff Layton
@ 2010-01-20  1:23         ` Tejun Heo
  2010-01-22 11:14           ` [PATCH UPDATED " Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-20  1:23 UTC (permalink / raw)
  To: Jeff Layton
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi,
	Steve French

On 01/20/2010 09:56 AM, Jeff Layton wrote:
>> Hmmm... Why is locking GlobalSMBSeslock necessary?
>> cifs_oplock_break_get() can never fail and it seems that
>> is_valid_oplock_break() should be holding valid reference by the time
>> it enqueues the work, so wouldn't the following be sufficient?
>>
>> 	if (queue_work(system_single_wq, &netfile->oplock_break))
>> 		cifs_oplock_break_get(netfile);
>>
> 
> I guess I wasn't sure I could count on that. It seems unlikely that the
> work would run before you did the "get", but unlikely races are even
> harder to troubleshoot when they do get hit.
> 
> FWIW, it's not a terribly hot codepath, so taking and dropping the
> spinlock shouldn't be too bad for performance. If you're certain that
> we don't need to worry about it though, then maybe we can just skip
> doing that.

Oooh, I don't know the code very well and can't guarantee any of that.
Ah, okay, I was confused.  Work can run and finish before the
reference is incremented.  So, yeap, I'll add the spinlocking in the
work function.

>>> Also, this change seems to have changed the logic a bit. The
>>> oplock_break_cancelled flag is being set to false unconditionally, and
>>> the printk was dropped. Not a big deal on the last part, but we can't
>>> really do much with errors in this codepath so it might be helpful to
>>> have some indication that there are problems here.
>>
>> The thing is that slow_work_enqueue() can only fail if getting a
>> reference fails.  In cifs' case, it always succeeds so there's no
>> failure case to handle there.
>>
> 
> Ok, but here we're changing this to queue_work. Is that also guaranteed
> to succeed here? If so, then dropping the printk is fine. If not, then
> I think we should keep it in.

Yeap, queue_work() is guaranteed to succeed.  The only possible
outcomes are 1. queued or 2. is already queued.

> It's been a while since I overhauled this code, so I'll need to look
> again at the semantics for the oplock_break_cancelled flag. It may be
> ok to just set it unconditionally like this, but I'll need to check
> and be sure.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH UPDATED 38/40] cifs: use workqueue instead of slow-work
  2010-01-20  1:23         ` Tejun Heo
@ 2010-01-22 11:14           ` Tejun Heo
  2010-01-22 11:45             ` Jeff Layton
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-22 11:14 UTC (permalink / raw)
  To: Jeff Layton
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi,
	Steve French

Workqueue can now handle high concurrency.  Use system_single_wq
instead of slow-work.

* Updated is_valid_oplock_break() to not call cifs_oplock_break_put()
  as advised by Steve French.  It might cause deadlock.  Instead,
  reference is increased after queueing succeeded and
  cifs_oplock_break() briefly grabs GlobalSMBSeslock before putting
  the cfile to make sure it doesn't put before the matching get is
  finished.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Steve French <sfrench@samba.org>
---
Another approach would be to put from a different work which would be
cleaner but need more code.  How does this one look to you?

Thanks.

 fs/cifs/Kconfig    |    1 -
 fs/cifs/cifsfs.c   |    6 +-----
 fs/cifs/cifsglob.h |    8 +++++---
 fs/cifs/dir.c      |    2 +-
 fs/cifs/file.c     |   30 +++++++++++++-----------------
 fs/cifs/misc.c     |   20 ++++++++++++--------
 6 files changed, 32 insertions(+), 35 deletions(-)

Index: work/fs/cifs/cifsfs.c
===================================================================
--- work.orig/fs/cifs/cifsfs.c
+++ work/fs/cifs/cifsfs.c
@@ -1038,15 +1038,10 @@ init_cifs(void)
 	if (rc)
 		goto out_unregister_key_type;
 #endif
-	rc = slow_work_register_user(THIS_MODULE);
-	if (rc)
-		goto out_unregister_resolver_key;
 
 	return 0;
 
- out_unregister_resolver_key:
 #ifdef CONFIG_CIFS_DFS_UPCALL
-	unregister_key_type(&key_type_dns_resolver);
  out_unregister_key_type:
 #endif
 #ifdef CONFIG_CIFS_UPCALL
@@ -1069,6 +1064,7 @@ static void __exit
 exit_cifs(void)
 {
 	cFYI(DBG2, ("exit_cifs"));
+	flush_workqueue(system_single_wq);
 	cifs_proc_clean();
 #ifdef CONFIG_CIFS_DFS_UPCALL
 	cifs_dfs_release_automount_timer();
Index: work/fs/cifs/cifsglob.h
===================================================================
--- work.orig/fs/cifs/cifsglob.h
+++ work/fs/cifs/cifsglob.h
@@ -18,7 +18,7 @@
  */
 #include <linux/in.h>
 #include <linux/in6.h>
-#include <linux/slow-work.h>
+#include <linux/workqueue.h>
 #include "cifs_fs_sb.h"
 #include "cifsacl.h"
 /*
@@ -356,7 +356,7 @@ struct cifsFileInfo {
 	atomic_t count;		/* reference count */
 	struct mutex fh_mutex; /* prevents reopen race after dead ses*/
 	struct cifs_search_info srch_inf;
-	struct slow_work oplock_break; /* slow_work job for oplock breaks */
+	struct work_struct oplock_break; /* work for oplock breaks */
 };
 
 /* Take a reference on the file private data */
@@ -723,4 +723,6 @@ GLOBAL_EXTERN unsigned int cifs_min_rcv;
 GLOBAL_EXTERN unsigned int cifs_min_small;  /* min size of small buf pool */
 GLOBAL_EXTERN unsigned int cifs_max_pending; /* MAX requests at once to server*/
 
-extern const struct slow_work_ops cifs_oplock_break_ops;
+void cifs_oplock_break(struct work_struct *work);
+void cifs_oplock_break_get(struct cifsFileInfo *cfile);
+void cifs_oplock_break_put(struct cifsFileInfo *cfile);
Index: work/fs/cifs/dir.c
===================================================================
--- work.orig/fs/cifs/dir.c
+++ work/fs/cifs/dir.c
@@ -157,7 +157,7 @@ cifs_new_fileinfo(struct inode *newinode
 	mutex_init(&pCifsFile->lock_mutex);
 	INIT_LIST_HEAD(&pCifsFile->llist);
 	atomic_set(&pCifsFile->count, 1);
-	slow_work_init(&pCifsFile->oplock_break, &cifs_oplock_break_ops);
+	INIT_WORK(&pCifsFile->oplock_break, cifs_oplock_break);
 
 	write_lock(&GlobalSMBSeslock);
 	list_add(&pCifsFile->tlist, &cifs_sb->tcon->openFileList);
Index: work/fs/cifs/file.c
===================================================================
--- work.orig/fs/cifs/file.c
+++ work/fs/cifs/file.c
@@ -2276,8 +2276,7 @@ out:
 	return rc;
 }
 
-static void
-cifs_oplock_break(struct slow_work *work)
+void cifs_oplock_break(struct work_struct *work)
 {
 	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
 						  oplock_break);
@@ -2316,33 +2315,30 @@ cifs_oplock_break(struct slow_work *work
 				 LOCKING_ANDX_OPLOCK_RELEASE, false);
 		cFYI(1, ("Oplock release rc = %d", rc));
 	}
+
+	/*
+	 * We might have kicked in before is_valid_oplock_break()
+	 * finished grabbing reference for us.  Make sure it's done by
+	 * waiting for GlobalSMSSeslock.
+	 */
+	write_lock(&GlobalSMBSeslock);
+	write_unlock(&GlobalSMBSeslock);
+
+	cifs_oplock_break_put(cfile);
 }
 
-static int
-cifs_oplock_break_get(struct slow_work *work)
+void cifs_oplock_break_get(struct cifsFileInfo *cfile)
 {
-	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
-						  oplock_break);
 	mntget(cfile->mnt);
 	cifsFileInfo_get(cfile);
-	return 0;
 }
 
-static void
-cifs_oplock_break_put(struct slow_work *work)
+void cifs_oplock_break_put(struct cifsFileInfo *cfile)
 {
-	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
-						  oplock_break);
 	mntput(cfile->mnt);
 	cifsFileInfo_put(cfile);
 }
 
-const struct slow_work_ops cifs_oplock_break_ops = {
-	.get_ref	= cifs_oplock_break_get,
-	.put_ref	= cifs_oplock_break_put,
-	.execute	= cifs_oplock_break,
-};
-
 const struct address_space_operations cifs_addr_ops = {
 	.readpage = cifs_readpage,
 	.readpages = cifs_readpages,
Index: work/fs/cifs/misc.c
===================================================================
--- work.orig/fs/cifs/misc.c
+++ work/fs/cifs/misc.c
@@ -499,7 +499,6 @@ is_valid_oplock_break(struct smb_hdr *bu
 	struct cifsTconInfo *tcon;
 	struct cifsInodeInfo *pCifsInode;
 	struct cifsFileInfo *netfile;
-	int rc;
 
 	cFYI(1, ("Checking for oplock break or dnotify response"));
 	if ((pSMB->hdr.Command == SMB_COM_NT_TRANSACT) &&
@@ -584,13 +583,18 @@ is_valid_oplock_break(struct smb_hdr *bu
 				pCifsInode->clientCanCacheAll = false;
 				if (pSMB->OplockLevel == 0)
 					pCifsInode->clientCanCacheRead = false;
-				rc = slow_work_enqueue(&netfile->oplock_break);
-				if (rc) {
-					cERROR(1, ("failed to enqueue oplock "
-						   "break: %d\n", rc));
-				} else {
-					netfile->oplock_break_cancelled = false;
-				}
+
+				/*
+				 * cifs_oplock_break_put() can't be called
+				 * from here.  Get reference after queueing
+				 * succeeded.  cifs_oplock_break() will
+				 * synchronize using GlobalSMSSeslock.
+				 */
+				if (queue_work(system_single_wq,
+					       &netfile->oplock_break))
+					cifs_oplock_break_get(netfile);
+				netfile->oplock_break_cancelled = false;
+
 				read_unlock(&GlobalSMBSeslock);
 				read_unlock(&cifs_tcp_ses_lock);
 				return true;
Index: work/fs/cifs/Kconfig
===================================================================
--- work.orig/fs/cifs/Kconfig
+++ work/fs/cifs/Kconfig
@@ -2,7 +2,6 @@ config CIFS
 	tristate "CIFS support (advanced network filesystem, SMBFS successor)"
 	depends on INET
 	select NLS
-	select SLOW_WORK
 	help
 	  This is the client VFS module for the Common Internet File System
 	  (CIFS) protocol which is the successor to the Server Message Block

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH UPDATED 38/40] cifs: use workqueue instead of slow-work
  2010-01-22 11:14           ` [PATCH UPDATED " Tejun Heo
@ 2010-01-22 11:45             ` Jeff Layton
  2010-01-24  8:25               ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Jeff Layton @ 2010-01-22 11:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi,
	Steve French

On Fri, 22 Jan 2010 20:14:37 +0900
Tejun Heo <tj@kernel.org> wrote:

> Workqueue can now handle high concurrency.  Use system_single_wq
> instead of slow-work.
> 
> * Updated is_valid_oplock_break() to not call cifs_oplock_break_put()
>   as advised by Steve French.  It might cause deadlock.  Instead,
>   reference is increased after queueing succeeded and
>   cifs_oplock_break() briefly grabs GlobalSMBSeslock before putting
>   the cfile to make sure it doesn't put before the matching get is
>   finished.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Steve French <sfrench@samba.org>
> ---
> Another approach would be to put from a different work which would be
> cleaner but need more code.  How does this one look to you?
> 
> Thanks.
> 
>  fs/cifs/Kconfig    |    1 -
>  fs/cifs/cifsfs.c   |    6 +-----
>  fs/cifs/cifsglob.h |    8 +++++---
>  fs/cifs/dir.c      |    2 +-
>  fs/cifs/file.c     |   30 +++++++++++++-----------------
>  fs/cifs/misc.c     |   20 ++++++++++++--------
>  6 files changed, 32 insertions(+), 35 deletions(-)
> 
> Index: work/fs/cifs/cifsfs.c
> ===================================================================
> --- work.orig/fs/cifs/cifsfs.c
> +++ work/fs/cifs/cifsfs.c
> @@ -1038,15 +1038,10 @@ init_cifs(void)
>  	if (rc)
>  		goto out_unregister_key_type;
>  #endif
> -	rc = slow_work_register_user(THIS_MODULE);
> -	if (rc)
> -		goto out_unregister_resolver_key;
>  
>  	return 0;
>  
> - out_unregister_resolver_key:
>  #ifdef CONFIG_CIFS_DFS_UPCALL
> -	unregister_key_type(&key_type_dns_resolver);
>   out_unregister_key_type:
>  #endif
>  #ifdef CONFIG_CIFS_UPCALL
> @@ -1069,6 +1064,7 @@ static void __exit
>  exit_cifs(void)
>  {
>  	cFYI(DBG2, ("exit_cifs"));
> +	flush_workqueue(system_single_wq);
>  	cifs_proc_clean();
>  #ifdef CONFIG_CIFS_DFS_UPCALL
>  	cifs_dfs_release_automount_timer();
> Index: work/fs/cifs/cifsglob.h
> ===================================================================
> --- work.orig/fs/cifs/cifsglob.h
> +++ work/fs/cifs/cifsglob.h
> @@ -18,7 +18,7 @@
>   */
>  #include <linux/in.h>
>  #include <linux/in6.h>
> -#include <linux/slow-work.h>
> +#include <linux/workqueue.h>
>  #include "cifs_fs_sb.h"
>  #include "cifsacl.h"
>  /*
> @@ -356,7 +356,7 @@ struct cifsFileInfo {
>  	atomic_t count;		/* reference count */
>  	struct mutex fh_mutex; /* prevents reopen race after dead ses*/
>  	struct cifs_search_info srch_inf;
> -	struct slow_work oplock_break; /* slow_work job for oplock breaks */
> +	struct work_struct oplock_break; /* work for oplock breaks */
>  };
>  
>  /* Take a reference on the file private data */
> @@ -723,4 +723,6 @@ GLOBAL_EXTERN unsigned int cifs_min_rcv;
>  GLOBAL_EXTERN unsigned int cifs_min_small;  /* min size of small buf pool */
>  GLOBAL_EXTERN unsigned int cifs_max_pending; /* MAX requests at once to server*/
>  
> -extern const struct slow_work_ops cifs_oplock_break_ops;
> +void cifs_oplock_break(struct work_struct *work);
> +void cifs_oplock_break_get(struct cifsFileInfo *cfile);
> +void cifs_oplock_break_put(struct cifsFileInfo *cfile);
> Index: work/fs/cifs/dir.c
> ===================================================================
> --- work.orig/fs/cifs/dir.c
> +++ work/fs/cifs/dir.c
> @@ -157,7 +157,7 @@ cifs_new_fileinfo(struct inode *newinode
>  	mutex_init(&pCifsFile->lock_mutex);
>  	INIT_LIST_HEAD(&pCifsFile->llist);
>  	atomic_set(&pCifsFile->count, 1);
> -	slow_work_init(&pCifsFile->oplock_break, &cifs_oplock_break_ops);
> +	INIT_WORK(&pCifsFile->oplock_break, cifs_oplock_break);
>  
>  	write_lock(&GlobalSMBSeslock);
>  	list_add(&pCifsFile->tlist, &cifs_sb->tcon->openFileList);
> Index: work/fs/cifs/file.c
> ===================================================================
> --- work.orig/fs/cifs/file.c
> +++ work/fs/cifs/file.c
> @@ -2276,8 +2276,7 @@ out:
>  	return rc;
>  }
>  
> -static void
> -cifs_oplock_break(struct slow_work *work)
> +void cifs_oplock_break(struct work_struct *work)
>  {
>  	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
>  						  oplock_break);
> @@ -2316,33 +2315,30 @@ cifs_oplock_break(struct slow_work *work
>  				 LOCKING_ANDX_OPLOCK_RELEASE, false);
>  		cFYI(1, ("Oplock release rc = %d", rc));
>  	}
> +
> +	/*
> +	 * We might have kicked in before is_valid_oplock_break()
> +	 * finished grabbing reference for us.  Make sure it's done by
> +	 * waiting for GlobalSMSSeslock.
> +	 */
> +	write_lock(&GlobalSMBSeslock);
> +	write_unlock(&GlobalSMBSeslock);
> +

^^^^
When I backported this before, I put this at the beginning of the
function, not just before the put. I think that this is actually ok
though. We may end up running the function without holding the
references, but the objects we're concerned about can't be freed until
is_valid_oplock_break drops the spinlock.

> +	cifs_oplock_break_put(cfile);
>  }
>  
> -static int
> -cifs_oplock_break_get(struct slow_work *work)
> +void cifs_oplock_break_get(struct cifsFileInfo *cfile)
>  {
> -	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
> -						  oplock_break);
>  	mntget(cfile->mnt);
>  	cifsFileInfo_get(cfile);
> -	return 0;
>  }
>  
> -static void
> -cifs_oplock_break_put(struct slow_work *work)
> +void cifs_oplock_break_put(struct cifsFileInfo *cfile)
>  {
> -	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
> -						  oplock_break);
>  	mntput(cfile->mnt);
>  	cifsFileInfo_put(cfile);
>  }
>  
> -const struct slow_work_ops cifs_oplock_break_ops = {
> -	.get_ref	= cifs_oplock_break_get,
> -	.put_ref	= cifs_oplock_break_put,
> -	.execute	= cifs_oplock_break,
> -};
> -
>  const struct address_space_operations cifs_addr_ops = {
>  	.readpage = cifs_readpage,
>  	.readpages = cifs_readpages,
> Index: work/fs/cifs/misc.c
> ===================================================================
> --- work.orig/fs/cifs/misc.c
> +++ work/fs/cifs/misc.c
> @@ -499,7 +499,6 @@ is_valid_oplock_break(struct smb_hdr *bu
>  	struct cifsTconInfo *tcon;
>  	struct cifsInodeInfo *pCifsInode;
>  	struct cifsFileInfo *netfile;
> -	int rc;
>  
>  	cFYI(1, ("Checking for oplock break or dnotify response"));
>  	if ((pSMB->hdr.Command == SMB_COM_NT_TRANSACT) &&
> @@ -584,13 +583,18 @@ is_valid_oplock_break(struct smb_hdr *bu
>  				pCifsInode->clientCanCacheAll = false;
>  				if (pSMB->OplockLevel == 0)
>  					pCifsInode->clientCanCacheRead = false;
> -				rc = slow_work_enqueue(&netfile->oplock_break);
> -				if (rc) {
> -					cERROR(1, ("failed to enqueue oplock "
> -						   "break: %d\n", rc));
> -				} else {
> -					netfile->oplock_break_cancelled = false;
> -				}
> +
> +				/*
> +				 * cifs_oplock_break_put() can't be called
> +				 * from here.  Get reference after queueing
> +				 * succeeded.  cifs_oplock_break() will
> +				 * synchronize using GlobalSMSSeslock.
> +				 */
> +				if (queue_work(system_single_wq,
> +					       &netfile->oplock_break))
> +					cifs_oplock_break_get(netfile);
> +				netfile->oplock_break_cancelled = false;
> +

I think we want to move the setting of netfile->oplock_break_cancelled
inside of the if above it.

If the work is already queued, I don't think we want to set the flag to
false. Doing so might be problematic if we somehow end up processing
this oplock break after a previous oplock break/reconnect/reopen
sequence, but while the initial oplock break is still running.

>  				read_unlock(&GlobalSMBSeslock);
>  				read_unlock(&cifs_tcp_ses_lock);
>  				return true;
> Index: work/fs/cifs/Kconfig
> ===================================================================
> --- work.orig/fs/cifs/Kconfig
> +++ work/fs/cifs/Kconfig
> @@ -2,7 +2,6 @@ config CIFS
>  	tristate "CIFS support (advanced network filesystem, SMBFS successor)"
>  	depends on INET
>  	select NLS
> -	select SLOW_WORK
>  	help
>  	  This is the client VFS module for the Common Internet File System
>  	  (CIFS) protocol which is the successor to the Server Message Block


-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH UPDATED 38/40] cifs: use workqueue instead of slow-work
  2010-01-22 11:45             ` Jeff Layton
@ 2010-01-24  8:25               ` Tejun Heo
  2010-01-24 12:13                 ` Jeff Layton
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-24  8:25 UTC (permalink / raw)
  To: Jeff Layton
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi,
	Steve French

Hello,

On 01/22/2010 08:45 PM, Jeff Layton wrote:
>> @@ -584,13 +583,18 @@ is_valid_oplock_break(struct smb_hdr *bu
>>  				pCifsInode->clientCanCacheAll = false;
>>  				if (pSMB->OplockLevel == 0)
>>  					pCifsInode->clientCanCacheRead = false;
>> -				rc = slow_work_enqueue(&netfile->oplock_break);
>> -				if (rc) {
>> -					cERROR(1, ("failed to enqueue oplock "
>> -						   "break: %d\n", rc));
>> -				} else {
>> -					netfile->oplock_break_cancelled = false;
>> -				}
>> +
>> +				/*
>> +				 * cifs_oplock_break_put() can't be called
>> +				 * from here.  Get reference after queueing
>> +				 * succeeded.  cifs_oplock_break() will
>> +				 * synchronize using GlobalSMSSeslock.
>> +				 */
>> +				if (queue_work(system_single_wq,
>> +					       &netfile->oplock_break))
>> +					cifs_oplock_break_get(netfile);
>> +				netfile->oplock_break_cancelled = false;
>> +
> 
> I think we want to move the setting of netfile->oplock_break_cancelled
> inside of the if above it.
> 
> If the work is already queued, I don't think we want to set the flag to
> false. Doing so might be problematic if we somehow end up processing
> this oplock break after a previous oplock break/reconnect/reopen
> sequence, but while the initial oplock break is still running.

Hmmm.... I can surely do that but that would be different from the
original code.  slow_work_enqueue() doesn't distinguish between
successful enqueue and the one which got ignored because the work was
already queued.  With conversion to queue_work(), there's no failure
case there so setting oplock_break_cancelled always is equivalent to
the original code.  Even if changing it is the right thing to do, it
should probably be done with a separate patch as it changes the logic.
Are you sure it needs to be changed?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH UPDATED 38/40] cifs: use workqueue instead of slow-work
  2010-01-24  8:25               ` Tejun Heo
@ 2010-01-24 12:13                 ` Jeff Layton
  2010-01-25 15:25                   ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Jeff Layton @ 2010-01-24 12:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi,
	Steve French

On Sun, 24 Jan 2010 17:25:18 +0900
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 01/22/2010 08:45 PM, Jeff Layton wrote:
> >> @@ -584,13 +583,18 @@ is_valid_oplock_break(struct smb_hdr *bu
> >>  				pCifsInode->clientCanCacheAll = false;
> >>  				if (pSMB->OplockLevel == 0)
> >>  					pCifsInode->clientCanCacheRead = false;
> >> -				rc = slow_work_enqueue(&netfile->oplock_break);
> >> -				if (rc) {
> >> -					cERROR(1, ("failed to enqueue oplock "
> >> -						   "break: %d\n", rc));
> >> -				} else {
> >> -					netfile->oplock_break_cancelled = false;
> >> -				}
> >> +
> >> +				/*
> >> +				 * cifs_oplock_break_put() can't be called
> >> +				 * from here.  Get reference after queueing
> >> +				 * succeeded.  cifs_oplock_break() will
> >> +				 * synchronize using GlobalSMSSeslock.
> >> +				 */
> >> +				if (queue_work(system_single_wq,
> >> +					       &netfile->oplock_break))
> >> +					cifs_oplock_break_get(netfile);
> >> +				netfile->oplock_break_cancelled = false;
> >> +
> > 
> > I think we want to move the setting of netfile->oplock_break_cancelled
> > inside of the if above it.
> > 
> > If the work is already queued, I don't think we want to set the flag to
> > false. Doing so might be problematic if we somehow end up processing
> > this oplock break after a previous oplock break/reconnect/reopen
> > sequence, but while the initial oplock break is still running.
> 
> Hmmm.... I can surely do that but that would be different from the
> original code.  slow_work_enqueue() doesn't distinguish between
> successful enqueue and the one which got ignored because the work was
> already queued.  With conversion to queue_work(), there's no failure
> case there so setting oplock_break_cancelled always is equivalent to
> the original code.  Even if changing it is the right thing to do, it
> should probably be done with a separate patch as it changes the logic.
> Are you sure it needs to be changed?
> 

I'm pretty sure we do. This flag only gets set to true if there's a
reconnection event. If there is one, then any oplock break queued up
before that happened is now invalid and shouldn't be sent.

It's a fairly minor point however. Even if we send the oplock break,
it's very unlikely to be treated as valid by the server as I don't
think the file would have a chance to be reopened prior to that.

If this is the way that the code works now, then let's go ahead with
your version and I'll plan to queue up a separate patch to change that
behavior after your changes go in.

Thanks,
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH UPDATED 38/40] cifs: use workqueue instead of slow-work
  2010-01-24 12:13                 ` Jeff Layton
@ 2010-01-25 15:25                   ` Tejun Heo
  0 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-25 15:25 UTC (permalink / raw)
  To: Jeff Layton
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi,
	Steve French

Hello,

On 01/24/2010 09:13 PM, Jeff Layton wrote:
>> Are you sure it needs to be changed?
>>
> 
> I'm pretty sure we do. This flag only gets set to true if there's a
> reconnection event. If there is one, then any oplock break queued up
> before that happened is now invalid and shouldn't be sent.
> 
> It's a fairly minor point however. Even if we send the oplock break,
> it's very unlikely to be treated as valid by the server as I don't
> think the file would have a chance to be reopened prior to that.
> 
> If this is the way that the code works now, then let's go ahead with
> your version and I'll plan to queue up a separate patch to change that
> behavior after your changes go in.

Yeap, that sounds good to me or I just can queue a separate patch to
do that along with this one so that you don't have to remember
queueing it later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 39/40] gfs2: use workqueue instead of slow-work
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (37 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 38/40] cifs: use workqueue instead of slow-work Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  9:45   ` Steven Whitehouse
  2010-01-18  0:57 ` [PATCH 40/40] slow-work: kill it Tejun Heo
                   ` (2 subsequent siblings)
  41 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo, Steven Whitehouse

Workqueue can now handle high concurrency.  Use system_long_wq instead
of slow-work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
---
 fs/gfs2/Kconfig      |    1 -
 fs/gfs2/incore.h     |    3 +-
 fs/gfs2/main.c       |    9 +-------
 fs/gfs2/ops_fstype.c |    8 +++---
 fs/gfs2/recovery.c   |   52 ++++++++++++++++++-------------------------------
 fs/gfs2/recovery.h   |    4 +-
 fs/gfs2/sys.c        |    3 +-
 7 files changed, 29 insertions(+), 51 deletions(-)

diff --git a/fs/gfs2/Kconfig b/fs/gfs2/Kconfig
index 4dcddf8..f51f1bb 100644
--- a/fs/gfs2/Kconfig
+++ b/fs/gfs2/Kconfig
@@ -7,7 +7,6 @@ config GFS2_FS
 	select IP_SCTP if DLM_SCTP
 	select FS_POSIX_ACL
 	select CRC32
-	select SLOW_WORK
 	select QUOTA
 	select QUOTACTL
 	help
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 4792200..b99d5be 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -12,7 +12,6 @@
 
 #include <linux/fs.h>
 #include <linux/workqueue.h>
-#include <linux/slow-work.h>
 #include <linux/dlm.h>
 #include <linux/buffer_head.h>
 
@@ -383,7 +382,7 @@ struct gfs2_journal_extent {
 struct gfs2_jdesc {
 	struct list_head jd_list;
 	struct list_head extent_list;
-	struct slow_work jd_work;
+	struct work_struct jd_work;
 	struct inode *jd_inode;
 	unsigned long jd_flags;
 #define JDF_RECOVERY 1
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 5b31f77..f809842 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -15,7 +15,6 @@
 #include <linux/init.h>
 #include <linux/gfs2_ondisk.h>
 #include <asm/atomic.h>
-#include <linux/slow-work.h>
 
 #include "gfs2.h"
 #include "incore.h"
@@ -114,18 +113,12 @@ static int __init init_gfs2_fs(void)
 	if (error)
 		goto fail_unregister;
 
-	error = slow_work_register_user(THIS_MODULE);
-	if (error)
-		goto fail_slow;
-
 	gfs2_register_debugfs();
 
 	printk("GFS2 (built %s %s) installed\n", __DATE__, __TIME__);
 
 	return 0;
 
-fail_slow:
-	unregister_filesystem(&gfs2meta_fs_type);
 fail_unregister:
 	unregister_filesystem(&gfs2_fs_type);
 fail:
@@ -163,7 +156,7 @@ static void __exit exit_gfs2_fs(void)
 	gfs2_unregister_debugfs();
 	unregister_filesystem(&gfs2_fs_type);
 	unregister_filesystem(&gfs2meta_fs_type);
-	slow_work_unregister_user(THIS_MODULE);
+	flush_workqueue(system_long_wq);
 
 	kmem_cache_destroy(gfs2_quotad_cachep);
 	kmem_cache_destroy(gfs2_rgrpd_cachep);
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index edfee24..58e130e 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -17,7 +17,6 @@
 #include <linux/namei.h>
 #include <linux/mount.h>
 #include <linux/gfs2_ondisk.h>
-#include <linux/slow-work.h>
 #include <linux/quotaops.h>
 
 #include "gfs2.h"
@@ -673,7 +672,7 @@ static int gfs2_jindex_hold(struct gfs2_sbd *sdp, struct gfs2_holder *ji_gh)
 			break;
 
 		INIT_LIST_HEAD(&jd->extent_list);
-		slow_work_init(&jd->jd_work, &gfs2_recover_ops);
+		INIT_WORK(&jd->jd_work, gfs2_recover_func);
 		jd->jd_inode = gfs2_lookupi(sdp->sd_jindex, &name, 1);
 		if (!jd->jd_inode || IS_ERR(jd->jd_inode)) {
 			if (!jd->jd_inode)
@@ -778,7 +777,8 @@ static int init_journal(struct gfs2_sbd *sdp, int undo)
 	if (sdp->sd_lockstruct.ls_first) {
 		unsigned int x;
 		for (x = 0; x < sdp->sd_journals; x++) {
-			error = gfs2_recover_journal(gfs2_jdesc_find(sdp, x));
+			error = gfs2_recover_journal(gfs2_jdesc_find(sdp, x),
+						     true);
 			if (error) {
 				fs_err(sdp, "error recovering journal %u: %d\n",
 				       x, error);
@@ -788,7 +788,7 @@ static int init_journal(struct gfs2_sbd *sdp, int undo)
 
 		gfs2_others_may_mount(sdp);
 	} else if (!sdp->sd_args.ar_spectator) {
-		error = gfs2_recover_journal(sdp->sd_jdesc);
+		error = gfs2_recover_journal(sdp->sd_jdesc, true);
 		if (error) {
 			fs_err(sdp, "error recovering my journal: %d\n", error);
 			goto fail_jinode_gh;
diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index 4b9bece..906f1ee 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -14,7 +14,6 @@
 #include <linux/buffer_head.h>
 #include <linux/gfs2_ondisk.h>
 #include <linux/crc32.h>
-#include <linux/slow-work.h>
 
 #include "gfs2.h"
 #include "incore.h"
@@ -443,23 +442,7 @@ static void gfs2_recovery_done(struct gfs2_sbd *sdp, unsigned int jid,
         kobject_uevent_env(&sdp->sd_kobj, KOBJ_CHANGE, envp);
 }
 
-static int gfs2_recover_get_ref(struct slow_work *work)
-{
-	struct gfs2_jdesc *jd = container_of(work, struct gfs2_jdesc, jd_work);
-	if (test_and_set_bit(JDF_RECOVERY, &jd->jd_flags))
-		return -EBUSY;
-	return 0;
-}
-
-static void gfs2_recover_put_ref(struct slow_work *work)
-{
-	struct gfs2_jdesc *jd = container_of(work, struct gfs2_jdesc, jd_work);
-	clear_bit(JDF_RECOVERY, &jd->jd_flags);
-	smp_mb__after_clear_bit();
-	wake_up_bit(&jd->jd_flags, JDF_RECOVERY);
-}
-
-static void gfs2_recover_work(struct slow_work *work)
+void gfs2_recover_func(struct work_struct *work)
 {
 	struct gfs2_jdesc *jd = container_of(work, struct gfs2_jdesc, jd_work);
 	struct gfs2_inode *ip = GFS2_I(jd->jd_inode);
@@ -578,7 +561,7 @@ static void gfs2_recover_work(struct slow_work *work)
 		gfs2_glock_dq_uninit(&j_gh);
 
 	fs_info(sdp, "jid=%u: Done\n", jd->jd_jid);
-	return;
+	goto done;
 
 fail_gunlock_tr:
 	gfs2_glock_dq_uninit(&t_gh);
@@ -590,32 +573,35 @@ fail_gunlock_j:
 	}
 
 	fs_info(sdp, "jid=%u: %s\n", jd->jd_jid, (error) ? "Failed" : "Done");
-
 fail:
 	gfs2_recovery_done(sdp, jd->jd_jid, LM_RD_GAVEUP);
+done:
+	clear_bit(JDF_RECOVERY, &jd->jd_flags);
+	smp_mb__after_clear_bit();
+	wake_up_bit(&jd->jd_flags, JDF_RECOVERY);
 }
 
-struct slow_work_ops gfs2_recover_ops = {
-	.owner	 = THIS_MODULE,
-	.get_ref = gfs2_recover_get_ref,
-	.put_ref = gfs2_recover_put_ref,
-	.execute = gfs2_recover_work,
-};
-
-
 static int gfs2_recovery_wait(void *word)
 {
 	schedule();
 	return 0;
 }
 
-int gfs2_recover_journal(struct gfs2_jdesc *jd)
+int gfs2_recover_journal(struct gfs2_jdesc *jd, bool wait)
 {
 	int rv;
-	rv = slow_work_enqueue(&jd->jd_work);
-	if (rv)
-		return rv;
-	wait_on_bit(&jd->jd_flags, JDF_RECOVERY, gfs2_recovery_wait, TASK_UNINTERRUPTIBLE);
+
+	if (test_and_set_bit(JDF_RECOVERY, &jd->jd_flags))
+		return -EBUSY;
+
+	/* we have JDF_RECOVERY, queue should always succeed */
+	rv = queue_work(system_long_wq, &jd->jd_work);
+	BUG_ON(!rv);
+
+	if (wait)
+		wait_on_bit(&jd->jd_flags, JDF_RECOVERY, gfs2_recovery_wait,
+			    TASK_UNINTERRUPTIBLE);
+
 	return 0;
 }
 
diff --git a/fs/gfs2/recovery.h b/fs/gfs2/recovery.h
index 1616ac2..78abd50 100644
--- a/fs/gfs2/recovery.h
+++ b/fs/gfs2/recovery.h
@@ -27,8 +27,8 @@ extern void gfs2_revoke_clean(struct gfs2_sbd *sdp);
 
 extern int gfs2_find_jhead(struct gfs2_jdesc *jd,
 		    struct gfs2_log_header_host *head);
-extern int gfs2_recover_journal(struct gfs2_jdesc *gfs2_jd);
-extern struct slow_work_ops gfs2_recover_ops;
+extern int gfs2_recover_journal(struct gfs2_jdesc *gfs2_jd, bool wait);
+extern void gfs2_recover_func(struct work_struct *work);
 
 #endif /* __RECOVERY_DOT_H__ */
 
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index 0dc3462..ee0f3cd 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -26,6 +26,7 @@
 #include "quota.h"
 #include "util.h"
 #include "glops.h"
+#include "recovery.h"
 
 struct gfs2_attr {
 	struct attribute attr;
@@ -351,7 +352,7 @@ static ssize_t recover_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
 	list_for_each_entry(jd, &sdp->sd_jindex_list, jd_list) {
 		if (jd->jd_jid != jid)
 			continue;
-		rv = slow_work_enqueue(&jd->jd_work);
+		rv = gfs2_recover_journal(jd, false);
 		break;
 	}
 out:
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH 39/40] gfs2: use workqueue instead of slow-work
  2010-01-18  0:57 ` [PATCH 39/40] gfs2: " Tejun Heo
@ 2010-01-18  9:45   ` Steven Whitehouse
  2010-01-18 11:24     ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Whitehouse @ 2010-01-18  9:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Hi,

On Mon, 2010-01-18 at 09:57 +0900, Tejun Heo wrote:
> Workqueue can now handle high concurrency.  Use system_long_wq instead
> of slow-work.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Steven Whitehouse <swhiteho@redhat.com>

Acked-by: Steven Whitehouse <swhiteho@redhat.com> on two conditions:

 i) That scheduling work on this new workqueue will not require any
GFP_KERNEL allocations (even hidden ones such as starting new threads)
before the work runs. This is required since the recovery code must not
call into the fs until after its recovered.
ii) That there is no interaction between this workqueue and the
"delayed" workqueue which the glock code uses since the recovery must
not block that workqueue, nor must that workqueue block recovery.

Having read briefly through the other patches, I believe that both those
two conditions are met, but I thought I'd ask too, just to be on the
safe side. Otherwise it looks like a nice clean up,

Steve.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 39/40] gfs2: use workqueue instead of slow-work
  2010-01-18  9:45   ` Steven Whitehouse
@ 2010-01-18 11:24     ` Tejun Heo
  2010-01-18 12:07       ` Steven Whitehouse
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-18 11:24 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

On 01/18/2010 06:45 PM, Steven Whitehouse wrote:
> Hi,
> 
> On Mon, 2010-01-18 at 09:57 +0900, Tejun Heo wrote:
>> Workqueue can now handle high concurrency.  Use system_long_wq instead
>> of slow-work.
>>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>> Cc: Steven Whitehouse <swhiteho@redhat.com>
> 
> Acked-by: Steven Whitehouse <swhiteho@redhat.com> on two conditions:
> 
>  i) That scheduling work on this new workqueue will not require any
> GFP_KERNEL allocations (even hidden ones such as starting new threads)
> before the work runs. This is required since the recovery code must not
> call into the fs until after its recovered.

Oh, if that's the case, it needs its own wq with a rescuer.  I thought
the recovery path wasn't invoked during allocation.  slow-work didn't
guarantee such thing either.  Anyways, changing that is pretty easy.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 39/40] gfs2: use workqueue instead of slow-work
  2010-01-18 11:24     ` Tejun Heo
@ 2010-01-18 12:07       ` Steven Whitehouse
  2010-01-19  1:00         ` Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Whitehouse @ 2010-01-18 12:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Hi,

On Mon, 2010-01-18 at 20:24 +0900, Tejun Heo wrote:
> On 01/18/2010 06:45 PM, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Mon, 2010-01-18 at 09:57 +0900, Tejun Heo wrote:
> >> Workqueue can now handle high concurrency.  Use system_long_wq instead
> >> of slow-work.
> >>
> >> Signed-off-by: Tejun Heo <tj@kernel.org>
> >> Cc: Steven Whitehouse <swhiteho@redhat.com>
> > 
> > Acked-by: Steven Whitehouse <swhiteho@redhat.com> on two conditions:
> > 
> >  i) That scheduling work on this new workqueue will not require any
> > GFP_KERNEL allocations (even hidden ones such as starting new threads)
> > before the work runs. This is required since the recovery code must not
> > call into the fs until after its recovered.
> 
> Oh, if that's the case, it needs its own wq with a rescuer.  I thought
> the recovery path wasn't invoked during allocation.  slow-work didn't
> guarantee such thing either.  Anyways, changing that is pretty easy.
> 
> Thanks.
> 

Hmm, I thought I'd checked slow work pretty carefully before I decided
to use it :( Looking at it though, its pretty unlikely that it would
cause a problem. We can be 100% safe by just increasing the number of
slow work threads to one per mounted gfs2 fs (assuming no other slow
work users).

Even then it starts new threads by scheduling slow work and thus it
looks like recovery would run before the slow work to start a new
thread, so its much less likely to cause a problem than if the new
thread was started before the slow work item was executed. We haven't
seen a problem during testing so far.

Anyway, if its easy to solve that problem in the new code, thats all
good :-) Thanks for pointing out this issue,

Steve.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 39/40] gfs2: use workqueue instead of slow-work
  2010-01-18 12:07       ` Steven Whitehouse
@ 2010-01-19  1:00         ` Tejun Heo
  2010-01-19  8:46           ` [PATCH UPDATED " Tejun Heo
  0 siblings, 1 reply; 102+ messages in thread
From: Tejun Heo @ 2010-01-19  1:00 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Hello,

On 01/18/2010 09:07 PM, Steven Whitehouse wrote:
> Hmm, I thought I'd checked slow work pretty carefully before I decided
> to use it :( Looking at it though, its pretty unlikely that it would
> cause a problem. We can be 100% safe by just increasing the number of
> slow work threads to one per mounted gfs2 fs (assuming no other slow
> work users).

I don't think that will guarantee anything as there's nothing which
guarantees the new additional work to the gfs2 code.  At any rate,
these problems are pretty unlikely to happen but they still need
guarantees, so...

> Anyway, if its easy to solve that problem in the new code, thats all
> good :-) Thanks for pointing out this issue,

Yeap, I'll post updated patch soonish.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH UPDATED 39/40] gfs2: use workqueue instead of slow-work
  2010-01-19  1:00         ` Tejun Heo
@ 2010-01-19  8:46           ` Tejun Heo
  0 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-19  8:46 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Workqueue can now handle high concurrency.  Convert gfs to use
workqueue instead of slow-work.

* Steven pointed out that recovery path might be run from allocation
  path and thus requires forward progress guarantee without memory
  allocation.  Create and use gfs_recovery_wq with rescuer.  Please
  note that forward progress wasn't guaranteed with slow-work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
---
So, here's the updated version.  Only compile tested.

Thanks.

 fs/gfs2/Kconfig      |    1 
 fs/gfs2/incore.h     |    3 --
 fs/gfs2/main.c       |   14 +++++++------
 fs/gfs2/ops_fstype.c |    8 +++----
 fs/gfs2/recovery.c   |   54 +++++++++++++++++++--------------------------------
 fs/gfs2/recovery.h   |    6 +++--
 fs/gfs2/sys.c        |    3 +-
 7 files changed, 40 insertions(+), 49 deletions(-)

Index: work/fs/gfs2/incore.h
===================================================================
--- work.orig/fs/gfs2/incore.h
+++ work/fs/gfs2/incore.h
@@ -12,7 +12,6 @@
 
 #include <linux/fs.h>
 #include <linux/workqueue.h>
-#include <linux/slow-work.h>
 #include <linux/dlm.h>
 #include <linux/buffer_head.h>
 
@@ -383,7 +382,7 @@ struct gfs2_journal_extent {
 struct gfs2_jdesc {
 	struct list_head jd_list;
 	struct list_head extent_list;
-	struct slow_work jd_work;
+	struct work_struct jd_work;
 	struct inode *jd_inode;
 	unsigned long jd_flags;
 #define JDF_RECOVERY 1
Index: work/fs/gfs2/main.c
===================================================================
--- work.orig/fs/gfs2/main.c
+++ work/fs/gfs2/main.c
@@ -15,7 +15,6 @@
 #include <linux/init.h>
 #include <linux/gfs2_ondisk.h>
 #include <asm/atomic.h>
-#include <linux/slow-work.h>
 
 #include "gfs2.h"
 #include "incore.h"
@@ -24,6 +23,7 @@
 #include "util.h"
 #include "glock.h"
 #include "quota.h"
+#include "recovery.h"
 
 static struct shrinker qd_shrinker = {
 	.shrink = gfs2_shrink_qd_memory,
@@ -114,9 +114,11 @@ static int __init init_gfs2_fs(void)
 	if (error)
 		goto fail_unregister;
 
-	error = slow_work_register_user(THIS_MODULE);
-	if (error)
-		goto fail_slow;
+	error = -ENOMEM;
+	gfs_recovery_wq = __create_workqueue("gfs_recovery", WQ_RESCUER,
+					     WQ_MAX_ACTIVE);
+	if (!gfs_recovery_wq)
+		goto fail_wq;
 
 	gfs2_register_debugfs();
 
@@ -124,7 +126,7 @@ static int __init init_gfs2_fs(void)
 
 	return 0;
 
-fail_slow:
+fail_wq:
 	unregister_filesystem(&gfs2meta_fs_type);
 fail_unregister:
 	unregister_filesystem(&gfs2_fs_type);
@@ -163,7 +165,7 @@ static void __exit exit_gfs2_fs(void)
 	gfs2_unregister_debugfs();
 	unregister_filesystem(&gfs2_fs_type);
 	unregister_filesystem(&gfs2meta_fs_type);
-	slow_work_unregister_user(THIS_MODULE);
+	destroy_workqueue(gfs_recovery_wq);
 
 	kmem_cache_destroy(gfs2_quotad_cachep);
 	kmem_cache_destroy(gfs2_rgrpd_cachep);
Index: work/fs/gfs2/ops_fstype.c
===================================================================
--- work.orig/fs/gfs2/ops_fstype.c
+++ work/fs/gfs2/ops_fstype.c
@@ -17,7 +17,6 @@
 #include <linux/namei.h>
 #include <linux/mount.h>
 #include <linux/gfs2_ondisk.h>
-#include <linux/slow-work.h>
 #include <linux/quotaops.h>
 
 #include "gfs2.h"
@@ -673,7 +672,7 @@ static int gfs2_jindex_hold(struct gfs2_
 			break;
 
 		INIT_LIST_HEAD(&jd->extent_list);
-		slow_work_init(&jd->jd_work, &gfs2_recover_ops);
+		INIT_WORK(&jd->jd_work, gfs2_recover_func);
 		jd->jd_inode = gfs2_lookupi(sdp->sd_jindex, &name, 1);
 		if (!jd->jd_inode || IS_ERR(jd->jd_inode)) {
 			if (!jd->jd_inode)
@@ -778,7 +777,8 @@ static int init_journal(struct gfs2_sbd
 	if (sdp->sd_lockstruct.ls_first) {
 		unsigned int x;
 		for (x = 0; x < sdp->sd_journals; x++) {
-			error = gfs2_recover_journal(gfs2_jdesc_find(sdp, x));
+			error = gfs2_recover_journal(gfs2_jdesc_find(sdp, x),
+						     true);
 			if (error) {
 				fs_err(sdp, "error recovering journal %u: %d\n",
 				       x, error);
@@ -788,7 +788,7 @@ static int init_journal(struct gfs2_sbd
 
 		gfs2_others_may_mount(sdp);
 	} else if (!sdp->sd_args.ar_spectator) {
-		error = gfs2_recover_journal(sdp->sd_jdesc);
+		error = gfs2_recover_journal(sdp->sd_jdesc, true);
 		if (error) {
 			fs_err(sdp, "error recovering my journal: %d\n", error);
 			goto fail_jinode_gh;
Index: work/fs/gfs2/recovery.c
===================================================================
--- work.orig/fs/gfs2/recovery.c
+++ work/fs/gfs2/recovery.c
@@ -14,7 +14,6 @@
 #include <linux/buffer_head.h>
 #include <linux/gfs2_ondisk.h>
 #include <linux/crc32.h>
-#include <linux/slow-work.h>
 
 #include "gfs2.h"
 #include "incore.h"
@@ -28,6 +27,8 @@
 #include "util.h"
 #include "dir.h"
 
+struct workqueue_struct *gfs_recovery_wq;
+
 int gfs2_replay_read_block(struct gfs2_jdesc *jd, unsigned int blk,
 			   struct buffer_head **bh)
 {
@@ -443,23 +444,7 @@ static void gfs2_recovery_done(struct gf
         kobject_uevent_env(&sdp->sd_kobj, KOBJ_CHANGE, envp);
 }
 
-static int gfs2_recover_get_ref(struct slow_work *work)
-{
-	struct gfs2_jdesc *jd = container_of(work, struct gfs2_jdesc, jd_work);
-	if (test_and_set_bit(JDF_RECOVERY, &jd->jd_flags))
-		return -EBUSY;
-	return 0;
-}
-
-static void gfs2_recover_put_ref(struct slow_work *work)
-{
-	struct gfs2_jdesc *jd = container_of(work, struct gfs2_jdesc, jd_work);
-	clear_bit(JDF_RECOVERY, &jd->jd_flags);
-	smp_mb__after_clear_bit();
-	wake_up_bit(&jd->jd_flags, JDF_RECOVERY);
-}
-
-static void gfs2_recover_work(struct slow_work *work)
+void gfs2_recover_func(struct work_struct *work)
 {
 	struct gfs2_jdesc *jd = container_of(work, struct gfs2_jdesc, jd_work);
 	struct gfs2_inode *ip = GFS2_I(jd->jd_inode);
@@ -578,7 +563,7 @@ static void gfs2_recover_work(struct slo
 		gfs2_glock_dq_uninit(&j_gh);
 
 	fs_info(sdp, "jid=%u: Done\n", jd->jd_jid);
-	return;
+	goto done;
 
 fail_gunlock_tr:
 	gfs2_glock_dq_uninit(&t_gh);
@@ -590,32 +575,35 @@ fail_gunlock_j:
 	}
 
 	fs_info(sdp, "jid=%u: %s\n", jd->jd_jid, (error) ? "Failed" : "Done");
-
 fail:
 	gfs2_recovery_done(sdp, jd->jd_jid, LM_RD_GAVEUP);
+done:
+	clear_bit(JDF_RECOVERY, &jd->jd_flags);
+	smp_mb__after_clear_bit();
+	wake_up_bit(&jd->jd_flags, JDF_RECOVERY);
 }
 
-struct slow_work_ops gfs2_recover_ops = {
-	.owner	 = THIS_MODULE,
-	.get_ref = gfs2_recover_get_ref,
-	.put_ref = gfs2_recover_put_ref,
-	.execute = gfs2_recover_work,
-};
-
-
 static int gfs2_recovery_wait(void *word)
 {
 	schedule();
 	return 0;
 }
 
-int gfs2_recover_journal(struct gfs2_jdesc *jd)
+int gfs2_recover_journal(struct gfs2_jdesc *jd, bool wait)
 {
 	int rv;
-	rv = slow_work_enqueue(&jd->jd_work);
-	if (rv)
-		return rv;
-	wait_on_bit(&jd->jd_flags, JDF_RECOVERY, gfs2_recovery_wait, TASK_UNINTERRUPTIBLE);
+
+	if (test_and_set_bit(JDF_RECOVERY, &jd->jd_flags))
+		return -EBUSY;
+
+	/* we have JDF_RECOVERY, queue should always succeed */
+	rv = queue_work(gfs_recovery_wq, &jd->jd_work);
+	BUG_ON(!rv);
+
+	if (wait)
+		wait_on_bit(&jd->jd_flags, JDF_RECOVERY, gfs2_recovery_wait,
+			    TASK_UNINTERRUPTIBLE);
+
 	return 0;
 }
 
Index: work/fs/gfs2/recovery.h
===================================================================
--- work.orig/fs/gfs2/recovery.h
+++ work/fs/gfs2/recovery.h
@@ -12,6 +12,8 @@
 
 #include "incore.h"
 
+extern struct workqueue_struct *gfs_recovery_wq;
+
 static inline void gfs2_replay_incr_blk(struct gfs2_sbd *sdp, unsigned int *blk)
 {
 	if (++*blk == sdp->sd_jdesc->jd_blocks)
@@ -27,8 +29,8 @@ extern void gfs2_revoke_clean(struct gfs
 
 extern int gfs2_find_jhead(struct gfs2_jdesc *jd,
 		    struct gfs2_log_header_host *head);
-extern int gfs2_recover_journal(struct gfs2_jdesc *gfs2_jd);
-extern struct slow_work_ops gfs2_recover_ops;
+extern int gfs2_recover_journal(struct gfs2_jdesc *gfs2_jd, bool wait);
+extern void gfs2_recover_func(struct work_struct *work);
 
 #endif /* __RECOVERY_DOT_H__ */
 
Index: work/fs/gfs2/sys.c
===================================================================
--- work.orig/fs/gfs2/sys.c
+++ work/fs/gfs2/sys.c
@@ -26,6 +26,7 @@
 #include "quota.h"
 #include "util.h"
 #include "glops.h"
+#include "recovery.h"
 
 struct gfs2_attr {
 	struct attribute attr;
@@ -351,7 +352,7 @@ static ssize_t recover_store(struct gfs2
 	list_for_each_entry(jd, &sdp->sd_jindex_list, jd_list) {
 		if (jd->jd_jid != jid)
 			continue;
-		rv = slow_work_enqueue(&jd->jd_work);
+		rv = gfs2_recover_journal(jd, false);
 		break;
 	}
 out:
Index: work/fs/gfs2/Kconfig
===================================================================
--- work.orig/fs/gfs2/Kconfig
+++ work/fs/gfs2/Kconfig
@@ -7,7 +7,6 @@ config GFS2_FS
 	select IP_SCTP if DLM_SCTP
 	select FS_POSIX_ACL
 	select CRC32
-	select SLOW_WORK
 	select QUOTA
 	select QUOTACTL
 	help

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 40/40] slow-work: kill it
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (38 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 39/40] gfs2: " Tejun Heo
@ 2010-01-18  0:57 ` Tejun Heo
  2010-01-18  1:03 ` perf-wq.c used to generate synthetic workload Tejun Heo
  2010-01-18 16:13 ` [PATCHSET] concurrency managed workqueue, take#3 Stefan Richter
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi
  Cc: Tejun Heo

slow-work doesn't have any user left.  Kill it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Howells <dhowells@redhat.com>
---
 Documentation/slow-work.txt |  322 -------------
 include/linux/slow-work.h   |  163 -------
 init/Kconfig                |   24 -
 kernel/Makefile             |    2 -
 kernel/slow-work-debugfs.c  |  227 ---------
 kernel/slow-work.c          | 1068 -------------------------------------------
 kernel/slow-work.h          |   72 ---
 kernel/sysctl.c             |    8 -
 8 files changed, 0 insertions(+), 1886 deletions(-)
 delete mode 100644 Documentation/slow-work.txt
 delete mode 100644 include/linux/slow-work.h
 delete mode 100644 kernel/slow-work-debugfs.c
 delete mode 100644 kernel/slow-work.c
 delete mode 100644 kernel/slow-work.h

diff --git a/Documentation/slow-work.txt b/Documentation/slow-work.txt
deleted file mode 100644
index 9dbf447..0000000
--- a/Documentation/slow-work.txt
+++ /dev/null
@@ -1,322 +0,0 @@
-		     ====================================
-		     SLOW WORK ITEM EXECUTION THREAD POOL
-		     ====================================
-
-By: David Howells <dhowells@redhat.com>
-
-The slow work item execution thread pool is a pool of threads for performing
-things that take a relatively long time, such as making mkdir calls.
-Typically, when processing something, these items will spend a lot of time
-blocking a thread on I/O, thus making that thread unavailable for doing other
-work.
-
-The standard workqueue model is unsuitable for this class of work item as that
-limits the owner to a single thread or a single thread per CPU.  For some
-tasks, however, more threads - or fewer - are required.
-
-There is just one pool per system.  It contains no threads unless something
-wants to use it - and that something must register its interest first.  When
-the pool is active, the number of threads it contains is dynamic, varying
-between a maximum and minimum setting, depending on the load.
-
-
-====================
-CLASSES OF WORK ITEM
-====================
-
-This pool support two classes of work items:
-
- (*) Slow work items.
-
- (*) Very slow work items.
-
-The former are expected to finish much quicker than the latter.
-
-An operation of the very slow class may do a batch combination of several
-lookups, mkdirs, and a create for instance.
-
-An operation of the ordinarily slow class may, for example, write stuff or
-expand files, provided the time taken to do so isn't too long.
-
-Operations of both types may sleep during execution, thus tying up the thread
-loaned to it.
-
-A further class of work item is available, based on the slow work item class:
-
- (*) Delayed slow work items.
-
-These are slow work items that have a timer to defer queueing of the item for
-a while.
-
-
-THREAD-TO-CLASS ALLOCATION
---------------------------
-
-Not all the threads in the pool are available to work on very slow work items.
-The number will be between one and one fewer than the number of active threads.
-This is configurable (see the "Pool Configuration" section).
-
-All the threads are available to work on ordinarily slow work items, but a
-percentage of the threads will prefer to work on very slow work items.
-
-The configuration ensures that at least one thread will be available to work on
-very slow work items, and at least one thread will be available that won't work
-on very slow work items at all.
-
-
-=====================
-USING SLOW WORK ITEMS
-=====================
-
-Firstly, a module or subsystem wanting to make use of slow work items must
-register its interest:
-
-	 int ret = slow_work_register_user(struct module *module);
-
-This will return 0 if successful, or a -ve error upon failure.  The module
-pointer should be the module interested in using this facility (almost
-certainly THIS_MODULE).
-
-
-Slow work items may then be set up by:
-
- (1) Declaring a slow_work struct type variable:
-
-	#include <linux/slow-work.h>
-
-	struct slow_work myitem;
-
- (2) Declaring the operations to be used for this item:
-
-	struct slow_work_ops myitem_ops = {
-		.get_ref = myitem_get_ref,
-		.put_ref = myitem_put_ref,
-		.execute = myitem_execute,
-	};
-
-     [*] For a description of the ops, see section "Item Operations".
-
- (3) Initialising the item:
-
-	slow_work_init(&myitem, &myitem_ops);
-
-     or:
-
-	delayed_slow_work_init(&myitem, &myitem_ops);
-
-     or:
-
-	vslow_work_init(&myitem, &myitem_ops);
-
-     depending on its class.
-
-A suitably set up work item can then be enqueued for processing:
-
-	int ret = slow_work_enqueue(&myitem);
-
-This will return a -ve error if the thread pool is unable to gain a reference
-on the item, 0 otherwise, or (for delayed work):
-
-	int ret = delayed_slow_work_enqueue(&myitem, my_jiffy_delay);
-
-
-The items are reference counted, so there ought to be no need for a flush
-operation.  But as the reference counting is optional, means to cancel
-existing work items are also included:
-
-	cancel_slow_work(&myitem);
-	cancel_delayed_slow_work(&myitem);
-
-can be used to cancel pending work.  The above cancel function waits for
-existing work to have been executed (or prevent execution of them, depending
-on timing).
-
-
-When all a module's slow work items have been processed, and the
-module has no further interest in the facility, it should unregister its
-interest:
-
-	slow_work_unregister_user(struct module *module);
-
-The module pointer is used to wait for all outstanding work items for that
-module before completing the unregistration.  This prevents the put_ref() code
-from being taken away before it completes.  module should almost certainly be
-THIS_MODULE.
-
-
-================
-HELPER FUNCTIONS
-================
-
-The slow-work facility provides a function by which it can be determined
-whether or not an item is queued for later execution:
-
-	bool queued = slow_work_is_queued(struct slow_work *work);
-
-If it returns false, then the item is not on the queue (it may be executing
-with a requeue pending).  This can be used to work out whether an item on which
-another depends is on the queue, thus allowing a dependent item to be queued
-after it.
-
-If the above shows an item on which another depends not to be queued, then the
-owner of the dependent item might need to wait.  However, to avoid locking up
-the threads unnecessarily be sleeping in them, it can make sense under some
-circumstances to return the work item to the queue, thus deferring it until
-some other items have had a chance to make use of the yielded thread.
-
-To yield a thread and defer an item, the work function should simply enqueue
-the work item again and return.  However, this doesn't work if there's nothing
-actually on the queue, as the thread just vacated will jump straight back into
-the item's work function, thus busy waiting on a CPU.
-
-Instead, the item should use the thread to wait for the dependency to go away,
-but rather than using schedule() or schedule_timeout() to sleep, it should use
-the following function:
-
-	bool requeue = slow_work_sleep_till_thread_needed(
-			struct slow_work *work,
-			signed long *_timeout);
-
-This will add a second wait and then sleep, such that it will be woken up if
-either something appears on the queue that could usefully make use of the
-thread - and behind which this item can be queued, or if the event the caller
-set up to wait for happens.  True will be returned if something else appeared
-on the queue and this work function should perhaps return, of false if
-something else woke it up.  The timeout is as for schedule_timeout().
-
-For example:
-
-	wq = bit_waitqueue(&my_flags, MY_BIT);
-	init_wait(&wait);
-	requeue = false;
-	do {
-		prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
-		if (!test_bit(MY_BIT, &my_flags))
-			break;
-		requeue = slow_work_sleep_till_thread_needed(&my_work,
-							     &timeout);
-	} while (timeout > 0 && !requeue);
-	finish_wait(wq, &wait);
-	if (!test_bit(MY_BIT, &my_flags)
-		goto do_my_thing;
-	if (requeue)
-		return; // to slow_work
-
-
-===============
-ITEM OPERATIONS
-===============
-
-Each work item requires a table of operations of type struct slow_work_ops.
-Only ->execute() is required; the getting and putting of a reference and the
-describing of an item are all optional.
-
- (*) Get a reference on an item:
-
-	int (*get_ref)(struct slow_work *work);
-
-     This allows the thread pool to attempt to pin an item by getting a
-     reference on it.  This function should return 0 if the reference was
-     granted, or a -ve error otherwise.  If an error is returned,
-     slow_work_enqueue() will fail.
-
-     The reference is held whilst the item is queued and whilst it is being
-     executed.  The item may then be requeued with the same reference held, or
-     the reference will be released.
-
- (*) Release a reference on an item:
-
-	void (*put_ref)(struct slow_work *work);
-
-     This allows the thread pool to unpin an item by releasing the reference on
-     it.  The thread pool will not touch the item again once this has been
-     called.
-
- (*) Execute an item:
-
-	void (*execute)(struct slow_work *work);
-
-     This should perform the work required of the item.  It may sleep, it may
-     perform disk I/O and it may wait for locks.
-
- (*) View an item through /proc:
-
-	void (*desc)(struct slow_work *work, struct seq_file *m);
-
-     If supplied, this should print to 'm' a small string describing the work
-     the item is to do.  This should be no more than about 40 characters, and
-     shouldn't include a newline character.
-
-     See the 'Viewing executing and queued items' section below.
-
-
-==================
-POOL CONFIGURATION
-==================
-
-The slow-work thread pool has a number of configurables:
-
- (*) /proc/sys/kernel/slow-work/min-threads
-
-     The minimum number of threads that should be in the pool whilst it is in
-     use.  This may be anywhere between 2 and max-threads.
-
- (*) /proc/sys/kernel/slow-work/max-threads
-
-     The maximum number of threads that should in the pool.  This may be
-     anywhere between min-threads and 255 or NR_CPUS * 2, whichever is greater.
-
- (*) /proc/sys/kernel/slow-work/vslow-percentage
-
-     The percentage of active threads in the pool that may be used to execute
-     very slow work items.  This may be between 1 and 99.  The resultant number
-     is bounded to between 1 and one fewer than the number of active threads.
-     This ensures there is always at least one thread that can process very
-     slow work items, and always at least one thread that won't.
-
-
-==================================
-VIEWING EXECUTING AND QUEUED ITEMS
-==================================
-
-If CONFIG_SLOW_WORK_DEBUG is enabled, a debugfs file is made available:
-
-	/sys/kernel/debug/slow_work/runqueue
-
-through which the list of work items being executed and the queues of items to
-be executed may be viewed.  The owner of a work item is given the chance to
-add some information of its own.
-
-The contents look something like the following:
-
-    THR PID   ITEM ADDR        FL MARK  DESC
-    === ===== ================ == ===== ==========
-      0  3005 ffff880023f52348  a 952ms FSC: OBJ17d3: LOOK
-      1  3006 ffff880024e33668  2 160ms FSC: OBJ17e5 OP60d3b: Write1/Store fl=2
-      2  3165 ffff8800296dd180  a 424ms FSC: OBJ17e4: LOOK
-      3  4089 ffff8800262c8d78  a 212ms FSC: OBJ17ea: CRTN
-      4  4090 ffff88002792bed8  2 388ms FSC: OBJ17e8 OP60d36: Write1/Store fl=2
-      5  4092 ffff88002a0ef308  2 388ms FSC: OBJ17e7 OP60d2e: Write1/Store fl=2
-      6  4094 ffff88002abaf4b8  2 132ms FSC: OBJ17e2 OP60d4e: Write1/Store fl=2
-      7  4095 ffff88002bb188e0  a 388ms FSC: OBJ17e9: CRTN
-    vsq     - ffff880023d99668  1 308ms FSC: OBJ17e0 OP60f91: Write1/EnQ fl=2
-    vsq     - ffff8800295d1740  1 212ms FSC: OBJ16be OP4d4b6: Write1/EnQ fl=2
-    vsq     - ffff880025ba3308  1 160ms FSC: OBJ179a OP58dec: Write1/EnQ fl=2
-    vsq     - ffff880024ec83e0  1 160ms FSC: OBJ17ae OP599f2: Write1/EnQ fl=2
-    vsq     - ffff880026618e00  1 160ms FSC: OBJ17e6 OP60d33: Write1/EnQ fl=2
-    vsq     - ffff880025a2a4b8  1 132ms FSC: OBJ16a2 OP4d583: Write1/EnQ fl=2
-    vsq     - ffff880023cbe6d8  9 212ms FSC: OBJ17eb: LOOK
-    vsq     - ffff880024d37590  9 212ms FSC: OBJ17ec: LOOK
-    vsq     - ffff880027746cb0  9 212ms FSC: OBJ17ed: LOOK
-    vsq     - ffff880024d37ae8  9 212ms FSC: OBJ17ee: LOOK
-    vsq     - ffff880024d37cb0  9 212ms FSC: OBJ17ef: LOOK
-    vsq     - ffff880025036550  9 212ms FSC: OBJ17f0: LOOK
-    vsq     - ffff8800250368e0  9 212ms FSC: OBJ17f1: LOOK
-    vsq     - ffff880025036aa8  9 212ms FSC: OBJ17f2: LOOK
-
-In the 'THR' column, executing items show the thread they're occupying and
-queued threads indicate which queue they're on.  'PID' shows the process ID of
-a slow-work thread that's executing something.  'FL' shows the work item flags.
-'MARK' indicates how long since an item was queued or began executing.  Lastly,
-the 'DESC' column permits the owner of an item to give some information.
-
diff --git a/include/linux/slow-work.h b/include/linux/slow-work.h
deleted file mode 100644
index 13337bf..0000000
--- a/include/linux/slow-work.h
+++ /dev/null
@@ -1,163 +0,0 @@
-/* Worker thread pool for slow items, such as filesystem lookups or mkdirs
- *
- * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public Licence
- * as published by the Free Software Foundation; either version
- * 2 of the Licence, or (at your option) any later version.
- *
- * See Documentation/slow-work.txt
- */
-
-#ifndef _LINUX_SLOW_WORK_H
-#define _LINUX_SLOW_WORK_H
-
-#ifdef CONFIG_SLOW_WORK
-
-#include <linux/sysctl.h>
-#include <linux/timer.h>
-
-struct slow_work;
-#ifdef CONFIG_SLOW_WORK_DEBUG
-struct seq_file;
-#endif
-
-/*
- * The operations used to support slow work items
- */
-struct slow_work_ops {
-	/* owner */
-	struct module *owner;
-
-	/* get a ref on a work item
-	 * - return 0 if successful, -ve if not
-	 */
-	int (*get_ref)(struct slow_work *work);
-
-	/* discard a ref to a work item */
-	void (*put_ref)(struct slow_work *work);
-
-	/* execute a work item */
-	void (*execute)(struct slow_work *work);
-
-#ifdef CONFIG_SLOW_WORK_DEBUG
-	/* describe a work item for debugfs */
-	void (*desc)(struct slow_work *work, struct seq_file *m);
-#endif
-};
-
-/*
- * A slow work item
- * - A reference is held on the parent object by the thread pool when it is
- *   queued
- */
-struct slow_work {
-	struct module		*owner;	/* the owning module */
-	unsigned long		flags;
-#define SLOW_WORK_PENDING	0	/* item pending (further) execution */
-#define SLOW_WORK_EXECUTING	1	/* item currently executing */
-#define SLOW_WORK_ENQ_DEFERRED	2	/* item enqueue deferred */
-#define SLOW_WORK_VERY_SLOW	3	/* item is very slow */
-#define SLOW_WORK_CANCELLING	4	/* item is being cancelled, don't enqueue */
-#define SLOW_WORK_DELAYED	5	/* item is struct delayed_slow_work with active timer */
-	const struct slow_work_ops *ops; /* operations table for this item */
-	struct list_head	link;	/* link in queue */
-#ifdef CONFIG_SLOW_WORK_DEBUG
-	struct timespec		mark;	/* jiffies at which queued or exec begun */
-#endif
-};
-
-struct delayed_slow_work {
-	struct slow_work	work;
-	struct timer_list	timer;
-};
-
-/**
- * slow_work_init - Initialise a slow work item
- * @work: The work item to initialise
- * @ops: The operations to use to handle the slow work item
- *
- * Initialise a slow work item.
- */
-static inline void slow_work_init(struct slow_work *work,
-				  const struct slow_work_ops *ops)
-{
-	work->flags = 0;
-	work->ops = ops;
-	INIT_LIST_HEAD(&work->link);
-}
-
-/**
- * slow_work_init - Initialise a delayed slow work item
- * @work: The work item to initialise
- * @ops: The operations to use to handle the slow work item
- *
- * Initialise a delayed slow work item.
- */
-static inline void delayed_slow_work_init(struct delayed_slow_work *dwork,
-					  const struct slow_work_ops *ops)
-{
-	init_timer(&dwork->timer);
-	slow_work_init(&dwork->work, ops);
-}
-
-/**
- * vslow_work_init - Initialise a very slow work item
- * @work: The work item to initialise
- * @ops: The operations to use to handle the slow work item
- *
- * Initialise a very slow work item.  This item will be restricted such that
- * only a certain number of the pool threads will be able to execute items of
- * this type.
- */
-static inline void vslow_work_init(struct slow_work *work,
-				   const struct slow_work_ops *ops)
-{
-	work->flags = 1 << SLOW_WORK_VERY_SLOW;
-	work->ops = ops;
-	INIT_LIST_HEAD(&work->link);
-}
-
-/**
- * slow_work_is_queued - Determine if a slow work item is on the work queue
- * work: The work item to test
- *
- * Determine if the specified slow-work item is on the work queue.  This
- * returns true if it is actually on the queue.
- *
- * If the item is executing and has been marked for requeue when execution
- * finishes, then false will be returned.
- *
- * Anyone wishing to wait for completion of execution can wait on the
- * SLOW_WORK_EXECUTING bit.
- */
-static inline bool slow_work_is_queued(struct slow_work *work)
-{
-	unsigned long flags = work->flags;
-	return flags & SLOW_WORK_PENDING && !(flags & SLOW_WORK_EXECUTING);
-}
-
-extern int slow_work_enqueue(struct slow_work *work);
-extern void slow_work_cancel(struct slow_work *work);
-extern int slow_work_register_user(struct module *owner);
-extern void slow_work_unregister_user(struct module *owner);
-
-extern int delayed_slow_work_enqueue(struct delayed_slow_work *dwork,
-				     unsigned long delay);
-
-static inline void delayed_slow_work_cancel(struct delayed_slow_work *dwork)
-{
-	slow_work_cancel(&dwork->work);
-}
-
-extern bool slow_work_sleep_till_thread_needed(struct slow_work *work,
-					       signed long *_timeout);
-
-#ifdef CONFIG_SYSCTL
-extern ctl_table slow_work_sysctls[];
-#endif
-
-#endif /* CONFIG_SLOW_WORK */
-#endif /* _LINUX_SLOW_WORK_H */
diff --git a/init/Kconfig b/init/Kconfig
index b1b7175..6bd0c83 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1126,30 +1126,6 @@ config TRACEPOINTS
 
 source "arch/Kconfig"
 
-config SLOW_WORK
-	default n
-	bool
-	help
-	  The slow work thread pool provides a number of dynamically allocated
-	  threads that can be used by the kernel to perform operations that
-	  take a relatively long time.
-
-	  An example of this would be CacheFiles doing a path lookup followed
-	  by a series of mkdirs and a create call, all of which have to touch
-	  disk.
-
-	  See Documentation/slow-work.txt.
-
-config SLOW_WORK_DEBUG
-	bool "Slow work debugging through debugfs"
-	default n
-	depends on SLOW_WORK && DEBUG_FS
-	help
-	  Display the contents of the slow work run queue through debugfs,
-	  including items currently executing.
-
-	  See Documentation/slow-work.txt.
-
 endmenu		# General setup
 
 config HAVE_GENERIC_DMA_COHERENT
diff --git a/kernel/Makefile b/kernel/Makefile
index 864ff75..99ce6f2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -95,8 +95,6 @@ obj-$(CONFIG_TRACING) += trace/
 obj-$(CONFIG_X86_DS) += trace/
 obj-$(CONFIG_RING_BUFFER) += trace/
 obj-$(CONFIG_SMP) += sched_cpupri.o
-obj-$(CONFIG_SLOW_WORK) += slow-work.o
-obj-$(CONFIG_SLOW_WORK_DEBUG) += slow-work-debugfs.o
 obj-$(CONFIG_PERF_EVENTS) += perf_event.o
 obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
 obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
diff --git a/kernel/slow-work-debugfs.c b/kernel/slow-work-debugfs.c
deleted file mode 100644
index e45c436..0000000
--- a/kernel/slow-work-debugfs.c
+++ /dev/null
@@ -1,227 +0,0 @@
-/* Slow work debugging
- *
- * Copyright (C) 2009 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public Licence
- * as published by the Free Software Foundation; either version
- * 2 of the Licence, or (at your option) any later version.
- */
-
-#include <linux/module.h>
-#include <linux/slow-work.h>
-#include <linux/fs.h>
-#include <linux/time.h>
-#include <linux/seq_file.h>
-#include "slow-work.h"
-
-#define ITERATOR_SHIFT		(BITS_PER_LONG - 4)
-#define ITERATOR_SELECTOR	(0xfUL << ITERATOR_SHIFT)
-#define ITERATOR_COUNTER	(~ITERATOR_SELECTOR)
-
-void slow_work_new_thread_desc(struct slow_work *work, struct seq_file *m)
-{
-	seq_puts(m, "Slow-work: New thread");
-}
-
-/*
- * Render the time mark field on a work item into a 5-char time with units plus
- * a space
- */
-static void slow_work_print_mark(struct seq_file *m, struct slow_work *work)
-{
-	struct timespec now, diff;
-
-	now = CURRENT_TIME;
-	diff = timespec_sub(now, work->mark);
-
-	if (diff.tv_sec < 0)
-		seq_puts(m, "  -ve ");
-	else if (diff.tv_sec == 0 && diff.tv_nsec < 1000)
-		seq_printf(m, "%3luns ", diff.tv_nsec);
-	else if (diff.tv_sec == 0 && diff.tv_nsec < 1000000)
-		seq_printf(m, "%3luus ", diff.tv_nsec / 1000);
-	else if (diff.tv_sec == 0 && diff.tv_nsec < 1000000000)
-		seq_printf(m, "%3lums ", diff.tv_nsec / 1000000);
-	else if (diff.tv_sec <= 1)
-		seq_puts(m, "   1s ");
-	else if (diff.tv_sec < 60)
-		seq_printf(m, "%4lus ", diff.tv_sec);
-	else if (diff.tv_sec < 60 * 60)
-		seq_printf(m, "%4lum ", diff.tv_sec / 60);
-	else if (diff.tv_sec < 60 * 60 * 24)
-		seq_printf(m, "%4luh ", diff.tv_sec / 3600);
-	else
-		seq_puts(m, "exces ");
-}
-
-/*
- * Describe a slow work item for debugfs
- */
-static int slow_work_runqueue_show(struct seq_file *m, void *v)
-{
-	struct slow_work *work;
-	struct list_head *p = v;
-	unsigned long id;
-
-	switch ((unsigned long) v) {
-	case 1:
-		seq_puts(m, "THR PID   ITEM ADDR        FL MARK  DESC\n");
-		return 0;
-	case 2:
-		seq_puts(m, "=== ===== ================ == ===== ==========\n");
-		return 0;
-
-	case 3 ... 3 + SLOW_WORK_THREAD_LIMIT - 1:
-		id = (unsigned long) v - 3;
-
-		read_lock(&slow_work_execs_lock);
-		work = slow_work_execs[id];
-		if (work) {
-			smp_read_barrier_depends();
-
-			seq_printf(m, "%3lu %5d %16p %2lx ",
-				   id, slow_work_pids[id], work, work->flags);
-			slow_work_print_mark(m, work);
-
-			if (work->ops->desc)
-				work->ops->desc(work, m);
-			seq_putc(m, '\n');
-		}
-		read_unlock(&slow_work_execs_lock);
-		return 0;
-
-	default:
-		work = list_entry(p, struct slow_work, link);
-		seq_printf(m, "%3s     - %16p %2lx ",
-			   work->flags & SLOW_WORK_VERY_SLOW ? "vsq" : "sq",
-			   work, work->flags);
-		slow_work_print_mark(m, work);
-
-		if (work->ops->desc)
-			work->ops->desc(work, m);
-		seq_putc(m, '\n');
-		return 0;
-	}
-}
-
-/*
- * map the iterator to a work item
- */
-static void *slow_work_runqueue_index(struct seq_file *m, loff_t *_pos)
-{
-	struct list_head *p;
-	unsigned long count, id;
-
-	switch (*_pos >> ITERATOR_SHIFT) {
-	case 0x0:
-		if (*_pos == 0)
-			*_pos = 1;
-		if (*_pos < 3)
-			return (void *)(unsigned long) *_pos;
-		if (*_pos < 3 + SLOW_WORK_THREAD_LIMIT)
-			for (id = *_pos - 3;
-			     id < SLOW_WORK_THREAD_LIMIT;
-			     id++, (*_pos)++)
-				if (slow_work_execs[id])
-					return (void *)(unsigned long) *_pos;
-		*_pos = 0x1UL << ITERATOR_SHIFT;
-
-	case 0x1:
-		count = *_pos & ITERATOR_COUNTER;
-		list_for_each(p, &slow_work_queue) {
-			if (count == 0)
-				return p;
-			count--;
-		}
-		*_pos = 0x2UL << ITERATOR_SHIFT;
-
-	case 0x2:
-		count = *_pos & ITERATOR_COUNTER;
-		list_for_each(p, &vslow_work_queue) {
-			if (count == 0)
-				return p;
-			count--;
-		}
-		*_pos = 0x3UL << ITERATOR_SHIFT;
-
-	default:
-		return NULL;
-	}
-}
-
-/*
- * set up the iterator to start reading from the first line
- */
-static void *slow_work_runqueue_start(struct seq_file *m, loff_t *_pos)
-{
-	spin_lock_irq(&slow_work_queue_lock);
-	return slow_work_runqueue_index(m, _pos);
-}
-
-/*
- * move to the next line
- */
-static void *slow_work_runqueue_next(struct seq_file *m, void *v, loff_t *_pos)
-{
-	struct list_head *p = v;
-	unsigned long selector = *_pos >> ITERATOR_SHIFT;
-
-	(*_pos)++;
-	switch (selector) {
-	case 0x0:
-		return slow_work_runqueue_index(m, _pos);
-
-	case 0x1:
-		if (*_pos >> ITERATOR_SHIFT == 0x1) {
-			p = p->next;
-			if (p != &slow_work_queue)
-				return p;
-		}
-		*_pos = 0x2UL << ITERATOR_SHIFT;
-		p = &vslow_work_queue;
-
-	case 0x2:
-		if (*_pos >> ITERATOR_SHIFT == 0x2) {
-			p = p->next;
-			if (p != &vslow_work_queue)
-				return p;
-		}
-		*_pos = 0x3UL << ITERATOR_SHIFT;
-
-	default:
-		return NULL;
-	}
-}
-
-/*
- * clean up after reading
- */
-static void slow_work_runqueue_stop(struct seq_file *m, void *v)
-{
-	spin_unlock_irq(&slow_work_queue_lock);
-}
-
-static const struct seq_operations slow_work_runqueue_ops = {
-	.start		= slow_work_runqueue_start,
-	.stop		= slow_work_runqueue_stop,
-	.next		= slow_work_runqueue_next,
-	.show		= slow_work_runqueue_show,
-};
-
-/*
- * open "/sys/kernel/debug/slow_work/runqueue" to list queue contents
- */
-static int slow_work_runqueue_open(struct inode *inode, struct file *file)
-{
-	return seq_open(file, &slow_work_runqueue_ops);
-}
-
-const struct file_operations slow_work_runqueue_fops = {
-	.owner		= THIS_MODULE,
-	.open		= slow_work_runqueue_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
-};
diff --git a/kernel/slow-work.c b/kernel/slow-work.c
deleted file mode 100644
index 7494bbf..0000000
--- a/kernel/slow-work.c
+++ /dev/null
@@ -1,1068 +0,0 @@
-/* Worker thread pool for slow items, such as filesystem lookups or mkdirs
- *
- * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public Licence
- * as published by the Free Software Foundation; either version
- * 2 of the Licence, or (at your option) any later version.
- *
- * See Documentation/slow-work.txt
- */
-
-#include <linux/module.h>
-#include <linux/slow-work.h>
-#include <linux/kthread.h>
-#include <linux/freezer.h>
-#include <linux/wait.h>
-#include <linux/debugfs.h>
-#include "slow-work.h"
-
-static void slow_work_cull_timeout(unsigned long);
-static void slow_work_oom_timeout(unsigned long);
-
-#ifdef CONFIG_SYSCTL
-static int slow_work_min_threads_sysctl(struct ctl_table *, int,
-					void __user *, size_t *, loff_t *);
-
-static int slow_work_max_threads_sysctl(struct ctl_table *, int ,
-					void __user *, size_t *, loff_t *);
-#endif
-
-/*
- * The pool of threads has at least min threads in it as long as someone is
- * using the facility, and may have as many as max.
- *
- * A portion of the pool may be processing very slow operations.
- */
-static unsigned slow_work_min_threads = 2;
-static unsigned slow_work_max_threads = 4;
-static unsigned vslow_work_proportion = 50; /* % of threads that may process
-					     * very slow work */
-
-#ifdef CONFIG_SYSCTL
-static const int slow_work_min_min_threads = 2;
-static int slow_work_max_max_threads = SLOW_WORK_THREAD_LIMIT;
-static const int slow_work_min_vslow = 1;
-static const int slow_work_max_vslow = 99;
-
-ctl_table slow_work_sysctls[] = {
-	{
-		.procname	= "min-threads",
-		.data		= &slow_work_min_threads,
-		.maxlen		= sizeof(unsigned),
-		.mode		= 0644,
-		.proc_handler	= slow_work_min_threads_sysctl,
-		.extra1		= (void *) &slow_work_min_min_threads,
-		.extra2		= &slow_work_max_threads,
-	},
-	{
-		.procname	= "max-threads",
-		.data		= &slow_work_max_threads,
-		.maxlen		= sizeof(unsigned),
-		.mode		= 0644,
-		.proc_handler	= slow_work_max_threads_sysctl,
-		.extra1		= &slow_work_min_threads,
-		.extra2		= (void *) &slow_work_max_max_threads,
-	},
-	{
-		.procname	= "vslow-percentage",
-		.data		= &vslow_work_proportion,
-		.maxlen		= sizeof(unsigned),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= (void *) &slow_work_min_vslow,
-		.extra2		= (void *) &slow_work_max_vslow,
-	},
-	{}
-};
-#endif
-
-/*
- * The active state of the thread pool
- */
-static atomic_t slow_work_thread_count;
-static atomic_t vslow_work_executing_count;
-
-static bool slow_work_may_not_start_new_thread;
-static bool slow_work_cull; /* cull a thread due to lack of activity */
-static DEFINE_TIMER(slow_work_cull_timer, slow_work_cull_timeout, 0, 0);
-static DEFINE_TIMER(slow_work_oom_timer, slow_work_oom_timeout, 0, 0);
-static struct slow_work slow_work_new_thread; /* new thread starter */
-
-/*
- * slow work ID allocation (use slow_work_queue_lock)
- */
-static DECLARE_BITMAP(slow_work_ids, SLOW_WORK_THREAD_LIMIT);
-
-/*
- * Unregistration tracking to prevent put_ref() from disappearing during module
- * unload
- */
-#ifdef CONFIG_MODULES
-static struct module *slow_work_thread_processing[SLOW_WORK_THREAD_LIMIT];
-static struct module *slow_work_unreg_module;
-static struct slow_work *slow_work_unreg_work_item;
-static DECLARE_WAIT_QUEUE_HEAD(slow_work_unreg_wq);
-static DEFINE_MUTEX(slow_work_unreg_sync_lock);
-
-static void slow_work_set_thread_processing(int id, struct slow_work *work)
-{
-	if (work)
-		slow_work_thread_processing[id] = work->owner;
-}
-static void slow_work_done_thread_processing(int id, struct slow_work *work)
-{
-	struct module *module = slow_work_thread_processing[id];
-
-	slow_work_thread_processing[id] = NULL;
-	smp_mb();
-	if (slow_work_unreg_work_item == work ||
-	    slow_work_unreg_module == module)
-		wake_up_all(&slow_work_unreg_wq);
-}
-static void slow_work_clear_thread_processing(int id)
-{
-	slow_work_thread_processing[id] = NULL;
-}
-#else
-static void slow_work_set_thread_processing(int id, struct slow_work *work) {}
-static void slow_work_done_thread_processing(int id, struct slow_work *work) {}
-static void slow_work_clear_thread_processing(int id) {}
-#endif
-
-/*
- * Data for tracking currently executing items for indication through /proc
- */
-#ifdef CONFIG_SLOW_WORK_DEBUG
-struct slow_work *slow_work_execs[SLOW_WORK_THREAD_LIMIT];
-pid_t slow_work_pids[SLOW_WORK_THREAD_LIMIT];
-DEFINE_RWLOCK(slow_work_execs_lock);
-#endif
-
-/*
- * The queues of work items and the lock governing access to them.  These are
- * shared between all the CPUs.  It doesn't make sense to have per-CPU queues
- * as the number of threads bears no relation to the number of CPUs.
- *
- * There are two queues of work items: one for slow work items, and one for
- * very slow work items.
- */
-LIST_HEAD(slow_work_queue);
-LIST_HEAD(vslow_work_queue);
-DEFINE_SPINLOCK(slow_work_queue_lock);
-
-/*
- * The following are two wait queues that get pinged when a work item is placed
- * on an empty queue.  These allow work items that are hogging a thread by
- * sleeping in a way that could be deferred to yield their thread and enqueue
- * themselves.
- */
-static DECLARE_WAIT_QUEUE_HEAD(slow_work_queue_waits_for_occupation);
-static DECLARE_WAIT_QUEUE_HEAD(vslow_work_queue_waits_for_occupation);
-
-/*
- * The thread controls.  A variable used to signal to the threads that they
- * should exit when the queue is empty, a waitqueue used by the threads to wait
- * for signals, and a completion set by the last thread to exit.
- */
-static bool slow_work_threads_should_exit;
-static DECLARE_WAIT_QUEUE_HEAD(slow_work_thread_wq);
-static DECLARE_COMPLETION(slow_work_last_thread_exited);
-
-/*
- * The number of users of the thread pool and its lock.  Whilst this is zero we
- * have no threads hanging around, and when this reaches zero, we wait for all
- * active or queued work items to complete and kill all the threads we do have.
- */
-static int slow_work_user_count;
-static DEFINE_MUTEX(slow_work_user_lock);
-
-static inline int slow_work_get_ref(struct slow_work *work)
-{
-	if (work->ops->get_ref)
-		return work->ops->get_ref(work);
-
-	return 0;
-}
-
-static inline void slow_work_put_ref(struct slow_work *work)
-{
-	if (work->ops->put_ref)
-		work->ops->put_ref(work);
-}
-
-/*
- * Calculate the maximum number of active threads in the pool that are
- * permitted to process very slow work items.
- *
- * The answer is rounded up to at least 1, but may not equal or exceed the
- * maximum number of the threads in the pool.  This means we always have at
- * least one thread that can process slow work items, and we always have at
- * least one thread that won't get tied up doing so.
- */
-static unsigned slow_work_calc_vsmax(void)
-{
-	unsigned vsmax;
-
-	vsmax = atomic_read(&slow_work_thread_count) * vslow_work_proportion;
-	vsmax /= 100;
-	vsmax = max(vsmax, 1U);
-	return min(vsmax, slow_work_max_threads - 1);
-}
-
-/*
- * Attempt to execute stuff queued on a slow thread.  Return true if we managed
- * it, false if there was nothing to do.
- */
-static noinline bool slow_work_execute(int id)
-{
-	struct slow_work *work = NULL;
-	unsigned vsmax;
-	bool very_slow;
-
-	vsmax = slow_work_calc_vsmax();
-
-	/* see if we can schedule a new thread to be started if we're not
-	 * keeping up with the work */
-	if (!waitqueue_active(&slow_work_thread_wq) &&
-	    (!list_empty(&slow_work_queue) || !list_empty(&vslow_work_queue)) &&
-	    atomic_read(&slow_work_thread_count) < slow_work_max_threads &&
-	    !slow_work_may_not_start_new_thread)
-		slow_work_enqueue(&slow_work_new_thread);
-
-	/* find something to execute */
-	spin_lock_irq(&slow_work_queue_lock);
-	if (!list_empty(&vslow_work_queue) &&
-	    atomic_read(&vslow_work_executing_count) < vsmax) {
-		work = list_entry(vslow_work_queue.next,
-				  struct slow_work, link);
-		if (test_and_set_bit_lock(SLOW_WORK_EXECUTING, &work->flags))
-			BUG();
-		list_del_init(&work->link);
-		atomic_inc(&vslow_work_executing_count);
-		very_slow = true;
-	} else if (!list_empty(&slow_work_queue)) {
-		work = list_entry(slow_work_queue.next,
-				  struct slow_work, link);
-		if (test_and_set_bit_lock(SLOW_WORK_EXECUTING, &work->flags))
-			BUG();
-		list_del_init(&work->link);
-		very_slow = false;
-	} else {
-		very_slow = false; /* avoid the compiler warning */
-	}
-
-	slow_work_set_thread_processing(id, work);
-	if (work) {
-		slow_work_mark_time(work);
-		slow_work_begin_exec(id, work);
-	}
-
-	spin_unlock_irq(&slow_work_queue_lock);
-
-	if (!work)
-		return false;
-
-	if (!test_and_clear_bit(SLOW_WORK_PENDING, &work->flags))
-		BUG();
-
-	/* don't execute if the work is in the process of being cancelled */
-	if (!test_bit(SLOW_WORK_CANCELLING, &work->flags))
-		work->ops->execute(work);
-
-	if (very_slow)
-		atomic_dec(&vslow_work_executing_count);
-	clear_bit_unlock(SLOW_WORK_EXECUTING, &work->flags);
-
-	/* wake up anyone waiting for this work to be complete */
-	wake_up_bit(&work->flags, SLOW_WORK_EXECUTING);
-
-	slow_work_end_exec(id, work);
-
-	/* if someone tried to enqueue the item whilst we were executing it,
-	 * then it'll be left unenqueued to avoid multiple threads trying to
-	 * execute it simultaneously
-	 *
-	 * there is, however, a race between us testing the pending flag and
-	 * getting the spinlock, and between the enqueuer setting the pending
-	 * flag and getting the spinlock, so we use a deferral bit to tell us
-	 * if the enqueuer got there first
-	 */
-	if (test_bit(SLOW_WORK_PENDING, &work->flags)) {
-		spin_lock_irq(&slow_work_queue_lock);
-
-		if (!test_bit(SLOW_WORK_EXECUTING, &work->flags) &&
-		    test_and_clear_bit(SLOW_WORK_ENQ_DEFERRED, &work->flags))
-			goto auto_requeue;
-
-		spin_unlock_irq(&slow_work_queue_lock);
-	}
-
-	/* sort out the race between module unloading and put_ref() */
-	slow_work_put_ref(work);
-	slow_work_done_thread_processing(id, work);
-
-	return true;
-
-auto_requeue:
-	/* we must complete the enqueue operation
-	 * - we transfer our ref on the item back to the appropriate queue
-	 * - don't wake another thread up as we're awake already
-	 */
-	slow_work_mark_time(work);
-	if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags))
-		list_add_tail(&work->link, &vslow_work_queue);
-	else
-		list_add_tail(&work->link, &slow_work_queue);
-	spin_unlock_irq(&slow_work_queue_lock);
-	slow_work_clear_thread_processing(id);
-	return true;
-}
-
-/**
- * slow_work_sleep_till_thread_needed - Sleep till thread needed by other work
- * work: The work item under execution that wants to sleep
- * _timeout: Scheduler sleep timeout
- *
- * Allow a requeueable work item to sleep on a slow-work processor thread until
- * that thread is needed to do some other work or the sleep is interrupted by
- * some other event.
- *
- * The caller must set up a wake up event before calling this and must have set
- * the appropriate sleep mode (such as TASK_UNINTERRUPTIBLE) and tested its own
- * condition before calling this function as no test is made here.
- *
- * False is returned if there is nothing on the queue; true is returned if the
- * work item should be requeued
- */
-bool slow_work_sleep_till_thread_needed(struct slow_work *work,
-					signed long *_timeout)
-{
-	wait_queue_head_t *wfo_wq;
-	struct list_head *queue;
-
-	DEFINE_WAIT(wait);
-
-	if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags)) {
-		wfo_wq = &vslow_work_queue_waits_for_occupation;
-		queue = &vslow_work_queue;
-	} else {
-		wfo_wq = &slow_work_queue_waits_for_occupation;
-		queue = &slow_work_queue;
-	}
-
-	if (!list_empty(queue))
-		return true;
-
-	add_wait_queue_exclusive(wfo_wq, &wait);
-	if (list_empty(queue))
-		*_timeout = schedule_timeout(*_timeout);
-	finish_wait(wfo_wq, &wait);
-
-	return !list_empty(queue);
-}
-EXPORT_SYMBOL(slow_work_sleep_till_thread_needed);
-
-/**
- * slow_work_enqueue - Schedule a slow work item for processing
- * @work: The work item to queue
- *
- * Schedule a slow work item for processing.  If the item is already undergoing
- * execution, this guarantees not to re-enter the execution routine until the
- * first execution finishes.
- *
- * The item is pinned by this function as it retains a reference to it, managed
- * through the item operations.  The item is unpinned once it has been
- * executed.
- *
- * An item may hog the thread that is running it for a relatively large amount
- * of time, sufficient, for example, to perform several lookup, mkdir, create
- * and setxattr operations.  It may sleep on I/O and may sleep to obtain locks.
- *
- * Conversely, if a number of items are awaiting processing, it may take some
- * time before any given item is given attention.  The number of threads in the
- * pool may be increased to deal with demand, but only up to a limit.
- *
- * If SLOW_WORK_VERY_SLOW is set on the work item, then it will be placed in
- * the very slow queue, from which only a portion of the threads will be
- * allowed to pick items to execute.  This ensures that very slow items won't
- * overly block ones that are just ordinarily slow.
- *
- * Returns 0 if successful, -EAGAIN if not (or -ECANCELED if cancelled work is
- * attempted queued)
- */
-int slow_work_enqueue(struct slow_work *work)
-{
-	wait_queue_head_t *wfo_wq;
-	struct list_head *queue;
-	unsigned long flags;
-	int ret;
-
-	if (test_bit(SLOW_WORK_CANCELLING, &work->flags))
-		return -ECANCELED;
-
-	BUG_ON(slow_work_user_count <= 0);
-	BUG_ON(!work);
-	BUG_ON(!work->ops);
-
-	/* when honouring an enqueue request, we only promise that we will run
-	 * the work function in the future; we do not promise to run it once
-	 * per enqueue request
-	 *
-	 * we use the PENDING bit to merge together repeat requests without
-	 * having to disable IRQs and take the spinlock, whilst still
-	 * maintaining our promise
-	 */
-	if (!test_and_set_bit_lock(SLOW_WORK_PENDING, &work->flags)) {
-		if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags)) {
-			wfo_wq = &vslow_work_queue_waits_for_occupation;
-			queue = &vslow_work_queue;
-		} else {
-			wfo_wq = &slow_work_queue_waits_for_occupation;
-			queue = &slow_work_queue;
-		}
-
-		spin_lock_irqsave(&slow_work_queue_lock, flags);
-
-		if (unlikely(test_bit(SLOW_WORK_CANCELLING, &work->flags)))
-			goto cancelled;
-
-		/* we promise that we will not attempt to execute the work
-		 * function in more than one thread simultaneously
-		 *
-		 * this, however, leaves us with a problem if we're asked to
-		 * enqueue the work whilst someone is executing the work
-		 * function as simply queueing the work immediately means that
-		 * another thread may try executing it whilst it is already
-		 * under execution
-		 *
-		 * to deal with this, we set the ENQ_DEFERRED bit instead of
-		 * enqueueing, and the thread currently executing the work
-		 * function will enqueue the work item when the work function
-		 * returns and it has cleared the EXECUTING bit
-		 */
-		if (test_bit(SLOW_WORK_EXECUTING, &work->flags)) {
-			set_bit(SLOW_WORK_ENQ_DEFERRED, &work->flags);
-		} else {
-			ret = slow_work_get_ref(work);
-			if (ret < 0)
-				goto failed;
-			slow_work_mark_time(work);
-			list_add_tail(&work->link, queue);
-			wake_up(&slow_work_thread_wq);
-
-			/* if someone who could be requeued is sleeping on a
-			 * thread, then ask them to yield their thread */
-			if (work->link.prev == queue)
-				wake_up(wfo_wq);
-		}
-
-		spin_unlock_irqrestore(&slow_work_queue_lock, flags);
-	}
-	return 0;
-
-cancelled:
-	ret = -ECANCELED;
-failed:
-	spin_unlock_irqrestore(&slow_work_queue_lock, flags);
-	return ret;
-}
-EXPORT_SYMBOL(slow_work_enqueue);
-
-static int slow_work_wait(void *word)
-{
-	schedule();
-	return 0;
-}
-
-/**
- * slow_work_cancel - Cancel a slow work item
- * @work: The work item to cancel
- *
- * This function will cancel a previously enqueued work item. If we cannot
- * cancel the work item, it is guarenteed to have run when this function
- * returns.
- */
-void slow_work_cancel(struct slow_work *work)
-{
-	bool wait = true, put = false;
-
-	set_bit(SLOW_WORK_CANCELLING, &work->flags);
-	smp_mb();
-
-	/* if the work item is a delayed work item with an active timer, we
-	 * need to wait for the timer to finish _before_ getting the spinlock,
-	 * lest we deadlock against the timer routine
-	 *
-	 * the timer routine will leave DELAYED set if it notices the
-	 * CANCELLING flag in time
-	 */
-	if (test_bit(SLOW_WORK_DELAYED, &work->flags)) {
-		struct delayed_slow_work *dwork =
-			container_of(work, struct delayed_slow_work, work);
-		del_timer_sync(&dwork->timer);
-	}
-
-	spin_lock_irq(&slow_work_queue_lock);
-
-	if (test_bit(SLOW_WORK_DELAYED, &work->flags)) {
-		/* the timer routine aborted or never happened, so we are left
-		 * holding the timer's reference on the item and should just
-		 * drop the pending flag and wait for any ongoing execution to
-		 * finish */
-		struct delayed_slow_work *dwork =
-			container_of(work, struct delayed_slow_work, work);
-
-		BUG_ON(timer_pending(&dwork->timer));
-		BUG_ON(!list_empty(&work->link));
-
-		clear_bit(SLOW_WORK_DELAYED, &work->flags);
-		put = true;
-		clear_bit(SLOW_WORK_PENDING, &work->flags);
-
-	} else if (test_bit(SLOW_WORK_PENDING, &work->flags) &&
-		   !list_empty(&work->link)) {
-		/* the link in the pending queue holds a reference on the item
-		 * that we will need to release */
-		list_del_init(&work->link);
-		wait = false;
-		put = true;
-		clear_bit(SLOW_WORK_PENDING, &work->flags);
-
-	} else if (test_and_clear_bit(SLOW_WORK_ENQ_DEFERRED, &work->flags)) {
-		/* the executor is holding our only reference on the item, so
-		 * we merely need to wait for it to finish executing */
-		clear_bit(SLOW_WORK_PENDING, &work->flags);
-	}
-
-	spin_unlock_irq(&slow_work_queue_lock);
-
-	/* the EXECUTING flag is set by the executor whilst the spinlock is set
-	 * and before the item is dequeued - so assuming the above doesn't
-	 * actually dequeue it, simply waiting for the EXECUTING flag to be
-	 * released here should be sufficient */
-	if (wait)
-		wait_on_bit(&work->flags, SLOW_WORK_EXECUTING, slow_work_wait,
-			    TASK_UNINTERRUPTIBLE);
-
-	clear_bit(SLOW_WORK_CANCELLING, &work->flags);
-	if (put)
-		slow_work_put_ref(work);
-}
-EXPORT_SYMBOL(slow_work_cancel);
-
-/*
- * Handle expiry of the delay timer, indicating that a delayed slow work item
- * should now be queued if not cancelled
- */
-static void delayed_slow_work_timer(unsigned long data)
-{
-	wait_queue_head_t *wfo_wq;
-	struct list_head *queue;
-	struct slow_work *work = (struct slow_work *) data;
-	unsigned long flags;
-	bool queued = false, put = false, first = false;
-
-	if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags)) {
-		wfo_wq = &vslow_work_queue_waits_for_occupation;
-		queue = &vslow_work_queue;
-	} else {
-		wfo_wq = &slow_work_queue_waits_for_occupation;
-		queue = &slow_work_queue;
-	}
-
-	spin_lock_irqsave(&slow_work_queue_lock, flags);
-	if (likely(!test_bit(SLOW_WORK_CANCELLING, &work->flags))) {
-		clear_bit(SLOW_WORK_DELAYED, &work->flags);
-
-		if (test_bit(SLOW_WORK_EXECUTING, &work->flags)) {
-			/* we discard the reference the timer was holding in
-			 * favour of the one the executor holds */
-			set_bit(SLOW_WORK_ENQ_DEFERRED, &work->flags);
-			put = true;
-		} else {
-			slow_work_mark_time(work);
-			list_add_tail(&work->link, queue);
-			queued = true;
-			if (work->link.prev == queue)
-				first = true;
-		}
-	}
-
-	spin_unlock_irqrestore(&slow_work_queue_lock, flags);
-	if (put)
-		slow_work_put_ref(work);
-	if (first)
-		wake_up(wfo_wq);
-	if (queued)
-		wake_up(&slow_work_thread_wq);
-}
-
-/**
- * delayed_slow_work_enqueue - Schedule a delayed slow work item for processing
- * @dwork: The delayed work item to queue
- * @delay: When to start executing the work, in jiffies from now
- *
- * This is similar to slow_work_enqueue(), but it adds a delay before the work
- * is actually queued for processing.
- *
- * The item can have delayed processing requested on it whilst it is being
- * executed.  The delay will begin immediately, and if it expires before the
- * item finishes executing, the item will be placed back on the queue when it
- * has done executing.
- */
-int delayed_slow_work_enqueue(struct delayed_slow_work *dwork,
-			      unsigned long delay)
-{
-	struct slow_work *work = &dwork->work;
-	unsigned long flags;
-	int ret;
-
-	if (delay == 0)
-		return slow_work_enqueue(&dwork->work);
-
-	BUG_ON(slow_work_user_count <= 0);
-	BUG_ON(!work);
-	BUG_ON(!work->ops);
-
-	if (test_bit(SLOW_WORK_CANCELLING, &work->flags))
-		return -ECANCELED;
-
-	if (!test_and_set_bit_lock(SLOW_WORK_PENDING, &work->flags)) {
-		spin_lock_irqsave(&slow_work_queue_lock, flags);
-
-		if (test_bit(SLOW_WORK_CANCELLING, &work->flags))
-			goto cancelled;
-
-		/* the timer holds a reference whilst it is pending */
-		ret = work->ops->get_ref(work);
-		if (ret < 0)
-			goto cant_get_ref;
-
-		if (test_and_set_bit(SLOW_WORK_DELAYED, &work->flags))
-			BUG();
-		dwork->timer.expires = jiffies + delay;
-		dwork->timer.data = (unsigned long) work;
-		dwork->timer.function = delayed_slow_work_timer;
-		add_timer(&dwork->timer);
-
-		spin_unlock_irqrestore(&slow_work_queue_lock, flags);
-	}
-
-	return 0;
-
-cancelled:
-	ret = -ECANCELED;
-cant_get_ref:
-	spin_unlock_irqrestore(&slow_work_queue_lock, flags);
-	return ret;
-}
-EXPORT_SYMBOL(delayed_slow_work_enqueue);
-
-/*
- * Schedule a cull of the thread pool at some time in the near future
- */
-static void slow_work_schedule_cull(void)
-{
-	mod_timer(&slow_work_cull_timer,
-		  round_jiffies(jiffies + SLOW_WORK_CULL_TIMEOUT));
-}
-
-/*
- * Worker thread culling algorithm
- */
-static bool slow_work_cull_thread(void)
-{
-	unsigned long flags;
-	bool do_cull = false;
-
-	spin_lock_irqsave(&slow_work_queue_lock, flags);
-
-	if (slow_work_cull) {
-		slow_work_cull = false;
-
-		if (list_empty(&slow_work_queue) &&
-		    list_empty(&vslow_work_queue) &&
-		    atomic_read(&slow_work_thread_count) >
-		    slow_work_min_threads) {
-			slow_work_schedule_cull();
-			do_cull = true;
-		}
-	}
-
-	spin_unlock_irqrestore(&slow_work_queue_lock, flags);
-	return do_cull;
-}
-
-/*
- * Determine if there is slow work available for dispatch
- */
-static inline bool slow_work_available(int vsmax)
-{
-	return !list_empty(&slow_work_queue) ||
-		(!list_empty(&vslow_work_queue) &&
-		 atomic_read(&vslow_work_executing_count) < vsmax);
-}
-
-/*
- * Worker thread dispatcher
- */
-static int slow_work_thread(void *_data)
-{
-	int vsmax, id;
-
-	DEFINE_WAIT(wait);
-
-	set_freezable();
-	set_user_nice(current, -5);
-
-	/* allocate ourselves an ID */
-	spin_lock_irq(&slow_work_queue_lock);
-	id = find_first_zero_bit(slow_work_ids, SLOW_WORK_THREAD_LIMIT);
-	BUG_ON(id < 0 || id >= SLOW_WORK_THREAD_LIMIT);
-	__set_bit(id, slow_work_ids);
-	slow_work_set_thread_pid(id, current->pid);
-	spin_unlock_irq(&slow_work_queue_lock);
-
-	sprintf(current->comm, "kslowd%03u", id);
-
-	for (;;) {
-		vsmax = vslow_work_proportion;
-		vsmax *= atomic_read(&slow_work_thread_count);
-		vsmax /= 100;
-
-		prepare_to_wait_exclusive(&slow_work_thread_wq, &wait,
-					  TASK_INTERRUPTIBLE);
-		if (!freezing(current) &&
-		    !slow_work_threads_should_exit &&
-		    !slow_work_available(vsmax) &&
-		    !slow_work_cull)
-			schedule();
-		finish_wait(&slow_work_thread_wq, &wait);
-
-		try_to_freeze();
-
-		vsmax = vslow_work_proportion;
-		vsmax *= atomic_read(&slow_work_thread_count);
-		vsmax /= 100;
-
-		if (slow_work_available(vsmax) && slow_work_execute(id)) {
-			cond_resched();
-			if (list_empty(&slow_work_queue) &&
-			    list_empty(&vslow_work_queue) &&
-			    atomic_read(&slow_work_thread_count) >
-			    slow_work_min_threads)
-				slow_work_schedule_cull();
-			continue;
-		}
-
-		if (slow_work_threads_should_exit)
-			break;
-
-		if (slow_work_cull && slow_work_cull_thread())
-			break;
-	}
-
-	spin_lock_irq(&slow_work_queue_lock);
-	slow_work_set_thread_pid(id, 0);
-	__clear_bit(id, slow_work_ids);
-	spin_unlock_irq(&slow_work_queue_lock);
-
-	if (atomic_dec_and_test(&slow_work_thread_count))
-		complete_and_exit(&slow_work_last_thread_exited, 0);
-	return 0;
-}
-
-/*
- * Handle thread cull timer expiration
- */
-static void slow_work_cull_timeout(unsigned long data)
-{
-	slow_work_cull = true;
-	wake_up(&slow_work_thread_wq);
-}
-
-/*
- * Start a new slow work thread
- */
-static void slow_work_new_thread_execute(struct slow_work *work)
-{
-	struct task_struct *p;
-
-	if (slow_work_threads_should_exit)
-		return;
-
-	if (atomic_read(&slow_work_thread_count) >= slow_work_max_threads)
-		return;
-
-	if (!mutex_trylock(&slow_work_user_lock))
-		return;
-
-	slow_work_may_not_start_new_thread = true;
-	atomic_inc(&slow_work_thread_count);
-	p = kthread_run(slow_work_thread, NULL, "kslowd");
-	if (IS_ERR(p)) {
-		printk(KERN_DEBUG "Slow work thread pool: OOM\n");
-		if (atomic_dec_and_test(&slow_work_thread_count))
-			BUG(); /* we're running on a slow work thread... */
-		mod_timer(&slow_work_oom_timer,
-			  round_jiffies(jiffies + SLOW_WORK_OOM_TIMEOUT));
-	} else {
-		/* ratelimit the starting of new threads */
-		mod_timer(&slow_work_oom_timer, jiffies + 1);
-	}
-
-	mutex_unlock(&slow_work_user_lock);
-}
-
-static const struct slow_work_ops slow_work_new_thread_ops = {
-	.owner		= THIS_MODULE,
-	.execute	= slow_work_new_thread_execute,
-#ifdef CONFIG_SLOW_WORK_DEBUG
-	.desc		= slow_work_new_thread_desc,
-#endif
-};
-
-/*
- * post-OOM new thread start suppression expiration
- */
-static void slow_work_oom_timeout(unsigned long data)
-{
-	slow_work_may_not_start_new_thread = false;
-}
-
-#ifdef CONFIG_SYSCTL
-/*
- * Handle adjustment of the minimum number of threads
- */
-static int slow_work_min_threads_sysctl(struct ctl_table *table, int write,
-					void __user *buffer,
-					size_t *lenp, loff_t *ppos)
-{
-	int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
-	int n;
-
-	if (ret == 0) {
-		mutex_lock(&slow_work_user_lock);
-		if (slow_work_user_count > 0) {
-			/* see if we need to start or stop threads */
-			n = atomic_read(&slow_work_thread_count) -
-				slow_work_min_threads;
-
-			if (n < 0 && !slow_work_may_not_start_new_thread)
-				slow_work_enqueue(&slow_work_new_thread);
-			else if (n > 0)
-				slow_work_schedule_cull();
-		}
-		mutex_unlock(&slow_work_user_lock);
-	}
-
-	return ret;
-}
-
-/*
- * Handle adjustment of the maximum number of threads
- */
-static int slow_work_max_threads_sysctl(struct ctl_table *table, int write,
-					void __user *buffer,
-					size_t *lenp, loff_t *ppos)
-{
-	int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
-	int n;
-
-	if (ret == 0) {
-		mutex_lock(&slow_work_user_lock);
-		if (slow_work_user_count > 0) {
-			/* see if we need to stop threads */
-			n = slow_work_max_threads -
-				atomic_read(&slow_work_thread_count);
-
-			if (n < 0)
-				slow_work_schedule_cull();
-		}
-		mutex_unlock(&slow_work_user_lock);
-	}
-
-	return ret;
-}
-#endif /* CONFIG_SYSCTL */
-
-/**
- * slow_work_register_user - Register a user of the facility
- * @module: The module about to make use of the facility
- *
- * Register a user of the facility, starting up the initial threads if there
- * aren't any other users at this point.  This will return 0 if successful, or
- * an error if not.
- */
-int slow_work_register_user(struct module *module)
-{
-	struct task_struct *p;
-	int loop;
-
-	mutex_lock(&slow_work_user_lock);
-
-	if (slow_work_user_count == 0) {
-		printk(KERN_NOTICE "Slow work thread pool: Starting up\n");
-		init_completion(&slow_work_last_thread_exited);
-
-		slow_work_threads_should_exit = false;
-		slow_work_init(&slow_work_new_thread,
-			       &slow_work_new_thread_ops);
-		slow_work_may_not_start_new_thread = false;
-		slow_work_cull = false;
-
-		/* start the minimum number of threads */
-		for (loop = 0; loop < slow_work_min_threads; loop++) {
-			atomic_inc(&slow_work_thread_count);
-			p = kthread_run(slow_work_thread, NULL, "kslowd");
-			if (IS_ERR(p))
-				goto error;
-		}
-		printk(KERN_NOTICE "Slow work thread pool: Ready\n");
-	}
-
-	slow_work_user_count++;
-	mutex_unlock(&slow_work_user_lock);
-	return 0;
-
-error:
-	if (atomic_dec_and_test(&slow_work_thread_count))
-		complete(&slow_work_last_thread_exited);
-	if (loop > 0) {
-		printk(KERN_ERR "Slow work thread pool:"
-		       " Aborting startup on ENOMEM\n");
-		slow_work_threads_should_exit = true;
-		wake_up_all(&slow_work_thread_wq);
-		wait_for_completion(&slow_work_last_thread_exited);
-		printk(KERN_ERR "Slow work thread pool: Aborted\n");
-	}
-	mutex_unlock(&slow_work_user_lock);
-	return PTR_ERR(p);
-}
-EXPORT_SYMBOL(slow_work_register_user);
-
-/*
- * wait for all outstanding items from the calling module to complete
- * - note that more items may be queued whilst we're waiting
- */
-static void slow_work_wait_for_items(struct module *module)
-{
-#ifdef CONFIG_MODULES
-	DECLARE_WAITQUEUE(myself, current);
-	struct slow_work *work;
-	int loop;
-
-	mutex_lock(&slow_work_unreg_sync_lock);
-	add_wait_queue(&slow_work_unreg_wq, &myself);
-
-	for (;;) {
-		spin_lock_irq(&slow_work_queue_lock);
-
-		/* first of all, we wait for the last queued item in each list
-		 * to be processed */
-		list_for_each_entry_reverse(work, &vslow_work_queue, link) {
-			if (work->owner == module) {
-				set_current_state(TASK_UNINTERRUPTIBLE);
-				slow_work_unreg_work_item = work;
-				goto do_wait;
-			}
-		}
-		list_for_each_entry_reverse(work, &slow_work_queue, link) {
-			if (work->owner == module) {
-				set_current_state(TASK_UNINTERRUPTIBLE);
-				slow_work_unreg_work_item = work;
-				goto do_wait;
-			}
-		}
-
-		/* then we wait for the items being processed to finish */
-		slow_work_unreg_module = module;
-		smp_mb();
-		for (loop = 0; loop < SLOW_WORK_THREAD_LIMIT; loop++) {
-			if (slow_work_thread_processing[loop] == module)
-				goto do_wait;
-		}
-		spin_unlock_irq(&slow_work_queue_lock);
-		break; /* okay, we're done */
-
-	do_wait:
-		spin_unlock_irq(&slow_work_queue_lock);
-		schedule();
-		slow_work_unreg_work_item = NULL;
-		slow_work_unreg_module = NULL;
-	}
-
-	remove_wait_queue(&slow_work_unreg_wq, &myself);
-	mutex_unlock(&slow_work_unreg_sync_lock);
-#endif /* CONFIG_MODULES */
-}
-
-/**
- * slow_work_unregister_user - Unregister a user of the facility
- * @module: The module whose items should be cleared
- *
- * Unregister a user of the facility, killing all the threads if this was the
- * last one.
- *
- * This waits for all the work items belonging to the nominated module to go
- * away before proceeding.
- */
-void slow_work_unregister_user(struct module *module)
-{
-	/* first of all, wait for all outstanding items from the calling module
-	 * to complete */
-	if (module)
-		slow_work_wait_for_items(module);
-
-	/* then we can actually go about shutting down the facility if need
-	 * be */
-	mutex_lock(&slow_work_user_lock);
-
-	BUG_ON(slow_work_user_count <= 0);
-
-	slow_work_user_count--;
-	if (slow_work_user_count == 0) {
-		printk(KERN_NOTICE "Slow work thread pool: Shutting down\n");
-		slow_work_threads_should_exit = true;
-		del_timer_sync(&slow_work_cull_timer);
-		del_timer_sync(&slow_work_oom_timer);
-		wake_up_all(&slow_work_thread_wq);
-		wait_for_completion(&slow_work_last_thread_exited);
-		printk(KERN_NOTICE "Slow work thread pool:"
-		       " Shut down complete\n");
-	}
-
-	mutex_unlock(&slow_work_user_lock);
-}
-EXPORT_SYMBOL(slow_work_unregister_user);
-
-/*
- * Initialise the slow work facility
- */
-static int __init init_slow_work(void)
-{
-	unsigned nr_cpus = num_possible_cpus();
-
-	if (slow_work_max_threads < nr_cpus)
-		slow_work_max_threads = nr_cpus;
-#ifdef CONFIG_SYSCTL
-	if (slow_work_max_max_threads < nr_cpus * 2)
-		slow_work_max_max_threads = nr_cpus * 2;
-#endif
-#ifdef CONFIG_SLOW_WORK_DEBUG
-	{
-		struct dentry *dbdir;
-
-		dbdir = debugfs_create_dir("slow_work", NULL);
-		if (dbdir && !IS_ERR(dbdir))
-			debugfs_create_file("runqueue", S_IFREG | 0400, dbdir,
-					    NULL, &slow_work_runqueue_fops);
-	}
-#endif
-	return 0;
-}
-
-subsys_initcall(init_slow_work);
diff --git a/kernel/slow-work.h b/kernel/slow-work.h
deleted file mode 100644
index 321f3c5..0000000
--- a/kernel/slow-work.h
+++ /dev/null
@@ -1,72 +0,0 @@
-/* Slow work private definitions
- *
- * Copyright (C) 2009 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public Licence
- * as published by the Free Software Foundation; either version
- * 2 of the Licence, or (at your option) any later version.
- */
-
-#define SLOW_WORK_CULL_TIMEOUT (5 * HZ)	/* cull threads 5s after running out of
-					 * things to do */
-#define SLOW_WORK_OOM_TIMEOUT (5 * HZ)	/* can't start new threads for 5s after
-					 * OOM */
-
-#define SLOW_WORK_THREAD_LIMIT	255	/* abs maximum number of slow-work threads */
-
-/*
- * slow-work.c
- */
-#ifdef CONFIG_SLOW_WORK_DEBUG
-extern struct slow_work *slow_work_execs[];
-extern pid_t slow_work_pids[];
-extern rwlock_t slow_work_execs_lock;
-#endif
-
-extern struct list_head slow_work_queue;
-extern struct list_head vslow_work_queue;
-extern spinlock_t slow_work_queue_lock;
-
-/*
- * slow-work-debugfs.c
- */
-#ifdef CONFIG_SLOW_WORK_DEBUG
-extern const struct file_operations slow_work_runqueue_fops;
-
-extern void slow_work_new_thread_desc(struct slow_work *, struct seq_file *);
-#endif
-
-/*
- * Helper functions
- */
-static inline void slow_work_set_thread_pid(int id, pid_t pid)
-{
-#ifdef CONFIG_SLOW_WORK_PROC
-	slow_work_pids[id] = pid;
-#endif
-}
-
-static inline void slow_work_mark_time(struct slow_work *work)
-{
-#ifdef CONFIG_SLOW_WORK_PROC
-	work->mark = CURRENT_TIME;
-#endif
-}
-
-static inline void slow_work_begin_exec(int id, struct slow_work *work)
-{
-#ifdef CONFIG_SLOW_WORK_PROC
-	slow_work_execs[id] = work;
-#endif
-}
-
-static inline void slow_work_end_exec(int id, struct slow_work *work)
-{
-#ifdef CONFIG_SLOW_WORK_PROC
-	write_lock(&slow_work_execs_lock);
-	slow_work_execs[id] = NULL;
-	write_unlock(&slow_work_execs_lock);
-#endif
-}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a68b24..dbba676 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -48,7 +48,6 @@
 #include <linux/acpi.h>
 #include <linux/reboot.h>
 #include <linux/ftrace.h>
-#include <linux/slow-work.h>
 #include <linux/perf_event.h>
 
 #include <asm/uaccess.h>
@@ -888,13 +887,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
-#ifdef CONFIG_SLOW_WORK
-	{
-		.procname	= "slow-work",
-		.mode		= 0555,
-		.child		= slow_work_sysctls,
-	},
-#endif
 #ifdef CONFIG_PERF_EVENTS
 	{
 		.procname	= "perf_event_paranoid",
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* perf-wq.c used to generate synthetic workload
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (39 preceding siblings ...)
  2010-01-18  0:57 ` [PATCH 40/40] slow-work: kill it Tejun Heo
@ 2010-01-18  1:03 ` Tejun Heo
  2010-01-18 16:13 ` [PATCHSET] concurrency managed workqueue, take#3 Stefan Richter
  41 siblings, 0 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  1:03 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

[-- Attachment #1: Type: text/plain, Size: 30 bytes --]

Attached.

Thanks.

-- 
tejun

[-- Attachment #2: perf-wq.c --]
[-- Type: text/x-csrc, Size: 6145 bytes --]

#include <linux/module.h>
#include <linux/workqueue.h>
#include <linux/jiffies.h>
#include <linux/delay.h>
#include <linux/sched.h>
#include <linux/wait.h>
#include <linux/cpu.h>
#include <linux/kthread.h>
#include <linux/random.h>
#include <linux/completion.h>

#define MAX_TEST_SECS			300

struct workload_spec {
	const char			*name;
	unsigned int			burn_usecs;
	unsigned int			mean_sleep_msecs;
	unsigned int			mean_resched_msecs;
	unsigned int			factor;
};

struct test_spec {
	const struct workload_spec	*workload;
	unsigned int			wq_id;
	unsigned int			nr_works;
};

struct test_run {
	char				name[64];
	struct delayed_work		dwork;
	struct workqueue_struct		*wq;
	const struct workload_spec	*spec;
	unsigned int			cycles_left;
	unsigned long			start;
	unsigned long			end;
	struct completion		done;
};

static const struct workload_spec workload_short = {
	.name				= "sht",
	.burn_usecs			= 50,
	.mean_sleep_msecs		= 1,
	.mean_resched_msecs		= 10,
	.factor				= 3,
};

static const struct workload_spec workload_medium = {
	.name				= "med",
	.burn_usecs			= 50,
	.mean_sleep_msecs		= 10,
	.mean_resched_msecs		= 50,
	.factor				= 2,
};

static const struct workload_spec workload_long = {
	.name				= "lng",
	.burn_usecs			= 50,
	.mean_sleep_msecs		= 100,
	.mean_resched_msecs		= 250,
	.factor				= 1,
};

static const struct test_spec test_specs[] = {
	/*       workload     wq_id  nr_works */
	{ &workload_short,        0,        4 },
	{ &workload_short,        1,        4 },
	{ &workload_short,        2,        4 },
	{ &workload_short,        3,        4 },

	{ &workload_short,        4,        2 },
	{ &workload_medium,       4,        2 },
	{ &workload_short,        5,        2 },
	{ &workload_medium,       5,        2 },

	{ &workload_medium,       6,        2 },
	{ &workload_long,         6,        1 },
	{ &workload_medium,       7,        2 },
	{ &workload_long,         7,        1 },
	{ &workload_medium,       8,        2 },
	{ &workload_long,         8,        1 },
	{ &workload_medium,       9,        2 },
	{ &workload_long,         9,        1 },

	{ &workload_long,	 10,        1 },
	{ &workload_long,	 11,        1 },
	{ &workload_long,        12,        1 },
	{ &workload_long,        13,        1 },
	{ &workload_long,	 14,        1 },
	{ &workload_long,	 15,        1 },
	{ &workload_long,        16,        1 },
	{ &workload_long,        17,        1 },

	{ &workload_short,       18,        4 },
	{ &workload_short,	 19,        4 },
	{ &workload_short,	 20,        4 },
	{ &workload_short,	 21,        4 },
	{ &workload_short,       22,        4 },
	{ &workload_short,	 23,        4 },
	{ &workload_short,	 24,        4 },
	{ &workload_short,	 25,        4 },
};

static const int nr_test_specs = ARRAY_SIZE(test_specs);

static unsigned int nr_wqs;
static unsigned int nr_test_runs;

static struct workqueue_struct **wqs;
static struct test_run *test_runs;

static void perf_wq_func(struct work_struct *work)
{
	struct delayed_work *dwork = to_delayed_work(work);
	struct test_run *run = container_of(dwork, struct test_run, dwork);
	const struct workload_spec *spec = run->spec;
	unsigned int sleep, tmp, delay;

	sleep = (spec->mean_sleep_msecs * (random32() % 200)) / 100;
	tmp = sleep * (random32() % 100) / 100;
	msleep(tmp);
	sleep -= tmp;

	udelay(spec->burn_usecs);

	msleep(sleep);

	if (--run->cycles_left) {
		delay = (spec->mean_resched_msecs * (random32() % 200)) / 100;
		queue_delayed_work(run->wq, dwork, msecs_to_jiffies(delay));
	} else {
		run->end = jiffies;
		complete(&run->done);
	}
}

static int param_set_trigger(const char *val, struct kernel_param *kp)
{
	static DEFINE_MUTEX(mutex);
	int i, dur;

	if (!mutex_trylock(&mutex))
		return -EBUSY;

	dur = simple_strtoul(val, NULL, 0);
	if (dur <= 0 || dur > MAX_TEST_SECS) {
		pr_err("perf-wq: invalid duration %s\n", val);
		return -EINVAL;
	}

	pr_info("perf-wq: duration %d\n", dur);

	for (i = 0; i < nr_test_runs; i++) {
		struct test_run *run = &test_runs[i];
		const struct workload_spec *spec = run->spec;
		unsigned int cycle_msec =
			spec->mean_sleep_msecs + spec->mean_resched_msecs;

		run->start = jiffies;
		run->cycles_left = dur * 1000 / cycle_msec;
		if (spec->factor)
			run->cycles_left /= spec->factor;
		INIT_COMPLETION(run->done);
		queue_delayed_work(run->wq, &run->dwork, 0);
	}

	for (i = 0; i < nr_test_runs; i++) {
		struct test_run *run = &test_runs[i];

		wait_for_completion(&run->done);
		pr_info("perf-wq: test %s ran for %u msecs\n",
			run->name, jiffies_to_msecs(run->end - run->start));
	}

	mutex_unlock(&mutex);

	return 0;
}

module_param_call(trigger, param_set_trigger, NULL, NULL, 0600);

static int __init perf_wq_init(void)
{
	struct test_run *run;
	int i, j;

	for (i = 0; i < nr_test_specs; i++) {
		nr_wqs = max(nr_wqs, test_specs[i].wq_id + 1);
		nr_test_runs += test_specs[i].nr_works;
	}

	wqs = kzalloc(sizeof(wqs[0]) * nr_wqs, GFP_KERNEL);
	test_runs = kzalloc(sizeof(test_runs[0]) * nr_test_runs, GFP_KERNEL);

	if (!wqs || !test_runs) {
		pr_err("perf-wq: allocation failed\n");
		goto fail;
	}

	for (i = 0; i < nr_wqs; i++) {
		char buf[32];

		snprintf(buf, sizeof(buf), "pwq-%02d", i);
		wqs[i] = create_workqueue(buf);
		if (!wqs[i])
			goto fail;
	}

	run = test_runs;
	for (i = 0; i < nr_test_specs; i++) {
		const struct test_spec *spec = &test_specs[i];

		for (j = 0; j < spec->nr_works; j++) {
			snprintf(run->name, sizeof(run->name), "%s-%d:%d@%d",
				 spec->workload->name, i, j, spec->wq_id);
			INIT_DELAYED_WORK(&run->dwork, perf_wq_func);
			init_completion(&run->done);
			run->wq = wqs[spec->wq_id];
			run->spec = spec->workload;
			run++;
		}
	}

	pr_info("perf-wq initialized, echo duration in seconds to "
		"/sys/module/perf_wq/parameters/trigger to start test cycles\n");

	return 0;

fail:
	if (wqs)
		for (i = 0; i < nr_wqs; i++)
			if (wqs[i])
				destroy_workqueue(wqs[i]);
	kfree(wqs);
	kfree(test_runs);
	return -ENOMEM;
}

static void __exit perf_wq_exit(void)
{
	int i;

	for (i = 0; i < nr_wqs; i++)
		destroy_workqueue(wqs[i]);
	kfree(wqs);
	kfree(test_runs);
}

module_init(perf_wq_init);
module_exit(perf_wq_exit);
MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCHSET] concurrency managed workqueue, take#3
  2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
                   ` (40 preceding siblings ...)
  2010-01-18  1:03 ` perf-wq.c used to generate synthetic workload Tejun Heo
@ 2010-01-18 16:13 ` Stefan Richter
  41 siblings, 0 replies; 102+ messages in thread
From: Stefan Richter @ 2010-01-18 16:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Tejun Heo wrote:
> slow-work is probably the largest chunk which can be replaced by cmwq
> but as shown in the libata case small conversions can bring noticeable
> benefits and there are other places which have had to deal with
> similar limitations.

I for one as driver programmer am keen on cmwq.  It would help me fix a
lack of parallelism in firewire (target probe/ reconnect/ removal in the
storage driver).  I can implement what I have in mind with slow-work too
but cmwq would take less effort and LOC.
-- 
Stefan Richter
-=====-==-=- ---= =--=-
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2010-02-18 12:28 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
2010-01-18  0:57 ` [PATCH 01/40] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
2010-01-18 10:13   ` Peter Zijlstra
2010-01-18 11:26     ` Tejun Heo
2010-01-18  0:57 ` [PATCH 02/40] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
2010-01-18  0:57 ` [PATCH 03/40] sched: refactor try_to_wake_up() Tejun Heo
2010-01-18  0:57 ` [PATCH 04/40] sched: implement __set_cpus_allowed() Tejun Heo
2010-01-18  9:56   ` Peter Zijlstra
2010-01-18 11:22     ` Tejun Heo
2010-01-18 11:41       ` Peter Zijlstra
2010-01-19  1:07         ` Tejun Heo
2010-01-19  8:37           ` Peter Zijlstra
2010-01-20  8:35             ` Tejun Heo
2010-01-20  8:50               ` Peter Zijlstra
2010-01-20  9:00                 ` Tejun Heo
2010-01-20  8:59                   ` Peter Zijlstra
2010-01-24  8:18               ` Tejun Heo
2010-01-18  0:57 ` [PATCH 05/40] sched: make sched_notifiers unconditional Tejun Heo
2010-01-18  0:57 ` [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops Tejun Heo
2010-01-18  9:57   ` Peter Zijlstra
2010-01-18 11:31     ` Tejun Heo
2010-01-18 12:49       ` Peter Zijlstra
2010-01-19  1:04         ` Tejun Heo
2010-01-19  8:28           ` Tejun Heo
2010-01-19  8:55             ` Peter Zijlstra
2010-01-20  8:47               ` Tejun Heo
2010-01-18  0:57 ` [PATCH 07/40] sched: implement try_to_wake_up_local() Tejun Heo
2010-01-18  0:57 ` [PATCH 08/40] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 Tejun Heo
2010-01-18  0:57 ` [PATCH 09/40] stop_machine: reimplement without using workqueue Tejun Heo
2010-01-18  0:57 ` [PATCH 10/40] workqueue: misc/cosmetic updates Tejun Heo
2010-01-18  0:57 ` [PATCH 11/40] workqueue: merge feature parameters into flags Tejun Heo
2010-01-18  0:57 ` [PATCH 12/40] workqueue: define both bit position and mask for work flags Tejun Heo
2010-01-18  0:57 ` [PATCH 13/40] workqueue: separate out process_one_work() Tejun Heo
2010-01-18  0:57 ` [PATCH 14/40] workqueue: temporarily disable workqueue tracing Tejun Heo
2010-01-18  0:57 ` [PATCH 15/40] workqueue: kill cpu_populated_map Tejun Heo
2010-01-18  0:57 ` [PATCH 16/40] workqueue: update cwq alignement Tejun Heo
2010-01-18  0:57 ` [PATCH 17/40] workqueue: reimplement workqueue flushing using color coded works Tejun Heo
2010-01-18  0:57 ` [PATCH 18/40] workqueue: introduce worker Tejun Heo
2010-01-18  0:57 ` [PATCH 19/40] workqueue: reimplement work flushing using linked works Tejun Heo
2010-01-18  0:57 ` [PATCH 20/40] workqueue: implement per-cwq active work limit Tejun Heo
2010-01-18  0:57 ` [PATCH 21/40] workqueue: reimplement workqueue freeze using max_active Tejun Heo
2010-01-18  0:57 ` [PATCH 22/40] workqueue: introduce global cwq and unify cwq locks Tejun Heo
2010-01-18  0:57 ` [PATCH 23/40] workqueue: implement worker states Tejun Heo
2010-01-18  0:57 ` [PATCH 24/40] workqueue: reimplement CPU hotplugging support using trustee Tejun Heo
2010-01-18  0:57 ` [PATCH 25/40] workqueue: make single thread workqueue shared worker pool friendly Tejun Heo
2010-01-18  0:57 ` [PATCH 26/40] workqueue: use shared worklist and pool all workers per cpu Tejun Heo
2010-01-18  0:57 ` [PATCH 27/40] workqueue: implement concurrency managed dynamic worker pool Tejun Heo
2010-01-18  0:57 ` [PATCH 28/40] workqueue: increase max_active of keventd and kill current_is_keventd() Tejun Heo
2010-01-18  0:57 ` [PATCH 29/40] workqueue: add system_wq and system_single_wq Tejun Heo
2010-01-18  0:57 ` [PATCH 30/40] workqueue: implement work_busy() Tejun Heo
2010-01-18  2:52   ` Andy Walls
2010-01-18  5:41     ` Tejun Heo
2010-01-18  0:57 ` [PATCH 31/40] libata: take advantage of cmwq and remove concurrency limitations Tejun Heo
2010-01-18 15:48   ` Stefan Richter
2010-01-19  0:49     ` Tejun Heo
2010-01-18  0:57 ` [PATCH 32/40] async: introduce workqueue based alternative implementation Tejun Heo
2010-01-18  6:01   ` Arjan van de Ven
2010-01-18  8:49     ` Tejun Heo
2010-01-18 15:25       ` Arjan van de Ven
2010-01-19  0:57         ` Tejun Heo
2010-01-19  0:57           ` Arjan van de Ven
2010-01-19  7:56             ` Tejun Heo
2010-01-19 14:37               ` Arjan van de Ven
2010-01-20  0:19                 ` Tejun Heo
2010-01-20  0:31                   ` Arjan van de Ven
2010-01-20  2:08                     ` Tejun Heo
2010-01-20  6:03                       ` Arjan van de Ven
2010-01-20  8:24                         ` Tejun Heo
2010-01-22 10:59                           ` [PATCH] async: use workqueue for worker pool Tejun Heo
2010-01-18  0:57 ` [PATCH 33/40] async: convert async users to use the new implementation Tejun Heo
2010-01-18  0:57 ` [PATCH 34/40] async: kill original implementation Tejun Heo
2010-01-18  0:57 ` [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work Tejun Heo
2010-02-12 18:03   ` David Howells
2010-02-13  5:43     ` Tejun Heo
2010-02-15 15:04       ` David Howells
2010-02-16  3:40         ` Tejun Heo
2010-02-16  3:59           ` Tejun Heo
2010-02-16 18:05           ` David Howells
2010-02-16 23:50             ` Tejun Heo
2010-02-18 11:50               ` David Howells
2010-02-18 12:33                 ` Tejun Heo
2010-01-18  0:57 ` [PATCH 36/40] fscache: convert operation " Tejun Heo
2010-01-18  0:57 ` [PATCH 37/40] fscache: drop references to slow-work Tejun Heo
2010-01-18  0:57 ` [PATCH 38/40] cifs: use workqueue instead of slow-work Tejun Heo
2010-01-19 12:20   ` Jeff Layton
2010-01-20  0:15     ` Tejun Heo
2010-01-20  0:56       ` Jeff Layton
2010-01-20  1:23         ` Tejun Heo
2010-01-22 11:14           ` [PATCH UPDATED " Tejun Heo
2010-01-22 11:45             ` Jeff Layton
2010-01-24  8:25               ` Tejun Heo
2010-01-24 12:13                 ` Jeff Layton
2010-01-25 15:25                   ` Tejun Heo
2010-01-18  0:57 ` [PATCH 39/40] gfs2: " Tejun Heo
2010-01-18  9:45   ` Steven Whitehouse
2010-01-18 11:24     ` Tejun Heo
2010-01-18 12:07       ` Steven Whitehouse
2010-01-19  1:00         ` Tejun Heo
2010-01-19  8:46           ` [PATCH UPDATED " Tejun Heo
2010-01-18  0:57 ` [PATCH 40/40] slow-work: kill it Tejun Heo
2010-01-18  1:03 ` perf-wq.c used to generate synthetic workload Tejun Heo
2010-01-18 16:13 ` [PATCHSET] concurrency managed workqueue, take#3 Stefan Richter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox