public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET] concurrency managed workqueue, take#3
@ 2010-01-18  0:57 Tejun Heo
  2010-01-18  0:57 ` [PATCH 01/40] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
                   ` (41 more replies)
  0 siblings, 42 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Hello, all.

This is the third take of cmwq (concurrency managed workqueue)
patchset.  It's on top of the current linus#master
066000dd856709b6980123eb39b957fe26993f7b (v2.6.33-rc3).  Git tree is
available at

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-cmwq

Quilt series is available at

  http://master.kernel.org/~tj/patches/review-cmwq.tar.gz

Changes from the last take[L]
=============================

* Scheduler code to select fallback cpu has changed and caused problem
  with kthread_bind()ing from CPU_DOWN_PREP.  It is fixed by adding
  0001-sched-consult-online-mask-instead-of-active-in-selec.patch.

* 0002-0028 haven't changed but included for completeness.

* 0029-0040 added to convert libata, async, fscache, cifs and gfs2 to
  use workqueue and kill slow-work which after conversion doesn't have
  any user left.

New patches in this series are

 0001-sched-consult-online-mask-instead-of-active-in-selec.patch
 0029-workqueue-add-system_wq-and-system_single_wq.patch
 0030-workqueue-implement-work_busy.patch
 0031-libata-take-advantage-of-cmwq-and-remove-concurrency.patch
 0032-async-introduce-workqueue-based-alternative-implemen.patch
 0033-async-convert-async-users-to-use-the-new-implementat.patch
 0034-async-kill-original-implementation.patch
 0035-fscache-convert-object-to-use-workqueue-instead-of-s.patch
 0036-fscache-convert-operation-to-use-workqueue-instead-o.patch
 0037-fscache-drop-references-to-slow-work.patch
 0038-cifs-use-workqueue-instead-of-slow-work.patch
 0039-gfs2-use-workqueue-instead-of-slow-work.patch
 0040-slow-work-kill-it.patch

0001 is the aforementioned scheduler fix.

0029-0030 prepare wq for conversions.

0031 converts libata to use cmwq and remove concurrency limitations.

0032-0034 reimplement async using two workqueues.

0035-0037 convert fscache to use workqueues instead of slow-work.

0038-0039 convert cifs and gfs2 to use workqueues instead of
slow-work.

0040 kills slow-work which doesn't have any user left.

Please note that slow-work conversion is missing a couple of
capabilities.

* sysctls to control concurrency level.

* workqueue business notification used to make fscache work to yield
  context and retry instead of waiting holding the context.

The former can easily be added.  The latter isn't difficult to add
either but I was a bit doubtful about its usefulness.  David, do you
think this is really needed?

With the above omissions and removal of slow-work documentation, the
the whole series ends up reducing line count by around a hundred
lines.  I'll append diffstat output at the end of this email.

The libata conversion reduces 13 lines of code while removing two
annoying concurrency limitations.

The new async implementation is shorter by about two hundred lines
while providing about the same capability and removing a dedicated
thread pool.

Although there are some minor differences, the capability provided by
slow-work is basically identical to that provided by cmwq.  Other than
few places where slow-work specific features are depended on, the
conversion of slow-work users to cmwq is fairly straight forward.  The
ref count is incremented on queue and decremented at the end of the
callback.  Module draining is replaced with workqueue flushing.
Concurrency limit is replaced with max_active.  The removal of
slow-work brings in the largest code reduction of about 2000 lines and
removes yet another dedicated thread pool.

slow-work is probably the largest chunk which can be replaced by cmwq
but as shown in the libata case small conversions can bring noticeable
benefits and there are other places which have had to deal with
similar limitations.

Please note that the slow-work conversions haven't been signed off
yet.  Those changes need careful review from David before going
anywhere.

Performance test
================

Another issue raised was the performance.  I tried a few things but
couldn't find a realistic and easy test scenario which could expose wq
performance difference.  As many have pointed out, wq just isn't a
very hot path.  I ended up writing a simplistic wq load generator.

wq workload is generated by perf-wq.c module which is a very simple
synthetic wq load generator (I'll post it as a reply to this message).
A work is described by five parameters - burn_usecs, mean_sleep_msecs,
mean_resched_msecs and factor.  It randomly splits burn_usecs into
two, burns the first part, sleeps for 0 - 2 * mean_sleep_msecs, burns
what's left of burn_usecs and then reschedules itself in 0 - 2 *
mean_resched_msecs.  factor is used to tune the number of cycles to
match execution duration.

It issues three types of works - short, medium and long, each with two
burn durations L and S.

	burn/L(us)	burn/S(us)	mean_sleep(ms)	mean_resched(ms) cycles
 short	50		1		1		10		 454
 medium	50		2		10		50		 125
 long	50		4		100		250		 42

And then these works are put into the following workloads.  The lower
numbered workloads have more short/medium works.

 workload 0
 * 12 wqs with 4 short works
 *  2 wqs with 2 short  and 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 1
 *  8 wqs with 4 short works
 *  2 wqs with 2 short  and 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 2
 *  4 wqs with 4 short works
 *  2 wqs with 2 short  and 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 3
 *  2 wqs with 4 short works
 *  2 wqs with 2 short  and 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 4
 *  2 wqs with 4 short works
 *  2 wqs with 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 5
 *  2 wqs with 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

The above wq loads are run in parallel with mencoder converting 76M
mjpeg file into mpeg4 which takes 25.59 seconds with standard
deviation of 0.19 without wq loading.  The CPU was intel netburst
celeron running at 2.66GHz (chosen for its small cache size and
slowness).  wl0 and 1 are only tested for burn/S.  Each test case was
run 11 times and the first run was discarded.

	 vanilla/L	cmwq/L		vanilla/S	cmwq/S
 wl0					26.18 d0.24	26.27 d0.29
 wl1					26.50 d0.45	26.52 d0.23
 wl2	26.62 d0.35	26.53 d0.23	26.14 d0.22	26.12 d0.32
 wl3	26.30 d0.25	26.29 d0.26	25.94 d0.25	26.17 d0.30
 wl4	26.26 d0.23	25.93 d0.24	25.90 d0.23	25.91 d0.29
 wl5	25.81 d0.33	25.88 d0.25	25.63 d0.27	25.59 d0.26

There is no significant difference between the two.  Maybe the code
overhead and benefits coming from context sharing are canceling each
other nicely.  With longer burns, cmwq looks better but it's nothing
significant.  With shorter burns, other than wl3 spiking up for
vanilla which probably would go away if the test is repeated, the two
are performing virtually identically.

The above is exaggerated synthetic test result and the performance
difference will be even less noticeable in either direction under
realistic workloads.

cmwq extends workqueue such that it can serve as robust async
mechanism which can be used (mostly) universally without introducing
any noticeable performance degradation.

Thanks.

diffstat
========
 Documentation/slow-work.txt   |  322 -----
 arch/ia64/kernel/smpboot.c    |    2 
 arch/ia64/kvm/Kconfig         |    1 
 arch/powerpc/kvm/Kconfig      |    1 
 arch/s390/kvm/Kconfig         |    1 
 arch/x86/kernel/smpboot.c     |    2 
 arch/x86/kvm/Kconfig          |    1 
 drivers/acpi/battery.c        |    4 
 drivers/acpi/osl.c            |   41 
 drivers/ata/libata-core.c     |   50 
 drivers/ata/libata-eh.c       |    4 
 drivers/ata/libata-scsi.c     |   11 
 drivers/ata/libata.h          |    1 
 drivers/ata/pata_legacy.c     |    2 
 drivers/base/core.c           |    2 
 drivers/base/dd.c             |    2 
 drivers/md/raid5.c            |    4 
 drivers/s390/block/dasd.c     |    4 
 drivers/scsi/sd.c             |    8 
 fs/cachefiles/namei.c         |   28 
 fs/cachefiles/rdwr.c          |    4 
 fs/cifs/Kconfig               |    1 
 fs/cifs/cifsfs.c              |    6 
 fs/cifs/cifsglob.h            |    8 
 fs/cifs/dir.c                 |    2 
 fs/cifs/file.c                |   22 
 fs/cifs/misc.c                |   15 
 fs/fscache/Kconfig            |    1 
 fs/fscache/internal.h         |    2 
 fs/fscache/main.c             |   25 
 fs/fscache/object-list.c      |   12 
 fs/fscache/object.c           |   67 -
 fs/fscache/operation.c        |   67 -
 fs/fscache/page.c             |   36 
 fs/gfs2/Kconfig               |    1 
 fs/gfs2/incore.h              |    3 
 fs/gfs2/main.c                |    9 
 fs/gfs2/ops_fstype.c          |    8 
 fs/gfs2/recovery.c            |   52 
 fs/gfs2/recovery.h            |    4 
 fs/gfs2/sys.c                 |    3 
 include/linux/async.h         |   17 
 include/linux/fscache-cache.h |   49 
 include/linux/kvm_host.h      |    4 
 include/linux/libata.h        |    2 
 include/linux/preempt.h       |   48 
 include/linux/sched.h         |   71 -
 include/linux/slow-work.h     |  163 --
 include/linux/stop_machine.h  |    6 
 include/linux/workqueue.h     |  109 +
 init/Kconfig                  |   28 
 init/do_mounts.c              |    2 
 init/main.c                   |    4 
 kernel/Makefile               |    2 
 kernel/async.c                |  393 +-----
 kernel/irq/autoprobe.c        |    2 
 kernel/module.c               |    4 
 kernel/power/process.c        |   21 
 kernel/sched.c                |  334 +++--
 kernel/slow-work-debugfs.c    |  227 ---
 kernel/slow-work.c            | 1068 ----------------
 kernel/slow-work.h            |   72 -
 kernel/stop_machine.c         |  151 +-
 kernel/sysctl.c               |    8 
 kernel/trace/Kconfig          |    4 
 kernel/workqueue.c            | 2697 ++++++++++++++++++++++++++++++++++++------
 virt/kvm/kvm_main.c           |   26 
 67 files changed, 3120 insertions(+), 3231 deletions(-)

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel/929641

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2010-02-18 12:28 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-18  0:57 [PATCHSET] concurrency managed workqueue, take#3 Tejun Heo
2010-01-18  0:57 ` [PATCH 01/40] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
2010-01-18 10:13   ` Peter Zijlstra
2010-01-18 11:26     ` Tejun Heo
2010-01-18  0:57 ` [PATCH 02/40] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
2010-01-18  0:57 ` [PATCH 03/40] sched: refactor try_to_wake_up() Tejun Heo
2010-01-18  0:57 ` [PATCH 04/40] sched: implement __set_cpus_allowed() Tejun Heo
2010-01-18  9:56   ` Peter Zijlstra
2010-01-18 11:22     ` Tejun Heo
2010-01-18 11:41       ` Peter Zijlstra
2010-01-19  1:07         ` Tejun Heo
2010-01-19  8:37           ` Peter Zijlstra
2010-01-20  8:35             ` Tejun Heo
2010-01-20  8:50               ` Peter Zijlstra
2010-01-20  9:00                 ` Tejun Heo
2010-01-20  8:59                   ` Peter Zijlstra
2010-01-24  8:18               ` Tejun Heo
2010-01-18  0:57 ` [PATCH 05/40] sched: make sched_notifiers unconditional Tejun Heo
2010-01-18  0:57 ` [PATCH 06/40] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops Tejun Heo
2010-01-18  9:57   ` Peter Zijlstra
2010-01-18 11:31     ` Tejun Heo
2010-01-18 12:49       ` Peter Zijlstra
2010-01-19  1:04         ` Tejun Heo
2010-01-19  8:28           ` Tejun Heo
2010-01-19  8:55             ` Peter Zijlstra
2010-01-20  8:47               ` Tejun Heo
2010-01-18  0:57 ` [PATCH 07/40] sched: implement try_to_wake_up_local() Tejun Heo
2010-01-18  0:57 ` [PATCH 08/40] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 Tejun Heo
2010-01-18  0:57 ` [PATCH 09/40] stop_machine: reimplement without using workqueue Tejun Heo
2010-01-18  0:57 ` [PATCH 10/40] workqueue: misc/cosmetic updates Tejun Heo
2010-01-18  0:57 ` [PATCH 11/40] workqueue: merge feature parameters into flags Tejun Heo
2010-01-18  0:57 ` [PATCH 12/40] workqueue: define both bit position and mask for work flags Tejun Heo
2010-01-18  0:57 ` [PATCH 13/40] workqueue: separate out process_one_work() Tejun Heo
2010-01-18  0:57 ` [PATCH 14/40] workqueue: temporarily disable workqueue tracing Tejun Heo
2010-01-18  0:57 ` [PATCH 15/40] workqueue: kill cpu_populated_map Tejun Heo
2010-01-18  0:57 ` [PATCH 16/40] workqueue: update cwq alignement Tejun Heo
2010-01-18  0:57 ` [PATCH 17/40] workqueue: reimplement workqueue flushing using color coded works Tejun Heo
2010-01-18  0:57 ` [PATCH 18/40] workqueue: introduce worker Tejun Heo
2010-01-18  0:57 ` [PATCH 19/40] workqueue: reimplement work flushing using linked works Tejun Heo
2010-01-18  0:57 ` [PATCH 20/40] workqueue: implement per-cwq active work limit Tejun Heo
2010-01-18  0:57 ` [PATCH 21/40] workqueue: reimplement workqueue freeze using max_active Tejun Heo
2010-01-18  0:57 ` [PATCH 22/40] workqueue: introduce global cwq and unify cwq locks Tejun Heo
2010-01-18  0:57 ` [PATCH 23/40] workqueue: implement worker states Tejun Heo
2010-01-18  0:57 ` [PATCH 24/40] workqueue: reimplement CPU hotplugging support using trustee Tejun Heo
2010-01-18  0:57 ` [PATCH 25/40] workqueue: make single thread workqueue shared worker pool friendly Tejun Heo
2010-01-18  0:57 ` [PATCH 26/40] workqueue: use shared worklist and pool all workers per cpu Tejun Heo
2010-01-18  0:57 ` [PATCH 27/40] workqueue: implement concurrency managed dynamic worker pool Tejun Heo
2010-01-18  0:57 ` [PATCH 28/40] workqueue: increase max_active of keventd and kill current_is_keventd() Tejun Heo
2010-01-18  0:57 ` [PATCH 29/40] workqueue: add system_wq and system_single_wq Tejun Heo
2010-01-18  0:57 ` [PATCH 30/40] workqueue: implement work_busy() Tejun Heo
2010-01-18  2:52   ` Andy Walls
2010-01-18  5:41     ` Tejun Heo
2010-01-18  0:57 ` [PATCH 31/40] libata: take advantage of cmwq and remove concurrency limitations Tejun Heo
2010-01-18 15:48   ` Stefan Richter
2010-01-19  0:49     ` Tejun Heo
2010-01-18  0:57 ` [PATCH 32/40] async: introduce workqueue based alternative implementation Tejun Heo
2010-01-18  6:01   ` Arjan van de Ven
2010-01-18  8:49     ` Tejun Heo
2010-01-18 15:25       ` Arjan van de Ven
2010-01-19  0:57         ` Tejun Heo
2010-01-19  0:57           ` Arjan van de Ven
2010-01-19  7:56             ` Tejun Heo
2010-01-19 14:37               ` Arjan van de Ven
2010-01-20  0:19                 ` Tejun Heo
2010-01-20  0:31                   ` Arjan van de Ven
2010-01-20  2:08                     ` Tejun Heo
2010-01-20  6:03                       ` Arjan van de Ven
2010-01-20  8:24                         ` Tejun Heo
2010-01-22 10:59                           ` [PATCH] async: use workqueue for worker pool Tejun Heo
2010-01-18  0:57 ` [PATCH 33/40] async: convert async users to use the new implementation Tejun Heo
2010-01-18  0:57 ` [PATCH 34/40] async: kill original implementation Tejun Heo
2010-01-18  0:57 ` [PATCH 35/40] fscache: convert object to use workqueue instead of slow-work Tejun Heo
2010-02-12 18:03   ` David Howells
2010-02-13  5:43     ` Tejun Heo
2010-02-15 15:04       ` David Howells
2010-02-16  3:40         ` Tejun Heo
2010-02-16  3:59           ` Tejun Heo
2010-02-16 18:05           ` David Howells
2010-02-16 23:50             ` Tejun Heo
2010-02-18 11:50               ` David Howells
2010-02-18 12:33                 ` Tejun Heo
2010-01-18  0:57 ` [PATCH 36/40] fscache: convert operation " Tejun Heo
2010-01-18  0:57 ` [PATCH 37/40] fscache: drop references to slow-work Tejun Heo
2010-01-18  0:57 ` [PATCH 38/40] cifs: use workqueue instead of slow-work Tejun Heo
2010-01-19 12:20   ` Jeff Layton
2010-01-20  0:15     ` Tejun Heo
2010-01-20  0:56       ` Jeff Layton
2010-01-20  1:23         ` Tejun Heo
2010-01-22 11:14           ` [PATCH UPDATED " Tejun Heo
2010-01-22 11:45             ` Jeff Layton
2010-01-24  8:25               ` Tejun Heo
2010-01-24 12:13                 ` Jeff Layton
2010-01-25 15:25                   ` Tejun Heo
2010-01-18  0:57 ` [PATCH 39/40] gfs2: " Tejun Heo
2010-01-18  9:45   ` Steven Whitehouse
2010-01-18 11:24     ` Tejun Heo
2010-01-18 12:07       ` Steven Whitehouse
2010-01-19  1:00         ` Tejun Heo
2010-01-19  8:46           ` [PATCH UPDATED " Tejun Heo
2010-01-18  0:57 ` [PATCH 40/40] slow-work: kill it Tejun Heo
2010-01-18  1:03 ` perf-wq.c used to generate synthetic workload Tejun Heo
2010-01-18 16:13 ` [PATCHSET] concurrency managed workqueue, take#3 Stefan Richter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox