public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Oleg Nesterov <oleg@redhat.com>
Cc: torvalds@linux-foundation.org, mingo@elte.hu,
	peterz@infradead.org, awalls@radix.net,
	linux-kernel@vger.kernel.org, jeff@garzik.org,
	akpm@linux-foundation.org, jens.axboe@oracle.com,
	rusty@rustcorp.com.au, cl@linux-foundation.org,
	dhowells@redhat.com, arjan@linux.intel.com, avi@redhat.com,
	johannes@sipsolutions.net, andi@firstfloor.org
Subject: Re: [PATCH 18/43] workqueue: reimplement workqueue flushing using color coded works
Date: Tue, 02 Mar 2010 02:33:25 +0900	[thread overview]
Message-ID: <4B8BFA65.8090706@kernel.org> (raw)
In-Reply-To: <20100228203136.GA28915@redhat.com>

Hello, Oleg.

On 03/01/2010 05:31 AM, Oleg Nesterov wrote:
> On 02/26, Tejun Heo wrote:
>>
>> +	/*
>> +	 * The last color is no color used for works which don't
>> +	 * participate in workqueue flushing.
>> +	 */
>> +	WORK_NR_COLORS		= (1 << WORK_STRUCT_COLOR_BITS) - 1,
>> +	WORK_NO_COLOR		= WORK_NR_COLORS,
> 
> Not sure this makes sense, but instead of special NO_COLOR reserved
> for barriers we could change insert_wq_barrier() to use cwq == NULL
> to mark this work as no-color. Note that the barriers doesn't need a
> valid get_wq_data() or even _PENDING bit set (apart from BUG_ON()
> check in run_workqueue).

Later when all the worklists on a cpu is unified into a single per-cpu
worklist, work->data is the only way to access which cwq and wq a work
belongs to and process_one_work() depends on it.  Looking at how it's
used, I suspect process_one_work() can survive without knowing the cwq
and wq of a work but it's gonna require a number of condition checks
throughout the queueing, execution and debugfs code (or we can set the
current_cwq to a bogus barrier wq).

Yeah it's definitely something to think about but I think that 15
flush colors should be more than enough especially given that colors
will have increased throughput as pressure builds up, so I'm a bit
reluctant to break invariants for one more color at this point.

>> +static int work_flags_to_color(unsigned int flags)
>> +{
>> +	return (flags >> WORK_STRUCT_COLOR_SHIFT) &
>> +		((1 << WORK_STRUCT_COLOR_BITS) - 1);
>> +}
> 
> perhaps work_to_color(work) makes more sense, every caller needs to
> read work->data anyway.

I think it better matches work_color_to_flags() but yeah doing
*work_data_bits() in every caller is ugly.  I'll change it to
work_color().

>> +static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
>> +{
>> +	/* ignore uncolored works */
>> +	if (color == WORK_NO_COLOR)
>> +		return;
>> +
>> +	cwq->nr_in_flight[color]--;
>> +
>> +	/* is flush in progress and are we at the flushing tip? */
>> +	if (likely(cwq->flush_color != color))
>> +		return;
>> +
>> +	/* are there still in-flight works? */
>> +	if (cwq->nr_in_flight[color])
>> +		return;
>> +
>> +	/* this cwq is done, clear flush_color */
>> +	cwq->flush_color = -1;
> 
> Well. I am not that smart to understand this patch quickly ;) will try to
> read it more tomorrow, but...

Sadly, this is one complex piece of code.  I tried to make it simpler
but couldn't come up with anything significantly better and it takes
me some time to get my head around it even though I wrote and spent
quite some time with it.  If you can suggest something simpler, it
will be great.

> Very naive question: why do we need cwq->nr_in_flight[] at all? Can't
> cwq_dec_nr_in_flight(color) do
> 
> 	if (WORK_NO_COLOR)
> 		return;
> 
> 	if (cwq->flush_color == -1)
> 		return;
> 
> 	BUG_ON((cwq->flush_color != color);
> 	if (next_work && work_to_color(next_work) == color)
> 		return;
> 
> 	cwq->flush_color = -1;
> 	if (atomic_dec_and_test(nr_cwqs_to_flush))
> 		complete(first_flusher);
> 
> ?

Issues I can think of off the top of my head are...

1. Works are removed from the worklist before the callback is called
   and inaccessible once the callback starts running, so completion
   order can't be tracked with work structure themselves.

2. Even if it could be tracked that way, with later unified worklist,
   it will require walking the worklist looking for works with
   matching cwq.

3. Transfer of works to work->scheduled list and the linked works.
   This ends up fragmenting the worklist before execution starts.

> And I can't understand ->nr_cwqs_to_flush logic.
> 
> Say, we have 2 CPUs and both have a single pending work with color == 0.
> 
> flush_workqueue()->flush_workqueue_prep_cwqs() does for_each_possible_cpu()
> and each iteration does atomic_inc(nr_cwqs_to_flush).
> 
> What if the first CPU flushes its work between these 2 iterations?
> 
> Afaics, in this case this first CPU will complete(first_flusher->done),
> then the flusher will increment nr_cwqs_to_flush again during the
> second iteration.
> 
> Then the flusher drops ->flush_mutex and wait_for_completion() succeeds
> before the second CPU flushes its work.
> 
> No?

Yes, thanks for spotting it.  It should start with 1 and do
dec_and_test after the iterations are complete.  Will fix.

>>  void flush_workqueue(struct workqueue_struct *wq)
>>  {
>> ...
>> +	 * First flushers are responsible for cascading flushes and
>> +	 * handling overflow.  Non-first flushers can simply return.
>> +	 */
>> +	if (wq->first_flusher != &this_flusher)
>> +		return;
>> +
>> +	mutex_lock(&wq->flush_mutex);
>> +
>> +	wq->first_flusher = NULL;
>> +
>> +	BUG_ON(!list_empty(&this_flusher.list));
>> +	BUG_ON(wq->flush_color != this_flusher.flush_color);
>> +
>> +	while (true) {
>> +		struct wq_flusher *next, *tmp;
>> +
>> +		/* complete all the flushers sharing the current flush color */
>> +		list_for_each_entry_safe(next, tmp, &wq->flusher_queue, list) {
>> +			if (next->flush_color != wq->flush_color)
>> +				break;
>> +			list_del_init(&next->list);
>> +			complete(&next->done);
> 
> Is it really possible that "next" can ever have the same flush_color?

Yeap, when all the colors are consumed, works wait in the overflow
queue.  On the next flush completion, all works on the overflow list
get the same color so that flush colors increase their throughput as
pressure increases instead of building up a long queue of flushers.

Thanks.

-- 
tejun

  reply	other threads:[~2010-03-01 17:23 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-26 12:22 [PATCHSET] workqueue: concurrency managed workqueue, take#4 Tejun Heo
2010-02-26 12:22 ` [PATCH 01/43] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
2010-02-26 12:22 ` [PATCH 02/43] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
2010-02-26 12:22 ` [PATCH 03/43] sched: refactor try_to_wake_up() Tejun Heo
2010-02-26 12:22 ` [PATCH 04/43] sched: implement __set_cpus_allowed() Tejun Heo
2010-02-26 12:22 ` [PATCH 05/43] sched: make sched_notifiers unconditional Tejun Heo
2010-02-26 12:22 ` [PATCH 06/43] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops Tejun Heo
2010-02-26 12:22 ` [PATCH 07/43] sched: implement try_to_wake_up_local() Tejun Heo
2010-02-28 12:33   ` Oleg Nesterov
2010-03-01 14:22     ` Tejun Heo
2010-02-26 12:22 ` [PATCH 08/43] workqueue: change cancel_work_sync() to clear work->data Tejun Heo
2010-02-26 12:22 ` [PATCH 09/43] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 Tejun Heo
2010-02-26 12:22 ` [PATCH 10/43] stop_machine: reimplement without using workqueue Tejun Heo
2010-02-28 14:11   ` Oleg Nesterov
2010-03-01 15:07     ` Tejun Heo
2010-03-01 15:37       ` Oleg Nesterov
2010-03-01 16:36         ` Tejun Heo
2010-03-01 16:50           ` Oleg Nesterov
2010-03-01 18:02             ` Tejun Heo
2010-02-28 14:34   ` Oleg Nesterov
2010-03-01 15:11     ` Tejun Heo
2010-03-01 15:41       ` Oleg Nesterov
2010-02-26 12:22 ` [PATCH 11/43] workqueue: misc/cosmetic updates Tejun Heo
2010-02-26 12:22 ` [PATCH 12/43] workqueue: merge feature parameters into flags Tejun Heo
2010-02-26 12:22 ` [PATCH 13/43] workqueue: define masks for work flags and conditionalize STATIC flags Tejun Heo
2010-02-26 12:22 ` [PATCH 14/43] workqueue: separate out process_one_work() Tejun Heo
2010-02-26 12:22 ` [PATCH 15/43] workqueue: temporarily disable workqueue tracing Tejun Heo
2010-02-26 12:22 ` [PATCH 16/43] workqueue: kill cpu_populated_map Tejun Heo
2010-02-28 16:00   ` Oleg Nesterov
2010-03-01 15:32     ` Tejun Heo
2010-03-01 15:50       ` Oleg Nesterov
2010-03-01 16:19         ` Tejun Heo
2010-02-26 12:22 ` [PATCH 17/43] workqueue: update cwq alignement Tejun Heo
2010-02-28 17:12   ` Oleg Nesterov
2010-03-01 16:40     ` Tejun Heo
2010-02-26 12:22 ` [PATCH 18/43] workqueue: reimplement workqueue flushing using color coded works Tejun Heo
2010-02-28 20:31   ` Oleg Nesterov
2010-03-01 17:33     ` Tejun Heo [this message]
2010-03-01 19:47       ` Oleg Nesterov
2010-02-26 12:22 ` [PATCH 19/43] workqueue: introduce worker Tejun Heo
2010-02-26 12:22 ` [PATCH 20/43] workqueue: reimplement work flushing using linked works Tejun Heo
2010-03-01 14:53   ` Oleg Nesterov
2010-03-01 18:00     ` Tejun Heo
2010-03-01 18:51       ` Oleg Nesterov
2010-02-26 12:22 ` [PATCH 21/43] workqueue: implement per-cwq active work limit Tejun Heo
2010-02-26 12:22 ` [PATCH 22/43] workqueue: reimplement workqueue freeze using max_active Tejun Heo
2010-02-26 12:23 ` [PATCH 23/43] workqueue: introduce global cwq and unify cwq locks Tejun Heo
2010-02-26 12:23 ` [PATCH 24/43] workqueue: implement worker states Tejun Heo
2010-02-26 12:23 ` [PATCH 25/43] workqueue: reimplement CPU hotplugging support using trustee Tejun Heo
2010-02-26 12:23 ` [PATCH 26/43] workqueue: make single thread workqueue shared worker pool friendly Tejun Heo
2010-02-26 12:23 ` [PATCH 27/43] workqueue: add find_worker_executing_work() and track current_cwq Tejun Heo
2010-02-26 12:23 ` [PATCH 28/43] workqueue: carry cpu number in work data once execution starts Tejun Heo
2010-02-28  7:08   ` [PATCH UPDATED " Tejun Heo
2010-02-26 12:23 ` [PATCH 29/43] workqueue: implement WQ_NON_REENTRANT Tejun Heo
2010-02-26 12:23 ` [PATCH 30/43] workqueue: use shared worklist and pool all workers per cpu Tejun Heo
2010-02-26 12:23 ` [PATCH 31/43] workqueue: implement concurrency managed dynamic worker pool Tejun Heo
2010-02-26 12:23 ` [PATCH 32/43] workqueue: increase max_active of keventd and kill current_is_keventd() Tejun Heo
2010-02-26 12:23 ` [PATCH 33/43] workqueue: add system_wq, system_long_wq and system_nrt_wq Tejun Heo
2010-02-26 12:23 ` [PATCH 34/43] workqueue: implement DEBUGFS/workqueue Tejun Heo
2010-02-28  7:13   ` [PATCH UPDATED " Tejun Heo
2010-02-26 12:23 ` [PATCH 35/43] workqueue: implement several utility APIs Tejun Heo
2010-02-28  7:15   ` [PATCH UPDATED " Tejun Heo
2010-02-26 12:23 ` [PATCH 36/43] libata: take advantage of cmwq and remove concurrency limitations Tejun Heo
2010-02-26 12:23 ` [PATCH 37/43] async: use workqueue for worker pool Tejun Heo
2010-02-26 12:23 ` [PATCH 38/43] fscache: convert object to use workqueue instead of slow-work Tejun Heo
2010-02-26 12:23 ` [PATCH 39/43] fscache: convert operation " Tejun Heo
2010-02-26 12:23 ` [PATCH 40/43] fscache: drop references to slow-work Tejun Heo
2010-02-26 12:23 ` [PATCH 41/43] cifs: use workqueue instead of slow-work Tejun Heo
2010-02-28  7:09   ` [PATCH UPDATED " Tejun Heo
2010-02-26 12:23 ` [PATCH 42/43] gfs2: " Tejun Heo
2010-02-28  7:10   ` [PATCH UPDATED " Tejun Heo
2010-02-26 12:23 ` [PATCH 43/43] slow-work: kill it Tejun Heo
2010-02-27 22:52 ` [PATCH] workqueue: Fix build on PowerPC Anton Blanchard
2010-02-28  6:08   ` Tejun Heo
2010-02-28  1:00 ` [PATCH] workqueue: Fix a compile warning in work_busy Anton Blanchard
2010-02-28  6:18   ` Tejun Heo
2010-02-28  1:11 ` [PATCHSET] workqueue: concurrency managed workqueue, take#4 Anton Blanchard
2010-02-28  6:32   ` Tejun Heo
2010-03-10 14:52 ` David Howells
2010-03-12  5:03   ` Tejun Heo
2010-03-12 11:23     ` David Howells
2010-03-12 22:55       ` Tejun Heo
2010-03-16 14:38         ` David Howells
2010-03-16 16:03           ` Tejun Heo
2010-03-16 17:18             ` David Howells
2010-04-25  8:09 ` [PATCHSET UPDATED] " Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B8BFA65.8090706@kernel.org \
    --to=tj@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=arjan@linux.intel.com \
    --cc=avi@redhat.com \
    --cc=awalls@radix.net \
    --cc=cl@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=jeff@garzik.org \
    --cc=jens.axboe@oracle.com \
    --cc=johannes@sipsolutions.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rusty@rustcorp.com.au \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox