Re: [PATCH 1/1] workqueue: fix ghost PENDING flag while doing MQ IO

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Peter Hurley <peter@hurleysoftware.com>
To: Roman Pen <roman.penyaev@profitbricks.com>, Tejun Heo <tj@kernel.org>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>,
	Michael Wang <yun.wang@profitbricks.com>,
	Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org
Subject: Re: [PATCH 1/1] workqueue: fix ghost PENDING flag while doing MQ IO
Date: Tue, 26 Apr 2016 13:58:07 -0700	[thread overview]
Message-ID: <571FD65F.2080901@hurleysoftware.com> (raw)
In-Reply-To: <1461669335-19522-1-git-send-email-roman.penyaev@profitbricks.com>

On 04/26/2016 04:15 AM, Roman Pen wrote:
> The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
> with the following backtrace:
> 
> [  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
> [  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
> [  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
> [  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
> [  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
> [  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
> [  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
> [  601.350965] Call Trace:
> [  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
> [  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
> [  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
> [  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
> [  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
> [  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
> [  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
> [  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
> [  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
> [  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
> [  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
> [  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
> [  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
> [  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
> [  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
> [  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
> [  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
> [  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
> [  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
> [  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
> [  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50
> 
> The underlying device is a null_blk, with default parameters:
> 
>   queue_mode    = MQ
>   submit_queues = 1
> 
> Verification that nullb0 has something inflight:
> 
> root@pserver8:~# cat /sys/block/nullb0/inflight
>        0        1
> root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
> ...
> /sys/block/nullb0/mq/0/cpu2/rq_list
> CTX pending:
>         ffff8838038e2400
> ...
> 
> During debug it became clear that stalled request is always inserted in
> the rq_list from the following path:
> 
>    save_stack_trace_tsk + 34
>    blk_mq_insert_requests + 231
>    blk_mq_flush_plug_list + 281
>    blk_flush_plug_list + 199
>    wait_on_page_bit + 192
>    __filemap_fdatawait_range + 228
>    filemap_fdatawait_range + 20
>    filemap_write_and_wait_range + 63
>    blkdev_fsync + 27
>    vfs_fsync_range + 73
>    blkdev_write_iter + 202
>    __vfs_write + 170
>    vfs_write + 169
>    kernel_write + 56
> 
> So blk_flush_plug_list() was called with from_schedule == true.
> 
> If from_schedule is true, that means that finally blk_mq_insert_requests()
> offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
> i.e. it calls kblockd_schedule_delayed_work_on().
> 
> That means, that we race with another CPU, which is about to execute
> __blk_mq_run_hw_queue() work.
> 
> Further debugging shows the following traces from different CPUs:
> 
>   CPU#0                                  CPU#1
>   ----------------------------------     -------------------------------
>   reqeust A inserted
>   STORE hctx->ctx_map[0] bit marked
>   kblockd_schedule...() returns 1
>   <schedule to kblockd workqueue>
>                                          request B inserted
>                                          STORE hctx->ctx_map[1] bit marked
>                                          kblockd_schedule...() returns 0
>   *** WORK PENDING bit is cleared ***
>   flush_busy_ctxs() is executed, but
>   bit 1, set by CPU#1, is not observed
> 
> As a result request B pended forever.
> 
> This behaviour can be explained by speculative LOAD of hctx->ctx_map on
> CPU#0, which is reordered with clear of PENDING bit and executed _before_
> actual STORE of bit 1 on CPU#1.
> 
> The proper fix is an explicit full barrier <mfence>, which guarantees
> that clear of PENDING bit is to be executed before all possible
> speculative LOADS or STORES inside actual work function.
> 
> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
> Cc: Michael Wang <yun.wang@profitbricks.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: linux-block@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: stable@vger.kernel.org
> ---
>  kernel/workqueue.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 450c21f..0ec0594 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -649,6 +649,35 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
>  	 */
>  	smp_wmb();
>  	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
> +	/*
> +	 * The following mb guarantees that previous clear of a PENDING bit
> +	 * will not be reordered with any speculative LOADS or STORES from
> +	 * work->current_func, which is executed afterwards.

Just to be clear: reordering with STORES in the work function is not a problem
because no conclusion can be reached regarding the scheduling state of the work
function from outside the work function, other than by testing PENDING.

IOW,

	is_running = 1
	clear PENDING

is equivalent to

	clear PENDING
	is_running = 1

for any other logic like

	if (is_running) {
	}

or

	if (!is_running) {
	}


It's only LOADS in the work function that must follow STORES to PENDING.
That the full barrier will also order stores is unintended effect.



>  This possible
> +	 * reordering can lead to a missed execution on attempt to qeueue
> +	 * the same @work.  E.g. consider this case:
> +	 *
> +	 *   CPU#0                         CPU#1
> +	 *   ----------------------------  --------------------------------
> +	 *
> +	 * 1  STORE event_indicated
> +	 * 2  queue_work_on() {
> +	 * 3    test_and_set_bit(PENDING)
> +	 * 4 }                             set_..._and_clear_pending() {
> +	 * 5                                 set_work_data() # clear bit
> +	 * 6                                 smp_mb()
> +	 * 7                               work->current_func() {
> +	 * 8				      LOAD event_indicated
> +	 *				   }
> +	 *
> +	 * Without an explicit full barrier speculative LOAD on line 8 can
> +	 * be executed before CPU#0 does STORE on line 1.  If that happens,
> +	 * CPU#0 observes the PENDING bit is still set and new execution of
> +	 * a @work is not queued in a hope, that CPU#1 will eventually
> +	 * finish the queued @work.  Meanwhile CPU#1 does not see
> +	 * event_indicated is set, because speculative LOAD was executed
> +	 * before actual STORE.
> +	 */
> +	smp_mb();
>  }
>  
>  static void clear_work_data(struct work_struct *work)
>

     prev parent reply	other threads:[~2016-04-26 20:58 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-26 11:15 [PATCH 1/1] workqueue: fix ghost PENDING flag while doing MQ IO Roman Pen
2016-04-26 15:21 ` Tejun Heo
2016-04-26 20:58 ` Peter Hurley [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=571FD65F.2080901@hurleysoftware.com \
    --to=peter@hurleysoftware.com \
    --cc=axboe@kernel.dk \
    --cc=gi-oh.kim@profitbricks.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=roman.penyaev@profitbricks.com \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=yun.wang@profitbricks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.