[PATCH 3.14 20/37] workqueue: fix ghost PENDING flag while doing MQ IO

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org,
	Roman Pen <roman.penyaev@profitbricks.com>,
	Gioh Kim <gi-oh.kim@profitbricks.com>,
	Michael Wang <yun.wang@profitbricks.com>,
	Tejun Heo <tj@kernel.org>, Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org
Subject: [PATCH 3.14 20/37] workqueue: fix ghost PENDING flag while doing MQ IO
Date: Mon,  2 May 2016 17:12:09 -0700	[thread overview]
Message-ID: <20160503000424.209611241@linuxfoundation.org> (raw)
In-Reply-To: <20160503000423.577563140@linuxfoundation.org>

3.14-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Roman Pen <roman.penyaev@profitbricks.com>

commit 346c09f80459a3ad97df1816d6d606169a51001a upstream.

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
[  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode    = MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0                                  CPU#1
  ----------------------------------     -------------------------------
  reqeust A inserted
  STORE hctx->ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  <schedule to kblockd workqueue>
                                         request B inserted
                                         STORE hctx->ctx_map[1] bit marked
                                         kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 kernel/workqueue.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -629,6 +629,35 @@ static void set_work_pool_and_clear_pend
 	 */
 	smp_wmb();
 	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
+	/*
+	 * The following mb guarantees that previous clear of a PENDING bit
+	 * will not be reordered with any speculative LOADS or STORES from
+	 * work->current_func, which is executed afterwards.  This possible
+	 * reordering can lead to a missed execution on attempt to qeueue
+	 * the same @work.  E.g. consider this case:
+	 *
+	 *   CPU#0                         CPU#1
+	 *   ----------------------------  --------------------------------
+	 *
+	 * 1  STORE event_indicated
+	 * 2  queue_work_on() {
+	 * 3    test_and_set_bit(PENDING)
+	 * 4 }                             set_..._and_clear_pending() {
+	 * 5                                 set_work_data() # clear bit
+	 * 6                                 smp_mb()
+	 * 7                               work->current_func() {
+	 * 8				      LOAD event_indicated
+	 *				   }
+	 *
+	 * Without an explicit full barrier speculative LOAD on line 8 can
+	 * be executed before CPU#0 does STORE on line 1.  If that happens,
+	 * CPU#0 observes the PENDING bit is still set and new execution of
+	 * a @work is not queued in a hope, that CPU#1 will eventually
+	 * finish the queued @work.  Meanwhile CPU#1 does not see
+	 * event_indicated is set, because speculative LOAD was executed
+	 * before actual STORE.
+	 */
+	smp_mb();
 }
 
 static void clear_work_data(struct work_struct *work)

next prev parent reply	other threads:[~2016-05-03  1:57 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-03  0:11 [PATCH 3.14 00/37] 3.14.68-stable review Greg Kroah-Hartman
2016-05-03  0:11 ` [PATCH 3.14 01/37] ARM: OMAP2+: hwmod: Fix updating of sysconfig register Greg Kroah-Hartman
2016-05-03  0:11 ` [PATCH 3.14 02/37] assoc_array: dont call compare_object() on a node Greg Kroah-Hartman
2016-05-03  0:11 ` [PATCH 3.14 03/37] usb: xhci: fix wild pointers in xhci_mem_cleanup Greg Kroah-Hartman
2016-05-03  0:11 ` [PATCH 3.14 04/37] usb: hcd: out of bounds access in for_each_companion Greg Kroah-Hartman
2016-05-03  0:11 ` [PATCH 3.14 05/37] lib: lz4: fixed zram with lz4 on big endian machines Greg Kroah-Hartman
2016-05-03  0:11 ` [PATCH 3.14 06/37] x86/iopl/64: Properly context-switch IOPL on Xen PV Greg Kroah-Hartman
2016-05-03  0:11 ` [PATCH 3.14 07/37] futex: Acknowledge a new waiter in counter before plist Greg Kroah-Hartman
2016-05-03  0:11 ` [PATCH 3.14 08/37] drm/qxl: fix cursor position with non-zero hotspot Greg Kroah-Hartman
2016-05-03  0:11 ` [PATCH 3.14 09/37] crypto: ccp - Prevent information leakage on export Greg Kroah-Hartman
2016-05-03  0:11 ` [PATCH 3.14 10/37] crypto: gcm - Fix rfc4543 decryption crash Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 11/37] nl80211: check netlink protocol in socket release notification Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 12/37] Input: gtco - fix crash on detecting device without endpoints Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 13/37] pinctrl: single: Fix pcs_parse_bits_in_pinctrl_entry to use __ffs than ffs Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 14/37] i2c: cpm: Fix build break due to incompatible pointer types Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 15/37] i2c: exynos5: Fix possible ABBA deadlock by keeping I2C clock prepared Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 16/37] EDAC: i7core, sb_edac: Dont return NOTIFY_BAD from mce_decoder callback Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 17/37] ASoC: s3c24xx: use const snd_soc_component_driver pointer Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 18/37] ASoC: rt5640: Correct the digital interface data select Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 19/37] efi: Fix out-of-bounds read in variable_matches() Greg Kroah-Hartman
2016-05-03  0:12 ` Greg Kroah-Hartman [this message]
2016-05-03  0:12 ` [PATCH 3.14 21/37] USB: usbip: fix potential out-of-bounds write Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 22/37] paride: make verbose parameter an int again Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 23/37] fbdev: da8xx-fb: fix videomodes of lcd panels Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 24/37] misc/bmp085: Enable building as a module Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 25/37] rtc: hym8563: fix invalid year calculation Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 27/37] drivers/misc/ad525x_dpot: AD5274 fix RDAC read back errors Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 28/37] ext4: fix NULL pointer dereference in ext4_mark_inode_dirty() Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 29/37] serial: sh-sci: Remove cpufreq notifier to fix crash/deadlock Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 30/37] include/linux/poison.h: fix LIST_POISON{1,2} offset Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 31/37] Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 32/37] perf stat: Document --detailed option Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 34/37] bus: imx-weim: Take the status property value into account Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 35/37] jme: Do not enable NIC WoL functions on S0 Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 36/37] jme: Fix device PM wakeup API usage Greg Kroah-Hartman
2016-05-03  0:12 ` [PATCH 3.14 37/37] sunrpc/cache: drop reference when sunrpc_cache_pipe_upcall() detects a race Greg Kroah-Hartman
2016-05-03  7:19 ` [PATCH 3.14 00/37] 3.14.68-stable review Guenter Roeck
2016-05-03 18:21   ` Greg Kroah-Hartman
2016-05-03 15:00 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160503000424.209611241@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=axboe@kernel.dk \
    --cc=gi-oh.kim@profitbricks.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=roman.penyaev@profitbricks.com \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=yun.wang@profitbricks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.