public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sasha.levin@oracle.com>
To: stable@vger.kernel.org, stable-commits@vger.kernel.org
Cc: Roman Pen <roman.penyaev@profitbricks.com>,
	Gioh Kim <gi-oh.kim@profitbricks.com>,
	Michael Wang <yun.wang@profitbricks.com>,
	Tejun Heo <tj@kernel.org>, Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	Sasha Levin <sasha.levin@oracle.com>
Subject: [added to the 4.1 stable tree] workqueue: fix ghost PENDING flag while doing MQ IO
Date: Thu, 19 May 2016 00:19:15 -0400	[thread overview]
Message-ID: <1463631606-32540-16-git-send-email-sasha.levin@oracle.com> (raw)
In-Reply-To: <1463631606-32540-1-git-send-email-sasha.levin@oracle.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

This patch has been added to the 4.1 stable tree. If you have any
objections, please let us know.

===============

[ Upstream commit 346c09f80459a3ad97df1816d6d606169a51001a ]

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
[  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode    = MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0                                  CPU#1
  ----------------------------------     -------------------------------
  reqeust A inserted
  STORE hctx->ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  <schedule to kblockd workqueue>
                                         request B inserted
                                         STORE hctx->ctx_map[1] bit marked
                                         kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
---
 kernel/workqueue.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6d63116..32152a4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -654,6 +654,35 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
 	 */
 	smp_wmb();
 	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
+	/*
+	 * The following mb guarantees that previous clear of a PENDING bit
+	 * will not be reordered with any speculative LOADS or STORES from
+	 * work->current_func, which is executed afterwards.  This possible
+	 * reordering can lead to a missed execution on attempt to qeueue
+	 * the same @work.  E.g. consider this case:
+	 *
+	 *   CPU#0                         CPU#1
+	 *   ----------------------------  --------------------------------
+	 *
+	 * 1  STORE event_indicated
+	 * 2  queue_work_on() {
+	 * 3    test_and_set_bit(PENDING)
+	 * 4 }                             set_..._and_clear_pending() {
+	 * 5                                 set_work_data() # clear bit
+	 * 6                                 smp_mb()
+	 * 7                               work->current_func() {
+	 * 8				      LOAD event_indicated
+	 *				   }
+	 *
+	 * Without an explicit full barrier speculative LOAD on line 8 can
+	 * be executed before CPU#0 does STORE on line 1.  If that happens,
+	 * CPU#0 observes the PENDING bit is still set and new execution of
+	 * a @work is not queued in a hope, that CPU#1 will eventually
+	 * finish the queued @work.  Meanwhile CPU#1 does not see
+	 * event_indicated is set, because speculative LOAD was executed
+	 * before actual STORE.
+	 */
+	smp_mb();
 }
 
 static void clear_work_data(struct work_struct *work)
-- 
2.5.0


  parent reply	other threads:[~2016-05-19  4:20 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-19  4:19 [added to the 4.1 stable tree] Revert "usb: hub: do not clear BOS field during reset device" Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] stable: remove artifact created on backport Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] iwlwifi: pcie: lower the debug level for RSA semaphore access Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ASoC: rt5640: Correct the digital interface data select Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] regulator: s2mps11: Fix invalid selector mask and voltages for buck9 Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] libahci: save port map for forced port map Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ata: ahci-platform: Add ports-implemented DT bindings Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] regmap: spmi: Fix regmap_spmi_ext_read in multi-byte case Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] iio: ak8975: Fix NULL pointer exception on early interrupt Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] efi: Fix out-of-bounds read in variable_matches() Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] USB: serial: cp210x: add ID for Link ECU Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] USB: serial: cp210x: add Straizona Focusers device ids Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] [media] v4l2-dv-timings.h: fix polarity for 4k formats Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] MD: make bio mergeable Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ALSA: hda - Add dock support for ThinkPad X260 Sasha Levin
2016-05-19  4:19 ` Sasha Levin [this message]
2016-05-19  4:19 ` [added to the 4.1 stable tree] drm/dp/mst: Get validated port ref in drm_dp_update_payload_part1() Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] drm/dp/mst: Restore primary hub guid on resume Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] cxl: Keep IRQ mappings on context teardown Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] drm/i915/ddi: Fix eDP VDD handling during booting and suspend/resume Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] drm/i915: Make RPS EI/thresholds multiple of 25 on SNB-BDW Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] drm/radeon: fix vertical bars appear on monitor (v2) Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ARM: SoCFPGA: Fix secondary CPU startup in thumb2 kernel Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ARM: cpuidle: Pass on arm_cpuidle_suspend()'s return value Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] IB/security: Restrict use of the write() interface Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] mm: vmscan: reclaim highmem zone if buffer_heads is over limit Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] mm: soft-offline: don't free target page in successful page migration Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] mm: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_* Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ALSA: usb-audio: Quirk for yet another Phoenix Audio devices (v2) Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] EDAC: i7core, sb_edac: Don't return NOTIFY_BAD from mce_decoder callback Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] atomic_open(): fix the handling of create_error Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] Drivers: hv: ring_buffer.c: fix comment style Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] Drivers: hv_vmbus: Fix signal to host condition Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] Drivers: hv: vmbus: Fix signaling logic in hv_need_to_signal_on_read() Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] powerpc: Fix bad inline asm constraint in create_zero_mask() Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] Minimal fix-up of bad hashing behavior of hash_64() Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] tracing: Don't display trigger file for events that can't be enabled Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] drm/radeon: make sure vertical front porch is at least 1 Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] MAINTAINERS: Remove asterisk from EFI directory names Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ACPICA: Dispatcher: Update thread ID for recursive method calls Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] crypto: hash - Fix page length clamping in hash walk Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] x86/sysfb_efi: Fix valid BAR address range check Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] fs/pnode.c: treat zero mnt_group_id-s as unequal Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] propogate_mnt: Handle the first propogated copy being a slave Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] drm/radeon: fix DP link training issue with second 4K monitor Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] mm, cma: prevent nr_isolated_* counters from going negative Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] x86/tsc: Read all ratio bits from MSR_PLATFORM_INFO Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] parisc: fix a bug when syscall number of tracee is __NR_Linux_syscalls Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] get_rock_ridge_filename(): handle malformed NM entries Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ALSA: hda - Apply fix for white noise on Asus N550JV, too Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ALSA: hda - Fix white noise on Asus UX501VW headset Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] Input: max8997-haptic - fix NULL pointer dereference Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] drm/i915: Bail out of pipe config compute loop on LPT Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ALSA: hda - Fix broken reconfig Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ALSA: hda - Asus N750JV external subwoofer fixup Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ALSA: hda - Fix white noise on Asus N750JV headphone Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ALSA: hda - Fix subwoofer pin on ASUS N751 and N551 Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] ALSA: usb-audio: Yet another Phoneix Audio device quirk Sasha Levin
2016-05-19  4:19 ` [added to the 4.1 stable tree] perf/core: Disable the event on a truncated AUX record Sasha Levin
2016-05-23  6:59   ` Alexander Shishkin
2016-05-30 21:50     ` Sasha Levin
2016-05-19  4:20 ` [added to the 4.1 stable tree] tools lib traceevent: Do not reassign parg after collapse_tree() Sasha Levin
2016-05-19  4:20 ` [added to the 4.1 stable tree] workqueue: fix rebind bound workers warning Sasha Levin
2016-05-19  4:20 ` [added to the 4.1 stable tree] drm/radeon: fix DP mode validation Sasha Levin
2016-05-19  4:20 ` [added to the 4.1 stable tree] ocfs2: fix SGID not inherited issue Sasha Levin
2016-05-19  4:20 ` [added to the 4.1 stable tree] ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang Sasha Levin
2016-05-19  4:20 ` [added to the 4.1 stable tree] ocfs2: fix posix_acl_create deadlock Sasha Levin
2016-05-19  4:20 ` [added to the 4.1 stable tree] nf_conntrack: avoid kernel pointer value leak in slab name Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1463631606-32540-16-git-send-email-sasha.levin@oracle.com \
    --to=sasha.levin@oracle.com \
    --cc=axboe@kernel.dk \
    --cc=gi-oh.kim@profitbricks.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=roman.penyaev@profitbricks.com \
    --cc=stable-commits@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=yun.wang@profitbricks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox