All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Coly Li <colyli@suse.de>,
	Jens Axboe <axboe@kernel.dk>, kbuild test robot <lkp@intel.com>
Subject: [PATCH 5.2 20/37] bcache: fix race in btree_flush_write()
Date: Fri, 13 Sep 2019 14:07:25 +0100	[thread overview]
Message-ID: <20190913130518.849251524@linuxfoundation.org> (raw)
In-Reply-To: <20190913130510.727515099@linuxfoundation.org>

There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
  other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
  shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
        2149 err_free2:
        2150         bkey_put(b->c, &n2->key);
        2151         btree_node_free(n2);
        2152         rw_unlock(true, n2);
        2153 err_free1:
        2154         bkey_put(b->c, &n1->key);
        2155         btree_node_free(n1);
        2156         rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/md/bcache/btree.c   | 28 +++++++++++++++++++++++++++-
 drivers/md/bcache/btree.h   |  2 ++
 drivers/md/bcache/journal.c |  7 +++++++
 3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 9788b2ee6638f..5cf3247e8afb2 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -35,7 +35,7 @@
 #include <linux/rcupdate.h>
 #include <linux/sched/clock.h>
 #include <linux/rculist.h>
-
+#include <linux/delay.h>
 #include <trace/events/bcache.h>
 
 /*
@@ -655,12 +655,25 @@ static int mca_reap(struct btree *b, unsigned int min_order, bool flush)
 		up(&b->io_mutex);
 	}
 
+retry:
 	/*
 	 * BTREE_NODE_dirty might be cleared in btree_flush_btree() by
 	 * __bch_btree_node_write(). To avoid an extra flush, acquire
 	 * b->write_lock before checking BTREE_NODE_dirty bit.
 	 */
 	mutex_lock(&b->write_lock);
+	/*
+	 * If this btree node is selected in btree_flush_write() by journal
+	 * code, delay and retry until the node is flushed by journal code
+	 * and BTREE_NODE_journal_flush bit cleared by btree_flush_write().
+	 */
+	if (btree_node_journal_flush(b)) {
+		pr_debug("bnode %p is flushing by journal, retry", b);
+		mutex_unlock(&b->write_lock);
+		udelay(1);
+		goto retry;
+	}
+
 	if (btree_node_dirty(b))
 		__bch_btree_node_write(b, &cl);
 	mutex_unlock(&b->write_lock);
@@ -1077,7 +1090,20 @@ static void btree_node_free(struct btree *b)
 
 	BUG_ON(b == b->c->root);
 
+retry:
 	mutex_lock(&b->write_lock);
+	/*
+	 * If the btree node is selected and flushing in btree_flush_write(),
+	 * delay and retry until the BTREE_NODE_journal_flush bit cleared,
+	 * then it is safe to free the btree node here. Otherwise this btree
+	 * node will be in race condition.
+	 */
+	if (btree_node_journal_flush(b)) {
+		mutex_unlock(&b->write_lock);
+		pr_debug("bnode %p journal_flush set, retry", b);
+		udelay(1);
+		goto retry;
+	}
 
 	if (btree_node_dirty(b)) {
 		btree_complete_write(b, btree_current_write(b));
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index d1c72ef64edf5..76cfd121a4861 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -158,11 +158,13 @@ enum btree_flags {
 	BTREE_NODE_io_error,
 	BTREE_NODE_dirty,
 	BTREE_NODE_write_idx,
+	BTREE_NODE_journal_flush,
 };
 
 BTREE_FLAG(io_error);
 BTREE_FLAG(dirty);
 BTREE_FLAG(write_idx);
+BTREE_FLAG(journal_flush);
 
 static inline struct btree_write *btree_current_write(struct btree *b)
 {
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index cae2aff5e27ae..33556acdcf9cd 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -405,6 +405,7 @@ static void btree_flush_write(struct cache_set *c)
 retry:
 	best = NULL;
 
+	mutex_lock(&c->bucket_lock);
 	for_each_cached_btree(b, c, i)
 		if (btree_current_write(b)->journal) {
 			if (!best)
@@ -417,9 +418,14 @@ retry:
 		}
 
 	b = best;
+	if (b)
+		set_btree_node_journal_flush(b);
+	mutex_unlock(&c->bucket_lock);
+
 	if (b) {
 		mutex_lock(&b->write_lock);
 		if (!btree_current_write(b)->journal) {
+			clear_bit(BTREE_NODE_journal_flush, &b->flags);
 			mutex_unlock(&b->write_lock);
 			/* We raced */
 			atomic_long_inc(&c->retry_flush_write);
@@ -427,6 +433,7 @@ retry:
 		}
 
 		__bch_btree_node_write(b, NULL);
+		clear_bit(BTREE_NODE_journal_flush, &b->flags);
 		mutex_unlock(&b->write_lock);
 	}
 }
-- 
2.20.1




  parent reply	other threads:[~2019-09-13 13:21 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-13 13:07 [PATCH 5.2 00/37] 5.2.15-stable review Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 01/37] gpio: pca953x: correct type of reg_direction Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 02/37] gpio: pca953x: use pca953x_read_regs instead of regmap_bulk_read Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 03/37] ALSA: hda - Fix potential endless loop at applying quirks Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 04/37] ALSA: hda/realtek - Fix overridden device-specific initialization Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 05/37] ALSA: hda/realtek - Add quirk for HP Pavilion 15 Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 06/37] ALSA: hda/realtek - Enable internal speaker & headset mic of ASUS UX431FL Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 07/37] ALSA: hda/realtek - Fix the problem of two front mics on a ThinkCentre Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 08/37] sched/fair: Dont assign runtime for throttled cfs_rq Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 09/37] drm/vmwgfx: Fix double free in vmw_recv_msg() Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 10/37] drm/nouveau/sec2/gp102: add missing MODULE_FIRMWAREs Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 11/37] vhost/test: fix build for vhost test Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 12/37] vhost/test: fix build for vhost test - again Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 13/37] powerpc/64e: Drop stale call to smp_processor_id() which hangs SMP startup Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 14/37] powerpc/tm: Fix FP/VMX unavailable exceptions inside a transaction Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 15/37] powerpc/tm: Fix restoring FP/VMX facility incorrectly on interrupts Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 16/37] batman-adv: fix uninit-value in batadv_netlink_get_ifindex() Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 17/37] batman-adv: Only read OGM tvlv_len after buffer len check Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 18/37] bcache: only clear BTREE_NODE_dirty bit when it is set Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 19/37] bcache: add comments for mutex_lock(&b->write_lock) Greg Kroah-Hartman
2019-09-13 13:07 ` Greg Kroah-Hartman [this message]
2019-09-13 13:07 ` [PATCH 5.2 21/37] IB/rdmavt: Add new completion inline Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 22/37] IB/{rdmavt, qib, hfi1}: Convert to new completion API Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 23/37] IB/hfi1: Unreserve a flushed OPFN request Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 24/37] drm/i915: Disable SAMPLER_STATE prefetching on all Gen11 steppings Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 25/37] drm/i915: Make sure cdclk is high enough for DP audio on VLV/CHV Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 26/37] mmc: sdhci-sprd: Fix the incorrect soft reset operation when runtime resuming Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 27/37] usb: chipidea: imx: add imx7ulp support Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 28/37] usb: chipidea: imx: fix EPROBE_DEFER support during driver probe Greg Kroah-Hartman
2019-09-13 13:07   ` Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 29/37] virtio/s390: fix race on airq_areas[] Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 30/37] drm/i915: Support flags in whitlist WAs Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 31/37] drm/i915: Support whitelist workarounds on all engines Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 32/37] drm/i915: whitelist PS_(DEPTH|INVOCATION)_COUNT Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 33/37] drm/i915: Add whitelist workarounds for ICL Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 34/37] drm/i915/icl: whitelist PS_(DEPTH|INVOCATION)_COUNT Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 35/37] Btrfs: fix unwritten extent buffers and hangs on future writeback attempts Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 36/37] vhost: block speculation of translated descriptors Greg Kroah-Hartman
2019-09-14  0:54   ` Stefan Lippers-Hollmann
2019-09-14  5:50     ` Greg Kroah-Hartman
2019-09-14  7:15       ` Stefan Lippers-Hollmann
2019-09-14  8:08         ` Greg Kroah-Hartman
2019-09-15  9:34           ` Thomas Backlund
2019-09-15 13:37             ` Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 37/37] vhost: make sure log_num < in_num Greg Kroah-Hartman
2019-09-13 19:39 ` [PATCH 5.2 00/37] 5.2.15-stable review kernelci.org bot
2019-09-14  4:26 ` Naresh Kamboju
2019-09-14  7:43   ` Greg Kroah-Hartman
2019-09-14 14:08 ` Guenter Roeck
2019-09-15 13:34 ` Greg Kroah-Hartman
2019-09-16  9:25 ` Jon Hunter
2019-09-16  9:25   ` Jon Hunter
2019-09-16 10:41   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190913130518.849251524@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=axboe@kernel.dk \
    --cc=colyli@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.