From: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
To: Coly Li <colyli@suse.de>
Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org
Subject: Re: [PATCH 06/29] bcache: fix race in btree_flush_write()
Date: Thu, 27 Jun 2019 17:16:12 +0800 [thread overview]
Message-ID: <20190627091611.GA15287@byw> (raw)
In-Reply-To: <20190614131358.2771-7-colyli@suse.de>
On Fri, Jun 14, 2019 at 09:13:35PM +0800, Coly Li wrote:
> There is a race between mca_reap(), btree_node_free() and journal code
> btree_flush_write(), which results very rare and strange deadlock or
> panic and are very hard to reproduce.
>
> Let me explain how the race happens. In btree_flush_write() one btree
> node with oldest journal pin is selected, then it is flushed to cache
> device, the select-and-flush is a two steps operation. Between these two
> steps, there are something may happen inside the race window,
> - The selected btree node was reaped by mca_reap() and allocated to
> other requesters for other btree node.
> - The slected btree node was selected, flushed and released by mca
> shrink callback bch_mca_scan().
> When btree_flush_write() tries to flush the selected btree node, firstly
> b->write_lock is held by mutex_lock(). If the race happens and the
> memory of selected btree node is allocated to other btree node, if that
> btree node's write_lock is held already, a deadlock very probably
> happens here. A worse case is the memory of the selected btree node is
> released, then all references to this btree node (e.g. b->write_lock)
> will trigger NULL pointer deference panic.
>
> This race was introduced in commit cafe56359144 ("bcache: A block layer
> cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
> occupancy during journal"), which selected 128 btree nodes and flushed
> them one-by-one in a quite long time period.
>
> Such race is not easy to reproduce before. On a Lenovo SR650 server with
> 48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
> device assembled by 3 NVMe SSDs as backing device, this race can be
> observed around every 10,000 times btree_flush_write() gets called. Both
> deadlock and kernel panic all happened as aftermath of the race.
>
> The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
> is set when selecting btree nodes, and cleared after btree nodes
> flushed. Then when mca_reap() selects a btree node with this bit set,
> this btree node will be skipped. Since mca_reap() only reaps btree node
> without BTREE_NODE_journal_flush flag, such race is avoided.
>
> Once corner case should be noticed, that is btree_node_free(). It might
> be called in some error handling code path. For example the following
> code piece from btree_split(),
> 2149 err_free2:
> 2150 bkey_put(b->c, &n2->key);
> 2151 btree_node_free(n2);
> 2152 rw_unlock(true, n2);
> 2153 err_free1:
> 2154 bkey_put(b->c, &n1->key);
> 2155 btree_node_free(n1);
> 2156 rw_unlock(true, n1);
> At line 2151 and 2155, the btree node n2 and n1 are released without
> mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
> If btree_node_free() is called directly in such error handling path,
> and the selected btree node has BTREE_NODE_journal_flush bit set, just
> wait for 1 jiffy and retry again. In this case this btree node won't
> be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
> and free the btree node memory.
>
> Wait for 1 jiffy inside btree_node_free() does not hurt too much
> performance here, the reasons are,
> - Error handling code path is not frequently executed, and the race
> inside this cold path should be very rare. If the very rare race
> happens in the cold code path, waiting 1 jiffy should be acceptible.
> - If bree_node_free() is called inside mca_reap(), it means the bit
> BTREE_NODE_journal_flush is checked already, no wait will happen here.
>
> Beside the above fix, the way to select flushing btree nodes is also
> changed in this patch. Let me explain what changes in this patch.
Then this change should be split into another patch. :)
>
next prev parent reply other threads:[~2019-06-27 9:26 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
2019-06-14 13:13 ` [PATCH 01/29] bcache: Revert "bcache: fix high CPU occupancy during journal" Coly Li
2019-06-14 13:13 ` [PATCH 02/29] bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" Coly Li
2019-06-14 13:13 ` [PATCH 03/29] bcache: add code comments for journal_read_bucket() Coly Li
2019-06-14 13:13 ` [PATCH 04/29] bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket() Coly Li
2019-06-14 13:13 ` [PATCH 05/29] bcache: remove retry_flush_write from struct cache_set Coly Li
2019-06-14 13:13 ` [PATCH 06/29] bcache: fix race in btree_flush_write() Coly Li
2019-06-27 9:16 ` Yaowei Bai [this message]
2019-06-27 11:47 ` Coly Li
2019-06-27 12:45 ` Coly Li
2019-06-14 13:13 ` [PATCH 07/29] bcache: add reclaimed_journal_buckets to struct cache_set Coly Li
2019-06-14 13:13 ` [PATCH 08/29] bcache: fix return value error in bch_journal_read() Coly Li
2019-06-14 13:13 ` [PATCH 09/29] Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()" Coly Li
2019-06-14 13:13 ` [PATCH 10/29] bcache: avoid flushing btree node in cache_set_flush() if io disabled Coly Li
2019-06-14 13:13 ` [PATCH 11/29] bcache: ignore read-ahead request failure on backing device Coly Li
2019-06-14 13:13 ` [PATCH 12/29] bcache: add io error counting in write_bdev_super_endio() Coly Li
2019-06-14 13:13 ` [PATCH 13/29] bcache: remove "XXX:" comment line from run_cache_set() Coly Li
2019-06-14 13:13 ` [PATCH 14/29] bcache: remove unnecessary prefetch() in bset_search_tree() Coly Li
2019-06-14 13:13 ` [PATCH 15/29] bcache: use sysfs_match_string() instead of __sysfs_match_string() Coly Li
2019-06-14 13:13 ` [PATCH 16/29] bcache: add return value check to bch_cached_dev_run() Coly Li
2019-06-14 13:13 ` [PATCH 17/29] bcache: remove unncessary code in bch_btree_keys_init() Coly Li
2019-06-14 13:13 ` [PATCH 18/29] bcache: check CACHE_SET_IO_DISABLE in allocator code Coly Li
2019-06-14 13:13 ` [PATCH 19/29] bcache: check CACHE_SET_IO_DISABLE bit in bch_journal() Coly Li
2019-06-14 13:13 ` [PATCH 20/29] bcache: more detailed error message to bcache_device_link() Coly Li
2019-06-14 13:13 ` [PATCH 21/29] bcache: add more error message in bch_cached_dev_attach() Coly Li
2019-06-14 13:13 ` [PATCH 22/29] bcache: shrink btree node cache after bch_btree_check() Coly Li
2019-06-14 13:13 ` [PATCH 23/29] bcache: improve error message in bch_cached_dev_run() Coly Li
2019-06-14 13:13 ` [PATCH 24/29] bcache: make bset_search_tree() be more understandable Coly Li
2019-06-14 13:13 ` [PATCH 25/29] bcache: add pendings_cleanup to stop pending bcache device Coly Li
2019-06-14 13:13 ` [PATCH 26/29] bcache: avoid a deadlock in bcache_reboot() Coly Li
2019-06-14 13:13 ` [PATCH 27/29] bcache: acquire bch_register_lock later in cached_dev_detach_finish() Coly Li
2019-06-14 13:13 ` [PATCH 28/29] bcache: acquire bch_register_lock later in cached_dev_free() Coly Li
2019-06-14 13:13 ` [PATCH 29/29] bcache: fix potential deadlock in cached_def_free() Coly Li
2019-06-20 9:34 ` [PATCH 00/29] bcache candidate patches for Linux v5.3 Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190627091611.GA15287@byw \
--to=baiyaowei@cmss.chinamobile.com \
--cc=colyli@suse.de \
--cc=linux-bcache@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).