From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Google-Smtp-Source: AIpwx49TKsY0iN1USE29/q5FKP7ZEIYuI+lj/yE5ydr8bjgFygEwFJH5hC1Ha4t8fkfGbLoe+8z+ ARC-Seal: i=1; a=rsa-sha256; t=1524653102; cv=none; d=google.com; s=arc-20160816; b=Lzz2Or5jlGJ3afBAIhD2YIGof0D3Fbq6kmDvikRI8hZnIH7Kl3866+JarfAJRqUVoJ p6SaI5Ak+PtCRFd1JEIjKFhrbQzzmmbpuQEs7T3zttMIIhBvFbnaKEbg08cXuJElwRj4 vsHYT7LMo4VnFkA6DocudrDeQlWzpDGCSacANgSvU2mjcvb5BHgasnnlszPD5if1fGO6 hjryro1axpzSW1f36ZOSM/1YA5fMhaOVjQnMg+3+refBLPtq37kt0yIiNENgHsbkunAW x8CeQqyx2vlkYhYFL9zZsa7bm7Z2MKb3efZeaCB6H4HDjBsZrbKPlztPI77CZ7EPDTgH GTeg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:user-agent:references:in-reply-to:message-id:date :subject:cc:to:from:arc-authentication-results; bh=cxEc54InZLnu4pEfB0KZINs4BKxo0U0YLUrNwtCvBpo=; b=PH5pamFKTY/pG2aHCwtihIfLrvBYcJa+yICuX/jcD1zYmGtZWgQD+nZ9jnhcti2zPU A/hJDtdyWNlz8C1JcNZNS0nuiwCWN+1cu1O1IpxjR/SOoWSotyLPh/2PixxVnAoN8Un7 tXaWMBygsmx6cRNm05WzjPGr1wQMkBR7xp3rHIE8f0InAunikfeMXgQIdeeKvvJgRsIr rFH+avmwU8JXvXW3NjitlYZtp8ZypYjjNfF+PIr9xZVs9lt1f+fOX5J38lEjgpu65Ed2 LvYPJMli/8tuEZTGrdq379/QyAMO8Jj2c5RXnHHLre7Ud/kKJV336H8TWh5MzM+2WK5H E2eQ== ARC-Authentication-Results: i=1; mx.google.com; spf=softfail (google.com: domain of transitioning gregkh@linuxfoundation.org does not designate 90.92.61.202 as permitted sender) smtp.mailfrom=gregkh@linuxfoundation.org Authentication-Results: mx.google.com; spf=softfail (google.com: domain of transitioning gregkh@linuxfoundation.org does not designate 90.92.61.202 as permitted sender) smtp.mailfrom=gregkh@linuxfoundation.org From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Hua Rui , Tang Junhui , Michael Lyle , Jens Axboe , Sasha Levin Subject: [PATCH 4.14 148/183] bcache: fix for allocator and register thread race Date: Wed, 25 Apr 2018 12:36:08 +0200 Message-Id: <20180425103248.485231934@linuxfoundation.org> X-Mailer: git-send-email 2.17.0 In-Reply-To: <20180425103242.532713678@linuxfoundation.org> References: <20180425103242.532713678@linuxfoundation.org> User-Agent: quilt/0.65 X-stable: review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-LABELS: =?utf-8?b?IlxcU2VudCI=?= X-GMAIL-THRID: =?utf-8?q?1598714651958775857?= X-GMAIL-MSGID: =?utf-8?q?1598714651958775857?= X-Mailing-List: linux-kernel@vger.kernel.org List-ID: 4.14-stable review patch. If anyone has any objections, please let me know. ------------------ From: Tang Junhui [ Upstream commit 682811b3ce1a5a4e20d700939a9042f01dbc66c4 ] After long time running of random small IO writing, I reboot the machine, and after the machine power on, I found bcache got stuck, the stack is: [root@ceph153 ~]# cat /proc/2510/task/*/stack [] closure_sync+0x25/0x90 [bcache] [] bch_journal+0x118/0x2b0 [bcache] [] bch_journal_meta+0x47/0x70 [bcache] [] bch_prio_write+0x237/0x340 [bcache] [] bch_allocator_thread+0x3c8/0x3d0 [bcache] [] kthread+0xcf/0xe0 [] ret_from_fork+0x58/0x90 [] 0xffffffffffffffff [root@ceph153 ~]# cat /proc/2038/task/*/stack [] __bch_btree_map_nodes+0x12d/0x150 [bcache] [] bch_btree_insert+0xf1/0x170 [bcache] [] bch_journal_replay+0x13f/0x230 [bcache] [] run_cache_set+0x79a/0x7c2 [bcache] [] register_bcache+0xd48/0x1310 [bcache] [] kobj_attr_store+0xf/0x20 [] sysfs_write_file+0xc6/0x140 [] vfs_write+0xbd/0x1e0 [] SyS_write+0x7f/0xe0 [] system_call_fastpath+0x16/0x1 The stack shows the register thread and allocator thread were getting stuck when registering cache device. I reboot the machine several times, the issue always exsit in this machine. I debug the code, and found the call trace as bellow: register_bcache() ==>run_cache_set() ==>bch_journal_replay() ==>bch_btree_insert() ==>__bch_btree_map_nodes() ==>btree_insert_fn() ==>btree_split() //node need split ==>btree_check_reserve() In btree_check_reserve(), It will check if there is enough buckets of RESERVE_BTREE type, since allocator thread did not work yet, so no buckets of RESERVE_BTREE type allocated, so the register thread waits on c->btree_cache_wait, and goes to sleep. Then the allocator thread initialized, the call trace is bellow: bch_allocator_thread() ==>bch_prio_write() ==>bch_journal_meta() ==>bch_journal() ==>journal_wait_for_write() In journal_wait_for_write(), It will check if journal is full by journal_full(), but the long time random small IO writing causes the exhaustion of journal buckets(journal.blocks_free=0), In order to release the journal buckets, the allocator calls btree_flush_write() to flush keys to btree nodes, and waits on c->journal.wait until btree nodes writing over or there has already some journal buckets space, then the allocator thread goes to sleep. but in btree_flush_write(), since bch_journal_replay() is not finished, so no btree nodes have journal (condition "if (btree_current_write(b)->journal)" never satisfied), so we got no btree node to flush, no journal bucket released, and allocator sleep all the times. Through the above analysis, we can see that: 1) Register thread wait for allocator thread to allocate buckets of RESERVE_BTREE type; 2) Alloctor thread wait for register thread to replay journal, so it can flush btree nodes and get journal bucket. then they are all got stuck by waiting for each other. Hua Rui provided a patch for me, by allocating some buckets of RESERVE_BTREE type in advance, so the register thread can get bucket when btree node splitting and no need to waiting for the allocator thread. I tested it, it has effect, and register thread run a step forward, but finally are still got stuck, the reason is only 8 bucket of RESERVE_BTREE type were allocated, and in bch_journal_replay(), after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left, then btree_check_reserve() is not satisfied anymore, so it goes to sleep again, and in the same time, alloctor thread did not flush enough btree nodes to release a journal bucket, so they all got stuck again. So we need to allocate more buckets of RESERVE_BTREE type in advance, but how much is enough? By experience and test, I think it should be as much as journal buckets. Then I modify the code as this patch, and test in the machine, and it works. This patch modified base on Hua Rui’s patch, and allocate more buckets of RESERVE_BTREE type in advance to avoid register thread and allocate thread going to wait for each other. [patch v2] ca->sb.njournal_buckets would be 0 in the first time after cache creation, and no journal exists, so just 8 btree buckets is OK. Signed-off-by: Hua Rui Signed-off-by: Tang Junhui Reviewed-by: Michael Lyle Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman --- drivers/md/bcache/btree.c | 9 ++++++--- drivers/md/bcache/super.c | 13 ++++++++++++- 2 files changed, 18 insertions(+), 4 deletions(-) --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1868,14 +1868,17 @@ void bch_initial_gc_finish(struct cache_ */ for_each_cache(ca, c, i) { for_each_bucket(b, ca) { - if (fifo_full(&ca->free[RESERVE_PRIO])) + if (fifo_full(&ca->free[RESERVE_PRIO]) && + fifo_full(&ca->free[RESERVE_BTREE])) break; if (bch_can_invalidate_bucket(ca, b) && !GC_MARK(b)) { __bch_invalidate_one_bucket(ca, b); - fifo_push(&ca->free[RESERVE_PRIO], - b - ca->buckets); + if (!fifo_push(&ca->free[RESERVE_PRIO], + b - ca->buckets)) + fifo_push(&ca->free[RESERVE_BTREE], + b - ca->buckets); } } } --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1829,6 +1829,7 @@ void bch_cache_release(struct kobject *k static int cache_alloc(struct cache *ca) { size_t free; + size_t btree_buckets; struct bucket *b; __module_get(THIS_MODULE); @@ -1836,9 +1837,19 @@ static int cache_alloc(struct cache *ca) bio_init(&ca->journal.bio, ca->journal.bio.bi_inline_vecs, 8); + /* + * when ca->sb.njournal_buckets is not zero, journal exists, + * and in bch_journal_replay(), tree node may split, + * so bucket of RESERVE_BTREE type is needed, + * the worst situation is all journal buckets are valid journal, + * and all the keys need to replay, + * so the number of RESERVE_BTREE type buckets should be as much + * as journal buckets + */ + btree_buckets = ca->sb.njournal_buckets ?: 8; free = roundup_pow_of_two(ca->sb.nbuckets) >> 10; - if (!init_fifo(&ca->free[RESERVE_BTREE], 8, GFP_KERNEL) || + if (!init_fifo(&ca->free[RESERVE_BTREE], btree_buckets, GFP_KERNEL) || !init_fifo_exact(&ca->free[RESERVE_PRIO], prio_buckets(ca), GFP_KERNEL) || !init_fifo(&ca->free[RESERVE_MOVINGGC], free, GFP_KERNEL) || !init_fifo(&ca->free[RESERVE_NONE], free, GFP_KERNEL) ||