From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from va-2-38.ptr.blmpb.com (va-2-38.ptr.blmpb.com [209.127.231.38]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B44CA3921C7 for ; Wed, 18 Mar 2026 11:47:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.127.231.38 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773834450; cv=none; b=UW8fLhPvj4FUyxQLzKc4DOyLITy9MdJNiZTDywqHufnnHXKyf7gqbb/oU6KhPXKhS8hPliv/kifJBvoCQ9DHx5NZQjhKKiMjChU0mZbXdvSiE1ILijkkNOMngQEfA6zTohQM/CJkKlAaTEswsmD+f4OpBP//ZbE58SYTMSpCz8A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773834450; c=relaxed/simple; bh=mNZQsCgJbsC1QFPTorsu+hIni35yTjf6aupxuwaXyfk=; h=Content-Disposition:Content-Type:From:Subject:In-Reply-To:To:Cc: Date:Message-Id:Mime-Version:References; b=eChAjhk68OIfAdKN3BjT1T7eX6yI//NPk4nLsxPClFwut5a4WsBL6z35OupZY5eD/sqZr+RxiHSRseP+SgHuILJjyeliVogtltfhn5AsrC6wpHez5LNo/JuUJU1+TTnxAz8QQ8pfEOlou68aQTw1EndMZcV27EfQ44zo/woPl1g= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=fnnas.com; spf=pass smtp.mailfrom=fnnas.com; dkim=pass (2048-bit key) header.d=fnnas-com.20200927.dkim.feishu.cn header.i=@fnnas-com.20200927.dkim.feishu.cn header.b=yApzobVX; arc=none smtp.client-ip=209.127.231.38 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=fnnas.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fnnas.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fnnas-com.20200927.dkim.feishu.cn header.i=@fnnas-com.20200927.dkim.feishu.cn header.b="yApzobVX" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; s=s1; d=fnnas-com.20200927.dkim.feishu.cn; t=1773834438; h=from:subject:mime-version:from:date:message-id:subject:to:cc: reply-to:content-type:mime-version:in-reply-to:message-id; bh=8f7/jUBkpSOKvRIcuVKsHW+vQ0NmQF4a3nJVy0l8zxA=; b=yApzobVXRFigowgLz3A1i6xyEK5gCPOlBi2QzdrAvFosbQxHXFtHahnNuKR9FIFmXpb5uK 3MHaDT0iTSoueyUIlqB6n0HHu775zYvPttuvoZC7E7EHp7dqo9ChyRDCzz3E1sUI748B2H kmsJWC3rKff3C3XfWB4kcZ5QZ2suTwZMx80YMDaXgz8ae+sFrzRAWOyfESU7ts62vzau/Z dd46m4WqNTiH3sbRH2eTJSHpO98hMMTHUIlZEB4G4K0A42RZIwU/lrLRETz+p60kro6udO wGRVVeoJ68/NiGpMhlFbI5TmP7lWt1T2VveKZkndpWvmITnQCRf2irS76xqXmw== Content-Disposition: inline Received: from studio.local ([120.245.64.207]) by smtp.feishu.cn with ESMTPS; Wed, 18 Mar 2026 19:47:15 +0800 Content-Type: text/plain; charset=UTF-8 From: "Coly Li" Subject: Re: [PATCH] bcache: convert bch_register_lock to rw_semaphore In-Reply-To: <20260318-wujing-bcache-v1-1-f0b9aaf3f81d@gmail.com> To: "Qiliang Yuan" Cc: "Kent Overstreet" , , Date: Wed, 18 Mar 2026 19:47:14 +0800 X-Lms-Return-Path: Content-Transfer-Encoding: 7bit Message-Id: Precedence: bulk X-Mailing-List: linux-bcache@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260318-wujing-bcache-v1-1-f0b9aaf3f81d@gmail.com> X-Original-From: Coly Li On Wed, Mar 18, 2026 at 03:52:46PM +0800, Qiliang Yuan wrote: > Refactor the global bch_register_lock from a mutex to an rw_semaphore to > resolve severe lock contention and hung tasks (State D) during large-scale > bcache device registration and concurrent sysfs access. > > Representative call trace from logs: > [ 243.082130] INFO: task bcache_cache_se:3496 blocked for more than 121 seconds. > [ 243.130817] Call trace: > [ 243.134161] __switch_to+0x7c/0xbc > [ 243.138461] __schedule+0x338/0x6f0 > [ 243.142847] schedule+0x50/0xe0 > [ 243.146884] schedule_preempt_disabled+0x18/0x24 > [ 243.152400] __mutex_lock.constprop.0+0x1d4/0x5ec > [ 243.158002] __mutex_lock_slowpath+0x1c/0x30 > [ 243.163170] mutex_lock+0x50/0x60 > [ 243.167397] bch_cache_set_store+0x40/0x80 [bcache] > [ 243.173175] sysfs_kf_write+0x4c/0x5c > [ 243.177735] kernfs_fop_write_iter+0x130/0x1c0 > [ 243.183077] new_sync_write+0xec/0x18c > [ 243.187724] vfs_write+0x214/0x2ac > [ 243.192022] ksys_write+0x70/0xfc > [ 243.196234] __arm64_sys_write+0x24/0x30 > [ 243.201057] invoke_syscall+0x50/0x11c > [ 243.205705] el0_svc_common.constprop.0+0x158/0x164 > [ 243.211483] do_el0_svc+0x2c/0x9c > [ 243.215696] el0_svc+0x20/0x30 > [ 243.219648] el0_sync_handler+0xb0/0xb4 > [ 243.224384] el0_sync+0x160/0x180 > > This addresses the long-standing issue where a single slow bcache device > initialization could block the entire system's bcache management path. Yes, this is an already know issue. The root cause is becasue all meta data and data buckets on cache device are shared among all cached devices. When a cached device is attached to or detached from cache device, there is no better method to distinct meta data/data from different cached device, a big bch_register_lock is the have-to choice. I see the issue you want to solve, but it is hard due to the above root cause. And for your patch, there is obvious regression. I will list it in line. > Signed-off-by: Qiliang Yuan > --- > drivers/md/bcache/bcache.h | 2 +- > drivers/md/bcache/request.c | 18 +++++----- > drivers/md/bcache/super.c | 85 +++++++++++++++++++++++++-------------------- > drivers/md/bcache/sysfs.c | 82 ++++++++++++++++++++++++++++++++++++++++--- > drivers/md/bcache/sysfs.h | 8 ++--- > 5 files changed, 139 insertions(+), 56 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index 8ccacba855475..7ab36987e945b 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -1003,7 +1003,7 @@ void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent); > extern struct workqueue_struct *bcache_wq; > extern struct workqueue_struct *bch_journal_wq; > extern struct workqueue_struct *bch_flush_wq; > -extern struct mutex bch_register_lock; > +extern struct rw_semaphore bch_register_lock; > extern struct list_head bch_cache_sets; This is a headache change, because you change global bch_register_lock data type, it may break kernel ABI and bring up hard life for downstream kernel maintainers. [snipped] > @@ -2029,8 +2029,12 @@ static int run_cache_set(struct cache_set *c) > goto err; > > err = "error in recovery"; > - if (bch_btree_check(c)) > + downgrade_write(&bch_register_lock); > + if (bch_btree_check(c)) { > + up_read(&bch_register_lock); > + down_write(&bch_register_lock); > goto err; > + } > > bch_journal_mark(c, &journal); > bch_initial_gc_finish(c); Consider one of the regressions, before bch_btree_check() is called, the cache set kobjects are created and linked already. It means the cache set sysfs inter- face can be accessed before calling bch_tree_check(). In the above code there is a gap/window between up_read() and down_write(), if down_write() is blocked by other reader from other thread, and someone triggers the unregister sysfs interface, try to image what will happen? I don't see this is broken issue, but it looks really uncomfortable. Current mutex will make sure such parital initalization circumstances won't happen. This is one example, and not the only one. [snipped] Coly Li