public inbox for linux-bcache@vger.kernel.org
 help / color / mirror / Atom feed
From: "Coly Li" <colyli@fnnas.com>
To: "Qiliang Yuan" <realwujing@gmail.com>
Cc: "Kent Overstreet" <kent.overstreet@linux.dev>,
	 <linux-bcache@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] bcache: convert bch_register_lock to rw_semaphore
Date: Wed, 18 Mar 2026 19:47:14 +0800	[thread overview]
Message-ID: <abqFB7e0KWxBV08Y@studio.local> (raw)
In-Reply-To: <20260318-wujing-bcache-v1-1-f0b9aaf3f81d@gmail.com>

On Wed, Mar 18, 2026 at 03:52:46PM +0800, Qiliang Yuan wrote:
> Refactor the global bch_register_lock from a mutex to an rw_semaphore to
> resolve severe lock contention and hung tasks (State D) during large-scale
> bcache device registration and concurrent sysfs access.
> 
> Representative call trace from logs:
> [  243.082130] INFO: task bcache_cache_se:3496 blocked for more than 121 seconds.
> [  243.130817] Call trace:
> [  243.134161]  __switch_to+0x7c/0xbc
> [  243.138461]  __schedule+0x338/0x6f0
> [  243.142847]  schedule+0x50/0xe0
> [  243.146884]  schedule_preempt_disabled+0x18/0x24
> [  243.152400]  __mutex_lock.constprop.0+0x1d4/0x5ec
> [  243.158002]  __mutex_lock_slowpath+0x1c/0x30
> [  243.163170]  mutex_lock+0x50/0x60
> [  243.167397]  bch_cache_set_store+0x40/0x80 [bcache]
> [  243.173175]  sysfs_kf_write+0x4c/0x5c
> [  243.177735]  kernfs_fop_write_iter+0x130/0x1c0
> [  243.183077]  new_sync_write+0xec/0x18c
> [  243.187724]  vfs_write+0x214/0x2ac
> [  243.192022]  ksys_write+0x70/0xfc
> [  243.196234]  __arm64_sys_write+0x24/0x30
> [  243.201057]  invoke_syscall+0x50/0x11c
> [  243.205705]  el0_svc_common.constprop.0+0x158/0x164
> [  243.211483]  do_el0_svc+0x2c/0x9c
> [  243.215696]  el0_svc+0x20/0x30
> [  243.219648]  el0_sync_handler+0xb0/0xb4
> [  243.224384]  el0_sync+0x160/0x180
> 
> This addresses the long-standing issue where a single slow bcache device
> initialization could block the entire system's bcache management path.

Yes, this is an already know issue. The root cause is becasue all meta data and
data buckets on cache device are shared among all cached devices. When a cached
device is attached to or detached from cache device, there is no better method
to distinct meta data/data from different cached device, a big bch_register_lock
is the have-to choice.

I see the issue you want to solve, but it is hard due to the above root cause.
And for your patch, there is obvious regression. I will list it in line.



> Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
> ---
>  drivers/md/bcache/bcache.h  |  2 +-
>  drivers/md/bcache/request.c | 18 +++++-----
>  drivers/md/bcache/super.c   | 85 +++++++++++++++++++++++++--------------------
>  drivers/md/bcache/sysfs.c   | 82 ++++++++++++++++++++++++++++++++++++++++---
>  drivers/md/bcache/sysfs.h   |  8 ++---
>  5 files changed, 139 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 8ccacba855475..7ab36987e945b 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -1003,7 +1003,7 @@ void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent);
>  extern struct workqueue_struct *bcache_wq;
>  extern struct workqueue_struct *bch_journal_wq;
>  extern struct workqueue_struct *bch_flush_wq;
> -extern struct mutex bch_register_lock;
> +extern struct rw_semaphore bch_register_lock;
>  extern struct list_head bch_cache_sets;

This is a headache change, because you change global bch_register_lock data
type, it may break kernel ABI and bring up hard life for downstream kernel
maintainers.

[snipped]

> @@ -2029,8 +2029,12 @@ static int run_cache_set(struct cache_set *c)
>  			goto err;
>  
>  		err = "error in recovery";
> -		if (bch_btree_check(c))
> +		downgrade_write(&bch_register_lock);
> +		if (bch_btree_check(c)) {
> +			up_read(&bch_register_lock);
> +			down_write(&bch_register_lock);
>  			goto err;
> +		}
>  
>  		bch_journal_mark(c, &journal);
>  		bch_initial_gc_finish(c);

Consider one of the regressions, before bch_btree_check() is called, the cache
set kobjects are created and linked already. It means the cache set sysfs inter-
face can be accessed before calling bch_tree_check(). In the above code there
is a gap/window between up_read() and down_write(), if down_write() is blocked
by other reader from other thread, and someone triggers the unregister sysfs
interface, try to image what will happen? I don't see this is broken issue, but
it looks really uncomfortable.

Current mutex will make sure such parital initalization circumstances won't
happen.

This is one example, and not the only one.

[snipped]

Coly Li

      reply	other threads:[~2026-03-18 11:47 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-18  7:52 [PATCH] bcache: convert bch_register_lock to rw_semaphore Qiliang Yuan
2026-03-18 11:47 ` Coly Li [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abqFB7e0KWxBV08Y@studio.local \
    --to=colyli@fnnas.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=realwujing@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox