Linux block layer
 help / color / mirror / Atom feed
* [PATCH v2] virtio-blk: clamp zone report to the report buffer capacity
From: Michael Bommarito @ 2026-06-07 12:48 UTC (permalink / raw)
  To: Michael S . Tsirkin, Jason Wang, Stefan Hajnoczi, Jens Axboe
  Cc: Xuan Zhuo, virtualization, linux-block, linux-kernel

virtblk_report_zones() trusts the device-reported number of zones when
walking the report buffer:

	nz = min_t(u64, virtio64_to_cpu(vblk->vdev, report->nr_zones),
		   nr_zones);
	...
	for (i = 0; i < nz && zone_idx < nr_zones; i++) {
		ret = virtblk_parse_zone(vblk, &report->zones[i], ...);

The buffer is allocated by virtblk_alloc_report_buffer(), whose size is
capped by the queue's max hardware sectors and max segments and can
therefore hold fewer descriptors than nr_zones. nz is bounded only by
the device-supplied report->nr_zones and the requested nr_zones, never
by the buffer's descriptor capacity. At probe time the request count is
unbounded (blk_revalidate_disk_zones() calls report_zones() with
nr_zones == UINT_MAX), so the device-supplied report->nr_zones is the
sole gate: a device that reports more zones than fit in the buffer
drives the loop to read report->zones[i] past the end of the allocation.

A malicious or buggy virtio-blk device that reports an inflated nr_zones
triggers this during zone revalidation at probe. KASAN reports a
vmalloc-out-of-bounds read in virtblk_report_zones() against the report
buffer allocated a few lines earlier.

Clamp nz to the number of descriptors that actually fit in the report
buffer.

Fixes: 95bfec41bd3d ("virtio-blk: add support for zoned block devices")
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
v2: drop the explanatory comment per Michael S. Tsirkin's review; the
    clamp itself is unchanged.

 drivers/block/virtio_blk.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index b1c9a27..32bf3ba 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -689,6 +689,8 @@ static int virtblk_report_zones(struct gendisk *disk, sector_t sector,
 
 		nz = min_t(u64, virtio64_to_cpu(vblk->vdev, report->nr_zones),
 			   nr_zones);
+		nz = min_t(u64, nz,
+			   (buflen - sizeof(*report)) / sizeof(report->zones[0]));
 		if (!nz)
 			break;
 

base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
-- 
2.53.0

^ permalink raw reply related

* [PATCH v2] zram: fix partial I/O gating on non-4K PAGE_SIZE
From: Jianyue Wu @ 2026-06-07 15:31 UTC (permalink / raw)
  To: minchan
  Cc: Jianyue Wu, Sergey Senozhatsky, Jens Axboe, linux-kernel,
	linux-block

IS_ENABLED() mainly for CONFIG_* symbols. ZRAM_PARTIAL_IO is a macro
defined as 1 on non-4K builds, so IS_ENABLED(ZRAM_PARTIAL_IO) becomes
IS_ENABLED(1) and evaluates false.

Replace that check with PAGE_SIZE == 4096 and fold is_partial_io() into
one helper so partial-I/O policy stays consistent. PAGE_SIZE is a
build-time constant, so the PAGE_SIZE == 4096 checks fold away on the
configurations where partial I/O is supported.

Tested-on: Raspberry Pi 5 (BCM2712, 4 KiB and 16 KiB page kernels)

Signed-off-by: Jianyue Wu <wujianyue000@gmail.com>
---
To: Minchan Kim <minchan@kernel.org>
To: Sergey Senozhatsky <senozhatsky@chromium.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-kernel@vger.kernel.org
Cc: linux-block@vger.kernel.org

zram: fix partial I/O gating on non-4K PAGE_SIZE

On PAGE_SIZE > 4K, zram writeback can use sub-page bvec I/O. The
synchronous read_from_bdev() path is used for that case.

v1 used IS_ENABLED(ZRAM_PARTIAL_IO) where ZRAM_PARTIAL_IO is a local
macro defined as 1, so the check expands to IS_ENABLED(1) and is always
false. The WARN_ON_ONCE(!IS_ENABLED(...)) guard then rejects the sync
path with -EIO.

Replace that check with PAGE_SIZE == 4096 and fold is_partial_io() into
one helper so partial-I/O policy stays consistent. PAGE_SIZE is a
build-time constant, so the PAGE_SIZE == 4096 checks fold away on the
configurations where partial I/O is supported.

Testing (Raspberry Pi 5, BCM2712, rpi-6.12.y):

16 KiB kernel (6.12.92-v8-16k+):
- full-page I/O: PASS
- sub-page I/O: PASS
- writeback-backed read: PASS (bd_reads=100)
- no zram WARNING in dmesg

4 KiB kernel (6.12.92-v8-4k+):
- full-page I/O: PASS
- sub-page I/O: PASS (regression; partial path not used by design)
- writeback-backed read: PASS (bd_reads=161)
- no WARN_ON_ONCE(PAGE_SIZE == 4096) in dmesg

Writeback tests use a loop block device as backing_dev. On 4 KiB
builds partial-path success is not required by design because
is_partial_io() is always false when PAGE_SIZE == 4096.
---
Changes in v2:
- Use PAGE_SIZE == 4096 for the read_from_bdev() guard.
- Fold is_partial_io() into one helper.
- Expand commit message with root cause, impact, and build-time note.
- Add Raspberry Pi 5 validation on 4 KiB and 16 KiB kernels.
- Link to v1: https://lore.kernel.org/r/20260531-zram-fix-partial-io-config-check-on-akpm-v1-1-eb085d98faea@gmail.com

Jianyue Wu (1):
  zram: fix partial I/O gating on non-4K PAGE_SIZE
---
 drivers/block/zram/zram_drv.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6e1330ce4bc1..ddea09afd8dd 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -216,18 +216,12 @@ static bool zram_can_store_page(struct zram *zram)
 	return !zram->limit_pages || alloced_pages <= zram->limit_pages;
 }
 
-#if PAGE_SIZE != 4096
 static inline bool is_partial_io(struct bio_vec *bvec)
 {
+	if (PAGE_SIZE == 4096)
+		return false;
 	return bvec->bv_len != PAGE_SIZE;
 }
-#define ZRAM_PARTIAL_IO		1
-#else
-static inline bool is_partial_io(struct bio_vec *bvec)
-{
-	return false;
-}
-#endif
 
 #if defined CONFIG_ZRAM_WRITEBACK || defined CONFIG_ZRAM_MULTI_COMP
 struct zram_pp_slot {
@@ -1510,7 +1504,8 @@ static int read_from_bdev(struct zram *zram, struct page *page, u32 index,
 {
 	atomic64_inc(&zram->stats.bd_reads);
 	if (!parent) {
-		if (WARN_ON_ONCE(!IS_ENABLED(ZRAM_PARTIAL_IO)))
+		/* Sub-page I/O only exists on non-4K PAGE_SIZE builds. */
+		if (WARN_ON_ONCE(PAGE_SIZE == 4096))
 			return -EIO;
 		return read_from_bdev_sync(zram, page, index, blk_idx);
 	}

---
base-commit: 404fb4f38e8f38469dfff4df0205c9d18eeb1f57
change-id: 20260531-zram-fix-partial-io-config-check-on-akpm-c62b972416f8

Best regards,
-- 
Jianyue Wu <wujianyue000@gmail.com>


^ permalink raw reply related

* Re: [PATCH] block: Add WQ_PERCPU to alloc_workqueue users
From: Marco Crivellari @ 2026-06-07 18:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-block, Tejun Heo, Lai Jiangshan,
	Frederic Weisbecker, Sebastian Andrzej Siewior, Michal Hocko,
	Damien Le Moal
In-Reply-To: <178068011170.856067.204306158214952007.b4-ty@b4>

On Fri, Jun 5, 2026 at 7:21 PM Jens Axboe <axboe@kernel.dk> wrote:
>
>
> On Thu, 04 Jun 2026 12:53:47 +0200, Marco Crivellari wrote:
> > This continues the effort to refactor workqueue APIs, which began with
> > the introduction of new workqueues and a new alloc_workqueue flag in:
> >
> >    commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
> >    commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
> >
> > The refactoring is going to alter the default behavior of
> > alloc_workqueue() to be unbound by default.
> >
> > [...]
>
> Applied, thanks!
>
> [1/1] block: Add WQ_PERCPU to alloc_workqueue users
>       commit: 7e712f292e7f01e91d09e83eb7b9526f77f66c71
>

Many thanks!

-- 

Marco Crivellari

SUSE Labs

^ permalink raw reply

* Re: [PATCH 2/5] bfq: protect q->blkg_list iteration in bfq_end_wr_async() with blkcg_mutex
From: yu kuai @ 2026-06-08  2:48 UTC (permalink / raw)
  To: Nilay Shroff, Jens Axboe
  Cc: Tejun Heo, Josef Bacik, Ming Lei, Bart Van Assche, linux-block,
	cgroups, linux-kernel, yukuai
In-Reply-To: <a532857a-16d5-4bef-bbd1-3bc080363182@linux.ibm.com>

Hi,

在 2026/6/5 1:31, Nilay Shroff 写道:
> On 6/3/26 6:57 PM, Yu Kuai wrote:
>> bfq_end_wr_async() iterates q->blkg_list while only holding bfqd->lock,
>> but not blkcg_mutex. This can race with blkg_free_workfn() that removes
>> blkgs from the list while holding blkcg_mutex.
>>
>> Add blkcg_mutex protection in bfq_end_wr() before taking bfqd->lock to
>> ensure proper synchronization when iterating q->blkg_list.
>>
>> Signed-off-by: Yu Kuai <yukuai@fygo.io>
>> ---
>>   block/bfq-cgroup.c  | 3 ++-
>>   block/bfq-iosched.c | 6 ++++++
>>   2 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
>> index 37ab70930c8d..f765e767d36a 100644
>> --- a/block/bfq-cgroup.c
>> +++ b/block/bfq-cgroup.c
>> @@ -939,11 +939,12 @@ void bfq_end_wr_async(struct bfq_data *bfqd)
>>       struct blkcg_gq *blkg;
>>         list_for_each_entry(blkg, &bfqd->queue->blkg_list, q_node) {
>>           struct bfq_group *bfqg = blkg_to_bfqg(blkg);
>>   -        bfq_end_wr_async_queues(bfqd, bfqg);
>> +        if (bfqg)
>> +            bfq_end_wr_async_queues(bfqd, bfqg);
>>       }
>>       bfq_end_wr_async_queues(bfqd, bfqd->root_group);
>>   }
>>     static int bfq_io_show_weight_legacy(struct seq_file *sf, void *v)
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 141c602d5e85..42ccfd0c6140 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -2643,10 +2643,13 @@ void bfq_end_wr_async_queues(struct bfq_data 
>> *bfqd,
>>   static void bfq_end_wr(struct bfq_data *bfqd)
>>   {
>>       struct bfq_queue *bfqq;
>>       int i;
>>   +#ifdef CONFIG_BFQ_GROUP_IOSCHED
>> +    mutex_lock(&bfqd->queue->blkcg_mutex);
>> +#endif
>>       spin_lock_irq(&bfqd->lock);
>>         for (i = 0; i < bfqd->num_actuators; i++) {
>>           list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list)
>>               bfq_bfqq_end_wr(bfqq);
>> @@ -2654,10 +2657,13 @@ static void bfq_end_wr(struct bfq_data *bfqd)
>>       list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
>>           bfq_bfqq_end_wr(bfqq);
>>       bfq_end_wr_async(bfqd);
>>         spin_unlock_irq(&bfqd->lock);
>> +#ifdef CONFIG_BFQ_GROUP_IOSCHED
>> +    mutex_unlock(&bfqd->queue->blkcg_mutex);
>> +#endif
>>   }
>
> The above change protects the q->blkg_list iteration in 
> bfq_end_wr_async()
> against list removal in blkg_free_workfn(). However the blkg insertion in
> blkg_create() still doesn't use q->blkcg_mutex and so list traversal in
> bfq_end_wr_async() may still race with blkg_create().
>
> So I think we may also need to protect blkg insert in blkg_create() using
> q->blkcg_mutex.

Yes, this is done in another huge patchset, because currently blkg_create()
is protected by queue_lock and can be called under rcu, code refactor will
be required.

I'll send the first set soon.

>
> Thanks,
> --Nilay

-- 
Thanks,
Kuai

^ permalink raw reply

* [PATCH 0/8] blk-cgroup: remove queue_lock nesting from blkcg paths
From: Yu Kuai @ 2026-06-08  3:42 UTC (permalink / raw)
  To: nilay, tom.leiming, bvanassche, tj, josef, axboe, yukuai
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm

From: Yu Kuai <yukuai@fygo.io>

Hi,

This series is the follow-up blk-cgroup locking cleanup on top of the
earlier blkg-list protection fixes, and prepares blk-cgroup to stop using
q->queue_lock as the global blkg lifetime/iteration lock.

The current queue_lock based protection is hard to maintain because
queue_lock is used from hardirq and softirq completion paths, while some
blkcg cgroup file paths also need to iterate blkgs, print policy data, or
create blkgs from RCU-protected contexts.  This series first tightens the
blkcg-side lifetime rules:

- blkcg_print_stat() iterates blkgs under blkcg->lock with IRQs disabled.
- policy data freeing is delayed past an RCU grace period.
- blkcg_print_blkgs(), blkg lookup/create, bio association, page-IO
  association, blkg destruction, and BFQ initialization stop nesting
  queue_lock under RCU or blkcg->lock.

Using blkcg->lock and RCU for blkcg-owned lists/data keeps the lock order
local to blk-cgroup and avoids extending queue_lock into cgroup file
iteration paths.  It also makes the subsequent conversion to q->blkcg_mutex
possible without carrying forward queue_lock's interrupt-context
constraints.

Yu Kuai (8):
  blk-cgroup: protect iterating blkgs with blkcg->lock in
    blkcg_print_stat()
  blk-cgroup: delay freeing policy data after rcu grace period
  blk-cgroup: don't nest queue_lock under rcu in blkcg_print_blkgs()
  blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create()
  blk-cgroup: don't nest queue_lock under rcu in bio_associate_blkg()
  blk-cgroup: don't nest queue_lock under blkcg->lock in
    blkcg_destroy_blkgs()
  mm/page_io: don't nest queue_lock under rcu in
    bio_associate_blkg_from_page()
  block, bfq: don't grab queue_lock to initialize bfq

 block/bfq-cgroup.c        |  17 ++++-
 block/bfq-iosched.c       |   5 --
 block/blk-cgroup-rwstat.c |  15 ++--
 block/blk-cgroup.c        | 151 ++++++++++++++++++++++----------------
 block/blk-cgroup.h        |   8 +-
 block/blk-iocost.c        |  22 ++++--
 block/blk-iolatency.c     |  10 ++-
 block/blk-throttle.c      |  13 +++-
 mm/page_io.c              |   7 +-
 9 files changed, 158 insertions(+), 90 deletions(-)


base-commit: b23df513de562739af61fa61ba80ef5e8059a636
-- 
2.51.0

^ permalink raw reply

* [PATCH 1/8] blk-cgroup: protect iterating blkgs with blkcg->lock in blkcg_print_stat()
From: Yu Kuai @ 2026-06-08  3:42 UTC (permalink / raw)
  To: nilay, tom.leiming, bvanassche, tj, josef, axboe, yukuai
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <cover.1780621988.git.yukuai@fygo.io>

From: Yu Kuai <yukuai@fygo.io>

blkcg_print_one_stat() will be called for each blkg:
- access blkg->iostat, which is freed from rcu callback
  blkg_free_workfn();
- access policy data from pd_stat_fn(), which is freed from
  pd_free_fn(), while pd_free_fn() can be called by removing blkcg or
  deactivating policy;

Take blkcg->lock while iterating so the blkgs stay online and both
blkg->iostat and policy data for activated policies stay valid.  Use
irq-safe locking because blkcg->lock can be nested under q->queue_lock,
which is used from IRQ completion paths.

Prepare to convert protecting blkgs from request_queue with mutex.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c75b2a103bbc..b55c43f72bcb 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1241,17 +1241,14 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
 	if (!seq_css(sf)->parent)
 		blkcg_fill_root_iostats();
 	else
 		css_rstat_flush(&blkcg->css);
 
-	rcu_read_lock();
-	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
-		spin_lock_irq(&blkg->q->queue_lock);
+	guard(spinlock_irq)(&blkcg->lock);
+	hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node)
 		blkcg_print_one_stat(blkg, sf);
-		spin_unlock_irq(&blkg->q->queue_lock);
-	}
-	rcu_read_unlock();
+
 	return 0;
 }
 
 static struct cftype blkcg_files[] = {
 	{
-- 
2.51.0

^ permalink raw reply related

* [PATCH 2/8] blk-cgroup: delay freeing policy data after rcu grace period
From: Yu Kuai @ 2026-06-08  3:42 UTC (permalink / raw)
  To: nilay, tom.leiming, bvanassche, tj, josef, axboe, yukuai
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <cover.1780621988.git.yukuai@fygo.io>

From: Yu Kuai <yukuai@fygo.io>

Currently blkcg_print_blkgs() must hold RCU to iterate blkgs from a
blkcg, and prfill() must hold queue_lock to prevent policy data from
being freed by policy deactivation. As a consequence, queue_lock has to
be nested under RCU from blkcg_print_blkgs().

Delay freeing policy data until after an RCU grace period so prfill() can
be protected by RCU alone.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/bfq-cgroup.c    |  9 ++++++++-
 block/blk-cgroup.h    |  2 ++
 block/blk-iocost.c    | 14 ++++++++++++--
 block/blk-iolatency.c | 10 +++++++++-
 block/blk-throttle.c  | 13 +++++++++++--
 5 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index f765e767d36a..56f60e36c799 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -548,17 +548,24 @@ static void bfq_pd_init(struct blkg_policy_data *pd)
 	bfqg->active_entities = 0;
 	bfqg->num_queues_with_pending_reqs = 0;
 	bfqg->rq_pos_tree = RB_ROOT;
 }
 
-static void bfq_pd_free(struct blkg_policy_data *pd)
+static void bfqg_release(struct rcu_head *rcu)
 {
+	struct blkg_policy_data *pd =
+		container_of(rcu, struct blkg_policy_data, rcu_head);
 	struct bfq_group *bfqg = pd_to_bfqg(pd);
 
 	bfqg_put(bfqg);
 }
 
+static void bfq_pd_free(struct blkg_policy_data *pd)
+{
+	call_rcu(&pd->rcu_head, bfqg_release);
+}
+
 static void bfq_pd_reset_stats(struct blkg_policy_data *pd)
 {
 	struct bfq_group *bfqg = pd_to_bfqg(pd);
 
 	bfqg_stats_reset(&bfqg->stats);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 1cce3294634d..fd206d1fa3c9 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -138,10 +138,12 @@ static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css)
 struct blkg_policy_data {
 	/* the blkg and policy id this per-policy data belongs to */
 	struct blkcg_gq			*blkg;
 	int				plid;
 	bool				online;
+
+	struct rcu_head			rcu_head;
 };
 
 /*
  * Policies that need to keep per-blkcg data which is independent from any
  * request_queue associated to it should implement cpd_alloc/free_fn()
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 0cca88a366dc..c136b1f46fcc 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -3024,10 +3024,20 @@ static void ioc_pd_init(struct blkg_policy_data *pd)
 	spin_lock_irqsave(&ioc->lock, flags);
 	weight_updated(iocg, &now);
 	spin_unlock_irqrestore(&ioc->lock, flags);
 }
 
+static void iocg_release(struct rcu_head *rcu)
+{
+	struct blkg_policy_data *pd =
+		container_of(rcu, struct blkg_policy_data, rcu_head);
+	struct ioc_gq *iocg = pd_to_iocg(pd);
+
+	free_percpu(iocg->pcpu_stat);
+	kfree(iocg);
+}
+
 static void ioc_pd_free(struct blkg_policy_data *pd)
 {
 	struct ioc_gq *iocg = pd_to_iocg(pd);
 	struct ioc *ioc = iocg->ioc;
 	unsigned long flags;
@@ -3048,12 +3058,12 @@ static void ioc_pd_free(struct blkg_policy_data *pd)
 
 		spin_unlock_irqrestore(&ioc->lock, flags);
 
 		hrtimer_cancel(&iocg->waitq_timer);
 	}
-	free_percpu(iocg->pcpu_stat);
-	kfree(iocg);
+
+	call_rcu(&pd->rcu_head, iocg_release);
 }
 
 static void ioc_pd_stat(struct blkg_policy_data *pd, struct seq_file *s)
 {
 	struct ioc_gq *iocg = pd_to_iocg(pd);
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index 53e8dd2dfa8a..c79056410cd9 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -1026,17 +1026,25 @@ static void iolatency_pd_offline(struct blkg_policy_data *pd)
 
 	iolatency_set_min_lat_nsec(blkg, 0);
 	iolatency_clear_scaling(blkg);
 }
 
-static void iolatency_pd_free(struct blkg_policy_data *pd)
+static void iolat_release(struct rcu_head *rcu)
 {
+	struct blkg_policy_data *pd =
+		container_of(rcu, struct blkg_policy_data, rcu_head);
 	struct iolatency_grp *iolat = pd_to_lat(pd);
+
 	free_percpu(iolat->stats);
 	kfree(iolat);
 }
 
+static void iolatency_pd_free(struct blkg_policy_data *pd)
+{
+	call_rcu(&pd->rcu_head, iolat_release);
+}
+
 static struct cftype iolatency_files[] = {
 	{
 		.name = "latency",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = iolatency_print_limit,
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index cabf91f0d0dc..0f89fb03cdb6 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -351,20 +351,29 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 	 * Update has_rules[] after a new group is brought online.
 	 */
 	tg_update_has_rules(tg);
 }
 
-static void throtl_pd_free(struct blkg_policy_data *pd)
+static void tg_release(struct rcu_head *rcu)
 {
+	struct blkg_policy_data *pd =
+		container_of(rcu, struct blkg_policy_data, rcu_head);
 	struct throtl_grp *tg = pd_to_tg(pd);
 
-	timer_delete_sync(&tg->service_queue.pending_timer);
 	blkg_rwstat_exit(&tg->stat_bytes);
 	blkg_rwstat_exit(&tg->stat_ios);
 	kfree(tg);
 }
 
+static void throtl_pd_free(struct blkg_policy_data *pd)
+{
+	struct throtl_grp *tg = pd_to_tg(pd);
+
+	timer_delete_sync(&tg->service_queue.pending_timer);
+	call_rcu(&pd->rcu_head, tg_release);
+}
+
 static struct throtl_grp *
 throtl_rb_first(struct throtl_service_queue *parent_sq)
 {
 	struct rb_node *n;
 
-- 
2.51.0

^ permalink raw reply related

* [PATCH 3/8] blk-cgroup: don't nest queue_lock under rcu in blkcg_print_blkgs()
From: Yu Kuai @ 2026-06-08  3:42 UTC (permalink / raw)
  To: nilay, tom.leiming, bvanassche, tj, josef, axboe, yukuai
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <cover.1780621988.git.yukuai@fygo.io>

From: Yu Kuai <yukuai@fygo.io>

With previous modification to delay freeing policy data after an RCU grace
period, prfill() can run under RCU instead of taking queue_lock. However,
policy teardown can still clear blkg->pd[plid] after blkcg_print_blkgs()
observes the policy enabled bit.

Load policy data once with READ_ONCE() and skip the blkg if teardown
already cleared it. Do the same in recursive stat walks for descendant
blkgs. Remove the stale BFQ debug queue_lock assertion because
blkcg_print_blkgs() no longer calls prfill() with queue_lock held. This
also lets ioc_qos_prfill() and ioc_cost_model_prfill() use IRQ-safe
ioc->lock locking without re-enabling IRQs while queue_lock is still held.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/bfq-cgroup.c        |  8 +++++---
 block/blk-cgroup-rwstat.c | 15 +++++++++------
 block/blk-cgroup.c        | 22 +++++++++++++---------
 block/blk-cgroup.h        |  6 +++---
 block/blk-iocost.c        |  8 ++++----
 5 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 56f60e36c799..904d9e0d9029 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -1146,20 +1146,22 @@ static u64 bfqg_prfill_stat_recursive(struct seq_file *sf,
 	struct blkcg_gq *blkg = pd_to_blkg(pd);
 	struct blkcg_gq *pos_blkg;
 	struct cgroup_subsys_state *pos_css;
 	u64 sum = 0;
 
-	lockdep_assert_held(&blkg->q->queue_lock);
-
 	rcu_read_lock();
 	blkg_for_each_descendant_pre(pos_blkg, pos_css, blkg) {
+		struct blkg_policy_data *pd;
 		struct bfq_stat *stat;
 
 		if (!pos_blkg->online)
 			continue;
 
-		stat = (void *)blkg_to_pd(pos_blkg, &blkcg_policy_bfq) + off;
+		pd = blkg_to_pd(pos_blkg, &blkcg_policy_bfq);
+		if (!pd)
+			continue;
+		stat = (void *)pd + off;
 		sum += bfq_stat_read(stat) + atomic64_read(&stat->aux_cnt);
 	}
 	rcu_read_unlock();
 
 	return __blkg_prfill_u64(sf, pd, sum);
diff --git a/block/blk-cgroup-rwstat.c b/block/blk-cgroup-rwstat.c
index a55fb0c53558..aae910713814 100644
--- a/block/blk-cgroup-rwstat.c
+++ b/block/blk-cgroup-rwstat.c
@@ -99,26 +99,29 @@ void blkg_rwstat_recursive_sum(struct blkcg_gq *blkg, struct blkcg_policy *pol,
 {
 	struct blkcg_gq *pos_blkg;
 	struct cgroup_subsys_state *pos_css;
 	unsigned int i;
 
-	lockdep_assert_held(&blkg->q->queue_lock);
+	WARN_ON_ONCE(!rcu_read_lock_held());
 
 	memset(sum, 0, sizeof(*sum));
-	rcu_read_lock();
 	blkg_for_each_descendant_pre(pos_blkg, pos_css, blkg) {
 		struct blkg_rwstat *rwstat;
 
 		if (!pos_blkg->online)
 			continue;
 
-		if (pol)
-			rwstat = (void *)blkg_to_pd(pos_blkg, pol) + off;
-		else
+		if (pol) {
+			struct blkg_policy_data *pd = blkg_to_pd(pos_blkg, pol);
+
+			if (!pd)
+				continue;
+			rwstat = (void *)pd + off;
+		} else {
 			rwstat = (void *)pos_blkg + off;
+		}
 
 		for (i = 0; i < BLKG_RWSTAT_NR; i++)
 			sum->cnt[i] += blkg_rwstat_read_counter(rwstat, i);
 	}
-	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum);
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b55c43f72bcb..46fc65050c38 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -699,13 +699,13 @@ const char *blkg_dev_name(struct blkcg_gq *blkg)
  * @data: data to be passed to @prfill
  * @show_total: to print out sum of prfill return values or not
  *
  * This function invokes @prfill on each blkg of @blkcg if pd for the
  * policy specified by @pol exists.  @prfill is invoked with @sf, the
- * policy data and @data and the matching queue lock held.  If @show_total
- * is %true, the sum of the return values from @prfill is printed with
- * "Total" label at the end.
+ * policy data and @data under RCU read lock.  If @show_total is %true, the
+ * sum of the return values from @prfill is printed with "Total" label at the
+ * end.
  *
  * This is to be used to construct print functions for
  * cftype->read_seq_string method.
  */
 void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
@@ -717,14 +717,18 @@ void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
 	struct blkcg_gq *blkg;
 	u64 total = 0;
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
-		spin_lock_irq(&blkg->q->queue_lock);
-		if (blkcg_policy_enabled(blkg->q, pol))
-			total += prfill(sf, blkg->pd[pol->plid], data);
-		spin_unlock_irq(&blkg->q->queue_lock);
+		struct blkg_policy_data *pd;
+
+		if (!blkcg_policy_enabled(blkg->q, pol))
+			continue;
+
+		pd = blkg_to_pd(blkg, pol);
+		if (pd)
+			total += prfill(sf, pd, data);
 	}
 	rcu_read_unlock();
 
 	if (show_total)
 		seq_printf(sf, "Total %llu\n", (unsigned long long)total);
@@ -1591,11 +1595,11 @@ static void blkcg_policy_teardown_pds(struct request_queue *q,
 		if (pd) {
 			if (pd->online && pol->pd_offline_fn)
 				pol->pd_offline_fn(pd);
 			pd->online = false;
 			pol->pd_free_fn(pd);
-			blkg->pd[pol->plid] = NULL;
+			WRITE_ONCE(blkg->pd[pol->plid], NULL);
 		}
 		spin_unlock(&blkcg->lock);
 	}
 }
 
@@ -1683,11 +1687,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 
 		spin_lock(&blkg->blkcg->lock);
 
 		pd->blkg = blkg;
 		pd->plid = pol->plid;
-		blkg->pd[pol->plid] = pd;
+		WRITE_ONCE(blkg->pd[pol->plid], pd);
 
 		if (pol->pd_init_fn)
 			pol->pd_init_fn(pd);
 
 		if (pol->pd_online_fn)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index fd206d1fa3c9..5402b4ff6f3f 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -279,13 +279,13 @@ static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg,
  * @pol: policy of interest
  *
  * Return pointer to private data associated with the @blkg-@pol pair.
  */
 static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
-						  struct blkcg_policy *pol)
+						  const struct blkcg_policy *pol)
 {
-	return blkg ? blkg->pd[pol->plid] : NULL;
+	return blkg ? READ_ONCE(blkg->pd[pol->plid]) : NULL;
 }
 
 static inline struct blkcg_policy_data *blkcg_to_cpd(struct blkcg *blkcg,
 						     struct blkcg_policy *pol)
 {
@@ -488,11 +488,11 @@ static inline int blkcg_activate_policy(struct gendisk *disk,
 					const struct blkcg_policy *pol) { return 0; }
 static inline void blkcg_deactivate_policy(struct gendisk *disk,
 					   const struct blkcg_policy *pol) { }
 
 static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
-						  struct blkcg_policy *pol) { return NULL; }
+						  const struct blkcg_policy *pol) { return NULL; }
 static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd) { return NULL; }
 static inline void blkg_get(struct blkcg_gq *blkg) { }
 static inline void blkg_put(struct blkcg_gq *blkg) { }
 static inline void blk_cgroup_bio_start(struct bio *bio) { }
 static inline bool blk_cgroup_mergeable(struct request *rq, struct bio *bio) { return true; }
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index c136b1f46fcc..1f3f6e0f8901 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -3188,11 +3188,11 @@ static u64 ioc_qos_prfill(struct seq_file *sf, struct blkg_policy_data *pd,
 	struct ioc *ioc = pd_to_iocg(pd)->ioc;
 
 	if (!dname)
 		return 0;
 
-	spin_lock(&ioc->lock);
+	spin_lock_irq(&ioc->lock);
 	seq_printf(sf, "%s enable=%d ctrl=%s rpct=%u.%02u rlat=%u wpct=%u.%02u wlat=%u min=%u.%02u max=%u.%02u\n",
 		   dname, ioc->enabled, ioc->user_qos_params ? "user" : "auto",
 		   ioc->params.qos[QOS_RPPM] / 10000,
 		   ioc->params.qos[QOS_RPPM] % 10000 / 100,
 		   ioc->params.qos[QOS_RLAT],
@@ -3201,11 +3201,11 @@ static u64 ioc_qos_prfill(struct seq_file *sf, struct blkg_policy_data *pd,
 		   ioc->params.qos[QOS_WLAT],
 		   ioc->params.qos[QOS_MIN] / 10000,
 		   ioc->params.qos[QOS_MIN] % 10000 / 100,
 		   ioc->params.qos[QOS_MAX] / 10000,
 		   ioc->params.qos[QOS_MAX] % 10000 / 100);
-	spin_unlock(&ioc->lock);
+	spin_unlock_irq(&ioc->lock);
 	return 0;
 }
 
 static int ioc_qos_show(struct seq_file *sf, void *v)
 {
@@ -3386,18 +3386,18 @@ static u64 ioc_cost_model_prfill(struct seq_file *sf,
 	u64 *u = ioc->params.i_lcoefs;
 
 	if (!dname)
 		return 0;
 
-	spin_lock(&ioc->lock);
+	spin_lock_irq(&ioc->lock);
 	seq_printf(sf, "%s ctrl=%s model=linear "
 		   "rbps=%llu rseqiops=%llu rrandiops=%llu "
 		   "wbps=%llu wseqiops=%llu wrandiops=%llu\n",
 		   dname, ioc->user_cost_model ? "user" : "auto",
 		   u[I_LCOEF_RBPS], u[I_LCOEF_RSEQIOPS], u[I_LCOEF_RRANDIOPS],
 		   u[I_LCOEF_WBPS], u[I_LCOEF_WSEQIOPS], u[I_LCOEF_WRANDIOPS]);
-	spin_unlock(&ioc->lock);
+	spin_unlock_irq(&ioc->lock);
 	return 0;
 }
 
 static int ioc_cost_model_show(struct seq_file *sf, void *v)
 {
-- 
2.51.0


^ permalink raw reply related

* [PATCH 4/8] blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create()
From: Yu Kuai @ 2026-06-08  3:42 UTC (permalink / raw)
  To: nilay, tom.leiming, bvanassche, tj, josef, axboe, yukuai
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <cover.1780621988.git.yukuai@fygo.io>

From: Yu Kuai <yukuai@fygo.io>

Change this in two steps:

1) hold rcu lock and do blkg_lookup() from fast path;
2) hold queue_lock directly from slow path, and don't nest it under rcu
   lock;

Prepare to convert protecting blkcg with blkcg_mutex instead of
queue_lock.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 57 +++++++++++++++++++++++++++++-----------------
 1 file changed, 36 insertions(+), 21 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 46fc65050c38..e2896d582235 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -466,26 +466,21 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 		struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
 	struct blkcg_gq *blkg;
-	unsigned long flags;
-
-	WARN_ON_ONCE(!rcu_read_lock_held());
 
-	blkg = blkg_lookup(blkcg, q);
-	if (blkg)
-		return blkg;
-
-	spin_lock_irqsave(&q->queue_lock, flags);
+	rcu_read_lock();
 	blkg = blkg_lookup(blkcg, q);
 	if (blkg) {
 		if (blkcg != &blkcg_root &&
 		    blkg != rcu_dereference(blkcg->blkg_hint))
 			rcu_assign_pointer(blkcg->blkg_hint, blkg);
-		goto found;
+		rcu_read_unlock();
+		return blkg;
 	}
+	rcu_read_unlock();
 
 	/*
 	 * Create blkgs walking down from blkcg_root to @blkcg, so that all
 	 * non-root blkgs have access to their parents.  Returns the closest
 	 * blkg to the intended blkg should blkg_create() fail.
@@ -513,12 +508,10 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 		}
 		if (pos == blkcg)
 			break;
 	}
 
-found:
-	spin_unlock_irqrestore(&q->queue_lock, flags);
 	return blkg;
 }
 
 static void blkg_destroy(struct blkcg_gq *blkg)
 {
@@ -2098,10 +2091,22 @@ void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta)
 		return;
 	blkcg_scale_delay(blkg, now);
 	atomic64_add(delta, &blkg->delay_nsec);
 }
 
+static inline struct blkcg_gq *blkg_lookup_tryget(struct blkcg_gq *blkg)
+{
+retry:
+	if (blkg_tryget(blkg))
+		return blkg;
+
+	blkg = blkg->parent;
+	if (blkg)
+		goto retry;
+
+	return NULL;
+}
 /**
  * blkg_tryget_closest - try and get a blkg ref on the closet blkg
  * @bio: target bio
  * @css: target css
  *
@@ -2110,24 +2115,34 @@ void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta)
  * up taking a reference on or %NULL if no reference was taken.
  */
 static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 		struct cgroup_subsys_state *css)
 {
-	struct blkcg_gq *blkg, *ret_blkg = NULL;
+	struct request_queue *q = bio->bi_bdev->bd_queue;
+	struct blkcg *blkcg = css_to_blkcg(css);
+	struct blkcg_gq *blkg;
 
 	rcu_read_lock();
-	blkg = blkg_lookup_create(css_to_blkcg(css), bio->bi_bdev->bd_disk);
-	while (blkg) {
-		if (blkg_tryget(blkg)) {
-			ret_blkg = blkg;
-			break;
-		}
-		blkg = blkg->parent;
-	}
+	blkg = blkg_lookup(blkcg, q);
+	if (likely(blkg))
+		blkg = blkg_lookup_tryget(blkg);
 	rcu_read_unlock();
 
-	return ret_blkg;
+	if (blkg)
+		return blkg;
+
+	/*
+	 * Fast path failed, we're probably issuing IO in this cgroup the first
+	 * time, hold lock to create new blkg.
+	 */
+	spin_lock_irq(&q->queue_lock);
+	blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk);
+	if (blkg)
+		blkg = blkg_lookup_tryget(blkg);
+	spin_unlock_irq(&q->queue_lock);
+
+	return blkg;
 }
 
 /**
  * bio_associate_blkg_from_css - associate a bio with a specified css
  * @bio: target bio
-- 
2.51.0

^ permalink raw reply related

* [PATCH 5/8] blk-cgroup: don't nest queue_lock under rcu in bio_associate_blkg()
From: Yu Kuai @ 2026-06-08  3:42 UTC (permalink / raw)
  To: nilay, tom.leiming, bvanassche, tj, josef, axboe, yukuai
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <cover.1780621988.git.yukuai@fygo.io>

From: Yu Kuai <yukuai@fygo.io>

If a bio is already associated with a blkg, the blkcg is already pinned
until the bio is done, so there is no need for RCU protection. Otherwise,
protect blkcg_css() with RCU independently. Prepare to protect blkcg with
blkcg_mutex instead of queue_lock.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index e2896d582235..8c9ca52a54f4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -2186,20 +2186,24 @@ void bio_associate_blkg(struct bio *bio)
 	struct cgroup_subsys_state *css;
 
 	if (blk_op_is_passthrough(bio->bi_opf))
 		return;
 
-	rcu_read_lock();
-
-	if (bio->bi_blkg)
+	if (bio->bi_blkg) {
 		css = bio_blkcg_css(bio);
-	else
+		bio_associate_blkg_from_css(bio, css);
+	} else {
+		rcu_read_lock();
 		css = blkcg_css();
+		if (!css_tryget_online(css))
+			css = NULL;
+		rcu_read_unlock();
 
-	bio_associate_blkg_from_css(bio, css);
-
-	rcu_read_unlock();
+		bio_associate_blkg_from_css(bio, css);
+		if (css)
+			css_put(css);
+	}
 }
 EXPORT_SYMBOL_GPL(bio_associate_blkg);
 
 /**
  * bio_clone_blkg_association - clone blkg association from src to dst bio
-- 
2.51.0

^ permalink raw reply related

* [PATCH 6/8] blk-cgroup: don't nest queue_lock under blkcg->lock in blkcg_destroy_blkgs()
From: Yu Kuai @ 2026-06-08  3:42 UTC (permalink / raw)
  To: nilay, tom.leiming, bvanassche, tj, josef, axboe, yukuai
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <cover.1780621988.git.yukuai@fygo.io>

From: Yu Kuai <yukuai@fygo.io>

The correct lock order is q->queue_lock before blkcg->lock, and in order
to prevent deadlock from blkcg_destroy_blkgs(), trylock is used for
q->queue_lock while blkcg->lock is already held, this is hacky.

Refactor blkcg_destroy_blkgs() to hold blkcg->lock only long enough to
get the first blkg and then release it. Then take q->queue_lock and
blkcg->lock in the correct order to destroy the blkg. This is a very cold
path, so the extra lock/unlock cycles are acceptable.

Also prepare to convert protecting blkcg with blkcg_mutex instead of
queue_lock.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 45 ++++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 8c9ca52a54f4..d1f69a23c9d6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1289,10 +1289,25 @@ struct list_head *blkcg_get_cgwb_list(struct cgroup_subsys_state *css)
  *
  * 3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
  *    This finally frees the blkcg.
  */
 
+static struct blkcg_gq *blkcg_get_first_blkg(struct blkcg *blkcg)
+{
+	struct blkcg_gq *blkg = NULL;
+
+	spin_lock_irq(&blkcg->lock);
+	if (!hlist_empty(&blkcg->blkg_list)) {
+		blkg = hlist_entry(blkcg->blkg_list.first, struct blkcg_gq,
+				   blkcg_node);
+		blkg_get(blkg);
+	}
+	spin_unlock_irq(&blkcg->lock);
+
+	return blkg;
+}
+
 /**
  * blkcg_destroy_blkgs - responsible for shooting down blkgs
  * @blkcg: blkcg of interest
  *
  * blkgs should be removed while holding both q and blkcg locks.  As blkcg lock
@@ -1302,36 +1317,28 @@ struct list_head *blkcg_get_cgwb_list(struct cgroup_subsys_state *css)
  *
  * This is the blkcg counterpart of ioc_release_fn().
  */
 static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 {
-	might_sleep();
+	struct blkcg_gq *blkg;
 
-	spin_lock_irq(&blkcg->lock);
+	might_sleep();
 
-	while (!hlist_empty(&blkcg->blkg_list)) {
-		struct blkcg_gq *blkg = hlist_entry(blkcg->blkg_list.first,
-						struct blkcg_gq, blkcg_node);
+	while ((blkg = blkcg_get_first_blkg(blkcg))) {
 		struct request_queue *q = blkg->q;
 
-		if (need_resched() || !spin_trylock(&q->queue_lock)) {
-			/*
-			 * Given that the system can accumulate a huge number
-			 * of blkgs in pathological cases, check to see if we
-			 * need to rescheduling to avoid softlockup.
-			 */
-			spin_unlock_irq(&blkcg->lock);
-			cond_resched();
-			spin_lock_irq(&blkcg->lock);
-			continue;
-		}
+		spin_lock_irq(&q->queue_lock);
+		spin_lock(&blkcg->lock);
 
 		blkg_destroy(blkg);
-		spin_unlock(&q->queue_lock);
-	}
 
-	spin_unlock_irq(&blkcg->lock);
+		spin_unlock(&blkcg->lock);
+		spin_unlock_irq(&q->queue_lock);
+
+		blkg_put(blkg);
+		cond_resched();
+	}
 }
 
 /**
  * blkcg_pin_online - pin online state
  * @blkcg_css: blkcg of interest
-- 
2.51.0

^ permalink raw reply related

* [PATCH 7/8] mm/page_io: don't nest queue_lock under rcu in bio_associate_blkg_from_page()
From: Yu Kuai @ 2026-06-08  3:42 UTC (permalink / raw)
  To: nilay, tom.leiming, bvanassche, tj, josef, axboe, yukuai
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <cover.1780621988.git.yukuai@fygo.io>

From: Yu Kuai <yukuai@fygo.io>

Take a css reference under RCU, drop RCU, and then associate the bio with
the blkg. This avoids nesting queue_lock under RCU and prepares to protect
blkcg with blkcg_mutex instead of queue_lock.

Use css_tryget() instead of css_tryget_online() so swap writeback for
pages charged to a dying memcg still passes the dying css to
bio_associate_blkg_from_css(). That preserves the existing closest-live
ancestor fallback instead of charging those bios to the root blkg.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 mm/page_io.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 70cea9e24d2f..3b54c60c278e 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -315,12 +315,17 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct folio *folio)
 		return;
 
 	rcu_read_lock();
 	memcg = folio_memcg(folio);
 	css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys);
-	bio_associate_blkg_from_css(bio, css);
+	if (!css || !css_tryget(css))
+		css = NULL;
 	rcu_read_unlock();
+
+	bio_associate_blkg_from_css(bio, css);
+	if (css)
+		css_put(css);
 }
 #else
 #define bio_associate_blkg_from_page(bio, folio)		do { } while (0)
 #endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */
 
-- 
2.51.0

^ permalink raw reply related

* [PATCH 8/8] block, bfq: don't grab queue_lock to initialize bfq
From: Yu Kuai @ 2026-06-08  3:42 UTC (permalink / raw)
  To: nilay, tom.leiming, bvanassche, tj, josef, axboe, yukuai
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <cover.1780621988.git.yukuai@fygo.io>

From: Yu Kuai <yukuai@fygo.io>

The request_queue is frozen and quiesced while the elevator init_sched()
method runs, so queue_lock is not needed for BFQ cgroup initialization.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/bfq-iosched.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 42ccfd0c6140..5cabee2d4e7c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7207,14 +7207,11 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
 	if (!bfqd)
 		return -ENOMEM;
 
 	eq->elevator_data = bfqd;
-
-	spin_lock_irq(&q->queue_lock);
 	q->elevator = eq;
-	spin_unlock_irq(&q->queue_lock);
 
 	/*
 	 * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
 	 * Grab a permanent reference to it, so that the normal code flow
 	 * will not attempt to free it.
@@ -7243,11 +7240,10 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	bfqd->num_actuators = 1;
 	/*
 	 * If the disk supports multiple actuators, copy independent
 	 * access ranges from the request queue structure.
 	 */
-	spin_lock_irq(&q->queue_lock);
 	if (ia_ranges) {
 		/*
 		 * Check if the disk ia_ranges size exceeds the current bfq
 		 * actuator limit.
 		 */
@@ -7269,11 +7265,10 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	/* Otherwise use single-actuator dev info */
 	if (bfqd->num_actuators == 1) {
 		bfqd->sector[0] = 0;
 		bfqd->nr_sectors[0] = get_capacity(q->disk);
 	}
-	spin_unlock_irq(&q->queue_lock);
 
 	INIT_LIST_HEAD(&bfqd->dispatch);
 
 	hrtimer_setup(&bfqd->idle_slice_timer, bfq_idle_slice_timer, CLOCK_MONOTONIC,
 		      HRTIMER_MODE_REL);
-- 
2.51.0

^ permalink raw reply related

* Re: [PATCH 1/4] block: add a macro to initialize the status table
From: Christoph Hellwig @ 2026-06-08  5:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Jens Axboe, Jonathan Corbet, linux-block,
	linux-doc, Keith Busch
In-Reply-To: <aiMZ-PXXQ-NxOHT4@casper.infradead.org>

On Fri, Jun 05, 2026 at 07:48:24PM +0100, Matthew Wilcox wrote:
> On Fri, Jun 05, 2026 at 08:44:27PM +0200, Christoph Hellwig wrote:
> > Prepare for adding a new value to the error table by adding a macro
> > to fill it.
> 
> > +#define ENT(_tag, _errno, _desc)	\
> > +[BLK_STS_##_tag] = {				\
> > +	.errno		= _errno,		\
> > +	.name		= _desc,		\
> 
> Bleh.  I hate this.  Before, I can grep for BLK_STS_NOSPC and find it.

You will still find BLK_STS_NOSPC itself in include/linux/blk_types.h

> After, I can't.  Yes, I know we have a lot of such things already, but
> I don't like adding more.

I'm not a huge fan off CPP pasting, and especially thing we should never
use it to define global symbols (hi page/folio flag helpers!), and in
general try to avoid using them as much as possible.  But I think here
the need to keep the names in sync with the tags exposed in debugfs is
more important than the grepability.  Especially as this sits in _the_
core block file, so it can't be easily missed.

^ permalink raw reply

* configurable block error injection v3
From: Christoph Hellwig @ 2026-06-08  5:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc

Hi all,

this series adds a new configurable block error injection facility.
We already have a few to inject block errors, but unfortunately most
of them are either not very useful or hard to use, or both:

 - The fail_make_request failure injection point can't distinguish
   different commands, different ranges in the file and can only injection
   plain I/O errors.
 - the should_fail_bio 'dynamic' failure injection has all the same issues
   as fail_make_request
 - dm-error can only fail all command in the table using BLK_STS_IOERR
   and requires setting up a new block device
 - dm-flakey and dm-dust allow all kinds of configurability, but still
   don't have good error selection, no good support for non-read/write
   commands and are limited to the dm table alignment requirements,
   which for zoned devices enforces setting them up for an entire zone.
   They also once again require setting up a stacked block device,
   which is really annoying in harnesses like xfstests

This series adds a new debugfs-based block layer error injection
that allows to configure what operations and ranges the injection
applied to, and what status to return.  It also allows to configure a
failure ratio similar to the xfs errortag injection.

Changes since v2:
 - improve the documentation a bit
 - fix a spelling mistake in a comment

Changes since v1:
 - drop the should_fail_bio removal and cleanup depending on it, as it's
   used by eBPF programs and thus a hidden UABI.
 - as a result split the code out to it's own Kconfig symbol
 - various error handling fixed pointed out by Keith
 - documentation spelling fixes pointed out by Randy

Diffstat:
 Documentation/block/error-injection.rst |   59 ++++++
 Documentation/block/index.rst           |    1 
 block/Kconfig                           |    7 
 block/Makefile                          |    1 
 block/blk-core.c                        |   86 ++++++--
 block/blk-sysfs.c                       |    4 
 block/blk.h                             |   15 +
 block/error-injection.c                 |  308 ++++++++++++++++++++++++++++++++
 block/genhd.c                           |    4 
 include/linux/blkdev.h                  |    6 
 10 files changed, 471 insertions(+), 20 deletions(-)

^ permalink raw reply

* [PATCH 1/4] block: add a macro to initialize the status table
From: Christoph Hellwig @ 2026-06-08  5:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260608051416.1205282-1-hch@lst.de>

Prepare for adding a new value to the error table by adding a macro
to fill it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
 block/blk-core.c | 45 +++++++++++++++++++++++++--------------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b0f0a304ea0b..1614323282f1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -132,39 +132,44 @@ inline const char *blk_op_str(enum req_op op)
 }
 EXPORT_SYMBOL_GPL(blk_op_str);
 
+#define ENT(_tag, _errno, _desc)	\
+[BLK_STS_##_tag] = {				\
+	.errno		= _errno,		\
+	.name		= _desc,		\
+}
 static const struct {
 	int		errno;
 	const char	*name;
 } blk_errors[] = {
-	[BLK_STS_OK]		= { 0,		"" },
-	[BLK_STS_NOTSUPP]	= { -EOPNOTSUPP, "operation not supported" },
-	[BLK_STS_TIMEOUT]	= { -ETIMEDOUT,	"timeout" },
-	[BLK_STS_NOSPC]		= { -ENOSPC,	"critical space allocation" },
-	[BLK_STS_TRANSPORT]	= { -ENOLINK,	"recoverable transport" },
-	[BLK_STS_TARGET]	= { -EREMOTEIO,	"critical target" },
-	[BLK_STS_RESV_CONFLICT]	= { -EBADE,	"reservation conflict" },
-	[BLK_STS_MEDIUM]	= { -ENODATA,	"critical medium" },
-	[BLK_STS_PROTECTION]	= { -EILSEQ,	"protection" },
-	[BLK_STS_RESOURCE]	= { -ENOMEM,	"kernel resource" },
-	[BLK_STS_DEV_RESOURCE]	= { -EBUSY,	"device resource" },
-	[BLK_STS_AGAIN]		= { -EAGAIN,	"nonblocking retry" },
-	[BLK_STS_OFFLINE]	= { -ENODEV,	"device offline" },
+	ENT(OK,			0,		""),
+	ENT(NOTSUPP,		-EOPNOTSUPP,	"operation not supported"),
+	ENT(TIMEOUT,		-ETIMEDOUT,	"timeout"),
+	ENT(NOSPC,		-ENOSPC,	"critical space allocation"),
+	ENT(TRANSPORT,		-ENOLINK,	"recoverable transport"),
+	ENT(TARGET,		-EREMOTEIO,	"critical target"),
+	ENT(RESV_CONFLICT,	-EBADE,		"reservation conflict"),
+	ENT(MEDIUM,		-ENODATA,	"critical medium"),
+	ENT(PROTECTION,		-EILSEQ,	"protection"),
+	ENT(RESOURCE,		-ENOMEM,	"kernel resource"),
+	ENT(DEV_RESOURCE,	-EBUSY,		"device resource"),
+	ENT(AGAIN,		-EAGAIN,	"nonblocking retry"),
+	ENT(OFFLINE,		-ENODEV,	"device offline"),
 
 	/* device mapper special case, should not leak out: */
-	[BLK_STS_DM_REQUEUE]	= { -EREMCHG, "dm internal retry" },
+	ENT(DM_REQUEUE,		-EREMCHG,	"dm internal retry"),
 
 	/* zone device specific errors */
-	[BLK_STS_ZONE_OPEN_RESOURCE]	= { -ETOOMANYREFS, "open zones exceeded" },
-	[BLK_STS_ZONE_ACTIVE_RESOURCE]	= { -EOVERFLOW, "active zones exceeded" },
+	ENT(ZONE_OPEN_RESOURCE, -ETOOMANYREFS,	"open zones exceeded"),
+	ENT(ZONE_ACTIVE_RESOURCE, -EOVERFLOW,	"active zones exceeded"),
 
 	/* Command duration limit device-side timeout */
-	[BLK_STS_DURATION_LIMIT]	= { -ETIME, "duration limit exceeded" },
-
-	[BLK_STS_INVAL]		= { -EINVAL,	"invalid" },
+	ENT(DURATION_LIMIT,	-ETIME,		"duration limit exceeded"),
+	ENT(INVAL,		-EINVAL,	"invalid"),
 
 	/* everything else not covered above: */
-	[BLK_STS_IOERR]		= { -EIO,	"I/O" },
+	ENT(IOERR,		-EIO,		"I/O"),
 };
+#undef ENT
 
 blk_status_t errno_to_blk_status(int errno)
 {
-- 
2.53.0


^ permalink raw reply related

* [PATCH 2/4] block: add a "tag" for block status codes
From: Christoph Hellwig @ 2026-06-08  5:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260608051416.1205282-1-hch@lst.de>

The full name of the status codes is not good for user interfaces as it
can contain white spaces.  Add the name of the status code without the
BLK_STS_ prefix as a tag so that it can be used for user interfaces.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
 block/blk-core.c | 28 ++++++++++++++++++++++++++++
 block/blk.h      |  2 ++
 2 files changed, 30 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1614323282f1..7aa9cd110bdd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -135,10 +135,12 @@ EXPORT_SYMBOL_GPL(blk_op_str);
 #define ENT(_tag, _errno, _desc)	\
 [BLK_STS_##_tag] = {				\
 	.errno		= _errno,		\
+	.tag		= __stringify(_tag),	\
 	.name		= _desc,		\
 }
 static const struct {
 	int		errno;
+	const char	*tag;
 	const char	*name;
 } blk_errors[] = {
 	ENT(OK,			0,		""),
@@ -203,6 +205,32 @@ const char *blk_status_to_str(blk_status_t status)
 	return blk_errors[idx].name;
 }
 
+const char *blk_status_to_tag(blk_status_t status)
+{
+	int idx = (__force int)status;
+
+	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors)))
+		return "<null>";
+	return blk_errors[idx].tag;
+}
+
+blk_status_t tag_to_blk_status(const char *tag)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(blk_errors); i++) {
+		if (blk_errors[i].tag &&
+		    !strcmp(blk_errors[i].tag, tag))
+			return (__force blk_status_t)i;
+	}
+
+	/*
+	 * Return BLK_STS_OK for mismatches as this function is intended to
+	 * parse error status values.
+	 */
+	return BLK_STS_OK;
+}
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
diff --git a/block/blk.h b/block/blk.h
index 1a2d9101bba0..0eb8e932ec66 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -50,6 +50,8 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
 void blk_free_flush_queue(struct blk_flush_queue *q);
 
 const char *blk_status_to_str(blk_status_t status);
+const char *blk_status_to_tag(blk_status_t status);
+blk_status_t tag_to_blk_status(const char *tag);
 
 bool __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
 bool blk_queue_start_drain(struct request_queue *q);
-- 
2.53.0


^ permalink raw reply related

* [PATCH 3/4] block: add a str_to_blk_op helper
From: Christoph Hellwig @ 2026-06-08  5:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260608051416.1205282-1-hch@lst.de>

Add a helper to find the REQ_OP_XYZ constant from the "XYZ" string.
This will be used for the error injection debugfs interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
 block/blk-core.c | 10 ++++++++++
 block/blk.h      |  1 +
 2 files changed, 11 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7aa9cd110bdd..aa90aad6da13 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -132,6 +132,16 @@ inline const char *blk_op_str(enum req_op op)
 }
 EXPORT_SYMBOL_GPL(blk_op_str);
 
+enum req_op str_to_blk_op(const char *op)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(blk_op_name); i++)
+		if (blk_op_name[i] && !strcmp(blk_op_name[i], op))
+			return (enum req_op)i;
+	return REQ_OP_LAST;
+}
+
 #define ENT(_tag, _errno, _desc)	\
 [BLK_STS_##_tag] = {				\
 	.errno		= _errno,		\
diff --git a/block/blk.h b/block/blk.h
index 0eb8e932ec66..e8b7d5517086 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -52,6 +52,7 @@ void blk_free_flush_queue(struct blk_flush_queue *q);
 const char *blk_status_to_str(blk_status_t status);
 const char *blk_status_to_tag(blk_status_t status);
 blk_status_t tag_to_blk_status(const char *tag);
+enum req_op str_to_blk_op(const char *op);
 
 bool __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
 bool blk_queue_start_drain(struct request_queue *q);
-- 
2.53.0


^ permalink raw reply related

* [PATCH 4/4] block: add configurable error injection
From: Christoph Hellwig @ 2026-06-08  5:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260608051416.1205282-1-hch@lst.de>

Add a new block error injection interface that allows to inject specific
status code for specific ranges.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
 Documentation/block/error-injection.rst |  59 +++++
 Documentation/block/index.rst           |   1 +
 block/Kconfig                           |   7 +
 block/Makefile                          |   1 +
 block/blk-core.c                        |   3 +
 block/blk-sysfs.c                       |   4 +
 block/blk.h                             |  12 +
 block/error-injection.c                 | 308 ++++++++++++++++++++++++
 block/genhd.c                           |   4 +
 include/linux/blkdev.h                  |   6 +
 10 files changed, 405 insertions(+)
 create mode 100644 Documentation/block/error-injection.rst
 create mode 100644 block/error-injection.c

diff --git a/Documentation/block/error-injection.rst b/Documentation/block/error-injection.rst
new file mode 100644
index 000000000000..a96b7af362c5
--- /dev/null
+++ b/Documentation/block/error-injection.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+Configurable Error Injection
+============================
+
+Overview
+--------
+
+Configurable error injection allows injecting specific block layer status codes
+for ranges of a block device.  Errors can be injected unconditionally, or with a
+given probability.
+
+To use configurable error injection, CONFIG_BLK_ERROR_INJECTION must be enabled.
+
+The only interface is the error_injection debugfs file, which is created for
+each registered gendisk.  Writes to this file are used to create or delete rules
+and reads return a list of the current error injection sites.
+
+Options
+-------
+
+The following options specify the operations:
+
+===================	=======================================================
+add			add a new rule
+removeall		remove all existing rules
+===================	=======================================================
+
+The following options specify the details of the rule for the add operation:
+
+===================	=======================================================
+op=<string>		block layer operation this rule applies to.  This uses
+			the XYZ for each REQ_OP_XYZ operation, e.g. READ, WRITE
+			or DISCARD. Mandatory.
+status=<string>		Status to return.  This uses XYZ for each BLK_STS_XYZ
+			code, e.g. IOERR or MEDIUM. Mandatory.
+start=<number>		First block layer sector the rule applies to.
+			Optional, defaults to 0.
+nr_sectors=<number>	Number of sectors this rule applies.
+			Optional, defaults to the remainder of the device.
+chance=<number>		Only return a failure with a likelihood of 1/chance.
+			Optional, defaults to 1 (always).
+===================	=======================================================
+
+Example
+-------
+
+Return BLK_STS_IOERR for one in 10 reads of sector 0 of /dev/nvme0n1:
+
+	$ echo 'add,op=READ,start=0,status=IOERR,chance=10' > /sys/kernel/debug/block/nvme0n1/error_injection
+
+Return BLK_STS_MEDIUM for every write to /dev/nvme0n1:
+
+	$ echo 'add,op=WRITE,start=0,status=MEDIUM' > /sys/kernel/debug/block/nvme0n1/error_injection
+
+Remove all rules for /dev/nvme0n1:
+
+	$ echo 'removeall' > /sys/kernel/debug/block/nvme0n1/error_injection
diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst
index 9fea696f9daa..bfa1bbd31ddf 100644
--- a/Documentation/block/index.rst
+++ b/Documentation/block/index.rst
@@ -22,3 +22,4 @@ Block
    switching-sched
    writeback_cache_control
    ublk
+   error-injection
diff --git a/block/Kconfig b/block/Kconfig
index 15027963472d..7651b86eed56 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -221,6 +221,13 @@ config BLOCK_HOLDER_DEPRECATED
 config BLK_MQ_STACKING
 	bool
 
+config BLK_ERROR_INJECTION
+	bool "Enable block layer error injection"
+	help
+	  Enable inserting arbitrary block errors through a debugfs interface.
+
+	  See Documentation/block/error-injection.rst for details.
+
 source "block/Kconfig.iosched"
 
 endif # BLOCK
diff --git a/block/Makefile b/block/Makefile
index 54130faacc21..e7bd320e3d69 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,6 +13,7 @@ obj-y		:= bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \
 			genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
 			disk-events.o blk-ia-ranges.o early-lookup.o
 
+obj-$(CONFIG_BLK_ERROR_INJECTION) += error-injection.o
 obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o
 obj-$(CONFIG_BLK_DEV_BSGLIB)	+= bsg-lib.o
 obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
diff --git a/block/blk-core.c b/block/blk-core.c
index aa90aad6da13..268735582ef1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -767,6 +767,9 @@ static void __submit_bio_noacct_mq(struct bio *bio)
 
 void submit_bio_noacct_nocheck(struct bio *bio, bool split)
 {
+	if (unlikely(blk_error_inject(bio)))
+		return;
+
 	blk_cgroup_bio_start(bio);
 
 	if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f22c1f253eb3..8a0c2be48a31 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -933,6 +933,8 @@ static void blk_debugfs_remove(struct gendisk *disk)
 
 	blk_debugfs_lock_nomemsave(q);
 	blk_trace_shutdown(q);
+	if (IS_ENABLED(CONFIG_BLK_ERROR_INJECTION))
+		blk_error_injection_exit(disk);
 	debugfs_remove_recursive(q->debugfs_dir);
 	q->debugfs_dir = NULL;
 	q->sched_debugfs_dir = NULL;
@@ -963,6 +965,8 @@ int blk_register_queue(struct gendisk *disk)
 
 	memflags = blk_debugfs_lock(q);
 	q->debugfs_dir = debugfs_create_dir(disk->disk_name, blk_debugfs_root);
+	if (IS_ENABLED(CONFIG_BLK_ERROR_INJECTION))
+		blk_error_injection_init(disk);
 	if (queue_is_mq(q))
 		blk_mq_debugfs_register(q);
 	blk_debugfs_unlock(q, memflags);
diff --git a/block/blk.h b/block/blk.h
index e8b7d5517086..10df23b2cb90 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -660,6 +660,18 @@ static inline bool should_fail_request(struct block_device *part,
 }
 #endif /* CONFIG_FAIL_MAKE_REQUEST */
 
+void blk_error_injection_init(struct gendisk *disk);
+void blk_error_injection_exit(struct gendisk *disk);
+bool __blk_error_inject(struct bio *bio);
+static inline bool blk_error_inject(struct bio *bio)
+{
+	if (!IS_ENABLED(CONFIG_BLK_ERROR_INJECTION))
+		return false;
+	if (!test_bit(GD_ERROR_INJECT, &bio->bi_bdev->bd_disk->state))
+		return false;
+	return __blk_error_inject(bio);
+}
+
 /*
  * Optimized request reference counting. Ideally we'd make timeouts be more
  * clever, as that's the only reason we need references at all... But until
diff --git a/block/error-injection.c b/block/error-injection.c
new file mode 100644
index 000000000000..3ca4ad297683
--- /dev/null
+++ b/block/error-injection.c
@@ -0,0 +1,308 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 Christoph Hellwig.
+ */
+#include <linux/debugfs.h>
+#include <linux/blkdev.h>
+#include <linux/parser.h>
+#include <linux/seq_file.h>
+#include "blk.h"
+
+struct blk_error_inject {
+	struct list_head		entry;
+	sector_t			start;
+	sector_t			end;
+	enum req_op			op;
+	blk_status_t			status;
+
+	/* only inject every 1 / chance times */
+	unsigned int			chance;
+};
+
+bool __blk_error_inject(struct bio *bio)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct blk_error_inject *inj;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(inj, &disk->error_injection_list, entry) {
+		if (bio->bi_iter.bi_sector <= inj->end &&
+		    bio_end_sector(bio) > inj->start &&
+		    bio_op(bio) == inj->op) {
+			blk_status_t status = inj->status;
+
+			if (inj->chance > 1 &&
+			    (get_random_u32() % inj->chance) != 0)
+				continue;
+
+			pr_info_ratelimited("%pg: injecting %s error for %s at sector %llu:%u\n",
+					disk->part0,
+					blk_status_to_str(status),
+					blk_op_str(inj->op),
+					bio->bi_iter.bi_sector,
+					bio_sectors(bio));
+			rcu_read_unlock();
+			bio_endio_status(bio, status);
+			return true;
+		}
+	}
+	rcu_read_unlock();
+	return false;
+}
+
+static int error_inject_add(struct gendisk *disk, enum req_op op,
+		sector_t start, u64 nr_sectors, blk_status_t status,
+		unsigned int chance)
+{
+	struct blk_error_inject *inj;
+	int error = -EINVAL;
+
+	if (op == REQ_OP_LAST)
+		return -EINVAL;
+	if (status == BLK_STS_OK)
+		return -EINVAL;
+
+	inj = kzalloc_obj(*inj);
+	if (!inj)
+		return -ENOMEM;
+
+	if (nr_sectors) {
+		if (U64_MAX - nr_sectors < start)
+			goto out_free_inj;
+		inj->end = start + nr_sectors - 1;
+	} else {
+		inj->end = U64_MAX;
+	}
+
+	inj->op = op;
+	inj->start = start;
+	inj->status = status;
+	inj->chance = chance;
+
+	pr_debug_ratelimited("%pg: adding %s injection for %s at sector %llu:%llu\n",
+			disk->part0, blk_status_to_str(status),
+			blk_op_str(op),
+			start, nr_sectors);
+
+	/*
+	 * Add to the front of the list so that newer entries can partially
+	 * override other entries.  This also intentionally allows duplicate
+	 * entries as there is no real reason to reject them.
+	 */
+	mutex_lock(&disk->error_injection_lock);
+	if (!disk_live(disk)) {
+		mutex_unlock(&disk->error_injection_lock);
+		error = -ENODEV;
+		goto out_free_inj;
+	}
+	list_add_rcu(&inj->entry, &disk->error_injection_list);
+	set_bit(GD_ERROR_INJECT, &disk->state);
+	mutex_unlock(&disk->error_injection_lock);
+	return 0;
+
+out_free_inj:
+	kfree(inj);
+	return error;
+}
+
+static void error_inject_removall(struct gendisk *disk)
+{
+	struct blk_error_inject *inj;
+
+	mutex_lock(&disk->error_injection_lock);
+	clear_bit(GD_ERROR_INJECT, &disk->state);
+	while ((inj = list_first_entry_or_null(&disk->error_injection_list,
+			struct blk_error_inject, entry))) {
+		list_del_rcu(&inj->entry);
+		mutex_unlock(&disk->error_injection_lock);
+
+		kfree_rcu_mightsleep(inj);
+
+		mutex_lock(&disk->error_injection_lock);
+	}
+	mutex_unlock(&disk->error_injection_lock);
+}
+
+enum options {
+	Opt_add			= (1u << 0),
+	Opt_removeall		= (1u << 1),
+
+	Opt_op			= (1u << 16),
+	Opt_start		= (1u << 17),
+	Opt_nr_sectors		= (1u << 18),
+	Opt_status		= (1u << 19),
+	Opt_chance		= (1u << 20),
+
+	Opt_invalid,
+};
+
+static const match_table_t opt_tokens = {
+	{ Opt_add,			"add",			},
+	{ Opt_removeall,		"removeall",		},
+	{ Opt_op,			"op=%s",		},
+	{ Opt_start,			"start=%u"		},
+	{ Opt_nr_sectors,		"nr_sectors=%u"		},
+	{ Opt_status,			"status=%s"		},
+	{ Opt_chance,			"chance=%u"		},
+	{ Opt_invalid,			NULL,			},
+};
+
+static int match_op(substring_t *args, enum req_op *op)
+{
+	const char *tag;
+
+	tag = match_strdup(args);
+	if (!tag)
+		return -ENOMEM;
+	*op = str_to_blk_op(tag);
+	if (*op == REQ_OP_LAST)
+		pr_warn("invalid op '%s'\n", tag);
+	kfree(tag);
+	return 0;
+}
+
+static int match_status(substring_t *args, blk_status_t *status)
+{
+	const char *tag;
+
+	tag = match_strdup(args);
+	if (!tag)
+		return -ENOMEM;
+	*status = tag_to_blk_status(tag);
+	if (!*status)
+		pr_warn("invalid status '%s'\n", tag);
+	kfree(tag);
+	return 0;
+}
+
+static ssize_t blk_error_injection_parse_options(struct gendisk *disk,
+		char *options)
+{
+	enum { Unset, Add, Removeall } action = Unset;
+	unsigned int option_mask = 0, chance = 1;
+	enum req_op op = REQ_OP_LAST;
+	u64 start = 0, nr_sectors = 0;
+	blk_status_t status = BLK_STS_OK;
+	substring_t args[MAX_OPT_ARGS];
+	char *p;
+
+	while ((p = strsep(&options, ",\n")) != NULL) {
+		int error = 0;
+		ssize_t token;
+
+		if (!*p)
+			continue;
+		token = match_token(p, opt_tokens, args);
+		option_mask |= token;
+		switch (token) {
+		case Opt_add:
+			if (action != Unset)
+				return -EINVAL;
+			action = Add;
+			break;
+		case Opt_removeall:
+			if (action != Unset)
+				return -EINVAL;
+			action = Removeall;
+			break;
+		case Opt_op:
+			error = match_op(args, &op);
+			break;
+		case Opt_start:
+			error = match_u64(args, &start);
+			break;
+		case Opt_nr_sectors:
+			error = match_u64(args, &nr_sectors);
+			break;
+		case Opt_status:
+			error = match_status(args, &status);
+			break;
+		case Opt_chance:
+			error = match_uint(args, &chance);
+			if (!error && chance == 0)
+				error = -EINVAL;
+			break;
+		default:
+			pr_warn("unknown parameter or missing value '%s'\n", p);
+			error = -EINVAL;
+		}
+		if (error)
+			return error;
+	}
+
+	switch (action) {
+	case Add:
+		return error_inject_add(disk, op, start, nr_sectors, status,
+				chance);
+	case Removeall:
+		if (option_mask & ~Opt_removeall)
+			return -EINVAL;
+		error_inject_removall(disk);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
+static ssize_t blk_error_injection_write(struct file *file,
+		const char __user *ubuf, size_t count, loff_t *pos)
+{
+	struct gendisk *disk = file_inode(file)->i_private;
+	char *options;
+	int error;
+
+	options = memdup_user_nul(ubuf, count);
+	if (IS_ERR(options))
+		return PTR_ERR(options);
+	error = blk_error_injection_parse_options(disk, options);
+	kfree(options);
+
+	if (error)
+		return error;
+	return count;
+}
+
+static int blk_error_injection_show(struct seq_file *s, void *private)
+{
+	struct gendisk *disk = s->private;
+	struct blk_error_inject *inj;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(inj, &disk->error_injection_list, entry) {
+		seq_printf(s, "%llu:%llu status=%s,chance=%u",
+			inj->start, inj->end,
+			blk_status_to_tag(inj->status), inj->chance);
+		seq_putc(s, '\n');
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
+static int blk_error_injection_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, blk_error_injection_show, inode->i_private);
+}
+
+static int blk_error_injection_release(struct inode *inode, struct file *file)
+{
+	return single_release(inode, file);
+}
+
+static const struct file_operations blk_error_injection_fops = {
+	.owner		= THIS_MODULE,
+	.write		= blk_error_injection_write,
+	.read		= seq_read,
+	.open		= blk_error_injection_open,
+	.release	= blk_error_injection_release,
+};
+
+void blk_error_injection_init(struct gendisk *disk)
+{
+	debugfs_create_file("error_injection", 0600, disk->queue->debugfs_dir,
+			disk, &blk_error_injection_fops);
+}
+
+void blk_error_injection_exit(struct gendisk *disk)
+{
+	error_inject_removall(disk);
+}
diff --git a/block/genhd.c b/block/genhd.c
index 7d6854fd28e9..f84b6a355b57 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1485,6 +1485,10 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 	lockdep_init_map(&disk->lockdep_map, "(bio completion)", lkclass, 0);
 #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
 	INIT_LIST_HEAD(&disk->slave_bdevs);
+#endif
+#ifdef CONFIG_BLK_ERROR_INJECTION
+	mutex_init(&disk->error_injection_lock);
+	INIT_LIST_HEAD(&disk->error_injection_list);
 #endif
 	mutex_init(&disk->rqos_state_mutex);
 	kobject_init(&disk->queue_kobj, &blk_queue_ktype);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 57e84d59a642..5070851cf924 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -176,6 +176,7 @@ struct gendisk {
 #define GD_SUPPRESS_PART_SCAN		5
 #define GD_OWNS_QUEUE			6
 #define GD_ZONE_APPEND_USED		7
+#define GD_ERROR_INJECT			8
 
 	struct mutex open_mutex;	/* open/close mutex */
 	unsigned open_partitions;	/* number of open partitions */
@@ -227,6 +228,11 @@ struct gendisk {
 	 */
 	struct blk_independent_access_ranges *ia_ranges;
 
+#ifdef CONFIG_BLK_ERROR_INJECTION
+	struct mutex		error_injection_lock;
+	struct list_head	error_injection_list;
+#endif
+
 	struct mutex rqos_state_mutex;	/* rqos state change mutex */
 };
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH] rust: block: require `Sync` for `Operations::QueueData`
From: Andreas Hindborg @ 2026-06-08  8:24 UTC (permalink / raw)
  To: Boqun Feng, Miguel Ojeda, Gary Guo, Björn Roy Baron,
	Benno Lossin, Alice Ryhl, Trevor Gross, Danilo Krummrich,
	Jens Axboe, Daniel Almeida
  Cc: linux-block, rust-for-linux, linux-kernel, Andreas Hindborg

The queue data installed in a `GenDisk` is stored in the request queue and
handed back to the driver as a shared borrow through the `queue_rq` and
`commit_rqs` callbacks. Both callbacks obtain that borrow via
`ForeignOwnable::borrow` and may execute concurrently on several CPUs,
since the block layer runs one hardware queue per CPU. That means a shared
reference to the same queue data can be live on multiple threads at once,
which is only sound when the referent is `Sync`.

The initial `GenDisk` private data support omitted this bound, so a
driver could install a non-`Sync` type as queue data and then access
it concurrently from multiple CPUs without synchronization. Add a
`Sync` bound to the `QueueData` associated type to rule that out.

Fixes: 90d952fac8ac ("rust: block: add `GenDisk` private data support")
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
---
 rust/kernel/block/mq/operations.rs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rust/kernel/block/mq/operations.rs b/rust/kernel/block/mq/operations.rs
index 8ad46129a52c..89029f468f44 100644
--- a/rust/kernel/block/mq/operations.rs
+++ b/rust/kernel/block/mq/operations.rs
@@ -30,7 +30,7 @@
 pub trait Operations: Sized {
     /// Data associated with the `struct request_queue` that is allocated for
     /// the `GenDisk` associated with this `Operations` implementation.
-    type QueueData: ForeignOwnable;
+    type QueueData: ForeignOwnable + Sync;
 
     /// Called by the kernel to queue a request with the driver. If `is_last` is
     /// `false`, the driver is allowed to defer committing the request.

---
base-commit: 7fd2df204f342fc17d1a0bfcd474b24232fb0f32
change-id: 20260608-queue-data-sync-80b66ab312ac

Best regards,
--  
Andreas Hindborg <a.hindborg@kernel.org>



^ permalink raw reply related

* [PATCH] block: fix arg type in `blk_mq_update_nr_hw_queues`
From: Andreas Hindborg @ 2026-06-08  8:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel, Andreas Hindborg

The type of the argument `nr_hw_queues` in the function
`blk_mq_update_nr_hw_queues` is a signed integer. This is wrong,
considering the field `nr_hw_queues` of `struct blk_mq_tag_set` is
unsigned. Thus, change the type of the parameter to unsigned.

Cascade the change to downstream functions.

Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
---
 block/blk-mq.c         | 13 +++++++------
 include/linux/blk-mq.h |  2 +-
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..5eb9b52f3146 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4816,10 +4816,10 @@ static void blk_mq_update_queue_map(struct blk_mq_tag_set *set)
 
 static struct blk_mq_tags **blk_mq_prealloc_tag_set_tags(
 				struct blk_mq_tag_set *set,
-				int new_nr_hw_queues)
+				unsigned int new_nr_hw_queues)
 {
 	struct blk_mq_tags **new_tags;
-	int i;
+	unsigned int i;
 
 	if (set->nr_hw_queues >= new_nr_hw_queues)
 		return NULL;
@@ -5134,12 +5134,12 @@ static int blk_mq_elv_switch_none(struct request_queue *q,
 }
 
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
-							int nr_hw_queues)
+					 unsigned int nr_hw_queues)
 {
 	struct request_queue *q;
-	int prev_nr_hw_queues = set->nr_hw_queues;
+	unsigned int prev_nr_hw_queues = set->nr_hw_queues;
 	unsigned int memflags;
-	int i;
+	unsigned int i;
 	struct xarray elv_tbl;
 	struct blk_mq_tags **new_tags;
 	bool queues_frozen = false;
@@ -5234,7 +5234,8 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		__blk_mq_free_map_and_rqs(set, i);
 }
 
-void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
+void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
+				unsigned int nr_hw_queues)
 {
 	down_write(&set->update_nr_hwq_lock);
 	mutex_lock(&set->tag_list_lock);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..f73b4a1f16db 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -974,7 +974,7 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues);
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap,
 			  struct device *dev, unsigned int offset);
-void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
+void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 

---
base-commit: 7fd2df204f342fc17d1a0bfcd474b24232fb0f32
change-id: 20260608-update-hw-nodes-arg-940ecec0380a

Best regards,
--  
Andreas Hindborg <a.hindborg@kernel.org>



^ permalink raw reply related

* Re: [PATCH RFC 1/8] fs, block: move blk_mode_t and fop_flags_t into <linux/types.h>
From: Jan Kara @ 2026-06-08  9:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-1-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:07, Christian Brauner wrote:
> blk_mode_t and fop_flags_t are both plain 'unsigned int __bitwise' flag
> typedefs, exactly like the gfp_t, slab_flags_t and fmode_t that already
> live in <linux/types.h>. Move them there so they are available
> everywhere without having to drag in a subsystem header.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Makes sense. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/blkdev.h | 2 --
>  include/linux/fs.h     | 2 --
>  include/linux/types.h  | 2 ++
>  3 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 890128cdea1c..c8494d64a69d 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -126,8 +126,6 @@ struct blk_integrity {
>  	unsigned char				pi_tuple_size;
>  };
>  
> -typedef unsigned int __bitwise blk_mode_t;
> -
>  /* open for reading */
>  #define BLK_OPEN_READ		((__force blk_mode_t)(1 << 0))
>  /* open for writing */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 11559c513dfb..e9346be8470f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1921,8 +1921,6 @@ struct dir_context {
>  struct io_uring_cmd;
>  struct offset_ctx;
>  
> -typedef unsigned int __bitwise fop_flags_t;
> -
>  struct file_operations {
>  	struct module *owner;
>  	fop_flags_t fop_flags;
> diff --git a/include/linux/types.h b/include/linux/types.h
> index 608050dbca6a..ef026585420b 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -163,6 +163,8 @@ typedef u32 dma_addr_t;
>  typedef unsigned int __bitwise gfp_t;
>  typedef unsigned int __bitwise slab_flags_t;
>  typedef unsigned int __bitwise fmode_t;
> +typedef unsigned int __bitwise blk_mode_t;
> +typedef unsigned int __bitwise fop_flags_t;
>  
>  #ifdef CONFIG_PHYS_ADDR_T_64BIT
>  typedef u64 phys_addr_t;
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 3/8] fs: refuse to claim any frozen block device
From: Jan Kara @ 2026-06-08 10:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-3-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:09, Christian Brauner wrote:
> setup_bdev_super() already refuses to bring a filesystem up on a frozen
> block device but only for the primary device. Now that filesystems claim
> every device through fs_bdev_file_open_by_{dev,path}(), do that check
> once in the registration helper so it covers all of them.
> 
> Drop the now-redundant check from setup_bdev_super().
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
> ---
>  fs/super.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index e0174d5819a0..cea743f699e4 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1690,6 +1690,17 @@ static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
>  	sb->s_count++;
>  	spin_unlock(&sb_lock);
>  
> +	/*
> +	 * Don't bring a filesystem up on a frozen device.  The entry is already
> +	 * published, so a freeze either is seen here or finds it and waits in
> +	 * super_lock() until this mount is born or (on -EBUSY) dies.  The mount
> +	 * aborts, so the entry is torn down without rebalancing @fs_bdev_active.
> +	 */
> +	if (atomic_read(&file_bdev(bdev_file)->bd_fsfreeze_count) > 0) {
> +		fs_bdev_holder_put(h);
> +		return -EBUSY;
> +	}
> +
>  	return 0;
>  }

Shouldn't this check be common also for the branch where we only increase
the refcount? Or is a filesystem where a superblock claims the bdev
multiple times and can get frozen inbetween too insane?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Jan Kara @ 2026-06-08 10:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-2-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:08, Christian Brauner wrote:
> fs_holder_ops recovers the owning superblock from bdev->bd_holder, which
> forces the holder to be exactly one superblock and prevents several
> superblocks from sharing one block device. That's what erofs is doing.
> 
> Introduce a global dev_t-keyed rhltable mapping each block device to the
> superblock(s) using it. The holder argument becomes purely the block
> layer's exclusivity token (a superblock, or a file_system_type for
> shared devices) and is no longer needed by the fs specific callbacks.
> 
> Registration keeps one entry per (device, superblock). When a filesystem
> claims a device it already uses (xfs with its log on the data device), no
> second entry is added, so each superblock is acted on once.
> 
> Each table entry holds a passive reference (s_count) on its superblock,
> so the struct stays valid for as long as the entry is reachable. The
> callbacks look the device up in the table and act on every superblock
> using it:
> 
> Unlinking an entry is deferred to the last unpin, so a cursor never
> resumes from a removed node. After this it's possible to act on all
> superblocks that share a given device.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good! One comment below:

>  static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
>  {
> -	struct super_block *sb;
> +	struct fs_bdev_holder *h;
> +	dev_t dev = bdev->bd_dev;
>  
> -	sb = bdev_super_lock(bdev, false);
> -	if (!sb)
> -		return;
> +	mutex_unlock(&bdev->bd_holder_lock);

The moment we drop bd_holder_lock, there's nothing which prevents the bdev
owner from changing. So this can lead to a situation where we miss calling
->mark_dead callback of the new holder. Similarly for all the other holder
ops. I didn't find a situation where it would actually matter so I think
we're fine but it's a potential catch. Anyway, feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

>  
> -	if (sb->s_op->remove_bdev) {
> -		int ret;
> +	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
> +		struct super_block *sb = h->sb;
>  
> -		ret = sb->s_op->remove_bdev(sb, bdev);
> -		if (!ret) {
> -			super_unlock_shared(sb);
> -			return;
> +		if (!super_lock_shared(sb))
> +			continue;
> +		if (sb->s_root && (sb->s_flags & SB_ACTIVE)) {
> +			if (!sb->s_op->remove_bdev ||
> +			    sb->s_op->remove_bdev(sb, bdev)) {
> +				if (!surprise)
> +					sync_filesystem(sb);
> +				shrink_dcache_sb(sb);
> +				evict_inodes(sb);
> +				if (sb->s_op->shutdown)
> +					sb->s_op->shutdown(sb);
> +			}
>  		}
> -		/* Fallback to shutdown. */
> +		super_unlock_shared(sb);
>  	}
> -
> -	if (!surprise)
> -		sync_filesystem(sb);
> -	shrink_dcache_sb(sb);
> -	evict_inodes(sb);
> -	if (sb->s_op->shutdown)
> -		sb->s_op->shutdown(sb);
> -
> -	super_unlock_shared(sb);
>  }
>  
>  static void fs_bdev_sync(struct block_device *bdev)
>  {
> -	struct super_block *sb;
> +	struct fs_bdev_holder *h;
> +	dev_t dev = bdev->bd_dev;
>  
> -	sb = bdev_super_lock(bdev, false);
> -	if (!sb)
> -		return;
> +	mutex_unlock(&bdev->bd_holder_lock);
>  
> -	sync_filesystem(sb);
> -	super_unlock_shared(sb);
> -}
> +	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
> +		struct super_block *sb = h->sb;
>  
> -static struct super_block *get_bdev_super(struct block_device *bdev)
> -{
> -	bool active = false;
> -	struct super_block *sb;
> -
> -	sb = bdev_super_lock(bdev, true);
> -	if (sb) {
> -		active = atomic_inc_not_zero(&sb->s_active);
> -		super_unlock_excl(sb);
> +		if (!super_lock_shared(sb))
> +			continue;
> +		if (sb->s_root && (sb->s_flags & SB_ACTIVE))
> +			sync_filesystem(sb);
> +		super_unlock_shared(sb);
>  	}
> -	if (!active)
> -		return NULL;
> -	return sb;
>  }
>  
>  /**
> - * fs_bdev_freeze - freeze owning filesystem of block device
> + * fs_bdev_freeze - freeze every superblock using a block device
>   * @bdev: block device
>   *
> - * Freeze the filesystem that owns this block device if it is still
> - * active.
> - *
> - * A filesystem that owns multiple block devices may be frozen from each
> - * block device and won't be unfrozen until all block devices are
> - * unfrozen. Each block device can only freeze the filesystem once as we
> - * nest freezes for block devices in the block layer.
> + * Freeze each live superblock using @bdev.  A superblock owning several block
> + * devices is frozen once per device and stays frozen until all are thawed; the
> + * block layer nests these freezes so the count stays balanced.
>   *
> - * Return: If the freeze was successful zero is returned. If the freeze
> - *         failed a negative error code is returned.
> + * Return: 0, or the error from the one superblock on a single-fs device.  When
> + *         several superblocks share @bdev a per-superblock failure is swallowed
> + *         (see below), but a sync_blockdev() failure is always reported.
>   */
>  static int fs_bdev_freeze(struct block_device *bdev)
>  {
> -	struct super_block *sb;
> -	int error = 0;
> +	dev_t dev = bdev->bd_dev;
> +	struct fs_bdev_holder *h;
> +	unsigned int count = 0;
> +	int error = 0, err;
>  
>  	lockdep_assert_held(&bdev->bd_fsfreeze_mutex);
>  
> -	sb = get_bdev_super(bdev);
> -	if (!sb)
> -		return -EINVAL;
> +	mutex_unlock(&bdev->bd_holder_lock);
>  
> -	if (sb->s_op->freeze_super)
> -		error = sb->s_op->freeze_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> -	else
> -		error = freeze_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> +	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
> +		if (!atomic_inc_not_zero(&h->sb->s_active))
> +			continue;
> +		err = fs_super_freeze(h->sb);
> +		if (err && !error)
> +			error = err;
> +		deactivate_super(h->sb);
> +		count++;
> +	}
> +
> +	/*
> +	 * When several superblocks share the device, keep it frozen even if some
> +	 * of them failed to freeze and swallow the error: rolling the rest back
> +	 * via thaw_super() can fail too, so neither is a clear win. A single
> +	 * filesystem (count == 1) still reports its error.
> +	 */
> +	if (error && count > 1)
> +		error = 0;
>  	if (!error)
>  		error = sync_blockdev(bdev);
> -	deactivate_super(sb);
>  	return error;
>  }
>  
>  /**
> - * fs_bdev_thaw - thaw owning filesystem of block device
> + * fs_bdev_thaw - thaw every superblock using a block device
>   * @bdev: block device
>   *
> - * Thaw the filesystem that owns this block device.
> + * The counterpart to fs_bdev_freeze(): thaw each live superblock using @bdev.
> + * A zero return does not imply a superblock is fully unfrozen; it may have been
> + * frozen more than once (by the kernel or via another device).
>   *
> - * A filesystem that owns multiple block devices may be frozen from each
> - * block device and won't be unfrozen until all block devices are
> - * unfrozen. Each block device can only freeze the filesystem once as we
> - * nest freezes for block devices in the block layer.
> - *
> - * Return: If the thaw was successful zero is returned. If the thaw
> - *         failed a negative error code is returned. If this function
> - *         returns zero it doesn't mean that the filesystem is unfrozen
> - *         as it may have been frozen multiple times (kernel may hold a
> - *         freeze or might be frozen from other block devices).
> + * Return: 0, or the first error on a single-fs device; a shared device swallows
> + *         per-superblock errors, as fs_bdev_freeze() does.
>   */
>  static int fs_bdev_thaw(struct block_device *bdev)
>  {
> -	struct super_block *sb;
> -	int error;
> +	dev_t dev = bdev->bd_dev;
> +	struct fs_bdev_holder *h;
> +	unsigned int count = 0;
> +	int error = 0, err;
>  
>  	lockdep_assert_held(&bdev->bd_fsfreeze_mutex);
>  
> -	/*
> -	 * The block device may have been frozen before it was claimed by a
> -	 * filesystem. Concurrently another process might try to mount that
> -	 * frozen block device and has temporarily claimed the block device for
> -	 * that purpose causing a concurrent fs_bdev_thaw() to end up here. The
> -	 * mounter is already about to abort mounting because they still saw an
> -	 * elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return
> -	 * NULL in that case.
> -	 */
> -	sb = get_bdev_super(bdev);
> -	if (!sb)
> -		return -EINVAL;
> +	mutex_unlock(&bdev->bd_holder_lock);
>  
> -	if (sb->s_op->thaw_super)
> -		error = sb->s_op->thaw_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> -	else
> -		error = thaw_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> -	deactivate_super(sb);
> +	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
> +		if (!atomic_inc_not_zero(&h->sb->s_active))
> +			continue;
> +		err = fs_super_thaw(h->sb);
> +		if (err && !error)
> +			error = err;
> +		deactivate_super(h->sb);
> +		count++;
> +	}
> +
> +	/* Shared device: swallow per-superblock errors, like fs_bdev_freeze(). */
> +	if (error && count > 1)
> +		error = 0;
>  	return error;
>  }
>  
> @@ -1602,6 +1651,131 @@ const struct blk_holder_ops fs_holder_ops = {
>  };
>  EXPORT_SYMBOL_GPL(fs_holder_ops);
>  
> +static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
> +{
> +	dev_t dev = file_bdev(bdev_file)->bd_dev;
> +	struct rhlist_head *list, *pos;
> +	struct fs_bdev_holder *h;
> +	int err;
> +
> +	/*
> +	 * A superblock may claim one device more than once (xfs with its log on
> +	 * the data device).  Keep a single entry per (device, superblock) and
> +	 * count the claims in @fs_bdev_active; the entry lives until the last one
> +	 * is released.
> +	 */
> +	scoped_guard(rcu) {
> +		list = rhltable_lookup(&fs_bdev_supers, &dev, fs_bdev_params);
> +		rhl_for_each_entry_rcu(h, pos, list, node)
> +			if (h->sb == sb && refcount_inc_not_zero(&h->fs_bdev_active))
> +				return 0;
> +	}
> +
> +	h = kmalloc(sizeof(*h), GFP_KERNEL);
> +	if (!h)
> +		return -ENOMEM;
> +	h->dev = dev;
> +	h->sb = sb;
> +	refcount_set(&h->fs_bdev_passive, 1);
> +	refcount_set(&h->fs_bdev_active, 1);
> +
> +	err = rhltable_insert(&fs_bdev_supers, &h->node, fs_bdev_params);
> +	if (err) {
> +		kfree(h);
> +		return err;
> +	}
> +
> +	/* The sb->s_count ref keeps @h->sb valid for as long as the entry exists. */
> +	spin_lock(&sb_lock);
> +	sb->s_count++;
> +	spin_unlock(&sb_lock);
> +
> +	return 0;
> +}
> +
> +/**
> + * fs_bdev_file_open_by_dev - claim a block device on behalf of a superblock
> + * @dev: block device number
> + * @mode: open mode
> + * @holder: block-layer exclusivity token (a superblock, or the file_system_type
> + *          when the device may be shared by several superblocks of that type)
> + * @sb: superblock to drive fs_holder_ops events for
> + *
> + * Open @dev with &fs_holder_ops and register that @sb uses it, so device
> + * removal/sync/freeze/thaw are propagated to @sb (and any other superblock
> + * sharing @dev).  Must be paired with fs_bdev_file_release().
> + *
> + * Return: an opened block-device file or an ERR_PTR().
> + */
> +struct file *fs_bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
> +				      struct super_block *sb)
> +{
> +	struct file *bdev_file;
> +	int err;
> +
> +	bdev_file = bdev_file_open_by_dev(dev, mode, holder, &fs_holder_ops);
> +	if (IS_ERR(bdev_file))
> +		return bdev_file;
> +
> +	err = fs_bdev_register(bdev_file, sb);
> +	if (err) {
> +		bdev_fput(bdev_file);
> +		return ERR_PTR(err);
> +	}
> +	return bdev_file;
> +}
> +EXPORT_SYMBOL_GPL(fs_bdev_file_open_by_dev);
> +
> +struct file *fs_bdev_file_open_by_path(const char *path, blk_mode_t mode,
> +				       void *holder, struct super_block *sb)
> +{
> +	struct file *bdev_file;
> +	int err;
> +
> +	bdev_file = bdev_file_open_by_path(path, mode, holder, &fs_holder_ops);
> +	if (IS_ERR(bdev_file))
> +		return bdev_file;
> +
> +	err = fs_bdev_register(bdev_file, sb);
> +	if (err) {
> +		bdev_fput(bdev_file);
> +		return ERR_PTR(err);
> +	}
> +	return bdev_file;
> +}
> +EXPORT_SYMBOL_GPL(fs_bdev_file_open_by_path);
> +
> +/**
> + * fs_bdev_file_release - release a block device claimed for a superblock
> + * @bdev_file: file returned by fs_bdev_file_open_by_{dev,path}()
> + * @sb: superblock the device was claimed for
> + *
> + * Drop one claim on the {dev, @sb} entry; the last claim unregisters it (a
> + * pinning cursor defers the actual unlink).  Then close the block device.
> + */
> +void fs_bdev_file_release(struct file *bdev_file, struct super_block *sb)
> +{
> +	dev_t dev = file_bdev(bdev_file)->bd_dev;
> +	struct fs_bdev_holder *h, *found = NULL;
> +	struct rhlist_head *list, *pos;
> +
> +	rcu_read_lock();
> +	list = rhltable_lookup(&fs_bdev_supers, &dev, fs_bdev_params);
> +	rhl_for_each_entry_rcu(h, pos, list, node) {
> +		if (h->sb != sb)
> +			continue;
> +		/* At most one entry per (dev, sb); the last claim drops the bias. */
> +		if (refcount_dec_and_test(&h->fs_bdev_active))
> +			found = h;
> +		break;
> +	}
> +	rcu_read_unlock();
> +	if (found)
> +		fs_bdev_holder_put(found);
> +	bdev_fput(bdev_file);
> +}
> +EXPORT_SYMBOL_GPL(fs_bdev_file_release);
> +
>  int setup_bdev_super(struct super_block *sb, int sb_flags,
>  		struct fs_context *fc)
>  {
> @@ -1609,7 +1783,7 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
>  	struct file *bdev_file;
>  	struct block_device *bdev;
>  
> -	bdev_file = bdev_file_open_by_dev(sb->s_dev, mode, sb, &fs_holder_ops);
> +	bdev_file = fs_bdev_file_open_by_dev(sb->s_dev, mode, sb, sb);
>  	if (IS_ERR(bdev_file)) {
>  		if (fc)
>  			errorf(fc, "%s: Can't open blockdev", fc->source);
> @@ -1623,7 +1797,7 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
>  	 * writable from userspace even for a read-only block device.
>  	 */
>  	if ((mode & BLK_OPEN_WRITE) && bdev_read_only(bdev)) {
> -		bdev_fput(bdev_file);
> +		fs_bdev_file_release(bdev_file, sb);
>  		return -EACCES;
>  	}
>  
> @@ -1634,7 +1808,7 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
>  	if (atomic_read(&bdev->bd_fsfreeze_count) > 0) {
>  		if (fc)
>  			warnf(fc, "%pg: Can't mount, blockdev is frozen", bdev);
> -		bdev_fput(bdev_file);
> +		fs_bdev_file_release(bdev_file, sb);
>  		return -EBUSY;
>  	}
>  	spin_lock(&sb_lock);
> @@ -1725,7 +1899,7 @@ void kill_block_super(struct super_block *sb)
>  	generic_shutdown_super(sb);
>  	if (bdev) {
>  		sync_blockdev(bdev);
> -		bdev_fput(sb->s_bdev_file);
> +		fs_bdev_file_release(sb->s_bdev_file, sb);
>  	}
>  }
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c8494d64a69d..43d37c02febf 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1760,13 +1760,6 @@ struct blk_holder_ops {
>  	int (*thaw)(struct block_device *bdev);
>  };
>  
> -/*
> - * For filesystems using @fs_holder_ops, the @holder argument passed to
> - * helpers used to open and claim block devices via
> - * bd_prepare_to_claim() must point to a superblock.
> - */
> -extern const struct blk_holder_ops fs_holder_ops;
> -
>  /*
>   * Return the correct open flags for blkdev_get_by_* for super block flags
>   * as stored in sb->s_flags.
> diff --git a/include/linux/fs/super.h b/include/linux/fs/super.h
> index f21ffbb6dea5..721d842e3b24 100644
> --- a/include/linux/fs/super.h
> +++ b/include/linux/fs/super.h
> @@ -235,4 +235,11 @@ int freeze_super(struct super_block *super, enum freeze_holder who,
>  int thaw_super(struct super_block *super, enum freeze_holder who,
>  	       const void *freeze_owner);
>  
> +struct file;
> +struct file *fs_bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
> +				      struct super_block *sb);
> +struct file *fs_bdev_file_open_by_path(const char *path, blk_mode_t mode,
> +				       void *holder, struct super_block *sb);
> +void fs_bdev_file_release(struct file *bdev_file, struct super_block *sb);
> +
>  #endif /* _LINUX_FS_SUPER_H */
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 4/8] xfs: port to fs_bdev_file_open_by_path()
From: Jan Kara @ 2026-06-08 10:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-4-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:10, Christian Brauner wrote:
> Route opens through fs_bdev_file_open_by_path() so each external device
> is registered against mp->m_super, and convert the matching releases.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/xfs/xfs_buf.c   |  2 +-
>  fs/xfs/xfs_super.c | 10 +++++-----
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 580d40a5ee57..3d3b29edb156 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1601,7 +1601,7 @@ xfs_free_buftarg(
>  	fs_put_dax(btp->bt_daxdev, btp->bt_mount);
>  	/* the main block device is closed by kill_block_super */
>  	if (btp->bt_bdev != btp->bt_mount->m_super->s_bdev)
> -		bdev_fput(btp->bt_file);
> +		fs_bdev_file_release(btp->bt_file, btp->bt_mount->m_super);
>  	kfree(btp);
>  }
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index f8de44443e81..304667210695 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -400,8 +400,8 @@ xfs_blkdev_get(
>  	blk_mode_t		mode;
>  
>  	mode = sb_open_mode(mp->m_super->s_flags);
> -	*bdev_filep = bdev_file_open_by_path(name, mode,
> -			mp->m_super, &fs_holder_ops);
> +	*bdev_filep = fs_bdev_file_open_by_path(name, mode,
> +			mp->m_super, mp->m_super);
>  	if (IS_ERR(*bdev_filep)) {
>  		error = PTR_ERR(*bdev_filep);
>  		*bdev_filep = NULL;
> @@ -526,7 +526,7 @@ xfs_open_devices(
>  		mp->m_logdev_targp = mp->m_ddev_targp;
>  		/* Handle won't be used, drop it */
>  		if (logdev_file)
> -			bdev_fput(logdev_file);
> +			fs_bdev_file_release(logdev_file, mp->m_super);
>  	}
>  
>  	return 0;
> @@ -538,10 +538,10 @@ xfs_open_devices(
>  	xfs_free_buftarg(mp->m_ddev_targp);
>   out_close_rtdev:
>  	 if (rtdev_file)
> -		bdev_fput(rtdev_file);
> +		fs_bdev_file_release(rtdev_file, mp->m_super);
>   out_close_logdev:
>  	if (logdev_file)
> -		bdev_fput(logdev_file);
> +		fs_bdev_file_release(logdev_file, mp->m_super);
>  	return error;
>  }
>  
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox