Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [RFC PATCH v1 07/17] block: support non-blocking bio allocation with a bdev
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache
In-Reply-To: <20260704195124.1375075-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

bio_alloc_clone(), bio_init_clone(), and bio_alloc_bioset() can be called
with non-blocking GFP masks.  Passing a bdev into bio initialization may
need to associate blkcg state and, after missing blkg creation is serialized
by q->blkcg_mutex, that association can sleep.

Keep the generic block layer simple by letting bio_alloc_bioset() handle this
case directly.  Non-blocking allocations initialize the bio without a bdev,
set the bdev fields, and associate the blkg with nowait=true.  If the needed
blkg is missing and would have to be created, allocation fails normally so the
caller can retry from a blocking context.

Blocking callers keep the existing allocation-time association behavior.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/bio.c | 46 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b74e9961c8ee..863ae73a4222 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -259,6 +259,20 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 }
 EXPORT_SYMBOL(bio_init);
 
+static bool bio_init_nowait(struct bio *bio, struct block_device *bdev,
+		struct bio_vec *table, unsigned short max_vecs, blk_opf_t opf)
+{
+	bio_init(bio, NULL, table, max_vecs, opf);
+	if (bdev) {
+		bio_set_dev_no_blkg(bio, bdev);
+		if (bio_associate_blkg(bio, true))
+			return true;
+		bio_uninit(bio);
+		return false;
+	}
+	return true;
+}
+
 /**
  * bio_reset - reinitialize a bio
  * @bio:	bio to reset
@@ -599,12 +613,25 @@ struct bio *bio_alloc_bioset(struct block_device *bdev, unsigned short nr_vecs,
 		}
 	}
 
-	if (nr_vecs && nr_vecs <= BIO_INLINE_VECS)
-		bio_init_inline(bio, bdev, nr_vecs, opf);
-	else
-		bio_init(bio, bdev, bvecs, nr_vecs, opf);
+	if (nr_vecs && nr_vecs <= BIO_INLINE_VECS) {
+		bvecs = bio_inline_vecs(bio);
+		if (gfpflags_allow_blocking(saved_gfp))
+			bio_init(bio, bdev, bvecs, nr_vecs, opf);
+		else if (!bio_init_nowait(bio, bdev, bvecs, nr_vecs, opf))
+			goto fail_free_bio;
+	} else {
+		if (gfpflags_allow_blocking(saved_gfp))
+			bio_init(bio, bdev, bvecs, nr_vecs, opf);
+		else if (!bio_init_nowait(bio, bdev, bvecs, nr_vecs, opf))
+			goto fail_free_bio;
+	}
 	bio->bi_pool = bs;
 	return bio;
+
+fail_free_bio:
+	bio->bi_pool = bs;
+	bio_put(bio);
+	return NULL;
 }
 EXPORT_SYMBOL(bio_alloc_bioset);
 
@@ -857,7 +884,9 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 		if (bio->bi_bdev == bio_src->bi_bdev &&
 		    bio_flagged(bio_src, BIO_REMAPPED))
 			bio_set_flag(bio, BIO_REMAPPED);
-		bio_clone_blkg_association(bio, bio_src, false);
+		if (!bio_clone_blkg_association(bio, bio_src,
+					!gfpflags_allow_blocking(gfp)))
+			return -ENOMEM;
 	}
 
 	if (bio_crypt_clone(bio, bio_src, gfp) < 0)
@@ -913,9 +942,14 @@ EXPORT_SYMBOL(bio_alloc_clone);
 int bio_init_clone(struct block_device *bdev, struct bio *bio,
 		struct bio *bio_src, gfp_t gfp)
 {
+	bool blocking = gfpflags_allow_blocking(gfp);
 	int ret;
 
-	bio_init(bio, bdev, bio_src->bi_io_vec, 0, bio_src->bi_opf);
+	if (blocking)
+		bio_init(bio, bdev, bio_src->bi_io_vec, 0, bio_src->bi_opf);
+	else if (!bio_init_nowait(bio, bdev, bio_src->bi_io_vec, 0,
+				bio_src->bi_opf))
+		return -ENOMEM;
 	ret = __bio_clone(bio, bio_src, gfp);
 	if (ret)
 		bio_uninit(bio);
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v1 08/17] bcache: avoid sleeping blkg association from locked paths
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache
In-Reply-To: <20260704195124.1375075-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

cached_dev_cache_miss() allocates cache_bio with GFP_NOWAIT.  Passing a bdev
to bio_alloc_bioset() can attach blkcg state and sleep to create a missing
blkg after blkg lookup is protected by q->blkcg_mutex.

Use the nowait bio allocation/association path.  If the cache bio needs a
missing blkg to be created, fail the association and fall back to the existing
miss submission path.

journal_write_unlocked() also resets journal bios while holding the journal
spinlock.  Reset those bios without a bdev, set bi_bdev while still under the
lock, and associate blkcg after dropping the lock.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 drivers/md/bcache/journal.c | 9 ++++++---
 drivers/md/bcache/request.c | 2 ++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 144693b7c46a..49d2fb9a5f20 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -714,8 +714,9 @@ static CLOSURE_CALLBACK(journal_write_unlocked)
 
 		atomic_long_add(sectors, &ca->meta_sectors_written);
 
-		bio_reset(bio, ca->bdev, REQ_OP_WRITE | 
-			  REQ_SYNC | REQ_META | REQ_PREFLUSH | REQ_FUA);
+		bio_reset(bio, NULL, REQ_OP_WRITE | REQ_SYNC | REQ_META |
+			  REQ_PREFLUSH | REQ_FUA);
+		bio->bi_bdev = ca->bdev;
 		bio->bi_iter.bi_sector	= PTR_OFFSET(k, i);
 		bio->bi_iter.bi_size = sectors << 9;
 
@@ -740,8 +741,10 @@ static CLOSURE_CALLBACK(journal_write_unlocked)
 
 	spin_unlock(&c->journal.lock);
 
-	while ((bio = bio_list_pop(&list)))
+	while ((bio = bio_list_pop(&list))) {
+		bio_associate_blkg(bio, false);
 		closure_bio_submit(c, bio, cl);
+	}
 
 	continue_at(cl, journal_write_done, NULL);
 }
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index c2b7a694ea99..647ca5018d07 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -932,6 +932,8 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 	if (!cache_bio)
 		goto out_submit;
 
+	if (!bio_clone_blkg_association(cache_bio, miss, true))
+		goto out_put;
 	cache_bio->bi_iter.bi_sector	= miss->bi_iter.bi_sector;
 	cache_bio->bi_iter.bi_size	= s->insert_bio_sectors << 9;
 
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v1 09/17] dm bufio: avoid blkg association from GFP_NOWAIT bio init
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache
In-Reply-To: <20260704195124.1375075-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

dm-bufio allocates a bio with bio_kmalloc(GFP_NOWAIT) and then initializes it
with the target bdev.  That initialization can attach blkcg state and sleep to
create a missing blkg once blkg lookup is protected by q->blkcg_mutex.

Initialize the bio without a bdev, set the bdev fields, and associate blkcg
with nowait=true.  Fall back to dm_io if a missing blkg would need to be
created.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 drivers/md/dm-bufio.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 26fedf5883ef..2002d9020dd6 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1347,7 +1347,14 @@ static void use_bio(struct dm_buffer *b, enum req_op op, sector_t sector,
 		use_dmio(b, op, sector, n_sectors, offset, ioprio);
 		return;
 	}
-	bio_init_inline(bio, b->c->bdev, 1, op);
+	bio_init_inline(bio, NULL, 1, op);
+	bio_set_dev_no_blkg(bio, b->c->bdev);
+	if (!bio_associate_blkg(bio, true)) {
+		bio_uninit(bio);
+		kfree(bio);
+		use_dmio(b, op, sector, n_sectors, offset, ioprio);
+		return;
+	}
 	bio->bi_iter.bi_sector = sector;
 	bio->bi_end_io = bio_complete;
 	bio->bi_private = b;
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v1 10/17] dm pcache: handle non-blocking bio clone init failure
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache
In-Reply-To: <20260704195124.1375075-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

dm-pcache may preallocate backing requests with GFP_NOWAIT and initialize
the embedded bio with bio_init_clone().  Non-blocking clone initialization
can now fail if cloning the blkg association would need to create a blkg.

Check the return value and free the preallocated request on failure so the
existing caller can retry through its GFP_NOIO preallocation path.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 drivers/md/dm-pcache/backing_dev.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-pcache/backing_dev.c b/drivers/md/dm-pcache/backing_dev.c
index 7165fc0364bb..5bde289ec5d7 100644
--- a/drivers/md/dm-pcache/backing_dev.c
+++ b/drivers/md/dm-pcache/backing_dev.c
@@ -204,6 +204,7 @@ static struct pcache_backing_dev_req *req_type_req_alloc(struct pcache_backing_d
 	struct pcache_request *pcache_req = opts->req.upper_req;
 	struct pcache_backing_dev_req *backing_req;
 	struct bio *orig = pcache_req->bio;
+	int ret;
 
 	backing_req = mempool_alloc(&backing_dev->req_pool, opts->gfp_mask);
 	if (!backing_req)
@@ -211,13 +212,20 @@ static struct pcache_backing_dev_req *req_type_req_alloc(struct pcache_backing_d
 
 	memset(backing_req, 0, sizeof(struct pcache_backing_dev_req));
 
-	bio_init_clone(backing_dev->dm_dev->bdev, &backing_req->bio, orig, opts->gfp_mask);
+	ret = bio_init_clone(backing_dev->dm_dev->bdev, &backing_req->bio,
+			     orig, opts->gfp_mask);
+	if (ret)
+		goto free_backing_req;
 
 	backing_req->type = BACKING_DEV_REQ_TYPE_REQ;
 	backing_req->backing_dev = backing_dev;
 	atomic_inc(&backing_dev->inflight_reqs);
 
 	return backing_req;
+
+free_backing_req:
+	mempool_free(backing_req, &backing_dev->req_pool);
+	return NULL;
 }
 
 static struct pcache_backing_dev_req *kmem_type_req_alloc(struct pcache_backing_dev *backing_dev,
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v1 11/17] block: avoid scheduling from non-blocking helper allocations
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache
In-Reply-To: <20260704195124.1375075-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

blk_alloc_discard_bio() and blk_rq_map_bio_alloc() can be used with
non-blocking GFP masks.  Their bio allocation now handles bdev association in
nowait mode, so the helpers can pass the target bdev directly and avoid local
open-coded association paths.

The discard helper can also be reached from io_uring with GFP_NOWAIT.  Keep
its long-loop cond_resched() only for blocking callers.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-lib.c | 3 ++-
 block/blk-map.c | 7 +------
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 688bc67cbf73..b5645f8f69b6 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -56,7 +56,8 @@ struct bio *blk_alloc_discard_bio(struct block_device *bdev,
 	 * discards (like mkfs).  Be nice and allow us to schedule out to avoid
 	 * softlocking if preempt is disabled.
 	 */
-	cond_resched();
+	if (gfpflags_allow_blocking(gfp_mask))
+		cond_resched();
 	return bio;
 }
 
diff --git a/block/blk-map.c b/block/blk-map.c
index 768549f19f97..75c7b864c15a 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -46,14 +46,9 @@ static struct bio *blk_rq_map_bio_alloc(struct request *rq,
 		unsigned int nr_vecs, gfp_t gfp_mask)
 {
 	struct block_device *bdev = rq->q->disk ? rq->q->disk->part0 : NULL;
-	struct bio *bio;
 
-	bio = bio_alloc_bioset(bdev, nr_vecs, rq->cmd_flags, gfp_mask,
+	return bio_alloc_bioset(bdev, nr_vecs, rq->cmd_flags, gfp_mask,
 				&fs_bio_set);
-	if (!bio)
-		return NULL;
-
-	return bio;
 }
 
 /**
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v1 12/17] dm: avoid sleeping blkg association from NOWAIT remaps
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache
In-Reply-To: <20260704195124.1375075-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

DM allocates normal NOWAIT target clones with GFP_NOWAIT.  Targets that set
needs_bio_set_dev can therefore make alloc_tio() associate blkcg state from a
non-blocking allocation path, which may sleep while creating a missing blkg
after blkg lookup is protected by q->blkcg_mutex.

Set the default bdev without blkcg association first, then associate blkcg
with nowait=true for non-blocking allocations.  If a blkg would need creating,
fail the NOWAIT allocation with BLK_STS_AGAIN.

Targets that advertise DM_TARGET_NOWAIT may also remap bios in their map
functions.  Those remaps update only the bdev for NOWAIT bios, then
DM submission clones the original bio's blkg association with nowait=true
before lower submission.  If that would need to sleep, complete the clone with
BLK_STS_AGAIN.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 drivers/md/dm-linear.c        |  2 +-
 drivers/md/dm-stripe.c        |  6 +++---
 drivers/md/dm-switch.c        |  2 +-
 drivers/md/dm-unstripe.c      |  2 +-
 drivers/md/dm.c               | 28 +++++++++++++++++++++++++---
 include/linux/device-mapper.h |  8 ++++++++
 6 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 38c17846deb0..f75a372acd20 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -90,7 +90,7 @@ int linear_map(struct dm_target *ti, struct bio *bio)
 {
 	struct linear_c *lc = ti->private;
 
-	bio_set_dev(bio, lc->dev->bdev);
+	dm_bio_set_dev(bio, lc->dev->bdev);
 	bio->bi_iter.bi_sector = linear_map_sector(ti, bio->bi_iter.bi_sector);
 
 	return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 750865fd3ae7..73f9483a3e8a 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -257,7 +257,7 @@ static int stripe_map_range(struct stripe_c *sc, struct bio *bio,
 	stripe_map_range_sector(sc, bio_end_sector(bio),
 				target_stripe, &end);
 	if (begin < end) {
-		bio_set_dev(bio, sc->stripe[target_stripe].dev->bdev);
+		dm_bio_set_dev(bio, sc->stripe[target_stripe].dev->bdev);
 		bio->bi_iter.bi_sector = begin +
 			sc->stripe[target_stripe].physical_start;
 		bio->bi_iter.bi_size = to_bytes(end - begin);
@@ -278,7 +278,7 @@ int stripe_map(struct dm_target *ti, struct bio *bio)
 	if (bio->bi_opf & REQ_PREFLUSH) {
 		target_bio_nr = dm_bio_get_target_bio_nr(bio);
 		BUG_ON(target_bio_nr >= sc->stripes);
-		bio_set_dev(bio, sc->stripe[target_bio_nr].dev->bdev);
+		dm_bio_set_dev(bio, sc->stripe[target_bio_nr].dev->bdev);
 		return DM_MAPIO_REMAPPED;
 	}
 	if (unlikely(bio_op(bio) == REQ_OP_DISCARD) ||
@@ -293,7 +293,7 @@ int stripe_map(struct dm_target *ti, struct bio *bio)
 			  &stripe, &bio->bi_iter.bi_sector);
 
 	bio->bi_iter.bi_sector += sc->stripe[stripe].physical_start;
-	bio_set_dev(bio, sc->stripe[stripe].dev->bdev);
+	dm_bio_set_dev(bio, sc->stripe[stripe].dev->bdev);
 
 	return DM_MAPIO_REMAPPED;
 }
diff --git a/drivers/md/dm-switch.c b/drivers/md/dm-switch.c
index 5952f02de1e6..9eea6c263eed 100644
--- a/drivers/md/dm-switch.c
+++ b/drivers/md/dm-switch.c
@@ -323,7 +323,7 @@ static int switch_map(struct dm_target *ti, struct bio *bio)
 	sector_t offset = dm_target_offset(ti, bio->bi_iter.bi_sector);
 	unsigned int path_nr = switch_get_path_nr(sctx, offset);
 
-	bio_set_dev(bio, sctx->path_list[path_nr].dmdev->bdev);
+	dm_bio_set_dev(bio, sctx->path_list[path_nr].dmdev->bdev);
 	bio->bi_iter.bi_sector = sctx->path_list[path_nr].start + offset;
 
 	return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm-unstripe.c b/drivers/md/dm-unstripe.c
index bfcbe6bfa71a..900b1ac88bc8 100644
--- a/drivers/md/dm-unstripe.c
+++ b/drivers/md/dm-unstripe.c
@@ -136,7 +136,7 @@ static int unstripe_map(struct dm_target *ti, struct bio *bio)
 {
 	struct unstripe_c *uc = ti->private;
 
-	bio_set_dev(bio, uc->dev->bdev);
+	dm_bio_set_dev(bio, uc->dev->bdev);
 	bio->bi_iter.bi_sector = map_to_core(ti, bio) + uc->physical_start;
 
 	return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c54636235ffe..6dde3c699122 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -610,6 +610,8 @@ static void free_io(struct dm_io *io)
 	bio_put(&io->tio.clone);
 }
 
+static void free_tio(struct bio *clone);
+
 static struct bio *alloc_tio(struct clone_info *ci, struct dm_target *ti,
 			     unsigned int target_bio_nr, unsigned int *len, gfp_t gfp_mask)
 {
@@ -644,8 +646,12 @@ static struct bio *alloc_tio(struct clone_info *ci, struct dm_target *ti,
 
 	/* Set default bdev, but target must bio_set_dev() before issuing IO */
 	clone->bi_bdev = md->disk->part0;
-	if (likely(ti != NULL) && unlikely(ti->needs_bio_set_dev))
-		bio_set_dev(clone, md->disk->part0);
+	if (likely(ti != NULL) && unlikely(ti->needs_bio_set_dev)) {
+		bio_set_dev_no_blkg(clone, md->disk->part0);
+		if (!bio_associate_blkg(clone,
+				!gfpflags_allow_blocking(gfp_mask)))
+			goto fail;
+	}
 
 	if (len) {
 		clone->bi_iter.bi_size = to_bytes(*len);
@@ -654,6 +660,14 @@ static struct bio *alloc_tio(struct clone_info *ci, struct dm_target *ti,
 	}
 
 	return clone;
+
+fail:
+	if (dm_tio_flagged(clone_to_tio(clone), DM_TIO_INSIDE_DM_IO)) {
+		clone->bi_bdev = NULL;
+		clone_to_tio(clone)->io = NULL;
+	}
+	free_tio(clone);
+	return NULL;
 }
 
 static void free_tio(struct bio *clone)
@@ -1364,7 +1378,15 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
 	if (!tgt_clone)
 		tgt_clone = clone;
 
-	bio_clone_blkg_association(tgt_clone, io->orig_bio, false);
+	if (tgt_clone->bi_opf & REQ_NOWAIT) {
+		if (!bio_clone_blkg_association(tgt_clone, io->orig_bio, true)) {
+			tgt_clone->bi_status = BLK_STS_AGAIN;
+			tgt_clone->bi_end_io(tgt_clone);
+			return;
+		}
+	} else {
+		bio_clone_blkg_association(tgt_clone, io->orig_bio, false);
+	}
 
 	/*
 	 * Account io->origin_bio to DM dev on behalf of target
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index cd4faaf5d427..ca1e1cfee74f 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -713,6 +713,14 @@ module_exit(dm_##name##_exit)
 #define DM_MAPIO_DELAY_REQUEUE	DM_ENDIO_DELAY_REQUEUE
 #define DM_MAPIO_KILL		4
 
+static inline void dm_bio_set_dev(struct bio *bio, struct block_device *bdev)
+{
+	if (bio->bi_opf & REQ_NOWAIT)
+		bio_set_dev_no_blkg(bio, bdev);
+	else
+		bio_set_dev(bio, bdev);
+}
+
 #define dm_sector_div64(x, y)( \
 { \
 	u64 _res; \
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v1 13/17] bfq: avoid blkg lookup from locked cgroup update
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache
In-Reply-To: <20260704195124.1375075-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

bfq_bio_bfqg() is called while bfqd->lock is held from the merge and
request insertion paths. It walks bio->bi_blkg and its parent chain to
find the closest online BFQ group, and also updates bio->bi_blkg when
the original association points at an offline or otherwise unusable
blkg.

After missing blkg creation is protected by q->blkcg_mutex,
bio_associate_blkg_from_css() can sleep on lookup misses. BFQ must not
call it while holding bfqd->lock. The blkg BFQ wants is already known
from the existing bio->bi_blkg ancestry walk, so update bio->bi_blkg by
swapping references to that existing blkg directly instead of looking it
up again by css.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/bfq-cgroup.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 5c2faf56c8ef..06c4ec6d5e35 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -604,6 +604,16 @@ static void bfq_link_bfqg(struct bfq_data *bfqd, struct bfq_group *bfqg)
 	}
 }
 
+static void bfq_bio_update_blkg(struct bio *bio, struct blkcg_gq *blkg)
+{
+	if (bio->bi_blkg == blkg)
+		return;
+
+	blkg_get(blkg);
+	blkg_put(bio->bi_blkg);
+	bio->bi_blkg = blkg;
+}
+
 struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio)
 {
 	struct blkcg_gq *blkg = bio->bi_blkg;
@@ -616,14 +626,13 @@ struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio)
 		}
 		bfqg = blkg_to_bfqg(blkg);
 		if (bfqg->pd.online) {
-			bio_associate_blkg_from_css(bio, &blkg->blkcg->css, false);
+			bfq_bio_update_blkg(bio, blkg);
 			return bfqg;
 		}
 		blkg = blkg->parent;
 	}
-	bio_associate_blkg_from_css(bio,
-				&bfqg_to_blkg(bfqd->root_group)->blkcg->css,
-				false);
+	blkg = bfqg_to_blkg(bfqd->root_group);
+	bfq_bio_update_blkg(bio, blkg);
 	return bfqd->root_group;
 }
 
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v1 14/17] blk-cgroup: protect blkgs with blkcg_mutex
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache
In-Reply-To: <20260704195124.1375075-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

queue_lock is still needed by block core users, but blkcg no longer needs
it for blkg topology now that throttle runtime state has a private lock.

Move queue-local blkg synchronization to q->blkcg_mutex. Hold it while
looking up, creating and destroying blkgs, while preparing and undoing
configuration, and while activating or deactivating policies.

Update the BFQ, iocost, iolatency and throttle paths which walk
q->blkg_list or access per-blkg policy state to use the same lock.

blkcg->lock still protects blkcg-local radix tree and list updates. Some
lookups under blkcg_mutex can race with blkcg updates done for other
queues, so keep those lookups in RCU read-side critical sections. In
particular, protect the parent lookup in blkg_create() and the parent
walk in blkg_lookup_create().

Nowait bio association remains non-blocking after the lock conversion: if
RCU lookup misses, preemptible task-context callers can try q->blkcg_mutex
and create the missing blkg without sleeping. Atomic callers, contended
mutexes, or allocation failures keep the fail-fast behavior.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/bfq-cgroup.c    |  10 +--
 block/blk-cgroup.c    | 199 +++++++++++++++++++++++-------------------
 block/blk-cgroup.h    |  16 ++--
 block/blk-iocost.c    |   5 +-
 block/blk-iolatency.c |   7 +-
 block/blk-throttle.c  |  10 +--
 6 files changed, 136 insertions(+), 111 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 06c4ec6d5e35..8a3ff9510386 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -426,7 +426,7 @@ static void bfqg_stats_xfer_dead(struct bfq_group *bfqg)
 
 	parent = bfqg_parent(bfqg);
 
-	lockdep_assert_held(&bfqg_to_blkg(bfqg)->q->queue_lock);
+	lockdep_assert_held(&bfqg_to_blkg(bfqg)->q->blkcg_mutex);
 
 	if (unlikely(!parent))
 		return;
@@ -884,7 +884,7 @@ static void bfq_reparent_active_queues(struct bfq_data *bfqd,
  *		    and reparent its children entities.
  * @pd: descriptor of the policy going offline.
  *
- * blkio already grabs the queue_lock for us, so no need to use
+ * blkio already grabs the blkcg_mutex for us, so no need to use
  * RCU-based magic
  */
 static void bfq_pd_offline(struct blkg_policy_data *pd)
@@ -957,8 +957,7 @@ void bfq_end_wr_async(struct bfq_data *bfqd)
 	struct blkcg_gq *blkg;
 
 	mutex_lock(&q->blkcg_mutex);
-	spin_lock_irq(&q->queue_lock);
-	spin_lock(&bfqd->lock);
+	spin_lock_irq(&bfqd->lock);
 
 	list_for_each_entry(blkg, &q->blkg_list, q_node) {
 		struct bfq_group *bfqg = blkg_to_bfqg(blkg);
@@ -967,8 +966,7 @@ void bfq_end_wr_async(struct bfq_data *bfqd)
 	}
 	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
 
-	spin_unlock(&bfqd->lock);
-	spin_unlock_irq(&q->queue_lock);
+	spin_unlock_irq(&bfqd->lock);
 	mutex_unlock(&q->blkcg_mutex);
 }
 
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 92846094043a..71313bb3c4f3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,6 +30,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/psi.h>
 #include <linux/part_stat.h>
+#include <linux/preempt.h>
 #include "blk.h"
 #include "blk-cgroup.h"
 #include "blk-ioprio.h"
@@ -131,9 +132,7 @@ static void blkg_free_workfn(struct work_struct *work)
 			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
 	if (blkg->parent)
 		blkg_put(blkg->parent);
-	spin_lock_irq(&q->queue_lock);
 	list_del_init(&blkg->q_node);
-	spin_unlock_irq(&q->queue_lock);
 	mutex_unlock(&q->blkcg_mutex);
 
 	/*
@@ -382,7 +381,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 	struct blkcg_gq *blkg;
 	int i, ret;
 
-	lockdep_assert_held(&disk->queue->queue_lock);
+	lockdep_assert_held(&disk->queue->blkcg_mutex);
 
 	/* request_queue is dying, do not create/recreate a blkg */
 	if (blk_queue_dying(disk->queue)) {
@@ -402,12 +401,15 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 
 	/* link parent */
 	if (blkcg_parent(blkcg)) {
+		rcu_read_lock();
 		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
 		if (WARN_ON_ONCE(!blkg->parent)) {
+			rcu_read_unlock();
 			ret = -ENODEV;
 			goto err_free_blkg;
 		}
 		blkg_get(blkg->parent);
+		rcu_read_unlock();
 	}
 
 	/* invoke per-policy init */
@@ -419,7 +421,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 	}
 
 	/* insert */
-	spin_lock(&blkcg->lock);
+	spin_lock_irq(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, disk->queue->id, blkg);
 	if (likely(!ret)) {
 		hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
@@ -436,7 +438,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 		}
 	}
 	blkg->online = true;
-	spin_unlock(&blkcg->lock);
+	spin_unlock_irq(&blkcg->lock);
 
 	if (!ret)
 		return blkg;
@@ -459,7 +461,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
  * Lookup blkg for the @blkcg - @disk pair.  If it doesn't exist, try to
  * create one.  blkg creation is performed recursively from blkcg_root such
  * that all non-root blkg's have access to the parent blkg.  This function
- * should be called under RCU read lock and takes @disk->queue->queue_lock.
+ * must be called with @disk->queue->blkcg_mutex held.
  *
  * Returns the blkg or the closest blkg if blkg_create() fails as it walks
  * down from root.
@@ -491,6 +493,7 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 		struct blkcg *parent = blkcg_parent(blkcg);
 		struct blkcg_gq *ret_blkg = q->root_blkg;
 
+		rcu_read_lock();
 		while (parent) {
 			blkg = blkg_lookup(parent, q);
 			if (blkg) {
@@ -501,6 +504,7 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 			pos = parent;
 			parent = blkcg_parent(parent);
 		}
+		rcu_read_unlock();
 
 		blkg = blkg_create(pos, disk, NULL);
 		if (IS_ERR(blkg)) {
@@ -519,7 +523,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
 	struct blkcg *blkcg = blkg->blkcg;
 	int i;
 
-	lockdep_assert_held(&blkg->q->queue_lock);
+	lockdep_assert_held(&blkg->q->blkcg_mutex);
 	lockdep_assert_held(&blkcg->lock);
 
 	/*
@@ -547,8 +551,8 @@ static void blkg_destroy(struct blkcg_gq *blkg)
 	hlist_del_init_rcu(&blkg->blkcg_node);
 
 	/*
-	 * Both setting lookup hint to and clearing it from @blkg are done
-	 * under queue_lock.  If it's not pointing to @blkg now, it never
+	 * Both setting lookup hint to and clearing it from @blkg are done under
+	 * blkcg_mutex.  If it's not pointing to @blkg now, it never
 	 * will.  Hint assignment itself can race safely.
 	 */
 	if (rcu_access_pointer(blkcg->blkg_hint) == blkg)
@@ -569,24 +573,21 @@ static void blkg_destroy_all(struct gendisk *disk)
 	int i;
 
 restart:
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 	list_for_each_entry(blkg, &q->blkg_list, q_node) {
 		struct blkcg *blkcg = blkg->blkcg;
 
 		if (hlist_unhashed(&blkg->blkcg_node))
 			continue;
 
-		spin_lock(&blkcg->lock);
+		spin_lock_irq(&blkcg->lock);
 		blkg_destroy(blkg);
-		spin_unlock(&blkcg->lock);
+		spin_unlock_irq(&blkcg->lock);
 
-		/*
-		 * in order to avoid holding the spin lock for too long, release
-		 * it when a batch of blkgs are destroyed.
-		 */
+		/* Avoid holding blkcg_mutex for too long. */
 		if (!(--count)) {
 			count = BLKG_DESTROY_BATCH_SIZE;
-			spin_unlock_irq(&q->queue_lock);
+			mutex_unlock(&q->blkcg_mutex);
 			cond_resched();
 			goto restart;
 		}
@@ -605,7 +606,7 @@ static void blkg_destroy_all(struct gendisk *disk)
 	}
 
 	q->root_blkg = NULL;
-	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 
 	wake_up_var(&q->root_blkg);
 }
@@ -822,8 +823,8 @@ EXPORT_SYMBOL_GPL(blkg_conf_open_bdev);
  * @ctx->blkg to the blkg being configured.
  *
  * blkg_conf_open_bdev() must be called on @ctx beforehand. On success, this
- * function returns with queue lock held and must be followed by
- * blkg_conf_close_bdev().
+ * function returns with blkcg_mutex held and must be followed by
+ * blkg_conf_unprep().
  */
 int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		   struct blkg_conf_ctx *ctx)
@@ -841,7 +842,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 
 	/* Prevent concurrent with blkcg_deactivate_policy() */
 	mutex_lock(&q->blkcg_mutex);
-	spin_lock_irq(&q->queue_lock);
 
 	if (!blkcg_policy_enabled(q, pol)) {
 		ret = -EOPNOTSUPP;
@@ -862,35 +862,34 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		struct blkcg_gq *new_blkg;
 
 		parent = blkcg_parent(blkcg);
+		rcu_read_lock();
 		while (parent && !blkg_lookup(parent, q)) {
 			pos = parent;
 			parent = blkcg_parent(parent);
 		}
-
-		/* Drop locks to do new blkg allocation with GFP_KERNEL. */
-		spin_unlock_irq(&q->queue_lock);
+		rcu_read_unlock();
 
 		new_blkg = blkg_alloc(pos, disk, GFP_NOIO);
 		if (unlikely(!new_blkg)) {
 			ret = -ENOMEM;
-			goto fail_exit;
+			goto fail_unlock;
 		}
 
 		if (radix_tree_preload(GFP_KERNEL)) {
 			blkg_free(new_blkg);
 			ret = -ENOMEM;
-			goto fail_exit;
+			goto fail_unlock;
 		}
 
-		spin_lock_irq(&q->queue_lock);
-
 		if (!blkcg_policy_enabled(q, pol)) {
 			blkg_free(new_blkg);
 			ret = -EOPNOTSUPP;
 			goto fail_preloaded;
 		}
 
+		rcu_read_lock();
 		blkg = blkg_lookup(pos, q);
+		rcu_read_unlock();
 		if (blkg) {
 			blkg_free(new_blkg);
 		} else {
@@ -907,15 +906,12 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 			goto success;
 	}
 success:
-	mutex_unlock(&q->blkcg_mutex);
 	ctx->blkg = blkg;
 	return 0;
 
 fail_preloaded:
 	radix_tree_preload_end();
 fail_unlock:
-	spin_unlock_irq(&q->queue_lock);
-fail_exit:
 	mutex_unlock(&q->blkcg_mutex);
 	/*
 	 * If queue was bypassing, we should retry.  Do so after a
@@ -938,7 +934,7 @@ EXPORT_SYMBOL_GPL(blkg_conf_prep);
 void blkg_conf_unprep(struct blkg_conf_ctx *ctx)
 {
 	WARN_ON_ONCE(!ctx->blkg);
-	spin_unlock_irq(&ctx->bdev->bd_disk->queue->queue_lock);
+	mutex_unlock(&ctx->bdev->bd_disk->queue->blkcg_mutex);
 	ctx->blkg = NULL;
 }
 EXPORT_SYMBOL_GPL(blkg_conf_unprep);
@@ -1258,8 +1254,9 @@ static struct blkcg_gq *blkcg_get_first_blkg(struct blkcg *blkcg)
  * blkcg_destroy_blkgs - responsible for shooting down blkgs
  * @blkcg: blkcg of interest
  *
- * blkgs should be removed while holding both q and blkcg locks.  As blkcg lock
- * is nested inside q lock, this function performs reverse double lock dancing.
+ * blkgs should be removed while holding both q->blkcg_mutex and blkcg->lock.
+ * As blkcg->lock is nested inside q->blkcg_mutex, this function performs
+ * reverse double lock dancing.
  * Destroying the blkgs releases the reference held on the blkcg's css allowing
  * blkcg_css_free to eventually be called.
  *
@@ -1274,13 +1271,13 @@ static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 	while ((blkg = blkcg_get_first_blkg(blkcg))) {
 		struct request_queue *q = blkg->q;
 
-		spin_lock_irq(&q->queue_lock);
-		spin_lock(&blkcg->lock);
+		mutex_lock(&q->blkcg_mutex);
+		spin_lock_irq(&blkcg->lock);
 
 		blkg_destroy(blkg);
 
-		spin_unlock(&blkcg->lock);
-		spin_unlock_irq(&q->queue_lock);
+		spin_unlock_irq(&blkcg->lock);
+		mutex_unlock(&q->blkcg_mutex);
 
 		blkg_put(blkg);
 		cond_resched();
@@ -1472,21 +1469,20 @@ int blkcg_init_disk(struct gendisk *disk)
 	preloaded = !radix_tree_preload(GFP_KERNEL);
 
 	/* Make sure the root blkg exists. */
-	/* spin_lock_irq can serve as RCU read-side critical section. */
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 	blkg = blkg_create(&blkcg_root, disk, new_blkg);
 	if (IS_ERR(blkg))
 		goto err_unlock;
 	q->root_blkg = blkg;
-	spin_unlock_irq(&q->queue_lock);
 
 	if (preloaded)
 		radix_tree_preload_end();
+	mutex_unlock(&q->blkcg_mutex);
 
 	return 0;
 
 err_unlock:
-	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 	if (preloaded)
 		radix_tree_preload_end();
 	return PTR_ERR(blkg);
@@ -1526,6 +1522,42 @@ struct cgroup_subsys io_cgrp_subsys = {
 };
 EXPORT_SYMBOL_GPL(io_cgrp_subsys);
 
+static void blkg_free_policy_data(struct blkcg_gq *blkg,
+				  const struct blkcg_policy *pol)
+{
+	struct blkcg *blkcg = blkg->blkcg;
+	struct blkg_policy_data *pd;
+	bool online = false;
+
+	lockdep_assert_held(&blkg->q->blkcg_mutex);
+
+	/*
+	 * ->pd_offline_fn() may need blkg->pd[] to stay installed, while
+	 * ->pd_free_fn() can sleep.  Mark offline under blkcg->lock, run
+	 * the offline callback, detach under blkcg->lock, then free.
+	 */
+	spin_lock_irq(&blkcg->lock);
+	pd = blkg->pd[pol->plid];
+	if (pd) {
+		online = pd->online;
+		pd->online = false;
+	}
+	spin_unlock_irq(&blkcg->lock);
+
+	if (!pd)
+		return;
+
+	if (online && pol->pd_offline_fn)
+		pol->pd_offline_fn(pd);
+
+	spin_lock_irq(&blkcg->lock);
+	WARN_ON_ONCE(blkg->pd[pol->plid] != pd);
+	WRITE_ONCE(blkg->pd[pol->plid], NULL);
+	spin_unlock_irq(&blkcg->lock);
+
+	pol->pd_free_fn(pd);
+}
+
 /**
  * blkcg_activate_policy - activate a blkcg policy on a gendisk
  * @disk: gendisk of interest
@@ -1535,9 +1567,9 @@ EXPORT_SYMBOL_GPL(io_cgrp_subsys);
  * bypass mode to populate its blkgs with policy_data for @pol.
  *
  * Activation happens with @disk bypassed, so nobody would be accessing blkgs
- * from IO path.  Update of each blkg is protected by both queue and blkcg
- * locks so that holding either lock and testing blkcg_policy_enabled() is
- * always enough for dereferencing policy data.
+ * from IO path.  Update of each blkg is protected by q->blkcg_mutex and
+ * blkcg->lock so that holding either lock and testing blkcg_policy_enabled()
+ * is always enough for dereferencing policy data.
  *
  * The caller is responsible for synchronizing [de]activations and policy
  * [un]registerations.  Returns 0 on success, -errno on failure.
@@ -1563,8 +1595,9 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 
 	if (queue_is_mq(q))
 		memflags = blk_mq_freeze_queue(q);
+
 retry:
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 
 	/* blkg_list is pushed at the head, reverse walk to initialize parents first */
 	list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
@@ -1572,14 +1605,15 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 
 		if (blkg->pd[pol->plid])
 			continue;
+		if (hlist_unhashed(&blkg->blkcg_node))
+			continue;
 
-		/* If prealloc matches, use it; otherwise try GFP_NOWAIT */
+		/* If prealloc matches, use it; otherwise try GFP_NOWAIT. */
 		if (blkg == pinned_blkg) {
 			pd = pd_prealloc;
 			pd_prealloc = NULL;
 		} else {
-			pd = pol->pd_alloc_fn(disk, blkg->blkcg,
-					      GFP_NOWAIT);
+			pd = pol->pd_alloc_fn(disk, blkg->blkcg, GFP_NOWAIT);
 		}
 
 		if (!pd) {
@@ -1592,7 +1626,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 			blkg_get(blkg);
 			pinned_blkg = blkg;
 
-			spin_unlock_irq(&q->queue_lock);
+			mutex_unlock(&q->blkcg_mutex);
 
 			if (pd_prealloc)
 				pol->pd_free_fn(pd_prealloc);
@@ -1600,11 +1634,10 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 						       GFP_KERNEL);
 			if (pd_prealloc)
 				goto retry;
-			else
-				goto enomem;
+			goto enomem;
 		}
 
-		spin_lock(&blkg->blkcg->lock);
+		spin_lock_irq(&blkg->blkcg->lock);
 
 		pd->blkg = blkg;
 		pd->plid = pol->plid;
@@ -1617,14 +1650,14 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 			pol->pd_online_fn(pd);
 		pd->online = true;
 
-		spin_unlock(&blkg->blkcg->lock);
+		spin_unlock_irq(&blkg->blkcg->lock);
 	}
 
 	__set_bit(pol->plid, q->blkcg_pols);
 	ret = 0;
 
-	spin_unlock_irq(&q->queue_lock);
 out:
+	mutex_unlock(&q->blkcg_mutex);
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
 	if (pinned_blkg)
@@ -1635,23 +1668,9 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 
 enomem:
 	/* alloc failed, take down everything */
-	spin_lock_irq(&q->queue_lock);
-	list_for_each_entry(blkg, &q->blkg_list, q_node) {
-		struct blkcg *blkcg = blkg->blkcg;
-		struct blkg_policy_data *pd;
-
-		spin_lock(&blkcg->lock);
-		pd = blkg->pd[pol->plid];
-		if (pd) {
-			if (pd->online && pol->pd_offline_fn)
-				pol->pd_offline_fn(pd);
-			pd->online = false;
-			pol->pd_free_fn(pd);
-			WRITE_ONCE(blkg->pd[pol->plid], NULL);
-		}
-		spin_unlock(&blkcg->lock);
-	}
-	spin_unlock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
+	list_for_each_entry(blkg, &q->blkg_list, q_node)
+		blkg_free_policy_data(blkg, pol);
 	ret = -ENOMEM;
 	goto out;
 }
@@ -1679,24 +1698,12 @@ void blkcg_deactivate_policy(struct gendisk *disk,
 		memflags = blk_mq_freeze_queue(q);
 
 	mutex_lock(&q->blkcg_mutex);
-	spin_lock_irq(&q->queue_lock);
 
 	__clear_bit(pol->plid, q->blkcg_pols);
 
-	list_for_each_entry(blkg, &q->blkg_list, q_node) {
-		struct blkcg *blkcg = blkg->blkcg;
-
-		spin_lock(&blkcg->lock);
-		if (blkg->pd[pol->plid]) {
-			if (blkg->pd[pol->plid]->online && pol->pd_offline_fn)
-				pol->pd_offline_fn(blkg->pd[pol->plid]);
-			pol->pd_free_fn(blkg->pd[pol->plid]);
-			blkg->pd[pol->plid] = NULL;
-		}
-		spin_unlock(&blkcg->lock);
-	}
+	list_for_each_entry(blkg, &q->blkg_list, q_node)
+		blkg_free_policy_data(blkg, pol);
 
-	spin_unlock_irq(&q->queue_lock);
 	mutex_unlock(&q->blkcg_mutex);
 
 	if (queue_is_mq(q))
@@ -2082,16 +2089,32 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 
 	if (blkg)
 		return blkg;
+	if (nowait) {
+		/*
+		 * mutex_trylock() itself does not sleep, but mutexes still
+		 * follow task-context locking rules.  Keep atomic nowait callers
+		 * on the strict fail-fast path.
+		 */
+		if (!preemptible() || !mutex_trylock(&q->blkcg_mutex))
+			return NULL;
+
+		blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk);
+		if (blkg)
+			blkg = blkg_lookup_tryget(blkg);
+		mutex_unlock(&q->blkcg_mutex);
+
+		return blkg;
+	}
 
 	/*
 	 * Fast path failed, we're probably issuing IO in this cgroup the first
 	 * time, hold lock to create new blkg.
 	 */
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 	blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk);
 	if (blkg)
 		blkg = blkg_lookup_tryget(blkg);
-	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 
 	return blkg;
 }
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 615390f751aa..5aaf2d54d17e 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -66,7 +66,7 @@ struct blkcg_gq {
 	/* reference count */
 	struct percpu_ref		refcnt;
 
-	/* is this blkg online? protected by both blkcg and q locks */
+	/* is this blkg online? protected by blkcg->lock and q->blkcg_mutex */
 	bool				online;
 
 	struct blkg_iostat_set __percpu	*iostat_cpu;
@@ -224,9 +224,9 @@ int blkg_conf_open_bdev(struct blkg_conf_ctx *ctx)
 	__cond_acquires(0, &ctx->bdev->bd_queue->rq_qos_mutex);
 int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		   struct blkg_conf_ctx *ctx)
-	__cond_acquires(0, &ctx->bdev->bd_disk->queue->queue_lock);
+	__cond_acquires(0, &ctx->bdev->bd_disk->queue->blkcg_mutex);
 void blkg_conf_unprep(struct blkg_conf_ctx *ctx)
-	__releases(ctx->bdev->bd_disk->queue->queue_lock);
+	__releases(ctx->bdev->bd_disk->queue->blkcg_mutex);
 void blkg_conf_close_bdev(struct blkg_conf_ctx *ctx)
 	__releases(&ctx->bdev->bd_queue->rq_qos_mutex);
 
@@ -255,7 +255,7 @@ static inline bool bio_issue_as_root_blkg(struct bio *bio)
  *
  * Lookup blkg for the @blkcg - @q pair.
  *
- * Must be called in a RCU critical section.
+ * Must be called in a RCU critical section or with q->blkcg_mutex held.
  */
 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg,
 					   struct request_queue *q)
@@ -266,7 +266,7 @@ static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg,
 		return q->root_blkg;
 
 	blkg = rcu_dereference_check(blkcg->blkg_hint,
-			lockdep_is_held(&q->queue_lock));
+			lockdep_is_held(&q->blkcg_mutex));
 	if (blkg && blkg->q == q)
 		return blkg;
 
@@ -350,9 +350,9 @@ static inline void blkg_put(struct blkcg_gq *blkg)
  * @p_blkg: target blkg to walk descendants of
  *
  * Walk @c_blkg through the descendants of @p_blkg.  Must be used with RCU
- * read locked.  If called under either blkcg or queue lock, the iteration
- * is guaranteed to include all and only online blkgs.  The caller may
- * update @pos_css by calling css_rightmost_descendant() to skip subtree.
+ * read locked.  If called under either blkcg->lock or q->blkcg_mutex, the
+ * iteration is guaranteed to include all and only online blkgs.  The caller
+ * may update @pos_css by calling css_rightmost_descendant() to skip subtree.
  * @p_blkg is included in the iteration and the first node to be visited.
  */
 #define blkg_for_each_descendant_pre(d_blkg, pos_css, p_blkg)		\
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 8b2aeba2e1e3..ae50d143e4fc 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -3143,6 +3143,7 @@ static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf,
 	struct blkg_conf_ctx ctx;
 	struct ioc_now now;
 	struct ioc_gq *iocg;
+	unsigned long flags;
 	u32 v;
 	int ret;
 
@@ -3195,11 +3196,11 @@ static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf,
 			goto unprep;
 	}
 
-	spin_lock(&iocg->ioc->lock);
+	spin_lock_irqsave(&iocg->ioc->lock, flags);
 	iocg->cfg_weight = v * WEIGHT_ONE;
 	ioc_now(iocg->ioc, &now);
 	weight_updated(iocg, &now);
-	spin_unlock(&iocg->ioc->lock);
+	spin_unlock_irqrestore(&iocg->ioc->lock, flags);
 
 	ret = 0;
 
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index cef02b6c5fa9..30e23fee4f15 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -639,6 +639,7 @@ static void blkcg_iolatency_exit(struct rq_qos *rqos)
 	timer_shutdown_sync(&blkiolat->timer);
 	flush_work(&blkiolat->enable_work);
 	blkcg_deactivate_policy(rqos->disk, &blkcg_policy_iolatency);
+	flush_work(&blkiolat->enable_work);
 	kfree(blkiolat);
 }
 
@@ -811,16 +812,18 @@ static void iolatency_clear_scaling(struct blkcg_gq *blkg)
 	if (blkg->parent) {
 		struct iolatency_grp *iolat = blkg_to_lat(blkg->parent);
 		struct child_latency_info *lat_info;
+		unsigned long flags;
+
 		if (!iolat)
 			return;
 
 		lat_info = &iolat->child_lat;
-		spin_lock(&lat_info->lock);
+		spin_lock_irqsave(&lat_info->lock, flags);
 		atomic_set(&lat_info->scale_cookie, DEFAULT_SCALE_COOKIE);
 		lat_info->last_scale_event = 0;
 		lat_info->scale_grp = NULL;
 		lat_info->scale_lat = 0;
-		spin_unlock(&lat_info->lock);
+		spin_unlock_irqrestore(&lat_info->lock, flags);
 	}
 }
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 7bca2805404f..ef3edd5a4785 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1777,10 +1777,10 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
 	if (!blk_throtl_activated(q))
 		return;
 
-	spin_lock_irq(&q->queue_lock);
-	spin_lock(&td->lock);
+	mutex_lock(&q->blkcg_mutex);
+	spin_lock_irq(&td->lock);
 	/*
-	 * queue_lock is held, rcu lock is not needed here technically.
+	 * blkcg_mutex is held, rcu lock is not needed here technically.
 	 * However, rcu lock is still held to emphasize that following
 	 * path need RCU protection and to prevent warning from lockdep.
 	 */
@@ -1797,8 +1797,8 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
 		tg_cancel_writeback_bios(blkg_to_tg(blkg), cancel_bios);
 	}
 	rcu_read_unlock();
-	spin_unlock(&td->lock);
-	spin_unlock_irq(&q->queue_lock);
+	spin_unlock_irq(&td->lock);
+	mutex_unlock(&q->blkcg_mutex);
 
 	for (rw = READ; rw <= WRITE; rw++) {
 		struct bio *bio;
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v1 15/17] blk-cgroup: remove blkg radix tree preloading
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache
In-Reply-To: <20260704195124.1375075-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

blkg creation is now serialized by q->blkcg_mutex and no longer runs
under q->queue_lock.  The radix tree is initialized with GFP_NOWAIT, so
radix_tree_insert() cannot sleep while blkcg->lock is held and the old
preload dance is no longer needed.

Remove the preload calls and the associated unwind path.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 22 ++--------------------
 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 71313bb3c4f3..b99ab8d67798 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -420,7 +420,6 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 			pol->pd_init_fn(blkg->pd[i]);
 	}
 
-	/* insert */
 	spin_lock_irq(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, disk->queue->id, blkg);
 	if (likely(!ret)) {
@@ -875,16 +874,10 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 			goto fail_unlock;
 		}
 
-		if (radix_tree_preload(GFP_KERNEL)) {
-			blkg_free(new_blkg);
-			ret = -ENOMEM;
-			goto fail_unlock;
-		}
-
 		if (!blkcg_policy_enabled(q, pol)) {
 			blkg_free(new_blkg);
 			ret = -EOPNOTSUPP;
-			goto fail_preloaded;
+			goto fail_unlock;
 		}
 
 		rcu_read_lock();
@@ -896,12 +889,10 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 			blkg = blkg_create(pos, disk, new_blkg);
 			if (IS_ERR(blkg)) {
 				ret = PTR_ERR(blkg);
-				goto fail_preloaded;
+				goto fail_unlock;
 			}
 		}
 
-		radix_tree_preload_end();
-
 		if (pos == blkcg)
 			goto success;
 	}
@@ -909,8 +900,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 	ctx->blkg = blkg;
 	return 0;
 
-fail_preloaded:
-	radix_tree_preload_end();
 fail_unlock:
 	mutex_unlock(&q->blkcg_mutex);
 	/*
@@ -1448,7 +1437,6 @@ int blkcg_init_disk(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
 	struct blkcg_gq *new_blkg, *blkg;
-	bool preloaded;
 
 	/*
 	 * If the queue is shared across disk rebind (e.g., SCSI), the
@@ -1466,8 +1454,6 @@ int blkcg_init_disk(struct gendisk *disk)
 	if (!new_blkg)
 		return -ENOMEM;
 
-	preloaded = !radix_tree_preload(GFP_KERNEL);
-
 	/* Make sure the root blkg exists. */
 	mutex_lock(&q->blkcg_mutex);
 	blkg = blkg_create(&blkcg_root, disk, new_blkg);
@@ -1475,16 +1461,12 @@ int blkcg_init_disk(struct gendisk *disk)
 		goto err_unlock;
 	q->root_blkg = blkg;
 
-	if (preloaded)
-		radix_tree_preload_end();
 	mutex_unlock(&q->blkcg_mutex);
 
 	return 0;
 
 err_unlock:
 	mutex_unlock(&q->blkcg_mutex);
-	if (preloaded)
-		radix_tree_preload_end();
 	return PTR_ERR(blkg);
 }
 
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v1 16/17] blk-cgroup: allocate blkgs in blkg_create
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache
In-Reply-To: <20260704195124.1375075-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

After radix tree preloading is gone, callers no longer need to allocate a
blkg before entering blkg_create(). Move allocation into blkg_create() and
pass the desired GFP mask instead.

Use GFP_NOIO for runtime and config blkg creation so slow paths can sleep
without recursing into IO reclaim, keep GFP_KERNEL for root blkg setup, and
use GFP_ATOMIC when nowait bio association creates a missing blkg after a
successful q->blkcg_mutex trylock.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 89 ++++++++++------------------------------------
 1 file changed, 18 insertions(+), 71 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b99ab8d67798..ddc9073d7ab9 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -371,14 +371,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
 	return NULL;
 }
 
-/*
- * If @new_blkg is %NULL, this function tries to allocate a new one as
- * necessary using %GFP_NOWAIT.  @new_blkg is always consumed on return.
- */
 static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
-				    struct blkcg_gq *new_blkg)
+				    gfp_t gfp_mask)
 {
-	struct blkcg_gq *blkg;
+	struct blkcg_gq *blkg = NULL;
 	int i, ret;
 
 	lockdep_assert_held(&disk->queue->blkcg_mutex);
@@ -389,15 +385,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 		goto err_free_blkg;
 	}
 
-	/* allocate */
-	if (!new_blkg) {
-		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
-		if (unlikely(!new_blkg)) {
-			ret = -ENOMEM;
-			goto err_free_blkg;
-		}
+	blkg = blkg_alloc(blkcg, disk, gfp_mask);
+	if (unlikely(!blkg)) {
+		ret = -ENOMEM;
+		goto err_free_blkg;
 	}
-	blkg = new_blkg;
 
 	/* link parent */
 	if (blkcg_parent(blkcg)) {
@@ -447,8 +439,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 	return ERR_PTR(ret);
 
 err_free_blkg:
-	if (new_blkg)
-		blkg_free(new_blkg);
+	if (blkg)
+		blkg_free(blkg);
 	return ERR_PTR(ret);
 }
 
@@ -505,7 +497,7 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 		}
 		rcu_read_unlock();
 
-		blkg = blkg_create(pos, disk, NULL);
+		blkg = blkg_create(pos, disk, GFP_NOIO);
 		if (IS_ERR(blkg)) {
 			blkg = ret_blkg;
 			break;
@@ -858,7 +850,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 	while (true) {
 		struct blkcg *pos = blkcg;
 		struct blkcg *parent;
-		struct blkcg_gq *new_blkg;
 
 		parent = blkcg_parent(blkcg);
 		rcu_read_lock();
@@ -868,14 +859,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		}
 		rcu_read_unlock();
 
-		new_blkg = blkg_alloc(pos, disk, GFP_NOIO);
-		if (unlikely(!new_blkg)) {
-			ret = -ENOMEM;
-			goto fail_unlock;
-		}
-
 		if (!blkcg_policy_enabled(q, pol)) {
-			blkg_free(new_blkg);
 			ret = -EOPNOTSUPP;
 			goto fail_unlock;
 		}
@@ -883,10 +867,8 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		rcu_read_lock();
 		blkg = blkg_lookup(pos, q);
 		rcu_read_unlock();
-		if (blkg) {
-			blkg_free(new_blkg);
-		} else {
-			blkg = blkg_create(pos, disk, new_blkg);
+		if (!blkg) {
+			blkg = blkg_create(pos, disk, GFP_NOIO);
 			if (IS_ERR(blkg)) {
 				ret = PTR_ERR(blkg);
 				goto fail_unlock;
@@ -1436,7 +1418,7 @@ void blkg_init_queue(struct request_queue *q)
 int blkcg_init_disk(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
-	struct blkcg_gq *new_blkg, *blkg;
+	struct blkcg_gq *blkg;
 
 	/*
 	 * If the queue is shared across disk rebind (e.g., SCSI), the
@@ -1450,13 +1432,9 @@ int blkcg_init_disk(struct gendisk *disk)
 	 */
 	wait_var_event(&q->root_blkg, !READ_ONCE(q->root_blkg));
 
-	new_blkg = blkg_alloc(&blkcg_root, disk, GFP_KERNEL);
-	if (!new_blkg)
-		return -ENOMEM;
-
 	/* Make sure the root blkg exists. */
 	mutex_lock(&q->blkcg_mutex);
-	blkg = blkg_create(&blkcg_root, disk, new_blkg);
+	blkg = blkg_create(&blkcg_root, disk, GFP_KERNEL);
 	if (IS_ERR(blkg))
 		goto err_unlock;
 	q->root_blkg = blkg;
@@ -1559,8 +1537,7 @@ static void blkg_free_policy_data(struct blkcg_gq *blkg,
 int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 {
 	struct request_queue *q = disk->queue;
-	struct blkg_policy_data *pd_prealloc = NULL;
-	struct blkcg_gq *blkg, *pinned_blkg = NULL;
+	struct blkcg_gq *blkg;
 	unsigned int memflags;
 	int ret;
 
@@ -1578,7 +1555,6 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	if (queue_is_mq(q))
 		memflags = blk_mq_freeze_queue(q);
 
-retry:
 	mutex_lock(&q->blkcg_mutex);
 
 	/* blkg_list is pushed at the head, reverse walk to initialize parents first */
@@ -1590,34 +1566,9 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 		if (hlist_unhashed(&blkg->blkcg_node))
 			continue;
 
-		/* If prealloc matches, use it; otherwise try GFP_NOWAIT. */
-		if (blkg == pinned_blkg) {
-			pd = pd_prealloc;
-			pd_prealloc = NULL;
-		} else {
-			pd = pol->pd_alloc_fn(disk, blkg->blkcg, GFP_NOWAIT);
-		}
-
-		if (!pd) {
-			/*
-			 * GFP_NOWAIT failed.  Free the existing one and
-			 * prealloc for @blkg w/ GFP_KERNEL.
-			 */
-			if (pinned_blkg)
-				blkg_put(pinned_blkg);
-			blkg_get(blkg);
-			pinned_blkg = blkg;
-
-			mutex_unlock(&q->blkcg_mutex);
-
-			if (pd_prealloc)
-				pol->pd_free_fn(pd_prealloc);
-			pd_prealloc = pol->pd_alloc_fn(disk, blkg->blkcg,
-						       GFP_KERNEL);
-			if (pd_prealloc)
-				goto retry;
+		pd = pol->pd_alloc_fn(disk, blkg->blkcg, GFP_NOIO);
+		if (!pd)
 			goto enomem;
-		}
 
 		spin_lock_irq(&blkg->blkcg->lock);
 
@@ -1642,15 +1593,10 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	mutex_unlock(&q->blkcg_mutex);
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
-	if (pinned_blkg)
-		blkg_put(pinned_blkg);
-	if (pd_prealloc)
-		pol->pd_free_fn(pd_prealloc);
 	return ret;
 
 enomem:
 	/* alloc failed, take down everything */
-	mutex_lock(&q->blkcg_mutex);
 	list_for_each_entry(blkg, &q->blkg_list, q_node)
 		blkg_free_policy_data(blkg, pol);
 	ret = -ENOMEM;
@@ -2080,7 +2026,8 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 		if (!preemptible() || !mutex_trylock(&q->blkcg_mutex))
 			return NULL;
 
-		blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk);
+		blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk,
+					  GFP_ATOMIC);
 		if (blkg)
 			blkg = blkg_lookup_tryget(blkg);
 		mutex_unlock(&q->blkcg_mutex);
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v1 17/17] blk-cgroup: share blkg creation between lookup and config prep
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache
In-Reply-To: <20260704195124.1375075-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

blkg_conf_prep() open-codes the same parent walk and blkg creation that
blkg_lookup_create() already performs. Make blkg_lookup_create() report
whether the target blkg was created or found while still returning the
closest existing blkg on failure, then have blkg_conf_prep() use the
helper and treat errors as config failures.

This keeps the bio association path's closest-blkg fallback and removes
the duplicate config path loop.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 81 +++++++++++++++-------------------------------
 1 file changed, 26 insertions(+), 55 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ddc9073d7ab9..ae481bcde934 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -448,17 +448,19 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
  * blkg_lookup_create - lookup blkg, try to create one if not there
  * @blkcg: blkcg of interest
  * @disk: gendisk of interest
+ * @gfp_mask: allocation mask to use
+ * @blkgp: out parameter for the target blkg, or closest blkg on failure
  *
  * Lookup blkg for the @blkcg - @disk pair.  If it doesn't exist, try to
  * create one.  blkg creation is performed recursively from blkcg_root such
  * that all non-root blkg's have access to the parent blkg.  This function
  * must be called with @disk->queue->blkcg_mutex held.
  *
- * Returns the blkg or the closest blkg if blkg_create() fails as it walks
- * down from root.
+ * On success, *@blkgp points to the target blkg and 0 is returned.  On
+ * failure, *@blkgp points to the closest blkg and the errno is returned.
  */
-static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
-		struct gendisk *disk)
+static int blkg_lookup_create(struct blkcg *blkcg, struct gendisk *disk,
+			      gfp_t gfp_mask, struct blkcg_gq **blkgp)
 {
 	struct request_queue *q = disk->queue;
 	struct blkcg_gq *blkg;
@@ -470,7 +472,8 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 		    blkg != rcu_dereference(blkcg->blkg_hint))
 			rcu_assign_pointer(blkcg->blkg_hint, blkg);
 		rcu_read_unlock();
-		return blkg;
+		*blkgp = blkg;
+		return 0;
 	}
 	rcu_read_unlock();
 
@@ -497,16 +500,16 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 		}
 		rcu_read_unlock();
 
-		blkg = blkg_create(pos, disk, GFP_NOIO);
+		blkg = blkg_create(pos, disk, gfp_mask);
 		if (IS_ERR(blkg)) {
-			blkg = ret_blkg;
-			break;
+			*blkgp = ret_blkg;
+			return PTR_ERR(blkg);
+		}
+		if (pos == blkcg) {
+			*blkgp = blkg;
+			return 0;
 		}
-		if (pos == blkcg)
-			break;
 	}
-
-	return blkg;
 }
 
 static void blkg_destroy(struct blkcg_gq *blkg)
@@ -839,46 +842,10 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		goto fail_unlock;
 	}
 
-	blkg = blkg_lookup(blkcg, q);
-	if (blkg)
-		goto success;
-
-	/*
-	 * Create blkgs walking down from blkcg_root to @blkcg, so that all
-	 * non-root blkgs have access to their parents.
-	 */
-	while (true) {
-		struct blkcg *pos = blkcg;
-		struct blkcg *parent;
-
-		parent = blkcg_parent(blkcg);
-		rcu_read_lock();
-		while (parent && !blkg_lookup(parent, q)) {
-			pos = parent;
-			parent = blkcg_parent(parent);
-		}
-		rcu_read_unlock();
-
-		if (!blkcg_policy_enabled(q, pol)) {
-			ret = -EOPNOTSUPP;
-			goto fail_unlock;
-		}
-
-		rcu_read_lock();
-		blkg = blkg_lookup(pos, q);
-		rcu_read_unlock();
-		if (!blkg) {
-			blkg = blkg_create(pos, disk, GFP_NOIO);
-			if (IS_ERR(blkg)) {
-				ret = PTR_ERR(blkg);
-				goto fail_unlock;
-			}
-		}
+	ret = blkg_lookup_create(blkcg, disk, GFP_NOIO, &blkg);
+	if (ret)
+		goto fail_unlock;
 
-		if (pos == blkcg)
-			goto success;
-	}
-success:
 	ctx->blkg = blkg;
 	return 0;
 
@@ -2018,6 +1985,8 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 	if (blkg)
 		return blkg;
 	if (nowait) {
+		int ret;
+
 		/*
 		 * mutex_trylock() itself does not sleep, but mutexes still
 		 * follow task-context locking rules.  Keep atomic nowait callers
@@ -2026,9 +1995,11 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 		if (!preemptible() || !mutex_trylock(&q->blkcg_mutex))
 			return NULL;
 
-		blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk,
-					  GFP_ATOMIC);
-		if (blkg)
+		ret = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk,
+					 GFP_ATOMIC, &blkg);
+		if (ret)
+			blkg = NULL;
+		else if (blkg)
 			blkg = blkg_lookup_tryget(blkg);
 		mutex_unlock(&q->blkcg_mutex);
 
@@ -2040,7 +2011,7 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 	 * time, hold lock to create new blkg.
 	 */
 	mutex_lock(&q->blkcg_mutex);
-	blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk);
+	blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk, GFP_NOIO, &blkg);
 	if (blkg)
 		blkg = blkg_lookup_tryget(blkg);
 	mutex_unlock(&q->blkcg_mutex);
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH 0/7] rust: Use kernel style vertical imports in various drivers
From: Guru Das Srinagesh @ 2026-07-05  0:38 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Miguel Ojeda, rust-for-linux, linux-kernel, Danilo Krummrich,
	Abdiel Janulgue, Daniel Almeida, Robin Murphy, Andreas Hindborg,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	Alice Ryhl, Trevor Gross, Tamir Duberstein, Alexandre Courbot,
	Onur Özkan, Drew Fustini, Guo Ren, Fu Wei, Michal Wilczynski,
	Uwe Kleine-König, Rafael J. Wysocki, Viresh Kumar,
	Jens Axboe, FUJITA Tomonori, Heiner Kallweit, Russell King,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Airlie, Simona Vetter, driver-core, linux-riscv, linux-pwm,
	linux-pm, linux-block, netdev, nova-gpu, dri-devel
In-Reply-To: <c89a1bc8-6cc1-4e16-ac95-add389e45a2b@lunn.ch>

On Mon, Jun 29, 2026 at 04:06:36PM +0200, Andrew Lunn wrote:
> On Sun, Jun 28, 2026 at 08:38:14PM -0700, Guru Das Srinagesh wrote:
> > Came across a recent commit bc58905eb07 ("samples: rust_misc_device: use
> > vertical import style") and found a few more locations that could
> > benefit from this cleanup. No functional changes.
> > 
> > Signed-off-by: Guru Das Srinagesh <linux@gurudas.dev>
> > ---
> > Guru Das Srinagesh (7):
> >       samples: rust_dma: use vertical import style
> >       pwm: th1520: use vertical import style
> >       cpufreq: rcpufreq_dt: use vertical import style
> >       block: rnull: use vertical import style
> >       net: phy: ax88796b: use vertical import style
> >       net: phy: qt2025: use vertical import style
> >       drm/nova: use vertical import style
> 
> You have multiple subsystems here, so you need to split this patch
> setup, per subsystem, and submit them separately. Maintainers only
> accept patchsets for their own subsystems.
> 
> For netdev, please take a read of:
> 
> https://www.kernel.org/doc/html/latest/process/maintainer-netdev.html
> 
> You need to get the correct tree, and set the Subject: line correctly.
> 
>     Andrew

Hi Andrew,

Thanks for the feedback.

I was aware of the per-subsystem rule, but reasoned that since these changes are
purely about Rust import formatting coding style with no functional impact on any
subsystem, they might go through the rust-for-linux tree with acks from the
respective subsystem maintainers. The Rust coding style is independent of any
subsystem-specific guidelines.

Is that reasoning off-base, or is the right path to split these out per subsystem
regardless?

Miguel, could you please indicate if you have a preference here?

Thank you.

^ permalink raw reply

* Re: [PATCH 3/7] cpufreq: rcpufreq_dt: use vertical import style
From: Guru Das Srinagesh @ 2026-07-05  1:01 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Zhongqiu Han, Miguel Ojeda, rust-for-linux, linux-kernel,
	Danilo Krummrich, Abdiel Janulgue, Daniel Almeida, Robin Murphy,
	Andreas Hindborg, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Alice Ryhl, Trevor Gross, Tamir Duberstein,
	Alexandre Courbot, Onur Özkan, Drew Fustini, Guo Ren, Fu Wei,
	Michal Wilczynski, Uwe Kleine-König, Rafael J. Wysocki,
	Viresh Kumar, Jens Axboe, FUJITA Tomonori, Andrew Lunn,
	Heiner Kallweit, Russell King, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, David Airlie, Simona Vetter,
	driver-core, linux-riscv, linux-pwm, linux-pm, linux-block,
	netdev, nova-gpu, dri-devel
In-Reply-To: <CANiq72=SQK7pd-fj+4MOb=E6=R-DHCcLcuCvN=us2E5o7Rcy2A@mail.gmail.com>

On Tue, Jun 30, 2026 at 10:35:00AM +0200, Miguel Ojeda wrote:
> On Mon, Jun 29, 2026 at 2:43 PM Zhongqiu Han
> <zhongqiu.han@oss.qualcomm.com> wrote:
> >
> > If the preferred style is to place each imported item on its own line,
> > shouldn't imports such as
> >
> >      cpu, cpufreq,
> >
> > be formatted similarly as well?
> 
> Indeed, good eyes.
> 
> To do what we want, `rustfmt` needs the `//` at the end of that level
> too (in the future, it will be without the `//`), i.e. the patch
> probably passes `rustfmtcheck`, but it still needs to split that line
> and add the other `//`.

Hi Zhongqiu, Miguel:

Yes, I did run `make LLVM=1 rustfmtcheck` and it passed on this series. I will fix
the missed ones in v2.

While investigating this, I found that that adding this config `imports_layout =
"Vertical"` to the rustfmt options would fix all the imports automatically, including
the ones I missed. I ran it locally on the files touched in this series using rustfmt
nightly and it correctly fixed the imports as desired:

    rustup run nightly rustfmt --unstable-features \
      --config "imports_layout=Vertical" \
      --config-path .rustfmt.toml <file>

But unfortunately, since `imports_layout` is an unstable option currently [1], it
cannot be used straightaway.

However, .rustfmt.toml already has a section of commented-out unstable options kept
as a reference for when they stabilize. Would it make sense to add `#imports_layout =
"Vertical"` there? If so, I can include it in v2.

[1]: https://github.com/rust-lang/rustfmt/issues/3361

^ permalink raw reply

* Re: [PATCH 1/7] samples: rust_dma: use vertical import style
From: Guru Das Srinagesh @ 2026-07-05  1:15 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Miguel Ojeda, rust-for-linux, linux-kernel, Abdiel Janulgue,
	Daniel Almeida, Robin Murphy, Andreas Hindborg, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, Alice Ryhl,
	Trevor Gross, Tamir Duberstein, Alexandre Courbot,
	Onur Özkan, Drew Fustini, Guo Ren, Fu Wei, Michal Wilczynski,
	Uwe Kleine-König, Rafael J. Wysocki, Viresh Kumar,
	Jens Axboe, FUJITA Tomonori, Andrew Lunn, Heiner Kallweit,
	Russell King, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, David Airlie, Simona Vetter, driver-core,
	linux-riscv, linux-pwm, linux-pm, linux-block, netdev, nova-gpu,
	dri-devel
In-Reply-To: <DJMAFSKM4K9W.19ED2PVPU6S3R@kernel.org>

On Tue, Jun 30, 2026 at 11:48:48AM +0200, Danilo Krummrich wrote:
> On Mon Jun 29, 2026 at 5:38 AM CEST, Guru Das Srinagesh wrote:
> >      page, pci,
> 
> Can you please also convert this one? Patch 7 also misses at least one case.

Will fix this and the missed ones in Patch 7 as well. Thanks for your review, Danilo.

^ permalink raw reply

* Re: [PATCH v2 1/5] nbd: simplify find_fallback() by removing redundant logic
From: yu kuai @ 2026-07-05  9:32 UTC (permalink / raw)
  To: Yang Erkun, josef, axboe, hch, yukuai
  Cc: yi.zhang, chengzhihao1, echo.chenlin, leo.lilong, wangkefeng.wang,
	linux-block, nbd
In-Reply-To: <20260625084458.4171890-2-yangerkun@huawei.com>

在 2026/6/25 16:44, Yang Erkun 写道:

> The second conditional checking nsock->fallback_index validity is the
> logical inverse of the first, so drop it and let execution fall through
> naturally. Consolidate the two identical dev_err_ratelimited() + return
> paths into a single no_fallback label to reduce duplication.
>
> Signed-off-by: Long Li<leo.lilong@huawei.com>
> ---
>   drivers/block/nbd.c | 37 ++++++++++++++-----------------------
>   1 file changed, 14 insertions(+), 23 deletions(-)
Reviewed-by: Yu Kuai <yukuai@fygo.io>

-- 
Thanks,
Kuai

^ permalink raw reply

* Re: [PATCH v2 2/5] nbd: replace socks pointer array with xarray
From: yu kuai @ 2026-07-05 10:02 UTC (permalink / raw)
  To: Yang Erkun, josef, axboe, hch, yukuai
  Cc: yi.zhang, chengzhihao1, echo.chenlin, leo.lilong, wangkefeng.wang,
	linux-block, nbd
In-Reply-To: <20260625084458.4171890-3-yangerkun@huawei.com>

Hi,

在 2026/6/25 16:44, Yang Erkun 写道:
> Replace the krealloc-based struct nbd_sock **socks array with struct
> xarray socks. Each nbd sock is fully initialized before being stored
> into the xarray via xa_store(), ensuring concurrent readers calling
> xa_load() never observe a partially initialized socket.
>
> Convert all array index accesses to xa_load() and open-coded for-loops
> to xa_for_each().
>
> Signed-off-by: Long Li<leo.lilong@huawei.com>

xarray may not be good idea for IO hot path because of the overhead.

https://lore.kernel.org/all/60f9a88b-b750-3579-bdfd-5421f2040406@huaweicloud.com/

-- 
Thanks,
Kuai

^ permalink raw reply

page:              | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox