[RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg

Linux block layer
 help / color / mirror / Atom feed

* [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex
@ 2026-07-04 19:51 Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 01/17] nvme-multipath: retarget failedover bios from requeue work Yu Kuai
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

This RFC moves queue-local blkg topology synchronization from
q->queue_lock to q->blkcg_mutex.

q->queue_lock is a hot block-layer spinlock used by request queue runtime
paths, and it is also used in irq-disabled or otherwise atomic contexts.
Using it to protect blkg topology makes blkg lookup, creation,
destruction, policy activation, and policy-state walks inherit those atomic
locking constraints.  That forces awkward preallocation schemes such as
radix-tree preloading and prevents missing-blkg creation from sleeping,
even though blkg creation is a blkcg control-plane operation rather than a
queue dispatch fast-path operation.

q->blkcg_mutex is a better fit for blkg protection because it is already a
queue-local blkcg lock, it can serialize the full lookup/create/destroy and
policy activation path, and it allows allocation and parent lookup to run
from sleepable contexts.  Moving blkg topology under q->blkcg_mutex also
separates blkcg topology from queue runtime locking, reducing queue_lock
scope and making the locking rules for blkcg policy users explicit.

bio_set_dev() and bio allocation with a bdev can associate a bio with the
destination queue's blkg.  Once missing blkg creation is serialized by
q->blkcg_mutex, those helpers may sleep when they create a blkg.  The first
part of the series therefore audits callers that can reach these helpers
from completion, spinlocked, irq-disabled, GFP_NOWAIT, or other
non-blocking paths, and either moves association to process context or uses
a nowait association path that avoids sleeping.

The preparatory patches cover NVMe multipath requeue, dm-thin and
dm-snapshot map paths, blk-throttle's private runtime lock, atomic bio
allocation helpers, bcache, dm-bufio, dm-pcache, DM NOWAIT clones/remaps,
and BFQ's locked cgroup update path.  The final blkcg patches then move
blkg lookup/create/destroy, policy activation, and configuration
preparation to q->blkcg_mutex; remove radix-tree preloading; move blkg
allocation into blkg_create(); and share creation code between bio
association and config preparation.

This is RFC because the locking conversion changes a central blkcg lifetime
path and relies on all non-sleepable bio association users either being
converted or tolerating nowait association failure.

One intentional tradeoff is left in the nowait paths.  They first associate
with an existing blkg.  If a thread issues IO to a queue for the first time
from a GFP_NOWAIT or otherwise non-blocking path, the cgroup's blkg for
that queue may not exist yet.  After blkg topology moves to q->blkcg_mutex,
preemptible task-context callers try q->blkcg_mutex and attempt blkg
creation.  Once allocation moves into blkg_create(), that opportunistic
nowait creation uses GFP_ATOMIC.  If the caller is in atomic context,
q->blkcg_mutex is contended, or allocation fails, the nowait helper still
fails and the caller needs to retry from a blocking context, defer the
association, or fall back to an existing slow path.

Patch layout:

Patch 1: move NVMe multipath failover bio retargeting to requeue work so
bio_set_dev() runs from process context instead of completion context.

Patches 2-3: remove or avoid bio_set_dev() while dm-thin and dm-snapshot
locks are held, and restore blkcg association later where needed.

Patch 4: give blk-throttle its own runtime-state lock so blkcg topology
can be moved away from queue_lock.

Patches 5-7: add bio_alloc_atomic(), make bio association nowait-aware,
and make bio allocation with a bdev fail rather than sleep for
non-blocking callers.

Patches 8-12: convert bcache, dm-bufio, dm-pcache, block helper
allocations, and DM NOWAIT remaps/clones to the new nowait or deferred
association model.

Patch 13: avoid a sleeping blkg lookup from BFQ while bfqd->lock is held.

Patch 14: protect queue-local blkg lookup, creation, destruction, policy
activation, and policy state walks with q->blkcg_mutex.  This also makes
preemptible nowait bio association try q->blkcg_mutex instead of failing
immediately after an RCU lookup miss.

Patch 15: remove radix-tree preloading after blkg creation no longer runs
under queue_lock.

Patch 16: allocate blkgs inside blkg_create() and use GFP_ATOMIC for the
nowait bio-association trylock creation path.

Patch 17: share blkg creation between bio association and config
preparation.

Yu Kuai (17):
  nvme-multipath: retarget failedover bios from requeue work
  dm thin: avoid bio_set_dev under pool lock
  dm snapshot: avoid bio_set_dev in locked map paths
  blk-throttle: protect throttle state with td lock
  block: add bio_alloc_atomic() for atomic bio users
  blk-cgroup: support non-blocking bio association
  block: support non-blocking bio allocation with a bdev
  bcache: avoid sleeping blkg association from locked paths
  dm bufio: avoid blkg association from GFP_NOWAIT bio init
  dm pcache: handle non-blocking bio clone init failure
  block: avoid scheduling from non-blocking helper allocations
  dm: avoid sleeping blkg association from NOWAIT remaps
  bfq: avoid blkg lookup from locked cgroup update
  blk-cgroup: protect blkgs with blkcg_mutex
  blk-cgroup: remove blkg radix tree preloading
  blk-cgroup: allocate blkgs in blkg_create
  blk-cgroup: share blkg creation between lookup and config prep

 block/bfq-cgroup.c                 |  26 +-
 block/bio.c                        |  50 +++-
 block/blk-cgroup.c                 | 397 ++++++++++++-----------------
 block/blk-cgroup.h                 |  16 +-
 block/blk-crypto-fallback.c        |   2 +-
 block/blk-iocost.c                 |   5 +-
 block/blk-iolatency.c              |   7 +-
 block/blk-lib.c                    |   3 +-
 block/blk-map.c                    |   7 +-
 block/blk-throttle.c               |  93 +++++--
 drivers/md/bcache/journal.c        |   9 +-
 drivers/md/bcache/request.c        |   4 +-
 drivers/md/dm-bufio.c              |   9 +-
 drivers/md/dm-linear.c             |   2 +-
 drivers/md/dm-pcache/backing_dev.c |  10 +-
 drivers/md/dm-snap.c               |  29 ++-
 drivers/md/dm-stripe.c             |   6 +-
 drivers/md/dm-switch.c             |   2 +-
 drivers/md/dm-thin.c               |   3 -
 drivers/md/dm-unstripe.c           |   2 +-
 drivers/md/dm.c                    |  28 +-
 drivers/md/md.c                    |   2 +-
 drivers/nvdimm/nd_virtio.c         |  11 +-
 drivers/nvme/host/multipath.c      |   4 +-
 fs/gfs2/lops.c                     |   3 +-
 fs/ocfs2/cluster/heartbeat.c       |  15 +-
 include/linux/bio.h                |  53 ++--
 include/linux/device-mapper.h      |   8 +
 include/linux/writeback.h          |   2 +-
 mm/page_io.c                       |   2 +-
 30 files changed, 467 insertions(+), 343 deletions(-)

base-commit: a1c8bdbbd72564cebb0d02948c1ed57b80b2e773
-- 
2.51.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 01/17] nvme-multipath: retarget failedover bios from requeue work
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 02/17] dm thin: avoid bio_set_dev under pool lock Yu Kuai
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

bio_set_dev() is about to become explicitly sleepable because it can
associate the bio with a blkg for the destination queue.  NVMe failover
can run from request completion context, and nvme_failover_req() also holds
head->requeue_lock with interrupts disabled while it steals bios from the
failed request.  Calling bio_set_dev() there is not safe once the helper is
allowed to sleep.

The requeue lock only protects head->requeue_list.  Keep the list
manipulation under that lock, but defer retargeting to nvme_requeue_work(),
which already drains the list from process context before resubmitting each
bio.  The bios remain private to the requeue list until the worker pops
them, so moving the device switch there preserves the existing retry flow
while avoiding a sleepable helper in completion context.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 drivers/nvme/host/multipath.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 9b9a657fa330..76baa180ae1c 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -149,7 +149,6 @@ void nvme_failover_req(struct request *req)
 	struct nvme_ns *ns = req->q->queuedata;
 	u16 status = nvme_req(req)->status & NVME_SCT_SC_MASK;
 	unsigned long flags;
-	struct bio *bio;
 
 	nvme_mpath_clear_current_path(ns);
 	atomic_long_inc(&ns->failover);
@@ -165,8 +164,6 @@ void nvme_failover_req(struct request *req)
 	}
 
 	spin_lock_irqsave(&ns->head->requeue_lock, flags);
-	for (bio = req->bio; bio; bio = bio->bi_next)
-		bio_set_dev(bio, ns->head->disk->part0);
 	blk_steal_bios(&ns->head->requeue_list, req);
 	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
 
@@ -684,6 +681,7 @@ static void nvme_requeue_work(struct work_struct *work)
 		next = bio->bi_next;
 		bio->bi_next = NULL;
 
+		bio_set_dev(bio, head->disk->part0);
 		submit_bio_noacct(bio);
 	}
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 02/17] dm thin: avoid bio_set_dev under pool lock
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 01/17] nvme-multipath: retarget failedover bios from requeue work Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 03/17] dm snapshot: avoid bio_set_dev in locked map paths Yu Kuai
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

bio_set_dev() is about to become explicitly sleepable because it can
associate the bio with a blkg for the destination queue.  pool_map()
calls bio_set_dev() while holding pool->lock with interrupts disabled,
which would be invalid once bio_set_dev() may sleep.

The lock is not needed in this map path.  The pool target is a singleton
mapping and pool_map() only reads pt->data_dev, which is a target-private
device reference acquired during construction and released during target
destruction.  It does not inspect or modify pool state protected by
pool->lock.

Remove the lock so the remap stays in the normal sleepable DM map context
while the data device pointer remains stable for the table lifetime.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 drivers/md/dm-thin.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 59392de7a477..358ed77ffb2b 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3438,14 +3438,11 @@ static int pool_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 static int pool_map(struct dm_target *ti, struct bio *bio)
 {
 	struct pool_c *pt = ti->private;
-	struct pool *pool = pt->pool;
 
 	/*
 	 * As this is a singleton target, ti->begin is always zero.
 	 */
-	spin_lock_irq(&pool->lock);
 	bio_set_dev(bio, pt->data_dev->bdev);
-	spin_unlock_irq(&pool->lock);
 
 	return DM_MAPIO_REMAPPED;
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 03/17] dm snapshot: avoid bio_set_dev in locked map paths
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 01/17] nvme-multipath: retarget failedover bios from requeue work Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 02/17] dm thin: avoid bio_set_dev under pool lock Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 04/17] blk-throttle: protect throttle state with td lock Yu Kuai
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

bio_set_dev() is about to become explicitly sleepable.  It currently
updates the bio's target device and then associates the bio with the
destination queue's blkcg state.  After blkcg lookup/creation is moved
under the queue's blkcg_mutex, that association may take blkcg_mutex and
allocate a new blkg.  Callers therefore must not invoke bio_set_dev() from
atomic or otherwise non-sleepable sections.

snapshot_map() has several remap decisions inside
dm_exception_table_lock(), which nests the completed and pending
exception hash-table spinlocks.  Those locks protect the lookup result,
pending-exception insertion, pe->started, and the pending bio lists until
the bio has either been returned to DM core or queued on the pending
exception.  Dropping the locks just to call bio_set_dev() would require
revalidating the exception state and preserving the pending-list ordering
rules; calling a sleepable bio_set_dev() while holding the spinlocks is not
allowed either.

Split out snapshot_bio_set_dev() for these locked remap decisions.  It only
performs the non-sleeping part of bio_set_dev(): clear BIO_REMAPPED, clear
BIO_BPS_THROTTLED when the bdev changes, and update bi_bdev.  It
deliberately does not associate the bio with a blkg while snapshot locks
are held.

This does not lose blkcg attribution for the normal DM_MAPIO_REMAPPED case.
After the target returns, DM core submits the mapped bio through
dm_submit_bio_remap(), and that helper clones the blkg association from the
original bio in the normal submission context.

Some snapshot bios are not submitted by DM core immediately.  Writes
waiting for a pending exception and bios queued during snapshot merge are
kept on snapshot-owned lists and submitted later after copy or merge
completion.  Once bio_set_dev() is no longer used in the locked path,
these delayed bios also need their blkcg association restored at submission
time.  Submit those bios through dm_submit_bio_remap() instead of
submit_bio_noacct() so the association is cloned from the original bio
after the snapshot locks have been released.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 drivers/md/dm-snap.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 1489fda9d24a..373a94156ec7 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -192,6 +192,19 @@ static sector_t chunk_to_sector(struct dm_exception_store *store,
 	return chunk << store->chunk_shift;
 }

+/*
+ * Snapshot exception-table locks are spinlocks.  Only update the target
+ * device while holding them; dm_submit_bio_remap() will associate target-owned
+ * bios with the original bio's blkg from a sleepable submission context.
+ */
+static void snapshot_bio_set_dev(struct bio *bio, struct block_device *bdev)
+{
+	bio_clear_flag(bio, BIO_REMAPPED);
+	if (bio->bi_bdev != bdev)
+		bio_clear_flag(bio, BIO_BPS_THROTTLED);
+	bio->bi_bdev = bdev;
+}
+
 static int bdev_equal(struct block_device *lhs, struct block_device *rhs)
 {
 	/*
@@ -1566,7 +1579,7 @@ static void flush_bios(struct bio *bio)
 	while (bio) {
 		n = bio->bi_next;
 		bio->bi_next = NULL;
-		submit_bio_noacct(bio);
+		dm_submit_bio_remap(bio, NULL);
 		bio = n;
 	}
 }
@@ -1586,7 +1599,7 @@ static void retry_origin_bios(struct dm_snapshot *s, struct bio *bio)
 		bio->bi_next = NULL;
 		r = do_origin(s->origin, bio, false);
 		if (r == DM_MAPIO_REMAPPED)
-			submit_bio_noacct(bio);
+			dm_submit_bio_remap(bio, NULL);
 		bio = n;
 	}
 }
@@ -1827,7 +1840,7 @@ static void start_full_bio(struct dm_snap_pending_exception *pe,
 	bio->bi_end_io = full_bio_end_io;
 	bio->bi_private = callback_data;

-	submit_bio_noacct(bio);
+	dm_submit_bio_remap(bio, NULL);
 }

 static struct dm_snap_pending_exception *
@@ -1898,7 +1911,7 @@ __find_pending_exception(struct dm_snapshot *s,
 static void remap_exception(struct dm_snapshot *s, struct dm_exception *e,
 			    struct bio *bio, chunk_t chunk)
 {
-	bio_set_dev(bio, s->cow->bdev);
+	snapshot_bio_set_dev(bio, s->cow->bdev);
 	bio->bi_iter.bi_sector =
 		chunk_to_sector(s->store, dm_chunk_number(e->new_chunk) +
 				(chunk - e->old_chunk)) +
@@ -1982,7 +1995,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
 			 * defeat the goal of freeing space in origin that is
 			 * implied by the "discard_passdown_origin" feature)
 			 */
-			bio_set_dev(bio, s->origin->bdev);
+			snapshot_bio_set_dev(bio, s->origin->bdev);
 			track_chunk(s, bio, chunk);
 			goto out_unlock;
 		}
@@ -2081,7 +2094,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
 			goto out;
 		}
 	} else {
-		bio_set_dev(bio, s->origin->bdev);
+		snapshot_bio_set_dev(bio, s->origin->bdev);
 		track_chunk(s, bio, chunk);
 	}

@@ -2143,7 +2156,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio)
 		    chunk >= s->first_merging_chunk &&
 		    chunk < (s->first_merging_chunk +
 			     s->num_merging_chunks)) {
-			bio_set_dev(bio, s->origin->bdev);
+			snapshot_bio_set_dev(bio, s->origin->bdev);
 			bio_list_add(&s->bios_queued_during_merge, bio);
 			r = DM_MAPIO_SUBMITTED;
 			goto out_unlock;
@@ -2157,7 +2170,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio)
 	}

 redirect_to_origin:
-	bio_set_dev(bio, s->origin->bdev);
+	snapshot_bio_set_dev(bio, s->origin->bdev);

 	if (bio_data_dir(bio) == WRITE) {
 		up_write(&s->lock);
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 04/17] blk-throttle: protect throttle state with td lock
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (2 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 03/17] dm snapshot: avoid bio_set_dev in locked map paths Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 05/17] block: add bio_alloc_atomic() for atomic bio users Yu Kuai
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

Throttle currently uses queue_lock for both blkcg topology and its own
runtime state. This blocks moving blkg topology protection to blkcg_mutex
cleanly.

Add a throttle-private spinlock and use it for throttle service queues,
pending timers, runtime counters and config updates. Keep queue_lock only
where the current intermediate code still walks blkcg topology.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-throttle.c | 87 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 67 insertions(+), 20 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index ffc3b70065d4..7bca2805404f 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -30,6 +30,9 @@ static struct workqueue_struct *kthrotld_workqueue;
 
 struct throtl_data
 {
+	/* protects throttle service queues and group runtime state */
+	spinlock_t lock;
+
 	/* service tree for active throtl groups */
 	struct throtl_service_queue service_queue;
 
@@ -346,11 +349,16 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 static void throtl_pd_online(struct blkg_policy_data *pd)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
+	struct throtl_data *td = tg->td;
+	unsigned long flags;
+
+	spin_lock_irqsave(&td->lock, flags);
 	/*
 	 * We don't want new groups to escape the limits of its ancestors.
 	 * Update has_rules[] after a new group is brought online.
 	 */
 	tg_update_has_rules(tg);
+	spin_unlock_irqrestore(&td->lock, flags);
 }
 
 static void tg_release(struct rcu_head *rcu)
@@ -368,7 +376,7 @@ static void throtl_pd_free(struct blkg_policy_data *pd)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
 
-	timer_delete_sync(&tg->service_queue.pending_timer);
+	timer_shutdown_sync(&tg->service_queue.pending_timer);
 	call_rcu(&pd->rcu_head, tg_release);
 }
 
@@ -1142,9 +1150,9 @@ static void throtl_pending_timer_fn(struct timer_list *t)
 	else
 		q = td->queue;
 
-	spin_lock_irq(&q->queue_lock);
+	spin_lock_irq(&td->lock);
 
-	if (!q->root_blkg)
+	if (!READ_ONCE(q->root_blkg))
 		goto out_unlock;
 
 again:
@@ -1168,9 +1176,9 @@ static void throtl_pending_timer_fn(struct timer_list *t)
 			break;
 
 		/* this dispatch windows is still open, relax and repeat */
-		spin_unlock_irq(&q->queue_lock);
+		spin_unlock_irq(&td->lock);
 		cpu_relax();
-		spin_lock_irq(&q->queue_lock);
+		spin_lock_irq(&td->lock);
 	}
 
 	if (!dispatched)
@@ -1193,7 +1201,7 @@ static void throtl_pending_timer_fn(struct timer_list *t)
 		queue_work(kthrotld_workqueue, &td->dispatch_work);
 	}
 out_unlock:
-	spin_unlock_irq(&q->queue_lock);
+	spin_unlock_irq(&td->lock);
 }
 
 /**
@@ -1209,7 +1217,6 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 	struct throtl_data *td = container_of(work, struct throtl_data,
 					      dispatch_work);
 	struct throtl_service_queue *td_sq = &td->service_queue;
-	struct request_queue *q = td->queue;
 	struct bio_list bio_list_on_stack;
 	struct bio *bio;
 	struct blk_plug plug;
@@ -1217,11 +1224,11 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 
 	bio_list_init(&bio_list_on_stack);
 
-	spin_lock_irq(&q->queue_lock);
+	spin_lock_irq(&td->lock);
 	for (rw = READ; rw <= WRITE; rw++)
 		while ((bio = throtl_pop_queued(td_sq, NULL, rw)))
 			bio_list_add(&bio_list_on_stack, bio);
-	spin_unlock_irq(&q->queue_lock);
+	spin_unlock_irq(&td->lock);
 
 	if (!bio_list_empty(&bio_list_on_stack)) {
 		blk_start_plug(&plug);
@@ -1299,7 +1306,7 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
 	rcu_read_unlock();
 
 	/*
-	 * We're already holding queue_lock and know @tg is valid.  Let's
+	 * We're already holding td->lock and know @tg is valid.  Let's
 	 * apply the new config directly.
 	 *
 	 * Restart the slices for both READ and WRITES. It might happen
@@ -1327,6 +1334,7 @@ static int blk_throtl_init(struct gendisk *disk)
 		return -ENOMEM;
 
 	INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
+	spin_lock_init(&td->lock);
 	throtl_service_queue_init(&td->service_queue);
 
 	memflags = blk_mq_freeze_queue(disk->queue);
@@ -1381,6 +1389,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 		v = U64_MAX;
 
 	tg = blkg_to_tg(ctx.blkg);
+	spin_lock_irq(&tg->td->lock);
 	tg_update_carryover(tg);
 
 	if (is_u64)
@@ -1389,6 +1398,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 		*(unsigned int *)((void *)tg + of_cft(of)->private) = v;
 
 	tg_conf_updated(tg, false);
+	spin_unlock_irq(&tg->td->lock);
 	ret = 0;
 
 unprep:
@@ -1563,6 +1573,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		goto close_bdev;
 
 	tg = blkg_to_tg(ctx.blkg);
+	spin_lock_irq(&tg->td->lock);
 	tg_update_carryover(tg);
 
 	v[0] = tg->bps[READ];
@@ -1586,11 +1597,11 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		p = tok;
 		strsep(&p, "=");
 		if (!p || (sscanf(p, "%llu", &val) != 1 && strcmp(p, "max")))
-			goto unprep;
+			goto unlock;
 
 		ret = -ERANGE;
 		if (!val)
-			goto unprep;
+			goto unlock;
 
 		ret = -EINVAL;
 		if (!strcmp(tok, "rbps"))
@@ -1602,7 +1613,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		else if (!strcmp(tok, "wiops"))
 			v[3] = min_t(u64, val, UINT_MAX);
 		else
-			goto unprep;
+			goto unlock;
 	}
 
 	tg->bps[READ] = v[0];
@@ -1611,7 +1622,11 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	tg->iops[WRITE] = v[3];
 
 	tg_conf_updated(tg, false);
+	spin_unlock_irq(&tg->td->lock);
 	ret = 0;
+	goto unprep;
+unlock:
+	spin_unlock_irq(&tg->td->lock);
 unprep:
 	blkg_conf_unprep(&ctx);
 close_bdev:
@@ -1636,6 +1651,28 @@ static void throtl_shutdown_wq(struct request_queue *q)
 	cancel_work_sync(&td->dispatch_work);
 }
 
+static void throtl_shutdown_timers(struct request_queue *q)
+{
+	struct throtl_data *td = q->td;
+	struct blkcg_gq *blkg;
+
+	/*
+	 * blkg_destroy_all() has already offlined the policy, but blkg policy
+	 * data is freed asynchronously.  Shut down per-group timers before
+	 * freeing td, as their callbacks still dereference tg->td.
+	 */
+	mutex_lock(&q->blkcg_mutex);
+	list_for_each_entry(blkg, &q->blkg_list, q_node) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+
+		if (tg)
+			timer_shutdown_sync(&tg->service_queue.pending_timer);
+	}
+	mutex_unlock(&q->blkcg_mutex);
+
+	timer_shutdown_sync(&td->service_queue.pending_timer);
+}
+
 static void tg_flush_bios(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1669,7 +1706,13 @@ static void tg_flush_bios(struct throtl_grp *tg)
 
 static void throtl_pd_offline(struct blkg_policy_data *pd)
 {
-	tg_flush_bios(pd_to_tg(pd));
+	struct throtl_grp *tg = pd_to_tg(pd);
+	struct throtl_data *td = tg->td;
+	unsigned long flags;
+
+	spin_lock_irqsave(&td->lock, flags);
+	tg_flush_bios(tg);
+	spin_unlock_irqrestore(&td->lock, flags);
 }
 
 struct blkcg_policy blkcg_policy_throtl = {
@@ -1725,6 +1768,7 @@ static void tg_cancel_writeback_bios(struct throtl_grp *tg,
 void blk_throtl_cancel_bios(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
+	struct throtl_data *td = q->td;
 	struct cgroup_subsys_state *pos_css;
 	struct blkcg_gq *blkg;
 	struct bio_list cancel_bios[2] = { };
@@ -1734,6 +1778,7 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
 		return;
 
 	spin_lock_irq(&q->queue_lock);
+	spin_lock(&td->lock);
 	/*
 	 * queue_lock is held, rcu lock is not needed here technically.
 	 * However, rcu lock is still held to emphasize that following
@@ -1752,6 +1797,7 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
 		tg_cancel_writeback_bios(blkg_to_tg(blkg), cancel_bios);
 	}
 	rcu_read_unlock();
+	spin_unlock(&td->lock);
 	spin_unlock_irq(&q->queue_lock);
 
 	for (rw = READ; rw <= WRITE; rw++) {
@@ -1791,7 +1837,6 @@ static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio, bool rw)
 
 bool __blk_throtl_bio(struct bio *bio)
 {
-	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	struct blkcg_gq *blkg = bio->bi_blkg;
 	struct throtl_qnode *qn = NULL;
 	struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1801,7 +1846,7 @@ bool __blk_throtl_bio(struct bio *bio)
 	struct throtl_data *td = tg->td;
 
 	rcu_read_lock();
-	spin_lock_irq(&q->queue_lock);
+	spin_lock_irq(&td->lock);
 	sq = &tg->service_queue;
 
 	while (true) {
@@ -1877,7 +1922,7 @@ bool __blk_throtl_bio(struct bio *bio)
 	}
 
 out_unlock:
-	spin_unlock_irq(&q->queue_lock);
+	spin_unlock_irq(&td->lock);
 
 	rcu_read_unlock();
 	return throttled;
@@ -1886,17 +1931,19 @@ bool __blk_throtl_bio(struct bio *bio)
 void blk_throtl_exit(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
+	struct throtl_data *td = q->td;
 
 	/*
 	 * blkg_destroy_all() already deactivate throtl policy, just check and
 	 * free throtl data.
 	 */
-	if (!q->td)
+	if (!td)
 		return;
 
-	timer_delete_sync(&q->td->service_queue.pending_timer);
+	throtl_shutdown_timers(q);
 	throtl_shutdown_wq(q);
-	kfree(q->td);
+	q->td = NULL;
+	kfree(td);
 }
 
 static int __init throtl_init(void)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 05/17] block: add bio_alloc_atomic() for atomic bio users
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (3 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 04/17] blk-throttle: protect throttle state with td lock Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 06/17] blk-cgroup: support non-blocking bio association Yu Kuai
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

Add bio_alloc_atomic() for callers that need a GFP_ATOMIC bio from the
default bio set but cannot safely pass a bdev during allocation. The
helper returns an unattached bio, leaving callers to set bi_bdev and
attach blkcg state explicitly before submission.

Use the helper for virtio-pmem flush child bios and OCFS2 heartbeat I/O.
Both allocate bios from atomic paths and must avoid creating missing blkgs
once blkg creation is protected by q->blkcg_mutex. virtio-pmem clones the
parent bio's blkg association; OCFS2 binds heartbeat I/O to the root blkg.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 drivers/nvdimm/nd_virtio.c   |  8 ++++----
 fs/ocfs2/cluster/heartbeat.c | 15 ++++++++++++---
 include/linux/bio.h          |  6 ++++++
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 4176046627be..13d1ed1c466c 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -115,13 +115,13 @@ int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
 	 * parent bio. Otherwise directly call nd_region flush.
 	 */
 	if (bio && bio->bi_iter.bi_sector != -1) {
-		struct bio *child = bio_alloc(bio->bi_bdev, 0,
-					      REQ_OP_WRITE | REQ_PREFLUSH,
-					      GFP_ATOMIC);
+		struct bio *child = bio_alloc_atomic(0,
+						REQ_OP_WRITE | REQ_PREFLUSH);
 
 		if (!child)
 			return -ENOMEM;
-		bio_clone_blkg_association(child, bio);
+		child->bi_bdev = bio->bi_bdev;
+			bio_clone_blkg_association(child, bio);
 		child->bi_iter.bi_sector = -1;
 		bio_chain(child, bio);
 		submit_bio(child);
diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index d12784aaaa4b..ec70f3b62837 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -10,6 +10,7 @@
 #include <linux/module.h>
 #include <linux/fs.h>
 #include <linux/bio.h>
+#include <linux/blk-cgroup.h>
 #include <linux/blkdev.h>
 #include <linux/delay.h>
 #include <linux/file.h>
@@ -519,16 +520,24 @@ static struct bio *o2hb_setup_one_bio(struct o2hb_region *reg,
 	struct bio *bio;
 	struct page *page;
 
-	/* Testing has shown this allocation to take long enough under
+	/*
+	 * Testing has shown this allocation to take long enough under
 	 * GFP_KERNEL that the local node can get fenced. It would be
 	 * nicest if we could pre-allocate these bios and avoid this
-	 * all together. */
-	bio = bio_alloc(reg_bdev(reg), 16, opf, GFP_ATOMIC);
+	 * all together.
+	 *
+	 * Use the atomic bio allocation helper so bio_init() does not create a
+	 * missing blkg. Heartbeat IO is cluster-liveness IO, so account it to
+	 * the root blkcg instead.
+	 */
+	bio = bio_alloc_atomic(16, opf);
 	if (!bio) {
 		mlog(ML_ERROR, "Could not alloc slots BIO!\n");
 		bio = ERR_PTR(-ENOMEM);
 		goto bail;
 	}
+	bio->bi_bdev = reg_bdev(reg);
+	bio_associate_blkg_from_css(bio, blkcg_root_css);
 
 	/* Must put everything in 512 byte sectors for the bio... */
 	bio->bi_iter.bi_sector = (reg->hr_start_block + cs) << (bits - 9);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8f33f717b14f..f7d94d37893f 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -366,6 +366,12 @@ static inline struct bio *bio_alloc(struct block_device *bdev,
 	return bio_alloc_bioset(bdev, nr_vecs, opf, gfp_mask, &fs_bio_set);
 }
 
+static inline struct bio *bio_alloc_atomic(unsigned short nr_vecs,
+					   blk_opf_t opf)
+{
+	return bio_alloc_bioset(NULL, nr_vecs, opf, GFP_ATOMIC, &fs_bio_set);
+}
+
 void submit_bio(struct bio *bio);
 
 extern void bio_endio(struct bio *);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 06/17] blk-cgroup: support non-blocking bio association
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (4 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 05/17] block: add bio_alloc_atomic() for atomic bio users Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 07/17] block: support non-blocking bio allocation with a bdev Yu Kuai
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

Allow bio association helpers to be called from non-blocking paths by
returning whether the association succeeded and by taking a nowait argument.
The normal callers pass nowait=false and keep the existing behavior of
creating missing blkgs.

For nowait=true, the helper only succeeds when the needed blkg already
exists.  This lets callers set or clone a bio's bdev without entering the
sleepable missing-blkg creation path.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/bfq-cgroup.c           |  5 ++--
 block/bio.c                  |  6 ++---
 block/blk-cgroup.c           | 44 ++++++++++++++++++++++++---------
 block/blk-crypto-fallback.c  |  2 +-
 drivers/md/bcache/request.c  |  2 +-
 drivers/md/dm.c              |  2 +-
 drivers/md/md.c              |  2 +-
 drivers/nvdimm/nd_virtio.c   |  5 +++-
 fs/gfs2/lops.c               |  3 +--
 fs/ocfs2/cluster/heartbeat.c |  2 +-
 include/linux/bio.h          | 47 ++++++++++++++++++++++++------------
 include/linux/writeback.h    |  2 +-
 mm/page_io.c                 |  2 +-
 13 files changed, 82 insertions(+), 42 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index e82ff03bda02..5c2faf56c8ef 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -616,13 +616,14 @@ struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio)
 		}
 		bfqg = blkg_to_bfqg(blkg);
 		if (bfqg->pd.online) {
-			bio_associate_blkg_from_css(bio, &blkg->blkcg->css);
+			bio_associate_blkg_from_css(bio, &blkg->blkcg->css, false);
 			return bfqg;
 		}
 		blkg = blkg->parent;
 	}
 	bio_associate_blkg_from_css(bio,
-				&bfqg_to_blkg(bfqd->root_group)->blkcg->css);
+				&bfqg_to_blkg(bfqd->root_group)->blkcg->css,
+				false);
 	return bfqd->root_group;
 }
 
diff --git a/block/bio.c b/block/bio.c
index f2a5f4d0a967..b74e9961c8ee 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -236,7 +236,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	bio->bi_blkg = NULL;
 	bio->issue_time_ns = 0;
 	if (bdev)
-		bio_associate_blkg(bio);
+		bio_associate_blkg(bio, false);
 #ifdef CONFIG_BLK_CGROUP_IOCOST
 	bio->bi_iocost_cost = 0;
 #endif
@@ -281,7 +281,7 @@ void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf)
 	bio->bi_io_vec = bv;
 	bio->bi_bdev = bdev;
 	if (bio->bi_bdev)
-		bio_associate_blkg(bio);
+		bio_associate_blkg(bio, false);
 	bio->bi_opf = opf;
 }
 EXPORT_SYMBOL(bio_reset);
@@ -857,7 +857,7 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 		if (bio->bi_bdev == bio_src->bi_bdev &&
 		    bio_flagged(bio_src, BIO_REMAPPED))
 			bio_set_flag(bio, BIO_REMAPPED);
-		bio_clone_blkg_association(bio, bio_src);
+		bio_clone_blkg_association(bio, bio_src, false);
 	}
 
 	if (bio_crypt_clone(bio, bio_src, gfp) < 0)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d2a1f5903f24..92846094043a 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -2068,7 +2068,7 @@ static inline struct blkcg_gq *blkg_lookup_tryget(struct blkcg_gq *blkg)
  * up taking a reference on or %NULL if no reference was taken.
  */
 static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
-		struct cgroup_subsys_state *css)
+		struct cgroup_subsys_state *css, bool nowait)
 {
 	struct request_queue *q = bio->bi_bdev->bd_queue;
 	struct blkcg *blkcg = css_to_blkcg(css);
@@ -2110,18 +2110,30 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
  * A reference will be taken on the blkg and will be released when @bio is
  * freed.
  */
-void bio_associate_blkg_from_css(struct bio *bio,
-				 struct cgroup_subsys_state *css)
+bool bio_associate_blkg_from_css(struct bio *bio,
+		struct cgroup_subsys_state *css, bool nowait)
 {
-	if (bio->bi_blkg)
+	struct blkcg_gq *blkg;
+
+	if (!nowait)
+		might_sleep();
+
+	if (bio->bi_blkg) {
 		blkg_put(bio->bi_blkg);
+		bio->bi_blkg = NULL;
+	}
 
 	if (css && css->parent) {
-		bio->bi_blkg = blkg_tryget_closest(bio, css);
+		blkg = blkg_tryget_closest(bio, css, nowait);
+		if (!blkg)
+			return false;
+		bio->bi_blkg = blkg;
 	} else {
 		blkg_get(bdev_get_queue(bio->bi_bdev)->root_blkg);
 		bio->bi_blkg = bdev_get_queue(bio->bi_bdev)->root_blkg;
 	}
+
+	return true;
 }
 EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css);
 
@@ -2134,16 +2146,19 @@ EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css);
  * already associated, the css is reused and association redone as the
  * request_queue may have changed.
  */
-void bio_associate_blkg(struct bio *bio)
+bool bio_associate_blkg(struct bio *bio, bool nowait)
 {
 	struct cgroup_subsys_state *css;
+	bool ret;
 
 	if (blk_op_is_passthrough(bio->bi_opf))
-		return;
+		return true;
+	if (!bio->bi_bdev)
+		return true;
 
 	if (bio->bi_blkg) {
 		css = bio_blkcg_css(bio);
-		bio_associate_blkg_from_css(bio, css);
+		return bio_associate_blkg_from_css(bio, css, nowait);
 	} else {
 		rcu_read_lock();
 		css = blkcg_css();
@@ -2151,9 +2166,10 @@ void bio_associate_blkg(struct bio *bio)
 			css = NULL;
 		rcu_read_unlock();
 
-		bio_associate_blkg_from_css(bio, css);
+		ret = bio_associate_blkg_from_css(bio, css, nowait);
 		if (css)
 			css_put(css);
+		return ret;
 	}
 }
 EXPORT_SYMBOL_GPL(bio_associate_blkg);
@@ -2163,10 +2179,14 @@ EXPORT_SYMBOL_GPL(bio_associate_blkg);
  * @dst: destination bio
  * @src: source bio
  */
-void bio_clone_blkg_association(struct bio *dst, struct bio *src)
+bool bio_clone_blkg_association(struct bio *dst, struct bio *src, bool nowait)
 {
-	if (src->bi_blkg)
-		bio_associate_blkg_from_css(dst, bio_blkcg_css(src));
+	if (!src->bi_blkg)
+		return true;
+	if (!dst->bi_bdev)
+		return false;
+
+	return bio_associate_blkg_from_css(dst, bio_blkcg_css(src), nowait);
 }
 EXPORT_SYMBOL_GPL(bio_clone_blkg_association);
 
diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c
index 2a5c52ab74b4..b99470bee8b6 100644
--- a/block/blk-crypto-fallback.c
+++ b/block/blk-crypto-fallback.c
@@ -187,7 +187,7 @@ static struct bio *blk_crypto_alloc_enc_bio(struct bio *bio_src,
 	bio->bi_write_hint	= bio_src->bi_write_hint;
 	bio->bi_write_stream	= bio_src->bi_write_stream;
 	bio->bi_iter.bi_sector	= bio_src->bi_iter.bi_sector;
-	bio_clone_blkg_association(bio, bio_src);
+	bio_clone_blkg_association(bio, bio_src, false);
 
 	/*
 	 * Move page array up in the allocated memory for the bio vecs as far as
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 3fa3b13a410f..c2b7a694ea99 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -848,7 +848,7 @@ static CLOSURE_CALLBACK(cached_dev_read_done)
 		s->iop.bio->bi_iter.bi_sector =
 			s->cache_miss->bi_iter.bi_sector;
 		s->iop.bio->bi_iter.bi_size = s->insert_bio_sectors << 9;
-		bio_clone_blkg_association(s->iop.bio, s->cache_miss);
+		bio_clone_blkg_association(s->iop.bio, s->cache_miss, false);
 		bch_bio_map(s->iop.bio, NULL);
 
 		bio_copy_data(s->cache_miss, s->iop.bio);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7287bed6eb64..c54636235ffe 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1364,7 +1364,7 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
 	if (!tgt_clone)
 		tgt_clone = clone;
 
-	bio_clone_blkg_association(tgt_clone, io->orig_bio);
+	bio_clone_blkg_association(tgt_clone, io->orig_bio, false);
 
 	/*
 	 * Account io->origin_bio to DM dev on behalf of target
diff --git a/drivers/md/md.c b/drivers/md/md.c
index d1465bcd86c8..d63c8841aaad 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9355,7 +9355,7 @@ void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
 		return;
 
 	bio_chain(discard_bio, bio);
-	bio_clone_blkg_association(discard_bio, bio);
+	bio_clone_blkg_association(discard_bio, bio, false);
 	mddev_trace_remap(mddev, discard_bio, bio->bi_iter.bi_sector);
 	submit_bio_noacct(discard_bio);
 }
diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 13d1ed1c466c..0391b41a4fce 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -121,7 +121,10 @@ int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
 		if (!child)
 			return -ENOMEM;
 		child->bi_bdev = bio->bi_bdev;
-			bio_clone_blkg_association(child, bio);
+		if (!bio_clone_blkg_association(child, bio, true)) {
+			bio_put(child);
+			return -ENOMEM;
+		}
 		child->bi_iter.bi_sector = -1;
 		bio_chain(child, bio);
 		submit_bio(child);
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 6dabe73ad790..ac45ccbde2a9 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -484,7 +484,7 @@ static struct bio *gfs2_chain_bio(struct bio *prev, unsigned int nr_iovecs,
 	struct bio *new;
 
 	new = bio_alloc(prev->bi_bdev, nr_iovecs, opf, GFP_NOIO);
-	bio_clone_blkg_association(new, prev);
+	bio_clone_blkg_association(new, prev, false);
 	new->bi_iter.bi_sector = sector;
 	bio_chain(new, prev);
 	submit_bio(prev);
@@ -1114,4 +1114,3 @@ const struct gfs2_log_operations *gfs2_log_ops[] = {
 	&gfs2_revoke_lops,
 	NULL,
 };
-
diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index ec70f3b62837..eb7f30707092 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -537,7 +537,7 @@ static struct bio *o2hb_setup_one_bio(struct o2hb_region *reg,
 		goto bail;
 	}
 	bio->bi_bdev = reg_bdev(reg);
-	bio_associate_blkg_from_css(bio, blkcg_root_css);
+	bio_associate_blkg_from_css(bio, blkcg_root_css, true);
 
 	/* Must put everything in 512 byte sectors for the bio... */
 	bio->bi_iter.bi_sector = (reg->hr_start_block + cs) << (bits - 9);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index f7d94d37893f..026df09a2546 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -508,19 +508,39 @@ static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 #define bio_dev(bio) \
 	disk_devt((bio)->bi_bdev->bd_disk)
 
+static inline void bio_set_dev_no_blkg(struct bio *bio,
+		struct block_device *bdev)
+{
+	bio_clear_flag(bio, BIO_REMAPPED);
+	if (bio->bi_bdev != bdev)
+		bio_clear_flag(bio, BIO_BPS_THROTTLED);
+	bio->bi_bdev = bdev;
+}
+
 #ifdef CONFIG_BLK_CGROUP
-void bio_associate_blkg(struct bio *bio);
-void bio_associate_blkg_from_css(struct bio *bio,
-				 struct cgroup_subsys_state *css);
-void bio_clone_blkg_association(struct bio *dst, struct bio *src);
+bool bio_associate_blkg(struct bio *bio, bool nowait);
+bool bio_associate_blkg_from_css(struct bio *bio,
+				 struct cgroup_subsys_state *css,
+				 bool nowait);
+bool bio_clone_blkg_association(struct bio *dst, struct bio *src,
+				bool nowait);
 void blkcg_punt_bio_submit(struct bio *bio);
 #else	/* CONFIG_BLK_CGROUP */
-static inline void bio_associate_blkg(struct bio *bio) { }
-static inline void bio_associate_blkg_from_css(struct bio *bio,
-					       struct cgroup_subsys_state *css)
-{ }
-static inline void bio_clone_blkg_association(struct bio *dst,
-					      struct bio *src) { }
+static inline bool bio_associate_blkg(struct bio *bio, bool nowait)
+{
+	return true;
+}
+static inline bool bio_associate_blkg_from_css(struct bio *bio,
+					       struct cgroup_subsys_state *css,
+					       bool nowait)
+{
+	return true;
+}
+static inline bool bio_clone_blkg_association(struct bio *dst,
+					      struct bio *src, bool nowait)
+{
+	return true;
+}
 static inline void blkcg_punt_bio_submit(struct bio *bio)
 {
 	submit_bio(bio);
@@ -529,11 +549,8 @@ static inline void blkcg_punt_bio_submit(struct bio *bio)
 
 static inline void bio_set_dev(struct bio *bio, struct block_device *bdev)
 {
-	bio_clear_flag(bio, BIO_REMAPPED);
-	if (bio->bi_bdev != bdev)
-		bio_clear_flag(bio, BIO_BPS_THROTTLED);
-	bio->bi_bdev = bdev;
-	bio_associate_blkg(bio);
+	bio_set_dev_no_blkg(bio, bdev);
+	bio_associate_blkg(bio, false);
 }
 
 /*
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 62552a2ce5b9..8165536fbbb0 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -262,7 +262,7 @@ static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio)
 	 * regular writeback instead of writing things out itself.
 	 */
 	if (wbc->wb)
-		bio_associate_blkg_from_css(bio, wbc->wb->blkcg_css);
+		bio_associate_blkg_from_css(bio, wbc->wb->blkcg_css, false);
 }
 
 void inode_switch_wbs_work_fn(struct work_struct *work);
diff --git a/mm/page_io.c b/mm/page_io.c
index c96d3e4cf872..48404f8604cb 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -321,7 +321,7 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct folio *folio)
 		css = NULL;
 	rcu_read_unlock();
 
-	bio_associate_blkg_from_css(bio, css);
+	bio_associate_blkg_from_css(bio, css, false);
 	if (css)
 		css_put(css);
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 07/17] block: support non-blocking bio allocation with a bdev
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (5 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 06/17] blk-cgroup: support non-blocking bio association Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 08/17] bcache: avoid sleeping blkg association from locked paths Yu Kuai
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

bio_alloc_clone(), bio_init_clone(), and bio_alloc_bioset() can be called
with non-blocking GFP masks.  Passing a bdev into bio initialization may
need to associate blkcg state and, after missing blkg creation is serialized
by q->blkcg_mutex, that association can sleep.

Keep the generic block layer simple by letting bio_alloc_bioset() handle this
case directly.  Non-blocking allocations initialize the bio without a bdev,
set the bdev fields, and associate the blkg with nowait=true.  If the needed
blkg is missing and would have to be created, allocation fails normally so the
caller can retry from a blocking context.

Blocking callers keep the existing allocation-time association behavior.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/bio.c | 46 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b74e9961c8ee..863ae73a4222 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -259,6 +259,20 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 }
 EXPORT_SYMBOL(bio_init);
 
+static bool bio_init_nowait(struct bio *bio, struct block_device *bdev,
+		struct bio_vec *table, unsigned short max_vecs, blk_opf_t opf)
+{
+	bio_init(bio, NULL, table, max_vecs, opf);
+	if (bdev) {
+		bio_set_dev_no_blkg(bio, bdev);
+		if (bio_associate_blkg(bio, true))
+			return true;
+		bio_uninit(bio);
+		return false;
+	}
+	return true;
+}
+
 /**
  * bio_reset - reinitialize a bio
  * @bio:	bio to reset
@@ -599,12 +613,25 @@ struct bio *bio_alloc_bioset(struct block_device *bdev, unsigned short nr_vecs,
 		}
 	}
 
-	if (nr_vecs && nr_vecs <= BIO_INLINE_VECS)
-		bio_init_inline(bio, bdev, nr_vecs, opf);
-	else
-		bio_init(bio, bdev, bvecs, nr_vecs, opf);
+	if (nr_vecs && nr_vecs <= BIO_INLINE_VECS) {
+		bvecs = bio_inline_vecs(bio);
+		if (gfpflags_allow_blocking(saved_gfp))
+			bio_init(bio, bdev, bvecs, nr_vecs, opf);
+		else if (!bio_init_nowait(bio, bdev, bvecs, nr_vecs, opf))
+			goto fail_free_bio;
+	} else {
+		if (gfpflags_allow_blocking(saved_gfp))
+			bio_init(bio, bdev, bvecs, nr_vecs, opf);
+		else if (!bio_init_nowait(bio, bdev, bvecs, nr_vecs, opf))
+			goto fail_free_bio;
+	}
 	bio->bi_pool = bs;
 	return bio;
+
+fail_free_bio:
+	bio->bi_pool = bs;
+	bio_put(bio);
+	return NULL;
 }
 EXPORT_SYMBOL(bio_alloc_bioset);
 
@@ -857,7 +884,9 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 		if (bio->bi_bdev == bio_src->bi_bdev &&
 		    bio_flagged(bio_src, BIO_REMAPPED))
 			bio_set_flag(bio, BIO_REMAPPED);
-		bio_clone_blkg_association(bio, bio_src, false);
+		if (!bio_clone_blkg_association(bio, bio_src,
+					!gfpflags_allow_blocking(gfp)))
+			return -ENOMEM;
 	}
 
 	if (bio_crypt_clone(bio, bio_src, gfp) < 0)
@@ -913,9 +942,14 @@ EXPORT_SYMBOL(bio_alloc_clone);
 int bio_init_clone(struct block_device *bdev, struct bio *bio,
 		struct bio *bio_src, gfp_t gfp)
 {
+	bool blocking = gfpflags_allow_blocking(gfp);
 	int ret;
 
-	bio_init(bio, bdev, bio_src->bi_io_vec, 0, bio_src->bi_opf);
+	if (blocking)
+		bio_init(bio, bdev, bio_src->bi_io_vec, 0, bio_src->bi_opf);
+	else if (!bio_init_nowait(bio, bdev, bio_src->bi_io_vec, 0,
+				bio_src->bi_opf))
+		return -ENOMEM;
 	ret = __bio_clone(bio, bio_src, gfp);
 	if (ret)
 		bio_uninit(bio);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 08/17] bcache: avoid sleeping blkg association from locked paths
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (6 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 07/17] block: support non-blocking bio allocation with a bdev Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 09/17] dm bufio: avoid blkg association from GFP_NOWAIT bio init Yu Kuai
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

cached_dev_cache_miss() allocates cache_bio with GFP_NOWAIT.  Passing a bdev
to bio_alloc_bioset() can attach blkcg state and sleep to create a missing
blkg after blkg lookup is protected by q->blkcg_mutex.

Use the nowait bio allocation/association path.  If the cache bio needs a
missing blkg to be created, fail the association and fall back to the existing
miss submission path.

journal_write_unlocked() also resets journal bios while holding the journal
spinlock.  Reset those bios without a bdev, set bi_bdev while still under the
lock, and associate blkcg after dropping the lock.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 drivers/md/bcache/journal.c | 9 ++++++---
 drivers/md/bcache/request.c | 2 ++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 144693b7c46a..49d2fb9a5f20 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -714,8 +714,9 @@ static CLOSURE_CALLBACK(journal_write_unlocked)
 
 		atomic_long_add(sectors, &ca->meta_sectors_written);
 
-		bio_reset(bio, ca->bdev, REQ_OP_WRITE | 
-			  REQ_SYNC | REQ_META | REQ_PREFLUSH | REQ_FUA);
+		bio_reset(bio, NULL, REQ_OP_WRITE | REQ_SYNC | REQ_META |
+			  REQ_PREFLUSH | REQ_FUA);
+		bio->bi_bdev = ca->bdev;
 		bio->bi_iter.bi_sector	= PTR_OFFSET(k, i);
 		bio->bi_iter.bi_size = sectors << 9;
 
@@ -740,8 +741,10 @@ static CLOSURE_CALLBACK(journal_write_unlocked)
 
 	spin_unlock(&c->journal.lock);
 
-	while ((bio = bio_list_pop(&list)))
+	while ((bio = bio_list_pop(&list))) {
+		bio_associate_blkg(bio, false);
 		closure_bio_submit(c, bio, cl);
+	}
 
 	continue_at(cl, journal_write_done, NULL);
 }
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index c2b7a694ea99..647ca5018d07 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -932,6 +932,8 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 	if (!cache_bio)
 		goto out_submit;
 
+	if (!bio_clone_blkg_association(cache_bio, miss, true))
+		goto out_put;
 	cache_bio->bi_iter.bi_sector	= miss->bi_iter.bi_sector;
 	cache_bio->bi_iter.bi_size	= s->insert_bio_sectors << 9;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 09/17] dm bufio: avoid blkg association from GFP_NOWAIT bio init
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (7 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 08/17] bcache: avoid sleeping blkg association from locked paths Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 10/17] dm pcache: handle non-blocking bio clone init failure Yu Kuai
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

dm-bufio allocates a bio with bio_kmalloc(GFP_NOWAIT) and then initializes it
with the target bdev.  That initialization can attach blkcg state and sleep to
create a missing blkg once blkg lookup is protected by q->blkcg_mutex.

Initialize the bio without a bdev, set the bdev fields, and associate blkcg
with nowait=true.  Fall back to dm_io if a missing blkg would need to be
created.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 drivers/md/dm-bufio.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 26fedf5883ef..2002d9020dd6 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1347,7 +1347,14 @@ static void use_bio(struct dm_buffer *b, enum req_op op, sector_t sector,
 		use_dmio(b, op, sector, n_sectors, offset, ioprio);
 		return;
 	}
-	bio_init_inline(bio, b->c->bdev, 1, op);
+	bio_init_inline(bio, NULL, 1, op);
+	bio_set_dev_no_blkg(bio, b->c->bdev);
+	if (!bio_associate_blkg(bio, true)) {
+		bio_uninit(bio);
+		kfree(bio);
+		use_dmio(b, op, sector, n_sectors, offset, ioprio);
+		return;
+	}
 	bio->bi_iter.bi_sector = sector;
 	bio->bi_end_io = bio_complete;
 	bio->bi_private = b;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 10/17] dm pcache: handle non-blocking bio clone init failure
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (8 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 09/17] dm bufio: avoid blkg association from GFP_NOWAIT bio init Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 11/17] block: avoid scheduling from non-blocking helper allocations Yu Kuai
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

dm-pcache may preallocate backing requests with GFP_NOWAIT and initialize
the embedded bio with bio_init_clone().  Non-blocking clone initialization
can now fail if cloning the blkg association would need to create a blkg.

Check the return value and free the preallocated request on failure so the
existing caller can retry through its GFP_NOIO preallocation path.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 drivers/md/dm-pcache/backing_dev.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-pcache/backing_dev.c b/drivers/md/dm-pcache/backing_dev.c
index 7165fc0364bb..5bde289ec5d7 100644
--- a/drivers/md/dm-pcache/backing_dev.c
+++ b/drivers/md/dm-pcache/backing_dev.c
@@ -204,6 +204,7 @@ static struct pcache_backing_dev_req *req_type_req_alloc(struct pcache_backing_d
 	struct pcache_request *pcache_req = opts->req.upper_req;
 	struct pcache_backing_dev_req *backing_req;
 	struct bio *orig = pcache_req->bio;
+	int ret;
 
 	backing_req = mempool_alloc(&backing_dev->req_pool, opts->gfp_mask);
 	if (!backing_req)
@@ -211,13 +212,20 @@ static struct pcache_backing_dev_req *req_type_req_alloc(struct pcache_backing_d
 
 	memset(backing_req, 0, sizeof(struct pcache_backing_dev_req));
 
-	bio_init_clone(backing_dev->dm_dev->bdev, &backing_req->bio, orig, opts->gfp_mask);
+	ret = bio_init_clone(backing_dev->dm_dev->bdev, &backing_req->bio,
+			     orig, opts->gfp_mask);
+	if (ret)
+		goto free_backing_req;
 
 	backing_req->type = BACKING_DEV_REQ_TYPE_REQ;
 	backing_req->backing_dev = backing_dev;
 	atomic_inc(&backing_dev->inflight_reqs);
 
 	return backing_req;
+
+free_backing_req:
+	mempool_free(backing_req, &backing_dev->req_pool);
+	return NULL;
 }
 
 static struct pcache_backing_dev_req *kmem_type_req_alloc(struct pcache_backing_dev *backing_dev,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 11/17] block: avoid scheduling from non-blocking helper allocations
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (9 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 10/17] dm pcache: handle non-blocking bio clone init failure Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 12/17] dm: avoid sleeping blkg association from NOWAIT remaps Yu Kuai
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

blk_alloc_discard_bio() and blk_rq_map_bio_alloc() can be used with
non-blocking GFP masks.  Their bio allocation now handles bdev association in
nowait mode, so the helpers can pass the target bdev directly and avoid local
open-coded association paths.

The discard helper can also be reached from io_uring with GFP_NOWAIT.  Keep
its long-loop cond_resched() only for blocking callers.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-lib.c | 3 ++-
 block/blk-map.c | 7 +------
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 688bc67cbf73..b5645f8f69b6 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -56,7 +56,8 @@ struct bio *blk_alloc_discard_bio(struct block_device *bdev,
 	 * discards (like mkfs).  Be nice and allow us to schedule out to avoid
 	 * softlocking if preempt is disabled.
 	 */
-	cond_resched();
+	if (gfpflags_allow_blocking(gfp_mask))
+		cond_resched();
 	return bio;
 }
 
diff --git a/block/blk-map.c b/block/blk-map.c
index 768549f19f97..75c7b864c15a 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -46,14 +46,9 @@ static struct bio *blk_rq_map_bio_alloc(struct request *rq,
 		unsigned int nr_vecs, gfp_t gfp_mask)
 {
 	struct block_device *bdev = rq->q->disk ? rq->q->disk->part0 : NULL;
-	struct bio *bio;
 
-	bio = bio_alloc_bioset(bdev, nr_vecs, rq->cmd_flags, gfp_mask,
+	return bio_alloc_bioset(bdev, nr_vecs, rq->cmd_flags, gfp_mask,
 				&fs_bio_set);
-	if (!bio)
-		return NULL;
-
-	return bio;
 }
 
 /**
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 12/17] dm: avoid sleeping blkg association from NOWAIT remaps
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (10 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 11/17] block: avoid scheduling from non-blocking helper allocations Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 13/17] bfq: avoid blkg lookup from locked cgroup update Yu Kuai
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

DM allocates normal NOWAIT target clones with GFP_NOWAIT.  Targets that set
needs_bio_set_dev can therefore make alloc_tio() associate blkcg state from a
non-blocking allocation path, which may sleep while creating a missing blkg
after blkg lookup is protected by q->blkcg_mutex.

Set the default bdev without blkcg association first, then associate blkcg
with nowait=true for non-blocking allocations.  If a blkg would need creating,
fail the NOWAIT allocation with BLK_STS_AGAIN.

Targets that advertise DM_TARGET_NOWAIT may also remap bios in their map
functions.  Those remaps update only the bdev for NOWAIT bios, then
DM submission clones the original bio's blkg association with nowait=true
before lower submission.  If that would need to sleep, complete the clone with
BLK_STS_AGAIN.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 drivers/md/dm-linear.c        |  2 +-
 drivers/md/dm-stripe.c        |  6 +++---
 drivers/md/dm-switch.c        |  2 +-
 drivers/md/dm-unstripe.c      |  2 +-
 drivers/md/dm.c               | 28 +++++++++++++++++++++++++---
 include/linux/device-mapper.h |  8 ++++++++
 6 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 38c17846deb0..f75a372acd20 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -90,7 +90,7 @@ int linear_map(struct dm_target *ti, struct bio *bio)
 {
 	struct linear_c *lc = ti->private;
 
-	bio_set_dev(bio, lc->dev->bdev);
+	dm_bio_set_dev(bio, lc->dev->bdev);
 	bio->bi_iter.bi_sector = linear_map_sector(ti, bio->bi_iter.bi_sector);
 
 	return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 750865fd3ae7..73f9483a3e8a 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -257,7 +257,7 @@ static int stripe_map_range(struct stripe_c *sc, struct bio *bio,
 	stripe_map_range_sector(sc, bio_end_sector(bio),
 				target_stripe, &end);
 	if (begin < end) {
-		bio_set_dev(bio, sc->stripe[target_stripe].dev->bdev);
+		dm_bio_set_dev(bio, sc->stripe[target_stripe].dev->bdev);
 		bio->bi_iter.bi_sector = begin +
 			sc->stripe[target_stripe].physical_start;
 		bio->bi_iter.bi_size = to_bytes(end - begin);
@@ -278,7 +278,7 @@ int stripe_map(struct dm_target *ti, struct bio *bio)
 	if (bio->bi_opf & REQ_PREFLUSH) {
 		target_bio_nr = dm_bio_get_target_bio_nr(bio);
 		BUG_ON(target_bio_nr >= sc->stripes);
-		bio_set_dev(bio, sc->stripe[target_bio_nr].dev->bdev);
+		dm_bio_set_dev(bio, sc->stripe[target_bio_nr].dev->bdev);
 		return DM_MAPIO_REMAPPED;
 	}
 	if (unlikely(bio_op(bio) == REQ_OP_DISCARD) ||
@@ -293,7 +293,7 @@ int stripe_map(struct dm_target *ti, struct bio *bio)
 			  &stripe, &bio->bi_iter.bi_sector);
 
 	bio->bi_iter.bi_sector += sc->stripe[stripe].physical_start;
-	bio_set_dev(bio, sc->stripe[stripe].dev->bdev);
+	dm_bio_set_dev(bio, sc->stripe[stripe].dev->bdev);
 
 	return DM_MAPIO_REMAPPED;
 }
diff --git a/drivers/md/dm-switch.c b/drivers/md/dm-switch.c
index 5952f02de1e6..9eea6c263eed 100644
--- a/drivers/md/dm-switch.c
+++ b/drivers/md/dm-switch.c
@@ -323,7 +323,7 @@ static int switch_map(struct dm_target *ti, struct bio *bio)
 	sector_t offset = dm_target_offset(ti, bio->bi_iter.bi_sector);
 	unsigned int path_nr = switch_get_path_nr(sctx, offset);
 
-	bio_set_dev(bio, sctx->path_list[path_nr].dmdev->bdev);
+	dm_bio_set_dev(bio, sctx->path_list[path_nr].dmdev->bdev);
 	bio->bi_iter.bi_sector = sctx->path_list[path_nr].start + offset;
 
 	return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm-unstripe.c b/drivers/md/dm-unstripe.c
index bfcbe6bfa71a..900b1ac88bc8 100644
--- a/drivers/md/dm-unstripe.c
+++ b/drivers/md/dm-unstripe.c
@@ -136,7 +136,7 @@ static int unstripe_map(struct dm_target *ti, struct bio *bio)
 {
 	struct unstripe_c *uc = ti->private;
 
-	bio_set_dev(bio, uc->dev->bdev);
+	dm_bio_set_dev(bio, uc->dev->bdev);
 	bio->bi_iter.bi_sector = map_to_core(ti, bio) + uc->physical_start;
 
 	return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c54636235ffe..6dde3c699122 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -610,6 +610,8 @@ static void free_io(struct dm_io *io)
 	bio_put(&io->tio.clone);
 }
 
+static void free_tio(struct bio *clone);
+
 static struct bio *alloc_tio(struct clone_info *ci, struct dm_target *ti,
 			     unsigned int target_bio_nr, unsigned int *len, gfp_t gfp_mask)
 {
@@ -644,8 +646,12 @@ static struct bio *alloc_tio(struct clone_info *ci, struct dm_target *ti,
 
 	/* Set default bdev, but target must bio_set_dev() before issuing IO */
 	clone->bi_bdev = md->disk->part0;
-	if (likely(ti != NULL) && unlikely(ti->needs_bio_set_dev))
-		bio_set_dev(clone, md->disk->part0);
+	if (likely(ti != NULL) && unlikely(ti->needs_bio_set_dev)) {
+		bio_set_dev_no_blkg(clone, md->disk->part0);
+		if (!bio_associate_blkg(clone,
+				!gfpflags_allow_blocking(gfp_mask)))
+			goto fail;
+	}
 
 	if (len) {
 		clone->bi_iter.bi_size = to_bytes(*len);
@@ -654,6 +660,14 @@ static struct bio *alloc_tio(struct clone_info *ci, struct dm_target *ti,
 	}
 
 	return clone;
+
+fail:
+	if (dm_tio_flagged(clone_to_tio(clone), DM_TIO_INSIDE_DM_IO)) {
+		clone->bi_bdev = NULL;
+		clone_to_tio(clone)->io = NULL;
+	}
+	free_tio(clone);
+	return NULL;
 }
 
 static void free_tio(struct bio *clone)
@@ -1364,7 +1378,15 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
 	if (!tgt_clone)
 		tgt_clone = clone;
 
-	bio_clone_blkg_association(tgt_clone, io->orig_bio, false);
+	if (tgt_clone->bi_opf & REQ_NOWAIT) {
+		if (!bio_clone_blkg_association(tgt_clone, io->orig_bio, true)) {
+			tgt_clone->bi_status = BLK_STS_AGAIN;
+			tgt_clone->bi_end_io(tgt_clone);
+			return;
+		}
+	} else {
+		bio_clone_blkg_association(tgt_clone, io->orig_bio, false);
+	}
 
 	/*
 	 * Account io->origin_bio to DM dev on behalf of target
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index cd4faaf5d427..ca1e1cfee74f 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -713,6 +713,14 @@ module_exit(dm_##name##_exit)
 #define DM_MAPIO_DELAY_REQUEUE	DM_ENDIO_DELAY_REQUEUE
 #define DM_MAPIO_KILL		4
 
+static inline void dm_bio_set_dev(struct bio *bio, struct block_device *bdev)
+{
+	if (bio->bi_opf & REQ_NOWAIT)
+		bio_set_dev_no_blkg(bio, bdev);
+	else
+		bio_set_dev(bio, bdev);
+}
+
 #define dm_sector_div64(x, y)( \
 { \
 	u64 _res; \
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 13/17] bfq: avoid blkg lookup from locked cgroup update
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (11 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 12/17] dm: avoid sleeping blkg association from NOWAIT remaps Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 14/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

bfq_bio_bfqg() is called while bfqd->lock is held from the merge and
request insertion paths. It walks bio->bi_blkg and its parent chain to
find the closest online BFQ group, and also updates bio->bi_blkg when
the original association points at an offline or otherwise unusable
blkg.

After missing blkg creation is protected by q->blkcg_mutex,
bio_associate_blkg_from_css() can sleep on lookup misses. BFQ must not
call it while holding bfqd->lock. The blkg BFQ wants is already known
from the existing bio->bi_blkg ancestry walk, so update bio->bi_blkg by
swapping references to that existing blkg directly instead of looking it
up again by css.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/bfq-cgroup.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 5c2faf56c8ef..06c4ec6d5e35 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -604,6 +604,16 @@ static void bfq_link_bfqg(struct bfq_data *bfqd, struct bfq_group *bfqg)
 	}
 }
 
+static void bfq_bio_update_blkg(struct bio *bio, struct blkcg_gq *blkg)
+{
+	if (bio->bi_blkg == blkg)
+		return;
+
+	blkg_get(blkg);
+	blkg_put(bio->bi_blkg);
+	bio->bi_blkg = blkg;
+}
+
 struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio)
 {
 	struct blkcg_gq *blkg = bio->bi_blkg;
@@ -616,14 +626,13 @@ struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio)
 		}
 		bfqg = blkg_to_bfqg(blkg);
 		if (bfqg->pd.online) {
-			bio_associate_blkg_from_css(bio, &blkg->blkcg->css, false);
+			bfq_bio_update_blkg(bio, blkg);
 			return bfqg;
 		}
 		blkg = blkg->parent;
 	}
-	bio_associate_blkg_from_css(bio,
-				&bfqg_to_blkg(bfqd->root_group)->blkcg->css,
-				false);
+	blkg = bfqg_to_blkg(bfqd->root_group);
+	bfq_bio_update_blkg(bio, blkg);
 	return bfqd->root_group;
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 14/17] blk-cgroup: protect blkgs with blkcg_mutex
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (12 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 13/17] bfq: avoid blkg lookup from locked cgroup update Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 15/17] blk-cgroup: remove blkg radix tree preloading Yu Kuai
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

queue_lock is still needed by block core users, but blkcg no longer needs
it for blkg topology now that throttle runtime state has a private lock.

Move queue-local blkg synchronization to q->blkcg_mutex. Hold it while
looking up, creating and destroying blkgs, while preparing and undoing
configuration, and while activating or deactivating policies.

Update the BFQ, iocost, iolatency and throttle paths which walk
q->blkg_list or access per-blkg policy state to use the same lock.

blkcg->lock still protects blkcg-local radix tree and list updates. Some
lookups under blkcg_mutex can race with blkcg updates done for other
queues, so keep those lookups in RCU read-side critical sections. In
particular, protect the parent lookup in blkg_create() and the parent
walk in blkg_lookup_create().

Nowait bio association remains non-blocking after the lock conversion: if
RCU lookup misses, preemptible task-context callers can try q->blkcg_mutex
and create the missing blkg without sleeping. Atomic callers, contended
mutexes, or allocation failures keep the fail-fast behavior.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/bfq-cgroup.c    |  10 +--
 block/blk-cgroup.c    | 199 +++++++++++++++++++++++-------------------
 block/blk-cgroup.h    |  16 ++--
 block/blk-iocost.c    |   5 +-
 block/blk-iolatency.c |   7 +-
 block/blk-throttle.c  |  10 +--
 6 files changed, 136 insertions(+), 111 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 06c4ec6d5e35..8a3ff9510386 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -426,7 +426,7 @@ static void bfqg_stats_xfer_dead(struct bfq_group *bfqg)
 
 	parent = bfqg_parent(bfqg);
 
-	lockdep_assert_held(&bfqg_to_blkg(bfqg)->q->queue_lock);
+	lockdep_assert_held(&bfqg_to_blkg(bfqg)->q->blkcg_mutex);
 
 	if (unlikely(!parent))
 		return;
@@ -884,7 +884,7 @@ static void bfq_reparent_active_queues(struct bfq_data *bfqd,
  *		    and reparent its children entities.
  * @pd: descriptor of the policy going offline.
  *
- * blkio already grabs the queue_lock for us, so no need to use
+ * blkio already grabs the blkcg_mutex for us, so no need to use
  * RCU-based magic
  */
 static void bfq_pd_offline(struct blkg_policy_data *pd)
@@ -957,8 +957,7 @@ void bfq_end_wr_async(struct bfq_data *bfqd)
 	struct blkcg_gq *blkg;
 
 	mutex_lock(&q->blkcg_mutex);
-	spin_lock_irq(&q->queue_lock);
-	spin_lock(&bfqd->lock);
+	spin_lock_irq(&bfqd->lock);
 
 	list_for_each_entry(blkg, &q->blkg_list, q_node) {
 		struct bfq_group *bfqg = blkg_to_bfqg(blkg);
@@ -967,8 +966,7 @@ void bfq_end_wr_async(struct bfq_data *bfqd)
 	}
 	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
 
-	spin_unlock(&bfqd->lock);
-	spin_unlock_irq(&q->queue_lock);
+	spin_unlock_irq(&bfqd->lock);
 	mutex_unlock(&q->blkcg_mutex);
 }
 
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 92846094043a..71313bb3c4f3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,6 +30,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/psi.h>
 #include <linux/part_stat.h>
+#include <linux/preempt.h>
 #include "blk.h"
 #include "blk-cgroup.h"
 #include "blk-ioprio.h"
@@ -131,9 +132,7 @@ static void blkg_free_workfn(struct work_struct *work)
 			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
 	if (blkg->parent)
 		blkg_put(blkg->parent);
-	spin_lock_irq(&q->queue_lock);
 	list_del_init(&blkg->q_node);
-	spin_unlock_irq(&q->queue_lock);
 	mutex_unlock(&q->blkcg_mutex);
 
 	/*
@@ -382,7 +381,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 	struct blkcg_gq *blkg;
 	int i, ret;
 
-	lockdep_assert_held(&disk->queue->queue_lock);
+	lockdep_assert_held(&disk->queue->blkcg_mutex);
 
 	/* request_queue is dying, do not create/recreate a blkg */
 	if (blk_queue_dying(disk->queue)) {
@@ -402,12 +401,15 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 
 	/* link parent */
 	if (blkcg_parent(blkcg)) {
+		rcu_read_lock();
 		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
 		if (WARN_ON_ONCE(!blkg->parent)) {
+			rcu_read_unlock();
 			ret = -ENODEV;
 			goto err_free_blkg;
 		}
 		blkg_get(blkg->parent);
+		rcu_read_unlock();
 	}
 
 	/* invoke per-policy init */
@@ -419,7 +421,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 	}
 
 	/* insert */
-	spin_lock(&blkcg->lock);
+	spin_lock_irq(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, disk->queue->id, blkg);
 	if (likely(!ret)) {
 		hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
@@ -436,7 +438,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 		}
 	}
 	blkg->online = true;
-	spin_unlock(&blkcg->lock);
+	spin_unlock_irq(&blkcg->lock);
 
 	if (!ret)
 		return blkg;
@@ -459,7 +461,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
  * Lookup blkg for the @blkcg - @disk pair.  If it doesn't exist, try to
  * create one.  blkg creation is performed recursively from blkcg_root such
  * that all non-root blkg's have access to the parent blkg.  This function
- * should be called under RCU read lock and takes @disk->queue->queue_lock.
+ * must be called with @disk->queue->blkcg_mutex held.
  *
  * Returns the blkg or the closest blkg if blkg_create() fails as it walks
  * down from root.
@@ -491,6 +493,7 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 		struct blkcg *parent = blkcg_parent(blkcg);
 		struct blkcg_gq *ret_blkg = q->root_blkg;
 
+		rcu_read_lock();
 		while (parent) {
 			blkg = blkg_lookup(parent, q);
 			if (blkg) {
@@ -501,6 +504,7 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 			pos = parent;
 			parent = blkcg_parent(parent);
 		}
+		rcu_read_unlock();
 
 		blkg = blkg_create(pos, disk, NULL);
 		if (IS_ERR(blkg)) {
@@ -519,7 +523,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
 	struct blkcg *blkcg = blkg->blkcg;
 	int i;
 
-	lockdep_assert_held(&blkg->q->queue_lock);
+	lockdep_assert_held(&blkg->q->blkcg_mutex);
 	lockdep_assert_held(&blkcg->lock);
 
 	/*
@@ -547,8 +551,8 @@ static void blkg_destroy(struct blkcg_gq *blkg)
 	hlist_del_init_rcu(&blkg->blkcg_node);
 
 	/*
-	 * Both setting lookup hint to and clearing it from @blkg are done
-	 * under queue_lock.  If it's not pointing to @blkg now, it never
+	 * Both setting lookup hint to and clearing it from @blkg are done under
+	 * blkcg_mutex.  If it's not pointing to @blkg now, it never
 	 * will.  Hint assignment itself can race safely.
 	 */
 	if (rcu_access_pointer(blkcg->blkg_hint) == blkg)
@@ -569,24 +573,21 @@ static void blkg_destroy_all(struct gendisk *disk)
 	int i;
 
 restart:
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 	list_for_each_entry(blkg, &q->blkg_list, q_node) {
 		struct blkcg *blkcg = blkg->blkcg;
 
 		if (hlist_unhashed(&blkg->blkcg_node))
 			continue;
 
-		spin_lock(&blkcg->lock);
+		spin_lock_irq(&blkcg->lock);
 		blkg_destroy(blkg);
-		spin_unlock(&blkcg->lock);
+		spin_unlock_irq(&blkcg->lock);
 
-		/*
-		 * in order to avoid holding the spin lock for too long, release
-		 * it when a batch of blkgs are destroyed.
-		 */
+		/* Avoid holding blkcg_mutex for too long. */
 		if (!(--count)) {
 			count = BLKG_DESTROY_BATCH_SIZE;
-			spin_unlock_irq(&q->queue_lock);
+			mutex_unlock(&q->blkcg_mutex);
 			cond_resched();
 			goto restart;
 		}
@@ -605,7 +606,7 @@ static void blkg_destroy_all(struct gendisk *disk)
 	}
 
 	q->root_blkg = NULL;
-	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 
 	wake_up_var(&q->root_blkg);
 }
@@ -822,8 +823,8 @@ EXPORT_SYMBOL_GPL(blkg_conf_open_bdev);
  * @ctx->blkg to the blkg being configured.
  *
  * blkg_conf_open_bdev() must be called on @ctx beforehand. On success, this
- * function returns with queue lock held and must be followed by
- * blkg_conf_close_bdev().
+ * function returns with blkcg_mutex held and must be followed by
+ * blkg_conf_unprep().
  */
 int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		   struct blkg_conf_ctx *ctx)
@@ -841,7 +842,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 
 	/* Prevent concurrent with blkcg_deactivate_policy() */
 	mutex_lock(&q->blkcg_mutex);
-	spin_lock_irq(&q->queue_lock);
 
 	if (!blkcg_policy_enabled(q, pol)) {
 		ret = -EOPNOTSUPP;
@@ -862,35 +862,34 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		struct blkcg_gq *new_blkg;
 
 		parent = blkcg_parent(blkcg);
+		rcu_read_lock();
 		while (parent && !blkg_lookup(parent, q)) {
 			pos = parent;
 			parent = blkcg_parent(parent);
 		}
-
-		/* Drop locks to do new blkg allocation with GFP_KERNEL. */
-		spin_unlock_irq(&q->queue_lock);
+		rcu_read_unlock();
 
 		new_blkg = blkg_alloc(pos, disk, GFP_NOIO);
 		if (unlikely(!new_blkg)) {
 			ret = -ENOMEM;
-			goto fail_exit;
+			goto fail_unlock;
 		}
 
 		if (radix_tree_preload(GFP_KERNEL)) {
 			blkg_free(new_blkg);
 			ret = -ENOMEM;
-			goto fail_exit;
+			goto fail_unlock;
 		}
 
-		spin_lock_irq(&q->queue_lock);
-
 		if (!blkcg_policy_enabled(q, pol)) {
 			blkg_free(new_blkg);
 			ret = -EOPNOTSUPP;
 			goto fail_preloaded;
 		}
 
+		rcu_read_lock();
 		blkg = blkg_lookup(pos, q);
+		rcu_read_unlock();
 		if (blkg) {
 			blkg_free(new_blkg);
 		} else {
@@ -907,15 +906,12 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 			goto success;
 	}
 success:
-	mutex_unlock(&q->blkcg_mutex);
 	ctx->blkg = blkg;
 	return 0;
 
 fail_preloaded:
 	radix_tree_preload_end();
 fail_unlock:
-	spin_unlock_irq(&q->queue_lock);
-fail_exit:
 	mutex_unlock(&q->blkcg_mutex);
 	/*
 	 * If queue was bypassing, we should retry.  Do so after a
@@ -938,7 +934,7 @@ EXPORT_SYMBOL_GPL(blkg_conf_prep);
 void blkg_conf_unprep(struct blkg_conf_ctx *ctx)
 {
 	WARN_ON_ONCE(!ctx->blkg);
-	spin_unlock_irq(&ctx->bdev->bd_disk->queue->queue_lock);
+	mutex_unlock(&ctx->bdev->bd_disk->queue->blkcg_mutex);
 	ctx->blkg = NULL;
 }
 EXPORT_SYMBOL_GPL(blkg_conf_unprep);
@@ -1258,8 +1254,9 @@ static struct blkcg_gq *blkcg_get_first_blkg(struct blkcg *blkcg)
  * blkcg_destroy_blkgs - responsible for shooting down blkgs
  * @blkcg: blkcg of interest
  *
- * blkgs should be removed while holding both q and blkcg locks.  As blkcg lock
- * is nested inside q lock, this function performs reverse double lock dancing.
+ * blkgs should be removed while holding both q->blkcg_mutex and blkcg->lock.
+ * As blkcg->lock is nested inside q->blkcg_mutex, this function performs
+ * reverse double lock dancing.
  * Destroying the blkgs releases the reference held on the blkcg's css allowing
  * blkcg_css_free to eventually be called.
  *
@@ -1274,13 +1271,13 @@ static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 	while ((blkg = blkcg_get_first_blkg(blkcg))) {
 		struct request_queue *q = blkg->q;
 
-		spin_lock_irq(&q->queue_lock);
-		spin_lock(&blkcg->lock);
+		mutex_lock(&q->blkcg_mutex);
+		spin_lock_irq(&blkcg->lock);
 
 		blkg_destroy(blkg);
 
-		spin_unlock(&blkcg->lock);
-		spin_unlock_irq(&q->queue_lock);
+		spin_unlock_irq(&blkcg->lock);
+		mutex_unlock(&q->blkcg_mutex);
 
 		blkg_put(blkg);
 		cond_resched();
@@ -1472,21 +1469,20 @@ int blkcg_init_disk(struct gendisk *disk)
 	preloaded = !radix_tree_preload(GFP_KERNEL);
 
 	/* Make sure the root blkg exists. */
-	/* spin_lock_irq can serve as RCU read-side critical section. */
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 	blkg = blkg_create(&blkcg_root, disk, new_blkg);
 	if (IS_ERR(blkg))
 		goto err_unlock;
 	q->root_blkg = blkg;
-	spin_unlock_irq(&q->queue_lock);
 
 	if (preloaded)
 		radix_tree_preload_end();
+	mutex_unlock(&q->blkcg_mutex);
 
 	return 0;
 
 err_unlock:
-	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 	if (preloaded)
 		radix_tree_preload_end();
 	return PTR_ERR(blkg);
@@ -1526,6 +1522,42 @@ struct cgroup_subsys io_cgrp_subsys = {
 };
 EXPORT_SYMBOL_GPL(io_cgrp_subsys);
 
+static void blkg_free_policy_data(struct blkcg_gq *blkg,
+				  const struct blkcg_policy *pol)
+{
+	struct blkcg *blkcg = blkg->blkcg;
+	struct blkg_policy_data *pd;
+	bool online = false;
+
+	lockdep_assert_held(&blkg->q->blkcg_mutex);
+
+	/*
+	 * ->pd_offline_fn() may need blkg->pd[] to stay installed, while
+	 * ->pd_free_fn() can sleep.  Mark offline under blkcg->lock, run
+	 * the offline callback, detach under blkcg->lock, then free.
+	 */
+	spin_lock_irq(&blkcg->lock);
+	pd = blkg->pd[pol->plid];
+	if (pd) {
+		online = pd->online;
+		pd->online = false;
+	}
+	spin_unlock_irq(&blkcg->lock);
+
+	if (!pd)
+		return;
+
+	if (online && pol->pd_offline_fn)
+		pol->pd_offline_fn(pd);
+
+	spin_lock_irq(&blkcg->lock);
+	WARN_ON_ONCE(blkg->pd[pol->plid] != pd);
+	WRITE_ONCE(blkg->pd[pol->plid], NULL);
+	spin_unlock_irq(&blkcg->lock);
+
+	pol->pd_free_fn(pd);
+}
+
 /**
  * blkcg_activate_policy - activate a blkcg policy on a gendisk
  * @disk: gendisk of interest
@@ -1535,9 +1567,9 @@ EXPORT_SYMBOL_GPL(io_cgrp_subsys);
  * bypass mode to populate its blkgs with policy_data for @pol.
  *
  * Activation happens with @disk bypassed, so nobody would be accessing blkgs
- * from IO path.  Update of each blkg is protected by both queue and blkcg
- * locks so that holding either lock and testing blkcg_policy_enabled() is
- * always enough for dereferencing policy data.
+ * from IO path.  Update of each blkg is protected by q->blkcg_mutex and
+ * blkcg->lock so that holding either lock and testing blkcg_policy_enabled()
+ * is always enough for dereferencing policy data.
  *
  * The caller is responsible for synchronizing [de]activations and policy
  * [un]registerations.  Returns 0 on success, -errno on failure.
@@ -1563,8 +1595,9 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 
 	if (queue_is_mq(q))
 		memflags = blk_mq_freeze_queue(q);
+
 retry:
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 
 	/* blkg_list is pushed at the head, reverse walk to initialize parents first */
 	list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
@@ -1572,14 +1605,15 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 
 		if (blkg->pd[pol->plid])
 			continue;
+		if (hlist_unhashed(&blkg->blkcg_node))
+			continue;
 
-		/* If prealloc matches, use it; otherwise try GFP_NOWAIT */
+		/* If prealloc matches, use it; otherwise try GFP_NOWAIT. */
 		if (blkg == pinned_blkg) {
 			pd = pd_prealloc;
 			pd_prealloc = NULL;
 		} else {
-			pd = pol->pd_alloc_fn(disk, blkg->blkcg,
-					      GFP_NOWAIT);
+			pd = pol->pd_alloc_fn(disk, blkg->blkcg, GFP_NOWAIT);
 		}
 
 		if (!pd) {
@@ -1592,7 +1626,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 			blkg_get(blkg);
 			pinned_blkg = blkg;
 
-			spin_unlock_irq(&q->queue_lock);
+			mutex_unlock(&q->blkcg_mutex);
 
 			if (pd_prealloc)
 				pol->pd_free_fn(pd_prealloc);
@@ -1600,11 +1634,10 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 						       GFP_KERNEL);
 			if (pd_prealloc)
 				goto retry;
-			else
-				goto enomem;
+			goto enomem;
 		}
 
-		spin_lock(&blkg->blkcg->lock);
+		spin_lock_irq(&blkg->blkcg->lock);
 
 		pd->blkg = blkg;
 		pd->plid = pol->plid;
@@ -1617,14 +1650,14 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 			pol->pd_online_fn(pd);
 		pd->online = true;
 
-		spin_unlock(&blkg->blkcg->lock);
+		spin_unlock_irq(&blkg->blkcg->lock);
 	}
 
 	__set_bit(pol->plid, q->blkcg_pols);
 	ret = 0;
 
-	spin_unlock_irq(&q->queue_lock);
 out:
+	mutex_unlock(&q->blkcg_mutex);
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
 	if (pinned_blkg)
@@ -1635,23 +1668,9 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 
 enomem:
 	/* alloc failed, take down everything */
-	spin_lock_irq(&q->queue_lock);
-	list_for_each_entry(blkg, &q->blkg_list, q_node) {
-		struct blkcg *blkcg = blkg->blkcg;
-		struct blkg_policy_data *pd;
-
-		spin_lock(&blkcg->lock);
-		pd = blkg->pd[pol->plid];
-		if (pd) {
-			if (pd->online && pol->pd_offline_fn)
-				pol->pd_offline_fn(pd);
-			pd->online = false;
-			pol->pd_free_fn(pd);
-			WRITE_ONCE(blkg->pd[pol->plid], NULL);
-		}
-		spin_unlock(&blkcg->lock);
-	}
-	spin_unlock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
+	list_for_each_entry(blkg, &q->blkg_list, q_node)
+		blkg_free_policy_data(blkg, pol);
 	ret = -ENOMEM;
 	goto out;
 }
@@ -1679,24 +1698,12 @@ void blkcg_deactivate_policy(struct gendisk *disk,
 		memflags = blk_mq_freeze_queue(q);
 
 	mutex_lock(&q->blkcg_mutex);
-	spin_lock_irq(&q->queue_lock);
 
 	__clear_bit(pol->plid, q->blkcg_pols);
 
-	list_for_each_entry(blkg, &q->blkg_list, q_node) {
-		struct blkcg *blkcg = blkg->blkcg;
-
-		spin_lock(&blkcg->lock);
-		if (blkg->pd[pol->plid]) {
-			if (blkg->pd[pol->plid]->online && pol->pd_offline_fn)
-				pol->pd_offline_fn(blkg->pd[pol->plid]);
-			pol->pd_free_fn(blkg->pd[pol->plid]);
-			blkg->pd[pol->plid] = NULL;
-		}
-		spin_unlock(&blkcg->lock);
-	}
+	list_for_each_entry(blkg, &q->blkg_list, q_node)
+		blkg_free_policy_data(blkg, pol);
 
-	spin_unlock_irq(&q->queue_lock);
 	mutex_unlock(&q->blkcg_mutex);
 
 	if (queue_is_mq(q))
@@ -2082,16 +2089,32 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 
 	if (blkg)
 		return blkg;
+	if (nowait) {
+		/*
+		 * mutex_trylock() itself does not sleep, but mutexes still
+		 * follow task-context locking rules.  Keep atomic nowait callers
+		 * on the strict fail-fast path.
+		 */
+		if (!preemptible() || !mutex_trylock(&q->blkcg_mutex))
+			return NULL;
+
+		blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk);
+		if (blkg)
+			blkg = blkg_lookup_tryget(blkg);
+		mutex_unlock(&q->blkcg_mutex);
+
+		return blkg;
+	}
 
 	/*
 	 * Fast path failed, we're probably issuing IO in this cgroup the first
 	 * time, hold lock to create new blkg.
 	 */
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 	blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk);
 	if (blkg)
 		blkg = blkg_lookup_tryget(blkg);
-	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 
 	return blkg;
 }
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 615390f751aa..5aaf2d54d17e 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -66,7 +66,7 @@ struct blkcg_gq {
 	/* reference count */
 	struct percpu_ref		refcnt;
 
-	/* is this blkg online? protected by both blkcg and q locks */
+	/* is this blkg online? protected by blkcg->lock and q->blkcg_mutex */
 	bool				online;
 
 	struct blkg_iostat_set __percpu	*iostat_cpu;
@@ -224,9 +224,9 @@ int blkg_conf_open_bdev(struct blkg_conf_ctx *ctx)
 	__cond_acquires(0, &ctx->bdev->bd_queue->rq_qos_mutex);
 int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		   struct blkg_conf_ctx *ctx)
-	__cond_acquires(0, &ctx->bdev->bd_disk->queue->queue_lock);
+	__cond_acquires(0, &ctx->bdev->bd_disk->queue->blkcg_mutex);
 void blkg_conf_unprep(struct blkg_conf_ctx *ctx)
-	__releases(ctx->bdev->bd_disk->queue->queue_lock);
+	__releases(ctx->bdev->bd_disk->queue->blkcg_mutex);
 void blkg_conf_close_bdev(struct blkg_conf_ctx *ctx)
 	__releases(&ctx->bdev->bd_queue->rq_qos_mutex);
 
@@ -255,7 +255,7 @@ static inline bool bio_issue_as_root_blkg(struct bio *bio)
  *
  * Lookup blkg for the @blkcg - @q pair.
  *
- * Must be called in a RCU critical section.
+ * Must be called in a RCU critical section or with q->blkcg_mutex held.
  */
 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg,
 					   struct request_queue *q)
@@ -266,7 +266,7 @@ static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg,
 		return q->root_blkg;
 
 	blkg = rcu_dereference_check(blkcg->blkg_hint,
-			lockdep_is_held(&q->queue_lock));
+			lockdep_is_held(&q->blkcg_mutex));
 	if (blkg && blkg->q == q)
 		return blkg;
 
@@ -350,9 +350,9 @@ static inline void blkg_put(struct blkcg_gq *blkg)
  * @p_blkg: target blkg to walk descendants of
  *
  * Walk @c_blkg through the descendants of @p_blkg.  Must be used with RCU
- * read locked.  If called under either blkcg or queue lock, the iteration
- * is guaranteed to include all and only online blkgs.  The caller may
- * update @pos_css by calling css_rightmost_descendant() to skip subtree.
+ * read locked.  If called under either blkcg->lock or q->blkcg_mutex, the
+ * iteration is guaranteed to include all and only online blkgs.  The caller
+ * may update @pos_css by calling css_rightmost_descendant() to skip subtree.
  * @p_blkg is included in the iteration and the first node to be visited.
  */
 #define blkg_for_each_descendant_pre(d_blkg, pos_css, p_blkg)		\
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 8b2aeba2e1e3..ae50d143e4fc 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -3143,6 +3143,7 @@ static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf,
 	struct blkg_conf_ctx ctx;
 	struct ioc_now now;
 	struct ioc_gq *iocg;
+	unsigned long flags;
 	u32 v;
 	int ret;
 
@@ -3195,11 +3196,11 @@ static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf,
 			goto unprep;
 	}
 
-	spin_lock(&iocg->ioc->lock);
+	spin_lock_irqsave(&iocg->ioc->lock, flags);
 	iocg->cfg_weight = v * WEIGHT_ONE;
 	ioc_now(iocg->ioc, &now);
 	weight_updated(iocg, &now);
-	spin_unlock(&iocg->ioc->lock);
+	spin_unlock_irqrestore(&iocg->ioc->lock, flags);
 
 	ret = 0;
 
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index cef02b6c5fa9..30e23fee4f15 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -639,6 +639,7 @@ static void blkcg_iolatency_exit(struct rq_qos *rqos)
 	timer_shutdown_sync(&blkiolat->timer);
 	flush_work(&blkiolat->enable_work);
 	blkcg_deactivate_policy(rqos->disk, &blkcg_policy_iolatency);
+	flush_work(&blkiolat->enable_work);
 	kfree(blkiolat);
 }
 
@@ -811,16 +812,18 @@ static void iolatency_clear_scaling(struct blkcg_gq *blkg)
 	if (blkg->parent) {
 		struct iolatency_grp *iolat = blkg_to_lat(blkg->parent);
 		struct child_latency_info *lat_info;
+		unsigned long flags;
+
 		if (!iolat)
 			return;
 
 		lat_info = &iolat->child_lat;
-		spin_lock(&lat_info->lock);
+		spin_lock_irqsave(&lat_info->lock, flags);
 		atomic_set(&lat_info->scale_cookie, DEFAULT_SCALE_COOKIE);
 		lat_info->last_scale_event = 0;
 		lat_info->scale_grp = NULL;
 		lat_info->scale_lat = 0;
-		spin_unlock(&lat_info->lock);
+		spin_unlock_irqrestore(&lat_info->lock, flags);
 	}
 }
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 7bca2805404f..ef3edd5a4785 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1777,10 +1777,10 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
 	if (!blk_throtl_activated(q))
 		return;
 
-	spin_lock_irq(&q->queue_lock);
-	spin_lock(&td->lock);
+	mutex_lock(&q->blkcg_mutex);
+	spin_lock_irq(&td->lock);
 	/*
-	 * queue_lock is held, rcu lock is not needed here technically.
+	 * blkcg_mutex is held, rcu lock is not needed here technically.
 	 * However, rcu lock is still held to emphasize that following
 	 * path need RCU protection and to prevent warning from lockdep.
 	 */
@@ -1797,8 +1797,8 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
 		tg_cancel_writeback_bios(blkg_to_tg(blkg), cancel_bios);
 	}
 	rcu_read_unlock();
-	spin_unlock(&td->lock);
-	spin_unlock_irq(&q->queue_lock);
+	spin_unlock_irq(&td->lock);
+	mutex_unlock(&q->blkcg_mutex);
 
 	for (rw = READ; rw <= WRITE; rw++) {
 		struct bio *bio;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 15/17] blk-cgroup: remove blkg radix tree preloading
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (13 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 14/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 16/17] blk-cgroup: allocate blkgs in blkg_create Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 17/17] blk-cgroup: share blkg creation between lookup and config prep Yu Kuai
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

blkg creation is now serialized by q->blkcg_mutex and no longer runs
under q->queue_lock.  The radix tree is initialized with GFP_NOWAIT, so
radix_tree_insert() cannot sleep while blkcg->lock is held and the old
preload dance is no longer needed.

Remove the preload calls and the associated unwind path.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 22 ++--------------------
 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 71313bb3c4f3..b99ab8d67798 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -420,7 +420,6 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 			pol->pd_init_fn(blkg->pd[i]);
 	}
 
-	/* insert */
 	spin_lock_irq(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, disk->queue->id, blkg);
 	if (likely(!ret)) {
@@ -875,16 +874,10 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 			goto fail_unlock;
 		}
 
-		if (radix_tree_preload(GFP_KERNEL)) {
-			blkg_free(new_blkg);
-			ret = -ENOMEM;
-			goto fail_unlock;
-		}
-
 		if (!blkcg_policy_enabled(q, pol)) {
 			blkg_free(new_blkg);
 			ret = -EOPNOTSUPP;
-			goto fail_preloaded;
+			goto fail_unlock;
 		}
 
 		rcu_read_lock();
@@ -896,12 +889,10 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 			blkg = blkg_create(pos, disk, new_blkg);
 			if (IS_ERR(blkg)) {
 				ret = PTR_ERR(blkg);
-				goto fail_preloaded;
+				goto fail_unlock;
 			}
 		}
 
-		radix_tree_preload_end();
-
 		if (pos == blkcg)
 			goto success;
 	}
@@ -909,8 +900,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 	ctx->blkg = blkg;
 	return 0;
 
-fail_preloaded:
-	radix_tree_preload_end();
 fail_unlock:
 	mutex_unlock(&q->blkcg_mutex);
 	/*
@@ -1448,7 +1437,6 @@ int blkcg_init_disk(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
 	struct blkcg_gq *new_blkg, *blkg;
-	bool preloaded;
 
 	/*
 	 * If the queue is shared across disk rebind (e.g., SCSI), the
@@ -1466,8 +1454,6 @@ int blkcg_init_disk(struct gendisk *disk)
 	if (!new_blkg)
 		return -ENOMEM;
 
-	preloaded = !radix_tree_preload(GFP_KERNEL);
-
 	/* Make sure the root blkg exists. */
 	mutex_lock(&q->blkcg_mutex);
 	blkg = blkg_create(&blkcg_root, disk, new_blkg);
@@ -1475,16 +1461,12 @@ int blkcg_init_disk(struct gendisk *disk)
 		goto err_unlock;
 	q->root_blkg = blkg;
 
-	if (preloaded)
-		radix_tree_preload_end();
 	mutex_unlock(&q->blkcg_mutex);
 
 	return 0;
 
 err_unlock:
 	mutex_unlock(&q->blkcg_mutex);
-	if (preloaded)
-		radix_tree_preload_end();
 	return PTR_ERR(blkg);
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 16/17] blk-cgroup: allocate blkgs in blkg_create
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (14 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 15/17] blk-cgroup: remove blkg radix tree preloading Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 17/17] blk-cgroup: share blkg creation between lookup and config prep Yu Kuai
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

After radix tree preloading is gone, callers no longer need to allocate a
blkg before entering blkg_create(). Move allocation into blkg_create() and
pass the desired GFP mask instead.

Use GFP_NOIO for runtime and config blkg creation so slow paths can sleep
without recursing into IO reclaim, keep GFP_KERNEL for root blkg setup, and
use GFP_ATOMIC when nowait bio association creates a missing blkg after a
successful q->blkcg_mutex trylock.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 89 ++++++++++------------------------------------
 1 file changed, 18 insertions(+), 71 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b99ab8d67798..ddc9073d7ab9 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -371,14 +371,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
 	return NULL;
 }
 
-/*
- * If @new_blkg is %NULL, this function tries to allocate a new one as
- * necessary using %GFP_NOWAIT.  @new_blkg is always consumed on return.
- */
 static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
-				    struct blkcg_gq *new_blkg)
+				    gfp_t gfp_mask)
 {
-	struct blkcg_gq *blkg;
+	struct blkcg_gq *blkg = NULL;
 	int i, ret;
 
 	lockdep_assert_held(&disk->queue->blkcg_mutex);
@@ -389,15 +385,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 		goto err_free_blkg;
 	}
 
-	/* allocate */
-	if (!new_blkg) {
-		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
-		if (unlikely(!new_blkg)) {
-			ret = -ENOMEM;
-			goto err_free_blkg;
-		}
+	blkg = blkg_alloc(blkcg, disk, gfp_mask);
+	if (unlikely(!blkg)) {
+		ret = -ENOMEM;
+		goto err_free_blkg;
 	}
-	blkg = new_blkg;
 
 	/* link parent */
 	if (blkcg_parent(blkcg)) {
@@ -447,8 +439,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 	return ERR_PTR(ret);
 
 err_free_blkg:
-	if (new_blkg)
-		blkg_free(new_blkg);
+	if (blkg)
+		blkg_free(blkg);
 	return ERR_PTR(ret);
 }
 
@@ -505,7 +497,7 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 		}
 		rcu_read_unlock();
 
-		blkg = blkg_create(pos, disk, NULL);
+		blkg = blkg_create(pos, disk, GFP_NOIO);
 		if (IS_ERR(blkg)) {
 			blkg = ret_blkg;
 			break;
@@ -858,7 +850,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 	while (true) {
 		struct blkcg *pos = blkcg;
 		struct blkcg *parent;
-		struct blkcg_gq *new_blkg;
 
 		parent = blkcg_parent(blkcg);
 		rcu_read_lock();
@@ -868,14 +859,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		}
 		rcu_read_unlock();
 
-		new_blkg = blkg_alloc(pos, disk, GFP_NOIO);
-		if (unlikely(!new_blkg)) {
-			ret = -ENOMEM;
-			goto fail_unlock;
-		}
-
 		if (!blkcg_policy_enabled(q, pol)) {
-			blkg_free(new_blkg);
 			ret = -EOPNOTSUPP;
 			goto fail_unlock;
 		}
@@ -883,10 +867,8 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		rcu_read_lock();
 		blkg = blkg_lookup(pos, q);
 		rcu_read_unlock();
-		if (blkg) {
-			blkg_free(new_blkg);
-		} else {
-			blkg = blkg_create(pos, disk, new_blkg);
+		if (!blkg) {
+			blkg = blkg_create(pos, disk, GFP_NOIO);
 			if (IS_ERR(blkg)) {
 				ret = PTR_ERR(blkg);
 				goto fail_unlock;
@@ -1436,7 +1418,7 @@ void blkg_init_queue(struct request_queue *q)
 int blkcg_init_disk(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
-	struct blkcg_gq *new_blkg, *blkg;
+	struct blkcg_gq *blkg;
 
 	/*
 	 * If the queue is shared across disk rebind (e.g., SCSI), the
@@ -1450,13 +1432,9 @@ int blkcg_init_disk(struct gendisk *disk)
 	 */
 	wait_var_event(&q->root_blkg, !READ_ONCE(q->root_blkg));
 
-	new_blkg = blkg_alloc(&blkcg_root, disk, GFP_KERNEL);
-	if (!new_blkg)
-		return -ENOMEM;
-
 	/* Make sure the root blkg exists. */
 	mutex_lock(&q->blkcg_mutex);
-	blkg = blkg_create(&blkcg_root, disk, new_blkg);
+	blkg = blkg_create(&blkcg_root, disk, GFP_KERNEL);
 	if (IS_ERR(blkg))
 		goto err_unlock;
 	q->root_blkg = blkg;
@@ -1559,8 +1537,7 @@ static void blkg_free_policy_data(struct blkcg_gq *blkg,
 int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 {
 	struct request_queue *q = disk->queue;
-	struct blkg_policy_data *pd_prealloc = NULL;
-	struct blkcg_gq *blkg, *pinned_blkg = NULL;
+	struct blkcg_gq *blkg;
 	unsigned int memflags;
 	int ret;
 
@@ -1578,7 +1555,6 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	if (queue_is_mq(q))
 		memflags = blk_mq_freeze_queue(q);
 
-retry:
 	mutex_lock(&q->blkcg_mutex);
 
 	/* blkg_list is pushed at the head, reverse walk to initialize parents first */
@@ -1590,34 +1566,9 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 		if (hlist_unhashed(&blkg->blkcg_node))
 			continue;
 
-		/* If prealloc matches, use it; otherwise try GFP_NOWAIT. */
-		if (blkg == pinned_blkg) {
-			pd = pd_prealloc;
-			pd_prealloc = NULL;
-		} else {
-			pd = pol->pd_alloc_fn(disk, blkg->blkcg, GFP_NOWAIT);
-		}
-
-		if (!pd) {
-			/*
-			 * GFP_NOWAIT failed.  Free the existing one and
-			 * prealloc for @blkg w/ GFP_KERNEL.
-			 */
-			if (pinned_blkg)
-				blkg_put(pinned_blkg);
-			blkg_get(blkg);
-			pinned_blkg = blkg;
-
-			mutex_unlock(&q->blkcg_mutex);
-
-			if (pd_prealloc)
-				pol->pd_free_fn(pd_prealloc);
-			pd_prealloc = pol->pd_alloc_fn(disk, blkg->blkcg,
-						       GFP_KERNEL);
-			if (pd_prealloc)
-				goto retry;
+		pd = pol->pd_alloc_fn(disk, blkg->blkcg, GFP_NOIO);
+		if (!pd)
 			goto enomem;
-		}
 
 		spin_lock_irq(&blkg->blkcg->lock);
 
@@ -1642,15 +1593,10 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	mutex_unlock(&q->blkcg_mutex);
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
-	if (pinned_blkg)
-		blkg_put(pinned_blkg);
-	if (pd_prealloc)
-		pol->pd_free_fn(pd_prealloc);
 	return ret;
 
 enomem:
 	/* alloc failed, take down everything */
-	mutex_lock(&q->blkcg_mutex);
 	list_for_each_entry(blkg, &q->blkg_list, q_node)
 		blkg_free_policy_data(blkg, pol);
 	ret = -ENOMEM;
@@ -2080,7 +2026,8 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 		if (!preemptible() || !mutex_trylock(&q->blkcg_mutex))
 			return NULL;
 
-		blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk);
+		blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk,
+					  GFP_ATOMIC);
 		if (blkg)
 			blkg = blkg_lookup_tryget(blkg);
 		mutex_unlock(&q->blkcg_mutex);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v1 17/17] blk-cgroup: share blkg creation between lookup and config prep
  2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
                   ` (15 preceding siblings ...)
  2026-07-04 19:51 ` [RFC PATCH v1 16/17] blk-cgroup: allocate blkgs in blkg_create Yu Kuai
@ 2026-07-04 19:51 ` Yu Kuai
  16 siblings, 0 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

blkg_conf_prep() open-codes the same parent walk and blkg creation that
blkg_lookup_create() already performs. Make blkg_lookup_create() report
whether the target blkg was created or found while still returning the
closest existing blkg on failure, then have blkg_conf_prep() use the
helper and treat errors as config failures.

This keeps the bio association path's closest-blkg fallback and removes
the duplicate config path loop.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 81 +++++++++++++++-------------------------------
 1 file changed, 26 insertions(+), 55 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ddc9073d7ab9..ae481bcde934 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -448,17 +448,19 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
  * blkg_lookup_create - lookup blkg, try to create one if not there
  * @blkcg: blkcg of interest
  * @disk: gendisk of interest
+ * @gfp_mask: allocation mask to use
+ * @blkgp: out parameter for the target blkg, or closest blkg on failure
  *
  * Lookup blkg for the @blkcg - @disk pair.  If it doesn't exist, try to
  * create one.  blkg creation is performed recursively from blkcg_root such
  * that all non-root blkg's have access to the parent blkg.  This function
  * must be called with @disk->queue->blkcg_mutex held.
  *
- * Returns the blkg or the closest blkg if blkg_create() fails as it walks
- * down from root.
+ * On success, *@blkgp points to the target blkg and 0 is returned.  On
+ * failure, *@blkgp points to the closest blkg and the errno is returned.
  */
-static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
-		struct gendisk *disk)
+static int blkg_lookup_create(struct blkcg *blkcg, struct gendisk *disk,
+			      gfp_t gfp_mask, struct blkcg_gq **blkgp)
 {
 	struct request_queue *q = disk->queue;
 	struct blkcg_gq *blkg;
@@ -470,7 +472,8 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 		    blkg != rcu_dereference(blkcg->blkg_hint))
 			rcu_assign_pointer(blkcg->blkg_hint, blkg);
 		rcu_read_unlock();
-		return blkg;
+		*blkgp = blkg;
+		return 0;
 	}
 	rcu_read_unlock();
 
@@ -497,16 +500,16 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 		}
 		rcu_read_unlock();
 
-		blkg = blkg_create(pos, disk, GFP_NOIO);
+		blkg = blkg_create(pos, disk, gfp_mask);
 		if (IS_ERR(blkg)) {
-			blkg = ret_blkg;
-			break;
+			*blkgp = ret_blkg;
+			return PTR_ERR(blkg);
+		}
+		if (pos == blkcg) {
+			*blkgp = blkg;
+			return 0;
 		}
-		if (pos == blkcg)
-			break;
 	}
-
-	return blkg;
 }
 
 static void blkg_destroy(struct blkcg_gq *blkg)
@@ -839,46 +842,10 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		goto fail_unlock;
 	}
 
-	blkg = blkg_lookup(blkcg, q);
-	if (blkg)
-		goto success;
-
-	/*
-	 * Create blkgs walking down from blkcg_root to @blkcg, so that all
-	 * non-root blkgs have access to their parents.
-	 */
-	while (true) {
-		struct blkcg *pos = blkcg;
-		struct blkcg *parent;
-
-		parent = blkcg_parent(blkcg);
-		rcu_read_lock();
-		while (parent && !blkg_lookup(parent, q)) {
-			pos = parent;
-			parent = blkcg_parent(parent);
-		}
-		rcu_read_unlock();
-
-		if (!blkcg_policy_enabled(q, pol)) {
-			ret = -EOPNOTSUPP;
-			goto fail_unlock;
-		}
-
-		rcu_read_lock();
-		blkg = blkg_lookup(pos, q);
-		rcu_read_unlock();
-		if (!blkg) {
-			blkg = blkg_create(pos, disk, GFP_NOIO);
-			if (IS_ERR(blkg)) {
-				ret = PTR_ERR(blkg);
-				goto fail_unlock;
-			}
-		}
+	ret = blkg_lookup_create(blkcg, disk, GFP_NOIO, &blkg);
+	if (ret)
+		goto fail_unlock;
 
-		if (pos == blkcg)
-			goto success;
-	}
-success:
 	ctx->blkg = blkg;
 	return 0;
 
@@ -2018,6 +1985,8 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 	if (blkg)
 		return blkg;
 	if (nowait) {
+		int ret;
+
 		/*
 		 * mutex_trylock() itself does not sleep, but mutexes still
 		 * follow task-context locking rules.  Keep atomic nowait callers
@@ -2026,9 +1995,11 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 		if (!preemptible() || !mutex_trylock(&q->blkcg_mutex))
 			return NULL;
 
-		blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk,
-					  GFP_ATOMIC);
-		if (blkg)
+		ret = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk,
+					 GFP_ATOMIC, &blkg);
+		if (ret)
+			blkg = NULL;
+		else if (blkg)
 			blkg = blkg_lookup_tryget(blkg);
 		mutex_unlock(&q->blkcg_mutex);
 
@@ -2040,7 +2011,7 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 	 * time, hold lock to create new blkg.
 	 */
 	mutex_lock(&q->blkcg_mutex);
-	blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk);
+	blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk, GFP_NOIO, &blkg);
 	if (blkg)
 		blkg = blkg_lookup_tryget(blkg);
 	mutex_unlock(&q->blkcg_mutex);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-07-04 19:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 01/17] nvme-multipath: retarget failedover bios from requeue work Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 02/17] dm thin: avoid bio_set_dev under pool lock Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 03/17] dm snapshot: avoid bio_set_dev in locked map paths Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 04/17] blk-throttle: protect throttle state with td lock Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 05/17] block: add bio_alloc_atomic() for atomic bio users Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 06/17] blk-cgroup: support non-blocking bio association Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 07/17] block: support non-blocking bio allocation with a bdev Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 08/17] bcache: avoid sleeping blkg association from locked paths Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 09/17] dm bufio: avoid blkg association from GFP_NOWAIT bio init Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 10/17] dm pcache: handle non-blocking bio clone init failure Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 11/17] block: avoid scheduling from non-blocking helper allocations Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 12/17] dm: avoid sleeping blkg association from NOWAIT remaps Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 13/17] bfq: avoid blkg lookup from locked cgroup update Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 14/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 15/17] blk-cgroup: remove blkg radix tree preloading Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 16/17] blk-cgroup: allocate blkgs in blkg_create Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 17/17] blk-cgroup: share blkg creation between lookup and config prep Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox