Linux block layer
 help / color / mirror / Atom feed
* [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex
@ 2026-07-04 19:51 Yu Kuai
  2026-07-04 19:51 ` [RFC PATCH v1 01/17] nvme-multipath: retarget failedover bios from requeue work Yu Kuai
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: Yu Kuai @ 2026-07-04 19:51 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, Alasdair Kergon,
	Benjamin Marzinski, Mike Snitzer, Mikulas Patocka, Dongsheng Yang,
	Zheng Gu, Coly Li, Kent Overstreet, Josef Bacik, Yu Kuai,
	Nilay Shroff, linux-block, cgroups, linux-nvme, dm-devel,
	linux-bcache

From: Yu Kuai <yukuai@fygo.io>

This RFC moves queue-local blkg topology synchronization from
q->queue_lock to q->blkcg_mutex.

q->queue_lock is a hot block-layer spinlock used by request queue runtime
paths, and it is also used in irq-disabled or otherwise atomic contexts.
Using it to protect blkg topology makes blkg lookup, creation,
destruction, policy activation, and policy-state walks inherit those atomic
locking constraints.  That forces awkward preallocation schemes such as
radix-tree preloading and prevents missing-blkg creation from sleeping,
even though blkg creation is a blkcg control-plane operation rather than a
queue dispatch fast-path operation.

q->blkcg_mutex is a better fit for blkg protection because it is already a
queue-local blkcg lock, it can serialize the full lookup/create/destroy and
policy activation path, and it allows allocation and parent lookup to run
from sleepable contexts.  Moving blkg topology under q->blkcg_mutex also
separates blkcg topology from queue runtime locking, reducing queue_lock
scope and making the locking rules for blkcg policy users explicit.

bio_set_dev() and bio allocation with a bdev can associate a bio with the
destination queue's blkg.  Once missing blkg creation is serialized by
q->blkcg_mutex, those helpers may sleep when they create a blkg.  The first
part of the series therefore audits callers that can reach these helpers
from completion, spinlocked, irq-disabled, GFP_NOWAIT, or other
non-blocking paths, and either moves association to process context or uses
a nowait association path that avoids sleeping.

The preparatory patches cover NVMe multipath requeue, dm-thin and
dm-snapshot map paths, blk-throttle's private runtime lock, atomic bio
allocation helpers, bcache, dm-bufio, dm-pcache, DM NOWAIT clones/remaps,
and BFQ's locked cgroup update path.  The final blkcg patches then move
blkg lookup/create/destroy, policy activation, and configuration
preparation to q->blkcg_mutex; remove radix-tree preloading; move blkg
allocation into blkg_create(); and share creation code between bio
association and config preparation.

This is RFC because the locking conversion changes a central blkcg lifetime
path and relies on all non-sleepable bio association users either being
converted or tolerating nowait association failure.

One intentional tradeoff is left in the nowait paths.  They first associate
with an existing blkg.  If a thread issues IO to a queue for the first time
from a GFP_NOWAIT or otherwise non-blocking path, the cgroup's blkg for
that queue may not exist yet.  After blkg topology moves to q->blkcg_mutex,
preemptible task-context callers try q->blkcg_mutex and attempt blkg
creation.  Once allocation moves into blkg_create(), that opportunistic
nowait creation uses GFP_ATOMIC.  If the caller is in atomic context,
q->blkcg_mutex is contended, or allocation fails, the nowait helper still
fails and the caller needs to retry from a blocking context, defer the
association, or fall back to an existing slow path.

Patch layout:

Patch 1: move NVMe multipath failover bio retargeting to requeue work so
bio_set_dev() runs from process context instead of completion context.

Patches 2-3: remove or avoid bio_set_dev() while dm-thin and dm-snapshot
locks are held, and restore blkcg association later where needed.

Patch 4: give blk-throttle its own runtime-state lock so blkcg topology
can be moved away from queue_lock.

Patches 5-7: add bio_alloc_atomic(), make bio association nowait-aware,
and make bio allocation with a bdev fail rather than sleep for
non-blocking callers.

Patches 8-12: convert bcache, dm-bufio, dm-pcache, block helper
allocations, and DM NOWAIT remaps/clones to the new nowait or deferred
association model.

Patch 13: avoid a sleeping blkg lookup from BFQ while bfqd->lock is held.

Patch 14: protect queue-local blkg lookup, creation, destruction, policy
activation, and policy state walks with q->blkcg_mutex.  This also makes
preemptible nowait bio association try q->blkcg_mutex instead of failing
immediately after an RCU lookup miss.

Patch 15: remove radix-tree preloading after blkg creation no longer runs
under queue_lock.

Patch 16: allocate blkgs inside blkg_create() and use GFP_ATOMIC for the
nowait bio-association trylock creation path.

Patch 17: share blkg creation between bio association and config
preparation.

Yu Kuai (17):
  nvme-multipath: retarget failedover bios from requeue work
  dm thin: avoid bio_set_dev under pool lock
  dm snapshot: avoid bio_set_dev in locked map paths
  blk-throttle: protect throttle state with td lock
  block: add bio_alloc_atomic() for atomic bio users
  blk-cgroup: support non-blocking bio association
  block: support non-blocking bio allocation with a bdev
  bcache: avoid sleeping blkg association from locked paths
  dm bufio: avoid blkg association from GFP_NOWAIT bio init
  dm pcache: handle non-blocking bio clone init failure
  block: avoid scheduling from non-blocking helper allocations
  dm: avoid sleeping blkg association from NOWAIT remaps
  bfq: avoid blkg lookup from locked cgroup update
  blk-cgroup: protect blkgs with blkcg_mutex
  blk-cgroup: remove blkg radix tree preloading
  blk-cgroup: allocate blkgs in blkg_create
  blk-cgroup: share blkg creation between lookup and config prep

 block/bfq-cgroup.c                 |  26 +-
 block/bio.c                        |  50 +++-
 block/blk-cgroup.c                 | 397 ++++++++++++-----------------
 block/blk-cgroup.h                 |  16 +-
 block/blk-crypto-fallback.c        |   2 +-
 block/blk-iocost.c                 |   5 +-
 block/blk-iolatency.c              |   7 +-
 block/blk-lib.c                    |   3 +-
 block/blk-map.c                    |   7 +-
 block/blk-throttle.c               |  93 +++++--
 drivers/md/bcache/journal.c        |   9 +-
 drivers/md/bcache/request.c        |   4 +-
 drivers/md/dm-bufio.c              |   9 +-
 drivers/md/dm-linear.c             |   2 +-
 drivers/md/dm-pcache/backing_dev.c |  10 +-
 drivers/md/dm-snap.c               |  29 ++-
 drivers/md/dm-stripe.c             |   6 +-
 drivers/md/dm-switch.c             |   2 +-
 drivers/md/dm-thin.c               |   3 -
 drivers/md/dm-unstripe.c           |   2 +-
 drivers/md/dm.c                    |  28 +-
 drivers/md/md.c                    |   2 +-
 drivers/nvdimm/nd_virtio.c         |  11 +-
 drivers/nvme/host/multipath.c      |   4 +-
 fs/gfs2/lops.c                     |   3 +-
 fs/ocfs2/cluster/heartbeat.c       |  15 +-
 include/linux/bio.h                |  53 ++--
 include/linux/device-mapper.h      |   8 +
 include/linux/writeback.h          |   2 +-
 mm/page_io.c                       |   2 +-
 30 files changed, 467 insertions(+), 343 deletions(-)


base-commit: a1c8bdbbd72564cebb0d02948c1ed57b80b2e773
-- 
2.51.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-07-04 19:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-04 19:51 [RFC PATCH v1 00/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 01/17] nvme-multipath: retarget failedover bios from requeue work Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 02/17] dm thin: avoid bio_set_dev under pool lock Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 03/17] dm snapshot: avoid bio_set_dev in locked map paths Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 04/17] blk-throttle: protect throttle state with td lock Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 05/17] block: add bio_alloc_atomic() for atomic bio users Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 06/17] blk-cgroup: support non-blocking bio association Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 07/17] block: support non-blocking bio allocation with a bdev Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 08/17] bcache: avoid sleeping blkg association from locked paths Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 09/17] dm bufio: avoid blkg association from GFP_NOWAIT bio init Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 10/17] dm pcache: handle non-blocking bio clone init failure Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 11/17] block: avoid scheduling from non-blocking helper allocations Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 12/17] dm: avoid sleeping blkg association from NOWAIT remaps Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 13/17] bfq: avoid blkg lookup from locked cgroup update Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 14/17] blk-cgroup: protect blkgs with blkcg_mutex Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 15/17] blk-cgroup: remove blkg radix tree preloading Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 16/17] blk-cgroup: allocate blkgs in blkg_create Yu Kuai
2026-07-04 19:51 ` [RFC PATCH v1 17/17] blk-cgroup: share blkg creation between lookup and config prep Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox