Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH v3] block: assign caller-specific lockdep class to disk->open_mutex
From: Tetsuo Handa @ 2026-06-03 11:54 UTC (permalink / raw)
  To: Christoph Hellwig, Bart Van Assche, Jens Axboe
  Cc: linux-block, LKML, Andrew Morton, Ming Lei, Damien Le Moal,
	Qu Wenruo, Hillf Danton, Miguel Ojeda
In-Reply-To: <420f723a-8168-4f56-b84a-2a36ecd87fea@I-love.SAKURA.ne.jp>

The block core currently allocates a single monolithic lockdep key for
disk->open_mutex across all callers. This single key conflates locking
hierarchies between independent block streams. For example, if a stacked
driver like loop flushes its internal workqueues inside lo_release() while
holding its own open_mutex, lockdep views this as a potential ABBA deadlock
against the underlying storage stack, leading to numerous circular
dependency splats.

To structurally reduce false positives, this patch splits the global
monolithic lock class into distinct, per-caller instances during disk
allocation. This is done by replacing "struct lock_class_key" with
"struct gendisk_lkclass", which contains two instances of
"struct lock_class_key" for the legacy "(bio completion)" map and
disk->open_mutex respectively.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
Changes in v3:
- Adjusted Rust part for safe pointer passing, pointed out by sashiko
  ( https://sashiko.dev/#/patchset/420f723a-8168-4f56-b84a-2a36ecd87fea%40I-love.SAKURA.ne.jp ) .

Changes in v2:
- Replaced a two-element array with a struct with two named members, suggested by Bart Van Assche
  ( https://lore.kernel.org/all/4cf7ecc7-932c-4589-9d0f-3e025e83e27c@acm.org/ ).
- Added changes needed by Rust block layer bindings and rnull module, pointed out by sashiko
  ( https://sashiko.dev/#/patchset/147ed056-03d9-4214-b925-0f10fc00cf27%40I-love.SAKURA.ne.jp ).

Testing result of v1:
- I kept v1 patch in linux-next for several more days, but result was that
  some of circular dependency splats which I thought this change succeeded to
  eliminate are still getting reported. That is, we need to determine whether
  we should make this change without example syzbot reports that demonstrates
  difference. But in general, keeping locking chains simpler and shorter
  should be a good change.

Acknowledgment:
  Since I have no experience with Rust, changes needed by Rust block layer
  bindings and rnull module are made based on conversation with the Gemini
  AI collaborator.

 block/blk-mq.c                   |  4 ++--
 block/blk.h                      |  2 +-
 block/genhd.c                    |  8 +++----
 drivers/block/rnull/rnull.rs     |  2 +-
 drivers/scsi/sd.c                |  2 +-
 drivers/scsi/sr.c                |  2 +-
 include/linux/blk-mq.h           |  6 ++---
 include/linux/blkdev.h           |  9 +++++--
 rust/kernel/block/mq.rs          |  2 +-
 rust/kernel/block/mq/gen_disk.rs | 41 ++++++++++++++++++++++++++++++--
 10 files changed, 60 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a24175441380..5203e8cc6a28 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4492,7 +4492,7 @@ EXPORT_SYMBOL(blk_mq_destroy_queue);
 
 struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
 		struct queue_limits *lim, void *queuedata,
-		struct lock_class_key *lkclass)
+		struct gendisk_lkclass *lkclass)
 {
 	struct request_queue *q;
 	struct gendisk *disk;
@@ -4513,7 +4513,7 @@ struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
 EXPORT_SYMBOL(__blk_mq_alloc_disk);
 
 struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
-		struct lock_class_key *lkclass)
+		struct gendisk_lkclass *lkclass)
 {
 	struct gendisk *disk;
 
diff --git a/block/blk.h b/block/blk.h
index b998a7761faf..611bcd655357 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -614,7 +614,7 @@ void drop_partition(struct block_device *part);
 void bdev_set_nr_sectors(struct block_device *bdev, sector_t sectors);
 
 struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
-		struct lock_class_key *lkclass);
+		struct gendisk_lkclass *lkclass);
 struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id);
 
 int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode);
diff --git a/block/genhd.c b/block/genhd.c
index 7d6854fd28e9..8f4a3d8ca15e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1444,7 +1444,7 @@ dev_t part_devt(struct gendisk *disk, u8 partno)
 }
 
 struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
-		struct lock_class_key *lkclass)
+		struct gendisk_lkclass *lkclass)
 {
 	struct gendisk *disk;
 
@@ -1467,7 +1467,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 		goto out_free_bdi;
 
 	disk->node_id = node_id;
-	mutex_init(&disk->open_mutex);
+	mutex_init_with_key(&disk->open_mutex, &lkclass->open_mutex_lkclass);
 	xa_init(&disk->part_tbl);
 	if (xa_insert(&disk->part_tbl, 0, disk->part0, GFP_KERNEL))
 		goto out_destroy_part_tbl;
@@ -1482,7 +1482,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 	device_initialize(disk_to_dev(disk));
 	inc_diskseq(disk);
 	q->disk = disk;
-	lockdep_init_map(&disk->lockdep_map, "(bio completion)", lkclass, 0);
+	lockdep_init_map(&disk->lockdep_map, "(bio completion)", &lkclass->bio_lkclass, 0);
 #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
 	INIT_LIST_HEAD(&disk->slave_bdevs);
 #endif
@@ -1506,7 +1506,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 }
 
 struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
-		struct lock_class_key *lkclass)
+		struct gendisk_lkclass *lkclass)
 {
 	struct queue_limits default_lim = { };
 	struct request_queue *q;
diff --git a/drivers/block/rnull/rnull.rs b/drivers/block/rnull/rnull.rs
index 0ca8715febe8..476a8910c432 100644
--- a/drivers/block/rnull/rnull.rs
+++ b/drivers/block/rnull/rnull.rs
@@ -61,7 +61,7 @@ fn new(
             .logical_block_size(block_size)?
             .physical_block_size(block_size)?
             .rotational(rotational)
-            .build(fmt!("{}", name.to_str()?), tagset, queue_data)
+            .build(fmt!("{}", name.to_str()?), tagset, queue_data, kernel::my_gendisk_lkclass!())
     }
 }
 
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 599e75f33334..63fe8c86606a 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -112,7 +112,7 @@ static DEFINE_MUTEX(sd_mutex_lock);
 static mempool_t *sd_page_pool;
 static mempool_t *sd_large_page_pool;
 static atomic_t sd_large_page_pool_users = ATOMIC_INIT(0);
-static struct lock_class_key sd_bio_compl_lkclass;
+static struct gendisk_lkclass sd_bio_compl_lkclass;
 
 static const char *sd_cache_types[] = {
 	"write through", "none", "write back",
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index c36c54ecd354..734567ae0e43 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -106,7 +106,7 @@ static struct scsi_driver sr_template = {
 static unsigned long sr_index_bits[SR_DISKS / BITS_PER_LONG];
 static DEFINE_SPINLOCK(sr_index_lock);
 
-static struct lock_class_key sr_bio_compl_lkclass;
+static struct gendisk_lkclass sr_bio_compl_lkclass;
 
 static int sr_open(struct cdrom_device_info *, int);
 static void sr_release(struct cdrom_device_info *);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..5aa17e82c3ba 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -726,15 +726,15 @@ enum {
 
 struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
 		struct queue_limits *lim, void *queuedata,
-		struct lock_class_key *lkclass);
+		struct gendisk_lkclass *lkclass);
 #define blk_mq_alloc_disk(set, lim, queuedata)				\
 ({									\
-	static struct lock_class_key __key;				\
+	static struct gendisk_lkclass __key;				\
 									\
 	__blk_mq_alloc_disk(set, lim, queuedata, &__key);		\
 })
 struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
-		struct lock_class_key *lkclass);
+		struct gendisk_lkclass *lkclass);
 struct request_queue *blk_mq_alloc_queue(struct blk_mq_tag_set *set,
 		struct queue_limits *lim, void *queuedata);
 int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 890128cdea1c..28b0aee6b3ba 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -49,6 +49,11 @@ extern const struct device_type disk_type;
 extern const struct device_type part_type;
 extern const struct class block_class;
 
+struct gendisk_lkclass {
+	struct lock_class_key bio_lkclass;
+	struct lock_class_key open_mutex_lkclass;
+};
+
 /*
  * Maximum number of blkcg policies allowed to be registered concurrently.
  * Defined here to simplify include dependency.
@@ -974,7 +979,7 @@ int bdev_disk_changed(struct gendisk *disk, bool invalidate);
 
 void put_disk(struct gendisk *disk);
 struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
-		struct lock_class_key *lkclass);
+		struct gendisk_lkclass *lkclass);
 
 /**
  * blk_alloc_disk - allocate a gendisk structure
@@ -990,7 +995,7 @@ struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
  */
 #define blk_alloc_disk(lim, node_id)					\
 ({									\
-	static struct lock_class_key __key;				\
+	static struct gendisk_lkclass __key;				\
 									\
 	__blk_alloc_disk(lim, node_id, &__key);				\
 })
diff --git a/rust/kernel/block/mq.rs b/rust/kernel/block/mq.rs
index 1fd0d54dd549..10f22b200567 100644
--- a/rust/kernel/block/mq.rs
+++ b/rust/kernel/block/mq.rs
@@ -88,7 +88,7 @@
 //!     Arc::pin_init(TagSet::new(1, 256, 1), flags::GFP_KERNEL)?;
 //! let mut disk = gen_disk::GenDiskBuilder::new()
 //!     .capacity_sectors(4096)
-//!     .build(fmt!("myblk"), tagset, ())?;
+//!     .build(fmt!("myblk"), tagset, (), kernel::my_gendisk_lkclass!())?;
 //!
 //! # Ok::<(), kernel::error::Error>(())
 //! ```
diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
index 912cb805caf5..7e669ca5c032 100644
--- a/rust/kernel/block/mq/gen_disk.rs
+++ b/rust/kernel/block/mq/gen_disk.rs
@@ -11,7 +11,6 @@
     error::{self, from_err_ptr, Result},
     fmt::{self, Write},
     prelude::*,
-    static_lock_class,
     str::NullTerminatedFormatter,
     sync::Arc,
     types::{ForeignOwnable, ScopeGuard},
@@ -38,6 +37,43 @@ fn default() -> Self {
     }
 }
 
+/// A wrapper type for safely passing "struct gendisk_lkclass" argument.
+///
+/// This type can only be instantiated via the [`my_gendisk_lkclass!`] macro.
+pub struct GenDiskLockClass(pub(crate) *mut bindings::gendisk_lkclass);
+
+impl GenDiskLockClass {
+    /// Retrieve the underlying raw pointer.
+    pub(crate) fn as_ptr(&self) -> *mut bindings::gendisk_lkclass {
+        self.0
+    }
+}
+
+#[doc(hidden)]
+pub mod __internal {
+    use super::*;
+
+    /// Internal constructor used ONLY by the `my_gendisk_lkclass!` macro.
+    ///
+    /// SAFETY: `ptr` must point to a valid static `gendisk_lkclass` instance.
+    pub const unsafe fn new_lock_class(ptr: *mut bindings::gendisk_lkclass) -> GenDiskLockClass {
+        GenDiskLockClass(ptr)
+    }
+}
+
+/// Helper macro to generate a unique caller-local static lock class struct
+#[macro_export]
+macro_rules! my_gendisk_lkclass {
+    () => {{
+        static mut LKCLASS: $crate::bindings::gendisk_lkclass = $crate::bindings::gendisk_lkclass {
+            bio_lkclass: const { unsafe { ::core::mem::zeroed() } },
+            open_mutex_lkclass: const { unsafe { ::core::mem::zeroed() } },
+        };
+
+        unsafe { $crate::block::mq::gen_disk::__internal::new_lock_class(&raw mut LKCLASS) }
+    }};
+}
+
 impl GenDiskBuilder {
     /// Create a new instance.
     pub fn new() -> Self {
@@ -100,6 +136,7 @@ pub fn build<T: Operations>(
         name: fmt::Arguments<'_>,
         tagset: Arc<TagSet<T>>,
         queue_data: T::QueueData,
+        lkclass: GenDiskLockClass,
     ) -> Result<GenDisk<T>> {
         let data = queue_data.into_foreign();
         let recover_data = ScopeGuard::new(|| {
@@ -121,7 +158,7 @@ pub fn build<T: Operations>(
                 tagset.raw_tag_set(),
                 &mut lim,
                 data,
-                static_lock_class!().as_ptr(),
+                lkclass.as_ptr(),
             )
         })?;
 
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH 15/79] block: rnull: add `use_per_node_hctx` config option
From: Andreas Hindborg @ 2026-06-03 12:09 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Alice Ryhl, Boqun Feng, Jens Axboe, Miguel Ojeda, Gary Guo,
	Björn Roy Baron, Benno Lossin, Trevor Gross,
	Danilo Krummrich, FUJITA Tomonori, Frederic Weisbecker,
	Lyude Paul, Thomas Gleixner, Anna-Maria Behnsen, John Stultz,
	Stephen Boyd, Lorenzo Stoakes, Liam R. Howlett, linux-block,
	rust-for-linux, linux-kernel, linux-mm
In-Reply-To: <CANiq72n5+JadJUW-GO0cwQrq9eJW7cB3ZK7zx4J484RzjxmYEA@mail.gmail.com>

Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> writes:

> On Mon, Mar 16, 2026 at 2:58 PM Andreas Hindborg <a.hindborg@kernel.org> wrote:
>>
>> This is intentional. The line gets too long if I pull it up, and I don't
>
> Do you mean you don't want to go over the limit or, otherwise, that
> something else complains?
>
> i.e. the limit is not a hard requirement -- exceptions can be made
> where reasonable (as long as `rustfmt` is clean), and moving the line
> to an odd position doesn't sound worth it.

Yes, checkpatch was complaining. I'll keep the long line then.

Best regards,
Andreas Hindborg




^ permalink raw reply

* Re: [PATCH 14/79] block: rnull: add submit queue count config option
From: Andreas Hindborg @ 2026-06-03 12:00 UTC (permalink / raw)
  To: Alice Ryhl
  Cc: Boqun Feng, Jens Axboe, Miguel Ojeda, Gary Guo,
	Björn Roy Baron, Benno Lossin, Trevor Gross,
	Danilo Krummrich, FUJITA Tomonori, Frederic Weisbecker,
	Lyude Paul, Thomas Gleixner, Anna-Maria Behnsen, John Stultz,
	Stephen Boyd, Lorenzo Stoakes, Liam R. Howlett, linux-block,
	rust-for-linux, linux-kernel, linux-mm
In-Reply-To: <abfeIpz8HP_Pr5Yv@google.com>

Alice Ryhl <aliceryhl@google.com> writes:

> On Mon, Feb 16, 2026 at 12:35:01AM +0100, Andreas Hindborg wrote:
>> Allow user space to control the number of submission queues when creating
>> null block devices.
>> 
>> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
>> ---
>>  drivers/block/rnull/configfs.rs | 56 +++++++++++++++++++++++++++++++++--------
>>  drivers/block/rnull/rnull.rs    | 56 +++++++++++++++++++++++++++--------------
>>  2 files changed, 83 insertions(+), 29 deletions(-)
>> 
>> diff --git a/drivers/block/rnull/configfs.rs b/drivers/block/rnull/configfs.rs
>> index b5dc30c5d3e20..fd3cbf7aa012e 100644
>> --- a/drivers/block/rnull/configfs.rs
>> +++ b/drivers/block/rnull/configfs.rs
>> @@ -59,7 +59,10 @@ impl AttributeOperations<0> for Config {
>>  
>>      fn show(_this: &Config, page: &mut [u8; PAGE_SIZE]) -> Result<usize> {
>>          let mut writer = kernel::str::Formatter::new(page);
>> -        writer.write_str("blocksize,size,rotational,irqmode,completion_nsec,memory_backed\n")?;
>> +        writer.write_str(
>> +            "blocksize,size,rotational,irqmode,completion_nsec,memory_backed\
>> +             submit_queues\n",
>> +        )?;
>
> Missing comma? If so, this may indicate missing test coverage :)

Indeed :o


Best regards,
Andreas Hindborg



^ permalink raw reply

* Re: [PATCH 26/79] block: rnull: add badblocks support
From: Andreas Hindborg @ 2026-06-03 12:43 UTC (permalink / raw)
  To: Alice Ryhl
  Cc: Boqun Feng, Jens Axboe, Miguel Ojeda, Gary Guo,
	Björn Roy Baron, Benno Lossin, Trevor Gross,
	Danilo Krummrich, FUJITA Tomonori, Frederic Weisbecker,
	Lyude Paul, Thomas Gleixner, Anna-Maria Behnsen, John Stultz,
	Stephen Boyd, Lorenzo Stoakes, Liam R. Howlett, linux-block,
	rust-for-linux, linux-kernel, linux-mm
In-Reply-To: <abfxSTL2qTMtpnY7@google.com>

Alice Ryhl <aliceryhl@google.com> writes:

> On Mon, Feb 16, 2026 at 12:35:13AM +0100, Andreas Hindborg wrote:
>> Add badblocks support to the rnull driver with a configfs interface for
>> managing bad sectors.
>> 
>> - Configfs attribute for adding/removing bad blocks via "+start-end" and
>>   "-start-end" syntax.
>> - Request handling that checks for bad blocks and returns IO errors.
>> - Updated request completion to handle error status properly.
>> 
>> The badblocks functionality is disabled by default and is enabled when
>> first bad block is added.
>> 
>> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
>
>> +    fn store(this: &DeviceConfig, page: &[u8]) -> Result {
>> +        // This attribute can be set while device is powered.
>> +
>> +        for line in core::str::from_utf8(page)?.lines() {
>> +            let mut chars = line.chars();
>> +            match chars.next() {
>> +                Some(sign @ '+' | sign @ '-') => {
>> +                    if let Some((start, end)) = chars.as_str().split_once('-') {
>> +                        let start: u64 = start.parse().map_err(|_| EINVAL)?;
>> +                        let end: u64 = end.parse().map_err(|_| EINVAL)?;
>> +
>> +                        if start > end {
>> +                            return Err(EINVAL);
>> +                        }
>> +
>> +                        this.data.lock().bad_blocks.enable();
>> +
>> +                        if sign == '+' {
>> +                            this.data.lock().bad_blocks.set_bad(start..=end, true)?;
>> +                        } else {
>> +                            this.data.lock().bad_blocks.set_good(start..=end)?;
>
> Taking lock twice: TOCTOU.

I fixed all of these, thanks for reporting.

>
>> @@ -118,6 +125,7 @@ fn make_group(
>>                      home_node: bindings::NUMA_NO_NODE,
>>                      discard: false,
>>                      no_sched: false,
>> +                    bad_blocks: Arc::pin_init(BadBlocks::new(false), GFP_KERNEL)?,
>>                  }),
>>              }),
>>              core::iter::empty(),
>
> [..]
>
>> @@ -155,6 +160,7 @@ fn init(_module: &'static ThisModule) -> impl PinInit<Self, Error> {
>>                      home_node: *module_parameters::home_node.value(),
>>                      discard: *module_parameters::discard.value() != 0,
>>                      no_sched: *module_parameters::no_sched.value() != 0,
>> +                    bad_blocks: Arc::pin_init(BadBlocks::new(false), GFP_KERNEL)?,
>
> It seems weird to construct this Arc in two places. Is it shared or not?

In the case where the device is created via configfs, the `BadBlocks`
instance is shared. When the device is constructed via module
parameters, the `BadBlocks` instance is not shared. In this case it is
actually not used, as there is no way to enable the code path. But we
need to provide an instance anyway.

I did not want to use an Option, because I did not want the checks on
access.


Best regards,
Andreas Hindborg




^ permalink raw reply

* [PATCH 0/5] blk-cgroup: fix blkg list and policy data races
From: Yu Kuai @ 2026-06-03 13:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Josef Bacik, Ming Lei, Nilay Shroff, Bart Van Assche,
	linux-block, cgroups, linux-kernel

This small series fixes races between blkg destruction, q->blkg_list
iteration, and blkcg policy activation.

The first two patches serialize q->blkg_list walks in blkg_destroy_all()
and BFQ writeback weight-raising teardown with blkcg_mutex. The next two
patches close policy activation races with concurrent blkg destruction,
including skipping blkgs that are already dying. The final patch factors
the common policy data teardown loop.

This uses blkcg_mutex rather than extending queue_lock coverage because
the races are about blkg list visibility and policy-data lifetime, not
request-queue dispatch state. blkg_free_workfn() already uses
blkcg_mutex to serialize policy-data freeing with policy deactivation
and removes blkgs from q->blkg_list only after that teardown. Taking the
same mutex around the remaining q->blkg_list walkers gives one sleepable
serialization point for blkg lifetime, avoids adding more queue_lock
nesting, and prepares the follow-up conversion that removes queue_lock
from blkcg list protection entirely.

Yu Kuai (2):
  blk-cgroup: protect q->blkg_list iteration in blkg_destroy_all() with
    blkcg_mutex
  bfq: protect q->blkg_list iteration in bfq_end_wr_async() with
    blkcg_mutex

Zheng Qixing (3):
  blk-cgroup: fix race between policy activation and blkg destruction
  blk-cgroup: skip dying blkg in blkcg_activate_policy()
  blk-cgroup: factor policy pd teardown loop into helper

 block/bfq-cgroup.c  |  3 ++-
 block/bfq-iosched.c |  6 +++++
 block/blk-cgroup.c  | 65 ++++++++++++++++++++++++---------------------
 3 files changed, 43 insertions(+), 31 deletions(-)

-- 
2.51.0

^ permalink raw reply

* [PATCH 1/5] blk-cgroup: protect q->blkg_list iteration in blkg_destroy_all() with blkcg_mutex
From: Yu Kuai @ 2026-06-03 13:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Josef Bacik, Ming Lei, Nilay Shroff, Bart Van Assche,
	linux-block, cgroups, linux-kernel
In-Reply-To: <cover.1780492756.git.yukuai@fygo.io>

blkg_destroy_all() iterates q->blkg_list without holding blkcg_mutex,
which can race with blkg_free_workfn() that removes blkgs from the list
while holding blkcg_mutex.

Add blkcg_mutex protection around the q->blkg_list iteration to prevent
potential list corruption or use-after-free issues.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 554c87bb4a86..a98a22e06fd1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -573,10 +573,11 @@ static void blkg_destroy_all(struct gendisk *disk)
 	struct blkcg_gq *blkg;
 	int count = BLKG_DESTROY_BATCH_SIZE;
 	int i;
 
 restart:
+	mutex_lock(&q->blkcg_mutex);
 	spin_lock_irq(&q->queue_lock);
 	list_for_each_entry(blkg, &q->blkg_list, q_node) {
 		struct blkcg *blkcg = blkg->blkcg;
 
 		if (hlist_unhashed(&blkg->blkcg_node))
@@ -591,10 +592,11 @@ static void blkg_destroy_all(struct gendisk *disk)
 		 * it when a batch of blkgs are destroyed.
 		 */
 		if (!(--count)) {
 			count = BLKG_DESTROY_BATCH_SIZE;
 			spin_unlock_irq(&q->queue_lock);
+			mutex_unlock(&q->blkcg_mutex);
 			cond_resched();
 			goto restart;
 		}
 	}
 
@@ -610,10 +612,11 @@ static void blkg_destroy_all(struct gendisk *disk)
 			__clear_bit(pol->plid, q->blkcg_pols);
 	}
 
 	q->root_blkg = NULL;
 	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 
 	wake_up_var(&q->root_blkg);
 }
 
 static void blkg_iostat_set(struct blkg_iostat *dst, struct blkg_iostat *src)
-- 
2.51.0


^ permalink raw reply related

* [PATCH 2/5] bfq: protect q->blkg_list iteration in bfq_end_wr_async() with blkcg_mutex
From: Yu Kuai @ 2026-06-03 13:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Josef Bacik, Ming Lei, Nilay Shroff, Bart Van Assche,
	linux-block, cgroups, linux-kernel
In-Reply-To: <cover.1780492756.git.yukuai@fygo.io>

bfq_end_wr_async() iterates q->blkg_list while only holding bfqd->lock,
but not blkcg_mutex. This can race with blkg_free_workfn() that removes
blkgs from the list while holding blkcg_mutex.

Add blkcg_mutex protection in bfq_end_wr() before taking bfqd->lock to
ensure proper synchronization when iterating q->blkg_list.

Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/bfq-cgroup.c  | 3 ++-
 block/bfq-iosched.c | 6 ++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 37ab70930c8d..f765e767d36a 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -939,11 +939,12 @@ void bfq_end_wr_async(struct bfq_data *bfqd)
 	struct blkcg_gq *blkg;
 
 	list_for_each_entry(blkg, &bfqd->queue->blkg_list, q_node) {
 		struct bfq_group *bfqg = blkg_to_bfqg(blkg);
 
-		bfq_end_wr_async_queues(bfqd, bfqg);
+		if (bfqg)
+			bfq_end_wr_async_queues(bfqd, bfqg);
 	}
 	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
 }
 
 static int bfq_io_show_weight_legacy(struct seq_file *sf, void *v)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 141c602d5e85..42ccfd0c6140 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2643,10 +2643,13 @@ void bfq_end_wr_async_queues(struct bfq_data *bfqd,
 static void bfq_end_wr(struct bfq_data *bfqd)
 {
 	struct bfq_queue *bfqq;
 	int i;
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	mutex_lock(&bfqd->queue->blkcg_mutex);
+#endif
 	spin_lock_irq(&bfqd->lock);
 
 	for (i = 0; i < bfqd->num_actuators; i++) {
 		list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list)
 			bfq_bfqq_end_wr(bfqq);
@@ -2654,10 +2657,13 @@ static void bfq_end_wr(struct bfq_data *bfqd)
 	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
 		bfq_bfqq_end_wr(bfqq);
 	bfq_end_wr_async(bfqd);
 
 	spin_unlock_irq(&bfqd->lock);
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	mutex_unlock(&bfqd->queue->blkcg_mutex);
+#endif
 }
 
 static sector_t bfq_io_struct_pos(void *io_struct, bool request)
 {
 	if (request)
-- 
2.51.0


^ permalink raw reply related

* [PATCH 3/5] blk-cgroup: fix race between policy activation and blkg destruction
From: Yu Kuai @ 2026-06-03 13:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Josef Bacik, Ming Lei, Nilay Shroff, Bart Van Assche,
	linux-block, cgroups, linux-kernel
In-Reply-To: <cover.1780492756.git.yukuai@fygo.io>

From: Zheng Qixing <zhengqixing@huawei.com>

When switching an IO scheduler on a block device, blkcg_activate_policy()
allocates blkg_policy_data (pd) for all blkgs attached to the queue.
However, blkcg_activate_policy() may race with concurrent blkcg deletion,
leading to use-after-free and memory leak issues.

The use-after-free occurs in the following race:

T1 (blkcg_activate_policy):
  - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
  - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
  - Enters the enomem rollback path to release blkg1 resources

T2 (blkcg deletion):
  - blkcgA is deleted concurrently
  - blkg1 is freed via blkg_free_workfn()
  - blkg1->pd is freed

T1 (continued):
  - Rollback path accesses blkg1->pd->online after pd is freed
  - Triggers use-after-free

In addition, blkg_free_workfn() frees pd before removing the blkg from
q->blkg_list. This allows blkcg_activate_policy() to allocate a new pd
for a blkg that is being destroyed, leaving the newly allocated pd
unreachable when the blkg is finally freed.

Fix these races by extending blkcg_mutex coverage to serialize
blkcg_activate_policy() rollback and blkg destruction, ensuring pd
lifecycle is synchronized with blkg list visibility.

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a98a22e06fd1..007dfc4f9c59 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1612,10 +1612,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	if (WARN_ON_ONCE(!pol->pd_alloc_fn || !pol->pd_free_fn))
 		return -EINVAL;
 
 	if (queue_is_mq(q))
 		memflags = blk_mq_freeze_queue(q);
+
+	mutex_lock(&q->blkcg_mutex);
 retry:
 	spin_lock_irq(&q->queue_lock);
 
 	/* blkg_list is pushed at the head, reverse walk to initialize parents first */
 	list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
@@ -1674,10 +1676,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	__set_bit(pol->plid, q->blkcg_pols);
 	ret = 0;
 
 	spin_unlock_irq(&q->queue_lock);
 out:
+	mutex_unlock(&q->blkcg_mutex);
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
 	if (pinned_blkg)
 		blkg_put(pinned_blkg);
 	if (pd_prealloc)
-- 
2.51.0


^ permalink raw reply related

* [PATCH 4/5] blk-cgroup: skip dying blkg in blkcg_activate_policy()
From: Yu Kuai @ 2026-06-03 13:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Josef Bacik, Ming Lei, Nilay Shroff, Bart Van Assche,
	linux-block, cgroups, linux-kernel
In-Reply-To: <cover.1780492756.git.yukuai@fygo.io>

From: Zheng Qixing <zhengqixing@huawei.com>

When switching IO schedulers on a block device, blkcg_activate_policy()
can race with concurrent blkcg deletion, leading to a use-after-free in
rcu_accelerate_cbs.

T1:                               T2:
                                  blkg_destroy
                                  kill(&blkg->refcnt) // blkg->refcnt=1->0
                                  blkg_release // call_rcu(__blkg_release)
                                  ...
                                  blkg_free_workfn
                                  ->pd_free_fn(pd)
elv_iosched_store
elevator_switch
...
iterate blkg list
blkg_get(blkg) // blkg->refcnt=0->1
                                  list_del_init(&blkg->q_node)
blkg_put(pinned_blkg) // blkg->refcnt=1->0
blkg_release // call_rcu again
rcu_accelerate_cbs // uaf

Fix this by checking hlist_unhashed(&blkg->blkcg_node) before getting
a reference to the blkg. This is the same check used in blkg_destroy()
to detect if a blkg has already been destroyed. If the blkg is already
unhashed, skip processing it since it's being destroyed.

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 007dfc4f9c59..b479a9796fb3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1623,10 +1623,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
 		struct blkg_policy_data *pd;
 
 		if (blkg->pd[pol->plid])
 			continue;
+		if (hlist_unhashed(&blkg->blkcg_node))
+			continue;
 
 		/* If prealloc matches, use it; otherwise try GFP_NOWAIT */
 		if (blkg == pinned_blkg) {
 			pd = pd_prealloc;
 			pd_prealloc = NULL;
-- 
2.51.0


^ permalink raw reply related

* [PATCH 5/5] blk-cgroup: factor policy pd teardown loop into helper
From: Yu Kuai @ 2026-06-03 13:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Josef Bacik, Ming Lei, Nilay Shroff, Bart Van Assche,
	linux-block, cgroups, linux-kernel
In-Reply-To: <cover.1780492756.git.yukuai@fygo.io>

From: Zheng Qixing <zhengqixing@huawei.com>

Move the teardown sequence which offlines and frees per-policy
blkg_policy_data (pd) into a helper for readability.

No functional change intended.

Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yu Kuai <yukuai@fygo.io>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 57 ++++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 30 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b479a9796fb3..c75b2a103bbc 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1575,10 +1575,35 @@ struct cgroup_subsys io_cgrp_subsys = {
 	.depends_on = 1 << memory_cgrp_id,
 #endif
 };
 EXPORT_SYMBOL_GPL(io_cgrp_subsys);
 
+/*
+ * Tear down per-blkg policy data for @pol on @q.
+ */
+static void blkcg_policy_teardown_pds(struct request_queue *q,
+				      const struct blkcg_policy *pol)
+{
+	struct blkcg_gq *blkg;
+
+	list_for_each_entry(blkg, &q->blkg_list, q_node) {
+		struct blkcg *blkcg = blkg->blkcg;
+		struct blkg_policy_data *pd;
+
+		spin_lock(&blkcg->lock);
+		pd = blkg->pd[pol->plid];
+		if (pd) {
+			if (pd->online && pol->pd_offline_fn)
+				pol->pd_offline_fn(pd);
+			pd->online = false;
+			pol->pd_free_fn(pd);
+			blkg->pd[pol->plid] = NULL;
+		}
+		spin_unlock(&blkcg->lock);
+	}
+}
+
 /**
  * blkcg_activate_policy - activate a blkcg policy on a gendisk
  * @disk: gendisk of interest
  * @pol: blkcg policy to activate
  *
@@ -1690,25 +1715,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	return ret;
 
 enomem:
 	/* alloc failed, take down everything */
 	spin_lock_irq(&q->queue_lock);
-	list_for_each_entry(blkg, &q->blkg_list, q_node) {
-		struct blkcg *blkcg = blkg->blkcg;
-		struct blkg_policy_data *pd;
-
-		spin_lock(&blkcg->lock);
-		pd = blkg->pd[pol->plid];
-		if (pd) {
-			if (pd->online && pol->pd_offline_fn)
-				pol->pd_offline_fn(pd);
-			pd->online = false;
-			pol->pd_free_fn(pd);
-			blkg->pd[pol->plid] = NULL;
-		}
-		spin_unlock(&blkcg->lock);
-	}
+	blkcg_policy_teardown_pds(q, pol);
 	spin_unlock_irq(&q->queue_lock);
 	ret = -ENOMEM;
 	goto out;
 }
 EXPORT_SYMBOL_GPL(blkcg_activate_policy);
@@ -1723,11 +1734,10 @@ EXPORT_SYMBOL_GPL(blkcg_activate_policy);
  */
 void blkcg_deactivate_policy(struct gendisk *disk,
 			     const struct blkcg_policy *pol)
 {
 	struct request_queue *q = disk->queue;
-	struct blkcg_gq *blkg;
 	unsigned int memflags;
 
 	if (!blkcg_policy_enabled(q, pol))
 		return;
 
@@ -1736,24 +1746,11 @@ void blkcg_deactivate_policy(struct gendisk *disk,
 
 	mutex_lock(&q->blkcg_mutex);
 	spin_lock_irq(&q->queue_lock);
 
 	__clear_bit(pol->plid, q->blkcg_pols);
-
-	list_for_each_entry(blkg, &q->blkg_list, q_node) {
-		struct blkcg *blkcg = blkg->blkcg;
-
-		spin_lock(&blkcg->lock);
-		if (blkg->pd[pol->plid]) {
-			if (blkg->pd[pol->plid]->online && pol->pd_offline_fn)
-				pol->pd_offline_fn(blkg->pd[pol->plid]);
-			pol->pd_free_fn(blkg->pd[pol->plid]);
-			blkg->pd[pol->plid] = NULL;
-		}
-		spin_unlock(&blkcg->lock);
-	}
-
+	blkcg_policy_teardown_pds(q, pol);
 	spin_unlock_irq(&q->queue_lock);
 	mutex_unlock(&q->blkcg_mutex);
 
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH RFC 7/8] erofs: open via dedicated fs bdev helpers
From: Christian Brauner @ 2026-06-03 13:42 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christoph Hellwig, Jan Kara
In-Reply-To: <7c5bfcf0-36a3-4cc6-bf31-6af4fc901c37@linux.alibaba.com>

> May I ask if it's an urgent 7.2 work? If not, I could

No no, it's way too late for that this cycle.

> make a preparation patch for the upcoming 7.2 cycle
> to handle erofs_map_dev() failure here so you don't
> need to bother with this in this patchset.

Sounds good. I take it you can just do this yourself without me.

> I will seek more time to resolve the recent todos

Thanks!

> yet always intercepted by other unrelated stuffs.

:)

^ permalink raw reply

* [PATCH v2] rust: add procedural macro for declaring configfs attributes
From: Malte Wechter @ 2026-06-03 14:08 UTC (permalink / raw)
  To: Andreas Hindborg, Breno Leitao, Miguel Ojeda, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Jens Axboe
  Cc: linux-kernel, rust-for-linux, linux-block, Malte Wechter

Implement `configfs_attrs!` as a procedural macro using `syn`, this
improves readability and maintainability. Remove the old macro and
replace all uses with the new macro. Add the new macro implementation
file to MAINTAINERS.

Signed-off-by: Malte Wechter <maltewechter@gmail.com>
---
Changes in v2:
- Add a try_parse helper function to macros/helpers.rs
- Fix bug where 'child' configuration gets dropped if trailing comma is missing (sashiko)
- Link to v1: https://lore.kernel.org/r/20260520-configfs-syn-v1-1-6c5b80a9cef2@gmail.com
---
 MAINTAINERS                     |   1 +
 drivers/block/rnull/configfs.rs |   2 +-
 rust/kernel/configfs.rs         | 251 ----------------------------------------
 rust/macros/configfs_attrs.rs   | 182 +++++++++++++++++++++++++++++
 rust/macros/helpers.rs          |  15 +++
 rust/macros/lib.rs              |  85 ++++++++++++++
 samples/rust/rust_configfs.rs   |   2 +-
 7 files changed, 285 insertions(+), 253 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2fb1c75afd163..45f7a1ec93b45 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6464,6 +6464,7 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/a.hindborg/linux.git config
 F:	fs/configfs/
 F:	include/linux/configfs.h
 F:	rust/kernel/configfs.rs
+F:	rust/macros/configfs_attrs.rs
 F:	samples/configfs/
 F:	samples/rust/rust_configfs.rs
 
diff --git a/drivers/block/rnull/configfs.rs b/drivers/block/rnull/configfs.rs
index 7c2eb5c0b7228..f28ec69d79846 100644
--- a/drivers/block/rnull/configfs.rs
+++ b/drivers/block/rnull/configfs.rs
@@ -4,8 +4,8 @@
 use kernel::{
     block::mq::gen_disk::{GenDisk, GenDiskBuilder},
     configfs::{self, AttributeOperations},
-    configfs_attrs,
     fmt::{self, Write as _},
+    macros::configfs_attrs,
     new_mutex,
     page::PAGE_SIZE,
     prelude::*,
diff --git a/rust/kernel/configfs.rs b/rust/kernel/configfs.rs
index 2339c6467325d..7a91e36677f52 100644
--- a/rust/kernel/configfs.rs
+++ b/rust/kernel/configfs.rs
@@ -791,254 +791,3 @@ fn as_ptr(&self) -> *const bindings::config_item_type {
         self.item_type.get()
     }
 }
-
-/// Define a list of configfs attributes statically.
-///
-/// Invoking the macro in the following manner:
-///
-/// ```ignore
-/// let item_type = configfs_attrs! {
-///     container: configfs::Subsystem<Configuration>,
-///     data: Configuration,
-///     child: Child,
-///     attributes: [
-///         message: 0,
-///         bar: 1,
-///     ],
-/// };
-/// ```
-///
-/// Expands the following output:
-///
-/// ```ignore
-/// let item_type = {
-///     static CONFIGURATION_MESSAGE_ATTR: kernel::configfs::Attribute<
-///         0,
-///         Configuration,
-///         Configuration,
-///     > = unsafe {
-///         kernel::configfs::Attribute::new({
-///             const S: &str = "message\u{0}";
-///             const C: &kernel::str::CStr = match kernel::str::CStr::from_bytes_with_nul(
-///                 S.as_bytes()
-///             ) {
-///                 Ok(v) => v,
-///                 Err(_) => {
-///                     core::panicking::panic_fmt(core::const_format_args!(
-///                         "string contains interior NUL"
-///                     ));
-///                 }
-///             };
-///             C
-///         })
-///     };
-///
-///     static CONFIGURATION_BAR_ATTR: kernel::configfs::Attribute<
-///             1,
-///             Configuration,
-///             Configuration
-///     > = unsafe {
-///         kernel::configfs::Attribute::new({
-///             const S: &str = "bar\u{0}";
-///             const C: &kernel::str::CStr = match kernel::str::CStr::from_bytes_with_nul(
-///                 S.as_bytes()
-///             ) {
-///                 Ok(v) => v,
-///                 Err(_) => {
-///                     core::panicking::panic_fmt(core::const_format_args!(
-///                         "string contains interior NUL"
-///                     ));
-///                 }
-///             };
-///             C
-///         })
-///     };
-///
-///     const N: usize = (1usize + (1usize + 0usize)) + 1usize;
-///
-///     static CONFIGURATION_ATTRS: kernel::configfs::AttributeList<N, Configuration> =
-///         unsafe { kernel::configfs::AttributeList::new() };
-///
-///     {
-///         const N: usize = 0usize;
-///         unsafe { CONFIGURATION_ATTRS.add::<N, 0, _>(&CONFIGURATION_MESSAGE_ATTR) };
-///     }
-///
-///     {
-///         const N: usize = (1usize + 0usize);
-///         unsafe { CONFIGURATION_ATTRS.add::<N, 1, _>(&CONFIGURATION_BAR_ATTR) };
-///     }
-///
-///     static CONFIGURATION_TPE:
-///       kernel::configfs::ItemType<configfs::Subsystem<Configuration> ,Configuration>
-///         = kernel::configfs::ItemType::<
-///                 configfs::Subsystem<Configuration>,
-///                 Configuration
-///                 >::new_with_child_ctor::<N,Child>(
-///             &THIS_MODULE,
-///             &CONFIGURATION_ATTRS
-///         );
-///
-///     &CONFIGURATION_TPE
-/// }
-/// ```
-#[macro_export]
-macro_rules! configfs_attrs {
-    (
-        container: $container:ty,
-        data: $data:ty,
-        attributes: [
-            $($name:ident: $attr:literal),* $(,)?
-        ] $(,)?
-    ) => {
-        $crate::configfs_attrs!(
-            count:
-            @container($container),
-            @data($data),
-            @child(),
-            @no_child(x),
-            @attrs($($name $attr)*),
-            @eat($($name $attr,)*),
-            @assign(),
-            @cnt(0usize),
-        )
-    };
-    (
-        container: $container:ty,
-        data: $data:ty,
-        child: $child:ty,
-        attributes: [
-            $($name:ident: $attr:literal),* $(,)?
-        ] $(,)?
-    ) => {
-        $crate::configfs_attrs!(
-            count:
-            @container($container),
-            @data($data),
-            @child($child),
-            @no_child(),
-            @attrs($($name $attr)*),
-            @eat($($name $attr,)*),
-            @assign(),
-            @cnt(0usize),
-        )
-    };
-    (count:
-     @container($container:ty),
-     @data($data:ty),
-     @child($($child:ty)?),
-     @no_child($($no_child:ident)?),
-     @attrs($($aname:ident $aattr:literal)*),
-     @eat($name:ident $attr:literal, $($rname:ident $rattr:literal,)*),
-     @assign($($assign:block)*),
-     @cnt($cnt:expr),
-    ) => {
-        $crate::configfs_attrs!(
-            count:
-            @container($container),
-            @data($data),
-            @child($($child)?),
-            @no_child($($no_child)?),
-            @attrs($($aname $aattr)*),
-            @eat($($rname $rattr,)*),
-            @assign($($assign)* {
-                const N: usize = $cnt;
-                // The following macro text expands to a call to `Attribute::add`.
-
-                // SAFETY: By design of this macro, the name of the variable we
-                // invoke the `add` method on below, is not visible outside of
-                // the macro expansion. The macro does not operate concurrently
-                // on this variable, and thus we have exclusive access to the
-                // variable.
-                unsafe {
-                    $crate::macros::paste!(
-                        [< $data:upper _ATTRS >]
-                            .add::<N, $attr, _>(&[< $data:upper _ $name:upper _ATTR >])
-                    )
-                };
-            }),
-            @cnt(1usize + $cnt),
-        )
-    };
-    (count:
-     @container($container:ty),
-     @data($data:ty),
-     @child($($child:ty)?),
-     @no_child($($no_child:ident)?),
-     @attrs($($aname:ident $aattr:literal)*),
-     @eat(),
-     @assign($($assign:block)*),
-     @cnt($cnt:expr),
-    ) =>
-    {
-        $crate::configfs_attrs!(
-            final:
-            @container($container),
-            @data($data),
-            @child($($child)?),
-            @no_child($($no_child)?),
-            @attrs($($aname $aattr)*),
-            @assign($($assign)*),
-            @cnt($cnt),
-        )
-    };
-    (final:
-     @container($container:ty),
-     @data($data:ty),
-     @child($($child:ty)?),
-     @no_child($($no_child:ident)?),
-     @attrs($($name:ident $attr:literal)*),
-     @assign($($assign:block)*),
-     @cnt($cnt:expr),
-    ) =>
-    {
-        $crate::macros::paste!{
-            {
-                $(
-                    // SAFETY: We are expanding `configfs_attrs`.
-                    static [< $data:upper _ $name:upper _ATTR >]:
-                        $crate::configfs::Attribute<$attr, $data, $data> =
-                            unsafe {
-                                $crate::configfs::Attribute::new(
-                                    $crate::c_str!(::core::stringify!($name)),
-                                )
-                            };
-                )*
-
-
-                // We need space for a null terminator.
-                const N: usize = $cnt + 1usize;
-
-                // SAFETY: We are expanding `configfs_attrs`.
-                static [< $data:upper _ATTRS >]:
-                $crate::configfs::AttributeList<N, $data> =
-                    unsafe { $crate::configfs::AttributeList::new() };
-
-                $($assign)*
-
-                $(
-                    const [<$no_child:upper>]: bool = true;
-
-                    static [< $data:upper _TPE >] : $crate::configfs::ItemType<$container, $data>  =
-                        $crate::configfs::ItemType::<$container, $data>::new::<N>(
-                            &THIS_MODULE, &[<$ data:upper _ATTRS >]
-                        );
-                )?
-
-                $(
-                    static [< $data:upper _TPE >]:
-                        $crate::configfs::ItemType<$container, $data>  =
-                            $crate::configfs::ItemType::<$container, $data>::
-                            new_with_child_ctor::<N, $child>(
-                                &THIS_MODULE, &[<$ data:upper _ATTRS >]
-                            );
-                )?
-
-                & [< $data:upper _TPE >]
-            }
-        }
-    };
-
-}
-
-pub use crate::configfs_attrs;
diff --git a/rust/macros/configfs_attrs.rs b/rust/macros/configfs_attrs.rs
new file mode 100644
index 0000000000000..a7fc75cdebcc0
--- /dev/null
+++ b/rust/macros/configfs_attrs.rs
@@ -0,0 +1,182 @@
+// SPDX-License-Identifier: GPL-2.0
+
+use quote::{
+    quote, //
+    ToTokens,
+};
+
+use proc_macro2::{
+    Span, //
+};
+
+use syn::{
+    bracketed,
+    parse::{
+        Parse,
+        ParseStream, //
+    },
+    punctuated::Punctuated,
+    spanned::Spanned,
+    Ident,
+    LitInt,
+    Token,
+    Type, //
+};
+
+use crate::helpers::try_parse;
+
+pub(crate) struct ConfigfsAttrs {
+    container: Type,
+    data: Type,
+    child: Option<Type>,
+    attributes: Vec<(Ident, LitInt)>,
+}
+
+fn parse_attribute_field(stream: ParseStream<'_>) -> syn::Result<(Ident, LitInt)> {
+    let id = stream.parse::<syn::Ident>()?;
+    let _colon = stream.parse::<Token![:]>()?;
+    let v = stream.parse::<LitInt>()?;
+    Ok((id, v))
+}
+
+fn parse_named_field(stream: ParseStream<'_>, name: &str) -> syn::Result<Type> {
+    let name_id = stream.parse::<Ident>()?;
+    if name_id != name {
+        return Err(parse_field_error(name_id.span(), &name_id, name));
+    }
+    let _colon = stream.parse::<Token![:]>()?;
+    let v = stream.parse::<Type>()?;
+    stream.parse::<Token![,]>()?;
+    Ok(v)
+}
+
+fn parse_attributes(stream: ParseStream<'_>) -> syn::Result<Vec<(Ident, LitInt)>> {
+    let attribute_id = stream.parse::<Ident>()?;
+    if attribute_id != "attributes" {
+        return Err(syn::Error::new(
+            attribute_id.span(),
+            format!(
+                "unexpected identifier: {}, expected: 'attributes'",
+                attribute_id
+            ),
+        ));
+    }
+    stream.parse::<Token![:]>()?;
+    let attr_stream;
+    let _bracket = bracketed!(attr_stream in stream);
+    let attributes = Punctuated::<(Ident, LitInt), Token![,]>::parse_terminated_with(
+        &attr_stream,
+        parse_attribute_field,
+    )?;
+    let attributes = attributes.into_iter().collect::<Vec<_>>();
+
+    stream.parse::<Option<Token![,]>>()?;
+    Ok(attributes)
+}
+
+fn parse_field_error(span: Span, name: &Ident, expected_name: &str) -> syn::Error {
+    syn::Error::new(
+        span,
+        format!("Unexpected field name, got: {name} expected: {expected_name}"),
+    )
+}
+
+impl Parse for ConfigfsAttrs {
+    fn parse(input: ParseStream<'_>) -> syn::Result<Self> {
+        let container = try_parse(input, |s| parse_named_field(s, "container"))?;
+        let data = try_parse(input, |s| parse_named_field(s, "data"))?;
+        let child = try_parse(input, |s| parse_named_field(s, "child")).ok();
+        let attributes = parse_attributes(input)?;
+
+        Ok(ConfigfsAttrs {
+            container,
+            data,
+            child,
+            attributes,
+        })
+    }
+}
+
+fn make_static_ident<T: ToTokens>(ty: &T, suffix: &str) -> syn::Ident {
+    let raw_id = quote! { #ty }.to_string();
+
+    // Sanitizing syn::Type::Path, this is safe since it is
+    // only used as the identifier.
+    let normalized = raw_id
+        .split("::")
+        .map(|s| String::from(s.trim()))
+        .reduce(|a, b| format!("{a}_{b}"))
+        .expect("Cannot be empty")
+        .to_uppercase()
+        .replace(|c: char| !c.is_alphanumeric(), "_");
+
+    Ident::new(&format!("{}_{}", normalized, suffix), ty.span())
+}
+
+pub(crate) fn configfs_attrs(cfs_attrs: ConfigfsAttrs) -> proc_macro2::TokenStream {
+    let (container_ty, data_ty) = (&cfs_attrs.container, &cfs_attrs.data);
+
+    let data_tp_ident = make_static_ident(data_ty, "TPE");
+    let data_attr_ident = make_static_ident(data_ty, "ATTR_LIST");
+
+    let n = cfs_attrs.attributes.len() + 1;
+
+    let attr_list = quote! {
+        static #data_attr_ident: kernel::configfs::AttributeList<#n, #data_ty> =
+            // SAFETY: We are expanding `configfs_attrs`.
+            unsafe { kernel::configfs::AttributeList::new() };
+    };
+
+    let mut attrs = Vec::new();
+    for (attr_idx, (name, id)) in cfs_attrs.attributes.iter().enumerate() {
+        let name_with_attr = make_static_ident(name, "ATTR");
+
+        let id: u64 = match id.base10_parse::<u64>() {
+            Ok(v) => v,
+            Err(_) => {
+                return syn::Error::new(id.span(), "Could not parse attribute ID as a u64")
+                    .to_compile_error();
+            }
+        };
+
+        attrs.push(quote! {
+        static #name_with_attr: kernel::configfs::Attribute<#id, #data_ty, #data_ty> =
+            // SAFETY: We are expanding `configfs_attrs`.
+            unsafe {
+              kernel::configfs::Attribute::new(kernel::c_str!(::core::stringify!(#name)))
+            };
+
+          // SAFETY: By design of this macro, the name of the variable we
+          // invoke the `add` method on below, is not visible outside of
+          // the macro expansion. The macro does not operate concurrently
+          // on this variable, and thus we have exclusive access to the
+          // variable.
+          unsafe { #data_attr_ident.add::<#attr_idx, #id, _>(&#name_with_attr) }
+        });
+    }
+
+    let has_child_code = if let Some(child) = cfs_attrs.child {
+        quote! { new_with_child_ctor::<#n, #child>}
+    } else {
+        quote! { new::<#n> }
+    };
+
+    let data_type = quote! {
+        {
+            static #data_tp_ident:
+            kernel::configfs::ItemType<#container_ty, #data_ty> =
+                kernel::configfs::ItemType::<#container_ty, #data_ty>::#has_child_code(
+                    &THIS_MODULE, &#data_attr_ident
+                );
+            &#data_tp_ident
+        }
+    };
+
+    quote! {
+        {
+            #attr_list
+            #(#attrs)*
+            #data_type
+        }
+    }
+}
diff --git a/rust/macros/helpers.rs b/rust/macros/helpers.rs
index d18fbf4daa0a5..fdab8804e1ba9 100644
--- a/rust/macros/helpers.rs
+++ b/rust/macros/helpers.rs
@@ -4,6 +4,7 @@
 use quote::ToTokens;
 use syn::{
     parse::{
+        discouraged::Speculative,
         Parse,
         ParseStream, //
     },
@@ -54,6 +55,20 @@ pub(crate) fn file() -> String {
     }
 }
 
+pub(crate) fn try_parse<T>(
+    input: ParseStream<'_>,
+    parser: impl FnOnce(ParseStream<'_>) -> Result<T>,
+) -> Result<T> {
+    let fork = input.fork();
+    match parser(&fork) {
+        Ok(value) => {
+            input.advance_to(&fork);
+            Ok(value)
+        }
+        Err(e) => Err(e),
+    }
+}
+
 /// Obtain all `#[cfg]` attributes.
 pub(crate) fn gather_cfg_attrs(attr: &[Attribute]) -> impl Iterator<Item = &Attribute> + '_ {
     attr.iter().filter(|a| a.path().is_ident("cfg"))
diff --git a/rust/macros/lib.rs b/rust/macros/lib.rs
index 2cfd59e0f9e7c..745ba83c828b9 100644
--- a/rust/macros/lib.rs
+++ b/rust/macros/lib.rs
@@ -15,6 +15,8 @@
 #![cfg_attr(not(CONFIG_RUSTC_HAS_SPAN_FILE), feature(proc_macro_span))]
 
 mod concat_idents;
+#[cfg(CONFIG_CONFIGFS_FS)]
+mod configfs_attrs;
 mod export;
 mod fmt;
 mod helpers;
@@ -489,3 +491,86 @@ pub fn kunit_tests(attr: TokenStream, input: TokenStream) -> TokenStream {
         .unwrap_or_else(|e| e.into_compile_error())
         .into()
 }
+
+/// Define a list of configfs attributes statically.
+///
+/// # Examples
+///
+/// ```ignore
+/// let item_type = configfs_attrs! {
+///     container: configfs::Subsystem<Configuration>,
+///     data: Configuration,
+///     child: Child,
+///     attributes: [
+///         message: 0,
+///         bar: 1,
+///     ],
+/// };
+///```
+///
+/// Expands the following output:
+///    let item_type = {
+///         static CONFIGURATION_ATTR_LIST: kernel::configfs::AttributeList<
+///             3usize,
+///             Configuration,
+///         > = unsafe { kernel::configfs::AttributeList::new() };
+///         static MESSAGE_ATTR: kernel::configfs::Attribute<
+///             0u64,
+///             Configuration,
+///             Configuration,
+///         > = unsafe {
+///             kernel::configfs::Attribute::new({
+///                 const S: &str = "message\u{0}";
+///                 const C: &kernel::str::CStr = match kernel::str::CStr::from_bytes_with_nul(
+///                     S.as_bytes(),
+///                 ) {
+///                     Ok(v) => v,
+///                     Err(_) => {
+///                         ::core::panicking::panic_fmt(
+///                             format_args!("string contains interior NUL"),
+///                         );
+///                     }
+///                 };
+///                 C
+///             })
+///         };
+///         unsafe { CONFIGURATION_ATTR_LIST.add::<0usize, 0u64, _>(&MESSAGE_ATTR) }
+///         static BAR_ATTR: kernel::configfs::Attribute<
+///             1u64,
+///             Configuration,
+///             Configuration,
+///         > = unsafe {
+///             kernel::configfs::Attribute::new({
+///                 const S: &str = "bar\u{0}";
+///                 const C: &kernel::str::CStr = match kernel::str::CStr::from_bytes_with_nul(
+///                     S.as_bytes(),
+///                 ) {
+///                     Ok(v) => v,
+///                     Err(_) => {
+///                         ::core::panicking::panic_fmt(
+///                             format_args!("string contains interior NUL"),
+///                         );
+///                     }
+///                 };
+///                 C
+///             })
+///         };
+///         unsafe { CONFIGURATION_ATTR_LIST.add::<1usize, 1u64, _>(&BAR_ATTR) }
+///         {
+///             static CONFIGURATION_TPE: kernel::configfs::ItemType<
+///                 Subsystem<Configuration>,
+///                 Configuration,
+///             > = kernel::configfs::ItemType::<
+///                 Subsystem<Configuration>,
+///                 Configuration,
+///             >::new_with_child_ctor::<3usize, Child>(&THIS_MODULE, &CONFIGURATION_ATTR_LIST);
+///             &CONFIGURATION_TPE
+///         }
+///     };
+///
+#[cfg(CONFIG_CONFIGFS_FS)]
+#[proc_macro]
+pub fn configfs_attrs(input: TokenStream) -> TokenStream {
+    configfs_attrs::configfs_attrs(parse_macro_input!(input as configfs_attrs::ConfigfsAttrs))
+        .into()
+}
diff --git a/samples/rust/rust_configfs.rs b/samples/rust/rust_configfs.rs
index a1bd9db6010da..876462f7789d1 100644
--- a/samples/rust/rust_configfs.rs
+++ b/samples/rust/rust_configfs.rs
@@ -4,7 +4,7 @@
 
 use kernel::alloc::flags;
 use kernel::configfs;
-use kernel::configfs::configfs_attrs;
+use kernel::macros::configfs_attrs;
 use kernel::new_mutex;
 use kernel::page::PAGE_SIZE;
 use kernel::prelude::*;

---
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
change-id: 20260417-configfs-syn-191e07130027

Best regards,
-- 
Malte Wechter <maltewechter@gmail.com>


^ permalink raw reply related

* Re: [PATCH 10/79] block: rust: add `command` getter to `Request`
From: Gary Guo @ 2026-06-03 14:21 UTC (permalink / raw)
  To: Andreas Hindborg, Alice Ryhl
  Cc: Boqun Feng, Jens Axboe, Miguel Ojeda, Gary Guo,
	Björn Roy Baron, Benno Lossin, Trevor Gross,
	Danilo Krummrich, FUJITA Tomonori, Frederic Weisbecker,
	Lyude Paul, Thomas Gleixner, Anna-Maria Behnsen, John Stultz,
	Stephen Boyd, Lorenzo Stoakes, Liam R. Howlett, linux-block,
	rust-for-linux, linux-kernel, linux-mm
In-Reply-To: <87mrxbkfme.fsf@t14s.mail-host-address-is-not-set>

On Wed Jun 3, 2026 at 12:50 PM BST, Andreas Hindborg wrote:
> Alice Ryhl <aliceryhl@google.com> writes:
>
>> On Mon, Feb 16, 2026 at 12:34:57AM +0100, Andreas Hindborg wrote:
>>> Add a method to extract the command operation code from a request. The
>>> command is obtained by masking the lower bits of `cmd_flags` as defined by
>>> `REQ_OP_BITS`. This allows Rust block drivers to determine the type of
>>> operation being requested.
>>> 
>>> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
>>
>> With the nit below fixed:
>> Reviewed-by: Alice Ryhl <aliceryhl@google.com>
>>
>>>  rust/kernel/block/mq/request.rs | 6 ++++++
>>>  1 file changed, 6 insertions(+)
>>> 
>>> diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
>>> index b49197a0c66d7..0dd329ae93dfc 100644
>>> --- a/rust/kernel/block/mq/request.rs
>>> +++ b/rust/kernel/block/mq/request.rs
>>> @@ -78,6 +78,12 @@ pub(crate) unsafe fn aref_from_raw(ptr: *mut bindings::request) -> ARef<Self> {
>>>          unsafe { ARef::from_raw(NonNull::new_unchecked(ptr.cast())) }
>>>      }
>>>  
>>> +    /// Get the command identifier for the request
>>> +    pub fn command(&self) -> u32 {
>>> +        // SAFETY: By C API contract and type invariant, `cmd_flags` is valid for read
>>> +        unsafe { (*self.0.get()).cmd_flags & ((1 << bindings::REQ_OP_BITS) - 1) }
>>
>> Nit: scope of unsafe
>>
>> 	unsafe { (*self.0.get()).cmd_flags } & ((1 << bindings::REQ_OP_BITS) - 1)
>
> The `&` is parsed as reference operator with this change. But we can do
> this:
>
>         use core::ops::BitAnd;
>         // SAFETY: By C API contract and type invariant, `cmd_flags` is valid for read
>         unsafe { (*self.0.get()).cmd_flags }.bitand((1u32 << bindings::REQ_OP_BITS) - 1)

Wrapping the unsafe block in a parenthesis should do the trick.

Best,
Gary



^ permalink raw reply

* Re: [PATCH v2] rust: add procedural macro for declaring configfs attributes
From: Gary Guo @ 2026-06-03 14:32 UTC (permalink / raw)
  To: Malte Wechter, Andreas Hindborg, Breno Leitao, Miguel Ojeda,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	Alice Ryhl, Trevor Gross, Danilo Krummrich, Jens Axboe
  Cc: linux-kernel, rust-for-linux, linux-block
In-Reply-To: <20260603-configfs-syn-v2-1-cb58489c2647@gmail.com>

On Wed Jun 3, 2026 at 3:08 PM BST, Malte Wechter wrote:
> Implement `configfs_attrs!` as a procedural macro using `syn`, this
> improves readability and maintainability. Remove the old macro and
> replace all uses with the new macro. Add the new macro implementation
> file to MAINTAINERS.
>
> Signed-off-by: Malte Wechter <maltewechter@gmail.com>
> ---
> Changes in v2:
> - Add a try_parse helper function to macros/helpers.rs
> - Fix bug where 'child' configuration gets dropped if trailing comma is missing (sashiko)
> - Link to v1: https://lore.kernel.org/r/20260520-configfs-syn-v1-1-6c5b80a9cef2@gmail.com
> ---
>  MAINTAINERS                     |   1 +
>  drivers/block/rnull/configfs.rs |   2 +-
>  rust/kernel/configfs.rs         | 251 ----------------------------------------
>  rust/macros/configfs_attrs.rs   | 182 +++++++++++++++++++++++++++++
>  rust/macros/helpers.rs          |  15 +++
>  rust/macros/lib.rs              |  85 ++++++++++++++
>  samples/rust/rust_configfs.rs   |   2 +-
>  7 files changed, 285 insertions(+), 253 deletions(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2fb1c75afd163..45f7a1ec93b45 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6464,6 +6464,7 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/a.hindborg/linux.git config
>  F:	fs/configfs/
>  F:	include/linux/configfs.h
>  F:	rust/kernel/configfs.rs
> +F:	rust/macros/configfs_attrs.rs
>  F:	samples/configfs/
>  F:	samples/rust/rust_configfs.rs
>  
> diff --git a/drivers/block/rnull/configfs.rs b/drivers/block/rnull/configfs.rs
> index 7c2eb5c0b7228..f28ec69d79846 100644
> --- a/drivers/block/rnull/configfs.rs
> +++ b/drivers/block/rnull/configfs.rs
> @@ -4,8 +4,8 @@
>  use kernel::{
>      block::mq::gen_disk::{GenDisk, GenDiskBuilder},
>      configfs::{self, AttributeOperations},
> -    configfs_attrs,
>      fmt::{self, Write as _},
> +    macros::configfs_attrs,
>      new_mutex,
>      page::PAGE_SIZE,
>      prelude::*,
> diff --git a/rust/macros/configfs_attrs.rs b/rust/macros/configfs_attrs.rs
> new file mode 100644
> index 0000000000000..a7fc75cdebcc0
> --- /dev/null
> +++ b/rust/macros/configfs_attrs.rs
> @@ -0,0 +1,182 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +use quote::{
> +    quote, //
> +    ToTokens,
> +};
> +
> +use proc_macro2::{
> +    Span, //
> +};
> +
> +use syn::{
> +    bracketed,
> +    parse::{
> +        Parse,
> +        ParseStream, //
> +    },
> +    punctuated::Punctuated,
> +    spanned::Spanned,
> +    Ident,
> +    LitInt,
> +    Token,
> +    Type, //
> +};
> +
> +use crate::helpers::try_parse;
> +
> +pub(crate) struct ConfigfsAttrs {
> +    container: Type,
> +    data: Type,
> +    child: Option<Type>,
> +    attributes: Vec<(Ident, LitInt)>,
> +}
> +
> +fn parse_attribute_field(stream: ParseStream<'_>) -> syn::Result<(Ident, LitInt)> {
> +    let id = stream.parse::<syn::Ident>()?;
> +    let _colon = stream.parse::<Token![:]>()?;
> +    let v = stream.parse::<LitInt>()?;
> +    Ok((id, v))
> +}
> +
> +fn parse_named_field(stream: ParseStream<'_>, name: &str) -> syn::Result<Type> {
> +    let name_id = stream.parse::<Ident>()?;
> +    if name_id != name {
> +        return Err(parse_field_error(name_id.span(), &name_id, name));
> +    }
> +    let _colon = stream.parse::<Token![:]>()?;
> +    let v = stream.parse::<Type>()?;
> +    stream.parse::<Token![,]>()?;
> +    Ok(v)
> +}
> +
> +fn parse_attributes(stream: ParseStream<'_>) -> syn::Result<Vec<(Ident, LitInt)>> {
> +    let attribute_id = stream.parse::<Ident>()?;
> +    if attribute_id != "attributes" {
> +        return Err(syn::Error::new(
> +            attribute_id.span(),
> +            format!(
> +                "unexpected identifier: {}, expected: 'attributes'",
> +                attribute_id
> +            ),
> +        ));
> +    }
> +    stream.parse::<Token![:]>()?;
> +    let attr_stream;
> +    let _bracket = bracketed!(attr_stream in stream);
> +    let attributes = Punctuated::<(Ident, LitInt), Token![,]>::parse_terminated_with(
> +        &attr_stream,
> +        parse_attribute_field,
> +    )?;
> +    let attributes = attributes.into_iter().collect::<Vec<_>>();
> +
> +    stream.parse::<Option<Token![,]>>()?;
> +    Ok(attributes)
> +}
> +
> +fn parse_field_error(span: Span, name: &Ident, expected_name: &str) -> syn::Error {
> +    syn::Error::new(
> +        span,
> +        format!("Unexpected field name, got: {name} expected: {expected_name}"),
> +    )
> +}
> +
> +impl Parse for ConfigfsAttrs {
> +    fn parse(input: ParseStream<'_>) -> syn::Result<Self> {
> +        let container = try_parse(input, |s| parse_named_field(s, "container"))?;
> +        let data = try_parse(input, |s| parse_named_field(s, "data"))?;
> +        let child = try_parse(input, |s| parse_named_field(s, "child")).ok();
> +        let attributes = parse_attributes(input)?;

I have a `parse_ordered_fields!()` macro in module.rs, please extract it to
helpers and use it instead of implementing the parsing in an ad-hoc manner.

> +
> +        Ok(ConfigfsAttrs {
> +            container,
> +            data,
> +            child,
> +            attributes,
> +        })
> +    }
> +}
> +
> +fn make_static_ident<T: ToTokens>(ty: &T, suffix: &str) -> syn::Ident {
> +    let raw_id = quote! { #ty }.to_string();
> +
> +    // Sanitizing syn::Type::Path, this is safe since it is
> +    // only used as the identifier.
> +    let normalized = raw_id
> +        .split("::")
> +        .map(|s| String::from(s.trim()))
> +        .reduce(|a, b| format!("{a}_{b}"))
> +        .expect("Cannot be empty")
> +        .to_uppercase()
> +        .replace(|c: char| !c.is_alphanumeric(), "_");
> +
> +    Ident::new(&format!("{}_{}", normalized, suffix), ty.span())
> +}
> +
> +pub(crate) fn configfs_attrs(cfs_attrs: ConfigfsAttrs) -> proc_macro2::TokenStream {
> +    let (container_ty, data_ty) = (&cfs_attrs.container, &cfs_attrs.data);
> +
> +    let data_tp_ident = make_static_ident(data_ty, "TPE");
> +    let data_attr_ident = make_static_ident(data_ty, "ATTR_LIST");

Instead of creating identifers like this, just scope them properly so that
a fixed identifier is used without colliding.

> +
> +    let n = cfs_attrs.attributes.len() + 1;
> +
> +    let attr_list = quote! {
> +        static #data_attr_ident: kernel::configfs::AttributeList<#n, #data_ty> =
> +            // SAFETY: We are expanding `configfs_attrs`.
> +            unsafe { kernel::configfs::AttributeList::new() };
> +    };
> +
> +    let mut attrs = Vec::new();
> +    for (attr_idx, (name, id)) in cfs_attrs.attributes.iter().enumerate() {
> +        let name_with_attr = make_static_ident(name, "ATTR");
> +
> +        let id: u64 = match id.base10_parse::<u64>() {
> +            Ok(v) => v,
> +            Err(_) => {
> +                return syn::Error::new(id.span(), "Could not parse attribute ID as a u64")
> +                    .to_compile_error();
> +            }
> +        };

Why parsing? The ID looks like it's just substituted as is.

Best,
Gary

> +
> +        attrs.push(quote! {
> +        static #name_with_attr: kernel::configfs::Attribute<#id, #data_ty, #data_ty> =
> +            // SAFETY: We are expanding `configfs_attrs`.
> +            unsafe {
> +              kernel::configfs::Attribute::new(kernel::c_str!(::core::stringify!(#name)))
> +            };
> +
> +          // SAFETY: By design of this macro, the name of the variable we
> +          // invoke the `add` method on below, is not visible outside of
> +          // the macro expansion. The macro does not operate concurrently
> +          // on this variable, and thus we have exclusive access to the
> +          // variable.
> +          unsafe { #data_attr_ident.add::<#attr_idx, #id, _>(&#name_with_attr) }
> +        });
> +    }
> +
> +    let has_child_code = if let Some(child) = cfs_attrs.child {
> +        quote! { new_with_child_ctor::<#n, #child>}
> +    } else {
> +        quote! { new::<#n> }
> +    };
> +
> +    let data_type = quote! {
> +        {
> +            static #data_tp_ident:
> +            kernel::configfs::ItemType<#container_ty, #data_ty> =
> +                kernel::configfs::ItemType::<#container_ty, #data_ty>::#has_child_code(
> +                    &THIS_MODULE, &#data_attr_ident
> +                );
> +            &#data_tp_ident
> +        }
> +    };
> +
> +    quote! {
> +        {
> +            #attr_list
> +            #(#attrs)*
> +            #data_type
> +        }
> +    }
> +}
> diff --git a/rust/macros/helpers.rs b/rust/macros/helpers.rs
> index d18fbf4daa0a5..fdab8804e1ba9 100644
> --- a/rust/macros/helpers.rs
> +++ b/rust/macros/helpers.rs
> @@ -4,6 +4,7 @@
>  use quote::ToTokens;
>  use syn::{
>      parse::{
> +        discouraged::Speculative,
>          Parse,
>          ParseStream, //
>      },
> @@ -54,6 +55,20 @@ pub(crate) fn file() -> String {
>      }
>  }
>  
> +pub(crate) fn try_parse<T>(
> +    input: ParseStream<'_>,
> +    parser: impl FnOnce(ParseStream<'_>) -> Result<T>,
> +) -> Result<T> {
> +    let fork = input.fork();
> +    match parser(&fork) {
> +        Ok(value) => {
> +            input.advance_to(&fork);
> +            Ok(value)
> +        }
> +        Err(e) => Err(e),
> +    }
> +}
> +
>  /// Obtain all `#[cfg]` attributes.
>  pub(crate) fn gather_cfg_attrs(attr: &[Attribute]) -> impl Iterator<Item = &Attribute> + '_ {
>      attr.iter().filter(|a| a.path().is_ident("cfg"))
> diff --git a/rust/macros/lib.rs b/rust/macros/lib.rs
> index 2cfd59e0f9e7c..745ba83c828b9 100644
> --- a/rust/macros/lib.rs
> +++ b/rust/macros/lib.rs
> @@ -15,6 +15,8 @@
>  #![cfg_attr(not(CONFIG_RUSTC_HAS_SPAN_FILE), feature(proc_macro_span))]
>  
>  mod concat_idents;
> +#[cfg(CONFIG_CONFIGFS_FS)]
> +mod configfs_attrs;
>  mod export;
>  mod fmt;
>  mod helpers;
> @@ -489,3 +491,86 @@ pub fn kunit_tests(attr: TokenStream, input: TokenStream) -> TokenStream {
>          .unwrap_or_else(|e| e.into_compile_error())
>          .into()
>  }
> +
> +/// Define a list of configfs attributes statically.
> +///
> +/// # Examples
> +///
> +/// ```ignore
> +/// let item_type = configfs_attrs! {
> +///     container: configfs::Subsystem<Configuration>,
> +///     data: Configuration,
> +///     child: Child,
> +///     attributes: [
> +///         message: 0,
> +///         bar: 1,
> +///     ],
> +/// };
> +///```
> +///
> +/// Expands the following output:
> +///    let item_type = {
> +///         static CONFIGURATION_ATTR_LIST: kernel::configfs::AttributeList<
> +///             3usize,
> +///             Configuration,
> +///         > = unsafe { kernel::configfs::AttributeList::new() };
> +///         static MESSAGE_ATTR: kernel::configfs::Attribute<
> +///             0u64,
> +///             Configuration,
> +///             Configuration,
> +///         > = unsafe {
> +///             kernel::configfs::Attribute::new({
> +///                 const S: &str = "message\u{0}";
> +///                 const C: &kernel::str::CStr = match kernel::str::CStr::from_bytes_with_nul(
> +///                     S.as_bytes(),
> +///                 ) {
> +///                     Ok(v) => v,
> +///                     Err(_) => {
> +///                         ::core::panicking::panic_fmt(
> +///                             format_args!("string contains interior NUL"),
> +///                         );
> +///                     }
> +///                 };
> +///                 C
> +///             })
> +///         };
> +///         unsafe { CONFIGURATION_ATTR_LIST.add::<0usize, 0u64, _>(&MESSAGE_ATTR) }
> +///         static BAR_ATTR: kernel::configfs::Attribute<
> +///             1u64,
> +///             Configuration,
> +///             Configuration,
> +///         > = unsafe {
> +///             kernel::configfs::Attribute::new({
> +///                 const S: &str = "bar\u{0}";
> +///                 const C: &kernel::str::CStr = match kernel::str::CStr::from_bytes_with_nul(
> +///                     S.as_bytes(),
> +///                 ) {
> +///                     Ok(v) => v,
> +///                     Err(_) => {
> +///                         ::core::panicking::panic_fmt(
> +///                             format_args!("string contains interior NUL"),
> +///                         );
> +///                     }
> +///                 };
> +///                 C
> +///             })
> +///         };
> +///         unsafe { CONFIGURATION_ATTR_LIST.add::<1usize, 1u64, _>(&BAR_ATTR) }
> +///         {
> +///             static CONFIGURATION_TPE: kernel::configfs::ItemType<
> +///                 Subsystem<Configuration>,
> +///                 Configuration,
> +///             > = kernel::configfs::ItemType::<
> +///                 Subsystem<Configuration>,
> +///                 Configuration,
> +///             >::new_with_child_ctor::<3usize, Child>(&THIS_MODULE, &CONFIGURATION_ATTR_LIST);
> +///             &CONFIGURATION_TPE
> +///         }
> +///     };
> +///
> +#[cfg(CONFIG_CONFIGFS_FS)]
> +#[proc_macro]
> +pub fn configfs_attrs(input: TokenStream) -> TokenStream {
> +    configfs_attrs::configfs_attrs(parse_macro_input!(input as configfs_attrs::ConfigfsAttrs))
> +        .into()
> +}
> diff --git a/samples/rust/rust_configfs.rs b/samples/rust/rust_configfs.rs
> index a1bd9db6010da..876462f7789d1 100644
> --- a/samples/rust/rust_configfs.rs
> +++ b/samples/rust/rust_configfs.rs
> @@ -4,7 +4,7 @@
>  
>  use kernel::alloc::flags;
>  use kernel::configfs;
> -use kernel::configfs::configfs_attrs;
> +use kernel::macros::configfs_attrs;
>  use kernel::new_mutex;
>  use kernel::page::PAGE_SIZE;
>  use kernel::prelude::*;
>
> ---
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
> change-id: 20260417-configfs-syn-191e07130027
>
> Best regards,



^ permalink raw reply

* Re: [PATCH 0/5] blk-cgroup: fix blkg list and policy data races
From: Jens Axboe @ 2026-06-03 14:35 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Tejun Heo, Josef Bacik, Ming Lei, Nilay Shroff, Bart Van Assche,
	linux-block, cgroups, linux-kernel
In-Reply-To: <cover.1780492756.git.yukuai@fygo.io>

On 6/3/26 7:27 AM, Yu Kuai wrote:
> This small series fixes races between blkg destruction, q->blkg_list
> iteration, and blkcg policy activation.

In spam:

Authentication-Results: mx.google.com; spf=pass (google.com: domain of srs0=eh6w=d7=fygo.io=yukuai@kernel.org designates 2600:3c04:e001:324:0:1991:8:25 as permitted sender) smtp.mailfrom="SRS0=EH6w=D7=fygo.io=yukuai@kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=QUARANTINE) header.from=fygo.io

if you don't get this sorted out, your messages will be missed as they
end up in spam for most people.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH 0/5] blk-cgroup: fix blkg list and policy data races
From: Jens Axboe @ 2026-06-03 15:08 UTC (permalink / raw)
  To: yukuai, Yu Kuai
  Cc: Tejun Heo, Josef Bacik, Ming Lei, Nilay Shroff, Bart Van Assche,
	linux-block, cgroups, linux-kernel
In-Reply-To: <189b8601-87a6-423a-b267-9d550c31c086@fnnas.com>

On 6/3/26 9:06 AM, yukuai wrote:
> Hi,
> 
> ? 2026/6/3 22:35, Jens Axboe ??:
>> On 6/3/26 7:27 AM, Yu Kuai wrote:
>>> This small series fixes races between blkg destruction, q->blkg_list
>>> iteration, and blkcg policy activation.
>> In spam: Authentication-Results: mx.google.com <mx.google.com>; spf=pass (google.com: <google.com:> domain of srs0=eh6w=d7=fygo.io=yukuai@kernel.org designates 2600:3c04:e001:324:0:1991:8:25 as permitted sender) smtp.mailfrom="SRS0=EH6w=D7=fygo.io=yukuai@kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=QUARANTINE) header.from=fygo.io <header.from=fygo.io> if you don't get this sorted out, your messages will be missed as they end up in spam for most people.
> 
> Thanks again for the notice. I think I finally figured out why dmarc failed. 

This email looks better!

-- 
Jens Axboe

^ permalink raw reply

* [PATCH blktests] throtl/008: Add a test for the iocost cgroup controller
From: Bart Van Assche @ 2026-06-03 20:50 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki; +Cc: Damien Le Moal, linux-block, Bart Van Assche

Add a test for read and write IOPS throttling. The test implementation
has been generated by Antigravity and Gemini with a few manual edits.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 tests/throtl/008     | 152 +++++++++++++++++++++++++++++++++++++++++++
 tests/throtl/008.out |   2 +
 2 files changed, 154 insertions(+)
 create mode 100755 tests/throtl/008
 create mode 100644 tests/throtl/008.out

diff --git a/tests/throtl/008 b/tests/throtl/008
new file mode 100755
index 000000000000..0570fc0188e0
--- /dev/null
+++ b/tests/throtl/008
@@ -0,0 +1,152 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+# Copyright (C) 2026 Google LLC
+#
+# Test cgroup iocost IOPS limiting.
+
+. tests/block/rc
+. common/null_blk
+. common/cgroup
+. common/fio
+
+DESCRIPTION="test cgroup iocost controller limits"
+
+requires() {
+	_have_cgroup2_controller io
+	_have_null_blk
+	_have_fio
+	_have_program bc
+	if [[ ! -e "$(_cgroup2_base_dir)/io.cost.qos" ]]; then
+		SKIP_REASONS+=("iocost controller not supported (CONFIG_BLK_CGROUP_IOCOST)")
+		return 1
+	fi
+}
+
+test() {
+	echo "Running ${TEST_NAME}"
+
+	# Create a null_blk instance
+	local null_blk_params=(
+		blocksize=4096
+		completion_nsec=0
+		memory_backed=0
+		size=1024 # MB
+		submit_queues=1
+		power=1
+	)
+	_init_null_blk nr_devices=0 queue_mode=2 &&
+	if ! _configure_null_blk nullb0 "${null_blk_params[@]}"; then
+		echo "Configuring null_blk failed"
+		return 1
+	fi
+
+	local dev_t
+	dev_t=$(</sys/block/nullb0/dev)
+	if [[ -z $dev_t ]]; then
+		echo "Failed to get major:minor for nullb0"
+		_exit_null_blk
+		return 1
+	fi
+
+	# Initialize cgroups
+	if ! _init_cgroup2; then
+		echo "Initializing cgroup2 failed"
+		_exit_null_blk
+		return 1
+	fi
+
+	# Enable io controller in the subtree
+	local deactivate_io_ctrlr=false
+	if ! grep -wq io "$(_cgroup2_base_dir)/cgroup.subtree_control"; then
+		if ! echo "+io" > "$(_cgroup2_base_dir)/cgroup.subtree_control"; then
+			echo "Failed to enable io controller on cgroup root"
+			_exit_cgroup2
+			_exit_null_blk
+			return 1
+		fi
+		deactivate_io_ctrlr=true
+	fi
+
+	if ! echo "+io" > "$CGROUP2_DIR/cgroup.subtree_control"; then
+		echo "Failed to enable io controller on $CGROUP2_DIR"
+		if [[ $deactivate_io_ctrlr == true ]]; then
+			echo "-io" > "$(_cgroup2_base_dir)/cgroup.subtree_control"
+		fi
+		_exit_cgroup2
+		_exit_null_blk
+		return 1
+	fi
+
+	# Configure iocost controller globally for the device
+	# min=100.00 max=100.00 forces vrate to be fixed at 100%
+	if ! echo "$dev_t enable=1 min=100.00 max=100.00" > "$(_cgroup2_base_dir)/io.cost.qos"; then
+		echo "Failed to configure io.cost.qos"
+		if [[ $deactivate_io_ctrlr == true ]]; then
+			echo "-io" > "$(_cgroup2_base_dir)/cgroup.subtree_control"
+		fi
+		_exit_cgroup2
+		_exit_null_blk
+		return 1
+	fi
+
+	# Limit read iops to 100 and write iops to 10.
+	# Setting rbps/wbps to 0 means they are ignored (costs calculated purely based on IOPS)
+	if ! echo "$dev_t ctrl=user model=linear rbps=0 rseqiops=100 rrandiops=100 wbps=0 wseqiops=10 wrandiops=10" > "$(_cgroup2_base_dir)/io.cost.model"; then
+		echo "Failed to configure io.cost.model"
+		echo "$dev_t enable=0" > "$(_cgroup2_base_dir)/io.cost.qos"
+		if [[ $deactivate_io_ctrlr == true ]]; then
+			echo "-io" > "$(_cgroup2_base_dir)/cgroup.subtree_control"
+		fi
+		_exit_cgroup2
+		_exit_null_blk
+		return 1
+	fi
+
+	# Create a child cgroup for test
+	local cg_name="testgrp"
+	local cg_path="$CGROUP2_DIR/$cg_name"
+	mkdir "$cg_path"
+
+	# 1. Run FIO read test
+	local -a FIO_PERF_FIELDS
+	FIO_PERF_FIELDS=("read iops")
+	if ! _fio_perf --bs=4k --rw=randread --name=read-test --filename=/dev/nullb0 \
+			--ioengine=libaio --direct=1 --iodepth=64 --time_based --runtime=10 \
+			--cgroup="blktests/$cg_name" > /dev/null 2>&1; then
+		echo "FIO read test failed"
+	else
+		local read_iops=${TEST_RUN["read iops"]}
+		if [[ -z $read_iops ]]; then
+			echo "Read IOPS is empty"
+		elif (( "$(echo "$read_iops < 90 || $read_iops > 110" | bc -l)" )); then
+			echo "Read IOPS $read_iops not in expected range [90, 110]"
+		fi
+	fi
+
+	# 2. Run FIO write test
+	FIO_PERF_FIELDS=("write iops")
+	if ! _fio_perf --bs=4k --rw=randwrite --name=write-test --filename=/dev/nullb0 \
+			--ioengine=libaio --direct=1 --iodepth=64 --time_based --runtime=10 \
+			--cgroup="blktests/$cg_name" > /dev/null 2>&1; then
+		echo "FIO write test failed"
+	else
+		local write_iops=${TEST_RUN["write iops"]}
+		if [[ -z $write_iops ]]; then
+			echo "Write IOPS is empty"
+		elif (( "$(echo "$write_iops < 8 || $write_iops > 12" | bc -l)" )); then
+			echo "Write IOPS $write_iops not in expected range [8, 12]"
+		fi
+	fi
+
+	# Clean up cgroups
+	rmdir "$cg_path"
+	if [[ $deactivate_io_ctrlr == true ]]; then
+		{ echo "-io" > "$(_cgroup2_base_dir)/cgroup.subtree_control"; } &> "${FULL}"
+	fi
+	_exit_cgroup2
+
+	# Clean up null_blk
+	_exit_null_blk
+
+	echo "Test complete"
+}
diff --git a/tests/throtl/008.out b/tests/throtl/008.out
new file mode 100644
index 000000000000..890ff7f226d1
--- /dev/null
+++ b/tests/throtl/008.out
@@ -0,0 +1,2 @@
+Running throtl/008
+Test complete

^ permalink raw reply related

* [syzbot] Monthly block report (Jun 2026)
From: syzbot @ 2026-06-04  4:32 UTC (permalink / raw)
  To: linux-block, linux-kernel, syzkaller-bugs

Hello block maintainers/developers,

This is a 31-day syzbot report for the block subsystem.
All related reports/information can be found at:
https://syzkaller.appspot.com/upstream/s/block

During the period, 0 new issues were detected and 0 were fixed.
In total, 17 issues are still open and 112 have already been fixed.
There are also 26 low-priority issues.

Some of the still happening issues:

Ref Crashes Repro Title
<1> 9088    Yes   KMSAN: kernel-infoleak in filemap_read
                  https://syzkaller.appspot.com/bug?extid=905d785c4923bea2c1db
<2> 1729    Yes   INFO: task hung in blkdev_fallocate
                  https://syzkaller.appspot.com/bug?extid=39b75c02b8be0a061bfc
<3> 215     Yes   general protection fault in rtlock_slowlock_locked
                  https://syzkaller.appspot.com/bug?extid=08df3e4c9b304b37cb04
<4> 60724   Yes   possible deadlock in __submit_bio
                  https://syzkaller.appspot.com/bug?extid=949ae54e95a2fab4cbb4
<5> 69      Yes   INFO: task hung in blk_trace_remove (2)
                  https://syzkaller.appspot.com/bug?extid=2373f6be3e6de4f92562
<6> 18      No    KASAN: slab-out-of-bounds Read in blk_mq_free_rqs
                  https://syzkaller.appspot.com/bug?extid=e90526cab23b9efcd03c

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

To disable reminders for individual bugs, reply with the following command:
#syz set <Ref> no-reminders

To change bug's subsystems, reply with:
#syz set <Ref> subsystems: new-subsystem

You may send multiple commands in a single email message.

^ permalink raw reply

* Re: [PATCH blktests] throtl/008: Add a test for the iocost cgroup controller
From: Shin'ichiro Kawasaki @ 2026-06-04  6:03 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Damien Le Moal, linux-block
In-Reply-To: <20260603205007.2654971-1-bvanassche@acm.org>

[-- Attachment #1: Type: text/plain, Size: 3597 bytes --]

On Jun 03, 2026 / 13:50, Bart Van Assche wrote:
> Add a test for read and write IOPS throttling. The test implementation
> has been generated by Antigravity and Gemini with a few manual edits.

Hi Bart, thanks for the patch. I think it's nice to add this test case to
cover the iocost controller code paths. It might be the better to note
'the iocost controller' in the commit message also. It will be consistent
with the DESCRIPTION in the test case.

> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  tests/throtl/008     | 152 +++++++++++++++++++++++++++++++++++++++++++
>  tests/throtl/008.out |   2 +
>  2 files changed, 154 insertions(+)
>  create mode 100755 tests/throtl/008
>  create mode 100644 tests/throtl/008.out
> 
> diff --git a/tests/throtl/008 b/tests/throtl/008
> new file mode 100755
> index 000000000000..0570fc0188e0
> --- /dev/null
> +++ b/tests/throtl/008
> @@ -0,0 +1,152 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-3.0+
> +# Copyright (C) 2026 Google LLC
> +#
> +# Test cgroup iocost IOPS limiting.
> +
> +. tests/block/rc

I guess this should be tests/throtl/rc. With this, the two lines below will not
be required.

> +. common/null_blk
> +. common/cgroup
> +. common/fio
> +
> +DESCRIPTION="test cgroup iocost controller limits"
> +
> +requires() {
> +	_have_cgroup2_controller io
> +	_have_null_blk
> +	_have_fio
> +	_have_program bc

group_require() in tsts/throtl/rc covers the 3 lines out of the 4 lines above.

> +	if [[ ! -e "$(_cgroup2_base_dir)/io.cost.qos" ]]; then
> +		SKIP_REASONS+=("iocost controller not supported (CONFIG_BLK_CGROUP_IOCOST)")
> +		return 1
> +	fi
> +}
> +
> +test() {
> +	echo "Running ${TEST_NAME}"
> +
> +	# Create a null_blk instance
> +	local null_blk_params=(
> +		blocksize=4096
> +		completion_nsec=0
> +		memory_backed=0
> +		size=1024 # MB
> +		submit_queues=1
> +		power=1
> +	)
> +	_init_null_blk nr_devices=0 queue_mode=2 &&
> +	if ! _configure_null_blk nullb0 "${null_blk_params[@]}"; then
> +		echo "Configuring null_blk failed"
> +		return 1
> +	fi
> +
> +	local dev_t
> +	dev_t=$(</sys/block/nullb0/dev)
> +	if [[ -z $dev_t ]]; then
> +		echo "Failed to get major:minor for nullb0"
> +		_exit_null_blk
> +		return 1
> +	fi
> +
> +	# Initialize cgroups
> +	if ! _init_cgroup2; then
> +		echo "Initializing cgroup2 failed"
> +		_exit_null_blk
> +		return 1
> +	fi
> +
> +	# Enable io controller in the subtree
> +	local deactivate_io_ctrlr=false
> +	if ! grep -wq io "$(_cgroup2_base_dir)/cgroup.subtree_control"; then
> +		if ! echo "+io" > "$(_cgroup2_base_dir)/cgroup.subtree_control"; then
> +			echo "Failed to enable io controller on cgroup root"
> +			_exit_cgroup2
> +			_exit_null_blk
> +			return 1
> +		fi
> +		deactivate_io_ctrlr=true
> +	fi
> +
> +	if ! echo "+io" > "$CGROUP2_DIR/cgroup.subtree_control"; then
> +		echo "Failed to enable io controller on $CGROUP2_DIR"
> +		if [[ $deactivate_io_ctrlr == true ]]; then
> +			echo "-io" > "$(_cgroup2_base_dir)/cgroup.subtree_control"
> +		fi
> +		_exit_cgroup2
> +		_exit_null_blk
> +		return 1
> +	fi

The initialization part above has duplications with _set_up_throtl() in
tests/throtl/rc. Is it considered to use _set_up_throtl()? I guess it will
simplify this test case, and make it consisten with other test cases in the
throtl group. It will also allow to run the test case for scsi_debug.

I created a trial patch that applis on top of this patch, which uses
_set_up_throtl(). It is attached to this mail for your reference. It drops
some null_blk options. As far as I did trial runs, it does not look affecting
the run results.


[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 4582 bytes --]

diff --git a/tests/throtl/008 b/tests/throtl/008
index 0570fc0..b1bcf84 100755
--- a/tests/throtl/008
+++ b/tests/throtl/008
@@ -4,76 +4,36 @@
 #
 # Test cgroup iocost IOPS limiting.
 
-. tests/block/rc
-. common/null_blk
-. common/cgroup
+. tests/throtl/rc
 . common/fio
 
 DESCRIPTION="test cgroup iocost controller limits"
 
 requires() {
-	_have_cgroup2_controller io
-	_have_null_blk
 	_have_fio
-	_have_program bc
 	if [[ ! -e "$(_cgroup2_base_dir)/io.cost.qos" ]]; then
 		SKIP_REASONS+=("iocost controller not supported (CONFIG_BLK_CGROUP_IOCOST)")
 		return 1
 	fi
 }
 
+set_conditions() {
+	_set_throtl_blkdev_type "$@"
+}
+
 test() {
 	echo "Running ${TEST_NAME}"
 
-	# Create a null_blk instance
-	local null_blk_params=(
-		blocksize=4096
-		completion_nsec=0
-		memory_backed=0
-		size=1024 # MB
-		submit_queues=1
-		power=1
-	)
-	_init_null_blk nr_devices=0 queue_mode=2 &&
-	if ! _configure_null_blk nullb0 "${null_blk_params[@]}"; then
-		echo "Configuring null_blk failed"
+	# Set up throtl device and cgroup
+	if ! _set_up_throtl; then
 		return 1
 	fi
 
 	local dev_t
-	dev_t=$(</sys/block/nullb0/dev)
+	dev_t=$(</sys/block/"$THROTL_DEV"/dev)
 	if [[ -z $dev_t ]]; then
-		echo "Failed to get major:minor for nullb0"
-		_exit_null_blk
-		return 1
-	fi
-
-	# Initialize cgroups
-	if ! _init_cgroup2; then
-		echo "Initializing cgroup2 failed"
-		_exit_null_blk
-		return 1
-	fi
-
-	# Enable io controller in the subtree
-	local deactivate_io_ctrlr=false
-	if ! grep -wq io "$(_cgroup2_base_dir)/cgroup.subtree_control"; then
-		if ! echo "+io" > "$(_cgroup2_base_dir)/cgroup.subtree_control"; then
-			echo "Failed to enable io controller on cgroup root"
-			_exit_cgroup2
-			_exit_null_blk
-			return 1
-		fi
-		deactivate_io_ctrlr=true
-	fi
-
-	if ! echo "+io" > "$CGROUP2_DIR/cgroup.subtree_control"; then
-		echo "Failed to enable io controller on $CGROUP2_DIR"
-		if [[ $deactivate_io_ctrlr == true ]]; then
-			echo "-io" > "$(_cgroup2_base_dir)/cgroup.subtree_control"
-		fi
-		_exit_cgroup2
-		_exit_null_blk
+		echo "Failed to get major:minor for $THROTL_DEV"
+		_clean_up_throtl
 		return 1
 	fi
 
@@ -81,11 +41,7 @@ test() {
 	# min=100.00 max=100.00 forces vrate to be fixed at 100%
 	if ! echo "$dev_t enable=1 min=100.00 max=100.00" > "$(_cgroup2_base_dir)/io.cost.qos"; then
 		echo "Failed to configure io.cost.qos"
-		if [[ $deactivate_io_ctrlr == true ]]; then
-			echo "-io" > "$(_cgroup2_base_dir)/cgroup.subtree_control"
-		fi
-		_exit_cgroup2
-		_exit_null_blk
+		_clean_up_throtl
 		return 1
 	fi
 
@@ -94,25 +50,21 @@ test() {
 	if ! echo "$dev_t ctrl=user model=linear rbps=0 rseqiops=100 rrandiops=100 wbps=0 wseqiops=10 wrandiops=10" > "$(_cgroup2_base_dir)/io.cost.model"; then
 		echo "Failed to configure io.cost.model"
 		echo "$dev_t enable=0" > "$(_cgroup2_base_dir)/io.cost.qos"
-		if [[ $deactivate_io_ctrlr == true ]]; then
-			echo "-io" > "$(_cgroup2_base_dir)/cgroup.subtree_control"
-		fi
-		_exit_cgroup2
-		_exit_null_blk
+		_clean_up_throtl
 		return 1
 	fi
 
 	# Create a child cgroup for test
 	local cg_name="testgrp"
-	local cg_path="$CGROUP2_DIR/$cg_name"
+	local cg_path="$CGROUP2_DIR/$THROTL_DIR/$cg_name"
 	mkdir "$cg_path"
 
 	# 1. Run FIO read test
 	local -a FIO_PERF_FIELDS
 	FIO_PERF_FIELDS=("read iops")
-	if ! _fio_perf --bs=4k --rw=randread --name=read-test --filename=/dev/nullb0 \
+	if ! _fio_perf --bs=4k --rw=randread --name=read-test --filename=/dev/"$THROTL_DEV" \
 			--ioengine=libaio --direct=1 --iodepth=64 --time_based --runtime=10 \
-			--cgroup="blktests/$cg_name" > /dev/null 2>&1; then
+			--cgroup="blktests/$THROTL_DIR/$cg_name" > /dev/null 2>&1; then
 		echo "FIO read test failed"
 	else
 		local read_iops=${TEST_RUN["read iops"]}
@@ -125,9 +77,9 @@ test() {
 
 	# 2. Run FIO write test
 	FIO_PERF_FIELDS=("write iops")
-	if ! _fio_perf --bs=4k --rw=randwrite --name=write-test --filename=/dev/nullb0 \
+	if ! _fio_perf --bs=4k --rw=randwrite --name=write-test --filename=/dev/"$THROTL_DEV" \
 			--ioengine=libaio --direct=1 --iodepth=64 --time_based --runtime=10 \
-			--cgroup="blktests/$cg_name" > /dev/null 2>&1; then
+			--cgroup="blktests/$THROTL_DIR/$cg_name" > /dev/null 2>&1; then
 		echo "FIO write test failed"
 	else
 		local write_iops=${TEST_RUN["write iops"]}
@@ -140,13 +92,7 @@ test() {
 
 	# Clean up cgroups
 	rmdir "$cg_path"
-	if [[ $deactivate_io_ctrlr == true ]]; then
-		{ echo "-io" > "$(_cgroup2_base_dir)/cgroup.subtree_control"; } &> "${FULL}"
-	fi
-	_exit_cgroup2
-
-	# Clean up null_blk
-	_exit_null_blk
+	_clean_up_throtl
 
 	echo "Test complete"
 }

^ permalink raw reply related

* [PATCH] nvmet-rdma: reject inline data with a nonzero offset
From: Bryam Vargas @ 2026-06-04  8:46 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, Keith Busch, Chaitanya Kulkarni
  Cc: linux-nvme, linux-rdma, linux-block
In-Reply-To: <ahm6Ksr3rfGdnOsN@kbusch-mbp>

nvmet_rdma_map_sgl_inline() takes a host-controlled offset and length
from the inline SGL descriptor and bounds-checks them against the
per-port inline_data_size:

	u64 off = le64_to_cpu(sgl->addr);
	u32 len = le32_to_cpu(sgl->length);
	...
	if (off + len > rsp->queue->dev->inline_data_size)
		return NVME_SC_SGL_INVALID_OFFSET | NVME_STATUS_DNR;

This is unsound whenever the offset is nonzero:

 - "off + len" is evaluated in u64 and wraps modulo 2^64.  A descriptor
   with addr = 0xfffffffffffffe00 and length = 0x1000 wraps the sum to
   0xe00 and passes the check.  nvmet_rdma_use_inline_sg() then stores
   the offset into scatterlist::offset (unsigned int) and the block
   layer reads out of bounds of the inline page; a large len also makes
   num_pages(len) exceed NVMET_RDMA_MAX_INLINE_SGE and overruns the
   fixed-size inline_sg[] array.

 - Even computed without wrapping, inline_data_size is configurable up
   to max(SZ_16K, PAGE_SIZE).  An offset in (PAGE_SIZE, inline_data_size]
   passes the bound and then "PAGE_SIZE - off" in
   nvmet_rdma_use_inline_sg() underflows, leaving scatterlist::length at
   ~4 GiB and the offset pointing past the first inline page.

A nonzero inline offset is never legitimate here.  nvmet advertises
icdoff = 0, nvme_rdma_setup_ctrl() refuses to use a controller that
reports a nonzero icdoff ("icdoff is not supported!"), and
nvme_rdma_map_sg_inline() sets the inline descriptor addr to icdoff, so
a compliant initiator always sends offset 0.  nvmet_rdma_use_inline_sg()
likewise assumes the inline data begins at the start of the first inline
page (the RNIC DMAs it to page offset 0); any nonzero offset also
mis-describes the scatterlist even when it is in bounds.

Reject a nonzero offset directly.  This closes the u64 overflow, the
inline_sg[] overrun and the PAGE_SIZE - off underflow together, and is
simpler than bounding the offset.

Fixes: 0d5ee2b2ab4f ("nvmet-rdma: support max(16KB, PAGE_SIZE) inline data")
Cc: stable@vger.kernel.org
Reported-by: Bryam Vargas <hexlabsecurity@proton.me>
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
---
Keith, thanks for the suggested form

	if (off > rsp->queue->dev->inline_data_size ||
	    len > rsp->queue->dev->inline_data_size - off)

It does stop the u64 overflow, but while testing it I found it is still
incomplete when a port is configured with inline_data_size > PAGE_SIZE
(it is settable up to max(SZ_16K, PAGE_SIZE)): an offset in
(PAGE_SIZE, inline_data_size] passes that bound and then "PAGE_SIZE - off"
in nvmet_rdma_use_inline_sg() underflows, leaving scatterlist::length at
~4 GiB pointing past the first inline page. The block backend then
executes the out-of-bounds read (KASAN trace below). Since a compliant
initiator never sends a nonzero inline offset (nvmet advertises
icdoff = 0 and nvme_rdma_setup_ctrl() refuses a nonzero icdoff),
rejecting off != 0 closes that case too and is even simpler, so this
formal patch uses that instead of bounding the offset.

Verified on a KASAN build (inline_data_size = 16384) over an rdma_rxe
soft-RoCE loopback nvmet-rdma target with a block backend:
  - offset 0, 4 KiB inline write: succeeds, clean (control).
  - offset 8192, len 4096: without this patch the bounds check passes
    and the block backend executes the out-of-bounds read
      BUG: KASAN: slab-out-of-bounds in copy_folio_from_iter_atomic
      Read of size 4096 ...
    with this patch it is rejected ("invalid inline data offset!").
  - offset 4095 (< PAGE_SIZE): without this patch it is in bounds but
    mis-describes the SGL (NVME_SC_SGL_INVALID_DATA, no OOB); with this
    patch it is rejected up front.
  - offset 0 keeps working (no regression for compliant initiators).

 drivers/nvme/target/rdma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -854,7 +854,7 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
 		return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
 	}

-	if (off + len > rsp->queue->dev->inline_data_size) {
+	if (off || len > rsp->queue->dev->inline_data_size) {
 		pr_err("invalid inline data offset!\n");
 		return NVME_SC_SGL_INVALID_OFFSET | NVME_STATUS_DNR;
 	}
--
2.43.0

^ permalink raw reply

* Re: [PATCH] nvmet-rdma: reject inline data with a nonzero offset
From: Keith Busch @ 2026-06-04  9:32 UTC (permalink / raw)
  To: Bryam Vargas
  Cc: Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni, linux-nvme,
	linux-rdma, linux-block
In-Reply-To: <20260604084624.120032-1-hexlabsecurity@proton.me>

On Thu, Jun 04, 2026 at 08:46:33AM +0000, Bryam Vargas wrote:
> A nonzero inline offset is never legitimate here.  nvmet advertises
> icdoff = 0, nvme_rdma_setup_ctrl() refuses to use a controller that
> reports a nonzero icdoff ("icdoff is not supported!"), and
> nvme_rdma_map_sg_inline() sets the inline descriptor addr to icdoff, so
> a compliant initiator always sends offset 0.  nvmet_rdma_use_inline_sg()
> likewise assumes the inline data begins at the start of the first inline
> page (the RNIC DMAs it to page offset 0); any nonzero offset also
> mis-describes the scatterlist even when it is in bounds.

Wait, is this accurate? I'm pretty sure icdoff == 0 just means the host
can start the inline data immediately after the SQE, not that it
necessarily must do that. My understanding is offsets are still allowed
as long as the total length fits.

^ permalink raw reply

* Re: [PATCH] nvmet-rdma: reject inline data with a nonzero offset
From: Keith Busch @ 2026-06-04 10:22 UTC (permalink / raw)
  To: Bryam Vargas
  Cc: Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni, linux-nvme,
	linux-rdma, linux-block
In-Reply-To: <20260604084624.120032-1-hexlabsecurity@proton.me>

On Thu, Jun 04, 2026 at 08:46:33AM +0000, Bryam Vargas wrote:
> It does stop the u64 overflow, but while testing it I found it is still
> incomplete when a port is configured with inline_data_size > PAGE_SIZE
> (it is settable up to max(SZ_16K, PAGE_SIZE)): an offset in
> (PAGE_SIZE, inline_data_size] passes that bound and then "PAGE_SIZE - off"
> in nvmet_rdma_use_inline_sg() underflows, leaving scatterlist::length at
> ~4 GiB pointing past the first inline page.

Then the use_inline_sg() should find the appropriate index and offset
accordingly.

---
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 1234567..abcdefg 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -821,22 +821,29 @@ static void nvmet_rdma_use_inline_sg(struct nvmet_rdma_rsp *rsp, u32 len,
 		u64 off)
 {
-	int sg_count = num_pages(len);
+	u64 page_off = off % PAGE_SIZE;
+	u64 page_idx = off / PAGE_SIZE;
+	int sg_count = num_pages(page_off + len);
 	struct scatterlist *sg;
 	int i;
 
-	sg = rsp->cmd->inline_sg;
+	sg = &rsp->cmd->inline_sg[page_idx];
 	for (i = 0; i < sg_count; i++, sg++) {
 		if (i < sg_count - 1)
 			sg_unmark_end(sg);
 		else
 			sg_mark_end(sg);
-		sg->offset = off;
-		sg->length = min_t(int, len, PAGE_SIZE - off);
+		sg->offset = page_off;
+		sg->length = min_t(u64, len, PAGE_SIZE - page_off);
 		len -= sg->length;
-		if (!i)
-			off = 0;
+		page_off = 0;
 	}
 
-	rsp->req.sg = rsp->cmd->inline_sg;
+	rsp->req.sg = &rsp->cmd->inline_sg[page_idx];
 	rsp->req.sg_cnt = sg_count;
 }
 
@@ -857,7 +864,8 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
 			return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
 	}
 
-	if (off + len > rsp->queue->dev->inline_data_size) {
+	if (off > rsp->queue->dev->inline_data_size ||
+	    len > rsp->queue->dev->inline_data_size - off) {
 		pr_err("invalid inline data offset!\n");
 		return NVME_SC_SGL_INVALID_OFFSET | NVME_STATUS_DNR;
 	}
--

^ permalink raw reply

* [PATCH] block: Add WQ_PERCPU to alloc_workqueue users
From: Marco Crivellari @ 2026-06-04 10:53 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: Tejun Heo, Lai Jiangshan, Frederic Weisbecker,
	Sebastian Andrzej Siewior, Marco Crivellari, Michal Hocko,
	Damien Le Moal, Jens Axboe

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

   commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")

The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.

In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.

Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
---
 block/blk-zoned.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 6a221c180889..bea817f3de56 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1924,7 +1924,7 @@ static int disk_alloc_zone_resources(struct gendisk *disk,
 		goto free_hash;

 	disk->zone_wplugs_wq =
-		alloc_workqueue("%s_zwplugs", WQ_MEM_RECLAIM | WQ_HIGHPRI,
+		alloc_workqueue("%s_zwplugs", WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_PERCPU,
 				pool_size, disk->disk_name);
 	if (!disk->zone_wplugs_wq)
 		goto destroy_pool;
-- 
2.54.0

^ permalink raw reply related

* Re: configurable block error injection
From: Daniel Gomez @ 2026-06-04 11:06 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Jonathan Corbet, linux-block, linux-doc, bpf, linux-kselftest,
	Luis Chamberlain, Masami Hiramatsu, Brendan Gregg, GOST,
	Shin'ichiro Kawasaki
In-Reply-To: <20260602150503.GA6887@lst.de>

On 02/06/2026 17.05, Christoph Hellwig wrote:
> On Tue, Jun 02, 2026 at 11:58:25AM +0200, Daniel Gomez wrote:
>> I wonder if the block layer would be interested in moving block error
>> injection off the should_fail() fault injection framework and extending
>> the ALLOW_ERROR_INJECTION annotation instead and offloading all the
>> debugfs configuration logic (block/error-injection.c) into eBPF?
> 
> I've looked into plain ALLOW_ERROR_INJECTION-based injection and it
> is not very useful.  I didn't even now eBPF could use it, 

For context: Josef Bacik introduced it first for BPF only
(BPF_ALLOW_ERROR_INJECTION). Masami Hiramatsu then generalized it into
the error injection framework, renaming it to ALLOW_ERROR_INJECTION
and adding the fail_function debugfs interface (which calls
should_fail()). So annotating a function with ALLOW_ERROR_INJECTION
gives you both backends at once: debugfs (fail_function) and eBPF
(bpf_override_return()).

> but I
> looked into other eBPF injections and at least for my uses cases
> it was a bit of a mess.  

Agreed, and I had not looked closely enough at the series
before proposing the wrong primitives. ALLOW_ERROR_INJECTION /
bpf_override_return() are not sufficient here: bpf_override_return()
cannot set bio->bi_status or call bio_endio(), which I think is the
key operation here.

IIU the series correctly, and oversimplifying: when
injection is enabled and a bio matches, the block layer completes the
bio inline with the chosen blk_status_t (the status= rule from debugfs)
via bio_endio_status(). The submission path returns to the caller
immediately, with the bio already in the error state. Nothing is ever
sent to the device, but the completion path sees the injected error.

submit_bio() -> submit_bio_noacct() -> submit_bio_noacct_nocheck()
      +-- Path 1: no match -> continue normal IO submission
      +-- Path 2: match (diskN) -> blk_error_inject()
                     -> bio_endio_status(bio, inj->status)
                          -> bio_endio()
                     // error injected. bio completed

So this is bio mutating + a bio_endio() call, not a return override.
That can't be solved with ALLOW_ERROR_INJECTION. But we can use
BPF_PROG_TYPE_STRUCT_OPS instead: the kernel keeps ownership of the
bio_endio_status() call, and the eBPF program only drives the policy,
ie. which blk_status_t to return for a given bio, based on whatever 
heuristics it implements.

> I'd have to allow access to certain bio

With struct_ops the bio is passed to the ebpf side as read-only, bio 
fields can be read to decide the policy but cannot write them. Is 
read-only access to bio fields itself a concern?

> fields and would have create a stable UAPI for commands and status
> using the fake BTF struct access which really would not be a good
> idea here as we need to be able to change internals.  

That should not be a problem at all. With CO-RE (compile once, run
everywhere) the program resolves the bio field offsets against the
BTF of the kernel it loads on, so it adapts dynamically if the layout
changes. The contract is just the struct_ops callback signature: a
struct bio * argument and a blk_status_t return. And that doesn't imply
any UAPI commitment AFAIK.

> Additionally
> having fully BTF-enabled toolchains in test VMs is not great either.

Are you referring to the old BCC toolchain requirements [1]? This is
solved in CO-RE [2]. The toolchain (Clang/LLVM, pahole) stays on the
build host; the test VM only needs the prebuilt BPF object, libbpf at
runtime, and the kernel's own BTF (CONFIG_DEBUG_INFO_BTF). No compiler
or BTF toolchain is required inside the VM. Clang/LLVM 10+ is enough to
build CO-RE libbpf tools [3].

Link: https://ebpf.io/what-is-ebpf/#how-are-ebpf-programs-written [1]
Link: https://nakryiko.com/posts/bpf-portability-and-co-re/ [2]
Link: https://github.com/libbpf/libbpf#bpf-co-re-compile-once--run-everywhere [1]

> 
> I've also not actually found any good map type for range lookups,
> which is kinda essential here.

Are you referring to the bio sector range comparison in
__blk_error_inject()? I don't think that needs to be delegated to a
BPF map (Documentation/bpf/maps.rst). The ebpf side has direct access
to the bio fields, so it can apply the same sector/op filtering
__blk_error_inject() does today. That match is already a linear list
walk, so the ebpf program just runs the same [start, end] condition
check.

In summary: what do you think of evolving this series
into eBPF, but BPF_PROG_TYPE_STRUCT_OPS instead of
ALLOW_ERROR_INJECTION/bpf_override_return()?. The configurable debugfs
injection is a nice enhancement over what we had, but the matching is
static (op/start/nr_sectors/status/chance) per gendisk. A struct_ops
hook would expand the failure model to anything derivable from the bio
(plus any state the program keeps), with the kernel still owning the
bio_endio_status() mutation and only the policy moving out of the tree.
The benefit is the flexibility to express the policy without hard-coding
the model in the kernel.

The open question is whether programmable injection is something we want
to support in-tree, or whether more debugfs knobs suffice. Thoughts?

^ permalink raw reply

* Re: [PATCH] lib: free pagelist on error in iov_iter_extract_pages()
From: Fedor Pchelkin @ 2026-06-04 13:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitry Antipov, Jens Axboe, Christoph Hellwig, linux-block,
	linux-fsdevel, lvc-project, Caleb Sander Mateos
In-Reply-To: <20260508144823.0213964cb15e7ef8fbdcd7c5@linux-foundation.org>

On Fri, 08. May 14:48, Andrew Morton wrote:
> On Fri,  8 May 2026 14:13:29 +0300 Dmitry Antipov <dmantipov@yandex.ru> wrote:
> 
> > Since 'iov_iter_extract_pages()' may allocate new pagelist if the passed
> > one isn't large enough, the worst-case scenario may be:
> > 
> > ...
> > struct page *stack_pages[SMALL];
> > struct page **pages = stack_pages;
> > ...
> > if (iov_iter_extract_pages(i..., &pages, ...) <= 0) {
> >         /* Even in case of error, new pagelist may be allocated */
> >         if (pages != stack_pages)
> >                 kvfree(pages);                                  [1]
> >         /* The rest of error handling and return */
> > }
> > /* Regular flow */
> > ...
> > if (pages != stack_pages)
> >         kvfree(pages);
> > ...
> > return 0;
> > 
> > If you're unlucky so SMALL amount of pages wasn't enough and new
> > pagelist was allocated, missing [1] causes the memory leak similar
> > to one I've recently observed and fixed for 6.12 in [2]. So adjust
> > 'iov_iter_extract_pages()' to make such a cleanup itself rather than
> > rely on caller's handling on error paths, thus making [1] not needed.
> 
> AI review said things:
> 	https://sashiko.dev/#/patchset/20260508111329.329943-1-dmantipov@yandex.ru

The current v1 patch is in mm-nonmm-unstable branch of your repo [1].
Please drop it.

Apart from AI, there were comments from Caleb Sander Mateos in this
thread.  v2 then appeared [2], but it had some issues, too.  Eventually it
seems a direct patch into block subsystem [3] has found some positive
feedback.  (Though it's still not applied but that question is left for
block subsystem maintainers)

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-nonmm-unstable&id=3ba07338fe5aae5f0f6b09d9029e1ccb4434b867
[2]: https://lore.kernel.org/all/20260512170525.357573-1-dmantipov@yandex.ru/
[3]: https://lore.kernel.org/all/20260513070515.528861-1-dmantipov@yandex.ru/

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox