Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH] virtio-blk: use little-endian types for the zoned fields
From: Michael Bommarito @ 2026-06-17 15:17 UTC (permalink / raw)
  To: Michael S . Tsirkin, Jason Wang
  Cc: Stefan Hajnoczi, Stefano Garzarella, Dmitry Fomichev,
	Damien Le Moal, Jens Axboe, Paolo Bonzini, virtualization,
	linux-block, linux-kernel

The zoned block-device fields in the virtio-blk header are typed
__virtio{32,64}, so their endianness follows VIRTIO_F_VERSION_1. The
zoned feature is only defined for VIRTIO 1.x devices, and the virtio
specification defines all of its fields as little-endian. Commit
b16a1756c716 ("virtio_blk: mark all zone fields LE") tagged them
__le* for exactly this reason, but commit f1ba4e674feb ("virtio-blk:
fix to match virtio spec") re-applied the reviewed version of the
original zoned series -- which predated b16a1756 -- and silently
restored the __virtio* typing together with the matching
virtio*_to_cpu() / virtio_cread() accessors in the driver.

Restore the little-endian typing for the zoned configuration-space
characteristics, the zone descriptor, the zone report header and the
ZONE_APPEND in-header sector, and read them with le*_to_cpu() and
virtio_cread_le() to match.

There is no functional change on any spec-compliant device: zoned
requires VIRTIO_F_VERSION_1, and for a VERSION_1 device
virtio*_to_cpu() is identical to le*_to_cpu(). The change makes the
uapi types describe the actual wire format and removes a latent
endianness mismatch for a (non-conformant) legacy device on a
big-endian guest.

Fixes: f1ba4e674feb ("virtio-blk: fix to match virtio spec")
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
Testing:
 - Builds with no new warnings; sparse endian-clean (C=2,
   __CHECK_ENDIAN__, CONFIG_BLK_DEV_ZONED=y) both before and after.
 - Booted under QEMU with a host-managed zoned device exposed through
   virtio-blk. Zone revalidation, blkzone report and a sequential
   write / write-pointer check return correct values; blktests zbd
   device tests 001-006 (sysfs+ioctl, report zone, reset, write split,
   write ordering, revalidate) pass, with results identical before and
   after this change -- expected, since on a VIRTIO_F_VERSION_1 device
   virtio*_to_cpu() == le*_to_cpu().

 drivers/block/virtio_blk.c      | 38 +++++++++++++++------------------
 include/uapi/linux/virtio_blk.h | 18 ++++++++--------
 2 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index b1c9a27fe00f3..5532cfbde7bfe 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -99,7 +99,7 @@ struct virtblk_req {
 		 * be the last byte.
 		 */
 		struct {
-			__virtio64 sector;
+			__le64 sector;
 			u8 status;
 		} zone_append;
 	} in_hdr;
@@ -335,14 +335,12 @@ static inline void virtblk_request_done(struct request *req)
 {
 	struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
 	blk_status_t status = virtblk_result(virtblk_vbr_status(vbr));
-	struct virtio_blk *vblk = req->mq_hctx->queue->queuedata;
 
 	virtblk_unmap_data(req, vbr);
 	virtblk_cleanup_cmd(req);
 
 	if (req_op(req) == REQ_OP_ZONE_APPEND)
-		req->__sector = virtio64_to_cpu(vblk->vdev,
-						vbr->in_hdr.zone_append.sector);
+		req->__sector = le64_to_cpu(vbr->in_hdr.zone_append.sector);
 
 	blk_mq_end_request(req, status);
 }
@@ -589,13 +587,13 @@ static int virtblk_parse_zone(struct virtio_blk *vblk,
 {
 	struct blk_zone zone = { };
 
-	zone.start = virtio64_to_cpu(vblk->vdev, entry->z_start);
+	zone.start = le64_to_cpu(entry->z_start);
 	if (zone.start + vblk->zone_sectors <= get_capacity(vblk->disk))
 		zone.len = vblk->zone_sectors;
 	else
 		zone.len = get_capacity(vblk->disk) - zone.start;
-	zone.capacity = virtio64_to_cpu(vblk->vdev, entry->z_cap);
-	zone.wp = virtio64_to_cpu(vblk->vdev, entry->z_wp);
+	zone.capacity = le64_to_cpu(entry->z_cap);
+	zone.wp = le64_to_cpu(entry->z_wp);
 
 	switch (entry->z_type) {
 	case VIRTIO_BLK_ZT_SWR:
@@ -687,8 +685,7 @@ static int virtblk_report_zones(struct gendisk *disk, sector_t sector,
 		if (ret)
 			goto fail_report;
 
-		nz = min_t(u64, virtio64_to_cpu(vblk->vdev, report->nr_zones),
-			   nr_zones);
+		nz = min_t(u64, le64_to_cpu(report->nr_zones), nr_zones);
 		if (!nz)
 			break;
 
@@ -698,8 +695,7 @@ static int virtblk_report_zones(struct gendisk *disk, sector_t sector,
 			if (ret)
 				goto fail_report;
 
-			sector = virtio64_to_cpu(vblk->vdev,
-						 report->zones[i].z_start) +
+			sector = le64_to_cpu(report->zones[i].z_start) +
 				 vblk->zone_sectors;
 			zone_idx++;
 		}
@@ -725,18 +721,18 @@ static int virtblk_read_zoned_limits(struct virtio_blk *vblk,
 
 	lim->features |= BLK_FEAT_ZONED;
 
-	virtio_cread(vdev, struct virtio_blk_config,
-		     zoned.max_open_zones, &v);
+	virtio_cread_le(vdev, struct virtio_blk_config,
+			zoned.max_open_zones, &v);
 	lim->max_open_zones = v;
 	dev_dbg(&vdev->dev, "max open zones = %u\n", v);
 
-	virtio_cread(vdev, struct virtio_blk_config,
-		     zoned.max_active_zones, &v);
+	virtio_cread_le(vdev, struct virtio_blk_config,
+			zoned.max_active_zones, &v);
 	lim->max_active_zones = v;
 	dev_dbg(&vdev->dev, "max active zones = %u\n", v);
 
-	virtio_cread(vdev, struct virtio_blk_config,
-		     zoned.write_granularity, &wg);
+	virtio_cread_le(vdev, struct virtio_blk_config,
+			zoned.write_granularity, &wg);
 	if (!wg) {
 		dev_warn(&vdev->dev, "zero write granularity reported\n");
 		return -ENODEV;
@@ -750,8 +746,8 @@ static int virtblk_read_zoned_limits(struct virtio_blk *vblk,
 	 * virtio ZBD specification doesn't require zones to be a power of
 	 * two sectors in size, but the code in this driver expects that.
 	 */
-	virtio_cread(vdev, struct virtio_blk_config, zoned.zone_sectors,
-		     &vblk->zone_sectors);
+	virtio_cread_le(vdev, struct virtio_blk_config, zoned.zone_sectors,
+			&vblk->zone_sectors);
 	if (vblk->zone_sectors == 0 || !is_power_of_2(vblk->zone_sectors)) {
 		dev_err(&vdev->dev,
 			"zoned device with non power of two zone size %u\n",
@@ -767,8 +763,8 @@ static int virtblk_read_zoned_limits(struct virtio_blk *vblk,
 		lim->max_hw_discard_sectors = 0;
 	}
 
-	virtio_cread(vdev, struct virtio_blk_config,
-		     zoned.max_append_sectors, &v);
+	virtio_cread_le(vdev, struct virtio_blk_config,
+			zoned.max_append_sectors, &v);
 	if (!v) {
 		dev_warn(&vdev->dev, "zero max_append_sectors reported\n");
 		return -ENODEV;
diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 3744e4da1b2a7..5af2a0300bb9d 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -140,11 +140,11 @@ struct virtio_blk_config {
 
 	/* Zoned block device characteristics (if VIRTIO_BLK_F_ZONED) */
 	struct virtio_blk_zoned_characteristics {
-		__virtio32 zone_sectors;
-		__virtio32 max_open_zones;
-		__virtio32 max_active_zones;
-		__virtio32 max_append_sectors;
-		__virtio32 write_granularity;
+		__le32 zone_sectors;
+		__le32 max_open_zones;
+		__le32 max_active_zones;
+		__le32 max_append_sectors;
+		__le32 write_granularity;
 		__u8 model;
 		__u8 unused2[3];
 	} zoned;
@@ -241,11 +241,11 @@ struct virtio_blk_outhdr {
  */
 struct virtio_blk_zone_descriptor {
 	/* Zone capacity */
-	__virtio64 z_cap;
+	__le64 z_cap;
 	/* The starting sector of the zone */
-	__virtio64 z_start;
+	__le64 z_start;
 	/* Zone write pointer position in sectors */
-	__virtio64 z_wp;
+	__le64 z_wp;
 	/* Zone type */
 	__u8 z_type;
 	/* Zone state */
@@ -254,7 +254,7 @@ struct virtio_blk_zone_descriptor {
 };
 
 struct virtio_blk_zone_report {
-	__virtio64 nr_zones;
+	__le64 nr_zones;
 	__u8 reserved[56];
 	struct virtio_blk_zone_descriptor zones[];
 };
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Keith Busch @ 2026-06-17 15:08 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Keith Busch, dm-devel, linux-block, mpatocka, Vjaceslavs Klimovs
In-Reply-To: <ajGtbuJ2kXo1GZ1d@gallifrey>

On Tue, Jun 16, 2026 at 08:09:18PM +0000, Dr. David Alan Gilbert wrote:
> root@dalek:/home/dg# lvcreate  --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2

So this is a subtle difference from your original report which ran
lvcreate a little differently:

  # lvcreate --type mirror --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2

This patch series address problems with the original report with the
"--type mirror" parameter, which uses dm-raid1.c instead of md/raid1.c.
Knowing that detail makes this a trivial matter to fix now, so I'll send
a separate patch for that. But this series should be good to go for the
original issue on the legacy dm mirror.

^ permalink raw reply

* Re: [PATCH v2] blk-mq: bound blk_hctx_poll() to one jiffy
From: Jens Axboe @ 2026-06-17 15:05 UTC (permalink / raw)
  To: Anuj Gupta, hch, kbusch, lidiangang, changfengnan, tom.leiming,
	nj.shetty, joshi.k, anuj1072538
  Cc: linux-block, Alok Rathore
In-Reply-To: <20260617060850.1244788-1-anuj20.g@samsung.com>

On 6/17/26 12:08 AM, Anuj Gupta wrote:
> blk_hctx_poll() can busy-poll until a completion is found or
> need_resched() becomes true. On preemptible kernels, the scheduler can
> set TIF_NEED_RESCHED on the timer tick and preempt the task at IRQ
> return before the loop condition re-evaluates it. After the context
> switch, the flag is cleared, so the poller can continue spinning instead
> of returning to its caller.
> 
> This can happen with io_uring IOPOLL reads inside iocb_bio_iopoll(),
> which holds the rcu_read_lock() while calling bio_poll(). If another
> poller on the same polled queue drains the available completions, this
> poller may repeatedly find no completions and remain inside the RCU
> read-side critical section long enough to trigger RCU stall reports:
> 
> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-9): P3961
> rcu:     (detected by 3, t=60002 jiffies, g=18533, q=4943 ncpus=20)
> task:fio state:R  running task     stack:0     pid:3961
> Call Trace:
> <TASK>
> ? nvme_poll+0x36/0xa0 [nvme]
> ? blk_hctx_poll+0x39/0x90
> ? blk_mq_poll+0x30/0x60
> ? bio_poll+0x87/0x170
> ? iocb_bio_iopoll+0x32/0x50
> ? io_uring_classic_poll+0x25/0x50
> ? io_do_iopoll+0x216/0x420
> ? __do_sys_io_uring_enter+0x2c7/0x7c0
> 
> Reproducible with:
> 
> fio -filename=/dev/nvme0n1 -direct=1 -size=4g -rw=randread \
> --numjobs=32 -bs=4K -ioengine=io_uring -hipri=1 -iodepth=1 \
> --registerfiles=1 --group_reporting --thread
> 
> Record the starting jiffy and exit the loop once jiffies has advanced.
> This bounds each blk_hctx_poll() invocation while also covering the
> case where the reschedule flag was cleared by the context switch
> before the loop condition could observe it.
> 
> Fixes: f22ecf9c14c1 ("blk-mq: delete task running check in blk_hctx_poll()")
> Suggested-by: Fengnan Chang <changfengnan@bytedance.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> Signed-off-by: Alok Rathore <alok.rathore@samsung.com>
> ---
>  block/blk-mq.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4c5c16cce4f8..ae6c5f4b80ce 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -5248,6 +5248,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>  			 struct io_comp_batch *iob, unsigned int flags)
>  {
>  	int ret;
> +	unsigned long timeout = jiffies + 1;

I wonder if it'd be better to set this to jiffies + 2, just to avoid
skipping after a single loop if jiffies changes right after this.
Probably not a big deal, on average it should be fine. But also should
not really matter if this is potentially spinning uselessly 10ms or 20ms
at most, if HZ == 100. Similarly, this is also a misuse/misconfiguration
if you end up having multiple pollers on the same queue. Yes it'll work,
but it's a terrible idea for obvious reasons. Hence the patch is mostly
about ensuring that bad case isn't TOO terrible. But you'd really want
to sort out the app/config side of things in any case.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Jianyue Wu @ 2026-06-17 14:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Chris Li, Baoquan He, Nhat Pham, Barry Song,
	Kairui Song, Kemeng Shi, Youngjun Park, Minchan Kim,
	Sergey Senozhatsky, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc
In-Reply-To: <20260617061743.GA19844@lst.de>

On Wed, Jun 17, 2026 at 2:17 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Jun 17, 2026 at 11:38:02AM +0800, Jianyue Wu wrote:
> > Before I rework or drop the RFC, could you outline how you see that
> > core-side model working? In particular:
> >   - How should a compressed backend like zram or future block device
> >     plug into swap_iocb / swap_ops?
>
> I don't think that is the right layer.  The virtual swap layer that is
> currently in the process of being upstreamed is the right level, and
> the actual swap devices or swap files are just a dumb backend for what
> they higher level code does.
>
> >   - What role do you expect zram to keep while the legacy block interface
> >     remains: current block swap only, or something else?
>
> For now we'll need to keep it working as-is.  It is heavily used in
> android and potentially elsewhere.  Once we have zswap fully working
> in the virtual swap layer world it might make sense to say never
> compress again in zram when REQ_SWAP is set (or maybe a new
> REQ_COPRESSED) so that we can use the core compression code without
> breaking existing setups.
>
Hello Christoph,

Thanks for the clarification.

I understand the goal is to have more common code in the core layer,
with dumb backends. On the swap path, once core has already compressed
the data, zram would only store it and not compress again, while
non-swap use of zram stays as-is.

Thanks,
Jianyue

^ permalink raw reply

* Re: [PATCH v3] rust: add procedural macro for declaring configfs attributes
From: Malte Wechter @ 2026-06-17 13:28 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Andreas Hindborg, Breno Leitao, Miguel Ojeda, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Jens Axboe, Luis Chamberlain,
	Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin,
	linux-kernel, rust-for-linux, linux-block, linux-modules
In-Reply-To: <CANiq72=fUuX7fF1=AQ12ahYHwW7k2i28Hkpo0TtUhwV7qeojQQ@mail.gmail.com>


On 6/17/26 11:32 AM, Miguel Ojeda wrote:
> On Wed, Jun 17, 2026 at 11:13 AM Malte Wechter <maltewechter@gmail.com> wrote:
>> As of now doc strings are not generated for private items in the macros
>> crate. I am moving the `parse_ordered_fields!` macro into
>> macros/helpers.rs but this means the doc strings are not generated for
>> the macro anymore. The `parse_ordered_fields!` macro is a larger helper
>> function, and the doc strings are relevant and helpful for macro
>> developers that wants to use it.
> If it is private, then it is what it is, don't worry about it --
> developers can still read the source code.
>
> But, yes, having a render of the private items is something I have
> wanted for a long time, but as a runtime toggle, so that it is easy to
> go from one to the other (and without having to have 2 entire copies
> of the docs).
>
> Please the entry "Private documentation (perhaps as an extension of
> the private items/fields toggle)" I have at:
>
>    https://github.com/Rust-for-Linux/linux/issues/350
>
> Upstream `rustdoc` implemented an MVP of the idea via CSS/JS in this draft PR:
>
>    https://github.com/rust-lang/rust/pull/141299
>
> If you want to help on that, then you could try it and leave some
> feedback there! :)
>
> Thanks!
>
> Cheers,
> Miguel
I will leave it as it is then.
I will try and look at it if time permits :)

Best regards,

Malte


^ permalink raw reply

* Re: [PATCH 00/19] init: discoverable root partitions, a.k.a. an omittable "root=" cmdline option
From: Christian Brauner @ 2026-06-17 12:41 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Jens Axboe, Davidlohr Bueso, Alexander Viro, Jan Kara,
	linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Richard Henderson, Matt Turner, Magnus Lindholm, linux-alpha,
	Vineet Gupta, linux-snps-arc, Russell King, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui, loongarch,
	Thomas Bogendoerfer, linux-mips, James E.J. Bottomley,
	Helge Deller, linux-parisc, Madhavan Srinivasan, Michael Ellerman,
	linuxppc-dev, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-riscv, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	linux-s390, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Jonathan Corbet, Shuah Khan, linux-doc
In-Reply-To: <20260615-discoverable-root_partitions-v1-0-39c78fac42e2@kernel.org>

On Mon, Jun 15, 2026 at 06:08:56PM +0200, Vincent Mailhol wrote:
> DPS [1] defines GPT partition type UUIDs for OS partitions and
> attributes that control whether such partitions should be
> automatically discovered. The specification states that:
> 
>   The OS can discover and mount the necessary file systems with a
>   non-existent or incomplete /etc/fstab file and without the root=
>   kernel command line option.
> 
> DPS is already implemented in systemd-gpt-auto-generator [2], which,
> when embedded in an initrd, indeed allows automatic detection of the
> root filesystem through its partition type UUID.
> 
> This series adds this discovery feature directly into the kernel so
> that people who are not using systemd or not using an initrd can still
> benefit from it. The implementation follows the same model as
> systemd-gpt-auto-generator:

I happen to co-maintain the DPS. It is userspace policy and complex
userspace policy at that and does not belong into the kernel.

This also implements a really tiny portion of the spec. It deals with a
lot more complex concepts such as automatic partitioning during
installation, verity, LUKS, containers. This is really not intended for
the kernel at all. I mean, it's great that this spec is being used but I
do not want this in the kernel just for the sake of auto-discovery.

The DPS is completely generic and can be implemented by tooling other
than systemd (util-linux implements it and so does refind iirc). I think
not wanting to use or build alternative userspace tooling for this is a
really weak argument for pushing this into the kernel.

^ permalink raw reply

* Re: [PATCH v17 05/10] rust: page: convert to `Ownable`
From: Alice Ryhl @ 2026-06-17 11:36 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Miguel Ojeda, Gary Guo, Björn Roy Baron, Benno Lossin,
	Trevor Gross, Danilo Krummrich, Greg Kroah-Hartman, Dave Ertman,
	Ira Weiny, Leon Romanovsky, Paul Moore, Serge Hallyn,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Alexander Viro,
	Christian Brauner, Jan Kara, Daniel Almeida, Viresh Kumar,
	Nishanth Menon, Stephen Boyd, Bjorn Helgaas,
	Krzysztof Wilczyński, Boqun Feng, Uladzislau Rezki,
	Lorenzo Stoakes, Vlastimil Babka, Liam R. Howlett, Igor Korotin,
	Pavel Tikhomirov, linux-kernel, rust-for-linux, linux-block,
	linux-security-module, dri-devel, linux-fsdevel, linux-mm,
	linux-pm, linux-pci, driver-core, Asahi Lina
In-Reply-To: <20260604-unique-ref-v17-5-7b4c3d2930b9@kernel.org>

On Thu, Jun 04, 2026 at 10:11:17PM +0200, Andreas Hindborg wrote:
> From: Asahi Lina <lina@asahilina.net>
> 
> This allows Page references to be returned as borrowed references,
> without necessarily owning the struct page.
> 
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> [ Andreas: Fix formatting and add a safety comment. ]
> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
> ---
>  rust/kernel/page.rs | 38 +++++++++++++++++++++++++-------------
>  1 file changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/rust/kernel/page.rs b/rust/kernel/page.rs
> index 3bdcee0e16a8..844c75e54134 100644
> --- a/rust/kernel/page.rs
> +++ b/rust/kernel/page.rs
> @@ -10,6 +10,11 @@
>      bindings,
>      error::code::*,
>      error::Result,
> +    types::{
> +        Opaque,
> +        Ownable,
> +        Owned, //
> +    },
>      uaccess::UserSliceReader, //
>  };
>  use core::{
> @@ -105,7 +110,7 @@ pub const fn page_align(addr: usize) -> Option<usize> {
>  ///
>  /// [`VBox`]: kernel::alloc::VBox
>  /// [`Vmalloc`]: kernel::alloc::allocator::Vmalloc
> -pub struct BorrowedPage<'a>(ManuallyDrop<Page>, PhantomData<&'a Page>);
> +pub struct BorrowedPage<'a>(ManuallyDrop<NonNull<Page>>, PhantomData<&'a Page>);

BorrowedPage<'a> is no longer needed because it's just &Page.

Alice

^ permalink raw reply

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
From: Nilay Shroff @ 2026-06-17 11:08 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki; +Cc: Ming Lei, linux-block, Jens Axboe
In-Reply-To: <ajCa9GrGoB4uXRpS@shinmob>

On 6/16/26 6:50 AM, Shin'ichiro Kawasaki wrote:
> On Jun 12, 2026 / 17:15, Nilay Shroff wrote:
>> On 6/12/26 4:36 PM, Ming Lei wrote:
>>> On Fri, Jun 12, 2026 at 06:47:50PM +0900, Shin'ichiro Kawasaki wrote:
>>>> On Jun 11, 2026 / 06:22, Ming Lei wrote:
>>>>> Hi Shin'ichiro,
>>>>
>>>> Hi Ming, thanks for the comments.
>>>>
>>>>>
>>>>> On Thu, Jun 11, 2026 at 04:41:59PM +0900, Shin'ichiro Kawasaki wrote:
>>>>>> I observed that the blktests test case block/005 hangs on a specific
>>>>>> server hardware using a specific HDD as a block device. During the test
>>>>>> case run, the kernel reported a KASAN null-ptr-deref (and other memory
>>>>>> corruption symptoms) [2]. This failure looked sporadic and hardware-
>>>>>> dependent.
>>>>>>
>>>>>>   From the kernel message, I noticed that udev-worker wrote to the
>>>>>> queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
>>>>>> The test case block/005 also wrote to the same sysfs attribute, which
>>>>>
>>>>> sysfs write is supposed to be serialized...
>>>>
>>>> I checked the sysfs write handler elv_iosched_store() in block/elevator.c.
>>>> I found elevator_change() call is guarded with the rw_semaphore
>>>> "set->update_nr_hwq_lock", but the guard is not the writer lock but the reader
>>>> lock. This does not serialize the sysfs writes.
>>>
>>> Please see kernfs_fop_write_iter(), in which mutex is held before calling
>>> ->write().
>>>
>> I think you're referring to @of->mutex here; however of->mutex is per struct
>> kernfs_open_file, which is associated with an open instance of the sysfs file.
>> The important point is that two separate opens can have different kernfs_open_file
>> instances and therefore different mutexes. Thus, concurrent write to same sysfs
>> attribute from two different processes may still be possible.
> 
> Thanks Nilay, I added debug prints to print @of->mutex address, and it observed
> the address is different for each process and each file open. So, I don't think
> sysfs write is serialized.
> 
>>
>>
>>>>
>>>> I tried the patch below to replace the reader lock with the writer lock. With
>>>> a quick trial, it looks working. The kernel message is no longer observed and
>>>> the new test case does not cause hangs. I will do further testing to confirm
>>>> that this change does not trigger other new lockdep WARNs. Assuming it does not
>>>> have such side effects, I hope this fix approach is acceptable. It doesn't add
>>>> the new lock, so I think it's the better.
>>>>
>>>> diff --git a/block/elevator.c b/block/elevator.c
>>>> index 3bcd37c2aa34..b03185a217ff 100644
>>>> --- a/block/elevator.c
>>>> +++ b/block/elevator.c
>>>> @@ -813,7 +813,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>>>>    	 *   update_nr_hwq_lock -> kn->active (via del_gendisk -> kobject_del)
>>>>    	 *   kn->active -> update_nr_hwq_lock (via this sysfs write path)
>>>>    	 */
>>>> -	if (!down_read_trylock(&set->update_nr_hwq_lock)) {
>>>> +	if (!down_write_trylock(&set->update_nr_hwq_lock)) {
>>>>    		ret = -EBUSY;
>>>>    		goto out;
>>>>    	}
>>>> @@ -824,7 +824,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>>>>    	} else {
>>>>    		ret = -ENOENT;
>>>>    	}
>>>> -	up_read(&set->update_nr_hwq_lock);
>>>> +	up_write(&set->update_nr_hwq_lock);
>>>>    out:
>>>>    	if (ctx.type)
>>>>
>>>> [...]
>>>>
>>>>> blk_mq_sched_reg_debugfs already includes debugfs lock, so I feel the proper
>>>>> fix could be check & avoid the null-ptr-deref.
>>>>
>>>> Actually, null-ptr-deref is one of the failure symptoms. KASAN slab-user-after
>>>> free is also observed [3]. Then I'm guessing adding null checks may not be
>>>> enough.
>>>>
>>>>> Adding new lock should be the last straw usually, especially this one is
>>>>> depended by queue freeze.
>>>>
>>>> Got it, thanks.
>>>>
>>>>
>>>> [3] KASAN slab-use-after-free
>>>
>>> Then you need to figure out the exact slab type and check if the pointer is cleared
>>> during free.
>>>
>>> Anyway, there is guard already, not see reason to add new lock for covering
>>> it.
>>>
>> Regarding the observed failure, my understanding is that blk_mq_debugfs_register_sched()
>> and blk_mq_debugfs_register_sched_hctx() access q->elevator without holding q->elevator_lock.
>> If multiple scheduler update paths run concurrently, one path can replace and free the
>> elevator while another path is still using it, which would explain the observed KASAN
>> use-after-free and NULL pointer dereference reports.
> 
> I have the same view. I think the use-after-free and the null-ptr-deref indicate
> that elevator_queue object address in q->elevator is the problem. The references
> of the object is also kept in the struct elv_change_ctx as ctx->old and
> ctx->new. These multiple references are used concurrently, then I'm not sure if
> adding pointer clears and null checks would fix the problem.
> 
>>
>> With the proposed change, upgrading update_nr_hwq_lock from a reader lock to a writer
>> lock in elv_iosched_store() would serialize concurrent scheduler updates and therefore
>> prevent multiple elevator switch operations from running at the same time.
>>
>> The another way to fix this might be to acquire q->elevator_lock in blk_mq_sched_reg_debugfs()
>> and thus serialize access to q->elevator in blk_mq_debugfs_register_sched() and
>> blk_mq_debugfs_register_sched_hctx().
> 
> Thanks for the idea. I tried the patch below [X], but it triggered WARN in
> debugfs_create_files() in block/blk-mq-debufs.c [Y]. Then I'm afraid, this
> approach does not look working.
> 
> At this moment, the writer lock in elv_iosched_store() looks like the solution
> to me, but further comments on other solution possibility will be welcomed.
> 
> 
> [X]
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 0a00f5a76f5a..12c582b6c713 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -394,9 +394,11 @@ void blk_mq_sched_reg_debugfs(struct request_queue *q)
>   	unsigned long i;
>   
>   	memflags = blk_debugfs_lock(q);
> +	mutex_lock(&q->elevator_lock);
>   	blk_mq_debugfs_register_sched(q);
>   	queue_for_each_hw_ctx(q, hctx, i)
>   		blk_mq_debugfs_register_sched_hctx(q, hctx);
> +	mutex_unlock(&q->elevator_lock);
>   	blk_debugfs_unlock(q, memflags);
>   }
>   
> 
> [Y]
> 
>   612 static void debugfs_create_files(struct request_queue *q, struct dentry *parent,|
>   613                                  void *data,                                    |
>   614                                  const struct blk_mq_debugfs_attr *attr)        |
>   615 {                                                                               |
>   616         lockdep_assert_held(&q->debugfs_mutex);                                 |
>   617         /*                                                                      |
>   618          * debugfs_mutex should not be nested under other locks that can be     |
>   619          * grabbed while queue is frozen.                                       |
>   620          */                                                                     |
>   621         lockdep_assert_not_held(&q->elevator_lock);                             | <----
>   622         lockdep_assert_not_held(&q->rq_qos_mutex);                              |
>   623                                                                                 |
> 

Yeah, I recall that assertion was added to avoid potential circular lockdep dependencies
when reclaim recurses back into the block layer. The concern is that ->elevator_lock and
  ->rq_qos_mutex can be acquired in code paths after the queue has been frozen. Consider
a scenario where one task freezes the queue and then attempts to acquire ->elevator_lock,
while another task already holds ->elevator_lock and subsequently triggers memory reclaim.
If reclaim recurses into the block layer, it may require forward progress on the same
frozen queue, which cannot happen until the freeze is lifted. This creates a circular
dependency involving queue freeze, reclaim, and ->elevator_lock (or ->rq_qos_mutex).

Given the above, I'm fine with the earlier approach of upgrading update_nr_hwq_lock from
a reader lock to a writer lock in elv_iosched_store(). That directly serializes concurrent
scheduler updates and avoids the race on q->elevator without introducing additional lock
ordering concerns.

Thanks,
--Nilay

^ permalink raw reply

* Re: [PATCH v15 0/8] blk: honor isolcpus configuration
From: Aaron Tomlin @ 2026-06-17 11:01 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: hare, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, kch,
	ming.lei, tom.leiming, steve, sean, chjohnst, neelx, mproche,
	nick.lange, marco.crivellari, rishil1999, linux-block,
	linux-kernel
In-Reply-To: <20260521232956.553287-1-atomlin@atomlin.com>

[-- Attachment #1: Type: text/plain, Size: 2490 bytes --]

On Thu, May 21, 2026 at 07:29:48PM -0400, Aaron Tomlin wrote:
> Hi,
> 
> I have decided to drive this series forward on behalf of Daniel Wagner, the
> original author. The series has been rebased on v7.1-rc4-100-g8bc67e4db64a.
> 
> This series introduces a new CPU isolation feature, "isolcpus=io_queue",
> designed to protect isolated cores from the disruptive hardware interrupts
> generated by high-performance multi-queue devices.
> 
> When enabled, it fundamentally alters how the generic IRQ subsystem and the
> block layer (blk-mq) map hardware queues:
> 
>     1.  Restricted IRQ Affinity: Managed hardware interrupts are strictly
>         confined to online housekeeping CPUs.
> 
>     2.  Transparent I/O Submission: Applications running on isolated CPUs
>         can still seamlessly submit I/O requests; however, the resulting
>         hardware completion interrupts are safely routed to a designated
>         housekeeping CPU.
> 
>     3.  Topology-Aware Queue Allocation: The generic CPU-to-hardware-queue
>         mapping logic is extended to distribute hardware contexts evenly
>         among the available housekeeping CPUs, preventing MSI-X vector
>         exhaustion while maintaining optimal cache locality where possible.
> 
> To prevent I/O stalls, the block layer is additionally hardened to reject
> hot-plug requests that attempt to offline a housekeeping CPU if it is the
> last remaining CPU actively serving an online isolated core.

Hi everyone,

I am writing to politely follow up and request feedback on the v15
iteration of the 'isolcpus=io_queue' patch series.

As noted in the cover letter, this version introduces a major architectural
simplification compared to the older v12 design. Specifically, the complex
"top-down" mask plumbing and struct irq_affinity modifications have been
completely abandoned.

Instead, this iteration relies on a much cleaner, centralized approach
using direct isolation querying via housekeeping_cpumask(HK_TYPE_IO_QUEUE)
within the genirq/affinity subsystem. This pivot successfully decouples the
core infrastructure changes from driver-specific implementations, which
should significantly reduce the maintenance burden.

I would greatly appreciate it if anyone has the bandwidth to review this
new approach. Please let me know your thoughts or if there are any further
refinements needed.

Thank you for your time and guidance.

Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v17 05/10] rust: page: convert to `Ownable`
From: Alice Ryhl @ 2026-06-17  9:54 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Miguel Ojeda, Gary Guo, Björn Roy Baron, Benno Lossin,
	Trevor Gross, Danilo Krummrich, Greg Kroah-Hartman, Dave Ertman,
	Ira Weiny, Leon Romanovsky, Paul Moore, Serge Hallyn,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Alexander Viro,
	Christian Brauner, Jan Kara, Daniel Almeida, Viresh Kumar,
	Nishanth Menon, Stephen Boyd, Bjorn Helgaas,
	Krzysztof Wilczyński, Boqun Feng, Uladzislau Rezki,
	Lorenzo Stoakes, Vlastimil Babka, Liam R. Howlett, Igor Korotin,
	Pavel Tikhomirov, linux-kernel, rust-for-linux, linux-block,
	linux-security-module, dri-devel, linux-fsdevel, linux-mm,
	linux-pm, linux-pci, driver-core, Asahi Lina
In-Reply-To: <20260604-unique-ref-v17-5-7b4c3d2930b9@kernel.org>

On Thu, Jun 4, 2026 at 10:14 PM Andreas Hindborg <a.hindborg@kernel.org> wrote:
>
> From: Asahi Lina <lina@asahilina.net>
>
> This allows Page references to be returned as borrowed references,
> without necessarily owning the struct page.
>
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> [ Andreas: Fix formatting and add a safety comment. ]
> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>

This will not compile unless Rust Binder is also updated.

Alice

^ permalink raw reply

* Re: Landlock: LANDLOCK_ACCESS_FS_IOCTL_DEV bypass via io_uring IORING_OP_URING_CMD
From: Günther Noack @ 2026-06-17  9:47 UTC (permalink / raw)
  To: Bryam Vargas
  Cc: Mickaël Salaün, Paul Moore, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, linux-security-module, io-uring,
	linux-block, linux-nvme, linux-kernel
In-Reply-To: <20260616201633.275067-1-hexlabsecurity@proton.me>

Hello Bryam!

Thanks for the report!

On Tue, Jun 16, 2026 at 08:16:41PM +0000, Bryam Vargas wrote:
> Hello Mickaël, and Landlock / io_uring folks,
> 
> A task confined by a Landlock ruleset that grants READ_FILE/WRITE_FILE on a block
> or NVMe character device but withholds LANDLOCK_ACCESS_FS_IOCTL_DEV can still
> reach the device-command surface through io_uring IORING_OP_URING_CMD with the
> IOCTL_DEV check bypassed: the request enters the device-command handler (block
> discard, or the NVMe char-device passthrough) where the equivalent ioctl(2) is
> denied. The destructive completion and the NVMe-admin surface follow from the
> code -- see Impact.
> 
> Affected
> --------
> Any kernel with CONFIG_SECURITY_LANDLOCK=y and Landlock enabled that supports
> LANDLOCK_ACCESS_FS_IOCTL_DEV (Landlock ABI >= 5, since Linux 6.8) and io_uring
> uring_cmd for the device class (block BLOCK_URING_CMD_DISCARD; NVMe passthrough).
> Confirmed by source inspection on mainline (v7.1-rc7) and reproduced on Linux
> 7.0.11 (Landlock ABI 8). The confined task needs a writable fd to a device it is
> legitimately allowed to use (e.g. a partition/loop device or an NVMe namespace
> passed into a container or granted by the ruleset); no CAP is required to reach
> the io_uring path. The gap is structural -- Landlock has never registered a
> uring_cmd hook -- so it is present from ABI 5 (Linux 6.8) through current
> mainline (v7.1-rc7) and is not a regression tied to a single Fixes: commit.
> 
> Root cause
> ----------
> On the ioctl(2) path, the syscall handler in fs/ioctl.c calls
> security_file_ioctl() (its only call site on the ioctl(2) path) before
> dispatching to do_vfs_ioctl(); that reaches Landlock hook_file_ioctl_common(),
> which denies a device ioctl unless the file's
> allowed_access holds LANDLOCK_ACCESS_FS_IOCTL_DEV (BLKDISCARD/BLKSECDISCARD/
> BLKZEROOUT and NVMe passthrough are not in the is_masked_device_ioctl()
> allow-list, so they require the right).
> 
> io_uring reaches the same device-command surface by a different producer:
> 
>   IORING_OP_URING_CMD -> io_uring_cmd()   io_uring/uring_cmd.c
>    -> security_uring_cmd(ioucmd)          (the ONLY LSM gate on this path)
>    -> file->f_op->uring_cmd()             e.g. blkdev_uring_cmd() / nvme_ns_chr_uring_cmd()
> 
> Landlock's LSM_HOOK_INIT list (security/landlock/fs.c, net.c, task.c) registers
> file_ioctl/file_ioctl_compat but no uring_cmd hook -- only SELinux
> (selinux_uring_cmd) and Smack (smack_uring_cmd) gate this surface -- so
> security_uring_cmd() returns 0 for a Landlocked task and hook_file_ioctl /
> IOCTL_DEV is never consulted. For block, blkdev_cmd_discard() is then gated only
> by BLK_OPEN_WRITE; for NVMe, nvme_ns_chr_uring_cmd() reaches the admin/IO
> passthrough with no security_file_ioctl on the path. There is no shared helper
> that re-applies the IOCTL_DEV check.
> 
> SELinux and Smack hooking uring_cmd while Landlock does not is the coverage
> asymmetry; the Landlock documentation describes IOCTL_DEV as gating ioctl(2) but
> does not mention io_uring.
> 
> Reproducer
> ----------
> A self-contained PoC is available on request (it needs root only to set up a loop
> block device and open it; Landlock enforcement is uid-independent, so the
> confined child demonstrates the gap regardless of the setup uid). The child
> applies a Landlock ruleset handling READ_FILE|WRITE_FILE|IOCTL_DEV with a rule
> granting only READ_FILE|WRITE_FILE on the device, then:
> 
>   (1) ioctl(fd, BLKDISCARD, range)        -> -EACCES  (Landlock enforces IOCTL_DEV)
>   (2) IORING_OP_URING_CMD,
>       cmd_op = BLOCK_URING_CMD_DISCARD     -> reaches the block command handler
> 
> Observed on Linux 7.0.11 (Landlock ABI 8):
> 
>   [1] ioctl(BLKDISCARD)   -> ret=-1 errno=13 (Permission denied)
>   [2] uring_cmd(DISCARD)  -> cqe.res=-22 (Invalid argument)
> 
> A Landlock denial is always -EACCES; the io_uring path returned -EINVAL, which
> originates in a post-authorization check inside the block command handler
> (blk_validate_byte_range() in blkdev_cmd_discard()), reached only after
> security_uring_cmd() returned 0. So this run demonstrates the authorization
> bypass -- the request traversed the LSM gate into the block device-command
> handler with no IOCTL_DEV check -- and then failed a parameter check, not an
> authorization check. The destructive completion (an authorized discard with a
> granularity-aligned range) is the expected behaviour but was not exercised in
> this run.
> 
> Impact
> ------
> Demonstrated: the LANDLOCK_ACCESS_FS_IOCTL_DEV authorization is bypassed. The
> device-command request reaches the block command handler with no Landlock check;
> the only remaining gate is BLK_OPEN_WRITE (held, since the policy granted write).
> Inferred from the code, not exercised here: an authorized DISCARD with a valid
> range completes (DISCARD/secure-erase semantics, destroying on-device data), and
> the same missing hook leaves the NVMe char-device uring_cmd surface ungated --
> nvme_ns_chr_uring_cmd (namespace device /dev/nvmeXnY) -> nvme_ns_uring_cmd for
> NVME_URING_CMD_IO/IO_VEC passthrough, and nvme_dev_uring_cmd (controller device
> /dev/nvmeX) for NVME_URING_CMD_ADMIN (format, sanitize, firmware download,
> security send) -- both reach f_op->uring_cmd with no Landlock/IOCTL_DEV gate.
> 
> So the confirmed finding is a missing authorization (the confined task escapes
> its own IOCTL_DEV restriction); the destructive data effect and the NVMe-admin
> high-water-mark follow from the code but are not shown in the run above. The
> proven authorization bypass alone scores CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:N/I:H/A:N
> (6.5 Medium) -- S:C because the confined task crosses the Landlock policy
> boundary it was placed under, I:H because the bypassed path reaches a handler
> whose authorized completion modifies device data. With the device command
> completing destructively the projected ceiling is
> CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:N/I:H/A:H (8.4 High), the A:H component
> reasoned from the source rather than executed. No memory safety is involved.
> 
> Suggested direction
> -------------------
> Have Landlock register a uring_cmd hook that maps the device command to the same
> checks the ioctl path applies (IOCTL_DEV, and truncate where relevant), so a
> single chokepoint covers every f_op->uring_cmd provider (block, NVMe, ublk, and
> any future one). Mirrors how SELinux/Smack already gate this surface.
> 
> I am happy to send a patch for this if you would like.

I have read through the code a bit, but I am not sure I follow the argument of
this report. Let me paraphrase my understanding --

* LANDLOCK_ACCESS_FS_IOCTL_DEV is documented as blocking ioctl(2)
  commands on opened character and block devices.
  (c.f. https://docs.kernel.org/userspace-api/landlock.html#filesystem-flags)

* One of many block-device IOCTL operations is BLKDISCARD.

* Block devices offer BLKDISCARD over io_uring as well,
  but io_uring does *not* offer a generic interface through which you
  can do IOCTLs.  It *only* implements BLOCK_URING_CMD_DISCARD in that
  place.  The header where that constant is defined happens to use one
  of the ioctl macros to construct the number, but points out that "It's
  a different number space from ioctl()" (see
  include/uapi/linux/blkdev.h).

So... while this is similar to IOCTL, and while this block device operation is
also available through ioctl(2), this is a different command multiplexer
than IOCTL and I am not convinced that that namespace should be guarded with
the same LANDLOCK_ACCESS_FS_IOCTL_DEV access right.

Do I understand correctly that the only operation affected in this report is
BLOCK_URING_CMD_DISCARD?  Or are there other operations affected by this
(through other devices)?  I saw you also mentioned the truncate right above,
but I assume that for this access right you have not found a way to side-step
it (assuming that this calls the more specific LSM hooks).

Thanks,
—Günther

^ permalink raw reply

* Re: [PATCH v3] rust: add procedural macro for declaring configfs attributes
From: Miguel Ojeda @ 2026-06-17  9:32 UTC (permalink / raw)
  To: Malte Wechter
  Cc: Andreas Hindborg, Breno Leitao, Miguel Ojeda, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Jens Axboe, Luis Chamberlain,
	Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin,
	linux-kernel, rust-for-linux, linux-block, linux-modules
In-Reply-To: <75552e2d-0bc8-40a0-b783-fba64482aea9@gmail.com>

On Wed, Jun 17, 2026 at 11:13 AM Malte Wechter <maltewechter@gmail.com> wrote:
>
> As of now doc strings are not generated for private items in the macros
> crate. I am moving the `parse_ordered_fields!` macro into
> macros/helpers.rs but this means the doc strings are not generated for
> the macro anymore. The `parse_ordered_fields!` macro is a larger helper
> function, and the doc strings are relevant and helpful for macro
> developers that wants to use it.

If it is private, then it is what it is, don't worry about it --
developers can still read the source code.

But, yes, having a render of the private items is something I have
wanted for a long time, but as a runtime toggle, so that it is easy to
go from one to the other (and without having to have 2 entire copies
of the docs).

Please the entry "Private documentation (perhaps as an extension of
the private items/fields toggle)" I have at:

  https://github.com/Rust-for-Linux/linux/issues/350

Upstream `rustdoc` implemented an MVP of the idea via CSS/JS in this draft PR:

  https://github.com/rust-lang/rust/pull/141299

If you want to help on that, then you could try it and leave some
feedback there! :)

Thanks!

Cheers,
Miguel

^ permalink raw reply

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Christian Brauner @ 2026-06-17  9:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs
In-Reply-To: <20260617062523.GA20041@lst.de>

> No, we don't need a secondary device number to sb mapping.  On the other
> hand we do need the deviceloss, freeze etc upcalls to work for owners
> that are not file systems like mdraid or dm, even if they have been
> slow to pick this.  The whole idea of the holder ops is to abstract
> away from who holds it instead of adding back the broken hard coding
> of the superblock.  Otherwise you're just badly reinventing get_super.

No, the expanded version works for all device numbers. There's also
no-hardcoding. And non-fs users may do whatever they want with their
holder ops ofc. erofs always had the non 1:1 relationship between
devices and filesystems and for that case it seems sane. I'm happy to
let the series sit for a bit to gather input and do the security
mediation patches first. The series are complementary.

^ permalink raw reply

* Re: [PATCH v3] rust: add procedural macro for declaring configfs attributes
From: Malte Wechter @ 2026-06-17  9:13 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Andreas Hindborg, Breno Leitao, Miguel Ojeda, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Jens Axboe, Luis Chamberlain,
	Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin,
	linux-kernel, rust-for-linux, linux-block, linux-modules
In-Reply-To: <CANiq72=RX5V6W+1tj0GHxZusrk5OqYbZ5-xV=wvSssrx_CWXAA@mail.gmail.com>


On 6/13/26 12:41 PM, Miguel Ojeda wrote:
> Hi Malte,
>
> Some quick notes...
>
> On Fri, Jun 12, 2026 at 3:29 PM Malte Wechter <maltewechter@gmail.com> wrote:
>> +/// ```ignore
> Empty /// before examples.
>
>> +///     // This will extract "foo: <field>" into a variable named "foo".
> ` instead of "
>
> i.e. please use Markdown
>
>> +///```
> Missing space indentation
>
>> +/// Expands the following output:
>> +///    let item_type = {
> Missing example block, both at the beginning and the end.
>
> Please double-check by generating the docs and looking at how they
> appear in the browser.
>
> The prefix of the title should likely be `rust: configfs:`.
>
> Thanks!
>
> Cheers,
> Miguel
As of now doc strings are not generated for private items in the macros 
crate. I am moving the `parse_ordered_fields!` macro into 
macros/helpers.rs but this means the doc strings are not generated for 
the macro anymore. The `parse_ordered_fields!` macro is a larger helper 
function, and the doc strings are relevant and helpful for macro 
developers that wants to use it.

You can enable documenting private items:

diff --git a/rust/Makefile b/rust/Makefile
index b361bfedfdf0..b4239443307e 100644
--- a/rust/Makefile
+++ b/rust/Makefile
@@ -147,6 +147,7 @@ quiet_cmd_rustdoc = RUSTDOC $(if $(rustdoc_host),H, ) $<
      OBJTREE=$(abspath $(objtree)) \
      $(RUSTDOC) $(filter-out $(skip_flags) --remap-path-scope=%,$(if 
$(rustdoc_host),$(rust_common_flags),$(rust_flags))) \
          $(rustc_target_flags) -L$(objtree)/$(obj) \
+        --document-private-items \
          -Zunstable-options --generate-link-to-definition \
          --output $(rustdoc_output) \
          --crate-name $(subst rustdoc-,,$@) \

But this enables _all_ private items to get rendered, which is not 
ideal. How should i proceed?
Best regards,

Malte



^ permalink raw reply related

* Re: [PATCH v3 3/4] block: drop shared-tag fairness throttling
From: Sumit Saxena @ 2026-06-17  7:32 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Martin K. Petersen, Jens Axboe, James Bottomley, Linux SCSI List,
	linux-block
In-Reply-To: <93a82831-608d-4462-a019-26b3adc7089c@suse.de>

[-- Attachment #1.1: Type: text/plain, Size: 721 bytes --]

> What tests did you perform?
> I'm pretty sure you see an improvement when having just a few drives,
> but what about having a lot of them (ie tens of drives)?
> The whole point of this was to increase fairness between drives, so
> of course removing it will make an individual drive going faster ...

Initially, we ran tests with 8 drives and saw positive results. However, we
completed
tests with 16 drives and are seeing performance drops at higher iodepths
(>=128) with this patch.
This appears to be due to the removal of the per-queue throttle
(hctx_may_queue).
We are currently running additional tests to better understand this
behavior. I will provide an update
once I have more meaningful data.

Thanks,
Sumit

[-- Attachment #1.2: Type: text/html, Size: 2102 bytes --]

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* [PATCH 2/2 blktests] src/miniublk: fall back to legacy opcodes on older kernels
From: Sebastian Chlad @ 2026-06-17  7:25 UTC (permalink / raw)
  To: linux-block; +Cc: shinichiro.kawasaki, Sebastian Chlad
In-Reply-To: <20260617072516.6238-1-sebastian.chlad@suse.com>

Try ioctl-encoded ADD_DEV and GET_DEV_INFO first; if either fails,
retry with the legacy raw opcode. After a successful bootstrap
command, derive use_ioctl from UBLK_F_CMD_IOCTL_ENCODE in dev_info.flags
so all subsequent control and IO commands use the mode reported by the
kernel.

Signed-off-by: Sebastian Chlad <sebastian.chlad@suse.com>
---
 src/miniublk.c | 47 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/src/miniublk.c b/src/miniublk.c
index 5a35ca7..494a4ae 100644
--- a/src/miniublk.c
+++ b/src/miniublk.c
@@ -112,6 +112,7 @@ struct ublk_dev {
 	int fds[2];	/* fds[0] points to /dev/ublkcN */
 	int nr_fds;
 	int ctrl_fd;
+	bool use_ioctl;
 	struct io_uring ring;
 };
 
@@ -235,7 +236,7 @@ static inline int ublk_setup_ring(struct io_uring *r, int depth,
 
 static inline void ublk_ctrl_init_cmd(struct ublk_dev *dev,
 		struct io_uring_sqe *sqe,
-		struct ublk_ctrl_cmd_data *data)
+		struct ublk_ctrl_cmd_data *data, __u32 cmd_op)
 {
 	struct ublksrv_ctrl_dev_info *info = &dev->dev_info;
 	struct ublksrv_ctrl_cmd *cmd = (struct ublksrv_ctrl_cmd *)ublk_get_sqe_cmd(sqe);
@@ -255,25 +256,34 @@ static inline void ublk_ctrl_init_cmd(struct ublk_dev *dev,
 	cmd->dev_id = info->dev_id;
 	cmd->queue_id = -1;
 
-	ublk_set_sqe_cmd_op(sqe, data->cmd_op);
+	ublk_set_sqe_cmd_op(sqe, cmd_op);
 
 	io_uring_sqe_set_data(sqe, cmd);
 }
 
+static void ublk_update_ioctl_encoding(struct ublk_dev *dev)
+{
+	dev->use_ioctl = !!(dev->dev_info.flags & UBLK_F_CMD_IOCTL_ENCODE);
+}
+
 static int __ublk_ctrl_cmd(struct ublk_dev *dev,
 		struct ublk_ctrl_cmd_data *data)
 {
 	struct io_uring_sqe *sqe;
 	struct io_uring_cqe *cqe;
+	__u32 cmd_op = data->cmd_op;
 	int ret = -EINVAL;
 
+	if (!dev->use_ioctl)
+		cmd_op = _IOC_NR(cmd_op);
+
 	sqe = io_uring_get_sqe(&dev->ring);
 	if (!sqe) {
 		ublk_err("%s: can't get sqe ret %d\n", __func__, ret);
 		return ret;
 	}
 
-	ublk_ctrl_init_cmd(dev, sqe, data);
+	ublk_ctrl_init_cmd(dev, sqe, data, cmd_op);
 
 	ret = io_uring_submit(&dev->ring);
 	if (ret < 0) {
@@ -321,8 +331,19 @@ int ublk_ctrl_add_dev(struct ublk_dev *dev)
 		.addr = (__u64)&dev->dev_info,
 		.len = sizeof(struct ublksrv_ctrl_dev_info),
 	};
+	int ret;
 
-	return __ublk_ctrl_cmd(dev, &data);
+	ret = __ublk_ctrl_cmd(dev, &data);
+	if (ret < 0) {
+		/* retry with legacy opcode on older kernels */
+		dev->use_ioctl = false;
+		ret = __ublk_ctrl_cmd(dev, &data);
+	}
+
+	if (ret >= 0)
+		ublk_update_ioctl_encoding(dev);
+
+	return ret;
 }
 
 int ublk_ctrl_del_dev(struct ublk_dev *dev)
@@ -343,8 +364,19 @@ int ublk_ctrl_get_info(struct ublk_dev *dev)
 		.addr = (__u64)&dev->dev_info,
 		.len = sizeof(struct ublksrv_ctrl_dev_info),
 	};
+	int ret;
 
-	return __ublk_ctrl_cmd(dev, &data);
+	ret = __ublk_ctrl_cmd(dev, &data);
+	if (ret < 0 && dev->use_ioctl) {
+		/* retry with legacy opcode on older kernels */
+		dev->use_ioctl = false;
+		ret = __ublk_ctrl_cmd(dev, &data);
+	}
+
+	if (ret >= 0)
+		ublk_update_ioctl_encoding(dev);
+
+	return ret;
 }
 
 int ublk_ctrl_set_params(struct ublk_dev *dev,
@@ -453,6 +485,8 @@ static struct ublk_dev *ublk_ctrl_init()
 	struct ublksrv_ctrl_dev_info *info = &dev->dev_info;
 	int ret;
 
+	dev->use_ioctl = true; /* use ioctl opcodes by default */
+
 	dev->ctrl_fd = open(CTRL_DEV, O_RDWR);
 	if (dev->ctrl_fd < 0) {
 		ublk_err("control dev %s can't be opened: %m %d\n", CTRL_DEV, errno);
@@ -628,6 +662,9 @@ static int ublk_queue_io_cmd(struct ublk_queue *q,
 	else
 		cmd_op = UBLK_U_IO_FETCH_REQ;
 
+	if (!q->dev->use_ioctl)
+		cmd_op = _IOC_NR(cmd_op);
+
 	sqe = io_uring_get_sqe(&q->ring);
 	if (!sqe) {
 		ublk_err("%s: run out of sqe %d, tag %d\n",
-- 
2.51.0


^ permalink raw reply related

* [PATCH 1/2 blktests] src/miniublk: switch to ioctl-encoded ublk commands
From: Sebastian Chlad @ 2026-06-17  7:25 UTC (permalink / raw)
  To: linux-block; +Cc: shinichiro.kawasaki, Sebastian Chlad
In-Reply-To: <20260617072516.6238-1-sebastian.chlad@suse.com>

Kernels built without CONFIG_BLKDEV_UBLK_LEGACY_OPCODES reject the
legacy raw UBLK_CMD_* and UBLK_IO_* opcodes. Switch miniublk to use
the ioctl-encoded UBLK_U_CMD_* and UBLK_U_IO_* variants defined in
linux/ublk_cmd.h instead.

For IO commands, the ioctl-encoded opcode is used for submission while
_IOC_NR() extracts the raw NR bits for build_user_data(), keeping the
user_data tag encoding intact.

Signed-off-by: Sebastian Chlad <sebastian.chlad@suse.com>
---
 src/miniublk.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/src/miniublk.c b/src/miniublk.c
index f98f850..5a35ca7 100644
--- a/src/miniublk.c
+++ b/src/miniublk.c
@@ -294,7 +294,7 @@ static int __ublk_ctrl_cmd(struct ublk_dev *dev,
 int ublk_ctrl_stop_dev(struct ublk_dev *dev)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_STOP_DEV,
+		.cmd_op	= UBLK_U_CMD_STOP_DEV,
 	};
 
 	return __ublk_ctrl_cmd(dev, &data);
@@ -304,7 +304,7 @@ int ublk_ctrl_start_dev(struct ublk_dev *dev,
 		int daemon_pid)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_START_DEV,
+		.cmd_op	= UBLK_U_CMD_START_DEV,
 		.flags	= CTRL_CMD_HAS_DATA,
 	};
 
@@ -316,7 +316,7 @@ int ublk_ctrl_start_dev(struct ublk_dev *dev,
 int ublk_ctrl_add_dev(struct ublk_dev *dev)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_ADD_DEV,
+		.cmd_op	= UBLK_U_CMD_ADD_DEV,
 		.flags	= CTRL_CMD_HAS_BUF,
 		.addr = (__u64)&dev->dev_info,
 		.len = sizeof(struct ublksrv_ctrl_dev_info),
@@ -328,7 +328,7 @@ int ublk_ctrl_add_dev(struct ublk_dev *dev)
 int ublk_ctrl_del_dev(struct ublk_dev *dev)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op = UBLK_CMD_DEL_DEV,
+		.cmd_op = UBLK_U_CMD_DEL_DEV,
 		.flags = 0,
 	};
 
@@ -338,7 +338,7 @@ int ublk_ctrl_del_dev(struct ublk_dev *dev)
 int ublk_ctrl_get_info(struct ublk_dev *dev)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_GET_DEV_INFO,
+		.cmd_op	= UBLK_U_CMD_GET_DEV_INFO,
 		.flags	= CTRL_CMD_HAS_BUF,
 		.addr = (__u64)&dev->dev_info,
 		.len = sizeof(struct ublksrv_ctrl_dev_info),
@@ -351,7 +351,7 @@ int ublk_ctrl_set_params(struct ublk_dev *dev,
 		struct ublk_params *params)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_SET_PARAMS,
+		.cmd_op	= UBLK_U_CMD_SET_PARAMS,
 		.flags	= CTRL_CMD_HAS_BUF,
 		.addr = (__u64)params,
 		.len = sizeof(*params),
@@ -364,7 +364,7 @@ static int ublk_ctrl_get_params(struct ublk_dev *dev,
 		struct ublk_params *params)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_GET_PARAMS,
+		.cmd_op	= UBLK_U_CMD_GET_PARAMS,
 		.flags	= CTRL_CMD_HAS_BUF,
 		.addr = (__u64)params,
 		.len = sizeof(*params),
@@ -378,7 +378,7 @@ static int ublk_ctrl_get_params(struct ublk_dev *dev,
 static int ublk_ctrl_start_user_recover(struct ublk_dev *dev)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_START_USER_RECOVERY,
+		.cmd_op	= UBLK_U_CMD_START_USER_RECOVERY,
 		.flags	= 0,
 	};
 
@@ -389,7 +389,7 @@ static int ublk_ctrl_end_user_recover(struct ublk_dev *dev,
 		int daemon_pid)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_END_USER_RECOVERY,
+		.cmd_op	= UBLK_U_CMD_END_USER_RECOVERY,
 		.flags	= CTRL_CMD_HAS_DATA,
 	};
 
@@ -624,9 +624,9 @@ static int ublk_queue_io_cmd(struct ublk_queue *q,
 		return 0;
 
 	if (io->flags & UBLKSRV_NEED_COMMIT_RQ_COMP)
-		cmd_op = UBLK_IO_COMMIT_AND_FETCH_REQ;
-	else if (io->flags & UBLKSRV_NEED_FETCH_RQ)
-		cmd_op = UBLK_IO_FETCH_REQ;
+		cmd_op = UBLK_U_IO_COMMIT_AND_FETCH_REQ;
+	else
+		cmd_op = UBLK_U_IO_FETCH_REQ;
 
 	sqe = io_uring_get_sqe(&q->ring);
 	if (!sqe) {
@@ -637,7 +637,7 @@ static int ublk_queue_io_cmd(struct ublk_queue *q,
 
 	cmd = (struct ublksrv_io_cmd *)ublk_get_sqe_cmd(sqe);
 
-	if (cmd_op == UBLK_IO_COMMIT_AND_FETCH_REQ)
+	if (io->flags & UBLKSRV_NEED_COMMIT_RQ_COMP)
 		cmd->result = io->result;
 
 	/* These fields should be written once, never change */
@@ -650,7 +650,7 @@ static int ublk_queue_io_cmd(struct ublk_queue *q,
 	cmd->addr	= (__u64)io->buf_addr;
 	cmd->q_id	= q->q_id;
 
-	user_data = build_user_data(tag, cmd_op, 0, 0);
+	user_data = build_user_data(tag, _IOC_NR(cmd_op), 0, 0);
 	io_uring_sqe_set_data64(sqe, user_data);
 
 	io->flags = 0;
@@ -658,7 +658,7 @@ static int ublk_queue_io_cmd(struct ublk_queue *q,
 	q->cmd_inflight += 1;
 
 	ublk_dbg(UBLK_DBG_IO_CMD, "%s: (qid %d tag %u cmd_op %u) iof %x stopping %d\n",
-			__func__, q->q_id, tag, cmd_op,
+			__func__, q->q_id, tag, _IOC_NR(cmd_op),
 			io->flags, !!(q->state & UBLKSRV_QUEUE_STOPPING));
 	return 1;
 }
-- 
2.51.0


^ permalink raw reply related

* [PATCH 0/2 blktests] Update the miniublk to use ioctl opcodes
From: Sebastian Chlad @ 2026-06-17  7:25 UTC (permalink / raw)
  To: linux-block; +Cc: shinichiro.kawasaki, Sebastian Chlad

miniublk currently uses only legacy opcodes. Kernels built without
CONFIG_BLKDEV_UBLK_LEGACY_OPCODES reject them with -EOPNOTSUPP, causing
all ublk tests to fail. This patch solves the problem and the following
patch adds fallback to legacy opcodes for testing of the older kernels.

I tested against the old 6.3 kernel supporting only legacy opcodes. Also
against new kernel with ioctl opcodes and legacy opcodes still enabled as
well as the new kernel with ioctl opcodes and no support for the legacy ones.

Sebastian Chlad (2):
  src/miniublk: switch to ioctl-encoded ublk commands
  src/miniublk: fall back to legacy opcodes on older kernels

 src/miniublk.c | 77 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 57 insertions(+), 20 deletions(-)

-- 
2.51.0

^ permalink raw reply

* Re: [PATCH v2] blk-mq: bound blk_hctx_poll() to one jiffy
From: changfengnan @ 2026-06-17  7:07 UTC (permalink / raw)
  To: Anuj Gupta
  Cc: axboe, hch, kbusch, lidiangang, tom.leiming, nj.shetty, joshi.k,
	anuj1072538, linux-block, Anuj Gupta, Alok Rathore
In-Reply-To: <20260617060850.1244788-1-anuj20.g@samsung.com>

Looks good to me.
Reviewed-by: Fengnan Chang <changfengnan@bytedance.com>

> From: "Anuj Gupta"<anuj20.g@samsung.com>
> Date:  Wed, Jun 17, 2026, 14:15
> Subject:  [PATCH v2] blk-mq: bound blk_hctx_poll() to one jiffy
> To: <axboe@kernel.dk>, <hch@lst.de>, <kbusch@kernel.org>, <lidiangang@bytedance.com>, <changfengnan@bytedance.com>, <tom.leiming@gmail.com>, <nj.shetty@samsung.com>, <joshi.k@samsung.com>, <anuj1072538@gmail.com>
> Cc: <linux-block@vger.kernel.org>, "Anuj Gupta"<anuj20.g@samsung.com>, "Alok Rathore"<alok.rathore@samsung.com>
> blk_hctx_poll() can busy-poll until a completion is found or
> need_resched() becomes true. On preemptible kernels, the scheduler can
> set TIF_NEED_RESCHED on the timer tick and preempt the task at IRQ
> return before the loop condition re-evaluates it. After the context
> switch, the flag is cleared, so the poller can continue spinning instead
> of returning to its caller.
> 
> This can happen with io_uring IOPOLL reads inside iocb_bio_iopoll(),
> which holds the rcu_read_lock() while calling bio_poll(). If another
> poller on the same polled queue drains the available completions, this
> poller may repeatedly find no completions and remain inside the RCU
> read-side critical section long enough to trigger RCU stall reports:
> 
> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-9): P3961
> rcu:     (detected by 3, t=60002 jiffies, g=18533, q=4943 ncpus=20)
> task:fio state:R  running task     stack:0     pid:3961
> Call Trace:
> <TASK>
> ? nvme_poll+0x36/0xa0 [nvme]
> ? blk_hctx_poll+0x39/0x90
> ? blk_mq_poll+0x30/0x60
> ? bio_poll+0x87/0x170
> ? iocb_bio_iopoll+0x32/0x50
> ? io_uring_classic_poll+0x25/0x50
> ? io_do_iopoll+0x216/0x420
> ? __do_sys_io_uring_enter+0x2c7/0x7c0
> 
> Reproducible with:
> 
> fio -filename=/dev/nvme0n1 -direct=1 -size=4g -rw=randread \
> --numjobs=32 -bs=4K -ioengine=io_uring -hipri=1 -iodepth=1 \
> --registerfiles=1 --group_reporting --thread
> 
> Record the starting jiffy and exit the loop once jiffies has advanced.
> This bounds each blk_hctx_poll() invocation while also covering the
> case where the reschedule flag was cleared by the context switch
> before the loop condition could observe it.
> 
> Fixes: f22ecf9c14c1 ("blk-mq: delete task running check in blk_hctx_poll()")
> Suggested-by: Fengnan Chang <changfengnan@bytedance.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> Signed-off-by: Alok Rathore <alok.rathore@samsung.com>
> ---
>  block/blk-mq.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4c5c16cce4f8..ae6c5f4b80ce 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -5248,6 +5248,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>                           struct io_comp_batch *iob, unsigned int flags)
>  {
>          int ret;
> +        unsigned long timeout = jiffies + 1;
>  
>          do {
>                  ret = q->mq_ops->poll(hctx, iob);
> @@ -5258,7 +5259,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>                  if (ret < 0 || (flags & BLK_POLL_ONESHOT))
>                          break;
>                  cpu_relax();
> -        } while (!need_resched());
> +        } while (!need_resched() && time_before(jiffies, timeout));
>  
>          return 0;
>  }
> -- 
> 2.25.1
> 

^ permalink raw reply

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Christoph Hellwig @ 2026-06-17  6:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-fragil-duktus-nachverfolgen-60f54584c206@brauner>

On Tue, Jun 16, 2026 at 04:59:53PM +0200, Christian Brauner wrote:
> > Err, no.  block devices need to have a specific owner.  If erofs wants
> > to share a device between superblock it needs to come up with an entity
> > that owns the block devices which is not a superblock.
> 
> It already did.
> 
> > IMHO sharing devices between superblocks is a bad idea, but that ship
> > has sailed, but please keep it contained inside of erofs.
> 
> We need a simple device number to superblock mapping anyway and that can
> simply be centralized in the vfs. And it can work with anon device
> numbers and block device numbers uniformly.

No, we don't need a secondary device number to sb mapping.  On the other
hand we do need the deviceloss, freeze etc upcalls to work for owners
that are not file systems like mdraid or dm, even if they have been
slow to pick this.  The whole idea of the holder ops is to abstract
away from who holds it instead of adding back the broken hard coding
of the superblock.  Otherwise you're just badly reinventing get_super.

If erofs already has an owner entity it just needs custom holder ops for
that.

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Christoph Hellwig @ 2026-06-17  6:19 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Jianyue Wu, Christoph Hellwig, Andrew Morton, Chris Li,
	Baoquan He, Nhat Pham, Barry Song, Kairui Song, Kemeng Shi,
	Youngjun Park, Minchan Kim, Jens Axboe, Matthew Wilcox (Oracle),
	Jan Kara, linux-mm, linux-kernel, linux-block, linux-doc,
	Brian Geffon
In-Reply-To: <ajIYFtADxQDq8q1P@google.com>

On Wed, Jun 17, 2026 at 12:46:53PM +0900, Sergey Senozhatsky wrote:
> Those are fantastic questions, thank you for asking them.
> Can we elaborate on zram being a "legacy interface"?

Compression is functionality that fundamentally belongs into the core
swap code, not a virtual block device.  Between the backing store
less zswap and the virtual swap layer, the core swap code is not getting
to the point where don't need to rely on hacks like a compressing
ramdisk.

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Christoph Hellwig @ 2026-06-17  6:17 UTC (permalink / raw)
  To: Jianyue Wu
  Cc: Christoph Hellwig, Andrew Morton, Chris Li, Baoquan He, Nhat Pham,
	Barry Song, Kairui Song, Kemeng Shi, Youngjun Park, Minchan Kim,
	Sergey Senozhatsky, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc
In-Reply-To: <CAJxJ_jhK+zkpjhs3YsQ9RoasKYh+E0NweQci0sPAEY1ne5LmBA@mail.gmail.com>

On Wed, Jun 17, 2026 at 11:38:02AM +0800, Jianyue Wu wrote:
> Before I rework or drop the RFC, could you outline how you see that
> core-side model working? In particular:
>   - How should a compressed backend like zram or future block device
>     plug into swap_iocb / swap_ops?

I don't think that is the right layer.  The virtual swap layer that is
currently in the process of being upstreamed is the right level, and
the actual swap devices or swap files are just a dumb backend for what
they higher level code does.

>   - What role do you expect zram to keep while the legacy block interface
>     remains: current block swap only, or something else?

For now we'll need to keep it working as-is.  It is heavily used in
android and potentially elsewhere.  Once we have zswap fully working
in the virtual swap layer world it might make sense to say never
compress again in zram when REQ_SWAP is set (or maybe a new
REQ_COPRESSED) so that we can use the core compression code without
breaking existing setups.

^ permalink raw reply

* [PATCH v2] blk-mq: bound blk_hctx_poll() to one jiffy
From: Anuj Gupta @ 2026-06-17  6:08 UTC (permalink / raw)
  To: axboe, hch, kbusch, lidiangang, changfengnan, tom.leiming,
	nj.shetty, joshi.k, anuj1072538
  Cc: linux-block, Anuj Gupta, Alok Rathore
In-Reply-To: <CGME20260617061531epcas5p26e62bfdf2e91b646611191e4451d9843@epcas5p2.samsung.com>

blk_hctx_poll() can busy-poll until a completion is found or
need_resched() becomes true. On preemptible kernels, the scheduler can
set TIF_NEED_RESCHED on the timer tick and preempt the task at IRQ
return before the loop condition re-evaluates it. After the context
switch, the flag is cleared, so the poller can continue spinning instead
of returning to its caller.

This can happen with io_uring IOPOLL reads inside iocb_bio_iopoll(),
which holds the rcu_read_lock() while calling bio_poll(). If another
poller on the same polled queue drains the available completions, this
poller may repeatedly find no completions and remain inside the RCU
read-side critical section long enough to trigger RCU stall reports:

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-9): P3961
rcu:     (detected by 3, t=60002 jiffies, g=18533, q=4943 ncpus=20)
task:fio state:R  running task     stack:0     pid:3961
Call Trace:
<TASK>
? nvme_poll+0x36/0xa0 [nvme]
? blk_hctx_poll+0x39/0x90
? blk_mq_poll+0x30/0x60
? bio_poll+0x87/0x170
? iocb_bio_iopoll+0x32/0x50
? io_uring_classic_poll+0x25/0x50
? io_do_iopoll+0x216/0x420
? __do_sys_io_uring_enter+0x2c7/0x7c0

Reproducible with:

fio -filename=/dev/nvme0n1 -direct=1 -size=4g -rw=randread \
--numjobs=32 -bs=4K -ioengine=io_uring -hipri=1 -iodepth=1 \
--registerfiles=1 --group_reporting --thread

Record the starting jiffy and exit the loop once jiffies has advanced.
This bounds each blk_hctx_poll() invocation while also covering the
case where the reschedule flag was cleared by the context switch
before the loop condition could observe it.

Fixes: f22ecf9c14c1 ("blk-mq: delete task running check in blk_hctx_poll()")
Suggested-by: Fengnan Chang <changfengnan@bytedance.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Alok Rathore <alok.rathore@samsung.com>
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..ae6c5f4b80ce 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -5248,6 +5248,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			 struct io_comp_batch *iob, unsigned int flags)
 {
 	int ret;
+	unsigned long timeout = jiffies + 1;

 	do {
 		ret = q->mq_ops->poll(hctx, iob);
@@ -5258,7 +5259,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 		if (ret < 0 || (flags & BLK_POLL_ONESHOT))
 			break;
 		cpu_relax();
-	} while (!need_resched());
+	} while (!need_resched() && time_before(jiffies, timeout));

 	return 0;
 }
-- 
2.25.1

^ permalink raw reply related

* Re: [PATCH] blk-mq: bound blk_hctx_poll() to one jiffy
From: Anuj Gupta/Anuj Gupta @ 2026-06-17  6:14 UTC (permalink / raw)
  To: Fengnan, axboe, hch, kbusch, lidiangang, tom.leiming, nj.shetty,
	joshi.k, anuj1072538
  Cc: linux-block, Alok Rathore
In-Reply-To: <2e916cee-3a82-47ac-a416-b52a9744cdd5@bytedance.com>

On 6/12/2026 7:23 AM, Fengnan wrote:
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 4c5c16cce4f8..d85fa4a51e79 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -5248,6 +5248,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>>    			 struct io_comp_batch *iob, unsigned int flags)
>>    {
>>    	int ret;
>> +	unsigned long start = jiffies;
> how about this :
> 
> unsigned long timeout = jiffies + 1;
> ...
> } while (!need_resched() && time_before(jiffies, timeout));

Thanks for taking a look.
These are functionally identical but your form is established idiom at 
other places.
I will switch to that in v2.
--
Anuj

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Sergey Senozhatsky @ 2026-06-17  6:10 UTC (permalink / raw)
  To: Jianyue Wu, Christoph Hellwig
  Cc: Sergey Senozhatsky, Andrew Morton, Chris Li, Baoquan He,
	Nhat Pham, Barry Song, Kairui Song, Kemeng Shi, Youngjun Park,
	Minchan Kim, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc, Brian Geffon
In-Reply-To: <CAJxJ_jiM_-a52EOm896FXkdH+wRxjSHJx+MW6b-ewNLVkp4uSw@mail.gmail.com>

Hi,

On (26/06/17 13:44), Jianyue Wu wrote:
> Hello Sergey,
> 
> On Wed, Jun 17, 2026 at 11:46 AM Sergey Senozhatsky
> <senozhatsky@chromium.org> wrote:
> > Can we elaborate on zram being a "legacy interface"?
> My previous wording was ambiguous. Actually I didn't mean it is a
> legacy interface.

Oh, your wording wasn't ambiguous.  I simply forgot to direct my
previous email to Christoph.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox