Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Keith Busch @ 2026-06-15 20:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: linux-block, dm-devel
In-Reply-To: <ajBRnUqXd7DqxLiG@kbusch-mbp>

On Mon, Jun 15, 2026 at 01:25:17PM -0600, Keith Busch wrote:
> In the meantime, since I so far can't reproduce this after including my
> previous proposal, I may have to request trying out a debug patch to get
> some more visibility on what's happening if that's okay.

Going in a different direction here, there's no reason to recreate the
lower level bio's from scratch when they originate from an incoming bio.
We can just clone it along with an iterator pointing to the original.

Can you try this one out? This was successful when I ran your reproducer
and cuts out a lot of code too with a performance bonus for large IO.

---
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 1db565b376200..28adfeb58f240 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -170,12 +170,11 @@ struct dpages {
 			 struct page **p, unsigned long *len, unsigned int *offset);
 	void (*next_page)(struct dpages *dp);
 
-	union {
-		unsigned int context_u;
-		struct bvec_iter context_bi;
-	};
+	unsigned int context_u;
 	void *context_ptr;
 
+	struct bio *orig_bio;
+
 	void *vma_invalidate_address;
 	unsigned long vma_invalidate_size;
 };
@@ -210,44 +209,6 @@ static void list_dp_init(struct dpages *dp, struct page_list *pl, unsigned int o
 	dp->context_ptr = pl;
 }
 
-/*
- * Functions for getting the pages from a bvec.
- */
-static void bio_get_page(struct dpages *dp, struct page **p,
-			 unsigned long *len, unsigned int *offset)
-{
-	struct bio_vec bvec = bvec_iter_bvec((struct bio_vec *)dp->context_ptr,
-					     dp->context_bi);
-
-	*p = bvec.bv_page;
-	*len = bvec.bv_len;
-	*offset = bvec.bv_offset;
-
-	/* avoid figuring it out again in bio_next_page() */
-	dp->context_bi.bi_sector = (sector_t)bvec.bv_len;
-}
-
-static void bio_next_page(struct dpages *dp)
-{
-	unsigned int len = (unsigned int)dp->context_bi.bi_sector;
-
-	bvec_iter_advance((struct bio_vec *)dp->context_ptr,
-			  &dp->context_bi, len);
-}
-
-static void bio_dp_init(struct dpages *dp, struct bio *bio)
-{
-	dp->get_page = bio_get_page;
-	dp->next_page = bio_next_page;
-
-	/*
-	 * We just use bvec iterator to retrieve pages, so it is ok to
-	 * access the bvec table directly here
-	 */
-	dp->context_ptr = bio->bi_io_vec;
-	dp->context_bi = bio->bi_iter;
-}
-
 /*
  * Functions for getting the pages from a VMA.
  */
@@ -332,6 +293,21 @@ static void do_region(const blk_opf_t opf, unsigned int region,
 		return;
 	}
 
+	if (dp->orig_bio) {
+		bio = bio_alloc_clone(where->bdev, dp->orig_bio, GFP_NOIO,
+				      &io->client->bios);
+		bio->bi_iter.bi_sector = where->sector;
+		bio->bi_iter.bi_size = where->count << SECTOR_SHIFT;
+		bio->bi_opf = opf;
+		bio->bi_end_io = endio;
+		bio->bi_ioprio = ioprio;
+		store_io_and_region_in_bio(bio, io, region);
+
+		atomic_inc(&io->count);
+		submit_bio(bio);
+		return;
+	}
+
 	/*
 	 * where->count may be zero if op holds a flush and we need to
 	 * send a zero-sized flush.
@@ -468,6 +444,7 @@ static int dp_init(struct dm_io_request *io_req, struct dpages *dp,
 
 	dp->vma_invalidate_address = NULL;
 	dp->vma_invalidate_size = 0;
+	dp->orig_bio = NULL;
 
 	switch (io_req->mem.type) {
 	case DM_IO_PAGE_LIST:
@@ -475,7 +452,11 @@ static int dp_init(struct dm_io_request *io_req, struct dpages *dp,
 		break;
 
 	case DM_IO_BIO:
-		bio_dp_init(dp, io_req->mem.ptr.bio);
+		/*
+		 * The destination bios clone this bio's biovec directly, so
+		 * there are no per-page accessors to set up here.
+		 */
+		dp->orig_bio = io_req->mem.ptr.bio;
 		break;
 
 	case DM_IO_VMA:
-- 

^ permalink raw reply related

* Re: [PATCH 12/19] x86: define DPS root partition type UUIDs
From: Vincent Mailhol @ 2026-06-15 20:19 UTC (permalink / raw)
  To: Dave Hansen, Jens Axboe, Davidlohr Bueso, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
In-Reply-To: <03be57ae-0e41-4b8a-adc5-bdd85ccce951@intel.com>

On 15/06/2026 at 18:46, Dave Hansen wrote:
> On 6/15/26 09:09, Vincent Mailhol wrote:
>> +#ifdef CONFIG_X86_64
>> +#define DPS_ROOT_PARTITION_TYPE_UUID "4f68bce3-e8cd-4db1-96e7-fbcaf984b709"
>> +#else
>> +#define DPS_ROOT_PARTITION_TYPE_UUID "44479540-f297-41b2-9af7-d131d5f0458a"
>> +#endif
> 
> This doesn't make a whole lot of sense to me. 64-bit kernels can run
> 32-bit userspace just fine.
> 
> But this #ifdef as proposed means that only a 32-bit *OR* 64-bit kernel
> can auto-discover a given partition.
> 
> I kinda think you should just have an array of strings for these things,
> maybe glued together with some preprocessor magic. Logically something
> like this:
> 
> const char* const uuids[] = {
> #ifdef CONFIG_ARM64
> 	"b921b045-1df0-41c3-af44-4c6f280d3fae"
> #endif
> #ifdef CONFIG_X86_64
> 	"4f68bce3-e8cd-4db1-96e7-fbcaf984b709",
> #endif
> #if defined(CONFIG_X86) && defined(CONFIG_COMPAT32)
> 	"44479540-f297-41b2-9af7-d131d5f0458a",
> #endif
> ...
> };
> 
> ... and then search the array. I honestly don't think you need to
> sprinkle UUIDs all over the architectures.
> 
> It could probably also be done almost entirely in Kconfig. This could be
> in, say block/partitions/Kconfig, or arch/*/Kconfig:
> 
> config DPS_ROOT_PARTITION_TYPE_UUID_1
> 	string
>         default "4f68bce3-e8cd-4db1-96e7-fbcaf984b709" if X86_64
> 	default "b921b045-1df0-41c3-af44-4c6f280d3fae" if ARM64
> 	...
> 
> config DPS_ROOT_PARTITION_TYPE_UUID_2
> 	string
>         default "44479540-f297-41b2-9af7-..." if X86 && COMPAT_32
> 
> const char* const uuids[] = {
> #ifdef CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_1
> 	CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_1
> #endif
> #ifdef CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_2
> 	CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_2
> #endif
> ...
> };
> 
> There are a lot of ways to do this. I'm just not a super big fan of the
> current proposal.
> 
> So, boiling it down:
> 
> 1. Should more than one UUID be supported per kernel build?

I didn't pay much attention to this, but this is a very good point.

The Discoverable Partitions Specification is not clear about this
point. All it has to say is:

  On systems *with matching architecture*, the first partition with
  this type UUID on the disk containing the active EFI ESP is
  automatically mounted to the root directory /.

Does an x86_32 system match an x86_64 partition? Wouldn't make sense.
Does an x86_64 system match an x86_32 partition? Could be.

My feeling is that the intent was an *exact* match. This is supported
by the implementation in systemd which just check against
SD_GPT_ROOT_NATIVE (which corresponds to the exact match).

  https://github.com/systemd/systemd/blob/main/src/udev/udev-builtin-blkid.c#L243-L247

*But* there are some hints about a secondary UUID. In my terminal I have:

  $ systemd-id128 show root root-secondary
  NAME           ID                              
  root           4f68bce3e8cd4db196e7fbcaf984b709
  root-secondary 44479540f29741b29af7d131d5f0458a

where root is the x86_64 and root-secondary is x86_32. So although I
see no match logic in the code, the ID table have it!

That said, your points make sense to me, and I would be supportive to
allow a search for a secondary UUID as a kernel extension. If we do
so, I think the only constraint should be to make sure that we check
for the exact match first (e.g. check x86_64 type before x86_32 type).

Would that make sense?

> 2. Should the UUIDs be defined in arch code or generic code?

I think that you convinced me to put it in generic code.

> 3. Kconfig or #ifdefs?

I would say Kconfig. If we go for the exact match only, that would be:

  CONFIG_DPS_ROOT_PARTITION_TYPE_UUID

If we allow more as an extension, that would become:

  - CONFIG_DPS_ROOT_PARTITION_TYPE_UUID for the exact match
  - CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_SECONDARY for the compatible
    one.

The drawback is that some entries will be in both:

  config DPS_ROOT_PARTITION_TYPE_UUID
  	string
  	  default "4f68bce3-e8cd-4db1-96e7-fbcaf984b709" if X86_64
  	  default "44479540-f297-41b2-9af7-d131d5f0458a" if X86

  config DPS_ROOT_PARTITION_TYPE_UUID_SECONDARY
  	string
  	  default "44479540-f297-41b2-9af7-d131d5f0458a" if X86_64 && COMPAT_32

And I don't think we need more than two.

A bonus question: should those Kconfig entries be hidden? I prefer the
hidden option because it doesn't add that much code and I thought this
was not worth bothering the user with one more menuconfig question.
But I would be happy to change if people this this is worth an
menuconfig entry.

Yours sincerely,
Vincent Mailhol

^ permalink raw reply

* Re: [PATCH 08/19] parisc: define DPS root partition type UUID
From: Helge Deller @ 2026-06-15 20:27 UTC (permalink / raw)
  To: James Bottomley, Vincent Mailhol, Jens Axboe, Davidlohr Bueso,
	Alexander Viro, Christian Brauner, Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel, linux-parisc
In-Reply-To: <0158c89cffd76c621607f66d1889ccc084754729.camel@HansenPartnership.com>

On 6/15/26 22:02, James Bottomley wrote:
> On Mon, 2026-06-15 at 18:09 +0200, Vincent Mailhol wrote:
>> DPS [1] assigns GPT partition type UUIDs to operating system
>> partitions. Root partitions use architecture-specific type UUIDs so
>> the OS can discover the intended root filesystem without relying on a
>> root=  cmdline option.
>>
>> Define DPS_ROOT_PARTITION_TYPE_UUID in asm/dps_root.h for parisc and
>> select ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID.

Vincent, first of all thank you for at least trying to including parisc (and other
niche Linux ports) in the specification! (whatever the outcome is!)

>> [1] The Discoverable Partitions Specification (DPS)
>> Link:
>> https://uapi-group.org/specifications/specs/discoverable_partitions_specification/
> 
> How are you planning to make this work for parisc?  Some systems have a
> PALO boot partition (fdisk type 0xf0) but the more modern way is to
> place palo inside a hidden ext4 inode in /boot.  The way parisc IODC
> works is very similar to the way MSDOS boots with the palo location
> table in the first block so I theorize that would probably work for gpt
> partitions as well ... I'm just not sure anyone has tested it.
> 
> However, to get this to work with PALO for auto discovery, you'd need
> palo patches to recognize the DPS UUID and no-one seems to have
> submitted anything to palo for this.

Maybe it's not necessary that palo does this job?
palo could stay as is and load kernel and the initrd.
Then the kernel (or the scripts in initrd) could try to find the root
partition on it's own (and handle GPT discs).

I even once started porting grub to parisc (which is currently on hold
because I'm busy with other stuff). If I ever finish it, having such a
mechanism/constant already in place is IMHO beneficial.

Helge

^ permalink raw reply

* Re: [PATCH 00/19] init: discoverable root partitions, a.k.a. an omittable "root=" cmdline option
From: Vincent Mailhol @ 2026-06-15 20:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Davidlohr Bueso, Christian Brauner, Jan Kara,
	linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Richard Henderson, Matt Turner, Magnus Lindholm, linux-alpha,
	Vineet Gupta, linux-snps-arc, Russell King, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui, loongarch,
	Thomas Bogendoerfer, linux-mips, James E.J. Bottomley,
	Helge Deller, linux-parisc, Madhavan Srinivasan, Michael Ellerman,
	linuxppc-dev, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-riscv, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	linux-s390, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Jonathan Corbet, Shuah Khan, linux-doc
In-Reply-To: <20260615170432.GW2636677@ZenIV>

On 15/06/2026 at 19:04, Al Viro wrote:
> On Mon, Jun 15, 2026 at 06:08:56PM +0200, Vincent Mailhol wrote:
> 
>> Tested with GRUB, which implements the LoaderDevicePartUUID EFI variable
>> in its bli module [3]. With this, I was able to boot a kernel with a
>> completely empty cmdline and no initrd.
>>
>> [1] The Discoverable Partitions Specification (DPS)
>> Link: https://uapi-group.org/specifications/specs/discoverable_partitions_specification/
>>
>> [2] systemd-gpt-auto-generator
>> Link: https://www.freedesktop.org/software/systemd/man/latest/systemd-gpt-auto-generator.html
>>
>> [3] GRUB -- §16.2 bli
>> Link: https://www.gnu.org/software/grub/manual/grub/html_node/bli_005fmodule.html
> 
> So what does that thing, tied to EFI as it is, have to do with architectures where
> 	* firmware is rather unlike EFI

I made CONFIG_DPS_ROOT_AUTO_DISCOVERY depend on CONFIG_EFI for this reason.

> 	* firmware wouldn't know what to do with GPT
> 	* GRUB is *not* ported to, let alone used
> such as, say it, the very first one mentioned at your [1]?

Fair point. I just did:

  $ git grep "^config EFI$"
  arch/arm/Kconfig:config EFI
  arch/arm64/Kconfig:config EFI
  arch/loongarch/Kconfig:config EFI
  arch/riscv/Kconfig:config EFI
  arch/x86/Kconfig:config EFI

Anything not in this list is dead code at the moment.

> Or is that conditional upon "if anyone wants to design replacement firmware
> for those, and if they agree to follow our wishlist"?

No, it was just an oversight from my side. I will just keep arm, arm64,
loongarch, riscv and x86 in my v2.


Yours sincerely,
Vincent Mailhol


^ permalink raw reply

* Re: [PATCH 12/19] x86: define DPS root partition type UUIDs
From: Vincent Mailhol @ 2026-06-15 20:39 UTC (permalink / raw)
  To: Matthew Wilcox, Dave Hansen
  Cc: Vincent Mailhol, Jens Axboe, Davidlohr Bueso, Alexander Viro,
	Christian Brauner, Jan Kara, linux-kernel, linux-block, linux-efi,
	linux-fsdevel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
In-Reply-To: <ajA-0YJW4gLr02c7@casper.infradead.org>

On 15/06/2026 at 20:05, Matthew Wilcox wrote:
> On Mon, Jun 15, 2026 at 09:46:41AM -0700, Dave Hansen wrote:
>> There are a lot of ways to do this. I'm just not a super big fan of the
>> current proposal.
>>
>> So, boiling it down:
>>
>> 1. Should more than one UUID be supported per kernel build?
>> 2. Should the UUIDs be defined in arch code or generic code?
>> 3. Kconfig or #ifdefs?
> 
> Further questions ... why do this in the kernel?

Most of the plumbing was already there so that the feature is still
tiny. It seems like a reasonable trade-off to me.

> Seems perfectly suited to be in initramfs where we can throw away the
> code after boot.

The added code uses the __init attribute for this exact reason: so that
its memory can be reclaimed after.

-- 
Yours sincerely,
Vincent Mailhol


^ permalink raw reply

* Re: [PATCH 08/19] parisc: define DPS root partition type UUID
From: Vincent Mailhol @ 2026-06-15 20:43 UTC (permalink / raw)
  To: Helge Deller, James Bottomley, Jens Axboe, Davidlohr Bueso,
	Alexander Viro, Christian Brauner, Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel, linux-parisc
In-Reply-To: <0290e66b-1da2-4706-ab28-2d47f164df08@gmx.de>

On 15/06/2026 at 22:27, Helge Deller wrote:
> On 6/15/26 22:02, James Bottomley wrote:
>> On Mon, 2026-06-15 at 18:09 +0200, Vincent Mailhol wrote:
>>> DPS [1] assigns GPT partition type UUIDs to operating system
>>> partitions. Root partitions use architecture-specific type UUIDs so
>>> the OS can discover the intended root filesystem without relying on a
>>> root=  cmdline option.
>>>
>>> Define DPS_ROOT_PARTITION_TYPE_UUID in asm/dps_root.h for parisc and
>>> select ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID.
> 
> Vincent, first of all thank you for at least trying to including parisc
> (and other
> niche Linux ports) in the specification! (whatever the outcome is!)

You are welcome. My personal interest is only x86_64 at the moment, but
at least I tried to make it useful to the broader community!

>>> [1] The Discoverable Partitions Specification (DPS)
>>> Link:
>>> https://uapi-group.org/specifications/specs/
>>> discoverable_partitions_specification/
>>
>> How are you planning to make this work for parisc?  Some systems have a
>> PALO boot partition (fdisk type 0xf0) but the more modern way is to
>> place palo inside a hidden ext4 inode in /boot.  The way parisc IODC
>> works is very similar to the way MSDOS boots with the palo location
>> table in the first block so I theorize that would probably work for gpt
>> partitions as well ... I'm just not sure anyone has tested it.
>>
>> However, to get this to work with PALO for auto discovery, you'd need
>> palo patches to recognize the DPS UUID and no-one seems to have
>> submitted anything to palo for this.
> 
> Maybe it's not necessary that palo does this job?
> palo could stay as is and load kernel and the initrd.
> Then the kernel (or the scripts in initrd) could try to find the root
> partition on it's own (and handle GPT discs).
> 
> I even once started porting grub to parisc (which is currently on hold
> because I'm busy with other stuff). If I ever finish it, having such a
> mechanism/constant already in place is IMHO beneficial.

You can see my answer to Alexander on the cover letter. This was an
oversight. parisc does not have CONFIG_EFI to begin with, so the feature
is just dead code there.

I will remove parisc (and all other architectures which do not have a
CONFIG_EFI) in v2. If someone wants to implement EFI support those
architectures, only then, we can revisit this DPS topic for these
architectures.


Yours sincerely,
Vincent Mailhol


^ permalink raw reply

* Re: [PATCH] block: check bio split for unaligned bvec
From: Keith Busch @ 2026-06-15 22:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, linux-block, axboe, Carlos Maiolino
In-Reply-To: <20260615133549.GC26132@lst.de>

On Mon, Jun 15, 2026 at 03:35:49PM +0200, Christoph Hellwig wrote:
> On Fri, Jun 12, 2026 at 03:32:04PM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > Offsets and lengths need to be validated against the dma alignment. This
> > check was skipped for sufficiently a small bio with a single bvec, which
> > may allow an invalid request dispatched to the driver. Force the
> > validation for an unaligned bvec by forcing the bio split path that
> > handles this condition.
> 
> This fix itself looks good, but we'll also need something similar
> for bio-based drivers that never call into the splitting helper.

Totally agree. I'm looking at all the .submit_bio drivers, and I think
they fall into one of four catagories:

  1: already split (md/nvme-mp/drbd; dm conditional)
  2: don't split
      * btt, dcssblk: already reject unaligned
      * n64cart: WARNs, but potentially proceeds to undefined behavior
      * nfhd: silently corrupts, but looks like a driver problem
  3: can handle arbitrary memory but advertise default dma_alignment=511
      (brd, pmem, zram, ps3vram, simdisk - "limits lie")
  4: forward/self-split (bcache)

I think the block layer can fix 3 with a BLK_FEAT flag to allow a zero
dma_alignment limit for the drivers that really don't need it from the
source buffer.

As for the rest, I don't know of anyone caring to ensure n64 or nfhd are
correctly handling degenerate applications.

^ permalink raw reply

* Re: [PATCH] sunvdc: fix -EIO issue due to lack of retries
From: John Paul Adrian Glaubitz @ 2026-06-15 22:09 UTC (permalink / raw)
  To: Jens Axboe, linux-block@vger.kernel.org
In-Reply-To: <418310b3-2b77-4534-b2fd-27dcc11e333c@kernel.dk>

Hi,

On Mon, 2025-10-06 at 08:59 -0600, Jens Axboe wrote:
> John reports that since commit:
> 
> a11f6ca9aef9 ("sunvdc: Do not spin in an infinite loop when vio_ldc_send() returns EAGAIN")
> 
> users of Linux inside Solaris ldom see occasional -EIO errors because
> the request send loop now times out. The current loop does 10 retries,
> and inside vio_ldc_send() a further 1000 1usec retries are done as well.
> Even with 10.5 msec of busy loop retries that's apparently not enough to
> always succeed.
> 
> Rather than introduce continued busy looping, requeue the request and
> have the delayed queue kicking retry the request after another 10ms.
> This obviously isn't ideal, but there's seemingly no way to wait for
> this type of event. And if 10ms of busy looping was not enough to make
> progress, then presumably this is an edge condition and we just need to
> guarantee to make forward progress at some later point in time. That's
> more suitably done through letting the CPU tend to other work, rather
> than sitting in a tight loop retrying.
> 
> Reported-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
> Link: https://lore.kernel.org/all/20251006100226.4246-2-glaubitz@physik.fu-berlin.de/
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> Caveat: 100% untested, not even compiled. Sending out on John's behest.
> 
> diff --git a/drivers/block/sunvdc.c b/drivers/block/sunvdc.c
> index db1fe9772a4d..aa49dffb1b53 100644
> --- a/drivers/block/sunvdc.c
> +++ b/drivers/block/sunvdc.c
> @@ -539,6 +539,7 @@ static blk_status_t vdc_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	struct vdc_port *port = hctx->queue->queuedata;
>  	struct vio_dring_state *dr;
>  	unsigned long flags;
> +	int ret;
>  
>  	dr = &port->vio.drings[VIO_DRIVER_TX_RING];
>  
> @@ -560,7 +561,13 @@ static blk_status_t vdc_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		return BLK_STS_DEV_RESOURCE;
>  	}
>  
> -	if (__send_request(bd->rq) < 0) {
> +	ret = __send_request(bd->rq);
> +	if (ret == -EAGAIN) {
> +		spin_unlock_irqrestore(&port->vio.lock, flags);
> +		/* already spun for 10msec, defer 10msec and retry */
> +		blk_mq_delay_kick_requeue_list(hctx->queue, 10);
> +		return BLK_STS_DEV_RESOURCE;
> +	} else if (ret < 0) {
>  		spin_unlock_irqrestore(&port->vio.lock, flags);
>  		return BLK_STS_IOERR;
>  	}

I will give this patch a try this week as I finally want to get this fixed.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

^ permalink raw reply

* Re: [PATCH v3 3/4] iomap: reject NOWAIT and BOUNCE direct IOs
From: Qu Wenruo @ 2026-06-15 22:43 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo
  Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <ajAU1yLd32BCiCNj@infradead.org>



在 2026/6/16 00:35, Christoph Hellwig 写道:
> On Fri, Jun 12, 2026 at 07:21:14PM +0930, Qu Wenruo wrote:
>> If a direct IO requires bounced pages for stable buffer, it will always
>> allocate memory, and both bio_iov_iter_bounce_write() and
>> bio_iov_iter_bounce_read() are allocating pages using GFP_KERNEL, which
>> can sleep and break NOWAIT requirement.
>>
>> So we need to reject such NOWAIT and BOUNCE direct IO in
>> iomap_dio_bio_iter().
> 
> That's a bit heavy handed. Just do a noretry allocation.

 From the comment of __GFP_NORETRY:

  * %__GFP_NORETRY: The VM implementation will try only very lightweight
  * memory direct reclaim to get some memory under memory pressure (thus
  * it can sleep).

It looks like NORETRY can still sleep, thus again breaking NOWAIT 
requirement.

I think you're talking about GFP_NOWAIT?

^ permalink raw reply

* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
From: Eric Biggers @ 2026-06-15 22:53 UTC (permalink / raw)
  To: Leonid Ravich
  Cc: Herbert Xu, Alasdair Kergon, Ard Biesheuvel, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <20260615111459.9452-1-lravich@amazon.com>

On Mon, Jun 15, 2026 at 11:14:56AM +0000, Leonid Ravich wrote:
> The series adds a per-request "data unit size" to the skcipher API
> so a caller can submit several data units (typically 512..4096-byte
> sectors) sharing one starting IV in a single request.  Algorithms
> derive each data unit's IV from the caller-supplied IV by treating
> it as a 128-bit little-endian counter and adding the data-unit
> index, matching the layout produced by dm-crypt's plain64 IV mode
> and by typical inline-encryption hardware.
> 
> This mirrors the data_unit_size concept already exposed by
> struct blk_crypto_config for inline encryption.
> 
> The first user is dm-crypt, which today issues one skcipher request
> per sector and so pays a per-sector cost in request allocation,
> callback dispatch, completion handling, and scatterlist setup.
> 
> Proof-of-concept performance numbers from the RFC reply [1]: +19%
> throughput / -40% CPU on a single-core arm64 system with a hardware
> XTS-AES-256 accelerator running fio 4 KiB sequential writes through
> dm-crypt, when an out-of-tree arm64 xts driver advertises
> CRYPTO_ALG_SKCIPHER_NATIVE_MULTI_DU.  This series itself does not
> include arch enablement; the fast path is opt-in per driver, the
> slow path is universal via the auto-splitter.
> 
> The native fast path amortises both per-sector dispatch and per-sector
> crypto setup across a bio - the measured win above, on an engine that
> offloads the AES compute.  The auto-splitter is for correctness and
> reach: any consumer can set data_unit_size and get correct output with
> the per-request allocation/callback/completion cost removed, but it
> still issues one alg->encrypt per data unit, so on a software cipher it
> saves only dispatch overhead (no throughput figure claimed - that is
> hardware- and workload-dependent).  What it guarantees unconditionally
> is byte-identical output (Verification below) at O(entries + units),
> walking the scatterlists with a pair of struct scatter_walk cursors
> rather than rescanning from the head per unit.

So in other words, this series slows down dm-crypt and crypto_skcipher
for everyone to optimize for an out-of-tree driver.  And there's also no
benchmark showing that your driver is even worth it over just using the
CPU.

- Eric

^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Vjaceslavs Klimovs @ 2026-06-15 23:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Thorsten Leemhuis, kbusch, trnka, linux-block, dm-devel,
	Linux kernel regressions list
In-Reply-To: <ai_1LYtofh1fwD-N@gallifrey>

Hi Dave, all,

I'm one of the original reporters and very much a user, not a block/dm
developer, so please sanity-check all of this.

Your trace looks like what the two earlier reports hit: a read reaching
a leaf device with sectors > 0 but phys_seg 0 (an empty bio). One aside
that may help read the trace: blk_io_trace.error is a __u16, so the
bracketed values on your C lines are errnos as u16 (65514 = -EINVAL,
65531 = -EIO).

The WARN itself is new, the bad bio isn't. bio_add_page() only started
rejecting len == 0 in 643893647cac ("block: reject zero length in
bio_add_page()", v7.1-rc1); on 7.0.8 the same empty bio tripped
scsi_alloc_sgtables()'s !nr_segs instead, which matches what you saw.
That fits your "not a recent regression": the condition is older, v7.1
just made it loud.

For Tomas's and my reports (QEMU O_DIRECT to the LV block device) the
origin looks like 5ff3f74e145a ("block: simplify direct io validity
check", v6.18): blkdev_dio_invalid() now checks only aggregate
ki_pos | count alignment and dropped the per-segment
bdev_iter_is_aligned() walk, so a degenerate or misaligned O_DIRECT no
longer gets -EINVAL at the fops boundary. But your reproducer reads a
file, which goes through the filesystem O_DIRECT path and never calls
blkdev_dio_invalid(), and still makes the empty bio. So it isn't only
that one entry point.

dm-mirror then hangs because Keith's f7b24c7b41f2 only covers md
raid1/raid10; legacy dm-mirror (dm-raid1.c) has no equivalent and
rebuilds the empty read onto the other leg. Note the leg's status isn't
even consistent (your SATA path returns BLK_STS_IOERR, not
BLK_STS_INVAL), so copying that status check into dm-mirror probably
wouldn't catch every case.

For what it's worth, that points me toward rejecting the empty or
misaligned bio once, at submission, with -EINVAL, rather than teaching
each consumer to tolerate it. But you'll know the tradeoffs far better
than I do.

I have a small QEMU + LVM raid1/mirror setup that reproduces the
block-device variant and bisects to 5ff3f74e. Happy to run your file
reproducer with some instrumentation at the dm-mirror read entry
(bi_size vs bio_sectors vs bvec lengths) to see whether the bio is
already empty on arrival or built that way on the retry, and to test
any patch.

Thanks,
Vjaceslavs


On Mon, Jun 15, 2026 at 5:50 AM Dr. David Alan Gilbert
<linux@treblig.org> wrote:
>
> * Thorsten Leemhuis (regressions@leemhuis.info) wrote:
> > On 6/14/26 19:57, Dr. David Alan Gilbert wrote:
> > >
> > >   I've got a repeatable raid hang/warn and would appreciate some pointers
> > > as where to debug.
> > >   (I've been logging stuff on  https://bugzilla.kernel.org/show_bug.cgi?id=221535 )
> >
> > Note: not my area of expertise, so I might be sending you totally
> > off-track with this comment. Feel free to ignore it. But FWIW:
>
> Hi Thorsten,
>   Thanks for the reply - these do seem to be related!
> (So copying in Keith, Vjaceslavs, and Tomáš )
> (Not my area either).
>
> > Have you seen these reports?
> > https://lore.kernel.org/all/2982107.4sosBPzcNG@electra/
> > https://lore.kernel.org/all/CAC_j7i1R7oy+nRhxEjCTba=DUgn02w9X+p94DCu0aHv5+5tKnQ@mail.gmail.com/
>
> I hadn't!  Those are both the problem I originally was trying to debug
> and stumbled into the WARN/BUG/hang with my test program.
>
> > The former lead to a fix in the mdraid code that should be in the kernel
> > version you are using. But in a reply to the latter report the repoter
> > claimed that that fix is not enough (claiming "this was obvious" and
> > also using dm), but things then stalled there.
>
> Yeh I see my world has Keith's f7b24c7b41f23
>
> I think the problem I'm seeing is zero length requests coming from somewhere.
>
> The WARN I'm seeing in 7.1.0-rc7+ is:
>
> [ 2681.597042] device-mapper: raid1: Mirror read failed from 252:25. Trying alternative device.
> [ 2681.631933] ------------[ cut here ]------------
> [ 2681.631939] WARNING: block/bio.c:1044 at bio_add_page+0x18b/0x250, CPU#22: kworker/22:0/18929
>
> 1039 int bio_add_page(struct bio *bio, struct page *page,
> 1040                  unsigned int len, unsigned int offset)
> 1041 {
> 1042         if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
> 1043                 return 0;
> 1044         if (WARN_ON_ONCE(len == 0))
> 1045                 return 0;
>
> So it's the ' if (WARN_ON_ONCE(len == 0))'
>
> and the warn I got on the older 7.0.8 was:
> [Sun May 17 17:22:52 2026] WARNING: drivers/scsi/scsi_lib.c:1140 at scsi_alloc_sgtables+0x38a/0x400, CPU#28: kworker/28:1H/3943
>
> which I *think* corresponds to:
> 1164         if (WARN_ON_ONCE(!nr_segs))
> 1165                 return BLK_STS_IOERR;
>
> so it sounds like we need to find where zero length requests are coming from??
>
> Thanks again,
>
> Dave
>
> > Ciao, Thorsten
> >
> > >   This started off as debugging a case where I'd get my RAID1 (on the host)
> > > getting a reliable 'rescheduling sector'/disk failure while running the qemu block test suite
> > > during a qemu build, but then I tried to build a smaller discrete
> > > test, and now I've got a simply triggerable warn and test hang.
> > > There's no errors from the underlying SATA layer on the storage,
> > > everything resyncs just fine.
> > >
> > > I've got an existing LVM vg ('main') with two mirrors on sda2, and sdb2
> > > which are SATA disks.
> > >
> > > # lvcreate --type mirror --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2
> > > # mkfs.ext4 /dev/mapper/main-lvol0
> > > # mount /dev/mapper/main-lvol0 /mnt/tmp/
> > > # chmod a+rwx /mnt/tmp
> > >
> > > $ dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=1
> > >
> > > (I then wait for the IO to stop)
> > >
> > > then we've got this little test program:
> > >
> > > <--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><-->
> > > #include <errno.h>
> > > #include <fcntl.h>
> > > #include <asm-generic/fcntl.h>
> > > #include <stdio.h>
> > > #include <unistd.h>
> > >
> > >
> > > const char* path="/mnt/tmp/testfile";
> > > static char buf[8192];
> > >
> > > int main()
> > > {
> > >   int fd=open(path, O_RDWR|O_DIRECT|O_CLOEXEC);
> > >
> > >   errno=0;
> > >   int res3=pread(fd, buf, 4096, 0);
> > >   printf("pread of 4096 said: %d (%m)\n", res3);
> > >
> > > }
> > > <--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><-->
> > >
> > > running that, either hangs or gets a 'pread of 4096 said: -1 (Input/output error)'
> > > when it hangs it's unkillable.
> > >
> > > at the moment (on 7.1.0-rc7) this is giving:
> > > Jun 14 18:08:32 dalek kernel: device-mapper: raid1: Mirror read failed from 252:24. Trying alternative device.
> > > Jun 14 18:08:32 dalek kernel: ------------[ cut here ]------------
> > > Jun 14 18:08:32 dalek dmeventd[1010]: Primary mirror device 252:24 read failed.
> > > Jun 14 18:08:32 dalek kernel: WARNING: block/bio.c:1044 at bio_add_page+0x18b/0x250, CPU#15: kworker/15:1/369
> > >
> > > (full backtrace below)
> > > (Note there is a moan in there about sdb IO error - repeated a lot - but
> > > again, there's no SATA level errors, and the drive is fine on smart, and
> > > I can read the whole of the underlying lvm mirrors, so I don't think it's
> > > physically there).
> > >
> > > I did a blktrace, although that gives me a 23G blkparse output, hmm
> > > (I see each event repeated a lot - maybe per thread?)
> > >
> > > 252,26  15        1     0.000000000  3435  Q  RS 264192 + 8 [dbf]
> > >   252,26 is /dev/mapper/main-lvol0
> > > 252,24  15        1     0.000005501  3435  A  RS 264192 + 8 <- (252,26) 264192
> > >   252,24 is main-lvol0_mimage_0
> > > 252,24  15        2     0.000005761  3435  Q  RS 264192 + 8 [dbf]
> > >   8,0   15        1     0.000008646  3435  A  RS 71634944 + 8 <- (252,24) 264192
> > >     so that's sda
> > >   8,0   15        2     0.000008787  3435  A  RS 73734144 + 8 <- (8,2) 71634944
> > >     I guess mapping down from sda2 to sda
> > >   8,0   15        3     0.000009037  3435  Q  RS 73734144 + 8 [dbf]
> > >   8,0   15        4     0.000009809  3435  C  RS 73734144 + 8 [65514]
> > >       ??? Hmm what's the 65514 there?
> > > 252,24  15        3     0.000010320  3435  C  RS 264192 + 8 [65514]
> > > 252,25  15        1     0.000290384   369  Q   R 264192 + 8 [kworker/15:1]
> > >    252,25 is main-lvol0_mimage_1
> > >
> > > and at this point I'm a bit lost as to what I'm looking for.
> > >
> > > Hints appreciated!
> > >
> > > (I don't believe this is a regression - or at least not recent)
> > >
> > > Dave
> > >
> > >
> > >
> > >
> > > Jun 14 18:08:32 dalek kernel: device-mapper: raid1: Mirror read failed from 252:24. Trying alternative device.
> > > Jun 14 18:08:32 dalek kernel: ------------[ cut here ]------------
> > > Jun 14 18:08:32 dalek dmeventd[1010]: Primary mirror device 252:24 read failed.
> > > Jun 14 18:08:32 dalek kernel: WARNING: block/bio.c:1044 at bio_add_page+0x18b/0x250, CPU#15: kworker/15:1/369
> > > Jun 14 18:08:32 dalek dmeventd[1010]: main-lvol0 is now in-sync.
> > > Jun 14 18:08:32 dalek kernel: Modules linked in: nft_masq nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reje>
> > > Jun 14 18:08:32 dalek kernel:  drm_panel_backlight_quirks gpu_sched drm_suballoc_helper video nvme drm_display_helper nvme_core cec nvme_keyring sp5100_tco nvme_auth wmi serio_raw fuse scsi_dh_alua i2c_dev scsi_dh_rdac scsi_dh_emc
> > > Jun 14 18:08:32 dalek kernel: CPU: 15 UID: 0 PID: 369 Comm: kworker/15:1 Not tainted 7.1.0-rc7+ #786 PREEMPT(lazy)
> > > Jun 14 18:08:32 dalek kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Pro4, BIOS P3.10 07/13/2020
> > > Jun 14 18:08:32 dalek kernel: Workqueue: kmirrord do_mirror
> > > Jun 14 18:08:32 dalek kernel: RIP: 0010:bio_add_page+0x18b/0x250
> > > Jun 14 18:08:32 dalek kernel: Code: 24 10 4c 8b 04 24 84 c0 0f 85 c9 00 00 00 41 0f b7 40 78 48 8b 74 24 08 8b 4c 24 14 e9 b4 fe ff ff 0f 0b 31 c0 e9 55 d1 af 00 <0f> 0b eb f5 48 8b 7f 08 83 7f 60 05 0f 85 00 ff ff ff 49 8b 3b 4c
> > > Jun 14 18:08:32 dalek kernel: RSP: 0018:ffffd1fb8176fc10 EFLAGS: 00010246
> > > Jun 14 18:08:32 dalek kernel: RAX: 0000000000000000 RBX: ffffd1fb8176fd18 RCX: 0000000000000000
> > > Jun 14 18:08:32 dalek kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8d1a8eb28b00
> > > Jun 14 18:08:32 dalek kernel: RBP: 0000000000000000 R08: ffffd1fb8176fc38 R09: ffffd1fb8176fc40
> > > Jun 14 18:08:32 dalek kernel: R10: ffffd1fb8176fc34 R11: 0000000000000000 R12: 0000000000000000
> > > Jun 14 18:08:32 dalek kernel: R13: ffffd1fb8176fd90 R14: 0000000000000001 R15: ffff8d1a8eb28b00
> > > Jun 14 18:08:32 dalek kernel: FS:  0000000000000000(0000) GS:ffff8d29d161f000(0000) knlGS:0000000000000000
> > > Jun 14 18:08:32 dalek kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > Jun 14 18:08:32 dalek kernel: CR2: 00007f0ddcd7b9d0 CR3: 000000023dcbf000 CR4: 0000000000350ef0
> > > Jun 14 18:08:32 dalek kernel: Call Trace:
> > > Jun 14 18:08:32 dalek kernel:  <TASK>
> > > Jun 14 18:08:32 dalek kernel:  do_region+0x227/0x2a0
> > > Jun 14 18:08:32 dalek kernel:  dispatch_io+0xf1/0x150
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_get_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_next_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_read_callback+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  dm_io+0x169/0x2d0
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_get_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_next_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  do_reads+0x149/0x230
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_read_callback+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  do_mirror+0x11a/0x2b0
> > > Jun 14 18:08:32 dalek kernel:  process_one_work+0x19e/0x390
> > > Jun 14 18:08:32 dalek kernel:  worker_thread+0x1a6/0x310
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_worker_thread+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  kthread+0xe4/0x120
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_kthread+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ret_from_fork+0x1a1/0x270
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_kthread+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ret_from_fork_asm+0x1a/0x30
> > > Jun 14 18:08:32 dalek kernel:  </TASK>
> > > Jun 14 18:08:32 dalek kernel: ---[ end trace 0000000000000000 ]---
> > > Jun 14 18:08:32 dalek kernel: ------------[ cut here ]------------
> > > Jun 14 18:08:32 dalek kernel: WARNING: drivers/scsi/scsi_lib.c:1164 at scsi_alloc_sgtables+0x38a/0x400, CPU#15: kworker/15:1/369
> > > Jun 14 18:08:32 dalek kernel: Modules linked in: nft_masq nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reje>
> > > Jun 14 18:08:32 dalek kernel:  drm_panel_backlight_quirks gpu_sched drm_suballoc_helper video nvme drm_display_helper nvme_core cec nvme_keyring sp5100_tco nvme_auth wmi serio_raw fuse scsi_dh_alua i2c_dev scsi_dh_rdac scsi_dh_emc
> > > Jun 14 18:08:32 dalek kernel: CPU: 15 UID: 0 PID: 369 Comm: kworker/15:1 Tainted: G        W           7.1.0-rc7+ #786 PREEMPT(lazy)
> > > Jun 14 18:08:32 dalek kernel: Tainted: [W]=WARN
> > > Jun 14 18:08:32 dalek kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Pro4, BIOS P3.10 07/13/2020
> > > Jun 14 18:08:32 dalek kernel: Workqueue: kmirrord do_mirror
> > > Jun 14 18:08:32 dalek kernel: RIP: 0010:scsi_alloc_sgtables+0x38a/0x400
> > > Jun 14 18:08:32 dalek kernel: Code: 8b 3d ba 2d a9 01 e9 d1 fd ff ff 48 8b 75 00 48 8d bb f0 fe ff ff e8 15 b7 b0 ff 48 89 ab e0 00 00 00 89 45 08 e9 30 ff ff ff <0f> 0b 4c 8b 6c 24 30 b8 0a 00 00 00 e9 21 ff ff ff b8 09 00 00 00
> > > Jun 14 18:08:32 dalek kernel: RSP: 0018:ffffd1fb8176f7f0 EFLAGS: 00010246
> > > Jun 14 18:08:32 dalek kernel: RAX: 0000000000000000 RBX: ffff8d1aedad0110 RCX: 0000000000000009
> > > Jun 14 18:08:32 dalek kernel: RDX: 0000000000000000 RSI: ffffffff99c15960 RDI: ffff8d1aedad0110
> > > Jun 14 18:08:32 dalek kernel: RBP: ffff8d1a93d17000 R08: ffff8d1aedad0110 R09: ffff8d1a818fa800
> > > Jun 14 18:08:32 dalek kernel: R10: 7020676e69736961 R11: 0000000000000000 R12: 0000000000000000
> > > Jun 14 18:08:32 dalek kernel: R13: 0000000000000000 R14: ffff8d1a93394000 R15: ffff8d1a93d17000
> > > Jun 14 18:08:32 dalek kernel: FS:  0000000000000000(0000) GS:ffff8d29d161f000(0000) knlGS:0000000000000000
> > > Jun 14 18:08:32 dalek kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > Jun 14 18:08:32 dalek kernel: CR2: 00007f0ddcd7b9d0 CR3: 000000023dcbf000 CR4: 0000000000350ef0
> > > Jun 14 18:08:32 dalek kernel: Call Trace:
> > > Jun 14 18:08:32 dalek kernel:  <TASK>
> > > Jun 14 18:08:32 dalek kernel:  ? srso_return_thunk+0x5/0x5f
> > > Jun 14 18:08:32 dalek kernel:  sd_setup_read_write_cmnd+0x9d/0x740
> > > Jun 14 18:08:32 dalek kernel:  ? srso_return_thunk+0x5/0x5f
> > > Jun 14 18:08:32 dalek kernel:  scsi_queue_rq+0x4d2/0x890
> > > Jun 14 18:08:32 dalek kernel:  blk_mq_dispatch_rq_list+0x241/0x530
> > > Jun 14 18:08:32 dalek kernel:  ? srso_return_thunk+0x5/0x5f
> > > Jun 14 18:08:32 dalek kernel:  ? sbitmap_get+0x61/0x100
> > > Jun 14 18:08:32 dalek kernel:  __blk_mq_do_dispatch_sched+0x330/0x340
> > > Jun 14 18:08:32 dalek kernel:  __blk_mq_sched_dispatch_requests+0x143/0x180
> > > Jun 14 18:08:32 dalek kernel:  blk_mq_sched_dispatch_requests+0x2d/0x70
> > > Jun 14 18:08:32 dalek kernel:  blk_mq_run_hw_queue+0x2bf/0x350
> > > Jun 14 18:08:32 dalek kernel:  ? srso_return_thunk+0x5/0x5f
> > > Jun 14 18:08:32 dalek kernel:  blk_mq_dispatch_list+0x172/0x350
> > > Jun 14 18:08:32 dalek kernel:  blk_mq_flush_plug_list+0x51/0x1a0
> > > Jun 14 18:08:32 dalek kernel:  ? blk_mq_submit_bio+0x71c/0x9f0
> > > Jun 14 18:08:32 dalek kernel:  __blk_flush_plug+0x112/0x180
> > > Jun 14 18:08:32 dalek kernel:  ? srso_return_thunk+0x5/0x5f
> > > Jun 14 18:08:32 dalek kernel:  __submit_bio+0x19c/0x260
> > > Jun 14 18:08:32 dalek kernel:  __submit_bio_noacct+0x8e/0x210
> > > Jun 14 18:08:32 dalek kernel:  do_region+0x14c/0x2a0
> > > Jun 14 18:08:32 dalek kernel:  dispatch_io+0xf1/0x150
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_get_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_next_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_read_callback+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  dm_io+0x169/0x2d0
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_get_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_next_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  do_reads+0x149/0x230
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_read_callback+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  do_mirror+0x11a/0x2b0
> > > Jun 14 18:08:32 dalek kernel:  process_one_work+0x19e/0x390
> > > Jun 14 18:08:32 dalek kernel:  worker_thread+0x1a6/0x310
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_worker_thread+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  kthread+0xe4/0x120
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_kthread+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ret_from_fork+0x1a1/0x270
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_kthread+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ret_from_fork_asm+0x1a/0x30
> > > Jun 14 18:08:32 dalek kernel:  </TASK>
> > > Jun 14 18:08:32 dalek kernel: ---[ end trace 0000000000000000 ]---
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:37 dalek kernel: blk_print_req_error: 241000 callbacks suppressed
> > > Jun 14 18:08:37 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > >
> > >
> >
> --
>  -----Open up your eyes, open up your mind, open up your code -------
> / Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \
> \        dave @ treblig.org |                               | In Hex /
>  \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Keith Busch @ 2026-06-16  0:06 UTC (permalink / raw)
  To: Vjaceslavs Klimovs
  Cc: Dr. David Alan Gilbert, Thorsten Leemhuis, trnka, linux-block,
	dm-devel, Linux kernel regressions list
In-Reply-To: <CAC_j7i0eDccVWzPeRafM50mZEOFHPz2cwd=RZqqx6TK2EVRFvw@mail.gmail.com>

On Mon, Jun 15, 2026 at 04:16:12PM -0700, Vjaceslavs Klimovs wrote:
> Your trace looks like what the two earlier reports hit: a read reaching
> a leaf device with sectors > 0 but phys_seg 0 (an empty bio). One aside
> that may help read the trace: blk_io_trace.error is a __u16, so the
> bracketed values on your C lines are errnos as u16 (65514 = -EINVAL,
> 65531 = -EIO).
> 
> The WARN itself is new, the bad bio isn't. bio_add_page() only started
> rejecting len == 0 in 643893647cac ("block: reject zero length in
> bio_add_page()", v7.1-rc1); on 7.0.8 the same empty bio tripped
> scsi_alloc_sgtables()'s !nr_segs instead, which matches what you saw.
> That fits your "not a recent regression": the condition is older, v7.1
> just made it loud.
> 
> For Tomas's and my reports (QEMU O_DIRECT to the LV block device) the
> origin looks like 5ff3f74e145a ("block: simplify direct io validity
> check", v6.18): blkdev_dio_invalid() now checks only aggregate
> ki_pos | count alignment and dropped the per-segment
> bdev_iter_is_aligned() walk, so a degenerate or misaligned O_DIRECT no
> longer gets -EINVAL at the fops boundary. But your reproducer reads a
> file, which goes through the filesystem O_DIRECT path and never calls
> blkdev_dio_invalid(), and still makes the empty bio. So it isn't only
> that one entry point.
> 
> dm-mirror then hangs because Keith's f7b24c7b41f2 only covers md
> raid1/raid10; legacy dm-mirror (dm-raid1.c) has no equivalent and
> rebuilds the empty read onto the other leg. Note the leg's status isn't
> even consistent (your SATA path returns BLK_STS_IOERR, not
> BLK_STS_INVAL), so copying that status check into dm-mirror probably
> wouldn't catch every case.
> 
> For what it's worth, that points me toward rejecting the empty or
> misaligned bio once, at submission, with -EINVAL, rather than teaching
> each consumer to tolerate it. But you'll know the tradeoffs far better
> than I do.
> 
> I have a small QEMU + LVM raid1/mirror setup that reproduces the
> block-device variant and bisects to 5ff3f74e. Happy to run your file
> reproducer with some instrumentation at the dm-mirror read entry
> (bi_size vs bio_sectors vs bvec lengths) to see whether the bio is
> already empty on arrival or built that way on the retry, and to test
> any patch.

Thanks for following up here. I didn't initially see your follow-up
until Thorsten linked it. I apologize for missing that, this feature is
important so I don't want to see anything regress for it.

There is a known bug fix I think future tests should include:

  https://lore.kernel.org/linux-block/20260612223205.465913-1-kbusch@meta.com/

This likely isn't the fix you're looking for, but including it rules out
conditions that are not important here.

After that, can we try this suggestion and see if the hang goes away?

  https://lore.kernel.org/linux-block/ajBb8tK-0aJBpIgF@kbusch-mbp/

I expect the original test case to still return an error (and I think it
was designed to), but it shouldn't produce the warn or bug splats with a
stuck uninterruptable task.

^ permalink raw reply

* Re: [PATCH 12/19] x86: define DPS root partition type UUIDs
From: Dave Hansen @ 2026-06-16  0:09 UTC (permalink / raw)
  To: Vincent Mailhol, Jens Axboe, Davidlohr Bueso, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
In-Reply-To: <ea71a433-fdcc-429d-adce-14e9fb726957@kernel.org>

On 6/15/26 13:19, Vincent Mailhol wrote:
...
> That said, your points make sense to me, and I would be supportive to
> allow a search for a secondary UUID as a kernel extension. If we do
> so, I think the only constraint should be to make sure that we check
> for the exact match first (e.g. check x86_64 type before x86_32 type).
> 
> Would that make sense?

Yep, that makes sense to me.

>> 2. Should the UUIDs be defined in arch code or generic code?
> 
> I think that you convinced me to put it in generic code.
> 
>> 3. Kconfig or #ifdefs?
> 
> I would say Kconfig. If we go for the exact match only, that would be:
> 
>   CONFIG_DPS_ROOT_PARTITION_TYPE_UUID
> 
> If we allow more as an extension, that would become:
> 
>   - CONFIG_DPS_ROOT_PARTITION_TYPE_UUID for the exact match
>   - CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_SECONDARY for the compatible
>     one.
> 
> The drawback is that some entries will be in both:
> 
>   config DPS_ROOT_PARTITION_TYPE_UUID
>   	string
>   	  default "4f68bce3-e8cd-4db1-96e7-fbcaf984b709" if X86_64
>   	  default "44479540-f297-41b2-9af7-d131d5f0458a" if X86
> 
>   config DPS_ROOT_PARTITION_TYPE_UUID_SECONDARY
>   	string
>   	  default "44479540-f297-41b2-9af7-d131d5f0458a" if X86_64 && COMPAT_32
> 
> And I don't think we need more than two.

That's not ideal, but it's also a completely static thing that will get
written very, very rarely.

> A bonus question: should those Kconfig entries be hidden? I prefer the
> hidden option because it doesn't add that much code and I thought this
> was not worth bothering the user with one more menuconfig question.
> But I would be happy to change if people this this is worth an
> menuconfig entry.

Yeah, it should be hidden. Anybody that wants to change it for whatever
reason can edit the .config file or hack Kconfig.

^ permalink raw reply

* Re: [PATCH v3 3/4] iomap: reject NOWAIT and BOUNCE direct IOs
From: Qu Wenruo @ 2026-06-16  1:00 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo
  Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <efa3f1f9-20c2-44c0-914c-5e34a40235a1@gmx.com>



在 2026/6/16 08:13, Qu Wenruo 写道:
> 
> 
> 在 2026/6/16 00:35, Christoph Hellwig 写道:
>> On Fri, Jun 12, 2026 at 07:21:14PM +0930, Qu Wenruo wrote:
>>> If a direct IO requires bounced pages for stable buffer, it will always
>>> allocate memory, and both bio_iov_iter_bounce_write() and
>>> bio_iov_iter_bounce_read() are allocating pages using GFP_KERNEL, which
>>> can sleep and break NOWAIT requirement.
>>>
>>> So we need to reject such NOWAIT and BOUNCE direct IO in
>>> iomap_dio_bio_iter().
>>
>> That's a bit heavy handed. Just do a noretry allocation.
> 
>  From the comment of __GFP_NORETRY:
> 
>   * %__GFP_NORETRY: The VM implementation will try only very lightweight
>   * memory direct reclaim to get some memory under memory pressure (thus
>   * it can sleep).
> 
> It looks like NORETRY can still sleep, thus again breaking NOWAIT 
> requirement.
> 
> I think you're talking about GFP_NOWAIT?
> 

After scanning the code for related memory allocation, there are some 
other locations doing memory allocation, including but not limited to:

- iomap_dio_alloc_bio()
   This one is a little tricky, if we pass GFP_NOWAIT, we can break the
   old assumption that the function will always return a bio.

- fscrypt_set_bio_crypt_ctx()
   This one is fine so far, as neither XFS nor btrfs support fscrypt yet.

And any memory allocation failure in NOWAIT mode should return -EAGAIN, 
not -ENOMEM so that the caller can retry in blocking mode as a fallback.

To me, considering NOWAIT itself is only an optimistic flag, and caller 
should always have a blocking mode as fallback, I'd prefer to reject 
NOWAIT + BOUNCE direct writes completely inside btrfs for now.

And leave all the missing NOWAIT handling in iomap in a dedicated series.

Would that be acceptable to you?

Thanks,
Qu

^ permalink raw reply

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
From: Shin'ichiro Kawasaki @ 2026-06-16  1:20 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: Ming Lei, linux-block, Jens Axboe
In-Reply-To: <3db036fe-747f-44eb-93c3-595350278297@linux.ibm.com>

On Jun 12, 2026 / 17:15, Nilay Shroff wrote:
> On 6/12/26 4:36 PM, Ming Lei wrote:
> > On Fri, Jun 12, 2026 at 06:47:50PM +0900, Shin'ichiro Kawasaki wrote:
> > > On Jun 11, 2026 / 06:22, Ming Lei wrote:
> > > > Hi Shin'ichiro,
> > > 
> > > Hi Ming, thanks for the comments.
> > > 
> > > > 
> > > > On Thu, Jun 11, 2026 at 04:41:59PM +0900, Shin'ichiro Kawasaki wrote:
> > > > > I observed that the blktests test case block/005 hangs on a specific
> > > > > server hardware using a specific HDD as a block device. During the test
> > > > > case run, the kernel reported a KASAN null-ptr-deref (and other memory
> > > > > corruption symptoms) [2]. This failure looked sporadic and hardware-
> > > > > dependent.
> > > > > 
> > > > >  From the kernel message, I noticed that udev-worker wrote to the
> > > > > queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
> > > > > The test case block/005 also wrote to the same sysfs attribute, which
> > > > 
> > > > sysfs write is supposed to be serialized...
> > > 
> > > I checked the sysfs write handler elv_iosched_store() in block/elevator.c.
> > > I found elevator_change() call is guarded with the rw_semaphore
> > > "set->update_nr_hwq_lock", but the guard is not the writer lock but the reader
> > > lock. This does not serialize the sysfs writes.
> > 
> > Please see kernfs_fop_write_iter(), in which mutex is held before calling
> > ->write().
> > 
> I think you're referring to @of->mutex here; however of->mutex is per struct
> kernfs_open_file, which is associated with an open instance of the sysfs file.
> The important point is that two separate opens can have different kernfs_open_file
> instances and therefore different mutexes. Thus, concurrent write to same sysfs
> attribute from two different processes may still be possible.

Thanks Nilay, I added debug prints to print @of->mutex address, and it observed
the address is different for each process and each file open. So, I don't think
sysfs write is serialized.

> 
> 
> > > 
> > > I tried the patch below to replace the reader lock with the writer lock. With
> > > a quick trial, it looks working. The kernel message is no longer observed and
> > > the new test case does not cause hangs. I will do further testing to confirm
> > > that this change does not trigger other new lockdep WARNs. Assuming it does not
> > > have such side effects, I hope this fix approach is acceptable. It doesn't add
> > > the new lock, so I think it's the better.
> > > 
> > > diff --git a/block/elevator.c b/block/elevator.c
> > > index 3bcd37c2aa34..b03185a217ff 100644
> > > --- a/block/elevator.c
> > > +++ b/block/elevator.c
> > > @@ -813,7 +813,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> > >   	 *   update_nr_hwq_lock -> kn->active (via del_gendisk -> kobject_del)
> > >   	 *   kn->active -> update_nr_hwq_lock (via this sysfs write path)
> > >   	 */
> > > -	if (!down_read_trylock(&set->update_nr_hwq_lock)) {
> > > +	if (!down_write_trylock(&set->update_nr_hwq_lock)) {
> > >   		ret = -EBUSY;
> > >   		goto out;
> > >   	}
> > > @@ -824,7 +824,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> > >   	} else {
> > >   		ret = -ENOENT;
> > >   	}
> > > -	up_read(&set->update_nr_hwq_lock);
> > > +	up_write(&set->update_nr_hwq_lock);
> > >   out:
> > >   	if (ctx.type)
> > > 
> > > [...]
> > > 
> > > > blk_mq_sched_reg_debugfs already includes debugfs lock, so I feel the proper
> > > > fix could be check & avoid the null-ptr-deref.
> > > 
> > > Actually, null-ptr-deref is one of the failure symptoms. KASAN slab-user-after
> > > free is also observed [3]. Then I'm guessing adding null checks may not be
> > > enough.
> > > 
> > > > Adding new lock should be the last straw usually, especially this one is
> > > > depended by queue freeze.
> > > 
> > > Got it, thanks.
> > > 
> > > 
> > > [3] KASAN slab-use-after-free
> > 
> > Then you need to figure out the exact slab type and check if the pointer is cleared
> > during free.
> > 
> > Anyway, there is guard already, not see reason to add new lock for covering
> > it.
> > 
> Regarding the observed failure, my understanding is that blk_mq_debugfs_register_sched()
> and blk_mq_debugfs_register_sched_hctx() access q->elevator without holding q->elevator_lock.
> If multiple scheduler update paths run concurrently, one path can replace and free the
> elevator while another path is still using it, which would explain the observed KASAN
> use-after-free and NULL pointer dereference reports.

I have the same view. I think the use-after-free and the null-ptr-deref indicate
that elevator_queue object address in q->elevator is the problem. The references
of the object is also kept in the struct elv_change_ctx as ctx->old and
ctx->new. These multiple references are used concurrently, then I'm not sure if
adding pointer clears and null checks would fix the problem.

> 
> With the proposed change, upgrading update_nr_hwq_lock from a reader lock to a writer
> lock in elv_iosched_store() would serialize concurrent scheduler updates and therefore
> prevent multiple elevator switch operations from running at the same time.
> 
> The another way to fix this might be to acquire q->elevator_lock in blk_mq_sched_reg_debugfs()
> and thus serialize access to q->elevator in blk_mq_debugfs_register_sched() and
> blk_mq_debugfs_register_sched_hctx().

Thanks for the idea. I tried the patch below [X], but it triggered WARN in
debugfs_create_files() in block/blk-mq-debufs.c [Y]. Then I'm afraid, this
approach does not look working.

At this moment, the writer lock in elv_iosched_store() looks like the solution
to me, but further comments on other solution possibility will be welcomed.


[X]

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 0a00f5a76f5a..12c582b6c713 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -394,9 +394,11 @@ void blk_mq_sched_reg_debugfs(struct request_queue *q)
 	unsigned long i;
 
 	memflags = blk_debugfs_lock(q);
+	mutex_lock(&q->elevator_lock);
 	blk_mq_debugfs_register_sched(q);
 	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_debugfs_register_sched_hctx(q, hctx);
+	mutex_unlock(&q->elevator_lock);
 	blk_debugfs_unlock(q, memflags);
 }
 

[Y]

 612 static void debugfs_create_files(struct request_queue *q, struct dentry *parent,|
 613                                  void *data,                                    |
 614                                  const struct blk_mq_debugfs_attr *attr)        |
 615 {                                                                               |
 616         lockdep_assert_held(&q->debugfs_mutex);                                 |
 617         /*                                                                      |
 618          * debugfs_mutex should not be nested under other locks that can be     |
 619          * grabbed while queue is frozen.                                       |
 620          */                                                                     |
 621         lockdep_assert_not_held(&q->elevator_lock);                             | <----
 622         lockdep_assert_not_held(&q->rq_qos_mutex);                              |
 623                                                                                 |


^ permalink raw reply related

* Re: [PATCH V2] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Hou Tao @ 2026-06-16  1:23 UTC (permalink / raw)
  To: yukuai, Zizhi Wo, axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1
In-Reply-To: <70642ddf-9ed9-45cb-bf40-891a07247c97@fnnas.com>

Hi,

On 6/16/2026 12:16 AM, Yu Kuai wrote:
> Hi，
>
> 在 2026/6/15 19:55, Zizhi Wo 写道:
>> From: Zizhi Wo <wozizhi@huawei.com>
>>
>> [BUG]
>> Our fuzz testing triggered a blkcg use-after-free issue:
>>
>>    BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
>>    Call Trace:
>>    ...
>>    blkcg_deactivate_policy+0x244/0x4d0
>>    ioc_rqos_exit+0x44/0xe0
>>    rq_qos_exit+0xba/0x120
>>    __del_gendisk+0x50b/0x800
>>    del_gendisk+0xff/0x190
>>    ...
>>
>> [CAUSE]
>> process1						process2
>> cgroup_rmdir
>> ...
>>    css_killed_work_fn
>>      offline_css
>>      ...
>>        blkcg_destroy_blkgs
>>        ...
>>          __blkg_release
>> 	  css_put(&blkg->blkcg->css)
>>            blkg_free
>> 	    INIT_WORK(xxx, blkg_free_workfn)
>> 	    schedule_work
>>      css_put
>>      ...
>>        blkcg_css_free
>>          kfree(blkcg)--------blkcg has been freed!!!
>> ====================================schedule_work
>>                blkg_free_workfn
>> 							__del_gendisk
>> 							  rq_qos_exit
>> 							    ioc_rqos_exit
>> 							      blkcg_deactivate_policy
>> 							        mutex_lock(&q->blkcg_mutex)
>> 								spin_lock_irq(&q->queue_lock)
>> 							        list_for_each_entry(blkg, xxx)
>> 								  blkcg = blkg->blkcg
>> 								  spin_lock(&blkcg->lock)-------UAF!!!
>> 	        mutex_lock(&q->blkcg_mutex)
>> 	        spin_lock_irq(&q->queue_lock)
>> 	        /* Only then is the blkg removed from the list */
>> 	        list_del_init(&blkg->q_node)
>>
>> As a result, a blkg can still be reachable through q->blkg_list while
>> its ->blkcg has already been freed.
>>
>> [Fix]
>> Fix this by deferring the blkcg css_put() until after the blkg has been
>> unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
>> blkcg outlives every blkg still reachable through q->blkg_list, so any
>> iterator holding q->queue_lock is guaranteed to observe a valid
>> blkg->blkcg.
>>
>> While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
>> so that the css reference is owned by the alloc/free pair rather than
>> straddling layers:
>> blkg_alloc()  <-> blkg_free()
>> blkg_create() <-> blkg_destroy()
>>
>> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
>> Suggested-by: Hou Tao <houtao1@huawei.com>
>> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
>> ---
>> v2:
>>   - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
>>     css reference follows the blkg's own lifetime, making the put in
>>     blkg_free_workfn() symmetric with the get in blkg_alloc().
>>
>> v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
>>
>>   block/blk-cgroup.c | 24 ++++++++++++------------
>>   1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index bc63bd220865..27414c291e49 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -132,10 +132,15 @@ static void blkg_free_workfn(struct work_struct *work)
>>   	if (blkg->parent)
>>   		blkg_put(blkg->parent);
>>   	spin_lock_irq(&q->queue_lock);
>>   	list_del_init(&blkg->q_node);
>>   	spin_unlock_irq(&q->queue_lock);
>> +	/*
>> +	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
>> +	 * so concurrent iterators won't see a blkg with a freed blkcg.
>> +	 */
>> +	css_put(&blkg->blkcg->css);
>>   	mutex_unlock(&q->blkcg_mutex);
> Please move css_put after mutex_unlock, unless there is a strong reason.

I think blkcg_mutex is used here to serialize the access of blkg->q_node
and blkg->blkcg. We could move the css_put after the mutex_unlock(),
however it stills depends on the mutex_lock and mutex_unlock pair on
blkcg_mutex implicitly. Instead of such implicit dependency, we move the
css_put inside the lock to make it be explicit.
>
> With above change, feel free to add:
>
> Reviewed-by: Yu Kuai <yukuai@fygo.io>
>
>>   
>>   	blk_put_queue(q);
>>   	free_percpu(blkg->iostat_cpu);
>>   	percpu_ref_exit(&blkg->refcnt);
>> @@ -177,12 +182,10 @@ static void __blkg_release(struct rcu_head *rcu)
>>   	 * blkg_stat_lock is for serializing blkg stat update
>>   	 */
>>   	for_each_possible_cpu(cpu)
>>   		__blkcg_rstat_flush(blkcg, cpu);
>>   
>> -	/* release the blkcg and parent blkg refs this blkg has been holding */
>> -	css_put(&blkg->blkcg->css);
>>   	blkg_free(blkg);
>>   }
>>   
>>   /*
>>    * A group is RCU protected, but having an rcu lock does not mean that one
>> @@ -311,10 +314,13 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>>   	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
>>   	if (!blkg->iostat_cpu)
>>   		goto out_exit_refcnt;
>>   	if (!blk_get_queue(disk->queue))
>>   		goto out_free_iostat;
>> +	/* blkg holds a reference to blkcg */
>> +	if (!css_tryget_online(&blkcg->css))
>> +		goto out_put_queue;
>>   
>>   	blkg->q = disk->queue;
>>   	INIT_LIST_HEAD(&blkg->q_node);
>>   	blkg->blkcg = blkcg;
>>   	blkg->iostat.blkg = blkg;
>> @@ -351,10 +357,12 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>>   
>>   out_free_pds:
>>   	while (--i >= 0)
>>   		if (blkg->pd[i])
>>   			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
>> +	css_put(&blkcg->css);
>> +out_put_queue:
>>   	blk_put_queue(disk->queue);
>>   out_free_iostat:
>>   	free_percpu(blkg->iostat_cpu);
>>   out_exit_refcnt:
>>   	percpu_ref_exit(&blkg->refcnt);
>> @@ -379,32 +387,26 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>>   	if (blk_queue_dying(disk->queue)) {
>>   		ret = -ENODEV;
>>   		goto err_free_blkg;
>>   	}
>>   
>> -	/* blkg holds a reference to blkcg */
>> -	if (!css_tryget_online(&blkcg->css)) {
>> -		ret = -ENODEV;
>> -		goto err_free_blkg;
>> -	}
>> -
>>   	/* allocate */
>>   	if (!new_blkg) {
>>   		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
>>   		if (unlikely(!new_blkg)) {
>>   			ret = -ENOMEM;
>> -			goto err_put_css;
>> +			goto err_free_blkg;
>>   		}
>>   	}
>>   	blkg = new_blkg;
>>   
>>   	/* link parent */
>>   	if (blkcg_parent(blkcg)) {
>>   		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
>>   		if (WARN_ON_ONCE(!blkg->parent)) {
>>   			ret = -ENODEV;
>> -			goto err_put_css;
>> +			goto err_free_blkg;
>>   		}
>>   		blkg_get(blkg->parent);
>>   	}
>>   
>>   	/* invoke per-policy init */
>> @@ -440,12 +442,10 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>>   
>>   	/* @blkg failed fully initialized, use the usual release path */
>>   	blkg_put(blkg);
>>   	return ERR_PTR(ret);
>>   
>> -err_put_css:
>> -	css_put(&blkcg->css);
>>   err_free_blkg:
>>   	if (new_blkg)
>>   		blkg_free(new_blkg);
>>   	return ERR_PTR(ret);
>>   }


^ permalink raw reply

* [PATCH V3] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Zizhi Wo @ 2026-06-16  1:17 UTC (permalink / raw)
  To: axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai, wozizhi

From: Zizhi Wo <wozizhi@huawei.com>

[BUG]
Our fuzz testing triggered a blkcg use-after-free issue:

  BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
  Call Trace:
  ...
  blkcg_deactivate_policy+0x244/0x4d0
  ioc_rqos_exit+0x44/0xe0
  rq_qos_exit+0xba/0x120
  __del_gendisk+0x50b/0x800
  del_gendisk+0xff/0x190
  ...

[CAUSE]
process1						process2
cgroup_rmdir
...
  css_killed_work_fn
    offline_css
    ...
      blkcg_destroy_blkgs
      ...
        __blkg_release
	  css_put(&blkg->blkcg->css)
          blkg_free
	    INIT_WORK(xxx, blkg_free_workfn)
	    schedule_work
    css_put
    ...
      blkcg_css_free
        kfree(blkcg)--------blkcg has been freed!!!
====================================schedule_work
              blkg_free_workfn
							__del_gendisk
							  rq_qos_exit
							    ioc_rqos_exit
							      blkcg_deactivate_policy
							        mutex_lock(&q->blkcg_mutex)
								spin_lock_irq(&q->queue_lock)
							        list_for_each_entry(blkg, xxx)
								  blkcg = blkg->blkcg
								  spin_lock(&blkcg->lock)-------UAF!!!
	        mutex_lock(&q->blkcg_mutex)
	        spin_lock_irq(&q->queue_lock)
	        /* Only then is the blkg removed from the list */
	        list_del_init(&blkg->q_node)

As a result, a blkg can still be reachable through q->blkg_list while
its ->blkcg has already been freed.

[Fix]
Fix this by deferring the blkcg css_put() until after the blkg has been
unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
blkcg outlives every blkg still reachable through q->blkg_list, so any
iterator holding q->queue_lock is guaranteed to observe a valid
blkg->blkcg.

While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
so that the css reference is owned by the alloc/free pair rather than
straddling layers:
blkg_alloc()  <-> blkg_free()
blkg_create() <-> blkg_destroy()

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Suggested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fygo.io>
---
v3:
 - move css_put() after mutex_unlock() in blkg_free_workfn().

v2:
 - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
   css reference follows the blkg's own lifetime, making the put in
   blkg_free_workfn() symmetric with the get in blkg_alloc().

v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
 block/blk-cgroup.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bc63bd220865..3ac41f766caf 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -136,6 +136,11 @@ static void blkg_free_workfn(struct work_struct *work)
 	spin_unlock_irq(&q->queue_lock);
 	mutex_unlock(&q->blkcg_mutex);
 
+	/*
+	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
+	 * so concurrent iterators won't see a blkg with a freed blkcg.
+	 */
+	css_put(&blkg->blkcg->css);
 	blk_put_queue(q);
 	free_percpu(blkg->iostat_cpu);
 	percpu_ref_exit(&blkg->refcnt);
@@ -179,8 +184,6 @@ static void __blkg_release(struct rcu_head *rcu)
 	for_each_possible_cpu(cpu)
 		__blkcg_rstat_flush(blkcg, cpu);
 
-	/* release the blkcg and parent blkg refs this blkg has been holding */
-	css_put(&blkg->blkcg->css);
 	blkg_free(blkg);
 }
 
@@ -313,6 +316,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
 		goto out_exit_refcnt;
 	if (!blk_get_queue(disk->queue))
 		goto out_free_iostat;
+	/* blkg holds a reference to blkcg */
+	if (!css_tryget_online(&blkcg->css))
+		goto out_put_queue;
 
 	blkg->q = disk->queue;
 	INIT_LIST_HEAD(&blkg->q_node);
@@ -353,6 +359,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
 	while (--i >= 0)
 		if (blkg->pd[i])
 			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
+	css_put(&blkcg->css);
+out_put_queue:
 	blk_put_queue(disk->queue);
 out_free_iostat:
 	free_percpu(blkg->iostat_cpu);
@@ -381,18 +389,12 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 		goto err_free_blkg;
 	}
 
-	/* blkg holds a reference to blkcg */
-	if (!css_tryget_online(&blkcg->css)) {
-		ret = -ENODEV;
-		goto err_free_blkg;
-	}
-
 	/* allocate */
 	if (!new_blkg) {
 		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
 		if (unlikely(!new_blkg)) {
 			ret = -ENOMEM;
-			goto err_put_css;
+			goto err_free_blkg;
 		}
 	}
 	blkg = new_blkg;
@@ -402,7 +404,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
 		if (WARN_ON_ONCE(!blkg->parent)) {
 			ret = -ENODEV;
-			goto err_put_css;
+			goto err_free_blkg;
 		}
 		blkg_get(blkg->parent);
 	}
@@ -442,8 +444,6 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 	blkg_put(blkg);
 	return ERR_PTR(ret);
 
-err_put_css:
-	css_put(&blkcg->css);
 err_free_blkg:
 	if (new_blkg)
 		blkg_free(new_blkg);
-- 
2.52.0


^ permalink raw reply related

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Vjaceslavs Klimovs @ 2026-06-16  1:25 UTC (permalink / raw)
  To: Keith Busch
  Cc: Dr. David Alan Gilbert, Thorsten Leemhuis, trnka, linux-block,
	dm-devel, Linux kernel regressions list
In-Reply-To: <ajCTaUaACV9eNmWo@kbusch-mbp>

Hi Keith,

Thanks. I tested both patches on current mainline
(v7.1-rc7-271-g424280953322) with my QEMU + LVM "--type mirror"
reproducer (virtio-blk, cache=none, aio=native).

With only the "block: check bio split for unaligned bvec" patch, the
hang still reproduces. The WARN fires from a kmirrord worker:

  WARNING: block/bio.c:1044 at bio_add_page+0x108/0x200
  Workqueue: kmirrord do_mirror
  Call Trace:
   bio_add_page+0x108/0x200
   do_region+0x21d/0x270
   dispatch_io+0xf1/0x150
   dm_io+0x136/0x240
   do_reads+0x13e/0x210
   do_mirror+0x117/0x2b0

and the VM then wedges.

With the dm-io.c clone patch applied on top, the WARN and the hang are
both gone. dm-mirror just fails the read instead:

  device-mapper: raid1: Mirror read failed from 252:0. Trying
alternative device.
  device-mapper: raid1: All sides of mirror have failed.
  device-mapper: raid1: Read failure on mirror device 252:1.  Failing I/O.

The guest still gets an I/O error, as you expected, but the host stays
up: no splat, no stuck task. For comparison, on the same kernel the
"--type raid1" case boots the guest and reads fine, and the 128 MB
mirror seed write goes through the clone path without trouble, so
normal I/O looks unaffected.

Thanks,
Vjaceslavs

On Mon, Jun 15, 2026 at 5:06 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Mon, Jun 15, 2026 at 04:16:12PM -0700, Vjaceslavs Klimovs wrote:
> > Your trace looks like what the two earlier reports hit: a read reaching
> > a leaf device with sectors > 0 but phys_seg 0 (an empty bio). One aside
> > that may help read the trace: blk_io_trace.error is a __u16, so the
> > bracketed values on your C lines are errnos as u16 (65514 = -EINVAL,
> > 65531 = -EIO).
> >
> > The WARN itself is new, the bad bio isn't. bio_add_page() only started
> > rejecting len == 0 in 643893647cac ("block: reject zero length in
> > bio_add_page()", v7.1-rc1); on 7.0.8 the same empty bio tripped
> > scsi_alloc_sgtables()'s !nr_segs instead, which matches what you saw.
> > That fits your "not a recent regression": the condition is older, v7.1
> > just made it loud.
> >
> > For Tomas's and my reports (QEMU O_DIRECT to the LV block device) the
> > origin looks like 5ff3f74e145a ("block: simplify direct io validity
> > check", v6.18): blkdev_dio_invalid() now checks only aggregate
> > ki_pos | count alignment and dropped the per-segment
> > bdev_iter_is_aligned() walk, so a degenerate or misaligned O_DIRECT no
> > longer gets -EINVAL at the fops boundary. But your reproducer reads a
> > file, which goes through the filesystem O_DIRECT path and never calls
> > blkdev_dio_invalid(), and still makes the empty bio. So it isn't only
> > that one entry point.
> >
> > dm-mirror then hangs because Keith's f7b24c7b41f2 only covers md
> > raid1/raid10; legacy dm-mirror (dm-raid1.c) has no equivalent and
> > rebuilds the empty read onto the other leg. Note the leg's status isn't
> > even consistent (your SATA path returns BLK_STS_IOERR, not
> > BLK_STS_INVAL), so copying that status check into dm-mirror probably
> > wouldn't catch every case.
> >
> > For what it's worth, that points me toward rejecting the empty or
> > misaligned bio once, at submission, with -EINVAL, rather than teaching
> > each consumer to tolerate it. But you'll know the tradeoffs far better
> > than I do.
> >
> > I have a small QEMU + LVM raid1/mirror setup that reproduces the
> > block-device variant and bisects to 5ff3f74e. Happy to run your file
> > reproducer with some instrumentation at the dm-mirror read entry
> > (bi_size vs bio_sectors vs bvec lengths) to see whether the bio is
> > already empty on arrival or built that way on the retry, and to test
> > any patch.
>
> Thanks for following up here. I didn't initially see your follow-up
> until Thorsten linked it. I apologize for missing that, this feature is
> important so I don't want to see anything regress for it.
>
> There is a known bug fix I think future tests should include:
>
>   https://lore.kernel.org/linux-block/20260612223205.465913-1-kbusch@meta.com/
>
> This likely isn't the fix you're looking for, but including it rules out
> conditions that are not important here.
>
> After that, can we try this suggestion and see if the hang goes away?
>
>   https://lore.kernel.org/linux-block/ajBb8tK-0aJBpIgF@kbusch-mbp/
>
> I expect the original test case to still return an error (and I think it
> was designed to), but it shouldn't produce the warn or bug splats with a
> stuck uninterruptable task.

^ permalink raw reply

* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
From: Herbert Xu @ 2026-06-16  4:13 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Leonid Ravich, Alasdair Kergon, Ard Biesheuvel, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <20260615225317.GB28589@quark>

On Mon, Jun 15, 2026 at 03:53:17PM -0700, Eric Biggers wrote:
>
> So in other words, this series slows down dm-crypt and crypto_skcipher
> for everyone to optimize for an out-of-tree driver.  And there's also no
> benchmark showing that your driver is even worth it over just using the
> CPU.

There is no reason why the software fallback should be slower
than the status quo.  Existing callers of the Crypto API will
be issuing one indirect function call per data unit.  With the
new scheme, the indirect calls per unit moves from from the caller
into the Crypto API.

In fact, we could move it down further and improve upon the
status quo by splitting the data in each algorithm implemntation
so that the calls per unit become direct function calls and only
the overall call into the Crypto API remains indirect.

But yes it would be nice to provide numbers for the fallback
path to verify that we didn't get this case terribly wrong.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
From: Eric Biggers @ 2026-06-16  4:50 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Leonid Ravich, Alasdair Kergon, Ard Biesheuvel, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <ajDNT5jVGgRtiNH6@gondor.apana.org.au>

On Tue, Jun 16, 2026 at 12:13:03PM +0800, Herbert Xu wrote:
> On Mon, Jun 15, 2026 at 03:53:17PM -0700, Eric Biggers wrote:
> >
> > So in other words, this series slows down dm-crypt and crypto_skcipher
> > for everyone to optimize for an out-of-tree driver.  And there's also no
> > benchmark showing that your driver is even worth it over just using the
> > CPU.
> 
> There is no reason why the software fallback should be slower
> than the status quo.  Existing callers of the Crypto API will
> be issuing one indirect function call per data unit.  With the
> new scheme, the indirect calls per unit moves from from the caller
> into the Crypto API.

Have you checked the code?  This patchset adds overhead in multiple
places.  Dynamically allocating multiple scatterlists and then parsing
them, adding a new field to skcipher_request for everyone, new checks in
crypto_skcipher_en/decrypt for everyone, new checks to validate the data
unit size that the caller knew was valid in the first place, etc.

> In fact, we could move it down further and improve upon the
> status quo by splitting the data in each algorithm implemntation
> so that the calls per unit become direct function calls and only
> the overall call into the Crypto API remains indirect.

That's not what this patchset does.  But also, as we know, a better way
to eliminate "Crypto API" overhead is to call the algorithms directly.

- Eric

^ permalink raw reply

* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
From: Herbert Xu @ 2026-06-16  4:53 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Leonid Ravich, Alasdair Kergon, Ard Biesheuvel, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <20260616045023.GA113934@sol>

On Mon, Jun 15, 2026 at 09:50:23PM -0700, Eric Biggers wrote:
>
> Have you checked the code?  This patchset adds overhead in multiple
> places.  Dynamically allocating multiple scatterlists and then parsing
> them, adding a new field to skcipher_request for everyone, new checks in
> crypto_skcipher_en/decrypt for everyone, new checks to validate the data
> unit size that the caller knew was valid in the first place, etc.

No I have not :)

I'm just stating the general principle.

Of course I will not apply the patch-set until I have reviewed it
properly.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v3 3/4] iomap: reject NOWAIT and BOUNCE direct IOs
From: Christoph Hellwig @ 2026-06-16  5:18 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, linux-block,
	linux-fsdevel, linux-xfs
In-Reply-To: <efa3f1f9-20c2-44c0-914c-5e34a40235a1@gmx.com>

On Tue, Jun 16, 2026 at 08:13:27AM +0930, Qu Wenruo wrote:
> It looks like NORETRY can still sleep, thus again breaking NOWAIT
> requirement.
> 
> I think you're talking about GFP_NOWAIT?

Yes, GFP_NOWAIT would be a better fit.

^ permalink raw reply

* Re: [PATCH v3 3/4] iomap: reject NOWAIT and BOUNCE direct IOs
From: Christoph Hellwig @ 2026-06-16  5:19 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, linux-block,
	linux-fsdevel, linux-xfs
In-Reply-To: <8a0e1881-fe73-4912-95f9-8eac998840d5@gmx.com>

On Tue, Jun 16, 2026 at 10:30:27AM +0930, Qu Wenruo wrote:
> After scanning the code for related memory allocation, there are some other
> locations doing memory allocation, including but not limited to:
> 
> - iomap_dio_alloc_bio()
>   This one is a little tricky, if we pass GFP_NOWAIT, we can break the
>   old assumption that the function will always return a bio.
> 
> - fscrypt_set_bio_crypt_ctx()
>   This one is fine so far, as neither XFS nor btrfs support fscrypt yet.
> 
> And any memory allocation failure in NOWAIT mode should return -EAGAIN, not
> -ENOMEM so that the caller can retry in blocking mode as a fallback.
> 
> To me, considering NOWAIT itself is only an optimistic flag, and caller
> should always have a blocking mode as fallback, I'd prefer to reject NOWAIT
> + BOUNCE direct writes completely inside btrfs for now.

If you want to do that in btrfs please do it there.

> And leave all the missing NOWAIT handling in iomap in a dedicated series.

This might be worth looking into.


^ permalink raw reply

* Re: [GIT PULL] Block updates for 7.2
From: pr-tracker-bot @ 2026-06-16  7:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, linux-block@vger.kernel.org
In-Reply-To: <28f00610-4a45-47f5-9e08-468c31736090@kernel.dk>

The pull request you sent on Mon, 15 Jun 2026 09:24:22 -0600:

> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git tags/for-7.2/block-20260615

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/ba9c792c824fff732df85119011d399d9b6d9155

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* [PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
From: Qu Wenruo @ 2026-06-16  8:12 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

[CHANGELOG]
v4:
- Follow iomap/block layer code style to avoid lines over 80 chars

- Reject NOWAIT BOUNCE direct writes inside btrfs
  The iomap code still allocates memory with GFP_KERNEL in other
  locations.
  For now just disable NOWAIT BOUNCE direct writes and let the caller
  fall back to blocking mode.

v3:
- Fix a bug in error handling of bio_iov_iter_bounce_write()
  Which can lead to generic/708 failure on btrfs.

- Respect nofault flag in bio_iov_iter_bounce_write()
  To avoid btrfs specific deadlocks.

- Reject NOWAIT and BOUNCE direct IOs
  Since BOUNCE always allocate pages using GFP_KERNEL, which can sleep
  and break NOWAIT requirement, has to reject such combination.

v2:
- Rework the comment in btrfs_dio_write()

Commit 968f19c5b1b7 ("btrfs: always fallback to buffered write if the
inode requires checksum") solved the csum mismatch caused by unstable
direct IO buffers, it has a pretty hefty performance penalty.

Meanwhile upstream iomap has introduce IOMAP_DIO_BOUNCE flag to get
stable buffers meanwhile without falling back to buffered IOs.

Using that flag btrfs can reach 95% of the original zero-copy direct IO
performance, almost 2x the current buffered fallback performance.

However during my tests, there are several bugs related to iomap that
can lead to direct IO test case failures:

- generic/708
  Results garbage in the end of the writes, is a bug in the error
  handling of a short copy.

  Fixed in the first patch.

- Deadlock if using the page cache as direct IO buffer
  This is because bio_iov_iter_bounce_write() doesn't respect
  iov_iter::nofault flag.

  Fixed in the second patch.

- Possible NOWAIT and BOUNCE conflicts
  BOUNCE flag for both reads and writes will allocate new folios using
  GFP_KERNEL, which can sleep and break NOWAIT requirement.

  Reject such combination in btrfs when enabling IOMAP_DIO_BOUNCE
  support.

And the final one will enable btrfs to use IOMAP_DIO_BOUNCE flag, so
that even with data checksum we do not need to fallback to buffered IO
and reclaim most of the dropped direct IO performance.

Qu Wenruo (3):
  block: revert the iov_iter after a short copy in
    bio_iov_iter_bounce_write()
  block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
  btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered
    IO

 block/bio.c          | 21 +++++++++++++---
 fs/btrfs/direct-io.c | 58 ++++++++++++++++++++++----------------------
 2 files changed, 47 insertions(+), 32 deletions(-)

-- 
2.54.0

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox