Linux block layer
 help / color / mirror / Atom feed
* Re: [PATCH 12/19] x86: define DPS root partition type UUIDs
From: Dave Hansen @ 2026-06-16  0:09 UTC (permalink / raw)
  To: Vincent Mailhol, Jens Axboe, Davidlohr Bueso, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
In-Reply-To: <ea71a433-fdcc-429d-adce-14e9fb726957@kernel.org>

On 6/15/26 13:19, Vincent Mailhol wrote:
...
> That said, your points make sense to me, and I would be supportive to
> allow a search for a secondary UUID as a kernel extension. If we do
> so, I think the only constraint should be to make sure that we check
> for the exact match first (e.g. check x86_64 type before x86_32 type).
> 
> Would that make sense?

Yep, that makes sense to me.

>> 2. Should the UUIDs be defined in arch code or generic code?
> 
> I think that you convinced me to put it in generic code.
> 
>> 3. Kconfig or #ifdefs?
> 
> I would say Kconfig. If we go for the exact match only, that would be:
> 
>   CONFIG_DPS_ROOT_PARTITION_TYPE_UUID
> 
> If we allow more as an extension, that would become:
> 
>   - CONFIG_DPS_ROOT_PARTITION_TYPE_UUID for the exact match
>   - CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_SECONDARY for the compatible
>     one.
> 
> The drawback is that some entries will be in both:
> 
>   config DPS_ROOT_PARTITION_TYPE_UUID
>   	string
>   	  default "4f68bce3-e8cd-4db1-96e7-fbcaf984b709" if X86_64
>   	  default "44479540-f297-41b2-9af7-d131d5f0458a" if X86
> 
>   config DPS_ROOT_PARTITION_TYPE_UUID_SECONDARY
>   	string
>   	  default "44479540-f297-41b2-9af7-d131d5f0458a" if X86_64 && COMPAT_32
> 
> And I don't think we need more than two.

That's not ideal, but it's also a completely static thing that will get
written very, very rarely.

> A bonus question: should those Kconfig entries be hidden? I prefer the
> hidden option because it doesn't add that much code and I thought this
> was not worth bothering the user with one more menuconfig question.
> But I would be happy to change if people this this is worth an
> menuconfig entry.

Yeah, it should be hidden. Anybody that wants to change it for whatever
reason can edit the .config file or hack Kconfig.

^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Keith Busch @ 2026-06-16  0:06 UTC (permalink / raw)
  To: Vjaceslavs Klimovs
  Cc: Dr. David Alan Gilbert, Thorsten Leemhuis, trnka, linux-block,
	dm-devel, Linux kernel regressions list
In-Reply-To: <CAC_j7i0eDccVWzPeRafM50mZEOFHPz2cwd=RZqqx6TK2EVRFvw@mail.gmail.com>

On Mon, Jun 15, 2026 at 04:16:12PM -0700, Vjaceslavs Klimovs wrote:
> Your trace looks like what the two earlier reports hit: a read reaching
> a leaf device with sectors > 0 but phys_seg 0 (an empty bio). One aside
> that may help read the trace: blk_io_trace.error is a __u16, so the
> bracketed values on your C lines are errnos as u16 (65514 = -EINVAL,
> 65531 = -EIO).
> 
> The WARN itself is new, the bad bio isn't. bio_add_page() only started
> rejecting len == 0 in 643893647cac ("block: reject zero length in
> bio_add_page()", v7.1-rc1); on 7.0.8 the same empty bio tripped
> scsi_alloc_sgtables()'s !nr_segs instead, which matches what you saw.
> That fits your "not a recent regression": the condition is older, v7.1
> just made it loud.
> 
> For Tomas's and my reports (QEMU O_DIRECT to the LV block device) the
> origin looks like 5ff3f74e145a ("block: simplify direct io validity
> check", v6.18): blkdev_dio_invalid() now checks only aggregate
> ki_pos | count alignment and dropped the per-segment
> bdev_iter_is_aligned() walk, so a degenerate or misaligned O_DIRECT no
> longer gets -EINVAL at the fops boundary. But your reproducer reads a
> file, which goes through the filesystem O_DIRECT path and never calls
> blkdev_dio_invalid(), and still makes the empty bio. So it isn't only
> that one entry point.
> 
> dm-mirror then hangs because Keith's f7b24c7b41f2 only covers md
> raid1/raid10; legacy dm-mirror (dm-raid1.c) has no equivalent and
> rebuilds the empty read onto the other leg. Note the leg's status isn't
> even consistent (your SATA path returns BLK_STS_IOERR, not
> BLK_STS_INVAL), so copying that status check into dm-mirror probably
> wouldn't catch every case.
> 
> For what it's worth, that points me toward rejecting the empty or
> misaligned bio once, at submission, with -EINVAL, rather than teaching
> each consumer to tolerate it. But you'll know the tradeoffs far better
> than I do.
> 
> I have a small QEMU + LVM raid1/mirror setup that reproduces the
> block-device variant and bisects to 5ff3f74e. Happy to run your file
> reproducer with some instrumentation at the dm-mirror read entry
> (bi_size vs bio_sectors vs bvec lengths) to see whether the bio is
> already empty on arrival or built that way on the retry, and to test
> any patch.

Thanks for following up here. I didn't initially see your follow-up
until Thorsten linked it. I apologize for missing that, this feature is
important so I don't want to see anything regress for it.

There is a known bug fix I think future tests should include:

  https://lore.kernel.org/linux-block/20260612223205.465913-1-kbusch@meta.com/

This likely isn't the fix you're looking for, but including it rules out
conditions that are not important here.

After that, can we try this suggestion and see if the hang goes away?

  https://lore.kernel.org/linux-block/ajBb8tK-0aJBpIgF@kbusch-mbp/

I expect the original test case to still return an error (and I think it
was designed to), but it shouldn't produce the warn or bug splats with a
stuck uninterruptable task.

^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Vjaceslavs Klimovs @ 2026-06-15 23:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Thorsten Leemhuis, kbusch, trnka, linux-block, dm-devel,
	Linux kernel regressions list
In-Reply-To: <ai_1LYtofh1fwD-N@gallifrey>

Hi Dave, all,

I'm one of the original reporters and very much a user, not a block/dm
developer, so please sanity-check all of this.

Your trace looks like what the two earlier reports hit: a read reaching
a leaf device with sectors > 0 but phys_seg 0 (an empty bio). One aside
that may help read the trace: blk_io_trace.error is a __u16, so the
bracketed values on your C lines are errnos as u16 (65514 = -EINVAL,
65531 = -EIO).

The WARN itself is new, the bad bio isn't. bio_add_page() only started
rejecting len == 0 in 643893647cac ("block: reject zero length in
bio_add_page()", v7.1-rc1); on 7.0.8 the same empty bio tripped
scsi_alloc_sgtables()'s !nr_segs instead, which matches what you saw.
That fits your "not a recent regression": the condition is older, v7.1
just made it loud.

For Tomas's and my reports (QEMU O_DIRECT to the LV block device) the
origin looks like 5ff3f74e145a ("block: simplify direct io validity
check", v6.18): blkdev_dio_invalid() now checks only aggregate
ki_pos | count alignment and dropped the per-segment
bdev_iter_is_aligned() walk, so a degenerate or misaligned O_DIRECT no
longer gets -EINVAL at the fops boundary. But your reproducer reads a
file, which goes through the filesystem O_DIRECT path and never calls
blkdev_dio_invalid(), and still makes the empty bio. So it isn't only
that one entry point.

dm-mirror then hangs because Keith's f7b24c7b41f2 only covers md
raid1/raid10; legacy dm-mirror (dm-raid1.c) has no equivalent and
rebuilds the empty read onto the other leg. Note the leg's status isn't
even consistent (your SATA path returns BLK_STS_IOERR, not
BLK_STS_INVAL), so copying that status check into dm-mirror probably
wouldn't catch every case.

For what it's worth, that points me toward rejecting the empty or
misaligned bio once, at submission, with -EINVAL, rather than teaching
each consumer to tolerate it. But you'll know the tradeoffs far better
than I do.

I have a small QEMU + LVM raid1/mirror setup that reproduces the
block-device variant and bisects to 5ff3f74e. Happy to run your file
reproducer with some instrumentation at the dm-mirror read entry
(bi_size vs bio_sectors vs bvec lengths) to see whether the bio is
already empty on arrival or built that way on the retry, and to test
any patch.

Thanks,
Vjaceslavs


On Mon, Jun 15, 2026 at 5:50 AM Dr. David Alan Gilbert
<linux@treblig.org> wrote:
>
> * Thorsten Leemhuis (regressions@leemhuis.info) wrote:
> > On 6/14/26 19:57, Dr. David Alan Gilbert wrote:
> > >
> > >   I've got a repeatable raid hang/warn and would appreciate some pointers
> > > as where to debug.
> > >   (I've been logging stuff on  https://bugzilla.kernel.org/show_bug.cgi?id=221535 )
> >
> > Note: not my area of expertise, so I might be sending you totally
> > off-track with this comment. Feel free to ignore it. But FWIW:
>
> Hi Thorsten,
>   Thanks for the reply - these do seem to be related!
> (So copying in Keith, Vjaceslavs, and Tomáš )
> (Not my area either).
>
> > Have you seen these reports?
> > https://lore.kernel.org/all/2982107.4sosBPzcNG@electra/
> > https://lore.kernel.org/all/CAC_j7i1R7oy+nRhxEjCTba=DUgn02w9X+p94DCu0aHv5+5tKnQ@mail.gmail.com/
>
> I hadn't!  Those are both the problem I originally was trying to debug
> and stumbled into the WARN/BUG/hang with my test program.
>
> > The former lead to a fix in the mdraid code that should be in the kernel
> > version you are using. But in a reply to the latter report the repoter
> > claimed that that fix is not enough (claiming "this was obvious" and
> > also using dm), but things then stalled there.
>
> Yeh I see my world has Keith's f7b24c7b41f23
>
> I think the problem I'm seeing is zero length requests coming from somewhere.
>
> The WARN I'm seeing in 7.1.0-rc7+ is:
>
> [ 2681.597042] device-mapper: raid1: Mirror read failed from 252:25. Trying alternative device.
> [ 2681.631933] ------------[ cut here ]------------
> [ 2681.631939] WARNING: block/bio.c:1044 at bio_add_page+0x18b/0x250, CPU#22: kworker/22:0/18929
>
> 1039 int bio_add_page(struct bio *bio, struct page *page,
> 1040                  unsigned int len, unsigned int offset)
> 1041 {
> 1042         if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
> 1043                 return 0;
> 1044         if (WARN_ON_ONCE(len == 0))
> 1045                 return 0;
>
> So it's the ' if (WARN_ON_ONCE(len == 0))'
>
> and the warn I got on the older 7.0.8 was:
> [Sun May 17 17:22:52 2026] WARNING: drivers/scsi/scsi_lib.c:1140 at scsi_alloc_sgtables+0x38a/0x400, CPU#28: kworker/28:1H/3943
>
> which I *think* corresponds to:
> 1164         if (WARN_ON_ONCE(!nr_segs))
> 1165                 return BLK_STS_IOERR;
>
> so it sounds like we need to find where zero length requests are coming from??
>
> Thanks again,
>
> Dave
>
> > Ciao, Thorsten
> >
> > >   This started off as debugging a case where I'd get my RAID1 (on the host)
> > > getting a reliable 'rescheduling sector'/disk failure while running the qemu block test suite
> > > during a qemu build, but then I tried to build a smaller discrete
> > > test, and now I've got a simply triggerable warn and test hang.
> > > There's no errors from the underlying SATA layer on the storage,
> > > everything resyncs just fine.
> > >
> > > I've got an existing LVM vg ('main') with two mirrors on sda2, and sdb2
> > > which are SATA disks.
> > >
> > > # lvcreate --type mirror --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2
> > > # mkfs.ext4 /dev/mapper/main-lvol0
> > > # mount /dev/mapper/main-lvol0 /mnt/tmp/
> > > # chmod a+rwx /mnt/tmp
> > >
> > > $ dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=1
> > >
> > > (I then wait for the IO to stop)
> > >
> > > then we've got this little test program:
> > >
> > > <--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><-->
> > > #include <errno.h>
> > > #include <fcntl.h>
> > > #include <asm-generic/fcntl.h>
> > > #include <stdio.h>
> > > #include <unistd.h>
> > >
> > >
> > > const char* path="/mnt/tmp/testfile";
> > > static char buf[8192];
> > >
> > > int main()
> > > {
> > >   int fd=open(path, O_RDWR|O_DIRECT|O_CLOEXEC);
> > >
> > >   errno=0;
> > >   int res3=pread(fd, buf, 4096, 0);
> > >   printf("pread of 4096 said: %d (%m)\n", res3);
> > >
> > > }
> > > <--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><-->
> > >
> > > running that, either hangs or gets a 'pread of 4096 said: -1 (Input/output error)'
> > > when it hangs it's unkillable.
> > >
> > > at the moment (on 7.1.0-rc7) this is giving:
> > > Jun 14 18:08:32 dalek kernel: device-mapper: raid1: Mirror read failed from 252:24. Trying alternative device.
> > > Jun 14 18:08:32 dalek kernel: ------------[ cut here ]------------
> > > Jun 14 18:08:32 dalek dmeventd[1010]: Primary mirror device 252:24 read failed.
> > > Jun 14 18:08:32 dalek kernel: WARNING: block/bio.c:1044 at bio_add_page+0x18b/0x250, CPU#15: kworker/15:1/369
> > >
> > > (full backtrace below)
> > > (Note there is a moan in there about sdb IO error - repeated a lot - but
> > > again, there's no SATA level errors, and the drive is fine on smart, and
> > > I can read the whole of the underlying lvm mirrors, so I don't think it's
> > > physically there).
> > >
> > > I did a blktrace, although that gives me a 23G blkparse output, hmm
> > > (I see each event repeated a lot - maybe per thread?)
> > >
> > > 252,26  15        1     0.000000000  3435  Q  RS 264192 + 8 [dbf]
> > >   252,26 is /dev/mapper/main-lvol0
> > > 252,24  15        1     0.000005501  3435  A  RS 264192 + 8 <- (252,26) 264192
> > >   252,24 is main-lvol0_mimage_0
> > > 252,24  15        2     0.000005761  3435  Q  RS 264192 + 8 [dbf]
> > >   8,0   15        1     0.000008646  3435  A  RS 71634944 + 8 <- (252,24) 264192
> > >     so that's sda
> > >   8,0   15        2     0.000008787  3435  A  RS 73734144 + 8 <- (8,2) 71634944
> > >     I guess mapping down from sda2 to sda
> > >   8,0   15        3     0.000009037  3435  Q  RS 73734144 + 8 [dbf]
> > >   8,0   15        4     0.000009809  3435  C  RS 73734144 + 8 [65514]
> > >       ??? Hmm what's the 65514 there?
> > > 252,24  15        3     0.000010320  3435  C  RS 264192 + 8 [65514]
> > > 252,25  15        1     0.000290384   369  Q   R 264192 + 8 [kworker/15:1]
> > >    252,25 is main-lvol0_mimage_1
> > >
> > > and at this point I'm a bit lost as to what I'm looking for.
> > >
> > > Hints appreciated!
> > >
> > > (I don't believe this is a regression - or at least not recent)
> > >
> > > Dave
> > >
> > >
> > >
> > >
> > > Jun 14 18:08:32 dalek kernel: device-mapper: raid1: Mirror read failed from 252:24. Trying alternative device.
> > > Jun 14 18:08:32 dalek kernel: ------------[ cut here ]------------
> > > Jun 14 18:08:32 dalek dmeventd[1010]: Primary mirror device 252:24 read failed.
> > > Jun 14 18:08:32 dalek kernel: WARNING: block/bio.c:1044 at bio_add_page+0x18b/0x250, CPU#15: kworker/15:1/369
> > > Jun 14 18:08:32 dalek dmeventd[1010]: main-lvol0 is now in-sync.
> > > Jun 14 18:08:32 dalek kernel: Modules linked in: nft_masq nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reje>
> > > Jun 14 18:08:32 dalek kernel:  drm_panel_backlight_quirks gpu_sched drm_suballoc_helper video nvme drm_display_helper nvme_core cec nvme_keyring sp5100_tco nvme_auth wmi serio_raw fuse scsi_dh_alua i2c_dev scsi_dh_rdac scsi_dh_emc
> > > Jun 14 18:08:32 dalek kernel: CPU: 15 UID: 0 PID: 369 Comm: kworker/15:1 Not tainted 7.1.0-rc7+ #786 PREEMPT(lazy)
> > > Jun 14 18:08:32 dalek kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Pro4, BIOS P3.10 07/13/2020
> > > Jun 14 18:08:32 dalek kernel: Workqueue: kmirrord do_mirror
> > > Jun 14 18:08:32 dalek kernel: RIP: 0010:bio_add_page+0x18b/0x250
> > > Jun 14 18:08:32 dalek kernel: Code: 24 10 4c 8b 04 24 84 c0 0f 85 c9 00 00 00 41 0f b7 40 78 48 8b 74 24 08 8b 4c 24 14 e9 b4 fe ff ff 0f 0b 31 c0 e9 55 d1 af 00 <0f> 0b eb f5 48 8b 7f 08 83 7f 60 05 0f 85 00 ff ff ff 49 8b 3b 4c
> > > Jun 14 18:08:32 dalek kernel: RSP: 0018:ffffd1fb8176fc10 EFLAGS: 00010246
> > > Jun 14 18:08:32 dalek kernel: RAX: 0000000000000000 RBX: ffffd1fb8176fd18 RCX: 0000000000000000
> > > Jun 14 18:08:32 dalek kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8d1a8eb28b00
> > > Jun 14 18:08:32 dalek kernel: RBP: 0000000000000000 R08: ffffd1fb8176fc38 R09: ffffd1fb8176fc40
> > > Jun 14 18:08:32 dalek kernel: R10: ffffd1fb8176fc34 R11: 0000000000000000 R12: 0000000000000000
> > > Jun 14 18:08:32 dalek kernel: R13: ffffd1fb8176fd90 R14: 0000000000000001 R15: ffff8d1a8eb28b00
> > > Jun 14 18:08:32 dalek kernel: FS:  0000000000000000(0000) GS:ffff8d29d161f000(0000) knlGS:0000000000000000
> > > Jun 14 18:08:32 dalek kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > Jun 14 18:08:32 dalek kernel: CR2: 00007f0ddcd7b9d0 CR3: 000000023dcbf000 CR4: 0000000000350ef0
> > > Jun 14 18:08:32 dalek kernel: Call Trace:
> > > Jun 14 18:08:32 dalek kernel:  <TASK>
> > > Jun 14 18:08:32 dalek kernel:  do_region+0x227/0x2a0
> > > Jun 14 18:08:32 dalek kernel:  dispatch_io+0xf1/0x150
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_get_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_next_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_read_callback+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  dm_io+0x169/0x2d0
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_get_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_next_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  do_reads+0x149/0x230
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_read_callback+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  do_mirror+0x11a/0x2b0
> > > Jun 14 18:08:32 dalek kernel:  process_one_work+0x19e/0x390
> > > Jun 14 18:08:32 dalek kernel:  worker_thread+0x1a6/0x310
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_worker_thread+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  kthread+0xe4/0x120
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_kthread+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ret_from_fork+0x1a1/0x270
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_kthread+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ret_from_fork_asm+0x1a/0x30
> > > Jun 14 18:08:32 dalek kernel:  </TASK>
> > > Jun 14 18:08:32 dalek kernel: ---[ end trace 0000000000000000 ]---
> > > Jun 14 18:08:32 dalek kernel: ------------[ cut here ]------------
> > > Jun 14 18:08:32 dalek kernel: WARNING: drivers/scsi/scsi_lib.c:1164 at scsi_alloc_sgtables+0x38a/0x400, CPU#15: kworker/15:1/369
> > > Jun 14 18:08:32 dalek kernel: Modules linked in: nft_masq nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reje>
> > > Jun 14 18:08:32 dalek kernel:  drm_panel_backlight_quirks gpu_sched drm_suballoc_helper video nvme drm_display_helper nvme_core cec nvme_keyring sp5100_tco nvme_auth wmi serio_raw fuse scsi_dh_alua i2c_dev scsi_dh_rdac scsi_dh_emc
> > > Jun 14 18:08:32 dalek kernel: CPU: 15 UID: 0 PID: 369 Comm: kworker/15:1 Tainted: G        W           7.1.0-rc7+ #786 PREEMPT(lazy)
> > > Jun 14 18:08:32 dalek kernel: Tainted: [W]=WARN
> > > Jun 14 18:08:32 dalek kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Pro4, BIOS P3.10 07/13/2020
> > > Jun 14 18:08:32 dalek kernel: Workqueue: kmirrord do_mirror
> > > Jun 14 18:08:32 dalek kernel: RIP: 0010:scsi_alloc_sgtables+0x38a/0x400
> > > Jun 14 18:08:32 dalek kernel: Code: 8b 3d ba 2d a9 01 e9 d1 fd ff ff 48 8b 75 00 48 8d bb f0 fe ff ff e8 15 b7 b0 ff 48 89 ab e0 00 00 00 89 45 08 e9 30 ff ff ff <0f> 0b 4c 8b 6c 24 30 b8 0a 00 00 00 e9 21 ff ff ff b8 09 00 00 00
> > > Jun 14 18:08:32 dalek kernel: RSP: 0018:ffffd1fb8176f7f0 EFLAGS: 00010246
> > > Jun 14 18:08:32 dalek kernel: RAX: 0000000000000000 RBX: ffff8d1aedad0110 RCX: 0000000000000009
> > > Jun 14 18:08:32 dalek kernel: RDX: 0000000000000000 RSI: ffffffff99c15960 RDI: ffff8d1aedad0110
> > > Jun 14 18:08:32 dalek kernel: RBP: ffff8d1a93d17000 R08: ffff8d1aedad0110 R09: ffff8d1a818fa800
> > > Jun 14 18:08:32 dalek kernel: R10: 7020676e69736961 R11: 0000000000000000 R12: 0000000000000000
> > > Jun 14 18:08:32 dalek kernel: R13: 0000000000000000 R14: ffff8d1a93394000 R15: ffff8d1a93d17000
> > > Jun 14 18:08:32 dalek kernel: FS:  0000000000000000(0000) GS:ffff8d29d161f000(0000) knlGS:0000000000000000
> > > Jun 14 18:08:32 dalek kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > Jun 14 18:08:32 dalek kernel: CR2: 00007f0ddcd7b9d0 CR3: 000000023dcbf000 CR4: 0000000000350ef0
> > > Jun 14 18:08:32 dalek kernel: Call Trace:
> > > Jun 14 18:08:32 dalek kernel:  <TASK>
> > > Jun 14 18:08:32 dalek kernel:  ? srso_return_thunk+0x5/0x5f
> > > Jun 14 18:08:32 dalek kernel:  sd_setup_read_write_cmnd+0x9d/0x740
> > > Jun 14 18:08:32 dalek kernel:  ? srso_return_thunk+0x5/0x5f
> > > Jun 14 18:08:32 dalek kernel:  scsi_queue_rq+0x4d2/0x890
> > > Jun 14 18:08:32 dalek kernel:  blk_mq_dispatch_rq_list+0x241/0x530
> > > Jun 14 18:08:32 dalek kernel:  ? srso_return_thunk+0x5/0x5f
> > > Jun 14 18:08:32 dalek kernel:  ? sbitmap_get+0x61/0x100
> > > Jun 14 18:08:32 dalek kernel:  __blk_mq_do_dispatch_sched+0x330/0x340
> > > Jun 14 18:08:32 dalek kernel:  __blk_mq_sched_dispatch_requests+0x143/0x180
> > > Jun 14 18:08:32 dalek kernel:  blk_mq_sched_dispatch_requests+0x2d/0x70
> > > Jun 14 18:08:32 dalek kernel:  blk_mq_run_hw_queue+0x2bf/0x350
> > > Jun 14 18:08:32 dalek kernel:  ? srso_return_thunk+0x5/0x5f
> > > Jun 14 18:08:32 dalek kernel:  blk_mq_dispatch_list+0x172/0x350
> > > Jun 14 18:08:32 dalek kernel:  blk_mq_flush_plug_list+0x51/0x1a0
> > > Jun 14 18:08:32 dalek kernel:  ? blk_mq_submit_bio+0x71c/0x9f0
> > > Jun 14 18:08:32 dalek kernel:  __blk_flush_plug+0x112/0x180
> > > Jun 14 18:08:32 dalek kernel:  ? srso_return_thunk+0x5/0x5f
> > > Jun 14 18:08:32 dalek kernel:  __submit_bio+0x19c/0x260
> > > Jun 14 18:08:32 dalek kernel:  __submit_bio_noacct+0x8e/0x210
> > > Jun 14 18:08:32 dalek kernel:  do_region+0x14c/0x2a0
> > > Jun 14 18:08:32 dalek kernel:  dispatch_io+0xf1/0x150
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_get_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_next_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_read_callback+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  dm_io+0x169/0x2d0
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_get_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_bio_next_page+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  do_reads+0x149/0x230
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_read_callback+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  do_mirror+0x11a/0x2b0
> > > Jun 14 18:08:32 dalek kernel:  process_one_work+0x19e/0x390
> > > Jun 14 18:08:32 dalek kernel:  worker_thread+0x1a6/0x310
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_worker_thread+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  kthread+0xe4/0x120
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_kthread+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ret_from_fork+0x1a1/0x270
> > > Jun 14 18:08:32 dalek kernel:  ? __pfx_kthread+0x10/0x10
> > > Jun 14 18:08:32 dalek kernel:  ret_from_fork_asm+0x1a/0x30
> > > Jun 14 18:08:32 dalek kernel:  </TASK>
> > > Jun 14 18:08:32 dalek kernel: ---[ end trace 0000000000000000 ]---
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > > Jun 14 18:08:37 dalek kernel: blk_print_req_error: 241000 callbacks suppressed
> > > Jun 14 18:08:37 dalek kernel: I/O error, dev sdb, sector 50606087 op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > >
> > >
> >
> --
>  -----Open up your eyes, open up your mind, open up your code -------
> / Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \
> \        dave @ treblig.org |                               | In Hex /
>  \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply

* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
From: Eric Biggers @ 2026-06-15 22:53 UTC (permalink / raw)
  To: Leonid Ravich
  Cc: Herbert Xu, Alasdair Kergon, Ard Biesheuvel, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <20260615111459.9452-1-lravich@amazon.com>

On Mon, Jun 15, 2026 at 11:14:56AM +0000, Leonid Ravich wrote:
> The series adds a per-request "data unit size" to the skcipher API
> so a caller can submit several data units (typically 512..4096-byte
> sectors) sharing one starting IV in a single request.  Algorithms
> derive each data unit's IV from the caller-supplied IV by treating
> it as a 128-bit little-endian counter and adding the data-unit
> index, matching the layout produced by dm-crypt's plain64 IV mode
> and by typical inline-encryption hardware.
> 
> This mirrors the data_unit_size concept already exposed by
> struct blk_crypto_config for inline encryption.
> 
> The first user is dm-crypt, which today issues one skcipher request
> per sector and so pays a per-sector cost in request allocation,
> callback dispatch, completion handling, and scatterlist setup.
> 
> Proof-of-concept performance numbers from the RFC reply [1]: +19%
> throughput / -40% CPU on a single-core arm64 system with a hardware
> XTS-AES-256 accelerator running fio 4 KiB sequential writes through
> dm-crypt, when an out-of-tree arm64 xts driver advertises
> CRYPTO_ALG_SKCIPHER_NATIVE_MULTI_DU.  This series itself does not
> include arch enablement; the fast path is opt-in per driver, the
> slow path is universal via the auto-splitter.
> 
> The native fast path amortises both per-sector dispatch and per-sector
> crypto setup across a bio - the measured win above, on an engine that
> offloads the AES compute.  The auto-splitter is for correctness and
> reach: any consumer can set data_unit_size and get correct output with
> the per-request allocation/callback/completion cost removed, but it
> still issues one alg->encrypt per data unit, so on a software cipher it
> saves only dispatch overhead (no throughput figure claimed - that is
> hardware- and workload-dependent).  What it guarantees unconditionally
> is byte-identical output (Verification below) at O(entries + units),
> walking the scatterlists with a pair of struct scatter_walk cursors
> rather than rescanning from the head per unit.

So in other words, this series slows down dm-crypt and crypto_skcipher
for everyone to optimize for an out-of-tree driver.  And there's also no
benchmark showing that your driver is even worth it over just using the
CPU.

- Eric

^ permalink raw reply

* Re: [PATCH v3 3/4] iomap: reject NOWAIT and BOUNCE direct IOs
From: Qu Wenruo @ 2026-06-15 22:43 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo
  Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <ajAU1yLd32BCiCNj@infradead.org>



在 2026/6/16 00:35, Christoph Hellwig 写道:
> On Fri, Jun 12, 2026 at 07:21:14PM +0930, Qu Wenruo wrote:
>> If a direct IO requires bounced pages for stable buffer, it will always
>> allocate memory, and both bio_iov_iter_bounce_write() and
>> bio_iov_iter_bounce_read() are allocating pages using GFP_KERNEL, which
>> can sleep and break NOWAIT requirement.
>>
>> So we need to reject such NOWAIT and BOUNCE direct IO in
>> iomap_dio_bio_iter().
> 
> That's a bit heavy handed. Just do a noretry allocation.

 From the comment of __GFP_NORETRY:

  * %__GFP_NORETRY: The VM implementation will try only very lightweight
  * memory direct reclaim to get some memory under memory pressure (thus
  * it can sleep).

It looks like NORETRY can still sleep, thus again breaking NOWAIT 
requirement.

I think you're talking about GFP_NOWAIT?

^ permalink raw reply

* Re: [PATCH] sunvdc: fix -EIO issue due to lack of retries
From: John Paul Adrian Glaubitz @ 2026-06-15 22:09 UTC (permalink / raw)
  To: Jens Axboe, linux-block@vger.kernel.org
In-Reply-To: <418310b3-2b77-4534-b2fd-27dcc11e333c@kernel.dk>

Hi,

On Mon, 2025-10-06 at 08:59 -0600, Jens Axboe wrote:
> John reports that since commit:
> 
> a11f6ca9aef9 ("sunvdc: Do not spin in an infinite loop when vio_ldc_send() returns EAGAIN")
> 
> users of Linux inside Solaris ldom see occasional -EIO errors because
> the request send loop now times out. The current loop does 10 retries,
> and inside vio_ldc_send() a further 1000 1usec retries are done as well.
> Even with 10.5 msec of busy loop retries that's apparently not enough to
> always succeed.
> 
> Rather than introduce continued busy looping, requeue the request and
> have the delayed queue kicking retry the request after another 10ms.
> This obviously isn't ideal, but there's seemingly no way to wait for
> this type of event. And if 10ms of busy looping was not enough to make
> progress, then presumably this is an edge condition and we just need to
> guarantee to make forward progress at some later point in time. That's
> more suitably done through letting the CPU tend to other work, rather
> than sitting in a tight loop retrying.
> 
> Reported-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
> Link: https://lore.kernel.org/all/20251006100226.4246-2-glaubitz@physik.fu-berlin.de/
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> Caveat: 100% untested, not even compiled. Sending out on John's behest.
> 
> diff --git a/drivers/block/sunvdc.c b/drivers/block/sunvdc.c
> index db1fe9772a4d..aa49dffb1b53 100644
> --- a/drivers/block/sunvdc.c
> +++ b/drivers/block/sunvdc.c
> @@ -539,6 +539,7 @@ static blk_status_t vdc_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	struct vdc_port *port = hctx->queue->queuedata;
>  	struct vio_dring_state *dr;
>  	unsigned long flags;
> +	int ret;
>  
>  	dr = &port->vio.drings[VIO_DRIVER_TX_RING];
>  
> @@ -560,7 +561,13 @@ static blk_status_t vdc_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		return BLK_STS_DEV_RESOURCE;
>  	}
>  
> -	if (__send_request(bd->rq) < 0) {
> +	ret = __send_request(bd->rq);
> +	if (ret == -EAGAIN) {
> +		spin_unlock_irqrestore(&port->vio.lock, flags);
> +		/* already spun for 10msec, defer 10msec and retry */
> +		blk_mq_delay_kick_requeue_list(hctx->queue, 10);
> +		return BLK_STS_DEV_RESOURCE;
> +	} else if (ret < 0) {
>  		spin_unlock_irqrestore(&port->vio.lock, flags);
>  		return BLK_STS_IOERR;
>  	}

I will give this patch a try this week as I finally want to get this fixed.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

^ permalink raw reply

* Re: [PATCH] block: check bio split for unaligned bvec
From: Keith Busch @ 2026-06-15 22:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, linux-block, axboe, Carlos Maiolino
In-Reply-To: <20260615133549.GC26132@lst.de>

On Mon, Jun 15, 2026 at 03:35:49PM +0200, Christoph Hellwig wrote:
> On Fri, Jun 12, 2026 at 03:32:04PM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > Offsets and lengths need to be validated against the dma alignment. This
> > check was skipped for sufficiently a small bio with a single bvec, which
> > may allow an invalid request dispatched to the driver. Force the
> > validation for an unaligned bvec by forcing the bio split path that
> > handles this condition.
> 
> This fix itself looks good, but we'll also need something similar
> for bio-based drivers that never call into the splitting helper.

Totally agree. I'm looking at all the .submit_bio drivers, and I think
they fall into one of four catagories:

  1: already split (md/nvme-mp/drbd; dm conditional)
  2: don't split
      * btt, dcssblk: already reject unaligned
      * n64cart: WARNs, but potentially proceeds to undefined behavior
      * nfhd: silently corrupts, but looks like a driver problem
  3: can handle arbitrary memory but advertise default dma_alignment=511
      (brd, pmem, zram, ps3vram, simdisk - "limits lie")
  4: forward/self-split (bcache)

I think the block layer can fix 3 with a BLK_FEAT flag to allow a zero
dma_alignment limit for the drivers that really don't need it from the
source buffer.

As for the rest, I don't know of anyone caring to ensure n64 or nfhd are
correctly handling degenerate applications.

^ permalink raw reply

* Re: [PATCH 08/19] parisc: define DPS root partition type UUID
From: Vincent Mailhol @ 2026-06-15 20:43 UTC (permalink / raw)
  To: Helge Deller, James Bottomley, Jens Axboe, Davidlohr Bueso,
	Alexander Viro, Christian Brauner, Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel, linux-parisc
In-Reply-To: <0290e66b-1da2-4706-ab28-2d47f164df08@gmx.de>

On 15/06/2026 at 22:27, Helge Deller wrote:
> On 6/15/26 22:02, James Bottomley wrote:
>> On Mon, 2026-06-15 at 18:09 +0200, Vincent Mailhol wrote:
>>> DPS [1] assigns GPT partition type UUIDs to operating system
>>> partitions. Root partitions use architecture-specific type UUIDs so
>>> the OS can discover the intended root filesystem without relying on a
>>> root=  cmdline option.
>>>
>>> Define DPS_ROOT_PARTITION_TYPE_UUID in asm/dps_root.h for parisc and
>>> select ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID.
> 
> Vincent, first of all thank you for at least trying to including parisc
> (and other
> niche Linux ports) in the specification! (whatever the outcome is!)

You are welcome. My personal interest is only x86_64 at the moment, but
at least I tried to make it useful to the broader community!

>>> [1] The Discoverable Partitions Specification (DPS)
>>> Link:
>>> https://uapi-group.org/specifications/specs/
>>> discoverable_partitions_specification/
>>
>> How are you planning to make this work for parisc?  Some systems have a
>> PALO boot partition (fdisk type 0xf0) but the more modern way is to
>> place palo inside a hidden ext4 inode in /boot.  The way parisc IODC
>> works is very similar to the way MSDOS boots with the palo location
>> table in the first block so I theorize that would probably work for gpt
>> partitions as well ... I'm just not sure anyone has tested it.
>>
>> However, to get this to work with PALO for auto discovery, you'd need
>> palo patches to recognize the DPS UUID and no-one seems to have
>> submitted anything to palo for this.
> 
> Maybe it's not necessary that palo does this job?
> palo could stay as is and load kernel and the initrd.
> Then the kernel (or the scripts in initrd) could try to find the root
> partition on it's own (and handle GPT discs).
> 
> I even once started porting grub to parisc (which is currently on hold
> because I'm busy with other stuff). If I ever finish it, having such a
> mechanism/constant already in place is IMHO beneficial.

You can see my answer to Alexander on the cover letter. This was an
oversight. parisc does not have CONFIG_EFI to begin with, so the feature
is just dead code there.

I will remove parisc (and all other architectures which do not have a
CONFIG_EFI) in v2. If someone wants to implement EFI support those
architectures, only then, we can revisit this DPS topic for these
architectures.


Yours sincerely,
Vincent Mailhol


^ permalink raw reply

* Re: [PATCH 12/19] x86: define DPS root partition type UUIDs
From: Vincent Mailhol @ 2026-06-15 20:39 UTC (permalink / raw)
  To: Matthew Wilcox, Dave Hansen
  Cc: Vincent Mailhol, Jens Axboe, Davidlohr Bueso, Alexander Viro,
	Christian Brauner, Jan Kara, linux-kernel, linux-block, linux-efi,
	linux-fsdevel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
In-Reply-To: <ajA-0YJW4gLr02c7@casper.infradead.org>

On 15/06/2026 at 20:05, Matthew Wilcox wrote:
> On Mon, Jun 15, 2026 at 09:46:41AM -0700, Dave Hansen wrote:
>> There are a lot of ways to do this. I'm just not a super big fan of the
>> current proposal.
>>
>> So, boiling it down:
>>
>> 1. Should more than one UUID be supported per kernel build?
>> 2. Should the UUIDs be defined in arch code or generic code?
>> 3. Kconfig or #ifdefs?
> 
> Further questions ... why do this in the kernel?

Most of the plumbing was already there so that the feature is still
tiny. It seems like a reasonable trade-off to me.

> Seems perfectly suited to be in initramfs where we can throw away the
> code after boot.

The added code uses the __init attribute for this exact reason: so that
its memory can be reclaimed after.

-- 
Yours sincerely,
Vincent Mailhol


^ permalink raw reply

* Re: [PATCH 00/19] init: discoverable root partitions, a.k.a. an omittable "root=" cmdline option
From: Vincent Mailhol @ 2026-06-15 20:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Davidlohr Bueso, Christian Brauner, Jan Kara,
	linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Richard Henderson, Matt Turner, Magnus Lindholm, linux-alpha,
	Vineet Gupta, linux-snps-arc, Russell King, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui, loongarch,
	Thomas Bogendoerfer, linux-mips, James E.J. Bottomley,
	Helge Deller, linux-parisc, Madhavan Srinivasan, Michael Ellerman,
	linuxppc-dev, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-riscv, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	linux-s390, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Jonathan Corbet, Shuah Khan, linux-doc
In-Reply-To: <20260615170432.GW2636677@ZenIV>

On 15/06/2026 at 19:04, Al Viro wrote:
> On Mon, Jun 15, 2026 at 06:08:56PM +0200, Vincent Mailhol wrote:
> 
>> Tested with GRUB, which implements the LoaderDevicePartUUID EFI variable
>> in its bli module [3]. With this, I was able to boot a kernel with a
>> completely empty cmdline and no initrd.
>>
>> [1] The Discoverable Partitions Specification (DPS)
>> Link: https://uapi-group.org/specifications/specs/discoverable_partitions_specification/
>>
>> [2] systemd-gpt-auto-generator
>> Link: https://www.freedesktop.org/software/systemd/man/latest/systemd-gpt-auto-generator.html
>>
>> [3] GRUB -- §16.2 bli
>> Link: https://www.gnu.org/software/grub/manual/grub/html_node/bli_005fmodule.html
> 
> So what does that thing, tied to EFI as it is, have to do with architectures where
> 	* firmware is rather unlike EFI

I made CONFIG_DPS_ROOT_AUTO_DISCOVERY depend on CONFIG_EFI for this reason.

> 	* firmware wouldn't know what to do with GPT
> 	* GRUB is *not* ported to, let alone used
> such as, say it, the very first one mentioned at your [1]?

Fair point. I just did:

  $ git grep "^config EFI$"
  arch/arm/Kconfig:config EFI
  arch/arm64/Kconfig:config EFI
  arch/loongarch/Kconfig:config EFI
  arch/riscv/Kconfig:config EFI
  arch/x86/Kconfig:config EFI

Anything not in this list is dead code at the moment.

> Or is that conditional upon "if anyone wants to design replacement firmware
> for those, and if they agree to follow our wishlist"?

No, it was just an oversight from my side. I will just keep arm, arm64,
loongarch, riscv and x86 in my v2.


Yours sincerely,
Vincent Mailhol


^ permalink raw reply

* Re: [PATCH 08/19] parisc: define DPS root partition type UUID
From: Helge Deller @ 2026-06-15 20:27 UTC (permalink / raw)
  To: James Bottomley, Vincent Mailhol, Jens Axboe, Davidlohr Bueso,
	Alexander Viro, Christian Brauner, Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel, linux-parisc
In-Reply-To: <0158c89cffd76c621607f66d1889ccc084754729.camel@HansenPartnership.com>

On 6/15/26 22:02, James Bottomley wrote:
> On Mon, 2026-06-15 at 18:09 +0200, Vincent Mailhol wrote:
>> DPS [1] assigns GPT partition type UUIDs to operating system
>> partitions. Root partitions use architecture-specific type UUIDs so
>> the OS can discover the intended root filesystem without relying on a
>> root=  cmdline option.
>>
>> Define DPS_ROOT_PARTITION_TYPE_UUID in asm/dps_root.h for parisc and
>> select ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID.

Vincent, first of all thank you for at least trying to including parisc (and other
niche Linux ports) in the specification! (whatever the outcome is!)

>> [1] The Discoverable Partitions Specification (DPS)
>> Link:
>> https://uapi-group.org/specifications/specs/discoverable_partitions_specification/
> 
> How are you planning to make this work for parisc?  Some systems have a
> PALO boot partition (fdisk type 0xf0) but the more modern way is to
> place palo inside a hidden ext4 inode in /boot.  The way parisc IODC
> works is very similar to the way MSDOS boots with the palo location
> table in the first block so I theorize that would probably work for gpt
> partitions as well ... I'm just not sure anyone has tested it.
> 
> However, to get this to work with PALO for auto discovery, you'd need
> palo patches to recognize the DPS UUID and no-one seems to have
> submitted anything to palo for this.

Maybe it's not necessary that palo does this job?
palo could stay as is and load kernel and the initrd.
Then the kernel (or the scripts in initrd) could try to find the root
partition on it's own (and handle GPT discs).

I even once started porting grub to parisc (which is currently on hold
because I'm busy with other stuff). If I ever finish it, having such a
mechanism/constant already in place is IMHO beneficial.

Helge

^ permalink raw reply

* Re: [PATCH 12/19] x86: define DPS root partition type UUIDs
From: Vincent Mailhol @ 2026-06-15 20:19 UTC (permalink / raw)
  To: Dave Hansen, Jens Axboe, Davidlohr Bueso, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
In-Reply-To: <03be57ae-0e41-4b8a-adc5-bdd85ccce951@intel.com>

On 15/06/2026 at 18:46, Dave Hansen wrote:
> On 6/15/26 09:09, Vincent Mailhol wrote:
>> +#ifdef CONFIG_X86_64
>> +#define DPS_ROOT_PARTITION_TYPE_UUID "4f68bce3-e8cd-4db1-96e7-fbcaf984b709"
>> +#else
>> +#define DPS_ROOT_PARTITION_TYPE_UUID "44479540-f297-41b2-9af7-d131d5f0458a"
>> +#endif
> 
> This doesn't make a whole lot of sense to me. 64-bit kernels can run
> 32-bit userspace just fine.
> 
> But this #ifdef as proposed means that only a 32-bit *OR* 64-bit kernel
> can auto-discover a given partition.
> 
> I kinda think you should just have an array of strings for these things,
> maybe glued together with some preprocessor magic. Logically something
> like this:
> 
> const char* const uuids[] = {
> #ifdef CONFIG_ARM64
> 	"b921b045-1df0-41c3-af44-4c6f280d3fae"
> #endif
> #ifdef CONFIG_X86_64
> 	"4f68bce3-e8cd-4db1-96e7-fbcaf984b709",
> #endif
> #if defined(CONFIG_X86) && defined(CONFIG_COMPAT32)
> 	"44479540-f297-41b2-9af7-d131d5f0458a",
> #endif
> ...
> };
> 
> ... and then search the array. I honestly don't think you need to
> sprinkle UUIDs all over the architectures.
> 
> It could probably also be done almost entirely in Kconfig. This could be
> in, say block/partitions/Kconfig, or arch/*/Kconfig:
> 
> config DPS_ROOT_PARTITION_TYPE_UUID_1
> 	string
>         default "4f68bce3-e8cd-4db1-96e7-fbcaf984b709" if X86_64
> 	default "b921b045-1df0-41c3-af44-4c6f280d3fae" if ARM64
> 	...
> 
> config DPS_ROOT_PARTITION_TYPE_UUID_2
> 	string
>         default "44479540-f297-41b2-9af7-..." if X86 && COMPAT_32
> 
> const char* const uuids[] = {
> #ifdef CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_1
> 	CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_1
> #endif
> #ifdef CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_2
> 	CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_2
> #endif
> ...
> };
> 
> There are a lot of ways to do this. I'm just not a super big fan of the
> current proposal.
> 
> So, boiling it down:
> 
> 1. Should more than one UUID be supported per kernel build?

I didn't pay much attention to this, but this is a very good point.

The Discoverable Partitions Specification is not clear about this
point. All it has to say is:

  On systems *with matching architecture*, the first partition with
  this type UUID on the disk containing the active EFI ESP is
  automatically mounted to the root directory /.

Does an x86_32 system match an x86_64 partition? Wouldn't make sense.
Does an x86_64 system match an x86_32 partition? Could be.

My feeling is that the intent was an *exact* match. This is supported
by the implementation in systemd which just check against
SD_GPT_ROOT_NATIVE (which corresponds to the exact match).

  https://github.com/systemd/systemd/blob/main/src/udev/udev-builtin-blkid.c#L243-L247

*But* there are some hints about a secondary UUID. In my terminal I have:

  $ systemd-id128 show root root-secondary
  NAME           ID                              
  root           4f68bce3e8cd4db196e7fbcaf984b709
  root-secondary 44479540f29741b29af7d131d5f0458a

where root is the x86_64 and root-secondary is x86_32. So although I
see no match logic in the code, the ID table have it!

That said, your points make sense to me, and I would be supportive to
allow a search for a secondary UUID as a kernel extension. If we do
so, I think the only constraint should be to make sure that we check
for the exact match first (e.g. check x86_64 type before x86_32 type).

Would that make sense?

> 2. Should the UUIDs be defined in arch code or generic code?

I think that you convinced me to put it in generic code.

> 3. Kconfig or #ifdefs?

I would say Kconfig. If we go for the exact match only, that would be:

  CONFIG_DPS_ROOT_PARTITION_TYPE_UUID

If we allow more as an extension, that would become:

  - CONFIG_DPS_ROOT_PARTITION_TYPE_UUID for the exact match
  - CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_SECONDARY for the compatible
    one.

The drawback is that some entries will be in both:

  config DPS_ROOT_PARTITION_TYPE_UUID
  	string
  	  default "4f68bce3-e8cd-4db1-96e7-fbcaf984b709" if X86_64
  	  default "44479540-f297-41b2-9af7-d131d5f0458a" if X86

  config DPS_ROOT_PARTITION_TYPE_UUID_SECONDARY
  	string
  	  default "44479540-f297-41b2-9af7-d131d5f0458a" if X86_64 && COMPAT_32

And I don't think we need more than two.


A bonus question: should those Kconfig entries be hidden? I prefer the
hidden option because it doesn't add that much code and I thought this
was not worth bothering the user with one more menuconfig question.
But I would be happy to change if people this this is worth an
menuconfig entry.


Yours sincerely,
Vincent Mailhol

^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Keith Busch @ 2026-06-15 20:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: linux-block, dm-devel
In-Reply-To: <ajBRnUqXd7DqxLiG@kbusch-mbp>

On Mon, Jun 15, 2026 at 01:25:17PM -0600, Keith Busch wrote:
> In the meantime, since I so far can't reproduce this after including my
> previous proposal, I may have to request trying out a debug patch to get
> some more visibility on what's happening if that's okay.

Going in a different direction here, there's no reason to recreate the
lower level bio's from scratch when they originate from an incoming bio.
We can just clone it along with an iterator pointing to the original.

Can you try this one out? This was successful when I ran your reproducer
and cuts out a lot of code too with a performance bonus for large IO.

---
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 1db565b376200..28adfeb58f240 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -170,12 +170,11 @@ struct dpages {
 			 struct page **p, unsigned long *len, unsigned int *offset);
 	void (*next_page)(struct dpages *dp);
 
-	union {
-		unsigned int context_u;
-		struct bvec_iter context_bi;
-	};
+	unsigned int context_u;
 	void *context_ptr;
 
+	struct bio *orig_bio;
+
 	void *vma_invalidate_address;
 	unsigned long vma_invalidate_size;
 };
@@ -210,44 +209,6 @@ static void list_dp_init(struct dpages *dp, struct page_list *pl, unsigned int o
 	dp->context_ptr = pl;
 }
 
-/*
- * Functions for getting the pages from a bvec.
- */
-static void bio_get_page(struct dpages *dp, struct page **p,
-			 unsigned long *len, unsigned int *offset)
-{
-	struct bio_vec bvec = bvec_iter_bvec((struct bio_vec *)dp->context_ptr,
-					     dp->context_bi);
-
-	*p = bvec.bv_page;
-	*len = bvec.bv_len;
-	*offset = bvec.bv_offset;
-
-	/* avoid figuring it out again in bio_next_page() */
-	dp->context_bi.bi_sector = (sector_t)bvec.bv_len;
-}
-
-static void bio_next_page(struct dpages *dp)
-{
-	unsigned int len = (unsigned int)dp->context_bi.bi_sector;
-
-	bvec_iter_advance((struct bio_vec *)dp->context_ptr,
-			  &dp->context_bi, len);
-}
-
-static void bio_dp_init(struct dpages *dp, struct bio *bio)
-{
-	dp->get_page = bio_get_page;
-	dp->next_page = bio_next_page;
-
-	/*
-	 * We just use bvec iterator to retrieve pages, so it is ok to
-	 * access the bvec table directly here
-	 */
-	dp->context_ptr = bio->bi_io_vec;
-	dp->context_bi = bio->bi_iter;
-}
-
 /*
  * Functions for getting the pages from a VMA.
  */
@@ -332,6 +293,21 @@ static void do_region(const blk_opf_t opf, unsigned int region,
 		return;
 	}
 
+	if (dp->orig_bio) {
+		bio = bio_alloc_clone(where->bdev, dp->orig_bio, GFP_NOIO,
+				      &io->client->bios);
+		bio->bi_iter.bi_sector = where->sector;
+		bio->bi_iter.bi_size = where->count << SECTOR_SHIFT;
+		bio->bi_opf = opf;
+		bio->bi_end_io = endio;
+		bio->bi_ioprio = ioprio;
+		store_io_and_region_in_bio(bio, io, region);
+
+		atomic_inc(&io->count);
+		submit_bio(bio);
+		return;
+	}
+
 	/*
 	 * where->count may be zero if op holds a flush and we need to
 	 * send a zero-sized flush.
@@ -468,6 +444,7 @@ static int dp_init(struct dm_io_request *io_req, struct dpages *dp,
 
 	dp->vma_invalidate_address = NULL;
 	dp->vma_invalidate_size = 0;
+	dp->orig_bio = NULL;
 
 	switch (io_req->mem.type) {
 	case DM_IO_PAGE_LIST:
@@ -475,7 +452,11 @@ static int dp_init(struct dm_io_request *io_req, struct dpages *dp,
 		break;
 
 	case DM_IO_BIO:
-		bio_dp_init(dp, io_req->mem.ptr.bio);
+		/*
+		 * The destination bios clone this bio's biovec directly, so
+		 * there are no per-page accessors to set up here.
+		 */
+		dp->orig_bio = io_req->mem.ptr.bio;
 		break;
 
 	case DM_IO_VMA:
-- 

^ permalink raw reply related

* Re: [PATCH 08/19] parisc: define DPS root partition type UUID
From: James Bottomley @ 2026-06-15 20:02 UTC (permalink / raw)
  To: Vincent Mailhol, Jens Axboe, Davidlohr Bueso, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel, Helge Deller,
	linux-parisc
In-Reply-To: <20260615-discoverable-root_partitions-v1-8-39c78fac42e2@kernel.org>

On Mon, 2026-06-15 at 18:09 +0200, Vincent Mailhol wrote:
> DPS [1] assigns GPT partition type UUIDs to operating system
> partitions. Root partitions use architecture-specific type UUIDs so
> the OS can discover the intended root filesystem without relying on a
> root=  cmdline option.
> 
> Define DPS_ROOT_PARTITION_TYPE_UUID in asm/dps_root.h for parisc and
> select ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID.
> 
> [1] The Discoverable Partitions Specification (DPS)
> Link:
> https://uapi-group.org/specifications/specs/discoverable_partitions_specification/

How are you planning to make this work for parisc?  Some systems have a
PALO boot partition (fdisk type 0xf0) but the more modern way is to
place palo inside a hidden ext4 inode in /boot.  The way parisc IODC
works is very similar to the way MSDOS boots with the palo location
table in the first block so I theorize that would probably work for gpt
partitions as well ... I'm just not sure anyone has tested it.

However, to get this to work with PALO for auto discovery, you'd need
palo patches to recognize the DPS UUID and no-one seems to have
submitted anything to palo for this.

Regards,

James


^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Keith Busch @ 2026-06-15 19:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: linux-block, dm-devel
In-Reply-To: <ajA5kpUraXqh9ag9@gallifrey>

On Mon, Jun 15, 2026 at 05:42:42PM +0000, Dr. David Alan Gilbert wrote:
> * Keith Busch (kbusch@kernel.org) wrote:
> > On Mon, Jun 15, 2026 at 04:37:39PM +0000, Dr. David Alan Gilbert wrote:
> > > Hi Keith,
> > >   Thanks for the patch, alas it doesn't seem to be helping here;
> > >  the first warn is still the same
> > > and it still hangs the test process hard and eventually BUGs at
> > > 
> > > void blk_mq_end_request(struct request *rq, blk_status_t error)
> > > {
> > >         if (blk_update_request(rq, error, blk_rq_bytes(rq)))
> > >                 BUG();
> > 
> > Oh, that was not expected.
> > 
> > What is the dma alignment requirement of your backing devices? You can
> > find the attribute for sda at /sys/block/sda/queue/dma_alignment. I'm
> > expecting 511, but just want to double check.
> 
> Yeh looks like it:
> /sys/block/sda/queue/dma_alignment:511
> /sys/block/sdb/queue/dma_alignment:511
> 
> all of the lvm also looks like it is.

Thanks for confirming.

I'm struggling to see how you're getting there with your reproducer with
the proposal included. I can see other short comings with preadv or
really large pread's, but not with a 4k pread. For those other issues
this patch can fix it:

  https://lore.kernel.org/linux-block/20260612223205.465913-1-kbusch@meta.com/

It is currently staged for upstream, so hasn't landed yet. But again, I
don't think those conditions apply to what you're seeing, but worth a
shot on top of the previous proposal to use byte units instead of
sectors.

In the meantime, since I so far can't reproduce this after including my
previous proposal, I may have to request trying out a debug patch to get
some more visibility on what's happening if that's okay.

^ permalink raw reply

* Re: [PATCH 12/19] x86: define DPS root partition type UUIDs
From: Matthew Wilcox @ 2026-06-15 18:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Vincent Mailhol, Jens Axboe, Davidlohr Bueso, Alexander Viro,
	Christian Brauner, Jan Kara, linux-kernel, linux-block, linux-efi,
	linux-fsdevel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
In-Reply-To: <03be57ae-0e41-4b8a-adc5-bdd85ccce951@intel.com>

On Mon, Jun 15, 2026 at 09:46:41AM -0700, Dave Hansen wrote:
> There are a lot of ways to do this. I'm just not a super big fan of the
> current proposal.
> 
> So, boiling it down:
> 
> 1. Should more than one UUID be supported per kernel build?
> 2. Should the UUIDs be defined in arch code or generic code?
> 3. Kconfig or #ifdefs?

Further questions ... why do this in the kernel?  Seems perfectly
suited to be in initramfs where we can throw away the code after boot.

^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Dr. David Alan Gilbert @ 2026-06-15 17:42 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, dm-devel
In-Reply-To: <ajA0L-u-r4nhbpfl@kbusch-mbp>

* Keith Busch (kbusch@kernel.org) wrote:
> On Mon, Jun 15, 2026 at 04:37:39PM +0000, Dr. David Alan Gilbert wrote:
> > Hi Keith,
> >   Thanks for the patch, alas it doesn't seem to be helping here;
> >  the first warn is still the same
> > and it still hangs the test process hard and eventually BUGs at
> > 
> > void blk_mq_end_request(struct request *rq, blk_status_t error)
> > {
> >         if (blk_update_request(rq, error, blk_rq_bytes(rq)))
> >                 BUG();
> 
> Oh, that was not expected.
> 
> What is the dma alignment requirement of your backing devices? You can
> find the attribute for sda at /sys/block/sda/queue/dma_alignment. I'm
> expecting 511, but just want to double check.

Yeh looks like it:
/sys/block/sda/queue/dma_alignment:511
/sys/block/sdb/queue/dma_alignment:511

all of the lvm also looks like it is.

Dave

-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Keith Busch @ 2026-06-15 17:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: linux-block, dm-devel
In-Reply-To: <ajAqUwo3eTm1c2_J@gallifrey>

On Mon, Jun 15, 2026 at 04:37:39PM +0000, Dr. David Alan Gilbert wrote:
> Hi Keith,
>   Thanks for the patch, alas it doesn't seem to be helping here;
>  the first warn is still the same
> and it still hangs the test process hard and eventually BUGs at
> 
> void blk_mq_end_request(struct request *rq, blk_status_t error)
> {
>         if (blk_update_request(rq, error, blk_rq_bytes(rq)))
>                 BUG();

Oh, that was not expected.

What is the dma alignment requirement of your backing devices? You can
find the attribute for sda at /sys/block/sda/queue/dma_alignment. I'm
expecting 511, but just want to double check.

^ permalink raw reply

* Re: [PATCH 00/19] init: discoverable root partitions, a.k.a. an omittable "root=" cmdline option
From: Al Viro @ 2026-06-15 17:04 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Jens Axboe, Davidlohr Bueso, Christian Brauner, Jan Kara,
	linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Richard Henderson, Matt Turner, Magnus Lindholm, linux-alpha,
	Vineet Gupta, linux-snps-arc, Russell King, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui, loongarch,
	Thomas Bogendoerfer, linux-mips, James E.J. Bottomley,
	Helge Deller, linux-parisc, Madhavan Srinivasan, Michael Ellerman,
	linuxppc-dev, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-riscv, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	linux-s390, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Jonathan Corbet, Shuah Khan, linux-doc
In-Reply-To: <20260615-discoverable-root_partitions-v1-0-39c78fac42e2@kernel.org>

On Mon, Jun 15, 2026 at 06:08:56PM +0200, Vincent Mailhol wrote:

> Tested with GRUB, which implements the LoaderDevicePartUUID EFI variable
> in its bli module [3]. With this, I was able to boot a kernel with a
> completely empty cmdline and no initrd.
> 
> [1] The Discoverable Partitions Specification (DPS)
> Link: https://uapi-group.org/specifications/specs/discoverable_partitions_specification/
> 
> [2] systemd-gpt-auto-generator
> Link: https://www.freedesktop.org/software/systemd/man/latest/systemd-gpt-auto-generator.html
> 
> [3] GRUB -- §16.2 bli
> Link: https://www.gnu.org/software/grub/manual/grub/html_node/bli_005fmodule.html

So what does that thing, tied to EFI as it is, have to do with architectures where
	* firmware is rather unlike EFI
	* firmware wouldn't know what to do with GPT
	* GRUB is *not* ported to, let alone used
such as, say it, the very first one mentioned at your [1]?

Or is that conditional upon "if anyone wants to design replacement firmware
for those, and if they agree to follow our wishlist"?

^ permalink raw reply

* Re: [PATCH 12/19] x86: define DPS root partition type UUIDs
From: Dave Hansen @ 2026-06-15 16:46 UTC (permalink / raw)
  To: Vincent Mailhol, Jens Axboe, Davidlohr Bueso, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
In-Reply-To: <20260615-discoverable-root_partitions-v1-12-39c78fac42e2@kernel.org>

On 6/15/26 09:09, Vincent Mailhol wrote:
> +#ifdef CONFIG_X86_64
> +#define DPS_ROOT_PARTITION_TYPE_UUID "4f68bce3-e8cd-4db1-96e7-fbcaf984b709"
> +#else
> +#define DPS_ROOT_PARTITION_TYPE_UUID "44479540-f297-41b2-9af7-d131d5f0458a"
> +#endif

This doesn't make a whole lot of sense to me. 64-bit kernels can run
32-bit userspace just fine.

But this #ifdef as proposed means that only a 32-bit *OR* 64-bit kernel
can auto-discover a given partition.

I kinda think you should just have an array of strings for these things,
maybe glued together with some preprocessor magic. Logically something
like this:

const char* const uuids[] = {
#ifdef CONFIG_ARM64
	"b921b045-1df0-41c3-af44-4c6f280d3fae"
#endif
#ifdef CONFIG_X86_64
	"4f68bce3-e8cd-4db1-96e7-fbcaf984b709",
#endif
#if defined(CONFIG_X86) && defined(CONFIG_COMPAT32)
	"44479540-f297-41b2-9af7-d131d5f0458a",
#endif
...
};

... and then search the array. I honestly don't think you need to
sprinkle UUIDs all over the architectures.

It could probably also be done almost entirely in Kconfig. This could be
in, say block/partitions/Kconfig, or arch/*/Kconfig:

config DPS_ROOT_PARTITION_TYPE_UUID_1
	string
        default "4f68bce3-e8cd-4db1-96e7-fbcaf984b709" if X86_64
	default "b921b045-1df0-41c3-af44-4c6f280d3fae" if ARM64
	...

config DPS_ROOT_PARTITION_TYPE_UUID_2
	string
        default "44479540-f297-41b2-9af7-..." if X86 && COMPAT_32

const char* const uuids[] = {
#ifdef CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_1
	CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_1
#endif
#ifdef CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_2
	CONFIG_DPS_ROOT_PARTITION_TYPE_UUID_2
#endif
...
};

There are a lot of ways to do this. I'm just not a super big fan of the
current proposal.

So, boiling it down:

1. Should more than one UUID be supported per kernel build?
2. Should the UUIDs be defined in arch code or generic code?
3. Kconfig or #ifdefs?


^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Dr. David Alan Gilbert @ 2026-06-15 16:37 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, dm-devel
In-Reply-To: <ajAb0m9cNraQn2Pw@kbusch-mbp>

* Keith Busch (kbusch@kernel.org) wrote:
> On Mon, Jun 15, 2026 at 09:20:23AM -0600, Keith Busch wrote:
> > On Sun, Jun 14, 2026 at 05:57:48PM +0000, Dr. David Alan Gilbert wrote:
> > > Jun 14 18:08:32 dalek kernel: device-mapper: raid1: Mirror read failed from 252:24. Trying alternative device.
> > > Jun 14 18:08:32 dalek kernel: ------------[ cut here ]------------
> > > Jun 14 18:08:32 dalek dmeventd[1010]: Primary mirror device 252:24 read failed.
> > > Jun 14 18:08:32 dalek kernel: WARNING: block/bio.c:1044 at bio_add_page+0x18b/0x250, CPU#15: kworker/15:1/369
> > > Jun 14 18:08:32 dalek dmeventd[1010]: main-lvol0 is now in-sync.
> > > Jun 14 18:08:32 dalek kernel: Modules linked in: nft_masq nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reje>
> > > Jun 14 18:08:32 dalek kernel:  drm_panel_backlight_quirks gpu_sched drm_suballoc_helper video nvme drm_display_helper nvme_core cec nvme_keyring sp5100_tco nvme_auth wmi serio_raw fuse scsi_dh_alua i2c_dev scsi_dh_rdac scsi_dh_emc
> > > Jun 14 18:08:32 dalek kernel: CPU: 15 UID: 0 PID: 369 Comm: kworker/15:1 Not tainted 7.1.0-rc7+ #786 PREEMPT(lazy) 
> > > Jun 14 18:08:32 dalek kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Pro4, BIOS P3.10 07/13/2020
> > > Jun 14 18:08:32 dalek kernel: Workqueue: kmirrord do_mirror
> > > Jun 14 18:08:32 dalek kernel: RIP: 0010:bio_add_page+0x18b/0x250
> > > Jun 14 18:08:32 dalek kernel: Code: 24 10 4c 8b 04 24 84 c0 0f 85 c9 00 00 00 41 0f b7 40 78 48 8b 74 24 08 8b 4c 24 14 e9 b4 fe ff ff 0f 0b 31 c0 e9 55 d1 af 00 <0f> 0b eb f5 48 8b 7f 08 83 7f 60 05 0f 85 00 ff ff ff 49 8b 3b 4c
> > > Jun 14 18:08:32 dalek kernel: RSP: 0018:ffffd1fb8176fc10 EFLAGS: 00010246
> > > Jun 14 18:08:32 dalek kernel: RAX: 0000000000000000 RBX: ffffd1fb8176fd18 RCX: 0000000000000000
> > > Jun 14 18:08:32 dalek kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8d1a8eb28b00
> > > Jun 14 18:08:32 dalek kernel: RBP: 0000000000000000 R08: ffffd1fb8176fc38 R09: ffffd1fb8176fc40
> > > Jun 14 18:08:32 dalek kernel: R10: ffffd1fb8176fc34 R11: 0000000000000000 R12: 0000000000000000
> > > Jun 14 18:08:32 dalek kernel: R13: ffffd1fb8176fd90 R14: 0000000000000001 R15: ffff8d1a8eb28b00
> > > Jun 14 18:08:32 dalek kernel: FS:  0000000000000000(0000) GS:ffff8d29d161f000(0000) knlGS:0000000000000000
> > > Jun 14 18:08:32 dalek kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > Jun 14 18:08:32 dalek kernel: CR2: 00007f0ddcd7b9d0 CR3: 000000023dcbf000 CR4: 0000000000350ef0
> > > Jun 14 18:08:32 dalek kernel: Call Trace:
> > > Jun 14 18:08:32 dalek kernel:  <TASK>
> > > Jun 14 18:08:32 dalek kernel:  do_region+0x227/0x2a0
> > 
> > I think the problem is that do_region is tracking the "remaining" in
> > sector granularity, but devices can have dma alignment such that it's
> > valid to have sub-sector vectors. Rounding the length appended
> > to_sectors() creates a 0 length subtraction, so the loop thinks no
> > progress is made and loops forever. If we track it in bytes instead of
> > sectors, then that should fix this observation.
> 
> I recreated your observation and this patch below appears to fix the
> stuck behavior.

Hi Keith,
  Thanks for the patch, alas it doesn't seem to be helping here;
 the first warn is still the same
and it still hangs the test process hard and eventually BUGs at

void blk_mq_end_request(struct request *rq, blk_status_t error)
{
        if (blk_update_request(rq, error, blk_rq_bytes(rq)))
                BUG();

Dave

> ---
> diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
> index 1db565b376200..d72b9331c2fd1 100644
> --- a/drivers/md/dm-io.c
> +++ b/drivers/md/dm-io.c
> @@ -362,19 +362,26 @@ static void do_region(const blk_opf_t opf, unsigned int region,
>                         bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
>                         remaining -= num_sectors;
>                 } else {
> -                       while (remaining) {
> +                       unsigned long byte_remaining = to_bytes(remaining);
> +
> +                       while (byte_remaining) {
>                                 /*
>                                  * Try and add as many pages as possible.
>                                  */
>                                 dp->get_page(dp, &page, &len, &offset);
> -                               len = min(len, to_bytes(remaining));
> +                               len = min(len, byte_remaining);
>                                 if (!bio_add_page(bio, page, len, offset))
>                                         break;
> 
>                                 offset = 0;
> -                               remaining -= to_sector(len);
> +                               byte_remaining -= len;
>                                 dp->next_page(dp);
>                         }
> +                       remaining = to_sector(byte_remaining);
>                 }
> 
>                 atomic_inc(&io->count);
> --
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply

* Re: [PATCH V2] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Yu Kuai @ 2026-06-15 16:16 UTC (permalink / raw)
  To: Zizhi Wo, axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai
In-Reply-To: <20260615115556.1225472-1-wozizhi@huaweicloud.com>

Hi,

在 2026/6/15 19:55, Zizhi Wo 写道:
> From: Zizhi Wo <wozizhi@huawei.com>
>
> [BUG]
> Our fuzz testing triggered a blkcg use-after-free issue:
>
>    BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
>    Call Trace:
>    ...
>    blkcg_deactivate_policy+0x244/0x4d0
>    ioc_rqos_exit+0x44/0xe0
>    rq_qos_exit+0xba/0x120
>    __del_gendisk+0x50b/0x800
>    del_gendisk+0xff/0x190
>    ...
>
> [CAUSE]
> process1						process2
> cgroup_rmdir
> ...
>    css_killed_work_fn
>      offline_css
>      ...
>        blkcg_destroy_blkgs
>        ...
>          __blkg_release
> 	  css_put(&blkg->blkcg->css)
>            blkg_free
> 	    INIT_WORK(xxx, blkg_free_workfn)
> 	    schedule_work
>      css_put
>      ...
>        blkcg_css_free
>          kfree(blkcg)--------blkcg has been freed!!!
> ====================================schedule_work
>                blkg_free_workfn
> 							__del_gendisk
> 							  rq_qos_exit
> 							    ioc_rqos_exit
> 							      blkcg_deactivate_policy
> 							        mutex_lock(&q->blkcg_mutex)
> 								spin_lock_irq(&q->queue_lock)
> 							        list_for_each_entry(blkg, xxx)
> 								  blkcg = blkg->blkcg
> 								  spin_lock(&blkcg->lock)-------UAF!!!
> 	        mutex_lock(&q->blkcg_mutex)
> 	        spin_lock_irq(&q->queue_lock)
> 	        /* Only then is the blkg removed from the list */
> 	        list_del_init(&blkg->q_node)
>
> As a result, a blkg can still be reachable through q->blkg_list while
> its ->blkcg has already been freed.
>
> [Fix]
> Fix this by deferring the blkcg css_put() until after the blkg has been
> unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
> blkcg outlives every blkg still reachable through q->blkg_list, so any
> iterator holding q->queue_lock is guaranteed to observe a valid
> blkg->blkcg.
>
> While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
> so that the css reference is owned by the alloc/free pair rather than
> straddling layers:
> blkg_alloc()  <-> blkg_free()
> blkg_create() <-> blkg_destroy()
>
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Suggested-by: Hou Tao <houtao1@huawei.com>
> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
> ---
> v2:
>   - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
>     css reference follows the blkg's own lifetime, making the put in
>     blkg_free_workfn() symmetric with the get in blkg_alloc().
>
> v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
>
>   block/blk-cgroup.c | 24 ++++++++++++------------
>   1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index bc63bd220865..27414c291e49 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -132,10 +132,15 @@ static void blkg_free_workfn(struct work_struct *work)
>   	if (blkg->parent)
>   		blkg_put(blkg->parent);
>   	spin_lock_irq(&q->queue_lock);
>   	list_del_init(&blkg->q_node);
>   	spin_unlock_irq(&q->queue_lock);
> +	/*
> +	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
> +	 * so concurrent iterators won't see a blkg with a freed blkcg.
> +	 */
> +	css_put(&blkg->blkcg->css);
>   	mutex_unlock(&q->blkcg_mutex);

Please move css_put after mutex_unlock, unless there is a strong reason.

With above change, feel free to add:

Reviewed-by: Yu Kuai <yukuai@fygo.io>

>   
>   	blk_put_queue(q);
>   	free_percpu(blkg->iostat_cpu);
>   	percpu_ref_exit(&blkg->refcnt);
> @@ -177,12 +182,10 @@ static void __blkg_release(struct rcu_head *rcu)
>   	 * blkg_stat_lock is for serializing blkg stat update
>   	 */
>   	for_each_possible_cpu(cpu)
>   		__blkcg_rstat_flush(blkcg, cpu);
>   
> -	/* release the blkcg and parent blkg refs this blkg has been holding */
> -	css_put(&blkg->blkcg->css);
>   	blkg_free(blkg);
>   }
>   
>   /*
>    * A group is RCU protected, but having an rcu lock does not mean that one
> @@ -311,10 +314,13 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>   	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
>   	if (!blkg->iostat_cpu)
>   		goto out_exit_refcnt;
>   	if (!blk_get_queue(disk->queue))
>   		goto out_free_iostat;
> +	/* blkg holds a reference to blkcg */
> +	if (!css_tryget_online(&blkcg->css))
> +		goto out_put_queue;
>   
>   	blkg->q = disk->queue;
>   	INIT_LIST_HEAD(&blkg->q_node);
>   	blkg->blkcg = blkcg;
>   	blkg->iostat.blkg = blkg;
> @@ -351,10 +357,12 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>   
>   out_free_pds:
>   	while (--i >= 0)
>   		if (blkg->pd[i])
>   			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
> +	css_put(&blkcg->css);
> +out_put_queue:
>   	blk_put_queue(disk->queue);
>   out_free_iostat:
>   	free_percpu(blkg->iostat_cpu);
>   out_exit_refcnt:
>   	percpu_ref_exit(&blkg->refcnt);
> @@ -379,32 +387,26 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>   	if (blk_queue_dying(disk->queue)) {
>   		ret = -ENODEV;
>   		goto err_free_blkg;
>   	}
>   
> -	/* blkg holds a reference to blkcg */
> -	if (!css_tryget_online(&blkcg->css)) {
> -		ret = -ENODEV;
> -		goto err_free_blkg;
> -	}
> -
>   	/* allocate */
>   	if (!new_blkg) {
>   		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
>   		if (unlikely(!new_blkg)) {
>   			ret = -ENOMEM;
> -			goto err_put_css;
> +			goto err_free_blkg;
>   		}
>   	}
>   	blkg = new_blkg;
>   
>   	/* link parent */
>   	if (blkcg_parent(blkcg)) {
>   		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
>   		if (WARN_ON_ONCE(!blkg->parent)) {
>   			ret = -ENODEV;
> -			goto err_put_css;
> +			goto err_free_blkg;
>   		}
>   		blkg_get(blkg->parent);
>   	}
>   
>   	/* invoke per-policy init */
> @@ -440,12 +442,10 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>   
>   	/* @blkg failed fully initialized, use the usual release path */
>   	blkg_put(blkg);
>   	return ERR_PTR(ret);
>   
> -err_put_css:
> -	css_put(&blkcg->css);
>   err_free_blkg:
>   	if (new_blkg)
>   		blkg_free(new_blkg);
>   	return ERR_PTR(ret);
>   }

-- 
Thanks,
Kuai

^ permalink raw reply

* [PATCH 19/19] docs: document discoverable root partitions
From: Vincent Mailhol @ 2026-06-15 16:09 UTC (permalink / raw)
  To: Jens Axboe, Davidlohr Bueso, Alexander Viro, Christian Brauner,
	Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Vincent Mailhol, Jonathan Corbet, Shuah Khan, linux-doc
In-Reply-To: <20260615-discoverable-root_partitions-v1-0-39c78fac42e2@kernel.org>

Document the automatic root block device discovery feature.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Vincent Mailhol <mailhol@kernel.org>
---
 Documentation/admin-guide/discoverable-root.rst | 33 +++++++++++++++++++++++++
 Documentation/admin-guide/index.rst             |  1 +
 Documentation/admin-guide/kernel-parameters.txt |  5 ++++
 3 files changed, 39 insertions(+)

diff --git a/Documentation/admin-guide/discoverable-root.rst b/Documentation/admin-guide/discoverable-root.rst
new file mode 100644
index 000000000000..9645bf39e405
--- /dev/null
+++ b/Documentation/admin-guide/discoverable-root.rst
@@ -0,0 +1,33 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _discoverable_root:
+
+Discoverable root partitions
+============================
+
+On EFI systems using a supported architecture, the kernel can discover the root
+block device from GPT partition type UUID metadata on the disk containing the
+active EFI System Partition.
+
+This follows the `Discoverable Partitions Specification`_ which defines a list
+of architecture-specific root partition type UUIDs.
+
+Specifying ``root=`` on the kernel command line takes precedence and entirely
+disables this automatic root partition discovery.
+
+The disk to search is identified by the Boot Loader Interface
+``LoaderDevicePartUUID`` EFI variable. If multiple partitions on that disk match
+the architecture root partition type UUID, the kernel selects the first match in
+block device enumeration order. Systems should not expose multiple eligible root
+partitions unless that ordering is intended.
+
+Partitions marked with the DPS ``no-auto`` GPT attribute are skipped. This
+allows a partition with an otherwise discoverable type UUID to opt out from
+automatic discovery.
+
+The DPS read-only attribute is not enforced by kernel root discovery. The
+root filesystem is mounted read-only by default unless ``rw`` is specified,
+and user space remains responsible for later remount policy.
+
+.. _Discoverable Partitions Specification:
+   https://uapi-group.org/specifications/specs/discoverable_partitions_specification/
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index cd28dfe91b06..0d9c2796ae09 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -50,6 +50,7 @@ Booting the kernel
 
    bootconfig
    kernel-parameters
+   discoverable-root
    efi-stub
    initrd
 
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f68bf1cdb53b..c9bfa010883c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6696,6 +6696,11 @@ Kernel parameters
 			ramdisk, "nfs" and "cifs" for root on a network file
 			system, or "mtd" and "ubi" for mounting from raw flash.
 
+			If this option is omitted, the kernel may try to
+			discover the root block device from the GPT partition
+			type UUID metadata when additional requirements are met.
+			See Documentation/admin-guide/discoverable-root.rst.
+
 	rootdelay=	[KNL] Delay (in seconds) to pause before attempting to
 			mount the root filesystem
 

-- 
2.53.0


^ permalink raw reply related

* [PATCH 18/19] init: discover root by DPS partition type UUID
From: Vincent Mailhol @ 2026-06-15 16:09 UTC (permalink / raw)
  To: Jens Axboe, Davidlohr Bueso, Alexander Viro, Christian Brauner,
	Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Vincent Mailhol
In-Reply-To: <20260615-discoverable-root_partitions-v1-0-39c78fac42e2@kernel.org>

When the root= cmdline option is absent, try to discover the root block
device using the architecture's DPS root partition type UUID.

DPS limits root discovery to the disk containing the active EFI System
Partition. Read LoaderDevicePartUUID from the Boot Loader Interface and
pass it to early_lookup_bdev_by_type_uuid() so the block lookup only
considers partitions on that disk.

Print a dedicated wait message while waiting for a discoverable root
partition and emit an informational message when discovery succeeds.

If LoaderDevicePartUUID cannot be read or does not contain a valid UUID,
clear root_wait so the kernel does not keep retrying a discovery path
that cannot succeed.

Signed-off-by: Vincent Mailhol <mailhol@kernel.org>
---
 init/do_mounts.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 87 insertions(+), 2 deletions(-)

diff --git a/init/do_mounts.c b/init/do_mounts.c
index 5fb5aeb88da9..20c176945b32 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -10,6 +10,7 @@
 #include <linux/delay.h>
 #include <linux/mount.h>
 #include <linux/device.h>
+#include <linux/efi.h>
 #include <linux/init.h>
 #include <linux/fs.h>
 #include <linux/initrd.h>
@@ -19,6 +20,8 @@
 #include <linux/ramfs.h>
 #include <linux/shmem_fs.h>
 #include <linux/ktime.h>
+#include <linux/ucs2_string.h>
+#include <linux/uuid.h>
 
 #include <linux/nfs_fs.h>
 #include <linux/nfs_fs_sb.h>
@@ -402,9 +405,86 @@ void __init mount_root(char *root_device_name)
 	}
 }
 
+#ifdef CONFIG_DPS_ROOT_AUTO_DISCOVERY
+static char efi_partuuid[EFI_VARIABLE_GUID_LEN + 1] __initdata;
+
+static int __init efi_loader_get_device_part_uuid(char *efi_uuid, size_t size)
+{
+	efi_char16_t efi_uuid_ucs2[EFI_VARIABLE_GUID_LEN + 1] = {};
+	unsigned long efi_uuid_ucs2_size = sizeof(efi_uuid_ucs2);
+	efi_status_t status;
+
+	if (!efi_rt_services_supported(EFI_RT_SUPPORTED_GET_VARIABLE))
+		return -EOPNOTSUPP;
+
+	status = efi.get_variable(L"LoaderDevicePartUUID",
+				  &LINUX_EFI_LOADER_ENTRY_GUID, NULL,
+				  &efi_uuid_ucs2_size, efi_uuid_ucs2);
+	if (status != EFI_SUCCESS)
+		return efi_status_to_err(status);
+
+	if (ucs2_as_utf8((u8 *)efi_uuid, efi_uuid_ucs2, size) != UUID_STRING_LEN ||
+	    !uuid_is_valid(efi_uuid))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int __init lookup_dps_root(dev_t *dev)
+{
+	static const char dps_root_partition_type_uuid[] __initconst =
+		DPS_ROOT_PARTITION_TYPE_UUID;
+	int err;
+
+	err = early_lookup_bdev_by_type_uuid(dps_root_partition_type_uuid,
+					     efi_partuuid, dev);
+	if (!err)
+		pr_info("VFS: Discovered root partition with GPT type UUID %s\n",
+			dps_root_partition_type_uuid);
+
+	return err;
+}
+
+static dev_t __init try_dps_root_discovery(void)
+{
+	dev_t dev;
+	int err;
+
+	err = efi_loader_get_device_part_uuid(efi_partuuid,
+					      sizeof(efi_partuuid));
+	if (err) {
+		pr_err("VFS: Unable to get LoaderDevicePartUUID EFI variable: %pe, skipping root partition discovery\n",
+			ERR_PTR(err));
+		if (root_wait) {
+			pr_err("Disabling rootwait\n");
+			root_wait = 0;
+		}
+		return 0;
+	}
+
+	if (!lookup_dps_root(&dev))
+		return dev;
+
+	return 0;
+}
+#else
+static int __init lookup_dps_root(dev_t *dev)
+{
+	return 0;
+}
+
+static dev_t __init try_dps_root_discovery(void)
+{
+	return 0;
+}
+#endif
+
 static int __init lookup_root_device(char *root_device_name)
 {
-	return early_lookup_bdev(root_device_name, &ROOT_DEV);
+	if (root_device_name[0])
+		return early_lookup_bdev(root_device_name, &ROOT_DEV);
+	else
+		return lookup_dps_root(&ROOT_DEV);
 }
 
 /* wait for any asynchronous scanning to complete */
@@ -415,7 +495,10 @@ static void __init wait_for_root(char *root_device_name)
 	if (ROOT_DEV != 0)
 		return;
 
-	pr_info("Waiting for root device %s...\n", root_device_name);
+	if (root_device_name[0])
+		pr_info("Waiting for root device %s...\n", root_device_name);
+	else if (IS_ENABLED(CONFIG_DPS_ROOT_AUTO_DISCOVERY))
+		pr_info("Waiting for discoverable root partition...\n");
 
 	end = ktime_add_ms(ktime_get_raw(), root_wait);
 
@@ -480,6 +563,8 @@ void __init prepare_namespace(void)
 
 	if (saved_root_name[0])
 		ROOT_DEV = parse_root_device(saved_root_name);
+	else
+		ROOT_DEV = try_dps_root_discovery();
 
 	initrd_load();
 

-- 
2.53.0


^ permalink raw reply related

* [PATCH 17/19] init: factor out root device lookup into lookup_root_device()
From: Vincent Mailhol @ 2026-06-15 16:09 UTC (permalink / raw)
  To: Jens Axboe, Davidlohr Bueso, Alexander Viro, Christian Brauner,
	Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Vincent Mailhol
In-Reply-To: <20260615-discoverable-root_partitions-v1-0-39c78fac42e2@kernel.org>

DPS root detection will also need to work if root_wait is set, meaning
that wait_for_root() needs to handle the DPS logic.

Move early_lookup_bdev() out of wait_for_root() into the new
lookup_root_device() so later changes can extend the lookup policy
without duplicating the retry logic.

Signed-off-by: Vincent Mailhol <mailhol@kernel.org>
---
 init/do_mounts.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/init/do_mounts.c b/init/do_mounts.c
index 95e0b3a0f711..5fb5aeb88da9 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -402,6 +402,11 @@ void __init mount_root(char *root_device_name)
 	}
 }
 
+static int __init lookup_root_device(char *root_device_name)
+{
+	return early_lookup_bdev(root_device_name, &ROOT_DEV);
+}
+
 /* wait for any asynchronous scanning to complete */
 static void __init wait_for_root(char *root_device_name)
 {
@@ -415,7 +420,7 @@ static void __init wait_for_root(char *root_device_name)
 	end = ktime_add_ms(ktime_get_raw(), root_wait);
 
 	while (!driver_probe_done() ||
-	       early_lookup_bdev(root_device_name, &ROOT_DEV) < 0) {
+	       lookup_root_device(root_device_name) < 0) {
 		msleep(5);
 		if (root_wait > 0 && ktime_after(ktime_get_raw(), end))
 			break;

-- 
2.53.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox