Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH 05/19] arm64: define DPS root partition type UUID
From: Vincent Mailhol @ 2026-06-15 16:09 UTC (permalink / raw)
  To: Jens Axboe, Davidlohr Bueso, Alexander Viro, Christian Brauner,
	Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Vincent Mailhol, Catalin Marinas, Will Deacon, linux-arm-kernel
In-Reply-To: <20260615-discoverable-root_partitions-v1-0-39c78fac42e2@kernel.org>

DPS [1] assigns GPT partition type UUIDs to operating system partitions.
Root partitions use architecture-specific type UUIDs so the OS can
discover the intended root filesystem without relying on a root= cmdline
option.

Define DPS_ROOT_PARTITION_TYPE_UUID in asm/dps_root.h for arm64 and select
ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID.

[1] The Discoverable Partitions Specification (DPS)
Link: https://uapi-group.org/specifications/specs/discoverable_partitions_specification/

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Vincent Mailhol <mailhol@kernel.org>
---
 arch/arm64/Kconfig                | 1 +
 arch/arm64/include/asm/dps_root.h | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe60738e5943..190f8dde63b2 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -26,6 +26,7 @@ config ARM64
 	select ARCH_HAS_DEBUG_VM_PGTABLE
 	select ARCH_HAS_DMA_OPS if XEN
 	select ARCH_HAS_DMA_PREP_COHERENT
+	select ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID
 	select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
 	select ARCH_HAS_FAST_MULTIPLIER
 	select ARCH_HAS_FORTIFY_SOURCE
diff --git a/arch/arm64/include/asm/dps_root.h b/arch/arm64/include/asm/dps_root.h
new file mode 100644
index 000000000000..7344f9a52343
--- /dev/null
+++ b/arch/arm64/include/asm/dps_root.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _ASM_ARM64_DPS_ROOT_H
+#define _ASM_ARM64_DPS_ROOT_H
+
+#define DPS_ROOT_PARTITION_TYPE_UUID "b921b045-1df0-41c3-af44-4c6f280d3fae"
+
+#endif /* _ASM_ARM64_DPS_ROOT_H */

-- 
2.53.0


^ permalink raw reply related

* [PATCH 04/19] arm: define DPS root partition type UUID
From: Vincent Mailhol @ 2026-06-15 16:09 UTC (permalink / raw)
  To: Jens Axboe, Davidlohr Bueso, Alexander Viro, Christian Brauner,
	Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Vincent Mailhol, Russell King, linux-arm-kernel
In-Reply-To: <20260615-discoverable-root_partitions-v1-0-39c78fac42e2@kernel.org>

DPS [1] assigns GPT partition type UUIDs to operating system partitions.
Root partitions use architecture-specific type UUIDs so the OS can
discover the intended root filesystem without relying on a root= cmdline
option.

Define DPS_ROOT_PARTITION_TYPE_UUID in asm/dps_root.h for ARM and select
ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID.

[1] The Discoverable Partitions Specification (DPS)
Link: https://uapi-group.org/specifications/specs/discoverable_partitions_specification/

Cc: Russell King <linux@armlinux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Vincent Mailhol <mailhol@kernel.org>
---
 arch/arm/Kconfig                | 1 +
 arch/arm/include/asm/dps_root.h | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 73e6647bea46..deedb5d808fb 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -14,6 +14,7 @@ config ARM
 	select ARCH_HAS_DMA_ALLOC if MMU
 	select ARCH_HAS_DMA_OPS
 	select ARCH_HAS_DMA_WRITE_COMBINE if !ARM_DMA_MEM_BUFFERABLE
+	select ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_KEEPINITRD
diff --git a/arch/arm/include/asm/dps_root.h b/arch/arm/include/asm/dps_root.h
new file mode 100644
index 000000000000..e9f0f24bcac2
--- /dev/null
+++ b/arch/arm/include/asm/dps_root.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _ASM_ARM_DPS_ROOT_H
+#define _ASM_ARM_DPS_ROOT_H
+
+#define DPS_ROOT_PARTITION_TYPE_UUID "69dad710-2ce4-4e3c-b16c-21a1d49abed3"
+
+#endif /* _ASM_ARM_DPS_ROOT_H */

-- 
2.53.0


^ permalink raw reply related

* [PATCH 03/19] arc: define DPS root partition type UUID
From: Vincent Mailhol @ 2026-06-15 16:08 UTC (permalink / raw)
  To: Jens Axboe, Davidlohr Bueso, Alexander Viro, Christian Brauner,
	Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Vincent Mailhol, Vineet Gupta, linux-snps-arc
In-Reply-To: <20260615-discoverable-root_partitions-v1-0-39c78fac42e2@kernel.org>

DPS [1] assigns GPT partition type UUIDs to operating system partitions.
Root partitions use architecture-specific type UUIDs so the OS can
discover the intended root filesystem without relying on a root= cmdline
option.

Define DPS_ROOT_PARTITION_TYPE_UUID in asm/dps_root.h for ARC and select
ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID.

[1] The Discoverable Partitions Specification (DPS)
Link: https://uapi-group.org/specifications/specs/discoverable_partitions_specification/

Cc: Vineet Gupta <vgupta@kernel.org>
Cc: linux-snps-arc@lists.infradead.org
Signed-off-by: Vincent Mailhol <mailhol@kernel.org>
---
 arch/arc/Kconfig                | 1 +
 arch/arc/include/asm/dps_root.h | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 2ed7186c81c5..cc3a57a0111f 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -10,6 +10,7 @@ config ARC
 	select ARCH_HAS_CACHE_LINE_SIZE
 	select ARCH_HAS_DEBUG_VM_PGTABLE
 	select ARCH_HAS_DMA_PREP_COHERENT
+	select ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SETUP_DMA_OPS
 	select ARCH_HAS_SYNC_DMA_FOR_CPU
diff --git a/arch/arc/include/asm/dps_root.h b/arch/arc/include/asm/dps_root.h
new file mode 100644
index 000000000000..c9db3ddf1a53
--- /dev/null
+++ b/arch/arc/include/asm/dps_root.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _ASM_ARC_DPS_ROOT_H
+#define _ASM_ARC_DPS_ROOT_H
+
+#define DPS_ROOT_PARTITION_TYPE_UUID "d27f46ed-2919-4cb8-bd25-9531f3c16534"
+
+#endif /* _ASM_ARC_DPS_ROOT_H */

-- 
2.53.0


^ permalink raw reply related

* [PATCH 02/19] alpha: define DPS root partition type UUID
From: Vincent Mailhol @ 2026-06-15 16:08 UTC (permalink / raw)
  To: Jens Axboe, Davidlohr Bueso, Alexander Viro, Christian Brauner,
	Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Vincent Mailhol, Richard Henderson, Matt Turner, Magnus Lindholm,
	linux-alpha
In-Reply-To: <20260615-discoverable-root_partitions-v1-0-39c78fac42e2@kernel.org>

DPS [1] assigns GPT partition type UUIDs to operating system partitions.
Root partitions use architecture-specific type UUIDs so the OS can
discover the intended root filesystem without relying on a root= cmdline
option.

Define DPS_ROOT_PARTITION_TYPE_UUID in asm/dps_root.h for alpha and
select ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID.

[1] The Discoverable Partitions Specification (DPS)
Link: https://uapi-group.org/specifications/specs/discoverable_partitions_specification/

Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Magnus Lindholm <linmag7@gmail.com>
Cc: linux-alpha@vger.kernel.org
Signed-off-by: Vincent Mailhol <mailhol@kernel.org>
---
 arch/alpha/Kconfig                | 1 +
 arch/alpha/include/asm/dps_root.h | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 7b7dafe7d9df..400cbb7525c8 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -5,6 +5,7 @@ config ALPHA
 	select ARCH_32BIT_USTAT_F_TINODE
 	select ARCH_HAS_CURRENT_STACK_POINTER
 	select ARCH_HAS_DMA_OPS if PCI
+	select ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID
 	select ARCH_MIGHT_HAVE_PC_PARPORT
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select ARCH_MODULE_NEEDS_WEAK_PER_CPU if SMP
diff --git a/arch/alpha/include/asm/dps_root.h b/arch/alpha/include/asm/dps_root.h
new file mode 100644
index 000000000000..7f70a83f72de
--- /dev/null
+++ b/arch/alpha/include/asm/dps_root.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _ASM_ALPHA_DPS_ROOT_H
+#define _ASM_ALPHA_DPS_ROOT_H
+
+#define DPS_ROOT_PARTITION_TYPE_UUID "6523f8ae-3eb1-4e2a-a05a-18b695ae656f"
+
+#endif /* _ASM_ALPHA_DPS_ROOT_H */

-- 
2.53.0


^ permalink raw reply related

* [PATCH 01/19] init: add DPS root partition type UUID capability
From: Vincent Mailhol @ 2026-06-15 16:08 UTC (permalink / raw)
  To: Jens Axboe, Davidlohr Bueso, Alexander Viro, Christian Brauner,
	Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Vincent Mailhol
In-Reply-To: <20260615-discoverable-root_partitions-v1-0-39c78fac42e2@kernel.org>

DPS [1] assigns native root partition type UUIDs per architecture.

Add the ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID config option so that
architectures can opt in.

Architectures that support this feature should define
DPS_ROOT_PARTITION_TYPE_UUID in asm/dps_root.h as a string
representation of the architecture's DPS root partition type UUID.

Add the hidden DPS_ROOT_AUTO_DISCOVERY symbol for the combination of
BLOCK, EFI and ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID, and use it to
expose DPS_ROOT_PARTITION_TYPE_UUID from the common linux/root_dev.h
header only when all prerequisites are met.

[1] The Discoverable Partitions Specification (DPS)
Link: https://uapi-group.org/specifications/specs/discoverable_partitions_specification/

Signed-off-by: Vincent Mailhol <mailhol@kernel.org>
---
 include/linux/root_dev.h | 6 ++++++
 init/Kconfig             | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/root_dev.h b/include/linux/root_dev.h
index 847c9a06101b..6b52a10b0bca 100644
--- a/include/linux/root_dev.h
+++ b/include/linux/root_dev.h
@@ -5,6 +5,12 @@
 #include <linux/major.h>
 #include <linux/types.h>
 #include <linux/kdev_t.h>
+#include <linux/uuid.h>
+
+#ifdef CONFIG_DPS_ROOT_AUTO_DISCOVERY
+#include <asm/dps_root.h>
+static_assert(sizeof(DPS_ROOT_PARTITION_TYPE_UUID) == UUID_STRING_LEN + 1);
+#endif
 
 enum {
 	Root_NFS = MKDEV(UNNAMED_MAJOR, 255),
diff --git a/init/Kconfig b/init/Kconfig
index 147da6370bf0..982c6ad9da4d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2287,6 +2287,12 @@ config ARCH_HAS_PREPARE_SYNC_CORE_CMD
 config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	bool
 
+config ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID
+	bool
+
+config DPS_ROOT_AUTO_DISCOVERY
+	def_bool BLOCK && EFI && ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID
+
 # It may be useful for an architecture to override the definitions of the
 # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
 # and the COMPAT_ variants in <linux/compat.h>, in particular to use a

-- 
2.53.0


^ permalink raw reply related

* [PATCH 00/19] init: discoverable root partitions, a.k.a. an omittable "root=" cmdline option
From: Vincent Mailhol @ 2026-06-15 16:08 UTC (permalink / raw)
  To: Jens Axboe, Davidlohr Bueso, Alexander Viro, Christian Brauner,
	Jan Kara
  Cc: linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Vincent Mailhol, Richard Henderson, Matt Turner, Magnus Lindholm,
	linux-alpha, Vineet Gupta, linux-snps-arc, Russell King,
	linux-arm-kernel, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, loongarch, Thomas Bogendoerfer, linux-mips,
	James E.J. Bottomley, Helge Deller, linux-parisc,
	Madhavan Srinivasan, Michael Ellerman, linuxppc-dev,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, linux-s390,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Jonathan Corbet, Shuah Khan, linux-doc

DPS [1] defines GPT partition type UUIDs for OS partitions and
attributes that control whether such partitions should be
automatically discovered. The specification states that:

  The OS can discover and mount the necessary file systems with a
  non-existent or incomplete /etc/fstab file and without the root=
  kernel command line option.

DPS is already implemented in systemd-gpt-auto-generator [2], which,
when embedded in an initrd, indeed allows automatic detection of the
root filesystem through its partition type UUID.

This series adds this discovery feature directly into the kernel so
that people who are not using systemd or not using an initrd can still
benefit from it. The implementation follows the same model as
systemd-gpt-auto-generator:

  - GPT partition type UUIDs are used for automatic discovery policy
    only. No root=PARTTYPEUUID=xxx cmdline option or similar syntax is
    added.

  - The root= cmdline option takes precedence. This prevents unexpected
    behaviour.

  - Only the disk with the active EFI System Partition is scanned, as
    required by DPS. The disk is identified through the Boot Loader
    Interface LoaderDevicePartUUID EFI variable.

The DPS no-auto attribute is also implemented, giving another option for
the user to disable this auto discovery. However, the DPS read-only
attribute is intentionally not enforced. The kernel already mounts the
root filesystem read-only by default unless the command line requests
rw, and user space remains responsible for deciding whether a discovered
root should later be remounted read-write based on DPS metadata and
local policy. The other partition type UUIDs (home, swap, var...) are
also out of scope for the same reason: user space remains responsible
for mounting anything other than the root partition.

Patch 1 adds the ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID capability and
the hidden CONFIG_DPS_ROOT_AUTO_DISCOVERY Kconfig symbol used to signal
whether the feature is available. Patches 2 to 12 declare the
ARCH_HAS_DPS_ROOT_PARTITION_TYPE_UUID capability for the supported
architectures and define their architecture-specific root partition type
UUID values in asm/dps_root.h.

Patches 13 to 16 make the GPT partition type UUID and the no-auto
attribute available during early block lookup.

Patch 17 is a small code refactor that prepares for patch 18, which
updates the root mount path so that, when root= is omitted, the kernel
reads LoaderDevicePartUUID and uses the early block lookup
infrastructure to discover the DPS root partition on that disk.

Finally, patch 19 documents this automatic root discovery feature.

Tested with GRUB, which implements the LoaderDevicePartUUID EFI variable
in its bli module [3]. With this, I was able to boot a kernel with a
completely empty cmdline and no initrd.

[1] The Discoverable Partitions Specification (DPS)
Link: https://uapi-group.org/specifications/specs/discoverable_partitions_specification/

[2] systemd-gpt-auto-generator
Link: https://www.freedesktop.org/software/systemd/man/latest/systemd-gpt-auto-generator.html

[3] GRUB -- §16.2 bli
Link: https://www.gnu.org/software/grub/manual/grub/html_node/bli_005fmodule.html

Signed-off-by: Vincent Mailhol <mailhol@kernel.org>
---
Vincent Mailhol (19):
      init: add DPS root partition type UUID capability
      alpha: define DPS root partition type UUID
      arc: define DPS root partition type UUID
      arm: define DPS root partition type UUID
      arm64: define DPS root partition type UUID
      loongarch: define DPS root partition type UUID
      mips: define DPS root partition type UUIDs
      parisc: define DPS root partition type UUID
      powerpc: define DPS root partition type UUIDs
      riscv: define DPS root partition type UUIDs
      s390: define DPS root partition type UUIDs
      x86: define DPS root partition type UUIDs
      block: store GPT partition type UUID
      block: add early_lookup_bdev_by_type_uuid()
      block: store GPT attributes as a raw value
      block: don't discover partition with DPS no-auto GPT attribute
      init: factor out root device lookup into lookup_root_device()
      init: discover root by DPS partition type UUID
      docs: document discoverable root partitions

 Documentation/admin-guide/discoverable-root.rst | 33 +++++++++
 Documentation/admin-guide/index.rst             |  1 +
 Documentation/admin-guide/kernel-parameters.txt |  5 ++
 arch/alpha/Kconfig                              |  1 +
 arch/alpha/include/asm/dps_root.h               |  8 +++
 arch/arc/Kconfig                                |  1 +
 arch/arc/include/asm/dps_root.h                 |  8 +++
 arch/arm/Kconfig                                |  1 +
 arch/arm/include/asm/dps_root.h                 |  8 +++
 arch/arm64/Kconfig                              |  1 +
 arch/arm64/include/asm/dps_root.h               |  8 +++
 arch/loongarch/Kconfig                          |  1 +
 arch/loongarch/include/asm/dps_root.h           |  8 +++
 arch/mips/Kconfig                               |  1 +
 arch/mips/include/asm/dps_root.h                | 20 ++++++
 arch/parisc/Kconfig                             |  1 +
 arch/parisc/include/asm/dps_root.h              |  8 +++
 arch/powerpc/Kconfig                            |  1 +
 arch/powerpc/include/asm/dps_root.h             | 16 +++++
 arch/riscv/Kconfig                              |  1 +
 arch/riscv/include/asm/dps_root.h               | 12 ++++
 arch/s390/Kconfig                               |  1 +
 arch/s390/include/asm/dps_root.h                | 12 ++++
 arch/x86/Kconfig                                |  1 +
 arch/x86/include/asm/dps_root.h                 | 12 ++++
 block/blk.h                                     |  1 +
 block/early-lookup.c                            | 68 +++++++++++++++++-
 block/partitions/core.c                         |  2 +
 block/partitions/efi.c                          |  3 +
 block/partitions/efi.h                          | 11 ++-
 include/linux/blk_types.h                       |  1 +
 include/linux/blkdev.h                          |  5 ++
 include/linux/root_dev.h                        |  6 ++
 init/Kconfig                                    |  6 ++
 init/do_mounts.c                                | 94 ++++++++++++++++++++++++-
 35 files changed, 355 insertions(+), 12 deletions(-)
---
base-commit: 36808d5e983985bbda87e01059cccc071fe3ec8d
change-id: 20260611-discoverable-root_partitions-bdacbada570d

Best regards,
-- 
Vincent Mailhol <mailhol@kernel.org>

^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Keith Busch @ 2026-06-15 15:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: linux-block, dm-devel
In-Reply-To: <ajAYN_mmjzYBAimV@kbusch-mbp>

On Mon, Jun 15, 2026 at 09:20:23AM -0600, Keith Busch wrote:
> On Sun, Jun 14, 2026 at 05:57:48PM +0000, Dr. David Alan Gilbert wrote:
> > Jun 14 18:08:32 dalek kernel: device-mapper: raid1: Mirror read failed from 252:24. Trying alternative device.
> > Jun 14 18:08:32 dalek kernel: ------------[ cut here ]------------
> > Jun 14 18:08:32 dalek dmeventd[1010]: Primary mirror device 252:24 read failed.
> > Jun 14 18:08:32 dalek kernel: WARNING: block/bio.c:1044 at bio_add_page+0x18b/0x250, CPU#15: kworker/15:1/369
> > Jun 14 18:08:32 dalek dmeventd[1010]: main-lvol0 is now in-sync.
> > Jun 14 18:08:32 dalek kernel: Modules linked in: nft_masq nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reje>
> > Jun 14 18:08:32 dalek kernel:  drm_panel_backlight_quirks gpu_sched drm_suballoc_helper video nvme drm_display_helper nvme_core cec nvme_keyring sp5100_tco nvme_auth wmi serio_raw fuse scsi_dh_alua i2c_dev scsi_dh_rdac scsi_dh_emc
> > Jun 14 18:08:32 dalek kernel: CPU: 15 UID: 0 PID: 369 Comm: kworker/15:1 Not tainted 7.1.0-rc7+ #786 PREEMPT(lazy) 
> > Jun 14 18:08:32 dalek kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Pro4, BIOS P3.10 07/13/2020
> > Jun 14 18:08:32 dalek kernel: Workqueue: kmirrord do_mirror
> > Jun 14 18:08:32 dalek kernel: RIP: 0010:bio_add_page+0x18b/0x250
> > Jun 14 18:08:32 dalek kernel: Code: 24 10 4c 8b 04 24 84 c0 0f 85 c9 00 00 00 41 0f b7 40 78 48 8b 74 24 08 8b 4c 24 14 e9 b4 fe ff ff 0f 0b 31 c0 e9 55 d1 af 00 <0f> 0b eb f5 48 8b 7f 08 83 7f 60 05 0f 85 00 ff ff ff 49 8b 3b 4c
> > Jun 14 18:08:32 dalek kernel: RSP: 0018:ffffd1fb8176fc10 EFLAGS: 00010246
> > Jun 14 18:08:32 dalek kernel: RAX: 0000000000000000 RBX: ffffd1fb8176fd18 RCX: 0000000000000000
> > Jun 14 18:08:32 dalek kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8d1a8eb28b00
> > Jun 14 18:08:32 dalek kernel: RBP: 0000000000000000 R08: ffffd1fb8176fc38 R09: ffffd1fb8176fc40
> > Jun 14 18:08:32 dalek kernel: R10: ffffd1fb8176fc34 R11: 0000000000000000 R12: 0000000000000000
> > Jun 14 18:08:32 dalek kernel: R13: ffffd1fb8176fd90 R14: 0000000000000001 R15: ffff8d1a8eb28b00
> > Jun 14 18:08:32 dalek kernel: FS:  0000000000000000(0000) GS:ffff8d29d161f000(0000) knlGS:0000000000000000
> > Jun 14 18:08:32 dalek kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Jun 14 18:08:32 dalek kernel: CR2: 00007f0ddcd7b9d0 CR3: 000000023dcbf000 CR4: 0000000000350ef0
> > Jun 14 18:08:32 dalek kernel: Call Trace:
> > Jun 14 18:08:32 dalek kernel:  <TASK>
> > Jun 14 18:08:32 dalek kernel:  do_region+0x227/0x2a0
> 
> I think the problem is that do_region is tracking the "remaining" in
> sector granularity, but devices can have dma alignment such that it's
> valid to have sub-sector vectors. Rounding the length appended
> to_sectors() creates a 0 length subtraction, so the loop thinks no
> progress is made and loops forever. If we track it in bytes instead of
> sectors, then that should fix this observation.

I recreated your observation and this patch below appears to fix the
stuck behavior.

---
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 1db565b376200..d72b9331c2fd1 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -362,19 +362,26 @@ static void do_region(const blk_opf_t opf, unsigned int region,
                        bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
                        remaining -= num_sectors;
                } else {
-                       while (remaining) {
+                       unsigned long byte_remaining = to_bytes(remaining);
+
+                       while (byte_remaining) {
                                /*
                                 * Try and add as many pages as possible.
                                 */
                                dp->get_page(dp, &page, &len, &offset);
-                               len = min(len, to_bytes(remaining));
+                               len = min(len, byte_remaining);
                                if (!bio_add_page(bio, page, len, offset))
                                        break;

                                offset = 0;
-                               remaining -= to_sector(len);
+                               byte_remaining -= len;
                                dp->next_page(dp);
                        }
+                       remaining = to_sector(byte_remaining);
                }

                atomic_inc(&io->count);
--

^ permalink raw reply related

* [GIT PULL] Block updates for 7.2
From: Jens Axboe @ 2026-06-15 15:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block@vger.kernel.org

Hi Linus,

Here are the block updates queued up for the 7.2 merge window. This
contains:

- NVMe pull request via Keith:
       - Per-controller admin and IO timeout sysfs attributes, and
         letting the block layer set request timeouts (Maurizio,
         Maximilian)
       - Multipath passthrough iostats, and PCI P2PDMA enablement for
         multipath devices (Keith, Kiran)
       - A new diag sysfs attribute group exporting per-controller
         counters (retries, multipath failover, error counters, requeue
         and failure counts, reset and reconnect events) (Nilay)
       - FDP configuration validation and bounds check fixes (liuxixin)
       - Various nvmet fixes, including a pre-auth out-of-bounds read in
         the Discovery Get Log Page handler, auth payload bounds
         validation, and tcp error-path leak fixes (Bryam, Tianchu,
         Geliang)
       - nvme-tcp lockdep and workqueue fixes (Shin'ichiro, Kuniyuki,
         Eric)
       - Assorted other fixes and cleanups (John, Yao, Chao, Mateusz,
         Achkinazi, Wentao)

- MD pull request via Yu Kuai:
       - raid1/raid10 fixes for a deadlock in the read error recovery
         path, error-path detection and bio accounting with cloned bios,
         and an nr_pending leak in the REQ_ATOMIC bad-block error path
         (Abd-Alrhman)
       - PCI P2PDMA propagation from member devices to the RAID device
         (Kiran)
       - dm-raid bio requeue fix, and various smaller fixes and cleanups
         (Benjamin, Chen, Li, Thorsten)

- Enable Clang lock context analysis for the block layer, with the
  accompanying annotations across queue limits, the blk_holder_ops
  callbacks, crypto, cgroup, iocost, kyber and mq-deadline (Bart)

- Block status code infrastructure work: a tagged status table, a
  str_to_blk_op() helper, a bio_endio_status() helper, and on top of
  that a new configurable block-layer error injection facility (Christoph)

- DRBD netlink rework, replacing the genl_magic machinery with explicit
  netlink serialization and moving the DRBD UAPI headers to
  include/uapi/linux/ (Christoph Böhmwalder)

- bvec improvements: a bvec_folio() helper and making the bvec_iter
  helpers proper inline functions (Willy, Christoph)

- ublk cleanups and a canceling-flag fix for the disk-not-allocated
  case (Caleb, Ming)

- Partition handling fixes: bound the AIX pp_count scan, fix an of_node
  refcount leak, and replace __get_free_page() with kmalloc() (Bryam,
  Wentao, Mike)

- Convert numa_node to int in blk_mq_hw_ctx and ->init_request, and add
  WQ_PERCPU to the block workqueue users (Mateusz, Marco)

- Block statistics and tracing: propagate in-flight to the whole disk
  on partition IO, export passthrough stats, and a new
  block_rq_tag_wait tracepoint (Tang, Keith, Aaron)

- A round of removals, unexports and cleanups across bio, direct-io and
  the bvec helpers (Christoph)

- Various driver fixes (mtip32xx use-after-free, rbd snap_count
  validation and strscpy conversion, nbd socket lockdep reclassify,
  virtio-blk zone report clamp, floppy) and a batch of MAINTAINERS
  email/list updates (Coly, Li, Yu, Christoph Böhmwalder)

- Other little fixes and cleanups all over

Please pull!


The following changes since commit 7fd2df204f342fc17d1a0bfcd474b24232fb0f32:

  Linux 7.1-rc2 (2026-05-03 14:21:25 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git tags/for-7.2/block-20260615

for you to fetch changes up to c7c76f9232bd34835d821f14abdc5fafc17bc938:

  MAINTAINERS: Update Coly Li's email address (2026-06-13 09:29:02 -0600)

----------------------------------------------------------------
for-7.2/block-20260615

----------------------------------------------------------------
Aaron Tomlin (1):
      blk-mq: add tracepoint block_rq_tag_wait

Abd-Alrhman Masalkhi (5):
      md: skip redundant raid_disks update when value is unchanged
      md/raid1,raid10: fix deadlock in read error recovery path
      md/raid1,raid10: fix error-path detection with md_cloned_bio()
      md/raid1,raid10: fix bio accounting for split md cloned bios
      raid1: fix nr_pending leak in REQ_ATOMIC bad-block error path

Achkinazi, Igor (1):
      nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks

Andreas Hindborg (1):
      rust: block: mq: align init_request numa_node arg with C signature

Bart Van Assche (14):
      block: Annotate the queue limits functions
      block/bdev: Annotate the blk_holder_ops callback functions
      block/cgroup: Split blkg_conf_prep()
      block/cgroup: Split blkg_conf_exit()
      block/cgroup: Improve lock context annotations
      block/blk-iocost: Combine two error paths in ioc_qos_write()
      block/cgroup: Inline blkg_conf_{open,close}_bdev_frozen()
      block/crypto: Annotate the crypto functions
      block/blk-iocost: Split ioc_rqos_throttle()
      block/blk-iocost: Inline iocg_lock() and iocg_unlock()
      block/blk-mq-debugfs: Improve lock context annotations
      block/Kyber: Make the lock context annotations compatible with Clang
      block/mq-deadline: Make the lock context annotations compatible with Clang
      block: Enable lock context analysis

Benjamin Marzinski (1):
      dm-raid: only requeue bios when dm is suspending

Bryam Vargas (2):
      nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log Page
      partitions: aix: bound the pp_count scan to the ppe array

Caleb Sander Mateos (4):
      blk-mq: introduce blk_rq_has_data()
      ublk: optimize ublk_rq_has_data()
      ublk: move ublk_req_build_flags() earlier
      ublk: factor out ublk_init_iod() helper

Chaitanya Kulkarni (1):
      block: clear BLK_FEAT_PCI_P2PDMA in blk_stack_limits() for non-supporting devices

Chao Shi (2):
      nvme: core: reject invalid LBA data size from Identify Namespace
      block: skip sync_blockdev() on surprise removal in bdev_mark_dead()

Chen Cheng (1):
      md/raid10: reset read_slot when reusing r10bio for discard

Christoph Böhmwalder (4):
      drbd: move UAPI headers to include/uapi/linux/
      drbd: replace genl_magic with explicit netlink serialization
      drbd: clean up UAPI headers
      MAINTAINERS: use new drbd-dev mailing list

Christoph Hellwig (18):
      block: remove zero_fill_bio_iter
      block: remove bio_copy_data_iter
      block: unexport blk_io_schedule
      block: unexport blk_status_to_str
      block: unexport bio_{set,check}_pages_dirty
      direct-io: remove IOCB_NOWAIT support
      block: don't set BIO_QUIET for BLK_STS_AGAIN
      block: mark biovec_init_pool static
      loop: cleanup lo_rw_aio
      nvme-tcp: cleanup nvme_tcp_init_iter
      bvec: make the bvec_iter helpers inline functions
      block: add a bio_endio_status helper
      md/raid1: cleanup handle_read_error
      md/raid1: move the exceed_read_errors condition out of fix_read_error
      block: add a macro to initialize the status table
      block: add a "tag" for block status codes
      block: add a str_to_blk_op helper
      block: add configurable error injection

Coly Li (1):
      MAINTAINERS: Update Coly Li's email address

David Laight (1):
      drivers/block/rbd: Use strscpy() to copy strings into arrays

Denis Arefev (1):
      block: Avoid mounting the bdev pseudo-filesystem in userspace

Eric Dumazet (1):
      nbd: Reclassify sockets to avoid lockdep circular dependency

Geliang Tang (2):
      nvmet-tcp: fix page fragment cache leak in error path
      nvmet-tcp: check return value of nvmet_tcp_set_queue_sock

Haoze Xie (1):
      rust: block: fix GenDisk cleanup paths

Jens Axboe (2):
      Merge tag 'md-7.2-20260531' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-7.2/block
      Merge tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme into for-7.2/block

John Garry (2):
      nvme: use DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE for multipath_sysfs
      nvme-multipath: pass NS head to nvme_mpath_revalidate_paths()

Keith Busch (3):
      block: export passthrough stats enabled
      nvme: add support multipath passthrough iostats
      block: check bio split for unaligned bvec

Kiran Kumar Modukuri (2):
      md: propagate BLK_FEAT_PCI_P2PDMA from member devices to RAID device
      nvme-multipath: enable PCI P2PDMA for multipath devices

Kuniyuki Iwashima (1):
      nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false.

Li Nan (1):
      MAINTAINERS: Update Li Nan's E-mail address

Marco Crivellari (1):
      block: Add WQ_PERCPU to alloc_workqueue users

Mateusz Nowicki (2):
      block: switch numa_node to int in blk_mq_hw_ctx and init_request
      nvme-pci: fix out-of-bounds access in nvme_setup_descriptor_pools

Matthew Wilcox (Oracle) (2):
      block: Add bvec_folio()
      block: Include bvec.h kernel-doc in the htmldocs

Maurizio Lombardi (7):
      nvme: remove redundant timeout argument from nvme_wait_freeze_timeout
      nvme: add sysfs attribute to change admin timeout per nvme controller
      nvme: add sysfs attribute to change IO timeout per controller
      nvme-core: align fabrics_q teardown with admin_q in nvme_free_ctrl
      nvmet-loop: do not alloc admin tag set during reset
      nvme-core: warn on allocating admin tag set with existing queue
      nvme-core: fix unsigned comparison warning in nvme_wait_freeze_timeout

Maximilian Heyne (1):
      nvme: Let the blocklayer set timeouts for requests

Michael Bommarito (1):
      virtio-blk: clamp zone report to the report buffer capacity

Mike Rapoport (Microsoft) (1):
      block: partitions: replace __get_free_page() with kmalloc()

Ming Lei (1):
      ublk: set canceling flag even when disk is not allocated

Nilay Shroff (9):
      nvme-multipath: fix flex array size in struct nvme_ns_head
      nvme: add diag attribute group under sysfs
      nvme: export command retry count via sysfs
      nvme: export multipath failover count via sysfs
      nvme: export command error counters via sysfs
      nvme: export I/O requeue count when no path is usable via sysfs
      nvme: export I/O failure count when no path is available via sysfs
      nvme: export controller reset event count via sysfs
      nvme: export controller reconnect event count via sysfs

Rosen Penev (1):
      rbd: check snap_count against RBD_MAX_SNAP_COUNT

Shin'ichiro Kawasaki (2):
      nvme-tcp: move nvme_tcp_reclassify_socket()
      nvme-tcp: lockdep: use dynamic lockdep keys per socket instance

Steven Feng (1):
      block: optimize I/O merge hot path with unlikely() hints

Tal Zussman (1):
      block: remove blkdev_write_begin() and blkdev_write_end()

Tang Yizhou (1):
      block: propagate in_flight to whole disk on partition I/O

Tao Cui (1):
      blk-throttle: schedule parent dispatch in tg_flush_bios()

Thorsten Blum (3):
      md/raid0: use str_plural helper in dump_zones
      block/partitions/acorn: use min in {riscix,linux}_partition
      n64cart: use strscpy in n64cart_probe

Tianchu Chen (1):
      nvmet-auth: validate reply message payload bounds against transfer length

Uwe Kleine-König (The Capable Hub) (1):
      floppy: Drop unused pnp driver data

Wentao Liang (2):
      block: partitions: fix of_node refcount leak in of_partition()
      nvme: target: rdma: fix ndev refcount leak on queue connect

Yao Sang (1):
      nvme: refresh multipath head zoned limits from path limits

Yu Kuai (2):
      MAINTAINERS: update Yu Kuai's email address
      block, bfq: release cgroup stats with bfq_group

Yuho Choi (1):
      mtip32xx: fix use-after-free on service thread failure

liuxixin (2):
      nvme: fix FDP fdpcidx bounds check
      nvme: validate FDP configuration descriptor sizes

liyouhong (1):
      nvme-multipath: require exact iopolicy names for module parameter

 Documentation/block/error-injection.rst            |   59 +
 Documentation/block/index.rst                      |    1 +
 Documentation/core-api/kernel-api.rst              |    1 +
 MAINTAINERS                                        |   10 +-
 block/Kconfig                                      |    8 +
 block/Makefile                                     |    3 +
 block/bdev.c                                       |   13 +-
 block/bfq-cgroup.c                                 |   54 +-
 block/bio.c                                        |   52 +-
 block/blk-cgroup.c                                 |   98 +-
 block/blk-cgroup.h                                 |   13 +-
 block/blk-core.c                                   |  104 +-
 block/blk-crypto-fallback.c                        |    9 +-
 block/blk-crypto-profile.c                         |    2 +
 block/blk-crypto.c                                 |    3 +-
 block/blk-iocost.c                                 |  306 ++-
 block/blk-iolatency.c                              |   19 +-
 block/blk-merge.c                                  |   17 +-
 block/blk-mq-debugfs.c                             |   24 +-
 block/blk-mq-tag.c                                 |    6 +
 block/blk-mq.c                                     |   43 +-
 block/blk-settings.c                               |    2 +
 block/blk-sysfs.c                                  |    5 +
 block/blk-throttle.c                               |   85 +-
 block/blk-zoned.c                                  |    2 +-
 block/blk.h                                        |   32 +
 block/bsg-lib.c                                    |    2 +-
 block/error-injection.c                            |  315 +++
 block/error-injection.h                            |   21 +
 block/fops.c                                       |   27 +-
 block/genhd.c                                      |    4 +
 block/kyber-iosched.c                              |    7 +-
 block/mq-deadline.c                                |   12 +-
 block/partitions/acorn.c                           |    5 +-
 block/partitions/aix.c                             |    9 +
 block/partitions/core.c                            |    6 +-
 block/partitions/of.c                              |    5 +-
 drivers/block/drbd/Makefile                        |    1 +
 drivers/block/drbd/drbd_buildtag.c                 |    2 +-
 .../linux => drivers/block/drbd}/drbd_config.h     |    0
 drivers/block/drbd/drbd_debugfs.c                  |    2 +-
 drivers/block/drbd/drbd_int.h                      |    6 +-
 drivers/block/drbd/drbd_main.c                     |    6 +-
 drivers/block/drbd/drbd_nl.c                       |  416 ++--
 drivers/block/drbd/drbd_nl_gen.c                   | 2606 ++++++++++++++++++++
 drivers/block/drbd/drbd_nl_gen.h                   |  395 +++
 drivers/block/drbd/drbd_proc.c                     |    2 +-
 drivers/block/floppy.c                             |    4 +-
 drivers/block/loop.c                               |   24 +-
 drivers/block/mtip32xx/mtip32xx.c                  |   19 +-
 drivers/block/n64cart.c                            |    3 +-
 drivers/block/nbd.c                                |   39 +-
 drivers/block/rbd.c                                |    9 +-
 drivers/block/ublk_drv.c                           |  174 +-
 drivers/block/virtio_blk.c                         |    2 +
 drivers/md/dm-raid.c                               |    6 +
 drivers/md/dm-rq.c                                 |    2 +-
 drivers/md/md.c                                    |   32 +-
 drivers/md/md.h                                    |    7 +
 drivers/md/raid0.c                                 |    4 +-
 drivers/md/raid1.c                                 |   64 +-
 drivers/md/raid10.c                                |   30 +-
 drivers/md/raid5.c                                 |    7 +-
 drivers/mmc/core/queue.c                           |    2 +-
 drivers/mtd/ubi/block.c                            |    2 +-
 drivers/nvme/host/apple.c                          |    4 +-
 drivers/nvme/host/core.c                           |   74 +-
 drivers/nvme/host/fc.c                             |    5 +-
 drivers/nvme/host/ioctl.c                          |    9 +
 drivers/nvme/host/multipath.c                      |  144 +-
 drivers/nvme/host/nvme.h                           |   21 +-
 drivers/nvme/host/pci.c                            |   16 +-
 drivers/nvme/host/rdma.c                           |    6 +-
 drivers/nvme/host/sysfs.c                          |  311 ++-
 drivers/nvme/host/tcp.c                            |  127 +-
 drivers/nvme/target/discovery.c                    |   23 +-
 drivers/nvme/target/fabrics-cmd-auth.c             |   15 +-
 drivers/nvme/target/loop.c                         |   33 +-
 drivers/nvme/target/rdma.c                         |    6 +-
 drivers/nvme/target/tcp.c                          |   11 +-
 drivers/scsi/scsi_lib.c                            |    2 +-
 fs/direct-io.c                                     |   15 +-
 include/linux/bio.h                                |   32 +-
 include/linux/blk-mq.h                             |   53 +-
 include/linux/blkdev.h                             |   28 +-
 include/linux/bvec.h                               |  112 +-
 include/linux/drbd_genl.h                          |  536 ----
 include/linux/drbd_genl_api.h                      |   56 -
 include/linux/genl_magic_func.h                    |  413 ----
 include/linux/genl_magic_struct.h                  |  272 --
 include/trace/events/block.h                       |   59 +
 include/{ => uapi}/linux/drbd.h                    |   73 +-
 include/uapi/linux/drbd_genl.h                     |  359 +++
 include/{ => uapi}/linux/drbd_limits.h             |   10 +-
 io_uring/rsrc.c                                    |    2 +-
 mm/page_io.c                                       |    4 +-
 rust/kernel/block/mq/gen_disk.rs                   |   20 +-
 rust/kernel/block/mq/operations.rs                 |    2 +-
 98 files changed, 5781 insertions(+), 2322 deletions(-)
 create mode 100644 Documentation/block/error-injection.rst
 create mode 100644 block/error-injection.c
 create mode 100644 block/error-injection.h
 rename {include/linux => drivers/block/drbd}/drbd_config.h (100%)
 create mode 100644 drivers/block/drbd/drbd_nl_gen.c
 create mode 100644 drivers/block/drbd/drbd_nl_gen.h
 delete mode 100644 include/linux/drbd_genl.h
 delete mode 100644 include/linux/drbd_genl_api.h
 delete mode 100644 include/linux/genl_magic_func.h
 delete mode 100644 include/linux/genl_magic_struct.h
 rename include/{ => uapi}/linux/drbd.h (85%)
 create mode 100644 include/uapi/linux/drbd_genl.h
 rename include/{ => uapi}/linux/drbd_limits.h (97%)

-- 
Jens Axboe


^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Keith Busch @ 2026-06-15 15:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: linux-block, dm-devel
In-Reply-To: <ai7rnH20IYeSmY8s@gallifrey>

On Sun, Jun 14, 2026 at 05:57:48PM +0000, Dr. David Alan Gilbert wrote:
> Jun 14 18:08:32 dalek kernel: device-mapper: raid1: Mirror read failed from 252:24. Trying alternative device.
> Jun 14 18:08:32 dalek kernel: ------------[ cut here ]------------
> Jun 14 18:08:32 dalek dmeventd[1010]: Primary mirror device 252:24 read failed.
> Jun 14 18:08:32 dalek kernel: WARNING: block/bio.c:1044 at bio_add_page+0x18b/0x250, CPU#15: kworker/15:1/369
> Jun 14 18:08:32 dalek dmeventd[1010]: main-lvol0 is now in-sync.
> Jun 14 18:08:32 dalek kernel: Modules linked in: nft_masq nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reje>
> Jun 14 18:08:32 dalek kernel:  drm_panel_backlight_quirks gpu_sched drm_suballoc_helper video nvme drm_display_helper nvme_core cec nvme_keyring sp5100_tco nvme_auth wmi serio_raw fuse scsi_dh_alua i2c_dev scsi_dh_rdac scsi_dh_emc
> Jun 14 18:08:32 dalek kernel: CPU: 15 UID: 0 PID: 369 Comm: kworker/15:1 Not tainted 7.1.0-rc7+ #786 PREEMPT(lazy) 
> Jun 14 18:08:32 dalek kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Pro4, BIOS P3.10 07/13/2020
> Jun 14 18:08:32 dalek kernel: Workqueue: kmirrord do_mirror
> Jun 14 18:08:32 dalek kernel: RIP: 0010:bio_add_page+0x18b/0x250
> Jun 14 18:08:32 dalek kernel: Code: 24 10 4c 8b 04 24 84 c0 0f 85 c9 00 00 00 41 0f b7 40 78 48 8b 74 24 08 8b 4c 24 14 e9 b4 fe ff ff 0f 0b 31 c0 e9 55 d1 af 00 <0f> 0b eb f5 48 8b 7f 08 83 7f 60 05 0f 85 00 ff ff ff 49 8b 3b 4c
> Jun 14 18:08:32 dalek kernel: RSP: 0018:ffffd1fb8176fc10 EFLAGS: 00010246
> Jun 14 18:08:32 dalek kernel: RAX: 0000000000000000 RBX: ffffd1fb8176fd18 RCX: 0000000000000000
> Jun 14 18:08:32 dalek kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8d1a8eb28b00
> Jun 14 18:08:32 dalek kernel: RBP: 0000000000000000 R08: ffffd1fb8176fc38 R09: ffffd1fb8176fc40
> Jun 14 18:08:32 dalek kernel: R10: ffffd1fb8176fc34 R11: 0000000000000000 R12: 0000000000000000
> Jun 14 18:08:32 dalek kernel: R13: ffffd1fb8176fd90 R14: 0000000000000001 R15: ffff8d1a8eb28b00
> Jun 14 18:08:32 dalek kernel: FS:  0000000000000000(0000) GS:ffff8d29d161f000(0000) knlGS:0000000000000000
> Jun 14 18:08:32 dalek kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jun 14 18:08:32 dalek kernel: CR2: 00007f0ddcd7b9d0 CR3: 000000023dcbf000 CR4: 0000000000350ef0
> Jun 14 18:08:32 dalek kernel: Call Trace:
> Jun 14 18:08:32 dalek kernel:  <TASK>
> Jun 14 18:08:32 dalek kernel:  do_region+0x227/0x2a0

I think the problem is that do_region is tracking the "remaining" in
sector granularity, but devices can have dma alignment such that it's
valid to have sub-sector vectors. Rounding the length appended
to_sectors() creates a 0 length subtraction, so the loop thinks no
progress is made and loops forever. If we track it in bytes instead of
sectors, then that should fix this observation.

^ permalink raw reply

* Re: [PATCH v3 3/4] iomap: reject NOWAIT and BOUNCE direct IOs
From: Christoph Hellwig @ 2026-06-15 15:05 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <3d739b6dc37e34ca2a2a3780d12d0288a4060d57.1781253428.git.wqu@suse.com>

On Fri, Jun 12, 2026 at 07:21:14PM +0930, Qu Wenruo wrote:
> If a direct IO requires bounced pages for stable buffer, it will always
> allocate memory, and both bio_iov_iter_bounce_write() and
> bio_iov_iter_bounce_read() are allocating pages using GFP_KERNEL, which
> can sleep and break NOWAIT requirement.
> 
> So we need to reject such NOWAIT and BOUNCE direct IO in
> iomap_dio_bio_iter().

That's a bit heavy handed. Just do a noretry allocation.


^ permalink raw reply

* Re: [PATCH v3 2/4] block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
From: Christoph Hellwig @ 2026-06-15 15:05 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <35a61301ea76c71d101533bf7d4aeffb7752fb85.1781253428.git.wqu@suse.com>

On Fri, Jun 12, 2026 at 07:21:13PM +0930, Qu Wenruo wrote:
> diff --git a/block/bio.c b/block/bio.c
> index b33ff69bb722..01bb76d9717c 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1335,7 +1335,10 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
>  			break;
>  		bio_add_folio_nofail(bio, folio, this_len, 0);
>  
> -		copied = copy_from_iter(folio_address(folio), this_len, iter);
> +		if (iter->nofault)
> +			copied = copy_folio_from_iter_atomic(folio, 0, this_len, iter);

Same here, please keep a sane line length.


^ permalink raw reply

* Re: [PATCH v3 1/4] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write()
From: Christoph Hellwig @ 2026-06-15 15:03 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <ed47077f3d4336b64a177486e74b4f6a460b2392.1781253428.git.wqu@suse.com>

On Fri, Jun 12, 2026 at 07:21:12PM +0930, Qu Wenruo wrote:
> +		copied = copy_from_iter(folio_address(folio), this_len, iter);
> +		if (copied < this_len) {
> +			iov_iter_revert(iter, bio->bi_iter.bi_size - this_len + copied);

Please keep the line below 80 characters.  And maybe add the explanation
for the amount reverted here based on what you wrote above in the commit
log.


^ permalink raw reply

* Re: [PATCH v1 0/2] virtio: PCI ERS permanent failure teardown for virtio-blk
From: Stefan Hajnoczi @ 2026-06-15 14:52 UTC (permalink / raw)
  To: Xixin Liu
  Cc: linux-block, virtualization, mst, jasowang, xuanzhuo, eperezma,
	pbonzini, axboe, linux-kernel, Parav Pandit
In-Reply-To: <cover.virtio-blk-ers-v1.1780449274.git.liuxixin@kylinos.cn>

[-- Attachment #1: Type: text/plain, Size: 1472 bytes --]

On Mon, Jun 15, 2026 at 10:00:00AM +0800, Xixin Liu wrote:
> Hi,
> 
> This series adds proper PCI AER error recovery handling for virtio-pci and
> completes virtio-blk teardown when ERS reports pci_channel_io_perm_failure.

CCing Parav because he previously looked at surprise removal:
https://lore.kernel.org/virtualization/20250822091706.21170-1-parav@nvidia.com/

> 
> virtio-pci only registered reset_prepare/reset_done.  The recovery core
> treats devices without error_detected as NO_AER_DRIVER and does not
> deliver perm_failure to the driver after a failed recovery.  When bus
> reset fails (reproduced on QEMU with DLLLA not set within 100 ms after
> secondary bus reset), virtio-blk disks stay live even though virtqueues
> may already have been torn down during the frozen phase.
> 
> Patch 1 registers error_detected (frozen quiesce + perm_failure notify).
> Patch 2 calls the virtio driver shutdown hook from virtio-pci on
> perm_failure, implements virtio-blk shutdown with blk_mark_disk_dead(),
> and fail-fast guards in virtio_queue_rq.
> 
> Thanks,
> Xixin Liu
> 
> ---
> 
> Xixin Liu (2):
>   virtio-pci: add error_detected for PCI AER recovery
>   virtio-blk: mark disk dead on ERS permanent failure
> 
>  drivers/block/virtio_blk.c         | 39 +++++++++++++++++++++++++++++++
>  drivers/virtio/virtio_pci_common.c | 47 ++++++++++++++++++++++++++++++++++
>  2 files changed, 85 insertions(+)
> 
> -- 
> 2.43.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH RFC 3/6] block: split bdev_yield_claim() out of bdev_fput()
From: Johannes Thumshirn @ 2026-06-15 14:09 UTC (permalink / raw)
  To: Christian Brauner, Chris Mason, Jens Axboe, David Sterba,
	Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel
In-Reply-To: <20260615-work-super-freeze_deny_upstream-v1-3-a6c72b840e7d@kernel.org>

On 6/15/26 3:18 PM, Christian Brauner wrote:
> -void bdev_fput(struct file *bdev_file)
> +void bdev_yield_claim(struct file *bdev_file)
>   {
> -	if (WARN_ON_ONCE(bdev_file->f_op != &def_blk_fops))
> -		return;
> -
>   	if (bdev_file->private_data) {
>   		struct block_device *bdev = file_bdev(bdev_file);
>   		struct gendisk *disk = bdev->bd_disk;
> @@ -1226,7 +1224,23 @@ void bdev_fput(struct file *bdev_file)
>   		bdev_file->private_data = BDEV_I(bdev_file->f_mapping->host);
>   		mutex_unlock(&disk->open_mutex);
>   	}
> +}
> +EXPORT_SYMBOL_GPL(bdev_yield_claim);

This now operates on 'bdev_file->private_data' without verifying it's a 
block-device (unlike before) and the whole function is now below the ' 
if (bdev_file->private_data) {'.

I'd negate the check and save a level of indentation.


^ permalink raw reply

* Re: [PATCH RFC 2/6] block: allow making a block device unfreezable
From: Jan Kara @ 2026-06-15 14:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Chris Mason, Jens Axboe, David Sterba, Jan Kara, Naohiro Aota,
	Josef Bacik, linux-btrfs, linux-block, linux-fsdevel
In-Reply-To: <20260615-work-super-freeze_deny_upstream-v1-2-a6c72b840e7d@kernel.org>

On Mon 15-06-26 15:18:05, Christian Brauner wrote:
> Add bdev_deny_freeze() and bdev_allow_freeze(), modeled on
> deny_write_access()/allow_write_access().  bd_fsfreeze_count becomes a
> signed counter: > 0 counts active freezes, < 0 counts deniers, and the
> two regimes are mutually exclusive.  bdev_freeze() refuses with -EBUSY
> while a deny is held, and bdev_deny_freeze() refuses while the device is
> frozen.
> 
> A filesystem that mutates a device's membership (a btrfs device add,
> remove or replace) denies freezing on the device for the duration, so a
> claim a freeze walk might act on is never added or torn down behind the
> freezer's back.
> 
> The deny/allow helpers are a single atomic on bd_fsfreeze_count and take
> no lock, so they can be called while holding s_umount without inverting
> against bdev_freeze()'s bd_fsfreeze_mutex -> s_umount order.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Just one nit below but regardless feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> +/**
> + * bdev_allow_freeze - allow freezing a block device again
> + * @bdev: block device
> + *
> + * Undo one bdev_deny_freeze().
> + */
> +void bdev_allow_freeze(struct block_device *bdev)
> +{
> +	atomic_inc(&bdev->bd_fsfreeze_count);
> +}
> +EXPORT_SYMBOL_GPL(bdev_allow_freeze);

Perhaps we can add WARN_ON_ONCE(atomic_read(&bdev->bd_fsfreeze_count) >= 0)
to catch unbalanced bdev_allow_freeze()?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 1/6] btrfs: destroy the target device when mark_block_group_to_copy() fails
From: Johannes Thumshirn @ 2026-06-15 14:02 UTC (permalink / raw)
  To: Christian Brauner, Chris Mason, Jens Axboe, David Sterba,
	Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel
In-Reply-To: <20260615-work-super-freeze_deny_upstream-v1-1-a6c72b840e7d@kernel.org>

Looks good,

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


^ permalink raw reply

* Re: [PATCH RFC 0/6] block,btrfs: fix frozen-superblock strand on device add/remove/replace
From: Christian Brauner @ 2026-06-15 13:41 UTC (permalink / raw)
  To: Chris Mason, Jens Axboe, David Sterba, Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel
In-Reply-To: <20260615-work-super-freeze_deny_upstream-v1-0-a6c72b840e7d@kernel.org>

> FIFREEZE on the bare device - resolves that holder to freeze the filesystem,

That's nonsense ofc. FIFREEZE only freezes the superblock.

^ permalink raw reply

* Re: [PATCH] block: check bio split for unaligned bvec
From: Christoph Hellwig @ 2026-06-15 13:35 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, axboe, Keith Busch, Carlos Maiolino
In-Reply-To: <20260612223205.465913-1-kbusch@meta.com>

On Fri, Jun 12, 2026 at 03:32:04PM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Offsets and lengths need to be validated against the dma alignment. This
> check was skipped for sufficiently a small bio with a single bvec, which
> may allow an invalid request dispatched to the driver. Force the
> validation for an unaligned bvec by forcing the bio split path that
> handles this condition.

This fix itself looks good, but we'll also need something similar
for bio-based drivers that never call into the splitting helper.


^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Jianyue Wu @ 2026-06-15 13:34 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Christoph Hellwig, Chris Li, Baoquan He, Nhat Pham,
	Kairui Song, Kemeng Shi, Youngjun Park, Minchan Kim,
	Sergey Senozhatsky, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc
In-Reply-To: <CAGsJ_4yqy3ZnXirSZD0X_1b5qDnyoMLuuSS4mHRcip+CPpvQdg@mail.gmail.com>

On Mon, Jun 15, 2026 at 5:14 PM Barry Song <baohua@kernel.org> wrote:
>
> On Sun, Jun 14, 2026 at 11:35 PM Jianyue Wu <wujianyue000@gmail.com> wrote:
> >
> > This series builds on Christoph Hellwig's swap batching rework that
> > moves block swap onto struct swap_iocb and per-backend struct
> > swap_ops handlers [1].  Christoph's patches unify batching for
> > ordinary block devices and swap files.  zram still needs a custom
> > path because swap slots map to compressed pages, not disk sectors.
> >
> > The first patch adds swap_register_block_ops() so a block driver can
> > install custom submit_read/submit_write handlers when swapon targets
> > its block device.  The default swap_bdev_ops path is unchanged for
> > devices that do not register.
> >
> > The second patch registers zram_swap_ops at module init.  On write,
> > the swap core still batches folios into a swap_iocb.  zram maps each
> > folio to a slot index and stores it through zram_write_page() instead
> > of building one bio per page.  Read handling keeps slot_lock and
> > mark_slot_accessed() in one critical section.  Writeback-enabled zram
> > falls back to swap_bdev_submit_read() for ZRAM_WB slots.
> >
> > The third patch moves slot_free_notify into swap_ops next to the
> > other zram swap callbacks, and documents the locking contract for
> > that hook.
> >
> > Applied on top of Christoph Hellwig's "better block swap batching and
> > a different take on swap_ops" series [1].
>
> Nice. I think it's better to mark it as RFC at this stage.
>
> By the way, besides the architectural refinements, have
> you also observed any noticeable performance improvements?
>
> >
> > [1] https://lore.kernel.org/linux-mm/?q=better+block+swap+batching
>
> Best Regards
> Barry

Hello Barry,

Thanks for the feedback:) I will mark the next revision as RFC.

I ran some local measurements on a zram swap workload.
Without a backing device (zspool-only swap read path), the swap_ops
path looks slightly better on average and median latency, while p99 is
roughly flat:
avg 1,750 ns vs 1,812 ns
p50 1,273 ns vs 1,504 ns
p99 6,318 ns vs 6,198 ns

With writeback/backing device enabled, the numbers are much noisier
(bd_reads per sample and cold-fault ratio varied a lot between runs),
so I would not read too much into them. Directionally, the swap_ops
path looked faster on avg/p50/p99 in the runs I captured, but I need
more controlled repeats before claiming a real win:
avg 39 µs vs 77 µs
p50 4.5 µs vs 90 µs
p99 116 µs vs 210 µs

bd_reads/sample 0.37 vs 0.75
cold-fault samples 62.5% vs 100%

So far I would describe the gain as modest for the common zspool case,
maybe because doesn't have merge benefit like bio side.
With the main motivation still being architectural fit (put zram
swap semantics under swap_ops) rather than a large performance jump.

Thanks,
Jianyue

^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Dr. David Alan Gilbert @ 2026-06-15 13:20 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: linux-block, dm-devel
In-Reply-To: <31e178fc-ed41-45d9-9e85-2d114f4222a4@redhat.com>

* Zdenek Kabelac (zkabelac@redhat.com) wrote:
> Dne 14. 06. 26 v 19:57 Dr. David Alan Gilbert napsal(a):
> > Hi,
> >    I've got a repeatable raid hang/warn and would appreciate some pointers
> > as where to debug.
> >    (I've been logging stuff on  https://bugzilla.kernel.org/show_bug.cgi?id=221535 )
> > 
> >    This started off as debugging a case where I'd get my RAID1 (on the host)
> > getting a reliable 'rescheduling sector'/disk failure while running the qemu block test suite
> > during a qemu build, but then I tried to build a smaller discrete
> > test, and now I've got a simply triggerable warn and test hang.
> > There's no errors from the underlying SATA layer on the storage,
> > everything resyncs just fine.
> > 
> > I've got an existing LVM vg ('main') with two mirrors on sda2, and sdb2
> > which are SATA disks.
> > 
> > # lvcreate --type mirror --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2
> > 
> 
> Hi
> 
> It's probably worth to say here - the '--type mirror' is the OLD
> (historical) DM mirror target implementation - this target is now in the not
> so active development as users are supposed to be using newer (and faster)
> md wrapped '--type raid1'

Ah, that might have been when I split off to using a separate test
LV rather than risking my LV containing actual useful data and had to
try and find the lvcreate command.

> So if you use   'lvcreate -m1 ....'   you get 'auto-magically' this
> mirroring target.

Thanks,

Dave

> But this obviously doesn't fix the problem if old mirror target...
> 
> Regards
> 
> Zdenek
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply

* Re: [PATCH 2/3] mm/zram: handle swap read/write via swap_ops
From: Jianyue Wu @ 2026-06-15 13:19 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Andrew Morton, Christoph Hellwig, Chris Li, Baoquan He, Nhat Pham,
	Barry Song, Kairui Song, Kemeng Shi, Minchan Kim,
	Sergey Senozhatsky, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc
In-Reply-To: <ai+eG9sFYTn/vTq5@yjaykim-PowerEdge-T330>

On Mon, Jun 15, 2026 at 2:39 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Sun, Jun 14, 2026 at 11:35:30PM +0800, Jianyue Wu wrote:
>
> Hello!
>
> > +static void zram_swap_submit_read(struct swap_io_ctx *ctx)
> > +{
> > +     struct zram *zram = ctx->sis->bdev->bd_disk->private_data;
>
> A passing thought. accessing `zram` here is too indirect. We might
> need a `private_data` in the swap device struct someday?
>
> (And If there is a real value like some swap-side only private data really needed.)
>
> > +     struct swap_iocb *sio = ctx->sio;
> > +     int nr = swap_iocb_nr_folios(sio);
> > +     bool failed = false;
> > +     int i, j;
> > +                     /*
> > +                      * read_from_zspool() and mark_slot_accessed() must run
> > +                      * under the same slot_lock.  zram_read_page() unlocks
> > +                      * before returning, which leaves a window where
> > +                      * writeback can pick an idle slot we just read.
> > +                      */
>
> Regarding the comment about the "window" where writeback can pick an
> idle slot. I think this reasoning is a bit of a gray area. Writeback
> could just as easily pick the slot right before entering this routine,
> so the race condition seems fundamentally the same.
>
> Isn't the actual justification here to separate the non-backend logic
> and ensure mark_slot_accessed() is called under the lock, given that
> zram_read_page() can call the backend device?
>
> If the "window" mentioned in the comment is indeed a valid issue, then
> zram_read_page() has the exact same problem and needs to be fixed as
> well?
>
> If not, IMHO I suggest revising or removing this comment to clarify
> the true(?) intention. :)
>
> > +                     slot_lock(zram, idx);
> > +                     ret = read_from_zspool(zram, page, idx);
> > +                     if (!ret)
> > +                             mark_slot_accessed(zram, idx);
> > +                     slot_unlock(zram, idx);
>

Hello Youngjun,

Agree. Walking ctx->sis->bdev->bd_disk->private_data
from every swap_ops callback is too indirect. I will add an opaque
private_data field to struct swap_info_struct, set it from
->swap_activate() when the swap area is set up, and clear it on
swapoff. The zram callbacks will then use ctx->sis->private_data directly.

You are right. The writeback "window" reasoning was overstated.
Writeback could already have picked the slot before we enter the swap
read path, we have ZRAM_PP_SLOT to ensure it.
0. // condition 1 write pick the slot before the lock.
1. lock → read_from_zspool → unlock
2. // condition 2 write pick the slot inside the lock.
3. lock → mark_slot_accessed() → unlock // clear ZRAM_IDLE and ZRAM_PP_SLOT flag

I think simply removing this comment is good.

Thanks,
Jianyue

^ permalink raw reply

* [PATCH RFC 6/6] btrfs: deny freezing devices undergoing a replace
From: Christian Brauner @ 2026-06-15 13:18 UTC (permalink / raw)
  To: Chris Mason, Jens Axboe, David Sterba, Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel, Christian Brauner (Amutable)
In-Reply-To: <20260615-work-super-freeze_deny_upstream-v1-0-a6c72b840e7d@kernel.org>

A device replace opens a target and, on success, frees the source on a live
filesystem from btrfs_dev_replace_finishing() - which cannot fail and also
runs from a kthread on mount resume.  A bdev_freeze() racing the source free
or the target swap-in would freeze the filesystem through a claim that is
being torn down or replaced, leaving nothing for bdev_thaw() to rebalance.

Make both devices unfreezable for the whole replace, with the invariant that
a STARTED replace holds one deny on each device and any other state holds
none.  The target is denied at open (btrfs_open_device_deny_freeze(), undone
on btrfs_init_dev_replace_tgtdev()'s error unwind); the source is denied at
the start of btrfs_dev_replace_start(), before mark_block_group_to_copy() so
every 'leave' unwind sees both denied.

The deny tracks the STARTED state and is dropped whenever the replace leaves
it: btrfs_dev_replace_finishing() re-allows the target it makes a member and
frees the source through btrfs_close_bdev(allow_freeze=true), and its
scrub-error path re-allows both as it cancels.  Its early failures (before
the device swap) keep the replace STARTED and resumable, so both stay denied.
Suspending for unmount re-allows both, so they are reopened freezable at the
next mount where btrfs_resume_dev_replace_async() re-denies them (staying
suspended if a device is frozen right then); a replace cancelled from the
suspended state therefore destroys the target without allowing.
btrfs_close_bdev() and btrfs_destroy_dev_replace_tgtdev() take an allow_freeze
argument to carry this distinction; the unmount path
(btrfs_close_one_device()) passes false.

On resume, a failed kthread_run() re-allows both devices and goes through the
suspend path, resetting the replace to SUSPENDED and finishing the exclusive
operation instead of returning straight away.  The (re)mount still aborts on
that error; routing it through suspend keeps the deny balanced against the
unmount teardown and additionally drops BTRFS_EXCLOP_DEV_REPLACE, closing a
pre-existing leak that was harmless on the failed mount that frees the fs but
would have wedged future exclusive operations after a failed remount-rw.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/btrfs/dev-replace.c | 65 ++++++++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/volumes.c     | 18 +++++++++-----
 fs/btrfs/volumes.h     |  3 ++-
 3 files changed, 72 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 0112aa6d7ab1..4d6bd6b4b039 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -247,8 +247,8 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 		return -EINVAL;
 	}
 
-	bdev_file = bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
-					   fs_info->sb, &fs_holder_ops);
+	/* Unfreezable for the whole replace; see btrfs_dev_replace_start(). */
+	bdev_file = btrfs_open_device_deny_freeze(device_path, fs_info->sb);
 	if (IS_ERR(bdev_file)) {
 		btrfs_err(fs_info, "target device %s is invalid!", device_path);
 		return PTR_ERR(bdev_file);
@@ -325,7 +325,8 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	return 0;
 
 error:
-	bdev_fput(bdev_file);
+	/* Undo the open-time freeze deny. */
+	btrfs_release_device_allow_freeze(bdev_file);
 	return ret;
 }
 
@@ -622,6 +623,15 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	if (ret)
 		return ret;
 
+	/* Deny the source before mark, so every 'leave' unwinds both denied. */
+	if (src_device->bdev) {
+		ret = bdev_deny_freeze(src_device->bdev);
+		if (ret) {
+			btrfs_destroy_dev_replace_tgtdev(tgt_device, true);
+			return ret;
+		}
+	}
+
 	ret = mark_block_group_to_copy(fs_info, src_device);
 	if (ret)
 		goto leave;
@@ -706,7 +716,9 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	return ret;
 
 leave:
-	btrfs_destroy_dev_replace_tgtdev(tgt_device);
+	if (src_device->bdev)
+		bdev_allow_freeze(src_device->bdev);
+	btrfs_destroy_dev_replace_tgtdev(tgt_device, true);
 	return ret;
 }
 
@@ -887,6 +899,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	 */
 	ret = btrfs_start_delalloc_roots(fs_info, LONG_MAX, false);
 	if (ret) {
+		/* Stays started/resumable; keep both denied. */
 		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 		return ret;
 	}
@@ -900,6 +913,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	while (1) {
 		trans = btrfs_start_transaction(root, 0);
 		if (IS_ERR(trans)) {
+			/* Stays started/resumable; keep both denied. */
 			mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 			return PTR_ERR(trans);
 		}
@@ -952,7 +966,10 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 		mutex_unlock(&fs_devices->device_list_mutex);
 		btrfs_rm_dev_replace_blocked(fs_info);
 		if (tgt_device)
-			btrfs_destroy_dev_replace_tgtdev(tgt_device);
+			btrfs_destroy_dev_replace_tgtdev(tgt_device, true);
+		/* The source stays a member; re-allow freezing it. */
+		if (src_device->bdev)
+			bdev_allow_freeze(src_device->bdev);
 		btrfs_rm_dev_replace_unblocked(fs_info);
 		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 
@@ -1018,6 +1035,8 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 
 	mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 
+	/* The target is now a member; the source is freed (allow + release). */
+	bdev_allow_freeze(tgt_device->bdev);
 	btrfs_rm_dev_replace_free_srcdev(src_device);
 
 	return 0;
@@ -1146,8 +1165,9 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info)
 			btrfs_dev_name(src_device), src_device->devid,
 			btrfs_dev_name(tgt_device));
 
+		/* A suspended replace never re-denied freezing; do not allow. */
 		if (tgt_device)
-			btrfs_destroy_dev_replace_tgtdev(tgt_device);
+			btrfs_destroy_dev_replace_tgtdev(tgt_device, false);
 		break;
 	default:
 		up_write(&dev_replace->rwsem);
@@ -1177,6 +1197,11 @@ void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info)
 		dev_replace->time_stopped = ktime_get_real_seconds();
 		dev_replace->item_needs_writeback = 1;
 		btrfs_info(fs_info, "suspending dev_replace for unmount");
+		/* Reopened freezable next mount; resume re-denies. */
+		if (dev_replace->srcdev && dev_replace->srcdev->bdev)
+			bdev_allow_freeze(dev_replace->srcdev->bdev);
+		if (dev_replace->tgtdev && dev_replace->tgtdev->bdev)
+			bdev_allow_freeze(dev_replace->tgtdev->bdev);
 		break;
 	}
 
@@ -1189,6 +1214,7 @@ int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info)
 {
 	struct task_struct *task;
 	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	int ret = 0;
 
 	down_write(&dev_replace->rwsem);
 
@@ -1232,8 +1258,33 @@ int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info)
 		return 0;
 	}
 
+	/* Re-deny for the resumed replace; stay suspended if frozen now. */
+	if (dev_replace->srcdev->bdev &&
+	    bdev_deny_freeze(dev_replace->srcdev->bdev))
+		goto suspend;
+	if (bdev_deny_freeze(dev_replace->tgtdev->bdev)) {
+		if (dev_replace->srcdev->bdev)
+			bdev_allow_freeze(dev_replace->srcdev->bdev);
+		goto suspend;
+	}
+
 	task = kthread_run(btrfs_dev_replace_kthread, fs_info, "btrfs-devrepl");
-	return PTR_ERR_OR_ZERO(task);
+	if (IS_ERR(task)) {
+		bdev_allow_freeze(dev_replace->tgtdev->bdev);
+		if (dev_replace->srcdev->bdev)
+			bdev_allow_freeze(dev_replace->srcdev->bdev);
+		/* Undo the deny and suspend, but still fail the mount. */
+		ret = PTR_ERR(task);
+		goto suspend;
+	}
+	return 0;
+
+suspend:
+	btrfs_exclop_finish(fs_info);
+	down_write(&dev_replace->rwsem);
+	dev_replace->replace_state = BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED;
+	up_write(&dev_replace->rwsem);
+	return ret;
 }
 
 static int btrfs_dev_replace_kthread(void *data)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4558e018b53b..d9f2cd37a365 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1128,7 +1128,7 @@ void btrfs_release_device_allow_freeze(struct file *bdev_file)
 	bdev_fput(bdev_file);
 }
 
-static void btrfs_close_bdev(struct btrfs_device *device)
+static void btrfs_close_bdev(struct btrfs_device *device, bool allow_freeze)
 {
 	if (!device->bdev)
 		return;
@@ -1138,7 +1138,11 @@ static void btrfs_close_bdev(struct btrfs_device *device)
 		invalidate_bdev(device->bdev);
 	}
 
-	bdev_fput(device->bdev_file);
+	/* @allow_freeze undoes a replace-time deny; unmount-close was never denied. */
+	if (allow_freeze)
+		btrfs_release_device_allow_freeze(device->bdev_file);
+	else
+		bdev_fput(device->bdev_file);
 }
 
 static void btrfs_close_one_device(struct btrfs_device *device)
@@ -1159,7 +1163,7 @@ static void btrfs_close_one_device(struct btrfs_device *device)
 		fs_devices->missing_devices--;
 	}
 
-	btrfs_close_bdev(device);
+	btrfs_close_bdev(device, false);
 	if (device->bdev) {
 		fs_devices->open_devices--;
 		device->bdev = NULL;
@@ -2511,7 +2515,8 @@ void btrfs_rm_dev_replace_free_srcdev(struct btrfs_device *srcdev)
 
 	mutex_lock(&uuid_mutex);
 
-	btrfs_close_bdev(srcdev);
+	/* The source was made unfreezable for the replace; undo it. */
+	btrfs_close_bdev(srcdev, true);
 	synchronize_rcu();
 	btrfs_free_device(srcdev);
 
@@ -2532,7 +2537,8 @@ void btrfs_rm_dev_replace_free_srcdev(struct btrfs_device *srcdev)
 	mutex_unlock(&uuid_mutex);
 }
 
-void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev)
+void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev,
+				      bool allow_freeze)
 {
 	struct btrfs_fs_devices *fs_devices = tgtdev->fs_info->fs_devices;
 
@@ -2553,7 +2559,7 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev)
 
 	btrfs_scratch_superblocks(tgtdev->fs_info, tgtdev);
 
-	btrfs_close_bdev(tgtdev);
+	btrfs_close_bdev(tgtdev, allow_freeze);
 	synchronize_rcu();
 	btrfs_free_device(tgtdev);
 }
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 75c7963f5d4c..65de9504d887 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -790,7 +790,8 @@ int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info);
 int btrfs_run_dev_stats(struct btrfs_trans_handle *trans);
 void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_device *srcdev);
 void btrfs_rm_dev_replace_free_srcdev(struct btrfs_device *srcdev);
-void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev);
+void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev,
+				      bool allow_freeze);
 unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
 				    u64 logical);
 u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map);

-- 
2.47.3


^ permalink raw reply related

* [PATCH RFC 5/6] btrfs: deny freezing a device while it is being added
From: Christian Brauner @ 2026-06-15 13:18 UTC (permalink / raw)
  To: Chris Mason, Jens Axboe, David Sterba, Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel, Christian Brauner (Amutable)
In-Reply-To: <20260615-work-super-freeze_deny_upstream-v1-0-a6c72b840e7d@kernel.org>

btrfs_init_new_device() opens and claims the new device on a live
superblock without holding the write count, so a bdev_freeze() racing the
window between the claim being published and the device becoming a member
could freeze the filesystem through a claim the add may still abort and tear
down.

Add btrfs_open_device_deny_freeze(): it opens the device once
non-exclusively to take the freeze deny, then claims it by the same dev_t,
so the holder is only ever published while the device is already
unfreezable.  Keep it denied until the add is durable: bdev_allow_freeze()
on each success return (the device is now a committed member),
btrfs_release_device_allow_freeze() on the error unwind.  The deny spans the
whole add, including the seeding tail whose late failures still release the
device.  A device already frozen when the add starts is refused with -EBUSY.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/btrfs/volumes.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h |  2 ++
 2 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 36f9835f65e3..4558e018b53b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2822,6 +2822,36 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle *trans)
 	return 0;
 }
 
+/*
+ * Open @path for @sb with freezing denied before the holder claim is published,
+ * so a racing bdev_freeze() can never reach a claim a device add or replace may
+ * still abort.  The deny is taken on a throwaway non-holder probe open, then the
+ * holder is opened by the probe's dev_t.  Balanced by the caller.
+ */
+struct file *btrfs_open_device_deny_freeze(const char *path,
+					   struct super_block *sb)
+{
+	struct file *probe_file, *bdev_file;
+	int ret;
+
+	probe_file = bdev_file_open_by_path(path, BLK_OPEN_READ, NULL, NULL);
+	if (IS_ERR(probe_file))
+		return probe_file;
+
+	ret = bdev_deny_freeze(file_bdev(probe_file));
+	if (ret) {
+		bdev_fput(probe_file);
+		return ERR_PTR(ret);
+	}
+
+	bdev_file = bdev_file_open_by_dev(file_bdev(probe_file)->bd_dev,
+					  BLK_OPEN_WRITE, sb, &fs_holder_ops);
+	if (IS_ERR(bdev_file))
+		bdev_allow_freeze(file_bdev(probe_file));
+	bdev_fput(probe_file);
+	return bdev_file;
+}
+
 int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path)
 {
 	struct btrfs_root *root = fs_info->dev_root;
@@ -2840,8 +2870,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	if (sb_rdonly(sb) && !fs_devices->seeding)
 		return -EROFS;
 
-	bdev_file = bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
-					   fs_info->sb, &fs_holder_ops);
+	/* Forbid freezing until the device is a committed member (or unwound). */
+	bdev_file = btrfs_open_device_deny_freeze(device_path, fs_info->sb);
 	if (IS_ERR(bdev_file))
 		return PTR_ERR(bdev_file);
 
@@ -3006,8 +3036,10 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 		up_write(&sb->s_umount);
 		locked = false;
 
-		if (ret) /* transaction commit */
+		if (ret) { /* transaction commit */
+			bdev_allow_freeze(file_bdev(bdev_file));
 			return ret;
+		}
 
 		ret = btrfs_relocate_sys_chunks(fs_info);
 		if (ret < 0)
@@ -3015,8 +3047,10 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 				    "Failed to relocate sys chunks after device initialization. This can be fixed using the \"btrfs balance\" command.");
 		trans = btrfs_attach_transaction(root);
 		if (IS_ERR(trans)) {
-			if (PTR_ERR(trans) == -ENOENT)
+			if (PTR_ERR(trans) == -ENOENT) {
+				bdev_allow_freeze(file_bdev(bdev_file));
 				return 0;
+			}
 			ret = PTR_ERR(trans);
 			trans = NULL;
 			goto error_sysfs;
@@ -3036,6 +3070,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	/* Update ctime/mtime for blkid or udev */
 	update_dev_time(device_path);
 
+	bdev_allow_freeze(file_bdev(bdev_file));
 	return ret;
 
 error_sysfs:
@@ -3065,7 +3100,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 error_free_device:
 	btrfs_free_device(device);
 error:
-	bdev_fput(bdev_file);
+	btrfs_release_device_allow_freeze(bdev_file);
 	if (locked) {
 		mutex_unlock(&uuid_mutex);
 		up_write(&sb->s_umount);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 60e82c15881a..75c7963f5d4c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -769,6 +769,8 @@ struct btrfs_device *btrfs_find_device(const struct btrfs_fs_devices *fs_devices
 				       const struct btrfs_dev_lookup_args *args);
 int btrfs_shrink_device(struct btrfs_device *device, u64 new_size);
 int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *path);
+struct file *btrfs_open_device_deny_freeze(const char *path,
+					   struct super_block *sb);
 int btrfs_balance(struct btrfs_fs_info *fs_info,
 		  struct btrfs_balance_control *bctl,
 		  struct btrfs_ioctl_balance_args *bargs);

-- 
2.47.3


^ permalink raw reply related

* [PATCH RFC 4/6] btrfs: deny freezing a device while it is being removed
From: Christian Brauner @ 2026-06-15 13:18 UTC (permalink / raw)
  To: Chris Mason, Jens Axboe, David Sterba, Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel, Christian Brauner (Amutable)
In-Reply-To: <20260615-work-super-freeze_deny_upstream-v1-0-a6c72b840e7d@kernel.org>

btrfs_rm_device() runs under mnt_want_write_file(), but the claim on the
removed device is released by the ioctl after mnt_drop_write_file(), so a
bdev_freeze() racing that window could freeze the filesystem through the
device just as its claim is torn down, leaving nothing for bdev_thaw() to
rebalance.

The window cannot be closed by reordering the teardown.  btrfs_rm_device()
hands the final bdev_fput() back to the ioctl, run only after
mnt_drop_write_file(), because bdev_release() takes the disk ->open_mutex and
its dependency chain, which must not nest under the superblock's freeze/write
protection -- freeze_super() drops s_umount before draining writers precisely
to keep sb_start_write ordered above s_umount.  Holding mnt_want_write across
bdev_fput() would reintroduce that inversion, so the holder teardown is forced
outside the write-protected section.  A freeze landing in the resulting gap
resolves the still-live holder, rides in, and strands when the claim is
released; no ordering of the close against the drop removes the gap.  The
device itself therefore has to refuse freezing for the whole removal.

Deny freezing the device for the duration of the removal: bdev_deny_freeze()
at the start of btrfs_rm_device() (it cannot be frozen yet, the ioctl holds
the write count), and release it through btrfs_release_device_allow_freeze()
in the ioctls on success, or bdev_allow_freeze() on the error paths that keep
the device a member.  A device frozen before the removal begins is refused
with -EBUSY.

btrfs_release_device_allow_freeze() yields the holder, re-allows freezing,
then closes the device, so the re-allow neither strands the filesystem on a
racing freeze nor touches the block device after the final fput.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/btrfs/ioctl.c   |  4 ++--
 fs/btrfs/volumes.c | 20 ++++++++++++++++++++
 fs/btrfs/volumes.h |  1 +
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b2e447f5005c..fc3e06445211 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2579,7 +2579,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg)
 err_drop:
 	mnt_drop_write_file(file);
 	if (bdev_file)
-		bdev_fput(bdev_file);
+		btrfs_release_device_allow_freeze(bdev_file);
 out:
 	btrfs_put_dev_args_from_path(&args);
 	kfree(vol_args);
@@ -2630,7 +2630,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)

 	mnt_drop_write_file(file);
 	if (bdev_file)
-		bdev_fput(bdev_file);
+		btrfs_release_device_allow_freeze(bdev_file);
 out:
 	btrfs_put_dev_args_from_path(&args);
 out_free:
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a88e68f90564..36f9835f65e3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1119,6 +1119,15 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices)
 	mutex_unlock(&uuid_mutex);
 }

+/* Release a device that was made unfreezable for a membership change. */
+void btrfs_release_device_allow_freeze(struct file *bdev_file)
+{
+	/* Yield before allow (strand-safe); file still open for the allow (UAF-safe). */
+	bdev_yield_claim(bdev_file);
+	bdev_allow_freeze(file_bdev(bdev_file));
+	bdev_fput(bdev_file);
+}
+
 static void btrfs_close_bdev(struct btrfs_device *device)
 {
 	if (!device->bdev)
@@ -2336,6 +2345,13 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
 	    fs_info->fs_devices->rw_devices == 1)
 		return BTRFS_ERROR_DEV_ONLY_WRITABLE;

+	/* Removal and freezing are mutually exclusive; refuse if frozen now. */
+	if (device->bdev) {
+		ret = bdev_deny_freeze(device->bdev);
+		if (ret)
+			return ret;
+	}
+
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
 		mutex_lock(&fs_info->chunk_mutex);
 		list_del_init(&device->dev_alloc_list);
@@ -2362,6 +2378,8 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
 			   device->devid, ret);
 		btrfs_abort_transaction(trans, ret);
 		btrfs_end_transaction(trans);
+		if (device->bdev)
+			bdev_allow_freeze(device->bdev);
 		return ret;
 	}

@@ -2447,6 +2465,8 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
 	return btrfs_commit_transaction(trans);

 error_undo:
+	if (device->bdev)
+		bdev_allow_freeze(device->bdev);
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
 		mutex_lock(&fs_info->chunk_mutex);
 		list_add(&device->dev_alloc_list,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0082c166af91..60e82c15881a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -744,6 +744,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 struct btrfs_device *btrfs_scan_one_device(const char *path, bool mount_arg_dev);
 int btrfs_forget_devices(dev_t devt);
 void btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
+void btrfs_release_device_allow_freeze(struct file *bdev_file);
 void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices);
 void btrfs_assign_next_active_device(struct btrfs_device *device,
 				     struct btrfs_device *this_dev);

-- 
2.47.3

^ permalink raw reply related

* [PATCH RFC 3/6] block: split bdev_yield_claim() out of bdev_fput()
From: Christian Brauner @ 2026-06-15 13:18 UTC (permalink / raw)
  To: Chris Mason, Jens Axboe, David Sterba, Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel, Christian Brauner (Amutable)
In-Reply-To: <20260615-work-super-freeze_deny_upstream-v1-0-a6c72b840e7d@kernel.org>

bdev_fput() yields the holder claim and then closes the file, which is a
deferred operation.  Split the yield half into bdev_yield_claim() so a caller
can give up the holder while the file - and therefore the block device - is
still open, act on the device, and only then bdev_fput().

A filesystem that made a device unfreezable for a membership change with
bdev_deny_freeze() undoes the deny on release with

	bdev_yield_claim(bdev_file);
	bdev_allow_freeze(file_bdev(bdev_file));
	bdev_fput(bdev_file);

Re-allowing only after the holder is yielded avoids stranding the filesystem
on a racing freeze, and doing it while the file is still open avoids touching
the block device after bdev_fput().  bdev_fput() yields again, which is a
no-op once the claim has already been given up.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 block/bdev.c           | 30 ++++++++++++++++++++++--------
 include/linux/blkdev.h |  1 +
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 939dec351772..e59052c2a081 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1199,18 +1199,16 @@ void bdev_release(struct file *bdev_file)
 }
 
 /**
- * bdev_fput - yield claim to the block device and put the file
+ * bdev_yield_claim - give up the holder claim on an open block device
  * @bdev_file: open block device
  *
- * Yield claim on the block device and put the file. Ensure that the
- * block device can be reclaimed before the file is closed which is a
- * deferred operation.
+ * Yield the holder and any write access for @bdev_file without closing it, so
+ * the caller can still act on the device - e.g. bdev_allow_freeze() it - before
+ * the final bdev_fput().  bdev_fput() yields too, so calling it afterwards is
+ * safe.
  */
-void bdev_fput(struct file *bdev_file)
+void bdev_yield_claim(struct file *bdev_file)
 {
-	if (WARN_ON_ONCE(bdev_file->f_op != &def_blk_fops))
-		return;
-
 	if (bdev_file->private_data) {
 		struct block_device *bdev = file_bdev(bdev_file);
 		struct gendisk *disk = bdev->bd_disk;
@@ -1226,7 +1224,23 @@ void bdev_fput(struct file *bdev_file)
 		bdev_file->private_data = BDEV_I(bdev_file->f_mapping->host);
 		mutex_unlock(&disk->open_mutex);
 	}
+}
+EXPORT_SYMBOL_GPL(bdev_yield_claim);
+
+/**
+ * bdev_fput - yield claim to the block device and put the file
+ * @bdev_file: open block device
+ *
+ * Yield claim on the block device and put the file. Ensure that the
+ * block device can be reclaimed before the file is closed which is a
+ * deferred operation.
+ */
+void bdev_fput(struct file *bdev_file)
+{
+	if (WARN_ON_ONCE(bdev_file->f_op != &def_blk_fops))
+		return;
 
+	bdev_yield_claim(bdev_file);
 	fput(bdev_file);
 }
 EXPORT_SYMBOL(bdev_fput);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cf1951caadb2..9fc16e3c8075 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1832,6 +1832,7 @@ int bdev_thaw(struct block_device *bdev);
 int bdev_deny_freeze(struct block_device *bdev);
 void bdev_allow_freeze(struct block_device *bdev);
 void bdev_fput(struct file *bdev_file);
+void bdev_yield_claim(struct file *bdev_file);
 
 struct io_comp_batch {
 	struct rq_list req_list;

-- 
2.47.3


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox