* [REPORT] nvmet-rdma: integer overflow in inline-data SGL bounds check -> pre-auth kernel-memory read + remote crash (candidate patch inline)
From: hexlabsecurity @ 2026-05-29 6:52 UTC (permalink / raw)
To: security@kernel.org
Cc: hch@lst.de, sagi@grimberg.me, kbusch@kernel.org, kch@nvidia.com,
linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org,
linux-block@vger.kernel.org
Hello,
I would like to report an integer-overflow vulnerability in the NVMe-oF
RDMA target (drivers/nvme/target/rdma.c). The inline-data SGL bounds
check in nvmet_rdma_map_sgl_inline() is computed in u64 over two
host-controlled values and wraps, which a remote fabric peer can use
both to read kernel memory back over the fabric and to crash the target.
== Affected ==
drivers/nvme/target/rdma.c, nvmet_rdma_map_sgl_inline()
Verified present on the current mainline tree (commit 27fa82620cba,
~v7.1-rc5), at the bounds check:
static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
{
struct nvme_sgl_desc *sgl = &rsp->req.cmd->common.dptr.sgl;
u64 off = le64_to_cpu(sgl->addr); /* host-controlled, 64-bit */
u32 len = le32_to_cpu(sgl->length); /* host-controlled, 32-bit */
...
if (off + len > rsp->queue->dev->inline_data_size) { /* u64 wrap */
pr_err("invalid inline data offset!\n");
return NVME_SC_SGL_INVALID_OFFSET | NVME_STATUS_DNR;
}
...
nvmet_rdma_use_inline_sg(rsp, len, off);
}
"off + len" is evaluated in u64 and wraps modulo 2^64. For example
addr = 0xfffffffffffffe00, length = 0x1000 makes the sum wrap to
0xe00, which is <= inline_data_size (default PAGE_SIZE), so the check
passes. The current check form (against the per-port inline_data_size)
and the fixed-size inline_sg[NVMET_RDMA_MAX_INLINE_SGE] array with the
num_pages(len) loop were introduced together by commit 0d5ee2b2ab4f
("nvmet-rdma: support max(16KB, PAGE_SIZE) inline data"), which is the
Fixes: I used. Note: the single-page inline path that predates that
commit may have an analogous u64-overflow read in a different code
shape; I would appreciate the maintainers' judgement on whether the
stable backport scope should reach before that commit.
== Two consequences of the bypass ==
1. Kernel-memory read (information disclosure).
nvmet_rdma_use_inline_sg() does "sg->offset = off", truncating the
64-bit offset to scatterlist::offset (unsigned int). The block
layer then accesses page_to_phys(inline_page) + (off & 0xffffffff),
so the target reads up to inline_data_size bytes of kernel memory
per write command and returns them to the host on read-back, or
faults the in-kernel copy if the offset lands on unmapped memory.
2. Kernel-memory corruption -> remote crash (denial of service).
A large length makes "sg_count = num_pages(len)" in
nvmet_rdma_use_inline_sg() exceed NVMET_RDMA_MAX_INLINE_SGE (4), so
the loop writes scatterlist entries past the fixed-size inline_sg[]
array, corrupting the surrounding command object.
== Reachability ==
The path is reached by any write command carrying an inline SGL, i.e.
after a Fabrics Connect. On a subsystem configured with
attr_allow_any_host=1 it is reachable WITHOUT authentication by any
RDMA peer (RoCE/iWARP/IB) that can reach the target's listener. With
DH-CHAP configured, or attr_allow_any_host=0 with an unknown host NQN,
a valid/known host NQN is required first.
== Empirical reproduction ==
Reproduced against a stock nvmet-rdma target over a soft-iWARP (siw)
fabric on a Linux 6.12.90 build with KASAN (KASAN_INLINE):
- Read: a single write command with addr = 0xfffffffffffffe00,
length = 0x1000 produced a KASAN out-of-bounds read and returned
~4 KiB of kernel memory (including kernel .text) into the
attacker-readable namespace.
- Crash: a write command with addr = 0xffffffffffff0500,
length = 0x10000 (sum wraps to 0x500 <= inline_data_size, but
num_pages(0x10000) = 16 writes 16 scatterlist entries into the
4-entry inline_sg[], 12 past its end) deterministically corrupted
the command object and oopsed the target:
Oops: general protection fault [...] KASAN: null-ptr-deref
RIP: nvmet_rdma_post_recv+0x... [nvmet_rdma]
nvmet_rdma_post_recv <- nvmet_rdma_queue_response
<- __nvmet_req_complete <- nvmet_check_transfer_len
<- nvmet_rdma_handle_command <- ib_cq_poll_work
Every reconnect re-triggers it (persistent remote DoS). The
nvmet_rdma_cmd objects are carved from one contiguous kcalloc'd
array, so the over-long entry write stays within that allocation and
KASAN flags the downstream dereference of the corrupted command in
nvmet_rdma_post_recv rather than the store itself. The out-of-bounds
content is not attacker-controlled, so this is a crash/corruption
primitive, not a controlled write; I do not see a path to remote code
execution from this bug.
Severity estimate. The two consequences arise from different inline-SGL
capsules (small vs large length) and are scored as separate single-capsule
outcomes, not one combined vector:
OOB read (info-disclosure): CVSS 7.5 HIGH
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:N
OOB write (corruption/DoS): CVSS 8.2 HIGH
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:L/A:H
Headline 8.2 HIGH (both reachable pre-auth with attr_allow_any_host=1).
With attr_allow_any_host=0 a valid host NQN is required first (PR:L),
lowering these to 6.5 and 7.1.
== Suggested fix ==
Validate the offset with check_add_overflow() before comparing against
inline_data_size. A passing check then guarantees
off + len <= inline_data_size <= NVMET_RDMA_MAX_INLINE_DATA_SIZE, which
bounds both the truncated scatterlist::offset and
num_pages(len) <= NVMET_RDMA_MAX_INLINE_SGE, closing the read and the
inline_sg[] overflow together. Candidate patch inline below (applies
to current mainline).
== Embargo ==
I am happy to follow the standard process. Proposing a 7-day embargo;
the fix is small and I can adjust as the maintainers prefer. I have
not notified linux-distros and will hold that until a public patch
lands, per the usual guidance.
I am an independent security researcher; please credit
"Bryam Vargas <hexlabsecurity@proton.me>" (Reported-by already in the
patch). Affiliation: HEXLAB SAS (registration pending) -- Cali,
Colombia. Happy to provide the full reproduction harness on request.
Thank you,
Bryam Vargas
----- candidate patch (inline, plain text) -----
From 448c122c744430c1c2926d635855a3894370ee33 Mon Sep 17 00:00:00 2001
From: Bryam Vargas <hexlabsecurity@proton.me>
Date: Thu, 28 May 2026 21:23:52 -0500
Subject: [PATCH] nvmet-rdma: fix integer overflow in inline data SGL bounds
check
nvmet_rdma_map_sgl_inline() bounds-checks the inline data descriptor
with both operands host-controlled and the sum evaluated in u64:
u64 off = le64_to_cpu(sgl->addr);
u32 len = le32_to_cpu(sgl->length);
...
if (off + len > rsp->queue->dev->inline_data_size)
return NVME_SC_SGL_INVALID_OFFSET | NVME_STATUS_DNR;
"off + len" therefore wraps modulo 2^64. A descriptor with, for
example, addr = 0xfffffffffffffe00 and length = 0x1000 makes the sum
wrap to 0xe00, which passes the inline_data_size check. An inline-SGL
write command reaches this path after a Fabrics Connect; on a subsystem
with attr_allow_any_host set it is reachable without authentication by
any peer that can reach the target.
Two distinct out-of-bounds accesses follow from the bypass:
- nvmet_rdma_use_inline_sg() stores the 64-bit offset into
scatterlist::offset, which is unsigned int, committing the truncated
attacker offset to the inline page. The block layer then accesses
page_to_phys(inline_page) + (off & 0xffffffff), reading up to
inline_data_size bytes of kernel memory per command back to the host
(or faulting the target if the offset lands on unmapped memory).
- A large len makes sg_count = num_pages(len) in
nvmet_rdma_use_inline_sg() exceed NVMET_RDMA_MAX_INLINE_SGE, so the
loop writes scatterlist entries past the fixed-size inline_sg[]
array, corrupting the surrounding command object and oopsing the
target on the next use of that command.
Validate the offset with check_add_overflow() before comparing against
inline_data_size. A passing check then guarantees
off + len <= inline_data_size <= NVMET_RDMA_MAX_INLINE_DATA_SIZE, which
bounds both the truncated scatterlist::offset and
num_pages(len) <= NVMET_RDMA_MAX_INLINE_SGE, closing the out-of-bounds
read and the inline_sg[] overflow together.
Reported-by: Bryam Vargas <hexlabsecurity@proton.me>
Fixes: 0d5ee2b2ab4f ("nvmet-rdma: support max(16KB, PAGE_SIZE) inline data")
Cc: stable@vger.kernel.org
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
---
Review context (not for the commit log):
Reproducer -- unprivileged remote RDMA peer against a target with
attr_allow_any_host=1, a single inline-SGL WRITE capsule:
* OOB read: sgl->addr=0xfffffffffffffe00, sgl->length=0x1000
(off+len wraps to 0xe00 <= inline_data_size; sg->offset
truncates to 0xfffffe00) -> ~4 KiB of kernel memory is
read back from the namespace.
* OOB write: sgl->addr=0xffffffffffff0500, sgl->length=0x10000
(num_pages(0x10000)=16 overruns the 4-entry inline_sg[])
-> target memory corruption / crash.
A/B-tested on a 6.12.90 KASAN lab kernel (same .config, only this hunk
differs): pre-fix the OOB-read capsule trips "KASAN: use-after-free in
copy_page_from_iter_atomic" via nvmet_file_execute_io; post-fix both
capsules are rejected with "invalid inline data offset!"
(NVME_SC_SGL_INVALID_OFFSET), benign inline writes still succeed, and no
KASAN/oops fires. The fix decides identically in 32- and 64-bit builds
(check_add_overflow operates on u64).
drivers/nvme/target/rdma.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index e6e2c3f9afdf..a5bbf9d41c3b 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -12,6 +12,7 @@
#include <linux/init.h>
#include <linux/module.h>
#include <linux/nvme.h>
+#include <linux/overflow.h>
#include <linux/slab.h>
#include <linux/string.h>
#include <linux/wait.h>
@@ -847,6 +848,7 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
struct nvme_sgl_desc *sgl = &rsp->req.cmd->common.dptr.sgl;
u64 off = le64_to_cpu(sgl->addr);
u32 len = le32_to_cpu(sgl->length);
+ u64 bound;
if (!nvme_is_write(rsp->req.cmd)) {
rsp->req.error_loc =
@@ -854,7 +856,8 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
}
- if (off + len > rsp->queue->dev->inline_data_size) {
+ if (check_add_overflow(off, (u64)len, &bound) ||
+ bound > rsp->queue->dev->inline_data_size) {
pr_err("invalid inline data offset!\n");
return NVME_SC_SGL_INVALID_OFFSET | NVME_STATUS_DNR;
}
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v7 32/43] btrfs: implement process_bio cb for fscrypt
From: Daniel Vacek @ 2026-05-29 15:43 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chris Mason, Josef Bacik, Eric Biggers, Theodore Y. Ts'o,
Jaegeuk Kim, Jens Axboe, David Sterba, linux-block, linux-fscrypt,
linux-btrfs, linux-kernel
In-Reply-To: <ahAfo4DzvH_ob1hv@infradead.org>
On Fri, 22 May 2026 at 11:19, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, May 13, 2026 at 10:53:06AM +0200, Daniel Vacek wrote:
> > From: Josef Bacik <josef@toxicpanda.com>
> >
> > We are going to be checksumming the encrypted data, so we have to
> > implement the ->process_bio fscrypt callback. This will provide us with
> > the original bio and the encrypted bio to do work on. For WRITE's this
> > will happen after the encrypted bio has been encrypted. For READ's this
> > will happen after the read has completed and before the decryption step
> > is done.
> >
> > For write's this is straightforward, we can just pass in the encrypted
> > bio to btrfs_csum_one_bio and then the csums will be added to the bbio
> > as normal.
> >
> > For read's this is relatively straightforward, but requires some care.
> > We assume (because that's how it works currently) that the encrypted bio
> > match the original bio, this is important because we save the iter of
> > the bio before we submit. If this changes in the future we'll need a
> > hook to give us the bi_iter of the decryption bio before it's submitted.
> > We check the csums before decryption. If it doesn't match we simply
> > error out and we let the normal path handle the repair work.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > Signed-off-by: Daniel Vacek <neelx@suse.com>
> > ---
> >
> > v7 changes:
> > * Fixed array overflow stack corruption for bios > max blocksize (>64KiB)
> > as reported by Chris' AI review.
> > v6 changes:
> > * Adapt to btrfs_data_csum_ok() changes for bs > ps. Mostly follow
> > what was done in 052fd7a5cace ("btrfs: make read verification
> > handle bs > ps cases without large folios").
> > * Rename bbio::csum_done to csum_ok due to name collision.
> > With upstream, member name csum_done was used for async csums.
> > v5: https://lore.kernel.org/linux-btrfs/ca32684b01ff8c252be515509137e0a4a0e5db7a.1706116485.git.josef@toxicpanda.com/
> > ---
> > fs/btrfs/bio.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
> > fs/btrfs/bio.h | 3 +++
> > fs/btrfs/file-item.c | 14 ++++++++++++--
> > fs/btrfs/fscrypt.c | 29 +++++++++++++++++++++++++++++
> > 4 files changed, 87 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> > index 3e2ee19aab50..729c5aff5c3d 100644
> > --- a/fs/btrfs/bio.c
> > +++ b/fs/btrfs/bio.c
> > @@ -301,6 +301,40 @@ static struct btrfs_failed_bio *repair_one_sector(struct btrfs_bio *failed_bbio,
> > return fbio;
> > }
> >
> > +blk_status_t btrfs_check_encrypted_read_bio(struct btrfs_bio *bbio, struct bio *enc_bio)
> > +{
> > + struct btrfs_inode *inode = bbio->inode;
> > + struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > + struct bvec_iter iter = bbio->saved_iter;
> > + struct btrfs_device *dev = bbio->bio.bi_private;
> > + const u32 blocksize = fs_info->sectorsize;
> > + const u32 step = min(blocksize, PAGE_SIZE);
> > + const u32 nr_steps = iter.bi_size / step;
> > + phys_addr_t paddrs[BTRFS_MAX_BLOCKSIZE / PAGE_SIZE];
> > + phys_addr_t paddr;
> > + unsigned int slot = 0;
> > + u32 offset = 0;
> > +
> > + /*
> > + * We have to use a copy of iter in case there's an error,
> > + * btrfs_check_read_bio will handle submitting the repair bios.
> > + */
> > + btrfs_bio_for_each_block(paddr, enc_bio, &iter, step) {
> > + ASSERT(slot < nr_steps);
> > + paddrs[slot] = paddr;
> > + slot++;
> > + offset += step;
> > + if (IS_ALIGNED(offset, blocksize)) {
> > + if (!btrfs_data_csum_ok(bbio, dev, offset - blocksize, paddrs))
> > + return BLK_STS_IOERR;
> > + slot = 0;
> > + }
> > + }
> > +
> > + bbio->csum_ok = true;
> > + return BLK_STS_OK;
> > +}
> > +
> > static void btrfs_check_read_bio(struct btrfs_bio *bbio, struct btrfs_device *dev)
> > {
> > struct btrfs_inode *inode = bbio->inode;
> > @@ -330,6 +364,10 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio, struct btrfs_device *de
> > /* Clear the I/O error. A failed repair will reset it. */
> > bbio->bio.bi_status = BLK_STS_OK;
> >
> > + /* This was an encrypted bio and we've already done the csum check. */
> > + if (status == BLK_STS_OK && bbio->csum_ok)
> > + goto out;
> > +
> > btrfs_bio_for_each_block(paddr, &bbio->bio, iter, step) {
> > paddrs[(offset / step) % nr_steps] = paddr;
> > offset += step;
> > @@ -341,6 +379,7 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio, struct btrfs_device *de
> > paddrs, fbio);
> > }
> > }
> > +out:
> > if (bbio->csum != bbio->csum_inline)
> > kvfree(bbio->csum);
> >
> > @@ -859,10 +898,13 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
> > /*
> > * Csum items for reloc roots have already been cloned at this
> > * point, so they are handled as part of the no-checksum case.
> > + *
> > + * Encrypted inodes are csum'ed via the ->process_bio callback.
> > */
> > if (!(inode->flags & BTRFS_INODE_NODATASUM) &&
> > !test_bit(BTRFS_FS_STATE_NO_DATA_CSUMS, &fs_info->fs_state) &&
> > - !btrfs_is_data_reloc_root(inode->root) && !bbio->is_remap) {
> > + !btrfs_is_data_reloc_root(inode->root) && !bbio->is_remap &&
> > + !IS_ENCRYPTED(&inode->vfs_inode)) {
> > if (should_async_write(bbio) &&
> > btrfs_wq_submit_bio(bbio, bioc, &smap, mirror_num))
> > goto done;
> > diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> > index 43f7544029ac..456d32db9e9e 100644
> > --- a/fs/btrfs/bio.h
> > +++ b/fs/btrfs/bio.h
> > @@ -43,6 +43,7 @@ struct btrfs_bio {
> > struct {
> > u8 *csum;
> > u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
> > + bool csum_ok;
> > struct bvec_iter saved_iter;
> > };
> >
> > @@ -130,5 +131,7 @@ void btrfs_submit_repair_write(struct btrfs_bio *bbio, int mirror_num, bool dev_
> > int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 fileoff,
> > u32 length, u64 logical, const phys_addr_t paddrs[],
> > unsigned int step, int mirror_num);
> > +blk_status_t btrfs_check_encrypted_read_bio(struct btrfs_bio *bbio,
> > + struct bio *enc_bio);
> >
> > #endif
> > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> > index 986914078708..72d9d3243460 100644
> > --- a/fs/btrfs/file-item.c
> > +++ b/fs/btrfs/file-item.c
> > @@ -338,6 +338,14 @@ static int search_csum_tree(struct btrfs_fs_info *fs_info,
> > return ret;
> > }
> >
> > +static inline bool inode_skip_csum(struct btrfs_inode *inode)
> > +{
> > + struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > +
> > + return (inode->flags & BTRFS_INODE_NODATASUM) ||
> > + test_bit(BTRFS_FS_STATE_NO_DATA_CSUMS, &fs_info->fs_state);
> > +}
> > +
> > /*
> > * Lookup the checksum for the read bio in csum tree.
> > *
> > @@ -357,8 +365,7 @@ int btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
> > int ret = 0;
> > u32 bio_offset = 0;
> >
> > - if ((inode->flags & BTRFS_INODE_NODATASUM) ||
> > - test_bit(BTRFS_FS_STATE_NO_DATA_CSUMS, &fs_info->fs_state))
> > + if (inode_skip_csum(inode))
> > return 0;
> >
> > /*
> > @@ -817,6 +824,9 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio, struct bio *bio, bool async)
> > struct btrfs_ordered_sum *sums;
> > unsigned nofs_flag;
> >
> > + if (inode_skip_csum(inode))
> > + return 0;
> > +
> > nofs_flag = memalloc_nofs_save();
> > sums = kvzalloc(btrfs_ordered_sum_size(fs_info, bio->bi_iter.bi_size),
> > GFP_KERNEL);
> > diff --git a/fs/btrfs/fscrypt.c b/fs/btrfs/fscrypt.c
> > index 5d34a8b94da5..924ee3df7f32 100644
> > --- a/fs/btrfs/fscrypt.c
> > +++ b/fs/btrfs/fscrypt.c
> > @@ -16,6 +16,7 @@
> > #include "transaction.h"
> > #include "volumes.h"
> > #include "xattr.h"
> > +#include "file-item.h"
> >
> > /*
> > * From a given location in a leaf, read a name into a qstr (usually a
> > @@ -212,6 +213,33 @@ static struct block_device **btrfs_fscrypt_get_devices(struct super_block *sb,
> > return devs;
> > }
> >
> > +static blk_status_t btrfs_process_encrypted_bio(struct bio *orig_bio,
> > + struct bio *enc_bio)
> > +{
> > + struct btrfs_bio *bbio;
> > +
> > + /*
> > + * If our bio is from the normal fs_bio_set then we know this is a
> > + * mirror split and we can skip it, we'll get the real bio on the last
> > + * mirror and we can process that one.
> > + */
> > + if (orig_bio->bi_pool == &fs_bio_set)
> > + return BLK_STS_OK;
> > +
> > + bbio = btrfs_bio(orig_bio);
> > +
> > + if (bio_op(orig_bio) == REQ_OP_READ) {
> > + /*
> > + * We have ->saved_iter based on the orig_bio, so if the block
> > + * layer changes we need to notice this asap so we can update
> > + * our code to handle the new world order.
> > + */
> > + ASSERT(orig_bio == enc_bio);
> > + return btrfs_check_encrypted_read_bio(bbio, enc_bio);
> > + }
> > + return btrfs_csum_one_bio(bbio, enc_bio, false);
>
> Honestly, all this shows that the architecture of the I/O path in this
> series is pretty broken. It needs all this magic detection, and the
> passing of arguments that mixes the bbio for state and the lower
> encrypted bio without the btrfs context shows something doesn't work
> well.
Well, this is all limited within the scope of the filesystem. Since
btrfs needs to compute the data checksum and the bounce bio (with the
encrypted pages) is created by the lower fscrypt layer, how else could
we accomplish this?
As the blk-crypto is inlined, without the callback the filesystem
never sees the encrypted data at all and it won't be able to get
checksums.
> So let's take a step back, if we think of the I/O pipeline, it should do
> things in this order for writes:
>
> - encrypt data
> - generate checksums
> - do mirroring/striping/parity
>
> and reverse for reads.
>
> All this suggest that the btrfs_bio needs to exist for the encrypted
> data.
My understanding was that fscrypt works differently. The bounce bio is
created inline in the lower layer, agnostic to any filesystem.
> So I think you'll need to and refactor this, preferably with the
> really annoying two-level callbacks that this really hard to follow (or
> implement). Your caller is in the file system, and it should be able to
> call fscrypt as helpers instead of going two layers down using direct
> calls and then two layers back up using indirect calls. The recent
> refactoring that moves the fscrypt fallback above the block layer
> instead of calling it from the bottom should help a lot with that.
Yeah, I may look into that. What you're talking about is a pretty recent change.
This is an old patch [1] from 2023 rebased without many changes since
as there was not much feedback before. So it still follows the
original (former) design.
If fscrypt supported checksumming the encrypted data and returned the
value back to the filesystem, no callbacks would be needed. Though to
me that sounds more invasive than this callback.
[1] https://lore.kernel.org/linux-btrfs/a26514814b4d2a54ff2317369365dc2bf1c280dc.1695750478.git.josef@toxicpanda.com/
--nX
^ permalink raw reply
* Re: [REPORT] nvmet-rdma: integer overflow in inline-data SGL bounds check -> pre-auth kernel-memory read + remote crash (candidate patch inline)
From: Keith Busch @ 2026-05-29 16:09 UTC (permalink / raw)
To: hexlabsecurity
Cc: security@kernel.org, hch@lst.de, sagi@grimberg.me, kch@nvidia.com,
linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org,
linux-block@vger.kernel.org
In-Reply-To: <LM21QIR-1-qJb7PViyJKCnGBnUzizeiNJVWQ3wb7ZwGezodjgKg3f-iobqOyequ-sT1jFCKJImfqNO_BKU3KO80xFITnaI5GTV_GxLUNDDc=@proton.me>
On Fri, May 29, 2026 at 06:52:13AM +0000, hexlabsecurity@proton.me wrote:
> @@ -847,6 +848,7 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
> struct nvme_sgl_desc *sgl = &rsp->req.cmd->common.dptr.sgl;
> u64 off = le64_to_cpu(sgl->addr);
> u32 len = le32_to_cpu(sgl->length);
> + u64 bound;
>
> if (!nvme_is_write(rsp->req.cmd)) {
> rsp->req.error_loc =
> @@ -854,7 +856,8 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
> return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
> }
>
> - if (off + len > rsp->queue->dev->inline_data_size) {
> + if (check_add_overflow(off, (u64)len, &bound) ||
> + bound > rsp->queue->dev->inline_data_size) {
Since you don't use "bound" for anything other than the final check, I
think we make this simpler without it:
if (off > rsp->queue->dev->inline_data_size ||
len > rsp->queue->dev->inline_data_size - off) {
Thanks for the report.
^ permalink raw reply
* [GIT PULL] Block fix for 7.1-rc6
From: Jens Axboe @ 2026-05-29 17:03 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-block@vger.kernel.org
Hi Linus,
Just a single fix for the block side, making a slight tweak to a fix
from this cycle. Please pull!
The following changes since commit f6982769910ecddabdb5b8b9afdab0bb8b6668ac:
block: avoid use-after-free in disk_free_zone_resources() (2026-05-22 08:01:52 -0600)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git block-7.1-20260529
for you to fetch changes up to b051bb6bf0a231117036aa607cadf55be8e63910:
blk-mq: reinsert cached request to the list (2026-05-26 15:05:30 -0600)
----------------------------------------------------------------
block-7.1-20260529
----------------------------------------------------------------
Keith Busch (1):
blk-mq: reinsert cached request to the list
block/blk-mq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--
Jens Axboe
^ permalink raw reply
* Re: Observing higher CPU utilization during random IO fio testing
From: Wen Xiong @ 2026-05-29 17:13 UTC (permalink / raw)
To: yukuai; +Cc: Jens Axboe, linux-block, tom.leiming, jmoyer, Gjoyce, wenxiong
In-Reply-To: <043e357f-5b37-4e05-9433-271504fc1d30@fygo.io>
On 2026-05-25 00:28, Yu Kuai wrote:
> 在 2026/5/22 5:52, Jens Axboe 写道:
> Yes, perf data will be helpful. And please show your test in details
> and I'll
> check if I can reproduce it.
Hi Yu Kuai,
Have you reproduced the issue yet?
Below is some perf data we took while running random read test:
Test:
FIO random read with qdepth=1 nj=20, we saw higher CPU utilization in
this testcase.
Perf record:
start fio run on one session and kickoff the script in another session
while test is running
Perf report:
With blk_start_plug/blk_finish_plug before calling __submit_bio() in
blk-core.c:
Top.txt
2.41% fio [kernel.kallsyms]
[k] cpupri_set
1.16% fio [kernel.kallsyms]
[k] queued_spin_lock_slowpath
0.75% fio [kernel.kallsyms]
[k] sbitmap_find_bit
0.47% fio [kernel.kallsyms]
[k] set_next_task_rt
0.41% fio [kernel.kallsyms]
[k] pull_rt_task
0.34% fio [kernel.kallsyms]
[k] enqueue_pushable_task
…
0.02% fio [kernel.kallsyms]
[k] __blk_flush_plug
0.01% fio [kernel.kallsyms]
[k] blk_add_rq_to_plug
0.01% fio [kernel.kallsyms]
[k] blk_mq_flush_plug_list
0.00% fio [kernel.kallsyms]
[k] blk_attempt_plug_merge
Callgraph.txt
2.41% fio [kernel.kallsyms]
[k] cpupri_set
|
---cpupri_set
|
|--1.15%--__enqueue_rt_entity
| enqueue_task_rt
| enqueue_task
| ttwu_do_activate
Perf report
Without blk_start_plug and blk_finish_plug before calling
__submit_bio():
Top.txt
0.67% fio [kernel.kallsyms]
[k] queued_spin_lock_slowpath
0.64% fio [kernel.kallsyms]
[k] sched_balance_newidle
0.47% fio [kernel.kallsyms]
[k] _raw_spin_lock
0.39% fio [kernel.kallsyms]
[k] sbitmap_find_bit
0.35% fio [kernel.kallsyms]
[k] cpupri_set
0.28% fio [kernel.kallsyms]
[k] work_grab_pending
0.24% fio [kernel.kallsyms]
[k] lookup_ioctx
0.23% fio [kernel.kallsyms]
[k] __schedule
…
…
0.00% fio [kernel.kallsyms]
[k] blk_attempt_plug_merge
Call graph.txt:
0.35% fio [kernel.kallsyms]
[k] cpupri_set
|
---cpupri_set
|
|--0.17%--arch_local_irq_restore.part.0
| |
| |--0.14%--finish_task_switch.isra.0
| | __schedule
| | |
| | |--0.13%--schedule
| | | |
| | | |--0.07%--read_events
…..
|--0.13%--__enqueue_rt_entity
| enqueue_task_rt
| enqueue_task
| ttwu_do_activate
From above perf data, looks like
1. High time spent in cpupri_set(): tasks being enqueued/dequeued
frequently, more IO scheduling.
2. Call more plug routines.
If you need full perf data report, I can email/attach your full report.
Thanks for your help!
Wendy
^ permalink raw reply
* Re: [PATCH 6.12 000/272] 6.12.92-rc1 review
From: Florian Fainelli @ 2026-05-29 19:33 UTC (permalink / raw)
To: Sasha Levin, Miguel Ojeda
Cc: gregkh, achill, akpm, broonie, conor, hargar, jonathanh,
linux-kernel, linux, lkft-triage, patches, patches, pavel,
rwarsow, shuah, sr, stable, sudipm.mukherjee, torvalds,
Anuj Gupta, Kanchan Joshi, Christoph Hellwig, Keith Busch,
Jens Axboe, linux-block
In-Reply-To: <20260529122623.bio-integrity-rc-prereq@kernel.org>
On 5/29/26 05:44, Sasha Levin wrote:
> On Fri, May 29, 2026 at 10:27:21AM +0200, Pavel Machek wrote:
>>> I am seeing:
>>>
>>> ./include/linux/bio-integrity.h:101:12: error: unused function 'bio_integrity_map_user' [-Werror,-Wunused-function]
>>>
>>> This looks like it needs:
>>>
>>> 546d191427cf ("block: make bio_integrity_map_user() static inline")
>>>
>> We see that, too:
>> https://gitlab.com/cip-project/cip-testing/linux-stable-rc-ci/-/jobs/14592368004
>> We don't see the problem on 6.6, 6.18 or 7.0-stable.
>
> Thanks! I've queued up 546d191427cf ("block: make bio_integrity_map_user()
> static inline").
Thanks, also seen here, FWIW.
--
Florian
^ permalink raw reply
* Re: [GIT PULL] Block fix for 7.1-rc6
From: pr-tracker-bot @ 2026-05-29 20:13 UTC (permalink / raw)
To: Jens Axboe; +Cc: Linus Torvalds, linux-block@vger.kernel.org
In-Reply-To: <d12b120e-0628-4d5c-a36a-7618f752250d@kernel.dk>
The pull request you sent on Fri, 29 May 2026 11:03:35 -0600:
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git block-7.1-20260529
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/9215e74f228f2b239f41271da9e5076ee3439d1b
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply
* [syzbot] [block?] possible deadlock in loop_process_work
From: syzbot @ 2026-05-29 20:24 UTC (permalink / raw)
To: axboe, linux-block, linux-kernel, syzkaller-bugs
Hello,
syzbot found the following issue on:
HEAD commit: c1ecb239fa34 Add linux-next specific files for 20260522
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=12fa6336580000
kernel config: https://syzkaller.appspot.com/x/.config?x=77a9211ff284de54
dashboard link: https://syzkaller.appspot.com/bug?extid=78ad2c6a58c0a1faa5f5
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
Unfortunately, I don't have any reproducer for this issue yet.
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/4cb88c910144/disk-c1ecb239.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/4a9bc938cf88/vmlinux-c1ecb239.xz
kernel image: https://storage.googleapis.com/syzbot-assets/684f1e33f264/bzImage-c1ecb239.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+78ad2c6a58c0a1faa5f5@syzkaller.appspotmail.com
======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
------------------------------------------------------
kworker/u8:15/1491 is trying to acquire lock:
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: do_req_filebacked drivers/block/loop.c:433 [inline]
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: loop_handle_cmd drivers/block/loop.c:1941 [inline]
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: loop_process_work+0x637/0x11b0 drivers/block/loop.c:1976
but task is already holding lock:
ffffc90006e27c40 ((work_completion)(&worker->work)){+.+.}-{0:0}, at: process_one_work+0x8be/0x1630 kernel/workqueue.c:3294
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 ((work_completion)(&worker->work)){+.+.}-{0:0}:
process_one_work+0x8d7/0x1630 kernel/workqueue.c:3294
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #6 ((wq_completion)loop4){+.+.}-{0:0}:
touch_wq_lockdep_map+0xcb/0x180 kernel/workqueue.c:4033
__flush_workqueue+0x14b/0x14f0 kernel/workqueue.c:4075
drain_workqueue+0xd3/0x390 kernel/workqueue.c:4239
__loop_clr_fd drivers/block/loop.c:1130 [inline]
lo_release+0x287/0x8f0 drivers/block/loop.c:1767
bdev_release+0x541/0x660 block/bdev.c:-1
blkdev_release+0x15/0x20 block/fops.c:705
__fput+0x461/0xa70 fs/file_table.c:510
fput_close_sync+0x11f/0x240 fs/file_table.c:615
__do_sys_close fs/open.c:1511 [inline]
__se_sys_close fs/open.c:1496 [inline]
__x64_sys_close+0x7e/0x110 fs/open.c:1496
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&disk->open_mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
__del_gendisk+0x127/0x980 block/genhd.c:710
del_gendisk+0xe7/0x160 block/genhd.c:823
nbd_dev_remove drivers/block/nbd.c:268 [inline]
nbd_dev_remove_work+0x47/0xe0 drivers/block/nbd.c:284
process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #4 (&set->update_nr_hwq_lock){++++}-{4:4}:
down_read+0x97/0x200 kernel/locking/rwsem.c:1568
add_disk_fwnode+0xe7/0x480 block/genhd.c:596
add_disk include/linux/blkdev.h:794 [inline]
nbd_dev_add+0x72c/0xb50 drivers/block/nbd.c:1984
nbd_genl_connect+0x965/0x1c80 drivers/block/nbd.c:2125
genl_family_rcv_msg_doit+0x22a/0x330 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x61c/0x7a0 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2551
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x780/0x920 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1895
sock_sendmsg_nosec+0x112/0x150 net/socket.c:797
__sock_sendmsg net/socket.c:812 [inline]
____sys_sendmsg+0x55c/0x870 net/socket.c:2716
___sys_sendmsg+0x2a5/0x360 net/socket.c:2770
__sys_sendmsg net/socket.c:2802 [inline]
__do_sys_sendmsg net/socket.c:2807 [inline]
__se_sys_sendmsg net/socket.c:2805 [inline]
__x64_sys_sendmsg+0x1c3/0x2a0 net/socket.c:2805
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (genl_mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
genl_lock net/netlink/genetlink.c:35 [inline]
genl_lock_all net/netlink/genetlink.c:48 [inline]
genl_register_family+0x7b9/0x17b0 net/netlink/genetlink.c:784
vdpa_init+0x39/0x70 drivers/vdpa/vdpa.c:1565
do_one_initcall+0x250/0x870 init/main.c:1347
do_initcall_level+0x104/0x190 init/main.c:1409
do_initcalls+0x59/0xa0 init/main.c:1425
kernel_init_freeable+0x2a6/0x3e0 init/main.c:1658
kernel_init+0x1d/0x1d0 init/main.c:1548
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (cb_lock){++++}-{4:4}:
down_read+0x97/0x200 kernel/locking/rwsem.c:1568
genl_rcv+0x19/0x40 net/netlink/genetlink.c:1217
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x780/0x920 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1895
sock_sendmsg_nosec+0x112/0x150 net/socket.c:797
__sock_sendmsg net/socket.c:812 [inline]
sock_sendmsg+0x1ca/0x2d0 net/socket.c:835
splice_to_socket+0xae5/0x11f0 fs/splice.c:884
do_splice_from fs/splice.c:936 [inline]
do_splice+0xef8/0x1940 fs/splice.c:1349
__do_splice fs/splice.c:1431 [inline]
__do_sys_splice fs/splice.c:1634 [inline]
__se_sys_splice+0x353/0x490 fs/splice.c:1616
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&pipe->mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
iter_file_splice_write+0x1f3/0x10f0 fs/splice.c:682
do_splice_from fs/splice.c:936 [inline]
do_splice+0xef8/0x1940 fs/splice.c:1349
__do_splice fs/splice.c:1431 [inline]
__do_sys_splice fs/splice.c:1634 [inline]
__se_sys_splice+0x353/0x490 fs/splice.c:1616
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (sb_writers#5){.+.+}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3167 [inline]
check_prevs_add kernel/locking/lockdep.c:3286 [inline]
validate_chain kernel/locking/lockdep.c:3910 [inline]
__lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
__sb_start_write include/linux/fs/super.h:19 [inline]
sb_start_write include/linux/fs/super.h:125 [inline]
kiocb_start_write include/linux/fs.h:2767 [inline]
lo_rw_aio+0xb1b/0xf00 drivers/block/loop.c:401
do_req_filebacked drivers/block/loop.c:433 [inline]
loop_handle_cmd drivers/block/loop.c:1941 [inline]
loop_process_work+0x637/0x11b0 drivers/block/loop.c:1976
process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
other info that might help us debug this:
Chain exists of:
sb_writers#5 --> (wq_completion)loop4 --> (work_completion)(&worker->work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&worker->work));
lock((wq_completion)loop4);
lock((work_completion)(&worker->work));
rlock(sb_writers#5);
*** DEADLOCK ***
2 locks held by kworker/u8:15/1491:
#0: ffff888022729938 ((wq_completion)loop5){+.+.}-{0:0}, at: process_one_work+0x897/0x1630 kernel/workqueue.c:3293
#1: ffffc90006e27c40 ((work_completion)(&worker->work)){+.+.}-{0:0}, at: process_one_work+0x8be/0x1630 kernel/workqueue.c:3294
stack backtrace:
CPU: 0 UID: 0 PID: 1491 Comm: kworker/u8:15 Tainted: G L syzkaller #0 PREEMPT_{RT,(full)}
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Workqueue: loop5 loop_workfn
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_circular_bug+0x2e1/0x300 kernel/locking/lockdep.c:2045
check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2177
check_prev_add kernel/locking/lockdep.c:3167 [inline]
check_prevs_add kernel/locking/lockdep.c:3286 [inline]
validate_chain kernel/locking/lockdep.c:3910 [inline]
__lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
__sb_start_write include/linux/fs/super.h:19 [inline]
sb_start_write include/linux/fs/super.h:125 [inline]
kiocb_start_write include/linux/fs.h:2767 [inline]
lo_rw_aio+0xb1b/0xf00 drivers/block/loop.c:401
do_req_filebacked drivers/block/loop.c:433 [inline]
loop_handle_cmd drivers/block/loop.c:1941 [inline]
loop_process_work+0x637/0x11b0 drivers/block/loop.c:1976
process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
^ permalink raw reply
* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Tal Zussman @ 2026-05-29 20:46 UTC (permalink / raw)
To: Christoph Hellwig, Jan Kara
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Dave Chinner,
Bart Van Assche, linux-block, linux-kernel, linux-xfs,
linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <ahbq0RdUyLIPiItB@infradead.org>
On 5/27/26 9:00 AM, Christoph Hellwig wrote:
> On Wed, May 27, 2026 at 11:42:28AM +0200, Jan Kara wrote:
>> > I ran some experiments with fio on both XFS and a raw block device. Five
>> > iterations each for 60s. Results below.
>> >
>> > TLDR: Removing the delay doesn't significantly decrease user-visible
>> > latency or otherwise improve performance, but does significantly reduce
>> > throughput and increase context switches in some workloads (e.g. C).
>> > I think it makes sense to leave the delay as-is. Thoughts?
>>
>> Thanks for the test! One question below:
>
> Thanks from me as well!
>
>>
>> > Results:
>> >
>> > Workloads (all `uncached=1`):
>> > A: rw=write bs=128k iodepth=1 ioengine=pvsync2 # XFS
>> > B: rw=write bs=128k iodepth=128 ioengine=io_uring # XFS
>> > C: rw=randwrite bs=4k iodepth=32 ioengine=io_uring # XFS
>> > D: rw=rw 50/50 bs=64k iodepth=32 ioengine=io_uring # XFS
>> > E: rw=write bs=128k iodepth=128 ioengine=io_uring # raw /dev/nvmeXn1
>> > F: rw=write bs=128k iodepth=128 numjobs=4
>> > + vm.dirty_bytes=64MB, vm.dirty_background_bytes=32MB # XFS
>> >
>> > Mean ± stddev across 5 iterations:
>> >
>> > metric delay=1 delay=0 delta
>> > --------------------------------------------------------------
>> >
>> > A seq 128k qd1
>> > BW (MB/s) 4333 ± 27 4374 ± 34 +0.9%
>> > p99 (us) 36.2 ± 0.8 35.8 ± 0.4 -1.1%
>> > p999 (us) 3260 ± 75 3228 ± 29 -1.0%
>> > ctx-switches 184 k ± 59 k 3.68 M ± 65 k +1903%
>> > cs / io 0.09 ± 0.03 1.86 ± 0.03 +1888%
>> > avg bios/run 80.4 ± 0.6 1.1 ± 0.0 -98.7%
>>
>> So 1 jiffie delay is (with default HZ=1000) 1ms. That means for this load
>> the completion latency should be at least 1000us but your results show p99
>> latency of 36. What am I missing?
>
> Yes, this looks a bit odd. Unless there's multiple threads submitting
> and somehow the completions get batched this should complete one
> bio at a time and be the worst case for the delay scheme.
Sorry, I should've clarified - the latency here is the userspace-visible
I/O completion latency (i.e. fio's clat value).
I ran again and traced to get the actual time from __bio_complete_in_task()
to calling ->bi_end_io(). The results match the 1 jiffie delay now:
metric delay=1 delay=0
A seq 128k qd1
fio clat p99 38us 36us
bio cb p50 1.23ms 2.5us
bio cb p99 4.13ms 1.44ms
bio cb p999 5.01ms 2.63ms
B seq 128k qd128
fio clat p99 8.74ms 8.85ms
bio cb p50 1.27ms 3.1us
bio cb p99 4.05ms 2.27ms
bio cb p999 4.91ms 2.77ms
C rand 4k qd32
fio clat p99 8.16ms 8.11ms
bio cb p50 1.09ms 97.7us
bio cb p99 3.73ms 2.06ms
bio cb p999 11.87ms 3.79ms
D mixed 64k qd32
fio clat p99 981us 1.03ms
bio cb p50 1.14ms 39.5us
bio cb p99 2.83ms 275us
bio cb p999 3.06ms 595us
E raw 128k qd128
fio clat p99 26.97ms 27.34ms
bio cb p50 1.58ms 41.5us
bio cb p99 2.98ms 325us
bio cb p999 3.02ms 575us
F mem-pressure
fio clat p99 29.75ms 30.43ms
bio cb p50 1.32ms 2.5us
bio cb p99 3.73ms 2.48ms
bio cb p999 4.62ms 2.83ms
Note that in the above, the C degradation didn't reproduce as much. The
bandwidth does go down from 64.5 MB/s with delay=1 to 54.9 MB/s with delay=0,
but it's a much smaller drop. I ran it several more times and ran into the
degradation ~20% of the time. The lack of batching means the completion
kworker fires for nearly every bio, leading to heavier preemption when a
writer is placed on a CPU that receives many completion IRQs. The degradation
seems to occur when the writers are migrated less often, leading to more
preemption. But I haven't dug into why the scheduler chooses to migrate more
in some runs vs. others. However, when pinning to 16 cores, the difference
between delay=0 and delay=1 goes away.
C specifically also seems to get worse because we're doing random writes to a
sparse file, so each bio goes through the IOMAP_IOEND_UNWRITTEN path and the
completion path is heavier, leading to more CPU stealing from the writing
threads compared to the other workloads.
>> > C rand 4k qd32
>> > BW (MB/s) 66.2 ± 0.8 44.6 ± 7.4 -32.7%
>> > p99 (us) 8002 ± 174 17990 ± 6800 +124.8%
>> > p999 (us) 11390 ± 554 31890 ± 11076 +180.0%
>> > ctx-switches 3.67 M ± 45 k 3.59 M ± 106 k -2.2%
>> > cs / io 3.78 ± 0.04 5.62 ± 0.83 +48.7%
>> > avg bios/run 32.3 ± 1.0 3.1 ± 0.3 -90.5%
>>
>> I'm somewhat surprised how larger is the completion latency is here without
>> the delay. Is that due to a contention on local lock between the IO completion
>> interrupt and the worker? Or why is the completion latency so big here when
>> the case B with more IOs in flight, less bios per run, still had significantly
>> lower latency in the delay=0 case?
>
> Note that in the past we had major problems with workqueue scheduling
> latency. At some point these got mitigated a lot, but if they are back
> for this workload that might be one reason.
>
^ permalink raw reply
* Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Hillf Danton @ 2026-05-29 22:05 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Jens Axboe, Bart Van Assche, Christoph Hellwig, Damien Le Moal,
Ming Lei, linux-block, LKML, Andrew Morton, Linus Torvalds,
linux-btrfs, David Sterba, linux-fsdevel, Christian Brauner,
syzbot+78ad2c6a58c0a1faa5f5
In-Reply-To: <20260529070411.1206-1-hdanton@sina.com>
On Fri, 29 May 2026 15:04:10 +0800 Hillf Danton wrote:
>On Fri, 29 May 2026 09:14:47 +0900 Tetsuo Handa wrote:
>>On 2026/05/29 8:00, Hillf Danton wrote:
>>>> Given the loop workqueue that triggered the jfs warning, can you specify
>>>> the reason why the workqueue in question is NOT flushed while closing disk?
>>>>
>>> Got it, the loop workqueue is NOT flushed to avoid deadlock, see d292dc80686a
>>> ("loop: don't destroy lo->workqueue in __loop_clr_fd") for detail.
>>> And the deadlock can be reproduced by flushing the loop workqueue with
>>> disk->open_mutex held [1].
>>>
>>> [1] Subject: Re: [syzbot] possible deadlock in blkdev_put (3)
>>> https://lore.kernel.org/lkml/000000000000ea753505da2658d5@google.com/
>>
>>We can avoid the following lockdep warnings (including [1] you mentioned)
>>
>> https://syzkaller.appspot.com/bug?extid=2f62807dc3239b8f584e
>> https://syzkaller.appspot.com/bug?extid=c4e9d077bcc86bee08dc
>> https://syzkaller.appspot.com/bug?extid=0f427123ae84b3ba6dc7
>> https://syzkaller.appspot.com/bug?extid=4feabfc9641267769c97
>> https://syzkaller.appspot.com/bug?extid=fb0ff9bfe34ad282ebd4
>>
>>caused by "drain_workqueue() with disk->open_mutex held" if we assign
>>caller-specific lockdep class to disk->open_mutex
>>
>> https://sourceforge.net/p/tomoyo/tomoyo.git/ci/c2245c765ebeba9dcb924d9171d8d470a9ac41c8/
>>
>>.
>>
>>Also, we can avoid lockdep warning caused by "drain_workqueue() with disk->open_mutex held" +
>>"holding system_transition_mutex" if we forbid binding to pseudo files as backing file
>>in the loop driver
>>
>> https://lkml.kernel.org/r/d38e4600-3c32-491f-aa49-905f4fad1bfb@I-love.SAKURA.ne.jp
>>
>>which we can reproduce with
>>
>> echo 7:0 > /sys/power/resume
>> losetup /dev/loop0 /sys/power/resume
>> cat /dev/loop0 > /dev/null
>> losetup -d /dev/loop0
>>
>>.
>>
>> Therefore, I think we can address this problem by "drain_workqueue() with disk->open_mutex
>> held" in the loop driver side.
>>
> Good news.
>
Bad news: Subject: [syzbot] [block?] possible deadlock in loop_process_work
[3] https://lore.kernel.org/lkml/6a19f5f7.5099cdd9.8e407.0004.GAE@google.com/
syzbot found the following issue on:
HEAD commit: c1ecb239fa34 Add linux-next specific files for 20260522
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=12fa6336580000
kernel config: https://syzkaller.appspot.com/x/.config?x=77a9211ff284de54
dashboard link: https://syzkaller.appspot.com/bug?extid=78ad2c6a58c0a1faa5f5
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
Unfortunately, I don't have any reproducer for this issue yet.
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/4cb88c910144/disk-c1ecb239.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/4a9bc938cf88/vmlinux-c1ecb239.xz
kernel image: https://storage.googleapis.com/syzbot-assets/684f1e33f264/bzImage-c1ecb239.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+78ad2c6a58c0a1faa5f5@syzkaller.appspotmail.com
======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
------------------------------------------------------
kworker/u8:15/1491 is trying to acquire lock:
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: do_req_filebacked drivers/block/loop.c:433 [inline]
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: loop_handle_cmd drivers/block/loop.c:1941 [inline]
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: loop_process_work+0x637/0x11b0 drivers/block/loop.c:1976
but task is already holding lock:
ffffc90006e27c40 ((work_completion)(&worker->work)){+.+.}-{0:0}, at: process_one_work+0x8be/0x1630 kernel/workqueue.c:3294
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 ((work_completion)(&worker->work)){+.+.}-{0:0}:
process_one_work+0x8d7/0x1630 kernel/workqueue.c:3294
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #6 ((wq_completion)loop4){+.+.}-{0:0}:
touch_wq_lockdep_map+0xcb/0x180 kernel/workqueue.c:4033
__flush_workqueue+0x14b/0x14f0 kernel/workqueue.c:4075
drain_workqueue+0xd3/0x390 kernel/workqueue.c:4239
__loop_clr_fd drivers/block/loop.c:1130 [inline]
lo_release+0x287/0x8f0 drivers/block/loop.c:1767
bdev_release+0x541/0x660 block/bdev.c:-1
blkdev_release+0x15/0x20 block/fops.c:705
__fput+0x461/0xa70 fs/file_table.c:510
fput_close_sync+0x11f/0x240 fs/file_table.c:615
__do_sys_close fs/open.c:1511 [inline]
__se_sys_close fs/open.c:1496 [inline]
__x64_sys_close+0x7e/0x110 fs/open.c:1496
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&disk->open_mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
__del_gendisk+0x127/0x980 block/genhd.c:710
del_gendisk+0xe7/0x160 block/genhd.c:823
nbd_dev_remove drivers/block/nbd.c:268 [inline]
nbd_dev_remove_work+0x47/0xe0 drivers/block/nbd.c:284
process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #4 (&set->update_nr_hwq_lock){++++}-{4:4}:
down_read+0x97/0x200 kernel/locking/rwsem.c:1568
add_disk_fwnode+0xe7/0x480 block/genhd.c:596
add_disk include/linux/blkdev.h:794 [inline]
nbd_dev_add+0x72c/0xb50 drivers/block/nbd.c:1984
nbd_genl_connect+0x965/0x1c80 drivers/block/nbd.c:2125
genl_family_rcv_msg_doit+0x22a/0x330 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x61c/0x7a0 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2551
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x780/0x920 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1895
sock_sendmsg_nosec+0x112/0x150 net/socket.c:797
__sock_sendmsg net/socket.c:812 [inline]
____sys_sendmsg+0x55c/0x870 net/socket.c:2716
___sys_sendmsg+0x2a5/0x360 net/socket.c:2770
__sys_sendmsg net/socket.c:2802 [inline]
__do_sys_sendmsg net/socket.c:2807 [inline]
__se_sys_sendmsg net/socket.c:2805 [inline]
__x64_sys_sendmsg+0x1c3/0x2a0 net/socket.c:2805
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (genl_mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
genl_lock net/netlink/genetlink.c:35 [inline]
genl_lock_all net/netlink/genetlink.c:48 [inline]
genl_register_family+0x7b9/0x17b0 net/netlink/genetlink.c:784
vdpa_init+0x39/0x70 drivers/vdpa/vdpa.c:1565
do_one_initcall+0x250/0x870 init/main.c:1347
do_initcall_level+0x104/0x190 init/main.c:1409
do_initcalls+0x59/0xa0 init/main.c:1425
kernel_init_freeable+0x2a6/0x3e0 init/main.c:1658
kernel_init+0x1d/0x1d0 init/main.c:1548
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (cb_lock){++++}-{4:4}:
down_read+0x97/0x200 kernel/locking/rwsem.c:1568
genl_rcv+0x19/0x40 net/netlink/genetlink.c:1217
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x780/0x920 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1895
sock_sendmsg_nosec+0x112/0x150 net/socket.c:797
__sock_sendmsg net/socket.c:812 [inline]
sock_sendmsg+0x1ca/0x2d0 net/socket.c:835
splice_to_socket+0xae5/0x11f0 fs/splice.c:884
do_splice_from fs/splice.c:936 [inline]
do_splice+0xef8/0x1940 fs/splice.c:1349
__do_splice fs/splice.c:1431 [inline]
__do_sys_splice fs/splice.c:1634 [inline]
__se_sys_splice+0x353/0x490 fs/splice.c:1616
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&pipe->mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
iter_file_splice_write+0x1f3/0x10f0 fs/splice.c:682
do_splice_from fs/splice.c:936 [inline]
do_splice+0xef8/0x1940 fs/splice.c:1349
__do_splice fs/splice.c:1431 [inline]
__do_sys_splice fs/splice.c:1634 [inline]
__se_sys_splice+0x353/0x490 fs/splice.c:1616
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (sb_writers#5){.+.+}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3167 [inline]
check_prevs_add kernel/locking/lockdep.c:3286 [inline]
validate_chain kernel/locking/lockdep.c:3910 [inline]
__lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
__sb_start_write include/linux/fs/super.h:19 [inline]
sb_start_write include/linux/fs/super.h:125 [inline]
kiocb_start_write include/linux/fs.h:2767 [inline]
lo_rw_aio+0xb1b/0xf00 drivers/block/loop.c:401
do_req_filebacked drivers/block/loop.c:433 [inline]
loop_handle_cmd drivers/block/loop.c:1941 [inline]
loop_process_work+0x637/0x11b0 drivers/block/loop.c:1976
process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
other info that might help us debug this:
Chain exists of:
sb_writers#5 --> (wq_completion)loop4 --> (work_completion)(&worker->work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&worker->work));
lock((wq_completion)loop4);
lock((work_completion)(&worker->work));
rlock(sb_writers#5);
*** DEADLOCK ***
^ permalink raw reply
* Re: [PATCH v2] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
From: Keith Busch @ 2026-05-29 23:08 UTC (permalink / raw)
To: Achkinazi, Igor
Cc: hch@lst.de, sagi@grimberg.me, axboe@kernel.dk,
linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <DS0PR19MB76965BF9FB57EA3ED8BD4586FD162@DS0PR19MB7696.namprd19.prod.outlook.com>
On Fri, May 29, 2026 at 01:32:22AM +0000, Achkinazi, Igor wrote:
> Keith Busch wrote:
> > I double checked the sequences here, and yes, I think the
> > synchronize_srcu's already in place ensure every caller sees the EOD
> > error before it could fail the bio_queue_enter(), so this looks like it
> > happens to be sufficient. I'm okay with it.
>
> Thanks Keith! May I add your Reviewed-by?
Sure, though I was considering just adding it the nvme tree. I'm giving
a few days to see if there are any other comments.
^ permalink raw reply
* Re: Observing higher CPU utilization during random IO fio testing
From: Ming Lei @ 2026-05-30 1:10 UTC (permalink / raw)
To: Wen Xiong; +Cc: linux-block, axboe, jmoyer, Gjoyce, wenxiong
In-Reply-To: <338169f719c77e4afe58f42e9760349e@linux.ibm.com>
On Thu, May 21, 2026 at 02:44:22PM -0500, Wen Xiong wrote:
> Hi All,
>
> Our performance team observed the higher CPU utilization in RHEL10 compared
> to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well
> when running FIO random IO tests.
>
> System configuration:
> 47 dedicate cores
> 120 GB memory
> PCIe4 2-Port 64Gb FC Adapter
> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
>
> Random IO tests are more CPU intensive than sequential IO tests due to
> several factors: more context switching, Interrupt Handling, cache
> Inefficiency etc. We found out the following patch which caused the higher
> CPU utilization in rhel10 and newer linux kernel:
>
> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
> Author: Yu Kuai <yukuai3@huawei.com>
> Date: Thu May 9 20:38:25 2024 +0800
>
> block: add plug while submitting IO
>
> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
> and __blkdev_direct_IO_async(), block layer can still benefit from caching
> nsec time in the plug.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Link:
> https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU
> utilization when doing the same FIO test.
>
> The patch adds plugging in __submit_bio() in block layer, maybe cause
> performance degradation:
> - Random IO tests have less merging, flush overhead.
> - More IO scheduler interaction, forces requests through scheduler instead
> of direct dispatch(direct dispatch to hardware queue)
> - Poor cache locality during plug operation
Yes, it is expected to see regression on QD=1 workload.
Adding inner plug for caching timestamp only is not good from plug function viewpoint,
because only the outer code path(io_uring, libaio, ...) knows exact IO batch size
and can decide if plug should be used.
Given 060406c61c7c ("block: add plug while submitting IO") doesn't provide
any performance data, maybe it can be reverted.
I am wondering why not move the timestamp cache into 'task_struct' and get wider users?
Thanks,
Ming
^ permalink raw reply
* [PATCH] rbd: check snap_count against RBD_MAX_SNAP_COUNT
From: Rosen Penev @ 2026-05-30 1:12 UTC (permalink / raw)
To: linux-block
Cc: Ilya Dryomov, Dongsheng Yang, Jens Axboe, Nathan Chancellor,
Nick Desaulniers, Bill Wendling, Justin Stitt,
open list:RADOS BLOCK DEVICE (RBD), open list,
open list:CLANG/LLVM BUILD SUPPORT:Keyword:b(?i:clang|llvm)b
snap_count is u32 but the comparison is against a SIZE_MAX-derived value
(~2^61 on 64-bit), which clang flags as always false with
-Wtautological-constant-out-of-range-compare.
The proper check here should be that snap_count does not go over
RBD_MAX_SNAP_COUNT.
Assisted-by: Opencode:Big-pickle
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
drivers/block/rbd.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 94709466ad19..25215c209484 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -6075,12 +6075,9 @@ static int rbd_dev_v2_snap_context(struct rbd_device *rbd_dev,
/*
* Make sure the reported number of snapshot ids wouldn't go
- * beyond the end of our buffer. But before checking that,
- * make sure the computed size of the snapshot context we
- * allocate is representable in a size_t.
+ * beyond the end of our buffer.
*/
- if (snap_count > (SIZE_MAX - sizeof (struct ceph_snap_context))
- / sizeof (u64)) {
+ if (snap_count > RBD_MAX_SNAP_COUNT) {
ret = -EINVAL;
goto out;
}
--
2.54.0
^ permalink raw reply related
* Re: [PATCH] rbd: check snap_count against RBD_MAX_SNAP_COUNT
From: Alex Elder @ 2026-05-30 1:44 UTC (permalink / raw)
To: Rosen Penev, linux-block
Cc: Ilya Dryomov, Dongsheng Yang, Jens Axboe, Nathan Chancellor,
Nick Desaulniers, Bill Wendling, Justin Stitt,
open list:RADOS BLOCK DEVICE (RBD), open list,
open list:CLANG/LLVM BUILD SUPPORT:Keyword:b(?i:clang|llvm)b
In-Reply-To: <20260530011255.52916-1-rosenp@gmail.com>
On 5/29/26 8:12 PM, Rosen Penev wrote:
> snap_count is u32 but the comparison is against a SIZE_MAX-derived value
> (~2^61 on 64-bit), which clang flags as always false with
> -Wtautological-constant-out-of-range-compare.
>
> The proper check here should be that snap_count does not go over
> RBD_MAX_SNAP_COUNT.
>
> Assisted-by: Opencode:Big-pickle
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
Looks good to me.
Reviewed-by: Alex Elder <elder@riscstar.com>
> ---
> drivers/block/rbd.c | 7 ++-----
> 1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 94709466ad19..25215c209484 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -6075,12 +6075,9 @@ static int rbd_dev_v2_snap_context(struct rbd_device *rbd_dev,
>
> /*
> * Make sure the reported number of snapshot ids wouldn't go
> - * beyond the end of our buffer. But before checking that,
> - * make sure the computed size of the snapshot context we
> - * allocate is representable in a size_t.
> + * beyond the end of our buffer.
> */
> - if (snap_count > (SIZE_MAX - sizeof (struct ceph_snap_context))
> - / sizeof (u64)) {
> + if (snap_count > RBD_MAX_SNAP_COUNT) {
> ret = -EINVAL;
> goto out;
> }
^ permalink raw reply
* [PATCH rust-fixes v3 1/1] rust: block: fix GenDisk cleanup paths
From: Ren Wei @ 2026-05-30 6:11 UTC (permalink / raw)
To: linux-block, rust-for-linux
Cc: ojeda, boqun, gary, bjorn3_gh, lossin, a.hindborg, aliceryhl,
tmgross, dakr, daniel.almeida, axboe, tamird, sunke, yuantan098,
bird, royenheart, n05ec
From: Haoze Xie <royenheart@gmail.com>
GenDiskBuilder::build() still has fallible work after
__blk_mq_alloc_disk(), but its error path only recovers the
foreign queue data. That leaks the temporary gendisk and
request_queue until later teardown. If the caller moved the last
Arc<TagSet<T>> into build(), the leaked queue can retain blk-mq
state after the tag set is dropped.
Fix the pre-registration failure path by dropping the temporary
gendisk reference with put_disk() before recovering queue_data,
so disk_release() can tear down the owned queue.
Also pair GenDisk::drop() with put_disk() after del_gendisk().
Once a Rust GenDisk has been added with device_add_disk(),
del_gendisk() only unregisters it; the final gendisk reference
still has to be dropped to complete the release path.
Fixes: 3253aba3408a ("rust: block: introduce `kernel::block::mq` module")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Signed-off-by: Haoze Xie <royenheart@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
Changes in v3:
- Add the requested blank lines around cleanup blocks for readability.
- v2 Link: https://lore.kernel.org/r/e14c015e2e0bde04f84a9452330b94436e2d8e68.1779901336.git.royenheart@gmail.com
Changes in v2:
- Add the missing put_disk() after del_gendisk() in GenDisk::drop(),
as suggested by Andreas Hindborg.
- Keep the GenDiskBuilder::build() failure cleanup fix and fold both
lifecycle fixes into one patch.
- v1 Link: https://lore.kernel.org/r/b6411cc055080c984a67bfad72fd683aa84b8e13.1779596478.git.royenheart@gmail.com
rust/kernel/block/mq/gen_disk.rs | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
index 912cb805caf5..fc97dd873974 100644
--- a/rust/kernel/block/mq/gen_disk.rs
+++ b/rust/kernel/block/mq/gen_disk.rs
@@ -150,6 +150,19 @@ pub fn build<T: Operations>(
// SAFETY: `gendisk` is a valid pointer as we initialized it above
unsafe { (*gendisk).fops = &TABLE };
+ let cleanup_failure = ScopeGuard::new_with_data((gendisk, data), |(gendisk, data)| {
+ // SAFETY: `gendisk` came from `__blk_mq_alloc_disk()` above and
+ // has not been added to the VFS on this cleanup path.
+ unsafe { bindings::put_disk(gendisk) };
+ // SAFETY: `data` came from `into_foreign()` above and has not been
+ // converted back on this cleanup path.
+ drop(unsafe { T::QueueData::from_foreign(data) });
+ });
+
+ // The failure guard now owns both pieces of cleanup; the early guard
+ // must not run on this path anymore.
+ recover_data.dismiss();
+
let mut writer = NullTerminatedFormatter::new(
// SAFETY: `gendisk` points to a valid and initialized instance. We
// have exclusive access, since the disk is not added to the VFS
@@ -172,7 +185,7 @@ pub fn build<T: Operations>(
},
)?;
- recover_data.dismiss();
+ cleanup_failure.dismiss();
// INVARIANT: `gendisk` was initialized above.
// INVARIANT: `gendisk` was added to the VFS via `device_add_disk` above.
@@ -215,6 +228,11 @@ fn drop(&mut self) {
// to the VFS.
unsafe { bindings::del_gendisk(self.gendisk) };
+ // SAFETY: By type invariant, `self.gendisk` was added to the VFS, so
+ // `put_disk()` must follow `del_gendisk()` to drop the final gendisk
+ // reference and trigger the remaining release path.
+ unsafe { bindings::put_disk(self.gendisk) };
+
// SAFETY: `queue.queuedata` was created by `GenDiskBuilder::build` with
// a call to `ForeignOwnable::into_foreign` to create `queuedata`.
// `ForeignOwnable::from_foreign` is only called here.
--
2.47.3
^ permalink raw reply related
* Re: [PATCH rust-fixes v3 1/1] rust: block: fix GenDisk cleanup paths
From: Miguel Ojeda @ 2026-05-30 6:49 UTC (permalink / raw)
To: Ren Wei
Cc: linux-block, rust-for-linux, ojeda, boqun, gary, bjorn3_gh,
lossin, a.hindborg, aliceryhl, tmgross, dakr, daniel.almeida,
axboe, tamird, sunke, yuantan098, bird, royenheart
In-Reply-To: <b70aff9a920cc42110fe5cf454c3099561863519.1780063368.git.royenheart@gmail.com>
On Sat, May 30, 2026 at 8:12 AM Ren Wei <n05ec@lzu.edu.cn> wrote:
>
> [PATCH rust-fixes v3 1/1] rust: block: fix GenDisk cleanup paths
I think block prefers to take these, but please let me know otherwise.
Cheers,
Miguel
^ permalink raw reply
* [PATCH] block: assign caller-specific lockdep class to disk->open_mutex
From: Tetsuo Handa @ 2026-05-30 13:45 UTC (permalink / raw)
To: Jens Axboe, linux-block, LKML
Cc: Bart Van Assche, Andrew Morton, Ming Lei, Damien Le Moal,
Christoph Hellwig, Qu Wenruo, Hillf Danton
The block core currently allocates a single monolithic lockdep key for
disk->open_mutex across all callers. This single key conflates locking
hierarchies between independent block streams. For example, if a stacked
driver like loop flushes its internal workqueues inside lo_release() while
holding its own open_mutex, lockdep views this as a potential ABBA deadlock
against the underlying storage stack, leading to numerous circular
dependency splats [2][3][4][5][6].
To reduce false-positives structurally, this patch splits the global
monolithic lock class into distinct, per-caller during disk allocation;
by changing "lock_class_key" into a 2-element array:
- lkclass[0]: Used for the legacy "(bio completion)" map.
- lkclass[1]: Assigned to target caller's disk->open_mutex.
This patch was tested by adding drain_workqueue() to __loop_clr_fd() during
testing of a patch for [1], and actually helped stopping [2][4][6].
Even if our final solution for [1] does not call drain_workqueue() with
disk->open_mutex held, keeping locking chains simpler and shorter should
be a good change.
Link: https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28 [1]
Link: https://syzkaller.appspot.com/bug?extid=2f62807dc3239b8f584e [2]
Link: https://syzkaller.appspot.com/bug?extid=c4e9d077bcc86bee08dc [3]
Link: https://syzkaller.appspot.com/bug?extid=0f427123ae84b3ba6dc7 [4]
Link: https://syzkaller.appspot.com/bug?extid=4feabfc9641267769c97 [5]
Link: https://syzkaller.appspot.com/bug?extid=fb0ff9bfe34ad282ebd4 [6]
Suggested-by: AI Mode in Google Search (no mail address)
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
block/blk-mq.c | 4 ++--
block/blk.h | 2 +-
block/genhd.c | 8 ++++----
drivers/scsi/sd.c | 4 ++--
drivers/scsi/sr.c | 4 ++--
include/linux/blk-mq.h | 8 ++++----
include/linux/blkdev.h | 6 +++---
7 files changed, 18 insertions(+), 18 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 28c2d931e75e..01a15ac40754 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4492,7 +4492,7 @@ EXPORT_SYMBOL(blk_mq_destroy_queue);
struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
struct queue_limits *lim, void *queuedata,
- struct lock_class_key *lkclass)
+ struct lock_class_key lkclass[2])
{
struct request_queue *q;
struct gendisk *disk;
@@ -4513,7 +4513,7 @@ struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
EXPORT_SYMBOL(__blk_mq_alloc_disk);
struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
- struct lock_class_key *lkclass)
+ struct lock_class_key lkclass[2])
{
struct gendisk *disk;
diff --git a/block/blk.h b/block/blk.h
index b998a7761faf..1744748f9b68 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -614,7 +614,7 @@ void drop_partition(struct block_device *part);
void bdev_set_nr_sectors(struct block_device *bdev, sector_t sectors);
struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
- struct lock_class_key *lkclass);
+ struct lock_class_key lkclass[2]);
struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id);
int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode);
diff --git a/block/genhd.c b/block/genhd.c
index 7d6854fd28e9..303bd5e619e7 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1444,7 +1444,7 @@ dev_t part_devt(struct gendisk *disk, u8 partno)
}
struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
- struct lock_class_key *lkclass)
+ struct lock_class_key lkclass[2])
{
struct gendisk *disk;
@@ -1467,7 +1467,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
goto out_free_bdi;
disk->node_id = node_id;
- mutex_init(&disk->open_mutex);
+ mutex_init_with_key(&disk->open_mutex, &lkclass[1]);
xa_init(&disk->part_tbl);
if (xa_insert(&disk->part_tbl, 0, disk->part0, GFP_KERNEL))
goto out_destroy_part_tbl;
@@ -1482,7 +1482,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
device_initialize(disk_to_dev(disk));
inc_diskseq(disk);
q->disk = disk;
- lockdep_init_map(&disk->lockdep_map, "(bio completion)", lkclass, 0);
+ lockdep_init_map(&disk->lockdep_map, "(bio completion)", &lkclass[0], 0);
#ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
INIT_LIST_HEAD(&disk->slave_bdevs);
#endif
@@ -1506,7 +1506,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
}
struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
- struct lock_class_key *lkclass)
+ struct lock_class_key lkclass[2])
{
struct queue_limits default_lim = { };
struct request_queue *q;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 599e75f33334..d8a1bbd4f19e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -112,7 +112,7 @@ static DEFINE_MUTEX(sd_mutex_lock);
static mempool_t *sd_page_pool;
static mempool_t *sd_large_page_pool;
static atomic_t sd_large_page_pool_users = ATOMIC_INIT(0);
-static struct lock_class_key sd_bio_compl_lkclass;
+static struct lock_class_key sd_bio_compl_lkclass[2];
static const char *sd_cache_types[] = {
"write through", "none", "write back",
@@ -4021,7 +4021,7 @@ static int sd_probe(struct scsi_device *sdp)
goto out;
gd = blk_mq_alloc_disk_for_queue(sdp->request_queue,
- &sd_bio_compl_lkclass);
+ sd_bio_compl_lkclass);
if (!gd)
goto out_free;
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index c36c54ecd354..421b8bd37db0 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -106,7 +106,7 @@ static struct scsi_driver sr_template = {
static unsigned long sr_index_bits[SR_DISKS / BITS_PER_LONG];
static DEFINE_SPINLOCK(sr_index_lock);
-static struct lock_class_key sr_bio_compl_lkclass;
+static struct lock_class_key sr_bio_compl_lkclass[2];
static int sr_open(struct cdrom_device_info *, int);
static void sr_release(struct cdrom_device_info *);
@@ -634,7 +634,7 @@ static int sr_probe(struct scsi_device *sdev)
goto fail;
disk = blk_mq_alloc_disk_for_queue(sdev->request_queue,
- &sr_bio_compl_lkclass);
+ sr_bio_compl_lkclass);
if (!disk)
goto fail_free;
mutex_init(&cd->lock);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..57d805c78827 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -726,15 +726,15 @@ enum {
struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
struct queue_limits *lim, void *queuedata,
- struct lock_class_key *lkclass);
+ struct lock_class_key lkclass[2]);
#define blk_mq_alloc_disk(set, lim, queuedata) \
({ \
- static struct lock_class_key __key; \
+ static struct lock_class_key __key[2]; \
\
- __blk_mq_alloc_disk(set, lim, queuedata, &__key); \
+ __blk_mq_alloc_disk(set, lim, queuedata, __key); \
})
struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
- struct lock_class_key *lkclass);
+ struct lock_class_key lkclass[2]);
struct request_queue *blk_mq_alloc_queue(struct blk_mq_tag_set *set,
struct queue_limits *lim, void *queuedata);
int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 890128cdea1c..3cd2056cde28 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -974,7 +974,7 @@ int bdev_disk_changed(struct gendisk *disk, bool invalidate);
void put_disk(struct gendisk *disk);
struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
- struct lock_class_key *lkclass);
+ struct lock_class_key lkclass[2]);
/**
* blk_alloc_disk - allocate a gendisk structure
@@ -990,9 +990,9 @@ struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
*/
#define blk_alloc_disk(lim, node_id) \
({ \
- static struct lock_class_key __key; \
+ static struct lock_class_key __key[2]; \
\
- __blk_alloc_disk(lim, node_id, &__key); \
+ __blk_alloc_disk(lim, node_id, __key); \
})
int __register_blkdev(unsigned int major, const char *name,
--
2.47.3
^ permalink raw reply related
* [PATCH] loop: reject binding to procfs and sysfs files
From: Tetsuo Handa @ 2026-05-30 13:48 UTC (permalink / raw)
To: Jens Axboe, linux-block, LKML
Cc: Bart Van Assche, Andrew Morton, Ming Lei, Damien Le Moal,
Christoph Hellwig, Qu Wenruo, Hillf Danton
I noticed that /dev/loopX accepts pseudo files, for loop_validate_file()
currently only checks:
if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
return -EINVAL;
and pseudo files are treated as S_ISREG().
Reading from pseudo files via /dev/loopX causes unexpected results, as it
tries to repeatedly read the entire content up to the size visible to the
"ls" command (padded with repeating data).
# ls -l /sys/power/pm_test
-rw-r--r-- 1 root root 4096 May 26 22:14 /sys/power/pm_test
# cat /sys/power/pm_test | wc
1 6 48
# cat $(losetup -f --show /sys/power/pm_test) | wc
85 513 4096
Writing to pseudo files via /dev/loopX might also cause undesirable
results. Therefore, explicitly reject binding to pseudo files on procfs
and sysfs for now. Other filesystems can be appended as needed.
There is another intention for this change. Currently, we are evaluating
the possibility of calling drain_workqueue() from __loop_clr_fd() in order
to address a NULL pointer dereference in lo_rw_aio() [1].
However, introducing drain_workqueue() into the loop teardown path where
disk->open_mutex is held forms a circular locking dependency when a pseudo
file that takes a global lock is specified as the backing store for the
loop device.
If drain_workqueue() is called from __loop_clr_fd(), an example of a
circular locking dependency that involves system_transition_mutex and
disk->open_mutex can be triggered by the following reproduction steps:
# echo 7:0 > /sys/power/resume
# losetup /dev/loop0 /sys/power/resume
# cat /dev/loop0 > /dev/null
# losetup -d /dev/loop0
Even if our final solution for [1] does not call drain_workqueue() with
disk->open_mutex held, rejecting binding to pseudo files that confuse
userspace programs is a standalone improvement.
Link: https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28 [1]
Analyzed-by: AI Mode in Google Search (no mail address)
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
drivers/block/loop.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0000913f7efc..6aa88a7a0e2e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -500,8 +500,15 @@ static int loop_validate_file(struct file *file, struct block_device *bdev)
rmb();
f = l->lo_backing_file;
}
- if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
+ if (S_ISBLK(inode->i_mode))
+ return 0;
+ if (!S_ISREG(inode->i_mode))
return -EINVAL;
+ switch (inode->i_sb->s_magic) {
+ case PROC_SUPER_MAGIC: /* e.g. "losetup -f /proc/sys/kernel/version" */
+ case SYSFS_MAGIC: /* e.g. "losetup -f /sys/power/state" */
+ return -EINVAL;
+ }
return 0;
}
--
2.47.3
^ permalink raw reply related
* RE: [PATCH v2] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
From: Achkinazi, Igor @ 2026-05-30 14:37 UTC (permalink / raw)
To: Keith Busch
Cc: hch@lst.de, sagi@grimberg.me, axboe@kernel.dk,
linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <ahocb8YRtqh5rHo-@kbusch-mbp>
Keith Busch wrote:
> Sure, though I was considering just adding it the nvme tree. I'm giving
> a few days to see if there are any other comments.
Sounds good, thanks Keith.
Internal Use - Confidential
^ permalink raw reply
* RE: [PATCH v2] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
From: Achkinazi, Igor @ 2026-05-30 14:34 UTC (permalink / raw)
To: Hannes Reinecke, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me,
axboe@kernel.dk
Cc: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <b8d1fda2-a2da-4b35-9bd5-941834f26c32@suse.de>
Hannes Reinecke wrote:
> ... or you could introduce __bio_set_dev():
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 97d747320b35..5a2709adeea7 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -518,15 +518,20 @@ static inline void blkcg_punt_bio_submit(struct
> bio *bio)
> }
> #endif /* CONFIG_BLK_CGROUP */
>
> -static inline void bio_set_dev(struct bio *bio, struct block_device *bdev)
> +static inline void __bio_set_dev(struct bio *bio, struct block_device
> *bdev)
> {
> - bio_clear_flag(bio, BIO_REMAPPED);
> if (bio->bi_bdev != bdev)
> bio_clear_flag(bio, BIO_BPS_THROTTLED);
> bio->bi_bdev = bdev;
> bio_associate_blkg(bio);
> }
>
> +static inline void bio_set_dev(struct bio *bio, struct block_device *bdev)
> +{
> + bio_clear_flag(bio, BIO_REMAPPED);
> + __bio_set_dev(bio, bdev);
> +}
> +
> /*
> * BIO list management for use by remapping drivers (e.g. DM or MD)
> and loop.
> *
>
> to avoid all this clear-and-set-flag dance.
Thanks Hannes. It is a cleaner approach and avoids the clear-and-set
dance. However it touches the block layer (bio.h) and would need
wider review and testing across all bio_set_dev callers.
I'd prefer to keep this patch as a minimal, nvme multipath fix that
Is easy to backport to stable kernels where this race is hitting us
today. The __bio_set_dev() approach (or Keith's patch that is
removing set_capacity(0) entirely) could follow as the proper
long-term solution.
Thanks, Igor
Internal Use - Confidential
^ permalink raw reply
* Re: [PATCH v2] scsi: bsg: read io_uring command fields once
From: Yang Xiuwei @ 2026-05-30 18:02 UTC (permalink / raw)
To: rc
Cc: James.Bottomley, martin.petersen, axboe, fujita.tomonori,
linux-scsi, linux-block, io-uring, linux-kernel, bvanassche,
csander, stable, Yang Xiuwei
In-Reply-To: <20260527191817.142769-1-rc@rexion.ai>
Hi Rahul,
Thanks for the report and for v2.
Reviewed-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
^ permalink raw reply
* Re: [PATCH] loop: reject binding to procfs and sysfs files
From: kernel test robot @ 2026-05-30 19:48 UTC (permalink / raw)
To: Tetsuo Handa, Jens Axboe, linux-block, LKML
Cc: llvm, oe-kbuild-all, Bart Van Assche, Andrew Morton,
Linux Memory Management List, Ming Lei, Damien Le Moal,
Christoph Hellwig, Qu Wenruo, Hillf Danton
In-Reply-To: <148efba2-a0b6-47d7-ac76-b19d2f4b696c@I-love.SAKURA.ne.jp>
Hi Tetsuo,
kernel test robot noticed the following build errors:
[auto build test ERROR on axboe/for-next]
[also build test ERROR on linus/master v7.1-rc5 next-20260529]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Tetsuo-Handa/loop-reject-binding-to-procfs-and-sysfs-files/20260530-214900
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git for-next
patch link: https://lore.kernel.org/r/148efba2-a0b6-47d7-ac76-b19d2f4b696c%40I-love.SAKURA.ne.jp
patch subject: [PATCH] loop: reject binding to procfs and sysfs files
config: um-x86_64_defconfig (https://download.01.org/0day-ci/archive/20260531/202605310318.dbidMe6W-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 9409c07de6378507397ecdb6f05f628f58110112)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260531/202605310318.dbidMe6W-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605310318.dbidMe6W-lkp@intel.com/
All errors (new ones prefixed by >>):
>> drivers/block/loop.c:504:7: error: use of undeclared identifier 'PROC_SUPER_MAGIC'
504 | case PROC_SUPER_MAGIC: /* e.g. "losetup -f /proc/sys/kernel/version" */
| ^~~~~~~~~~~~~~~~
>> drivers/block/loop.c:505:7: error: use of undeclared identifier 'SYSFS_MAGIC'
505 | case SYSFS_MAGIC: /* e.g. "losetup -f /sys/power/state" */
| ^~~~~~~~~~~
2 errors generated.
vim +/PROC_SUPER_MAGIC +504 drivers/block/loop.c
478
479 static int loop_validate_file(struct file *file, struct block_device *bdev)
480 {
481 struct inode *inode = file->f_mapping->host;
482 struct file *f = file;
483
484 /* Avoid recursion */
485 while (is_loop_device(f)) {
486 struct loop_device *l;
487
488 lockdep_assert_held(&loop_validate_mutex);
489 if (f->f_mapping->host->i_rdev == bdev->bd_dev)
490 return -EBADF;
491
492 l = I_BDEV(f->f_mapping->host)->bd_disk->private_data;
493 if (l->lo_state != Lo_bound)
494 return -EINVAL;
495 /* Order wrt setting lo->lo_backing_file in loop_configure(). */
496 rmb();
497 f = l->lo_backing_file;
498 }
499 if (S_ISBLK(inode->i_mode))
500 return 0;
501 if (!S_ISREG(inode->i_mode))
502 return -EINVAL;
503 switch (inode->i_sb->s_magic) {
> 504 case PROC_SUPER_MAGIC: /* e.g. "losetup -f /proc/sys/kernel/version" */
> 505 case SYSFS_MAGIC: /* e.g. "losetup -f /sys/power/state" */
506 return -EINVAL;
507 }
508 return 0;
509 }
510
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH] loop: reject binding to procfs and sysfs files
From: kernel test robot @ 2026-05-30 20:45 UTC (permalink / raw)
To: Tetsuo Handa, Jens Axboe, linux-block, LKML
Cc: oe-kbuild-all, Bart Van Assche, Andrew Morton,
Linux Memory Management List, Ming Lei, Damien Le Moal,
Christoph Hellwig, Qu Wenruo, Hillf Danton
In-Reply-To: <148efba2-a0b6-47d7-ac76-b19d2f4b696c@I-love.SAKURA.ne.jp>
Hi Tetsuo,
kernel test robot noticed the following build errors:
[auto build test ERROR on axboe/for-next]
[also build test ERROR on linus/master v7.1-rc5 next-20260529]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Tetsuo-Handa/loop-reject-binding-to-procfs-and-sysfs-files/20260530-214900
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git for-next
patch link: https://lore.kernel.org/r/148efba2-a0b6-47d7-ac76-b19d2f4b696c%40I-love.SAKURA.ne.jp
patch subject: [PATCH] loop: reject binding to procfs and sysfs files
config: nios2-defconfig (https://download.01.org/0day-ci/archive/20260531/202605310413.Xgk6vCeB-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260531/202605310413.Xgk6vCeB-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605310413.Xgk6vCeB-lkp@intel.com/
All errors (new ones prefixed by >>):
drivers/block/loop.c: In function 'loop_validate_file':
>> drivers/block/loop.c:504:14: error: 'PROC_SUPER_MAGIC' undeclared (first use in this function)
504 | case PROC_SUPER_MAGIC: /* e.g. "losetup -f /proc/sys/kernel/version" */
| ^~~~~~~~~~~~~~~~
drivers/block/loop.c:504:14: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/block/loop.c:505:14: error: 'SYSFS_MAGIC' undeclared (first use in this function)
505 | case SYSFS_MAGIC: /* e.g. "losetup -f /sys/power/state" */
| ^~~~~~~~~~~
vim +/PROC_SUPER_MAGIC +504 drivers/block/loop.c
478
479 static int loop_validate_file(struct file *file, struct block_device *bdev)
480 {
481 struct inode *inode = file->f_mapping->host;
482 struct file *f = file;
483
484 /* Avoid recursion */
485 while (is_loop_device(f)) {
486 struct loop_device *l;
487
488 lockdep_assert_held(&loop_validate_mutex);
489 if (f->f_mapping->host->i_rdev == bdev->bd_dev)
490 return -EBADF;
491
492 l = I_BDEV(f->f_mapping->host)->bd_disk->private_data;
493 if (l->lo_state != Lo_bound)
494 return -EINVAL;
495 /* Order wrt setting lo->lo_backing_file in loop_configure(). */
496 rmb();
497 f = l->lo_backing_file;
498 }
499 if (S_ISBLK(inode->i_mode))
500 return 0;
501 if (!S_ISREG(inode->i_mode))
502 return -EINVAL;
503 switch (inode->i_sb->s_magic) {
> 504 case PROC_SUPER_MAGIC: /* e.g. "losetup -f /proc/sys/kernel/version" */
> 505 case SYSFS_MAGIC: /* e.g. "losetup -f /sys/power/state" */
506 return -EINVAL;
507 }
508 return 0;
509 }
510
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH] block: assign caller-specific lockdep class to disk->open_mutex
From: Bart Van Assche @ 2026-05-30 21:15 UTC (permalink / raw)
To: Tetsuo Handa, Jens Axboe, linux-block, LKML
Cc: Andrew Morton, Ming Lei, Damien Le Moal, Christoph Hellwig,
Qu Wenruo, Hillf Danton
In-Reply-To: <147ed056-03d9-4214-b925-0f10fc00cf27@I-love.SAKURA.ne.jp>
On 5/30/26 6:45 AM, Tetsuo Handa wrote:
> - static struct lock_class_key __key; \
> + static struct lock_class_key __key[2]; \
The two elements of this array have different roles. From the point of
view of code readability and maintainability it's probably much better
to make this a struct with two named members rather than a two-element
array.
Thanks,
Bart.
^ permalink raw reply
* Re: [PATCH] block: assign caller-specific lockdep class to disk->open_mutex
From: Hillf Danton @ 2026-05-30 22:50 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Jens Axboe, linux-block, LKML, Bart Van Assche, Boqun Feng,
Andrew Morton, Ming Lei, Damien Le Moal, Christoph Hellwig,
Qu Wenruo, Hillf Danton
In-Reply-To: <147ed056-03d9-4214-b925-0f10fc00cf27@I-love.SAKURA.ne.jp>
On Sat, 30 May 2026 22:45:55 +0900 Tetsuo Handa wrote:
> The block core currently allocates a single monolithic lockdep key for
> disk->open_mutex across all callers. This single key conflates locking
> hierarchies between independent block streams. For example, if a stacked
> driver like loop flushes its internal workqueues inside lo_release() while
> holding its own open_mutex, lockdep views this as a potential ABBA deadlock
> against the underlying storage stack, leading to numerous circular
> dependency splats [2][3][4][5][6].
>
> To reduce false-positives structurally, this patch splits the global
> monolithic lock class into distinct, per-caller during disk allocation;
> by changing "lock_class_key" into a 2-element array:
> - lkclass[0]: Used for the legacy "(bio completion)" map.
> - lkclass[1]: Assigned to target caller's disk->open_mutex.
>
I wonder how this works given e966eaeeb623 ("locking/lockdep: Remove the
cross-release locking checks").
> This patch was tested by adding drain_workqueue() to __loop_clr_fd() during
> testing of a patch for [1], and actually helped stopping [2][4][6].
> Even if our final solution for [1] does not call drain_workqueue() with
> disk->open_mutex held, keeping locking chains simpler and shorter should
> be a good change.
>
> Link: https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28 [1]
> Link: https://syzkaller.appspot.com/bug?extid=2f62807dc3239b8f584e [2]
> Link: https://syzkaller.appspot.com/bug?extid=c4e9d077bcc86bee08dc [3]
> Link: https://syzkaller.appspot.com/bug?extid=0f427123ae84b3ba6dc7 [4]
> Link: https://syzkaller.appspot.com/bug?extid=4feabfc9641267769c97 [5]
> Link: https://syzkaller.appspot.com/bug?extid=fb0ff9bfe34ad282ebd4 [6]
> Suggested-by: AI Mode in Google Search (no mail address)
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
> block/blk-mq.c | 4 ++--
> block/blk.h | 2 +-
> block/genhd.c | 8 ++++----
> drivers/scsi/sd.c | 4 ++--
> drivers/scsi/sr.c | 4 ++--
> include/linux/blk-mq.h | 8 ++++----
> include/linux/blkdev.h | 6 +++---
> 7 files changed, 18 insertions(+), 18 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 28c2d931e75e..01a15ac40754 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4492,7 +4492,7 @@ EXPORT_SYMBOL(blk_mq_destroy_queue);
>
> struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
> struct queue_limits *lim, void *queuedata,
> - struct lock_class_key *lkclass)
> + struct lock_class_key lkclass[2])
> {
> struct request_queue *q;
> struct gendisk *disk;
> @@ -4513,7 +4513,7 @@ struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
> EXPORT_SYMBOL(__blk_mq_alloc_disk);
>
> struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
> - struct lock_class_key *lkclass)
> + struct lock_class_key lkclass[2])
> {
> struct gendisk *disk;
>
> diff --git a/block/blk.h b/block/blk.h
> index b998a7761faf..1744748f9b68 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -614,7 +614,7 @@ void drop_partition(struct block_device *part);
> void bdev_set_nr_sectors(struct block_device *bdev, sector_t sectors);
>
> struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
> - struct lock_class_key *lkclass);
> + struct lock_class_key lkclass[2]);
> struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id);
>
> int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode);
> diff --git a/block/genhd.c b/block/genhd.c
> index 7d6854fd28e9..303bd5e619e7 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1444,7 +1444,7 @@ dev_t part_devt(struct gendisk *disk, u8 partno)
> }
>
> struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
> - struct lock_class_key *lkclass)
> + struct lock_class_key lkclass[2])
> {
> struct gendisk *disk;
>
> @@ -1467,7 +1467,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
> goto out_free_bdi;
>
> disk->node_id = node_id;
> - mutex_init(&disk->open_mutex);
> + mutex_init_with_key(&disk->open_mutex, &lkclass[1]);
> xa_init(&disk->part_tbl);
> if (xa_insert(&disk->part_tbl, 0, disk->part0, GFP_KERNEL))
> goto out_destroy_part_tbl;
> @@ -1482,7 +1482,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
> device_initialize(disk_to_dev(disk));
> inc_diskseq(disk);
> q->disk = disk;
> - lockdep_init_map(&disk->lockdep_map, "(bio completion)", lkclass, 0);
> + lockdep_init_map(&disk->lockdep_map, "(bio completion)", &lkclass[0], 0);
> #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
> INIT_LIST_HEAD(&disk->slave_bdevs);
> #endif
> @@ -1506,7 +1506,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
> }
>
> struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
> - struct lock_class_key *lkclass)
> + struct lock_class_key lkclass[2])
> {
> struct queue_limits default_lim = { };
> struct request_queue *q;
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 599e75f33334..d8a1bbd4f19e 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -112,7 +112,7 @@ static DEFINE_MUTEX(sd_mutex_lock);
> static mempool_t *sd_page_pool;
> static mempool_t *sd_large_page_pool;
> static atomic_t sd_large_page_pool_users = ATOMIC_INIT(0);
> -static struct lock_class_key sd_bio_compl_lkclass;
> +static struct lock_class_key sd_bio_compl_lkclass[2];
>
> static const char *sd_cache_types[] = {
> "write through", "none", "write back",
> @@ -4021,7 +4021,7 @@ static int sd_probe(struct scsi_device *sdp)
> goto out;
>
> gd = blk_mq_alloc_disk_for_queue(sdp->request_queue,
> - &sd_bio_compl_lkclass);
> + sd_bio_compl_lkclass);
> if (!gd)
> goto out_free;
>
> diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
> index c36c54ecd354..421b8bd37db0 100644
> --- a/drivers/scsi/sr.c
> +++ b/drivers/scsi/sr.c
> @@ -106,7 +106,7 @@ static struct scsi_driver sr_template = {
> static unsigned long sr_index_bits[SR_DISKS / BITS_PER_LONG];
> static DEFINE_SPINLOCK(sr_index_lock);
>
> -static struct lock_class_key sr_bio_compl_lkclass;
> +static struct lock_class_key sr_bio_compl_lkclass[2];
>
> static int sr_open(struct cdrom_device_info *, int);
> static void sr_release(struct cdrom_device_info *);
> @@ -634,7 +634,7 @@ static int sr_probe(struct scsi_device *sdev)
> goto fail;
>
> disk = blk_mq_alloc_disk_for_queue(sdev->request_queue,
> - &sr_bio_compl_lkclass);
> + sr_bio_compl_lkclass);
> if (!disk)
> goto fail_free;
> mutex_init(&cd->lock);
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 18a2388ba581..57d805c78827 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -726,15 +726,15 @@ enum {
>
> struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
> struct queue_limits *lim, void *queuedata,
> - struct lock_class_key *lkclass);
> + struct lock_class_key lkclass[2]);
> #define blk_mq_alloc_disk(set, lim, queuedata) \
> ({ \
> - static struct lock_class_key __key; \
> + static struct lock_class_key __key[2]; \
> \
> - __blk_mq_alloc_disk(set, lim, queuedata, &__key); \
> + __blk_mq_alloc_disk(set, lim, queuedata, __key); \
> })
> struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
> - struct lock_class_key *lkclass);
> + struct lock_class_key lkclass[2]);
> struct request_queue *blk_mq_alloc_queue(struct blk_mq_tag_set *set,
> struct queue_limits *lim, void *queuedata);
> int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 890128cdea1c..3cd2056cde28 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -974,7 +974,7 @@ int bdev_disk_changed(struct gendisk *disk, bool invalidate);
>
> void put_disk(struct gendisk *disk);
> struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
> - struct lock_class_key *lkclass);
> + struct lock_class_key lkclass[2]);
>
> /**
> * blk_alloc_disk - allocate a gendisk structure
> @@ -990,9 +990,9 @@ struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
> */
> #define blk_alloc_disk(lim, node_id) \
> ({ \
> - static struct lock_class_key __key; \
> + static struct lock_class_key __key[2]; \
> \
> - __blk_alloc_disk(lim, node_id, &__key); \
> + __blk_alloc_disk(lim, node_id, __key); \
> })
>
> int __register_blkdev(unsigned int major, const char *name,
> --
> 2.47.3
>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox