Linux block layer
 help / color / mirror / Atom feed
* Re: [PATCH v7 17/43] btrfs: add get_devices hook for fscrypt
From: Daniel Vacek @ 2026-05-29 14:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, Eric Biggers, Theodore Y. Ts'o,
	Jaegeuk Kim, Jens Axboe, David Sterba, linux-block, linux-fscrypt,
	linux-btrfs, linux-kernel, Sweet Tea Dorminy
In-Reply-To: <ahBJdSKMly8rv04F@infradead.org>

On Fri, 22 May 2026 at 14:17, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, May 22, 2026 at 02:00:28PM +0200, Daniel Vacek wrote:
> > > How does this handled adding/removing devices at runtime?
> >
> > When called, this callback returns the list of bdevs opened by the
> > given superblock. If devices are added or removed, this function
> > returns a different list.
> > In other words it always returns a valid list.
> >
> > This is called from `fscrypt_get_devices()`, which is called from
> > `fscrypt_select_encryption_impl()` or
> > `fscrypt_prepare_inline_crypt_key()` or
> > `fscrypt_destroy_inline_crypt_key()`. All these functions walk the
> > returned list and discard it immediately afterwards.
> >
> > Note that with btrfs at this point we're only using the inline crypto fallback.
> > Is there any particular reason you asked this question?
>
> Well, assume you have a single device fs, and then you add a device
> later, you will not get the blk_crypto_config_supported call for this
> device, and it will not be taken into account.

This function is called from `fscrypt_prepare_new_inode()` from
`btrfs_new_inode_prepare()` as well as from many other places.
It looks quite OK to me and I can also confirm this with tracing.
Using the following bpftrace script:

```
fr:fscrypt_get_devices {
//      $num_devs = args.num_devs[0];
        $num_devs = ((uint32 *)args.num_devs)[0];
//      if ($num_devs < 2) { return; }
        printf("%s()\t\t\t(%4d %13s[%d])\tnum_devs %d\n", func,
                cpu, curtask->comm, curtask->pid, $num_devs);
}

f:blk_crypto_config_supported {
        printf("%s()\t\t(%4d %13s[%d])\tbdev %18p\n", func,
                cpu, curtask->comm, curtask->pid, args.bdev);
}
```

... and mounting an encrypted FS, then adding an additional device, like this:

```
$ mount /dev/vdb /mnt/scratch; \
echo -ne $TEST_RAW_KEY | xfs_io -c add_enckey /mnt/scratch; \
touch /mnt/scratch/dir/foo; \
btrfs device add /dev/vdc /mnt/scratch; \
touch /mnt/scratch/dir/bar
```

I'm getting this:

```
fscrypt_get_devices()            (   5         touch[26840])    num_devs 1
blk_crypto_config_supported()    (   5         touch[26840])    bdev
0xffff88a9c33fc880
fscrypt_get_devices()            (   5         touch[26840])    num_devs 1
fscrypt_get_devices()            (   5         touch[26844])    num_devs 2
blk_crypto_config_supported()    (   5         touch[26844])    bdev
0xffff88a9c3262b80
blk_crypto_config_supported()    (   5         touch[26844])    bdev
0xffff88a9c33fc880
fscrypt_get_devices()            (   5         touch[26844])    num_devs 2
```

Here you can see the newly added device is being considered.

Moreover btrfs only supports the fallback encryption due to the need
to compute the checksums of encrypted data stored on the device.

> Now can btrfs even support hardware inline encryption?  The way the bio
> processing is special cased I somehow doubt it.  But the concept of a
> static device list just doesn't work for btrfs, so I think the fscrypt
> side of this will need refactoring not to rely on it.  If we never
> support hardware inline encryption on such dynamic file systems that
> would be relative easy, if we need to support that case things might
> get a lot more complicated.

Yeah, this depends. If the device or fscrypt could return the checksum
to the FS, btrfs could use the inline HW encryption. Note that the
checksum must also be one that btrfs supports.
Otherwise we need to get the encrypted data to compute the checksum
ourselves. That is precisely why only fallback encryption is currently
supported. And it's where the FS callback hook is used to compute the
checksum.

--nX

^ permalink raw reply

* [REPORT] nvmet-rdma: integer overflow in inline-data SGL bounds check -> pre-auth kernel-memory read + remote crash (candidate patch inline)
From: hexlabsecurity @ 2026-05-29  6:52 UTC (permalink / raw)
  To: security@kernel.org
  Cc: hch@lst.de, sagi@grimberg.me, kbusch@kernel.org, kch@nvidia.com,
	linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org,
	linux-block@vger.kernel.org

Hello,

I would like to report an integer-overflow vulnerability in the NVMe-oF
RDMA target (drivers/nvme/target/rdma.c).  The inline-data SGL bounds
check in nvmet_rdma_map_sgl_inline() is computed in u64 over two
host-controlled values and wraps, which a remote fabric peer can use
both to read kernel memory back over the fabric and to crash the target.

== Affected ==

  drivers/nvme/target/rdma.c, nvmet_rdma_map_sgl_inline()

  Verified present on the current mainline tree (commit 27fa82620cba,
  ~v7.1-rc5), at the bounds check:

    static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
    {
        struct nvme_sgl_desc *sgl = &rsp->req.cmd->common.dptr.sgl;
        u64 off = le64_to_cpu(sgl->addr);     /* host-controlled, 64-bit */
        u32 len = le32_to_cpu(sgl->length);   /* host-controlled, 32-bit */
        ...
        if (off + len > rsp->queue->dev->inline_data_size) {   /* u64 wrap */
            pr_err("invalid inline data offset!\n");
            return NVME_SC_SGL_INVALID_OFFSET | NVME_STATUS_DNR;
        }
        ...
        nvmet_rdma_use_inline_sg(rsp, len, off);
    }

  "off + len" is evaluated in u64 and wraps modulo 2^64.  For example
  addr = 0xfffffffffffffe00, length = 0x1000 makes the sum wrap to
  0xe00, which is <= inline_data_size (default PAGE_SIZE), so the check
  passes.  The current check form (against the per-port inline_data_size)
  and the fixed-size inline_sg[NVMET_RDMA_MAX_INLINE_SGE] array with the
  num_pages(len) loop were introduced together by commit 0d5ee2b2ab4f
  ("nvmet-rdma: support max(16KB, PAGE_SIZE) inline data"), which is the
  Fixes: I used.  Note: the single-page inline path that predates that
  commit may have an analogous u64-overflow read in a different code
  shape; I would appreciate the maintainers' judgement on whether the
  stable backport scope should reach before that commit.

== Two consequences of the bypass ==

  1. Kernel-memory read (information disclosure).
     nvmet_rdma_use_inline_sg() does "sg->offset = off", truncating the
     64-bit offset to scatterlist::offset (unsigned int).  The block
     layer then accesses page_to_phys(inline_page) + (off & 0xffffffff),
     so the target reads up to inline_data_size bytes of kernel memory
     per write command and returns them to the host on read-back, or
     faults the in-kernel copy if the offset lands on unmapped memory.

  2. Kernel-memory corruption -> remote crash (denial of service).
     A large length makes "sg_count = num_pages(len)" in
     nvmet_rdma_use_inline_sg() exceed NVMET_RDMA_MAX_INLINE_SGE (4), so
     the loop writes scatterlist entries past the fixed-size inline_sg[]
     array, corrupting the surrounding command object.

== Reachability ==

  The path is reached by any write command carrying an inline SGL, i.e.
  after a Fabrics Connect.  On a subsystem configured with
  attr_allow_any_host=1 it is reachable WITHOUT authentication by any
  RDMA peer (RoCE/iWARP/IB) that can reach the target's listener.  With
  DH-CHAP configured, or attr_allow_any_host=0 with an unknown host NQN,
  a valid/known host NQN is required first.

== Empirical reproduction ==

  Reproduced against a stock nvmet-rdma target over a soft-iWARP (siw)
  fabric on a Linux 6.12.90 build with KASAN (KASAN_INLINE):

  - Read: a single write command with addr = 0xfffffffffffffe00,
    length = 0x1000 produced a KASAN out-of-bounds read and returned
    ~4 KiB of kernel memory (including kernel .text) into the
    attacker-readable namespace.

  - Crash: a write command with addr = 0xffffffffffff0500,
    length = 0x10000 (sum wraps to 0x500 <= inline_data_size, but
    num_pages(0x10000) = 16 writes 16 scatterlist entries into the
    4-entry inline_sg[], 12 past its end) deterministically corrupted
    the command object and oopsed the target:

      Oops: general protection fault [...] KASAN: null-ptr-deref
      RIP: nvmet_rdma_post_recv+0x... [nvmet_rdma]
        nvmet_rdma_post_recv <- nvmet_rdma_queue_response
        <- __nvmet_req_complete <- nvmet_check_transfer_len
        <- nvmet_rdma_handle_command <- ib_cq_poll_work

    Every reconnect re-triggers it (persistent remote DoS).  The
    nvmet_rdma_cmd objects are carved from one contiguous kcalloc'd
    array, so the over-long entry write stays within that allocation and
    KASAN flags the downstream dereference of the corrupted command in
    nvmet_rdma_post_recv rather than the store itself.  The out-of-bounds
    content is not attacker-controlled, so this is a crash/corruption
    primitive, not a controlled write; I do not see a path to remote code
    execution from this bug.

  Severity estimate.  The two consequences arise from different inline-SGL
  capsules (small vs large length) and are scored as separate single-capsule
  outcomes, not one combined vector:

    OOB read  (info-disclosure):  CVSS 7.5 HIGH
        CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:N
    OOB write (corruption/DoS):   CVSS 8.2 HIGH
        CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:L/A:H

  Headline 8.2 HIGH (both reachable pre-auth with attr_allow_any_host=1).
  With attr_allow_any_host=0 a valid host NQN is required first (PR:L),
  lowering these to 6.5 and 7.1.

== Suggested fix ==

  Validate the offset with check_add_overflow() before comparing against
  inline_data_size.  A passing check then guarantees
  off + len <= inline_data_size <= NVMET_RDMA_MAX_INLINE_DATA_SIZE, which
  bounds both the truncated scatterlist::offset and
  num_pages(len) <= NVMET_RDMA_MAX_INLINE_SGE, closing the read and the
  inline_sg[] overflow together.  Candidate patch inline below (applies
  to current mainline).

== Embargo ==

  I am happy to follow the standard process.  Proposing a 7-day embargo;
  the fix is small and I can adjust as the maintainers prefer.  I have
  not notified linux-distros and will hold that until a public patch
  lands, per the usual guidance.

I am an independent security researcher; please credit
"Bryam Vargas <hexlabsecurity@proton.me>" (Reported-by already in the
patch).  Affiliation: HEXLAB SAS (registration pending) -- Cali,
Colombia.  Happy to provide the full reproduction harness on request.

Thank you,
Bryam Vargas

----- candidate patch (inline, plain text) -----

From 448c122c744430c1c2926d635855a3894370ee33 Mon Sep 17 00:00:00 2001
From: Bryam Vargas <hexlabsecurity@proton.me>
Date: Thu, 28 May 2026 21:23:52 -0500
Subject: [PATCH] nvmet-rdma: fix integer overflow in inline data SGL bounds
 check

nvmet_rdma_map_sgl_inline() bounds-checks the inline data descriptor
with both operands host-controlled and the sum evaluated in u64:

	u64 off = le64_to_cpu(sgl->addr);
	u32 len = le32_to_cpu(sgl->length);
	...
	if (off + len > rsp->queue->dev->inline_data_size)
		return NVME_SC_SGL_INVALID_OFFSET | NVME_STATUS_DNR;

"off + len" therefore wraps modulo 2^64.  A descriptor with, for
example, addr = 0xfffffffffffffe00 and length = 0x1000 makes the sum
wrap to 0xe00, which passes the inline_data_size check.  An inline-SGL
write command reaches this path after a Fabrics Connect; on a subsystem
with attr_allow_any_host set it is reachable without authentication by
any peer that can reach the target.

Two distinct out-of-bounds accesses follow from the bypass:

 - nvmet_rdma_use_inline_sg() stores the 64-bit offset into
   scatterlist::offset, which is unsigned int, committing the truncated
   attacker offset to the inline page.  The block layer then accesses
   page_to_phys(inline_page) + (off & 0xffffffff), reading up to
   inline_data_size bytes of kernel memory per command back to the host
   (or faulting the target if the offset lands on unmapped memory).

 - A large len makes sg_count = num_pages(len) in
   nvmet_rdma_use_inline_sg() exceed NVMET_RDMA_MAX_INLINE_SGE, so the
   loop writes scatterlist entries past the fixed-size inline_sg[]
   array, corrupting the surrounding command object and oopsing the
   target on the next use of that command.

Validate the offset with check_add_overflow() before comparing against
inline_data_size.  A passing check then guarantees
off + len <= inline_data_size <= NVMET_RDMA_MAX_INLINE_DATA_SIZE, which
bounds both the truncated scatterlist::offset and
num_pages(len) <= NVMET_RDMA_MAX_INLINE_SGE, closing the out-of-bounds
read and the inline_sg[] overflow together.

Reported-by: Bryam Vargas <hexlabsecurity@proton.me>
Fixes: 0d5ee2b2ab4f ("nvmet-rdma: support max(16KB, PAGE_SIZE) inline data")
Cc: stable@vger.kernel.org
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
---
Review context (not for the commit log):

Reproducer -- unprivileged remote RDMA peer against a target with
attr_allow_any_host=1, a single inline-SGL WRITE capsule:
  * OOB read:  sgl->addr=0xfffffffffffffe00, sgl->length=0x1000
               (off+len wraps to 0xe00 <= inline_data_size; sg->offset
               truncates to 0xfffffe00) -> ~4 KiB of kernel memory is
               read back from the namespace.
  * OOB write: sgl->addr=0xffffffffffff0500, sgl->length=0x10000
               (num_pages(0x10000)=16 overruns the 4-entry inline_sg[])
               -> target memory corruption / crash.

A/B-tested on a 6.12.90 KASAN lab kernel (same .config, only this hunk
differs): pre-fix the OOB-read capsule trips "KASAN: use-after-free in
copy_page_from_iter_atomic" via nvmet_file_execute_io; post-fix both
capsules are rejected with "invalid inline data offset!"
(NVME_SC_SGL_INVALID_OFFSET), benign inline writes still succeed, and no
KASAN/oops fires. The fix decides identically in 32- and 64-bit builds
(check_add_overflow operates on u64).

 drivers/nvme/target/rdma.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index e6e2c3f9afdf..a5bbf9d41c3b 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -12,6 +12,7 @@
 #include <linux/init.h>
 #include <linux/module.h>
 #include <linux/nvme.h>
+#include <linux/overflow.h>
 #include <linux/slab.h>
 #include <linux/string.h>
 #include <linux/wait.h>
@@ -847,6 +848,7 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
 	struct nvme_sgl_desc *sgl = &rsp->req.cmd->common.dptr.sgl;
 	u64 off = le64_to_cpu(sgl->addr);
 	u32 len = le32_to_cpu(sgl->length);
+	u64 bound;

 	if (!nvme_is_write(rsp->req.cmd)) {
 		rsp->req.error_loc =
@@ -854,7 +856,8 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
 		return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
 	}

-	if (off + len > rsp->queue->dev->inline_data_size) {
+	if (check_add_overflow(off, (u64)len, &bound) ||
+	    bound > rsp->queue->dev->inline_data_size) {
 		pr_err("invalid inline data offset!\n");
 		return NVME_SC_SGL_INVALID_OFFSET | NVME_STATUS_DNR;
 	}
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH v7 32/43] btrfs: implement process_bio cb for fscrypt
From: Daniel Vacek @ 2026-05-29 15:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, Eric Biggers, Theodore Y. Ts'o,
	Jaegeuk Kim, Jens Axboe, David Sterba, linux-block, linux-fscrypt,
	linux-btrfs, linux-kernel
In-Reply-To: <ahAfo4DzvH_ob1hv@infradead.org>

On Fri, 22 May 2026 at 11:19, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, May 13, 2026 at 10:53:06AM +0200, Daniel Vacek wrote:
> > From: Josef Bacik <josef@toxicpanda.com>
> >
> > We are going to be checksumming the encrypted data, so we have to
> > implement the ->process_bio fscrypt callback.  This will provide us with
> > the original bio and the encrypted bio to do work on.  For WRITE's this
> > will happen after the encrypted bio has been encrypted.  For READ's this
> > will happen after the read has completed and before the decryption step
> > is done.
> >
> > For write's this is straightforward, we can just pass in the encrypted
> > bio to btrfs_csum_one_bio and then the csums will be added to the bbio
> > as normal.
> >
> > For read's this is relatively straightforward, but requires some care.
> > We assume (because that's how it works currently) that the encrypted bio
> > match the original bio, this is important because we save the iter of
> > the bio before we submit.  If this changes in the future we'll need a
> > hook to give us the bi_iter of the decryption bio before it's submitted.
> > We check the csums before decryption.  If it doesn't match we simply
> > error out and we let the normal path handle the repair work.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > Signed-off-by: Daniel Vacek <neelx@suse.com>
> > ---
> >
> > v7 changes:
> >  * Fixed array overflow stack corruption for bios > max blocksize (>64KiB)
> >    as reported by Chris' AI review.
> > v6 changes:
> >  * Adapt to btrfs_data_csum_ok() changes for bs > ps.  Mostly follow
> >    what was done in 052fd7a5cace ("btrfs: make read verification
> >    handle bs > ps cases without large folios").
> >  * Rename bbio::csum_done to csum_ok due to name collision.
> >    With upstream, member name csum_done was used for async csums.
> > v5: https://lore.kernel.org/linux-btrfs/ca32684b01ff8c252be515509137e0a4a0e5db7a.1706116485.git.josef@toxicpanda.com/
> > ---
> >  fs/btrfs/bio.c       | 44 +++++++++++++++++++++++++++++++++++++++++++-
> >  fs/btrfs/bio.h       |  3 +++
> >  fs/btrfs/file-item.c | 14 ++++++++++++--
> >  fs/btrfs/fscrypt.c   | 29 +++++++++++++++++++++++++++++
> >  4 files changed, 87 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> > index 3e2ee19aab50..729c5aff5c3d 100644
> > --- a/fs/btrfs/bio.c
> > +++ b/fs/btrfs/bio.c
> > @@ -301,6 +301,40 @@ static struct btrfs_failed_bio *repair_one_sector(struct btrfs_bio *failed_bbio,
> >       return fbio;
> >  }
> >
> > +blk_status_t btrfs_check_encrypted_read_bio(struct btrfs_bio *bbio, struct bio *enc_bio)
> > +{
> > +     struct btrfs_inode *inode = bbio->inode;
> > +     struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > +     struct bvec_iter iter = bbio->saved_iter;
> > +     struct btrfs_device *dev = bbio->bio.bi_private;
> > +     const u32 blocksize = fs_info->sectorsize;
> > +     const u32 step = min(blocksize, PAGE_SIZE);
> > +     const u32 nr_steps = iter.bi_size / step;
> > +     phys_addr_t paddrs[BTRFS_MAX_BLOCKSIZE / PAGE_SIZE];
> > +     phys_addr_t paddr;
> > +     unsigned int slot = 0;
> > +     u32 offset = 0;
> > +
> > +     /*
> > +      * We have to use a copy of iter in case there's an error,
> > +      * btrfs_check_read_bio will handle submitting the repair bios.
> > +      */
> > +     btrfs_bio_for_each_block(paddr, enc_bio, &iter, step) {
> > +             ASSERT(slot < nr_steps);
> > +             paddrs[slot] = paddr;
> > +             slot++;
> > +             offset += step;
> > +             if (IS_ALIGNED(offset, blocksize)) {
> > +                     if (!btrfs_data_csum_ok(bbio, dev, offset - blocksize, paddrs))
> > +                             return BLK_STS_IOERR;
> > +                     slot = 0;
> > +             }
> > +     }
> > +
> > +     bbio->csum_ok = true;
> > +     return BLK_STS_OK;
> > +}
> > +
> >  static void btrfs_check_read_bio(struct btrfs_bio *bbio, struct btrfs_device *dev)
> >  {
> >       struct btrfs_inode *inode = bbio->inode;
> > @@ -330,6 +364,10 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio, struct btrfs_device *de
> >       /* Clear the I/O error. A failed repair will reset it. */
> >       bbio->bio.bi_status = BLK_STS_OK;
> >
> > +     /* This was an encrypted bio and we've already done the csum check. */
> > +     if (status == BLK_STS_OK && bbio->csum_ok)
> > +             goto out;
> > +
> >       btrfs_bio_for_each_block(paddr, &bbio->bio, iter, step) {
> >               paddrs[(offset / step) % nr_steps] = paddr;
> >               offset += step;
> > @@ -341,6 +379,7 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio, struct btrfs_device *de
> >                                                        paddrs, fbio);
> >               }
> >       }
> > +out:
> >       if (bbio->csum != bbio->csum_inline)
> >               kvfree(bbio->csum);
> >
> > @@ -859,10 +898,13 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
> >               /*
> >                * Csum items for reloc roots have already been cloned at this
> >                * point, so they are handled as part of the no-checksum case.
> > +              *
> > +              * Encrypted inodes are csum'ed via the ->process_bio callback.
> >                */
> >               if (!(inode->flags & BTRFS_INODE_NODATASUM) &&
> >                   !test_bit(BTRFS_FS_STATE_NO_DATA_CSUMS, &fs_info->fs_state) &&
> > -                 !btrfs_is_data_reloc_root(inode->root) && !bbio->is_remap) {
> > +                 !btrfs_is_data_reloc_root(inode->root) && !bbio->is_remap &&
> > +                 !IS_ENCRYPTED(&inode->vfs_inode)) {
> >                       if (should_async_write(bbio) &&
> >                           btrfs_wq_submit_bio(bbio, bioc, &smap, mirror_num))
> >                               goto done;
> > diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> > index 43f7544029ac..456d32db9e9e 100644
> > --- a/fs/btrfs/bio.h
> > +++ b/fs/btrfs/bio.h
> > @@ -43,6 +43,7 @@ struct btrfs_bio {
> >               struct {
> >                       u8 *csum;
> >                       u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
> > +                     bool csum_ok;
> >                       struct bvec_iter saved_iter;
> >               };
> >
> > @@ -130,5 +131,7 @@ void btrfs_submit_repair_write(struct btrfs_bio *bbio, int mirror_num, bool dev_
> >  int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 fileoff,
> >                           u32 length, u64 logical, const phys_addr_t paddrs[],
> >                           unsigned int step, int mirror_num);
> > +blk_status_t btrfs_check_encrypted_read_bio(struct btrfs_bio *bbio,
> > +                                         struct bio *enc_bio);
> >
> >  #endif
> > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> > index 986914078708..72d9d3243460 100644
> > --- a/fs/btrfs/file-item.c
> > +++ b/fs/btrfs/file-item.c
> > @@ -338,6 +338,14 @@ static int search_csum_tree(struct btrfs_fs_info *fs_info,
> >       return ret;
> >  }
> >
> > +static inline bool inode_skip_csum(struct btrfs_inode *inode)
> > +{
> > +     struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > +
> > +     return (inode->flags & BTRFS_INODE_NODATASUM) ||
> > +             test_bit(BTRFS_FS_STATE_NO_DATA_CSUMS, &fs_info->fs_state);
> > +}
> > +
> >  /*
> >   * Lookup the checksum for the read bio in csum tree.
> >   *
> > @@ -357,8 +365,7 @@ int btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
> >       int ret = 0;
> >       u32 bio_offset = 0;
> >
> > -     if ((inode->flags & BTRFS_INODE_NODATASUM) ||
> > -         test_bit(BTRFS_FS_STATE_NO_DATA_CSUMS, &fs_info->fs_state))
> > +     if (inode_skip_csum(inode))
> >               return 0;
> >
> >       /*
> > @@ -817,6 +824,9 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio, struct bio *bio, bool async)
> >       struct btrfs_ordered_sum *sums;
> >       unsigned nofs_flag;
> >
> > +     if (inode_skip_csum(inode))
> > +             return 0;
> > +
> >       nofs_flag = memalloc_nofs_save();
> >       sums = kvzalloc(btrfs_ordered_sum_size(fs_info, bio->bi_iter.bi_size),
> >                      GFP_KERNEL);
> > diff --git a/fs/btrfs/fscrypt.c b/fs/btrfs/fscrypt.c
> > index 5d34a8b94da5..924ee3df7f32 100644
> > --- a/fs/btrfs/fscrypt.c
> > +++ b/fs/btrfs/fscrypt.c
> > @@ -16,6 +16,7 @@
> >  #include "transaction.h"
> >  #include "volumes.h"
> >  #include "xattr.h"
> > +#include "file-item.h"
> >
> >  /*
> >   * From a given location in a leaf, read a name into a qstr (usually a
> > @@ -212,6 +213,33 @@ static struct block_device **btrfs_fscrypt_get_devices(struct super_block *sb,
> >       return devs;
> >  }
> >
> > +static blk_status_t btrfs_process_encrypted_bio(struct bio *orig_bio,
> > +                                             struct bio *enc_bio)
> > +{
> > +     struct btrfs_bio *bbio;
> > +
> > +     /*
> > +      * If our bio is from the normal fs_bio_set then we know this is a
> > +      * mirror split and we can skip it, we'll get the real bio on the last
> > +      * mirror and we can process that one.
> > +      */
> > +     if (orig_bio->bi_pool == &fs_bio_set)
> > +             return BLK_STS_OK;
> > +
> > +     bbio = btrfs_bio(orig_bio);
> > +
> > +     if (bio_op(orig_bio) == REQ_OP_READ) {
> > +             /*
> > +              * We have ->saved_iter based on the orig_bio, so if the block
> > +              * layer changes we need to notice this asap so we can update
> > +              * our code to handle the new world order.
> > +              */
> > +             ASSERT(orig_bio == enc_bio);
> > +             return btrfs_check_encrypted_read_bio(bbio, enc_bio);
> > +     }
> > +     return btrfs_csum_one_bio(bbio, enc_bio, false);
>
> Honestly, all this shows that the architecture of the I/O path in this
> series is pretty broken.  It needs all this magic detection, and the
> passing of arguments that mixes the bbio for state and the lower
> encrypted bio without the btrfs context shows something doesn't work
> well.

Well, this is all limited within the scope of the filesystem. Since
btrfs needs to compute the data checksum and the bounce bio (with the
encrypted pages) is created by the lower fscrypt layer, how else could
we accomplish this?

As the blk-crypto is inlined, without the callback the filesystem
never sees the encrypted data at all and it won't be able to get
checksums.

> So let's take a step back, if we think of the I/O pipeline, it should do
> things in this order for writes:
>
>  - encrypt data
>  - generate checksums
>  - do mirroring/striping/parity
>
> and reverse for reads.
>
> All this suggest that the btrfs_bio needs to exist for the encrypted
> data.

My understanding was that fscrypt works differently. The bounce bio is
created inline in the lower layer, agnostic to any filesystem.

>  So I think you'll need to and refactor this, preferably with the
> really annoying two-level callbacks that this really hard to follow (or
> implement).  Your caller is in the file system, and it should be able to
> call fscrypt as helpers instead of going two layers down using direct
> calls and then two layers back up using indirect calls.  The recent
> refactoring that moves the fscrypt fallback above the block layer
> instead of calling it from the bottom should help a lot with that.

Yeah, I may look into that. What you're talking about is a pretty recent change.
This is an old patch [1] from 2023 rebased without many changes since
as there was not much feedback before. So it still follows the
original (former) design.

If fscrypt supported checksumming the encrypted data and returned the
value back to the filesystem, no callbacks would be needed. Though to
me that sounds more invasive than this callback.

[1] https://lore.kernel.org/linux-btrfs/a26514814b4d2a54ff2317369365dc2bf1c280dc.1695750478.git.josef@toxicpanda.com/

--nX

^ permalink raw reply

* Re: [REPORT] nvmet-rdma: integer overflow in inline-data SGL bounds check -> pre-auth kernel-memory read + remote crash (candidate patch inline)
From: Keith Busch @ 2026-05-29 16:09 UTC (permalink / raw)
  To: hexlabsecurity
  Cc: security@kernel.org, hch@lst.de, sagi@grimberg.me, kch@nvidia.com,
	linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org,
	linux-block@vger.kernel.org
In-Reply-To: <LM21QIR-1-qJb7PViyJKCnGBnUzizeiNJVWQ3wb7ZwGezodjgKg3f-iobqOyequ-sT1jFCKJImfqNO_BKU3KO80xFITnaI5GTV_GxLUNDDc=@proton.me>

On Fri, May 29, 2026 at 06:52:13AM +0000, hexlabsecurity@proton.me wrote:
> @@ -847,6 +848,7 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
>  	struct nvme_sgl_desc *sgl = &rsp->req.cmd->common.dptr.sgl;
>  	u64 off = le64_to_cpu(sgl->addr);
>  	u32 len = le32_to_cpu(sgl->length);
> +	u64 bound;
> 
>  	if (!nvme_is_write(rsp->req.cmd)) {
>  		rsp->req.error_loc =
> @@ -854,7 +856,8 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
>  		return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
>  	}
> 
> -	if (off + len > rsp->queue->dev->inline_data_size) {
> +	if (check_add_overflow(off, (u64)len, &bound) ||
> +	    bound > rsp->queue->dev->inline_data_size) {

Since you don't use "bound" for anything other than the final check, I
think we make this simpler without it:

	if (off > rsp->queue->dev->inline_data_size ||
	    len > rsp->queue->dev->inline_data_size - off) {

Thanks for the report.

^ permalink raw reply

* [GIT PULL] Block fix for 7.1-rc6
From: Jens Axboe @ 2026-05-29 17:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block@vger.kernel.org

Hi Linus,

Just a single fix for the block side, making a slight tweak to a fix
from this cycle. Please pull!


The following changes since commit f6982769910ecddabdb5b8b9afdab0bb8b6668ac:

  block: avoid use-after-free in disk_free_zone_resources() (2026-05-22 08:01:52 -0600)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git block-7.1-20260529

for you to fetch changes up to b051bb6bf0a231117036aa607cadf55be8e63910:

  blk-mq: reinsert cached request to the list (2026-05-26 15:05:30 -0600)

----------------------------------------------------------------
block-7.1-20260529

----------------------------------------------------------------
Keith Busch (1):
      blk-mq: reinsert cached request to the list

 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
Jens Axboe


^ permalink raw reply

* Re: Observing higher CPU utilization during random IO fio testing
From: Wen Xiong @ 2026-05-29 17:13 UTC (permalink / raw)
  To: yukuai; +Cc: Jens Axboe, linux-block, tom.leiming, jmoyer, Gjoyce, wenxiong
In-Reply-To: <043e357f-5b37-4e05-9433-271504fc1d30@fygo.io>

On 2026-05-25 00:28, Yu Kuai wrote:

> 在 2026/5/22 5:52, Jens Axboe 写道:
> Yes, perf data will be helpful. And please show your test in details 
> and I'll
> check if I can reproduce it.

Hi Yu Kuai,
Have you reproduced the issue yet?

Below is some perf data we took while running random read test:

Test:
FIO random read with qdepth=1 nj=20, we saw higher CPU utilization in 
this testcase.

Perf record:
start fio run on one session and kickoff the script in another session 
while test is running

Perf report:
With blk_start_plug/blk_finish_plug before calling __submit_bio() in 
blk-core.c:
Top.txt
      2.41%  fio              [kernel.kallsyms]                           
     [k] cpupri_set
      1.16%  fio              [kernel.kallsyms]                           
     [k] queued_spin_lock_slowpath
      0.75%  fio              [kernel.kallsyms]                           
     [k] sbitmap_find_bit
      0.47%  fio              [kernel.kallsyms]                           
     [k] set_next_task_rt
      0.41%  fio              [kernel.kallsyms]                           
     [k] pull_rt_task
      0.34%  fio              [kernel.kallsyms]                           
     [k] enqueue_pushable_task
       …
      0.02%  fio              [kernel.kallsyms]                           
     [k] __blk_flush_plug
      0.01%  fio              [kernel.kallsyms]                           
     [k] blk_add_rq_to_plug
      0.01%  fio              [kernel.kallsyms]                           
     [k] blk_mq_flush_plug_list
      0.00%  fio              [kernel.kallsyms]                           
     [k] blk_attempt_plug_merge

Callgraph.txt

      2.41%  fio              [kernel.kallsyms]                           
     [k] cpupri_set
             |
             ---cpupri_set
                |
                |--1.15%--__enqueue_rt_entity
                |          enqueue_task_rt
                |          enqueue_task
                |          ttwu_do_activate


Perf report
  Without blk_start_plug and blk_finish_plug before calling 
__submit_bio():
Top.txt
     0.67%  fio              [kernel.kallsyms]                            
    [k] queued_spin_lock_slowpath
      0.64%  fio              [kernel.kallsyms]                           
     [k] sched_balance_newidle
      0.47%  fio              [kernel.kallsyms]                           
     [k] _raw_spin_lock
      0.39%  fio              [kernel.kallsyms]                           
     [k] sbitmap_find_bit
      0.35%  fio              [kernel.kallsyms]                           
     [k] cpupri_set
      0.28%  fio              [kernel.kallsyms]                           
     [k] work_grab_pending
      0.24%  fio              [kernel.kallsyms]                           
     [k] lookup_ioctx
      0.23%  fio              [kernel.kallsyms]                           
     [k] __schedule
       …
        …
      0.00%  fio              [kernel.kallsyms]                           
     [k] blk_attempt_plug_merge

Call graph.txt:

0.35%  fio              [kernel.kallsyms]                               
[k] cpupri_set
             |
             ---cpupri_set
                |
                |--0.17%--arch_local_irq_restore.part.0
                |          |
                |          |--0.14%--finish_task_switch.isra.0
                |          |          __schedule
                |          |          |
                |          |          |--0.13%--schedule
                |          |          |          |
                |          |          |          |--0.07%--read_events
…..
                        |--0.13%--__enqueue_rt_entity
                |          enqueue_task_rt
                |          enqueue_task
                |          ttwu_do_activate

 From above perf data, looks like
1. High time spent in cpupri_set(): tasks being enqueued/dequeued 
frequently, more IO scheduling.
2. Call more plug routines.

If you need full perf data report, I can email/attach your full report.

Thanks for your help!
Wendy

^ permalink raw reply

* Re: [PATCH 6.12 000/272] 6.12.92-rc1 review
From: Florian Fainelli @ 2026-05-29 19:33 UTC (permalink / raw)
  To: Sasha Levin, Miguel Ojeda
  Cc: gregkh, achill, akpm, broonie, conor, hargar, jonathanh,
	linux-kernel, linux, lkft-triage, patches, patches, pavel,
	rwarsow, shuah, sr, stable, sudipm.mukherjee, torvalds,
	Anuj Gupta, Kanchan Joshi, Christoph Hellwig, Keith Busch,
	Jens Axboe, linux-block
In-Reply-To: <20260529122623.bio-integrity-rc-prereq@kernel.org>

On 5/29/26 05:44, Sasha Levin wrote:
> On Fri, May 29, 2026 at 10:27:21AM +0200, Pavel Machek wrote:
>>> I am seeing:
>>>
>>>      ./include/linux/bio-integrity.h:101:12: error: unused function 'bio_integrity_map_user' [-Werror,-Wunused-function]
>>>
>>> This looks like it needs:
>>>
>>>    546d191427cf ("block: make bio_integrity_map_user() static inline")
>>>
>> We see that, too:
>> https://gitlab.com/cip-project/cip-testing/linux-stable-rc-ci/-/jobs/14592368004
>> We don't see the problem on 6.6, 6.18 or 7.0-stable.
> 
> Thanks! I've queued up 546d191427cf ("block: make bio_integrity_map_user()
> static inline").

Thanks, also seen here, FWIW.
-- 
Florian

^ permalink raw reply

* Re: [GIT PULL] Block fix for 7.1-rc6
From: pr-tracker-bot @ 2026-05-29 20:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, linux-block@vger.kernel.org
In-Reply-To: <d12b120e-0628-4d5c-a36a-7618f752250d@kernel.dk>

The pull request you sent on Fri, 29 May 2026 11:03:35 -0600:

> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git block-7.1-20260529

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/9215e74f228f2b239f41271da9e5076ee3439d1b

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* [syzbot] [block?] possible deadlock in loop_process_work
From: syzbot @ 2026-05-29 20:24 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    c1ecb239fa34 Add linux-next specific files for 20260522
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=12fa6336580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=77a9211ff284de54
dashboard link: https://syzkaller.appspot.com/bug?extid=78ad2c6a58c0a1faa5f5
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/4cb88c910144/disk-c1ecb239.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/4a9bc938cf88/vmlinux-c1ecb239.xz
kernel image: https://storage.googleapis.com/syzbot-assets/684f1e33f264/bzImage-c1ecb239.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+78ad2c6a58c0a1faa5f5@syzkaller.appspotmail.com

======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G             L     
------------------------------------------------------
kworker/u8:15/1491 is trying to acquire lock:
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: do_req_filebacked drivers/block/loop.c:433 [inline]
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: loop_handle_cmd drivers/block/loop.c:1941 [inline]
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: loop_process_work+0x637/0x11b0 drivers/block/loop.c:1976

but task is already holding lock:
ffffc90006e27c40 ((work_completion)(&worker->work)){+.+.}-{0:0}, at: process_one_work+0x8be/0x1630 kernel/workqueue.c:3294

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #7 ((work_completion)(&worker->work)){+.+.}-{0:0}:
       process_one_work+0x8d7/0x1630 kernel/workqueue.c:3294
       process_scheduled_works kernel/workqueue.c:3401 [inline]
       worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
       kthread+0x388/0x470 kernel/kthread.c:436
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #6 ((wq_completion)loop4){+.+.}-{0:0}:
       touch_wq_lockdep_map+0xcb/0x180 kernel/workqueue.c:4033
       __flush_workqueue+0x14b/0x14f0 kernel/workqueue.c:4075
       drain_workqueue+0xd3/0x390 kernel/workqueue.c:4239
       __loop_clr_fd drivers/block/loop.c:1130 [inline]
       lo_release+0x287/0x8f0 drivers/block/loop.c:1767
       bdev_release+0x541/0x660 block/bdev.c:-1
       blkdev_release+0x15/0x20 block/fops.c:705
       __fput+0x461/0xa70 fs/file_table.c:510
       fput_close_sync+0x11f/0x240 fs/file_table.c:615
       __do_sys_close fs/open.c:1511 [inline]
       __se_sys_close fs/open.c:1496 [inline]
       __x64_sys_close+0x7e/0x110 fs/open.c:1496
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #5 (&disk->open_mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       __del_gendisk+0x127/0x980 block/genhd.c:710
       del_gendisk+0xe7/0x160 block/genhd.c:823
       nbd_dev_remove drivers/block/nbd.c:268 [inline]
       nbd_dev_remove_work+0x47/0xe0 drivers/block/nbd.c:284
       process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
       process_scheduled_works kernel/workqueue.c:3401 [inline]
       worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
       kthread+0x388/0x470 kernel/kthread.c:436
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #4 (&set->update_nr_hwq_lock){++++}-{4:4}:
       down_read+0x97/0x200 kernel/locking/rwsem.c:1568
       add_disk_fwnode+0xe7/0x480 block/genhd.c:596
       add_disk include/linux/blkdev.h:794 [inline]
       nbd_dev_add+0x72c/0xb50 drivers/block/nbd.c:1984
       nbd_genl_connect+0x965/0x1c80 drivers/block/nbd.c:2125
       genl_family_rcv_msg_doit+0x22a/0x330 net/netlink/genetlink.c:1114
       genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
       genl_rcv_msg+0x61c/0x7a0 net/netlink/genetlink.c:1209
       netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2551
       genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
       netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
       netlink_unicast+0x780/0x920 net/netlink/af_netlink.c:1345
       netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1895
       sock_sendmsg_nosec+0x112/0x150 net/socket.c:797
       __sock_sendmsg net/socket.c:812 [inline]
       ____sys_sendmsg+0x55c/0x870 net/socket.c:2716
       ___sys_sendmsg+0x2a5/0x360 net/socket.c:2770
       __sys_sendmsg net/socket.c:2802 [inline]
       __do_sys_sendmsg net/socket.c:2807 [inline]
       __se_sys_sendmsg net/socket.c:2805 [inline]
       __x64_sys_sendmsg+0x1c3/0x2a0 net/socket.c:2805
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #3 (genl_mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       genl_lock net/netlink/genetlink.c:35 [inline]
       genl_lock_all net/netlink/genetlink.c:48 [inline]
       genl_register_family+0x7b9/0x17b0 net/netlink/genetlink.c:784
       vdpa_init+0x39/0x70 drivers/vdpa/vdpa.c:1565
       do_one_initcall+0x250/0x870 init/main.c:1347
       do_initcall_level+0x104/0x190 init/main.c:1409
       do_initcalls+0x59/0xa0 init/main.c:1425
       kernel_init_freeable+0x2a6/0x3e0 init/main.c:1658
       kernel_init+0x1d/0x1d0 init/main.c:1548
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #2 (cb_lock){++++}-{4:4}:
       down_read+0x97/0x200 kernel/locking/rwsem.c:1568
       genl_rcv+0x19/0x40 net/netlink/genetlink.c:1217
       netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
       netlink_unicast+0x780/0x920 net/netlink/af_netlink.c:1345
       netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1895
       sock_sendmsg_nosec+0x112/0x150 net/socket.c:797
       __sock_sendmsg net/socket.c:812 [inline]
       sock_sendmsg+0x1ca/0x2d0 net/socket.c:835
       splice_to_socket+0xae5/0x11f0 fs/splice.c:884
       do_splice_from fs/splice.c:936 [inline]
       do_splice+0xef8/0x1940 fs/splice.c:1349
       __do_splice fs/splice.c:1431 [inline]
       __do_sys_splice fs/splice.c:1634 [inline]
       __se_sys_splice+0x353/0x490 fs/splice.c:1616
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #1 (&pipe->mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       iter_file_splice_write+0x1f3/0x10f0 fs/splice.c:682
       do_splice_from fs/splice.c:936 [inline]
       do_splice+0xef8/0x1940 fs/splice.c:1349
       __do_splice fs/splice.c:1431 [inline]
       __do_sys_splice fs/splice.c:1634 [inline]
       __se_sys_splice+0x353/0x490 fs/splice.c:1616
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (sb_writers#5){.+.+}-{0:0}:
       check_prev_add kernel/locking/lockdep.c:3167 [inline]
       check_prevs_add kernel/locking/lockdep.c:3286 [inline]
       validate_chain kernel/locking/lockdep.c:3910 [inline]
       __lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
       lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
       percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
       percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
       __sb_start_write include/linux/fs/super.h:19 [inline]
       sb_start_write include/linux/fs/super.h:125 [inline]
       kiocb_start_write include/linux/fs.h:2767 [inline]
       lo_rw_aio+0xb1b/0xf00 drivers/block/loop.c:401
       do_req_filebacked drivers/block/loop.c:433 [inline]
       loop_handle_cmd drivers/block/loop.c:1941 [inline]
       loop_process_work+0x637/0x11b0 drivers/block/loop.c:1976
       process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
       process_scheduled_works kernel/workqueue.c:3401 [inline]
       worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
       kthread+0x388/0x470 kernel/kthread.c:436
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

other info that might help us debug this:

Chain exists of:
  sb_writers#5 --> (wq_completion)loop4 --> (work_completion)(&worker->work)

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((work_completion)(&worker->work));
                               lock((wq_completion)loop4);
                               lock((work_completion)(&worker->work));
  rlock(sb_writers#5);

 *** DEADLOCK ***

2 locks held by kworker/u8:15/1491:
 #0: ffff888022729938 ((wq_completion)loop5){+.+.}-{0:0}, at: process_one_work+0x897/0x1630 kernel/workqueue.c:3293
 #1: ffffc90006e27c40 ((work_completion)(&worker->work)){+.+.}-{0:0}, at: process_one_work+0x8be/0x1630 kernel/workqueue.c:3294

stack backtrace:
CPU: 0 UID: 0 PID: 1491 Comm: kworker/u8:15 Tainted: G             L      syzkaller #0 PREEMPT_{RT,(full)} 
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Workqueue: loop5 loop_workfn
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_circular_bug+0x2e1/0x300 kernel/locking/lockdep.c:2045
 check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2177
 check_prev_add kernel/locking/lockdep.c:3167 [inline]
 check_prevs_add kernel/locking/lockdep.c:3286 [inline]
 validate_chain kernel/locking/lockdep.c:3910 [inline]
 __lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
 lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
 percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
 percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
 __sb_start_write include/linux/fs/super.h:19 [inline]
 sb_start_write include/linux/fs/super.h:125 [inline]
 kiocb_start_write include/linux/fs.h:2767 [inline]
 lo_rw_aio+0xb1b/0xf00 drivers/block/loop.c:401
 do_req_filebacked drivers/block/loop.c:433 [inline]
 loop_handle_cmd drivers/block/loop.c:1941 [inline]
 loop_process_work+0x637/0x11b0 drivers/block/loop.c:1976
 process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
 process_scheduled_works kernel/workqueue.c:3401 [inline]
 worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
 kthread+0x388/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Tal Zussman @ 2026-05-29 20:46 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Dave Chinner,
	Bart Van Assche, linux-block, linux-kernel, linux-xfs,
	linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <ahbq0RdUyLIPiItB@infradead.org>

On 5/27/26 9:00 AM, Christoph Hellwig wrote:
> On Wed, May 27, 2026 at 11:42:28AM +0200, Jan Kara wrote:
>> > I ran some experiments with fio on both XFS and a raw block device. Five
>> > iterations each for 60s. Results below.
>> > 
>> > TLDR: Removing the delay doesn't significantly decrease user-visible
>> > latency or otherwise improve performance, but does significantly reduce
>> > throughput and increase context switches in some workloads (e.g. C).
>> > I think it makes sense to leave the delay as-is. Thoughts?
>> 
>> Thanks for the test! One question below:
> 
> Thanks from me as well!
> 
>> 
>> > Results:
>> > 
>> > Workloads (all `uncached=1`):
>> >   A: rw=write     bs=128k iodepth=1   ioengine=pvsync2     # XFS
>> >   B: rw=write     bs=128k iodepth=128 ioengine=io_uring    # XFS
>> >   C: rw=randwrite bs=4k   iodepth=32  ioengine=io_uring    # XFS
>> >   D: rw=rw 50/50  bs=64k  iodepth=32  ioengine=io_uring    # XFS
>> >   E: rw=write     bs=128k iodepth=128 ioengine=io_uring    # raw /dev/nvmeXn1
>> >   F: rw=write     bs=128k iodepth=128 numjobs=4
>> >      + vm.dirty_bytes=64MB, vm.dirty_background_bytes=32MB # XFS
>> > 
>> > Mean ± stddev across 5 iterations:
>> > 
>> >     metric                     delay=1           delay=0     delta
>> >     --------------------------------------------------------------
>> > 
>> >   A seq 128k qd1
>> >     BW (MB/s)                4333 ± 27         4374 ± 34     +0.9%
>> >     p99   (us)              36.2 ± 0.8        35.8 ± 0.4     -1.1%
>> >     p999  (us)               3260 ± 75         3228 ± 29     -1.0%
>> >     ctx-switches          184 k ± 59 k     3.68 M ± 65 k    +1903%
>> >     cs / io                0.09 ± 0.03       1.86 ± 0.03    +1888%
>> >     avg bios/run            80.4 ± 0.6         1.1 ± 0.0    -98.7%
>> 
>> So 1 jiffie delay is (with default HZ=1000) 1ms. That means for this load
>> the completion latency should be at least 1000us but your results show p99
>> latency of 36. What am I missing?
> 
> Yes, this looks a bit odd.  Unless there's multiple threads submitting
> and somehow the completions get batched this should complete one
> bio at a time and be the worst case for the delay scheme.

Sorry, I should've clarified - the latency here is the userspace-visible
I/O completion latency (i.e. fio's clat value).

I ran again and traced to get the actual time from __bio_complete_in_task()
to calling ->bi_end_io(). The results match the 1 jiffie delay now:

  metric                  delay=1  delay=0

  A seq 128k qd1
    fio clat p99             38us     36us
    bio cb p50             1.23ms    2.5us
    bio cb p99             4.13ms   1.44ms
    bio cb p999            5.01ms   2.63ms

  B seq 128k qd128
    fio clat p99           8.74ms   8.85ms
    bio cb p50             1.27ms    3.1us
    bio cb p99             4.05ms   2.27ms
    bio cb p999            4.91ms   2.77ms

  C rand 4k qd32
    fio clat p99           8.16ms   8.11ms
    bio cb p50             1.09ms   97.7us
    bio cb p99             3.73ms   2.06ms
    bio cb p999           11.87ms   3.79ms

  D mixed 64k qd32
    fio clat p99            981us   1.03ms
    bio cb p50             1.14ms   39.5us
    bio cb p99             2.83ms    275us
    bio cb p999            3.06ms    595us

  E raw 128k qd128
    fio clat p99          26.97ms  27.34ms
    bio cb p50             1.58ms   41.5us
    bio cb p99             2.98ms    325us
    bio cb p999            3.02ms    575us

  F mem-pressure
    fio clat p99          29.75ms  30.43ms
    bio cb p50             1.32ms    2.5us
    bio cb p99             3.73ms   2.48ms
    bio cb p999            4.62ms   2.83ms

Note that in the above, the C degradation didn't reproduce as much. The
bandwidth does go down from 64.5 MB/s with delay=1 to 54.9 MB/s with delay=0,
but it's a much smaller drop. I ran it several more times and ran into the
degradation ~20% of the time. The lack of batching means the completion
kworker fires for nearly every bio, leading to heavier preemption when a
writer is placed on a CPU that receives many completion IRQs. The degradation
seems to occur when the writers are migrated less often, leading to more
preemption. But I haven't dug into why the scheduler chooses to migrate more
in some runs vs. others. However, when pinning to 16 cores, the difference
between delay=0 and delay=1 goes away.

C specifically also seems to get worse because we're doing random writes to a
sparse file, so each bio goes through the IOMAP_IOEND_UNWRITTEN path and the
completion path is heavier, leading to more CPU stealing from the writing
threads compared to the other workloads.

>> >   C rand 4k qd32
>> >     BW (MB/s)               66.2 ± 0.8        44.6 ± 7.4    -32.7%
>> >     p99   (us)              8002 ± 174      17990 ± 6800   +124.8%
>> >     p999  (us)             11390 ± 554     31890 ± 11076   +180.0%
>> >     ctx-switches         3.67 M ± 45 k    3.59 M ± 106 k     -2.2%
>> >     cs / io                3.78 ± 0.04       5.62 ± 0.83    +48.7%
>> >     avg bios/run            32.3 ± 1.0         3.1 ± 0.3    -90.5%
>> 
>> I'm somewhat surprised how larger is the completion latency is here without
>> the delay. Is that due to a contention on local lock between the IO completion
>> interrupt and the worker? Or why is the completion latency so big here when
>> the case B with more IOs in flight, less bios per run, still had significantly
>> lower latency in the delay=0 case?
> 
> Note that in the past we had major problems with workqueue scheduling
> latency.  At some point these got mitigated a lot, but if they are back
> for this workload that might be one reason.
> 


^ permalink raw reply

* Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Hillf Danton @ 2026-05-29 22:05 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jens Axboe, Bart Van Assche, Christoph Hellwig, Damien Le Moal,
	Ming Lei, linux-block, LKML, Andrew Morton, Linus Torvalds,
	linux-btrfs, David Sterba, linux-fsdevel, Christian Brauner,
	syzbot+78ad2c6a58c0a1faa5f5
In-Reply-To: <20260529070411.1206-1-hdanton@sina.com>

On Fri, 29 May 2026 15:04:10 +0800 Hillf Danton wrote:
>On Fri, 29 May 2026 09:14:47 +0900 Tetsuo Handa wrote:
>>On 2026/05/29 8:00, Hillf Danton wrote:
>>>> Given the loop workqueue that triggered the jfs warning, can you specify
>>>> the reason why the workqueue in question is NOT flushed while closing disk?
>>>>
>>> Got it, the loop workqueue is NOT flushed to avoid deadlock, see d292dc80686a
>>> ("loop: don't destroy lo->workqueue in __loop_clr_fd") for detail.
>>> And the deadlock can be reproduced by flushing the loop workqueue with
>>> disk->open_mutex held [1].
>>> 
>>> [1] Subject: Re: [syzbot] possible deadlock in blkdev_put (3)
>>> https://lore.kernel.org/lkml/000000000000ea753505da2658d5@google.com/
>>
>>We can avoid the following lockdep warnings (including [1] you mentioned)
>>
>>  https://syzkaller.appspot.com/bug?extid=2f62807dc3239b8f584e
>>  https://syzkaller.appspot.com/bug?extid=c4e9d077bcc86bee08dc
>>  https://syzkaller.appspot.com/bug?extid=0f427123ae84b3ba6dc7
>>  https://syzkaller.appspot.com/bug?extid=4feabfc9641267769c97
>>  https://syzkaller.appspot.com/bug?extid=fb0ff9bfe34ad282ebd4
>>
>>caused by "drain_workqueue() with disk->open_mutex held" if we assign
>>caller-specific lockdep class to disk->open_mutex
>>
>>  https://sourceforge.net/p/tomoyo/tomoyo.git/ci/c2245c765ebeba9dcb924d9171d8d470a9ac41c8/
>>
>>.
>>
>>Also, we can avoid lockdep warning caused by "drain_workqueue() with disk->open_mutex held" +
>>"holding system_transition_mutex" if we forbid binding to pseudo files as backing file
>>in the loop driver
>>
>>  https://lkml.kernel.org/r/d38e4600-3c32-491f-aa49-905f4fad1bfb@I-love.SAKURA.ne.jp
>>
>>which we can reproduce with
>>
>>  echo 7:0 > /sys/power/resume
>>  losetup /dev/loop0 /sys/power/resume
>>  cat /dev/loop0 > /dev/null
>>  losetup -d /dev/loop0
>>
>>.
>>
>> Therefore, I think we can address this problem by "drain_workqueue() with disk->open_mutex
>> held" in the loop driver side.
>>
> Good news.
>
Bad news: Subject: [syzbot] [block?] possible deadlock in loop_process_work
[3] https://lore.kernel.org/lkml/6a19f5f7.5099cdd9.8e407.0004.GAE@google.com/

syzbot found the following issue on:

HEAD commit:    c1ecb239fa34 Add linux-next specific files for 20260522
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=12fa6336580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=77a9211ff284de54
dashboard link: https://syzkaller.appspot.com/bug?extid=78ad2c6a58c0a1faa5f5
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/4cb88c910144/disk-c1ecb239.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/4a9bc938cf88/vmlinux-c1ecb239.xz
kernel image: https://storage.googleapis.com/syzbot-assets/684f1e33f264/bzImage-c1ecb239.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+78ad2c6a58c0a1faa5f5@syzkaller.appspotmail.com

======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G             L
------------------------------------------------------
kworker/u8:15/1491 is trying to acquire lock:
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: do_req_filebacked drivers/block/loop.c:433 [inline]
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: loop_handle_cmd drivers/block/loop.c:1941 [inline]
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: loop_process_work+0x637/0x11b0 drivers/block/loop.c:1976

but task is already holding lock:
ffffc90006e27c40 ((work_completion)(&worker->work)){+.+.}-{0:0}, at: process_one_work+0x8be/0x1630 kernel/workqueue.c:3294

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #7 ((work_completion)(&worker->work)){+.+.}-{0:0}:
       process_one_work+0x8d7/0x1630 kernel/workqueue.c:3294
       process_scheduled_works kernel/workqueue.c:3401 [inline]
       worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
       kthread+0x388/0x470 kernel/kthread.c:436
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #6 ((wq_completion)loop4){+.+.}-{0:0}:
       touch_wq_lockdep_map+0xcb/0x180 kernel/workqueue.c:4033
       __flush_workqueue+0x14b/0x14f0 kernel/workqueue.c:4075
       drain_workqueue+0xd3/0x390 kernel/workqueue.c:4239
       __loop_clr_fd drivers/block/loop.c:1130 [inline]
       lo_release+0x287/0x8f0 drivers/block/loop.c:1767
       bdev_release+0x541/0x660 block/bdev.c:-1
       blkdev_release+0x15/0x20 block/fops.c:705
       __fput+0x461/0xa70 fs/file_table.c:510
       fput_close_sync+0x11f/0x240 fs/file_table.c:615
       __do_sys_close fs/open.c:1511 [inline]
       __se_sys_close fs/open.c:1496 [inline]
       __x64_sys_close+0x7e/0x110 fs/open.c:1496
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #5 (&disk->open_mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       __del_gendisk+0x127/0x980 block/genhd.c:710
       del_gendisk+0xe7/0x160 block/genhd.c:823
       nbd_dev_remove drivers/block/nbd.c:268 [inline]
       nbd_dev_remove_work+0x47/0xe0 drivers/block/nbd.c:284
       process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
       process_scheduled_works kernel/workqueue.c:3401 [inline]
       worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
       kthread+0x388/0x470 kernel/kthread.c:436
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #4 (&set->update_nr_hwq_lock){++++}-{4:4}:
       down_read+0x97/0x200 kernel/locking/rwsem.c:1568
       add_disk_fwnode+0xe7/0x480 block/genhd.c:596
       add_disk include/linux/blkdev.h:794 [inline]
       nbd_dev_add+0x72c/0xb50 drivers/block/nbd.c:1984
       nbd_genl_connect+0x965/0x1c80 drivers/block/nbd.c:2125
       genl_family_rcv_msg_doit+0x22a/0x330 net/netlink/genetlink.c:1114
       genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
       genl_rcv_msg+0x61c/0x7a0 net/netlink/genetlink.c:1209
       netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2551
       genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
       netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
       netlink_unicast+0x780/0x920 net/netlink/af_netlink.c:1345
       netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1895
       sock_sendmsg_nosec+0x112/0x150 net/socket.c:797
       __sock_sendmsg net/socket.c:812 [inline]
       ____sys_sendmsg+0x55c/0x870 net/socket.c:2716
       ___sys_sendmsg+0x2a5/0x360 net/socket.c:2770
       __sys_sendmsg net/socket.c:2802 [inline]
       __do_sys_sendmsg net/socket.c:2807 [inline]
       __se_sys_sendmsg net/socket.c:2805 [inline]
       __x64_sys_sendmsg+0x1c3/0x2a0 net/socket.c:2805
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #3 (genl_mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       genl_lock net/netlink/genetlink.c:35 [inline]
       genl_lock_all net/netlink/genetlink.c:48 [inline]
       genl_register_family+0x7b9/0x17b0 net/netlink/genetlink.c:784
       vdpa_init+0x39/0x70 drivers/vdpa/vdpa.c:1565
       do_one_initcall+0x250/0x870 init/main.c:1347
       do_initcall_level+0x104/0x190 init/main.c:1409
       do_initcalls+0x59/0xa0 init/main.c:1425
       kernel_init_freeable+0x2a6/0x3e0 init/main.c:1658
       kernel_init+0x1d/0x1d0 init/main.c:1548
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #2 (cb_lock){++++}-{4:4}:
       down_read+0x97/0x200 kernel/locking/rwsem.c:1568
       genl_rcv+0x19/0x40 net/netlink/genetlink.c:1217
       netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
       netlink_unicast+0x780/0x920 net/netlink/af_netlink.c:1345
       netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1895
       sock_sendmsg_nosec+0x112/0x150 net/socket.c:797
       __sock_sendmsg net/socket.c:812 [inline]
       sock_sendmsg+0x1ca/0x2d0 net/socket.c:835
       splice_to_socket+0xae5/0x11f0 fs/splice.c:884
       do_splice_from fs/splice.c:936 [inline]
       do_splice+0xef8/0x1940 fs/splice.c:1349
       __do_splice fs/splice.c:1431 [inline]
       __do_sys_splice fs/splice.c:1634 [inline]
       __se_sys_splice+0x353/0x490 fs/splice.c:1616
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #1 (&pipe->mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       iter_file_splice_write+0x1f3/0x10f0 fs/splice.c:682
       do_splice_from fs/splice.c:936 [inline]
       do_splice+0xef8/0x1940 fs/splice.c:1349
       __do_splice fs/splice.c:1431 [inline]
       __do_sys_splice fs/splice.c:1634 [inline]
       __se_sys_splice+0x353/0x490 fs/splice.c:1616
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (sb_writers#5){.+.+}-{0:0}:
       check_prev_add kernel/locking/lockdep.c:3167 [inline]
       check_prevs_add kernel/locking/lockdep.c:3286 [inline]
       validate_chain kernel/locking/lockdep.c:3910 [inline]
       __lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
       lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
       percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
       percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
       __sb_start_write include/linux/fs/super.h:19 [inline]
       sb_start_write include/linux/fs/super.h:125 [inline]
       kiocb_start_write include/linux/fs.h:2767 [inline]
       lo_rw_aio+0xb1b/0xf00 drivers/block/loop.c:401
       do_req_filebacked drivers/block/loop.c:433 [inline]
       loop_handle_cmd drivers/block/loop.c:1941 [inline]
       loop_process_work+0x637/0x11b0 drivers/block/loop.c:1976
       process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
       process_scheduled_works kernel/workqueue.c:3401 [inline]
       worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
       kthread+0x388/0x470 kernel/kthread.c:436
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

other info that might help us debug this:

Chain exists of:
  sb_writers#5 --> (wq_completion)loop4 --> (work_completion)(&worker->work)

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((work_completion)(&worker->work));
                               lock((wq_completion)loop4);
                               lock((work_completion)(&worker->work));
  rlock(sb_writers#5);

 *** DEADLOCK ***

^ permalink raw reply

* Re: [PATCH v2] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
From: Keith Busch @ 2026-05-29 23:08 UTC (permalink / raw)
  To: Achkinazi, Igor
  Cc: hch@lst.de, sagi@grimberg.me, axboe@kernel.dk,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <DS0PR19MB76965BF9FB57EA3ED8BD4586FD162@DS0PR19MB7696.namprd19.prod.outlook.com>

On Fri, May 29, 2026 at 01:32:22AM +0000, Achkinazi, Igor wrote:
> Keith Busch wrote:
> > I double checked the sequences here, and yes, I think the
> > synchronize_srcu's already in place ensure every caller sees the EOD
> > error before it could fail the bio_queue_enter(), so this looks like it
> > happens to be sufficient. I'm okay with it.
> 
> Thanks Keith! May I add your Reviewed-by?

Sure, though I was considering just adding it the nvme tree. I'm giving
a few days to see if there are any other comments.

^ permalink raw reply

* Re: Observing higher CPU utilization during random IO fio testing
From: Ming Lei @ 2026-05-30  1:10 UTC (permalink / raw)
  To: Wen Xiong; +Cc: linux-block, axboe, jmoyer, Gjoyce, wenxiong
In-Reply-To: <338169f719c77e4afe58f42e9760349e@linux.ibm.com>

On Thu, May 21, 2026 at 02:44:22PM -0500, Wen Xiong wrote:
> Hi All,
> 
> Our performance team observed the higher CPU utilization in RHEL10 compared
> to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well
> when running FIO random IO tests.
> 
> System configuration:
> 47 dedicate cores
> 120 GB memory
> PCIe4 2-Port 64Gb FC Adapter
> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
> 
> Random IO tests are more CPU intensive than sequential IO tests due to
> several factors: more context switching, Interrupt Handling,  cache
> Inefficiency etc. We found out the following patch which caused the higher
> CPU utilization in rhel10 and newer linux kernel:
> 
> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Thu May 9 20:38:25 2024 +0800
> 
> block: add plug while submitting IO
> 
> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
> and __blkdev_direct_IO_async(), block layer can still benefit from caching
> nsec time in the plug.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Link:
> https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU
> utilization when doing the same FIO test.
> 
> The patch adds plugging in __submit_bio() in block layer, maybe cause
> performance degradation:
> - Random IO tests have less merging, flush overhead.
> - More IO scheduler interaction, forces requests through scheduler instead
> of direct dispatch(direct dispatch to hardware queue)
> - Poor cache locality during plug operation

Yes, it is expected to see regression on QD=1 workload.

Adding inner plug for caching timestamp only is not good from plug function viewpoint,
because only the outer code path(io_uring, libaio, ...) knows exact IO batch size
and can decide if plug should be used.

Given 060406c61c7c ("block: add plug while submitting IO") doesn't provide
any performance data, maybe it can be reverted.

I am wondering why not move the timestamp cache into 'task_struct' and get wider users?


Thanks,
Ming

^ permalink raw reply

* [PATCH] rbd: check snap_count against RBD_MAX_SNAP_COUNT
From: Rosen Penev @ 2026-05-30  1:12 UTC (permalink / raw)
  To: linux-block
  Cc: Ilya Dryomov, Dongsheng Yang, Jens Axboe, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt,
	open list:RADOS BLOCK DEVICE (RBD), open list,
	open list:CLANG/LLVM BUILD SUPPORT:Keyword:b(?i:clang|llvm)b

snap_count is u32 but the comparison is against a SIZE_MAX-derived value
(~2^61 on 64-bit), which clang flags as always false with
-Wtautological-constant-out-of-range-compare.

The proper check here should be that snap_count does not go over
RBD_MAX_SNAP_COUNT.

Assisted-by: Opencode:Big-pickle
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 drivers/block/rbd.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 94709466ad19..25215c209484 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -6075,12 +6075,9 @@ static int rbd_dev_v2_snap_context(struct rbd_device *rbd_dev,
 
 	/*
 	 * Make sure the reported number of snapshot ids wouldn't go
-	 * beyond the end of our buffer.  But before checking that,
-	 * make sure the computed size of the snapshot context we
-	 * allocate is representable in a size_t.
+	 * beyond the end of our buffer.
 	 */
-	if (snap_count > (SIZE_MAX - sizeof (struct ceph_snap_context))
-				 / sizeof (u64)) {
+	if (snap_count > RBD_MAX_SNAP_COUNT) {
 		ret = -EINVAL;
 		goto out;
 	}
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] rbd: check snap_count against RBD_MAX_SNAP_COUNT
From: Alex Elder @ 2026-05-30  1:44 UTC (permalink / raw)
  To: Rosen Penev, linux-block
  Cc: Ilya Dryomov, Dongsheng Yang, Jens Axboe, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt,
	open list:RADOS BLOCK DEVICE (RBD), open list,
	open list:CLANG/LLVM BUILD SUPPORT:Keyword:b(?i:clang|llvm)b
In-Reply-To: <20260530011255.52916-1-rosenp@gmail.com>

On 5/29/26 8:12 PM, Rosen Penev wrote:
> snap_count is u32 but the comparison is against a SIZE_MAX-derived value
> (~2^61 on 64-bit), which clang flags as always false with
> -Wtautological-constant-out-of-range-compare.
> 
> The proper check here should be that snap_count does not go over
> RBD_MAX_SNAP_COUNT.
> 
> Assisted-by: Opencode:Big-pickle
> Signed-off-by: Rosen Penev <rosenp@gmail.com>

Looks good to me.

Reviewed-by: Alex Elder <elder@riscstar.com>

> ---
>   drivers/block/rbd.c | 7 ++-----
>   1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 94709466ad19..25215c209484 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -6075,12 +6075,9 @@ static int rbd_dev_v2_snap_context(struct rbd_device *rbd_dev,
>   
>   	/*
>   	 * Make sure the reported number of snapshot ids wouldn't go
> -	 * beyond the end of our buffer.  But before checking that,
> -	 * make sure the computed size of the snapshot context we
> -	 * allocate is representable in a size_t.
> +	 * beyond the end of our buffer.
>   	 */
> -	if (snap_count > (SIZE_MAX - sizeof (struct ceph_snap_context))
> -				 / sizeof (u64)) {
> +	if (snap_count > RBD_MAX_SNAP_COUNT) {
>   		ret = -EINVAL;
>   		goto out;
>   	}


^ permalink raw reply

* [PATCH rust-fixes v3 1/1] rust: block: fix GenDisk cleanup paths
From: Ren Wei @ 2026-05-30  6:11 UTC (permalink / raw)
  To: linux-block, rust-for-linux
  Cc: ojeda, boqun, gary, bjorn3_gh, lossin, a.hindborg, aliceryhl,
	tmgross, dakr, daniel.almeida, axboe, tamird, sunke, yuantan098,
	bird, royenheart, n05ec

From: Haoze Xie <royenheart@gmail.com>

GenDiskBuilder::build() still has fallible work after
__blk_mq_alloc_disk(), but its error path only recovers the
foreign queue data. That leaks the temporary gendisk and
request_queue until later teardown. If the caller moved the last
Arc<TagSet<T>> into build(), the leaked queue can retain blk-mq
state after the tag set is dropped.

Fix the pre-registration failure path by dropping the temporary
gendisk reference with put_disk() before recovering queue_data,
so disk_release() can tear down the owned queue.

Also pair GenDisk::drop() with put_disk() after del_gendisk().
Once a Rust GenDisk has been added with device_add_disk(),
del_gendisk() only unregisters it; the final gendisk reference
still has to be dropped to complete the release path.

Fixes: 3253aba3408a ("rust: block: introduce `kernel::block::mq` module")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Signed-off-by: Haoze Xie <royenheart@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
Changes in v3:
  - Add the requested blank lines around cleanup blocks for readability.
  - v2 Link: https://lore.kernel.org/r/e14c015e2e0bde04f84a9452330b94436e2d8e68.1779901336.git.royenheart@gmail.com
Changes in v2:
  - Add the missing put_disk() after del_gendisk() in GenDisk::drop(),
    as suggested by Andreas Hindborg.
  - Keep the GenDiskBuilder::build() failure cleanup fix and fold both
    lifecycle fixes into one patch.
  - v1 Link: https://lore.kernel.org/r/b6411cc055080c984a67bfad72fd683aa84b8e13.1779596478.git.royenheart@gmail.com

 rust/kernel/block/mq/gen_disk.rs | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
index 912cb805caf5..fc97dd873974 100644
--- a/rust/kernel/block/mq/gen_disk.rs
+++ b/rust/kernel/block/mq/gen_disk.rs
@@ -150,6 +150,19 @@ pub fn build<T: Operations>(
         // SAFETY: `gendisk` is a valid pointer as we initialized it above
         unsafe { (*gendisk).fops = &TABLE };
 
+        let cleanup_failure = ScopeGuard::new_with_data((gendisk, data), |(gendisk, data)| {
+            // SAFETY: `gendisk` came from `__blk_mq_alloc_disk()` above and
+            // has not been added to the VFS on this cleanup path.
+            unsafe { bindings::put_disk(gendisk) };
+            // SAFETY: `data` came from `into_foreign()` above and has not been
+            // converted back on this cleanup path.
+            drop(unsafe { T::QueueData::from_foreign(data) });
+        });
+
+        // The failure guard now owns both pieces of cleanup; the early guard
+        // must not run on this path anymore.
+        recover_data.dismiss();
+
         let mut writer = NullTerminatedFormatter::new(
             // SAFETY: `gendisk` points to a valid and initialized instance. We
             // have exclusive access, since the disk is not added to the VFS
@@ -172,7 +185,7 @@ pub fn build<T: Operations>(
             },
         )?;
 
-        recover_data.dismiss();
+        cleanup_failure.dismiss();
 
         // INVARIANT: `gendisk` was initialized above.
         // INVARIANT: `gendisk` was added to the VFS via `device_add_disk` above.
@@ -215,6 +228,11 @@ fn drop(&mut self) {
         // to the VFS.
         unsafe { bindings::del_gendisk(self.gendisk) };
 
+        // SAFETY: By type invariant, `self.gendisk` was added to the VFS, so
+        // `put_disk()` must follow `del_gendisk()` to drop the final gendisk
+        // reference and trigger the remaining release path.
+        unsafe { bindings::put_disk(self.gendisk) };
+
         // SAFETY: `queue.queuedata` was created by `GenDiskBuilder::build` with
         // a call to `ForeignOwnable::into_foreign` to create `queuedata`.
         // `ForeignOwnable::from_foreign` is only called here.
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH rust-fixes v3 1/1] rust: block: fix GenDisk cleanup paths
From: Miguel Ojeda @ 2026-05-30  6:49 UTC (permalink / raw)
  To: Ren Wei
  Cc: linux-block, rust-for-linux, ojeda, boqun, gary, bjorn3_gh,
	lossin, a.hindborg, aliceryhl, tmgross, dakr, daniel.almeida,
	axboe, tamird, sunke, yuantan098, bird, royenheart
In-Reply-To: <b70aff9a920cc42110fe5cf454c3099561863519.1780063368.git.royenheart@gmail.com>

On Sat, May 30, 2026 at 8:12 AM Ren Wei <n05ec@lzu.edu.cn> wrote:
>
> [PATCH rust-fixes v3 1/1] rust: block: fix GenDisk cleanup paths

I think block prefers to take these, but please let me know otherwise.

Cheers,
Miguel

^ permalink raw reply

* [PATCH] block: assign caller-specific lockdep class to disk->open_mutex
From: Tetsuo Handa @ 2026-05-30 13:45 UTC (permalink / raw)
  To: Jens Axboe, linux-block, LKML
  Cc: Bart Van Assche, Andrew Morton, Ming Lei, Damien Le Moal,
	Christoph Hellwig, Qu Wenruo, Hillf Danton

The block core currently allocates a single monolithic lockdep key for
disk->open_mutex across all callers. This single key conflates locking
hierarchies between independent block streams. For example, if a stacked
driver like loop flushes its internal workqueues inside lo_release() while
holding its own open_mutex, lockdep views this as a potential ABBA deadlock
against the underlying storage stack, leading to numerous circular
dependency splats [2][3][4][5][6].

To reduce false-positives structurally, this patch splits the global
monolithic lock class into distinct, per-caller during disk allocation;
by changing "lock_class_key" into a 2-element array:
  - lkclass[0]: Used for the legacy "(bio completion)" map.
  - lkclass[1]: Assigned to target caller's disk->open_mutex.

This patch was tested by adding drain_workqueue() to __loop_clr_fd() during
testing of a patch for [1], and actually helped stopping [2][4][6].
Even if our final solution for [1] does not call drain_workqueue() with
disk->open_mutex held, keeping locking chains simpler and shorter should
be a good change.

Link: https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28 [1]
Link: https://syzkaller.appspot.com/bug?extid=2f62807dc3239b8f584e [2]
Link: https://syzkaller.appspot.com/bug?extid=c4e9d077bcc86bee08dc [3]
Link: https://syzkaller.appspot.com/bug?extid=0f427123ae84b3ba6dc7 [4]
Link: https://syzkaller.appspot.com/bug?extid=4feabfc9641267769c97 [5]
Link: https://syzkaller.appspot.com/bug?extid=fb0ff9bfe34ad282ebd4 [6]
Suggested-by: AI Mode in Google Search (no mail address)
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 block/blk-mq.c         | 4 ++--
 block/blk.h            | 2 +-
 block/genhd.c          | 8 ++++----
 drivers/scsi/sd.c      | 4 ++--
 drivers/scsi/sr.c      | 4 ++--
 include/linux/blk-mq.h | 8 ++++----
 include/linux/blkdev.h | 6 +++---
 7 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 28c2d931e75e..01a15ac40754 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4492,7 +4492,7 @@ EXPORT_SYMBOL(blk_mq_destroy_queue);
 
 struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
 		struct queue_limits *lim, void *queuedata,
-		struct lock_class_key *lkclass)
+		struct lock_class_key lkclass[2])
 {
 	struct request_queue *q;
 	struct gendisk *disk;
@@ -4513,7 +4513,7 @@ struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
 EXPORT_SYMBOL(__blk_mq_alloc_disk);
 
 struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
-		struct lock_class_key *lkclass)
+		struct lock_class_key lkclass[2])
 {
 	struct gendisk *disk;
 
diff --git a/block/blk.h b/block/blk.h
index b998a7761faf..1744748f9b68 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -614,7 +614,7 @@ void drop_partition(struct block_device *part);
 void bdev_set_nr_sectors(struct block_device *bdev, sector_t sectors);
 
 struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
-		struct lock_class_key *lkclass);
+		struct lock_class_key lkclass[2]);
 struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id);
 
 int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode);
diff --git a/block/genhd.c b/block/genhd.c
index 7d6854fd28e9..303bd5e619e7 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1444,7 +1444,7 @@ dev_t part_devt(struct gendisk *disk, u8 partno)
 }
 
 struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
-		struct lock_class_key *lkclass)
+		struct lock_class_key lkclass[2])
 {
 	struct gendisk *disk;
 
@@ -1467,7 +1467,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 		goto out_free_bdi;
 
 	disk->node_id = node_id;
-	mutex_init(&disk->open_mutex);
+	mutex_init_with_key(&disk->open_mutex, &lkclass[1]);
 	xa_init(&disk->part_tbl);
 	if (xa_insert(&disk->part_tbl, 0, disk->part0, GFP_KERNEL))
 		goto out_destroy_part_tbl;
@@ -1482,7 +1482,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 	device_initialize(disk_to_dev(disk));
 	inc_diskseq(disk);
 	q->disk = disk;
-	lockdep_init_map(&disk->lockdep_map, "(bio completion)", lkclass, 0);
+	lockdep_init_map(&disk->lockdep_map, "(bio completion)", &lkclass[0], 0);
 #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
 	INIT_LIST_HEAD(&disk->slave_bdevs);
 #endif
@@ -1506,7 +1506,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 }
 
 struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
-		struct lock_class_key *lkclass)
+		struct lock_class_key lkclass[2])
 {
 	struct queue_limits default_lim = { };
 	struct request_queue *q;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 599e75f33334..d8a1bbd4f19e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -112,7 +112,7 @@ static DEFINE_MUTEX(sd_mutex_lock);
 static mempool_t *sd_page_pool;
 static mempool_t *sd_large_page_pool;
 static atomic_t sd_large_page_pool_users = ATOMIC_INIT(0);
-static struct lock_class_key sd_bio_compl_lkclass;
+static struct lock_class_key sd_bio_compl_lkclass[2];
 
 static const char *sd_cache_types[] = {
 	"write through", "none", "write back",
@@ -4021,7 +4021,7 @@ static int sd_probe(struct scsi_device *sdp)
 		goto out;
 
 	gd = blk_mq_alloc_disk_for_queue(sdp->request_queue,
-					 &sd_bio_compl_lkclass);
+					 sd_bio_compl_lkclass);
 	if (!gd)
 		goto out_free;
 
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index c36c54ecd354..421b8bd37db0 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -106,7 +106,7 @@ static struct scsi_driver sr_template = {
 static unsigned long sr_index_bits[SR_DISKS / BITS_PER_LONG];
 static DEFINE_SPINLOCK(sr_index_lock);
 
-static struct lock_class_key sr_bio_compl_lkclass;
+static struct lock_class_key sr_bio_compl_lkclass[2];
 
 static int sr_open(struct cdrom_device_info *, int);
 static void sr_release(struct cdrom_device_info *);
@@ -634,7 +634,7 @@ static int sr_probe(struct scsi_device *sdev)
 		goto fail;
 
 	disk = blk_mq_alloc_disk_for_queue(sdev->request_queue,
-					   &sr_bio_compl_lkclass);
+					   sr_bio_compl_lkclass);
 	if (!disk)
 		goto fail_free;
 	mutex_init(&cd->lock);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..57d805c78827 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -726,15 +726,15 @@ enum {
 
 struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
 		struct queue_limits *lim, void *queuedata,
-		struct lock_class_key *lkclass);
+		struct lock_class_key lkclass[2]);
 #define blk_mq_alloc_disk(set, lim, queuedata)				\
 ({									\
-	static struct lock_class_key __key;				\
+	static struct lock_class_key __key[2];				\
 									\
-	__blk_mq_alloc_disk(set, lim, queuedata, &__key);		\
+	__blk_mq_alloc_disk(set, lim, queuedata, __key);		\
 })
 struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
-		struct lock_class_key *lkclass);
+		struct lock_class_key lkclass[2]);
 struct request_queue *blk_mq_alloc_queue(struct blk_mq_tag_set *set,
 		struct queue_limits *lim, void *queuedata);
 int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 890128cdea1c..3cd2056cde28 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -974,7 +974,7 @@ int bdev_disk_changed(struct gendisk *disk, bool invalidate);
 
 void put_disk(struct gendisk *disk);
 struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
-		struct lock_class_key *lkclass);
+		struct lock_class_key lkclass[2]);
 
 /**
  * blk_alloc_disk - allocate a gendisk structure
@@ -990,9 +990,9 @@ struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
  */
 #define blk_alloc_disk(lim, node_id)					\
 ({									\
-	static struct lock_class_key __key;				\
+	static struct lock_class_key __key[2];				\
 									\
-	__blk_alloc_disk(lim, node_id, &__key);				\
+	__blk_alloc_disk(lim, node_id, __key);				\
 })
 
 int __register_blkdev(unsigned int major, const char *name,
-- 
2.47.3


^ permalink raw reply related

* [PATCH] loop: reject binding to procfs and sysfs files
From: Tetsuo Handa @ 2026-05-30 13:48 UTC (permalink / raw)
  To: Jens Axboe, linux-block, LKML
  Cc: Bart Van Assche, Andrew Morton, Ming Lei, Damien Le Moal,
	Christoph Hellwig, Qu Wenruo, Hillf Danton

I noticed that /dev/loopX accepts pseudo files, for loop_validate_file()
currently only checks:

  if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
      return -EINVAL;

and pseudo files are treated as S_ISREG().

Reading from pseudo files via /dev/loopX causes unexpected results, as it
tries to repeatedly read the entire content up to the size visible to the
"ls" command (padded with repeating data).

  # ls -l /sys/power/pm_test
  -rw-r--r-- 1 root root 4096 May 26 22:14 /sys/power/pm_test
  # cat /sys/power/pm_test | wc
        1       6      48
  # cat $(losetup -f --show /sys/power/pm_test) | wc
       85     513    4096

Writing to pseudo files via /dev/loopX might also cause undesirable
results. Therefore, explicitly reject binding to pseudo files on procfs
and sysfs for now. Other filesystems can be appended as needed.

There is another intention for this change. Currently, we are evaluating
the possibility of calling drain_workqueue() from __loop_clr_fd() in order
to address a NULL pointer dereference in lo_rw_aio() [1].
However, introducing drain_workqueue() into the loop teardown path where
disk->open_mutex is held forms a circular locking dependency when a pseudo
file that takes a global lock is specified as the backing store for the
loop device.

If drain_workqueue() is called from __loop_clr_fd(), an example of a
circular locking dependency that involves system_transition_mutex and
disk->open_mutex can be triggered by the following reproduction steps:

  # echo 7:0 > /sys/power/resume
  # losetup /dev/loop0 /sys/power/resume
  # cat /dev/loop0 > /dev/null
  # losetup -d /dev/loop0

Even if our final solution for [1] does not call drain_workqueue() with
disk->open_mutex held, rejecting binding to pseudo files that confuse
userspace programs is a standalone improvement.

Link: https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28 [1]
Analyzed-by: AI Mode in Google Search (no mail address)
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/block/loop.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0000913f7efc..6aa88a7a0e2e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -500,8 +500,15 @@ static int loop_validate_file(struct file *file, struct block_device *bdev)
 		rmb();
 		f = l->lo_backing_file;
 	}
-	if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
+	if (S_ISBLK(inode->i_mode))
+		return 0;
+	if (!S_ISREG(inode->i_mode))
 		return -EINVAL;
+	switch (inode->i_sb->s_magic) {
+	case PROC_SUPER_MAGIC: /* e.g. "losetup -f /proc/sys/kernel/version" */
+	case SYSFS_MAGIC: /* e.g. "losetup -f /sys/power/state" */
+		return -EINVAL;
+	}
 	return 0;
 }
 
-- 
2.47.3


^ permalink raw reply related

* RE: [PATCH v2] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
From: Achkinazi, Igor @ 2026-05-30 14:37 UTC (permalink / raw)
  To: Keith Busch
  Cc: hch@lst.de, sagi@grimberg.me, axboe@kernel.dk,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <ahocb8YRtqh5rHo-@kbusch-mbp>

Keith Busch wrote:
> Sure, though I was considering just adding it the nvme tree. I'm giving
> a few days to see if there are any other comments.

Sounds good, thanks Keith.

Internal Use - Confidential

^ permalink raw reply

* RE: [PATCH v2] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
From: Achkinazi, Igor @ 2026-05-30 14:34 UTC (permalink / raw)
  To: Hannes Reinecke, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me,
	axboe@kernel.dk
  Cc: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <b8d1fda2-a2da-4b35-9bd5-941834f26c32@suse.de>

Hannes Reinecke wrote:
> ... or you could introduce __bio_set_dev():
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 97d747320b35..5a2709adeea7 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -518,15 +518,20 @@ static inline void blkcg_punt_bio_submit(struct
> bio *bio)
>   }
>   #endif /* CONFIG_BLK_CGROUP */
>
> -static inline void bio_set_dev(struct bio *bio, struct block_device *bdev)
> +static inline void __bio_set_dev(struct bio *bio, struct block_device
> *bdev)
>   {
> -       bio_clear_flag(bio, BIO_REMAPPED);
>          if (bio->bi_bdev != bdev)
>                  bio_clear_flag(bio, BIO_BPS_THROTTLED);
>          bio->bi_bdev = bdev;
>          bio_associate_blkg(bio);
>   }
>
> +static inline void bio_set_dev(struct bio *bio, struct block_device *bdev)
> +{
> +       bio_clear_flag(bio, BIO_REMAPPED);
> +       __bio_set_dev(bio, bdev);
> +}
> +
>   /*
>    * BIO list management for use by remapping drivers (e.g. DM or MD)
> and loop.
>    *
>
> to avoid all this clear-and-set-flag dance.


Thanks Hannes. It is a cleaner approach and avoids the clear-and-set
dance. However it touches the block layer (bio.h) and would need
wider review and testing across all bio_set_dev callers.

I'd prefer to keep this patch as a minimal, nvme multipath fix that
Is easy to backport to stable kernels where this race is hitting us
today. The __bio_set_dev() approach (or Keith's patch that is
removing set_capacity(0) entirely) could follow as the proper
long-term solution.

Thanks, Igor


Internal Use - Confidential

^ permalink raw reply

* Re: [PATCH v2] scsi: bsg: read io_uring command fields once
From: Yang Xiuwei @ 2026-05-30 18:02 UTC (permalink / raw)
  To: rc
  Cc: James.Bottomley, martin.petersen, axboe, fujita.tomonori,
	linux-scsi, linux-block, io-uring, linux-kernel, bvanassche,
	csander, stable, Yang Xiuwei
In-Reply-To: <20260527191817.142769-1-rc@rexion.ai>

Hi Rahul,

Thanks for the report and for v2.

Reviewed-by: Yang Xiuwei <yangxiuwei@kylinos.cn>


^ permalink raw reply

* Re: [PATCH] loop: reject binding to procfs and sysfs files
From: kernel test robot @ 2026-05-30 19:48 UTC (permalink / raw)
  To: Tetsuo Handa, Jens Axboe, linux-block, LKML
  Cc: llvm, oe-kbuild-all, Bart Van Assche, Andrew Morton,
	Linux Memory Management List, Ming Lei, Damien Le Moal,
	Christoph Hellwig, Qu Wenruo, Hillf Danton
In-Reply-To: <148efba2-a0b6-47d7-ac76-b19d2f4b696c@I-love.SAKURA.ne.jp>

Hi Tetsuo,

kernel test robot noticed the following build errors:

[auto build test ERROR on axboe/for-next]
[also build test ERROR on linus/master v7.1-rc5 next-20260529]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Tetsuo-Handa/loop-reject-binding-to-procfs-and-sysfs-files/20260530-214900
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git for-next
patch link:    https://lore.kernel.org/r/148efba2-a0b6-47d7-ac76-b19d2f4b696c%40I-love.SAKURA.ne.jp
patch subject: [PATCH] loop: reject binding to procfs and sysfs files
config: um-x86_64_defconfig (https://download.01.org/0day-ci/archive/20260531/202605310318.dbidMe6W-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 9409c07de6378507397ecdb6f05f628f58110112)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260531/202605310318.dbidMe6W-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605310318.dbidMe6W-lkp@intel.com/

All errors (new ones prefixed by >>):

>> drivers/block/loop.c:504:7: error: use of undeclared identifier 'PROC_SUPER_MAGIC'
     504 |         case PROC_SUPER_MAGIC: /* e.g. "losetup -f /proc/sys/kernel/version" */
         |              ^~~~~~~~~~~~~~~~
>> drivers/block/loop.c:505:7: error: use of undeclared identifier 'SYSFS_MAGIC'
     505 |         case SYSFS_MAGIC: /* e.g. "losetup -f /sys/power/state" */
         |              ^~~~~~~~~~~
   2 errors generated.


vim +/PROC_SUPER_MAGIC +504 drivers/block/loop.c

   478	
   479	static int loop_validate_file(struct file *file, struct block_device *bdev)
   480	{
   481		struct inode	*inode = file->f_mapping->host;
   482		struct file	*f = file;
   483	
   484		/* Avoid recursion */
   485		while (is_loop_device(f)) {
   486			struct loop_device *l;
   487	
   488			lockdep_assert_held(&loop_validate_mutex);
   489			if (f->f_mapping->host->i_rdev == bdev->bd_dev)
   490				return -EBADF;
   491	
   492			l = I_BDEV(f->f_mapping->host)->bd_disk->private_data;
   493			if (l->lo_state != Lo_bound)
   494				return -EINVAL;
   495			/* Order wrt setting lo->lo_backing_file in loop_configure(). */
   496			rmb();
   497			f = l->lo_backing_file;
   498		}
   499		if (S_ISBLK(inode->i_mode))
   500			return 0;
   501		if (!S_ISREG(inode->i_mode))
   502			return -EINVAL;
   503		switch (inode->i_sb->s_magic) {
 > 504		case PROC_SUPER_MAGIC: /* e.g. "losetup -f /proc/sys/kernel/version" */
 > 505		case SYSFS_MAGIC: /* e.g. "losetup -f /sys/power/state" */
   506			return -EINVAL;
   507		}
   508		return 0;
   509	}
   510	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH] loop: reject binding to procfs and sysfs files
From: kernel test robot @ 2026-05-30 20:45 UTC (permalink / raw)
  To: Tetsuo Handa, Jens Axboe, linux-block, LKML
  Cc: oe-kbuild-all, Bart Van Assche, Andrew Morton,
	Linux Memory Management List, Ming Lei, Damien Le Moal,
	Christoph Hellwig, Qu Wenruo, Hillf Danton
In-Reply-To: <148efba2-a0b6-47d7-ac76-b19d2f4b696c@I-love.SAKURA.ne.jp>

Hi Tetsuo,

kernel test robot noticed the following build errors:

[auto build test ERROR on axboe/for-next]
[also build test ERROR on linus/master v7.1-rc5 next-20260529]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Tetsuo-Handa/loop-reject-binding-to-procfs-and-sysfs-files/20260530-214900
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git for-next
patch link:    https://lore.kernel.org/r/148efba2-a0b6-47d7-ac76-b19d2f4b696c%40I-love.SAKURA.ne.jp
patch subject: [PATCH] loop: reject binding to procfs and sysfs files
config: nios2-defconfig (https://download.01.org/0day-ci/archive/20260531/202605310413.Xgk6vCeB-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260531/202605310413.Xgk6vCeB-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605310413.Xgk6vCeB-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/block/loop.c: In function 'loop_validate_file':
>> drivers/block/loop.c:504:14: error: 'PROC_SUPER_MAGIC' undeclared (first use in this function)
     504 |         case PROC_SUPER_MAGIC: /* e.g. "losetup -f /proc/sys/kernel/version" */
         |              ^~~~~~~~~~~~~~~~
   drivers/block/loop.c:504:14: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/block/loop.c:505:14: error: 'SYSFS_MAGIC' undeclared (first use in this function)
     505 |         case SYSFS_MAGIC: /* e.g. "losetup -f /sys/power/state" */
         |              ^~~~~~~~~~~


vim +/PROC_SUPER_MAGIC +504 drivers/block/loop.c

   478	
   479	static int loop_validate_file(struct file *file, struct block_device *bdev)
   480	{
   481		struct inode	*inode = file->f_mapping->host;
   482		struct file	*f = file;
   483	
   484		/* Avoid recursion */
   485		while (is_loop_device(f)) {
   486			struct loop_device *l;
   487	
   488			lockdep_assert_held(&loop_validate_mutex);
   489			if (f->f_mapping->host->i_rdev == bdev->bd_dev)
   490				return -EBADF;
   491	
   492			l = I_BDEV(f->f_mapping->host)->bd_disk->private_data;
   493			if (l->lo_state != Lo_bound)
   494				return -EINVAL;
   495			/* Order wrt setting lo->lo_backing_file in loop_configure(). */
   496			rmb();
   497			f = l->lo_backing_file;
   498		}
   499		if (S_ISBLK(inode->i_mode))
   500			return 0;
   501		if (!S_ISREG(inode->i_mode))
   502			return -EINVAL;
   503		switch (inode->i_sb->s_magic) {
 > 504		case PROC_SUPER_MAGIC: /* e.g. "losetup -f /proc/sys/kernel/version" */
 > 505		case SYSFS_MAGIC: /* e.g. "losetup -f /sys/power/state" */
   506			return -EINVAL;
   507		}
   508		return 0;
   509	}
   510	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH] block: assign caller-specific lockdep class to disk->open_mutex
From: Bart Van Assche @ 2026-05-30 21:15 UTC (permalink / raw)
  To: Tetsuo Handa, Jens Axboe, linux-block, LKML
  Cc: Andrew Morton, Ming Lei, Damien Le Moal, Christoph Hellwig,
	Qu Wenruo, Hillf Danton
In-Reply-To: <147ed056-03d9-4214-b925-0f10fc00cf27@I-love.SAKURA.ne.jp>

On 5/30/26 6:45 AM, Tetsuo Handa wrote:
> -	static struct lock_class_key __key;				\
> +	static struct lock_class_key __key[2];				\
The two elements of this array have different roles. From the point of
view of code readability and maintainability it's probably much better
to make this a struct with two named members rather than a two-element
array.

Thanks,

Bart.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox