Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Landlock: LANDLOCK_ACCESS_FS_IOCTL_DEV bypass via io_uring IORING_OP_URING_CMD
From: Bryam Vargas @ 2026-06-16 20:16 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Günther Noack, Paul Moore, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, linux-security-module, io-uring,
	linux-block, linux-nvme, linux-kernel

Hello Mickaël, and Landlock / io_uring folks,

A task confined by a Landlock ruleset that grants READ_FILE/WRITE_FILE on a block
or NVMe character device but withholds LANDLOCK_ACCESS_FS_IOCTL_DEV can still
reach the device-command surface through io_uring IORING_OP_URING_CMD with the
IOCTL_DEV check bypassed: the request enters the device-command handler (block
discard, or the NVMe char-device passthrough) where the equivalent ioctl(2) is
denied. The destructive completion and the NVMe-admin surface follow from the
code -- see Impact.

Affected
--------
Any kernel with CONFIG_SECURITY_LANDLOCK=y and Landlock enabled that supports
LANDLOCK_ACCESS_FS_IOCTL_DEV (Landlock ABI >= 5, since Linux 6.8) and io_uring
uring_cmd for the device class (block BLOCK_URING_CMD_DISCARD; NVMe passthrough).
Confirmed by source inspection on mainline (v7.1-rc7) and reproduced on Linux
7.0.11 (Landlock ABI 8). The confined task needs a writable fd to a device it is
legitimately allowed to use (e.g. a partition/loop device or an NVMe namespace
passed into a container or granted by the ruleset); no CAP is required to reach
the io_uring path. The gap is structural -- Landlock has never registered a
uring_cmd hook -- so it is present from ABI 5 (Linux 6.8) through current
mainline (v7.1-rc7) and is not a regression tied to a single Fixes: commit.

Root cause
----------
On the ioctl(2) path, the syscall handler in fs/ioctl.c calls
security_file_ioctl() (its only call site on the ioctl(2) path) before
dispatching to do_vfs_ioctl(); that reaches Landlock hook_file_ioctl_common(),
which denies a device ioctl unless the file's
allowed_access holds LANDLOCK_ACCESS_FS_IOCTL_DEV (BLKDISCARD/BLKSECDISCARD/
BLKZEROOUT and NVMe passthrough are not in the is_masked_device_ioctl()
allow-list, so they require the right).

io_uring reaches the same device-command surface by a different producer:

  IORING_OP_URING_CMD -> io_uring_cmd()   io_uring/uring_cmd.c
   -> security_uring_cmd(ioucmd)          (the ONLY LSM gate on this path)
   -> file->f_op->uring_cmd()             e.g. blkdev_uring_cmd() / nvme_ns_chr_uring_cmd()

Landlock's LSM_HOOK_INIT list (security/landlock/fs.c, net.c, task.c) registers
file_ioctl/file_ioctl_compat but no uring_cmd hook -- only SELinux
(selinux_uring_cmd) and Smack (smack_uring_cmd) gate this surface -- so
security_uring_cmd() returns 0 for a Landlocked task and hook_file_ioctl /
IOCTL_DEV is never consulted. For block, blkdev_cmd_discard() is then gated only
by BLK_OPEN_WRITE; for NVMe, nvme_ns_chr_uring_cmd() reaches the admin/IO
passthrough with no security_file_ioctl on the path. There is no shared helper
that re-applies the IOCTL_DEV check.

SELinux and Smack hooking uring_cmd while Landlock does not is the coverage
asymmetry; the Landlock documentation describes IOCTL_DEV as gating ioctl(2) but
does not mention io_uring.

Reproducer
----------
A self-contained PoC is available on request (it needs root only to set up a loop
block device and open it; Landlock enforcement is uid-independent, so the
confined child demonstrates the gap regardless of the setup uid). The child
applies a Landlock ruleset handling READ_FILE|WRITE_FILE|IOCTL_DEV with a rule
granting only READ_FILE|WRITE_FILE on the device, then:

  (1) ioctl(fd, BLKDISCARD, range)        -> -EACCES  (Landlock enforces IOCTL_DEV)
  (2) IORING_OP_URING_CMD,
      cmd_op = BLOCK_URING_CMD_DISCARD     -> reaches the block command handler

Observed on Linux 7.0.11 (Landlock ABI 8):

  [1] ioctl(BLKDISCARD)   -> ret=-1 errno=13 (Permission denied)
  [2] uring_cmd(DISCARD)  -> cqe.res=-22 (Invalid argument)

A Landlock denial is always -EACCES; the io_uring path returned -EINVAL, which
originates in a post-authorization check inside the block command handler
(blk_validate_byte_range() in blkdev_cmd_discard()), reached only after
security_uring_cmd() returned 0. So this run demonstrates the authorization
bypass -- the request traversed the LSM gate into the block device-command
handler with no IOCTL_DEV check -- and then failed a parameter check, not an
authorization check. The destructive completion (an authorized discard with a
granularity-aligned range) is the expected behaviour but was not exercised in
this run.

Impact
------
Demonstrated: the LANDLOCK_ACCESS_FS_IOCTL_DEV authorization is bypassed. The
device-command request reaches the block command handler with no Landlock check;
the only remaining gate is BLK_OPEN_WRITE (held, since the policy granted write).
Inferred from the code, not exercised here: an authorized DISCARD with a valid
range completes (DISCARD/secure-erase semantics, destroying on-device data), and
the same missing hook leaves the NVMe char-device uring_cmd surface ungated --
nvme_ns_chr_uring_cmd (namespace device /dev/nvmeXnY) -> nvme_ns_uring_cmd for
NVME_URING_CMD_IO/IO_VEC passthrough, and nvme_dev_uring_cmd (controller device
/dev/nvmeX) for NVME_URING_CMD_ADMIN (format, sanitize, firmware download,
security send) -- both reach f_op->uring_cmd with no Landlock/IOCTL_DEV gate.

So the confirmed finding is a missing authorization (the confined task escapes
its own IOCTL_DEV restriction); the destructive data effect and the NVMe-admin
high-water-mark follow from the code but are not shown in the run above. The
proven authorization bypass alone scores CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:N/I:H/A:N
(6.5 Medium) -- S:C because the confined task crosses the Landlock policy
boundary it was placed under, I:H because the bypassed path reaches a handler
whose authorized completion modifies device data. With the device command
completing destructively the projected ceiling is
CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:N/I:H/A:H (8.4 High), the A:H component
reasoned from the source rather than executed. No memory safety is involved.

Suggested direction
-------------------
Have Landlock register a uring_cmd hook that maps the device command to the same
checks the ioctl path applies (IOCTL_DEV, and truncate where relevant), so a
single chokepoint covers every f_op->uring_cmd provider (block, NVMe, ublk, and
any future one). Mirrors how SELinux/Smack already gate this surface.

I am happy to send a patch for this if you would like.

Best regards,

Bryam Vargas
Independent security researcher, HEXLAB S.A.S., Cali, Colombia
hexlabsecurity@proton.me

^ permalink raw reply

* Re: [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Dr. David Alan Gilbert @ 2026-06-16 20:09 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, dm-devel, linux-block, mpatocka, Vjaceslavs Klimovs
In-Reply-To: <ajGacnaCZ6v6VE1B@kbusch-mbp>

* Keith Busch (kbusch@kernel.org) wrote:
> On Tue, Jun 16, 2026 at 05:54:28PM +0000, Dr. David Alan Gilbert wrote:
> > OK, for this pair I think would be fair for a Tested-by me as well;
> > they certainly resolve the hang and the WARN/BUGs.
> > I still see the errors as EIO on my tests, and on the older mirror type
> 
> Could you share your reproducer? I'm just using the original recipe you
> sent here:
> 
> https://lore.kernel.org/linux-block/ai7rnH20IYeSmY8s@gallifrey/
> 
> And I'm seeing EINVAL instead EIO.

Interesting; I've got your:
  dm-raid1: don't fail the mirror for invalid I/O errors
  For DM_IO_BIO requests, do_region() built each destination bio by walking..
ontop of e21ee273e6fa3879aec9a27251cfce98156e07c4 which is just before 7.1
  I've not your https://lore.kernel.org/linux-block/20260612223205.465913-1-kbusch@meta.com/

root@dalek:/home/dg# lvcreate  --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2
root@dalek:/home/dg# mkfs.ext4 /dev/mapper/main-lvol0
root@dalek:/home/dg# mount /dev/mapper/main-lvol0 /mnt/tmp/
root@dalek:/home/dg# chmod a+rwx /mnt/tmp

dg@dalek:~$ dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=1

my two tests are separate tests:
{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}
dg@dalek:~$ cat dbf.c
#include <errno.h>
#include <fcntl.h>             
#include <asm-generic/fcntl.h>
#include <stdio.h> 
#include <unistd.h>


const char* path="/mnt/tmp/testfile";
static char buf[8192];

int main()                                       
{
  int fd=open(path, O_RDWR|O_DIRECT|O_CLOEXEC);
    
  errno=0;
  int res3=pread(fd, buf, 4096, 0);
  printf("pread of 4096 said: %d (%m)\n", res3);

}
{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}
dg@dalek:~$ cat dbf-write.c
#include <errno.h>
#include <fcntl.h>             
#include <asm-generic/fcntl.h>
#include <stdio.h> 
#include <unistd.h>


const char* path="/mnt/tmp/testfile";
static char buf[8192];

int main()                                       
{
  int fd=open(path, O_RDWR|O_DIRECT|O_CLOEXEC);
    
  errno=0;
  int res3=pwrite(fd, buf, 4096, 0);
  printf("pwrite of 4096 said: %d (%m)\n", res3);

}
{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}{--}


> > get the stuck resync on write, and on the newer mirror I see the write
> > apparently succeed (did it really?)
> 
> There was a time when ext4 used to fallback to buffered io for writes
> but not for reads, but looks like that was fixed since 6.18, so should
> be returning error.
> 
> I tried testing it with a modification to your original read test, and
> it is still failing with EINVAL for me:
> 
>   pread of 4096 said: -1 (Invalid argument)
>   pwrite of 4096 said: -1 (Invalid argument)

Your double test gives me:
dg@dalek:~$ ./dbf-joint 
pread of 4096 said: -1 (Input/output error)
pwrite of 4096 said: 4096 (Input/output error)

> ---
> #define _GNU_SOURCE
> #include <errno.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <unistd.h>
> 
> const char* path="/mnt/tmp/testfile";
> static char buf[8192];
> 
> int main()
> {
>   int fd=open(path, O_RDWR|O_DIRECT|O_CLOEXEC);
> 
>   errno=0;
>   int res3=pread(fd, buf, 4096, 0);
>   printf("pread of 4096 said: %d (%m)\n", res3);

errno=0;

>   res3=pwrite(fd, buf, 4096, 0);
>   printf("pwrite of 4096 said: %d (%m)\n", res3);
> }

Dave

> --
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply

* Re: [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Keith Busch @ 2026-06-16 18:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Keith Busch, dm-devel, linux-block, mpatocka, Vjaceslavs Klimovs
In-Reply-To: <ajGN1OKun3qyvqMc@gallifrey>

On Tue, Jun 16, 2026 at 05:54:28PM +0000, Dr. David Alan Gilbert wrote:
> OK, for this pair I think would be fair for a Tested-by me as well;
> they certainly resolve the hang and the WARN/BUGs.
> I still see the errors as EIO on my tests, and on the older mirror type

Could you share your reproducer? I'm just using the original recipe you
sent here:

https://lore.kernel.org/linux-block/ai7rnH20IYeSmY8s@gallifrey/

And I'm seeing EINVAL instead EIO.

> get the stuck resync on write, and on the newer mirror I see the write
> apparently succeed (did it really?)

There was a time when ext4 used to fallback to buffered io for writes
but not for reads, but looks like that was fixed since 6.18, so should
be returning error.

I tried testing it with a modification to your original read test, and
it is still failing with EINVAL for me:

  pread of 4096 said: -1 (Invalid argument)
  pwrite of 4096 said: -1 (Invalid argument)

---
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

const char* path="/mnt/tmp/testfile";
static char buf[8192];

int main()
{
  int fd=open(path, O_RDWR|O_DIRECT|O_CLOEXEC);

  errno=0;
  int res3=pread(fd, buf, 4096, 0);
  printf("pread of 4096 said: %d (%m)\n", res3);
  res3=pwrite(fd, buf, 4096, 0);
  printf("pwrite of 4096 said: %d (%m)\n", res3);
}
--

^ permalink raw reply

* [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Keith Busch @ 2026-06-16 15:40 UTC (permalink / raw)
  To: dm-devel; +Cc: linux-block, mpatocka, kbusch, linux, vklimovs
In-Reply-To: <20260616150554.1686662-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

BLK_STS_INVAL indicates the I/O request itself was invalid (for example a
misaligned direct I/O), not that the device has failed. dm-raid1 treated
any read or write completion error as a device failure: it failed the
mirror leg, retried on the alternatives - which fail identically - and
eventually returned EIO while spuriously degrading the array.

Since commit 5ff3f74e145a ("block: simplify direct io validity check") the
direct I/O path no longer rejects misaligned buffers up front, so an
invalid bio now reaches the lower block layers, which fail it with
BLK_STS_INVAL. dm-io collapses the block status into a per-region error
bit before invoking the completion callback, so record BLK_STS_INVAL on
the originating bio and have the dm-raid1 read, write and end_io paths
propagate it instead of failing the device.

This mirrors the raid1/raid10 fix in commit f7b24c7b41f23
("md/raid1,raid10: don't fail devices for invalid IO errors") for the
device-mapper mirror target.

Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Reported-by: Vjaceslavs Klimovs <vklimovs@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/md/dm-io.c    | 14 +++++++++++++-
 drivers/md/dm-raid1.c | 28 +++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 28adfeb58f240..f382e9f9be059 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -37,6 +37,7 @@ struct io {
 	struct dm_io_client *client;
 	io_notify_fn callback;
 	void *context;
+	struct bio *orig_bio;
 	void *vma_invalidate_address;
 	unsigned long vma_invalidate_size;
 } __aligned(DM_IO_MAX_REGIONS);
@@ -132,8 +133,18 @@ static void complete_io(struct io *io)
 
 static void dec_count(struct io *io, unsigned int region, blk_status_t error)
 {
-	if (error)
+	if (error) {
 		set_bit(region, &io->error_bits);
+		/*
+		 * BLK_STS_INVAL means the bio was not valid for the underlying
+		 * device (e.g. a misaligned direct I/O), which is a caller error
+		 * rather than a device failure. Record it on the original bio so
+		 * bio-based targets can propagate it instead of treating it as a
+		 * media error and failing the device.
+		 */
+		if (error == BLK_STS_INVAL && io->orig_bio)
+			io->orig_bio->bi_status = error;
+	}
 
 	if (atomic_dec_and_test(&io->count))
 		complete_io(io);
@@ -398,6 +409,7 @@ static void async_io(struct dm_io_client *client, unsigned int num_regions,
 	io->client = client;
 	io->callback = fn;
 	io->context = context;
+	io->orig_bio = dp->orig_bio;
 
 	io->vma_invalidate_address = dp->vma_invalidate_address;
 	io->vma_invalidate_size = dp->vma_invalidate_size;
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index de5c00704e69c..022ad791c2957 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -524,6 +524,17 @@ static void read_callback(unsigned long error, void *context)
 		return;
 	}
 
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. That is a caller error, not a device
+	 * failure, so propagate it rather than failing the mirror and retrying
+	 * on the other legs, which would fail the same way.
+	 */
+	if (bio->bi_status == BLK_STS_INVAL) {
+		bio_endio(bio);
+		return;
+	}
+
 	fail_mirror(m, DM_RAID1_READ_ERROR);
 
 	if (likely(default_ok(m)) || mirror_available(m->ms, bio)) {
@@ -622,6 +633,16 @@ static void write_callback(unsigned long error, void *context)
 		return;
 	}
 
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. Propagate the error without degrading
+	 * the array.
+	 */
+	if (bio->bi_status == BLK_STS_INVAL) {
+		bio_endio(bio);
+		return;
+	}
+
 	/*
 	 * If the bio is discard, return an error, but do not
 	 * degrade the array.
@@ -1262,7 +1283,12 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
 		return DM_ENDIO_DONE;
 	}
 
-	if (*error == BLK_STS_NOTSUPP)
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. Propagate it rather than failing the
+	 * mirror and retrying, which would fail the same way on every leg.
+	 */
+	if (*error == BLK_STS_NOTSUPP || *error == BLK_STS_INVAL)
 		goto out;
 
 	if (bio->bi_opf & REQ_RAHEAD)
-- 
2.52.0


^ permalink raw reply related

* [PATCH v3 6/6] xfs: introduce software write streams
From: Kanchan Joshi @ 2026-06-16 18:05 UTC (permalink / raw)
  To: brauner, hch, djwong, dgc, jack, cem, axboe, kbusch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-block, gost.dev, Kanchan Joshi
In-Reply-To: <20260616180555.33338-1-joshi.k@samsung.com>

Even when the underlying block device does not advertise write streams,
XFS can choose do so, as that enables logical spatial isolation and
dynamic AG-set based concurrency for the standard storage, excluding
rtvolume.

Use AG count based heuristic to derive AG set size and software streams.
Larger filesystem (i.e., more AGs) get wider fanout (i.e., larger AG-set).

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 fs/xfs/xfs_inode.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2e7c61d71b48..10ffed130dce 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -45,6 +45,7 @@
 #include "xfs_inode_util.h"
 #include "xfs_metafile.h"
 
+#define XFS_MAX_USER_WRITE_STREAMS		(16)
 struct kmem_cache *xfs_inode_cache;
 
 int
@@ -52,12 +53,34 @@ xfs_inode_max_write_streams(
 	struct xfs_inode	*ip)
 {
 	struct block_device	*bdev;
+	struct xfs_mount	*mp = ip->i_mount;
+	int nr_streams;
+	xfs_agnumber_t nr_ags, ag_set_size;
 
 	bdev = xfs_inode_buftarg(ip)->bt_bdev;
 	if (!bdev)
 		return 0;
 
-	return bdev_max_write_streams(bdev);
+	nr_streams = bdev_max_write_streams(bdev);
+	if (nr_streams > 0)
+		return nr_streams;
+	if (XFS_IS_REALTIME_INODE(ip))
+		return 0;
+	/*
+	 * Enable software-only streams if hardware streams are not available.
+	 * This helps to
+	 * - improve isolation; reduce allocation interleaving.
+	 * - improve concurrency using AG-set based steering within and across streams.
+	 */
+	nr_ags = mp->m_sb.sb_agcount;
+	if (nr_ags >= 32)
+		ag_set_size = 4;
+	else if (nr_ags >= 8)
+		ag_set_size = 2;
+	else
+		ag_set_size = 1;
+	nr_streams = nr_ags / ag_set_size;
+	return min_t(uint16_t, nr_streams, XFS_MAX_USER_WRITE_STREAMS);
 }
 
 uint16_t
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 5/6] xfs: write stream based AG placement
From: Kanchan Joshi @ 2026-06-16 18:05 UTC (permalink / raw)
  To: brauner, hch, djwong, dgc, jack, cem, axboe, kbusch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-block, gost.dev, Kanchan Joshi
In-Reply-To: <20260616180555.33338-1-joshi.k@samsung.com>

When write stream is set on the file, choose the AG set based on the
write stream value.

Isolating distinct write streams into dedicated allocation groups helps
in reducing the block interleaving of concurrent writers. Keeping these
streams spatially separated reduces AGF lock contention and logical file
fragmentation.

If AGs are fewer than write streams, write streams are distributed into
available AGs in round robin fashion.
If not, available AGs are partitioned into write streams. The write-stream
value is used to derive the AG set, and low bits of the inode is used to
derive the AG within the AG set.

While each stream provides the isolation, the intra-stream concurrency
comes from the AG set size.

Example: 8 Allocation Groups, 4 write streams
AG set size = 2 AGs per write stream

   Stream 1 (ID: 1)         Stream 2 (ID: 2)         Streams 3 & 4
 +---------+---------+    +---------+---------+    +-------------
 |   AG0   |   AG1   |    |   AG2   |   AG3   |    |  AG4...AG7
 +---------+---------+    +---------+---------+    +-------------
      ^         ^              ^         ^
      |         |              |         |
      | File B (ino: 101)      | File D (ino: 201)
      | 101 % 2 = 1 -> AG 1    | 201 % 2 = 1 -> AG 3
      |                        |
 File A (ino: 100)        File C (ino: 200)
 100 % 2 = 0 -> AG 0      200 % 2 = 0 -> AG 2

If AGs can not be evenly distributed among streams, the last stream will
absorb the remaining AGs.

Note that there are no hard boundaries; this only provides explicit
routing hint to xfs allocator. We still preserve file contiguity, and the
full space can be utilized even with a single stream.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 6685220ec59a..325987b5bd9e 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3205,6 +3205,38 @@ xfs_default_ag_set_size(
 	return min_t(xfs_agnumber_t, GENERIC_AG_SET_SZ, mp->m_sb.sb_agcount);
 }
 
+static xfs_agnumber_t
+xfs_inode_write_stream_ag_set(
+	struct xfs_inode	*ip,
+	xfs_agnumber_t		*target_agno)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	uint32_t		nr_streams = xfs_inode_max_write_streams(ip);
+	uint32_t		stream_id = ip->i_write_stream;
+	uint32_t		nr_ags = mp->m_sb.sb_agcount;
+	xfs_agnumber_t		set_size;
+
+
+	if (!nr_streams)
+		return xfs_default_ag_set_size(ip);
+
+	stream_id -= 1; /* For 0-based math; stream-ids are 1-based */
+	set_size = nr_ags / nr_streams;
+
+	if (set_size) {
+		*target_agno = stream_id * set_size;
+		/* unven distribution, last stream will cover extra AGs */
+		if (stream_id == nr_streams - 1)
+			set_size = nr_ags - *target_agno;
+	} else {
+		/* for the case when we have fewer AGs than streams */
+		*target_agno = stream_id % nr_ags;
+		set_size = 1;
+	}
+
+	return set_size;
+}
+
 static xfs_agnumber_t
 xfs_ag_to_ag_set(
 	struct xfs_bmalloca	*ap,
@@ -3218,7 +3250,11 @@ xfs_ag_to_ag_set(
 	if (!(ap->datatype & XFS_ALLOC_USERDATA))
 		return base_agno;
 
-	set_size = xfs_default_ag_set_size(ip);
+	if (ip->i_write_stream)
+		set_size = xfs_inode_write_stream_ag_set(ip, &base_agno);
+	else
+		set_size = xfs_default_ag_set_size(ip);
+
 	/* Fan out within the AG set using low bits of the inode */
 	return (base_agno + (XFS_INO_TO_AGINO(mp, ip->i_ino) % set_size)) %
 		mp->m_sb.sb_agcount;
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 2/6] iomap: introduce and propagate write_stream
From: Kanchan Joshi @ 2026-06-16 18:05 UTC (permalink / raw)
  To: brauner, hch, djwong, dgc, jack, cem, axboe, kbusch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-block, gost.dev, Kanchan Joshi
In-Reply-To: <20260616180555.33338-1-joshi.k@samsung.com>

Add a new write_stream field to struct iomap. Existing hole is used to
place the new field.
Propagate write_stream from iomap to bio in both direct I/O and buffered
writeback paths.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 fs/iomap/direct-io.c  | 1 +
 fs/iomap/ioend.c      | 3 +++
 include/linux/iomap.h | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b36ee619cdcd..455fd5d97d25 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -348,6 +348,7 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter,
 	fscrypt_set_bio_crypt_ctx(bio, iter->inode, pos, GFP_KERNEL);
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_write_hint = iter->inode->i_write_hint;
+	bio->bi_write_stream = iter->iomap.write_stream;
 	bio->bi_ioprio = dio->iocb->ki_ioprio;
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
index acf3cf98b23a..56ed5ba6a421 100644
--- a/fs/iomap/ioend.c
+++ b/fs/iomap/ioend.c
@@ -164,6 +164,7 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
 			       GFP_NOFS, &iomap_ioend_bioset);
 	bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
 	bio->bi_write_hint = wpc->inode->i_write_hint;
+	bio->bi_write_stream = wpc->iomap.write_stream;
 	wbc_init_bio(wpc->wbc, bio);
 	wpc->nr_folios = 0;
 	return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags);
@@ -187,6 +188,8 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos,
 	if (!(wpc->iomap.flags & IOMAP_F_ANON_WRITE) &&
 	    iomap_sector(&wpc->iomap, pos) != bio_end_sector(&ioend->io_bio))
 		return false;
+	if (wpc->iomap.write_stream != ioend->io_bio.bi_write_stream)
+		return false;
 	/*
 	 * Limit ioend bio chain lengths to minimise IO completion latency. This
 	 * also prevents long tight loops ending page writeback on all the
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 2c5685adf3a9..44583429ffa4 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -120,6 +120,8 @@ struct iomap {
 	u64			length;	/* length of mapping, bytes */
 	u16			type;	/* type of mapping */
 	u16			flags;	/* flags for mapping */
+	u8			write_stream; /* write stream for I/O */
+	/* 3 bytes padding hole here */
 	struct block_device	*bdev;	/* block device for I/O */
 	struct dax_device	*dax_dev; /* dax_dev for dax operations */
 	void			*inline_data;
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 4/6] xfs: generic AG set based steering
From: Kanchan Joshi @ 2026-06-16 18:05 UTC (permalink / raw)
  To: brauner, hch, djwong, dgc, jack, cem, axboe, kbusch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-block, gost.dev, Kanchan Joshi,
	Anuj Gupta
In-Reply-To: <20260616180555.33338-1-joshi.k@samsung.com>

Improve allocator concurrency and reduce interlaving by introducing
fixed sized AG set.
Use low bits of the inode as a hash to select AG within the AG set.
Overall, a file will try to use the same AG (and contiguity is maintained),
but multiple files will be spread across all AGs in the target AG set.

Suggested-by: Dave Chinner <dgc@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7a4c8f1aa76c..6685220ec59a 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3194,6 +3194,36 @@ xfs_bmap_select_minlen(
 	return args->maxlen;
 }
 
+#define	GENERIC_AG_SET_SZ	(2)
+
+static inline xfs_agnumber_t
+xfs_default_ag_set_size(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	return min_t(xfs_agnumber_t, GENERIC_AG_SET_SZ, mp->m_sb.sb_agcount);
+}
+
+static xfs_agnumber_t
+xfs_ag_to_ag_set(
+	struct xfs_bmalloca	*ap,
+	xfs_agnumber_t		base_agno)
+{
+	struct xfs_inode	*ip = ap->ip;
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_agnumber_t		set_size;
+
+	/* Apply fanning only for regular file data */
+	if (!(ap->datatype & XFS_ALLOC_USERDATA))
+		return base_agno;
+
+	set_size = xfs_default_ag_set_size(ip);
+	/* Fan out within the AG set using low bits of the inode */
+	return (base_agno + (XFS_INO_TO_AGINO(mp, ip->i_ino) % set_size)) %
+		mp->m_sb.sb_agcount;
+}
+
 static int
 xfs_bmap_btalloc_select_lengths(
 	struct xfs_bmalloca	*ap,
@@ -3589,8 +3619,16 @@ xfs_bmap_btalloc_best_length(
 {
 	xfs_extlen_t		blen = 0;
 	int			error;
+	xfs_agnumber_t		target_ag, start_ag;
 
 	ap->blkno = XFS_INO_TO_FSB(args->mp, ap->ip->i_ino);
+
+	/* fan out initial AG across the generic AG set */
+	start_ag = XFS_FSB_TO_AGNO(args->mp, ap->blkno);
+	target_ag = xfs_ag_to_ag_set(ap, start_ag);
+	if (target_ag != start_ag)
+		ap->blkno = XFS_AGB_TO_FSB(args->mp, target_ag, 0);
+
 	if (!xfs_bmap_adjacent(ap))
 		ap->eof = false;
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 1/6] fs: add generic write-stream management ioctl
From: Kanchan Joshi @ 2026-06-16 18:05 UTC (permalink / raw)
  To: brauner, hch, djwong, dgc, jack, cem, axboe, kbusch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-block, gost.dev, Kanchan Joshi
In-Reply-To: <20260616180555.33338-1-joshi.k@samsung.com>

Wire up the userspace interface for write stream management via a new
vfs ioctl 'FS_IOC_WRITE_STEAM'.
Application communictes the intended operation using the 'op_flags'
field of the passed 'struct fs_write_stream'.
Valid flags are:
FS_WRITE_STREAM_OP_GET_MAX: Returns the number of available streams.
FS_WRITE_STREAM_OP_SET: Assign a specific stream value to the file.
FS_WRITE_STREAM_OP_GET: Query what stream value is set on the file.

Application should query the available streams by using
FS_WRITE_STREAM_OP_GET_MAX first.
If returned value is N, valid stream values for the file are 0 to N.
Stream value 0 implies that no stream is set on the file.
Setting a larger value than available streams is rejected.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 include/uapi/linux/fs.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 13f71202845e..9e87271e610b 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -338,6 +338,20 @@ struct file_attr {
 /* Get logical block metadata capability details */
 #define FS_IOC_GETLBMD_CAP		_IOWR(0x15, 2, struct logical_block_metadata_cap)

+struct fs_write_stream {
+	__u32		op_flags;	/* IN: operation flags */
+	union {
+		__u32		stream_id;	/* IN/OUT:  stream value to assign/guery */
+		__u32		max_streams;	/* OUT: max streams values supported */
+	};
+	__u64		rsvd;
+};
+
+#define FS_WRITE_STREAM_OP_GET_MAX		(1 << 0)
+#define FS_WRITE_STREAM_OP_GET			(1 << 1)
+#define FS_WRITE_STREAM_OP_SET			(1 << 2)
+
+#define FS_IOC_WRITE_STREAM		_IOWR('f', 135, struct fs_write_stream)
 /*
  * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
  *
-- 
2.25.1

^ permalink raw reply related

* [PATCH v3 0/6] xfs write streams
From: Kanchan Joshi @ 2026-06-16 18:05 UTC (permalink / raw)
  To: brauner, hch, djwong, dgc, jack, cem, axboe, kbusch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-block, gost.dev, Kanchan Joshi
In-Reply-To: <CGME20260616181240epcas5p3f86fbb67f0d04cb0ee4b34839c9522b5@epcas5p3.samsung.com>

Hi All,

In LSFMMBPF'26, we discussed 'write stream' as a mechanism to reduce the
filesystem allocator bottlenecks and improving buffered/direct IO
scalability, in different sessions.

This series introduces a generic interface [1,2] for write stream management on
files. It achieves spatial isolation and concurrency improvments [3] in xfs using
- generic AG-set (patch #4)
- write-stream based AG-set (patch #5)

write streams allow the abstraction provider (fs, block device, raid etc.) to
leverage application's intent (file relationships/lifecycle).
- application: sends grouping/isolation intent with a stream id.
- xfs: maps streams to AGs; allocates without interleaving; gains higher
  concurrency due to reduced lock contention.
- hardware: maps streams to underlying allocation unit; reduces device
  internal write amplification, improved life, predictable QoS.

Also few other points:

- Since high level write streams (in xfs) can work without the
low-level write streams (in block device), the series has a general
value beyond hardware with a particular capability.

- For hardware-only spatial isolation, only first 3 patches are needed.

- write-stream is different from existing 'filestream' allocator which
  maintains directory-to-AG associations in a global MRU cache. That
requires state managment and memory (and its reclaim). Proposed AG-set
based steering relies on simple, statless/lockless airthmatic that aligns
more with the default allocator heuristics.

[3]
### Performance
1. On regular NVMe:
fio 4k write, direct IO, 16 jobs, 16 files * 8GiB, iodepth 32, XFS with 16 AGs

Base: 41 KIOPS
With generic ag-set: 93 KIOPS (+126%)
With write-stream ag-set: 227 KIOPS (+453%)

2. On FDP-capable NVMe:
RocksDB YCSB
WAF (base vs write-stream): 35% Reduction

[1]
### Application interface

New vfs ioctl 'FS_IOC_WRITE_STEAM'.
Application communicates the intended operation using the 'op_flags'
field of the passed 'struct fs_write_stream'.
Valid flags are:
FS_WRITE_STREAM_OP_GET_MAX: Returns the number of available streams.
FS_WRITE_STREAM_OP_SET: Assign a specific stream value to the file.
FS_WRITE_STREAM_OP_GET: Query what stream value is set on the file.

[2]
### Comparison with Write Hints (RWH_WRITE_LIFE_*)

- Semantics: Write Hints describe 'data temperature' (e.g.,
short/long/extreme), implying a lifetime. Write Streams describe 'data
placement' (e.g., Bin 1/Bin 2), implying only separation.

- Scalability: Write Hints are limited to a small, fixed enum (6
values). Write streams are dynamic, provider-dependent values that can
scale much higher (kernel limit: up to 255 due to u8 field).

- Discovery: The existing write-hint interface is advisory and decoupled
  from underlying capabilties; application has no way to probe support
and cannot deterministically know which hints are valid. OTOH, write-streams
provide explicit discovery.

Note: within the kernel, the separation between two constructs
(write-hint and write-stream) had started from 6.16 itself.

### Changelog
since v2:
https://lore.kernel.org/linux-fsdevel/20260309052944.156054-1-joshi.k@samsung.com/
- xfs default allocator optimization using fixed-size generic AG set (Dave)
- reuse the above to simplify the write-stream AG set handling
- streamline the uapi; Use union for GET_MAX and GET/SET (Darrick)
- uint16_t for write-stream within xfs inode and other cleanups (Darrick)

since v1:
https://lore.kernel.org/linux-fsdevel/20260216052540.217920-1-joshi.k@samsung.com/
- swich from fcntl based to ioctl-based interface (Christian)
- new patch (#4) that makes xfs allocator use the write streams for AG
  selection
- new patch (#5) that introduces software write streams in xfs.

### Interface example

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <string.h>
#include <errno.h>

/* Duplicate the kernel UAPI definitions */
struct fs_write_stream {
        uint32_t op_flags;
        union {
                uint32_t stream_id;
                uint32_t max_streams;
        };
        uint64_t __reserved;
};

#define FS_WRITE_STREAM_OP_GET                  (1 << 1)
#define FS_WRITE_STREAM_OP_SET                  (1 << 2)
#define FS_WRITE_STREAM_OP_GET_MAX              (1 << 0)

#define FS_IOC_WRITE_STREAM             _IOWR('f', 135, struct fs_write_stream)

void print_usage(const char *progname) {
        fprintf(stderr, "Usage:\n");
        fprintf(stderr, "  %s <file> max       - Get max supported streams\n", progname);
        fprintf(stderr, "  %s <file> get       - Get current stream ID\n", progname);
        fprintf(stderr, "  %s <file> set <id>  - Set stream ID\n", progname);
        exit(EXIT_FAILURE);
}

int main(int argc, char *argv[]) {
        if (argc < 3)
                print_usage(argv[0]);

        const char *filepath = argv[1];
        const char *cmd = argv[2];
        int fd = open(filepath, O_RDWR);
        if (fd < 0) {
                perror("Error opening file");
                return EXIT_FAILURE;
        }

        struct fs_write_stream req;
        memset(&req, 0, sizeof(req));

        if (strcmp(cmd, "max") == 0) {
                req.op_flags = FS_WRITE_STREAM_OP_GET_MAX;
                if (ioctl(fd, FS_IOC_WRITE_STREAM, &req) < 0) {
                        perror("ioctl(GET_MAX) failed");
                        close(fd);
                        return EXIT_FAILURE;
                }
                printf("Max streams supported: %u\n", req.max_streams);
        } else if (strcmp(cmd, "get") == 0) {
                req.op_flags = FS_WRITE_STREAM_OP_GET;
                if (ioctl(fd, FS_IOC_WRITE_STREAM, &req) < 0) {
                        perror("ioctl(GET) failed");
                        close(fd);
                        return EXIT_FAILURE;
                }
                printf("Current stream ID: %u\n", req.stream_id);
        } else if (strcmp(cmd, "set") == 0) {
                if (argc != 4)
                        print_usage(argv[0]);

                req.op_flags = FS_WRITE_STREAM_OP_SET;
                req.stream_id = atoi(argv[3]);

                if (ioctl(fd, FS_IOC_WRITE_STREAM, &req) < 0) {
                        perror("ioctl(SET) failed");
                        close(fd);
                        return EXIT_FAILURE;
                }
                printf("Set stream ID to: %u\n", req.stream_id);
        } else {
                fprintf(stderr, "Unknown command: %s\n", cmd);
                close(fd);
                print_usage(argv[0]);
        }

        close(fd);
        return EXIT_SUCCESS;
}

Kanchan Joshi (6):
  fs: add generic write-stream management ioctl
  iomap: introduce and propagate write_stream
  xfs: implement write-stream management support
  xfs: generic AG set based steering
  xfs: write stream based AG placement
  xfs: introduce software write streams

 fs/iomap/direct-io.c     |  1 +
 fs/iomap/ioend.c         |  3 ++
 fs/xfs/libxfs/xfs_bmap.c | 74 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_icache.c      |  1 +
 fs/xfs/xfs_inode.c       | 69 +++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.h       |  6 ++++
 fs/xfs/xfs_ioctl.c       | 38 +++++++++++++++++++++
 fs/xfs/xfs_iomap.c       |  1 +
 include/linux/iomap.h    |  2 ++
 include/uapi/linux/fs.h  | 14 ++++++++
 10 files changed, 209 insertions(+)

-- 
2.25.1


^ permalink raw reply

* [PATCH v3 3/6] xfs: implement write-stream management support
From: Kanchan Joshi @ 2026-06-16 18:05 UTC (permalink / raw)
  To: brauner, hch, djwong, dgc, jack, cem, axboe, kbusch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-block, gost.dev, Kanchan Joshi
In-Reply-To: <20260616180555.33338-1-joshi.k@samsung.com>

Implement support for FS_IOC_WRITE_STREAM ioctl.

For FS_WRITE_STREAM_OP_GET_MAX, available write streams are reported
based on the capability of the underlying block device.
For FS_WRITE_STREAM_OP_{SET/GET}, add a new i_write_stream field in xfs
inode. This value is propagated to the iomap during block mapping.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 fs/xfs/xfs_icache.c |  1 +
 fs/xfs/xfs_inode.c  | 46 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.h  |  6 ++++++
 fs/xfs/xfs_ioctl.c  | 38 +++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iomap.c  |  1 +
 5 files changed, 92 insertions(+)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 2040a9292ee6..d5f880f5b810 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -130,6 +130,7 @@ xfs_inode_alloc(
 	spin_lock_init(&ip->i_ioend_lock);
 	ip->i_next_unlinked = NULLAGINO;
 	ip->i_prev_unlinked = 0;
+	ip->i_write_stream = 0;
 
 	return ip;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index beaa26ec62da..2e7c61d71b48 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -47,6 +47,52 @@
 
 struct kmem_cache *xfs_inode_cache;
 
+int
+xfs_inode_max_write_streams(
+	struct xfs_inode	*ip)
+{
+	struct block_device	*bdev;
+
+	bdev = xfs_inode_buftarg(ip)->bt_bdev;
+	if (!bdev)
+		return 0;
+
+	return bdev_max_write_streams(bdev);
+}
+
+uint16_t
+xfs_inode_get_write_stream(
+	struct xfs_inode	*ip)
+{
+	uint16_t	stream_id;
+
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+	stream_id = ip->i_write_stream;
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+
+	return stream_id;
+}
+
+int
+xfs_inode_set_write_stream(
+	struct xfs_inode	*ip,
+	uint16_t		stream_id)
+{
+	int ret = 0;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+
+	if (stream_id > xfs_inode_max_write_streams(ip)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+	ip->i_write_stream =  stream_id;
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return ret;
+}
+
 /*
  * These two are wrapper routines around the xfs_ilock() routine used to
  * centralize some grungy code.  They are used in places that wish to lock the
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index bd6d33557194..768c4195306c 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -38,6 +38,9 @@ typedef struct xfs_inode {
 	struct xfs_ifork	i_df;		/* data fork */
 	struct xfs_ifork	i_af;		/* attribute fork */
 
+	/* Write stream information */
+	uint16_t		i_write_stream;
+
 	/* Transaction and locking information. */
 	struct xfs_inode_log_item *i_itemp;	/* logging information */
 	struct rw_semaphore	i_lock;		/* inode lock */
@@ -676,4 +679,7 @@ int xfs_icreate_dqalloc(const struct xfs_icreate_args *args,
 		struct xfs_dquot **udqpp, struct xfs_dquot **gdqpp,
 		struct xfs_dquot **pdqpp);
 
+int xfs_inode_max_write_streams(struct xfs_inode *ip);
+uint16_t xfs_inode_get_write_stream(struct xfs_inode *ip);
+int xfs_inode_set_write_stream(struct xfs_inode *ip, uint16_t stream_id);
 #endif	/* __XFS_INODE_H__ */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 46e234863644..3f82a4884b81 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1179,6 +1179,42 @@ xfs_ioctl_fs_counts(
 	return 0;
 }
 
+static int
+xfs_ioc_write_stream(
+	struct file		*filp,
+	void __user		*arg)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct fs_write_stream	ws = { };
+
+	if (copy_from_user(&ws, arg, sizeof(ws)))
+		return -EFAULT;
+	if (ws.rsvd != 0)
+		return -EINVAL;
+
+	switch (ws.op_flags) {
+	case FS_WRITE_STREAM_OP_GET_MAX:
+		ws.max_streams = xfs_inode_max_write_streams(ip);
+		goto copy_out;
+	case FS_WRITE_STREAM_OP_GET:
+		ws.stream_id = xfs_inode_get_write_stream(ip);
+		goto copy_out;
+	case FS_WRITE_STREAM_OP_SET:
+		if (!(filp->f_mode & FMODE_WRITE))
+			return -EBADF;
+		return xfs_inode_set_write_stream(ip, ws.stream_id);
+	default:
+		return -EINVAL;
+	}
+	return 0;
+
+copy_out:
+	if (copy_to_user(arg, &ws, sizeof(ws)))
+		return -EFAULT;
+	return 0;
+}
+
 /*
  * These long-unused ioctls were removed from the official ioctl API in 5.17,
  * but retain these definitions so that we can log warnings about them.
@@ -1444,6 +1480,8 @@ xfs_file_ioctl(
 		return xfs_ioc_health_monitor(filp, arg);
 	case XFS_IOC_VERIFY_MEDIA:
 		return xfs_ioc_verify_media(filp, arg);
+	case FS_IOC_WRITE_STREAM:
+		return xfs_ioc_write_stream(filp, arg);
 
 	default:
 		return -ENOTTY;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f20a02f49ed9..ccbf7dcf1ad5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -144,6 +144,7 @@ xfs_bmbt_to_iomap(
 	iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
 	iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
 	iomap->flags = iomap_flags;
+	iomap->write_stream = ip->i_write_stream;
 	if (mapping_flags & IOMAP_DAX) {
 		iomap->dax_dev = target->bt_daxdev;
 	} else {
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Dr. David Alan Gilbert @ 2026-06-16 17:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: dm-devel, linux-block, mpatocka, Keith Busch, Vjaceslavs Klimovs
In-Reply-To: <20260616150554.1686662-2-kbusch@meta.com>

* Keith Busch (kbusch@meta.com) wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> BLK_STS_INVAL indicates the I/O request itself was invalid (for example a
> misaligned direct I/O), not that the device has failed. dm-raid1 treated
> any read or write completion error as a device failure: it failed the
> mirror leg, retried on the alternatives - which fail identically - and
> eventually returned EIO while spuriously degrading the array.
> 
> Since commit 5ff3f74e145a ("block: simplify direct io validity check") the
> direct I/O path no longer rejects misaligned buffers up front, so an
> invalid bio now reaches the lower block layers, which fail it with
> BLK_STS_INVAL. dm-io collapses the block status into a per-region error
> bit before invoking the completion callback, so record BLK_STS_INVAL on
> the originating bio and have the dm-raid1 read, write and end_io paths
> propagate it instead of failing the device.
> 
> This mirrors the raid1/raid10 fix in commit f7b24c7b41f23
> ("md/raid1,raid10: don't fail devices for invalid IO errors") for the
> device-mapper mirror target.
> 
> Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
> Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
> Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
> Reported-by: Vjaceslavs Klimovs <vklimovs@gmail.com>
> Signed-off-by: Keith Busch <kbusch@kernel.org>

OK, for this pair I think would be fair for a Tested-by me as well;
they certainly resolve the hang and the WARN/BUGs.
I still see the errors as EIO on my tests, and on the older mirror type
get the stuck resync on write, and on the newer mirror I see the write
apparently succeed (did it really?)

I suggest given the BUG/WARN and the number of people who've tripped over
this, and it's triggerable as a normal user, that it's a candidate for stable.

Thanks again,

Dave

> ---
>  drivers/md/dm-io.c    | 14 +++++++++++++-
>  drivers/md/dm-raid1.c | 28 +++++++++++++++++++++++++++-
>  2 files changed, 40 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
> index 28adfeb58f240..f382e9f9be059 100644
> --- a/drivers/md/dm-io.c
> +++ b/drivers/md/dm-io.c
> @@ -37,6 +37,7 @@ struct io {
>  	struct dm_io_client *client;
>  	io_notify_fn callback;
>  	void *context;
> +	struct bio *orig_bio;
>  	void *vma_invalidate_address;
>  	unsigned long vma_invalidate_size;
>  } __aligned(DM_IO_MAX_REGIONS);
> @@ -132,8 +133,18 @@ static void complete_io(struct io *io)
>  
>  static void dec_count(struct io *io, unsigned int region, blk_status_t error)
>  {
> -	if (error)
> +	if (error) {
>  		set_bit(region, &io->error_bits);
> +		/*
> +		 * BLK_STS_INVAL means the bio was not valid for the underlying
> +		 * device (e.g. a misaligned direct I/O), which is a caller error
> +		 * rather than a device failure. Record it on the original bio so
> +		 * bio-based targets can propagate it instead of treating it as a
> +		 * media error and failing the device.
> +		 */
> +		if (error == BLK_STS_INVAL && io->orig_bio)
> +			io->orig_bio->bi_status = error;
> +	}
>  
>  	if (atomic_dec_and_test(&io->count))
>  		complete_io(io);
> @@ -398,6 +409,7 @@ static void async_io(struct dm_io_client *client, unsigned int num_regions,
>  	io->client = client;
>  	io->callback = fn;
>  	io->context = context;
> +	io->orig_bio = dp->orig_bio;
>  
>  	io->vma_invalidate_address = dp->vma_invalidate_address;
>  	io->vma_invalidate_size = dp->vma_invalidate_size;
> diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
> index de5c00704e69c..022ad791c2957 100644
> --- a/drivers/md/dm-raid1.c
> +++ b/drivers/md/dm-raid1.c
> @@ -524,6 +524,17 @@ static void read_callback(unsigned long error, void *context)
>  		return;
>  	}
>  
> +	/*
> +	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
> +	 * e.g. a misaligned direct I/O. That is a caller error, not a device
> +	 * failure, so propagate it rather than failing the mirror and retrying
> +	 * on the other legs, which would fail the same way.
> +	 */
> +	if (bio->bi_status == BLK_STS_INVAL) {
> +		bio_endio(bio);
> +		return;
> +	}
> +
>  	fail_mirror(m, DM_RAID1_READ_ERROR);
>  
>  	if (likely(default_ok(m)) || mirror_available(m->ms, bio)) {
> @@ -622,6 +633,16 @@ static void write_callback(unsigned long error, void *context)
>  		return;
>  	}
>  
> +	/*
> +	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
> +	 * e.g. a misaligned direct I/O. Propagate the error without degrading
> +	 * the array.
> +	 */
> +	if (bio->bi_status == BLK_STS_INVAL) {
> +		bio_endio(bio);
> +		return;
> +	}
> +
>  	/*
>  	 * If the bio is discard, return an error, but do not
>  	 * degrade the array.
> @@ -1262,7 +1283,12 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
>  		return DM_ENDIO_DONE;
>  	}
>  
> -	if (*error == BLK_STS_NOTSUPP)
> +	/*
> +	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
> +	 * e.g. a misaligned direct I/O. Propagate it rather than failing the
> +	 * mirror and retrying, which would fail the same way on every leg.
> +	 */
> +	if (*error == BLK_STS_NOTSUPP || *error == BLK_STS_INVAL)
>  		goto out;
>  
>  	if (bio->bi_opf & REQ_RAHEAD)
> -- 
> 2.52.0
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply

* [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Keith Busch @ 2026-06-16 15:05 UTC (permalink / raw)
  To: dm-devel
  Cc: linux-block, mpatocka, Keith Busch, Dr. David Alan Gilbert,
	Vjaceslavs Klimovs
In-Reply-To: <20260616150554.1686662-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

BLK_STS_INVAL indicates the I/O request itself was invalid (for example a
misaligned direct I/O), not that the device has failed. dm-raid1 treated
any read or write completion error as a device failure: it failed the
mirror leg, retried on the alternatives - which fail identically - and
eventually returned EIO while spuriously degrading the array.

Since commit 5ff3f74e145a ("block: simplify direct io validity check") the
direct I/O path no longer rejects misaligned buffers up front, so an
invalid bio now reaches the lower block layers, which fail it with
BLK_STS_INVAL. dm-io collapses the block status into a per-region error
bit before invoking the completion callback, so record BLK_STS_INVAL on
the originating bio and have the dm-raid1 read, write and end_io paths
propagate it instead of failing the device.

This mirrors the raid1/raid10 fix in commit f7b24c7b41f23
("md/raid1,raid10: don't fail devices for invalid IO errors") for the
device-mapper mirror target.

Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Reported-by: Vjaceslavs Klimovs <vklimovs@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/md/dm-io.c    | 14 +++++++++++++-
 drivers/md/dm-raid1.c | 28 +++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 28adfeb58f240..f382e9f9be059 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -37,6 +37,7 @@ struct io {
 	struct dm_io_client *client;
 	io_notify_fn callback;
 	void *context;
+	struct bio *orig_bio;
 	void *vma_invalidate_address;
 	unsigned long vma_invalidate_size;
 } __aligned(DM_IO_MAX_REGIONS);
@@ -132,8 +133,18 @@ static void complete_io(struct io *io)
 
 static void dec_count(struct io *io, unsigned int region, blk_status_t error)
 {
-	if (error)
+	if (error) {
 		set_bit(region, &io->error_bits);
+		/*
+		 * BLK_STS_INVAL means the bio was not valid for the underlying
+		 * device (e.g. a misaligned direct I/O), which is a caller error
+		 * rather than a device failure. Record it on the original bio so
+		 * bio-based targets can propagate it instead of treating it as a
+		 * media error and failing the device.
+		 */
+		if (error == BLK_STS_INVAL && io->orig_bio)
+			io->orig_bio->bi_status = error;
+	}
 
 	if (atomic_dec_and_test(&io->count))
 		complete_io(io);
@@ -398,6 +409,7 @@ static void async_io(struct dm_io_client *client, unsigned int num_regions,
 	io->client = client;
 	io->callback = fn;
 	io->context = context;
+	io->orig_bio = dp->orig_bio;
 
 	io->vma_invalidate_address = dp->vma_invalidate_address;
 	io->vma_invalidate_size = dp->vma_invalidate_size;
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index de5c00704e69c..022ad791c2957 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -524,6 +524,17 @@ static void read_callback(unsigned long error, void *context)
 		return;
 	}
 
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. That is a caller error, not a device
+	 * failure, so propagate it rather than failing the mirror and retrying
+	 * on the other legs, which would fail the same way.
+	 */
+	if (bio->bi_status == BLK_STS_INVAL) {
+		bio_endio(bio);
+		return;
+	}
+
 	fail_mirror(m, DM_RAID1_READ_ERROR);
 
 	if (likely(default_ok(m)) || mirror_available(m->ms, bio)) {
@@ -622,6 +633,16 @@ static void write_callback(unsigned long error, void *context)
 		return;
 	}
 
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. Propagate the error without degrading
+	 * the array.
+	 */
+	if (bio->bi_status == BLK_STS_INVAL) {
+		bio_endio(bio);
+		return;
+	}
+
 	/*
 	 * If the bio is discard, return an error, but do not
 	 * degrade the array.
@@ -1262,7 +1283,12 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
 		return DM_ENDIO_DONE;
 	}
 
-	if (*error == BLK_STS_NOTSUPP)
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. Propagate it rather than failing the
+	 * mirror and retrying, which would fail the same way on every leg.
+	 */
+	if (*error == BLK_STS_NOTSUPP || *error == BLK_STS_INVAL)
 		goto out;
 
 	if (bio->bi_opf & REQ_RAHEAD)
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH 0/2] block: invalidate cached plug timestamp on context switch
From: Jens Axboe @ 2026-06-16 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, rostedt,
	vincent.guittot, vschneid, Usama Arif, shakeel.butt, hannes, riel,
	kernel-team
In-Reply-To: <20260616165434.GG49951@noisy.programming.kicks-ass.net>

On 6/16/26 10:54 AM, Peter Zijlstra wrote:
> On Tue, Jun 16, 2026 at 10:10:31AM -0600, Jens Axboe wrote:
>> On 6/16/26 10:08 AM, Jens Axboe wrote:
>>>
>>> On Tue, 16 Jun 2026 07:15:16 -0700, Usama Arif wrote:
>>>> The details for this are in patch 2. The main reason for this series
>>>> is to invalidate the cached timestamp on context switch. This was
>>>> done in sched_update_worker() only before which was resulting in
>>>> blk-iocost reading stale timestamps and throttling based on wrong
>>>> information.
>>>>
>>>> Patch 1 is a prerequisite to create the invariant that
>>>> PF_BLOCK_TS set implies current->plug != NULL.
>>>>
>>>> [...]
>>>
>>> Applied, thanks!
>>>
>>> [1/2] kernel/fork: clear PF_BLOCK_TS in copy_process()
>>>       commit: fd38b75c4b43295b10d69772a46d1c74dbd6fc81
>>> [2/2] block: invalidate cached plug timestamp after task switch
>>>       commit: fad156c2af227f42ca796cbb20ddc354a6dd9932
>>
>> Note: I tentatively queued this on up as a) it looks good to me (and
>> thanks Usama for fixing this!), and b) about to head OOO for a week
>> or so. If Peter or any of the sched people disagree, let me know and
>> we can deal with it. If not, then I plan on sending this in with the
>> usual follow-up merge window fixes next week.
> 
> FWIW, looks good to me.

Great, thanks Peter!

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 0/2] block: invalidate cached plug timestamp on context switch
From: Peter Zijlstra @ 2026-06-16 16:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, rostedt,
	vincent.guittot, vschneid, Usama Arif, shakeel.butt, hannes, riel,
	kernel-team
In-Reply-To: <43b0010d-1919-4986-a88a-a4ccdb3639dd@kernel.dk>

On Tue, Jun 16, 2026 at 10:10:31AM -0600, Jens Axboe wrote:
> On 6/16/26 10:08 AM, Jens Axboe wrote:
> > 
> > On Tue, 16 Jun 2026 07:15:16 -0700, Usama Arif wrote:
> >> The details for this are in patch 2. The main reason for this series
> >> is to invalidate the cached timestamp on context switch. This was
> >> done in sched_update_worker() only before which was resulting in
> >> blk-iocost reading stale timestamps and throttling based on wrong
> >> information.
> >>
> >> Patch 1 is a prerequisite to create the invariant that
> >> PF_BLOCK_TS set implies current->plug != NULL.
> >>
> >> [...]
> > 
> > Applied, thanks!
> > 
> > [1/2] kernel/fork: clear PF_BLOCK_TS in copy_process()
> >       commit: fd38b75c4b43295b10d69772a46d1c74dbd6fc81
> > [2/2] block: invalidate cached plug timestamp after task switch
> >       commit: fad156c2af227f42ca796cbb20ddc354a6dd9932
> 
> Note: I tentatively queued this on up as a) it looks good to me (and
> thanks Usama for fixing this!), and b) about to head OOO for a week
> or so. If Peter or any of the sched people disagree, let me know and
> we can deal with it. If not, then I plan on sending this in with the
> usual follow-up merge window fixes next week.

FWIW, looks good to me.


^ permalink raw reply

* Re: [PATCH V3] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Tang Yizhou @ 2026-06-16 16:50 UTC (permalink / raw)
  To: Zizhi Wo, axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai
In-Reply-To: <20260616011746.2451461-1-wozizhi@huaweicloud.com>

On 16/6/26 9:17 am, Zizhi Wo wrote:
> From: Zizhi Wo <wozizhi@huawei.com>
> 
> [BUG]
> Our fuzz testing triggered a blkcg use-after-free issue:
> 
>   BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
>   Call Trace:
>   ...
>   blkcg_deactivate_policy+0x244/0x4d0
>   ioc_rqos_exit+0x44/0xe0
>   rq_qos_exit+0xba/0x120
>   __del_gendisk+0x50b/0x800
>   del_gendisk+0xff/0x190
>   ...
> 
> [CAUSE]
> process1						process2
> cgroup_rmdir
> ...
>   css_killed_work_fn
>     offline_css
>     ...
>       blkcg_destroy_blkgs
>       ...
>         __blkg_release
> 	  css_put(&blkg->blkcg->css)
>           blkg_free
> 	    INIT_WORK(xxx, blkg_free_workfn)
> 	    schedule_work
>     css_put
>     ...
>       blkcg_css_free
>         kfree(blkcg)--------blkcg has been freed!!!
> ====================================schedule_work
>               blkg_free_workfn
> 							__del_gendisk
> 							  rq_qos_exit
> 							    ioc_rqos_exit
> 							      blkcg_deactivate_policy
> 							        mutex_lock(&q->blkcg_mutex)
> 								spin_lock_irq(&q->queue_lock)
> 							        list_for_each_entry(blkg, xxx)
> 								  blkcg = blkg->blkcg
> 								  spin_lock(&blkcg->lock)-------UAF!!!
> 	        mutex_lock(&q->blkcg_mutex)
> 	        spin_lock_irq(&q->queue_lock)
> 	        /* Only then is the blkg removed from the list */
> 	        list_del_init(&blkg->q_node)
> 
> As a result, a blkg can still be reachable through q->blkg_list while
> its ->blkcg has already been freed.
> 
> [Fix]
> Fix this by deferring the blkcg css_put() until after the blkg has been
> unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
> blkcg outlives every blkg still reachable through q->blkg_list, so any
> iterator holding q->queue_lock is guaranteed to observe a valid
> blkg->blkcg.
> 
> While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
> so that the css reference is owned by the alloc/free pair rather than
> straddling layers:
> blkg_alloc()  <-> blkg_free()
> blkg_create() <-> blkg_destroy()
> 
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Suggested-by: Hou Tao <houtao1@huawei.com>
> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
> Reviewed-by: Yu Kuai <yukuai@fygo.io>
> ---
> v3:
>  - move css_put() after mutex_unlock() in blkg_free_workfn().
> 
> v2:
>  - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
>    css reference follows the blkg's own lifetime, making the put in
>    blkg_free_workfn() symmetric with the get in blkg_alloc().
> 
> v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
>  block/blk-cgroup.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index bc63bd220865..3ac41f766caf 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -136,6 +136,11 @@ static void blkg_free_workfn(struct work_struct *work)
>  	spin_unlock_irq(&q->queue_lock);
>  	mutex_unlock(&q->blkcg_mutex);
>  
> +	/*
> +	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
> +	 * so concurrent iterators won't see a blkg with a freed blkcg.
> +	 */
> +	css_put(&blkg->blkcg->css);
>  	blk_put_queue(q);
>  	free_percpu(blkg->iostat_cpu);
>  	percpu_ref_exit(&blkg->refcnt);
> @@ -179,8 +184,6 @@ static void __blkg_release(struct rcu_head *rcu)
>  	for_each_possible_cpu(cpu)
>  		__blkcg_rstat_flush(blkcg, cpu);
>  
> -	/* release the blkcg and parent blkg refs this blkg has been holding */
> -	css_put(&blkg->blkcg->css);
>  	blkg_free(blkg);
>  }
>  
> @@ -313,6 +316,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>  		goto out_exit_refcnt;
>  	if (!blk_get_queue(disk->queue))
>  		goto out_free_iostat;
> +	/* blkg holds a reference to blkcg */
> +	if (!css_tryget_online(&blkcg->css))
> +		goto out_put_queue;
>  
>  	blkg->q = disk->queue;
>  	INIT_LIST_HEAD(&blkg->q_node);
> @@ -353,6 +359,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>  	while (--i >= 0)
>  		if (blkg->pd[i])
>  			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
> +	css_put(&blkcg->css);
> +out_put_queue:
>  	blk_put_queue(disk->queue);
>  out_free_iostat:
>  	free_percpu(blkg->iostat_cpu);
> @@ -381,18 +389,12 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>  		goto err_free_blkg;
>  	}
>  
> -	/* blkg holds a reference to blkcg */
> -	if (!css_tryget_online(&blkcg->css)) {
> -		ret = -ENODEV;
> -		goto err_free_blkg;
> -	}
> -
>  	/* allocate */
>  	if (!new_blkg) {
>  		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
>  		if (unlikely(!new_blkg)) {
>  			ret = -ENOMEM;
> -			goto err_put_css;
> +			goto err_free_blkg;
>  		}
>  	}
>  	blkg = new_blkg;
> @@ -402,7 +404,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>  		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
>  		if (WARN_ON_ONCE(!blkg->parent)) {
>  			ret = -ENODEV;
> -			goto err_put_css;
> +			goto err_free_blkg;
>  		}
>  		blkg_get(blkg->parent);
>  	}
> @@ -442,8 +444,6 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>  	blkg_put(blkg);
>  	return ERR_PTR(ret);
>  
> -err_put_css:
> -	css_put(&blkcg->css);
>  err_free_blkg:
>  	if (new_blkg)
>  		blkg_free(new_blkg);

LGTM.

Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>

-- 
Best Regards,
Yi


^ permalink raw reply

* Re: [PATCH V2] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Tang Yizhou @ 2026-06-16 16:44 UTC (permalink / raw)
  To: Hou Tao, yukuai, Zizhi Wo, axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1
In-Reply-To: <8bdf88b3-0879-e3ec-a52d-3e7559bfddbb@huaweicloud.com>

On 16/6/26 9:23 am, Hou Tao wrote:
> Hi,
> 
> On 6/16/2026 12:16 AM, Yu Kuai wrote:
>> Hi，
>>
>> 在 2026/6/15 19:55, Zizhi Wo 写道:
>>> From: Zizhi Wo <wozizhi@huawei.com>
>>>
>>> [BUG]
>>> Our fuzz testing triggered a blkcg use-after-free issue:
>>>
>>>    BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
>>>    Call Trace:
>>>    ...
>>>    blkcg_deactivate_policy+0x244/0x4d0
>>>    ioc_rqos_exit+0x44/0xe0
>>>    rq_qos_exit+0xba/0x120
>>>    __del_gendisk+0x50b/0x800
>>>    del_gendisk+0xff/0x190
>>>    ...
>>>
>>> [CAUSE]
>>> process1						process2
>>> cgroup_rmdir
>>> ...
>>>    css_killed_work_fn
>>>      offline_css
>>>      ...
>>>        blkcg_destroy_blkgs
>>>        ...
>>>          __blkg_release
>>> 	  css_put(&blkg->blkcg->css)
>>>            blkg_free
>>> 	    INIT_WORK(xxx, blkg_free_workfn)
>>> 	    schedule_work
>>>      css_put
>>>      ...
>>>        blkcg_css_free
>>>          kfree(blkcg)--------blkcg has been freed!!!
>>> ====================================schedule_work
>>>                blkg_free_workfn
>>> 							__del_gendisk
>>> 							  rq_qos_exit
>>> 							    ioc_rqos_exit
>>> 							      blkcg_deactivate_policy
>>> 							        mutex_lock(&q->blkcg_mutex)
>>> 								spin_lock_irq(&q->queue_lock)
>>> 							        list_for_each_entry(blkg, xxx)
>>> 								  blkcg = blkg->blkcg
>>> 								  spin_lock(&blkcg->lock)-------UAF!!!
>>> 	        mutex_lock(&q->blkcg_mutex)
>>> 	        spin_lock_irq(&q->queue_lock)
>>> 	        /* Only then is the blkg removed from the list */
>>> 	        list_del_init(&blkg->q_node)
>>>
>>> As a result, a blkg can still be reachable through q->blkg_list while
>>> its ->blkcg has already been freed.
>>>
>>> [Fix]
>>> Fix this by deferring the blkcg css_put() until after the blkg has been
>>> unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
>>> blkcg outlives every blkg still reachable through q->blkg_list, so any
>>> iterator holding q->queue_lock is guaranteed to observe a valid
>>> blkg->blkcg.
>>>
>>> While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
>>> so that the css reference is owned by the alloc/free pair rather than
>>> straddling layers:
>>> blkg_alloc()  <-> blkg_free()
>>> blkg_create() <-> blkg_destroy()
>>>
>>> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
>>> Suggested-by: Hou Tao <houtao1@huawei.com>
>>> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
>>> ---
>>> v2:
>>>   - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
>>>     css reference follows the blkg's own lifetime, making the put in
>>>     blkg_free_workfn() symmetric with the get in blkg_alloc().
>>>
>>> v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
>>>
>>>   block/blk-cgroup.c | 24 ++++++++++++------------
>>>   1 file changed, 12 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>>> index bc63bd220865..27414c291e49 100644
>>> --- a/block/blk-cgroup.c
>>> +++ b/block/blk-cgroup.c
>>> @@ -132,10 +132,15 @@ static void blkg_free_workfn(struct work_struct *work)
>>>   	if (blkg->parent)
>>>   		blkg_put(blkg->parent);
>>>   	spin_lock_irq(&q->queue_lock);
>>>   	list_del_init(&blkg->q_node);
>>>   	spin_unlock_irq(&q->queue_lock);
>>> +	/*
>>> +	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
>>> +	 * so concurrent iterators won't see a blkg with a freed blkcg.
>>> +	 */
>>> +	css_put(&blkg->blkcg->css);
>>>   	mutex_unlock(&q->blkcg_mutex);
>> Please move css_put after mutex_unlock, unless there is a strong reason.
> 
> I think blkcg_mutex is used here to serialize the access of blkg->q_node
> and blkg->blkcg. We could move the css_put after the mutex_unlock(),
> however it stills depends on the mutex_lock and mutex_unlock pair on
> blkcg_mutex implicitly. Instead of such implicit dependency, we move the
> css_put inside the lock to make it be explicit.

Hi, I think I understand your point. Keeping css_put() inside blkcg_mutex makes the dependency explicit, since the same mutex serializes both the removal of blkg->q_node and the access to blkg->blkcg.

Placing css_put() after mutex_unlock(&q->blkcg_mutex) is still functionally correct. The blkg has already been removed from q->blkg_list under the mutex, so once we drop the mutex no iterator can reach this blkg anymore.

The benefit of moving it out is a smaller critical section.

-- 
Best Regards,
Yi

>>
>> With above change, feel free to add:
>>
>> Reviewed-by: Yu Kuai <yukuai@fygo.io>
>>
>>>   
>>>   	blk_put_queue(q);
>>>   	free_percpu(blkg->iostat_cpu);
>>>   	percpu_ref_exit(&blkg->refcnt);
>>> @@ -177,12 +182,10 @@ static void __blkg_release(struct rcu_head *rcu)
>>>   	 * blkg_stat_lock is for serializing blkg stat update
>>>   	 */
>>>   	for_each_possible_cpu(cpu)
>>>   		__blkcg_rstat_flush(blkcg, cpu);
>>>   
>>> -	/* release the blkcg and parent blkg refs this blkg has been holding */
>>> -	css_put(&blkg->blkcg->css);
>>>   	blkg_free(blkg);
>>>   }
>>>   
>>>   /*
>>>    * A group is RCU protected, but having an rcu lock does not mean that one
>>> @@ -311,10 +314,13 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>>>   	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
>>>   	if (!blkg->iostat_cpu)
>>>   		goto out_exit_refcnt;
>>>   	if (!blk_get_queue(disk->queue))
>>>   		goto out_free_iostat;
>>> +	/* blkg holds a reference to blkcg */
>>> +	if (!css_tryget_online(&blkcg->css))
>>> +		goto out_put_queue;
>>>   
>>>   	blkg->q = disk->queue;
>>>   	INIT_LIST_HEAD(&blkg->q_node);
>>>   	blkg->blkcg = blkcg;
>>>   	blkg->iostat.blkg = blkg;
>>> @@ -351,10 +357,12 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>>>   
>>>   out_free_pds:
>>>   	while (--i >= 0)
>>>   		if (blkg->pd[i])
>>>   			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
>>> +	css_put(&blkcg->css);
>>> +out_put_queue:
>>>   	blk_put_queue(disk->queue);
>>>   out_free_iostat:
>>>   	free_percpu(blkg->iostat_cpu);
>>>   out_exit_refcnt:
>>>   	percpu_ref_exit(&blkg->refcnt);
>>> @@ -379,32 +387,26 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>>>   	if (blk_queue_dying(disk->queue)) {
>>>   		ret = -ENODEV;
>>>   		goto err_free_blkg;
>>>   	}
>>>   
>>> -	/* blkg holds a reference to blkcg */
>>> -	if (!css_tryget_online(&blkcg->css)) {
>>> -		ret = -ENODEV;
>>> -		goto err_free_blkg;
>>> -	}
>>> -
>>>   	/* allocate */
>>>   	if (!new_blkg) {
>>>   		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
>>>   		if (unlikely(!new_blkg)) {
>>>   			ret = -ENOMEM;
>>> -			goto err_put_css;
>>> +			goto err_free_blkg;
>>>   		}
>>>   	}
>>>   	blkg = new_blkg;
>>>   
>>>   	/* link parent */
>>>   	if (blkcg_parent(blkcg)) {
>>>   		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
>>>   		if (WARN_ON_ONCE(!blkg->parent)) {
>>>   			ret = -ENODEV;
>>> -			goto err_put_css;
>>> +			goto err_free_blkg;
>>>   		}
>>>   		blkg_get(blkg->parent);
>>>   	}
>>>   
>>>   	/* invoke per-policy init */
>>> @@ -440,12 +442,10 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>>>   
>>>   	/* @blkg failed fully initialized, use the usual release path */
>>>   	blkg_put(blkg);
>>>   	return ERR_PTR(ret);
>>>   
>>> -err_put_css:
>>> -	css_put(&blkcg->css);
>>>   err_free_blkg:
>>>   	if (new_blkg)
>>>   		blkg_free(new_blkg);
>>>   	return ERR_PTR(ret);
>>>   }
> 
> 



^ permalink raw reply

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Gao Xiang @ 2026-06-16 16:35 UTC (permalink / raw)
  To: Christoph Hellwig, Christian Brauner
  Cc: Jan Kara, Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs
In-Reply-To: <20260616123443.GA21024@lst.de>

On 2026/6/16 20:34, Christoph Hellwig wrote:

> IMHO sharing devices between superblocks is a bad idea, but that ship
> has sailed, but please keep it contained inside of erofs.

I'm not sure why it's a bad idea, for example,
the immutable layer model is already applied to layered virtual
block formats (such as qcow2) and layered fs like overlayfs.

and I think device mappers may have some similar immutable
approaches as shared layers but works in a slight different
way.

The principle is that each instance uses shared blobs in a
read-only way, and that is almost a simple and safest way
to share data among filesystem instances.

Yet I don't want to argue with that since it's pretty common
for years and I've seen no practical risk using this model.

Thanks,
Gao Xiang

^ permalink raw reply

* Re: [PATCH] block: fix IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmd
From: Caleb Sander Mateos @ 2026-06-16 16:13 UTC (permalink / raw)
  To: Yitang Yang; +Cc: Jens Axboe, linux-block
In-Reply-To: <20260616155129.406057-1-yi1tang.yang@gmail.com>

On Tue, Jun 16, 2026 at 9:11 AM Yitang Yang <yi1tang.yang@gmail.com> wrote:
>
> blkdev_uring_cmd() checks IORING_URING_CMD_REISSUE to determine whether
> this is the first issue. However, this flag lives in cmd->flags instead
> of issue_flags.
>
> Coincidentally, IO_URING_F_NONBLOCK shares bit 31 with
> IORING_URING_CMD_REISSUE. As a result, the SQE read was never performed,
> bic->len remained zero, and every BLOCK_URING_CMD_DISCARD failed with
> -EINVAL.
>
> Fix it by checking cmd->flags as intended.
>
> Fixes: 212ec34e4e72 ("block: only read from sqe on initial invocation of blkdev_uring_cmd")
> Signed-off-by: Yitang Yang <yi1tang.yang@gmail.com>

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

> ---
>  block/ioctl.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/ioctl.c b/block/ioctl.c
> index ab2c9ed79946..3d4ea1537457 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -951,7 +951,7 @@ int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>         u32 cmd_op = cmd->cmd_op;
>
>         /* Read what we need from the SQE on the first issue */
> -       if (!(issue_flags & IORING_URING_CMD_REISSUE)) {
> +       if (!(cmd->flags & IORING_URING_CMD_REISSUE)) {
>                 const struct io_uring_sqe *sqe = cmd->sqe;
>
>                 if (unlikely(sqe->ioprio || sqe->__pad1 || sqe->len ||
> --
> 2.43.0
>
>

^ permalink raw reply

* Re: [PATCH 0/2] block: invalidate cached plug timestamp on context switch
From: Jens Axboe @ 2026-06-16 16:10 UTC (permalink / raw)
  To: linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, rostedt,
	vincent.guittot, vschneid, Usama Arif
  Cc: shakeel.butt, hannes, riel, kernel-team
In-Reply-To: <178162611741.2191657.12211870708971600814.b4-ty@b4>

On 6/16/26 10:08 AM, Jens Axboe wrote:
> 
> On Tue, 16 Jun 2026 07:15:16 -0700, Usama Arif wrote:
>> The details for this are in patch 2. The main reason for this series
>> is to invalidate the cached timestamp on context switch. This was
>> done in sched_update_worker() only before which was resulting in
>> blk-iocost reading stale timestamps and throttling based on wrong
>> information.
>>
>> Patch 1 is a prerequisite to create the invariant that
>> PF_BLOCK_TS set implies current->plug != NULL.
>>
>> [...]
> 
> Applied, thanks!
> 
> [1/2] kernel/fork: clear PF_BLOCK_TS in copy_process()
>       commit: fd38b75c4b43295b10d69772a46d1c74dbd6fc81
> [2/2] block: invalidate cached plug timestamp after task switch
>       commit: fad156c2af227f42ca796cbb20ddc354a6dd9932

Note: I tentatively queued this on up as a) it looks good to me (and
thanks Usama for fixing this!), and b) about to head OOO for a week
or so. If Peter or any of the sched people disagree, let me know and
we can deal with it. If not, then I plan on sending this in with the
usual follow-up merge window fixes next week.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH 0/2] block: invalidate cached plug timestamp on context switch
From: Jens Axboe @ 2026-06-16 16:08 UTC (permalink / raw)
  To: linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, rostedt,
	vincent.guittot, vschneid, Usama Arif
  Cc: shakeel.butt, hannes, riel, kernel-team
In-Reply-To: <20260616141604.328820-1-usama.arif@linux.dev>


On Tue, 16 Jun 2026 07:15:16 -0700, Usama Arif wrote:
> The details for this are in patch 2. The main reason for this series
> is to invalidate the cached timestamp on context switch. This was
> done in sched_update_worker() only before which was resulting in
> blk-iocost reading stale timestamps and throttling based on wrong
> information.
> 
> Patch 1 is a prerequisite to create the invariant that
> PF_BLOCK_TS set implies current->plug != NULL.
> 
> [...]

Applied, thanks!

[1/2] kernel/fork: clear PF_BLOCK_TS in copy_process()
      commit: fd38b75c4b43295b10d69772a46d1c74dbd6fc81
[2/2] block: invalidate cached plug timestamp after task switch
      commit: fad156c2af227f42ca796cbb20ddc354a6dd9932

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH V2]block: Remove redundant plug in __submit_bio()
From: Jens Axboe @ 2026-06-16 16:08 UTC (permalink / raw)
  To: linux-block, wenxiong; +Cc: tom.leiming, yukuai, stable, wenxiong
In-Reply-To: <20260616143121.878021-1-wenxiong@linux.ibm.com>


On Tue, 16 Jun 2026 10:31:21 -0400, wenxiong@linux.ibm.com wrote:
> The patch removes the automatic plug/unplug operations from __submit_bio()
> that were added to cache nsecs time when no explicit plug is used.
> 
> The plug mechanism is most effective when batching multiple I/O
> operations together. Creating a plug for every bio submission
> provides minimal benefit while adding function call overhead and
> stack usage for every I/O operation.
> 
> [...]

Applied, thanks!

[1/1] block: Remove redundant plug in __submit_bio()
      commit: 9cbbac29d752fb5d95e375fa3685a359b89caa0a

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] block: fix IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmd
From: Jens Axboe @ 2026-06-16 16:08 UTC (permalink / raw)
  To: Yitang Yang; +Cc: linux-block
In-Reply-To: <20260616155129.406057-1-yi1tang.yang@gmail.com>


On Tue, 16 Jun 2026 23:51:29 +0800, Yitang Yang wrote:
> blkdev_uring_cmd() checks IORING_URING_CMD_REISSUE to determine whether
> this is the first issue. However, this flag lives in cmd->flags instead
> of issue_flags.
> 
> Coincidentally, IO_URING_F_NONBLOCK shares bit 31 with
> IORING_URING_CMD_REISSUE. As a result, the SQE read was never performed,
> bic->len remained zero, and every BLOCK_URING_CMD_DISCARD failed with
> -EINVAL.
> 
> [...]

Applied, thanks!

[1/1] block: fix IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmd
      commit: 4f919141be38ea2b1314e3a531b7b998eb64e8bc

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Keith Busch @ 2026-06-16 16:05 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Vjaceslavs Klimovs, Dr. David Alan Gilbert, Thorsten Leemhuis,
	trnka, Zdenek Kabelac, linux-block, dm-devel,
	Linux kernel regressions list
In-Reply-To: <27311df3-2c46-08be-825a-157ea906bdb2@redhat.com>

On Tue, Jun 16, 2026 at 05:55:13PM +0200, Mikulas Patocka wrote:
> I thought that reverting 5ff3f74e145a and re-introducing the alignment 
> check in block/fops.c:blkdev_dio_invalid would fix it - but it wouldn't.
> 
> The same problem existed even before 5ff3f74e145a, with the pvmove 
> command.

Also before 5ff3f74e145a, you could still have devices that are
perfectly fine with dword aligned dma, so sub-sector vectors  would have
passed the checks and gone through to dm-raid, which would have
miscounted the remaining.

> So, I think that the proper way to fix this is to teach dm-mirror/dm-io to 
> deal with unaligned bio vectors and handle them properly.

The block layer already handles it, so I think just dispatch it and
check the bi_status is all the stacking drivers need to do.

^ permalink raw reply

* [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Keith Busch @ 2026-06-16 15:58 UTC (permalink / raw)
  To: Keith Busch
  Cc: dm-devel, linux-block, mpatocka, Dr. David Alan Gilbert,
	Vjaceslavs Klimovs
In-Reply-To: <20260616150554.1686662-1-kbusch@meta.com>

BLK_STS_INVAL indicates the I/O request itself was invalid (for example a
misaligned direct I/O), not that the device has failed. dm-raid1 treated
any read or write completion error as a device failure: it failed the
mirror leg, retried on the alternatives - which fail identically - and
eventually returned EIO while spuriously degrading the array.

Since commit 5ff3f74e145a ("block: simplify direct io validity check") the
direct I/O path no longer rejects misaligned buffers up front, so an
invalid bio now reaches the lower block layers, which fail it with
BLK_STS_INVAL. dm-io collapses the block status into a per-region error
bit before invoking the completion callback, so record BLK_STS_INVAL on
the originating bio and have the dm-raid1 read, write and end_io paths
propagate it instead of failing the device.

This mirrors the raid1/raid10 fix in commit f7b24c7b41f23
("md/raid1,raid10: don't fail devices for invalid IO errors") for the
device-mapper mirror target.

Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Reported-by: Vjaceslavs Klimovs <vklimovs@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
Resending patch 2/2 from a different machine. For some reason, only 1/2
is getting through with git-send-email, so manually replying to the
thread with the missing second patch.

 drivers/md/dm-io.c    | 14 +++++++++++++-
 drivers/md/dm-raid1.c | 28 +++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 28adfeb58f240..f382e9f9be059 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -37,6 +37,7 @@ struct io {
 	struct dm_io_client *client;
 	io_notify_fn callback;
 	void *context;
+	struct bio *orig_bio;
 	void *vma_invalidate_address;
 	unsigned long vma_invalidate_size;
 } __aligned(DM_IO_MAX_REGIONS);
@@ -132,8 +133,18 @@ static void complete_io(struct io *io)
 
 static void dec_count(struct io *io, unsigned int region, blk_status_t error)
 {
-	if (error)
+	if (error) {
 		set_bit(region, &io->error_bits);
+		/*
+		 * BLK_STS_INVAL means the bio was not valid for the underlying
+		 * device (e.g. a misaligned direct I/O), which is a caller error
+		 * rather than a device failure. Record it on the original bio so
+		 * bio-based targets can propagate it instead of treating it as a
+		 * media error and failing the device.
+		 */
+		if (error == BLK_STS_INVAL && io->orig_bio)
+			io->orig_bio->bi_status = error;
+	}
 
 	if (atomic_dec_and_test(&io->count))
 		complete_io(io);
@@ -398,6 +409,7 @@ static void async_io(struct dm_io_client *client, unsigned int num_regions,
 	io->client = client;
 	io->callback = fn;
 	io->context = context;
+	io->orig_bio = dp->orig_bio;
 
 	io->vma_invalidate_address = dp->vma_invalidate_address;
 	io->vma_invalidate_size = dp->vma_invalidate_size;
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index de5c00704e69c..022ad791c2957 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -524,6 +524,17 @@ static void read_callback(unsigned long error, void *context)
 		return;
 	}
 
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. That is a caller error, not a device
+	 * failure, so propagate it rather than failing the mirror and retrying
+	 * on the other legs, which would fail the same way.
+	 */
+	if (bio->bi_status == BLK_STS_INVAL) {
+		bio_endio(bio);
+		return;
+	}
+
 	fail_mirror(m, DM_RAID1_READ_ERROR);
 
 	if (likely(default_ok(m)) || mirror_available(m->ms, bio)) {
@@ -622,6 +633,16 @@ static void write_callback(unsigned long error, void *context)
 		return;
 	}
 
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. Propagate the error without degrading
+	 * the array.
+	 */
+	if (bio->bi_status == BLK_STS_INVAL) {
+		bio_endio(bio);
+		return;
+	}
+
 	/*
 	 * If the bio is discard, return an error, but do not
 	 * degrade the array.
@@ -1262,7 +1283,12 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
 		return DM_ENDIO_DONE;
 	}
 
-	if (*error == BLK_STS_NOTSUPP)
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. Propagate it rather than failing the
+	 * mirror and retrying, which would fail the same way on every leg.
+	 */
+	if (*error == BLK_STS_NOTSUPP || *error == BLK_STS_INVAL)
 		goto out;
 
 	if (bio->bi_opf & REQ_RAHEAD)
-- 
2.52.0



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox