Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH v3 2/4] blk-cgroup: fix race between policy activation and blkg destruction
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel
In-Reply-To: <20260625025739.2459651-1-yukuai@kernel.org>

From: Zheng Qixing <zhengqixing@huawei.com>

When switching an IO scheduler on a block device, blkcg_activate_policy()
allocates blkg_policy_data (pd) for all blkgs attached to the queue.
However, blkcg_activate_policy() may race with concurrent blkcg deletion,
leading to use-after-free and memory leak issues.

The use-after-free occurs in the following race:

T1 (blkcg_activate_policy):
  - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
  - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
  - Enters the enomem rollback path to release blkg1 resources

T2 (blkcg deletion):
  - blkcgA is deleted concurrently
  - blkg1 is freed via blkg_free_workfn()
  - blkg1->pd is freed

T1 (continued):
  - Rollback path accesses blkg1->pd->online after pd is freed
  - Triggers use-after-free

In addition, blkg_free_workfn() frees pd before removing the blkg from
q->blkg_list. This allows blkcg_activate_policy() to allocate a new pd
for a blkg that is being destroyed, leaving the newly allocated pd
unreachable when the blkg is finally freed.

Fix these races by extending blkcg_mutex coverage to serialize
blkcg_activate_policy() rollback and blkg destruction, ensuring pd
lifecycle is synchronized with blkg list visibility.

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d22a43c545b6..fd1eed67924b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1564,10 +1564,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	if (WARN_ON_ONCE(!pol->pd_alloc_fn || !pol->pd_free_fn))
 		return -EINVAL;
 
 	if (queue_is_mq(q))
 		memflags = blk_mq_freeze_queue(q);
+
+	mutex_lock(&q->blkcg_mutex);
 retry:
 	spin_lock_irq(&q->queue_lock);
 
 	/* blkg_list is pushed at the head, reverse walk to initialize parents first */
 	list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
@@ -1626,10 +1628,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	__set_bit(pol->plid, q->blkcg_pols);
 	ret = 0;
 
 	spin_unlock_irq(&q->queue_lock);
 out:
+	mutex_unlock(&q->blkcg_mutex);
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
 	if (pinned_blkg)
 		blkg_put(pinned_blkg);
 	if (pd_prealloc)
-- 
2.51.0


^ permalink raw reply related

* [PATCH v3 1/4] blk-cgroup: protect q->blkg_list iteration in blkg_destroy_all() with blkcg_mutex
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel
In-Reply-To: <20260625025739.2459651-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

blkg_destroy_all() iterates q->blkg_list without holding blkcg_mutex,
which can race with blkg_free_workfn() that removes blkgs from the list
while holding blkcg_mutex.

Add blkcg_mutex protection around the q->blkg_list iteration to prevent
potential list corruption or use-after-free issues.

Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d2a1f5903f24..d22a43c545b6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -567,10 +567,11 @@ static void blkg_destroy_all(struct gendisk *disk)
 	struct blkcg_gq *blkg;
 	int count = BLKG_DESTROY_BATCH_SIZE;
 	int i;
 
 restart:
+	mutex_lock(&q->blkcg_mutex);
 	spin_lock_irq(&q->queue_lock);
 	list_for_each_entry(blkg, &q->blkg_list, q_node) {
 		struct blkcg *blkcg = blkg->blkcg;
 
 		if (hlist_unhashed(&blkg->blkcg_node))
@@ -585,10 +586,11 @@ static void blkg_destroy_all(struct gendisk *disk)
 		 * it when a batch of blkgs are destroyed.
 		 */
 		if (!(--count)) {
 			count = BLKG_DESTROY_BATCH_SIZE;
 			spin_unlock_irq(&q->queue_lock);
+			mutex_unlock(&q->blkcg_mutex);
 			cond_resched();
 			goto restart;
 		}
 	}
 
@@ -604,10 +606,11 @@ static void blkg_destroy_all(struct gendisk *disk)
 			__clear_bit(pol->plid, q->blkcg_pols);
 	}
 
 	q->root_blkg = NULL;
 	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 
 	wake_up_var(&q->root_blkg);
 }
 
 static void blkg_iostat_set(struct blkg_iostat *dst, struct blkg_iostat *src)
-- 
2.51.0


^ permalink raw reply related

* [PATCH v3 0/4] blk-cgroup: fix blkg list and policy data races
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel

From: Yu Kuai <yukuai@fygo.io>

Hi,

This series fixes races around q->blkg_list and blkg policy data
lifetime.

Patch 1 protects blkg_destroy_all()'s q->blkg_list walk with
blkcg_mutex.

Patches 2-3 fix races between blkcg_activate_policy() and concurrent
blkg destruction.

Patch 4 factors the policy data teardown loop into a helper after the
race fixes.

Changes since v2:
- Rebase on the latest block-7.2 branch.

Changes since v1:
- Drop the BFQ q->blkg_list patch because the current block tree already
  has a stronger fix in commit 17b2d950a3c0 ("block, bfq: protect async
  queue reset with blkcg locks").
- Add Reviewed-by tags from Tang Yizhou.

Yu Kuai (1):
  blk-cgroup: protect q->blkg_list iteration in blkg_destroy_all() with
    blkcg_mutex

Zheng Qixing (3):
  blk-cgroup: fix race between policy activation and blkg destruction
  blk-cgroup: skip dying blkg in blkcg_activate_policy()
  blk-cgroup: factor policy pd teardown loop into helper

 block/blk-cgroup.c | 65 +++++++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 30 deletions(-)

-- 
2.51.0

^ permalink raw reply

* Re: [PATCH 2/2] block: handle REQ_OP_ZONE_APPEND in __bio_integrity_action
From: Martin K. Petersen @ 2026-06-25  2:29 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, linux-block
In-Reply-To: <CADUfDZrDNT5rVXE_zSn9-MT1YwZvN2ynCSy_zY4xt-jx_SyuTw@mail.gmail.com>

Caleb,

> Right, I don't mean partitions of zoned devices, but block devices in
> general. I was just trying to understand why the remapping
> infrastructure exists in the first place. Seems like we can't remove
> it entirely, but we can at least ensure the ref tag seeds are correct
> if it's skipped for a non-partitioned device.

The entity attaching the PI to the I/O decides what the seed value
should be.

If you are an application preparing PI, you don't know which LBA a write
is eventually going to end up at. You just know you are writing block 10
inside your file. So you generate PI starting with a reference tag value
of 10 because that is what makes sense to you. And thus you set the seed
to 10 to tell the remapping code which initial reference tag value to
expect in the prepared PI buffer.

Once the write hits the bottom of the stack, we know which initial
reference tag the hardware expects. So we remap the reference tags in
the PI buffer from whatever made sense to the application to whatever
the hardware requires. I.e. an initial value of the lower 32 bits of the
LBA for T10 PI Type 1, incremented by 1 for each subsequent protection
interval.

For reads, it's the same thing. The application wants to read starting
at block offset 10 inside the file so it sets the seed value to 10. At
the bottom of the stack we know how to interpret the PI returned by the
hardware. So we validate that the reference tags received from the
controller match the lower 32 bits of the LBA or whatever the correct PI
format is. And then we remap the reference tags in the received PI
buffer, setting the reference tag to the requested value of 10 for the
first block, and then incrementing by 1 for each subsequent protection
interval.

This allows the application to validate the received reference tags
without ever knowing anything about which start LBA the I/O happened to
come from.

Both DIX and NVMe also allow the hardware to perform the remapping so
the software remapping step can be skipped altogether. Christoph and I
briefly talked about that last week. We currently don't take advantage
of that capability in the NVMe driver.

-- 
Martin K. Petersen

^ permalink raw reply

* Re: [PATCH 0/8] blk-cgroup: remove queue_lock nesting from blkcg paths
From: yu kuai @ 2026-06-25  1:42 UTC (permalink / raw)
  To: Jens Axboe, yukuai, nilay, tom.leiming, bvanassche, tj, josef
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <34d48fb5-4952-4a48-b92a-f189bc3edd0b@kernel.dk>

Hi,

在 2026/6/24 20:43, Jens Axboe 写道:
> On 6/24/26 12:57 AM, yu kuai wrote:
>> Friendly ping ...
>>
>> This set can still be applied cleanly for block-7.2 branch.
> Not sure how you checked that, because patch 3 very much needs some
> manual attention to get applied. I have applied it now.

Thanks!

This was build on the top of my other set:
blk-cgroup: fix blkg list and policy data races

I'll rebase and resend this set :)

>
-- 
Thanks,
Kuai

^ permalink raw reply

* Re: [PATCH RFC v2 17/18] fs: look up the superblock via the device table in user_get_super()
From: Gao Xiang @ 2026-06-24 22:48 UTC (permalink / raw)
  To: Darrick J. Wong, Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260624175417.GU6078@frogsfrogsfrogs>



On 2026/6/25 01:54, Darrick J. Wong wrote:
> On Tue, Jun 16, 2026 at 04:08:33PM +0200, Christian Brauner wrote:
>> user_get_super() still finds the superblock for a device number by
>> walking the global super_blocks list under sb_lock. Every superblock is
>> registered in the device table under its s_dev since sget_fc() inserts
>> it there, including superblocks on anonymous devices, so use the table
>> instead.
>>
>> The refcount-pinning cursor helpers super_dev_{get,first,next}() only
>> touch table state and do not depend on CONFIG_BLOCK, so drop the
>> CONFIG_BLOCK guard around them: their new caller serves anonymous
>> devices as well (ustat() on e.g. tmpfs) and is built without
>> CONFIG_BLOCK. The guard falls in this patch rather than separately
>> since without this caller the helpers would be unused without
>> CONFIG_BLOCK.
>>
>> The pinned entry holds a passive reference on the superblock so
>> super_lock() can be called directly; once the superblock is locked grab
>> a passive reference for the caller before dropping the pin.
>>
>> The device table contains more than the old walk could find: a
>> superblock is also registered for every additional device it claims
>> (the xfs log and realtime devices, btrfs member devices, the ext4
>> external journal, erofs blob devices). Don't filter those out:
>> specifying any device a filesystem uses now resolves to that
>> filesystem, so ustat() and quotactl() work on e.g. the xfs log device
>> or a btrfs member device (the latter used to fail outright as btrfs
>> superblocks carry an anonymous s_dev that never matches a member
>> device). When several superblocks share a device (erofs blob devices)
>> the first live superblock wins.
> 
> Does erofs have a means to find the other superblocks that share a
> device given a notification coming in on one of them?  
Nope, erofs currently doesn't have a way to find the other
superblocks (it  doesn't maintain the relationship). My previous
thought is that because it's a read-only filesystem, IMHO, there
is not a must to implement shutdown or notification mechanism in
erofs itself, just because it's strictly immutable (no local
write or dirty journals), and block layer can return io error
on dead bdevs directly even it's a shared block dev.  But I may
be wrong if there are reason that we should maintain the
relationship.

Currently it only uses sb->s_type as the holder for bdev sharing,
I think Christian meant that.

Thanks,
Gao Xiang

^ permalink raw reply

* [PATCH] block: bio: check offset/length sanity in {__,}bio_add_page()
From: Sergey Shtylyov @ 2026-06-24 20:33 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Sergey Shtylyov, linux-kernel, Karina Yankevich

Sum of the *struct* bio_vec's fields bv_offset and bv_len is calculated in
some functions in block/{blk-merge.c,blk.h> (and that sum is often compared
to PAGE_SIZE) -- that sum may overflow (and so the comparison yield a wrong
result) if some bad arguments were previusly passed to {__,}bio_add_page().
Add a check that the sum of the offset and length parameters won't overflow
to {__,}bio_add_page()...

Found by Linux Verification Center (linuxtesting.org) with the Svace static
analysis tool.

Signed-off-by: Sergey Shtylyov <s.shtylyov@auroraos.dev>

---
The patch is against the for-next branch of Jens Axboeu's linux.git repo...

 block/bio.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index f2a5f4d0a967..daca63b94fae 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1000,6 +1000,7 @@ void __bio_add_page(struct bio *bio, struct page *page,
 {
 	WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
 	WARN_ON_ONCE(bio_full(bio, len));
+	WARN_ON_ONCE(off + len < off);	/* does the sum overflow? */
 
 	if (is_pci_p2pdma_page(page))
 		bio->bi_opf |= REQ_NOMERGE;
@@ -1045,6 +1046,9 @@ int bio_add_page(struct bio *bio, struct page *page,
 		return 0;
 	if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
 		return 0;
+	/* Are offset and len sane, i.e. their sum doesn't overflow? */
+	if (offset + len < offset)
+		return 0;
 
 	if (bio->bi_vcnt > 0) {
 		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
From: Leonid Ravich @ 2026-06-24 19:52 UTC (permalink / raw)
  To: Eric Biggers, Herbert Xu
  Cc: Alasdair Kergon, Ard Biesheuvel, Jens Axboe, dm-devel,
	linux-block, linux-crypto
In-Reply-To: <20260622182328.GB1250822@google.com>

On Mon, Jun 22, 2026 at 06:23:28PM +0000, Eric Biggers wrote:
> I don't think there's a path forward without an in-tree user that's
> shown to be worthwhile over just using the acceleration built directly
> into the CPU.  As well as confirmation of no regression to existing
> users, including in cases where the inline sg list can't be used.

Agreed. Proposing a smaller v5 that meets the no-regression bar now and
leaves "beats the CPU" to a follow-up with a real in-tree user.

dm-crypt submits one request per contiguous bio segment (a single
bio_vec) with data_unit_size = sector_size, instead of one per sector.
E.g. default sector_size 512 with a 4 KiB bio_vec: one request of 8
data units, which the fallback splitter walks as 8 per-sector calls --
dm-crypt no longer open-codes the per-data-unit loop itself.

  - Uses only the existing inline sg_in[0]/sg_out[0] entry. No per-bio
    scatterlist, no kmalloc -- the "inline sg list can't be used" case
    doesn't exist here, so there's nothing to regress.
  - For a non-native algorithm the core auto-splits into the same
    per-sector calls dm-crypt makes today: identical output and cost.
    This is what Herbert predicted -- the per-unit indirect call just
    moves from the caller into the API; the fallback is no slower.

So it stands on no-regression alone, with no software throughput claim.
What it adds is the interface a native one-pass driver needs. I'd land
that now and bring a native offload user + numbers as the follow-up,
rather than block the interface on the driver.

Acceptable? If so I'll respin v5 as the minimal version.

Thanks,
Leonid

^ permalink raw reply

* Re: [PATCH v17 06/10] rust: rename `AlwaysRefCounted` to `RefCounted`.
From: Andreas Hindborg @ 2026-06-24 19:17 UTC (permalink / raw)
  To: Onur Özkan
  Cc: Miguel Ojeda, Gary Guo, Björn Roy Baron, Benno Lossin,
	Alice Ryhl, Trevor Gross, Danilo Krummrich, Greg Kroah-Hartman,
	Dave Ertman, Ira Weiny, Leon Romanovsky, Paul Moore, Serge Hallyn,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Alexander Viro,
	Christian Brauner, Jan Kara, Daniel Almeida, Viresh Kumar,
	Nishanth Menon, Stephen Boyd, Bjorn Helgaas,
	Krzysztof Wilczyński, Boqun Feng, Uladzislau Rezki,
	Lorenzo Stoakes, Vlastimil Babka, Liam R. Howlett, Igor Korotin,
	Pavel Tikhomirov, linux-kernel, rust-for-linux, linux-block,
	linux-security-module, dri-devel, linux-fsdevel, linux-mm,
	linux-pm, linux-pci, driver-core, Oliver Mangold, Viresh Kumar
In-Reply-To: <20260623175814.87191-1-work@onurozkan.dev>

Onur Özkan <work@onurozkan.dev> writes:

> On Thu, 04 Jun 2026 22:11:18 +0200
> Andreas Hindborg <a.hindborg@kernel.org> wrote:
>
>> From: Oliver Mangold <oliver.mangold@pm.me>
>> 
>> There are types where it may both be reference counted in some cases and
>> owned in others. In such cases, obtaining `ARef<T>` from `&T` would be
>> unsound as it allows creation of `ARef<T>` copy from `&Owned<T>`.
>> 
>> Therefore, we split `AlwaysRefCounted` into `RefCounted` (which `ARef<T>`
>> would require) and a marker trait to indicate that the type is always
>> reference counted (and not `Ownable`) so the `&T` -> `ARef<T>` conversion
>> is possible.
>> 
>> - Rename `AlwaysRefCounted` to `RefCounted`.
>> - Add a new unsafe trait `AlwaysRefCounted`.
>> - Implement the new trait `AlwaysRefCounted` for the newly renamed
>>   `RefCounted` implementations. This leaves functionality of existing
>>   implementers of `AlwaysRefCounted` intact.
>> 
>> Suggested-by: Alice Ryhl <aliceryhl@google.com>
>> Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
>> Signed-off-by: Oliver Mangold <oliver.mangold@pm.me>
>> [ Andreas: Updated commit message and rebase on rust-6.20-7.0 ]
>> Acked-by: Igor Korotin <igor.korotin.linux@gmail.com>
>> Acked-by: Danilo Krummrich <dakr@kernel.org>
>> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
>> Reviewed-by: Gary Guo <gary@garyguo.net>
>> Co-developed-by: Andreas Hindborg <a.hindborg@kernel.org>
>> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
>> ---
>>  rust/kernel/auxiliary.rs        |  7 +++++-
>>  rust/kernel/block/mq/request.rs | 15 ++++++++-----
>>  rust/kernel/cred.rs             | 13 +++++++++--
>>  rust/kernel/device.rs           | 12 ++++++++--
>>  rust/kernel/device/property.rs  | 11 +++++++--
>>  rust/kernel/drm/device.rs       |  9 ++++++--
>>  rust/kernel/drm/gem/mod.rs      | 16 ++++++++++----
>>  rust/kernel/fs/file.rs          | 16 ++++++++++----
>>  rust/kernel/i2c.rs              | 13 ++++++++---
>>  rust/kernel/mm.rs               | 15 +++++++++----
>>  rust/kernel/mm/mmput_async.rs   |  9 ++++++--
>>  rust/kernel/opp.rs              | 10 ++++++---
>>  rust/kernel/owned.rs            |  2 +-
>>  rust/kernel/pci.rs              | 10 ++++++++-
>>  rust/kernel/pid_namespace.rs    | 12 ++++++++--
>>  rust/kernel/platform.rs         |  7 +++++-
>>  rust/kernel/sync/aref.rs        | 49 ++++++++++++++++++++++++++---------------
>>  rust/kernel/task.rs             | 13 +++++++++--
>>  rust/kernel/types.rs            |  3 ++-
>>  rust/kernel/usb.rs              | 17 +++++++++++---
>>  20 files changed, 195 insertions(+), 64 deletions(-)
>> 
>> diff --git a/rust/kernel/auxiliary.rs b/rust/kernel/auxiliary.rs
>> index 93c0db1f6655..49f07740f657 100644
>> --- a/rust/kernel/auxiliary.rs
>> +++ b/rust/kernel/auxiliary.rs
>> @@ -19,6 +19,7 @@
>>          to_result, //
>>      },
>>      prelude::*,
>> +    sync::aref::{AlwaysRefCounted, RefCounted},
>
> This patch has multiple horizontal use statements around.

Thanks, I'll take another pass to fix that.


Best regards,
Andreas Hindborg




^ permalink raw reply

* Re: [PATCH v17 08/10] rust: aref: update formatting of use statements
From: Andreas Hindborg @ 2026-06-24 19:16 UTC (permalink / raw)
  To: Onur Özkan
  Cc: Miguel Ojeda, Gary Guo, Björn Roy Baron, Benno Lossin,
	Alice Ryhl, Trevor Gross, Danilo Krummrich, Greg Kroah-Hartman,
	Dave Ertman, Ira Weiny, Leon Romanovsky, Paul Moore, Serge Hallyn,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Alexander Viro,
	Christian Brauner, Jan Kara, Daniel Almeida, Viresh Kumar,
	Nishanth Menon, Stephen Boyd, Bjorn Helgaas,
	Krzysztof Wilczyński, Boqun Feng, Uladzislau Rezki,
	Lorenzo Stoakes, Vlastimil Babka, Liam R. Howlett, Igor Korotin,
	Pavel Tikhomirov, linux-kernel, rust-for-linux, linux-block,
	linux-security-module, dri-devel, linux-fsdevel, linux-mm,
	linux-pm, linux-pci, driver-core
In-Reply-To: <20260623175531.85421-1-work@onurozkan.dev>

Onur Özkan <work@onurozkan.dev> writes:

> On Thu, 04 Jun 2026 22:11:20 +0200
> Andreas Hindborg <a.hindborg@kernel.org> wrote:
>
>> Update formatting if use statements in preparation for next commit.
>
> I guess you meant "formatting use statements"? Also, why not doing this in
> the next commit directly?

Because it is an unrelated change.


Best regards,
Andreas Hindborg




^ permalink raw reply

* Re: [PATCH] tools/cgroup: iocost_monitor: parse help before importing drgn
From: Tejun Heo @ 2026-06-24 18:59 UTC (permalink / raw)
  To: Yousef Alhouseen; +Cc: josef, axboe, cgroups, linux-block, linux-kernel
In-Reply-To: <20260624123652.8108-1-alhouseenyousef@gmail.com>

On Wed, Jun 24, 2026 at 02:36:52PM +0200, Yousef Alhouseen wrote:
> iocost_monitor.py imports drgn before argparse can handle "-h" or report
> argument errors. That makes basic usage help fail on systems where drgn is
> not installed.
> 
> Parse arguments before importing drgn so the help and argument-error paths
> work without the runtime debugging dependency. Normal execution still
> imports drgn before reading kernel state.
> 
> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>

Applied to cgroup/for-7.3.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH v3 3/6] xfs: implement write-stream management support
From: Darrick J. Wong @ 2026-06-24 18:11 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: brauner, hch, dgc, jack, cem, axboe, kbusch, ritesh.list,
	linux-xfs, linux-fsdevel, linux-block, gost.dev
In-Reply-To: <20260616180555.33338-4-joshi.k@samsung.com>

On Tue, Jun 16, 2026 at 11:35:52PM +0530, Kanchan Joshi wrote:
> Implement support for FS_IOC_WRITE_STREAM ioctl.
> 
> For FS_WRITE_STREAM_OP_GET_MAX, available write streams are reported
> based on the capability of the underlying block device.
> For FS_WRITE_STREAM_OP_{SET/GET}, add a new i_write_stream field in xfs
> inode. This value is propagated to the iomap during block mapping.
> 
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> ---
>  fs/xfs/xfs_icache.c |  1 +
>  fs/xfs/xfs_inode.c  | 46 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_inode.h  |  6 ++++++
>  fs/xfs/xfs_ioctl.c  | 38 +++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_iomap.c  |  1 +
>  5 files changed, 92 insertions(+)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 2040a9292ee6..d5f880f5b810 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -130,6 +130,7 @@ xfs_inode_alloc(
>  	spin_lock_init(&ip->i_ioend_lock);
>  	ip->i_next_unlinked = NULLAGINO;
>  	ip->i_prev_unlinked = 0;
> +	ip->i_write_stream = 0;
>  
>  	return ip;
>  }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index beaa26ec62da..2e7c61d71b48 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -47,6 +47,52 @@
>  
>  struct kmem_cache *xfs_inode_cache;
>  
> +int
> +xfs_inode_max_write_streams(
> +	struct xfs_inode	*ip)
> +{
> +	struct block_device	*bdev;
> +
> +	bdev = xfs_inode_buftarg(ip)->bt_bdev;
> +	if (!bdev)
> +		return 0;
> +
> +	return bdev_max_write_streams(bdev);
> +}
> +
> +uint16_t
> +xfs_inode_get_write_stream(
> +	struct xfs_inode	*ip)
> +{
> +	uint16_t	stream_id;
> +
> +	xfs_ilock(ip, XFS_ILOCK_SHARED);
> +	stream_id = ip->i_write_stream;
> +	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +
> +	return stream_id;
> +}
> +
> +int
> +xfs_inode_set_write_stream(
> +	struct xfs_inode	*ip,
> +	uint16_t		stream_id)
> +{
> +	int ret = 0;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +
> +	if (stream_id > xfs_inode_max_write_streams(ip)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +	ip->i_write_stream =  stream_id;
> +
> +out_unlock:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return ret;
> +}
> +
>  /*
>   * These two are wrapper routines around the xfs_ilock() routine used to
>   * centralize some grungy code.  They are used in places that wish to lock the
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index bd6d33557194..768c4195306c 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -38,6 +38,9 @@ typedef struct xfs_inode {
>  	struct xfs_ifork	i_df;		/* data fork */
>  	struct xfs_ifork	i_af;		/* attribute fork */
>  
> +	/* Write stream information */
> +	uint16_t		i_write_stream;
> +
>  	/* Transaction and locking information. */
>  	struct xfs_inode_log_item *i_itemp;	/* logging information */
>  	struct rw_semaphore	i_lock;		/* inode lock */
> @@ -676,4 +679,7 @@ int xfs_icreate_dqalloc(const struct xfs_icreate_args *args,
>  		struct xfs_dquot **udqpp, struct xfs_dquot **gdqpp,
>  		struct xfs_dquot **pdqpp);
>  
> +int xfs_inode_max_write_streams(struct xfs_inode *ip);
> +uint16_t xfs_inode_get_write_stream(struct xfs_inode *ip);
> +int xfs_inode_set_write_stream(struct xfs_inode *ip, uint16_t stream_id);
>  #endif	/* __XFS_INODE_H__ */
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 46e234863644..3f82a4884b81 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1179,6 +1179,42 @@ xfs_ioctl_fs_counts(
>  	return 0;
>  }
>  
> +static int
> +xfs_ioc_write_stream(
> +	struct file		*filp,
> +	void __user		*arg)
> +{
> +	struct inode		*inode = file_inode(filp);
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	struct fs_write_stream	ws = { };
> +
> +	if (copy_from_user(&ws, arg, sizeof(ws)))
> +		return -EFAULT;
> +	if (ws.rsvd != 0)
> +		return -EINVAL;
> +
> +	switch (ws.op_flags) {
> +	case FS_WRITE_STREAM_OP_GET_MAX:
> +		ws.max_streams = xfs_inode_max_write_streams(ip);

Shouldn't you hold ILOCK when you look at the REALTIME bit?

--D

> +		goto copy_out;
> +	case FS_WRITE_STREAM_OP_GET:
> +		ws.stream_id = xfs_inode_get_write_stream(ip);
> +		goto copy_out;
> +	case FS_WRITE_STREAM_OP_SET:
> +		if (!(filp->f_mode & FMODE_WRITE))
> +			return -EBADF;
> +		return xfs_inode_set_write_stream(ip, ws.stream_id);
> +	default:
> +		return -EINVAL;
> +	}
> +	return 0;
> +
> +copy_out:
> +	if (copy_to_user(arg, &ws, sizeof(ws)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
>  /*
>   * These long-unused ioctls were removed from the official ioctl API in 5.17,
>   * but retain these definitions so that we can log warnings about them.
> @@ -1444,6 +1480,8 @@ xfs_file_ioctl(
>  		return xfs_ioc_health_monitor(filp, arg);
>  	case XFS_IOC_VERIFY_MEDIA:
>  		return xfs_ioc_verify_media(filp, arg);
> +	case FS_IOC_WRITE_STREAM:
> +		return xfs_ioc_write_stream(filp, arg);
>  
>  	default:
>  		return -ENOTTY;
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index f20a02f49ed9..ccbf7dcf1ad5 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -144,6 +144,7 @@ xfs_bmbt_to_iomap(
>  	iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
>  	iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
>  	iomap->flags = iomap_flags;
> +	iomap->write_stream = ip->i_write_stream;
>  	if (mapping_flags & IOMAP_DAX) {
>  		iomap->dax_dev = target->bt_daxdev;
>  	} else {
> -- 
> 2.25.1
> 
> 

^ permalink raw reply

* Re: [PATCH v3 2/6] iomap: introduce and propagate write_stream
From: Darrick J. Wong @ 2026-06-24 18:10 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: brauner, hch, dgc, jack, cem, axboe, kbusch, ritesh.list,
	linux-xfs, linux-fsdevel, linux-block, gost.dev
In-Reply-To: <20260616180555.33338-3-joshi.k@samsung.com>

On Tue, Jun 16, 2026 at 11:35:51PM +0530, Kanchan Joshi wrote:
> Add a new write_stream field to struct iomap. Existing hole is used to
> place the new field.
> Propagate write_stream from iomap to bio in both direct I/O and buffered
> writeback paths.
> 
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> ---
>  fs/iomap/direct-io.c  | 1 +
>  fs/iomap/ioend.c      | 3 +++
>  include/linux/iomap.h | 2 ++
>  3 files changed, 6 insertions(+)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index b36ee619cdcd..455fd5d97d25 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -348,6 +348,7 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter,
>  	fscrypt_set_bio_crypt_ctx(bio, iter->inode, pos, GFP_KERNEL);
>  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>  	bio->bi_write_hint = iter->inode->i_write_hint;
> +	bio->bi_write_stream = iter->iomap.write_stream;
>  	bio->bi_ioprio = dio->iocb->ki_ioprio;
>  	bio->bi_private = dio;
>  	bio->bi_end_io = iomap_dio_bio_end_io;
> diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
> index acf3cf98b23a..56ed5ba6a421 100644
> --- a/fs/iomap/ioend.c
> +++ b/fs/iomap/ioend.c
> @@ -164,6 +164,7 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
>  			       GFP_NOFS, &iomap_ioend_bioset);
>  	bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
>  	bio->bi_write_hint = wpc->inode->i_write_hint;
> +	bio->bi_write_stream = wpc->iomap.write_stream;
>  	wbc_init_bio(wpc->wbc, bio);
>  	wpc->nr_folios = 0;
>  	return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags);
> @@ -187,6 +188,8 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos,
>  	if (!(wpc->iomap.flags & IOMAP_F_ANON_WRITE) &&
>  	    iomap_sector(&wpc->iomap, pos) != bio_end_sector(&ioend->io_bio))
>  		return false;
> +	if (wpc->iomap.write_stream != ioend->io_bio.bi_write_stream)
> +		return false;
>  	/*
>  	 * Limit ioend bio chain lengths to minimise IO completion latency. This
>  	 * also prevents long tight loops ending page writeback on all the
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 2c5685adf3a9..44583429ffa4 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -120,6 +120,8 @@ struct iomap {
>  	u64			length;	/* length of mapping, bytes */
>  	u16			type;	/* type of mapping */
>  	u16			flags;	/* flags for mapping */
> +	u8			write_stream; /* write stream for I/O */

I'm mildly confused by the types here -- the ioctl exposes a u32, iomap
has a u8, and xfs seems to use u16.  I gather you want maximum
flexibility in the uapi and that's the reason for the u32, but can the
internal interfaces be made consistent?

I also wonder what happens if the write stream ever becomes persistent,
but this patchset doesn't go there, and maybe the programming model is
simply that you have to set it every time you open the file?

--D

> +	/* 3 bytes padding hole here */
>  	struct block_device	*bdev;	/* block device for I/O */
>  	struct dax_device	*dax_dev; /* dax_dev for dax operations */
>  	void			*inline_data;
> -- 
> 2.25.1
> 
> 

^ permalink raw reply

* Re: [PATCH v3 1/6] fs: add generic write-stream management ioctl
From: Darrick J. Wong @ 2026-06-24 18:03 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: brauner, hch, dgc, jack, cem, axboe, kbusch, ritesh.list,
	linux-xfs, linux-fsdevel, linux-block, gost.dev
In-Reply-To: <20260616180555.33338-2-joshi.k@samsung.com>

On Tue, Jun 16, 2026 at 11:35:50PM +0530, Kanchan Joshi wrote:
> Wire up the userspace interface for write stream management via a new
> vfs ioctl 'FS_IOC_WRITE_STEAM'.
> Application communictes the intended operation using the 'op_flags'
> field of the passed 'struct fs_write_stream'.
> Valid flags are:
> FS_WRITE_STREAM_OP_GET_MAX: Returns the number of available streams.
> FS_WRITE_STREAM_OP_SET: Assign a specific stream value to the file.
> FS_WRITE_STREAM_OP_GET: Query what stream value is set on the file.
> 
> Application should query the available streams by using
> FS_WRITE_STREAM_OP_GET_MAX first.
> If returned value is N, valid stream values for the file are 0 to N.
> Stream value 0 implies that no stream is set on the file.

You might want to make that an explicit #define then.

> Setting a larger value than available streams is rejected.
> 
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> ---
>  include/uapi/linux/fs.h | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 13f71202845e..9e87271e610b 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -338,6 +338,20 @@ struct file_attr {
>  /* Get logical block metadata capability details */
>  #define FS_IOC_GETLBMD_CAP		_IOWR(0x15, 2, struct logical_block_metadata_cap)
>  
> +struct fs_write_stream {
> +	__u32		op_flags;	/* IN: operation flags */
> +	union {
> +		__u32		stream_id;	/* IN/OUT:  stream value to assign/guery */

"query"?

--D

> +		__u32		max_streams;	/* OUT: max streams values supported */
> +	};
> +	__u64		rsvd;
> +};
> +
> +#define FS_WRITE_STREAM_OP_GET_MAX		(1 << 0)
> +#define FS_WRITE_STREAM_OP_GET			(1 << 1)
> +#define FS_WRITE_STREAM_OP_SET			(1 << 2)
> +
> +#define FS_IOC_WRITE_STREAM		_IOWR('f', 135, struct fs_write_stream)
>  /*
>   * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
>   *
> -- 
> 2.25.1
> 
> 

^ permalink raw reply

* Re: [PATCH RFC v2 17/18] fs: look up the superblock via the device table in user_get_super()
From: Darrick J. Wong @ 2026-06-24 17:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-17-7df6b864028e@kernel.org>

On Tue, Jun 16, 2026 at 04:08:33PM +0200, Christian Brauner wrote:
> user_get_super() still finds the superblock for a device number by
> walking the global super_blocks list under sb_lock. Every superblock is
> registered in the device table under its s_dev since sget_fc() inserts
> it there, including superblocks on anonymous devices, so use the table
> instead.
> 
> The refcount-pinning cursor helpers super_dev_{get,first,next}() only
> touch table state and do not depend on CONFIG_BLOCK, so drop the
> CONFIG_BLOCK guard around them: their new caller serves anonymous
> devices as well (ustat() on e.g. tmpfs) and is built without
> CONFIG_BLOCK. The guard falls in this patch rather than separately
> since without this caller the helpers would be unused without
> CONFIG_BLOCK.
> 
> The pinned entry holds a passive reference on the superblock so
> super_lock() can be called directly; once the superblock is locked grab
> a passive reference for the caller before dropping the pin.
> 
> The device table contains more than the old walk could find: a
> superblock is also registered for every additional device it claims
> (the xfs log and realtime devices, btrfs member devices, the ext4
> external journal, erofs blob devices). Don't filter those out:
> specifying any device a filesystem uses now resolves to that
> filesystem, so ustat() and quotactl() work on e.g. the xfs log device
> or a btrfs member device (the latter used to fail outright as btrfs
> superblocks carry an anonymous s_dev that never matches a member
> device). When several superblocks share a device (erofs blob devices)
> the first live superblock wins.

Does erofs have a means to find the other superblocks that share a
device given a notification coming in on one of them?  As hch says, it
feels weird to have a lookup mechanism when there's also an upcall
mechanism.

<shrug> I've been on vacation for a while so maybe I missed that there's
another use for the bdev->sb lookup?  There are 1600 more emails for me
to go through... :P

--D

> 
> The cursor also keeps scanning past dying superblocks where the old
> walk gave up after the first s_dev match, so a mount racing with the
> unmount of the same device (or with the reuse of a recycled anonymous
> dev_t) finds the live superblock where the old walk could spuriously
> return NULL.
> 
> This removes the last s_dev-keyed walk of the super_blocks list and
> takes ustat() and quotactl()'s block device lookup off sb_lock
> entirely.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
> ---
>  fs/super.c | 28 ++++++++--------------------
>  1 file changed, 8 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 2d0a07861bfc..93f24aea75c4 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -501,7 +501,6 @@ static int super_dev_register(struct super_block *sb)
>  	return err;
>  }
>  
> -#ifdef CONFIG_BLOCK
>  static struct super_dev *super_dev_get(struct rhlist_head *pos)
>  {
>  	struct super_dev *sb_dev;
> @@ -535,7 +534,6 @@ static struct super_dev *super_dev_next(struct super_dev *prev)
>  	super_dev_put(prev);
>  	return sb_dev;
>  }
> -#endif
>  
>  static void kill_super_notify(struct super_block *sb)
>  {
> @@ -1044,29 +1042,19 @@ EXPORT_SYMBOL(iterate_supers_type);
>  
>  struct super_block *user_get_super(dev_t dev, bool excl)
>  {
> -	struct super_block *sb;
> -
> -	spin_lock(&sb_lock);
> -	list_for_each_entry(sb, &super_blocks, s_list) {
> -		bool locked;
> +	struct super_dev *sb_dev;
>  
> -		if (sb->s_dev != dev)
> -			continue;
> +	for (sb_dev = super_dev_first(dev); sb_dev; sb_dev = super_dev_next(sb_dev)) {
> +		struct super_block *sb = sb_dev->sd_sb;
>  
> -		if (!refcount_inc_not_zero(&sb->s_passive))
> +		if (!super_lock(sb, excl))
>  			continue;
>  
> -		spin_unlock(&sb_lock);
> -
> -		locked = super_lock(sb, excl);
> -		if (locked)
> -			return sb;
> -
> -		put_super(sb);
> -		spin_lock(&sb_lock);
> -		break;
> +		/* The pinned entry holds a passive reference, take our own. */
> +		refcount_inc(&sb->s_passive);
> +		super_dev_put(sb_dev);
> +		return sb;
>  	}
> -	spin_unlock(&sb_lock);
>  	return NULL;
>  }
>  
> 
> -- 
> 2.47.3
> 
> 

^ permalink raw reply

* [PATCH] iomap: Remove FGP_NOFS from iomap_get_folio()
From: Matthew Wilcox (Oracle) @ 2026-06-24 17:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Matthew Wilcox (Oracle), Darrick J. Wong, Jens Axboe, Namjae Jeon,
	Sungjong Seo, Yuezhang Mo, Miklos Szeredi, Andreas Gruenbacher,
	Hyunchul Lee, Konstantin Komarov, Carlos Maiolino, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	linux-block, fuse-devel, gfs2, ntfs3

FGP_NOFS is legacy; filesystems should be using memalloc_nofs_save/restore
instead.  We have it here in iomap because it was buried in
grab_cache_page_write_begin() and we didn't want to change this behaviour
as part of the folio transition.

I have tested this with XFS and see no issues.  Other filesystems (cc'd)
may need to make adjustments.  Please test with lockdep enabled.

Cc: "Darrick J. Wong" <djwong@kernel.org> (iomap)
Cc: Jens Axboe <axboe@kernel.dk> (block)
Cc: Namjae Jeon <linkinjeon@kernel.org> (exfat, ntfs)
Cc: Sungjong Seo <sj1557.seo@samsung.com> (exfat)
Cc: Yuezhang Mo <yuezhang.mo@sony.com> (exfat)
Cc: Miklos Szeredi <miklos@szeredi.hu> (fuse)
Cc: Andreas Gruenbacher <agruenba@redhat.com> (gfs2)
Cc: Hyunchul Lee <hyc.lee@gmail.com> (ntfs)
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> (ntfs3)
Cc: Carlos Maiolino <cem@kernel.org> (xfs)
Cc: Damien Le Moal <dlemoal@kernel.org> (zonefs)
Cc: Naohiro Aota <naohiro.aota@wdc.com> (zonefs)
Cc: Johannes Thumshirn <jth@kernel.org> (zonefs)
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-block@vger.kernel.org
Cc: fuse-devel@lists.linux.dev
Cc: gfs2@lists.linux.dev
Cc: ntfs3@lists.linux.dev
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8d4806dc46d4..27bc2455a98d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -768,7 +768,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  */
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
-	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
+	fgf_t fgp = FGP_WRITEBEGIN;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
-- 
2.47.3


^ permalink raw reply related

* [PATCH v3 3/5] loop: set dma_alignment from the backing file for direct I/O
From: Keith Busch @ 2026-06-24 17:09 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch
In-Reply-To: <20260624170905.3972095-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

Direct I/O user pages are forwarded to the backing file unchanged, so
the backing's DMA alignment requirement applies to them. Track the
backing file's dio_mem_align and advertise it as the loop device's
dma_alignment if it is larger than the default so we advertise proper
limits and misaligned I/O is rejected early instead of being dispatched
to the backend.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/block/loop.c | 46 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 38 insertions(+), 8 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 310de0463beb1..5fe61d542f8b7 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -54,6 +54,7 @@ struct loop_device {
 
 	struct file	*lo_backing_file;
 	unsigned int	lo_min_dio_size;
+	unsigned int	lo_dio_mem_align;
 	struct block_device *lo_device;
 
 	gfp_t		old_gfp_mask;
@@ -447,26 +448,37 @@ static void loop_reread_partitions(struct loop_device *lo)
 			__func__, lo->lo_number, lo->lo_file_name, rc);
 }
 
-static unsigned int loop_query_min_dio_size(struct loop_device *lo)
+static void loop_update_dio_alignment(struct loop_device *lo)
 {
 	struct file *file = lo->lo_backing_file;
 	struct block_device *sb_bdev = file->f_mapping->host->i_sb->s_bdev;
 	struct kstat st;
 
 	/*
-	 * Use the minimal dio alignment of the file system if provided.
+	 * Use the dio alignment of the file system if provided.  The incomoing
+	 * request's bio_vec is forwarded to the backing file unchanged, so its
+	 * required memory alignment becomes the device's dma_alignment when
+	 * used for direct-io.
 	 */
 	if (!vfs_getattr(&file->f_path, &st, STATX_DIOALIGN, 0) &&
-	    (st.result_mask & STATX_DIOALIGN))
-		return st.dio_offset_align;
+	    (st.result_mask & STATX_DIOALIGN)) {
+		lo->lo_min_dio_size = st.dio_offset_align;
+		lo->lo_dio_mem_align = st.dio_mem_align - 1;
+		return;
+	}
 
 	/*
 	 * In a perfect world this wouldn't be needed, but as of Linux 6.13 only
 	 * a handful of file systems support the STATX_DIOALIGN flag.
 	 */
-	if (sb_bdev)
-		return bdev_logical_block_size(sb_bdev);
-	return SECTOR_SIZE;
+	if (sb_bdev) {
+		lo->lo_min_dio_size = bdev_logical_block_size(sb_bdev);
+		lo->lo_dio_mem_align = bdev_dma_alignment(sb_bdev);
+		return;
+	}
+
+	lo->lo_min_dio_size = SECTOR_SIZE;
+	lo->lo_dio_mem_align = SECTOR_SIZE - 1;
 }
 
 static inline int is_loop_device(struct file *file)
@@ -509,7 +521,7 @@ static void loop_assign_backing_file(struct loop_device *lo, struct file *file)
 			lo->old_gfp_mask & ~(__GFP_IO | __GFP_FS));
 	if (lo->lo_backing_file->f_flags & O_DIRECT)
 		lo->lo_flags |= LO_FLAGS_DIRECT_IO;
-	lo->lo_min_dio_size = loop_query_min_dio_size(lo);
+	loop_update_dio_alignment(lo);
 }
 
 static int loop_check_backing_file(struct file *file)
@@ -940,6 +952,19 @@ static unsigned int loop_default_blocksize(struct loop_device *lo)
 	return SECTOR_SIZE;
 }
 
+static void loop_set_dma_limit(struct loop_device *lo, struct queue_limits *lim)
+{
+	/*
+	 * Direct I/O forwards the user pages to the backing file unchanged, so
+	 * track the backing's DMA alignment requirement as the mode is toggled.
+	 */
+	if (lo->lo_flags & LO_FLAGS_DIRECT_IO)
+		lim->dma_alignment = max_t(unsigned int, lo->lo_dio_mem_align,
+					   SECTOR_SIZE - 1);
+	else
+		lim->dma_alignment = SECTOR_SIZE - 1;
+}
+
 static void loop_update_limits(struct loop_device *lo, struct queue_limits *lim,
 		unsigned int bsize)
 {
@@ -961,6 +986,7 @@ static void loop_update_limits(struct loop_device *lo, struct queue_limits *lim,
 	lim->logical_block_size = bsize;
 	lim->physical_block_size = bsize;
 	lim->io_min = bsize;
+	loop_set_dma_limit(lo, lim);
 	lim->features &= ~(BLK_FEAT_WRITE_CACHE | BLK_FEAT_ROTATIONAL);
 	if (file->f_op->fsync && !(lo->lo_flags & LO_FLAGS_READ_ONLY))
 		lim->features |= BLK_FEAT_WRITE_CACHE;
@@ -1416,6 +1442,7 @@ static int loop_set_dio(struct loop_device *lo, unsigned long arg)
 {
 	bool use_dio = !!arg;
 	unsigned int memflags;
+	struct queue_limits lim;
 
 	if (lo->lo_state != Lo_bound)
 		return -ENXIO;
@@ -1434,6 +1461,9 @@ static int loop_set_dio(struct loop_device *lo, unsigned long arg)
 		lo->lo_flags |= LO_FLAGS_DIRECT_IO;
 	else
 		lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
+	lim = queue_limits_start_update(lo->lo_queue);
+	loop_set_dma_limit(lo, &lim);
+	queue_limits_commit_update(lo->lo_queue, &lim);
 	blk_mq_unfreeze_queue(lo->lo_queue, memflags);
 	return 0;
 }
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 1/5] block: use blkdev_iov_iter_get_pages status for errors
From: Keith Busch @ 2026-06-24 17:09 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch
In-Reply-To: <20260624170905.3972095-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

blkdev_iov_iter_get_pages() can return various error values, including
EIO, EFAULT, and ENOMEM. Set the actual reported status so user space
can know a little more on why an operation failed.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/fops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/fops.c b/block/fops.c
index 15783a6180dec..0827bb884d473 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -218,7 +218,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 
 		ret = blkdev_iov_iter_get_pages(bio, iter, bdev);
 		if (unlikely(ret)) {
-			bio_endio_status(bio, BLK_STS_IOERR);
+			bio_endio_status(bio, errno_to_blk_status(ret));
 			break;
 		}
 		if (iocb->ki_flags & IOCB_NOWAIT) {
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 2/5] block: fix dio leak on metadata mapping error
From: Keith Busch @ 2026-06-24 17:09 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch
In-Reply-To: <20260624170905.3972095-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

A failed integrity mapping holds a dio reference, so we need to go
through the full bio ending in case there were previously submitted
bio's in the sequence.

Fixes: 2729a60bbfb92 ("block: don't silently ignore metadata for sync read/write")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/fops.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 0827bb884d473..0098a90a956e1 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -238,8 +238,10 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		}
 		if (iocb->ki_flags & IOCB_HAS_METADATA) {
 			ret = bio_integrity_map_iter(bio, iocb->private);
-			if (unlikely(ret))
-				goto fail;
+			if (unlikely(ret)) {
+				bio_endio_status(bio, errno_to_blk_status(ret));
+				break;
+			}
 		}
 
 		if (is_read) {
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 5/5] block: validate user space vectors during extraction
From: Keith Busch @ 2026-06-24 17:09 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch, stable
In-Reply-To: <20260624170905.3972095-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

The bio-based drivers don't necessarily check the alignment split, and
stacking block drivers don't always handle a misalignment detected after
submitting the bio. Validate user vectors against the device's
dma_alignment as the bio is built from the iov_iter, rejecting
misaligned early with -EINVAL.

Cc: stable@vger.kernel.org
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c          | 56 +++++++++++++++++++++++++++++++++++++++++---
 block/blk-map.c      |  2 +-
 block/fops.c         |  2 +-
 fs/iomap/direct-io.c |  1 +
 include/linux/bio.h  |  2 +-
 include/linux/uio.h  | 10 +++++++-
 lib/iov_iter.c       |  9 ++++++-
 7 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f2a5f4d0a9672..faad41a72ac77 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1220,10 +1220,45 @@ static int bio_iov_iter_align_down(struct bio *bio, struct iov_iter *iter,
 	return 0;
 }
 
+#ifdef CONFIG_DEBUG_KERNEL
+static inline bool bio_iov_bvec_aligned(const struct bio *bio,
+					unsigned mem_align_mask)
+{
+	struct bvec_iter iter;
+	struct bio_vec bv;
+
+	/*
+	 * Correct callers never break the alignment requirements, so this
+	 * exhaustive check is only paid for in debug builds.
+	 */
+	for_each_mp_bvec(bv, bio->bi_io_vec, iter, bio->bi_iter)
+		if ((bv.bv_offset | bv.bv_len) & mem_align_mask)
+			return false;
+	return true;
+}
+#else
+static inline bool bio_iov_bvec_aligned(const struct bio *bio,
+					unsigned mem_align_mask)
+{
+	/*
+	 * We forward the bio_vec as-is, so ITER_BVEC callers must provide
+	 * segments already aligned to the device's DMA alignment. The only
+	 * unchecked user-controllable offset that reaches here is an io_uring
+	 * registered buffer where just the first segment can be unaligned
+	 * (the rest is virtually contiguous), so checking only that one is
+	 * sufficient to know if the entire vector is valid.
+	 */
+	return !(mp_bvec_iter_offset(bio->bi_io_vec, bio->bi_iter) &
+							mem_align_mask);
+}
+#endif
+
 /**
  * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be added
+ * @mem_align_mask: the mask the source address and length must be aligned to,
+ *	0 for no requirement
  * @len_align_mask: the mask to align the total size to, 0 for any length
  *
  * This takes either an iterator pointing to user memory, or one pointing to
@@ -1242,7 +1277,7 @@ static int bio_iov_iter_align_down(struct bio *bio, struct iov_iter *iter,
  * is returned only if 0 pages could be pinned.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
-			   unsigned len_align_mask)
+			   unsigned mem_align_mask, unsigned len_align_mask)
 {
 	iov_iter_extraction_t flags = 0;
 
@@ -1251,6 +1286,10 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 
 	if (iov_iter_is_bvec(iter)) {
 		bio_iov_bvec_set(bio, iter);
+
+		if (!bio_iov_bvec_aligned(bio, mem_align_mask))
+			return -EINVAL;
+
 		iov_iter_advance(iter, bio->bi_iter.bi_size);
 		return 0;
 	}
@@ -1265,8 +1304,19 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 
 		ret = iov_iter_extract_bvecs(iter, bio->bi_io_vec,
 				BIO_MAX_SIZE - bio->bi_iter.bi_size,
-				&bio->bi_vcnt, bio->bi_max_vecs, flags);
+				&bio->bi_vcnt, bio->bi_max_vecs,
+				mem_align_mask, flags);
 		if (ret <= 0) {
+			/*
+			 * A misaligned vector fails the whole I/O.  Release any
+			 * pages pinned by earlier iterations before returning
+			 * since this bio won't be submitted to release them.
+			 */
+			if (ret == -EINVAL) {
+				bio_release_pages(bio, false);
+				bio_clear_flag(bio, BIO_PAGE_PINNED);
+				bio->bi_vcnt = 0;
+			}
 			if (!bio->bi_vcnt)
 				return ret;
 			break;
@@ -1377,7 +1427,7 @@ static int bio_iov_iter_bounce_read(struct bio *bio, struct iov_iter *iter,
 		ssize_t ret;
 
 		ret = iov_iter_extract_bvecs(iter, bio->bi_io_vec + 1, len,
-				&bio->bi_vcnt, bio->bi_max_vecs - 1, 0);
+				&bio->bi_vcnt, bio->bi_max_vecs - 1, 0, 0);
 		if (ret <= 0) {
 			if (!bio->bi_vcnt) {
 				folio_put(folio);
diff --git a/block/blk-map.c b/block/blk-map.c
index 768549f19f97e..c9535efe1a913 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -274,7 +274,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 	 * No alignment requirements on our part to support arbitrary
 	 * passthrough commands.
 	 */
-	ret = bio_iov_iter_get_pages(bio, iter, 0);
+	ret = bio_iov_iter_get_pages(bio, iter, 0, 0);
 	if (ret)
 		goto out_put;
 	ret = blk_rq_append_bio(rq, bio);
diff --git a/block/fops.c b/block/fops.c
index 0098a90a956e1..e519d7f43b310 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -46,7 +46,7 @@ static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
 static inline int blkdev_iov_iter_get_pages(struct bio *bio,
 		struct iov_iter *iter, struct block_device *bdev)
 {
-	return bio_iov_iter_get_pages(bio, iter,
+	return bio_iov_iter_get_pages(bio, iter, bdev_dma_alignment(bdev),
 			bdev_logical_block_size(bdev) - 1);
 }
 
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b485e3b191daf..ff458aa12ae29 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -358,6 +358,7 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter,
 				iomap_max_bio_size(&iter->iomap), alignment);
 	else
 		ret = bio_iov_iter_get_pages(bio, dio->submit.iter,
+					     bdev_dma_alignment(bio->bi_bdev),
 					     alignment - 1);
 	if (unlikely(ret))
 		goto out_put_bio;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8f33f717b14f5..ce34ea49ef358 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -477,7 +477,7 @@ int bdev_rw_virt(struct block_device *bdev, sector_t sector, void *data,
 		size_t len, enum req_op op);
 
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
-		unsigned len_align_mask);
+		unsigned mem_align_mask, unsigned len_align_mask);
 
 void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter);
 void __bio_release_pages(struct bio *bio, bool mark_dirty);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index a9bc5b3067e32..fe2e985d74d24 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -389,9 +389,17 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
 			       size_t maxsize, unsigned int maxpages,
 			       iov_iter_extraction_t extraction_flags,
 			       size_t *offset0);
+/*
+ * Block-layer consumers (e.g. bio_iov_iter_get_pages()) require that the
+ * segments of an ITER_BVEC iterator are already aligned to the target device's
+ * DMA alignment, and forward them as-is.  In-kernel users that build their own
+ * bvecs must not create sub-aligned segments; iov_iter_extract_bvecs() enforces
+ * the same for the segments it extracts via @mem_align_mask.
+ */
 ssize_t iov_iter_extract_bvecs(struct iov_iter *iter, struct bio_vec *bv,
 		size_t max_size, unsigned short *nr_vecs,
-		unsigned short max_vecs, iov_iter_extraction_t extraction_flags);
+		unsigned short max_vecs, unsigned mem_align_mask,
+		iov_iter_extraction_t extraction_flags);
 
 /**
  * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 273919b161617..c343075951ded 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1886,6 +1886,8 @@ static unsigned int get_contig_folio_len(struct page **pages,
  * @max_size:	maximum size to extract from @iter
  * @nr_vecs:	number of vectors in @bv (on in and output)
  * @max_vecs:	maximum vectors in @bv, including those filled before calling
+ * @mem_align_mask:	reject with -EINVAL if the source address or
+ *		length is not aligned to this mask
  * @extraction_flags: flags to qualify request
  *
  * Like iov_iter_extract_pages(), but returns physically contiguous ranges
@@ -1897,14 +1899,19 @@ static unsigned int get_contig_folio_len(struct page **pages,
  */
 ssize_t iov_iter_extract_bvecs(struct iov_iter *iter, struct bio_vec *bv,
 		size_t max_size, unsigned short *nr_vecs,
-		unsigned short max_vecs, iov_iter_extraction_t extraction_flags)
+		unsigned short max_vecs, unsigned mem_align_mask,
+		iov_iter_extraction_t extraction_flags)
 {
+	unsigned long start = (unsigned long)iter_iov_addr(iter);
 	unsigned short entries_left = max_vecs - *nr_vecs;
 	unsigned short nr_pages, i = 0;
 	size_t left, offset, len;
 	struct page **pages;
 	ssize_t size;
 
+	if ((start | iter_iov_len(iter)) & mem_align_mask)
+		return -EINVAL;
+
 	/*
 	 * Move page array up in the allocated memory for the bio vecs as far as
 	 * possible so that we can start filling biovecs from the beginning
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 4/5] zloop: set dma_alignment from the backing files for direct I/O
From: Keith Busch @ 2026-06-24 17:09 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch
In-Reply-To: <20260624170905.3972095-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

Direct I/O user pages are forwarded to the backing files unchanged, so
the backing's DMA alignment requirement applies to them. Track the
backing file's dio_mem_align and advertise it as the zloop device's
dma_alignment if it is larger than the default so we advertise proper
limits and misaligned I/O is rejected early instead of being dispatched
to the backend.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/block/zloop.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/drivers/block/zloop.c b/drivers/block/zloop.c
index 55eeb6aac0ea3..f97a20cfdb7ce 100644
--- a/drivers/block/zloop.c
+++ b/drivers/block/zloop.c
@@ -144,6 +144,7 @@ struct zloop_device {
 	unsigned int		nr_conv_zones;
 	unsigned int		max_open_zones;
 	unsigned int		block_size;
+	unsigned int		dio_mem_align;
 
 	spinlock_t		open_zones_lock;
 	struct list_head	open_zones_lru_list;
@@ -1037,20 +1038,30 @@ static int zloop_get_block_size(struct zloop_device *zlo,
 	struct kstat st;
 
 	/*
-	 * If the FS block size is lower than or equal to 4K, use that as the
-	 * device block size. Otherwise, fallback to the FS direct IO alignment
-	 * constraint if that is provided, and to the FS underlying device
-	 * physical block size if the direct IO alignment is unknown.
+	 * Use the dio alignment of the file system if provided.  The incoming
+	 * request's bio_vec is forwarded to the backing file unchanged, so its
+	 * required memory alignment becomes the device's dma_alignment when
+	 * used for direct-io.
 	 */
-	if (file_inode(zone->file)->i_sb->s_blocksize <= SZ_4K)
-		zlo->block_size = file_inode(zone->file)->i_sb->s_blocksize;
-	else if (!vfs_getattr(&zone->file->f_path, &st, STATX_DIOALIGN, 0) &&
-		 (st.result_mask & STATX_DIOALIGN))
+	if (!vfs_getattr(&zone->file->f_path, &st, STATX_DIOALIGN, 0) &&
+	    (st.result_mask & STATX_DIOALIGN)) {
 		zlo->block_size = st.dio_offset_align;
-	else if (sb_bdev)
+		zlo->dio_mem_align = st.dio_mem_align - 1;
+	} else if (sb_bdev) {
 		zlo->block_size = bdev_physical_block_size(sb_bdev);
-	else
+		zlo->dio_mem_align = bdev_dma_alignment(sb_bdev);
+	} else {
 		zlo->block_size = SECTOR_SIZE;
+		zlo->dio_mem_align = SECTOR_SIZE - 1;
+	}
+
+	/*
+	 * Prefer the FS block size for the device block size when it is no
+	 * larger than 4K; otherwise keep the direct I/O / physical block size
+	 * selected above.
+	 */
+	if (file_inode(zone->file)->i_sb->s_blocksize <= SZ_4K)
+		zlo->block_size = file_inode(zone->file)->i_sb->s_blocksize;
 
 	if (zlo->zone_capacity & ((zlo->block_size >> SECTOR_SHIFT) - 1)) {
 		pr_err("Zone capacity is not aligned to block size %u\n",
@@ -1279,6 +1290,10 @@ static int zloop_ctl_add(struct zloop_options *opts)
 
 	lim.physical_block_size = zlo->block_size;
 	lim.logical_block_size = zlo->block_size;
+	/* Direct I/O forwards the request pages to the backing files as-is. */
+	if (!opts->buffered_io)
+		lim.dma_alignment = max_t(unsigned int, zlo->dio_mem_align,
+					  SECTOR_SIZE - 1);
 	if (zlo->zone_append)
 		lim.max_hw_zone_append_sectors = lim.max_hw_sectors;
 	lim.max_open_zones = zlo->max_open_zones;
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 0/5] block: validate direct I/O memory alignment
From: Keith Busch @ 2026-06-24 17:09 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

This addresses the misaligned direct-io problem behind various threads:

 https://lore.kernel.org/linux-xfs/20260610145218.141369-1-cem@kernel.org/
 https://lore.kernel.org/all/CAC_j7i1R7oy+nRhxEjCTba=DUgn02w9X+p94DCu0aHv5+5tKnQ@mail.gmail.com/
 https://lore.kernel.org/linux-block/ai7rnH20IYeSmY8s@gallifrey/
 https://lore.kernel.org/linux-block/20260616154009.2123183-1-kbusch@meta.com/

The previously tested fixes are correct as far as they go, but they
treat the symptom: they only matter because an invalid bio reaches those
drivers in the first place.

The reason it reaches them is an assumption I made when I removed
direct-io alignment checks in 5ff3f74e145a ("block: simplify direct io
validity check") and 7eac331869575 ("iomap: simplify direct io validity
check"): every bio is eventually split to the device limits, and the
upper layers cope with resulting errors once the bio has formed. Both
were optimistic assumptions. Drivers with their own ->submit_bio may
never pass through blk_mq_submit_bio()'s split, so the check never runs
for them, and as numerous threads showed, the consumers don't uniformly
handle this condition.

This series stops the invalid bio at the source instead. It validates
the buffer's alignment against the alignment limits when the bio is
built from the iov_iter. The check is folded into the bvec extraction
that already walks the vectors, so it adds only a comparison on a path
that is pinning direct-io pages anyway. Misalignment is now uniformly
rejected with EINVAL before submission for every direct-io path.

v2->v3:

- Dropped the bio_endio_errno helper and open-coded its two users.
- Documented the ITER_BVEC alignment expectation in uio.h and reworded
  the bvec check comment; the exhaustive per-segment validation stays
  behind CONFIG_DEBUG_KERNEL as a contract assertion.
- Reworked zloop_get_block_size() to mirror loop's structure.
- loop/zloop only ever tighten dma_alignment beyond the default.  I
  think these could use more relaxed alignments, but I'm just being
  extra conservative against introducing new changes here.

Previous version:

  https://lore.kernel.org/linux-block/20260622174241.2299563-1-kbusch@meta.com/

Keith Busch (5):
  block: use blkdev_iov_iter_get_pages status for errors
  block: fix dio leak on metadata mapping error
  loop: set dma_alignment from the backing file for direct I/O
  zloop: set dma_alignment from the backing files for direct I/O
  block: validate user space vectors during extraction

 block/bio.c           | 56 ++++++++++++++++++++++++++++++++++++++++---
 block/blk-map.c       |  2 +-
 block/fops.c          | 10 ++++----
 drivers/block/loop.c  | 46 ++++++++++++++++++++++++++++-------
 drivers/block/zloop.c | 35 +++++++++++++++++++--------
 fs/iomap/direct-io.c  |  1 +
 include/linux/bio.h   |  2 +-
 include/linux/uio.h   | 10 +++++++-
 lib/iov_iter.c        |  9 ++++++-
 9 files changed, 142 insertions(+), 29 deletions(-)

base-commit: 5c7804e3279c9bdc36e5eac743b4000633b25f65
-- 
2.53.0-Meta

^ permalink raw reply

* Re: [PATCH] blkcg: update iocost_coef_gen.py to use io_uring
From: Tejun Heo @ 2026-06-24 16:48 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Jens Axboe, linux-kernel, linux-block
In-Reply-To: <20260624-iocost-v1-1-2d53f3c026a2@kernel.org>

On Wed, Jun 24, 2026 at 10:50:34AM -0400, Jeff Layton wrote:
> Recently I found myself having to benchmark some rather fast disks for
> iocost, but the old iocost_coef_gen.py script couldn't generate enough
> throughput to saturate it. Make it use io_uring instead.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-block@vger.kernel.org
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 2/2] block: handle REQ_OP_ZONE_APPEND in __bio_integrity_action
From: Caleb Sander Mateos @ 2026-06-24 15:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Martin K. Petersen, linux-block
In-Reply-To: <20260624153856.GA13186@lst.de>

On Wed, Jun 24, 2026 at 8:38 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Jun 24, 2026 at 08:29:07AM -0700, Caleb Sander Mateos wrote:
> > I take it 4-KB integrity intervals don't work due to the lack of
> > remapping for REQ_OP_ZONE_APPEND? Sounds like we should come back to
> > the discussion about cleaning up the ref tag seed and remapping, then.
> > I never got a reply from Martin on that thread. I guess remapping is
> > necessary at least for partitioned block devices, but we could skip it
> > for non-partitioned block devices if we initialized the ref tag seed
> > correctly.
>
> We don't actually need the partition remapping because there can't
> be partitions.  But I see on-the-wire reftag value that are 8 times
> what they should be, so there is some kind of unit mismatch that your
> series fixes.

Right, I don't mean partitions of zoned devices, but block devices in
general. I was just trying to understand why the remapping
infrastructure exists in the first place. Seems like we can't remove
it entirely, but we can at least ensure the ref tag seeds are correct
if it's skipped for a non-partitioned device.

^ permalink raw reply

* Re: [PATCH 2/2] block: handle REQ_OP_ZONE_APPEND in __bio_integrity_action
From: Christoph Hellwig @ 2026-06-24 15:38 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, linux-block
In-Reply-To: <CADUfDZo4hysS6qj=Z=dEzVk=DQe6D7-zTFODLk8RGTJ13RY5uQ@mail.gmail.com>

On Wed, Jun 24, 2026 at 08:29:07AM -0700, Caleb Sander Mateos wrote:
> I take it 4-KB integrity intervals don't work due to the lack of
> remapping for REQ_OP_ZONE_APPEND? Sounds like we should come back to
> the discussion about cleaning up the ref tag seed and remapping, then.
> I never got a reply from Martin on that thread. I guess remapping is
> necessary at least for partitioned block devices, but we could skip it
> for non-partitioned block devices if we initialized the ref tag seed
> correctly.

We don't actually need the partition remapping because there can't
be partitions.  But I see on-the-wire reftag value that are 8 times
what they should be, so there is some kind of unit mismatch that your
series fixes.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox