Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Keith Busch @ 2026-06-16 15:05 UTC (permalink / raw)
  To: dm-devel
  Cc: linux-block, mpatocka, Keith Busch, Dr. David Alan Gilbert,
	Vjaceslavs Klimovs
In-Reply-To: <20260616150554.1686662-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

BLK_STS_INVAL indicates the I/O request itself was invalid (for example a
misaligned direct I/O), not that the device has failed. dm-raid1 treated
any read or write completion error as a device failure: it failed the
mirror leg, retried on the alternatives - which fail identically - and
eventually returned EIO while spuriously degrading the array.

Since commit 5ff3f74e145a ("block: simplify direct io validity check") the
direct I/O path no longer rejects misaligned buffers up front, so an
invalid bio now reaches the lower block layers, which fail it with
BLK_STS_INVAL. dm-io collapses the block status into a per-region error
bit before invoking the completion callback, so record BLK_STS_INVAL on
the originating bio and have the dm-raid1 read, write and end_io paths
propagate it instead of failing the device.

This mirrors the raid1/raid10 fix in commit f7b24c7b41f23
("md/raid1,raid10: don't fail devices for invalid IO errors") for the
device-mapper mirror target.

Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Reported-by: Vjaceslavs Klimovs <vklimovs@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/md/dm-io.c    | 14 +++++++++++++-
 drivers/md/dm-raid1.c | 28 +++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 28adfeb58f240..f382e9f9be059 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -37,6 +37,7 @@ struct io {
 	struct dm_io_client *client;
 	io_notify_fn callback;
 	void *context;
+	struct bio *orig_bio;
 	void *vma_invalidate_address;
 	unsigned long vma_invalidate_size;
 } __aligned(DM_IO_MAX_REGIONS);
@@ -132,8 +133,18 @@ static void complete_io(struct io *io)
 
 static void dec_count(struct io *io, unsigned int region, blk_status_t error)
 {
-	if (error)
+	if (error) {
 		set_bit(region, &io->error_bits);
+		/*
+		 * BLK_STS_INVAL means the bio was not valid for the underlying
+		 * device (e.g. a misaligned direct I/O), which is a caller error
+		 * rather than a device failure. Record it on the original bio so
+		 * bio-based targets can propagate it instead of treating it as a
+		 * media error and failing the device.
+		 */
+		if (error == BLK_STS_INVAL && io->orig_bio)
+			io->orig_bio->bi_status = error;
+	}
 
 	if (atomic_dec_and_test(&io->count))
 		complete_io(io);
@@ -398,6 +409,7 @@ static void async_io(struct dm_io_client *client, unsigned int num_regions,
 	io->client = client;
 	io->callback = fn;
 	io->context = context;
+	io->orig_bio = dp->orig_bio;
 
 	io->vma_invalidate_address = dp->vma_invalidate_address;
 	io->vma_invalidate_size = dp->vma_invalidate_size;
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index de5c00704e69c..022ad791c2957 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -524,6 +524,17 @@ static void read_callback(unsigned long error, void *context)
 		return;
 	}
 
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. That is a caller error, not a device
+	 * failure, so propagate it rather than failing the mirror and retrying
+	 * on the other legs, which would fail the same way.
+	 */
+	if (bio->bi_status == BLK_STS_INVAL) {
+		bio_endio(bio);
+		return;
+	}
+
 	fail_mirror(m, DM_RAID1_READ_ERROR);
 
 	if (likely(default_ok(m)) || mirror_available(m->ms, bio)) {
@@ -622,6 +633,16 @@ static void write_callback(unsigned long error, void *context)
 		return;
 	}
 
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. Propagate the error without degrading
+	 * the array.
+	 */
+	if (bio->bi_status == BLK_STS_INVAL) {
+		bio_endio(bio);
+		return;
+	}
+
 	/*
 	 * If the bio is discard, return an error, but do not
 	 * degrade the array.
@@ -1262,7 +1283,12 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
 		return DM_ENDIO_DONE;
 	}
 
-	if (*error == BLK_STS_NOTSUPP)
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. Propagate it rather than failing the
+	 * mirror and retrying, which would fail the same way on every leg.
+	 */
+	if (*error == BLK_STS_NOTSUPP || *error == BLK_STS_INVAL)
 		goto out;
 
 	if (bio->bi_opf & REQ_RAHEAD)
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH 0/2] block: invalidate cached plug timestamp on context switch
From: Jens Axboe @ 2026-06-16 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, rostedt,
	vincent.guittot, vschneid, Usama Arif, shakeel.butt, hannes, riel,
	kernel-team
In-Reply-To: <20260616165434.GG49951@noisy.programming.kicks-ass.net>

On 6/16/26 10:54 AM, Peter Zijlstra wrote:
> On Tue, Jun 16, 2026 at 10:10:31AM -0600, Jens Axboe wrote:
>> On 6/16/26 10:08 AM, Jens Axboe wrote:
>>>
>>> On Tue, 16 Jun 2026 07:15:16 -0700, Usama Arif wrote:
>>>> The details for this are in patch 2. The main reason for this series
>>>> is to invalidate the cached timestamp on context switch. This was
>>>> done in sched_update_worker() only before which was resulting in
>>>> blk-iocost reading stale timestamps and throttling based on wrong
>>>> information.
>>>>
>>>> Patch 1 is a prerequisite to create the invariant that
>>>> PF_BLOCK_TS set implies current->plug != NULL.
>>>>
>>>> [...]
>>>
>>> Applied, thanks!
>>>
>>> [1/2] kernel/fork: clear PF_BLOCK_TS in copy_process()
>>>       commit: fd38b75c4b43295b10d69772a46d1c74dbd6fc81
>>> [2/2] block: invalidate cached plug timestamp after task switch
>>>       commit: fad156c2af227f42ca796cbb20ddc354a6dd9932
>>
>> Note: I tentatively queued this on up as a) it looks good to me (and
>> thanks Usama for fixing this!), and b) about to head OOO for a week
>> or so. If Peter or any of the sched people disagree, let me know and
>> we can deal with it. If not, then I plan on sending this in with the
>> usual follow-up merge window fixes next week.
> 
> FWIW, looks good to me.

Great, thanks Peter!

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 0/2] block: invalidate cached plug timestamp on context switch
From: Peter Zijlstra @ 2026-06-16 16:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, rostedt,
	vincent.guittot, vschneid, Usama Arif, shakeel.butt, hannes, riel,
	kernel-team
In-Reply-To: <43b0010d-1919-4986-a88a-a4ccdb3639dd@kernel.dk>

On Tue, Jun 16, 2026 at 10:10:31AM -0600, Jens Axboe wrote:
> On 6/16/26 10:08 AM, Jens Axboe wrote:
> > 
> > On Tue, 16 Jun 2026 07:15:16 -0700, Usama Arif wrote:
> >> The details for this are in patch 2. The main reason for this series
> >> is to invalidate the cached timestamp on context switch. This was
> >> done in sched_update_worker() only before which was resulting in
> >> blk-iocost reading stale timestamps and throttling based on wrong
> >> information.
> >>
> >> Patch 1 is a prerequisite to create the invariant that
> >> PF_BLOCK_TS set implies current->plug != NULL.
> >>
> >> [...]
> > 
> > Applied, thanks!
> > 
> > [1/2] kernel/fork: clear PF_BLOCK_TS in copy_process()
> >       commit: fd38b75c4b43295b10d69772a46d1c74dbd6fc81
> > [2/2] block: invalidate cached plug timestamp after task switch
> >       commit: fad156c2af227f42ca796cbb20ddc354a6dd9932
> 
> Note: I tentatively queued this on up as a) it looks good to me (and
> thanks Usama for fixing this!), and b) about to head OOO for a week
> or so. If Peter or any of the sched people disagree, let me know and
> we can deal with it. If not, then I plan on sending this in with the
> usual follow-up merge window fixes next week.

FWIW, looks good to me.


^ permalink raw reply

* Re: [PATCH V3] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Tang Yizhou @ 2026-06-16 16:50 UTC (permalink / raw)
  To: Zizhi Wo, axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai
In-Reply-To: <20260616011746.2451461-1-wozizhi@huaweicloud.com>

On 16/6/26 9:17 am, Zizhi Wo wrote:
> From: Zizhi Wo <wozizhi@huawei.com>
> 
> [BUG]
> Our fuzz testing triggered a blkcg use-after-free issue:
> 
>   BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
>   Call Trace:
>   ...
>   blkcg_deactivate_policy+0x244/0x4d0
>   ioc_rqos_exit+0x44/0xe0
>   rq_qos_exit+0xba/0x120
>   __del_gendisk+0x50b/0x800
>   del_gendisk+0xff/0x190
>   ...
> 
> [CAUSE]
> process1						process2
> cgroup_rmdir
> ...
>   css_killed_work_fn
>     offline_css
>     ...
>       blkcg_destroy_blkgs
>       ...
>         __blkg_release
> 	  css_put(&blkg->blkcg->css)
>           blkg_free
> 	    INIT_WORK(xxx, blkg_free_workfn)
> 	    schedule_work
>     css_put
>     ...
>       blkcg_css_free
>         kfree(blkcg)--------blkcg has been freed!!!
> ====================================schedule_work
>               blkg_free_workfn
> 							__del_gendisk
> 							  rq_qos_exit
> 							    ioc_rqos_exit
> 							      blkcg_deactivate_policy
> 							        mutex_lock(&q->blkcg_mutex)
> 								spin_lock_irq(&q->queue_lock)
> 							        list_for_each_entry(blkg, xxx)
> 								  blkcg = blkg->blkcg
> 								  spin_lock(&blkcg->lock)-------UAF!!!
> 	        mutex_lock(&q->blkcg_mutex)
> 	        spin_lock_irq(&q->queue_lock)
> 	        /* Only then is the blkg removed from the list */
> 	        list_del_init(&blkg->q_node)
> 
> As a result, a blkg can still be reachable through q->blkg_list while
> its ->blkcg has already been freed.
> 
> [Fix]
> Fix this by deferring the blkcg css_put() until after the blkg has been
> unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
> blkcg outlives every blkg still reachable through q->blkg_list, so any
> iterator holding q->queue_lock is guaranteed to observe a valid
> blkg->blkcg.
> 
> While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
> so that the css reference is owned by the alloc/free pair rather than
> straddling layers:
> blkg_alloc()  <-> blkg_free()
> blkg_create() <-> blkg_destroy()
> 
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Suggested-by: Hou Tao <houtao1@huawei.com>
> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
> Reviewed-by: Yu Kuai <yukuai@fygo.io>
> ---
> v3:
>  - move css_put() after mutex_unlock() in blkg_free_workfn().
> 
> v2:
>  - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
>    css reference follows the blkg's own lifetime, making the put in
>    blkg_free_workfn() symmetric with the get in blkg_alloc().
> 
> v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
>  block/blk-cgroup.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index bc63bd220865..3ac41f766caf 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -136,6 +136,11 @@ static void blkg_free_workfn(struct work_struct *work)
>  	spin_unlock_irq(&q->queue_lock);
>  	mutex_unlock(&q->blkcg_mutex);
>  
> +	/*
> +	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
> +	 * so concurrent iterators won't see a blkg with a freed blkcg.
> +	 */
> +	css_put(&blkg->blkcg->css);
>  	blk_put_queue(q);
>  	free_percpu(blkg->iostat_cpu);
>  	percpu_ref_exit(&blkg->refcnt);
> @@ -179,8 +184,6 @@ static void __blkg_release(struct rcu_head *rcu)
>  	for_each_possible_cpu(cpu)
>  		__blkcg_rstat_flush(blkcg, cpu);
>  
> -	/* release the blkcg and parent blkg refs this blkg has been holding */
> -	css_put(&blkg->blkcg->css);
>  	blkg_free(blkg);
>  }
>  
> @@ -313,6 +316,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>  		goto out_exit_refcnt;
>  	if (!blk_get_queue(disk->queue))
>  		goto out_free_iostat;
> +	/* blkg holds a reference to blkcg */
> +	if (!css_tryget_online(&blkcg->css))
> +		goto out_put_queue;
>  
>  	blkg->q = disk->queue;
>  	INIT_LIST_HEAD(&blkg->q_node);
> @@ -353,6 +359,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>  	while (--i >= 0)
>  		if (blkg->pd[i])
>  			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
> +	css_put(&blkcg->css);
> +out_put_queue:
>  	blk_put_queue(disk->queue);
>  out_free_iostat:
>  	free_percpu(blkg->iostat_cpu);
> @@ -381,18 +389,12 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>  		goto err_free_blkg;
>  	}
>  
> -	/* blkg holds a reference to blkcg */
> -	if (!css_tryget_online(&blkcg->css)) {
> -		ret = -ENODEV;
> -		goto err_free_blkg;
> -	}
> -
>  	/* allocate */
>  	if (!new_blkg) {
>  		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
>  		if (unlikely(!new_blkg)) {
>  			ret = -ENOMEM;
> -			goto err_put_css;
> +			goto err_free_blkg;
>  		}
>  	}
>  	blkg = new_blkg;
> @@ -402,7 +404,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>  		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
>  		if (WARN_ON_ONCE(!blkg->parent)) {
>  			ret = -ENODEV;
> -			goto err_put_css;
> +			goto err_free_blkg;
>  		}
>  		blkg_get(blkg->parent);
>  	}
> @@ -442,8 +444,6 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>  	blkg_put(blkg);
>  	return ERR_PTR(ret);
>  
> -err_put_css:
> -	css_put(&blkcg->css);
>  err_free_blkg:
>  	if (new_blkg)
>  		blkg_free(new_blkg);

LGTM.

Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>

-- 
Best Regards,
Yi


^ permalink raw reply

* Re: [PATCH V2] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Tang Yizhou @ 2026-06-16 16:44 UTC (permalink / raw)
  To: Hou Tao, yukuai, Zizhi Wo, axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1
In-Reply-To: <8bdf88b3-0879-e3ec-a52d-3e7559bfddbb@huaweicloud.com>

On 16/6/26 9:23 am, Hou Tao wrote:
> Hi,
> 
> On 6/16/2026 12:16 AM, Yu Kuai wrote:
>> Hi，
>>
>> 在 2026/6/15 19:55, Zizhi Wo 写道:
>>> From: Zizhi Wo <wozizhi@huawei.com>
>>>
>>> [BUG]
>>> Our fuzz testing triggered a blkcg use-after-free issue:
>>>
>>>    BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
>>>    Call Trace:
>>>    ...
>>>    blkcg_deactivate_policy+0x244/0x4d0
>>>    ioc_rqos_exit+0x44/0xe0
>>>    rq_qos_exit+0xba/0x120
>>>    __del_gendisk+0x50b/0x800
>>>    del_gendisk+0xff/0x190
>>>    ...
>>>
>>> [CAUSE]
>>> process1						process2
>>> cgroup_rmdir
>>> ...
>>>    css_killed_work_fn
>>>      offline_css
>>>      ...
>>>        blkcg_destroy_blkgs
>>>        ...
>>>          __blkg_release
>>> 	  css_put(&blkg->blkcg->css)
>>>            blkg_free
>>> 	    INIT_WORK(xxx, blkg_free_workfn)
>>> 	    schedule_work
>>>      css_put
>>>      ...
>>>        blkcg_css_free
>>>          kfree(blkcg)--------blkcg has been freed!!!
>>> ====================================schedule_work
>>>                blkg_free_workfn
>>> 							__del_gendisk
>>> 							  rq_qos_exit
>>> 							    ioc_rqos_exit
>>> 							      blkcg_deactivate_policy
>>> 							        mutex_lock(&q->blkcg_mutex)
>>> 								spin_lock_irq(&q->queue_lock)
>>> 							        list_for_each_entry(blkg, xxx)
>>> 								  blkcg = blkg->blkcg
>>> 								  spin_lock(&blkcg->lock)-------UAF!!!
>>> 	        mutex_lock(&q->blkcg_mutex)
>>> 	        spin_lock_irq(&q->queue_lock)
>>> 	        /* Only then is the blkg removed from the list */
>>> 	        list_del_init(&blkg->q_node)
>>>
>>> As a result, a blkg can still be reachable through q->blkg_list while
>>> its ->blkcg has already been freed.
>>>
>>> [Fix]
>>> Fix this by deferring the blkcg css_put() until after the blkg has been
>>> unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
>>> blkcg outlives every blkg still reachable through q->blkg_list, so any
>>> iterator holding q->queue_lock is guaranteed to observe a valid
>>> blkg->blkcg.
>>>
>>> While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
>>> so that the css reference is owned by the alloc/free pair rather than
>>> straddling layers:
>>> blkg_alloc()  <-> blkg_free()
>>> blkg_create() <-> blkg_destroy()
>>>
>>> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
>>> Suggested-by: Hou Tao <houtao1@huawei.com>
>>> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
>>> ---
>>> v2:
>>>   - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
>>>     css reference follows the blkg's own lifetime, making the put in
>>>     blkg_free_workfn() symmetric with the get in blkg_alloc().
>>>
>>> v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
>>>
>>>   block/blk-cgroup.c | 24 ++++++++++++------------
>>>   1 file changed, 12 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>>> index bc63bd220865..27414c291e49 100644
>>> --- a/block/blk-cgroup.c
>>> +++ b/block/blk-cgroup.c
>>> @@ -132,10 +132,15 @@ static void blkg_free_workfn(struct work_struct *work)
>>>   	if (blkg->parent)
>>>   		blkg_put(blkg->parent);
>>>   	spin_lock_irq(&q->queue_lock);
>>>   	list_del_init(&blkg->q_node);
>>>   	spin_unlock_irq(&q->queue_lock);
>>> +	/*
>>> +	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
>>> +	 * so concurrent iterators won't see a blkg with a freed blkcg.
>>> +	 */
>>> +	css_put(&blkg->blkcg->css);
>>>   	mutex_unlock(&q->blkcg_mutex);
>> Please move css_put after mutex_unlock, unless there is a strong reason.
> 
> I think blkcg_mutex is used here to serialize the access of blkg->q_node
> and blkg->blkcg. We could move the css_put after the mutex_unlock(),
> however it stills depends on the mutex_lock and mutex_unlock pair on
> blkcg_mutex implicitly. Instead of such implicit dependency, we move the
> css_put inside the lock to make it be explicit.

Hi, I think I understand your point. Keeping css_put() inside blkcg_mutex makes the dependency explicit, since the same mutex serializes both the removal of blkg->q_node and the access to blkg->blkcg.

Placing css_put() after mutex_unlock(&q->blkcg_mutex) is still functionally correct. The blkg has already been removed from q->blkg_list under the mutex, so once we drop the mutex no iterator can reach this blkg anymore.

The benefit of moving it out is a smaller critical section.

-- 
Best Regards,
Yi

>>
>> With above change, feel free to add:
>>
>> Reviewed-by: Yu Kuai <yukuai@fygo.io>
>>
>>>   
>>>   	blk_put_queue(q);
>>>   	free_percpu(blkg->iostat_cpu);
>>>   	percpu_ref_exit(&blkg->refcnt);
>>> @@ -177,12 +182,10 @@ static void __blkg_release(struct rcu_head *rcu)
>>>   	 * blkg_stat_lock is for serializing blkg stat update
>>>   	 */
>>>   	for_each_possible_cpu(cpu)
>>>   		__blkcg_rstat_flush(blkcg, cpu);
>>>   
>>> -	/* release the blkcg and parent blkg refs this blkg has been holding */
>>> -	css_put(&blkg->blkcg->css);
>>>   	blkg_free(blkg);
>>>   }
>>>   
>>>   /*
>>>    * A group is RCU protected, but having an rcu lock does not mean that one
>>> @@ -311,10 +314,13 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>>>   	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
>>>   	if (!blkg->iostat_cpu)
>>>   		goto out_exit_refcnt;
>>>   	if (!blk_get_queue(disk->queue))
>>>   		goto out_free_iostat;
>>> +	/* blkg holds a reference to blkcg */
>>> +	if (!css_tryget_online(&blkcg->css))
>>> +		goto out_put_queue;
>>>   
>>>   	blkg->q = disk->queue;
>>>   	INIT_LIST_HEAD(&blkg->q_node);
>>>   	blkg->blkcg = blkcg;
>>>   	blkg->iostat.blkg = blkg;
>>> @@ -351,10 +357,12 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>>>   
>>>   out_free_pds:
>>>   	while (--i >= 0)
>>>   		if (blkg->pd[i])
>>>   			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
>>> +	css_put(&blkcg->css);
>>> +out_put_queue:
>>>   	blk_put_queue(disk->queue);
>>>   out_free_iostat:
>>>   	free_percpu(blkg->iostat_cpu);
>>>   out_exit_refcnt:
>>>   	percpu_ref_exit(&blkg->refcnt);
>>> @@ -379,32 +387,26 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>>>   	if (blk_queue_dying(disk->queue)) {
>>>   		ret = -ENODEV;
>>>   		goto err_free_blkg;
>>>   	}
>>>   
>>> -	/* blkg holds a reference to blkcg */
>>> -	if (!css_tryget_online(&blkcg->css)) {
>>> -		ret = -ENODEV;
>>> -		goto err_free_blkg;
>>> -	}
>>> -
>>>   	/* allocate */
>>>   	if (!new_blkg) {
>>>   		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
>>>   		if (unlikely(!new_blkg)) {
>>>   			ret = -ENOMEM;
>>> -			goto err_put_css;
>>> +			goto err_free_blkg;
>>>   		}
>>>   	}
>>>   	blkg = new_blkg;
>>>   
>>>   	/* link parent */
>>>   	if (blkcg_parent(blkcg)) {
>>>   		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
>>>   		if (WARN_ON_ONCE(!blkg->parent)) {
>>>   			ret = -ENODEV;
>>> -			goto err_put_css;
>>> +			goto err_free_blkg;
>>>   		}
>>>   		blkg_get(blkg->parent);
>>>   	}
>>>   
>>>   	/* invoke per-policy init */
>>> @@ -440,12 +442,10 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>>>   
>>>   	/* @blkg failed fully initialized, use the usual release path */
>>>   	blkg_put(blkg);
>>>   	return ERR_PTR(ret);
>>>   
>>> -err_put_css:
>>> -	css_put(&blkcg->css);
>>>   err_free_blkg:
>>>   	if (new_blkg)
>>>   		blkg_free(new_blkg);
>>>   	return ERR_PTR(ret);
>>>   }
> 
> 



^ permalink raw reply

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Gao Xiang @ 2026-06-16 16:35 UTC (permalink / raw)
  To: Christoph Hellwig, Christian Brauner
  Cc: Jan Kara, Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs
In-Reply-To: <20260616123443.GA21024@lst.de>

On 2026/6/16 20:34, Christoph Hellwig wrote:

> IMHO sharing devices between superblocks is a bad idea, but that ship
> has sailed, but please keep it contained inside of erofs.

I'm not sure why it's a bad idea, for example,
the immutable layer model is already applied to layered virtual
block formats (such as qcow2) and layered fs like overlayfs.

and I think device mappers may have some similar immutable
approaches as shared layers but works in a slight different
way.

The principle is that each instance uses shared blobs in a
read-only way, and that is almost a simple and safest way
to share data among filesystem instances.

Yet I don't want to argue with that since it's pretty common
for years and I've seen no practical risk using this model.

Thanks,
Gao Xiang

^ permalink raw reply

* Re: [PATCH] block: fix IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmd
From: Caleb Sander Mateos @ 2026-06-16 16:13 UTC (permalink / raw)
  To: Yitang Yang; +Cc: Jens Axboe, linux-block
In-Reply-To: <20260616155129.406057-1-yi1tang.yang@gmail.com>

On Tue, Jun 16, 2026 at 9:11 AM Yitang Yang <yi1tang.yang@gmail.com> wrote:
>
> blkdev_uring_cmd() checks IORING_URING_CMD_REISSUE to determine whether
> this is the first issue. However, this flag lives in cmd->flags instead
> of issue_flags.
>
> Coincidentally, IO_URING_F_NONBLOCK shares bit 31 with
> IORING_URING_CMD_REISSUE. As a result, the SQE read was never performed,
> bic->len remained zero, and every BLOCK_URING_CMD_DISCARD failed with
> -EINVAL.
>
> Fix it by checking cmd->flags as intended.
>
> Fixes: 212ec34e4e72 ("block: only read from sqe on initial invocation of blkdev_uring_cmd")
> Signed-off-by: Yitang Yang <yi1tang.yang@gmail.com>

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

> ---
>  block/ioctl.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/ioctl.c b/block/ioctl.c
> index ab2c9ed79946..3d4ea1537457 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -951,7 +951,7 @@ int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>         u32 cmd_op = cmd->cmd_op;
>
>         /* Read what we need from the SQE on the first issue */
> -       if (!(issue_flags & IORING_URING_CMD_REISSUE)) {
> +       if (!(cmd->flags & IORING_URING_CMD_REISSUE)) {
>                 const struct io_uring_sqe *sqe = cmd->sqe;
>
>                 if (unlikely(sqe->ioprio || sqe->__pad1 || sqe->len ||
> --
> 2.43.0
>
>

^ permalink raw reply

* Re: [PATCH 0/2] block: invalidate cached plug timestamp on context switch
From: Jens Axboe @ 2026-06-16 16:10 UTC (permalink / raw)
  To: linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, rostedt,
	vincent.guittot, vschneid, Usama Arif
  Cc: shakeel.butt, hannes, riel, kernel-team
In-Reply-To: <178162611741.2191657.12211870708971600814.b4-ty@b4>

On 6/16/26 10:08 AM, Jens Axboe wrote:
> 
> On Tue, 16 Jun 2026 07:15:16 -0700, Usama Arif wrote:
>> The details for this are in patch 2. The main reason for this series
>> is to invalidate the cached timestamp on context switch. This was
>> done in sched_update_worker() only before which was resulting in
>> blk-iocost reading stale timestamps and throttling based on wrong
>> information.
>>
>> Patch 1 is a prerequisite to create the invariant that
>> PF_BLOCK_TS set implies current->plug != NULL.
>>
>> [...]
> 
> Applied, thanks!
> 
> [1/2] kernel/fork: clear PF_BLOCK_TS in copy_process()
>       commit: fd38b75c4b43295b10d69772a46d1c74dbd6fc81
> [2/2] block: invalidate cached plug timestamp after task switch
>       commit: fad156c2af227f42ca796cbb20ddc354a6dd9932

Note: I tentatively queued this on up as a) it looks good to me (and
thanks Usama for fixing this!), and b) about to head OOO for a week
or so. If Peter or any of the sched people disagree, let me know and
we can deal with it. If not, then I plan on sending this in with the
usual follow-up merge window fixes next week.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH 0/2] block: invalidate cached plug timestamp on context switch
From: Jens Axboe @ 2026-06-16 16:08 UTC (permalink / raw)
  To: linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, rostedt,
	vincent.guittot, vschneid, Usama Arif
  Cc: shakeel.butt, hannes, riel, kernel-team
In-Reply-To: <20260616141604.328820-1-usama.arif@linux.dev>


On Tue, 16 Jun 2026 07:15:16 -0700, Usama Arif wrote:
> The details for this are in patch 2. The main reason for this series
> is to invalidate the cached timestamp on context switch. This was
> done in sched_update_worker() only before which was resulting in
> blk-iocost reading stale timestamps and throttling based on wrong
> information.
> 
> Patch 1 is a prerequisite to create the invariant that
> PF_BLOCK_TS set implies current->plug != NULL.
> 
> [...]

Applied, thanks!

[1/2] kernel/fork: clear PF_BLOCK_TS in copy_process()
      commit: fd38b75c4b43295b10d69772a46d1c74dbd6fc81
[2/2] block: invalidate cached plug timestamp after task switch
      commit: fad156c2af227f42ca796cbb20ddc354a6dd9932

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH V2]block: Remove redundant plug in __submit_bio()
From: Jens Axboe @ 2026-06-16 16:08 UTC (permalink / raw)
  To: linux-block, wenxiong; +Cc: tom.leiming, yukuai, stable, wenxiong
In-Reply-To: <20260616143121.878021-1-wenxiong@linux.ibm.com>


On Tue, 16 Jun 2026 10:31:21 -0400, wenxiong@linux.ibm.com wrote:
> The patch removes the automatic plug/unplug operations from __submit_bio()
> that were added to cache nsecs time when no explicit plug is used.
> 
> The plug mechanism is most effective when batching multiple I/O
> operations together. Creating a plug for every bio submission
> provides minimal benefit while adding function call overhead and
> stack usage for every I/O operation.
> 
> [...]

Applied, thanks!

[1/1] block: Remove redundant plug in __submit_bio()
      commit: 9cbbac29d752fb5d95e375fa3685a359b89caa0a

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] block: fix IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmd
From: Jens Axboe @ 2026-06-16 16:08 UTC (permalink / raw)
  To: Yitang Yang; +Cc: linux-block
In-Reply-To: <20260616155129.406057-1-yi1tang.yang@gmail.com>


On Tue, 16 Jun 2026 23:51:29 +0800, Yitang Yang wrote:
> blkdev_uring_cmd() checks IORING_URING_CMD_REISSUE to determine whether
> this is the first issue. However, this flag lives in cmd->flags instead
> of issue_flags.
> 
> Coincidentally, IO_URING_F_NONBLOCK shares bit 31 with
> IORING_URING_CMD_REISSUE. As a result, the SQE read was never performed,
> bic->len remained zero, and every BLOCK_URING_CMD_DISCARD failed with
> -EINVAL.
> 
> [...]

Applied, thanks!

[1/1] block: fix IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmd
      commit: 4f919141be38ea2b1314e3a531b7b998eb64e8bc

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Keith Busch @ 2026-06-16 16:05 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Vjaceslavs Klimovs, Dr. David Alan Gilbert, Thorsten Leemhuis,
	trnka, Zdenek Kabelac, linux-block, dm-devel,
	Linux kernel regressions list
In-Reply-To: <27311df3-2c46-08be-825a-157ea906bdb2@redhat.com>

On Tue, Jun 16, 2026 at 05:55:13PM +0200, Mikulas Patocka wrote:
> I thought that reverting 5ff3f74e145a and re-introducing the alignment 
> check in block/fops.c:blkdev_dio_invalid would fix it - but it wouldn't.
> 
> The same problem existed even before 5ff3f74e145a, with the pvmove 
> command.

Also before 5ff3f74e145a, you could still have devices that are
perfectly fine with dword aligned dma, so sub-sector vectors  would have
passed the checks and gone through to dm-raid, which would have
miscounted the remaining.

> So, I think that the proper way to fix this is to teach dm-mirror/dm-io to 
> deal with unaligned bio vectors and handle them properly.

The block layer already handles it, so I think just dispatch it and
check the bi_status is all the stacking drivers need to do.

^ permalink raw reply

* [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Keith Busch @ 2026-06-16 15:58 UTC (permalink / raw)
  To: Keith Busch
  Cc: dm-devel, linux-block, mpatocka, Dr. David Alan Gilbert,
	Vjaceslavs Klimovs
In-Reply-To: <20260616150554.1686662-1-kbusch@meta.com>

BLK_STS_INVAL indicates the I/O request itself was invalid (for example a
misaligned direct I/O), not that the device has failed. dm-raid1 treated
any read or write completion error as a device failure: it failed the
mirror leg, retried on the alternatives - which fail identically - and
eventually returned EIO while spuriously degrading the array.

Since commit 5ff3f74e145a ("block: simplify direct io validity check") the
direct I/O path no longer rejects misaligned buffers up front, so an
invalid bio now reaches the lower block layers, which fail it with
BLK_STS_INVAL. dm-io collapses the block status into a per-region error
bit before invoking the completion callback, so record BLK_STS_INVAL on
the originating bio and have the dm-raid1 read, write and end_io paths
propagate it instead of failing the device.

This mirrors the raid1/raid10 fix in commit f7b24c7b41f23
("md/raid1,raid10: don't fail devices for invalid IO errors") for the
device-mapper mirror target.

Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Reported-by: Vjaceslavs Klimovs <vklimovs@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
Resending patch 2/2 from a different machine. For some reason, only 1/2
is getting through with git-send-email, so manually replying to the
thread with the missing second patch.

 drivers/md/dm-io.c    | 14 +++++++++++++-
 drivers/md/dm-raid1.c | 28 +++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 28adfeb58f240..f382e9f9be059 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -37,6 +37,7 @@ struct io {
 	struct dm_io_client *client;
 	io_notify_fn callback;
 	void *context;
+	struct bio *orig_bio;
 	void *vma_invalidate_address;
 	unsigned long vma_invalidate_size;
 } __aligned(DM_IO_MAX_REGIONS);
@@ -132,8 +133,18 @@ static void complete_io(struct io *io)
 
 static void dec_count(struct io *io, unsigned int region, blk_status_t error)
 {
-	if (error)
+	if (error) {
 		set_bit(region, &io->error_bits);
+		/*
+		 * BLK_STS_INVAL means the bio was not valid for the underlying
+		 * device (e.g. a misaligned direct I/O), which is a caller error
+		 * rather than a device failure. Record it on the original bio so
+		 * bio-based targets can propagate it instead of treating it as a
+		 * media error and failing the device.
+		 */
+		if (error == BLK_STS_INVAL && io->orig_bio)
+			io->orig_bio->bi_status = error;
+	}
 
 	if (atomic_dec_and_test(&io->count))
 		complete_io(io);
@@ -398,6 +409,7 @@ static void async_io(struct dm_io_client *client, unsigned int num_regions,
 	io->client = client;
 	io->callback = fn;
 	io->context = context;
+	io->orig_bio = dp->orig_bio;
 
 	io->vma_invalidate_address = dp->vma_invalidate_address;
 	io->vma_invalidate_size = dp->vma_invalidate_size;
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index de5c00704e69c..022ad791c2957 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -524,6 +524,17 @@ static void read_callback(unsigned long error, void *context)
 		return;
 	}
 
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. That is a caller error, not a device
+	 * failure, so propagate it rather than failing the mirror and retrying
+	 * on the other legs, which would fail the same way.
+	 */
+	if (bio->bi_status == BLK_STS_INVAL) {
+		bio_endio(bio);
+		return;
+	}
+
 	fail_mirror(m, DM_RAID1_READ_ERROR);
 
 	if (likely(default_ok(m)) || mirror_available(m->ms, bio)) {
@@ -622,6 +633,16 @@ static void write_callback(unsigned long error, void *context)
 		return;
 	}
 
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. Propagate the error without degrading
+	 * the array.
+	 */
+	if (bio->bi_status == BLK_STS_INVAL) {
+		bio_endio(bio);
+		return;
+	}
+
 	/*
 	 * If the bio is discard, return an error, but do not
 	 * degrade the array.
@@ -1262,7 +1283,12 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
 		return DM_ENDIO_DONE;
 	}
 
-	if (*error == BLK_STS_NOTSUPP)
+	/*
+	 * BLK_STS_INVAL means the bio was not valid for the underlying device,
+	 * e.g. a misaligned direct I/O. Propagate it rather than failing the
+	 * mirror and retrying, which would fail the same way on every leg.
+	 */
+	if (*error == BLK_STS_NOTSUPP || *error == BLK_STS_INVAL)
 		goto out;
 
 	if (bio->bi_opf & REQ_RAHEAD)
-- 
2.52.0



^ permalink raw reply related

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Mikulas Patocka @ 2026-06-16 15:55 UTC (permalink / raw)
  To: Vjaceslavs Klimovs
  Cc: Dr. David Alan Gilbert, Thorsten Leemhuis, kbusch, trnka,
	Zdenek Kabelac, linux-block, dm-devel,
	Linux kernel regressions list
In-Reply-To: <CAC_j7i0eDccVWzPeRafM50mZEOFHPz2cwd=RZqqx6TK2EVRFvw@mail.gmail.com>

Hi


On Mon, 15 Jun 2026, Vjaceslavs Klimovs wrote:

> Hi Dave, all,
> 
> I'm one of the original reporters and very much a user, not a block/dm
> developer, so please sanity-check all of this.
> 
> Your trace looks like what the two earlier reports hit: a read reaching
> a leaf device with sectors > 0 but phys_seg 0 (an empty bio). One aside
> that may help read the trace: blk_io_trace.error is a __u16, so the
> bracketed values on your C lines are errnos as u16 (65514 = -EINVAL,
> 65531 = -EIO).
> 
> The WARN itself is new, the bad bio isn't. bio_add_page() only started
> rejecting len == 0 in 643893647cac ("block: reject zero length in
> bio_add_page()", v7.1-rc1); on 7.0.8 the same empty bio tripped
> scsi_alloc_sgtables()'s !nr_segs instead, which matches what you saw.
> That fits your "not a recent regression": the condition is older, v7.1
> just made it loud.
> 
> For Tomas's and my reports (QEMU O_DIRECT to the LV block device) the
> origin looks like 5ff3f74e145a ("block: simplify direct io validity
> check", v6.18): blkdev_dio_invalid() now checks only aggregate
> ki_pos | count alignment and dropped the per-segment
> bdev_iter_is_aligned() walk, so a degenerate or misaligned O_DIRECT no
> longer gets -EINVAL at the fops boundary. But your reproducer reads a
> file, which goes through the filesystem O_DIRECT path and never calls
> blkdev_dio_invalid(), and still makes the empty bio. So it isn't only
> that one entry point.

I thought that reverting 5ff3f74e145a and re-introducing the alignment 
check in block/fops.c:blkdev_dio_invalid would fix it - but it wouldn't.

The same problem existed even before 5ff3f74e145a, with the pvmove 
command.

Suppose that the administrator needs to move a logical volume from one 
disk to another and uses pvmove. Pvmove inserts a new dm-mirror target 
underneath the logical volume and uses it to copy the data. Now, the 
dm-mirror target crashes whenever it receives bio with unaligned vectors.

So, I think that the proper way to fix this is to teach dm-mirror/dm-io to 
deal with unaligned bio vectors and handle them properly.

Mikulas


^ permalink raw reply

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Dr. David Alan Gilbert @ 2026-06-16 15:55 UTC (permalink / raw)
  To: Keith Busch
  Cc: zkabelac, Vjaceslavs Klimovs, Thorsten Leemhuis, trnka,
	linux-block, dm-devel, Linux kernel regressions list
In-Reply-To: <ajFbglSvcLrFH8Z-@kbusch-mbp>

* Keith Busch (kbusch@kernel.org) wrote:
> On Tue, Jun 16, 2026 at 01:08:52PM +0000, Dr. David Alan Gilbert wrote:
> > ( lvcreate  -m 1 -L 1G main /dev/sda2 /dev/sdb2 ) rather than
> > the old mirror with the same patch, then:
> > 
> >   a) I get no log errors with either read or write
> >   b) read still gives EIO
> 
> I've a follow up patch to handle the error properly. You want to see
> EINVAL, not EIO, and that error shouldn't be considered for determining
> the raid health. Something like what f7b24c7b41f23b5 does, but it's a
> little more complicated in this path since it doesn't see the lower
> level error status and just converts everything to EIO.

OK, thanks for your help, and I'll be happy to test that when it's done.

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply

* [PATCH] block: fix IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmd
From: Yitang Yang @ 2026-06-16 15:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Yitang Yang

blkdev_uring_cmd() checks IORING_URING_CMD_REISSUE to determine whether
this is the first issue. However, this flag lives in cmd->flags instead
of issue_flags.

Coincidentally, IO_URING_F_NONBLOCK shares bit 31 with
IORING_URING_CMD_REISSUE. As a result, the SQE read was never performed,
bic->len remained zero, and every BLOCK_URING_CMD_DISCARD failed with
-EINVAL.

Fix it by checking cmd->flags as intended.

Fixes: 212ec34e4e72 ("block: only read from sqe on initial invocation of blkdev_uring_cmd")
Signed-off-by: Yitang Yang <yi1tang.yang@gmail.com>
---
 block/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index ab2c9ed79946..3d4ea1537457 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -951,7 +951,7 @@ int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	u32 cmd_op = cmd->cmd_op;
 
 	/* Read what we need from the SQE on the first issue */
-	if (!(issue_flags & IORING_URING_CMD_REISSUE)) {
+	if (!(cmd->flags & IORING_URING_CMD_REISSUE)) {
 		const struct io_uring_sqe *sqe = cmd->sqe;
 
 		if (unlikely(sqe->ioprio || sqe->__pad1 || sqe->len ||
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] block: check bio split for unaligned bvec
From: Keith Busch @ 2026-06-16 15:36 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, linux-block, axboe, Carlos Maiolino
In-Reply-To: <ajB36Wt9lU_F-r7h@kbusch-mbp>

On Mon, Jun 15, 2026 at 04:08:41PM -0600, Keith Busch wrote:
>   3: can handle arbitrary memory but advertise default dma_alignment=511
>       (brd, pmem, zram, ps3vram, simdisk - "limits lie")

That's actually not right because they iterate with
bio_for_each_segment, which requires all the bv_len's are sector size
granularity, so their default limits are correct. Just no one's
enforcing them right now.

^ permalink raw reply

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Christian Brauner @ 2026-06-16 15:19 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs
In-Reply-To: <20260616-fragil-duktus-nachverfolgen-60f54584c206@brauner>

On Tue, Jun 16, 2026 at 04:59:53PM +0200, Christian Brauner wrote:
> On Tue, Jun 16, 2026 at 02:34:43PM +0200, Christoph Hellwig wrote:
> > On Tue, Jun 02, 2026 at 12:10:08PM +0200, Christian Brauner wrote:
> > > fs_holder_ops recovers the owning superblock from bdev->bd_holder, which
> > > forces the holder to be exactly one superblock and prevents several
> > > superblocks from sharing one block device. That's what erofs is doing.
> > > 
> > > Introduce a global dev_t-keyed rhltable mapping each block device to the
> > > superblock(s) using it. The holder argument becomes purely the block
> > > layer's exclusivity token (a superblock, or a file_system_type for
> > > shared devices) and is no longer needed by the fs specific callbacks.
> > 
> > Err, no.  block devices need to have a specific owner.  If erofs wants
> > to share a device between superblock it needs to come up with an entity
> > that owns the block devices which is not a superblock.
> 
> It already did.
> 
> > IMHO sharing devices between superblocks is a bad idea, but that ship
> > has sailed, but please keep it contained inside of erofs.
> 
> We need a simple device number to superblock mapping anyway and that can
> simply be centralized in the vfs. And it can work with anon device
> numbers and block device numbers uniformly.

Plus, after we're done we then also have a centry place where we can
intercept what devices can be mounted by a filesystem uniformly.

My first approach for this was of course to just add fs_file_open_by_*()
wrappers and move the relevant security hook into there. But while doing
this - ignoring the ton of bugs I found - I realized that having a
mapping so we can go from device number to superblock is very helpful.

We could of course keep the mapping just local to erofs but I see no
reason why the vfs cannot just provide this ability natively given that
it has all the required machinery. I'll let Jan chime in as well.

^ permalink raw reply

* [PATCH 1/2] dm-io: clone the source bio instead of copying its biovec
From: Keith Busch @ 2026-06-16 15:05 UTC (permalink / raw)
  To: dm-devel
  Cc: linux-block, mpatocka, Keith Busch, Dr. David Alan Gilbert,
	Vjaceslavs Klimovs

From: Keith Busch <kbusch@kernel.org>

For DM_IO_BIO requests, do_region() built each destination bio by walking
the source bio's biovec and re-adding the pages one at a time, tracking
the remaining transfer in sectors. The vector lengths are byte granular
and need not be sector aligned (e.g. a misaligned O_DIRECT buffer split
across pages), so the sector-based accounting could lose a sub-sector
fragment: to_sector() truncated the remainder and the outer loop spun
forever submitting empty bios, hanging the I/O.

There is no need to rebuild the biovec at all. The destination reads into
(or writes from) exactly the same pages as the source bio, so the bio can
simply clone the source's biovec with bio_alloc_clone() and remap it to
the target device. The clone inherits the source's iterator and alignment,
and the block layer splits it to the target's limits on submission, so the
whole region maps to a single cloned bio with no manual page copying or
sector accounting.

This removes the per-page copy path (and its open-coded bvec dpages
helpers) for bio-backed I/O and fixes the hang on misaligned direct I/O to
a dm-mirror device. Page-list, vma and kmem sources keep the existing copy
path.

Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Reported-by: Vjaceslavs Klimovs <vklimovs@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/md/dm-io.c | 67 +++++++++++++++++-----------------------------
 1 file changed, 24 insertions(+), 43 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 1db565b376200..28adfeb58f240 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -170,12 +170,11 @@ struct dpages {
 			 struct page **p, unsigned long *len, unsigned int *offset);
 	void (*next_page)(struct dpages *dp);
 
-	union {
-		unsigned int context_u;
-		struct bvec_iter context_bi;
-	};
+	unsigned int context_u;
 	void *context_ptr;
 
+	struct bio *orig_bio;
+
 	void *vma_invalidate_address;
 	unsigned long vma_invalidate_size;
 };
@@ -210,44 +209,6 @@ static void list_dp_init(struct dpages *dp, struct page_list *pl, unsigned int o
 	dp->context_ptr = pl;
 }
 
-/*
- * Functions for getting the pages from a bvec.
- */
-static void bio_get_page(struct dpages *dp, struct page **p,
-			 unsigned long *len, unsigned int *offset)
-{
-	struct bio_vec bvec = bvec_iter_bvec((struct bio_vec *)dp->context_ptr,
-					     dp->context_bi);
-
-	*p = bvec.bv_page;
-	*len = bvec.bv_len;
-	*offset = bvec.bv_offset;
-
-	/* avoid figuring it out again in bio_next_page() */
-	dp->context_bi.bi_sector = (sector_t)bvec.bv_len;
-}
-
-static void bio_next_page(struct dpages *dp)
-{
-	unsigned int len = (unsigned int)dp->context_bi.bi_sector;
-
-	bvec_iter_advance((struct bio_vec *)dp->context_ptr,
-			  &dp->context_bi, len);
-}
-
-static void bio_dp_init(struct dpages *dp, struct bio *bio)
-{
-	dp->get_page = bio_get_page;
-	dp->next_page = bio_next_page;
-
-	/*
-	 * We just use bvec iterator to retrieve pages, so it is ok to
-	 * access the bvec table directly here
-	 */
-	dp->context_ptr = bio->bi_io_vec;
-	dp->context_bi = bio->bi_iter;
-}
-
 /*
  * Functions for getting the pages from a VMA.
  */
@@ -332,6 +293,21 @@ static void do_region(const blk_opf_t opf, unsigned int region,
 		return;
 	}
 
+	if (dp->orig_bio) {
+		bio = bio_alloc_clone(where->bdev, dp->orig_bio, GFP_NOIO,
+				      &io->client->bios);
+		bio->bi_iter.bi_sector = where->sector;
+		bio->bi_iter.bi_size = where->count << SECTOR_SHIFT;
+		bio->bi_opf = opf;
+		bio->bi_end_io = endio;
+		bio->bi_ioprio = ioprio;
+		store_io_and_region_in_bio(bio, io, region);
+
+		atomic_inc(&io->count);
+		submit_bio(bio);
+		return;
+	}
+
 	/*
 	 * where->count may be zero if op holds a flush and we need to
 	 * send a zero-sized flush.
@@ -468,6 +444,7 @@ static int dp_init(struct dm_io_request *io_req, struct dpages *dp,
 
 	dp->vma_invalidate_address = NULL;
 	dp->vma_invalidate_size = 0;
+	dp->orig_bio = NULL;
 
 	switch (io_req->mem.type) {
 	case DM_IO_PAGE_LIST:
@@ -475,7 +452,11 @@ static int dp_init(struct dm_io_request *io_req, struct dpages *dp,
 		break;
 
 	case DM_IO_BIO:
-		bio_dp_init(dp, io_req->mem.ptr.bio);
+		/*
+		 * The destination bios clone this bio's biovec directly, so
+		 * there are no per-page accessors to set up here.
+		 */
+		dp->orig_bio = io_req->mem.ptr.bio;
 		break;
 
 	case DM_IO_VMA:
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Christian Brauner @ 2026-06-16 14:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs
In-Reply-To: <20260616123443.GA21024@lst.de>

On Tue, Jun 16, 2026 at 02:34:43PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 02, 2026 at 12:10:08PM +0200, Christian Brauner wrote:
> > fs_holder_ops recovers the owning superblock from bdev->bd_holder, which
> > forces the holder to be exactly one superblock and prevents several
> > superblocks from sharing one block device. That's what erofs is doing.
> > 
> > Introduce a global dev_t-keyed rhltable mapping each block device to the
> > superblock(s) using it. The holder argument becomes purely the block
> > layer's exclusivity token (a superblock, or a file_system_type for
> > shared devices) and is no longer needed by the fs specific callbacks.
> 
> Err, no.  block devices need to have a specific owner.  If erofs wants
> to share a device between superblock it needs to come up with an entity
> that owns the block devices which is not a superblock.

It already did.

> IMHO sharing devices between superblocks is a bad idea, but that ship
> has sailed, but please keep it contained inside of erofs.

We need a simple device number to superblock mapping anyway and that can
simply be centralized in the vfs. And it can work with anon device
numbers and block device numbers uniformly.

^ permalink raw reply

* [PATCH V2]block: Remove redundant plug in __submit_bio()
From: wenxiong @ 2026-06-16 14:31 UTC (permalink / raw)
  To: linux-block, axboe; +Cc: tom.leiming, yukuai, stable, wenxiong, Wen Xiong

From: Wen Xiong <wenxiong@linux.ibm.com>

The patch removes the automatic plug/unplug operations from __submit_bio()
that were added to cache nsecs time when no explicit plug is used.

The plug mechanism is most effective when batching multiple I/O
operations together. Creating a plug for every bio submission
provides minimal benefit while adding function call overhead and
stack usage for every I/O operation.

Below is performance comparison with the latest upstream kernel.

Iotype  qd nj  rmix  mpstat busy  mpstat busy without plug
Randrw  1  20  100       53%                 24%
Randrw  1  40  100       70%                 24%
Randrw  1  20  70        40%                 24%
Randrw  1  40  70        60%                 26%
Randrw  1  20  0         14%                 6%
Randrw  1  40  0         20%                 7%

Fixes: 060406c61c7c ("block: add plug while submitting IO")
Signed-off-by: Wen Xiong <wenxiong@linux.ibm.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
---
 block/blk-core.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 73a41df98c9a..365641266c9e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -669,11 +669,6 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 
 static void __submit_bio(struct bio *bio)
 {
-	/* If plug is not used, add new plug here to cache nsecs time. */
-	struct blk_plug plug;
-
-	blk_start_plug(&plug);
-
 	if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
 		blk_mq_submit_bio(bio);
 	} else if (likely(bio_queue_enter(bio) == 0)) {
@@ -686,8 +681,6 @@ static void __submit_bio(struct bio *bio)
 			disk->fops->submit_bio(bio);
 		blk_queue_exit(disk->queue);
 	}
-
-	blk_finish_plug(&plug);
 }
 
 /*
-- 
2.52.0


^ permalink raw reply related

* Re: Repeatable, raid1+O_DIRECT, hang/warn
From: Keith Busch @ 2026-06-16 14:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: zkabelac, Vjaceslavs Klimovs, Thorsten Leemhuis, trnka,
	linux-block, dm-devel, Linux kernel regressions list
In-Reply-To: <ajFK5NXkxd6jU5zu@gallifrey>

On Tue, Jun 16, 2026 at 01:08:52PM +0000, Dr. David Alan Gilbert wrote:
> ( lvcreate  -m 1 -L 1G main /dev/sda2 /dev/sdb2 ) rather than
> the old mirror with the same patch, then:
> 
>   a) I get no log errors with either read or write
>   b) read still gives EIO

I've a follow up patch to handle the error properly. You want to see
EINVAL, not EIO, and that error shouldn't be considered for determining
the raid health. Something like what f7b24c7b41f23b5 does, but it's a
little more complicated in this path since it doesn't see the lower
level error status and just converts everything to EIO.

^ permalink raw reply

* [PATCH 2/2] block: invalidate cached plug timestamp after task switch
From: Usama Arif @ 2026-06-16 14:15 UTC (permalink / raw)
  To: axboe, linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, rostedt,
	vincent.guittot, vschneid
  Cc: shakeel.butt, hannes, riel, kernel-team, Usama Arif, stable
In-Reply-To: <20260616141604.328820-1-usama.arif@linux.dev>

blk_time_get_ns() caches ktime_get_ns() in current->plug->cur_ktime
and marks the task with PF_BLOCK_TS. That cache is only valid while the
task keeps running; if the task is switched out, wall-clock time
advances and the cached value must not be reused when the task runs again.

The existing invalidation covers explicit plug flushes through
__blk_flush_plug(), and the schedule() / rtmutex paths through
sched_update_worker(). It does not cover in-kernel preemption paths such
as preempt_schedule(), preempt_schedule_notrace(), and
preempt_schedule_irq(), which enter __schedule(SM_PREEMPT) directly and
return without calling sched_update_worker().

As a result, a task preempted while holding a plug with PF_BLOCK_TS set
can reuse a stale plug->cur_ktime after it is scheduled back in. blk-iocost
then consumes that stale timestamp through ioc_now(), producing stale vnow
values for throttle decisions, and through ioc_rqos_done(), inflating
on-queue time and feeding false missed-QoS samples into vrate
adjustment.

Move the schedule-side invalidation to finish_task_switch(), which runs
for the scheduled-in task after every actual context switch regardless
of which schedule entry point was used. Keep __blk_flush_plug() as the
explicit flush/finish-plug invalidation path, and remove only the
PF_BLOCK_TS handling from sched_update_worker().

Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption")
Cc: stable@vger.kernel.org
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/blkdev.h | 16 ++++++----------
 kernel/sched/core.c    | 12 ++++++++----
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 57e84d59a642..c285a4d9837d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1216,16 +1216,12 @@ static inline void blk_flush_plug(struct blk_plug *plug, bool async)
 		__blk_flush_plug(plug, async);
 }

-/*
- * tsk == current here
- */
-static inline void blk_plug_invalidate_ts(struct task_struct *tsk)
+static __always_inline void blk_plug_invalidate_ts(void)
 {
-	struct blk_plug *plug = tsk->plug;
-
-	if (plug)
-		plug->cur_ktime = 0;
-	current->flags &= ~PF_BLOCK_TS;
+	if (unlikely(current->flags & PF_BLOCK_TS)) {
+		current->plug->cur_ktime = 0;
+		current->flags &= ~PF_BLOCK_TS;
+	}
 }

 int blkdev_issue_flush(struct block_device *bdev);
@@ -1251,7 +1247,7 @@ static inline void blk_flush_plug(struct blk_plug *plug, bool async)
 {
 }

-static inline void blk_plug_invalidate_ts(struct task_struct *tsk)
+static inline void blk_plug_invalidate_ts(void)
 {
 }

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b791e9e9f67..e97e98c33be5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5368,6 +5368,12 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 */
 	kmap_local_sched_in();

+	/*
+	 * Any cached block-layer timestamp (plug->cur_ktime) is stale now,
+	 * invalidate it.
+	 */
+	blk_plug_invalidate_ts();
+
 	fire_sched_in_preempt_notifiers(current);
 	/*
 	 * When switching through a kernel thread, the loop in
@@ -7290,12 +7296,10 @@ static inline void sched_submit_work(struct task_struct *tsk)

 static void sched_update_worker(struct task_struct *tsk)
 {
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_BLOCK_TS)) {
-		if (tsk->flags & PF_BLOCK_TS)
-			blk_plug_invalidate_ts(tsk);
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
-		else if (tsk->flags & PF_IO_WORKER)
+		else
 			io_wq_worker_running(tsk);
 	}
 }
-- 
2.53.0-Meta

^ permalink raw reply related

* [PATCH 1/2] kernel/fork: clear PF_BLOCK_TS in copy_process()
From: Usama Arif @ 2026-06-16 14:15 UTC (permalink / raw)
  To: axboe, linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, rostedt,
	vincent.guittot, vschneid
  Cc: shakeel.butt, hannes, riel, kernel-team, Usama Arif, stable
In-Reply-To: <20260616141604.328820-1-usama.arif@linux.dev>

PF_BLOCK_TS is only set in blk_time_get_ns() when current->plug is
non-NULL, and blk_finish_plug() clears it via __blk_flush_plug()
before NULLing the plug pointer.  copy_process() breaks the
invariant by inheriting PF_BLOCK_TS from the parent while resetting
the child's plug to NULL.

Clear PF_BLOCK_TS alongside that assignment so callers can rely on
"PF_BLOCK_TS set implies current->plug != NULL" and dereference
current->plug unguarded.

Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption")
Cc: stable@vger.kernel.org
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 kernel/fork.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 892a95214c54..13e38e89a1f3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2338,6 +2338,7 @@ __latent_entropy struct task_struct *copy_process(

 #ifdef CONFIG_BLOCK
 	p->plug = NULL;
+	p->flags &= ~PF_BLOCK_TS;
 #endif
 	futex_init_task(p);

-- 
2.53.0-Meta

^ permalink raw reply related

* [PATCH 0/2] block: invalidate cached plug timestamp on context switch
From: Usama Arif @ 2026-06-16 14:15 UTC (permalink / raw)
  To: axboe, linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, rostedt,
	vincent.guittot, vschneid
  Cc: shakeel.butt, hannes, riel, kernel-team, Usama Arif

The details for this are in patch 2. The main reason for this series
is to invalidate the cached timestamp on context switch. This was
done in sched_update_worker() only before which was resulting in
blk-iocost reading stale timestamps and throttling based on wrong
information.

Patch 1 is a prerequisite to create the invariant that
PF_BLOCK_TS set implies current->plug != NULL.

v2 -> v3:
  https://lore.kernel.org/all/20260612094042.3350401-1-usama.arif@linux.dev/
  - Add patch 1 to clear PF_BLOCK_TS in copy_process() so the
    invariant survives fork.
  - Drop the if (plug) NULL check inside blk_plug_invalidate_ts(),
    relying on the invariant established by patch 1. (Peter Zijlstra)

v1 -> v2:
  https://lore.kernel.org/all/20260611231428.345098-1-usama.arif@linux.dev/
  - Move the PF_BLOCK_TS check into blk_plug_invalidate_ts() and
    upgrade it to __always_inline (Peter Zijlstra).
  - Drop the tsk parameter; the helper only ever operates on current.
 
Usama Arif (2):
  kernel/fork: clear PF_BLOCK_TS in copy_process()
  block: invalidate cached plug timestamp after task switch

 include/linux/blkdev.h | 16 ++++++----------
 kernel/fork.c          |  1 +
 kernel/sched/core.c    | 12 ++++++++----
 3 files changed, 15 insertions(+), 14 deletions(-)

-- 
2.53.0-Meta


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox