Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH] rnbd-clt: Use common error handling code in rnbd_get_iu()
From: Haris Iqbal @ 2026-06-12  9:39 UTC (permalink / raw)
  To: Markus Elfring; +Cc: linux-block, Jack Wang, Jens Axboe, LKML, kernel-janitors
In-Reply-To: <c9f86f0b-331d-4cb1-b8a2-00bc1e857ec7@web.de>

On Wed, Jun 10, 2026 at 9:03 PM Markus Elfring <Markus.Elfring@web.de> wrote:
>
> From: Markus Elfring <elfring@users.sourceforge.net>
> Date: Wed, 10 Jun 2026 20:58:47 +0200
>
> Use an additional label so that a bit of exception handling can be better
> reused at the end of an if branch.
>
> This issue was detected by using the Coccinelle software.
>
> Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
> ---
>  drivers/block/rnbd/rnbd-clt.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/block/rnbd/rnbd-clt.c b/drivers/block/rnbd/rnbd-clt.c
> index 4d6725a0035e..d8e3f145ee2f 100644
> --- a/drivers/block/rnbd/rnbd-clt.c
> +++ b/drivers/block/rnbd/rnbd-clt.c
> @@ -329,10 +329,8 @@ static struct rnbd_iu *rnbd_get_iu(struct rnbd_clt_session *sess,
>                 return NULL;
>
>         permit = rnbd_get_permit(sess, con_type, wait);
> -       if (!permit) {
> -               kfree(iu);
> -               return NULL;
> -       }
> +       if (!permit)
> +               goto free_iu;
>
>         iu->permit = permit;
>         /*
> @@ -349,6 +347,7 @@ static struct rnbd_iu *rnbd_get_iu(struct rnbd_clt_session *sess,
>
>         if (sg_alloc_table(&iu->sgt, 1, GFP_KERNEL)) {
>                 rnbd_put_permit(sess, permit);
> +free_iu:

Thanks for the patch.
It does what it mentioned in the commit description, but maybe we do
not need to do this?

If there was a kfree before the last "return iu;", it would have made
more sense. But jumping to the middle of a conditional block to reuse
the free and return seems forced.

>                 kfree(iu);
>                 return NULL;
>         }
> --
> 2.54.0
>

^ permalink raw reply

* Re: [PATCH v4 7/8] Bluetooth: qca: Set NVMEM BD address quirks when address is invalid
From: Dmitry Baryshkov @ 2026-06-12  9:12 UTC (permalink / raw)
  To: Loic Poulain
  Cc: Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Jens Axboe, Johannes Berg,
	Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan, linux-mmc, devicetree,
	linux-kernel, linux-arm-msm, linux-block, linux-wireless, ath10k,
	linux-bluetooth, netdev, daniel, Bartosz Golaszewski
In-Reply-To: <20260609-block-as-nvmem-v4-7-45712e6b22c6@oss.qualcomm.com>

On Tue, Jun 09, 2026 at 09:52:32AM +0200, Loic Poulain wrote:
> When the controller BD address is invalid (zero or default),
> set the NVMEM quirks to allow retrieving the address from a
> 'local-bd-address' NVMEM cell. The BD address is often stored
> alongside the WiFi MAC address in big-endian format, so also
> set the big-endian quirk.

Okay, this answers my question to the previous patch. We need to support
BE addresses.

> 
> Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> ---
>  drivers/bluetooth/btqca.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 

-- 
With best wishes
Dmitry

^ permalink raw reply

* Re: [PATCH v4 6/8] Bluetooth: hci_sync: Add NVMEM-backed BD address retrieval
From: Dmitry Baryshkov @ 2026-06-12  9:11 UTC (permalink / raw)
  To: Loic Poulain
  Cc: Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Jens Axboe, Johannes Berg,
	Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan, linux-mmc, devicetree,
	linux-kernel, linux-arm-msm, linux-block, linux-wireless, ath10k,
	linux-bluetooth, netdev, daniel, Bartosz Golaszewski
In-Reply-To: <20260609-block-as-nvmem-v4-6-45712e6b22c6@oss.qualcomm.com>

On Tue, Jun 09, 2026 at 09:52:31AM +0200, Loic Poulain wrote:
> Some devices store the Bluetooth BD address in non-volatile
> memory, which can be accessed through the NVMEM framework.
> Similar to Ethernet or WiFi MAC addresses, add support for
> reading the BD address from a 'local-bd-address' NVMEM cell.
> 
> As with the device-tree provided BD address, add a quirk to
> indicate whether a device or platform should attempt to read
> the address from NVMEM when no valid in-chip address is present.
> Also add a quirk to indicate if the address is stored in
> big-endian byte order.
> 
> Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> ---
>  include/net/bluetooth/hci.h | 18 ++++++++++++++++++
>  net/bluetooth/hci_sync.c    | 39 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 56 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
> index 572b1c620c5d653a1fe10b26c1b0ba33e8f4968f..7686466d1109253b0d75edeb5f6a99fb98ce4cc6 100644
> --- a/include/net/bluetooth/hci.h
> +++ b/include/net/bluetooth/hci.h
> @@ -164,6 +164,24 @@ enum {
>  	 */
>  	HCI_QUIRK_BDADDR_PROPERTY_BROKEN,
>  
> +	/* When this quirk is set, the public Bluetooth address
> +	 * initially reported by HCI Read BD Address command
> +	 * is considered invalid. The public BD Address can be
> +	 * retrieved via a 'local-bd-address' NVMEM cell.

Why do we need a quirk here? Can't we always assume that if there is an
NVMEM cell, it contains a correct address, even if HCI command returned
a seemingly-sensible one?

> +	 *
> +	 * This quirk can be set before hci_register_dev is called or
> +	 * during the hdev->setup vendor callback.
> +	 */
> +	HCI_QUIRK_USE_BDADDR_NVMEM,
> +
> +	/* When this quirk is set, the Bluetooth Device Address provided by
> +	 * the 'local-bd-address' NVMEM is stored in big-endian order.
> +	 *
> +	 * This quirk can be set before hci_register_dev is called or
> +	 * during the hdev->setup vendor callback.
> +	 */
> +	HCI_QUIRK_BDADDR_NVMEM_BE,

Also, is this necessary? Are the devices which store the address in the
wrong format in the NVMEM?

> +
>  	/* When this quirk is set, the duplicate filtering during
>  	 * scanning is based on Bluetooth devices addresses. To allow
>  	 * RSSI based updates, restart scanning if needed.

-- 
With best wishes
Dmitry

^ permalink raw reply

* Re: Direct IO page bouncing got some garbage?
From: Qu Wenruo @ 2026-06-12  8:27 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo
  Cc: linux-btrfs, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, Linux Memory Management List
In-Reply-To: <aivABK8nwNR6Z__A@infradead.org>



在 2026/6/12 17:45, Christoph Hellwig 写道:
> On Fri, Jun 12, 2026 at 02:54:22PM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2026/6/12 11:11, Qu Wenruo 写道:
>>> Hi,
>>>
>>> Recently I'm trying to make btrfs utilize IOMAP_DIO_BOUNCE, however I'm
>>> experiencing weird data corruption.
>>>
>>> During test case generic/708, I'm reliably hitting garbage pages at the
>>> last 64KiB, the garbage even contains an ELF header.
>>
>> Added ftrace shows that, since btrfs has to disable page fault to avoid
>> certain deadlock, bio_iov_iter_bounced() failed with a short copy.
>>
>> Initially bio_iov_iter_bounced() got a 1MiB page, but copy_iter_from() only
>> copied 64K then failed due to the disabled page fault.
>>
>> I don't think it's a coincident that the short 64K exactly matches where the
>> garbage is (the last 64K).
>>
>> I guess it's in the error path we didn't properly revert the iov iter?
> 
> Looks like it.  The better option would probably be not to give up on
> a short copy, and just reduce the bio size to fit the short copy
> even if that wastes a little memory.

And that will only work if IOMAP_PARTIAL is specified.

For now I'll just fix the short copy path, and continue to make btrfs 
work with IOMAP_DIO_BOUNCE first (already testing and finished one round).

The partial bio return solution will require some way to pass the dio 
flag, or a refactor to move all the error handling to the only caller.
Anyway it will be a dedicated series for the change, meanwhile the 
simple revert fix will be sent out soon along with the btrfs enablement.

Thanks,
Qu

^ permalink raw reply

* Re: [PATCH v4 2/8] dt-bindings: net: wireless: qcom,ath10k: Document NVMEM cells
From: Krzysztof Kozlowski @ 2026-06-12  8:26 UTC (permalink / raw)
  To: Loic Poulain, Ulf Hansson, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Bjorn Andersson, Konrad Dybcio, Jens Axboe,
	Johannes Berg, Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan
  Cc: linux-mmc, devicetree, linux-kernel, linux-arm-msm, linux-block,
	linux-wireless, ath10k, linux-bluetooth, netdev, daniel,
	Bartosz Golaszewski
In-Reply-To: <20260609-block-as-nvmem-v4-2-45712e6b22c6@oss.qualcomm.com>

On 09/06/2026 09:52, Loic Poulain wrote:
> Document the NVMEM cells supported by the ath10k driver, the
> mac-address, pre-calibration data, and calibration data.
> 
> Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> ---


Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>

Best regards,
Krzysztof

^ permalink raw reply

* Re: Direct IO page bouncing got some garbage?
From: Christoph Hellwig @ 2026-06-12  8:15 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, Linux Memory Management List
In-Reply-To: <f1dbdcd6-4e6e-4304-8fa5-59c2c60252ad@suse.com>

On Fri, Jun 12, 2026 at 02:54:22PM +0930, Qu Wenruo wrote:
> 
> 
> 在 2026/6/12 11:11, Qu Wenruo 写道:
> > Hi,
> > 
> > Recently I'm trying to make btrfs utilize IOMAP_DIO_BOUNCE, however I'm
> > experiencing weird data corruption.
> > 
> > During test case generic/708, I'm reliably hitting garbage pages at the
> > last 64KiB, the garbage even contains an ELF header.
> 
> Added ftrace shows that, since btrfs has to disable page fault to avoid
> certain deadlock, bio_iov_iter_bounced() failed with a short copy.
> 
> Initially bio_iov_iter_bounced() got a 1MiB page, but copy_iter_from() only
> copied 64K then failed due to the disabled page fault.
> 
> I don't think it's a coincident that the short 64K exactly matches where the
> garbage is (the last 64K).
> 
> I guess it's in the error path we didn't properly revert the iov iter?

Looks like it.  The better option would probably be not to give up on
a short copy, and just reduce the bio size to fit the short copy
even if that wastes a little memory.


^ permalink raw reply

* Re: [PATCH v4 2/8] dt-bindings: net: wireless: qcom,ath10k: Document NVMEM cells
From: Loic Poulain @ 2026-06-12  7:57 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Jens Axboe, Johannes Berg,
	Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan, linux-mmc, devicetree,
	linux-kernel, linux-arm-msm, linux-block, linux-wireless, ath10k,
	linux-bluetooth, netdev, daniel, Bartosz Golaszewski
In-Reply-To: <20260610-funny-paper-warthog-25fa0a@quoll>

On Wed, Jun 10, 2026 at 9:16 AM Krzysztof Kozlowski <krzk@kernel.org> wrote:
>
> On Tue, Jun 09, 2026 at 09:52:27AM +0200, Loic Poulain wrote:
> > Document the NVMEM cells supported by the ath10k driver, the
> > mac-address, pre-calibration data, and calibration data.
> >
> > Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> > Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> > ---
> >  .../devicetree/bindings/net/wireless/qcom,ath10k.yaml    | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml
> > index c21d66c7cd558ab792524be9afec8b79272d1c87..7391df5e7071e626af4c64b9919d48c41ac09f1e 100644
> > --- a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml
> > +++ b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml
> > @@ -92,6 +92,22 @@ properties:
> >
> >    ieee80211-freq-limit: true
> >
> > +  nvmem-cells:
> > +    minItems: 1
> > +    maxItems: 3
> > +    description: |
>
> If there is going to be resend:
> Do not need '|' unless you need to preserve formatting.

Sure, thanks.

>
> > +      References to nvmem cells for MAC address and/or calibration data.
> > +      Supported cell names are mac-address, calibration, and pre-calibration.
> > +
> > +  nvmem-cell-names:
> > +    minItems: 1
> > +    maxItems: 3
> > +    items:
> > +      enum:
> > +        - mac-address
> > +        - calibration
> > +        - pre-calibration
>
> This means you expect random order with variable number of items. Is
> that intentional? If yes, please provide short explanation in the commit
> msg.

Yes we may or may have any of those cells. Will document.

Thanks,
Loic

^ permalink raw reply

* Re: [PATCH] block: invalidate cached plug timestamp after task switch
From: Peter Zijlstra @ 2026-06-12  7:21 UTC (permalink / raw)
  To: Usama Arif
  Cc: axboe, linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, rostedt,
	vincent.guittot, vschneid, shakeel.butt, hannes, riel,
	kernel-team, stable
In-Reply-To: <20260611231428.345098-1-usama.arif@linux.dev>

On Thu, Jun 11, 2026 at 04:14:28PM -0700, Usama Arif wrote:
> blk_time_get_ns() caches ktime_get_ns() in current->plug->cur_ktime
> and marks the task with PF_BLOCK_TS. That cache is only valid while the
> task keeps running; if the task is switched out, wall-clock time
> advances and the cached value must not be reused when the task runs again.
> 
> The existing invalidation covers explicit plug flushes through
> __blk_flush_plug(), and the schedule() / rtmutex paths through
> sched_update_worker(). It does not cover in-kernel preemption paths such
> as preempt_schedule(), preempt_schedule_notrace(), and
> preempt_schedule_irq(), which enter __schedule(SM_PREEMPT) directly and
> return without calling sched_update_worker().
> 
> As a result, a task preempted while holding a plug with PF_BLOCK_TS set
> can reuse a stale plug->cur_ktime after it is scheduled back in. blk-iocost
> then consumes that stale timestamp through ioc_now(), producing stale vnow
> values for throttle decisions, and through ioc_rqos_done(), inflating
> on-queue time and feeding false missed-QoS samples into vrate
> adjustment.
> 
> Move the schedule-side invalidation to finish_task_switch(), which runs
> for the scheduled-in task after every actual context switch regardless
> of which schedule entry point was used. Keep __blk_flush_plug() as the
> explicit flush/finish-plug invalidation path, and remove only the
> PF_BLOCK_TS handling from sched_update_worker().
> 
> Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption")
> Cc: stable@vger.kernel.org
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>  kernel/sched/core.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8b791e9e9f67..bf024ca115ff 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5368,6 +5368,13 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  	 */
>  	kmap_local_sched_in();
>  
> +	/*
> +	 * Any cached block-layer timestamp (plug->cur_ktime) is stale now,
> +	 * invalidate it.
> +	 */
> +	if (unlikely(current->flags & PF_BLOCK_TS))
> +		blk_plug_invalidate_ts(current);

Can you make that just blk_plug_invalidate_ts() and move the branch into
the function itself, which is already inline anyway (but perhaps upgrade
it to __always_inline).

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Christoph Hellwig @ 2026-06-12  5:28 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, Keith Busch, brauner, linux-block,
	linux-fsdevel, linux-ext4, linux-xfs, Hannes Reinecke,
	Martin K. Petersen, Jens Axboe
In-Reply-To: <airX6BmMQ14Rvjcb@nidhogg.toxiclabs.cc>

On Thu, Jun 11, 2026 at 05:47:07PM +0200, Carlos Maiolino wrote:
> On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> > On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > > It's entirely possible a device supports byte aligned addresses. The
> > > block layer just doesn't let a driver report that. So either it really
> > > was successful because you found a bug that skips the alignment checks,
> > > or your device silently corrupted your payload.
> 
> I tried this on different hardware, I find it hard to say all those
> devices were corrupting the payload.

I think in the other thread we agreed that we are currently missing
the alignment check for fast-path bios not hitting the splitting code,
so maybe that is something you see.  Additionally we're missing the
checks for purely bio based drivers not calling the splitting helper
at all, but I don't think that applies here.

> > > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > > though, in not taking the optimization when it was possible. So here's
> > > an alternative suggestion that should get things working as expected:
> > 
> > The fix below looks like it is addressing a real bug.  I'm not sure if
> > Carlos is hitting it, but we were missing the alignment checks for
> > single-bvec fast path bios so far indeed.
> 
> You left context out so I'm assuming by the fix you meant Keith's patch.

Yes.


^ permalink raw reply

* Re: Direct IO page bouncing got some garbage?
From: Qu Wenruo @ 2026-06-12  5:24 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, Linux Memory Management List
In-Reply-To: <b866b2e6-292f-4c5e-ae5a-d77e983cfefd@suse.com>



在 2026/6/12 11:11, Qu Wenruo 写道:
> Hi,
> 
> Recently I'm trying to make btrfs utilize IOMAP_DIO_BOUNCE, however I'm 
> experiencing weird data corruption.
> 
> During test case generic/708, I'm reliably hitting garbage pages at the 
> last 64KiB, the garbage even contains an ELF header.

Added ftrace shows that, since btrfs has to disable page fault to avoid 
certain deadlock, bio_iov_iter_bounced() failed with a short copy.

Initially bio_iov_iter_bounced() got a 1MiB page, but copy_iter_from() 
only copied 64K then failed due to the disabled page fault.

I don't think it's a coincident that the short 64K exactly matches where 
the garbage is (the last 64K).

I guess it's in the error path we didn't properly revert the iov iter?

> 
> In that test case, we mmap a 2MiB sized buffer from another file, and 
> use that 2MiB mmapped memory as buffer for direct IO, write into a 
> different file.
> 
> The source file has dirty page cache for that 2MiB range, and no 
> writeback happened during that direct IO write.
> 
> So it means as long as we fault in all the pages of that 2MiB buffer, we 
> should be able to copy them into the newly allocated folio, and submit a 
> bio using the bounced pages.
> 
> But the last 64KiB is reliably corrupted with some ELF header.
> 
> I'm wondering where the corruption is from, especially it seems btrfs 
> has very little to do, except calling fault_in_iov_readable() to fault 
> in all the pages.
> 
> Thanks,
> Qu


^ permalink raw reply

* Re: [PATCH] blk-mq: bound blk_hctx_poll() to one jiffy
From: Fengnan @ 2026-06-12  1:53 UTC (permalink / raw)
  To: Anuj Gupta, axboe, hch, kbusch, lidiangang, tom.leiming,
	nj.shetty, joshi.k, anuj1072538
  Cc: linux-block, Alok Rathore
In-Reply-To: <20260611162719.910837-1-anuj20.g@samsung.com>

在 2026/6/12 00:27, Anuj Gupta 写道:
> blk_hctx_poll() can busy-poll until a completion is found or
> need_resched() becomes true. On preemptible kernels, the scheduler can
> set TIF_NEED_RESCHED on the timer tick and preempt the task at IRQ
> return before the loop condition re-evaluates it. After the context
> switch, the flag is cleared, so the poller can continue spinning instead
> of returning to its caller.
>
> This can happen with io_uring IOPOLL reads inside iocb_bio_iopoll(),
> which holds the rcu_read_lock() while calling bio_poll(). If another
> poller on the same polled queue drains the available completions, this
> poller may repeatedly find no completions and remain inside the RCU
> read-side critical section long enough to trigger RCU stall reports:
>
> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-9): P3961
> rcu:     (detected by 3, t=60002 jiffies, g=18533, q=4943 ncpus=20)
> task:fio state:R  running task     stack:0     pid:3961
> Call Trace:
> <TASK>
> ? nvme_poll+0x36/0xa0 [nvme]
> ? blk_hctx_poll+0x39/0x90
> ? blk_mq_poll+0x30/0x60
> ? bio_poll+0x87/0x170
> ? iocb_bio_iopoll+0x32/0x50
> ? io_uring_classic_poll+0x25/0x50
> ? io_do_iopoll+0x216/0x420
> ? __do_sys_io_uring_enter+0x2c7/0x7c0
>
> Reproducible with:
>
> fio -filename=/dev/nvme0n1 -direct=1 -size=4g -rw=randread \
> --numjobs=32 -bs=4K -ioengine=io_uring -hipri=1 -iodepth=1 \
> --registerfiles=1 --group_reporting --thread
>
> Record the starting jiffy and exit the loop once jiffies has advanced.
> This bounds each blk_hctx_poll() invocation while also covering the
> case where the reschedule flag was cleared by the context switch
> before the loop condition could observe it.
>
> Fixes: f22ecf9c14c1 ("blk-mq: delete task running check in blk_hctx_poll()")
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> Signed-off-by: Alok Rathore <alok.rathore@samsung.com>
> ---
>   block/blk-mq.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4c5c16cce4f8..d85fa4a51e79 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -5248,6 +5248,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>   			 struct io_comp_batch *iob, unsigned int flags)
>   {
>   	int ret;
> +	unsigned long start = jiffies;
how about this :

unsigned long timeout = jiffies + 1;
...
} while (!need_resched() && time_before(jiffies, timeout));

>   
>   	do {
>   		ret = q->mq_ops->poll(hctx, iob);
> @@ -5258,7 +5259,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>   		if (ret < 0 || (flags & BLK_POLL_ONESHOT))
>   			break;
>   		cpu_relax();
> -	} while (!need_resched());
> +	} while (!need_resched() && time_before_eq(jiffies, start));
>   
>   	return 0;
>   }

^ permalink raw reply

* Direct IO page bouncing got some garbage?
From: Qu Wenruo @ 2026-06-12  1:41 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, Linux Memory Management List

Hi,

Recently I'm trying to make btrfs utilize IOMAP_DIO_BOUNCE, however I'm 
experiencing weird data corruption.

During test case generic/708, I'm reliably hitting garbage pages at the 
last 64KiB, the garbage even contains an ELF header.

In that test case, we mmap a 2MiB sized buffer from another file, and 
use that 2MiB mmapped memory as buffer for direct IO, write into a 
different file.

The source file has dirty page cache for that 2MiB range, and no 
writeback happened during that direct IO write.

So it means as long as we fault in all the pages of that 2MiB buffer, we 
should be able to copy them into the newly allocated folio, and submit a 
bio using the bounced pages.

But the last 64KiB is reliably corrupted with some ELF header.

I'm wondering where the corruption is from, especially it seems btrfs 
has very little to do, except calling fault_in_iov_readable() to fault 
in all the pages.

Thanks,
Qu

^ permalink raw reply

* [PATCH] block: invalidate cached plug timestamp after task switch
From: Usama Arif @ 2026-06-11 23:14 UTC (permalink / raw)
  To: axboe, linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, rostedt,
	vincent.guittot, vschneid
  Cc: shakeel.butt, hannes, riel, kernel-team, Usama Arif, stable

blk_time_get_ns() caches ktime_get_ns() in current->plug->cur_ktime
and marks the task with PF_BLOCK_TS. That cache is only valid while the
task keeps running; if the task is switched out, wall-clock time
advances and the cached value must not be reused when the task runs again.

The existing invalidation covers explicit plug flushes through
__blk_flush_plug(), and the schedule() / rtmutex paths through
sched_update_worker(). It does not cover in-kernel preemption paths such
as preempt_schedule(), preempt_schedule_notrace(), and
preempt_schedule_irq(), which enter __schedule(SM_PREEMPT) directly and
return without calling sched_update_worker().

As a result, a task preempted while holding a plug with PF_BLOCK_TS set
can reuse a stale plug->cur_ktime after it is scheduled back in. blk-iocost
then consumes that stale timestamp through ioc_now(), producing stale vnow
values for throttle decisions, and through ioc_rqos_done(), inflating
on-queue time and feeding false missed-QoS samples into vrate
adjustment.

Move the schedule-side invalidation to finish_task_switch(), which runs
for the scheduled-in task after every actual context switch regardless
of which schedule entry point was used. Keep __blk_flush_plug() as the
explicit flush/finish-plug invalidation path, and remove only the
PF_BLOCK_TS handling from sched_update_worker().

Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption")
Cc: stable@vger.kernel.org
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 kernel/sched/core.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b791e9e9f67..bf024ca115ff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5368,6 +5368,13 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 */
 	kmap_local_sched_in();

+	/*
+	 * Any cached block-layer timestamp (plug->cur_ktime) is stale now,
+	 * invalidate it.
+	 */
+	if (unlikely(current->flags & PF_BLOCK_TS))
+		blk_plug_invalidate_ts(current);
+
 	fire_sched_in_preempt_notifiers(current);
 	/*
 	 * When switching through a kernel thread, the loop in
@@ -7290,12 +7297,10 @@ static inline void sched_submit_work(struct task_struct *tsk)

 static void sched_update_worker(struct task_struct *tsk)
 {
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_BLOCK_TS)) {
-		if (tsk->flags & PF_BLOCK_TS)
-			blk_plug_invalidate_ts(tsk);
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
-		else if (tsk->flags & PF_IO_WORKER)
+		else
 			io_wq_worker_running(tsk);
 	}
 }
-- 
2.53.0-Meta

^ permalink raw reply related

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Andy Shevchenko @ 2026-06-11 18:31 UTC (permalink / raw)
  To: Christian König
  Cc: Kaitao Cheng, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <96f9390b-a547-442f-b0a9-99a5ba52c0e1@amd.com>

On Thu, Jun 11, 2026 at 10:39:14AM +0200, Christian König wrote:
> On 6/11/26 10:29, Andy Shevchenko wrote:
> > On Thu, Jun 11, 2026 at 10:01:25AM +0200, Christian König wrote:
> >> On 6/10/26 17:02, Andy Shevchenko wrote:
> >>> On Wed, Jun 10, 2026 at 11:11:34AM +0200, Christian König wrote:
> >>>> On 6/10/26 10:18, Kaitao Cheng wrote:
> >>>>> 在 2026/6/10 16:07, Christian König 写道:

...

> >>>>> Should we revert to v1, or keep list_for_each_entry() and
> >>>>> list_for_each_entry_safe() as they are, close this thread, and make no
> >>>>> changes?
> >>>>>
> >>>>> Link to v1:
> >>>>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
> >>>>>
> >>>>> Or do you have any better suggestions?
> >>>>
> >>>> v1 looks perfectly reasonable to me.
> >>>
> >>> But why not just hiding that once for all (in case they don't use the temporary
> >>> iterator)? Easy to automate, robust — everyone is happy?
> >>
> >> As far as I can see that is an extremely bad idea.
> >>
> >> The distinction between the use cases of 'iterating the list' and 'iterating
> >> the list while you modify it' is completely intentional.
> > 
> > What I meant is to keep the name, just drop the parameter (make it hidden and
> > being defined inside list_for_each_*_safe() cases).
> 
> Ah, sorry I was still thinking the suggestion is to merge
> list_for_each_entry() and list_for_each_entry_safe().
> 
> If the modification is done all at once or in steps doesn't really matter for
> me as long as the patch can be re-created reproducible.
> 
> But I'm wondering if we couldn't improve the name at the same time. The
> _safe() postfix has caused tons of confusion where especially beginners
> thought that it is a thread-safe variant, which it clearly isn't.
> 
> The _mutable() postfix sounds like a much better description to what happens here.

I see, no objections from my side, but with the new name we don't need to have
treewide change, the downside that one should undertake this to finish the job,
otherwise we will have _safe() and _mutable() for a long time.

> >> See the bool type can be implemented by int as well, but it is just a
> >> different use case.
> > 
> >>>> You should just include some patches in the same patch set to actually use
> >>>> the new macros.
> >>>>
> >>>> If you modify the files under drivers/dma-buf or drivers/gpu/drm/amd to use
> >>>> the new macro I'm happy to review that.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* [PATCH] blk-mq: bound blk_hctx_poll() to one jiffy
From: Anuj Gupta @ 2026-06-11 16:27 UTC (permalink / raw)
  To: axboe, hch, kbusch, lidiangang, changfengnan, tom.leiming,
	nj.shetty, joshi.k, anuj1072538
  Cc: linux-block, Anuj Gupta, Alok Rathore
In-Reply-To: <CGME20260611163403epcas5p2bba1c085eaab24852acfb0751ef148ab@epcas5p2.samsung.com>

blk_hctx_poll() can busy-poll until a completion is found or
need_resched() becomes true. On preemptible kernels, the scheduler can
set TIF_NEED_RESCHED on the timer tick and preempt the task at IRQ
return before the loop condition re-evaluates it. After the context
switch, the flag is cleared, so the poller can continue spinning instead
of returning to its caller.

This can happen with io_uring IOPOLL reads inside iocb_bio_iopoll(),
which holds the rcu_read_lock() while calling bio_poll(). If another
poller on the same polled queue drains the available completions, this
poller may repeatedly find no completions and remain inside the RCU
read-side critical section long enough to trigger RCU stall reports:

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-9): P3961
rcu:     (detected by 3, t=60002 jiffies, g=18533, q=4943 ncpus=20)
task:fio state:R  running task     stack:0     pid:3961
Call Trace:
<TASK>
? nvme_poll+0x36/0xa0 [nvme]
? blk_hctx_poll+0x39/0x90
? blk_mq_poll+0x30/0x60
? bio_poll+0x87/0x170
? iocb_bio_iopoll+0x32/0x50
? io_uring_classic_poll+0x25/0x50
? io_do_iopoll+0x216/0x420
? __do_sys_io_uring_enter+0x2c7/0x7c0

Reproducible with:

fio -filename=/dev/nvme0n1 -direct=1 -size=4g -rw=randread \
--numjobs=32 -bs=4K -ioengine=io_uring -hipri=1 -iodepth=1 \
--registerfiles=1 --group_reporting --thread

Record the starting jiffy and exit the loop once jiffies has advanced.
This bounds each blk_hctx_poll() invocation while also covering the
case where the reschedule flag was cleared by the context switch
before the loop condition could observe it.

Fixes: f22ecf9c14c1 ("blk-mq: delete task running check in blk_hctx_poll()")
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Alok Rathore <alok.rathore@samsung.com>
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..d85fa4a51e79 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -5248,6 +5248,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			 struct io_comp_batch *iob, unsigned int flags)
 {
 	int ret;
+	unsigned long start = jiffies;

 	do {
 		ret = q->mq_ops->poll(hctx, iob);
@@ -5258,7 +5259,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 		if (ret < 0 || (flags & BLK_POLL_ONESHOT))
 			break;
 		cpu_relax();
-	} while (!need_resched());
+	} while (!need_resched() && time_before_eq(jiffies, start));

 	return 0;
 }
-- 
2.25.1

^ permalink raw reply related

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Carlos Maiolino @ 2026-06-11 15:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, brauner, linux-block, linux-fsdevel, linux-ext4,
	linux-xfs, Hannes Reinecke, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260611133833.GA14645@lst.de>

On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > It's entirely possible a device supports byte aligned addresses. The
> > block layer just doesn't let a driver report that. So either it really
> > was successful because you found a bug that skips the alignment checks,
> > or your device silently corrupted your payload.

I tried this on different hardware, I find it hard to say all those
devices were corrupting the payload.

> > 
> > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > though, in not taking the optimization when it was possible. So here's
> > an alternative suggestion that should get things working as expected:
> 
> The fix below looks like it is addressing a real bug.  I'm not sure if
> Carlos is hitting it, but we were missing the alignment checks for
> single-bvec fast path bios so far indeed.

You left context out so I'm assuming by the fix you meant Keith's patch.
I can give it a spin and see if it fixes the behavior I'm talking
about. Give me some time as I have a bunch of stuff to do tonight so
likely I will only manage to try this tomorrow.

^ permalink raw reply

* [PATCH 4/4] block: add configurable error injection
From: Christoph Hellwig @ 2026-06-11 14:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260611140703.2401204-1-hch@lst.de>

Add a new block error injection interface that allows to inject specific
status code for specific ranges.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
 Documentation/block/error-injection.rst |  59 +++++
 Documentation/block/index.rst           |   1 +
 block/Kconfig                           |   8 +
 block/Makefile                          |   1 +
 block/blk-core.c                        |   4 +
 block/blk-sysfs.c                       |   5 +
 block/error-injection.c                 | 315 ++++++++++++++++++++++++
 block/error-injection.h                 |  21 ++
 block/genhd.c                           |   4 +
 include/linux/blkdev.h                  |   6 +
 10 files changed, 424 insertions(+)
 create mode 100644 Documentation/block/error-injection.rst
 create mode 100644 block/error-injection.c
 create mode 100644 block/error-injection.h

diff --git a/Documentation/block/error-injection.rst b/Documentation/block/error-injection.rst
new file mode 100644
index 000000000000..81f31af82e65
--- /dev/null
+++ b/Documentation/block/error-injection.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+Configurable Error Injection
+============================
+
+Overview
+--------
+
+Configurable error injection allows injecting specific block layer status codes
+for sector ranges of a block device.  Errors can be injected unconditionally, or
+with a given probability.
+
+To use configurable error injection, CONFIG_BLK_ERROR_INJECTION must be enabled.
+
+The only interface is the error_injection debugfs file, which is created for
+each registered gendisk.  Writes to this file are used to create or delete rules
+and reads return a list of the current error injection sites.
+
+Options
+-------
+
+The following options specify the operations:
+
+===================	=======================================================
+add			add a new rule
+removeall		remove all existing rules
+===================	=======================================================
+
+The following options specify the details of the rule for the add operation:
+
+===================	=======================================================
+op=<string>		block layer operation this rule applies to.  This uses
+			the XYZ for each REQ_OP_XYZ operation, e.g. READ, WRITE
+			or DISCARD. Mandatory.
+status=<string>		Status to return.  This uses XYZ for each BLK_STS_XYZ
+			code, e.g. IOERR or MEDIUM. Mandatory.
+start=<number>		First block layer sector the rule applies to.
+			Optional, defaults to 0.
+nr_sectors=<number>	Number of sectors this rule applies.
+			Optional, defaults to the remainder of the device.
+chance=<number>		Only return a failure with a likelihood of 1/chance.
+			Optional, defaults to 1 (always).
+===================	=======================================================
+
+Example
+-------
+
+Return BLK_STS_IOERR for one in 10 reads of sector 0 of /dev/nvme0n1:
+
+	$ echo 'add,op=READ,start=0,status=IOERR,chance=10' > /sys/kernel/debug/block/nvme0n1/error_injection
+
+Return BLK_STS_MEDIUM for every write to /dev/nvme0n1:
+
+	$ echo 'add,op=WRITE,start=0,status=MEDIUM' > /sys/kernel/debug/block/nvme0n1/error_injection
+
+Remove all rules for /dev/nvme0n1:
+
+	$ echo 'removeall' > /sys/kernel/debug/block/nvme0n1/error_injection
diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst
index 9fea696f9daa..bfa1bbd31ddf 100644
--- a/Documentation/block/index.rst
+++ b/Documentation/block/index.rst
@@ -22,3 +22,4 @@ Block
    switching-sched
    writeback_cache_control
    ublk
+   error-injection
diff --git a/block/Kconfig b/block/Kconfig
index 15027963472d..70e4a66d941f 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -221,6 +221,14 @@ config BLOCK_HOLDER_DEPRECATED
 config BLK_MQ_STACKING
 	bool
 
+config BLK_ERROR_INJECTION
+	bool "Enable block layer error injection"
+	select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
+	help
+	  Enable inserting arbitrary block errors through a debugfs interface.
+
+	  See Documentation/block/error-injection.rst for details.
+
 source "block/Kconfig.iosched"
 
 endif # BLOCK
diff --git a/block/Makefile b/block/Makefile
index 54130faacc21..e7bd320e3d69 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,6 +13,7 @@ obj-y		:= bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \
 			genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
 			disk-events.o blk-ia-ranges.o early-lookup.o
 
+obj-$(CONFIG_BLK_ERROR_INJECTION) += error-injection.o
 obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o
 obj-$(CONFIG_BLK_DEV_BSGLIB)	+= bsg-lib.o
 obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
diff --git a/block/blk-core.c b/block/blk-core.c
index beaab7a71fba..73a41df98c9a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -50,6 +50,7 @@
 #include "blk-cgroup.h"
 #include "blk-throttle.h"
 #include "blk-ioprio.h"
+#include "error-injection.h"
 
 struct dentry *blk_debugfs_root;
 
@@ -767,6 +768,9 @@ static void __submit_bio_noacct_mq(struct bio *bio)
 
 void submit_bio_noacct_nocheck(struct bio *bio, bool split)
 {
+	if (unlikely(blk_error_inject(bio)))
+		return;
+
 	blk_cgroup_bio_start(bio);
 
 	if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f22c1f253eb3..520972676ab4 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -19,6 +19,7 @@
 #include "blk-wbt.h"
 #include "blk-cgroup.h"
 #include "blk-throttle.h"
+#include "error-injection.h"
 
 struct queue_sysfs_entry {
 	struct attribute attr;
@@ -933,6 +934,8 @@ static void blk_debugfs_remove(struct gendisk *disk)
 
 	blk_debugfs_lock_nomemsave(q);
 	blk_trace_shutdown(q);
+	if (IS_ENABLED(CONFIG_BLK_ERROR_INJECTION))
+		blk_error_injection_exit(disk);
 	debugfs_remove_recursive(q->debugfs_dir);
 	q->debugfs_dir = NULL;
 	q->sched_debugfs_dir = NULL;
@@ -963,6 +966,8 @@ int blk_register_queue(struct gendisk *disk)
 
 	memflags = blk_debugfs_lock(q);
 	q->debugfs_dir = debugfs_create_dir(disk->disk_name, blk_debugfs_root);
+	if (IS_ENABLED(CONFIG_BLK_ERROR_INJECTION))
+		blk_error_injection_init(disk);
 	if (queue_is_mq(q))
 		blk_mq_debugfs_register(q);
 	blk_debugfs_unlock(q, memflags);
diff --git a/block/error-injection.c b/block/error-injection.c
new file mode 100644
index 000000000000..d24c90e9a25f
--- /dev/null
+++ b/block/error-injection.c
@@ -0,0 +1,315 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 Christoph Hellwig.
+ */
+#include <linux/debugfs.h>
+#include <linux/blkdev.h>
+#include <linux/parser.h>
+#include <linux/seq_file.h>
+#include "blk.h"
+#include "error-injection.h"
+
+struct blk_error_inject {
+	struct list_head		entry;
+	sector_t			start;
+	sector_t			end;
+	enum req_op			op;
+	blk_status_t			status;
+
+	/* only inject every 1 / chance times */
+	unsigned int			chance;
+};
+
+DEFINE_STATIC_KEY_FALSE(blk_error_injection_enabled);
+
+bool __blk_error_inject(struct bio *bio)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct blk_error_inject *inj;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(inj, &disk->error_injection_list, entry) {
+		if (bio_op(bio) != inj->op)
+			continue;
+		/*
+		 * This never matches 0-sized bios like empty WRITEs with
+		 * REQ_PREFLUSH or ZONE_RESET_ALL.  While adding a special case
+		 * for them would be trivial, that means any WRITE rule would
+		 * trigger for flushes.  So before we can make this work
+		 * properly, we'll need to start using REQ_OP_FLUSH for pure
+		 * flushes at the bio level like we already do in blk-mq.
+		 */
+		if (bio->bi_iter.bi_sector > inj->end ||
+		    bio_end_sector(bio) <= inj->start)
+			continue;
+		if (inj->chance > 1 && (get_random_u32() % inj->chance) != 0)
+			continue;
+
+		pr_info_ratelimited("%pg: injecting %s error for %s at sector %llu:%u\n",
+				disk->part0, blk_status_to_str(inj->status),
+				blk_op_str(inj->op), bio->bi_iter.bi_sector,
+				bio_sectors(bio));
+		bio->bi_status = inj->status;
+		rcu_read_unlock();
+		bio_endio(bio);
+		return true;
+	}
+	rcu_read_unlock();
+	return false;
+}
+
+static int error_inject_add(struct gendisk *disk, enum req_op op,
+		sector_t start, u64 nr_sectors, blk_status_t status,
+		unsigned int chance)
+{
+	struct blk_error_inject *inj;
+	int error = -EINVAL;
+
+	if (op == REQ_OP_LAST)
+		return -EINVAL;
+	if (status == BLK_STS_OK)
+		return -EINVAL;
+
+	inj = kzalloc_obj(*inj);
+	if (!inj)
+		return -ENOMEM;
+
+	if (nr_sectors) {
+		if (U64_MAX - nr_sectors < start)
+			goto out_free_inj;
+		inj->end = start + nr_sectors - 1;
+	} else {
+		inj->end = U64_MAX;
+	}
+
+	inj->op = op;
+	inj->start = start;
+	inj->status = status;
+	inj->chance = chance;
+
+	pr_debug_ratelimited("%pg: adding %s injection for %s at sector %llu:%llu\n",
+			disk->part0, blk_status_to_str(status),
+			blk_op_str(op),
+			start, nr_sectors);
+
+	/*
+	 * Add to the front of the list so that newer entries can partially
+	 * override other entries.  This also intentionally allows duplicate
+	 * entries as there is no real reason to reject them.
+	 */
+	mutex_lock(&disk->error_injection_lock);
+	if (!disk_live(disk)) {
+		mutex_unlock(&disk->error_injection_lock);
+		error = -ENODEV;
+		goto out_free_inj;
+	}
+	if (list_empty(&disk->error_injection_list))
+		static_branch_inc(&blk_error_injection_enabled);
+	list_add_rcu(&inj->entry, &disk->error_injection_list);
+	set_bit(GD_ERROR_INJECT, &disk->state);
+	mutex_unlock(&disk->error_injection_lock);
+	return 0;
+
+out_free_inj:
+	kfree(inj);
+	return error;
+}
+
+static void error_inject_removeall(struct gendisk *disk)
+{
+	struct blk_error_inject *inj;
+
+	mutex_lock(&disk->error_injection_lock);
+	clear_bit(GD_ERROR_INJECT, &disk->state);
+	while ((inj = list_first_entry_or_null(&disk->error_injection_list,
+			struct blk_error_inject, entry))) {
+		list_del_rcu(&inj->entry);
+		kfree_rcu_mightsleep(inj);
+	}
+	static_branch_dec(&blk_error_injection_enabled);
+	mutex_unlock(&disk->error_injection_lock);
+}
+
+enum options {
+	Opt_add			= (1u << 0),
+	Opt_removeall		= (1u << 1),
+
+	Opt_op			= (1u << 16),
+	Opt_start		= (1u << 17),
+	Opt_nr_sectors		= (1u << 18),
+	Opt_status		= (1u << 19),
+	Opt_chance		= (1u << 20),
+
+	Opt_invalid,
+};
+
+static const match_table_t opt_tokens = {
+	{ Opt_add,			"add",			},
+	{ Opt_removeall,		"removeall",		},
+	{ Opt_op,			"op=%s",		},
+	{ Opt_start,			"start=%u"		},
+	{ Opt_nr_sectors,		"nr_sectors=%u"		},
+	{ Opt_status,			"status=%s"		},
+	{ Opt_chance,			"chance=%u"		},
+	{ Opt_invalid,			NULL,			},
+};
+
+static int match_op(substring_t *args, enum req_op *op)
+{
+	const char *tag;
+
+	tag = match_strdup(args);
+	if (!tag)
+		return -ENOMEM;
+	*op = str_to_blk_op(tag);
+	if (*op == REQ_OP_LAST)
+		pr_warn("invalid op '%s'\n", tag);
+	kfree(tag);
+	return 0;
+}
+
+static int match_status(substring_t *args, blk_status_t *status)
+{
+	const char *tag;
+
+	tag = match_strdup(args);
+	if (!tag)
+		return -ENOMEM;
+	*status = tag_to_blk_status(tag);
+	if (!*status)
+		pr_warn("invalid status '%s'\n", tag);
+	kfree(tag);
+	return 0;
+}
+
+static ssize_t blk_error_injection_parse_options(struct gendisk *disk,
+		char *options)
+{
+	enum { Unset, Add, Removeall } action = Unset;
+	unsigned int option_mask = 0, chance = 1;
+	enum req_op op = REQ_OP_LAST;
+	u64 start = 0, nr_sectors = 0;
+	blk_status_t status = BLK_STS_OK;
+	substring_t args[MAX_OPT_ARGS];
+	char *p;
+
+	while ((p = strsep(&options, ",\n")) != NULL) {
+		int error = 0;
+		ssize_t token;
+
+		if (!*p)
+			continue;
+		token = match_token(p, opt_tokens, args);
+		option_mask |= token;
+		switch (token) {
+		case Opt_add:
+			if (action != Unset)
+				return -EINVAL;
+			action = Add;
+			break;
+		case Opt_removeall:
+			if (action != Unset)
+				return -EINVAL;
+			action = Removeall;
+			break;
+		case Opt_op:
+			error = match_op(args, &op);
+			break;
+		case Opt_start:
+			error = match_u64(args, &start);
+			break;
+		case Opt_nr_sectors:
+			error = match_u64(args, &nr_sectors);
+			break;
+		case Opt_status:
+			error = match_status(args, &status);
+			break;
+		case Opt_chance:
+			error = match_uint(args, &chance);
+			if (!error && chance == 0)
+				error = -EINVAL;
+			break;
+		default:
+			pr_warn("unknown parameter or missing value '%s'\n", p);
+			error = -EINVAL;
+		}
+		if (error)
+			return error;
+	}
+
+	switch (action) {
+	case Add:
+		return error_inject_add(disk, op, start, nr_sectors, status,
+				chance);
+	case Removeall:
+		if (option_mask & ~Opt_removeall)
+			return -EINVAL;
+		error_inject_removeall(disk);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
+static ssize_t blk_error_injection_write(struct file *file,
+		const char __user *ubuf, size_t count, loff_t *pos)
+{
+	struct gendisk *disk = file_inode(file)->i_private;
+	char *options;
+	int error;
+
+	options = memdup_user_nul(ubuf, count);
+	if (IS_ERR(options))
+		return PTR_ERR(options);
+	error = blk_error_injection_parse_options(disk, options);
+	kfree(options);
+
+	if (error)
+		return error;
+	return count;
+}
+
+static int blk_error_injection_show(struct seq_file *s, void *private)
+{
+	struct gendisk *disk = s->private;
+	struct blk_error_inject *inj;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(inj, &disk->error_injection_list, entry) {
+		seq_printf(s, "%llu:%llu status=%s,chance=%u",
+			inj->start, inj->end,
+			blk_status_to_tag(inj->status), inj->chance);
+		seq_putc(s, '\n');
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
+static int blk_error_injection_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, blk_error_injection_show, inode->i_private);
+}
+
+static int blk_error_injection_release(struct inode *inode, struct file *file)
+{
+	return single_release(inode, file);
+}
+
+static const struct file_operations blk_error_injection_fops = {
+	.owner		= THIS_MODULE,
+	.write		= blk_error_injection_write,
+	.read		= seq_read,
+	.open		= blk_error_injection_open,
+	.release	= blk_error_injection_release,
+};
+
+void blk_error_injection_init(struct gendisk *disk)
+{
+	debugfs_create_file("error_injection", 0600, disk->queue->debugfs_dir,
+			disk, &blk_error_injection_fops);
+}
+
+void blk_error_injection_exit(struct gendisk *disk)
+{
+	error_inject_removeall(disk);
+}
diff --git a/block/error-injection.h b/block/error-injection.h
new file mode 100644
index 000000000000..9821d773abab
--- /dev/null
+++ b/block/error-injection.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _BLK_ERROR_INJECTION_H
+#define _BLK_ERROR_INJECTION_H 1
+
+#include <linux/jump_label.h>
+
+DECLARE_STATIC_KEY_FALSE(blk_error_injection_enabled);
+
+void blk_error_injection_init(struct gendisk *disk);
+void blk_error_injection_exit(struct gendisk *disk);
+bool __blk_error_inject(struct bio *bio);
+static inline bool blk_error_inject(struct bio *bio)
+{
+	if (IS_ENABLED(CONFIG_BLK_ERROR_INJECTION) &&
+	    static_branch_unlikely(&blk_error_injection_enabled) &&
+	    test_bit(GD_ERROR_INJECT, &bio->bi_bdev->bd_disk->state))
+		return __blk_error_inject(bio);
+	return false;
+}
+
+#endif /* _BLK_ERROR_INJECTION_H */
diff --git a/block/genhd.c b/block/genhd.c
index 7d6854fd28e9..f84b6a355b57 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1485,6 +1485,10 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 	lockdep_init_map(&disk->lockdep_map, "(bio completion)", lkclass, 0);
 #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
 	INIT_LIST_HEAD(&disk->slave_bdevs);
+#endif
+#ifdef CONFIG_BLK_ERROR_INJECTION
+	mutex_init(&disk->error_injection_lock);
+	INIT_LIST_HEAD(&disk->error_injection_list);
 #endif
 	mutex_init(&disk->rqos_state_mutex);
 	kobject_init(&disk->queue_kobj, &blk_queue_ktype);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 57e84d59a642..5070851cf924 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -176,6 +176,7 @@ struct gendisk {
 #define GD_SUPPRESS_PART_SCAN		5
 #define GD_OWNS_QUEUE			6
 #define GD_ZONE_APPEND_USED		7
+#define GD_ERROR_INJECT			8
 
 	struct mutex open_mutex;	/* open/close mutex */
 	unsigned open_partitions;	/* number of open partitions */
@@ -227,6 +228,11 @@ struct gendisk {
 	 */
 	struct blk_independent_access_ranges *ia_ranges;
 
+#ifdef CONFIG_BLK_ERROR_INJECTION
+	struct mutex		error_injection_lock;
+	struct list_head	error_injection_list;
+#endif
+
 	struct mutex rqos_state_mutex;	/* rqos state change mutex */
 };
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH 3/4] block: add a str_to_blk_op helper
From: Christoph Hellwig @ 2026-06-11 14:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260611140703.2401204-1-hch@lst.de>

Add a helper to find the REQ_OP_XYZ constant from the "XYZ" string.
This will be used for the error injection debugfs interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
 block/blk-core.c | 10 ++++++++++
 block/blk.h      |  1 +
 2 files changed, 11 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 842b5c6f2fb4..beaab7a71fba 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -132,6 +132,16 @@ inline const char *blk_op_str(enum req_op op)
 }
 EXPORT_SYMBOL_GPL(blk_op_str);
 
+enum req_op str_to_blk_op(const char *op)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(blk_op_name); i++)
+		if (blk_op_name[i] && !strcmp(blk_op_name[i], op))
+			return (enum req_op)i;
+	return REQ_OP_LAST;
+}
+
 #define ENT(_tag, _errno, _desc)	\
 [BLK_STS_##_tag] = {				\
 	.errno		= _errno,		\
diff --git a/block/blk.h b/block/blk.h
index 3ab2cdd6ed12..507ab34a6e90 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -53,6 +53,7 @@ void blk_free_flush_queue(struct blk_flush_queue *q);
 const char *blk_status_to_str(blk_status_t status);
 const char *blk_status_to_tag(blk_status_t status);
 blk_status_t tag_to_blk_status(const char *tag);
+enum req_op str_to_blk_op(const char *op);
 
 bool __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
 bool blk_queue_start_drain(struct request_queue *q);
-- 
2.53.0


^ permalink raw reply related

* [PATCH 2/4] block: add a "tag" for block status codes
From: Christoph Hellwig @ 2026-06-11 14:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260611140703.2401204-1-hch@lst.de>

The full name of the status codes is not good for user interfaces as it
can contain white spaces.  Add the name of the status code without the
BLK_STS_ prefix as a tag so that it can be used for user interfaces.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
 block/blk-core.c | 28 ++++++++++++++++++++++++++++
 block/blk.h      |  2 ++
 2 files changed, 30 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43121a9f99f0..842b5c6f2fb4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -135,10 +135,12 @@ EXPORT_SYMBOL_GPL(blk_op_str);
 #define ENT(_tag, _errno, _desc)	\
 [BLK_STS_##_tag] = {				\
 	.errno		= _errno,		\
+	.tag		= __stringify(_tag),	\
 	.name		= _desc,		\
 }
 static const struct {
 	int		errno;
+	const char	*tag;
 	const char	*name;
 } blk_errors[] = {
 	ENT(OK,			0,		""),
@@ -203,6 +205,32 @@ const char *blk_status_to_str(blk_status_t status)
 	return blk_errors[idx].name;
 }
 
+const char *blk_status_to_tag(blk_status_t status)
+{
+	int idx = (__force int)status;
+
+	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors) || !blk_errors[idx].tag))
+		return "<null>";
+	return blk_errors[idx].tag;
+}
+
+blk_status_t tag_to_blk_status(const char *tag)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(blk_errors); i++) {
+		if (blk_errors[i].tag &&
+		    !strcmp(blk_errors[i].tag, tag))
+			return (__force blk_status_t)i;
+	}
+
+	/*
+	 * Return BLK_STS_OK for mismatches as this function is intended to
+	 * parse error status values.
+	 */
+	return BLK_STS_OK;
+}
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
diff --git a/block/blk.h b/block/blk.h
index 7fdfb9012ce1..3ab2cdd6ed12 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,6 +51,8 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
 void blk_free_flush_queue(struct blk_flush_queue *q);
 
 const char *blk_status_to_str(blk_status_t status);
+const char *blk_status_to_tag(blk_status_t status);
+blk_status_t tag_to_blk_status(const char *tag);
 
 bool __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
 bool blk_queue_start_drain(struct request_queue *q);
-- 
2.53.0


^ permalink raw reply related

* [PATCH 1/4] block: add a macro to initialize the status table
From: Christoph Hellwig @ 2026-06-11 14:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Bart Van Assche, Hannes Reinecke
In-Reply-To: <20260611140703.2401204-1-hch@lst.de>

Prepare for adding a new value to the error table by adding a macro
to fill it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
 block/blk-core.c | 45 +++++++++++++++++++++++++--------------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1c637db79e59..43121a9f99f0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -132,39 +132,44 @@ inline const char *blk_op_str(enum req_op op)
 }
 EXPORT_SYMBOL_GPL(blk_op_str);
 
+#define ENT(_tag, _errno, _desc)	\
+[BLK_STS_##_tag] = {				\
+	.errno		= _errno,		\
+	.name		= _desc,		\
+}
 static const struct {
 	int		errno;
 	const char	*name;
 } blk_errors[] = {
-	[BLK_STS_OK]		= { 0,		"" },
-	[BLK_STS_NOTSUPP]	= { -EOPNOTSUPP, "operation not supported" },
-	[BLK_STS_TIMEOUT]	= { -ETIMEDOUT,	"timeout" },
-	[BLK_STS_NOSPC]		= { -ENOSPC,	"critical space allocation" },
-	[BLK_STS_TRANSPORT]	= { -ENOLINK,	"recoverable transport" },
-	[BLK_STS_TARGET]	= { -EREMOTEIO,	"critical target" },
-	[BLK_STS_RESV_CONFLICT]	= { -EBADE,	"reservation conflict" },
-	[BLK_STS_MEDIUM]	= { -ENODATA,	"critical medium" },
-	[BLK_STS_PROTECTION]	= { -EILSEQ,	"protection" },
-	[BLK_STS_RESOURCE]	= { -ENOMEM,	"kernel resource" },
-	[BLK_STS_DEV_RESOURCE]	= { -EBUSY,	"device resource" },
-	[BLK_STS_AGAIN]		= { -EAGAIN,	"nonblocking retry" },
-	[BLK_STS_OFFLINE]	= { -ENODEV,	"device offline" },
+	ENT(OK,			0,		""),
+	ENT(NOTSUPP,		-EOPNOTSUPP,	"operation not supported"),
+	ENT(TIMEOUT,		-ETIMEDOUT,	"timeout"),
+	ENT(NOSPC,		-ENOSPC,	"critical space allocation"),
+	ENT(TRANSPORT,		-ENOLINK,	"recoverable transport"),
+	ENT(TARGET,		-EREMOTEIO,	"critical target"),
+	ENT(RESV_CONFLICT,	-EBADE,		"reservation conflict"),
+	ENT(MEDIUM,		-ENODATA,	"critical medium"),
+	ENT(PROTECTION,		-EILSEQ,	"protection"),
+	ENT(RESOURCE,		-ENOMEM,	"kernel resource"),
+	ENT(DEV_RESOURCE,	-EBUSY,		"device resource"),
+	ENT(AGAIN,		-EAGAIN,	"nonblocking retry"),
+	ENT(OFFLINE,		-ENODEV,	"device offline"),
 
 	/* device mapper special case, should not leak out: */
-	[BLK_STS_DM_REQUEUE]	= { -EREMCHG, "dm internal retry" },
+	ENT(DM_REQUEUE,		-EREMCHG,	"dm internal retry"),
 
 	/* zone device specific errors */
-	[BLK_STS_ZONE_OPEN_RESOURCE]	= { -ETOOMANYREFS, "open zones exceeded" },
-	[BLK_STS_ZONE_ACTIVE_RESOURCE]	= { -EOVERFLOW, "active zones exceeded" },
+	ENT(ZONE_OPEN_RESOURCE, -ETOOMANYREFS,	"open zones exceeded"),
+	ENT(ZONE_ACTIVE_RESOURCE, -EOVERFLOW,	"active zones exceeded"),
 
 	/* Command duration limit device-side timeout */
-	[BLK_STS_DURATION_LIMIT]	= { -ETIME, "duration limit exceeded" },
-
-	[BLK_STS_INVAL]		= { -EINVAL,	"invalid" },
+	ENT(DURATION_LIMIT,	-ETIME,		"duration limit exceeded"),
+	ENT(INVAL,		-EINVAL,	"invalid"),
 
 	/* everything else not covered above: */
-	[BLK_STS_IOERR]		= { -EIO,	"I/O" },
+	ENT(IOERR,		-EIO,		"I/O"),
 };
+#undef ENT
 
 blk_status_t errno_to_blk_status(int errno)
 {
-- 
2.53.0


^ permalink raw reply related

* configurable block error injection v5
From: Christoph Hellwig @ 2026-06-11 14:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc

Hi all,

this series adds a new configurable block error injection facility.
We already have a few to inject block errors, but unfortunately most
of them are either not very useful or hard to use, or both:

 - The fail_make_request failure injection point can't distinguish
   different commands, different ranges in the file and can only injection
   plain I/O errors.
 - the should_fail_bio 'dynamic' failure injection has all the same issues
   as fail_make_request
 - dm-error can only fail all command in the table using BLK_STS_IOERR
   and requires setting up a new block device
 - dm-flakey and dm-dust allow all kinds of configurability, but still
   don't have good error selection, no good support for non-read/write
   commands and are limited to the dm table alignment requirements,
   which for zoned devices enforces setting them up for an entire zone.
   They also once again require setting up a stacked block device,
   which is really annoying in harnesses like xfstests

This series adds a new debugfs-based block layer error injection
that allows to configure what operations and ranges the injection
applied to, and what status to return.  It also allows to configure a
failure ratio similar to the xfs errortag injection.

Changes since v4:
 - don't unlock in removeall to avoid a race between removeall and setup
 - document why we can't match 0-sized bios

Changes since v3:
 - use a static branch to guard the new condition
 - split out a new header so that jump_label.h doesn't get pulled into
   blk.h
 - more checking for impossible conditions in blk_status_to_tag
 - more spelling fixes

Changes since v2:
 - improve the documentation a bit
 - fix a spelling mistake in a comment

Changes since v1:
 - drop the should_fail_bio removal and cleanup depending on it, as it's
   used by eBPF programs and thus a hidden UABI.
 - as a result split the code out to it's own Kconfig symbol
 - various error handling fixed pointed out by Keith
 - documentation spelling fixes pointed out by Randy

Diffstat:
 Documentation/block/error-injection.rst |   59 +++++
 Documentation/block/index.rst           |    1 
 block/Kconfig                           |    8 
 block/Makefile                          |    1 
 block/blk-core.c                        |   87 ++++++--
 block/blk-sysfs.c                       |    5 
 block/blk.h                             |    3 
 block/error-injection.c                 |  315 ++++++++++++++++++++++++++++++++
 block/error-injection.h                 |   21 ++
 block/genhd.c                           |    4 
 include/linux/blkdev.h                  |    6 
 11 files changed, 490 insertions(+), 20 deletions(-)

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Christoph Hellwig @ 2026-06-11 13:38 UTC (permalink / raw)
  To: Keith Busch
  Cc: Carlos Maiolino, Christoph Hellwig, brauner, linux-block,
	linux-fsdevel, linux-ext4, linux-xfs, Hannes Reinecke,
	Martin K. Petersen, Jens Axboe
In-Reply-To: <aiqwy5DfHI79KXuZ@kbusch-mbp>

On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> It's entirely possible a device supports byte aligned addresses. The
> block layer just doesn't let a driver report that. So either it really
> was successful because you found a bug that skips the alignment checks,
> or your device silently corrupted your payload.
> 
> Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> though, in not taking the optimization when it was possible. So here's
> an alternative suggestion that should get things working as expected:

The fix below looks like it is addressing a real bug.  I'm not sure if
Carlos is hitting it, but we were missing the alignment checks for
single-bvec fast path bios so far indeed.


^ permalink raw reply

* Re: [PATCH v3 2/4] scsi: host: allocate struct Scsi_Host on the NUMA node of the host adapter
From: Stefan Hajnoczi @ 2026-06-10 15:37 UTC (permalink / raw)
  To: Sumit Saxena, Michael S. Tsirkin
  Cc: Martin K . Petersen, Jens Axboe, James E . J . Bottomley,
	linux-scsi, linux-block, Adam Radford, Khalid Aziz,
	Adaptec OEM Raid Solutions, Matthew Wilcox, Hannes Reinecke,
	Juergen E . Fischer, Russell King, linux-arm-kernel, Finn Thain,
	Michael Schmitz, Anil Gurumurthy, Sudarsana Kalluru,
	Oliver Neukum, Ali Akcaagac, Jamie Lenehan, Ram Vegesna,
	target-devel, Bradley Grove, Satish Kharat, Sesidhar Baddela,
	Karan Tilak Kumar, Yihang Li, Don Brace, storagedev,
	HighPoint Linux Team, Tyrel Datwyler, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy, linuxppc-dev,
	Brian King, Lee Duncan, Chris Leech, Mike Christie, open-iscsi,
	Justin Tee, Paul Ely, Kashyap Desai, Shivasharan S,
	Chandrakanth Patil, megaraidlinux.pdl, Sathya Prakash Veerichetty,
	Sreekanth Reddy, mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani,
	Ranjan Kumar, MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Eugenio Perez,
	virtualization, Vishal Bhakta, bcm-kernel-feedback-list,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	xen-devel, John Garry
In-Reply-To: <20260609121806.2121755-3-sumit.saxena@broadcom.com>

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

On Tue, Jun 09, 2026 at 05:48:01PM +0530, Sumit Saxena wrote:
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 5fdaa71f0652..88375574cb18 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -929,7 +929,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
>  	num_targets = virtscsi_config_get(vdev, max_target) + 1;
>  
>  	shost = scsi_host_alloc(&virtscsi_host_template,
> -				struct_size(vscsi, req_vqs, num_queues));
> +				struct_size(vscsi, req_vqs, num_queues), NULL);

A virtio_device has a parent (this is the virtio transport, like
virtio_pci) and that may have NUMA node.

drivers/virtio/virtio.c:register_virtio_device() could call
set_dev_node(dev, dev_to_node(dev->parent)) to propagate the NUMA node
to the virtio_device if it is not already automatically propagated.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Keith Busch @ 2026-06-11 12:57 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, brauner, linux-block, linux-fsdevel,
	linux-ext4, linux-xfs, Hannes Reinecke, Martin K. Petersen,
	Jens Axboe
In-Reply-To: <aiqBvF93P4NjfaDR@nidhogg.toxiclabs.cc>

On Thu, Jun 11, 2026 at 12:05:22PM +0200, Carlos Maiolino wrote:
> The passed in address 0x1003af80001 is one byte misaligned and shouldn't
> (at least in theory) ever be accepted no? Or am I missing something
> else?

It's entirely possible a device supports byte aligned addresses. The
block layer just doesn't let a driver report that. So either it really
was successful because you found a bug that skips the alignment checks,
or your device silently corrupted your payload.

Anyway, my earlier suggestion should work. Ming thinks it may go to far,
though, in not taking the optimization when it was possible. So here's
an alternative suggestion that should get things working as expected:

---
diff --git a/block/blk.h b/block/blk.h
index 1a2d9101bba04..4c31762d6fb5f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -404,6 +404,9 @@ static inline bool bio_may_need_split(struct bio *bio,
        bv = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
        if (bio->bi_iter.bi_size > bv->bv_len - bio->bi_iter.bi_bvec_done)
                return true;
+
+       if ((bv->bv_offset | bv->bv_len) & lim->dma_alignment)
+               return true;
        return bv->bv_len + bv->bv_offset > lim->max_fast_segment_size;
 }
 
-- 

^ permalink raw reply related

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-11 12:27 UTC (permalink / raw)
  To: Andy Shevchenko, Christian König
  Cc: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Davidlohr Bueso, Paul E . McKenney, Josh Triplett, Peter Zijlstra,
	Ingo Molnar, Will Deacon, Boqun Feng, Liam Girdwood, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin, Huang Rui,
	Eddie James, Mark Brown, Maxime Coquelin, Alexandre Torgue,
	Laxman Dewangan, Neil Armstrong, Robert Foss, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Laurent Pinchart, Jonas Karlman, Jernej Skrabec, Matthew Auld,
	Matthew Brost, Waiman Long, drbd-dev, linux-block,
	linux1394-devel, dri-devel, intel-gfx, linux-spi, linux-stm32,
	linux-arm-kernel, linux-tegra, linux-sound, linux-kernel,
	Andrew Morton, Randy Dunlap, Christian Brauner, David Howells,
	Luca Ceresoli, Kaito Cheng, Muchun Song, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder, Jens Axboe,
	Takashi Sakamoto, Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <aipx1goKIsk40vrF@ashevche-desk.local>



在 2026/6/11 16:29, Andy Shevchenko 写道:
> On Thu, Jun 11, 2026 at 10:01:25AM +0200, Christian König wrote:
>> On 6/10/26 17:02, Andy Shevchenko wrote:
>>> On Wed, Jun 10, 2026 at 11:11:34AM +0200, Christian König wrote:
>>>> On 6/10/26 10:18, Kaitao Cheng wrote:
>>>>> 在 2026/6/10 16:07, Christian König 写道:
> 
> ...
> 
>>>>> Should we revert to v1, or keep list_for_each_entry() and
>>>>> list_for_each_entry_safe() as they are, close this thread, and make no
>>>>> changes?
>>>>>
>>>>> Link to v1:
>>>>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
>>>>>
>>>>> Or do you have any better suggestions?
>>>>
>>>> v1 looks perfectly reasonable to me.
>>>
>>> But why not just hiding that once for all (in case they don't use the temporary
>>> iterator)? Easy to automate, robust — everyone is happy?
>>
>> As far as I can see that is an extremely bad idea.
>>
>> The distinction between the use cases of 'iterating the list' and 'iterating
>> the list while you modify it' is completely intentional.

I agree with this point. It is very reasonable for list_for_each_entry()
to be used only for 'iterating the list'. In practice, however, we do not
have an effective way to enforce that rule for users, whereas the distinction
between bool and int can be enforced by the compiler. The 13 patches in the
current series are all real examples where users modify the list while using
list_for_each_entry(). Is a rule that cannot actually be enforced reasonable?
This is just my humble opinion, and I am raising it here only for discussion.

> What I meant is to keep the name, just drop the parameter (make it hidden and
> being defined inside list_for_each_*_safe() cases).

I agree with this approach, but the specific details still need to be settled,
including the issue described in the link below.

https://lore.kernel.org/all/0a333eb8-fc29-4b85-993e-6b726f4c7cf0@linux.dev/

Of course, there is also the suffix-renaming issue raised by Christian.

>> See the bool type can be implemented by int as well, but it is just a
>> different use case.
> 
>>>> You should just include some patches in the same patch set to actually use
>>>> the new macros.
>>>>
>>>> If you modify the files under drivers/dma-buf or drivers/gpu/drm/amd to use
>>>> the new macro I'm happy to review that.
>>>
>>
> 

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox