[REGRESSION][BISECTED] Spurious raid1 device failure triggered by qemu direct IO on 6.18+

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [REGRESSION][BISECTED] Spurious raid1 device failure triggered by qemu direct IO on 6.18+
@ 2026-04-15 12:18 Tomáš Trnka
  2026-04-15 15:20 ` Keith Busch
  0 siblings, 1 reply; 5+ messages in thread
From: Tomáš Trnka @ 2026-04-15 12:18 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel; +Cc: regressions, linux-block, Keith Busch

Since 6.18, booting a VM that is backed by a raid1 LVM LV makes that LV 
immediately eject one of the devices. This is apparently because of a direct 
IO read by QEMU failing. I have bisected the issue to the following commit and 
confirmed that reverting that commit (plus dependencies 
9eab1d4e0d15b633adc170c458c51e8be3b1c553 and 
b475272f03ca5d0c437c8f899ff229b21010ec83) on top of 6.19.11 fixes the issue.

commit 5ff3f74e145adc79b49668adb8de276446acf6be
Author: Keith Busch <kbusch@kernel.org>
Date:   Wed Aug 27 07:12:54 2025 -0700

    block: simplify direct io validity check
    
    The block layer checks all the segments for validity later, so no need
    for an early check. Just reduce it to a simple position and total length
    check, and defer the more invasive segment checks to the block layer.
    
    Signed-off-by: Keith Busch <kbusch@kernel.org>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

The issue looks like this:

md/raid1:mdX: dm-17: rescheduling sector 0
md/raid1:mdX: redirecting sector 0 to other mirror: dm-17
(snipped 9 repeats of the preceding two lines)
md/raid1:mdX: dm-17: Raid device exceeded read_error threshold [cur 21:max 20]
md/raid1:mdX: dm-17: Failing raid device
md/raid1:mdX: Disk failure on dm-17, disabling device.
md/raid1:mdX: Operation continuing on 1 devices.

There's absolutely nothing wrong with the HW, the issue persists even when I 
move the mirrors to a different pair of PVs (SAS HDD vs SATA SSD).

The following command is enough to trigger the issue:

/usr/bin/qemu-system-x86_64 -blockdev '{"driver":"host_device","filename":"/
dev/vg_mintaka/lv_test","aio":"native","node-name":"libvirt-1-storage","read-
only":false,"discard":"unmap","cache":{"direct":true,"no-flush":false}}'

According to blktrace below, this seems to be an ordinary direct IO read of 
sectors 0-7, but I can't reproduce the issue emulating such a read with dd.

The beginning of blktrace for dm-20 (mirror LV):

252,20   0        3     0.050436097 17815  Q  RS 0 + 8 [qemu-system-x86]
252,20   4        1     0.053590884 17179  C  RS 0 + 8 [65531]
252,20   9        1     0.071942534 17843  Q  RS 0 + 1 [worker]
252,20  10        1     0.077792770 10803  C  RS 0 + 1 [0]

for dm-17 (one of the legs of the raid1 mirror):

252,17   0        3     0.050441207 17815  Q  RS 0 + 8 [qemu-system-x86]
252,17   0        4     0.050465318 17815  C  RS 0 + 8 [65514]
252,17   4        1     0.050491948 17179  Q  RS 0 + 8 [mdX_raid1]
252,17  12        1     0.050695772 12662  C  RS 0 + 8 [0]

for sda1 that holds that leg (raid1 LV on dm-crypt on sda1; bfq messages 
snipped):

  8,0    0        7     0.050453828 17815  A  RS 902334464 + 8 <- (252,5) 
902301696
  8,0    0        8     0.050453988 17815  A  RS 902336512 + 8 <- (8,1) 
902334464
  8,1    0        9     0.050454158 17815  Q  RS 902336512 + 8 [qemu-system-
x86]
  8,1    0       10     0.050455058 17815  C  RS 902336512 + 8 [65514]
  8,0    4        1     0.050490699 17179  A  RS 902334464 + 8 <- (252,5) 
902301696
  8,0    4        2     0.050490849 17179  A  RS 902336512 + 8 <- (8,1) 
902334464
  8,1    4        3     0.050491009 17179  Q  RS 902336512 + 8 [mdX_raid1]
  8,1    4        4     0.050498089 17179  G  RS 902336512 + 8 [mdX_raid1]
  8,1    4        5     0.050500129 17179  P   N [mdX_raid1]
  8,1    4        6     0.050500939 17179 UT   N [mdX_raid1] 1
  8,1    4        7     0.050507439 17179  I  RS 902336512 + 8 [mdX_raid1]
  8,1    4        8     0.050531999   387  D  RS 902336512 + 8 [kworker/4:1H]
  8,1   15        1     0.050668902     0  C  RS 902336512 + 8 [0]

for sdb1 (backing the other leg of the mirror):

  8,16   4        1     0.053558754 17179  A  RS 902334464 + 8 <- (252,4) 
902301696
  8,16   4        2     0.053558884 17179  A  RS 902336512 + 8 <- (8,17) 
902334464
  8,17   4        3     0.053559024 17179  Q  RS 902336512 + 8 [mdX_raid1]
  8,17   4        4     0.053559364 17179  C  RS 902336512 + 8 [65514]
  8,17   4        0     0.053570484 17179 1,0  m   N bfq [bfq_limit_depth] 
wr_busy 0 sync 1 depth 48
  8,17   4        5     0.053578104   387  D  FN [kworker/4:1H]
  8,17  15        1     0.053647696 17192  C  FN 0 [0]
  8,17  15        2     0.053815039   567  D  FN [kworker/15:1H]
  8,17  15        3     0.053872560 17192  C  FN 0 [0]

Full logs can be downloaded from:
https://is.muni.cz/de/ttrnka/qemu-dio-raid1-fail/dmesg.log
https://is.muni.cz/de/ttrnka/qemu-dio-raid1-fail/mapped-devs.lst
https://is.muni.cz/de/ttrnka/qemu-dio-raid1-fail/blktrace.tar.gz

lsblk output (from a different boot, minors might not match mapped-devs.lst):

https://is.muni.cz/de/ttrnka/qemu-dio-raid1-fail/lsblk-t.out
https://is.muni.cz/de/ttrnka/qemu-dio-raid1-fail/lsblk.out

I can share any other info or logs, test patches, or poke around with ftrace 
or systemtap as needed.

#regzbot introduced: 5ff3f74e145adc79b49668adb8de276446acf6be

Best regards,

Tomáš
-- 
Tomáš Trnka
Software for Chemistry & Materials B.V.
De Boelelaan 1109
1081 HV Amsterdam, The Netherlands
https://www.scm.com




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION][BISECTED] Spurious raid1 device failure triggered by qemu direct IO on 6.18+
  2026-04-15 12:18 [REGRESSION][BISECTED] Spurious raid1 device failure triggered by qemu direct IO on 6.18+ Tomáš Trnka
@ 2026-04-15 15:20 ` Keith Busch
  2026-04-15 15:52   ` Keith Busch
  0 siblings, 1 reply; 5+ messages in thread
From: Keith Busch @ 2026-04-15 15:20 UTC (permalink / raw)
  To: Tomás Trnka; +Cc: Jens Axboe, linux-kernel, regressions, linux-block

On Wed, Apr 15, 2026 at 02:18:59PM +0200, Tomás Trnka wrote:
> Since 6.18, booting a VM that is backed by a raid1 LVM LV makes that LV 
> immediately eject one of the devices. This is apparently because of a direct 
> IO read by QEMU failing. I have bisected the issue to the following commit and 
> confirmed that reverting that commit (plus dependencies 
> 9eab1d4e0d15b633adc170c458c51e8be3b1c553 and 
> b475272f03ca5d0c437c8f899ff229b21010ec83) on top of 6.19.11 fixes the issue.
> 
> commit 5ff3f74e145adc79b49668adb8de276446acf6be
> Author: Keith Busch <kbusch@kernel.org>
> Date:   Wed Aug 27 07:12:54 2025 -0700
> 
>     block: simplify direct io validity check
>     
>     The block layer checks all the segments for validity later, so no need
>     for an early check. Just reduce it to a simple position and total length
>     check, and defer the more invasive segment checks to the block layer.
>     
>     Signed-off-by: Keith Busch <kbusch@kernel.org>
>     Reviewed-by: Hannes Reinecke <hare@suse.de>
>     Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
>     Reviewed-by: Christoph Hellwig <hch@lst.de>
>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> The issue looks like this:
> 
> md/raid1:mdX: dm-17: rescheduling sector 0
> md/raid1:mdX: redirecting sector 0 to other mirror: dm-17
> (snipped 9 repeats of the preceding two lines)
> md/raid1:mdX: dm-17: Raid device exceeded read_error threshold [cur 21:max 20]
> md/raid1:mdX: dm-17: Failing raid device
> md/raid1:mdX: Disk failure on dm-17, disabling device.
> md/raid1:mdX: Operation continuing on 1 devices.
> 
> There's absolutely nothing wrong with the HW, the issue persists even when I 
> move the mirrors to a different pair of PVs (SAS HDD vs SATA SSD).

Thanks for the notice and the logs. Sounds likek something is getting
sent to the block layer that can't form a viable IO. The commit you
identified should have still error'ed though, just much earlier in the
call stack.

Anyway, I'll take a look if there's something I missed handling with the
stacking setup you're describing.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION][BISECTED] Spurious raid1 device failure triggered by qemu direct IO on 6.18+
  2026-04-15 15:20 ` Keith Busch
@ 2026-04-15 15:52   ` Keith Busch
  2026-04-15 22:59     ` Keith Busch
  2026-04-16 10:13     ` Tomáš Trnka
  0 siblings, 2 replies; 5+ messages in thread
From: Keith Busch @ 2026-04-15 15:52 UTC (permalink / raw)
  To: Tomás Trnka; +Cc: Jens Axboe, linux-kernel, regressions, linux-block

On Wed, Apr 15, 2026 at 09:20:06AM -0600, Keith Busch wrote:
> On Wed, Apr 15, 2026 at 02:18:59PM +0200, Tomás Trnka wrote:
> > Since 6.18, booting a VM that is backed by a raid1 LVM LV makes that LV 
> > immediately eject one of the devices. This is apparently because of a direct 
> > IO read by QEMU failing. I have bisected the issue to the following commit and 
> > confirmed that reverting that commit (plus dependencies 
> > 9eab1d4e0d15b633adc170c458c51e8be3b1c553 and 
> > b475272f03ca5d0c437c8f899ff229b21010ec83) on top of 6.19.11 fixes the issue.
> > 
> > commit 5ff3f74e145adc79b49668adb8de276446acf6be
> > Author: Keith Busch <kbusch@kernel.org>
> > Date:   Wed Aug 27 07:12:54 2025 -0700
> > 
> >     block: simplify direct io validity check
> >     
> >     The block layer checks all the segments for validity later, so no need
> >     for an early check. Just reduce it to a simple position and total length
> >     check, and defer the more invasive segment checks to the block layer.
> >     
> >     Signed-off-by: Keith Busch <kbusch@kernel.org>
> >     Reviewed-by: Hannes Reinecke <hare@suse.de>
> >     Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> >     Reviewed-by: Christoph Hellwig <hch@lst.de>
> >     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > 
> > The issue looks like this:
> > 
> > md/raid1:mdX: dm-17: rescheduling sector 0
> > md/raid1:mdX: redirecting sector 0 to other mirror: dm-17
> > (snipped 9 repeats of the preceding two lines)
> > md/raid1:mdX: dm-17: Raid device exceeded read_error threshold [cur 21:max 20]
> > md/raid1:mdX: dm-17: Failing raid device
> > md/raid1:mdX: Disk failure on dm-17, disabling device.
> > md/raid1:mdX: Operation continuing on 1 devices.
> > 
> > There's absolutely nothing wrong with the HW, the issue persists even when I 
> > move the mirrors to a different pair of PVs (SAS HDD vs SATA SSD).
> 
> Thanks for the notice and the logs. Sounds likek something is getting
> sent to the block layer that can't form a viable IO. The commit you
> identified should have still error'ed though, just much earlier in the
> call stack.
> 
> Anyway, I'll take a look if there's something I missed handling with the
> stacking setup you're describing.

I think the block layer is working as designed. The problem you're
seeing is likely related to how QEMU probes direct-io limits instead of
doing proper stat/statx type checks. During its probing, it tries
various sizes and offsets, and some of these is invalid. The kernel code
previously still returned the EINVAL error, but it was before entering
the dm raid layer, and qemu reacted by changing its assumptions.

The new kernel code eventually returns the same error, but on the other
side of dm raid stack, and unfortunately dm raid treats this as a device
failure rather than a user error as intended.

Suggest the stacking layers shouldn't consider BLK_STS_INVAL to be a
device error or retryable.

---
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
index c33099925f230..cf1c25f290f36 100644
--- a/drivers/md/raid1-10.c
+++ b/drivers/md/raid1-10.c
@@ -293,8 +293,16 @@ static inline bool raid1_should_read_first(struct mddev *mddev,
  * bio with REQ_RAHEAD or REQ_NOWAIT can fail at anytime, before such IO is
  * submitted to the underlying disks, hence don't record badblocks or retry
  * in this case.
+ *
+ * BLK_STS_INVAL means the request itself is malformed (e.g. unaligned
+ * buffers that violate DMA constraints).  Retrying on another mirror will
+ * fail the same way, and counting it against the device is wrong.
  */
 static inline bool raid1_should_handle_error(struct bio *bio)
 {
-	return !(bio->bi_opf & (REQ_RAHEAD | REQ_NOWAIT));
+	if (bio->bi_opf & (REQ_RAHEAD | REQ_NOWAIT))
+		return false;
+	if (bio->bi_status == BLK_STS_INVAL)
+		return false;
+	return true;
 }
--

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [REGRESSION][BISECTED] Spurious raid1 device failure triggered by qemu direct IO on 6.18+
  2026-04-15 15:52   ` Keith Busch
@ 2026-04-15 22:59     ` Keith Busch
  2026-04-16 10:13     ` Tomáš Trnka
  1 sibling, 0 replies; 5+ messages in thread
From: Keith Busch @ 2026-04-15 22:59 UTC (permalink / raw)
  To: Tomás Trnka; +Cc: Jens Axboe, linux-kernel, regressions, linux-block

On Wed, Apr 15, 2026 at 09:52:12AM -0600, Keith Busch wrote:
> Suggest the stacking layers shouldn't consider BLK_STS_INVAL to be a
> device error or retryable.

I was able to recreate the reported issue (the key is you have to use
dm-raid, not md-raid), and the below diff tests successfully for me.
I'll send a formal patch tomorrow.

> ---
> diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
> index c33099925f230..cf1c25f290f36 100644
> --- a/drivers/md/raid1-10.c
> +++ b/drivers/md/raid1-10.c
> @@ -293,8 +293,16 @@ static inline bool raid1_should_read_first(struct mddev *mddev,
>   * bio with REQ_RAHEAD or REQ_NOWAIT can fail at anytime, before such IO is
>   * submitted to the underlying disks, hence don't record badblocks or retry
>   * in this case.
> + *
> + * BLK_STS_INVAL means the request itself is malformed (e.g. unaligned
> + * buffers that violate DMA constraints).  Retrying on another mirror will
> + * fail the same way, and counting it against the device is wrong.
>   */
>  static inline bool raid1_should_handle_error(struct bio *bio)
>  {
> -	return !(bio->bi_opf & (REQ_RAHEAD | REQ_NOWAIT));
> +	if (bio->bi_opf & (REQ_RAHEAD | REQ_NOWAIT))
> +		return false;
> +	if (bio->bi_status == BLK_STS_INVAL)
> +		return false;
> +	return true;
>  }
> --
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION][BISECTED] Spurious raid1 device failure triggered by qemu direct IO on 6.18+
  2026-04-15 15:52   ` Keith Busch
  2026-04-15 22:59     ` Keith Busch
@ 2026-04-16 10:13     ` Tomáš Trnka
  1 sibling, 0 replies; 5+ messages in thread
From: Tomáš Trnka @ 2026-04-16 10:13 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, linux-kernel, regressions, linux-block

The proposed patch fixes the issue for me and doesn't cause any visible issues 
(in normal operation, I didn't test actual device failure). Thanks a lot for 
such a speedy fix, looking forward for the final version to test as well.

> ---
> diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
> index c33099925f230..cf1c25f290f36 100644
> --- a/drivers/md/raid1-10.c
> +++ b/drivers/md/raid1-10.c
> @@ -293,8 +293,16 @@ static inline bool raid1_should_read_first(struct mddev
> *mddev, * bio with REQ_RAHEAD or REQ_NOWAIT can fail at anytime, before
> such IO is * submitted to the underlying disks, hence don't record
> badblocks or retry * in this case.
> + *
> + * BLK_STS_INVAL means the request itself is malformed (e.g. unaligned
> + * buffers that violate DMA constraints).  Retrying on another mirror will
> + * fail the same way, and counting it against the device is wrong.
>   */
>  static inline bool raid1_should_handle_error(struct bio *bio)
>  {
> -	return !(bio->bi_opf & (REQ_RAHEAD | REQ_NOWAIT));
> +	if (bio->bi_opf & (REQ_RAHEAD | REQ_NOWAIT))
> +		return false;
> +	if (bio->bi_status == BLK_STS_INVAL)
> +		return false;
> +	return true;
>  }
> --





^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-16 10:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-15 12:18 [REGRESSION][BISECTED] Spurious raid1 device failure triggered by qemu direct IO on 6.18+ Tomáš Trnka
2026-04-15 15:20 ` Keith Busch
2026-04-15 15:52   ` Keith Busch
2026-04-15 22:59     ` Keith Busch
2026-04-16 10:13     ` Tomáš Trnka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox