Re: [REGRESSION][BISECTED] Spurious raid1 device failure triggered by qemu direct IO on 6.18+

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Keith Busch <kbusch@kernel.org>
To: "Tomás Trnka" <trnka@scm.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-kernel@vger.kernel.org, regressions@lists.linux.dev,
	linux-block@vger.kernel.org
Subject: Re: [REGRESSION][BISECTED] Spurious raid1 device failure triggered by qemu direct IO on 6.18+
Date: Wed, 15 Apr 2026 09:52:12 -0600	[thread overview]
Message-ID: <ad-0LIubzaT8M2_O@kbusch-mbp> (raw)
In-Reply-To: <ad-sprBWPA20iola@kbusch-mbp>

On Wed, Apr 15, 2026 at 09:20:06AM -0600, Keith Busch wrote:
> On Wed, Apr 15, 2026 at 02:18:59PM +0200, Tomás Trnka wrote:
> > Since 6.18, booting a VM that is backed by a raid1 LVM LV makes that LV 
> > immediately eject one of the devices. This is apparently because of a direct 
> > IO read by QEMU failing. I have bisected the issue to the following commit and 
> > confirmed that reverting that commit (plus dependencies 
> > 9eab1d4e0d15b633adc170c458c51e8be3b1c553 and 
> > b475272f03ca5d0c437c8f899ff229b21010ec83) on top of 6.19.11 fixes the issue.
> > 
> > commit 5ff3f74e145adc79b49668adb8de276446acf6be
> > Author: Keith Busch <kbusch@kernel.org>
> > Date:   Wed Aug 27 07:12:54 2025 -0700
> > 
> >     block: simplify direct io validity check
> >     
> >     The block layer checks all the segments for validity later, so no need
> >     for an early check. Just reduce it to a simple position and total length
> >     check, and defer the more invasive segment checks to the block layer.
> >     
> >     Signed-off-by: Keith Busch <kbusch@kernel.org>
> >     Reviewed-by: Hannes Reinecke <hare@suse.de>
> >     Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> >     Reviewed-by: Christoph Hellwig <hch@lst.de>
> >     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > 
> > The issue looks like this:
> > 
> > md/raid1:mdX: dm-17: rescheduling sector 0
> > md/raid1:mdX: redirecting sector 0 to other mirror: dm-17
> > (snipped 9 repeats of the preceding two lines)
> > md/raid1:mdX: dm-17: Raid device exceeded read_error threshold [cur 21:max 20]
> > md/raid1:mdX: dm-17: Failing raid device
> > md/raid1:mdX: Disk failure on dm-17, disabling device.
> > md/raid1:mdX: Operation continuing on 1 devices.
> > 
> > There's absolutely nothing wrong with the HW, the issue persists even when I 
> > move the mirrors to a different pair of PVs (SAS HDD vs SATA SSD).
> 
> Thanks for the notice and the logs. Sounds likek something is getting
> sent to the block layer that can't form a viable IO. The commit you
> identified should have still error'ed though, just much earlier in the
> call stack.
> 
> Anyway, I'll take a look if there's something I missed handling with the
> stacking setup you're describing.

I think the block layer is working as designed. The problem you're
seeing is likely related to how QEMU probes direct-io limits instead of
doing proper stat/statx type checks. During its probing, it tries
various sizes and offsets, and some of these is invalid. The kernel code
previously still returned the EINVAL error, but it was before entering
the dm raid layer, and qemu reacted by changing its assumptions.

The new kernel code eventually returns the same error, but on the other
side of dm raid stack, and unfortunately dm raid treats this as a device
failure rather than a user error as intended.

Suggest the stacking layers shouldn't consider BLK_STS_INVAL to be a
device error or retryable.

---
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
index c33099925f230..cf1c25f290f36 100644
--- a/drivers/md/raid1-10.c
+++ b/drivers/md/raid1-10.c
@@ -293,8 +293,16 @@ static inline bool raid1_should_read_first(struct mddev *mddev,
  * bio with REQ_RAHEAD or REQ_NOWAIT can fail at anytime, before such IO is
  * submitted to the underlying disks, hence don't record badblocks or retry
  * in this case.
+ *
+ * BLK_STS_INVAL means the request itself is malformed (e.g. unaligned
+ * buffers that violate DMA constraints).  Retrying on another mirror will
+ * fail the same way, and counting it against the device is wrong.
  */
 static inline bool raid1_should_handle_error(struct bio *bio)
 {
-	return !(bio->bi_opf & (REQ_RAHEAD | REQ_NOWAIT));
+	if (bio->bi_opf & (REQ_RAHEAD | REQ_NOWAIT))
+		return false;
+	if (bio->bi_status == BLK_STS_INVAL)
+		return false;
+	return true;
 }
--

next prev parent reply	other threads:[~2026-04-15 15:52 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-15 12:18 [REGRESSION][BISECTED] Spurious raid1 device failure triggered by qemu direct IO on 6.18+ Tomáš Trnka
2026-04-15 15:20 ` Keith Busch
2026-04-15 15:52   ` Keith Busch [this message]
2026-04-15 22:59     ` Keith Busch
2026-04-16 10:13     ` Tomáš Trnka

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:c33099925f23 dfblob:cf1c25f290f3 )
 OR (
bs:"Re: [REGRESSION][BISECTED] Spurious raid1 device failure triggered by qemu direct IO on 6.18+" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ad-0LIubzaT8M2_O@kbusch-mbp \
    --to=kbusch@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=regressions@lists.linux.dev \
    --cc=trnka@scm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox