From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D5EA03090E6 for ; Mon, 5 Jan 2026 14:41:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767624098; cv=none; b=RqG7SZ0uIzdF2HSc1+QNaxwqdgA+qJTSD2uH2r0rvcOdfEZYrbLqlQuQSbT7uc388Gj//owTnXhA4qB6jZ5xLMMsuTguEnrF/ljzWXTBuB6nfMIyoqnqAtmnX6EJVhgWJPLpr39zJx0yd44r9bon31vUmOd+/XeS8fVeEtJvuTE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767624098; c=relaxed/simple; bh=4hpRu+4mowGnADaVwSP44rPs6iIIjVUi8jRSRXkpyTU=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=AbaAvPXdYKaz3gpm0G6PcQFQmCo3QAnFsv3mhZRZGHx/pQiQTsAKScsUIZlxyIzseOFKUcV1xeB3VjqeB2mbS3rmdMEXqDLgDCWWWoBBFNls7yBkptO5AW85wl9OnDbbfVwaU5yfHqjdfHR48qAOHwGJT/QHABnHq5gUxxur1Es= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=GiKG7UJ1; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="GiKG7UJ1" Received: from fedora (p3411048-ipxg00d01tokaisakaetozai.aichi.ocn.ne.jp [114.157.12.48]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 605Eenoa052549 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 5 Jan 2026 23:40:53 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=qU0IGrAolzaxUygMjLr8kf12EEM2j4vPMWSnqyqhBIc=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1767624053; v=1; b=GiKG7UJ1KYIi41NEbJsR6PnYDGta6WM7AiN5LwccG027W/BVdgHuBB3ZFAGQXLPm 8K8gyQu7iYhhw89QdtNC2LZoXlv3OWHnHtqkNzxf+8ymmW9w1E2GRJ8s5O4Z2SMV BwC92brshptH2m+PZ3yKqwiMae0jeTYnPxDP2PytzWjL/39atBYXThEZ7pfBE2lR nKTVZPslM8czQhHEQz+ebpP4MLmUHubeZLo+hnN2GrB2uId/Ia/qe+8RIKinOHdO U2EepJhA/PVacL1fuIQVzQC3/EM2EbbdxzcFIk9MQlHeOcvTC89AeKEmPK226dYf CIlguN3LMlBU+OP5iRD1xQ== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Xiao Ni Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v6 0/2] Don't set MD_BROKEN on failfast bio failure Date: Mon, 5 Jan 2026 23:40:23 +0900 Message-ID: <20260105144025.12478-1-k@mgml.me> X-Mailer: git-send-email 2.50.1 Precedence: bulk X-Mailing-List: linux-raid@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Changes from V5: - Prevent md being broken when rdev has FailFast, regardless of the bios flag. Thanks to Xiao for the advice: https://lore.kernel.org/linux-raid/CALTww2_nJqyA99cG9YNarXEB4wimFK=pKy=qrxdkfB60PaUa1w@mail.gmail.com/#t - Dropping preparation refactor, flag rename, error logging improvement commits Changes from V4: - Use device_lock to serialize md_error() instead of adding a new spinlock_t. - Rename new function md_bio_failure_error() to md_cond_error(). - Add helper function pers->should_error() to determine whether to fail rdev in failfast bio failure, instead of using the LastDev flag. - Avoid changing the behavior of the LastDev flag. - Drop fix for R{1,10}BIO_Uptodate not being set despite successful retry; this will be sent separately after Nan's refactor. - Drop fix for the message 'Operation continuing on 0 devices'; as it is outside the scope of this patch, it will be sent separately. - Improve logging when metadata writing fails. - Rename LastDev to RetryingSBWrite. Changes from V3: - The error handling in md_error() is now serialized, and a new helper function, md_bio_failure_error, has been introduced. - MD_FAILFAST bio failures are now processed by md_bio_failure_error instead of signaling via FailfastIOFailure. - RAID10: Fix missing reschedule of failfast read bio failure - Regardless of failfast, in narrow_write_error, writes that succeed in retry are returned to the higher layer as success Changes from V2: - Fix to prevent the array from being marked broken for all Failfast IOs, not just metadata. - Reflecting the review, update raid{1,10}_error to clear FailfastIOFailure so that devices are properly marked Faulty. Changes from V1: - Avoid setting MD_BROKEN instead of clearing it - Add pr_crit() when setting MD_BROKEN - Fix the message may shown after all rdevs failure: "Operation continuing on 0 devices" v5: https://lore.kernel.org/linux-raid/20251027150433.18193-1-k@mgml.me/ v4: https://lore.kernel.org/linux-raid/20250915034210.8533-1-k@mgml.me/ v3: https://lore.kernel.org/linux-raid/20250828163216.4225-1-k@mgml.me/ v2: https://lore.kernel.org/linux-raid/20250817172710.4892-1-k@mgml.me/ v1: https://lore.kernel.org/linux-raid/20250812090119.153697-1-k@mgml.me/ When multiple MD_FAILFAST bios fail simultaneously on Failfast-enabled rdevs in RAID1/RAID10, the following issues can occur: * MD_BROKEN is set and the array halts, even though this should not occur under the intended Failfast design. * Writes retried through narrow_write_error succeed, but the I/O is still reported as BLK_STS_IOERR * NOTE: a fix for this was removed in v5, will be send separetely https://lore.kernel.org/linux-raid/6f0f9730-4bbe-7f3c-1b50-690bb77d5d90@huaweicloud.com/ * RAID10 only: If a Failfast read I/O fails, it is not retried on any remaining rdev, and as a result, the upper layer receives an I/O error. Simultaneous bio failures across multiple rdevs are uncommon; however, rdevs serviced via nvme-tcp can still experience them due to something as simple as an Ethernet fault. The issue can be reproduced using the following steps. # prepare nvmet/nvme-tcp and md array sh-5.2# cat << 'EOF' > loopback-nvme.sh set -eu nqn="nqn.2025-08.io.example:nvmet-test-$1" back=$2 cd /sys/kernel/config/nvmet/ mkdir subsystems/$nqn echo 1 > subsystems/${nqn}/attr_allow_any_host mkdir subsystems/${nqn}/namespaces/1 echo -n ${back} > subsystems/${nqn}/namespaces/1/device_path echo 1 > subsystems/${nqn}/namespaces/1/enable ports="ports/1" if [ ! -d $ports ]; then mkdir $ports cd $ports echo 127.0.0.1 > addr_traddr echo tcp > addr_trtype echo 4420 > addr_trsvcid echo ipv4 > addr_adrfam cd ../../ fi ln -s /sys/kernel/config/nvmet/subsystems/${nqn} ${ports}/subsystems/ nvme connect -t tcp -n $nqn -a 127.0.0.1 -s 4420 EOF sh-5.2# chmod +x loopback-nvme.sh sh-5.2# modprobe -a nvme-tcp nvmet-tcp sh-5.2# truncate -s 1g a.img b.img sh-5.2# losetup --show -f a.img /dev/loop0 sh-5.2# losetup --show -f b.img /dev/loop1 sh-5.2# ./loopback-nvme.sh 0 /dev/loop0 connecting to device: nvme0 sh-5.2# ./loopback-nvme.sh 1 /dev/loop1 connecting to device: nvme1 sh-5.2# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 \ --failfast /dev/nvme0n1 --failfast /dev/nvme1n1 ... mdadm: array /dev/md0 started. # run fio sh-5.2# fio --name=test --filename=/dev/md0 --rw=randrw --rwmixread=50 \ --bs=4k --numjobs=9 --time_based --runtime=300s --group_reporting --direct=1 & # It can reproduce the issue by block nvme traffic during fio sh-5.2# iptables -A INPUT -m tcp -p tcp --dport 4420 -j DROP; sh-5.2# sleep 10; # twice the default KATO value sh-5.2# iptables -D INPUT -m tcp -p tcp --dport 4420 -j DROP Patch 1 prevent array broken when FailFast is set on rdev Patch 2 adds the missing retry path for Failfast read errors in RAID10. Kenta Akagi (2): md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast md/raid10: fix failfast read error not rescheduled drivers/md/md.c | 6 ++++-- drivers/md/raid1.c | 8 +++++++- drivers/md/raid10.c | 15 ++++++++++++++- 3 files changed, 25 insertions(+), 4 deletions(-) -- 2.50.1