From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE5E3207A38 for ; Mon, 15 Sep 2025 03:43:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757907829; cv=none; b=mfhaUIsAmqJffpPzoBuPmXu0ulq1THwafJWhHWfWZ+CQKF7U/qWc8IF4dDTizZkDxeD6dWpy8s5V5X0/lhziCPoT8e9DtHz5Pe7fBFRTentQzqQUlUv5YnsOG/ivanHz88ruTV+E4rIezzySC1+8hJKrBzjMCkLV452VjLLwu74= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757907829; c=relaxed/simple; bh=o0Rd8ZJtLex7FiYuILGq5adkG8zzAY0wbNpBkBldQ/o=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type; b=QPAqw6SL2xdekl6iFpO9lb7bAFzg12ZBiN9XfzGvb/8lSYH5RPUKeU+1NYNc+yTY4Ecdy1Qh2yhQis9r9+rwOwRDKlwGkO2cLsFKJdzqe4rUISmgIWoPbvBLTZs3ALR02GoUn6jyItnWyv8/3H3jYeG3p6wFPaLNyDDlk4W2EMI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=sCGGOSax; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="sCGGOSax" Received: from fedora (p4504123-ipxg00s01tokaisakaetozai.aichi.ocn.ne.jp [114.172.113.123]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 58F3gpZh004256 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 15 Sep 2025 12:43:10 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=dNMr8Y+863HpIUIHCZHjh4DHeN8AyCy4E4P46xtDhxM=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1757907790; v=1; b=sCGGOSaxu1+J9ZIz6Xj6yxf2tVKIxEnRYCvpZrk+dFJQTP8+NGvb1qvBpBDwg7qJ E72lucBkOxZ3/Wi8N3fBJ+mnwPUuqWuZ88Ego/1iTzGsTSBtDgIHSXmEma4+xW2Q MESAtYOJqCm2R2haFLJf4iJknZAUiT4Ny3aTUHpcDyi55f8Yay8b+mTEl22/zxy5 IdbKp20yXwK2S8u/uyQqvaMOfAcvHnjH7/J7E/AqAIflsK3i1ZOwbmCA3BI8pTqS lNbpm5otb9Q8zwc6C2hPN7KX3QaZyerC2dnqhgajMO5GxrIhpBHTclbXOXULipVi bOZQIVYFbNQrmWiU9ZxNDA== From: Kenta Akagi To: Song Liu , Yu Kuai , Mariusz Tkaczyk , Shaohua Li , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure Date: Mon, 15 Sep 2025 12:42:01 +0900 Message-ID: <20250915034210.8533-1-k@mgml.me> X-Mailer: git-send-email 2.50.1 Precedence: bulk X-Mailing-List: linux-raid@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Changes from V3: - The error handling in md_error() is now serialized, and a new helper function, md_bio_failure_error, has been introduced. - MD_FAILFAST bio failures are now processed by md_bio_failure_error instead of signaling via FailfastIOFailure. - RAID10: Fix missing reschedule of failfast read bio failure - Regardless of failfast, in narrow_write_error, writes that succeed in retry are returned to the higher layer as success Changes from V2: - Fix to prevent the array from being marked broken for all Failfast IOs, not just metadata. - Reflecting the review, update raid{1,10}_error to clear FailfastIOFailure so that devices are properly marked Faulty. Changes from V1: - Avoid setting MD_BROKEN instead of clearing it - Add pr_crit() when setting MD_BROKEN - Fix the message may shown after all rdevs failure: "Operation continuing on 0 devices" v3: https://lore.kernel.org/linux-raid/20250828163216.4225-1-k@mgml.me/ v2: https://lore.kernel.org/linux-raid/20250817172710.4892-1-k@mgml.me/ v1: https://lore.kernel.org/linux-raid/20250812090119.153697-1-k@mgml.me/ When multiple MD_FAILFAST bios fail simultaneously on Failfast-enabled rdevs in RAID1/RAID10, the following issues can occur: * MD_BROKEN is set and the array halts, even though this should not occur under the intended Failfast design. * Writes retried through narrow_write_error succeed, but the I/O is still reported as BLK_STS_IOERR * RAID10 only: If a Failfast read I/O fails, it is not retried on any remaining rdev, and as a result, the upper layer receives an I/O error. Simultaneous bio failures across multiple rdevs are uncommon; however, rdevs serviced via nvme-tcp can still experience them due to something as simple as an Ethernet fault. The issue can be reproduced using the following steps. # prepare nvmet/nvme-tcp and md array # sh-5.2# cat << 'EOF' > loopback-nvme.sh set -eu nqn="nqn.2025-08.io.example:nvmet-test-$1" back=$2 cd /sys/kernel/config/nvmet/ mkdir subsystems/$nqn echo 1 > subsystems/${nqn}/attr_allow_any_host mkdir subsystems/${nqn}/namespaces/1 echo -n ${back} > subsystems/${nqn}/namespaces/1/device_path echo 1 > subsystems/${nqn}/namespaces/1/enable ports="ports/1" if [ ! -d $ports ]; then mkdir $ports cd $ports echo 127.0.0.1 > addr_traddr echo tcp > addr_trtype echo 4420 > addr_trsvcid echo ipv4 > addr_adrfam cd ../../ fi ln -s /sys/kernel/config/nvmet/subsystems/${nqn} ${ports}/subsystems/ nvme connect -t tcp -n $nqn -a 127.0.0.1 -s 4420 EOF sh-5.2# chmod +x loopback-nvme.sh sh-5.2# modprobe -a nvme-tcp nvmet-tcp sh-5.2# truncate -s 1g a.img b.img sh-5.2# losetup --show -f a.img /dev/loop0 sh-5.2# losetup --show -f b.img /dev/loop1 sh-5.2# ./loopback-nvme.sh 0 /dev/loop0 connecting to device: nvme0 sh-5.2# ./loopback-nvme.sh 1 /dev/loop1 connecting to device: nvme1 sh-5.2# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 \ --failfast /dev/nvme0n1 --failfast /dev/nvme1n1 ... mdadm: array /dev/md0 started. # run fio # sh-5.2# fio --name=test --filename=/dev/md0 --rw=randrw --rwmixread=50 \ --bs=4k --numjobs=9 --time_based --runtime=300s --group_reporting --direct=1 # It can reproduce the issue by block nvme traffic during fio # sh-5.2# iptables -A INPUT -m tcp -p tcp --dport 4420 -j DROP; sh-5.2# sleep 10; # twice the default KATO value sh-5.2# iptables -D INPUT -m tcp -p tcp --dport 4420 -j DROP Patch 1–3 serve as preparatory changes for patch 4. Patch 4 prevents MD_FAILFAST bio failure causing the array to fail. Patch 5, regardless of FAILFAST, reports success to the upper layer if a write retry in narrow_write_error succeeds without marking badblock. Patch 6 fixes an issue where writes are not retried in a no-bbl configuration when last rdevs MD_FAILFAST bio failure. Patch 7 adds the missing retry path for Failfast read errors in RAID10. Patch 8-9 modify the pr_crit handling in raid{1,10}_error. Kenta Akagi (9): md/raid1,raid10: Set the LastDev flag when the configuration changes md: serialize md_error() md: introduce md_bio_failure_error() md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a failed bio md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl rdevs md/raid10: fix failfast read error not rescheduled md/raid1,raid10: Add error message when setting MD_BROKEN md/raid1,raid10: Fix: Operation continuing on 0 devices. drivers/md/md.c | 61 ++++++++++++++--- drivers/md/md.h | 12 +++- drivers/md/raid1.c | 156 ++++++++++++++++++++++++++++++++------------ drivers/md/raid10.c | 115 ++++++++++++++++++++++++++------ 4 files changed, 270 insertions(+), 74 deletions(-) -- 2.50.1