From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from va-1-112.ptr.blmpb.com (va-1-112.ptr.blmpb.com [209.127.230.112]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 16D282DF13A for ; Thu, 16 Apr 2026 03:54:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.127.230.112 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776311690; cv=none; b=OQc1NbPm1e92upLKR69T/bvvGrNY09nPLuLCOABMvw0YujfugQE6DK0DgOioB+8FDl/GcMnfwAf7SPB3ycJjhnc2fylx08PnbqT/ajcN4JXZkLmylQUGZhZs5CGm9zv7fOeisMBA9Px6kGPBh3Mg0RQS/whWA/VeUlGIXLbyodQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776311690; c=relaxed/simple; bh=FRtxye/7XATNQiRwvPJ7zj4jM5BiMBcNqlQXvVp6b6o=; h=In-Reply-To:Content-Type:Cc:Subject:To:Mime-Version:Message-Id: References:From:Date; b=W5QlHaWdpO5XtOC08+3u24qE8vUmDIM1qpDgX5Pdv/HLld4BxyJ9vVv2bCSU9NWrpn5lkELhWdj+DNtyXGz6Ky4aOr0hprZhIpo4AauvM2UZEYLNbqNjsSrBsFkA6Xy04B9PDi7pVD08ASFbdeipuAX35b+1EsELBNBUHIuWb8w= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=UUhz64C3; arc=none smtp.client-ip=209.127.230.112 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="UUhz64C3" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; s=2212171451; d=bytedance.com; t=1776311683; h=from:subject: mime-version:from:date:message-id:subject:to:cc:reply-to:content-type: mime-version:in-reply-to:message-id; bh=XuUijCRU49VU51RPVF7QODH/tQoawFO7A52ugknXftA=; b=UUhz64C3WokIoUnig+6fdptc4C+Kd/RGQcaO66T8RcBXjfOHPQGySTTa+WBPP4avK86hkr IoBqFhbrtX4+nifDG8unDSvz1ezYd8rAicAK//9emZyNO47GYuInBHF9P83tM4r+VsALjF dc494e2xl4mvKPJhueZI46sExV5fi0ubN2j2iEb9tW8x0rFLZjQnX4gfcF8NciWpb8NQmv S0GaMTmaONXrIeS303bnpPA3Z8tA2P4Wrp9Vn2sUL1Pwq2VX4Fa5pbd1vytXNoystnRqwV C61QEJy5EWLhDf3e2TFkRS3R5IxK9T1tEbDgwEHjAXIkj4ddrVsKy2nbYb2F7Q== In-Reply-To: <20260413124703.GA20496@macsyma-wired.lan> X-Original-From: Diangang Li Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: , , , , , , Subject: Re: [RFC v2 0/1] ext4: fail fast on repeated buffer_head reads after IO failure User-Agent: Mozilla Thunderbird To: "Theodore Tso" , "Diangang Li" Precedence: bulk X-Mailing-List: linux-ext4@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 Message-Id: <03453fb3-0f3e-491d-ba12-e4208fe1c185@bytedance.com> References: <20260325093349.630193-1-diangangli@gmail.com> <20260413062500.1380307-1-diangangli@gmail.com> <20260413124703.GA20496@macsyma-wired.lan> Content-Language: en-US X-Lms-Return-Path: From: "Diangang Li" Date: Thu, 16 Apr 2026 11:54:27 +0800 On 4/13/26 8:47 PM, Theodore Tso wrote: > On Mon, Apr 13, 2026 at 02:24:59PM +0800, Diangang Li wrote: >> From: Diangang Li >> >> A production system reported hung tasks blocked for 300s+ in ext4 >> buffer_head paths.... >> >> [Tue Mar 24 14:16:24 2026] blk_update_request: I/O error, dev sdi, >> sector 10704150288 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 >> [Tue Mar 24 14:16:25 2026] blk_update_request: I/O error, dev sdi, >> sector 10704488160 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 >> [Tue Mar 24 14:16:26 2026] blk_update_request: I/O error, dev sdi, >> sector 10704382912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 > > I wonder whether the ext4 layer is the right place to be handle this > sort of issue. For example, it could be handled by having a subsystem > scanning dmesg (or by wiring up notifications so block device errors > get sent to a userspace daemon), and when certain criteria is met, the > machine is automatically sent to hardware operations to run > diagnostics and (most likey) replace the failing disk. > > It could also be handled in the driver or SCSI layer so the "fail > fast" semantics are handled there, so that it supports all file > systems, not just ext4. The SCSI layer also has more information > about the type of error; you might want to handle things like media > errors differently from Fibre Channel or iSCSI timeouts (which might > be something where "fast fast" is not appropriate). > > By the time the error gets propagated up to the buffer head, we lose a > lot of detail about why the error took place. Also, in the long term > we will hopefully be moving away from using buffer cache. > > - Ted Hi Ted, What about moving the fail-fast check into the buffer-head path (submit_bh_wbc) so it is not ext4-specific. We can update a BH_Read_EIO bit in end_bio_bh_io_sync, and add a per-bdev/per-partition sysfs knob for the retry window. That turns it into a generic guard for buffer-head users, and it naturally goes away as buffer-head usage shrinks. We did think about doing this in the block layer (submit_bio) or in SCSI/NVMe, but a generic solution there seems to need a per-device table to cache the error LBAs. With buffer-head, we can keep the error state on the bh itself. I also checked f2fs (no buffer-head). It tracks repeated EIOs on metadata/node pages to avoid infinite retry loops. How do you see that compared with a buffer-head retry window? Are either of these directions worth exploring further? Thanks, Diangang