From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7BE6ED299 for ; Thu, 29 May 2025 04:36:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.177 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748493396; cv=none; b=Xpj8tmQ5co11crFcw3/h4kCF6yvyUNQKSz8pg+u96MtuQxil7lSDbd43aDaZsDWr2TSw7cJ5i85TliPaO2coXqgScsPP3bFYsckSU4N45YfwctMbPYizkBOIozDkMtBrzseq2UsoQH36StxMzq70qUPe3P4JbLyGQ0S5E8cbgSc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748493396; c=relaxed/simple; bh=/aAE8PQ4Eq42jFjznP25L4UOSJMqseibpFa1FKn+2DA=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=sVMYCKwnUFJ9HfSKVG43Mv71PeYBTCViABwZYuWbuZh8wNIacWCdmHnBrIac0RCmsLNFUsRQeKwXg+zZXlKhFj7EswOoOhfr/C8HREdhT7vUH5UlfdELkBU4lPDE3CZujqd/DNKx1U9FziA+4QRmyZWtv6I/gayKKvC/z3eWGpg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=LQxYFuOO; arc=none smtp.client-ip=209.85.214.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="LQxYFuOO" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-231e98e46c0so4837935ad.3 for ; Wed, 28 May 2025 21:36:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1748493394; x=1749098194; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=+1O4w2je/tYdlFGEn+TC9L2CRnIFOJ94HhRwCNdSpXA=; b=LQxYFuOO0XgGNGcYrDOfyZY5HwA8uoF28DELWWYOZo+SjTRqql/zmbpgW3VWRUf/3u k9wFWRVGNJFUmHsHEN7RTvzlIYkLADBSrn/uPbFjdreO5/nXjlKBt0h0sS579lK2wpzH FUCAdQcIHZVOlNAoEUW/RB0WEr7YqcGZAwK/0j2jvax4rOYT8LwOrBT1aeqOFcIUDwws VIId44lbrtnFjTJyxKT4C1SBFcRApAzHhnR0CUHL9B3RuXjriM3A7bqnmCDNBFFmm4Yf GEluvCMPDAmpYQoO/TbcQSBNJIr5UqSJrQr3izpAa8cUMcCYGcLIRiIJnrc2zYNttkkM 8v3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748493394; x=1749098194; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=+1O4w2je/tYdlFGEn+TC9L2CRnIFOJ94HhRwCNdSpXA=; b=F4m8fbscUgIRQGEl3El7144EwGcKP50zgjo5PyVUNeqppV+sz8wzvnPaIYPQ8El6Fe LX3VuiV65YEH7N5rRfsUJ5pt6wLGx+FOPhzVtpERUUn5I9jmdWvj7ccW60KEhVST2ys2 XqL6Y4h8uLQ2KD34+LeqGPiYEdaADALZSGTj7b2k+mH7uYUQ/A3D+NClO13LH8++BZeE /TKUVyj9hCkY5x7fNSN/2p+LxzqRZJWgzWRPUkeJEj3++7+KVoKhBXcuj73Zi5sdLYt2 vu/zLvTeAC+rNL5am9dTnUL+iR6qLOshzZdPPOVB3oCCcYkDZX/iPnzz+cnXG5pODTS+ fvdA== X-Forwarded-Encrypted: i=1; AJvYcCWXo82Evv+ABXufMu9v0RcTAeELlPNhO3/u/6bjod1SkbkqERFrXNeab3BA2OoJozMsNVA/IbnCdEBQ6MD0@vger.kernel.org X-Gm-Message-State: AOJu0YxRAVivlRmaWl/Ln34W5ruVVS9nWAiK9a30D54d07PolwEwkK/s c5fxyxTTd6W7e/6a3s6ThtjI0x0kXg6iOTPl9IS8epRciRwqz8gd46ww0nw0oT9/WN8OQjYpcpx X3bSt X-Gm-Gg: ASbGncua1aYH3uLxSIL6mSGH1HOtq6fneDxAoWOL0+B9dAD9f1//dpxVJFCafGBpjIm aK4cFuFjr8PWhDmHJeSAuReEw+80XcmmeihjCxCEovdYb1UdPl0M0HZpQ2T9b1Hxl53WQBh6rsU Dfo4BxbEA/0MloZPkmaGe5XgyYNaE9btAekKQYAzFGGBDG8QjwdDZyWqRP5tHXCzQq2bbdf8WkX maORhvvgpY2z0CggjGpsmSfX2HzHYOqOcRmbqJk+icQvwQocu+X+Vd7eANLrs/Ub2VqrbAd1DOn xWMbqMGhf0Jaybk/JR3L2Kc0c3SF6qOPnWOaadfKED1x3fUvCq+LMqt3gF77mnOPLVQhry/ZIIb +Ri1TL16wNymKAm1E X-Google-Smtp-Source: AGHT+IHIgAO2lsClFdwJkF17Bm7HKwxTZ8x+hNA5gd6TNfZdL6Y3/KIDmW7/6Xy7Y8KRZoEajill7g== X-Received: by 2002:a17:903:41cc:b0:232:202e:ab18 with SMTP id d9443c01a7336-23414f6f7femr234658245ad.26.1748493393680; Wed, 28 May 2025 21:36:33 -0700 (PDT) Received: from dread.disaster.area (pa49-180-184-88.pa.nsw.optusnet.com.au. [49.180.184.88]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-23506cd3558sm3979585ad.122.2025.05.28.21.36.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 May 2025 21:36:33 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.98.2) (envelope-from ) id 1uKV0I-00000009UZb-1kfn; Thu, 29 May 2025 14:36:30 +1000 Date: Thu, 29 May 2025 14:36:30 +1000 From: Dave Chinner To: Yafang Shao Cc: Christian Brauner , djwong@kernel.org, cem@kernel.org, linux-xfs@vger.kernel.org, Linux-Fsdevel Subject: Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption Message-ID: References: Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote: > Hello, > > Recently, we encountered data loss when using XFS on an HDD with bad > blocks. After investigation, we determined that the issue was related > to writeback errors. The details are as follows: > > 1. Process-A writes data to a file using buffered I/O and completes > without errors. > 2. However, during the writeback of the dirtied pagecache pages, an > I/O error occurs, causing the data to fail to reach the disk. > 3. Later, the pagecache pages may be reclaimed due to memory pressure, > since they are already clean pages. > 4. When Process-B reads the same file, it retrieves zeroed data from > the bad blocks, as the original data was never successfully written > (IOMAP_UNWRITTEN). > > We reviewed the related discussion [0] and confirmed that this is a > known writeback error issue. While using fsync() after buffered > write() could mitigate the problem, this approach is impractical for > our services. Really, that's terrible application design. If you aren't checking that data has been written successfully, then you get to keep all the broken and/or missing data bits to yourself. However, with that said, some history. XFS used to keep pages that had IO errors on writeback dirty so they would be retried at a later time and couldn't be reclaimed from memory until they were written. This was historical behaviour from Irix and designed to handle SAN environments where multipath fail-over could take several minutes. In these situations writeback could fail for several attempts before the storage timed out and came back online. Then the next write retry would succeed, and everything would be good. Linux never gave us a specific IO error for this case, so we just had to retry on EIO and hope that the storage came back eventually. This is different to traditional Linux writeback behaviour, which is what is implemented now via iomap. There are good reasons for this model: - a filesystem with a dirty page that can't be written and cleaned cannot be unmounted. - having large chunks of memory that cannot be cleaned and reclaimed has adverse impact on system performance - the system can potentially hang if the page cache is dirtied beyond write throttling thresholds and then the device is yanked. Now none of the dirty memory can be cleaned, and all new writes are throttled.... > Instead, we propose introducing configurable options to notify users > of writeback errors immediately and prevent further operations on > affected files or disks. Possible solutions include: > > - Option A: Immediately shut down the filesystem upon writeback errors. > - Option B: Mark the affected file as inaccessible if a writeback error occurs. Go look at /sys/fs/xfs//error/metadata/... and configurable error handling behaviour implemented through this interface. Essential, XFS metadata behaves as "retry writes forever and hang on unmount until write succeeds" by default. i.e. similar to the old data IO error behaviour. The "hang on unmount" behaviour can be turned off by /sys/fs/xfs//error/fail_at_unmount, and we can configured different failure handling policies for different types of IO error. e.g. fail-fast on -ENODEV (e.g. device was unplugged and is never coming back so shut the filesystem down), retry-for-while on -ENOSPC (e.g. dm-thinp pool has run out of space, so give some time for the pool to be expanded before shutting down) and retry-once on -EIO (to avoid random spurious hardware failures from shutting down the fs) and everything else uses the configured default behaviour.... There's also good reason the sysfs error heirarchy is structured the way it is - it leaves open the option for expanding the error handling policies to different IO types (i.e. data and metadata). It even allows different policies for different types of data devices (e.g. RT vs data device policies). So, got look at how the error configuration code in XFS is handled, consider extending that to /sys/fs/xfs//error/data/.... to allow different error handling policies for different types of data writeback IO errors. Then you'll need to implement those policies through the XFS and iomap IO paths... -Dave. -- Dave Chinner david@fromorbit.com