From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.stoffel.org (mail.stoffel.org [172.104.24.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C83CE200120; Mon, 17 Mar 2025 21:04:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=172.104.24.175 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742245450; cv=none; b=DinyHuz2pn3eiOZEdON9mYL0muJN102STWkyuOvCEsc24N29XQZBSS5iYND4vae0bRZh6agQOQwdGIb3Dp7g/9TTnmO47Fdf6q/cirm1gdGaNyggcdKp4cuxcbTM1CtIQyzz0HIZZQqFFNucHpnm2GSODDFIcLB+gkmXQC1z4EM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742245450; c=relaxed/simple; bh=B/H5hBc8DjuP28tLDUztk9nbB9QenvgNHoqRvHBoDN8=; h=MIME-Version:Content-Type:Message-ID:Date:From:To:Cc:Subject: In-Reply-To:References; b=uBDK5p++wDnlo00gr/0bwmFSdrF1cpKCGuZlxCwgZByPpBC01vUU9vzNhbiqenSc2NWPauiFB6FGfn99AUcLUlKBmHABBk0f5d/OPYophKMJUIwIIz/jlM0hetJNKWS+/JTCSwF8ahj7e7B4cE+6UU52QOtFdP7GNfwc+wqkEQk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=stoffel.org; spf=pass smtp.mailfrom=stoffel.org; arc=none smtp.client-ip=172.104.24.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=stoffel.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=stoffel.org Received: from quad.stoffel.org (syn-097-095-183-072.res.spectrum.com [97.95.183.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.stoffel.org (Postfix) with ESMTPSA id 514B91E4AB; Mon, 17 Mar 2025 16:55:25 -0400 (EDT) Received: by quad.stoffel.org (Postfix, from userid 1000) id D6D99A0E76; Mon, 17 Mar 2025 16:55:24 -0400 (EDT) Precedence: bulk X-Mailing-List: linux-bcachefs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <26584.35900.850011.320586@quad.stoffel.home> Date: Mon, 17 Mar 2025 16:55:24 -0400 From: "John Stoffel" To: Kent Overstreet Cc: linux-bcachefs@vger.kernel.org, linux-block@vger.kernel.org, Roland Vet , linux-fsdevel@vger.kernel.org X-Clacks-Overhead: GNU Terry Pratchett Subject: Re: [PATCH 00/14] better handling of checksum errors/bitrot In-Reply-To: <20250311201518.3573009-1-kent.overstreet@linux.dev> References: <20250311201518.3573009-1-kent.overstreet@linux.dev> X-Mailer: VM 8.3.x under 28.2 (x86_64-pc-linux-gnu) >>>>> "Kent" == Kent Overstreet writes: > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if > we've got an extent that can't be read, due to checksum error/bitrot. > This took some doing to fix properly, because > - We don't want to just delete such data (replace it with > KEY_TYPE_error); we never want to delete anything except when > explicitly told to by the user, and while we don't yet have an API for > "read this file even though there's errors, just give me what you > have" we definitely will in the future. So will open() just return an error? How will this look from userspace? > - Not being able to move data is a non-option: that would block copygc, > device removal, etc. > - And, moving said extent requires giving it a new checksum - strictly > required if the move has to fragment it, teaching the write/move path > about extents with bad checksums is unpalateable, and anyways we'd > like to be able to guard against more bitrot, if we can. Why does it need a new checksum if there are read errors? What happens if there are more read errors? If I have a file on a filesystem which is based in spinning rust and I get a single bit flip, I'm super happy you catch it. But now you re-checksum the file, with the read error, and return it? I'm stupid and just a user/IT guy. I want notifications, but I don't want my application to block so I can't kill it, or unmount the filesystem. Or continue to use it if I like. > So that means: > - Extents need a poison bit: "reads return errors, even though it now > has a good checksum" - this was added in a separate patch queued up > for 6.15. Sorry for being dense, but does a file have one or more extents? Or is this at a level below that? > It's an incompat feature because it's a new extent field, and old > versions can't parse extents with unknown field types, since they > won't know their sizes - meaning users will have to explicitly do an > incompat upgrade to make use of this stuff. > - The read path needs to do additional retries after checksum errors > before giving up and marking it poisoned, so that we don't > accidentally convert a transient error to permanent corruption. When doing these retries, is the filesystem locked up or will the process doing the read() be blocked from being killed? > - The read path gets a whole bunch of work to plumb precise modern error > codes around, so that e.g. the retry path, the data move path, and the > "mark extent poisoned" path all know exactly what's going on. > - Read path is responsible for marking extents poisoned after sufficient > retry attempts (controlled by a new filesystem option) > - Data move path is allowed to move extents after a read error, if it's > a checksum error (giving it a new checksum) if it's been poisoned > (i.e. the extent flags feature is enabled). So if just a single bit flips, the extent gets moved onto better storage, and the file gets re-checksummed. But what about if more bits go bad afterwards? > Code should be more or less finalized - still have more tests for corner > cases to write, but "write corrupt data and then tell rebalance to move > it to another device" works as expected. > TODO: > - NVME has a "read recovery level" attribute that controlls how hard the > erasure coding algorithms work - we want that plumbed. > Before we give up and move data that we know is bad, we need to try > _as hard as possible_ to get a successful read. What does this mean exactly in terms of end user and SysAdmin point of view? Locking up the system (no new writes to the now problematic storage) that I can't kill would be annoyingly painful. Don't get me wrong, I'm happy to see data integrity stuff get added, I'm just trying to understand how it interacts with userspace. John