From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mout.web.de (mout.web.de [212.227.17.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B1A23D994 for ; Mon, 8 Jul 2024 08:26:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=212.227.17.12 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720427211; cv=none; b=PFaPapk3yF1uPxrsvrNMbqT/DKa/0JFdyWkgsTOdW8xNE1QivkZiQJCtegm5nomeZN8ArAxf7sCMBEeVL6osVxve5/rSs+ZUWpU4znz5rA4LfX3c3UEYmIRMYtybydcMNW3dM6b5ewB+OBaW3n3HjOcA1G2KSBSolf/qvTwlMG4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720427211; c=relaxed/simple; bh=BFA6M1sup5S6zVryvKghJybv93ZM8ZH9RjbedKwJQw0=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=UcCquQQNKYtI1rV+uBvVlhAahxA5LYDlXuIys4vVlm/id9CFb7yFK4a5xUN5OfaLgvlEbDSG5cPvgT9eWbIKTbcPveQ0K9IN0KChvkU859JWf/iCkTRxOmTTJJy/oUIXAD797wfQiUoR6zW8A9qbs7aHZeIMc8hpdNQZP5toXz8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=web.de; spf=pass smtp.mailfrom=web.de; dkim=pass (2048-bit key) header.d=web.de header.i=lukasstraub2@web.de header.b=HjjijEGV; arc=none smtp.client-ip=212.227.17.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=web.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=web.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=web.de header.i=lukasstraub2@web.de header.b="HjjijEGV" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=web.de; s=s29768273; t=1720427202; x=1721032002; i=lukasstraub2@web.de; bh=75dbBp2gwc0PF8XHc6rver8E7cBi0QYNjz8UFOK/9xU=; h=X-UI-Sender-Class:Date:From:To:Cc:Subject:Message-ID:In-Reply-To: References:MIME-Version:Content-Type:cc:content-transfer-encoding: content-type:date:from:message-id:mime-version:reply-to:subject: to; b=HjjijEGV4WsaYRFDjJSoNDhBpa0zXyg/d3AIQCrladmF6jrBEOTfVzjJD3QMG3Sn jlURHKPRaTxFCSWGiCAH894SlJQILHdhEoKcgeFfsqNW+wAy8vd6smOIedzTJRmW7 kgKpeUw50CPpK4BnWLgm1Uh7XcJZnRc+kxvLjcPSLr475L0xj2i+Fd0oosocJv5X4 lDzB6Krya2SYFL57ZQkrY8e58+N/F1fx+oz6lJZoO4hkSQtIxh0X/R5DgiQzYkdft w4fWeDovnGBX6I3u/5xKp7IyqrxIzyjDGDYV+bOn11pFvw5rDlgnTZzmGZQMhYHr4 32PuVVUMQ+jnfvTn4Q== X-UI-Sender-Class: 814a7b36-bfc1-4dae-8640-3722d8ec6cd6 Received: from penguin ([89.246.98.79]) by smtp.web.de (mrweb106 [213.165.67.124]) with ESMTPSA (Nemesis) id 1N6bCu-1sKbvz3RRW-00xDGU; Mon, 08 Jul 2024 10:26:41 +0200 Date: Mon, 8 Jul 2024 10:26:32 +0200 From: Lukas Straub To: Zygo Blaxell Cc: Qu Wenruo , Qu Wenruo , linux-btrfs@vger.kernel.org Subject: Re: raid5 silent data loss in 6.2 and later, after "7a3150723061 btrfs: raid56: do data csum verification during RMW cycle" Message-ID: <20240708100927.652b2bc7@penguin> In-Reply-To: References: <5a8c1fbf-3065-4cea-9cf9-48e49806707d@suse.com> Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: multipart/signed; boundary="Sig_/bvsPL_L5E4gZT7.R.K.KLn_"; protocol="application/pgp-signature"; micalg=pgp-sha512 X-Provags-ID: V03:K1:rhZPF6qYMeMSMfyCIFkMQlKuu8Eo4mUkntrmIuwnk5vaEnTjolH nHcsmgzPdVrAkO9866tghE8+1G5zwaZmrxxRHxvKmpZY/n/z8JPzvqfsz/2vPczy82B4dhP V86l8LhRegKYBs/kraATUD3jQduZ/AJK6B0ZhLhiwTIT3HJD6DqTVUr1ilDblMn05OkVGuA kkVFWpPZcrLLksQjE5iIA== X-Spam-Flag: NO UI-OutboundReport: notjunk:1;M01:P0:9WIZd+XYRP8=;vVtrGsZ4WUeMUVNraYczFYbllZe R2Vi6nOtZma+zuHgsvnEFgEEnxMTW4WfeITvOUJsks86c/QEEUK+UUqBORnmarnEa+DvM2mdP dRuCXADbI7h93t63gO0C9Sts0CuY6taHfBg0d31riSCCIiYIialPREcyjpjDH1nHiHJIpoUCt G/pLJ3VgWTykX/CSLKG6dOD2M+fbWFLc79vGb42SO3mUiVKJ5wTuQpvFogXeLPu2JI5XlLFbb 7JnXhMWZTrRkkVbKTKcjQ0d9jV6DE7wO2YBT1nukA3BPRQErMUDedZ0nDpNs/Ax+WI+RsI/oS CxdwvIXjIFoPYiqknLZZ+KgHVKUmdhoLbf4ngGHd1JKRNW3FTdp7LDdwwFbrAQG81v4feI842 rQ7rVB062zDX2WOTIaVFnNZIbcsL8xezClxrKdCn96dJUvDXPCVVPJASUISfMStg8Yy8gT5qD nKcHpm+tpc9msQmk+XIW2pfetuwITGf4s4x82+SXG+6+tc+Ic4KXEfy90glB0CFW4XONAA5YW qE9trp+zpZ720IZyK6eK9cikQgMVLx8g9/L3eqhvi/sCWnnSV1mWW43faYH+6PAJWbkVtBelY OgU/eJiCQIxwZ2ypPFuClfqsxCdeDgSB65HmcMiOSim0mByPmyeZ4kyE2nzKTRryWz/dL81Ys y03nY1cr1fjpFWdDhWSGYu1iY9fM6GTrjVbNBQcVKNDAtkmS8wwBYXKg3rP77jJVHU0o3bI2f OOYQL7o9TOn+8MM4ukM9ySqcM2fb//5omzONdYSofh+DqmjmViq6S1sNLsIJzelJude6/OXBO k6sT9JWxr6LHqJ2wi/UtMGmjGwVCk2as3xlsen+dvtB1Q= --Sig_/bvsPL_L5E4gZT7.R.K.KLn_ Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Mon, 8 Jul 2024 02:25:37 -0400 Zygo Blaxell wrote: > On Sat, Jun 08, 2024 at 12:50:35PM +0930, Qu Wenruo wrote: > > =E5=9C=A8 2024/6/8 11:25, Zygo Blaxell =E5=86=99=E9=81=93: =20 > > > On Sat, Jun 01, 2024 at 05:22:46PM +0930, Qu Wenruo wrote: > > > After this change, we now end up in an infinite loop: > > >=20 > > > 1. Allocator picks a stripe with some unrecoverable csum blocks > > > and some free blocks > > >=20 > > > 2. Writeback tries to put data in the stripe > > >=20 > > > 3. rmw_rbio aborts after it can't repair the existing blocks > > >=20 > > > 4. Writeback deletes the extent, often silently (the application > > > has to use fsync to detect it) > > >=20 > > > 5. Go to step 1, where the allocator picks the same blocks again > > >=20 > > > The effect is pretty dramatic--even a single unrecoverable sector in > > > one stripe will bring an application server to its knees, constantly > > > discarding an application's data whenever it tries to write. Once the > > > allocator reaches the point where the "next" block is in a bad rmw st= ripe, > > > it keeps allocating that same block over and over again. =20 > >=20 > > I'm afraid the error path (no way to inform the caller) is an existing > > problem. Buffered write can always success (as long as no ENOMEM/ENOSPC > > etc), but the real writeback is not ensured to success. > > It doesn't even need RAID56 to trigger. > >=20 > > But "discarding" the dirty pages doesn't sound correct. > > If a writeback failed, the dirty pages should still stay dirty, not > > discarded. > >=20 > > It may be a new bug in the error handling path. =20 >=20 > I found the code that does this. It's more than 11 years old: >=20 > commit 0bec9ef525e33233d7739b71be83bb78746f6e94 > Author: Josef Bacik > Date: Thu Jan 31 14:58:00 2013 -0500 >=20 > Btrfs: unreserve space if our ordered extent fails to work >=20 > When a transaction aborts or there's an EIO on an ordered extent or a= ny > error really we will not free up the space we reserved for this order= ed > extent. This results in warnings from the block group cache cleanup = in the > case of a transaction abort, or leaking space in the case of EIO on an > ordered extent. Fix this up by free'ing the reserved space if we hav= e an > error at all trying to complete an ordered extent. Thanks, >=20 > [...] Before this escalates further in IMHO the wrong direction: I think the current btrfs behavior correct. See also this paper[1] that examines write failure of buffered io in different filesystems. Especially Table 2. Ext4 and xfs for example do not discard the page cache on write failure, but this is worse since now you have a mismatch of what is in the cache and what is on disk. They do not retry to write back the page cache. The source of confusion here is rather that write errors do not happen in the real world: Disks do not verify if they wrote data correctly and neither does any layer (raid, etc.) above it. Thus handling of write failure is completely untested in all applications (See the paper again) and it seems the problems you see are due to wrongly handling of write errors. The data loss is not silent it's just that many applications and scripts do not use fsync() at all. I think the proper way of resolving this is for btrfs to retry writing the extent, but to another (possibly clean) stripe. Or perhaps a fresh raid5 block group altogether. I very much approve btrfs' current design of handling (and reporting) write errors in the most correct way possible. Regards, Lukas Straub [1] https://www.usenix.org/system/files/atc20-rebello.pdf --Sig_/bvsPL_L5E4gZT7.R.K.KLn_ Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEg/qxWKDZuPtyYo+kNasLKJxdslgFAmaLorgACgkQNasLKJxd slhmng/9GwQ+uF8Gyxp6fvm3DJgS58OLZ6GmUZ0YDw3oq/xBMIi1F91cSHZBKq2p obTl7Tb8tSns0IFAajaHSTmCmtM7LX6OfV9ltNI1f7TfbsKwNBOG+ikALsOreb5U NHhG6ejnv+F/5aymMdxEEiIH8zkLkQ1XkM/wbgcxe+H1xUtgGw2HHqAEilPJKtYY P3dLH7LwMU5cIni1bvuHJaABqMxaKGaYly1bNC+UGX6iCXJOQoKGRch5W/vQGi3z JDyt854MkefIbMi4dx49spZXjRToD1eheZv9KxAS27FusjS75VmtVoot7cDlVvg7 +nImH1vliK9aga2dVFI1ybDkjpfpkVmeDkXywRhSos9PpzGUTLGEDIGanrogsv9V G5OOkb+RWP+JqTs8wLaMxDi38yLn5YwhKGj3icQKVKbhQr0FrsdntUXnmwPRCcth 2ILH+q9zHY1L0BXMNg/LjAONhyNB4Fd4jauaiqOj3i21v6+XQl2GxmYgSZysBRFo 3GcSaw7pdQ0ddTcVnzqGbkdTRwoFSRDBlwLmUu5eEMBo6YnbxQReq1pPsnnl+re+ VW8mRWT13PaHM7E8sHOsrkhIpI87gvAmXn7GgvMFpjOHhy6bPPfqiMxOLGPpXNx/ wan3kEbtbP5+7HmrEG7zG0XGl4DAZYqosR8f4osnXzquQLqyx7Q= =3fmG -----END PGP SIGNATURE----- --Sig_/bvsPL_L5E4gZT7.R.K.KLn_--