From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from james.kirk.hungrycats.org ([174.142.39.145]:46112 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1751432AbcKYFSd (ORCPT ); Fri, 25 Nov 2016 00:18:33 -0500 Date: Fri, 25 Nov 2016 00:07:40 -0500 From: Zygo Blaxell To: Gareth Pye Cc: Goffredo Baroncelli , Qu Wenruo , linux-btrfs Subject: Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q Message-ID: <20161125050740.GH8685@hungrycats.org> References: <20161121085016.7148-1-quwenruo@cn.fujitsu.com> <94606bda-dab0-e7c9-7fc6-1af9069b64fc@inwind.it> <20161125043119.GG8685@hungrycats.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="UthUFkbMtH2ceUK2" In-Reply-To: Sender: linux-btrfs-owner@vger.kernel.org List-ID: --UthUFkbMtH2ceUK2 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Nov 25, 2016 at 03:40:36PM +1100, Gareth Pye wrote: > On Fri, Nov 25, 2016 at 3:31 PM, Zygo Blaxell > wrote: > > > > This risk mitigation measure does rely on admins taking a machine in th= is > > state down immediately, and also somehow knowing not to start a scrub > > while their RAM is failing...which is kind of an annoying requirement > > for the admin. >=20 > Attempting to detect if RAM is bad when scrub starts is both time > consuming and not very reliable right. RAM, like all hardware, could fail at any time, and a scrub could already be running when it happens. This is annoying but also a fact of life that admins have to deal with. Testing RAM before scrub starts is not more beneficial than testing RAM at random intervals--but if you are testing RAM at random intervals, why not do it at the same intervals as scrub? If I see corruption errors showing up in stats, I will do a basic sanity test to make sure they're coming from the storage layer and not somewhere closer to the CPU. If all errors come from one device and there are clear log messages showing SCSI device errors and the SMART log matches the other data, RAM is probably not the root case of failures, so scrub away. If normally reliable programs like /bin/sh start randomly segfaulting, there's smoke pouring out of the back of the machine, all the disks are full of csum failures, and the BIOS welcome message has spelling errors that weren't there before, I would *not* start a scrub. More like turn the machine off, take it apart, test all the pieces separately, and only do a scrub after everything above the storage layer had been replaced or recertified. I certainly wouldn't want the filesystem to try to fix the csum failures it finds in such situations. --UthUFkbMtH2ceUK2 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEARECAAYFAlg3xxwACgkQgfmLGlazG5zMHQCfSTS367EOiRGyjdikg/pnqf2q j/gAn3tViMjkAAE9pce8rIGzwPJ2fnGs =51Az -----END PGP SIGNATURE----- --UthUFkbMtH2ceUK2--