From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from james.kirk.hungrycats.org ([174.142.39.145]:41692 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754273AbcFTUlH (ORCPT ); Mon, 20 Jun 2016 16:41:07 -0400 Date: Mon, 20 Jun 2016 16:40:49 -0400 From: Zygo Blaxell To: Chris Murphy Cc: Roman Mamedov , Btrfs BTRFS Subject: Re: Adventures in btrfs raid5 disk recovery Message-ID: <20160620204049.GA1986@hungrycats.org> References: <20160620034427.GK15597@hungrycats.org> <20160620231351.1833a341@natsu> <20160620191112.GL15597@hungrycats.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="tKW2IUtsqtDRztdT" In-Reply-To: Sender: linux-btrfs-owner@vger.kernel.org List-ID: --tKW2IUtsqtDRztdT Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote: > On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell > wrote: > > On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote: > >> On Sun, 19 Jun 2016 23:44:27 -0400 > Seems difficult at best due to this: > >>The normal 'device delete' operation got about 25% of the way in, > then got stuck on some corrupted sectors and aborting with EIO. >=20 > In effect it's like a 2 disk failure for a raid5 (or it's > intermittently a 2 disk failure but always at least a 1 disk failure). > That's not something md raid recovers from. Even manual recovery in > such a case is far from certain. >=20 > Perhaps Roman's advice is also a question about the cause of this > corruption? I'm wondering this myself. That's the real problem here as > I see it. Losing a drive is ordinary. Additional corruptions happening > afterward is not. And are those corrupt sectors hardware corruptions, > or Btrfs corruptions at the time the data was written to disk, or > Btrfs being confused as it's reading the data from disk? > For me the critical question is what does "some corrupted sectors" mean? On other raid5 arrays, I would observe a small amount of corruption every time there was a system crash (some of which were triggered by disk failures, some not). It looked like any writes in progress at the time of the failure would be damaged. In the past I would just mop up the corrupt files (they were always the last extents written, easy to find with find-new or scrub) and have no further problems. In the earlier cases there were no new instances of corruption after the initial failure event and manual cleanup. Now that I did a little deeper into this, I do see one fairly significant piece of data: root@host:~# btrfs dev stat /data | grep -v ' 0$' [/dev/vdc].corruption_errs 16774 [/dev/vde].write_io_errs 121 [/dev/vde].read_io_errs 4 [devid:8].read_io_errs 16 Prior to the failure of devid:8, vde had 121 write errors and 4 read errors (these counter values are months old and the errors were long since repaired by scrub). The 16774 corruption errors on vdc are all new since the devid:8 failure, though. >=20 >=20 > --=20 > Chris Murphy >=20 --tKW2IUtsqtDRztdT Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEARECAAYFAldoVNEACgkQgfmLGlazG5y2QQCffO7V0lSnmhdDSLj7WqZ39waO hjwAmwQT8rfXy9AnrEeX3ZEZV88DZgmC =yXrj -----END PGP SIGNATURE----- --tKW2IUtsqtDRztdT--