From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from james.kirk.hungrycats.org ([174.142.39.145]:36632 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1753261AbbAXSGE (ORCPT ); Sat, 24 Jan 2015 13:06:04 -0500 Date: Sat, 24 Jan 2015 13:06:01 -0500 From: Zygo Blaxell To: linux-btrfs@vger.kernel.org Subject: spuious I/O errors from btrfs...at the caching layer? Message-ID: <20150124180601.GA15018@hungrycats.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="x+6KMIRAuhnl3hBn" Sender: linux-btrfs-owner@vger.kernel.org List-ID: --x+6KMIRAuhnl3hBn Content-Type: text/plain; charset=us-ascii Content-Disposition: inline I am seeing a lot of spurious I/O errors that look like they come from the cache-facing side of btrfs. While running a heavy load with some extent-sharing (e.g. building 20 Linux kernels at once from source trees copied with 'cp -a --reflink=always'), some files will return spurious EIO on read. It happens often enough to prevent a Linux kernel build about 1/3 of the time. I believe the I/O errors to be spurious because: - there is no kernel message of any kind during the event - scrub detects 0 errors - device stats report 0 errors - the drive firmware reports nothing wrong through SMART - there seems to be no attempt to read the disk when the error is reported - "sysctl vm.drop_caches={1,2}" makes the I/O error go away. Files become unreadable at random, and stay unreadable indefinitely; however, any time I discover a file that gives EIO on read, I can poke vm.drop_caches and make the EIO go away. The file can then be read normally and has correct contents. The disk does not seem to be involved in the I/O error return. This seems to happen more often when snapshots are being deleted; however, it occurs on systems with no snapshots as well (though in these cases the system had snapshots in the past). When a file returns EIO on read, other snapshots of the same file also return EIO on read. I have not been able to test whether this affects reflink copies (clones) as well. Observed from 3.17..3.18.3. All filesystems affected use skinny-metadata. No filesystems that are not using skinny-metadata seem to have this problem. --x+6KMIRAuhnl3hBn Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlTD3wkACgkQgfmLGlazG5wx0ACffEO9Hjy9niiBZfoEGBdB6u55 X9IAn3qMRUeKd8NquGdIPFzogSX8sO24 =Bev4 -----END PGP SIGNATURE----- --x+6KMIRAuhnl3hBn--