From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp03.citrix.com ([162.221.156.55]:39224 "EHLO SMTP03.CITRIX.COM" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727719AbfCVOmk (ORCPT ); Fri, 22 Mar 2019 10:42:40 -0400 Subject: Re: [PATCH] Add new tests/generic/536: intermittent I/O errors must not corrupt a filesystem References: <20190321103045.6441-1-edvin.torok@citrix.com> <20190321202348.GA1180@magnolia> From: =?UTF-8?B?RWR3aW4gVMO2csO2aw==?= Message-ID: <4d660c49-e8ba-2dfc-2300-9d9d648e213f@citrix.com> Date: Fri, 22 Mar 2019 14:42:27 +0000 MIME-Version: 1.0 In-Reply-To: <20190321202348.GA1180@magnolia> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Sender: fstests-owner@vger.kernel.org Content-Transfer-Encoding: quoted-printable To: "Darrick J. Wong" , Dave Chinner Cc: fstests@vger.kernel.org, Mark Syms , Tim Smith , Ross Lagerwall List-ID: On 21/03/2019 20:23, Darrick J. Wong wrote: > On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin T=C3=B6r=C3=B6k wrote: >> Based on tests/generic/347. >> >> In our lab we've found that if multiple iSCSI connection errors are >> detected (without completely loosing the iSCSI connection) then the GF= S2 >> filesystem becomes corrupt due to differences in filesystem and device= blocksizes. >> Add a test that explicitly checks for this by simulating I/O errors >> deterministically with dm-thin. >=20 > How is this different from generic/475? Is there something specific to > thin pools here (vs. using dm-error to simulate the errors)? When I tried generic/475 it hanged in unmount and never reached the data = corruption part. Thanks for the suggestion, dm-error would be better than dm-thin, see bel= ow. On 21/03/2019 21:26, Dave Chinner wrote:> On Thu, Mar 21, 2019 at 10:30:4= 6AM +0000, Edwin T=C3=B6r=C3=B6k wrote: >> Based on tests/generic/347. >> >> In our lab we've found that if multiple iSCSI connection errors are >> detected (without completely loosing the iSCSI connection) then the GF= S2 >> filesystem becomes corrupt due to differences in filesystem and device= blocksizes. >> Add a test that explicitly checks for this by simulating I/O errors >> deterministically with dm-thin. >=20 > Exactly what IO errors is dm-thinp generating here? If you run it > out of space, then it triggers ENOSPC, not EIO. That's very, very > different to iSCSI throwing random EIO errors.. I agree that dm-error would be a better starting place than dm-thin for t= his test, I'll try to modify it and see if I can get it to finish running without h= anging, and reproduce the corruption issue. On 21/03/2019 21:26, Dave Chinner wrote:> On Thu, Mar 21, 2019 at 10:30:4= 6AM +0000, Edwin T=C3=B6r=C3=B6k wrote: >> +# now remount the filesystem without triggering IO errors, >> +# and check that the filesystem is not corrupt >> +_dmthin_cycle_mount >> +# ls --color makes ls stat each file, which finds the corruption >=20 > Not sure it always does - ISTR that in the past if the dtype > returned indicated the type of file, then it ls would omit the stat > just for the purposes of coloring.... >=20 > And, realistically, the way we find /filesystem/ corruption is to > run fsck/repair, not iterate the directory structure. I don't disagree, however GFS2's fsck is very noisy and complains about i= nconsistencies even on a filesystem where I can otherwise list and read each entry corre= ctly. I wanted to make a clear distinction between that and actual corruption o= bserved, so that the 2 bugs can be fixed independently. Perhaps the test should first do an 'ls/stat', and if that is fine then u= nmount and run the filesystem check as usual. > If we are > looking for missing files, then we dump the directory structure to > the golden output file or dump it before/after errors and compare > that they are the same. >=20 >> +ls --color=3Dalways $SCRATCH_MNT/ >/dev/null || _fail "Failed to list= filesystem after remount" >> +ls --color=3Dalways $SCRATCH_MNT/ >/dev/null || _fail "Failed to list= filesystem after remount" >> +ls --color=3Dalways $SCRATCH_MNT/ >/dev/null || _fail "Failed to list= filesystem after remount" >=20 > If corruption is not found on the first pass, why would the next 2 > passes find anything different? Indeed, I'll drop them. Thanks, --Edwin