public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [problem] xfstests generic/311 unreliable...
@ 2013-05-07  7:11 Dave Chinner
  2013-05-07  7:37 ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2013-05-07  7:11 UTC (permalink / raw)
  To: xfs

Hi Josef,

I was just looking at a generic/311, and I think there's something
fundamentally wrong with the way it is checking the scratch device.

You reported it was failing for internal test 19 on XFS, but I'm
seeing is fail after the first test or 2, randomly. It has never
made it past test 3. So I had a little bit of a closer look at it's
structure. Essentially it is doing this (and the contents seen by
each step:

scratch dev + mkfs
	+-------------------------------+
overlay dm-flakey
	D-------------------------------D
mount/write/kill/unmount dm-flakey
	Dx-x-x-x-x-x-x------------------D

All good up to here. Now, you can _check_scratch_fs which sees:

scratch dev + check
	+-------------------------------+

i.e. it's not seeing all the changes written to dm-flakey and so
xfs-check it seeing corruption.

After I realised this was stacking block devices and checking the
underlying block device, the cause was pretty obvious: scratch-dev
and dm-flakey have different address spaces, so changes written
throughone address space will not be seen through the other address
space if there is stale cached data in the original address space.

And that's exactly what is happening. This patch:

--- a/tests/generic/311
+++ b/tests/generic/311
@@ -79,6 +79,7 @@ _mount_flakey()
 _unmount_flakey()
 {
        $UMOUNT_PROG $SCRATCH_MNT
+       echo 3 > /proc/sys/vm/drop_caches
 }
 
 _load_flakey_table()

Makes the problem go away for xfs_check. But really, I don't like
the assumption that the test is built on - that writes through one
block device are visible through another. It's just asking for weird
problems.

Is there some way that you can restructure this test so it doesn't
have this problem (e.g. do everything on dm-flakey)?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [problem] xfstests generic/311 unreliable...
  2013-05-07  7:11 [problem] xfstests generic/311 unreliable Dave Chinner
@ 2013-05-07  7:37 ` Dave Chinner
  2013-05-07 13:28   ` [BULK] " Josef Bacik
  2013-05-07 14:10   ` Josef Bacik
  0 siblings, 2 replies; 4+ messages in thread
From: Dave Chinner @ 2013-05-07  7:37 UTC (permalink / raw)
  To: xfs; +Cc: jbacik

Argh, add the cc to Josef...

On Tue, May 07, 2013 at 05:11:02PM +1000, Dave Chinner wrote:
> Hi Josef,
> 
> I was just looking at a generic/311, and I think there's something
> fundamentally wrong with the way it is checking the scratch device.
> 
> You reported it was failing for internal test 19 on XFS, but I'm
> seeing is fail after the first test or 2, randomly. It has never
> made it past test 3. So I had a little bit of a closer look at it's
> structure. Essentially it is doing this (and the contents seen by
> each step:
> 
> scratch dev + mkfs
> 	+-------------------------------+
> overlay dm-flakey
> 	D-------------------------------D
> mount/write/kill/unmount dm-flakey
> 	Dx-x-x-x-x-x-x------------------D
> 
> All good up to here. Now, you can _check_scratch_fs which sees:
> 
> scratch dev + check
> 	+-------------------------------+
> 
> i.e. it's not seeing all the changes written to dm-flakey and so
> xfs-check it seeing corruption.
> 
> After I realised this was stacking block devices and checking the
> underlying block device, the cause was pretty obvious: scratch-dev
> and dm-flakey have different address spaces, so changes written
> throughone address space will not be seen through the other address
> space if there is stale cached data in the original address space.
> 
> And that's exactly what is happening. This patch:
> 
> --- a/tests/generic/311
> +++ b/tests/generic/311
> @@ -79,6 +79,7 @@ _mount_flakey()
>  _unmount_flakey()
>  {
>         $UMOUNT_PROG $SCRATCH_MNT
> +       echo 3 > /proc/sys/vm/drop_caches
>  }
>  
>  _load_flakey_table()
> 
> Makes the problem go away for xfs_check. But really, I don't like
> the assumption that the test is built on - that writes through one
> block device are visible through another. It's just asking for weird
> problems.
> 
> Is there some way that you can restructure this test so it doesn't
> have this problem (e.g. do everything on dm-flakey)?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BULK]  Re: [problem] xfstests generic/311 unreliable...
  2013-05-07  7:37 ` Dave Chinner
@ 2013-05-07 13:28   ` Josef Bacik
  2013-05-07 14:10   ` Josef Bacik
  1 sibling, 0 replies; 4+ messages in thread
From: Josef Bacik @ 2013-05-07 13:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Josef Bacik, xfs@oss.sgi.com

On Tue, May 07, 2013 at 01:37:17AM -0600, Dave Chinner wrote:
> Argh, add the cc to Josef...
> 
> On Tue, May 07, 2013 at 05:11:02PM +1000, Dave Chinner wrote:
> > Hi Josef,
> > 
> > I was just looking at a generic/311, and I think there's something
> > fundamentally wrong with the way it is checking the scratch device.
> > 
> > You reported it was failing for internal test 19 on XFS, but I'm
> > seeing is fail after the first test or 2, randomly. It has never
> > made it past test 3. So I had a little bit of a closer look at it's
> > structure. Essentially it is doing this (and the contents seen by
> > each step:
> > 
> > scratch dev + mkfs
> > 	+-------------------------------+
> > overlay dm-flakey
> > 	D-------------------------------D
> > mount/write/kill/unmount dm-flakey
> > 	Dx-x-x-x-x-x-x------------------D
> > 
> > All good up to here. Now, you can _check_scratch_fs which sees:
> > 
> > scratch dev + check
> > 	+-------------------------------+
> > 
> > i.e. it's not seeing all the changes written to dm-flakey and so
> > xfs-check it seeing corruption.
> > 
> > After I realised this was stacking block devices and checking the
> > underlying block device, the cause was pretty obvious: scratch-dev
> > and dm-flakey have different address spaces, so changes written
> > throughone address space will not be seen through the other address
> > space if there is stale cached data in the original address space.
> > 
> > And that's exactly what is happening. This patch:
> > 
> > --- a/tests/generic/311
> > +++ b/tests/generic/311
> > @@ -79,6 +79,7 @@ _mount_flakey()
> >  _unmount_flakey()
> >  {
> >         $UMOUNT_PROG $SCRATCH_MNT
> > +       echo 3 > /proc/sys/vm/drop_caches
> >  }
> >  
> >  _load_flakey_table()
> > 
> > Makes the problem go away for xfs_check. But really, I don't like
> > the assumption that the test is built on - that writes through one
> > block device are visible through another. It's just asking for weird
> > problems.
> > 
> > Is there some way that you can restructure this test so it doesn't
> > have this problem (e.g. do everything on dm-flakey)?

Yup I can do that, honestly the only reason I was doing it this way was because
my original script which this test is based on did this all to a raw disk with
a real reboot in there.  I'll fix it up and send a patch.  Thanks,

Josef

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BULK]  Re: [problem] xfstests generic/311 unreliable...
  2013-05-07  7:37 ` Dave Chinner
  2013-05-07 13:28   ` [BULK] " Josef Bacik
@ 2013-05-07 14:10   ` Josef Bacik
  1 sibling, 0 replies; 4+ messages in thread
From: Josef Bacik @ 2013-05-07 14:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Josef Bacik, xfs@oss.sgi.com

On Tue, May 07, 2013 at 01:37:17AM -0600, Dave Chinner wrote:
> Argh, add the cc to Josef...
> 
> On Tue, May 07, 2013 at 05:11:02PM +1000, Dave Chinner wrote:
> > Hi Josef,
> > 
> > I was just looking at a generic/311, and I think there's something
> > fundamentally wrong with the way it is checking the scratch device.
> > 
> > You reported it was failing for internal test 19 on XFS, but I'm
> > seeing is fail after the first test or 2, randomly. It has never
> > made it past test 3. So I had a little bit of a closer look at it's
> > structure. Essentially it is doing this (and the contents seen by
> > each step:
> > 
> > scratch dev + mkfs
> > 	+-------------------------------+
> > overlay dm-flakey
> > 	D-------------------------------D
> > mount/write/kill/unmount dm-flakey
> > 	Dx-x-x-x-x-x-x------------------D
> > 
> > All good up to here. Now, you can _check_scratch_fs which sees:
> > 
> > scratch dev + check
> > 	+-------------------------------+
> > 
> > i.e. it's not seeing all the changes written to dm-flakey and so
> > xfs-check it seeing corruption.
> > 
> > After I realised this was stacking block devices and checking the
> > underlying block device, the cause was pretty obvious: scratch-dev
> > and dm-flakey have different address spaces, so changes written
> > throughone address space will not be seen through the other address
> > space if there is stale cached data in the original address space.
> > 
> > And that's exactly what is happening. This patch:
> > 
> > --- a/tests/generic/311
> > +++ b/tests/generic/311
> > @@ -79,6 +79,7 @@ _mount_flakey()
> >  _unmount_flakey()
> >  {
> >         $UMOUNT_PROG $SCRATCH_MNT
> > +       echo 3 > /proc/sys/vm/drop_caches
> >  }
> >  
> >  _load_flakey_table()
> > 
> > Makes the problem go away for xfs_check. But really, I don't like
> > the assumption that the test is built on - that writes through one
> > block device are visible through another. It's just asking for weird
> > problems.
> > 
> > Is there some way that you can restructure this test so it doesn't
> > have this problem (e.g. do everything on dm-flakey)?
> > 

So I've made the following patch which I think will do what you want, it's kind
of ugly but we have such specific things for fsck that I don't want to have to
re-implement it all just for this test.  The thing is, I'm still seeing the
failure with test 19 for xfs.  xfs_check always passes fine for me, it's the
part where we re-mount the flakey device and then md5sum the file, it is the
md5sum of an empty file and doesn't match the md5sum we take before we unmount.
All of that is done on the flakey device so theres no stale caching going on
there.  Let me know what you think about this patch, I'm open to other less
horrible options.  Thanks,

Josef


index 2b3b569..f11119b
--- a/tests/generic/311
+++ b/tests/generic/311
@@ -125,7 +125,10 @@ _run_test()
 
 	#Unmount and fsck to make sure we got a valid fs after replay
 	_unmount_flakey
+	tmp=$SCRATCH_DEV
+	SCRATCH_DEV=$FLAKEY_DEV
 	_check_scratch_fs
+	SCRATCH_DEV=$tmp
 	[ $? -ne 0 ] && _fatal "fsck failed"
 
 	_mount_flakey

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-05-07 14:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-07  7:11 [problem] xfstests generic/311 unreliable Dave Chinner
2013-05-07  7:37 ` Dave Chinner
2013-05-07 13:28   ` [BULK] " Josef Bacik
2013-05-07 14:10   ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox