Re: status of userspace release

From: Ben Myers <bpm@sgi.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: status of userspace release
Date: Fri, 2 Nov 2012 13:59:23 -0500	[thread overview]
Message-ID: <20121102185923.GG9783@sgi.com> (raw)
In-Reply-To: <20121102055102.GY29378@dastard>

Hi Dave,

On Fri, Nov 02, 2012 at 04:51:02PM +1100, Dave Chinner wrote:
> On Thu, Oct 25, 2012 at 10:15:01AM -0500, Ben Myers wrote:
> > Hi Folks,
> > 
> > We're working toward a userspace release this month.  There are several patches
> > that need to go in first, including backing out the xfsdump format version bump
> > from Eric, fixes for the makefiles from Mike, and the Polish language update
> > for xfsdump from Jakub.  If anyone knows of something else we need, now is the
> > time to flame about it.  I will take a look around for other important patches
> > too.
> > 
> > This time I'm going to tag an -rc1 (probably later today or tomorrow).  We'll
> > give everyone a few working days to do a final test and/or pipe up if we have
> > missed something important.  Then if all goes well we'll cut the release next
> > Tuesday.
> 
> I think that dump/restore need more work/testing.

Sounds good.  AFAIK there is no blazing hurry to release immediately.

> I've already pointed Eric to the header checksum failures (forkoff
> patch being needed), and that fixes the failures I've been seeing on
> normal xfstests runs.

I've pulled that patch in.  Interesting that it doesn't reproduce on i586 but
is so reliable on x86_64.  It's a good excuse to do some testing on a wider set
of arches before the release.

> Running some large filesystem testing, however, I see more problems.
> I'm using a 17TB filesytsem and the --largefs patch series. This
> results in a futex hang in 059 like so:
> 
> [ 4770.007858] xfsrestore      S ffff88021fc52d40  5504  3926   3487 0x00000000
> [ 4770.007858]  ffff880212ea9c68 0000000000000082 ffff880207830140 ffff880212ea9fd8
> [ 4770.007858]  ffff880212ea9fd8 ffff880212ea9fd8 ffff880216cec2c0 ffff880207830140
> [ 4770.007858]  ffff880212ea9d08 ffff880212ea9d58 ffff880207830140 0000000000000000
> [ 4770.007858] Call Trace:
> [ 4770.007858]  [<ffffffff81b8a009>] schedule+0x29/0x70
> [ 4770.007858]  [<ffffffff810db089>] futex_wait_queue_me+0xc9/0x100
> [ 4770.007858]  [<ffffffff810db809>] futex_wait+0x189/0x290
> [ 4770.007858]  [<ffffffff8113acf7>] ? __free_pages+0x47/0x70
> [ 4770.007858]  [<ffffffff810dd41c>] do_futex+0x11c/0xa80
> [ 4770.007858]  [<ffffffff810abbd5>] ? hrtimer_try_to_cancel+0x55/0x110
> [ 4770.007858]  [<ffffffff810abcb2>] ? hrtimer_cancel+0x22/0x30
> [ 4770.007858]  [<ffffffff81b88f44>] ? do_nanosleep+0xa4/0xd0
> [ 4770.007858]  [<ffffffff810dde0d>] sys_futex+0x8d/0x1b0
> [ 4770.007858]  [<ffffffff810ab6e0>] ? update_rmtp+0x80/0x80
> [ 4770.007858]  [<ffffffff81b93a99>] system_call_fastpath+0x16/0x1b
> [ 4770.007858] xfsrestore      S ffff88021fc52d40  5656  3927   3487 0x00000000
> [ 4770.007858]  ffff880208f29c68 0000000000000082 ffff880208f84180 ffff880208f29fd8
> [ 4770.007858]  ffff880208f29fd8 ffff880208f29fd8 ffff880216cec2c0 ffff880208f84180
> [ 4770.007858]  ffff880208f29d08 ffff880208f29d58 ffff880208f84180 0000000000000000
> [ 4770.007858] Call Trace:
> [ 4770.007858]  [<ffffffff81b8a009>] schedule+0x29/0x70
> [ 4770.007858]  [<ffffffff810db089>] futex_wait_queue_me+0xc9/0x100
> [ 4770.007858]  [<ffffffff810db809>] futex_wait+0x189/0x290
> [ 4770.007858]  [<ffffffff810dd41c>] do_futex+0x11c/0xa80
> [ 4770.007858]  [<ffffffff810abbd5>] ? hrtimer_try_to_cancel+0x55/0x110
> [ 4770.007858]  [<ffffffff810abcb2>] ? hrtimer_cancel+0x22/0x30
> [ 4770.007858]  [<ffffffff81b88f44>] ? do_nanosleep+0xa4/0xd0
> [ 4770.007858]  [<ffffffff810dde0d>] sys_futex+0x8d/0x1b0
> [ 4770.007858]  [<ffffffff810ab6e0>] ? update_rmtp+0x80/0x80
> [ 4770.007858]  [<ffffffff81b93a99>] system_call_fastpath+0x16/0x1b
> [ 4770.007858] xfsrestore      S ffff88021fc92d40  5848  3928   3487 0x00000000
> [ 4770.007858]  ffff880212d0dc68 0000000000000082 ffff880208e76240 ffff880212d0dfd8
> [ 4770.007858]  ffff880212d0dfd8 ffff880212d0dfd8 ffff880216cf2300 ffff880208e76240
> [ 4770.007858]  ffff880212d0dd08 ffff880212d0dd58 ffff880208e76240 0000000000000000
> [ 4770.007858] Call Trace:
> [ 4770.007858]  [<ffffffff81b8a009>] schedule+0x29/0x70
> [ 4770.007858]  [<ffffffff810db089>] futex_wait_queue_me+0xc9/0x100
> [ 4770.007858]  [<ffffffff810db809>] futex_wait+0x189/0x290
> [ 4770.007858]  [<ffffffff810dd41c>] do_futex+0x11c/0xa80
> [ 4770.007858]  [<ffffffff810abbd5>] ? hrtimer_try_to_cancel+0x55/0x110
> [ 4770.007858]  [<ffffffff810abcb2>] ? hrtimer_cancel+0x22/0x30
> [ 4770.007858]  [<ffffffff81b88f44>] ? do_nanosleep+0xa4/0xd0
> [ 4770.007858]  [<ffffffff810dde0d>] sys_futex+0x8d/0x1b0
> [ 4770.007858]  [<ffffffff810ab6e0>] ? update_rmtp+0x80/0x80
> [ 4770.007858]  [<ffffffff81b93a99>] system_call_fastpath+0x16/0x1b
> 
> I can't reliably reproduce it at this point, but there does appear
> to be some kind of locking problem in the multistream support.

One of my machines hit this overnight without --largefs.  I wasn't able to get
a dump though.  Just another data point.

> Speaking of which, most large filesystems dump/restore tests are
> failing because of this output:
> 
> 026 20s ... - output mismatch (see 026.out.bad)
> --- 026.out     2012-10-05 11:37:51.000000000 +1000
> +++ 026.out.bad 2012-11-02 16:20:17.000000000 +1100
> @@ -20,6 +20,7 @@
>  xfsdump: media file size NUM bytes
>  xfsdump: dump size (non-dir files) : NUM bytes
>  xfsdump: dump complete: SECS seconds elapsed
> +xfsdump:   stream 0 DUMP_FILE OK (success)
>  xfsdump: Dump Status: SUCCESS
>  Restoring from file...
>  xfsrestore  -f DUMP_FILE  -L stress_026 RESTORE_DIR
> @@ -32,6 +33,7 @@
>  xfsrestore: directory post-processing
>  xfsrestore: restoring non-directory files
>  xfsrestore: restore complete: SECS seconds elapsed
> +xfsrestore:   stream 0 DUMP_FILE OK (success)
>  xfsrestore: Restore Status: SUCCESS
>  Comparing dump directory with restore directory
>  Files DUMP_DIR/big and RESTORE_DIR/DUMP_SUBDIR/big are identical
> 
> Which looks like output from the multistream code. Why it is
> emitting this for large filesystem testing and not for small
> filesystems, I'm not sure yet. 
> 
> In fact, with --largefs, I see this for the dump group:
> 
> Failures: 026 028 046 047 056 059 060 061 063 064 065 066 266 281
> 282 283
> Failed 16 of 19 tests
> 
> And this for the normal sized (10GB) scratch device:
> 
> Passed all 18 tests
> 
> So there's something funky going on here....

Rich also reported some golden output related changes with --largefs awhile
back.  I don't think he saw this one though.

The TODO list for userspace release currently stands at:

1) fix the header checksum failures... which is resolved
2) fix a futex hang in 059
3) fix the golden output changes related to multistream support in xfsdump
   and --largefs
4) test on more platforms

Regards,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs