public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* xfs_repair memory usage and stopping on "Traversing filesystem..."
@ 2010-05-18 19:28 Colin Wilson
  2010-05-19  0:19 ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Colin Wilson @ 2010-05-18 19:28 UTC (permalink / raw)
  To: xfs@oss.sgi.com

Hello all,
	I seem to be having the same problem as Tomasz had in this post to the mailing list: http://oss.sgi.com/archives/xfs/2009-07/msg00082.html .  Eric ultimately suggested running xfs_repair with the '-P' and '-o bhash=1024' flags to get past this problem and described what he thought the underlieing problem was as such:

> "This looks like some of the caching that xfs_repair does is mis-sized,
> and it gets stuck when it's unable to find a slot for a new node to
> cache.  IMHO that's still a bug that I'd like to work out.  If it gets
> stuck this way, it'd probably be better to exit, and suggest a larger
> hash size."

	Currently my file system is ~50 TB in size with ~40TB in use and when I do the repair memory usage ends up between 10 and 11 GB used for most of the check .  The system currently has 12GB of ram not including swap.  Is this expected behavior?  My concern is setting bhash too large and causing xfs_repair to swap for long periods of time.  It already takes a few days to get to Phase 6 in the repair.

	I am currently running Debian Lenny(5.0.4) with xfsprogs 2.9.8 with linux kernel 2.6.26.  I've briefly looked through the change logs for newer version of xfsprogs and noticed that there were a few updates mentioning better memory performance or management  so upgrading to a newer version may be all I need.  Has the bug Eric mentions been fixed in a later version of xfsprogs?  What is your suggestion as to my best course of action to get this xfs-repair to complete in a timely manor without using up all the ram in my system?  Thanks

xfs_info dump:
# xfs_info /u1/
meta-data=/dev/mapper/sangroup-sandisk isize=256    agcount=821, agsize=15258784 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=12514290688, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096  
log      =internal               bsize=4096   blocks=32768, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=65536  blocks=0, rtextents=0

--Colin

Colin Wilson
Linux Systems Administrator
T +1.781.810.1331
F +1.781.891.5145
cwilson@blackducksoftware.com
http://www.blackducksoftware.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: xfs_repair memory usage and stopping on "Traversing filesystem..."
  2010-05-18 19:28 xfs_repair memory usage and stopping on "Traversing filesystem..." Colin Wilson
@ 2010-05-19  0:19 ` Dave Chinner
  2010-05-19  2:15   ` Eric Sandeen
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2010-05-19  0:19 UTC (permalink / raw)
  To: Colin Wilson; +Cc: xfs@oss.sgi.com

On Tue, May 18, 2010 at 03:28:20PM -0400, Colin Wilson wrote:
> Hello all, I seem to be having the same problem as Tomasz had in
> this post to the mailing list:
> http://oss.sgi.com/archives/xfs/2009-07/msg00082.html .  Eric
> ultimately suggested running xfs_repair with the '-P' and '-o
> bhash=1024' flags to get past this problem and described what he
> thought the underlieing problem was as such:
> 
> > "This looks like some of the caching that xfs_repair does is
> > mis-sized, and it gets stuck when it's unable to find a slot for
> > a new node to cache.  IMHO that's still a bug that I'd like to
> > work out.  If it gets stuck this way, it'd probably be better to
> > exit, and suggest a larger hash size."
> 
> Currently my file system is ~50 TB in size with ~40TB in use and
> when I do the repair memory usage ends up between 10 and 11 GB
> used for most of the check.  The system currently has 12GB of ram
> not including swap.  Is this expected behavior?

Given you are running v2.9.8, I'd say yes, and one of your problems
is that repair is swapping as the base memory footprint is likely
to be in the order of 40-50GB RAM for xfs_repair.

I just ran xfs_check on an empty 51TB filesystem w/ 821 AGs to get
an idea of how much RAM an older xfs_repair will use (as it have
3.1.2 installed on my test machines). It is allocating about 115GB
of virtual memory space before consuming all the RAM+swap in the
machine before being OOM-killed.

> My concern is
> setting bhash too large and causing xfs_repair to swap for long
> periods of time.  It already takes a few days to get to Phase 6 in
> the repair.

Must be swapping, then...

> I am currently running Debian Lenny(5.0.4) with xfsprogs 2.9.8
> with linux kernel 2.6.26.  I've briefly looked through the change
> logs for newer version of xfsprogs and noticed that there were a
> few updates mentioning better memory performance or management  so
> upgrading to a newer version may be all I need.

Yup, there were major memory usage reductions in xfs-repair in
3.1.0.  Looking at the same empty filesystem as above the base
xfs_repair memory footprint is a few tens of megabytes of RAM. That
will definitely balloon to a few GB as the filesytem metadata is
read in and cached, but i doubt it will get anywhere near what 2.9.8
requires and so should be much faster.

Hence I'd start by upgrading to 3.1.2 and running with the default
options first to see whether it is faster and whether it hangs or
not before going any further.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: xfs_repair memory usage and stopping on "Traversing filesystem..."
  2010-05-19  0:19 ` Dave Chinner
@ 2010-05-19  2:15   ` Eric Sandeen
  2010-05-24 20:30     ` Colin Wilson
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Sandeen @ 2010-05-19  2:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Colin Wilson, xfs@oss.sgi.com

Dave Chinner wrote:

 
> Hence I'd start by upgrading to 3.1.2 and running with the default
> options first to see whether it is faster and whether it hangs or
> not before going any further.

If it still hangs, collecting an xfs_metadump of the fs would be 
useful for investigating the problem.

But, I think I fixed that (the options you mentioned were workarounds
for the bug I eventually fixed, IIRC)

Thanks,
-Eric

> Cheers,
> 
> Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: xfs_repair memory usage and stopping on "Traversing filesystem..."
  2010-05-19  2:15   ` Eric Sandeen
@ 2010-05-24 20:30     ` Colin Wilson
  0 siblings, 0 replies; 4+ messages in thread
From: Colin Wilson @ 2010-05-24 20:30 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs@oss.sgi.com


[-- Attachment #1.1: Type: text/plain, Size: 2362 bytes --]

I got some down time this weekend and tried to run another xfs_repair with the latest(3.1.2) version of XFS tools.  This time the check ran much slower than it had before and used much more swap.  My system has 12GB of ram in it right now and 16GB of swap space, do you guys have any rule of thumb to use to figure out how much memory the system should need?  I am thinking adding more memory is the only way to fix my problem as it is now since its just slowness.  I don't remember how much swap it ended up using but the process ran until I killed it to bring the file system back on line without running out of total memory.

This may not be the biggest problem in the world but I tried to take a metadata dump just incase that was helpful and the process ran till a certain point and then hung with xfs_db using 100% of one of my cores.  I've confirmed the same outcome three runs in a row.  The output of xfs_metadump was:

:~# xfs_metadump -gw /dev/mapper/sangroup-sandisk ./metadata.dump
Copied 8192 of 1732067904 inodes (0 of 821 AGs)
xfs_metadump: suspicious count 1152 in bmap extent 89 in dir2 ino 12743
xfs_metadump: suspicious count 1455 in bmap extent 135 in dir2 ino 12743
xfs_metadump: suspicious count 1074 in bmap extent 2 in dir2 ino 12743
Copied 8151232 of 1732067904 inodes (0 of 821 AGs)

/usr/sbin/xfs_metadump: line 31:  5363 Terminated              xfs_db$DBOPTS -F -i -p xfs_metadump -c "metadump$OPTS $2" $1

The process would hang at "Copied 8151232 of 1732067904 inodes (0 of 821 AGs)" and the rest of the output was me killing the xfs_db process.  Thanks for all the help.

--Colin

Colin Wilson
Linux Systems Administrator
T +1.781.810.1331
F +1.781.891.5145
cwilson@blackducksoftware.com<mailto:cwilson@blackducksoftware.com>
http://www.blackducksoftware.com<http://www.blackducksoftware.com/>




On May 18, 2010, at 10:15 PM, Eric Sandeen wrote:

Dave Chinner wrote:


Hence I'd start by upgrading to 3.1.2 and running with the default
options first to see whether it is faster and whether it hangs or
not before going any further.

If it still hangs, collecting an xfs_metadump of the fs would be
useful for investigating the problem.

But, I think I fixed that (the options you mentioned were workarounds
for the bug I eventually fixed, IIRC)

Thanks,
-Eric

Cheers,

Dave.



[-- Attachment #1.2: Type: text/html, Size: 3626 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-05-24 20:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-18 19:28 xfs_repair memory usage and stopping on "Traversing filesystem..." Colin Wilson
2010-05-19  0:19 ` Dave Chinner
2010-05-19  2:15   ` Eric Sandeen
2010-05-24 20:30     ` Colin Wilson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox