Stalled xfs_repair on 100TB filesystem

* Stalled xfs_repair on 100TB filesystem
@ 2010-03-02 17:22 Jason Vagalatos
  2010-03-03  0:25 ` Dave Chinner
  2010-03-03  0:35 ` Stan Hoeppner
  0 siblings, 2 replies; 6+ messages in thread
From: Jason Vagalatos @ 2010-03-02 17:22 UTC (permalink / raw)
  To: xfs@oss.sgi.com

Hello,
On Friday 2/26 I started an xfs_repair on a 100TB filesystem:

#> nohup xfs_repair -v -l /dev/logfs-sessions/logdev /dev/logfs-sessions/sessions > /root/xfs_repair.out.logfs1.sjc.02262010 &

I've been monitoring the process with 'top' and tailing the output file from the redirect above.  I believe the repair has "stalled".  When the process was running 'top' showed almost all physical memory consumed and 12.6G of virt memory consumed by xfs_repair.  It made it all the way to Phase 6 and has been sitting at agno = 14 for almost 48 hours.  The memory consumption of xfs_repair has ceased but the process is still "running" and consuming 100% CPU:

top - 10:10:37 up 3 days, 21:06,  1 user,  load average: 1.20, 1.13, 1.09
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
Cpu(s): 12.5%us,  0.0%sy,  0.0%ni, 87.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8177380k total,   896668k used,  7280712k free,   247100k buffers
Swap: 56525356k total,   173852k used, 56351504k free,   304588k cached

  PID    USER   PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND                                                                
32705  root     25   0  160m  95m  704 R    100        1.2      2629:53    xfs_repair

#> tail -f -n1000 xfs_repair.out.logfs1.sjc.02262010
........
        - agno = 98
        - agno = 99
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
<stopped here, fs has 99 ag's>

Is there anything I can do at this point to salvage the repair?  I do not want to kill the repair process based on the amount of time it takes to run.  If I do kill it, is there any risk of damaging the filesystem?

Any help would be greatly appreciated.

Thank you

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread