* xfs_repair: "fatal error -- ran out of disk space!"
@ 2011-06-22 21:32 Patrick J. LoPresti
2011-06-22 22:27 ` Eric Sandeen
0 siblings, 1 reply; 6+ messages in thread
From: Patrick J. LoPresti @ 2011-06-22 21:32 UTC (permalink / raw)
To: xfs
I have a 5.1TB XFS file system that is 93% full (399G free according to "df").
I am trying to run "xfs_repair" on it.
The output is appended.
Question: What am I supposed to do about this? "xfs_repair -V" says
"xfs_repair version 3.1.5". (I downloaded and built the latest
version hoping it would fix the issue, but no luck.) Should I just
start deleting files at random?
Any ideas would be appreciated; I am trying to get this server back
up, and restoring 5.1T is not going to be pleasant.
Thanks!
- Pat
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
sb_icount 42688, counted 59328
sb_ifree 1, counted 36
sb_fdblocks 104582610, counted 24
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 2
- agno = 3
- agno = 5
- agno = 4
- agno = 1
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
fatal error -- ran out of disk space!
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: xfs_repair: "fatal error -- ran out of disk space!" 2011-06-22 21:32 xfs_repair: "fatal error -- ran out of disk space!" Patrick J. LoPresti @ 2011-06-22 22:27 ` Eric Sandeen 2011-06-22 23:24 ` Dave Chinner 0 siblings, 1 reply; 6+ messages in thread From: Eric Sandeen @ 2011-06-22 22:27 UTC (permalink / raw) To: Patrick J. LoPresti; +Cc: xfs On 6/22/11 4:32 PM, Patrick J. LoPresti wrote: > I have a 5.1TB XFS file system that is 93% full (399G free according to "df"). > > I am trying to run "xfs_repair" on it. > > The output is appended. > > Question: What am I supposed to do about this? "xfs_repair -V" says > "xfs_repair version 3.1.5". (I downloaded and built the latest > version hoping it would fix the issue, but no luck.) Should I just > start deleting files at random? You could start by removing a few files you know you don't need, rather than at random. :) TBH I've not seen this one before, and the error message is not all that helpful. It'd be nice to know how many blocks it was trying to reserve when it ran out of space; I guess you'd need to use gdb, or instrument all the calls to res_failed() in phase6.c to know for sure... You could also capture an xfs_metadump of the fs and provide it for analysis, it would let us reproduce the issue and know for sure what's going on. By default it obfuscates metadata. -Eric > Any ideas would be appreciated; I am trying to get this server back > up, and restoring 5.1T is not going to be pleasant. > > Thanks! > > - Pat > > > Phase 1 - find and verify superblock... > Phase 2 - using internal log > - zero log... > - scan filesystem freespace and inode maps... > sb_icount 42688, counted 59328 > sb_ifree 1, counted 36 > sb_fdblocks 104582610, counted 24 > - found root inode chunk > Phase 3 - for each AG... > - scan and clear agi unlinked lists... > - process known inodes and perform inode discovery... > - agno = 0 > - agno = 1 > - agno = 2 > - agno = 3 > - agno = 4 > - agno = 5 > - process newly discovered inodes... > Phase 4 - check for duplicate blocks... > - setting up duplicate extent list... > - check for inodes claiming duplicate blocks... > - agno = 0 > - agno = 2 > - agno = 3 > - agno = 5 > - agno = 4 > - agno = 1 > Phase 5 - rebuild AG headers and trees... > - reset superblock... > Phase 6 - check inode connectivity... > - resetting contents of realtime bitmap and summary inodes > - traversing filesystem ... > > fatal error -- ran out of disk space! > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: xfs_repair: "fatal error -- ran out of disk space!" 2011-06-22 22:27 ` Eric Sandeen @ 2011-06-22 23:24 ` Dave Chinner 2011-06-22 23:41 ` Patrick J. LoPresti 0 siblings, 1 reply; 6+ messages in thread From: Dave Chinner @ 2011-06-22 23:24 UTC (permalink / raw) To: Eric Sandeen; +Cc: Patrick J. LoPresti, xfs On Wed, Jun 22, 2011 at 05:27:14PM -0500, Eric Sandeen wrote: > On 6/22/11 4:32 PM, Patrick J. LoPresti wrote: > > I have a 5.1TB XFS file system that is 93% full (399G free according to "df"). > > > > I am trying to run "xfs_repair" on it. > > > > The output is appended. > > > > Question: What am I supposed to do about this? "xfs_repair -V" says > > "xfs_repair version 3.1.5". (I downloaded and built the latest > > version hoping it would fix the issue, but no luck.) Should I just > > start deleting files at random? > > You could start by removing a few files you know you don't need, rather than > at random. :) > > TBH I've not seen this one before, and the error message is not all that > helpful. It'd be nice to know how many blocks it was trying to reserve > when it ran out of space; I guess you'd need to use gdb, or instrument > all the calls to res_failed() in phase6.c to know for sure... Also, the number of inodes and directories in your filesystem might tell us whether we should expect an ENOSPC, as well. I suspect that there's an accounting error, because 400GB of transaction reservations is an awful lot of directory rebuilds.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: xfs_repair: "fatal error -- ran out of disk space!" 2011-06-22 23:24 ` Dave Chinner @ 2011-06-22 23:41 ` Patrick J. LoPresti 2011-06-23 7:42 ` Stan Hoeppner 0 siblings, 1 reply; 6+ messages in thread From: Patrick J. LoPresti @ 2011-06-22 23:41 UTC (permalink / raw) To: Dave Chinner; +Cc: Eric Sandeen, xfs Hi, Dave and Eric. And thank you for the quick reply. I blew away a couple of files (200-300 megabytes; I did not write it down) and then xfs_repair succeeded. And now "df" shows the partition as 100% full (265M free out of 5.1T), not 93% full (399G free). I think the file system actually was full, but corrupted. The reason I was trying to run xfs_repair is that the system was acting... "funny" (but not "ha ha" funny). Specifically, a nfsd task was consuming 100% CPU even though no NFS traffic was visible on the network. cat /proc/task_id/stack suggested the nfsd was in an infinite loop calling into XFS trying to allocate an extent or something. This nfsd held a lock making it impossible to umount the partition (among other things). My guess is that nfsd was fooled much like df into thinking there was space available, but when it tried to actually obtain that space, it was told "please try again". Which it did, forever. I guess one question is how xfs_repair should behave in this case. I mean, what if the file system had been full, but too corrupt for me to delete anything? Anyway, my problem is fixed. Well, until the filesystem gets corrupted again, anyway; I still have not identified the underlying cause of that... Thank you again for the prompt response. - Pat On Wed, Jun 22, 2011 at 4:24 PM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Jun 22, 2011 at 05:27:14PM -0500, Eric Sandeen wrote: >> On 6/22/11 4:32 PM, Patrick J. LoPresti wrote: >> > I have a 5.1TB XFS file system that is 93% full (399G free according to "df"). >> > >> > I am trying to run "xfs_repair" on it. >> > >> > The output is appended. >> > >> > Question: What am I supposed to do about this? "xfs_repair -V" says >> > "xfs_repair version 3.1.5". (I downloaded and built the latest >> > version hoping it would fix the issue, but no luck.) Should I just >> > start deleting files at random? >> >> You could start by removing a few files you know you don't need, rather than >> at random. :) >> >> TBH I've not seen this one before, and the error message is not all that >> helpful. It'd be nice to know how many blocks it was trying to reserve >> when it ran out of space; I guess you'd need to use gdb, or instrument >> all the calls to res_failed() in phase6.c to know for sure... > > Also, the number of inodes and directories in your filesystem might > tell us whether we should expect an ENOSPC, as well. I suspect that > there's an accounting error, because 400GB of transaction > reservations is an awful lot of directory rebuilds.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: xfs_repair: "fatal error -- ran out of disk space!" 2011-06-22 23:41 ` Patrick J. LoPresti @ 2011-06-23 7:42 ` Stan Hoeppner 2011-06-23 14:16 ` Patrick J. LoPresti 0 siblings, 1 reply; 6+ messages in thread From: Stan Hoeppner @ 2011-06-23 7:42 UTC (permalink / raw) To: Patrick J. LoPresti; +Cc: Eric Sandeen, xfs On 6/22/2011 6:41 PM, Patrick J. LoPresti wrote: > I guess one question is how xfs_repair should behave in this case. I > mean, what if the file system had been full, but too corrupt for me to > delete anything? Maybe you should rethink your policy on filesystem space management. >From what you stated the FS in question actually was full. You apparently were unaware of it until a problem (misbehaving nfsd process) brought it to your attention. You should be monitoring your FS usage. Something as simple as logwatch daily summaries can save your bacon here. As a general rule, when an FS begins steadily growing past the 80% mark heading toward 90%, you need to take action, either adding more disk to the underlying LVM device and growing the FS, mounting a new device/FS into a new directory in the tree and manually moving files, or making use of some HSM software. Full filesystems have been a source of problems basically forever. It's best to avoid such situations instead of tickling the dragon. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: xfs_repair: "fatal error -- ran out of disk space!" 2011-06-23 7:42 ` Stan Hoeppner @ 2011-06-23 14:16 ` Patrick J. LoPresti 0 siblings, 0 replies; 6+ messages in thread From: Patrick J. LoPresti @ 2011-06-23 14:16 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Eric Sandeen, xfs Of course we monitor our file systems. But as I thought I made clear, we were "unaware" the file system was full because df said it still had 399 gigabytes of free space. Granted, this is "only" 7%, but it could just as easily been 30% or 50% or 80% because _the file system was corrupt_. Also, given your "80% rule", I suspect you have never worked in an environment like mine. This file system is one of around 40 of similar size in a single pool. Am I supposed to tell my boss that we need more disk as soon as our free space goes below 40 terabytes? The bottom line is that the file system was full but appeared not to be, and thus xfs_repair bombed out. I realize this is a corner case, but it is a nasty one, and it has nothing to do with my "policy on filesystem space management". But thank you for your input. - Pat On Thu, Jun 23, 2011 at 12:42 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 6/22/2011 6:41 PM, Patrick J. LoPresti wrote: > >> I guess one question is how xfs_repair should behave in this case. I >> mean, what if the file system had been full, but too corrupt for me to >> delete anything? > > Maybe you should rethink your policy on filesystem space management. > From what you stated the FS in question actually was full. You > apparently were unaware of it until a problem (misbehaving nfsd process) > brought it to your attention. You should be monitoring your FS usage. > Something as simple as logwatch daily summaries can save your bacon here. > > As a general rule, when an FS begins steadily growing past the 80% mark > heading toward 90%, you need to take action, either adding more disk to > the underlying LVM device and growing the FS, mounting a new device/FS > into a new directory in the tree and manually moving files, or making > use of some HSM software. > > Full filesystems have been a source of problems basically forever. It's > best to avoid such situations instead of tickling the dragon. > > -- > Stan > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-06-23 14:16 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-06-22 21:32 xfs_repair: "fatal error -- ran out of disk space!" Patrick J. LoPresti 2011-06-22 22:27 ` Eric Sandeen 2011-06-22 23:24 ` Dave Chinner 2011-06-22 23:41 ` Patrick J. LoPresti 2011-06-23 7:42 ` Stan Hoeppner 2011-06-23 14:16 ` Patrick J. LoPresti
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox