* xfs_repair: "fatal error -- ran out of disk space!"
@ 2011-06-22 21:32 Patrick J. LoPresti
2011-06-22 22:27 ` Eric Sandeen
0 siblings, 1 reply; 6+ messages in thread
From: Patrick J. LoPresti @ 2011-06-22 21:32 UTC (permalink / raw)
To: xfs
I have a 5.1TB XFS file system that is 93% full (399G free according to "df").
I am trying to run "xfs_repair" on it.
The output is appended.
Question: What am I supposed to do about this? "xfs_repair -V" says
"xfs_repair version 3.1.5". (I downloaded and built the latest
version hoping it would fix the issue, but no luck.) Should I just
start deleting files at random?
Any ideas would be appreciated; I am trying to get this server back
up, and restoring 5.1T is not going to be pleasant.
Thanks!
- Pat
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
sb_icount 42688, counted 59328
sb_ifree 1, counted 36
sb_fdblocks 104582610, counted 24
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 2
- agno = 3
- agno = 5
- agno = 4
- agno = 1
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
fatal error -- ran out of disk space!
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: xfs_repair: "fatal error -- ran out of disk space!"
2011-06-22 21:32 xfs_repair: "fatal error -- ran out of disk space!" Patrick J. LoPresti
@ 2011-06-22 22:27 ` Eric Sandeen
2011-06-22 23:24 ` Dave Chinner
0 siblings, 1 reply; 6+ messages in thread
From: Eric Sandeen @ 2011-06-22 22:27 UTC (permalink / raw)
To: Patrick J. LoPresti; +Cc: xfs
On 6/22/11 4:32 PM, Patrick J. LoPresti wrote:
> I have a 5.1TB XFS file system that is 93% full (399G free according to "df").
>
> I am trying to run "xfs_repair" on it.
>
> The output is appended.
>
> Question: What am I supposed to do about this? "xfs_repair -V" says
> "xfs_repair version 3.1.5". (I downloaded and built the latest
> version hoping it would fix the issue, but no luck.) Should I just
> start deleting files at random?
You could start by removing a few files you know you don't need, rather than
at random. :)
TBH I've not seen this one before, and the error message is not all that
helpful. It'd be nice to know how many blocks it was trying to reserve
when it ran out of space; I guess you'd need to use gdb, or instrument
all the calls to res_failed() in phase6.c to know for sure...
You could also capture an xfs_metadump of the fs and provide it for
analysis, it would let us reproduce the issue and know for sure what's
going on. By default it obfuscates metadata.
-Eric
> Any ideas would be appreciated; I am trying to get this server back
> up, and restoring 5.1T is not going to be pleasant.
>
> Thanks!
>
> - Pat
>
>
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> - scan filesystem freespace and inode maps...
> sb_icount 42688, counted 59328
> sb_ifree 1, counted 36
> sb_fdblocks 104582610, counted 24
> - found root inode chunk
> Phase 3 - for each AG...
> - scan and clear agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> - agno = 5
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 2
> - agno = 3
> - agno = 5
> - agno = 4
> - agno = 1
> Phase 5 - rebuild AG headers and trees...
> - reset superblock...
> Phase 6 - check inode connectivity...
> - resetting contents of realtime bitmap and summary inodes
> - traversing filesystem ...
>
> fatal error -- ran out of disk space!
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: xfs_repair: "fatal error -- ran out of disk space!"
2011-06-22 22:27 ` Eric Sandeen
@ 2011-06-22 23:24 ` Dave Chinner
2011-06-22 23:41 ` Patrick J. LoPresti
0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2011-06-22 23:24 UTC (permalink / raw)
To: Eric Sandeen; +Cc: Patrick J. LoPresti, xfs
On Wed, Jun 22, 2011 at 05:27:14PM -0500, Eric Sandeen wrote:
> On 6/22/11 4:32 PM, Patrick J. LoPresti wrote:
> > I have a 5.1TB XFS file system that is 93% full (399G free according to "df").
> >
> > I am trying to run "xfs_repair" on it.
> >
> > The output is appended.
> >
> > Question: What am I supposed to do about this? "xfs_repair -V" says
> > "xfs_repair version 3.1.5". (I downloaded and built the latest
> > version hoping it would fix the issue, but no luck.) Should I just
> > start deleting files at random?
>
> You could start by removing a few files you know you don't need, rather than
> at random. :)
>
> TBH I've not seen this one before, and the error message is not all that
> helpful. It'd be nice to know how many blocks it was trying to reserve
> when it ran out of space; I guess you'd need to use gdb, or instrument
> all the calls to res_failed() in phase6.c to know for sure...
Also, the number of inodes and directories in your filesystem might
tell us whether we should expect an ENOSPC, as well. I suspect that
there's an accounting error, because 400GB of transaction
reservations is an awful lot of directory rebuilds....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: xfs_repair: "fatal error -- ran out of disk space!"
2011-06-22 23:24 ` Dave Chinner
@ 2011-06-22 23:41 ` Patrick J. LoPresti
2011-06-23 7:42 ` Stan Hoeppner
0 siblings, 1 reply; 6+ messages in thread
From: Patrick J. LoPresti @ 2011-06-22 23:41 UTC (permalink / raw)
To: Dave Chinner; +Cc: Eric Sandeen, xfs
Hi, Dave and Eric. And thank you for the quick reply.
I blew away a couple of files (200-300 megabytes; I did not write it
down) and then xfs_repair succeeded. And now "df" shows the partition
as 100% full (265M free out of 5.1T), not 93% full (399G free).
I think the file system actually was full, but corrupted. The reason
I was trying to run xfs_repair is that the system was acting...
"funny" (but not "ha ha" funny). Specifically, a nfsd task was
consuming 100% CPU even though no NFS traffic was visible on the
network. cat /proc/task_id/stack suggested the nfsd was in an
infinite loop calling into XFS trying to allocate an extent or
something. This nfsd held a lock making it impossible to umount the
partition (among other things).
My guess is that nfsd was fooled much like df into thinking there was
space available, but when it tried to actually obtain that space, it
was told "please try again". Which it did, forever.
I guess one question is how xfs_repair should behave in this case. I
mean, what if the file system had been full, but too corrupt for me to
delete anything?
Anyway, my problem is fixed. Well, until the filesystem gets
corrupted again, anyway; I still have not identified the underlying
cause of that...
Thank you again for the prompt response.
- Pat
On Wed, Jun 22, 2011 at 4:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Jun 22, 2011 at 05:27:14PM -0500, Eric Sandeen wrote:
>> On 6/22/11 4:32 PM, Patrick J. LoPresti wrote:
>> > I have a 5.1TB XFS file system that is 93% full (399G free according to "df").
>> >
>> > I am trying to run "xfs_repair" on it.
>> >
>> > The output is appended.
>> >
>> > Question: What am I supposed to do about this? "xfs_repair -V" says
>> > "xfs_repair version 3.1.5". (I downloaded and built the latest
>> > version hoping it would fix the issue, but no luck.) Should I just
>> > start deleting files at random?
>>
>> You could start by removing a few files you know you don't need, rather than
>> at random. :)
>>
>> TBH I've not seen this one before, and the error message is not all that
>> helpful. It'd be nice to know how many blocks it was trying to reserve
>> when it ran out of space; I guess you'd need to use gdb, or instrument
>> all the calls to res_failed() in phase6.c to know for sure...
>
> Also, the number of inodes and directories in your filesystem might
> tell us whether we should expect an ENOSPC, as well. I suspect that
> there's an accounting error, because 400GB of transaction
> reservations is an awful lot of directory rebuilds....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: xfs_repair: "fatal error -- ran out of disk space!"
2011-06-22 23:41 ` Patrick J. LoPresti
@ 2011-06-23 7:42 ` Stan Hoeppner
2011-06-23 14:16 ` Patrick J. LoPresti
0 siblings, 1 reply; 6+ messages in thread
From: Stan Hoeppner @ 2011-06-23 7:42 UTC (permalink / raw)
To: Patrick J. LoPresti; +Cc: Eric Sandeen, xfs
On 6/22/2011 6:41 PM, Patrick J. LoPresti wrote:
> I guess one question is how xfs_repair should behave in this case. I
> mean, what if the file system had been full, but too corrupt for me to
> delete anything?
Maybe you should rethink your policy on filesystem space management.
>From what you stated the FS in question actually was full. You
apparently were unaware of it until a problem (misbehaving nfsd process)
brought it to your attention. You should be monitoring your FS usage.
Something as simple as logwatch daily summaries can save your bacon here.
As a general rule, when an FS begins steadily growing past the 80% mark
heading toward 90%, you need to take action, either adding more disk to
the underlying LVM device and growing the FS, mounting a new device/FS
into a new directory in the tree and manually moving files, or making
use of some HSM software.
Full filesystems have been a source of problems basically forever. It's
best to avoid such situations instead of tickling the dragon.
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: xfs_repair: "fatal error -- ran out of disk space!"
2011-06-23 7:42 ` Stan Hoeppner
@ 2011-06-23 14:16 ` Patrick J. LoPresti
0 siblings, 0 replies; 6+ messages in thread
From: Patrick J. LoPresti @ 2011-06-23 14:16 UTC (permalink / raw)
To: Stan Hoeppner; +Cc: Eric Sandeen, xfs
Of course we monitor our file systems. But as I thought I made
clear, we were "unaware" the file system was full because df said it
still had 399 gigabytes of free space. Granted, this is "only" 7%,
but it could just as easily been 30% or 50% or 80% because _the file
system was corrupt_.
Also, given your "80% rule", I suspect you have never worked in an
environment like mine. This file system is one of around 40 of similar
size in a single pool. Am I supposed to tell my boss that we need
more disk as soon as our free space goes below 40 terabytes?
The bottom line is that the file system was full but appeared not to
be, and thus xfs_repair bombed out. I realize this is a corner case,
but it is a nasty one, and it has nothing to do with my "policy on
filesystem space management".
But thank you for your input.
- Pat
On Thu, Jun 23, 2011 at 12:42 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 6/22/2011 6:41 PM, Patrick J. LoPresti wrote:
>
>> I guess one question is how xfs_repair should behave in this case. I
>> mean, what if the file system had been full, but too corrupt for me to
>> delete anything?
>
> Maybe you should rethink your policy on filesystem space management.
> From what you stated the FS in question actually was full. You
> apparently were unaware of it until a problem (misbehaving nfsd process)
> brought it to your attention. You should be monitoring your FS usage.
> Something as simple as logwatch daily summaries can save your bacon here.
>
> As a general rule, when an FS begins steadily growing past the 80% mark
> heading toward 90%, you need to take action, either adding more disk to
> the underlying LVM device and growing the FS, mounting a new device/FS
> into a new directory in the tree and manually moving files, or making
> use of some HSM software.
>
> Full filesystems have been a source of problems basically forever. It's
> best to avoid such situations instead of tickling the dragon.
>
> --
> Stan
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-06-23 14:16 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-22 21:32 xfs_repair: "fatal error -- ran out of disk space!" Patrick J. LoPresti
2011-06-22 22:27 ` Eric Sandeen
2011-06-22 23:24 ` Dave Chinner
2011-06-22 23:41 ` Patrick J. LoPresti
2011-06-23 7:42 ` Stan Hoeppner
2011-06-23 14:16 ` Patrick J. LoPresti
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox