* What to do when... xfs_repair hangs?
@ 2014-05-30 19:49 Sean Caron
2014-05-30 21:30 ` Brian Foster
2014-05-31 0:01 ` Dave Chinner
0 siblings, 2 replies; 7+ messages in thread
From: Sean Caron @ 2014-05-30 19:49 UTC (permalink / raw)
To: xfs, Sean Caron
[-- Attachment #1.1: Type: text/plain, Size: 2211 bytes --]
Hi all,
Long story short, we have a big array formatted as XFS, we had a machine go
down hard maybe a month, month and a half ago... when it came back up, XFS
faulted out when we attempted to mount the filesystem; it complained the
log was bad or something... I did a dry run of xfs_repair (-L) and it
looked pretty bad, so we mounted up the filesystem read-only, ran a
backup... I think we got pretty much everything out OK except maybe files
that were open at the time of the crash.
Now with a backup in hand, we kicked off xfs_repair "for real"... it ran
for a while and did its thing, but now it appears to be stuck at the stage -
- agno = 436
rebuilding directory inode ...
rebuilding directory inode ...
rebuilding directory inode ...
...
- traversal finished ...
- moving disconected inodes to lost+found ...
disconnected inode 1109099673,
and then it just stops. I don't know how long its been sitting like that,
but it hasn't moved in the last hour or two. I assume that's not good...
Interestingly when we ran a dry run of xfs_repair (-L) it got all the way
through; it never hung up at any point. Not sure why it would start to hang
up, once it gets run "for real".
This machine is in single-user-mode, I have exactly 24 lines of console
with no scrollback buffer, no other tty available besides that which I'm
running xfs_repair on, the system console.
Running Linux kernel 3.4.61, Ubuntu 12.04 LTS 64-bit with whatever their
current xfsprogs is.
This is a bit of an exceptional situation for me; I've never seen
xfs_repair just hang outright. I hoped I could maybe get some feedback from
the experts here... what should I do?
Try to Control-C out of the xfs_repair and ... re-run it?
Should I just quit wasting time at this point, wipe out the filesystem,
reformat, then just start the long process of restoring from the backups?
Original plan was just to run xfs_repair, see what happened and pull from
backups as required to fix damage. Perhaps we should just cut to the chase,
rebuild, and restore everything? Probably the file system would be
ultimately healthier starting from scratch, than what xfs_repair leaves
behind?
Any insight would be very much appreciated!
Thanks,
Sean
[-- Attachment #1.2: Type: text/html, Size: 2697 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: What to do when... xfs_repair hangs?
2014-05-30 19:49 What to do when... xfs_repair hangs? Sean Caron
@ 2014-05-30 21:30 ` Brian Foster
2014-05-31 0:01 ` Dave Chinner
1 sibling, 0 replies; 7+ messages in thread
From: Brian Foster @ 2014-05-30 21:30 UTC (permalink / raw)
To: Sean Caron; +Cc: xfs
On Fri, May 30, 2014 at 03:49:13PM -0400, Sean Caron wrote:
> Hi all,
>
> Long story short, we have a big array formatted as XFS, we had a machine go
> down hard maybe a month, month and a half ago... when it came back up, XFS
> faulted out when we attempted to mount the filesystem; it complained the
> log was bad or something... I did a dry run of xfs_repair (-L) and it
> looked pretty bad, so we mounted up the filesystem read-only, ran a
> backup... I think we got pretty much everything out OK except maybe files
> that were open at the time of the crash.
>
I assume you've reasonably verified that the files that have been backed
up at this point have valid content.
> Now with a backup in hand, we kicked off xfs_repair "for real"... it ran
> for a while and did its thing, but now it appears to be stuck at the stage -
>
> - agno = 436
> rebuilding directory inode ...
> rebuilding directory inode ...
> rebuilding directory inode ...
> ...
> - traversal finished ...
> - moving disconected inodes to lost+found ...
> disconnected inode 1109099673,
>
> and then it just stops. I don't know how long its been sitting like that,
> but it hasn't moved in the last hour or two. I assume that's not good...
>
You might want to include a bit more information about your storage and
filesystem geometry, if possible. See here:
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
In terms of the hang, does the process appear to be active and spinning
via top, or is it idle? If the latter, have you any hung task messages
in dmesg or the system logs? A blocked tasks dump might also be
informative here (see the sysrq-trigger bit in the link). In either
case, I suppose some information of the runtime state of xfs_repair
could be useful.
> Interestingly when we ran a dry run of xfs_repair (-L) it got all the way
> through; it never hung up at any point. Not sure why it would start to hang
> up, once it gets run "for real".
>
Perhaps writing to storage is problematic..? Have you encountered any
other errors related to the storage?
> This machine is in single-user-mode, I have exactly 24 lines of console
> with no scrollback buffer, no other tty available besides that which I'm
> running xfs_repair on, the system console.
>
> Running Linux kernel 3.4.61, Ubuntu 12.04 LTS 64-bit with whatever their
> current xfsprogs is.
>
> This is a bit of an exceptional situation for me; I've never seen
> xfs_repair just hang outright. I hoped I could maybe get some feedback from
> the experts here... what should I do?
>
> Try to Control-C out of the xfs_repair and ... re-run it?
>
> Should I just quit wasting time at this point, wipe out the filesystem,
> reformat, then just start the long process of restoring from the backups?
>
I'm not totally sure, but I think if you include some more of this data,
others might have some suggestions. If there really is something about
the filesystem causing repair to choke/spin/fall-over, a metadump of the
fs might be useful (beforehand, if you do happen to go this route).
Brian
> Original plan was just to run xfs_repair, see what happened and pull from
> backups as required to fix damage. Perhaps we should just cut to the chase,
> rebuild, and restore everything? Probably the file system would be
> ultimately healthier starting from scratch, than what xfs_repair leaves
> behind?
>
> Any insight would be very much appreciated!
>
> Thanks,
>
> Sean
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: What to do when... xfs_repair hangs?
2014-05-30 19:49 What to do when... xfs_repair hangs? Sean Caron
2014-05-30 21:30 ` Brian Foster
@ 2014-05-31 0:01 ` Dave Chinner
2014-06-01 16:21 ` Sean Caron
1 sibling, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2014-05-31 0:01 UTC (permalink / raw)
To: Sean Caron; +Cc: xfs
On Fri, May 30, 2014 at 03:49:13PM -0400, Sean Caron wrote:
> Hi all,
>
> Long story short, we have a big array formatted as XFS, we had a machine go
> down hard maybe a month, month and a half ago... when it came back up, XFS
> faulted out when we attempted to mount the filesystem; it complained the
> log was bad or something... I did a dry run of xfs_repair (-L) and it
> looked pretty bad, so we mounted up the filesystem read-only, ran a
> backup... I think we got pretty much everything out OK except maybe files
> that were open at the time of the crash.
>
> Now with a backup in hand, we kicked off xfs_repair "for real"... it ran
> for a while and did its thing, but now it appears to be stuck at the stage -
>
> - agno = 436
> rebuilding directory inode ...
> rebuilding directory inode ...
> rebuilding directory inode ...
> ...
> - traversal finished ...
> - moving disconected inodes to lost+found ...
> disconnected inode 1109099673,
>
> and then it just stops. I don't know how long its been sitting like that,
> but it hasn't moved in the last hour or two. I assume that's not good...
Is that the total of the last line of output? If so, it's likely
stuck creating the lost+found directory. It's possible there's a
corruption in the inode AVL tree (e.g. endless loop) that is causing
it to spin doing an inode record lookup, but otherwise I can't see
any reason for it getting stuck here.
The information that Brian asked for will be a good start in
tracking this down, as will the complete output of xfs_repair...
> Interestingly when we ran a dry run of xfs_repair (-L) it got all the way
> through; it never hung up at any point. Not sure why it would start to hang
> up, once it gets run "for real".
That's because a dry-run skips the "move to lost_found" phase.
> This machine is in single-user-mode, I have exactly 24 lines of console
> with no scrollback buffer, no other tty available besides that which I'm
> running xfs_repair on, the system console.
$ man script
or
$ man tee
> Running Linux kernel 3.4.61, Ubuntu 12.04 LTS 64-bit with whatever their
> current xfsprogs is.
Upgrading xfsprogs to 3.2.0 would be a good idea.
> This is a bit of an exceptional situation for me; I've never seen
> xfs_repair just hang outright. I hoped I could maybe get some feedback from
> the experts here... what should I do?
>
> Try to Control-C out of the xfs_repair and ... re-run it?
That's fine - the next time repair runs it will start again and
repair anything that wasn't repaired in the last run.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: What to do when... xfs_repair hangs?
2014-05-31 0:01 ` Dave Chinner
@ 2014-06-01 16:21 ` Sean Caron
2014-06-01 20:40 ` Emmanuel Florac
2014-06-01 22:48 ` Dave Chinner
0 siblings, 2 replies; 7+ messages in thread
From: Sean Caron @ 2014-06-01 16:21 UTC (permalink / raw)
To: Dave Chinner, Sean Caron; +Cc: xfs
[-- Attachment #1.1: Type: text/plain, Size: 4098 bytes --]
Sorry, all, I was a little out-of-it on Friday afternoon, of course I had
kicked off xfs_repair actually in the background with all output sent to a
file, and I was just doing 'tail -f' on that file.
So I kill the 'tail -f' and jump back to the command line, it appears that
xfs_repair segfaulted and died.
That line of text:
disconnected inode 1109099673,
was indeed the last thing that it printed before it crashed.
If I look in dmesg, I just see -
xfs_repair[6770]: segfault at 28 ip 000000000042307b sp 00007fffef61bad0
error 4 in xfs_repair[400000+72000]
and that's it.
I checked with 'df' and there's plenty of space everywhere; I don't see why
it would have faulted out trying to connect something to lost+found.
Underlying storage should be good; this is basically a RAID 60 built on top
of a bunch of JBODs with LSI SAS9200 cards. MD sees all strings as started
and running OK; no problems getting the array assembled at all.
Since Dave is saying it's OK to try re-running xfs_repair; it'll just pick
up where it left off; let me give it another pass and see if it manages to
complete, or if it segfaults out again. I guess it it poops out a second
time, maybe we'll just want to consider rebuilding the filesystem and
restoring from our copies?
Thanks for the feedback,
Sean
On Fri, May 30, 2014 at 8:01 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, May 30, 2014 at 03:49:13PM -0400, Sean Caron wrote:
> > Hi all,
> >
> > Long story short, we have a big array formatted as XFS, we had a machine
> go
> > down hard maybe a month, month and a half ago... when it came back up,
> XFS
> > faulted out when we attempted to mount the filesystem; it complained the
> > log was bad or something... I did a dry run of xfs_repair (-L) and it
> > looked pretty bad, so we mounted up the filesystem read-only, ran a
> > backup... I think we got pretty much everything out OK except maybe files
> > that were open at the time of the crash.
> >
> > Now with a backup in hand, we kicked off xfs_repair "for real"... it ran
> > for a while and did its thing, but now it appears to be stuck at the
> stage -
> >
> > - agno = 436
> > rebuilding directory inode ...
> > rebuilding directory inode ...
> > rebuilding directory inode ...
> > ...
> > - traversal finished ...
> > - moving disconected inodes to lost+found ...
> > disconnected inode 1109099673,
> >
> > and then it just stops. I don't know how long its been sitting like that,
> > but it hasn't moved in the last hour or two. I assume that's not good...
>
> Is that the total of the last line of output? If so, it's likely
> stuck creating the lost+found directory. It's possible there's a
> corruption in the inode AVL tree (e.g. endless loop) that is causing
> it to spin doing an inode record lookup, but otherwise I can't see
> any reason for it getting stuck here.
>
> The information that Brian asked for will be a good start in
> tracking this down, as will the complete output of xfs_repair...
>
> > Interestingly when we ran a dry run of xfs_repair (-L) it got all the way
> > through; it never hung up at any point. Not sure why it would start to
> hang
> > up, once it gets run "for real".
>
> That's because a dry-run skips the "move to lost_found" phase.
>
> > This machine is in single-user-mode, I have exactly 24 lines of console
> > with no scrollback buffer, no other tty available besides that which I'm
> > running xfs_repair on, the system console.
>
> $ man script
>
> or
>
> $ man tee
>
> > Running Linux kernel 3.4.61, Ubuntu 12.04 LTS 64-bit with whatever their
> > current xfsprogs is.
>
> Upgrading xfsprogs to 3.2.0 would be a good idea.
>
> > This is a bit of an exceptional situation for me; I've never seen
> > xfs_repair just hang outright. I hoped I could maybe get some feedback
> from
> > the experts here... what should I do?
> >
> > Try to Control-C out of the xfs_repair and ... re-run it?
>
> That's fine - the next time repair runs it will start again and
> repair anything that wasn't repaired in the last run.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
[-- Attachment #1.2: Type: text/html, Size: 7063 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: What to do when... xfs_repair hangs?
2014-06-01 16:21 ` Sean Caron
@ 2014-06-01 20:40 ` Emmanuel Florac
2014-06-01 22:48 ` Dave Chinner
1 sibling, 0 replies; 7+ messages in thread
From: Emmanuel Florac @ 2014-06-01 20:40 UTC (permalink / raw)
To: Sean Caron; +Cc: xfs
Le Sun, 1 Jun 2014 12:21:55 -0400 vous écriviez:
> Since Dave is saying it's OK to try re-running xfs_repair; it'll just
> pick up where it left off; let me give it another pass and see if it
> manages to complete, or if it segfaults out again. I guess it it
> poops out a second time, maybe we'll just want to consider rebuilding
> the filesystem and restoring from our copies?
You should definitely try a more up-to-date version of xfs_repair
first. In case you're not afraid of running a binary from an unkown
source, please find a 3.1.11 binary here:
http://update.intellique.com/pub/xfsrepair.tar.gz
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: What to do when... xfs_repair hangs?
2014-06-01 16:21 ` Sean Caron
2014-06-01 20:40 ` Emmanuel Florac
@ 2014-06-01 22:48 ` Dave Chinner
2014-06-02 18:32 ` Sean Caron
1 sibling, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2014-06-01 22:48 UTC (permalink / raw)
To: Sean Caron; +Cc: xfs
On Sun, Jun 01, 2014 at 12:21:55PM -0400, Sean Caron wrote:
> Sorry, all, I was a little out-of-it on Friday afternoon, of course I had
> kicked off xfs_repair actually in the background with all output sent to a
> file, and I was just doing 'tail -f' on that file.
>
> So I kill the 'tail -f' and jump back to the command line, it appears that
> xfs_repair segfaulted and died.
>
> That line of text:
>
> disconnected inode 1109099673,
>
> was indeed the last thing that it printed before it crashed.
>
> If I look in dmesg, I just see -
>
> xfs_repair[6770]: segfault at 28 ip 000000000042307b sp 00007fffef61bad0
> error 4 in xfs_repair[400000+72000]
>
> and that's it.
>
> I checked with 'df' and there's plenty of space everywhere; I don't see why
> it would have faulted out trying to connect something to lost+found.
>
> Underlying storage should be good; this is basically a RAID 60 built on top
> of a bunch of JBODs with LSI SAS9200 cards. MD sees all strings as started
> and running OK; no problems getting the array assembled at all.
>
> Since Dave is saying it's OK to try re-running xfs_repair; it'll just pick
> up where it left off; let me give it another pass and see if it manages to
> complete, or if it segfaults out again. I guess it it poops out a second
> time, maybe we'll just want to consider rebuilding the filesystem and
> restoring from our copies?
You should update to the latest version of xfs_repair first (3.2.0).
If that still crashes, running xfs-repair under gdb to get a stack
trace would be a good start, or sending me a metadump image so I can
reproduce the crash myself would be even better...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: What to do when... xfs_repair hangs?
2014-06-01 22:48 ` Dave Chinner
@ 2014-06-02 18:32 ` Sean Caron
0 siblings, 0 replies; 7+ messages in thread
From: Sean Caron @ 2014-06-02 18:32 UTC (permalink / raw)
To: Dave Chinner, Sean Caron; +Cc: xfs
[-- Attachment #1.1: Type: text/plain, Size: 2250 bytes --]
I tried re-running the version that came with Ubuntu 12.04 LTS and it very
consistently segfaults at that point... so I went and pulled a copy of the
most recent source from Git and I'm trying xfs_repair 3.2.0 now. I'll see
how that goes (it'll probably take a day or two to run; 450 TB volume) and
report back. Thanks everyone for the suggestions and feedback so far.
Best,
Sean
On Sun, Jun 1, 2014 at 6:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Jun 01, 2014 at 12:21:55PM -0400, Sean Caron wrote:
> > Sorry, all, I was a little out-of-it on Friday afternoon, of course I had
> > kicked off xfs_repair actually in the background with all output sent to
> a
> > file, and I was just doing 'tail -f' on that file.
> >
> > So I kill the 'tail -f' and jump back to the command line, it appears
> that
> > xfs_repair segfaulted and died.
> >
> > That line of text:
> >
> > disconnected inode 1109099673,
> >
> > was indeed the last thing that it printed before it crashed.
> >
> > If I look in dmesg, I just see -
> >
> > xfs_repair[6770]: segfault at 28 ip 000000000042307b sp 00007fffef61bad0
> > error 4 in xfs_repair[400000+72000]
> >
> > and that's it.
> >
> > I checked with 'df' and there's plenty of space everywhere; I don't see
> why
> > it would have faulted out trying to connect something to lost+found.
> >
> > Underlying storage should be good; this is basically a RAID 60 built on
> top
> > of a bunch of JBODs with LSI SAS9200 cards. MD sees all strings as
> started
> > and running OK; no problems getting the array assembled at all.
> >
> > Since Dave is saying it's OK to try re-running xfs_repair; it'll just
> pick
> > up where it left off; let me give it another pass and see if it manages
> to
> > complete, or if it segfaults out again. I guess it it poops out a second
> > time, maybe we'll just want to consider rebuilding the filesystem and
> > restoring from our copies?
>
> You should update to the latest version of xfs_repair first (3.2.0).
> If that still crashes, running xfs-repair under gdb to get a stack
> trace would be a good start, or sending me a metadump image so I can
> reproduce the crash myself would be even better...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
[-- Attachment #1.2: Type: text/html, Size: 2986 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-06-02 18:32 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-30 19:49 What to do when... xfs_repair hangs? Sean Caron
2014-05-30 21:30 ` Brian Foster
2014-05-31 0:01 ` Dave Chinner
2014-06-01 16:21 ` Sean Caron
2014-06-01 20:40 ` Emmanuel Florac
2014-06-01 22:48 ` Dave Chinner
2014-06-02 18:32 ` Sean Caron
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox