From: Brian Foster <bfoster@redhat.com>
To: "Carlos E. R." <carlos.e.r@opensuse.org>
Cc: XFS mailing list <xfs@oss.sgi.com>
Subject: Re: Got "Internal error XFS_WANT_CORRUPTED_GOTO". Filesystem needs reformatting to correct issue.
Date: Sat, 12 Jul 2014 10:19:25 -0400 [thread overview]
Message-ID: <20140712141925.GA53265@bfoster.bfoster> (raw)
In-Reply-To: <alpine.LSU.2.11.1407120150220.15907@Telcontar.valinor>
On Sat, Jul 12, 2014 at 02:30:45AM +0200, Carlos E. R. wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
>
> On Saturday, 2014-07-05 at 08:28 -0400, Brian Foster wrote:
> >On Fri, Jul 04, 2014 at 11:32:26PM +0200, Carlos E. R. wrote:
>
>
> >>If I don't do that backup-format-restore, I get issues soon, and it crashes
> >>within a day - I got after booting (the first event):
> >>
> >
> >I echo Dave's previous question... within a day of doing what? Just
> >using the system or doing more hibernation cycles?
>
> It is in the long post with the logs I posted.
>
> The first time it crashed, I rebooted, got some errors I probably did not
> see, managed to mount the device, and I used the machine normally, doing
> several hibernation cycles. On one of these, it crashed, within the day.
>
That still suggests something could be going on at runtime during the
hibernation or wakeup cycle. Identifying some kind of runtime error or
metadata inconsistency without involving hibernation would be a smoking
gun for a general corruption. So far we have no evidence of reproduction
without hibernation and no evidence of a persistent corruption. That
doesn't rule out something going on on-disk, but it certainly suggests a
runtime corruption during hibernation/wake is more likely.
>
> As explained in this part of the previous post:
>
> >>0.1> 2014-03-15 03:53:47 Telcontar kernel - - - [ 301.857523] XFS: Internal error XFS_WANT_CORRUPTED_RETURN at line 350 of file /home/abuild/rpmbuild/BUILD/kernel-desktop-3.11.10/linux-3.11/fs/xfs/xfs_all
> >>
> >>And some hours later:
> >>
> >><0.1> 2014-03-15 22:20:34 Telcontar kernel - - - [20151.298345] XFS: Internal error XFS_WANT_CORRUPTED_GOTO at line 1602 of file /home/abuild/rpmbuild/BUILD/kernel-desktop-3.11.10/linux-3.11/fs/xfs/xfs_allo
> >>
> >>
> >>It was here that I decided to backup-format-restore instead.
>
>
>
>
>
> >>>That also means it's probably not be necessary to do a full backup,
> >>>reformat and restore sequence as part of your routine here. xfs_repair
> >>>should scour through all of the allocation metadata and yell if it finds
> >>>something like free blocks allocated to a file.
> >>
> >>No, if I don't backup-format-restore it happens again within a day. There is
> >>something lingering. Unless that was just chance... :-?
> >>
> >>It is true that during that day I hibernated several times more than needed
> >>to see if it happened again - and it did.
> >>
> >
> >This depends on what causes this to happen, not how frequent it happens.
> >Does it continue to happen along with hibernation, or do you start
> >seeing these kind of errors during normal use?
>
>
> Except the first time that this happened, the sequence is this:
>
> I use the machine for weeks, without event, booting once, then hibernating
> at least once per day. I finally reboot when I have to apply some system
> update, or something special.
>
> Till one day, this "thing" happens. It happens inmediately after coming out
> from hibernation, and puts the affected partition, always /home, in read
> only mode. When it happens, I reboot, repair partition manually if needed,
> then I back up the files, format it, and replace all the files from the
> backup just made, with xfsdump. Well, this last time, I used rsync instead.
>
>
> It has happened "only" four times:
>
> 2014-03-15 03:35:17
> 2014-03-15 22:20:34
> 2014-04-17 22:47:08
> 2014-06-29 12:32:18
>
>
> >If the latter, that could suggest something broken on disk.
>
> That was my first thought, because it started hapening after replacing the
> hard disk, but also after a kernel update. But I have tested that disk
> several times, with smartctl and with the manufacturer test tool, and
> nothing came out.
>
I was referring to a potential on-disk corruption, but that's good to
know as well.
>
> >If the
> >former, that could simply suggest the fs (perhaps on-disk) has made it
> >into some kind of state that makes this easier to reproduce, for
> >whatever reason. It could be timing, location of metadata,
> >fragmentation, or anything really for that matter, but it doesn't
> >necessarily mean corruption (even though it doesn't rule it out).
> >Perhaps the clean regeneration of everything by a from-scratch recovery
> >simply makes this more difficult to reproduce until the fs naturally
> >becomes more aged/fragmented, for example.
> >
> >This probably makes a pristine, pre-repair metadump of the reproducing
> >fs more interesting. I could try some of my previous tests against a
> >restore of that metadump.
>
>
> Well, I suggest that, unless you can find something on the metadata (I just
> sent you the link via email from google), we wait till the next event. I
> will at that time take an intact metadata photo. But this can take a month
> or two to happen again, if the pattern keeps.
>
That would be a good idea. I'll take a look at the metadump when I have
a chance. If there is nothing out of the ordinary, the next best option
is to metadump the fs that reproduces the behavior. I could retry some
of my previous vm hibernation tests against that. As mentioned
previously, once you have a more reliably reproducing state, that's also
a good opportunity to see if you can narrow down which of the things you
have running against the fs appear to trigger this.
>
>
>
> >I was somewhat thinking out loud originally discussing this topic. I was
> >suggesting to run this against a restored metadump, not the primary
> >dataset or a backup.
> >
> >The metadump creates an image of the metadata of the source fs in a file
> >(no data is copied). This metadump image can be restored at will via
> >'xfs_mdrestore.' This allows restoring to a file, mounting the file
> >loopback, and performing experiments or investigation on the fs
> >generally as it existed when the shutdown was reproducible.
>
> Ah... I see.
>
>
> >So basically:
> >
> >- xfs_mdrestore <mdimgfile> <tmpfileimg>
> >- mount <tmpfileimg> /mnt
> >- rm -rf /mnt/*
> >
> >... was what I was suggesting. <tmpfileimg> can be recreated from the
> >metadump image afterwards to get back to square one.
>
> I see.
>
> Well, I tried this on a copy of the 'dd' image days ago, and nothing
> hapened. I guess the procedure above would be the same.
>
A dd of the raw block device will preserve the metadata, so yeah that's
effectively the same test. If there were an obvious free space
corruption, the fs probably would have shutdown. I can retry the same
test via the metadump on a debug kernel as well.
Brian
>
>
>
>
> >>I have an active bugzilla account at <http://oss.sgi.com/bugzilla/>, I'm
> >>logged in there now. I haven't checked if I can create a bug, not been sure
> >>what parameters to use (product, component, whom to assign to). I think that
> >>would be the most appropriate place.
> >>
> >>Meanwhile, I have uploaded the file to my google drive account, so I can
> >>share it with anybody on request - ie, it is not public, I need to add a
> >>gmail address to the list of people that can read the file.
> >>
> >>Alternatively, I could just email the file to people asking for it, offlist,
> >>but not in a single email, in chunks limited to 1.5 MB per email.
> >>
> >
> >Either of the bugzilla or google drive options works Ok for me.
>
> It's here:
>
> <https://drive.google.com/file/d/0Bx2OgfTa-XC9UDBnQzZIMTVyN0k/edit?usp=sharing>
>
> Whoever wants to read it, has to tell me the address to add to it, access is
> not public.
>
>
> - -- Cheers,
> Carlos E. R.
> (from 13.1 x86_64 "Bottle" at Telcontar)
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.22 (GNU/Linux)
>
> iEYEARECAAYFAlPAgb0ACgkQtTMYHG2NR9U/FQCgjtwuDC0HTSG3i7DrEV8+qZeT
> 6mUAn0FGf42SsU1WeRx/AAk4X2oqV4Bc
> =pASJ
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-07-12 14:19 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-02 9:57 Got "Internal error XFS_WANT_CORRUPTED_GOTO". Filesystem needs reformatting to correct issue Carlos E. R.
2014-07-02 12:04 ` Brian Foster
2014-07-02 13:07 ` Mark Tinguely
2014-07-03 2:54 ` Carlos E. R.
2014-07-03 3:00 ` Carlos E. R.
2014-07-03 9:43 ` Dave Chinner
2014-07-03 17:40 ` Brian Foster
2014-07-03 23:34 ` Carlos E. R.
2014-07-04 0:04 ` Dave Chinner
2014-07-04 1:29 ` Carlos E. R.
2014-07-04 1:40 ` Dave Chinner
2014-07-04 2:42 ` Carlos E. R.
2014-07-04 3:12 ` Carlos E. R.
2014-07-04 12:40 ` Brian Foster
2014-07-04 13:36 ` Carlos E. R.
2014-07-03 17:39 ` Brian Foster
2014-07-04 21:32 ` Carlos E. R.
2014-07-05 12:28 ` Brian Foster
2014-07-12 0:30 ` Carlos E. R.
2014-07-12 1:30 ` Carlos E. R.
2014-07-12 1:45 ` Carlos E. R.
2014-07-12 14:26 ` Brian Foster
2014-07-12 14:19 ` Brian Foster [this message]
2014-08-11 14:23 ` Subject : Happened again, 20140811 -- " Carlos E. R.
2014-08-11 14:44 ` Brian Foster
2014-08-11 14:58 ` Carlos E. R.
2014-08-11 17:05 ` Carlos E. R.
2014-08-11 21:31 ` Carlos E. R.
[not found] ` <53E938CC.4010103@sgi.com>
2014-08-11 22:01 ` Carlos E. R.
2014-08-11 14:57 ` Mark Tinguely
2014-08-11 15:34 ` Carlos E. R.
2014-08-11 16:14 ` Brian Foster
2014-08-11 17:08 ` Carlos E. R.
2014-08-11 21:27 ` Mark Tinguely
2014-08-11 21:50 ` Carlos E. R.
2014-08-11 21:56 ` Mark Tinguely
2014-08-11 22:36 ` Carlos E. R.
2014-08-12 0:17 ` Carlos E. R.
2014-08-12 16:51 ` Brian Foster
2014-08-12 21:17 ` Carlos E. R.
2014-08-13 12:04 ` Brian Foster
2014-08-13 13:29 ` Mark Tinguely
2014-08-13 21:04 ` Dave Chinner
2014-08-12 21:27 ` Eric Sandeen
2014-08-12 21:57 ` Dave Chinner
2014-08-12 21:59 ` Brian Foster
2014-08-12 22:21 ` Eric Sandeen
2014-08-12 23:16 ` Dave Chinner
2014-08-13 0:07 ` Carlos E. R.
2014-09-30 22:27 ` Happened again, 20140930 " Carlos E. R.
2014-10-01 0:45 ` Dave Chinner
2014-10-01 2:48 ` Carlos E. R.
2014-10-01 3:04 ` Eric Sandeen
2014-10-02 11:32 ` Jan Kara
2014-10-02 11:46 ` Carlos E. R.
2014-10-05 14:28 ` Carlos E. R.
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140712141925.GA53265@bfoster.bfoster \
--to=bfoster@redhat.com \
--cc=carlos.e.r@opensuse.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.