Re: Crash recovery/zero-byte file question

From: Eric Sandeen <sandeen@sandeen.net>
To: Josh Endries <endries@cs.cornell.edu>
Cc: xfs@oss.sgi.com
Subject: Re: Crash recovery/zero-byte file question
Date: Sun, 19 May 2013 21:22:55 -0500	[thread overview]
Message-ID: <519988FF.3010702@sandeen.net> (raw)
In-Reply-To: <76015885.11204.1369015296087.JavaMail.root@coecis.cornell.edu>

On 5/19/13 9:01 PM, Josh Endries wrote:
> Hello,
> 
> Thanks for the reply!
> 
>>> We have a RHEL 6.3 machine with a large XFS mount that suffered a
>>> power outage.
>>
>> For starters, have you engaged your RH support folks?
> 
> Unfortunately we don't have support for these machines. We have tons of RH machines and licenses, but only a few with paid support. Generally the (grant-funded) research machines don't include RH support. (And generally we don't run into problems like this. :))

ok

>>> When it came back up, it allegedly fixed itself, but
>>> now many files are zero bytes. I found a bug report/errata fix at RH
>>> that mentions something similar, which might be what we ran into.
>>
>> Which one?  RH support can probably help you decide if that bug report
>> applies, and where/when it was fixed.
> 
> This one: https://access.redhat.com/site/solutions/272673

well, that's a "solution" ;)

> You need a login to view that, though... I think this is the same one, which I just found today:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=845233
> 
> That URL is currently broken for me, so here is a cache of it:
> 
> http://webcache.googleusercontent.com/search?q=cache:3OjuPDd8A1AJ:https://bugzilla.redhat.com/show_bug.cgi%3Fid%3D845233+&cd=2&hl=en&ct=clnk&gl=us&client=firefox-a
> 
> Reading this, I'm no longer sure we have a kernel with the fix. That machine is running:
> 
> 2.6.32-279.el6.x86_64

Right, and:  "Fixed In Version: 	kernel-2.6.32-328.el6"

So this is a known bug and fixed, but you're not running the fix it seems.

> I'm not really sure when the files were created or how long it was
> idle before the crash... I wonder if ctime/mtime would be reliable
> for the files. I also don't know how to reproduce the situation in
> order to test if it's fixed in a later kernel. I can pull the power
> out to test if I knew how to modify files ahead of time such that
> they would zero themselves out.

I think you can be fairly certain that it's resolved in the above
kernel.

>>> We
>>> are running a kernel that should have the fix as far as I can tell,
>>> but we definitely have zero byte files that shouldn't be.
>>
>> shouldn't be because they had all been properly synced to disk
>> before the power loss, or?  (just in general, files not fsynced
>> aren't guaranteed to be in any particular state if you lose power,
>> though of course there are certain expectations of timely flushing).
> 
> No, I mean they shouldn't be zero normally. They weren't zero a week
> ago. In other words, the files definitely changed unexpectedly, I'm
> assuming due to the power outage. The files had not been touched in
> at least a few days before the crash, according to the researcher
> working on those files. If I read the report correctly, though, that
> might not matter much.

ok

>>> My question is: is there a way to restore this or fix it before going
>>> to backups? Is it worth it to unmount and run xfs_check or similar?
>>> Unfortunately, since the system came up and appeared to be working,
>>> some users have been using that mount point.
>>
>> If you have backups that's probably the best option.
> 
> There aren't any backups of these files. The researchers should be
> able to recreate them (I hope so); the data sets come from various
> places. It's a lot of data, so I was hoping I could recover something
> to lessen the downtime. They opted not to back up that directory
> because it's just too many TBs for normal backups.
> 
> I'm not really expecting to be able to restore everything, I just
> want to put some effort in to getting back what I can before telling
> them they need to start over...

Dave is more familiar with that bug than I am, but short of some serious
forensics & luck, I don't think you'll be able to get things back.

I'd update to the kernel mentioned above soon, though, and sorry
about the hassle.  :(

-Eric

> Thanks,
> Josh
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs