public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Debugging file truncation problem
@ 2012-06-21  1:51 Ling Ho
  2012-06-21  6:02 ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Ling Ho @ 2012-06-21  1:51 UTC (permalink / raw)
  To: xfs

Hello,

I am trying to debug a problem that has bugged us for the last few months.

We have set up a large storage system using GlusterFS, and XFS 
underneath it. We have 5 RHEL6.2 servers (running 
2.6.32-220.7.1.el6.x86_64 when problem last occurred), with LSI 9285-8e 
Raid Controller with Battery Backup Unit. System memory is 48GB.

Over the few months we have it running, we experience two complete power 
outage where everything went down for a long period of time.

After the system came back up, we found some files (between 1-10GB) 
truncated. By truncated I mean the file sizes shrunk, and we lost the 
tail of the files. Since the files were copied from another storage 
system, we have the original to compare. Furthermore we have a cron job 
that collect the file sizes once a day.

However, the troubling thing is, these files were all multiple days old, 
and were not being written to or accessed at the time of the power outage.

Last week, I sensed some problems on the OS on one of the machines, and 
so shut it down cleanly. And right after that, also upgraded the kernels 
and rebooted all other 4 servers. After they all came back up, we 
discovered truncated file again. I am sure the truncation occurred 
within the 24 hours before or after the reboots since the file sizes we 
had collected before the reboot differ from what we collected few hours 
after the reboot. The file truncation occurred on the problematic 
machine, and another one, which I have rebooted cleanly.

I tried to spend more time looking at the truncated files this time. I 
found some of the smaller files actually got truncated to zero length.

I used xfs_bmap to look at the extend allocation, and saw that all of 
them were using a single extent. So, by looking at the original file 
size, and the start location of the truncated file, I tried to extract 
the bits from the raw device, and saved it onto a different directory. 
Something like this:  dd if=/dev/hdc of=/u1/recovered bs=1 
count=1231451239  skip=53242445

To my amaze, after I wrote the file out this way (assuming the complete 
file were also occupying one single extent), the checksum matches the 
original file which resides on the server where I had copied the file from.


These are my questions:

- Under what possible circumstances would the updated inode not written 
to the disk, if the content of the file are already on disk?

- I tried to use block dump to debug while trying to reproduce the 
problem on another test box. I notice xfssyncd and xfsbufd don't cause 
data and inode to be writen to disk. It seems after a file is written, 
data and dirtied inode are written to disk only when flush wakes up. 
Does xfssyncd/xfsbufd only responsible for moving stuff to the system cache?

- Can all the flush processes die, or cease to work on a system and 
still allow the system to function?

I have been trying to reproduce the problem on a test box for the last 
few days but unsuccessful, except I see truncations on file newly 
written, and not yet flushed to disk when I reset the test box. It seems 
XFS is doing everything right. I tried writing through Gluster layer, 
and writing directory to the XFS file system and see no different in 
behavior. I would really like to get some ideas what else to look.


Thanks,
...
ling


















_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Debugging file truncation problem
  2012-06-21  1:51 Debugging file truncation problem Ling Ho
@ 2012-06-21  6:02 ` Dave Chinner
  2012-06-21  6:30   ` Ling Ho
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2012-06-21  6:02 UTC (permalink / raw)
  To: Ling Ho; +Cc: xfs

On Wed, Jun 20, 2012 at 06:51:39PM -0700, Ling Ho wrote:
> Hello,
> 
> I am trying to debug a problem that has bugged us for the last few months.
> 
> We have set up a large storage system using GlusterFS, and XFS
> underneath it. We have 5 RHEL6.2 servers (running
> 2.6.32-220.7.1.el6.x86_64 when problem last occurred), with LSI
> 9285-8e Raid Controller with Battery Backup Unit. System memory is
> 48GB.

Perhaps you should report the problem to your RedHat support
engineer and tell them to raise a bug - it's entire possible that
the RH kernel needs this fix:

be4f1ac xfs: log all dirty inodes in xfs_fs_sync_fs

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Debugging file truncation problem
  2012-06-21  6:02 ` Dave Chinner
@ 2012-06-21  6:30   ` Ling Ho
  2012-06-21  9:01     ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Ling Ho @ 2012-06-21  6:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs@oss.sgi.com

On 6/20/12 11:02 PM, Dave Chinner wrote:
> On Wed, Jun 20, 2012 at 06:51:39PM -0700, Ling Ho wrote:
>> Hello,
>>
>> I am trying to debug a problem that has bugged us for the last few months.
>>
>> We have set up a large storage system using GlusterFS, and XFS
>> underneath it. We have 5 RHEL6.2 servers (running
>> 2.6.32-220.7.1.el6.x86_64 when problem last occurred), with LSI
>> 9285-8e Raid Controller with Battery Backup Unit. System memory is
>> 48GB.
> Perhaps you should report the problem to your RedHat support
> engineer and tell them to raise a bug - it's entire possible that
> the RH kernel needs this fix:
>
> be4f1ac xfs: log all dirty inodes in xfs_fs_sync_fs
>
> Cheers,
>
> Dave.
>
Ok. Is there a bug reference number or thread I can point to?

I will try to open a ticket with Red Hat.

Thanks,
...
ling

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Debugging file truncation problem
  2012-06-21  6:30   ` Ling Ho
@ 2012-06-21  9:01     ` Dave Chinner
  0 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2012-06-21  9:01 UTC (permalink / raw)
  To: Ling Ho; +Cc: xfs@oss.sgi.com

On Wed, Jun 20, 2012 at 11:30:12PM -0700, Ling Ho wrote:
> On 6/20/12 11:02 PM, Dave Chinner wrote:
> >On Wed, Jun 20, 2012 at 06:51:39PM -0700, Ling Ho wrote:
> >>Hello,
> >>
> >>I am trying to debug a problem that has bugged us for the last few months.
> >>
> >>We have set up a large storage system using GlusterFS, and XFS
> >>underneath it. We have 5 RHEL6.2 servers (running
> >>2.6.32-220.7.1.el6.x86_64 when problem last occurred), with LSI
> >>9285-8e Raid Controller with Battery Backup Unit. System memory is
> >>48GB.
> >Perhaps you should report the problem to your RedHat support
> >engineer and tell them to raise a bug - it's entire possible that
> >the RH kernel needs this fix:
> >
> >be4f1ac xfs: log all dirty inodes in xfs_fs_sync_fs
> >
> >Cheers,
> >
> >Dave.
> >
> Ok. Is there a bug reference number or thread I can point to?

The above commit is all that is needed - I'll be the person who
fixes the bug in the RH kernel, anyway, so not much else is needed
;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-06-21  9:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-21  1:51 Debugging file truncation problem Ling Ho
2012-06-21  6:02 ` Dave Chinner
2012-06-21  6:30   ` Ling Ho
2012-06-21  9:01     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox