public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Data corruption, md5 changes on every mount
@ 2011-12-11 13:21 Dmitry Panov
  2011-12-11 23:53 ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Dmitry Panov @ 2011-12-11 13:21 UTC (permalink / raw)
  To: xfs

Hi guys,

I have a 2TiB XFS which is about 60% full. Recently I've noticed that 
the daily inc. backup reports file contents change for files that are 
not supposed to change.

I've created an LVM snapshot and ran xfs_check/xfs_repair. xfs_check did 
report a few problems (unknown node type). After that I ran a simple 
test: mount, calculate md5 of the problematic files, report if it 
changed, umount, sleep 10 sec. That script reported that md5 sum of at 
least one file was changing on every cycle.

Analyzing the differences I found that a 4k block that should contain 
all zeros sometimes contains random garbage (luckily most of the files 
are pcm wavs, so it's easy to verify). However I did not analyze every 
occurrence so this may be not 100% true. The files do not look as they 
are sparse according to du. Interestingly one of them appears to occupy 
one block more than necessary.

Then I did cp -a file newfile, mv newfile file and re-ran the test. No 
problems reported since.

As there were a few unclean umounts I think most likely it is a 
filesystem corruption that went unspotted by xfs_repair. It would not 
surprise me too much because xfs_repair took just 3.5 min.

Any ideas? I could just copy the files and pretend noting happened but 
is there a guarantee that doing so won't corrupt other data?


Best regards,

-- 
Dmitry Panov

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Data corruption, md5 changes on every mount
  2011-12-11 13:21 Data corruption, md5 changes on every mount Dmitry Panov
@ 2011-12-11 23:53 ` Dave Chinner
  2011-12-12  0:13   ` Dmitry Panov
  2011-12-12  1:56   ` Dmitry Panov
  0 siblings, 2 replies; 5+ messages in thread
From: Dave Chinner @ 2011-12-11 23:53 UTC (permalink / raw)
  To: Dmitry Panov; +Cc: xfs

On Sun, Dec 11, 2011 at 01:21:37PM +0000, Dmitry Panov wrote:
> Hi guys,
> 
> I have a 2TiB XFS which is about 60% full. Recently I've noticed
> that the daily inc. backup reports file contents change for files
> that are not supposed to change.

What kernel/platform? What version of xfsprogs? What kind of
storage?

> I've created an LVM snapshot and ran xfs_check/xfs_repair. xfs_check
> did report a few problems (unknown node type). After that I ran a
> simple test: mount, calculate md5 of the problematic files, report
> if it changed, umount, sleep 10 sec. That script reported that md5
> sum of at least one file was changing on every cycle.

That sounds like you've got a dodgy drive.

> Analyzing the differences I found that a 4k block that should
> contain all zeros sometimes contains random garbage (luckily most of
> the files are pcm wavs, so it's easy to verify). However I did not
> analyze every occurrence so this may be not 100% true. The files do
> not look as they are sparse according to du. Interestingly one of
> them appears to occupy one block more than necessary.

XFS can allocate blocks beyond EOF - it's completely valid to do so.

> Then I did cp -a file newfile, mv newfile file and re-ran the test.
> No problems reported since.

So the file is now in a different physical location on disk.
Defintely sounds like a dodgy disk to me.

> As there were a few unclean umounts I think most likely it is a
> filesystem corruption that went unspotted by xfs_repair. It would
> not surprise me too much because xfs_repair took just 3.5 min.

The run time of xfs_repair is determined by how much IO it needs to
do to read all the metadata. Your filesystem is not all that densely
populated with metadata, so it doesn't take very long to run. The
short runtime does not mean it hasn't checked you filesystem
properly.

Think about scale or a minute - take your filesystem and scale it
linearly in all dimensions - a repair rate of 1.5m per TB means
2.5hrs for a 100TB filesystem or a day for a PB sized filesystem. The
speed you are seeing doesn't seem quite so fast now, does it?

> Any ideas? I could just copy the files and pretend noting happened
> but is there a guarantee that doing so won't corrupt other data?

I'd start by replacing hardware....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Data corruption, md5 changes on every mount
  2011-12-11 23:53 ` Dave Chinner
@ 2011-12-12  0:13   ` Dmitry Panov
  2011-12-12  4:15     ` Dave Chinner
  2011-12-12  1:56   ` Dmitry Panov
  1 sibling, 1 reply; 5+ messages in thread
From: Dmitry Panov @ 2011-12-12  0:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,

On 11/12/2011 23:53, Dave Chinner wrote:
> On Sun, Dec 11, 2011 at 01:21:37PM +0000, Dmitry Panov wrote:
>> Hi guys,
>>
>> I have a 2TiB XFS which is about 60% full. Recently I've noticed
>> that the daily inc. backup reports file contents change for files
>> that are not supposed to change.
> What kernel/platform? What version of xfsprogs? What kind of
> storage?
It's linux kernel 3.0.0 at the moment, however it used to run different 
versions and I can't tell for sure when the problem started. xfsprogs 
version is 3.1.2.

The storage is a 2 node cluster with hardware RAID1+0 and drbd.
>> I've created an LVM snapshot and ran xfs_check/xfs_repair. xfs_check
>> did report a few problems (unknown node type). After that I ran a
>> simple test: mount, calculate md5 of the problematic files, report
>> if it changed, umount, sleep 10 sec. That script reported that md5
>> sum of at least one file was changing on every cycle.
> That sounds like you've got a dodgy drive.

That would be my guess too, however the problem occurs on both nodes 
(i.e. it doesn't go away when the other node becomes active) and the 
same files affected which makes hard drives or RAID controller or RAM 
failure very unlikely.


Is there any way to perform a more thorough check, than xfs_check does?


Best regards,

-- 
Dmitry Panov

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Data corruption, md5 changes on every mount
  2011-12-11 23:53 ` Dave Chinner
  2011-12-12  0:13   ` Dmitry Panov
@ 2011-12-12  1:56   ` Dmitry Panov
  1 sibling, 0 replies; 5+ messages in thread
From: Dmitry Panov @ 2011-12-12  1:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,

I've found xfs_bmap and did a few experiments with dd. It looks to me as 
though it's RAID1 sync problem -- I've got 2 different versions of the 
data during continuous reads from the device with cache drops inbetween.

So, although the cause of the problem is still unclear, it's definitely 
not XFS. Thanks for the hint!


On 11/12/2011 23:53, Dave Chinner wrote:
> On Sun, Dec 11, 2011 at 01:21:37PM +0000, Dmitry Panov wrote:
>> Hi guys,
>>
>> I have a 2TiB XFS which is about 60% full. Recently I've noticed
>> that the daily inc. backup reports file contents change for files
>> that are not supposed to change.
> What kernel/platform? What version of xfsprogs? What kind of
> storage?
>
>> I've created an LVM snapshot and ran xfs_check/xfs_repair. xfs_check
>> did report a few problems (unknown node type). After that I ran a
>> simple test: mount, calculate md5 of the problematic files, report
>> if it changed, umount, sleep 10 sec. That script reported that md5
>> sum of at least one file was changing on every cycle.
> That sounds like you've got a dodgy drive.
>
>> Analyzing the differences I found that a 4k block that should
>> contain all zeros sometimes contains random garbage (luckily most of
>> the files are pcm wavs, so it's easy to verify). However I did not
>> analyze every occurrence so this may be not 100% true. The files do
>> not look as they are sparse according to du. Interestingly one of
>> them appears to occupy one block more than necessary.
> XFS can allocate blocks beyond EOF - it's completely valid to do so.
>
>> Then I did cp -a file newfile, mv newfile file and re-ran the test.
>> No problems reported since.
> So the file is now in a different physical location on disk.
> Defintely sounds like a dodgy disk to me.
>
>> As there were a few unclean umounts I think most likely it is a
>> filesystem corruption that went unspotted by xfs_repair. It would
>> not surprise me too much because xfs_repair took just 3.5 min.
> The run time of xfs_repair is determined by how much IO it needs to
> do to read all the metadata. Your filesystem is not all that densely
> populated with metadata, so it doesn't take very long to run. The
> short runtime does not mean it hasn't checked you filesystem
> properly.
>
> Think about scale or a minute - take your filesystem and scale it
> linearly in all dimensions - a repair rate of 1.5m per TB means
> 2.5hrs for a 100TB filesystem or a day for a PB sized filesystem. The
> speed you are seeing doesn't seem quite so fast now, does it?
>
>> Any ideas? I could just copy the files and pretend noting happened
>> but is there a guarantee that doing so won't corrupt other data?
> I'd start by replacing hardware....
>
> Cheers,
>
> Dave.


-- 
Dmitry Panov

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Data corruption, md5 changes on every mount
  2011-12-12  0:13   ` Dmitry Panov
@ 2011-12-12  4:15     ` Dave Chinner
  0 siblings, 0 replies; 5+ messages in thread
From: Dave Chinner @ 2011-12-12  4:15 UTC (permalink / raw)
  To: Dmitry Panov; +Cc: xfs

On Mon, Dec 12, 2011 at 12:13:40AM +0000, Dmitry Panov wrote:
> Hi Dave,
> 
> On 11/12/2011 23:53, Dave Chinner wrote:
> >On Sun, Dec 11, 2011 at 01:21:37PM +0000, Dmitry Panov wrote:
> >>Hi guys,
> >>
> >>I have a 2TiB XFS which is about 60% full. Recently I've noticed
> >>that the daily inc. backup reports file contents change for files
> >>that are not supposed to change.
> >What kernel/platform? What version of xfsprogs? What kind of
> >storage?
> It's linux kernel 3.0.0 at the moment, however it used to run
> different versions and I can't tell for sure when the problem
> started. xfsprogs version is 3.1.2.
> 
> The storage is a 2 node cluster with hardware RAID1+0 and drbd.

Hmmmm. HA, remote replication, network paths in the storage stack.
Not a particularly common setup, so I'd be looking at validating
your drbd setup before looking at XFS.....

> >>I've created an LVM snapshot and ran xfs_check/xfs_repair. xfs_check
> >>did report a few problems (unknown node type). After that I ran a
> >>simple test: mount, calculate md5 of the problematic files, report
> >>if it changed, umount, sleep 10 sec. That script reported that md5
> >>sum of at least one file was changing on every cycle.
> >That sounds like you've got a dodgy drive.
> 
> That would be my guess too, however the problem occurs on both nodes
> (i.e. it doesn't go away when the other node becomes active) and the
> same files affected which makes hard drives or RAID controller or
> RAM failure very unlikely.

Which simply means the corruption has been replicated.

Given that drbd is in the picture and that has a history of causing
filesystem and/or data corruptions, I'd suggest you validate that
drbd is not causing problems first. If you can reproduce the data
corruption on a storage stack that doesn't have drbd in it, then
it's probably a filesystem problem.  However, you need to rule out
the lower storage layers as the cause first.  i.e. once you've
validated that your block device is good, then we can start to look
at whether the filesystem is the cause.

In general, you need a reliable reproducer to do this, so if you can
reproduce the problem anymore, there's little that can be done about
it...

> Is there any way to perform a more thorough check, than xfs_check does?

xfs_repair -n is more thorough than xfs_check. But remember, both
xfs_check and xfs_repair are only chekcing the filesystem structure,
not the contents of your files. The contents of your files are yours
to check....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-12-12  4:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-11 13:21 Data corruption, md5 changes on every mount Dmitry Panov
2011-12-11 23:53 ` Dave Chinner
2011-12-12  0:13   ` Dmitry Panov
2011-12-12  4:15     ` Dave Chinner
2011-12-12  1:56   ` Dmitry Panov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox