All of lore.kernel.org
 help / color / mirror / Atom feed
From: Doug Ledford <dledford@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Rainer Fuegenstein <rfu@kaneda.iguw.tuwien.ac.at>,
	xfs@oss.sgi.com, linux-raid@vger.kernel.org
Subject: Re: xfs and raid5 - "Structure needs cleaning for directory open"
Date: Mon, 17 May 2010 18:18:28 -0400	[thread overview]
Message-ID: <4BF1C0B4.5090009@redhat.com> (raw)
In-Reply-To: <20100517214532.GL8120@dastard>

[-- Attachment #1: Type: text/plain, Size: 1959 bytes --]

On 05/17/2010 05:45 PM, Dave Chinner wrote:
> On Mon, May 17, 2010 at 05:28:30PM -0400, Doug Ledford wrote:
>> On 05/09/2010 10:20 PM, Dave Chinner wrote:
>>> On Sun, May 09, 2010 at 08:48:00PM +0200, Rainer Fuegenstein wrote:
>>>>
>>>> today in the morning some daemon processes terminated because of
>>>> errors in the xfs file system on top of a software raid5, consisting
>>>> of 4*1.5TB WD caviar green SATA disks.
>>>
>>> Reminds me of a recent(-ish) md/dm readahead cancellation fix - that
>>> would fit the symptoms of (btree corruption showing up under heavy IO
>>> load but no corruption on disk. However, I can't seem to find any
>>> references to it at the moment (can't remember the bug title), but
>>> perhaps your distro doesn't have the fix in it?
>>>
>>> Cheers,
>>>
>>> Dave.
>>
>> That sounds plausible, as does hardware error.  A memory bit flip under
>> heavy load would cause the in memory data to be corrupt while the on
>> disk data is good.
> 
> The data dumps from the bad blocks weren't wrong by a single bit -
> they were unrecogniѕable garbage - so that it very unlikely to be
> a memory erro causing the problem.

Not true.  It can still be a single bit error but a single bit error
higher up in the chain.  Aka a single bit error in the scsi command to
read various sectors, then you read in all sorts of wrong data and
everything from there is totally whacked.

>> By waiting to check it until later, the bad memory
>> was flushed at some point and when the data was reloaded it came in ok
>> this time.
> 
> Yup - XFS needs to do a better job of catching this case - the
> prototype metadata checksumming patch caught most of these cases...
> 
> Cheers,
> 
> Dave.


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

WARNING: multiple messages have this Message-ID (diff)
From: Doug Ledford <dledford@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-raid@vger.kernel.org,
	Rainer Fuegenstein <rfu@kaneda.iguw.tuwien.ac.at>,
	xfs@oss.sgi.com
Subject: Re: xfs and raid5 - "Structure needs cleaning for directory open"
Date: Mon, 17 May 2010 18:18:28 -0400	[thread overview]
Message-ID: <4BF1C0B4.5090009@redhat.com> (raw)
In-Reply-To: <20100517214532.GL8120@dastard>


[-- Attachment #1.1: Type: text/plain, Size: 1959 bytes --]

On 05/17/2010 05:45 PM, Dave Chinner wrote:
> On Mon, May 17, 2010 at 05:28:30PM -0400, Doug Ledford wrote:
>> On 05/09/2010 10:20 PM, Dave Chinner wrote:
>>> On Sun, May 09, 2010 at 08:48:00PM +0200, Rainer Fuegenstein wrote:
>>>>
>>>> today in the morning some daemon processes terminated because of
>>>> errors in the xfs file system on top of a software raid5, consisting
>>>> of 4*1.5TB WD caviar green SATA disks.
>>>
>>> Reminds me of a recent(-ish) md/dm readahead cancellation fix - that
>>> would fit the symptoms of (btree corruption showing up under heavy IO
>>> load but no corruption on disk. However, I can't seem to find any
>>> references to it at the moment (can't remember the bug title), but
>>> perhaps your distro doesn't have the fix in it?
>>>
>>> Cheers,
>>>
>>> Dave.
>>
>> That sounds plausible, as does hardware error.  A memory bit flip under
>> heavy load would cause the in memory data to be corrupt while the on
>> disk data is good.
> 
> The data dumps from the bad blocks weren't wrong by a single bit -
> they were unrecogniѕable garbage - so that it very unlikely to be
> a memory erro causing the problem.

Not true.  It can still be a single bit error but a single bit error
higher up in the chain.  Aka a single bit error in the scsi command to
read various sectors, then you read in all sorts of wrong data and
everything from there is totally whacked.

>> By waiting to check it until later, the bad memory
>> was flushed at some point and when the data was reloaded it came in ok
>> this time.
> 
> Yup - XFS needs to do a better job of catching this case - the
> prototype metadata checksumming patch caught most of these cases...
> 
> Cheers,
> 
> Dave.


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2010-05-17 22:18 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-09 18:48 xfs and raid5 - "Structure needs cleaning for directory open" Rainer Fuegenstein
2010-05-09 18:48 ` Rainer Fuegenstein
2010-05-09 19:07 ` Rainer Fuegenstein
2010-05-09 23:35 ` Rainer Fuegenstein
2010-05-10  2:20 ` Dave Chinner
2010-05-10  2:20   ` Dave Chinner
2010-05-10  6:53   ` Mark Goodwin
2010-05-10 10:22     ` Re[2]: " Rainer Fuegenstein
2010-05-10 14:08       ` Stan Hoeppner
2010-05-17 21:28   ` Doug Ledford
2010-05-17 21:28     ` Doug Ledford
2010-05-17 21:45     ` Dave Chinner
2010-05-17 21:45       ` Dave Chinner
2010-05-17 22:18       ` Doug Ledford [this message]
2010-05-17 22:18         ` Doug Ledford
2010-05-17 23:04         ` Dave Chinner
2010-05-17 23:04           ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4BF1C0B4.5090009@redhat.com \
    --to=dledford@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=rfu@kaneda.iguw.tuwien.ac.at \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.