Re: file corruptions, 2nd half of 512b block

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Chris Dunlop <chris@onthe.net.au>
To: Dave Chinner <david@fromorbit.com>
Cc: Brian Foster <bfoster@redhat.com>, linux-xfs@vger.kernel.org
Subject: Re: file corruptions, 2nd half of 512b block
Date: Thu, 29 Mar 2018 12:09:09 +1100	[thread overview]
Message-ID: <20180329010909.GA22702@onthe.net.au> (raw)
In-Reply-To: <20180328222754.GC18129@dastard>

On Thu, Mar 29, 2018 at 09:27:54AM +1100, Dave Chinner wrote:
> On Thu, Mar 29, 2018 at 02:20:00AM +1100, Chris Dunlop wrote:
>> On Fri, Mar 23, 2018 at 10:04:50AM +1100, Dave Chinner wrote:
>>> On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote:
>>>> On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
>>>>> XFS on LVM on 6 x PVs, each PV is md raid-6, each with 11 x hdd.
>>>
>>> Are these all on the one raid controller? i.e. what's the physical
>>> layout of all these disks?
>>
>> Yep, one controller. Physical layout:
>>
>> c0 LSI 9211-8i (SAS2008)
>> |
>> + SAS expander w/ SATA HDD x 12
>> |   + SAS expander w/ SATA HDD x 24
>> |       + SAS expander w/ SATA HDD x 24
>> |
>> + SAS expander w/ SATA HDD x 24
>>     + SAS expander w/ SATA HDD x 24
>
> Ok, that's good to know. I've seen misdirected writes in a past life
> because a controller had a firmware bug when it hit it's maximum CTQ
> depth of 2048 (controller max, not per-lun max) and the 2049th
> queued write got written to a random lun on the controller. That
> causes random, unpredicatble data corruptions in a similar manner to
> what you are seeing.

Ouch!

> So don't rule out a hardware problem yet.

OK. I'm not sure which of hardware or kernel I'd prefer it to be 
at this point!

>> Whilst that hardware side of things is interesting, and that md4
>> could bear some more investigation, as previously suggested, and now
>> with more evidence (older files checked clean), it's looking like
>> this issue really started with the upgrade from v3.18.25 to v4.9.76
>> on 2018-01-15. I.e. less likely to be hardware related - unless the
>> new kernel is stressing the hardware in new exciting ways.
>
> Right, that's entirely possible the new kernel is doing something
> the old kernel didn't, like loading it up with more concurrent IO
> across more disks. Do you have the latest firmware on the
> controller?

Not quite: it's on 19.00.00.00, looks like latest is 20.00.06.00 or 
20.00.07.00, depending on where you look.

I can't find a comprehensive set of release notes. Sigh.

We originally held off going to 20 because there were reports of 
problems, but it looks like they've since been resolved in the minor 
updates. Unfortunately we won't be able to update the BIOS in the next 
week or so.

> The next steps are to validate the data is getting through each
> layer of the OS intact. This really needs a more predictable test
> case - can you reproduce and detect this corruption using
> genstream/checkstream?
>
> If so, the first step is to move to direct IO to rule out a page
> cache related data corruption. If direct IO still shows the
> corruption, we need to rule out things like file extension and
> zeroing causing issues. e.g. preallocate the entire files, then
> write via direct IO. If that still generates corruption then we need
> to add code into the bottom of the filesystem IO path to validate
> the data being sent by the filesystem is not corrupt.
>
> If we get that far with correct write data, but still get
> corruptions on read, it's not a filesystem created data corruption.
> Let's see if we can get to that point first...

I'll see what I can do - and/or I'll try v4.14.latest: even if that
makes the problem goes away, that will tell us ...something, right?!

Cheers,

Chris

next prev parent reply	other threads:[~2018-03-29  1:09 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-22 15:02 file corruptions, 2nd half of 512b block Chris Dunlop
2018-03-22 18:03 ` Brian Foster
2018-03-22 23:04   ` Dave Chinner
2018-03-22 23:26     ` Darrick J. Wong
2018-03-22 23:49       ` Dave Chinner
2018-03-28 15:20     ` Chris Dunlop
2018-03-28 22:27       ` Dave Chinner
2018-03-29  1:09         ` Chris Dunlop [this message]
2018-03-27 22:33   ` Chris Dunlop
2018-03-28 18:09     ` Brian Foster
2018-03-29  0:15       ` Chris Dunlop

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180329010909.GA22702@onthe.net.au \
    --to=chris@onthe.net.au \
    --cc=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).