Re: file corruptions, 2nd half of 512b block

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Brian Foster <bfoster@redhat.com>,
	Chris Dunlop <chris@onthe.net.au>,
	linux-xfs@vger.kernel.org
Subject: Re: file corruptions, 2nd half of 512b block
Date: Thu, 22 Mar 2018 16:26:29 -0700	[thread overview]
Message-ID: <20180322232629.GF4818@magnolia> (raw)
In-Reply-To: <20180322230450.GT1150@dastard>

On Fri, Mar 23, 2018 at 10:04:50AM +1100, Dave Chinner wrote:
> On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote:
> > On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
> > > Eyeballing the corrupted blocks and matching good blocks doesn't show
> > > any obvious pattern. The files themselves contain compressed data so
> > > it's all highly random at the block level, and the corruptions
> > > themselves similarly look like random bytes.
> > > 
> > > The corrupt blocks are not a copy of other data in the file within the
> > > surrounding 256k of the corrupt block.
> > > 
> > 
> > So you obviously have a fairly large/complex storage configuration. I
> > think you have to assume that this corruption could be introduced pretty
> > much anywhere in the stack (network, mm, fs, block layer, md) until it
> > can be narrowed down.
> > 
> > > ----------------------------------------------------------------------
> > > System configuration
> > > ----------------------------------------------------------------------
> > > 
> > > linux-4.9.76
> > > xfsprogs 4.10
> > > CPU: 2 x E5620 (16 cores total)
> > > 192G RAM
> > > 
> > > # grep bigfs /etc/mtab
> > > /dev/mapper/vg00-bigfs /bigfs xfs rw,noatime,attr2,inode64,logbsize=256k,sunit=1024,swidth=9216,noquota 0 0
> > > # xfs_info /bigfs
> > > meta-data=/dev/mapper/vg00-bigfs isize=512    agcount=246, agsize=268435328 blks
> > >         =                       sectsz=4096  attr=2, projid32bit=1
> > >         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
> > >         =                       reflink=0
> > > data     =                       bsize=4096   blocks=65929101312, imaxpct=5
> > >         =                       sunit=128    swidth=1152 blks
> > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > log      =internal               bsize=4096   blocks=521728, version=2
> > >         =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > > 
> > > XFS on LVM on 6 x PVs, each PV is md raid-6, each with 11 x hdd.
> 
> Are these all on the one raid controller? i.e. what's the physical
> layout of all these disks?
> 
> > > The raids all check clean.
> > > 
> > > The XFS has been expanded a number of times.
> > > 
> > > ----------------------------------------------------------------------
> > > Explicit example...
> > > ----------------------------------------------------------------------
> > > 
> > > 2018-03-04 21:40:44 data + md5 files written
> > > 2018-03-04 22:43:33 checksum mismatch detected
> > 
> > Seems like the corruption is detected fairly soon after creation. How
> > often are these files explicitly checked/read? I also assume the files
> > aren't ever modified..?
> > 
> > FWIW, the patterns that you have shown so far do seem to suggest
> > something higher level than a physical storage problem. Otherwise, I'd
> > expect these instances wouldn't always necessarily land in file data.
> > Have you run 'xfs_repair -n' on the fs to confirm there aren't any other
> > problems?
> > 
> > OTOH, a 256b corruption seems quite unusual for a filesystem with 4k
> > blocks. I suppose that could suggest some kind of memory/cache
> > corruption as opposed to a bad page/extent state or something of that
> > nature.
> 
> Especially with the data write mechanisms being used - e.g. NFS
> won't be doing partial sector reads and writes for data transfer -
> it'll all be done in blocks much larger that the filesystem block
> size (e.g. 1MB IOs).
> 
> > Hmm, I guess the only productive thing I can think of right now is to
> > see if you can try and detect the problem as soon as possible. For e.g.,
> > it sounds like this is a closed system. If so, could you follow up every
> > file creation with an immediate md5 verification (perhaps followed by an
> > fadvise(DONTNEED) and another md5 check to try and catch an inconsistent
> > pagecache)? Perhaps others might have further ideas..
> 
> Basically, the only steps now are a methodical, layer by layer
> checking of the IO path to isolate where the corruption is being
> introduced. First you need a somewhat reliable reproducer that can
> be used for debugging.
> 
> Write patterned files (e.g. encode a file id, file offset and 16 bit
> cksum in every 8 byte chunk) and then verify them. When you get a
> corruption, the corrupted data will tell you where the corruption
> came from. It'll either be silent bit flips, some other files' data,
> or it will be stale data.i See if the corruption pattern is
> consistent. See if the locations correlate to a single disk, a
> single raid controller, a single backplane, etc. i.e. try to find
> some pattern to the corruption.
> 
> Unfortunately, I can't find the repository for the data checking
> tools that were developed years ago for doing exactly this sort of
> testing (genstream+checkstream) online anymore - they seem to
> have disappeared from the internet. (*) Shouldn't be too hard to
> write a quick tool to do this, though.

https://sourceforge.net/projects/checkstream/ ?

--D

> Also worth testing is whether the same corruption occurs when you
> use direct IO to write and read the files. That would rule out a
> large chunk of the filesystem and OS code as the cause of the
> corruption.
> 
> (*) Google is completely useless for searching for historic things,
> mailing lists and/or code these days. Searching google now reminds
> of the bad old days of AltaVista - "never finds what I'm looking
> for"....
> 
> > > file size: 31232491008 bytes
> > > 
> > > The file is moved to "badfile", and the file regenerated from source
> > > data as "goodfile".
> 
> What does "regenerated from source" mean?
> 
> DOes that mean a new file is created, compressed and then copied
> across? Or is it just the original file being copied again?
> 
> > > From extent 16, the actual corrupt sector offset within the lv device
> > > underneath xfs is:
> > > 
> > > 289315926016 + (53906431 - 45826040) == 289324006407
> > > 
> > > Then we can look at the devices underneath the lv:
> > > 
> > > # lvs --units s -o lv_name,seg_start,seg_size,devices
> > >  LV    Start         SSize         Devices
> > >  bigfs            0S 105486999552S /dev/md0(0)
> > >  bigfs 105486999552S 105487007744S /dev/md4(0)
> > >  bigfs 210974007296S 105487007744S /dev/md9(0)
> > >  bigfs 316461015040S  35160866816S /dev/md1(0)
> > >  bigfs 351621881856S 105487007744S /dev/md5(0)
> > >  bigfs 457108889600S  70323920896S /dev/md3(0)
> > > 
> > > Comparing our corrupt sector lv offset with the start sector of each md
> > > device, we can see the corrupt sector is within /dev/md9 and not at a
> > > boundary. The corrupt sector offset within the lv data on md9 is given
> > > by:
> 
> Does, the problem always occur on /dev/md9?
> 
> If so, does the location correlate to a single disk in /dev/md9?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2018-03-22 23:26 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-22 15:02 file corruptions, 2nd half of 512b block Chris Dunlop
2018-03-22 18:03 ` Brian Foster
2018-03-22 23:04   ` Dave Chinner
2018-03-22 23:26     ` Darrick J. Wong [this message]
2018-03-22 23:49       ` Dave Chinner
2018-03-28 15:20     ` Chris Dunlop
2018-03-28 22:27       ` Dave Chinner
2018-03-29  1:09         ` Chris Dunlop
2018-03-27 22:33   ` Chris Dunlop
2018-03-28 18:09     ` Brian Foster
2018-03-29  0:15       ` Chris Dunlop

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180322232629.GF4818@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=bfoster@redhat.com \
    --cc=chris@onthe.net.au \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).