From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Brian Foster <bfoster@redhat.com>,
Chris Dunlop <chris@onthe.net.au>,
linux-xfs@vger.kernel.org
Subject: Re: file corruptions, 2nd half of 512b block
Date: Thu, 22 Mar 2018 16:26:29 -0700 [thread overview]
Message-ID: <20180322232629.GF4818@magnolia> (raw)
In-Reply-To: <20180322230450.GT1150@dastard>
On Fri, Mar 23, 2018 at 10:04:50AM +1100, Dave Chinner wrote:
> On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote:
> > On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
> > > Eyeballing the corrupted blocks and matching good blocks doesn't show
> > > any obvious pattern. The files themselves contain compressed data so
> > > it's all highly random at the block level, and the corruptions
> > > themselves similarly look like random bytes.
> > >
> > > The corrupt blocks are not a copy of other data in the file within the
> > > surrounding 256k of the corrupt block.
> > >
> >
> > So you obviously have a fairly large/complex storage configuration. I
> > think you have to assume that this corruption could be introduced pretty
> > much anywhere in the stack (network, mm, fs, block layer, md) until it
> > can be narrowed down.
> >
> > > ----------------------------------------------------------------------
> > > System configuration
> > > ----------------------------------------------------------------------
> > >
> > > linux-4.9.76
> > > xfsprogs 4.10
> > > CPU: 2 x E5620 (16 cores total)
> > > 192G RAM
> > >
> > > # grep bigfs /etc/mtab
> > > /dev/mapper/vg00-bigfs /bigfs xfs rw,noatime,attr2,inode64,logbsize=256k,sunit=1024,swidth=9216,noquota 0 0
> > > # xfs_info /bigfs
> > > meta-data=/dev/mapper/vg00-bigfs isize=512 agcount=246, agsize=268435328 blks
> > > = sectsz=4096 attr=2, projid32bit=1
> > > = crc=1 finobt=1 spinodes=0 rmapbt=0
> > > = reflink=0
> > > data = bsize=4096 blocks=65929101312, imaxpct=5
> > > = sunit=128 swidth=1152 blks
> > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1
> > > log =internal bsize=4096 blocks=521728, version=2
> > > = sectsz=4096 sunit=1 blks, lazy-count=1
> > > realtime =none extsz=4096 blocks=0, rtextents=0
> > >
> > > XFS on LVM on 6 x PVs, each PV is md raid-6, each with 11 x hdd.
>
> Are these all on the one raid controller? i.e. what's the physical
> layout of all these disks?
>
> > > The raids all check clean.
> > >
> > > The XFS has been expanded a number of times.
> > >
> > > ----------------------------------------------------------------------
> > > Explicit example...
> > > ----------------------------------------------------------------------
> > >
> > > 2018-03-04 21:40:44 data + md5 files written
> > > 2018-03-04 22:43:33 checksum mismatch detected
> >
> > Seems like the corruption is detected fairly soon after creation. How
> > often are these files explicitly checked/read? I also assume the files
> > aren't ever modified..?
> >
> > FWIW, the patterns that you have shown so far do seem to suggest
> > something higher level than a physical storage problem. Otherwise, I'd
> > expect these instances wouldn't always necessarily land in file data.
> > Have you run 'xfs_repair -n' on the fs to confirm there aren't any other
> > problems?
> >
> > OTOH, a 256b corruption seems quite unusual for a filesystem with 4k
> > blocks. I suppose that could suggest some kind of memory/cache
> > corruption as opposed to a bad page/extent state or something of that
> > nature.
>
> Especially with the data write mechanisms being used - e.g. NFS
> won't be doing partial sector reads and writes for data transfer -
> it'll all be done in blocks much larger that the filesystem block
> size (e.g. 1MB IOs).
>
> > Hmm, I guess the only productive thing I can think of right now is to
> > see if you can try and detect the problem as soon as possible. For e.g.,
> > it sounds like this is a closed system. If so, could you follow up every
> > file creation with an immediate md5 verification (perhaps followed by an
> > fadvise(DONTNEED) and another md5 check to try and catch an inconsistent
> > pagecache)? Perhaps others might have further ideas..
>
> Basically, the only steps now are a methodical, layer by layer
> checking of the IO path to isolate where the corruption is being
> introduced. First you need a somewhat reliable reproducer that can
> be used for debugging.
>
> Write patterned files (e.g. encode a file id, file offset and 16 bit
> cksum in every 8 byte chunk) and then verify them. When you get a
> corruption, the corrupted data will tell you where the corruption
> came from. It'll either be silent bit flips, some other files' data,
> or it will be stale data.i See if the corruption pattern is
> consistent. See if the locations correlate to a single disk, a
> single raid controller, a single backplane, etc. i.e. try to find
> some pattern to the corruption.
>
> Unfortunately, I can't find the repository for the data checking
> tools that were developed years ago for doing exactly this sort of
> testing (genstream+checkstream) online anymore - they seem to
> have disappeared from the internet. (*) Shouldn't be too hard to
> write a quick tool to do this, though.
https://sourceforge.net/projects/checkstream/ ?
--D
> Also worth testing is whether the same corruption occurs when you
> use direct IO to write and read the files. That would rule out a
> large chunk of the filesystem and OS code as the cause of the
> corruption.
>
> (*) Google is completely useless for searching for historic things,
> mailing lists and/or code these days. Searching google now reminds
> of the bad old days of AltaVista - "never finds what I'm looking
> for"....
>
> > > file size: 31232491008 bytes
> > >
> > > The file is moved to "badfile", and the file regenerated from source
> > > data as "goodfile".
>
> What does "regenerated from source" mean?
>
> DOes that mean a new file is created, compressed and then copied
> across? Or is it just the original file being copied again?
>
> > > From extent 16, the actual corrupt sector offset within the lv device
> > > underneath xfs is:
> > >
> > > 289315926016 + (53906431 - 45826040) == 289324006407
> > >
> > > Then we can look at the devices underneath the lv:
> > >
> > > # lvs --units s -o lv_name,seg_start,seg_size,devices
> > > LV Start SSize Devices
> > > bigfs 0S 105486999552S /dev/md0(0)
> > > bigfs 105486999552S 105487007744S /dev/md4(0)
> > > bigfs 210974007296S 105487007744S /dev/md9(0)
> > > bigfs 316461015040S 35160866816S /dev/md1(0)
> > > bigfs 351621881856S 105487007744S /dev/md5(0)
> > > bigfs 457108889600S 70323920896S /dev/md3(0)
> > >
> > > Comparing our corrupt sector lv offset with the start sector of each md
> > > device, we can see the corrupt sector is within /dev/md9 and not at a
> > > boundary. The corrupt sector offset within the lv data on md9 is given
> > > by:
>
> Does, the problem always occur on /dev/md9?
>
> If so, does the location correlate to a single disk in /dev/md9?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2018-03-22 23:26 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-22 15:02 file corruptions, 2nd half of 512b block Chris Dunlop
2018-03-22 18:03 ` Brian Foster
2018-03-22 23:04 ` Dave Chinner
2018-03-22 23:26 ` Darrick J. Wong [this message]
2018-03-22 23:49 ` Dave Chinner
2018-03-28 15:20 ` Chris Dunlop
2018-03-28 22:27 ` Dave Chinner
2018-03-29 1:09 ` Chris Dunlop
2018-03-27 22:33 ` Chris Dunlop
2018-03-28 18:09 ` Brian Foster
2018-03-29 0:15 ` Chris Dunlop
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180322232629.GF4818@magnolia \
--to=darrick.wong@oracle.com \
--cc=bfoster@redhat.com \
--cc=chris@onthe.net.au \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).