From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id DF5F87F50 for ; Thu, 13 Aug 2015 01:21:35 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay1.corp.sgi.com (Postfix) with ESMTP id 9CF208F8035 for ; Wed, 12 Aug 2015 23:21:32 -0700 (PDT) Received: from mail01.lsn.net (mail01.lsn.net [66.90.130.120]) by cuda.sgi.com with ESMTP id O9SGInVwAGDEpDed for ; Wed, 12 Aug 2015 23:21:29 -0700 (PDT) Message-ID: <55CC375C.10902@mygrande.net> Date: Thu, 13 Aug 2015 01:21:16 -0500 From: Leslie Rhorer MIME-Version: 1.0 Subject: Re: XFS File system in trouble References: <03864DDC681E664EBF5D47682BE7D7CF0D358740@USADCWVEMBX07.corp.global.level3.com> <55AAF73A.4040903@mygrande.net> <20150720111747.GA53450@bfoster.bfoster> <55B73365.1050908@mygrande.net> <20150728123307.GC38784@bfoster.bfoster> <55B79BFD.6020509@mygrande.net> <20150728221150.GA26604@bfoster.bfoster> <55BE7C75.4060604@mygrande.net> <55C06F41.4030502@mygrande.net> <20150804224240.GU16638@dastard> <55C8006C.8070807@mygrande.net> In-Reply-To: <55C8006C.8070807@mygrande.net> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: "Rhorer, Leslie" , Brian Foster , Kris Rusocki , Eric Sandeen , "xfs@oss.sgi.com" The compressed tarball containing the dump file and the image are on my web site. http://fletchergeek.com/images/metadump.tar.gz It's 22G in size. On 8/9/2015 8:37 PM, Leslie Rhorer wrote: > Well, nice try, but it doesn't wash for several reasons: > > 1. Power supply issues would be highly unlikely to be the cause of such > a highly specific failure at always a very specific point in a process. > Problems would crop up all over the place, not just with one, very > specific failure. While I am thinking of it, I also ran memtest86+ > again on the new memory. It passed all tests with flying colors. > > 2. The system has not been under a heavy load when this happens. In > fact, it's piddling. Rsync and tar are single threaded, eating up at > most 1 CPU core at a time. I have processes that can regularly bang all > 8 cores right to the wall with no errors. The I/O stream is even more > piddling. Rsync is transferring nearly 120 MBps (it's a 1G link) during > the process, and some portions of the tar process can bang out well over > 2Gbps. Creating a directory is nothing. > > 3. All the power supply rails are nominal - I checked. > > 4. Most damning of all, I am able to reproduce the issue, now, on > another machine. I'm not entirely sure why creating the image on one > partition and then copying it to the root or across the LAN stopped it > from failing, but I took the 1.5T drive and moved it to the backup > machine, which as I related earlier is nearly identical in hardware and > highly similar in software to the primary system. It's failing there > repeatedly and consistently: > > RR274x/Driver/Freebsd/rr274x_3x-bsd-8.0-v1.0.10.0712.tgz > RR274x/Driver/Linux/ > RR274x/Driver/Linux/Debian/ > tar: RR274x/Driver/Linux/Debian: Cannot mkdir: Structure needs cleaning > RR274x/Driver/Linux/Debian/rr274x_3x-debian-5.0.1-i386/ > tar: RR274x/Driver/Linux/Debian: Cannot mkdir: Input/output error > tar: RR274x/Driver/Linux/Debian/rr274x_3x-debian-5.0.1-i386: Cannot > mkdir: No such file or directory > RR274x/Driver/Linux/Debian/rr274x_3x-debian-5.0.1-i386/boot/ > tar: RR274x/Driver/Linux/Debian: Cannot mkdir: Input/output error > tar: RR274x/Driver/Linux/Debian/rr274x_3x-debian-5.0.1-i386/boot: Cannot > mkdir: No such file or directory > RR274x/Driver/Linux/Debian/rr274x_3x-debian-5.0.1-i386/boot/rr274x_3x2.6.26-2-486i386.ko.gz > > tar: RR274x/Driver/Linux/Debian: Cannot mkdir: Input/output error > > gzip: stdin: Input/output error > tar: Unexpected EOF in archive > tar: RR274x/Driver/Linux: Cannot utime: Input/output error > tar: RR274x/Driver/Linux: Cannot change ownership to uid 0, gid 1000: > Input/output error > tar: RR274x/Driver/Linux: Cannot change mode to rwxr-xr-x: Input/output > error > tar: RR274x/Driver: Cannot utime: Input/output error > tar: RR274x/Driver: Cannot change ownership to uid 0, gid 1000: > Input/output error > tar: RR274x/Driver: Cannot change mode to rwxr-xr-x: Input/output error > tar: RR274x: Cannot utime: Input/output error > tar: RR274x: Cannot change ownership to uid 0, gid 1000: Input/output error > tar: RR274x: Cannot change mode to rwxr-xr-x: Input/output error > tar: Error is not recoverable: exiting now > > > dmesg: > [26743.775522] XFS (sdk): Mounting V4 Filesystem > [26743.904281] XFS (sdk): Ending clean mount > [26743.912614] Loading kernel module for a network device with > CAP_SYS_MODULE (deprecated). Use CAP_NET_ADMIN and alias netdev- instead. > > > > [26772.528827] loop: module loaded > [26772.601043] XFS (loop0): Mounting V4 Filesystem > [26772.764360] XFS (loop0): Ending clean mount > [26772.770627] Loading kernel module for a network device with > CAP_SYS_MODULE (deprecated). Use CAP_NET_ADMIN and alias netdev- instead. > > > > [26899.019942] XFS (loop0): xfs_iread: validation failed for inode > 124656869424 failed > [26899.019952] ffff8800b473e000: 49 4e 00 00 03 02 00 00 00 30 00 70 00 > 00 03 e8 IN.......0.p.... > [26899.019957] ffff8800b473e010: 00 00 00 00 06 20 b0 6f 01 2e 00 00 00 > 00 00 16 ..... .o........ > [26899.019960] ffff8800b473e020: 01 57 37 fd 2b 5d 22 9e 1e 0a 61 8c 00 > 00 00 20 .W7.+]"...a.... > [26899.019964] ffff8800b473e030: ff ff 00 d2 1b f6 27 90 00 00 00 00 00 > 00 00 00 ......'......... > [26899.019993] XFS (loop0): Internal error xfs_iread at line 392 of file > /build/linux-u5KAtC/linux-3.16.7-ckt11/fs/xfs/xfs_inode_buf.c. Caller > xfs_iget+0x24b/0x690 [xfs] > [26899.020000] CPU: 6 PID: 3756 Comm: tar Not tainted 3.16.0-4-amd64 #1 > Debian 3.16.7-ckt11-1+deb8u2 > [26899.020004] Hardware name: To be filled by O.E.M. To be filled by > O.E.M./SABERTOOTH 990FX R2.0, BIOS 0803 08/15/2012 > [26899.020007] 0000000000000001 ffffffff8150b3d5 ffff8800065b9800 > ffffffffa06bd5cb > [26899.020014] 0000018800000010 ffffffffa06c2f6b ffff88000a680400 > ffff8800065b9800 > [26899.020019] 0000000000000075 ffff88000527f140 ffffffffa0708b3a > ffffffffa06c2f6b > [26899.020024] Call Trace: > [26899.020034] [] ? dump_stack+0x41/0x51 > [26899.020052] [] ? xfs_corruption_error+0x5b/0x80 [xfs] > [26899.020069] [] ? xfs_iget+0x24b/0x690 [xfs] > [26899.020090] [] ? xfs_iread+0xea/0x400 [xfs] > [26899.020106] [] ? xfs_iget+0x24b/0x690 [xfs] > [26899.020124] [] ? xfs_iget+0x24b/0x690 [xfs] > [26899.020146] [] ? xfs_ialloc+0xa6/0x500 [xfs] > [26899.020192] [] ? kmem_zone_alloc+0x6e/0xe0 [xfs] > [26899.020215] [] ? xfs_dir_ialloc+0x62/0x2a0 [xfs] > [26899.020237] [] ? xfs_trans_reserve+0x1f5/0x200 [xfs] > [26899.020261] [] ? xfs_create+0x489/0x700 [xfs] > [26899.020267] [] ? kern_path_create+0xaa/0x190 > [26899.020286] [] ? xfs_generic_create+0xca/0x250 [xfs] > [26899.020292] [] ? vfs_mkdir+0xb0/0x160 > [26899.020296] [] ? SyS_mkdirat+0xab/0xe0 > [26899.020303] [] ? > system_call_fast_compare_end+0x10/0x15 > [26899.020307] XFS (loop0): Corruption detected. Unmount and run xfs_repair > [26899.020337] XFS (loop0): Internal error xfs_trans_cancel at line 959 > of file /build/linux-u5KAtC/linux-3.16.7-ckt11/fs/xfs/xfs_trans.c. > Caller xfs_create+0x2b2/0x700 [xfs] > [26899.020342] CPU: 6 PID: 3756 Comm: tar Not tainted 3.16.0-4-amd64 #1 > Debian 3.16.7-ckt11-1+deb8u2 > [26899.020345] Hardware name: To be filled by O.E.M. To be filled by > O.E.M./SABERTOOTH 990FX R2.0, BIOS 0803 08/15/2012 > [26899.020347] 000000000000000c ffffffff8150b3d5 ffff88000527f140 > ffffffffa06d1e07 > [26899.020354] ffff88000a729800 ffff8800066e3ec8 ffff8800065b9800 > ffffffffa07037d2 > [26899.020359] 0000000000000001 ffff8800066e3e20 ffff8800066e3e1c > ffff8800066e3eb0 > [26899.020364] Call Trace: > [26899.020370] [] ? dump_stack+0x41/0x51 > [26899.020388] [] ? xfs_trans_cancel+0xc7/0xf0 [xfs] > [26899.020409] [] ? xfs_create+0x2b2/0x700 [xfs] > [26899.020414] [] ? kern_path_create+0xaa/0x190 > [26899.020432] [] ? xfs_generic_create+0xca/0x250 [xfs] > [26899.020437] [] ? vfs_mkdir+0xb0/0x160 > [26899.020442] [] ? SyS_mkdirat+0xab/0xe0 > [26899.020447] [] ? > system_call_fast_compare_end+0x10/0x15 > [26899.020454] XFS (loop0): xfs_do_force_shutdown(0x8) called from line > 960 of file /build/linux-u5KAtC/linux-3.16.7-ckt11/fs/xfs/xfs_trans.c. > Return address = 0xffffffffa06d1e20 > [26899.407181] XFS (loop0): Corruption of in-memory data detected. > Shutting down filesystem > [26899.407190] XFS (loop0): Please umount the filesystem and rectify the > problem(s) > [26923.319559] XFS (loop0): xfs_log_force: error 5 returned. > > > > Xfs_repair still reports no faults. I'm compressing the dump file and > image file right now to be posted on http:/flethergeek.com/images when > it is done, but it is taking a very long time. I'll also try > decompresssing the image to the other array to see if it still fails > before I upload the file. 'No point in uploading if putting it through > the compression process results in an image that does not fail. > > On 8/4/2015 5:42 PM, Dave Chinner wrote: >> On Tue, Aug 04, 2015 at 02:52:33AM -0500, Leslie Rhorer wrote: >>> It's failing, again. The rsync job failed and when I attempt to >>> untar the file in the image mount, it fails there, as well. See >>> below. I formatted a 1.5T drive as xfs and mounted it under /media. >>> I then dumped the failing FS to a file on /media using xfs_metadump >>> and used xfs_mdrestore to create an image of the FS. I then mounted >>> the image, copied over the tarball to its location, and ran tar to >>> extract the files: >>> >>> [131874.545344] loop: module loaded >>> [131874.549914] XFS (loop0): Mounting V4 Filesystem >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >>> [131874.555540] XFS (loop0): Ending clean mount >>> [132020.964431] XFS (loop0): xfs_iread: validation failed for inode >>> 124656869424 failed >>> [132020.964435] ffff88028b078000: 49 4e 00 00 03 02 00 00 00 30 00 70 >>> 00 00 03 e8 IN.......0.p.... >>> [132020.964437] ffff88028b078010: 00 00 00 00 06 20 b0 6f 01 2e 00 00 >>> 00 00 00 16 ..... .o........ >>> [132020.964438] ffff88028b078020: 01 57 37 fd 2b 5d 22 9e 1e 0a 61 8c >>> 00 00 00 20 .W7.+]"...a.... >>> [132020.964440] ffff88028b078030: ff ff 00 d2 1b f6 27 90 00 00 00 00 >>> 00 00 00 00 ......'......... >>> [132020.964454] XFS (loop0): Internal error xfs_iread at line 392 of >>> file /build/linux-QZaPpC/linux-3.16.7-ckt11/fs/xfs/xfs_inode_buf.c. >>> Caller xfs_iget+0x24b/0x690 [xfs] >> >> That's a different error to all the ones you've previously posted. >> This is an inode allocation that has found a bad inode on disk. >> >> Decoding the 64 bytes above: >> >> di_magic = 0x494e >> di_mode = 0 >> di_version = 3 <<< That's *wrong* >> di_format = 2 >> di_onlink = 0 >> di_uid = 0x300070 <<< Looks unlikely >> di_gid = 0x3e8 >> ---- >> di_nlink = 0 >> di_projlo = 0x620 <<< should be zero >> di_projhi = 0xb06f <<< should be zero >> di_pad[6] = 0x1 0x2e 0 0 0 0 <<< should be zero >> di_flushiter = 0x16 <<< should be zero for v3 inode >> --- >> di_atime >> di_mtime >> di_ctime >> di_size = 0x20ffff00d2 <<< should be zero >> ---- >> di_nblocks = 0x1bf6279000000000 <<< should be zero >> di_extsize = 0 >> ---- >> >> You've just created and mounted a v4 filesystem, which means it is >> using v2 inodes. This inode read back as a v3 inode, with lots of >> crap in places where there should be zeros for either v2 or v3 inodes. >> >> This does not look like a filesystem problem - it's clear that what >> has come from disk (or a cached memory buffer) is full of garbage >> and contains invalid configuration, and the filesystem has quite >> correctly detected the corruption and shut down. The filesystem >> would give the same errors if it tried to *write* such a corrupt >> block, so we know what was just been detected has not come from the >> filesytem code... >> >> FWIW, I've occasionally seen this sort of thing happen when a power >> supply had gone bad - it wasn't bad enough to make things fail, it >> ust caused transient issues under load that manifest as corruptions >> and crashes. Given that you've already found one set of hardware >> problems and the corruption patterns are unlike any >> filesystem/storage problem I've ever seen, I'd suggest that you >> still have some kind of hardware issue... >> >> Cheers, >> >> Dave. >> > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs