From: John Valdes <valdes@anl.gov>
To: xfs@oss.sgi.com
Subject: log recovery fails at mount
Date: Mon, 23 Jan 2012 20:43:41 -0600 [thread overview]
Message-ID: <20120124024340.GA6689@starfish.mcs.anl.gov> (raw)
All,
We have an XFS which fails to mount due to an internal error according
to the messages reported to syslog:
kernel: Filesystem md4: Disabling barriers, trial barrier write failed
kernel: XFS mounting filesystem md4
kernel: Starting XFS recovery on filesystem: md4 (logdev: internal)
kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1676 of file fs/xfs/xfs_alloc.c. Caller 0xffffffff887fca71
kernel:
kernel:
kernel: Call Trace:
kernel: [<ffffffff887fb1cc>] :xfs:xfs_free_ag_extent+0x433/0x67e
kernel: [<ffffffff887fca71>] :xfs:xfs_free_extent+0xa9/0xc9
kernel: [<ffffffff8882d874>] :xfs:xlog_recover_process_efi+0x112/0x16c
kernel: [<ffffffff888476c8>] :xfs:xfs_fs_fill_super+0x0/0x3dc
kernel: [<ffffffff8882ea53>] :xfs:xlog_recover_process_efis+0x4f/0x8d
kernel: [<ffffffff8882eaa5>] :xfs:xlog_recover_finish+0x14/0x9e
kernel: [<ffffffff888476c8>] :xfs:xfs_fs_fill_super+0x0/0x3dc
kernel: [<ffffffff888336c6>] :xfs:xfs_mountfs+0x47a/0x5ac
kernel: [<ffffffff88833daa>] :xfs:xfs_mru_cache_create+0x113/0x143
kernel: [<ffffffff888478cb>] :xfs:xfs_fs_fill_super+0x203/0x3dc
kernel: [<ffffffff800e7401>] get_sb_bdev+0x10a/0x16c
kernel: [<ffffffff800e6d9e>] vfs_kern_mount+0x93/0x11a
kernel: [<ffffffff800e6e67>] do_kern_mount+0x36/0x4d
kernel: [<ffffffff800f1865>] do_mount+0x6a9/0x719
kernel: [<ffffffff80009165>] __handle_mm_fault+0x9f6/0x103b
kernel: [<ffffffff8000c816>] _atomic_dec_and_lock+0x39/0x57
kernel: [<ffffffff8002cc44>] mntput_no_expire+0x19/0x89
kernel: [<ffffffff8000769e>] find_get_page+0x21/0x51
kernel: [<ffffffff8002239a>] __up_read+0x19/0x7f
kernel: [<ffffffff80067225>] do_page_fault+0x4cc/0x842
kernel: [<ffffffff80008d64>] __handle_mm_fault+0x5f5/0x103b
kernel: [<ffffffff800cee54>] zone_statistics+0x3e/0x6d
kernel: [<ffffffff8000f470>] __alloc_pages+0x78/0x308
kernel: [<ffffffff8004c0df>] sys_mount+0x8a/0xcd
kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
kernel:
kernel: Failed to recover EFIs on filesystem: md4
kernel: XFS: log mount finish failed
xfs_repair is unwilling to repair the fs since it sees unwritten data
in the xfs log:
prompt# xfs_repair /dev/md4
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair. If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
Of course, since I can't mount the fs, I can't replay the log. Before
zeroing out the log w/ xfs_repair -L, I was wondering if there is any
way to tell how critical the metadata in the log is? I've run
"xfs_logprint", but not being an XFS developer, I don't understand the
info it's showing me. Is there anyway to glean something useful from
xfs_logprint? For reference, I've put a copy of the complete output
at http://www.mcs.anl.gov/~valdes/xfslog.txt (warning, it's over 3.7
million lines long and about 192 MB big).
The system with this problem is running RHEL 5.7 with the bundled XFS
modules, eg:
prompt# modinfo xfs
filename: /lib/modules/2.6.18-274.3.1.el5/kernel/fs/xfs/xfs.ko
license: GPL
description: SGI XFS with ACLs, security attributes, large block/inode numbers, no debug enabled
author: Silicon Graphics, Inc.
srcversion: 4A41C05CBD42F5525F11CBD
depends:
vermagic: 2.6.18-274.3.1.el5 SMP mod_unload gcc-4.1
module_sig: 883f3504e58268794abe3920d1168f112bb7209e2721679ef3b2971313fad2364b5a43f2ab33e0a0a59bf02c12aca5e46c326a106f838129e0ab4867
although the XFS itself was built on an earlier version of RHEL 5, FWIW.
The details and history of the problem XFS are:
- It's ~20TB built on an md stripe of two 3ware RAID6 arrays.
- The problem showed up after a drive in one of the 3ware RAIDs
failed, causing the controller to hang, which took that RAID (scsi
device) offline:
kernel: sd 7:0:0:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.
kernel: 3w-9xxx: scsi7: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence.
kernel: 3w-9xxx: scsi7: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence.
kernel: 3w-9xxx: scsi7: ERROR: (0x06:0x002B): Controller reset failed during scsi host reset.
kernel: sd 7:0:0:0: scsi: Device offlined - not ready after error recovery
last message repeated 99 times
kernel: sd 7:0:0:0: rejecting I/O to offline device
last message repeated 50 times
kernel: sd 7:0:0:0: SCSI error: return code = 0x00010000
kernel: end_request: I/O error, dev sdd, sector 2292015744
kernel: sd 7:0:0:0: rejecting I/O to offline device
last message repeated 436 times
kernel: Device md4, XFS metadata write error block 0xd03f0 in md4
kernel: Buffer I/O error on device md4, logical block 723454688
kernel: lost page write due to I/O error on md4
kernel: Buffer I/O error on device md4, logical block 723454689
[...]
kernel: sd 7:0:0:0: rejecting I/O to offline device
kernel: I/O error in filesystem ("md4") meta-data dev md4 block 0x48c2598aa ("xlog_iodone") error 5 buf count 3584
kernel: xfs_force_shutdown(md4,0x2) called from line 1061 of file fs/xfs/xfs_log.c. Return address = 0xffffffff8867404a
kernel: Filesystem md4: Log I/O Error Detected. Shutting down filesystem: md4
kernel: Please umount the filesystem, and rectify the problem(s)
kernel: Filesystem md4: xfs_log_force: error 5 returned.
I was able to fully shutdown the system after this, although I did
need to power cycle it in order to get the 3ware controller back
online (the controller does have a functional battery, so in theory
data in its write cache should have been preserved, although
messages at reboot suggest otherwise). Nevertheless, upon reboot,
the XFS mounted fine:
kernel: 3w-9xxx: scsi7: AEN: ERROR (0x04:0x005F): Cache synchronization failed; some data lost:unit=0.
kernel: 3w-9xxx: scsi7: AEN: WARNING (0x04:0x0008): Unclean shutdown detected:unit=0.
[...]
kernel: SGI XFS with ACLs, security attributes, large block/inode numbers, no debug enabled
kernel: SGI XFS Quota Management subsystem
kernel: Filesystem md4: Disabling barriers, trial barrier write failed
kernel: XFS mounting filesystem md4
kernel: Starting XFS recovery on filesystem: md4 (logdev: internal)
kernel: Ending XFS recovery on filesystem: md4 (logdev: internal)
- The XFS continued working fine for about 2 weeks, but then it started
reporting internal erros (XFS_WANT_CORRUPTED_RETURN):
kernel: XFS internal error XFS_WANT_CORRUPTED_RETURN at line 295 of file fs/xfs/xfs_alloc.c. Caller 0xffffffff8864a345
kernel:
kernel:
kernel: Call Trace:
kernel: [<ffffffff8864889f>] :xfs:xfs_alloc_fixup_trees+0x2ba/0x2cb
kernel: [<ffffffff8865e89b>] :xfs:xfs_btree_init_cursor+0x31/0x1a3
kernel: [<ffffffff8864a345>] :xfs:xfs_alloc_ag_vextent_near+0x773/0x8e2
kernel: [<ffffffff8864a4df>] :xfs:xfs_alloc_ag_vextent+0x2b/0xfc
kernel: [<ffffffff8864ad5f>] :xfs:xfs_alloc_vextent+0x2ce/0x3ff
kernel: [<ffffffff886595ca>] :xfs:xfs_bmap_btalloc+0x673/0x8c1
kernel: [<ffffffff88659f09>] :xfs:xfs_bmapi+0x6ec/0xe79
kernel: [<ffffffff8867b0c7>] :xfs:xlog_ticket_alloc+0xc8/0xed
kernel: [<ffffffff8867b199>] :xfs:xfs_log_reserve+0xad/0xc9
kernel: [<ffffffff886764de>] :xfs:xfs_iomap_write_allocate+0x202/0x329
kernel: [<ffffffff88676f0e>] :xfs:xfs_iomap+0x217/0x28d
kernel: [<ffffffff8868bf48>] :xfs:xfs_map_blocks+0x2d/0x63
kernel: [<ffffffff8868cb8e>] :xfs:xfs_page_state_convert+0x2b1/0x546
kernel: [<ffffffff8001c452>] generic_make_request+0x211/0x228
kernel: [<ffffffff8868cf6f>] :xfs:xfs_vm_writepage+0xa7/0xe0
kernel: [<ffffffff8001d1d1>] mpage_writepages+0x1bf/0x37d
kernel: [<ffffffff8868cec8>] :xfs:xfs_vm_writepage+0x0/0xe0
kernel: [<ffffffff8005a8a6>] do_writepages+0x20/0x2f
kernel: [<ffffffff8002fa24>] __writeback_single_inode+0x1a2/0x31c
kernel: [<ffffffff80021143>] sync_sb_inodes+0x1b7/0x271
kernel: [<ffffffff800a2be5>] keventd_create_kthread+0x0/0xc4
kernel: [<ffffffff80050ce2>] writeback_inodes+0x82/0xd8
kernel: [<ffffffff800cc304>] wb_kupdate+0xd4/0x14e
kernel: [<ffffffff800562a9>] pdflush+0x0/0x1fb
kernel: [<ffffffff800563fa>] pdflush+0x151/0x1fb
kernel: [<ffffffff800cc230>] wb_kupdate+0x0/0x14e
kernel: [<ffffffff80032722>] kthread+0xfe/0x132
kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
kernel: [<ffffffff800a2be5>] keventd_create_kthread+0x0/0xc4
kernel: [<ffffffff80032624>] kthread+0x0/0x132
kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
- Once this started happening, I shutdown the system again, but this
time at reboot, the XFS failed to mount, w/ the error given at the
top of this email.
Does anyone have any suggestions on how to recover from this state, or
is my only option xfs_repair -L and hope that there isn't any
corruption? This XFS is part of a scratch filesystem (we have a large
PVFS filesystem built on top of this XFS plus 7 other identical ones
on other servers), so if it ended up being corrupted, it wouldn't been
the end of the world, but it would represent a lot of lost work.
Thanks for any help.
John
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next reply other threads:[~2012-01-24 2:43 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-24 2:43 John Valdes [this message]
2012-01-24 5:06 ` log recovery fails at mount Eric Sandeen
2012-01-24 22:58 ` John Valdes
2012-01-24 23:03 ` Eric Sandeen
2012-01-24 23:34 ` John Valdes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120124024340.GA6689@starfish.mcs.anl.gov \
--to=valdes@anl.gov \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.