All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Valdes <valdes@anl.gov>
To: xfs@oss.sgi.com
Subject: log recovery fails at mount
Date: Mon, 23 Jan 2012 20:43:41 -0600	[thread overview]
Message-ID: <20120124024340.GA6689@starfish.mcs.anl.gov> (raw)

All,

We have an XFS which fails to mount due to an internal error according
to the messages reported to syslog:

  kernel: Filesystem md4: Disabling barriers, trial barrier write failed
  kernel: XFS mounting filesystem md4
  kernel: Starting XFS recovery on filesystem: md4 (logdev: internal)
  kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1676 of file fs/xfs/xfs_alloc.c.  Caller 0xffffffff887fca71
  kernel: 
  kernel: 
  kernel: Call Trace:
  kernel:  [<ffffffff887fb1cc>] :xfs:xfs_free_ag_extent+0x433/0x67e
  kernel:  [<ffffffff887fca71>] :xfs:xfs_free_extent+0xa9/0xc9
  kernel:  [<ffffffff8882d874>] :xfs:xlog_recover_process_efi+0x112/0x16c
  kernel:  [<ffffffff888476c8>] :xfs:xfs_fs_fill_super+0x0/0x3dc
  kernel:  [<ffffffff8882ea53>] :xfs:xlog_recover_process_efis+0x4f/0x8d
  kernel:  [<ffffffff8882eaa5>] :xfs:xlog_recover_finish+0x14/0x9e
  kernel:  [<ffffffff888476c8>] :xfs:xfs_fs_fill_super+0x0/0x3dc
  kernel:  [<ffffffff888336c6>] :xfs:xfs_mountfs+0x47a/0x5ac
  kernel:  [<ffffffff88833daa>] :xfs:xfs_mru_cache_create+0x113/0x143
  kernel:  [<ffffffff888478cb>] :xfs:xfs_fs_fill_super+0x203/0x3dc
  kernel:  [<ffffffff800e7401>] get_sb_bdev+0x10a/0x16c
  kernel:  [<ffffffff800e6d9e>] vfs_kern_mount+0x93/0x11a
  kernel:  [<ffffffff800e6e67>] do_kern_mount+0x36/0x4d
  kernel:  [<ffffffff800f1865>] do_mount+0x6a9/0x719
  kernel:  [<ffffffff80009165>] __handle_mm_fault+0x9f6/0x103b
  kernel:  [<ffffffff8000c816>] _atomic_dec_and_lock+0x39/0x57
  kernel:  [<ffffffff8002cc44>] mntput_no_expire+0x19/0x89
  kernel:  [<ffffffff8000769e>] find_get_page+0x21/0x51
  kernel:  [<ffffffff8002239a>] __up_read+0x19/0x7f
  kernel:  [<ffffffff80067225>] do_page_fault+0x4cc/0x842
  kernel:  [<ffffffff80008d64>] __handle_mm_fault+0x5f5/0x103b
  kernel:  [<ffffffff800cee54>] zone_statistics+0x3e/0x6d
  kernel:  [<ffffffff8000f470>] __alloc_pages+0x78/0x308
  kernel:  [<ffffffff8004c0df>] sys_mount+0x8a/0xcd
  kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
  kernel: 
  kernel: Failed to recover EFIs on filesystem: md4
  kernel: XFS: log mount finish failed

xfs_repair is unwilling to repair the fs since it sees unwritten data
in the xfs log:

  prompt# xfs_repair /dev/md4
  Phase 1 - find and verify superblock...
  Phase 2 - using internal log
          - zero log...
  ERROR: The filesystem has valuable metadata changes in a log which needs to
  be replayed.  Mount the filesystem to replay the log, and unmount it before
  re-running xfs_repair.  If you are unable to mount the filesystem, then use
  the -L option to destroy the log and attempt a repair.
  Note that destroying the log may cause corruption -- please attempt a mount
  of the filesystem before doing this.

Of course, since I can't mount the fs, I can't replay the log.  Before
zeroing out the log w/ xfs_repair -L, I was wondering if there is any
way to tell how critical the metadata in the log is?  I've run
"xfs_logprint", but not being an XFS developer, I don't understand the
info it's showing me.  Is there anyway to glean something useful from
xfs_logprint?  For reference, I've put a copy of the complete output
at http://www.mcs.anl.gov/~valdes/xfslog.txt (warning, it's over 3.7
million lines long and about 192 MB big).

The system with this problem is running RHEL 5.7 with the bundled XFS
modules, eg:

  prompt# modinfo xfs
  filename:       /lib/modules/2.6.18-274.3.1.el5/kernel/fs/xfs/xfs.ko
  license:        GPL
  description:    SGI XFS with ACLs, security attributes, large block/inode numbers, no debug enabled
  author:         Silicon Graphics, Inc.
  srcversion:     4A41C05CBD42F5525F11CBD
  depends:        
  vermagic:       2.6.18-274.3.1.el5 SMP mod_unload gcc-4.1
  module_sig:     883f3504e58268794abe3920d1168f112bb7209e2721679ef3b2971313fad2364b5a43f2ab33e0a0a59bf02c12aca5e46c326a106f838129e0ab4867

although the XFS itself was built on an earlier version of RHEL 5, FWIW.

The details and history of the problem XFS are:

- It's ~20TB built on an md stripe of two 3ware RAID6 arrays.

- The problem showed up after a drive in one of the 3ware RAIDs
  failed, causing the controller to hang, which took that RAID (scsi
  device) offline:

    kernel: sd 7:0:0:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.
    kernel: 3w-9xxx: scsi7: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence.
    kernel: 3w-9xxx: scsi7: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence.
    kernel: 3w-9xxx: scsi7: ERROR: (0x06:0x002B): Controller reset failed during scsi host reset.
    kernel: sd 7:0:0:0: scsi: Device offlined - not ready after error recovery
    last message repeated 99 times
    kernel: sd 7:0:0:0: rejecting I/O to offline device
    last message repeated 50 times
    kernel: sd 7:0:0:0: SCSI error: return code = 0x00010000
    kernel: end_request: I/O error, dev sdd, sector 2292015744
    kernel: sd 7:0:0:0: rejecting I/O to offline device
    last message repeated 436 times
    kernel: Device md4, XFS metadata write error block 0xd03f0 in md4
    kernel: Buffer I/O error on device md4, logical block 723454688
    kernel: lost page write due to I/O error on md4
    kernel: Buffer I/O error on device md4, logical block 723454689
    [...]
    kernel: sd 7:0:0:0: rejecting I/O to offline device
    kernel: I/O error in filesystem ("md4") meta-data dev md4 block 0x48c2598aa       ("xlog_iodone") error 5 buf count 3584
    kernel: xfs_force_shutdown(md4,0x2) called from line 1061 of file fs/xfs/xfs_log.c.  Return address = 0xffffffff8867404a
    kernel: Filesystem md4: Log I/O Error Detected.  Shutting down filesystem: md4
    kernel: Please umount the filesystem, and rectify the problem(s)
    kernel: Filesystem md4: xfs_log_force: error 5 returned.

  I was able to fully shutdown the system after this, although I did
  need to power cycle it in order to get the 3ware controller back
  online (the controller does have a functional battery, so in theory
  data in its write cache should have been preserved, although
  messages at reboot suggest otherwise).  Nevertheless, upon reboot,
  the XFS mounted fine:

    kernel: 3w-9xxx: scsi7: AEN: ERROR (0x04:0x005F): Cache synchronization failed; some data lost:unit=0.
    kernel: 3w-9xxx: scsi7: AEN: WARNING (0x04:0x0008): Unclean shutdown detected:unit=0.
    [...]
    kernel: SGI XFS with ACLs, security attributes, large block/inode numbers, no debug enabled
    kernel: SGI XFS Quota Management subsystem
    kernel: Filesystem md4: Disabling barriers, trial barrier write failed
    kernel: XFS mounting filesystem md4
    kernel: Starting XFS recovery on filesystem: md4 (logdev: internal)
    kernel: Ending XFS recovery on filesystem: md4 (logdev: internal)

- The XFS continued working fine for about 2 weeks, but then it started
  reporting internal erros (XFS_WANT_CORRUPTED_RETURN):

    kernel: XFS internal error XFS_WANT_CORRUPTED_RETURN at line 295 of file fs/xfs/xfs_alloc.c.  Caller 0xffffffff8864a345
    kernel: 
    kernel: 
    kernel: Call Trace:
    kernel:  [<ffffffff8864889f>] :xfs:xfs_alloc_fixup_trees+0x2ba/0x2cb
    kernel:  [<ffffffff8865e89b>] :xfs:xfs_btree_init_cursor+0x31/0x1a3
    kernel:  [<ffffffff8864a345>] :xfs:xfs_alloc_ag_vextent_near+0x773/0x8e2
    kernel:  [<ffffffff8864a4df>] :xfs:xfs_alloc_ag_vextent+0x2b/0xfc
    kernel:  [<ffffffff8864ad5f>] :xfs:xfs_alloc_vextent+0x2ce/0x3ff
    kernel:  [<ffffffff886595ca>] :xfs:xfs_bmap_btalloc+0x673/0x8c1
    kernel:  [<ffffffff88659f09>] :xfs:xfs_bmapi+0x6ec/0xe79
    kernel:  [<ffffffff8867b0c7>] :xfs:xlog_ticket_alloc+0xc8/0xed
    kernel:  [<ffffffff8867b199>] :xfs:xfs_log_reserve+0xad/0xc9
    kernel:  [<ffffffff886764de>] :xfs:xfs_iomap_write_allocate+0x202/0x329
    kernel:  [<ffffffff88676f0e>] :xfs:xfs_iomap+0x217/0x28d
    kernel:  [<ffffffff8868bf48>] :xfs:xfs_map_blocks+0x2d/0x63
    kernel:  [<ffffffff8868cb8e>] :xfs:xfs_page_state_convert+0x2b1/0x546
    kernel:  [<ffffffff8001c452>] generic_make_request+0x211/0x228
    kernel:  [<ffffffff8868cf6f>] :xfs:xfs_vm_writepage+0xa7/0xe0
    kernel:  [<ffffffff8001d1d1>] mpage_writepages+0x1bf/0x37d
    kernel:  [<ffffffff8868cec8>] :xfs:xfs_vm_writepage+0x0/0xe0
    kernel:  [<ffffffff8005a8a6>] do_writepages+0x20/0x2f
    kernel:  [<ffffffff8002fa24>] __writeback_single_inode+0x1a2/0x31c
    kernel:  [<ffffffff80021143>] sync_sb_inodes+0x1b7/0x271
    kernel:  [<ffffffff800a2be5>] keventd_create_kthread+0x0/0xc4
    kernel:  [<ffffffff80050ce2>] writeback_inodes+0x82/0xd8
    kernel:  [<ffffffff800cc304>] wb_kupdate+0xd4/0x14e
    kernel:  [<ffffffff800562a9>] pdflush+0x0/0x1fb
    kernel:  [<ffffffff800563fa>] pdflush+0x151/0x1fb
    kernel:  [<ffffffff800cc230>] wb_kupdate+0x0/0x14e
    kernel:  [<ffffffff80032722>] kthread+0xfe/0x132
    kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
    kernel:  [<ffffffff800a2be5>] keventd_create_kthread+0x0/0xc4
    kernel:  [<ffffffff80032624>] kthread+0x0/0x132
    kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

- Once this started happening, I shutdown the system again, but this
  time at reboot, the XFS failed to mount, w/ the error given at the
  top of this email.

Does anyone have any suggestions on how to recover from this state, or
is my only option xfs_repair -L and hope that there isn't any
corruption?  This XFS is part of a scratch filesystem (we have a large
PVFS filesystem built on top of this XFS plus 7 other identical ones
on other servers), so if it ended up being corrupted, it wouldn't been
the end of the world, but it would represent a lot of lost work.

Thanks for any help.

John

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

             reply	other threads:[~2012-01-24  2:43 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-24  2:43 John Valdes [this message]
2012-01-24  5:06 ` log recovery fails at mount Eric Sandeen
2012-01-24 22:58   ` John Valdes
2012-01-24 23:03     ` Eric Sandeen
2012-01-24 23:34       ` John Valdes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120124024340.GA6689@starfish.mcs.anl.gov \
    --to=valdes@anl.gov \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.