public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: David Raffelt <david.raffelt@florey.edu.au>
Cc: stefanrin@gmail.com, "xfs@oss.sgi.com" <xfs@oss.sgi.com>
Subject: Re: XFS corrupt after RAID failure and resync
Date: Tue, 6 Jan 2015 18:16:17 -0500	[thread overview]
Message-ID: <20150106231617.GA18544@bfoster.bfoster> (raw)
In-Reply-To: <CAOFq7B5HnEZFtVvoRORRwOxPhx5Txf9xJW=BG6GFbLMNk+_CEw@mail.gmail.com>

On Wed, Jan 07, 2015 at 07:34:37AM +1100, David Raffelt wrote:
> Hi Brian and Stefan,
> Thanks for your reply.  I checked the status of the array after the rebuild
> (and before the reset).
> 
> md0 : active raid6 sdd1[8] sdc1[4] sda1[3] sdb1[7] sdi1[5] sde1[1]
>       14650667520 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6]
> [UUUUUU_]
> 
> However given that I've never had any problems before with mdadm rebuilds I
> did not think to check the data before rebooting.  Note that the array is
> still in this state. Before the reboot I tried to run a smartctl check on
> the failed drives and it could not read them. When I rebooted I did not
> actually replace any drives, I just power cycled to see if I could
> re-access the drives that were thrown out of the array. According to
> smartctl they are completely fine.
> 
> I guess there is no way I can re-add the old drives and remove the newly
> synced drive?  Even though I immediately kicked all users off the system
> when I got the mdadm alert, it's possible a small amount of data was
> written to the array during the resync.
> 
> It looks like the filesystem was not unmounted properly before reboot:
> Jan 06 09:11:54 server systemd[1]: Failed unmounting /export/data.
> Jan 06 09:11:54 server systemd[1]: Shutting down.
> 
> Here is the mount errors in the log after rebooting:
> Jan 06 09:15:17 server kernel: XFS (md0): Mounting Filesystem
> Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount and
> run xfs_repair
> Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount and
> run xfs_repair
> Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount and
> run xfs_repair
> Jan 06 09:15:17 server kernel: XFS (md0): metadata I/O error: block 0x400
> ("xfs_trans_read_buf_map") error 117 numblks 16
> Jan 06 09:15:17 server kernel: XFS (md0): xfs_imap_to_bp:
> xfs_trans_read_buf() returned error 117.
> Jan 06 09:15:17 server kernel: XFS (md0): failed to read root inode
> 

So it fails to read the root inode. You could also try to read said
inode via xfs_db (e.g., 'sb,' 'p rootino,' 'inode <ino#>,' 'p') and see
what it shows.

Are you able to run xfs_metadump against the fs? If so and you're
willing/able to make the dump available somewhere (compressed), I'd be
interested to take a look to see what might be causing the difference in
behavior between repair and xfs_db.

Brian

> xfs_repair -n -L also complains about a bad magic number.
> 
> Unfortunately this 15TB RAID was part of a 45TB GlusterFS distributed
> volume. It was only ever meant to be a scratch drive for intermediate
> scientific results, however inevitably most users used it to store lots of
> data. Oh well.
> 
> Thanks again,
> Dave
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 6 January 2015 at 23:47, Brian Foster <bfoster@redhat.com> wrote:
> 
> > On Tue, Jan 06, 2015 at 05:12:14PM +1100, David Raffelt wrote:
> > > Hi again,
> > > Some more information.... the kernel log show the following errors were
> > > occurring after the RAID recovery, but before I reset the server.
> > >
> >
> > By after the raid recovery, you mean after the two drives had failed out
> > and 1 hot spare was activated and resync completed? It certainly seems
> > like something went wrong in this process. The output below looks like
> > it's failing to read in some inodes. Is there any stack trace output
> > that accompanies these error messages to confirm?
> >
> > I suppose I would try to verify that the array configuration looks sane,
> > but after the hot spare resync and then one or two other drive
> > replacements (was the hot spare ultimately replaced?), it's hard to say
> > whether it might be recoverable.
> >
> > Brian
> >
> > > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount
> > and
> > > run xfs_repair
> > > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount
> > and
> > > run xfs_repair
> > > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount
> > and
> > > run xfs_repair
> > > Jan 06 00:00:27 server kernel: XFS (md0): metadata I/O error: block
> > > 0x36b106c00 ("xfs_trans_read_buf_map") error 117 numblks 16
> > > Jan 06 00:00:27 server kernel: XFS (md0): xfs_imap_to_bp:
> > > xfs_trans_read_buf() returned error 117.
> > >
> > >
> > > Thanks,
> > > Dave
> >
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> >
> >
> 
> 
> -- 
> *David Raffelt (PhD)*
> Postdoctoral Fellow
> 
> The Florey Institute of Neuroscience and Mental Health
> Melbourne Brain Centre - Austin Campus
> 245 Burgundy Street
> Heidelberg Vic 3084
> Ph: +61 3 9035 7024
> www.florey.edu.au

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2015-01-06 23:16 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-06  6:12 XFS corrupt after RAID failure and resync David Raffelt
2015-01-06 12:47 ` Brian Foster
     [not found] ` <44b127de199c445fa12c3b832a05f108@000s-ex-hub-qs1.unimelb.edu.au>
2015-01-06 20:34   ` David Raffelt
2015-01-06 23:16     ` Brian Foster [this message]
     [not found]     ` <8cc9a649ec2240faa4e38fd742437546@000S-EX-HUB-NP2.unimelb.edu.au>
2015-01-06 23:47       ` David Raffelt
2015-01-07  0:27         ` Dave Chinner
2015-01-07 16:16         ` Brian Foster
2015-01-07  2:35     ` Chris Murphy
  -- strict thread matches above, loose matches on Subject: below --
2015-01-08  8:09 Chris Murphy
2015-01-06  5:39 David Raffelt
2015-01-06 12:36 ` Stefan Ring
2015-01-06 12:41 ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150106231617.GA18544@bfoster.bfoster \
    --to=bfoster@redhat.com \
    --cc=david.raffelt@florey.edu.au \
    --cc=stefanrin@gmail.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox