public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Mike Dacre <mike.dacre@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: Sudden File System Corruption
Date: Thu, 5 Dec 2013 14:40:34 +1100	[thread overview]
Message-ID: <20131205034034.GI8803@dastard> (raw)
In-Reply-To: <CAPd9ww_qT9J_Rt04g7+OApoBeggNOyWNwD+57DiDTuUvz-O-0g@mail.gmail.com>

On Wed, Dec 04, 2013 at 06:55:05PM -0800, Mike Dacre wrote:
> Hi Folks,
> 
> Apologies if this is the wrong place to post or if this has been answered
> already.
> 
> I have a 16 2TB drive RAID6 array powered by an LSI 9240-4i.  It has an XFS
> filesystem and has been online for over a year.  It is accessed by 23
> different machines connected via Infiniband over NFS v3.  I haven't had any
> major problems yet, one drive failed but it was easily replaced.
> 
> However, today the drive suddenly stopped responding and started returning
> IO errors when any requests were made.  This happened while it was being
> accessed by  5 different users, one was doing a very large rm operation (rm
> *sh on thousands on files in a directory).  Also, about 30 minutes before
> we had connected the globus connect endpoint to allow easy file transfers
> to SDSC.

So, you had a drive die and at roughly the same time XFS started
reporting corruption problems and shut down? Chances are that the
drive returned garbage to XFS before died completely and that's what
XFS detected and shut down on. If you are unlucky in this situation,
the corruption can get propagated into the log by changes that are
adjacent to the corrupted region, and then you have problems with log
recovery failing because the corruption gets replayed....

> I have attached the complete log from the time it died until now.
> 
> In the end, I successfully repaired the filesystem with `xfs_repair -L
> /dev/sda1`.  However, I am nervous that some files may have been corrupted.
> 
> Do any of you have any idea what could have caused this problem?

When corruption appears at roughly the same time a drive dies, it's
almost always caused by the drive that failed. RAID doesn't repvent
disks from returning crap to the OS because nobody configures the
arrays to do read-verify cycles that would catch such a condition.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2013-12-05  3:40 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-05  2:55 Sudden File System Corruption Mike Dacre
2013-12-05  3:40 ` Dave Chinner [this message]
2013-12-05  3:46   ` Mike Dacre
2013-12-05  3:59     ` Dave Chinner
2013-12-05  8:10 ` Stan Hoeppner
     [not found]   ` <CAPd9ww9hsOFK6pxqRY-YtLLAkkJHCuSi1BaM4n9=2XTjNVAn2Q@mail.gmail.com>
2013-12-05 15:58     ` Fwd: " Mike Dacre
2013-12-06  8:58       ` Stan Hoeppner
     [not found]         ` <CAPd9ww8+W2VX2HAfxEkVN5mL1a_+=HDAStf1126WSE33Vb=VsQ@mail.gmail.com>
2013-12-06 23:15           ` Fwd: " Mike Dacre
2013-12-07 11:12           ` Stan Hoeppner
2013-12-07 18:36             ` Mike Dacre
2013-12-08  5:22               ` Stan Hoeppner
2013-12-08 15:03                 ` Emmanuel Florac
2013-12-09  0:58                   ` Stan Hoeppner
2013-12-09  1:40                     ` Dave Chinner
2013-12-09 19:51                       ` Stan Hoeppner
2013-12-09 22:21                         ` Dave Chinner
2013-12-09 22:30                           ` Emmanuel Florac
2013-12-10  3:39                             ` Stan Hoeppner
2013-12-10  8:45                               ` Emmanuel Florac
2013-12-09 22:24                         ` Emmanuel Florac
2013-12-09  9:49                     ` Emmanuel Florac
2013-12-05 17:40 ` Ben Myers
     [not found]   ` <20131205175053.GG1935@sgi.com>
     [not found]     ` <CAPd9ww9YFbMEe-dM96zHsbRJgQuBHfF=ipromch1Yw6SzPUftg@mail.gmail.com>
     [not found]       ` <20131206002308.GS10553@sgi.com>
     [not found]         ` <CAPd9ww8XDzGbSZsEEoCmSuJ+KBYUWqHeRON1sFr6bG1fZ6af7w@mail.gmail.com>
     [not found]           ` <20131206225612.GU10553@sgi.com>
2013-12-06 23:15             ` Mike Dacre
2013-12-08 22:20               ` Dave Chinner
2013-12-09 19:04 ` Eric Sandeen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131205034034.GI8803@dastard \
    --to=david@fromorbit.com \
    --cc=mike.dacre@gmail.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox