linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Evan King <f11n1@unb.ca>
Cc: linux-ext4@vger.kernel.org
Subject: Re: Strange disk failure...could ext4 be the culprit?
Date: Mon, 13 Jul 2009 11:05:20 +0530	[thread overview]
Message-ID: <20090713053520.GA5088@skywalker> (raw)
In-Reply-To: <loom.20090707T173457-377@post.gmane.org>

On Tue, Jul 07, 2009 at 06:16:23PM +0000, Evan King wrote:
> Hello all,
> 
> I'm administering a small computing cluster on new off-the-shelf hardware.  The
> configuration is a master-slaves setup with the master serving nfs for the data
> synchronization and performing the data re-assembly process (as well as doing
> some slave work as well).
> 
> The workload produces a fairly steady I/O workload, but not particularly heavy.
>  While I originally pushed for specialized storage hardware or configurations,
> testing and benchmarking showed that the workload appeared quite manageable for
> a single disk.  I expected it might experience a short lifespan, but on the
> order of several months at least.  To spare the disk as much thrashing as
> possible, I opted for ext4.
> 
> In the first week of active deployment (and while I was on vacation), the master
> experienced a very strange form of catastrophic failure.  A job had failed after
> only a couple hours, and serious errors blocked further work.  Several core GNU
> tools in /bin were corrupted, such as: mv, rm, uname, hostname, pwd.  A couple
> 0-byte files existed in / with scrambled filenames, and plenty of Unicode
> characters splattered across the screen during reboot.  The reboot itself
> reached a login prompt, but wouldn't accept any input.  But this is where things
> get strange.
> 
> I used a liveCD to perform disk checks, and there were no filesystem errors of
> *any* kind.  The entire filesystem was and is in pristine condition.  While I'm
> aware of discussion and issues surrounding some of the design decisions made for
> ext4 (such as delayed write allocation), it doesn't seem possible that those
> issues could be related to this kind of failure (data written without permission
> or any attempt to do so).  The corrupted binaries were in fact corrupted on
> disk, not just in memory (also unreadable by readelf), and larger than the
> originals.  The software I was using runs from a user-level account and has an
> apache-served web interface with apache dropping permissions to that same user.
>  Nothing but the kernel itself had permission to write to the files that were
> corrupted, however the computing software does execute (I think all of) the
> commands that were corrupted.
> 
> I have saved copies of several of the corrupted files, but neglected to save any
> system logs before restoring a backup.  There are still some strange messages
> appearing during startup, but they fly by too quickly to see, and nothing seems
> amiss in the logs except that /var/log/messages seems extremely verbose with
> startup and has many references to initializing ext4 (but nothing sounds like an
> error).  I'm about to tell my users to start using it again and will be
> expecting and watching for a repeat performance.  The disk itself appears to be
> fine.
> 
> _____
> 
> So my questions are these:
> 
>  - How likely is it that some arcane bug in ext4 is responsible for the failure?

Can you check whether your kernel have this patch
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2ec0ae3acec47f628179ee95fe2c4da01b5e9fc4

-aneesh

  parent reply	other threads:[~2009-07-13  5:35 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-07 18:16 Strange disk failure...could ext4 be the culprit? Evan King
2009-07-07 22:28 ` Andreas Dilger
2009-07-13  5:35 ` Aneesh Kumar K.V [this message]
2009-07-13 12:12   ` Theodore Tso
2009-07-13 14:38   ` Evan King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090713053520.GA5088@skywalker \
    --to=aneesh.kumar@linux.vnet.ibm.com \
    --cc=f11n1@unb.ca \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).