From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Evan King <f11n1@unb.ca>
Cc: linux-ext4@vger.kernel.org
Subject: Re: Strange disk failure...could ext4 be the culprit?
Date: Mon, 13 Jul 2009 11:05:20 +0530 [thread overview]
Message-ID: <20090713053520.GA5088@skywalker> (raw)
In-Reply-To: <loom.20090707T173457-377@post.gmane.org>
On Tue, Jul 07, 2009 at 06:16:23PM +0000, Evan King wrote:
> Hello all,
>
> I'm administering a small computing cluster on new off-the-shelf hardware. The
> configuration is a master-slaves setup with the master serving nfs for the data
> synchronization and performing the data re-assembly process (as well as doing
> some slave work as well).
>
> The workload produces a fairly steady I/O workload, but not particularly heavy.
> While I originally pushed for specialized storage hardware or configurations,
> testing and benchmarking showed that the workload appeared quite manageable for
> a single disk. I expected it might experience a short lifespan, but on the
> order of several months at least. To spare the disk as much thrashing as
> possible, I opted for ext4.
>
> In the first week of active deployment (and while I was on vacation), the master
> experienced a very strange form of catastrophic failure. A job had failed after
> only a couple hours, and serious errors blocked further work. Several core GNU
> tools in /bin were corrupted, such as: mv, rm, uname, hostname, pwd. A couple
> 0-byte files existed in / with scrambled filenames, and plenty of Unicode
> characters splattered across the screen during reboot. The reboot itself
> reached a login prompt, but wouldn't accept any input. But this is where things
> get strange.
>
> I used a liveCD to perform disk checks, and there were no filesystem errors of
> *any* kind. The entire filesystem was and is in pristine condition. While I'm
> aware of discussion and issues surrounding some of the design decisions made for
> ext4 (such as delayed write allocation), it doesn't seem possible that those
> issues could be related to this kind of failure (data written without permission
> or any attempt to do so). The corrupted binaries were in fact corrupted on
> disk, not just in memory (also unreadable by readelf), and larger than the
> originals. The software I was using runs from a user-level account and has an
> apache-served web interface with apache dropping permissions to that same user.
> Nothing but the kernel itself had permission to write to the files that were
> corrupted, however the computing software does execute (I think all of) the
> commands that were corrupted.
>
> I have saved copies of several of the corrupted files, but neglected to save any
> system logs before restoring a backup. There are still some strange messages
> appearing during startup, but they fly by too quickly to see, and nothing seems
> amiss in the logs except that /var/log/messages seems extremely verbose with
> startup and has many references to initializing ext4 (but nothing sounds like an
> error). I'm about to tell my users to start using it again and will be
> expecting and watching for a repeat performance. The disk itself appears to be
> fine.
>
> _____
>
> So my questions are these:
>
> - How likely is it that some arcane bug in ext4 is responsible for the failure?
Can you check whether your kernel have this patch
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2ec0ae3acec47f628179ee95fe2c4da01b5e9fc4
-aneesh
next prev parent reply other threads:[~2009-07-13 5:35 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-07-07 18:16 Strange disk failure...could ext4 be the culprit? Evan King
2009-07-07 22:28 ` Andreas Dilger
2009-07-13 5:35 ` Aneesh Kumar K.V [this message]
2009-07-13 12:12 ` Theodore Tso
2009-07-13 14:38 ` Evan King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090713053520.GA5088@skywalker \
--to=aneesh.kumar@linux.vnet.ibm.com \
--cc=f11n1@unb.ca \
--cc=linux-ext4@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.