From: Eric Sandeen <sandeen@redhat.com>
To: Nix <nix@esperi.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>,
linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
gregkh@linuxfoundation.org
Subject: Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
Date: Sat, 27 Oct 2012 16:19:33 -0500 [thread overview]
Message-ID: <508C4FE5.1030102@redhat.com> (raw)
In-Reply-To: <87fw4zzra3.fsf@spindle.srvr.nix>
On 10/27/12 1:47 PM, Nix wrote:
> On 27 Oct 2012, Theodore Ts'o said:
>
>> On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
>>> Ah! it's turned on by journal_async_commit. OK, that alone argues
>>> against use of journal_async_commit, tested or not, and I'd not have
>>> turned it on if I'd noticed that.
>>>
>>> (So, the combinations I'll be trying for effect on this bug are:
>>>
>>> journal_async_commit (as now)
>>> journal_checksum
>>> none
>>
>> Can you also check and see whether the presence or absence of
>> "nobarrier" makes a difference?
>
> Done. (Also checked the effect of your patches posted earlier this week:
> no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
> I was carrying out, umount -l'ing /var as the very last thing I did
> before /sbin/reboot -f.)
>
> nobarrier makes a difference that I, at least, did not expect:
>
> [no options] No corruption
>
> nobarrier No corruption
>
> journal_checksum Corruption
> Corrupted transaction, journal aborted
>
> nobarrier,journal_checksum Corruption
> Corrupted transaction, journal aborted
>
> journal_async_commit Corruption
> Corrupted transaction, journal aborted
>
> nobarrier,journal_async_commit Corruption
> No corrupted transaction or aborted journal
That's what we needed. Woulda been great a few days ago ;)
In my testing journal_checksum is broken, and my bisection seems to
implicate
commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31
Author: Theodore Ts'o <tytso@mit.edu>
Date: Mon Feb 6 20:12:03 2012 -0500
ext4: fold ext4_claim_inode into ext4_new_inode
as the culprit. I haven't had time to look into why, yet.
-Eric
> I didn't expect the last case at all, and it adequately explains why you
> are mostly seeing corrupted journal messages in your tests but I was
> not. It also explains why when I saw this for the first time I was able
> to mount the resulting corrupted filesystem read-write and corrupt it
> further before I noticed that anything was wrong.
>
> It is also clear that journal_checksum and all that relies on it is
> worse than useless right now, as Eric reported while I was testing this.
> It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
> kernels, if CONFIG_BROKEN existed anymore, which it doesn't.
>
> It's a shame journal_async_commit depends on a broken feature: it might
> be notionally unsafe but on some of my systems (without nobarrier or
> flashy caching controllers) it was associated with a noticeable speedup
> of metadata-heavy workloads -- though that was way back in 2009...
> however, "safety first" definitely applies in this case.
>
next prev parent reply other threads:[~2012-10-27 21:19 UTC|newest]
Thread overview: 112+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-10-22 16:17 Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression? Nix
2012-10-23 1:33 ` J. Bruce Fields
2012-10-23 14:07 ` Nix
2012-10-23 14:30 ` J. Bruce Fields
2012-10-23 16:32 ` Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) Nix
2012-10-23 16:46 ` J. Bruce Fields
2012-10-23 16:54 ` J. Bruce Fields
2012-10-23 16:56 ` Myklebust, Trond
2012-10-23 16:56 ` Myklebust, Trond
2012-10-23 17:05 ` Nix
2012-10-23 17:36 ` Nix
2012-10-23 17:43 ` J. Bruce Fields
2012-10-23 17:44 ` Myklebust, Trond
2012-10-23 17:57 ` Myklebust, Trond
2012-10-23 17:57 ` Myklebust, Trond
[not found] ` <1351015039.4622.23.camel@lade.trondhjem.org>
2012-10-23 18:23 ` Myklebust, Trond
2012-10-23 18:23 ` Myklebust, Trond
2012-10-23 19:49 ` Nix
2012-10-24 10:18 ` [PATCH] lockd: fix races in per-net NSM client handling Stanislav Kinsbursky
[not found] ` <874nllxi7e.fsf_-_-AdTWujXS48Mg67Zj9sPl2A@public.gmane.org>
2012-10-23 20:57 ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix
2012-10-23 20:57 ` Nix
2012-10-23 22:19 ` Theodore Ts'o
2012-10-23 22:47 ` Nix
2012-10-23 23:16 ` Theodore Ts'o
2012-10-23 23:06 ` Nix
2012-10-23 23:28 ` Theodore Ts'o
2012-10-23 23:34 ` Nix
2012-10-24 0:57 ` Eric Sandeen
2012-10-24 20:17 ` Jan Kara
2012-10-26 15:25 ` Eric Sandeen
2012-10-24 19:13 ` Jannis Achstetter
2012-10-24 19:13 ` Jannis Achstetter
2012-10-24 21:31 ` Theodore Ts'o
2012-10-24 22:05 ` Jannis Achstetter
2012-10-24 23:47 ` Nix
2012-10-25 17:02 ` Felipe Contreras
2012-10-24 21:04 ` Jannis Achstetter
[not found] ` <87pq48nbyz.fsf_-_-AdTWujXS48Mg67Zj9sPl2A@public.gmane.org>
2012-10-24 1:13 ` Eric Sandeen
2012-10-24 1:13 ` Eric Sandeen
2012-10-24 4:15 ` Nix
2012-10-24 4:27 ` Eric Sandeen
2012-10-24 5:23 ` Theodore Ts'o
2012-10-24 7:00 ` Hugh Dickins
2012-10-24 11:46 ` Nix
2012-10-24 11:45 ` Nix
2012-10-24 17:22 ` Eric Sandeen
2012-10-24 19:49 ` Nix
2012-10-24 19:54 ` Nix
2012-10-24 20:30 ` Eric Sandeen
2012-10-24 20:34 ` Nix
2012-10-24 20:45 ` Nix
2012-10-24 21:08 ` Theodore Ts'o
2012-10-24 23:27 ` Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount) Nix
2012-10-24 23:42 ` Nix
2012-10-25 1:10 ` Theodore Ts'o
2012-10-25 1:45 ` Nix
2012-10-25 1:45 ` Nix
2012-10-25 14:12 ` Theodore Ts'o
2012-10-25 14:15 ` Nix
2012-10-25 17:39 ` Nix
2012-10-25 11:06 ` Nix
2012-10-26 0:22 ` Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount) (possibly blockdev / arcmsr at fault??) Nix
2012-10-26 0:11 ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Ric Wheeler
2012-10-26 0:43 ` Theodore Ts'o
2012-10-26 12:12 ` Nix
2012-10-26 20:35 ` Eric Sandeen
2012-10-26 20:37 ` Nix
[not found] ` <87wqydx957.fsf-AdTWujXS48Mg67Zj9sPl2A@public.gmane.org>
2012-10-26 20:56 ` Theodore Ts'o
2012-10-26 20:56 ` Theodore Ts'o
[not found] ` <20121026205618.GC8614-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2012-10-26 20:59 ` Nix
2012-10-26 20:59 ` Nix
[not found] ` <87objpx84k.fsf-AdTWujXS48Mg67Zj9sPl2A@public.gmane.org>
2012-10-26 21:15 ` Theodore Ts'o
2012-10-26 21:15 ` Theodore Ts'o
2012-10-26 21:19 ` Nix
[not found] ` <87haphx76u.fsf-AdTWujXS48Mg67Zj9sPl2A@public.gmane.org>
2012-10-27 0:22 ` Theodore Ts'o
2012-10-27 0:22 ` Theodore Ts'o
2012-10-27 12:45 ` Nix
2012-10-27 17:55 ` Theodore Ts'o
2012-10-27 18:47 ` Nix
2012-10-27 21:19 ` Eric Sandeen [this message]
2012-10-27 21:21 ` Nix
2012-10-27 21:23 ` Eric Sandeen
2012-10-27 21:29 ` Nix
2012-10-27 21:34 ` Eric Sandeen
2012-10-27 21:40 ` Nix
[not found] ` <09758CEA-74B5-48D0-8075-BB723A2CABBB@dilger.ca>
2012-10-29 2:09 ` Eric Sandeen
2012-10-27 22:42 ` Eric Sandeen
2012-10-29 1:00 ` Theodore Ts'o
2012-10-29 1:04 ` Nix
2012-10-29 2:24 ` Eric Sandeen
2012-10-29 2:34 ` Theodore Ts'o
2012-10-29 2:35 ` Eric Sandeen
2012-10-29 2:42 ` Theodore Ts'o
2012-10-27 18:30 ` Eric Sandeen
[not found] ` <20121026211542.GE8614-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2012-10-27 3:11 ` Jim Rees
2012-10-27 3:11 ` Jim Rees
2012-10-27 8:01 ` Testing ext4's journal via simulating a reboot via KVM Theodore Ts'o
2012-10-28 4:23 ` [PATCH] ext4: fix unjournaled inode bitmap modification Eric Sandeen
2012-10-28 4:23 ` Eric Sandeen
2012-10-28 13:59 ` Nix
2012-10-29 2:30 ` [PATCH -v3] " Theodore Ts'o
2012-10-29 2:30 ` Theodore Ts'o
2012-10-29 3:24 ` Eric Sandeen
2012-10-29 5:07 ` Andreas Dilger
2012-10-29 17:08 ` Darrick J. Wong
[not found] <jXsTo-5lW-13@gated-at.bofh.it>
[not found] ` <jXBDk-7vn-13@gated-at.bofh.it>
[not found] ` <jXNl8-5m5-13@gated-at.bofh.it>
[not found] ` <jXNOa-5MR-23@gated-at.bofh.it>
[not found] ` <jXPGh-87s-5@gated-at.bofh.it>
[not found] ` <jXTJW-4CH-55@gated-at.bofh.it>
[not found] ` <jXUZj-6mo-13@gated-at.bofh.it>
[not found] ` <jXVLH-7kO-5@gated-at.bofh.it>
[not found] ` <jXW53-7CC-5@gated-at.bofh.it>
[not found] ` <jXWeJ-7Lk-1@gated-at.bofh.it>
2012-10-24 17:38 ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Martin
2012-10-26 20:13 ` Martin
2012-10-26 20:24 ` Nix
2012-10-26 20:44 ` Martin
2012-10-26 20:47 ` Nix
2012-10-26 21:10 ` Theodore Ts'o
2012-10-26 23:15 ` Martin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=508C4FE5.1030102@redhat.com \
--to=sandeen@redhat.com \
--cc=gregkh@linuxfoundation.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=nix@esperi.org.uk \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.