From: David Jander <david@protonic.nl>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>,
Matteo Croce <technoboy85@gmail.com>,
"Darrick J. Wong" <darrick.wong@oracle.com>,
linux-ext4@vger.kernel.org
Subject: Re: ext4: journal has aborted
Date: Fri, 4 Jul 2014 15:45:59 +0200 [thread overview]
Message-ID: <20140704154559.026331ec@archvile> (raw)
In-Reply-To: <20140704122022.GC10514@thunk.org>
Hi Ted, Dmitry,
On Fri, 4 Jul 2014 08:20:22 -0400
"Theodore Ts'o" <tytso@mit.edu> wrote:
> On Fri, Jul 04, 2014 at 01:28:02PM +0200, David Jander wrote:
> >
> > Here is the output I am getting... AFAICS no problems on the raw device. Is
> > this sufficient testing, Ted?
>
> I'm not sure what theory Dmitry was trying to pursue when he requested
> that you run the fio test. Dmitry?
>
>
> Please note that at this point there may be multiple causes with
> similar symptoms that are showing up. So just because one person
> reports one set of data points, such as someone claiming they've seen
> this without a power drop to the storage device, that therefore all of
> the problems were caused by flaky I/O to the device.
>
> Right now, there are multiple theories floating around --- and it may
> be that more than one of them are true (i.e., there may be multiple
> bugs here). Some of the possibilities, which again, may not be
> mutually exclusive:
>
> 1) Some kind of eMMC driver bug, which is possibly causing the CACHE
> FLUSH command not to be sent.
How can I investigate this? According to the fio tests I ran and the
explanation Dmitry gave, I conclude that incorrectly sending of CACHE-FLUSH
commands is the only thing left to be discarded on the eMMC driver front,
right?
> 2) Some kind of hardware problem involving flash translation layers
> not having durable transactions of their flash metadata across power
> failures.
That would be like blaming Micron (the eMMC part manufacturer) for faulty
firmware... could be, but how can we test this?
> 3) Some kind of ext4/jbd2 bug, recently introduced, where we are
> modifying some ext4 metadata (either the block allocation bitmap or
> block group summary statistics) outside of a valid transaction handle.
I think I have some more evidence to support this case:
Until previously, I did not run fsck EVER! I know that this is not a good idea
to do in a production environment, but I am only testing right now, and in
theory it should not be necessary, right?
What I did this time, was to run fsck.ext3 or fsck.ext4 (depending on FS
format of course) once every one or two power cycles.
So effectively, what I did amounts to this:
CASE 1: fsck on every power-cycle:
1.- Boot from clean filesystem
2.- Run the following command line:
$ cp -a /usr . & bonnie\+\+ -r 32 -u 100:100 & bonnie\+\+ -r 32 -u 102:102
3.- Hit CTRL+Z (to stop the second bonnie++ process)
4.- Execute "sync"
5.- While "sync" was running, cut off the power supply.
6.- Turn on power and boot from external medium
7.- Run fsck.ext3/4 on eMMC device
8.- Repeat
In this case, there was a minor difference for the fsck output of both
filesystems:
EXT4 was always something like this:
# fsck.ext4 /dev/mmcblk1p2
e2fsck 1.42.5 (29-Jul-2012)
rootfs: recovering journal
Setting free inodes count to 37692 (was 37695)
Setting free blocks count to 136285 (was 136291)
rootfs: clean, 7140/44832 files, 42915/179200 blocks
While for EXT3 the output did not contain the "Setting free * count..."
messages:
# fsck.ext3 -p /dev/mmcblk1p2
rootfs: recovering journal
rootfs: clean, 4895/44832 files, 36473/179200 blocks
CASE 2: fsck on every other power-cycle:
Same as CASE 1 steps 1...5 and then:
6.- Turn on power and boot again from dirty internal eMMC without running fsck.
7.- Repeat steps 2...5 one more time
8.- Perform steps 6...8 from CASE 1.
With this test, the following difference became apparent:
With EXT3: fsck.ext3 did the same as in CASE 1
With EXT4: I get a long list of errors that are being fixed.
It starts like this:
# fsck.ext4 /dev/mmcblk1p2
e2fsck 1.42.5 (29-Jul-2012)
rootfs: recovering journal
rootfs contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 4591, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4594, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4595, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4596, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4597, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4598, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4599, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4600, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4601, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4602, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4603, i_blocks is 16, should be 8. Fix<y>? yes
...
...
Eventually I pressed CTRL+C and restarted fsck with the option "-p", because
this list was getting a little long.
...
...
# fsck.ext4 -p /dev/mmcblk1p2
rootfs contains a file system with errors, check forced.
rootfs: Inode 5391, i_blocks is 32, should be 16. FIXED.
rootfs: Inode 5392, i_blocks is 16, should be 8. FIXED.
rootfs: Inode 5393, i_blocks is 48, should be 24. FIXED.
rootfs: Inode 5394, i_blocks is 32, should be 16. FIXED.
rootfs: Inode 5395, i_blocks is 16, should be 8. FIXED.
...
...
rootfs: Inode 5854, i_blocks is 240, should be 120. FIXED.
rootfs: Inode 5857, i_blocks is 576, should be 288. FIXED.
rootfs: Inode 5860, i_blocks is 512, should be 256. FIXED.
rootfs: Inode 5863, i_blocks is 656, should be 328. FIXED.
rootfs: Inode 5866, i_blocks is 480, should be 240. FIXED.
rootfs: Inode 5869, i_blocks is 176, should be 88. FIXED.
rootfs: Inode 5872, i_blocks is 336, should be 168. FIXED.
rootfs: 11379/44832 files (0.1% non-contiguous), 70010/179200 blocks
#
> 4) Some other kind of hard-to-reproduce race or wild pointer which is
> sometimes corrupting fs data structures.
I don't have such a hard time reproducing it... but it does take quite some
time (booting several times, re-installing, testing, etc...)
> If someone has a easy to reproduce failure case, the first step is to
> do a very rough bisection test. Does the easy-to-reproduce failure go
> away if you use 3.14? 3.12? Also, if you can describe in great
> detail your hardware and software configuration, and under what
> circumstances the problem reproduces, and when it doesn't, that would
> also be critical. Whether you are just doing reset or a power cycle
> if an unclean shutdown is involved, might also be important.
Until now, I always do a power-cycle, but I can try to check if I am able to
reproduce the problem with just a "shutdown -f" (AFAIK, this does NOT sync
filesystems, right?)
I will try to check 3.14 and 3.12 (if 3.14 still seems buggy). It could take
quite a while until I have results... certainly not before monday.
> And at this point, because I'm getting very suspicious that there may
> be more than one root cause, we should try to keep the debugging of
> one person's reproduction, such as David's, separate from another's,
> such as Matteo's. It may be that there ultimately have the same root
> cause, and so if one person is able to get an interesting reproduction
> result, it would be great for the other person to try running the same
> experiment on their hardware/software configuration. But what we must
> not do is assume that one person's experiment is automatically
> applicable to other circumstances.
I agree.
Best regards,
--
David Jander
Protonic Holland.
next prev parent reply other threads:[~2014-07-04 13:45 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-06-30 21:30 ext4: journal has aborted Matteo Croce
2014-07-01 6:26 ` David Jander
2014-07-01 8:00 ` Matteo Croce
2014-07-01 8:42 ` Darrick J. Wong
2014-07-01 8:55 ` Matteo Croce
2014-07-02 13:49 ` Dmitry Monakhov
2014-07-03 13:43 ` Theodore Ts'o
2014-07-03 14:15 ` David Jander
2014-07-03 14:46 ` Theodore Ts'o
2014-07-03 14:57 ` Dmitry Monakhov
2014-07-03 14:58 ` Dmitry Monakhov
2014-07-04 9:40 ` David Jander
2014-07-04 10:17 ` Dmitry Monakhov
2014-07-04 11:28 ` David Jander
2014-07-04 12:20 ` Theodore Ts'o
2014-07-04 12:38 ` Dmitry Monakhov
2014-07-04 13:45 ` David Jander [this message]
2014-07-04 18:45 ` Theodore Ts'o
2014-07-04 22:46 ` Dave Chinner
2014-07-05 2:30 ` Dmitry Monakhov
2014-07-05 20:36 ` Theodore Ts'o
2014-07-07 12:17 ` David Jander
2014-07-07 15:53 ` Theodore Ts'o
2014-07-07 22:31 ` Darrick J. Wong
2014-07-07 22:56 ` Theodore Ts'o
2014-07-10 18:57 ` Eric Whitney
2014-07-10 20:01 ` Darrick J. Wong
2014-07-10 21:31 ` Matteo Croce
2014-07-10 22:32 ` Theodore Ts'o
2014-07-11 0:13 ` Darrick J. Wong
2014-07-11 0:45 ` Eric Whitney
2014-07-11 8:50 ` Jaehoon Chung
2014-07-11 11:43 ` Theodore Ts'o
2014-07-15 6:31 ` David Jander
2014-07-10 23:29 ` Azat Khuzhin
2014-07-04 11:04 ` Jaehoon Chung
2014-07-04 11:32 ` David Jander
2014-07-01 12:07 ` Jaehoon Chung
2014-07-01 13:50 ` David Jander
2014-07-01 15:58 ` Theodore Ts'o
2014-07-01 16:14 ` Lukáš Czerner
2014-07-01 16:36 ` Eric Whitney
2014-07-02 8:34 ` Matteo Croce
2014-07-02 10:17 ` David Jander
2014-07-02 10:19 ` Matteo Croce
2014-07-03 17:14 ` Eric Whitney
2014-07-03 23:17 ` Theodore Ts'o
2014-07-04 20:48 ` Eric Whitney
2014-07-02 9:44 ` David Jander
2014-07-01 9:02 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140704154559.026331ec@archvile \
--to=david@protonic.nl \
--cc=darrick.wong@oracle.com \
--cc=dmonakhov@openvz.org \
--cc=linux-ext4@vger.kernel.org \
--cc=technoboy85@gmail.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).