Re: ext4: journal has aborted

Linux EXT4 FS development
 help / color / mirror / Atom feed

From: David Jander <david@protonic.nl>
To: Jaehoon Chung <jh80.chung@samsung.com>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Matteo Croce <technoboy85@gmail.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	linux-ext4@vger.kernel.org
Subject: Re: ext4: journal has aborted
Date: Fri, 4 Jul 2014 13:32:48 +0200	[thread overview]
Message-ID: <20140704133248.0ef344e4@archvile> (raw)
In-Reply-To: <53B68A2D.9070902@samsung.com>

On Fri, 04 Jul 2014 20:04:13 +0900
Jaehoon Chung <jh80.chung@samsung.com> wrote:

> Hi, David.
> 
> On 07/04/2014 06:40 PM, David Jander wrote:
> > 
> > Hi Dmitry,
> > 
> > On Thu, 03 Jul 2014 18:58:48 +0400
> > Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> > 
> >> On Thu, 3 Jul 2014 16:15:51 +0200, David Jander <david@protonic.nl> wrote:
> >>>
> >>> Hi Ted,
> >>>
> >>> On Thu, 3 Jul 2014 09:43:38 -0400
> >>> "Theodore Ts'o" <tytso@mit.edu> wrote:
> >>>
> >>>> On Tue, Jul 01, 2014 at 10:55:11AM +0200, Matteo Croce wrote:
> >>>>> 2014-07-01 10:42 GMT+02:00 Darrick J. Wong <darrick.wong@oracle.com>:
> >>>>>
> >>>>> I have a Samsung SSD 840 PRO
> >>>>
> >>>> Matteo,
> >>>>
> >>>> For you, you said you were seeing these problems on 3.15.  Was it
> >>>> *not* happening for you when you used an older kernel?  If so, that
> >>>> would help us try to provide the basis of trying to do a bisection
> >>>> search.
> >>>
> >>> I also tested with 3.15, and there too I see the same problem.
> >>>
> >>>> Using the kvm-xfstests infrastructure, I've been trying to reproduce
> >>>> the problem as follows:
> >>>>
> >>>> ./kvm-xfstests  --no-log -c 4k generic/075 ; e2fsck -p /dev/heap/test-4k ; e2fsck -f /dev/heap/test-4k 
> >>>>
> >>>> xfstests geneeric/075 runs fsx which does a fair amount of block
> >>>> allocation deallocations, and then after the test finishes, it first
> >>>> replays the journal (e2fsck -p) and then forces a fsck run on the
> >>>> test disk that I use for the run.
> >>>>
> >>>> After I launch this, in a separate window, I do this:
> >>>>
> >>>> 	sleep 60  ; killall qemu-system-x86_64 
> >>>>
> >>>> This kills the qemu process midway through the fsx test, and then I
> >>>> see if I can find a problem.  I haven't had a chance to automate this
> >>>> yet, and it is my intention to try to set this up where I can run this
> >>>> on a ramdisk or a SSD, so I can more closely approximate what people
> >>>> are reporting on flash-based media.
> >>>>
> >>>> So far, I haven't been able to reproduce the problem.  If after doing
> >>>> a large number of times, it can't be reproduced (especially if it
> >>>> can't be reproduced on an SSD), then it would lead us to believe that
> >>>> one of two things is the cause.  (a) The CACHE FLUSH command isn't
> >>>> properly getting sent to the device in some cases, or (b) there really
> >>>> is a hardware problem with the flash device in question.
> >>>
> >>> Could (a) be caused by a bug in the mmc subsystem or in the MMC peripheral
> >>> driver? Can you explain why I don't see any problems with EXT3?
> >>>
> >>> I can't discard the possibility of (b) because I cannot prove it, but I will
> >>> try to see if I can do the same test on a SSD which I happen to have on that
> >>> platform. That should be able to rule out problems with the eMMC chip and
> >>> -driver, right?
> >>>
> >>> Do you know a way to investigate (a) (CACHE FLUSH not being sent correctly)?
> >>>
> >>> I left the system running (it started from a dirty EXT4 partition), and I am
> >>> seen the following error pop up after a few minutes. The system is not doing
> >>> much (some syslog activity maybe, but not much more):
> >>>
> >>> [  303.072983] EXT4-fs (mmcblk1p2): error count: 4
> >>> [  303.077558] EXT4-fs (mmcblk1p2): initial error at 1404216838: ext4_mb_generate_buddy:756
> >>> [  303.085690] EXT4-fs (mmcblk1p2): last error at 1404388969: ext4_mb_generate_buddy:757
> >>>
> >>> What does that mean?
> >> This means that it found previous error in internal ext4's log. Which is
> >> normal because your fs was corrupted before. It is reasonable to
> >> recreate filesystem from very beginning.
> >>
> >> In order to understand whenever it is regression in eMMC driver it is
> >> reasonable to run integrity test for a device itself. You can run
> >> any integrity test you like, For example just run a fio's job
> >>  "fio disk-verify2.fio" (see attachment), IMPORTANT this script will
> >>  destroy data on test partition. If it failed with errors like
> >>  follows "verify: bad magic header XXX" than it is definitely a drivers issue.
> > 
> > I have been trying to run fio on my board with your configuration file, but I
> > am having problems, and since I am not familiar with fio at all, I can't
> > really figure out what's wrong. My eMMC device is only 916MiB in size, so I
> > edited the last part to be:
> 
> Which eMMC host controller did you use?

compatible = "fsl,imx6q-usdhc";

Basically a Freescale i.MX6 which uses the drivers/mmc/host/sdhci-esdhc-imx.c
driver which (as the name says) is SDHCI compliant.

> > offset_increment=100M
> > size=100M
> > 
> > Is that ok?
> > 
> > I still get error messages complaining about blocksize though. Here is the
> > output I get (can't really make sense of it):
> > 
> > # ./fio ../disk-verify2.fio 
> > Multiple writers may overwrite blocks that belong to other jobs. This can cause verification failures.
> > /dev/mmcblk1p2: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> > ...
> > fio-2.1.10-49-gf302
> > Starting 4 processes
> > fio: blocksize too large for data set
> > fio: blocksize too large for data set
> > fio: blocksize too large for data set
> > fio: io_u.c:1315: __get_io_u: Assertion `io_u->flags & IO_U_F_FREE' failed.ta 00m:00s]
> > fio: pid=7612, got signal=6
> > 
> > /dev/mmcblk1p2: (groupid=0, jobs=1): err= 0: pid=7612: Fri Jul  4 09:31:15 2014
> >     lat (msec) : 4=0.19%, 10=0.19%, 20=0.19%, 50=0.85%, 100=1.23%
> >     lat (msec) : 250=56.01%, 500=37.18%, 750=1.14%
> >   cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
> >   IO depths    : 1=0.1%, 2=0.2%, 4=0.4%, 8=0.8%, 16=1.5%, 32=97.1%, >=64=0.0%
> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >      issued    : total=r=33/w=1024/d=0, short=r=0/w=0/d=0
> >      latency   : target=0, window=0, percentile=100.00%, depth=32
> > 
> > Run status group 0 (all jobs):
> > 
> > Disk stats (read/write):
> >   mmcblk1: ios=11/1025, merge=0/0, ticks=94/6671, in_queue=7121, util=96.12%
> > fio: file hash not empty on exit
> > 
> > 
> > This assertion bugs me. Is it due to the previous errors ("blocksize too large
> > for data set") or is is because my eMMC drive/kernel is seriously screwed?
> > 
> > Help please!
> > 
> >> If my theory is true and it is storage's driver issue than JBD complain
> >> simply because it do care about it's data (it does integrity checks).
> >> Can you also create btrfs on that partition and performs some io
> >> activity and run fsck after that. You likely will see similar corruption
> > 
> > Best regards,
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Best regards,

-- 
David Jander
Protonic Holland.

next prev parent reply	other threads:[~2014-07-04 11:32 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-30 21:30 ext4: journal has aborted Matteo Croce
2014-07-01  6:26 ` David Jander
2014-07-01  8:00   ` Matteo Croce
2014-07-01  8:42   ` Darrick J. Wong
2014-07-01  8:55     ` Matteo Croce
2014-07-02 13:49       ` Dmitry Monakhov
2014-07-03 13:43       ` Theodore Ts'o
2014-07-03 14:15         ` David Jander
2014-07-03 14:46           ` Theodore Ts'o
2014-07-03 14:57           ` Dmitry Monakhov
2014-07-03 14:58           ` Dmitry Monakhov
2014-07-04  9:40             ` David Jander
2014-07-04 10:17               ` Dmitry Monakhov
2014-07-04 11:28                 ` David Jander
2014-07-04 12:20                   ` Theodore Ts'o
2014-07-04 12:38                     ` Dmitry Monakhov
2014-07-04 13:45                     ` David Jander
2014-07-04 18:45                       ` Theodore Ts'o
2014-07-04 22:46                         ` Dave Chinner
2014-07-05  2:30                         ` Dmitry Monakhov
2014-07-05 20:36                         ` Theodore Ts'o
2014-07-07 12:17                         ` David Jander
2014-07-07 15:53                           ` Theodore Ts'o
2014-07-07 22:31                             ` Darrick J. Wong
2014-07-07 22:56                             ` Theodore Ts'o
2014-07-10 18:57                               ` Eric Whitney
2014-07-10 20:01                                 ` Darrick J. Wong
2014-07-10 21:31                                   ` Matteo Croce
2014-07-10 22:32                                     ` Theodore Ts'o
2014-07-11  0:13                                       ` Darrick J. Wong
2014-07-11  0:45                                         ` Eric Whitney
2014-07-11  8:50                                           ` Jaehoon Chung
2014-07-11 11:43                                           ` Theodore Ts'o
2014-07-15  6:31                                           ` David Jander
2014-07-10 23:29                                 ` Azat Khuzhin
2014-07-04 11:04               ` Jaehoon Chung
2014-07-04 11:32                 ` David Jander [this message]
2014-07-01 12:07     ` Jaehoon Chung
2014-07-01 13:50       ` David Jander
2014-07-01 15:58       ` Theodore Ts'o
2014-07-01 16:14         ` Lukáš Czerner
2014-07-01 16:36         ` Eric Whitney
2014-07-02  8:34           ` Matteo Croce
2014-07-02 10:17           ` David Jander
2014-07-02 10:19             ` Matteo Croce
2014-07-03 17:14               ` Eric Whitney
2014-07-03 23:17                 ` Theodore Ts'o
2014-07-04 20:48                   ` Eric Whitney
2014-07-02  9:44         ` David Jander
2014-07-01  9:02   ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140704133248.0ef344e4@archvile \
    --to=david@protonic.nl \
    --cc=darrick.wong@oracle.com \
    --cc=dmonakhov@openvz.org \
    --cc=jh80.chung@samsung.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=technoboy85@gmail.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox