e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed
@ 2012-11-07  8:23 kaefert
  2012-11-09  0:01 ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: kaefert @ 2012-11-07  8:23 UTC (permalink / raw)
  To: linux-ext4

Hi there!

I've got this problem with an ext4 filesystem on an external usb disk
of mine, I've documented the problem (and my approaches for solving
it) here:
http://serverfault.com/questions/446074/e2fsck-extremly-slow-although-enough-memory-exists

I started gparted and with it e2fsck is around 2012-11-04_2200 and its
2012-11-07_0923 right now (CET).
It used up nearly 57 hours of cpu time since then, and I would like to
find out if and when it will finish.
Can somebody tell me if and how I could get this information?

Thanks for developing this great filesystem, and thanks for helping me
out in my hour[s ;)] of need.

Kind Regards,

Thomas K.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed
  2012-11-07  8:23 e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed kaefert
@ 2012-11-09  0:01 ` Theodore Ts'o
  2012-11-09  6:05   ` kaefert
  2012-11-11 18:14   ` kaefert
  0 siblings, 2 replies; 10+ messages in thread
From: Theodore Ts'o @ 2012-11-09  0:01 UTC (permalink / raw)
  To: kaefert@gmail.com; +Cc: linux-ext4

On Wed, Nov 07, 2012 at 09:23:22AM +0100, kaefert@gmail.com wrote:
> Hi there!
> 
> I've got this problem with an ext4 filesystem on an external usb disk
> of mine, I've documented the problem (and my approaches for solving
> it) here:
> http://serverfault.com/questions/446074/e2fsck-extremly-slow-although-enough-memory-exists
> 
> I started gparted and with it e2fsck is around 2012-11-04_2200 and its
> 2012-11-07_0923 right now (CET).
> It used up nearly 57 hours of cpu time since then, and I would like to
> find out if and when it will finish.
> Can somebody tell me if and how I could get this information?

Can you please run e2fsck from the command line, and capture the
output (i.e., using "script").  I really need the e2fsck output to
understand what is going on.  The strace output is really not helpful.

In general, you may be better off simply not trusting gparted to run
e2fsck and resize2fs for you.  If there are no problems I'm sure it's
fine, but it's really hard to debug things if you insist on letting
gparted swallon all of the useful debugging output....

		       	   	  	    - Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed
  2012-11-09  0:01 ` Theodore Ts'o
@ 2012-11-09  6:05   ` kaefert
  2012-11-11 18:14   ` kaefert
  1 sibling, 0 replies; 10+ messages in thread
From: kaefert @ 2012-11-09  6:05 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

2012/11/9 Theodore Ts'o <tytso@mit.edu>
>
> Can you please run e2fsck from the command line, and capture the
> output (i.e., using "script").  I really need the e2fsck output to
> understand what is going on.  The strace output is really not helpful.
>
> In general, you may be better off simply not trusting gparted to run
> e2fsck and resize2fs for you.  If there are no problems I'm sure it's
> fine, but it's really hard to debug things if you insist on letting
> gparted swallon all of the useful debugging output....
>
>                                             - Ted

Hi there Ted!

Thanks for the answer! Of course I understand that when you want to
debug something you really gotta run it from the console, It's just
when I started to run this, I did not think anything would go wrong,
you never think it hits you ;)

After e2fsck failed for the first time (where I don't know why, since
gparted crashed after I tried to save the details), I started to run
e2fsck manually, and since the -p option made him cancel the run I
started a third run in interactive mode. I've posted the console
output of this second and third run as an update to my question at
serverfault.com (see
http://serverfault.com/questions/446074/e2fsck-extremely-slow-although-enough-memory-exists
- start reading at "UPDATE4")

The 3rd run is still running (since about 20 hours now) and showing
the same pattern as the first run that failed after 78 hours.


Thanks for looking at this,

Thomas K.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed
  2012-11-09  0:01 ` Theodore Ts'o
  2012-11-09  6:05   ` kaefert
@ 2012-11-11 18:14   ` kaefert
  2012-11-12 16:16     ` Theodore Ts'o
  1 sibling, 1 reply; 10+ messages in thread
From: kaefert @ 2012-11-11 18:14 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

2012/11/9 Theodore Ts'o <tytso@mit.edu>:
> On Wed, Nov 07, 2012 at 09:23:22AM +0100, kaefert@gmail.com wrote:
>
> Can you please run e2fsck from the command line, and capture the
> output (i.e., using "script").  I really need the e2fsck output to
> understand what is going on.  The strace output is really not helpful.
>
> In general, you may be better off simply not trusting gparted to run
> e2fsck and resize2fs for you.  If there are no problems I'm sure it's
> fine, but it's really hard to debug things if you insist on letting
> gparted swallon all of the useful debugging output....
>
>                                             - Ted

Hi there!

So it took several days, but after running it manually it finished to
run after a few days. However, It doesn't seem to get the filesystem
in a truely clean state, although it doesn't print an error (at least
not at the end), because I've ran e2fsck again on the same partition
and it found errors again. Here's the output of the last run that
completed:


kaefert@blechmobil:~$ sudo e2fsck -f -y -v /dev/sdb1
e2fsck 1.42.4 (12-Jun-2012)
Durchgang 1: Prüfe Inodes, Blocks, und Größen

Doppelter Blocks gefunden... starte Scan nach doppelten Block.
Durchgang 1B: Suche nach doppelten/defekten Blocks
Mehrfach beansprucht Block(s) in Inode 86114492: 4538368 4405248
<< ... removed millions of entries of the same pattern here ... >>
11648685 11648686
Durchgang 1C: Prüfe Verzeichnisse nach Inodes mit doppelten Blocks.
Durchgang 1D: Gleiche doppelte Blocks ab
(es gibt 6 Inodes, die doppelte/defekte Blocks enthalten.)

Datei /Recordings/.../MVI_8559.MOV (Inode #86114492,
Modifikationszeitpunkt Sat Mar 24 20:23:54 2012)
  hat Block Nr.413455 doppelte Block(s), gemeinsam genutzt mit 1 Datei(en):
    /Recordings/.../MVI_8563.MOV (Inode #86114496, mod time Sat Mar 24
20:23:54 2012)
multiply claimed block map? ja

clone_file_block: interner Fehler; dup_blk für 4538368 wurde nicht gefunden

clone_file_block: interner Fehler; dup_blk für 4538368 wurde nicht gefunden

Datei /Recordings/.../MVI_8563.MOV (Inode #86114496,
Modifikationszeitpunkt Sat Mar 24 20:23:54 2012)
  hat Block Nr.413455 doppelte Block(s), gemeinsam genutzt mit 1 Datei(en):
    /Recordings/.../MVI_8559.MOV (Inode #86114492, mod time Sat Mar 24
20:23:54 2012)
Duplizierte Blocks bereits neu zugeordnet bzw. geklont.

Datei /Recordings/.../MVI_8571.MOV (Inode #86114504,
Modifikationszeitpunkt Sat Mar 24 22:09:56 2012)
  hat Block Nr.244958 doppelte Block(s), gemeinsam genutzt mit 1 Datei(en):
    /Recordings/.../MVI_8575.MOV (Inode #86114508, mod time Sat Mar 24
22:09:56 2012)
multiply claimed block map? ja

clone_file_block: interner Fehler; dup_blk für 7999488 wurde nicht gefunden

clone_file_block: interner Fehler; dup_blk für 7999488 wurde nicht gefunden

Datei /Recordings/.../MVI_8575.MOV (Inode #86114508,
Modifikationszeitpunkt Sat Mar 24 22:09:56 2012)
  hat Block Nr.244958 doppelte Block(s), gemeinsam genutzt mit 1 Datei(en):
    /Recordings/.../MVI_8571.MOV (Inode #86114504, mod time Sat Mar 24
22:09:56 2012)
Duplizierte Blocks bereits neu zugeordnet bzw. geklont.

Datei /Recordings/.../MVI_3598.MOV (Inode #86376840,
Modifikationszeitpunkt Thu Aug 23 21:14:34 2012)
  hat Block Nr.45835 doppelte Block(s), gemeinsam genutzt mit 1 Datei(en):
    /Recordings/.../SomeFile.psd (Inode #86376844, mod time Thu Aug 23
21:14:34 2012)
multiply claimed block map? ja

clone_file_block: interner Fehler; dup_blk für 345554931 wurde nicht gefunden

clone_file_block: interner Fehler; dup_blk für 345554931 wurde nicht gefunden

Datei /Recordings/.../SomeFile.psd (Inode #86376844,
Modifikationszeitpunkt Thu Aug 23 21:14:34 2012)
  hat Block Nr.45835 doppelte Block(s), gemeinsam genutzt mit 1 Datei(en):
    /Recordings/.../MVI_3598.MOV (Inode #86376840, mod time Thu Aug 23
21:14:34 2012)
Duplizierte Blocks bereits neu zugeordnet bzw. geklont.

Durchgang 2: Prüfe Verzeichnis Struktur
Durchgang 3: Prüfe Verzeichnis Verknüpfungen
Durchgang 4: Überprüfe die Referenzzähler
Durchgang 5: Überprüfe Gruppe Zusammenfassung

/dev/sdb1: ***** DATEISYSTEM WURDE VERÄNDERT *****

121950 Inodes sind in Benutzung (0.07%)
    1244 nicht zusammenhängende Dateien (1.0%)
      30 nicht zusammenhängende Verzeichnisse (0.0%)
         # von Inodes mit ind/dind/tind Blöcken: 0/0/0
         Erweiterungstiefe Histogramm: 121816/126
184589222 Blöcke werden benutzt (25.20%)
0 ungültige Blöcke
       4 große Dateien

  119827 reguläre Dateien
    2114 Verzeichnisse
       0 zeichenorientierte Gerätedateien
       0 Blockgerätedateien
       0 Fifos
      11 Verknüpfungen
       0 symbolische Verknüpfungen (0 schnelle symbolische Verknüpfungen)
       0 Sockets
--------
  121952 Dateien



The part that takes that extremly long (like days) is "Durchgang 1D:
Gleiche doppelte Blocks ab"  = "Pass 1D: Reconciling multiply-claimed
blocks"
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed
  2012-11-11 18:14   ` kaefert
@ 2012-11-12 16:16     ` Theodore Ts'o
  2012-11-12 16:29       ` kaefert
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2012-11-12 16:16 UTC (permalink / raw)
  To: kaefert@gmail.com; +Cc: linux-ext4

On Sun, Nov 11, 2012 at 07:14:40PM +0100, kaefert@gmail.com wrote:
> 
> So it took several days, but after running it manually it finished to
> run after a few days. However, It doesn't seem to get the filesystem
> in a truely clean state, although it doesn't print an error (at least
> not at the end), because I've ran e2fsck again on the same partition
> and it found errors again. Here's the output of the last run that
> completed:

You said this is an external USB drive, right?  How big is it?  If
it's affordable, something I would suggest doing is to make image copy
(i.e., using dd or dd_rescue) to another external USB drive, just to
rule out hardware issues.

The Pass 1B/1C/1D errors, particularly if you are seeing the exact
same pattern after running a full e2fsck -fy run, makes me suspicious
that inode table blocks are getting written to the wrong location on
disk --- and whether this is caused by the storage device failing in
some strange way.

Also, can you save the output of e2fsck to a file?  Direct the output
to a log file, so I can look at it.  There are patterns of the
"millions of entries of the same pattern" which you've elided which
can be a hint.  Also, can you disable the German translation to make
it easier for me to investigate?

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed
  2012-11-12 16:16     ` Theodore Ts'o
@ 2012-11-12 16:29       ` kaefert
  2012-11-13 21:09         ` Andreas Dilger
  0 siblings, 1 reply; 10+ messages in thread
From: kaefert @ 2012-11-12 16:29 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

2012/11/12 Theodore Ts'o <tytso@mit.edu>:
>
> You said this is an external USB drive, right?  How big is it?

Yep, its an external USB3.0 3TB disk.

>  If it's affordable, something I would suggest doing is to make image copy
> (i.e., using dd or dd_rescue) to another external USB drive, just to
> rule out hardware issues.

I'm sorry but I don't have a spare one of those lying around, but
wouldn't I see hardware errors in the dmesg output?

>
> The Pass 1B/1C/1D errors, particularly if you are seeing the exact
> same pattern after running a full e2fsck -fy run, makes me suspicious
> that inode table blocks are getting written to the wrong location on
> disk --- and whether this is caused by the storage device failing in
> some strange way.
>
> Also, can you save the output of e2fsck to a file?  Direct the output
> to a log file, so I can look at it.  There are patterns of the
> "millions of entries of the same pattern" which you've elided which
> can be a hint.

Yep, no problem, I'll send you a complete output in an hour or two,
when I'm home.
I just omitted those since its really huge and it looked to me like if
its only counting from the smallest number to the biggest, and
skipping some numbers in between.

> Also, can you disable the German translation to make
> it easier for me to investigate?

Could you tell me how to disable the German translations? sorry...

>
> Thanks,
>
>                                                 - Ted


Thanks for looking into my issue!

Regards, Thomas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed
  2012-11-12 16:29       ` kaefert
@ 2012-11-13 21:09         ` Andreas Dilger
  2012-11-13 21:24           ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Dilger @ 2012-11-13 21:09 UTC (permalink / raw)
  To: kaefert@gmail.com; +Cc: Theodore Ts'o, linux-ext4

On 2012-11-12, at 9:29 AM, kaefert@gmail.com wrote:
> 2012/11/12 Theodore Ts'o <tytso@mit.edu>:
>> Also, can you disable the German translation to make
>> it easier for me to investigate?
> 
> Could you tell me how to disable the German translations? sorry...

# LANG=C e2fsck -f ...

Cheers, Andreas






^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed
  2012-11-13 21:09         ` Andreas Dilger
@ 2012-11-13 21:24           ` Theodore Ts'o
  2012-11-15 11:51             ` kaefert
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2012-11-13 21:24 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: kaefert@gmail.com, linux-ext4

To follow up on the list since Thomas and I have had a number of
e-mail exchanges that were off-list, and he has sent me an compressed,
raw e2image dump of his file system which I have investigated

The proximate cause of the fs corruption seems to be a few inode table
blocks written offset by a 1024 bytes --- there were 3 pairs of inodes
of the form (N, N+4) which had the exact same contents in the inode
structure (same generation number, same mtime/ctime/atimes, same
extents).  This pattern of corruption is quite odd given that the file
system has a 4k block size.  The best bet is that the corruption
happened at the USB device layer, since the mis-written inodes were
offset by a 2 512 byte sectors, as opposed to by an incorrect block
number.  Thomas tells me this particular device has had a flaky USB
controller and this is the not the first such failure.

There also seems to be a bug in e2fsck which caused it not to be able
to repair the corrupted file system.  I have not had a chance to track
down the bug yet.  It may have been caused by how we handle extent
tree blocks getting cached while trying to clone the data block.
Something which we should fix, but ultimately, the use of metadata
checksums is going to be the best way to deal with cases of the inode
table block getting written to the wrong place on disk, since we will
then know which inode not to trust, and just have e2fsck zap it.

Speaking of zapping, I've given Thomas instructions on how to clri
three of the duplicated inodes using debugfs, and that allowed e2fsck
to be able to repair his file system.  He will have suffered some data
loss due to the corrupted inode table, but at least this way he'll be
able to gain access to most of the files on the disk.

     	     	       	       	   - Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed
  2012-11-13 21:24           ` Theodore Ts'o
@ 2012-11-15 11:51             ` kaefert
  2012-11-16 18:14               ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: kaefert @ 2012-11-15 11:51 UTC (permalink / raw)
  To: linux-ext4

Hi there!

I've found that on that filesystem, in many folders I now have found
every 8th file has as contents instead of what it should have a copy
of every other 4th file - with some aditional zeros after it.

Maybe its more clear I give an example:
A folder with 25 files:
file 5 is a copy of 1
file 13 is a copy of 9
file 21 is a copy of 17

The original contents of file 5, 13 and 21 seem to have been lost,
maybe they are in the lost+found folder. The problem doesn't always
start with the first file in a folder, and doesn't always continue to
the end of the folder.


2012/11/13 Theodore Ts'o <tytso@thunk.org>
>
> To follow up on the list since Thomas and I have had a number of
> e-mail exchanges that were off-list, and he has sent me an compressed,
> raw e2image dump of his file system which I have investigated
>
> The proximate cause of the fs corruption seems to be a few inode table
> blocks written offset by a 1024 bytes --- there were 3 pairs of inodes
> of the form (N, N+4) which had the exact same contents in the inode
> structure (same generation number, same mtime/ctime/atimes, same
> extents).  This pattern of corruption is quite odd given that the file
> system has a 4k block size.  The best bet is that the corruption
> happened at the USB device layer, since the mis-written inodes were
> offset by a 2 512 byte sectors, as opposed to by an incorrect block
> number.  Thomas tells me this particular device has had a flaky USB
> controller and this is the not the first such failure.
>
> There also seems to be a bug in e2fsck which caused it not to be able
> to repair the corrupted file system.  I have not had a chance to track
> down the bug yet.  It may have been caused by how we handle extent
> tree blocks getting cached while trying to clone the data block.
> Something which we should fix, but ultimately, the use of metadata
> checksums is going to be the best way to deal with cases of the inode
> table block getting written to the wrong place on disk, since we will
> then know which inode not to trust, and just have e2fsck zap it.
>
> Speaking of zapping, I've given Thomas instructions on how to clri
> three of the duplicated inodes using debugfs, and that allowed e2fsck
> to be able to repair his file system.  He will have suffered some data
> loss due to the corrupted inode table, but at least this way he'll be
> able to gain access to most of the files on the disk.
>
>                                    - Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed
  2012-11-15 11:51             ` kaefert
@ 2012-11-16 18:14               ` Theodore Ts'o
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Ts'o @ 2012-11-16 18:14 UTC (permalink / raw)
  To: kaefert@gmail.com; +Cc: linux-ext4

On Thu, Nov 15, 2012 at 12:51:04PM +0100, kaefert@gmail.com wrote:
> 
> I've found that on that filesystem, in many folders I now have found
> every 8th file has as contents instead of what it should have a copy
> of every other 4th file - with some aditional zeros after it.
> 
> Maybe its more clear I give an example:
> A folder with 25 files:
> file 5 is a copy of 1
> file 13 is a copy of 9
> file 21 is a copy of 17
> 
> The original contents of file 5, 13 and 21 seem to have been lost,
> maybe they are in the lost+found folder. The problem doesn't always
> start with the first file in a folder, and doesn't always continue to
> the end of the folder.

Alas, that's symptomatic of an inode table block getting written to
the wrong location on disk.  Each inode structure in the inode table
is 256 bytes (by default for ext4).  So if you write a block which is
supposed to contain the inode information for inodes #100, #101, #102,
#104, #105, #106, #107, #108, ... #115 off by 1 kilobyte, the the
inode structure for #100 will get written on top of the location where
the inode information for inode #104 should be, etc.

This sounds very much like hardware failure to me, since this is not
the sort of mistake that is likely to be caused by a kernel bug ---
especially since 1k is not a multiple of the file system block size.

So I would cast a very skeptical eye on the hardware that this file
system was stored on....

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-11-16 18:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-07  8:23 e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed kaefert
2012-11-09  0:01 ` Theodore Ts'o
2012-11-09  6:05   ` kaefert
2012-11-11 18:14   ` kaefert
2012-11-12 16:16     ` Theodore Ts'o
2012-11-12 16:29       ` kaefert
2012-11-13 21:09         ` Andreas Dilger
2012-11-13 21:24           ` Theodore Ts'o
2012-11-15 11:51             ` kaefert
2012-11-16 18:14               ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).