* e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed @ 2012-11-07 8:23 kaefert 2012-11-09 0:01 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: kaefert @ 2012-11-07 8:23 UTC (permalink / raw) To: linux-ext4 Hi there! I've got this problem with an ext4 filesystem on an external usb disk of mine, I've documented the problem (and my approaches for solving it) here: http://serverfault.com/questions/446074/e2fsck-extremly-slow-although-enough-memory-exists I started gparted and with it e2fsck is around 2012-11-04_2200 and its 2012-11-07_0923 right now (CET). It used up nearly 57 hours of cpu time since then, and I would like to find out if and when it will finish. Can somebody tell me if and how I could get this information? Thanks for developing this great filesystem, and thanks for helping me out in my hour[s ;)] of need. Kind Regards, Thomas K. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed 2012-11-07 8:23 e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed kaefert @ 2012-11-09 0:01 ` Theodore Ts'o 2012-11-09 6:05 ` kaefert 2012-11-11 18:14 ` kaefert 0 siblings, 2 replies; 10+ messages in thread From: Theodore Ts'o @ 2012-11-09 0:01 UTC (permalink / raw) To: kaefert@gmail.com; +Cc: linux-ext4 On Wed, Nov 07, 2012 at 09:23:22AM +0100, kaefert@gmail.com wrote: > Hi there! > > I've got this problem with an ext4 filesystem on an external usb disk > of mine, I've documented the problem (and my approaches for solving > it) here: > http://serverfault.com/questions/446074/e2fsck-extremly-slow-although-enough-memory-exists > > I started gparted and with it e2fsck is around 2012-11-04_2200 and its > 2012-11-07_0923 right now (CET). > It used up nearly 57 hours of cpu time since then, and I would like to > find out if and when it will finish. > Can somebody tell me if and how I could get this information? Can you please run e2fsck from the command line, and capture the output (i.e., using "script"). I really need the e2fsck output to understand what is going on. The strace output is really not helpful. In general, you may be better off simply not trusting gparted to run e2fsck and resize2fs for you. If there are no problems I'm sure it's fine, but it's really hard to debug things if you insist on letting gparted swallon all of the useful debugging output.... - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed 2012-11-09 0:01 ` Theodore Ts'o @ 2012-11-09 6:05 ` kaefert 2012-11-11 18:14 ` kaefert 1 sibling, 0 replies; 10+ messages in thread From: kaefert @ 2012-11-09 6:05 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 2012/11/9 Theodore Ts'o <tytso@mit.edu> > > Can you please run e2fsck from the command line, and capture the > output (i.e., using "script"). I really need the e2fsck output to > understand what is going on. The strace output is really not helpful. > > In general, you may be better off simply not trusting gparted to run > e2fsck and resize2fs for you. If there are no problems I'm sure it's > fine, but it's really hard to debug things if you insist on letting > gparted swallon all of the useful debugging output.... > > - Ted Hi there Ted! Thanks for the answer! Of course I understand that when you want to debug something you really gotta run it from the console, It's just when I started to run this, I did not think anything would go wrong, you never think it hits you ;) After e2fsck failed for the first time (where I don't know why, since gparted crashed after I tried to save the details), I started to run e2fsck manually, and since the -p option made him cancel the run I started a third run in interactive mode. I've posted the console output of this second and third run as an update to my question at serverfault.com (see http://serverfault.com/questions/446074/e2fsck-extremely-slow-although-enough-memory-exists - start reading at "UPDATE4") The 3rd run is still running (since about 20 hours now) and showing the same pattern as the first run that failed after 78 hours. Thanks for looking at this, Thomas K. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed 2012-11-09 0:01 ` Theodore Ts'o 2012-11-09 6:05 ` kaefert @ 2012-11-11 18:14 ` kaefert 2012-11-12 16:16 ` Theodore Ts'o 1 sibling, 1 reply; 10+ messages in thread From: kaefert @ 2012-11-11 18:14 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 2012/11/9 Theodore Ts'o <tytso@mit.edu>: > On Wed, Nov 07, 2012 at 09:23:22AM +0100, kaefert@gmail.com wrote: > > Can you please run e2fsck from the command line, and capture the > output (i.e., using "script"). I really need the e2fsck output to > understand what is going on. The strace output is really not helpful. > > In general, you may be better off simply not trusting gparted to run > e2fsck and resize2fs for you. If there are no problems I'm sure it's > fine, but it's really hard to debug things if you insist on letting > gparted swallon all of the useful debugging output.... > > - Ted Hi there! So it took several days, but after running it manually it finished to run after a few days. However, It doesn't seem to get the filesystem in a truely clean state, although it doesn't print an error (at least not at the end), because I've ran e2fsck again on the same partition and it found errors again. Here's the output of the last run that completed: kaefert@blechmobil:~$ sudo e2fsck -f -y -v /dev/sdb1 e2fsck 1.42.4 (12-Jun-2012) Durchgang 1: Prüfe Inodes, Blocks, und Größen Doppelter Blocks gefunden... starte Scan nach doppelten Block. Durchgang 1B: Suche nach doppelten/defekten Blocks Mehrfach beansprucht Block(s) in Inode 86114492: 4538368 4405248 << ... removed millions of entries of the same pattern here ... >> 11648685 11648686 Durchgang 1C: Prüfe Verzeichnisse nach Inodes mit doppelten Blocks. Durchgang 1D: Gleiche doppelte Blocks ab (es gibt 6 Inodes, die doppelte/defekte Blocks enthalten.) Datei /Recordings/.../MVI_8559.MOV (Inode #86114492, Modifikationszeitpunkt Sat Mar 24 20:23:54 2012) hat Block Nr.413455 doppelte Block(s), gemeinsam genutzt mit 1 Datei(en): /Recordings/.../MVI_8563.MOV (Inode #86114496, mod time Sat Mar 24 20:23:54 2012) multiply claimed block map? ja clone_file_block: interner Fehler; dup_blk für 4538368 wurde nicht gefunden clone_file_block: interner Fehler; dup_blk für 4538368 wurde nicht gefunden Datei /Recordings/.../MVI_8563.MOV (Inode #86114496, Modifikationszeitpunkt Sat Mar 24 20:23:54 2012) hat Block Nr.413455 doppelte Block(s), gemeinsam genutzt mit 1 Datei(en): /Recordings/.../MVI_8559.MOV (Inode #86114492, mod time Sat Mar 24 20:23:54 2012) Duplizierte Blocks bereits neu zugeordnet bzw. geklont. Datei /Recordings/.../MVI_8571.MOV (Inode #86114504, Modifikationszeitpunkt Sat Mar 24 22:09:56 2012) hat Block Nr.244958 doppelte Block(s), gemeinsam genutzt mit 1 Datei(en): /Recordings/.../MVI_8575.MOV (Inode #86114508, mod time Sat Mar 24 22:09:56 2012) multiply claimed block map? ja clone_file_block: interner Fehler; dup_blk für 7999488 wurde nicht gefunden clone_file_block: interner Fehler; dup_blk für 7999488 wurde nicht gefunden Datei /Recordings/.../MVI_8575.MOV (Inode #86114508, Modifikationszeitpunkt Sat Mar 24 22:09:56 2012) hat Block Nr.244958 doppelte Block(s), gemeinsam genutzt mit 1 Datei(en): /Recordings/.../MVI_8571.MOV (Inode #86114504, mod time Sat Mar 24 22:09:56 2012) Duplizierte Blocks bereits neu zugeordnet bzw. geklont. Datei /Recordings/.../MVI_3598.MOV (Inode #86376840, Modifikationszeitpunkt Thu Aug 23 21:14:34 2012) hat Block Nr.45835 doppelte Block(s), gemeinsam genutzt mit 1 Datei(en): /Recordings/.../SomeFile.psd (Inode #86376844, mod time Thu Aug 23 21:14:34 2012) multiply claimed block map? ja clone_file_block: interner Fehler; dup_blk für 345554931 wurde nicht gefunden clone_file_block: interner Fehler; dup_blk für 345554931 wurde nicht gefunden Datei /Recordings/.../SomeFile.psd (Inode #86376844, Modifikationszeitpunkt Thu Aug 23 21:14:34 2012) hat Block Nr.45835 doppelte Block(s), gemeinsam genutzt mit 1 Datei(en): /Recordings/.../MVI_3598.MOV (Inode #86376840, mod time Thu Aug 23 21:14:34 2012) Duplizierte Blocks bereits neu zugeordnet bzw. geklont. Durchgang 2: Prüfe Verzeichnis Struktur Durchgang 3: Prüfe Verzeichnis Verknüpfungen Durchgang 4: Überprüfe die Referenzzähler Durchgang 5: Überprüfe Gruppe Zusammenfassung /dev/sdb1: ***** DATEISYSTEM WURDE VERÄNDERT ***** 121950 Inodes sind in Benutzung (0.07%) 1244 nicht zusammenhängende Dateien (1.0%) 30 nicht zusammenhängende Verzeichnisse (0.0%) # von Inodes mit ind/dind/tind Blöcken: 0/0/0 Erweiterungstiefe Histogramm: 121816/126 184589222 Blöcke werden benutzt (25.20%) 0 ungültige Blöcke 4 große Dateien 119827 reguläre Dateien 2114 Verzeichnisse 0 zeichenorientierte Gerätedateien 0 Blockgerätedateien 0 Fifos 11 Verknüpfungen 0 symbolische Verknüpfungen (0 schnelle symbolische Verknüpfungen) 0 Sockets -------- 121952 Dateien The part that takes that extremly long (like days) is "Durchgang 1D: Gleiche doppelte Blocks ab" = "Pass 1D: Reconciling multiply-claimed blocks" -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed 2012-11-11 18:14 ` kaefert @ 2012-11-12 16:16 ` Theodore Ts'o 2012-11-12 16:29 ` kaefert 0 siblings, 1 reply; 10+ messages in thread From: Theodore Ts'o @ 2012-11-12 16:16 UTC (permalink / raw) To: kaefert@gmail.com; +Cc: linux-ext4 On Sun, Nov 11, 2012 at 07:14:40PM +0100, kaefert@gmail.com wrote: > > So it took several days, but after running it manually it finished to > run after a few days. However, It doesn't seem to get the filesystem > in a truely clean state, although it doesn't print an error (at least > not at the end), because I've ran e2fsck again on the same partition > and it found errors again. Here's the output of the last run that > completed: You said this is an external USB drive, right? How big is it? If it's affordable, something I would suggest doing is to make image copy (i.e., using dd or dd_rescue) to another external USB drive, just to rule out hardware issues. The Pass 1B/1C/1D errors, particularly if you are seeing the exact same pattern after running a full e2fsck -fy run, makes me suspicious that inode table blocks are getting written to the wrong location on disk --- and whether this is caused by the storage device failing in some strange way. Also, can you save the output of e2fsck to a file? Direct the output to a log file, so I can look at it. There are patterns of the "millions of entries of the same pattern" which you've elided which can be a hint. Also, can you disable the German translation to make it easier for me to investigate? Thanks, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed 2012-11-12 16:16 ` Theodore Ts'o @ 2012-11-12 16:29 ` kaefert 2012-11-13 21:09 ` Andreas Dilger 0 siblings, 1 reply; 10+ messages in thread From: kaefert @ 2012-11-12 16:29 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 2012/11/12 Theodore Ts'o <tytso@mit.edu>: > > You said this is an external USB drive, right? How big is it? Yep, its an external USB3.0 3TB disk. > If it's affordable, something I would suggest doing is to make image copy > (i.e., using dd or dd_rescue) to another external USB drive, just to > rule out hardware issues. I'm sorry but I don't have a spare one of those lying around, but wouldn't I see hardware errors in the dmesg output? > > The Pass 1B/1C/1D errors, particularly if you are seeing the exact > same pattern after running a full e2fsck -fy run, makes me suspicious > that inode table blocks are getting written to the wrong location on > disk --- and whether this is caused by the storage device failing in > some strange way. > > Also, can you save the output of e2fsck to a file? Direct the output > to a log file, so I can look at it. There are patterns of the > "millions of entries of the same pattern" which you've elided which > can be a hint. Yep, no problem, I'll send you a complete output in an hour or two, when I'm home. I just omitted those since its really huge and it looked to me like if its only counting from the smallest number to the biggest, and skipping some numbers in between. > Also, can you disable the German translation to make > it easier for me to investigate? Could you tell me how to disable the German translations? sorry... > > Thanks, > > - Ted Thanks for looking into my issue! Regards, Thomas ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed 2012-11-12 16:29 ` kaefert @ 2012-11-13 21:09 ` Andreas Dilger 2012-11-13 21:24 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: Andreas Dilger @ 2012-11-13 21:09 UTC (permalink / raw) To: kaefert@gmail.com; +Cc: Theodore Ts'o, linux-ext4 On 2012-11-12, at 9:29 AM, kaefert@gmail.com wrote: > 2012/11/12 Theodore Ts'o <tytso@mit.edu>: >> Also, can you disable the German translation to make >> it easier for me to investigate? > > Could you tell me how to disable the German translations? sorry... # LANG=C e2fsck -f ... Cheers, Andreas ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed 2012-11-13 21:09 ` Andreas Dilger @ 2012-11-13 21:24 ` Theodore Ts'o 2012-11-15 11:51 ` kaefert 0 siblings, 1 reply; 10+ messages in thread From: Theodore Ts'o @ 2012-11-13 21:24 UTC (permalink / raw) To: Andreas Dilger; +Cc: kaefert@gmail.com, linux-ext4 To follow up on the list since Thomas and I have had a number of e-mail exchanges that were off-list, and he has sent me an compressed, raw e2image dump of his file system which I have investigated The proximate cause of the fs corruption seems to be a few inode table blocks written offset by a 1024 bytes --- there were 3 pairs of inodes of the form (N, N+4) which had the exact same contents in the inode structure (same generation number, same mtime/ctime/atimes, same extents). This pattern of corruption is quite odd given that the file system has a 4k block size. The best bet is that the corruption happened at the USB device layer, since the mis-written inodes were offset by a 2 512 byte sectors, as opposed to by an incorrect block number. Thomas tells me this particular device has had a flaky USB controller and this is the not the first such failure. There also seems to be a bug in e2fsck which caused it not to be able to repair the corrupted file system. I have not had a chance to track down the bug yet. It may have been caused by how we handle extent tree blocks getting cached while trying to clone the data block. Something which we should fix, but ultimately, the use of metadata checksums is going to be the best way to deal with cases of the inode table block getting written to the wrong place on disk, since we will then know which inode not to trust, and just have e2fsck zap it. Speaking of zapping, I've given Thomas instructions on how to clri three of the duplicated inodes using debugfs, and that allowed e2fsck to be able to repair his file system. He will have suffered some data loss due to the corrupted inode table, but at least this way he'll be able to gain access to most of the files on the disk. - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed 2012-11-13 21:24 ` Theodore Ts'o @ 2012-11-15 11:51 ` kaefert 2012-11-16 18:14 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: kaefert @ 2012-11-15 11:51 UTC (permalink / raw) To: linux-ext4 Hi there! I've found that on that filesystem, in many folders I now have found every 8th file has as contents instead of what it should have a copy of every other 4th file - with some aditional zeros after it. Maybe its more clear I give an example: A folder with 25 files: file 5 is a copy of 1 file 13 is a copy of 9 file 21 is a copy of 17 The original contents of file 5, 13 and 21 seem to have been lost, maybe they are in the lost+found folder. The problem doesn't always start with the first file in a folder, and doesn't always continue to the end of the folder. 2012/11/13 Theodore Ts'o <tytso@thunk.org> > > To follow up on the list since Thomas and I have had a number of > e-mail exchanges that were off-list, and he has sent me an compressed, > raw e2image dump of his file system which I have investigated > > The proximate cause of the fs corruption seems to be a few inode table > blocks written offset by a 1024 bytes --- there were 3 pairs of inodes > of the form (N, N+4) which had the exact same contents in the inode > structure (same generation number, same mtime/ctime/atimes, same > extents). This pattern of corruption is quite odd given that the file > system has a 4k block size. The best bet is that the corruption > happened at the USB device layer, since the mis-written inodes were > offset by a 2 512 byte sectors, as opposed to by an incorrect block > number. Thomas tells me this particular device has had a flaky USB > controller and this is the not the first such failure. > > There also seems to be a bug in e2fsck which caused it not to be able > to repair the corrupted file system. I have not had a chance to track > down the bug yet. It may have been caused by how we handle extent > tree blocks getting cached while trying to clone the data block. > Something which we should fix, but ultimately, the use of metadata > checksums is going to be the best way to deal with cases of the inode > table block getting written to the wrong place on disk, since we will > then know which inode not to trust, and just have e2fsck zap it. > > Speaking of zapping, I've given Thomas instructions on how to clri > three of the duplicated inodes using debugfs, and that allowed e2fsck > to be able to repair his file system. He will have suffered some data > loss due to the corrupted inode table, but at least this way he'll be > able to gain access to most of the files on the disk. > > - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed 2012-11-15 11:51 ` kaefert @ 2012-11-16 18:14 ` Theodore Ts'o 0 siblings, 0 replies; 10+ messages in thread From: Theodore Ts'o @ 2012-11-16 18:14 UTC (permalink / raw) To: kaefert@gmail.com; +Cc: linux-ext4 On Thu, Nov 15, 2012 at 12:51:04PM +0100, kaefert@gmail.com wrote: > > I've found that on that filesystem, in many folders I now have found > every 8th file has as contents instead of what it should have a copy > of every other 4th file - with some aditional zeros after it. > > Maybe its more clear I give an example: > A folder with 25 files: > file 5 is a copy of 1 > file 13 is a copy of 9 > file 21 is a copy of 17 > > The original contents of file 5, 13 and 21 seem to have been lost, > maybe they are in the lost+found folder. The problem doesn't always > start with the first file in a folder, and doesn't always continue to > the end of the folder. Alas, that's symptomatic of an inode table block getting written to the wrong location on disk. Each inode structure in the inode table is 256 bytes (by default for ext4). So if you write a block which is supposed to contain the inode information for inodes #100, #101, #102, #104, #105, #106, #107, #108, ... #115 off by 1 kilobyte, the the inode structure for #100 will get written on top of the location where the inode information for inode #104 should be, etc. This sounds very much like hardware failure to me, since this is not the sort of mistake that is likely to be caused by a kernel bug --- especially since 1k is not a multiple of the file system block size. So I would cast a very skeptical eye on the hardware that this file system was stored on.... - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2012-11-16 18:14 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-11-07 8:23 e2fsck extremly slow after: EXT4-fs.. ext4_check_descriptors: Checksum for group .. failed kaefert 2012-11-09 0:01 ` Theodore Ts'o 2012-11-09 6:05 ` kaefert 2012-11-11 18:14 ` kaefert 2012-11-12 16:16 ` Theodore Ts'o 2012-11-12 16:29 ` kaefert 2012-11-13 21:09 ` Andreas Dilger 2012-11-13 21:24 ` Theodore Ts'o 2012-11-15 11:51 ` kaefert 2012-11-16 18:14 ` Theodore Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).