All of lore.kernel.org
 help / color / mirror / Atom feed
* Running fsck of huge ext4 partition takes weeks
@ 2015-08-25 15:30 Alexander Afonyashin
  2015-08-25 19:43 ` Andreas Dilger
  2015-08-28  7:56 ` Alexander Afonyashin
  0 siblings, 2 replies; 13+ messages in thread
From: Alexander Afonyashin @ 2015-08-25 15:30 UTC (permalink / raw)
  To: linux-ext4

Hi,

Recently I had to run fsck on 47TB ext4 partition backed by hardware
RAID6 (LSI MegaRAID SAS 2108). Right now over 2 weeks passed but fsck
is not finished yet. It occupies 30GB RSS, almost 35GB VSS and eats
100% of single CPU. It detected errors (and fixed them) but doesn't
finish yet.

Rescue disc is based on Debian 7.8.
kernel: 4.1.4-5
e2fsprogs: 1.42.5-1.1+deb7u1

Any suggestions?

Regards,
Alexander Afonyashin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Running fsck of huge ext4 partition takes weeks
  2015-08-25 15:30 Running fsck of huge ext4 partition takes weeks Alexander Afonyashin
@ 2015-08-25 19:43 ` Andreas Dilger
  2015-08-27  5:28   ` Alexander Afonyashin
  2015-08-28  7:56 ` Alexander Afonyashin
  1 sibling, 1 reply; 13+ messages in thread
From: Andreas Dilger @ 2015-08-25 19:43 UTC (permalink / raw)
  To: Alexander Afonyashin; +Cc: linux-ext4

On Aug 25, 2015, at 9:30 AM, Alexander Afonyashin <a.afonyashin@madnet-team.ru> wrote:
> 
> Hi,
> 
> Recently I had to run fsck on 47TB ext4 partition backed by hardware
> RAID6 (LSI MegaRAID SAS 2108). Right now over 2 weeks passed but fsck
> is not finished yet. It occupies 30GB RSS, almost 35GB VSS and eats
> 100% of single CPU. It detected errors (and fixed them) but doesn't
> finish yet.
> 
> Rescue disc is based on Debian 7.8.
> kernel: 4.1.4-5
> e2fsprogs: 1.42.5-1.1+deb7u1
> 
> Any suggestions?

Usually the only reason for e2fsck to run so long is because of
duplicate block pass 1b/1c.

Having some of the actual output of e2fsck would allow us to give
some useful advice.

The only thing I can offer is for you to run "strace -p <e2fsck_pid>"
and/or "ltrace -p <e2fsck_pid>" to see what it is doing.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Running fsck of huge ext4 partition takes weeks
  2015-08-25 19:43 ` Andreas Dilger
@ 2015-08-27  5:28   ` Alexander Afonyashin
  2015-08-27 14:23     ` Alexander Afonyashin
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Afonyashin @ 2015-08-27  5:28 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4

Hi,

The last output (2 days ago) from fsck:

[skipped]
Block #524296 (1235508688) causes directory to be too big.  CLEARED.
Block #524297 (4003498426) causes directory to be too big.  CLEARED.
Block #524298 (3113378389) causes directory to be too big.  CLEARED.
Block #524299 (1368545889) causes directory to be too big.  CLEARED.
Too many illegal blocks in inode 4425477.
Clear inode? yes

---------------------------
iostat output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.00   14.52    0.00   85.48

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
loop0             0.00     0.00    2.00    0.00    12.00     0.00
12.00     0.09   46.00   46.00    0.00  46.00   9.20
sda               0.00     0.00   87.00    0.00   348.00     0.00
8.00     1.00   11.86   11.86    0.00  11.45  99.60

---------------------------
strace ouput:

root@rescue ~ # strace -f -t -p 4779
Process 4779 attached - interrupt to quit
07:26:54 lseek(4, 14154266963968, SEEK_SET) = 14154266963968
07:26:54 read(4,
"\277\224\312\371\302\356\tJC{P\244#3\"2P\327*2Q5\372\206\262\20\\\373\226\262\21\316"...,
4096) = 4096
07:27:02 lseek(4, 1408506736640, SEEK_SET) = 1408506736640
07:27:02 read(4,
"\352\3041\345\1\337p\263l;\354\377E[\17\350\235\260\r\344\265\337\3655\223E\216\226\376\263!\n"...,
4096) = 4096
07:27:08 lseek(4, 5948177264640, SEEK_SET) = 5948177264640
07:27:08 read(4,
"\321}\226m;1\253Z\301f\205\235\25\201\334?\311AQN(\22!\23{\345\214Vi\240=y"...,
4096) = 4096
07:27:10 brk(0x8cf18e000)               = 0x8cf18e000
07:27:14 lseek(4, 6408024879104, SEEK_SET) = 6408024879104
07:27:14 read(4,
"\254n\fn\r\302$\t\213\231\256\2774\326\34\364\fY\v\365`*Br\354X\7T3J\243K"...,
4096) = 4096
07:27:21 lseek(4, 8640894586880, SEEK_SET) = 8640894586880
07:27:21 read(4,
"3\372\24\357\3579\254\31\214L\rYrurj\376\250\352%\2\242\255\252\22\347XU\327\235\362\337"...,
4096) = 4096
^CProcess 4779 detached

Regards,
Alexander

On Tue, Aug 25, 2015 at 10:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Aug 25, 2015, at 9:30 AM, Alexander Afonyashin <a.afonyashin@madnet-team.ru> wrote:
>>
>> Hi,
>>
>> Recently I had to run fsck on 47TB ext4 partition backed by hardware
>> RAID6 (LSI MegaRAID SAS 2108). Right now over 2 weeks passed but fsck
>> is not finished yet. It occupies 30GB RSS, almost 35GB VSS and eats
>> 100% of single CPU. It detected errors (and fixed them) but doesn't
>> finish yet.
>>
>> Rescue disc is based on Debian 7.8.
>> kernel: 4.1.4-5
>> e2fsprogs: 1.42.5-1.1+deb7u1
>>
>> Any suggestions?
>
> Usually the only reason for e2fsck to run so long is because of
> duplicate block pass 1b/1c.
>
> Having some of the actual output of e2fsck would allow us to give
> some useful advice.
>
> The only thing I can offer is for you to run "strace -p <e2fsck_pid>"
> and/or "ltrace -p <e2fsck_pid>" to see what it is doing.
>
> Cheers, Andreas
>
>
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Running fsck of huge ext4 partition takes weeks
  2015-08-27  5:28   ` Alexander Afonyashin
@ 2015-08-27 14:23     ` Alexander Afonyashin
  2015-08-27 16:39       ` Andreas Dilger
  2015-08-27 19:05       ` Theodore Ts'o
  0 siblings, 2 replies; 13+ messages in thread
From: Alexander Afonyashin @ 2015-08-27 14:23 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4

Hi,

I've restarted fsck ~6 hours ago. It's again occupied ~30GB RAM and
strace shows that number of syscalls per second becomes fewer and
fewer.

Regards,
Alexander

On Thu, Aug 27, 2015 at 8:28 AM, Alexander Afonyashin
<a.afonyashin@madnet-team.ru> wrote:
> Hi,
>
> The last output (2 days ago) from fsck:
>
> [skipped]
> Block #524296 (1235508688) causes directory to be too big.  CLEARED.
> Block #524297 (4003498426) causes directory to be too big.  CLEARED.
> Block #524298 (3113378389) causes directory to be too big.  CLEARED.
> Block #524299 (1368545889) causes directory to be too big.  CLEARED.
> Too many illegal blocks in inode 4425477.
> Clear inode? yes
>
> ---------------------------
> iostat output:
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00    0.00   14.52    0.00   85.48
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> loop0             0.00     0.00    2.00    0.00    12.00     0.00
> 12.00     0.09   46.00   46.00    0.00  46.00   9.20
> sda               0.00     0.00   87.00    0.00   348.00     0.00
> 8.00     1.00   11.86   11.86    0.00  11.45  99.60
>
> ---------------------------
> strace ouput:
>
> root@rescue ~ # strace -f -t -p 4779
> Process 4779 attached - interrupt to quit
> 07:26:54 lseek(4, 14154266963968, SEEK_SET) = 14154266963968
> 07:26:54 read(4,
> "\277\224\312\371\302\356\tJC{P\244#3\"2P\327*2Q5\372\206\262\20\\\373\226\262\21\316"...,
> 4096) = 4096
> 07:27:02 lseek(4, 1408506736640, SEEK_SET) = 1408506736640
> 07:27:02 read(4,
> "\352\3041\345\1\337p\263l;\354\377E[\17\350\235\260\r\344\265\337\3655\223E\216\226\376\263!\n"...,
> 4096) = 4096
> 07:27:08 lseek(4, 5948177264640, SEEK_SET) = 5948177264640
> 07:27:08 read(4,
> "\321}\226m;1\253Z\301f\205\235\25\201\334?\311AQN(\22!\23{\345\214Vi\240=y"...,
> 4096) = 4096
> 07:27:10 brk(0x8cf18e000)               = 0x8cf18e000
> 07:27:14 lseek(4, 6408024879104, SEEK_SET) = 6408024879104
> 07:27:14 read(4,
> "\254n\fn\r\302$\t\213\231\256\2774\326\34\364\fY\v\365`*Br\354X\7T3J\243K"...,
> 4096) = 4096
> 07:27:21 lseek(4, 8640894586880, SEEK_SET) = 8640894586880
> 07:27:21 read(4,
> "3\372\24\357\3579\254\31\214L\rYrurj\376\250\352%\2\242\255\252\22\347XU\327\235\362\337"...,
> 4096) = 4096
> ^CProcess 4779 detached
>
> Regards,
> Alexander
>
> On Tue, Aug 25, 2015 at 10:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>> On Aug 25, 2015, at 9:30 AM, Alexander Afonyashin <a.afonyashin@madnet-team.ru> wrote:
>>>
>>> Hi,
>>>
>>> Recently I had to run fsck on 47TB ext4 partition backed by hardware
>>> RAID6 (LSI MegaRAID SAS 2108). Right now over 2 weeks passed but fsck
>>> is not finished yet. It occupies 30GB RSS, almost 35GB VSS and eats
>>> 100% of single CPU. It detected errors (and fixed them) but doesn't
>>> finish yet.
>>>
>>> Rescue disc is based on Debian 7.8.
>>> kernel: 4.1.4-5
>>> e2fsprogs: 1.42.5-1.1+deb7u1
>>>
>>> Any suggestions?
>>
>> Usually the only reason for e2fsck to run so long is because of
>> duplicate block pass 1b/1c.
>>
>> Having some of the actual output of e2fsck would allow us to give
>> some useful advice.
>>
>> The only thing I can offer is for you to run "strace -p <e2fsck_pid>"
>> and/or "ltrace -p <e2fsck_pid>" to see what it is doing.
>>
>> Cheers, Andreas
>>
>>
>>
>>
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Running fsck of huge ext4 partition takes weeks
  2015-08-27 14:23     ` Alexander Afonyashin
@ 2015-08-27 16:39       ` Andreas Dilger
  2015-08-28  6:39         ` Alexander Afonyashin
  2015-08-27 19:05       ` Theodore Ts'o
  1 sibling, 1 reply; 13+ messages in thread
From: Andreas Dilger @ 2015-08-27 16:39 UTC (permalink / raw)
  To: Alexander Afonyashin; +Cc: linux-ext4

On Aug 27, 2015, at 8:23 AM, Alexander Afonyashin <a.afonyashin@madnet-team.ru> wrote:
> 
> Hi,
> 
> I've restarted fsck ~6 hours ago. It's again occupied ~30GB RAM and
> strace shows that number of syscalls per second becomes fewer and
> fewer.

My first suggestion would be to upgrade e2fsprogs to the latest
stable version - 1.42.13 so that you are not hitting any older bugs.

What was the original problem reported that caused the e2fsck
to be run?

Next, please include the full output from the start of e2fsck,
unless it is just a lot of the same lines repeated.  There are
a lot of Lustre users with 32TB or 48TB ext4 filesystems that can
finish a full e2fsck in a few hours, unless there is some kind
of major corruption.  It may be possible to fix some of the
corruption manually with debugfs to avoid a lengthy e2fsck run.

If you can run "ltrace -p <e2fsck_pid>" on the e2fsck then it
would tell us what code it is running.  It doesn't seem to be
IO bound (only one seek+read per 6 seconds).

Are there any special formatting options that were used for the
filesystem originally?  What does "debugfs -c -R stats <dev>"
report about the filesystem?

Cheers, Andreas


> Regards,
> Alexander
> 
> On Thu, Aug 27, 2015 at 8:28 AM, Alexander Afonyashin
> <a.afonyashin@madnet-team.ru> wrote:
>> Hi,
>> 
>> The last output (2 days ago) from fsck:
>> 
>> [skipped]
>> Block #524296 (1235508688) causes directory to be too big.  CLEARED.
>> Block #524297 (4003498426) causes directory to be too big.  CLEARED.
>> Block #524298 (3113378389) causes directory to be too big.  CLEARED.
>> Block #524299 (1368545889) causes directory to be too big.  CLEARED.
>> Too many illegal blocks in inode 4425477.
>> Clear inode? yes
>> 
>> ---------------------------
>> iostat output:
>> 
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>           0.00    0.00    0.00   14.52    0.00   85.48
>> 
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> loop0             0.00     0.00    2.00    0.00    12.00     0.00
>> 12.00     0.09   46.00   46.00    0.00  46.00   9.20
>> sda               0.00     0.00   87.00    0.00   348.00     0.00
>> 8.00     1.00   11.86   11.86    0.00  11.45  99.60
>> 
>> ---------------------------
>> strace ouput:
>> 
>> root@rescue ~ # strace -f -t -p 4779
>> Process 4779 attached - interrupt to quit
>> 07:26:54 lseek(4, 14154266963968, SEEK_SET) = 14154266963968
>> 07:26:54 read(4,
>> "\277\224\312\371\302\356\tJC{P\244#3\"2P\327*2Q5\372\206\262\20\\\373\226\262\21\316"...,
>> 4096) = 4096
>> 07:27:02 lseek(4, 1408506736640, SEEK_SET) = 1408506736640
>> 07:27:02 read(4,
>> "\352\3041\345\1\337p\263l;\354\377E[\17\350\235\260\r\344\265\337\3655\223E\216\226\376\263!\n"...,
>> 4096) = 4096
>> 07:27:08 lseek(4, 5948177264640, SEEK_SET) = 5948177264640
>> 07:27:08 read(4,
>> "\321}\226m;1\253Z\301f\205\235\25\201\334?\311AQN(\22!\23{\345\214Vi\240=y"...,
>> 4096) = 4096
>> 07:27:10 brk(0x8cf18e000)               = 0x8cf18e000
>> 07:27:14 lseek(4, 6408024879104, SEEK_SET) = 6408024879104
>> 07:27:14 read(4,
>> "\254n\fn\r\302$\t\213\231\256\2774\326\34\364\fY\v\365`*Br\354X\7T3J\243K"...,
>> 4096) = 4096
>> 07:27:21 lseek(4, 8640894586880, SEEK_SET) = 8640894586880
>> 07:27:21 read(4,
>> "3\372\24\357\3579\254\31\214L\rYrurj\376\250\352%\2\242\255\252\22\347XU\327\235\362\337"...,
>> 4096) = 4096
>> ^CProcess 4779 detached
>> 
>> Regards,
>> Alexander
>> 
>> On Tue, Aug 25, 2015 at 10:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>>> On Aug 25, 2015, at 9:30 AM, Alexander Afonyashin <a.afonyashin@madnet-team.ru> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Recently I had to run fsck on 47TB ext4 partition backed by hardware
>>>> RAID6 (LSI MegaRAID SAS 2108). Right now over 2 weeks passed but fsck
>>>> is not finished yet. It occupies 30GB RSS, almost 35GB VSS and eats
>>>> 100% of single CPU. It detected errors (and fixed them) but doesn't
>>>> finish yet.
>>>> 
>>>> Rescue disc is based on Debian 7.8.
>>>> kernel: 4.1.4-5
>>>> e2fsprogs: 1.42.5-1.1+deb7u1
>>>> 
>>>> Any suggestions?
>>> 
>>> Usually the only reason for e2fsck to run so long is because of
>>> duplicate block pass 1b/1c.
>>> 
>>> Having some of the actual output of e2fsck would allow us to give
>>> some useful advice.
>>> 
>>> The only thing I can offer is for you to run "strace -p <e2fsck_pid>"
>>> and/or "ltrace -p <e2fsck_pid>" to see what it is doing.
>>> 
>>> Cheers, Andreas
>>> 
>>> 
>>> 
>>> 
>>> 


Cheers, Andreas






^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Running fsck of huge ext4 partition takes weeks
  2015-08-27 14:23     ` Alexander Afonyashin
  2015-08-27 16:39       ` Andreas Dilger
@ 2015-08-27 19:05       ` Theodore Ts'o
  2015-08-28  6:27         ` Alexander Afonyashin
  1 sibling, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2015-08-27 19:05 UTC (permalink / raw)
  To: Alexander Afonyashin; +Cc: Andreas Dilger, linux-ext4

On Thu, Aug 27, 2015 at 05:23:58PM +0300, Alexander Afonyashin wrote:
> Hi,
> 
> I've restarted fsck ~6 hours ago. It's again occupied ~30GB RAM and
> strace shows that number of syscalls per second becomes fewer and
> fewer.

Can you run it under "script" so we can get a transcript of the run?

It sounds like your file system has gotten very badly damaged, so the
question is figuring out what happened so we can advise you about how
to recover.

Can you also send the output of dumpe2fs?

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Running fsck of huge ext4 partition takes weeks
  2015-08-27 19:05       ` Theodore Ts'o
@ 2015-08-28  6:27         ` Alexander Afonyashin
  2015-08-28 17:53           ` Theodore Ts'o
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Afonyashin @ 2015-08-28  6:27 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Andreas Dilger, linux-ext4

Hi Ted,

Since the moment when fsck has been restarted yesterday (under screen)
it found only a few errors:
-----------------------------
root@rescue ~ # fsck -y -v /dev/sda3
fsck from util-linux 2.20.1
e2fsck 1.42.5 (29-Jul-2012)
/dev/sda3 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 4425472 is too big.  Truncate? yes

Block #593549637 (1359559318) causes file to be too big.  CLEARED.
Block #593549638 (782521882) causes file to be too big.  CLEARED.
Block #593549639 (1184464312) causes file to be too big.  CLEARED.
Block #593549640 (1283655900) causes file to be too big.  CLEARED.
Block #593549641 (2241107660) causes file to be too big.  CLEARED.
Block #593549642 (117581257) causes file to be too big.  CLEARED.
Block #593549643 (957560556) causes file to be too big.  CLEARED.
Block #593549644 (1037291178) causes file to be too big.  CLEARED.
Block #593549645 (2099220496) causes file to be too big.  CLEARED.
Block #593549646 (539062498) causes file to be too big.  CLEARED.
Block #593549647 (96880969) causes file to be too big.  CLEARED.
Too many illegal blocks in inode 4425472.
Clear inode? yes

Inode 4425549, i_size is 18119674890280087280, should be 0.  Fix? yes

Inode 4425549, i_blocks is 27717127168711, should be 0.  Fix? yes

Inode 4425554 has compression flag set on filesystem without
compression support.  Clear? yes

Inode 4425554, i_size is 12927291416082045099, should be 0.  Fix? yes

Inode 4425554, i_blocks is 207957728820169, should be 0.  Fix? yes

Inode 4425667, i_size is 3071282098034816027, should be 0.  Fix? yes

Inode 4425667, i_blocks is 225290567745721, should be 0.  Fix? yes

Inode 4425603 has INDEX_FL flag set but is not a directory.
Clear HTree index? yes

Inode 4425603, i_size is 13931552572174262662, should be 0.  Fix? yes

Inode 4425603, i_blocks is 109288676305237, should be 0.  Fix? yes

Inode 4425281 has INDEX_FL flag set but is not a directory.
Clear HTree index? yes

Inode 4425281, i_size is 16737347569809842710, should be 0.  Fix? yes

Inode 4425281, i_blocks is 231725314734850, should be 0.  Fix? yes

Inode 4425553 is too big.  Truncate? yes

Block #1 (1012277717) causes symlink to be too big.  CLEARED.
Block #2 (2730874111) causes symlink to be too big.  CLEARED.
Block #3 (2924388706) causes symlink to be too big.  CLEARED.
Block #4 (739968058) causes symlink to be too big.  CLEARED.
Block #5 (2349486480) causes symlink to be too big.  CLEARED.
Block #6 (184918148) causes symlink to be too big.  CLEARED.
Block #7 (3200511249) causes symlink to be too big.  CLEARED.
Block #8 (4199384552) causes symlink to be too big.  CLEARED.
Block #9 (2310276563) causes symlink to be too big.  CLEARED.
Block #10 (960264107) causes symlink to be too big.  CLEARED.
Block #11 (3387206892) causes symlink to be too big.  CLEARED.
Too many illegal blocks in inode 4425553.
Clear inode? yes

-----------------------------

dumpe2fs "hangs" after dumping superblock info (no group info is shown):
-----------------------------
root@rescue ~ # dumpe2fs /dev/sda3
dumpe2fs 1.42.5 (29-Jul-2012)
Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          552052d1-9e25-4b2b-bc04-21c7b4a87aa4
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr dir_index filetype
extent 64bit flex_bg sparse_super huge_file uninit_bg dir_nlink
extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         not clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              793231360
Block count:              12691701243
Reserved block count:     634585062
Free blocks:              12641158920
Free inodes:              793231369
First block:              0
Block size:               4096
Fragment size:            4096
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         2048
Inode blocks per group:   128
Flex block group size:    16
Filesystem created:       Thu Feb 12 16:57:47 2015
Last mount time:          Thu Aug 27 10:58:45 2015
Last write time:          Thu Aug 27 10:58:58 2015
Mount count:              4
Maximum mount count:      -1
Last checked:             Thu Feb 12 16:57:47 2015
Check interval:           0 (<none>)
Lifetime writes:          279 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      9ddd8b55-ff3f-4447-bf5c-732402ddd8d6
Journal backup:           inode blocks
FS Error count:           154
First error time:         Thu Aug 13 17:25:45 2015
First error function:     ext4_mb_generate_buddy
First error line #:       739
First error inode #:      0
First error block #:      0
Last error time:          Fri Aug 14 16:48:41 2015
Last error function:      ext4_mb_generate_buddy
Last error line #:        739
Last error inode #:       0
Last error block #:       0
Journal features:         journal_incompat_revoke journal_64bit
Journal size:             128M
Journal length:           32768
Journal sequence:         0x00321045
Journal start:            0

^C

-----------------------------

/boot partition which sits on /dev/sda2 is shown perfectly by dumpe2fs

Regards,
Alexander

On Thu, Aug 27, 2015 at 10:05 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Aug 27, 2015 at 05:23:58PM +0300, Alexander Afonyashin wrote:
>> Hi,
>>
>> I've restarted fsck ~6 hours ago. It's again occupied ~30GB RAM and
>> strace shows that number of syscalls per second becomes fewer and
>> fewer.
>
> Can you run it under "script" so we can get a transcript of the run?
>
> It sounds like your file system has gotten very badly damaged, so the
> question is figuring out what happened so we can advise you about how
> to recover.
>
> Can you also send the output of dumpe2fs?
>
> Thanks,
>
>                                         - Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Running fsck of huge ext4 partition takes weeks
  2015-08-27 16:39       ` Andreas Dilger
@ 2015-08-28  6:39         ` Alexander Afonyashin
  0 siblings, 0 replies; 13+ messages in thread
From: Alexander Afonyashin @ 2015-08-28  6:39 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4

Hi Andreas,

Here's the ltrace output (it's definitely running infinitive loop)"

root@rescue ~ # ltrace -p 31435 2>&1|head -n 30
ext2fs_mark_generic_bmap(0x797480, 0xf49f8345, 0x63e000, 0x64af40,
0x7ffe7148e8d0) = 0
ext2fs_blocks_count(0x641b00, 0x7ffe7148e998, 18, 0xffffffff,
0xf49f8346) = 0x2f47bfdfb
ext2fs_test_generic_bmap(0x64ae80, 0xe4dcb10f, 0x63e000, 0, 0x7ffe7148e8d0) = 0
ext2fs_mark_generic_bmap(0x64ae80, 0xe4dcb10f, 0x63e000, 0x64af40,
0x7ffe7148e8d0) = 0
ext2fs_blocks_count(0x641b00, 0x7ffe7148e998, 18, 0xffffffff,
0xe4dcb110) = 0x2f47bfdfb
ext2fs_test_generic_bmap(0x64ae80, 0x2c4ceefd, 0x63e000, 0, 0x7ffe7148e8d0) = 0
ext2fs_mark_generic_bmap(0x64ae80, 0x2c4ceefd, 0x63e000, 0x64af40,
0x7ffe7148e8d0) = 0
ext2fs_blocks_count(0x641b00, 0x7ffe7148e998, 18, 0xffffffff,
0x2c4ceefe) = 0x2f47bfdfb
ext2fs_test_generic_bmap(0x64ae80, 0x27a62eff, 0x63e000, 0, 0x7ffe7148e8d0) = 0
ext2fs_mark_generic_bmap(0x64ae80, 0x27a62eff, 0x63e000, 0x64af40,
0x7ffe7148e8d0) = 0
ext2fs_blocks_count(0x641b00, 0x7ffe7148e998, 18, 0xffffffff,
0x27a62f00) = 0x2f47bfdfb
ext2fs_test_generic_bmap(0x64ae80, 0x7887810d, 0x63e000, 0, 0x7ffe7148e8d0) = 1
ext2fs_mark_generic_bmap(0x797480, 0x7887810d, 0x63e000, 0x64af40,
0x7ffe7148e8d0) = 0
ext2fs_blocks_count(0x641b00, 0x7ffe7148e998, 18, 0xffffffff,
0x7887810e) = 0x2f47bfdfb
[skipped]

Right after running ltrace, fsck exited with the message:

fsck: Warning... fsck.ext4 for device /dev/sda3 exited with signal 5.

Will try to install latest version of e2fsprogs.

root@rescue ~ # debugfs -c -R stats /dev/sda3
debugfs 1.42.5 (29-Jul-2012)
/dev/sda3: catastrophic mode - not reading inode or group bitmaps

The output starts with supeblock info and continues with group info:

Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          552052d1-9e25-4b2b-bc04-21c7b4a87aa4
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr dir_index filetype
extent 64bit flex_bg sparse_super huge_file uninit_bg dir_nlink
extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         not clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              793231360
Block count:              12691701243
Reserved block count:     634585062
Free blocks:              12641158920
Free inodes:              793231369
First block:              0
Block size:               4096
Fragment size:            4096
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         2048
Inode blocks per group:   128
Flex block group size:    16
Filesystem created:       Thu Feb 12 16:57:47 2015
Last mount time:          Thu Aug 27 10:58:45 2015
Last write time:          Thu Aug 27 10:58:58 2015
Mount count:              4
Maximum mount count:      -1
Last checked:             Thu Feb 12 16:57:47 2015
Check interval:           0 (<none>)
Lifetime writes:          279 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      9ddd8b55-ff3f-4447-bf5c-732402ddd8d6
Journal backup:           inode blocks
FS Error count:           154
First error time:         Thu Aug 13 17:25:45 2015
First error function:     ext4_mb_generate_buddy
First error line #:       739
First error inode #:      0
First error block #:      0
Last error time:          Fri Aug 14 16:48:41 2015
Last error function:      ext4_mb_generate_buddy
Last error line #:        739
Last error inode #:       0
Last error block #:       0
Directories:              -5
 Group  0: block bitmap at 6053, inode bitmap at 6069, inode table at 6085
           24511 free blocks, 2037 free inodes, 2 used directories, 0
unused inodes
           [Checksum 0x473d]
 Group  1: block bitmap at 6054, inode bitmap at 6070, inode table at 6213
           26665 free blocks, 2048 free inodes, 0 used directories, 0
unused inodes
           [Checksum 0x87ce]
 Group  2: block bitmap at 6055, inode bitmap at 6071, inode table at 6341
           32768 free blocks, 2048 free inodes, 0 used directories, 0
unused inodes
           [Checksum 0xa6fb]
 Group  3: block bitmap at 6056, inode bitmap at 6072, inode table at 6469
           26715 free blocks, 2048 free inodes, 0 used directories, 0
unused inodes
           [Checksum 0x8707]
 Group  4: block bitmap at 6057, inode bitmap at 6073, inode table at 6597
           32768 free blocks, 2048 free inodes, 0 used directories, 0
unused inodes
           [Checksum 0x1495]
 Group  5: block bitmap at 6058, inode bitmap at 6074, inode table at 6725
           26715 free blocks, 2048 free inodes, 0 used directories, 0
unused inodes
           [Checksum 0x324b]
 Group  6: block bitmap at 6059, inode bitmap at 6075, inode table at 6853
           32768 free blocks, 2048 free inodes, 0 used directories, 0
unused inodes
           [Checksum 0x3098]
 Group  7: block bitmap at 6060, inode bitmap at 6076, inode table at 6981
           26715 free blocks, 2048 free inodes, 0 used directories, 0
unused inodes

[skipped]

 Group 387317: block bitmap at 12691439621, inode bitmap at
12691439637, inode table at 12691440288
           32768 free blocks, 2048 free inodes, 0 used directories, 0
unused inodes
           [Checksum 0xad15]
 Group 387318: block bitmap at 12691439622, inode bitmap at
12691439638, inode table at 12691440416
           32768 free blocks, 2048 free inodes, 0 used directories, 0
unused inodes
           [Checksum 0x95b4]
 Group 387319: block bitmap at 12691439623, inode bitmap at
12691439639, inode table at 12691440544
           32251 free blocks, 2048 free inodes, 0 used directories, 0
unused inodes
           [Checksum 0xbfba]

P.S. I can even mount it and walk on directories but errors still exist:
- ????????? instead of directory record
- i/o error on <directory_name>
- etc.

Regards,
Alexander

On Thu, Aug 27, 2015 at 7:39 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Aug 27, 2015, at 8:23 AM, Alexander Afonyashin <a.afonyashin@madnet-team.ru> wrote:
>>
>> Hi,
>>
>> I've restarted fsck ~6 hours ago. It's again occupied ~30GB RAM and
>> strace shows that number of syscalls per second becomes fewer and
>> fewer.
>
> My first suggestion would be to upgrade e2fsprogs to the latest
> stable version - 1.42.13 so that you are not hitting any older bugs.
>
> What was the original problem reported that caused the e2fsck
> to be run?
>
> Next, please include the full output from the start of e2fsck,
> unless it is just a lot of the same lines repeated.  There are
> a lot of Lustre users with 32TB or 48TB ext4 filesystems that can
> finish a full e2fsck in a few hours, unless there is some kind
> of major corruption.  It may be possible to fix some of the
> corruption manually with debugfs to avoid a lengthy e2fsck run.
>
> If you can run "ltrace -p <e2fsck_pid>" on the e2fsck then it
> would tell us what code it is running.  It doesn't seem to be
> IO bound (only one seek+read per 6 seconds).
>
> Are there any special formatting options that were used for the
> filesystem originally?  What does "debugfs -c -R stats <dev>"
> report about the filesystem?
>
> Cheers, Andreas
>
>
>> Regards,
>> Alexander
>>
>> On Thu, Aug 27, 2015 at 8:28 AM, Alexander Afonyashin
>> <a.afonyashin@madnet-team.ru> wrote:
>>> Hi,
>>>
>>> The last output (2 days ago) from fsck:
>>>
>>> [skipped]
>>> Block #524296 (1235508688) causes directory to be too big.  CLEARED.
>>> Block #524297 (4003498426) causes directory to be too big.  CLEARED.
>>> Block #524298 (3113378389) causes directory to be too big.  CLEARED.
>>> Block #524299 (1368545889) causes directory to be too big.  CLEARED.
>>> Too many illegal blocks in inode 4425477.
>>> Clear inode? yes
>>>
>>> ---------------------------
>>> iostat output:
>>>
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>           0.00    0.00    0.00   14.52    0.00   85.48
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> loop0             0.00     0.00    2.00    0.00    12.00     0.00
>>> 12.00     0.09   46.00   46.00    0.00  46.00   9.20
>>> sda               0.00     0.00   87.00    0.00   348.00     0.00
>>> 8.00     1.00   11.86   11.86    0.00  11.45  99.60
>>>
>>> ---------------------------
>>> strace ouput:
>>>
>>> root@rescue ~ # strace -f -t -p 4779
>>> Process 4779 attached - interrupt to quit
>>> 07:26:54 lseek(4, 14154266963968, SEEK_SET) = 14154266963968
>>> 07:26:54 read(4,
>>> "\277\224\312\371\302\356\tJC{P\244#3\"2P\327*2Q5\372\206\262\20\\\373\226\262\21\316"...,
>>> 4096) = 4096
>>> 07:27:02 lseek(4, 1408506736640, SEEK_SET) = 1408506736640
>>> 07:27:02 read(4,
>>> "\352\3041\345\1\337p\263l;\354\377E[\17\350\235\260\r\344\265\337\3655\223E\216\226\376\263!\n"...,
>>> 4096) = 4096
>>> 07:27:08 lseek(4, 5948177264640, SEEK_SET) = 5948177264640
>>> 07:27:08 read(4,
>>> "\321}\226m;1\253Z\301f\205\235\25\201\334?\311AQN(\22!\23{\345\214Vi\240=y"...,
>>> 4096) = 4096
>>> 07:27:10 brk(0x8cf18e000)               = 0x8cf18e000
>>> 07:27:14 lseek(4, 6408024879104, SEEK_SET) = 6408024879104
>>> 07:27:14 read(4,
>>> "\254n\fn\r\302$\t\213\231\256\2774\326\34\364\fY\v\365`*Br\354X\7T3J\243K"...,
>>> 4096) = 4096
>>> 07:27:21 lseek(4, 8640894586880, SEEK_SET) = 8640894586880
>>> 07:27:21 read(4,
>>> "3\372\24\357\3579\254\31\214L\rYrurj\376\250\352%\2\242\255\252\22\347XU\327\235\362\337"...,
>>> 4096) = 4096
>>> ^CProcess 4779 detached
>>>
>>> Regards,
>>> Alexander
>>>
>>> On Tue, Aug 25, 2015 at 10:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>>>> On Aug 25, 2015, at 9:30 AM, Alexander Afonyashin <a.afonyashin@madnet-team.ru> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Recently I had to run fsck on 47TB ext4 partition backed by hardware
>>>>> RAID6 (LSI MegaRAID SAS 2108). Right now over 2 weeks passed but fsck
>>>>> is not finished yet. It occupies 30GB RSS, almost 35GB VSS and eats
>>>>> 100% of single CPU. It detected errors (and fixed them) but doesn't
>>>>> finish yet.
>>>>>
>>>>> Rescue disc is based on Debian 7.8.
>>>>> kernel: 4.1.4-5
>>>>> e2fsprogs: 1.42.5-1.1+deb7u1
>>>>>
>>>>> Any suggestions?
>>>>
>>>> Usually the only reason for e2fsck to run so long is because of
>>>> duplicate block pass 1b/1c.
>>>>
>>>> Having some of the actual output of e2fsck would allow us to give
>>>> some useful advice.
>>>>
>>>> The only thing I can offer is for you to run "strace -p <e2fsck_pid>"
>>>> and/or "ltrace -p <e2fsck_pid>" to see what it is doing.
>>>>
>>>> Cheers, Andreas
>>>>
>>>>
>>>>
>>>>
>>>>
>
>
> Cheers, Andreas
>
>
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Running fsck of huge ext4 partition takes weeks
  2015-08-25 15:30 Running fsck of huge ext4 partition takes weeks Alexander Afonyashin
  2015-08-25 19:43 ` Andreas Dilger
@ 2015-08-28  7:56 ` Alexander Afonyashin
  2015-08-28 16:44   ` Andreas Dilger
  1 sibling, 1 reply; 13+ messages in thread
From: Alexander Afonyashin @ 2015-08-28  7:56 UTC (permalink / raw)
  To: linux-ext4

Hi,

A brief story of problem.

- hardware raid6 became partially degraded because one of disks failed
(slot0 - mention it) as controller said
- provider was asked to replace the failed drive (hot-swap)
- while it was performing the task, the 2nd disk (slot4) has failed
and raid became degraded (fully)
- so provider was asked to replace 2nd disk too
- I don't know what exactly happened (and how) but they replace disk
in slot4 with disk from slot0 (see below - it's really looks like
this) and inserted new disk into slot0
- system not booted due to 'no partitions' found (gpt)
- I booted from rescue disk and found the cool thing:

1st LBA sector (GPT master sector) of LD0 (there was only one logical
disk configured on controller) moved 1MB from start of logical disk.
Paying attention that the strip size is 256K - this looks logical. In
fact, controller holds raid metadata info on drives so the order which
they are inserted into slots should not be a difference.I had
experience with LSI controllers and it was so all the time. But this
time it failed to recognize that disk was simply moved from one slot
to another (may be due to the fact it has marked disk as failed - but
suddenly it returned to life). I don't know if there's a bug in
firmware or something else happened but when disk was placed back into
original slot0 (keeping slot4 open) - GPT partition map has returned.

But ... It seems that automatic rebuild had been started since first
disk's replacement. And did its job.

So I have partially broken ext4 that wish to fix.

P.S. Raid hardware (performed by controller) rebuild process has been
completed without errors.

Regards,
Alexander

On Tue, Aug 25, 2015 at 6:30 PM, Alexander Afonyashin
<a.afonyashin@madnet-team.ru> wrote:
> Hi,
>
> Recently I had to run fsck on 47TB ext4 partition backed by hardware
> RAID6 (LSI MegaRAID SAS 2108). Right now over 2 weeks passed but fsck
> is not finished yet. It occupies 30GB RSS, almost 35GB VSS and eats
> 100% of single CPU. It detected errors (and fixed them) but doesn't
> finish yet.
>
> Rescue disc is based on Debian 7.8.
> kernel: 4.1.4-5
> e2fsprogs: 1.42.5-1.1+deb7u1
>
> Any suggestions?
>
> Regards,
> Alexander Afonyashin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Running fsck of huge ext4 partition takes weeks
  2015-08-28  7:56 ` Alexander Afonyashin
@ 2015-08-28 16:44   ` Andreas Dilger
  0 siblings, 0 replies; 13+ messages in thread
From: Andreas Dilger @ 2015-08-28 16:44 UTC (permalink / raw)
  To: Alexander Afonyashin; +Cc: linux-ext4

On Aug 28, 2015, at 1:56 AM, Alexander Afonyashin <a.afonyashin@madnet-team.ru> wrote:
> 
> Hi,
> 
> A brief story of problem.
> 
> - hardware raid6 became partially degraded because one of disks failed
> (slot0 - mention it) as controller said
> - provider was asked to replace the failed drive (hot-swap)
> - while it was performing the task, the 2nd disk (slot4) has failed
> and raid became degraded (fully)
> - so provider was asked to replace 2nd disk too
> - I don't know what exactly happened (and how) but they replace disk
> in slot4 with disk from slot0 (see below - it's really looks like
> this) and inserted new disk into slot0

I would suspect that the slot0 disk started to be overwritten by
the RAID rebuild while it was installed in slot4?  It seems like
there is likely "confusing" corruption such as valid inode blocks
written to the wrong offset rather than just "random" corruption.

> - system not booted due to 'no partitions' found (gpt)
> - I booted from rescue disk and found the cool thing:
> 
> 1st LBA sector (GPT master sector) of LD0 (there was only one logical
> disk configured on controller) moved 1MB from start of logical disk.
> Paying attention that the strip size is 256K - this looks logical. In
> fact, controller holds raid metadata info on drives so the order which
> they are inserted into slots should not be a difference.I had
> experience with LSI controllers and it was so all the time. But this
> time it failed to recognize that disk was simply moved from one slot
> to another (may be due to the fact it has marked disk as failed - but
> suddenly it returned to life). I don't know if there's a bug in
> firmware or something else happened but when disk was placed back into
> original slot0 (keeping slot4 open) - GPT partition map has returned.
> 
> But ... It seems that automatic rebuild had been started since first
> disk's replacement. And did its job.
> 
> So I have partially broken ext4 that wish to fix.
> 
> P.S. Raid hardware (performed by controller) rebuild process has been
> completed without errors.

In my experience, just because the RAID rebuild doesn't report
any errors, doesn't mean that it didn't write random corruption
across the disk.  Depending on how badly the filesystem is corrupted,
you might be able to recover some data, or you might have corruption
every 8th or 10th block in your filesystem.

Running e2fsck in such a case will make the metadata "valid" but it
will not make the data correct.

Definitely, if you have a filesystem over 16TB you should be running
e2fsprogs-1.42.13 to get the latest fixes.

Cheers, Andreas
> 
> Regards,
> Alexander
> 
> On Tue, Aug 25, 2015 at 6:30 PM, Alexander Afonyashin
> <a.afonyashin@madnet-team.ru> wrote:
>> Hi,
>> 
>> Recently I had to run fsck on 47TB ext4 partition backed by hardware
>> RAID6 (LSI MegaRAID SAS 2108). Right now over 2 weeks passed but fsck
>> is not finished yet. It occupies 30GB RSS, almost 35GB VSS and eats
>> 100% of single CPU. It detected errors (and fixed them) but doesn't
>> finish yet.
>> 
>> Rescue disc is based on Debian 7.8.
>> kernel: 4.1.4-5
>> e2fsprogs: 1.42.5-1.1+deb7u1
>> 
>> Any suggestions?
>> 
>> Regards,
>> Alexander Afonyashin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Running fsck of huge ext4 partition takes weeks
  2015-08-28  6:27         ` Alexander Afonyashin
@ 2015-08-28 17:53           ` Theodore Ts'o
  2015-08-31  7:20             ` Alexander Afonyashin
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2015-08-28 17:53 UTC (permalink / raw)
  To: Alexander Afonyashin; +Cc: Andreas Dilger, linux-ext4

If dumpe2fs is hanging as well, it's likely that the problem may be at
the hardware level.  You might want to check dmesg or the kernel log
to see if there are any I/O errors being reported from hard drive.
What might be happening is that when a program (such as e2fsck or
dumpe2fs) tries to read from a specific part of the hard drive, the
hard drive is retrying a large number of times because the hard drive
head or platter surface has gotten damaged in some way.

It might also be a good idea to check the S.M.A.R.T. status using the
smartctl program.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Running fsck of huge ext4 partition takes weeks
  2015-08-28 17:53           ` Theodore Ts'o
@ 2015-08-31  7:20             ` Alexander Afonyashin
  2015-09-01  3:19               ` Andreas Dilger
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Afonyashin @ 2015-08-31  7:20 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Andreas Dilger, linux-ext4

Hi,

Running fsck from e2fsprogs-1.42.13 results in SIGKILL:

Inode 4425496 is too big.  Truncate? yes

Block #524289 (103743230) causes directory to be too big.  CLEARED.
Block #524290 (3236857350) causes directory to be too big.  CLEARED.
Block #524291 (3625464338) causes directory to be too big.  CLEARED.
Block #524292 (1370882069) causes directory to be too big.  CLEARED.
Block #524293 (3868016883) causes directory to be too big.  CLEARED.
Block #524294 (3919147116) causes directory to be too big.  CLEARED.
Block #524295 (279419478) causes directory to be too big.  CLEARED.
Block #524296 (194746972) causes directory to be too big.  CLEARED.
Block #524297 (1695856868) causes directory to be too big.  CLEARED.
Block #524298 (587425254) causes directory to be too big.  CLEARED.
Block #524299 (142614537) causes directory to be too big.  CLEARED.
Too many illegal blocks in inode 4425496.
Clear inode? yes

Inode 4425357 has compression flag set on filesystem without
compression support.  Clear? yes

Warning... fsck.ext4 for device /dev/sda3 exited with signal 9.
root@rescue ~ # e2fsprogs-1.42.13/build/misc/fsck -v -y /dev/sda3

Regards,
Alexander

On Fri, Aug 28, 2015 at 8:53 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> If dumpe2fs is hanging as well, it's likely that the problem may be at
> the hardware level.  You might want to check dmesg or the kernel log
> to see if there are any I/O errors being reported from hard drive.
> What might be happening is that when a program (such as e2fsck or
> dumpe2fs) tries to read from a specific part of the hard drive, the
> hard drive is retrying a large number of times because the hard drive
> head or platter surface has gotten damaged in some way.
>
> It might also be a good idea to check the S.M.A.R.T. status using the
> smartctl program.
>
> Cheers,
>
>                                                 - Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Running fsck of huge ext4 partition takes weeks
  2015-08-31  7:20             ` Alexander Afonyashin
@ 2015-09-01  3:19               ` Andreas Dilger
  0 siblings, 0 replies; 13+ messages in thread
From: Andreas Dilger @ 2015-09-01  3:19 UTC (permalink / raw)
  To: Alexander Afonyashin; +Cc: Theodore Ts'o, linux-ext4

On Aug 31, 2015, at 1:20 AM, Alexander Afonyashin <a.afonyashin@madnet-team.ru> wrote:
> 
> Hi,
> 
> Running fsck from e2fsprogs-1.42.13 results in SIGKILL:
> 
> Inode 4425496 is too big.  Truncate? yes
> 
> Block #524289 (103743230) causes directory to be too big.  CLEARED.
> Block #524290 (3236857350) causes directory to be too big.  CLEARED.
> Block #524291 (3625464338) causes directory to be too big.  CLEARED.
> Block #524292 (1370882069) causes directory to be too big.  CLEARED.
> Block #524293 (3868016883) causes directory to be too big.  CLEARED.
> Block #524294 (3919147116) causes directory to be too big.  CLEARED.
> Block #524295 (279419478) causes directory to be too big.  CLEARED.
> Block #524296 (194746972) causes directory to be too big.  CLEARED.
> Block #524297 (1695856868) causes directory to be too big.  CLEARED.
> Block #524298 (587425254) causes directory to be too big.  CLEARED.
> Block #524299 (142614537) causes directory to be too big.  CLEARED.
> Too many illegal blocks in inode 4425496.
> Clear inode? yes
> 
> Inode 4425357 has compression flag set on filesystem without
> compression support.  Clear? yes
> 
> Warning... fsck.ext4 for device /dev/sda3 exited with signal 9.
> root@rescue ~ # e2fsprogs-1.42.13/build/misc/fsck -v -y /dev/sda3

Hmm, the "fsck" command is just a wrapper, and it is not necessarily
calling the e2fsck command from your build tree.  You should run:

   e2fsprogs-1.42.13/build/e2fsck/e2fsck -fy /dev/sda3

That said, if you are having problems with the e2fsck, could you
run it under gdb to see where it is failing?  Signal 9 is SIGKILL
which means that the process was killed by some external signal?

> Regards,
> Alexander
> 
> On Fri, Aug 28, 2015 at 8:53 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>> If dumpe2fs is hanging as well, it's likely that the problem may be at
>> the hardware level.  You might want to check dmesg or the kernel log
>> to see if there are any I/O errors being reported from hard drive.
>> What might be happening is that when a program (such as e2fsck or
>> dumpe2fs) tries to read from a specific part of the hard drive, the
>> hard drive is retrying a large number of times because the hard drive
>> head or platter surface has gotten damaged in some way.
>> 
>> It might also be a good idea to check the S.M.A.R.T. status using the
>> smartctl program.
>> 
>> Cheers,
>> 
>>                                                - Ted


Cheers, Andreas






^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-09-01  3:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-25 15:30 Running fsck of huge ext4 partition takes weeks Alexander Afonyashin
2015-08-25 19:43 ` Andreas Dilger
2015-08-27  5:28   ` Alexander Afonyashin
2015-08-27 14:23     ` Alexander Afonyashin
2015-08-27 16:39       ` Andreas Dilger
2015-08-28  6:39         ` Alexander Afonyashin
2015-08-27 19:05       ` Theodore Ts'o
2015-08-28  6:27         ` Alexander Afonyashin
2015-08-28 17:53           ` Theodore Ts'o
2015-08-31  7:20             ` Alexander Afonyashin
2015-09-01  3:19               ` Andreas Dilger
2015-08-28  7:56 ` Alexander Afonyashin
2015-08-28 16:44   ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.