From: Andreas Dilger <adilger@dilger.ca>
To: Theodore Ts'o <tytso@mit.edu>
Cc: linux-ext4 <linux-ext4@vger.kernel.org>
Subject: e2fsck dir_info corruption
Date: Fri, 31 Jan 2020 21:07:42 -0700 [thread overview]
Message-ID: <00CED351-2128-4DF8-B286-7774CCC1FC0A@dilger.ca> (raw)
[-- Attachment #1: Type: text/plain, Size: 5124 bytes --]
I've been trying to track down a failure with e2fsck on a large filesystem. Running e2fsck-1.45.2 repeatedly and consistently reports an error in pass2:
Internal error: couldn't find dir_info for <ino>
The inode number is consistent across multiple runs, but examining it with
debugfs shows that it is a normal directory with no problems. In order to
complete the e2fsck, we tried using "debugfs -w -R 'clri <ino>'" to erase
the inode, but this just shifted the error onto a slightly later directory
(which would itself now be reported consistently).
The filesystem is 6TB with 2.3B inodes, about 1.8B of which are in use. The
inode numbers reporting problems are in the range of 1.2B. e2fsck is using
about 45GB of RAM (with some swap on an NVMe SSD, so much faster than the
HDD the filesystem image is on). It takes about 3-4h to hit this problem.
I was able to run e2fsck under gdb, though this slows it down significantly
(10-100x slower), if there are conditional breakpoints since they need to
be evaluated for each inode, which makes it impossible to use this way.
What I've discovered is quite interesting. The ctx->dir_info->array is
filled in properly for the bad inode, and the array is OK at the start of
pass2 (2 is the root inode, and $66 holds the bad inode number):
(gdb) p e2fsck_get_dir_info(ctx, 2)
$171 = (struct dir_info *) 0x7ffabf2bd010
(gdb) p *$171
$172 = {ino = 2, dotdot = 0, parent = 2}
(gdb) p $66
$173 = 1222630490
(gdb) p e2fsck_get_dir_info(ctx, $66)
$174 = (struct dir_info *) 0x7ffb3d327f54
(gdb) p *$174
$175 = {ino = 1222630490, dotdot = 0, parent = 0}
The dir_info array itself is over 4GB in size (not sure if this is relevant
or not), with 380M directories, but the bad inode is only about half way
through the array ($140 is the index of the problematic entry):
(gdb) p $140
$176 = 176197275
(gdb) p p ctx->dir_info->array[$140]
$177 = {ino = 1222630490, dotdot = 0, parent = 0}
(gdb) p *ctx->dir_info
$178 = {count = 381539950, size = 381539965,
array = 0x7ffabf2bd010, last_lookup = 0x0,
tdb_fn = 0x0, tdb = 0x0}
(gdb) p $178.count * sizeof(ctx->dir_info->array[0])
$179 = 4578479400
I put a hardware watchpoint on the ".parent" and ".dotdot" fields of this
entry so that I could watch as e2fsck_dir_info_set_dotdot() updated it,
but I didn't think to put one on the ".ino" field, since I thought it would
be static.
(gdb) info break
Num Type Disp Enb Address What
15 hw watchpoint keep y -location $177->parent
16 hw watchpoint keep y -location $177->dotdot
The watchpoint triggered, and saw that the entry was changed by qsort_r(),
which at first I thought "OK, the dir_info array needs to be sorted, because
a binary search is used on it", but in fact the array is *created* in order
during the inode table scan and does not need to be sorted. As can be seen
from the stack, it is *another* array that is being sorted that overwrites
the dir_info entry:
Hardware watchpoint 16: -location $177->dotdot
Old value = 0
New value = 1581603653
0x0000003b2ea381f7 in _quicksort ()
from /lib64/libc.so.6
(gdb) p $141
$179 = {ino = 1222630490, dotdot = 0, parent = 0}
(gdb) bt
#0 0x0000003b2ea381e1 in _quicksort ()
from /lib64/libc.so.6
#1 0x0000003b2ea38f82 in qsort_r ()
from /lib64/libc.so.6
#2 0x000000000045447b in ext2fs_dblist_sort2 (
dblist=0x6c9a90,
sortfunc=0x428c3d <special_dir_block_cmp>)
at dblist.c:189
#3 0x00000000004285bc in e2fsck_pass2 (
ctx=0x6c4000) at pass2.c:180
#4 0x000000000041646b in e2fsck_run (ctx=0x6c4000)
at e2fsck.c:253
#5 0x0000000000415210 in main (argc=4,
argv=0x7fffffffe088) at unix.c:2047
(gdb) p ctx->dir_info->array[$140]
$207 = {ino = 0, dotdot = 1581603653, parent = 0}
(gdb) f 3
#3 0x00000000004285bc in e2fsck_pass2 (
ctx=0x6c4000) at pass2.c:180
180 ext2fs_dblist_sort2(fs->dblist, special_dir_block_cmp);
AFAIK, the ext2fs_dblist_sort2() is for the directory *blocks*, and should
not be changing the dir_info at all. Is this a bug in qsort or glibc?
What I just noticed writing this email is that the fs->dblist.list address
is right in the middle of the dir_info array address range:
(gdb) p *fs->dblist
$210 = {magic = 2133571340, fs = 0x6c4460,
size = 763079922, count = 388821313, sorted = 0,
list = 0x7ffad011e010}
(gdb) p &ctx->dir_info->array[0]
$211 = (struct dir_info *) 0x7ffabf2bd010
(gdb) p &ctx->dir_info->array[$140]
$212 = (struct dir_info *) 0x7ffb3d327f54
which might explain why sorting dblist is messing with dir_info? I don't
_think_ it is a problem with my build or swap, which is different from
the system that this was originally reproduced on.
Any thoughts on how to continue debugging this?
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]
next reply other threads:[~2020-02-01 4:07 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-01 4:07 Andreas Dilger [this message]
2020-02-01 17:02 ` e2fsck dir_info corruption Andreas Dilger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=00CED351-2128-4DF8-B286-7774CCC1FC0A@dilger.ca \
--to=adilger@dilger.ca \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox