All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Dilger, Andreas" <andreas.dilger@intel.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>
Subject: Re: e2fsck -fD corruption of large htree/extent directory
Date: Wed, 11 Nov 2015 10:13:46 +0000	[thread overview]
Message-ID: <D2685283.119F15%andreas.dilger@intel.com> (raw)
In-Reply-To: <D261A180.118DB8%andreas.dilger@intel.com>

[-- Attachment #1: Type: text/plain, Size: 4109 bytes --]

On 2015/11/06, 00:12, "Dilger, Andreas" <andreas.dilger@intel.com> wrote:

>Running e2fsck -fD on a large extent+htree directory (> 300k entries,
>1600+ filesystem blocks) may result in the directory becoming corrupted.
>This is definitely caused by a bug in the code rather than hardware, as
>this corrupted multiple large directories on different systems.

Thanks to a suggestion from Darrick, I was able to reproduce this problem
with an e2fsck test script (attached) when shrinking an htree extent
directory with only 3 index blocks referenced directly by the inode.  The
problem is not present on block-mapped directories but looks to be a
danger for any user of the "-fD" option with extent-mapped directories.

It looks like the problem is if the inode shrinks enough that one of the
index blocks is dropped from the end of the file (blocks after logical
block 114 were freed), but the write_directory() write_dir_block()
iterator doesn't free the index block 800:
\x01
    :
    write_dir_block 113:583 - write
    write_dir_block 114:587 - write
    write_dir_block 115:591 - free
    write_dir_block 116:595 - free
    :
    :
    write_dir_block 165:791 - free
    write_dir_block -1:800 - skip
    write_dir_block 166:795 - free
    write_dir_block 167:799 - free

    write_dir_block 168:804 - free
    write_dir_block 169:808 - free
    write_dir_block 170:812 - free
    write_dir_block 171:813 - free
    write_dir_block 172:814 - free
    write_dir_block -1:800 - skip
    Pass 4: Checking reference counts
    Pass 5: Checking group summary information


The extent tree now has a bogus index block at the end, but somehow is
also missing the valid extent block that was holding the rest of the
file, as shown by debugfs (after "e2fsck -fD" but before the second
e2fsck that detects the corruption) and logical blocks 83-114 are lost:

    debugfs: stat subdir
    Inode: 12   Type: directory    Mode:  0755   Flags: 0x81000
    Generation: 0    Version: 0x00000000
    User:     0   Group:     0   Size: 117760
    File ACL: 0    Directory ACL: 0
    Links: 2   Blockcount: 238
    Fragment:  Address: 0    Number: 0    Size: 0
    ctime: 0x5642e764 -- Tue Nov 10 23:59:48 2015
    atime: 0x5642e764 -- Tue Nov 10 23:59:48 2015
    mtime: 0x5642e764 -- Tue Nov 10 23:59:48 2015
    EXTENTS:
    (ETB0):146, (0):129, (1):133, (2):137, (3):141, (4):145, (5):150,
    (6):154, (7):158, (8):162, (9):166, (10):170, (11):174, (12):178,
    (13):182, (14):186, (15):190, (16):194, (17):198, (18):202,
    (19):206, (20):210, (21):214, (22):218, (23):222, (24):226,
    (25):230, (26):234, (27):238, (28):242, (29):246, (30):250,
    (31):254, (32):258, (33):262, (34):266, (35):270, (36):274,
    (37):278, (38):282, (39):286, (40):290, (41):294, (42):298,
    (43):302, (44):306, (45):310, (46):314, (47):318, (48):322,
    (49):326, (50):330, (51):334, (52):338, (53):342, (54):346,
    (55):350, (56):354, (57):358, (58):362, (59):366, (60):370,
    (61):374, (62):378, (63):382, (64):386, (65):390, (66):394,
    (67):398, (68):402, (69):406, (70):410, (71):414, (72):418,
    (73):422, (74):426, (75):430, (76):434, (77):438, (78):442,
    (79):446, (80):450, (81):454, (82):458, (ETB0):800, (172):814


    debugfs: extents subdir
    :
    :
    1/ 1  82/ 83    81 -    81   454 -   454      1
    1/ 1  83/ 83    82 -    82   458 -   458      1
    0/ 1   2/  2   170 - 4294967410   800         4294967241
    1/ 1   1/  1   172 -   172   814 -   814      1










The i_size is correct for 115 data blocks written, and i_blocks would
be correct if the second index block wouldn't have been lost.  It seems
the bug is in the extent handling code, but I haven't yet dug into why
the last extent is kept.  I tried deleting it like the other blocks,
but the iteration immediately stops with an error that the index block
is corrupted, and I'm not sure how to catch it the second time.

Cheers, Andreas
-- 
Andreas Dilger

Lustre Principal Engineer
Intel High Performance Data Division



[-- Attachment #2: script --]
[-- Type: application/octet-stream, Size: 1978 bytes --]

#!/bin/bash
TMP=${TMP:-"/tmp"}
test_name=${test_name:-$(basename $(dirname $0))}
test_dir=${test_dir:-$test_name}
cmd_dir=${cmd_dir:-"."}
OUT=$test_name.log
MKFS=${MKFS:-../misc/mke2fs}
FSCK=${FSCK:-../e2fsck/e2fsck}
DEBUGFS=${DEBUGFS:-../debugfs/debugfs}

# parameters for run_e2fsck
SKIP_GUNZIP="true"
FSCK_OPT="-fyvD"

NAMELEN=250
SRC=$TMP/$test_name.tmp
SUB=subdir
BASE=$SRC/$SUB/$(yes | tr -d '\n' | dd bs=$NAMELEN count=1 2> /dev/null)
TMPFILE=${TMPFILE:-"$TMP/image"}
BSIZE=1024

> $OUT
mkdir -p $SRC/$SUB
# calculate the number of files needed to create the directory extent tree
# deep enough to exceed the in-inode index and spill into an index block.
#
# dirents per block * extents per block * (index blocks > i_blocks)
NUM=$(((BSIZE / (NAMELEN + 8)) * (BSIZE / 12) * 2))
# Create source files. Unfortunately hard links will be copied as links,
# and blocks with only NULs will be turned into holes.
if [ ! -f $BASE.1 ]; then
	for N in $(seq $NUM); do
		echo "foo" > $BASE.$N
	done >> $OUT
fi

# make filesystem with enough inodes and blocks to hold all the test files
> $TMPFILE
NUM=$((NUM * 5 / 3))
echo "mke2fs -b $BSIZE -O dir_index,extent -d$SRC -N$NUM $TMPFILE $NUM" >> $OUT
$MKFS -b $BSIZE -O dir_index,extent -d$SRC -N$NUM $TMPFILE $NUM >> $OUT 2>&1
rm -r $SRC

# Run e2fsck to convert dir to htree before deleting the files, as mke2fs
# doesn't do this.  Run second e2fsck to verify there is no corruption yet.
(
	EXP1=$test_dir/expect.pre.1
	EXP2=$test_dir/expect.pre.2
	OUT1=$test_name.pre.1.log
	OUT2=$test_name.pre.2.log
	DESCRIPTION="$(cat $test_dir/name) setup"
	. $cmd_dir/run_e2fsck
)

# generate a list of filenames for debugfs to delete, one from each leaf block
DELETE_LIST=$TMP/delete.$$
$DEBUGFS -c -R "htree subdir" $TMPFILE 2>> $OUT |
	grep -A2 "Reading directory block" |
	awk '/yyyyy/ { print "rm '$SUB'/"$4 }' > $DELETE_LIST
$DEBUGFS -w -f $DELETE_LIST $TMPFILE >> $OUT 2>&1
rm $DELETE_LIST
cp $TMPFILE $TMPFILE.sav

. $cmd_dir/run_e2fsck

      parent reply	other threads:[~2015-11-11 10:13 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <D261A180.118DB8%andreas.dilger@intel.com>
2015-11-06  8:23 ` Fwd: e2fsck -fD corruption of large htree/extent directory Andreas Dilger
2015-11-11 10:13 ` Dilger, Andreas [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D2685283.119F15%andreas.dilger@intel.com \
    --to=andreas.dilger@intel.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.