All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Theodore Ts'o" <tytso@mit.edu>
To: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: Zhang Yi <yi.zhang@huaweicloud.com>,
	linux-ext4@vger.kernel.org, adilger.kernel@dilger.ca,
	jack@suse.cz, yi.zhang@huawei.com, yukuai3@huawei.com
Subject: Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
Date: Tue, 13 Jun 2023 13:27:49 -0400	[thread overview]
Message-ID: <20230613172749.GA18303@mit.edu> (raw)
In-Reply-To: <20002902-39c5-914b-75b0-5a21b5cee25c@huawei.com>

On Tue, Jun 13, 2023 at 04:13:06PM +0800, Zhihao Cheng wrote:
> 
> Hi Ted, I tried to run './check generic/475' many rounds(1.47.0,
> 5-Feb-2023), and I cannot reproduce the problem with this patch.

What file system configuration (e.g., mke2fs options) were you using
when you ran generic/475?  I reproduced the problem with
CONFIG_LOCKDEP enabled, with the ext4/adv configuration[1], which
means that the file system was created using "mke2fs -t ext4 -O
inline_data,fast_commit".  The size of the test file system was 5 GiB.

[1] https://github.com/tytso/xfstests-bld/blob/master/test-appliance/files/root/fs/ext4/cfg/adv

At this point, it looks like the problem is timing specific.  When I
built at patch 3/6 of your patch series, I was no longer able to
trigger the failure using the CONFIG_LOCKDEP kernel --- specifically
using a kernel config generated using install-kconfig[2] with the
--lockdep command-line-option.

[2] https://github.com/tytso/xfstests-bld/blob/master/kernel-build/install-kconfig

However, when I built a kernel config without --lockdep, I was able to
trigger the problem for both the the ext4/adv and the ext4/ext3[1]
file system test scenario.  That is, doing a full regression test
suite using "gce-xfstests ltm -c ext4/all -g auto", the VM's for the
ext4/adv and ext4/ext3 VM's both hung while running generic/475.  And
using a non-lockdep kernel, the commands "gce-xfstests -c
ext4/adv,ext4/ext3 generic/475" would hang.  I ran this command twice,
to make sure there were no timing-related false negatives, and once we
hung while running generic/475 for ext4/adv, and once we hung while
running ext4/ext3:

% gce-xfstests ls -l
  ...
xfstests-tytso-20230613115748 34.172.36.63 - 6.4.0-rc5-xfstests-00057-ge86f802ab8d4 - 12:07 ext4/adv generic/475 - RUNNING
xfstests-tytso-20230613115802 34.133.66.61 - 6.4.0-rc5-xfstests-00057-ge86f802ab8d4 - 12:06 ext4/ext3 generic/475 - RUNNING

[3] https://github.com/tytso/xfstests-bld/blob/master/test-appliance/files/root/fs/ext4/cfg/ext3

Furthermore, when I rewond the git repo to just before this patch
series (which is currently at the end of the dev branch), the full
regression test suites ("-c ext4/all -g all") and the more specific
test run ("-C 5 -c ext4/adv,ext4/ext3 generic/475") did not hang.  I
am currently doing another bisect run using a non-lockdep kernel to
see if I can more detail.

> Could you send me a compressed image which can trigger the problem
> with 'fsck -fn'?

Sure.  I'll have to send that under a separate e-mail message, since
it's 15 megabytes.  It was created using "dd if=/dev/mapper/xt-vdc |
gzip -9 > broken-image-which-causes-e2fsck-to-segv.gz".
Unfortunately, I was not able to create a metadata-only dump because
the filesystem was too corrupted.  An attempt to run "e2image -Q
/dev/mapper/xt-vdc broken.qcow2" failed with:

e2image 1.47.0 (5-Feb-2023)
e2image: Corrupt extent header while iterating over inode 6016

I was able to run e2fsck compiled with clang's asan enabled, and
here's the ASAN report (this is against the master branch in
e2fsprogs's git repo, so it's a bit ahead of 1.47.0):

e2fsck 1.47.0 (5-Feb-2023)
=================================================================
==25033==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x625000009900 at pc 0x564bbf8ae405 bp 0x7ffdd82bf0e0 sp 0x7ffdd82be8b0
WRITE of size 4096 at 0x625000009900 thread T0
    #0 0x564bbf8ae404 in pread64 (/build/e2fsprogs-asan/e2fsck/e2fsck+0x24a404) (BuildId: e291b1c8655954ec4293b8635a561dc29c81a785)
    #1 0x564bbfd47532 in raw_read_blk /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/unix_io.c:240:12
    #2 0x564bbfd3965b in unix_read_blk64 /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/unix_io.c:1079:17
    #3 0x564bbfc6cecd in io_channel_read_blk64 /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/io_manager.c:78:10
    #4 0x564bbf9a3791 in e2fsck_pass1_check_symlink /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass1.c:241:7
    #5 0x564bbfa28a9f in e2fsck_process_bad_inode /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass2.c:1990:8
    #6 0x564bbfa21e9c in check_dir_block /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass2.c:1525:8
    #7 0x564bbfa18a3b in check_dir_block2 /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass2.c:1034:8
    #8 0x564bbfb9ee4a in ext2fs_dblist_iterate3 /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/dblist.c:216:9
    #9 0x564bbfb9ef79 in ext2fs_dblist_iterate2 /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/dblist.c:229:9
    #10 0x564bbfa14529 in e2fsck_pass2 /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass2.c:190:20
    #11 0x564bbf980660 in e2fsck_run /usr/projects/e2fsprogs/e2fsprogs/e2fsck/e2fsck.c:262:3
    #12 0x564bbf95c774 in main /usr/projects/e2fsprogs/e2fsprogs/e2fsck/unix.c:1931:15
    #13 0x7f65d4e4e189 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #14 0x7f65d4e4e244 in __libc_start_main csu/../csu/libc-start.c:381:3
    #15 0x564bbf892840 in _start (/build/e2fsprogs-asan/e2fsck/e2fsck+0x22e840) (BuildId: e291b1c8655954ec4293b8635a561dc29c81a785)

I'm still digging into finding the root cause; I'll let you know if I
find more.

Cheers,

						- Ted




  reply	other threads:[~2023-06-13 17:30 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-06 13:59 [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Zhang Yi
2023-06-06 13:59 ` [PATCH v3 1/6] jbd2: recheck chechpointing non-dirty buffer Zhang Yi
2023-06-06 13:59 ` [PATCH v3 2/6] jbd2: remove t_checkpoint_io_list Zhang Yi
2023-06-06 13:59 ` [PATCH v3 3/6] jbd2: remove journal_clean_one_cp_list() Zhang Yi
2023-06-07  8:30   ` Jan Kara
2023-06-06 13:59 ` [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint Zhang Yi
2023-06-13  4:31   ` Theodore Ts'o
2023-06-13  8:13     ` Zhihao Cheng
2023-06-13 17:27       ` Theodore Ts'o [this message]
2023-06-14  5:42         ` Theodore Ts'o
2023-06-14 13:25           ` Zhang Yi
2023-06-14 20:37             ` Theodore Ts'o
2023-06-15  3:56               ` Zhang Yi
2023-06-26  7:36               ` Zhang Yi
2023-06-06 13:59 ` [PATCH v3 5/6] jbd2: fix a race when checking checkpoint buffer busy Zhang Yi
2023-06-06 13:59 ` [PATCH v3 6/6] jbd2: remove __journal_try_to_free_buffer() Zhang Yi
2023-07-12 18:29 ` [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Theodore Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230613172749.GA18303@mit.edu \
    --to=tytso@mit.edu \
    --cc=adilger.kernel@dilger.ca \
    --cc=chengzhihao1@huawei.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=yi.zhang@huawei.com \
    --cc=yi.zhang@huaweicloud.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.