From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-ext4-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E51B4C433F5
	for <linux-ext4@archiver.kernel.org>; Thu, 14 Apr 2022 02:46:40 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S239626AbiDNCtC (ORCPT <rfc822;linux-ext4@archiver.kernel.org>);
        Wed, 13 Apr 2022 22:49:02 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34094 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233483AbiDNCtB (ORCPT
        <rfc822;linux-ext4@vger.kernel.org>); Wed, 13 Apr 2022 22:49:01 -0400
Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 76C711FCD2;
        Wed, 13 Apr 2022 19:46:38 -0700 (PDT)
Received: from canpemm500010.china.huawei.com (unknown [172.30.72.57])
        by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4Kf3jz6x1tzfYny;
        Thu, 14 Apr 2022 10:45:59 +0800 (CST)
Received: from [10.174.178.185] (10.174.178.185) by
 canpemm500010.china.huawei.com (7.192.105.118) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.24; Thu, 14 Apr 2022 10:46:36 +0800
Subject: Re: [PATCH -next] ext4: fix warning in ext4_handle_inode_extension
To:     Jan Kara <jack@suse.cz>
References: <20220326065351.761952-1-yebin10@huawei.com>
 <20220329092810.j5ngxckygut6mxo2@quack3.lan> <6244482D.4090603@huawei.com>
 <20220330133015.yxfnnw564wgehjc3@quack3.lan>
CC:     <tytso@mit.edu>, <adilger.kernel@dilger.ca>,
        <linux-ext4@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
        <lczerner@redhat.com>
From:   yebin <yebin10@huawei.com>
Message-ID: <62578B0C.9000803@huawei.com>
Date:   Thu, 14 Apr 2022 10:46:36 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101
 Thunderbird/38.1.0
MIME-Version: 1.0
In-Reply-To: <20220330133015.yxfnnw564wgehjc3@quack3.lan>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.174.178.185]
X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To
 canpemm500010.china.huawei.com (7.192.105.118)
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <linux-ext4.vger.kernel.org>
X-Mailing-List: linux-ext4@vger.kernel.org


On 2022/3/30 21:30, Jan Kara wrote:
> On Wed 30-03-22 20:08:13, yebin wrote:
>> On 2022/3/29 17:28, Jan Kara wrote:
>>> On Sat 26-03-22 14:53:51, Ye Bin wrote:
>>>> We got issue as follows:
>>>> EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory
>>>> EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error
>>>> EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory
>>>> EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc
>>>> ------------[ cut here ]------------
>>>> WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220
>>>> Modules linked in:
>>>> CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1
>>>> RIP: 0010:ext4_file_write_iter+0x11c9/0x1220
>>>> RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282
>>>> RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000
>>>> RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd
>>>> RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f
>>>> R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a
>>>> R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b
>>>> FS:  00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0
>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> Call Trace:
>>>>    do_iter_readv_writev+0x2e5/0x360
>>>>    do_iter_write+0x112/0x4c0
>>>>    do_pwritev+0x1e5/0x390
>>>>    __x64_sys_pwritev2+0x7e/0xa0
>>>>    do_syscall_64+0x37/0x50
>>>>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>
>>>> Above issue may happen as follows:
>>>> Assume
>>>> inode.i_size=4096
>>>> EXT4_I(inode)->i_disksize=4096
>>>>
>>>> step 1: set inode->i_isize = 8192
>>>> ext4_setattr
>>>>     if (attr->ia_size != inode->i_size)
>>>>       EXT4_I(inode)->i_disksize = attr->ia_size;
>>>>       rc = ext4_mark_inode_dirty
>>>>          ext4_reserve_inode_write
>>>>             ext4_get_inode_loc
>>>>               __ext4_get_inode_loc
>>>>                 sb_getblk --> return -ENOMEM
>>>>      ...
>>>>      if (!error)  ->will not update i_size
>>>>        i_size_write(inode, attr->ia_size);
>>>> Now:
>>>> inode.i_size=4096
>>>> EXT4_I(inode)->i_disksize=8192
>>>>
>>>> step 2: Direct write 4096 bytes
>>>> ext4_file_write_iter
>>>>    ext4_dio_write_iter
>>>>      iomap_dio_rw ->return error
>>>>    if (extend)
>>>>      ext4_handle_inode_extension
>>>>        WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
>>>> ->Then trigger warning.
>>>>
>>>> To solve above issue, if mark inode dirty failed in ext4_setattr just
>>>> set 'EXT4_I(inode)->i_disksize' with old value.
>>>>
>>>> Signed-off-by: Ye Bin <yebin10@huawei.com>
>>> Thanks for the fix! So I think this deserves a further debate. I have two
>>> points here:
>>>
>>> 1) If ext4_mark_inode_dirty() fails (or basically any metadata writeback)
>>> we must abort the journal because metadata is not guaranteed to be
>>> consistent anymore. In this particular callsite of ext4_mark_inode_dirty()
>>> you were able to undo the changes but there are many more where it is not
>>> sanely possible AFAICT. Hence I think that ext4_reserve_inode_write() needs
>>> to call ext4_journal_abort_handle() (as already happens inside
>>> __ext4_journal_get_write_access()) and not just ext4_std_error().
>>>
>>> 2) The assertion in ext4_handle_inode_extension() should be conditioned on
>>> !is_journal_aborted() to avoid useless warnings for filesystems we know are
>>> inconsistent anyway.
>>>
>>> Thoughts?
>>>
>>> 								Honza
>> Do you mean call jbd2_abort in ext4_reserve_inode_write() ?
> Yes.
>
>> If we abort journal when metadata is not guaranteed to be consistent. The
>> mode of ‘errors=continue’ is unnecessary.
> Well, firstly, errors=continue was always the best effort. There are no
> guarantees which failures we are able to withstand and which not.
> Generally, I think we try to withstand on-disk filesystem inconsistency but
> not inconsistency coming from programming errors or other external factors
> like out-of-memory conditions. Secondly, we already do abort the journal
> when e.g. jbd2_journal_get_write_access() fails (although that generally
> means some internal inconsistency) or when say revoke handling fails to
> allocate memory for a revoke record. So it won't be a new thing. Thirdly,
> and perhaps most importantly, you have found and fixed just one fairly
> innocent problem happening due to in memory inode state getting
> inconsistent after we fail to record the inode in the journal. There are
> almost 80 callsites of ext4_mark_inode_dirty() and honestly I suspect that
> e.g. inconsistent states resulting from extent tree manipulations being
> aborted in the middle due to ext4_ext_dirty() failing due to ENOMEM will
> also trigger all sorts of "interesting" behavior. So that's why I'd rather
> abort the journal than try to continue when we almost certainly now we
> cannot.
>
> One other possibility I could see dealing with this class of problems is
> using __GFP_NOFAIL for inode buffer head allocation (through
> sb_getblk_gfp()) in __ext4_get_inode_loc(). BTW, how did you trigger NULL
> sb_getblk() return in the first place? AFAICS grow_buffers() already uses
> __GFP_NOFAIL for all the allocations?
>
> 								Honza
To be honest, I don't know syzkaller how to inject the NOMEM fault. If 
syzkaller
rely on the memory fault injection mode provided by the kernel, should  
report
null pointer access. Anyway,  If inject a single point of IO fault, we 
still have to
face the same situation.
>
>>>> ---
>>>>    fs/ext4/inode.c | 4 ++++
>>>>    1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 90fd6f7b6209..8adf1f802f6c 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -5384,6 +5384,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
>>>>    	if (attr->ia_valid & ATTR_SIZE) {
>>>>    		handle_t *handle;
>>>>    		loff_t oldsize = inode->i_size;
>>>> +		loff_t old_disksize;
>>>>    		int shrink = (attr->ia_size < inode->i_size);
>>>>    		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
>>>> @@ -5455,6 +5456,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
>>>>    					inode->i_sb->s_blocksize_bits);
>>>>    			down_write(&EXT4_I(inode)->i_data_sem);
>>>> +			old_disksize = EXT4_I(inode)->i_disksize;
>>>>    			EXT4_I(inode)->i_disksize = attr->ia_size;
>>>>    			rc = ext4_mark_inode_dirty(handle, inode);
>>>>    			if (!error)
>>>> @@ -5466,6 +5468,8 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
>>>>    			 */
>>>>    			if (!error)
>>>>    				i_size_write(inode, attr->ia_size);
>>>> +			else
>>>> +				EXT4_I(inode)->i_disksize = old_disksize;
>>>>    			up_write(&EXT4_I(inode)->i_data_sem);
>>>>    			ext4_journal_stop(handle);
>>>>    			if (error)
>>>> -- 
>>>> 2.31.1
>>>>