public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Zhang Yi <yi.zhang@huawei.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-ext4@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, yukuai3@huawei.com
Subject: Re: [RFC PATCH 1/4] ext4: check and update i_disksize properly
Date: Tue, 6 Jul 2021 17:26:33 +0200	[thread overview]
Message-ID: <20210706152633.GB17149@quack2.suse.cz> (raw)
In-Reply-To: <32946f62-631e-d752-9fcf-e89b568e2e7f@huawei.com>

On Tue 06-07-21 22:40:46, Zhang Yi wrote:
> On 2021/7/6 20:11, Jan Kara wrote:
> > On Tue 06-07-21 10:42:07, Zhang Yi wrote:
> >> After commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <=
> >> isize"), i_disksize could always be updated to i_size in ext4_setattr(),
> >> and it seems that there is no other way that could appear
> >> i_disksize < i_size besides the delalloc write. In the case of delay
> > 
> > Well, there are also direct IO writes which have temporarily i_disksize <
> > i_size but when you hold i_rwsem, you're right that delalloc is the only
> > reason why you can see i_disksize < i_size AFAIK.
> > 
> >> alloc write, ext4_writepages() could update i_disksize for the new delay
> >> allocated blocks properly. So we could switch to check i_size instead
> >> of i_disksize in ext4_da_write_end() when write to the end of the file.
> > 
> > I agree that since ext4_da_should_update_i_disksize() needs to return true
> > for us to touch i_disksize, writeback has to have already allocated block
> > underlying the end of write (new_i_size position) and thus we are
> > guaranteed that writeback will also soon update i_disksize after the
> > new_i_size position. So I agree that your switch to testing i_size instead
> > of i_disksize should not have any bad effect... Thinking about this some
> > more why do we need i_disksize update in ext4_da_write_end() at all? The
> > page will be dirtied and when writeback will happen we will update
> > i_disksize to i_size. Updating i_disksize earlier brings no benefit - the user
> > will see zeros instead of valid data if we crash before the writeback
> > happened. Am I missing something guys?
> > 
> 
> Hi, Jan.
> 
> Do you remember the patch and question I asked 2 years ago[1][2]? The
> case of new_i_size > i_size && ext4_da_should_update_i_disksize() here
> means partial block append write,

Agreed.

> ext4_writepages() does not update i_disksize for this case now.

Doesn't it? Hmm, so mpage_map_and_submit_extent() certainly does make sure
we update i_size properly. But you are actually correct that
ext4_writepage() does not update i_disksize and neither does
mpage_prepare_extent_to_map() which can also writeback fully mapped pages.
Changing mpage_prepare_extent_to_map() to handle i_disksize update would be
trivial but dealing with ext4_writepage() would be difficult. So yes, let's
keep the i_disksize update in ext4_da_write_end() for now. But please add a
comment there explaining the situation. Like:

	/*
	 * Since we are holding inode lock, we are sure i_disksize <=
	 * i_size. We also know that if i_disksize < i_size, there are
	 * delalloc writes pending in the range upto i_size. If the end of
	 * the current write is <= i_size, there's no need to touch
	 * i_disksize since writeback will push i_disksize upto i_size
	 * eventually. If the end of the current write is > i_size and
	 * inside an allocated block (ext4_da_should_update_i_disksize()
	 * check), we need to update i_disksize here as neither
	 * ext4_writepage() nor certain ext4_writepages() paths not
	 * allocating blocks update i_disksize.
	 *
	 * Note that we defer inode dirtying to generic_write_end() /
	 * ext4_da_write_inline_data_end().
	 */

> And the journal data=ordered mode also
> cannot guarantee write data before metadata. So we cannot guarantee we
> cannot see zeros where data was written after crash.

Yes, but that is IMO somewhat different question.

								Honza

> 
> Thanks,
> Yi.
> 
> [1]https://lore.kernel.org/linux-ext4/20190404101823.GA22313@quack2.suse.cz/
> [2]https://lore.kernel.org/linux-ext4/20190405091258.GA1600@quack2.suse.cz/
> 
> > 
> >> we also could remove ext4_mark_inode_dirty() together because
> >> generic_write_end() will dirty the inode.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/ext4/inode.c | 21 ++++++++-------------
> >>  1 file changed, 8 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index d8de607849df..6f6a61f3ae5f 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -3087,32 +3087,27 @@ static int ext4_da_write_end(struct file *file,
> >>  	 * generic_write_end() will run mark_inode_dirty() if i_size
> >>  	 * changes.  So let's piggyback the i_disksize mark_inode_dirty
> >>  	 * into that.
> >> +	 *
> >> +	 * Check i_size not i_disksize here because ext4_writepages() could
> >> +	 * update i_disksize from i_size for delay allocated blocks properly.
> >>  	 */
> >>  	new_i_size = pos + copied;
> >> -	if (copied && new_i_size > EXT4_I(inode)->i_disksize) {
> >> +	if (copied && new_i_size > inode->i_size) {
> >>  		if (ext4_has_inline_data(inode) ||
> >> -		    ext4_da_should_update_i_disksize(page, end)) {
> >> +		    ext4_da_should_update_i_disksize(page, end))
> >>  			ext4_update_i_disksize(inode, new_i_size);
> >> -			/* We need to mark inode dirty even if
> >> -			 * new_i_size is less that inode->i_size
> >> -			 * bu greater than i_disksize.(hint delalloc)
> >> -			 */
> >> -			ret = ext4_mark_inode_dirty(handle, inode);
> >> -		}
> >>  	}
> >>  
> >>  	if (write_mode != CONVERT_INLINE_DATA &&
> >>  	    ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) &&
> >>  	    ext4_has_inline_data(inode))
> >> -		ret2 = ext4_da_write_inline_data_end(inode, pos, len, copied,
> >> +		ret = ext4_da_write_inline_data_end(inode, pos, len, copied,
> >>  						     page);
> >>  	else
> >> -		ret2 = generic_write_end(file, mapping, pos, len, copied,
> >> +		ret = generic_write_end(file, mapping, pos, len, copied,
> >>  							page, fsdata);
> >>  
> >> -	copied = ret2;
> >> -	if (ret2 < 0)
> >> -		ret = ret2;
> >> +	copied = ret;
> >>  	ret2 = ext4_journal_stop(handle);
> >>  	if (unlikely(ret2 && !ret))
> >>  		ret = ret2;
> >> -- 
> >> 2.31.1
> >>
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2021-07-06 15:26 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-06  2:42 [RFC PATCH 0/4] ext4: improve delalloc buffer write performance Zhang Yi
2021-07-06  2:42 ` [RFC PATCH 1/4] ext4: check and update i_disksize properly Zhang Yi
2021-07-06 12:11   ` Jan Kara
2021-07-06 14:40     ` Zhang Yi
2021-07-06 15:26       ` Jan Kara [this message]
2021-07-07  6:18         ` Zhang Yi
2021-07-06  2:42 ` [RFC PATCH 2/4] ext4: correct the error path of ext4_write_inline_data_end() Zhang Yi
2021-07-06 12:28   ` Jan Kara
2021-07-06  2:42 ` [RFC PATCH 3/4] ext4: factor out write end code of inline file Zhang Yi
2021-07-07 16:49   ` Jan Kara
2021-07-10  8:13     ` Zhang Yi
2021-07-06  2:42 ` [RFC PATCH 4/4] ext4: drop unnecessary journal handle in delalloc write Zhang Yi
2021-07-07 16:59   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210706152633.GB17149@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=adilger.kernel@dilger.ca \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox