Re: [PATCH v4 01/10] ext4: remove writable userspace mappings before truncating page cache

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ojaswin Mujoo <ojaswin@linux.ibm.com>
To: Zhang Yi <yi.zhang@huaweicloud.com>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, jack@suse.cz, yi.zhang@huawei.com,
	chengzhihao1@huawei.com, yukuai3@huawei.com,
	yangerkun@huawei.com
Subject: Re: [PATCH v4 01/10] ext4: remove writable userspace mappings before truncating page cache
Date: Wed, 18 Dec 2024 15:26:54 +0530	[thread overview]
Message-ID: <Z2KcZt91otMCYqvi@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com> (raw)
In-Reply-To: <20241216013915.3392419-2-yi.zhang@huaweicloud.com>

On Mon, Dec 16, 2024 at 09:39:06AM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When zeroing a range of folios on the filesystem which block size is
> less than the page size, the file's mapped blocks within one page will
> be marked as unwritten, we should remove writable userspace mappings to
> ensure that ext4_page_mkwrite() can be called during subsequent write
> access to these partial folios. Otherwise, data written by subsequent
> mmap writes may not be saved to disk.
> 
>  $mkfs.ext4 -b 1024 /dev/vdb
>  $mount /dev/vdb /mnt
>  $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
>                -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \
>                -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo
> 
>  $od -Ax -t x1z /mnt/foo
>  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>  *
>  000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
>  *
>  001000
> 
>  $umount /mnt && mount /dev/vdb /mnt
>  $od -Ax -t x1z /mnt/foo
>  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>  *
>  000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  *
>  001000
> 
> Fix this by introducing ext4_truncate_page_cache_block_range() to remove
> writable userspace mappings when truncating a partial folio range.
> Additionally, move the journal data mode-specific handlers and
> truncate_pagecache_range() into this function, allowing it to serve as a
> common helper that correctly manages the page cache in preparation for
> block range manipulations.

Hi Zhang,

Thanks for the fix, just to confirm my understanding, the issue arises
because of the following flow:

1. page_mkwrite() makes folio dirty when we write to the mmap'd region

2. ext4_zero_range (2kb to 4kb)
    truncate_pagecache_range
      truncate_inode_pages_range
        truncate_inode_partial_folio
          folio_zero_range (2kb to 4kb)
            folio_invalidate
              ext4_invalidate_folio
                block_invalidate_folio -> clear the bh dirty bit

3. mwrite (2kb to 4kb): Again we write in pagecache but the bh is not
   dirty hence after a remount the data is not seen on disk

Also, we won't see this issue if we are zeroing a page aligned range
since we end up unmapping the pages from the proccess address space in 
that case. Correct?

I have also tested the patch in PowerPC with 64k pagesize and 4k blocks
size and can confirm that it fixes the data loss issue. That being said,
I have a few minor comments on the patch below:

> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/ext4.h    |  2 ++
>  fs/ext4/extents.c | 19 ++++-----------
>  fs/ext4/inode.c   | 62 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 69 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 74f2071189b2..8843929b46ce 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3016,6 +3016,8 @@ extern int ext4_inode_attach_jinode(struct inode *inode);
>  extern int ext4_can_truncate(struct inode *inode);
>  extern int ext4_truncate(struct inode *);
>  extern int ext4_break_layouts(struct inode *);
> +extern int ext4_truncate_page_cache_block_range(struct inode *inode,
> +						loff_t start, loff_t end);
>  extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length);
>  extern void ext4_set_inode_flags(struct inode *, bool init);
>  extern int ext4_alloc_da_blocks(struct inode *inode);
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index a07a98a4b97a..8dc6b4271b15 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4667,22 +4667,13 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  			goto out_mutex;
>  		}
>  
> -		/*
> -		 * For journalled data we need to write (and checkpoint) pages
> -		 * before discarding page cache to avoid inconsitent data on
> -		 * disk in case of crash before zeroing trans is committed.
> -		 */
> -		if (ext4_should_journal_data(inode)) {
> -			ret = filemap_write_and_wait_range(mapping, start,
> -							   end - 1);
> -			if (ret) {
> -				filemap_invalidate_unlock(mapping);
> -				goto out_mutex;
> -			}
> +		/* Now release the pages and zero block aligned part of pages */
> +		ret = ext4_truncate_page_cache_block_range(inode, start, end);
> +		if (ret) {
> +			filemap_invalidate_unlock(mapping);
> +			goto out_mutex;
>  		}
>  
> -		/* Now release the pages and zero block aligned part of pages */
> -		truncate_pagecache_range(inode, start, end - 1);
>  		inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
>  
>  		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 89aade6f45f6..c68a8b841148 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -31,6 +31,7 @@
>  #include <linux/writeback.h>
>  #include <linux/pagevec.h>
>  #include <linux/mpage.h>
> +#include <linux/rmap.h>
>  #include <linux/namei.h>
>  #include <linux/uio.h>
>  #include <linux/bio.h>
> @@ -3902,6 +3903,67 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
>  	return ret;
>  }
>  
> +static inline void ext4_truncate_folio(struct inode *inode,
> +				       loff_t start, loff_t end)
> +{
> +	unsigned long blocksize = i_blocksize(inode);
> +	struct folio *folio;
> +
> +	/* Nothing to be done if no complete block needs to be truncated. */
> +	if (round_up(start, blocksize) >= round_down(end, blocksize))
> +		return;
> +
> +	folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT);
> +	if (IS_ERR(folio))
> +		return;
> +
> +	if (folio_mkclean(folio))
> +		folio_mark_dirty(folio);
> +	folio_unlock(folio);
> +	folio_put(folio);
> +}
> +
> +int ext4_truncate_page_cache_block_range(struct inode *inode,
> +					 loff_t start, loff_t end)
> +{
> +	unsigned long blocksize = i_blocksize(inode);
> +	int ret;
> +
> +	/*
> +	 * For journalled data we need to write (and checkpoint) pages
> +	 * before discarding page cache to avoid inconsitent data on disk
> +	 * in case of crash before freeing or unwritten converting trans
> +	 * is committed.
> +	 */
> +	if (ext4_should_journal_data(inode)) {
> +		ret = filemap_write_and_wait_range(inode->i_mapping, start,
> +						   end - 1);
> +		if (ret)
> +			return ret;
> +		goto truncate_pagecache;
> +	}
> +
> +	/*
> +	 * If the block size is less than the page size, the file's mapped
> +	 * blocks within one page could be freed or converted to unwritten.
> +	 * So it's necessary to remove writable userspace mappings, and then
> +	 * ext4_page_mkwrite() can be called during subsequent write access
> +	 * to these partial folios.
> +	 */
> +	if (blocksize < PAGE_SIZE && start < inode->i_size) {

Maybe we should only call ext4_truncate_folio() if the range is not page
aligned, rather than calling it everytime for bs < ps?

> +		loff_t start_boundary = round_up(start, PAGE_SIZE);

I think page_boundary seems like a more suitable name for the variable.

Regards,
ojaswin

> +
> +		ext4_truncate_folio(inode, start, min(start_boundary, end));
> +		if (end > start_boundary)
> +			ext4_truncate_folio(inode,
> +					    round_down(end, PAGE_SIZE), end);
> +	}
> +
> +truncate_pagecache:
> +	truncate_pagecache_range(inode, start, end - 1);
> +	return 0;
> +}
> +
>  static void ext4_wait_dax_page(struct inode *inode)
>  {
>  	filemap_invalidate_unlock(inode->i_mapping);
> -- 
> 2.46.1
>

next prev parent reply	other threads:[~2024-12-18  9:57 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-16  1:39 [PATCH v4 00/10] ext4: clean up and refactor fallocate Zhang Yi
2024-12-16  1:39 ` [PATCH v4 01/10] ext4: remove writable userspace mappings before truncating page cache Zhang Yi
2024-12-16 15:00   ` Jan Kara
2024-12-17  7:05     ` Zhang Yi
2024-12-16 15:15   ` Matthew Wilcox
2024-12-17  7:38     ` Zhang Yi
2024-12-18  9:56   ` Ojaswin Mujoo [this message]
2024-12-18 13:02     ` Zhang Yi
2024-12-19  7:19       ` Ojaswin Mujoo
2024-12-16  1:39 ` [PATCH v4 02/10] ext4: don't explicit update times in ext4_fallocate() Zhang Yi
2024-12-18  9:58   ` Ojaswin Mujoo
2024-12-16  1:39 ` [PATCH v4 03/10] ext4: don't write back data before punch hole in nojournal mode Zhang Yi
2024-12-16 15:02   ` Jan Kara
2024-12-17 14:31   ` Ojaswin Mujoo
2024-12-17 14:50     ` Ojaswin Mujoo
2024-12-18  7:10     ` Zhang Yi
2024-12-18 10:13       ` Ojaswin Mujoo
2024-12-16  1:39 ` [PATCH v4 04/10] ext4: refactor ext4_punch_hole() Zhang Yi
2024-12-16 15:07   ` Jan Kara
2024-12-18 10:17   ` Ojaswin Mujoo
2024-12-18 13:13     ` Zhang Yi
2024-12-19  7:11       ` Ojaswin Mujoo
2024-12-16  1:39 ` [PATCH v4 05/10] ext4: refactor ext4_zero_range() Zhang Yi
2024-12-16 15:24   ` Jan Kara
2024-12-19  7:12   ` Ojaswin Mujoo
2024-12-16  1:39 ` [PATCH v4 06/10] ext4: refactor ext4_collapse_range() Zhang Yi
2024-12-18 10:18   ` Ojaswin Mujoo
2024-12-16  1:39 ` [PATCH v4 07/10] ext4: refactor ext4_insert_range() Zhang Yi
2024-12-18 10:18   ` Ojaswin Mujoo
2024-12-16  1:39 ` [PATCH v4 08/10] ext4: factor out ext4_do_fallocate() Zhang Yi
2024-12-18 10:18   ` Ojaswin Mujoo
2024-12-16  1:39 ` [PATCH v4 09/10] ext4: move out inode_lock into ext4_fallocate() Zhang Yi
2024-12-18 10:19   ` Ojaswin Mujoo
2024-12-16  1:39 ` [PATCH v4 10/10] ext4: move out common parts " Zhang Yi
2024-12-18 10:20   ` Ojaswin Mujoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z2KcZt91otMCYqvi@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com \
    --to=ojaswin@linux.ibm.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=chengzhihao1@huawei.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yi.zhang@huaweicloud.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).