Re: [PATCH v4 1/6] btrfs: ensure fast fsync waits for ordered extents after a write failure

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v4 1/6] btrfs: ensure fast fsync waits for ordered extents after a write failure
Date: Mon, 20 May 2024 19:50:30 +0930	[thread overview]
Message-ID: <e50e7918-373c-4349-a81d-e8b0dd0c0a88@gmx.com> (raw)
In-Reply-To: <3aac8d8b6e6caab81b3726ff857684c08fb2346d.1716053516.git.fdmanana@suse.com>



在 2024/5/20 19:16, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> If a write path in COW mode fails, either before submitting a bio for the
> new extents or an actual IO error happens, we can end up allowing a fast
> fsync to log file extent items that point to unwritten extents.
>
> This is because dropping the extent maps happens when completing ordered
> extents, at btrfs_finish_one_ordered(), and the completion of an ordered
> extent is executed in a work queue.
>
> This can result in a fast fsync to start logging file extent items based
> on existing extent maps before the ordered extents complete, therefore
> resulting in a log that has file extent items that point to unwritten
> extents, resulting in a corrupt file if a crash happens after and the log
> tree is replayed the next time the fs is mounted.
>
> This can happen for both direct IO writes and buffered writes.
>
> For example consider a direct IO write, in COW mode, that fails at
> btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
> error:
>
> 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
>     set to false, meaning an error happened;
>
> 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
>     flag;
>
> 3) btrfs_finish_ordered_extent() queues the completion of the ordered
>     extent - so that btrfs_finish_one_ordered() will be executed later in
>     a work queue. That function will drop extent maps in the range when
>     it's executed, since the extent maps point to unwritten locations
>     (signaled by the BTRFS_ORDERED_IOERR flag);
>
> 4) After calling btrfs_finish_ordered_extent() we keep going down the
>     write path and unlock the inode;
>
> 5) After that a fast fsync starts and locks the inode;
>
> 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
>     task sees the extent maps that point to the unwritten locations and
>     logs file extent items based on them - it does not know they are
>     unwritten, and the fast fsync path does not wait for ordered extents
>     to complete, which is an intentional behaviour in order to reduce
>     latency.
>
> For the buffered write case, here's one example:
>
> 1) A fast fsync begins, and it starts by flushing delalloc and waiting for
>     the writeback to complete by calling filemap_fdatawait_range();
>
> 2) Flushing the dellaloc created a new extent map X;
>
> 3) During the writeback some IO error happened, and at the end io callback
>     (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which
>     sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its
>     completion;
>
> 4) After queuing the ordered extent completion, the end io callback clears
>     the writeback flag from all pages (or folios), and from that moment the
>     fast fsync can proceed;
>
> 5) The fast fsync proceeds sees extent map X and logs a file extent item
>     based on extent map X, resulting in a log that points to an unwritten
>     data extent - because the ordered extent completion hasn't run yet, it
>     happens only after the logging.
>
> To fix this make btrfs_finish_ordered_extent() set the inode flag
> BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write,
> so that a fast fsync will wait for ordered extent completion.
>
> Note that this issues of using extent maps that point to unwritten
> locations can not happen for reads, because in read paths we start by
> locking the extent range and wait for any ordered extents in the range
> to complete before looking for extent maps.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

So a new inode flag without touching the spinlock, that's a solid solution.

Thanks,
Qu
> ---
>   fs/btrfs/btrfs_inode.h  | 10 ++++++++++
>   fs/btrfs/file.c         | 16 ++++++++++++++++
>   fs/btrfs/ordered-data.c | 31 +++++++++++++++++++++++++++++++
>   3 files changed, 57 insertions(+)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 3c8bc7a8ebdd..46db4027bf15 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -112,6 +112,16 @@ enum {
>   	 * done at new_simple_dir(), called from btrfs_lookup_dentry().
>   	 */
>   	BTRFS_INODE_ROOT_STUB,
> +	/*
> +	 * Set if an error happened when doing a COW write before submitting a
> +	 * bio or during writeback. Used for both buffered writes and direct IO
> +	 * writes. This is to signal a fast fsync that it has to wait for
> +	 * ordered extents to complete and therefore not log extent maps that
> +	 * point to unwritten extents (when an ordered extent completes and it
> +	 * has the BTRFS_ORDERED_IOERR flag set, it drops extent maps in its
> +	 * range).
> +	 */
> +	BTRFS_INODE_COW_WRITE_ERROR,
>   };
>
>   /* in memory btrfs inode */
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0c7c1b42028e..00670596bf06 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1885,6 +1885,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>   	 */
>   	if (full_sync || btrfs_is_zoned(fs_info)) {
>   		ret = btrfs_wait_ordered_range(inode, start, len);
> +		clear_bit(BTRFS_INODE_COW_WRITE_ERROR, &BTRFS_I(inode)->runtime_flags);
>   	} else {
>   		/*
>   		 * Get our ordered extents as soon as possible to avoid doing
> @@ -1894,6 +1895,21 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>   		btrfs_get_ordered_extents_for_logging(BTRFS_I(inode),
>   						      &ctx.ordered_extents);
>   		ret = filemap_fdatawait_range(inode->i_mapping, start, end);
> +		if (ret)
> +			goto out_release_extents;
> +
> +		/*
> +		 * Check and clear the BTRFS_INODE_COW_WRITE_ERROR now after
> +		 * starting and waiting for writeback, because for buffered IO
> +		 * it may have been set during the end IO callback
> +		 * (end_bbio_data_write() -> btrfs_finish_ordered_extent()) in
> +		 * case an error happened and we need to wait for ordered
> +		 * extents to complete so that any extent maps that point to
> +		 * unwritten locations are dropped and we don't log them.
> +		 */
> +		if (test_and_clear_bit(BTRFS_INODE_COW_WRITE_ERROR,
> +				       &BTRFS_I(inode)->runtime_flags))
> +			ret = btrfs_wait_ordered_range(inode, start, len);
>   	}
>
>   	if (ret)
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 44157e43fd2a..7d175d10a6d0 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -388,6 +388,37 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
>   	ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
>   	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
>
> +	/*
> +	 * If this is a COW write it means we created new extent maps for the
> +	 * range and they point to unwritten locations if we got an error either
> +	 * before submitting a bio or during IO.
> +	 *
> +	 * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
> +	 * are queuing its completion below. During completion, at
> +	 * btrfs_finish_one_ordered(), we will drop the extent maps for the
> +	 * unwritten extents.
> +	 *
> +	 * However because completion runs in a work queue we can end up having
> +	 * a fast fsync running before that. In the case of direct IO, once we
> +	 * unlock the inode the fsync might start, and we queue the completion
> +	 * before unlocking the inode. In the case of buffered IO when writeback
> +	 * finishes (end_bbio_data_write()) we queue the completion, so if the
> +	 * writeback was triggered by a fast fsync, the fsync might start
> +	 * logging before ordered extent completion runs in the work queue.
> +	 *
> +	 * The fast fsync will log file extent items based on the extent maps it
> +	 * finds, so if by the time it collects extent maps the ordered extent
> +	 * completion didn't happen yet, it will log file extent items that
> +	 * point to unwritten extents, resulting in a corruption if a crash
> +	 * happens and the log tree is replayed. Note that a fast fsync does not
> +	 * wait for completion of ordered extents in order to reduce latency.
> +	 *
> +	 * Set a flag in the inode so that the next fast fsync will wait for
> +	 * ordered extents to complete before starting to log.
> +	 */
> +	if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
> +		set_bit(BTRFS_INODE_COW_WRITE_ERROR, &inode->runtime_flags);
> +
>   	if (ret)
>   		btrfs_queue_ordered_fn(ordered);
>   	return ret;

next prev parent reply	other threads:[~2024-05-20 10:20 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-17 16:52 [PATCH v3 0/2] btrfs: fix logging unwritten extents after failure in write paths fdmanana
2024-05-17 16:52 ` [PATCH v3 1/2] btrfs: ensure fast fsync waits for ordered extents after a write failure fdmanana
2024-05-18  5:28   ` Qu Wenruo
2024-05-20  9:46     ` Filipe Manana
2024-05-17 16:52 ` [PATCH v3 2/2] btrfs: make btrfs_finish_ordered_extent() return void fdmanana
2024-05-20  9:46 ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths fdmanana
2024-05-20  9:46   ` [PATCH v4 1/6] btrfs: ensure fast fsync waits for ordered extents after a write failure fdmanana
2024-05-20 10:20     ` Qu Wenruo [this message]
2024-05-20  9:46   ` [PATCH v4 2/6] btrfs: make btrfs_finish_ordered_extent() return void fdmanana
2024-05-20  9:46   ` [PATCH v4 3/6] btrfs: use a btrfs_inode in the log context (struct btrfs_log_ctx) fdmanana
2024-05-20  9:46   ` [PATCH v4 4/6] btrfs: pass a btrfs_inode to btrfs_fdatawrite_range() fdmanana
2024-05-20  9:46   ` [PATCH v4 5/6] btrfs: pass a btrfs_inode to btrfs_wait_ordered_range() fdmanana
2024-05-20  9:46   ` [PATCH v4 6/6] btrfs: use a btrfs_inode local variable at btrfs_sync_file() fdmanana
2024-05-20 10:23   ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e50e7918-373c-4349-a81d-e8b0dd0c0a88@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox