Re: [PATCH] btrfs: fix deadlock with fiemap and extent locking

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: David Sterba <dsterba@suse.cz>
To: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] btrfs: fix deadlock with fiemap and extent locking
Date: Tue, 6 Feb 2024 21:54:45 +0100	[thread overview]
Message-ID: <20240206205445.GR355@twin.jikos.cz> (raw)
In-Reply-To: <47ac92a6c8be53a5e10add9315255460c062b52d.1706817512.git.josef@toxicpanda.com>

On Thu, Feb 01, 2024 at 02:58:54PM -0500, Josef Bacik wrote:
> While working on the patchset to remove extent locking I got a lockdep
> splat with fiemap and pagefaulting with my new extent lock replacement
> lock.
> 
> This deadlock exists with our normal code, we just don't have lockdep
> annotations with the extent locking so we've never noticed it.
> 
> Since we're copying the fiemap extent to user space on every iteration
> we have the chance of pagefaulting.  Because we hold the extent lock for
> the entire range we could mkwrite into a range in the file that we have
> mmap'ed.  This would deadlock with the following stack trace
> 
> [<0>] lock_extent+0x28d/0x2f0
> [<0>] btrfs_page_mkwrite+0x273/0x8a0
> [<0>] do_page_mkwrite+0x50/0xb0
> [<0>] do_fault+0xc1/0x7b0
> [<0>] __handle_mm_fault+0x2fa/0x460
> [<0>] handle_mm_fault+0xa4/0x330
> [<0>] do_user_addr_fault+0x1f4/0x800
> [<0>] exc_page_fault+0x7c/0x1e0
> [<0>] asm_exc_page_fault+0x26/0x30
> [<0>] rep_movs_alternative+0x33/0x70
> [<0>] _copy_to_user+0x49/0x70
> [<0>] fiemap_fill_next_extent+0xc8/0x120
> [<0>] emit_fiemap_extent+0x4d/0xa0
> [<0>] extent_fiemap+0x7f8/0xad0
> [<0>] btrfs_fiemap+0x49/0x80
> [<0>] __x64_sys_ioctl+0x3e1/0xb50
> [<0>] do_syscall_64+0x94/0x1a0
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> 
> I wrote an fstest to reproduce this deadlock without my replacement lock
> and verified that the deadlock exists with our existing locking.
> 
> To fix this simply don't take the extent lock for the entire duration of
> the fiemap.  This is safe in general because we keep track of where we
> are when we're searching the tree, so if an ordered extent updates in
> the middle of our fiemap call we'll still emit the correct extents
> because we know what offset we were on before.
> 
> The only place we maintain the lock is searching delalloc.  Since the
> delalloc stuff can change during writeback we want to lock the extent
> range so we have a consistent view of delalloc at the time we're
> checking to see if we need to set the delalloc flag.
> 
> With this patch applied we no longer deadlock with my testcase.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent_io.c | 49 +++++++++++++++++++++++++++++---------------
>  1 file changed, 33 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 8648ea9b5fb5..f8b68249d958 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2683,16 +2683,25 @@ static int fiemap_process_hole(struct btrfs_inode *inode,
>  	 * it beyond i_size.
>  	 */
>  	while (cur_offset < end && cur_offset < i_size) {
> +		struct extent_state *cached_state = NULL;
>  		u64 delalloc_start;
>  		u64 delalloc_end;
>  		u64 prealloc_start;
> +		u64 lockstart, lockend;
>  		u64 prealloc_len = 0;
>  		bool delalloc;
>  
> +		lockstart = round_down(cur_offset,
> +				       inode->root->fs_info->sectorsize);
> +		lockend = round_up(end, inode->root->fs_info->sectorsize);
> +
> +		lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);

This could use a comment why the locking is ok only for delalloc. This
is quite a step from 'the whole range'.

>  		delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
>  							delalloc_cached_state,
>  							&delalloc_start,
>  							&delalloc_end);


> +		unlock_extent(&inode->io_tree, lockstart, lockend,
> +			      &cached_state);
>  		if (!delalloc)
>  			break;
>  
> @@ -2860,15 +2869,14 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  		  u64 start, u64 len)
>  {
>  	const u64 ino = btrfs_ino(inode);
> -	struct extent_state *cached_state = NULL;
>  	struct extent_state *delalloc_cached_state = NULL;
>  	struct btrfs_path *path;
>  	struct fiemap_cache cache = { 0 };
>  	struct btrfs_backref_share_check_ctx *backref_ctx;
>  	u64 last_extent_end;
>  	u64 prev_extent_end;
> -	u64 lockstart;
> -	u64 lockend;
> +	u64 align_start;
> +	u64 align_end;

This could be range_start and range_end, in the context where it appears
it's IMHO more clear.

>  	bool stopped = false;
>  	int ret;
>  
> @@ -2879,12 +2887,11 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  		goto out;
>  	}
>  
> -	lockstart = round_down(start, inode->root->fs_info->sectorsize);
> -	lockend = round_up(start + len, inode->root->fs_info->sectorsize);
> -	prev_extent_end = lockstart;
> +	align_start = round_down(start, inode->root->fs_info->sectorsize);
> +	align_end = round_up(start + len, inode->root->fs_info->sectorsize);
> +	prev_extent_end = align_start;
>  
>  	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> -	lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>  
>  	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
>  	if (ret < 0)
> @@ -2892,7 +2899,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  	btrfs_release_path(path);
>  
>  	path->reada = READA_FORWARD;
> -	ret = fiemap_search_slot(inode, path, lockstart);
> +	ret = fiemap_search_slot(inode, path, align_start);
>  	if (ret < 0) {
>  		goto out_unlock;
>  	} else if (ret > 0) {
> @@ -2904,7 +2911,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  		goto check_eof_delalloc;
>  	}
>  
> -	while (prev_extent_end < lockend) {
> +	while (prev_extent_end < align_end) {
>  		struct extent_buffer *leaf = path->nodes[0];
>  		struct btrfs_file_extent_item *ei;
>  		struct btrfs_key key;
> @@ -2927,14 +2934,14 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  		 * The first iteration can leave us at an extent item that ends
>  		 * before our range's start. Move to the next item.
>  		 */
> -		if (extent_end <= lockstart)
> +		if (extent_end <= align_start)
>  			goto next_item;
>  
>  		backref_ctx->curr_leaf_bytenr = leaf->start;
>  
>  		/* We have in implicit hole (NO_HOLES feature enabled). */
>  		if (prev_extent_end < key.offset) {
> -			const u64 range_end = min(key.offset, lockend) - 1;
> +			const u64 range_end = min(key.offset, align_end) - 1;
>  
>  			ret = fiemap_process_hole(inode, fieinfo, &cache,
>  						  &delalloc_cached_state,
> @@ -2949,7 +2956,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  			}
>  
>  			/* We've reached the end of the fiemap range, stop. */
> -			if (key.offset >= lockend) {
> +			if (key.offset >= align_end) {
>  				stopped = true;
>  				break;
>  			}
> @@ -3043,29 +3050,40 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  	btrfs_free_path(path);
>  	path = NULL;
>  
> -	if (!stopped && prev_extent_end < lockend) {
> +	if (!stopped && prev_extent_end < align_end) {
>  		ret = fiemap_process_hole(inode, fieinfo, &cache,
>  					  &delalloc_cached_state, backref_ctx,
> -					  0, 0, 0, prev_extent_end, lockend - 1);
> +					  0, 0, 0, prev_extent_end, align_end - 1);
>  		if (ret < 0)
>  			goto out_unlock;
> -		prev_extent_end = lockend;
> +		prev_extent_end = align_end;
>  	}
>  
>  	if (cache.cached && cache.offset + cache.len >= last_extent_end) {
>  		const u64 i_size = i_size_read(&inode->vfs_inode);
>  
>  		if (prev_extent_end < i_size) {
> +			struct extent_state *cached_state = NULL;
>  			u64 delalloc_start;
>  			u64 delalloc_end;
> +			u64 lockstart, lockend;
>  			bool delalloc;
>  
> +			lockstart = round_down(prev_extent_end,
> +					       inode->root->fs_info->sectorsize);
> +			lockend = round_up(i_size,
> +					   inode->root->fs_info->sectorsize);
> +
> +			lock_extent(&inode->io_tree, lockstart, lockend,
> +				    &cached_state);

And same comment about locking only delalloc range.

>  			delalloc = btrfs_find_delalloc_in_range(inode,
>  								prev_extent_end,
>  								i_size - 1,
>  								&delalloc_cached_state,
>  								&delalloc_start,
>  								&delalloc_end);
> +			unlock_extent(&inode->io_tree, lockstart, lockend,
> +				      &cached_state);
>  			if (!delalloc)
>  				cache.flags |= FIEMAP_EXTENT_LAST;
>  		} else {
> @@ -3076,7 +3094,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  	ret = emit_last_fiemap_cache(fieinfo, &cache);
>  
>  out_unlock:
> -	unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>  	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
>  out:
>  	free_extent_state(delalloc_cached_state);
> -- 
> 2.43.0
>

     prev parent reply	other threads:[~2024-02-06 20:55 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-01 19:58 [PATCH] btrfs: fix deadlock with fiemap and extent locking Josef Bacik
2024-02-02 11:40 ` Filipe Manana
2024-02-06 20:54 ` David Sterba [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240206205445.GR355@twin.jikos.cz \
    --to=dsterba@suse.cz \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox