linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Liu Bo <bo.li.liu@oracle.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2] Btrfs: fix invalid attempt to free reserved space on failure to cow range
Date: Tue, 7 Mar 2017 14:03:48 -0800	[thread overview]
Message-ID: <20170307220348.GG12408@lim.localdomain> (raw)
In-Reply-To: <1488860689-6008-1-git-send-email-fdmanana@kernel.org>

On Tue, Mar 07, 2017 at 04:24:49AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When attempting to COW a file range (we are starting writeback and doing
> COW), if we manage to reserve an extent for the range we will write into
> but fail after reserving it and before creating the respective ordered
> extent, we end up in an error path where we attempt to decrement the
> data space's bytes_may_use counter after we already did it while
> reserving the extent, leading to a warning/trace like the following:
> 
> [  847.621524] ------------[ cut here ]------------
> [  847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
> [  847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg
> [  847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2
> [  847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> [  847.648601] Call Trace:
> [  847.648601]  dump_stack+0x67/0x90
> [  847.648601]  __warn+0xc2/0xdd
> [  847.648601]  warn_slowpath_null+0x1d/0x1f
> [  847.648601]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
> [  847.648601]  btrfs_clear_bit_hook+0x140/0x258 [btrfs]
> [  847.648601]  clear_state_bit+0x87/0x128 [btrfs]
> [  847.648601]  __clear_extent_bit+0x222/0x2b7 [btrfs]
> [  847.648601]  clear_extent_bit+0x17/0x19 [btrfs]
> [  847.648601]  extent_clear_unlock_delalloc+0x3b/0x6b [btrfs]
> [  847.648601]  cow_file_range.isra.39+0x387/0x39a [btrfs]
> [  847.648601]  run_delalloc_nocow+0x4d7/0x70e [btrfs]
> [  847.648601]  ? arch_local_irq_save+0x9/0xc
> [  847.648601]  run_delalloc_range+0xa7/0x2b5 [btrfs]
> [  847.648601]  writepage_delalloc.isra.31+0xb9/0x15c [btrfs]
> [  847.648601]  __extent_writepage+0x249/0x2e8 [btrfs]
> [  847.648601]  extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs]
> [  847.648601]  ? arch_local_irq_save+0x9/0xc
> [  847.648601]  ? mark_lock+0x24/0x201
> [  847.648601]  extent_writepages+0x4b/0x5c [btrfs]
> [  847.648601]  ? btrfs_writepage_start_hook+0xed/0xed [btrfs]
> [  847.648601]  btrfs_writepages+0x28/0x2a [btrfs]
> [  847.648601]  do_writepages+0x23/0x2c
> [  847.648601]  __filemap_fdatawrite_range+0x5a/0x61
> [  847.648601]  filemap_fdatawrite_range+0x13/0x15
> [  847.648601]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
> [  847.648601]  start_ordered_ops+0x19/0x23 [btrfs]
> [  847.648601]  btrfs_sync_file+0x136/0x42c [btrfs]
> [  847.648601]  vfs_fsync_range+0x8c/0x9e
> [  847.648601]  vfs_fsync+0x1c/0x1e
> [  847.648601]  do_fsync+0x31/0x4a
> [  847.648601]  SyS_fsync+0x10/0x14
> [  847.648601]  entry_SYSCALL_64_fastpath+0x18/0xad
> [  847.648601] RIP: 0033:0x7f5b05200800
> [  847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
> [  847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800
> [  847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003
> [  847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f
> [  847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046
> [  847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740
> [  847.648601]  ? trace_hardirqs_off_caller+0x3f/0xaa
> [  847.685787] ---[ end trace 2a4a3e15382508e8 ]---
> 
> So fix this by not attempting to decrement the data space info's
> bytes_may_use counter if we already reserved the extent and an error
> happened before creating the ordered extent. We are already correctly
> freeing the reserved extent if an error happens, so there's no additional
> measure needed.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
> V2: Removed previous dependence on a no longer needed cleanup patch and
>     rebased it on top of Qu's recent patchset that fixes hanging ordered
>     extents when errors happen.
> 
>  fs/btrfs/extent_io.h |  4 +++-
>  fs/btrfs/inode.c     | 56 ++++++++++++++++++++++++++++++++++++++--------------
>  2 files changed, 44 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 3e4fad4..d31313d 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -21,8 +21,10 @@
>  #define EXTENT_NORESERVE	(1U << 15)
>  #define EXTENT_QGROUP_RESERVED	(1U << 16)
>  #define EXTENT_CLEAR_DATA_RESV	(1U << 17)
> +#define EXTENT_SPACE_FREED	(1U << 18)
>  #define EXTENT_IOBITS		(EXTENT_LOCKED | EXTENT_WRITEBACK)
> -#define EXTENT_CTLBITS		(EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC)
> +#define EXTENT_CTLBITS		(EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC |\
> +				 EXTENT_SPACE_FREED)
>  
>  /*
>   * flags for bio submission. The high bits indicate the compression
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9b03eb9b..9714453 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -942,10 +942,13 @@ static noinline int cow_file_range(struct inode *inode,
>  	u64 num_bytes;
>  	unsigned long ram_size;
>  	u64 disk_num_bytes;
> -	u64 cur_alloc_size;
> +	u64 cur_alloc_size = 0;
>  	u64 blocksize = fs_info->sectorsize;
>  	struct btrfs_key ins;
>  	struct extent_map *em;
> +	unsigned clear_bits;
> +	unsigned long page_ops;
> +	bool extent_reserved = false;
>  	int ret = 0;
>  
>  	if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
> @@ -990,14 +993,14 @@ static noinline int cow_file_range(struct inode *inode,
>  			start + num_bytes - 1, 0);
>  
>  	while (disk_num_bytes > 0) {
> -		unsigned long op;
> -
>  		cur_alloc_size = disk_num_bytes;
>  		ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size,
>  					   fs_info->sectorsize, 0, alloc_hint,
>  					   &ins, 1, 1);
>  		if (ret < 0)
>  			goto out_unlock;
> +		cur_alloc_size = ins.offset;
> +		extent_reserved = true;
>  
>  		ram_size = ins.offset;
>  		em = create_io_em(inode, start, ins.offset, /* len */
> @@ -1012,7 +1015,6 @@ static noinline int cow_file_range(struct inode *inode,
>  			goto out_reserve;
>  		free_extent_map(em);
>  
> -		cur_alloc_size = ins.offset;
>  		ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
>  					       ram_size, cur_alloc_size, 0);
>  		if (ret)
> @@ -1047,14 +1049,14 @@ static noinline int cow_file_range(struct inode *inode,
>  		 * Do set the Private2 bit so we know this page was properly
>  		 * setup for writepage
>  		 */
> -		op = unlock ? PAGE_UNLOCK : 0;
> -		op |= PAGE_SET_PRIVATE2;
> +		page_ops = unlock ? PAGE_UNLOCK : 0;
> +		page_ops |= PAGE_SET_PRIVATE2;
>  
>  		extent_clear_unlock_delalloc(inode, start,
>  					     start + ram_size - 1,
>  					     delalloc_end, locked_page,
>  					     EXTENT_LOCKED | EXTENT_DELALLOC,
> -					     op);
> +					     page_ops);
>  		if (disk_num_bytes > cur_alloc_size)
>  			disk_num_bytes = 0;
>  		else
> @@ -1062,6 +1064,7 @@ static noinline int cow_file_range(struct inode *inode,
>  		num_bytes -= cur_alloc_size;
>  		alloc_hint = ins.objectid + ins.offset;
>  		start += cur_alloc_size;
> +		extent_reserved = false;
>  
>  		/*
>  		 * btrfs_reloc_clone_csums() error, since start is increased
> @@ -1080,12 +1083,35 @@ static noinline int cow_file_range(struct inode *inode,
>  	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>  	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
>  out_unlock:
> +	clear_bits = EXTENT_LOCKED | EXTENT_DO_ACCOUNTING | EXTENT_DELALLOC |
> +		EXTENT_DEFRAG;
> +	page_ops = PAGE_UNLOCK | PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK |
> +		PAGE_END_WRITEBACK;
> +	/*
> +	 * If we reserved an extent for our delalloc range (or a subrange) and
> +	 * failed to create the respective ordered extent, then it means that
> +	 * when we reserved the extent we decremented the extent's size from
> +	 * the data space_info's bytes_may_use counter and incremented the
> +	 * space_info's bytes_reserved counter by the same amount. We must make
> +	 * sure extent_clear_unlock_delalloc() does not try to decrement again
> +	 * the data space_info's bytes_may_use counter, therefore we pass it the
> +	 * flag EXTENT_SPACE_FREED.
> +	 */

EXTENT_SPACE_FREED is to free associated metadata only while EXTENT_CLEAR_DATA_RESV is to free data only,
EXTENT_DO_ACCOUNTING is to free both.

Can we just do
#define EXTENT_DO_ACCOUNTING (EXTENT_SPACE_FREED | EXTENT_CLEAR_DATA_RESV)
?

Thanks,

-liubo
> +	if (extent_reserved) {
> +		extent_clear_unlock_delalloc(inode, start,
> +					     start + cur_alloc_size,
> +					     start + cur_alloc_size,
> +					     locked_page,
> +					     clear_bits | EXTENT_SPACE_FREED,
> +					     page_ops);
> +		start += cur_alloc_size;
> +		if (start >= end)
> +			goto out;
> +	}
>  	extent_clear_unlock_delalloc(inode, start, end, delalloc_end,
>  				     locked_page,
> -				     EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
> -				     EXTENT_DELALLOC | EXTENT_DEFRAG,
> -				     PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
> -				     PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK);
> +				     clear_bits,
> +				     page_ops);
>  	goto out;
>  }
>  
> @@ -1794,10 +1820,10 @@ static void btrfs_clear_bit_hook(struct btrfs_inode *inode,
>  		if (btrfs_is_testing(fs_info))
>  			return;
>  
> -		if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID
> -		    && do_list && !(state->state & EXTENT_NORESERVE)
> -		    && (*bits & (EXTENT_DO_ACCOUNTING |
> -		    EXTENT_CLEAR_DATA_RESV)))
> +		if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID &&
> +		    do_list && !(state->state & EXTENT_NORESERVE) &&
> +		    (*bits & (EXTENT_DO_ACCOUNTING | EXTENT_CLEAR_DATA_RESV)) &&
> +		    !(*bits & EXTENT_SPACE_FREED))
>  			btrfs_free_reserved_data_space_noquota(
>  					&inode->vfs_inode,
>  					state->start, len);
> -- 
> 2.7.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2017-03-08  9:54 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-07  4:24 [PATCH v2] Btrfs: fix invalid attempt to free reserved space on failure to cow range fdmanana
2017-03-07 20:34 ` [PATCH v3] " fdmanana
2017-03-07 22:03 ` Liu Bo [this message]
2017-03-08  0:34   ` [PATCH v2] " Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170307220348.GG12408@lim.localdomain \
    --to=bo.li.liu@oracle.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).