From: Liu Bo <bo.li.liu@oracle.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org, jbacik@fb.com,
Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH v3 2/2] Btrfs: don't do unnecessary delalloc flushes when relocating
Date: Tue, 10 May 2016 12:31:38 -0700 [thread overview]
Message-ID: <20160510193138.GC9317@localhost.localdomain> (raw)
In-Reply-To: <1461692724-3923-1-git-send-email-fdmanana@kernel.org>
On Tue, Apr 26, 2016 at 06:45:24PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Before we start the actual relocation process of a block group, we do
> calls to flush delalloc of all inodes and then wait for ordered extents
> to complete. However we do these flush calls just to make sure we don't
> race with concurrent tasks that have actually already started to run
> delalloc and have allocated an extent from the block group we want to
> relocate, right before we set it to readonly mode, but have not yet
> created the respective ordered extents. The flush calls make us wait
> for such concurrent tasks because they end up calling
> filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
> __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
> btrfs_run_delalloc_work()) which ends up serializing us with those tasks
> due to attempts to lock the same pages (and the delalloc flush procedure
> calls the allocator and creates the ordered extents before unlocking the
> pages).
>
> These flushing calls not only make us waste time (cpu, IO) but also reduce
> the chances of writing larger extents (applications might be writing to
> contiguous ranges and we flush before they finish dirtying the whole
> ranges).
>
> So make sure we don't flush delalloc and just wait for concurrent tasks
> that have already started flushing delalloc and have allocated an extent
> from the block group we are about to relocate.
>
> This change also ends up fixing a race with direct IO writes that makes
> relocation not wait for direct IO ordered extents. This race is
> illustrated by the following diagram:
>
> CPU 1 CPU 2
>
> btrfs_relocate_block_group(bg X)
>
> starts direct IO write,
> target inode currently has no
> ordered extents ongoing nor
> dirty pages (delalloc regions),
> therefore the root for our inode
> is not in the list
> fs_info->ordered_roots
>
> btrfs_direct_IO()
> __blockdev_direct_IO()
> btrfs_get_blocks_direct()
> btrfs_lock_extent_direct()
> locks range in the io tree
> btrfs_new_extent_direct()
> btrfs_reserve_extent()
> --> extent allocated
> from bg X
>
> btrfs_inc_block_group_ro(bg X)
>
> btrfs_start_delalloc_roots()
> __start_delalloc_inodes()
> --> does nothing, no dealloc ranges
> in the inode's io tree so the
> inode's root is not in the list
> fs_info->delalloc_roots
>
> btrfs_wait_ordered_roots()
> --> does not find the inode's root in the
> list fs_info->ordered_roots
>
> --> ends up not waiting for the direct IO
> write started by the task at CPU 2
>
> relocate_block_group(rc->stage ==
> MOVE_DATA_EXTENTS)
>
> prepare_to_relocate()
> btrfs_commit_transaction()
>
> iterates the extent tree, using its
> commit root and moves extents into new
> locations
>
> btrfs_add_ordered_extent_dio()
> --> now a ordered extent is
> created and added to the
> list root->ordered_extents
> and the root added to the
> list fs_info->ordered_roots
> --> this is too late and the
> task at CPU 1 already
> started the relocation
>
> btrfs_commit_transaction()
>
> btrfs_finish_ordered_io()
> btrfs_alloc_reserved_file_extent()
> --> adds delayed data reference
> for the extent allocated
> from bg X
>
> relocate_block_group(rc->stage ==
> UPDATE_DATA_PTRS)
>
> prepare_to_relocate()
> btrfs_commit_transaction()
> --> delayed refs are run, so an extent
> item for the allocated extent from
> bg X is added to extent tree
> --> commit roots are switched, so the
> next scan in the extent tree will
> see the extent item
>
> sees the extent in the extent tree
>
> When this happens the relocation produces the following warning when it
> finishes:
>
> [ 7260.832836] ------------[ cut here ]------------
> [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
> [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
> [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
> [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
> [ 7260.852998] 0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
> [ 7260.852998] 0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
> [ 7260.852998] ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
> [ 7260.852998] Call Trace:
> [ 7260.852998] [<ffffffff812648b3>] dump_stack+0x67/0x90
> [ 7260.852998] [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
> [ 7260.852998] [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
> [ 7260.852998] [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
> [ 7260.852998] [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
> [ 7260.852998] [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
> [ 7260.852998] [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
> [ 7260.852998] [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
> [ 7260.852998] [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
> [ 7260.852998] [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
> [ 7260.852998] [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
> [ 7260.852998] [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
> [ 7260.852998] [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
> [ 7260.852998] [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
> [ 7260.852998] [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
> [ 7260.852998] [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
> [ 7260.852998] [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
> [ 7260.852998] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
> [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---
>
> This is because at the end of the first stage, in relocate_block_group(),
> we commit the current transaction, which makes delayed refs run, the
> commit roots are switched and so the second stage will find the extent
> item that the ordered extent added to the delayed refs. But this extent
> was not moved (ordered extent completed after first stage finished), so
> at the end of the relocation our block group item still has a positive
> used bytes counter, triggering a warning at the end of
> btrfs_relocate_block_group(). Later on when trying to read the extent
> contents from disk we hit a BUG_ON() due to the inability to map a block
> with a logical address that belongs to the block group we relocated and
> is no longer valid, resulting in the following trace:
>
> [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
> [ 7344.887518] ------------[ cut here ]------------
> [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
> [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
> [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G W 4.5.0-rc6-btrfs-next-28+ #1
> [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
> [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
> [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>] [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
> [ 7344.888431] RSP: 0018:ffff8802046878f0 EFLAGS: 00010282
> [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
> [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
> [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
> [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
> [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
> [ 7344.888431] FS: 00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
> [ 7344.888431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
> [ 7344.888431] Stack:
> [ 7344.888431] 0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
> [ 7344.888431] ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
> [ 7344.888431] ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
> [ 7344.888431] Call Trace:
> [ 7344.888431] [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
> [ 7344.888431] [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
> [ 7344.888431] [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
> [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
> [ 7344.888431] [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
> [ 7344.888431] [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
> [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
> [ 7344.888431] [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
> [ 7344.888431] [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
> [ 7344.888431] [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
> [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
> [ 7344.888431] [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
> [ 7344.888431] [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
> [ 7344.888431] [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
> [ 7344.888431] [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
> [ 7344.888431] [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
> [ 7344.888431] [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
> [ 7344.888431] [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
> [ 7344.888431] [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
> [ 7344.888431] [<ffffffff8117773a>] __vfs_read+0x79/0x9d
> [ 7344.888431] [<ffffffff81178050>] vfs_read+0x8f/0xd2
> [ 7344.888431] [<ffffffff81178a38>] SyS_read+0x50/0x7e
> [ 7344.888431] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
> [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
> [ 7344.888431] RIP [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
> [ 7344.888431] RSP <ffff8802046878f0>
> [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>
> v3: Replaced a BUG_ON(!bg) with ASSERT(bg).
>
> fs/btrfs/ctree.h | 14 ++++++++++++
> fs/btrfs/extent-tree.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++--
> fs/btrfs/inode.c | 8 +++++++
> fs/btrfs/relocation.c | 6 +-----
> 4 files changed, 79 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 84a6a5b..90e70e2 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1408,6 +1408,17 @@ struct btrfs_block_group_cache {
>
> struct btrfs_io_ctl io_ctl;
>
> + /*
> + * Incremented when doing extent allocations and holding a read lock
> + * on the space_info's groups_sem semaphore.
> + * Decremented when an ordered extent that represents an IO against this
> + * block group's range is created (after it's added to its inode's
> + * root's list of ordered extents) or immediately after the allocation
> + * if it's a metadata extent or fallocate extent (for these cases we
> + * don't create ordered extents).
> + */
> + atomic_t reservations;
> +
> /* Lock for free space tree operations. */
> struct mutex free_space_lock;
>
> @@ -3499,6 +3510,9 @@ int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans,
> struct btrfs_root *root);
> int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans,
> struct btrfs_root *root);
> +void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
> + const u64 start);
> +void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg);
> void btrfs_put_block_group(struct btrfs_block_group_cache *cache);
> int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
> struct btrfs_root *root, unsigned long count);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0544f70..1e009bc 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6175,6 +6175,57 @@ int btrfs_exclude_logged_extents(struct btrfs_root *log,
> return 0;
> }
>
> +static void
> +btrfs_inc_block_group_reservations(struct btrfs_block_group_cache *bg)
> +{
> + atomic_inc(&bg->reservations);
> +}
> +
> +void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
> + const u64 start)
> +{
> + struct btrfs_block_group_cache *bg;
> +
> + bg = btrfs_lookup_block_group(fs_info, start);
> + ASSERT(bg);
> + if (atomic_dec_and_test(&bg->reservations))
> + wake_up_atomic_t(&bg->reservations);
> + btrfs_put_block_group(bg);
> +}
> +
> +static int btrfs_wait_bg_reservations_atomic_t(atomic_t *a)
> +{
> + schedule();
> + return 0;
> +}
> +
> +void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
> +{
> + struct btrfs_space_info *space_info = bg->space_info;
> +
> + ASSERT(bg->ro);
> +
> + if (!(bg->flags & BTRFS_BLOCK_GROUP_DATA))
> + return;
> +
> + /*
> + * Our block group is read only but before we set it to read only,
> + * some task might have had allocated an extent from it already, but it
> + * has not yet created a respective ordered extent (and added it to a
> + * root's list of ordered extents).
> + * Therefore wait for any task currently allocating extents, since the
> + * block group's reservations counter is incremented while a read lock
> + * on the groups' semaphore is held and decremented after releasing
> + * the read access on that semaphore and creating the ordered extent.
> + */
> + down_write(&space_info->groups_sem);
> + up_write(&space_info->groups_sem);
> +
> + wait_on_atomic_t(&bg->reservations,
> + btrfs_wait_bg_reservations_atomic_t,
> + TASK_UNINTERRUPTIBLE);
> +}
> +
> /**
> * btrfs_update_reserved_bytes - update the block_group and space info counters
> * @cache: The cache we are manipulating
> @@ -7434,6 +7485,7 @@ checks:
> btrfs_add_free_space(block_group, offset, num_bytes);
> goto loop;
> }
> + btrfs_inc_block_group_reservations(block_group);
Metadata doesn't have ordered extent, we can check like
if (is_data)
btrfs_inc_block_group_reservations(block_group);
>
> /* we are all good, lets return */
> ins->objectid = search_start;
> @@ -7615,8 +7667,10 @@ again:
> WARN_ON(num_bytes < root->sectorsize);
> ret = find_free_extent(root, num_bytes, empty_size, hint_byte, ins,
> flags, delalloc);
> -
> - if (ret == -ENOSPC) {
> + if (!ret && !is_data) {
> + btrfs_dec_block_group_reservations(root->fs_info,
> + ins->objectid);
> + } else if (ret == -ENOSPC) {
Oh, you did this way, then it's fine.
Others are good.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Thanks,
-liubo
> if (!final_tried && ins->offset) {
> num_bytes = min(num_bytes >> 1, ins->offset);
> num_bytes = round_down(num_bytes, root->sectorsize);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 41a5688..0085899 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -824,6 +824,7 @@ retry:
> async_extent->ram_size - 1, 0);
> goto out_free_reserve;
> }
> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid);
>
> /*
> * clear dirty, set writeback and unlock the pages.
> @@ -861,6 +862,7 @@ retry:
> }
> return;
> out_free_reserve:
> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid);
> btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
> out_free:
> extent_clear_unlock_delalloc(inode, async_extent->start,
> @@ -1038,6 +1040,8 @@ static noinline int cow_file_range(struct inode *inode,
> goto out_drop_extent_cache;
> }
>
> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid);
> +
> if (disk_num_bytes < cur_alloc_size)
> break;
>
> @@ -1066,6 +1070,7 @@ out:
> out_drop_extent_cache:
> btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
> out_reserve:
> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid);
> btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
> out_unlock:
> extent_clear_unlock_delalloc(inode, start, end, locked_page,
> @@ -7162,6 +7167,8 @@ static struct extent_map *btrfs_new_extent_direct(struct inode *inode,
> return ERR_PTR(ret);
> }
>
> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid);
> +
> em = create_pinned_em(inode, start, ins.offset, start, ins.objectid,
> ins.offset, ins.offset, ins.offset, 0);
> if (IS_ERR(em)) {
> @@ -9942,6 +9949,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
> btrfs_end_transaction(trans, root);
> break;
> }
> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid);
>
> last_alloc = ins.offset;
> ret = insert_reserved_file_extent(trans, inode,
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 9c35ea2..2aed15c 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -4253,11 +4253,7 @@ int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start)
> btrfs_info(extent_root->fs_info, "relocating block group %llu flags %llu",
> rc->block_group->key.objectid, rc->block_group->flags);
>
> - ret = btrfs_start_delalloc_roots(fs_info, 0, -1);
> - if (ret < 0) {
> - err = ret;
> - goto out;
> - }
> + btrfs_wait_block_group_reservations(rc->block_group);
> btrfs_wait_ordered_roots(fs_info, -1,
> rc->block_group->key.objectid,
> rc->block_group->key.offset);
> --
> 2.7.0.rc3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
prev parent reply other threads:[~2016-05-10 19:32 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-26 15:39 [PATCH v2 0/2] Fix for race in relocation and avoid start and wait for unrelated IO fdmanana
2016-04-26 15:39 ` [PATCH v2 1/2] Btrfs: don't wait for unrelated IO to finish before relocation fdmanana
2016-04-26 15:55 ` Josef Bacik
2016-05-10 19:23 ` Liu Bo
2016-04-26 15:39 ` [PATCH v2 2/2] Btrfs: don't do unnecessary delalloc flushes when relocating fdmanana
2016-04-26 16:02 ` Josef Bacik
2016-04-26 16:09 ` Filipe Manana
2016-04-26 19:48 ` Josef Bacik
2016-04-26 17:45 ` [PATCH v3 " fdmanana
2016-04-27 14:54 ` Josef Bacik
2016-05-10 19:31 ` Liu Bo [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160510193138.GC9317@localhost.localdomain \
--to=bo.li.liu@oracle.com \
--cc=fdmanana@kernel.org \
--cc=fdmanana@suse.com \
--cc=jbacik@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).