From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
To: <linux-btrfs@vger.kernel.org>
Cc: <dsterba@suse.com>, <s.priebe@profihost.ag>
Subject: Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
Date: Fri, 9 Sep 2016 16:25:15 +0800 [thread overview]
Message-ID: <57D271EB.3000209@cn.fujitsu.com> (raw)
In-Reply-To: <20160909081748.26540-1-wangxg.fnst@cn.fujitsu.com>
hello David,
This patch's v2 version in your for-next-20160906 branch is still wrong,
really sorry,
please revert it.
Stefan Priebe has reported another similar issue, thought I didn't see
it in my
test environment. Now I choose to not call down_read(bg_delete_sem) for free
space inode, which I think can resolve these issues, please check, thanks.
Regards,
Xiaoguang Wang
On 09/09/2016 04:17 PM, Wang Xiaoguang wrote:
> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
> to delete unused block groups. Because this work is asynchronous, it may also result
> in false ENOSPC error. Please see below race window:
>
> CPU1 | CPU2
> |
> |-> btrfs_alloc_data_chunk_ondemand() |-> cleaner_kthread()
> |-> do_chunk_alloc() | |
> | assume it returns ENOSPC, which means | |
> | btrfs_space_info is full and have free| |
> | space to satisfy data request. | |
> | | |- > btrfs_delete_unused_bgs()
> | | | it will decrease btrfs_space_info
> | | | total_bytes and make
> | | | btrfs_space_info is not full.
> | | |
> In this case, we may get ENOSPC error, but btrfs_space_info is not full.
>
> To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
> do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
> Here we introduce a new struct rw_semaphore bg_delete_sem to do this job.
>
> Indeed there is already a "struct mutex delete_unused_bgs_mutex", but it's mutex,
> we can not use it for this purpose. Of course, we can re-define it to be struct
> rw_semaphore, then use it in btrfs_alloc_data_chunk_ondemand(). Either method will
> work.
>
> But given that delete_unused_bgs_mutex's name length is longer than bg_delete_sem,
> I choose the first method, to create a new struct rw_semaphore bg_delete_sem and
> delete delete_unused_bgs_mutex :)
>
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
> V2: fix a deadlock revealed by fstests case btrfs/071, we call
> start_transaction() before in down_write(bg_delete_sem) in
> btrfs_delete_unused_bgs().
>
> v3: Stefan Priebe reported another similar deadlock, so here we choose
> to not call down_read(bg_delete_sem) for free space inode in
> btrfs_alloc_data_chunk_ondemand(). Meanwhile because we only do the
> data space reservation for free space cache in the transaction context,
> btrfs_delete_unused_bgs() will either have finished its job, or start
> a new transaction waiting current transaction to complete, there will
> be no unused block groups to be deleted, so it's safe to not call
> down_read(bg_delete_sem)
> ---
> ---
> fs/btrfs/ctree.h | 2 +-
> fs/btrfs/disk-io.c | 13 +++++------
> fs/btrfs/extent-tree.c | 59 ++++++++++++++++++++++++++++++++++++++++----------
> fs/btrfs/volumes.c | 42 +++++++++++++++++------------------
> 4 files changed, 76 insertions(+), 40 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index eff3993..fa78ef9 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -788,6 +788,7 @@ struct btrfs_fs_info {
> struct mutex cleaner_mutex;
> struct mutex chunk_mutex;
> struct mutex volume_mutex;
> + struct rw_semaphore bg_delete_sem;
>
> /*
> * this is taken to make sure we don't set block groups ro after
> @@ -1068,7 +1069,6 @@ struct btrfs_fs_info {
> spinlock_t unused_bgs_lock;
> struct list_head unused_bgs;
> struct mutex unused_bg_unpin_mutex;
> - struct mutex delete_unused_bgs_mutex;
>
> /* For btrfs to record security options */
> struct security_mnt_opts security_opts;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 54bc8c7..3cdbd05 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1868,12 +1868,11 @@ static int cleaner_kthread(void *arg)
> btrfs_run_defrag_inodes(root->fs_info);
>
> /*
> - * Acquires fs_info->delete_unused_bgs_mutex to avoid racing
> - * with relocation (btrfs_relocate_chunk) and relocation
> - * acquires fs_info->cleaner_mutex (btrfs_relocate_block_group)
> - * after acquiring fs_info->delete_unused_bgs_mutex. So we
> - * can't hold, nor need to, fs_info->cleaner_mutex when deleting
> - * unused block groups.
> + * Acquires fs_info->bg_delete_sem to avoid racing with
> + * relocation (btrfs_relocate_chunk) and relocation acquires
> + * fs_info->cleaner_mutex (btrfs_relocate_block_group) after
> + * acquiring fs_info->bg_delete_sem. So we can't hold, nor need
> + * to, fs_info->cleaner_mutex when deleting unused block groups.
> */
> btrfs_delete_unused_bgs(root->fs_info);
> sleep:
> @@ -2634,7 +2633,6 @@ int open_ctree(struct super_block *sb,
> spin_lock_init(&fs_info->unused_bgs_lock);
> rwlock_init(&fs_info->tree_mod_log_lock);
> mutex_init(&fs_info->unused_bg_unpin_mutex);
> - mutex_init(&fs_info->delete_unused_bgs_mutex);
> mutex_init(&fs_info->reloc_mutex);
> mutex_init(&fs_info->delalloc_root_mutex);
> mutex_init(&fs_info->cleaner_delayed_iput_mutex);
> @@ -2723,6 +2721,7 @@ int open_ctree(struct super_block *sb,
> init_rwsem(&fs_info->commit_root_sem);
> init_rwsem(&fs_info->cleanup_work_sem);
> init_rwsem(&fs_info->subvol_sem);
> + init_rwsem(&fs_info->bg_delete_sem);
> sema_init(&fs_info->uuid_tree_rescan_sem, 1);
>
> btrfs_init_dev_replace_locks(fs_info);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 8c8a4d1..f037769 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4128,18 +4128,35 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
> int ret = 0;
> int need_commit = 2;
> int have_pinned_space;
> + int have_bg_delete_sem = 0;
> + bool free_space_inode = btrfs_is_free_space_inode(inode);
>
> /* make sure bytes are sectorsize aligned */
> bytes = ALIGN(bytes, root->sectorsize);
>
> - if (btrfs_is_free_space_inode(inode)) {
> + if (free_space_inode) {
> need_commit = 0;
> ASSERT(current->journal_info);
> }
>
> + /*
> + * Here we shouldn't call down_read(bg_delete_sem) for free space inode,
> + * there is lock order between bg_delete_sem and "wait current trans
> + * finished". Meanwhile because we only do the data space reservation
> + * for free space cache in the transaction context,
> + * btrfs_delete_unused_bgs() will either have finished its job, or start
> + * a new transaction waiting current transaction to complete, there will
> + * be no unused block groups to be deleted, so it's safe to not call
> + * down_read(bg_delete_sem).
> + */
> data_sinfo = fs_info->data_sinfo;
> - if (!data_sinfo)
> + if (!data_sinfo) {
> + if (!free_space_inode) {
> + down_read(&root->fs_info->bg_delete_sem);
> + have_bg_delete_sem = 1;
> + }
> goto alloc;
> + }
>
> again:
> /* make sure we have enough space to handle the data first */
> @@ -4152,6 +4169,17 @@ again:
> struct btrfs_trans_handle *trans;
>
> /*
> + * We may need to allocate new chunk, so we should block
> + * btrfs_delete_unused_bgs()
> + */
> + if (!have_bg_delete_sem && !free_space_inode) {
> + spin_unlock(&data_sinfo->lock);
> + down_read(&root->fs_info->bg_delete_sem);
> + have_bg_delete_sem = 1;
> + goto again;
> + }
> +
> + /*
> * if we don't have enough free bytes in this space then we need
> * to alloc a new chunk.
> */
> @@ -4173,8 +4201,10 @@ alloc:
> * the fs.
> */
> trans = btrfs_join_transaction(root);
> - if (IS_ERR(trans))
> - return PTR_ERR(trans);
> + if (IS_ERR(trans)) {
> + ret = PTR_ERR(trans);
> + goto out;
> + }
>
> ret = do_chunk_alloc(trans, root->fs_info->extent_root,
> alloc_target,
> @@ -4182,7 +4212,7 @@ alloc:
> btrfs_end_transaction(trans, root);
> if (ret < 0) {
> if (ret != -ENOSPC)
> - return ret;
> + goto out;
> else {
> have_pinned_space = 1;
> goto commit_trans;
> @@ -4217,15 +4247,17 @@ commit_trans:
> }
>
> trans = btrfs_join_transaction(root);
> - if (IS_ERR(trans))
> - return PTR_ERR(trans);
> + if (IS_ERR(trans)) {
> + ret = PTR_ERR(trans);
> + goto out;
> + }
> if (have_pinned_space >= 0 ||
> test_bit(BTRFS_TRANS_HAVE_FREE_BGS,
> &trans->transaction->flags) ||
> need_commit > 0) {
> ret = btrfs_commit_transaction(trans, root);
> if (ret)
> - return ret;
> + goto out;
> /*
> * The cleaner kthread might still be doing iput
> * operations. Wait for it to finish so that
> @@ -4242,13 +4274,18 @@ commit_trans:
> trace_btrfs_space_reservation(root->fs_info,
> "space_info:enospc",
> data_sinfo->flags, bytes, 1);
> - return -ENOSPC;
> + ret = -ENOSPC;
> + goto out;
> }
> data_sinfo->bytes_may_use += bytes;
> trace_btrfs_space_reservation(root->fs_info, "space_info",
> data_sinfo->flags, bytes, 1);
> spin_unlock(&data_sinfo->lock);
>
> +out:
> + if (have_bg_delete_sem && !free_space_inode)
> + up_read(&root->fs_info->bg_delete_sem);
> +
> return ret;
> }
>
> @@ -10806,7 +10843,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> }
> spin_unlock(&fs_info->unused_bgs_lock);
>
> - mutex_lock(&fs_info->delete_unused_bgs_mutex);
> + down_write(&root->fs_info->bg_delete_sem);
>
> /* Don't want to race with allocators so take the groups_sem */
> down_write(&space_info->groups_sem);
> @@ -10934,7 +10971,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> end_trans:
> btrfs_end_transaction(trans, root);
> next:
> - mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> + up_write(&root->fs_info->bg_delete_sem);
> btrfs_put_block_group(block_group);
> spin_lock(&fs_info->unused_bgs_lock);
> }
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 035efce..408fb0c 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2926,7 +2926,7 @@ static int btrfs_relocate_chunk(struct btrfs_root *root, u64 chunk_offset)
> * we release the path used to search the chunk/dev tree and before
> * the current task acquires this mutex and calls us.
> */
> - ASSERT(mutex_is_locked(&root->fs_info->delete_unused_bgs_mutex));
> + ASSERT(rwsem_is_locked(&root->fs_info->bg_delete_sem));
>
> ret = btrfs_can_relocate(extent_root, chunk_offset);
> if (ret)
> @@ -2979,10 +2979,10 @@ again:
> key.type = BTRFS_CHUNK_ITEM_KEY;
>
> while (1) {
> - mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
> + down_read(&root->fs_info->bg_delete_sem);
> ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
> if (ret < 0) {
> - mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> + up_read(&root->fs_info->bg_delete_sem);
> goto error;
> }
> BUG_ON(ret == 0); /* Corruption */
> @@ -2990,7 +2990,7 @@ again:
> ret = btrfs_previous_item(chunk_root, path, key.objectid,
> key.type);
> if (ret)
> - mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> + up_read(&root->fs_info->bg_delete_sem);
> if (ret < 0)
> goto error;
> if (ret > 0)
> @@ -3012,7 +3012,7 @@ again:
> else
> BUG_ON(ret);
> }
> - mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> + up_read(&root->fs_info->bg_delete_sem);
>
> if (found_key.offset == 0)
> break;
> @@ -3568,10 +3568,10 @@ again:
> goto error;
> }
>
> - mutex_lock(&fs_info->delete_unused_bgs_mutex);
> + down_read(&fs_info->bg_delete_sem);
> ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
> if (ret < 0) {
> - mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> + up_read(&fs_info->bg_delete_sem);
> goto error;
> }
>
> @@ -3585,7 +3585,7 @@ again:
> ret = btrfs_previous_item(chunk_root, path, 0,
> BTRFS_CHUNK_ITEM_KEY);
> if (ret) {
> - mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> + up_read(&fs_info->bg_delete_sem);
> ret = 0;
> break;
> }
> @@ -3595,7 +3595,7 @@ again:
> btrfs_item_key_to_cpu(leaf, &found_key, slot);
>
> if (found_key.objectid != key.objectid) {
> - mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> + up_read(&fs_info->bg_delete_sem);
> break;
> }
>
> @@ -3613,12 +3613,12 @@ again:
>
> btrfs_release_path(path);
> if (!ret) {
> - mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> + up_read(&fs_info->bg_delete_sem);
> goto loop;
> }
>
> if (counting) {
> - mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> + up_read(&fs_info->bg_delete_sem);
> spin_lock(&fs_info->balance_lock);
> bctl->stat.expected++;
> spin_unlock(&fs_info->balance_lock);
> @@ -3643,7 +3643,7 @@ again:
> count_meta < bctl->meta.limit_min)
> || ((chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) &&
> count_sys < bctl->sys.limit_min)) {
> - mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> + up_read(&fs_info->bg_delete_sem);
> goto loop;
> }
>
> @@ -3656,7 +3656,7 @@ again:
> !chunk_reserved && !bytes_used) {
> trans = btrfs_start_transaction(chunk_root, 0);
> if (IS_ERR(trans)) {
> - mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> + up_read(&fs_info->bg_delete_sem);
> ret = PTR_ERR(trans);
> goto error;
> }
> @@ -3665,7 +3665,7 @@ again:
> BTRFS_BLOCK_GROUP_DATA);
> btrfs_end_transaction(trans, chunk_root);
> if (ret < 0) {
> - mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> + up_read(&fs_info->bg_delete_sem);
> goto error;
> }
> chunk_reserved = 1;
> @@ -3673,7 +3673,7 @@ again:
>
> ret = btrfs_relocate_chunk(chunk_root,
> found_key.offset);
> - mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> + up_read(&fs_info->bg_delete_sem);
> if (ret && ret != -ENOSPC)
> goto error;
> if (ret == -ENOSPC) {
> @@ -4400,16 +4400,16 @@ again:
> key.type = BTRFS_DEV_EXTENT_KEY;
>
> do {
> - mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
> + down_read(&root->fs_info->bg_delete_sem);
> ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> if (ret < 0) {
> - mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> + up_read(&root->fs_info->bg_delete_sem);
> goto done;
> }
>
> ret = btrfs_previous_item(root, path, 0, key.type);
> if (ret)
> - mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> + up_read(&root->fs_info->bg_delete_sem);
> if (ret < 0)
> goto done;
> if (ret) {
> @@ -4423,7 +4423,7 @@ again:
> btrfs_item_key_to_cpu(l, &key, path->slots[0]);
>
> if (key.objectid != device->devid) {
> - mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> + up_read(&root->fs_info->bg_delete_sem);
> btrfs_release_path(path);
> break;
> }
> @@ -4432,7 +4432,7 @@ again:
> length = btrfs_dev_extent_length(l, dev_extent);
>
> if (key.offset + length <= new_size) {
> - mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> + up_read(&root->fs_info->bg_delete_sem);
> btrfs_release_path(path);
> break;
> }
> @@ -4441,7 +4441,7 @@ again:
> btrfs_release_path(path);
>
> ret = btrfs_relocate_chunk(root, chunk_offset);
> - mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> + up_read(&root->fs_info->bg_delete_sem);
> if (ret && ret != -ENOSPC)
> goto done;
> if (ret == -ENOSPC)
next prev parent reply other threads:[~2016-09-09 8:29 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-09-09 8:17 [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space Wang Xiaoguang
2016-09-09 8:25 ` Wang Xiaoguang [this message]
2016-09-09 9:02 ` David Sterba
2016-09-09 10:18 ` Holger Hoffstätte
2016-09-09 10:56 ` Holger Hoffstätte
2016-09-12 7:38 ` Wang Xiaoguang
2016-09-12 8:19 ` Holger Hoffstätte
2016-09-10 6:04 ` Stefan Priebe - Profihost AG
-- strict thread matches above, loose matches on Subject: below --
2016-07-25 13:32 [PATCH v2 4/4] " Josef Bacik
2016-07-26 12:04 ` [PATCH v3] " Wang Xiaoguang
2016-07-26 15:51 ` Josef Bacik
2016-07-26 16:25 ` David Sterba
2016-07-27 1:26 ` Wang Xiaoguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57D271EB.3000209@cn.fujitsu.com \
--to=wangxg.fnst@cn.fujitsu.com \
--cc=dsterba@suse.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=s.priebe@profihost.ag \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.