linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-07-25 13:32 [PATCH v2 4/4] " Josef Bacik
@ 2016-07-26 12:04 ` Wang Xiaoguang
  2016-07-26 15:51   ` Josef Bacik
  0 siblings, 1 reply; 12+ messages in thread
From: Wang Xiaoguang @ 2016-07-26 12:04 UTC (permalink / raw)
  To: linux-btrfs; +Cc: jbacik

cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
to delete unused block groups. Because this work is asynchronous, it may also result
in false ENOSPC error. Please see below race window:

               CPU1                           |             CPU2
                                              |
|-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
    |-> do_chunk_alloc()                      |   |
    |   assume it returns ENOSPC, which means |   |
    |   btrfs_space_info is full and have free|   |
    |   space to satisfy data request.        |   |
    |                                         |   |- > btrfs_delete_unused_bgs()
    |                                         |   |    it will decrease btrfs_space_info
    |                                         |   |    total_bytes and make
    |                                         |   |    btrfs_space_info is not full.
    |                                         |   |
In this case, we may get ENOSPC error, but btrfs_space_info is not full.

To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
So here we introduce a new struct rw_semaphore bg_delete_sem to do this job.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
v3: improve one code logic suggested by Josef, thanks.
---
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/disk-io.c     |  1 +
 fs/btrfs/extent-tree.c | 44 ++++++++++++++++++++++++++++++++++++++------
 3 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7eb2913..bf0751d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -800,6 +800,7 @@ struct btrfs_fs_info {
 	struct mutex cleaner_mutex;
 	struct mutex chunk_mutex;
 	struct mutex volume_mutex;
+	struct rw_semaphore bg_delete_sem;
 
 	/*
 	 * this is taken to make sure we don't set block groups ro after
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60ce119..f8609fd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2683,6 +2683,7 @@ int open_ctree(struct super_block *sb,
 	init_rwsem(&fs_info->commit_root_sem);
 	init_rwsem(&fs_info->cleanup_work_sem);
 	init_rwsem(&fs_info->subvol_sem);
+	init_rwsem(&fs_info->bg_delete_sem);
 	sema_init(&fs_info->uuid_tree_rescan_sem, 1);
 
 	btrfs_init_dev_replace_locks(fs_info);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index df8d756..50b440e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
 	int ret = 0;
 	int need_commit = 2;
 	int have_pinned_space;
+	int have_bg_delete_sem = 0;
 
 	/* make sure bytes are sectorsize aligned */
 	bytes = ALIGN(bytes, root->sectorsize);
@@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
 	}
 
 	data_sinfo = fs_info->data_sinfo;
-	if (!data_sinfo)
+	if (!data_sinfo) {
+		down_read(&root->fs_info->bg_delete_sem);
+		have_bg_delete_sem = 1;
 		goto alloc;
+	}
 
 again:
 	/* make sure we have enough space to handle the data first */
@@ -4135,6 +4139,19 @@ again:
 		struct btrfs_trans_handle *trans;
 
 		/*
+		 * We may need to allocate new chunk, so we should block
+		 * btrfs_delete_unused_bgs()
+		 */
+		if (!have_bg_delete_sem) {
+			spin_unlock(&data_sinfo->lock);
+			down_read(&root->fs_info->bg_delete_sem);
+			have_bg_delete_sem = 1;
+			spin_lock(&data_sinfo->lock);
+			if (used + bytes <= data_sinfo->total_bytes)
+				goto done;
+		}
+
+		/*
 		 * if we don't have enough free bytes in this space then we need
 		 * to alloc a new chunk.
 		 */
@@ -4156,17 +4173,20 @@ alloc:
 			 * the fs.
 			 */
 			trans = btrfs_join_transaction(root);
-			if (IS_ERR(trans))
+			if (IS_ERR(trans)) {
+				up_read(&root->fs_info->bg_delete_sem);
 				return PTR_ERR(trans);
+			}
 
 			ret = do_chunk_alloc(trans, root->fs_info->extent_root,
 					     alloc_target,
 					     CHUNK_ALLOC_NO_FORCE);
 			btrfs_end_transaction(trans, root);
 			if (ret < 0) {
-				if (ret != -ENOSPC)
+				if (ret != -ENOSPC) {
+					up_read(&root->fs_info->bg_delete_sem);
 					return ret;
-				else {
+				} else {
 					have_pinned_space = 1;
 					goto commit_trans;
 				}
@@ -4200,15 +4220,19 @@ commit_trans:
 			}
 
 			trans = btrfs_join_transaction(root);
-			if (IS_ERR(trans))
+			if (IS_ERR(trans)) {
+				up_read(&root->fs_info->bg_delete_sem);
 				return PTR_ERR(trans);
+			}
 			if (have_pinned_space >= 0 ||
 			    test_bit(BTRFS_TRANS_HAVE_FREE_BGS,
 				     &trans->transaction->flags) ||
 			    need_commit > 0) {
 				ret = btrfs_commit_transaction(trans, root);
-				if (ret)
+				if (ret) {
+					up_read(&root->fs_info->bg_delete_sem);
 					return ret;
+				}
 				/*
 				 * The cleaner kthread might still be doing iput
 				 * operations. Wait for it to finish so that
@@ -4225,13 +4249,19 @@ commit_trans:
 		trace_btrfs_space_reservation(root->fs_info,
 					      "space_info:enospc",
 					      data_sinfo->flags, bytes, 1);
+		up_read(&root->fs_info->bg_delete_sem);
 		return -ENOSPC;
 	}
+
+done:
 	data_sinfo->bytes_may_use += bytes;
 	trace_btrfs_space_reservation(root->fs_info, "space_info",
 				      data_sinfo->flags, bytes, 1);
 	spin_unlock(&data_sinfo->lock);
 
+	if (have_bg_delete_sem)
+		up_read(&root->fs_info->bg_delete_sem);
+
 	return ret;
 }
 
@@ -10594,6 +10624,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&fs_info->unused_bgs_lock);
 
 		mutex_lock(&fs_info->delete_unused_bgs_mutex);
+		down_write(&root->fs_info->bg_delete_sem);
 
 		/* Don't want to race with allocators so take the groups_sem */
 		down_write(&space_info->groups_sem);
@@ -10721,6 +10752,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 end_trans:
 		btrfs_end_transaction(trans, root);
 next:
+		up_write(&root->fs_info->bg_delete_sem);
 		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
 		btrfs_put_block_group(block_group);
 		spin_lock(&fs_info->unused_bgs_lock);
-- 
2.9.0




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-07-26 12:04 ` [PATCH v3] " Wang Xiaoguang
@ 2016-07-26 15:51   ` Josef Bacik
  2016-07-26 16:25     ` David Sterba
  2016-07-27  1:26     ` Wang Xiaoguang
  0 siblings, 2 replies; 12+ messages in thread
From: Josef Bacik @ 2016-07-26 15:51 UTC (permalink / raw)
  To: Wang Xiaoguang, linux-btrfs

On 07/26/2016 08:04 AM, Wang Xiaoguang wrote:
> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
> to delete unused block groups. Because this work is asynchronous, it may also result
> in false ENOSPC error. Please see below race window:
>
>                CPU1                           |             CPU2
>                                               |
> |-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
>     |-> do_chunk_alloc()                      |   |
>     |   assume it returns ENOSPC, which means |   |
>     |   btrfs_space_info is full and have free|   |
>     |   space to satisfy data request.        |   |
>     |                                         |   |- > btrfs_delete_unused_bgs()
>     |                                         |   |    it will decrease btrfs_space_info
>     |                                         |   |    total_bytes and make
>     |                                         |   |    btrfs_space_info is not full.
>     |                                         |   |
> In this case, we may get ENOSPC error, but btrfs_space_info is not full.
>
> To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
> do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
> So here we introduce a new struct rw_semaphore bg_delete_sem to do this job.
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
> v3: improve one code logic suggested by Josef, thanks.
> ---
>  fs/btrfs/ctree.h       |  1 +
>  fs/btrfs/disk-io.c     |  1 +
>  fs/btrfs/extent-tree.c | 44 ++++++++++++++++++++++++++++++++++++++------
>  3 files changed, 40 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 7eb2913..bf0751d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -800,6 +800,7 @@ struct btrfs_fs_info {
>  	struct mutex cleaner_mutex;
>  	struct mutex chunk_mutex;
>  	struct mutex volume_mutex;
> +	struct rw_semaphore bg_delete_sem;
>
>  	/*
>  	 * this is taken to make sure we don't set block groups ro after
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 60ce119..f8609fd 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2683,6 +2683,7 @@ int open_ctree(struct super_block *sb,
>  	init_rwsem(&fs_info->commit_root_sem);
>  	init_rwsem(&fs_info->cleanup_work_sem);
>  	init_rwsem(&fs_info->subvol_sem);
> +	init_rwsem(&fs_info->bg_delete_sem);
>  	sema_init(&fs_info->uuid_tree_rescan_sem, 1);
>
>  	btrfs_init_dev_replace_locks(fs_info);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index df8d756..50b440e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
>  	int ret = 0;
>  	int need_commit = 2;
>  	int have_pinned_space;
> +	int have_bg_delete_sem = 0;
>
>  	/* make sure bytes are sectorsize aligned */
>  	bytes = ALIGN(bytes, root->sectorsize);
> @@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
>  	}
>
>  	data_sinfo = fs_info->data_sinfo;
> -	if (!data_sinfo)
> +	if (!data_sinfo) {
> +		down_read(&root->fs_info->bg_delete_sem);
> +		have_bg_delete_sem = 1;
>  		goto alloc;
> +	}
>
>  again:
>  	/* make sure we have enough space to handle the data first */
> @@ -4135,6 +4139,19 @@ again:
>  		struct btrfs_trans_handle *trans;
>
>  		/*
> +		 * We may need to allocate new chunk, so we should block
> +		 * btrfs_delete_unused_bgs()
> +		 */
> +		if (!have_bg_delete_sem) {
> +			spin_unlock(&data_sinfo->lock);
> +			down_read(&root->fs_info->bg_delete_sem);
> +			have_bg_delete_sem = 1;
> +			spin_lock(&data_sinfo->lock);
> +			if (used + bytes <= data_sinfo->total_bytes)

Doh sorry I forgot to mention you need to re-calculate used here.  Also I 
noticed we already have a delete_unused_bgs_mutex, can we drop that and replace 
it with bg_delete_sem as well?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-07-26 15:51   ` Josef Bacik
@ 2016-07-26 16:25     ` David Sterba
  2016-07-27  1:26     ` Wang Xiaoguang
  1 sibling, 0 replies; 12+ messages in thread
From: David Sterba @ 2016-07-26 16:25 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Wang Xiaoguang, linux-btrfs

On Tue, Jul 26, 2016 at 11:51:39AM -0400, Josef Bacik wrote:
> Also I 
> noticed we already have a delete_unused_bgs_mutex, can we drop that and replace 
> it with bg_delete_sem as well?  Thanks,

Oh, yes please, making sense of all the mutexes and semaphores gets
harder with every new one.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-07-26 15:51   ` Josef Bacik
  2016-07-26 16:25     ` David Sterba
@ 2016-07-27  1:26     ` Wang Xiaoguang
  1 sibling, 0 replies; 12+ messages in thread
From: Wang Xiaoguang @ 2016-07-27  1:26 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs

hello,

On 07/26/2016 11:51 PM, Josef Bacik wrote:
> On 07/26/2016 08:04 AM, Wang Xiaoguang wrote:
>> cleaner_kthread() may run at any time, in which it'll call 
>> btrfs_delete_unused_bgs()
>> to delete unused block groups. Because this work is asynchronous, it 
>> may also result
>> in false ENOSPC error. Please see below race window:
>>
>>                CPU1                           |             CPU2
>>                                               |
>> |-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
>>     |-> do_chunk_alloc()                      |   |
>>     |   assume it returns ENOSPC, which means |   |
>>     |   btrfs_space_info is full and have free|   |
>>     |   space to satisfy data request.        |   |
>>     |                                         |   |- > 
>> btrfs_delete_unused_bgs()
>>     |                                         |   |    it will 
>> decrease btrfs_space_info
>>     |                                         |   | total_bytes and make
>>     |                                         |   | btrfs_space_info 
>> is not full.
>>     |                                         |   |
>> In this case, we may get ENOSPC error, but btrfs_space_info is not full.
>>
>> To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need 
>> to call
>> do_chunk_alloc() to allocating new chunk, we should block 
>> btrfs_delete_unused_bgs().
>> So here we introduce a new struct rw_semaphore bg_delete_sem to do 
>> this job.
>>
>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>> ---
>> v3: improve one code logic suggested by Josef, thanks.
>> ---
>>  fs/btrfs/ctree.h       |  1 +
>>  fs/btrfs/disk-io.c     |  1 +
>>  fs/btrfs/extent-tree.c | 44 
>> ++++++++++++++++++++++++++++++++++++++------
>>  3 files changed, 40 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 7eb2913..bf0751d 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -800,6 +800,7 @@ struct btrfs_fs_info {
>>      struct mutex cleaner_mutex;
>>      struct mutex chunk_mutex;
>>      struct mutex volume_mutex;
>> +    struct rw_semaphore bg_delete_sem;
>>
>>      /*
>>       * this is taken to make sure we don't set block groups ro after
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 60ce119..f8609fd 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2683,6 +2683,7 @@ int open_ctree(struct super_block *sb,
>>      init_rwsem(&fs_info->commit_root_sem);
>>      init_rwsem(&fs_info->cleanup_work_sem);
>>      init_rwsem(&fs_info->subvol_sem);
>> +    init_rwsem(&fs_info->bg_delete_sem);
>>      sema_init(&fs_info->uuid_tree_rescan_sem, 1);
>>
>>      btrfs_init_dev_replace_locks(fs_info);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index df8d756..50b440e 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct 
>> inode *inode, u64 bytes)
>>      int ret = 0;
>>      int need_commit = 2;
>>      int have_pinned_space;
>> +    int have_bg_delete_sem = 0;
>>
>>      /* make sure bytes are sectorsize aligned */
>>      bytes = ALIGN(bytes, root->sectorsize);
>> @@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct 
>> inode *inode, u64 bytes)
>>      }
>>
>>      data_sinfo = fs_info->data_sinfo;
>> -    if (!data_sinfo)
>> +    if (!data_sinfo) {
>> +        down_read(&root->fs_info->bg_delete_sem);
>> +        have_bg_delete_sem = 1;
>>          goto alloc;
>> +    }
>>
>>  again:
>>      /* make sure we have enough space to handle the data first */
>> @@ -4135,6 +4139,19 @@ again:
>>          struct btrfs_trans_handle *trans;
>>
>>          /*
>> +         * We may need to allocate new chunk, so we should block
>> +         * btrfs_delete_unused_bgs()
>> +         */
>> +        if (!have_bg_delete_sem) {
>> +            spin_unlock(&data_sinfo->lock);
>> +            down_read(&root->fs_info->bg_delete_sem);
>> +            have_bg_delete_sem = 1;
>> +            spin_lock(&data_sinfo->lock);
>> +            if (used + bytes <= data_sinfo->total_bytes)
>
> Doh sorry I forgot to mention you need to re-calculate used here. Also 
> I noticed we already have a delete_unused_bgs_mutex, can we drop that 
> and replace it with bg_delete_sem as well?  Thanks,
Sorry, I also forgot to re-calculate used...
OK, I'll try to replace delete_unused_bgs_mutex with bg_delete_sem, 
please wait a while :)

Regards,
Xiaoguang Wang

>
> Josef
>
>




^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
@ 2016-09-09  8:17 Wang Xiaoguang
  2016-09-09  8:25 ` Wang Xiaoguang
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Wang Xiaoguang @ 2016-09-09  8:17 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, s.priebe

cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
to delete unused block groups. Because this work is asynchronous, it may also result
in false ENOSPC error. Please see below race window:

               CPU1                           |             CPU2
                                              |
|-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
    |-> do_chunk_alloc()                      |   |
    |   assume it returns ENOSPC, which means |   |
    |   btrfs_space_info is full and have free|   |
    |   space to satisfy data request.        |   |
    |                                         |   |- > btrfs_delete_unused_bgs()
    |                                         |   |    it will decrease btrfs_space_info
    |                                         |   |    total_bytes and make
    |                                         |   |    btrfs_space_info is not full.
    |                                         |   |
In this case, we may get ENOSPC error, but btrfs_space_info is not full.

To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
Here we introduce a new struct rw_semaphore bg_delete_sem to do this job.

Indeed there is already a "struct mutex delete_unused_bgs_mutex", but it's mutex,
we can not use it for this purpose. Of course, we can re-define it to be struct
rw_semaphore, then use it in btrfs_alloc_data_chunk_ondemand(). Either method will
work.

But given that delete_unused_bgs_mutex's name length is longer than bg_delete_sem,
I choose the first method, to create a new struct rw_semaphore bg_delete_sem and
delete delete_unused_bgs_mutex :)

Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
V2: fix a deadlock revealed by fstests case btrfs/071, we call
    start_transaction() before in down_write(bg_delete_sem) in
    btrfs_delete_unused_bgs().

v3: Stefan Priebe reported another similar deadlock, so here we choose
    to not call down_read(bg_delete_sem) for free space inode in
    btrfs_alloc_data_chunk_ondemand(). Meanwhile because we only do the
    data space reservation for free space cache in the transaction context,
    btrfs_delete_unused_bgs() will either have finished its job, or start
    a new transaction waiting current transaction to complete, there will
    be no unused block groups to be deleted, so it's safe to not call
    down_read(bg_delete_sem)
---
---
 fs/btrfs/ctree.h       |  2 +-
 fs/btrfs/disk-io.c     | 13 +++++------
 fs/btrfs/extent-tree.c | 59 ++++++++++++++++++++++++++++++++++++++++----------
 fs/btrfs/volumes.c     | 42 +++++++++++++++++------------------
 4 files changed, 76 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index eff3993..fa78ef9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -788,6 +788,7 @@ struct btrfs_fs_info {
 	struct mutex cleaner_mutex;
 	struct mutex chunk_mutex;
 	struct mutex volume_mutex;
+	struct rw_semaphore bg_delete_sem;
 
 	/*
 	 * this is taken to make sure we don't set block groups ro after
@@ -1068,7 +1069,6 @@ struct btrfs_fs_info {
 	spinlock_t unused_bgs_lock;
 	struct list_head unused_bgs;
 	struct mutex unused_bg_unpin_mutex;
-	struct mutex delete_unused_bgs_mutex;
 
 	/* For btrfs to record security options */
 	struct security_mnt_opts security_opts;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 54bc8c7..3cdbd05 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1868,12 +1868,11 @@ static int cleaner_kthread(void *arg)
 		btrfs_run_defrag_inodes(root->fs_info);
 
 		/*
-		 * Acquires fs_info->delete_unused_bgs_mutex to avoid racing
-		 * with relocation (btrfs_relocate_chunk) and relocation
-		 * acquires fs_info->cleaner_mutex (btrfs_relocate_block_group)
-		 * after acquiring fs_info->delete_unused_bgs_mutex. So we
-		 * can't hold, nor need to, fs_info->cleaner_mutex when deleting
-		 * unused block groups.
+		 * Acquires fs_info->bg_delete_sem to avoid racing with
+		 * relocation (btrfs_relocate_chunk) and relocation acquires
+		 * fs_info->cleaner_mutex (btrfs_relocate_block_group) after
+		 * acquiring fs_info->bg_delete_sem. So we can't hold, nor need
+		 * to, fs_info->cleaner_mutex when deleting unused block groups.
 		 */
 		btrfs_delete_unused_bgs(root->fs_info);
 sleep:
@@ -2634,7 +2633,6 @@ int open_ctree(struct super_block *sb,
 	spin_lock_init(&fs_info->unused_bgs_lock);
 	rwlock_init(&fs_info->tree_mod_log_lock);
 	mutex_init(&fs_info->unused_bg_unpin_mutex);
-	mutex_init(&fs_info->delete_unused_bgs_mutex);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
 	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
@@ -2723,6 +2721,7 @@ int open_ctree(struct super_block *sb,
 	init_rwsem(&fs_info->commit_root_sem);
 	init_rwsem(&fs_info->cleanup_work_sem);
 	init_rwsem(&fs_info->subvol_sem);
+	init_rwsem(&fs_info->bg_delete_sem);
 	sema_init(&fs_info->uuid_tree_rescan_sem, 1);
 
 	btrfs_init_dev_replace_locks(fs_info);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8c8a4d1..f037769 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4128,18 +4128,35 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
 	int ret = 0;
 	int need_commit = 2;
 	int have_pinned_space;
+	int have_bg_delete_sem = 0;
+	bool free_space_inode = btrfs_is_free_space_inode(inode);
 
 	/* make sure bytes are sectorsize aligned */
 	bytes = ALIGN(bytes, root->sectorsize);
 
-	if (btrfs_is_free_space_inode(inode)) {
+	if (free_space_inode) {
 		need_commit = 0;
 		ASSERT(current->journal_info);
 	}
 
+	/*
+	 * Here we shouldn't call down_read(bg_delete_sem) for free space inode,
+	 * there is lock order between bg_delete_sem and "wait current trans
+	 * finished". Meanwhile because we only do the data space reservation
+	 * for free space cache in the transaction context,
+	 * btrfs_delete_unused_bgs() will either have finished its job, or start
+	 * a new transaction waiting current transaction to complete, there will
+	 * be no unused block groups to be deleted, so it's safe to not call
+	 * down_read(bg_delete_sem).
+	 */
 	data_sinfo = fs_info->data_sinfo;
-	if (!data_sinfo)
+	if (!data_sinfo) {
+		if (!free_space_inode) {
+			down_read(&root->fs_info->bg_delete_sem);
+			have_bg_delete_sem = 1;
+		}
 		goto alloc;
+	}
 
 again:
 	/* make sure we have enough space to handle the data first */
@@ -4152,6 +4169,17 @@ again:
 		struct btrfs_trans_handle *trans;
 
 		/*
+		 * We may need to allocate new chunk, so we should block
+		 * btrfs_delete_unused_bgs()
+		 */
+		if (!have_bg_delete_sem && !free_space_inode) {
+			spin_unlock(&data_sinfo->lock);
+			down_read(&root->fs_info->bg_delete_sem);
+			have_bg_delete_sem = 1;
+			goto again;
+		}
+
+		/*
 		 * if we don't have enough free bytes in this space then we need
 		 * to alloc a new chunk.
 		 */
@@ -4173,8 +4201,10 @@ alloc:
 			 * the fs.
 			 */
 			trans = btrfs_join_transaction(root);
-			if (IS_ERR(trans))
-				return PTR_ERR(trans);
+			if (IS_ERR(trans)) {
+				ret = PTR_ERR(trans);
+				goto out;
+			}
 
 			ret = do_chunk_alloc(trans, root->fs_info->extent_root,
 					     alloc_target,
@@ -4182,7 +4212,7 @@ alloc:
 			btrfs_end_transaction(trans, root);
 			if (ret < 0) {
 				if (ret != -ENOSPC)
-					return ret;
+					goto out;
 				else {
 					have_pinned_space = 1;
 					goto commit_trans;
@@ -4217,15 +4247,17 @@ commit_trans:
 			}
 
 			trans = btrfs_join_transaction(root);
-			if (IS_ERR(trans))
-				return PTR_ERR(trans);
+			if (IS_ERR(trans)) {
+				ret = PTR_ERR(trans);
+				goto out;
+			}
 			if (have_pinned_space >= 0 ||
 			    test_bit(BTRFS_TRANS_HAVE_FREE_BGS,
 				     &trans->transaction->flags) ||
 			    need_commit > 0) {
 				ret = btrfs_commit_transaction(trans, root);
 				if (ret)
-					return ret;
+					goto out;
 				/*
 				 * The cleaner kthread might still be doing iput
 				 * operations. Wait for it to finish so that
@@ -4242,13 +4274,18 @@ commit_trans:
 		trace_btrfs_space_reservation(root->fs_info,
 					      "space_info:enospc",
 					      data_sinfo->flags, bytes, 1);
-		return -ENOSPC;
+		ret = -ENOSPC;
+		goto out;
 	}
 	data_sinfo->bytes_may_use += bytes;
 	trace_btrfs_space_reservation(root->fs_info, "space_info",
 				      data_sinfo->flags, bytes, 1);
 	spin_unlock(&data_sinfo->lock);
 
+out:
+	if (have_bg_delete_sem && !free_space_inode)
+		up_read(&root->fs_info->bg_delete_sem);
+
 	return ret;
 }
 
@@ -10806,7 +10843,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		}
 		spin_unlock(&fs_info->unused_bgs_lock);
 
-		mutex_lock(&fs_info->delete_unused_bgs_mutex);
+		down_write(&root->fs_info->bg_delete_sem);
 
 		/* Don't want to race with allocators so take the groups_sem */
 		down_write(&space_info->groups_sem);
@@ -10934,7 +10971,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 end_trans:
 		btrfs_end_transaction(trans, root);
 next:
-		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+		up_write(&root->fs_info->bg_delete_sem);
 		btrfs_put_block_group(block_group);
 		spin_lock(&fs_info->unused_bgs_lock);
 	}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 035efce..408fb0c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2926,7 +2926,7 @@ static int btrfs_relocate_chunk(struct btrfs_root *root, u64 chunk_offset)
 	 * we release the path used to search the chunk/dev tree and before
 	 * the current task acquires this mutex and calls us.
 	 */
-	ASSERT(mutex_is_locked(&root->fs_info->delete_unused_bgs_mutex));
+	ASSERT(rwsem_is_locked(&root->fs_info->bg_delete_sem));
 
 	ret = btrfs_can_relocate(extent_root, chunk_offset);
 	if (ret)
@@ -2979,10 +2979,10 @@ again:
 	key.type = BTRFS_CHUNK_ITEM_KEY;
 
 	while (1) {
-		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
+		down_read(&root->fs_info->bg_delete_sem);
 		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
 		if (ret < 0) {
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 			goto error;
 		}
 		BUG_ON(ret == 0); /* Corruption */
@@ -2990,7 +2990,7 @@ again:
 		ret = btrfs_previous_item(chunk_root, path, key.objectid,
 					  key.type);
 		if (ret)
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 		if (ret < 0)
 			goto error;
 		if (ret > 0)
@@ -3012,7 +3012,7 @@ again:
 			else
 				BUG_ON(ret);
 		}
-		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+		up_read(&root->fs_info->bg_delete_sem);
 
 		if (found_key.offset == 0)
 			break;
@@ -3568,10 +3568,10 @@ again:
 			goto error;
 		}
 
-		mutex_lock(&fs_info->delete_unused_bgs_mutex);
+		down_read(&fs_info->bg_delete_sem);
 		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
 		if (ret < 0) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			goto error;
 		}
 
@@ -3585,7 +3585,7 @@ again:
 		ret = btrfs_previous_item(chunk_root, path, 0,
 					  BTRFS_CHUNK_ITEM_KEY);
 		if (ret) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			ret = 0;
 			break;
 		}
@@ -3595,7 +3595,7 @@ again:
 		btrfs_item_key_to_cpu(leaf, &found_key, slot);
 
 		if (found_key.objectid != key.objectid) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			break;
 		}
 
@@ -3613,12 +3613,12 @@ again:
 
 		btrfs_release_path(path);
 		if (!ret) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			goto loop;
 		}
 
 		if (counting) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			spin_lock(&fs_info->balance_lock);
 			bctl->stat.expected++;
 			spin_unlock(&fs_info->balance_lock);
@@ -3643,7 +3643,7 @@ again:
 					count_meta < bctl->meta.limit_min)
 				|| ((chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) &&
 					count_sys < bctl->sys.limit_min)) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			goto loop;
 		}
 
@@ -3656,7 +3656,7 @@ again:
 		    !chunk_reserved && !bytes_used) {
 			trans = btrfs_start_transaction(chunk_root, 0);
 			if (IS_ERR(trans)) {
-				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+				up_read(&fs_info->bg_delete_sem);
 				ret = PTR_ERR(trans);
 				goto error;
 			}
@@ -3665,7 +3665,7 @@ again:
 						      BTRFS_BLOCK_GROUP_DATA);
 			btrfs_end_transaction(trans, chunk_root);
 			if (ret < 0) {
-				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+				up_read(&fs_info->bg_delete_sem);
 				goto error;
 			}
 			chunk_reserved = 1;
@@ -3673,7 +3673,7 @@ again:
 
 		ret = btrfs_relocate_chunk(chunk_root,
 					   found_key.offset);
-		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+		up_read(&fs_info->bg_delete_sem);
 		if (ret && ret != -ENOSPC)
 			goto error;
 		if (ret == -ENOSPC) {
@@ -4400,16 +4400,16 @@ again:
 	key.type = BTRFS_DEV_EXTENT_KEY;
 
 	do {
-		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
+		down_read(&root->fs_info->bg_delete_sem);
 		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
 		if (ret < 0) {
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 			goto done;
 		}
 
 		ret = btrfs_previous_item(root, path, 0, key.type);
 		if (ret)
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 		if (ret < 0)
 			goto done;
 		if (ret) {
@@ -4423,7 +4423,7 @@ again:
 		btrfs_item_key_to_cpu(l, &key, path->slots[0]);
 
 		if (key.objectid != device->devid) {
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 			btrfs_release_path(path);
 			break;
 		}
@@ -4432,7 +4432,7 @@ again:
 		length = btrfs_dev_extent_length(l, dev_extent);
 
 		if (key.offset + length <= new_size) {
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 			btrfs_release_path(path);
 			break;
 		}
@@ -4441,7 +4441,7 @@ again:
 		btrfs_release_path(path);
 
 		ret = btrfs_relocate_chunk(root, chunk_offset);
-		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+		up_read(&root->fs_info->bg_delete_sem);
 		if (ret && ret != -ENOSPC)
 			goto done;
 		if (ret == -ENOSPC)
-- 
2.9.0




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-09  8:17 [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space Wang Xiaoguang
@ 2016-09-09  8:25 ` Wang Xiaoguang
  2016-09-09  9:02   ` David Sterba
  2016-09-09 10:18 ` Holger Hoffstätte
  2016-09-10  6:04 ` Stefan Priebe - Profihost AG
  2 siblings, 1 reply; 12+ messages in thread
From: Wang Xiaoguang @ 2016-09-09  8:25 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, s.priebe

hello David,

This patch's v2 version in your for-next-20160906 branch is still wrong, 
really sorry,
please revert it.
Stefan Priebe has reported another similar issue, thought I didn't see 
it in my
test environment. Now I choose to not call down_read(bg_delete_sem) for free
space inode, which I think can resolve these issues, please check, thanks.

Regards,
Xiaoguang Wang

On 09/09/2016 04:17 PM, Wang Xiaoguang wrote:
> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
> to delete unused block groups. Because this work is asynchronous, it may also result
> in false ENOSPC error. Please see below race window:
>
>                 CPU1                           |             CPU2
>                                                |
> |-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
>      |-> do_chunk_alloc()                      |   |
>      |   assume it returns ENOSPC, which means |   |
>      |   btrfs_space_info is full and have free|   |
>      |   space to satisfy data request.        |   |
>      |                                         |   |- > btrfs_delete_unused_bgs()
>      |                                         |   |    it will decrease btrfs_space_info
>      |                                         |   |    total_bytes and make
>      |                                         |   |    btrfs_space_info is not full.
>      |                                         |   |
> In this case, we may get ENOSPC error, but btrfs_space_info is not full.
>
> To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
> do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
> Here we introduce a new struct rw_semaphore bg_delete_sem to do this job.
>
> Indeed there is already a "struct mutex delete_unused_bgs_mutex", but it's mutex,
> we can not use it for this purpose. Of course, we can re-define it to be struct
> rw_semaphore, then use it in btrfs_alloc_data_chunk_ondemand(). Either method will
> work.
>
> But given that delete_unused_bgs_mutex's name length is longer than bg_delete_sem,
> I choose the first method, to create a new struct rw_semaphore bg_delete_sem and
> delete delete_unused_bgs_mutex :)
>
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
> V2: fix a deadlock revealed by fstests case btrfs/071, we call
>      start_transaction() before in down_write(bg_delete_sem) in
>      btrfs_delete_unused_bgs().
>
> v3: Stefan Priebe reported another similar deadlock, so here we choose
>      to not call down_read(bg_delete_sem) for free space inode in
>      btrfs_alloc_data_chunk_ondemand(). Meanwhile because we only do the
>      data space reservation for free space cache in the transaction context,
>      btrfs_delete_unused_bgs() will either have finished its job, or start
>      a new transaction waiting current transaction to complete, there will
>      be no unused block groups to be deleted, so it's safe to not call
>      down_read(bg_delete_sem)
> ---
> ---
>   fs/btrfs/ctree.h       |  2 +-
>   fs/btrfs/disk-io.c     | 13 +++++------
>   fs/btrfs/extent-tree.c | 59 ++++++++++++++++++++++++++++++++++++++++----------
>   fs/btrfs/volumes.c     | 42 +++++++++++++++++------------------
>   4 files changed, 76 insertions(+), 40 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index eff3993..fa78ef9 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -788,6 +788,7 @@ struct btrfs_fs_info {
>   	struct mutex cleaner_mutex;
>   	struct mutex chunk_mutex;
>   	struct mutex volume_mutex;
> +	struct rw_semaphore bg_delete_sem;
>   
>   	/*
>   	 * this is taken to make sure we don't set block groups ro after
> @@ -1068,7 +1069,6 @@ struct btrfs_fs_info {
>   	spinlock_t unused_bgs_lock;
>   	struct list_head unused_bgs;
>   	struct mutex unused_bg_unpin_mutex;
> -	struct mutex delete_unused_bgs_mutex;
>   
>   	/* For btrfs to record security options */
>   	struct security_mnt_opts security_opts;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 54bc8c7..3cdbd05 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1868,12 +1868,11 @@ static int cleaner_kthread(void *arg)
>   		btrfs_run_defrag_inodes(root->fs_info);
>   
>   		/*
> -		 * Acquires fs_info->delete_unused_bgs_mutex to avoid racing
> -		 * with relocation (btrfs_relocate_chunk) and relocation
> -		 * acquires fs_info->cleaner_mutex (btrfs_relocate_block_group)
> -		 * after acquiring fs_info->delete_unused_bgs_mutex. So we
> -		 * can't hold, nor need to, fs_info->cleaner_mutex when deleting
> -		 * unused block groups.
> +		 * Acquires fs_info->bg_delete_sem to avoid racing with
> +		 * relocation (btrfs_relocate_chunk) and relocation acquires
> +		 * fs_info->cleaner_mutex (btrfs_relocate_block_group) after
> +		 * acquiring fs_info->bg_delete_sem. So we can't hold, nor need
> +		 * to, fs_info->cleaner_mutex when deleting unused block groups.
>   		 */
>   		btrfs_delete_unused_bgs(root->fs_info);
>   sleep:
> @@ -2634,7 +2633,6 @@ int open_ctree(struct super_block *sb,
>   	spin_lock_init(&fs_info->unused_bgs_lock);
>   	rwlock_init(&fs_info->tree_mod_log_lock);
>   	mutex_init(&fs_info->unused_bg_unpin_mutex);
> -	mutex_init(&fs_info->delete_unused_bgs_mutex);
>   	mutex_init(&fs_info->reloc_mutex);
>   	mutex_init(&fs_info->delalloc_root_mutex);
>   	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
> @@ -2723,6 +2721,7 @@ int open_ctree(struct super_block *sb,
>   	init_rwsem(&fs_info->commit_root_sem);
>   	init_rwsem(&fs_info->cleanup_work_sem);
>   	init_rwsem(&fs_info->subvol_sem);
> +	init_rwsem(&fs_info->bg_delete_sem);
>   	sema_init(&fs_info->uuid_tree_rescan_sem, 1);
>   
>   	btrfs_init_dev_replace_locks(fs_info);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 8c8a4d1..f037769 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4128,18 +4128,35 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
>   	int ret = 0;
>   	int need_commit = 2;
>   	int have_pinned_space;
> +	int have_bg_delete_sem = 0;
> +	bool free_space_inode = btrfs_is_free_space_inode(inode);
>   
>   	/* make sure bytes are sectorsize aligned */
>   	bytes = ALIGN(bytes, root->sectorsize);
>   
> -	if (btrfs_is_free_space_inode(inode)) {
> +	if (free_space_inode) {
>   		need_commit = 0;
>   		ASSERT(current->journal_info);
>   	}
>   
> +	/*
> +	 * Here we shouldn't call down_read(bg_delete_sem) for free space inode,
> +	 * there is lock order between bg_delete_sem and "wait current trans
> +	 * finished". Meanwhile because we only do the data space reservation
> +	 * for free space cache in the transaction context,
> +	 * btrfs_delete_unused_bgs() will either have finished its job, or start
> +	 * a new transaction waiting current transaction to complete, there will
> +	 * be no unused block groups to be deleted, so it's safe to not call
> +	 * down_read(bg_delete_sem).
> +	 */
>   	data_sinfo = fs_info->data_sinfo;
> -	if (!data_sinfo)
> +	if (!data_sinfo) {
> +		if (!free_space_inode) {
> +			down_read(&root->fs_info->bg_delete_sem);
> +			have_bg_delete_sem = 1;
> +		}
>   		goto alloc;
> +	}
>   
>   again:
>   	/* make sure we have enough space to handle the data first */
> @@ -4152,6 +4169,17 @@ again:
>   		struct btrfs_trans_handle *trans;
>   
>   		/*
> +		 * We may need to allocate new chunk, so we should block
> +		 * btrfs_delete_unused_bgs()
> +		 */
> +		if (!have_bg_delete_sem && !free_space_inode) {
> +			spin_unlock(&data_sinfo->lock);
> +			down_read(&root->fs_info->bg_delete_sem);
> +			have_bg_delete_sem = 1;
> +			goto again;
> +		}
> +
> +		/*
>   		 * if we don't have enough free bytes in this space then we need
>   		 * to alloc a new chunk.
>   		 */
> @@ -4173,8 +4201,10 @@ alloc:
>   			 * the fs.
>   			 */
>   			trans = btrfs_join_transaction(root);
> -			if (IS_ERR(trans))
> -				return PTR_ERR(trans);
> +			if (IS_ERR(trans)) {
> +				ret = PTR_ERR(trans);
> +				goto out;
> +			}
>   
>   			ret = do_chunk_alloc(trans, root->fs_info->extent_root,
>   					     alloc_target,
> @@ -4182,7 +4212,7 @@ alloc:
>   			btrfs_end_transaction(trans, root);
>   			if (ret < 0) {
>   				if (ret != -ENOSPC)
> -					return ret;
> +					goto out;
>   				else {
>   					have_pinned_space = 1;
>   					goto commit_trans;
> @@ -4217,15 +4247,17 @@ commit_trans:
>   			}
>   
>   			trans = btrfs_join_transaction(root);
> -			if (IS_ERR(trans))
> -				return PTR_ERR(trans);
> +			if (IS_ERR(trans)) {
> +				ret = PTR_ERR(trans);
> +				goto out;
> +			}
>   			if (have_pinned_space >= 0 ||
>   			    test_bit(BTRFS_TRANS_HAVE_FREE_BGS,
>   				     &trans->transaction->flags) ||
>   			    need_commit > 0) {
>   				ret = btrfs_commit_transaction(trans, root);
>   				if (ret)
> -					return ret;
> +					goto out;
>   				/*
>   				 * The cleaner kthread might still be doing iput
>   				 * operations. Wait for it to finish so that
> @@ -4242,13 +4274,18 @@ commit_trans:
>   		trace_btrfs_space_reservation(root->fs_info,
>   					      "space_info:enospc",
>   					      data_sinfo->flags, bytes, 1);
> -		return -ENOSPC;
> +		ret = -ENOSPC;
> +		goto out;
>   	}
>   	data_sinfo->bytes_may_use += bytes;
>   	trace_btrfs_space_reservation(root->fs_info, "space_info",
>   				      data_sinfo->flags, bytes, 1);
>   	spin_unlock(&data_sinfo->lock);
>   
> +out:
> +	if (have_bg_delete_sem && !free_space_inode)
> +		up_read(&root->fs_info->bg_delete_sem);
> +
>   	return ret;
>   }
>   
> @@ -10806,7 +10843,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>   		}
>   		spin_unlock(&fs_info->unused_bgs_lock);
>   
> -		mutex_lock(&fs_info->delete_unused_bgs_mutex);
> +		down_write(&root->fs_info->bg_delete_sem);
>   
>   		/* Don't want to race with allocators so take the groups_sem */
>   		down_write(&space_info->groups_sem);
> @@ -10934,7 +10971,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>   end_trans:
>   		btrfs_end_transaction(trans, root);
>   next:
> -		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +		up_write(&root->fs_info->bg_delete_sem);
>   		btrfs_put_block_group(block_group);
>   		spin_lock(&fs_info->unused_bgs_lock);
>   	}
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 035efce..408fb0c 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2926,7 +2926,7 @@ static int btrfs_relocate_chunk(struct btrfs_root *root, u64 chunk_offset)
>   	 * we release the path used to search the chunk/dev tree and before
>   	 * the current task acquires this mutex and calls us.
>   	 */
> -	ASSERT(mutex_is_locked(&root->fs_info->delete_unused_bgs_mutex));
> +	ASSERT(rwsem_is_locked(&root->fs_info->bg_delete_sem));
>   
>   	ret = btrfs_can_relocate(extent_root, chunk_offset);
>   	if (ret)
> @@ -2979,10 +2979,10 @@ again:
>   	key.type = BTRFS_CHUNK_ITEM_KEY;
>   
>   	while (1) {
> -		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
> +		down_read(&root->fs_info->bg_delete_sem);
>   		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
>   		if (ret < 0) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>   			goto error;
>   		}
>   		BUG_ON(ret == 0); /* Corruption */
> @@ -2990,7 +2990,7 @@ again:
>   		ret = btrfs_previous_item(chunk_root, path, key.objectid,
>   					  key.type);
>   		if (ret)
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>   		if (ret < 0)
>   			goto error;
>   		if (ret > 0)
> @@ -3012,7 +3012,7 @@ again:
>   			else
>   				BUG_ON(ret);
>   		}
> -		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +		up_read(&root->fs_info->bg_delete_sem);
>   
>   		if (found_key.offset == 0)
>   			break;
> @@ -3568,10 +3568,10 @@ again:
>   			goto error;
>   		}
>   
> -		mutex_lock(&fs_info->delete_unused_bgs_mutex);
> +		down_read(&fs_info->bg_delete_sem);
>   		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
>   		if (ret < 0) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>   			goto error;
>   		}
>   
> @@ -3585,7 +3585,7 @@ again:
>   		ret = btrfs_previous_item(chunk_root, path, 0,
>   					  BTRFS_CHUNK_ITEM_KEY);
>   		if (ret) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>   			ret = 0;
>   			break;
>   		}
> @@ -3595,7 +3595,7 @@ again:
>   		btrfs_item_key_to_cpu(leaf, &found_key, slot);
>   
>   		if (found_key.objectid != key.objectid) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>   			break;
>   		}
>   
> @@ -3613,12 +3613,12 @@ again:
>   
>   		btrfs_release_path(path);
>   		if (!ret) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>   			goto loop;
>   		}
>   
>   		if (counting) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>   			spin_lock(&fs_info->balance_lock);
>   			bctl->stat.expected++;
>   			spin_unlock(&fs_info->balance_lock);
> @@ -3643,7 +3643,7 @@ again:
>   					count_meta < bctl->meta.limit_min)
>   				|| ((chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) &&
>   					count_sys < bctl->sys.limit_min)) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>   			goto loop;
>   		}
>   
> @@ -3656,7 +3656,7 @@ again:
>   		    !chunk_reserved && !bytes_used) {
>   			trans = btrfs_start_transaction(chunk_root, 0);
>   			if (IS_ERR(trans)) {
> -				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +				up_read(&fs_info->bg_delete_sem);
>   				ret = PTR_ERR(trans);
>   				goto error;
>   			}
> @@ -3665,7 +3665,7 @@ again:
>   						      BTRFS_BLOCK_GROUP_DATA);
>   			btrfs_end_transaction(trans, chunk_root);
>   			if (ret < 0) {
> -				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +				up_read(&fs_info->bg_delete_sem);
>   				goto error;
>   			}
>   			chunk_reserved = 1;
> @@ -3673,7 +3673,7 @@ again:
>   
>   		ret = btrfs_relocate_chunk(chunk_root,
>   					   found_key.offset);
> -		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +		up_read(&fs_info->bg_delete_sem);
>   		if (ret && ret != -ENOSPC)
>   			goto error;
>   		if (ret == -ENOSPC) {
> @@ -4400,16 +4400,16 @@ again:
>   	key.type = BTRFS_DEV_EXTENT_KEY;
>   
>   	do {
> -		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
> +		down_read(&root->fs_info->bg_delete_sem);
>   		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>   		if (ret < 0) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>   			goto done;
>   		}
>   
>   		ret = btrfs_previous_item(root, path, 0, key.type);
>   		if (ret)
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>   		if (ret < 0)
>   			goto done;
>   		if (ret) {
> @@ -4423,7 +4423,7 @@ again:
>   		btrfs_item_key_to_cpu(l, &key, path->slots[0]);
>   
>   		if (key.objectid != device->devid) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>   			btrfs_release_path(path);
>   			break;
>   		}
> @@ -4432,7 +4432,7 @@ again:
>   		length = btrfs_dev_extent_length(l, dev_extent);
>   
>   		if (key.offset + length <= new_size) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>   			btrfs_release_path(path);
>   			break;
>   		}
> @@ -4441,7 +4441,7 @@ again:
>   		btrfs_release_path(path);
>   
>   		ret = btrfs_relocate_chunk(root, chunk_offset);
> -		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +		up_read(&root->fs_info->bg_delete_sem);
>   		if (ret && ret != -ENOSPC)
>   			goto done;
>   		if (ret == -ENOSPC)




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-09  8:25 ` Wang Xiaoguang
@ 2016-09-09  9:02   ` David Sterba
  0 siblings, 0 replies; 12+ messages in thread
From: David Sterba @ 2016-09-09  9:02 UTC (permalink / raw)
  To: Wang Xiaoguang; +Cc: linux-btrfs, dsterba, s.priebe

On Fri, Sep 09, 2016 at 04:25:15PM +0800, Wang Xiaoguang wrote:
> hello David,
> 
> This patch's v2 version in your for-next-20160906 branch is still wrong, 
> really sorry,
> please revert it.

Patch replaced with V3 in the upcoming for-next.

> Stefan Priebe has reported another similar issue, thought I didn't see 
> it in my
> test environment. Now I choose to not call down_read(bg_delete_sem) for free
> space inode, which I think can resolve these issues, please check, thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-09  8:17 [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space Wang Xiaoguang
  2016-09-09  8:25 ` Wang Xiaoguang
@ 2016-09-09 10:18 ` Holger Hoffstätte
  2016-09-09 10:56   ` Holger Hoffstätte
  2016-09-10  6:04 ` Stefan Priebe - Profihost AG
  2 siblings, 1 reply; 12+ messages in thread
From: Holger Hoffstätte @ 2016-09-09 10:18 UTC (permalink / raw)
  To: linux-btrfs

On Fri, 09 Sep 2016 16:17:48 +0800, Wang Xiaoguang wrote:

> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
> to delete unused block groups. Because this work is asynchronous, it may also result
> in false ENOSPC error. 
<snip>

With this v3 I can now no longer balance (tested only with metadata).
New chunks are allocated (as balance does) but nothing ever shrinks, until
after unmount/remount, when the cleaner eventually kicks in.

This might be related to the recent patch by Naohiro Aota:
"btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"

which by itself doesn't seem to do any harm (i.e. everything still seems
to work as expected).

-h


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-09 10:18 ` Holger Hoffstätte
@ 2016-09-09 10:56   ` Holger Hoffstätte
  2016-09-12  7:38     ` Wang Xiaoguang
  0 siblings, 1 reply; 12+ messages in thread
From: Holger Hoffstätte @ 2016-09-09 10:56 UTC (permalink / raw)
  To: linux-btrfs, Wang Xiaoguang, Naohiro Aota

On 09/09/16 12:18, Holger Hoffstätte wrote:
> On Fri, 09 Sep 2016 16:17:48 +0800, Wang Xiaoguang wrote:
> 
>> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
>> to delete unused block groups. Because this work is asynchronous, it may also result
>> in false ENOSPC error. 
> <snip>
> 
> With this v3 I can now no longer balance (tested only with metadata).
> New chunks are allocated (as balance does) but nothing ever shrinks, until
> after unmount/remount, when the cleaner eventually kicks in.
> 
> This might be related to the recent patch by Naohiro Aota:
> "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
> 
> which by itself doesn't seem to do any harm (i.e. everything still seems
> to work as expected).

Actually even that is not true; both patches seem to be wrong in subtle
ways. Naohiro's patch seems to prevent the deletion during balance, whereas
yours prevents the cleaner from kicking in.

As a simple reproducer you can convert from -mdup to -msingle (to create
bloat) and then balance with -musage=10. Depending on which of the two
patches are applied, you end with bloat that only grows and never shrinks,
or bloat that ends up in mixed state (dup and single).

Undoing both makes both balancing and cleaning work again.

-h


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-09  8:17 [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space Wang Xiaoguang
  2016-09-09  8:25 ` Wang Xiaoguang
  2016-09-09 10:18 ` Holger Hoffstätte
@ 2016-09-10  6:04 ` Stefan Priebe - Profihost AG
  2 siblings, 0 replies; 12+ messages in thread
From: Stefan Priebe - Profihost AG @ 2016-09-10  6:04 UTC (permalink / raw)
  To: Wang Xiaoguang, linux-btrfs; +Cc: dsterba, Holger Hoffstätte

Thanks,

this one works fine. No deadlocks.

Stefan

Am 09.09.2016 um 10:17 schrieb Wang Xiaoguang:
> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
> to delete unused block groups. Because this work is asynchronous, it may also result
> in false ENOSPC error. Please see below race window:
> 
>                CPU1                           |             CPU2
>                                               |
> |-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
>     |-> do_chunk_alloc()                      |   |
>     |   assume it returns ENOSPC, which means |   |
>     |   btrfs_space_info is full and have free|   |
>     |   space to satisfy data request.        |   |
>     |                                         |   |- > btrfs_delete_unused_bgs()
>     |                                         |   |    it will decrease btrfs_space_info
>     |                                         |   |    total_bytes and make
>     |                                         |   |    btrfs_space_info is not full.
>     |                                         |   |
> In this case, we may get ENOSPC error, but btrfs_space_info is not full.
> 
> To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
> do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
> Here we introduce a new struct rw_semaphore bg_delete_sem to do this job.
> 
> Indeed there is already a "struct mutex delete_unused_bgs_mutex", but it's mutex,
> we can not use it for this purpose. Of course, we can re-define it to be struct
> rw_semaphore, then use it in btrfs_alloc_data_chunk_ondemand(). Either method will
> work.
> 
> But given that delete_unused_bgs_mutex's name length is longer than bg_delete_sem,
> I choose the first method, to create a new struct rw_semaphore bg_delete_sem and
> delete delete_unused_bgs_mutex :)
> 
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
> V2: fix a deadlock revealed by fstests case btrfs/071, we call
>     start_transaction() before in down_write(bg_delete_sem) in
>     btrfs_delete_unused_bgs().
> 
> v3: Stefan Priebe reported another similar deadlock, so here we choose
>     to not call down_read(bg_delete_sem) for free space inode in
>     btrfs_alloc_data_chunk_ondemand(). Meanwhile because we only do the
>     data space reservation for free space cache in the transaction context,
>     btrfs_delete_unused_bgs() will either have finished its job, or start
>     a new transaction waiting current transaction to complete, there will
>     be no unused block groups to be deleted, so it's safe to not call
>     down_read(bg_delete_sem)
> ---
> ---
>  fs/btrfs/ctree.h       |  2 +-
>  fs/btrfs/disk-io.c     | 13 +++++------
>  fs/btrfs/extent-tree.c | 59 ++++++++++++++++++++++++++++++++++++++++----------
>  fs/btrfs/volumes.c     | 42 +++++++++++++++++------------------
>  4 files changed, 76 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index eff3993..fa78ef9 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -788,6 +788,7 @@ struct btrfs_fs_info {
>  	struct mutex cleaner_mutex;
>  	struct mutex chunk_mutex;
>  	struct mutex volume_mutex;
> +	struct rw_semaphore bg_delete_sem;
>  
>  	/*
>  	 * this is taken to make sure we don't set block groups ro after
> @@ -1068,7 +1069,6 @@ struct btrfs_fs_info {
>  	spinlock_t unused_bgs_lock;
>  	struct list_head unused_bgs;
>  	struct mutex unused_bg_unpin_mutex;
> -	struct mutex delete_unused_bgs_mutex;
>  
>  	/* For btrfs to record security options */
>  	struct security_mnt_opts security_opts;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 54bc8c7..3cdbd05 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1868,12 +1868,11 @@ static int cleaner_kthread(void *arg)
>  		btrfs_run_defrag_inodes(root->fs_info);
>  
>  		/*
> -		 * Acquires fs_info->delete_unused_bgs_mutex to avoid racing
> -		 * with relocation (btrfs_relocate_chunk) and relocation
> -		 * acquires fs_info->cleaner_mutex (btrfs_relocate_block_group)
> -		 * after acquiring fs_info->delete_unused_bgs_mutex. So we
> -		 * can't hold, nor need to, fs_info->cleaner_mutex when deleting
> -		 * unused block groups.
> +		 * Acquires fs_info->bg_delete_sem to avoid racing with
> +		 * relocation (btrfs_relocate_chunk) and relocation acquires
> +		 * fs_info->cleaner_mutex (btrfs_relocate_block_group) after
> +		 * acquiring fs_info->bg_delete_sem. So we can't hold, nor need
> +		 * to, fs_info->cleaner_mutex when deleting unused block groups.
>  		 */
>  		btrfs_delete_unused_bgs(root->fs_info);
>  sleep:
> @@ -2634,7 +2633,6 @@ int open_ctree(struct super_block *sb,
>  	spin_lock_init(&fs_info->unused_bgs_lock);
>  	rwlock_init(&fs_info->tree_mod_log_lock);
>  	mutex_init(&fs_info->unused_bg_unpin_mutex);
> -	mutex_init(&fs_info->delete_unused_bgs_mutex);
>  	mutex_init(&fs_info->reloc_mutex);
>  	mutex_init(&fs_info->delalloc_root_mutex);
>  	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
> @@ -2723,6 +2721,7 @@ int open_ctree(struct super_block *sb,
>  	init_rwsem(&fs_info->commit_root_sem);
>  	init_rwsem(&fs_info->cleanup_work_sem);
>  	init_rwsem(&fs_info->subvol_sem);
> +	init_rwsem(&fs_info->bg_delete_sem);
>  	sema_init(&fs_info->uuid_tree_rescan_sem, 1);
>  
>  	btrfs_init_dev_replace_locks(fs_info);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 8c8a4d1..f037769 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4128,18 +4128,35 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
>  	int ret = 0;
>  	int need_commit = 2;
>  	int have_pinned_space;
> +	int have_bg_delete_sem = 0;
> +	bool free_space_inode = btrfs_is_free_space_inode(inode);
>  
>  	/* make sure bytes are sectorsize aligned */
>  	bytes = ALIGN(bytes, root->sectorsize);
>  
> -	if (btrfs_is_free_space_inode(inode)) {
> +	if (free_space_inode) {
>  		need_commit = 0;
>  		ASSERT(current->journal_info);
>  	}
>  
> +	/*
> +	 * Here we shouldn't call down_read(bg_delete_sem) for free space inode,
> +	 * there is lock order between bg_delete_sem and "wait current trans
> +	 * finished". Meanwhile because we only do the data space reservation
> +	 * for free space cache in the transaction context,
> +	 * btrfs_delete_unused_bgs() will either have finished its job, or start
> +	 * a new transaction waiting current transaction to complete, there will
> +	 * be no unused block groups to be deleted, so it's safe to not call
> +	 * down_read(bg_delete_sem).
> +	 */
>  	data_sinfo = fs_info->data_sinfo;
> -	if (!data_sinfo)
> +	if (!data_sinfo) {
> +		if (!free_space_inode) {
> +			down_read(&root->fs_info->bg_delete_sem);
> +			have_bg_delete_sem = 1;
> +		}
>  		goto alloc;
> +	}
>  
>  again:
>  	/* make sure we have enough space to handle the data first */
> @@ -4152,6 +4169,17 @@ again:
>  		struct btrfs_trans_handle *trans;
>  
>  		/*
> +		 * We may need to allocate new chunk, so we should block
> +		 * btrfs_delete_unused_bgs()
> +		 */
> +		if (!have_bg_delete_sem && !free_space_inode) {
> +			spin_unlock(&data_sinfo->lock);
> +			down_read(&root->fs_info->bg_delete_sem);
> +			have_bg_delete_sem = 1;
> +			goto again;
> +		}
> +
> +		/*
>  		 * if we don't have enough free bytes in this space then we need
>  		 * to alloc a new chunk.
>  		 */
> @@ -4173,8 +4201,10 @@ alloc:
>  			 * the fs.
>  			 */
>  			trans = btrfs_join_transaction(root);
> -			if (IS_ERR(trans))
> -				return PTR_ERR(trans);
> +			if (IS_ERR(trans)) {
> +				ret = PTR_ERR(trans);
> +				goto out;
> +			}
>  
>  			ret = do_chunk_alloc(trans, root->fs_info->extent_root,
>  					     alloc_target,
> @@ -4182,7 +4212,7 @@ alloc:
>  			btrfs_end_transaction(trans, root);
>  			if (ret < 0) {
>  				if (ret != -ENOSPC)
> -					return ret;
> +					goto out;
>  				else {
>  					have_pinned_space = 1;
>  					goto commit_trans;
> @@ -4217,15 +4247,17 @@ commit_trans:
>  			}
>  
>  			trans = btrfs_join_transaction(root);
> -			if (IS_ERR(trans))
> -				return PTR_ERR(trans);
> +			if (IS_ERR(trans)) {
> +				ret = PTR_ERR(trans);
> +				goto out;
> +			}
>  			if (have_pinned_space >= 0 ||
>  			    test_bit(BTRFS_TRANS_HAVE_FREE_BGS,
>  				     &trans->transaction->flags) ||
>  			    need_commit > 0) {
>  				ret = btrfs_commit_transaction(trans, root);
>  				if (ret)
> -					return ret;
> +					goto out;
>  				/*
>  				 * The cleaner kthread might still be doing iput
>  				 * operations. Wait for it to finish so that
> @@ -4242,13 +4274,18 @@ commit_trans:
>  		trace_btrfs_space_reservation(root->fs_info,
>  					      "space_info:enospc",
>  					      data_sinfo->flags, bytes, 1);
> -		return -ENOSPC;
> +		ret = -ENOSPC;
> +		goto out;
>  	}
>  	data_sinfo->bytes_may_use += bytes;
>  	trace_btrfs_space_reservation(root->fs_info, "space_info",
>  				      data_sinfo->flags, bytes, 1);
>  	spin_unlock(&data_sinfo->lock);
>  
> +out:
> +	if (have_bg_delete_sem && !free_space_inode)
> +		up_read(&root->fs_info->bg_delete_sem);
> +
>  	return ret;
>  }
>  
> @@ -10806,7 +10843,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		}
>  		spin_unlock(&fs_info->unused_bgs_lock);
>  
> -		mutex_lock(&fs_info->delete_unused_bgs_mutex);
> +		down_write(&root->fs_info->bg_delete_sem);
>  
>  		/* Don't want to race with allocators so take the groups_sem */
>  		down_write(&space_info->groups_sem);
> @@ -10934,7 +10971,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  end_trans:
>  		btrfs_end_transaction(trans, root);
>  next:
> -		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +		up_write(&root->fs_info->bg_delete_sem);
>  		btrfs_put_block_group(block_group);
>  		spin_lock(&fs_info->unused_bgs_lock);
>  	}
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 035efce..408fb0c 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2926,7 +2926,7 @@ static int btrfs_relocate_chunk(struct btrfs_root *root, u64 chunk_offset)
>  	 * we release the path used to search the chunk/dev tree and before
>  	 * the current task acquires this mutex and calls us.
>  	 */
> -	ASSERT(mutex_is_locked(&root->fs_info->delete_unused_bgs_mutex));
> +	ASSERT(rwsem_is_locked(&root->fs_info->bg_delete_sem));
>  
>  	ret = btrfs_can_relocate(extent_root, chunk_offset);
>  	if (ret)
> @@ -2979,10 +2979,10 @@ again:
>  	key.type = BTRFS_CHUNK_ITEM_KEY;
>  
>  	while (1) {
> -		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
> +		down_read(&root->fs_info->bg_delete_sem);
>  		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
>  		if (ret < 0) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>  			goto error;
>  		}
>  		BUG_ON(ret == 0); /* Corruption */
> @@ -2990,7 +2990,7 @@ again:
>  		ret = btrfs_previous_item(chunk_root, path, key.objectid,
>  					  key.type);
>  		if (ret)
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>  		if (ret < 0)
>  			goto error;
>  		if (ret > 0)
> @@ -3012,7 +3012,7 @@ again:
>  			else
>  				BUG_ON(ret);
>  		}
> -		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +		up_read(&root->fs_info->bg_delete_sem);
>  
>  		if (found_key.offset == 0)
>  			break;
> @@ -3568,10 +3568,10 @@ again:
>  			goto error;
>  		}
>  
> -		mutex_lock(&fs_info->delete_unused_bgs_mutex);
> +		down_read(&fs_info->bg_delete_sem);
>  		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
>  		if (ret < 0) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>  			goto error;
>  		}
>  
> @@ -3585,7 +3585,7 @@ again:
>  		ret = btrfs_previous_item(chunk_root, path, 0,
>  					  BTRFS_CHUNK_ITEM_KEY);
>  		if (ret) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>  			ret = 0;
>  			break;
>  		}
> @@ -3595,7 +3595,7 @@ again:
>  		btrfs_item_key_to_cpu(leaf, &found_key, slot);
>  
>  		if (found_key.objectid != key.objectid) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>  			break;
>  		}
>  
> @@ -3613,12 +3613,12 @@ again:
>  
>  		btrfs_release_path(path);
>  		if (!ret) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>  			goto loop;
>  		}
>  
>  		if (counting) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>  			spin_lock(&fs_info->balance_lock);
>  			bctl->stat.expected++;
>  			spin_unlock(&fs_info->balance_lock);
> @@ -3643,7 +3643,7 @@ again:
>  					count_meta < bctl->meta.limit_min)
>  				|| ((chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) &&
>  					count_sys < bctl->sys.limit_min)) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>  			goto loop;
>  		}
>  
> @@ -3656,7 +3656,7 @@ again:
>  		    !chunk_reserved && !bytes_used) {
>  			trans = btrfs_start_transaction(chunk_root, 0);
>  			if (IS_ERR(trans)) {
> -				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +				up_read(&fs_info->bg_delete_sem);
>  				ret = PTR_ERR(trans);
>  				goto error;
>  			}
> @@ -3665,7 +3665,7 @@ again:
>  						      BTRFS_BLOCK_GROUP_DATA);
>  			btrfs_end_transaction(trans, chunk_root);
>  			if (ret < 0) {
> -				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +				up_read(&fs_info->bg_delete_sem);
>  				goto error;
>  			}
>  			chunk_reserved = 1;
> @@ -3673,7 +3673,7 @@ again:
>  
>  		ret = btrfs_relocate_chunk(chunk_root,
>  					   found_key.offset);
> -		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +		up_read(&fs_info->bg_delete_sem);
>  		if (ret && ret != -ENOSPC)
>  			goto error;
>  		if (ret == -ENOSPC) {
> @@ -4400,16 +4400,16 @@ again:
>  	key.type = BTRFS_DEV_EXTENT_KEY;
>  
>  	do {
> -		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
> +		down_read(&root->fs_info->bg_delete_sem);
>  		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>  		if (ret < 0) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>  			goto done;
>  		}
>  
>  		ret = btrfs_previous_item(root, path, 0, key.type);
>  		if (ret)
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>  		if (ret < 0)
>  			goto done;
>  		if (ret) {
> @@ -4423,7 +4423,7 @@ again:
>  		btrfs_item_key_to_cpu(l, &key, path->slots[0]);
>  
>  		if (key.objectid != device->devid) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>  			btrfs_release_path(path);
>  			break;
>  		}
> @@ -4432,7 +4432,7 @@ again:
>  		length = btrfs_dev_extent_length(l, dev_extent);
>  
>  		if (key.offset + length <= new_size) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>  			btrfs_release_path(path);
>  			break;
>  		}
> @@ -4441,7 +4441,7 @@ again:
>  		btrfs_release_path(path);
>  
>  		ret = btrfs_relocate_chunk(root, chunk_offset);
> -		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +		up_read(&root->fs_info->bg_delete_sem);
>  		if (ret && ret != -ENOSPC)
>  			goto done;
>  		if (ret == -ENOSPC)
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-09 10:56   ` Holger Hoffstätte
@ 2016-09-12  7:38     ` Wang Xiaoguang
  2016-09-12  8:19       ` Holger Hoffstätte
  0 siblings, 1 reply; 12+ messages in thread
From: Wang Xiaoguang @ 2016-09-12  7:38 UTC (permalink / raw)
  To: Holger Hoffstätte, linux-btrfs, Naohiro Aota

hello,

On 09/09/2016 06:56 PM, Holger Hoffstätte wrote:
> On 09/09/16 12:18, Holger Hoffstätte wrote:
>> On Fri, 09 Sep 2016 16:17:48 +0800, Wang Xiaoguang wrote:
>>
>>> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
>>> to delete unused block groups. Because this work is asynchronous, it may also result
>>> in false ENOSPC error.
>> <snip>
>>
>> With this v3 I can now no longer balance (tested only with metadata).
>> New chunks are allocated (as balance does) but nothing ever shrinks, until
>> after unmount/remount, when the cleaner eventually kicks in.
>>
>> This might be related to the recent patch by Naohiro Aota:
>> "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
>>
>> which by itself doesn't seem to do any harm (i.e. everything still seems
>> to work as expected).
> Actually even that is not true; both patches seem to be wrong in subtle
> ways. Naohiro's patch seems to prevent the deletion during balance, whereas
> yours prevents the cleaner from kicking in.
Indeed in my patch, I just change "struct mutex delete_unused_bgs_mutex"
to "struct rw_semaphore bg_delete_sem", and try to get bg_delete_sem when
we allocate data space, so this patch should work as before :)

>
> As a simple reproducer you can convert from -mdup to -msingle (to create
> bloat) and then balance with -musage=10. Depending on which of the two
> patches are applied, you end with bloat that only grows and never shrinks,
> or bloat that ends up in mixed state (dup and single).
Can you give me a simple test script to reproduce your problem.
(I can write it myself, but I'm afraid I may misunderstand you :) )

Regards,
Xiaoguang Wang
>
> Undoing both makes both balancing and cleaning work again.
>
> -h
>
>
>




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-12  7:38     ` Wang Xiaoguang
@ 2016-09-12  8:19       ` Holger Hoffstätte
  0 siblings, 0 replies; 12+ messages in thread
From: Holger Hoffstätte @ 2016-09-12  8:19 UTC (permalink / raw)
  To: Wang Xiaoguang, linux-btrfs, Naohiro Aota

On 09/12/16 09:38, Wang Xiaoguang wrote:
<snip>
>> Actually even that is not true; both patches seem to be wrong in subtle
>> ways. Naohiro's patch seems to prevent the deletion during balance, whereas
>> yours prevents the cleaner from kicking in.
> Indeed in my patch, I just change "struct mutex delete_unused_bgs_mutex"
> to "struct rw_semaphore bg_delete_sem", and try to get bg_delete_sem when
> we allocate data space, so this patch should work as before :)
> 
>>
>> As a simple reproducer you can convert from -mdup to -msingle (to create
>> bloat) and then balance with -musage=10. Depending on which of the two
>> patches are applied, you end with bloat that only grows and never shrinks,
>> or bloat that ends up in mixed state (dup and single).
> Can you give me a simple test script to reproduce your problem.
> (I can write it myself, but I'm afraid I may misunderstand you :) )

I don't have a script and no time this week to play with this.
Just create an fs with dup metadata, balance -mconvert=single and then
back with -mconvert=dup. That's all I tried.

Holger


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-09-12  8:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-09  8:17 [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space Wang Xiaoguang
2016-09-09  8:25 ` Wang Xiaoguang
2016-09-09  9:02   ` David Sterba
2016-09-09 10:18 ` Holger Hoffstätte
2016-09-09 10:56   ` Holger Hoffstätte
2016-09-12  7:38     ` Wang Xiaoguang
2016-09-12  8:19       ` Holger Hoffstätte
2016-09-10  6:04 ` Stefan Priebe - Profihost AG
  -- strict thread matches above, loose matches on Subject: below --
2016-07-25 13:32 [PATCH v2 4/4] " Josef Bacik
2016-07-26 12:04 ` [PATCH v3] " Wang Xiaoguang
2016-07-26 15:51   ` Josef Bacik
2016-07-26 16:25     ` David Sterba
2016-07-27  1:26     ` Wang Xiaoguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).