[PATCH v2] btrfs: use customized batch size for total_bytes

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] btrfs: use customized batch size for total_bytes_pinned
@ 2018-07-11 15:59 Ethan Lien
  2018-07-12  7:07 ` Nikolay Borisov
  2018-07-12 22:19 ` Omar Sandoval
  0 siblings, 2 replies; 5+ messages in thread
From: Ethan Lien @ 2018-07-11 15:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ethan Lien

In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t. So
fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk,
we can use a really large batch size and have nearly no penalty in most
cases.


[Test]
We test the patch on a 4-cores x86 machine:
1. falloate a 16GiB size test file.
2. take snapshot (so all following writes will be cow write).
3. run a 180 sec, 4 jobs, 4K random write fio on test file.

We also add a temporary lockdep class on percpu_counter's spin lock used
by total_bytes_pinned to track lock_stat.


[Results]
unpatched:
lock_stat version 0.4
-----------------------------------------------------------------------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

               total_bytes_pinned_percpu:            82             82
        0.21           0.61          29.46           0.36         298340
      635973           0.09          11.01      173476.25           0.27


patched:
lock_stat version 0.4
-----------------------------------------------------------------------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

               total_bytes_pinned_percpu:             1              1
        0.62           0.62           0.62           0.62          13601
       31542           0.14           9.61       11016.90           0.35


[Analysis]
Since the spin lock only protect a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock get
transferred between different cpus, so the patch can really recude
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.

Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
pinned bytes")

Signed-off-by: Ethan Lien <ethanlien@synology.com>
---

V2:
	Rewrite commit comments.
	Add lock_stat test.
	Pull dirty_metadata_bytes out to a separate patch.

 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/extent-tree.c | 46 ++++++++++++++++++++++++++++--------------
 2 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 118346aceea9..df682a521635 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -422,6 +422,7 @@ struct btrfs_space_info {
 	 * time the transaction commits.
 	 */
 	struct percpu_counter total_bytes_pinned;
+	s32 total_bytes_pinned_batch;
 
 	struct list_head list;
 	/* Protected by the spinlock 'lock'. */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d9fe58c0080..937113534ef4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes,
 
 	space_info = __find_space_info(fs_info, flags);
 	ASSERT(space_info);
-	percpu_counter_add(&space_info->total_bytes_pinned, num_bytes);
+	percpu_counter_add_batch(&space_info->total_bytes_pinned, num_bytes,
+		    space_info->total_bytes_pinned_batch);
 }
 
 /*
@@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 			flags = BTRFS_BLOCK_GROUP_METADATA;
 		space_info = __find_space_info(fs_info, flags);
 		ASSERT(space_info);
-		percpu_counter_add(&space_info->total_bytes_pinned,
-				   -head->num_bytes);
+		percpu_counter_add_batch(&space_info->total_bytes_pinned,
+				   -head->num_bytes,
+				   space_info->total_bytes_pinned_batch);
 
 		if (head->is_data) {
 			spin_lock(&delayed_refs->lock);
@@ -4024,6 +4026,13 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags)
 		kfree(space_info);
 		return ret;
 	}
+	/*
+	 * Use large batch size to reduce update overhead.
+	 * For reader side, we only read it when we are close to ENOSPC and
+	 * the read overhead is mostly related to # of cpus, so it is OK to
+	 * use arbitrary large value here.
+	 */
+	space_info->total_bytes_pinned_batch = SZ_128M;
 
 	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
 		INIT_LIST_HEAD(&space_info->block_groups[i]);
@@ -4309,9 +4318,10 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
 		 * allocation, and no removed chunk in current transaction,
 		 * don't bother committing the transaction.
 		 */
-		have_pinned_space = percpu_counter_compare(
+		have_pinned_space = __percpu_counter_compare(
 			&data_sinfo->total_bytes_pinned,
-			used + bytes - data_sinfo->total_bytes);
+			used + bytes - data_sinfo->total_bytes,
+			data_sinfo->total_bytes_pinned_batch);
 		spin_unlock(&data_sinfo->lock);
 
 		/* commit the current transaction and try again */
@@ -4912,8 +4922,9 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 		return 0;
 
 	/* See if there is enough pinned space to make this reservation */
-	if (percpu_counter_compare(&space_info->total_bytes_pinned,
-				   bytes) >= 0)
+	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
+				   bytes,
+				   space_info->total_bytes_pinned_batch) >= 0)
 		goto commit;
 
 	/*
@@ -4930,8 +4941,9 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 		bytes -= delayed_rsv->size;
 	spin_unlock(&delayed_rsv->lock);
 
-	if (percpu_counter_compare(&space_info->total_bytes_pinned,
-				   bytes) < 0) {
+	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
+				   bytes,
+				   space_info->total_bytes_pinned_batch) < 0) {
 		return -ENOSPC;
 	}
 
@@ -6268,8 +6280,9 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 			trace_btrfs_space_reservation(info, "pinned",
 						      cache->space_info->flags,
 						      num_bytes, 1);
-			percpu_counter_add(&cache->space_info->total_bytes_pinned,
-					   num_bytes);
+			percpu_counter_add_batch(&cache->space_info->total_bytes_pinned,
+					   num_bytes,
+					   cache->space_info->total_bytes_pinned_batch);
 			set_extent_dirty(info->pinned_extents,
 					 bytenr, bytenr + num_bytes - 1,
 					 GFP_NOFS | __GFP_NOFAIL);
@@ -6347,7 +6360,8 @@ static int pin_down_extent(struct btrfs_fs_info *fs_info,
 
 	trace_btrfs_space_reservation(fs_info, "pinned",
 				      cache->space_info->flags, num_bytes, 1);
-	percpu_counter_add(&cache->space_info->total_bytes_pinned, num_bytes);
+	percpu_counter_add_batch(&cache->space_info->total_bytes_pinned,
+		    num_bytes, cache->space_info->total_bytes_pinned_batch);
 	set_extent_dirty(fs_info->pinned_extents, bytenr,
 			 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
 	return 0;
@@ -6711,7 +6725,8 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 		trace_btrfs_space_reservation(fs_info, "pinned",
 					      space_info->flags, len, 0);
 		space_info->max_extent_size = 0;
-		percpu_counter_add(&space_info->total_bytes_pinned, -len);
+		percpu_counter_add_batch(&space_info->total_bytes_pinned,
+			    -len, space_info->total_bytes_pinned_batch);
 		if (cache->ro) {
 			space_info->bytes_readonly += len;
 			readonly = true;
@@ -10764,8 +10779,9 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 
 		space_info->bytes_pinned -= block_group->pinned;
 		space_info->bytes_readonly += block_group->pinned;
-		percpu_counter_add(&space_info->total_bytes_pinned,
-				   -block_group->pinned);
+		percpu_counter_add_batch(&space_info->total_bytes_pinned,
+				   -block_group->pinned,
+				   space_info->total_bytes_pinned_batch);
 		block_group->pinned = 0;
 
 		spin_unlock(&block_group->lock);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] btrfs: use customized batch size for total_bytes_pinned
  2018-07-11 15:59 [PATCH v2] btrfs: use customized batch size for total_bytes_pinned Ethan Lien
@ 2018-07-12  7:07 ` Nikolay Borisov
  2018-07-12 17:13   ` ethanlien
  2018-07-12 22:19 ` Omar Sandoval
  1 sibling, 1 reply; 5+ messages in thread
From: Nikolay Borisov @ 2018-07-12  7:07 UTC (permalink / raw)
  To: Ethan Lien, linux-btrfs



On 11.07.2018 18:59, Ethan Lien wrote:
> In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
> pinned bytes") we use total_bytes_pinned to track how many bytes we are
> going to free in this transaction. When we are close to ENOSPC, we check it
> and know if we can make the allocation by commit the current transaction.
> For every data/metadata extent we are going to free, we add
> total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
> release it in unpin_extent_range() when we finish the transaction. So this
> is a variable we frequently update but rarely read - just the suitable
> use of percpu_counter. But in previous commit we update total_bytes_pinned
> by default 32 batch size, making every update essentially a spin lock
> protected update. Since every spin lock/unlock operation involves syncing
> a globally used variable and some kind of barrier in a SMP system, this is
> more expensive than using total_bytes_pinned as a simple atomic64_t. So
> fix this by using a customized batch size. Since we only read
> total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk,
> we can use a really large batch size and have nearly no penalty in most
> cases.
> 
> 
> [Test]
> We test the patch on a 4-cores x86 machine:
> 1. falloate a 16GiB size test file.
> 2. take snapshot (so all following writes will be cow write).
> 3. run a 180 sec, 4 jobs, 4K random write fio on test file.
> 
> We also add a temporary lockdep class on percpu_counter's spin lock used
> by total_bytes_pinned to track lock_stat.
> 
> 
> [Results]
> unpatched:
> lock_stat version 0.4
> -----------------------------------------------------------------------
>                               class name    con-bounces    contentions
> waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
> acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> 
>                total_bytes_pinned_percpu:            82             82
>         0.21           0.61          29.46           0.36         298340
>       635973           0.09          11.01      173476.25           0.27
> 
> 
> patched:
> lock_stat version 0.4
> -----------------------------------------------------------------------
>                               class name    con-bounces    contentions
> waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
> acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> 
>                total_bytes_pinned_percpu:             1              1
>         0.62           0.62           0.62           0.62          13601
>        31542           0.14           9.61       11016.90           0.35
> 

According to these numbers though, the best-case (waittime-min and the
average waittime-avg) have actually regressed. So what really saves you
is the fact that the number of time we had to go to the best/average
case is reduced, due to the larger batch. I guess in that regard you are
in the clear. Another pertinent question is did you observe any
significant impact on run times of actual workloads or, say, transaction
commit times or anything like that? I think this case will only be hit
when the filesystem is struggling to satisfy a metadata allocation  has
to cycle through all stages .

> 
> [Analysis]
> Since the spin lock only protect a single in-memory variable, the
> contentions (number of lock acquisitions that had to wait) in both
> unpatched and patched version are low. But when we see acquisitions and
> acq-bounces, we get much lower counts in patched version. Here the most
> important metric is acq-bounces. It means how many times the lock get
> transferred between different cpus, so the patch can really recude

nit: s/recude/reduce - no need to resend i guess david will fix it on
the way into the tree

> cacheline bouncing of spin lock (also the global counter of percpu_counter)
> in a SMP system.
> 
> Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
> pinned bytes")
> 
> Signed-off-by: Ethan Lien <ethanlien@synology.com>

 Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
> 
> V2:
> 	Rewrite commit comments.
> 	Add lock_stat test.
> 	Pull dirty_metadata_bytes out to a separate patch.
> 
>  fs/btrfs/ctree.h       |  1 +
>  fs/btrfs/extent-tree.c | 46 ++++++++++++++++++++++++++++--------------
>  2 files changed, 32 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 118346aceea9..df682a521635 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -422,6 +422,7 @@ struct btrfs_space_info {
>  	 * time the transaction commits.
>  	 */
>  	struct percpu_counter total_bytes_pinned;
> +	s32 total_bytes_pinned_batch;
>  
>  	struct list_head list;
>  	/* Protected by the spinlock 'lock'. */
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3d9fe58c0080..937113534ef4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes,
>  
>  	space_info = __find_space_info(fs_info, flags);
>  	ASSERT(space_info);
> -	percpu_counter_add(&space_info->total_bytes_pinned, num_bytes);
> +	percpu_counter_add_batch(&space_info->total_bytes_pinned, num_bytes,
> +		    space_info->total_bytes_pinned_batch);
>  }
>  
>  /*
> @@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>  			flags = BTRFS_BLOCK_GROUP_METADATA;
>  		space_info = __find_space_info(fs_info, flags);
>  		ASSERT(space_info);
> -		percpu_counter_add(&space_info->total_bytes_pinned,
> -				   -head->num_bytes);
> +		percpu_counter_add_batch(&space_info->total_bytes_pinned,
> +				   -head->num_bytes,
> +				   space_info->total_bytes_pinned_batch);
>  
>  		if (head->is_data) {
>  			spin_lock(&delayed_refs->lock);
> @@ -4024,6 +4026,13 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags)
>  		kfree(space_info);
>  		return ret;
>  	}
> +	/*
> +	 * Use large batch size to reduce update overhead.
> +	 * For reader side, we only read it when we are close to ENOSPC and
> +	 * the read overhead is mostly related to # of cpus, so it is OK to
> +	 * use arbitrary large value here.
> +	 */
> +	space_info->total_bytes_pinned_batch = SZ_128M;
>  
>  	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
>  		INIT_LIST_HEAD(&space_info->block_groups[i]);
> @@ -4309,9 +4318,10 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
>  		 * allocation, and no removed chunk in current transaction,
>  		 * don't bother committing the transaction.
>  		 */
> -		have_pinned_space = percpu_counter_compare(
> +		have_pinned_space = __percpu_counter_compare(
>  			&data_sinfo->total_bytes_pinned,
> -			used + bytes - data_sinfo->total_bytes);
> +			used + bytes - data_sinfo->total_bytes,
> +			data_sinfo->total_bytes_pinned_batch);
>  		spin_unlock(&data_sinfo->lock);
>  
>  		/* commit the current transaction and try again */
> @@ -4912,8 +4922,9 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  		return 0;
>  
>  	/* See if there is enough pinned space to make this reservation */
> -	if (percpu_counter_compare(&space_info->total_bytes_pinned,
> -				   bytes) >= 0)
> +	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
> +				   bytes,
> +				   space_info->total_bytes_pinned_batch) >= 0)
>  		goto commit;
>  
>  	/*
> @@ -4930,8 +4941,9 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  		bytes -= delayed_rsv->size;
>  	spin_unlock(&delayed_rsv->lock);
>  
> -	if (percpu_counter_compare(&space_info->total_bytes_pinned,
> -				   bytes) < 0) {
> +	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
> +				   bytes,
> +				   space_info->total_bytes_pinned_batch) < 0) {
>  		return -ENOSPC;
>  	}
>  
> @@ -6268,8 +6280,9 @@ static int update_block_group(struct btrfs_trans_handle *trans,
>  			trace_btrfs_space_reservation(info, "pinned",
>  						      cache->space_info->flags,
>  						      num_bytes, 1);
> -			percpu_counter_add(&cache->space_info->total_bytes_pinned,
> -					   num_bytes);
> +			percpu_counter_add_batch(&cache->space_info->total_bytes_pinned,
> +					   num_bytes,
> +					   cache->space_info->total_bytes_pinned_batch);
>  			set_extent_dirty(info->pinned_extents,
>  					 bytenr, bytenr + num_bytes - 1,
>  					 GFP_NOFS | __GFP_NOFAIL);
> @@ -6347,7 +6360,8 @@ static int pin_down_extent(struct btrfs_fs_info *fs_info,
>  
>  	trace_btrfs_space_reservation(fs_info, "pinned",
>  				      cache->space_info->flags, num_bytes, 1);
> -	percpu_counter_add(&cache->space_info->total_bytes_pinned, num_bytes);
> +	percpu_counter_add_batch(&cache->space_info->total_bytes_pinned,
> +		    num_bytes, cache->space_info->total_bytes_pinned_batch);
>  	set_extent_dirty(fs_info->pinned_extents, bytenr,
>  			 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
>  	return 0;
> @@ -6711,7 +6725,8 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
>  		trace_btrfs_space_reservation(fs_info, "pinned",
>  					      space_info->flags, len, 0);
>  		space_info->max_extent_size = 0;
> -		percpu_counter_add(&space_info->total_bytes_pinned, -len);
> +		percpu_counter_add_batch(&space_info->total_bytes_pinned,
> +			    -len, space_info->total_bytes_pinned_batch);
>  		if (cache->ro) {
>  			space_info->bytes_readonly += len;
>  			readonly = true;
> @@ -10764,8 +10779,9 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  
>  		space_info->bytes_pinned -= block_group->pinned;
>  		space_info->bytes_readonly += block_group->pinned;
> -		percpu_counter_add(&space_info->total_bytes_pinned,
> -				   -block_group->pinned);
> +		percpu_counter_add_batch(&space_info->total_bytes_pinned,
> +				   -block_group->pinned,
> +				   space_info->total_bytes_pinned_batch);
>  		block_group->pinned = 0;
>  
>  		spin_unlock(&block_group->lock);
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] btrfs: use customized batch size for total_bytes_pinned
  2018-07-12  7:07 ` Nikolay Borisov
@ 2018-07-12 17:13   ` ethanlien
  0 siblings, 0 replies; 5+ messages in thread
From: ethanlien @ 2018-07-12 17:13 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, linux-btrfs-owner

Nikolay Borisov 於 2018-07-12 15:07 寫到:
> On 11.07.2018 18:59, Ethan Lien wrote:
>> In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of 
>> possibly
>> pinned bytes") we use total_bytes_pinned to track how many bytes we 
>> are
>> going to free in this transaction. When we are close to ENOSPC, we 
>> check it
>> and know if we can make the allocation by commit the current 
>> transaction.
>> For every data/metadata extent we are going to free, we add
>> total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), 
>> and
>> release it in unpin_extent_range() when we finish the transaction. So 
>> this
>> is a variable we frequently update but rarely read - just the suitable
>> use of percpu_counter. But in previous commit we update 
>> total_bytes_pinned
>> by default 32 batch size, making every update essentially a spin lock
>> protected update. Since every spin lock/unlock operation involves 
>> syncing
>> a globally used variable and some kind of barrier in a SMP system, 
>> this is
>> more expensive than using total_bytes_pinned as a simple atomic64_t. 
>> So
>> fix this by using a customized batch size. Since we only read
>> total_bytes_pinned when we are close to ENOSPC and fail to alloc new 
>> chunk,
>> we can use a really large batch size and have nearly no penalty in 
>> most
>> cases.
>> 
>> 
>> [Test]
>> We test the patch on a 4-cores x86 machine:
>> 1. falloate a 16GiB size test file.
>> 2. take snapshot (so all following writes will be cow write).
>> 3. run a 180 sec, 4 jobs, 4K random write fio on test file.
>> 
>> We also add a temporary lockdep class on percpu_counter's spin lock 
>> used
>> by total_bytes_pinned to track lock_stat.
>> 
>> 
>> [Results]
>> unpatched:
>> lock_stat version 0.4
>> -----------------------------------------------------------------------
>>                               class name    con-bounces    contentions
>> waittime-min   waittime-max waittime-total   waittime-avg    
>> acq-bounces
>> acquisitions   holdtime-min   holdtime-max holdtime-total   
>> holdtime-avg
>> 
>>                total_bytes_pinned_percpu:            82             82
>>         0.21           0.61          29.46           0.36         
>> 298340
>>       635973           0.09          11.01      173476.25           
>> 0.27
>> 
>> 
>> patched:
>> lock_stat version 0.4
>> -----------------------------------------------------------------------
>>                               class name    con-bounces    contentions
>> waittime-min   waittime-max waittime-total   waittime-avg    
>> acq-bounces
>> acquisitions   holdtime-min   holdtime-max holdtime-total   
>> holdtime-avg
>> 
>>                total_bytes_pinned_percpu:             1              1
>>         0.62           0.62           0.62           0.62          
>> 13601
>>        31542           0.14           9.61       11016.90           
>> 0.35
>> 
> 
> According to these numbers though, the best-case (waittime-min and the
> average waittime-avg) have actually regressed. So what really saves you

In patched version we only got one contentions count so the min/max/avg
waittime all come from that count. I think the min/max/avg waittime here
are inconclusive since we have only one sample count so it could be 
biased.


> is the fact that the number of time we had to go to the best/average
> case is reduced, due to the larger batch. I guess in that regard you 
> are
> in the clear. Another pertinent question is did you observe any
> significant impact on run times of actual workloads or, say, 
> transaction
> commit times or anything like that? I think this case will only be hit
> when the filesystem is struggling to satisfy a metadata allocation  has
> to cycle through all stages .

The frequently update of total_bytes_pinned can happen in normal cow 
write path.
In our test, any re-write to a existing 4K data block will trigger cow 
and we release
old data block by btrfs_free_extent() and in that time we add 
total_bytes_pinned.
As the test file is written out, eventually almost every 4K write will 
trigger a +4K and a
-4K update to total_bytes_pinned. But so far we haven't seen significant 
performance
improvement from the test.

> 
>> 
>> [Analysis]
>> Since the spin lock only protect a single in-memory variable, the
>> contentions (number of lock acquisitions that had to wait) in both
>> unpatched and patched version are low. But when we see acquisitions 
>> and
>> acq-bounces, we get much lower counts in patched version. Here the 
>> most
>> important metric is acq-bounces. It means how many times the lock get
>> transferred between different cpus, so the patch can really recude
> 
> nit: s/recude/reduce - no need to resend i guess david will fix it on
> the way into the tree
> 
>> cacheline bouncing of spin lock (also the global counter of 
>> percpu_counter)
>> in a SMP system.
>> 
>> Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
>> pinned bytes")
>> 
>> Signed-off-by: Ethan Lien <ethanlien@synology.com>
> 
>  Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> 
>> ---
>> 
>> V2:
>> 	Rewrite commit comments.
>> 	Add lock_stat test.
>> 	Pull dirty_metadata_bytes out to a separate patch.
>> 
>>  fs/btrfs/ctree.h       |  1 +
>>  fs/btrfs/extent-tree.c | 46 
>> ++++++++++++++++++++++++++++--------------
>>  2 files changed, 32 insertions(+), 15 deletions(-)
>> 
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 118346aceea9..df682a521635 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -422,6 +422,7 @@ struct btrfs_space_info {
>>  	 * time the transaction commits.
>>  	 */
>>  	struct percpu_counter total_bytes_pinned;
>> +	s32 total_bytes_pinned_batch;
>> 
>>  	struct list_head list;
>>  	/* Protected by the spinlock 'lock'. */
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 3d9fe58c0080..937113534ef4 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info 
>> *fs_info, s64 num_bytes,
>> 
>>  	space_info = __find_space_info(fs_info, flags);
>>  	ASSERT(space_info);
>> -	percpu_counter_add(&space_info->total_bytes_pinned, num_bytes);
>> +	percpu_counter_add_batch(&space_info->total_bytes_pinned, num_bytes,
>> +		    space_info->total_bytes_pinned_batch);
>>  }
>> 
>>  /*
>> @@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct 
>> btrfs_trans_handle *trans,
>>  			flags = BTRFS_BLOCK_GROUP_METADATA;
>>  		space_info = __find_space_info(fs_info, flags);
>>  		ASSERT(space_info);
>> -		percpu_counter_add(&space_info->total_bytes_pinned,
>> -				   -head->num_bytes);
>> +		percpu_counter_add_batch(&space_info->total_bytes_pinned,
>> +				   -head->num_bytes,
>> +				   space_info->total_bytes_pinned_batch);
>> 
>>  		if (head->is_data) {
>>  			spin_lock(&delayed_refs->lock);
>> @@ -4024,6 +4026,13 @@ static int create_space_info(struct 
>> btrfs_fs_info *info, u64 flags)
>>  		kfree(space_info);
>>  		return ret;
>>  	}
>> +	/*
>> +	 * Use large batch size to reduce update overhead.
>> +	 * For reader side, we only read it when we are close to ENOSPC and
>> +	 * the read overhead is mostly related to # of cpus, so it is OK to
>> +	 * use arbitrary large value here.
>> +	 */
>> +	space_info->total_bytes_pinned_batch = SZ_128M;
>> 
>>  	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
>>  		INIT_LIST_HEAD(&space_info->block_groups[i]);
>> @@ -4309,9 +4318,10 @@ int btrfs_alloc_data_chunk_ondemand(struct 
>> btrfs_inode *inode, u64 bytes)
>>  		 * allocation, and no removed chunk in current transaction,
>>  		 * don't bother committing the transaction.
>>  		 */
>> -		have_pinned_space = percpu_counter_compare(
>> +		have_pinned_space = __percpu_counter_compare(
>>  			&data_sinfo->total_bytes_pinned,
>> -			used + bytes - data_sinfo->total_bytes);
>> +			used + bytes - data_sinfo->total_bytes,
>> +			data_sinfo->total_bytes_pinned_batch);
>>  		spin_unlock(&data_sinfo->lock);
>> 
>>  		/* commit the current transaction and try again */
>> @@ -4912,8 +4922,9 @@ static int may_commit_transaction(struct 
>> btrfs_fs_info *fs_info,
>>  		return 0;
>> 
>>  	/* See if there is enough pinned space to make this reservation */
>> -	if (percpu_counter_compare(&space_info->total_bytes_pinned,
>> -				   bytes) >= 0)
>> +	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
>> +				   bytes,
>> +				   space_info->total_bytes_pinned_batch) >= 0)
>>  		goto commit;
>> 
>>  	/*
>> @@ -4930,8 +4941,9 @@ static int may_commit_transaction(struct 
>> btrfs_fs_info *fs_info,
>>  		bytes -= delayed_rsv->size;
>>  	spin_unlock(&delayed_rsv->lock);
>> 
>> -	if (percpu_counter_compare(&space_info->total_bytes_pinned,
>> -				   bytes) < 0) {
>> +	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
>> +				   bytes,
>> +				   space_info->total_bytes_pinned_batch) < 0) {
>>  		return -ENOSPC;
>>  	}
>> 
>> @@ -6268,8 +6280,9 @@ static int update_block_group(struct 
>> btrfs_trans_handle *trans,
>>  			trace_btrfs_space_reservation(info, "pinned",
>>  						      cache->space_info->flags,
>>  						      num_bytes, 1);
>> -			percpu_counter_add(&cache->space_info->total_bytes_pinned,
>> -					   num_bytes);
>> +			percpu_counter_add_batch(&cache->space_info->total_bytes_pinned,
>> +					   num_bytes,
>> +					   cache->space_info->total_bytes_pinned_batch);
>>  			set_extent_dirty(info->pinned_extents,
>>  					 bytenr, bytenr + num_bytes - 1,
>>  					 GFP_NOFS | __GFP_NOFAIL);
>> @@ -6347,7 +6360,8 @@ static int pin_down_extent(struct btrfs_fs_info 
>> *fs_info,
>> 
>>  	trace_btrfs_space_reservation(fs_info, "pinned",
>>  				      cache->space_info->flags, num_bytes, 1);
>> -	percpu_counter_add(&cache->space_info->total_bytes_pinned, 
>> num_bytes);
>> +	percpu_counter_add_batch(&cache->space_info->total_bytes_pinned,
>> +		    num_bytes, cache->space_info->total_bytes_pinned_batch);
>>  	set_extent_dirty(fs_info->pinned_extents, bytenr,
>>  			 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
>>  	return 0;
>> @@ -6711,7 +6725,8 @@ static int unpin_extent_range(struct 
>> btrfs_fs_info *fs_info,
>>  		trace_btrfs_space_reservation(fs_info, "pinned",
>>  					      space_info->flags, len, 0);
>>  		space_info->max_extent_size = 0;
>> -		percpu_counter_add(&space_info->total_bytes_pinned, -len);
>> +		percpu_counter_add_batch(&space_info->total_bytes_pinned,
>> +			    -len, space_info->total_bytes_pinned_batch);
>>  		if (cache->ro) {
>>  			space_info->bytes_readonly += len;
>>  			readonly = true;
>> @@ -10764,8 +10779,9 @@ void btrfs_delete_unused_bgs(struct 
>> btrfs_fs_info *fs_info)
>> 
>>  		space_info->bytes_pinned -= block_group->pinned;
>>  		space_info->bytes_readonly += block_group->pinned;
>> -		percpu_counter_add(&space_info->total_bytes_pinned,
>> -				   -block_group->pinned);
>> +		percpu_counter_add_batch(&space_info->total_bytes_pinned,
>> +				   -block_group->pinned,
>> +				   space_info->total_bytes_pinned_batch);
>>  		block_group->pinned = 0;
>> 
>>  		spin_unlock(&block_group->lock);
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] btrfs: use customized batch size for total_bytes_pinned
  2018-07-11 15:59 [PATCH v2] btrfs: use customized batch size for total_bytes_pinned Ethan Lien
  2018-07-12  7:07 ` Nikolay Borisov
@ 2018-07-12 22:19 ` Omar Sandoval
  2018-07-13  2:27   ` ethanlien
  1 sibling, 1 reply; 5+ messages in thread
From: Omar Sandoval @ 2018-07-12 22:19 UTC (permalink / raw)
  To: Ethan Lien; +Cc: linux-btrfs

On Wed, Jul 11, 2018 at 11:59:36PM +0800, Ethan Lien wrote:
> In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
> pinned bytes") we use total_bytes_pinned to track how many bytes we are
> going to free in this transaction. When we are close to ENOSPC, we check it
> and know if we can make the allocation by commit the current transaction.
> For every data/metadata extent we are going to free, we add
> total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
> release it in unpin_extent_range() when we finish the transaction. So this
> is a variable we frequently update but rarely read - just the suitable
> use of percpu_counter. But in previous commit we update total_bytes_pinned
> by default 32 batch size, making every update essentially a spin lock
> protected update. Since every spin lock/unlock operation involves syncing
> a globally used variable and some kind of barrier in a SMP system, this is
> more expensive than using total_bytes_pinned as a simple atomic64_t. So
> fix this by using a customized batch size. Since we only read
> total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk,
> we can use a really large batch size and have nearly no penalty in most
> cases.
> 
> 
> [Test]
> We test the patch on a 4-cores x86 machine:
> 1. falloate a 16GiB size test file.
> 2. take snapshot (so all following writes will be cow write).
> 3. run a 180 sec, 4 jobs, 4K random write fio on test file.
> 
> We also add a temporary lockdep class on percpu_counter's spin lock used
> by total_bytes_pinned to track lock_stat.
> 
> 
> [Results]
> unpatched:
> lock_stat version 0.4
> -----------------------------------------------------------------------
>                               class name    con-bounces    contentions
> waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
> acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> 
>                total_bytes_pinned_percpu:            82             82
>         0.21           0.61          29.46           0.36         298340
>       635973           0.09          11.01      173476.25           0.27
> 
> 
> patched:
> lock_stat version 0.4
> -----------------------------------------------------------------------
>                               class name    con-bounces    contentions
> waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
> acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> 
>                total_bytes_pinned_percpu:             1              1
>         0.62           0.62           0.62           0.62          13601
>        31542           0.14           9.61       11016.90           0.35
> 
> 
> [Analysis]
> Since the spin lock only protect a single in-memory variable, the
> contentions (number of lock acquisitions that had to wait) in both
> unpatched and patched version are low. But when we see acquisitions and
> acq-bounces, we get much lower counts in patched version. Here the most
> important metric is acq-bounces. It means how many times the lock get
> transferred between different cpus, so the patch can really recude
> cacheline bouncing of spin lock (also the global counter of percpu_counter)
> in a SMP system.
> 
> Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
> pinned bytes")
> 
> Signed-off-by: Ethan Lien <ethanlien@synology.com>
> ---
> 
> V2:
> 	Rewrite commit comments.
> 	Add lock_stat test.
> 	Pull dirty_metadata_bytes out to a separate patch.
> 
>  fs/btrfs/ctree.h       |  1 +
>  fs/btrfs/extent-tree.c | 46 ++++++++++++++++++++++++++++--------------
>  2 files changed, 32 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 118346aceea9..df682a521635 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -422,6 +422,7 @@ struct btrfs_space_info {
>  	 * time the transaction commits.
>  	 */
>  	struct percpu_counter total_bytes_pinned;
> +	s32 total_bytes_pinned_batch;

Can this just be a constant instead of adding it to space_info?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] btrfs: use customized batch size for total_bytes_pinned
  2018-07-12 22:19 ` Omar Sandoval
@ 2018-07-13  2:27   ` ethanlien
  0 siblings, 0 replies; 5+ messages in thread
From: ethanlien @ 2018-07-13  2:27 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, linux-btrfs-owner

Omar Sandoval 於 2018-07-13 06:19 寫到:
> On Wed, Jul 11, 2018 at 11:59:36PM +0800, Ethan Lien wrote:
>> In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of 
>> possibly
>> pinned bytes") we use total_bytes_pinned to track how many bytes we 
>> are
>> going to free in this transaction. When we are close to ENOSPC, we 
>> check it
>> and know if we can make the allocation by commit the current 
>> transaction.
>> For every data/metadata extent we are going to free, we add
>> total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), 
>> and
>> release it in unpin_extent_range() when we finish the transaction. So 
>> this
>> is a variable we frequently update but rarely read - just the suitable
>> use of percpu_counter. But in previous commit we update 
>> total_bytes_pinned
>> by default 32 batch size, making every update essentially a spin lock
>> protected update. Since every spin lock/unlock operation involves 
>> syncing
>> a globally used variable and some kind of barrier in a SMP system, 
>> this is
>> more expensive than using total_bytes_pinned as a simple atomic64_t. 
>> So
>> fix this by using a customized batch size. Since we only read
>> total_bytes_pinned when we are close to ENOSPC and fail to alloc new 
>> chunk,
>> we can use a really large batch size and have nearly no penalty in 
>> most
>> cases.
>> 
>> 
>> [Test]
>> We test the patch on a 4-cores x86 machine:
>> 1. falloate a 16GiB size test file.
>> 2. take snapshot (so all following writes will be cow write).
>> 3. run a 180 sec, 4 jobs, 4K random write fio on test file.
>> 
>> We also add a temporary lockdep class on percpu_counter's spin lock 
>> used
>> by total_bytes_pinned to track lock_stat.
>> 
>> 
>> [Results]
>> unpatched:
>> lock_stat version 0.4
>> -----------------------------------------------------------------------
>>                               class name    con-bounces    contentions
>> waittime-min   waittime-max waittime-total   waittime-avg    
>> acq-bounces
>> acquisitions   holdtime-min   holdtime-max holdtime-total   
>> holdtime-avg
>> 
>>                total_bytes_pinned_percpu:            82             82
>>         0.21           0.61          29.46           0.36         
>> 298340
>>       635973           0.09          11.01      173476.25           
>> 0.27
>> 
>> 
>> patched:
>> lock_stat version 0.4
>> -----------------------------------------------------------------------
>>                               class name    con-bounces    contentions
>> waittime-min   waittime-max waittime-total   waittime-avg    
>> acq-bounces
>> acquisitions   holdtime-min   holdtime-max holdtime-total   
>> holdtime-avg
>> 
>>                total_bytes_pinned_percpu:             1              1
>>         0.62           0.62           0.62           0.62          
>> 13601
>>        31542           0.14           9.61       11016.90           
>> 0.35
>> 
>> 
>> [Analysis]
>> Since the spin lock only protect a single in-memory variable, the
>> contentions (number of lock acquisitions that had to wait) in both
>> unpatched and patched version are low. But when we see acquisitions 
>> and
>> acq-bounces, we get much lower counts in patched version. Here the 
>> most
>> important metric is acq-bounces. It means how many times the lock get
>> transferred between different cpus, so the patch can really recude
>> cacheline bouncing of spin lock (also the global counter of 
>> percpu_counter)
>> in a SMP system.
>> 
>> Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
>> pinned bytes")
>> 
>> Signed-off-by: Ethan Lien <ethanlien@synology.com>
>> ---
>> 
>> V2:
>> 	Rewrite commit comments.
>> 	Add lock_stat test.
>> 	Pull dirty_metadata_bytes out to a separate patch.
>> 
>>  fs/btrfs/ctree.h       |  1 +
>>  fs/btrfs/extent-tree.c | 46 
>> ++++++++++++++++++++++++++++--------------
>>  2 files changed, 32 insertions(+), 15 deletions(-)
>> 
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 118346aceea9..df682a521635 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -422,6 +422,7 @@ struct btrfs_space_info {
>>  	 * time the transaction commits.
>>  	 */
>>  	struct percpu_counter total_bytes_pinned;
>> +	s32 total_bytes_pinned_batch;
> 
> Can this just be a constant instead of adding it to space_info?

Yes constant is better here, I'll resend it, thanks.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-07-13  2:40 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-11 15:59 [PATCH v2] btrfs: use customized batch size for total_bytes_pinned Ethan Lien
2018-07-12  7:07 ` Nikolay Borisov
2018-07-12 17:13   ` ethanlien
2018-07-12 22:19 ` Omar Sandoval
2018-07-13  2:27   ` ethanlien

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).