btrfs metadata reclaim behavior/performance characteristics

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* btrfs metadata reclaim behavior/performance characteristics
@ 2017-05-18  8:40 Nikolay Borisov
  2017-05-18 14:45 ` Liu Bo
  0 siblings, 1 reply; 8+ messages in thread
From: Nikolay Borisov @ 2017-05-18  8:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: jeffm

[-- Attachment #1: Type: text/plain, Size: 5903 bytes --]

Hello list, 

As a result of the perf regression discovered in [0] I also took the time to 
investigate the behavior of the metadata reclaim algorithm and how it chagned
with the introduction of the ticketed reservations. My tests were performed on
the following commits: 

Pre-ticketed => c83f8effefa4 ("Btrfs: add tracepoint for adding block groups")
Ticketed => kernel version 4.11
Patched => my patch (attached), applied on top of 4.11 


The workload used was a stripped down version of generic/027 [1]. The filesystem
is created with the same condition - 256mb total size, mixed mode, only 2mb
free space, rest taken up by a large file. The canonical workload is a single
iteration of 8 workers, creating 2kb files. However, the below measurements 
have been taken by doing 10 iteration of that just to reduce noise. 


First here is latency distribution of reserve_metadata_bytes, obtained by 
the following systemtap script [2] Results are in milliseconds (ms):

pre-ticketed(ms):
value |-------------------------------------------------- count
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  345330
    1 |@@                                                  17221
    2 |                                                     3506
    4 |                                                      212
    8 |@@                                                  17777
   16 |@                                                    7125
   32 |                                                     1974
   64 |                                                      136
  128 |                                                        3
  256 |                                                        0
  512 |    

Ticketed:
value |-------------------------------------------------- count
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  311844
    1 |@                                                   11123
    2 |@                                                   10103
    4 |@                                                    8669
    8 |@                                                   10802
   16 |@@@                                                 20079
   32 |@@                                                  18015
   64 |                                                     1485
  128 |                                                      175
  256 |                                                        0
  512 |                                                        0

patched:
value |-------------------------------------------------- count
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  321168
    1 |@                                                   10620
    2 |@                                                    9099
    4 |@                                                    6488
    8 |@@                                                  17000
   16 |@                                                   12495
   32 |                                                     3525
   64 |                                                       54
  128 |                                                        0
  256 |                                                        0


So if anything the ticketed rework seems to have negative impact on the 
latencies, making the results a lot more variable. My patch does improve things
a little bit but it's not as before. 

Latency however is not the only important metric, here is the sum of files
created by each of 8 workers for 10 iteration with: 

                Pre-ticketed    Ticketed        Patched
                9647            9688        	 9492
                9811            9805        	 9627
                9838            9656        	 9569
                9818            9809        	 9419
                9684            9883        	 9495
                9755            9861        	 9696
                9791            9756        	 9362
                9852            9714        	 9489
                9880            9787        	 9657
                9809            9863        	 9453

average         9788          	9782          	 9525
				100%             %97.38

median          9810            9787          	 9495
                                100%             %97.02

stdev           73           	77           	 108
stdev/average   0.75%           0.79%            1.14%
stdev/median    0.75%           0.79%            1.14% 

generic/027 
runtime:        300s            800s        	225s

Further drilling down on why the increased run times I traced that to the
amount of transaction being generated. I've already described this in my 
previous email [0] so I'm not going to repeat the numbers here as well. 

Empirically I discovered that one way to reduce the latency is to actually 
limit the commit cycle to 1 and only allow all other states except transaction
commit to happen during reclaim. This is exactly the strategy which the attached
patch implements. Evidently from the numbers provided, the reduced number of 
commit cycles is only marginally worse than the the current version, yet is 
a lot faster in this particular test and reduces latency as well.  I also 
tested doing only a single pass through all flush_space states ala 
pre-ticketed version, however the results were a lot noisier and worse. 

Admittedly this latency problem likely manifests when we are close to ENOSPC
condition hence why no one has observed this so far. However, I think a 3x 
drop in performance even in an edge case with no considerable change in 
success writes during ENOSPC is a bit too much. 

I would like this email light up a discussion regarding this since we might 
not really have the most optimal solution at the moment. 


[0] https://www.spinics.net/lists/linux-btrfs/msg65355.html
[1] http://sprunge.us/HVMF
[2] http://paste.ubuntu.com/24591513/

[-- Attachment #2: 0001-btrfs-Rework-the-metadata-reclaim-algorithm.patch --]
[-- Type: text/x-patch, Size: 5391 bytes --]

>From 8b3e5b1a45c47043e4ae3a066ca494d9211cc8be Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov@suse.com>
Date: Tue, 16 May 2017 14:46:03 +0300
Subject: [PATCH] btrfs: Rework the metadata reclaim algorithm

Currently the metadata reclaim algorithm works by allong up to 2 commit cycles
(where a commit cycle is defined as an iteration through all states in
flush_space) for every space reservation ticket on the space_info->ticket_list.
Empirical testing shows that this improves the number of reservations which
could be satisfied, however it also leads to a very large number of transaction
commits. Testing showed that for the same workload ticketed rework generates
~4.5k transaction whereas pre-ticketed work generated ~1.5k. This results in
generic/027 runtime increase almost 3x (from 300 to 800 seconds), mainly due to
the 'rm' phase causing a lot more transaction commits.

Current patch fixes this by modifying the algorithm in such a way so as to
allow only 1 transactino commit for every ticket by allow every other metadata
cleaning operation to be performed twice. That is the code would stop flushing
before the  2nd transaction commit commences.

In addition to runtime measurements I also measured the number of reservations
which are being satisfied. The test case is the following:

#!/bin/bash
create_file()
{
        local dir=$1
        local direct=$2
        local i=0

        mkdir -p $dir >/dev/null 2>&1
        local STARTTIME=$(date +%s)
        while xfs_io -f $direct -c "pwrite 0 1k" $dir/file_$i >/dev/null 2>&1; do
                let i=$i+1
        done

        local ENDTIME=$(date +%s)
        echo "Created $i files before returning error, time taken $(($ENDTIME - $STARTTIME))"
}

dir=/media/scratch/testdir/
loop=1
i=1
while [ $i -le $loop ]; do
        nr_worker=8
        while [ $nr_worker -gt 0 ]; do
                # half buffered I/O half direct I/O
                if [ `expr $nr_worker % 2` -eq 0 ]; then
                        create_file $dir/$nr_worker -d &
                else
                        create_file $dir/$nr_worker &
                fi
                let nr_worker=$nr_worker-1
        done
        wait
        STARTTIME=$(date +%s)
        rm -rf $dir
        ENDTIME=$(date +%s)
        echo "rming took $(($ENDTIME - $STARTTIME)) seconds"
        let i=$i+1
done

Runnig this 10 times yelds the following sums of successful writes for
all 8 workers per iteration:
                Pre-ticketed    Ticketed        Patched
                9647            9688        	 9492
                9811            9805        	 9627
                9838            9656        	 9569
                9818            9809        	 9419
                9684            9883        	 9495
                9755            9861        	 9696
                9791            9756        	 9362
                9852            9714        	 9489
                9880            9787        	 9657
                9809            9863        	 9453

average         9788          	9782          	 9525
								100%             %97.38

median          9810            9787          	 9495
                                100%             %97.02

stdev           73           	77           	 108
stdev/average   0.75%           0.79%            1.14%
stdev/median    0.75%           0.79%            1.14% 

generic/027 
runtime:        300s            800s        	225s

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/extent-tree.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3ab1f88af038..8124a750fd6b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4973,8 +4973,8 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 	struct btrfs_fs_info *fs_info;
 	struct btrfs_space_info *space_info;
 	u64 to_reclaim;
-	int flush_state;
-	int commit_cycles = 0;
+	int flush_state = FLUSH_DELAYED_ITEMS_NR;
+	bool committed = false;
 	u64 last_tickets_id;
 
 	fs_info = container_of(work, struct btrfs_fs_info, async_reclaim_work);
@@ -4991,8 +4991,7 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 	last_tickets_id = space_info->tickets_id;
 	spin_unlock(&space_info->lock);
 
-	flush_state = FLUSH_DELAYED_ITEMS_NR;
-	do {
+	while (!committed || flush_state < COMMIT_TRANS) {
 		struct reserve_ticket *ticket;
 		int ret;
 
@@ -5013,21 +5012,21 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 		} else {
 			last_tickets_id = space_info->tickets_id;
 			flush_state = FLUSH_DELAYED_ITEMS_NR;
-			if (commit_cycles)
-				commit_cycles--;
+			if (committed)
+				committed = false;
 		}
+		spin_unlock(&space_info->lock);
 
 		if (flush_state > COMMIT_TRANS) {
-			commit_cycles++;
-			if (commit_cycles > 2) {
-				wake_all_tickets(&space_info->tickets);
-				space_info->flush = 0;
-			} else {
-				flush_state = FLUSH_DELAYED_ITEMS_NR;
-			}
+			committed = true;
+			flush_state = FLUSH_DELAYED_ITEMS_NR;
 		}
-		spin_unlock(&space_info->lock);
-	} while (flush_state <= COMMIT_TRANS);
+	}
+
+	spin_lock(&space_info->lock);
+	wake_all_tickets(&space_info->tickets);
+	space_info->flush = 0;
+	spin_unlock(&space_info->lock);
 }
 
 void btrfs_init_async_reclaim_work(struct work_struct *work)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: btrfs metadata reclaim behavior/performance characteristics
  2017-05-18  8:40 btrfs metadata reclaim behavior/performance characteristics Nikolay Borisov
@ 2017-05-18 14:45 ` Liu Bo
  2017-05-18 15:41   ` Nikolay Borisov
  0 siblings, 1 reply; 8+ messages in thread
From: Liu Bo @ 2017-05-18 14:45 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, jeffm

On Thu, May 18, 2017 at 11:40:05AM +0300, Nikolay Borisov wrote:
> Hello list, 
> 
> As a result of the perf regression discovered in [0] I also took the time to 
> investigate the behavior of the metadata reclaim algorithm and how it chagned
> with the introduction of the ticketed reservations. My tests were performed on
> the following commits: 
> 
> Pre-ticketed => c83f8effefa4 ("Btrfs: add tracepoint for adding block groups")
> Ticketed => kernel version 4.11
> Patched => my patch (attached), applied on top of 4.11 
> 
> 
> The workload used was a stripped down version of generic/027 [1]. The filesystem
> is created with the same condition - 256mb total size, mixed mode, only 2mb
> free space, rest taken up by a large file. The canonical workload is a single
> iteration of 8 workers, creating 2kb files. However, the below measurements 
> have been taken by doing 10 iteration of that just to reduce noise. 
> 
> 
> First here is latency distribution of reserve_metadata_bytes, obtained by 
> the following systemtap script [2] Results are in milliseconds (ms):
> 
> pre-ticketed(ms):
> value |-------------------------------------------------- count
>     0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  345330
>     1 |@@                                                  17221
>     2 |                                                     3506
>     4 |                                                      212
>     8 |@@                                                  17777
>    16 |@                                                    7125
>    32 |                                                     1974
>    64 |                                                      136
>   128 |                                                        3
>   256 |                                                        0
>   512 |    
> 
> Ticketed:
> value |-------------------------------------------------- count
>     0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  311844
>     1 |@                                                   11123
>     2 |@                                                   10103
>     4 |@                                                    8669
>     8 |@                                                   10802
>    16 |@@@                                                 20079
>    32 |@@                                                  18015
>    64 |                                                     1485
>   128 |                                                      175
>   256 |                                                        0
>   512 |                                                        0
> 
> patched:
> value |-------------------------------------------------- count
>     0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  321168
>     1 |@                                                   10620
>     2 |@                                                    9099
>     4 |@                                                    6488
>     8 |@@                                                  17000
>    16 |@                                                   12495
>    32 |                                                     3525
>    64 |                                                       54
>   128 |                                                        0
>   256 |                                                        0
> 
> 
> So if anything the ticketed rework seems to have negative impact on the 
> latencies, making the results a lot more variable. My patch does improve things
> a little bit but it's not as before. 
> 
> Latency however is not the only important metric, here is the sum of files
> created by each of 8 workers for 10 iteration with: 
> 
>                 Pre-ticketed    Ticketed        Patched
>                 9647            9688        	 9492
>                 9811            9805        	 9627
>                 9838            9656        	 9569
>                 9818            9809        	 9419
>                 9684            9883        	 9495
>                 9755            9861        	 9696
>                 9791            9756        	 9362
>                 9852            9714        	 9489
>                 9880            9787        	 9657
>                 9809            9863        	 9453
> 
> average         9788          	9782          	 9525
> 				100%             %97.38
> 
> median          9810            9787          	 9495
>                                 100%             %97.02
> 
> stdev           73           	77           	 108
> stdev/average   0.75%           0.79%            1.14%
> stdev/median    0.75%           0.79%            1.14% 
> 
> generic/027 
> runtime:        300s            800s        	225s
> 
> Further drilling down on why the increased run times I traced that to the
> amount of transaction being generated. I've already described this in my 
> previous email [0] so I'm not going to repeat the numbers here as well. 
> 
> Empirically I discovered that one way to reduce the latency is to actually 
> limit the commit cycle to 1 and only allow all other states except transaction
> commit to happen during reclaim. This is exactly the strategy which the attached
> patch implements. Evidently from the numbers provided, the reduced number of 
> commit cycles is only marginally worse than the the current version, yet is 
> a lot faster in this particular test and reduces latency as well.  I also 
> tested doing only a single pass through all flush_space states ala 
> pre-ticketed version, however the results were a lot noisier and worse. 
> 
> Admittedly this latency problem likely manifests when we are close to ENOSPC
> condition hence why no one has observed this so far. However, I think a 3x 
> drop in performance even in an edge case with no considerable change in 
> success writes during ENOSPC is a bit too much. 
> 
> I would like this email light up a discussion regarding this since we might 
> not really have the most optimal solution at the moment. 
>

Just some random thoughts here.

Hmm, not sure if this matters, but fstests now doesn't set --mixed even if the
disk size is as small as 256mb.  So are you testing a mixed btrfs or not?

So now we've observed there're too many 'commit transaction' happening, I think it's because via commiting transaction it doesn't reclaim enough metadata space, esp. looks like space->bytes_may_use is not reduced somehow.

The metadata space_info->bytes_may_use may come from:
1) 1K file with buffered IO ends up living in btree leaf, so it will contribute to the number,
2) if it's mixed btrfs, then 1k file with direct IO ends up with creating a 4k extent in mixed block group.
3) while writing 1k files, metadata is reserved to make it run, and when to release depends on writeback (in the buffered IO case) or endio (in the direct IO case)

When running several commit transaction concurrently, if one has entered TRANS_STATE_COMMIT_START state, others just wait there, have you observed that if each commit transaction actually writes superblock in the end?

Thanks,

-liubo

> 
> [0] https://www.spinics.net/lists/linux-btrfs/msg65355.html
> [1] http://sprunge.us/HVMF
> [2] http://paste.ubuntu.com/24591513/

> From 8b3e5b1a45c47043e4ae3a066ca494d9211cc8be Mon Sep 17 00:00:00 2001
> From: Nikolay Borisov <nborisov@suse.com>
> Date: Tue, 16 May 2017 14:46:03 +0300
> Subject: [PATCH] btrfs: Rework the metadata reclaim algorithm
> 
> Currently the metadata reclaim algorithm works by allong up to 2 commit cycles
> (where a commit cycle is defined as an iteration through all states in
> flush_space) for every space reservation ticket on the space_info->ticket_list.
> Empirical testing shows that this improves the number of reservations which
> could be satisfied, however it also leads to a very large number of transaction
> commits. Testing showed that for the same workload ticketed rework generates
> ~4.5k transaction whereas pre-ticketed work generated ~1.5k. This results in
> generic/027 runtime increase almost 3x (from 300 to 800 seconds), mainly due to
> the 'rm' phase causing a lot more transaction commits.
> 
> Current patch fixes this by modifying the algorithm in such a way so as to
> allow only 1 transactino commit for every ticket by allow every other metadata
> cleaning operation to be performed twice. That is the code would stop flushing
> before the  2nd transaction commit commences.
> 
> In addition to runtime measurements I also measured the number of reservations
> which are being satisfied. The test case is the following:
> 
> #!/bin/bash
> create_file()
> {
>         local dir=$1
>         local direct=$2
>         local i=0
> 
>         mkdir -p $dir >/dev/null 2>&1
>         local STARTTIME=$(date +%s)
>         while xfs_io -f $direct -c "pwrite 0 1k" $dir/file_$i >/dev/null 2>&1; do
>                 let i=$i+1
>         done
> 
>         local ENDTIME=$(date +%s)
>         echo "Created $i files before returning error, time taken $(($ENDTIME - $STARTTIME))"
> }
> 
> dir=/media/scratch/testdir/
> loop=1
> i=1
> while [ $i -le $loop ]; do
>         nr_worker=8
>         while [ $nr_worker -gt 0 ]; do
>                 # half buffered I/O half direct I/O
>                 if [ `expr $nr_worker % 2` -eq 0 ]; then
>                         create_file $dir/$nr_worker -d &
>                 else
>                         create_file $dir/$nr_worker &
>                 fi
>                 let nr_worker=$nr_worker-1
>         done
>         wait
>         STARTTIME=$(date +%s)
>         rm -rf $dir
>         ENDTIME=$(date +%s)
>         echo "rming took $(($ENDTIME - $STARTTIME)) seconds"
>         let i=$i+1
> done
> 
> Runnig this 10 times yelds the following sums of successful writes for
> all 8 workers per iteration:
>                 Pre-ticketed    Ticketed        Patched
>                 9647            9688        	 9492
>                 9811            9805        	 9627
>                 9838            9656        	 9569
>                 9818            9809        	 9419
>                 9684            9883        	 9495
>                 9755            9861        	 9696
>                 9791            9756        	 9362
>                 9852            9714        	 9489
>                 9880            9787        	 9657
>                 9809            9863        	 9453
> 
> average         9788          	9782          	 9525
> 								100%             %97.38
> 
> median          9810            9787          	 9495
>                                 100%             %97.02
> 
> stdev           73           	77           	 108
> stdev/average   0.75%           0.79%            1.14%
> stdev/median    0.75%           0.79%            1.14% 
> 
> generic/027 
> runtime:        300s            800s        	225s
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/extent-tree.c | 29 ++++++++++++++---------------
>  1 file changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3ab1f88af038..8124a750fd6b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4973,8 +4973,8 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
>  	struct btrfs_fs_info *fs_info;
>  	struct btrfs_space_info *space_info;
>  	u64 to_reclaim;
> -	int flush_state;
> -	int commit_cycles = 0;
> +	int flush_state = FLUSH_DELAYED_ITEMS_NR;
> +	bool committed = false;
>  	u64 last_tickets_id;
>  
>  	fs_info = container_of(work, struct btrfs_fs_info, async_reclaim_work);
> @@ -4991,8 +4991,7 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
>  	last_tickets_id = space_info->tickets_id;
>  	spin_unlock(&space_info->lock);
>  
> -	flush_state = FLUSH_DELAYED_ITEMS_NR;
> -	do {
> +	while (!committed || flush_state < COMMIT_TRANS) {
>  		struct reserve_ticket *ticket;
>  		int ret;
>  
> @@ -5013,21 +5012,21 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
>  		} else {
>  			last_tickets_id = space_info->tickets_id;
>  			flush_state = FLUSH_DELAYED_ITEMS_NR;
> -			if (commit_cycles)
> -				commit_cycles--;
> +			if (committed)
> +				committed = false;
>  		}
> +		spin_unlock(&space_info->lock);
>  
>  		if (flush_state > COMMIT_TRANS) {
> -			commit_cycles++;
> -			if (commit_cycles > 2) {
> -				wake_all_tickets(&space_info->tickets);
> -				space_info->flush = 0;
> -			} else {
> -				flush_state = FLUSH_DELAYED_ITEMS_NR;
> -			}
> +			committed = true;
> +			flush_state = FLUSH_DELAYED_ITEMS_NR;
>  		}
> -		spin_unlock(&space_info->lock);
> -	} while (flush_state <= COMMIT_TRANS);
> +	}
> +
> +	spin_lock(&space_info->lock);
> +	wake_all_tickets(&space_info->tickets);
> +	space_info->flush = 0;
> +	spin_unlock(&space_info->lock);
>  }
>  
>  void btrfs_init_async_reclaim_work(struct work_struct *work)
> -- 
> 2.7.4


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs metadata reclaim behavior/performance characteristics
  2017-05-18 14:45 ` Liu Bo
@ 2017-05-18 15:41   ` Nikolay Borisov
  2017-05-18 21:47     ` Liu Bo
  0 siblings, 1 reply; 8+ messages in thread
From: Nikolay Borisov @ 2017-05-18 15:41 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs, jeffm



On 18.05.2017 17:45, Liu Bo wrote:
> On Thu, May 18, 2017 at 11:40:05AM +0300, Nikolay Borisov wrote:

<ommitted for brevity>

> 
> Just some random thoughts here.
> 
> Hmm, not sure if this matters, but fstests now doesn't set --mixed even if the
> disk size is as small as 256mb.  So are you testing a mixed btrfs or not?

You are right that fstest only sets mixed mode in case the filesystem is
less than or equal to 100mb. IN this case the fs is 256mb which means
mixed mode _is not_ set.

> 
> So now we've observed there're too many 'commit transaction' happening, I think it's because via commiting transaction it doesn't reclaim enough metadata space, esp. looks like space->bytes_may_use is not reduced somehow.

You've given me the idea to basically compare the state of the various
space_info counters after each transaction commit. Before and after the
ticketed work. Also what makes you believe it's the commit transaction
itself not freeing enough memory and not some of the other, "cheaper"
flush states?

> 
> The metadata space_info->bytes_may_use may come from:
> 1) 1K file with buffered IO ends up living in btree leaf, so it will contribute to the number,
> 2) if it's mixed btrfs, then 1k file with direct IO ends up with creating a 4k extent in mixed block group.
> 3) while writing 1k files, metadata is reserved to make it run, and when to release depends on writeback (in the buffered IO case) or endio (in the direct IO case)
> 
> When running several commit transaction concurrently, if one has entered TRANS_STATE_COMMIT_START state, others just wait there, have you observed that if each commit transaction actually writes superblock in the end?

Haven't gone that far, will have to instrument the code to confirm this.
What exactly should writing the superblock reveal?

> 
> Thanks,
> 
> -liubo
> 
>>

<omitted for brevity>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs metadata reclaim behavior/performance characteristics
  2017-05-18 15:41   ` Nikolay Borisov
@ 2017-05-18 21:47     ` Liu Bo
  2017-05-19  9:54       ` Nikolay Borisov
  0 siblings, 1 reply; 8+ messages in thread
From: Liu Bo @ 2017-05-18 21:47 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, jeffm

On Thu, May 18, 2017 at 06:41:08PM +0300, Nikolay Borisov wrote:
> 
> 
> On 18.05.2017 17:45, Liu Bo wrote:
> > On Thu, May 18, 2017 at 11:40:05AM +0300, Nikolay Borisov wrote:
> 
> <ommitted for brevity>
> 
> > 
> > Just some random thoughts here.
> > 
> > Hmm, not sure if this matters, but fstests now doesn't set --mixed even if the
> > disk size is as small as 256mb.  So are you testing a mixed btrfs or not?
> 
> You are right that fstest only sets mixed mode in case the filesystem is
> less than or equal to 100mb. IN this case the fs is 256mb which means
> mixed mode _is not_ set.
> 
> > 
> > So now we've observed there're too many 'commit transaction' happening, I think it's because via commiting transaction it doesn't reclaim enough metadata space, esp. looks like space->bytes_may_use is not reduced somehow.
> 
> You've given me the idea to basically compare the state of the various
> space_info counters after each transaction commit. Before and after the
> ticketed work. Also what makes you believe it's the commit transaction
> itself not freeing enough memory and not some of the other, "cheaper"
> flush states?
>

Firstly, commit transaction is a checkpoint, where all queued delayed refs are
supposed to be processed and this's one of the factors to update space_info
counters (bytes_may_use -> reserve, pinned -> bytes_used), secondly, it's the
last resort, if any cheaper flush works, ie. offers enough space to tickets, it
won't commit, will it?

> > 
> > The metadata space_info->bytes_may_use may come from:
> > 1) 1K file with buffered IO ends up living in btree leaf, so it will contribute to the number,
> > 2) if it's mixed btrfs, then 1k file with direct IO ends up with creating a 4k extent in mixed block group.
> > 3) while writing 1k files, metadata is reserved to make it run, and when to release depends on writeback (in the buffered IO case) or endio (in the direct IO case)
> > 
> > When running several commit transaction concurrently, if one has entered TRANS_STATE_COMMIT_START state, others just wait there, have you observed that if each commit transaction actually writes superblock in the end?
> 
> Haven't gone that far, will have to instrument the code to confirm this.
> What exactly should writing the superblock reveal?
>

On a second thinking, it shouldn't matter whether they go to the part of writing
superblock.

Commit transaction is costly for sure, I think we're trying to find
why there're so many commit transaction with ticket patch set.

In btrfs_async_reclaim_metadata_space(), flush_state++ happens when one ticket is not satisfied, and we do commit transaction only if there're enough pinned bytes, but what I don't understand is the link shows 'pinned' is 0 and I spotted something wrong (shown in the included patch).

Thanks,

-liubo

From: Liu Bo <bo.li.liu@oracle.com>

Subject: [PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes

We commit transaction in order to reclaim space from pinned bytes because it could process delayed refs, and in may_commit_transaction(), we check first if pinned bytes are enough for the required space, we then check if that plus bytes reserved for delayed insert are enough for the required space.

This changes the code to the above logic.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e390451c72e6..bded1ddd1bb6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 
	spin_lock(&delayed_rsv->lock);
 	if (percpu_counter_compare(&space_info->total_bytes_pinned,
-			   bytes - delayed_rsv->size) >= 0) {
+			   	   		         bytes - delayed_rsv->size) < 0) {
 							       spin_unlock(&delayed_rsv->lock);
								return -ENOSPC;
 								}
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: btrfs metadata reclaim behavior/performance characteristics
  2017-05-18 21:47     ` Liu Bo
@ 2017-05-19  9:54       ` Nikolay Borisov
  2017-05-19 18:32         ` Liu Bo
  0 siblings, 1 reply; 8+ messages in thread
From: Nikolay Borisov @ 2017-05-19  9:54 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs, jeffm

> From: Liu Bo <bo.li.liu@oracle.com>
> 
> Subject: [PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes
> 
> We commit transaction in order to reclaim space from pinned bytes because it could process delayed refs, and in may_commit_transaction(), we check first if pinned bytes are enough for the required space, we then check if that plus bytes reserved for delayed insert are enough for the required space.
> 
> This changes the code to the above logic.
> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e390451c72e6..bded1ddd1bb6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  
> 	spin_lock(&delayed_rsv->lock);
>  	if (percpu_counter_compare(&space_info->total_bytes_pinned,
> -			   bytes - delayed_rsv->size) >= 0) {
> +			   	   		         bytes - delayed_rsv->size) < 0) {
>  							       spin_unlock(&delayed_rsv->lock);
> 								return -ENOSPC;
>  								}
> 

Your patch does make a very big difference. Here are a couple of runs of
slow-rm:



root@ubuntu-virtual:~# ./slow-rm.sh
Created 837 files before returning error, time taken 3
Created 920 files before returning error, time taken 3
Created 949 files before returning error, time taken 3
Created 930 files before returning error, time taken 3
Created 1101 files before returning error, time taken 4
Created 1082 files before returning error, time taken 4
Created 1608 files before returning error, time taken 5
Created 1735 files before returning error, time taken 5
rming took 1 seconds

root@ubuntu-virtual:~# ./slow-rm.sh
Created 801 files before returning error, time taken 3
Created 829 files before returning error, time taken 3
Created 983 files before returning error, time taken 3
Created 978 files before returning error, time taken 3
Created 1023 files before returning error, time taken 3
Created 1126 files before returning error, time taken 3
Created 1538 files before returning error, time taken 4
Created 1737 files before returning error, time taken 5
rming took 2 seconds

root@ubuntu-virtual:~# ./slow-rm.sh
Created 875 files before returning error, time taken 3
Created 891 files before returning error, time taken 3
Created 969 files before returning error, time taken 4
Created 1002 files before returning error, time taken 4
Created 1039 files before returning error, time taken 4
Created 1051 files before returning error, time taken 4
Created 1191 files before returning error, time taken 4
Created 2137 files before returning error, time taken 8
rming took 2 seconds

So rming is a lot faster, but we create less files all in all and get
ENOSPC earlier. This means that most of the time bytes_pinned is not
enough to satisfy the allocation hence we are hitting the second
percpu_counter comparison.

Also, the reason why the previous links showed 0 for bytes_pinned was
due to me having completely forgotten that bytes_pinned is a percpu
counter, hence my stap script wasn't actually reading it correctly.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs metadata reclaim behavior/performance characteristics
  2017-05-19  9:54       ` Nikolay Borisov
@ 2017-05-19 18:32         ` Liu Bo
  2017-05-21 12:45           ` Nikolay Borisov
  0 siblings, 1 reply; 8+ messages in thread
From: Liu Bo @ 2017-05-19 18:32 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, jeffm

On Fri, May 19, 2017 at 12:54:59PM +0300, Nikolay Borisov wrote:
> > From: Liu Bo <bo.li.liu@oracle.com>
> > 
> > Subject: [PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes
> > 
> > We commit transaction in order to reclaim space from pinned bytes because it could process delayed refs, and in may_commit_transaction(), we check first if pinned bytes are enough for the required space, we then check if that plus bytes reserved for delayed insert are enough for the required space.
> > 
> > This changes the code to the above logic.
> > 
> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > ---
> >  fs/btrfs/extent-tree.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index e390451c72e6..bded1ddd1bb6 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
> >  
> > 	spin_lock(&delayed_rsv->lock);
> >  	if (percpu_counter_compare(&space_info->total_bytes_pinned,
> > -			   bytes - delayed_rsv->size) >= 0) {
> > +			   	   		         bytes - delayed_rsv->size) < 0) {
> >  							       spin_unlock(&delayed_rsv->lock);
> > 								return -ENOSPC;
> >  								}
> > 
> 
> Your patch does make a very big difference. Here are a couple of runs of
> slow-rm:
> 
> 
> 
> root@ubuntu-virtual:~# ./slow-rm.sh
> Created 837 files before returning error, time taken 3
> Created 920 files before returning error, time taken 3
> Created 949 files before returning error, time taken 3
> Created 930 files before returning error, time taken 3
> Created 1101 files before returning error, time taken 4
> Created 1082 files before returning error, time taken 4
> Created 1608 files before returning error, time taken 5
> Created 1735 files before returning error, time taken 5
> rming took 1 seconds
> 
> root@ubuntu-virtual:~# ./slow-rm.sh
> Created 801 files before returning error, time taken 3
> Created 829 files before returning error, time taken 3
> Created 983 files before returning error, time taken 3
> Created 978 files before returning error, time taken 3
> Created 1023 files before returning error, time taken 3
> Created 1126 files before returning error, time taken 3
> Created 1538 files before returning error, time taken 4
> Created 1737 files before returning error, time taken 5
> rming took 2 seconds
> 
> root@ubuntu-virtual:~# ./slow-rm.sh
> Created 875 files before returning error, time taken 3
> Created 891 files before returning error, time taken 3
> Created 969 files before returning error, time taken 4
> Created 1002 files before returning error, time taken 4
> Created 1039 files before returning error, time taken 4
> Created 1051 files before returning error, time taken 4
> Created 1191 files before returning error, time taken 4
> Created 2137 files before returning error, time taken 8
> rming took 2 seconds
> 
> So rming is a lot faster, but we create less files all in all and get
> ENOSPC earlier. This means that most of the time bytes_pinned is not
> enough to satisfy the allocation hence we are hitting the second
> percpu_counter comparison.
>

Right, it's sort of my expected bahavior because all 1K buffered IO ends up
being inline extent, it's likely to run out of metadata space very soon.

> Also, the reason why the previous links showed 0 for bytes_pinned was
> due to me having completely forgotten that bytes_pinned is a percpu
> counter, hence my stap script wasn't actually reading it correctly.

I see, bytes_pinned in space_info is different from the percpu one, they're
updated at different time, but overall the percpu one is the the preciser
counter.

-liubo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs metadata reclaim behavior/performance characteristics
  2017-05-19 18:32         ` Liu Bo
@ 2017-05-21 12:45           ` Nikolay Borisov
  2017-05-22 22:57             ` Liu Bo
  0 siblings, 1 reply; 8+ messages in thread
From: Nikolay Borisov @ 2017-05-21 12:45 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs, jeffm



On 19.05.2017 21:32, Liu Bo wrote:
> On Fri, May 19, 2017 at 12:54:59PM +0300, Nikolay Borisov wrote:
>>> From: Liu Bo <bo.li.liu@oracle.com>
>>>
>>> Subject: [PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes
>>>
>>> We commit transaction in order to reclaim space from pinned bytes because it could process delayed refs, and in may_commit_transaction(), we check first if pinned bytes are enough for the required space, we then check if that plus bytes reserved for delayed insert are enough for the required space.
>>>
>>> This changes the code to the above logic.
>>>
>>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>>> ---
>>>  fs/btrfs/extent-tree.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index e390451c72e6..bded1ddd1bb6 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>>>  
>>> 	spin_lock(&delayed_rsv->lock);
>>>  	if (percpu_counter_compare(&space_info->total_bytes_pinned,
>>> -			   bytes - delayed_rsv->size) >= 0) {
>>> +			   	   		         bytes - delayed_rsv->size) < 0) {
>>>  							       spin_unlock(&delayed_rsv->lock);
>>> 								return -ENOSPC;
>>>  								}
>>>
>>
>> Your patch does make a very big difference. Here are a couple of runs of
>> slow-rm:
>>
>>
>>
>> root@ubuntu-virtual:~# ./slow-rm.sh
>> Created 837 files before returning error, time taken 3
>> Created 920 files before returning error, time taken 3
>> Created 949 files before returning error, time taken 3
>> Created 930 files before returning error, time taken 3
>> Created 1101 files before returning error, time taken 4
>> Created 1082 files before returning error, time taken 4
>> Created 1608 files before returning error, time taken 5
>> Created 1735 files before returning error, time taken 5
>> rming took 1 seconds
>>
>> root@ubuntu-virtual:~# ./slow-rm.sh
>> Created 801 files before returning error, time taken 3
>> Created 829 files before returning error, time taken 3
>> Created 983 files before returning error, time taken 3
>> Created 978 files before returning error, time taken 3
>> Created 1023 files before returning error, time taken 3
>> Created 1126 files before returning error, time taken 3
>> Created 1538 files before returning error, time taken 4
>> Created 1737 files before returning error, time taken 5
>> rming took 2 seconds
>>
>> root@ubuntu-virtual:~# ./slow-rm.sh
>> Created 875 files before returning error, time taken 3
>> Created 891 files before returning error, time taken 3
>> Created 969 files before returning error, time taken 4
>> Created 1002 files before returning error, time taken 4
>> Created 1039 files before returning error, time taken 4
>> Created 1051 files before returning error, time taken 4
>> Created 1191 files before returning error, time taken 4
>> Created 2137 files before returning error, time taken 8
>> rming took 2 seconds
>>
>> So rming is a lot faster, but we create less files all in all and get
>> ENOSPC earlier. This means that most of the time bytes_pinned is not
>> enough to satisfy the allocation hence we are hitting the second
>> percpu_counter comparison.
>>
> 
> Right, it's sort of my expected bahavior because all 1K buffered IO ends up
> being inline extent, it's likely to run out of metadata space very soon.

Are you going to send this as an official patch to the ML ?

> 
>> Also, the reason why the previous links showed 0 for bytes_pinned was
>> due to me having completely forgotten that bytes_pinned is a percpu
>> counter, hence my stap script wasn't actually reading it correctly.
> 
> I see, bytes_pinned in space_info is different from the percpu one, they're
> updated at different time, but overall the percpu one is the the preciser
> counter.
> 
> -liubo
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs metadata reclaim behavior/performance characteristics
  2017-05-21 12:45           ` Nikolay Borisov
@ 2017-05-22 22:57             ` Liu Bo
  0 siblings, 0 replies; 8+ messages in thread
From: Liu Bo @ 2017-05-22 22:57 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, jeffm

On Sun, May 21, 2017 at 03:45:02PM +0300, Nikolay Borisov wrote:
> 
> 
> On 19.05.2017 21:32, Liu Bo wrote:
> > On Fri, May 19, 2017 at 12:54:59PM +0300, Nikolay Borisov wrote:
> >>> From: Liu Bo <bo.li.liu@oracle.com>
> >>>
> >>> Subject: [PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes
> >>>
> >>> We commit transaction in order to reclaim space from pinned bytes because it could process delayed refs, and in may_commit_transaction(), we check first if pinned bytes are enough for the required space, we then check if that plus bytes reserved for delayed insert are enough for the required space.
> >>>
> >>> This changes the code to the above logic.
> >>>
> >>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> >>> ---
> >>>  fs/btrfs/extent-tree.c | 2 +-
> >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >>> index e390451c72e6..bded1ddd1bb6 100644
> >>> --- a/fs/btrfs/extent-tree.c
> >>> +++ b/fs/btrfs/extent-tree.c
> >>> @@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
> >>>  
> >>> 	spin_lock(&delayed_rsv->lock);
> >>>  	if (percpu_counter_compare(&space_info->total_bytes_pinned,
> >>> -			   bytes - delayed_rsv->size) >= 0) {
> >>> +			   	   		         bytes - delayed_rsv->size) < 0) {
> >>>  							       spin_unlock(&delayed_rsv->lock);
> >>> 								return -ENOSPC;
> >>>  								}
> >>>
> >>
> >> Your patch does make a very big difference. Here are a couple of runs of
> >> slow-rm:
> >>
> >>
> >>
> >> root@ubuntu-virtual:~# ./slow-rm.sh
> >> Created 837 files before returning error, time taken 3
> >> Created 920 files before returning error, time taken 3
> >> Created 949 files before returning error, time taken 3
> >> Created 930 files before returning error, time taken 3
> >> Created 1101 files before returning error, time taken 4
> >> Created 1082 files before returning error, time taken 4
> >> Created 1608 files before returning error, time taken 5
> >> Created 1735 files before returning error, time taken 5
> >> rming took 1 seconds
> >>
> >> root@ubuntu-virtual:~# ./slow-rm.sh
> >> Created 801 files before returning error, time taken 3
> >> Created 829 files before returning error, time taken 3
> >> Created 983 files before returning error, time taken 3
> >> Created 978 files before returning error, time taken 3
> >> Created 1023 files before returning error, time taken 3
> >> Created 1126 files before returning error, time taken 3
> >> Created 1538 files before returning error, time taken 4
> >> Created 1737 files before returning error, time taken 5
> >> rming took 2 seconds
> >>
> >> root@ubuntu-virtual:~# ./slow-rm.sh
> >> Created 875 files before returning error, time taken 3
> >> Created 891 files before returning error, time taken 3
> >> Created 969 files before returning error, time taken 4
> >> Created 1002 files before returning error, time taken 4
> >> Created 1039 files before returning error, time taken 4
> >> Created 1051 files before returning error, time taken 4
> >> Created 1191 files before returning error, time taken 4
> >> Created 2137 files before returning error, time taken 8
> >> rming took 2 seconds
> >>
> >> So rming is a lot faster, but we create less files all in all and get
> >> ENOSPC earlier. This means that most of the time bytes_pinned is not
> >> enough to satisfy the allocation hence we are hitting the second
> >> percpu_counter comparison.
> >>
> > 
> > Right, it's sort of my expected bahavior because all 1K buffered IO ends up
> > being inline extent, it's likely to run out of metadata space very soon.
> 
> Are you going to send this as an official patch to the ML ?
> 

Yes, here is the link,
https://patchwork.kernel.org/patch/9737947/

Thanks,
-liubo
> > 
> >> Also, the reason why the previous links showed 0 for bytes_pinned was
> >> due to me having completely forgotten that bytes_pinned is a percpu
> >> counter, hence my stap script wasn't actually reading it correctly.
> > 
> > I see, bytes_pinned in space_info is different from the percpu one, they're
> > updated at different time, but overall the percpu one is the the preciser
> > counter.
> > 
> > -liubo
> > 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-05-22 22:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-18  8:40 btrfs metadata reclaim behavior/performance characteristics Nikolay Borisov
2017-05-18 14:45 ` Liu Bo
2017-05-18 15:41   ` Nikolay Borisov
2017-05-18 21:47     ` Liu Bo
2017-05-19  9:54       ` Nikolay Borisov
2017-05-19 18:32         ` Liu Bo
2017-05-21 12:45           ` Nikolay Borisov
2017-05-22 22:57             ` Liu Bo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).