public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] btrfs: some fsync updates when logging directories
@ 2023-04-05 17:51 fdmanana
  2023-04-05 17:51 ` [PATCH 1/2] btrfs: avoid iterating over all indexes when logging directory fdmanana
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: fdmanana @ 2023-04-05 17:51 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Two optimizations for directory logging, more details on the changelogs.

Filipe Manana (2):
  btrfs: avoid iterating over all indexes when logging directory
  btrfs: use log root when iterating over index keys when logging directory

 fs/btrfs/btrfs_inode.h | 32 +++++++++++++++---
 fs/btrfs/tree-log.c    | 76 +++++++++++++++++++++++++++---------------
 2 files changed, 77 insertions(+), 31 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] btrfs: avoid iterating over all indexes when logging directory
  2023-04-05 17:51 [PATCH 0/2] btrfs: some fsync updates when logging directories fdmanana
@ 2023-04-05 17:51 ` fdmanana
  2023-04-06 15:24   ` David Sterba
  2023-04-05 17:51 ` [PATCH 2/2] btrfs: use log root when iterating over index keys " fdmanana
  2023-04-06 18:52 ` [PATCH 0/2] btrfs: some fsync updates when logging directories David Sterba
  2 siblings, 1 reply; 5+ messages in thread
From: fdmanana @ 2023-04-05 17:51 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When logging a directory, after copying all directory index items from the
subvolume tree to the log tree, we iterate over the subvolume tree to find
all dir index items that are located in leaves COWed (or created) in the
current transaction. If we keep logging a directory several times during
the same transaction, we end up iterating over the same dir index items
everytime we log the directory, wasting time and adding extra lock
contention on the subvolume tree.

So just keep track of the last logged dir index offset in order to start
the search for that index (+1) the next time the directory is logged, as
dir index values (key offsets) come from a monotonically increasing
counter.

The following test measures the difference before and after this change:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/nullb0

  umount $DEV &> /dev/null
  mkfs.btrfs -f $DEV
  mount -o ssd $DEV $MNT

  # Time values in milliseconds.
  declare -a fsync_times
  # Total number of files added to the test directory.
  num_files=1000000
  # Fsync directory after every N files are added.
  fsync_period=100

  mkdir $MNT/testdir

  fsync_total_time=0
  for ((i = 1; i <= $num_files; i++)); do
        echo -n > $MNT/testdir/file_$i

        if [ $((i % fsync_period)) -eq 0 ]; then
                start=$(date +%s%N)
                xfs_io -c "fsync" $MNT/testdir
                end=$(date +%s%N)
                fsync_total_time=$((fsync_total_time + (end - start)))
                fsync_times[i]=$(( (end - start) / 1000000 ))
                echo -n -e "Progress $i / $num_files\r"
        fi
  done

  echo -e "\nHistogram of directory fsync duration in ms:\n"

  printf '%s\n' "${fsync_times[@]}" | \
     perl -MStatistics::Histogram -e '@d = <>; print get_histogram(\@d);'

  fsync_total_time=$((fsync_total_time / 1000000))
  echo -e "\nTotal time spent in fsync: $fsync_total_time ms\n"
  echo

  umount $MNT

The test was run on a non-debug kernel (Debian's default kernel config)
against a 15G null block device.

Result before this change:

   Histogram of directory fsync duration in ms:

   Count: 10000
   Range:  3.000 - 362.000; Mean: 34.556; Median: 31.000; Stddev: 25.751
   Percentiles:  90th: 71.000; 95th: 77.000; 99th: 81.000
      3.000 -    5.278:  1423 #################################
      5.278 -    8.854:  1173 ###########################
      8.854 -   14.467:   591 ##############
     14.467 -   23.277:  1025 #######################
     23.277 -   37.105:  1422 #################################
     37.105 -   58.809:  2036 ###############################################
     58.809 -   92.876:  2316 #####################################################
     92.876 -  146.346:     6 |
    146.346 -  230.271:     6 |
    230.271 -  362.000:     2 |

   Total time spent in fsync: 350527 ms

Result after this change:

   Histogram of directory fsync duration in ms:

   Count: 10000
   Range:  3.000 - 1088.000; Mean:  8.704; Median:  8.000; Stddev: 12.576
   Percentiles:  90th: 12.000; 95th: 14.000; 99th: 17.000
      3.000 -    6.007:  3222 #################################
      6.007 -   11.276:  5197 #####################################################
     11.276 -   20.506:  1551 ################
     20.506 -   36.674:    24 |
     36.674 -  201.552:     1 |
    201.552 -  353.841:     4 |
    353.841 - 1088.000:     1 |

   Total time spent in fsync: 92114 ms

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h | 32 +++++++++++++++++++++++++++-----
 fs/btrfs/tree-log.c    | 31 +++++++++++++++++++++++++++++--
 2 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9dc21622806e..f1e087994f87 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -142,11 +142,22 @@ struct btrfs_inode {
 	/* a local copy of root's last_log_commit */
 	int last_log_commit;
 
-	/*
-	 * Total number of bytes pending delalloc, used by stat to calculate the
-	 * real block usage of the file. This is used only for files.
-	 */
-	u64 delalloc_bytes;
+	union {
+		/*
+		 * Total number of bytes pending delalloc, used by stat to
+		 * calculate the real block usage of the file. This is used
+		 * only for files.
+		 */
+		u64 delalloc_bytes;
+		/*
+		 * The lowest possible index of the next dir index key which
+		 * points to an inode that needs to be logged.
+		 * This is used only for directories.
+		 * Use the helpers btrfs_get_first_dir_index_to_log() and
+		 * btrfs_set_first_dir_index_to_log() to access this field.
+		 */
+		u64 first_dir_index_to_log;
+	};
 
 	union {
 		/*
@@ -247,6 +258,17 @@ struct btrfs_inode {
 	struct inode vfs_inode;
 };
 
+static inline u64 btrfs_get_first_dir_index_to_log(const struct btrfs_inode *inode)
+{
+	return READ_ONCE(inode->first_dir_index_to_log);
+}
+
+static inline void btrfs_set_first_dir_index_to_log(struct btrfs_inode *inode,
+						    u64 index)
+{
+	WRITE_ONCE(inode->first_dir_index_to_log, index);
+}
+
 static inline struct btrfs_inode *BTRFS_I(const struct inode *inode)
 {
 	return container_of(inode, struct btrfs_inode, vfs_inode);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 9ab793b638a7..c2a467276071 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3648,6 +3648,9 @@ static int flush_dir_items_batch(struct btrfs_trans_handle *trans,
 		ret = BTRFS_LOG_FORCE_COMMIT;
 	else
 		inode->last_dir_index_offset = last_index;
+
+	if (btrfs_get_first_dir_index_to_log(inode) == 0)
+		btrfs_set_first_dir_index_to_log(inode, batch.keys[0].offset);
 out:
 	kfree(ins_data);
 
@@ -5406,6 +5409,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 	LIST_HEAD(dir_list);
 	struct btrfs_dir_list *dir_elem;
 	u64 ino = btrfs_ino(start_inode);
+	struct btrfs_inode *curr_inode = start_inode;
 	int ret = 0;
 
 	/*
@@ -5420,18 +5424,22 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
+	ihold(&curr_inode->vfs_inode);
+
 	while (true) {
+		struct inode *vfs_inode;
 		struct extent_buffer *leaf;
 		struct btrfs_key min_key;
+		u64 next_index;
 		bool continue_curr_inode = true;
 		int nritems;
 		int i;
 
 		min_key.objectid = ino;
 		min_key.type = BTRFS_DIR_INDEX_KEY;
-		min_key.offset = 0;
+		min_key.offset = btrfs_get_first_dir_index_to_log(curr_inode);
+		next_index = min_key.offset;
 again:
-		btrfs_release_path(path);
 		ret = btrfs_search_forward(root, &min_key, path, trans->transid);
 		if (ret < 0) {
 			break;
@@ -5456,6 +5464,8 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 				break;
 			}
 
+			next_index = min_key.offset + 1;
+
 			di = btrfs_item_ptr(leaf, i, struct btrfs_dir_item);
 			type = btrfs_dir_ftype(leaf, di);
 			if (btrfs_dir_transid(leaf, di) < trans->transid)
@@ -5496,12 +5506,16 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 			break;
 		}
 
+		btrfs_release_path(path);
+
 		if (continue_curr_inode && min_key.offset < (u64)-1) {
 			min_key.offset++;
 			goto again;
 		}
 
 next:
+		btrfs_set_first_dir_index_to_log(curr_inode, next_index);
+
 		if (list_empty(&dir_list))
 			break;
 
@@ -5509,9 +5523,22 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 		ino = dir_elem->ino;
 		list_del(&dir_elem->list);
 		kfree(dir_elem);
+
+		btrfs_add_delayed_iput(curr_inode);
+		curr_inode = NULL;
+
+		vfs_inode = btrfs_iget(fs_info->sb, ino, root);
+		if (IS_ERR(vfs_inode)) {
+			ret = PTR_ERR(vfs_inode);
+			break;
+		}
+		curr_inode = BTRFS_I(vfs_inode);
 	}
 out:
 	btrfs_free_path(path);
+	if (curr_inode)
+		btrfs_add_delayed_iput(curr_inode);
+
 	if (ret) {
 		struct btrfs_dir_list *next;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] btrfs: use log root when iterating over index keys when logging directory
  2023-04-05 17:51 [PATCH 0/2] btrfs: some fsync updates when logging directories fdmanana
  2023-04-05 17:51 ` [PATCH 1/2] btrfs: avoid iterating over all indexes when logging directory fdmanana
@ 2023-04-05 17:51 ` fdmanana
  2023-04-06 18:52 ` [PATCH 0/2] btrfs: some fsync updates when logging directories David Sterba
  2 siblings, 0 replies; 5+ messages in thread
From: fdmanana @ 2023-04-05 17:51 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When logging dir dentries of a directory, we iterate over the subvolume
tree to find dir index keys on leaves modified in the current transaction.
This however is heavy on locking, since btrfs_search_forward() may often
keep locks on extent buffers for quite a while when walking the tree to
find a suitable leaf modified in the current transaction and with a key
not smaller than then the provided minimum key. That means it will block
other tasks trying to access the subvolume tree, which may be common fs
operations like creating, renaming, linking, unlinking, reflinking files,
etc.

A better solution is to iterate the log tree, since it's much smaller than
a subvolume tree and just use plain btrfs_search_slot() (or the wrapper
btrfs_for_each_slot()) and only contains dir index keys added in the
current transaction.

The following bonnie++ test on a non-debug kernel (with Debian's default
kernel config) on a 20G null block device, was used to measure the impact:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/nullb0
   MNT=/mnt/nullb0

   NR_DIRECTORIES=20
   NR_FILES=20480  # must be a multiple of 1024
   DATASET_SIZE=$(( (8 * 1024 * 1024 * 1024) / 1048576 )) # 8 GiB as megabytes
   DIRECTORY_SIZE=$(( DATASET_SIZE / NR_FILES ))
   NR_FILES=$(( NR_FILES / 1024 ))

   umount $DEV &> /dev/null
   mkfs.btrfs -f $DEV
   mount $DEV $MNT

   bonnie++ -u root -d $MNT \
       -n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \
       -r 0 -s $DATASET_SIZE -b

   umount $MNT

Before patchset:

   Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                       -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
   Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
   debian0          8G  376k  99  1.1g  98  939m  92 1527k  99  3.2g  99  9060 256
   Latency             24920us     207us     680ms    5594us     171us    2891us
   Version 2.00a       ------Sequential Create------ --------Random Create--------
   debian0             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                 files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 20/20 20480  96 +++++ +++ 20480  95 20480  99 +++++ +++ 20480  97
   Latency              8708us     137us    5128us    6743us      60us   19712us

After patchset:

   Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                       -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
   Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
   debian0          8G  384k  99  1.2g  99  971m  91 1533k  99  3.3g  99  9180 309
   Latency             24930us     125us     661ms    5587us      46us    2020us
   Version 2.00a       ------Sequential Create------ --------Random Create--------
   debian0             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                 files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 20/20 20480  90 +++++ +++ 20480  99 20480  99 +++++ +++ 20480  97
   Latency              7030us      61us    1246us    4942us      56us   16855us

The patchset consists of this patch plus a previous one that has the
following subject:

   "btrfs: avoid iterating over all indexes when logging directory"

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 51 +++++++++++++++++++++------------------------
 1 file changed, 24 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c2a467276071..f298d7b03b8c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5428,45 +5428,34 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 
 	while (true) {
 		struct inode *vfs_inode;
-		struct extent_buffer *leaf;
-		struct btrfs_key min_key;
+		struct btrfs_key key;
+		struct btrfs_key found_key;
 		u64 next_index;
 		bool continue_curr_inode = true;
-		int nritems;
-		int i;
+		int iter_ret;
 
-		min_key.objectid = ino;
-		min_key.type = BTRFS_DIR_INDEX_KEY;
-		min_key.offset = btrfs_get_first_dir_index_to_log(curr_inode);
-		next_index = min_key.offset;
+		key.objectid = ino;
+		key.type = BTRFS_DIR_INDEX_KEY;
+		key.offset = btrfs_get_first_dir_index_to_log(curr_inode);
+		next_index = key.offset;
 again:
-		ret = btrfs_search_forward(root, &min_key, path, trans->transid);
-		if (ret < 0) {
-			break;
-		} else if (ret > 0) {
-			ret = 0;
-			goto next;
-		}
-
-		leaf = path->nodes[0];
-		nritems = btrfs_header_nritems(leaf);
-		for (i = path->slots[0]; i < nritems; i++) {
+		btrfs_for_each_slot(root->log_root, &key, &found_key, path, iter_ret) {
+			struct extent_buffer *leaf = path->nodes[0];
 			struct btrfs_dir_item *di;
 			struct btrfs_key di_key;
 			struct inode *di_inode;
 			int log_mode = LOG_INODE_EXISTS;
 			int type;
 
-			btrfs_item_key_to_cpu(leaf, &min_key, i);
-			if (min_key.objectid != ino ||
-			    min_key.type != BTRFS_DIR_INDEX_KEY) {
+			if (found_key.objectid != ino ||
+			    found_key.type != BTRFS_DIR_INDEX_KEY) {
 				continue_curr_inode = false;
 				break;
 			}
 
-			next_index = min_key.offset + 1;
+			next_index = found_key.offset + 1;
 
-			di = btrfs_item_ptr(leaf, i, struct btrfs_dir_item);
+			di = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dir_item);
 			type = btrfs_dir_ftype(leaf, di);
 			if (btrfs_dir_transid(leaf, di) < trans->transid)
 				continue;
@@ -5508,12 +5497,20 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 
 		btrfs_release_path(path);
 
-		if (continue_curr_inode && min_key.offset < (u64)-1) {
-			min_key.offset++;
+		if (iter_ret < 0) {
+			ret = iter_ret;
+			goto out;
+		} else if (iter_ret > 0) {
+			continue_curr_inode = false;
+		} else {
+			key = found_key;
+		}
+
+		if (continue_curr_inode && key.offset < (u64)-1) {
+			key.offset++;
 			goto again;
 		}
 
-next:
 		btrfs_set_first_dir_index_to_log(curr_inode, next_index);
 
 		if (list_empty(&dir_list))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] btrfs: avoid iterating over all indexes when logging directory
  2023-04-05 17:51 ` [PATCH 1/2] btrfs: avoid iterating over all indexes when logging directory fdmanana
@ 2023-04-06 15:24   ` David Sterba
  0 siblings, 0 replies; 5+ messages in thread
From: David Sterba @ 2023-04-06 15:24 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Apr 05, 2023 at 06:51:29PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 9ab793b638a7..c2a467276071 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -3648,6 +3648,9 @@ static int flush_dir_items_batch(struct btrfs_trans_handle *trans,
>  		ret = BTRFS_LOG_FORCE_COMMIT;
>  	else
>  		inode->last_dir_index_offset = last_index;
> +
> +	if (btrfs_get_first_dir_index_to_log(inode) == 0)
> +		btrfs_set_first_dir_index_to_log(inode, batch.keys[0].offset);
>  out:
>  	kfree(ins_data);
>  
> @@ -5406,6 +5409,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
>  	LIST_HEAD(dir_list);
>  	struct btrfs_dir_list *dir_elem;
>  	u64 ino = btrfs_ino(start_inode);
> +	struct btrfs_inode *curr_inode = start_inode;
>  	int ret = 0;
>  
>  	/*
> @@ -5420,18 +5424,22 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
>  	if (!path)
>  		return -ENOMEM;
>  
> +	ihold(&curr_inode->vfs_inode);

I'll add comment why ther ref is needed, we don't have any other
instance of ihold/btrfs_add_delayed_iput pattern, all other ihold()
calls are commented.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] btrfs: some fsync updates when logging directories
  2023-04-05 17:51 [PATCH 0/2] btrfs: some fsync updates when logging directories fdmanana
  2023-04-05 17:51 ` [PATCH 1/2] btrfs: avoid iterating over all indexes when logging directory fdmanana
  2023-04-05 17:51 ` [PATCH 2/2] btrfs: use log root when iterating over index keys " fdmanana
@ 2023-04-06 18:52 ` David Sterba
  2 siblings, 0 replies; 5+ messages in thread
From: David Sterba @ 2023-04-06 18:52 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Apr 05, 2023 at 06:51:28PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Two optimizations for directory logging, more details on the changelogs.
> 
> Filipe Manana (2):
>   btrfs: avoid iterating over all indexes when logging directory
>   btrfs: use log root when iterating over index keys when logging directory

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-04-06 18:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-05 17:51 [PATCH 0/2] btrfs: some fsync updates when logging directories fdmanana
2023-04-05 17:51 ` [PATCH 1/2] btrfs: avoid iterating over all indexes when logging directory fdmanana
2023-04-06 15:24   ` David Sterba
2023-04-05 17:51 ` [PATCH 2/2] btrfs: use log root when iterating over index keys " fdmanana
2023-04-06 18:52 ` [PATCH 0/2] btrfs: some fsync updates when logging directories David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox