All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs: preallocate temporary extent buffer for inode logging when needed
Date: Wed, 31 Jan 2024 15:41:48 -0500	[thread overview]
Message-ID: <20240131204148.GA3203388@perftesting> (raw)
In-Reply-To: <1ef0997eee1fbe194ab2546f34052cd4e27c6ef4.1706612525.git.fdmanana@suse.com>

On Tue, Jan 30, 2024 at 11:05:44AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When logging an inode and we require to copy items from subvolume leaves
> to the log tree, we clone each subvolume leaf and than use that clone to
> copy items to the log tree. This is required to avoid possible deadlocks
> as stated in commit 796787c978ef ("btrfs: do not modify log tree while
> holding a leaf from fs tree locked").
> 
> The cloning requires allocating an extent buffer (struct extent_buffer)
> and then allocating pages (folios) to attach to the extent buffer. This
> may be slow in case we are under memory pressure, and since we are doing
> the cloning while holding a read lock on a subvolume leaf, it means we
> can be blocking other operations on that leaf for significant periods of
> time, which can increase latency on operations like creating other files,
> renaming files, etc. Similarly because we're under a log transaction, we
> may also cause extra delay on other tasks doing an fsync, because syncing
> the log requires waiting for tasks that joined a log transaction to exit
> the transaction.
> 
> So to improve this, for any inode logging operation that needs to copy
> items from a subvolume leaf ("full sync" or "copy everything" bit set
> in the inode), preallocate a dummy extent buffer before locking any
> extent buffer from the subvolume tree, and even before joining a log
> transaction, add it to the log context and then use it when we need to
> copy items from a subvolume leaf to the log tree. This avoids making
> other operations get extra latency when waiting to lock a subvolume
> leaf that is used during inode logging and we are under heavy memory
> pressure.
> 
> The following test script with bonnie++ was used to test this:
> 
>   $ cat test.sh
>   #!/bin/bash
> 
>   DEV=/dev/sdh
>   MNT=/mnt/sdh
>   MOUNT_OPTIONS="-o ssd"
> 
>   MEMTOTAL_BYTES=`free -b | grep Mem: | awk '{ print $2 }'`
>   NR_DIRECTORIES=20
>   NR_FILES=20480
>   DATASET_SIZE=$((MEMTOTAL_BYTES * 2 / 1048576))
>   DIRECTORY_SIZE=$((MEMTOTAL_BYTES * 2 / NR_FILES))
>   NR_FILES=$((NR_FILES / 1024))
> 
>   echo "performance" | \
>       tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> 
>   umount $DEV &> /dev/null
>   mkfs.btrfs -f $MKFS_OPTIONS $DEV
>   mount $MOUNT_OPTIONS $DEV $MNT
> 
>   bonnie++ -u root -d $MNT \
>       -n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \
>       -r 0 -s $DATASET_SIZE -b
> 
>   umount $MNT
> 
> The results of this test on a 8G VM running a non-debug kernel (Debian's
> default kernel config), were the following.
> 
> Before this change:
> 
>   Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
>                       -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
>   Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>   debian0       7501M  376k  99  1.4g  96  117m  14 1510k  99  2.5g  95 +++++ +++
>   Latency             35068us   24976us    2944ms   30725us   71770us   26152us
>   Version 2.00a       ------Sequential Create------ --------Random Create--------
>   debian0             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>   files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>   20:384100:384100/20 20480  32 20480  58 20480  48 20480  39 20480  56 20480  61
>   Latency               411ms   11914us     119ms     617ms   10296us     110ms
> 
> After this change:
> 
>   Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
>                       -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
>   Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>   debian0       7501M  375k  99  1.4g  97  117m  14 1546k  99  2.3g  98 +++++ +++
>   Latency             35975us  20945us    2144ms   10297us    2217us    6004us
>   Version 2.00a       ------Sequential Create------ --------Random Create--------
>   debian0             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>   files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>   20:384100:384100/20 20480  35 20480  58 20480  48 20480  40 20480  57 20480  59
>   Latency               320ms   11237us   77779us     518ms    6470us   86389us
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/file.c     | 12 ++++++
>  fs/btrfs/tree-log.c | 93 +++++++++++++++++++++++++++------------------
>  fs/btrfs/tree-log.h | 25 ++++++++++++
>  3 files changed, 94 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index f8e1a7ce3d39..fd5e23035a28 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1912,6 +1912,8 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  		goto out_release_extents;
>  	}
>  
> +	btrfs_init_log_ctx_scratch_eb(&ctx);
> +
>  	/*
>  	 * We use start here because we will need to wait on the IO to complete
>  	 * in btrfs_sync_log, which could require joining a transaction (for
> @@ -1931,6 +1933,15 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  	trans->in_fsync = true;
>  
>  	ret = btrfs_log_dentry_safe(trans, dentry, &ctx);
> +	/*
> +	 * Scratch eb no longer needed, release before syncing log or commit
> +	 * transaction, to avoid holding unnecessary memory during such long
> +	 * operations.
> +	 */
> +	if (ctx.scratch_eb) {
> +		free_extent_buffer(ctx.scratch_eb);
> +		ctx.scratch_eb = NULL;
> +	}
>  	btrfs_release_log_ctx_extents(&ctx);
>  	if (ret < 0) {
>  		/* Fallthrough and commit/free transaction. */
> @@ -2006,6 +2017,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  
>  	ret = btrfs_commit_transaction(trans);
>  out:
> +	free_extent_buffer(ctx.scratch_eb);
>  	ASSERT(list_empty(&ctx.list));
>  	ASSERT(list_empty(&ctx.conflict_inodes));
>  	err = file_check_and_advance_wb_err(file);
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 331fc7429952..761b13b3d342 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -3619,6 +3619,30 @@ static int flush_dir_items_batch(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  
> +static int clone_leaf(struct btrfs_path *path, struct btrfs_log_ctx *ctx)
> +{
> +	const int slot = path->slots[0];
> +
> +	if (ctx->scratch_eb) {
> +		copy_extent_buffer_full(ctx->scratch_eb, path->nodes[0]);
> +	} else {
> +		ctx->scratch_eb = btrfs_clone_extent_buffer(path->nodes[0]);
> +		if (!ctx->scratch_eb)
> +			return -ENOMEM;
> +	}
> +
> +	btrfs_release_path(path);
> +	path->nodes[0] = ctx->scratch_eb;

Here we put the scratch_b into path->nodes[0], so if we go do the next leaf in
the copy_items loop we'll drop our reference for this scratch_eb, and then we're
just writing into free'd memory.  Am I missing something here?  Thanks,

Josef

  reply	other threads:[~2024-01-31 20:41 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-30 11:05 [PATCH] btrfs: preallocate temporary extent buffer for inode logging when needed fdmanana
2024-01-31 20:41 ` Josef Bacik [this message]
2024-01-31 20:55   ` Filipe Manana
2024-02-01 19:52     ` Josef Bacik
2024-02-02 12:31 ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240131204148.GA3203388@perftesting \
    --to=josef@toxicpanda.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.