Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: David Sterba <dsterba@suse.cz>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs: unlock path before checking if extent is shared during nocow writeback
Date: Mon, 23 Nov 2020 17:13:33 +0100	[thread overview]
Message-ID: <20201123161333.GE8669@twin.jikos.cz> (raw)
In-Reply-To: <6fbdddf38bd353dba7eba2117573f3b74fb79e40.1605697030.git.fdmanana@suse.com>

On Wed, Nov 18, 2020 at 11:00:17AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When we are attempting to start writeback for an existing extent in NOCOW
> mode, at run_delalloc_nocow(), we must check if the extent is shared, and
> if it is, fallback to a COW write. However we do such check while still
> holding a read lock on the leaf that contains the file extent item, and
> that check, the call to btrfs_cross_ref_exist(), can take some time
> because:
> 
> 1) It needs to do a search on the extent tree, which obviously takes some
>    time, specially if delayed references are being run at the moment, as
>    we can block when trying to lock currently write locked btree nodes;
> 
> 2) It needs to check the delayed references for any existing reference
>    for our data extent, this requires acquiring the delayed references'
>    spinlock and maybe block on the mutex of a delayed reference head in the
>    case where there is a delayed reference for our data extent, in the
>    worst case it makes us release the path on the extent tree and retry
>    the whole process again (going back to step 1).
> 
> There are other operations we do while holding the leaf locked that can
> take some significant time as well (specially all together):
> 
> * btrfs_extent_readonly() - to check if the block group containing the
>   extent is currently in RO mode. This requires taking a spinlock and
>   searching for the block group in a rbtree that can be big on large
>   filesystems;
> 
> * csum_exist_in_range() - to search if there are any checksums in the
>   csum tree for the extent. Like before, this can take some time if we are
>   in a filesystem that has both COW and NOCOW files, in which case the
>   csum tree is not empty;
> 
> * btrfs_inc_nocow_writers() - increment the number of nocow writers in the
>   block group that contains the data extent. Needs to acquire a spinlock
>   and search for the block group in a rbtree that can be big on large
>   filesystems.
> 
> So just unlock the leaf (release the path) before doing all those checks,
> since we do not need it anymore. In case we can not do a NOCOW write for
> the extent, due to any of those checks failing, and the writeback range
> goes beyond that extents' length, we will do another btree search for the
> next file extent item.
> 
> The following script that calls dbench was used to measure the impact of
> this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
> directly (no intermediary filesystem on the host) and using a non-debug
> kernel (default configuration on Debian):
> 
>   $ cat test-dbench.sh
>   #!/bin/bash
> 
>   DEV=/dev/sdk
>   MNT=/mnt/sdk
>   MOUNT_OPTIONS="-o ssd -o nodatacow"
>   MKFS_OPTIONS="-m single -d single"
> 
>   mkfs.btrfs -f $MKFS_OPTIONS $DEV
>   mount $MOUNT_OPTIONS $DEV $MNT
> 
>   dbench -D $MNT -t 300 64
> 
>   umount $MNT
> 
> Before this change:
> 
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX    9326331     0.317   399.957
>  Close        6851198     0.002     6.402
>  Rename        394894     2.621   402.819
>  Unlink       1883131     0.931   398.082
>  Deltree          256    19.160   303.580
>  Mkdir            128     0.003     0.016
>  Qpathinfo    8452314     0.068   116.133
>  Qfileinfo    1481921     0.001     5.081
>  Qfsinfo      1549963     0.002     4.444
>  Sfileinfo     759679     0.084    17.079
>  Find         3268168     0.396   118.196
>  WriteX       4653310     0.056   110.993
>  ReadX        14618818     0.005    23.314
>  LockX          30364     0.003     0.497
>  UnlockX        30364     0.002     1.720
>  Flush         653619    16.954   569.299
> 
> Throughput 966.651 MB/sec  64 clients  64 procs  max_latency=569.377 ms
> 
> After this change:
> 
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX    9710433     0.302   232.449
>  Close        7132948     0.002    11.496
>  Rename        411144     2.452   131.805
>  Unlink       1960961     0.893   230.383
>  Deltree          256    14.858   198.646
>  Mkdir            128     0.002     0.005
>  Qpathinfo    8800890     0.066   111.588
>  Qfileinfo    1542556     0.001     3.852
>  Qfsinfo      1613835     0.002     5.483
>  Sfileinfo     790871     0.081    19.492
>  Find         3402743     0.386   120.185
>  WriteX       4842918     0.054   179.312
>  ReadX        15220407     0.005    32.435
>  LockX          31612     0.003     1.533
>  UnlockX        31612     0.002     1.047
>  Flush         680567    16.320   463.323
> 
> Throughput 1016.59 MB/sec  64 clients  64 procs  max_latency=463.327 ms
> 
> +5.0% throughput, -20.5% max latency
> 
> Also, the following test using fio was run:
> 
>   $ cat test-fio.sh
>   #!/bin/bash
> 
>   DEV=/dev/sdk
>   MNT=/mnt/sdk
>   MOUNT_OPTIONS="-o ssd -o nodatacow"
>   MKFS_OPTIONS="-d single -m single"
> 
>   if [ $# -ne 4 ]; then
>       echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
>       exit 1
>   fi
> 
>   NUM_JOBS=$1
>   FILE_SIZE=$2
>   FSYNC_FREQ=$3
>   BLOCK_SIZE=$4
> 
>   cat <<EOF > /tmp/fio-job.ini
>   [writers]
>   rw=randwrite
>   fsync=$FSYNC_FREQ
>   fallocate=none
>   group_reporting=1
>   direct=0
>   bs=$BLOCK_SIZE
>   ioengine=sync
>   size=$FILE_SIZE
>   directory=$MNT
>   numjobs=$NUM_JOBS
>   EOF
> 
>   echo
>   echo "Using fio config:"
>   echo
>   cat /tmp/fio-job.ini
>   echo
>   echo "mount options: $MOUNT_OPTIONS"
>   echo
> 
>   mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
>   mount $MOUNT_OPTIONS $DEV $MNT
> 
>   echo "Creating nodatacow files before fio runs..."
>   for ((i = 0; i < $NUM_JOBS; i++)); do
>       xfs_io -f -c "pwrite -b 128M 0 $FILE_SIZE" "$MNT/writers.$i.0"
>   done
>   sync
> 
>   fio /tmp/fio-job.ini
>   umount $MNT
> 
> Before this change:
> 
> $ ./test-fio.sh 16 512M 2 4K
> (...)
> WRITE: bw=28.3MiB/s (29.6MB/s), 28.3MiB/s-28.3MiB/s (29.6MB/s-29.6MB/s), io=8192MiB (8590MB), run=289800-289800msec
> 
> After this change:
> 
> $ ./test-fio.sh 16 512M 2 4K
> (...)
> WRITE: bw=31.2MiB/s (32.7MB/s), 31.2MiB/s-31.2MiB/s (32.7MB/s-32.7MB/s), io=8192MiB (8590MB), run=262845-262845msec
> 
> +9.7% throughput, -9.8% runtime
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Nice, thank you. Added to misc-next.

      reply	other threads:[~2020-11-23 16:15 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-18 11:00 [PATCH] btrfs: unlock path before checking if extent is shared during nocow writeback fdmanana
2020-11-23 16:13 ` David Sterba [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201123161333.GE8669@twin.jikos.cz \
    --to=dsterba@suse.cz \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox