From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs: reduce contention on log trees when logging checksums
Date: Fri, 17 Jul 2020 11:26:39 -0400 [thread overview]
Message-ID: <f82d7a79-5be8-339e-2a8a-2f19c585edae@toxicpanda.com> (raw)
In-Reply-To: <20200715113043.3192206-1-fdmanana@kernel.org>
On 7/15/20 7:30 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The possibility of extents being shared (through clone and deduplication
> operations) requires special care when logging data checksums, to avoid
> having a log tree with different checksum items that cover ranges which
> overlap (which resulted in missing checksums after replaying a log tree).
> Such problems were fixed in the past by the following commits:
>
> commit 40e046acbd2f ("Btrfs: fix missing data checksums after replaying a
> log tree")
>
> commit e289f03ea79b ("btrfs: fix corrupt log due to concurrent fsync of
> inodes with shared extents")
>
> Test case generic/588 exercises the scenario solved by the first commit
> (purely sequential and deterministic) while test case generic/457 often
> triggered the case fixed by the second commit (not deterministic, requires
> specific timings under concurrency).
>
> The problems were addressed by deleting, from the log tree, any existing
> checksums before logging the new ones. And also by doing the deletion and
> logging of the cheksums while locking the checksum range in an extent io
> tree (root->log_csum_range), to deal with the case where we have concurrent
> fsyncs against files with shared extents.
>
> That however causes more contention on the leaves of a log tree where we
> store checksums (and all the nodes in the paths leading to them), even
> when we do not have shared extents, or all the shared extents were created
> by past transactions. It also adds a bit of contention on the spin lock of
> the log_csums_range extent io tree of the log root.
>
> This change adds a 'last_reflink_trans' field to the inode to keep track
> of the last transaction where a new extent was shared between inodes
> (through clone and deduplication operations). It is updated for both the
> source and destination inodes of reflink operations whenever a new extent
> (created in the current transaction) becomes shared by the inodes. This
> field is kept in memory only, not persisted in the inode item, similar
> to other existing fields (last_unlink_trans, logged_trans).
>
> When logging checksums for an extent, if the value of 'last_reflink_trans'
> is smaller then the current transaction's generation/id, we skip locking
> the extent range and deletion of checksums from the log tree, since we
> know we do not have new shared extents. This reduces contention on the
> log tree's leaves where checksums are stored.
>
> The following script, which uses fio, was used to measure the impact of
> this change:
>
> $ cat test-fsync.sh
> #!/bin/bash
>
> DEV=/dev/sdk
> MNT=/mnt/sdk
> MOUNT_OPTIONS="-o ssd"
> MKFS_OPTIONS="-d single -m single"
>
> if [ $# -ne 3 ]; then
> echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
> exit 1
> fi
>
> NUM_JOBS=$1
> FILE_SIZE=$2
> FSYNC_FREQ=$3
>
> cat <<EOF > /tmp/fio-job.ini
> [writers]
> rw=write
> fsync=$FSYNC_FREQ
> fallocate=none
> group_reporting=1
> direct=0
> bs=64k
> ioengine=sync
> size=$FILE_SIZE
> directory=$MNT
> numjobs=$NUM_JOBS
> EOF
>
> echo "Using config:"
> echo
> cat /tmp/fio-job.ini
> echo
>
> mkfs.btrfs -f $MKFS_OPTIONS $DEV
> mount $MOUNT_OPTIONS $DEV $MNT
> fio /tmp/fio-job.ini
> umount $MNT
>
> The tests were performed for different numbers of jobs, file sizes and
> fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
> 12 cores, with cpu governance set to performance mode on all cores), 16Gb
> of ram (the host has 64Gb) and using a NVMe device directly (without an
> intermediary filesystem in the host). While running the tests, the host
> was not used for anything else, to avoid disturbing the tests.
>
> The obtained results were the following (the last line of fio's output was
> pasted). Starting with 16 jobs is where a significant difference is
> observable in this particular setup and hardware (differences highlighted
> below). The very small differences for tests with less than 16 jobs are
> possibly just noise and random.
>
> **** 1 job, file size 1G, fsync frequency 1 ****
>
> before this change:
>
> WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec
>
> after this change:
>
> WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec
>
> **** 2 jobs, file size 1G, fsync frequency 1 ****
>
> before this change:
>
> WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec
>
> after this change:
>
> WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec
>
> **** 4 jobs, file size 1G, fsync frequency 1 ****
>
> before this change:
>
> WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec
>
> after this change:
>
> WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec
>
> **** 8 jobs, file size 1G, fsync frequency 1 ****
>
> before this change:
>
> WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec
>
> after this change:
>
> WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec
>
> **** 16 jobs, file size 1G, fsync frequency 1 ****
>
> before this change:
>
> WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec
>
> after this change:
>
> WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec
> (+40.1% throughput, -28.5% runtime)
>
> **** 32 jobs, file size 1G, fsync frequency 1 ****
>
> before this change:
>
> WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec
>
> after this change:
>
> WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec
> (+29.4% throughput, -22.7% runtime)
>
> **** 64 jobs, file size 512M, fsync frequency 1 ****
>
> before this change:
>
> WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec
>
> after this change:
>
> WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec
> (+29.3% throughput, -22.7% runtime)
>
> **** 128 jobs, file size 256M, fsync frequency 1 ****
>
> before this change:
>
> WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec
>
> after this change:
>
> WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec
> (+121.2% throughput, -54.6% runtime)
>
> **** 256 jobs, file size 256M, fsync frequency 1 ****
>
> before this change:
>
> WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec
>
> after this change:
>
> WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec
> (+74.9% throughput, -42.8% runtime)
>
> **** 512 jobs, file size 128M, fsync frequency 1 ****
>
> before this change:
>
> WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec
>
> after this change:
>
> WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec
> (+65.1% throughput, -39.3% runtime)
>
> **** 1024 jobs, file size 128M, fsync frequency 1 ****
>
> before this change:
>
> WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec
>
> after this change:
>
> WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec
> (+48.7% throughput, -33.1% runtime)
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Looks good,
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
next prev parent reply other threads:[~2020-07-17 15:26 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-15 11:30 [PATCH] btrfs: reduce contention on log trees when logging checksums fdmanana
2020-07-17 15:26 ` Josef Bacik [this message]
2020-07-20 14:15 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f82d7a79-5be8-339e-2a8a-2f19c585edae@toxicpanda.com \
--to=josef@toxicpanda.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox