From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH v4 0/5] btrfs: qgroup: address the performance penalty for subvolume dropping
Date: Wed, 24 Aug 2022 09:14:04 +0800 [thread overview]
Message-ID: <cover.1661302005.git.wqu@suse.com> (raw)
[CHANGELOG]
v4:
- Fix a kmalloc() usage, which can lead to kobject warnings
Since kobject_init_and_add() relies on certain values of a member,
kmalloc() can leave kobj->state_initialized to be true, and cause
crash at qgroups_kobj initialization time.
- Add a script in the cover letter to verify the behavior
Will later be submitted as a test case.
v3:
- Rebased to latest misc-next
v2:
- Split /sys/fs/btrfs/<uuid>/qgroups/qgroup_flags into two
entries
- Update the cover letter to explain the drop_subtree_threshold better
Btrfs qgroup has a long history of bringing huge performance penalty,
from subvolume dropping to balance.
Although we solved the problem for balance, but the subvolume dropping
problem is still unresolved, as we really need to do all the costly
backref for all the involved subtrees, or qgroup numbers will be
inconsistent.
But the performance penalty is sometimes too big, so big that it's
better just to disable qgroup, do the drop, then do the rescan.
This patchset will address the problem by introducing a user
configurable sysfs interface, to allow certain high subtree dropping to
mark qgroup inconsistent, and skip the whole accounting.
The following things are needed for this objective:
- New qgroups attributes
Instead of plain qgroup kobjects, we need extra attributes like
drop_subtree_threshold.
This patchset will introduce two new attributes to the existing
qgroups kobject:
* qgroups_flags
To indicate the qgroup status flags like ON, RESCAN, INCONSISTENT.
* drop_subtree_threshold
To show the subtree dropping level threshold.
The default value is BTRFS_MAX_LEVEL (8), which means all subtree
dropping will go through the qgroup accounting, while costly it will
try to keep qgroup numbers as consistent as possible.
Users can specify values like 3, meaning any subtree which is at
level 3 or higher will mark qgroup inconsistent and skip all the
costly accounting.
NOTE: if a snapshot is create with tree root level 3, dropping the
snapshot with drop_subtree_threshold 3 will not mark the qgroup
inconsistent.
Since the level threshold is for shared subtree node, not the
snapshot root node.
In the case of newly created snapshot, only its (root level - 1)
tree blocks are shared subtrees.
This only affects subvolume dropping.
- Skip qgroup accounting when the numbers are already inconsistent
But still keeps the qgroup relationship correct, thus users can keep
its qgroup organization while do the rescan later.
All the behavior can be verified using the following simple script:
btrfs dev scan -u
mkfs.btrfs -U $fsid -f -n 4k $dev
mount $dev $mnt
btrfs subv create $mnt/subv
# Bump the tree level to 2
for (( i = 0; i < 8192; i++)); do
xfs_io -f -c "pwrite 0 2k" $mnt/subv/inline1_$i > /dev/null
xfs_io -f -c "pwrite 0 2k" $mnt/subv/inline2_$i > /dev/null
xfs_io -f -c "pwrite 0 4k" $mnt/subv/regular_$i > /dev/null
done
sync
btrfs subv snapshot $mnt/subv $mnt/snap
btrfs quota enable $mnt
btrfs quota rescan -w $mnt
echo 2 > /sys/fs/btrfs/<fsid>/qgroups/drop_subtree_threshold
btrfs subv delete $mnt/snap
btrfs subv sync $mnt
After above workload, btrfs qgroup show should give the following
warning:
WARNING: qgroup data inconsistent, rescan recommended
This sysfs interface needs user space tools to monitor and set the
values for each btrfs.
And it's also user space daemon's responsibility to save the
drop_subtree_threshold values.
As introducing a new on-disk format just for qgroup is a little
overkilled to an optional feature to me.
Currently the target user space tool is snapper, which by default
utilizes qgroups for its space-aware snapshots reclaim mechanism.
Qu Wenruo (5):
btrfs: sysfs: introduce qgroup global attribute groups
btrfs: introduce BTRFS_QGROUP_STATUS_FLAGS_MASK for later expansion
btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN
btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip
qgroup accounting
btrfs: skip subtree scan if it's too high to avoid low stall in
btrfs_commit_transaction()
fs/btrfs/ctree.h | 1 +
fs/btrfs/disk-io.c | 1 +
fs/btrfs/qgroup.c | 84 ++++++++++++++++++------
fs/btrfs/qgroup.h | 3 +
fs/btrfs/sysfs.c | 112 ++++++++++++++++++++++++++++++--
include/uapi/linux/btrfs_tree.h | 4 ++
6 files changed, 180 insertions(+), 25 deletions(-)
--
2.37.2
next reply other threads:[~2022-08-24 1:14 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-24 1:14 Qu Wenruo [this message]
2022-08-24 1:14 ` [PATCH v4 1/5] btrfs: sysfs: introduce qgroup global attribute groups Qu Wenruo
2022-08-24 1:14 ` [PATCH v4 2/5] btrfs: introduce BTRFS_QGROUP_STATUS_FLAGS_MASK for later expansion Qu Wenruo
2022-08-24 1:14 ` [PATCH v4 3/5] btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN Qu Wenruo
2022-09-05 17:43 ` David Sterba
2022-08-24 1:14 ` [PATCH v4 4/5] btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting Qu Wenruo
2022-08-24 1:14 ` [PATCH v4 5/5] btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction() Qu Wenruo
2022-09-08 20:45 ` [PATCH v4 0/5] btrfs: qgroup: address the performance penalty for subvolume dropping David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1661302005.git.wqu@suse.com \
--to=wqu@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox