* [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements
@ 2021-07-27 10:24 fdmanana
2021-07-27 10:24 ` [PATCH 1/3] btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction fdmanana
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: fdmanana @ 2021-07-27 10:24 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
The first patch in the series fixes a bug where a directory fsync, following
by inode eviction, followed by renaming a file and syncing the log results
in losing a file if a power failure happens and the log is replayed.
The remaining two changes are independent, and are about avoiding unnecessary
work during operations that need to check or modify the log (renames, adding
a hard link, unlinks) and making renames hold log commits for a shorter
period.
Filipe Manana (3):
btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
btrfs: eliminate some false positives when checking if inode was logged
btrfs: do not pin logs too early during renames
fs/btrfs/inode.c | 48 +++++++++++++++++++++++++++++++++++++++------
fs/btrfs/tree-log.c | 29 ++++++++++++++++-----------
2 files changed, 60 insertions(+), 17 deletions(-)
--
2.28.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/3] btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
2021-07-27 10:24 [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements fdmanana
@ 2021-07-27 10:24 ` fdmanana
2021-07-27 10:24 ` [PATCH 2/3] btrfs: eliminate some false positives when checking if inode was logged fdmanana
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: fdmanana @ 2021-07-27 10:24 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When checking if we need to log the new name of a renamed inode, we are
checking if the inode and its parent inode have been logged before, and if
not we don't log the new name. The check however is buggy, as it directly
compares the logged_trans field of the inodes versus the ID of the current
transaction. The problem is that logged_trans is a transient field, only
stored in memory and never persisted in the inode item, so if an inode
was logged before, evicted and reloaded, its logged_trans field is set to
a value of 0, meaning the check will return false and the new name of the
renamed inode is not logged. If the old parent directory was previously
fsynced and we deleted the logged directory entries corresponding to the
old name, we end up with a log that when replayed will delete the renamed
inode.
The following example triggers the problem:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/A
$ mkdir /mnt/B
$ echo -n "hello world" > /mnt/A/foo
$ sync
# Add some new file to A and fsync directory A.
$ touch /mnt/A/bar
$ xfs_io -c "fsync" /mnt/A
# Now trigger inode eviction. We are only interested in triggering
# eviction for the inode of directory A.
$ echo 2 > /proc/sys/vm/drop_caches
# Move foo from directory A to directory B.
# This deletes the directory entries for foo in A from the log, and
# does not add the new name for foo in directory B to the log, because
# logged_trans of A is 0, which is less than the current transaction ID.
$ mv /mnt/A/foo /mnt/B/foo
# Now make an fsync to anything except A, B or any file inside them,
# like for example create a file at the root directory and fsync this
# new file. This syncs the log that contains all the changes done by
# previous rename operation.
$ touch /mnt/baz
$ xfs_io -c "fsync" /mnt/baz
<power fail>
# Mount the filesystem and replay the log.
$ mount /dev/sdc /mnt
# Check the filesystem content.
$ ls -1R /mnt
/mnt/:
A
B
baz
/mnt/A:
bar
/mnt/B:
$
# File foo is gone, it's neither in A/ nor in B/.
Fix this by using the inode_logged() helper at btrfs_log_new_name(), which
safely checks if an inode was logged before in the current transaction.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c2ebe700d6e2..8dde5c08a48f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -6536,8 +6536,8 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
* if this inode hasn't been logged and directory we're renaming it
* from hasn't been logged, we don't need to log it
*/
- if (inode->logged_trans < trans->transid &&
- (!old_dir || old_dir->logged_trans < trans->transid))
+ if (!inode_logged(trans, inode) &&
+ (!old_dir || !inode_logged(trans, old_dir)))
return;
/*
--
2.28.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 2/3] btrfs: eliminate some false positives when checking if inode was logged
2021-07-27 10:24 [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements fdmanana
2021-07-27 10:24 ` [PATCH 1/3] btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction fdmanana
@ 2021-07-27 10:24 ` fdmanana
2021-07-27 10:24 ` [PATCH 3/3] btrfs: do not pin logs too early during renames fdmanana
2021-07-28 12:41 ` [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements David Sterba
3 siblings, 0 replies; 5+ messages in thread
From: fdmanana @ 2021-07-27 10:24 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When checking if an inode was previously logged in the current transaction
through the helper inode_logged(), we can return some false positives that
can be easily eliminated. These correspond to the cases where an inode has
a ->logged_trans value that is not zero and its value is smaller then the
ID of the current transaction. This means we know exactly that the inode
was never logged before in the current transaction, so we can return false
and avoid the callers to do extra work:
1) Having btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log()
unnecessarily join a log transaction and do deletion searches in a log
tree that will not find anything. This just adds unnecessary contention
on extent buffer locks;
2) Having btrfs_log_new_name() unnecessarily log an inode when it is not
needed. If the inode was not logged before, we don't need to log it in
LOG_INODE_EXISTS mode.
So just make sure that any false positive only happens when ->logged_trans
has a value of 0.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8dde5c08a48f..fc98b7a7a8e6 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3421,14 +3421,10 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
}
/*
- * Check if an inode was logged in the current transaction. We can't always rely
- * on an inode's logged_trans value, because it's an in-memory only field and
- * therefore not persisted. This means that its value is lost if the inode gets
- * evicted and loaded again from disk (in which case it has a value of 0, and
- * certainly it is smaller then any possible transaction ID), when that happens
- * the full_sync flag is set in the inode's runtime flags, so on that case we
- * assume eviction happened and ignore the logged_trans value, assuming the
- * worst case, that the inode was logged before in the current transaction.
+ * Check if an inode was logged in the current transaction. This may often
+ * return some false positives, because logged_trans is an in memory only field,
+ * not persisted anywhere. This is meant to be used in contexts where a false
+ * positive has no functional consequences.
*/
static bool inode_logged(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode)
@@ -3436,7 +3432,18 @@ static bool inode_logged(struct btrfs_trans_handle *trans,
if (inode->logged_trans == trans->transid)
return true;
- if (inode->last_trans == trans->transid &&
+ /*
+ * The inode's logged_trans is always 0 when we load it (because it is
+ * not persisted in the inode item or elsewhere). So if it is 0, the
+ * inode was last modified in the current transaction and has the
+ * full_sync flag set, then the inode may have been logged before in
+ * the current transaction, then evicted and loaded again in the current
+ * transaction - or may have never been logged in the current transaction,
+ * but since we can not be sure, we have to assume it was, otherwise our
+ * callers can leave an inconsistent log.
+ */
+ if (inode->logged_trans == 0 &&
+ inode->last_trans == trans->transid &&
test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) &&
!test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags))
return true;
--
2.28.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 3/3] btrfs: do not pin logs too early during renames
2021-07-27 10:24 [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements fdmanana
2021-07-27 10:24 ` [PATCH 1/3] btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction fdmanana
2021-07-27 10:24 ` [PATCH 2/3] btrfs: eliminate some false positives when checking if inode was logged fdmanana
@ 2021-07-27 10:24 ` fdmanana
2021-07-28 12:41 ` [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements David Sterba
3 siblings, 0 replies; 5+ messages in thread
From: fdmanana @ 2021-07-27 10:24 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
During renames we pin the logs of the roots a bit too early, before the
calls to btrfs_insert_inode_ref(). We can pin the logs after those calls,
since those will not change anything in a log tree.
In a scenario where we have multiple and diverse filesystem operations
running in parallel, those calls can take a significant amount of time,
due to lock contention on extent buffers, and delay log commits from other
tasks for longer than necessary.
So just pin logs after calls to btrfs_insert_inode_ref() and right before
the first operation that can update a log tree.
The following script that uses dbench was used for testing:
$ cat dbench-test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-m single -d single"
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 120 16
umount $MNT
The tests were run on a machine with 12 cores, 64G of RAN, a NVME device
and using a non-debug kernel config (Debian's default config).
The results compare a branch without this patch and without the previous
patch in the series, that has the subject:
"btrfs: eliminate some false positives when checking if inode was logged"
Versus the same branch with these two patches applied.
dbench with 8 clients, results before:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4391359 0.009 249.745
Close 3225882 0.001 3.243
Rename 185953 0.065 240.643
Unlink 886669 0.049 249.906
Deltree 112 2.455 217.433
Mkdir 56 0.002 0.004
Qpathinfo 3980281 0.004 3.109
Qfileinfo 697579 0.001 0.187
Qfsinfo 729780 0.002 2.424
Sfileinfo 357764 0.004 1.415
Find 1538861 0.016 4.863
WriteX 2189666 0.010 3.327
ReadX 6883443 0.002 0.729
LockX 14298 0.002 0.073
UnlockX 14298 0.001 0.042
Flush 307777 2.447 303.663
Throughput 1149.6 MB/sec 8 clients 8 procs max_latency=303.666 ms
dbench with 8 clients, results after:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4269920 0.009 213.532
Close 3136653 0.001 0.690
Rename 180805 0.082 213.858
Unlink 862189 0.050 172.893
Deltree 112 2.998 218.328
Mkdir 56 0.002 0.003
Qpathinfo 3870158 0.004 5.072
Qfileinfo 678375 0.001 0.194
Qfsinfo 709604 0.002 0.485
Sfileinfo 347850 0.004 1.304
Find 1496310 0.017 5.504
WriteX 2129613 0.010 2.882
ReadX 6693066 0.002 1.517
LockX 13902 0.002 0.075
UnlockX 13902 0.001 0.055
Flush 299276 2.511 220.189
Throughput 1187.33 MB/sec 8 clients 8 procs max_latency=220.194 ms
+3.2% throughput, -31.8% max latency
dbench with 16 clients, results before:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5978334 0.028 156.507
Close 4391598 0.001 1.345
Rename 253136 0.241 155.057
Unlink 1207220 0.182 257.344
Deltree 160 6.123 36.277
Mkdir 80 0.003 0.005
Qpathinfo 5418817 0.012 6.867
Qfileinfo 949929 0.001 0.941
Qfsinfo 993560 0.002 1.386
Sfileinfo 486904 0.004 2.829
Find 2095088 0.059 8.164
WriteX 2982319 0.017 9.029
ReadX 9371484 0.002 4.052
LockX 19470 0.002 0.461
UnlockX 19470 0.001 0.990
Flush 418936 2.740 347.902
Throughput 1495.31 MB/sec 16 clients 16 procs max_latency=347.909 ms
dbench with 16 clients, results after:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5711833 0.029 131.240
Close 4195897 0.001 1.732
Rename 241849 0.204 147.831
Unlink 1153341 0.184 231.322
Deltree 160 6.086 30.198
Mkdir 80 0.003 0.021
Qpathinfo 5177011 0.012 7.150
Qfileinfo 907768 0.001 0.793
Qfsinfo 949205 0.002 1.431
Sfileinfo 465317 0.004 2.454
Find 2001541 0.058 7.819
WriteX 2850661 0.017 9.110
ReadX 8952289 0.002 3.991
LockX 18596 0.002 0.655
UnlockX 18596 0.001 0.179
Flush 400342 2.879 293.607
Throughput 1565.73 MB/sec 16 clients 16 procs max_latency=293.611 ms
+4.6% throughput, -16.9% max latency
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/inode.c | 48 ++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 42 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6089c5e7763c..59b35e12bb81 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9271,8 +9271,6 @@ static int btrfs_rename_exchange(struct inode *old_dir,
/* force full log commit if subvolume involved. */
btrfs_set_log_full_commit(trans);
} else {
- btrfs_pin_log_trans(root);
- root_log_pinned = true;
ret = btrfs_insert_inode_ref(trans, dest,
new_dentry->d_name.name,
new_dentry->d_name.len,
@@ -9289,8 +9287,6 @@ static int btrfs_rename_exchange(struct inode *old_dir,
/* force full log commit if subvolume involved. */
btrfs_set_log_full_commit(trans);
} else {
- btrfs_pin_log_trans(dest);
- dest_log_pinned = true;
ret = btrfs_insert_inode_ref(trans, root,
old_dentry->d_name.name,
old_dentry->d_name.len,
@@ -9321,6 +9317,29 @@ static int btrfs_rename_exchange(struct inode *old_dir,
BTRFS_I(new_inode), 1);
}
+ /*
+ * Now pin the logs of the roots. We do it to ensure that no other task
+ * can sync the logs while we are in progress with the rename, because
+ * that could result in an inconsistency in case any of the inodes that
+ * are part of this rename operation were logged before.
+ *
+ * We pin the logs even if at this precise moment none of the inodes was
+ * logged before. This is because right after we checked for that, some
+ * other task fsyncing some other inode not involved with this rename
+ * operation could log that one of our inodes exists.
+ *
+ * We don't need to pin the logs before the above calls to
+ * btrfs_insert_inode_ref(), since those don't ever need to change a log.
+ */
+ if (old_ino != BTRFS_FIRST_FREE_OBJECTID) {
+ btrfs_pin_log_trans(root);
+ root_log_pinned = true;
+ }
+ if (new_ino != BTRFS_FIRST_FREE_OBJECTID) {
+ btrfs_pin_log_trans(dest);
+ dest_log_pinned = true;
+ }
+
/* src is a subvolume */
if (old_ino == BTRFS_FIRST_FREE_OBJECTID) {
ret = btrfs_unlink_subvol(trans, old_dir, old_dentry);
@@ -9573,8 +9592,6 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
/* force full log commit if subvolume involved. */
btrfs_set_log_full_commit(trans);
} else {
- btrfs_pin_log_trans(root);
- log_pinned = true;
ret = btrfs_insert_inode_ref(trans, dest,
new_dentry->d_name.name,
new_dentry->d_name.len,
@@ -9598,6 +9615,25 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
if (unlikely(old_ino == BTRFS_FIRST_FREE_OBJECTID)) {
ret = btrfs_unlink_subvol(trans, old_dir, old_dentry);
} else {
+ /*
+ * Now pin the log. We do it to ensure that no other task can
+ * sync the log while we are in progress with the rename, as
+ * that could result in an inconsistency in case any of the
+ * inodes that are part of this rename operation were logged
+ * before.
+ *
+ * We pin the log even if at this precise moment none of the
+ * inodes was logged before. This is because right after we
+ * checked for that, some other task fsyncing some other inode
+ * not involved with this rename operation could log that one of
+ * our inodes exists.
+ *
+ * We don't need to pin the logs before the above call to
+ * btrfs_insert_inode_ref(), since that does not need to change
+ * a log.
+ */
+ btrfs_pin_log_trans(root);
+ log_pinned = true;
ret = __btrfs_unlink_inode(trans, root, BTRFS_I(old_dir),
BTRFS_I(d_inode(old_dentry)),
old_dentry->d_name.name,
--
2.28.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements
2021-07-27 10:24 [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements fdmanana
` (2 preceding siblings ...)
2021-07-27 10:24 ` [PATCH 3/3] btrfs: do not pin logs too early during renames fdmanana
@ 2021-07-28 12:41 ` David Sterba
3 siblings, 0 replies; 5+ messages in thread
From: David Sterba @ 2021-07-28 12:41 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Tue, Jul 27, 2021 at 11:24:42AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The first patch in the series fixes a bug where a directory fsync, following
> by inode eviction, followed by renaming a file and syncing the log results
> in losing a file if a power failure happens and the log is replayed.
>
> The remaining two changes are independent, and are about avoiding unnecessary
> work during operations that need to check or modify the log (renames, adding
> a hard link, unlinks) and making renames hold log commits for a shorter
> period.
Added to misc-next, thanks.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-07-28 12:44 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-07-27 10:24 [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements fdmanana
2021-07-27 10:24 ` [PATCH 1/3] btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction fdmanana
2021-07-27 10:24 ` [PATCH 2/3] btrfs: eliminate some false positives when checking if inode was logged fdmanana
2021-07-27 10:24 ` [PATCH 3/3] btrfs: do not pin logs too early during renames fdmanana
2021-07-28 12:41 ` [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements David Sterba
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox