Linux Btrfs filesystem development
 help / color / mirror / Atom feed
* [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements
@ 2021-07-27 10:24 fdmanana
  2021-07-27 10:24 ` [PATCH 1/3] btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction fdmanana
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: fdmanana @ 2021-07-27 10:24 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The first patch in the series fixes a bug where a directory fsync, following
by inode eviction, followed by renaming a file and syncing the log results
in losing a file if a power failure happens and the log is replayed.

The remaining two changes are independent, and are about avoiding unnecessary
work during operations that need to check or modify the log (renames, adding
a hard link, unlinks) and making renames hold log commits for a shorter
period.

Filipe Manana (3):
  btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
  btrfs: eliminate some false positives when checking if inode was logged
  btrfs: do not pin logs too early during renames

 fs/btrfs/inode.c    | 48 +++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/tree-log.c | 29 ++++++++++++++++-----------
 2 files changed, 60 insertions(+), 17 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/3] btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
  2021-07-27 10:24 [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements fdmanana
@ 2021-07-27 10:24 ` fdmanana
  2021-07-27 10:24 ` [PATCH 2/3] btrfs: eliminate some false positives when checking if inode was logged fdmanana
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: fdmanana @ 2021-07-27 10:24 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When checking if we need to log the new name of a renamed inode, we are
checking if the inode and its parent inode have been logged before, and if
not we don't log the new name. The check however is buggy, as it directly
compares the logged_trans field of the inodes versus the ID of the current
transaction. The problem is that logged_trans is a transient field, only
stored in memory and never persisted in the inode item, so if an inode
was logged before, evicted and reloaded, its logged_trans field is set to
a value of 0, meaning the check will return false and the new name of the
renamed inode is not logged. If the old parent directory was previously
fsynced and we deleted the logged directory entries corresponding to the
old name, we end up with a log that when replayed will delete the renamed
inode.

The following example triggers the problem:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt

  $ mkdir /mnt/A
  $ mkdir /mnt/B
  $ echo -n "hello world" > /mnt/A/foo

  $ sync

  # Add some new file to A and fsync directory A.
  $ touch /mnt/A/bar
  $ xfs_io -c "fsync" /mnt/A

  # Now trigger inode eviction. We are only interested in triggering
  # eviction for the inode of directory A.
  $ echo 2 > /proc/sys/vm/drop_caches

  # Move foo from directory A to directory B.
  # This deletes the directory entries for foo in A from the log, and
  # does not add the new name for foo in directory B to the log, because
  # logged_trans of A is 0, which is less than the current transaction ID.
  $ mv /mnt/A/foo /mnt/B/foo

  # Now make an fsync to anything except A, B or any file inside them,
  # like for example create a file at the root directory and fsync this
  # new file. This syncs the log that contains all the changes done by
  # previous rename operation.
  $ touch /mnt/baz
  $ xfs_io -c "fsync" /mnt/baz

  <power fail>

  # Mount the filesystem and replay the log.
  $ mount /dev/sdc /mnt

  # Check the filesystem content.
  $ ls -1R /mnt
  /mnt/:
  A
  B
  baz

  /mnt/A:
  bar

  /mnt/B:
  $

  # File foo is gone, it's neither in A/ nor in B/.

Fix this by using the inode_logged() helper at btrfs_log_new_name(), which
safely checks if an inode was logged before in the current transaction.

A test case for fstests will follow soon.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c2ebe700d6e2..8dde5c08a48f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -6536,8 +6536,8 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
 	 * if this inode hasn't been logged and directory we're renaming it
 	 * from hasn't been logged, we don't need to log it
 	 */
-	if (inode->logged_trans < trans->transid &&
-	    (!old_dir || old_dir->logged_trans < trans->transid))
+	if (!inode_logged(trans, inode) &&
+	    (!old_dir || !inode_logged(trans, old_dir)))
 		return;
 
 	/*
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/3] btrfs: eliminate some false positives when checking if inode was logged
  2021-07-27 10:24 [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements fdmanana
  2021-07-27 10:24 ` [PATCH 1/3] btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction fdmanana
@ 2021-07-27 10:24 ` fdmanana
  2021-07-27 10:24 ` [PATCH 3/3] btrfs: do not pin logs too early during renames fdmanana
  2021-07-28 12:41 ` [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements David Sterba
  3 siblings, 0 replies; 5+ messages in thread
From: fdmanana @ 2021-07-27 10:24 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When checking if an inode was previously logged in the current transaction
through the helper inode_logged(), we can return some false positives that
can be easily eliminated. These correspond to the cases where an inode has
a ->logged_trans value that is not zero and its value is smaller then the
ID of the current transaction. This means we know exactly that the inode
was never logged before in the current transaction, so we can return false
and avoid the callers to do extra work:

1) Having btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log()
   unnecessarily join a log transaction and do deletion searches in a log
   tree that will not find anything. This just adds unnecessary contention
   on extent buffer locks;

2) Having btrfs_log_new_name() unnecessarily log an inode when it is not
   needed. If the inode was not logged before, we don't need to log it in
   LOG_INODE_EXISTS mode.

So just make sure that any false positive only happens when ->logged_trans
has a value of 0.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8dde5c08a48f..fc98b7a7a8e6 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3421,14 +3421,10 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
 }
 
 /*
- * Check if an inode was logged in the current transaction. We can't always rely
- * on an inode's logged_trans value, because it's an in-memory only field and
- * therefore not persisted. This means that its value is lost if the inode gets
- * evicted and loaded again from disk (in which case it has a value of 0, and
- * certainly it is smaller then any possible transaction ID), when that happens
- * the full_sync flag is set in the inode's runtime flags, so on that case we
- * assume eviction happened and ignore the logged_trans value, assuming the
- * worst case, that the inode was logged before in the current transaction.
+ * Check if an inode was logged in the current transaction. This may often
+ * return some false positives, because logged_trans is an in memory only field,
+ * not persisted anywhere. This is meant to be used in contexts where a false
+ * positive has no functional consequences.
  */
 static bool inode_logged(struct btrfs_trans_handle *trans,
 			 struct btrfs_inode *inode)
@@ -3436,7 +3432,18 @@ static bool inode_logged(struct btrfs_trans_handle *trans,
 	if (inode->logged_trans == trans->transid)
 		return true;
 
-	if (inode->last_trans == trans->transid &&
+	/*
+	 * The inode's logged_trans is always 0 when we load it (because it is
+	 * not persisted in the inode item or elsewhere). So if it is 0, the
+	 * inode was last modified in the current transaction and has the
+	 * full_sync flag set, then the inode may have been logged before in
+	 * the current transaction, then evicted and loaded again in the current
+	 * transaction - or may have never been logged in the current transaction,
+	 * but since we can not be sure, we have to assume it was, otherwise our
+	 * callers can leave an inconsistent log.
+	 */
+	if (inode->logged_trans == 0 &&
+	    inode->last_trans == trans->transid &&
 	    test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) &&
 	    !test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags))
 		return true;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/3] btrfs: do not pin logs too early during renames
  2021-07-27 10:24 [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements fdmanana
  2021-07-27 10:24 ` [PATCH 1/3] btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction fdmanana
  2021-07-27 10:24 ` [PATCH 2/3] btrfs: eliminate some false positives when checking if inode was logged fdmanana
@ 2021-07-27 10:24 ` fdmanana
  2021-07-28 12:41 ` [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements David Sterba
  3 siblings, 0 replies; 5+ messages in thread
From: fdmanana @ 2021-07-27 10:24 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

During renames we pin the logs of the roots a bit too early, before the
calls to btrfs_insert_inode_ref(). We can pin the logs after those calls,
since those will not change anything in a log tree.

In a scenario where we have multiple and diverse filesystem operations
running in parallel, those calls can take a significant amount of time,
due to lock contention on extent buffers, and delay log commits from other
tasks for longer than necessary.

So just pin logs after calls to btrfs_insert_inode_ref() and right before
the first operation that can update a log tree.

The following script that uses dbench was used for testing:

  $ cat dbench-test.sh
  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/nvme0n1
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-m single -d single"

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  umount $DEV &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 120 16

  umount $MNT

The tests were run on a machine with 12 cores, 64G of RAN, a NVME device
and using a non-debug kernel config (Debian's default config).

The results compare a branch without this patch and without the previous
patch in the series, that has the subject:

 "btrfs: eliminate some false positives when checking if inode was logged"

Versus the same branch with these two patches applied.

dbench with 8 clients, results before:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4391359     0.009   249.745
 Close        3225882     0.001     3.243
 Rename        185953     0.065   240.643
 Unlink        886669     0.049   249.906
 Deltree          112     2.455   217.433
 Mkdir             56     0.002     0.004
 Qpathinfo    3980281     0.004     3.109
 Qfileinfo     697579     0.001     0.187
 Qfsinfo       729780     0.002     2.424
 Sfileinfo     357764     0.004     1.415
 Find         1538861     0.016     4.863
 WriteX       2189666     0.010     3.327
 ReadX        6883443     0.002     0.729
 LockX          14298     0.002     0.073
 UnlockX        14298     0.001     0.042
 Flush         307777     2.447   303.663

Throughput 1149.6 MB/sec  8 clients  8 procs  max_latency=303.666 ms

dbench with 8 clients, results after:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4269920     0.009   213.532
 Close        3136653     0.001     0.690
 Rename        180805     0.082   213.858
 Unlink        862189     0.050   172.893
 Deltree          112     2.998   218.328
 Mkdir             56     0.002     0.003
 Qpathinfo    3870158     0.004     5.072
 Qfileinfo     678375     0.001     0.194
 Qfsinfo       709604     0.002     0.485
 Sfileinfo     347850     0.004     1.304
 Find         1496310     0.017     5.504
 WriteX       2129613     0.010     2.882
 ReadX        6693066     0.002     1.517
 LockX          13902     0.002     0.075
 UnlockX        13902     0.001     0.055
 Flush         299276     2.511   220.189

Throughput 1187.33 MB/sec  8 clients  8 procs  max_latency=220.194 ms

+3.2% throughput, -31.8% max latency

dbench with 16 clients, results before:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    5978334     0.028   156.507
 Close        4391598     0.001     1.345
 Rename        253136     0.241   155.057
 Unlink       1207220     0.182   257.344
 Deltree          160     6.123    36.277
 Mkdir             80     0.003     0.005
 Qpathinfo    5418817     0.012     6.867
 Qfileinfo     949929     0.001     0.941
 Qfsinfo       993560     0.002     1.386
 Sfileinfo     486904     0.004     2.829
 Find         2095088     0.059     8.164
 WriteX       2982319     0.017     9.029
 ReadX        9371484     0.002     4.052
 LockX          19470     0.002     0.461
 UnlockX        19470     0.001     0.990
 Flush         418936     2.740   347.902

Throughput 1495.31 MB/sec  16 clients  16 procs  max_latency=347.909 ms

dbench with 16 clients, results after:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    5711833     0.029   131.240
 Close        4195897     0.001     1.732
 Rename        241849     0.204   147.831
 Unlink       1153341     0.184   231.322
 Deltree          160     6.086    30.198
 Mkdir             80     0.003     0.021
 Qpathinfo    5177011     0.012     7.150
 Qfileinfo     907768     0.001     0.793
 Qfsinfo       949205     0.002     1.431
 Sfileinfo     465317     0.004     2.454
 Find         2001541     0.058     7.819
 WriteX       2850661     0.017     9.110
 ReadX        8952289     0.002     3.991
 LockX          18596     0.002     0.655
 UnlockX        18596     0.001     0.179
 Flush         400342     2.879   293.607

Throughput 1565.73 MB/sec  16 clients  16 procs  max_latency=293.611 ms

+4.6% throughput, -16.9% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 48 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 42 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6089c5e7763c..59b35e12bb81 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9271,8 +9271,6 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 		/* force full log commit if subvolume involved. */
 		btrfs_set_log_full_commit(trans);
 	} else {
-		btrfs_pin_log_trans(root);
-		root_log_pinned = true;
 		ret = btrfs_insert_inode_ref(trans, dest,
 					     new_dentry->d_name.name,
 					     new_dentry->d_name.len,
@@ -9289,8 +9287,6 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 		/* force full log commit if subvolume involved. */
 		btrfs_set_log_full_commit(trans);
 	} else {
-		btrfs_pin_log_trans(dest);
-		dest_log_pinned = true;
 		ret = btrfs_insert_inode_ref(trans, root,
 					     old_dentry->d_name.name,
 					     old_dentry->d_name.len,
@@ -9321,6 +9317,29 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 				BTRFS_I(new_inode), 1);
 	}
 
+	/*
+	 * Now pin the logs of the roots. We do it to ensure that no other task
+	 * can sync the logs while we are in progress with the rename, because
+	 * that could result in an inconsistency in case any of the inodes that
+	 * are part of this rename operation were logged before.
+	 *
+	 * We pin the logs even if at this precise moment none of the inodes was
+	 * logged before. This is because right after we checked for that, some
+	 * other task fsyncing some other inode not involved with this rename
+	 * operation could log that one of our inodes exists.
+	 *
+	 * We don't need to pin the logs before the above calls to
+	 * btrfs_insert_inode_ref(), since those don't ever need to change a log.
+	 */
+	if (old_ino != BTRFS_FIRST_FREE_OBJECTID) {
+		btrfs_pin_log_trans(root);
+		root_log_pinned = true;
+	}
+	if (new_ino != BTRFS_FIRST_FREE_OBJECTID) {
+		btrfs_pin_log_trans(dest);
+		dest_log_pinned = true;
+	}
+
 	/* src is a subvolume */
 	if (old_ino == BTRFS_FIRST_FREE_OBJECTID) {
 		ret = btrfs_unlink_subvol(trans, old_dir, old_dentry);
@@ -9573,8 +9592,6 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 		/* force full log commit if subvolume involved. */
 		btrfs_set_log_full_commit(trans);
 	} else {
-		btrfs_pin_log_trans(root);
-		log_pinned = true;
 		ret = btrfs_insert_inode_ref(trans, dest,
 					     new_dentry->d_name.name,
 					     new_dentry->d_name.len,
@@ -9598,6 +9615,25 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 	if (unlikely(old_ino == BTRFS_FIRST_FREE_OBJECTID)) {
 		ret = btrfs_unlink_subvol(trans, old_dir, old_dentry);
 	} else {
+		/*
+		 * Now pin the log. We do it to ensure that no other task can
+		 * sync the log while we are in progress with the rename, as
+		 * that could result in an inconsistency in case any of the
+		 * inodes that are part of this rename operation were logged
+		 * before.
+		 *
+		 * We pin the log even if at this precise moment none of the
+		 * inodes was logged before. This is because right after we
+		 * checked for that, some other task fsyncing some other inode
+		 * not involved with this rename operation could log that one of
+		 * our inodes exists.
+		 *
+		 * We don't need to pin the logs before the above call to
+		 * btrfs_insert_inode_ref(), since that does not need to change
+		 * a log.
+		 */
+		btrfs_pin_log_trans(root);
+		log_pinned = true;
 		ret = __btrfs_unlink_inode(trans, root, BTRFS_I(old_dir),
 					BTRFS_I(d_inode(old_dentry)),
 					old_dentry->d_name.name,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements
  2021-07-27 10:24 [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements fdmanana
                   ` (2 preceding siblings ...)
  2021-07-27 10:24 ` [PATCH 3/3] btrfs: do not pin logs too early during renames fdmanana
@ 2021-07-28 12:41 ` David Sterba
  3 siblings, 0 replies; 5+ messages in thread
From: David Sterba @ 2021-07-28 12:41 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Tue, Jul 27, 2021 at 11:24:42AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The first patch in the series fixes a bug where a directory fsync, following
> by inode eviction, followed by renaming a file and syncing the log results
> in losing a file if a power failure happens and the log is replayed.
> 
> The remaining two changes are independent, and are about avoiding unnecessary
> work during operations that need to check or modify the log (renames, adding
> a hard link, unlinks) and making renames hold log commits for a shorter
> period.

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-07-28 12:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-07-27 10:24 [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements fdmanana
2021-07-27 10:24 ` [PATCH 1/3] btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction fdmanana
2021-07-27 10:24 ` [PATCH 2/3] btrfs: eliminate some false positives when checking if inode was logged fdmanana
2021-07-27 10:24 ` [PATCH 3/3] btrfs: do not pin logs too early during renames fdmanana
2021-07-28 12:41 ` [PATCH 0/3] btrfs: fsync changes, a bug fix and a couple improvements David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox