All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14] btrfs: directory logging bug fix and several updates
@ 2025-06-02 10:32 fdmanana
  2025-06-02 10:32 ` [PATCH 01/14] btrfs: fix a race between renames and directory logging fdmanana
                   ` (15 more replies)
  0 siblings, 16 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

These fix a race between renames and directory logging and several cleanups
and small optimizations. Details in the change logs.

Filipe Manana (14):
  btrfs: fix a race between renames and directory logging
  btrfs: assert we join log transaction at btrfs_del_inode_ref_in_log()
  btrfs: free path sooner at __btrfs_unlink_inode()
  btrfs: use btrfs_del_item() at del_logged_dentry()
  btrfs: assert we join log transaction at btrfs_del_dir_entries_in_log()
  btrfs: allocate path earlier at btrfs_del_dir_entries_in_log()
  btrfs: allocate path earlier at btrfs_log_new_name()
  btrfs: allocate scratch eb earlier at btrfs_log_new_name()
  btrfs: pass NULL index to btrfs_del_inode_ref() where not needed
  btrfs: switch del_all argument of replay_dir_deletes() from int to bool
  btrfs: make btrfs_delete_delayed_insertion_item() return a boolean
  btrfs: add details to error messages at btrfs_delete_delayed_dir_index()
  btrfs: make btrfs_should_delete_dir_index() return a bool instead
  btrfs: make btrfs_readdir_delayed_dir_index() return a bool instead

 fs/btrfs/delayed-inode.c |  42 +++++++-------
 fs/btrfs/delayed-inode.h |   7 +--
 fs/btrfs/inode.c         | 119 ++++++++++++++++++++++++++-------------
 fs/btrfs/tree-log.c      |  72 ++++++++++++-----------
 4 files changed, 144 insertions(+), 96 deletions(-)

-- 
2.47.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/14] btrfs: fix a race between renames and directory logging
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
@ 2025-06-02 10:32 ` fdmanana
  2025-06-03  1:03   ` Boris Burkov
  2025-06-02 10:32 ` [PATCH 02/14] btrfs: assert we join log transaction at btrfs_del_inode_ref_in_log() fdmanana
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We have a race between a rename and directory inode logging that if it
happens and we crash/power fail before the rename completes, the next time
the filesystem is mounted, the log replay code will end up deleting the
file that was being renamed.

This is best explained following a step by step analysis of an interleaving
of steps that lead into this situation.

Consider the initial conditions:

1) We are at transaction N;

2) We have directories A and B created in a past transaction (< N);

3) We have inode X corresponding to a file that has 2 hardlinks, one in
   directory A and the other in directory B, so we'll name them as
   "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a
   past transaction (< N);

4) We have inode Y corresponding to a file that as a single hard link and
   is located in directory A, we'll name it as "A/bar". This file was also
   persisted in a past transaction (< N).

The steps leading to a file loss are the following and for all of them we
are under transaction N:

 1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field
    is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir();

 2) Task A starts a rename for inode Y, with the goal of renaming from
    "A/bar" to "A/baz", so we enter btrfs_rename();

 3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling
    btrfs_insert_inode_ref();

 4) Because the rename happens in the same directory, we don't set the
    last_unlink_trans field of directoty A's inode to the current
    transaction id, that is, we don't cal btrfs_record_unlink_dir();

 5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY
    and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode()
    (actually the dir index item is added as a delayed item, but the
    effect is the same);

 6) Now before task A adds the new entry "A/baz" to directory A by
    calling btrfs_add_link(), another task, task B is logging inode X;

 7) Task B starts a fsync of inode X and after logging inode X, at
    btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since
    inode X has a last_unlink_trans value of N, set at in step 1;

 8) At btrfs_log_all_parents() we search for all parent directories of
    inode X using the commit root, so we find directories A and B and log
    them. Bu when logging direct A, we don't have a dir index item for
    inode Y anymore, neither the old name "A/bar" nor for the new name
    "A/baz" since the rename has deleted the old name but has not yet
    inserted the new name - task A hasn't called yet btrfs_add_link() to
    do that.

    Note that logging directory A doesn't fallback to a transaction
    commit because its last_unlink_trans has a lower value than the
    current transaction's id (see step 4);

 9) Task B finishes logging directories A and B and gets back to
    btrfs_sync_file() where it calls btrfs_sync_log() to persist the log
    tree;

10) Task B successfully persisted the log tree, btrfs_sync_log() completed
    with success, and a power failure happened.

    We have a log tree without any directory entry for inode Y, so the
    log replay code deletes the entry for inode Y, name "A/bar", from the
    subvolume tree since it doesn't exist in the log tree and the log
    tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY
    item that covers the index range for the dentry that corresponds to
    "A/bar").

    Since there's no other hard link for inode Y and the log replay code
    deletes the name "A/bar", the file is lost.

The issue wouldn't happen if task B synced the log only after task A
called btrfs_log_new_name(), which would update the log with the new name
for inode Y ("A/bar").

Fix this by pinning the log root during renames before removing the old
directory entry, and unpinning after btrfs_log_new_name() is called.

Fixes: 259c4b96d78d ("btrfs: stop doing unnecessary log updates during a rename")
CC: stable@vger.kernel.org # 5.18+

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 81 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 64 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7074f975c033..7bad21ec41f8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8063,6 +8063,7 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 	int ret;
 	int ret2;
 	bool need_abort = false;
+	bool logs_pinned = false;
 	struct fscrypt_name old_fname, new_fname;
 	struct fscrypt_str *old_name, *new_name;
 
@@ -8186,6 +8187,31 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 	inode_inc_iversion(new_inode);
 	simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
 
+	if (old_ino != BTRFS_FIRST_FREE_OBJECTID &&
+	    new_ino != BTRFS_FIRST_FREE_OBJECTID) {
+		/*
+		 * If we are renaming in the same directory (and it's not for
+		 * root entries) pin the log early to prevent any concurrent
+		 * task from logging the directory after we removed the old
+		 * entries and before we add the new entries, otherwise that
+		 * task can sync a log without any entry for the inodes we are
+		 * renaming and therefore replaying that log, if a power failure
+		 * happens after syncing the log, would result in deleting the
+		 * inodes.
+		 *
+		 * If the rename affects two different directories, we want to
+		 * make sure the that there's no log commit that contains
+		 * updates for only one of the directories but not for the
+		 * other.
+		 *
+		 * If we are renaming an entry for a root, we don't care about
+		 * log updates since we called btrfs_set_log_full_commit().
+		 */
+		btrfs_pin_log_trans(root);
+		btrfs_pin_log_trans(dest);
+		logs_pinned = true;
+	}
+
 	if (old_dentry->d_parent != new_dentry->d_parent) {
 		btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
 					BTRFS_I(old_inode), true);
@@ -8257,30 +8283,23 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 		BTRFS_I(new_inode)->dir_index = new_idx;
 
 	/*
-	 * Now pin the logs of the roots. We do it to ensure that no other task
-	 * can sync the logs while we are in progress with the rename, because
-	 * that could result in an inconsistency in case any of the inodes that
-	 * are part of this rename operation were logged before.
+	 * Do the log updates for all inodes.
+	 *
+	 * If either entry is for a root we don't need to update the logs since
+	 * we've called btrfs_set_log_full_commit() before.
 	 */
-	if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
-		btrfs_pin_log_trans(root);
-	if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
-		btrfs_pin_log_trans(dest);
-
-	/* Do the log updates for all inodes. */
-	if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
+	if (logs_pinned) {
 		btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
 				   old_rename_ctx.index, new_dentry->d_parent);
-	if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
 		btrfs_log_new_name(trans, new_dentry, BTRFS_I(new_dir),
 				   new_rename_ctx.index, old_dentry->d_parent);
+	}
 
-	/* Now unpin the logs. */
-	if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
+out_fail:
+	if (logs_pinned) {
 		btrfs_end_log_trans(root);
-	if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
 		btrfs_end_log_trans(dest);
-out_fail:
+	}
 	ret2 = btrfs_end_transaction(trans);
 	ret = ret ? ret : ret2;
 out_notrans:
@@ -8330,6 +8349,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
 	int ret2;
 	u64 old_ino = btrfs_ino(BTRFS_I(old_inode));
 	struct fscrypt_name old_fname, new_fname;
+	bool logs_pinned = false;
 
 	if (btrfs_ino(BTRFS_I(new_dir)) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID)
 		return -EPERM;
@@ -8464,6 +8484,29 @@ static int btrfs_rename(struct mnt_idmap *idmap,
 	inode_inc_iversion(old_inode);
 	simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
 
+	if (old_ino != BTRFS_FIRST_FREE_OBJECTID) {
+		/*
+		 * If we are renaming in the same directory (and it's not a
+		 * root entry) pin the log to prevent any concurrent task from
+		 * logging the directory after we removed the old entry and
+		 * before we add the new entry, otherwise that task can sync
+		 * a log without any entry for the inode we are renaming and
+		 * therefore replaying that log, if a power failure happens
+		 * after syncing the log, would result in deleting the inode.
+		 *
+		 * If the rename affects two different directories, we want to
+		 * make sure the that there's no log commit that contains
+		 * updates for only one of the directories but not for the
+		 * other.
+		 *
+		 * If we are renaming an entry for a root, we don't care about
+		 * log updates since we called btrfs_set_log_full_commit().
+		 */
+		btrfs_pin_log_trans(root);
+		btrfs_pin_log_trans(dest);
+		logs_pinned = true;
+	}
+
 	if (old_dentry->d_parent != new_dentry->d_parent)
 		btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
 					BTRFS_I(old_inode), true);
@@ -8528,7 +8571,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
 	if (old_inode->i_nlink == 1)
 		BTRFS_I(old_inode)->dir_index = index;
 
-	if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
+	if (logs_pinned)
 		btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
 				   rename_ctx.index, new_dentry->d_parent);
 
@@ -8544,6 +8587,10 @@ static int btrfs_rename(struct mnt_idmap *idmap,
 		}
 	}
 out_fail:
+	if (logs_pinned) {
+		btrfs_end_log_trans(root);
+		btrfs_end_log_trans(dest);
+	}
 	ret2 = btrfs_end_transaction(trans);
 	ret = ret ? ret : ret2;
 out_notrans:
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 02/14] btrfs: assert we join log transaction at btrfs_del_inode_ref_in_log()
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
  2025-06-02 10:32 ` [PATCH 01/14] btrfs: fix a race between renames and directory logging fdmanana
@ 2025-06-02 10:32 ` fdmanana
  2025-06-02 10:32 ` [PATCH 03/14] btrfs: free path sooner at __btrfs_unlink_inode() fdmanana
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We are supposed to be able to join a log transaction at that point, since
we have determined that the inode was logged in the current transaction
with the call to inode_logged(). So ASSERT() we joined a log transaction
and also warn if we didn't in case assertions are disabled (the kernel
config doesn't have CONFIG_BTRFS_ASSERT=y), so that the issue gets noticed
and reported if it ever happens.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 0d5b79402312..5af0e2d0634e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3513,7 +3513,8 @@ void btrfs_del_inode_ref_in_log(struct btrfs_trans_handle *trans,
 	}
 
 	ret = join_running_log_trans(root);
-	if (ret)
+	ASSERT(ret == 0, "join_running_log_trans() ret=%d", ret);
+	if (WARN_ON(ret))
 		return;
 	log = root->log_root;
 	mutex_lock(&inode->log_mutex);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 03/14] btrfs: free path sooner at __btrfs_unlink_inode()
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
  2025-06-02 10:32 ` [PATCH 01/14] btrfs: fix a race between renames and directory logging fdmanana
  2025-06-02 10:32 ` [PATCH 02/14] btrfs: assert we join log transaction at btrfs_del_inode_ref_in_log() fdmanana
@ 2025-06-02 10:32 ` fdmanana
  2025-06-02 13:25   ` Sun YangKai
  2025-06-02 10:32 ` [PATCH 04/14] btrfs: use btrfs_del_item() at del_logged_dentry() fdmanana
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

After calling btrfs_delete_one_dir_name() there's no need for the path
anymore so we can free it immediately after. After that point we do
some btree operations that take time and in those call chains we end up
allocating paths for these operations, so we're unnecessarily holding on
to the path we allocated early at __btrfs_unlink_inode().

So free the path as soon as we don't need it and add a comment. This
also allows to simplify the error path, removing two exit labels and
returning directly when an error happens.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7bad21ec41f8..a9a37553bc45 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4215,20 +4215,22 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
 	u64 dir_ino = btrfs_ino(dir);
 
 	path = btrfs_alloc_path();
-	if (!path) {
-		ret = -ENOMEM;
-		goto out;
-	}
+	if (!path)
+		return -ENOMEM;
 
 	di = btrfs_lookup_dir_item(trans, root, path, dir_ino, name, -1);
 	if (IS_ERR_OR_NULL(di)) {
-		ret = di ? PTR_ERR(di) : -ENOENT;
-		goto err;
+		btrfs_free_path(path);
+		return di ? PTR_ERR(di) : -ENOENT;
 	}
 	ret = btrfs_delete_one_dir_name(trans, root, path, di);
+	/*
+	 * Down the call chains below we'll also need to allocate a path, so no
+	 * need to hold on to this one for longer than necessary.
+	 */
+	btrfs_free_path(path);
 	if (ret)
-		goto err;
-	btrfs_release_path(path);
+		return ret;
 
 	/*
 	 * If we don't have dir index, we have to get it by looking up
@@ -4254,7 +4256,7 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
 	   "failed to delete reference to %.*s, root %llu inode %llu parent %llu",
 			   name->len, name->name, btrfs_root_id(root), ino, dir_ino);
 		btrfs_abort_transaction(trans, ret);
-		goto err;
+		return ret;
 	}
 skip_backref:
 	if (rename_ctx)
@@ -4263,7 +4265,7 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
 	ret = btrfs_delete_delayed_dir_index(trans, dir, index);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
-		goto err;
+		return ret;
 	}
 
 	/*
@@ -4287,19 +4289,14 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
 	 * holding.
 	 */
 	btrfs_run_delayed_iput(fs_info, inode);
-err:
-	btrfs_free_path(path);
-	if (ret)
-		goto out;
 
 	btrfs_i_size_write(dir, dir->vfs_inode.i_size - name->len * 2);
 	inode_inc_iversion(&inode->vfs_inode);
 	inode_set_ctime_current(&inode->vfs_inode);
 	inode_inc_iversion(&dir->vfs_inode);
  	inode_set_mtime_to_ts(&dir->vfs_inode, inode_set_ctime_current(&dir->vfs_inode));
-	ret = btrfs_update_inode(trans, dir);
-out:
-	return ret;
+
+	return btrfs_update_inode(trans, dir);
 }
 
 int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 04/14] btrfs: use btrfs_del_item() at del_logged_dentry()
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (2 preceding siblings ...)
  2025-06-02 10:32 ` [PATCH 03/14] btrfs: free path sooner at __btrfs_unlink_inode() fdmanana
@ 2025-06-02 10:32 ` fdmanana
  2025-06-02 10:32 ` [PATCH 05/14] btrfs: assert we join log transaction at btrfs_del_dir_entries_in_log() fdmanana
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There's no need to use btrfs_delete_one_dir_name() at del_logged_dentry()
because we are processing a dir index key which can contain only a single
name, unlike dir item keys which can encode multiple names in case of name
hash collisions. We have explicitly looked up for a dir index key by
calling btrfs_lookup_dir_index_item() and we don't log dir item keys
anymore (since commit 339d03542484 ("btrfs: only copy dir index keys when
logging a directory")). So simplify and use btrfs_del_item() directly
instead of btrfs_delete_one_dir_name().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5af0e2d0634e..4f0a86805dc2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3432,7 +3432,7 @@ static int del_logged_dentry(struct btrfs_trans_handle *trans,
 	 * inode item because on log replay we update the field to reflect
 	 * all existing entries in the directory (see overwrite_item()).
 	 */
-	return btrfs_delete_one_dir_name(trans, log, path, di);
+	return btrfs_del_item(trans, log, path);
 }
 
 /*
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 05/14] btrfs: assert we join log transaction at btrfs_del_dir_entries_in_log()
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (3 preceding siblings ...)
  2025-06-02 10:32 ` [PATCH 04/14] btrfs: use btrfs_del_item() at del_logged_dentry() fdmanana
@ 2025-06-02 10:32 ` fdmanana
  2025-06-02 10:32 ` [PATCH 06/14] btrfs: allocate path earlier " fdmanana
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We are supposed to be able to join a log transaction at that point, since
we have determined that the inode was logged in the current transaction
with the call to inode_logged(). So ASSERT() we joined a log transaction
and also warn if we didn't in case assertions are disabled (the kernel
config doesn't have CONFIG_BTRFS_ASSERT=y), so that the issue gets noticed
and reported if it ever happens.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 4f0a86805dc2..14506a09f28e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3473,7 +3473,8 @@ void btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans,
 	}
 
 	ret = join_running_log_trans(root);
-	if (ret)
+	ASSERT(ret == 0, "join_running_log_trans() ret=%d", ret);
+	if (WARN_ON(ret))
 		return;
 
 	mutex_lock(&dir->log_mutex);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 06/14] btrfs: allocate path earlier at btrfs_del_dir_entries_in_log()
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (4 preceding siblings ...)
  2025-06-02 10:32 ` [PATCH 05/14] btrfs: assert we join log transaction at btrfs_del_dir_entries_in_log() fdmanana
@ 2025-06-02 10:32 ` fdmanana
  2025-06-02 10:33 ` [PATCH 07/14] btrfs: allocate path earlier at btrfs_log_new_name() fdmanana
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:32 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of allocating the path after joining the log transaction, allocate
it before so that we're not delaying log commits for the rare cases where
the allocation takes a significant time (under memory pressure and all
slabs are full, there's the need to allocate a new page, etc).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 14506a09f28e..cbdf9791ec5d 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3472,27 +3472,27 @@ void btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans,
 		return;
 	}
 
+	path = btrfs_alloc_path();
+	if (!path) {
+		btrfs_set_log_full_commit(trans);
+		return;
+	}
+
 	ret = join_running_log_trans(root);
 	ASSERT(ret == 0, "join_running_log_trans() ret=%d", ret);
 	if (WARN_ON(ret))
-		return;
+		goto out;
 
 	mutex_lock(&dir->log_mutex);
 
-	path = btrfs_alloc_path();
-	if (!path) {
-		ret = -ENOMEM;
-		goto out_unlock;
-	}
-
 	ret = del_logged_dentry(trans, root->log_root, path, btrfs_ino(dir),
 				name, index);
-	btrfs_free_path(path);
-out_unlock:
 	mutex_unlock(&dir->log_mutex);
 	if (ret < 0)
 		btrfs_set_log_full_commit(trans);
 	btrfs_end_log_trans(root);
+out:
+	btrfs_free_path(path);
 }
 
 /* see comments for btrfs_del_dir_entries_in_log */
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 07/14] btrfs: allocate path earlier at btrfs_log_new_name()
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (5 preceding siblings ...)
  2025-06-02 10:32 ` [PATCH 06/14] btrfs: allocate path earlier " fdmanana
@ 2025-06-02 10:33 ` fdmanana
  2025-06-02 10:33 ` [PATCH 08/14] btrfs: allocate scratch eb " fdmanana
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:33 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of allocating the path after joining the log transaction, allocate
it before so that we're not delaying log commits for the rare cases where
the allocation takes a significant time (under memory pressure and all
slabs are full, there's the need to allocate a new page, etc).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index cbdf9791ec5d..37da27acef95 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -7543,6 +7543,14 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
 					     &old_dentry->d_name, 0, &fname);
 		if (ret)
 			goto out;
+
+		path = btrfs_alloc_path();
+		if (!path) {
+			ret = -ENOMEM;
+			fscrypt_free_filename(&fname);
+			goto out;
+		}
+
 		/*
 		 * We have two inodes to update in the log, the old directory and
 		 * the inode that got renamed, so we must pin the log to prevent
@@ -7556,19 +7564,13 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
 		 * mark the log for a full commit.
 		 */
 		if (WARN_ON_ONCE(ret < 0)) {
+			btrfs_free_path(path);
 			fscrypt_free_filename(&fname);
 			goto out;
 		}
 
 		log_pinned = true;
 
-		path = btrfs_alloc_path();
-		if (!path) {
-			ret = -ENOMEM;
-			fscrypt_free_filename(&fname);
-			goto out;
-		}
-
 		/*
 		 * Other concurrent task might be logging the old directory,
 		 * as it can be triggered when logging other inode that had or
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 08/14] btrfs: allocate scratch eb earlier at btrfs_log_new_name()
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (6 preceding siblings ...)
  2025-06-02 10:33 ` [PATCH 07/14] btrfs: allocate path earlier at btrfs_log_new_name() fdmanana
@ 2025-06-02 10:33 ` fdmanana
  2025-06-02 10:33 ` [PATCH 09/14] btrfs: pass NULL index to btrfs_del_inode_ref() where not needed fdmanana
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:33 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of allocating the scratch eb after joining the log transaction,
allocate it before so that we're not delaying log commits for longer
than necessary, as allocating the scratch eb means allocating an
extent_buffer structure, which comes from a dedicated kmem_cache, plus
pages/folios to attach to the eb. Both of these allocations may take time
when we're under memory pressure.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 37da27acef95..a40afb44702c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -7495,6 +7495,9 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
 	bool log_pinned = false;
 	int ret;
 
+	btrfs_init_log_ctx(&ctx, inode);
+	ctx.logging_new_name = true;
+
 	/*
 	 * this will force the logging code to walk the dentry chain
 	 * up for the file
@@ -7525,6 +7528,13 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
 	}
 	ret = 0;
 
+	/*
+	 * Now that we know we need to update the log, allocate the scratch eb
+	 * for the context before joining a log transaction below, as this can
+	 * take time and therefore we could delay log commits from other tasks.
+	 */
+	btrfs_init_log_ctx_scratch_eb(&ctx);
+
 	/*
 	 * If we are doing a rename (old_dir is not NULL) from a directory that
 	 * was previously logged, make sure that on log replay we get the old
@@ -7602,9 +7612,6 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
 			goto out;
 	}
 
-	btrfs_init_log_ctx(&ctx, inode);
-	ctx.logging_new_name = true;
-	btrfs_init_log_ctx_scratch_eb(&ctx);
 	/*
 	 * We don't care about the return value. If we fail to log the new name
 	 * then we know the next attempt to sync the log will fallback to a full
@@ -7613,7 +7620,6 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
 	 * inconsistent state after a rename operation.
 	 */
 	btrfs_log_inode_parent(trans, inode, parent, LOG_INODE_EXISTS, &ctx);
-	free_extent_buffer(ctx.scratch_eb);
 	ASSERT(list_empty(&ctx.conflict_inodes));
 out:
 	/*
@@ -7626,5 +7632,6 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
 		btrfs_set_log_full_commit(trans);
 	if (log_pinned)
 		btrfs_end_log_trans(root);
+	free_extent_buffer(ctx.scratch_eb);
 }
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 09/14] btrfs: pass NULL index to btrfs_del_inode_ref() where not needed
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (7 preceding siblings ...)
  2025-06-02 10:33 ` [PATCH 08/14] btrfs: allocate scratch eb " fdmanana
@ 2025-06-02 10:33 ` fdmanana
  2025-06-02 10:33 ` [PATCH 10/14] btrfs: switch del_all argument of replay_dir_deletes() from int to bool fdmanana
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:33 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There are two callers of btrfs_del_inode_ref() that declare a local index
variable and then pass a pointer for it to btrfs_del_inode_ref(), but then
don't use that index at all. Since btrfs_del_inode_ref() accepts a NULL
index pointer, pass NULL instead and stop declaring those useless index
variables.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c    | 4 +---
 fs/btrfs/tree-log.c | 4 +---
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a9a37553bc45..890d20868250 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6711,11 +6711,9 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
 		if (err)
 			btrfs_abort_transaction(trans, err);
 	} else if (add_backref) {
-		u64 local_index;
 		int err;
 
-		err = btrfs_del_inode_ref(trans, root, name, ino, parent_ino,
-					  &local_index);
+		err = btrfs_del_inode_ref(trans, root, name, ino, parent_ino, NULL);
 		if (err)
 			btrfs_abort_transaction(trans, err);
 	}
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index a40afb44702c..c639fe492a25 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3502,7 +3502,6 @@ void btrfs_del_inode_ref_in_log(struct btrfs_trans_handle *trans,
 				struct btrfs_inode *inode, u64 dirid)
 {
 	struct btrfs_root *log;
-	u64 index;
 	int ret;
 
 	ret = inode_logged(trans, inode, NULL);
@@ -3520,8 +3519,7 @@ void btrfs_del_inode_ref_in_log(struct btrfs_trans_handle *trans,
 	log = root->log_root;
 	mutex_lock(&inode->log_mutex);
 
-	ret = btrfs_del_inode_ref(trans, log, name, btrfs_ino(inode),
-				  dirid, &index);
+	ret = btrfs_del_inode_ref(trans, log, name, btrfs_ino(inode), dirid, NULL);
 	mutex_unlock(&inode->log_mutex);
 	if (ret < 0 && ret != -ENOENT)
 		btrfs_set_log_full_commit(trans);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 10/14] btrfs: switch del_all argument of replay_dir_deletes() from int to bool
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (8 preceding siblings ...)
  2025-06-02 10:33 ` [PATCH 09/14] btrfs: pass NULL index to btrfs_del_inode_ref() where not needed fdmanana
@ 2025-06-02 10:33 ` fdmanana
  2025-06-02 10:33 ` [PATCH 11/14] btrfs: make btrfs_delete_delayed_insertion_item() return a boolean fdmanana
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:33 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The argument has boolean semantics, so change its type from int to bool,
making it more clear.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c639fe492a25..4e6a054682e3 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -112,7 +112,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *root,
 				       struct btrfs_root *log,
 				       struct btrfs_path *path,
-				       u64 dirid, int del_all);
+				       u64 dirid, bool del_all);
 static void wait_log_commit(struct btrfs_root *root, int transid);
 
 /*
@@ -1632,8 +1632,7 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans,
 
 	if (inode->vfs_inode.i_nlink == 0) {
 		if (S_ISDIR(inode->vfs_inode.i_mode)) {
-			ret = replay_dir_deletes(trans, root, NULL, path,
-						 ino, 1);
+			ret = replay_dir_deletes(trans, root, NULL, path, ino, true);
 			if (ret)
 				goto out;
 		}
@@ -2284,7 +2283,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *root,
 				       struct btrfs_root *log,
 				       struct btrfs_path *path,
-				       u64 dirid, int del_all)
+				       u64 dirid, bool del_all)
 {
 	u64 range_start;
 	u64 range_end;
@@ -2443,8 +2442,8 @@ static int replay_one_buffer(struct btrfs_root *log, struct extent_buffer *eb,
 				break;
 			mode = btrfs_inode_mode(eb, inode_item);
 			if (S_ISDIR(mode)) {
-				ret = replay_dir_deletes(wc->trans,
-					 root, log, path, key.objectid, 0);
+				ret = replay_dir_deletes(wc->trans, root, log, path,
+							 key.objectid, false);
 				if (ret)
 					break;
 			}
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 11/14] btrfs: make btrfs_delete_delayed_insertion_item() return a boolean
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (9 preceding siblings ...)
  2025-06-02 10:33 ` [PATCH 10/14] btrfs: switch del_all argument of replay_dir_deletes() from int to bool fdmanana
@ 2025-06-02 10:33 ` fdmanana
  2025-06-02 10:33 ` [PATCH 12/14] btrfs: add details to error messages at btrfs_delete_delayed_dir_index() fdmanana
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:33 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There's no need to return an integer as all we need to do is return true
or false to tell whether we deleted a delayed item or not. Also the logic
is inverted since we return 1 (true) if we didn't delete and 0 (false) if
we did, which is somewhat counter intuitive. Change the return type to a
boolean and make it return true if we deleted and false otherwise.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/delayed-inode.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 4f9d7a436f16..6918340f4b38 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1540,8 +1540,8 @@ int btrfs_insert_delayed_dir_index(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-static int btrfs_delete_delayed_insertion_item(struct btrfs_delayed_node *node,
-					       u64 index)
+static bool btrfs_delete_delayed_insertion_item(struct btrfs_delayed_node *node,
+						u64 index)
 {
 	struct btrfs_delayed_item *item;
 
@@ -1549,7 +1549,7 @@ static int btrfs_delete_delayed_insertion_item(struct btrfs_delayed_node *node,
 	item = __btrfs_lookup_delayed_item(&node->ins_root.rb_root, index);
 	if (!item) {
 		mutex_unlock(&node->mutex);
-		return 1;
+		return false;
 	}
 
 	/*
@@ -1584,7 +1584,7 @@ static int btrfs_delete_delayed_insertion_item(struct btrfs_delayed_node *node,
 	}
 
 	mutex_unlock(&node->mutex);
-	return 0;
+	return true;
 }
 
 int btrfs_delete_delayed_dir_index(struct btrfs_trans_handle *trans,
@@ -1598,9 +1598,10 @@ int btrfs_delete_delayed_dir_index(struct btrfs_trans_handle *trans,
 	if (IS_ERR(node))
 		return PTR_ERR(node);
 
-	ret = btrfs_delete_delayed_insertion_item(node, index);
-	if (!ret)
+	if (btrfs_delete_delayed_insertion_item(node, index)) {
+		ret = 0;
 		goto end;
+	}
 
 	item = btrfs_alloc_delayed_item(0, node, BTRFS_DELAYED_DELETION_ITEM);
 	if (!item) {
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 12/14] btrfs: add details to error messages at btrfs_delete_delayed_dir_index()
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (10 preceding siblings ...)
  2025-06-02 10:33 ` [PATCH 11/14] btrfs: make btrfs_delete_delayed_insertion_item() return a boolean fdmanana
@ 2025-06-02 10:33 ` fdmanana
  2025-06-02 10:33 ` [PATCH 13/14] btrfs: make btrfs_should_delete_dir_index() return a bool instead fdmanana
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:33 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Update the error messages with:

1) Fix typo in the first one, deltiona -> deletion;

2) Remove redundant part of the first message, the part following the
   comma, and including all the useful information: root, inode, index
   and error value;

3) Update the second message to use more formal language (example 'error'
   instead of 'err'), , remove redundant part about "deletion tree of
   delayed node..." and print the relevant information in the same
   format and order as the first message, without the ugly opening
   parenthesis without a space separating from the previous word.
   This also makes the message similar in format to the one we have at
   btrfs_insert_delayed_dir_index().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/delayed-inode.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 6918340f4b38..1e9bec6d24f7 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1618,7 +1618,8 @@ int btrfs_delete_delayed_dir_index(struct btrfs_trans_handle *trans,
 	 */
 	if (ret < 0) {
 		btrfs_err(trans->fs_info,
-"metadata reservation failed for delayed dir item deltiona, should have been reserved");
+"metadata reservation failed for delayed dir item deletion, index: %llu, root: %llu, inode: %llu, error: %d",
+			  index, btrfs_root_id(node->root), node->inode_id, ret);
 		btrfs_release_delayed_item(item);
 		goto end;
 	}
@@ -1627,9 +1628,8 @@ int btrfs_delete_delayed_dir_index(struct btrfs_trans_handle *trans,
 	ret = __btrfs_add_delayed_item(node, item);
 	if (unlikely(ret)) {
 		btrfs_err(trans->fs_info,
-			  "err add delayed dir index item(index: %llu) into the deletion tree of the delayed node(root id: %llu, inode id: %llu, errno: %d)",
-			  index, btrfs_root_id(node->root),
-			  node->inode_id, ret);
+"failed to add delayed dir index item, root: %llu, inode: %llu, index: %llu, error: %d",
+			  index, btrfs_root_id(node->root), node->inode_id, ret);
 		btrfs_delayed_item_release_metadata(dir->root, item);
 		btrfs_release_delayed_item(item);
 	}
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 13/14] btrfs: make btrfs_should_delete_dir_index() return a bool instead
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (11 preceding siblings ...)
  2025-06-02 10:33 ` [PATCH 12/14] btrfs: add details to error messages at btrfs_delete_delayed_dir_index() fdmanana
@ 2025-06-02 10:33 ` fdmanana
  2025-06-02 10:33 ` [PATCH 14/14] btrfs: make btrfs_readdir_delayed_dir_index() " fdmanana
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:33 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There's no need to return errors and we currently return 1 in case the
index should be deleted and 0 otherwise, so change the return type from
int to bool as this is a boolean function and it's more clear this way.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/delayed-inode.c | 7 +++----
 fs/btrfs/delayed-inode.h | 3 +--
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 1e9bec6d24f7..050723ade942 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1734,17 +1734,16 @@ void btrfs_readdir_put_delayed_items(struct btrfs_inode *inode,
 	downgrade_write(&inode->vfs_inode.i_rwsem);
 }
 
-int btrfs_should_delete_dir_index(const struct list_head *del_list,
-				  u64 index)
+bool btrfs_should_delete_dir_index(const struct list_head *del_list, u64 index)
 {
 	struct btrfs_delayed_item *curr;
-	int ret = 0;
+	bool ret = false;
 
 	list_for_each_entry(curr, del_list, readdir_list) {
 		if (curr->index > index)
 			break;
 		if (curr->index == index) {
-			ret = 1;
+			ret = true;
 			break;
 		}
 	}
diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h
index c4b4ba122beb..73d13fac8917 100644
--- a/fs/btrfs/delayed-inode.h
+++ b/fs/btrfs/delayed-inode.h
@@ -150,8 +150,7 @@ bool btrfs_readdir_get_delayed_items(struct btrfs_inode *inode,
 void btrfs_readdir_put_delayed_items(struct btrfs_inode *inode,
 				     struct list_head *ins_list,
 				     struct list_head *del_list);
-int btrfs_should_delete_dir_index(const struct list_head *del_list,
-				  u64 index);
+bool btrfs_should_delete_dir_index(const struct list_head *del_list, u64 index);
 int btrfs_readdir_delayed_dir_index(struct dir_context *ctx,
 				    const struct list_head *ins_list);
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 14/14] btrfs: make btrfs_readdir_delayed_dir_index() return a bool instead
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (12 preceding siblings ...)
  2025-06-02 10:33 ` [PATCH 13/14] btrfs: make btrfs_should_delete_dir_index() return a bool instead fdmanana
@ 2025-06-02 10:33 ` fdmanana
  2025-06-03  7:23 ` [PATCH 00/14] btrfs: directory logging bug fix and several updates David Sterba
  2025-06-03 17:34 ` Boris Burkov
  15 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2025-06-02 10:33 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There's no need to return errors, all we do is return 1 or 0 depending
on whether we should or should not stop iterating over delayed dir
indexes. So change the function to return bool instead of an int.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/delayed-inode.c | 14 +++++++-------
 fs/btrfs/delayed-inode.h |  4 ++--
 fs/btrfs/inode.c         |  3 +--
 3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 050723ade942..0f8d8e275143 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1753,15 +1753,14 @@ bool btrfs_should_delete_dir_index(const struct list_head *del_list, u64 index)
 /*
  * Read dir info stored in the delayed tree.
  */
-int btrfs_readdir_delayed_dir_index(struct dir_context *ctx,
-				    const struct list_head *ins_list)
+bool btrfs_readdir_delayed_dir_index(struct dir_context *ctx,
+				     const struct list_head *ins_list)
 {
 	struct btrfs_dir_item *di;
 	struct btrfs_delayed_item *curr, *next;
 	struct btrfs_key location;
 	char *name;
 	int name_len;
-	int over = 0;
 	unsigned char d_type;
 
 	/*
@@ -1770,6 +1769,8 @@ int btrfs_readdir_delayed_dir_index(struct dir_context *ctx,
 	 * directory, nobody can delete any directory indexes now.
 	 */
 	list_for_each_entry_safe(curr, next, ins_list, readdir_list) {
+		bool over;
+
 		list_del(&curr->readdir_list);
 
 		if (curr->index < ctx->pos) {
@@ -1787,17 +1788,16 @@ int btrfs_readdir_delayed_dir_index(struct dir_context *ctx,
 		d_type = fs_ftype_to_dtype(btrfs_dir_flags_to_ftype(di->type));
 		btrfs_disk_key_to_cpu(&location, &di->location);
 
-		over = !dir_emit(ctx, name, name_len,
-			       location.objectid, d_type);
+		over = !dir_emit(ctx, name, name_len, location.objectid, d_type);
 
 		if (refcount_dec_and_test(&curr->refs))
 			kfree(curr);
 
 		if (over)
-			return 1;
+			return true;
 		ctx->pos++;
 	}
-	return 0;
+	return false;
 }
 
 static void fill_stack_inode_item(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h
index 73d13fac8917..e6e763ad2d42 100644
--- a/fs/btrfs/delayed-inode.h
+++ b/fs/btrfs/delayed-inode.h
@@ -151,8 +151,8 @@ void btrfs_readdir_put_delayed_items(struct btrfs_inode *inode,
 				     struct list_head *ins_list,
 				     struct list_head *del_list);
 bool btrfs_should_delete_dir_index(const struct list_head *del_list, u64 index);
-int btrfs_readdir_delayed_dir_index(struct dir_context *ctx,
-				    const struct list_head *ins_list);
+bool btrfs_readdir_delayed_dir_index(struct dir_context *ctx,
+				     const struct list_head *ins_list);
 
 /* Used during directory logging. */
 void btrfs_log_get_delayed_items(struct btrfs_inode *inode,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 890d20868250..777f9507f180 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6170,8 +6170,7 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
 	if (ret)
 		goto nopos;
 
-	ret = btrfs_readdir_delayed_dir_index(ctx, &ins_list);
-	if (ret)
+	if (btrfs_readdir_delayed_dir_index(ctx, &ins_list))
 		goto nopos;
 
 	/*
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 03/14] btrfs: free path sooner at __btrfs_unlink_inode()
  2025-06-02 10:32 ` [PATCH 03/14] btrfs: free path sooner at __btrfs_unlink_inode() fdmanana
@ 2025-06-02 13:25   ` Sun YangKai
  2025-06-02 13:28     ` Filipe Manana
  0 siblings, 1 reply; 23+ messages in thread
From: Sun YangKai @ 2025-06-02 13:25 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

> From: Filipe Manana <fdmanana@suse.com>
> 
> After calling btrfs_delete_one_dir_name() there's no need for the path
> anymore so we can free it immediately after. After that point we do
> some btree operations that take time and in those call chains we end up
> allocating paths for these operations, so we're unnecessarily holding on
> to the path we allocated early at __btrfs_unlink_inode().
> 
> So free the path as soon as we don't need it and add a comment. This
> also allows to simplify the error path, removing two exit labels and
> returning directly when an error happens.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
>  fs/btrfs/inode.c | 31 ++++++++++++++-----------------
>  1 file changed, 14 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 7bad21ec41f8..a9a37553bc45 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4215,20 +4215,22 @@ static int __btrfs_unlink_inode(struct
> btrfs_trans_handle *trans,> 
>  	u64 dir_ino = btrfs_ino(dir);
>  	
>  	path = btrfs_alloc_path();
> 
> -	if (!path) {
> -		ret = -ENOMEM;
> -		goto out;
> -	}
> +	if (!path)
> +		return -ENOMEM;
> 
>  	di = btrfs_lookup_dir_item(trans, root, path, dir_ino, name, -1);
>  	if (IS_ERR_OR_NULL(di)) {
> 
> -		ret = di ? PTR_ERR(di) : -ENOENT;
> -		goto err;
> +		btrfs_free_path(path);
> +		return di ? PTR_ERR(di) : -ENOENT;

Maybe we could define the path using `BTRFS_PATH_AUTO_FREE(path);` so that we 
don't need to write btrfs_free_path before every return statement like this.

Forgive me if I'm mistaken, as I'm not completely familiar with this.

> 
>  	}
>  	ret = btrfs_delete_one_dir_name(trans, root, path, di);
> 
> +	/*
> +	 * Down the call chains below we'll also need to allocate a path, so no
> +	 * need to hold on to this one for longer than necessary.
> +	 */
> +	btrfs_free_path(path);
> 
>  	if (ret)
> 
> -		goto err;
> -	btrfs_release_path(path);
> +		return ret;
> 
>  	/*
>  	
>  	 * If we don't have dir index, we have to get it by looking up
> 
> @@ -4254,7 +4256,7 @@ static int __btrfs_unlink_inode(struct
> btrfs_trans_handle *trans,> 
>  	   "failed to delete reference to %.*s, root %llu inode %llu parent 
%llu",
>  	   
>  			   name->len, name->name, btrfs_root_id(root), ino, dir_ino);
>  		
>  		btrfs_abort_transaction(trans, ret);
> 
> -		goto err;
> +		return ret;
> 
>  	}
>  
>  skip_backref:
>  	if (rename_ctx)
> 
> @@ -4263,7 +4265,7 @@ static int __btrfs_unlink_inode(struct
> btrfs_trans_handle *trans,> 
>  	ret = btrfs_delete_delayed_dir_index(trans, dir, index);
>  	if (ret) {
>  	
>  		btrfs_abort_transaction(trans, ret);
> 
> -		goto err;
> +		return ret;
> 
>  	}
>  	
>  	/*
> 
> @@ -4287,19 +4289,14 @@ static int __btrfs_unlink_inode(struct
> btrfs_trans_handle *trans,> 
>  	 * holding.
>  	 */
>  	
>  	btrfs_run_delayed_iput(fs_info, inode);
> 
> -err:
> -	btrfs_free_path(path);
> -	if (ret)
> -		goto out;
> 
>  	btrfs_i_size_write(dir, dir->vfs_inode.i_size - name->len * 2);
>  	inode_inc_iversion(&inode->vfs_inode);
>  	inode_set_ctime_current(&inode->vfs_inode);
>  	inode_inc_iversion(&dir->vfs_inode);
>  	
>   	inode_set_mtime_to_ts(&dir->vfs_inode,
>   	inode_set_ctime_current(&dir->vfs_inode));> 
> -	ret = btrfs_update_inode(trans, dir);
> -out:
> -	return ret;
> +
> +	return btrfs_update_inode(trans, dir);
> 
>  }
>  
>  int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
> 
> --
> 2.47.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 03/14] btrfs: free path sooner at __btrfs_unlink_inode()
  2025-06-02 13:25   ` Sun YangKai
@ 2025-06-02 13:28     ` Filipe Manana
  2025-06-02 16:20       ` Daniel Vacek
  0 siblings, 1 reply; 23+ messages in thread
From: Filipe Manana @ 2025-06-02 13:28 UTC (permalink / raw)
  To: Sun YangKai; +Cc: linux-btrfs

On Mon, Jun 2, 2025 at 2:25 PM Sun YangKai <sunk67188@gmail.com> wrote:
>
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > After calling btrfs_delete_one_dir_name() there's no need for the path
> > anymore so we can free it immediately after. After that point we do
> > some btree operations that take time and in those call chains we end up
> > allocating paths for these operations, so we're unnecessarily holding on
> > to the path we allocated early at __btrfs_unlink_inode().
> >
> > So free the path as soon as we don't need it and add a comment. This
> > also allows to simplify the error path, removing two exit labels and
> > returning directly when an error happens.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >
> >  fs/btrfs/inode.c | 31 ++++++++++++++-----------------
> >  1 file changed, 14 insertions(+), 17 deletions(-)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 7bad21ec41f8..a9a37553bc45 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -4215,20 +4215,22 @@ static int __btrfs_unlink_inode(struct
> > btrfs_trans_handle *trans,>
> >       u64 dir_ino = btrfs_ino(dir);
> >
> >       path = btrfs_alloc_path();
> >
> > -     if (!path) {
> > -             ret = -ENOMEM;
> > -             goto out;
> > -     }
> > +     if (!path)
> > +             return -ENOMEM;
> >
> >       di = btrfs_lookup_dir_item(trans, root, path, dir_ino, name, -1);
> >       if (IS_ERR_OR_NULL(di)) {
> >
> > -             ret = di ? PTR_ERR(di) : -ENOENT;
> > -             goto err;
> > +             btrfs_free_path(path);
> > +             return di ? PTR_ERR(di) : -ENOENT;
>
> Maybe we could define the path using `BTRFS_PATH_AUTO_FREE(path);` so that we
> don't need to write btrfs_free_path before every return statement like this.
>
> Forgive me if I'm mistaken, as I'm not completely familiar with this.

For the error case only, yes. But the goal here is to free the path as
soon as it's not needed for the expected, non-error case.
So no, it wouldn't do it.

>
> >
> >       }
> >       ret = btrfs_delete_one_dir_name(trans, root, path, di);
> >
> > +     /*
> > +      * Down the call chains below we'll also need to allocate a path, so no
> > +      * need to hold on to this one for longer than necessary.
> > +      */
> > +     btrfs_free_path(path);
> >
> >       if (ret)
> >
> > -             goto err;
> > -     btrfs_release_path(path);
> > +             return ret;
> >
> >       /*
> >
> >        * If we don't have dir index, we have to get it by looking up
> >
> > @@ -4254,7 +4256,7 @@ static int __btrfs_unlink_inode(struct
> > btrfs_trans_handle *trans,>
> >          "failed to delete reference to %.*s, root %llu inode %llu parent
> %llu",
> >
> >                          name->len, name->name, btrfs_root_id(root), ino, dir_ino);
> >
> >               btrfs_abort_transaction(trans, ret);
> >
> > -             goto err;
> > +             return ret;
> >
> >       }
> >
> >  skip_backref:
> >       if (rename_ctx)
> >
> > @@ -4263,7 +4265,7 @@ static int __btrfs_unlink_inode(struct
> > btrfs_trans_handle *trans,>
> >       ret = btrfs_delete_delayed_dir_index(trans, dir, index);
> >       if (ret) {
> >
> >               btrfs_abort_transaction(trans, ret);
> >
> > -             goto err;
> > +             return ret;
> >
> >       }
> >
> >       /*
> >
> > @@ -4287,19 +4289,14 @@ static int __btrfs_unlink_inode(struct
> > btrfs_trans_handle *trans,>
> >        * holding.
> >        */
> >
> >       btrfs_run_delayed_iput(fs_info, inode);
> >
> > -err:
> > -     btrfs_free_path(path);
> > -     if (ret)
> > -             goto out;
> >
> >       btrfs_i_size_write(dir, dir->vfs_inode.i_size - name->len * 2);
> >       inode_inc_iversion(&inode->vfs_inode);
> >       inode_set_ctime_current(&inode->vfs_inode);
> >       inode_inc_iversion(&dir->vfs_inode);
> >
> >       inode_set_mtime_to_ts(&dir->vfs_inode,
> >       inode_set_ctime_current(&dir->vfs_inode));>
> > -     ret = btrfs_update_inode(trans, dir);
> > -out:
> > -     return ret;
> > +
> > +     return btrfs_update_inode(trans, dir);
> >
> >  }
> >
> >  int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
> >
> > --
> > 2.47.2
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 03/14] btrfs: free path sooner at __btrfs_unlink_inode()
  2025-06-02 13:28     ` Filipe Manana
@ 2025-06-02 16:20       ` Daniel Vacek
  0 siblings, 0 replies; 23+ messages in thread
From: Daniel Vacek @ 2025-06-02 16:20 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Sun YangKai, linux-btrfs

On Mon, 2 Jun 2025 at 15:28, Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Mon, Jun 2, 2025 at 2:25 PM Sun YangKai <sunk67188@gmail.com> wrote:
> >
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > After calling btrfs_delete_one_dir_name() there's no need for the path
> > > anymore so we can free it immediately after. After that point we do
> > > some btree operations that take time and in those call chains we end up
> > > allocating paths for these operations, so we're unnecessarily holding on
> > > to the path we allocated early at __btrfs_unlink_inode().
> > >
> > > So free the path as soon as we don't need it and add a comment. This
> > > also allows to simplify the error path, removing two exit labels and
> > > returning directly when an error happens.
> > >
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >
> > >  fs/btrfs/inode.c | 31 ++++++++++++++-----------------
> > >  1 file changed, 14 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 7bad21ec41f8..a9a37553bc45 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -4215,20 +4215,22 @@ static int __btrfs_unlink_inode(struct
> > > btrfs_trans_handle *trans,>
> > >       u64 dir_ino = btrfs_ino(dir);
> > >
> > >       path = btrfs_alloc_path();
> > >
> > > -     if (!path) {
> > > -             ret = -ENOMEM;
> > > -             goto out;
> > > -     }
> > > +     if (!path)
> > > +             return -ENOMEM;
> > >
> > >       di = btrfs_lookup_dir_item(trans, root, path, dir_ino, name, -1);
> > >       if (IS_ERR_OR_NULL(di)) {
> > >
> > > -             ret = di ? PTR_ERR(di) : -ENOENT;
> > > -             goto err;
> > > +             btrfs_free_path(path);
> > > +             return di ? PTR_ERR(di) : -ENOENT;
> >
> > Maybe we could define the path using `BTRFS_PATH_AUTO_FREE(path);` so that we
> > don't need to write btrfs_free_path before every return statement like this.
> >
> > Forgive me if I'm mistaken, as I'm not completely familiar with this.
>
> For the error case only, yes. But the goal here is to free the path as
> soon as it's not needed for the expected, non-error case.
> So no, it wouldn't do it.

You can actually combine explicit btrfs_free_path(path) with auto free
if you follow with path = NULL. In such a case the second free will
result in nop.
Of course, a comment would be proper at such a location if you really
wanted to use it.

> > >       }
> > >       ret = btrfs_delete_one_dir_name(trans, root, path, di);
> > >
> > > +     /*
> > > +      * Down the call chains below we'll also need to allocate a path, so no
> > > +      * need to hold on to this one for longer than necessary.
> > > +      */
> > > +     btrfs_free_path(path);
> > >
> > >       if (ret)
> > >
> > > -             goto err;
> > > -     btrfs_release_path(path);
> > > +             return ret;
> > >
> > >       /*
> > >
> > >        * If we don't have dir index, we have to get it by looking up
> > >
> > > @@ -4254,7 +4256,7 @@ static int __btrfs_unlink_inode(struct
> > > btrfs_trans_handle *trans,>
> > >          "failed to delete reference to %.*s, root %llu inode %llu parent
> > %llu",
> > >
> > >                          name->len, name->name, btrfs_root_id(root), ino, dir_ino);
> > >
> > >               btrfs_abort_transaction(trans, ret);
> > >
> > > -             goto err;
> > > +             return ret;
> > >
> > >       }
> > >
> > >  skip_backref:
> > >       if (rename_ctx)
> > >
> > > @@ -4263,7 +4265,7 @@ static int __btrfs_unlink_inode(struct
> > > btrfs_trans_handle *trans,>
> > >       ret = btrfs_delete_delayed_dir_index(trans, dir, index);
> > >       if (ret) {
> > >
> > >               btrfs_abort_transaction(trans, ret);
> > >
> > > -             goto err;
> > > +             return ret;
> > >
> > >       }
> > >
> > >       /*
> > >
> > > @@ -4287,19 +4289,14 @@ static int __btrfs_unlink_inode(struct
> > > btrfs_trans_handle *trans,>
> > >        * holding.
> > >        */
> > >
> > >       btrfs_run_delayed_iput(fs_info, inode);
> > >
> > > -err:
> > > -     btrfs_free_path(path);
> > > -     if (ret)
> > > -             goto out;
> > >
> > >       btrfs_i_size_write(dir, dir->vfs_inode.i_size - name->len * 2);
> > >       inode_inc_iversion(&inode->vfs_inode);
> > >       inode_set_ctime_current(&inode->vfs_inode);
> > >       inode_inc_iversion(&dir->vfs_inode);
> > >
> > >       inode_set_mtime_to_ts(&dir->vfs_inode,
> > >       inode_set_ctime_current(&dir->vfs_inode));>
> > > -     ret = btrfs_update_inode(trans, dir);
> > > -out:
> > > -     return ret;
> > > +
> > > +     return btrfs_update_inode(trans, dir);
> > >
> > >  }
> > >
> > >  int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
> > >
> > > --
> > > 2.47.2
> >
> >
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 01/14] btrfs: fix a race between renames and directory logging
  2025-06-02 10:32 ` [PATCH 01/14] btrfs: fix a race between renames and directory logging fdmanana
@ 2025-06-03  1:03   ` Boris Burkov
  2025-06-03  7:55     ` Filipe Manana
  0 siblings, 1 reply; 23+ messages in thread
From: Boris Burkov @ 2025-06-03  1:03 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, Jun 02, 2025 at 11:32:54AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We have a race between a rename and directory inode logging that if it
> happens and we crash/power fail before the rename completes, the next time
> the filesystem is mounted, the log replay code will end up deleting the
> file that was being renamed.
> 
> This is best explained following a step by step analysis of an interleaving
> of steps that lead into this situation.
> 
> Consider the initial conditions:
> 
> 1) We are at transaction N;
> 
> 2) We have directories A and B created in a past transaction (< N);
> 
> 3) We have inode X corresponding to a file that has 2 hardlinks, one in
>    directory A and the other in directory B, so we'll name them as
>    "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a
>    past transaction (< N);
> 
> 4) We have inode Y corresponding to a file that as a single hard link and
>    is located in directory A, we'll name it as "A/bar". This file was also
>    persisted in a past transaction (< N).
> 
> The steps leading to a file loss are the following and for all of them we
> are under transaction N:
> 
>  1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field
>     is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir();
> 
>  2) Task A starts a rename for inode Y, with the goal of renaming from
>     "A/bar" to "A/baz", so we enter btrfs_rename();
> 
>  3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling
>     btrfs_insert_inode_ref();
> 
>  4) Because the rename happens in the same directory, we don't set the
>     last_unlink_trans field of directoty A's inode to the current
>     transaction id, that is, we don't cal btrfs_record_unlink_dir();

Presumably, an alternative fix would be to also call
btrfs_record_unlink_dir() for same directory renames? However, it seems
like pinning the log commit for the duration of the rename is going to
be faster than falling back to a full transaction commit.

Did I understand the rationale correctly?

> 
>  5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY
>     and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode()
>     (actually the dir index item is added as a delayed item, but the
>     effect is the same);
> 
>  6) Now before task A adds the new entry "A/baz" to directory A by
>     calling btrfs_add_link(), another task, task B is logging inode X;
> 
>  7) Task B starts a fsync of inode X and after logging inode X, at
>     btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since
>     inode X has a last_unlink_trans value of N, set at in step 1;
> 
>  8) At btrfs_log_all_parents() we search for all parent directories of
>     inode X using the commit root, so we find directories A and B and log
>     them. Bu when logging direct A, we don't have a dir index item for
>     inode Y anymore, neither the old name "A/bar" nor for the new name
>     "A/baz" since the rename has deleted the old name but has not yet
>     inserted the new name - task A hasn't called yet btrfs_add_link() to
>     do that.
> 
>     Note that logging directory A doesn't fallback to a transaction
>     commit because its last_unlink_trans has a lower value than the
>     current transaction's id (see step 4);
> 
>  9) Task B finishes logging directories A and B and gets back to
>     btrfs_sync_file() where it calls btrfs_sync_log() to persist the log
>     tree;
> 
> 10) Task B successfully persisted the log tree, btrfs_sync_log() completed
>     with success, and a power failure happened.
> 
>     We have a log tree without any directory entry for inode Y, so the
>     log replay code deletes the entry for inode Y, name "A/bar", from the
>     subvolume tree since it doesn't exist in the log tree and the log
>     tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY
>     item that covers the index range for the dentry that corresponds to
>     "A/bar").
> 
>     Since there's no other hard link for inode Y and the log replay code
>     deletes the name "A/bar", the file is lost.
> 
> The issue wouldn't happen if task B synced the log only after task A
> called btrfs_log_new_name(), which would update the log with the new name
> for inode Y ("A/bar").
> 
> Fix this by pinning the log root during renames before removing the old
> directory entry, and unpinning after btrfs_log_new_name() is called.
> 
> Fixes: 259c4b96d78d ("btrfs: stop doing unnecessary log updates during a rename")
> CC: stable@vger.kernel.org # 5.18+
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

The explanation and fix make sense to me, thanks.
Reviewed-by: Boris Burkov <boris@bur.io>

One qq: would it be possible / make sense to have an absolutely minimal
fix patch that only restores the log pinning order (as changed by
259c4b96d78d) without the additional refactoring of the subvol root
checking logic (i.e., getting rid of the per-root pin tracking bools)?
Then separately do that refactoring?

> ---
>  fs/btrfs/inode.c | 81 ++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 64 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 7074f975c033..7bad21ec41f8 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8063,6 +8063,7 @@ static int btrfs_rename_exchange(struct inode *old_dir,
>  	int ret;
>  	int ret2;
>  	bool need_abort = false;
> +	bool logs_pinned = false;
>  	struct fscrypt_name old_fname, new_fname;
>  	struct fscrypt_str *old_name, *new_name;
>  
> @@ -8186,6 +8187,31 @@ static int btrfs_rename_exchange(struct inode *old_dir,
>  	inode_inc_iversion(new_inode);
>  	simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
>  
> +	if (old_ino != BTRFS_FIRST_FREE_OBJECTID &&
> +	    new_ino != BTRFS_FIRST_FREE_OBJECTID) {
> +		/*
> +		 * If we are renaming in the same directory (and it's not for
> +		 * root entries) pin the log early to prevent any concurrent
> +		 * task from logging the directory after we removed the old
> +		 * entries and before we add the new entries, otherwise that
> +		 * task can sync a log without any entry for the inodes we are
> +		 * renaming and therefore replaying that log, if a power failure
> +		 * happens after syncing the log, would result in deleting the
> +		 * inodes.
> +		 *
> +		 * If the rename affects two different directories, we want to
> +		 * make sure the that there's no log commit that contains
> +		 * updates for only one of the directories but not for the
> +		 * other.
> +		 *
> +		 * If we are renaming an entry for a root, we don't care about
> +		 * log updates since we called btrfs_set_log_full_commit().
> +		 */
> +		btrfs_pin_log_trans(root);
> +		btrfs_pin_log_trans(dest);
> +		logs_pinned = true;
> +	}
> +
>  	if (old_dentry->d_parent != new_dentry->d_parent) {
>  		btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
>  					BTRFS_I(old_inode), true);
> @@ -8257,30 +8283,23 @@ static int btrfs_rename_exchange(struct inode *old_dir,
>  		BTRFS_I(new_inode)->dir_index = new_idx;
>  
>  	/*
> -	 * Now pin the logs of the roots. We do it to ensure that no other task
> -	 * can sync the logs while we are in progress with the rename, because
> -	 * that could result in an inconsistency in case any of the inodes that
> -	 * are part of this rename operation were logged before.
> +	 * Do the log updates for all inodes.
> +	 *
> +	 * If either entry is for a root we don't need to update the logs since
> +	 * we've called btrfs_set_log_full_commit() before.
>  	 */
> -	if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
> -		btrfs_pin_log_trans(root);
> -	if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
> -		btrfs_pin_log_trans(dest);
> -
> -	/* Do the log updates for all inodes. */
> -	if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
> +	if (logs_pinned) {
>  		btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
>  				   old_rename_ctx.index, new_dentry->d_parent);
> -	if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
>  		btrfs_log_new_name(trans, new_dentry, BTRFS_I(new_dir),
>  				   new_rename_ctx.index, old_dentry->d_parent);
> +	}
>  
> -	/* Now unpin the logs. */
> -	if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
> +out_fail:
> +	if (logs_pinned) {
>  		btrfs_end_log_trans(root);
> -	if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
>  		btrfs_end_log_trans(dest);
> -out_fail:
> +	}
>  	ret2 = btrfs_end_transaction(trans);
>  	ret = ret ? ret : ret2;
>  out_notrans:
> @@ -8330,6 +8349,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
>  	int ret2;
>  	u64 old_ino = btrfs_ino(BTRFS_I(old_inode));
>  	struct fscrypt_name old_fname, new_fname;
> +	bool logs_pinned = false;
>  
>  	if (btrfs_ino(BTRFS_I(new_dir)) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID)
>  		return -EPERM;
> @@ -8464,6 +8484,29 @@ static int btrfs_rename(struct mnt_idmap *idmap,
>  	inode_inc_iversion(old_inode);
>  	simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
>  
> +	if (old_ino != BTRFS_FIRST_FREE_OBJECTID) {
> +		/*
> +		 * If we are renaming in the same directory (and it's not a
> +		 * root entry) pin the log to prevent any concurrent task from
> +		 * logging the directory after we removed the old entry and
> +		 * before we add the new entry, otherwise that task can sync
> +		 * a log without any entry for the inode we are renaming and
> +		 * therefore replaying that log, if a power failure happens
> +		 * after syncing the log, would result in deleting the inode.
> +		 *
> +		 * If the rename affects two different directories, we want to
> +		 * make sure the that there's no log commit that contains
> +		 * updates for only one of the directories but not for the
> +		 * other.
> +		 *
> +		 * If we are renaming an entry for a root, we don't care about
> +		 * log updates since we called btrfs_set_log_full_commit().
> +		 */
> +		btrfs_pin_log_trans(root);
> +		btrfs_pin_log_trans(dest);
> +		logs_pinned = true;
> +	}
> +
>  	if (old_dentry->d_parent != new_dentry->d_parent)
>  		btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
>  					BTRFS_I(old_inode), true);
> @@ -8528,7 +8571,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
>  	if (old_inode->i_nlink == 1)
>  		BTRFS_I(old_inode)->dir_index = index;
>  
> -	if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
> +	if (logs_pinned)
>  		btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
>  				   rename_ctx.index, new_dentry->d_parent);
>  
> @@ -8544,6 +8587,10 @@ static int btrfs_rename(struct mnt_idmap *idmap,
>  		}
>  	}
>  out_fail:
> +	if (logs_pinned) {
> +		btrfs_end_log_trans(root);
> +		btrfs_end_log_trans(dest);
> +	}
>  	ret2 = btrfs_end_transaction(trans);
>  	ret = ret ? ret : ret2;
>  out_notrans:
> -- 
> 2.47.2
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00/14] btrfs: directory logging bug fix and several updates
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (13 preceding siblings ...)
  2025-06-02 10:33 ` [PATCH 14/14] btrfs: make btrfs_readdir_delayed_dir_index() " fdmanana
@ 2025-06-03  7:23 ` David Sterba
  2025-06-03 17:34 ` Boris Burkov
  15 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2025-06-03  7:23 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, Jun 02, 2025 at 11:32:53AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> These fix a race between renames and directory logging and several cleanups
> and small optimizations. Details in the change logs.
> 
> Filipe Manana (14):
>   btrfs: fix a race between renames and directory logging
>   btrfs: assert we join log transaction at btrfs_del_inode_ref_in_log()
>   btrfs: free path sooner at __btrfs_unlink_inode()
>   btrfs: use btrfs_del_item() at del_logged_dentry()
>   btrfs: assert we join log transaction at btrfs_del_dir_entries_in_log()
>   btrfs: allocate path earlier at btrfs_del_dir_entries_in_log()
>   btrfs: allocate path earlier at btrfs_log_new_name()
>   btrfs: allocate scratch eb earlier at btrfs_log_new_name()
>   btrfs: pass NULL index to btrfs_del_inode_ref() where not needed
>   btrfs: switch del_all argument of replay_dir_deletes() from int to bool
>   btrfs: make btrfs_delete_delayed_insertion_item() return a boolean
>   btrfs: add details to error messages at btrfs_delete_delayed_dir_index()
>   btrfs: make btrfs_should_delete_dir_index() return a bool instead
>   btrfs: make btrfs_readdir_delayed_dir_index() return a bool instead

The cleanup patches look ok. Feel free to add the whole series to
for-next, I'm not sure how much review from others you'll get for the
lgging magic.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 01/14] btrfs: fix a race between renames and directory logging
  2025-06-03  1:03   ` Boris Burkov
@ 2025-06-03  7:55     ` Filipe Manana
  2025-06-03 17:11       ` Boris Burkov
  0 siblings, 1 reply; 23+ messages in thread
From: Filipe Manana @ 2025-06-03  7:55 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs

On Tue, Jun 3, 2025 at 2:03 AM Boris Burkov <boris@bur.io> wrote:
>
> On Mon, Jun 02, 2025 at 11:32:54AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > We have a race between a rename and directory inode logging that if it
> > happens and we crash/power fail before the rename completes, the next time
> > the filesystem is mounted, the log replay code will end up deleting the
> > file that was being renamed.
> >
> > This is best explained following a step by step analysis of an interleaving
> > of steps that lead into this situation.
> >
> > Consider the initial conditions:
> >
> > 1) We are at transaction N;
> >
> > 2) We have directories A and B created in a past transaction (< N);
> >
> > 3) We have inode X corresponding to a file that has 2 hardlinks, one in
> >    directory A and the other in directory B, so we'll name them as
> >    "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a
> >    past transaction (< N);
> >
> > 4) We have inode Y corresponding to a file that as a single hard link and
> >    is located in directory A, we'll name it as "A/bar". This file was also
> >    persisted in a past transaction (< N).
> >
> > The steps leading to a file loss are the following and for all of them we
> > are under transaction N:
> >
> >  1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field
> >     is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir();
> >
> >  2) Task A starts a rename for inode Y, with the goal of renaming from
> >     "A/bar" to "A/baz", so we enter btrfs_rename();
> >
> >  3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling
> >     btrfs_insert_inode_ref();
> >
> >  4) Because the rename happens in the same directory, we don't set the
> >     last_unlink_trans field of directoty A's inode to the current
> >     transaction id, that is, we don't cal btrfs_record_unlink_dir();
>
> Presumably, an alternative fix would be to also call
> btrfs_record_unlink_dir() for same directory renames? However, it seems
> like pinning the log commit for the duration of the rename is going to
> be faster than falling back to a full transaction commit.
>
> Did I understand the rationale correctly?

Yes, you understood correctly.
We want to avoid setting last_unlink_trans for the directory, in case
of same directory renames, to avoid full transaction commits.

>
> >
> >  5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY
> >     and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode()
> >     (actually the dir index item is added as a delayed item, but the
> >     effect is the same);
> >
> >  6) Now before task A adds the new entry "A/baz" to directory A by
> >     calling btrfs_add_link(), another task, task B is logging inode X;
> >
> >  7) Task B starts a fsync of inode X and after logging inode X, at
> >     btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since
> >     inode X has a last_unlink_trans value of N, set at in step 1;
> >
> >  8) At btrfs_log_all_parents() we search for all parent directories of
> >     inode X using the commit root, so we find directories A and B and log
> >     them. Bu when logging direct A, we don't have a dir index item for
> >     inode Y anymore, neither the old name "A/bar" nor for the new name
> >     "A/baz" since the rename has deleted the old name but has not yet
> >     inserted the new name - task A hasn't called yet btrfs_add_link() to
> >     do that.
> >
> >     Note that logging directory A doesn't fallback to a transaction
> >     commit because its last_unlink_trans has a lower value than the
> >     current transaction's id (see step 4);
> >
> >  9) Task B finishes logging directories A and B and gets back to
> >     btrfs_sync_file() where it calls btrfs_sync_log() to persist the log
> >     tree;
> >
> > 10) Task B successfully persisted the log tree, btrfs_sync_log() completed
> >     with success, and a power failure happened.
> >
> >     We have a log tree without any directory entry for inode Y, so the
> >     log replay code deletes the entry for inode Y, name "A/bar", from the
> >     subvolume tree since it doesn't exist in the log tree and the log
> >     tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY
> >     item that covers the index range for the dentry that corresponds to
> >     "A/bar").
> >
> >     Since there's no other hard link for inode Y and the log replay code
> >     deletes the name "A/bar", the file is lost.
> >
> > The issue wouldn't happen if task B synced the log only after task A
> > called btrfs_log_new_name(), which would update the log with the new name
> > for inode Y ("A/bar").
> >
> > Fix this by pinning the log root during renames before removing the old
> > directory entry, and unpinning after btrfs_log_new_name() is called.
> >
> > Fixes: 259c4b96d78d ("btrfs: stop doing unnecessary log updates during a rename")
> > CC: stable@vger.kernel.org # 5.18+
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> The explanation and fix make sense to me, thanks.
> Reviewed-by: Boris Burkov <boris@bur.io>
>
> One qq: would it be possible / make sense to have an absolutely minimal
> fix patch that only restores the log pinning order (as changed by
> 259c4b96d78d) without the additional refactoring of the subvol root
> checking logic (i.e., getting rid of the per-root pin tracking bools)?
> Then separately do that refactoring?

A fix doesn't have to necessarily make code exactly equal to what it
was before, unless the only solution is a full revert, which usually
happens for fairly recent commits.

The code is slightly different, yes, but many things have changed
after that commit over the years.
I wouldn't call it a refactoring, and I don't see why you think it
makes backporting harder, in fact trying to do it exactly as before
would actually make the backport harder on more recent kernels.

The code now is equivalent, with more comments about what is being done and why
The addition is to avoid logging new names in case we have set the log
for full commit, due to renaming a root, but that doesn't make it
harder to backport at all.

Thanks.


>
> > ---
> >  fs/btrfs/inode.c | 81 ++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 64 insertions(+), 17 deletions(-)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 7074f975c033..7bad21ec41f8 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -8063,6 +8063,7 @@ static int btrfs_rename_exchange(struct inode *old_dir,
> >       int ret;
> >       int ret2;
> >       bool need_abort = false;
> > +     bool logs_pinned = false;
> >       struct fscrypt_name old_fname, new_fname;
> >       struct fscrypt_str *old_name, *new_name;
> >
> > @@ -8186,6 +8187,31 @@ static int btrfs_rename_exchange(struct inode *old_dir,
> >       inode_inc_iversion(new_inode);
> >       simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
> >
> > +     if (old_ino != BTRFS_FIRST_FREE_OBJECTID &&
> > +         new_ino != BTRFS_FIRST_FREE_OBJECTID) {
> > +             /*
> > +              * If we are renaming in the same directory (and it's not for
> > +              * root entries) pin the log early to prevent any concurrent
> > +              * task from logging the directory after we removed the old
> > +              * entries and before we add the new entries, otherwise that
> > +              * task can sync a log without any entry for the inodes we are
> > +              * renaming and therefore replaying that log, if a power failure
> > +              * happens after syncing the log, would result in deleting the
> > +              * inodes.
> > +              *
> > +              * If the rename affects two different directories, we want to
> > +              * make sure the that there's no log commit that contains
> > +              * updates for only one of the directories but not for the
> > +              * other.
> > +              *
> > +              * If we are renaming an entry for a root, we don't care about
> > +              * log updates since we called btrfs_set_log_full_commit().
> > +              */
> > +             btrfs_pin_log_trans(root);
> > +             btrfs_pin_log_trans(dest);
> > +             logs_pinned = true;
> > +     }
> > +
> >       if (old_dentry->d_parent != new_dentry->d_parent) {
> >               btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
> >                                       BTRFS_I(old_inode), true);
> > @@ -8257,30 +8283,23 @@ static int btrfs_rename_exchange(struct inode *old_dir,
> >               BTRFS_I(new_inode)->dir_index = new_idx;
> >
> >       /*
> > -      * Now pin the logs of the roots. We do it to ensure that no other task
> > -      * can sync the logs while we are in progress with the rename, because
> > -      * that could result in an inconsistency in case any of the inodes that
> > -      * are part of this rename operation were logged before.
> > +      * Do the log updates for all inodes.
> > +      *
> > +      * If either entry is for a root we don't need to update the logs since
> > +      * we've called btrfs_set_log_full_commit() before.
> >        */
> > -     if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
> > -             btrfs_pin_log_trans(root);
> > -     if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
> > -             btrfs_pin_log_trans(dest);
> > -
> > -     /* Do the log updates for all inodes. */
> > -     if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
> > +     if (logs_pinned) {
> >               btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
> >                                  old_rename_ctx.index, new_dentry->d_parent);
> > -     if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
> >               btrfs_log_new_name(trans, new_dentry, BTRFS_I(new_dir),
> >                                  new_rename_ctx.index, old_dentry->d_parent);
> > +     }
> >
> > -     /* Now unpin the logs. */
> > -     if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
> > +out_fail:
> > +     if (logs_pinned) {
> >               btrfs_end_log_trans(root);
> > -     if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
> >               btrfs_end_log_trans(dest);
> > -out_fail:
> > +     }
> >       ret2 = btrfs_end_transaction(trans);
> >       ret = ret ? ret : ret2;
> >  out_notrans:
> > @@ -8330,6 +8349,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
> >       int ret2;
> >       u64 old_ino = btrfs_ino(BTRFS_I(old_inode));
> >       struct fscrypt_name old_fname, new_fname;
> > +     bool logs_pinned = false;
> >
> >       if (btrfs_ino(BTRFS_I(new_dir)) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID)
> >               return -EPERM;
> > @@ -8464,6 +8484,29 @@ static int btrfs_rename(struct mnt_idmap *idmap,
> >       inode_inc_iversion(old_inode);
> >       simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
> >
> > +     if (old_ino != BTRFS_FIRST_FREE_OBJECTID) {
> > +             /*
> > +              * If we are renaming in the same directory (and it's not a
> > +              * root entry) pin the log to prevent any concurrent task from
> > +              * logging the directory after we removed the old entry and
> > +              * before we add the new entry, otherwise that task can sync
> > +              * a log without any entry for the inode we are renaming and
> > +              * therefore replaying that log, if a power failure happens
> > +              * after syncing the log, would result in deleting the inode.
> > +              *
> > +              * If the rename affects two different directories, we want to
> > +              * make sure the that there's no log commit that contains
> > +              * updates for only one of the directories but not for the
> > +              * other.
> > +              *
> > +              * If we are renaming an entry for a root, we don't care about
> > +              * log updates since we called btrfs_set_log_full_commit().
> > +              */
> > +             btrfs_pin_log_trans(root);
> > +             btrfs_pin_log_trans(dest);
> > +             logs_pinned = true;
> > +     }
> > +
> >       if (old_dentry->d_parent != new_dentry->d_parent)
> >               btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
> >                                       BTRFS_I(old_inode), true);
> > @@ -8528,7 +8571,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
> >       if (old_inode->i_nlink == 1)
> >               BTRFS_I(old_inode)->dir_index = index;
> >
> > -     if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
> > +     if (logs_pinned)
> >               btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
> >                                  rename_ctx.index, new_dentry->d_parent);
> >
> > @@ -8544,6 +8587,10 @@ static int btrfs_rename(struct mnt_idmap *idmap,
> >               }
> >       }
> >  out_fail:
> > +     if (logs_pinned) {
> > +             btrfs_end_log_trans(root);
> > +             btrfs_end_log_trans(dest);
> > +     }
> >       ret2 = btrfs_end_transaction(trans);
> >       ret = ret ? ret : ret2;
> >  out_notrans:
> > --
> > 2.47.2
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 01/14] btrfs: fix a race between renames and directory logging
  2025-06-03  7:55     ` Filipe Manana
@ 2025-06-03 17:11       ` Boris Burkov
  0 siblings, 0 replies; 23+ messages in thread
From: Boris Burkov @ 2025-06-03 17:11 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Tue, Jun 03, 2025 at 08:55:03AM +0100, Filipe Manana wrote:
> On Tue, Jun 3, 2025 at 2:03 AM Boris Burkov <boris@bur.io> wrote:
> >
> > On Mon, Jun 02, 2025 at 11:32:54AM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > We have a race between a rename and directory inode logging that if it
> > > happens and we crash/power fail before the rename completes, the next time
> > > the filesystem is mounted, the log replay code will end up deleting the
> > > file that was being renamed.
> > >
> > > This is best explained following a step by step analysis of an interleaving
> > > of steps that lead into this situation.
> > >
> > > Consider the initial conditions:
> > >
> > > 1) We are at transaction N;
> > >
> > > 2) We have directories A and B created in a past transaction (< N);
> > >
> > > 3) We have inode X corresponding to a file that has 2 hardlinks, one in
> > >    directory A and the other in directory B, so we'll name them as
> > >    "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a
> > >    past transaction (< N);
> > >
> > > 4) We have inode Y corresponding to a file that as a single hard link and
> > >    is located in directory A, we'll name it as "A/bar". This file was also
> > >    persisted in a past transaction (< N).
> > >
> > > The steps leading to a file loss are the following and for all of them we
> > > are under transaction N:
> > >
> > >  1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field
> > >     is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir();
> > >
> > >  2) Task A starts a rename for inode Y, with the goal of renaming from
> > >     "A/bar" to "A/baz", so we enter btrfs_rename();
> > >
> > >  3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling
> > >     btrfs_insert_inode_ref();
> > >
> > >  4) Because the rename happens in the same directory, we don't set the
> > >     last_unlink_trans field of directoty A's inode to the current
> > >     transaction id, that is, we don't cal btrfs_record_unlink_dir();
> >
> > Presumably, an alternative fix would be to also call
> > btrfs_record_unlink_dir() for same directory renames? However, it seems
> > like pinning the log commit for the duration of the rename is going to
> > be faster than falling back to a full transaction commit.
> >
> > Did I understand the rationale correctly?
> 
> Yes, you understood correctly.
> We want to avoid setting last_unlink_trans for the directory, in case
> of same directory renames, to avoid full transaction commits.
> 
> >
> > >
> > >  5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY
> > >     and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode()
> > >     (actually the dir index item is added as a delayed item, but the
> > >     effect is the same);
> > >
> > >  6) Now before task A adds the new entry "A/baz" to directory A by
> > >     calling btrfs_add_link(), another task, task B is logging inode X;
> > >
> > >  7) Task B starts a fsync of inode X and after logging inode X, at
> > >     btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since
> > >     inode X has a last_unlink_trans value of N, set at in step 1;
> > >
> > >  8) At btrfs_log_all_parents() we search for all parent directories of
> > >     inode X using the commit root, so we find directories A and B and log
> > >     them. Bu when logging direct A, we don't have a dir index item for
> > >     inode Y anymore, neither the old name "A/bar" nor for the new name
> > >     "A/baz" since the rename has deleted the old name but has not yet
> > >     inserted the new name - task A hasn't called yet btrfs_add_link() to
> > >     do that.
> > >
> > >     Note that logging directory A doesn't fallback to a transaction
> > >     commit because its last_unlink_trans has a lower value than the
> > >     current transaction's id (see step 4);
> > >
> > >  9) Task B finishes logging directories A and B and gets back to
> > >     btrfs_sync_file() where it calls btrfs_sync_log() to persist the log
> > >     tree;
> > >
> > > 10) Task B successfully persisted the log tree, btrfs_sync_log() completed
> > >     with success, and a power failure happened.
> > >
> > >     We have a log tree without any directory entry for inode Y, so the
> > >     log replay code deletes the entry for inode Y, name "A/bar", from the
> > >     subvolume tree since it doesn't exist in the log tree and the log
> > >     tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY
> > >     item that covers the index range for the dentry that corresponds to
> > >     "A/bar").
> > >
> > >     Since there's no other hard link for inode Y and the log replay code
> > >     deletes the name "A/bar", the file is lost.
> > >
> > > The issue wouldn't happen if task B synced the log only after task A
> > > called btrfs_log_new_name(), which would update the log with the new name
> > > for inode Y ("A/bar").
> > >
> > > Fix this by pinning the log root during renames before removing the old
> > > directory entry, and unpinning after btrfs_log_new_name() is called.
> > >
> > > Fixes: 259c4b96d78d ("btrfs: stop doing unnecessary log updates during a rename")
> > > CC: stable@vger.kernel.org # 5.18+
> > >
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >
> > The explanation and fix make sense to me, thanks.
> > Reviewed-by: Boris Burkov <boris@bur.io>
> >
> > One qq: would it be possible / make sense to have an absolutely minimal
> > fix patch that only restores the log pinning order (as changed by
> > 259c4b96d78d) without the additional refactoring of the subvol root
> > checking logic (i.e., getting rid of the per-root pin tracking bools)?
> > Then separately do that refactoring?
> 
> A fix doesn't have to necessarily make code exactly equal to what it
> was before, unless the only solution is a full revert, which usually
> happens for fairly recent commits.
> 
> The code is slightly different, yes, but many things have changed
> after that commit over the years.
> I wouldn't call it a refactoring, and I don't see why you think it
> makes backporting harder, in fact trying to do it exactly as before
> would actually make the backport harder on more recent kernels.

I missed how old the bug was, my bad. Totally agreed that "fixing
forward" is way more appropriate here.

> 
> The code now is equivalent, with more comments about what is being done and why
> The addition is to avoid logging new names in case we have set the log
> for full commit, due to renaming a root, but that doesn't make it
> harder to backport at all.
> 
> Thanks.
> 
> 
> >
> > > ---
> > >  fs/btrfs/inode.c | 81 ++++++++++++++++++++++++++++++++++++++----------
> > >  1 file changed, 64 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 7074f975c033..7bad21ec41f8 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -8063,6 +8063,7 @@ static int btrfs_rename_exchange(struct inode *old_dir,
> > >       int ret;
> > >       int ret2;
> > >       bool need_abort = false;
> > > +     bool logs_pinned = false;
> > >       struct fscrypt_name old_fname, new_fname;
> > >       struct fscrypt_str *old_name, *new_name;
> > >
> > > @@ -8186,6 +8187,31 @@ static int btrfs_rename_exchange(struct inode *old_dir,
> > >       inode_inc_iversion(new_inode);
> > >       simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
> > >
> > > +     if (old_ino != BTRFS_FIRST_FREE_OBJECTID &&
> > > +         new_ino != BTRFS_FIRST_FREE_OBJECTID) {
> > > +             /*
> > > +              * If we are renaming in the same directory (and it's not for
> > > +              * root entries) pin the log early to prevent any concurrent
> > > +              * task from logging the directory after we removed the old
> > > +              * entries and before we add the new entries, otherwise that
> > > +              * task can sync a log without any entry for the inodes we are
> > > +              * renaming and therefore replaying that log, if a power failure
> > > +              * happens after syncing the log, would result in deleting the
> > > +              * inodes.
> > > +              *
> > > +              * If the rename affects two different directories, we want to
> > > +              * make sure the that there's no log commit that contains
> > > +              * updates for only one of the directories but not for the
> > > +              * other.
> > > +              *
> > > +              * If we are renaming an entry for a root, we don't care about
> > > +              * log updates since we called btrfs_set_log_full_commit().
> > > +              */
> > > +             btrfs_pin_log_trans(root);
> > > +             btrfs_pin_log_trans(dest);
> > > +             logs_pinned = true;
> > > +     }
> > > +
> > >       if (old_dentry->d_parent != new_dentry->d_parent) {
> > >               btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
> > >                                       BTRFS_I(old_inode), true);
> > > @@ -8257,30 +8283,23 @@ static int btrfs_rename_exchange(struct inode *old_dir,
> > >               BTRFS_I(new_inode)->dir_index = new_idx;
> > >
> > >       /*
> > > -      * Now pin the logs of the roots. We do it to ensure that no other task
> > > -      * can sync the logs while we are in progress with the rename, because
> > > -      * that could result in an inconsistency in case any of the inodes that
> > > -      * are part of this rename operation were logged before.
> > > +      * Do the log updates for all inodes.
> > > +      *
> > > +      * If either entry is for a root we don't need to update the logs since
> > > +      * we've called btrfs_set_log_full_commit() before.
> > >        */
> > > -     if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
> > > -             btrfs_pin_log_trans(root);
> > > -     if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
> > > -             btrfs_pin_log_trans(dest);
> > > -
> > > -     /* Do the log updates for all inodes. */
> > > -     if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
> > > +     if (logs_pinned) {
> > >               btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
> > >                                  old_rename_ctx.index, new_dentry->d_parent);
> > > -     if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
> > >               btrfs_log_new_name(trans, new_dentry, BTRFS_I(new_dir),
> > >                                  new_rename_ctx.index, old_dentry->d_parent);
> > > +     }
> > >
> > > -     /* Now unpin the logs. */
> > > -     if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
> > > +out_fail:
> > > +     if (logs_pinned) {
> > >               btrfs_end_log_trans(root);
> > > -     if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
> > >               btrfs_end_log_trans(dest);
> > > -out_fail:
> > > +     }
> > >       ret2 = btrfs_end_transaction(trans);
> > >       ret = ret ? ret : ret2;
> > >  out_notrans:
> > > @@ -8330,6 +8349,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
> > >       int ret2;
> > >       u64 old_ino = btrfs_ino(BTRFS_I(old_inode));
> > >       struct fscrypt_name old_fname, new_fname;
> > > +     bool logs_pinned = false;
> > >
> > >       if (btrfs_ino(BTRFS_I(new_dir)) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID)
> > >               return -EPERM;
> > > @@ -8464,6 +8484,29 @@ static int btrfs_rename(struct mnt_idmap *idmap,
> > >       inode_inc_iversion(old_inode);
> > >       simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
> > >
> > > +     if (old_ino != BTRFS_FIRST_FREE_OBJECTID) {
> > > +             /*
> > > +              * If we are renaming in the same directory (and it's not a
> > > +              * root entry) pin the log to prevent any concurrent task from
> > > +              * logging the directory after we removed the old entry and
> > > +              * before we add the new entry, otherwise that task can sync
> > > +              * a log without any entry for the inode we are renaming and
> > > +              * therefore replaying that log, if a power failure happens
> > > +              * after syncing the log, would result in deleting the inode.
> > > +              *
> > > +              * If the rename affects two different directories, we want to
> > > +              * make sure the that there's no log commit that contains
> > > +              * updates for only one of the directories but not for the
> > > +              * other.
> > > +              *
> > > +              * If we are renaming an entry for a root, we don't care about
> > > +              * log updates since we called btrfs_set_log_full_commit().
> > > +              */
> > > +             btrfs_pin_log_trans(root);
> > > +             btrfs_pin_log_trans(dest);
> > > +             logs_pinned = true;
> > > +     }
> > > +
> > >       if (old_dentry->d_parent != new_dentry->d_parent)
> > >               btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
> > >                                       BTRFS_I(old_inode), true);
> > > @@ -8528,7 +8571,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
> > >       if (old_inode->i_nlink == 1)
> > >               BTRFS_I(old_inode)->dir_index = index;
> > >
> > > -     if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
> > > +     if (logs_pinned)
> > >               btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
> > >                                  rename_ctx.index, new_dentry->d_parent);
> > >
> > > @@ -8544,6 +8587,10 @@ static int btrfs_rename(struct mnt_idmap *idmap,
> > >               }
> > >       }
> > >  out_fail:
> > > +     if (logs_pinned) {
> > > +             btrfs_end_log_trans(root);
> > > +             btrfs_end_log_trans(dest);
> > > +     }
> > >       ret2 = btrfs_end_transaction(trans);
> > >       ret = ret ? ret : ret2;
> > >  out_notrans:
> > > --
> > > 2.47.2
> > >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00/14] btrfs: directory logging bug fix and several updates
  2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
                   ` (14 preceding siblings ...)
  2025-06-03  7:23 ` [PATCH 00/14] btrfs: directory logging bug fix and several updates David Sterba
@ 2025-06-03 17:34 ` Boris Burkov
  15 siblings, 0 replies; 23+ messages in thread
From: Boris Burkov @ 2025-06-03 17:34 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, Jun 02, 2025 at 11:32:53AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> These fix a race between renames and directory logging and several cleanups
> and small optimizations. Details in the change logs.

These all LGTM, thanks!
Reviewed-by: Boris Burkov <boris@bur.io>

> 
> Filipe Manana (14):
>   btrfs: fix a race between renames and directory logging
>   btrfs: assert we join log transaction at btrfs_del_inode_ref_in_log()
>   btrfs: free path sooner at __btrfs_unlink_inode()
>   btrfs: use btrfs_del_item() at del_logged_dentry()
>   btrfs: assert we join log transaction at btrfs_del_dir_entries_in_log()
>   btrfs: allocate path earlier at btrfs_del_dir_entries_in_log()
>   btrfs: allocate path earlier at btrfs_log_new_name()
>   btrfs: allocate scratch eb earlier at btrfs_log_new_name()
>   btrfs: pass NULL index to btrfs_del_inode_ref() where not needed
>   btrfs: switch del_all argument of replay_dir_deletes() from int to bool
>   btrfs: make btrfs_delete_delayed_insertion_item() return a boolean
>   btrfs: add details to error messages at btrfs_delete_delayed_dir_index()
>   btrfs: make btrfs_should_delete_dir_index() return a bool instead
>   btrfs: make btrfs_readdir_delayed_dir_index() return a bool instead
> 
>  fs/btrfs/delayed-inode.c |  42 +++++++-------
>  fs/btrfs/delayed-inode.h |   7 +--
>  fs/btrfs/inode.c         | 119 ++++++++++++++++++++++++++-------------
>  fs/btrfs/tree-log.c      |  72 ++++++++++++-----------
>  4 files changed, 144 insertions(+), 96 deletions(-)
> 
> -- 
> 2.47.2
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-06-03 17:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-02 10:32 [PATCH 00/14] btrfs: directory logging bug fix and several updates fdmanana
2025-06-02 10:32 ` [PATCH 01/14] btrfs: fix a race between renames and directory logging fdmanana
2025-06-03  1:03   ` Boris Burkov
2025-06-03  7:55     ` Filipe Manana
2025-06-03 17:11       ` Boris Burkov
2025-06-02 10:32 ` [PATCH 02/14] btrfs: assert we join log transaction at btrfs_del_inode_ref_in_log() fdmanana
2025-06-02 10:32 ` [PATCH 03/14] btrfs: free path sooner at __btrfs_unlink_inode() fdmanana
2025-06-02 13:25   ` Sun YangKai
2025-06-02 13:28     ` Filipe Manana
2025-06-02 16:20       ` Daniel Vacek
2025-06-02 10:32 ` [PATCH 04/14] btrfs: use btrfs_del_item() at del_logged_dentry() fdmanana
2025-06-02 10:32 ` [PATCH 05/14] btrfs: assert we join log transaction at btrfs_del_dir_entries_in_log() fdmanana
2025-06-02 10:32 ` [PATCH 06/14] btrfs: allocate path earlier " fdmanana
2025-06-02 10:33 ` [PATCH 07/14] btrfs: allocate path earlier at btrfs_log_new_name() fdmanana
2025-06-02 10:33 ` [PATCH 08/14] btrfs: allocate scratch eb " fdmanana
2025-06-02 10:33 ` [PATCH 09/14] btrfs: pass NULL index to btrfs_del_inode_ref() where not needed fdmanana
2025-06-02 10:33 ` [PATCH 10/14] btrfs: switch del_all argument of replay_dir_deletes() from int to bool fdmanana
2025-06-02 10:33 ` [PATCH 11/14] btrfs: make btrfs_delete_delayed_insertion_item() return a boolean fdmanana
2025-06-02 10:33 ` [PATCH 12/14] btrfs: add details to error messages at btrfs_delete_delayed_dir_index() fdmanana
2025-06-02 10:33 ` [PATCH 13/14] btrfs: make btrfs_should_delete_dir_index() return a bool instead fdmanana
2025-06-02 10:33 ` [PATCH 14/14] btrfs: make btrfs_readdir_delayed_dir_index() " fdmanana
2025-06-03  7:23 ` [PATCH 00/14] btrfs: directory logging bug fix and several updates David Sterba
2025-06-03 17:34 ` Boris Burkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.