* [PATCH 0/3] btrfs: fixes for races when checking if an inode was logged before
@ 2025-08-06 11:11 fdmanana
2025-08-06 11:11 ` [PATCH 1/3] btrfs: fix race between logging inode and checking if it " fdmanana
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: fdmanana @ 2025-08-06 11:11 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
The first patch is actually a new version of a patch sent out before [1],
because holding the inode's log_mutex at inode_logged() can make lockdep
unhappy in situations like when logging conflicting inodes, where we are
already holding the log_mutex of some other inode and could potentially
result in a ABBA deadlock.
The second patch splits part of what the initial version of patch 1 fixed,
but with a different solution that makes the management of an inode's
last_dir_index_offset field simpler.
Patch 3 is completely new.
Details in the change logs.
[1] - https://lore.kernel.org/linux-btrfs/7585d15c0e9c163d5cdf32307014a4e792e62541.1753807163.git.fdmanana@suse.com/
Filipe Manana (3):
btrfs: fix race between logging inode and checking if it was logged before
btrfs: fix race between setting last_dir_index_offset and inode logging
btrfs: avoid load/store tearing races when checking if an inode was logged
fs/btrfs/btrfs_inode.h | 2 +-
fs/btrfs/inode.c | 1 +
fs/btrfs/tree-log.c | 78 ++++++++++++++++++++++++++++--------------
3 files changed, 55 insertions(+), 26 deletions(-)
--
2.47.2
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/3] btrfs: fix race between logging inode and checking if it was logged before
2025-08-06 11:11 [PATCH 0/3] btrfs: fixes for races when checking if an inode was logged before fdmanana
@ 2025-08-06 11:11 ` fdmanana
2025-08-06 11:11 ` [PATCH 2/3] btrfs: fix race between setting last_dir_index_offset and inode logging fdmanana
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: fdmanana @ 2025-08-06 11:11 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
There's a race between checking if an inode was logged before and logging
an inode that can cause us to mark an inode as not logged just after it
was logged by a concurrent task:
1) We have inode X which was not logged before neither in the current
transaction not in past transaction since the inode was loaded into
memory, so it's ->logged_trans value is 0;
2) We are at transaction N;
3) Task A calls inode_logged() against inode X, sees that ->logged_trans
is 0 and there is a log tree and so it proceeds to search in the log
tree for an inode item for inode X. It doesn't see any, but before
it sets ->logged_trans to N - 1...
3) Task B calls btrfs_log_inode() against inode X, logs the inode and
sets ->logged_trans to N;
4) Task A now sets ->logged_trans to N - 1;
5) At this point anyone calling inode_logged() gets 0 (inode not logged)
since ->logged_trans is greater than 0 and less than N, but our inode
was really logged. As a consequence operations like rename, unlink and
link that happen afterwards in the current transaction end up not
updating the log when they should.
Fix this by ensuring inode_logged() only updates ->logged_trans in case
the inode item is not found in the log tree if after tacking the inode's
lock (spinlock struct btrfs_inode::lock) the ->logged_trans value is still
zero, since the inode lock is what protects setting ->logged_trans at
btrfs_log_inode().
Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 36 ++++++++++++++++++++++++++++++------
1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 0f3f4de9ca9c..3d8090bc4b4e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3489,6 +3489,31 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
return 0;
}
+static bool mark_inode_as_not_logged(const struct btrfs_trans_handle *trans,
+ struct btrfs_inode *inode)
+{
+ bool ret = false;
+
+ /*
+ * Do this only if ->logged_trans is still 0 to prevent races with
+ * concurrent logging as we may see the inode not logged when
+ * inode_logged() is called but it gets logged after inode_logged() did
+ * not find it in the log tree and we end up setting ->logged_trans to a
+ * value less than trans->transid after the concurrent logging task has
+ * set it to trans->transid. As a consequence, subsequent rename, unlink
+ * and link operations may end up not logging new names and removing old
+ * names from the log.
+ */
+ spin_lock(&inode->lock);
+ if (inode->logged_trans == 0)
+ inode->logged_trans = trans->transid - 1;
+ else if (inode->logged_trans == trans->transid)
+ ret = true;
+ spin_unlock(&inode->lock);
+
+ return ret;
+}
+
/*
* Check if an inode was logged in the current transaction. This correctly deals
* with the case where the inode was logged but has a logged_trans of 0, which
@@ -3523,10 +3548,8 @@ static int inode_logged(const struct btrfs_trans_handle *trans,
* transaction's ID, to avoid the search below in a future call in case
* a log tree gets created after this.
*/
- if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &inode->root->state)) {
- inode->logged_trans = trans->transid - 1;
- return 0;
- }
+ if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &inode->root->state))
+ return mark_inode_as_not_logged(trans, inode);
/*
* We have a log tree and the inode's logged_trans is 0. We can't tell
@@ -3580,8 +3603,7 @@ static int inode_logged(const struct btrfs_trans_handle *trans,
* Set logged_trans to a value greater than 0 and less then the
* current transaction to avoid doing the search in future calls.
*/
- inode->logged_trans = trans->transid - 1;
- return 0;
+ return mark_inode_as_not_logged(trans, inode);
}
/*
@@ -3589,7 +3611,9 @@ static int inode_logged(const struct btrfs_trans_handle *trans,
* the current transacion's ID, to avoid future tree searches as long as
* the inode is not evicted again.
*/
+ spin_lock(&inode->lock);
inode->logged_trans = trans->transid;
+ spin_unlock(&inode->lock);
/*
* If it's a directory, then we must set last_dir_index_offset to the
--
2.47.2
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 2/3] btrfs: fix race between setting last_dir_index_offset and inode logging
2025-08-06 11:11 [PATCH 0/3] btrfs: fixes for races when checking if an inode was logged before fdmanana
2025-08-06 11:11 ` [PATCH 1/3] btrfs: fix race between logging inode and checking if it " fdmanana
@ 2025-08-06 11:11 ` fdmanana
2025-08-06 11:11 ` [PATCH 3/3] btrfs: avoid load/store tearing races when checking if an inode was logged fdmanana
2025-08-08 21:07 ` [PATCH 0/3] btrfs: fixes for races when checking if an inode was logged before Boris Burkov
3 siblings, 0 replies; 5+ messages in thread
From: fdmanana @ 2025-08-06 11:11 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
At inode_logged() if we find that the inode was not logged before we
update its ->last_dir_index_offset to (u64)-1 with the goal that the
next directory log operation will see the (u64)-1 and then figure out
it must check what was the index of the last logged dir index key and
update ->last_dir_index_offset to that key's offset (this is done in
update_last_dir_index_offset()).
This however has a possibility for a time window where a race can happen
and lead to directory logging skipping dir index keys that should be
logged. The race happens like this:
1) Task A calls inode_logged(), sees ->logged_trans as 0 and then checks
that the inode item was logged before, but before it sets the inode's
->last_dir_index_offset to (u64)-1...
2) Task B is at btrfs_log_inode() which calls inode_logged() early, and
that has set ->last_dir_index_offset to (u64)-1;
3) Task B then enters log_directory_changes() which calls
update_last_dir_index_offset(). There it sees ->last_dir_index_offset
is (u64)-1 and that the inode was logged before (ctx->logged_before is
true), and so it searches for the last logged dir index key in the log
tree and it finds that it has an offset (index) value of N, so it sets
->last_dir_index_offset to N, so that we can skip index keys that are
less than or equal to N (later at process_dir_items_leaf());
4) Task A now sets ->last_dir_index_offset to (u64)-1, undoing the update
that task B just did;
5) Task B will now skip every index key when it enters
process_dir_items_leaf(), since ->last_dir_index_offset is (u64)-1.
Fix this by making inode_logged() not touch ->last_dir_index_offset and
initializing it to 0 when an inode is loaded (at btrfs_alloc_inode()) and
then having update_last_dir_index_offset() treat a value of 0 as meaning
we must check the log tree and update with the index of the last logged
index key. This is fine since the minimum possible value for
->last_dir_index_offset is 1 (BTRFS_DIR_START_INDEX - 1 = 2 - 1 = 1).
This also simplifies the management of ->last_dir_index_offset and now
all accesses to it are done under the inode's log_mutex.
Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/btrfs_inode.h | 2 +-
fs/btrfs/inode.c | 1 +
fs/btrfs/tree-log.c | 17 ++---------------
3 files changed, 4 insertions(+), 16 deletions(-)
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index b99fb0273292..0387b9f43a52 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -248,7 +248,7 @@ struct btrfs_inode {
u64 new_delalloc_bytes;
/*
* The offset of the last dir index key that was logged.
- * This is used only for directories.
+ * This is used only for directories. Protected by 'log_mutex'.
*/
u64 last_dir_index_offset;
};
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 83e242bf42f3..096044eeb0ba 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7832,6 +7832,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
ei->last_sub_trans = 0;
ei->logged_trans = 0;
ei->delalloc_bytes = 0;
+ /* new_delalloc_bytes and last_dir_index_offset are in a union. */
ei->new_delalloc_bytes = 0;
ei->defrag_bytes = 0;
ei->disk_i_size = 0;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 3d8090bc4b4e..9e12447f3e0e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3615,19 +3615,6 @@ static int inode_logged(const struct btrfs_trans_handle *trans,
inode->logged_trans = trans->transid;
spin_unlock(&inode->lock);
- /*
- * If it's a directory, then we must set last_dir_index_offset to the
- * maximum possible value, so that the next attempt to log the inode does
- * not skip checking if dir index keys found in modified subvolume tree
- * leaves have been logged before, otherwise it would result in attempts
- * to insert duplicate dir index keys in the log tree. This must be done
- * because last_dir_index_offset is an in-memory only field, not persisted
- * in the inode item or any other on-disk structure, so its value is lost
- * once the inode is evicted.
- */
- if (S_ISDIR(inode->vfs_inode.i_mode))
- inode->last_dir_index_offset = (u64)-1;
-
return 1;
}
@@ -4218,7 +4205,7 @@ static noinline int log_dir_items(struct btrfs_trans_handle *trans,
/*
* If the inode was logged before and it was evicted, then its
- * last_dir_index_offset is (u64)-1, so we don't the value of the last index
+ * last_dir_index_offset is 0, so we don't know the value of the last index
* key offset. If that's the case, search for it and update the inode. This
* is to avoid lookups in the log tree every time we try to insert a dir index
* key from a leaf changed in the current transaction, and to allow us to always
@@ -4234,7 +4221,7 @@ static int update_last_dir_index_offset(struct btrfs_inode *inode,
lockdep_assert_held(&inode->log_mutex);
- if (inode->last_dir_index_offset != (u64)-1)
+ if (inode->last_dir_index_offset != 0)
return 0;
if (!ctx->logged_before) {
--
2.47.2
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 3/3] btrfs: avoid load/store tearing races when checking if an inode was logged
2025-08-06 11:11 [PATCH 0/3] btrfs: fixes for races when checking if an inode was logged before fdmanana
2025-08-06 11:11 ` [PATCH 1/3] btrfs: fix race between logging inode and checking if it " fdmanana
2025-08-06 11:11 ` [PATCH 2/3] btrfs: fix race between setting last_dir_index_offset and inode logging fdmanana
@ 2025-08-06 11:11 ` fdmanana
2025-08-08 21:07 ` [PATCH 0/3] btrfs: fixes for races when checking if an inode was logged before Boris Burkov
3 siblings, 0 replies; 5+ messages in thread
From: fdmanana @ 2025-08-06 11:11 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
At inode_logged() we do a couple lockless checks for ->logged_trans, and
these are generally safe except the second one in case we get a load or
store tearing due to a concurrent call updating ->logged_trans (either at
btrfs_log_inode() or later at inode_logged()).
In the first case it's safe to compare to the current transaction ID since
once ->logged_trans is set the current transaction, we never set it to a
lower value.
In the second case, where we check if it's greater than zero, we are prone
to load/store tearing races, since we can have a concurrent task updating
to the current transaction ID with store tearing for example, instead of
updating with a single 64 bits write, to update with two 32 bits writes or
four 16 bits writes. In that case the reading side at inode_logged() could
see a positive value that does not match the current transaction and then
return a false negative.
Fix this by doing the second check while holding the inode's spinlock, add
some comments about it too. Also add the data_race() annotation to the
first check to avoid any reports from KCSAN (or similar tools) and comment
about it.
Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 9e12447f3e0e..654d6912eb46 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3531,15 +3531,32 @@ static int inode_logged(const struct btrfs_trans_handle *trans,
struct btrfs_key key;
int ret;
- if (inode->logged_trans == trans->transid)
+ /*
+ * Quick lockless call, since once ->logged_trans is set to the current
+ * transaction, we never set it to a lower value anywhere else.
+ */
+ if (data_race(inode->logged_trans) == trans->transid)
return 1;
/*
- * If logged_trans is not 0, then we know the inode logged was not logged
- * in this transaction, so we can return false right away.
+ * If logged_trans is not 0 and not trans->transid, then we know the
+ * inode was not logged in this transaction, so we can return false
+ * right away. We take the lock to avoid a race caused by load/store
+ * tearing with a concurrent btrfs_log_inode() call or a concurrent task
+ * in this function further below - an update to trans->transid can be
+ * teared into two 32 bits updates for example, in which case we could
+ * see a positive value that is not trans->transid and assume the inode
+ * was not logged when it was.
*/
- if (inode->logged_trans > 0)
+ spin_lock(&inode->lock);
+ if (inode->logged_trans == trans->transid) {
+ spin_unlock(&inode->lock);
+ return 1;
+ } else if (inode->logged_trans > 0) {
+ spin_unlock(&inode->lock);
return 0;
+ }
+ spin_unlock(&inode->lock);
/*
* If no log tree was created for this root in this transaction, then
--
2.47.2
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 0/3] btrfs: fixes for races when checking if an inode was logged before
2025-08-06 11:11 [PATCH 0/3] btrfs: fixes for races when checking if an inode was logged before fdmanana
` (2 preceding siblings ...)
2025-08-06 11:11 ` [PATCH 3/3] btrfs: avoid load/store tearing races when checking if an inode was logged fdmanana
@ 2025-08-08 21:07 ` Boris Burkov
3 siblings, 0 replies; 5+ messages in thread
From: Boris Burkov @ 2025-08-08 21:07 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Wed, Aug 06, 2025 at 12:11:29PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The first patch is actually a new version of a patch sent out before [1],
> because holding the inode's log_mutex at inode_logged() can make lockdep
> unhappy in situations like when logging conflicting inodes, where we are
> already holding the log_mutex of some other inode and could potentially
> result in a ABBA deadlock.
>
> The second patch splits part of what the initial version of patch 1 fixed,
> but with a different solution that makes the management of an inode's
> last_dir_index_offset field simpler.
>
> Patch 3 is completely new.
>
> Details in the change logs.
New version looks good to me. I reproduced the lockdep failure on the
old version and didn't reproduce it in 100 runs on this one.
Reviewed-by: Boris Burkov <boris@bur.io>
>
> [1] - https://lore.kernel.org/linux-btrfs/7585d15c0e9c163d5cdf32307014a4e792e62541.1753807163.git.fdmanana@suse.com/
>
> Filipe Manana (3):
> btrfs: fix race between logging inode and checking if it was logged before
> btrfs: fix race between setting last_dir_index_offset and inode logging
> btrfs: avoid load/store tearing races when checking if an inode was logged
>
> fs/btrfs/btrfs_inode.h | 2 +-
> fs/btrfs/inode.c | 1 +
> fs/btrfs/tree-log.c | 78 ++++++++++++++++++++++++++++--------------
> 3 files changed, 55 insertions(+), 26 deletions(-)
>
> --
> 2.47.2
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-08-08 21:06 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-06 11:11 [PATCH 0/3] btrfs: fixes for races when checking if an inode was logged before fdmanana
2025-08-06 11:11 ` [PATCH 1/3] btrfs: fix race between logging inode and checking if it " fdmanana
2025-08-06 11:11 ` [PATCH 2/3] btrfs: fix race between setting last_dir_index_offset and inode logging fdmanana
2025-08-06 11:11 ` [PATCH 3/3] btrfs: avoid load/store tearing races when checking if an inode was logged fdmanana
2025-08-08 21:07 ` [PATCH 0/3] btrfs: fixes for races when checking if an inode was logged before Boris Burkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).