* [PATCH 0/2] btrfs: fix log replay not persisting a 0 i_size
@ 2026-02-17 18:27 fdmanana
2026-02-17 18:27 ` [PATCH 1/2] btrfs: fix zero size inode with non-zero size after log replay fdmanana
2026-02-17 18:27 ` [PATCH 2/2] btrfs: pass a btrfs inode to tree-log.c:fill_inode_item() fdmanana
0 siblings, 2 replies; 3+ messages in thread
From: fdmanana @ 2026-02-17 18:27 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
Fix an issue where if we fsync a file that got its size truncated to 0, we
may end up not getting an i_size of 0 after log replay if new names for the
file were also logged. Details in the change log of the first patch.
Filipe Manana (2):
btrfs: fix zero size inode with non-zero size after log replay
btrfs: pass a btrfs inode to tree-log.c:fill_inode_item()
fs/btrfs/tree-log.c | 140 +++++++++++++++++++++++++++-----------------
1 file changed, 85 insertions(+), 55 deletions(-)
--
2.47.2
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH 1/2] btrfs: fix zero size inode with non-zero size after log replay
2026-02-17 18:27 [PATCH 0/2] btrfs: fix log replay not persisting a 0 i_size fdmanana
@ 2026-02-17 18:27 ` fdmanana
2026-02-17 18:27 ` [PATCH 2/2] btrfs: pass a btrfs inode to tree-log.c:fill_inode_item() fdmanana
1 sibling, 0 replies; 3+ messages in thread
From: fdmanana @ 2026-02-17 18:27 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When logging that an inode exists, as part of logging a new name or
logging new dir entries for a directory, we always set the generation of
the logged inode item to 0. This is to signal during log replay (in
overwrite_item()), that we should not set the i_size since we only logged
that an inode exists, so the i_size of the inode in the subvolume tree
must be preserved (as when we log new names or that an inode exists, we
don't log extents).
This works fine except when we have already logged an inode in full mode
or it's the first time we are logging an inode created in a past
transaction, that inode has a new i_size of 0 and then we log a new name
for the inode (due to a new hardlink or a rename), in which case we log
an i_size of 0 for the inode and a generation of 0, which causes the log
replay code to not update the inode's i_size to 0 (in overwrite_item()).
An example scenario:
mkdir /mnt/dir
xfs_io -f -c "pwrite 0 64K" /mnt/dir/foo
sync
xfs_io -c "truncate 0" -c "fsync" /mnt/dir/foo
ln /mnt/dir/foo /mnt/dir/bar
xfs_io -c "fsync" /mnt/dir
<power fail>
After log replay the file remains with a size of 64K. This is because when
we first log the inode, when we fsync file foo, we log its current i_size
of 0, and then when we create a hard link we log again the inode in exists
mode (LOG_INODE_EXISTS) but we set a generation of 0 for the inode item we
add to the log tree, so during log replay overwrite_item() sees that the
generation is 0 and i_size is 0 so we skip updating the inode's i_size
from 64K to 0.
Fix this by making sure at fill_inode_item() we always log the real
generation of the inode if it was logged in the current transaction with
the i_size we logged before. Also if an inode created in a previous
transaction is logged in exists mode only, make sure we log the i_size
stored in the inode item located from the commit root, so that if we log
multiple times that the inode exists we get the correct i_size.
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/af8c15fa-4e41-4bb2-885c-0bc4e97532a6@gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 98 ++++++++++++++++++++++++++++++---------------
1 file changed, 65 insertions(+), 33 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 603ac3794b9c..504b2cb84857 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4616,21 +4616,32 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
struct inode *inode, bool log_inode_only,
u64 logged_isize)
{
+ u64 gen = BTRFS_I(inode)->generation;
u64 flags;
if (log_inode_only) {
- /* set the generation to zero so the recover code
- * can tell the difference between an logging
- * just to say 'this inode exists' and a logging
- * to say 'update this inode with these values'
+ /*
+ * Set the generation to zero so the recover code can tell the
+ * difference between a logging just to say 'this inode exists'
+ * and a logging to say 'update this inode with these values'.
+ * But only if the inode was not already logged before.
+ * We access ->logged_trans directly since it was already set
+ * up in the call chain by btrfs_log_inode(), and data_race()
+ * to avoid false alerts from KCSAN and since it was set already
+ * and one can set it to 0 since that only happens on eviction
+ * and we are holding a ref on the inode.
*/
- btrfs_set_inode_generation(leaf, item, 0);
+ ASSERT(data_race(BTRFS_I(inode)->logged_trans) > 0);
+ if (data_race(BTRFS_I(inode)->logged_trans) < trans->transid)
+ gen = 0;
+
btrfs_set_inode_size(leaf, item, logged_isize);
} else {
- btrfs_set_inode_generation(leaf, item, BTRFS_I(inode)->generation);
btrfs_set_inode_size(leaf, item, inode->i_size);
}
+ btrfs_set_inode_generation(leaf, item, gen);
+
btrfs_set_inode_uid(leaf, item, i_uid_read(inode));
btrfs_set_inode_gid(leaf, item, i_gid_read(inode));
btrfs_set_inode_mode(leaf, item, inode->i_mode);
@@ -5448,42 +5459,63 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
return 0;
}
-static int logged_inode_size(struct btrfs_root *log, struct btrfs_inode *inode,
- struct btrfs_path *path, u64 *size_ret)
+static int get_inode_size_to_log(struct btrfs_trans_handle *trans,
+ struct btrfs_inode *inode,
+ struct btrfs_path *path, u64 *size_ret)
{
struct btrfs_key key;
+ struct btrfs_inode_item *item;
int ret;
key.objectid = btrfs_ino(inode);
key.type = BTRFS_INODE_ITEM_KEY;
key.offset = 0;
- ret = btrfs_search_slot(NULL, log, &key, path, 0, 0);
- if (ret < 0) {
- return ret;
- } else if (ret > 0) {
- *size_ret = 0;
- } else {
- struct btrfs_inode_item *item;
+ /*
+ * Our caller called inode_logged(), so logged_trans is up to date.
+ * Use data_race() to silence any warning from KCSAN. Once logged_trans
+ * is set, it can only be reset to 0 after inode eviction.
+ */
+ if (data_race(inode->logged_trans) == trans->transid) {
+ ret = btrfs_search_slot(NULL, inode->root->log_root, &key, path, 0, 0);
+ } else if (inode->generation < trans->transid) {
+ path->search_commit_root = true;
+ path->skip_locking = true;
+ ret = btrfs_search_slot(NULL, inode->root, &key, path, 0, 0);
+ path->search_commit_root = false;
+ path->skip_locking = false;
- item = btrfs_item_ptr(path->nodes[0], path->slots[0],
- struct btrfs_inode_item);
- *size_ret = btrfs_inode_size(path->nodes[0], item);
- /*
- * If the in-memory inode's i_size is smaller then the inode
- * size stored in the btree, return the inode's i_size, so
- * that we get a correct inode size after replaying the log
- * when before a power failure we had a shrinking truncate
- * followed by addition of a new name (rename / new hard link).
- * Otherwise return the inode size from the btree, to avoid
- * data loss when replaying a log due to previously doing a
- * write that expands the inode's size and logging a new name
- * immediately after.
- */
- if (*size_ret > inode->vfs_inode.i_size)
- *size_ret = inode->vfs_inode.i_size;
+ } else {
+ *size_ret = 0;
+ return 0;
}
+ /*
+ * If the inode was logged before or is from a past transaction, then
+ * its inode item must exist in the log root or in the commit root.
+ */
+ ASSERT(ret <= 0);
+ if (WARN_ON_ONCE(ret > 0))
+ ret = -ENOENT;
+
+ if (ret < 0)
+ return ret;
+
+ item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+ struct btrfs_inode_item);
+ *size_ret = btrfs_inode_size(path->nodes[0], item);
+ /*
+ * If the in-memory inode's i_size is smaller then the inode size stored
+ * in the btree, return the inode's i_size, so that we get a correct
+ * inode size after replaying the log when before a power failure we had
+ * a shrinking truncate followed by addition of a new name (rename / new
+ * hard link). Otherwise return the inode size from the btree, to avoid
+ * data loss when replaying a log due to previously doing a write that
+ * expands the inode's size and logging a new name immediately after.
+ */
+ if (*size_ret > inode->vfs_inode.i_size)
+ *size_ret = inode->vfs_inode.i_size;
+
btrfs_release_path(path);
return 0;
}
@@ -6990,7 +7022,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
ret = drop_inode_items(trans, log, path, inode,
BTRFS_XATTR_ITEM_KEY);
} else {
- if (inode_only == LOG_INODE_EXISTS && ctx->logged_before) {
+ if (inode_only == LOG_INODE_EXISTS) {
/*
* Make sure the new inode item we write to the log has
* the same isize as the current one (if it exists).
@@ -7004,7 +7036,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
* (zeroes), as if an expanding truncate happened,
* instead of getting a file of 4Kb only.
*/
- ret = logged_inode_size(log, inode, path, &logged_isize);
+ ret = get_inode_size_to_log(trans, inode, path, &logged_isize);
if (ret)
goto out_unlock;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH 2/2] btrfs: pass a btrfs inode to tree-log.c:fill_inode_item()
2026-02-17 18:27 [PATCH 0/2] btrfs: fix log replay not persisting a 0 i_size fdmanana
2026-02-17 18:27 ` [PATCH 1/2] btrfs: fix zero size inode with non-zero size after log replay fdmanana
@ 2026-02-17 18:27 ` fdmanana
1 sibling, 0 replies; 3+ messages in thread
From: fdmanana @ 2026-02-17 18:27 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
All internal functions should be given a btrfs_inode for consistency and
not a vfs inode. So pass a btrfs_inode instead of a vfs inode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 48 ++++++++++++++++++++++-----------------------
1 file changed, 23 insertions(+), 25 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 504b2cb84857..6f75857b0196 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4613,10 +4613,11 @@ static int truncate_inode_items(struct btrfs_trans_handle *trans,
static void fill_inode_item(struct btrfs_trans_handle *trans,
struct extent_buffer *leaf,
struct btrfs_inode_item *item,
- struct inode *inode, bool log_inode_only,
+ struct btrfs_inode *inode, bool log_inode_only,
u64 logged_isize)
{
- u64 gen = BTRFS_I(inode)->generation;
+ struct inode *vfs_inode = &inode->vfs_inode;
+ u64 gen = inode->generation;
u64 flags;
if (log_inode_only) {
@@ -4631,33 +4632,33 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
* and one can set it to 0 since that only happens on eviction
* and we are holding a ref on the inode.
*/
- ASSERT(data_race(BTRFS_I(inode)->logged_trans) > 0);
- if (data_race(BTRFS_I(inode)->logged_trans) < trans->transid)
+ ASSERT(data_race(inode->logged_trans) > 0);
+ if (data_race(inode->logged_trans) < trans->transid)
gen = 0;
btrfs_set_inode_size(leaf, item, logged_isize);
} else {
- btrfs_set_inode_size(leaf, item, inode->i_size);
+ btrfs_set_inode_size(leaf, item, vfs_inode->i_size);
}
btrfs_set_inode_generation(leaf, item, gen);
- btrfs_set_inode_uid(leaf, item, i_uid_read(inode));
- btrfs_set_inode_gid(leaf, item, i_gid_read(inode));
- btrfs_set_inode_mode(leaf, item, inode->i_mode);
- btrfs_set_inode_nlink(leaf, item, inode->i_nlink);
+ btrfs_set_inode_uid(leaf, item, i_uid_read(vfs_inode));
+ btrfs_set_inode_gid(leaf, item, i_gid_read(vfs_inode));
+ btrfs_set_inode_mode(leaf, item, vfs_inode->i_mode);
+ btrfs_set_inode_nlink(leaf, item, vfs_inode->i_nlink);
- btrfs_set_timespec_sec(leaf, &item->atime, inode_get_atime_sec(inode));
- btrfs_set_timespec_nsec(leaf, &item->atime, inode_get_atime_nsec(inode));
+ btrfs_set_timespec_sec(leaf, &item->atime, inode_get_atime_sec(vfs_inode));
+ btrfs_set_timespec_nsec(leaf, &item->atime, inode_get_atime_nsec(vfs_inode));
- btrfs_set_timespec_sec(leaf, &item->mtime, inode_get_mtime_sec(inode));
- btrfs_set_timespec_nsec(leaf, &item->mtime, inode_get_mtime_nsec(inode));
+ btrfs_set_timespec_sec(leaf, &item->mtime, inode_get_mtime_sec(vfs_inode));
+ btrfs_set_timespec_nsec(leaf, &item->mtime, inode_get_mtime_nsec(vfs_inode));
- btrfs_set_timespec_sec(leaf, &item->ctime, inode_get_ctime_sec(inode));
- btrfs_set_timespec_nsec(leaf, &item->ctime, inode_get_ctime_nsec(inode));
+ btrfs_set_timespec_sec(leaf, &item->ctime, inode_get_ctime_sec(vfs_inode));
+ btrfs_set_timespec_nsec(leaf, &item->ctime, inode_get_ctime_nsec(vfs_inode));
- btrfs_set_timespec_sec(leaf, &item->otime, BTRFS_I(inode)->i_otime_sec);
- btrfs_set_timespec_nsec(leaf, &item->otime, BTRFS_I(inode)->i_otime_nsec);
+ btrfs_set_timespec_sec(leaf, &item->otime, inode->i_otime_sec);
+ btrfs_set_timespec_nsec(leaf, &item->otime, inode->i_otime_nsec);
/*
* We do not need to set the nbytes field, in fact during a fast fsync
@@ -4668,11 +4669,10 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
* inode item in subvolume tree as needed (see overwrite_item()).
*/
- btrfs_set_inode_sequence(leaf, item, inode_peek_iversion(inode));
+ btrfs_set_inode_sequence(leaf, item, inode_peek_iversion(vfs_inode));
btrfs_set_inode_transid(leaf, item, trans->transid);
- btrfs_set_inode_rdev(leaf, item, inode->i_rdev);
- flags = btrfs_inode_combine_flags(BTRFS_I(inode)->flags,
- BTRFS_I(inode)->ro_flags);
+ btrfs_set_inode_rdev(leaf, item, vfs_inode->i_rdev);
+ flags = btrfs_inode_combine_flags(inode->flags, inode->ro_flags);
btrfs_set_inode_flags(leaf, item, flags);
btrfs_set_inode_block_group(leaf, item, 0);
}
@@ -4719,8 +4719,7 @@ static int log_inode_item(struct btrfs_trans_handle *trans,
return ret;
inode_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
struct btrfs_inode_item);
- fill_inode_item(trans, path->nodes[0], inode_item, &inode->vfs_inode,
- false, 0);
+ fill_inode_item(trans, path->nodes[0], inode_item, inode, false, 0);
btrfs_release_path(path);
return 0;
}
@@ -4989,8 +4988,7 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
inode_item = btrfs_item_ptr(dst_path->nodes[0], dst_slot,
struct btrfs_inode_item);
fill_inode_item(trans, dst_path->nodes[0], inode_item,
- &inode->vfs_inode,
- inode_only == LOG_INODE_EXISTS,
+ inode, inode_only == LOG_INODE_EXISTS,
logged_isize);
} else {
copy_extent_buffer(dst_path->nodes[0], src, dst_offset,
--
2.47.2
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-02-17 18:27 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-17 18:27 [PATCH 0/2] btrfs: fix log replay not persisting a 0 i_size fdmanana
2026-02-17 18:27 ` [PATCH 1/2] btrfs: fix zero size inode with non-zero size after log replay fdmanana
2026-02-17 18:27 ` [PATCH 2/2] btrfs: pass a btrfs inode to tree-log.c:fill_inode_item() fdmanana
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox