* [PATCH v2 01/10] affs: Drop support for metadata bh tracking
2026-05-25 8:58 [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
@ 2026-05-25 8:58 ` Jan Kara
2026-05-25 10:06 ` David Sterba
2026-05-25 8:58 ` [PATCH v2 02/10] ext4: Allocate mapping_metadata_bhs struct on demand Jan Kara
` (9 subsequent siblings)
10 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2026-05-25 8:58 UTC (permalink / raw)
To: linux-fsdevel
Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
OGAWA Hirofumi, Jan Kara, David Sterba
AFFS did all the hard work of tracking metadata bhs dirtied for an inode
but it actually never used this information as affs_file_fsync() just
calls sync_blockdev() to writeback all filesystem metadata bhs. After a
discussion with AFFS maintainer nobody cares about AFFS performance
so let's keep this affs_file_fsync() behavior and just drop all the
pointless tracking from AFFS.
CC: David Sterba <dsterba@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/affs/affs.h | 1 -
fs/affs/amigaffs.c | 12 ++++++------
fs/affs/file.c | 25 +++++++++++--------------
fs/affs/inode.c | 13 +++++--------
fs/affs/namei.c | 9 ++++-----
fs/affs/super.c | 1 -
6 files changed, 26 insertions(+), 35 deletions(-)
diff --git a/fs/affs/affs.h b/fs/affs/affs.h
index a0caf6ace860..406a0ef63e7b 100644
--- a/fs/affs/affs.h
+++ b/fs/affs/affs.h
@@ -44,7 +44,6 @@ struct affs_inode_info {
struct mutex i_link_lock; /* Protects internal inode access. */
struct mutex i_ext_lock; /* Protects internal inode access. */
#define i_hash_lock i_ext_lock
- struct mapping_metadata_bhs i_metadata_bhs;
u32 i_blkcnt; /* block count */
u32 i_extcnt; /* extended block count */
u32 *i_lc; /* linear cache of extended blocks */
diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index bed4fc805e8e..6cc0fc9a4cbf 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -57,7 +57,7 @@ affs_insert_hash(struct inode *dir, struct buffer_head *bh)
AFFS_TAIL(sb, dir_bh)->hash_chain = cpu_to_be32(ino);
affs_adjust_checksum(dir_bh, ino);
- mmb_mark_buffer_dirty(dir_bh, &AFFS_I(dir)->i_metadata_bhs);
+ mark_buffer_dirty(dir_bh);
affs_brelse(dir_bh);
inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
@@ -100,7 +100,7 @@ affs_remove_hash(struct inode *dir, struct buffer_head *rem_bh)
else
AFFS_TAIL(sb, bh)->hash_chain = ino;
affs_adjust_checksum(bh, be32_to_cpu(ino) - hash_ino);
- mmb_mark_buffer_dirty(bh, &AFFS_I(dir)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
AFFS_TAIL(sb, rem_bh)->parent = 0;
retval = 0;
break;
@@ -180,7 +180,7 @@ affs_remove_link(struct dentry *dentry)
affs_unlock_dir(dir);
goto done;
}
- mmb_mark_buffer_dirty(link_bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(link_bh);
memcpy(AFFS_TAIL(sb, bh)->name, AFFS_TAIL(sb, link_bh)->name, 32);
retval = affs_insert_hash(dir, bh);
@@ -188,7 +188,7 @@ affs_remove_link(struct dentry *dentry)
affs_unlock_dir(dir);
goto done;
}
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
affs_unlock_dir(dir);
iput(dir);
@@ -203,7 +203,7 @@ affs_remove_link(struct dentry *dentry)
__be32 ino2 = AFFS_TAIL(sb, link_bh)->link_chain;
AFFS_TAIL(sb, bh)->link_chain = ino2;
affs_adjust_checksum(bh, be32_to_cpu(ino2) - link_ino);
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
retval = 0;
/* Fix the link count, if bh is a normal header block without links */
switch (be32_to_cpu(AFFS_TAIL(sb, bh)->stype)) {
@@ -306,7 +306,7 @@ affs_remove_header(struct dentry *dentry)
retval = affs_remove_hash(dir, bh);
if (retval)
goto done_unlock;
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
affs_unlock_dir(dir);
diff --git a/fs/affs/file.c b/fs/affs/file.c
index 144b17482d12..23e088a7ed4f 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -140,14 +140,14 @@ affs_alloc_extblock(struct inode *inode, struct buffer_head *bh, u32 ext)
AFFS_TAIL(sb, new_bh)->parent = cpu_to_be32(inode->i_ino);
affs_fix_checksum(sb, new_bh);
- mmb_mark_buffer_dirty(new_bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(new_bh);
tmp = be32_to_cpu(AFFS_TAIL(sb, bh)->extension);
if (tmp)
affs_warning(sb, "alloc_ext", "previous extension set (%x)", tmp);
AFFS_TAIL(sb, bh)->extension = cpu_to_be32(blocknr);
affs_adjust_checksum(bh, blocknr - tmp);
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
AFFS_I(inode)->i_extcnt++;
mark_inode_dirty(inode);
@@ -581,7 +581,7 @@ affs_extent_file_ofs(struct inode *inode, u32 newsize)
memset(AFFS_DATA(bh) + boff, 0, tmp);
be32_add_cpu(&AFFS_DATA_HEAD(bh)->size, tmp);
affs_fix_checksum(sb, bh);
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
size += tmp;
bidx++;
} else if (bidx) {
@@ -603,7 +603,7 @@ affs_extent_file_ofs(struct inode *inode, u32 newsize)
AFFS_DATA_HEAD(bh)->size = cpu_to_be32(tmp);
affs_fix_checksum(sb, bh);
bh->b_state &= ~(1UL << BH_New);
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
if (prev_bh) {
u32 tmp_next = be32_to_cpu(AFFS_DATA_HEAD(prev_bh)->next);
@@ -613,8 +613,7 @@ affs_extent_file_ofs(struct inode *inode, u32 newsize)
bidx, tmp_next);
AFFS_DATA_HEAD(prev_bh)->next = cpu_to_be32(bh->b_blocknr);
affs_adjust_checksum(prev_bh, bh->b_blocknr - tmp_next);
- mmb_mark_buffer_dirty(prev_bh,
- &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(prev_bh);
affs_brelse(prev_bh);
}
size += bsize;
@@ -733,7 +732,7 @@ static int affs_write_end_ofs(const struct kiocb *iocb,
AFFS_DATA_HEAD(bh)->size = cpu_to_be32(
max(boff + tmp, be32_to_cpu(AFFS_DATA_HEAD(bh)->size)));
affs_fix_checksum(sb, bh);
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
written += tmp;
from += tmp;
bidx++;
@@ -766,13 +765,12 @@ static int affs_write_end_ofs(const struct kiocb *iocb,
bidx, tmp_next);
AFFS_DATA_HEAD(prev_bh)->next = cpu_to_be32(bh->b_blocknr);
affs_adjust_checksum(prev_bh, bh->b_blocknr - tmp_next);
- mmb_mark_buffer_dirty(prev_bh,
- &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(prev_bh);
}
}
affs_brelse(prev_bh);
affs_fix_checksum(sb, bh);
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
written += bsize;
from += bsize;
bidx++;
@@ -801,14 +799,13 @@ static int affs_write_end_ofs(const struct kiocb *iocb,
bidx, tmp_next);
AFFS_DATA_HEAD(prev_bh)->next = cpu_to_be32(bh->b_blocknr);
affs_adjust_checksum(prev_bh, bh->b_blocknr - tmp_next);
- mmb_mark_buffer_dirty(prev_bh,
- &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(prev_bh);
}
} else if (be32_to_cpu(AFFS_DATA_HEAD(bh)->size) < tmp)
AFFS_DATA_HEAD(bh)->size = cpu_to_be32(tmp);
affs_brelse(prev_bh);
affs_fix_checksum(sb, bh);
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
written += tmp;
from += tmp;
bidx++;
@@ -945,7 +942,7 @@ affs_truncate(struct inode *inode)
}
AFFS_TAIL(sb, ext_bh)->extension = 0;
affs_fix_checksum(sb, ext_bh);
- mmb_mark_buffer_dirty(ext_bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(ext_bh);
affs_brelse(ext_bh);
if (inode->i_size) {
diff --git a/fs/affs/inode.c b/fs/affs/inode.c
index 5dd1b016bcb0..d4a3f381c4bc 100644
--- a/fs/affs/inode.c
+++ b/fs/affs/inode.c
@@ -206,7 +206,7 @@ affs_write_inode(struct inode *inode, struct writeback_control *wbc)
}
}
affs_fix_checksum(sb, bh);
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
affs_brelse(bh);
affs_free_prealloc(inode);
return 0;
@@ -266,11 +266,8 @@ affs_evict_inode(struct inode *inode)
if (!inode->i_nlink) {
inode->i_size = 0;
affs_truncate(inode);
- } else {
- mmb_sync(&AFFS_I(inode)->i_metadata_bhs);
}
- mmb_invalidate(&AFFS_I(inode)->i_metadata_bhs);
clear_inode(inode);
affs_free_prealloc(inode);
cache_page = (unsigned long)AFFS_I(inode)->i_lc;
@@ -305,7 +302,7 @@ affs_new_inode(struct inode *dir)
bh = affs_getzeroblk(sb, block);
if (!bh)
goto err_bh;
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
affs_brelse(bh);
inode->i_uid = current_fsuid();
@@ -393,17 +390,17 @@ affs_add_entry(struct inode *dir, struct inode *inode, struct dentry *dentry, s3
AFFS_TAIL(sb, bh)->link_chain = chain;
AFFS_TAIL(sb, inode_bh)->link_chain = cpu_to_be32(block);
affs_adjust_checksum(inode_bh, block - be32_to_cpu(chain));
- mmb_mark_buffer_dirty(inode_bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(inode_bh);
set_nlink(inode, 2);
ihold(inode);
}
affs_fix_checksum(sb, bh);
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
dentry->d_fsdata = (void *)(long)bh->b_blocknr;
affs_lock_dir(dir);
retval = affs_insert_hash(dir, bh);
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
affs_unlock_dir(dir);
affs_unlock_link(inode);
diff --git a/fs/affs/namei.c b/fs/affs/namei.c
index c3c6532da4b0..57d8d755aada 100644
--- a/fs/affs/namei.c
+++ b/fs/affs/namei.c
@@ -373,7 +373,7 @@ affs_symlink(struct mnt_idmap *idmap, struct inode *dir,
}
*p = 0;
inode->i_size = i + 1;
- mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
affs_brelse(bh);
mark_inode_dirty(inode);
@@ -443,8 +443,7 @@ affs_rename(struct inode *old_dir, struct dentry *old_dentry,
/* TODO: move it back to old_dir, if error? */
done:
- mmb_mark_buffer_dirty(bh,
- &AFFS_I(retval ? old_dir : new_dir)->i_metadata_bhs);
+ mark_buffer_dirty(bh);
affs_brelse(bh);
return retval;
}
@@ -497,8 +496,8 @@ affs_xrename(struct inode *old_dir, struct dentry *old_dentry,
retval = affs_insert_hash(old_dir, bh_new);
affs_unlock_dir(old_dir);
done:
- mmb_mark_buffer_dirty(bh_old, &AFFS_I(new_dir)->i_metadata_bhs);
- mmb_mark_buffer_dirty(bh_new, &AFFS_I(old_dir)->i_metadata_bhs);
+ mark_buffer_dirty(bh_old);
+ mark_buffer_dirty(bh_new);
affs_brelse(bh_old);
affs_brelse(bh_new);
return retval;
diff --git a/fs/affs/super.c b/fs/affs/super.c
index 079f36e1ddec..8451647f3fea 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -108,7 +108,6 @@ static struct inode *affs_alloc_inode(struct super_block *sb)
i->i_lc = NULL;
i->i_ext_bh = NULL;
i->i_pa_cnt = 0;
- mmb_init(&i->i_metadata_bhs, &i->vfs_inode.i_data);
return &i->vfs_inode;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v2 01/10] affs: Drop support for metadata bh tracking
2026-05-25 8:58 ` [PATCH v2 01/10] affs: Drop support for metadata bh tracking Jan Kara
@ 2026-05-25 10:06 ` David Sterba
0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2026-05-25 10:06 UTC (permalink / raw)
To: Jan Kara
Cc: linux-fsdevel, Christian Brauner, aivazian.tigran, Ted Tso,
linux-ext4, OGAWA Hirofumi, David Sterba
On Mon, May 25, 2026 at 10:58:07AM +0200, Jan Kara wrote:
> AFFS did all the hard work of tracking metadata bhs dirtied for an inode
> but it actually never used this information as affs_file_fsync() just
> calls sync_blockdev() to writeback all filesystem metadata bhs. After a
> discussion with AFFS maintainer nobody cares about AFFS performance
> so let's keep this affs_file_fsync() behavior and just drop all the
> pointless tracking from AFFS.
>
> CC: David Sterba <dsterba@suse.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 02/10] ext4: Allocate mapping_metadata_bhs struct on demand
2026-05-25 8:58 [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
2026-05-25 8:58 ` [PATCH v2 01/10] affs: Drop support for metadata bh tracking Jan Kara
@ 2026-05-25 8:58 ` Jan Kara
2026-06-03 13:41 ` Theodore Tso
2026-05-25 8:58 ` [PATCH v2 03/10] fs: Writeout inode buffer from mmb_sync() Jan Kara
` (8 subsequent siblings)
10 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2026-05-25 8:58 UTC (permalink / raw)
To: linux-fsdevel
Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
OGAWA Hirofumi, Jan Kara
Currently every ext4 inode gets mapping_metadata_bhs struct although it
is only needed when running without a journal and only for inodes where
any metadata was dirtied. Allocate mapping_metadata_bhs struct on demand
when dirtying the first metadata buffer for the inode.
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/ext4/ext4.h | 2 +-
fs/ext4/ext4_jbd2.c | 24 +++++++++++++++++++++---
fs/ext4/fsync.c | 12 ++++++++----
fs/ext4/inode.c | 9 +++++----
fs/ext4/super.c | 7 ++++---
5 files changed, 39 insertions(+), 15 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 94283a991e5c..6bb29a20420f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1117,7 +1117,7 @@ struct ext4_inode_info {
struct rw_semaphore i_data_sem;
struct inode vfs_inode;
struct jbd2_inode *jinode;
- struct mapping_metadata_bhs i_metadata_bhs;
+ struct mapping_metadata_bhs *i_metadata_bhs;
/*
* File creation time. Its function is same as that of
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 9a8c225f2753..752326f3b653 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -350,6 +350,21 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,
return 0;
}
+static void ext4_inode_attach_mmb(struct inode *inode)
+{
+ struct mapping_metadata_bhs *mmb;
+
+ /*
+ * It's difficult to handle failure when marking buffer dirty without
+ * leaving filesystem corrupted
+ */
+ mmb = kmalloc_obj(*mmb, GFP_NOFS | __GFP_NOFAIL);
+ mmb_init(mmb, &inode->i_data);
+ /* Someone swapped another mmb before us? */
+ if (cmpxchg(&EXT4_I(inode)->i_metadata_bhs, NULL, mmb))
+ kfree(mmb);
+}
+
int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
handle_t *handle, struct inode *inode,
struct buffer_head *bh)
@@ -389,11 +404,14 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
err);
}
} else {
- if (inode)
+ if (inode) {
+ if (!EXT4_I(inode)->i_metadata_bhs)
+ ext4_inode_attach_mmb(inode);
mmb_mark_buffer_dirty(bh,
- &EXT4_I(inode)->i_metadata_bhs);
- else
+ EXT4_I(inode)->i_metadata_bhs);
+ } else {
mark_buffer_dirty(bh);
+ }
if (inode && inode_needs_sync(inode)) {
sync_dirty_buffer(bh);
if (buffer_req(bh) && !buffer_uptodate(bh)) {
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 924726dcc85f..c104f55a0242 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -46,6 +46,7 @@
static int ext4_sync_parent(struct inode *inode)
{
struct dentry *dentry, *next;
+ struct mapping_metadata_bhs *mmb;
int ret = 0;
if (!ext4_test_inode_state(inode, EXT4_STATE_NEWENTRY))
@@ -68,9 +69,12 @@ static int ext4_sync_parent(struct inode *inode)
* through ext4_evict_inode()) and so we are safe to flush
* metadata blocks and the inode.
*/
- ret = mmb_sync(&EXT4_I(inode)->i_metadata_bhs);
- if (ret)
- break;
+ mmb = READ_ONCE(EXT4_I(inode)->i_metadata_bhs);
+ if (mmb) {
+ ret = mmb_sync(mmb);
+ if (ret)
+ break;
+ }
ret = sync_inode_metadata(inode, 1);
if (ret)
break;
@@ -89,7 +93,7 @@ static int ext4_fsync_nojournal(struct file *file, loff_t start, loff_t end,
};
int ret;
- ret = mmb_fsync_noflush(file, &EXT4_I(inode)->i_metadata_bhs,
+ ret = mmb_fsync_noflush(file, READ_ONCE(EXT4_I(inode)->i_metadata_bhs),
start, end, datasync);
if (ret)
return ret;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..3e66e9510909 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -195,9 +195,8 @@ void ext4_evict_inode(struct inode *inode)
ext4_warning_inode(inode, "data will be lost");
truncate_inode_pages_final(&inode->i_data);
- /* Avoid mballoc special inode which has no proper iops */
- if (!EXT4_SB(inode->i_sb)->s_journal)
- mmb_sync(&EXT4_I(inode)->i_metadata_bhs);
+ if (EXT4_I(inode)->i_metadata_bhs)
+ mmb_sync(EXT4_I(inode)->i_metadata_bhs);
goto no_delete;
}
@@ -3451,6 +3450,7 @@ static bool ext4_release_folio(struct folio *folio, gfp_t wait)
static bool ext4_inode_datasync_dirty(struct inode *inode)
{
journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ struct mapping_metadata_bhs *mmb;
if (journal) {
if (jbd2_transaction_committed(journal,
@@ -3461,8 +3461,9 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
return true;
}
+ mmb = READ_ONCE(EXT4_I(inode)->i_metadata_bhs);
/* Any metadata buffers to write? */
- if (mmb_has_buffers(&EXT4_I(inode)->i_metadata_bhs))
+ if (mmb && mmb_has_buffers(mmb))
return true;
return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6a77db4d3124..7fc2cff708cc 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1430,7 +1430,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
ext4_fc_init_inode(&ei->vfs_inode);
spin_lock_init(&ei->i_fc_lock);
- mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
+ ei->i_metadata_bhs = NULL;
return &ei->vfs_inode;
}
@@ -1448,6 +1448,7 @@ static int ext4_drop_inode(struct inode *inode)
static void ext4_free_in_core_inode(struct inode *inode)
{
fscrypt_free_inode(inode);
+ kfree(EXT4_I(inode)->i_metadata_bhs);
if (!list_empty(&(EXT4_I(inode)->i_fc_list))) {
pr_warn("%s: inode %llu still in fc list",
__func__, inode->i_ino);
@@ -1527,8 +1528,8 @@ static void destroy_inodecache(void)
void ext4_clear_inode(struct inode *inode)
{
ext4_fc_del(inode);
- if (!EXT4_SB(inode->i_sb)->s_journal)
- mmb_invalidate(&EXT4_I(inode)->i_metadata_bhs);
+ if (EXT4_I(inode)->i_metadata_bhs)
+ mmb_invalidate(EXT4_I(inode)->i_metadata_bhs);
clear_inode(inode);
ext4_discard_preallocations(inode);
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v2 03/10] fs: Writeout inode buffer from mmb_sync()
2026-05-25 8:58 [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
2026-05-25 8:58 ` [PATCH v2 01/10] affs: Drop support for metadata bh tracking Jan Kara
2026-05-25 8:58 ` [PATCH v2 02/10] ext4: Allocate mapping_metadata_bhs struct on demand Jan Kara
@ 2026-05-25 8:58 ` Jan Kara
2026-05-25 8:58 ` [PATCH v2 04/10] ext2: Fix possibly missing inode write on fsync(2) Jan Kara
` (7 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2026-05-25 8:58 UTC (permalink / raw)
To: linux-fsdevel
Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
OGAWA Hirofumi, Jan Kara
Currently metadata bh tracking does not track inode buffers because they
are usually shared by several inodes and so our linked list tracking
cannot be used. On fsync we call sync_inode_metadata() to write inode
instead where filesystems' .write_inode methods detect data integrity
writeback and take care to submit inode buffer to disk and wait for it
in that case. This is however racy as for example flush worker can
submit normal (WB_SYNC_NONE) inode writeback first, which makes the
inode clean and copies the inode to the buffer but doesn't submit the
buffer for IO. Thus sync_inode_metadata() call does nothing and we fail
to persist inode buffer to disk on fsync(2).
Fix the problem by allowing filesystem to set the number of block backing
the inode in mmb structure and mmb_sync() then takes care to writeout
corresponding buffer and wait for it.
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/buffer.c | 64 +++++++++++++++++++++++++++++--------
include/linux/buffer_head.h | 14 ++++++++
include/linux/fs.h | 1 +
3 files changed, 66 insertions(+), 13 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index b0b3792b1496..f83fb3cdc6ac 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -477,12 +477,12 @@ EXPORT_SYMBOL(mark_buffer_async_write);
* using RCU, grab the lock, verify we didn't race with somebody detaching the
* bh / moving it to different inode and only then proceeding.
*/
-
void mmb_init(struct mapping_metadata_bhs *mmb, struct address_space *mapping)
{
spin_lock_init(&mmb->lock);
INIT_LIST_HEAD(&mmb->list);
mmb->mapping = mapping;
+ mmb->inode_blk = MMB_INVALID_BLK;
}
EXPORT_SYMBOL(mmb_init);
@@ -550,11 +550,13 @@ EXPORT_SYMBOL_GPL(mmb_has_buffers);
int mmb_sync(struct mapping_metadata_bhs *mmb)
{
struct buffer_head *bh;
+ sector_t inode_blk;
int err = 0;
struct blk_plug plug;
LIST_HEAD(tmp);
- if (!mmb_has_buffers(mmb))
+ if (!mmb_has_buffers(mmb) &&
+ data_race(mmb->inode_blk == MMB_INVALID_BLK))
return 0;
blk_start_plug(&plug);
@@ -593,8 +595,22 @@ int mmb_sync(struct mapping_metadata_bhs *mmb)
}
}
}
-
+ inode_blk = mmb->inode_blk;
+ mmb->inode_blk = MMB_INVALID_BLK;
spin_unlock(&mmb->lock);
+
+ /* Writeout inode buffer if it was set and wasn't written out yet */
+ if (inode_blk != MMB_INVALID_BLK) {
+ bh = sb_find_get_block(mmb->mapping->host->i_sb, inode_blk);
+ if (bh) {
+ write_dirty_buffer(bh, REQ_SYNC);
+ wait_on_buffer(bh);
+ if (!buffer_uptodate(bh))
+ err = -EIO;
+ brelse(bh);
+ }
+ }
+
blk_finish_plug(&plug);
spin_lock(&mmb->lock);
@@ -646,18 +662,18 @@ int mmb_fsync_noflush(struct file *file, struct mapping_metadata_bhs *mmb,
if (err)
return err;
- if (mmb)
- ret = mmb_sync(mmb);
if (!(inode_state_read_once(inode) & I_DIRTY_ALL))
- goto out;
+ goto sync_buffers;
if (datasync && !(inode_state_read_once(inode) & I_DIRTY_DATASYNC))
- goto out;
-
- err = sync_inode_metadata(inode, 1);
- if (ret == 0)
- ret = err;
-
-out:
+ goto sync_buffers;
+
+ ret = sync_inode_metadata(inode, 1);
+sync_buffers:
+ if (mmb) {
+ err = mmb_sync(mmb);
+ if (ret == 0)
+ ret = err;
+ }
/* check and advance again to catch errors after syncing out buffers */
err = file_check_and_advance_wb_err(file);
if (ret == 0)
@@ -733,6 +749,28 @@ void mmb_mark_buffer_dirty(struct buffer_head *bh,
}
EXPORT_SYMBOL(mmb_mark_buffer_dirty);
+/**
+ * mmb_mark_inode_buffer_dirty - Mark buffer containing inode as dirty and
+ * track it for fsync.
+ * @bh: The buffer containing the inode.
+ * @mmb: Mmb structure for metadata tracking.
+ *
+ * Mark the buffer containing inode as dirty and track the block number of
+ * the buffer containing the inode in mmb so that it gets written out from
+ * mmb_sync().
+ */
+void mmb_mark_inode_buffer_dirty(struct buffer_head *bh,
+ struct mapping_metadata_bhs *mmb)
+{
+ /* For simplicity we use mmb->lock to synchronize with mmb_sync() */
+ spin_lock(&mmb->lock);
+ mark_buffer_dirty(bh);
+ mmb->inode_blk = bh->b_blocknr;
+ spin_unlock(&mmb->lock);
+}
+EXPORT_SYMBOL(mmb_mark_inode_buffer_dirty);
+
+
/**
* block_dirty_folio - Mark a folio as dirty.
* @mapping: The address space containing this folio.
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index e4939e33b4b5..b77464359028 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -207,6 +207,8 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate);
/* Things to do with metadata buffers list */
void mmb_mark_buffer_dirty(struct buffer_head *bh, struct mapping_metadata_bhs *mmb);
+void mmb_mark_inode_buffer_dirty(struct buffer_head *bh,
+ struct mapping_metadata_bhs *mmb);
int mmb_fsync_noflush(struct file *file, struct mapping_metadata_bhs *mmb,
loff_t start, loff_t end, bool datasync);
int mmb_fsync(struct file *file, struct mapping_metadata_bhs *mmb,
@@ -513,12 +515,24 @@ bool block_dirty_folio(struct address_space *mapping, struct folio *folio);
#ifdef CONFIG_BUFFER_HEAD
+#define MMB_INVALID_BLK (~0ULL)
+
void buffer_init(void);
bool try_to_free_buffers(struct folio *folio);
void mmb_init(struct mapping_metadata_bhs *mmb, struct address_space *mapping);
bool mmb_has_buffers(struct mapping_metadata_bhs *mmb);
void mmb_invalidate(struct mapping_metadata_bhs *mmb);
int mmb_sync(struct mapping_metadata_bhs *mmb);
+static inline void mmb_clear_inode_blk(struct mapping_metadata_bhs *mmb)
+{
+ /*
+ * The lock is mostly pointless here but let's keep setting of
+ * inode_blk consistently under it.
+ */
+ spin_lock(&mmb->lock);
+ mmb->inode_blk = MMB_INVALID_BLK;
+ spin_unlock(&mmb->lock);
+}
void invalidate_bh_lrus(void);
void invalidate_bh_lrus_cpu(void);
bool has_bh_in_lru(int cpu, void *dummy);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..435a41e4c90f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -446,6 +446,7 @@ extern const struct address_space_operations empty_aops;
/* Structure for tracking metadata buffer heads associated with the mapping */
struct mapping_metadata_bhs {
struct address_space *mapping; /* Mapping bhs are associated with */
+ sector_t inode_blk; /* Number of block containing the inode */
spinlock_t lock; /* Lock protecting bh list */
struct list_head list; /* The list of bhs (b_assoc_buffers) */
};
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v2 04/10] ext2: Fix possibly missing inode write on fsync(2)
2026-05-25 8:58 [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
` (2 preceding siblings ...)
2026-05-25 8:58 ` [PATCH v2 03/10] fs: Writeout inode buffer from mmb_sync() Jan Kara
@ 2026-05-25 8:58 ` Jan Kara
2026-05-25 8:58 ` [PATCH v2 05/10] udf: " Jan Kara
` (6 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2026-05-25 8:58 UTC (permalink / raw)
To: linux-fsdevel
Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
OGAWA Hirofumi, Jan Kara
Use mmb inode buffer writeout infrastructure to reliably write out
inode's inode table block on fsync(2).
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/ext2/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 74aca5eb572d..d7a812b84b6b 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1611,7 +1611,7 @@ static int __ext2_write_inode(struct inode *inode, int do_sync)
}
} else for (n = 0; n < EXT2_N_BLOCKS; n++)
raw_inode->i_block[n] = ei->i_data[n];
- mark_buffer_dirty(bh);
+ mmb_mark_inode_buffer_dirty(bh, &ei->i_metadata_bhs);
if (do_sync) {
sync_dirty_buffer(bh);
if (buffer_req(bh) && !buffer_uptodate(bh)) {
@@ -1627,7 +1627,7 @@ static int __ext2_write_inode(struct inode *inode, int do_sync)
int ext2_write_inode(struct inode *inode, struct writeback_control *wbc)
{
- return __ext2_write_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
+ return __ext2_write_inode(inode, 0);
}
int ext2_getattr(struct mnt_idmap *idmap, const struct path *path,
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v2 05/10] udf: Fix possibly missing inode write on fsync(2)
2026-05-25 8:58 [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
` (3 preceding siblings ...)
2026-05-25 8:58 ` [PATCH v2 04/10] ext2: Fix possibly missing inode write on fsync(2) Jan Kara
@ 2026-05-25 8:58 ` Jan Kara
2026-05-25 8:58 ` [PATCH v2 06/10] fat: " Jan Kara
` (5 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2026-05-25 8:58 UTC (permalink / raw)
To: linux-fsdevel
Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
OGAWA Hirofumi, Jan Kara
Use mmb inode buffer writeout infrastructure to reliably write out
inode's block on fsync(2).
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/udf/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 67bcf83758c8..861400152261 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -1707,7 +1707,7 @@ void udf_update_extra_perms(struct inode *inode, umode_t mode)
int udf_write_inode(struct inode *inode, struct writeback_control *wbc)
{
- return udf_update_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
+ return udf_update_inode(inode, 0);
}
static int udf_sync_inode(struct inode *inode)
@@ -1936,7 +1936,7 @@ static int udf_update_inode(struct inode *inode, int do_sync)
unlock_buffer(bh);
/* write the data blocks */
- mark_buffer_dirty(bh);
+ mmb_mark_inode_buffer_dirty(bh, &iinfo->i_metadata_bhs);
if (do_sync) {
sync_dirty_buffer(bh);
if (buffer_write_io_error(bh)) {
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v2 06/10] fat: Fix possibly missing inode write on fsync(2)
2026-05-25 8:58 [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
` (4 preceding siblings ...)
2026-05-25 8:58 ` [PATCH v2 05/10] udf: " Jan Kara
@ 2026-05-25 8:58 ` Jan Kara
2026-05-25 8:58 ` [PATCH v2 07/10] minix: " Jan Kara
` (4 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2026-05-25 8:58 UTC (permalink / raw)
To: linux-fsdevel
Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
OGAWA Hirofumi, Jan Kara
Use mmb inode buffer writeout infrastructure to reliably write out
inode's buffer on fsync(2).
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/fat/inode.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 28f78df086ef..1ffbfee1a9ad 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -431,6 +431,9 @@ EXPORT_SYMBOL_GPL(fat_attach);
void fat_detach(struct inode *inode)
{
struct msdos_sb_info *sbi = MSDOS_SB(inode->i_sb);
+
+ /* The block isn't associated with the inode anymore... */
+ mmb_clear_inode_blk(&MSDOS_I(inode)->i_metadata_bhs);
spin_lock(&sbi->inode_hash_lock);
MSDOS_I(inode)->i_pos = 0;
hlist_del_init(&MSDOS_I(inode)->i_fat_hash);
@@ -906,7 +909,7 @@ static int __fat_write_inode(struct inode *inode, int wait)
&raw_entry->cdate, &raw_entry->ctime_cs);
}
spin_unlock(&sbi->inode_hash_lock);
- mark_buffer_dirty(bh);
+ mmb_mark_inode_buffer_dirty(bh, &MSDOS_I(inode)->i_metadata_bhs);
err = 0;
if (wait)
err = sync_dirty_buffer(bh);
@@ -925,7 +928,7 @@ static int fat_write_inode(struct inode *inode, struct writeback_control *wbc)
err = fat_clusters_flush(sb);
mutex_unlock(&MSDOS_SB(sb)->s_lock);
} else
- err = __fat_write_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
+ err = __fat_write_inode(inode, 0);
return err;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v2 07/10] minix: Fix possibly missing inode write on fsync(2)
2026-05-25 8:58 [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
` (5 preceding siblings ...)
2026-05-25 8:58 ` [PATCH v2 06/10] fat: " Jan Kara
@ 2026-05-25 8:58 ` Jan Kara
2026-05-25 8:58 ` [PATCH v2 08/10] bfs: " Jan Kara
` (3 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2026-05-25 8:58 UTC (permalink / raw)
To: linux-fsdevel
Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
OGAWA Hirofumi, Jan Kara
Use mmb inode buffer writeout infrastructure to reliably write out
inode's buffer on fsync(2).
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/minix/inode.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index 9c6bac248907..401056fb682e 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -649,7 +649,7 @@ static struct buffer_head * V1_minix_update_inode(struct inode * inode)
raw_inode->i_zone[0] = old_encode_dev(inode->i_rdev);
else for (i = 0; i < 9; i++)
raw_inode->i_zone[i] = minix_inode->u.i1_data[i];
- mark_buffer_dirty(bh);
+ mmb_mark_inode_buffer_dirty(bh, &minix_i(inode)->i_metadata_bhs);
return bh;
}
@@ -678,7 +678,7 @@ static struct buffer_head * V2_minix_update_inode(struct inode * inode)
raw_inode->i_zone[0] = old_encode_dev(inode->i_rdev);
else for (i = 0; i < 10; i++)
raw_inode->i_zone[i] = minix_inode->u.i2_data[i];
- mark_buffer_dirty(bh);
+ mmb_mark_inode_buffer_dirty(bh, &minix_i(inode)->i_metadata_bhs);
return bh;
}
@@ -693,14 +693,6 @@ static int minix_write_inode(struct inode *inode, struct writeback_control *wbc)
bh = V2_minix_update_inode(inode);
if (!bh)
return -EIO;
- if (wbc->sync_mode == WB_SYNC_ALL && buffer_dirty(bh)) {
- sync_dirty_buffer(bh);
- if (buffer_req(bh) && !buffer_uptodate(bh)) {
- printk("IO error syncing minix inode [%s:%08llx]\n",
- inode->i_sb->s_id, inode->i_ino);
- err = -EIO;
- }
- }
brelse (bh);
return err;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v2 08/10] bfs: Fix possibly missing inode write on fsync(2)
2026-05-25 8:58 [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
` (6 preceding siblings ...)
2026-05-25 8:58 ` [PATCH v2 07/10] minix: " Jan Kara
@ 2026-05-25 8:58 ` Jan Kara
2026-05-25 8:58 ` [PATCH v2 09/10] ext4: Use mmb infrastructure for inode buffer writeout Jan Kara
` (2 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2026-05-25 8:58 UTC (permalink / raw)
To: linux-fsdevel
Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
OGAWA Hirofumi, Jan Kara
Use mmb inode buffer writeout infrastructure to reliably write out
inode's buffer on fsync(2).
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/bfs/inode.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
index 19e49c8cf750..2506795c3f2c 100644
--- a/fs/bfs/inode.c
+++ b/fs/bfs/inode.c
@@ -136,7 +136,6 @@ static int bfs_write_inode(struct inode *inode, struct writeback_control *wbc)
unsigned long i_sblock;
struct bfs_inode *di;
struct buffer_head *bh;
- int err = 0;
dprintf("ino=%08x\n", ino);
@@ -164,15 +163,10 @@ static int bfs_write_inode(struct inode *inode, struct writeback_control *wbc)
di->i_eblock = cpu_to_le32(BFS_I(inode)->i_eblock);
di->i_eoffset = cpu_to_le32(i_sblock * BFS_BSIZE + inode->i_size - 1);
- mark_buffer_dirty(bh);
- if (wbc->sync_mode == WB_SYNC_ALL) {
- sync_dirty_buffer(bh);
- if (buffer_req(bh) && !buffer_uptodate(bh))
- err = -EIO;
- }
+ mmb_mark_inode_buffer_dirty(bh, &BFS_I(inode)->i_metadata_bhs);
brelse(bh);
mutex_unlock(&info->bfs_lock);
- return err;
+ return 0;
}
static void bfs_evict_inode(struct inode *inode)
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v2 09/10] ext4: Use mmb infrastructure for inode buffer writeout
2026-05-25 8:58 [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
` (7 preceding siblings ...)
2026-05-25 8:58 ` [PATCH v2 08/10] bfs: " Jan Kara
@ 2026-05-25 8:58 ` Jan Kara
2026-06-03 13:41 ` Theodore Tso
2026-05-25 8:58 ` [PATCH v2 10/10] fs: Fix missed inode writeback when racing with __writeback_single_inode Jan Kara
2026-06-02 7:22 ` [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
10 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2026-05-25 8:58 UTC (permalink / raw)
To: linux-fsdevel
Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
OGAWA Hirofumi, Jan Kara
Use mmb inode buffer writeout infrastructure to reliably write out
inode's inode table block on fsync(2) in nojournal mode (from
ext4_sync_parent() and ext4_fsync_nojournal()). This significantly
simplifies the code as we don't have to explicitely handle inode buffer
writeback in ext4_write_inode() and thus we can also remove
sync_inode_metadata() calls from ext4_sync_parent() and
ext4_write_inode() call from ext4_fsync_nojournal().
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/ext4/ext4_jbd2.c | 2 +-
fs/ext4/ext4_jbd2.h | 2 ++
fs/ext4/fsync.c | 12 ------------
fs/ext4/inode.c | 33 ++++++++++-----------------------
4 files changed, 13 insertions(+), 36 deletions(-)
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 752326f3b653..3d6b8f494a76 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -350,7 +350,7 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,
return 0;
}
-static void ext4_inode_attach_mmb(struct inode *inode)
+void ext4_inode_attach_mmb(struct inode *inode)
{
struct mapping_metadata_bhs *mmb;
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 63d17c5201b5..2a01b8279c88 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -122,6 +122,8 @@
#define EXT4_HT_EXT_CONVERT 11
#define EXT4_HT_MAX 12
+void ext4_inode_attach_mmb(struct inode *inode);
+
int
ext4_mark_iloc_dirty(handle_t *handle,
struct inode *inode,
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index c104f55a0242..1cb613478386 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -75,9 +75,6 @@ static int ext4_sync_parent(struct inode *inode)
if (ret)
break;
}
- ret = sync_inode_metadata(inode, 1);
- if (ret)
- break;
}
dput(dentry);
return ret;
@@ -87,10 +84,6 @@ static int ext4_fsync_nojournal(struct file *file, loff_t start, loff_t end,
int datasync, bool *needs_barrier)
{
struct inode *inode = file->f_inode;
- struct writeback_control wbc = {
- .sync_mode = WB_SYNC_ALL,
- .nr_to_write = 0,
- };
int ret;
ret = mmb_fsync_noflush(file, READ_ONCE(EXT4_I(inode)->i_metadata_bhs),
@@ -98,11 +91,6 @@ static int ext4_fsync_nojournal(struct file *file, loff_t start, loff_t end,
if (ret)
return ret;
- /* Force writeout of inode table buffer to disk */
- ret = ext4_write_inode(inode, &wbc);
- if (ret)
- return ret;
-
ret = ext4_sync_parent(inode);
if (test_opt(inode->i_sb, BARRIER))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3e66e9510909..651201849667 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5696,10 +5696,16 @@ static int ext4_do_update_inode(handle_t *handle,
ext4_update_other_inodes_time(inode->i_sb, inode->i_ino,
bh->b_data);
- BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, NULL, bh);
- if (err)
- goto out_error;
+ if (!ext4_handle_valid(handle)) {
+ if (!ei->i_metadata_bhs)
+ ext4_inode_attach_mmb(inode);
+ mmb_mark_inode_buffer_dirty(iloc->bh, ei->i_metadata_bhs);
+ } else {
+ BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
+ err = ext4_handle_dirty_metadata(handle, NULL, bh);
+ if (err)
+ goto out_error;
+ }
ext4_clear_inode_state(inode, EXT4_STATE_NEW);
if (set_large_file) {
BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "get write access");
@@ -5786,24 +5792,6 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
err = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal,
EXT4_I(inode)->i_sync_tid);
- } else {
- struct ext4_iloc iloc;
-
- err = __ext4_get_inode_loc_noinmem(inode, &iloc);
- if (err)
- return err;
- /*
- * sync(2) will flush the whole buffer cache. No need to do
- * it here separately for each inode.
- */
- if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
- sync_dirty_buffer(iloc.bh);
- if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) {
- ext4_error_inode_block(inode, iloc.bh->b_blocknr, EIO,
- "IO error syncing inode");
- err = -EIO;
- }
- brelse(iloc.bh);
}
return err;
}
@@ -6348,7 +6336,6 @@ int ext4_mark_iloc_dirty(handle_t *handle,
/* the do_update_inode consumes one bh->b_count */
get_bh(iloc->bh);
-
/* ext4_do_update_inode() does jbd2_journal_dirty_metadata */
err = ext4_do_update_inode(handle, inode, iloc);
put_bh(iloc->bh);
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v2 09/10] ext4: Use mmb infrastructure for inode buffer writeout
2026-05-25 8:58 ` [PATCH v2 09/10] ext4: Use mmb infrastructure for inode buffer writeout Jan Kara
@ 2026-06-03 13:41 ` Theodore Tso
0 siblings, 0 replies; 15+ messages in thread
From: Theodore Tso @ 2026-06-03 13:41 UTC (permalink / raw)
To: Jan Kara
Cc: linux-fsdevel, Christian Brauner, aivazian.tigran, linux-ext4,
OGAWA Hirofumi
On Mon, May 25, 2026 at 10:58:15AM +0200, Jan Kara wrote:
> Use mmb inode buffer writeout infrastructure to reliably write out
> inode's inode table block on fsync(2) in nojournal mode (from
> ext4_sync_parent() and ext4_fsync_nojournal()). This significantly
> simplifies the code as we don't have to explicitely handle inode buffer
> writeback in ext4_write_inode() and thus we can also remove
> sync_inode_metadata() calls from ext4_sync_parent() and
> ext4_write_inode() call from ext4_fsync_nojournal().
>
> Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 10/10] fs: Fix missed inode writeback when racing with __writeback_single_inode
2026-05-25 8:58 [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
` (8 preceding siblings ...)
2026-05-25 8:58 ` [PATCH v2 09/10] ext4: Use mmb infrastructure for inode buffer writeout Jan Kara
@ 2026-05-25 8:58 ` Jan Kara
2026-06-02 7:22 ` [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
10 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2026-05-25 8:58 UTC (permalink / raw)
To: linux-fsdevel
Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
OGAWA Hirofumi, Jan Kara
When mmb_fsync_noflush() or simple_fsync_noflush() race with another
writeback of the same inode, they can see inode dirty bits are already
clear and skip inode writeback although the racing
__writeback_single_inode() didn't yet get to writing anything. This can
result in fsync(2) returning without properly persisting the inode.
We already have I_SYNC bit for this synchronization and
writeback_single_inode() properly uses it so just fix
mmb_fsync_noflush() and simple_fsync_noflush() to take it into account
as well.
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/buffer.c | 5 +++--
fs/libfs.c | 5 +++--
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index f83fb3cdc6ac..01e17a287a42 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -662,9 +662,10 @@ int mmb_fsync_noflush(struct file *file, struct mapping_metadata_bhs *mmb,
if (err)
return err;
- if (!(inode_state_read_once(inode) & I_DIRTY_ALL))
+ if (!(inode_state_read_once(inode) & (I_DIRTY_ALL | I_SYNC)))
goto sync_buffers;
- if (datasync && !(inode_state_read_once(inode) & I_DIRTY_DATASYNC))
+ if (datasync &&
+ !(inode_state_read_once(inode) & (I_DIRTY_DATASYNC | I_SYNC)))
goto sync_buffers;
ret = sync_inode_metadata(inode, 1);
diff --git a/fs/libfs.c b/fs/libfs.c
index 1bbea5e7bae3..3a98dacd81b2 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1559,9 +1559,10 @@ int simple_fsync_noflush(struct file *file, loff_t start, loff_t end,
if (err)
return err;
- if (!(inode_state_read_once(inode) & I_DIRTY_ALL))
+ if (!(inode_state_read_once(inode) & (I_DIRTY_ALL | I_SYNC)))
goto out;
- if (datasync && !(inode_state_read_once(inode) & I_DIRTY_DATASYNC))
+ if (datasync &&
+ !(inode_state_read_once(inode) & (I_DIRTY_DATASYNC | I_SYNC)))
goto out;
ret = sync_inode_metadata(inode, 1);
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v2 0/10] fs: Fix missed inode write during fsync
2026-05-25 8:58 [PATCH v2 0/10] fs: Fix missed inode write during fsync Jan Kara
` (9 preceding siblings ...)
2026-05-25 8:58 ` [PATCH v2 10/10] fs: Fix missed inode writeback when racing with __writeback_single_inode Jan Kara
@ 2026-06-02 7:22 ` Jan Kara
10 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2026-06-02 7:22 UTC (permalink / raw)
To: linux-fsdevel
Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
OGAWA Hirofumi, Jan Kara
Hello!
On Mon 25-05-26 10:58:06, Jan Kara wrote:
> here is v2 of the patch series which fixes the possibly missing inode write
> during fsync(2) for filesystems using generic metadata bh tracking. The
> inherent problem is that .write_inode methods clear inode dirty bit but they
> only copy inode contents into to the buffer cache. Because buffer carrying the
> inode is shared among multiple inodes, it cannot be tracked by the generic
> metadata bh tracking infrastructure and thus nothing is tracking that buffer
> needs to be written out to maintain fsync(2) guarantees. Normally, this gets
> taken care of by .write_inode checking for WB_SYNC_ALL writeback and submitting
> & waiting for the buffer in that case however if flush worker ends up writing
> the inode before data integrity writeback, this mechanism is broken.
>
> This patch series adds a way for filesystems to track metadata block number
> which contains the inode metadata and then uses this information to writeout
> the buffer on fsync.
FWIW I went through Sashiko review comments. Lot of them are hallucinated
but there are actually three good finds:
1) FAT implementation of inode tracking is broken when fsync races with
rename.
2) ext2 & minix inode tracking makes handling of dirsync even more broken
than it already is (current handling is already broken because we don't
flush any directory indirect blocks but my changes also stop flushing the
inode buffer).
3) mmb_sync() flushing of inode buffer is racy for multiple parallel
fsyncs.
So I'll be addressing these. Please don't waste time with the series as is.
Honza
>
> Changes since v1:
> * Fixed freeing for ext4 dynamically allocated mmb struct
> * Optimized tracking of block carrying the inode so that we don't flush it
> unnecessarily on fsync
> * Add forgotten check for reclaimed bh to mmb_sync() to avoid NULL ptr deref
> * Couple other smaller fixups pointed out by Sashiko
>
> Honza
> Previous versions:
> Link: http://lore.kernel.org/r/20260511115725.28441-1-jack@suse.cz # v1
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 15+ messages in thread