Linux EXT4 FS development
 help / color / mirror / Atom feed
* Re: [PATCH 11/17] isofs: replace __get_free_page() with kmalloc()
From: Jan Kara @ 2026-05-25 16:17 UTC (permalink / raw)
  To: Mike Rapoport (Microsoft)
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Theodore Ts'o, Miklos Szeredi, Andreas Hindborg, Breno Leitao,
	Kees Cook, Tigran A. Aivazian, linux-kernel, linux-fsdevel,
	ocfs2-devel, linux-nilfs, linux-nfs, jfs-discussion, linux-ext4,
	linux-mm
In-Reply-To: <20260523-b4-fs-v1-11-275e36a83f0e@kernel.org>

On Sat 23-05-26 20:54:23, Mike Rapoport (Microsoft) wrote:
> isofs_readdir() allocates a temporary buffer with __get_free_page().
> 
> kmalloc() is a better API for such use and it also provides better
> scalability and more debugging possibilities.
> 
> Replace use of __get_free_page() with kmalloc().
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Thanks. Added to my tree.

								Honza

> ---
>  fs/isofs/dir.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/isofs/dir.c b/fs/isofs/dir.c
> index 2fd9948d606e..6d220eab531e 100644
> --- a/fs/isofs/dir.c
> +++ b/fs/isofs/dir.c
> @@ -13,6 +13,7 @@
>   */
>  #include <linux/gfp.h>
>  #include <linux/filelock.h>
> +#include <linux/slab.h>
>  #include "isofs.h"
>  
>  int isofs_name_translate(struct iso_directory_record *de, char *new, struct inode *inode)
> @@ -255,7 +256,7 @@ static int isofs_readdir(struct file *file, struct dir_context *ctx)
>  	struct iso_directory_record *tmpde;
>  	struct inode *inode = file_inode(file);
>  
> -	tmpname = (char *)__get_free_page(GFP_KERNEL);
> +	tmpname = kmalloc(PAGE_SIZE, GFP_KERNEL);
>  	if (tmpname == NULL)
>  		return -ENOMEM;
>  
> @@ -263,7 +264,7 @@ static int isofs_readdir(struct file *file, struct dir_context *ctx)
>  
>  	result = do_isofs_readdir(inode, file, ctx, tmpname, tmpde);
>  
> -	free_page((unsigned long) tmpname);
> +	kfree(tmpname);
>  	return result;
>  }
>  
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Jan Kara @ 2026-05-25 16:17 UTC (permalink / raw)
  To: Mike Rapoport (Microsoft)
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Theodore Ts'o, Miklos Szeredi, Andreas Hindborg, Breno Leitao,
	Kees Cook, Tigran A. Aivazian, linux-kernel, linux-fsdevel,
	ocfs2-devel, linux-nilfs, linux-nfs, jfs-discussion, linux-ext4,
	linux-mm
In-Reply-To: <20260523-b4-fs-v1-10-275e36a83f0e@kernel.org>

On Sat 23-05-26 20:54:22, Mike Rapoport (Microsoft) wrote:
> jbd2_alloc() falls back from kmem_cache_alloc() to __get_free_pages() for
> allocations larger than PAGE_SIZE.
> But kmalloc() can handle such cases with essentially the same fallback.
> 
> Replace use of __get_free_pages() with kmalloc() and simplify
> jbd2_free() as both kmem_cache_alloc() and kmalloc() allocations can be
> freed with kfree().
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

I'll just note that we allocate here fs block size large buffer so the same
kind of allocator as we use for folios would be even better. But that's a
different cleanup I guess.

								Honza

> ---
>  fs/jbd2/journal.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 4f397fcdb13c..1137b471e490 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -2784,7 +2784,7 @@ void *jbd2_alloc(size_t size, gfp_t flags)
>  	if (size < PAGE_SIZE)
>  		ptr = kmem_cache_alloc(get_slab(size), flags);
>  	else
> -		ptr = (void *)__get_free_pages(flags, get_order(size));
> +		ptr = kmalloc(size, flags);
>  
>  	/* Check alignment; SLUB has gotten this wrong in the past,
>  	 * and this can lead to user data corruption! */
> @@ -2795,10 +2795,7 @@ void *jbd2_alloc(size_t size, gfp_t flags)
>  
>  void jbd2_free(void *ptr, size_t size)
>  {
> -	if (size < PAGE_SIZE)
> -		kmem_cache_free(get_slab(size), ptr);
> -	else
> -		free_pages((unsigned long)ptr, get_order(size));
> +	kfree(ptr);
>  };
>  
>  /*
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 03/17] ocfs2/dlm: replace __get_free_page() with kmalloc()
From: Jan Kara @ 2026-05-25 16:13 UTC (permalink / raw)
  To: Mike Rapoport (Microsoft)
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Theodore Ts'o, Miklos Szeredi, Andreas Hindborg, Breno Leitao,
	Kees Cook, Tigran A. Aivazian, linux-kernel, linux-fsdevel,
	ocfs2-devel, linux-nilfs, linux-nfs, jfs-discussion, linux-ext4,
	linux-mm
In-Reply-To: <20260523-b4-fs-v1-3-275e36a83f0e@kernel.org>

On Sat 23-05-26 20:54:15, Mike Rapoport (Microsoft) wrote:
> A few places in ocsfs2 allocate temporary buffers with __get_free_page() or
> get_zeroed_page().
> 
> kmalloc() is a better API for such use and it also provides better
> scalability and more debugging possibilities.
> 
> Replace use of __get_free_page() and get_zeroed_page() with kmalloc() and
> kzalloc() respectively.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ocfs2/dlm/dlmdebug.c    | 24 +++++++++---------------
>  fs/ocfs2/dlm/dlmdomain.c   |  8 +++++---
>  fs/ocfs2/dlm/dlmmaster.c   |  5 ++---
>  fs/ocfs2/dlm/dlmrecovery.c |  4 ++--
>  4 files changed, 18 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmdebug.c b/fs/ocfs2/dlm/dlmdebug.c
> index fe4fdd09bae3..6ca8b3b68eef 100644
> --- a/fs/ocfs2/dlm/dlmdebug.c
> +++ b/fs/ocfs2/dlm/dlmdebug.c
> @@ -260,10 +260,10 @@ void dlm_print_one_mle(struct dlm_master_list_entry *mle)
>  {
>  	char *buf;
>  
> -	buf = (char *) get_zeroed_page(GFP_ATOMIC);
> +	buf = kzalloc(PAGE_SIZE, GFP_ATOMIC);
>  	if (buf) {
>  		dump_mle(mle, buf, PAGE_SIZE - 1);
> -		free_page((unsigned long)buf);
> +		kfree(buf);
>  	}
>  }
>  
> @@ -280,7 +280,7 @@ static struct dentry *dlm_debugfs_root;
>  /* begin - utils funcs */
>  static int debug_release(struct inode *inode, struct file *file)
>  {
> -	free_page((unsigned long)file->private_data);
> +	kfree(file->private_data);
>  	return 0;
>  }
>  
> @@ -327,17 +327,15 @@ static int debug_purgelist_open(struct inode *inode, struct file *file)
>  	struct dlm_ctxt *dlm = inode->i_private;
>  	char *buf = NULL;
>  
> -	buf = (char *) get_zeroed_page(GFP_NOFS);
> +	buf = kzalloc(PAGE_SIZE, GFP_NOFS);
>  	if (!buf)
> -		goto bail;
> +		return -ENOMEM;
>  
>  	i_size_write(inode, debug_purgelist_print(dlm, buf, PAGE_SIZE - 1));
>  
>  	file->private_data = buf;
>  
>  	return 0;
> -bail:
> -	return -ENOMEM;
>  }
>  
>  static const struct file_operations debug_purgelist_fops = {
> @@ -384,17 +382,15 @@ static int debug_mle_open(struct inode *inode, struct file *file)
>  	struct dlm_ctxt *dlm = inode->i_private;
>  	char *buf = NULL;
>  
> -	buf = (char *) get_zeroed_page(GFP_NOFS);
> +	buf = kzalloc(PAGE_SIZE, GFP_NOFS);
>  	if (!buf)
> -		goto bail;
> +		return -ENOMEM;
>  
>  	i_size_write(inode, debug_mle_print(dlm, buf, PAGE_SIZE - 1));
>  
>  	file->private_data = buf;
>  
>  	return 0;
> -bail:
> -	return -ENOMEM;
>  }
>  
>  static const struct file_operations debug_mle_fops = {
> @@ -775,17 +771,15 @@ static int debug_state_open(struct inode *inode, struct file *file)
>  	struct dlm_ctxt *dlm = inode->i_private;
>  	char *buf = NULL;
>  
> -	buf = (char *) get_zeroed_page(GFP_NOFS);
> +	buf = kzalloc(PAGE_SIZE, GFP_NOFS);
>  	if (!buf)
> -		goto bail;
> +		return -ENOMEM;
>  
>  	i_size_write(inode, debug_state_print(dlm, buf, PAGE_SIZE - 1));
>  
>  	file->private_data = buf;
>  
>  	return 0;
> -bail:
> -	return -ENOMEM;
>  }
>  
>  static const struct file_operations debug_state_fops = {
> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
> index dc9da9133c8e..97bb9400e24b 100644
> --- a/fs/ocfs2/dlm/dlmdomain.c
> +++ b/fs/ocfs2/dlm/dlmdomain.c
> @@ -63,7 +63,7 @@ static inline void byte_copymap(u8 dmap[], unsigned long smap[],
>  static void dlm_free_pagevec(void **vec, int pages)
>  {
>  	while (pages--)
> -		free_page((unsigned long)vec[pages]);
> +		kfree(vec[pages]);
>  	kfree(vec);
>  }
>  
> @@ -75,9 +75,11 @@ static void **dlm_alloc_pagevec(int pages)
>  	if (!vec)
>  		return NULL;
>  
> -	for (i = 0; i < pages; i++)
> -		if (!(vec[i] = (void *)__get_free_page(GFP_KERNEL)))
> +	for (i = 0; i < pages; i++) {
> +		vec[i] = kmalloc(PAGE_SIZE, GFP_KERNEL);
> +		if (!vec[i])
>  			goto out_free;
> +	}
>  
>  	mlog(0, "Allocated DLM hash pagevec; %d pages (%lu expected), %lu buckets per page\n",
>  	     pages, (unsigned long)DLM_HASH_PAGES,
> diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
> index 93eff38fdadd..aee3b4c56dcc 100644
> --- a/fs/ocfs2/dlm/dlmmaster.c
> +++ b/fs/ocfs2/dlm/dlmmaster.c
> @@ -2548,7 +2548,7 @@ static int dlm_migrate_lockres(struct dlm_ctxt *dlm,
>  
>  	/* preallocate up front. if this fails, abort */
>  	ret = -ENOMEM;
> -	mres = (struct dlm_migratable_lockres *) __get_free_page(GFP_NOFS);
> +	mres = kmalloc(PAGE_SIZE, GFP_NOFS);
>  	if (!mres) {
>  		mlog_errno(ret);
>  		goto leave;
> @@ -2725,8 +2725,7 @@ static int dlm_migrate_lockres(struct dlm_ctxt *dlm,
>  	if (wake)
>  		wake_up(&res->wq);
>  
> -	if (mres)
> -		free_page((unsigned long)mres);
> +	kfree(mres);
>  
>  	dlm_put(dlm);
>  
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 128872bd945d..9b97bf73df22 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -837,7 +837,7 @@ int dlm_request_all_locks_handler(struct o2net_msg *msg, u32 len, void *data,
>  	}
>  
>  	/* this will get freed by dlm_request_all_locks_worker */
> -	buf = (char *) __get_free_page(GFP_NOFS);
> +	buf = kmalloc(PAGE_SIZE, GFP_NOFS);
>  	if (!buf) {
>  		kfree(item);
>  		dlm_put(dlm);
> @@ -933,7 +933,7 @@ static void dlm_request_all_locks_worker(struct dlm_work_item *item, void *data)
>  		}
>  	}
>  leave:
> -	free_page((unsigned long)data);
> +	kfree(data);
>  }
>  
>  
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 02/17] proc: replace __get_free_page() with kmalloc()
From: Jan Kara @ 2026-05-25 16:11 UTC (permalink / raw)
  To: Mike Rapoport (Microsoft)
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Theodore Ts'o, Miklos Szeredi, Andreas Hindborg, Breno Leitao,
	Kees Cook, Tigran A. Aivazian, linux-kernel, linux-fsdevel,
	ocfs2-devel, linux-nilfs, linux-nfs, jfs-discussion, linux-ext4,
	linux-mm
In-Reply-To: <20260523-b4-fs-v1-2-275e36a83f0e@kernel.org>

On Sat 23-05-26 20:54:14, Mike Rapoport (Microsoft) wrote:
> A few functions in fs/proc/base.c use __get_free_page() to allocate a
> temporary buffer.
> 
> kmalloc() is a better API for such use and it also provides better
> scalability and more debugging possibilities.
> 
> Replace use of __get_free_page() with kmalloc().
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/proc/base.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d9acfa89c894..e129dc509b79 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -261,7 +261,7 @@ static ssize_t get_mm_proctitle(struct mm_struct *mm, char __user *buf,
>  	if (pos >= PAGE_SIZE)
>  		return 0;
>  
> -	page = (char *)__get_free_page(GFP_KERNEL);
> +	page = kmalloc(PAGE_SIZE, GFP_KERNEL);
>  	if (!page)
>  		return -ENOMEM;
>  
> @@ -284,7 +284,7 @@ static ssize_t get_mm_proctitle(struct mm_struct *mm, char __user *buf,
>  			ret = len;
>  		}
>  	}
> -	free_page((unsigned long)page);
> +	kfree(page);
>  	return ret;
>  }
>  
> @@ -347,7 +347,7 @@ static ssize_t get_mm_cmdline(struct mm_struct *mm, char __user *buf,
>  	if (count > arg_end - pos)
>  		count = arg_end - pos;
>  
> -	page = (char *)__get_free_page(GFP_KERNEL);
> +	page = kmalloc(PAGE_SIZE, GFP_KERNEL);
>  	if (!page)
>  		return -ENOMEM;
>  
> @@ -371,7 +371,7 @@ static ssize_t get_mm_cmdline(struct mm_struct *mm, char __user *buf,
>  		count -= got;
>  	}
>  
> -	free_page((unsigned long)page);
> +	kfree(page);
>  	return len;
>  }
>  
> @@ -908,7 +908,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
>  	if (!mm)
>  		return 0;
>  
> -	page = (char *)__get_free_page(GFP_KERNEL);
> +	page = kmalloc(PAGE_SIZE, GFP_KERNEL);
>  	if (!page)
>  		return -ENOMEM;
>  
> @@ -949,7 +949,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
>  
>  	mmput(mm);
>  free:
> -	free_page((unsigned long) page);
> +	kfree(page);
>  	return copied;
>  }
>  
> @@ -1016,7 +1016,7 @@ static ssize_t environ_read(struct file *file, char __user *buf,
>  	if (!mm || !mm->env_end)
>  		return 0;
>  
> -	page = (char *)__get_free_page(GFP_KERNEL);
> +	page = kmalloc(PAGE_SIZE, GFP_KERNEL);
>  	if (!page)
>  		return -ENOMEM;
>  
> @@ -1062,7 +1062,7 @@ static ssize_t environ_read(struct file *file, char __user *buf,
>  	mmput(mm);
>  
>  free:
> -	free_page((unsigned long) page);
> +	kfree(page);
>  	return ret;
>  }
>  
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 01/17] quota: allocate dquot_hash with kmalloc()
From: Jan Kara @ 2026-05-25 16:10 UTC (permalink / raw)
  To: Mike Rapoport (Microsoft)
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Theodore Ts'o, Miklos Szeredi, Andreas Hindborg, Breno Leitao,
	Kees Cook, Tigran A. Aivazian, linux-kernel, linux-fsdevel,
	ocfs2-devel, linux-nilfs, linux-nfs, jfs-discussion, linux-ext4,
	linux-mm
In-Reply-To: <20260523-b4-fs-v1-1-275e36a83f0e@kernel.org>

On Sat 23-05-26 20:54:13, Mike Rapoport (Microsoft) wrote:
> dquot_init() allocates a single page for dquot_hash with
> __get_free_pages().
> 
> kmalloc() is a better API for such use and it also provides better
> scalability and more debugging possibilities.
> 
> Replace use of __get_free_pages() with kmalloc() and get rid of the order
> variable that remained 0 for more than 20 years.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Thanks! I've added this patch to my tree.

								Honza

> ---
>  fs/quota/dquot.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index 64cf42721496..9850de3955d3 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -3022,7 +3022,7 @@ static const struct ctl_table fs_dqstats_table[] = {
>  static int __init dquot_init(void)
>  {
>  	int i, ret;
> -	unsigned long nr_hash, order;
> +	unsigned long nr_hash;
>  	struct shrinker *dqcache_shrinker;
>  
>  	printk(KERN_NOTICE "VFS: Disk quotas %s\n", __DQUOT_VERSION__);
> @@ -3035,8 +3035,7 @@ static int __init dquot_init(void)
>  				SLAB_PANIC),
>  			NULL);
>  
> -	order = 0;
> -	dquot_hash = (struct hlist_head *)__get_free_pages(GFP_KERNEL, order);
> +	dquot_hash = kmalloc(PAGE_SIZE, GFP_KERNEL);
>  	if (!dquot_hash)
>  		panic("Cannot create dquot hash table");
>  
> @@ -3046,7 +3045,7 @@ static int __init dquot_init(void)
>  		panic("Cannot create dquot stat counters");
>  
>  	/* Find power-of-two hlist_heads which can fit into allocation */
> -	nr_hash = (1UL << order) * PAGE_SIZE / sizeof(struct hlist_head);
> +	nr_hash = PAGE_SIZE / sizeof(struct hlist_head);
>  	dq_hash_bits = ilog2(nr_hash);
>  
>  	nr_hash = 1UL << dq_hash_bits;
> @@ -3054,8 +3053,8 @@ static int __init dquot_init(void)
>  	for (i = 0; i < nr_hash; i++)
>  		INIT_HLIST_HEAD(dquot_hash + i);
>  
> -	pr_info("VFS: Dquot-cache hash table entries: %ld (order %ld,"
> -		" %ld bytes)\n", nr_hash, order, (PAGE_SIZE << order));
> +	pr_info("VFS: Dquot-cache hash table entries: %ld (%ld bytes)\n",
> +		nr_hash, PAGE_SIZE);
>  
>  	dqcache_shrinker = shrinker_alloc(0, "dquota-cache");
>  	if (!dqcache_shrinker)
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] jbd2: update outdated comment for jbd2_journal_try_to_free_buffers()
From: Jan Kara @ 2026-05-25 15:42 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <20260522030540.3896201-1-yi.zhang@huaweicloud.com>

On Fri 22-05-26 11:05:40, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> jbd2_journal_try_to_free_buffers() currently only tries to remove
> checkpointed data buffers from the checkpoint list for data=journal
> mode, and bails out if any buffer is still attached to a transaction.
> For data=ordered and writeback modes, data buffers never have
> journal_heads, so the function degenerates to a plain
> try_to_free_buffers() call.
> 
> Besides, The release of metadata buffers has been delegated to the jbd2
> journal shrinker in commit 4ba3fcdde7e3 ("jbd2,ext4: add a shrinker to
> release checkpointed buffers"). jbd2_journal_try_to_free_buffers() is
> not used for handling metadata buffers now.
> 
> However, the comment above the function still references
> jbd2_journal_dirty_data(), __jbd2_journal_unfile_buffer(), t_datalist,
> BKL, and BUF_CLEAN, all of which were removed in commit 87c89c232c8f
> ("jbd2: Remove data=ordered mode support using jbd buffer heads").
> 
> Replace it with a description of what the function actually does now.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Thanks for the update. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/jbd2/transaction.c | 39 ++++++++++++---------------------------
>  1 file changed, 12 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 4885903bbd10..239bcf88ed1c 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -2139,38 +2139,23 @@ static void __jbd2_journal_unfile_buffer(struct journal_head *jh)
>  }
>  
>  /**
> - * jbd2_journal_try_to_free_buffers() - try to free page buffers.
> + * jbd2_journal_try_to_free_buffers() - try to free folio buffers.
>   * @journal: journal for operation
>   * @folio: Folio to detach data from.
>   *
> - * For all the buffers on this page,
> - * if they are fully written out ordered data, move them onto BUF_CLEAN
> - * so try_to_free_buffers() can reap them.
> + * For each buffer_head on @folio, if the buffer has a journal_head but
> + * is not attached to a running or committing transaction, try to remove
> + * it from the checkpoint list.  This is needed for data=journal mode
> + * where data buffers are journaled: once they are checkpointed, the
> + * journal_head can be detached and the buffer freed.  If any buffer is
> + * still attached to a transaction, the folio cannot be released and we
> + * bail out.  Otherwise we call try_to_free_buffers() to detach all
> + * buffer_heads from the folio.
>   *
> - * This function returns non-zero if we wish try_to_free_buffers()
> - * to be called. We do this if the page is releasable by try_to_free_buffers().
> - * We also do it if the page has locked or dirty buffers and the caller wants
> - * us to perform sync or async writeout.
> + * For data=ordered and writeback modes, data buffers never have
> + * journal_heads, so this degenerates to a plain try_to_free_buffers().
>   *
> - * This complicates JBD locking somewhat.  We aren't protected by the
> - * BKL here.  We wish to remove the buffer from its committing or
> - * running transaction's ->t_datalist via __jbd2_journal_unfile_buffer.
> - *
> - * This may *change* the value of transaction_t->t_datalist, so anyone
> - * who looks at t_datalist needs to lock against this function.
> - *
> - * Even worse, someone may be doing a jbd2_journal_dirty_data on this
> - * buffer.  So we need to lock against that.  jbd2_journal_dirty_data()
> - * will come out of the lock with the buffer dirty, which makes it
> - * ineligible for release here.
> - *
> - * Who else is affected by this?  hmm...  Really the only contender
> - * is do_get_write_access() - it could be looking at the buffer while
> - * journal_try_to_free_buffer() is changing its state.  But that
> - * cannot happen because we never reallocate freed data as metadata
> - * while the data is part of a transaction.  Yes?
> - *
> - * Return false on failure, true on success
> + * Return: true if the folio's buffers were freed, false otherwise
>   */
>  bool jbd2_journal_try_to_free_buffers(journal_t *journal, struct folio *folio)
>  {
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 2/2] ext4: get ext4_group_desc in ext4_mb_prefetch only when necessary
From: Jan Kara @ 2026-05-25 15:32 UTC (permalink / raw)
  To: Bohdan Trach
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, mchehab+huawei,
	lilith.oberhauser, linux-ext4, linux-kernel
In-Reply-To: <20260521125931.16474-3-bohdan.trach@huaweicloud.com>

On Thu 21-05-26 14:59:29, Bohdan Trach wrote:
> Getting ext4_group_desc structure can contribute to the cost of
> ext4_mb_prefetch() without any need, as most groups fail the
> !EXT4_MB_GRP_TEST_AND_SET_READ check.
> 
> Optimize ext4_mb_prefetch by getting the group description only when
> necessary.
> 
> The result is further increase in performance of fallocate() system call
> path that triggers ext4_mb_prefetch() via a linear group scan.
> 
> Signed-off-by: Bohdan Trach <bohdan.trach@huaweicloud.com>

Looks good. Just one nit below:

> @@ -2872,14 +2870,17 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
>  		 * prefetch once, so we avoid getblk() call, which can
>  		 * be expensive.
>  		 */
> -		if (gdp && grp && !EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&
> -		    EXT4_MB_GRP_NEED_INIT(grp) &&
> -		    ext4_free_group_clusters(sb, gdp) > 0 ) {
> -			bh = ext4_read_block_bitmap_nowait(sb, group, true);
> -			if (!IS_ERR_OR_NULL(bh)) {
> -				if (!buffer_uptodate(bh) && cnt)
> -					(*cnt)++;
> -				brelse(bh);
> +		if (group < ngroups && grp && !EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&

The group < ngroup should be always true and is also implicitly contained
in the 'grp != NULL' check. So I'd remove that bit. Otherwise feel free to
add:

Reviewed-by: Jan Kara <jack@suse.cz>

									Honza

> +		    EXT4_MB_GRP_NEED_INIT(grp)) {
> +			struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group, NULL);
> +
> +			if (gdp && ext4_free_group_clusters(sb, gdp) > 0) {
> +				bh = ext4_read_block_bitmap_nowait(sb, group, true);
> +				if (!IS_ERR_OR_NULL(bh)) {
> +					if (!buffer_uptodate(bh) && cnt)
> +						(*cnt)++;
> +					brelse(bh);
> +				}
>  			}
>  		}
>  		if (++group >= ngroups)
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 1/2] ext4: avoid RWM atomic in EXT4_MB_GRP_TEST_AND_SET_READ
From: Jan Kara @ 2026-05-25 15:28 UTC (permalink / raw)
  To: Bohdan Trach
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, mchehab+huawei,
	lilith.oberhauser, linux-ext4, linux-kernel
In-Reply-To: <20260521125931.16474-2-bohdan.trach@huaweicloud.com>

On Thu 21-05-26 14:59:28, Bohdan Trach wrote:
> EXT4_MB_GRP_TEST_AND_SET_READ uses test_and_set_bit function which
> issues an atomic write. This can cause high overhead due to cache
> contention when multiple threads iterate over groups in a tight loop,
> as is the case for ext4_mb_prefetch(). We have seen this to be a
> problem for Kunpeng 920b CPUs which uses a single ARM LSE instruction
> for this purpose.
> 
> This change significantly reduces costs of fallocate() operations which
> trigger linear group scans on large multicore machines where
> test_and_set_bit issues an atomic write operation unconditionally.
> 
> Signed-off-by: Bohdan Trach <bohdan.trach@huaweicloud.com>
...
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 56b82d4a15d7..0713207811a6 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3551,7 +3551,17 @@ struct ext4_group_info {
>  #define EXT4_MB_GRP_CLEAR_TRIMMED(grp)	\
>  	(clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
>  #define EXT4_MB_GRP_TEST_AND_SET_READ(grp)	\
> -	(test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
> +	(ext4_mb_grp_test_and_set_read((grp)))
> +
> +static inline int ext4_mb_grp_test_and_set_read(struct ext4_group_info *grp)
> +{
> +	int r = test_bit_acquire(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state);
> +
> +	if (!r)
> +		return test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state);
> +	else
> +		return r;
> +}

Good idea but do we really need the 'acquire' barrier here? I don't see
anything that would really need this so I think
EXT4_MB_GRP_TEST_AND_SET_READ() can be just:

test_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state) || \
  test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state)

or am I missing something?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] ext4: convert legacy ext4_debug() to standard pr_debug()
From: Jan Kara @ 2026-05-25 15:16 UTC (permalink / raw)
  To: lirongqing
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4, linux-kernel
In-Reply-To: <20260521074634.2697-1-lirongqing@baidu.com>

On Thu 21-05-26 03:46:34, lirongqing wrote:
> From: Li RongQing <lirongqing@baidu.com>
> 
> The ext4 file system historically implemented its own debug logging macro
> ext4_debug() via EXT4FS_DEBUG conditional compilation. This legacy
> implementation suffers from two major drawbacks:
> 
> 1. It makes two consecutive un-serialized printk() calls, which can
>    lead to severe log interleaving and corruption under multi-core
>    concurrent workloads.
> 2. It completely bypasses the standard modern kernel dynamic debug
>    (CONFIG_DYNAMIC_DEBUG) infrastructure.
> 
> Clean up the legacy implementation by leveraging pr_debug(). This squashes
> the multiple printk() calls into a single atomic execution, ensuring
> log integrity, while seamlessly hooking ext4 into the kernel's native
> dynamic debug framework.
> 
> The redundant __FILE__ and __LINE__ macros are intentionally removed from
> the string format because the dynamic debug infrastructure can already
> append them automatically at runtime (via the '+fl' flags) if desired.
> This avoids redundancy and double-logging in modern production/debugging
> environments while keeping the macro clean and robust against dangling
> comma compiler errors.
> 
> Signed-off-by: Li RongQing <lirongqing@baidu.com>

Nice! Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


> ---
>  fs/ext4/ext4.h | 20 ++------------------
>  1 file changed, 2 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a9..39e86ff 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -62,24 +62,8 @@
>   */
>  #define DOUBLE_CHECK__
>  
> -/*
> - * Define EXT4FS_DEBUG to produce debug messages
> - */
> -#undef EXT4FS_DEBUG
> -
> -/*
> - * Debug code
> - */
> -#ifdef EXT4FS_DEBUG
> -#define ext4_debug(f, a...)						\
> -	do {								\
> -		printk(KERN_DEBUG "EXT4-fs DEBUG (%s, %d): %s:",	\
> -			__FILE__, __LINE__, __func__);			\
> -		printk(KERN_DEBUG f, ## a);				\
> -	} while (0)
> -#else
> -#define ext4_debug(fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
> -#endif
> +#define ext4_debug(fmt, ...)						\
> +	pr_debug("EXT4-fs DEBUG %s: " fmt, __func__,  ##__VA_ARGS__)
>  
>   /*
>    * Turn on EXT_DEBUG to enable ext4_ext_show_path/leaf/move in extents.c
> -- 
> 2.9.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 2/2] fs: jbd2: use clear_and_wake_up_bit() in journal_end_buffer_io_sync()
From: Jan Kara @ 2026-05-25 15:03 UTC (permalink / raw)
  To: Agatha Isabelle Moreira
  Cc: linux-ext4, linux-kernel, linux-fsdevel, Theodore Ts'o,
	Jan Kara, shuo chen, linux-kernel-mentees, shuah, patch-reply
In-Reply-To: <ag4SrrOl7R2DcLLi@guidai>

On Wed 20-05-26 17:05:46, Agatha Isabelle Moreira wrote:
> Use `clear_and_wake_up_bit()` in `journal_end_buffer_io_sync()`, since
> the helper was introduced in 'commit 8236b0ae31c83 ("bdi: wake up
> concurrent wb_shutdown() callers.")' as a generic way of doing the same
> sequence of operations:
> 	clear_bit_unlock();
> 	smp_mb__after_atomic();
> 	wake_up_bit();
> 
> The helper was first implemented to avoid bugs caused by forgetting to
> call `wake_up_bit()` after `clear_bit_unlock()`.
> 
> Since `journal_end_buffer_io_sync()` was first introduced by 'commit
> 470decc613ab2 ("jbd2: initial copy of files from jbd")' and last
> modified in this operation by 'commit 4e857c58efeb9 ("arch: Mass
> conversion of smp_mb__*()")', years before `clear_and_wake_up_bit()`, it
> still uses the open-coded sequence.
> 
> Replace the open-coded sequence with the helper to avoid duplicate code
> and reduce code paths to maintain.
> 
> Suggested-by: shuo chen <1289151713@qq.com>
> Link: https://lore.kernel.org/kernelnewbies/agzoqV835-co4kAN@guidai/T/#t
> Signed-off-by: Agatha Isabelle Moreira <code@agatha.dev>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/jbd2/commit.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 8cf61e7185c4..b647fde76e49 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -39,9 +39,7 @@ static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
>  	else
>  		clear_buffer_uptodate(bh);
>  	if (orig_bh) {
> -		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
> -		smp_mb__after_atomic();
> -		wake_up_bit(&orig_bh->b_state, BH_Shadow);
> +		clear_and_wake_up_bit(BH_Shadow, &orig_bh->b_state);
>  	}
>  	unlock_buffer(bh);
>  }
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 1/2] fs: buffer: use clear_and_wake_up_bit() in unlock_buffer()
From: Jan Kara @ 2026-05-25 15:02 UTC (permalink / raw)
  To: Agatha Isabelle Moreira
  Cc: linux-fsdevel, linux-ext4, linux-kernel, Alexander Viro,
	Christian Brauner, Jan Kara, shuo chen, linux-kernel-mentees,
	shuah, patch-reply
In-Reply-To: <ag4SD-mkmn5IbuN7@guidai>

On Wed 20-05-26 16:58:16, Agatha Isabelle Moreira wrote:
> Use `clear_and_wake_up_bit()` in `unlock_buffer()`, since the helper was
> introduced in 'commit 8236b0ae31c83 ("bdi: wake up concurrent
> wb_shutdown() callers.")' as a generic way of doing the same sequence of
> operations:
>        	clear_bit_unlock();
> 	smp_mb__after_atomic();
> 	wake_up_bit();
> 
> The helper was implemented to avoid bugs caused by forgetting to call
> `wake_up_bit()` after `clear_bit_unlock()`.
> 
> Since `unlock_buffer()` predates git and was last modified in
> 'commit 4e857c58efeb9 ("arch: Mass conversion of smp_mb__*()")', years
> before `clear_and_wake_up_bit()`, it still uses the open-coded sequence.
> 
> Replace the open-coded sequence with the helper to avoid duplicate code
> and reduce code paths to maintain.
> 
> Suggested-by: shuo chen <1289151713@qq.com>
> Link: https://lore.kernel.org/kernelnewbies/agzoqV835-co4kAN@guidai/T/#t
> Signed-off-by: Agatha Isabelle Moreira <code@agatha.dev>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/buffer.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b0b3792b1496..4348b240bd97 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -74,9 +74,7 @@ EXPORT_SYMBOL(__lock_buffer);
>  
>  void unlock_buffer(struct buffer_head *bh)
>  {
> -	clear_bit_unlock(BH_Lock, &bh->b_state);
> -	smp_mb__after_atomic();
> -	wake_up_bit(&bh->b_state, BH_Lock);
> +	clear_and_wake_up_bit(BH_Lock, &bh->b_state);
>  }
>  EXPORT_SYMBOL(unlock_buffer);
>  
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 01/10] affs: Drop support for metadata bh tracking
From: David Sterba @ 2026-05-25 10:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Christian Brauner, aivazian.tigran, Ted Tso,
	linux-ext4, OGAWA Hirofumi, David Sterba
In-Reply-To: <20260525085821.769119-11-jack@suse.cz>

On Mon, May 25, 2026 at 10:58:07AM +0200, Jan Kara wrote:
> AFFS did all the hard work of tracking metadata bhs dirtied for an inode
> but it actually never used this information as affs_file_fsync() just
> calls sync_blockdev() to writeback all filesystem metadata bhs. After a
> discussion with AFFS maintainer nobody cares about AFFS performance
> so let's keep this affs_file_fsync() behavior and just drop all the
> pointless tracking from AFFS.
> 
> CC: David Sterba <dsterba@suse.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: David Sterba <dsterba@suse.com>

^ permalink raw reply

* [PATCH v2 10/10] fs: Fix missed inode writeback when racing with __writeback_single_inode
From: Jan Kara @ 2026-05-25  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
	OGAWA Hirofumi, Jan Kara
In-Reply-To: <20260525085035.12891-1-jack@suse.cz>

When mmb_fsync_noflush() or simple_fsync_noflush() race with another
writeback of the same inode, they can see inode dirty bits are already
clear and skip inode writeback although the racing
__writeback_single_inode() didn't yet get to writing anything. This can
result in fsync(2) returning without properly persisting the inode.

We already have I_SYNC bit for this synchronization and
writeback_single_inode() properly uses it so just fix
mmb_fsync_noflush() and simple_fsync_noflush() to take it into account
as well.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c | 5 +++--
 fs/libfs.c  | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index f83fb3cdc6ac..01e17a287a42 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -662,9 +662,10 @@ int mmb_fsync_noflush(struct file *file, struct mapping_metadata_bhs *mmb,
 	if (err)
 		return err;
 
-	if (!(inode_state_read_once(inode) & I_DIRTY_ALL))
+	if (!(inode_state_read_once(inode) & (I_DIRTY_ALL | I_SYNC)))
 		goto sync_buffers;
-	if (datasync && !(inode_state_read_once(inode) & I_DIRTY_DATASYNC))
+	if (datasync &&
+	    !(inode_state_read_once(inode) & (I_DIRTY_DATASYNC | I_SYNC)))
 		goto sync_buffers;
 
 	ret = sync_inode_metadata(inode, 1);
diff --git a/fs/libfs.c b/fs/libfs.c
index 1bbea5e7bae3..3a98dacd81b2 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1559,9 +1559,10 @@ int simple_fsync_noflush(struct file *file, loff_t start, loff_t end,
 	if (err)
 		return err;
 
-	if (!(inode_state_read_once(inode) & I_DIRTY_ALL))
+	if (!(inode_state_read_once(inode) & (I_DIRTY_ALL | I_SYNC)))
 		goto out;
-	if (datasync && !(inode_state_read_once(inode) & I_DIRTY_DATASYNC))
+	if (datasync &&
+	    !(inode_state_read_once(inode) & (I_DIRTY_DATASYNC | I_SYNC)))
 		goto out;
 
 	ret = sync_inode_metadata(inode, 1);
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 07/10] minix: Fix possibly missing inode write on fsync(2)
From: Jan Kara @ 2026-05-25  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
	OGAWA Hirofumi, Jan Kara
In-Reply-To: <20260525085035.12891-1-jack@suse.cz>

Use mmb inode buffer writeout infrastructure to reliably write out
inode's buffer on fsync(2).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/minix/inode.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index 9c6bac248907..401056fb682e 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -649,7 +649,7 @@ static struct buffer_head * V1_minix_update_inode(struct inode * inode)
 		raw_inode->i_zone[0] = old_encode_dev(inode->i_rdev);
 	else for (i = 0; i < 9; i++)
 		raw_inode->i_zone[i] = minix_inode->u.i1_data[i];
-	mark_buffer_dirty(bh);
+	mmb_mark_inode_buffer_dirty(bh, &minix_i(inode)->i_metadata_bhs);
 	return bh;
 }
 
@@ -678,7 +678,7 @@ static struct buffer_head * V2_minix_update_inode(struct inode * inode)
 		raw_inode->i_zone[0] = old_encode_dev(inode->i_rdev);
 	else for (i = 0; i < 10; i++)
 		raw_inode->i_zone[i] = minix_inode->u.i2_data[i];
-	mark_buffer_dirty(bh);
+	mmb_mark_inode_buffer_dirty(bh, &minix_i(inode)->i_metadata_bhs);
 	return bh;
 }
 
@@ -693,14 +693,6 @@ static int minix_write_inode(struct inode *inode, struct writeback_control *wbc)
 		bh = V2_minix_update_inode(inode);
 	if (!bh)
 		return -EIO;
-	if (wbc->sync_mode == WB_SYNC_ALL && buffer_dirty(bh)) {
-		sync_dirty_buffer(bh);
-		if (buffer_req(bh) && !buffer_uptodate(bh)) {
-			printk("IO error syncing minix inode [%s:%08llx]\n",
-				inode->i_sb->s_id, inode->i_ino);
-			err = -EIO;
-		}
-	}
 	brelse (bh);
 	return err;
 }
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 09/10] ext4: Use mmb infrastructure for inode buffer writeout
From: Jan Kara @ 2026-05-25  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
	OGAWA Hirofumi, Jan Kara
In-Reply-To: <20260525085035.12891-1-jack@suse.cz>

Use mmb inode buffer writeout infrastructure to reliably write out
inode's inode table block on fsync(2) in nojournal mode (from
ext4_sync_parent() and ext4_fsync_nojournal()). This significantly
simplifies the code as we don't have to explicitely handle inode buffer
writeback in ext4_write_inode() and thus we can also remove
sync_inode_metadata() calls from ext4_sync_parent() and
ext4_write_inode() call from ext4_fsync_nojournal().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.c |  2 +-
 fs/ext4/ext4_jbd2.h |  2 ++
 fs/ext4/fsync.c     | 12 ------------
 fs/ext4/inode.c     | 33 ++++++++++-----------------------
 4 files changed, 13 insertions(+), 36 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 752326f3b653..3d6b8f494a76 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -350,7 +350,7 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,
 	return 0;
 }
 
-static void ext4_inode_attach_mmb(struct inode *inode)
+void ext4_inode_attach_mmb(struct inode *inode)
 {
 	struct mapping_metadata_bhs *mmb;
 
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 63d17c5201b5..2a01b8279c88 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -122,6 +122,8 @@
 #define EXT4_HT_EXT_CONVERT     11
 #define EXT4_HT_MAX             12
 
+void ext4_inode_attach_mmb(struct inode *inode);
+
 int
 ext4_mark_iloc_dirty(handle_t *handle,
 		     struct inode *inode,
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index c104f55a0242..1cb613478386 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -75,9 +75,6 @@ static int ext4_sync_parent(struct inode *inode)
 			if (ret)
 				break;
 		}
-		ret = sync_inode_metadata(inode, 1);
-		if (ret)
-			break;
 	}
 	dput(dentry);
 	return ret;
@@ -87,10 +84,6 @@ static int ext4_fsync_nojournal(struct file *file, loff_t start, loff_t end,
 				int datasync, bool *needs_barrier)
 {
 	struct inode *inode = file->f_inode;
-	struct writeback_control wbc = {
-		.sync_mode = WB_SYNC_ALL,
-		.nr_to_write = 0,
-	};
 	int ret;
 
 	ret = mmb_fsync_noflush(file, READ_ONCE(EXT4_I(inode)->i_metadata_bhs),
@@ -98,11 +91,6 @@ static int ext4_fsync_nojournal(struct file *file, loff_t start, loff_t end,
 	if (ret)
 		return ret;
 
-	/* Force writeout of inode table buffer to disk */
-	ret = ext4_write_inode(inode, &wbc);
-	if (ret)
-		return ret;
-
 	ret = ext4_sync_parent(inode);
 
 	if (test_opt(inode->i_sb, BARRIER))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3e66e9510909..651201849667 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5696,10 +5696,16 @@ static int ext4_do_update_inode(handle_t *handle,
 		ext4_update_other_inodes_time(inode->i_sb, inode->i_ino,
 					      bh->b_data);
 
-	BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
-	err = ext4_handle_dirty_metadata(handle, NULL, bh);
-	if (err)
-		goto out_error;
+	if (!ext4_handle_valid(handle)) {
+		if (!ei->i_metadata_bhs)
+			ext4_inode_attach_mmb(inode);
+		mmb_mark_inode_buffer_dirty(iloc->bh, ei->i_metadata_bhs);
+	} else {
+		BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
+		err = ext4_handle_dirty_metadata(handle, NULL, bh);
+		if (err)
+			goto out_error;
+	}
 	ext4_clear_inode_state(inode, EXT4_STATE_NEW);
 	if (set_large_file) {
 		BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "get write access");
@@ -5786,24 +5792,6 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 
 		err = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal,
 						EXT4_I(inode)->i_sync_tid);
-	} else {
-		struct ext4_iloc iloc;
-
-		err = __ext4_get_inode_loc_noinmem(inode, &iloc);
-		if (err)
-			return err;
-		/*
-		 * sync(2) will flush the whole buffer cache. No need to do
-		 * it here separately for each inode.
-		 */
-		if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
-			sync_dirty_buffer(iloc.bh);
-		if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) {
-			ext4_error_inode_block(inode, iloc.bh->b_blocknr, EIO,
-					       "IO error syncing inode");
-			err = -EIO;
-		}
-		brelse(iloc.bh);
 	}
 	return err;
 }
@@ -6348,7 +6336,6 @@ int ext4_mark_iloc_dirty(handle_t *handle,
 
 	/* the do_update_inode consumes one bh->b_count */
 	get_bh(iloc->bh);
-
 	/* ext4_do_update_inode() does jbd2_journal_dirty_metadata */
 	err = ext4_do_update_inode(handle, inode, iloc);
 	put_bh(iloc->bh);
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 05/10] udf: Fix possibly missing inode write on fsync(2)
From: Jan Kara @ 2026-05-25  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
	OGAWA Hirofumi, Jan Kara
In-Reply-To: <20260525085035.12891-1-jack@suse.cz>

Use mmb inode buffer writeout infrastructure to reliably write out
inode's block on fsync(2).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/udf/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 67bcf83758c8..861400152261 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -1707,7 +1707,7 @@ void udf_update_extra_perms(struct inode *inode, umode_t mode)
 
 int udf_write_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	return udf_update_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
+	return udf_update_inode(inode, 0);
 }
 
 static int udf_sync_inode(struct inode *inode)
@@ -1936,7 +1936,7 @@ static int udf_update_inode(struct inode *inode, int do_sync)
 	unlock_buffer(bh);
 
 	/* write the data blocks */
-	mark_buffer_dirty(bh);
+	mmb_mark_inode_buffer_dirty(bh, &iinfo->i_metadata_bhs);
 	if (do_sync) {
 		sync_dirty_buffer(bh);
 		if (buffer_write_io_error(bh)) {
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 08/10] bfs: Fix possibly missing inode write on fsync(2)
From: Jan Kara @ 2026-05-25  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
	OGAWA Hirofumi, Jan Kara
In-Reply-To: <20260525085035.12891-1-jack@suse.cz>

Use mmb inode buffer writeout infrastructure to reliably write out
inode's buffer on fsync(2).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/bfs/inode.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
index 19e49c8cf750..2506795c3f2c 100644
--- a/fs/bfs/inode.c
+++ b/fs/bfs/inode.c
@@ -136,7 +136,6 @@ static int bfs_write_inode(struct inode *inode, struct writeback_control *wbc)
 	unsigned long i_sblock;
 	struct bfs_inode *di;
 	struct buffer_head *bh;
-	int err = 0;
 
 	dprintf("ino=%08x\n", ino);
 
@@ -164,15 +163,10 @@ static int bfs_write_inode(struct inode *inode, struct writeback_control *wbc)
 	di->i_eblock = cpu_to_le32(BFS_I(inode)->i_eblock);
 	di->i_eoffset = cpu_to_le32(i_sblock * BFS_BSIZE + inode->i_size - 1);
 
-	mark_buffer_dirty(bh);
-	if (wbc->sync_mode == WB_SYNC_ALL) {
-		sync_dirty_buffer(bh);
-		if (buffer_req(bh) && !buffer_uptodate(bh))
-			err = -EIO;
-	}
+	mmb_mark_inode_buffer_dirty(bh, &BFS_I(inode)->i_metadata_bhs);
 	brelse(bh);
 	mutex_unlock(&info->bfs_lock);
-	return err;
+	return 0;
 }
 
 static void bfs_evict_inode(struct inode *inode)
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 06/10] fat: Fix possibly missing inode write on fsync(2)
From: Jan Kara @ 2026-05-25  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
	OGAWA Hirofumi, Jan Kara
In-Reply-To: <20260525085035.12891-1-jack@suse.cz>

Use mmb inode buffer writeout infrastructure to reliably write out
inode's buffer on fsync(2).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fat/inode.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 28f78df086ef..1ffbfee1a9ad 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -431,6 +431,9 @@ EXPORT_SYMBOL_GPL(fat_attach);
 void fat_detach(struct inode *inode)
 {
 	struct msdos_sb_info *sbi = MSDOS_SB(inode->i_sb);
+
+	/* The block isn't associated with the inode anymore... */
+	mmb_clear_inode_blk(&MSDOS_I(inode)->i_metadata_bhs);
 	spin_lock(&sbi->inode_hash_lock);
 	MSDOS_I(inode)->i_pos = 0;
 	hlist_del_init(&MSDOS_I(inode)->i_fat_hash);
@@ -906,7 +909,7 @@ static int __fat_write_inode(struct inode *inode, int wait)
 				  &raw_entry->cdate, &raw_entry->ctime_cs);
 	}
 	spin_unlock(&sbi->inode_hash_lock);
-	mark_buffer_dirty(bh);
+	mmb_mark_inode_buffer_dirty(bh, &MSDOS_I(inode)->i_metadata_bhs);
 	err = 0;
 	if (wait)
 		err = sync_dirty_buffer(bh);
@@ -925,7 +928,7 @@ static int fat_write_inode(struct inode *inode, struct writeback_control *wbc)
 		err = fat_clusters_flush(sb);
 		mutex_unlock(&MSDOS_SB(sb)->s_lock);
 	} else
-		err = __fat_write_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
+		err = __fat_write_inode(inode, 0);
 
 	return err;
 }
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 04/10] ext2: Fix possibly missing inode write on fsync(2)
From: Jan Kara @ 2026-05-25  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
	OGAWA Hirofumi, Jan Kara
In-Reply-To: <20260525085035.12891-1-jack@suse.cz>

Use mmb inode buffer writeout infrastructure to reliably write out
inode's inode table block on fsync(2).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 74aca5eb572d..d7a812b84b6b 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1611,7 +1611,7 @@ static int __ext2_write_inode(struct inode *inode, int do_sync)
 		}
 	} else for (n = 0; n < EXT2_N_BLOCKS; n++)
 		raw_inode->i_block[n] = ei->i_data[n];
-	mark_buffer_dirty(bh);
+	mmb_mark_inode_buffer_dirty(bh, &ei->i_metadata_bhs);
 	if (do_sync) {
 		sync_dirty_buffer(bh);
 		if (buffer_req(bh) && !buffer_uptodate(bh)) {
@@ -1627,7 +1627,7 @@ static int __ext2_write_inode(struct inode *inode, int do_sync)
 
 int ext2_write_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	return __ext2_write_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
+	return __ext2_write_inode(inode, 0);
 }
 
 int ext2_getattr(struct mnt_idmap *idmap, const struct path *path,
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 03/10] fs: Writeout inode buffer from mmb_sync()
From: Jan Kara @ 2026-05-25  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
	OGAWA Hirofumi, Jan Kara
In-Reply-To: <20260525085035.12891-1-jack@suse.cz>

Currently metadata bh tracking does not track inode buffers because they
are usually shared by several inodes and so our linked list tracking
cannot be used. On fsync we call sync_inode_metadata() to write inode
instead where filesystems' .write_inode methods detect data integrity
writeback and take care to submit inode buffer to disk and wait for it
in that case. This is however racy as for example flush worker can
submit normal (WB_SYNC_NONE) inode writeback first, which makes the
inode clean and copies the inode to the buffer but doesn't submit the
buffer for IO. Thus sync_inode_metadata() call does nothing and we fail
to persist inode buffer to disk on fsync(2).

Fix the problem by allowing filesystem to set the number of block backing
the inode in mmb structure and mmb_sync() then takes care to writeout
corresponding buffer and wait for it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c                 | 64 +++++++++++++++++++++++++++++--------
 include/linux/buffer_head.h | 14 ++++++++
 include/linux/fs.h          |  1 +
 3 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b0b3792b1496..f83fb3cdc6ac 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -477,12 +477,12 @@ EXPORT_SYMBOL(mark_buffer_async_write);
  * using RCU, grab the lock, verify we didn't race with somebody detaching the
  * bh / moving it to different inode and only then proceeding.
  */
-
 void mmb_init(struct mapping_metadata_bhs *mmb, struct address_space *mapping)
 {
 	spin_lock_init(&mmb->lock);
 	INIT_LIST_HEAD(&mmb->list);
 	mmb->mapping = mapping;
+	mmb->inode_blk = MMB_INVALID_BLK;
 }
 EXPORT_SYMBOL(mmb_init);
 
@@ -550,11 +550,13 @@ EXPORT_SYMBOL_GPL(mmb_has_buffers);
 int mmb_sync(struct mapping_metadata_bhs *mmb)
 {
 	struct buffer_head *bh;
+	sector_t inode_blk;
 	int err = 0;
 	struct blk_plug plug;
 	LIST_HEAD(tmp);
 
-	if (!mmb_has_buffers(mmb))
+	if (!mmb_has_buffers(mmb) &&
+	    data_race(mmb->inode_blk == MMB_INVALID_BLK))
 		return 0;
 
 	blk_start_plug(&plug);
@@ -593,8 +595,22 @@ int mmb_sync(struct mapping_metadata_bhs *mmb)
 			}
 		}
 	}
-
+	inode_blk = mmb->inode_blk;
+	mmb->inode_blk = MMB_INVALID_BLK;
 	spin_unlock(&mmb->lock);
+
+	/* Writeout inode buffer if it was set and wasn't written out yet */
+	if (inode_blk != MMB_INVALID_BLK) {
+		bh = sb_find_get_block(mmb->mapping->host->i_sb, inode_blk);
+		if (bh) {
+			write_dirty_buffer(bh, REQ_SYNC);
+			wait_on_buffer(bh);
+			if (!buffer_uptodate(bh))
+				err = -EIO;
+			brelse(bh);
+		}
+	}
+
 	blk_finish_plug(&plug);
 	spin_lock(&mmb->lock);
 
@@ -646,18 +662,18 @@ int mmb_fsync_noflush(struct file *file, struct mapping_metadata_bhs *mmb,
 	if (err)
 		return err;
 
-	if (mmb)
-		ret = mmb_sync(mmb);
 	if (!(inode_state_read_once(inode) & I_DIRTY_ALL))
-		goto out;
+		goto sync_buffers;
 	if (datasync && !(inode_state_read_once(inode) & I_DIRTY_DATASYNC))
-		goto out;
-
-	err = sync_inode_metadata(inode, 1);
-	if (ret == 0)
-		ret = err;
-
-out:
+		goto sync_buffers;
+
+	ret = sync_inode_metadata(inode, 1);
+sync_buffers:
+	if (mmb) {
+		err = mmb_sync(mmb);
+		if (ret == 0)
+			ret = err;
+	}
 	/* check and advance again to catch errors after syncing out buffers */
 	err = file_check_and_advance_wb_err(file);
 	if (ret == 0)
@@ -733,6 +749,28 @@ void mmb_mark_buffer_dirty(struct buffer_head *bh,
 }
 EXPORT_SYMBOL(mmb_mark_buffer_dirty);
 
+/**
+ * mmb_mark_inode_buffer_dirty - Mark buffer containing inode as dirty and
+ *				 track it for fsync.
+ * @bh: The buffer containing the inode.
+ * @mmb: Mmb structure for metadata tracking.
+ *
+ * Mark the buffer containing inode as dirty and track the block number of
+ * the buffer containing the inode in mmb so that it gets written out from
+ * mmb_sync().
+ */
+void mmb_mark_inode_buffer_dirty(struct buffer_head *bh,
+				 struct mapping_metadata_bhs *mmb)
+{
+	/* For simplicity we use mmb->lock to synchronize with mmb_sync() */
+	spin_lock(&mmb->lock);
+	mark_buffer_dirty(bh);
+	mmb->inode_blk = bh->b_blocknr;
+	spin_unlock(&mmb->lock);
+}
+EXPORT_SYMBOL(mmb_mark_inode_buffer_dirty);
+
+
 /**
  * block_dirty_folio - Mark a folio as dirty.
  * @mapping: The address space containing this folio.
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index e4939e33b4b5..b77464359028 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -207,6 +207,8 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate);
 
 /* Things to do with metadata buffers list */
 void mmb_mark_buffer_dirty(struct buffer_head *bh, struct mapping_metadata_bhs *mmb);
+void mmb_mark_inode_buffer_dirty(struct buffer_head *bh,
+				 struct mapping_metadata_bhs *mmb);
 int mmb_fsync_noflush(struct file *file, struct mapping_metadata_bhs *mmb,
 		      loff_t start, loff_t end, bool datasync);
 int mmb_fsync(struct file *file, struct mapping_metadata_bhs *mmb,
@@ -513,12 +515,24 @@ bool block_dirty_folio(struct address_space *mapping, struct folio *folio);
 
 #ifdef CONFIG_BUFFER_HEAD
 
+#define MMB_INVALID_BLK (~0ULL)
+
 void buffer_init(void);
 bool try_to_free_buffers(struct folio *folio);
 void mmb_init(struct mapping_metadata_bhs *mmb, struct address_space *mapping);
 bool mmb_has_buffers(struct mapping_metadata_bhs *mmb);
 void mmb_invalidate(struct mapping_metadata_bhs *mmb);
 int mmb_sync(struct mapping_metadata_bhs *mmb);
+static inline void mmb_clear_inode_blk(struct mapping_metadata_bhs *mmb)
+{
+	/*
+	 * The lock is mostly pointless here but let's keep setting of
+	 * inode_blk consistently under it.
+	 */
+	spin_lock(&mmb->lock);
+	mmb->inode_blk = MMB_INVALID_BLK;
+	spin_unlock(&mmb->lock);
+}
 void invalidate_bh_lrus(void);
 void invalidate_bh_lrus_cpu(void);
 bool has_bh_in_lru(int cpu, void *dummy);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..435a41e4c90f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -446,6 +446,7 @@ extern const struct address_space_operations empty_aops;
 /* Structure for tracking metadata buffer heads associated with the mapping */
 struct mapping_metadata_bhs {
 	struct address_space *mapping;	/* Mapping bhs are associated with */
+	sector_t inode_blk;	/* Number of block containing the inode */
 	spinlock_t lock;	/* Lock protecting bh list */
 	struct list_head list;	/* The list of bhs (b_assoc_buffers) */
 };
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 01/10] affs: Drop support for metadata bh tracking
From: Jan Kara @ 2026-05-25  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
	OGAWA Hirofumi, Jan Kara, David Sterba
In-Reply-To: <20260525085035.12891-1-jack@suse.cz>

AFFS did all the hard work of tracking metadata bhs dirtied for an inode
but it actually never used this information as affs_file_fsync() just
calls sync_blockdev() to writeback all filesystem metadata bhs. After a
discussion with AFFS maintainer nobody cares about AFFS performance
so let's keep this affs_file_fsync() behavior and just drop all the
pointless tracking from AFFS.

CC: David Sterba <dsterba@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/affs/affs.h     |  1 -
 fs/affs/amigaffs.c | 12 ++++++------
 fs/affs/file.c     | 25 +++++++++++--------------
 fs/affs/inode.c    | 13 +++++--------
 fs/affs/namei.c    |  9 ++++-----
 fs/affs/super.c    |  1 -
 6 files changed, 26 insertions(+), 35 deletions(-)

diff --git a/fs/affs/affs.h b/fs/affs/affs.h
index a0caf6ace860..406a0ef63e7b 100644
--- a/fs/affs/affs.h
+++ b/fs/affs/affs.h
@@ -44,7 +44,6 @@ struct affs_inode_info {
 	struct mutex i_link_lock;		/* Protects internal inode access. */
 	struct mutex i_ext_lock;		/* Protects internal inode access. */
 #define i_hash_lock i_ext_lock
-	struct mapping_metadata_bhs i_metadata_bhs;
 	u32	 i_blkcnt;			/* block count */
 	u32	 i_extcnt;			/* extended block count */
 	u32	*i_lc;				/* linear cache of extended blocks */
diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index bed4fc805e8e..6cc0fc9a4cbf 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -57,7 +57,7 @@ affs_insert_hash(struct inode *dir, struct buffer_head *bh)
 		AFFS_TAIL(sb, dir_bh)->hash_chain = cpu_to_be32(ino);
 
 	affs_adjust_checksum(dir_bh, ino);
-	mmb_mark_buffer_dirty(dir_bh, &AFFS_I(dir)->i_metadata_bhs);
+	mark_buffer_dirty(dir_bh);
 	affs_brelse(dir_bh);
 
 	inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
@@ -100,7 +100,7 @@ affs_remove_hash(struct inode *dir, struct buffer_head *rem_bh)
 			else
 				AFFS_TAIL(sb, bh)->hash_chain = ino;
 			affs_adjust_checksum(bh, be32_to_cpu(ino) - hash_ino);
-			mmb_mark_buffer_dirty(bh, &AFFS_I(dir)->i_metadata_bhs);
+			mark_buffer_dirty(bh);
 			AFFS_TAIL(sb, rem_bh)->parent = 0;
 			retval = 0;
 			break;
@@ -180,7 +180,7 @@ affs_remove_link(struct dentry *dentry)
 			affs_unlock_dir(dir);
 			goto done;
 		}
-		mmb_mark_buffer_dirty(link_bh, &AFFS_I(inode)->i_metadata_bhs);
+		mark_buffer_dirty(link_bh);
 
 		memcpy(AFFS_TAIL(sb, bh)->name, AFFS_TAIL(sb, link_bh)->name, 32);
 		retval = affs_insert_hash(dir, bh);
@@ -188,7 +188,7 @@ affs_remove_link(struct dentry *dentry)
 			affs_unlock_dir(dir);
 			goto done;
 		}
-		mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+		mark_buffer_dirty(bh);
 
 		affs_unlock_dir(dir);
 		iput(dir);
@@ -203,7 +203,7 @@ affs_remove_link(struct dentry *dentry)
 			__be32 ino2 = AFFS_TAIL(sb, link_bh)->link_chain;
 			AFFS_TAIL(sb, bh)->link_chain = ino2;
 			affs_adjust_checksum(bh, be32_to_cpu(ino2) - link_ino);
-			mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+			mark_buffer_dirty(bh);
 			retval = 0;
 			/* Fix the link count, if bh is a normal header block without links */
 			switch (be32_to_cpu(AFFS_TAIL(sb, bh)->stype)) {
@@ -306,7 +306,7 @@ affs_remove_header(struct dentry *dentry)
 	retval = affs_remove_hash(dir, bh);
 	if (retval)
 		goto done_unlock;
-	mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+	mark_buffer_dirty(bh);
 
 	affs_unlock_dir(dir);
 
diff --git a/fs/affs/file.c b/fs/affs/file.c
index 144b17482d12..23e088a7ed4f 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -140,14 +140,14 @@ affs_alloc_extblock(struct inode *inode, struct buffer_head *bh, u32 ext)
 	AFFS_TAIL(sb, new_bh)->parent = cpu_to_be32(inode->i_ino);
 	affs_fix_checksum(sb, new_bh);
 
-	mmb_mark_buffer_dirty(new_bh, &AFFS_I(inode)->i_metadata_bhs);
+	mark_buffer_dirty(new_bh);
 
 	tmp = be32_to_cpu(AFFS_TAIL(sb, bh)->extension);
 	if (tmp)
 		affs_warning(sb, "alloc_ext", "previous extension set (%x)", tmp);
 	AFFS_TAIL(sb, bh)->extension = cpu_to_be32(blocknr);
 	affs_adjust_checksum(bh, blocknr - tmp);
-	mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+	mark_buffer_dirty(bh);
 
 	AFFS_I(inode)->i_extcnt++;
 	mark_inode_dirty(inode);
@@ -581,7 +581,7 @@ affs_extent_file_ofs(struct inode *inode, u32 newsize)
 		memset(AFFS_DATA(bh) + boff, 0, tmp);
 		be32_add_cpu(&AFFS_DATA_HEAD(bh)->size, tmp);
 		affs_fix_checksum(sb, bh);
-		mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+		mark_buffer_dirty(bh);
 		size += tmp;
 		bidx++;
 	} else if (bidx) {
@@ -603,7 +603,7 @@ affs_extent_file_ofs(struct inode *inode, u32 newsize)
 		AFFS_DATA_HEAD(bh)->size = cpu_to_be32(tmp);
 		affs_fix_checksum(sb, bh);
 		bh->b_state &= ~(1UL << BH_New);
-		mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+		mark_buffer_dirty(bh);
 		if (prev_bh) {
 			u32 tmp_next = be32_to_cpu(AFFS_DATA_HEAD(prev_bh)->next);
 
@@ -613,8 +613,7 @@ affs_extent_file_ofs(struct inode *inode, u32 newsize)
 					     bidx, tmp_next);
 			AFFS_DATA_HEAD(prev_bh)->next = cpu_to_be32(bh->b_blocknr);
 			affs_adjust_checksum(prev_bh, bh->b_blocknr - tmp_next);
-			mmb_mark_buffer_dirty(prev_bh,
-					      &AFFS_I(inode)->i_metadata_bhs);
+			mark_buffer_dirty(prev_bh);
 			affs_brelse(prev_bh);
 		}
 		size += bsize;
@@ -733,7 +732,7 @@ static int affs_write_end_ofs(const struct kiocb *iocb,
 		AFFS_DATA_HEAD(bh)->size = cpu_to_be32(
 			max(boff + tmp, be32_to_cpu(AFFS_DATA_HEAD(bh)->size)));
 		affs_fix_checksum(sb, bh);
-		mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+		mark_buffer_dirty(bh);
 		written += tmp;
 		from += tmp;
 		bidx++;
@@ -766,13 +765,12 @@ static int affs_write_end_ofs(const struct kiocb *iocb,
 						     bidx, tmp_next);
 				AFFS_DATA_HEAD(prev_bh)->next = cpu_to_be32(bh->b_blocknr);
 				affs_adjust_checksum(prev_bh, bh->b_blocknr - tmp_next);
-				mmb_mark_buffer_dirty(prev_bh,
-					&AFFS_I(inode)->i_metadata_bhs);
+				mark_buffer_dirty(prev_bh);
 			}
 		}
 		affs_brelse(prev_bh);
 		affs_fix_checksum(sb, bh);
-		mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+		mark_buffer_dirty(bh);
 		written += bsize;
 		from += bsize;
 		bidx++;
@@ -801,14 +799,13 @@ static int affs_write_end_ofs(const struct kiocb *iocb,
 						     bidx, tmp_next);
 				AFFS_DATA_HEAD(prev_bh)->next = cpu_to_be32(bh->b_blocknr);
 				affs_adjust_checksum(prev_bh, bh->b_blocknr - tmp_next);
-				mmb_mark_buffer_dirty(prev_bh,
-						&AFFS_I(inode)->i_metadata_bhs);
+				mark_buffer_dirty(prev_bh);
 			}
 		} else if (be32_to_cpu(AFFS_DATA_HEAD(bh)->size) < tmp)
 			AFFS_DATA_HEAD(bh)->size = cpu_to_be32(tmp);
 		affs_brelse(prev_bh);
 		affs_fix_checksum(sb, bh);
-		mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+		mark_buffer_dirty(bh);
 		written += tmp;
 		from += tmp;
 		bidx++;
@@ -945,7 +942,7 @@ affs_truncate(struct inode *inode)
 	}
 	AFFS_TAIL(sb, ext_bh)->extension = 0;
 	affs_fix_checksum(sb, ext_bh);
-	mmb_mark_buffer_dirty(ext_bh, &AFFS_I(inode)->i_metadata_bhs);
+	mark_buffer_dirty(ext_bh);
 	affs_brelse(ext_bh);
 
 	if (inode->i_size) {
diff --git a/fs/affs/inode.c b/fs/affs/inode.c
index 5dd1b016bcb0..d4a3f381c4bc 100644
--- a/fs/affs/inode.c
+++ b/fs/affs/inode.c
@@ -206,7 +206,7 @@ affs_write_inode(struct inode *inode, struct writeback_control *wbc)
 		}
 	}
 	affs_fix_checksum(sb, bh);
-	mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+	mark_buffer_dirty(bh);
 	affs_brelse(bh);
 	affs_free_prealloc(inode);
 	return 0;
@@ -266,11 +266,8 @@ affs_evict_inode(struct inode *inode)
 	if (!inode->i_nlink) {
 		inode->i_size = 0;
 		affs_truncate(inode);
-	} else {
-		mmb_sync(&AFFS_I(inode)->i_metadata_bhs);
 	}
 
-	mmb_invalidate(&AFFS_I(inode)->i_metadata_bhs);
 	clear_inode(inode);
 	affs_free_prealloc(inode);
 	cache_page = (unsigned long)AFFS_I(inode)->i_lc;
@@ -305,7 +302,7 @@ affs_new_inode(struct inode *dir)
 	bh = affs_getzeroblk(sb, block);
 	if (!bh)
 		goto err_bh;
-	mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+	mark_buffer_dirty(bh);
 	affs_brelse(bh);
 
 	inode->i_uid     = current_fsuid();
@@ -393,17 +390,17 @@ affs_add_entry(struct inode *dir, struct inode *inode, struct dentry *dentry, s3
 		AFFS_TAIL(sb, bh)->link_chain = chain;
 		AFFS_TAIL(sb, inode_bh)->link_chain = cpu_to_be32(block);
 		affs_adjust_checksum(inode_bh, block - be32_to_cpu(chain));
-		mmb_mark_buffer_dirty(inode_bh, &AFFS_I(inode)->i_metadata_bhs);
+		mark_buffer_dirty(inode_bh);
 		set_nlink(inode, 2);
 		ihold(inode);
 	}
 	affs_fix_checksum(sb, bh);
-	mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+	mark_buffer_dirty(bh);
 	dentry->d_fsdata = (void *)(long)bh->b_blocknr;
 
 	affs_lock_dir(dir);
 	retval = affs_insert_hash(dir, bh);
-	mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+	mark_buffer_dirty(bh);
 	affs_unlock_dir(dir);
 	affs_unlock_link(inode);
 
diff --git a/fs/affs/namei.c b/fs/affs/namei.c
index c3c6532da4b0..57d8d755aada 100644
--- a/fs/affs/namei.c
+++ b/fs/affs/namei.c
@@ -373,7 +373,7 @@ affs_symlink(struct mnt_idmap *idmap, struct inode *dir,
 	}
 	*p = 0;
 	inode->i_size = i + 1;
-	mmb_mark_buffer_dirty(bh, &AFFS_I(inode)->i_metadata_bhs);
+	mark_buffer_dirty(bh);
 	affs_brelse(bh);
 	mark_inode_dirty(inode);
 
@@ -443,8 +443,7 @@ affs_rename(struct inode *old_dir, struct dentry *old_dentry,
 	/* TODO: move it back to old_dir, if error? */
 
 done:
-	mmb_mark_buffer_dirty(bh,
-			&AFFS_I(retval ? old_dir : new_dir)->i_metadata_bhs);
+	mark_buffer_dirty(bh);
 	affs_brelse(bh);
 	return retval;
 }
@@ -497,8 +496,8 @@ affs_xrename(struct inode *old_dir, struct dentry *old_dentry,
 	retval = affs_insert_hash(old_dir, bh_new);
 	affs_unlock_dir(old_dir);
 done:
-	mmb_mark_buffer_dirty(bh_old, &AFFS_I(new_dir)->i_metadata_bhs);
-	mmb_mark_buffer_dirty(bh_new, &AFFS_I(old_dir)->i_metadata_bhs);
+	mark_buffer_dirty(bh_old);
+	mark_buffer_dirty(bh_new);
 	affs_brelse(bh_old);
 	affs_brelse(bh_new);
 	return retval;
diff --git a/fs/affs/super.c b/fs/affs/super.c
index 079f36e1ddec..8451647f3fea 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -108,7 +108,6 @@ static struct inode *affs_alloc_inode(struct super_block *sb)
 	i->i_lc = NULL;
 	i->i_ext_bh = NULL;
 	i->i_pa_cnt = 0;
-	mmb_init(&i->i_metadata_bhs, &i->vfs_inode.i_data);
 
 	return &i->vfs_inode;
 }
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 0/10] fs: Fix missed inode write during fsync
From: Jan Kara @ 2026-05-25  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
	OGAWA Hirofumi, Jan Kara

Hello,

here is v2 of the patch series which fixes the possibly missing inode write
during fsync(2) for filesystems using generic metadata bh tracking. The
inherent problem is that .write_inode methods clear inode dirty bit but they
only copy inode contents into to the buffer cache. Because buffer carrying the
inode is shared among multiple inodes, it cannot be tracked by the generic
metadata bh tracking infrastructure and thus nothing is tracking that buffer
needs to be written out to maintain fsync(2) guarantees. Normally, this gets
taken care of by .write_inode checking for WB_SYNC_ALL writeback and submitting
& waiting for the buffer in that case however if flush worker ends up writing
the inode before data integrity writeback, this mechanism is broken.

This patch series adds a way for filesystems to track metadata block number
which contains the inode metadata and then uses this information to writeout
the buffer on fsync.

Changes since v1:
* Fixed freeing for ext4 dynamically allocated mmb struct
* Optimized tracking of block carrying the inode so that we don't flush it
  unnecessarily on fsync
* Add forgotten check for reclaimed bh to mmb_sync() to avoid NULL ptr deref
* Couple other smaller fixups pointed out by Sashiko

								Honza
Previous versions:
Link: http://lore.kernel.org/r/20260511115725.28441-1-jack@suse.cz # v1

^ permalink raw reply

* [PATCH v2 02/10] ext4: Allocate mapping_metadata_bhs struct on demand
From: Jan Kara @ 2026-05-25  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
	OGAWA Hirofumi, Jan Kara
In-Reply-To: <20260525085035.12891-1-jack@suse.cz>

Currently every ext4 inode gets mapping_metadata_bhs struct although it
is only needed when running without a journal and only for inodes where
any metadata was dirtied. Allocate mapping_metadata_bhs struct on demand
when dirtying the first metadata buffer for the inode.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h      |  2 +-
 fs/ext4/ext4_jbd2.c | 24 +++++++++++++++++++++---
 fs/ext4/fsync.c     | 12 ++++++++----
 fs/ext4/inode.c     |  9 +++++----
 fs/ext4/super.c     |  7 ++++---
 5 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 94283a991e5c..6bb29a20420f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1117,7 +1117,7 @@ struct ext4_inode_info {
 	struct rw_semaphore i_data_sem;
 	struct inode vfs_inode;
 	struct jbd2_inode *jinode;
-	struct mapping_metadata_bhs i_metadata_bhs;
+	struct mapping_metadata_bhs *i_metadata_bhs;
 
 	/*
 	 * File creation time. Its function is same as that of
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 9a8c225f2753..752326f3b653 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -350,6 +350,21 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,
 	return 0;
 }
 
+static void ext4_inode_attach_mmb(struct inode *inode)
+{
+	struct mapping_metadata_bhs *mmb;
+
+	/*
+	 * It's difficult to handle failure when marking buffer dirty without
+	 * leaving filesystem corrupted
+	 */
+	mmb = kmalloc_obj(*mmb, GFP_NOFS | __GFP_NOFAIL);
+	mmb_init(mmb, &inode->i_data);
+	/* Someone swapped another mmb before us? */
+	if (cmpxchg(&EXT4_I(inode)->i_metadata_bhs, NULL, mmb))
+		kfree(mmb);
+}
+
 int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
 				 handle_t *handle, struct inode *inode,
 				 struct buffer_head *bh)
@@ -389,11 +404,14 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
 					 err);
 		}
 	} else {
-		if (inode)
+		if (inode) {
+			if (!EXT4_I(inode)->i_metadata_bhs)
+				ext4_inode_attach_mmb(inode);
 			mmb_mark_buffer_dirty(bh,
-					      &EXT4_I(inode)->i_metadata_bhs);
-		else
+					      EXT4_I(inode)->i_metadata_bhs);
+		} else {
 			mark_buffer_dirty(bh);
+		}
 		if (inode && inode_needs_sync(inode)) {
 			sync_dirty_buffer(bh);
 			if (buffer_req(bh) && !buffer_uptodate(bh)) {
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 924726dcc85f..c104f55a0242 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -46,6 +46,7 @@
 static int ext4_sync_parent(struct inode *inode)
 {
 	struct dentry *dentry, *next;
+	struct mapping_metadata_bhs *mmb;
 	int ret = 0;
 
 	if (!ext4_test_inode_state(inode, EXT4_STATE_NEWENTRY))
@@ -68,9 +69,12 @@ static int ext4_sync_parent(struct inode *inode)
 		 * through ext4_evict_inode()) and so we are safe to flush
 		 * metadata blocks and the inode.
 		 */
-		ret = mmb_sync(&EXT4_I(inode)->i_metadata_bhs);
-		if (ret)
-			break;
+		mmb = READ_ONCE(EXT4_I(inode)->i_metadata_bhs);
+		if (mmb) {
+			ret = mmb_sync(mmb);
+			if (ret)
+				break;
+		}
 		ret = sync_inode_metadata(inode, 1);
 		if (ret)
 			break;
@@ -89,7 +93,7 @@ static int ext4_fsync_nojournal(struct file *file, loff_t start, loff_t end,
 	};
 	int ret;
 
-	ret = mmb_fsync_noflush(file, &EXT4_I(inode)->i_metadata_bhs,
+	ret = mmb_fsync_noflush(file, READ_ONCE(EXT4_I(inode)->i_metadata_bhs),
 				start, end, datasync);
 	if (ret)
 		return ret;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..3e66e9510909 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -195,9 +195,8 @@ void ext4_evict_inode(struct inode *inode)
 			ext4_warning_inode(inode, "data will be lost");
 
 		truncate_inode_pages_final(&inode->i_data);
-		/* Avoid mballoc special inode which has no proper iops */
-		if (!EXT4_SB(inode->i_sb)->s_journal)
-			mmb_sync(&EXT4_I(inode)->i_metadata_bhs);
+		if (EXT4_I(inode)->i_metadata_bhs)
+			mmb_sync(EXT4_I(inode)->i_metadata_bhs);
 		goto no_delete;
 	}
 
@@ -3451,6 +3450,7 @@ static bool ext4_release_folio(struct folio *folio, gfp_t wait)
 static bool ext4_inode_datasync_dirty(struct inode *inode)
 {
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+	struct mapping_metadata_bhs *mmb;
 
 	if (journal) {
 		if (jbd2_transaction_committed(journal,
@@ -3461,8 +3461,9 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
 		return true;
 	}
 
+	mmb = READ_ONCE(EXT4_I(inode)->i_metadata_bhs);
 	/* Any metadata buffers to write? */
-	if (mmb_has_buffers(&EXT4_I(inode)->i_metadata_bhs))
+	if (mmb && mmb_has_buffers(mmb))
 		return true;
 	return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6a77db4d3124..7fc2cff708cc 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1430,7 +1430,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
 	ext4_fc_init_inode(&ei->vfs_inode);
 	spin_lock_init(&ei->i_fc_lock);
-	mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
+	ei->i_metadata_bhs = NULL;
 	return &ei->vfs_inode;
 }
 
@@ -1448,6 +1448,7 @@ static int ext4_drop_inode(struct inode *inode)
 static void ext4_free_in_core_inode(struct inode *inode)
 {
 	fscrypt_free_inode(inode);
+	kfree(EXT4_I(inode)->i_metadata_bhs);
 	if (!list_empty(&(EXT4_I(inode)->i_fc_list))) {
 		pr_warn("%s: inode %llu still in fc list",
 			__func__, inode->i_ino);
@@ -1527,8 +1528,8 @@ static void destroy_inodecache(void)
 void ext4_clear_inode(struct inode *inode)
 {
 	ext4_fc_del(inode);
-	if (!EXT4_SB(inode->i_sb)->s_journal)
-		mmb_invalidate(&EXT4_I(inode)->i_metadata_bhs);
+	if (EXT4_I(inode)->i_metadata_bhs)
+		mmb_invalidate(EXT4_I(inode)->i_metadata_bhs);
 	clear_inode(inode);
 	ext4_discard_preallocations(inode);
 	/*
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH v5 03/10] fstests: add test for inotify isolation on cloned devices
From: Anand Jain @ 2026-05-25  8:35 UTC (permalink / raw)
  To: Christoph Hellwig, Anand Jain
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, amir73il, zlang
In-Reply-To: <ahP1tCc81ascjwxJ@infradead.org>



On 25/5/26 15:09, Christoph Hellwig wrote:
>> diff --git a/common/config b/common/config
>> index 4fd4c2c8af11..605a57947a40 100644
>> --- a/common/config
>> +++ b/common/config
>> @@ -242,6 +242,7 @@ export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
>>  export PARTED_PROG="$(type -P parted)"
>>  export XFS_PROPERTY_PROG="$(type -P xfs_property)"
>>  export FSCRYPTCTL_PROG="$(type -P fscryptctl)"
>> +export INOTIFYWAIT_PROG="$(type -P inotifywait)"
> 
> Usually we try to split infrastructure changes like this out into
> separate patches.

Right. I'll split it into a separate patch.

> Also any reason to rely on the obsolete inotify instead of fsnotify?

fsnotify is exercised in patch 4/10.
IMO, exercising inotify ensures we don't break legacy stuff.

In general, is it fine to keep obsolete command/testcase from
an LTS kernel perspective? We did that a couple of times before.



^ permalink raw reply

* Re: [PATCH v5 04/10] fstests: verify fanotify isolation on cloned filesystems
From: Anand Jain @ 2026-05-25  8:20 UTC (permalink / raw)
  To: Christoph Hellwig, Anand Jain
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	amir73il, zlang
In-Reply-To: <ahP2OOXmwsM87j9D@infradead.org>



On 25/5/26 15:11, Christoph Hellwig wrote:
>> +export FSNOTIFYWAIT_PROG="$(type -P fsnotifywait)"
> 
> Same comment about adding new common bits outside of test cases applies,

Got it.

> but why use both inotify and fsnotify?
> 

The SYSCALL and stuffs until fsnotify_..() are different,
so I decided to keep them both.

>> +[[ "$fsid1" == "$fsid2" ]] && \
>> +	_notrun "Require clone filesystem with unique f_fsid"
> 
> Please add a comment why this happens.  I also have to say I find
> the if syntax easier to follow then && with a line continuation.

Got it. If statements are better for readability.
Also, I'll add comments.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox