[PATCH 0/3] jbd2/ext4/ocfs2: READ_ONCE for lockless jinode reads

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] jbd2/ext4/ocfs2: READ_ONCE for lockless jinode reads
@ 2026-01-30  3:12 Li Chen
  2026-01-30  3:12 ` [PATCH 1/3] jbd2: use " Li Chen
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Li Chen @ 2026-01-30  3:12 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara, Mark Fasheh, linux-ext4, ocfs2-devel
  Cc: Andreas Dilger, Joel Becker, Joseph Qi, linux-kernel

This series adds READ_ONCE() for existing lockless reads of
jbd2_inode fields in jbd2 and filesystem callbacks used by ext4 and ocfs2.

This is based on Jan's suggestion in the review of the ext4 jinode
publication race fix. [1]

[1]: https://lore.kernel.org/all/4jxwogttddiaoqbstlgou5ox6zs27ngjjz5ukrxafm2z5ijxod@so4eqnykiegj/

Thanks,
Li

Li Chen (3):
  jbd2: use READ_ONCE for lockless jinode reads
  ext4: use READ_ONCE for lockless jinode reads
  ocfs2: use READ_ONCE for lockless jinode reads

 fs/ext4/inode.c       |  6 ++++--
 fs/ext4/super.c       | 13 ++++++++-----
 fs/jbd2/commit.c      | 39 ++++++++++++++++++++++++++++++++-------
 fs/jbd2/transaction.c |  2 +-
 fs/ocfs2/journal.c    |  7 +++++--
 5 files changed, 50 insertions(+), 17 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] jbd2: use READ_ONCE for lockless jinode reads
  2026-01-30  3:12 [PATCH 0/3] jbd2/ext4/ocfs2: READ_ONCE for lockless jinode reads Li Chen
@ 2026-01-30  3:12 ` Li Chen
  2026-02-02 16:40   ` Jan Kara
  2026-01-30  3:12 ` [PATCH 2/3] ext4: " Li Chen
  2026-01-30  3:12 ` [PATCH 3/3] ocfs2: " Li Chen
  2 siblings, 1 reply; 14+ messages in thread
From: Li Chen @ 2026-01-30  3:12 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara, linux-ext4, linux-kernel; +Cc: Li Chen

jbd2_inode fields are updated under journal->j_list_lock, but some
paths read them without holding the lock (e.g. fast commit
helpers and the ordered truncate fast path).

Use READ_ONCE() for these lockless reads to correct the
concurrency assumptions.

Suggested-by: Jan Kara <jack@suse.com>
Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/jbd2/commit.c      | 39 ++++++++++++++++++++++++++++++++-------
 fs/jbd2/transaction.c |  2 +-
 2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 7203d2d2624d..3347d75da2f8 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -180,7 +180,13 @@ static int journal_wait_on_commit_record(journal_t *journal,
 /* Send all the data buffers related to an inode */
 int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode)
 {
-	if (!jinode || !(jinode->i_flags & JI_WRITE_DATA))
+	unsigned long flags;
+
+	if (!jinode)
+		return 0;
+
+	flags = READ_ONCE(jinode->i_flags);
+	if (!(flags & JI_WRITE_DATA))
 		return 0;
 
 	trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
@@ -191,12 +197,30 @@ EXPORT_SYMBOL(jbd2_submit_inode_data);
 
 int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode)
 {
-	if (!jinode || !(jinode->i_flags & JI_WAIT_DATA) ||
-		!jinode->i_vfs_inode || !jinode->i_vfs_inode->i_mapping)
+	struct address_space *mapping;
+	struct inode *inode;
+	unsigned long flags;
+	loff_t start, end;
+
+	if (!jinode)
+		return 0;
+
+	flags = READ_ONCE(jinode->i_flags);
+	if (!(flags & JI_WAIT_DATA))
+		return 0;
+
+	inode = READ_ONCE(jinode->i_vfs_inode);
+	if (!inode)
+		return 0;
+
+	mapping = inode->i_mapping;
+	start = READ_ONCE(jinode->i_dirty_start);
+	end = READ_ONCE(jinode->i_dirty_end);
+
+	if (!mapping)
 		return 0;
 	return filemap_fdatawait_range_keep_errors(
-		jinode->i_vfs_inode->i_mapping, jinode->i_dirty_start,
-		jinode->i_dirty_end);
+		mapping, start, end);
 }
 EXPORT_SYMBOL(jbd2_wait_inode_data);
 
@@ -240,10 +264,11 @@ static int journal_submit_data_buffers(journal_t *journal,
 int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
 {
 	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+	loff_t start = READ_ONCE(jinode->i_dirty_start);
+	loff_t end = READ_ONCE(jinode->i_dirty_end);
 
 	return filemap_fdatawait_range_keep_errors(mapping,
-						   jinode->i_dirty_start,
-						   jinode->i_dirty_end);
+						   start, end);
 }
 
 /*
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index dca4b5d8aaaa..302b2090eea7 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2739,7 +2739,7 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal,
 	int ret = 0;
 
 	/* This is a quick check to avoid locking if not necessary */
-	if (!jinode->i_transaction)
+	if (!READ_ONCE(jinode->i_transaction))
 		goto out;
 	/* Locks are here just to force reading of recent values, it is
 	 * enough that the transaction was not committing before we started
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] ext4: use READ_ONCE for lockless jinode reads
  2026-01-30  3:12 [PATCH 0/3] jbd2/ext4/ocfs2: READ_ONCE for lockless jinode reads Li Chen
  2026-01-30  3:12 ` [PATCH 1/3] jbd2: use " Li Chen
@ 2026-01-30  3:12 ` Li Chen
  2026-02-02 16:41   ` Jan Kara
  2026-01-30  3:12 ` [PATCH 3/3] ocfs2: " Li Chen
  2 siblings, 1 reply; 14+ messages in thread
From: Li Chen @ 2026-01-30  3:12 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger, linux-ext4, linux-kernel
  Cc: Jan Kara, Li Chen

ext4 journal commit callbacks access jbd2_inode fields such as
i_transaction and i_dirty_start/end without holding journal->j_list_lock.

Use READ_ONCE() for these reads to correct the concurrency assumptions.

Suggested-by: Jan Kara <jack@suse.com>
Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/inode.c |  6 ++++--
 fs/ext4/super.c | 13 ++++++++-----
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d99296d7315f..2d451388e080 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3033,11 +3033,13 @@ static int ext4_writepages(struct address_space *mapping,
 
 int ext4_normal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 {
+	loff_t dirty_start = READ_ONCE(jinode->i_dirty_start);
+	loff_t dirty_end = READ_ONCE(jinode->i_dirty_end);
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = LONG_MAX,
-		.range_start = jinode->i_dirty_start,
-		.range_end = jinode->i_dirty_end,
+		.range_start = dirty_start,
+		.range_end = dirty_end,
 	};
 	struct mpage_da_data mpd = {
 		.inode = jinode->i_vfs_inode,
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 5cf6c2b54bbb..acb2bc016fd4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -521,6 +521,7 @@ static bool ext4_journalled_writepage_needs_redirty(struct jbd2_inode *jinode,
 {
 	struct buffer_head *bh, *head;
 	struct journal_head *jh;
+	transaction_t *trans = READ_ONCE(jinode->i_transaction);
 
 	bh = head = folio_buffers(folio);
 	do {
@@ -539,7 +540,7 @@ static bool ext4_journalled_writepage_needs_redirty(struct jbd2_inode *jinode,
 		 */
 		jh = bh2jh(bh);
 		if (buffer_dirty(bh) ||
-		    (jh && (jh->b_transaction != jinode->i_transaction ||
+		    (jh && (jh->b_transaction != trans ||
 			    jh->b_next_transaction)))
 			return true;
 	} while ((bh = bh->b_this_page) != head);
@@ -550,12 +551,14 @@ static bool ext4_journalled_writepage_needs_redirty(struct jbd2_inode *jinode,
 static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
 {
 	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+	loff_t dirty_start = READ_ONCE(jinode->i_dirty_start);
+	loff_t dirty_end = READ_ONCE(jinode->i_dirty_end);
 	struct writeback_control wbc = {
-		.sync_mode =  WB_SYNC_ALL,
+		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = LONG_MAX,
-		.range_start = jinode->i_dirty_start,
-		.range_end = jinode->i_dirty_end,
-        };
+		.range_start = dirty_start,
+		.range_end = dirty_end,
+	};
 	struct folio *folio = NULL;
 	int error;
 
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads
  2026-01-30  3:12 [PATCH 0/3] jbd2/ext4/ocfs2: READ_ONCE for lockless jinode reads Li Chen
  2026-01-30  3:12 ` [PATCH 1/3] jbd2: use " Li Chen
  2026-01-30  3:12 ` [PATCH 2/3] ext4: " Li Chen
@ 2026-01-30  3:12 ` Li Chen
  2026-01-30  5:27   ` Matthew Wilcox
  2 siblings, 1 reply; 14+ messages in thread
From: Li Chen @ 2026-01-30  3:12 UTC (permalink / raw)
  To: Mark Fasheh, Joel Becker, Joseph Qi, ocfs2-devel, linux-kernel
  Cc: Jan Kara, Li Chen

ocfs2 journal commit callback reads jbd2_inode dirty range fields without
holding journal->j_list_lock.

Use READ_ONCE() for these reads to correct the concurrency assumptions.

Suggested-by: Jan Kara <jack@suse.com>
Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ocfs2/journal.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 85239807dec7..7032284cdbd6 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -902,8 +902,11 @@ int ocfs2_journal_alloc(struct ocfs2_super *osb)
 
 static int ocfs2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 {
-	return filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
-			jinode->i_dirty_start, jinode->i_dirty_end);
+	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+	loff_t dirty_start = READ_ONCE(jinode->i_dirty_start);
+	loff_t dirty_end = READ_ONCE(jinode->i_dirty_end);
+
+	return filemap_fdatawrite_range(mapping, dirty_start, dirty_end);
 }
 
 int ocfs2_journal_init(struct ocfs2_super *osb, int *dirty)
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads
  2026-01-30  3:12 ` [PATCH 3/3] ocfs2: " Li Chen
@ 2026-01-30  5:27   ` Matthew Wilcox
  2026-01-30 12:26     ` Li Chen
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2026-01-30  5:27 UTC (permalink / raw)
  To: Li Chen
  Cc: Mark Fasheh, Joel Becker, Joseph Qi, ocfs2-devel, linux-kernel,
	Jan Kara

On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
> ocfs2 journal commit callback reads jbd2_inode dirty range fields without
> holding journal->j_list_lock.
> 
> Use READ_ONCE() for these reads to correct the concurrency assumptions.

I don't think this is the right solution to the problem.  If it is,
there needs to be much better argumentation in the commit message.

As I understand it, jbd2_journal_file_inode() initialises jinode,
then adds it to the t_inode_list, then drops the j_list_lock.  So the
actual problem we need to address is that there's no memory barrier
between the store to i_dirty_start and the list_add().  Once that's
added, there's no need for a READ_ONCE here.

Or have I misunderstood the problem?

> Suggested-by: Jan Kara <jack@suse.com>
> Signed-off-by: Li Chen <me@linux.beauty>
> ---
>  fs/ocfs2/journal.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index 85239807dec7..7032284cdbd6 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -902,8 +902,11 @@ int ocfs2_journal_alloc(struct ocfs2_super *osb)
>  
>  static int ocfs2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
>  {
> -	return filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
> -			jinode->i_dirty_start, jinode->i_dirty_end);
> +	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
> +	loff_t dirty_start = READ_ONCE(jinode->i_dirty_start);
> +	loff_t dirty_end = READ_ONCE(jinode->i_dirty_end);
> +
> +	return filemap_fdatawrite_range(mapping, dirty_start, dirty_end);
>  }
>  
>  int ocfs2_journal_init(struct ocfs2_super *osb, int *dirty)
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads
  2026-01-30  5:27   ` Matthew Wilcox
@ 2026-01-30 12:26     ` Li Chen
  2026-01-30 16:36       ` Matthew Wilcox
  0 siblings, 1 reply; 14+ messages in thread
From: Li Chen @ 2026-01-30 12:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mark Fasheh, Joel Becker, Joseph Qi, ocfs2-devel, linux-kernel,
	Jan Kara

Hi Matthew,

 > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
 > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
 > > holding journal->j_list_lock.
 > > 
 > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
 > 
 > I don't think this is the right solution to the problem.  If it is,
 > there needs to be much better argumentation in the commit message.
 > 
 > As I understand it, jbd2_journal_file_inode() initialises jinode,
 > then adds it to the t_inode_list, then drops the j_list_lock.  So the
 > actual problem we need to address is that there's no memory barrier
 > between the store to i_dirty_start and the list_add().  Once that's
 > added, there's no need for a READ_ONCE here.
 > 
 > Or have I misunderstood the problem?

Thanks for the review.

My understanding of your point is that you're worried about a missing
"publish" ordering in jbd2_journal_file_inode(): we store
jinode->i_dirty_start/end and then list_add() the jinode to
t_inode_list, and a core which observes the list entry might miss the prior
i_dirty_* stores. Is that the issue you had in mind?

If so, for the normal commit path where the list is walked under
journal->j_list_lock (e.g. journal_submit_data_buffers() in
fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
necessary ordering, since both the i_dirty_* updates and the list_add()
happen inside the same critical section.

The ocfs2 case I was aiming at is different: the filesystem callback is
invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
j_list_lock but it still reads jinode->i_dirty_start/end while other
threads update these fields under the lock. Adding a barrier between the
stores and list_add() would not address that concurrent update window.

So the itent of READ_ONCE() in ocfs2 is to take a single snapshot of the
dirty range values from memory (avoid compiler to reuse a value kept in a
register or fold multiple reads). I'm not trying to claim any additional
memory ordering from this change.

I'll respin and adjust the commit message accordingly. The updated part will
say along the lines of:

"ocfs2 reads jinode->i_dirty_start/end without journal->j_list_lock
(callback may sleep); these fields are updated under j_list_lock in jbd2.
Use READ_ONCE() so the callback takes a single snapshot via actual loads
from the variable (i.e. don't let the compiler reuse a value kept in a register
or fold multiple reads)."

Does that match your understanding? 

Regards,
Li

 > > Suggested-by: Jan Kara <jack@suse.com>
 > > Signed-off-by: Li Chen <me@linux.beauty>
 > > ---
 > >  fs/ocfs2/journal.c | 7 +++++--
 > >  1 file changed, 5 insertions(+), 2 deletions(-)
 > > 
 > > diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
 > > index 85239807dec7..7032284cdbd6 100644
 > > --- a/fs/ocfs2/journal.c
 > > +++ b/fs/ocfs2/journal.c
 > > @@ -902,8 +902,11 @@ int ocfs2_journal_alloc(struct ocfs2_super *osb)
 > >  
 > >  static int ocfs2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 > >  {
 > > -    return filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
 > > -            jinode->i_dirty_start, jinode->i_dirty_end);
 > > +    struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
 > > +    loff_t dirty_start = READ_ONCE(jinode->i_dirty_start);
 > > +    loff_t dirty_end = READ_ONCE(jinode->i_dirty_end);
 > > +
 > > +    return filemap_fdatawrite_range(mapping, dirty_start, dirty_end);
 > >  }
 > >  
 > >  int ocfs2_journal_init(struct ocfs2_super *osb, int *dirty)
 > > -- 
 > > 2.52.0
 > > 
 > 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads
  2026-01-30 12:26     ` Li Chen
@ 2026-01-30 16:36       ` Matthew Wilcox
  2026-02-01  4:37         ` Li Chen
  2026-02-02 17:17         ` Jan Kara
  0 siblings, 2 replies; 14+ messages in thread
From: Matthew Wilcox @ 2026-01-30 16:36 UTC (permalink / raw)
  To: Li Chen
  Cc: Mark Fasheh, Joel Becker, Joseph Qi, ocfs2-devel, linux-kernel,
	Jan Kara

On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
> Hi Matthew,
> 
>  > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
>  > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
>  > > holding journal->j_list_lock.
>  > > 
>  > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
>  > 
>  > I don't think this is the right solution to the problem.  If it is,
>  > there needs to be much better argumentation in the commit message.
>  > 
>  > As I understand it, jbd2_journal_file_inode() initialises jinode,
>  > then adds it to the t_inode_list, then drops the j_list_lock.  So the
>  > actual problem we need to address is that there's no memory barrier
>  > between the store to i_dirty_start and the list_add().  Once that's
>  > added, there's no need for a READ_ONCE here.
>  > 
>  > Or have I misunderstood the problem?
> 
> Thanks for the review.
> 
> My understanding of your point is that you're worried about a missing
> "publish" ordering in jbd2_journal_file_inode(): we store
> jinode->i_dirty_start/end and then list_add() the jinode to
> t_inode_list, and a core which observes the list entry might miss the prior
> i_dirty_* stores. Is that the issue you had in mind?

I think that's the only issue that exists ...

> If so, for the normal commit path where the list is walked under
> journal->j_list_lock (e.g. journal_submit_data_buffers() in
> fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
> necessary ordering, since both the i_dirty_* updates and the list_add()
> happen inside the same critical section.

I don't think that's true.  I think what you're asserting is that:

	int *pi;
	int **ppi;

	spin_lock(&lock);
	*pi = 1;
	*ppi = pi;
	spin_unlock(&lock);

that the store to *pi must be observed before the store to *ppi, and
that's not true for a reader which doesn't read the value of lock.
The store to *ppi needs a store barrier before it.

> The ocfs2 case I was aiming at is different: the filesystem callback is
> invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
> j_list_lock but it still reads jinode->i_dirty_start/end while other
> threads update these fields under the lock. Adding a barrier between the
> stores and list_add() would not address that concurrent update window.

I don't think that race exists.  If it does exist, the READ_ONCE will
not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
platforms do not, in general, have a way to do an atomic 64-bit load
(look at the implementation of i_size_read() for the gyrations we go
through to assure a non-torn read of that value).

> "ocfs2 reads jinode->i_dirty_start/end without journal->j_list_lock
> (callback may sleep); these fields are updated under j_list_lock in jbd2.
> Use READ_ONCE() so the callback takes a single snapshot via actual loads
> from the variable (i.e. don't let the compiler reuse a value kept in a register
> or fold multiple reads)."

I think the prevention of this race occurs at a higher level than
"it's updated under a lock".  That is, jbd2_journal_file_inode()
is never called for a jinode which is currently being operated on by
j_submit_inode_data_buffers().  Now, I'm not an expert on the jbd code,
so I may be wrong here.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads
  2026-01-30 16:36       ` Matthew Wilcox
@ 2026-02-01  4:37         ` Li Chen
  2026-02-02 17:17         ` Jan Kara
  1 sibling, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-02-01  4:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mark Fasheh, Joel Becker, Joseph Qi, ocfs2-devel, linux-kernel,
	Jan Kara


Hi Matthew,

Thank you very much for the detailed explanation and for your patience.

On Sat, 31 Jan 2026 00:36:28 +0800,
Matthew Wilcox wrote:
> 
> On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
> > Hi Matthew,
> > 
> >  > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
> >  > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
> >  > > holding journal->j_list_lock.
> >  > > 
> >  > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
> >  > 
> >  > I don't think this is the right solution to the problem.  If it is,
> >  > there needs to be much better argumentation in the commit message.
> >  > 
> >  > As I understand it, jbd2_journal_file_inode() initialises jinode,
> >  > then adds it to the t_inode_list, then drops the j_list_lock.  So the
> >  > actual problem we need to address is that there's no memory barrier
> >  > between the store to i_dirty_start and the list_add().  Once that's
> >  > added, there's no need for a READ_ONCE here.
> >  > 
> >  > Or have I misunderstood the problem?
> > 
> > Thanks for the review.
> > 
> > My understanding of your point is that you're worried about a missing
> > "publish" ordering in jbd2_journal_file_inode(): we store
> > jinode->i_dirty_start/end and then list_add() the jinode to
> > t_inode_list, and a core which observes the list entry might miss the prior
> > i_dirty_* stores. Is that the issue you had in mind?
> 
> I think that's the only issue that exists ...

Understood.

> 
> > If so, for the normal commit path where the list is walked under
> > journal->j_list_lock (e.g. journal_submit_data_buffers() in
> > fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
> > necessary ordering, since both the i_dirty_* updates and the list_add()
> > happen inside the same critical section.
> 
> I don't think that's true.  I think what you're asserting is that:
> 
> 	int *pi;
> 	int **ppi;
> 
> 	spin_lock(&lock);
> 	*pi = 1;
> 	*ppi = pi;
> 	spin_unlock(&lock);
> 
> that the store to *pi must be observed before the store to *ppi, and
> that's not true for a reader which doesn't read the value of lock.
> The store to *ppi needs a store barrier before it.

Yes, agreed ― thank you. I was implicitly assuming the reader had taken the same lock
at some point, which is not a valid assumption for a lockless reader. 

> > The ocfs2 case I was aiming at is different: the filesystem callback is
> > invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
> > j_list_lock but it still reads jinode->i_dirty_start/end while other
> > threads update these fields under the lock. Adding a barrier between the
> > stores and list_add() would not address that concurrent update window.
> 
> I don't think that race exists.  If it does exist, the READ_ONCE will
> not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
> platforms do not, in general, have a way to do an atomic 64-bit load
> (look at the implementation of i_size_read() for the gyrations we go
> through to assure a non-torn read of that value).
> 
> > "ocfs2 reads jinode->i_dirty_start/end without journal->j_list_lock
> > (callback may sleep); these fields are updated under j_list_lock in jbd2.
> > Use READ_ONCE() so the callback takes a single snapshot via actual loads
> > from the variable (i.e. don't let the compiler reuse a value kept in a register
> > or fold multiple reads)."
> 
> I think the prevention of this race occurs at a higher level than
> "it's updated under a lock".  That is, jbd2_journal_file_inode()
> is never called for a jinode which is currently being operated on by
> j_submit_inode_data_buffers().  Now, I'm not an expert on the jbd code,
> so I may be wrong here.

Thanks. I tried to sanity-check whether that “never called” invariant holds
in practice.

I added a small local-only tracepoint (not for upstream) which fires from
jbd2_journal_file_inode() when it observes JI_COMMIT_RUNNING already set
on the same jinode:

/* fs/jbd2/transaction.c */
if (unlikely(jinode->i_flags & JI_COMMIT_RUNNING))
      trace_jbd2_file_inode_commit_running(...);

The trace event prints dev, ino, current tid, jinode flags, and the
i_transaction / i_next_transaction tids.

With an ext4 test (ordered mode) I do see repeated hits. Trace output:

... jbd2_submit_inode_data: dev 7,0 ino 20
... jbd2_file_inode_commit_running: dev 7,0 ino 20 tid 3 op 0x6 i_flags 0x7
    j_tid 2 j_next 3 ... comm python3

So it looks like jbd2_journal_file_inode() can run while JI_COMMIT_RUNNING
is set for that inode, i.e. during the window where the commit thread drops
j_list_lock around ->j_submit_inode_data_buffers() / ->j_finish_inode_data_buffers().

Given this, would you prefer the series to move towards something like:

1. taking a snapshot of i_dirty_start/end under j_list_lock in the commit path and passing the snapshot
   to the filesystem callback (so callbacks never read jinode->i_dirty_* locklessly), or
2. introducing a real synchronization mechanism for the dirty range itself (seqcount/atomic64/etc)?
3. something else.

I’d be very grateful for guidance on what you consider the most appropriate direction or point out something I'm wrong.

Thanks again.

Regards,
Li

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] jbd2: use READ_ONCE for lockless jinode reads
  2026-01-30  3:12 ` [PATCH 1/3] jbd2: use " Li Chen
@ 2026-02-02 16:40   ` Jan Kara
  2026-02-02 16:52     ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2026-02-02 16:40 UTC (permalink / raw)
  To: Li Chen; +Cc: Theodore Ts'o, Jan Kara, linux-ext4, linux-kernel

On Fri 30-01-26 11:12:30, Li Chen wrote:
> jbd2_inode fields are updated under journal->j_list_lock, but some
> paths read them without holding the lock (e.g. fast commit
> helpers and the ordered truncate fast path).
> 
> Use READ_ONCE() for these lockless reads to correct the
> concurrency assumptions.
> 
> Suggested-by: Jan Kara <jack@suse.com>
> Signed-off-by: Li Chen <me@linux.beauty>

Just one nit below. With that fixed feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> @@ -191,12 +197,30 @@ EXPORT_SYMBOL(jbd2_submit_inode_data);
>  
>  int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode)
>  {
> -	if (!jinode || !(jinode->i_flags & JI_WAIT_DATA) ||
> -		!jinode->i_vfs_inode || !jinode->i_vfs_inode->i_mapping)
> +	struct address_space *mapping;
> +	struct inode *inode;
> +	unsigned long flags;
> +	loff_t start, end;
> +
> +	if (!jinode)
> +		return 0;
> +
> +	flags = READ_ONCE(jinode->i_flags);
> +	if (!(flags & JI_WAIT_DATA))
> +		return 0;
> +
> +	inode = READ_ONCE(jinode->i_vfs_inode);

i_vfs_inode never changes so READ_ONCE is pointless here.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] ext4: use READ_ONCE for lockless jinode reads
  2026-01-30  3:12 ` [PATCH 2/3] ext4: " Li Chen
@ 2026-02-02 16:41   ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2026-02-02 16:41 UTC (permalink / raw)
  To: Li Chen
  Cc: Theodore Ts'o, Andreas Dilger, linux-ext4, linux-kernel,
	Jan Kara

On Fri 30-01-26 11:12:31, Li Chen wrote:
> ext4 journal commit callbacks access jbd2_inode fields such as
> i_transaction and i_dirty_start/end without holding journal->j_list_lock.
> 
> Use READ_ONCE() for these reads to correct the concurrency assumptions.
> 
> Suggested-by: Jan Kara <jack@suse.com>
> Signed-off-by: Li Chen <me@linux.beauty>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c |  6 ++++--
>  fs/ext4/super.c | 13 ++++++++-----
>  2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d99296d7315f..2d451388e080 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3033,11 +3033,13 @@ static int ext4_writepages(struct address_space *mapping,
>  
>  int ext4_normal_submit_inode_data_buffers(struct jbd2_inode *jinode)
>  {
> +	loff_t dirty_start = READ_ONCE(jinode->i_dirty_start);
> +	loff_t dirty_end = READ_ONCE(jinode->i_dirty_end);
>  	struct writeback_control wbc = {
>  		.sync_mode = WB_SYNC_ALL,
>  		.nr_to_write = LONG_MAX,
> -		.range_start = jinode->i_dirty_start,
> -		.range_end = jinode->i_dirty_end,
> +		.range_start = dirty_start,
> +		.range_end = dirty_end,
>  	};
>  	struct mpage_da_data mpd = {
>  		.inode = jinode->i_vfs_inode,
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 5cf6c2b54bbb..acb2bc016fd4 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -521,6 +521,7 @@ static bool ext4_journalled_writepage_needs_redirty(struct jbd2_inode *jinode,
>  {
>  	struct buffer_head *bh, *head;
>  	struct journal_head *jh;
> +	transaction_t *trans = READ_ONCE(jinode->i_transaction);
>  
>  	bh = head = folio_buffers(folio);
>  	do {
> @@ -539,7 +540,7 @@ static bool ext4_journalled_writepage_needs_redirty(struct jbd2_inode *jinode,
>  		 */
>  		jh = bh2jh(bh);
>  		if (buffer_dirty(bh) ||
> -		    (jh && (jh->b_transaction != jinode->i_transaction ||
> +		    (jh && (jh->b_transaction != trans ||
>  			    jh->b_next_transaction)))
>  			return true;
>  	} while ((bh = bh->b_this_page) != head);
> @@ -550,12 +551,14 @@ static bool ext4_journalled_writepage_needs_redirty(struct jbd2_inode *jinode,
>  static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
>  {
>  	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
> +	loff_t dirty_start = READ_ONCE(jinode->i_dirty_start);
> +	loff_t dirty_end = READ_ONCE(jinode->i_dirty_end);
>  	struct writeback_control wbc = {
> -		.sync_mode =  WB_SYNC_ALL,
> +		.sync_mode = WB_SYNC_ALL,
>  		.nr_to_write = LONG_MAX,
> -		.range_start = jinode->i_dirty_start,
> -		.range_end = jinode->i_dirty_end,
> -        };
> +		.range_start = dirty_start,
> +		.range_end = dirty_end,
> +	};
>  	struct folio *folio = NULL;
>  	int error;
>  
> -- 
> 2.52.0
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] jbd2: use READ_ONCE for lockless jinode reads
  2026-02-02 16:40   ` Jan Kara
@ 2026-02-02 16:52     ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2026-02-02 16:52 UTC (permalink / raw)
  To: Li Chen; +Cc: Theodore Ts'o, Jan Kara, linux-ext4, linux-kernel

On Mon 02-02-26 17:40:45, Jan Kara wrote:
> On Fri 30-01-26 11:12:30, Li Chen wrote:
> > jbd2_inode fields are updated under journal->j_list_lock, but some
> > paths read them without holding the lock (e.g. fast commit
> > helpers and the ordered truncate fast path).
> > 
> > Use READ_ONCE() for these lockless reads to correct the
> > concurrency assumptions.
> > 
> > Suggested-by: Jan Kara <jack@suse.com>
> > Signed-off-by: Li Chen <me@linux.beauty>
> 
> Just one nit below. With that fixed feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> > @@ -191,12 +197,30 @@ EXPORT_SYMBOL(jbd2_submit_inode_data);
> >  
> >  int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode)
> >  {
> > -	if (!jinode || !(jinode->i_flags & JI_WAIT_DATA) ||
> > -		!jinode->i_vfs_inode || !jinode->i_vfs_inode->i_mapping)
> > +	struct address_space *mapping;
> > +	struct inode *inode;
> > +	unsigned long flags;
> > +	loff_t start, end;
> > +
> > +	if (!jinode)
> > +		return 0;
> > +
> > +	flags = READ_ONCE(jinode->i_flags);
> > +	if (!(flags & JI_WAIT_DATA))
> > +		return 0;
> > +
> > +	inode = READ_ONCE(jinode->i_vfs_inode);
> 
> i_vfs_inode never changes so READ_ONCE is pointless here.

One more note: I've realized that for this to work you also need to make
jbd2_journal_file_inode() use WRITE_ONCE() when updating i_dirty_start,
i_dirty_end and i_flags.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads
  2026-01-30 16:36       ` Matthew Wilcox
  2026-02-01  4:37         ` Li Chen
@ 2026-02-02 17:17         ` Jan Kara
  2026-02-03 12:10           ` Li Chen
  1 sibling, 1 reply; 14+ messages in thread
From: Jan Kara @ 2026-02-02 17:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Li Chen, Mark Fasheh, Joel Becker, Joseph Qi, ocfs2-devel,
	linux-kernel, Jan Kara

On Fri 30-01-26 16:36:28, Matthew Wilcox wrote:
> On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
> > Hi Matthew,
> > 
> >  > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
> >  > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
> >  > > holding journal->j_list_lock.
> >  > > 
> >  > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
> >  > 
> >  > I don't think this is the right solution to the problem.  If it is,
> >  > there needs to be much better argumentation in the commit message.
> >  > 
> >  > As I understand it, jbd2_journal_file_inode() initialises jinode,
> >  > then adds it to the t_inode_list, then drops the j_list_lock.  So the
> >  > actual problem we need to address is that there's no memory barrier
> >  > between the store to i_dirty_start and the list_add().  Once that's
> >  > added, there's no need for a READ_ONCE here.
> >  > 
> >  > Or have I misunderstood the problem?
> > 
> > Thanks for the review.
> > 
> > My understanding of your point is that you're worried about a missing
> > "publish" ordering in jbd2_journal_file_inode(): we store
> > jinode->i_dirty_start/end and then list_add() the jinode to
> > t_inode_list, and a core which observes the list entry might miss the prior
> > i_dirty_* stores. Is that the issue you had in mind?
> 
> I think that's the only issue that exists ...
> 
> > If so, for the normal commit path where the list is walked under
> > journal->j_list_lock (e.g. journal_submit_data_buffers() in
> > fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
> > necessary ordering, since both the i_dirty_* updates and the list_add()
> > happen inside the same critical section.
> 
> I don't think that's true.  I think what you're asserting is that:
> 
> 	int *pi;
> 	int **ppi;
> 
> 	spin_lock(&lock);
> 	*pi = 1;
> 	*ppi = pi;
> 	spin_unlock(&lock);
> 
> that the store to *pi must be observed before the store to *ppi, and
> that's not true for a reader which doesn't read the value of lock.
> The store to *ppi needs a store barrier before it.

Well, the above reasonably accurately describes the code making jinode
visible. The reader code is like:

	spin_lock(&lock);
        pi = *ppi;
	spin_unlock(&lock);

	work with pi

so it is guaranteed to see pi properly initialized. The problem is that
"work with pi" can race with other code updating the content of pi which is
what this patch is trying to deal with.

> > The ocfs2 case I was aiming at is different: the filesystem callback is
> > invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
> > j_list_lock but it still reads jinode->i_dirty_start/end while other
> > threads update these fields under the lock. Adding a barrier between the
> > stores and list_add() would not address that concurrent update window.
> 
> I don't think that race exists.  If it does exist, the READ_ONCE will
> not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
> platforms do not, in general, have a way to do an atomic 64-bit load
> (look at the implementation of i_size_read() for the gyrations we go
> through to assure a non-torn read of that value).

Sadly the race does exist - journal_submit_data_buffers() on the committing
transaction can run in parallel with jbd2_journal_file_inode() in the
running transaction. There's nothing preventing that. The problems arising
out of that are mostly theoretical but they do exist. In particular you're
correct that on 32-bit platforms this will be racy even with READ_ONCE /
WRITE_ONCE which I didn't realize.

Li, the best way to address this concern would be to modify jbd2_inode to
switch i_dirty_start / i_dirty_end to account in PAGE_SIZE units instead of
bytes and be of type pgoff_t. jbd2_journal_file_inode() just needs to round
the passed ranges properly...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads
  2026-02-02 17:17         ` Jan Kara
@ 2026-02-03 12:10           ` Li Chen
  2026-02-03 12:37             ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Li Chen @ 2026-02-03 12:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, Mark Fasheh, Joel Becker, Joseph Qi, ocfs2-devel,
	linux-kernel, Jan Kara

Hi Jan,

 ---- On Tue, 03 Feb 2026 01:17:49 +0800  Jan Kara <jack@suse.cz> wrote --- 
 > On Fri 30-01-26 16:36:28, Matthew Wilcox wrote:
 > > On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
 > > > Hi Matthew,
 > > > 
 > > >  > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
 > > >  > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
 > > >  > > holding journal->j_list_lock.
 > > >  > > 
 > > >  > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
 > > >  > 
 > > >  > I don't think this is the right solution to the problem.  If it is,
 > > >  > there needs to be much better argumentation in the commit message.
 > > >  > 
 > > >  > As I understand it, jbd2_journal_file_inode() initialises jinode,
 > > >  > then adds it to the t_inode_list, then drops the j_list_lock.  So the
 > > >  > actual problem we need to address is that there's no memory barrier
 > > >  > between the store to i_dirty_start and the list_add().  Once that's
 > > >  > added, there's no need for a READ_ONCE here.
 > > >  > 
 > > >  > Or have I misunderstood the problem?
 > > > 
 > > > Thanks for the review.
 > > > 
 > > > My understanding of your point is that you're worried about a missing
 > > > "publish" ordering in jbd2_journal_file_inode(): we store
 > > > jinode->i_dirty_start/end and then list_add() the jinode to
 > > > t_inode_list, and a core which observes the list entry might miss the prior
 > > > i_dirty_* stores. Is that the issue you had in mind?
 > > 
 > > I think that's the only issue that exists ...
 > > 
 > > > If so, for the normal commit path where the list is walked under
 > > > journal->j_list_lock (e.g. journal_submit_data_buffers() in
 > > > fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
 > > > necessary ordering, since both the i_dirty_* updates and the list_add()
 > > > happen inside the same critical section.
 > > 
 > > I don't think that's true.  I think what you're asserting is that:
 > > 
 > >     int *pi;
 > >     int **ppi;
 > > 
 > >     spin_lock(&lock);
 > >     *pi = 1;
 > >     *ppi = pi;
 > >     spin_unlock(&lock);
 > > 
 > > that the store to *pi must be observed before the store to *ppi, and
 > > that's not true for a reader which doesn't read the value of lock.
 > > The store to *ppi needs a store barrier before it.
 > 
 > Well, the above reasonably accurately describes the code making jinode
 > visible. The reader code is like:
 > 
 >     spin_lock(&lock);
 >         pi = *ppi;
 >     spin_unlock(&lock);
 > 
 >     work with pi
 > 
 > so it is guaranteed to see pi properly initialized. The problem is that
 > "work with pi" can race with other code updating the content of pi which is
 > what this patch is trying to deal with.
 > 
 > > > The ocfs2 case I was aiming at is different: the filesystem callback is
 > > > invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
 > > > j_list_lock but it still reads jinode->i_dirty_start/end while other
 > > > threads update these fields under the lock. Adding a barrier between the
 > > > stores and list_add() would not address that concurrent update window.
 > > 
 > > I don't think that race exists.  If it does exist, the READ_ONCE will
 > > not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
 > > platforms do not, in general, have a way to do an atomic 64-bit load
 > > (look at the implementation of i_size_read() for the gyrations we go
 > > through to assure a non-torn read of that value).
 > 
 > Sadly the race does exist - journal_submit_data_buffers() on the committing
 > transaction can run in parallel with jbd2_journal_file_inode() in the
 > running transaction. There's nothing preventing that. The problems arising
 > out of that are mostly theoretical but they do exist. In particular you're
 > correct that on 32-bit platforms this will be racy even with READ_ONCE /
 > WRITE_ONCE which I didn't realize.
 > 
 > Li, the best way to address this concern would be to modify jbd2_inode to
 > switch i_dirty_start / i_dirty_end to account in PAGE_SIZE units instead of
 > bytes and be of type pgoff_t. jbd2_journal_file_inode() just needs to round
 > the passed ranges properly...

  Thank you, Jan. I will update the series accordingly.

  It seems this won’t break large files on 32-bit either: MAX_LFS_FILESIZE there is
  ULONG_MAX << PAGE_SHIFT (i.e. ULONG_MAX pages, ~16TB with 4K pages), and pgoff_t
  (unsigned long) can represent the same ULONG_MAX-page range.

Regards,
Li


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads
  2026-02-03 12:10           ` Li Chen
@ 2026-02-03 12:37             ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2026-02-03 12:37 UTC (permalink / raw)
  To: Li Chen
  Cc: Jan Kara, Matthew Wilcox, Mark Fasheh, Joel Becker, Joseph Qi,
	ocfs2-devel, linux-kernel, Jan Kara

On Tue 03-02-26 20:10:42, Li Chen wrote:
> Hi Jan,
> 
>  ---- On Tue, 03 Feb 2026 01:17:49 +0800  Jan Kara <jack@suse.cz> wrote --- 
>  > On Fri 30-01-26 16:36:28, Matthew Wilcox wrote:
>  > > On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
>  > > > Hi Matthew,
>  > > > 
>  > > >  > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
>  > > >  > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
>  > > >  > > holding journal->j_list_lock.
>  > > >  > > 
>  > > >  > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
>  > > >  > 
>  > > >  > I don't think this is the right solution to the problem.  If it is,
>  > > >  > there needs to be much better argumentation in the commit message.
>  > > >  > 
>  > > >  > As I understand it, jbd2_journal_file_inode() initialises jinode,
>  > > >  > then adds it to the t_inode_list, then drops the j_list_lock.  So the
>  > > >  > actual problem we need to address is that there's no memory barrier
>  > > >  > between the store to i_dirty_start and the list_add().  Once that's
>  > > >  > added, there's no need for a READ_ONCE here.
>  > > >  > 
>  > > >  > Or have I misunderstood the problem?
>  > > > 
>  > > > Thanks for the review.
>  > > > 
>  > > > My understanding of your point is that you're worried about a missing
>  > > > "publish" ordering in jbd2_journal_file_inode(): we store
>  > > > jinode->i_dirty_start/end and then list_add() the jinode to
>  > > > t_inode_list, and a core which observes the list entry might miss the prior
>  > > > i_dirty_* stores. Is that the issue you had in mind?
>  > > 
>  > > I think that's the only issue that exists ...
>  > > 
>  > > > If so, for the normal commit path where the list is walked under
>  > > > journal->j_list_lock (e.g. journal_submit_data_buffers() in
>  > > > fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
>  > > > necessary ordering, since both the i_dirty_* updates and the list_add()
>  > > > happen inside the same critical section.
>  > > 
>  > > I don't think that's true.  I think what you're asserting is that:
>  > > 
>  > >     int *pi;
>  > >     int **ppi;
>  > > 
>  > >     spin_lock(&lock);
>  > >     *pi = 1;
>  > >     *ppi = pi;
>  > >     spin_unlock(&lock);
>  > > 
>  > > that the store to *pi must be observed before the store to *ppi, and
>  > > that's not true for a reader which doesn't read the value of lock.
>  > > The store to *ppi needs a store barrier before it.
>  > 
>  > Well, the above reasonably accurately describes the code making jinode
>  > visible. The reader code is like:
>  > 
>  >     spin_lock(&lock);
>  >         pi = *ppi;
>  >     spin_unlock(&lock);
>  > 
>  >     work with pi
>  > 
>  > so it is guaranteed to see pi properly initialized. The problem is that
>  > "work with pi" can race with other code updating the content of pi which is
>  > what this patch is trying to deal with.
>  > 
>  > > > The ocfs2 case I was aiming at is different: the filesystem callback is
>  > > > invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
>  > > > j_list_lock but it still reads jinode->i_dirty_start/end while other
>  > > > threads update these fields under the lock. Adding a barrier between the
>  > > > stores and list_add() would not address that concurrent update window.
>  > > 
>  > > I don't think that race exists.  If it does exist, the READ_ONCE will
>  > > not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
>  > > platforms do not, in general, have a way to do an atomic 64-bit load
>  > > (look at the implementation of i_size_read() for the gyrations we go
>  > > through to assure a non-torn read of that value).
>  > 
>  > Sadly the race does exist - journal_submit_data_buffers() on the committing
>  > transaction can run in parallel with jbd2_journal_file_inode() in the
>  > running transaction. There's nothing preventing that. The problems arising
>  > out of that are mostly theoretical but they do exist. In particular you're
>  > correct that on 32-bit platforms this will be racy even with READ_ONCE /
>  > WRITE_ONCE which I didn't realize.
>  > 
>  > Li, the best way to address this concern would be to modify jbd2_inode to
>  > switch i_dirty_start / i_dirty_end to account in PAGE_SIZE units instead of
>  > bytes and be of type pgoff_t. jbd2_journal_file_inode() just needs to round
>  > the passed ranges properly...
> 
>   Thank you, Jan. I will update the series accordingly.
> 
>   It seems this won’t break large files on 32-bit either: MAX_LFS_FILESIZE there is
>   ULONG_MAX << PAGE_SHIFT (i.e. ULONG_MAX pages, ~16TB with 4K pages), and pgoff_t
>   (unsigned long) can represent the same ULONG_MAX-page range.

Exactly. That's the reason why I've proposed this but I should have
probably stated that explicitely :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-02-03 12:37 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-30  3:12 [PATCH 0/3] jbd2/ext4/ocfs2: READ_ONCE for lockless jinode reads Li Chen
2026-01-30  3:12 ` [PATCH 1/3] jbd2: use " Li Chen
2026-02-02 16:40   ` Jan Kara
2026-02-02 16:52     ` Jan Kara
2026-01-30  3:12 ` [PATCH 2/3] ext4: " Li Chen
2026-02-02 16:41   ` Jan Kara
2026-01-30  3:12 ` [PATCH 3/3] ocfs2: " Li Chen
2026-01-30  5:27   ` Matthew Wilcox
2026-01-30 12:26     ` Li Chen
2026-01-30 16:36       ` Matthew Wilcox
2026-02-01  4:37         ` Li Chen
2026-02-02 17:17         ` Jan Kara
2026-02-03 12:10           ` Li Chen
2026-02-03 12:37             ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox