From: Li Chen <me@linux.beauty>
To: Matthew Wilcox <willy@infradead.org>
Cc: Mark Fasheh <mark@fasheh.com>, Joel Becker <jlbec@evilplan.org>,
Joseph Qi <joseph.qi@linux.alibaba.com>,
ocfs2-devel <ocfs2-devel@lists.linux.dev>,
linux-kernel <linux-kernel@vger.kernel.org>,
Jan Kara <jack@suse.com>
Subject: Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads
Date: Sun, 01 Feb 2026 12:37:36 +0800 [thread overview]
Message-ID: <878qddoz8v.wl-me@linux.beauty> (raw)
In-Reply-To: <aXzeDE5P_lxiXfPB@casper.infradead.org>
Hi Matthew,
Thank you very much for the detailed explanation and for your patience.
On Sat, 31 Jan 2026 00:36:28 +0800,
Matthew Wilcox wrote:
>
> On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
> > Hi Matthew,
> >
> > > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
> > > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
> > > > holding journal->j_list_lock.
> > > >
> > > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
> > >
> > > I don't think this is the right solution to the problem. If it is,
> > > there needs to be much better argumentation in the commit message.
> > >
> > > As I understand it, jbd2_journal_file_inode() initialises jinode,
> > > then adds it to the t_inode_list, then drops the j_list_lock. So the
> > > actual problem we need to address is that there's no memory barrier
> > > between the store to i_dirty_start and the list_add(). Once that's
> > > added, there's no need for a READ_ONCE here.
> > >
> > > Or have I misunderstood the problem?
> >
> > Thanks for the review.
> >
> > My understanding of your point is that you're worried about a missing
> > "publish" ordering in jbd2_journal_file_inode(): we store
> > jinode->i_dirty_start/end and then list_add() the jinode to
> > t_inode_list, and a core which observes the list entry might miss the prior
> > i_dirty_* stores. Is that the issue you had in mind?
>
> I think that's the only issue that exists ...
Understood.
>
> > If so, for the normal commit path where the list is walked under
> > journal->j_list_lock (e.g. journal_submit_data_buffers() in
> > fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
> > necessary ordering, since both the i_dirty_* updates and the list_add()
> > happen inside the same critical section.
>
> I don't think that's true. I think what you're asserting is that:
>
> int *pi;
> int **ppi;
>
> spin_lock(&lock);
> *pi = 1;
> *ppi = pi;
> spin_unlock(&lock);
>
> that the store to *pi must be observed before the store to *ppi, and
> that's not true for a reader which doesn't read the value of lock.
> The store to *ppi needs a store barrier before it.
Yes, agreed ― thank you. I was implicitly assuming the reader had taken the same lock
at some point, which is not a valid assumption for a lockless reader.
> > The ocfs2 case I was aiming at is different: the filesystem callback is
> > invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
> > j_list_lock but it still reads jinode->i_dirty_start/end while other
> > threads update these fields under the lock. Adding a barrier between the
> > stores and list_add() would not address that concurrent update window.
>
> I don't think that race exists. If it does exist, the READ_ONCE will
> not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
> platforms do not, in general, have a way to do an atomic 64-bit load
> (look at the implementation of i_size_read() for the gyrations we go
> through to assure a non-torn read of that value).
>
> > "ocfs2 reads jinode->i_dirty_start/end without journal->j_list_lock
> > (callback may sleep); these fields are updated under j_list_lock in jbd2.
> > Use READ_ONCE() so the callback takes a single snapshot via actual loads
> > from the variable (i.e. don't let the compiler reuse a value kept in a register
> > or fold multiple reads)."
>
> I think the prevention of this race occurs at a higher level than
> "it's updated under a lock". That is, jbd2_journal_file_inode()
> is never called for a jinode which is currently being operated on by
> j_submit_inode_data_buffers(). Now, I'm not an expert on the jbd code,
> so I may be wrong here.
Thanks. I tried to sanity-check whether that “never called” invariant holds
in practice.
I added a small local-only tracepoint (not for upstream) which fires from
jbd2_journal_file_inode() when it observes JI_COMMIT_RUNNING already set
on the same jinode:
/* fs/jbd2/transaction.c */
if (unlikely(jinode->i_flags & JI_COMMIT_RUNNING))
trace_jbd2_file_inode_commit_running(...);
The trace event prints dev, ino, current tid, jinode flags, and the
i_transaction / i_next_transaction tids.
With an ext4 test (ordered mode) I do see repeated hits. Trace output:
... jbd2_submit_inode_data: dev 7,0 ino 20
... jbd2_file_inode_commit_running: dev 7,0 ino 20 tid 3 op 0x6 i_flags 0x7
j_tid 2 j_next 3 ... comm python3
So it looks like jbd2_journal_file_inode() can run while JI_COMMIT_RUNNING
is set for that inode, i.e. during the window where the commit thread drops
j_list_lock around ->j_submit_inode_data_buffers() / ->j_finish_inode_data_buffers().
Given this, would you prefer the series to move towards something like:
1. taking a snapshot of i_dirty_start/end under j_list_lock in the commit path and passing the snapshot
to the filesystem callback (so callbacks never read jinode->i_dirty_* locklessly), or
2. introducing a real synchronization mechanism for the dirty range itself (seqcount/atomic64/etc)?
3. something else.
I’d be very grateful for guidance on what you consider the most appropriate direction or point out something I'm wrong.
Thanks again.
Regards,
Li
next prev parent reply other threads:[~2026-02-01 4:38 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-30 3:12 [PATCH 0/3] jbd2/ext4/ocfs2: READ_ONCE for lockless jinode reads Li Chen
2026-01-30 3:12 ` [PATCH 1/3] jbd2: use " Li Chen
2026-02-02 16:40 ` Jan Kara
2026-02-02 16:52 ` Jan Kara
2026-01-30 3:12 ` [PATCH 2/3] ext4: " Li Chen
2026-02-02 16:41 ` Jan Kara
2026-01-30 3:12 ` [PATCH 3/3] ocfs2: " Li Chen
2026-01-30 5:27 ` Matthew Wilcox
2026-01-30 12:26 ` Li Chen
2026-01-30 16:36 ` Matthew Wilcox
2026-02-01 4:37 ` Li Chen [this message]
2026-02-02 17:17 ` Jan Kara
2026-02-03 12:10 ` Li Chen
2026-02-03 12:37 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=878qddoz8v.wl-me@linux.beauty \
--to=me@linux.beauty \
--cc=jack@suse.com \
--cc=jlbec@evilplan.org \
--cc=joseph.qi@linux.alibaba.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mark@fasheh.com \
--cc=ocfs2-devel@lists.linux.dev \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox