From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from sender4-op-o12.zoho.com (sender4-op-o12.zoho.com [136.143.188.12])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0C20E748F
	for <linux-kernel@vger.kernel.org>; Sun,  1 Feb 2026 04:38:30 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=pass smtp.client-ip=136.143.188.12
ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1769920712; cv=pass; b=hneCda1W9+0Vcnuo8SVAMqtW6I9kz/ta9UCdQSOmDC95Gz+ecWOIcCuIqLRUe86mw0rmiHzjXNcIp/4rFAMxeMx83GCKuVhQ//HbeXsM2tAM5GQXTkIYkDl3jhU8dHFxYzHKJfbWK4IFeyXJzvyqJwztkF/KpUinQPruQ7F80E8=
ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1769920712; c=relaxed/simple;
	bh=3uM/IoV+kWhx0MA2bMJkjETxvtqfhNmiJEDbYSTHY+4=;
	h=Date:Message-ID:From:To:Cc:Subject:In-Reply-To:References:
	 MIME-Version:Content-Type; b=DKirimSOY0FcH84ldq3wtTihPbpgVRefkV0zKvmRPeJGB+JjIwORD0MDjr6x9XxRFaMwBi5onthR1kpNjqpJAne37XVfqnFC8EcdccM2Z5MX8vnHbUuVW0tAD/aRsvLXC7Rj9VqiyBWOMDARBIRENOR4VNVdzeV+vD6oKLhfl7s=
ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=linux.beauty; spf=pass smtp.mailfrom=linux.beauty; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b=G9DjFFV4; arc=pass smtp.client-ip=136.143.188.12
Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=linux.beauty
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.beauty
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b="G9DjFFV4"
ARC-Seal: i=1; a=rsa-sha256; t=1769920672; cv=none; 
	d=zohomail.com; s=zohoarc; 
	b=R3Gf0hUkXlTDhhmJqSZkhFTSJou2HN15whGvmlzTvoXGQ7On6NOmX/gHfmQhaGYI34oK4D8L03vYFuZ4sSAU6VfrvakIzEg357MULjCyJ+GUKfGmRbLdXOEaVAnPMZv8kxLIQ4jlXVJb1AwC4Lzy40i/bUbI5Ou5ROrte9rwlqo=
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; 
	t=1769920672; h=Content-Type:Cc:Cc:Date:Date:From:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:Subject:To:To:Message-Id:Reply-To; 
	bh=iDc1fs0KSQXhIw9liIu96vuZGLLY5LuSWTg5rYDoZ18=; 
	b=bW/4iKlDPpoVus/lFExMj6t0EZIBaGxgMEK5/qsGgUE5kADaVCkEBAd2e/Ao7F85DcYN+gcOKyVNKXtgLLq1/qi9/v31GWy5parOz2Rq++Cv0mn4OxmGHYRsh81d12rTBWFbfQFfuu8dK6E4SplldqLx4g3tkGW6XxYYm8rEzsY=
ARC-Authentication-Results: i=1; mx.zohomail.com;
	dkim=pass  header.i=linux.beauty;
	spf=pass  smtp.mailfrom=me@linux.beauty;
	dmarc=pass header.from=<me@linux.beauty>
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1769920672;
	s=zmail; d=linux.beauty; i=me@linux.beauty;
	h=Date:Date:Message-ID:From:From:To:To:Cc:Cc:Subject:Subject:In-Reply-To:References:MIME-Version:Content-Type:Message-Id:Reply-To;
	bh=iDc1fs0KSQXhIw9liIu96vuZGLLY5LuSWTg5rYDoZ18=;
	b=G9DjFFV4FCLccd6X4o/nvYyYNzSa5kgS8jkUasc5PviXG7N3SEtp2D5cFyLLTTzp
	aZa0C2hEHxv9qa87avoCCCwD4oMkyfygkt2im+l8CuiBd7N5mH2q4FP71hKo6ViHVfC
	jwL+JlW4IrcqgIbehnf08tlMwNGs+uHZ+S0H7YxU=
Received: by mx.zohomail.com with SMTPS id 1769920670779516.4956288034215;
	Sat, 31 Jan 2026 20:37:50 -0800 (PST)
Date: Sun, 01 Feb 2026 12:37:36 +0800
Message-ID: <878qddoz8v.wl-me@linux.beauty>
From: Li Chen <me@linux.beauty>
To: Matthew Wilcox <willy@infradead.org>
Cc: Mark Fasheh <mark@fasheh.com>,
	Joel Becker <jlbec@evilplan.org>,
	Joseph Qi <joseph.qi@linux.alibaba.com>,
	ocfs2-devel <ocfs2-devel@lists.linux.dev>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Jan Kara <jack@suse.com>
Subject: Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads
In-Reply-To: <aXzeDE5P_lxiXfPB@casper.infradead.org>
References: <20260130031232.60780-1-me@linux.beauty>
	<20260130031232.60780-4-me@linux.beauty>
	<aXxBX2Ul-S6nJzG5@casper.infradead.org>
	<19c0edeadad.6adfc20c802629.8728440093470268039@linux.beauty>
	<aXzeDE5P_lxiXfPB@casper.infradead.org>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue)
 FLIM-LB/1.14.9 (=?ISO-8859-4?Q?Goj=F2?=) APEL-LB/10.8 EasyPG/1.0.0
 Emacs/30.2 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue")
Content-Type: text/plain; charset=ISO-2022-JP
X-ZohoMailClient: External


Hi Matthew,

Thank you very much for the detailed explanation and for your patience.

On Sat, 31 Jan 2026 00:36:28 +0800,
Matthew Wilcox wrote:
> 
> On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
> > Hi Matthew,
> > 
> >  > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
> >  > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
> >  > > holding journal->j_list_lock.
> >  > > 
> >  > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
> >  > 
> >  > I don't think this is the right solution to the problem.  If it is,
> >  > there needs to be much better argumentation in the commit message.
> >  > 
> >  > As I understand it, jbd2_journal_file_inode() initialises jinode,
> >  > then adds it to the t_inode_list, then drops the j_list_lock.  So the
> >  > actual problem we need to address is that there's no memory barrier
> >  > between the store to i_dirty_start and the list_add().  Once that's
> >  > added, there's no need for a READ_ONCE here.
> >  > 
> >  > Or have I misunderstood the problem?
> > 
> > Thanks for the review.
> > 
> > My understanding of your point is that you're worried about a missing
> > "publish" ordering in jbd2_journal_file_inode(): we store
> > jinode->i_dirty_start/end and then list_add() the jinode to
> > t_inode_list, and a core which observes the list entry might miss the prior
> > i_dirty_* stores. Is that the issue you had in mind?
> 
> I think that's the only issue that exists ...

Understood.

> 
> > If so, for the normal commit path where the list is walked under
> > journal->j_list_lock (e.g. journal_submit_data_buffers() in
> > fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
> > necessary ordering, since both the i_dirty_* updates and the list_add()
> > happen inside the same critical section.
> 
> I don't think that's true.  I think what you're asserting is that:
> 
> 	int *pi;
> 	int **ppi;
> 
> 	spin_lock(&lock);
> 	*pi = 1;
> 	*ppi = pi;
> 	spin_unlock(&lock);
> 
> that the store to *pi must be observed before the store to *ppi, and
> that's not true for a reader which doesn't read the value of lock.
> The store to *ppi needs a store barrier before it.

Yes, agreed ― thank you. I was implicitly assuming the reader had taken the same lock
at some point, which is not a valid assumption for a lockless reader. 

> > The ocfs2 case I was aiming at is different: the filesystem callback is
> > invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
> > j_list_lock but it still reads jinode->i_dirty_start/end while other
> > threads update these fields under the lock. Adding a barrier between the
> > stores and list_add() would not address that concurrent update window.
> 
> I don't think that race exists.  If it does exist, the READ_ONCE will
> not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
> platforms do not, in general, have a way to do an atomic 64-bit load
> (look at the implementation of i_size_read() for the gyrations we go
> through to assure a non-torn read of that value).
> 
> > "ocfs2 reads jinode->i_dirty_start/end without journal->j_list_lock
> > (callback may sleep); these fields are updated under j_list_lock in jbd2.
> > Use READ_ONCE() so the callback takes a single snapshot via actual loads
> > from the variable (i.e. don't let the compiler reuse a value kept in a register
> > or fold multiple reads)."
> 
> I think the prevention of this race occurs at a higher level than
> "it's updated under a lock".  That is, jbd2_journal_file_inode()
> is never called for a jinode which is currently being operated on by
> j_submit_inode_data_buffers().  Now, I'm not an expert on the jbd code,
> so I may be wrong here.

Thanks. I tried to sanity-check whether that “never called” invariant holds
in practice.

I added a small local-only tracepoint (not for upstream) which fires from
jbd2_journal_file_inode() when it observes JI_COMMIT_RUNNING already set
on the same jinode:

/* fs/jbd2/transaction.c */
if (unlikely(jinode->i_flags & JI_COMMIT_RUNNING))
      trace_jbd2_file_inode_commit_running(...);

The trace event prints dev, ino, current tid, jinode flags, and the
i_transaction / i_next_transaction tids.

With an ext4 test (ordered mode) I do see repeated hits. Trace output:

... jbd2_submit_inode_data: dev 7,0 ino 20
... jbd2_file_inode_commit_running: dev 7,0 ino 20 tid 3 op 0x6 i_flags 0x7
    j_tid 2 j_next 3 ... comm python3

So it looks like jbd2_journal_file_inode() can run while JI_COMMIT_RUNNING
is set for that inode, i.e. during the window where the commit thread drops
j_list_lock around ->j_submit_inode_data_buffers() / ->j_finish_inode_data_buffers().

Given this, would you prefer the series to move towards something like:

1. taking a snapshot of i_dirty_start/end under j_list_lock in the commit path and passing the snapshot
   to the filesystem callback (so callbacks never read jinode->i_dirty_* locklessly), or
2. introducing a real synchronization mechanism for the dirty range itself (seqcount/atomic64/etc)?
3. something else.

I’d be very grateful for guidance on what you consider the most appropriate direction or point out something I'm wrong.

Thanks again.

Regards,
Li