From: Heming Zhao <heming.zhao@suse.com>
To: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Mark Fasheh <mark@fasheh.com>, Joel Becker <jlbec@evilplan.org>,
Joseph Qi <joseph.qi@linux.alibaba.com>,
jiangyiwen <jiangyiwen@huawei.com>,
Andrew Morton <akpm@linux-foundation.org>,
ocfs2-devel@lists.linux.dev, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
Date: Sat, 16 May 2026 07:53:52 +0800 [thread overview]
Message-ID: <ageuwl9o3nzktMrz@c73> (raw)
In-Reply-To: <agc8kA2wxC_7091p@c73>
On Fri, May 15, 2026 at 11:35:13PM +0800, Heming Zhao wrote:
> On Thu, May 14, 2026 at 04:09:25PM +0900, Tetsuo Handa wrote:
> > Hi Heming,
> >
> > I would like to clarify why the expectation of "being called only once" is logically
> > incorrect, reply to your concern regarding the reference count leak and explain why
> > this patch is completely safe and sufficient.
> >
> > 1. get_local_system_inode() can fail under memory pressure:
> > get_local_system_inode() allocates memory internally. Under heavy memory pressure,
> > this allocation can fail and return NULL. When this happens, the caller
> > ocfs2_get_system_file_inode() must fall back to calling _ocfs2_get_system_file_inode()
> > again to read the inode from disk. Therefore, the filesystem design must inherently
> > support multiple calls to _ocfs2_get_system_file_inode().
> >
> > 2. Why cmpxchg() is sufficient and safe without the mutex:
> > The only thing the system_file_mutex is needed was to prevent a race where two
> > threads concurrently execute _ocfs2_get_system_file_inode(), obtain the SAME inode
> > pointer (since the underlying VFS iget_locked() returns the identical address for
> > the same slot), and both mistakenly invoke igrab() on it, leading to a reference
> > count leak.
> >
> > This patch perfectly solves that race condition by using cmpxchg() on the target
> > pointer array slot:
> >
> > * The thread that wins the cmpxchg() successfully initializes the slot with the
> > fetched inode and get the extra refcount because it is the first time to store
> > into the slot.
> >
> > * The thread that loses the cmpxchg() detects that another thread has already
> > initialized the slot with the exact same inode. The loser thread returns
> > without getting the extra refcount because it is not the first time to store
> > into the slot.
> >
> > Therefore, the reference counting contract is strictly and atomically maintained.
> > No references are leaked, and the array slot is never corrupted.
>
> Hi,
>
> The logic here is incorrect. The purpose of the refcount is to track how many
> consumers are using the inode.
>
> In the original code, if two threads concurrently access ocfs2_get_system_file_inode()
> while the inode is uninitialized, inode->i_count would ultimately be incremented
> by 3. However, with your patch, i_count will only be incremented by 2.
>
> To be more specific:
> Your patch explicitly triggers a race condition: when the target local_system
> inode is uninitialized and two threads enter simultaneously, Thread 1 wins the
> cmpxchg() and increments the refcount before exiting. Thread 2, however, loses
> the refcount increment simply because the atomic operation failed.
>
> btw, The issue addressed in commit 43b10a20372d was that after two concurrent
> threads returned, inode->i_count ended up being 4 when the correct value should
> have been 3. With your patch, the value will end up being 2, which is insufficient.
My above analysis contains a mistake.
With the patch, the refcount is also 3. However, I don't think the code logic is
correct.
Before commit 43b10a20372d, the refcount was 4:
Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (refcount +1)
Thread 2: does the same job as Thread 1.
Current code logic, the refcount is 3:
Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (refcount +1)
Thread 2: "inode = igrab(inode)" (gets inode from array, refcount +1)
With the patch, the refcount is also 3:
Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (sets array, refcount +1)
Thread 2: _ocfs2_get_system_file_inode (refcount +1)
In theory, _ocfs2_get_system_file_inode() should only be called once after mount.
The performance penalty in the current ocfs2_get_system_file_inode() comes from
doing "inode = igrab(inode)" while holding the mutex lock.
- Heming
>
> In my opinion, the problem with the current code is that the scope of
> mutex_lock(&osb->system_file_mutex) is too broad. This mutex only needs to be
> held prior to calling _ocfs2_get_system_file_inode(). I previously highlighted
> this point in my initial review comment on the patch.
>
> Thanks,
> Heming
>
> >
> > 3. Standard filesystems do not use a global mutex for this:
> > Standard filesystems (like Ext4's ext4_get_journal_inode or XFS's
> > xfs_qm_init_quotainos) rely entirely on the VFS layer's internal hashing/locking (e.g.,
> > iget_locked) to serialize metadata/system inode lookups. OCFS2's system_file_mutex is a
> > redundant global lock that heavily pollutes the lock dependency graph, triggering
> > possible deadlock warnings that block us from testing and fixing genuine deadlocks.
> >
> > Since the cmpxchg() approach guarantees atomic slot initialization + igrab(), the global
> > mutex is completely redundant and should be removed.
> >
> > Regards.
> >
next prev parent reply other threads:[~2026-05-15 23:55 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-21 15:56 [PATCH] ocfs2: kill osb->system_file_mutex lock Tetsuo Handa
2025-06-24 1:33 ` Heming Zhao
2025-06-24 1:55 ` Tetsuo Handa
2025-06-24 2:51 ` Heming Zhao
2025-06-24 2:17 ` Tetsuo Handa
2025-06-24 2:40 ` Heming Zhao
2025-06-24 3:05 ` Tetsuo Handa
2026-05-14 7:09 ` Tetsuo Handa
2026-05-15 15:35 ` Heming Zhao
2026-05-15 15:51 ` Tetsuo Handa
2026-05-15 23:56 ` Heming Zhao
2026-05-15 23:53 ` Heming Zhao [this message]
2026-05-16 5:52 ` Tetsuo Handa
2026-05-16 12:27 ` Heming Zhao
2026-05-16 13:10 ` Tetsuo Handa
2026-05-18 2:52 ` Heming Zhao
2026-05-18 4:23 ` [PATCH v2] " Tetsuo Handa
2026-05-18 4:56 ` Heming Zhao
2026-05-18 6:20 ` Joseph Qi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ageuwl9o3nzktMrz@c73 \
--to=heming.zhao@suse.com \
--cc=akpm@linux-foundation.org \
--cc=jiangyiwen@huawei.com \
--cc=jlbec@evilplan.org \
--cc=joseph.qi@linux.alibaba.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mark@fasheh.com \
--cc=ocfs2-devel@lists.linux.dev \
--cc=penguin-kernel@i-love.sakura.ne.jp \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox