Re: [PATCH] ocfs2: kill osb->system_file_mutex lock

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Heming Zhao <heming.zhao@suse.com>
To: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Mark Fasheh <mark@fasheh.com>, Joel Becker <jlbec@evilplan.org>,
	 Joseph Qi <joseph.qi@linux.alibaba.com>,
	jiangyiwen <jiangyiwen@huawei.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	ocfs2-devel@lists.linux.dev, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
Date: Sat, 16 May 2026 07:53:52 +0800	[thread overview]
Message-ID: <ageuwl9o3nzktMrz@c73> (raw)
In-Reply-To: <agc8kA2wxC_7091p@c73>

On Fri, May 15, 2026 at 11:35:13PM +0800, Heming Zhao wrote:
> On Thu, May 14, 2026 at 04:09:25PM +0900, Tetsuo Handa wrote:
> > Hi Heming,
> > 
> > I would like to clarify why the expectation of "being called only once" is logically
> > incorrect, reply to your concern regarding the reference count leak and explain why
> > this patch is completely safe and sufficient.
> > 
> > 1. get_local_system_inode() can fail under memory pressure:
> >    get_local_system_inode() allocates memory internally. Under heavy memory pressure,
> >    this allocation can fail and return NULL. When this happens, the caller
> >    ocfs2_get_system_file_inode() must fall back to calling _ocfs2_get_system_file_inode()
> >    again to read the inode from disk. Therefore, the filesystem design must inherently
> >    support multiple calls to _ocfs2_get_system_file_inode().
> > 
> > 2. Why cmpxchg() is sufficient and safe without the mutex:
> >    The only thing the system_file_mutex is needed was to prevent a race where two
> >    threads concurrently execute _ocfs2_get_system_file_inode(), obtain the SAME inode
> >    pointer (since the underlying VFS iget_locked() returns the identical address for
> >    the same slot), and both mistakenly invoke igrab() on it, leading to a reference
> >    count leak.
> > 
> >    This patch perfectly solves that race condition by using cmpxchg() on the target
> >    pointer array slot:
> > 
> >      * The thread that wins the cmpxchg() successfully initializes the slot with the
> >        fetched inode and get the extra refcount because it is the first time to store
> >        into the slot.
> > 
> >      * The thread that loses the cmpxchg() detects that another thread has already
> >        initialized the slot with the exact same inode. The loser thread returns
> >        without getting the extra refcount because it is not the first time to store
> >        into the slot.
> > 
> >    Therefore, the reference counting contract is strictly and atomically maintained.
> >    No references are leaked, and the array slot is never corrupted.
> 
> Hi,
> 
> The logic here is incorrect. The purpose of the refcount is to track how many
> consumers are using the inode.
> 
> In the original code, if two threads concurrently access ocfs2_get_system_file_inode()
> while the inode is uninitialized, inode->i_count would ultimately be incremented
> by 3. However, with your patch, i_count will only be incremented by 2.
> 
> To be more specific:
> Your patch explicitly triggers a race condition: when the target local_system
> inode is uninitialized and two threads enter simultaneously, Thread 1 wins the
> cmpxchg() and increments the refcount before exiting. Thread 2, however, loses
> the refcount increment simply because the atomic operation failed.
> 
> btw, The issue addressed in commit 43b10a20372d was that after two concurrent
> threads returned, inode->i_count ended up being 4 when the correct value should
> have been 3. With your patch, the value will end up being 2, which is insufficient.

My above analysis contains a mistake.
With the patch, the refcount is also 3. However, I don't think the code logic is
correct.

Before commit 43b10a20372d, the refcount was 4:
Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (refcount +1)
Thread 2: does the same job as Thread 1.

Current code logic, the refcount is 3:
Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (refcount +1)
Thread 2: "inode = igrab(inode)" (gets inode from array, refcount +1)

With the patch, the refcount is also 3:
Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (sets array, refcount +1)
Thread 2: _ocfs2_get_system_file_inode (refcount +1)

In theory, _ocfs2_get_system_file_inode() should only be called once after mount.
The performance penalty in the current ocfs2_get_system_file_inode() comes from
doing "inode = igrab(inode)" while holding the mutex lock.

- Heming
> 
> In my opinion, the problem with the current code is that the scope of
> mutex_lock(&osb->system_file_mutex) is too broad. This mutex only needs to be
> held prior to calling _ocfs2_get_system_file_inode(). I previously highlighted
> this point in my initial review comment on the patch.
> 
> Thanks,
> Heming
> 
> > 
> > 3. Standard filesystems do not use a global mutex for this:
> >    Standard filesystems (like Ext4's ext4_get_journal_inode or XFS's
> >    xfs_qm_init_quotainos) rely entirely on the VFS layer's internal hashing/locking (e.g.,
> >    iget_locked) to serialize metadata/system inode lookups. OCFS2's system_file_mutex is a
> >    redundant global lock that heavily pollutes the lock dependency graph, triggering
> >    possible deadlock warnings that block us from testing and fixing genuine deadlocks.
> > 
> > Since the cmpxchg() approach guarantees atomic slot initialization + igrab(), the global
> > mutex is completely redundant and should be removed.
> > 
> > Regards.
> >

next prev parent reply	other threads:[~2026-05-15 23:55 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-21 15:56 [PATCH] ocfs2: kill osb->system_file_mutex lock Tetsuo Handa
2025-06-24  1:33 ` Heming Zhao
2025-06-24  1:55   ` Tetsuo Handa
2025-06-24  2:51     ` Heming Zhao
2025-06-24  2:17   ` Tetsuo Handa
2025-06-24  2:40     ` Heming Zhao
2025-06-24  3:05       ` Tetsuo Handa
2026-05-14  7:09 ` Tetsuo Handa
2026-05-15 15:35   ` Heming Zhao
2026-05-15 15:51     ` Tetsuo Handa
2026-05-15 23:56       ` Heming Zhao
2026-05-15 23:53     ` Heming Zhao [this message]
2026-05-16  5:52       ` Tetsuo Handa
2026-05-16 12:27         ` Heming Zhao
2026-05-16 13:10           ` Tetsuo Handa
2026-05-18  2:52             ` Heming Zhao
2026-05-18  4:23               ` [PATCH v2] " Tetsuo Handa
2026-05-18  4:56                 ` Heming Zhao
2026-05-18  6:20                   ` Joseph Qi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ageuwl9o3nzktMrz@c73 \
    --to=heming.zhao@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=jiangyiwen@huawei.com \
    --cc=jlbec@evilplan.org \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark@fasheh.com \
    --cc=ocfs2-devel@lists.linux.dev \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox