All of lore.kernel.org
 help / color / mirror / Atom feed
From: Heming Zhao <heming.zhao@suse.com>
To: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Mark Fasheh <mark@fasheh.com>, Joel Becker <jlbec@evilplan.org>,
	 Joseph Qi <joseph.qi@linux.alibaba.com>,
	jiangyiwen <jiangyiwen@huawei.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	ocfs2-devel@lists.linux.dev, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
Date: Sat, 16 May 2026 07:53:52 +0800	[thread overview]
Message-ID: <ageuwl9o3nzktMrz@c73> (raw)
In-Reply-To: <agc8kA2wxC_7091p@c73>

On Fri, May 15, 2026 at 11:35:13PM +0800, Heming Zhao wrote:
> On Thu, May 14, 2026 at 04:09:25PM +0900, Tetsuo Handa wrote:
> > Hi Heming,
> > 
> > I would like to clarify why the expectation of "being called only once" is logically
> > incorrect, reply to your concern regarding the reference count leak and explain why
> > this patch is completely safe and sufficient.
> > 
> > 1. get_local_system_inode() can fail under memory pressure:
> >    get_local_system_inode() allocates memory internally. Under heavy memory pressure,
> >    this allocation can fail and return NULL. When this happens, the caller
> >    ocfs2_get_system_file_inode() must fall back to calling _ocfs2_get_system_file_inode()
> >    again to read the inode from disk. Therefore, the filesystem design must inherently
> >    support multiple calls to _ocfs2_get_system_file_inode().
> > 
> > 2. Why cmpxchg() is sufficient and safe without the mutex:
> >    The only thing the system_file_mutex is needed was to prevent a race where two
> >    threads concurrently execute _ocfs2_get_system_file_inode(), obtain the SAME inode
> >    pointer (since the underlying VFS iget_locked() returns the identical address for
> >    the same slot), and both mistakenly invoke igrab() on it, leading to a reference
> >    count leak.
> > 
> >    This patch perfectly solves that race condition by using cmpxchg() on the target
> >    pointer array slot:
> > 
> >      * The thread that wins the cmpxchg() successfully initializes the slot with the
> >        fetched inode and get the extra refcount because it is the first time to store
> >        into the slot.
> > 
> >      * The thread that loses the cmpxchg() detects that another thread has already
> >        initialized the slot with the exact same inode. The loser thread returns
> >        without getting the extra refcount because it is not the first time to store
> >        into the slot.
> > 
> >    Therefore, the reference counting contract is strictly and atomically maintained.
> >    No references are leaked, and the array slot is never corrupted.
> 
> Hi,
> 
> The logic here is incorrect. The purpose of the refcount is to track how many
> consumers are using the inode.
> 
> In the original code, if two threads concurrently access ocfs2_get_system_file_inode()
> while the inode is uninitialized, inode->i_count would ultimately be incremented
> by 3. However, with your patch, i_count will only be incremented by 2.
> 
> To be more specific:
> Your patch explicitly triggers a race condition: when the target local_system
> inode is uninitialized and two threads enter simultaneously, Thread 1 wins the
> cmpxchg() and increments the refcount before exiting. Thread 2, however, loses
> the refcount increment simply because the atomic operation failed.
> 
> btw, The issue addressed in commit 43b10a20372d was that after two concurrent
> threads returned, inode->i_count ended up being 4 when the correct value should
> have been 3. With your patch, the value will end up being 2, which is insufficient.

My above analysis contains a mistake.
With the patch, the refcount is also 3. However, I don't think the code logic is
correct.

Before commit 43b10a20372d, the refcount was 4:
Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (refcount +1)
Thread 2: does the same job as Thread 1.

Current code logic, the refcount is 3:
Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (refcount +1)
Thread 2: "inode = igrab(inode)" (gets inode from array, refcount +1)

With the patch, the refcount is also 3:
Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (sets array, refcount +1)
Thread 2: _ocfs2_get_system_file_inode (refcount +1)

In theory, _ocfs2_get_system_file_inode() should only be called once after mount.
The performance penalty in the current ocfs2_get_system_file_inode() comes from
doing "inode = igrab(inode)" while holding the mutex lock.

- Heming
> 
> In my opinion, the problem with the current code is that the scope of
> mutex_lock(&osb->system_file_mutex) is too broad. This mutex only needs to be
> held prior to calling _ocfs2_get_system_file_inode(). I previously highlighted
> this point in my initial review comment on the patch.
> 
> Thanks,
> Heming
> 
> > 
> > 3. Standard filesystems do not use a global mutex for this:
> >    Standard filesystems (like Ext4's ext4_get_journal_inode or XFS's
> >    xfs_qm_init_quotainos) rely entirely on the VFS layer's internal hashing/locking (e.g.,
> >    iget_locked) to serialize metadata/system inode lookups. OCFS2's system_file_mutex is a
> >    redundant global lock that heavily pollutes the lock dependency graph, triggering
> >    possible deadlock warnings that block us from testing and fixing genuine deadlocks.
> > 
> > Since the cmpxchg() approach guarantees atomic slot initialization + igrab(), the global
> > mutex is completely redundant and should be removed.
> > 
> > Regards.
> > 

      parent reply	other threads:[~2026-05-15 23:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-21 15:56 [PATCH] ocfs2: kill osb->system_file_mutex lock Tetsuo Handa
2025-06-24  1:33 ` Heming Zhao
2025-06-24  1:55   ` Tetsuo Handa
2025-06-24  2:51     ` Heming Zhao
2025-06-24  2:17   ` Tetsuo Handa
2025-06-24  2:40     ` Heming Zhao
2025-06-24  3:05       ` Tetsuo Handa
2026-05-14  7:09 ` Tetsuo Handa
2026-05-15 15:35   ` Heming Zhao
2026-05-15 15:51     ` Tetsuo Handa
2026-05-15 23:56       ` Heming Zhao
2026-05-15 23:53     ` Heming Zhao [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ageuwl9o3nzktMrz@c73 \
    --to=heming.zhao@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=jiangyiwen@huawei.com \
    --cc=jlbec@evilplan.org \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark@fasheh.com \
    --cc=ocfs2-devel@lists.linux.dev \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.