[PATCH] ocfs2: kill osb->system_file

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] ocfs2: kill osb->system_file_mutex lock
@ 2025-06-21 15:56 Tetsuo Handa
  2025-06-24  1:33 ` Heming Zhao
  2026-05-14  7:09 ` Tetsuo Handa
  0 siblings, 2 replies; 12+ messages in thread
From: Tetsuo Handa @ 2025-06-21 15:56 UTC (permalink / raw)
  To: Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton,
	ocfs2-devel
  Cc: LKML

Since calling _ocfs2_get_system_file_inode() twice with the same
arguments returns the same address, there is no need to serialize
_ocfs2_get_system_file_inode() using osb->system_file_mutex lock.

Kill osb->system_file_mutex lock in order to avoid AB-BA deadlock.
cmpxchg() will be sufficient for avoiding the inode refcount leak
problem which commit 43b10a20372d ("ocfs2: avoid system inode ref
confusion by adding mutex lock") tried to address.

Reported-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
Closes: https://lkml.kernel.org/r/000000000000ff2d7a0620381afe@google.com
Fixes: 43b10a20372d ("ocfs2: avoid system inode ref confusion by adding mutex lock")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: jiangyiwen <jiangyiwen@huawei.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/ocfs2.h   | 2 --
 fs/ocfs2/super.c   | 2 --
 fs/ocfs2/sysfile.c | 9 +++------
 3 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 6aaa94c554c1..8bdeea60742a 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -494,8 +494,6 @@ struct ocfs2_super
 	struct rb_root	osb_rf_lock_tree;
 	struct ocfs2_refcount_tree *osb_ref_tree_lru;
 
-	struct mutex system_file_mutex;
-
 	/*
 	 * OCFS2 needs to schedule several different types of work which
 	 * require cluster locking, disk I/O, recovery waits, etc. Since these
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 3d2533950bae..4461daf909cf 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1997,8 +1997,6 @@ static int ocfs2_initialize_super(struct super_block *sb,
 	spin_lock_init(&osb->osb_xattr_lock);
 	ocfs2_init_steal_slots(osb);
 
-	mutex_init(&osb->system_file_mutex);
-
 	atomic_set(&osb->alloc_stats.moves, 0);
 	atomic_set(&osb->alloc_stats.local_data, 0);
 	atomic_set(&osb->alloc_stats.bitmap_data, 0);
diff --git a/fs/ocfs2/sysfile.c b/fs/ocfs2/sysfile.c
index 53a945da873b..b63af8d64904 100644
--- a/fs/ocfs2/sysfile.c
+++ b/fs/ocfs2/sysfile.c
@@ -98,11 +98,9 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb,
 	} else
 		arr = get_local_system_inode(osb, type, slot);
 
-	mutex_lock(&osb->system_file_mutex);
 	if (arr && ((inode = *arr) != NULL)) {
 		/* get a ref in addition to the array ref */
 		inode = igrab(inode);
-		mutex_unlock(&osb->system_file_mutex);
 		BUG_ON(!inode);
 
 		return inode;
@@ -112,11 +110,10 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb,
 	inode = _ocfs2_get_system_file_inode(osb, type, slot);
 
 	/* add one more if putting into array for first time */
-	if (arr && inode) {
-		*arr = igrab(inode);
-		BUG_ON(!*arr);
+	if (inode && arr && !*arr && !cmpxchg(&(*arr), NULL, inode)) {
+		inode = igrab(inode);
+		BUG_ON(!inode);
 	}
-	mutex_unlock(&osb->system_file_mutex);
 	return inode;
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
  2025-06-21 15:56 [PATCH] ocfs2: kill osb->system_file_mutex lock Tetsuo Handa
@ 2025-06-24  1:33 ` Heming Zhao
  2025-06-24  1:55   ` Tetsuo Handa
  2025-06-24  2:17   ` Tetsuo Handa
  2026-05-14  7:09 ` Tetsuo Handa
  1 sibling, 2 replies; 12+ messages in thread
From: Heming Zhao @ 2025-06-24  1:33 UTC (permalink / raw)
  To: Tetsuo Handa, Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen,
	Andrew Morton, ocfs2-devel
  Cc: LKML

Hello,

Protecting refcnt with a mutex is the right approach, and commit 43b10a20372d
did it properly.
However, I don't see how your patch fixes the syzbot report [1]. Could you
elaborate on the root cause analysis?

My review comments are inline below.

[1]: https://syzkaller.appspot.com/bug?extid=1fed2de07d8e11a3ec1b

On 6/21/25 23:56, Tetsuo Handa wrote:
> Since calling _ocfs2_get_system_file_inode() twice with the same
> arguments returns the same address, there is no need to serialize
> _ocfs2_get_system_file_inode() using osb->system_file_mutex lock.
> 
> Kill osb->system_file_mutex lock in order to avoid AB-BA deadlock.
> cmpxchg() will be sufficient for avoiding the inode refcount leak
> problem which commit 43b10a20372d ("ocfs2: avoid system inode ref
> confusion by adding mutex lock") tried to address.
> 
> Reported-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
'Reported-by' should be: https://syzkaller.appspot.com/bug?extid=1fed2de07d8e11a3ec1b

> Closes: https://lkml.kernel.org/r/000000000000ff2d7a0620381afe@google.com
I don't think we need 'Closes'.

> Fixes: 43b10a20372d ("ocfs2: avoid system inode ref confusion by adding mutex lock")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: jiangyiwen <jiangyiwen@huawei.com>
> Cc: Joseph Qi <joseph.qi@huawei.com>
> Cc: Joel Becker <jlbec@evilplan.org>
> Cc: Mark Fasheh <mfasheh@suse.com>
The 'CC's are also useless.

> ---
>   fs/ocfs2/ocfs2.h   | 2 --
>   fs/ocfs2/super.c   | 2 --
>   fs/ocfs2/sysfile.c | 9 +++------
>   3 files changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
> index 6aaa94c554c1..8bdeea60742a 100644
> --- a/fs/ocfs2/ocfs2.h
> +++ b/fs/ocfs2/ocfs2.h
> @@ -494,8 +494,6 @@ struct ocfs2_super
>   	struct rb_root	osb_rf_lock_tree;
>   	struct ocfs2_refcount_tree *osb_ref_tree_lru;
>   
> -	struct mutex system_file_mutex;
> -
>   	/*
>   	 * OCFS2 needs to schedule several different types of work which
>   	 * require cluster locking, disk I/O, recovery waits, etc. Since these
> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
> index 3d2533950bae..4461daf909cf 100644
> --- a/fs/ocfs2/super.c
> +++ b/fs/ocfs2/super.c
> @@ -1997,8 +1997,6 @@ static int ocfs2_initialize_super(struct super_block *sb,
>   	spin_lock_init(&osb->osb_xattr_lock);
>   	ocfs2_init_steal_slots(osb);
>   
> -	mutex_init(&osb->system_file_mutex);
> -
>   	atomic_set(&osb->alloc_stats.moves, 0);
>   	atomic_set(&osb->alloc_stats.local_data, 0);
>   	atomic_set(&osb->alloc_stats.bitmap_data, 0);
> diff --git a/fs/ocfs2/sysfile.c b/fs/ocfs2/sysfile.c
> index 53a945da873b..b63af8d64904 100644
> --- a/fs/ocfs2/sysfile.c
> +++ b/fs/ocfs2/sysfile.c
> @@ -98,11 +98,9 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb,
>   	} else
>   		arr = get_local_system_inode(osb, type, slot);
>   
> -	mutex_lock(&osb->system_file_mutex);
>   	if (arr && ((inode = *arr) != NULL)) {
>   		/* get a ref in addition to the array ref */
>   		inode = igrab(inode);
> -		mutex_unlock(&osb->system_file_mutex);
>   		BUG_ON(!inode);
>   

I agree the above mutex_lock and mutex_unlock is useless. we can remove it
without any problem.

>   		return inode;
> @@ -112,11 +110,10 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb,
>   	inode = _ocfs2_get_system_file_inode(osb, type, slot);

In my view, the key of commit 43b10a20372d is to avoid calling
_ocfs2_get_system_file_inode() twice, which lead refcnt+1 but no place to
do refcnt-1.

>   
>   	/* add one more if putting into array for first time */
> -	if (arr && inode) {
> -		*arr = igrab(inode);
> -		BUG_ON(!*arr);
> +	if (inode && arr && !*arr && !cmpxchg(&(*arr), NULL, inode)) {

Bypassing the refcnt+1 here is not a good idea. We should do refcnt+1
before returning to the caller.

> +		inode = igrab(inode);
> +		BUG_ON(!inode);
>   	}
> -	mutex_unlock(&osb->system_file_mutex);
>   	return inode;
>   }
>   


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
  2025-06-24  1:33 ` Heming Zhao
@ 2025-06-24  1:55   ` Tetsuo Handa
  2025-06-24  2:51     ` Heming Zhao
  2025-06-24  2:17   ` Tetsuo Handa
  1 sibling, 1 reply; 12+ messages in thread
From: Tetsuo Handa @ 2025-06-24  1:55 UTC (permalink / raw)
  To: Heming Zhao
  Cc: LKML, Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen,
	Andrew Morton, ocfs2-devel, Diogo Jahchan Koike

On 2025/06/24 10:33, Heming Zhao wrote:
> Hello,
> 
> Protecting refcnt with a mutex is the right approach, and commit 43b10a20372d
> did it properly.
> However, I don't see how your patch fixes the syzbot report [1]. Could you
> elaborate on the root cause analysis?
> 
> My review comments are inline below.
> 
> [1]: https://syzkaller.appspot.com/bug?extid=1fed2de07d8e11a3ec1b

My patch does not fix [1]. My patch fixes a bug which syzbot reported at
https://lkml.kernel.org/r/000000000000ff2d7a0620381afe@google.com
when testing with Diogo's patch at
https://syzkaller.appspot.com/x/patch.diff?x=178f93d5980000 for [1].

>> Reported-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
> 'Reported-by' should be: https://syzkaller.appspot.com/bug?extid=1fed2de07d8e11a3ec1b

Since there is not yet a bug link for my patch, I don't choose syzbot as reporter.
Diogo will post a formal patch for fixing [1] after returning from vacation.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
  2025-06-24  1:55   ` Tetsuo Handa
@ 2025-06-24  2:51     ` Heming Zhao
  0 siblings, 0 replies; 12+ messages in thread
From: Heming Zhao @ 2025-06-24  2:51 UTC (permalink / raw)
  To: Joseph Qi, Tetsuo Handa
  Cc: LKML, Mark Fasheh, Joel Becker, jiangyiwen, Andrew Morton,
	ocfs2-devel, Diogo Jahchan Koike

Hello Joseph,

Do you agree that we need to add a rule for the ocfs2 read/write (IO) path?
- When an ocfs2 volume is in the process of mounting, all write
   operations must fail immediately.

- Heming

On 6/24/25 09:55, Tetsuo Handa wrote:
> On 2025/06/24 10:33, Heming Zhao wrote:
>> Hello,
>>
>> Protecting refcnt with a mutex is the right approach, and commit 43b10a20372d
>> did it properly.
>> However, I don't see how your patch fixes the syzbot report [1]. Could you
>> elaborate on the root cause analysis?
>>
>> My review comments are inline below.
>>
>> [1]: https://syzkaller.appspot.com/bug?extid=1fed2de07d8e11a3ec1b
> 
> My patch does not fix [1]. My patch fixes a bug which syzbot reported at
> https://lkml.kernel.org/r/000000000000ff2d7a0620381afe@google.com
> when testing with Diogo's patch at
> https://syzkaller.appspot.com/x/patch.diff?x=178f93d5980000 for [1].
> 
>>> Reported-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
>> 'Reported-by' should be: https://syzkaller.appspot.com/bug?extid=1fed2de07d8e11a3ec1b
> 
> Since there is not yet a bug link for my patch, I don't choose syzbot as reporter.
> Diogo will post a formal patch for fixing [1] after returning from vacation.
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
  2025-06-24  1:33 ` Heming Zhao
  2025-06-24  1:55   ` Tetsuo Handa
@ 2025-06-24  2:17   ` Tetsuo Handa
  2025-06-24  2:40     ` Heming Zhao
  1 sibling, 1 reply; 12+ messages in thread
From: Tetsuo Handa @ 2025-06-24  2:17 UTC (permalink / raw)
  To: Heming Zhao, Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen,
	Andrew Morton, ocfs2-devel, Diogo Jahchan Koike
  Cc: LKML

On 2025/06/24 10:33, Heming Zhao wrote:
>> @@ -112,11 +110,10 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb,
>>       inode = _ocfs2_get_system_file_inode(osb, type, slot);
> 
> In my view, the key of commit 43b10a20372d is to avoid calling
> _ocfs2_get_system_file_inode() twice, which lead refcnt+1 but no place to
> do refcnt-1.

My understanding is that concurrently calling _ocfs2_get_system_file_inode() itself
is OK, for the caller of ocfs2_get_system_file_inode() is responsible for calling
iput().

The problem commit 43b10a20372d fixed is that there was no mechanism to avoid
concurrently calling

  *arr = igrab(inode);

which will result in failing to call iput() for raced references when
ocfs2_release_system_inodes() is called.

> 
>>         /* add one more if putting into array for first time */
>> -    if (arr && inode) {
>> -        *arr = igrab(inode);
>> -        BUG_ON(!*arr);
>> +    if (inode && arr && !*arr && !cmpxchg(&(*arr), NULL, inode)) {
> 
> Bypassing the refcnt+1 here is not a good idea. We should do refcnt+1
> before returning to the caller.
> 
>> +        inode = igrab(inode);

We do refcnt+1 immediately after cmpxchg() succeeds, for
ocfs2_release_system_inodes() which clears *arr is the place for
doing refcnt-1.

>> +        BUG_ON(!inode);
>>       }
>> -    mutex_unlock(&osb->system_file_mutex);
>>       return inode;
>>   }
>>   
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
  2025-06-24  2:17   ` Tetsuo Handa
@ 2025-06-24  2:40     ` Heming Zhao
  2025-06-24  3:05       ` Tetsuo Handa
  0 siblings, 1 reply; 12+ messages in thread
From: Heming Zhao @ 2025-06-24  2:40 UTC (permalink / raw)
  To: Tetsuo Handa, Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen,
	Andrew Morton, ocfs2-devel, Diogo Jahchan Koike
  Cc: LKML

On 6/24/25 10:17, Tetsuo Handa wrote:
> On 2025/06/24 10:33, Heming Zhao wrote:
>>> @@ -112,11 +110,10 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb,
>>>        inode = _ocfs2_get_system_file_inode(osb, type, slot);
>>
>> In my view, the key of commit 43b10a20372d is to avoid calling
>> _ocfs2_get_system_file_inode() twice, which lead refcnt+1 but no place to
>> do refcnt-1.
> 
> My understanding is that concurrently calling _ocfs2_get_system_file_inode() itself
> is OK, for the caller of ocfs2_get_system_file_inode() is responsible for calling
> iput().

We have different perspectives on calling _ocfs2_get_system_file_inode().
In the current code logic, _ocfs2_get_system_file_inode() is expected to
be called only once. Subsequent local system inodes will be retrieved from
the cache (via get_local_system_inode()).

> 
> The problem commit 43b10a20372d fixed is that there was no mechanism to avoid
> concurrently calling
> 
>    *arr = igrab(inode);
> 
> which will result in failing to call iput() for raced references when
> ocfs2_release_system_inodes() is called.
> 
>>
>>>          /* add one more if putting into array for first time */
>>> -    if (arr && inode) {
>>> -        *arr = igrab(inode);
>>> -        BUG_ON(!*arr);
>>> +    if (inode && arr && !*arr && !cmpxchg(&(*arr), NULL, inode)) {
>>
>> Bypassing the refcnt+1 here is not a good idea. We should do refcnt+1
>> before returning to the caller.
>>
>>> +        inode = igrab(inode);
> 
> We do refcnt+1 immediately after cmpxchg() succeeds, for
> ocfs2_release_system_inodes() which clears *arr is the place for
> doing refcnt-1.
> 
>>> +        BUG_ON(!inode);
>>>        }
>>> -    mutex_unlock(&osb->system_file_mutex);
>>>        return inode;
>>>    }
>>>    
>>
> 

In my view, your patch has logical errors - at least from my perspective,
I have to vote NAK.

In my view, for this syzbot bug, the better solution is to block/deny write
operations during the ocfs2 mounting phase.
There are many syzbot bugs related to writing data during the mounting phase.
I don't believe there is any reason a user would want to write data before the
filesystem is mounted.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
  2025-06-24  2:40     ` Heming Zhao
@ 2025-06-24  3:05       ` Tetsuo Handa
  0 siblings, 0 replies; 12+ messages in thread
From: Tetsuo Handa @ 2025-06-24  3:05 UTC (permalink / raw)
  To: Heming Zhao, Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen,
	Andrew Morton, ocfs2-devel, Diogo Jahchan Koike
  Cc: LKML

On 2025/06/24 11:40, Heming Zhao wrote:
> On 6/24/25 10:17, Tetsuo Handa wrote:
>> On 2025/06/24 10:33, Heming Zhao wrote:
>>>> @@ -112,11 +110,10 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb,
>>>>        inode = _ocfs2_get_system_file_inode(osb, type, slot);
>>>
>>> In my view, the key of commit 43b10a20372d is to avoid calling
>>> _ocfs2_get_system_file_inode() twice, which lead refcnt+1 but no place to
>>> do refcnt-1.
>>
>> My understanding is that concurrently calling _ocfs2_get_system_file_inode() itself
>> is OK, for the caller of ocfs2_get_system_file_inode() is responsible for calling
>> iput().
> 
> We have different perspectives on calling _ocfs2_get_system_file_inode().
> In the current code logic, _ocfs2_get_system_file_inode() is expected to
> be called only once. Subsequent local system inodes will be retrieved from
> the cache (via get_local_system_inode()).

That expectation is wrong. Since get_local_system_inode() involves memory allocation,
get_local_system_inode() might return NULL. Therefore, ocfs2_get_system_file_inode(),
which is the caller of get_local_system_inode(), has to be logically prepared for
calling _ocfs2_get_system_file_inode() for multiple times.

This cache is only for speeding lookups up.
This cache does not provide "lookup only once" requirement.

> 
> In my view, your patch has logical errors - at least from my perspective,
> I have to vote NAK.

If you NAK, you have to make sure that get_local_system_inode() never fails.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
  2025-06-21 15:56 [PATCH] ocfs2: kill osb->system_file_mutex lock Tetsuo Handa
  2025-06-24  1:33 ` Heming Zhao
@ 2026-05-14  7:09 ` Tetsuo Handa
  2026-05-15 15:35   ` Heming Zhao
  1 sibling, 1 reply; 12+ messages in thread
From: Tetsuo Handa @ 2026-05-14  7:09 UTC (permalink / raw)
  To: Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton,
	ocfs2-devel, Heming Zhao
  Cc: LKML

Hi Heming,

I would like to clarify why the expectation of "being called only once" is logically
incorrect, reply to your concern regarding the reference count leak and explain why
this patch is completely safe and sufficient.

1. get_local_system_inode() can fail under memory pressure:
   get_local_system_inode() allocates memory internally. Under heavy memory pressure,
   this allocation can fail and return NULL. When this happens, the caller
   ocfs2_get_system_file_inode() must fall back to calling _ocfs2_get_system_file_inode()
   again to read the inode from disk. Therefore, the filesystem design must inherently
   support multiple calls to _ocfs2_get_system_file_inode().

2. Why cmpxchg() is sufficient and safe without the mutex:
   The only thing the system_file_mutex is needed was to prevent a race where two
   threads concurrently execute _ocfs2_get_system_file_inode(), obtain the SAME inode
   pointer (since the underlying VFS iget_locked() returns the identical address for
   the same slot), and both mistakenly invoke igrab() on it, leading to a reference
   count leak.

   This patch perfectly solves that race condition by using cmpxchg() on the target
   pointer array slot:

     * The thread that wins the cmpxchg() successfully initializes the slot with the
       fetched inode and get the extra refcount because it is the first time to store
       into the slot.

     * The thread that loses the cmpxchg() detects that another thread has already
       initialized the slot with the exact same inode. The loser thread returns
       without getting the extra refcount because it is not the first time to store
       into the slot.

   Therefore, the reference counting contract is strictly and atomically maintained.
   No references are leaked, and the array slot is never corrupted.

3. Standard filesystems do not use a global mutex for this:
   Standard filesystems (like Ext4's ext4_get_journal_inode or XFS's
   xfs_qm_init_quotainos) rely entirely on the VFS layer's internal hashing/locking (e.g.,
   iget_locked) to serialize metadata/system inode lookups. OCFS2's system_file_mutex is a
   redundant global lock that heavily pollutes the lock dependency graph, triggering
   possible deadlock warnings that block us from testing and fixing genuine deadlocks.

Since the cmpxchg() approach guarantees atomic slot initialization + igrab(), the global
mutex is completely redundant and should be removed.

Regards.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
  2026-05-14  7:09 ` Tetsuo Handa
@ 2026-05-15 15:35   ` Heming Zhao
  2026-05-15 15:51     ` Tetsuo Handa
  2026-05-15 23:53     ` Heming Zhao
  0 siblings, 2 replies; 12+ messages in thread
From: Heming Zhao @ 2026-05-15 15:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton,
	ocfs2-devel, LKML

On Thu, May 14, 2026 at 04:09:25PM +0900, Tetsuo Handa wrote:
> Hi Heming,
> 
> I would like to clarify why the expectation of "being called only once" is logically
> incorrect, reply to your concern regarding the reference count leak and explain why
> this patch is completely safe and sufficient.
> 
> 1. get_local_system_inode() can fail under memory pressure:
>    get_local_system_inode() allocates memory internally. Under heavy memory pressure,
>    this allocation can fail and return NULL. When this happens, the caller
>    ocfs2_get_system_file_inode() must fall back to calling _ocfs2_get_system_file_inode()
>    again to read the inode from disk. Therefore, the filesystem design must inherently
>    support multiple calls to _ocfs2_get_system_file_inode().
> 
> 2. Why cmpxchg() is sufficient and safe without the mutex:
>    The only thing the system_file_mutex is needed was to prevent a race where two
>    threads concurrently execute _ocfs2_get_system_file_inode(), obtain the SAME inode
>    pointer (since the underlying VFS iget_locked() returns the identical address for
>    the same slot), and both mistakenly invoke igrab() on it, leading to a reference
>    count leak.
> 
>    This patch perfectly solves that race condition by using cmpxchg() on the target
>    pointer array slot:
> 
>      * The thread that wins the cmpxchg() successfully initializes the slot with the
>        fetched inode and get the extra refcount because it is the first time to store
>        into the slot.
> 
>      * The thread that loses the cmpxchg() detects that another thread has already
>        initialized the slot with the exact same inode. The loser thread returns
>        without getting the extra refcount because it is not the first time to store
>        into the slot.
> 
>    Therefore, the reference counting contract is strictly and atomically maintained.
>    No references are leaked, and the array slot is never corrupted.

Hi,

The logic here is incorrect. The purpose of the refcount is to track how many
consumers are using the inode.

In the original code, if two threads concurrently access ocfs2_get_system_file_inode()
while the inode is uninitialized, inode->i_count would ultimately be incremented
by 3. However, with your patch, i_count will only be incremented by 2.

To be more specific:
Your patch explicitly triggers a race condition: when the target local_system
inode is uninitialized and two threads enter simultaneously, Thread 1 wins the
cmpxchg() and increments the refcount before exiting. Thread 2, however, loses
the refcount increment simply because the atomic operation failed.

btw, The issue addressed in commit 43b10a20372d was that after two concurrent
threads returned, inode->i_count ended up being 4 when the correct value should
have been 3. With your patch, the value will end up being 2, which is insufficient.

In my opinion, the problem with the current code is that the scope of
mutex_lock(&osb->system_file_mutex) is too broad. This mutex only needs to be
held prior to calling _ocfs2_get_system_file_inode(). I previously highlighted
this point in my initial review comment on the patch.

Thanks,
Heming

> 
> 3. Standard filesystems do not use a global mutex for this:
>    Standard filesystems (like Ext4's ext4_get_journal_inode or XFS's
>    xfs_qm_init_quotainos) rely entirely on the VFS layer's internal hashing/locking (e.g.,
>    iget_locked) to serialize metadata/system inode lookups. OCFS2's system_file_mutex is a
>    redundant global lock that heavily pollutes the lock dependency graph, triggering
>    possible deadlock warnings that block us from testing and fixing genuine deadlocks.
> 
> Since the cmpxchg() approach guarantees atomic slot initialization + igrab(), the global
> mutex is completely redundant and should be removed.
> 
> Regards.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
  2026-05-15 15:35   ` Heming Zhao
@ 2026-05-15 15:51     ` Tetsuo Handa
  2026-05-15 23:56       ` Heming Zhao
  2026-05-15 23:53     ` Heming Zhao
  1 sibling, 1 reply; 12+ messages in thread
From: Tetsuo Handa @ 2026-05-15 15:51 UTC (permalink / raw)
  To: Heming Zhao
  Cc: Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton,
	ocfs2-devel, LKML

On 2026/05/16 0:35, Heming Zhao wrote:
> Hi,
> 
> The logic here is incorrect. The purpose of the refcount is to track how many
> consumers are using the inode.

The igrab(inode) for the first time is for getting a refcount for the slot, isn't it?
The igrab(inode) for the subsequent times is for getting a refcount which is supposed to be dropped by the caller, isn't it?
_ocfs2_get_system_file_inode() always gets a refcount which is supposed to be dropped by the caller when it succeeds, doesn't it?

So why do you want one more refcount?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
  2026-05-15 15:51     ` Tetsuo Handa
@ 2026-05-15 23:56       ` Heming Zhao
  0 siblings, 0 replies; 12+ messages in thread
From: Heming Zhao @ 2026-05-15 23:56 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton,
	ocfs2-devel, LKML

On Sat, May 16, 2026 at 12:51:32AM +0900, Tetsuo Handa wrote:
> On 2026/05/16 0:35, Heming Zhao wrote:
> > Hi,
> > 
> > The logic here is incorrect. The purpose of the refcount is to track how many
> > consumers are using the inode.
> 
> The igrab(inode) for the first time is for getting a refcount for the slot, isn't it?
> The igrab(inode) for the subsequent times is for getting a refcount which is supposed to be dropped by the caller, isn't it?
> _ocfs2_get_system_file_inode() always gets a refcount which is supposed to be dropped by the caller when it succeeds, doesn't it?
> 
> So why do you want one more refcount?
> 

Hi,

Could you please reply inline? your current reply style makes it very hard for
others to follow. I made a mistake in the previous email, and wrote the analysis
to continue my previous email.

- Heming

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock
  2026-05-15 15:35   ` Heming Zhao
  2026-05-15 15:51     ` Tetsuo Handa
@ 2026-05-15 23:53     ` Heming Zhao
  1 sibling, 0 replies; 12+ messages in thread
From: Heming Zhao @ 2026-05-15 23:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton,
	ocfs2-devel, LKML

On Fri, May 15, 2026 at 11:35:13PM +0800, Heming Zhao wrote:
> On Thu, May 14, 2026 at 04:09:25PM +0900, Tetsuo Handa wrote:
> > Hi Heming,
> > 
> > I would like to clarify why the expectation of "being called only once" is logically
> > incorrect, reply to your concern regarding the reference count leak and explain why
> > this patch is completely safe and sufficient.
> > 
> > 1. get_local_system_inode() can fail under memory pressure:
> >    get_local_system_inode() allocates memory internally. Under heavy memory pressure,
> >    this allocation can fail and return NULL. When this happens, the caller
> >    ocfs2_get_system_file_inode() must fall back to calling _ocfs2_get_system_file_inode()
> >    again to read the inode from disk. Therefore, the filesystem design must inherently
> >    support multiple calls to _ocfs2_get_system_file_inode().
> > 
> > 2. Why cmpxchg() is sufficient and safe without the mutex:
> >    The only thing the system_file_mutex is needed was to prevent a race where two
> >    threads concurrently execute _ocfs2_get_system_file_inode(), obtain the SAME inode
> >    pointer (since the underlying VFS iget_locked() returns the identical address for
> >    the same slot), and both mistakenly invoke igrab() on it, leading to a reference
> >    count leak.
> > 
> >    This patch perfectly solves that race condition by using cmpxchg() on the target
> >    pointer array slot:
> > 
> >      * The thread that wins the cmpxchg() successfully initializes the slot with the
> >        fetched inode and get the extra refcount because it is the first time to store
> >        into the slot.
> > 
> >      * The thread that loses the cmpxchg() detects that another thread has already
> >        initialized the slot with the exact same inode. The loser thread returns
> >        without getting the extra refcount because it is not the first time to store
> >        into the slot.
> > 
> >    Therefore, the reference counting contract is strictly and atomically maintained.
> >    No references are leaked, and the array slot is never corrupted.
> 
> Hi,
> 
> The logic here is incorrect. The purpose of the refcount is to track how many
> consumers are using the inode.
> 
> In the original code, if two threads concurrently access ocfs2_get_system_file_inode()
> while the inode is uninitialized, inode->i_count would ultimately be incremented
> by 3. However, with your patch, i_count will only be incremented by 2.
> 
> To be more specific:
> Your patch explicitly triggers a race condition: when the target local_system
> inode is uninitialized and two threads enter simultaneously, Thread 1 wins the
> cmpxchg() and increments the refcount before exiting. Thread 2, however, loses
> the refcount increment simply because the atomic operation failed.
> 
> btw, The issue addressed in commit 43b10a20372d was that after two concurrent
> threads returned, inode->i_count ended up being 4 when the correct value should
> have been 3. With your patch, the value will end up being 2, which is insufficient.

My above analysis contains a mistake.
With the patch, the refcount is also 3. However, I don't think the code logic is
correct.

Before commit 43b10a20372d, the refcount was 4:
Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (refcount +1)
Thread 2: does the same job as Thread 1.

Current code logic, the refcount is 3:
Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (refcount +1)
Thread 2: "inode = igrab(inode)" (gets inode from array, refcount +1)

With the patch, the refcount is also 3:
Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (sets array, refcount +1)
Thread 2: _ocfs2_get_system_file_inode (refcount +1)

In theory, _ocfs2_get_system_file_inode() should only be called once after mount.
The performance penalty in the current ocfs2_get_system_file_inode() comes from
doing "inode = igrab(inode)" while holding the mutex lock.

- Heming
> 
> In my opinion, the problem with the current code is that the scope of
> mutex_lock(&osb->system_file_mutex) is too broad. This mutex only needs to be
> held prior to calling _ocfs2_get_system_file_inode(). I previously highlighted
> this point in my initial review comment on the patch.
> 
> Thanks,
> Heming
> 
> > 
> > 3. Standard filesystems do not use a global mutex for this:
> >    Standard filesystems (like Ext4's ext4_get_journal_inode or XFS's
> >    xfs_qm_init_quotainos) rely entirely on the VFS layer's internal hashing/locking (e.g.,
> >    iget_locked) to serialize metadata/system inode lookups. OCFS2's system_file_mutex is a
> >    redundant global lock that heavily pollutes the lock dependency graph, triggering
> >    possible deadlock warnings that block us from testing and fixing genuine deadlocks.
> > 
> > Since the cmpxchg() approach guarantees atomic slot initialization + igrab(), the global
> > mutex is completely redundant and should be removed.
> > 
> > Regards.
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-15 23:56 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-21 15:56 [PATCH] ocfs2: kill osb->system_file_mutex lock Tetsuo Handa
2025-06-24  1:33 ` Heming Zhao
2025-06-24  1:55   ` Tetsuo Handa
2025-06-24  2:51     ` Heming Zhao
2025-06-24  2:17   ` Tetsuo Handa
2025-06-24  2:40     ` Heming Zhao
2025-06-24  3:05       ` Tetsuo Handa
2026-05-14  7:09 ` Tetsuo Handa
2026-05-15 15:35   ` Heming Zhao
2026-05-15 15:51     ` Tetsuo Handa
2026-05-15 23:56       ` Heming Zhao
2026-05-15 23:53     ` Heming Zhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.