* [PATCH] ocfs2: kill osb->system_file_mutex lock
@ 2025-06-21 15:56 Tetsuo Handa
2025-06-24 1:33 ` Heming Zhao
2026-05-14 7:09 ` Tetsuo Handa
0 siblings, 2 replies; 12+ messages in thread
From: Tetsuo Handa @ 2025-06-21 15:56 UTC (permalink / raw)
To: Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton,
ocfs2-devel
Cc: LKML
Since calling _ocfs2_get_system_file_inode() twice with the same
arguments returns the same address, there is no need to serialize
_ocfs2_get_system_file_inode() using osb->system_file_mutex lock.
Kill osb->system_file_mutex lock in order to avoid AB-BA deadlock.
cmpxchg() will be sufficient for avoiding the inode refcount leak
problem which commit 43b10a20372d ("ocfs2: avoid system inode ref
confusion by adding mutex lock") tried to address.
Reported-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
Closes: https://lkml.kernel.org/r/000000000000ff2d7a0620381afe@google.com
Fixes: 43b10a20372d ("ocfs2: avoid system inode ref confusion by adding mutex lock")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: jiangyiwen <jiangyiwen@huawei.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
---
fs/ocfs2/ocfs2.h | 2 --
fs/ocfs2/super.c | 2 --
fs/ocfs2/sysfile.c | 9 +++------
3 files changed, 3 insertions(+), 10 deletions(-)
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 6aaa94c554c1..8bdeea60742a 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -494,8 +494,6 @@ struct ocfs2_super
struct rb_root osb_rf_lock_tree;
struct ocfs2_refcount_tree *osb_ref_tree_lru;
- struct mutex system_file_mutex;
-
/*
* OCFS2 needs to schedule several different types of work which
* require cluster locking, disk I/O, recovery waits, etc. Since these
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 3d2533950bae..4461daf909cf 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1997,8 +1997,6 @@ static int ocfs2_initialize_super(struct super_block *sb,
spin_lock_init(&osb->osb_xattr_lock);
ocfs2_init_steal_slots(osb);
- mutex_init(&osb->system_file_mutex);
-
atomic_set(&osb->alloc_stats.moves, 0);
atomic_set(&osb->alloc_stats.local_data, 0);
atomic_set(&osb->alloc_stats.bitmap_data, 0);
diff --git a/fs/ocfs2/sysfile.c b/fs/ocfs2/sysfile.c
index 53a945da873b..b63af8d64904 100644
--- a/fs/ocfs2/sysfile.c
+++ b/fs/ocfs2/sysfile.c
@@ -98,11 +98,9 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb,
} else
arr = get_local_system_inode(osb, type, slot);
- mutex_lock(&osb->system_file_mutex);
if (arr && ((inode = *arr) != NULL)) {
/* get a ref in addition to the array ref */
inode = igrab(inode);
- mutex_unlock(&osb->system_file_mutex);
BUG_ON(!inode);
return inode;
@@ -112,11 +110,10 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb,
inode = _ocfs2_get_system_file_inode(osb, type, slot);
/* add one more if putting into array for first time */
- if (arr && inode) {
- *arr = igrab(inode);
- BUG_ON(!*arr);
+ if (inode && arr && !*arr && !cmpxchg(&(*arr), NULL, inode)) {
+ inode = igrab(inode);
+ BUG_ON(!inode);
}
- mutex_unlock(&osb->system_file_mutex);
return inode;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock 2025-06-21 15:56 [PATCH] ocfs2: kill osb->system_file_mutex lock Tetsuo Handa @ 2025-06-24 1:33 ` Heming Zhao 2025-06-24 1:55 ` Tetsuo Handa 2025-06-24 2:17 ` Tetsuo Handa 2026-05-14 7:09 ` Tetsuo Handa 1 sibling, 2 replies; 12+ messages in thread From: Heming Zhao @ 2025-06-24 1:33 UTC (permalink / raw) To: Tetsuo Handa, Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton, ocfs2-devel Cc: LKML Hello, Protecting refcnt with a mutex is the right approach, and commit 43b10a20372d did it properly. However, I don't see how your patch fixes the syzbot report [1]. Could you elaborate on the root cause analysis? My review comments are inline below. [1]: https://syzkaller.appspot.com/bug?extid=1fed2de07d8e11a3ec1b On 6/21/25 23:56, Tetsuo Handa wrote: > Since calling _ocfs2_get_system_file_inode() twice with the same > arguments returns the same address, there is no need to serialize > _ocfs2_get_system_file_inode() using osb->system_file_mutex lock. > > Kill osb->system_file_mutex lock in order to avoid AB-BA deadlock. > cmpxchg() will be sufficient for avoiding the inode refcount leak > problem which commit 43b10a20372d ("ocfs2: avoid system inode ref > confusion by adding mutex lock") tried to address. > > Reported-by: Diogo Jahchan Koike <djahchankoike@gmail.com> 'Reported-by' should be: https://syzkaller.appspot.com/bug?extid=1fed2de07d8e11a3ec1b > Closes: https://lkml.kernel.org/r/000000000000ff2d7a0620381afe@google.com I don't think we need 'Closes'. > Fixes: 43b10a20372d ("ocfs2: avoid system inode ref confusion by adding mutex lock") > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Cc: jiangyiwen <jiangyiwen@huawei.com> > Cc: Joseph Qi <joseph.qi@huawei.com> > Cc: Joel Becker <jlbec@evilplan.org> > Cc: Mark Fasheh <mfasheh@suse.com> The 'CC's are also useless. > --- > fs/ocfs2/ocfs2.h | 2 -- > fs/ocfs2/super.c | 2 -- > fs/ocfs2/sysfile.c | 9 +++------ > 3 files changed, 3 insertions(+), 10 deletions(-) > > diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h > index 6aaa94c554c1..8bdeea60742a 100644 > --- a/fs/ocfs2/ocfs2.h > +++ b/fs/ocfs2/ocfs2.h > @@ -494,8 +494,6 @@ struct ocfs2_super > struct rb_root osb_rf_lock_tree; > struct ocfs2_refcount_tree *osb_ref_tree_lru; > > - struct mutex system_file_mutex; > - > /* > * OCFS2 needs to schedule several different types of work which > * require cluster locking, disk I/O, recovery waits, etc. Since these > diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c > index 3d2533950bae..4461daf909cf 100644 > --- a/fs/ocfs2/super.c > +++ b/fs/ocfs2/super.c > @@ -1997,8 +1997,6 @@ static int ocfs2_initialize_super(struct super_block *sb, > spin_lock_init(&osb->osb_xattr_lock); > ocfs2_init_steal_slots(osb); > > - mutex_init(&osb->system_file_mutex); > - > atomic_set(&osb->alloc_stats.moves, 0); > atomic_set(&osb->alloc_stats.local_data, 0); > atomic_set(&osb->alloc_stats.bitmap_data, 0); > diff --git a/fs/ocfs2/sysfile.c b/fs/ocfs2/sysfile.c > index 53a945da873b..b63af8d64904 100644 > --- a/fs/ocfs2/sysfile.c > +++ b/fs/ocfs2/sysfile.c > @@ -98,11 +98,9 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb, > } else > arr = get_local_system_inode(osb, type, slot); > > - mutex_lock(&osb->system_file_mutex); > if (arr && ((inode = *arr) != NULL)) { > /* get a ref in addition to the array ref */ > inode = igrab(inode); > - mutex_unlock(&osb->system_file_mutex); > BUG_ON(!inode); > I agree the above mutex_lock and mutex_unlock is useless. we can remove it without any problem. > return inode; > @@ -112,11 +110,10 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb, > inode = _ocfs2_get_system_file_inode(osb, type, slot); In my view, the key of commit 43b10a20372d is to avoid calling _ocfs2_get_system_file_inode() twice, which lead refcnt+1 but no place to do refcnt-1. > > /* add one more if putting into array for first time */ > - if (arr && inode) { > - *arr = igrab(inode); > - BUG_ON(!*arr); > + if (inode && arr && !*arr && !cmpxchg(&(*arr), NULL, inode)) { Bypassing the refcnt+1 here is not a good idea. We should do refcnt+1 before returning to the caller. > + inode = igrab(inode); > + BUG_ON(!inode); > } > - mutex_unlock(&osb->system_file_mutex); > return inode; > } > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock 2025-06-24 1:33 ` Heming Zhao @ 2025-06-24 1:55 ` Tetsuo Handa 2025-06-24 2:51 ` Heming Zhao 2025-06-24 2:17 ` Tetsuo Handa 1 sibling, 1 reply; 12+ messages in thread From: Tetsuo Handa @ 2025-06-24 1:55 UTC (permalink / raw) To: Heming Zhao Cc: LKML, Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton, ocfs2-devel, Diogo Jahchan Koike On 2025/06/24 10:33, Heming Zhao wrote: > Hello, > > Protecting refcnt with a mutex is the right approach, and commit 43b10a20372d > did it properly. > However, I don't see how your patch fixes the syzbot report [1]. Could you > elaborate on the root cause analysis? > > My review comments are inline below. > > [1]: https://syzkaller.appspot.com/bug?extid=1fed2de07d8e11a3ec1b My patch does not fix [1]. My patch fixes a bug which syzbot reported at https://lkml.kernel.org/r/000000000000ff2d7a0620381afe@google.com when testing with Diogo's patch at https://syzkaller.appspot.com/x/patch.diff?x=178f93d5980000 for [1]. >> Reported-by: Diogo Jahchan Koike <djahchankoike@gmail.com> > 'Reported-by' should be: https://syzkaller.appspot.com/bug?extid=1fed2de07d8e11a3ec1b Since there is not yet a bug link for my patch, I don't choose syzbot as reporter. Diogo will post a formal patch for fixing [1] after returning from vacation. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock 2025-06-24 1:55 ` Tetsuo Handa @ 2025-06-24 2:51 ` Heming Zhao 0 siblings, 0 replies; 12+ messages in thread From: Heming Zhao @ 2025-06-24 2:51 UTC (permalink / raw) To: Joseph Qi, Tetsuo Handa Cc: LKML, Mark Fasheh, Joel Becker, jiangyiwen, Andrew Morton, ocfs2-devel, Diogo Jahchan Koike Hello Joseph, Do you agree that we need to add a rule for the ocfs2 read/write (IO) path? - When an ocfs2 volume is in the process of mounting, all write operations must fail immediately. - Heming On 6/24/25 09:55, Tetsuo Handa wrote: > On 2025/06/24 10:33, Heming Zhao wrote: >> Hello, >> >> Protecting refcnt with a mutex is the right approach, and commit 43b10a20372d >> did it properly. >> However, I don't see how your patch fixes the syzbot report [1]. Could you >> elaborate on the root cause analysis? >> >> My review comments are inline below. >> >> [1]: https://syzkaller.appspot.com/bug?extid=1fed2de07d8e11a3ec1b > > My patch does not fix [1]. My patch fixes a bug which syzbot reported at > https://lkml.kernel.org/r/000000000000ff2d7a0620381afe@google.com > when testing with Diogo's patch at > https://syzkaller.appspot.com/x/patch.diff?x=178f93d5980000 for [1]. > >>> Reported-by: Diogo Jahchan Koike <djahchankoike@gmail.com> >> 'Reported-by' should be: https://syzkaller.appspot.com/bug?extid=1fed2de07d8e11a3ec1b > > Since there is not yet a bug link for my patch, I don't choose syzbot as reporter. > Diogo will post a formal patch for fixing [1] after returning from vacation. > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock 2025-06-24 1:33 ` Heming Zhao 2025-06-24 1:55 ` Tetsuo Handa @ 2025-06-24 2:17 ` Tetsuo Handa 2025-06-24 2:40 ` Heming Zhao 1 sibling, 1 reply; 12+ messages in thread From: Tetsuo Handa @ 2025-06-24 2:17 UTC (permalink / raw) To: Heming Zhao, Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton, ocfs2-devel, Diogo Jahchan Koike Cc: LKML On 2025/06/24 10:33, Heming Zhao wrote: >> @@ -112,11 +110,10 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb, >> inode = _ocfs2_get_system_file_inode(osb, type, slot); > > In my view, the key of commit 43b10a20372d is to avoid calling > _ocfs2_get_system_file_inode() twice, which lead refcnt+1 but no place to > do refcnt-1. My understanding is that concurrently calling _ocfs2_get_system_file_inode() itself is OK, for the caller of ocfs2_get_system_file_inode() is responsible for calling iput(). The problem commit 43b10a20372d fixed is that there was no mechanism to avoid concurrently calling *arr = igrab(inode); which will result in failing to call iput() for raced references when ocfs2_release_system_inodes() is called. > >> /* add one more if putting into array for first time */ >> - if (arr && inode) { >> - *arr = igrab(inode); >> - BUG_ON(!*arr); >> + if (inode && arr && !*arr && !cmpxchg(&(*arr), NULL, inode)) { > > Bypassing the refcnt+1 here is not a good idea. We should do refcnt+1 > before returning to the caller. > >> + inode = igrab(inode); We do refcnt+1 immediately after cmpxchg() succeeds, for ocfs2_release_system_inodes() which clears *arr is the place for doing refcnt-1. >> + BUG_ON(!inode); >> } >> - mutex_unlock(&osb->system_file_mutex); >> return inode; >> } >> > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock 2025-06-24 2:17 ` Tetsuo Handa @ 2025-06-24 2:40 ` Heming Zhao 2025-06-24 3:05 ` Tetsuo Handa 0 siblings, 1 reply; 12+ messages in thread From: Heming Zhao @ 2025-06-24 2:40 UTC (permalink / raw) To: Tetsuo Handa, Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton, ocfs2-devel, Diogo Jahchan Koike Cc: LKML On 6/24/25 10:17, Tetsuo Handa wrote: > On 2025/06/24 10:33, Heming Zhao wrote: >>> @@ -112,11 +110,10 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb, >>> inode = _ocfs2_get_system_file_inode(osb, type, slot); >> >> In my view, the key of commit 43b10a20372d is to avoid calling >> _ocfs2_get_system_file_inode() twice, which lead refcnt+1 but no place to >> do refcnt-1. > > My understanding is that concurrently calling _ocfs2_get_system_file_inode() itself > is OK, for the caller of ocfs2_get_system_file_inode() is responsible for calling > iput(). We have different perspectives on calling _ocfs2_get_system_file_inode(). In the current code logic, _ocfs2_get_system_file_inode() is expected to be called only once. Subsequent local system inodes will be retrieved from the cache (via get_local_system_inode()). > > The problem commit 43b10a20372d fixed is that there was no mechanism to avoid > concurrently calling > > *arr = igrab(inode); > > which will result in failing to call iput() for raced references when > ocfs2_release_system_inodes() is called. > >> >>> /* add one more if putting into array for first time */ >>> - if (arr && inode) { >>> - *arr = igrab(inode); >>> - BUG_ON(!*arr); >>> + if (inode && arr && !*arr && !cmpxchg(&(*arr), NULL, inode)) { >> >> Bypassing the refcnt+1 here is not a good idea. We should do refcnt+1 >> before returning to the caller. >> >>> + inode = igrab(inode); > > We do refcnt+1 immediately after cmpxchg() succeeds, for > ocfs2_release_system_inodes() which clears *arr is the place for > doing refcnt-1. > >>> + BUG_ON(!inode); >>> } >>> - mutex_unlock(&osb->system_file_mutex); >>> return inode; >>> } >>> >> > In my view, your patch has logical errors - at least from my perspective, I have to vote NAK. In my view, for this syzbot bug, the better solution is to block/deny write operations during the ocfs2 mounting phase. There are many syzbot bugs related to writing data during the mounting phase. I don't believe there is any reason a user would want to write data before the filesystem is mounted. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock 2025-06-24 2:40 ` Heming Zhao @ 2025-06-24 3:05 ` Tetsuo Handa 0 siblings, 0 replies; 12+ messages in thread From: Tetsuo Handa @ 2025-06-24 3:05 UTC (permalink / raw) To: Heming Zhao, Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton, ocfs2-devel, Diogo Jahchan Koike Cc: LKML On 2025/06/24 11:40, Heming Zhao wrote: > On 6/24/25 10:17, Tetsuo Handa wrote: >> On 2025/06/24 10:33, Heming Zhao wrote: >>>> @@ -112,11 +110,10 @@ struct inode *ocfs2_get_system_file_inode(struct ocfs2_super *osb, >>>> inode = _ocfs2_get_system_file_inode(osb, type, slot); >>> >>> In my view, the key of commit 43b10a20372d is to avoid calling >>> _ocfs2_get_system_file_inode() twice, which lead refcnt+1 but no place to >>> do refcnt-1. >> >> My understanding is that concurrently calling _ocfs2_get_system_file_inode() itself >> is OK, for the caller of ocfs2_get_system_file_inode() is responsible for calling >> iput(). > > We have different perspectives on calling _ocfs2_get_system_file_inode(). > In the current code logic, _ocfs2_get_system_file_inode() is expected to > be called only once. Subsequent local system inodes will be retrieved from > the cache (via get_local_system_inode()). That expectation is wrong. Since get_local_system_inode() involves memory allocation, get_local_system_inode() might return NULL. Therefore, ocfs2_get_system_file_inode(), which is the caller of get_local_system_inode(), has to be logically prepared for calling _ocfs2_get_system_file_inode() for multiple times. This cache is only for speeding lookups up. This cache does not provide "lookup only once" requirement. > > In my view, your patch has logical errors - at least from my perspective, > I have to vote NAK. If you NAK, you have to make sure that get_local_system_inode() never fails. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock 2025-06-21 15:56 [PATCH] ocfs2: kill osb->system_file_mutex lock Tetsuo Handa 2025-06-24 1:33 ` Heming Zhao @ 2026-05-14 7:09 ` Tetsuo Handa 2026-05-15 15:35 ` Heming Zhao 1 sibling, 1 reply; 12+ messages in thread From: Tetsuo Handa @ 2026-05-14 7:09 UTC (permalink / raw) To: Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton, ocfs2-devel, Heming Zhao Cc: LKML Hi Heming, I would like to clarify why the expectation of "being called only once" is logically incorrect, reply to your concern regarding the reference count leak and explain why this patch is completely safe and sufficient. 1. get_local_system_inode() can fail under memory pressure: get_local_system_inode() allocates memory internally. Under heavy memory pressure, this allocation can fail and return NULL. When this happens, the caller ocfs2_get_system_file_inode() must fall back to calling _ocfs2_get_system_file_inode() again to read the inode from disk. Therefore, the filesystem design must inherently support multiple calls to _ocfs2_get_system_file_inode(). 2. Why cmpxchg() is sufficient and safe without the mutex: The only thing the system_file_mutex is needed was to prevent a race where two threads concurrently execute _ocfs2_get_system_file_inode(), obtain the SAME inode pointer (since the underlying VFS iget_locked() returns the identical address for the same slot), and both mistakenly invoke igrab() on it, leading to a reference count leak. This patch perfectly solves that race condition by using cmpxchg() on the target pointer array slot: * The thread that wins the cmpxchg() successfully initializes the slot with the fetched inode and get the extra refcount because it is the first time to store into the slot. * The thread that loses the cmpxchg() detects that another thread has already initialized the slot with the exact same inode. The loser thread returns without getting the extra refcount because it is not the first time to store into the slot. Therefore, the reference counting contract is strictly and atomically maintained. No references are leaked, and the array slot is never corrupted. 3. Standard filesystems do not use a global mutex for this: Standard filesystems (like Ext4's ext4_get_journal_inode or XFS's xfs_qm_init_quotainos) rely entirely on the VFS layer's internal hashing/locking (e.g., iget_locked) to serialize metadata/system inode lookups. OCFS2's system_file_mutex is a redundant global lock that heavily pollutes the lock dependency graph, triggering possible deadlock warnings that block us from testing and fixing genuine deadlocks. Since the cmpxchg() approach guarantees atomic slot initialization + igrab(), the global mutex is completely redundant and should be removed. Regards. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock 2026-05-14 7:09 ` Tetsuo Handa @ 2026-05-15 15:35 ` Heming Zhao 2026-05-15 15:51 ` Tetsuo Handa 2026-05-15 23:53 ` Heming Zhao 0 siblings, 2 replies; 12+ messages in thread From: Heming Zhao @ 2026-05-15 15:35 UTC (permalink / raw) To: Tetsuo Handa Cc: Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton, ocfs2-devel, LKML On Thu, May 14, 2026 at 04:09:25PM +0900, Tetsuo Handa wrote: > Hi Heming, > > I would like to clarify why the expectation of "being called only once" is logically > incorrect, reply to your concern regarding the reference count leak and explain why > this patch is completely safe and sufficient. > > 1. get_local_system_inode() can fail under memory pressure: > get_local_system_inode() allocates memory internally. Under heavy memory pressure, > this allocation can fail and return NULL. When this happens, the caller > ocfs2_get_system_file_inode() must fall back to calling _ocfs2_get_system_file_inode() > again to read the inode from disk. Therefore, the filesystem design must inherently > support multiple calls to _ocfs2_get_system_file_inode(). > > 2. Why cmpxchg() is sufficient and safe without the mutex: > The only thing the system_file_mutex is needed was to prevent a race where two > threads concurrently execute _ocfs2_get_system_file_inode(), obtain the SAME inode > pointer (since the underlying VFS iget_locked() returns the identical address for > the same slot), and both mistakenly invoke igrab() on it, leading to a reference > count leak. > > This patch perfectly solves that race condition by using cmpxchg() on the target > pointer array slot: > > * The thread that wins the cmpxchg() successfully initializes the slot with the > fetched inode and get the extra refcount because it is the first time to store > into the slot. > > * The thread that loses the cmpxchg() detects that another thread has already > initialized the slot with the exact same inode. The loser thread returns > without getting the extra refcount because it is not the first time to store > into the slot. > > Therefore, the reference counting contract is strictly and atomically maintained. > No references are leaked, and the array slot is never corrupted. Hi, The logic here is incorrect. The purpose of the refcount is to track how many consumers are using the inode. In the original code, if two threads concurrently access ocfs2_get_system_file_inode() while the inode is uninitialized, inode->i_count would ultimately be incremented by 3. However, with your patch, i_count will only be incremented by 2. To be more specific: Your patch explicitly triggers a race condition: when the target local_system inode is uninitialized and two threads enter simultaneously, Thread 1 wins the cmpxchg() and increments the refcount before exiting. Thread 2, however, loses the refcount increment simply because the atomic operation failed. btw, The issue addressed in commit 43b10a20372d was that after two concurrent threads returned, inode->i_count ended up being 4 when the correct value should have been 3. With your patch, the value will end up being 2, which is insufficient. In my opinion, the problem with the current code is that the scope of mutex_lock(&osb->system_file_mutex) is too broad. This mutex only needs to be held prior to calling _ocfs2_get_system_file_inode(). I previously highlighted this point in my initial review comment on the patch. Thanks, Heming > > 3. Standard filesystems do not use a global mutex for this: > Standard filesystems (like Ext4's ext4_get_journal_inode or XFS's > xfs_qm_init_quotainos) rely entirely on the VFS layer's internal hashing/locking (e.g., > iget_locked) to serialize metadata/system inode lookups. OCFS2's system_file_mutex is a > redundant global lock that heavily pollutes the lock dependency graph, triggering > possible deadlock warnings that block us from testing and fixing genuine deadlocks. > > Since the cmpxchg() approach guarantees atomic slot initialization + igrab(), the global > mutex is completely redundant and should be removed. > > Regards. > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock 2026-05-15 15:35 ` Heming Zhao @ 2026-05-15 15:51 ` Tetsuo Handa 2026-05-15 23:56 ` Heming Zhao 2026-05-15 23:53 ` Heming Zhao 1 sibling, 1 reply; 12+ messages in thread From: Tetsuo Handa @ 2026-05-15 15:51 UTC (permalink / raw) To: Heming Zhao Cc: Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton, ocfs2-devel, LKML On 2026/05/16 0:35, Heming Zhao wrote: > Hi, > > The logic here is incorrect. The purpose of the refcount is to track how many > consumers are using the inode. The igrab(inode) for the first time is for getting a refcount for the slot, isn't it? The igrab(inode) for the subsequent times is for getting a refcount which is supposed to be dropped by the caller, isn't it? _ocfs2_get_system_file_inode() always gets a refcount which is supposed to be dropped by the caller when it succeeds, doesn't it? So why do you want one more refcount? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock 2026-05-15 15:51 ` Tetsuo Handa @ 2026-05-15 23:56 ` Heming Zhao 0 siblings, 0 replies; 12+ messages in thread From: Heming Zhao @ 2026-05-15 23:56 UTC (permalink / raw) To: Tetsuo Handa Cc: Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton, ocfs2-devel, LKML On Sat, May 16, 2026 at 12:51:32AM +0900, Tetsuo Handa wrote: > On 2026/05/16 0:35, Heming Zhao wrote: > > Hi, > > > > The logic here is incorrect. The purpose of the refcount is to track how many > > consumers are using the inode. > > The igrab(inode) for the first time is for getting a refcount for the slot, isn't it? > The igrab(inode) for the subsequent times is for getting a refcount which is supposed to be dropped by the caller, isn't it? > _ocfs2_get_system_file_inode() always gets a refcount which is supposed to be dropped by the caller when it succeeds, doesn't it? > > So why do you want one more refcount? > Hi, Could you please reply inline? your current reply style makes it very hard for others to follow. I made a mistake in the previous email, and wrote the analysis to continue my previous email. - Heming ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ocfs2: kill osb->system_file_mutex lock 2026-05-15 15:35 ` Heming Zhao 2026-05-15 15:51 ` Tetsuo Handa @ 2026-05-15 23:53 ` Heming Zhao 1 sibling, 0 replies; 12+ messages in thread From: Heming Zhao @ 2026-05-15 23:53 UTC (permalink / raw) To: Tetsuo Handa Cc: Mark Fasheh, Joel Becker, Joseph Qi, jiangyiwen, Andrew Morton, ocfs2-devel, LKML On Fri, May 15, 2026 at 11:35:13PM +0800, Heming Zhao wrote: > On Thu, May 14, 2026 at 04:09:25PM +0900, Tetsuo Handa wrote: > > Hi Heming, > > > > I would like to clarify why the expectation of "being called only once" is logically > > incorrect, reply to your concern regarding the reference count leak and explain why > > this patch is completely safe and sufficient. > > > > 1. get_local_system_inode() can fail under memory pressure: > > get_local_system_inode() allocates memory internally. Under heavy memory pressure, > > this allocation can fail and return NULL. When this happens, the caller > > ocfs2_get_system_file_inode() must fall back to calling _ocfs2_get_system_file_inode() > > again to read the inode from disk. Therefore, the filesystem design must inherently > > support multiple calls to _ocfs2_get_system_file_inode(). > > > > 2. Why cmpxchg() is sufficient and safe without the mutex: > > The only thing the system_file_mutex is needed was to prevent a race where two > > threads concurrently execute _ocfs2_get_system_file_inode(), obtain the SAME inode > > pointer (since the underlying VFS iget_locked() returns the identical address for > > the same slot), and both mistakenly invoke igrab() on it, leading to a reference > > count leak. > > > > This patch perfectly solves that race condition by using cmpxchg() on the target > > pointer array slot: > > > > * The thread that wins the cmpxchg() successfully initializes the slot with the > > fetched inode and get the extra refcount because it is the first time to store > > into the slot. > > > > * The thread that loses the cmpxchg() detects that another thread has already > > initialized the slot with the exact same inode. The loser thread returns > > without getting the extra refcount because it is not the first time to store > > into the slot. > > > > Therefore, the reference counting contract is strictly and atomically maintained. > > No references are leaked, and the array slot is never corrupted. > > Hi, > > The logic here is incorrect. The purpose of the refcount is to track how many > consumers are using the inode. > > In the original code, if two threads concurrently access ocfs2_get_system_file_inode() > while the inode is uninitialized, inode->i_count would ultimately be incremented > by 3. However, with your patch, i_count will only be incremented by 2. > > To be more specific: > Your patch explicitly triggers a race condition: when the target local_system > inode is uninitialized and two threads enter simultaneously, Thread 1 wins the > cmpxchg() and increments the refcount before exiting. Thread 2, however, loses > the refcount increment simply because the atomic operation failed. > > btw, The issue addressed in commit 43b10a20372d was that after two concurrent > threads returned, inode->i_count ended up being 4 when the correct value should > have been 3. With your patch, the value will end up being 2, which is insufficient. My above analysis contains a mistake. With the patch, the refcount is also 3. However, I don't think the code logic is correct. Before commit 43b10a20372d, the refcount was 4: Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (refcount +1) Thread 2: does the same job as Thread 1. Current code logic, the refcount is 3: Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (refcount +1) Thread 2: "inode = igrab(inode)" (gets inode from array, refcount +1) With the patch, the refcount is also 3: Thread 1: _ocfs2_get_system_file_inode (refcount +1), "*arr = igrab(inode)" (sets array, refcount +1) Thread 2: _ocfs2_get_system_file_inode (refcount +1) In theory, _ocfs2_get_system_file_inode() should only be called once after mount. The performance penalty in the current ocfs2_get_system_file_inode() comes from doing "inode = igrab(inode)" while holding the mutex lock. - Heming > > In my opinion, the problem with the current code is that the scope of > mutex_lock(&osb->system_file_mutex) is too broad. This mutex only needs to be > held prior to calling _ocfs2_get_system_file_inode(). I previously highlighted > this point in my initial review comment on the patch. > > Thanks, > Heming > > > > > 3. Standard filesystems do not use a global mutex for this: > > Standard filesystems (like Ext4's ext4_get_journal_inode or XFS's > > xfs_qm_init_quotainos) rely entirely on the VFS layer's internal hashing/locking (e.g., > > iget_locked) to serialize metadata/system inode lookups. OCFS2's system_file_mutex is a > > redundant global lock that heavily pollutes the lock dependency graph, triggering > > possible deadlock warnings that block us from testing and fixing genuine deadlocks. > > > > Since the cmpxchg() approach guarantees atomic slot initialization + igrab(), the global > > mutex is completely redundant and should be removed. > > > > Regards. > > ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-05-15 23:56 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-21 15:56 [PATCH] ocfs2: kill osb->system_file_mutex lock Tetsuo Handa 2025-06-24 1:33 ` Heming Zhao 2025-06-24 1:55 ` Tetsuo Handa 2025-06-24 2:51 ` Heming Zhao 2025-06-24 2:17 ` Tetsuo Handa 2025-06-24 2:40 ` Heming Zhao 2025-06-24 3:05 ` Tetsuo Handa 2026-05-14 7:09 ` Tetsuo Handa 2026-05-15 15:35 ` Heming Zhao 2026-05-15 15:51 ` Tetsuo Handa 2026-05-15 23:56 ` Heming Zhao 2026-05-15 23:53 ` Heming Zhao
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.