linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
@ 2025-05-13  9:34 Gavin Guo
  2025-05-14  0:56 ` Andrew Morton
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Gavin Guo @ 2025-05-13  9:34 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, muchun.song, osalvador, akpm, mike.kravetz,
	kernel-dev, Gavin Guo, stable, Hugh Dickins, Florent Revest,
	Gavin Shan

The patch fixes a deadlock which can be triggered by an internal
syzkaller [1] reproducer and captured by bpftrace script [2] and its log
[3] in this scenario:

Process 1                              Process 2
---				       ---
hugetlb_fault
  mutex_lock(B) // take B
  filemap_lock_hugetlb_folio
    filemap_lock_folio
      __filemap_get_folio
        folio_lock(A) // take A
  hugetlb_wp
    mutex_unlock(B) // release B
    ...                                hugetlb_fault
    ...                                  mutex_lock(B) // take B
                                         filemap_lock_hugetlb_folio
                                           filemap_lock_folio
                                             __filemap_get_folio
                                               folio_lock(A) // blocked
    unmap_ref_private
    ...
    mutex_lock(B) // retake and blocked

This is a ABBA deadlock involving two locks:
- Lock A: pagecache_folio lock
- Lock B: hugetlb_fault_mutex_table lock

The deadlock occurs between two processes as follows:
1. The first process (let’s call it Process 1) is handling a
copy-on-write (COW) operation on a hugepage via hugetlb_wp. Due to
insufficient reserved hugetlb pages, Process 1, owner of the reserved
hugetlb page, attempts to unmap a hugepage owned by another process
(non-owner) to satisfy the reservation. Before unmapping, Process 1
acquires lock B (hugetlb_fault_mutex_table lock) and then lock A
(pagecache_folio lock). To proceed with the unmap, it releases Lock B
but retains Lock A. After the unmap, Process 1 tries to reacquire Lock
B. However, at this point, Lock B has already been acquired by another
process.

2. The second process (Process 2) enters the hugetlb_fault handler
during the unmap operation. It successfully acquires Lock B
(hugetlb_fault_mutex_table lock) that was just released by Process 1,
but then attempts to acquire Lock A (pagecache_folio lock), which is
still held by Process 1.

As a result, Process 1 (holding Lock A) is blocked waiting for Lock B
(held by Process 2), while Process 2 (holding Lock B) is blocked waiting
for Lock A (held by Process 1), constructing a ABBA deadlock scenario.

The solution here is to unlock the pagecache_folio and provide the
pagecache_folio_unlocked variable to the caller to have the visibility
over the pagecache_folio status for subsequent handling.

The error message:
INFO: task repro_20250402_:13229 blocked for more than 64 seconds.
      Not tainted 6.15.0-rc3+ #24
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:repro_20250402_ state:D stack:25856 pid:13229 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00004006
Call Trace:
 <TASK>
 __schedule+0x1755/0x4f50
 schedule+0x158/0x330
 schedule_preempt_disabled+0x15/0x30
 __mutex_lock+0x75f/0xeb0
 hugetlb_wp+0xf88/0x3440
 hugetlb_fault+0x14c8/0x2c30
 trace_clock_x86_tsc+0x20/0x20
 do_user_addr_fault+0x61d/0x1490
 exc_page_fault+0x64/0x100
 asm_exc_page_fault+0x26/0x30
RIP: 0010:__put_user_4+0xd/0x20
 copy_process+0x1f4a/0x3d60
 kernel_clone+0x210/0x8f0
 __x64_sys_clone+0x18d/0x1f0
 do_syscall_64+0x6a/0x120
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x41b26d
 </TASK>
INFO: task repro_20250402_:13229 is blocked on a mutex likely owned by task repro_20250402_:13250.
task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006
Call Trace:
 <TASK>
 __schedule+0x1755/0x4f50
 schedule+0x158/0x330
 io_schedule+0x92/0x110
 folio_wait_bit_common+0x69a/0xba0
 __filemap_get_folio+0x154/0xb70
 hugetlb_fault+0xa50/0x2c30
 trace_clock_x86_tsc+0x20/0x20
 do_user_addr_fault+0xace/0x1490
 exc_page_fault+0x64/0x100
 asm_exc_page_fault+0x26/0x30
RIP: 0033:0x402619
 </TASK>
INFO: task repro_20250402_:13250 blocked for more than 65 seconds.
      Not tainted 6.15.0-rc3+ #24
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006
Call Trace:
 <TASK>
 __schedule+0x1755/0x4f50
 schedule+0x158/0x330
 io_schedule+0x92/0x110
 folio_wait_bit_common+0x69a/0xba0
 __filemap_get_folio+0x154/0xb70
 hugetlb_fault+0xa50/0x2c30
 trace_clock_x86_tsc+0x20/0x20
 do_user_addr_fault+0xace/0x1490
 exc_page_fault+0x64/0x100
 asm_exc_page_fault+0x26/0x30
RIP: 0033:0x402619
 </TASK>

Showing all locks held in the system:
1 lock held by khungtaskd/35:
 #0: ffffffff879a7440 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x30/0x180
2 locks held by repro_20250402_/13229:
 #0: ffff888017d801e0 (&mm->mmap_lock){++++}-{4:4}, at: lock_mm_and_find_vma+0x37/0x300
 #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_wp+0xf88/0x3440
3 locks held by repro_20250402_/13250:
 #0: ffff8880177f3d08 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x41b/0x1490
 #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_fault+0x3b8/0x2c30
 #2: ffff8880129500e8 (&resv_map->rw_sema){++++}-{4:4}, at: hugetlb_fault+0x494/0x2c30

Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=drive_link [1]
Link: https://github.com/bboymimi/bpftracer/blob/master/scripts/hugetlb_lock_debug.bt [2]
Link: https://drive.google.com/file/d/1bWq2-8o-BJAuhoHWX7zAhI6ggfhVzQUI/view?usp=sharing [3]
Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
Cc: <stable@vger.kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Florent Revest <revest@google.com>
Cc: Gavin Shan <gshan@redhat.com>
Signed-off-by: Gavin Guo <gavinguo@igalia.com>
---
 mm/hugetlb.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e3e6ac991b9c..ad54a74aa563 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
  * Keep the pte_same checks anyway to make transition from the mutex easier.
  */
 static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
-		       struct vm_fault *vmf)
+		       struct vm_fault *vmf,
+		       bool *pagecache_folio_unlocked)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct mm_struct *mm = vma->vm_mm;
@@ -6212,6 +6213,22 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 			u32 hash;
 
 			folio_put(old_folio);
+			/*
+			 * The pagecache_folio needs to be unlocked to avoid
+			 * deadlock and we won't re-lock it in hugetlb_wp(). The
+			 * pagecache_folio could be truncated after being
+			 * unlocked. So its state should not be relied
+			 * subsequently.
+			 *
+			 * Setting *pagecache_folio_unlocked to true allows the
+			 * caller to handle any necessary logic related to the
+			 * folio's unlocked state.
+			 */
+			if (pagecache_folio) {
+				folio_unlock(pagecache_folio);
+				if (pagecache_folio_unlocked)
+					*pagecache_folio_unlocked = true;
+			}
 			/*
 			 * Drop hugetlb_fault_mutex and vma_lock before
 			 * unmapping.  unmapping needs to hold vma_lock
@@ -6566,7 +6583,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	hugetlb_count_add(pages_per_huge_page(h), mm);
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_wp(folio, vmf);
+		ret = hugetlb_wp(folio, vmf, NULL);
 	}
 
 	spin_unlock(vmf->ptl);
@@ -6638,6 +6655,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct address_space *mapping;
 	int need_wait_lock = 0;
+	bool pagecache_folio_unlocked = false;
 	struct vm_fault vmf = {
 		.vma = vma,
 		.address = address & huge_page_mask(h),
@@ -6792,7 +6810,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 		if (!huge_pte_write(vmf.orig_pte)) {
-			ret = hugetlb_wp(pagecache_folio, &vmf);
+			ret = hugetlb_wp(pagecache_folio, &vmf,
+					&pagecache_folio_unlocked);
 			goto out_put_page;
 		} else if (likely(flags & FAULT_FLAG_WRITE)) {
 			vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
@@ -6809,10 +6828,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 out_ptl:
 	spin_unlock(vmf.ptl);
 
-	if (pagecache_folio) {
+	/*
+	 * If the pagecache_folio is unlocked in hugetlb_wp(), we skip
+	 * folio_unlock() here.
+	 */
+	if (pagecache_folio && !pagecache_folio_unlocked)
 		folio_unlock(pagecache_folio);
+	if (pagecache_folio)
 		folio_put(pagecache_folio);
-	}
 out_mutex:
 	hugetlb_vma_unlock_read(vma);
 

base-commit: d76bb1ebb5587f66b0f8b8099bfbb44722bc08b3
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-13  9:34 [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table Gavin Guo
@ 2025-05-14  0:56 ` Andrew Morton
  2025-05-14  4:33   ` Byungchul Park
  2025-05-14  6:47 ` Byungchul Park
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2025-05-14  0:56 UTC (permalink / raw)
  To: Gavin Guo
  Cc: linux-mm, linux-kernel, muchun.song, osalvador, kernel-dev,
	stable, Hugh Dickins, Florent Revest, Gavin Shan, Byungchul Park

On Tue, 13 May 2025 17:34:48 +0800 Gavin Guo <gavinguo@igalia.com> wrote:

> The patch fixes a deadlock which can be triggered by an internal
> syzkaller [1] reproducer and captured by bpftrace script [2] and its log
> [3] in this scenario:
> 
> Process 1                              Process 2
> ---				       ---
> hugetlb_fault
>   mutex_lock(B) // take B
>   filemap_lock_hugetlb_folio
>     filemap_lock_folio
>       __filemap_get_folio
>         folio_lock(A) // take A
>   hugetlb_wp
>     mutex_unlock(B) // release B
>     ...                                hugetlb_fault
>     ...                                  mutex_lock(B) // take B
>                                          filemap_lock_hugetlb_folio
>                                            filemap_lock_folio
>                                              __filemap_get_folio
>                                                folio_lock(A) // blocked
>     unmap_ref_private
>     ...
>     mutex_lock(B) // retake and blocked
> 
> This is a ABBA deadlock involving two locks:
> - Lock A: pagecache_folio lock
> - Lock B: hugetlb_fault_mutex_table lock

Nostalgia.  A decade or three ago many of us spent much of our lives
staring at ABBA deadlocks.  Then came lockdep and after a few more
years, it all stopped.  I've long hoped that lockdep would gain a
solution to custom locks such as folio_wait_bit_common(), but not yet.

Byungchul, please take a look.  Would DEPT
(https://lkml.kernel.org/r/20250513100730.12664-1-byungchul@sk.com)
have warned us about this?

>
> ...
>
> The deadlock occurs between two processes as follows:
>
> ...
> 
> Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
> Cc: <stable@vger.kernel.org>

It's been there for three years so I assume we aren't in a hurry.

The fix looks a bit nasty, sorry.  Perhaps designed for a minimal patch
footprint?  That's good for a backportable fixup, but a more broadly
architected solution may be needed going forward.

I'll queue it for 6.16-rc1 with a cc:stable, so this should be
presented to the -stable trees 3-4 weeks from now.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-14  0:56 ` Andrew Morton
@ 2025-05-14  4:33   ` Byungchul Park
  0 siblings, 0 replies; 14+ messages in thread
From: Byungchul Park @ 2025-05-14  4:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gavin Guo, linux-mm, linux-kernel, muchun.song, osalvador,
	kernel-dev, stable, Hugh Dickins, Florent Revest, Gavin Shan,
	kernel_team

On Tue, May 13, 2025 at 05:56:33PM -0700, Andrew Morton wrote:
> On Tue, 13 May 2025 17:34:48 +0800 Gavin Guo <gavinguo@igalia.com> wrote:
> 
> > The patch fixes a deadlock which can be triggered by an internal
> > syzkaller [1] reproducer and captured by bpftrace script [2] and its log
> > [3] in this scenario:
> > 
> > Process 1                              Process 2
> > ---				       ---
> > hugetlb_fault
> >   mutex_lock(B) // take B
> >   filemap_lock_hugetlb_folio
> >     filemap_lock_folio
> >       __filemap_get_folio
> >         folio_lock(A) // take A
> >   hugetlb_wp
> >     mutex_unlock(B) // release B
> >     ...                                hugetlb_fault
> >     ...                                  mutex_lock(B) // take B
> >                                          filemap_lock_hugetlb_folio
> >                                            filemap_lock_folio
> >                                              __filemap_get_folio
> >                                                folio_lock(A) // blocked
> >     unmap_ref_private
> >     ...
> >     mutex_lock(B) // retake and blocked
> > 
> > This is a ABBA deadlock involving two locks:
> > - Lock A: pagecache_folio lock
> > - Lock B: hugetlb_fault_mutex_table lock
> 
> Nostalgia.  A decade or three ago many of us spent much of our lives
> staring at ABBA deadlocks.  Then came lockdep and after a few more
> years, it all stopped.  I've long hoped that lockdep would gain a
> solution to custom locks such as folio_wait_bit_common(), but not yet.
> 
> Byungchul, please take a look.  Would DEPT
> (https://lkml.kernel.org/r/20250513100730.12664-1-byungchul@sk.com)
> have warned us about this?

Sure, I will check it.  I think this type of deadlock is what DEPT can do
the best.

	Byungchul

> >
> > ...
> >
> > The deadlock occurs between two processes as follows:
> >
> > ...
> > 
> > Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
> > Cc: <stable@vger.kernel.org>
> 
> It's been there for three years so I assume we aren't in a hurry.
> 
> The fix looks a bit nasty, sorry.  Perhaps designed for a minimal patch
> footprint?  That's good for a backportable fixup, but a more broadly
> architected solution may be needed going forward.
> 
> I'll queue it for 6.16-rc1 with a cc:stable, so this should be
> presented to the -stable trees 3-4 weeks from now.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-13  9:34 [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table Gavin Guo
  2025-05-14  0:56 ` Andrew Morton
@ 2025-05-14  6:47 ` Byungchul Park
  2025-05-14  8:10   ` Gavin Guo
  2025-05-20 19:53 ` Oscar Salvador
  2025-05-26  4:41 ` Gavin Shan
  3 siblings, 1 reply; 14+ messages in thread
From: Byungchul Park @ 2025-05-14  6:47 UTC (permalink / raw)
  To: Gavin Guo
  Cc: linux-mm, linux-kernel, muchun.song, osalvador, akpm,
	mike.kravetz, kernel-dev, stable, Hugh Dickins, Florent Revest,
	Gavin Shan, kernel_team

On Tue, May 13, 2025 at 05:34:48PM +0800, Gavin Guo wrote:
> The patch fixes a deadlock which can be triggered by an internal
> syzkaller [1] reproducer and captured by bpftrace script [2] and its log

Hi,

I'm trying to reproduce using the test program [1].  But not yet
produced.  I see a lot of segfaults while running [1].  I guess
something goes wrong.  Is there any prerequisite condition to reproduce
it?  Lemme know if any.  Or can you try DEPT15 with your config and
environment by the following steps:

   1. Apply the patchset on v6.15-rc6.
      https://lkml.kernel.org/r/20250513100730.12664-1-byungchul@sk.com
   2. Turn on CONFIG_DEPT.
   3. Run test program reproducing the deadlock.
   4. Check dmesg to see if dept reported the dependency.

	Byungchul

> [3] in this scenario:
> 
> Process 1                              Process 2
> ---				       ---
> hugetlb_fault
>   mutex_lock(B) // take B
>   filemap_lock_hugetlb_folio
>     filemap_lock_folio
>       __filemap_get_folio
>         folio_lock(A) // take A
>   hugetlb_wp
>     mutex_unlock(B) // release B
>     ...                                hugetlb_fault
>     ...                                  mutex_lock(B) // take B
>                                          filemap_lock_hugetlb_folio
>                                            filemap_lock_folio
>                                              __filemap_get_folio
>                                                folio_lock(A) // blocked
>     unmap_ref_private
>     ...
>     mutex_lock(B) // retake and blocked
> 
> This is a ABBA deadlock involving two locks:
> - Lock A: pagecache_folio lock
> - Lock B: hugetlb_fault_mutex_table lock
> 
> The deadlock occurs between two processes as follows:
> 1. The first process (let’s call it Process 1) is handling a
> copy-on-write (COW) operation on a hugepage via hugetlb_wp. Due to
> insufficient reserved hugetlb pages, Process 1, owner of the reserved
> hugetlb page, attempts to unmap a hugepage owned by another process
> (non-owner) to satisfy the reservation. Before unmapping, Process 1
> acquires lock B (hugetlb_fault_mutex_table lock) and then lock A
> (pagecache_folio lock). To proceed with the unmap, it releases Lock B
> but retains Lock A. After the unmap, Process 1 tries to reacquire Lock
> B. However, at this point, Lock B has already been acquired by another
> process.
> 
> 2. The second process (Process 2) enters the hugetlb_fault handler
> during the unmap operation. It successfully acquires Lock B
> (hugetlb_fault_mutex_table lock) that was just released by Process 1,
> but then attempts to acquire Lock A (pagecache_folio lock), which is
> still held by Process 1.
> 
> As a result, Process 1 (holding Lock A) is blocked waiting for Lock B
> (held by Process 2), while Process 2 (holding Lock B) is blocked waiting
> for Lock A (held by Process 1), constructing a ABBA deadlock scenario.
> 
> The solution here is to unlock the pagecache_folio and provide the
> pagecache_folio_unlocked variable to the caller to have the visibility
> over the pagecache_folio status for subsequent handling.
> 
> The error message:
> INFO: task repro_20250402_:13229 blocked for more than 64 seconds.
>       Not tainted 6.15.0-rc3+ #24
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:repro_20250402_ state:D stack:25856 pid:13229 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00004006
> Call Trace:
>  <TASK>
>  __schedule+0x1755/0x4f50
>  schedule+0x158/0x330
>  schedule_preempt_disabled+0x15/0x30
>  __mutex_lock+0x75f/0xeb0
>  hugetlb_wp+0xf88/0x3440
>  hugetlb_fault+0x14c8/0x2c30
>  trace_clock_x86_tsc+0x20/0x20
>  do_user_addr_fault+0x61d/0x1490
>  exc_page_fault+0x64/0x100
>  asm_exc_page_fault+0x26/0x30
> RIP: 0010:__put_user_4+0xd/0x20
>  copy_process+0x1f4a/0x3d60
>  kernel_clone+0x210/0x8f0
>  __x64_sys_clone+0x18d/0x1f0
>  do_syscall_64+0x6a/0x120
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x41b26d
>  </TASK>
> INFO: task repro_20250402_:13229 is blocked on a mutex likely owned by task repro_20250402_:13250.
> task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006
> Call Trace:
>  <TASK>
>  __schedule+0x1755/0x4f50
>  schedule+0x158/0x330
>  io_schedule+0x92/0x110
>  folio_wait_bit_common+0x69a/0xba0
>  __filemap_get_folio+0x154/0xb70
>  hugetlb_fault+0xa50/0x2c30
>  trace_clock_x86_tsc+0x20/0x20
>  do_user_addr_fault+0xace/0x1490
>  exc_page_fault+0x64/0x100
>  asm_exc_page_fault+0x26/0x30
> RIP: 0033:0x402619
>  </TASK>
> INFO: task repro_20250402_:13250 blocked for more than 65 seconds.
>       Not tainted 6.15.0-rc3+ #24
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006
> Call Trace:
>  <TASK>
>  __schedule+0x1755/0x4f50
>  schedule+0x158/0x330
>  io_schedule+0x92/0x110
>  folio_wait_bit_common+0x69a/0xba0
>  __filemap_get_folio+0x154/0xb70
>  hugetlb_fault+0xa50/0x2c30
>  trace_clock_x86_tsc+0x20/0x20
>  do_user_addr_fault+0xace/0x1490
>  exc_page_fault+0x64/0x100
>  asm_exc_page_fault+0x26/0x30
> RIP: 0033:0x402619
>  </TASK>
> 
> Showing all locks held in the system:
> 1 lock held by khungtaskd/35:
>  #0: ffffffff879a7440 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x30/0x180
> 2 locks held by repro_20250402_/13229:
>  #0: ffff888017d801e0 (&mm->mmap_lock){++++}-{4:4}, at: lock_mm_and_find_vma+0x37/0x300
>  #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_wp+0xf88/0x3440
> 3 locks held by repro_20250402_/13250:
>  #0: ffff8880177f3d08 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x41b/0x1490
>  #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_fault+0x3b8/0x2c30
>  #2: ffff8880129500e8 (&resv_map->rw_sema){++++}-{4:4}, at: hugetlb_fault+0x494/0x2c30
> 
> Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=drive_link [1]
> Link: https://github.com/bboymimi/bpftracer/blob/master/scripts/hugetlb_lock_debug.bt [2]
> Link: https://drive.google.com/file/d/1bWq2-8o-BJAuhoHWX7zAhI6ggfhVzQUI/view?usp=sharing [3]
> Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
> Cc: <stable@vger.kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Florent Revest <revest@google.com>
> Cc: Gavin Shan <gshan@redhat.com>
> Signed-off-by: Gavin Guo <gavinguo@igalia.com>
> ---
>  mm/hugetlb.c | 33 ++++++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e3e6ac991b9c..ad54a74aa563 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>   * Keep the pte_same checks anyway to make transition from the mutex easier.
>   */
>  static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> -		       struct vm_fault *vmf)
> +		       struct vm_fault *vmf,
> +		       bool *pagecache_folio_unlocked)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -6212,6 +6213,22 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>  			u32 hash;
>  
>  			folio_put(old_folio);
> +			/*
> +			 * The pagecache_folio needs to be unlocked to avoid
> +			 * deadlock and we won't re-lock it in hugetlb_wp(). The
> +			 * pagecache_folio could be truncated after being
> +			 * unlocked. So its state should not be relied
> +			 * subsequently.
> +			 *
> +			 * Setting *pagecache_folio_unlocked to true allows the
> +			 * caller to handle any necessary logic related to the
> +			 * folio's unlocked state.
> +			 */
> +			if (pagecache_folio) {
> +				folio_unlock(pagecache_folio);
> +				if (pagecache_folio_unlocked)
> +					*pagecache_folio_unlocked = true;
> +			}
>  			/*
>  			 * Drop hugetlb_fault_mutex and vma_lock before
>  			 * unmapping.  unmapping needs to hold vma_lock
> @@ -6566,7 +6583,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>  	hugetlb_count_add(pages_per_huge_page(h), mm);
>  	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
>  		/* Optimization, do the COW without a second fault */
> -		ret = hugetlb_wp(folio, vmf);
> +		ret = hugetlb_wp(folio, vmf, NULL);
>  	}
>  
>  	spin_unlock(vmf->ptl);
> @@ -6638,6 +6655,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct hstate *h = hstate_vma(vma);
>  	struct address_space *mapping;
>  	int need_wait_lock = 0;
> +	bool pagecache_folio_unlocked = false;
>  	struct vm_fault vmf = {
>  		.vma = vma,
>  		.address = address & huge_page_mask(h),
> @@ -6792,7 +6810,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>  		if (!huge_pte_write(vmf.orig_pte)) {
> -			ret = hugetlb_wp(pagecache_folio, &vmf);
> +			ret = hugetlb_wp(pagecache_folio, &vmf,
> +					&pagecache_folio_unlocked);
>  			goto out_put_page;
>  		} else if (likely(flags & FAULT_FLAG_WRITE)) {
>  			vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
> @@ -6809,10 +6828,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  out_ptl:
>  	spin_unlock(vmf.ptl);
>  
> -	if (pagecache_folio) {
> +	/*
> +	 * If the pagecache_folio is unlocked in hugetlb_wp(), we skip
> +	 * folio_unlock() here.
> +	 */
> +	if (pagecache_folio && !pagecache_folio_unlocked)
>  		folio_unlock(pagecache_folio);
> +	if (pagecache_folio)
>  		folio_put(pagecache_folio);
> -	}
>  out_mutex:
>  	hugetlb_vma_unlock_read(vma);
>  
> 
> base-commit: d76bb1ebb5587f66b0f8b8099bfbb44722bc08b3
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-14  6:47 ` Byungchul Park
@ 2025-05-14  8:10   ` Gavin Guo
  2025-05-15  2:22     ` Byungchul Park
  2025-05-16  6:03     ` Byungchul Park
  0 siblings, 2 replies; 14+ messages in thread
From: Gavin Guo @ 2025-05-14  8:10 UTC (permalink / raw)
  To: Byungchul Park
  Cc: linux-mm, linux-kernel, muchun.song, osalvador, akpm,
	mike.kravetz, kernel-dev, stable, Hugh Dickins, Florent Revest,
	Gavin Shan, kernel_team

Hi Byungchul,

On 5/14/25 14:47, Byungchul Park wrote:
> On Tue, May 13, 2025 at 05:34:48PM +0800, Gavin Guo wrote:
>> The patch fixes a deadlock which can be triggered by an internal
>> syzkaller [1] reproducer and captured by bpftrace script [2] and its log
> 
> Hi,
> 
> I'm trying to reproduce using the test program [1].  But not yet
> produced.  I see a lot of segfaults while running [1].  I guess
> something goes wrong.  Is there any prerequisite condition to reproduce
> it?  Lemme know if any.  Or can you try DEPT15 with your config and
> environment by the following steps:
> 
>     1. Apply the patchset on v6.15-rc6.
>        https://lkml.kernel.org/r/20250513100730.12664-1-byungchul@sk.com
>     2. Turn on CONFIG_DEPT.
>     3. Run test program reproducing the deadlock.
>     4. Check dmesg to see if dept reported the dependency.
> 
> 	Byungchul

I have enabled the patchset and successfully reproduced the bug. It
seems that there is no warning or error log related to the lock. Did I
miss anything? This is the console log:
https://drive.google.com/file/d/1dxWNiO71qE-H-e5NMPqj7W-aW5CkGSSF/view?usp=sharing

> 
>> [3] in this scenario:
>>
>> Process 1                              Process 2
>> ---				       ---
>> hugetlb_fault
>>    mutex_lock(B) // take B
>>    filemap_lock_hugetlb_folio
>>      filemap_lock_folio
>>        __filemap_get_folio
>>          folio_lock(A) // take A
>>    hugetlb_wp
>>      mutex_unlock(B) // release B
>>      ...                                hugetlb_fault
>>      ...                                  mutex_lock(B) // take B
>>                                           filemap_lock_hugetlb_folio
>>                                             filemap_lock_folio
>>                                               __filemap_get_folio
>>                                                 folio_lock(A) // blocked
>>      unmap_ref_private
>>      ...
>>      mutex_lock(B) // retake and blocked
>>
>> This is a ABBA deadlock involving two locks:
>> - Lock A: pagecache_folio lock
>> - Lock B: hugetlb_fault_mutex_table lock
>>
>> The deadlock occurs between two processes as follows:
>> 1. The first process (let’s call it Process 1) is handling a
>> copy-on-write (COW) operation on a hugepage via hugetlb_wp. Due to
>> insufficient reserved hugetlb pages, Process 1, owner of the reserved
>> hugetlb page, attempts to unmap a hugepage owned by another process
>> (non-owner) to satisfy the reservation. Before unmapping, Process 1
>> acquires lock B (hugetlb_fault_mutex_table lock) and then lock A
>> (pagecache_folio lock). To proceed with the unmap, it releases Lock B
>> but retains Lock A. After the unmap, Process 1 tries to reacquire Lock
>> B. However, at this point, Lock B has already been acquired by another
>> process.
>>
>> 2. The second process (Process 2) enters the hugetlb_fault handler
>> during the unmap operation. It successfully acquires Lock B
>> (hugetlb_fault_mutex_table lock) that was just released by Process 1,
>> but then attempts to acquire Lock A (pagecache_folio lock), which is
>> still held by Process 1.
>>
>> As a result, Process 1 (holding Lock A) is blocked waiting for Lock B
>> (held by Process 2), while Process 2 (holding Lock B) is blocked waiting
>> for Lock A (held by Process 1), constructing a ABBA deadlock scenario.
>>
>> The solution here is to unlock the pagecache_folio and provide the
>> pagecache_folio_unlocked variable to the caller to have the visibility
>> over the pagecache_folio status for subsequent handling.
>>
>> The error message:
>> INFO: task repro_20250402_:13229 blocked for more than 64 seconds.
>>        Not tainted 6.15.0-rc3+ #24
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> task:repro_20250402_ state:D stack:25856 pid:13229 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00004006
>> Call Trace:
>>   <TASK>
>>   __schedule+0x1755/0x4f50
>>   schedule+0x158/0x330
>>   schedule_preempt_disabled+0x15/0x30
>>   __mutex_lock+0x75f/0xeb0
>>   hugetlb_wp+0xf88/0x3440
>>   hugetlb_fault+0x14c8/0x2c30
>>   trace_clock_x86_tsc+0x20/0x20
>>   do_user_addr_fault+0x61d/0x1490
>>   exc_page_fault+0x64/0x100
>>   asm_exc_page_fault+0x26/0x30
>> RIP: 0010:__put_user_4+0xd/0x20
>>   copy_process+0x1f4a/0x3d60
>>   kernel_clone+0x210/0x8f0
>>   __x64_sys_clone+0x18d/0x1f0
>>   do_syscall_64+0x6a/0x120
>>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> RIP: 0033:0x41b26d
>>   </TASK>
>> INFO: task repro_20250402_:13229 is blocked on a mutex likely owned by task repro_20250402_:13250.
>> task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006
>> Call Trace:
>>   <TASK>
>>   __schedule+0x1755/0x4f50
>>   schedule+0x158/0x330
>>   io_schedule+0x92/0x110
>>   folio_wait_bit_common+0x69a/0xba0
>>   __filemap_get_folio+0x154/0xb70
>>   hugetlb_fault+0xa50/0x2c30
>>   trace_clock_x86_tsc+0x20/0x20
>>   do_user_addr_fault+0xace/0x1490
>>   exc_page_fault+0x64/0x100
>>   asm_exc_page_fault+0x26/0x30
>> RIP: 0033:0x402619
>>   </TASK>
>> INFO: task repro_20250402_:13250 blocked for more than 65 seconds.
>>        Not tainted 6.15.0-rc3+ #24
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006
>> Call Trace:
>>   <TASK>
>>   __schedule+0x1755/0x4f50
>>   schedule+0x158/0x330
>>   io_schedule+0x92/0x110
>>   folio_wait_bit_common+0x69a/0xba0
>>   __filemap_get_folio+0x154/0xb70
>>   hugetlb_fault+0xa50/0x2c30
>>   trace_clock_x86_tsc+0x20/0x20
>>   do_user_addr_fault+0xace/0x1490
>>   exc_page_fault+0x64/0x100
>>   asm_exc_page_fault+0x26/0x30
>> RIP: 0033:0x402619
>>   </TASK>
>>
>> Showing all locks held in the system:
>> 1 lock held by khungtaskd/35:
>>   #0: ffffffff879a7440 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x30/0x180
>> 2 locks held by repro_20250402_/13229:
>>   #0: ffff888017d801e0 (&mm->mmap_lock){++++}-{4:4}, at: lock_mm_and_find_vma+0x37/0x300
>>   #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_wp+0xf88/0x3440
>> 3 locks held by repro_20250402_/13250:
>>   #0: ffff8880177f3d08 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x41b/0x1490
>>   #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_fault+0x3b8/0x2c30
>>   #2: ffff8880129500e8 (&resv_map->rw_sema){++++}-{4:4}, at: hugetlb_fault+0x494/0x2c30
>>
>> Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=drive_link [1]
>> Link: https://github.com/bboymimi/bpftracer/blob/master/scripts/hugetlb_lock_debug.bt [2]
>> Link: https://drive.google.com/file/d/1bWq2-8o-BJAuhoHWX7zAhI6ggfhVzQUI/view?usp=sharing [3]
>> Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
>> Cc: <stable@vger.kernel.org>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Florent Revest <revest@google.com>
>> Cc: Gavin Shan <gshan@redhat.com>
>> Signed-off-by: Gavin Guo <gavinguo@igalia.com>
>> ---
>>   mm/hugetlb.c | 33 ++++++++++++++++++++++++++++-----
>>   1 file changed, 28 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index e3e6ac991b9c..ad54a74aa563 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>>    * Keep the pte_same checks anyway to make transition from the mutex easier.
>>    */
>>   static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>> -		       struct vm_fault *vmf)
>> +		       struct vm_fault *vmf,
>> +		       bool *pagecache_folio_unlocked)
>>   {
>>   	struct vm_area_struct *vma = vmf->vma;
>>   	struct mm_struct *mm = vma->vm_mm;
>> @@ -6212,6 +6213,22 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>>   			u32 hash;
>>   
>>   			folio_put(old_folio);
>> +			/*
>> +			 * The pagecache_folio needs to be unlocked to avoid
>> +			 * deadlock and we won't re-lock it in hugetlb_wp(). The
>> +			 * pagecache_folio could be truncated after being
>> +			 * unlocked. So its state should not be relied
>> +			 * subsequently.
>> +			 *
>> +			 * Setting *pagecache_folio_unlocked to true allows the
>> +			 * caller to handle any necessary logic related to the
>> +			 * folio's unlocked state.
>> +			 */
>> +			if (pagecache_folio) {
>> +				folio_unlock(pagecache_folio);
>> +				if (pagecache_folio_unlocked)
>> +					*pagecache_folio_unlocked = true;
>> +			}
>>   			/*
>>   			 * Drop hugetlb_fault_mutex and vma_lock before
>>   			 * unmapping.  unmapping needs to hold vma_lock
>> @@ -6566,7 +6583,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>>   	hugetlb_count_add(pages_per_huge_page(h), mm);
>>   	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
>>   		/* Optimization, do the COW without a second fault */
>> -		ret = hugetlb_wp(folio, vmf);
>> +		ret = hugetlb_wp(folio, vmf, NULL);
>>   	}
>>   
>>   	spin_unlock(vmf->ptl);
>> @@ -6638,6 +6655,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>   	struct hstate *h = hstate_vma(vma);
>>   	struct address_space *mapping;
>>   	int need_wait_lock = 0;
>> +	bool pagecache_folio_unlocked = false;
>>   	struct vm_fault vmf = {
>>   		.vma = vma,
>>   		.address = address & huge_page_mask(h),
>> @@ -6792,7 +6810,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>   
>>   	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>>   		if (!huge_pte_write(vmf.orig_pte)) {
>> -			ret = hugetlb_wp(pagecache_folio, &vmf);
>> +			ret = hugetlb_wp(pagecache_folio, &vmf,
>> +					&pagecache_folio_unlocked);
>>   			goto out_put_page;
>>   		} else if (likely(flags & FAULT_FLAG_WRITE)) {
>>   			vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
>> @@ -6809,10 +6828,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>   out_ptl:
>>   	spin_unlock(vmf.ptl);
>>   
>> -	if (pagecache_folio) {
>> +	/*
>> +	 * If the pagecache_folio is unlocked in hugetlb_wp(), we skip
>> +	 * folio_unlock() here.
>> +	 */
>> +	if (pagecache_folio && !pagecache_folio_unlocked)
>>   		folio_unlock(pagecache_folio);
>> +	if (pagecache_folio)
>>   		folio_put(pagecache_folio);
>> -	}
>>   out_mutex:
>>   	hugetlb_vma_unlock_read(vma);
>>   
>>
>> base-commit: d76bb1ebb5587f66b0f8b8099bfbb44722bc08b3
>> -- 
>> 2.43.0
>>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-14  8:10   ` Gavin Guo
@ 2025-05-15  2:22     ` Byungchul Park
  2025-05-16  6:03     ` Byungchul Park
  1 sibling, 0 replies; 14+ messages in thread
From: Byungchul Park @ 2025-05-15  2:22 UTC (permalink / raw)
  To: Gavin Guo
  Cc: linux-mm, linux-kernel, muchun.song, osalvador, akpm,
	mike.kravetz, kernel-dev, stable, Hugh Dickins, Florent Revest,
	Gavin Shan, kernel_team

On Wed, May 14, 2025 at 04:10:12PM +0800, Gavin Guo wrote:
> Hi Byungchul,
> 
> On 5/14/25 14:47, Byungchul Park wrote:
> > On Tue, May 13, 2025 at 05:34:48PM +0800, Gavin Guo wrote:
> > > The patch fixes a deadlock which can be triggered by an internal
> > > syzkaller [1] reproducer and captured by bpftrace script [2] and its log
> > 
> > Hi,
> > 
> > I'm trying to reproduce using the test program [1].  But not yet
> > produced.  I see a lot of segfaults while running [1].  I guess
> > something goes wrong.  Is there any prerequisite condition to reproduce
> > it?  Lemme know if any.  Or can you try DEPT15 with your config and
> > environment by the following steps:
> > 
> >     1. Apply the patchset on v6.15-rc6.
> >        https://lkml.kernel.org/r/20250513100730.12664-1-byungchul@sk.com
> >     2. Turn on CONFIG_DEPT.
> >     3. Run test program reproducing the deadlock.
> >     4. Check dmesg to see if dept reported the dependency.
> > 
> > 	Byungchul
> 
> I have enabled the patchset and successfully reproduced the bug. It
> seems that there is no warning or error log related to the lock. Did I
> miss anything? This is the console log:
> https://drive.google.com/file/d/1dxWNiO71qE-H-e5NMPqj7W-aW5CkGSSF/view?usp=sharing

Hi,

No, you don't miss anything.

> > 
> > > [3] in this scenario:
> > > 
> > > Process 1                              Process 2
> > > ---				       ---
> > > hugetlb_fault
> > >    mutex_lock(B) // take B
> > >    filemap_lock_hugetlb_folio
> > >      filemap_lock_folio
> > >        __filemap_get_folio
> > >          folio_lock(A) // take A
> > >    hugetlb_wp
> > >      mutex_unlock(B) // release B
> > >      ...                                hugetlb_fault
> > >      ...                                  mutex_lock(B) // take B
> > >                                           filemap_lock_hugetlb_folio
> > >                                             filemap_lock_folio
> > >                                               __filemap_get_folio
> > >                                                 folio_lock(A) // blocked
							^
					dept adds B->A dependency.

> > >      unmap_ref_private
> > >      ...
> > >      mutex_lock(B) // retake and blocked

Since folio_unlock(A) might be called from others than Process 1 and
resolve the hang, dept delays the decision adding A->B dependency until
dept actually sees folio_unlock(A).

I guess the folio_unlock(A) has never been hit with B held, before the
hang.  dept would've detected the deadlock in advance if folio_unlock(A)
had been hit with B held at least once while testing.

Or to detect that in advance, we can tell dept that the pair of
folio_lock() and folio_unlock() happens in the same context like, which
looks a bit ugly tho:

   folio_lock(A);
 + /*
 +  * Means the interesting event, folio_unlock(), will appear within
 +  * this context.
 +  */
 + sdt_ecxt_enter(&pg_locked_map);

   ...

 + sdt_ecxt_exit(&pg_locked_map);
   folio_unlock(A);

Feel free to lemme know if you have any idea for better work.

	Byungchul

> > > 
> > > This is a ABBA deadlock involving two locks:
> > > - Lock A: pagecache_folio lock
> > > - Lock B: hugetlb_fault_mutex_table lock
> > > 
> > > The deadlock occurs between two processes as follows:
> > > 1. The first process (let’s call it Process 1) is handling a
> > > copy-on-write (COW) operation on a hugepage via hugetlb_wp. Due to
> > > insufficient reserved hugetlb pages, Process 1, owner of the reserved
> > > hugetlb page, attempts to unmap a hugepage owned by another process
> > > (non-owner) to satisfy the reservation. Before unmapping, Process 1
> > > acquires lock B (hugetlb_fault_mutex_table lock) and then lock A
> > > (pagecache_folio lock). To proceed with the unmap, it releases Lock B
> > > but retains Lock A. After the unmap, Process 1 tries to reacquire Lock
> > > B. However, at this point, Lock B has already been acquired by another
> > > process.
> > > 
> > > 2. The second process (Process 2) enters the hugetlb_fault handler
> > > during the unmap operation. It successfully acquires Lock B
> > > (hugetlb_fault_mutex_table lock) that was just released by Process 1,
> > > but then attempts to acquire Lock A (pagecache_folio lock), which is
> > > still held by Process 1.
> > > 
> > > As a result, Process 1 (holding Lock A) is blocked waiting for Lock B
> > > (held by Process 2), while Process 2 (holding Lock B) is blocked waiting
> > > for Lock A (held by Process 1), constructing a ABBA deadlock scenario.
> > > 
> > > The solution here is to unlock the pagecache_folio and provide the
> > > pagecache_folio_unlocked variable to the caller to have the visibility
> > > over the pagecache_folio status for subsequent handling.
> > > 
> > > The error message:
> > > INFO: task repro_20250402_:13229 blocked for more than 64 seconds.
> > >        Not tainted 6.15.0-rc3+ #24
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > task:repro_20250402_ state:D stack:25856 pid:13229 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00004006
> > > Call Trace:
> > >   <TASK>
> > >   __schedule+0x1755/0x4f50
> > >   schedule+0x158/0x330
> > >   schedule_preempt_disabled+0x15/0x30
> > >   __mutex_lock+0x75f/0xeb0
> > >   hugetlb_wp+0xf88/0x3440
> > >   hugetlb_fault+0x14c8/0x2c30
> > >   trace_clock_x86_tsc+0x20/0x20
> > >   do_user_addr_fault+0x61d/0x1490
> > >   exc_page_fault+0x64/0x100
> > >   asm_exc_page_fault+0x26/0x30
> > > RIP: 0010:__put_user_4+0xd/0x20
> > >   copy_process+0x1f4a/0x3d60
> > >   kernel_clone+0x210/0x8f0
> > >   __x64_sys_clone+0x18d/0x1f0
> > >   do_syscall_64+0x6a/0x120
> > >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > RIP: 0033:0x41b26d
> > >   </TASK>
> > > INFO: task repro_20250402_:13229 is blocked on a mutex likely owned by task repro_20250402_:13250.
> > > task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006
> > > Call Trace:
> > >   <TASK>
> > >   __schedule+0x1755/0x4f50
> > >   schedule+0x158/0x330
> > >   io_schedule+0x92/0x110
> > >   folio_wait_bit_common+0x69a/0xba0
> > >   __filemap_get_folio+0x154/0xb70
> > >   hugetlb_fault+0xa50/0x2c30
> > >   trace_clock_x86_tsc+0x20/0x20
> > >   do_user_addr_fault+0xace/0x1490
> > >   exc_page_fault+0x64/0x100
> > >   asm_exc_page_fault+0x26/0x30
> > > RIP: 0033:0x402619
> > >   </TASK>
> > > INFO: task repro_20250402_:13250 blocked for more than 65 seconds.
> > >        Not tainted 6.15.0-rc3+ #24
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006
> > > Call Trace:
> > >   <TASK>
> > >   __schedule+0x1755/0x4f50
> > >   schedule+0x158/0x330
> > >   io_schedule+0x92/0x110
> > >   folio_wait_bit_common+0x69a/0xba0
> > >   __filemap_get_folio+0x154/0xb70
> > >   hugetlb_fault+0xa50/0x2c30
> > >   trace_clock_x86_tsc+0x20/0x20
> > >   do_user_addr_fault+0xace/0x1490
> > >   exc_page_fault+0x64/0x100
> > >   asm_exc_page_fault+0x26/0x30
> > > RIP: 0033:0x402619
> > >   </TASK>
> > > 
> > > Showing all locks held in the system:
> > > 1 lock held by khungtaskd/35:
> > >   #0: ffffffff879a7440 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x30/0x180
> > > 2 locks held by repro_20250402_/13229:
> > >   #0: ffff888017d801e0 (&mm->mmap_lock){++++}-{4:4}, at: lock_mm_and_find_vma+0x37/0x300
> > >   #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_wp+0xf88/0x3440
> > > 3 locks held by repro_20250402_/13250:
> > >   #0: ffff8880177f3d08 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x41b/0x1490
> > >   #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_fault+0x3b8/0x2c30
> > >   #2: ffff8880129500e8 (&resv_map->rw_sema){++++}-{4:4}, at: hugetlb_fault+0x494/0x2c30
> > > 
> > > Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=drive_link [1]
> > > Link: https://github.com/bboymimi/bpftracer/blob/master/scripts/hugetlb_lock_debug.bt [2]
> > > Link: https://drive.google.com/file/d/1bWq2-8o-BJAuhoHWX7zAhI6ggfhVzQUI/view?usp=sharing [3]
> > > Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
> > > Cc: <stable@vger.kernel.org>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Florent Revest <revest@google.com>
> > > Cc: Gavin Shan <gshan@redhat.com>
> > > Signed-off-by: Gavin Guo <gavinguo@igalia.com>
> > > ---
> > >   mm/hugetlb.c | 33 ++++++++++++++++++++++++++++-----
> > >   1 file changed, 28 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index e3e6ac991b9c..ad54a74aa563 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
> > >    * Keep the pte_same checks anyway to make transition from the mutex easier.
> > >    */
> > >   static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> > > -		       struct vm_fault *vmf)
> > > +		       struct vm_fault *vmf,
> > > +		       bool *pagecache_folio_unlocked)
> > >   {
> > >   	struct vm_area_struct *vma = vmf->vma;
> > >   	struct mm_struct *mm = vma->vm_mm;
> > > @@ -6212,6 +6213,22 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> > >   			u32 hash;
> > >   			folio_put(old_folio);
> > > +			/*
> > > +			 * The pagecache_folio needs to be unlocked to avoid
> > > +			 * deadlock and we won't re-lock it in hugetlb_wp(). The
> > > +			 * pagecache_folio could be truncated after being
> > > +			 * unlocked. So its state should not be relied
> > > +			 * subsequently.
> > > +			 *
> > > +			 * Setting *pagecache_folio_unlocked to true allows the
> > > +			 * caller to handle any necessary logic related to the
> > > +			 * folio's unlocked state.
> > > +			 */
> > > +			if (pagecache_folio) {
> > > +				folio_unlock(pagecache_folio);
> > > +				if (pagecache_folio_unlocked)
> > > +					*pagecache_folio_unlocked = true;
> > > +			}
> > >   			/*
> > >   			 * Drop hugetlb_fault_mutex and vma_lock before
> > >   			 * unmapping.  unmapping needs to hold vma_lock
> > > @@ -6566,7 +6583,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
> > >   	hugetlb_count_add(pages_per_huge_page(h), mm);
> > >   	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> > >   		/* Optimization, do the COW without a second fault */
> > > -		ret = hugetlb_wp(folio, vmf);
> > > +		ret = hugetlb_wp(folio, vmf, NULL);
> > >   	}
> > >   	spin_unlock(vmf->ptl);
> > > @@ -6638,6 +6655,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > >   	struct hstate *h = hstate_vma(vma);
> > >   	struct address_space *mapping;
> > >   	int need_wait_lock = 0;
> > > +	bool pagecache_folio_unlocked = false;
> > >   	struct vm_fault vmf = {
> > >   		.vma = vma,
> > >   		.address = address & huge_page_mask(h),
> > > @@ -6792,7 +6810,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > >   	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
> > >   		if (!huge_pte_write(vmf.orig_pte)) {
> > > -			ret = hugetlb_wp(pagecache_folio, &vmf);
> > > +			ret = hugetlb_wp(pagecache_folio, &vmf,
> > > +					&pagecache_folio_unlocked);
> > >   			goto out_put_page;
> > >   		} else if (likely(flags & FAULT_FLAG_WRITE)) {
> > >   			vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
> > > @@ -6809,10 +6828,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > >   out_ptl:
> > >   	spin_unlock(vmf.ptl);
> > > -	if (pagecache_folio) {
> > > +	/*
> > > +	 * If the pagecache_folio is unlocked in hugetlb_wp(), we skip
> > > +	 * folio_unlock() here.
> > > +	 */
> > > +	if (pagecache_folio && !pagecache_folio_unlocked)
> > >   		folio_unlock(pagecache_folio);
> > > +	if (pagecache_folio)
> > >   		folio_put(pagecache_folio);
> > > -	}
> > >   out_mutex:
> > >   	hugetlb_vma_unlock_read(vma);
> > > 
> > > base-commit: d76bb1ebb5587f66b0f8b8099bfbb44722bc08b3
> > > -- 
> > > 2.43.0
> > > 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-14  8:10   ` Gavin Guo
  2025-05-15  2:22     ` Byungchul Park
@ 2025-05-16  6:03     ` Byungchul Park
  2025-05-16  7:32       ` Gavin Guo
  1 sibling, 1 reply; 14+ messages in thread
From: Byungchul Park @ 2025-05-16  6:03 UTC (permalink / raw)
  To: Gavin Guo
  Cc: linux-mm, linux-kernel, muchun.song, osalvador, akpm,
	mike.kravetz, kernel-dev, stable, Hugh Dickins, Florent Revest,
	Gavin Shan, kernel_team

On Wed, May 14, 2025 at 04:10:12PM +0800, Gavin Guo wrote:
> Hi Byungchul,
> 
> On 5/14/25 14:47, Byungchul Park wrote:
> > On Tue, May 13, 2025 at 05:34:48PM +0800, Gavin Guo wrote:
> > > The patch fixes a deadlock which can be triggered by an internal
> > > syzkaller [1] reproducer and captured by bpftrace script [2] and its log
> > 
> > Hi,
> > 
> > I'm trying to reproduce using the test program [1].  But not yet
> > produced.  I see a lot of segfaults while running [1].  I guess
> > something goes wrong.  Is there any prerequisite condition to reproduce
> > it?  Lemme know if any.  Or can you try DEPT15 with your config and
> > environment by the following steps:
> > 
> >     1. Apply the patchset on v6.15-rc6.
> >        https://lkml.kernel.org/r/20250513100730.12664-1-byungchul@sk.com
> >     2. Turn on CONFIG_DEPT.
> >     3. Run test program reproducing the deadlock.
> >     4. Check dmesg to see if dept reported the dependency.
> > 
> > 	Byungchul
> 
> I have enabled the patchset and successfully reproduced the bug. It
> seems that there is no warning or error log related to the lock. Did I
> miss anything? This is the console log:
> https://drive.google.com/file/d/1dxWNiO71qE-H-e5NMPqj7W-aW5CkGSSF/view?usp=sharing

My bad.  I think I found the problem that dept didn't report it.  You
might see the report with the following patch applied on the top, there
might be a lot of false positives along with that might be annoying tho.

Some of my efforts to suppress false positives, suppressed the real one.

Do you mind if I ask you to run the test with the following patch
applied?  It'd be appreciated if you do and share the result with me.

	Byungchul

---
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f31cd68f2935..fd7559e663c5 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1138,6 +1138,7 @@ static inline bool trylock_page(struct page *page)
 static inline void folio_lock(struct folio *folio)
 {
 	might_sleep();
+	dept_page_wait_on_bit(&folio->page, PG_locked);
 	if (!folio_trylock(folio))
 		__folio_lock(folio);
 }
diff --git a/kernel/dependency/dept.c b/kernel/dependency/dept.c
index b2fa96d984bc..4e96a6a72d02 100644
--- a/kernel/dependency/dept.c
+++ b/kernel/dependency/dept.c
@@ -931,7 +931,6 @@ static void print_circle(struct dept_class *c)
 	dept_outworld_exit();
 
 	do {
-		tc->reported = true;
 		tc = fc;
 		fc = fc->bfs_parent;
 	} while (tc != c);
diff --git a/kernel/dependency/dept_unit_test.c b/kernel/dependency/dept_unit_test.c
index 88e846b9f876..496149f31fb3 100644
--- a/kernel/dependency/dept_unit_test.c
+++ b/kernel/dependency/dept_unit_test.c
@@ -125,6 +125,8 @@ static int __init dept_ut_init(void)
 {
 	int i;
 
+	return 0;
+
 	lockdep_off();
 
 	dept_ut_results.ecxt_stack_valid_cnt = 0;
--

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-16  6:03     ` Byungchul Park
@ 2025-05-16  7:32       ` Gavin Guo
  2025-05-16  7:43         ` Byungchul Park
  0 siblings, 1 reply; 14+ messages in thread
From: Gavin Guo @ 2025-05-16  7:32 UTC (permalink / raw)
  To: Byungchul Park
  Cc: linux-mm, linux-kernel, muchun.song, osalvador, akpm,
	mike.kravetz, kernel-dev, stable, Hugh Dickins, Florent Revest,
	Gavin Shan, kernel_team

On 5/16/25 14:03, Byungchul Park wrote:
> On Wed, May 14, 2025 at 04:10:12PM +0800, Gavin Guo wrote:
>> Hi Byungchul,
>>
>> On 5/14/25 14:47, Byungchul Park wrote:
>>> On Tue, May 13, 2025 at 05:34:48PM +0800, Gavin Guo wrote:
>>>> The patch fixes a deadlock which can be triggered by an internal
>>>> syzkaller [1] reproducer and captured by bpftrace script [2] and its log
>>>
>>> Hi,
>>>
>>> I'm trying to reproduce using the test program [1].  But not yet
>>> produced.  I see a lot of segfaults while running [1].  I guess
>>> something goes wrong.  Is there any prerequisite condition to reproduce
>>> it?  Lemme know if any.  Or can you try DEPT15 with your config and
>>> environment by the following steps:
>>>
>>>      1. Apply the patchset on v6.15-rc6.
>>>         https://lkml.kernel.org/r/20250513100730.12664-1-byungchul@sk.com
>>>      2. Turn on CONFIG_DEPT.
>>>      3. Run test program reproducing the deadlock.
>>>      4. Check dmesg to see if dept reported the dependency.
>>>
>>> 	Byungchul
>>
>> I have enabled the patchset and successfully reproduced the bug. It
>> seems that there is no warning or error log related to the lock. Did I
>> miss anything? This is the console log:
>> https://drive.google.com/file/d/1dxWNiO71qE-H-e5NMPqj7W-aW5CkGSSF/view?usp=sharing
> 
> My bad.  I think I found the problem that dept didn't report it.  You
> might see the report with the following patch applied on the top, there
> might be a lot of false positives along with that might be annoying tho.
> 
> Some of my efforts to suppress false positives, suppressed the real one.
> 
> Do you mind if I ask you to run the test with the following patch
> applied?  It'd be appreciated if you do and share the result with me.
> 
> 	Byungchul
> 
> ---
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index f31cd68f2935..fd7559e663c5 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -1138,6 +1138,7 @@ static inline bool trylock_page(struct page *page)
>   static inline void folio_lock(struct folio *folio)
>   {
>   	might_sleep();
> +	dept_page_wait_on_bit(&folio->page, PG_locked);
>   	if (!folio_trylock(folio))
>   		__folio_lock(folio);
>   }
> diff --git a/kernel/dependency/dept.c b/kernel/dependency/dept.c
> index b2fa96d984bc..4e96a6a72d02 100644
> --- a/kernel/dependency/dept.c
> +++ b/kernel/dependency/dept.c
> @@ -931,7 +931,6 @@ static void print_circle(struct dept_class *c)
>   	dept_outworld_exit();
>   
>   	do {
> -		tc->reported = true;
>   		tc = fc;
>   		fc = fc->bfs_parent;
>   	} while (tc != c);
> diff --git a/kernel/dependency/dept_unit_test.c b/kernel/dependency/dept_unit_test.c
> index 88e846b9f876..496149f31fb3 100644
> --- a/kernel/dependency/dept_unit_test.c
> +++ b/kernel/dependency/dept_unit_test.c
> @@ -125,6 +125,8 @@ static int __init dept_ut_init(void)
>   {
>   	int i;
>   
> +	return 0;
> +
>   	lockdep_off();
>   
>   	dept_ut_results.ecxt_stack_valid_cnt = 0;
> --

Please see the test result:
https://drive.google.com/file/d/1B20Gu3wLFbAeaXXb7aSQP5T6aeN9Mext/view?usp=sharing

It seems that after the first round, the deadlock is captured:
ubuntu@localhost:~$ ./repro_20250402_0225_154f8fb0580000
executing program
[   80.425842][ T3416] ===================================================
[   80.426707][ T3416] DEPT: Circular dependency has been detected.
[   80.427497][ T3416] 6.15.0-rc6+ #31 Not tainted
[   80.428084][ T3416] ---------------------------------------------------
[   80.428964][ T3416] summary
[   80.429330][ T3416] ---------------------------------------------------
[   80.430078][ T3416] *** DEADLOCK ***
[   80.430078][ T3416]
[   80.430736][ T3416] context A
[   80.431076][ T3416]    [S] (unknown)(pg_locked_map:0)
[   80.431637][ T3416]    [W] lock(&hugetlb_fault_mutex_table[i]:0)
[   80.432312][ T3416]    [E] dept_page_clear_bit(pg_locked_map:0)
[   80.432977][ T3416]
[   80.433246][ T3416] context B
[   80.433595][ T3416]    [S] lock(&hugetlb_fault_mutex_table[i]:0)
[   80.434245][ T3416]    [W] dept_page_wait_on_bit(pg_locked_map:0)
[   80.434880][ T3416]    [E] unlock(&hugetlb_fault_mutex_table[i]:0)
[   80.435592][ T3416]
[   80.435852][ T3416] [S]: start of the event context
[   80.436369][ T3416] [W]: the wait blocked
[   80.436789][ T3416] [E]: the event not reachable
[   80.437275][ T3416] ---------------------------------------------------
[   80.437950][ T3416] context A's detail
[   80.438367][ T3416] ---------------------------------------------------
[   80.439006][ T3416] context A
[   80.439337][ T3416]    [S] (unknown)(pg_locked_map:0)
[   80.439883][ T3416]    [W] lock(&hugetlb_fault_mutex_table[i]:0)
[   80.440489][ T3416]    [E] dept_page_clear_bit(pg_locked_map:0)
[   80.441075][ T3416]
[   80.441318][ T3416] [S] (unknown)(pg_locked_map:0):
[   80.441816][ T3416] (N/A)
[   80.442077][ T3416]
[   80.442309][ T3416] [W] lock(&hugetlb_fault_mutex_table[i]:0):
[   80.442872][ T3416] [<ffffffff82144644>] hugetlb_wp+0xfa4/0x3490
[   80.443502][ T3416] stacktrace:
[   80.443810][ T3416]       hugetlb_wp+0xfa4/0x3490
[   80.444267][ T3416]       hugetlb_fault+0x1505/0x2c70
[   80.444776][ T3416]       handle_mm_fault+0x1845/0x1ab0
[   80.445275][ T3416]       do_user_addr_fault+0x637/0x1450
[   80.445779][ T3416]       exc_page_fault+0x67/0x110
[   80.446239][ T3416]       asm_exc_page_fault+0x26/0x30
[   80.446722][ T3416]       __put_user_4+0xd/0x20
[   80.447157][ T3416]       copy_process+0x1f64/0x3d80
[   80.447621][ T3416]       kernel_clone+0x216/0x940
[   80.448068][ T3416]       __x64_sys_clone+0x18d/0x1f0
[   80.448548][ T3416]       do_syscall_64+0x6f/0x120
[   80.448999][ T3416]       entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   80.449556][ T3416]
[   80.449765][ T3416] [E] dept_page_clear_bit(pg_locked_map:0):
[   80.450272][ T3416] [<ffffffff8214263b>] hugetlb_fault+0x1ccb/0x2c70
[   80.450861][ T3416] stacktrace:
[   80.451148][ T3416]       hugetlb_fault+0x1ccb/0x2c70
[   80.451611][ T3416]       handle_mm_fault+0x1845/0x1ab0
[   80.452080][ T3416]       do_user_addr_fault+0x637/0x1450
[   80.452566][ T3416]       exc_page_fault+0x67/0x110
[   80.453014][ T3416]       asm_exc_page_fault+0x26/0x30
[   80.453497][ T3416]       __put_user_4+0xd/0x20
[   80.453923][ T3416]       copy_process+0x1f64/0x3d80
[   80.454379][ T3416]       kernel_clone+0x216/0x940
[   80.454817][ T3416]       __x64_sys_clone+0x18d/0x1f0
[   80.455277][ T3416]       do_syscall_64+0x6f/0x120
[   80.455722][ T3416]       entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   80.456253][ T3416] ---------------------------------------------------
[   80.456842][ T3416] context B's detail
[   80.457198][ T3416] ---------------------------------------------------
[   80.457842][ T3416] context B
[   80.458122][ T3416]    [S] lock(&hugetlb_fault_mutex_table[i]:0)
[   80.458661][ T3416]    [W] dept_page_wait_on_bit(pg_locked_map:0)
[   80.459187][ T3416]    [E] unlock(&hugetlb_fault_mutex_table[i]:0)
[   80.459763][ T3416]
[   80.459988][ T3416] [S] lock(&hugetlb_fault_mutex_table[i]:0):
[   80.460509][ T3416] [<ffffffff82140d36>] hugetlb_fault+0x3c6/0x2c70
[   80.461074][ T3416] stacktrace:
[   80.461374][ T3416]       hugetlb_fault+0x3c6/0x2c70
[   80.461812][ T3416]       handle_mm_fault+0x1845/0x1ab0
[   80.462281][ T3416]       do_user_addr_fault+0x637/0x1450
[   80.462775][ T3416]       exc_page_fault+0x67/0x110
[   80.463220][ T3416]       asm_exc_page_fault+0x26/0x30
[   80.463694][ T3416]       __put_user_4+0xd/0x20
[   80.464129][ T3416]       copy_process+0x1f64/0x3d80
[   80.464577][ T3416]       kernel_clone+0x216/0x940
[   80.464994][ T3416]       __x64_sys_clone+0x18d/0x1f0
[   80.465466][ T3416]       do_syscall_64+0x6f/0x120
[   80.465909][ T3416]       entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   80.466457][ T3416]
[   80.466660][ T3416] [W] dept_page_wait_on_bit(pg_locked_map:0):
[   80.467177][ T3416] [<ffffffff82141187>] hugetlb_fault+0x817/0x2c70
[   80.467740][ T3416] stacktrace:
[   80.468032][ T3416]       hugetlb_fault+0x817/0x2c70
[   80.468479][ T3416]       handle_mm_fault+0x1845/0x1ab0
[   80.468947][ T3416]       do_user_addr_fault+0x637/0x1450
[   80.469428][ T3416]       exc_page_fault+0x67/0x110
[   80.469865][ T3416]       asm_exc_page_fault+0x26/0x30
[   80.470332][ T3416]       __put_user_4+0xd/0x20
[   80.470742][ T3416]       copy_process+0x1f64/0x3d80
[   80.471186][ T3416]       kernel_clone+0x216/0x940
[   80.471616][ T3416]       __x64_sys_clone+0x18d/0x1f0
[   80.472060][ T3416]       do_syscall_64+0x6f/0x120
[   80.472492][ T3416]       entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   80.473040][ T3416]
[   80.473271][ T3416] [E] unlock(&hugetlb_fault_mutex_table[i]:0):
[   80.473863][ T3416] (N/A)
[   80.474124][ T3416] ---------------------------------------------------
[   80.474738][ T3416] information that might be helpful
[   80.475210][ T3416] ---------------------------------------------------
[   80.475820][ T3416] CPU: 1 UID: 1000 PID: 3416 Comm: repro_20250402_ 
Not tainted 6.15.0-rc6+ #31 NONE
[   80.475831][ T3416] Hardware name: QEMU Standard PC (Q35 + ICH9, 
2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   80.475837][ T3416] Call Trace:
[   80.475841][ T3416]  <TASK>
[   80.475845][ T3416]  dump_stack_lvl+0x1ad/0x280
[   80.475858][ T3416]  ? __pfx_dump_stack_lvl+0x10/0x10
[   80.475867][ T3416]  ? __pfx__printk+0x10/0x10
[   80.475883][ T3416]  cb_check_dl+0x24a8/0x2530
[   80.475897][ T3416]  ? bfs_extend_dep+0x271/0x290
[   80.475909][ T3416]  bfs+0x464/0x5e0
[   80.475921][ T3416]  ? __pfx_bfs+0x10/0x10
[   80.475931][ T3416]  ? add_dep+0x387/0x710
[   80.475943][ T3416]  add_dep+0x3d0/0x710
[   80.475953][ T3416]  ? __pfx_from_pool+0x10/0x10
[   80.475963][ T3416]  ? __pfx_bfs_init_check_dl+0x10/0x10
[   80.475972][ T3416]  ? __pfx_bfs_extend_dep+0x10/0x10
[   80.475981][ T3416]  ? __pfx_bfs_dequeue_dep+0x10/0x10
[   80.475990][ T3416]  ? __pfx_cb_check_dl+0x10/0x10
[   80.475999][ T3416]  ? __pfx_add_dep+0x10/0x10
[   80.476011][ T3416]  ? put_ecxt+0xda/0x4b0
[   80.476024][ T3416]  __dept_event+0xee8/0x1590
[   80.476038][ T3416]  dept_event+0x166/0x240
[   80.476047][ T3416]  ? hugetlb_fault+0x1ccb/0x2c70
[   80.476057][ T3416]  folio_unlock+0xb8/0x190
[   80.476071][ T3416]  hugetlb_fault+0x1ccb/0x2c70
[   80.476085][ T3416]  ? __pfx_hugetlb_fault+0x10/0x10
[   80.476100][ T3416]  ? mt_find+0x15a/0x5f0
[   80.476110][ T3416]  handle_mm_fault+0x1845/0x1ab0
[   80.476125][ T3416]  ? handle_mm_fault+0xdb/0x1ab0
[   80.476142][ T3416]  ? __pfx_handle_mm_fault+0x10/0x10
[   80.476156][ T3416]  ? find_vma+0xec/0x160
[   80.476164][ T3416]  ? __pfx_find_vma+0x10/0x10
[   80.476172][ T3416]  ? dept_on+0x1c/0x30
[   80.476179][ T3416]  ? dept_exit+0x1c5/0x2c0
[   80.476186][ T3416]  ? lockdep_hardirqs_on_prepare+0x21/0x280
[   80.476197][ T3416]  ? lock_mm_and_find_vma+0xa1/0x300
[   80.476211][ T3416]  do_user_addr_fault+0x637/0x1450
[   80.476219][ T3416]  ? mntput_no_expire+0xc0/0x870
[   80.476235][ T3416]  ? __pfx_do_user_addr_fault+0x10/0x10
[   80.476246][ T3416]  ? trace_irq_disable+0x60/0x180
[   80.476258][ T3416]  exc_page_fault+0x67/0x110
[   80.476272][ T3416]  asm_exc_page_fault+0x26/0x30
[   80.476280][ T3416] RIP: 0010:__put_user_4+0xd/0x20
[   80.476293][ T3416] Code: 66 89 01 31 c9 0f 1f 00 c3 cc cc cc cc 90 
90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 89 cb 48 c1 fb 3f 48 09 
d9 0f 1f 00 <89> 01 31 c9 0
[   80.476312][ T3416] RSP: 0018:ffffc90004dffa38 EFLAGS: 00010206
[   80.476322][ T3416] RAX: 000000000000000c RBX: 0000000000000000 RCX: 
0000200000000200
[   80.476329][ T3416] RDX: 0000000000000000 RSI: ffff888016abe300 RDI: 
ffff888017878c20
[   80.476335][ T3416] RBP: ffffc90004dffc10 R08: 0000000000000000 R09: 
0000000000000000
[   80.476340][ T3416] R10: 0000000000000000 R11: ffffffff82034b65 R12: 
ffff888017c0a1e8
[   80.476346][ T3416] R13: ffff88800d6a8200 R14: 0000000000000000 R15: 
ffff888017c08a38
[   80.476354][ T3416]  ? __might_fault+0xb5/0x130
[   80.476367][ T3416]  copy_process+0x1f64/0x3d80
[   80.476375][ T3416]  ? lockdep_hardirqs_on_prepare+0x21/0x280
[   80.476388][ T3416]  ? copy_process+0x996/0x3d80
[   80.476399][ T3416]  ? __pfx_copy_process+0x10/0x10
[   80.476406][ T3416]  ? from_pool+0x1e1/0x750
[   80.476416][ T3416]  ? handle_mm_fault+0x122e/0x1ab0
[   80.476432][ T3416]  kernel_clone+0x216/0x940
[   80.476440][ T3416]  ? __pfx_llist_del_first+0x10/0x10
[   80.476448][ T3416]  ? check_new_class+0x28a/0xe90
[   80.476458][ T3416]  ? __pfx_kernel_clone+0x10/0x10
[   80.476468][ T3416]  ? from_pool+0x1e1/0x750
[   80.476478][ T3416]  ? __pfx_from_pool+0x10/0x10
[   80.476487][ T3416]  ? __pfx_from_pool+0x10/0x10
[   80.476502][ T3416]  __x64_sys_clone+0x18d/0x1f0
[   80.476512][ T3416]  ? __pfx___x64_sys_clone+0x10/0x10
[   80.476520][ T3416]  ? llist_add_batch+0x111/0x1f0
[   80.476532][ T3416]  ? dept_task+0x5/0x20
[   80.476539][ T3416]  ? dept_on+0x1c/0x30
[   80.476545][ T3416]  ? dept_exit+0x1c5/0x2c0
[   80.476553][ T3416]  ? lockdep_hardirqs_on_prepare+0x21/0x280
[   80.476565][ T3416]  do_syscall_64+0x6f/0x120
[   80.476573][ T3416]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   80.476580][ T3416] RIP: 0033:0x41b26d
[   80.476588][ T3416] Code: b3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 
24 08 0f 05 <48> 3d 01 f0 8
[   80.476595][ T3416] RSP: 002b:00007ffa1ad2d198 EFLAGS: 00000206 
ORIG_RAX: 0000000000000038
[   80.476604][ T3416] RAX: ffffffffffffffda RBX: 00007ffa1ad2dcdc RCX: 
000000000041b26d
[   80.476610][ T3416] RDX: 0000200000000200 RSI: 0000000000000000 RDI: 
0000000000001200
[   80.476616][ T3416] RBP: 00007ffa1ad2d1e0 R08: 0000000000000000 R09: 
0000000000000000
[   80.476621][ T3416] R10: 0000000000000000 R11: 0000000000000206 R12: 
00007ffa1ad2d6c0
[   80.476626][ T3416] R13: ffffffffffffffb8 R14: 0000000000000002 R15: 
00007ffd95d76940
[   80.476638][ T3416]  </TASK>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-16  7:32       ` Gavin Guo
@ 2025-05-16  7:43         ` Byungchul Park
  0 siblings, 0 replies; 14+ messages in thread
From: Byungchul Park @ 2025-05-16  7:43 UTC (permalink / raw)
  To: Gavin Guo
  Cc: linux-mm, linux-kernel, muchun.song, osalvador, akpm,
	mike.kravetz, kernel-dev, stable, Hugh Dickins, Florent Revest,
	Gavin Shan, kernel_team

On Fri, May 16, 2025 at 03:32:35PM +0800, Gavin Guo wrote:
> On 5/16/25 14:03, Byungchul Park wrote:
> > On Wed, May 14, 2025 at 04:10:12PM +0800, Gavin Guo wrote:
> > > Hi Byungchul,
> > > 
> > > On 5/14/25 14:47, Byungchul Park wrote:
> > > > On Tue, May 13, 2025 at 05:34:48PM +0800, Gavin Guo wrote:
> > > > > The patch fixes a deadlock which can be triggered by an internal
> > > > > syzkaller [1] reproducer and captured by bpftrace script [2] and its log
> > > > 
> > > > Hi,
> > > > 
> > > > I'm trying to reproduce using the test program [1].  But not yet
> > > > produced.  I see a lot of segfaults while running [1].  I guess
> > > > something goes wrong.  Is there any prerequisite condition to reproduce
> > > > it?  Lemme know if any.  Or can you try DEPT15 with your config and
> > > > environment by the following steps:
> > > > 
> > > >      1. Apply the patchset on v6.15-rc6.
> > > >         https://lkml.kernel.org/r/20250513100730.12664-1-byungchul@sk.com
> > > >      2. Turn on CONFIG_DEPT.
> > > >      3. Run test program reproducing the deadlock.
> > > >      4. Check dmesg to see if dept reported the dependency.
> > > > 
> > > > 	Byungchul
> > > 
> > > I have enabled the patchset and successfully reproduced the bug. It
> > > seems that there is no warning or error log related to the lock. Did I
> > > miss anything? This is the console log:
> > > https://drive.google.com/file/d/1dxWNiO71qE-H-e5NMPqj7W-aW5CkGSSF/view?usp=sharing
> > 
> > My bad.  I think I found the problem that dept didn't report it.  You
> > might see the report with the following patch applied on the top, there
> > might be a lot of false positives along with that might be annoying tho.
> > 
> > Some of my efforts to suppress false positives, suppressed the real one.
> > 
> > Do you mind if I ask you to run the test with the following patch
> > applied?  It'd be appreciated if you do and share the result with me.
> > 
> > 	Byungchul
> > 
> > ---
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index f31cd68f2935..fd7559e663c5 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -1138,6 +1138,7 @@ static inline bool trylock_page(struct page *page)
> >   static inline void folio_lock(struct folio *folio)
> >   {
> >   	might_sleep();
> > +	dept_page_wait_on_bit(&folio->page, PG_locked);
> >   	if (!folio_trylock(folio))
> >   		__folio_lock(folio);
> >   }
> > diff --git a/kernel/dependency/dept.c b/kernel/dependency/dept.c
> > index b2fa96d984bc..4e96a6a72d02 100644
> > --- a/kernel/dependency/dept.c
> > +++ b/kernel/dependency/dept.c
> > @@ -931,7 +931,6 @@ static void print_circle(struct dept_class *c)
> >   	dept_outworld_exit();
> >   	do {
> > -		tc->reported = true;
> >   		tc = fc;
> >   		fc = fc->bfs_parent;
> >   	} while (tc != c);
> > diff --git a/kernel/dependency/dept_unit_test.c b/kernel/dependency/dept_unit_test.c
> > index 88e846b9f876..496149f31fb3 100644
> > --- a/kernel/dependency/dept_unit_test.c
> > +++ b/kernel/dependency/dept_unit_test.c
> > @@ -125,6 +125,8 @@ static int __init dept_ut_init(void)
> >   {
> >   	int i;
> > +	return 0;
> > +
> >   	lockdep_off();
> >   	dept_ut_results.ecxt_stack_valid_cnt = 0;
> > --
> 
> Please see the test result:
> https://drive.google.com/file/d/1B20Gu3wLFbAeaXXb7aSQP5T6aeN9Mext/view?usp=sharing
> 
> It seems that after the first round, the deadlock is captured:

Thank you for the testing again!

Yeah, dept works well as I expected.  I shouldn't have suppressed dept
reports too aggressively, but.. I (or we if any) need to deal with the
existing false positives one by one by using dept annotations.

Thanks again for confirming it.

	Byungchul

> ubuntu@localhost:~$ ./repro_20250402_0225_154f8fb0580000
> executing program
> [   80.425842][ T3416] ===================================================
> [   80.426707][ T3416] DEPT: Circular dependency has been detected.
> [   80.427497][ T3416] 6.15.0-rc6+ #31 Not tainted
> [   80.428084][ T3416] ---------------------------------------------------
> [   80.428964][ T3416] summary
> [   80.429330][ T3416] ---------------------------------------------------
> [   80.430078][ T3416] *** DEADLOCK ***
> [   80.430078][ T3416]
> [   80.430736][ T3416] context A
> [   80.431076][ T3416]    [S] (unknown)(pg_locked_map:0)
> [   80.431637][ T3416]    [W] lock(&hugetlb_fault_mutex_table[i]:0)
> [   80.432312][ T3416]    [E] dept_page_clear_bit(pg_locked_map:0)
> [   80.432977][ T3416]
> [   80.433246][ T3416] context B
> [   80.433595][ T3416]    [S] lock(&hugetlb_fault_mutex_table[i]:0)
> [   80.434245][ T3416]    [W] dept_page_wait_on_bit(pg_locked_map:0)
> [   80.434880][ T3416]    [E] unlock(&hugetlb_fault_mutex_table[i]:0)
> [   80.435592][ T3416]
> [   80.435852][ T3416] [S]: start of the event context
> [   80.436369][ T3416] [W]: the wait blocked
> [   80.436789][ T3416] [E]: the event not reachable
> [   80.437275][ T3416] ---------------------------------------------------
> [   80.437950][ T3416] context A's detail
> [   80.438367][ T3416] ---------------------------------------------------
> [   80.439006][ T3416] context A
> [   80.439337][ T3416]    [S] (unknown)(pg_locked_map:0)
> [   80.439883][ T3416]    [W] lock(&hugetlb_fault_mutex_table[i]:0)
> [   80.440489][ T3416]    [E] dept_page_clear_bit(pg_locked_map:0)
> [   80.441075][ T3416]
> [   80.441318][ T3416] [S] (unknown)(pg_locked_map:0):
> [   80.441816][ T3416] (N/A)
> [   80.442077][ T3416]
> [   80.442309][ T3416] [W] lock(&hugetlb_fault_mutex_table[i]:0):
> [   80.442872][ T3416] [<ffffffff82144644>] hugetlb_wp+0xfa4/0x3490
> [   80.443502][ T3416] stacktrace:
> [   80.443810][ T3416]       hugetlb_wp+0xfa4/0x3490
> [   80.444267][ T3416]       hugetlb_fault+0x1505/0x2c70
> [   80.444776][ T3416]       handle_mm_fault+0x1845/0x1ab0
> [   80.445275][ T3416]       do_user_addr_fault+0x637/0x1450
> [   80.445779][ T3416]       exc_page_fault+0x67/0x110
> [   80.446239][ T3416]       asm_exc_page_fault+0x26/0x30
> [   80.446722][ T3416]       __put_user_4+0xd/0x20
> [   80.447157][ T3416]       copy_process+0x1f64/0x3d80
> [   80.447621][ T3416]       kernel_clone+0x216/0x940
> [   80.448068][ T3416]       __x64_sys_clone+0x18d/0x1f0
> [   80.448548][ T3416]       do_syscall_64+0x6f/0x120
> [   80.448999][ T3416]       entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   80.449556][ T3416]
> [   80.449765][ T3416] [E] dept_page_clear_bit(pg_locked_map:0):
> [   80.450272][ T3416] [<ffffffff8214263b>] hugetlb_fault+0x1ccb/0x2c70
> [   80.450861][ T3416] stacktrace:
> [   80.451148][ T3416]       hugetlb_fault+0x1ccb/0x2c70
> [   80.451611][ T3416]       handle_mm_fault+0x1845/0x1ab0
> [   80.452080][ T3416]       do_user_addr_fault+0x637/0x1450
> [   80.452566][ T3416]       exc_page_fault+0x67/0x110
> [   80.453014][ T3416]       asm_exc_page_fault+0x26/0x30
> [   80.453497][ T3416]       __put_user_4+0xd/0x20
> [   80.453923][ T3416]       copy_process+0x1f64/0x3d80
> [   80.454379][ T3416]       kernel_clone+0x216/0x940
> [   80.454817][ T3416]       __x64_sys_clone+0x18d/0x1f0
> [   80.455277][ T3416]       do_syscall_64+0x6f/0x120
> [   80.455722][ T3416]       entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   80.456253][ T3416] ---------------------------------------------------
> [   80.456842][ T3416] context B's detail
> [   80.457198][ T3416] ---------------------------------------------------
> [   80.457842][ T3416] context B
> [   80.458122][ T3416]    [S] lock(&hugetlb_fault_mutex_table[i]:0)
> [   80.458661][ T3416]    [W] dept_page_wait_on_bit(pg_locked_map:0)
> [   80.459187][ T3416]    [E] unlock(&hugetlb_fault_mutex_table[i]:0)
> [   80.459763][ T3416]
> [   80.459988][ T3416] [S] lock(&hugetlb_fault_mutex_table[i]:0):
> [   80.460509][ T3416] [<ffffffff82140d36>] hugetlb_fault+0x3c6/0x2c70
> [   80.461074][ T3416] stacktrace:
> [   80.461374][ T3416]       hugetlb_fault+0x3c6/0x2c70
> [   80.461812][ T3416]       handle_mm_fault+0x1845/0x1ab0
> [   80.462281][ T3416]       do_user_addr_fault+0x637/0x1450
> [   80.462775][ T3416]       exc_page_fault+0x67/0x110
> [   80.463220][ T3416]       asm_exc_page_fault+0x26/0x30
> [   80.463694][ T3416]       __put_user_4+0xd/0x20
> [   80.464129][ T3416]       copy_process+0x1f64/0x3d80
> [   80.464577][ T3416]       kernel_clone+0x216/0x940
> [   80.464994][ T3416]       __x64_sys_clone+0x18d/0x1f0
> [   80.465466][ T3416]       do_syscall_64+0x6f/0x120
> [   80.465909][ T3416]       entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   80.466457][ T3416]
> [   80.466660][ T3416] [W] dept_page_wait_on_bit(pg_locked_map:0):
> [   80.467177][ T3416] [<ffffffff82141187>] hugetlb_fault+0x817/0x2c70
> [   80.467740][ T3416] stacktrace:
> [   80.468032][ T3416]       hugetlb_fault+0x817/0x2c70
> [   80.468479][ T3416]       handle_mm_fault+0x1845/0x1ab0
> [   80.468947][ T3416]       do_user_addr_fault+0x637/0x1450
> [   80.469428][ T3416]       exc_page_fault+0x67/0x110
> [   80.469865][ T3416]       asm_exc_page_fault+0x26/0x30
> [   80.470332][ T3416]       __put_user_4+0xd/0x20
> [   80.470742][ T3416]       copy_process+0x1f64/0x3d80
> [   80.471186][ T3416]       kernel_clone+0x216/0x940
> [   80.471616][ T3416]       __x64_sys_clone+0x18d/0x1f0
> [   80.472060][ T3416]       do_syscall_64+0x6f/0x120
> [   80.472492][ T3416]       entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   80.473040][ T3416]
> [   80.473271][ T3416] [E] unlock(&hugetlb_fault_mutex_table[i]:0):
> [   80.473863][ T3416] (N/A)
> [   80.474124][ T3416] ---------------------------------------------------
> [   80.474738][ T3416] information that might be helpful
> [   80.475210][ T3416] ---------------------------------------------------
> [   80.475820][ T3416] CPU: 1 UID: 1000 PID: 3416 Comm: repro_20250402_ Not
> tainted 6.15.0-rc6+ #31 NONE
> [   80.475831][ T3416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   80.475837][ T3416] Call Trace:
> [   80.475841][ T3416]  <TASK>
> [   80.475845][ T3416]  dump_stack_lvl+0x1ad/0x280
> [   80.475858][ T3416]  ? __pfx_dump_stack_lvl+0x10/0x10
> [   80.475867][ T3416]  ? __pfx__printk+0x10/0x10
> [   80.475883][ T3416]  cb_check_dl+0x24a8/0x2530
> [   80.475897][ T3416]  ? bfs_extend_dep+0x271/0x290
> [   80.475909][ T3416]  bfs+0x464/0x5e0
> [   80.475921][ T3416]  ? __pfx_bfs+0x10/0x10
> [   80.475931][ T3416]  ? add_dep+0x387/0x710
> [   80.475943][ T3416]  add_dep+0x3d0/0x710
> [   80.475953][ T3416]  ? __pfx_from_pool+0x10/0x10
> [   80.475963][ T3416]  ? __pfx_bfs_init_check_dl+0x10/0x10
> [   80.475972][ T3416]  ? __pfx_bfs_extend_dep+0x10/0x10
> [   80.475981][ T3416]  ? __pfx_bfs_dequeue_dep+0x10/0x10
> [   80.475990][ T3416]  ? __pfx_cb_check_dl+0x10/0x10
> [   80.475999][ T3416]  ? __pfx_add_dep+0x10/0x10
> [   80.476011][ T3416]  ? put_ecxt+0xda/0x4b0
> [   80.476024][ T3416]  __dept_event+0xee8/0x1590
> [   80.476038][ T3416]  dept_event+0x166/0x240
> [   80.476047][ T3416]  ? hugetlb_fault+0x1ccb/0x2c70
> [   80.476057][ T3416]  folio_unlock+0xb8/0x190
> [   80.476071][ T3416]  hugetlb_fault+0x1ccb/0x2c70
> [   80.476085][ T3416]  ? __pfx_hugetlb_fault+0x10/0x10
> [   80.476100][ T3416]  ? mt_find+0x15a/0x5f0
> [   80.476110][ T3416]  handle_mm_fault+0x1845/0x1ab0
> [   80.476125][ T3416]  ? handle_mm_fault+0xdb/0x1ab0
> [   80.476142][ T3416]  ? __pfx_handle_mm_fault+0x10/0x10
> [   80.476156][ T3416]  ? find_vma+0xec/0x160
> [   80.476164][ T3416]  ? __pfx_find_vma+0x10/0x10
> [   80.476172][ T3416]  ? dept_on+0x1c/0x30
> [   80.476179][ T3416]  ? dept_exit+0x1c5/0x2c0
> [   80.476186][ T3416]  ? lockdep_hardirqs_on_prepare+0x21/0x280
> [   80.476197][ T3416]  ? lock_mm_and_find_vma+0xa1/0x300
> [   80.476211][ T3416]  do_user_addr_fault+0x637/0x1450
> [   80.476219][ T3416]  ? mntput_no_expire+0xc0/0x870
> [   80.476235][ T3416]  ? __pfx_do_user_addr_fault+0x10/0x10
> [   80.476246][ T3416]  ? trace_irq_disable+0x60/0x180
> [   80.476258][ T3416]  exc_page_fault+0x67/0x110
> [   80.476272][ T3416]  asm_exc_page_fault+0x26/0x30
> [   80.476280][ T3416] RIP: 0010:__put_user_4+0xd/0x20
> [   80.476293][ T3416] Code: 66 89 01 31 c9 0f 1f 00 c3 cc cc cc cc 90 90 90
> 90 90 90 90 90 90 90 90 90 90 90 90 90 48 89 cb 48 c1 fb 3f 48 09 d9 0f 1f
> 00 <89> 01 31 c9 0
> [   80.476312][ T3416] RSP: 0018:ffffc90004dffa38 EFLAGS: 00010206
> [   80.476322][ T3416] RAX: 000000000000000c RBX: 0000000000000000 RCX:
> 0000200000000200
> [   80.476329][ T3416] RDX: 0000000000000000 RSI: ffff888016abe300 RDI:
> ffff888017878c20
> [   80.476335][ T3416] RBP: ffffc90004dffc10 R08: 0000000000000000 R09:
> 0000000000000000
> [   80.476340][ T3416] R10: 0000000000000000 R11: ffffffff82034b65 R12:
> ffff888017c0a1e8
> [   80.476346][ T3416] R13: ffff88800d6a8200 R14: 0000000000000000 R15:
> ffff888017c08a38
> [   80.476354][ T3416]  ? __might_fault+0xb5/0x130
> [   80.476367][ T3416]  copy_process+0x1f64/0x3d80
> [   80.476375][ T3416]  ? lockdep_hardirqs_on_prepare+0x21/0x280
> [   80.476388][ T3416]  ? copy_process+0x996/0x3d80
> [   80.476399][ T3416]  ? __pfx_copy_process+0x10/0x10
> [   80.476406][ T3416]  ? from_pool+0x1e1/0x750
> [   80.476416][ T3416]  ? handle_mm_fault+0x122e/0x1ab0
> [   80.476432][ T3416]  kernel_clone+0x216/0x940
> [   80.476440][ T3416]  ? __pfx_llist_del_first+0x10/0x10
> [   80.476448][ T3416]  ? check_new_class+0x28a/0xe90
> [   80.476458][ T3416]  ? __pfx_kernel_clone+0x10/0x10
> [   80.476468][ T3416]  ? from_pool+0x1e1/0x750
> [   80.476478][ T3416]  ? __pfx_from_pool+0x10/0x10
> [   80.476487][ T3416]  ? __pfx_from_pool+0x10/0x10
> [   80.476502][ T3416]  __x64_sys_clone+0x18d/0x1f0
> [   80.476512][ T3416]  ? __pfx___x64_sys_clone+0x10/0x10
> [   80.476520][ T3416]  ? llist_add_batch+0x111/0x1f0
> [   80.476532][ T3416]  ? dept_task+0x5/0x20
> [   80.476539][ T3416]  ? dept_on+0x1c/0x30
> [   80.476545][ T3416]  ? dept_exit+0x1c5/0x2c0
> [   80.476553][ T3416]  ? lockdep_hardirqs_on_prepare+0x21/0x280
> [   80.476565][ T3416]  do_syscall_64+0x6f/0x120
> [   80.476573][ T3416]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   80.476580][ T3416] RIP: 0033:0x41b26d
> [   80.476588][ T3416] Code: b3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e
> fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f
> 05 <48> 3d 01 f0 8
> [   80.476595][ T3416] RSP: 002b:00007ffa1ad2d198 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000038
> [   80.476604][ T3416] RAX: ffffffffffffffda RBX: 00007ffa1ad2dcdc RCX:
> 000000000041b26d
> [   80.476610][ T3416] RDX: 0000200000000200 RSI: 0000000000000000 RDI:
> 0000000000001200
> [   80.476616][ T3416] RBP: 00007ffa1ad2d1e0 R08: 0000000000000000 R09:
> 0000000000000000
> [   80.476621][ T3416] R10: 0000000000000000 R11: 0000000000000206 R12:
> 00007ffa1ad2d6c0
> [   80.476626][ T3416] R13: ffffffffffffffb8 R14: 0000000000000002 R15:
> 00007ffd95d76940
> [   80.476638][ T3416]  </TASK>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-13  9:34 [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table Gavin Guo
  2025-05-14  0:56 ` Andrew Morton
  2025-05-14  6:47 ` Byungchul Park
@ 2025-05-20 19:53 ` Oscar Salvador
  2025-05-21 11:12   ` Gavin Guo
  2025-05-26  4:41 ` Gavin Shan
  3 siblings, 1 reply; 14+ messages in thread
From: Oscar Salvador @ 2025-05-20 19:53 UTC (permalink / raw)
  To: Gavin Guo
  Cc: linux-mm, linux-kernel, muchun.song, akpm, mike.kravetz,
	kernel-dev, stable, Hugh Dickins, Florent Revest, Gavin Shan

On Tue, May 13, 2025 at 05:34:48PM +0800, Gavin Guo wrote:
> The patch fixes a deadlock which can be triggered by an internal
> syzkaller [1] reproducer and captured by bpftrace script [2] and its log
> [3] in this scenario:
> 
> Process 1                              Process 2
> ---				       ---
> hugetlb_fault
>   mutex_lock(B) // take B
>   filemap_lock_hugetlb_folio
>     filemap_lock_folio
>       __filemap_get_folio
>         folio_lock(A) // take A
>   hugetlb_wp
>     mutex_unlock(B) // release B
>     ...                                hugetlb_fault
>     ...                                  mutex_lock(B) // take B
>                                          filemap_lock_hugetlb_folio
>                                            filemap_lock_folio
>                                              __filemap_get_folio
>                                                folio_lock(A) // blocked
>     unmap_ref_private
>     ...
>     mutex_lock(B) // retake and blocked
> 
...
> Signed-off-by: Gavin Guo <gavinguo@igalia.com>

I think this is more convoluted that it needs to be?

hugetlb_wp() is called from hugetlb_no_page() and hugetlb_fault().
hugetlb_no_page() locks and unlocks the lock itself, which leaves us
with hugetlb_fault().

hugetlb_fault() always passed the folio locked to hugetlb_wp(), and the
latter only unlocks it when we have a cow from owner happening and we
cannot satisfy the allocation.
So, should not checking whether the folio is still locked after
returning enough?
What speaks against:

 diff --git a/mm/hugetlb.c b/mm/hugetlb.c
 index bd8971388236..23b57c5689a4 100644
 --- a/mm/hugetlb.c
 +++ b/mm/hugetlb.c
 @@ -6228,6 +6228,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
  			u32 hash;
 
  			folio_put(old_folio);
 +			/*
 +			* The pagecache_folio needs to be unlocked to avoid
 +			* deadlock when the child unmaps the folio.
 +			*/
 +			if (pagecache_folio)
 +				folio_unlock(pagecache_folio);
  			/*
  			 * Drop hugetlb_fault_mutex and vma_lock before
  			 * unmapping.  unmapping needs to hold vma_lock
 @@ -6825,7 +6831,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  	spin_unlock(vmf.ptl);
 
  	if (pagecache_folio) {
 -		folio_unlock(pagecache_folio);
 +		/*
 +		 * hugetlb_wp() might have already unlocked pagecache_folio, so
 +		 * skip it if that is the case.
 +		 */
 +		if (folio_test_locked(pagecache_folio))
 +			folio_unlock(pagecache_folio);
  		folio_put(pagecache_folio);
  	}
  out_mutex:

> ---
>  mm/hugetlb.c | 33 ++++++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e3e6ac991b9c..ad54a74aa563 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>   * Keep the pte_same checks anyway to make transition from the mutex easier.
>   */
>  static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> -		       struct vm_fault *vmf)
> +		       struct vm_fault *vmf,
> +		       bool *pagecache_folio_unlocked)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -6212,6 +6213,22 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>  			u32 hash;
>  
>  			folio_put(old_folio);
> +			/*
> +			 * The pagecache_folio needs to be unlocked to avoid
> +			 * deadlock and we won't re-lock it in hugetlb_wp(). The
> +			 * pagecache_folio could be truncated after being
> +			 * unlocked. So its state should not be relied
> +			 * subsequently.
> +			 *
> +			 * Setting *pagecache_folio_unlocked to true allows the
> +			 * caller to handle any necessary logic related to the
> +			 * folio's unlocked state.
> +			 */
> +			if (pagecache_folio) {
> +				folio_unlock(pagecache_folio);
> +				if (pagecache_folio_unlocked)
> +					*pagecache_folio_unlocked = true;
> +			}
>  			/*
>  			 * Drop hugetlb_fault_mutex and vma_lock before
>  			 * unmapping.  unmapping needs to hold vma_lock
> @@ -6566,7 +6583,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>  	hugetlb_count_add(pages_per_huge_page(h), mm);
>  	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
>  		/* Optimization, do the COW without a second fault */
> -		ret = hugetlb_wp(folio, vmf);
> +		ret = hugetlb_wp(folio, vmf, NULL);
>  	}
>  
>  	spin_unlock(vmf->ptl);
> @@ -6638,6 +6655,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct hstate *h = hstate_vma(vma);
>  	struct address_space *mapping;
>  	int need_wait_lock = 0;
> +	bool pagecache_folio_unlocked = false;
>  	struct vm_fault vmf = {
>  		.vma = vma,
>  		.address = address & huge_page_mask(h),
> @@ -6792,7 +6810,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>  		if (!huge_pte_write(vmf.orig_pte)) {
> -			ret = hugetlb_wp(pagecache_folio, &vmf);
> +			ret = hugetlb_wp(pagecache_folio, &vmf,
> +					&pagecache_folio_unlocked);
>  			goto out_put_page;
>  		} else if (likely(flags & FAULT_FLAG_WRITE)) {
>  			vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
> @@ -6809,10 +6828,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  out_ptl:
>  	spin_unlock(vmf.ptl);
>  
> -	if (pagecache_folio) {
> +	/*
> +	 * If the pagecache_folio is unlocked in hugetlb_wp(), we skip
> +	 * folio_unlock() here.
> +	 */
> +	if (pagecache_folio && !pagecache_folio_unlocked)
>  		folio_unlock(pagecache_folio);
> +	if (pagecache_folio)
>  		folio_put(pagecache_folio);
> -	}
>  out_mutex:
>  	hugetlb_vma_unlock_read(vma);
>  
> 
> base-commit: d76bb1ebb5587f66b0f8b8099bfbb44722bc08b3
> -- 
> 2.43.0
> 
> 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-20 19:53 ` Oscar Salvador
@ 2025-05-21 11:12   ` Gavin Guo
  0 siblings, 0 replies; 14+ messages in thread
From: Gavin Guo @ 2025-05-21 11:12 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-mm, linux-kernel, muchun.song, akpm, mike.kravetz,
	kernel-dev, stable, Hugh Dickins, Florent Revest, Gavin Shan

On 5/21/25 03:53, Oscar Salvador wrote:
> On Tue, May 13, 2025 at 05:34:48PM +0800, Gavin Guo wrote:
>> The patch fixes a deadlock which can be triggered by an internal
>> syzkaller [1] reproducer and captured by bpftrace script [2] and its log
>> [3] in this scenario:
>>
>> Process 1                              Process 2
>> ---				       ---
>> hugetlb_fault
>>    mutex_lock(B) // take B
>>    filemap_lock_hugetlb_folio
>>      filemap_lock_folio
>>        __filemap_get_folio
>>          folio_lock(A) // take A
>>    hugetlb_wp
>>      mutex_unlock(B) // release B
>>      ...                                hugetlb_fault
>>      ...                                  mutex_lock(B) // take B
>>                                           filemap_lock_hugetlb_folio
>>                                             filemap_lock_folio
>>                                               __filemap_get_folio
>>                                                 folio_lock(A) // blocked
>>      unmap_ref_private
>>      ...
>>      mutex_lock(B) // retake and blocked
>>
> ...
>> Signed-off-by: Gavin Guo <gavinguo@igalia.com>
> 
> I think this is more convoluted that it needs to be?
> 
> hugetlb_wp() is called from hugetlb_no_page() and hugetlb_fault().
> hugetlb_no_page() locks and unlocks the lock itself, which leaves us
> with hugetlb_fault().
> 
> hugetlb_fault() always passed the folio locked to hugetlb_wp(), and the
> latter only unlocks it when we have a cow from owner happening and we
> cannot satisfy the allocation.
> So, should not checking whether the folio is still locked after
> returning enough?
> What speaks against:
> 
>   diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>   index bd8971388236..23b57c5689a4 100644
>   --- a/mm/hugetlb.c
>   +++ b/mm/hugetlb.c
>   @@ -6228,6 +6228,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>    			u32 hash;
>   
>    			folio_put(old_folio);
>   +			/*
>   +			* The pagecache_folio needs to be unlocked to avoid
>   +			* deadlock when the child unmaps the folio.
>   +			*/
>   +			if (pagecache_folio)
>   +				folio_unlock(pagecache_folio);
>    			/*
>    			 * Drop hugetlb_fault_mutex and vma_lock before
>    			 * unmapping.  unmapping needs to hold vma_lock
>   @@ -6825,7 +6831,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>    	spin_unlock(vmf.ptl);
>   
>    	if (pagecache_folio) {
>   -		folio_unlock(pagecache_folio);
>   +		/*
>   +		 * hugetlb_wp() might have already unlocked pagecache_folio, so
>   +		 * skip it if that is the case.
>   +		 */
>   +		if (folio_test_locked(pagecache_folio))
>   +			folio_unlock(pagecache_folio);
>    		folio_put(pagecache_folio);
>    	}
>    out_mutex:

Really appreciate your review. Good catch! This is elegant. The patch is 
under testing and I'll send out the v2 patch soon.

> 
>> ---
>>   mm/hugetlb.c | 33 ++++++++++++++++++++++++++++-----
>>   1 file changed, 28 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index e3e6ac991b9c..ad54a74aa563 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>>    * Keep the pte_same checks anyway to make transition from the mutex easier.
>>    */
>>   static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>> -		       struct vm_fault *vmf)
>> +		       struct vm_fault *vmf,
>> +		       bool *pagecache_folio_unlocked)
>>   {
>>   	struct vm_area_struct *vma = vmf->vma;
>>   	struct mm_struct *mm = vma->vm_mm;
>> @@ -6212,6 +6213,22 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>>   			u32 hash;
>>   
>>   			folio_put(old_folio);
>> +			/*
>> +			 * The pagecache_folio needs to be unlocked to avoid
>> +			 * deadlock and we won't re-lock it in hugetlb_wp(). The
>> +			 * pagecache_folio could be truncated after being
>> +			 * unlocked. So its state should not be relied
>> +			 * subsequently.
>> +			 *
>> +			 * Setting *pagecache_folio_unlocked to true allows the
>> +			 * caller to handle any necessary logic related to the
>> +			 * folio's unlocked state.
>> +			 */
>> +			if (pagecache_folio) {
>> +				folio_unlock(pagecache_folio);
>> +				if (pagecache_folio_unlocked)
>> +					*pagecache_folio_unlocked = true;
>> +			}
>>   			/*
>>   			 * Drop hugetlb_fault_mutex and vma_lock before
>>   			 * unmapping.  unmapping needs to hold vma_lock
>> @@ -6566,7 +6583,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>>   	hugetlb_count_add(pages_per_huge_page(h), mm);
>>   	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
>>   		/* Optimization, do the COW without a second fault */
>> -		ret = hugetlb_wp(folio, vmf);
>> +		ret = hugetlb_wp(folio, vmf, NULL);
>>   	}
>>   
>>   	spin_unlock(vmf->ptl);
>> @@ -6638,6 +6655,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>   	struct hstate *h = hstate_vma(vma);
>>   	struct address_space *mapping;
>>   	int need_wait_lock = 0;
>> +	bool pagecache_folio_unlocked = false;
>>   	struct vm_fault vmf = {
>>   		.vma = vma,
>>   		.address = address & huge_page_mask(h),
>> @@ -6792,7 +6810,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>   
>>   	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>>   		if (!huge_pte_write(vmf.orig_pte)) {
>> -			ret = hugetlb_wp(pagecache_folio, &vmf);
>> +			ret = hugetlb_wp(pagecache_folio, &vmf,
>> +					&pagecache_folio_unlocked);
>>   			goto out_put_page;
>>   		} else if (likely(flags & FAULT_FLAG_WRITE)) {
>>   			vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
>> @@ -6809,10 +6828,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>   out_ptl:
>>   	spin_unlock(vmf.ptl);
>>   
>> -	if (pagecache_folio) {
>> +	/*
>> +	 * If the pagecache_folio is unlocked in hugetlb_wp(), we skip
>> +	 * folio_unlock() here.
>> +	 */
>> +	if (pagecache_folio && !pagecache_folio_unlocked)
>>   		folio_unlock(pagecache_folio);
>> +	if (pagecache_folio)
>>   		folio_put(pagecache_folio);
>> -	}
>>   out_mutex:
>>   	hugetlb_vma_unlock_read(vma);
>>   
>>
>> base-commit: d76bb1ebb5587f66b0f8b8099bfbb44722bc08b3
>> -- 
>> 2.43.0
>>
>>
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-13  9:34 [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table Gavin Guo
                   ` (2 preceding siblings ...)
  2025-05-20 19:53 ` Oscar Salvador
@ 2025-05-26  4:41 ` Gavin Shan
  2025-05-27  9:59   ` Gavin Guo
  3 siblings, 1 reply; 14+ messages in thread
From: Gavin Shan @ 2025-05-26  4:41 UTC (permalink / raw)
  To: Gavin Guo, linux-mm
  Cc: linux-kernel, muchun.song, osalvador, akpm, mike.kravetz,
	kernel-dev, stable, Hugh Dickins, Florent Revest

Hi Gavin,

On 5/13/25 7:34 PM, Gavin Guo wrote:
> The patch fixes a deadlock which can be triggered by an internal
> syzkaller [1] reproducer and captured by bpftrace script [2] and its log
> [3] in this scenario:
> 
> Process 1                              Process 2
> ---				       ---
> hugetlb_fault
>    mutex_lock(B) // take B
>    filemap_lock_hugetlb_folio
>      filemap_lock_folio
>        __filemap_get_folio
>          folio_lock(A) // take A
>    hugetlb_wp
>      mutex_unlock(B) // release B
>      ...                                hugetlb_fault
>      ...                                  mutex_lock(B) // take B
>                                           filemap_lock_hugetlb_folio
>                                             filemap_lock_folio
>                                               __filemap_get_folio
>                                                 folio_lock(A) // blocked
>      unmap_ref_private
>      ...
>      mutex_lock(B) // retake and blocked
> 
> This is a ABBA deadlock involving two locks:
> - Lock A: pagecache_folio lock
> - Lock B: hugetlb_fault_mutex_table lock
> 
> The deadlock occurs between two processes as follows:
> 1. The first process (let’s call it Process 1) is handling a
> copy-on-write (COW) operation on a hugepage via hugetlb_wp. Due to
> insufficient reserved hugetlb pages, Process 1, owner of the reserved
> hugetlb page, attempts to unmap a hugepage owned by another process
> (non-owner) to satisfy the reservation. Before unmapping, Process 1
> acquires lock B (hugetlb_fault_mutex_table lock) and then lock A
> (pagecache_folio lock). To proceed with the unmap, it releases Lock B
> but retains Lock A. After the unmap, Process 1 tries to reacquire Lock
> B. However, at this point, Lock B has already been acquired by another
> process.
> 
> 2. The second process (Process 2) enters the hugetlb_fault handler
> during the unmap operation. It successfully acquires Lock B
> (hugetlb_fault_mutex_table lock) that was just released by Process 1,
> but then attempts to acquire Lock A (pagecache_folio lock), which is
> still held by Process 1.
> 
> As a result, Process 1 (holding Lock A) is blocked waiting for Lock B
> (held by Process 2), while Process 2 (holding Lock B) is blocked waiting
> for Lock A (held by Process 1), constructing a ABBA deadlock scenario.
> 
> The solution here is to unlock the pagecache_folio and provide the
> pagecache_folio_unlocked variable to the caller to have the visibility
> over the pagecache_folio status for subsequent handling.
> 
> The error message:
> INFO: task repro_20250402_:13229 blocked for more than 64 seconds.
>        Not tainted 6.15.0-rc3+ #24
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:repro_20250402_ state:D stack:25856 pid:13229 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00004006
> Call Trace:
>   <TASK>
>   __schedule+0x1755/0x4f50
>   schedule+0x158/0x330
>   schedule_preempt_disabled+0x15/0x30
>   __mutex_lock+0x75f/0xeb0
>   hugetlb_wp+0xf88/0x3440
>   hugetlb_fault+0x14c8/0x2c30
>   trace_clock_x86_tsc+0x20/0x20
>   do_user_addr_fault+0x61d/0x1490
>   exc_page_fault+0x64/0x100
>   asm_exc_page_fault+0x26/0x30
> RIP: 0010:__put_user_4+0xd/0x20
>   copy_process+0x1f4a/0x3d60
>   kernel_clone+0x210/0x8f0
>   __x64_sys_clone+0x18d/0x1f0
>   do_syscall_64+0x6a/0x120
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x41b26d
>   </TASK>
> INFO: task repro_20250402_:13229 is blocked on a mutex likely owned by task repro_20250402_:13250.
> task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006
> Call Trace:
>   <TASK>
>   __schedule+0x1755/0x4f50
>   schedule+0x158/0x330
>   io_schedule+0x92/0x110
>   folio_wait_bit_common+0x69a/0xba0
>   __filemap_get_folio+0x154/0xb70
>   hugetlb_fault+0xa50/0x2c30
>   trace_clock_x86_tsc+0x20/0x20
>   do_user_addr_fault+0xace/0x1490
>   exc_page_fault+0x64/0x100
>   asm_exc_page_fault+0x26/0x30
> RIP: 0033:0x402619
>   </TASK>
> INFO: task repro_20250402_:13250 blocked for more than 65 seconds.
>        Not tainted 6.15.0-rc3+ #24
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006
> Call Trace:
>   <TASK>
>   __schedule+0x1755/0x4f50
>   schedule+0x158/0x330
>   io_schedule+0x92/0x110
>   folio_wait_bit_common+0x69a/0xba0
>   __filemap_get_folio+0x154/0xb70
>   hugetlb_fault+0xa50/0x2c30
>   trace_clock_x86_tsc+0x20/0x20
>   do_user_addr_fault+0xace/0x1490
>   exc_page_fault+0x64/0x100
>   asm_exc_page_fault+0x26/0x30
> RIP: 0033:0x402619
>   </TASK>
> 
> Showing all locks held in the system:
> 1 lock held by khungtaskd/35:
>   #0: ffffffff879a7440 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x30/0x180
> 2 locks held by repro_20250402_/13229:
>   #0: ffff888017d801e0 (&mm->mmap_lock){++++}-{4:4}, at: lock_mm_and_find_vma+0x37/0x300
>   #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_wp+0xf88/0x3440
> 3 locks held by repro_20250402_/13250:
>   #0: ffff8880177f3d08 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x41b/0x1490
>   #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_fault+0x3b8/0x2c30
>   #2: ffff8880129500e8 (&resv_map->rw_sema){++++}-{4:4}, at: hugetlb_fault+0x494/0x2c30
> 
> Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=drive_link [1]
> Link: https://github.com/bboymimi/bpftracer/blob/master/scripts/hugetlb_lock_debug.bt [2]
> Link: https://drive.google.com/file/d/1bWq2-8o-BJAuhoHWX7zAhI6ggfhVzQUI/view?usp=sharing [3]
> Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
> Cc: <stable@vger.kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Florent Revest <revest@google.com>
> Cc: Gavin Shan <gshan@redhat.com>
> Signed-off-by: Gavin Guo <gavinguo@igalia.com>
> ---
>   mm/hugetlb.c | 33 ++++++++++++++++++++++++++++-----
>   1 file changed, 28 insertions(+), 5 deletions(-)
> 

I guess the change log can become concise after the kernel log is dropped. The summarized
stack trace is sufficient to indicate how the dead locking scenario happens. Besides,
it's no need to mention bpftrace and its output. So the changelog would be simplified
to something like below. Please polish it a bit if you would to take it. The solution
looks good except some nitpicks as below.

---

There is ABBA dead locking scenario happening between hugetlb_fault() and hugetlb_wp() on
the pagecache folio's lock and hugetlb global mutex, which is reproducible with syzkaller
[1]. As below stack traces reveal, process-1 tries to take the hugetlb global mutex (A3),
but with the pagecache folio's lock hold. Process-2 took the hugetlb global mutex but tries
to take the pagecache folio's lock.

[1] https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=drive_link

Process-1                                       Process-2
=========                                       =========
hugetlb_fault
   mutex_lock                  (A1)
   filemap_lock_hugetlb_folio  (B1)
   hugetlb_wp
     alloc_hugetlb_folio       #error
       mutex_unlock            (A2)
                                                hugetlb_fault
                                                  mutex_lock                  (A4)
                                                  filemap_lock_hugetlb_folio  (B4)
       unmap_ref_private
       mutex_lock              (A3)

Fix it by releasing the pagecache folio's lock at (A2) of process-1 so that pagecache folio's
lock is available to process-2 at (B4), to avoid the deadlock. In process-1, a new variable
is added to track if the pagecache folio's lock has been released by its child function
hugetlb_wp() to avoid double releases on the lock in hugetlb_fault(). The similar changes
are applied to hugetlb_no_page().

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e3e6ac991b9c..ad54a74aa563 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>    * Keep the pte_same checks anyway to make transition from the mutex easier.
>    */
>   static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> -		       struct vm_fault *vmf)
> +		       struct vm_fault *vmf,
> +		       bool *pagecache_folio_unlocked)

Nitpick: the variable may be renamed to 'pagecache_folio_locked' if you're happy
with.

>   {
>   	struct vm_area_struct *vma = vmf->vma;
>   	struct mm_struct *mm = vma->vm_mm;
> @@ -6212,6 +6213,22 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>   			u32 hash;
>   
>   			folio_put(old_folio);
> +			/*
> +			 * The pagecache_folio needs to be unlocked to avoid
                                                ^^^^^^^^

                                                has to be (?)

> +			 * deadlock and we won't re-lock it in hugetlb_wp(). The
> +			 * pagecache_folio could be truncated after being
> +			 * unlocked. So its state should not be relied
                                                                 ^^^^^^
                                                                 reliable (?)
> +			 * subsequently.
> +			 *
> +			 * Setting *pagecache_folio_unlocked to true allows the
> +			 * caller to handle any necessary logic related to the
> +			 * folio's unlocked state.
> +			 */
> +			if (pagecache_folio) {
> +				folio_unlock(pagecache_folio);
> +				if (pagecache_folio_unlocked)
> +					*pagecache_folio_unlocked = true;
> +			}

The second section of the comments looks a bit redundant since the code changes
are self-explaining enough :-)

>   			/*
>   			 * Drop hugetlb_fault_mutex and vma_lock before
>   			 * unmapping.  unmapping needs to hold vma_lock
> @@ -6566,7 +6583,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>   	hugetlb_count_add(pages_per_huge_page(h), mm);
>   	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
>   		/* Optimization, do the COW without a second fault */
> -		ret = hugetlb_wp(folio, vmf);
> +		ret = hugetlb_wp(folio, vmf, NULL);

It's not certain if we have another deadlock between hugetlb_no_page() and hugetlb_wp(),
similar to the existing one between hugetlb_fault() and hugetlb_wp(). So I think it's
reasonable to pass '&pagecache_folio_locked' to hugetlb_wp() here and skip to unlock
on pagecache_folio_locked == false in hugetlb_no_page(). It's not harmful at least.

>   	}
>   
>   	spin_unlock(vmf->ptl);
> @@ -6638,6 +6655,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	struct hstate *h = hstate_vma(vma);
>   	struct address_space *mapping;
>   	int need_wait_lock = 0;
> +	bool pagecache_folio_unlocked = false;
>   	struct vm_fault vmf = {
>   		.vma = vma,
>   		.address = address & huge_page_mask(h),
> @@ -6792,7 +6810,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   
>   	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>   		if (!huge_pte_write(vmf.orig_pte)) {
> -			ret = hugetlb_wp(pagecache_folio, &vmf);
> +			ret = hugetlb_wp(pagecache_folio, &vmf,
> +					&pagecache_folio_unlocked);
>   			goto out_put_page;
>   		} else if (likely(flags & FAULT_FLAG_WRITE)) {
>   			vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
> @@ -6809,10 +6828,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   out_ptl:
>   	spin_unlock(vmf.ptl);
>   
> -	if (pagecache_folio) {
> +	/*
> +	 * If the pagecache_folio is unlocked in hugetlb_wp(), we skip
> +	 * folio_unlock() here.
> +	 */
> +	if (pagecache_folio && !pagecache_folio_unlocked)
>   		folio_unlock(pagecache_folio);
> +	if (pagecache_folio)
>   		folio_put(pagecache_folio);
> -	}

The comments seem redundant since the code changes are self-explaining.
Besides, no need to validate 'pagecache_folio' for twice.

	if (pagecache_folio) {
		if (pagecache_folio_locked)
			folio_unlock(pagecache_folio);

		folio_put(pagecache_folio);
	}

>   out_mutex:
>   	hugetlb_vma_unlock_read(vma);
>   
> 
> base-commit: d76bb1ebb5587f66b0f8b8099bfbb44722bc08b3

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-26  4:41 ` Gavin Shan
@ 2025-05-27  9:59   ` Gavin Guo
  2025-05-27 10:59     ` Gavin Shan
  0 siblings, 1 reply; 14+ messages in thread
From: Gavin Guo @ 2025-05-27  9:59 UTC (permalink / raw)
  To: Gavin Shan, linux-mm
  Cc: linux-kernel, muchun.song, osalvador, akpm, mike.kravetz,
	kernel-dev, stable, Hugh Dickins, Florent Revest

Hi Gavin,

On 5/26/25 12:41, Gavin Shan wrote:
> Hi Gavin,
> 
> On 5/13/25 7:34 PM, Gavin Guo wrote:
>> The patch fixes a deadlock which can be triggered by an internal
>> syzkaller [1] reproducer and captured by bpftrace script [2] and its log
>> [3] in this scenario:
>>
>> Process 1                              Process 2
>> ---                       ---
>> hugetlb_fault
>>    mutex_lock(B) // take B
>>    filemap_lock_hugetlb_folio
>>      filemap_lock_folio
>>        __filemap_get_folio
>>          folio_lock(A) // take A
>>    hugetlb_wp
>>      mutex_unlock(B) // release B
>>      ...                                hugetlb_fault
>>      ...                                  mutex_lock(B) // take B
>>                                           filemap_lock_hugetlb_folio
>>                                             filemap_lock_folio
>>                                               __filemap_get_folio
>>                                                 folio_lock(A) // blocked
>>      unmap_ref_private
>>      ...
>>      mutex_lock(B) // retake and blocked
>>
>> This is a ABBA deadlock involving two locks:
>> - Lock A: pagecache_folio lock
>> - Lock B: hugetlb_fault_mutex_table lock
>>
>> The deadlock occurs between two processes as follows:
>> 1. The first process (let’s call it Process 1) is handling a
>> copy-on-write (COW) operation on a hugepage via hugetlb_wp. Due to
>> insufficient reserved hugetlb pages, Process 1, owner of the reserved
>> hugetlb page, attempts to unmap a hugepage owned by another process
>> (non-owner) to satisfy the reservation. Before unmapping, Process 1
>> acquires lock B (hugetlb_fault_mutex_table lock) and then lock A
>> (pagecache_folio lock). To proceed with the unmap, it releases Lock B
>> but retains Lock A. After the unmap, Process 1 tries to reacquire Lock
>> B. However, at this point, Lock B has already been acquired by another
>> process.
>>
>> 2. The second process (Process 2) enters the hugetlb_fault handler
>> during the unmap operation. It successfully acquires Lock B
>> (hugetlb_fault_mutex_table lock) that was just released by Process 1,
>> but then attempts to acquire Lock A (pagecache_folio lock), which is
>> still held by Process 1.
>>
>> As a result, Process 1 (holding Lock A) is blocked waiting for Lock B
>> (held by Process 2), while Process 2 (holding Lock B) is blocked waiting
>> for Lock A (held by Process 1), constructing a ABBA deadlock scenario.
>>
>> The solution here is to unlock the pagecache_folio and provide the
>> pagecache_folio_unlocked variable to the caller to have the visibility
>> over the pagecache_folio status for subsequent handling.
>>
>> The error message:
>> INFO: task repro_20250402_:13229 blocked for more than 64 seconds.
>>        Not tainted 6.15.0-rc3+ #24
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> task:repro_20250402_ state:D stack:25856 pid:13229 tgid:13228 
>> ppid:3513   task_flags:0x400040 flags:0x00004006
>> Call Trace:
>>   <TASK>
>>   __schedule+0x1755/0x4f50
>>   schedule+0x158/0x330
>>   schedule_preempt_disabled+0x15/0x30
>>   __mutex_lock+0x75f/0xeb0
>>   hugetlb_wp+0xf88/0x3440
>>   hugetlb_fault+0x14c8/0x2c30
>>   trace_clock_x86_tsc+0x20/0x20
>>   do_user_addr_fault+0x61d/0x1490
>>   exc_page_fault+0x64/0x100
>>   asm_exc_page_fault+0x26/0x30
>> RIP: 0010:__put_user_4+0xd/0x20
>>   copy_process+0x1f4a/0x3d60
>>   kernel_clone+0x210/0x8f0
>>   __x64_sys_clone+0x18d/0x1f0
>>   do_syscall_64+0x6a/0x120
>>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> RIP: 0033:0x41b26d
>>   </TASK>
>> INFO: task repro_20250402_:13229 is blocked on a mutex likely owned by 
>> task repro_20250402_:13250.
>> task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 
>> ppid:3513   task_flags:0x400040 flags:0x00000006
>> Call Trace:
>>   <TASK>
>>   __schedule+0x1755/0x4f50
>>   schedule+0x158/0x330
>>   io_schedule+0x92/0x110
>>   folio_wait_bit_common+0x69a/0xba0
>>   __filemap_get_folio+0x154/0xb70
>>   hugetlb_fault+0xa50/0x2c30
>>   trace_clock_x86_tsc+0x20/0x20
>>   do_user_addr_fault+0xace/0x1490
>>   exc_page_fault+0x64/0x100
>>   asm_exc_page_fault+0x26/0x30
>> RIP: 0033:0x402619
>>   </TASK>
>> INFO: task repro_20250402_:13250 blocked for more than 65 seconds.
>>        Not tainted 6.15.0-rc3+ #24
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 
>> ppid:3513   task_flags:0x400040 flags:0x00000006
>> Call Trace:
>>   <TASK>
>>   __schedule+0x1755/0x4f50
>>   schedule+0x158/0x330
>>   io_schedule+0x92/0x110
>>   folio_wait_bit_common+0x69a/0xba0
>>   __filemap_get_folio+0x154/0xb70
>>   hugetlb_fault+0xa50/0x2c30
>>   trace_clock_x86_tsc+0x20/0x20
>>   do_user_addr_fault+0xace/0x1490
>>   exc_page_fault+0x64/0x100
>>   asm_exc_page_fault+0x26/0x30
>> RIP: 0033:0x402619
>>   </TASK>
>>
>> Showing all locks held in the system:
>> 1 lock held by khungtaskd/35:
>>   #0: ffffffff879a7440 (rcu_read_lock){....}-{1:3}, at: 
>> debug_show_all_locks+0x30/0x180
>> 2 locks held by repro_20250402_/13229:
>>   #0: ffff888017d801e0 (&mm->mmap_lock){++++}-{4:4}, at: 
>> lock_mm_and_find_vma+0x37/0x300
>>   #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, 
>> at: hugetlb_wp+0xf88/0x3440
>> 3 locks held by repro_20250402_/13250:
>>   #0: ffff8880177f3d08 (vm_lock){++++}-{0:0}, at: 
>> do_user_addr_fault+0x41b/0x1490
>>   #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, 
>> at: hugetlb_fault+0x3b8/0x2c30
>>   #2: ffff8880129500e8 (&resv_map->rw_sema){++++}-{4:4}, at: 
>> hugetlb_fault+0x494/0x2c30
>>
>> Link: https://drive.google.com/file/d/1DVRnIW- 
>> vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=drive_link [1]
>> Link: https://github.com/bboymimi/bpftracer/blob/master/scripts/ 
>> hugetlb_lock_debug.bt [2]
>> Link: https://drive.google.com/file/d/1bWq2-8o- 
>> BJAuhoHWX7zAhI6ggfhVzQUI/view?usp=sharing [3]
>> Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing 
>> synchronization")
>> Cc: <stable@vger.kernel.org>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Florent Revest <revest@google.com>
>> Cc: Gavin Shan <gshan@redhat.com>
>> Signed-off-by: Gavin Guo <gavinguo@igalia.com>
>> ---
>>   mm/hugetlb.c | 33 ++++++++++++++++++++++++++++-----
>>   1 file changed, 28 insertions(+), 5 deletions(-)
>>
> 
> I guess the change log can become concise after the kernel log is 
> dropped. The summarized
> stack trace is sufficient to indicate how the dead locking scenario 
> happens. Besides,
> it's no need to mention bpftrace and its output. So the changelog would 
> be simplified
> to something like below. Please polish it a bit if you would to take it. 
> The solution
> looks good except some nitpicks as below.
> 
> ---
> 
> There is ABBA dead locking scenario happening between hugetlb_fault() 
> and hugetlb_wp() on
> the pagecache folio's lock and hugetlb global mutex, which is 
> reproducible with syzkaller
> [1]. As below stack traces reveal, process-1 tries to take the hugetlb 
> global mutex (A3),
> but with the pagecache folio's lock hold. Process-2 took the hugetlb 
> global mutex but tries
> to take the pagecache folio's lock.
> 
> [1] https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/ 
> view?usp=drive_link
> 
> Process-1                                       Process-2
> =========                                       =========
> hugetlb_fault
>    mutex_lock                  (A1)
>    filemap_lock_hugetlb_folio  (B1)
>    hugetlb_wp
>      alloc_hugetlb_folio       #error
>        mutex_unlock            (A2)
>                                                 hugetlb_fault
>                                                   
> mutex_lock                  (A4)
>                                                   
> filemap_lock_hugetlb_folio  (B4)
>        unmap_ref_private
>        mutex_lock              (A3)
> 
> Fix it by releasing the pagecache folio's lock at (A2) of process-1 so 
> that pagecache folio's
> lock is available to process-2 at (B4), to avoid the deadlock. In 
> process-1, a new variable
> is added to track if the pagecache folio's lock has been released by its 
> child function
> hugetlb_wp() to avoid double releases on the lock in hugetlb_fault(). 
> The similar changes
> are applied to hugetlb_no_page().
> 
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index e3e6ac991b9c..ad54a74aa563 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct 
>> *mm, struct vm_area_struct *vma,
>>    * Keep the pte_same checks anyway to make transition from the mutex 
>> easier.
>>    */
>>   static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>> -               struct vm_fault *vmf)
>> +               struct vm_fault *vmf,
>> +               bool *pagecache_folio_unlocked)
> 
> Nitpick: the variable may be renamed to 'pagecache_folio_locked' if 
> you're happy
> with.
> 
>>   {
>>       struct vm_area_struct *vma = vmf->vma;
>>       struct mm_struct *mm = vma->vm_mm;
>> @@ -6212,6 +6213,22 @@ static vm_fault_t hugetlb_wp(struct folio 
>> *pagecache_folio,
>>               u32 hash;
>>               folio_put(old_folio);
>> +            /*
>> +             * The pagecache_folio needs to be unlocked to avoid
>                                                 ^^^^^^^^
> 
>                                                 has to be (?)
> 
>> +             * deadlock and we won't re-lock it in hugetlb_wp(). The
>> +             * pagecache_folio could be truncated after being
>> +             * unlocked. So its state should not be relied
>                                                                  ^^^^^^
>                                                                  
> reliable (?)
>> +             * subsequently.
>> +             *
>> +             * Setting *pagecache_folio_unlocked to true allows the
>> +             * caller to handle any necessary logic related to the
>> +             * folio's unlocked state.
>> +             */
>> +            if (pagecache_folio) {
>> +                folio_unlock(pagecache_folio);
>> +                if (pagecache_folio_unlocked)
>> +                    *pagecache_folio_unlocked = true;
>> +            }
> 
> The second section of the comments looks a bit redundant since the code 
> changes
> are self-explaining enough :-)
> 
>>               /*
>>                * Drop hugetlb_fault_mutex and vma_lock before
>>                * unmapping.  unmapping needs to hold vma_lock
>> @@ -6566,7 +6583,7 @@ static vm_fault_t hugetlb_no_page(struct 
>> address_space *mapping,
>>       hugetlb_count_add(pages_per_huge_page(h), mm);
>>       if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & 
>> VM_SHARED)) {
>>           /* Optimization, do the COW without a second fault */
>> -        ret = hugetlb_wp(folio, vmf);
>> +        ret = hugetlb_wp(folio, vmf, NULL);
> 
> It's not certain if we have another deadlock between hugetlb_no_page() 
> and hugetlb_wp(),
> similar to the existing one between hugetlb_fault() and hugetlb_wp(). So 
> I think it's
> reasonable to pass '&pagecache_folio_locked' to hugetlb_wp() here and 
> skip to unlock
> on pagecache_folio_locked == false in hugetlb_no_page(). It's not 
> harmful at least.

Thank you very much for taking the time to review my patch! I appreciate
your feedback. :)

After carefully reviewing the hugetlb_no_page function, I've made an
observation regarding the pagecache_folio handling. Specifically, when
pagecache_folio is assigned to vmf->pte, the vmf->ptl lock is held. This
lock remains active when vmf->pte is later accessed in hugetlb_wp. The
ptl lock ensures that the pte value is the same as the pagecache_folio
assigned in hugetlb_no_page. As a result, the following comparison in
hugetlb_wp will always evaluate to false because old_folio and
pagecache_folio reference the same object:

         if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
                         old_folio != pagecache_folio)
                 cow_from_owner = true;

Based on this analysis, passing pagecache_folio_locked in
hugetlb_no_page is unnecessary. Let me know if I missed anything. Other
comments look good to me. Thanks!

> 
>>       }
>>       spin_unlock(vmf->ptl);
>> @@ -6638,6 +6655,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, 
>> struct vm_area_struct *vma,
>>       struct hstate *h = hstate_vma(vma);
>>       struct address_space *mapping;
>>       int need_wait_lock = 0;
>> +    bool pagecache_folio_unlocked = false;
>>       struct vm_fault vmf = {
>>           .vma = vma,
>>           .address = address & huge_page_mask(h),
>> @@ -6792,7 +6810,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, 
>> struct vm_area_struct *vma,
>>       if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>>           if (!huge_pte_write(vmf.orig_pte)) {
>> -            ret = hugetlb_wp(pagecache_folio, &vmf);
>> +            ret = hugetlb_wp(pagecache_folio, &vmf,
>> +                    &pagecache_folio_unlocked);
>>               goto out_put_page;
>>           } else if (likely(flags & FAULT_FLAG_WRITE)) {
>>               vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
>> @@ -6809,10 +6828,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, 
>> struct vm_area_struct *vma,
>>   out_ptl:
>>       spin_unlock(vmf.ptl);
>> -    if (pagecache_folio) {
>> +    /*
>> +     * If the pagecache_folio is unlocked in hugetlb_wp(), we skip
>> +     * folio_unlock() here.
>> +     */
>> +    if (pagecache_folio && !pagecache_folio_unlocked)
>>           folio_unlock(pagecache_folio);
>> +    if (pagecache_folio)
>>           folio_put(pagecache_folio);
>> -    }
> 
> The comments seem redundant since the code changes are self-explaining.
> Besides, no need to validate 'pagecache_folio' for twice.
> 
>      if (pagecache_folio) {
>          if (pagecache_folio_locked)
>              folio_unlock(pagecache_folio);
> 
>          folio_put(pagecache_folio);
>      }
> 
>>   out_mutex:
>>       hugetlb_vma_unlock_read(vma);
>>
>> base-commit: d76bb1ebb5587f66b0f8b8099bfbb44722bc08b3
> 
> Thanks,
> Gavin
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
  2025-05-27  9:59   ` Gavin Guo
@ 2025-05-27 10:59     ` Gavin Shan
  0 siblings, 0 replies; 14+ messages in thread
From: Gavin Shan @ 2025-05-27 10:59 UTC (permalink / raw)
  To: Gavin Guo, linux-mm
  Cc: linux-kernel, muchun.song, osalvador, akpm, mike.kravetz,
	kernel-dev, stable, Hugh Dickins, Florent Revest

Hi Gavin,

On 5/27/25 7:59 PM, Gavin Guo wrote:
> On 5/26/25 12:41, Gavin Shan wrote:
>> On 5/13/25 7:34 PM, Gavin Guo wrote:
>>> The patch fixes a deadlock which can be triggered by an internal
>>> syzkaller [1] reproducer and captured by bpftrace script [2] and its log
>>> [3] in this scenario:
>>>
>>> Process 1                              Process 2
>>> ---                       ---
>>> hugetlb_fault
>>>    mutex_lock(B) // take B
>>>    filemap_lock_hugetlb_folio
>>>      filemap_lock_folio
>>>        __filemap_get_folio
>>>          folio_lock(A) // take A
>>>    hugetlb_wp
>>>      mutex_unlock(B) // release B
>>>      ...                                hugetlb_fault
>>>      ...                                  mutex_lock(B) // take B
>>>                                           filemap_lock_hugetlb_folio
>>>                                             filemap_lock_folio
>>>                                               __filemap_get_folio
>>>                                                 folio_lock(A) // blocked
>>>      unmap_ref_private
>>>      ...
>>>      mutex_lock(B) // retake and blocked
>>>
>>> This is a ABBA deadlock involving two locks:
>>> - Lock A: pagecache_folio lock
>>> - Lock B: hugetlb_fault_mutex_table lock
>>>
>>> The deadlock occurs between two processes as follows:
>>> 1. The first process (let’s call it Process 1) is handling a
>>> copy-on-write (COW) operation on a hugepage via hugetlb_wp. Due to
>>> insufficient reserved hugetlb pages, Process 1, owner of the reserved
>>> hugetlb page, attempts to unmap a hugepage owned by another process
>>> (non-owner) to satisfy the reservation. Before unmapping, Process 1
>>> acquires lock B (hugetlb_fault_mutex_table lock) and then lock A
>>> (pagecache_folio lock). To proceed with the unmap, it releases Lock B
>>> but retains Lock A. After the unmap, Process 1 tries to reacquire Lock
>>> B. However, at this point, Lock B has already been acquired by another
>>> process.
>>>
>>> 2. The second process (Process 2) enters the hugetlb_fault handler
>>> during the unmap operation. It successfully acquires Lock B
>>> (hugetlb_fault_mutex_table lock) that was just released by Process 1,
>>> but then attempts to acquire Lock A (pagecache_folio lock), which is
>>> still held by Process 1.
>>>
>>> As a result, Process 1 (holding Lock A) is blocked waiting for Lock B
>>> (held by Process 2), while Process 2 (holding Lock B) is blocked waiting
>>> for Lock A (held by Process 1), constructing a ABBA deadlock scenario.
>>>
>>> The solution here is to unlock the pagecache_folio and provide the
>>> pagecache_folio_unlocked variable to the caller to have the visibility
>>> over the pagecache_folio status for subsequent handling.
>>>
>>> The error message:
>>> INFO: task repro_20250402_:13229 blocked for more than 64 seconds.
>>>        Not tainted 6.15.0-rc3+ #24
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> task:repro_20250402_ state:D stack:25856 pid:13229 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00004006
>>> Call Trace:
>>>   <TASK>
>>>   __schedule+0x1755/0x4f50
>>>   schedule+0x158/0x330
>>>   schedule_preempt_disabled+0x15/0x30
>>>   __mutex_lock+0x75f/0xeb0
>>>   hugetlb_wp+0xf88/0x3440
>>>   hugetlb_fault+0x14c8/0x2c30
>>>   trace_clock_x86_tsc+0x20/0x20
>>>   do_user_addr_fault+0x61d/0x1490
>>>   exc_page_fault+0x64/0x100
>>>   asm_exc_page_fault+0x26/0x30
>>> RIP: 0010:__put_user_4+0xd/0x20
>>>   copy_process+0x1f4a/0x3d60
>>>   kernel_clone+0x210/0x8f0
>>>   __x64_sys_clone+0x18d/0x1f0
>>>   do_syscall_64+0x6a/0x120
>>>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>> RIP: 0033:0x41b26d
>>>   </TASK>
>>> INFO: task repro_20250402_:13229 is blocked on a mutex likely owned by task repro_20250402_:13250.
>>> task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006
>>> Call Trace:
>>>   <TASK>
>>>   __schedule+0x1755/0x4f50
>>>   schedule+0x158/0x330
>>>   io_schedule+0x92/0x110
>>>   folio_wait_bit_common+0x69a/0xba0
>>>   __filemap_get_folio+0x154/0xb70
>>>   hugetlb_fault+0xa50/0x2c30
>>>   trace_clock_x86_tsc+0x20/0x20
>>>   do_user_addr_fault+0xace/0x1490
>>>   exc_page_fault+0x64/0x100
>>>   asm_exc_page_fault+0x26/0x30
>>> RIP: 0033:0x402619
>>>   </TASK>
>>> INFO: task repro_20250402_:13250 blocked for more than 65 seconds.
>>>        Not tainted 6.15.0-rc3+ #24
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006
>>> Call Trace:
>>>   <TASK>
>>>   __schedule+0x1755/0x4f50
>>>   schedule+0x158/0x330
>>>   io_schedule+0x92/0x110
>>>   folio_wait_bit_common+0x69a/0xba0
>>>   __filemap_get_folio+0x154/0xb70
>>>   hugetlb_fault+0xa50/0x2c30
>>>   trace_clock_x86_tsc+0x20/0x20
>>>   do_user_addr_fault+0xace/0x1490
>>>   exc_page_fault+0x64/0x100
>>>   asm_exc_page_fault+0x26/0x30
>>> RIP: 0033:0x402619
>>>   </TASK>
>>>
>>> Showing all locks held in the system:
>>> 1 lock held by khungtaskd/35:
>>>   #0: ffffffff879a7440 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x30/0x180
>>> 2 locks held by repro_20250402_/13229:
>>>   #0: ffff888017d801e0 (&mm->mmap_lock){++++}-{4:4}, at: lock_mm_and_find_vma+0x37/0x300
>>>   #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_wp+0xf88/0x3440
>>> 3 locks held by repro_20250402_/13250:
>>>   #0: ffff8880177f3d08 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x41b/0x1490
>>>   #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_fault+0x3b8/0x2c30
>>>   #2: ffff8880129500e8 (&resv_map->rw_sema){++++}-{4:4}, at: hugetlb_fault+0x494/0x2c30
>>>
>>> Link: https://drive.google.com/file/d/1DVRnIW- vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=drive_link [1]
>>> Link: https://github.com/bboymimi/bpftracer/blob/master/scripts/ hugetlb_lock_debug.bt [2]
>>> Link: https://drive.google.com/file/d/1bWq2-8o- BJAuhoHWX7zAhI6ggfhVzQUI/view?usp=sharing [3]
>>> Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
>>> Cc: <stable@vger.kernel.org>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: Florent Revest <revest@google.com>
>>> Cc: Gavin Shan <gshan@redhat.com>
>>> Signed-off-by: Gavin Guo <gavinguo@igalia.com>
>>> ---
>>>   mm/hugetlb.c | 33 ++++++++++++++++++++++++++++-----
>>>   1 file changed, 28 insertions(+), 5 deletions(-)
>>>
>>
>> I guess the change log can become concise after the kernel log is dropped. The summarized
>> stack trace is sufficient to indicate how the dead locking scenario happens. Besides,
>> it's no need to mention bpftrace and its output. So the changelog would be simplified
>> to something like below. Please polish it a bit if you would to take it. The solution
>> looks good except some nitpicks as below.
>>
>> ---
>>
>> There is ABBA dead locking scenario happening between hugetlb_fault() and hugetlb_wp() on
>> the pagecache folio's lock and hugetlb global mutex, which is reproducible with syzkaller
>> [1]. As below stack traces reveal, process-1 tries to take the hugetlb global mutex (A3),
>> but with the pagecache folio's lock hold. Process-2 took the hugetlb global mutex but tries
>> to take the pagecache folio's lock.
>>
>> [1] https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/ view?usp=drive_link
>>
>> Process-1                                       Process-2
>> =========                                       =========
>> hugetlb_fault
>>    mutex_lock                  (A1)
>>    filemap_lock_hugetlb_folio  (B1)
>>    hugetlb_wp
>>      alloc_hugetlb_folio       #error
>>        mutex_unlock            (A2)
>>                                                 hugetlb_fault
>> mutex_lock                  (A4)
>> filemap_lock_hugetlb_folio  (B4)
>>        unmap_ref_private
>>        mutex_lock              (A3)
>>
>> Fix it by releasing the pagecache folio's lock at (A2) of process-1 so that pagecache folio's
>> lock is available to process-2 at (B4), to avoid the deadlock. In process-1, a new variable
>> is added to track if the pagecache folio's lock has been released by its child function
>> hugetlb_wp() to avoid double releases on the lock in hugetlb_fault(). The similar changes
>> are applied to hugetlb_no_page().
>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index e3e6ac991b9c..ad54a74aa563 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>>>    * Keep the pte_same checks anyway to make transition from the mutex easier.
>>>    */
>>>   static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>>> -               struct vm_fault *vmf)
>>> +               struct vm_fault *vmf,
>>> +               bool *pagecache_folio_unlocked)
>>
>> Nitpick: the variable may be renamed to 'pagecache_folio_locked' if you're happy
>> with.
>>
>>>   {
>>>       struct vm_area_struct *vma = vmf->vma;
>>>       struct mm_struct *mm = vma->vm_mm;
>>> @@ -6212,6 +6213,22 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>>>               u32 hash;
>>>               folio_put(old_folio);
>>> +            /*
>>> +             * The pagecache_folio needs to be unlocked to avoid
>>                                                 ^^^^^^^^
>>
>>                                                 has to be (?)
>>
>>> +             * deadlock and we won't re-lock it in hugetlb_wp(). The
>>> +             * pagecache_folio could be truncated after being
>>> +             * unlocked. So its state should not be relied
>>                                                                  ^^^^^^
>> reliable (?)
>>> +             * subsequently.
>>> +             *
>>> +             * Setting *pagecache_folio_unlocked to true allows the
>>> +             * caller to handle any necessary logic related to the
>>> +             * folio's unlocked state.
>>> +             */
>>> +            if (pagecache_folio) {
>>> +                folio_unlock(pagecache_folio);
>>> +                if (pagecache_folio_unlocked)
>>> +                    *pagecache_folio_unlocked = true;
>>> +            }
>>
>> The second section of the comments looks a bit redundant since the code changes
>> are self-explaining enough :-)
>>
>>>               /*
>>>                * Drop hugetlb_fault_mutex and vma_lock before
>>>                * unmapping.  unmapping needs to hold vma_lock
>>> @@ -6566,7 +6583,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>>>       hugetlb_count_add(pages_per_huge_page(h), mm);
>>>       if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
>>>           /* Optimization, do the COW without a second fault */
>>> -        ret = hugetlb_wp(folio, vmf);
>>> +        ret = hugetlb_wp(folio, vmf, NULL);
>>
>> It's not certain if we have another deadlock between hugetlb_no_page() and hugetlb_wp(),
>> similar to the existing one between hugetlb_fault() and hugetlb_wp(). So I think it's
>> reasonable to pass '&pagecache_folio_locked' to hugetlb_wp() here and skip to unlock
>> on pagecache_folio_locked == false in hugetlb_no_page(). It's not harmful at least.
> 
> Thank you very much for taking the time to review my patch! I appreciate
> your feedback. :)
> 
> After carefully reviewing the hugetlb_no_page function, I've made an
> observation regarding the pagecache_folio handling. Specifically, when
> pagecache_folio is assigned to vmf->pte, the vmf->ptl lock is held. This
> lock remains active when vmf->pte is later accessed in hugetlb_wp. The
> ptl lock ensures that the pte value is the same as the pagecache_folio
> assigned in hugetlb_no_page. As a result, the following comparison in
> hugetlb_wp will always evaluate to false because old_folio and
> pagecache_folio reference the same object:
> 
>          if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
>                          old_folio != pagecache_folio)
>                  cow_from_owner = true;
> 
> Based on this analysis, passing pagecache_folio_locked in
> hugetlb_no_page is unnecessary. Let me know if I missed anything. Other
> comments look good to me. Thanks!
> 

Thanks for diving into the code deeper. Agreed, we're safe to bypass this
specific case. As you explained, @old_folio sorted out from PTE should be
equal to @pagecache_folio, and the consistence is guranteed by PTL lock.
So we don't have locking contentions between hugetlb_no_page() and hugetlb_wp().

>>
>>>       }
>>>       spin_unlock(vmf->ptl);
>>> @@ -6638,6 +6655,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>>       struct hstate *h = hstate_vma(vma);
>>>       struct address_space *mapping;
>>>       int need_wait_lock = 0;
>>> +    bool pagecache_folio_unlocked = false;
>>>       struct vm_fault vmf = {
>>>           .vma = vma,
>>>           .address = address & huge_page_mask(h),
>>> @@ -6792,7 +6810,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>>       if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>>>           if (!huge_pte_write(vmf.orig_pte)) {
>>> -            ret = hugetlb_wp(pagecache_folio, &vmf);
>>> +            ret = hugetlb_wp(pagecache_folio, &vmf,
>>> +                    &pagecache_folio_unlocked);
>>>               goto out_put_page;
>>>           } else if (likely(flags & FAULT_FLAG_WRITE)) {
>>>               vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
>>> @@ -6809,10 +6828,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>>   out_ptl:
>>>       spin_unlock(vmf.ptl);
>>> -    if (pagecache_folio) {
>>> +    /*
>>> +     * If the pagecache_folio is unlocked in hugetlb_wp(), we skip
>>> +     * folio_unlock() here.
>>> +     */
>>> +    if (pagecache_folio && !pagecache_folio_unlocked)
>>>           folio_unlock(pagecache_folio);
>>> +    if (pagecache_folio)
>>>           folio_put(pagecache_folio);
>>> -    }
>>
>> The comments seem redundant since the code changes are self-explaining.
>> Besides, no need to validate 'pagecache_folio' for twice.
>>
>>      if (pagecache_folio) {
>>          if (pagecache_folio_locked)
>>              folio_unlock(pagecache_folio);
>>
>>          folio_put(pagecache_folio);
>>      }
>>
>>>   out_mutex:
>>>       hugetlb_vma_unlock_read(vma);
>>>
>>> base-commit: d76bb1ebb5587f66b0f8b8099bfbb44722bc08b3

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-05-27 10:59 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-13  9:34 [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table Gavin Guo
2025-05-14  0:56 ` Andrew Morton
2025-05-14  4:33   ` Byungchul Park
2025-05-14  6:47 ` Byungchul Park
2025-05-14  8:10   ` Gavin Guo
2025-05-15  2:22     ` Byungchul Park
2025-05-16  6:03     ` Byungchul Park
2025-05-16  7:32       ` Gavin Guo
2025-05-16  7:43         ` Byungchul Park
2025-05-20 19:53 ` Oscar Salvador
2025-05-21 11:12   ` Gavin Guo
2025-05-26  4:41 ` Gavin Shan
2025-05-27  9:59   ` Gavin Guo
2025-05-27 10:59     ` Gavin Shan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).