hugetlbfs: WARNING: bad unlock balance detected during MADV

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
@ 2024-01-25 20:28 Thorvald Natvig
  2024-01-26  7:50 ` Muchun Song
  0 siblings, 1 reply; 12+ messages in thread
From: Thorvald Natvig @ 2024-01-25 20:28 UTC (permalink / raw)
  To: Muchun Song; +Cc: linux-mm

We've found what appears to be a lock issue that results in a blocked
process somewhere in hugetlbfs for shared maps; seemingly from an
interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.

Based on some added pr_warn, we believe the following is happening:
When hugetlb_vmdelete_list is entered from the child process,
vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
not lock, since neither __vma_shareable_lock nor __vma_private_lock
are true.

While hugetlb_vmdelete_list is executing, the parent process does
fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
lock for the same vma.

Thus, when the hugetlb_vmdelete_list in the child reaches the end of
the function, vma->vm_private_data is now populated, and hence
hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
not hold.

dmesg:
WARNING: bad unlock balance detected!
6.8.0-rc1+ #24 Not tainted
-------------------------------------
lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
[<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
but there are no more locks to release!


3 locks held by lock/2613:
 #0: ffff9b4bc6225450 (sb_writers#16){.+.+}-{0:0}, at:
madvise_vma_behavior+0x4cc/0xcf0
 #1: ffff9ba4dc34eca0 (&sb->s_type->i_mutex_key#23){+.+.}-{3:3}, at:
hugetlbfs_fallocate+0x3fe/0x620
 #2: ffff9ba4dc34ef38 (&hugetlbfs_i_mmap_rwsem_key){+.+.}-{3:3}, at:
hugetlbfs_fallocate+0x438/0x620


CPU: 17 PID: 2613 Comm: lock Not tainted 6.8.0-rc1+ #24
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 12/02/2023
Call Trace:
 <TASK>
 dump_stack_lvl+0x77/0xe0
 ? hugetlb_vma_unlock_write+0x48/0x60
 dump_stack+0x10/0x20
 print_unlock_imbalance_bug+0x127/0x150
 lock_release+0x21a/0x3f0
 ? hugetlb_vma_unlock_write+0x48/0x60
 up_write+0x1c/0x1d0
 hugetlb_vma_unlock_write+0x48/0x60
 hugetlb_vmdelete_list+0x93/0xd0
 hugetlbfs_fallocate+0x4e1/0x620
 vfs_fallocate+0x153/0x4b0
 madvise_vma_behavior+0x4cc/0xcf0
 ? mas_prev+0x68/0x70
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? find_vma_prev+0x78/0xc0
 ? __pfx_madvise_vma_behavior+0x10/0x10
 madvise_walk_vmas+0xc4/0x140
 do_madvise+0x3df/0x450
 __x64_sys_madvise+0x2c/0x40
 do_syscall_64+0x8e/0x160
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? do_syscall_64+0x9b/0x160
 ? do_syscall_64+0x9b/0x160
 ? do_syscall_64+0x9b/0x160
 entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f55e0b23bbb

Repro:

#include <signal.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>

#define PSIZE (2048UL * 1024UL)

int main(int argc, char **argv) {
  char *buffer = mmap(NULL, PSIZE, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB, -1, 0);
  if (buffer == MAP_FAILED) {
    perror("mmap");
    exit(1);
  }

  pid_t remover = fork();

  if (remover == 0) {
    while(1) {
      if (madvise(buffer, PSIZE, MADV_REMOVE) == -1) {
        perror("madvise");
        exit(1);
      }
    }
  }

  int wstatus;

  for(int l = 0; l < 10000; ++l) {
    pid_t childpid = fork();
    if (childpid == 0) {
      exit(0);
    } else {
      waitpid(childpid, &wstatus, 0);
    }
  }

  kill(remover, SIGKILL);
  waitpid(remover, &wstatus, 0);
  printf("Clean exit\n");
}

- Thorvald


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
  2024-01-25 20:28 hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE Thorvald Natvig
@ 2024-01-26  7:50 ` Muchun Song
  2024-01-27 10:13   ` Miaohe Lin
  0 siblings, 1 reply; 12+ messages in thread
From: Muchun Song @ 2024-01-26  7:50 UTC (permalink / raw)
  To: Thorvald Natvig, Miaohe Lin; +Cc: Linux-MM



> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@google.com> wrote:
> 
> We've found what appears to be a lock issue that results in a blocked
> process somewhere in hugetlbfs for shared maps; seemingly from an
> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.
> 
> Based on some added pr_warn, we believe the following is happening:
> When hugetlb_vmdelete_list is entered from the child process,
> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
> not lock, since neither __vma_shareable_lock nor __vma_private_lock
> are true.
> 
> While hugetlb_vmdelete_list is executing, the parent process does
> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
> lock for the same vma.
> 
> Thus, when the hugetlb_vmdelete_list in the child reaches the end of
> the function, vma->vm_private_data is now populated, and hence
> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
> not hold.

Thanks for your report. ->vm_private_data was introduced since the
series [1]. So I suspect it was caused by this. But I haven't reviewed
that at that time (actually, it is a little complex in pmd sharing
case). I saw Miaohe had reviewed many of those.

CC Miaohe, maybe he has some ideas on this.

[1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff

Thanks.

> 
> dmesg:
> WARNING: bad unlock balance detected!
> 6.8.0-rc1+ #24 Not tainted
> -------------------------------------
> lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
> but there are no more locks to release!
> 
> 
> 3 locks held by lock/2613:
> #0: ffff9b4bc6225450 (sb_writers#16){.+.+}-{0:0}, at:
> madvise_vma_behavior+0x4cc/0xcf0
> #1: ffff9ba4dc34eca0 (&sb->s_type->i_mutex_key#23){+.+.}-{3:3}, at:
> hugetlbfs_fallocate+0x3fe/0x620
> #2: ffff9ba4dc34ef38 (&hugetlbfs_i_mmap_rwsem_key){+.+.}-{3:3}, at:
> hugetlbfs_fallocate+0x438/0x620
> 
> 
> CPU: 17 PID: 2613 Comm: lock Not tainted 6.8.0-rc1+ #24
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 12/02/2023
> Call Trace:
> <TASK>
> dump_stack_lvl+0x77/0xe0
> ? hugetlb_vma_unlock_write+0x48/0x60
> dump_stack+0x10/0x20
> print_unlock_imbalance_bug+0x127/0x150
> lock_release+0x21a/0x3f0
> ? hugetlb_vma_unlock_write+0x48/0x60
> up_write+0x1c/0x1d0
> hugetlb_vma_unlock_write+0x48/0x60
> hugetlb_vmdelete_list+0x93/0xd0
> hugetlbfs_fallocate+0x4e1/0x620
> vfs_fallocate+0x153/0x4b0
> madvise_vma_behavior+0x4cc/0xcf0
> ? mas_prev+0x68/0x70
> ? srso_alias_return_thunk+0x5/0xfbef5
> ? find_vma_prev+0x78/0xc0
> ? __pfx_madvise_vma_behavior+0x10/0x10
> madvise_walk_vmas+0xc4/0x140
> do_madvise+0x3df/0x450
> __x64_sys_madvise+0x2c/0x40
> do_syscall_64+0x8e/0x160
> ? srso_alias_return_thunk+0x5/0xfbef5
> ? do_syscall_64+0x9b/0x160
> ? do_syscall_64+0x9b/0x160
> ? do_syscall_64+0x9b/0x160
> entry_SYSCALL_64_after_hwframe+0x6e/0x76
> RIP: 0033:0x7f55e0b23bbb
> 
> Repro:
> 
> #include <signal.h>
> #include <stddef.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <sys/wait.h>
> #include <unistd.h>
> 
> #define PSIZE (2048UL * 1024UL)
> 
> int main(int argc, char **argv) {
>  char *buffer = mmap(NULL, PSIZE, PROT_READ | PROT_WRITE,
> MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB, -1, 0);
>  if (buffer == MAP_FAILED) {
>    perror("mmap");
>    exit(1);
>  }
> 
>  pid_t remover = fork();
> 
>  if (remover == 0) {
>    while(1) {
>      if (madvise(buffer, PSIZE, MADV_REMOVE) == -1) {
>        perror("madvise");
>        exit(1);
>      }
>    }
>  }
> 
>  int wstatus;
> 
>  for(int l = 0; l < 10000; ++l) {
>    pid_t childpid = fork();
>    if (childpid == 0) {
>      exit(0);
>    } else {
>      waitpid(childpid, &wstatus, 0);
>    }
>  }
> 
>  kill(remover, SIGKILL);
>  waitpid(remover, &wstatus, 0);
>  printf("Clean exit\n");
> }
> 
> - Thorvald



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
  2024-01-26  7:50 ` Muchun Song
@ 2024-01-27 10:13   ` Miaohe Lin
  2024-01-29 12:56     ` Miaohe Lin
  0 siblings, 1 reply; 12+ messages in thread
From: Miaohe Lin @ 2024-01-27 10:13 UTC (permalink / raw)
  To: Muchun Song, Thorvald Natvig; +Cc: Linux-MM

On 2024/1/26 15:50, Muchun Song wrote:
> 
> 
>> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@google.com> wrote:
>>
>> We've found what appears to be a lock issue that results in a blocked
>> process somewhere in hugetlbfs for shared maps; seemingly from an
>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.
>>
>> Based on some added pr_warn, we believe the following is happening:
>> When hugetlb_vmdelete_list is entered from the child process,
>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
>> not lock, since neither __vma_shareable_lock nor __vma_private_lock
>> are true.
>>
>> While hugetlb_vmdelete_list is executing, the parent process does
>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
>> lock for the same vma.
>>
>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of
>> the function, vma->vm_private_data is now populated, and hence
>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
>> not hold.
> 
> Thanks for your report. ->vm_private_data was introduced since the
> series [1]. So I suspect it was caused by this. But I haven't reviewed
> that at that time (actually, it is a little complex in pmd sharing
> case). I saw Miaohe had reviewed many of those.
> 
> CC Miaohe, maybe he has some ideas on this.
> 
> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff
> 
> Thanks.
> 
>>
>> dmesg:
>> WARNING: bad unlock balance detected!
>> 6.8.0-rc1+ #24 Not tainted
>> -------------------------------------
>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
>> but there are no more locks to release!

Thanks for your report. It seems there's a race:

 CPU 1											CPU 2
 fork											hugetlbfs_fallocate
  dup_mmap										 hugetlbfs_punch_hole
   i_mmap_lock_write(mapping);								
   vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
   i_mmap_unlock_write(mapping);
   hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!			 i_mmap_lock_write(mapping);
   											 hugetlb_vmdelete_list
											  vma_interval_tree_foreach
											   hugetlb_vma_trylock_write -- Vma_lock is cleared.
   tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
											 i_mmap_unlock_write(mapping);

hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it.
But I'm not really sure. I will take a more closed look at next week.

Thanks.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
  2024-01-27 10:13   ` Miaohe Lin
@ 2024-01-29 12:56     ` Miaohe Lin
  2024-01-29 16:17       ` Liam R. Howlett
  0 siblings, 1 reply; 12+ messages in thread
From: Miaohe Lin @ 2024-01-29 12:56 UTC (permalink / raw)
  To: Muchun Song, Thorvald Natvig; +Cc: Linux-MM

On 2024/1/27 18:13, Miaohe Lin wrote:
> On 2024/1/26 15:50, Muchun Song wrote:
>>
>>
>>> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@google.com> wrote:
>>>
>>> We've found what appears to be a lock issue that results in a blocked
>>> process somewhere in hugetlbfs for shared maps; seemingly from an
>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.
>>>
>>> Based on some added pr_warn, we believe the following is happening:
>>> When hugetlb_vmdelete_list is entered from the child process,
>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock
>>> are true.
>>>
>>> While hugetlb_vmdelete_list is executing, the parent process does
>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
>>> lock for the same vma.
>>>
>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of
>>> the function, vma->vm_private_data is now populated, and hence
>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
>>> not hold.
>>
>> Thanks for your report. ->vm_private_data was introduced since the
>> series [1]. So I suspect it was caused by this. But I haven't reviewed
>> that at that time (actually, it is a little complex in pmd sharing
>> case). I saw Miaohe had reviewed many of those.
>>
>> CC Miaohe, maybe he has some ideas on this.
>>
>> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff
>>
>> Thanks.
>>
>>>
>>> dmesg:
>>> WARNING: bad unlock balance detected!
>>> 6.8.0-rc1+ #24 Not tainted
>>> -------------------------------------
>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
>>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
>>> but there are no more locks to release!
> 
> Thanks for your report. It seems there's a race:
> 
>  CPU 1											CPU 2
>  fork											hugetlbfs_fallocate
>   dup_mmap										 hugetlbfs_punch_hole
>    i_mmap_lock_write(mapping);								
>    vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
>    i_mmap_unlock_write(mapping);
>    hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!			 i_mmap_lock_write(mapping);
>    											 hugetlb_vmdelete_list
> 											  vma_interval_tree_foreach
> 											   hugetlb_vma_trylock_write -- Vma_lock is cleared.
>    tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
> 											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
> 											 i_mmap_unlock_write(mapping);
> 
> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it.
> But I'm not really sure. I will take a more closed look at next week.


This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized.
But I'm not sure whether there're side effects with this patch.

linux-UJMmTI:/home/linmiaohe/mm # git diff
diff --git a/kernel/fork.c b/kernel/fork.c
index 47ff3b35352e..2ef2711452e0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
                } else if (anon_vma_fork(tmp, mpnt))
                        goto fail_nomem_anon_vma_fork;
                vm_flags_clear(tmp, VM_LOCKED_MASK);
-               file = tmp->vm_file;
-               if (file) {
-                       struct address_space *mapping = file->f_mapping;
-
-                       get_file(file);
-                       i_mmap_lock_write(mapping);
-                       if (vma_is_shared_maywrite(tmp))
-                               mapping_allow_writable(mapping);
-                       flush_dcache_mmap_lock(mapping);
-                       /* insert tmp into the share list, just after mpnt */
-                       vma_interval_tree_insert_after(tmp, mpnt,
-                                       &mapping->i_mmap);
-                       flush_dcache_mmap_unlock(mapping);
-                       i_mmap_unlock_write(mapping);
-               }

                /*
                 * Copy/update hugetlb private vma information.
@@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
                if (tmp->vm_ops && tmp->vm_ops->open)
                        tmp->vm_ops->open(tmp);

+               file = tmp->vm_file;
+               if (file) {
+                       struct address_space *mapping = file->f_mapping;
+
+                       get_file(file);
+                       i_mmap_lock_write(mapping);
+                       if (vma_is_shared_maywrite(tmp))
+                               mapping_allow_writable(mapping);
+                       flush_dcache_mmap_lock(mapping);
+                       /* insert tmp into the share list, just after mpnt. */
+                       vma_interval_tree_insert_after(tmp, mpnt,
+                                       &mapping->i_mmap);
+                       flush_dcache_mmap_unlock(mapping);
+                       i_mmap_unlock_write(mapping);
+               }
+
                if (retval) {
                        mpnt = vma_next(&vmi);
                        goto loop_out;


root@qemu:~# ./hugetlb_vma_lock
Clean exit

> 
> Thanks.
> 



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
  2024-01-29 12:56     ` Miaohe Lin
@ 2024-01-29 16:17       ` Liam R. Howlett
  2024-01-30  2:14         ` Miaohe Lin
  0 siblings, 1 reply; 12+ messages in thread
From: Liam R. Howlett @ 2024-01-29 16:17 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: Muchun Song, Thorvald Natvig, Linux-MM

* Miaohe Lin <linmiaohe@huawei.com> [240129 07:56]:
> On 2024/1/27 18:13, Miaohe Lin wrote:
> > On 2024/1/26 15:50, Muchun Song wrote:
> >>
> >>
> >>> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@google.com> wrote:
> >>>
> >>> We've found what appears to be a lock issue that results in a blocked
> >>> process somewhere in hugetlbfs for shared maps; seemingly from an
> >>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.
> >>>
> >>> Based on some added pr_warn, we believe the following is happening:
> >>> When hugetlb_vmdelete_list is entered from the child process,
> >>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
> >>> not lock, since neither __vma_shareable_lock nor __vma_private_lock
> >>> are true.
> >>>
> >>> While hugetlb_vmdelete_list is executing, the parent process does
> >>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
> >>> lock for the same vma.
> >>>
> >>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of
> >>> the function, vma->vm_private_data is now populated, and hence
> >>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
> >>> not hold.
> >>
> >> Thanks for your report. ->vm_private_data was introduced since the
> >> series [1]. So I suspect it was caused by this. But I haven't reviewed
> >> that at that time (actually, it is a little complex in pmd sharing
> >> case). I saw Miaohe had reviewed many of those.
> >>
> >> CC Miaohe, maybe he has some ideas on this.
> >>
> >> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff
> >>
> >> Thanks.
> >>
> >>>
> >>> dmesg:
> >>> WARNING: bad unlock balance detected!
> >>> 6.8.0-rc1+ #24 Not tainted
> >>> -------------------------------------
> >>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
> >>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
> >>> but there are no more locks to release!
> > 
> > Thanks for your report. It seems there's a race:
> > 
> >  CPU 1											CPU 2
> >  fork											hugetlbfs_fallocate
> >   dup_mmap										 hugetlbfs_punch_hole
> >    i_mmap_lock_write(mapping);								
> >    vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
> >    i_mmap_unlock_write(mapping);
> >    hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!			 i_mmap_lock_write(mapping);
> >    											 hugetlb_vmdelete_list
> > 											  vma_interval_tree_foreach
> > 											   hugetlb_vma_trylock_write -- Vma_lock is cleared.
> >    tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
> > 											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
> > 											 i_mmap_unlock_write(mapping);
> > 
> > hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it.
> > But I'm not really sure. I will take a more closed look at next week.
> 
> 
> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized.
> But I'm not sure whether there're side effects with this patch.
> 
> linux-UJMmTI:/home/linmiaohe/mm # git diff
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 47ff3b35352e..2ef2711452e0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>                 } else if (anon_vma_fork(tmp, mpnt))
>                         goto fail_nomem_anon_vma_fork;
>                 vm_flags_clear(tmp, VM_LOCKED_MASK);
> -               file = tmp->vm_file;
> -               if (file) {
> -                       struct address_space *mapping = file->f_mapping;
> -
> -                       get_file(file);
> -                       i_mmap_lock_write(mapping);
> -                       if (vma_is_shared_maywrite(tmp))
> -                               mapping_allow_writable(mapping);
> -                       flush_dcache_mmap_lock(mapping);
> -                       /* insert tmp into the share list, just after mpnt */
> -                       vma_interval_tree_insert_after(tmp, mpnt,
> -                                       &mapping->i_mmap);
> -                       flush_dcache_mmap_unlock(mapping);
> -                       i_mmap_unlock_write(mapping);
> -               }
> 
>                 /*
>                  * Copy/update hugetlb private vma information.
> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>                 if (tmp->vm_ops && tmp->vm_ops->open)
>                         tmp->vm_ops->open(tmp);
> 
> +               file = tmp->vm_file;
> +               if (file) {
> +                       struct address_space *mapping = file->f_mapping;
> +
> +                       get_file(file);
> +                       i_mmap_lock_write(mapping);
> +                       if (vma_is_shared_maywrite(tmp))
> +                               mapping_allow_writable(mapping);
> +                       flush_dcache_mmap_lock(mapping);
> +                       /* insert tmp into the share list, just after mpnt. */
> +                       vma_interval_tree_insert_after(tmp, mpnt,
> +                                       &mapping->i_mmap);
> +                       flush_dcache_mmap_unlock(mapping);
> +                       i_mmap_unlock_write(mapping);
> +               }
> +
>                 if (retval) {
>                         mpnt = vma_next(&vmi);
>                         goto loop_out;
> 
> 

How is this possible?  I thought, as specified in mm/rmap.c, that the
hugetlbfs path would be holding the mmap lock (which is also held in the
fork path)?

That is, the mmap_lock must be held before the i_mmap_lock_write()

Am I missing something?  Do we need an update to mm/rmap.c?

Thanks,
Liam




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
  2024-01-29 16:17       ` Liam R. Howlett
@ 2024-01-30  2:14         ` Miaohe Lin
  2024-01-30  4:08           ` Liam R. Howlett
  0 siblings, 1 reply; 12+ messages in thread
From: Miaohe Lin @ 2024-01-30  2:14 UTC (permalink / raw)
  To: Liam R. Howlett; +Cc: Muchun Song, Thorvald Natvig, Linux-MM

On 2024/1/30 0:17, Liam R. Howlett wrote:
> * Miaohe Lin <linmiaohe@huawei.com> [240129 07:56]:
>> On 2024/1/27 18:13, Miaohe Lin wrote:
>>> On 2024/1/26 15:50, Muchun Song wrote:
>>>>
>>>>
>>>>> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@google.com> wrote:
>>>>>
>>>>> We've found what appears to be a lock issue that results in a blocked
>>>>> process somewhere in hugetlbfs for shared maps; seemingly from an
>>>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.
>>>>>
>>>>> Based on some added pr_warn, we believe the following is happening:
>>>>> When hugetlb_vmdelete_list is entered from the child process,
>>>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
>>>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock
>>>>> are true.
>>>>>
>>>>> While hugetlb_vmdelete_list is executing, the parent process does
>>>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
>>>>> lock for the same vma.
>>>>>
>>>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of
>>>>> the function, vma->vm_private_data is now populated, and hence
>>>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
>>>>> not hold.
>>>>
>>>> Thanks for your report. ->vm_private_data was introduced since the
>>>> series [1]. So I suspect it was caused by this. But I haven't reviewed
>>>> that at that time (actually, it is a little complex in pmd sharing
>>>> case). I saw Miaohe had reviewed many of those.
>>>>
>>>> CC Miaohe, maybe he has some ideas on this.
>>>>
>>>> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>> dmesg:
>>>>> WARNING: bad unlock balance detected!
>>>>> 6.8.0-rc1+ #24 Not tainted
>>>>> -------------------------------------
>>>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
>>>>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
>>>>> but there are no more locks to release!
>>>
>>> Thanks for your report. It seems there's a race:
>>>
>>>  CPU 1											CPU 2
>>>  fork											hugetlbfs_fallocate
>>>   dup_mmap										 hugetlbfs_punch_hole
>>>    i_mmap_lock_write(mapping);								
>>>    vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
>>>    i_mmap_unlock_write(mapping);
>>>    hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!			 i_mmap_lock_write(mapping);
>>>    											 hugetlb_vmdelete_list
>>> 											  vma_interval_tree_foreach
>>> 											   hugetlb_vma_trylock_write -- Vma_lock is cleared.
>>>    tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
>>> 											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
>>> 											 i_mmap_unlock_write(mapping);
>>>
>>> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it.
>>> But I'm not really sure. I will take a more closed look at next week.
>>
>>
>> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized.
>> But I'm not sure whether there're side effects with this patch.
>>
>> linux-UJMmTI:/home/linmiaohe/mm # git diff
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 47ff3b35352e..2ef2711452e0 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>                 } else if (anon_vma_fork(tmp, mpnt))
>>                         goto fail_nomem_anon_vma_fork;
>>                 vm_flags_clear(tmp, VM_LOCKED_MASK);
>> -               file = tmp->vm_file;
>> -               if (file) {
>> -                       struct address_space *mapping = file->f_mapping;
>> -
>> -                       get_file(file);
>> -                       i_mmap_lock_write(mapping);
>> -                       if (vma_is_shared_maywrite(tmp))
>> -                               mapping_allow_writable(mapping);
>> -                       flush_dcache_mmap_lock(mapping);
>> -                       /* insert tmp into the share list, just after mpnt */
>> -                       vma_interval_tree_insert_after(tmp, mpnt,
>> -                                       &mapping->i_mmap);
>> -                       flush_dcache_mmap_unlock(mapping);
>> -                       i_mmap_unlock_write(mapping);
>> -               }
>>
>>                 /*
>>                  * Copy/update hugetlb private vma information.
>> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>                 if (tmp->vm_ops && tmp->vm_ops->open)
>>                         tmp->vm_ops->open(tmp);
>>
>> +               file = tmp->vm_file;
>> +               if (file) {
>> +                       struct address_space *mapping = file->f_mapping;
>> +
>> +                       get_file(file);
>> +                       i_mmap_lock_write(mapping);
>> +                       if (vma_is_shared_maywrite(tmp))
>> +                               mapping_allow_writable(mapping);
>> +                       flush_dcache_mmap_lock(mapping);
>> +                       /* insert tmp into the share list, just after mpnt. */
>> +                       vma_interval_tree_insert_after(tmp, mpnt,
>> +                                       &mapping->i_mmap);
>> +                       flush_dcache_mmap_unlock(mapping);
>> +                       i_mmap_unlock_write(mapping);
>> +               }
>> +
>>                 if (retval) {
>>                         mpnt = vma_next(&vmi);
>>                         goto loop_out;
>>
>>
> 
> How is this possible?  I thought, as specified in mm/rmap.c, that the
> hugetlbfs path would be holding the mmap lock (which is also held in the
> fork path)?

The fork path holds the mmap lock from parent A and other childs(except first child B) while hugetlbfs path
holds the mmap lock from first child B. So the mmap lock won't help here because it comes from different mm.
Or am I miss something?

Thanks.

> 
> That is, the mmap_lock must be held before the i_mmap_lock_write()
> 
> Am I missing something?  Do we need an update to mm/rmap.c?
> 
> Thanks,
> Liam
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
  2024-01-30  2:14         ` Miaohe Lin
@ 2024-01-30  4:08           ` Liam R. Howlett
  2024-01-31  6:51             ` Miaohe Lin
  0 siblings, 1 reply; 12+ messages in thread
From: Liam R. Howlett @ 2024-01-30  4:08 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: Muchun Song, Thorvald Natvig, Linux-MM

* Miaohe Lin <linmiaohe@huawei.com> [240129 21:14]:
> On 2024/1/30 0:17, Liam R. Howlett wrote:
> > * Miaohe Lin <linmiaohe@huawei.com> [240129 07:56]:
> >> On 2024/1/27 18:13, Miaohe Lin wrote:
> >>> On 2024/1/26 15:50, Muchun Song wrote:
> >>>>
> >>>>
> >>>>> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@google.com> wrote:
> >>>>>
> >>>>> We've found what appears to be a lock issue that results in a blocked
> >>>>> process somewhere in hugetlbfs for shared maps; seemingly from an
> >>>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.
> >>>>>
> >>>>> Based on some added pr_warn, we believe the following is happening:
> >>>>> When hugetlb_vmdelete_list is entered from the child process,
> >>>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
> >>>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock
> >>>>> are true.
> >>>>>
> >>>>> While hugetlb_vmdelete_list is executing, the parent process does
> >>>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
> >>>>> lock for the same vma.
> >>>>>
> >>>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of
> >>>>> the function, vma->vm_private_data is now populated, and hence
> >>>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
> >>>>> not hold.
> >>>>
> >>>> Thanks for your report. ->vm_private_data was introduced since the
> >>>> series [1]. So I suspect it was caused by this. But I haven't reviewed
> >>>> that at that time (actually, it is a little complex in pmd sharing
> >>>> case). I saw Miaohe had reviewed many of those.
> >>>>
> >>>> CC Miaohe, maybe he has some ideas on this.
> >>>>
> >>>> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff
> >>>>
> >>>> Thanks.
> >>>>
> >>>>>
> >>>>> dmesg:
> >>>>> WARNING: bad unlock balance detected!
> >>>>> 6.8.0-rc1+ #24 Not tainted
> >>>>> -------------------------------------
> >>>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
> >>>>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
> >>>>> but there are no more locks to release!
> >>>
> >>> Thanks for your report. It seems there's a race:
> >>>
> >>>  CPU 1											CPU 2
> >>>  fork											hugetlbfs_fallocate
> >>>   dup_mmap										 hugetlbfs_punch_hole
> >>>    i_mmap_lock_write(mapping);								
> >>>    vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
> >>>    i_mmap_unlock_write(mapping);
> >>>    hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!			 i_mmap_lock_write(mapping);
> >>>    											 hugetlb_vmdelete_list
> >>> 											  vma_interval_tree_foreach
> >>> 											   hugetlb_vma_trylock_write -- Vma_lock is cleared.
> >>>    tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
> >>> 											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
> >>> 											 i_mmap_unlock_write(mapping);
> >>>
> >>> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it.
> >>> But I'm not really sure. I will take a more closed look at next week.
> >>
> >>
> >> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized.
> >> But I'm not sure whether there're side effects with this patch.
> >>
> >> linux-UJMmTI:/home/linmiaohe/mm # git diff
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index 47ff3b35352e..2ef2711452e0 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >>                 } else if (anon_vma_fork(tmp, mpnt))
> >>                         goto fail_nomem_anon_vma_fork;
> >>                 vm_flags_clear(tmp, VM_LOCKED_MASK);
> >> -               file = tmp->vm_file;
> >> -               if (file) {
> >> -                       struct address_space *mapping = file->f_mapping;
> >> -
> >> -                       get_file(file);
> >> -                       i_mmap_lock_write(mapping);
> >> -                       if (vma_is_shared_maywrite(tmp))
> >> -                               mapping_allow_writable(mapping);
> >> -                       flush_dcache_mmap_lock(mapping);
> >> -                       /* insert tmp into the share list, just after mpnt */
> >> -                       vma_interval_tree_insert_after(tmp, mpnt,
> >> -                                       &mapping->i_mmap);
> >> -                       flush_dcache_mmap_unlock(mapping);
> >> -                       i_mmap_unlock_write(mapping);
> >> -               }
> >>
> >>                 /*
> >>                  * Copy/update hugetlb private vma information.
> >> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >>                 if (tmp->vm_ops && tmp->vm_ops->open)
> >>                         tmp->vm_ops->open(tmp);
> >>
> >> +               file = tmp->vm_file;
> >> +               if (file) {
> >> +                       struct address_space *mapping = file->f_mapping;
> >> +
> >> +                       get_file(file);
> >> +                       i_mmap_lock_write(mapping);
> >> +                       if (vma_is_shared_maywrite(tmp))
> >> +                               mapping_allow_writable(mapping);
> >> +                       flush_dcache_mmap_lock(mapping);
> >> +                       /* insert tmp into the share list, just after mpnt. */
> >> +                       vma_interval_tree_insert_after(tmp, mpnt,
> >> +                                       &mapping->i_mmap);
> >> +                       flush_dcache_mmap_unlock(mapping);
> >> +                       i_mmap_unlock_write(mapping);
> >> +               }
> >> +
> >>                 if (retval) {
> >>                         mpnt = vma_next(&vmi);
> >>                         goto loop_out;
> >>
> >>
> > 
> > How is this possible?  I thought, as specified in mm/rmap.c, that the
> > hugetlbfs path would be holding the mmap lock (which is also held in the
> > fork path)?
> 
> The fork path holds the mmap lock from parent A and other childs(except first child B) while hugetlbfs path
> holds the mmap lock from first child B. So the mmap lock won't help here because it comes from different mm.
> Or am I miss something?

You are correct.  It is also in mm/rmap.c:
 * hugetlbfs PageHuge() take locks in this order:
 *   hugetlb_fault_mutex (hugetlbfs specific page fault mutex)                                                          
 *     vma_lock (hugetlb specific lock for pmd_sharing)
 *       mapping->i_mmap_rwsem (also used for hugetlb pmd sharing)                                                      
 *         page->flags PG_locked (lock_page)

Does it make sense for hugetlb_dup_vma_private()  to assert
mapping->i_mmap_rwsem is locked?  When is that necessary?

I also think it might be safer to move the hugetlb_dup_vma_private()
call up instead of the insert into the interval tree down?
See the following comment from mmap.c:

                        /*                                                                                              
                         * Put into interval tree now, so instantiated pages                                            
                         * are visible to arm/parisc __flush_dcache_page
                         * throughout; but we cannot insert into address                                                
                         * space until vma start or end is updated.                                                     
                         */

So there may be arch dependent reasons for this order.

Thanks,
Liam



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
  2024-01-30  4:08           ` Liam R. Howlett
@ 2024-01-31  6:51             ` Miaohe Lin
  2024-02-02 21:02               ` Jane Chu
  0 siblings, 1 reply; 12+ messages in thread
From: Miaohe Lin @ 2024-01-31  6:51 UTC (permalink / raw)
  To: Liam R. Howlett, Muchun Song, Thorvald Natvig, Linux-MM

On 2024/1/30 12:08, Liam R. Howlett wrote:
> * Miaohe Lin <linmiaohe@huawei.com> [240129 21:14]:
>> On 2024/1/30 0:17, Liam R. Howlett wrote:
>>> * Miaohe Lin <linmiaohe@huawei.com> [240129 07:56]:
>>>> On 2024/1/27 18:13, Miaohe Lin wrote:
>>>>> On 2024/1/26 15:50, Muchun Song wrote:
>>>>>>
>>>>>>
>>>>>>> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@google.com> wrote:
>>>>>>>
>>>>>>> We've found what appears to be a lock issue that results in a blocked
>>>>>>> process somewhere in hugetlbfs for shared maps; seemingly from an
>>>>>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.
>>>>>>>
>>>>>>> Based on some added pr_warn, we believe the following is happening:
>>>>>>> When hugetlb_vmdelete_list is entered from the child process,
>>>>>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
>>>>>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock
>>>>>>> are true.
>>>>>>>
>>>>>>> While hugetlb_vmdelete_list is executing, the parent process does
>>>>>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
>>>>>>> lock for the same vma.
>>>>>>>
>>>>>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of
>>>>>>> the function, vma->vm_private_data is now populated, and hence
>>>>>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
>>>>>>> not hold.
>>>>>>
>>>>>> Thanks for your report. ->vm_private_data was introduced since the
>>>>>> series [1]. So I suspect it was caused by this. But I haven't reviewed
>>>>>> that at that time (actually, it is a little complex in pmd sharing
>>>>>> case). I saw Miaohe had reviewed many of those.
>>>>>>
>>>>>> CC Miaohe, maybe he has some ideas on this.
>>>>>>
>>>>>> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>>
>>>>>>> dmesg:
>>>>>>> WARNING: bad unlock balance detected!
>>>>>>> 6.8.0-rc1+ #24 Not tainted
>>>>>>> -------------------------------------
>>>>>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
>>>>>>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
>>>>>>> but there are no more locks to release!
>>>>>
>>>>> Thanks for your report. It seems there's a race:
>>>>>
>>>>>  CPU 1											CPU 2
>>>>>  fork											hugetlbfs_fallocate
>>>>>   dup_mmap										 hugetlbfs_punch_hole
>>>>>    i_mmap_lock_write(mapping);								
>>>>>    vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
>>>>>    i_mmap_unlock_write(mapping);
>>>>>    hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!			 i_mmap_lock_write(mapping);
>>>>>    											 hugetlb_vmdelete_list
>>>>> 											  vma_interval_tree_foreach
>>>>> 											   hugetlb_vma_trylock_write -- Vma_lock is cleared.
>>>>>    tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
>>>>> 											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
>>>>> 											 i_mmap_unlock_write(mapping);
>>>>>
>>>>> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it.
>>>>> But I'm not really sure. I will take a more closed look at next week.
>>>>
>>>>
>>>> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized.
>>>> But I'm not sure whether there're side effects with this patch.
>>>>
>>>> linux-UJMmTI:/home/linmiaohe/mm # git diff
>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>> index 47ff3b35352e..2ef2711452e0 100644
>>>> --- a/kernel/fork.c
>>>> +++ b/kernel/fork.c
>>>> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>>>                 } else if (anon_vma_fork(tmp, mpnt))
>>>>                         goto fail_nomem_anon_vma_fork;
>>>>                 vm_flags_clear(tmp, VM_LOCKED_MASK);
>>>> -               file = tmp->vm_file;
>>>> -               if (file) {
>>>> -                       struct address_space *mapping = file->f_mapping;
>>>> -
>>>> -                       get_file(file);
>>>> -                       i_mmap_lock_write(mapping);
>>>> -                       if (vma_is_shared_maywrite(tmp))
>>>> -                               mapping_allow_writable(mapping);
>>>> -                       flush_dcache_mmap_lock(mapping);
>>>> -                       /* insert tmp into the share list, just after mpnt */
>>>> -                       vma_interval_tree_insert_after(tmp, mpnt,
>>>> -                                       &mapping->i_mmap);
>>>> -                       flush_dcache_mmap_unlock(mapping);
>>>> -                       i_mmap_unlock_write(mapping);
>>>> -               }
>>>>
>>>>                 /*
>>>>                  * Copy/update hugetlb private vma information.
>>>> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>>>                 if (tmp->vm_ops && tmp->vm_ops->open)
>>>>                         tmp->vm_ops->open(tmp);
>>>>
>>>> +               file = tmp->vm_file;
>>>> +               if (file) {
>>>> +                       struct address_space *mapping = file->f_mapping;
>>>> +
>>>> +                       get_file(file);
>>>> +                       i_mmap_lock_write(mapping);
>>>> +                       if (vma_is_shared_maywrite(tmp))
>>>> +                               mapping_allow_writable(mapping);
>>>> +                       flush_dcache_mmap_lock(mapping);
>>>> +                       /* insert tmp into the share list, just after mpnt. */
>>>> +                       vma_interval_tree_insert_after(tmp, mpnt,
>>>> +                                       &mapping->i_mmap);
>>>> +                       flush_dcache_mmap_unlock(mapping);
>>>> +                       i_mmap_unlock_write(mapping);
>>>> +               }
>>>> +
>>>>                 if (retval) {
>>>>                         mpnt = vma_next(&vmi);
>>>>                         goto loop_out;
>>>>
>>>>
>>>
>>> How is this possible?  I thought, as specified in mm/rmap.c, that the
>>> hugetlbfs path would be holding the mmap lock (which is also held in the
>>> fork path)?
>>
>> The fork path holds the mmap lock from parent A and other childs(except first child B) while hugetlbfs path
>> holds the mmap lock from first child B. So the mmap lock won't help here because it comes from different mm.
>> Or am I miss something?
> 
> You are correct.  It is also in mm/rmap.c:
>  * hugetlbfs PageHuge() take locks in this order:
>  *   hugetlb_fault_mutex (hugetlbfs specific page fault mutex)                                                          
>  *     vma_lock (hugetlb specific lock for pmd_sharing)
>  *       mapping->i_mmap_rwsem (also used for hugetlb pmd sharing)                                                      
>  *         page->flags PG_locked (lock_page)
> 
> Does it make sense for hugetlb_dup_vma_private()  to assert
> mapping->i_mmap_rwsem is locked?  When is that necessary?

I'm afraid not. AFAICS, vma_lock(vma->vm_private_data) is only modified at the time of
vma creating or destroy. Vma_lock is not supposed to be used at that time.

> 
> I also think it might be safer to move the hugetlb_dup_vma_private()
> call up instead of the insert into the interval tree down?
> See the following comment from mmap.c:
> 
>                         /*                                                                                              
>                          * Put into interval tree now, so instantiated pages                                            
>                          * are visible to arm/parisc __flush_dcache_page
>                          * throughout; but we cannot insert into address                                                
>                          * space until vma start or end is updated.                                                     
>                          */
> 
> So there may be arch dependent reasons for this order.

Yes, it should be safer to move hugetlb_dup_vma_private() call up. But we also need to move tmp->vm_ops->open(tmp) call up.
Or the race still exists:

 CPU 1											CPU 2
 fork											hugetlbfs_fallocate
  dup_mmap										 hugetlbfs_punch_hole
   hugetlb_dup_vma_private -- Clear vma_lock.	<-- it is moved up.
   i_mmap_lock_write(mapping);								
   vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
   i_mmap_unlock_write(mapping);
   		 									 i_mmap_lock_write(mapping);
   											 hugetlb_vmdelete_list
											  vma_interval_tree_foreach
											   hugetlb_vma_trylock_write -- Vma_lock is already cleared.
   tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
											 i_mmap_unlock_write(mapping);


My patch should not be a complete solution. It's used to prove and fix the race quickly. It's very great if you or
someone else can provide a better and safer solution.

Thanks.

> 
> Thanks,
> Liam
> 
> .
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
  2024-01-31  6:51             ` Miaohe Lin
@ 2024-02-02 21:02               ` Jane Chu
  2024-02-04  1:54                 ` Miaohe Lin
  0 siblings, 1 reply; 12+ messages in thread
From: Jane Chu @ 2024-02-02 21:02 UTC (permalink / raw)
  To: Miaohe Lin, Liam R. Howlett, Muchun Song, Thorvald Natvig,
	Linux-MM

[-- Attachment #1: Type: text/plain, Size: 9095 bytes --]

On 1/30/2024 10:51 PM, Miaohe Lin wrote:

> On 2024/1/30 12:08, Liam R. Howlett wrote:
>> * Miaohe Lin<linmiaohe@huawei.com>  [240129 21:14]:
>>> On 2024/1/30 0:17, Liam R. Howlett wrote:
>>>> * Miaohe Lin<linmiaohe@huawei.com>  [240129 07:56]:
>>>>> On 2024/1/27 18:13, Miaohe Lin wrote:
>>>>>> On 2024/1/26 15:50, Muchun Song wrote:
>>>>>>>
>>>>>>>> On Jan 26, 2024, at 04:28, Thorvald Natvig<thorvald@google.com>  wrote:
>>>>>>>>
>>>>>>>> We've found what appears to be a lock issue that results in a blocked
>>>>>>>> process somewhere in hugetlbfs for shared maps; seemingly from an
>>>>>>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.
>>>>>>>>
>>>>>>>> Based on some added pr_warn, we believe the following is happening:
>>>>>>>> When hugetlb_vmdelete_list is entered from the child process,
>>>>>>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
>>>>>>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock
>>>>>>>> are true.
>>>>>>>>
>>>>>>>> While hugetlb_vmdelete_list is executing, the parent process does
>>>>>>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
>>>>>>>> lock for the same vma.
>>>>>>>>
>>>>>>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of
>>>>>>>> the function, vma->vm_private_data is now populated, and hence
>>>>>>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
>>>>>>>> not hold.
>>>>>>> Thanks for your report. ->vm_private_data was introduced since the
>>>>>>> series [1]. So I suspect it was caused by this. But I haven't reviewed
>>>>>>> that at that time (actually, it is a little complex in pmd sharing
>>>>>>> case). I saw Miaohe had reviewed many of those.
>>>>>>>
>>>>>>> CC Miaohe, maybe he has some ideas on this.
>>>>>>>
>>>>>>> [1]https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>> dmesg:
>>>>>>>> WARNING: bad unlock balance detected!
>>>>>>>> 6.8.0-rc1+ #24 Not tainted
>>>>>>>> -------------------------------------
>>>>>>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
>>>>>>>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
>>>>>>>> but there are no more locks to release!
>>>>>> Thanks for your report. It seems there's a race:
>>>>>>
>>>>>>   CPU 1											CPU 2
>>>>>>   fork											hugetlbfs_fallocate
>>>>>>    dup_mmap										 hugetlbfs_punch_hole
>>>>>>     i_mmap_lock_write(mapping);								
>>>>>>     vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
>>>>>>     i_mmap_unlock_write(mapping);
>>>>>>     hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!			 i_mmap_lock_write(mapping);
>>>>>>     											 hugetlb_vmdelete_list
>>>>>> 											  vma_interval_tree_foreach
>>>>>> 											   hugetlb_vma_trylock_write -- Vma_lock is cleared.
>>>>>>     tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
>>>>>> 											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
>>>>>> 											 i_mmap_unlock_write(mapping);
>>>>>>
>>>>>> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it.
>>>>>> But I'm not really sure. I will take a more closed look at next week.
>>>>>
>>>>> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized.
>>>>> But I'm not sure whether there're side effects with this patch.
>>>>>
>>>>> linux-UJMmTI:/home/linmiaohe/mm # git diff
>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>> index 47ff3b35352e..2ef2711452e0 100644
>>>>> --- a/kernel/fork.c
>>>>> +++ b/kernel/fork.c
>>>>> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>>>>                  } else if (anon_vma_fork(tmp, mpnt))
>>>>>                          goto fail_nomem_anon_vma_fork;
>>>>>                  vm_flags_clear(tmp, VM_LOCKED_MASK);
>>>>> -               file = tmp->vm_file;
>>>>> -               if (file) {
>>>>> -                       struct address_space *mapping = file->f_mapping;
>>>>> -
>>>>> -                       get_file(file);
>>>>> -                       i_mmap_lock_write(mapping);
>>>>> -                       if (vma_is_shared_maywrite(tmp))
>>>>> -                               mapping_allow_writable(mapping);
>>>>> -                       flush_dcache_mmap_lock(mapping);
>>>>> -                       /* insert tmp into the share list, just after mpnt */
>>>>> -                       vma_interval_tree_insert_after(tmp, mpnt,
>>>>> -                                       &mapping->i_mmap);
>>>>> -                       flush_dcache_mmap_unlock(mapping);
>>>>> -                       i_mmap_unlock_write(mapping);
>>>>> -               }
>>>>>
>>>>>                  /*
>>>>>                   * Copy/update hugetlb private vma information.
>>>>> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>>>>                  if (tmp->vm_ops && tmp->vm_ops->open)
>>>>>                          tmp->vm_ops->open(tmp);
>>>>>
>>>>> +               file = tmp->vm_file;
>>>>> +               if (file) {
>>>>> +                       struct address_space *mapping = file->f_mapping;
>>>>> +
>>>>> +                       get_file(file);
>>>>> +                       i_mmap_lock_write(mapping);
>>>>> +                       if (vma_is_shared_maywrite(tmp))
>>>>> +                               mapping_allow_writable(mapping);
>>>>> +                       flush_dcache_mmap_lock(mapping);
>>>>> +                       /* insert tmp into the share list, just after mpnt. */
>>>>> +                       vma_interval_tree_insert_after(tmp, mpnt,
>>>>> +                                       &mapping->i_mmap);
>>>>> +                       flush_dcache_mmap_unlock(mapping);
>>>>> +                       i_mmap_unlock_write(mapping);
>>>>> +               }
>>>>> +
>>>>>                  if (retval) {
>>>>>                          mpnt = vma_next(&vmi);
>>>>>                          goto loop_out;
>>>>>
>>>>>
>>>> How is this possible?  I thought, as specified in mm/rmap.c, that the
>>>> hugetlbfs path would be holding the mmap lock (which is also held in the
>>>> fork path)?
>>> The fork path holds the mmap lock from parent A and other childs(except first child B) while hugetlbfs path
>>> holds the mmap lock from first child B. So the mmap lock won't help here because it comes from different mm.
>>> Or am I miss something?
>> You are correct.  It is also in mm/rmap.c:
>>   * hugetlbfs PageHuge() take locks in this order:
>>   *   hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
>>   *     vma_lock (hugetlb specific lock for pmd_sharing)
>>   *       mapping->i_mmap_rwsem (also used for hugetlb pmd sharing)
>>   *         page->flags PG_locked (lock_page)
>>
>> Does it make sense for hugetlb_dup_vma_private()  to assert
>> mapping->i_mmap_rwsem is locked?  When is that necessary?
> I'm afraid not. AFAICS, vma_lock(vma->vm_private_data) is only modified at the time of
> vma creating or destroy. Vma_lock is not supposed to be used at that time.
>
>> I also think it might be safer to move the hugetlb_dup_vma_private()
>> call up instead of the insert into the interval tree down?
>> See the following comment from mmap.c:
>>
>>                          /*
>>                           * Put into interval tree now, so instantiated pages
>>                           * are visible to arm/parisc __flush_dcache_page
>>                           * throughout; but we cannot insert into address
>>                           * space until vma start or end is updated.
>>                           */
>>
>> So there may be arch dependent reasons for this order.
> Yes, it should be safer to move hugetlb_dup_vma_private() call up. But we also need to move tmp->vm_ops->open(tmp) call up.
> Or the race still exists:
>
>   CPU 1											CPU 2
>   fork											hugetlbfs_fallocate
>    dup_mmap										 hugetlbfs_punch_hole
>     hugetlb_dup_vma_private -- Clear vma_lock.	<-- it is moved up.
>     i_mmap_lock_write(mapping);								
>     vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
>     i_mmap_unlock_write(mapping);
>     		 									 i_mmap_lock_write(mapping);
>     											 hugetlb_vmdelete_list
> 											  vma_interval_tree_foreach
> 											   hugetlb_vma_trylock_write -- Vma_lock is already cleared.
>     tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
> 											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
> 											 i_mmap_unlock_write(mapping);
>
>
> My patch should not be a complete solution. It's used to prove and fix the race quickly. It's very great if you or
> someone else can provide a better and safer solution.

But,  your patch has already moved the vma_interval_tree_insert_after() 
block after the

tmp->vm_ops->open(tmp) call, right?  Hence, there should be no more race with truncation?

thanks,
-jane

>
> Thanks.
>
>> Thanks,
>> Liam
>>
>> .
>>
>

[-- Attachment #2: Type: text/html, Size: 11188 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
  2024-02-02 21:02               ` Jane Chu
@ 2024-02-04  1:54                 ` Miaohe Lin
  2024-03-29 15:54                   ` Thorvald Natvig
  0 siblings, 1 reply; 12+ messages in thread
From: Miaohe Lin @ 2024-02-04  1:54 UTC (permalink / raw)
  To: Jane Chu, Liam R. Howlett, Muchun Song, Thorvald Natvig, Linux-MM

On 2024/2/3 5:02, Jane Chu wrote:
> On 1/30/2024 10:51 PM, Miaohe Lin wrote:
> 
>> On 2024/1/30 12:08, Liam R. Howlett wrote:
>>> * Miaohe Lin <linmiaohe@huawei.com> [240129 21:14]:
>>>> On 2024/1/30 0:17, Liam R. Howlett wrote:
>>>>> * Miaohe Lin <linmiaohe@huawei.com> [240129 07:56]:
>>>>>> On 2024/1/27 18:13, Miaohe Lin wrote:
>>>>>>> On 2024/1/26 15:50, Muchun Song wrote:
>>>>>>>>> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@google.com> wrote:
>>>>>>>>>
>>>>>>>>> We've found what appears to be a lock issue that results in a blocked
>>>>>>>>> process somewhere in hugetlbfs for shared maps; seemingly from an
>>>>>>>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.
>>>>>>>>>
>>>>>>>>> Based on some added pr_warn, we believe the following is happening:
>>>>>>>>> When hugetlb_vmdelete_list is entered from the child process,
>>>>>>>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
>>>>>>>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock
>>>>>>>>> are true.
>>>>>>>>>
>>>>>>>>> While hugetlb_vmdelete_list is executing, the parent process does
>>>>>>>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
>>>>>>>>> lock for the same vma.
>>>>>>>>>
>>>>>>>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of
>>>>>>>>> the function, vma->vm_private_data is now populated, and hence
>>>>>>>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
>>>>>>>>> not hold.
>>>>>>>> Thanks for your report. ->vm_private_data was introduced since the
>>>>>>>> series [1]. So I suspect it was caused by this. But I haven't reviewed
>>>>>>>> that at that time (actually, it is a little complex in pmd sharing
>>>>>>>> case). I saw Miaohe had reviewed many of those.
>>>>>>>>
>>>>>>>> CC Miaohe, maybe he has some ideas on this.
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>>> dmesg:
>>>>>>>>> WARNING: bad unlock balance detected!
>>>>>>>>> 6.8.0-rc1+ #24 Not tainted
>>>>>>>>> -------------------------------------
>>>>>>>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
>>>>>>>>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
>>>>>>>>> but there are no more locks to release!
>>>>>>> Thanks for your report. It seems there's a race:
>>>>>>>
>>>>>>>  CPU 1											CPU 2
>>>>>>>  fork											hugetlbfs_fallocate
>>>>>>>   dup_mmap										 hugetlbfs_punch_hole
>>>>>>>    i_mmap_lock_write(mapping);								
>>>>>>>    vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
>>>>>>>    i_mmap_unlock_write(mapping);
>>>>>>>    hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!			 i_mmap_lock_write(mapping);
>>>>>>>    											 hugetlb_vmdelete_list
>>>>>>> 											  vma_interval_tree_foreach
>>>>>>> 											   hugetlb_vma_trylock_write -- Vma_lock is cleared.
>>>>>>>    tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
>>>>>>> 											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
>>>>>>> 											 i_mmap_unlock_write(mapping);
>>>>>>>
>>>>>>> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it.
>>>>>>> But I'm not really sure. I will take a more closed look at next week.
>>>>>> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized.
>>>>>> But I'm not sure whether there're side effects with this patch.
>>>>>>
>>>>>> linux-UJMmTI:/home/linmiaohe/mm # git diff
>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>>> index 47ff3b35352e..2ef2711452e0 100644
>>>>>> --- a/kernel/fork.c
>>>>>> +++ b/kernel/fork.c
>>>>>> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>>>>>                 } else if (anon_vma_fork(tmp, mpnt))
>>>>>>                         goto fail_nomem_anon_vma_fork;
>>>>>>                 vm_flags_clear(tmp, VM_LOCKED_MASK);
>>>>>> -               file = tmp->vm_file;
>>>>>> -               if (file) {
>>>>>> -                       struct address_space *mapping = file->f_mapping;
>>>>>> -
>>>>>> -                       get_file(file);
>>>>>> -                       i_mmap_lock_write(mapping);
>>>>>> -                       if (vma_is_shared_maywrite(tmp))
>>>>>> -                               mapping_allow_writable(mapping);
>>>>>> -                       flush_dcache_mmap_lock(mapping);
>>>>>> -                       /* insert tmp into the share list, just after mpnt */
>>>>>> -                       vma_interval_tree_insert_after(tmp, mpnt,
>>>>>> -                                       &mapping->i_mmap);
>>>>>> -                       flush_dcache_mmap_unlock(mapping);
>>>>>> -                       i_mmap_unlock_write(mapping);
>>>>>> -               }
>>>>>>
>>>>>>                 /*
>>>>>>                  * Copy/update hugetlb private vma information.
>>>>>> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>>>>>                 if (tmp->vm_ops && tmp->vm_ops->open)
>>>>>>                         tmp->vm_ops->open(tmp);
>>>>>>
>>>>>> +               file = tmp->vm_file;
>>>>>> +               if (file) {
>>>>>> +                       struct address_space *mapping = file->f_mapping;
>>>>>> +
>>>>>> +                       get_file(file);
>>>>>> +                       i_mmap_lock_write(mapping);
>>>>>> +                       if (vma_is_shared_maywrite(tmp))
>>>>>> +                               mapping_allow_writable(mapping);
>>>>>> +                       flush_dcache_mmap_lock(mapping);
>>>>>> +                       /* insert tmp into the share list, just after mpnt. */
>>>>>> +                       vma_interval_tree_insert_after(tmp, mpnt,
>>>>>> +                                       &mapping->i_mmap);
>>>>>> +                       flush_dcache_mmap_unlock(mapping);
>>>>>> +                       i_mmap_unlock_write(mapping);
>>>>>> +               }
>>>>>> +
>>>>>>                 if (retval) {
>>>>>>                         mpnt = vma_next(&vmi);
>>>>>>                         goto loop_out;
>>>>>>
>>>>>>
>>>>> How is this possible?  I thought, as specified in mm/rmap.c, that the
>>>>> hugetlbfs path would be holding the mmap lock (which is also held in the
>>>>> fork path)?
>>>> The fork path holds the mmap lock from parent A and other childs(except first child B) while hugetlbfs path
>>>> holds the mmap lock from first child B. So the mmap lock won't help here because it comes from different mm.
>>>> Or am I miss something?
>>> You are correct.  It is also in mm/rmap.c:
>>>  * hugetlbfs PageHuge() take locks in this order:
>>>  *   hugetlb_fault_mutex (hugetlbfs specific page fault mutex)                                                          
>>>  *     vma_lock (hugetlb specific lock for pmd_sharing)
>>>  *       mapping->i_mmap_rwsem (also used for hugetlb pmd sharing)                                                      
>>>  *         page->flags PG_locked (lock_page)
>>>
>>> Does it make sense for hugetlb_dup_vma_private()  to assert
>>> mapping->i_mmap_rwsem is locked?  When is that necessary?
>> I'm afraid not. AFAICS, vma_lock(vma->vm_private_data) is only modified at the time of
>> vma creating or destroy. Vma_lock is not supposed to be used at that time.
>>
>>> I also think it might be safer to move the hugetlb_dup_vma_private()
>>> call up instead of the insert into the interval tree down?
>>> See the following comment from mmap.c:
>>>
>>>                         /*                                                                                              
>>>                          * Put into interval tree now, so instantiated pages                                            
>>>                          * are visible to arm/parisc __flush_dcache_page
>>>                          * throughout; but we cannot insert into address                                                
>>>                          * space until vma start or end is updated.                                                     
>>>                          */
>>>
>>> So there may be arch dependent reasons for this order.
>> Yes, it should be safer to move hugetlb_dup_vma_private() call up. But we also need to move tmp->vm_ops->open(tmp) call up.
>> Or the race still exists:
>>
>>  CPU 1											CPU 2
>>  fork											hugetlbfs_fallocate
>>   dup_mmap										 hugetlbfs_punch_hole
>>    hugetlb_dup_vma_private -- Clear vma_lock.	<-- it is moved up.
>>    i_mmap_lock_write(mapping);								
>>    vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
>>    i_mmap_unlock_write(mapping);
>>    		 									 i_mmap_lock_write(mapping);
>>    											 hugetlb_vmdelete_list
>> 											  vma_interval_tree_foreach
>> 											   hugetlb_vma_trylock_write -- Vma_lock is already cleared.
>>    tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
>> 											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
>> 											 i_mmap_unlock_write(mapping);
>>
>>
>> My patch should not be a complete solution. It's used to prove and fix the race quickly. It's very great if you or
>> someone else can provide a better and safer solution.
> 
> But,  your patch has already moved the  vma_interval_tree_insert_after() block after the
> 
> tmp->vm_ops->open(tmp) call, right?  Hence, there should be no more race with truncation?

Sure. There won't be more race if tmp->vm_ops->open(tmp) call is *also* moved above vma_interval_tree_insert_after() block.
But I'm not sure it's safe to do so. There might be some obscure assumptions about the time to call vma_interval_tree_insert_after().

Thanks.

> 
> thanks,
> -jane
> 
>> Thanks.
>>
>>> Thanks,
>>> Liam
>>>
>>> .
>>>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
  2024-02-04  1:54                 ` Miaohe Lin
@ 2024-03-29 15:54                   ` Thorvald Natvig
  2024-04-02 11:24                     ` Miaohe Lin
  0 siblings, 1 reply; 12+ messages in thread
From: Thorvald Natvig @ 2024-03-29 15:54 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: Jane Chu, Liam R. Howlett, Muchun Song, Linux-MM

Did this patch (or another fix for the same problem) make it through?
If not, is there anything we can do to help?

- Thorvald


On Sat, Feb 3, 2024 at 5:54 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2024/2/3 5:02, Jane Chu wrote:
> > On 1/30/2024 10:51 PM, Miaohe Lin wrote:
> >
> >> On 2024/1/30 12:08, Liam R. Howlett wrote:
> >>> * Miaohe Lin <linmiaohe@huawei.com> [240129 21:14]:
> >>>> On 2024/1/30 0:17, Liam R. Howlett wrote:
> >>>>> * Miaohe Lin <linmiaohe@huawei.com> [240129 07:56]:
> >>>>>> On 2024/1/27 18:13, Miaohe Lin wrote:
> >>>>>>> On 2024/1/26 15:50, Muchun Song wrote:
> >>>>>>>>> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@google.com> wrote:
> >>>>>>>>>
> >>>>>>>>> We've found what appears to be a lock issue that results in a blocked
> >>>>>>>>> process somewhere in hugetlbfs for shared maps; seemingly from an
> >>>>>>>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.
> >>>>>>>>>
> >>>>>>>>> Based on some added pr_warn, we believe the following is happening:
> >>>>>>>>> When hugetlb_vmdelete_list is entered from the child process,
> >>>>>>>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
> >>>>>>>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock
> >>>>>>>>> are true.
> >>>>>>>>>
> >>>>>>>>> While hugetlb_vmdelete_list is executing, the parent process does
> >>>>>>>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
> >>>>>>>>> lock for the same vma.
> >>>>>>>>>
> >>>>>>>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of
> >>>>>>>>> the function, vma->vm_private_data is now populated, and hence
> >>>>>>>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
> >>>>>>>>> not hold.
> >>>>>>>> Thanks for your report. ->vm_private_data was introduced since the
> >>>>>>>> series [1]. So I suspect it was caused by this. But I haven't reviewed
> >>>>>>>> that at that time (actually, it is a little complex in pmd sharing
> >>>>>>>> case). I saw Miaohe had reviewed many of those.
> >>>>>>>>
> >>>>>>>> CC Miaohe, maybe he has some ideas on this.
> >>>>>>>>
> >>>>>>>> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff
> >>>>>>>>
> >>>>>>>> Thanks.
> >>>>>>>>
> >>>>>>>>> dmesg:
> >>>>>>>>> WARNING: bad unlock balance detected!
> >>>>>>>>> 6.8.0-rc1+ #24 Not tainted
> >>>>>>>>> -------------------------------------
> >>>>>>>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
> >>>>>>>>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
> >>>>>>>>> but there are no more locks to release!
> >>>>>>> Thanks for your report. It seems there's a race:
> >>>>>>>
> >>>>>>>  CPU 1                                                                                  CPU 2
> >>>>>>>  fork                                                                                   hugetlbfs_fallocate
> >>>>>>>   dup_mmap                                                                               hugetlbfs_punch_hole
> >>>>>>>    i_mmap_lock_write(mapping);
> >>>>>>>    vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
> >>>>>>>    i_mmap_unlock_write(mapping);
> >>>>>>>    hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!                       i_mmap_lock_write(mapping);
> >>>>>>>                                                                                          hugetlb_vmdelete_list
> >>>>>>>                                                                                           vma_interval_tree_foreach
> >>>>>>>                                                                                            hugetlb_vma_trylock_write -- Vma_lock is cleared.
> >>>>>>>    tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
> >>>>>>>                                                                                            hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
> >>>>>>>                                                                                          i_mmap_unlock_write(mapping);
> >>>>>>>
> >>>>>>> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it.
> >>>>>>> But I'm not really sure. I will take a more closed look at next week.
> >>>>>> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized.
> >>>>>> But I'm not sure whether there're side effects with this patch.
> >>>>>>
> >>>>>> linux-UJMmTI:/home/linmiaohe/mm # git diff
> >>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
> >>>>>> index 47ff3b35352e..2ef2711452e0 100644
> >>>>>> --- a/kernel/fork.c
> >>>>>> +++ b/kernel/fork.c
> >>>>>> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >>>>>>                 } else if (anon_vma_fork(tmp, mpnt))
> >>>>>>                         goto fail_nomem_anon_vma_fork;
> >>>>>>                 vm_flags_clear(tmp, VM_LOCKED_MASK);
> >>>>>> -               file = tmp->vm_file;
> >>>>>> -               if (file) {
> >>>>>> -                       struct address_space *mapping = file->f_mapping;
> >>>>>> -
> >>>>>> -                       get_file(file);
> >>>>>> -                       i_mmap_lock_write(mapping);
> >>>>>> -                       if (vma_is_shared_maywrite(tmp))
> >>>>>> -                               mapping_allow_writable(mapping);
> >>>>>> -                       flush_dcache_mmap_lock(mapping);
> >>>>>> -                       /* insert tmp into the share list, just after mpnt */
> >>>>>> -                       vma_interval_tree_insert_after(tmp, mpnt,
> >>>>>> -                                       &mapping->i_mmap);
> >>>>>> -                       flush_dcache_mmap_unlock(mapping);
> >>>>>> -                       i_mmap_unlock_write(mapping);
> >>>>>> -               }
> >>>>>>
> >>>>>>                 /*
> >>>>>>                  * Copy/update hugetlb private vma information.
> >>>>>> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >>>>>>                 if (tmp->vm_ops && tmp->vm_ops->open)
> >>>>>>                         tmp->vm_ops->open(tmp);
> >>>>>>
> >>>>>> +               file = tmp->vm_file;
> >>>>>> +               if (file) {
> >>>>>> +                       struct address_space *mapping = file->f_mapping;
> >>>>>> +
> >>>>>> +                       get_file(file);
> >>>>>> +                       i_mmap_lock_write(mapping);
> >>>>>> +                       if (vma_is_shared_maywrite(tmp))
> >>>>>> +                               mapping_allow_writable(mapping);
> >>>>>> +                       flush_dcache_mmap_lock(mapping);
> >>>>>> +                       /* insert tmp into the share list, just after mpnt. */
> >>>>>> +                       vma_interval_tree_insert_after(tmp, mpnt,
> >>>>>> +                                       &mapping->i_mmap);
> >>>>>> +                       flush_dcache_mmap_unlock(mapping);
> >>>>>> +                       i_mmap_unlock_write(mapping);
> >>>>>> +               }
> >>>>>> +
> >>>>>>                 if (retval) {
> >>>>>>                         mpnt = vma_next(&vmi);
> >>>>>>                         goto loop_out;
> >>>>>>
> >>>>>>
> >>>>> How is this possible?  I thought, as specified in mm/rmap.c, that the
> >>>>> hugetlbfs path would be holding the mmap lock (which is also held in the
> >>>>> fork path)?
> >>>> The fork path holds the mmap lock from parent A and other childs(except first child B) while hugetlbfs path
> >>>> holds the mmap lock from first child B. So the mmap lock won't help here because it comes from different mm.
> >>>> Or am I miss something?
> >>> You are correct.  It is also in mm/rmap.c:
> >>>  * hugetlbfs PageHuge() take locks in this order:
> >>>  *   hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
> >>>  *     vma_lock (hugetlb specific lock for pmd_sharing)
> >>>  *       mapping->i_mmap_rwsem (also used for hugetlb pmd sharing)
> >>>  *         page->flags PG_locked (lock_page)
> >>>
> >>> Does it make sense for hugetlb_dup_vma_private()  to assert
> >>> mapping->i_mmap_rwsem is locked?  When is that necessary?
> >> I'm afraid not. AFAICS, vma_lock(vma->vm_private_data) is only modified at the time of
> >> vma creating or destroy. Vma_lock is not supposed to be used at that time.
> >>
> >>> I also think it might be safer to move the hugetlb_dup_vma_private()
> >>> call up instead of the insert into the interval tree down?
> >>> See the following comment from mmap.c:
> >>>
> >>>                         /*
> >>>                          * Put into interval tree now, so instantiated pages
> >>>                          * are visible to arm/parisc __flush_dcache_page
> >>>                          * throughout; but we cannot insert into address
> >>>                          * space until vma start or end is updated.
> >>>                          */
> >>>
> >>> So there may be arch dependent reasons for this order.
> >> Yes, it should be safer to move hugetlb_dup_vma_private() call up. But we also need to move tmp->vm_ops->open(tmp) call up.
> >> Or the race still exists:
> >>
> >>  CPU 1                                                                                       CPU 2
> >>  fork                                                                                        hugetlbfs_fallocate
> >>   dup_mmap                                                                            hugetlbfs_punch_hole
> >>    hugetlb_dup_vma_private -- Clear vma_lock.        <-- it is moved up.
> >>    i_mmap_lock_write(mapping);
> >>    vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
> >>    i_mmap_unlock_write(mapping);
> >>                                                                                       i_mmap_lock_write(mapping);
> >>                                                                                       hugetlb_vmdelete_list
> >>                                                                                        vma_interval_tree_foreach
> >>                                                                                         hugetlb_vma_trylock_write -- Vma_lock is already cleared.
> >>    tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
> >>                                                                                         hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
> >>                                                                                       i_mmap_unlock_write(mapping);
> >>
> >>
> >> My patch should not be a complete solution. It's used to prove and fix the race quickly. It's very great if you or
> >> someone else can provide a better and safer solution.
> >
> > But,  your patch has already moved the  vma_interval_tree_insert_after() block after the
> >
> > tmp->vm_ops->open(tmp) call, right?  Hence, there should be no more race with truncation?
>
> Sure. There won't be more race if tmp->vm_ops->open(tmp) call is *also* moved above vma_interval_tree_insert_after() block.
> But I'm not sure it's safe to do so. There might be some obscure assumptions about the time to call vma_interval_tree_insert_after().
>
> Thanks.
>
> >
> > thanks,
> > -jane
> >
> >> Thanks.
> >>
> >>> Thanks,
> >>> Liam
> >>>
> >>> .
> >>>
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
  2024-03-29 15:54                   ` Thorvald Natvig
@ 2024-04-02 11:24                     ` Miaohe Lin
  0 siblings, 0 replies; 12+ messages in thread
From: Miaohe Lin @ 2024-04-02 11:24 UTC (permalink / raw)
  To: Thorvald Natvig; +Cc: Jane Chu, Liam R. Howlett, Muchun Song, Linux-MM

On 2024/3/29 23:54, Thorvald Natvig wrote:
> Did this patch (or another fix for the same problem) make it through?

I'm sorry but I didn't have enough time to figure out a complete solution and this stuff got lost in my mind...

> If not, is there anything we can do to help?

I would try to send a formal patch as soon as possible. A full test and review would be really helpful.

Thanks.

> 
> - Thorvald
> 
> 
> On Sat, Feb 3, 2024 at 5:54 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2024/2/3 5:02, Jane Chu wrote:
>>> On 1/30/2024 10:51 PM, Miaohe Lin wrote:
>>>
>>>> On 2024/1/30 12:08, Liam R. Howlett wrote:
>>>>> * Miaohe Lin <linmiaohe@huawei.com> [240129 21:14]:
>>>>>> On 2024/1/30 0:17, Liam R. Howlett wrote:
>>>>>>> * Miaohe Lin <linmiaohe@huawei.com> [240129 07:56]:
>>>>>>>> On 2024/1/27 18:13, Miaohe Lin wrote:
>>>>>>>>> On 2024/1/26 15:50, Muchun Song wrote:
>>>>>>>>>>> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> We've found what appears to be a lock issue that results in a blocked
>>>>>>>>>>> process somewhere in hugetlbfs for shared maps; seemingly from an
>>>>>>>>>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.
>>>>>>>>>>>
>>>>>>>>>>> Based on some added pr_warn, we believe the following is happening:
>>>>>>>>>>> When hugetlb_vmdelete_list is entered from the child process,
>>>>>>>>>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
>>>>>>>>>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock
>>>>>>>>>>> are true.
>>>>>>>>>>>
>>>>>>>>>>> While hugetlb_vmdelete_list is executing, the parent process does
>>>>>>>>>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
>>>>>>>>>>> lock for the same vma.
>>>>>>>>>>>
>>>>>>>>>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of
>>>>>>>>>>> the function, vma->vm_private_data is now populated, and hence
>>>>>>>>>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
>>>>>>>>>>> not hold.
>>>>>>>>>> Thanks for your report. ->vm_private_data was introduced since the
>>>>>>>>>> series [1]. So I suspect it was caused by this. But I haven't reviewed
>>>>>>>>>> that at that time (actually, it is a little complex in pmd sharing
>>>>>>>>>> case). I saw Miaohe had reviewed many of those.
>>>>>>>>>>
>>>>>>>>>> CC Miaohe, maybe he has some ideas on this.
>>>>>>>>>>
>>>>>>>>>> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>>> dmesg:
>>>>>>>>>>> WARNING: bad unlock balance detected!
>>>>>>>>>>> 6.8.0-rc1+ #24 Not tainted
>>>>>>>>>>> -------------------------------------
>>>>>>>>>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
>>>>>>>>>>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
>>>>>>>>>>> but there are no more locks to release!
>>>>>>>>> Thanks for your report. It seems there's a race:
>>>>>>>>>
>>>>>>>>>  CPU 1                                                                                  CPU 2
>>>>>>>>>  fork                                                                                   hugetlbfs_fallocate
>>>>>>>>>   dup_mmap                                                                               hugetlbfs_punch_hole
>>>>>>>>>    i_mmap_lock_write(mapping);
>>>>>>>>>    vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
>>>>>>>>>    i_mmap_unlock_write(mapping);
>>>>>>>>>    hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!                       i_mmap_lock_write(mapping);
>>>>>>>>>                                                                                          hugetlb_vmdelete_list
>>>>>>>>>                                                                                           vma_interval_tree_foreach
>>>>>>>>>                                                                                            hugetlb_vma_trylock_write -- Vma_lock is cleared.
>>>>>>>>>    tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
>>>>>>>>>                                                                                            hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
>>>>>>>>>                                                                                          i_mmap_unlock_write(mapping);
>>>>>>>>>
>>>>>>>>> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it.
>>>>>>>>> But I'm not really sure. I will take a more closed look at next week.
>>>>>>>> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized.
>>>>>>>> But I'm not sure whether there're side effects with this patch.
>>>>>>>>
>>>>>>>> linux-UJMmTI:/home/linmiaohe/mm # git diff
>>>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>>>>> index 47ff3b35352e..2ef2711452e0 100644
>>>>>>>> --- a/kernel/fork.c
>>>>>>>> +++ b/kernel/fork.c
>>>>>>>> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>>>>>>>                 } else if (anon_vma_fork(tmp, mpnt))
>>>>>>>>                         goto fail_nomem_anon_vma_fork;
>>>>>>>>                 vm_flags_clear(tmp, VM_LOCKED_MASK);
>>>>>>>> -               file = tmp->vm_file;
>>>>>>>> -               if (file) {
>>>>>>>> -                       struct address_space *mapping = file->f_mapping;
>>>>>>>> -
>>>>>>>> -                       get_file(file);
>>>>>>>> -                       i_mmap_lock_write(mapping);
>>>>>>>> -                       if (vma_is_shared_maywrite(tmp))
>>>>>>>> -                               mapping_allow_writable(mapping);
>>>>>>>> -                       flush_dcache_mmap_lock(mapping);
>>>>>>>> -                       /* insert tmp into the share list, just after mpnt */
>>>>>>>> -                       vma_interval_tree_insert_after(tmp, mpnt,
>>>>>>>> -                                       &mapping->i_mmap);
>>>>>>>> -                       flush_dcache_mmap_unlock(mapping);
>>>>>>>> -                       i_mmap_unlock_write(mapping);
>>>>>>>> -               }
>>>>>>>>
>>>>>>>>                 /*
>>>>>>>>                  * Copy/update hugetlb private vma information.
>>>>>>>> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>>>>>>>                 if (tmp->vm_ops && tmp->vm_ops->open)
>>>>>>>>                         tmp->vm_ops->open(tmp);
>>>>>>>>
>>>>>>>> +               file = tmp->vm_file;
>>>>>>>> +               if (file) {
>>>>>>>> +                       struct address_space *mapping = file->f_mapping;
>>>>>>>> +
>>>>>>>> +                       get_file(file);
>>>>>>>> +                       i_mmap_lock_write(mapping);
>>>>>>>> +                       if (vma_is_shared_maywrite(tmp))
>>>>>>>> +                               mapping_allow_writable(mapping);
>>>>>>>> +                       flush_dcache_mmap_lock(mapping);
>>>>>>>> +                       /* insert tmp into the share list, just after mpnt. */
>>>>>>>> +                       vma_interval_tree_insert_after(tmp, mpnt,
>>>>>>>> +                                       &mapping->i_mmap);
>>>>>>>> +                       flush_dcache_mmap_unlock(mapping);
>>>>>>>> +                       i_mmap_unlock_write(mapping);
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>>                 if (retval) {
>>>>>>>>                         mpnt = vma_next(&vmi);
>>>>>>>>                         goto loop_out;
>>>>>>>>
>>>>>>>>
>>>>>>> How is this possible?  I thought, as specified in mm/rmap.c, that the
>>>>>>> hugetlbfs path would be holding the mmap lock (which is also held in the
>>>>>>> fork path)?
>>>>>> The fork path holds the mmap lock from parent A and other childs(except first child B) while hugetlbfs path
>>>>>> holds the mmap lock from first child B. So the mmap lock won't help here because it comes from different mm.
>>>>>> Or am I miss something?
>>>>> You are correct.  It is also in mm/rmap.c:
>>>>>  * hugetlbfs PageHuge() take locks in this order:
>>>>>  *   hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
>>>>>  *     vma_lock (hugetlb specific lock for pmd_sharing)
>>>>>  *       mapping->i_mmap_rwsem (also used for hugetlb pmd sharing)
>>>>>  *         page->flags PG_locked (lock_page)
>>>>>
>>>>> Does it make sense for hugetlb_dup_vma_private()  to assert
>>>>> mapping->i_mmap_rwsem is locked?  When is that necessary?
>>>> I'm afraid not. AFAICS, vma_lock(vma->vm_private_data) is only modified at the time of
>>>> vma creating or destroy. Vma_lock is not supposed to be used at that time.
>>>>
>>>>> I also think it might be safer to move the hugetlb_dup_vma_private()
>>>>> call up instead of the insert into the interval tree down?
>>>>> See the following comment from mmap.c:
>>>>>
>>>>>                         /*
>>>>>                          * Put into interval tree now, so instantiated pages
>>>>>                          * are visible to arm/parisc __flush_dcache_page
>>>>>                          * throughout; but we cannot insert into address
>>>>>                          * space until vma start or end is updated.
>>>>>                          */
>>>>>
>>>>> So there may be arch dependent reasons for this order.
>>>> Yes, it should be safer to move hugetlb_dup_vma_private() call up. But we also need to move tmp->vm_ops->open(tmp) call up.
>>>> Or the race still exists:
>>>>
>>>>  CPU 1                                                                                       CPU 2
>>>>  fork                                                                                        hugetlbfs_fallocate
>>>>   dup_mmap                                                                            hugetlbfs_punch_hole
>>>>    hugetlb_dup_vma_private -- Clear vma_lock.        <-- it is moved up.
>>>>    i_mmap_lock_write(mapping);
>>>>    vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
>>>>    i_mmap_unlock_write(mapping);
>>>>                                                                                       i_mmap_lock_write(mapping);
>>>>                                                                                       hugetlb_vmdelete_list
>>>>                                                                                        vma_interval_tree_foreach
>>>>                                                                                         hugetlb_vma_trylock_write -- Vma_lock is already cleared.
>>>>    tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
>>>>                                                                                         hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
>>>>                                                                                       i_mmap_unlock_write(mapping);
>>>>
>>>>
>>>> My patch should not be a complete solution. It's used to prove and fix the race quickly. It's very great if you or
>>>> someone else can provide a better and safer solution.
>>>
>>> But,  your patch has already moved the  vma_interval_tree_insert_after() block after the
>>>
>>> tmp->vm_ops->open(tmp) call, right?  Hence, there should be no more race with truncation?
>>
>> Sure. There won't be more race if tmp->vm_ops->open(tmp) call is *also* moved above vma_interval_tree_insert_after() block.
>> But I'm not sure it's safe to do so. There might be some obscure assumptions about the time to call vma_interval_tree_insert_after().
>>
>> Thanks.
>>
>>>
>>> thanks,
>>> -jane
>>>
>>>> Thanks.
>>>>
>>>>> Thanks,
>>>>> Liam
>>>>>
>>>>> .
>>>>>
>>
> .
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-04-02 11:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-25 20:28 hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE Thorvald Natvig
2024-01-26  7:50 ` Muchun Song
2024-01-27 10:13   ` Miaohe Lin
2024-01-29 12:56     ` Miaohe Lin
2024-01-29 16:17       ` Liam R. Howlett
2024-01-30  2:14         ` Miaohe Lin
2024-01-30  4:08           ` Liam R. Howlett
2024-01-31  6:51             ` Miaohe Lin
2024-02-02 21:02               ` Jane Chu
2024-02-04  1:54                 ` Miaohe Lin
2024-03-29 15:54                   ` Thorvald Natvig
2024-04-02 11:24                     ` Miaohe Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).