Multiple potential races on vma->vm

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Multiple potential races on vma->vm_flags
       [not found] ` <55EC9221.4040603@oracle.com>
@ 2015-09-07 11:40   ` Kirill A. Shutemov
  2015-09-09 15:27     ` Vlastimil Babka
  2015-09-10  0:58     ` Sasha Levin
  0 siblings, 2 replies; 35+ messages in thread
From: Kirill A. Shutemov @ 2015-09-07 11:40 UTC (permalink / raw)
  To: Sasha Levin, Rik van Riel, Andrew Morton
  Cc: Andrey Konovalov, Dmitry Vyukov, linux-mm, linux-kernel

On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
> ==================================================================
> ThreadSanitizer: data-race in munlock_vma_pages_range
> 
> Write of size 8 by thread T378 (K2633, CPU3):
>  [<ffffffff81212579>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186

...

> Previous read of size 8 by thread T398 (K2623, CPU2):
>  [<ffffffff8121d198>] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
>  [<     inlined    >] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
>  [<ffffffff8121e7b7>] rmap_walk+0x147/0x450 mm/rmap.c:1559
>  [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186

Okay, the detected race is mlock/munlock vs. rmap.

On rmap side we check vma->vm_flags in few places without taking
vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
us.

In this particular case, speculative check in beginning of
try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
the function.

False-negative is fine too here, since we will mlock the page in
__mm_populate() on mlock side after mlock_fixup().

BUT.

We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to
avoid all compiler trickery, like duplication vm_flags access with
inconsistent results.

I looked only on VM_LOCKED checks, but there are few other flags checked
in rmap. All of them must be handled carefully. At least READ_ONCE() is
required.

Other solution would be to introduce per-vma spinlock to protect
vma->vm_flags and probably other vma fields and offload this duty
from mmap_sem.
But that's much bigger project.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-07 11:40   ` Multiple potential races on vma->vm_flags Kirill A. Shutemov
@ 2015-09-09 15:27     ` Vlastimil Babka
  2015-09-09 16:01       ` Kirill A. Shutemov
  2015-09-10  0:58     ` Sasha Levin
  1 sibling, 1 reply; 35+ messages in thread
From: Vlastimil Babka @ 2015-09-09 15:27 UTC (permalink / raw)
  To: Kirill A. Shutemov, Sasha Levin, Rik van Riel, Andrew Morton
  Cc: Andrey Konovalov, Dmitry Vyukov, linux-mm, linux-kernel

On 09/07/2015 01:40 PM, Kirill A. Shutemov wrote:
> On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
>> ==================================================================
>> ThreadSanitizer: data-race in munlock_vma_pages_range
>>
>> Write of size 8 by thread T378 (K2633, CPU3):
>>   [<ffffffff81212579>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>>   [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>>   [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
>>   [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>>   [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>>   [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
>> arch/x86/entry/entry_64.S:186
>
> ...
>
>> Previous read of size 8 by thread T398 (K2623, CPU2):
>>   [<ffffffff8121d198>] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
>>   [<     inlined    >] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
>>   [<ffffffff8121e7b7>] rmap_walk+0x147/0x450 mm/rmap.c:1559
>>   [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>>   [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>>   [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>>   [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>>   [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>>   [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
>>   [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>>   [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>>   [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
>> arch/x86/entry/entry_64.S:186
>
> Okay, the detected race is mlock/munlock vs. rmap.
>
> On rmap side we check vma->vm_flags in few places without taking
> vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
> or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
> us.
>
> In this particular case, speculative check in beginning of
> try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
> the function.
>
> False-negative is fine too here, since we will mlock the page in
> __mm_populate() on mlock side after mlock_fixup().
>
> BUT.
>
> We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to
> avoid all compiler trickery, like duplication vm_flags access with
> inconsistent results.

Doesn't taking a semaphore, as in try_to_unmap_one(), already imply a 
compiler barrier forcing vm_flags to be re-read?

> I looked only on VM_LOCKED checks, but there are few other flags checked
> in rmap. All of them must be handled carefully. At least READ_ONCE() is
> required.
>
> Other solution would be to introduce per-vma spinlock to protect
> vma->vm_flags and probably other vma fields and offload this duty
> from mmap_sem.
> But that's much bigger project.

Sounds like an overkill, unless we find something more serious than this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-09 15:27     ` Vlastimil Babka
@ 2015-09-09 16:01       ` Kirill A. Shutemov
  0 siblings, 0 replies; 35+ messages in thread
From: Kirill A. Shutemov @ 2015-09-09 16:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, Rik van Riel, Andrew Morton, Andrey Konovalov,
	Dmitry Vyukov, linux-mm, linux-kernel

On Wed, Sep 09, 2015 at 05:27:16PM +0200, Vlastimil Babka wrote:
> On 09/07/2015 01:40 PM, Kirill A. Shutemov wrote:
> >On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
> >>==================================================================
> >>ThreadSanitizer: data-race in munlock_vma_pages_range
> >>
> >>Write of size 8 by thread T378 (K2633, CPU3):
> >>  [<ffffffff81212579>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
> >>  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >>  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
> >>  [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >>  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >>  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> >>arch/x86/entry/entry_64.S:186
> >
> >...
> >
> >>Previous read of size 8 by thread T398 (K2623, CPU2):
> >>  [<ffffffff8121d198>] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
> >>  [<     inlined    >] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
> >>  [<ffffffff8121e7b7>] rmap_walk+0x147/0x450 mm/rmap.c:1559
> >>  [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
> >>  [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
> >>  [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
> >>  [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
> >>  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >>  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
> >>  [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >>  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >>  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> >>arch/x86/entry/entry_64.S:186
> >
> >Okay, the detected race is mlock/munlock vs. rmap.
> >
> >On rmap side we check vma->vm_flags in few places without taking
> >vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
> >or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
> >us.
> >
> >In this particular case, speculative check in beginning of
> >try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
> >the function.
> >
> >False-negative is fine too here, since we will mlock the page in
> >__mm_populate() on mlock side after mlock_fixup().
> >
> >BUT.
> >
> >We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to
> >avoid all compiler trickery, like duplication vm_flags access with
> >inconsistent results.
> 
> Doesn't taking a semaphore, as in try_to_unmap_one(), already imply a
> compiler barrier forcing vm_flags to be re-read?

Yes, but it doesn't prevent compiler from generation multiple reads from
vma->vm_flags and it may blow up if two values doesn't match.

> >I looked only on VM_LOCKED checks, but there are few other flags checked
> >in rmap. All of them must be handled carefully. At least READ_ONCE() is
> >required.
> >
> >Other solution would be to introduce per-vma spinlock to protect
> >vma->vm_flags and probably other vma fields and offload this duty
> >from mmap_sem.
> >But that's much bigger project.
> 
> Sounds like an overkill, unless we find something more serious than this.

May be...

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-07 11:40   ` Multiple potential races on vma->vm_flags Kirill A. Shutemov
  2015-09-09 15:27     ` Vlastimil Babka
@ 2015-09-10  0:58     ` Sasha Levin
  2015-09-10  8:36       ` Kirill A. Shutemov
  1 sibling, 1 reply; 35+ messages in thread
From: Sasha Levin @ 2015-09-10  0:58 UTC (permalink / raw)
  To: Kirill A. Shutemov, Rik van Riel, Andrew Morton
  Cc: Andrey Konovalov, Dmitry Vyukov, linux-mm, linux-kernel

On 09/07/2015 07:40 AM, Kirill A. Shutemov wrote:
> On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
>> > ==================================================================
>> > ThreadSanitizer: data-race in munlock_vma_pages_range
>> > 
>> > Write of size 8 by thread T378 (K2633, CPU3):
>> >  [<ffffffff81212579>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>> >  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>> >  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
>> >  [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>> >  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>> >  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
>> > arch/x86/entry/entry_64.S:186
> ...
> 
>> > Previous read of size 8 by thread T398 (K2623, CPU2):
>> >  [<ffffffff8121d198>] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
>> >  [<     inlined    >] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
>> >  [<ffffffff8121e7b7>] rmap_walk+0x147/0x450 mm/rmap.c:1559
>> >  [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>> >  [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>> >  [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>> >  [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>> >  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>> >  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
>> >  [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>> >  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>> >  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
>> > arch/x86/entry/entry_64.S:186
> Okay, the detected race is mlock/munlock vs. rmap.
> 
> On rmap side we check vma->vm_flags in few places without taking
> vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
> or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
> us.
> 
> In this particular case, speculative check in beginning of
> try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
> the function.

So you're suggesting that this isn't the cause of the bad page flags
error observed by Andrey and myself?


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-10  0:58     ` Sasha Levin
@ 2015-09-10  8:36       ` Kirill A. Shutemov
  2015-09-10 13:27         ` Andrey Konovalov
  0 siblings, 1 reply; 35+ messages in thread
From: Kirill A. Shutemov @ 2015-09-10  8:36 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Rik van Riel, Andrew Morton, Andrey Konovalov, Dmitry Vyukov,
	linux-mm, linux-kernel

On Wed, Sep 09, 2015 at 08:58:26PM -0400, Sasha Levin wrote:
> On 09/07/2015 07:40 AM, Kirill A. Shutemov wrote:
> > On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
> >> > ==================================================================
> >> > ThreadSanitizer: data-race in munlock_vma_pages_range
> >> > 
> >> > Write of size 8 by thread T378 (K2633, CPU3):
> >> >  [<ffffffff81212579>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
> >> >  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >> >  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
> >> >  [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >> >  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >> >  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> >> > arch/x86/entry/entry_64.S:186
> > ...
> > 
> >> > Previous read of size 8 by thread T398 (K2623, CPU2):
> >> >  [<ffffffff8121d198>] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
> >> >  [<     inlined    >] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
> >> >  [<ffffffff8121e7b7>] rmap_walk+0x147/0x450 mm/rmap.c:1559
> >> >  [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
> >> >  [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
> >> >  [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
> >> >  [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
> >> >  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >> >  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
> >> >  [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >> >  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >> >  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> >> > arch/x86/entry/entry_64.S:186
> > Okay, the detected race is mlock/munlock vs. rmap.
> > 
> > On rmap side we check vma->vm_flags in few places without taking
> > vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
> > or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
> > us.
> > 
> > In this particular case, speculative check in beginning of
> > try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
> > the function.
> 
> So you're suggesting that this isn't the cause of the bad page flags
> error observed by Andrey and myself?

I don't see it, but who knows.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-10  8:36       ` Kirill A. Shutemov
@ 2015-09-10 13:27         ` Andrey Konovalov
  2015-09-11 10:39           ` Kirill A. Shutemov
  0 siblings, 1 reply; 35+ messages in thread
From: Andrey Konovalov @ 2015-09-10 13:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sasha Levin, Rik van Riel, Andrew Morton, Dmitry Vyukov, linux-mm,
	linux-kernel

Can a vma be shared among a few mm's?
If yes, then taking current->mm->mmap_sem to protect vma is not enough.

In the first report below both T378 and T398 take
current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
different locks (the addresses are different).
In the second report T309 doesn't take any locks at all, since it
assumes that after checking atomic_dec_and_test(&mm->mm_users) the mm
has no other users, but then it does a write to vma.

==================================================================
ThreadSanitizer: data-race in munlock_vma_pages_range

Write of size 8 by thread T378 (K2633, CPU3):
 [<ffffffff81212579>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
 [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
 [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
 [<     inlined    >] SYSC_munlock mm/mlock.c:651
 [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
 [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Locks held by T378:
#0 Lock 25710428 taken here:
 [<     inlined    >] SYSC_munlock mm/mlock.c:650
 [<ffffffff8121308c>] SyS_munlock+0x4c/0xb0 mm/mlock.c:643
 [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Previous read of size 8 by thread T398 (K2623, CPU2):
 [<ffffffff8121d198>] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
 [<     inlined    >] rmap_walk_file mm/rmap.c:1540
 [<ffffffff8121e7b7>] rmap_walk+0x147/0x450 mm/rmap.c:1559
 [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
 [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
 [<     inlined    >] SYSC_munlock mm/mlock.c:651
 [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
 [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Locks held by T398:
#0 Lock 21b00c68 taken here:
 [<     inlined    >] SYSC_munlock mm/mlock.c:650
 [<ffffffff8121308c>] SyS_munlock+0x4c/0xb0 mm/mlock.c:643
 [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186
#1 Lock bac2d750 taken here:
 [<     inlined    >] i_mmap_lock_read include/linux/fs.h:509
 [<     inlined    >] rmap_walk_file mm/rmap.c:1533
 [<ffffffff8121e6e8>] rmap_walk+0x78/0x450 mm/rmap.c:1559
 [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
 [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
 [<     inlined    >] SYSC_munlock mm/mlock.c:651
 [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
 [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186
#2 Lock 0895f570 taken here:
 [<     inlined    >] spin_lock include/linux/spinlock.h:312
 [<ffffffff8121c959>] __page_check_address+0xd9/0x210 mm/rmap.c:681
 [<     inlined    >] page_check_address include/linux/rmap.h:204
 [<ffffffff8121d173>] try_to_unmap_one+0x53/0x4f0 mm/rmap.c:1198
 [<     inlined    >] rmap_walk_file mm/rmap.c:1540
 [<ffffffff8121e7b7>] rmap_walk+0x147/0x450 mm/rmap.c:1559
 [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
 [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
 [<     inlined    >] SYSC_munlock mm/mlock.c:651
 [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
 [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

DBG: addr: ffff880222610e10
DBG: first offset: 0, second offset: 0
DBG: T378 clock: {T378: 4486533, T398: 2405850}
DBG: T398 clock: {T398: 2406009}
==================================================================

==================================================================
ThreadSanitizer: data-race in munlock_vma_pages_range

Write of size 8 by thread T309 (K2577, CPU0):
 [<ffffffff81211fc9>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
 [<     inlined    >] munlock_vma_pages_all mm/internal.h:252
 [<ffffffff81216cc3>] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [<ffffffff81085685>] mmput+0x65/0x190 kernel/fork.c:708
 [<     inlined    >] exit_mm kernel/exit.c:437
 [<ffffffff8108c3a7>] do_exit+0x457/0x1420 kernel/exit.c:733
 [<ffffffff8108f08f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
 [<     inlined    >] SYSC_exit_group kernel/exit.c:885
 [<ffffffff8108f170>] __wake_up_parent+0x0/0x50 kernel/exit.c:883
 [<ffffffff81eadb2e>] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Locks held by T309:

Previous read of size 8 by thread T293 (K2573, CPU3):
 [<ffffffff8121cbe8>] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
 [<     inlined    >] rmap_walk_file mm/rmap.c:1540
 [<ffffffff8121e207>] rmap_walk+0x147/0x450 mm/rmap.c:1559
 [<ffffffff8121e9c2>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [<ffffffff81211600>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [<ffffffff81211ab6>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [<ffffffff812122f0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [<     inlined    >] munlock_vma_pages_all mm/internal.h:252
 [<ffffffff81216cc3>] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [<ffffffff81085685>] mmput+0x65/0x190 kernel/fork.c:708
 [<     inlined    >] exit_mm kernel/exit.c:437
 [<ffffffff8108c3a7>] do_exit+0x457/0x1420 kernel/exit.c:733
 [<ffffffff8108f08f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
 [<     inlined    >] SYSC_exit_group kernel/exit.c:885
 [<ffffffff8108f170>] __wake_up_parent+0x0/0x50 kernel/exit.c:883
 [<ffffffff81eadb2e>] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Locks held by T293:
#0 Lock bb0dc710 taken here:
 [<     inlined    >] i_mmap_lock_read include/linux/fs.h:509
 [<     inlined    >] rmap_walk_file mm/rmap.c:1533
 [<ffffffff8121e138>] rmap_walk+0x78/0x450 mm/rmap.c:1559
 [<ffffffff8121e9c2>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [<ffffffff81211600>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [<ffffffff81211ab6>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [<ffffffff812122f0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [<     inlined    >] munlock_vma_pages_all mm/internal.h:252
 [<ffffffff81216cc3>] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [<ffffffff81085685>] mmput+0x65/0x190 kernel/fork.c:708
 [<     inlined    >] exit_mm kernel/exit.c:437
 [<ffffffff8108c3a7>] do_exit+0x457/0x1420 kernel/exit.c:733
 [<ffffffff8108f08f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
 [<     inlined    >] SYSC_exit_group kernel/exit.c:885
 [<ffffffff8108f170>] __wake_up_parent+0x0/0x50 kernel/exit.c:883
 [<ffffffff81eadb2e>] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186
#1 Lock 02e0f1b0 taken here:
 [<     inlined    >] spin_lock include/linux/spinlock.h:312
 [<ffffffff8121c3a9>] __page_check_address+0xd9/0x210 mm/rmap.c:681
 [<     inlined    >] page_check_address include/linux/rmap.h:204
 [<ffffffff8121cbc3>] try_to_unmap_one+0x53/0x4f0 mm/rmap.c:1198
 [<     inlined    >] rmap_walk_file mm/rmap.c:1540
 [<ffffffff8121e207>] rmap_walk+0x147/0x450 mm/rmap.c:1559
 [<ffffffff8121e9c2>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [<ffffffff81211600>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [<ffffffff81211ab6>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [<ffffffff812122f0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [<     inlined    >] munlock_vma_pages_all mm/internal.h:252
 [<ffffffff81216cc3>] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [<ffffffff81085685>] mmput+0x65/0x190 kernel/fork.c:708
 [<     inlined    >] exit_mm kernel/exit.c:437
 [<ffffffff8108c3a7>] do_exit+0x457/0x1420 kernel/exit.c:733
 [<ffffffff8108f08f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
 [<     inlined    >] SYSC_exit_group kernel/exit.c:885
 [<ffffffff8108f170>] __wake_up_parent+0x0/0x50 kernel/exit.c:883
 [<ffffffff81eadb2e>] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

DBG: addr: ffff8800bb153a78
DBG: first offset: 0, second offset: 0
DBG: T309 clock: {T309: 1297809, T293: 747168}
DBG: T293 clock: {T293: 747528}
==================================================================

On Thu, Sep 10, 2015 at 10:36 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Wed, Sep 09, 2015 at 08:58:26PM -0400, Sasha Levin wrote:
>> On 09/07/2015 07:40 AM, Kirill A. Shutemov wrote:
>> > On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
>> >> > ==================================================================
>> >> > ThreadSanitizer: data-race in munlock_vma_pages_range
>> >> >
>> >> > Write of size 8 by thread T378 (K2633, CPU3):
>> >> >  [<ffffffff81212579>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>> >> >  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>> >> >  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
>> >> >  [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>> >> >  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>> >> >  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
>> >> > arch/x86/entry/entry_64.S:186
>> > ...
>> >
>> >> > Previous read of size 8 by thread T398 (K2623, CPU2):
>> >> >  [<ffffffff8121d198>] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
>> >> >  [<     inlined    >] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
>> >> >  [<ffffffff8121e7b7>] rmap_walk+0x147/0x450 mm/rmap.c:1559
>> >> >  [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>> >> >  [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>> >> >  [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>> >> >  [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>> >> >  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>> >> >  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
>> >> >  [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>> >> >  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>> >> >  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
>> >> > arch/x86/entry/entry_64.S:186
>> > Okay, the detected race is mlock/munlock vs. rmap.
>> >
>> > On rmap side we check vma->vm_flags in few places without taking
>> > vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
>> > or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
>> > us.
>> >
>> > In this particular case, speculative check in beginning of
>> > try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
>> > the function.
>>
>> So you're suggesting that this isn't the cause of the bad page flags
>> error observed by Andrey and myself?
>
> I don't see it, but who knows.
>
> --
>  Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-10 13:27         ` Andrey Konovalov
@ 2015-09-11 10:39           ` Kirill A. Shutemov
  2015-09-11 15:29             ` Vlastimil Babka
                               ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Kirill A. Shutemov @ 2015-09-11 10:39 UTC (permalink / raw)
  To: Andrey Konovalov, Oleg Nesterov
  Cc: Sasha Levin, Rik van Riel, Andrew Morton, Dmitry Vyukov, linux-mm,
	linux-kernel, Vlastimil Babka, Hugh Dickins

On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:
> Can a vma be shared among a few mm's?

Define "shared".

vma can belong only to one process (mm_struct), but it can be accessed
from other process like in rmap case below.

rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
sure that the vma will not disappear under it.

> If yes, then taking current->mm->mmap_sem to protect vma is not enough.

Depends on what protection you are talking about.
 
> In the first report below both T378 and T398 take
> current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
> different locks (the addresses are different).

See i_mmap_lock_read() in T398. It will guarantee that vma is there.

> In the second report T309 doesn't take any locks at all, since it
> assumes that after checking atomic_dec_and_test(&mm->mm_users) the mm
> has no other users, but then it does a write to vma.

This one is tricky. I *assume* the mm cannot be generally accessible after
mm_users drops to zero, but I'm not entirely sure about it.
procfs? ptrace?

The VMA is still accessible via rmap at this point. And I think it can be
a problem:

		CPU0					CPU1
exit_mmap()
  // mmap_sem is *not* taken
  munlock_vma_pages_all()
    munlock_vma_pages_range()
    						try_to_unmap_one()
						  down_read_trylock(&vma->vm_mm->mmap_sem))
						  !!(vma->vm_flags & VM_LOCKED) == true
      vma->vm_flags &= ~VM_LOCKED;
      <munlock the page>
      						  mlock_vma_page(page);
						  // mlocked pages is leaked.

The obvious solution is to take mmap_sem in exit path, but it would cause
performance regression.

Any comments?

> 
> ==================================================================
> ThreadSanitizer: data-race in munlock_vma_pages_range
> 
> Write of size 8 by thread T378 (K2633, CPU3):
>  [<ffffffff81212579>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [<     inlined    >] SYSC_munlock mm/mlock.c:651
>  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> Locks held by T378:
> #0 Lock 25710428 taken here:
>  [<     inlined    >] SYSC_munlock mm/mlock.c:650
>  [<ffffffff8121308c>] SyS_munlock+0x4c/0xb0 mm/mlock.c:643
>  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> Previous read of size 8 by thread T398 (K2623, CPU2):
>  [<ffffffff8121d198>] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
>  [<     inlined    >] rmap_walk_file mm/rmap.c:1540
>  [<ffffffff8121e7b7>] rmap_walk+0x147/0x450 mm/rmap.c:1559
>  [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [<     inlined    >] SYSC_munlock mm/mlock.c:651
>  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> Locks held by T398:
> #0 Lock 21b00c68 taken here:
>  [<     inlined    >] SYSC_munlock mm/mlock.c:650
>  [<ffffffff8121308c>] SyS_munlock+0x4c/0xb0 mm/mlock.c:643
>  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> #1 Lock bac2d750 taken here:
>  [<     inlined    >] i_mmap_lock_read include/linux/fs.h:509
>  [<     inlined    >] rmap_walk_file mm/rmap.c:1533
>  [<ffffffff8121e6e8>] rmap_walk+0x78/0x450 mm/rmap.c:1559
>  [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [<     inlined    >] SYSC_munlock mm/mlock.c:651
>  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> #2 Lock 0895f570 taken here:
>  [<     inlined    >] spin_lock include/linux/spinlock.h:312
>  [<ffffffff8121c959>] __page_check_address+0xd9/0x210 mm/rmap.c:681
>  [<     inlined    >] page_check_address include/linux/rmap.h:204
>  [<ffffffff8121d173>] try_to_unmap_one+0x53/0x4f0 mm/rmap.c:1198
>  [<     inlined    >] rmap_walk_file mm/rmap.c:1540
>  [<ffffffff8121e7b7>] rmap_walk+0x147/0x450 mm/rmap.c:1559
>  [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [<     inlined    >] SYSC_munlock mm/mlock.c:651
>  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> DBG: addr: ffff880222610e10
> DBG: first offset: 0, second offset: 0
> DBG: T378 clock: {T378: 4486533, T398: 2405850}
> DBG: T398 clock: {T398: 2406009}
> ==================================================================
> 
> ==================================================================
> ThreadSanitizer: data-race in munlock_vma_pages_range
> 
> Write of size 8 by thread T309 (K2577, CPU0):
>  [<ffffffff81211fc9>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>  [<     inlined    >] munlock_vma_pages_all mm/internal.h:252
>  [<ffffffff81216cc3>] exit_mmap+0x163/0x190 mm/mmap.c:2824
>  [<ffffffff81085685>] mmput+0x65/0x190 kernel/fork.c:708
>  [<     inlined    >] exit_mm kernel/exit.c:437
>  [<ffffffff8108c3a7>] do_exit+0x457/0x1420 kernel/exit.c:733
>  [<ffffffff8108f08f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
>  [<     inlined    >] SYSC_exit_group kernel/exit.c:885
>  [<ffffffff8108f170>] __wake_up_parent+0x0/0x50 kernel/exit.c:883
>  [<ffffffff81eadb2e>] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> Locks held by T309:
> 
> Previous read of size 8 by thread T293 (K2573, CPU3):
>  [<ffffffff8121cbe8>] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
>  [<     inlined    >] rmap_walk_file mm/rmap.c:1540
>  [<ffffffff8121e207>] rmap_walk+0x147/0x450 mm/rmap.c:1559
>  [<ffffffff8121e9c2>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [<ffffffff81211600>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [<ffffffff81211ab6>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [<ffffffff812122f0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [<     inlined    >] munlock_vma_pages_all mm/internal.h:252
>  [<ffffffff81216cc3>] exit_mmap+0x163/0x190 mm/mmap.c:2824
>  [<ffffffff81085685>] mmput+0x65/0x190 kernel/fork.c:708
>  [<     inlined    >] exit_mm kernel/exit.c:437
>  [<ffffffff8108c3a7>] do_exit+0x457/0x1420 kernel/exit.c:733
>  [<ffffffff8108f08f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
>  [<     inlined    >] SYSC_exit_group kernel/exit.c:885
>  [<ffffffff8108f170>] __wake_up_parent+0x0/0x50 kernel/exit.c:883
>  [<ffffffff81eadb2e>] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> Locks held by T293:
> #0 Lock bb0dc710 taken here:
>  [<     inlined    >] i_mmap_lock_read include/linux/fs.h:509
>  [<     inlined    >] rmap_walk_file mm/rmap.c:1533
>  [<ffffffff8121e138>] rmap_walk+0x78/0x450 mm/rmap.c:1559
>  [<ffffffff8121e9c2>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [<ffffffff81211600>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [<ffffffff81211ab6>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [<ffffffff812122f0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [<     inlined    >] munlock_vma_pages_all mm/internal.h:252
>  [<ffffffff81216cc3>] exit_mmap+0x163/0x190 mm/mmap.c:2824
>  [<ffffffff81085685>] mmput+0x65/0x190 kernel/fork.c:708
>  [<     inlined    >] exit_mm kernel/exit.c:437
>  [<ffffffff8108c3a7>] do_exit+0x457/0x1420 kernel/exit.c:733
>  [<ffffffff8108f08f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
>  [<     inlined    >] SYSC_exit_group kernel/exit.c:885
>  [<ffffffff8108f170>] __wake_up_parent+0x0/0x50 kernel/exit.c:883
>  [<ffffffff81eadb2e>] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> #1 Lock 02e0f1b0 taken here:
>  [<     inlined    >] spin_lock include/linux/spinlock.h:312
>  [<ffffffff8121c3a9>] __page_check_address+0xd9/0x210 mm/rmap.c:681
>  [<     inlined    >] page_check_address include/linux/rmap.h:204
>  [<ffffffff8121cbc3>] try_to_unmap_one+0x53/0x4f0 mm/rmap.c:1198
>  [<     inlined    >] rmap_walk_file mm/rmap.c:1540
>  [<ffffffff8121e207>] rmap_walk+0x147/0x450 mm/rmap.c:1559
>  [<ffffffff8121e9c2>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [<ffffffff81211600>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [<ffffffff81211ab6>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [<ffffffff812122f0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [<     inlined    >] munlock_vma_pages_all mm/internal.h:252
>  [<ffffffff81216cc3>] exit_mmap+0x163/0x190 mm/mmap.c:2824
>  [<ffffffff81085685>] mmput+0x65/0x190 kernel/fork.c:708
>  [<     inlined    >] exit_mm kernel/exit.c:437
>  [<ffffffff8108c3a7>] do_exit+0x457/0x1420 kernel/exit.c:733
>  [<ffffffff8108f08f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
>  [<     inlined    >] SYSC_exit_group kernel/exit.c:885
>  [<ffffffff8108f170>] __wake_up_parent+0x0/0x50 kernel/exit.c:883
>  [<ffffffff81eadb2e>] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> DBG: addr: ffff8800bb153a78
> DBG: first offset: 0, second offset: 0
> DBG: T309 clock: {T309: 1297809, T293: 747168}
> DBG: T293 clock: {T293: 747528}
> ==================================================================
> 
> On Thu, Sep 10, 2015 at 10:36 AM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
> > On Wed, Sep 09, 2015 at 08:58:26PM -0400, Sasha Levin wrote:
> >> On 09/07/2015 07:40 AM, Kirill A. Shutemov wrote:
> >> > On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
> >> >> > ==================================================================
> >> >> > ThreadSanitizer: data-race in munlock_vma_pages_range
> >> >> >
> >> >> > Write of size 8 by thread T378 (K2633, CPU3):
> >> >> >  [<ffffffff81212579>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
> >> >> >  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >> >> >  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
> >> >> >  [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >> >> >  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >> >> >  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> >> >> > arch/x86/entry/entry_64.S:186
> >> > ...
> >> >
> >> >> > Previous read of size 8 by thread T398 (K2623, CPU2):
> >> >> >  [<ffffffff8121d198>] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
> >> >> >  [<     inlined    >] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
> >> >> >  [<ffffffff8121e7b7>] rmap_walk+0x147/0x450 mm/rmap.c:1559
> >> >> >  [<ffffffff8121ef72>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
> >> >> >  [<ffffffff81211bb0>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
> >> >> >  [<ffffffff81212066>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
> >> >> >  [<ffffffff812128a0>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
> >> >> >  [<ffffffff81212ac9>] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >> >> >  [<ffffffff81212ccc>] do_mlock+0x14c/0x180 mm/mlock.c:589
> >> >> >  [<     inlined    >] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >> >> >  [<ffffffff812130b4>] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >> >> >  [<ffffffff81eb352e>] entry_SYSCALL_64_fastpath+0x12/0x71
> >> >> > arch/x86/entry/entry_64.S:186
> >> > Okay, the detected race is mlock/munlock vs. rmap.
> >> >
> >> > On rmap side we check vma->vm_flags in few places without taking
> >> > vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
> >> > or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
> >> > us.
> >> >
> >> > In this particular case, speculative check in beginning of
> >> > try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
> >> > the function.
> >>
> >> So you're suggesting that this isn't the cause of the bad page flags
> >> error observed by Andrey and myself?
> >
> > I don't see it, but who knows.
> >
> > --
> >  Kirill A. Shutemov

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-11 10:39           ` Kirill A. Shutemov
@ 2015-09-11 15:29             ` Vlastimil Babka
  2015-09-11 16:08               ` Vlastimil Babka
  2015-09-12  1:27             ` Hugh Dickins
  2015-09-23 15:34             ` Oleg Nesterov
  2 siblings, 1 reply; 35+ messages in thread
From: Vlastimil Babka @ 2015-09-11 15:29 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrey Konovalov, Oleg Nesterov
  Cc: Sasha Levin, Rik van Riel, Andrew Morton, Dmitry Vyukov, linux-mm,
	linux-kernel, Hugh Dickins

On 09/11/2015 12:39 PM, Kirill A. Shutemov wrote:
> On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:
>> Can a vma be shared among a few mm's?
>
> Define "shared".
>
> vma can belong only to one process (mm_struct), but it can be accessed
> from other process like in rmap case below.
>
> rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
> sure that the vma will not disappear under it.
>
>> If yes, then taking current->mm->mmap_sem to protect vma is not enough.
>
> Depends on what protection you are talking about.
>
>> In the first report below both T378 and T398 take
>> current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
>> different locks (the addresses are different).
>
> See i_mmap_lock_read() in T398. It will guarantee that vma is there.
>
>> In the second report T309 doesn't take any locks at all, since it
>> assumes that after checking atomic_dec_and_test(&mm->mm_users) the mm
>> has no other users, but then it does a write to vma.
>
> This one is tricky. I *assume* the mm cannot be generally accessible after
> mm_users drops to zero, but I'm not entirely sure about it.
> procfs? ptrace?
>
> The VMA is still accessible via rmap at this point. And I think it can be
> a problem:
>
> 		CPU0					CPU1
> exit_mmap()
>    // mmap_sem is *not* taken
>    munlock_vma_pages_all()
>      munlock_vma_pages_range()
>      						try_to_unmap_one()
> 						  down_read_trylock(&vma->vm_mm->mmap_sem))
> 						  !!(vma->vm_flags & VM_LOCKED) == true
>        vma->vm_flags &= ~VM_LOCKED;
>        <munlock the page>
>        						  mlock_vma_page(page);
> 						  // mlocked pages is leaked.
>
> The obvious solution is to take mmap_sem in exit path, but it would cause
> performance regression.
>
> Any comments?

Just so others don't repeat the paths that I already looked at:

- First I thought that try_to_unmap_one() has the page locked and 
munlock_vma_pages_range() will also lock it... but it doesn't.
- Then I thought that exit_mmap() will revisit the page anyway doing 
actual unmap. It would, if it's the one who has the page mapped, it will 
clear the mlock (see page_remove_rmap()). If it's not the last one, page 
will be left locked. So it won't be completely leaked, but still, it 
will be mlocked when it shouldn't.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-11 15:29             ` Vlastimil Babka
@ 2015-09-11 16:08               ` Vlastimil Babka
  0 siblings, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2015-09-11 16:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrey Konovalov, Oleg Nesterov
  Cc: Sasha Levin, Rik van Riel, Andrew Morton, Dmitry Vyukov, linux-mm,
	linux-kernel, Hugh Dickins

On 09/11/2015 05:29 PM, Vlastimil Babka wrote:
> On 09/11/2015 12:39 PM, Kirill A. Shutemov wrote:
>> On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:
>>> Can a vma be shared among a few mm's?
>>
>> Define "shared".
>>
>> vma can belong only to one process (mm_struct), but it can be accessed
>> from other process like in rmap case below.
>>
>> rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
>> sure that the vma will not disappear under it.
>>
>>> If yes, then taking current->mm->mmap_sem to protect vma is not enough.
>>
>> Depends on what protection you are talking about.
>>
>>> In the first report below both T378 and T398 take
>>> current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
>>> different locks (the addresses are different).
>>
>> See i_mmap_lock_read() in T398. It will guarantee that vma is there.
>>
>>> In the second report T309 doesn't take any locks at all, since it
>>> assumes that after checking atomic_dec_and_test(&mm->mm_users) the mm
>>> has no other users, but then it does a write to vma.
>>
>> This one is tricky. I *assume* the mm cannot be generally accessible after
>> mm_users drops to zero, but I'm not entirely sure about it.
>> procfs? ptrace?
>>
>> The VMA is still accessible via rmap at this point. And I think it can be
>> a problem:
>>
>> 		CPU0					CPU1
>> exit_mmap()
>>     // mmap_sem is *not* taken
>>     munlock_vma_pages_all()
>>       munlock_vma_pages_range()
>>       						try_to_unmap_one()
>> 						  down_read_trylock(&vma->vm_mm->mmap_sem))
>> 						  !!(vma->vm_flags & VM_LOCKED) == true
>>         vma->vm_flags &= ~VM_LOCKED;
>>         <munlock the page>
>>         						  mlock_vma_page(page);
>> 						  // mlocked pages is leaked.
>>
>> The obvious solution is to take mmap_sem in exit path, but it would cause
>> performance regression.
>>
>> Any comments?
>
> Just so others don't repeat the paths that I already looked at:
>
> - First I thought that try_to_unmap_one() has the page locked and
> munlock_vma_pages_range() will also lock it... but it doesn't.

More precisely, it does (in __munlock_pagevec()), but 
TestClearPageMlocked(page) doesn't happen under that lock.

> - Then I thought that exit_mmap() will revisit the page anyway doing
> actual unmap. It would, if it's the one who has the page mapped, it will
> clear the mlock (see page_remove_rmap()). If it's not the last one, page
> will be left locked. So it won't be completely leaked, but still, it
> will be mlocked when it shouldn't.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-11 10:39           ` Kirill A. Shutemov
  2015-09-11 15:29             ` Vlastimil Babka
@ 2015-09-12  1:27             ` Hugh Dickins
  2015-09-14 10:16               ` Kirill A. Shutemov
  2015-09-15 17:36               ` Sasha Levin
  2015-09-23 15:34             ` Oleg Nesterov
  2 siblings, 2 replies; 35+ messages in thread
From: Hugh Dickins @ 2015-09-12  1:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrey Konovalov, Oleg Nesterov, Sasha Levin, Rik van Riel,
	Andrew Morton, Dmitry Vyukov, linux-mm, linux-kernel,
	Vlastimil Babka, Hugh Dickins

On Fri, 11 Sep 2015, Kirill A. Shutemov wrote:
> On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:
> > Can a vma be shared among a few mm's?
> 
> Define "shared".
> 
> vma can belong only to one process (mm_struct), but it can be accessed
> from other process like in rmap case below.
> 
> rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
> sure that the vma will not disappear under it.
> 
> > If yes, then taking current->mm->mmap_sem to protect vma is not enough.
> 
> Depends on what protection you are talking about.
>  
> > In the first report below both T378 and T398 take
> > current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
> > different locks (the addresses are different).
> 
> See i_mmap_lock_read() in T398. It will guarantee that vma is there.
> 
> > In the second report T309 doesn't take any locks at all, since it
> > assumes that after checking atomic_dec_and_test(&mm->mm_users) the mm
> > has no other users, but then it does a write to vma.
> 
> This one is tricky. I *assume* the mm cannot be generally accessible after
> mm_users drops to zero, but I'm not entirely sure about it.
> procfs? ptrace?

Most of the things (including procfs and ptrace) that need to work on
a foreign mm do take a hold on mm_users with get_task_mm().  swapoff
uses atomic_inc_not_zero(&mm->mm_users).  In KSM I managed to get away
with just a hold on the structure itself, atomic_inc(&mm->mm_count),
and a check for mm_users 0 wherever it down_reads mmap_sem (but Andrey
might like to turn KSM on: it wouldn't be entirely shocking if he were
to discover an anomaly from that).

> 
> The VMA is still accessible via rmap at this point. And I think it can be
> a problem:
> 
> 		CPU0					CPU1
> exit_mmap()
>   // mmap_sem is *not* taken
>   munlock_vma_pages_all()
>     munlock_vma_pages_range()
>     						try_to_unmap_one()
> 						  down_read_trylock(&vma->vm_mm->mmap_sem))
> 						  !!(vma->vm_flags & VM_LOCKED) == true
>       vma->vm_flags &= ~VM_LOCKED;
>       <munlock the page>
>       						  mlock_vma_page(page);
> 						  // mlocked pages is leaked.
> 
> The obvious solution is to take mmap_sem in exit path, but it would cause
> performance regression.
> 
> Any comments?

I'm inclined to echo Vlastimil's comment from earlier in the thread:
sounds like an overkill, unless we find something more serious than this.

I'm not sure whether we'd actually see a regression from taking mmap_sem
in exit path; but given that it's mmap_sem, yes, history tells us please
not to take it any more than we have to.

I do remember wishing, when working out KSM's mm handling, that exit took
mmap_sem: it would have made it simpler, but that wasn't a change I dared
to make.

Maybe an mm_users 0 check after down_read_trylock in try_to_unmap_one() 
could fix it?

But if we were to make a bigger change for this VM_LOCKED issue, and
something more serious makes it worth all the effort, I'd say that
what needs to be done is to give mlock/munlock proper locking (haha).

I have not yet looked at your mlocked THP patch (sorry), but when I
was doing the same thing for huge tmpfs, what made it so surprisingly
difficult was all the spongy trylocking, which concealed the rules.

Maybe I'm completely wrong, but I thought a lot of awkwardness might
disappear if they were relying on anon_vma->rwsem and i_mmap_rwsem
throughout instead of mmap_sem.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-12  1:27             ` Hugh Dickins
@ 2015-09-14 10:16               ` Kirill A. Shutemov
  2015-09-15 17:36               ` Sasha Levin
  1 sibling, 0 replies; 35+ messages in thread
From: Kirill A. Shutemov @ 2015-09-14 10:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrey Konovalov, Oleg Nesterov, Sasha Levin, Rik van Riel,
	Andrew Morton, Dmitry Vyukov, linux-mm, linux-kernel,
	Vlastimil Babka

On Fri, Sep 11, 2015 at 06:27:14PM -0700, Hugh Dickins wrote:
> On Fri, 11 Sep 2015, Kirill A. Shutemov wrote:
> > On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:
> > > Can a vma be shared among a few mm's?
> > 
> > Define "shared".
> > 
> > vma can belong only to one process (mm_struct), but it can be accessed
> > from other process like in rmap case below.
> > 
> > rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
> > sure that the vma will not disappear under it.
> > 
> > > If yes, then taking current->mm->mmap_sem to protect vma is not enough.
> > 
> > Depends on what protection you are talking about.
> >  
> > > In the first report below both T378 and T398 take
> > > current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
> > > different locks (the addresses are different).
> > 
> > See i_mmap_lock_read() in T398. It will guarantee that vma is there.
> > 
> > > In the second report T309 doesn't take any locks at all, since it
> > > assumes that after checking atomic_dec_and_test(&mm->mm_users) the mm
> > > has no other users, but then it does a write to vma.
> > 
> > This one is tricky. I *assume* the mm cannot be generally accessible after
> > mm_users drops to zero, but I'm not entirely sure about it.
> > procfs? ptrace?
> 
> Most of the things (including procfs and ptrace) that need to work on
> a foreign mm do take a hold on mm_users with get_task_mm().  swapoff
> uses atomic_inc_not_zero(&mm->mm_users).  In KSM I managed to get away
> with just a hold on the structure itself, atomic_inc(&mm->mm_count),
> and a check for mm_users 0 wherever it down_reads mmap_sem (but Andrey
> might like to turn KSM on: it wouldn't be entirely shocking if he were
> to discover an anomaly from that).
> 
> > 
> > The VMA is still accessible via rmap at this point. And I think it can be
> > a problem:
> > 
> > 		CPU0					CPU1
> > exit_mmap()
> >   // mmap_sem is *not* taken
> >   munlock_vma_pages_all()
> >     munlock_vma_pages_range()
> >     						try_to_unmap_one()
> > 						  down_read_trylock(&vma->vm_mm->mmap_sem))
> > 						  !!(vma->vm_flags & VM_LOCKED) == true
> >       vma->vm_flags &= ~VM_LOCKED;
> >       <munlock the page>
> >       						  mlock_vma_page(page);
> > 						  // mlocked pages is leaked.
> > 
> > The obvious solution is to take mmap_sem in exit path, but it would cause
> > performance regression.
> > 
> > Any comments?
> 
> I'm inclined to echo Vlastimil's comment from earlier in the thread:
> sounds like an overkill, unless we find something more serious than this.
> 
> I'm not sure whether we'd actually see a regression from taking mmap_sem
> in exit path; but given that it's mmap_sem, yes, history tells us please
> not to take it any more than we have to.
> 
> I do remember wishing, when working out KSM's mm handling, that exit took
> mmap_sem: it would have made it simpler, but that wasn't a change I dared
> to make.
> 
> Maybe an mm_users 0 check after down_read_trylock in try_to_unmap_one() 
> could fix it?

I don't see how. It would shift a picture, but doesn't fix it: exit_mmap()
can happen after down_read_trylock() and mm_users check.
We would only hide the problem.

> But if we were to make a bigger change for this VM_LOCKED issue, and
> something more serious makes it worth all the effort, I'd say that
> what needs to be done is to give mlock/munlock proper locking (haha).
> 
> I have not yet looked at your mlocked THP patch (sorry), but when I
> was doing the same thing for huge tmpfs, what made it so surprisingly
> difficult was all the spongy trylocking, which concealed the rules.
> 
> Maybe I'm completely wrong, but I thought a lot of awkwardness might
> disappear if they were relying on anon_vma->rwsem and i_mmap_rwsem
> throughout instead of mmap_sem.

This can be helpful. But the risk is getting scalability regression on
other front: long anon_vma chain or highly shared files.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-12  1:27             ` Hugh Dickins
  2015-09-14 10:16               ` Kirill A. Shutemov
@ 2015-09-15 17:36               ` Sasha Levin
  2015-09-15 19:01                 ` Kirill A. Shutemov
                                   ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Sasha Levin @ 2015-09-15 17:36 UTC (permalink / raw)
  To: Hugh Dickins, Kirill A. Shutemov
  Cc: Andrey Konovalov, Oleg Nesterov, Rik van Riel, Andrew Morton,
	Dmitry Vyukov, linux-mm, linux-kernel, Vlastimil Babka

On 09/11/2015 09:27 PM, Hugh Dickins wrote:
> I'm inclined to echo Vlastimil's comment from earlier in the thread:
> sounds like an overkill, unless we find something more serious than this.

I've modified my tests to stress the exit path of processes with many vmas,
and hit the following NULL ptr deref (not sure if it's related to the original issue):

[1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[1181047.937223] Modules linked in:
[1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
[1181047.939387] task: ffff8804195c8000 ti: ffff880433f00000 task.ti: ffff880433f00000
[1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)
[1181047.941842] RSP: 0000:ffff880433f078a8  EFLAGS: 00010206
[1181047.942383] RAX: dffffc0000000000 RBX: ffff88041acd000a RCX: ffffffffffffffff
[1181047.943091] RDX: 0000000000000099 RSI: ffff88041acd000a RDI: 00000000000004c8
[1181047.943889] RBP: ffff880433f078d8 R08: ffff880415c59c58 R09: 0000000015c59e01
[1181047.944604] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffffffffffff
[1181047.944833] pps pps0: PPS event at 21837.866101174
[1181047.944838] pps pps0: capture assert seq #7188
[1181047.946261] R13: 0000000000000000 R14: ffff880433f07910 R15: 0000000000002e0d
[1181047.947005] FS:  0000000000000000(0000) GS:ffff880252000000(0000) knlGS:0000000000000000
[1181047.947779] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[1181047.948361] CR2: 000000000097df90 CR3: 000000044e08c000 CR4: 00000000000006a0
[1181047.949085] Stack:
[1181047.949350]  0000000000000000 ffff880433f07910 1ffff100867e0f1e dffffc0000000000
[1181047.950164]  ffff8801d825d000 0000000000002e0d ffff880433f079d0 ffffffff9276c4ab
[1181047.951070]  ffff88041acd000a 0000000041b58ab3 ffffffff9ecd1a43 ffffffff9276c2a0
[1181047.951906] Call Trace:
[1181047.952201] exit_mmap (mm/mmap.c:2856)
[1181047.952751] ? SyS_remap_file_pages (mm/mmap.c:2826)
[1181047.953633] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 include/linux/sched.h:2557 mm/huge_memory.c:2169)
[1181047.954281] ? rcu_read_lock_sched_held (kernel/rcu/update.c:109)
[1181047.954936] ? kmem_cache_free (include/trace/events/kmem.h:143 mm/slub.c:2746)
[1181047.955535] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 include/linux/sched.h:2557 mm/huge_memory.c:2169)
[1181047.956204] mmput (include/linux/compiler.h:207 kernel/fork.c:735 kernel/fork.c:702)
[1181047.956691] do_exit (./arch/x86/include/asm/bitops.h:311 include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733)
[1181047.957241] ? lockdep_init (kernel/locking/lockdep.c:3298)
[1181047.958005] ? mm_update_next_owner (kernel/exit.c:654)
[1181047.959007] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[1181047.959995] ? get_lock_stats (kernel/locking/lockdep.c:249)
[1181047.960885] ? lockdep_init (kernel/locking/lockdep.c:3298)
[1181047.961438] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
[1181047.962573] ? lock_release (kernel/locking/lockdep.c:3641)
[1181047.963488] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
[1181047.964704] do_group_exit (./arch/x86/include/asm/current.h:14 kernel/exit.c:859)
[1181047.965569] get_signal (kernel/signal.c:2353)
[1181047.966430] do_signal (arch/x86/kernel/signal.c:709)
[1181047.967241] ? do_readv_writev (include/linux/fsnotify.h:223 fs/read_write.c:821)
[1181047.968169] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:407)
[1181047.969126] ? vfs_write (fs/read_write.c:777)
[1181047.969955] ? setup_sigcontext (arch/x86/kernel/signal.c:706)
[1181047.970916] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
[1181047.972139] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[1181047.973489] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:171 kernel/locking/spinlock.c:199)
[1181047.974160] ? do_setitimer (include/linux/spinlock.h:357 kernel/time/itimer.c:227)
[1181047.974818] ? check_preemption_disabled (lib/smp_processor_id.c:18)
[1181047.975480] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[1181047.976142] prepare_exit_to_usermode (arch/x86/entry/common.c:251)
[1181047.976784] syscall_return_slowpath (arch/x86/entry/common.c:318)
[1181047.977473] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282)
[1181047.978116] Code: 08 80 3c 02 00 0f 85 22 01 00 00 48 8b 43 40 48 8d b8 c8 04 00 00 48 89 45 d0 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 ee 00 00 00 48 8b 45 d0 48 83 b8 c8 04 00 00
All code
========
   0:   08 80 3c 02 00 0f       or     %al,0xf00023c(%rax)
   6:   85 22                   test   %esp,(%rdx)
   8:   01 00                   add    %eax,(%rax)
   a:   00 48 8b                add    %cl,-0x75(%rax)
   d:   43                      rex.XB
   e:   40                      rex
   f:   48 8d b8 c8 04 00 00    lea    0x4c8(%rax),%rdi
  16:   48 89 45 d0             mov    %rax,-0x30(%rbp)
  1a:   48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
  21:   fc ff df
  24:   48 89 fa                mov    %rdi,%rdx
  27:   48 c1 ea 03             shr    $0x3,%rdx
  2b:*  80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)               <-- trapping instruction
  2f:   0f 85 ee 00 00 00       jne    0x123
  35:   48 8b 45 d0             mov    -0x30(%rbp),%rax
  39:   48 83 b8 c8 04 00 00    cmpq   $0x0,0x4c8(%rax)
  40:   00

Code starting with the faulting instruction
===========================================
   0:   80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)
   4:   0f 85 ee 00 00 00       jne    0xf8
   a:   48 8b 45 d0             mov    -0x30(%rbp),%rax
   e:   48 83 b8 c8 04 00 00    cmpq   $0x0,0x4c8(%rax)
  15:   00
[1181047.981417] RIP unmap_vmas (mm/memory.c:1337)
[1181047.982011]  RSP <ffff880433f078a8>


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-15 17:36               ` Sasha Levin
@ 2015-09-15 19:01                 ` Kirill A. Shutemov
  2015-09-22 16:47                   ` Andrey Konovalov
  2015-09-23 21:30                 ` Sasha Levin
  2015-09-24 13:11                 ` Oleg Nesterov
  2 siblings, 1 reply; 35+ messages in thread
From: Kirill A. Shutemov @ 2015-09-15 19:01 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Andrey Konovalov, Oleg Nesterov, Rik van Riel,
	Andrew Morton, Dmitry Vyukov, linux-mm, linux-kernel,
	Vlastimil Babka

On Tue, Sep 15, 2015 at 01:36:45PM -0400, Sasha Levin wrote:
> On 09/11/2015 09:27 PM, Hugh Dickins wrote:
> > I'm inclined to echo Vlastimil's comment from earlier in the thread:
> > sounds like an overkill, unless we find something more serious than this.
> 
> I've modified my tests to stress the exit path of processes with many vmas,

Could you share the test?

> and hit the following NULL ptr deref (not sure if it's related to the original issue):
> 
> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [1181047.937223] Modules linked in:
> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
> [1181047.939387] task: ffff8804195c8000 ti: ffff880433f00000 task.ti: ffff880433f00000
> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)

Is it "struct mm_struct *mm = vma->vm_mm;"?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-15 19:01                 ` Kirill A. Shutemov
@ 2015-09-22 16:47                   ` Andrey Konovalov
  2015-09-22 18:54                     ` Hugh Dickins
  0 siblings, 1 reply; 35+ messages in thread
From: Andrey Konovalov @ 2015-09-22 16:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sasha Levin, Hugh Dickins, Oleg Nesterov, Rik van Riel,
	Andrew Morton, Dmitry Vyukov, linux-mm, linux-kernel,
	Vlastimil Babka

If anybody comes up with a patch to fix the original issue I easily
can test it, since I'm hitting "BUG: Bad page state" in a second when
fuzzing with KTSAN and Trinity.

On Tue, Sep 15, 2015 at 9:01 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Tue, Sep 15, 2015 at 01:36:45PM -0400, Sasha Levin wrote:
>> On 09/11/2015 09:27 PM, Hugh Dickins wrote:
>> > I'm inclined to echo Vlastimil's comment from earlier in the thread:
>> > sounds like an overkill, unless we find something more serious than this.
>>
>> I've modified my tests to stress the exit path of processes with many vmas,
>
> Could you share the test?
>
>> and hit the following NULL ptr deref (not sure if it's related to the original issue):
>>
>> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
>> [1181047.937223] Modules linked in:
>> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
>> [1181047.939387] task: ffff8804195c8000 ti: ffff880433f00000 task.ti: ffff880433f00000
>> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)
>
> Is it "struct mm_struct *mm = vma->vm_mm;"?
>
> --
>  Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-22 16:47                   ` Andrey Konovalov
@ 2015-09-22 18:54                     ` Hugh Dickins
  2015-09-22 19:45                       ` Andrey Konovalov
  0 siblings, 1 reply; 35+ messages in thread
From: Hugh Dickins @ 2015-09-22 18:54 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Kirill A. Shutemov, Sasha Levin, Hugh Dickins, Oleg Nesterov,
	Rik van Riel, Andrew Morton, Dmitry Vyukov, linux-mm,
	linux-kernel, Vlastimil Babka

On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> If anybody comes up with a patch to fix the original issue I easily
> can test it, since I'm hitting "BUG: Bad page state" in a second when
> fuzzing with KTSAN and Trinity.

This "BUG: Bad page state" sounds more serious, but I cannot track down
your report of it: please repost - thanks - though on seeing it, I may
well end up with no ideas.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-22 18:54                     ` Hugh Dickins
@ 2015-09-22 19:45                       ` Andrey Konovalov
  2015-09-23  1:39                         ` Hugh Dickins
  0 siblings, 1 reply; 35+ messages in thread
From: Andrey Konovalov @ 2015-09-22 19:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Sasha Levin, Oleg Nesterov, Rik van Riel,
	Andrew Morton, Dmitry Vyukov, linux-mm, linux-kernel,
	Vlastimil Babka

On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins <hughd@google.com> wrote:
> On Tue, 22 Sep 2015, Andrey Konovalov wrote:
>> If anybody comes up with a patch to fix the original issue I easily
>> can test it, since I'm hitting "BUG: Bad page state" in a second when
>> fuzzing with KTSAN and Trinity.
>
> This "BUG: Bad page state" sounds more serious, but I cannot track down
> your report of it: please repost - thanks - though on seeing it, I may
> well end up with no ideas.

The report is below.

I get it after a few seconds of running Trinity on a kernel with KTSAN
and targeting mlock, munlock and madvise syscalls.
Sasha also observed a very similar crash a while ago
(https://lkml.org/lkml/2014/11/6/1055).
I didn't manage to reproduce this in a kernel build without KTSAN though.
The idea was that data races KTSAN reports might be the explanation of
these crashes.

BUG: Bad page state in process trinity-c15  pfn:281999
page:ffffea000a066640 count:0 mapcount:0 mapping:          (null) index:0xd
flags: 0x20000000028000c(referenced|uptodate|swapbacked|mlocked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x200000(mlocked)
Modules linked in:
CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 ffffffff821c3b70 0000000000000000 0000000100004741 ffff8800b857f948
 ffffffff81e9926c 0000000000000003 ffffea000a066640 ffff8800b857f978
 ffffffff811ce045 ffffffff821c3b70 ffffea000a066640 0000000000000001
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff81e9926c>] dump_stack+0x63/0x81 lib/dump_stack.c:50
 [<ffffffff811ce045>] bad_page+0x115/0x1a0 mm/page_alloc.c:409
 [<     inline     >] free_pages_check mm/page_alloc.c:731
 [<ffffffff811cf3b8>] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922
 [<ffffffff811d2911>] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908
 [<ffffffff811d2bcf>] free_hot_cold_page_list+0x5f/0x100
mm/page_alloc.c:1956 (discriminator 3)
 [<ffffffff811dd9c1>] release_pages+0x151/0x300 mm/swap.c:967
 [<ffffffff811de723>] __pagevec_release+0x43/0x60 mm/swap.c:984
 [<     inline     >] pagevec_release include/linux/pagevec.h:69
 [<ffffffff811ef36a>] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446
 [<ffffffff811ef86f>] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540
 [<ffffffff811f15d5>] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086
 [<ffffffff812568d0>] vfs_fallocate+0x1e0/0x310 fs/open.c:303
 [<     inline     >] madvise_remove mm/madvise.c:326
 [<     inline     >] madvise_vma mm/madvise.c:378
 [<     inline     >] SYSC_madvise mm/madvise.c:528
 [<ffffffff81225548>] SyS_madvise+0x378/0x760 mm/madvise.c:459
 [<ffffffff8124ef36>] ? kt_atomic64_store+0x76/0x130 mm/ktsan/sync_atomic.c:161
 [<ffffffff81ea8691>] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:188
Disabling lock debugging due to kernel taint

>
> Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-22 19:45                       ` Andrey Konovalov
@ 2015-09-23  1:39                         ` Hugh Dickins
  2015-09-23 11:46                           ` Kirill A. Shutemov
  2015-09-23 13:08                           ` Andrey Konovalov
  0 siblings, 2 replies; 35+ messages in thread
From: Hugh Dickins @ 2015-09-23  1:39 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Hugh Dickins, Kirill A. Shutemov, Davidlohr Bueso, Sasha Levin,
	Oleg Nesterov, Rik van Riel, Andrew Morton, Dmitry Vyukov,
	linux-mm, linux-kernel, Vlastimil Babka

On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins <hughd@google.com> wrote:
> > On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> >> If anybody comes up with a patch to fix the original issue I easily
> >> can test it, since I'm hitting "BUG: Bad page state" in a second when
> >> fuzzing with KTSAN and Trinity.
> >
> > This "BUG: Bad page state" sounds more serious, but I cannot track down
> > your report of it: please repost - thanks - though on seeing it, I may
> > well end up with no ideas.
> 
> The report is below.

Thanks.

> 
> I get it after a few seconds of running Trinity on a kernel with KTSAN
> and targeting mlock, munlock and madvise syscalls.
> Sasha also observed a very similar crash a while ago
> (https://lkml.org/lkml/2014/11/6/1055).
> I didn't manage to reproduce this in a kernel build without KTSAN though.
> The idea was that data races KTSAN reports might be the explanation of
> these crashes.
> 
> BUG: Bad page state in process trinity-c15  pfn:281999
> page:ffffea000a066640 count:0 mapcount:0 mapping:          (null) index:0xd
> flags: 0x20000000028000c(referenced|uptodate|swapbacked|mlocked)
> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> bad because of flags:
> flags: 0x200000(mlocked)
> Modules linked in:
> CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>  ffffffff821c3b70 0000000000000000 0000000100004741 ffff8800b857f948
>  ffffffff81e9926c 0000000000000003 ffffea000a066640 ffff8800b857f978
>  ffffffff811ce045 ffffffff821c3b70 ffffea000a066640 0000000000000001
> Call Trace:
>  [<     inline     >] __dump_stack lib/dump_stack.c:15
>  [<ffffffff81e9926c>] dump_stack+0x63/0x81 lib/dump_stack.c:50
>  [<ffffffff811ce045>] bad_page+0x115/0x1a0 mm/page_alloc.c:409
>  [<     inline     >] free_pages_check mm/page_alloc.c:731
>  [<ffffffff811cf3b8>] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922
>  [<ffffffff811d2911>] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908
>  [<ffffffff811d2bcf>] free_hot_cold_page_list+0x5f/0x100
> mm/page_alloc.c:1956 (discriminator 3)
>  [<ffffffff811dd9c1>] release_pages+0x151/0x300 mm/swap.c:967
>  [<ffffffff811de723>] __pagevec_release+0x43/0x60 mm/swap.c:984
>  [<     inline     >] pagevec_release include/linux/pagevec.h:69
>  [<ffffffff811ef36a>] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446
>  [<ffffffff811ef86f>] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540
>  [<ffffffff811f15d5>] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086
>  [<ffffffff812568d0>] vfs_fallocate+0x1e0/0x310 fs/open.c:303
>  [<     inline     >] madvise_remove mm/madvise.c:326
>  [<     inline     >] madvise_vma mm/madvise.c:378
>  [<     inline     >] SYSC_madvise mm/madvise.c:528
>  [<ffffffff81225548>] SyS_madvise+0x378/0x760 mm/madvise.c:459
>  [<ffffffff8124ef36>] ? kt_atomic64_store+0x76/0x130 mm/ktsan/sync_atomic.c:161
>  [<ffffffff81ea8691>] entry_SYSCALL_64_fastpath+0x31/0x95
> arch/x86/entry/entry_64.S:188
> Disabling lock debugging due to kernel taint

This is totally untested, and one of you may quickly prove me wrong;
but I went in to fix your "Bad page state (mlocked)" by holding pte
lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
then couldn't see why it would need mmap_sem at all, given how mlock
and munlock first assert intention by setting or clearing VM_LOCKED
in vm_flags, then work their way up the vma, taking pte locks.

Calling mlock_vma_page() under pte lock may look suspicious
at first: but what it does is similar to clear_page_mlock(),
which we regularly call under pte lock from page_remove_rmap().

I'd rather wait to hear whether this appears to work in practice,
and whether you agree that it should work in theory, before writing
the proper description.  I'd love to lose that down_read_trylock.

You mention how Sasha hit the "Bad page state (mlocked)" back in
November: that was one of the reasons we reverted Davidlohr's
i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(),
without understanding why it was needed.  Yes, it would lock out
a concurrent try_to_unmap(), whose setting of PageMlocked was not
sufficiently serialized by the down_read_trylock of mmap_sem.

But I don't remember the other reasons for that revert (and
haven't looked very hard as yet): anyone else remember?

Not-yet-Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/rmap.c |   32 +++++++-------------------------
 1 file changed, 7 insertions(+), 25 deletions(-)

--- 4.3-rc2/mm/rmap.c	2015-09-12 18:30:20.857039763 -0700
+++ linux/mm/rmap.c	2015-09-22 17:47:43.489096676 -0700
@@ -1314,9 +1314,12 @@ static int try_to_unmap_one(struct page
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!(flags & TTU_IGNORE_MLOCK)) {
-		if (vma->vm_flags & VM_LOCKED)
-			goto out_mlock;
-
+		if (vma->vm_flags & VM_LOCKED) {
+	 		/* Holding pte lock, we do *not* need mmap_sem here */
+			mlock_vma_page(page);
+			ret = SWAP_MLOCK;
+			goto out_unmap;
+		}
 		if (flags & TTU_MUNLOCK)
 			goto out_unmap;
 	}
@@ -1419,31 +1422,10 @@ static int try_to_unmap_one(struct page
 
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
-	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
+	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
 		mmu_notifier_invalidate_page(mm, address);
 out:
 	return ret;
-
-out_mlock:
-	pte_unmap_unlock(pte, ptl);
-
-
-	/*
-	 * We need mmap_sem locking, Otherwise VM_LOCKED check makes
-	 * unstable result and race. Plus, We can't wait here because
-	 * we now hold anon_vma->rwsem or mapping->i_mmap_rwsem.
-	 * if trylock failed, the page remain in evictable lru and later
-	 * vmscan could retry to move the page to unevictable lru if the
-	 * page is actually mlocked.
-	 */
-	if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
-		if (vma->vm_flags & VM_LOCKED) {
-			mlock_vma_page(page);
-			ret = SWAP_MLOCK;
-		}
-		up_read(&vma->vm_mm->mmap_sem);
-	}
-	return ret;
 }
 
 bool is_vma_temporary_stack(struct vm_area_struct *vma)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-23  1:39                         ` Hugh Dickins
@ 2015-09-23 11:46                           ` Kirill A. Shutemov
  2015-09-23 22:58                             ` Davidlohr Bueso
  2015-09-23 13:08                           ` Andrey Konovalov
  1 sibling, 1 reply; 35+ messages in thread
From: Kirill A. Shutemov @ 2015-09-23 11:46 UTC (permalink / raw)
  To: Hugh Dickins, Davidlohr Bueso
  Cc: Andrey Konovalov, Sasha Levin, Oleg Nesterov, Rik van Riel,
	Andrew Morton, Dmitry Vyukov, linux-mm, linux-kernel,
	Vlastimil Babka

On Tue, Sep 22, 2015 at 06:39:52PM -0700, Hugh Dickins wrote:
> On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> > On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins <hughd@google.com> wrote:
> > > On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> > >> If anybody comes up with a patch to fix the original issue I easily
> > >> can test it, since I'm hitting "BUG: Bad page state" in a second when
> > >> fuzzing with KTSAN and Trinity.
> > >
> > > This "BUG: Bad page state" sounds more serious, but I cannot track down
> > > your report of it: please repost - thanks - though on seeing it, I may
> > > well end up with no ideas.
> > 
> > The report is below.
> 
> Thanks.
> 
> > 
> > I get it after a few seconds of running Trinity on a kernel with KTSAN
> > and targeting mlock, munlock and madvise syscalls.
> > Sasha also observed a very similar crash a while ago
> > (https://lkml.org/lkml/2014/11/6/1055).
> > I didn't manage to reproduce this in a kernel build without KTSAN though.
> > The idea was that data races KTSAN reports might be the explanation of
> > these crashes.
> > 
> > BUG: Bad page state in process trinity-c15  pfn:281999
> > page:ffffea000a066640 count:0 mapcount:0 mapping:          (null) index:0xd
> > flags: 0x20000000028000c(referenced|uptodate|swapbacked|mlocked)
> > page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> > bad because of flags:
> > flags: 0x200000(mlocked)
> > Modules linked in:
> > CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >  ffffffff821c3b70 0000000000000000 0000000100004741 ffff8800b857f948
> >  ffffffff81e9926c 0000000000000003 ffffea000a066640 ffff8800b857f978
> >  ffffffff811ce045 ffffffff821c3b70 ffffea000a066640 0000000000000001
> > Call Trace:
> >  [<     inline     >] __dump_stack lib/dump_stack.c:15
> >  [<ffffffff81e9926c>] dump_stack+0x63/0x81 lib/dump_stack.c:50
> >  [<ffffffff811ce045>] bad_page+0x115/0x1a0 mm/page_alloc.c:409
> >  [<     inline     >] free_pages_check mm/page_alloc.c:731
> >  [<ffffffff811cf3b8>] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922
> >  [<ffffffff811d2911>] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908
> >  [<ffffffff811d2bcf>] free_hot_cold_page_list+0x5f/0x100
> > mm/page_alloc.c:1956 (discriminator 3)
> >  [<ffffffff811dd9c1>] release_pages+0x151/0x300 mm/swap.c:967
> >  [<ffffffff811de723>] __pagevec_release+0x43/0x60 mm/swap.c:984
> >  [<     inline     >] pagevec_release include/linux/pagevec.h:69
> >  [<ffffffff811ef36a>] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446
> >  [<ffffffff811ef86f>] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540
> >  [<ffffffff811f15d5>] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086
> >  [<ffffffff812568d0>] vfs_fallocate+0x1e0/0x310 fs/open.c:303
> >  [<     inline     >] madvise_remove mm/madvise.c:326
> >  [<     inline     >] madvise_vma mm/madvise.c:378
> >  [<     inline     >] SYSC_madvise mm/madvise.c:528
> >  [<ffffffff81225548>] SyS_madvise+0x378/0x760 mm/madvise.c:459
> >  [<ffffffff8124ef36>] ? kt_atomic64_store+0x76/0x130 mm/ktsan/sync_atomic.c:161
> >  [<ffffffff81ea8691>] entry_SYSCALL_64_fastpath+0x31/0x95
> > arch/x86/entry/entry_64.S:188
> > Disabling lock debugging due to kernel taint
> 
> This is totally untested, and one of you may quickly prove me wrong;
> but I went in to fix your "Bad page state (mlocked)" by holding pte
> lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> then couldn't see why it would need mmap_sem at all, given how mlock
> and munlock first assert intention by setting or clearing VM_LOCKED
> in vm_flags, then work their way up the vma, taking pte locks.
> 
> Calling mlock_vma_page() under pte lock may look suspicious
> at first: but what it does is similar to clear_page_mlock(),
> which we regularly call under pte lock from page_remove_rmap().

Indeed. Looks fishy. But probably will work.

That was not obvious for me what makes clearing VM_LOCKED visible in
try_to_unmap_one() without mmap_sem.

After looking some more it seems we would rely on page_check_address()
in try_to_unmap_one() having acquire semantics and follow_page_mask() in
munlock_vma_pages_range() having release semantics.

But I would prefer to have something more explicit.

That's mess.

> 
> I'd rather wait to hear whether this appears to work in practice,
> and whether you agree that it should work in theory, before writing
> the proper description.  I'd love to lose that down_read_trylock.
> 
> You mention how Sasha hit the "Bad page state (mlocked)" back in
> November: that was one of the reasons we reverted Davidlohr's
> i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(),
> without understanding why it was needed.  Yes, it would lock out
> a concurrent try_to_unmap(), whose setting of PageMlocked was not
> sufficiently serialized by the down_read_trylock of mmap_sem.
> 
> But I don't remember the other reasons for that revert (and
> haven't looked very hard as yet): anyone else remember?

I hoped Davidlohr will come back with something after the revert, but it
never happend. I think the reverted patch was responsible for most of
scalability boost from rwsem for i_mmap_lock...

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-23  1:39                         ` Hugh Dickins
  2015-09-23 11:46                           ` Kirill A. Shutemov
@ 2015-09-23 13:08                           ` Andrey Konovalov
  2015-09-24  0:42                             ` Sasha Levin
  2015-10-13 22:33                             ` Hugh Dickins
  1 sibling, 2 replies; 35+ messages in thread
From: Andrey Konovalov @ 2015-09-23 13:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Davidlohr Bueso, Sasha Levin, Oleg Nesterov,
	Rik van Riel, Andrew Morton, Dmitry Vyukov, linux-mm,
	linux-kernel, Vlastimil Babka

On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins <hughd@google.com> wrote:
> This is totally untested, and one of you may quickly prove me wrong;
> but I went in to fix your "Bad page state (mlocked)" by holding pte
> lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> then couldn't see why it would need mmap_sem at all, given how mlock
> and munlock first assert intention by setting or clearing VM_LOCKED
> in vm_flags, then work their way up the vma, taking pte locks.
>
> Calling mlock_vma_page() under pte lock may look suspicious
> at first: but what it does is similar to clear_page_mlock(),
> which we regularly call under pte lock from page_remove_rmap().
>
> I'd rather wait to hear whether this appears to work in practice,
> and whether you agree that it should work in theory, before writing
> the proper description.  I'd love to lose that down_read_trylock.

No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".

It seems that your patch doesn't fix the race from the report below, since pte
lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
is being executed. (Line numbers are from kernel with your patch applied.)

===
ThreadSanitizer: data-race in munlock_vma_pages_range

Write at 0xffff880282a93290 of size 8 by thread 2546 on CPU 2:
 [<ffffffff81211009>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
 [<     inline     >] munlock_vma_pages_all mm/internal.h:252
 [<ffffffff81215d03>] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [<ffffffff81085635>] mmput+0x65/0x190 kernel/fork.c:708
 [<     inline     >] exit_mm kernel/exit.c:437
 [<ffffffff8108c2a7>] do_exit+0x457/0x1400 kernel/exit.c:733
 [<ffffffff8108ef3f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
 [<ffffffff810a03a5>] get_signal+0x375/0xa70 kernel/signal.c:2353
 [<ffffffff8100619c>] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
 [<ffffffff81006cbd>] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
 [<ffffffff81ea87a4>] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329

Previous read at 0xffff880282a93290 of size 8 by thread 2545 on CPU 1:
 [<ffffffff8121bc1a>] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208
 [<     inline     >] rmap_walk_file mm/rmap.c:1522
 [<ffffffff8121d1a7>] rmap_walk+0x147/0x450 mm/rmap.c:1541
 [<ffffffff8121d962>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405
 [<ffffffff81210640>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [<ffffffff81210af6>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [<ffffffff81211330>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [<     inline     >] munlock_vma_pages_all mm/internal.h:252
 [<ffffffff81215d03>] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [<ffffffff81085635>] mmput+0x65/0x190 kernel/fork.c:708
 [<     inline     >] exit_mm kernel/exit.c:437
 [<ffffffff8108c2a7>] do_exit+0x457/0x1400 kernel/exit.c:733
 [<ffffffff8108ef3f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
 [<ffffffff810a03a5>] get_signal+0x375/0xa70 kernel/signal.c:2353
 [<ffffffff8100619c>] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
 [<ffffffff81006cbd>] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
 [<ffffffff81ea87a4>] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
===

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-11 10:39           ` Kirill A. Shutemov
  2015-09-11 15:29             ` Vlastimil Babka
  2015-09-12  1:27             ` Hugh Dickins
@ 2015-09-23 15:34             ` Oleg Nesterov
  2015-09-23 15:38               ` Oleg Nesterov
  2 siblings, 1 reply; 35+ messages in thread
From: Oleg Nesterov @ 2015-09-23 15:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrey Konovalov, Sasha Levin, Rik van Riel, Andrew Morton,
	Dmitry Vyukov, linux-mm, linux-kernel, Vlastimil Babka,
	Hugh Dickins

On 09/11, Kirill A. Shutemov wrote:
>
> This one is tricky. I *assume* the mm cannot be generally accessible after
> mm_users drops to zero, but I'm not entirely sure about it.
> procfs? ptrace?

Well, all I can say is that proc/ptrace look fine afaics...

This is off-topic, but how about the patch below? Different threads can
expand different vma's at the same time under read_lock(mmap_sem), so
vma_lock_anon_vma() can't help to serialize "locked_vm += grow".

Oleg.

--- x/mm/mmap.c
+++ x/mm/mmap.c
@@ -2146,9 +2146,6 @@ static int acct_stack_growth(struct vm_a
 	if (security_vm_enough_memory_mm(mm, grow))
 		return -ENOMEM;
 
-	/* Ok, everything looks good - let it rip */
-	if (vma->vm_flags & VM_LOCKED)
-		mm->locked_vm += grow;
 	vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
 	return 0;
 }
@@ -2210,6 +2207,8 @@ int expand_upwards(struct vm_area_struct
 				 * against concurrent vma expansions.
 				 */
 				spin_lock(&vma->vm_mm->page_table_lock);
+				if (vma->vm_flags & VM_LOCKED)
+					mm->locked_vm += grow;
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_end = address;
 				anon_vma_interval_tree_post_update_vma(vma);
@@ -2281,6 +2280,8 @@ int expand_downwards(struct vm_area_stru
 				 * against concurrent vma expansions.
 				 */
 				spin_lock(&vma->vm_mm->page_table_lock);
+				if (vma->vm_flags & VM_LOCKED)
+					mm->locked_vm += grow;
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_start = address;
 				vma->vm_pgoff -= grow;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-23 15:34             ` Oleg Nesterov
@ 2015-09-23 15:38               ` Oleg Nesterov
  0 siblings, 0 replies; 35+ messages in thread
From: Oleg Nesterov @ 2015-09-23 15:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrey Konovalov, Sasha Levin, Rik van Riel, Andrew Morton,
	Dmitry Vyukov, linux-mm, linux-kernel, Vlastimil Babka,
	Hugh Dickins

On 09/23, Oleg Nesterov wrote:
>
> On 09/11, Kirill A. Shutemov wrote:
> >
> > This one is tricky. I *assume* the mm cannot be generally accessible after
> > mm_users drops to zero, but I'm not entirely sure about it.
> > procfs? ptrace?
>
> Well, all I can say is that proc/ptrace look fine afaics...
>
> This is off-topic, but how about the patch below? Different threads can
> expand different vma's at the same time under read_lock(mmap_sem), so
> vma_lock_anon_vma() can't help to serialize "locked_vm += grow".

perhaps vm_stat_account() should be moved too, but total_vm/etc is less
important.

Or I missed something?

Oleg.

> --- x/mm/mmap.c
> +++ x/mm/mmap.c
> @@ -2146,9 +2146,6 @@ static int acct_stack_growth(struct vm_a
>  	if (security_vm_enough_memory_mm(mm, grow))
>  		return -ENOMEM;
>  
> -	/* Ok, everything looks good - let it rip */
> -	if (vma->vm_flags & VM_LOCKED)
> -		mm->locked_vm += grow;
>  	vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
>  	return 0;
>  }
> @@ -2210,6 +2207,8 @@ int expand_upwards(struct vm_area_struct
>  				 * against concurrent vma expansions.
>  				 */
>  				spin_lock(&vma->vm_mm->page_table_lock);
> +				if (vma->vm_flags & VM_LOCKED)
> +					mm->locked_vm += grow;
>  				anon_vma_interval_tree_pre_update_vma(vma);
>  				vma->vm_end = address;
>  				anon_vma_interval_tree_post_update_vma(vma);
> @@ -2281,6 +2280,8 @@ int expand_downwards(struct vm_area_stru
>  				 * against concurrent vma expansions.
>  				 */
>  				spin_lock(&vma->vm_mm->page_table_lock);
> +				if (vma->vm_flags & VM_LOCKED)
> +					mm->locked_vm += grow;
>  				anon_vma_interval_tree_pre_update_vma(vma);
>  				vma->vm_start = address;
>  				vma->vm_pgoff -= grow;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-15 17:36               ` Sasha Levin
  2015-09-15 19:01                 ` Kirill A. Shutemov
@ 2015-09-23 21:30                 ` Sasha Levin
  2015-09-25 14:26                   ` Oleg Nesterov
  2015-09-24 13:11                 ` Oleg Nesterov
  2 siblings, 1 reply; 35+ messages in thread
From: Sasha Levin @ 2015-09-23 21:30 UTC (permalink / raw)
  To: Hugh Dickins, Kirill A. Shutemov
  Cc: Andrey Konovalov, Oleg Nesterov, Rik van Riel, Andrew Morton,
	Dmitry Vyukov, linux-mm, linux-kernel, Vlastimil Babka

On 09/15/2015 01:36 PM, Sasha Levin wrote:
> On 09/11/2015 09:27 PM, Hugh Dickins wrote:
>> > I'm inclined to echo Vlastimil's comment from earlier in the thread:
>> > sounds like an overkill, unless we find something more serious than this.
> I've modified my tests to stress the exit path of processes with many vmas,
> and hit the following NULL ptr deref (not sure if it's related to the original issue):
> 
> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [1181047.937223] Modules linked in:
> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
> [1181047.939387] task: ffff8804195c8000 ti: ffff880433f00000 task.ti: ffff880433f00000
> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)
> [1181047.941842] RSP: 0000:ffff880433f078a8  EFLAGS: 00010206
> [1181047.942383] RAX: dffffc0000000000 RBX: ffff88041acd000a RCX: ffffffffffffffff
> [1181047.943091] RDX: 0000000000000099 RSI: ffff88041acd000a RDI: 00000000000004c8
> [1181047.943889] RBP: ffff880433f078d8 R08: ffff880415c59c58 R09: 0000000015c59e01
> [1181047.944604] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffffffffffff
> [1181047.944833] pps pps0: PPS event at 21837.866101174
> [1181047.944838] pps pps0: capture assert seq #7188
> [1181047.946261] R13: 0000000000000000 R14: ffff880433f07910 R15: 0000000000002e0d
> [1181047.947005] FS:  0000000000000000(0000) GS:ffff880252000000(0000) knlGS:0000000000000000
> [1181047.947779] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [1181047.948361] CR2: 000000000097df90 CR3: 000000044e08c000 CR4: 00000000000006a0
> [1181047.949085] Stack:
> [1181047.949350]  0000000000000000 ffff880433f07910 1ffff100867e0f1e dffffc0000000000
> [1181047.950164]  ffff8801d825d000 0000000000002e0d ffff880433f079d0 ffffffff9276c4ab
> [1181047.951070]  ffff88041acd000a 0000000041b58ab3 ffffffff9ecd1a43 ffffffff9276c2a0
> [1181047.951906] Call Trace:
> [1181047.952201] exit_mmap (mm/mmap.c:2856)
> [1181047.952751] ? SyS_remap_file_pages (mm/mmap.c:2826)
> [1181047.953633] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 include/linux/sched.h:2557 mm/huge_memory.c:2169)
> [1181047.954281] ? rcu_read_lock_sched_held (kernel/rcu/update.c:109)
> [1181047.954936] ? kmem_cache_free (include/trace/events/kmem.h:143 mm/slub.c:2746)
> [1181047.955535] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 include/linux/sched.h:2557 mm/huge_memory.c:2169)
> [1181047.956204] mmput (include/linux/compiler.h:207 kernel/fork.c:735 kernel/fork.c:702)
> [1181047.956691] do_exit (./arch/x86/include/asm/bitops.h:311 include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733)
> [1181047.957241] ? lockdep_init (kernel/locking/lockdep.c:3298)
> [1181047.958005] ? mm_update_next_owner (kernel/exit.c:654)
> [1181047.959007] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [1181047.959995] ? get_lock_stats (kernel/locking/lockdep.c:249)
> [1181047.960885] ? lockdep_init (kernel/locking/lockdep.c:3298)
> [1181047.961438] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
> [1181047.962573] ? lock_release (kernel/locking/lockdep.c:3641)
> [1181047.963488] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
> [1181047.964704] do_group_exit (./arch/x86/include/asm/current.h:14 kernel/exit.c:859)
> [1181047.965569] get_signal (kernel/signal.c:2353)
> [1181047.966430] do_signal (arch/x86/kernel/signal.c:709)
> [1181047.967241] ? do_readv_writev (include/linux/fsnotify.h:223 fs/read_write.c:821)
> [1181047.968169] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:407)
> [1181047.969126] ? vfs_write (fs/read_write.c:777)
> [1181047.969955] ? setup_sigcontext (arch/x86/kernel/signal.c:706)
> [1181047.970916] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
> [1181047.972139] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [1181047.973489] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:171 kernel/locking/spinlock.c:199)
> [1181047.974160] ? do_setitimer (include/linux/spinlock.h:357 kernel/time/itimer.c:227)
> [1181047.974818] ? check_preemption_disabled (lib/smp_processor_id.c:18)
> [1181047.975480] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [1181047.976142] prepare_exit_to_usermode (arch/x86/entry/common.c:251)
> [1181047.976784] syscall_return_slowpath (arch/x86/entry/common.c:318)
> [1181047.977473] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282)
> [1181047.978116] Code: 08 80 3c 02 00 0f 85 22 01 00 00 48 8b 43 40 48 8d b8 c8 04 00 00 48 89 45 d0 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 ee 00 00 00 48 8b 45 d0 48 83 b8 c8 04 00 00
> All code
> ========
>    0:   08 80 3c 02 00 0f       or     %al,0xf00023c(%rax)
>    6:   85 22                   test   %esp,(%rdx)
>    8:   01 00                   add    %eax,(%rax)
>    a:   00 48 8b                add    %cl,-0x75(%rax)
>    d:   43                      rex.XB
>    e:   40                      rex
>    f:   48 8d b8 c8 04 00 00    lea    0x4c8(%rax),%rdi
>   16:   48 89 45 d0             mov    %rax,-0x30(%rbp)
>   1a:   48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
>   21:   fc ff df
>   24:   48 89 fa                mov    %rdi,%rdx
>   27:   48 c1 ea 03             shr    $0x3,%rdx
>   2b:*  80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)               <-- trapping instruction
>   2f:   0f 85 ee 00 00 00       jne    0x123
>   35:   48 8b 45 d0             mov    -0x30(%rbp),%rax
>   39:   48 83 b8 c8 04 00 00    cmpq   $0x0,0x4c8(%rax)
>   40:   00
> 
> Code starting with the faulting instruction
> ===========================================
>    0:   80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)
>    4:   0f 85 ee 00 00 00       jne    0xf8
>    a:   48 8b 45 d0             mov    -0x30(%rbp),%rax
>    e:   48 83 b8 c8 04 00 00    cmpq   $0x0,0x4c8(%rax)
>   15:   00
> [1181047.981417] RIP unmap_vmas (mm/memory.c:1337)
> [1181047.982011]  RSP <ffff880433f078a8>

Another similar trace where we see a problem during process exit:

[1922964.887922] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[1922964.890234] Modules linked in:
[1922964.890844] CPU: 1 PID: 21477 Comm: trinity-c161 Tainted: G        W       4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
[1922964.892584] task: ffff880251858000 ti: ffff88009f258000 task.ti: ffff88009f258000
[1922964.893723] RIP: acct_collect (kernel/acct.c:542)
[1922964.895105] RSP: 0000:ffff88009f25f908  EFLAGS: 00010207
[1922964.895935] RAX: dffffc0000000000 RBX: 2ce0ffffffffffff RCX: 0000000000000000
[1922964.897008] RDX: ffff2152b153ffff RSI: 059c200000000000 RDI: 2ce1000000000007
[1922964.898091] RBP: ffff88009f25f9e8 R08: 0000000000000001 R09: 00000000000003ef
[1922964.899178] R10: ffffed014d7a3a01 R11: 0000000000000001 R12: ffff880082b485c0
[1922964.901643] R13: ffff2152b153ffff R14: 1ffff10013e4bf24 R15: ffff88009f25f9c0
[1922964.904396] FS:  00007f680d3f2700(0000) GS:ffff880108000000(0000) knlGS:0000000000000000
[1922964.907484] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[1922964.908489] CR2: 00007f680b7a3000 CR3: 00000000af748000 CR4: 00000000000006a0
[1922964.909626] Stack:
[1922964.910009]  dffffc0000000000 ffff880251858000 0000000000000009 0000000041b58ab3
[1922964.913945]  ffffffffa2ee6c6a ffffffff96561dc0 0000000041b58ab3 ffffffffa2ee6c6a
[1922964.915212]  0000000041b58ab3 ffffffffa2edfbb4 ffffffff9650ce50 0000000041b58ab3
[1922964.916489] Call Trace:
[1922964.916951] ? acct_exit_ns (kernel/acct.c:531)
[1922964.917859] ? hrtimer_get_remaining (kernel/time/hrtimer.c:1028)
[1922964.918895] ? get_signal (kernel/signal.c:2384)
[1922964.919821] ? creds_are_invalid (kernel/cred.c:709)
[1922964.921177] do_exit (kernel/exit.c:727)
[1922964.922022] ? lockdep_reset_lock (kernel/locking/lockdep.c:3105)
[1922964.923009] ? mm_update_next_owner (kernel/exit.c:654)
[1922964.924099] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[1922964.925237] ? lockdep_init (kernel/locking/lockdep.c:3298)
[1922964.926157] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
[1922964.927223] do_group_exit (./arch/x86/include/asm/current.h:14 kernel/exit.c:859)
[1922964.928111] get_signal (kernel/signal.c:2307)
[1922964.929021] do_signal (arch/x86/kernel/signal.c:709)
[1922964.929902] ? do_readv_writev (include/linux/fsnotify.h:223 fs/read_write.c:821)
[1922964.930962] ? __generic_file_write_iter (mm/filemap.c:2684)
[1922964.932119] ? vfs_write (fs/read_write.c:777)
[1922964.933052] ? setup_sigcontext (arch/x86/kernel/signal.c:706)
[1922964.934061] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
[1922964.935261] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[1922964.936285] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
[1922964.937575] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:171 kernel/locking/spinlock.c:199)
[1922964.938563] ? do_setitimer (include/linux/spinlock.h:357 kernel/time/itimer.c:227)
[1922964.939463] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[1922964.940596] prepare_exit_to_usermode (arch/x86/entry/common.c:251)
[1922964.941645] syscall_return_slowpath (arch/x86/entry/common.c:318)
[1922964.942715] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282)
[1922964.943706] Code: 02 00 0f 85 9d 05 00 00 48 8b 1b 48 85 db 0f 84 7b 05 00 00 48 b8 00 00 00 00 00 fc ff df 31 d2 48 8d 7b 08 48 89 fe 48 c1 ee 03 <80> 3c 06 00 0f 85 02 06 00 00 48 89 de 4c 8b 6b 08 48 c1 ee 03
All code
========
   0:   02 00                   add    (%rax),%al
   2:   0f 85 9d 05 00 00       jne    0x5a5
   8:   48 8b 1b                mov    (%rbx),%rbx
   b:   48 85 db                test   %rbx,%rbx
   e:   0f 84 7b 05 00 00       je     0x58f
  14:   48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
  1b:   fc ff df
  1e:   31 d2                   xor    %edx,%edx
  20:   48 8d 7b 08             lea    0x8(%rbx),%rdi
  24:   48 89 fe                mov    %rdi,%rsi
  27:   48 c1 ee 03             shr    $0x3,%rsi
  2b:*  80 3c 06 00             cmpb   $0x0,(%rsi,%rax,1)               <-- trapping instruction
  2f:   0f 85 02 06 00 00       jne    0x637
  35:   48 89 de                mov    %rbx,%rsi
  38:   4c 8b 6b 08             mov    0x8(%rbx),%r13
  3c:   48 c1 ee 03             shr    $0x3,%rsi
        ...

Code starting with the faulting instruction
===========================================
   0:   80 3c 06 00             cmpb   $0x0,(%rsi,%rax,1)
   4:   0f 85 02 06 00 00       jne    0x60c
   a:   48 89 de                mov    %rbx,%rsi
   d:   4c 8b 6b 08             mov    0x8(%rbx),%r13
  11:   48 c1 ee 03             shr    $0x3,%rsi
        ...
[1922964.948548] RIP acct_collect (kernel/acct.c:542)
[1922964.949552]  RSP <ffff88009f25f908>


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-23 11:46                           ` Kirill A. Shutemov
@ 2015-09-23 22:58                             ` Davidlohr Bueso
  0 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2015-09-23 22:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrey Konovalov, Sasha Levin, Oleg Nesterov,
	Rik van Riel, Andrew Morton, Dmitry Vyukov, linux-mm,
	linux-kernel, Vlastimil Babka

On Wed, 23 Sep 2015, Kirill A. Shutemov wrote:

>On Tue, Sep 22, 2015 at 06:39:52PM -0700, Hugh Dickins wrote:
>>[...]
>> I'd rather wait to hear whether this appears to work in practice,
>> and whether you agree that it should work in theory, before writing
>> the proper description.  I'd love to lose that down_read_trylock.
>>
>> You mention how Sasha hit the "Bad page state (mlocked)" back in
>> November: that was one of the reasons we reverted Davidlohr's
>> i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(),
>> without understanding why it was needed.  Yes, it would lock out
>> a concurrent try_to_unmap(), whose setting of PageMlocked was not
>> sufficiently serialized by the down_read_trylock of mmap_sem.
>>
>> But I don't remember the other reasons for that revert (and
>> haven't looked very hard as yet): anyone else remember?

Yeah, I don't think this was ever resolved, but ultimately the patch
got reverted[1] because it exposed issues in the form of bad pages
(shmem, vmsplice) and corrupted vm_flags while in untrack_pfn() causing,
for example, vm_file to dissapear.

>I hoped Davidlohr will come back with something after the revert, but it
>never happend. I think the reverted patch was responsible for most of
>scalability boost from rwsem for i_mmap_lock...

Actually no, the change that got reverted was something we got in very
last minute, just because it made sense and had the blessing of some
key people. The main winner of the series was migration (rmap), which
later Hugh addressed more specifically for unmapped pages:

https://lkml.org/lkml/2014/11/30/349

So I really didn't care about the reverted patch, and therefore was never
on my radar.

[1] https://lkml.org/lkml/2014/12/22/375

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-23 13:08                           ` Andrey Konovalov
@ 2015-09-24  0:42                             ` Sasha Levin
  2015-09-25 19:33                               ` Kirill A. Shutemov
  2015-10-13 22:38                               ` Hugh Dickins
  2015-10-13 22:33                             ` Hugh Dickins
  1 sibling, 2 replies; 35+ messages in thread
From: Sasha Levin @ 2015-09-24  0:42 UTC (permalink / raw)
  To: Andrey Konovalov, Hugh Dickins
  Cc: Kirill A. Shutemov, Davidlohr Bueso, Oleg Nesterov, Rik van Riel,
	Andrew Morton, Dmitry Vyukov, linux-mm, linux-kernel,
	Vlastimil Babka

On 09/23/2015 09:08 AM, Andrey Konovalov wrote:
> On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins <hughd@google.com> wrote:
>> > This is totally untested, and one of you may quickly prove me wrong;
>> > but I went in to fix your "Bad page state (mlocked)" by holding pte
>> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
>> > then couldn't see why it would need mmap_sem at all, given how mlock
>> > and munlock first assert intention by setting or clearing VM_LOCKED
>> > in vm_flags, then work their way up the vma, taking pte locks.
>> >
>> > Calling mlock_vma_page() under pte lock may look suspicious
>> > at first: but what it does is similar to clear_page_mlock(),
>> > which we regularly call under pte lock from page_remove_rmap().
>> >
>> > I'd rather wait to hear whether this appears to work in practice,
>> > and whether you agree that it should work in theory, before writing
>> > the proper description.  I'd love to lose that down_read_trylock.
> No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".
> 
> It seems that your patch doesn't fix the race from the report below, since pte
> lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
> is being executed. (Line numbers are from kernel with your patch applied.)

I've fired up my HZ_10000 patch, and this seems to be a real race that is
somewhat easy to reproduce under those conditions.

Here's a fresh backtrace from my VMs:

[1935109.882343] BUG: Bad page state in process trinity-subchil  pfn:3ca200
[1935109.884000] page:ffffea000f288000 count:0 mapcount:0 mapping:          (null) index:0x1e00 compound_mapcount: 0
[1935109.885772] flags: 0x22fffff80144008(uptodate|head|swapbacked|mlocked)
[1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[1935109.888197] bad because of flags:
[1935109.888759] flags: 0x100000(mlocked)
[1935109.889525] Modules linked in:
[1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
[1935109.891876]  1ffffffff6445448 00000000e5dca494 ffff8803f7657708 ffffffffa70402da
[1935109.893504]  ffffea000f288000 ffff8803f7657738 ffffffffa56e522b 022fffff80144008
[1935109.894947]  ffffea000f288020 ffffea000f288000 00000000ffffffff ffff8803f76577a8
[1935109.896413] Call Trace:
[1935109.899102]  [<ffffffffa70402da>] dump_stack+0x4e/0x84
[1935109.899821]  [<ffffffffa56e522b>] bad_page+0x17b/0x210
[1935109.900469]  [<ffffffffa56e85a8>] free_pages_prepare+0xb48/0x1110
[1935109.902127]  [<ffffffffa56ee0d1>] __free_pages_ok+0x21/0x260
[1935109.904435]  [<ffffffffa56ee373>] free_compound_page+0x63/0x80
[1935109.905614]  [<ffffffffa581b51e>] free_transhuge_page+0x6e/0x80
[1935109.906752]  [<ffffffffa5709f76>] __put_compound_page+0x76/0xa0
[1935109.907884]  [<ffffffffa570a475>] release_pages+0x4d5/0x9f0
[1935109.913027]  [<ffffffffa5769bea>] tlb_flush_mmu_free+0x8a/0x120
[1935109.913957]  [<ffffffffa576f993>] unmap_page_range+0xe73/0x1460
[1935109.915737]  [<ffffffffa57700a6>] unmap_single_vma+0x126/0x2f0
[1935109.916646]  [<ffffffffa577270d>] unmap_vmas+0xdd/0x190
[1935109.917454]  [<ffffffffa5790361>] exit_mmap+0x221/0x430
[1935109.921176]  [<ffffffffa5366da1>] mmput+0xb1/0x240
[1935109.921919]  [<ffffffffa537b3b2>] do_exit+0x732/0x27c0
[1935109.928561]  [<ffffffffa537d599>] do_group_exit+0xf9/0x300
[1935109.929786]  [<ffffffffa537d7bd>] SyS_exit_group+0x1d/0x20
[1935109.930617]  [<ffffffffaf59fbf6>] entry_SYSCALL_64_fastpath+0x16/0x7a


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-15 17:36               ` Sasha Levin
  2015-09-15 19:01                 ` Kirill A. Shutemov
  2015-09-23 21:30                 ` Sasha Levin
@ 2015-09-24 13:11                 ` Oleg Nesterov
  2015-09-24 16:27                   ` Sasha Levin
  2 siblings, 1 reply; 35+ messages in thread
From: Oleg Nesterov @ 2015-09-24 13:11 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrey Konovalov, Rik van Riel,
	Andrew Morton, Dmitry Vyukov, linux-mm, linux-kernel,
	Vlastimil Babka

On 09/15, Sasha Levin wrote:
>
> I've modified my tests to stress the exit path of processes with many vmas,
> and hit the following NULL ptr deref (not sure if it's related to the original issue):

I am shy to ask. Looks like I am the only stupid one who needs
more info...

> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN

Well, I know absolutely nothing about kasan, to the point I can't even
unserstand where does this message come from. grep didn't help. But this
doesn't matter...

> [1181047.937223] Modules linked in:
> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
> [1181047.939387] task: ffff8804195c8000 ti: ffff880433f00000 task.ti: ffff880433f00000
> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)

I do not know which tree/branch do you use. In Linus's tree mm/memory.c:1337 is

	struct mm_struct *mm = vma->vm_mm;

but this doesn't match the asm below,

>    0:   08 80 3c 02 00 0f       or     %al,0xf00023c(%rax)
>    6:   85 22                   test   %esp,(%rdx)
>    8:   01 00                   add    %eax,(%rax)
>    a:   00 48 8b                add    %cl,-0x75(%rax)
>    d:   43                      rex.XB
>    e:   40                      rex
>    f:   48 8d b8 c8 04 00 00    lea    0x4c8(%rax),%rdi
>   16:   48 89 45 d0             mov    %rax,-0x30(%rbp)
>   1a:   48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
>   21:   fc ff df
>   24:   48 89 fa                mov    %rdi,%rdx
>   27:   48 c1 ea 03             shr    $0x3,%rdx
>   2b:*  80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)               <-- trapping instruction
>   2f:   0f 85 ee 00 00 00       jne    0x123
>   35:   48 8b 45 d0             mov    -0x30(%rbp),%rax
>   39:   48 83 b8 c8 04 00 00    cmpq   $0x0,0x4c8(%rax)
>   40:   00

And I do not see anything similar in "objdump -d". So could you at least
show mm/memory.c:1337 in your tree?

Hmm. movabs $0xdffffc0000000000,%rax above looks suspicious, this looks
like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
(I can't check, my gcc is very old). Or what?

Any chance you can tell us where exactly we hit NULL-deref in unmap_vmas?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-24 13:11                 ` Oleg Nesterov
@ 2015-09-24 16:27                   ` Sasha Levin
  2015-09-24 17:26                     ` Oleg Nesterov
  0 siblings, 1 reply; 35+ messages in thread
From: Sasha Levin @ 2015-09-24 16:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrey Konovalov, Rik van Riel,
	Andrew Morton, Dmitry Vyukov, linux-mm, linux-kernel,
	Vlastimil Babka

On 09/24/2015 09:11 AM, Oleg Nesterov wrote:
> On 09/15, Sasha Levin wrote:
>>
>> I've modified my tests to stress the exit path of processes with many vmas,
>> and hit the following NULL ptr deref (not sure if it's related to the original issue):
> 
> I am shy to ask. Looks like I am the only stupid one who needs
> more info...
> 
>> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> 
> Well, I know absolutely nothing about kasan, to the point I can't even
> unserstand where does this message come from. grep didn't help. But this
> doesn't matter...

The reason behind this message is that NULL ptr derefs when using kasan are
manifested as GFPs. This is because in order to validate an access to a given
memory address kasan would check (shadow_base + (mem_offset >> 3)), so in the case of
a NULL it would try to access shadow_base + 0, which would GFP.

>> [1181047.937223] Modules linked in:
>> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
>> [1181047.939387] task: ffff8804195c8000 ti: ffff880433f00000 task.ti: ffff880433f00000
>> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)
> 
> I do not know which tree/branch do you use. In Linus's tree mm/memory.c:1337 is

I'm running -next + Kirill's THP patchset.

> 	struct mm_struct *mm = vma->vm_mm;

void unmap_vmas(struct mmu_gather *tlb,
                struct vm_area_struct *vma, unsigned long start_addr,
                unsigned long end_addr)
{
        struct mm_struct *mm = vma->vm_mm;

        mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
        for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
                unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- this
        mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
}

> but this doesn't match the asm below,
> 
>>    0:   08 80 3c 02 00 0f       or     %al,0xf00023c(%rax)
>>    6:   85 22                   test   %esp,(%rdx)
>>    8:   01 00                   add    %eax,(%rax)
>>    a:   00 48 8b                add    %cl,-0x75(%rax)
>>    d:   43                      rex.XB
>>    e:   40                      rex
>>    f:   48 8d b8 c8 04 00 00    lea    0x4c8(%rax),%rdi
>>   16:   48 89 45 d0             mov    %rax,-0x30(%rbp)
>>   1a:   48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
>>   21:   fc ff df
>>   24:   48 89 fa                mov    %rdi,%rdx
>>   27:   48 c1 ea 03             shr    $0x3,%rdx
>>   2b:*  80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)               <-- trapping instruction
>>   2f:   0f 85 ee 00 00 00       jne    0x123
>>   35:   48 8b 45 d0             mov    -0x30(%rbp),%rax
>>   39:   48 83 b8 c8 04 00 00    cmpq   $0x0,0x4c8(%rax)
>>   40:   00
> 
> And I do not see anything similar in "objdump -d". So could you at least
> show mm/memory.c:1337 in your tree?
> 
> Hmm. movabs $0xdffffc0000000000,%rax above looks suspicious, this looks
> like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
> (I can't check, my gcc is very old). Or what?

This is indeed kasan code. 0xdffffc0000000000 is the shadow base, and you see
kasan trying to access shadow base + (ptr >> 3), which is why we get GFP.

Looking at the assembly, the address we were trying to access was:

 RDI: 00000000000004c8

> Any chance you can tell us where exactly we hit NULL-deref in unmap_vmas?

I hope the information above helped, please let me know if it didn't and you
need anything else.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-24 16:27                   ` Sasha Levin
@ 2015-09-24 17:26                     ` Oleg Nesterov
  2015-09-24 18:52                       ` Andrey Ryabinin
  0 siblings, 1 reply; 35+ messages in thread
From: Oleg Nesterov @ 2015-09-24 17:26 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrey Konovalov, Rik van Riel,
	Andrew Morton, Dmitry Vyukov, linux-mm, linux-kernel,
	Vlastimil Babka

On 09/24, Sasha Levin wrote:
>
> On 09/24/2015 09:11 AM, Oleg Nesterov wrote:
> >
> > Well, I know absolutely nothing about kasan, to the point I can't even
> > unserstand where does this message come from. grep didn't help. But this
> > doesn't matter...
>
> The reason behind this message is that NULL ptr derefs when using kasan are
> manifested as GFPs. This is because in order to validate an access to a given
> memory address kasan would check (shadow_base + (mem_offset >> 3)), so in the case of
> a NULL it would try to access shadow_base + 0, which would GFP.

OK, so this just means the kernele derefs the NULL pointer,

> I'm running -next + Kirill's THP patchset.
>
> > 	struct mm_struct *mm = vma->vm_mm;
>
> void unmap_vmas(struct mmu_gather *tlb,
>                 struct vm_area_struct *vma, unsigned long start_addr,
>                 unsigned long end_addr)
> {
>         struct mm_struct *mm = vma->vm_mm;
>
>         mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
>         for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
>                 unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- this
>         mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> }

And I do not see any dereference at this line,

> >>    0:   08 80 3c 02 00 0f       or     %al,0xf00023c(%rax)
> >>    6:   85 22                   test   %esp,(%rdx)
> >>    8:   01 00                   add    %eax,(%rax)
> >>    a:   00 48 8b                add    %cl,-0x75(%rax)
> >>    d:   43                      rex.XB
> >>    e:   40                      rex
> >>    f:   48 8d b8 c8 04 00 00    lea    0x4c8(%rax),%rdi
> >>   16:   48 89 45 d0             mov    %rax,-0x30(%rbp)
> >>   1a:   48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
> >>   21:   fc ff df
> >>   24:   48 89 fa                mov    %rdi,%rdx
> >>   27:   48 c1 ea 03             shr    $0x3,%rdx
> >>   2b:*  80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)               <-- trapping instruction
> >>   2f:   0f 85 ee 00 00 00       jne    0x123
> >>   35:   48 8b 45 d0             mov    -0x30(%rbp),%rax
> >>   39:   48 83 b8 c8 04 00 00    cmpq   $0x0,0x4c8(%rax)
> >>   40:   00
> >
> > And I do not see anything similar in "objdump -d". So could you at least
> > show mm/memory.c:1337 in your tree?
> >
> > Hmm. movabs $0xdffffc0000000000,%rax above looks suspicious, this looks
> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
> > (I can't check, my gcc is very old). Or what?
>
> This is indeed kasan code. 0xdffffc0000000000 is the shadow base, and you see
> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP.

and thus this asm can't help, right?

So how can we figure out where exactly the kernel hits NULL ? And what
exactly it tries to dereference?

> I hope the information above helped, please let me know if it didn't and you
> need anything else.

Thanks a lot, it does help, but I am still confused.


Looks like, "function+offset" is more useful than the line numbers,
at least we could look at mm/memory.s.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-24 17:26                     ` Oleg Nesterov
@ 2015-09-24 18:52                       ` Andrey Ryabinin
  2015-09-24 19:01                         ` Sasha Levin
  2015-09-25 12:41                         ` Oleg Nesterov
  0 siblings, 2 replies; 35+ messages in thread
From: Andrey Ryabinin @ 2015-09-24 18:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sasha Levin, Hugh Dickins, Kirill A. Shutemov, Andrey Konovalov,
	Rik van Riel, Andrew Morton, Dmitry Vyukov, linux-mm@kvack.org,
	LKML, Vlastimil Babka

2015-09-24 20:26 GMT+03:00 Oleg Nesterov <oleg@redhat.com>:
> On 09/24, Sasha Levin wrote:
>>
>> On 09/24/2015 09:11 AM, Oleg Nesterov wrote:
>> >
>> > Well, I know absolutely nothing about kasan, to the point I can't even
>> > unserstand where does this message come from. grep didn't help. But this
>> > doesn't matter...
>>
>> The reason behind this message is that NULL ptr derefs when using kasan are
>> manifested as GFPs. This is because in order to validate an access to a given
>> memory address kasan would check (shadow_base + (mem_offset >> 3)), so in the case of
>> a NULL it would try to access shadow_base + 0, which would GFP.
>
> OK, so this just means the kernele derefs the NULL pointer,
>
>> I'm running -next + Kirill's THP patchset.
>>
>> >     struct mm_struct *mm = vma->vm_mm;
>>
>> void unmap_vmas(struct mmu_gather *tlb,
>>                 struct vm_area_struct *vma, unsigned long start_addr,
>>                 unsigned long end_addr)
>> {
>>         struct mm_struct *mm = vma->vm_mm;
>>
>>         mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
>>         for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
>>                 unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- this
>>         mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
>> }
>
> And I do not see any dereference at this line,
>

I noticed, that addr2line sometimes doesn't work reliably on
compiler-instrumented code.
I've seen couple times that it points to the next line of code.


>> >>    0:   08 80 3c 02 00 0f       or     %al,0xf00023c(%rax)
>> >>    6:   85 22                   test   %esp,(%rdx)
>> >>    8:   01 00                   add    %eax,(%rax)
>> >>    a:   00 48 8b                add    %cl,-0x75(%rax)
>> >>    d:   43                      rex.XB
>> >>    e:   40                      rex
>> >>    f:   48 8d b8 c8 04 00 00    lea    0x4c8(%rax),%rdi
>> >>   16:   48 89 45 d0             mov    %rax,-0x30(%rbp)
>> >>   1a:   48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
>> >>   21:   fc ff df
>> >>   24:   48 89 fa                mov    %rdi,%rdx
>> >>   27:   48 c1 ea 03             shr    $0x3,%rdx
>> >>   2b:*  80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)               <-- trapping instruction
>> >>   2f:   0f 85 ee 00 00 00       jne    0x123
>> >>   35:   48 8b 45 d0             mov    -0x30(%rbp),%rax
>> >>   39:   48 83 b8 c8 04 00 00    cmpq   $0x0,0x4c8(%rax)
>> >>   40:   00
>> >
>> > And I do not see anything similar in "objdump -d". So could you at least
>> > show mm/memory.c:1337 in your tree?
>> >
>> > Hmm. movabs $0xdffffc0000000000,%rax above looks suspicious, this looks
>> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
>> > (I can't check, my gcc is very old). Or what?
>>
>> This is indeed kasan code. 0xdffffc0000000000 is the shadow base, and you see
>> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP.
>
> and thus this asm can't help, right?
>

I think it can.

> So how can we figure out where exactly the kernel hits NULL ? And what
> exactly it tries to dereference?

So we tried to dereference 0x4c8.  That 0x4c8 is probably offset in some struct.
The only big struct here is mm_struct.
So I think that we tried to derefernce null mm, and this asm:
         > cmpq   $0x0,0x4c8(%rax)

is likely from inlined mm_has_notifiers():
    static inline int mm_has_notifiers(struct mm_struct *mm)
    {
             return unlikely(mm->mmu_notifier_mm);
    }


Sasha, could you confirm that in your kernel mmu_notifier_mm field has
0x4c8 offset?
I would use gdb for that:
gdb vmlinux
(gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-24 18:52                       ` Andrey Ryabinin
@ 2015-09-24 19:01                         ` Sasha Levin
  2015-09-25 12:41                         ` Oleg Nesterov
  1 sibling, 0 replies; 35+ messages in thread
From: Sasha Levin @ 2015-09-24 19:01 UTC (permalink / raw)
  To: Andrey Ryabinin, Oleg Nesterov
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrey Konovalov, Rik van Riel,
	Andrew Morton, Dmitry Vyukov, linux-mm@kvack.org, LKML,
	Vlastimil Babka

On 09/24/2015 02:52 PM, Andrey Ryabinin wrote:
> Sasha, could you confirm that in your kernel mmu_notifier_mm field has
> 0x4c8 offset?
> I would use gdb for that:
> gdb vmlinux
> (gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm)

(gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm)
$1 = 0x4c8


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-24 18:52                       ` Andrey Ryabinin
  2015-09-24 19:01                         ` Sasha Levin
@ 2015-09-25 12:41                         ` Oleg Nesterov
  1 sibling, 0 replies; 35+ messages in thread
From: Oleg Nesterov @ 2015-09-25 12:41 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Sasha Levin, Hugh Dickins, Kirill A. Shutemov, Andrey Konovalov,
	Rik van Riel, Andrew Morton, Dmitry Vyukov, linux-mm@kvack.org,
	LKML, Vlastimil Babka

On 09/24, Andrey Ryabinin wrote:
>
> 2015-09-24 20:26 GMT+03:00 Oleg Nesterov <oleg@redhat.com>:
> > On 09/24, Sasha Levin wrote:
> >>
> >> void unmap_vmas(struct mmu_gather *tlb,
> >>                 struct vm_area_struct *vma, unsigned long start_addr,
> >>                 unsigned long end_addr)
> >> {
> >>         struct mm_struct *mm = vma->vm_mm;
> >>
> >>         mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> >>         for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> >>                 unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- this
> >>         mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> >> }
> >
> > And I do not see any dereference at this line,
> >
>
> I noticed, that addr2line sometimes doesn't work reliably on
> compiler-instrumented code.
> I've seen couple times that it points to the next line of code.

Yes, I know that we can't trust it. That is why I think (at least in
this particular case) function+offset would be more helpful. And we
need more asm probably.

> >> >>    0:   08 80 3c 02 00 0f       or     %al,0xf00023c(%rax)
> >> >>    6:   85 22                   test   %esp,(%rdx)
> >> >>    8:   01 00                   add    %eax,(%rax)
> >> >>    a:   00 48 8b                add    %cl,-0x75(%rax)
> >> >>    d:   43                      rex.XB
> >> >>    e:   40                      rex
> >> >>    f:   48 8d b8 c8 04 00 00    lea    0x4c8(%rax),%rdi
> >> >>   16:   48 89 45 d0             mov    %rax,-0x30(%rbp)
> >> >>   1a:   48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
> >> >>   21:   fc ff df
> >> >>   24:   48 89 fa                mov    %rdi,%rdx
> >> >>   27:   48 c1 ea 03             shr    $0x3,%rdx
> >> >>   2b:*  80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)               <-- trapping instruction
> >> >>   2f:   0f 85 ee 00 00 00       jne    0x123
> >> >>   35:   48 8b 45 d0             mov    -0x30(%rbp),%rax
> >> >>   39:   48 83 b8 c8 04 00 00    cmpq   $0x0,0x4c8(%rax)
> >> >>   40:   00
> >> >
> >> > And I do not see anything similar in "objdump -d". So could you at least
> >> > show mm/memory.c:1337 in your tree?
> >> >
> >> > Hmm. movabs $0xdffffc0000000000,%rax above looks suspicious, this looks
> >> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
> >> > (I can't check, my gcc is very old). Or what?
> >>
> >> This is indeed kasan code. 0xdffffc0000000000 is the shadow base, and you see
> >> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP.
> >
> > and thus this asm can't help, right?
> >
>
> I think it can.
>
> > So how can we figure out where exactly the kernel hits NULL ? And what
> > exactly it tries to dereference?
>
> So we tried to dereference 0x4c8.  That 0x4c8 is probably offset in some struct.
> The only big struct here is mm_struct.
> So I think that we tried to derefernce null mm, and this asm:
>          > cmpq   $0x0,0x4c8(%rax)
>
> is likely from inlined mm_has_notifiers():
>     static inline int mm_has_notifiers(struct mm_struct *mm)
>     {
>              return unlikely(mm->mmu_notifier_mm);
>     }

Looks reasonable... Thanks.

I was going to say that this is impossible because the caller should have
crashed if ->mm == NULL. But unmap_vmas() uses mm = vma->vm_mm, so it looks
like this vma or mm->mmap was corrupted...

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-23 21:30                 ` Sasha Levin
@ 2015-09-25 14:26                   ` Oleg Nesterov
  0 siblings, 0 replies; 35+ messages in thread
From: Oleg Nesterov @ 2015-09-25 14:26 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrey Konovalov, Rik van Riel,
	Andrew Morton, Dmitry Vyukov, linux-mm, linux-kernel,
	Vlastimil Babka

On 09/23, Sasha Levin wrote:
>
> Another similar trace where we see a problem during process exit:
>
> [1922964.887922] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [1922964.890234] Modules linked in:
> [1922964.890844] CPU: 1 PID: 21477 Comm: trinity-c161 Tainted: G        W       4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
> [1922964.892584] task: ffff880251858000 ti: ffff88009f258000 task.ti: ffff88009f258000
> [1922964.893723] RIP: acct_collect (kernel/acct.c:542)

   530  void acct_collect(long exitcode, int group_dead)
   531  {
   532          struct pacct_struct *pacct = &current->signal->pacct;
   533          cputime_t utime, stime;
   534          unsigned long vsize = 0;
   535
   536          if (group_dead && current->mm) {
   537                  struct vm_area_struct *vma;
   538
   539                  down_read(&current->mm->mmap_sem);
   540                  vma = current->mm->mmap;
   541                  while (vma) {
   542                          vsize += vma->vm_end - vma->vm_start; // !!!!!!!!!!!!
   543                          vma = vma->vm_next;
   544                  }
   545                  up_read(&current->mm->mmap_sem);
   546          }


> [1922964.895105] RSP: 0000:ffff88009f25f908  EFLAGS: 00010207
> [1922964.895935] RAX: dffffc0000000000 RBX: 2ce0ffffffffffff RCX: 0000000000000000
> [1922964.897008] RDX: ffff2152b153ffff RSI: 059c200000000000 RDI: 2ce1000000000007
> [1922964.898091] RBP: ffff88009f25f9e8 R08: 0000000000000001 R09: 00000000000003ef
> [1922964.899178] R10: ffffed014d7a3a01 R11: 0000000000000001 R12: ffff880082b485c0
> [1922964.901643] R13: ffff2152b153ffff R14: 1ffff10013e4bf24 R15: ffff88009f25f9c0
...
>    0:   02 00                   add    (%rax),%al
>    2:   0f 85 9d 05 00 00       jne    0x5a5
>    8:   48 8b 1b                mov    (%rbx),%rbx
>    b:   48 85 db                test   %rbx,%rbx
>    e:   0f 84 7b 05 00 00       je     0x58f

Probably "mov (%rbx),%rbx" is "vma = mm->mmap",

>   14:   48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
>   1b:   fc ff df
>   1e:   31 d2                   xor    %edx,%edx
>   20:   48 8d 7b 08             lea    0x8(%rbx),%rdi

and this loads the addr of vma->vm_end for kasan,

>   24:   48 89 fe                mov    %rdi,%rsi
>   27:   48 c1 ee 03             shr    $0x3,%rsi
>   2b:*  80 3c 06 00             cmpb   $0x0,(%rsi,%rax,1)               <-- trapping instruction

which reporst the error. But in this case this is not NULL-deref,
note that $rbx = 2ce0ffffffffffff and this is below __PAGE_OFFSET
(but above TASK_SIZE_MAX). It seems it is not even canonical. In
any case this odd value can't be valid.

Again, looks like mm->mmap pointer was corrupted. Perhaps you can
re-test with the stupid patch below. But unlikely it will help. If
mm was freed we would probably see something else.

Oleg.
---

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -672,6 +672,7 @@ struct mm_struct *mm_alloc(void)
 void __mmdrop(struct mm_struct *mm)
 {
 	BUG_ON(mm == &init_mm);
+	BUG_ON(atomic_read(&mm->mm_users));
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-24  0:42                             ` Sasha Levin
@ 2015-09-25 19:33                               ` Kirill A. Shutemov
  2015-10-13 22:38                               ` Hugh Dickins
  1 sibling, 0 replies; 35+ messages in thread
From: Kirill A. Shutemov @ 2015-09-25 19:33 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrey Konovalov, Hugh Dickins, Davidlohr Bueso, Oleg Nesterov,
	Rik van Riel, Andrew Morton, Dmitry Vyukov, linux-mm,
	linux-kernel, Vlastimil Babka

On Wed, Sep 23, 2015 at 08:42:26PM -0400, Sasha Levin wrote:
> On 09/23/2015 09:08 AM, Andrey Konovalov wrote:
> > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins <hughd@google.com> wrote:
> >> > This is totally untested, and one of you may quickly prove me wrong;
> >> > but I went in to fix your "Bad page state (mlocked)" by holding pte
> >> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> >> > then couldn't see why it would need mmap_sem at all, given how mlock
> >> > and munlock first assert intention by setting or clearing VM_LOCKED
> >> > in vm_flags, then work their way up the vma, taking pte locks.
> >> >
> >> > Calling mlock_vma_page() under pte lock may look suspicious
> >> > at first: but what it does is similar to clear_page_mlock(),
> >> > which we regularly call under pte lock from page_remove_rmap().
> >> >
> >> > I'd rather wait to hear whether this appears to work in practice,
> >> > and whether you agree that it should work in theory, before writing
> >> > the proper description.  I'd love to lose that down_read_trylock.
> > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".
> > 
> > It seems that your patch doesn't fix the race from the report below, since pte
> > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
> > is being executed. (Line numbers are from kernel with your patch applied.)
> 
> I've fired up my HZ_10000 patch,

Can we make HZ_10000 thing into upstream? Under KERNEL_DEBUG, or
something?

> and this seems to be a real race that is
> somewhat easy to reproduce under those conditions.
> 
> Here's a fresh backtrace from my VMs:
> 
> [1935109.882343] BUG: Bad page state in process trinity-subchil  pfn:3ca200
> [1935109.884000] page:ffffea000f288000 count:0 mapcount:0 mapping:          (null) index:0x1e00 compound_mapcount: 0
> [1935109.885772] flags: 0x22fffff80144008(uptodate|head|swapbacked|mlocked)
> [1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [1935109.888197] bad because of flags:
> [1935109.888759] flags: 0x100000(mlocked)
> [1935109.889525] Modules linked in:
> [1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
> [1935109.891876]  1ffffffff6445448 00000000e5dca494 ffff8803f7657708 ffffffffa70402da
> [1935109.893504]  ffffea000f288000 ffff8803f7657738 ffffffffa56e522b 022fffff80144008
> [1935109.894947]  ffffea000f288020 ffffea000f288000 00000000ffffffff ffff8803f76577a8
> [1935109.896413] Call Trace:
> [1935109.899102]  [<ffffffffa70402da>] dump_stack+0x4e/0x84
> [1935109.899821]  [<ffffffffa56e522b>] bad_page+0x17b/0x210
> [1935109.900469]  [<ffffffffa56e85a8>] free_pages_prepare+0xb48/0x1110
> [1935109.902127]  [<ffffffffa56ee0d1>] __free_pages_ok+0x21/0x260
> [1935109.904435]  [<ffffffffa56ee373>] free_compound_page+0x63/0x80
> [1935109.905614]  [<ffffffffa581b51e>] free_transhuge_page+0x6e/0x80
> [1935109.906752]  [<ffffffffa5709f76>] __put_compound_page+0x76/0xa0
> [1935109.907884]  [<ffffffffa570a475>] release_pages+0x4d5/0x9f0
> [1935109.913027]  [<ffffffffa5769bea>] tlb_flush_mmu_free+0x8a/0x120
> [1935109.913957]  [<ffffffffa576f993>] unmap_page_range+0xe73/0x1460
> [1935109.915737]  [<ffffffffa57700a6>] unmap_single_vma+0x126/0x2f0
> [1935109.916646]  [<ffffffffa577270d>] unmap_vmas+0xdd/0x190
> [1935109.917454]  [<ffffffffa5790361>] exit_mmap+0x221/0x430
> [1935109.921176]  [<ffffffffa5366da1>] mmput+0xb1/0x240
> [1935109.921919]  [<ffffffffa537b3b2>] do_exit+0x732/0x27c0
> [1935109.928561]  [<ffffffffa537d599>] do_group_exit+0xf9/0x300
> [1935109.929786]  [<ffffffffa537d7bd>] SyS_exit_group+0x1d/0x20
> [1935109.930617]  [<ffffffffaf59fbf6>] entry_SYSCALL_64_fastpath+0x16/0x7a

Would it make any difference if you'll add mmap_sem protection in
exit_mmap?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-23 13:08                           ` Andrey Konovalov
  2015-09-24  0:42                             ` Sasha Levin
@ 2015-10-13 22:33                             ` Hugh Dickins
  2015-10-15 16:58                               ` Andrey Konovalov
  1 sibling, 1 reply; 35+ messages in thread
From: Hugh Dickins @ 2015-10-13 22:33 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Hugh Dickins, Kirill A. Shutemov, Davidlohr Bueso, Sasha Levin,
	Oleg Nesterov, Rik van Riel, Andrew Morton, Dmitry Vyukov,
	linux-mm, linux-kernel, Vlastimil Babka

On Wed, 23 Sep 2015, Andrey Konovalov wrote:
> On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins <hughd@google.com> wrote:
> > This is totally untested, and one of you may quickly prove me wrong;
> > but I went in to fix your "Bad page state (mlocked)" by holding pte
> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> > then couldn't see why it would need mmap_sem at all, given how mlock
> > and munlock first assert intention by setting or clearing VM_LOCKED
> > in vm_flags, then work their way up the vma, taking pte locks.
> >
> > Calling mlock_vma_page() under pte lock may look suspicious
> > at first: but what it does is similar to clear_page_mlock(),
> > which we regularly call under pte lock from page_remove_rmap().
> >
> > I'd rather wait to hear whether this appears to work in practice,
> > and whether you agree that it should work in theory, before writing
> > the proper description.  I'd love to lose that down_read_trylock.
> 
> No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".

I think I've found the answer to that at last: we were indeed
all looking in the wrong direction.  Your ktsan tree shows

static __always_inline int atomic_add_negative(int i, atomic_t *v)
{
#ifndef CONFIG_KTSAN
	GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", "s");
#else
	return (ktsan_atomic32_fetch_add((void *)v, i,
			ktsan_memory_order_acq_rel) + i) < 0;
#endif
}

but ktsan_atomic32_fetch_add() returns u32: so it looks like
your implementation of atomic_add_negative() always returns 0,
and page_remove_file_rmap() never calls clear_page_mlock(), as
it ought when an Mlocked page has been truncated or punched out.

/proc/meminfo gives you crazy AnonPages and Mapped too, yes?

> 
> It seems that your patch doesn't fix the race from the report below, since pte
> lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
> is being executed. (Line numbers are from kernel with your patch applied.)

I was not trying to "fix" that with my patch, because I couldn't find
any problem with the way it reads vm_flags there; I can't even see any
need for READ_ONCE or more barriers, we have sufficient locking already.

Sure, try_to_unmap_one() may read vm_flags an instant before or after
a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED;
but the syscalls (or exit) then work their way up the address space to
establish the final state, no problem.

But I am glad you drew attention to the inadequacy of the
down_read_trylock(mmap_sem) in try_to_unmap_one(), and since posting
that patch (doing the mlock_vma_page under pt lock instead), I have
identifed one case that it would fix - though it clearly wasn't
involved in your stacktrace (it's a race with truncating COWed pages,
but your trace was holepunching, which leaves the COWs alone).

I'll go forward with that patch, but it rather falls into a series
I was preparing, must finish up all their comments before posting.

Hugh

> 
> ===
> ThreadSanitizer: data-race in munlock_vma_pages_range
> 
> Write at 0xffff880282a93290 of size 8 by thread 2546 on CPU 2:
>  [<ffffffff81211009>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>  [<     inline     >] munlock_vma_pages_all mm/internal.h:252
>  [<ffffffff81215d03>] exit_mmap+0x163/0x190 mm/mmap.c:2824
>  [<ffffffff81085635>] mmput+0x65/0x190 kernel/fork.c:708
>  [<     inline     >] exit_mm kernel/exit.c:437
>  [<ffffffff8108c2a7>] do_exit+0x457/0x1400 kernel/exit.c:733
>  [<ffffffff8108ef3f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
>  [<ffffffff810a03a5>] get_signal+0x375/0xa70 kernel/signal.c:2353
>  [<ffffffff8100619c>] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
>  [<ffffffff81006cbd>] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
>  [<ffffffff81ea87a4>] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
> 
> Previous read at 0xffff880282a93290 of size 8 by thread 2545 on CPU 1:
>  [<ffffffff8121bc1a>] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208
>  [<     inline     >] rmap_walk_file mm/rmap.c:1522
>  [<ffffffff8121d1a7>] rmap_walk+0x147/0x450 mm/rmap.c:1541
>  [<ffffffff8121d962>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405
>  [<ffffffff81210640>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [<ffffffff81210af6>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [<ffffffff81211330>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [<     inline     >] munlock_vma_pages_all mm/internal.h:252
>  [<ffffffff81215d03>] exit_mmap+0x163/0x190 mm/mmap.c:2824
>  [<ffffffff81085635>] mmput+0x65/0x190 kernel/fork.c:708
>  [<     inline     >] exit_mm kernel/exit.c:437
>  [<ffffffff8108c2a7>] do_exit+0x457/0x1400 kernel/exit.c:733
>  [<ffffffff8108ef3f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
>  [<ffffffff810a03a5>] get_signal+0x375/0xa70 kernel/signal.c:2353
>  [<ffffffff8100619c>] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
>  [<ffffffff81006cbd>] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
>  [<ffffffff81ea87a4>] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
> ===
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-09-24  0:42                             ` Sasha Levin
  2015-09-25 19:33                               ` Kirill A. Shutemov
@ 2015-10-13 22:38                               ` Hugh Dickins
  1 sibling, 0 replies; 35+ messages in thread
From: Hugh Dickins @ 2015-10-13 22:38 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrey Konovalov, Hugh Dickins, Kirill A. Shutemov,
	Davidlohr Bueso, Oleg Nesterov, Rik van Riel, Andrew Morton,
	Dmitry Vyukov, linux-mm, linux-kernel, Vlastimil Babka

On Wed, 23 Sep 2015, Sasha Levin wrote:
> On 09/23/2015 09:08 AM, Andrey Konovalov wrote:
> > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins <hughd@google.com> wrote:
> >> > This is totally untested, and one of you may quickly prove me wrong;
> >> > but I went in to fix your "Bad page state (mlocked)" by holding pte
> >> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> >> > then couldn't see why it would need mmap_sem at all, given how mlock
> >> > and munlock first assert intention by setting or clearing VM_LOCKED
> >> > in vm_flags, then work their way up the vma, taking pte locks.
> >> >
> >> > Calling mlock_vma_page() under pte lock may look suspicious
> >> > at first: but what it does is similar to clear_page_mlock(),
> >> > which we regularly call under pte lock from page_remove_rmap().
> >> >
> >> > I'd rather wait to hear whether this appears to work in practice,
> >> > and whether you agree that it should work in theory, before writing
> >> > the proper description.  I'd love to lose that down_read_trylock.
> > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".
> > 
> > It seems that your patch doesn't fix the race from the report below, since pte
> > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
> > is being executed. (Line numbers are from kernel with your patch applied.)
> 
> I've fired up my HZ_10000 patch, and this seems to be a real race that is
> somewhat easy to reproduce under those conditions.
> 
> Here's a fresh backtrace from my VMs:
> 
> [1935109.882343] BUG: Bad page state in process trinity-subchil  pfn:3ca200
> [1935109.884000] page:ffffea000f288000 count:0 mapcount:0 mapping:          (null) index:0x1e00 compound_mapcount: 0
> [1935109.885772] flags: 0x22fffff80144008(uptodate|head|swapbacked|mlocked)
> [1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [1935109.888197] bad because of flags:
> [1935109.888759] flags: 0x100000(mlocked)
> [1935109.889525] Modules linked in:
> [1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
> [1935109.891876]  1ffffffff6445448 00000000e5dca494 ffff8803f7657708 ffffffffa70402da
> [1935109.893504]  ffffea000f288000 ffff8803f7657738 ffffffffa56e522b 022fffff80144008
> [1935109.894947]  ffffea000f288020 ffffea000f288000 00000000ffffffff ffff8803f76577a8
> [1935109.896413] Call Trace:
> [1935109.899102]  [<ffffffffa70402da>] dump_stack+0x4e/0x84
> [1935109.899821]  [<ffffffffa56e522b>] bad_page+0x17b/0x210
> [1935109.900469]  [<ffffffffa56e85a8>] free_pages_prepare+0xb48/0x1110
> [1935109.902127]  [<ffffffffa56ee0d1>] __free_pages_ok+0x21/0x260
> [1935109.904435]  [<ffffffffa56ee373>] free_compound_page+0x63/0x80
> [1935109.905614]  [<ffffffffa581b51e>] free_transhuge_page+0x6e/0x80

free_transhuge_page belongs to Kirill's THP refcounting patchset,
it's not in 4.3-rc or 4.3.0-rc2-next-20150923 or mmotm.
Well worth testing, thank you, but please make it clear what you
are testing: I'll not spend longer on this one, not at this time.

Hugh

> [1935109.906752]  [<ffffffffa5709f76>] __put_compound_page+0x76/0xa0
> [1935109.907884]  [<ffffffffa570a475>] release_pages+0x4d5/0x9f0
> [1935109.913027]  [<ffffffffa5769bea>] tlb_flush_mmu_free+0x8a/0x120
> [1935109.913957]  [<ffffffffa576f993>] unmap_page_range+0xe73/0x1460
> [1935109.915737]  [<ffffffffa57700a6>] unmap_single_vma+0x126/0x2f0
> [1935109.916646]  [<ffffffffa577270d>] unmap_vmas+0xdd/0x190
> [1935109.917454]  [<ffffffffa5790361>] exit_mmap+0x221/0x430
> [1935109.921176]  [<ffffffffa5366da1>] mmput+0xb1/0x240
> [1935109.921919]  [<ffffffffa537b3b2>] do_exit+0x732/0x27c0
> [1935109.928561]  [<ffffffffa537d599>] do_group_exit+0xf9/0x300
> [1935109.929786]  [<ffffffffa537d7bd>] SyS_exit_group+0x1d/0x20
> [1935109.930617]  [<ffffffffaf59fbf6>] entry_SYSCALL_64_fastpath+0x16/0x7a
> 
> 
> Thanks,
> Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Multiple potential races on vma->vm_flags
  2015-10-13 22:33                             ` Hugh Dickins
@ 2015-10-15 16:58                               ` Andrey Konovalov
  0 siblings, 0 replies; 35+ messages in thread
From: Andrey Konovalov @ 2015-10-15 16:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Davidlohr Bueso, Sasha Levin, Oleg Nesterov,
	Rik van Riel, Andrew Morton, Dmitry Vyukov, linux-mm,
	linux-kernel, Vlastimil Babka

On Wed, Oct 14, 2015 at 12:33 AM, Hugh Dickins <hughd@google.com> wrote:
> I think I've found the answer to that at last: we were indeed
> all looking in the wrong direction.  Your ktsan tree shows
>
> static __always_inline int atomic_add_negative(int i, atomic_t *v)
> {
> #ifndef CONFIG_KTSAN
>         GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", "s");
> #else
>         return (ktsan_atomic32_fetch_add((void *)v, i,
>                         ktsan_memory_order_acq_rel) + i) < 0;
> #endif
> }
>
> but ktsan_atomic32_fetch_add() returns u32: so it looks like
> your implementation of atomic_add_negative() always returns 0,
> and page_remove_file_rmap() never calls clear_page_mlock(), as
> it ought when an Mlocked page has been truncated or punched out.
>
> /proc/meminfo gives you crazy AnonPages and Mapped too, yes?

Yes, you're correct, there was a bug in KTSAN annotations.
Thank you for finding this.

Fixing that bug fixes the bad page reports.
Sorry for troubling you with this.

The race reports are still there though.

>
>>
>> It seems that your patch doesn't fix the race from the report below, since pte
>> lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
>> is being executed. (Line numbers are from kernel with your patch applied.)
>
> I was not trying to "fix" that with my patch, because I couldn't find
> any problem with the way it reads vm_flags there; I can't even see any
> need for READ_ONCE or more barriers, we have sufficient locking already.
>
> Sure, try_to_unmap_one() may read vm_flags an instant before or after
> a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED;
> but the syscalls (or exit) then work their way up the address space to
> establish the final state, no problem.
>
> But I am glad you drew attention to the inadequacy of the
> down_read_trylock(mmap_sem) in try_to_unmap_one(), and since posting
> that patch (doing the mlock_vma_page under pt lock instead), I have
> identifed one case that it would fix - though it clearly wasn't
> involved in your stacktrace (it's a race with truncating COWed pages,
> but your trace was holepunching, which leaves the COWs alone).
>
> I'll go forward with that patch, but it rather falls into a series
> I was preparing, must finish up all their comments before posting.
>
> Hugh
>
>>
>> ===
>> ThreadSanitizer: data-race in munlock_vma_pages_range
>>
>> Write at 0xffff880282a93290 of size 8 by thread 2546 on CPU 2:
>>  [<ffffffff81211009>] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>>  [<     inline     >] munlock_vma_pages_all mm/internal.h:252
>>  [<ffffffff81215d03>] exit_mmap+0x163/0x190 mm/mmap.c:2824
>>  [<ffffffff81085635>] mmput+0x65/0x190 kernel/fork.c:708
>>  [<     inline     >] exit_mm kernel/exit.c:437
>>  [<ffffffff8108c2a7>] do_exit+0x457/0x1400 kernel/exit.c:733
>>  [<ffffffff8108ef3f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
>>  [<ffffffff810a03a5>] get_signal+0x375/0xa70 kernel/signal.c:2353
>>  [<ffffffff8100619c>] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
>>  [<ffffffff81006cbd>] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
>>  [<ffffffff81ea87a4>] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
>>
>> Previous read at 0xffff880282a93290 of size 8 by thread 2545 on CPU 1:
>>  [<ffffffff8121bc1a>] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208
>>  [<     inline     >] rmap_walk_file mm/rmap.c:1522
>>  [<ffffffff8121d1a7>] rmap_walk+0x147/0x450 mm/rmap.c:1541
>>  [<ffffffff8121d962>] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405
>>  [<ffffffff81210640>] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>>  [<ffffffff81210af6>] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>>  [<ffffffff81211330>] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>>  [<     inline     >] munlock_vma_pages_all mm/internal.h:252
>>  [<ffffffff81215d03>] exit_mmap+0x163/0x190 mm/mmap.c:2824
>>  [<ffffffff81085635>] mmput+0x65/0x190 kernel/fork.c:708
>>  [<     inline     >] exit_mm kernel/exit.c:437
>>  [<ffffffff8108c2a7>] do_exit+0x457/0x1400 kernel/exit.c:733
>>  [<ffffffff8108ef3f>] do_group_exit+0x7f/0x140 kernel/exit.c:874
>>  [<ffffffff810a03a5>] get_signal+0x375/0xa70 kernel/signal.c:2353
>>  [<ffffffff8100619c>] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
>>  [<ffffffff81006cbd>] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
>>  [<ffffffff81ea87a4>] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
>> ===
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2015-10-15 16:58 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAAeHK+z8o96YeRF-fQXmoApOKXa0b9pWsQHDeP=5GC_hMTuoDg@mail.gmail.com>
     [not found] ` <55EC9221.4040603@oracle.com>
2015-09-07 11:40   ` Multiple potential races on vma->vm_flags Kirill A. Shutemov
2015-09-09 15:27     ` Vlastimil Babka
2015-09-09 16:01       ` Kirill A. Shutemov
2015-09-10  0:58     ` Sasha Levin
2015-09-10  8:36       ` Kirill A. Shutemov
2015-09-10 13:27         ` Andrey Konovalov
2015-09-11 10:39           ` Kirill A. Shutemov
2015-09-11 15:29             ` Vlastimil Babka
2015-09-11 16:08               ` Vlastimil Babka
2015-09-12  1:27             ` Hugh Dickins
2015-09-14 10:16               ` Kirill A. Shutemov
2015-09-15 17:36               ` Sasha Levin
2015-09-15 19:01                 ` Kirill A. Shutemov
2015-09-22 16:47                   ` Andrey Konovalov
2015-09-22 18:54                     ` Hugh Dickins
2015-09-22 19:45                       ` Andrey Konovalov
2015-09-23  1:39                         ` Hugh Dickins
2015-09-23 11:46                           ` Kirill A. Shutemov
2015-09-23 22:58                             ` Davidlohr Bueso
2015-09-23 13:08                           ` Andrey Konovalov
2015-09-24  0:42                             ` Sasha Levin
2015-09-25 19:33                               ` Kirill A. Shutemov
2015-10-13 22:38                               ` Hugh Dickins
2015-10-13 22:33                             ` Hugh Dickins
2015-10-15 16:58                               ` Andrey Konovalov
2015-09-23 21:30                 ` Sasha Levin
2015-09-25 14:26                   ` Oleg Nesterov
2015-09-24 13:11                 ` Oleg Nesterov
2015-09-24 16:27                   ` Sasha Levin
2015-09-24 17:26                     ` Oleg Nesterov
2015-09-24 18:52                       ` Andrey Ryabinin
2015-09-24 19:01                         ` Sasha Levin
2015-09-25 12:41                         ` Oleg Nesterov
2015-09-23 15:34             ` Oleg Nesterov
2015-09-23 15:38               ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).