Re: Multiple potential races on vma->vm_flags

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Hugh Dickins <hughd@google.com>, Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrey Konovalov <andreyknvl@google.com>,
	Sasha Levin <sasha.levin@oracle.com>,
	Oleg Nesterov <oleg@redhat.com>, Rik van Riel <riel@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dmitry Vyukov <dvyukov@google.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Vlastimil Babka <vbabka@suse.cz>
Subject: Re: Multiple potential races on vma->vm_flags
Date: Wed, 23 Sep 2015 14:46:37 +0300	[thread overview]
Message-ID: <20150923114636.GB25020@node.dhcp.inet.fi> (raw)
In-Reply-To: <alpine.LSU.2.11.1509221831570.19790@eggly.anvils>

On Tue, Sep 22, 2015 at 06:39:52PM -0700, Hugh Dickins wrote:
> On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> > On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins <hughd@google.com> wrote:
> > > On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> > >> If anybody comes up with a patch to fix the original issue I easily
> > >> can test it, since I'm hitting "BUG: Bad page state" in a second when
> > >> fuzzing with KTSAN and Trinity.
> > >
> > > This "BUG: Bad page state" sounds more serious, but I cannot track down
> > > your report of it: please repost - thanks - though on seeing it, I may
> > > well end up with no ideas.
> > 
> > The report is below.
> 
> Thanks.
> 
> > 
> > I get it after a few seconds of running Trinity on a kernel with KTSAN
> > and targeting mlock, munlock and madvise syscalls.
> > Sasha also observed a very similar crash a while ago
> > (https://lkml.org/lkml/2014/11/6/1055).
> > I didn't manage to reproduce this in a kernel build without KTSAN though.
> > The idea was that data races KTSAN reports might be the explanation of
> > these crashes.
> > 
> > BUG: Bad page state in process trinity-c15  pfn:281999
> > page:ffffea000a066640 count:0 mapcount:0 mapping:          (null) index:0xd
> > flags: 0x20000000028000c(referenced|uptodate|swapbacked|mlocked)
> > page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> > bad because of flags:
> > flags: 0x200000(mlocked)
> > Modules linked in:
> > CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >  ffffffff821c3b70 0000000000000000 0000000100004741 ffff8800b857f948
> >  ffffffff81e9926c 0000000000000003 ffffea000a066640 ffff8800b857f978
> >  ffffffff811ce045 ffffffff821c3b70 ffffea000a066640 0000000000000001
> > Call Trace:
> >  [<     inline     >] __dump_stack lib/dump_stack.c:15
> >  [<ffffffff81e9926c>] dump_stack+0x63/0x81 lib/dump_stack.c:50
> >  [<ffffffff811ce045>] bad_page+0x115/0x1a0 mm/page_alloc.c:409
> >  [<     inline     >] free_pages_check mm/page_alloc.c:731
> >  [<ffffffff811cf3b8>] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922
> >  [<ffffffff811d2911>] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908
> >  [<ffffffff811d2bcf>] free_hot_cold_page_list+0x5f/0x100
> > mm/page_alloc.c:1956 (discriminator 3)
> >  [<ffffffff811dd9c1>] release_pages+0x151/0x300 mm/swap.c:967
> >  [<ffffffff811de723>] __pagevec_release+0x43/0x60 mm/swap.c:984
> >  [<     inline     >] pagevec_release include/linux/pagevec.h:69
> >  [<ffffffff811ef36a>] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446
> >  [<ffffffff811ef86f>] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540
> >  [<ffffffff811f15d5>] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086
> >  [<ffffffff812568d0>] vfs_fallocate+0x1e0/0x310 fs/open.c:303
> >  [<     inline     >] madvise_remove mm/madvise.c:326
> >  [<     inline     >] madvise_vma mm/madvise.c:378
> >  [<     inline     >] SYSC_madvise mm/madvise.c:528
> >  [<ffffffff81225548>] SyS_madvise+0x378/0x760 mm/madvise.c:459
> >  [<ffffffff8124ef36>] ? kt_atomic64_store+0x76/0x130 mm/ktsan/sync_atomic.c:161
> >  [<ffffffff81ea8691>] entry_SYSCALL_64_fastpath+0x31/0x95
> > arch/x86/entry/entry_64.S:188
> > Disabling lock debugging due to kernel taint
> 
> This is totally untested, and one of you may quickly prove me wrong;
> but I went in to fix your "Bad page state (mlocked)" by holding pte
> lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> then couldn't see why it would need mmap_sem at all, given how mlock
> and munlock first assert intention by setting or clearing VM_LOCKED
> in vm_flags, then work their way up the vma, taking pte locks.
> 
> Calling mlock_vma_page() under pte lock may look suspicious
> at first: but what it does is similar to clear_page_mlock(),
> which we regularly call under pte lock from page_remove_rmap().

Indeed. Looks fishy. But probably will work.

That was not obvious for me what makes clearing VM_LOCKED visible in
try_to_unmap_one() without mmap_sem.

After looking some more it seems we would rely on page_check_address()
in try_to_unmap_one() having acquire semantics and follow_page_mask() in
munlock_vma_pages_range() having release semantics.

But I would prefer to have something more explicit.

That's mess.

> 
> I'd rather wait to hear whether this appears to work in practice,
> and whether you agree that it should work in theory, before writing
> the proper description.  I'd love to lose that down_read_trylock.
> 
> You mention how Sasha hit the "Bad page state (mlocked)" back in
> November: that was one of the reasons we reverted Davidlohr's
> i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(),
> without understanding why it was needed.  Yes, it would lock out
> a concurrent try_to_unmap(), whose setting of PageMlocked was not
> sufficiently serialized by the down_read_trylock of mmap_sem.
> 
> But I don't remember the other reasons for that revert (and
> haven't looked very hard as yet): anyone else remember?

I hoped Davidlohr will come back with something after the revert, but it
never happend. I think the reverted patch was responsible for most of
scalability boost from rwsem for i_mmap_lock...

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Hugh Dickins <hughd@google.com>, Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrey Konovalov <andreyknvl@google.com>,
	Sasha Levin <sasha.levin@oracle.com>,
	Oleg Nesterov <oleg@redhat.com>, Rik van Riel <riel@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dmitry Vyukov <dvyukov@google.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Vlastimil Babka <vbabka@suse.cz>
Subject: Re: Multiple potential races on vma->vm_flags
Date: Wed, 23 Sep 2015 14:46:37 +0300	[thread overview]
Message-ID: <20150923114636.GB25020@node.dhcp.inet.fi> (raw)
In-Reply-To: <alpine.LSU.2.11.1509221831570.19790@eggly.anvils>

On Tue, Sep 22, 2015 at 06:39:52PM -0700, Hugh Dickins wrote:
> On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> > On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins <hughd@google.com> wrote:
> > > On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> > >> If anybody comes up with a patch to fix the original issue I easily
> > >> can test it, since I'm hitting "BUG: Bad page state" in a second when
> > >> fuzzing with KTSAN and Trinity.
> > >
> > > This "BUG: Bad page state" sounds more serious, but I cannot track down
> > > your report of it: please repost - thanks - though on seeing it, I may
> > > well end up with no ideas.
> > 
> > The report is below.
> 
> Thanks.
> 
> > 
> > I get it after a few seconds of running Trinity on a kernel with KTSAN
> > and targeting mlock, munlock and madvise syscalls.
> > Sasha also observed a very similar crash a while ago
> > (https://lkml.org/lkml/2014/11/6/1055).
> > I didn't manage to reproduce this in a kernel build without KTSAN though.
> > The idea was that data races KTSAN reports might be the explanation of
> > these crashes.
> > 
> > BUG: Bad page state in process trinity-c15  pfn:281999
> > page:ffffea000a066640 count:0 mapcount:0 mapping:          (null) index:0xd
> > flags: 0x20000000028000c(referenced|uptodate|swapbacked|mlocked)
> > page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> > bad because of flags:
> > flags: 0x200000(mlocked)
> > Modules linked in:
> > CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >  ffffffff821c3b70 0000000000000000 0000000100004741 ffff8800b857f948
> >  ffffffff81e9926c 0000000000000003 ffffea000a066640 ffff8800b857f978
> >  ffffffff811ce045 ffffffff821c3b70 ffffea000a066640 0000000000000001
> > Call Trace:
> >  [<     inline     >] __dump_stack lib/dump_stack.c:15
> >  [<ffffffff81e9926c>] dump_stack+0x63/0x81 lib/dump_stack.c:50
> >  [<ffffffff811ce045>] bad_page+0x115/0x1a0 mm/page_alloc.c:409
> >  [<     inline     >] free_pages_check mm/page_alloc.c:731
> >  [<ffffffff811cf3b8>] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922
> >  [<ffffffff811d2911>] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908
> >  [<ffffffff811d2bcf>] free_hot_cold_page_list+0x5f/0x100
> > mm/page_alloc.c:1956 (discriminator 3)
> >  [<ffffffff811dd9c1>] release_pages+0x151/0x300 mm/swap.c:967
> >  [<ffffffff811de723>] __pagevec_release+0x43/0x60 mm/swap.c:984
> >  [<     inline     >] pagevec_release include/linux/pagevec.h:69
> >  [<ffffffff811ef36a>] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446
> >  [<ffffffff811ef86f>] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540
> >  [<ffffffff811f15d5>] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086
> >  [<ffffffff812568d0>] vfs_fallocate+0x1e0/0x310 fs/open.c:303
> >  [<     inline     >] madvise_remove mm/madvise.c:326
> >  [<     inline     >] madvise_vma mm/madvise.c:378
> >  [<     inline     >] SYSC_madvise mm/madvise.c:528
> >  [<ffffffff81225548>] SyS_madvise+0x378/0x760 mm/madvise.c:459
> >  [<ffffffff8124ef36>] ? kt_atomic64_store+0x76/0x130 mm/ktsan/sync_atomic.c:161
> >  [<ffffffff81ea8691>] entry_SYSCALL_64_fastpath+0x31/0x95
> > arch/x86/entry/entry_64.S:188
> > Disabling lock debugging due to kernel taint
> 
> This is totally untested, and one of you may quickly prove me wrong;
> but I went in to fix your "Bad page state (mlocked)" by holding pte
> lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> then couldn't see why it would need mmap_sem at all, given how mlock
> and munlock first assert intention by setting or clearing VM_LOCKED
> in vm_flags, then work their way up the vma, taking pte locks.
> 
> Calling mlock_vma_page() under pte lock may look suspicious
> at first: but what it does is similar to clear_page_mlock(),
> which we regularly call under pte lock from page_remove_rmap().

Indeed. Looks fishy. But probably will work.

That was not obvious for me what makes clearing VM_LOCKED visible in
try_to_unmap_one() without mmap_sem.

After looking some more it seems we would rely on page_check_address()
in try_to_unmap_one() having acquire semantics and follow_page_mask() in
munlock_vma_pages_range() having release semantics.

But I would prefer to have something more explicit.

That's mess.

> 
> I'd rather wait to hear whether this appears to work in practice,
> and whether you agree that it should work in theory, before writing
> the proper description.  I'd love to lose that down_read_trylock.
> 
> You mention how Sasha hit the "Bad page state (mlocked)" back in
> November: that was one of the reasons we reverted Davidlohr's
> i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(),
> without understanding why it was needed.  Yes, it would lock out
> a concurrent try_to_unmap(), whose setting of PageMlocked was not
> sufficiently serialized by the down_read_trylock of mmap_sem.
> 
> But I don't remember the other reasons for that revert (and
> haven't looked very hard as yet): anyone else remember?

I hoped Davidlohr will come back with something after the revert, but it
never happend. I think the reverted patch was responsible for most of
scalability boost from rwsem for i_mmap_lock...

-- 
 Kirill A. Shutemov

next prev parent reply	other threads:[~2015-09-23 11:46 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAAeHK+z8o96YeRF-fQXmoApOKXa0b9pWsQHDeP=5GC_hMTuoDg@mail.gmail.com>
     [not found] ` <55EC9221.4040603@oracle.com>
2015-09-07 11:40   ` Multiple potential races on vma->vm_flags Kirill A. Shutemov
2015-09-07 11:40     ` Kirill A. Shutemov
2015-09-09 15:27     ` Vlastimil Babka
2015-09-09 15:27       ` Vlastimil Babka
2015-09-09 16:01       ` Kirill A. Shutemov
2015-09-09 16:01         ` Kirill A. Shutemov
2015-09-10  0:58     ` Sasha Levin
2015-09-10  0:58       ` Sasha Levin
2015-09-10  8:36       ` Kirill A. Shutemov
2015-09-10  8:36         ` Kirill A. Shutemov
2015-09-10 13:27         ` Andrey Konovalov
2015-09-10 13:27           ` Andrey Konovalov
2015-09-11 10:39           ` Kirill A. Shutemov
2015-09-11 10:39             ` Kirill A. Shutemov
2015-09-11 15:29             ` Vlastimil Babka
2015-09-11 15:29               ` Vlastimil Babka
2015-09-11 16:08               ` Vlastimil Babka
2015-09-11 16:08                 ` Vlastimil Babka
2015-09-12  1:27             ` Hugh Dickins
2015-09-12  1:27               ` Hugh Dickins
2015-09-14 10:16               ` Kirill A. Shutemov
2015-09-14 10:16                 ` Kirill A. Shutemov
2015-09-15 17:36               ` Sasha Levin
2015-09-15 17:36                 ` Sasha Levin
2015-09-15 19:01                 ` Kirill A. Shutemov
2015-09-15 19:01                   ` Kirill A. Shutemov
2015-09-22 16:47                   ` Andrey Konovalov
2015-09-22 16:47                     ` Andrey Konovalov
2015-09-22 18:54                     ` Hugh Dickins
2015-09-22 18:54                       ` Hugh Dickins
2015-09-22 19:45                       ` Andrey Konovalov
2015-09-22 19:45                         ` Andrey Konovalov
2015-09-23  1:39                         ` Hugh Dickins
2015-09-23  1:39                           ` Hugh Dickins
2015-09-23 11:46                           ` Kirill A. Shutemov [this message]
2015-09-23 11:46                             ` Kirill A. Shutemov
2015-09-23 22:58                             ` Davidlohr Bueso
2015-09-23 22:58                               ` Davidlohr Bueso
2015-09-23 13:08                           ` Andrey Konovalov
2015-09-23 13:08                             ` Andrey Konovalov
2015-09-24  0:42                             ` Sasha Levin
2015-09-24  0:42                               ` Sasha Levin
2015-09-25 19:33                               ` Kirill A. Shutemov
2015-09-25 19:33                                 ` Kirill A. Shutemov
2015-10-13 22:38                               ` Hugh Dickins
2015-10-13 22:38                                 ` Hugh Dickins
2015-10-13 22:33                             ` Hugh Dickins
2015-10-13 22:33                               ` Hugh Dickins
2015-10-15 16:58                               ` Andrey Konovalov
2015-10-15 16:58                                 ` Andrey Konovalov
2015-09-23 21:30                 ` Sasha Levin
2015-09-23 21:30                   ` Sasha Levin
2015-09-25 14:26                   ` Oleg Nesterov
2015-09-25 14:26                     ` Oleg Nesterov
2015-09-24 13:11                 ` Oleg Nesterov
2015-09-24 13:11                   ` Oleg Nesterov
2015-09-24 16:27                   ` Sasha Levin
2015-09-24 16:27                     ` Sasha Levin
2015-09-24 17:26                     ` Oleg Nesterov
2015-09-24 17:26                       ` Oleg Nesterov
2015-09-24 18:52                       ` Andrey Ryabinin
2015-09-24 18:52                         ` Andrey Ryabinin
2015-09-24 19:01                         ` Sasha Levin
2015-09-24 19:01                           ` Sasha Levin
2015-09-25 12:41                         ` Oleg Nesterov
2015-09-25 12:41                           ` Oleg Nesterov
2015-09-23 15:34             ` Oleg Nesterov
2015-09-23 15:34               ` Oleg Nesterov
2015-09-23 15:38               ` Oleg Nesterov
2015-09-23 15:38                 ` Oleg Nesterov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150923114636.GB25020@node.dhcp.inet.fi \
    --to=kirill@shutemov.name \
    --cc=akpm@linux-foundation.org \
    --cc=andreyknvl@google.com \
    --cc=dave@stgolabs.net \
    --cc=dvyukov@google.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=oleg@redhat.com \
    --cc=riel@redhat.com \
    --cc=sasha.levin@oracle.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.