Re: [PATCH 2/12] mm: rmap use pte lock not mmap_sem to set PageMlocked

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Vlastimil Babka <vbabka@suse.cz>
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Rik van Riel <riel@redhat.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Oleg Nesterov <oleg@redhat.com>,
	Sasha Levin <sasha.levin@oracle.com>,
	Andrey Konovalov <andreyknvl@google.com>,
	Dmitry Vyukov <dvyukov@google.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH 2/12] mm: rmap use pte lock not mmap_sem to set PageMlocked
Date: Mon, 19 Oct 2015 14:33:30 +0200	[thread overview]
Message-ID: <5624E31A.9010202@suse.cz> (raw)
In-Reply-To: <alpine.LSU.2.11.1510190341490.3809@eggly.anvils>

On 10/19/2015 01:20 PM, Hugh Dickins wrote:
> On Mon, 19 Oct 2015, Vlastimil Babka wrote:
>> On 10/19/2015 06:50 AM, Hugh Dickins wrote:
>>> KernelThreadSanitizer (ktsan) has shown that the down_read_trylock()
>>> of mmap_sem in try_to_unmap_one() (when going to set PageMlocked on
>>> a page found mapped in a VM_LOCKED vma) is ineffective against races
>>> with exit_mmap()'s munlock_vma_pages_all(), because mmap_sem is not
>>> held when tearing down an mm.
>>>
>>> But that's okay, those races are benign; and although we've believed
>>
>> But didn't Kirill show that it's not so benign, and can leak memory?
>> - http://marc.info/?l=linux-mm&m=144196800325498&w=2
> 
> Kirill's race was this:
> 
> 		CPU0				CPU1
> exit_mmap()
>    // mmap_sem is *not* taken
>    munlock_vma_pages_all()
>      munlock_vma_pages_range()
>      					try_to_unmap_one()
> 					  down_read_trylock(&vma->vm_mm->mmap_sem))
> 					  !!(vma->vm_flags & VM_LOCKED) == true
>        vma->vm_flags &= ~VM_LOCKED;
>        <munlock the page>
>        					  mlock_vma_page(page);
> 					  // mlocked pages is leaked.
> 
> Hmm, I pulled that in to say that it looked benign to me, that he was
> missing all the subsequent "<munlock the page>" which would correct the
> situation.  But now I look at it again, I agree with you both: lacking
> any relevant locking on CPU1 at that point (it has already given up the
> pte lock there), the whole of "<munlock the page>" could take place on
> CPU0, before CPU1 reaches its mlock_vma_page(page), yes.
> 
> Oh, hold on, no: doesn't page lock prevent that one?  CPU1 has the page
> lock throughout, so CPU0's <munlock the page> cannot complete before
> CPU1's mlock_vma_page(page).  So now I disagree with you again!


I think the page lock doesn't help with munlock_vma_pages_range(). If I
expand the race above:

	CPU0				CPU1
					
exit_mmap()
  // mmap_sem is *not* taken
  munlock_vma_pages_all()
    munlock_vma_pages_range()		
					lock_page()
					...
    					try_to_unmap_one()
					  down_read_trylock(&vma->vm_mm->mmap_sem))
					  !!(vma->vm_flags & VM_LOCKED) == true
      vma->vm_flags &= ~VM_LOCKED;
      __munlock_pagevec_fill
        // this briefly takes pte lock
      __munlock_pagevec()
	// Phase 1
	TestClearPageMlocked(page)

      					  mlock_vma_page(page);
					    TestSetPageMlocked(page)
					    // page is still mlocked
					...
					unlock_page()
        // Phase 2
        lock_page()
        if (!__putback_lru_fast_prepare())
          // true, because page_evictable(page) is false due to PageMlocked
          __munlock_isolated_page
          if (page_mapcount(page) > 1)
             try_to_munlock(page);
             // this will not help AFAICS

Now if CPU0 is the last mapper, it will unmap the page anyway
further in exit_mmap(). If not, it stays mlocked.

The key problem is that page lock doesn't cover the TestClearPageMlocked(page)
part on CPU0.
Your patch should help AFAICS. If CPU1 does the mlock under pte lock, the
TestClear... on CPU0 can happen only after that.
If CPU0 takes pte lock first, then CPU1 must see the VM_LOCKED flag cleared,
right?

>> Although as I noted, it probably doesn't leak completely. But a page will
>> remain unevictable, until its last user unmaps it, which is again not
>> completely benign?
>> - http://marc.info/?l=linux-mm&m=144198536831589&w=2
> 
> I agree that we'd be wrong to leave a page on the unevictable lru
> indefinitely once it's actually evictable.  But I think my change is
> only making the above case easier to think about: trylock on mmap_sem
> is a confusing distraction from where the proper locking is done,
> whether it be page lock or pte lock.
> 
>>
>>
>>> for years in that ugly down_read_trylock(), it's unsuitable for the job,
>>> and frustrates the good intention of setting PageMlocked when it fails.
>>>
>>> It just doesn't matter if here we read vm_flags an instant before or
>>> after a racing mlock() or munlock() or exit_mmap() sets or clears
>>> VM_LOCKED: the syscalls (or exit) work their way up the address space
>>> (taking pt locks after updating vm_flags) to establish the final state.
>>>
>>> We do still need to be careful never to mark a page Mlocked (hence
>>> unevictable) by any race that will not be corrected shortly after.
>>
>> And waiting for the last user to unmap the page is not necessarily shortly
>> after :)
>>
>> Anyway pte lock looks like it could work, but I'll need to think about it
>> some more, because...
>>
>>> The page lock protects from many of the races, but not all (a page
>>> is not necessarily locked when it's unmapped).  But the pte lock we
>>> just dropped is good to cover the rest (and serializes even with
>>> munlock_vma_pages_all(),
>>
>> Note how munlock_vma_pages_range() via __munlock_pagevec() does
>> TestClearPageMlocked() without (or "between") pte or page lock. But the pte
>> lock is being taken after clearing VM_LOCKED, so perhaps it's safe against
>> try_to_unmap_one...
> 
> A mind-trick I found helpful for understanding the barriers here, is
> to imagine that the munlocker repeats its "vma->vm_flags &= ~VM_LOCKED"
> every time it takes the pte lock: it does not actually do that, it
> doesn't need to of course; but that does help show that ~VM_LOCKED
> must be visible to anyone getting that pte lock afterwards.
> 
> Hugh
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2015-10-19 12:33 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-19  4:44 [PATCH 0/12] mm: page migration cleanups, and a little mlock Hugh Dickins
2015-10-19  4:45 ` [PATCH 1/12] mm Documentation: undoc non-linear vmas Hugh Dickins
2015-10-19  9:16   ` Kirill A. Shutemov
2015-11-05 17:29   ` Vlastimil Babka
2015-10-19  4:50 ` [PATCH 2/12] mm: rmap use pte lock not mmap_sem to set PageMlocked Hugh Dickins
2015-10-19  6:23   ` Vlastimil Babka
2015-10-19 11:20     ` Hugh Dickins
2015-10-19 12:33       ` Vlastimil Babka [this message]
2015-10-19 19:17         ` Hugh Dickins
2015-10-19 20:52           ` Vlastimil Babka
2015-10-19 13:13       ` Kirill A. Shutemov
2015-10-19 19:53         ` Hugh Dickins
2015-10-19 20:10           ` Kirill A. Shutemov
2015-10-19 21:25             ` Vlastimil Babka
2015-10-19 21:53               ` Kirill A. Shutemov
2015-10-21 23:26               ` Hugh Dickins
2015-10-29 18:49                 ` [PATCH v2 " Hugh Dickins
2015-11-05 17:50                   ` Vlastimil Babka
2015-10-19 23:30         ` [PATCH " Davidlohr Bueso
2015-10-19  4:52 ` [PATCH 3/12] mm: page migration fix PageMlocked on migrated pages Hugh Dickins
2015-11-05 18:18   ` Vlastimil Babka
2015-10-19  4:54 ` [PATCH 4/12] mm: rename mem_cgroup_migrate to mem_cgroup_replace_page Hugh Dickins
2015-10-19 12:35   ` Johannes Weiner
2015-12-02  9:33   ` [PATCH] mm: fix kerneldoc on mem_cgroup_replace_page Hugh Dickins
2015-12-02 10:17     ` Michal Hocko
2015-12-02 16:57     ` Johannes Weiner
2015-10-19  4:55 ` [PATCH 5/12] mm: correct a couple of page migration comments Hugh Dickins
2015-10-21 17:53   ` Rafael Aquini
2015-10-19  4:57 ` [PATCH 6/12] mm: page migration use the put_new_page whenever necessary Hugh Dickins
2015-11-05 18:31   ` Vlastimil Babka
2015-11-08 21:17     ` Hugh Dickins
2015-10-19  4:59 ` [PATCH 7/12] mm: page migration trylock newpage at same level as oldpage Hugh Dickins
2015-10-21 17:54   ` Rafael Aquini
2015-10-19  5:01 ` [PATCH 8/12] mm: page migration remove_migration_ptes at lock+unlock level Hugh Dickins
2015-10-19  5:03 ` [PATCH 9/12] mm: simplify page migration's anon_vma comment and flow Hugh Dickins
2015-10-19  5:05 ` [PATCH 10/12] mm: page migration use migration entry for swapcache too Hugh Dickins
2015-10-22 22:35   ` Cyrill Gorcunov
2015-10-19  5:07 ` [PATCH 11/12] mm: page migration avoid touching newpage until no going back Hugh Dickins
2015-10-19  5:11 ` [PATCH 12/12] mm: migrate dirty page without clear_page_dirty_for_io etc Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5624E31A.9010202@suse.cz \
    --to=vbabka@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=andreyknvl@google.com \
    --cc=cl@linux.com \
    --cc=dave@stgolabs.net \
    --cc=dvyukov@google.com \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=oleg@redhat.com \
    --cc=riel@redhat.com \
    --cc=sasha.levin@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).