linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Bob Liu <bob.liu@oracle.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Motohiro Kosaki <Motohiro.Kosaki@us.fujitsu.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Sasha Levin <sasha.levin@oracle.com>,
	Wanpeng Li <liwanp@linux.vnet.ibm.com>,
	Michel Lespinasse <walken@google.com>,
	Nick Piggin <npiggin@suse.de>,
	Motohiro Kosaki JP <kosaki.motohiro@jp.fujitsu.com>,
	Rik van Riel <riel@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Mel Gorman <mgorman@suse.de>, Minchan Kim <minchan@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-mm <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
Date: Wed, 08 Jan 2014 09:06:46 +0800	[thread overview]
Message-ID: <52CCA4A6.7030106@oracle.com> (raw)
In-Reply-To: <52CC16DC.9070308@suse.cz>


On 01/07/2014 11:01 PM, Vlastimil Babka wrote:
> On 01/06/2014 05:47 PM, Motohiro Kosaki wrote:
>>
>>
>>> -----Original Message-----
>>> From: linus971@gmail.com [mailto:linus971@gmail.com] On Behalf Of Linus
>>> Torvalds
>>> Sent: Friday, January 03, 2014 7:18 PM
>>> To: Vlastimil Babka
>>> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu;
>>> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman;
>>> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing
>>> List
>>> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear
>>> VMAs
>>>
>>> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>>
>>>> I'm for going with the removal of BUG_ON. The TestSetPageMlocked
>>>> should provide enough race protection.
>>>
>>> Maybe. But dammit, that's subtle, and I don't think you're even right.
>>>
>>> It basically depends on mlock_vma_page() and munlock_vma_page() being
>>> able to run CONCURRENTLY on the same page. In particular, you could have a
>>> mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
>>> immediately clearing it on another, and then the rest of those functions
>>> could run with a totally arbitrary interleaving when working with the exact
>>> same page.
>>>
>>> They both do basically
>>>
>>>      if (!isolate_lru_page(page))
>>>          putback_lru_page(page);
>>>
>>> but one or the other would randomly win the race (it's internally protected
>>> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do
>>>
>>>      try_to_munlock(page);
>>>
>>> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely
>>> broken - you end up with the PageMlocked bit clear, but
>>> try_to_munlock() was never called on that page, because
>>> mlock_vma_page() got to the page isolation before the "subsequent"
>>> munlock_vma_page().
>>>
>>> And this is very much what the page lock serialization would prevent.
>>> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op,
>>> yes, but that only "serializes" in one direction, not when you can have a mix
>>> of bit setting and clearing.
>>>
>>> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least
>>> enforces some kind of ordering. And try_to_unmap_cluster() is just broken
>>> in calling that without the page being locked. That's my opinion. There may
>>> be some *other* reason why it all happens to work, but no,
>>> "TestSetPageMlocked should provide enough race protection" is simply not
>>> true, and even if it were, it's way too subtle and odd to be a good rule.
>>>
>>> So I really object to just removing the BUG_ON(). Not with a *lot* more
>>> explanation as to why these kinds of issues wouldn't matter.
>>
>> I don't have a perfect answer. But I can explain a bit history. Let's me try.
>>
>> First off, 5 years ago, Lee's original putback_lru_page() implementation required
>> page-lock, but I removed the restriction months later. That's why we can see
>> strange BUG_ON here.
>>
>> 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected  by
>> mmap_sem (write mdoe). Then, mlock and munlock had no race.
>> Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to
>> protect against munlock.
>>
>> Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under
>> mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1),
>> reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect
>> VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem.
>>
>> By the way, page isolation is still necessary because we need to protect another page modification like page migration.
> 
> I guess you meant page locking, not (lru) isolation. Indeed, the documentation
> at Documentation/vm/unevictable-lru.txt also discusses races with page migration.
> 
> I've checked and it seems really the case that mlock_migrate_page()
> could race with mlock_vma_page() so that PG_mlocked is set on the
> old page again after deciding that the new page will be without the flag.
> (or after the flag was transferred to the new page)
> That would not be fatal, but distort accounting anyway.
> 
> So here's a patch that if accepted should replace the removal of BUG_ON patch in
> -mm tree: http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch
> 

Make sense to me.

> The idea is that try_to_unmap_cluster() will try locking the page
> for mlock, and just leave it alone if lock cannot be obtained. Again
> that's not fatal, as eventually something will encounter and mlock the page.
> 
> -----8<-----
> 
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Tue, 7 Jan 2014 14:59:58 +0100
> Subject: [PATCH] mm: try_to_unmap_cluster() should lock_page() before mlocking
> 
> A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin fuzzing
> with trinity. The call site try_to_unmap_cluster() does not lock the pages
> other than its check_page parameter (which is already locked).
> 
> The BUG_ON in mlock_vma_page() is not documented and its purpose is somewhat
> unclear, but apparently it serializes against page migration, which could
> otherwise fail to transfer the PG_mlocked flag. This would not be fatal, as the
> page would be eventually encountered again, but NR_MLOCK accounting would
> become distorted nevertheless. This patch adds a comment to the BUG_ON in
> mlock_vma_page() and munlock_vma_page() to that effect.
> 
> The call site try_to_unmap_cluster() is fixed so that for page != check_page,
> trylock_page() is attempted (to avoid possible deadlocks as we already have
> check_page locked) and mlock_vma_page() is performed only upon success. If the
> page lock cannot be obtained, the page is left without PG_mlocked, which is
> again not a problem in the whole unevictable memory design.
> 
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Bob Liu <bob.liu@oracle.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Bob Liu <bob.liu@oracle.com>

> ---
>  mm/mlock.c |  2 ++
>  mm/rmap.c  | 14 ++++++++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 192e6ee..1b12dfa 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -79,6 +79,7 @@ void clear_page_mlock(struct page *page)
>   */
>  void mlock_vma_page(struct page *page)
>  {
> +	/* Serialize with page migration */
>  	BUG_ON(!PageLocked(page));
>  
>  	if (!TestSetPageMlocked(page)) {
> @@ -153,6 +154,7 @@ unsigned int munlock_vma_page(struct page *page)
>  {
>  	unsigned int nr_pages;
>  
> +	/* For try_to_munlock() and to serialize with page migration */
>  	BUG_ON(!PageLocked(page));
>  
>  	if (TestClearPageMlocked(page)) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 068522d..b99c742 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  		BUG_ON(!page || PageAnon(page));
>  
>  		if (locked_vma) {
> -			mlock_vma_page(page);   /* no-op if already mlocked */
> -			if (page == check_page)
> +			if (page == check_page) {
> +				/* we know we have check_page locked */
> +				mlock_vma_page(page);
>  				ret = SWAP_MLOCK;
> +			} else if (trylock_page(page)) {
> +				/*
> +				 * If we can lock the page, perform mlock.
> +				 * Otherwise leave the page alone, it will be
> +				 * eventually encountered again later.
> +				 */
> +				mlock_vma_page(page);
> +				unlock_page(page);
> +			}
>  			continue;	/* don't unmap */
>  		}
>  
> 

-- 
Regards,
-Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-01-08  1:07 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-17  8:05 [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs Wanpeng Li
2013-12-18  3:14 ` Wanpeng Li
2013-12-18  3:16 ` Wanpeng Li
2013-12-18  3:23   ` Wanpeng Li
     [not found]   ` <20131218032329.GA6044@hacker.(null)>
2013-12-18  3:32     ` Sasha Levin
2013-12-18  4:12       ` Wanpeng Li
2013-12-18  9:11         ` Vlastimil Babka
2013-12-18  9:23           ` Wanpeng Li
     [not found]           ` <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com>
2013-12-18 21:43             ` Andrew Morton
2014-01-03 20:17               ` Sasha Levin
2014-01-03 20:52                 ` Linus Torvalds
2014-01-03 23:36                   ` Vlastimil Babka
2014-01-03 23:56                     ` Andrew Morton
2014-01-04  3:03                       ` Sasha Levin
2014-01-04  0:18                     ` Linus Torvalds
2014-01-04  8:09                       ` Vlastimil Babka
2014-01-05  0:27                       ` Wanpeng Li
2014-01-06 16:47                       ` Motohiro Kosaki
2014-01-06 22:01                         ` KOSAKI Motohiro
2014-01-07  5:27                         ` Wanpeng Li
2014-01-07 15:01                         ` Vlastimil Babka
2014-01-08  1:06                           ` Bob Liu [this message]
2014-01-08  2:44                           ` Linus Torvalds
2014-01-10 17:48                           ` Motohiro Kosaki
2014-01-13 14:03                             ` Vlastimil Babka
2014-01-14 11:05                               ` Vlastimil Babka
2014-01-04  3:31                   ` Bob Liu
2013-12-18  8:54 ` Vlastimil Babka
2013-12-18  9:01   ` Wanpeng Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52CCA4A6.7030106@oracle.com \
    --to=bob.liu@oracle.com \
    --cc=Motohiro.Kosaki@us.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liwanp@linux.vnet.ibm.com \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=npiggin@suse.de \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=sasha.levin@oracle.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).