Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrea Arcangeli <aarcange@redhat.com>
To: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
Date: Fri, 8 Feb 2013 21:25:51 +0100	[thread overview]
Message-ID: <20130208202550.GB9817@redhat.com> (raw)
In-Reply-To: <1359962232-20811-4-git-send-email-walken@google.com>

Hi Michel,

On Sun, Feb 03, 2013 at 11:17:12PM -0800, Michel Lespinasse wrote:
> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
> at a time. When munlocking THP pages (or the huge zero page), this resulted
> in taking the mm->page_table_lock 512 times in a row.
> 
> We can do better by making use of the page_mask returned by follow_page_mask
> (for the huge zero page case), or the size of the page munlock_vma_page()
> operated on (for the true THP page case).
> 
> Note - I am sending this as RFC only for now as I can't currently put
> my finger on what if anything prevents split_huge_page() from operating
> concurrently on the same page as munlock_vma_page(), which would mess
> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
> point I missed here ?

I agree something looks fishy: nor mmap_sem for writing, nor the page
lock can stop split_huge_page_refcount.

Now the mlock side was intended to be safe because mlock_vma_page is
called within follow_page while holding the PT lock or the
page_table_lock (so split_huge_page_refcount will have to wait for it
to be released before it can run). See follow_trans_huge_pmd
assert_spin_locked and the pte_unmap_unlock after mlock_vma_page
returns.

Problem is, the lock side dependen on the TestSetPageMlocked below to
be always repeated on the head page (follow_trans_huge_pmd will always
pass the head page to mlock_vma_page).

void mlock_vma_page(struct page *page)
{
	BUG_ON(!PageLocked(page));

	if (!TestSetPageMlocked(page)) {

But what if the head page was split in between two different
follow_page calls? The second call wouldn't take the pmd_trans_huge
path anymore and the stats would be increased too much.

The problem on the munlock side is even more apparent as you pointed
out above but now I think the mlock side was problematic too.

The good thing is, your accelleration code for the mlock side should
have fixed the mlock race already: not ever risking to end up calling
mlock_vma_page twice on the head page is not an "accelleration" only,
it should also be a natural fix for the race.

To fix the munlock side, which is still present, I think one way would
be to do mlock and unlock within get_user_pages, so they run in the
same place protected by the PT lock or page_table_lock.

There are few things that stop split_huge_page_refcount:
page_table_lock, lru_lock, compound_lock, anon_vma lock. So if we keep
calling munlock_vma_page outside of get_user_pages (so outside of the
page_table_lock) the other way would be to use the compound_lock.

NOTE: this a purely aesthetical issue in /proc/meminfo, there's
nothing functional (at least in the kernel) connected to it, so no
panic :).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Andrea Arcangeli <aarcange@redhat.com>
To: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
Date: Fri, 8 Feb 2013 21:25:51 +0100	[thread overview]
Message-ID: <20130208202550.GB9817@redhat.com> (raw)
In-Reply-To: <1359962232-20811-4-git-send-email-walken@google.com>

Hi Michel,

On Sun, Feb 03, 2013 at 11:17:12PM -0800, Michel Lespinasse wrote:
> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
> at a time. When munlocking THP pages (or the huge zero page), this resulted
> in taking the mm->page_table_lock 512 times in a row.
> 
> We can do better by making use of the page_mask returned by follow_page_mask
> (for the huge zero page case), or the size of the page munlock_vma_page()
> operated on (for the true THP page case).
> 
> Note - I am sending this as RFC only for now as I can't currently put
> my finger on what if anything prevents split_huge_page() from operating
> concurrently on the same page as munlock_vma_page(), which would mess
> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
> point I missed here ?

I agree something looks fishy: nor mmap_sem for writing, nor the page
lock can stop split_huge_page_refcount.

Now the mlock side was intended to be safe because mlock_vma_page is
called within follow_page while holding the PT lock or the
page_table_lock (so split_huge_page_refcount will have to wait for it
to be released before it can run). See follow_trans_huge_pmd
assert_spin_locked and the pte_unmap_unlock after mlock_vma_page
returns.

Problem is, the lock side dependen on the TestSetPageMlocked below to
be always repeated on the head page (follow_trans_huge_pmd will always
pass the head page to mlock_vma_page).

void mlock_vma_page(struct page *page)
{
	BUG_ON(!PageLocked(page));

	if (!TestSetPageMlocked(page)) {

But what if the head page was split in between two different
follow_page calls? The second call wouldn't take the pmd_trans_huge
path anymore and the stats would be increased too much.

The problem on the munlock side is even more apparent as you pointed
out above but now I think the mlock side was problematic too.

The good thing is, your accelleration code for the mlock side should
have fixed the mlock race already: not ever risking to end up calling
mlock_vma_page twice on the head page is not an "accelleration" only,
it should also be a natural fix for the race.

To fix the munlock side, which is still present, I think one way would
be to do mlock and unlock within get_user_pages, so they run in the
same place protected by the PT lock or page_table_lock.

There are few things that stop split_huge_page_refcount:
page_table_lock, lru_lock, compound_lock, anon_vma lock. So if we keep
calling munlock_vma_page outside of get_user_pages (so outside of the
page_table_lock) the other way would be to use the compound_lock.

NOTE: this a purely aesthetical issue in /proc/meminfo, there's
nothing functional (at least in the kernel) connected to it, so no
panic :).

Thanks,
Andrea

next prev parent reply	other threads:[~2013-02-08 20:25 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-04  7:17 [PATCH v2 0/3] fixes for large mm_populate() and munlock() operations Michel Lespinasse
2013-02-04  7:17 ` Michel Lespinasse
2013-02-04  7:17 ` [PATCH v2 1/3] fix mm: use long type for page counts in mm_populate() and get_user_pages() Michel Lespinasse
2013-02-04  7:17   ` Michel Lespinasse
2013-02-04  7:17 ` [PATCH v2 2/3] mm: accelerate mm_populate() treatment of THP pages Michel Lespinasse
2013-02-04  7:17   ` Michel Lespinasse
2013-02-04  7:17 ` [PATCH v2 3/3] mm: accelerate munlock() " Michel Lespinasse
2013-02-04  7:17   ` Michel Lespinasse
2013-02-06 23:44   ` Sasha Levin
2013-02-06 23:44     ` Sasha Levin
2013-02-07  2:50     ` Li Zhong
2013-02-07  2:50       ` Li Zhong
2013-02-07  5:42       ` Sasha Levin
2013-02-07  5:42         ` Sasha Levin
2013-02-07 11:49     ` Hillf Danton
2013-02-07 11:49       ` Hillf Danton
2013-02-08 20:25   ` Andrea Arcangeli [this message]
2013-02-08 20:25     ` Andrea Arcangeli
2013-02-08 23:17     ` Michel Lespinasse
2013-02-08 23:17       ` Michel Lespinasse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130208202550.GB9817@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=riel@redhat.com \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.