From: Andrea Arcangeli <aarcange@redhat.com>
To: Christoph Lameter <cl@linux-foundation.org>
Cc: linux-mm@kvack.org, Marcelo Tosatti <mtosatti@redhat.com>,
Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
Izik Eidus <ieidus@redhat.com>,
Hugh Dickins <hugh.dickins@tiscali.co.uk>,
Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>,
Mel Gorman <mel@csn.ul.ie>, Dave Hansen <dave@linux.vnet.ibm.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Ingo Molnar <mingo@elte.hu>, Mike Travis <travis@sgi.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Chris Wright <chrisw@sous-sol.org>,
Andrew Morton <akpm@linux-foundation.org>,
bpicco@redhat.com,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Balbir Singh <balbir@linux.vnet.ibm.com>,
Arnd Bergmann <arnd@arndb.de>
Subject: Re: [PATCH 32 of 32] khugepaged
Date: Mon, 1 Feb 2010 23:56:24 +0100 [thread overview]
Message-ID: <20100201225624.GB4135@random.random> (raw)
In-Reply-To: <alpine.DEB.2.00.1002011551560.2384@router.home>
On Mon, Feb 01, 2010 at 04:18:19PM -0600, Christoph Lameter wrote:
> On Sun, 31 Jan 2010, Andrea Arcangeli wrote:
>
> > +static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > + unsigned long address,
> > + pte_t *pte)
> > +{
> > + struct page *page;
> > + pte_t *_pte;
> > + int referenced = 0, isolated = 0;
> > + for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
> > + _pte++, address += PAGE_SIZE) {
> > + pte_t pteval = *_pte;
> > + if (!pte_present(pteval) || !pte_write(pteval)) {
> > + release_pte_pages(pte, _pte);
> > + goto out;
> > + }
> > + /* If there is no mapped pte young don't collapse the page */
> > + if (pte_young(pteval))
> > + referenced = 1;
> > + page = vm_normal_page(vma, address, pteval);
> > + if (unlikely(!page)) {
> > + release_pte_pages(pte, _pte);
> > + goto out;
> > + }
> > + VM_BUG_ON(PageCompound(page));
> > + BUG_ON(!PageAnon(page));
> > + VM_BUG_ON(!PageSwapBacked(page));
> > +
> > + /* cannot use mapcount: can't collapse if there's a gup pin */
> > + if (page_count(page) != 1) {
> > + release_pte_pages(pte, _pte);
> > + goto out;
> > + }
>
> Ok you hold mmap_sem locked the pte_lock and we know that there is/was
> only one refcount. However, this does not mean that nothing can reach the
> page. The hardware tlb lookup / tlb walker can still reach the page and
> potentially modify bits. follow_page() can still reach the page. LRU is
> also possible?
Check this:
spin_lock(&mm->page_table_lock); /* probably unnecessary */
/* after this gup_fast can't run anymore */
_pmd = pmdp_clear_flush_notify(vma, address, pmd);
spin_unlock(&mm->page_table_lock);
This already stopped gup_fast and tlb/mmu before the above code could
run. Otherwise the page_count check is meaningless if gup_fast can
run.
> > + /*
> > + * We can do it before isolate_lru_page because the
> > + * page can't be freed from under us. NOTE: PG_lock
> > + * seems entirely unnecessary but in doubt this is
> > + * safer. If proven unnecessary it can be removed.
> > + */
> > + if (!trylock_page(page)) {
> > + release_pte_pages(pte, _pte);
> > + goto out;
> > + }
>
> Who could be locking the page if the only ref is the page table and we
> hold the pte lock?
split_huge_page through rmap. Maybe the comment should be removed as
it's misleading and probably not accurate. I'll remove the "if proven
unnecessary" because comment is misleading.
> Is this for the case that someone does a follow_page() and lock?
follow_page can't run, it's only to serialize against split_huge_page
running through rmap code, and the VM takes the lock on the page
before it can split_huge_page.
> The page migration code puts migration ptes in there to stop access to the
> page.
No need of that when pmdp_clear_flush_notify is enough and it has to
invoke mmu notifier to be safe so sptes are dropped too.
> > + /*
> > + * Isolate the page to avoid collapsing an hugepage
> > + * currently in use by the VM.
> > + */
> > + if (isolate_lru_page(page)) {
> > + unlock_page(page);
> > + release_pte_pages(pte, _pte);
> > + goto out;
> > + }
>
> The page can also be reached via the LRU until here? Reclaim code?
Yes. If it's already taken by it, we just abort and try later.
> Note the requirement in the comments for isolate_lru_page() for an
> elevated refcount.
The refcount is elevated by the fact the pte mapping can't go away
from under us (we only invalidated the pmd the pte still there).
> Seems to rely on the teardown of a pte before release of a page in
> reclaim.
Not sure I get this, we simply take the lock on the page to serialize
against split_huge_page, if we arrived first we try to isolate the
page by relaying on the refcount of the page_mapcount == 1, and if we
again arrived first again we collapsed the pages.
> Would it not be better and safer to use and extend the existing
> page migration code for this purpose?
>
> You are performing almost the same activities.
>
> 1. Isolation
>
> 2. Copying
>
> 3. Reestablish reachability.
KSM also works exactly the same as khugepaged and migration but we
solved it without migration pte and apparently nobody wants to deal
with that special migration pte logic. So before worrying about
khugepaged out of the tree, you should actively go fix ksm that works
exactly the same and it's in mainline. Until you don't fix ksm I think
I should be allowed to keep khugepaged simple and lightweight without
being forced to migration pte.
Thanks for the review!
Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-02-01 22:57 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-01-31 20:27 [PATCH 00 of 32] Transparent Hugepage support #9 Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 01 of 32] define MADV_HUGEPAGE Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 02 of 32] compound_lock Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 03 of 32] alter compound get_page/put_page Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 04 of 32] update futex compound knowledge Andrea Arcangeli
2010-02-16 11:33 ` Peter Zijlstra
2010-03-01 17:58 ` Andrea Arcangeli
2010-03-01 18:07 ` Peter Zijlstra
2010-03-01 18:23 ` Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 05 of 32] fix bad_page to show the real reason the page is bad Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 06 of 32] clear compound mapping Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 07 of 32] add native_set_pmd_at Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 08 of 32] add pmd paravirt ops Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 09 of 32] no paravirt version of pmd ops Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 10 of 32] export maybe_mkwrite Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 11 of 32] comment reminder in destroy_compound_page Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 12 of 32] config_transparent_hugepage Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 13 of 32] special pmd_trans_* functions Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 14 of 32] add pmd mangling generic functions Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 15 of 32] add pmd mangling functions to x86 Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 16 of 32] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 17 of 32] pte alloc trans splitting Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 18 of 32] add pmd mmu_notifier helpers Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 19 of 32] clear page compound Andrea Arcangeli
2010-02-01 21:37 ` Christoph Lameter
2010-01-31 20:27 ` [PATCH 20 of 32] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 21 of 32] split_huge_page_mm/vma Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 22 of 32] split_huge_page paging Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 23 of 32] clear_copy_huge_page Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 24 of 32] kvm mmu transparent hugepage support Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 25 of 32] transparent hugepage core Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 26 of 32] verify pmd_trans_huge isn't leaking Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 27 of 32] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 28 of 32] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-02-01 21:46 ` Christoph Lameter
2010-02-03 15:49 ` Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 29 of 32] memcg compound Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 30 of 32] memcg huge memory Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 31 of 32] transparent hugepage vmstat Andrea Arcangeli
2010-01-31 20:27 ` [PATCH 32 of 32] khugepaged Andrea Arcangeli
2010-02-01 17:03 ` Rik van Riel
2010-02-02 13:56 ` Andrea Arcangeli
2010-02-01 22:18 ` Christoph Lameter
2010-02-01 22:56 ` Andrea Arcangeli [this message]
2010-02-02 19:52 ` Christoph Lameter
2010-02-02 20:24 ` Andrea Arcangeli
2010-02-03 16:13 ` Christoph Lameter
2010-02-03 16:30 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100201225624.GB4135@random.random \
--to=aarcange@redhat.com \
--cc=agl@us.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=avi@redhat.com \
--cc=balbir@linux.vnet.ibm.com \
--cc=benh@kernel.crashing.org \
--cc=bpicco@redhat.com \
--cc=chrisw@sous-sol.org \
--cc=cl@linux-foundation.org \
--cc=dave@linux.vnet.ibm.com \
--cc=hugh.dickins@tiscali.co.uk \
--cc=ieidus@redhat.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=mingo@elte.hu \
--cc=mtosatti@redhat.com \
--cc=npiggin@suse.de \
--cc=riel@redhat.com \
--cc=travis@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).