From: Andrea Arcangeli <aarcange@redhat.com>
To: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: linux-mm@kvack.org, Marcelo Tosatti <mtosatti@redhat.com>,
Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
Izik Eidus <ieidus@redhat.com>,
Hugh Dickins <hugh.dickins@tiscali.co.uk>,
Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>,
Mel Gorman <mel@csn.ul.ie>, Andi Kleen <andi@firstfloor.org>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Ingo Molnar <mingo@elte.hu>, Mike Travis <travis@sgi.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Christoph Lameter <cl@linux-foundation.org>,
Chris Wright <chrisw@sous-sol.org>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 03 of 30] alter compound get_page/put_page
Date: Sat, 23 Jan 2010 18:39:58 +0100 [thread overview]
Message-ID: <20100123173958.GA6494@random.random> (raw)
In-Reply-To: <1264095346.32717.34452.camel@nimitz>
On Thu, Jan 21, 2010 at 09:35:46AM -0800, Dave Hansen wrote:
> Christoph kinda has a point here. The gup code is going to be a pretty
> hot path for some people, and this does add a bunch of atomics that some
> people will have no need for.
>
> It's also a decent place to put a helper function anyway.
>
> void pin_huge_page_tail(struct page *page)
> {
> /*
> * This ensures that a __split_huge_page_refcount()
> * running underneath us cannot
> */
> VM_BUG_ON(atomic_read(&page->_count) < 0);
> atomic_inc(&page->_count);
> }
>
> It'll keep us from putting the same comment in too many arches, I guess
We can replace the compound_lock with a branch, by setting a
PG_trans_huge on all compound pages allocated by huge_memory.c, that
would only benefit gup on hugetlbfs (and it'll add the cost of one
branch to gup on transparent hugepages, that's why I didn't do
that). But I can add it. Note the compound_lock is granular on a
cacheline already hot and exclusive read-write on the l1 cache, not
like the mmap_sem (that gup_fast avoids), but surely an atomic op is
more costly than just a branch...
> > static inline void get_page(struct page *page)
> > {
> > - page = compound_head(page);
> > - VM_BUG_ON(atomic_read(&page->_count) == 0);
> > + VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
>
> Hmm.
This means, if the page is not a tail page, count must be >= 1 (,
which is more strict and more correct than the already existing check
== 0 that should really be <= 0). If a page is a tail page, then the
bugcheck is only for < 0, because tail pages are only pinned by gup
and if there is no gup going on, there is no pin either on tail pages.
>
> if
>
> > atomic_inc(&page->_count);
> > + if (unlikely(PageTail(page))) {
> > + VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> > + atomic_inc(&page->first_page->_count);
> > + /* __split_huge_page_refcount can't run under get_page */
> > + VM_BUG_ON(!PageTail(page));
> > + }
> > }
>
> Are you hoping to catch a race in progress with the second VM_BUG_ON()
> here? Maybe the comment should say, "detect race with
> __split_huge_page_refcount".
Exactly. I think the current comment was explicit enough. But frankly
this is pure paranoid and I'm thinking that gcc can eliminate the
bugcheck entirely because atomic_inc doesn't clobber "memory" so I'll
remove the bugcheck instead, but leaving the current comment.
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -322,10 +322,13 @@ static inline void get_page(struct page
VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
atomic_inc(&page->_count);
if (unlikely(PageTail(page))) {
+ /*
+ * This is safe only because
+ * __split_huge_page_refcount can't run under
+ * get_page().
+ */
VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
atomic_inc(&page->first_page->_count);
- /* __split_huge_page_refcount can't run under get_page */
- VM_BUG_ON(!PageTail(page));
}
}
> > static inline struct page *virt_to_head_page(const void *x)
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -409,7 +409,8 @@ static inline void __ClearPageTail(struc
> > 1 << PG_private | 1 << PG_private_2 | \
> > 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \
> > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
> > - 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
> > + 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
> > + 1 << PG_compound_lock)
>
> Nit: should probably go in the last patch.
Why? If you apply this single patch we already want to immediately
detect if somebody is running compund_lock but forgetting to
compound_unlock before freeing the page. Just like with PG_lock. There
may be other nits on how I tried to splited the original monolith
without having to rewrite lots of intermediate code, but this looks
ok or at least I don't get why to move it elsewhere ;).
> That looks functional to me, although the code is pretty darn dense. :)
> But, I'm not sure there's a better way to do it.
I'm not sure either.
If you or Christoph or anybody else asks me to add a PG_trans_huge set
by huge_memory.c immediately after allocating the hugepage, and to
make the above put_page/get_page tail pinning and compound_lock
entirely conditional to PG_trans_huge being set I'll do it
immediately. As said it will replace around 2 atomic ops on each
gup/put_page run on a tail page allocated in hugetlbfs (not through
the transparent hugepage framework) with a branch, so it will
practically eliminate the overhead caused to O_DIRECT over
hugetlbfs. I'm not doing it unless explicitly asked because:
1) it will make code even a little more dense
2) it will microslowdown transparent hugepage gup (which means
O_DIRECT over transparent hugepage and the kvm minor fault will have
to pay one more branch than necessary)
It might be a worthwhile tradeoff but I'm not big believer in
hugetlbfs optimization (unless they're entirely self contained) so
that's why I'm not inclined to do it unless explicitly asked. I think
we should rather think on how to speedup gup on transparent hugepage,
and secondly we should add transparent hugepage support starting with
tmpfs probably.
As you guessed, I also couldn't think of a more efficient way than to
use this compound_lock on tail pages to allow the proper atomic adjust
of the tail page refcounts in __split_huge_page_refcount.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-01-23 17:40 UTC|newest]
Thread overview: 79+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-01-21 6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 01 of 30] define MADV_HUGEPAGE Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 02 of 30] compound_lock Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 03 of 30] alter compound get_page/put_page Andrea Arcangeli
2010-01-21 17:35 ` Dave Hansen
2010-01-23 17:39 ` Andrea Arcangeli [this message]
2010-01-21 6:20 ` [PATCH 04 of 30] clear compound mapping Andrea Arcangeli
2010-01-21 17:43 ` Dave Hansen
2010-01-23 17:55 ` Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 05 of 30] add native_set_pmd_at Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 06 of 30] add pmd paravirt ops Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 07 of 30] no paravirt version of pmd ops Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 08 of 30] export maybe_mkwrite Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 09 of 30] comment reminder in destroy_compound_page Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 10 of 30] config_transparent_hugepage Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 11 of 30] add pmd mangling functions to x86 Andrea Arcangeli
2010-01-21 17:47 ` Dave Hansen
2010-01-21 19:14 ` Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 12 of 30] add pmd mangling generic functions Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 13 of 30] special pmd_trans_* functions Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 14 of 30] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 15 of 30] pte alloc trans splitting Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 16 of 30] add pmd mmu_notifier helpers Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 17 of 30] clear page compound Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 18 of 30] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 19 of 30] ensure mapcount is taken on head pages Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 20 of 30] split_huge_page_mm/vma Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 21 of 30] split_huge_page paging Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 22 of 30] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-01-21 20:40 ` Christoph Lameter
2010-01-21 23:01 ` Andrea Arcangeli
2010-01-21 23:17 ` Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 23 of 30] clear_copy_huge_page Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 24 of 30] kvm mmu transparent hugepage support Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 25 of 30] transparent hugepage core Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 26 of 30] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 27 of 30] memcg compound Andrea Arcangeli
2010-01-21 7:07 ` KAMEZAWA Hiroyuki
2010-01-21 15:44 ` Andrea Arcangeli
2010-01-21 23:55 ` KAMEZAWA Hiroyuki
2010-01-21 6:20 ` [PATCH 28 of 30] memcg huge memory Andrea Arcangeli
2010-01-21 7:16 ` KAMEZAWA Hiroyuki
2010-01-21 16:08 ` Andrea Arcangeli
2010-01-22 0:13 ` KAMEZAWA Hiroyuki
2010-01-27 11:27 ` Balbir Singh
2010-01-28 0:50 ` Daisuke Nishimura
2010-01-28 11:39 ` Andrea Arcangeli
2010-01-28 12:23 ` Balbir Singh
2010-01-28 12:36 ` Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 29 of 30] transparent hugepage vmstat Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 30 of 30] khugepaged Andrea Arcangeli
2010-01-22 14:46 ` [PATCH 00 of 30] Transparent Hugepage support #3 Christoph Lameter
2010-01-22 15:19 ` Andrea Arcangeli
2010-01-22 16:51 ` Christoph Lameter
2010-01-23 17:58 ` Andrea Arcangeli
2010-01-25 21:50 ` Christoph Lameter
2010-01-25 22:46 ` Andrea Arcangeli
2010-01-26 15:47 ` Christoph Lameter
2010-01-26 16:11 ` Andrea Arcangeli
2010-01-26 16:30 ` Christoph Lameter
2010-01-26 16:45 ` Andrea Arcangeli
2010-01-26 18:23 ` Christoph Lameter
2010-01-26 17:09 ` Avi Kivity
2010-01-26 0:52 ` Rik van Riel
2010-01-26 6:53 ` Gleb Natapov
2010-01-26 12:35 ` Andrea Arcangeli
2010-01-26 15:55 ` Christoph Lameter
2010-01-26 16:19 ` Andrea Arcangeli
2010-01-26 15:54 ` Christoph Lameter
2010-01-26 16:16 ` Andrea Arcangeli
2010-01-26 16:24 ` Andi Kleen
2010-01-26 16:37 ` Christoph Lameter
2010-01-26 16:42 ` Mel Gorman
2010-01-26 16:52 ` Andrea Arcangeli
2010-01-26 17:26 ` Mel Gorman
2010-01-26 19:46 ` Andrea Arcangeli
2010-01-26 23:07 ` Rik van Riel
2010-01-27 18:33 ` Christoph Lameter
2010-01-26 11:24 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100123173958.GA6494@random.random \
--to=aarcange@redhat.com \
--cc=agl@us.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=avi@redhat.com \
--cc=benh@kernel.crashing.org \
--cc=chrisw@sous-sol.org \
--cc=cl@linux-foundation.org \
--cc=dave@linux.vnet.ibm.com \
--cc=hugh.dickins@tiscali.co.uk \
--cc=ieidus@redhat.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=mingo@elte.hu \
--cc=mtosatti@redhat.com \
--cc=npiggin@suse.de \
--cc=riel@redhat.com \
--cc=travis@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.