linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Minchan Kim <minchan.kim@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickins <hughd@google.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Shaohua Li <shaohua.li@intel.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: Re: [PATCH] thp: tail page refcounting fix #6
Date: Fri, 30 Sep 2011 15:58:58 +0200	[thread overview]
Message-ID: <20110930135858.GN7768@redhat.com> (raw)
In-Reply-To: <1316793432.9084.47.camel@twins>

Hi everyone,

On Fri, Sep 23, 2011 at 05:57:12PM +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-08 at 18:51 +0200, Andrea Arcangeli wrote:
> 
> > +++ b/arch/powerpc/mm/gup.c
> > +++ b/arch/x86/mm/gup.c
> 
> lacking a diffstat a quick look seems to suggest you missed a few:
> 
> $ ls arch/*/mm/gup.c
> arch/powerpc/mm/gup.c  
> arch/s390/mm/gup.c  
> arch/sh/mm/gup.c  
> arch/sparc/mm/gup.c  
> arch/x86/mm/gup.c

sh should be already ok, sparc too as they don't seem to support
hugetlbfs or THP. s390 32bit probably too. sh actually has some
HUGETLB optional thing but I don't see it handling it in gup...

But I think I missed s39064bit! But after a closer look surprisingly
they weren't ok before this change too! Even ppc.

In fact they will work better with this change applied because
succeeding a page_cache_get_speculative on a TailPage (like it could
have happened before) would have resulted in a VM_BUG_ON before
returning from page_cache_get_speculative. I guess not many are
testing O_DIRECT on hugetlbfs on ppc64,s39064bit. The PageTail check
in powerpc/mm/gup.c suddenly looks like a noop. Weird.

Also while doing this I found longstanding bugs in powerpc. nr
includes also pages found before calling gup_hugepte. So if that race
triggers (mremap,munmap changing the huge pagetable under gup_fast,
which can definitely happen) it'd lead to put_page being called too
many times on the hugetlbfs page.

Also in the below code probably they intended to use "head", not
"page".... "page" is out of bounds... so the rollback in the race case
was going to free a random page even in older kernels.

       if (unlikely(pte_val(pte) != pte_val(*ptep))) {
               /* Could be optimized better */
               while (*nr) {
                       put_page(page);

But I'm not sure why ppc always checks if the pte changed and tries to
rollback. There's no need of that, if gup_fast could run it means
we're not within a mmu_notifier_invalidate_range_start/end critical
section, and mmu_notifier_invalidate_page won't run until after the
tlb flush returned (so after gup_fast returned on the other cpu), and
so the pages are guaranteed to always have page_count >= 0
(munmap/mremap will stop and wait in
mmu_notifier_invalidate_range_start or in mmu_notifier_invalidate_page
where the IPI delivery will wait before releasing the page count, to
flush any secondary mapping before the primary mapping is allowed to
free the page).

In short there are two fishy things in ppc code (in addition to the
race-rollback corrupting memory which is just a minor implementation
bug trivial to correct, besides after THP we've to put_page the right
subpage not just the head so it must be refactored anyway):

1) page_cache_get/add_speculative is not needed anywhere in gup_fast
path of ppc as long as irqs are disabled (it simulates the cpu doing a tlb
miss). Special care has to be taken so you read the ptep to a local
variable (stored in cpu register or kernel stack), and then you
evaluate the local pte_t (stopping reading from the pointer). Then you
do pte_page on the _local_ pte_t variable, and you know you can
get_page without doing any get_page_unless_zero (plus there's a
VM_BUG_ON in get_page to verify we're not running get_page on a page
with a zero count).

Hmmm explanation of point 1) above self-remind that maybe even x86
should always use ACCESS_ONCE, even for the pmds (it already does for
the pte, either that or it has a smp_rmb() after finish reading it),
(probably not required for upper layers as they can't be tear down
until the IPI runs), now there's no way gcc is stupid enough to
re-read from pointer but in theory it could without barrier(). This
isn't just for THP but for hugetlbfs too, it's only theoretical though.

2) the above pte_val(local_pte_t) != pte(*ptep) checks in theory as
useless for ppc, as long as you verify the pte is ok (so you arrived
reading before mremap/munmap changed the pte), the page can't go away
under you because the tlb flush will wait.

I didn't start yet checking s390 in detail but it looks close to ppc
code. Let's sort out ppc first.

Now I'd like to know if the soft tlb miss handler of powerpc changes
something and really requires the code commented point 1) and 2) (even
if that code and race checks are not required on x86 for the reason I
just mentioned...).

I can make a patch to try to keep these page_cache_get/add_speculative
and the pte_val(local_pte_t) != pte(*ptep) checks intact (they're
superfluous but they can't hurt obviously). So I could make a patch
that works with current code, but if I could safely delete those I'd
prefer as they're quite confusing on why they're needed. But before I
do that I need ack from ppc people to be sure it's ok.. I don't know
the assembly of the tlb miss hashtable handler to be sure... I don't
exclude ppc has different ipi/irq semantics for the gup_fast case and
really requires those checks.

CC'ed Benjamin :)

Comments welcome!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-09-30 13:59 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-19  7:48 [PATCH 0/9] Use RCU to stabilize page counts Michel Lespinasse
2011-08-19  7:48 ` [PATCH 1/9] mm: rcu read lock for getting reference on pages in migration_entry_wait() Michel Lespinasse
2011-08-19  7:48 ` [PATCH 2/9] mm: avoid calling get_page_unless_zero() when charging cgroups Michel Lespinasse
2011-08-19  7:48 ` [PATCH 3/9] mm: rcu read lock when getting from tail to head page Michel Lespinasse
2011-08-19  7:48 ` [PATCH 4/9] mm: use get_page in deactivate_page() Michel Lespinasse
2011-08-19  7:48 ` [PATCH 5/9] kvm: use get_page instead of get_page_unless_zero Michel Lespinasse
2011-08-19  7:48 ` [PATCH 6/9] mm: assert that get_page_unless_zero() callers hold the rcu lock Michel Lespinasse
2011-08-19 23:28   ` Andi Kleen
2011-08-19  7:48 ` [PATCH 7/9] rcu: rcu_get_gp_cookie() / rcu_gp_cookie_elapsed() stand-ins Michel Lespinasse
2011-08-19  7:48 ` [PATCH 8/9] mm: add API for setting a grace period cookie on compound pages Michel Lespinasse
2011-08-19  7:48 ` [PATCH 9/9] mm: make sure tail page counts are stable before splitting THP pages Michel Lespinasse
2011-08-19  7:53 ` [PATCH 0/9] Use RCU to stabilize page counts Michel Lespinasse
2011-08-22 21:33 ` [PATCH] thp: tail page refcounting fix Andrea Arcangeli
2011-08-23 14:55   ` Andrea Arcangeli
2011-08-23 16:45   ` Minchan Kim
2011-08-23 16:54     ` Andrea Arcangeli
2011-08-23 19:52   ` Michel Lespinasse
2011-08-24  0:09     ` Andrea Arcangeli
2011-08-24  0:27       ` Andrea Arcangeli
2011-08-24 13:34         ` [PATCH] thp: tail page refcounting fix #2 Andrea Arcangeli
2011-08-26  6:24           ` Michel Lespinasse
2011-08-26 16:10             ` Andrea Arcangeli
2011-08-26 18:54               ` [PATCH] thp: tail page refcounting fix #3 Andrea Arcangeli
2011-08-27  9:41                 ` Michel Lespinasse
2011-08-27 17:34                   ` [PATCH] thp: tail page refcounting fix #4 Andrea Arcangeli
2011-08-29  4:20                     ` Minchan Kim
2011-09-01 15:24                       ` [PATCH] thp: tail page refcounting fix #5 Andrea Arcangeli
2011-09-01 22:27                         ` Michel Lespinasse
2011-09-01 23:28                         ` Andrew Morton
2011-09-01 23:45                           ` Andi Kleen
2011-09-02  0:20                             ` Andrea Arcangeli
2011-09-02  1:17                               ` Andi Kleen
2011-09-02  0:03                         ` Andrew Morton
2011-09-08 16:51                           ` [PATCH] thp: tail page refcounting fix #6 Andrea Arcangeli
2011-09-23 15:57                             ` Peter Zijlstra
2011-09-30 13:58                               ` Andrea Arcangeli [this message]
2011-10-16 20:37                               ` thp: gup_fast ppc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6] Andrea Arcangeli
2011-10-16 20:37                               ` [PATCH 1/4] powerpc: remove superfluous PageTail checks on the pte gup_fast Andrea Arcangeli
2011-10-16 20:37                               ` [PATCH 2/4] powerpc: get_hugepte() don't put_page() the wrong page Andrea Arcangeli
2011-10-16 20:37                               ` [PATCH 3/4] powerpc: gup_hugepte() avoid to free the head page too many times Andrea Arcangeli
2011-10-16 20:37                               ` [PATCH 4/4] powerpc: gup_hugepte() support THP based tail recounting Andrea Arcangeli
2011-10-16 20:40                               ` thp: gup_fast ppc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6] Andrea Arcangeli
2011-10-16 20:40                               ` [PATCH 1/4] powerpc: remove superfluous PageTail checks on the pte gup_fast Andrea Arcangeli
2011-10-16 20:40                               ` [PATCH 2/4] powerpc: get_hugepte() don't put_page() the wrong page Andrea Arcangeli
2011-10-16 20:40                               ` [PATCH 3/4] powerpc: gup_hugepte() avoid to free the head page too many times Andrea Arcangeli
2011-10-16 20:40                               ` [PATCH 4/4] powerpc: gup_hugepte() support THP based tail recounting Andrea Arcangeli
2011-10-17 14:41                               ` thp: gup_fast s390/sparc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6] Andrea Arcangeli
2011-10-17 14:41                                 ` [PATCH 1/3] s390: gup_huge_pmd() support THP tail recounting Andrea Arcangeli
2011-10-17 14:41                                 ` [PATCH 2/3] sparc: gup_pte_range() support THP based " Andrea Arcangeli
2011-10-17 22:44                                   ` David Miller
2011-10-17 14:41                                 ` [PATCH 3/3] thp: share get_huge_page_tail() Andrea Arcangeli
2011-10-17 21:32                               ` fix two more s390/sparc gup_fast bugs Andrea Arcangeli
2011-10-17 21:32                                 ` [PATCH 1/2] s390: gup_huge_pmd() return 0 if pte changes Andrea Arcangeli
2011-10-17 21:32                                 ` [PATCH 2/2] powerpc: " Andrea Arcangeli
2011-08-29 22:40                     ` [PATCH] thp: tail page refcounting fix #4 Michel Lespinasse
2011-08-29 23:30                       ` Andrea Arcangeli
2011-08-26 19:28             ` [PATCH] thp: tail page refcounting fix #2 Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110930135858.GN7768@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=benh@kernel.crashing.org \
    --cc=hughd@google.com \
    --cc=jweiner@redhat.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=shaohua.li@intel.com \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).