From: Andrea Arcangeli <aarcange@redhat.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Yong Sun <yosun@suse.com>,
linux390@de.ibm.com, linux-s390@vger.kernel.org,
Zhang Yi <wetpzy@gmail.com>, Mel Gorman <mgorman@suse.de>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Hugh Dickins <hughd@google.com>,
Dominik Dingel <dingel@linux.vnet.ibm.com>,
Martin Schwidefsky <schwidefsky@de.ibm.com>,
Heiko Carstens <heiko.carstens@de.ibm.com>,
Christian Borntraeger <borntraeger@de.ibm.com>
Subject: Re: [RFC] futex: prevent endless loop on s390x with emulated hugepages
Date: Thu, 24 Sep 2015 18:51:22 +0200 [thread overview]
Message-ID: <20150924165122.GU25412@redhat.com> (raw)
In-Reply-To: <1443107148-28625-1-git-send-email-vbabka@suse.cz>
On Thu, Sep 24, 2015 at 05:05:48PM +0200, Vlastimil Babka wrote:
> The problem is an endless loop in get_futex_key() when
> CONFIG_TRANSPARENT_HUGEPAGE is enabled and the s390x machine has emulated
> hugepages. The code tries to serialize against __split_huge_page_splitting(),
> but __get_user_pages_fast() fails on the hugetlbfs tail page. This happens
> because pmd_large() is false for emulated hugepages, so the code will proceed
> into gup_pte_range() and fail page_cache_get_speculative() through failing
> get_page_unless_zero() as the tail page count is zero. Failing __gup_fast is
> supposed to be temporary due to a race, so get_futex_key() will try again
> endlessly.
>
> This attempt for a fix is a bandaid solution and probably incomplete.
> Hopefully something better will emerge from the discussion. Fully fixing
> emulated hugepages just for stable backports is unlikely due to them being
> removed. Also THP refcounting redesign should soon remove the trickery from
> get_futex_key().
THP refcounting redesign will simplify things a lot here because the
head page cannot be freed from under us if we hold a reference on the
tail.
With the current split_huge_page that cannot fail, it should be
possible to stop using __get_user_pages_fast to reach the head page
and pin it before it can be freed from under us by using the
compound_lock_irqsave too.
The old code could have done get_page on a already freed head page (if
the THP was splitted after compound_head returned) and this is why it
needed adjustement. Here we just need to safely get a refcount on the
head page.
If we do get_page_unless_zero() on the head page returned by
compound_head, take compound_lock_irqsave and check if the tail page
is still a tail (which means split_huge_page hasn't run yet and it
cannot run anymore by holding the compound_lock), then we can take a
reference on the head page. After we take a reference on the head we
just put_page the tail page and we continue using the page_head.
It should be the very same logic of __get_page_tail, except we don't
want the refcount taken on the tail too (i.e. we must not increase the
mapcount and we should skip the get_huge_page_tail or the head will be
freed again if split_huge_page runs as result of MADV_DONTNEED and it
literally frees the head). We want only one more recount on the head
because the code then only works with page_head and we don't care
about the tail anymore. A new function get_head_page() may work for
that and avoid the pagetable walking.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <aarcange@redhat.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Yong Sun <yosun@suse.com>,
linux390@de.ibm.com, linux-s390@vger.kernel.org,
Zhang Yi <wetpzy@gmail.com>, Mel Gorman <mgorman@suse.de>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Hugh Dickins <hughd@google.com>,
Dominik Dingel <dingel@linux.vnet.ibm.com>,
Martin Schwidefsky <schwidefsky@de.ibm.com>,
Heiko Carstens <heiko.carstens@de.ibm.com>,
Christian Borntraeger <borntraeger@de.ibm.com>
Subject: Re: [RFC] futex: prevent endless loop on s390x with emulated hugepages
Date: Thu, 24 Sep 2015 18:51:22 +0200 [thread overview]
Message-ID: <20150924165122.GU25412@redhat.com> (raw)
In-Reply-To: <1443107148-28625-1-git-send-email-vbabka@suse.cz>
On Thu, Sep 24, 2015 at 05:05:48PM +0200, Vlastimil Babka wrote:
> The problem is an endless loop in get_futex_key() when
> CONFIG_TRANSPARENT_HUGEPAGE is enabled and the s390x machine has emulated
> hugepages. The code tries to serialize against __split_huge_page_splitting(),
> but __get_user_pages_fast() fails on the hugetlbfs tail page. This happens
> because pmd_large() is false for emulated hugepages, so the code will proceed
> into gup_pte_range() and fail page_cache_get_speculative() through failing
> get_page_unless_zero() as the tail page count is zero. Failing __gup_fast is
> supposed to be temporary due to a race, so get_futex_key() will try again
> endlessly.
>
> This attempt for a fix is a bandaid solution and probably incomplete.
> Hopefully something better will emerge from the discussion. Fully fixing
> emulated hugepages just for stable backports is unlikely due to them being
> removed. Also THP refcounting redesign should soon remove the trickery from
> get_futex_key().
THP refcounting redesign will simplify things a lot here because the
head page cannot be freed from under us if we hold a reference on the
tail.
With the current split_huge_page that cannot fail, it should be
possible to stop using __get_user_pages_fast to reach the head page
and pin it before it can be freed from under us by using the
compound_lock_irqsave too.
The old code could have done get_page on a already freed head page (if
the THP was splitted after compound_head returned) and this is why it
needed adjustement. Here we just need to safely get a refcount on the
head page.
If we do get_page_unless_zero() on the head page returned by
compound_head, take compound_lock_irqsave and check if the tail page
is still a tail (which means split_huge_page hasn't run yet and it
cannot run anymore by holding the compound_lock), then we can take a
reference on the head page. After we take a reference on the head we
just put_page the tail page and we continue using the page_head.
It should be the very same logic of __get_page_tail, except we don't
want the refcount taken on the tail too (i.e. we must not increase the
mapcount and we should skip the get_huge_page_tail or the head will be
freed again if split_huge_page runs as result of MADV_DONTNEED and it
literally frees the head). We want only one more recount on the head
because the code then only works with page_head and we don't care
about the tail anymore. A new function get_head_page() may work for
that and avoid the pagetable walking.
next prev parent reply other threads:[~2015-09-24 16:51 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-24 15:05 [RFC] futex: prevent endless loop on s390x with emulated hugepages Vlastimil Babka
2015-09-24 15:05 ` Vlastimil Babka
2015-09-24 16:51 ` Andrea Arcangeli [this message]
2015-09-24 16:51 ` Andrea Arcangeli
2015-09-28 11:49 ` Martin Schwidefsky
2015-09-28 11:49 ` Martin Schwidefsky
2015-10-13 11:48 ` Vlastimil Babka
2015-10-13 11:48 ` Vlastimil Babka
2015-10-13 13:51 ` Vlastimil Babka
2015-10-13 13:51 ` Vlastimil Babka
2015-10-13 15:12 ` Martin Schwidefsky
2015-10-13 15:12 ` Martin Schwidefsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150924165122.GU25412@redhat.com \
--to=aarcange@redhat.com \
--cc=borntraeger@de.ibm.com \
--cc=dingel@linux.vnet.ibm.com \
--cc=heiko.carstens@de.ibm.com \
--cc=hughd@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
--cc=linux390@de.ibm.com \
--cc=mgorman@suse.de \
--cc=schwidefsky@de.ibm.com \
--cc=vbabka@suse.cz \
--cc=wetpzy@gmail.com \
--cc=yosun@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.