From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Khalid Aziz <khalid.aziz@oracle.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Pravin Shelar <pshelar@nicira.com>,
Ben Hutchings <bhutchings@solarflare.com>,
Christoph Lameter <cl@linux.com>,
Johannes Weiner <jweiner@redhat.com>,
Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
Andi Kleen <andi@firstfloor.org>,
Minchan Kim <minchan@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Guillaume Morin <guillaume@morinfr.org>
Subject: [PATCH 3.4 02/37] mm: hugetlbfs: fix hugetlbfs optimization
Date: Tue, 4 Feb 2014 13:00:38 -0800 [thread overview]
Message-ID: <20140204210056.070697121@linuxfoundation.org> (raw)
In-Reply-To: <20140204210055.992134150@linuxfoundation.org>
3.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Andrea Arcangeli <aarcange@redhat.com>
commit 27c73ae759774e63313c1fbfeb17ba076cea64c5 upstream.
Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database
caused by THP") can cause dereference of a dangling pointer if
split_huge_page runs during PageHuge() if there are updates to the
tail_page->private field.
Also it is repeating compound_head twice for hugetlbfs and it is running
compound_head+compound_trans_head for THP when a single one is needed in
both cases.
The new code within the PageSlab() check doesn't need to verify that the
THP page size is never bigger than the smallest hugetlbfs page size, to
avoid memory corruption.
A longstanding theoretical race condition was found while fixing the
above (see the change right after the skip_unlock label, that is
relevant for the compound_lock path too).
By re-establishing the _mapcount tail refcounting for all compound
pages, this also fixes the below problem:
echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
BUG: Bad page state in process bash pfn:59a01
page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0
page flags: 0x1c00000000008000(tail)
Modules linked in:
CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x55/0x76
bad_page+0xd5/0x130
free_pages_prepare+0x213/0x280
__free_pages+0x36/0x80
update_and_free_page+0xc1/0xd0
free_pool_huge_page+0xc2/0xe0
set_max_huge_pages.part.58+0x14c/0x220
nr_hugepages_store_common.isra.60+0xd0/0xf0
nr_hugepages_store+0x13/0x20
kobj_attr_store+0xf/0x20
sysfs_write_file+0x189/0x1e0
vfs_write+0xc5/0x1f0
SyS_write+0x55/0xb0
system_call_fastpath+0x16/0x1b
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Guillaume Morin <guillaume@morinfr.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
include/linux/hugetlb.h | 6 ++
mm/hugetlb.c | 17 +++++++
mm/swap.c | 107 ++++++++++++++++++++++++++++++++++--------------
3 files changed, 101 insertions(+), 29 deletions(-)
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -24,6 +24,7 @@ struct hugepage_subpool *hugepage_new_su
void hugepage_put_subpool(struct hugepage_subpool *spool);
int PageHuge(struct page *page);
+int PageHeadHuge(struct page *page_head);
void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
@@ -84,6 +85,11 @@ static inline int PageHuge(struct page *
{
return 0;
}
+
+static inline int PageHeadHuge(struct page *page_head)
+{
+ return 0;
+}
static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
{
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -679,6 +679,23 @@ int PageHuge(struct page *page)
}
EXPORT_SYMBOL_GPL(PageHuge);
+/*
+ * PageHeadHuge() only returns true for hugetlbfs head page, but not for
+ * normal or transparent huge pages.
+ */
+int PageHeadHuge(struct page *page_head)
+{
+ compound_page_dtor *dtor;
+
+ if (!PageHead(page_head))
+ return 0;
+
+ dtor = get_compound_page_dtor(page_head);
+
+ return dtor == free_huge_page;
+}
+EXPORT_SYMBOL_GPL(PageHeadHuge);
+
pgoff_t __basepage_index(struct page *page)
{
struct page *page_head = compound_head(page);
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -77,18 +77,6 @@ static void __put_compound_page(struct p
static void put_compound_page(struct page *page)
{
- /*
- * hugetlbfs pages cannot be split from under us. If this is a
- * hugetlbfs page, check refcount on head page and release the page if
- * the refcount becomes zero.
- */
- if (PageHuge(page)) {
- page = compound_head(page);
- if (put_page_testzero(page))
- __put_compound_page(page);
- return;
- }
-
if (unlikely(PageTail(page))) {
/* __split_huge_page_refcount can run under us */
struct page *page_head = compound_trans_head(page);
@@ -96,6 +84,35 @@ static void put_compound_page(struct pag
if (likely(page != page_head &&
get_page_unless_zero(page_head))) {
unsigned long flags;
+
+ if (PageHeadHuge(page_head)) {
+ if (likely(PageTail(page))) {
+ /*
+ * __split_huge_page_refcount
+ * cannot race here.
+ */
+ VM_BUG_ON(!PageHead(page_head));
+ atomic_dec(&page->_mapcount);
+ if (put_page_testzero(page_head))
+ VM_BUG_ON(1);
+ if (put_page_testzero(page_head))
+ __put_compound_page(page_head);
+ return;
+ } else {
+ /*
+ * __split_huge_page_refcount
+ * run before us, "page" was a
+ * THP tail. The split
+ * page_head has been freed
+ * and reallocated as slab or
+ * hugetlbfs page of smaller
+ * order (only possible if
+ * reallocated as slab on
+ * x86).
+ */
+ goto skip_lock;
+ }
+ }
/*
* page_head wasn't a dangling pointer but it
* may not be a head page anymore by the time
@@ -107,9 +124,29 @@ static void put_compound_page(struct pag
/* __split_huge_page_refcount run before us */
compound_unlock_irqrestore(page_head, flags);
VM_BUG_ON(PageHead(page_head));
- if (put_page_testzero(page_head))
- __put_single_page(page_head);
- out_put_single:
+skip_lock:
+ if (put_page_testzero(page_head)) {
+ /*
+ * The head page may have been
+ * freed and reallocated as a
+ * compound page of smaller
+ * order and then freed again.
+ * All we know is that it
+ * cannot have become: a THP
+ * page, a compound page of
+ * higher order, a tail page.
+ * That is because we still
+ * hold the refcount of the
+ * split THP tail and
+ * page_head was the THP head
+ * before the split.
+ */
+ if (PageHead(page_head))
+ __put_compound_page(page_head);
+ else
+ __put_single_page(page_head);
+ }
+out_put_single:
if (put_page_testzero(page))
__put_single_page(page);
return;
@@ -173,21 +210,34 @@ bool __get_page_tail(struct page *page)
*/
unsigned long flags;
bool got = false;
- struct page *page_head;
+ struct page *page_head = compound_trans_head(page);
- /*
- * If this is a hugetlbfs page it cannot be split under us. Simply
- * increment refcount for the head page.
- */
- if (PageHuge(page)) {
- page_head = compound_head(page);
- atomic_inc(&page_head->_count);
- got = true;
- goto out;
- }
-
- page_head = compound_trans_head(page);
if (likely(page != page_head && get_page_unless_zero(page_head))) {
+ /* Ref to put_compound_page() comment. */
+ if (PageHeadHuge(page_head)) {
+ if (likely(PageTail(page))) {
+ /*
+ * This is a hugetlbfs
+ * page. __split_huge_page_refcount
+ * cannot race here.
+ */
+ VM_BUG_ON(!PageHead(page_head));
+ __get_page_tail_foll(page, false);
+ return true;
+ } else {
+ /*
+ * __split_huge_page_refcount run
+ * before us, "page" was a THP
+ * tail. The split page_head has been
+ * freed and reallocated as slab or
+ * hugetlbfs page of smaller order
+ * (only possible if reallocated as
+ * slab on x86).
+ */
+ put_page(page_head);
+ return false;
+ }
+ }
/*
* page_head wasn't a dangling pointer but it
* may not be a head page anymore by the time
@@ -204,7 +254,6 @@ bool __get_page_tail(struct page *page)
if (unlikely(!got))
put_page(page_head);
}
-out:
return got;
}
EXPORT_SYMBOL(__get_page_tail);
next prev parent reply other threads:[~2014-02-04 21:00 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-04 21:00 [PATCH 3.4 00/37] 3.4.79-stable review Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 01/37] md/raid5: fix long-standing problem with bitmap handling on write failure Greg Kroah-Hartman
2014-02-04 21:00 ` Greg Kroah-Hartman [this message]
2014-02-04 21:00 ` [PATCH 3.4 03/37] e752x_edac: Fix pci_dev usage count Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 04/37] x86, x32: Correct invalid use of user timespec in the kernel Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 05/37] usb: option: add new zte 3g modem pids to option driver Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 07/37] USB: cypress_m8: fix ring-indicator detection and reporting Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 08/37] USB: Nokia 502 is an unusual device Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 09/37] usb: xhci: Check for XHCI_PLAT in xhci_cleanup_msix() Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 10/37] rtlwifi: rtl8192cu: Add new device ID Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 11/37] rtlwifi: Set the link state Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 12/37] rtlwifi: rtl8192cu: Fix some code in RF handling Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 13/37] b43: Fix lockdep splat Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 14/37] b43: Fix unload oops if firmware is not available Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 15/37] b43legacy: " Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 16/37] b43: fix the wrong assignment of status.freq in b43_rx() Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 17/37] staging: r8712u: Set device type to wlan Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 18/37] staging: vt6656: [BUG] BBvUpdatePreEDThreshold Always set sensitivity on bScanning Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 19/37] tty/serial: at91: Handle shutdown more safely Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 20/37] ARM: at91: smc: bug fix in sam9_smc_cs_read() Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 21/37] serial: add support for 200 v3 series Titan card Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 22/37] parport: parport_pc: remove double PCI ID for NetMos Greg Kroah-Hartman
2014-02-04 21:00 ` [PATCH 3.4 23/37] rtc-cmos: Add an alarm disable quirk Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 24/37] ASoC: adau1701: Fix ADAU1701_SEROCTL_WORD_LEN_16 constant Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 25/37] ALSA: rme9652: fix a missing comma in channel_map_9636_ds[] Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 26/37] ALSA: Enable CONFIG_ZONE_DMA for smaller PCI DMA masks Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 28/37] bnx2x: fix DMA unmapping of TSO split BDs Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 29/37] inet_diag: fix inet_diag_dump_icsk() timewait socket state logic Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 30/37] net: avoid reference counter overflows on fib_rules in multicast forwarding Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 31/37] net,via-rhine: Fix tx_timeout handling Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 32/37] KVM: x86: Fix potential divide by 0 in lapic (CVE-2013-6367) Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 33/37] usb: core: get config and string descriptors for unauthorized devices Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 34/37] SCSI: bfa: Chinook quad port 16G FC HBA claim issue Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 35/37] target/iscsi: Fix network portal creation race Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 36/37] Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot() Greg Kroah-Hartman
2014-02-04 21:01 ` [PATCH 3.4 37/37] powerpc: Make sure "cache" directory is removed when offlining cpu Greg Kroah-Hartman
2014-02-04 21:52 ` [PATCH 3.4 00/37] 3.4.79-stable review Guillaume Morin
2014-02-04 22:11 ` Greg Kroah-Hartman
2014-02-04 22:22 ` Guillaume Morin
2014-02-04 22:31 ` Greg Kroah-Hartman
2014-02-04 22:48 ` John Stultz
2014-02-04 23:35 ` Greg Kroah-Hartman
2014-02-05 6:36 ` Guenter Roeck
2014-02-05 20:38 ` Shuah Khan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140204210056.070697121@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=bhutchings@solarflare.com \
--cc=cl@linux.com \
--cc=guillaume@morinfr.org \
--cc=jweiner@redhat.com \
--cc=khalid.aziz@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=minchan@kernel.org \
--cc=pshelar@nicira.com \
--cc=riel@redhat.com \
--cc=stable@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).