stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Khalid Aziz <khalid.aziz@oracle.com>,
	Pravin B Shelar <pshelar@nicira.com>,
	Christoph Lameter <cl@linux.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Mel Gorman <mel@csn.ul.ie>,
	Rik van Riel <riel@redhat.com>, Minchan Kim <minchan@kernel.org>,
	Andi Kleen <andi@firstfloor.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [ 62/71] mm: fix aio performance regression for database caused by THP
Date: Sun, 29 Sep 2013 12:28:14 -0700	[thread overview]
Message-ID: <20130929192647.773952861@linuxfoundation.org> (raw)
In-Reply-To: <20130929192643.539596256@linuxfoundation.org>

3.11-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Khalid Aziz <khalid.aziz@oracle.com>

commit 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911 upstream.

I am working with a tool that simulates oracle database I/O workload.
This tool (orion to be specific -
<http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>)
allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag.  It then
does aio into these pages from flash disks using various common block
sizes used by database.  I am looking at performance with two of the most
common block sizes - 1M and 64K.  aio performance with these two block
sizes plunged after Transparent HugePages was introduced in the kernel.
Here are performance numbers:

		pre-THP		2.6.39		3.11-rc5
1M read		8384 MB/s	5629 MB/s	6501 MB/s
64K read	7867 MB/s	4576 MB/s	4251 MB/s

I have narrowed the performance impact down to the overheads introduced by
THP in __get_page_tail() and put_compound_page() routines.  perf top shows
>40% of cycles being spent in these two routines.  Every time direct I/O
to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
the pages and calls put_page() when I/O completes to put the reference
away.  THP introduced significant amount of locking overhead to get_page()
and put_page() when dealing with compound pages because hugepages can be
split underneath get_page() and put_page().  It added this overhead
irrespective of whether it is dealing with hugetlbfs pages or transparent
hugepages.  This resulted in 20%-45% drop in aio performance when using
hugetlbfs pages.

Since hugetlbfs pages can not be split, there is no reason to go through
all the locking overhead for these pages from what I can see.  I added
code to __get_page_tail() and put_compound_page() to bypass all the
locking code when working with hugetlbfs pages.  This improved performance
significantly.  Performance numbers with this patch:

		pre-THP		3.11-rc5	3.11-rc5 + Patch
1M read		8384 MB/s	6501 MB/s	8371 MB/s
64K read	7867 MB/s	4251 MB/s	6510 MB/s

Performance with 64K read is still lower than what it was before THP, but
still a 53% improvement.  It does mean there is more work to be done but I
will take a 53% improvement for now.

Please take a look at the following patch and let me know if it looks
reasonable.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/swap.c |   77 +++++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 52 insertions(+), 25 deletions(-)

--- a/mm/swap.c
+++ b/mm/swap.c
@@ -31,6 +31,7 @@
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
+#include <linux/hugetlb.h>
 
 #include "internal.h"
 
@@ -81,6 +82,19 @@ static void __put_compound_page(struct p
 
 static void put_compound_page(struct page *page)
 {
+	/*
+	 * hugetlbfs pages cannot be split from under us.  If this is a
+	 * hugetlbfs page, check refcount on head page and release the page if
+	 * the refcount becomes zero.
+	 */
+	if (PageHuge(page)) {
+		page = compound_head(page);
+		if (put_page_testzero(page))
+			__put_compound_page(page);
+
+		return;
+	}
+
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
 		struct page *page_head = compound_trans_head(page);
@@ -184,38 +198,51 @@ bool __get_page_tail(struct page *page)
 	 * proper PT lock that already serializes against
 	 * split_huge_page().
 	 */
-	unsigned long flags;
 	bool got = false;
-	struct page *page_head = compound_trans_head(page);
+	struct page *page_head;
 
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+	/*
+	 * If this is a hugetlbfs page it cannot be split under us.  Simply
+	 * increment refcount for the head page.
+	 */
+	if (PageHuge(page)) {
+		page_head = compound_head(page);
+		atomic_inc(&page_head->_count);
+		got = true;
+	} else {
+		unsigned long flags;
+
+		page_head = compound_trans_head(page);
+		if (likely(page != page_head &&
+					get_page_unless_zero(page_head))) {
+
+			/* Ref to put_compound_page() comment. */
+			if (PageSlab(page_head)) {
+				if (likely(PageTail(page))) {
+					__get_page_tail_foll(page, false);
+					return true;
+				} else {
+					put_page(page_head);
+					return false;
+				}
+			}
 
-		/* Ref to put_compound_page() comment. */
-		if (PageSlab(page_head)) {
+			/*
+			 * page_head wasn't a dangling pointer but it
+			 * may not be a head page anymore by the time
+			 * we obtain the lock. That is ok as long as it
+			 * can't be freed from under us.
+			 */
+			flags = compound_lock_irqsave(page_head);
+			/* here __split_huge_page_refcount won't run anymore */
 			if (likely(PageTail(page))) {
 				__get_page_tail_foll(page, false);
-				return true;
-			} else {
-				put_page(page_head);
-				return false;
+				got = true;
 			}
+			compound_unlock_irqrestore(page_head, flags);
+			if (unlikely(!got))
+				put_page(page_head);
 		}
-
-		/*
-		 * page_head wasn't a dangling pointer but it
-		 * may not be a head page anymore by the time
-		 * we obtain the lock. That is ok as long as it
-		 * can't be freed from under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		/* here __split_huge_page_refcount won't run anymore */
-		if (likely(PageTail(page))) {
-			__get_page_tail_foll(page, false);
-			got = true;
-		}
-		compound_unlock_irqrestore(page_head, flags);
-		if (unlikely(!got))
-			put_page(page_head);
 	}
 	return got;
 }



  parent reply	other threads:[~2013-09-29 19:28 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-29 19:27 [ 00/71] 3.11.3-stable review Greg Kroah-Hartman
2013-09-29 19:27 ` [ 01/71] PCI / ACPI / PM: Clear pme_poll for devices in D3cold on wakeup Greg Kroah-Hartman
2013-09-29 19:27 ` [ 02/71] ARM: OMAP4: Fix clock_get error for GPMC during boot Greg Kroah-Hartman
2013-09-29 19:27 ` [ 03/71] net: usb: cdc_ether: Use wwan interface for Telit modules Greg Kroah-Hartman
2013-09-29 19:27 ` [ 04/71] cifs: fix filp leak in cifs_atomic_open() Greg Kroah-Hartman
2013-09-29 19:27 ` [ 05/71] bgmac: fix internal switch initialization Greg Kroah-Hartman
2013-09-29 19:27 ` [ 06/71] rt2800: change initialization sequence to fix system freeze Greg Kroah-Hartman
2013-09-29 19:27 ` [ 07/71] rt2800: fix wrong TX power compensation Greg Kroah-Hartman
2013-09-29 19:27 ` [ 08/71] timekeeping: Fix HRTICK related deadlock from ntp lock changes Greg Kroah-Hartman
2013-09-29 19:27 ` [ 09/71] sched/cputime: Do not scale when utime == 0 Greg Kroah-Hartman
2013-09-29 19:27 ` [ 10/71] sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones Greg Kroah-Hartman
2013-09-29 19:27 ` [ 11/71] HID: provide a helper for validating hid reports Greg Kroah-Hartman
2013-09-29 19:27 ` [ 12/71] HID: validate feature and input report details Greg Kroah-Hartman
2013-09-29 19:27 ` [ 13/71] HID: multitouch: validate indexes details Greg Kroah-Hartman
2013-09-29 19:27 ` [ 14/71] HID: LG: validate HID output report details Greg Kroah-Hartman
2013-09-29 19:27 ` [ 15/71] HID: zeroplus: validate " Greg Kroah-Hartman
2013-09-29 19:27 ` [ 16/71] HID: lenovo-tpkbd: fix leak if tpkbd_probe_tp fails Greg Kroah-Hartman
2013-09-29 19:27 ` [ 17/71] HID: steelseries: validate output report details Greg Kroah-Hartman
2013-09-29 19:27 ` [ 18/71] HID: sony: validate HID " Greg Kroah-Hartman
2013-09-29 19:27 ` [ 19/71] HID: lenovo-tpkbd: validate " Greg Kroah-Hartman
2013-09-29 19:27 ` [ 20/71] HID: logitech-dj: " Greg Kroah-Hartman
2013-09-29 19:27 ` [ 21/71] usb: gadget: fix a bug and a WARN_ON in dummy-hcd Greg Kroah-Hartman
2013-09-29 19:27 ` [ 22/71] drm/i915: try not to lose backlight CBLV precision Greg Kroah-Hartman
2013-09-29 19:27 ` [ 23/71] drm/i915: fix hpd work vs. flush_work in the pageflip code deadlock Greg Kroah-Hartman
2013-09-29 19:27 ` [ 24/71] drm/i915: fix gpu hang vs. flip stall deadlocks Greg Kroah-Hartman
2013-09-29 19:27 ` [ 25/71] drm/i915: fix wait_for_pending_flips vs gpu hang deadlock Greg Kroah-Hartman
2013-09-29 19:27 ` [ 26/71] drm/i915: do not update cursor in crtc mode set Greg Kroah-Hartman
2013-09-29 19:27 ` [ 27/71] drm/i915: Dont enable the cursor on a disable pipe Greg Kroah-Hartman
2013-09-29 19:27 ` [ 28/71] drm: fix DRM_IOCTL_MODE_GETFB handle-leak Greg Kroah-Hartman
2013-09-29 19:27 ` [ 29/71] drm/ast: fix the ast open key function Greg Kroah-Hartman
2013-09-29 19:27 ` [ 30/71] drm/ttm: fix the tt_populated check in ttm_tt_destroy() Greg Kroah-Hartman
2013-09-29 19:27 ` [ 31/71] radeon kms: fix uninitialised hotplug work usage in r100_irq_process() Greg Kroah-Hartman
2013-09-29 19:27 ` [ 32/71] drm/nv50/disp: prevent false output detection on the original nv50 Greg Kroah-Hartman
2013-09-29 19:27 ` [ 33/71] drm/radeon: fix LCD record parsing Greg Kroah-Hartman
2013-09-29 19:27 ` [ 34/71] drm/radeon/dpm: add reclocking quirk for ASUS K70AF Greg Kroah-Hartman
2013-09-29 19:27 ` [ 35/71] drm/radeon: fix endian bugs in hw i2c atom routines Greg Kroah-Hartman
2013-09-29 19:27 ` [ 36/71] drm/radeon: enable UVD interrupts on CIK Greg Kroah-Hartman
2013-09-29 19:27 ` [ 37/71] drm/radeon: fill in gpu_init for berlin GPU cores Greg Kroah-Hartman
2013-09-29 19:27 ` [ 38/71] drm/radeon: update line buffer allocation for dce8 Greg Kroah-Hartman
2013-09-29 19:27 ` [ 39/71] drm/radeon: fix init ordering for r600+ Greg Kroah-Hartman
2013-09-29 19:27 ` [ 40/71] drm/radeon/cik: update gpu_init for an additional berlin gpu Greg Kroah-Hartman
2013-09-29 19:27 ` [ 41/71] drm/radeon: add berlin pci ids Greg Kroah-Hartman
2013-09-29 19:27 ` [ 42/71] drm/radeon/si: Add support for CP DMA to CS checker for compute v2 Greg Kroah-Hartman
2013-09-29 19:27 ` [ 43/71] drm/radeon: update line buffer allocation for dce4.1/5 Greg Kroah-Hartman
2013-09-29 19:27 ` [ 44/71] drm/radeon: update line buffer allocation for dce6 Greg Kroah-Hartman
2013-09-29 19:27 ` [ 45/71] drm/radeon: fix resume on some rs4xx boards (v2) Greg Kroah-Hartman
2013-09-29 19:27 ` [ 46/71] drm/radeon: fix handling of variable sized arrays for router objects Greg Kroah-Hartman
2013-09-29 19:27 ` [ 47/71] drm/radeon/dpm: make sure dc performance level limits are valid (BTC-SI) (v2) Greg Kroah-Hartman
2013-09-29 19:28 ` [ 48/71] tg3: Dont turn off led on 5719 serdes port 0 Greg Kroah-Hartman
2013-09-29 19:28 ` [ 49/71] tg3: Expand led off fix to include 5720 Greg Kroah-Hartman
2013-09-29 19:28 ` [ 50/71] drm/radeon: add some additional berlin pci ids Greg Kroah-Hartman
2013-09-29 19:28 ` [ 51/71] drm/radeon/r6xx: add a stubbed out set_uvd_clocks callback Greg Kroah-Hartman
2013-09-29 19:28 ` [ 52/71] drm/radeon/atom: workaround vbios bug in transmitter table on rs880 (v2) Greg Kroah-Hartman
2013-09-29 19:28 ` [ 53/71] drm/radeon/dpm: handle bapm on trinity Greg Kroah-Hartman
2013-09-29 19:28 ` [ 54/71] drm/radeon/dpm: fix fallback for empty UVD clocks Greg Kroah-Hartman
2013-09-29 19:28 ` [ 55/71] drm/radeon/dpm/rs780: dont enable sclk scaling if not required Greg Kroah-Hartman
2013-09-29 19:28 ` [ 56/71] drm/radeon: fix panel scaling with eDP and LVDS bridges Greg Kroah-Hartman
2013-09-29 19:28 ` [ 57/71] drm/radeon: avoid UVD corruptions on AGP cards Greg Kroah-Hartman
2013-09-29 19:28 ` [ 58/71] skge: fix broken driver Greg Kroah-Hartman
2013-09-29 19:28 ` [ 59/71] udf: Standardize return values in mount sequence Greg Kroah-Hartman
2013-09-29 19:28 ` [ 60/71] udf: Refuse RW mount of the filesystem instead of making it RO Greg Kroah-Hartman
2013-09-29 19:28 ` [ 61/71] audit: fix endless wait in audit_log_start() Greg Kroah-Hartman
2013-09-29 19:28 ` Greg Kroah-Hartman [this message]
2013-09-29 19:28 ` [ 63/71] bio-integrity: Fix use of bs->bio_integrity_pool after free Greg Kroah-Hartman
2013-09-29 19:28 ` [ 64/71] cfq: explicitly use 64bit divide operation for 64bit arguments Greg Kroah-Hartman
2013-09-29 19:28 ` [ 65/71] rpc: clean up decoding of gssproxy linux creds Greg Kroah-Hartman
2013-09-29 19:28 ` [ 66/71] rpc: comment on linux_cred encoding, treat all as unsigned Greg Kroah-Hartman
2013-09-29 19:28 ` [ 67/71] rpc: fix huge kmallocs in gss-proxy Greg Kroah-Hartman
2013-09-29 19:28 ` [ 68/71] rpc: let xdr layer allocate gssproxy receieve pages Greg Kroah-Hartman
2013-09-29 19:28 ` [ 69/71] cw1200: Prevent a lock-related hang in the cw1200_spi driver Greg Kroah-Hartman
2013-09-29 19:28 ` [ 70/71] cw1200: Dont perform SPI transfers in interrupt context Greg Kroah-Hartman
2013-10-02  2:23   ` Solomon Peachy
2013-10-02 21:26     ` Greg Kroah-Hartman
2013-10-03 13:22       ` Solomon Peachy
2013-09-29 19:28 ` [ 71/71] netfilter: ipset: Fix serious failure in CIDR tracking Greg Kroah-Hartman
2013-09-30  1:28 ` [ 00/71] 3.11.3-stable review Guenter Roeck
2013-09-30  1:51   ` Greg Kroah-Hartman
2013-09-30  2:22     ` Guenter Roeck
2013-10-01 19:23 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130929192647.773952861@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=cl@linux.com \
    --cc=hannes@cmpxchg.org \
    --cc=khalid.aziz@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mel@csn.ul.ie \
    --cc=minchan@kernel.org \
    --cc=pshelar@nicira.com \
    --cc=riel@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).