public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jack Wang <jinpu.wang@profitbricks.com>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Luis Henriques <luis.henriques@canonical.com>,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	kernel-team@lists.ubuntu.com,
	Khalid Aziz <khalid.aziz@oracle.com>,
	Pravin B Shelar <pshelar@nicira.com>,
	Christoph Lameter <cl@linux.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Mel Gorman <mel@csn.ul.ie>,
	Rik van Riel <riel@redhat.com>, Minchan Kim <minchan@kernel.org>,
	Andi Kleen <andi@firstfloor.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 092/104] mm: fix aio performance regression for database caused by THP
Date: Mon, 30 Sep 2013 15:14:52 +0200	[thread overview]
Message-ID: <5249794C.5050204@profitbricks.com> (raw)
In-Reply-To: <1380535881-9239-93-git-send-email-luis.henriques@canonical.com>

On 09/30/2013 12:11 PM, Luis Henriques wrote:
> 3.5.7.22 -stable review patch.  If anyone has any objections, please let me know.
> 
> ------------------
> 
> From: Khalid Aziz <khalid.aziz@oracle.com>
> 
> commit 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911 upstream.
> 
> I am working with a tool that simulates oracle database I/O workload.
> This tool (orion to be specific -
> <http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>)
> allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag.  It then
> does aio into these pages from flash disks using various common block
> sizes used by database.  I am looking at performance with two of the most
> common block sizes - 1M and 64K.  aio performance with these two block
> sizes plunged after Transparent HugePages was introduced in the kernel.
> Here are performance numbers:
> 
> 		pre-THP		2.6.39		3.11-rc5
> 1M read		8384 MB/s	5629 MB/s	6501 MB/s
> 64K read	7867 MB/s	4576 MB/s	4251 MB/s
> 
> I have narrowed the performance impact down to the overheads introduced by
> THP in __get_page_tail() and put_compound_page() routines.  perf top shows
>> 40% of cycles being spent in these two routines.  Every time direct I/O
> to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
> the pages and calls put_page() when I/O completes to put the reference
> away.  THP introduced significant amount of locking overhead to get_page()
> and put_page() when dealing with compound pages because hugepages can be
> split underneath get_page() and put_page().  It added this overhead
> irrespective of whether it is dealing with hugetlbfs pages or transparent
> hugepages.  This resulted in 20%-45% drop in aio performance when using
> hugetlbfs pages.
> 
> Since hugetlbfs pages can not be split, there is no reason to go through
> all the locking overhead for these pages from what I can see.  I added
> code to __get_page_tail() and put_compound_page() to bypass all the
> locking code when working with hugetlbfs pages.  This improved performance
> significantly.  Performance numbers with this patch:
> 
> 		pre-THP		3.11-rc5	3.11-rc5 + Patch
> 1M read		8384 MB/s	6501 MB/s	8371 MB/s
> 64K read	7867 MB/s	4251 MB/s	6510 MB/s
> 
> Performance with 64K read is still lower than what it was before THP, but
> still a 53% improvement.  It does mean there is more work to be done but I
> will take a 53% improvement for now.
> 
> Please take a look at the following patch and let me know if it looks
> reasonable.
> 
> [akpm@linux-foundation.org: tweak comments]
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> Cc: Pravin B Shelar <pshelar@nicira.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Andi Kleen <andi@firstfloor.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> [ luis: backported to 3.5: adjusted context ]
> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Hi Greg,

I suppose this patch also needed for 3.4, right?

Regards,
Jack


> ---
>  mm/swap.c | 77 ++++++++++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 52 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 4e7e2ec..0c833e8 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -30,6 +30,7 @@
>  #include <linux/backing-dev.h>
>  #include <linux/memcontrol.h>
>  #include <linux/gfp.h>
> +#include <linux/hugetlb.h>
>  
>  #include "internal.h"
>  
> @@ -77,6 +78,19 @@ static void __put_compound_page(struct page *page)
>  
>  static void put_compound_page(struct page *page)
>  {
> +	/*
> +	 * hugetlbfs pages cannot be split from under us.  If this is a
> +	 * hugetlbfs page, check refcount on head page and release the page if
> +	 * the refcount becomes zero.
> +	 */
> +	if (PageHuge(page)) {
> +		page = compound_head(page);
> +		if (put_page_testzero(page))
> +			__put_compound_page(page);
> +
> +		return;
> +	}
> +
>  	if (unlikely(PageTail(page))) {
>  		/* __split_huge_page_refcount can run under us */
>  		struct page *page_head = compound_trans_head(page);
> @@ -180,38 +194,51 @@ bool __get_page_tail(struct page *page)
>  	 * proper PT lock that already serializes against
>  	 * split_huge_page().
>  	 */
> -	unsigned long flags;
>  	bool got = false;
> -	struct page *page_head = compound_trans_head(page);
> +	struct page *page_head;
>  
> -	if (likely(page != page_head && get_page_unless_zero(page_head))) {
> +	/*
> +	 * If this is a hugetlbfs page it cannot be split under us.  Simply
> +	 * increment refcount for the head page.
> +	 */
> +	if (PageHuge(page)) {
> +		page_head = compound_head(page);
> +		atomic_inc(&page_head->_count);
> +		got = true;
> +	} else {
> +		unsigned long flags;
> +
> +		page_head = compound_trans_head(page);
> +		if (likely(page != page_head &&
> +					get_page_unless_zero(page_head))) {
> +
> +			/* Ref to put_compound_page() comment. */
> +			if (PageSlab(page_head)) {
> +				if (likely(PageTail(page))) {
> +					__get_page_tail_foll(page, false);
> +					return true;
> +				} else {
> +					put_page(page_head);
> +					return false;
> +				}
> +			}
>  
> -		/* Ref to put_compound_page() comment. */
> -		if (PageSlab(page_head)) {
> +			/*
> +			 * page_head wasn't a dangling pointer but it
> +			 * may not be a head page anymore by the time
> +			 * we obtain the lock. That is ok as long as it
> +			 * can't be freed from under us.
> +			 */
> +			flags = compound_lock_irqsave(page_head);
> +			/* here __split_huge_page_refcount won't run anymore */
>  			if (likely(PageTail(page))) {
>  				__get_page_tail_foll(page, false);
> -				return true;
> -			} else {
> -				put_page(page_head);
> -				return false;
> +				got = true;
>  			}
> +			compound_unlock_irqrestore(page_head, flags);
> +			if (unlikely(!got))
> +				put_page(page_head);
>  		}
> -
> -		/*
> -		 * page_head wasn't a dangling pointer but it
> -		 * may not be a head page anymore by the time
> -		 * we obtain the lock. That is ok as long as it
> -		 * can't be freed from under us.
> -		 */
> -		flags = compound_lock_irqsave(page_head);
> -		/* here __split_huge_page_refcount won't run anymore */
> -		if (likely(PageTail(page))) {
> -			__get_page_tail_foll(page, false);
> -			got = true;
> -		}
> -		compound_unlock_irqrestore(page_head, flags);
> -		if (unlikely(!got))
> -			put_page(page_head);
>  	}
>  	return got;
>  }
> 


  reply	other threads:[~2013-09-30 13:14 UTC|newest]

Thread overview: 113+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-30 10:09 [ 3.5.y.z extended stable ] Linux 3.5.7.22 stable review Luis Henriques
2013-09-30 10:09 ` [PATCH 001/104] iwl4965: fix rfkill set state regression Luis Henriques
2013-09-30 10:09 ` [PATCH 002/104] ath9k_htc: Restore skb headroom when returning skb to mac80211 Luis Henriques
2013-09-30 10:09 ` [PATCH 003/104] ALSA: opti9xx: Fix conflicting driver object name Luis Henriques
2013-09-30 10:09 ` [PATCH 004/104] SUNRPC: Fix memory corruption issue on 32-bit highmem systems Luis Henriques
2013-09-30 10:09 ` [PATCH 005/104] drm/i915: ivb: fix edp voltage swing reg val Luis Henriques
2013-09-30 10:09 ` [PATCH 006/104] drm/vmwgfx: Split GMR2_REMAP commands if they are to large Luis Henriques
2013-09-30 10:09 ` [PATCH 007/104] ALSA: ak4xx-adda: info leak in ak4xxx_capture_source_info() Luis Henriques
2013-09-30 10:09 ` [PATCH 008/104] Bluetooth: Add support for Foxconn/Hon Hai [0489:e04d] Luis Henriques
2013-09-30 10:09 ` [PATCH 009/104] [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal Luis Henriques
2013-09-30 10:09 ` [PATCH 010/104] xen-gnt: prevent adding duplicate gnt callbacks Luis Henriques
2013-09-30 10:09 ` [PATCH 011/104] usb: config->desc.bLength may not exceed amount of data returned by the device Luis Henriques
2013-09-30 10:09 ` [PATCH 012/104] USB: cdc-wdm: fix race between interrupt handler and tasklet Luis Henriques
2013-09-30 10:09 ` [PATCH 013/104] USB: handle LPM errors during device suspend correctly Luis Henriques
2013-09-30 10:09 ` [PATCH 014/104] xhci-plat: Don't enable legacy PCI interrupts Luis Henriques
2013-09-30 10:09 ` [PATCH 015/104] ASoC: wm8960: Fix PLL register writes Luis Henriques
2013-09-30 10:09 ` [PATCH 016/104] rculist: list_first_or_null_rcu() should use list_entry_rcu() Luis Henriques
2013-09-30 10:09 ` [PATCH 017/104] USB: mos7720: use GFP_ATOMIC under spinlock Luis Henriques
2013-09-30 10:09 ` [PATCH 018/104] USB: mos7720: fix big-endian control requests Luis Henriques
2013-09-30 10:09 ` [PATCH 019/104] staging: comedi: dt282x: dt282x_ai_insn_read() always fails Luis Henriques
2013-09-30 10:09 ` [PATCH 020/104] usb: ehci-mxc: check for pdata before dereferencing Luis Henriques
2013-09-30 10:09 ` [PATCH 021/104] usb: xhci: Disable runtime PM suspend for quirky controllers Luis Henriques
2013-09-30 10:09 ` [PATCH 022/104] USB: OHCI: Allow runtime PM without system sleep Luis Henriques
2013-09-30 10:10 ` [PATCH 023/104] ACPI / EC: Add HP Folio 13 to ec_dmi_table in order to skip DSDT scan Luis Henriques
2013-09-30 10:10 ` [PATCH 024/104] ACPI / EC: Add ASUSTEK L4R to quirk list in order to validate ECDT Luis Henriques
2013-09-30 10:10 ` [PATCH 025/104] USB: fix build error when CONFIG_PM_SLEEP isn't enabled Luis Henriques
2013-09-30 10:10 ` [PATCH 026/104] ALSA: hda - hdmi: Refactor hdmi_eld into parsed_hdmi_eld Luis Henriques
2013-09-30 10:29   ` David Henningsson
2013-09-30 11:10     ` Luis Henriques
2013-09-30 11:37       ` David Henningsson
2013-09-30 10:10 ` [PATCH 027/104] ALSA: hda - hdmi: Fallback to ALSA allocation when selecting CA Luis Henriques
2013-09-30 10:10 ` [PATCH 028/104] regmap: silence GCC warning Luis Henriques
2013-09-30 10:10 ` [PATCH 029/104] target: Fix trailing ASCII space usage in INQUIRY vendor+model Luis Henriques
2013-09-30 10:10 ` [PATCH 030/104] iwlwifi: dvm: don't send BT_CONFIG on devices w/o Bluetooth Luis Henriques
2013-09-30 10:10 ` [PATCH 031/104] Bluetooth: Add support for Mediatek Bluetooth device [0e8d:763f] Luis Henriques
2013-09-30 10:10 ` [PATCH 032/104] Bluetooth: ath3k: Add support for Fujitsu Lifebook UH5x2 [04c5:1330] Luis Henriques
2013-09-30 10:10 ` [PATCH 033/104] Bluetooth: ath3k: Add support for ID 0x13d3/0x3402 Luis Henriques
2013-09-30 10:10 ` [PATCH 034/104] Bluetooth: Add support for Atheros [0cf3:e003] Luis Henriques
2013-09-30 10:10 ` [PATCH 035/104] cifs: don't instantiate new dentries in readdir for inodes that need to be revalidated immediately Luis Henriques
2013-09-30 10:10 ` [PATCH 036/104] xen/events: mask events when changing their VCPU binding Luis Henriques
2013-09-30 10:10 ` [PATCH 037/104] tipc: fix lockdep warning during bearer initialization Luis Henriques
2013-09-30 10:10 ` [PATCH 038/104] htb: fix sign extension bug Luis Henriques
2013-09-30 10:10 ` [PATCH 039/104] net: check net.core.somaxconn sysctl values Luis Henriques
2013-09-30 10:10 ` [PATCH 040/104] neighbour: populate neigh_parms on alloc before calling ndo_neigh_setup Luis Henriques
2013-09-30 10:10 ` [PATCH 041/104] bonding: modify only neigh_parms owned by us Luis Henriques
2013-09-30 10:10 ` [PATCH 042/104] fib_trie: remove potential out of bound access Luis Henriques
2013-09-30 10:10 ` [PATCH 043/104] tcp: cubic: fix overflow error in bictcp_update() Luis Henriques
2013-09-30 10:10 ` [PATCH 044/104] tcp: cubic: fix bug in bictcp_acked() Luis Henriques
2013-09-30 10:10 ` [PATCH 045/104] macvtap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS Luis Henriques
2013-09-30 10:10 ` [PATCH 046/104] ipv6: don't stop backtracking in fib6_lookup_1 if subtree does not match Luis Henriques
2013-09-30 10:10 ` [PATCH 047/104] 8139cp: Fix skb leak in rx_status_loop failure path Luis Henriques
2013-09-30 10:10 ` [PATCH 048/104] tun: signedness bug in tun_get_user() Luis Henriques
2013-09-30 10:10 ` [PATCH 049/104] ipv6: remove max_addresses check from ipv6_create_tempaddr Luis Henriques
2013-09-30 10:10 ` [PATCH 050/104] ipv6: drop packets with multiple fragmentation headers Luis Henriques
2013-09-30 10:10 ` [PATCH 051/104] net: bridge: convert MLDv2 Query MRC into msecs_to_jiffies for max_delay Luis Henriques
2013-09-30 10:10 ` [PATCH 052/104] ICMPv6: treat dest unreachable codes 5 and 6 as EACCES, not EPROTO Luis Henriques
2013-09-30 10:10 ` [PATCH 053/104] ipv6: Don't depend on per socket memory for neighbour discovery messages Luis Henriques
2013-09-30 10:10 ` [PATCH 054/104] net: ipv6: tcp: fix potential use after free in tcp_v6_do_rcv Luis Henriques
2013-09-30 10:10 ` [PATCH 055/104] ath9k: always clear ps filter bit on new assoc Luis Henriques
2013-09-30 10:10 ` [PATCH 056/104] libceph: unregister request in __map_request failed and nofail == false Luis Henriques
2013-09-30 10:10 ` [PATCH 057/104] powerpc: Handle unaligned ldbrx/stdbrx Luis Henriques
2013-09-30 10:10 ` [PATCH 058/104] ath9k: fix rx descriptor related race condition Luis Henriques
2013-09-30 10:10 ` [PATCH 059/104] ath9k: avoid accessing MRC registers on single-chain devices Luis Henriques
2013-09-30 10:10 ` [PATCH 060/104] brcmsmac: Fix WARNING caused by lack of calls to dma_mapping_error() Luis Henriques
2013-09-30 10:10 ` [PATCH 061/104] mmc: tmio_mmc_dma: fix PIO fallback on SDHI Luis Henriques
2013-09-30 10:10 ` [PATCH 062/104] HID: validate HID report id size Luis Henriques
2013-09-30 10:10 ` [PATCH 063/104] of: Fix missing memory initialization on FDT unflattening Luis Henriques
2013-09-30 10:10 ` [PATCH 064/104] drm/edid: add quirk for Medion MD30217PG Luis Henriques
2013-09-30 10:10 ` [PATCH 065/104] drm/radeon: fix endian bugs in hw i2c atom routines Luis Henriques
2013-09-30 10:10 ` [PATCH 066/104] drm/radeon: update line buffer allocation for dce4.1/5 Luis Henriques
2013-09-30 10:10 ` [PATCH 067/104] drm/radeon: update line buffer allocation for dce6 Luis Henriques
2013-09-30 10:10 ` [PATCH 068/104] drm/radeon: fix LCD record parsing Luis Henriques
2013-09-30 10:10 ` [PATCH 069/104] drm/radeon: fix resume on some rs4xx boards (v2) Luis Henriques
2013-09-30 10:10 ` [PATCH 070/104] drm/radeon: fix handling of variable sized arrays for router objects Luis Henriques
2013-09-30 10:10 ` [PATCH 071/104] radeon kms: fix uninitialised hotplug work usage in r100_irq_process() Luis Henriques
2013-09-30 10:10 ` [PATCH 072/104] drm/radeon: fix init ordering for r600+ Luis Henriques
2013-09-30 10:10 ` [PATCH 073/104] HID: input: return ENODATA if reading battery attrs fails Luis Henriques
2013-09-30 10:10 ` [PATCH 074/104] HID: battery: don't do DMA from stack Luis Henriques
2013-09-30 10:10 ` [PATCH 075/104] fuse: postpone end_page_writeback() in fuse_writepage_locked() Luis Henriques
2013-09-30 10:10 ` [PATCH 076/104] fuse: invalidate inode attributes on xattr modification Luis Henriques
2013-09-30 10:10 ` [PATCH 077/104] s5p-g2d: Fix registration failure Luis Henriques
2013-09-30 10:10 ` [PATCH 078/104] DocBook: upgrade media_api DocBook version to 4.2 Luis Henriques
2013-09-30 10:10 ` [PATCH 079/104] v4l2: added missing mutex.h include to v4l2-ctrls.h Luis Henriques
2013-09-30 10:10 ` [PATCH 080/104] hdpvr: fix iteration over uninitialized lists in hdpvr_probe() Luis Henriques
2013-09-30 10:10 ` [PATCH 081/104] exynos4-is: Fix fimc-lite bayer formats Luis Henriques
2013-09-30 10:10 ` [PATCH 082/104] exynos4-is: Fix entity unregistration on error path Luis Henriques
2013-09-30 10:11 ` [PATCH 083/104] libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc Luis Henriques
2013-09-30 10:11 ` [PATCH 084/104] HID: pantherlord: validate output report details Luis Henriques
2013-09-30 10:11 ` [PATCH 085/104] HID: ntrig: validate feature " Luis Henriques
2013-09-30 10:11 ` [PATCH 086/104] HID: picolcd_core: validate output " Luis Henriques
2013-09-30 10:11 ` [PATCH 087/104] HID: check for NULL field when setting values Luis Henriques
2013-09-30 10:11 ` [PATCH 088/104] drm/i915: try not to lose backlight CBLV precision Luis Henriques
2013-09-30 10:11 ` [PATCH 089/104] powerpc: Default arch idle could cede processor on pseries Luis Henriques
2013-09-30 10:11 ` [PATCH 090/104] ocfs2: fix the end cluster offset of FIEMAP Luis Henriques
2013-09-30 10:11 ` [PATCH 091/104] mm/huge_memory.c: fix potential NULL pointer dereference Luis Henriques
2013-09-30 10:11 ` [PATCH 092/104] mm: fix aio performance regression for database caused by THP Luis Henriques
2013-09-30 13:14   ` Jack Wang [this message]
2013-09-30 13:26     ` Greg Kroah-Hartman
2013-09-30 13:31       ` Khalid Aziz
2013-09-30 15:00         ` Greg Kroah-Hartman
2013-10-03  2:33           ` Greg Kroah-Hartman
2013-09-30 10:11 ` [PATCH 093/104] memcg: fix multiple large threshold notifications Luis Henriques
2013-09-30 10:11 ` [PATCH 094/104] intel-iommu: Fix leaks in pagetable freeing Luis Henriques
2013-09-30 10:11 ` [PATCH 095/104] MIPS: ath79: Fix ar933x watchdog clock Luis Henriques
2013-09-30 10:11 ` [PATCH 096/104] ARM: PCI: versatile: Fix map_irq function to match hardware Luis Henriques
2013-09-30 10:11 ` [PATCH 097/104] ARM: PCI: versatile: Fix SMAP register offsets Luis Henriques
2013-09-30 10:11 ` [PATCH 098/104] crypto: api - Fix race condition in larval lookup Luis Henriques
2013-09-30 10:11 ` [PATCH 099/104] cifs: ensure that srv_mutex is held when dealing with ssocket pointer Luis Henriques
2013-09-30 10:11 ` [PATCH 100/104] ALSA: hda - Add Toshiba Satellite C870 to MSI blacklist Luis Henriques
2013-09-30 10:11 ` [PATCH 101/104] ASoC: mc13783: add spi errata fix Luis Henriques
2013-09-30 10:11 ` [PATCH 102/104] [SCSI] sd: Fix potential out-of-bounds access Luis Henriques
2013-09-30 10:11 ` [PATCH 103/104] Revert "zram: use zram->lock to protect zram_free_page() in swap free notify path" Luis Henriques
2013-09-30 10:11 ` [PATCH 104/104] kernel-doc: bugfix - multi-line macros Luis Henriques

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5249794C.5050204@profitbricks.com \
    --to=jinpu.wang@profitbricks.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=cl@linux.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@lists.ubuntu.com \
    --cc=khalid.aziz@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luis.henriques@canonical.com \
    --cc=mel@csn.ul.ie \
    --cc=minchan@kernel.org \
    --cc=pshelar@nicira.com \
    --cc=riel@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox