linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Vlastimil Babka <vbabka@suse.cz>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
	David Rientjes <rientjes@google.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Michal Hocko <mhocko@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4.1 41/65] mm, thp: respect MPOL_PREFERRED policy with non-local node
Date: Sun, 19 Jul 2015 12:08:00 -0700	[thread overview]
Message-ID: <20150719190810.809975560@linuxfoundation.org> (raw)
In-Reply-To: <20150719190809.469715936@linuxfoundation.org>

4.1-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Vlastimil Babka <vbabka@suse.cz>

commit 0867a57c4f80a566dda1bac975b42fcd857cb489 upstream.

Since commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on
local node"), we handle THP allocations on page fault in a special way -
for non-interleave memory policies, the allocation is only attempted on
the node local to the current CPU, if the policy's nodemask allows the
node.

This is motivated by the assumption that THP benefits cannot offset the
cost of remote accesses, so it's better to fallback to base pages on the
local node (which might still be available, while huge pages are not due
to fragmentation) than to allocate huge pages on a remote node.

The nodemask check prevents us from violating e.g.  MPOL_BIND policies
where the local node is not among the allowed nodes.  However, the
current implementation can still give surprising results for the
MPOL_PREFERRED policy when the preferred node is different than the
current CPU's local node.

In such case we should honor the preferred node and not use the local
node, which is what this patch does.  If hugepage allocation on the
preferred node fails, we fall back to base pages and don't try other
nodes, with the same motivation as is done for the local node hugepage
allocations.  The patch also moves the MPOL_INTERLEAVE check around to
simplify the hugepage specific test.

The difference can be demonstrated using in-tree transhuge-stress test
on the following 2-node machine where half memory on one node was
occupied to show the difference.

> numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 7878 MB
node 0 free: 3623 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 8045 MB
node 1 free: 7818 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Before the patch:
> numactl -p0 -C0 ./transhuge-stress
transhuge-stress: 2.197 s/loop, 0.276 ms/page,   7249.168 MiB/s 7962 succeed,    0 failed, 1786 different pages

> numactl -p0 -C12 ./transhuge-stress
transhuge-stress: 2.962 s/loop, 0.372 ms/page,   5376.172 MiB/s 7962 succeed,    0 failed, 3873 different pages

Number of successful THP allocations corresponds to free memory on node 0 in
the first case and node 1 in the second case, i.e. -p parameter is ignored and
cpu binding "wins".

After the patch:
> numactl -p0 -C0 ./transhuge-stress
transhuge-stress: 2.183 s/loop, 0.274 ms/page,   7295.516 MiB/s 7962 succeed,    0 failed, 1760 different pages

> numactl -p0 -C12 ./transhuge-stress
transhuge-stress: 2.878 s/loop, 0.361 ms/page,   5533.638 MiB/s 7962 succeed,    0 failed, 1750 different pages

> numactl -p1 -C0 ./transhuge-stress
transhuge-stress: 4.628 s/loop, 0.581 ms/page,   3440.893 MiB/s 7962 succeed,    0 failed, 3918 different pages

The -p parameter is respected regardless of cpu binding.

> numactl -C0 ./transhuge-stress
transhuge-stress: 2.202 s/loop, 0.277 ms/page,   7230.003 MiB/s 7962 succeed,    0 failed, 1750 different pages

> numactl -C12 ./transhuge-stress
transhuge-stress: 3.020 s/loop, 0.379 ms/page,   5273.324 MiB/s 7962 succeed,    0 failed, 3916 different pages

Without -p parameter, hugepage restriction to CPU-local node works as before.

Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/mempolicy.c |   38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1972,35 +1972,41 @@ retry_cpuset:
 	pol = get_vma_policy(vma, addr);
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage &&
-					pol->mode != MPOL_INTERLEAVE)) {
+	if (pol->mode == MPOL_INTERLEAVE) {
+		unsigned nid;
+
+		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
+		mpol_cond_put(pol);
+		page = alloc_page_interleave(gfp, order, nid);
+		goto out;
+	}
+
+	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
+		int hpage_node = node;
+
 		/*
 		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node, we only try to allocate from the
-		 * current node and don't fall back to other nodes, as the
-		 * cost of remote accesses would likely offset THP benefits.
+		 * allows the current node (or other explicitly preferred
+		 * node) we only try to allocate from the current/preferred
+		 * node and don't fall back to other nodes, as the cost of
+		 * remote accesses would likely offset THP benefits.
 		 *
 		 * If the policy is interleave, or does not allow the current
 		 * node in its nodemask, we allocate the standard way.
 		 */
+		if (pol->mode == MPOL_PREFERRED &&
+						!(pol->flags & MPOL_F_LOCAL))
+			hpage_node = pol->v.preferred_node;
+
 		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(node, *nmask)) {
+		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
-			page = alloc_pages_exact_node(node,
+			page = alloc_pages_exact_node(hpage_node,
 						gfp | __GFP_THISNODE, order);
 			goto out;
 		}
 	}
 
-	if (pol->mode == MPOL_INTERLEAVE) {
-		unsigned nid;
-
-		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
-		mpol_cond_put(pol);
-		page = alloc_page_interleave(gfp, order, nid);
-		goto out;
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	zl = policy_zonelist(gfp, pol, node);
 	mpol_cond_put(pol);



  parent reply	other threads:[~2015-07-19 19:33 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-19 19:07 [PATCH 4.1 00/65] 4.1.3-stable review Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 01/65] fs: Add helper functions for permanently empty directories Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 02/65] sysctl: Allow creating permanently empty directories that serve as mountpoints Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 03/65] proc: Allow creating permanently empty directories that serve as mount points Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 04/65] kernfs: Add support for always empty directories Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 05/65] sysfs: Add support for permanently empty directories to serve as mount points Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 06/65] sysfs: Create mountpoints with sysfs_create_mount_point Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 07/65] mnt: Update fs_fully_visible to test for permanently empty directories Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 08/65] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 09/65] mnt: Modify fs_fully_visible to deal with locked ro nodev and atime Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 10/65] gpio: crystalcove: set IRQCHIP_SKIP_SET_WAKE for the irqchip Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 11/65] gpio: rcar: Check for irq_set_irq_wake() failures Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 12/65] rcu: Correctly handle non-empty Tiny RCU callback list with none ready Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 13/65] ipr: Increase default adapter init stage change timeout Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 14/65] Disable write buffering on Toshiba ToPIC95 Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 15/65] ALSA: pcm: Fix pcm_class sysfs output Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 16/65] ALSA: hda - Fix Dock Headphone on Thinkpad X250 seen as a Line Out Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 17/65] ALSA: hda - set proper caps for newer AMD hda audio in KB/KV Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 18/65] ALSA: hda - Disable widget power-save for VIA codecs Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 19/65] ALSA: hda - restore the MIC FIXUP for some Dell machines Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 20/65] ALSA: hda - Add headset support to Acer Aspire V5 Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 21/65] ALSA: hda - Fix the dock headphone output on Fujitsu Lifebook E780 Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 22/65] ALSA: hda - Add a fixup for Dell E7450 Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 23/65] ACPI / init: Switch over platform to the ACPI mode later Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 24/65] ACPI / PM: Add missing pm_generic_complete() invocation Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 25/65] ACPI / PNP: Avoid conflicting resource reservations Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 26/65] iio: accel: kxcjk-1013: add the "KXCJ9000" ACPI id Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 27/65] tools selftests: Fix clean target with make 3.81 Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 28/65] ARC: add smp barriers around atomics per Documentation/atomic_ops.txt Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 29/65] ARC: add compiler barrier to LLSC based cmpxchg Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 30/65] arc: fix use of uninitialized arc_pmu Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 31/65] power_supply: Fix NULL pointer dereference during bq27x00_battery probe Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 32/65] power_supply: Fix possible NULL pointer dereference on early uevent Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 33/65] mei: me: wait for power gating exit confirmation Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 34/65] mei: txe: reduce suspend/resume time Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 35/65] arm64: Do not attempt to use init_mm in reset_context() Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 36/65] arm64: entry: fix context tracking for el0_sp_pc Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 37/65] arm64: mm: Fix freeing of the wrong memmap entries with !SPARSEMEM_VMEMMAP Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 38/65] arm64: vdso: work-around broken ELF toolchains in Makefile Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 39/65] mm: kmemleak: allow safe memory scanning during kmemleak disabling Greg Kroah-Hartman
2015-07-19 19:07 ` [PATCH 4.1 40/65] mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc() Greg Kroah-Hartman
2015-07-19 19:08 ` Greg Kroah-Hartman [this message]
2015-07-19 19:08 ` [PATCH 4.1 42/65] regmap: Fix regmap_bulk_read in BE mode Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 43/65] regmap: Fix possible shift overflow in regmap_field_init() Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 44/65] regulator: max77686: fix gpio_enabled shift wrapping bug Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 45/65] regulator: core: fix constraints output buffer Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 46/65] livepatch: add module locking around kallsyms calls Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 48/65] spi: orion: Fix maximum baud rates for Armada 370/XP Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 49/65] spi: pl022: Specify num-cs property as required in devicetree binding Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 50/65] scsi_transport_srp: Introduce srp_wait_for_queuecommand() Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 51/65] scsi_transport_srp: Fix a race condition Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 52/65] IB/srp: Remove an extraneous scsi_host_put() from an error path Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 53/65] IB/srp: Fix a connection setup race Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 54/65] IB/srp: Fix connection state tracking Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 55/65] IB/srp: Fix reconnection failure handling Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 56/65] genirq: devres: Fix testing return value of request_any_context_irq() Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 57/65] video: mxsfb: Make sure axi clock is enabled when accessing registers Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 58/65] leds / PM: fix hibernation on arm when gpio-led used with CPU led trigger Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 59/65] mtd: fix: avoid race condition when accessing mtd->usecount Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 61/65] PCI: Propagate the "ignore hotplug" setting to parent Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 62/65] PCI: Add pci_bus_addr_t Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 63/65] PCI: pciehp: Wait for hotplug command completion where necessary Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 64/65] of/pci: Fix pci_address_to_pio() conversion of CPU address to I/O port Greg Kroah-Hartman
2015-07-19 19:08 ` [PATCH 4.1 65/65] Input: pixcir_i2c_ts - fix receive error Greg Kroah-Hartman
2015-07-20  3:19 ` [PATCH 4.1 00/65] 4.1.3-stable review Guenter Roeck
2015-07-20 19:26   ` Greg Kroah-Hartman
2015-07-20  6:33 ` Sudip Mukherjee
2015-07-20 19:27   ` Greg Kroah-Hartman
2015-07-20 17:17 ` Shuah Khan
2015-07-20 19:27   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150719190810.809975560@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@suse.cz \
    --cc=rientjes@google.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).