stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Mel Gorman <mgorman@suse.de>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 3.14 75/77] mm: page_alloc: reduce cost of the fair zone allocation policy
Date: Tue, 27 Jan 2015 17:27:53 -0800	[thread overview]
Message-ID: <20150128012748.207832838@linuxfoundation.org> (raw)
In-Reply-To: <20150128012745.971137091@linuxfoundation.org>

3.14-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Mel Gorman <mgorman@suse.de>

commit 4ffeaf3560a52b4a69cc7909873d08c0ef5909d4 upstream.

The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim.  If the
first allocation fails, the batch counts are reset and a second attempt
made before entering the slow path.

One assumption made with this scheme is that batches expire at roughly
the same time and the resets each time are justified.  This assumption
does not hold when zones reach their low watermark as the batches will
be consumed at uneven rates.  Allocation failure due to watermark
depletion result in additional zonelist scans for the reset and another
watermark check before hitting the slowpath.

On UMA, the benefit is negligible -- around 0.25%.  On 4-socket NUMA
machine it's variable due to the variability of measuring overhead with
the vmstat changes.  The system CPU overhead comparison looks like

          3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
             vanilla   vmstat-v5 lowercost-v5
User          746.94      774.56      802.00
System      65336.22    32847.27    40852.33
Elapsed     27553.52    27415.04    27368.46

However it is worth noting that the overall benchmark still completed
faster and intuitively it makes sense to take as few passes as possible
through the zonelists.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 include/linux/mmzone.h |    6 ++
 mm/page_alloc.c        |  101 +++++++++++++++++++++++++------------------------
 2 files changed, 59 insertions(+), 48 deletions(-)

--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -529,6 +529,7 @@ typedef enum {
 	ZONE_WRITEBACK,			/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
+	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -566,6 +567,11 @@ static inline int zone_is_reclaim_locked
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
 }
 
+static inline int zone_is_fair_depleted(const struct zone *zone)
+{
+	return test_bit(ZONE_FAIR_DEPLETED, &zone->flags);
+}
+
 static inline int zone_is_oom_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1614,6 +1614,9 @@ again:
 	}
 
 	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+	if (zone_page_state(zone, NR_ALLOC_BATCH) == 0 &&
+	    !zone_is_fair_depleted(zone))
+		zone_set_flag(zone, ZONE_FAIR_DEPLETED);
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -1938,6 +1941,18 @@ static inline void init_zone_allows_recl
 }
 #endif	/* CONFIG_NUMA */
 
+static void reset_alloc_batches(struct zone *preferred_zone)
+{
+	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
+
+	do {
+		mod_zone_page_state(zone, NR_ALLOC_BATCH,
+			high_wmark_pages(zone) - low_wmark_pages(zone) -
+			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
+		zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
+	} while (zone++ != preferred_zone);
+}
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -1955,8 +1970,12 @@ get_page_from_freelist(gfp_t gfp_mask, n
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
+	int nr_fair_skipped = 0;
+	bool zonelist_rescan;
 
 zonelist_scan:
+	zonelist_rescan = false;
+
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
@@ -1981,8 +2000,10 @@ zonelist_scan:
 		if (alloc_flags & ALLOC_FAIR) {
 			if (!zone_local(preferred_zone, zone))
 				break;
-			if (atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]) <= 0)
+			if (zone_is_fair_depleted(zone)) {
+				nr_fair_skipped++;
 				continue;
+			}
 		}
 		/*
 		 * When allocating a page cache page for writing, we
@@ -2088,13 +2109,7 @@ this_zone_full:
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_NUMA) && page == NULL && zlc_active)) {
-		/* Disable zlc cache for second zonelist scan */
-		zlc_active = 0;
-		goto zonelist_scan;
-	}
-
-	if (page)
+	if (page) {
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
 		 * necessary to allocate the page. The expectation is
@@ -2103,8 +2118,37 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+		return page;
+	}
 
-	return page;
+	/*
+	 * The first pass makes sure allocations are spread fairly within the
+	 * local node.  However, the local node might have free pages left
+	 * after the fairness batches are exhausted, and remote zones haven't
+	 * even been considered yet.  Try once more without fairness, and
+	 * include remote zones now, before entering the slowpath and waking
+	 * kswapd: prefer spilling to a remote zone over swapping locally.
+	 */
+	if (alloc_flags & ALLOC_FAIR) {
+		alloc_flags &= ~ALLOC_FAIR;
+		if (nr_fair_skipped) {
+			zonelist_rescan = true;
+			reset_alloc_batches(preferred_zone);
+		}
+		if (nr_online_nodes > 1)
+			zonelist_rescan = true;
+	}
+
+	if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
+		/* Disable zlc cache for second zonelist scan */
+		zlc_active = 0;
+		zonelist_rescan = true;
+	}
+
+	if (zonelist_rescan)
+		goto zonelist_scan;
+
+	return NULL;
 }
 
 /*
@@ -2433,28 +2477,6 @@ __alloc_pages_high_priority(gfp_t gfp_ma
 	return page;
 }
 
-static void reset_alloc_batches(struct zonelist *zonelist,
-				enum zone_type high_zoneidx,
-				struct zone *preferred_zone)
-{
-	struct zoneref *z;
-	struct zone *zone;
-
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		/*
-		 * Only reset the batches of zones that were actually
-		 * considered in the fairness pass, we don't want to
-		 * trash fairness information for zones that are not
-		 * actually part of this zonelist's round-robin cycle.
-		 */
-		if (!zone_local(preferred_zone, zone))
-			continue;
-		mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-	}
-}
-
 static void wake_all_kswapds(unsigned int order,
 			     struct zonelist *zonelist,
 			     enum zone_type high_zoneidx,
@@ -2792,29 +2814,12 @@ retry_cpuset:
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
-retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
 			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
-		 * The first pass makes sure allocations are spread
-		 * fairly within the local node.  However, the local
-		 * node might have free pages left after the fairness
-		 * batches are exhausted, and remote zones haven't
-		 * even been considered yet.  Try once more without
-		 * fairness, and include remote zones now, before
-		 * entering the slowpath and waking kswapd: prefer
-		 * spilling to a remote zone over swapping locally.
-		 */
-		if (alloc_flags & ALLOC_FAIR) {
-			reset_alloc_batches(zonelist, high_zoneidx,
-					    preferred_zone);
-			alloc_flags &= ~ALLOC_FAIR;
-			goto retry;
-		}
-		/*
 		 * Runtime PM, block IO and its error handling path
 		 * can deadlock because I/O on the device might not
 		 * complete.



  parent reply	other threads:[~2015-01-28  1:27 UTC|newest]

Thread overview: 85+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-28  1:26 [PATCH 3.14 00/77] 3.14.31-stable review Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 01/77] gpio: sysfs: fix gpio-chip device-attribute leak Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 02/77] gpio: sysfs: fix gpio " Greg Kroah-Hartman
2015-01-28 14:30   ` Luis Henriques
2015-01-28 15:24     ` Johan Hovold
2015-01-28 16:02       ` [PATCH v2] " Johan Hovold
2015-01-28 17:52         ` Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 03/77] pinctrl: Fix two deadlocks Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 04/77] libata: prevent HSM state change race between ISR and PIO Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 05/77] ALSA: usb-audio: Add mic volume fix quirk for Logitech Webcam C210 Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 06/77] scripts/recordmcount.pl: There is no -m32 gcc option on Super-H anymore Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 07/77] drm/i915: Fix mutex->owner inspection race under DEBUG_MUTEXES Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 08/77] drm/radeon: add a dpm quirk list Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 09/77] drm/radeon: add si " Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 10/77] drm/radeon: use rv515_ring_start on r5xx Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 11/77] PCI: Add flag for devices where we cant use bus reset Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 12/77] PCI: Mark Atheros AR93xx to avoid " Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 13/77] ipr: wait for aborted command responses Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 14/77] dm cache: share cache-metadata object across inactive and active DM tables Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 15/77] dm cache: fix problematic dual use of a single migration count variable Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 16/77] time: settimeofday: Validate the values of tv from user Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 17/77] time: adjtimex: Validate the ADJ_FREQUENCY values Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 18/77] ARM: dts: imx25: Fix PWM "per" clocks Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 19/77] bus: mvebu-mbus: fix support of MBus window 13 Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 20/77] fix deadlock in cifs_ioctl_clone() Greg Kroah-Hartman
2015-01-28  1:26 ` [PATCH 3.14 21/77] can: dev: fix crtlmode_supported check Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 22/77] clocksource: exynos_mct: Fix bitmask regression for exynos4_mct_write Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 23/77] x86, hyperv: Mark the Hyper-V clocksource as being continuous Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 24/77] x86/tsc: Change Fast TSC calibration failed from error to info Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 25/77] x86, boot: Skip relocs when load address unchanged Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 26/77] KVM: x86: Fix of previously incomplete fix for CVE-2014-8480 Greg Kroah-Hartman
2015-01-28  8:51   ` Nadav Amit
2015-01-28  1:27 ` [PATCH 3.14 27/77] x86, tls, ldt: Stop checking lm in LDT_empty Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 28/77] x86, tls: Interpret an all-zero struct user_desc as "no segment" Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 29/77] x86/apic: Re-enable PCI_MSI support for non-SMP X86_32 Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 30/77] x86/asm/traps: Disable tracing and kprobes in fixup_bad_iret and sync_regs Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 31/77] sata_dwc_460ex: fix resource leak on error path Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 32/77] KEYS: close race between key lookup and freeing Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 33/77] netfilter: nfnetlink: validate nfnetlink header from batch Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 34/77] ipvs: uninitialized data with IP_VS_IPV6 Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 35/77] Revert "swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single" Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 36/77] drbd: merge_bvec_fn: properly remap bvm->bi_bdev Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 37/77] crypto: prefix module autoloading with "crypto-" Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 38/77] crypto: include crypto- module prefix in template Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 39/77] crypto: add missing crypto module aliases Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 40/77] ARC: Delete stale barrier.h Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 41/77] ARC: Fix build breakage for !CONFIG_ARC_DW2_UNWIND Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 42/77] Input: evdev - fix EVIOCG{type} ioctl Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 43/77] tty: Fix pty master poll() after slave closes v2 Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 44/77] mmc: sdhci: Dont signal the sdio irq if its not setup Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 45/77] mm/swap.c: clean up *lru_cache_add* functions Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 46/77] mm: page_alloc: do not update zlc unless the zlc is active Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 47/77] mm: page_alloc: do not treat a zone that cannot be used for dirty pages as "full" Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 48/77] include/linux/jump_label.h: expose the reference count Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 49/77] mm: page_alloc: use jump labels to avoid checking number_of_cpusets Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 50/77] mm: page_alloc: calculate classzone_idx once from the zonelist ref Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 51/77] mm: page_alloc: only check the zone id check if pages are buddies Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 52/77] mm: page_alloc: only check the alloc flags and gfp_mask for dirty once Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 53/77] mm: page_alloc: take the ALLOC_NO_WATERMARK check out of the fast path Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 54/77] mm: page_alloc: use unsigned int for order in more places Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 55/77] mm: page_alloc: reduce number of times page_to_pfn is called Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 56/77] mm: page_alloc: convert hot/cold parameter and immediate callers to bool Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 57/77] mm: page_alloc: lookup pageblock migratetype with IRQs enabled during free Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 58/77] mm: shmem: avoid atomic operation during shmem_getpage_gfp Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 59/77] mm: do not use atomic operations when releasing pages Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 60/77] mm: do not use unnecessary atomic operations when adding pages to the LRU Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 61/77] fs: buffer: do not use unnecessary atomic operations when discarding buffers Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 62/77] mm: non-atomically mark page accessed during page cache allocation where possible Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 63/77] mm: avoid unnecessary atomic operations during end_page_writeback() Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 64/77] shmem: fix init_page_accessed use to stop !PageLRU bug Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 65/77] mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault() Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 66/77] mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 67/77] mm: make copy_pte_range static again Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 68/77] vmalloc: use rcu list iterator to reduce vmap_area_lock contention Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 69/77] memcg, vmscan: Fix forced scan of anonymous pages Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 70/77] mm: pagemap: avoid unnecessary overhead when tracepoints are deactivated Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 71/77] mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 72/77] mm: move zone->pages_scanned into a vmstat counter Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 73/77] mm: vmscan: only update per-cpu thresholds for online CPU Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 74/77] mm: page_alloc: abort fair zone allocation policy when remotes nodes are encountered Greg Kroah-Hartman
2015-01-28  1:27 ` Greg Kroah-Hartman [this message]
2015-01-28  1:27 ` [PATCH 3.14 76/77] mm: get rid of radix tree gfp mask for pagecache_get_page Greg Kroah-Hartman
2015-01-28  1:27 ` [PATCH 3.14 77/77] md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants Greg Kroah-Hartman
2015-01-28 14:15 ` [PATCH 3.14 00/77] 3.14.31-stable review Guenter Roeck
2015-01-28 16:51 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150128012748.207832838@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).