[PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support
@ 2021-11-03 17:05 Nicolas Saenz Julienne
  2021-11-03 17:05 ` [PATCH v2 1/3] mm/page_alloc: Don't pass pfn to free_unref_page_commit() Nicolas Saenz Julienne
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Nicolas Saenz Julienne @ 2021-11-03 17:05 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, vbabka, cl, ppandit,
	Nicolas Saenz Julienne

This series introduces a new locking scheme around mm/page_alloc.c's per-cpu
page lists which will allow for remote CPUs to drain them. Currently, only a
local CPU is permitted to change its per-cpu lists, and it's expected to do so,
on-demand, whenever a process demands it (by means of queueing an drain task on
the local CPU). Most systems will handle this promptly, but it'll cause
problems for NOHZ_FULL CPUs that can't take any sort of interruption without
breaking their functional guarantees (latency, bandwidth, etc...).

This new locking scheme, based on per-cpu spinlocks, is the simpler and more
maintainable approach so far[1], although also has some drawbacks: it comes
with a small performance. Depending on the page allocation code path
micro-benchmark we can expect 0% to 0.6% degradation on x86_64, and 0% to 2% on
arm64[2].

Assuming there is nothing too horrible in the patches themselves I believe it
all comes down to whether we prefer to take the small performance hit vs the
maintenance burden of a more complex solution[1]. I don't have enough
experience with performance tuning, nor with maintenance to have an
authoritative opinion here, so I'll defer to whatever is hopefully discussed
here. Also, I'll be happy to run any extra tests that I might have missed.

Patch #1 could be taken regardless of the rest of the series as it removes dead
code.

The series is based on today's linux-next. 

Changes since v2:
 - Provide performance numbers
 - Unanimously use per-cpu spinlocks

[1] Other approaches can be found here:

  - Static branch conditional on nohz_full, no performance loss, the extra
    config option makes is painful to maintain (v1):
    https://lore.kernel.org/linux-mm/20210921161323.607817-5-nsaenzju@redhat.com/

  - RCU based approach, complex, yet a bit less taxing performance wise
    (RFC):
    https://lore.kernel.org/linux-mm/20211008161922.942459-4-nsaenzju@redhat.com/

[2] See individual patches for in-depth results

---

Nicolas Saenz Julienne (3):
  mm/page_alloc: Don't pass pfn to free_unref_page_commit()
  mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin
    locks
  mm/page_alloc: Remotely drain per-cpu lists

 include/linux/mmzone.h |   1 +
 mm/page_alloc.c        | 151 ++++++++++++++---------------------------
 2 files changed, 52 insertions(+), 100 deletions(-)

-- 
2.33.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/3] mm/page_alloc: Don't pass pfn to free_unref_page_commit()
  2021-11-03 17:05 [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support Nicolas Saenz Julienne
@ 2021-11-03 17:05 ` Nicolas Saenz Julienne
  2021-11-23 14:41   ` Vlastimil Babka
  2021-11-03 17:05 ` [PATCH v2 2/3] mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin locks Nicolas Saenz Julienne
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Nicolas Saenz Julienne @ 2021-11-03 17:05 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, vbabka, cl, ppandit,
	Nicolas Saenz Julienne

free_unref_page_commit() doesn't make use of its pfn argument, so get
rid of it.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
---
 mm/page_alloc.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..9ef03dfb8f95 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3355,8 +3355,8 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone)
 	return min(READ_ONCE(pcp->batch) << 2, high);
 }
 
-static void free_unref_page_commit(struct page *page, unsigned long pfn,
-				   int migratetype, unsigned int order)
+static void free_unref_page_commit(struct page *page, int migratetype,
+				   unsigned int order)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
@@ -3405,7 +3405,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	}
 
 	local_lock_irqsave(&pagesets.lock, flags);
-	free_unref_page_commit(page, pfn, migratetype, order);
+	free_unref_page_commit(page, migratetype, order);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
@@ -3415,13 +3415,13 @@ void free_unref_page(struct page *page, unsigned int order)
 void free_unref_page_list(struct list_head *list)
 {
 	struct page *page, *next;
-	unsigned long flags, pfn;
+	unsigned long flags;
 	int batch_count = 0;
 	int migratetype;
 
 	/* Prepare pages for freeing */
 	list_for_each_entry_safe(page, next, list, lru) {
-		pfn = page_to_pfn(page);
+		unsigned long pfn = page_to_pfn(page);
 		if (!free_unref_page_prepare(page, pfn, 0)) {
 			list_del(&page->lru);
 			continue;
@@ -3437,15 +3437,10 @@ void free_unref_page_list(struct list_head *list)
 			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
 			continue;
 		}
-
-		set_page_private(page, pfn);
 	}
 
 	local_lock_irqsave(&pagesets.lock, flags);
 	list_for_each_entry_safe(page, next, list, lru) {
-		pfn = page_private(page);
-		set_page_private(page, 0);
-
 		/*
 		 * Non-isolated types over MIGRATE_PCPTYPES get added
 		 * to the MIGRATE_MOVABLE pcp list.
@@ -3455,7 +3450,7 @@ void free_unref_page_list(struct list_head *list)
 			migratetype = MIGRATE_MOVABLE;
 
 		trace_mm_page_free_batched(page);
-		free_unref_page_commit(page, pfn, migratetype, 0);
+		free_unref_page_commit(page, migratetype, 0);
 
 		/*
 		 * Guard against excessive IRQ disabled times when we get
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/3] mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin locks
  2021-11-03 17:05 [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support Nicolas Saenz Julienne
  2021-11-03 17:05 ` [PATCH v2 1/3] mm/page_alloc: Don't pass pfn to free_unref_page_commit() Nicolas Saenz Julienne
@ 2021-11-03 17:05 ` Nicolas Saenz Julienne
  2021-11-03 17:05 ` [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists Nicolas Saenz Julienne
  2021-11-23 14:58 ` [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support Vlastimil Babka
  3 siblings, 0 replies; 15+ messages in thread
From: Nicolas Saenz Julienne @ 2021-11-03 17:05 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, vbabka, cl, ppandit,
	Nicolas Saenz Julienne

page_alloc's per-cpu page lists are currently protected by local locks.
This forces any remote operation dependent on draining them to schedule
drain work on all CPUs. This doesn't play well with NOHZ_FULL CPUs,
which can't be bothered to run housekeeping tasks.

As a first step to mitigate this, convert the current locking scheme to
per-cpu spinlocks. The conversion also moves the actual lock into
'struct per_cpu_pages' which is nicer code, but also essential in order
to couple access to the lock and lists. One side effect of this is a
more complex free_unref_page_list(), as the patch tries to maintain
previous function optimizations[1]. Other than that the conversion
itself is mostly trivial.

The performance difference between local locks and uncontended per-cpu
spinlocks (which they happen to be during normal operation) is pretty
small.

On an Intel Xeon E5-2640 (x86_64) with with 32GB of memory (mean
variation vs. vanilla runs, higher is worse):
   - netperf: -0.5% to 0.5% (no difference)
   - hackbench: -0.3% to 0.7% (almost no difference)
   - mmtests/sparsetruncate-tiny: -0.1% to 0.6%

On a Cavium ThunderX2 (arm64) with 64GB of memory:
   - netperf 1.0% to 1.7%
   - hackbench 0.8% to 1.5%
   - mmtests/sparsetruncate-tiny 1.6% to 2.1%

arm64 is a bit more sensitive to the change. Probably due to the effect
of the spinlock's memory barriers.

Note that the aim9 test suite was also run (through
mmtests/pagealloc-performance) but the test's own variance distorts the
results too much.

[1] See:
      - 9cca35d42eb61 ("mm, page_alloc: enable/disable IRQs once when
	freeing a list of pages ")
      - c24ad77d962c3 ("mm/page_alloc.c: avoid excessive IRQ disabled
	times in free_unref_page_list()")

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
---
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 87 ++++++++++++++++++++++--------------------
 2 files changed, 47 insertions(+), 41 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 58e744b78c2c..83c51036c756 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -376,6 +376,7 @@ struct per_cpu_pages {
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
+	spinlock_t lock;
 };
 
 struct per_cpu_zonestat {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ef03dfb8f95..b332d5cc40f1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -122,13 +122,6 @@ typedef int __bitwise fpi_t;
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
 
-struct pagesets {
-	local_lock_t lock;
-};
-static DEFINE_PER_CPU(struct pagesets, pagesets) = {
-	.lock = INIT_LOCAL_LOCK(lock),
-};
-
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1505,8 +1498,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	pcp->count -= nr_freed;
 
 	/*
-	 * local_lock_irq held so equivalent to spin_lock_irqsave for
-	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
+	 * spin_lock_irqsave(&pcp->lock) held so equivalent to
+	 * spin_lock_irqsave().
 	 */
 	spin_lock(&zone->lock);
 	isolated_pageblocks = has_isolate_pageblock(zone);
@@ -3011,8 +3004,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	int i, allocated = 0;
 
 	/*
-	 * local_lock_irq held so equivalent to spin_lock_irqsave for
-	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
+	 * spin_lock_irqsave(&pcp->lock) held so equivalent to
+	 * spin_lock_irqsave().
 	 */
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
@@ -3066,12 +3059,12 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	unsigned long flags;
 	int to_drain, batch;
 
-	local_lock_irqsave(&pagesets.lock, flags);
+	spin_lock_irqsave(&pcp->lock, flags);
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
 	if (to_drain > 0)
 		free_pcppages_bulk(zone, to_drain, pcp);
-	local_unlock_irqrestore(&pagesets.lock, flags);
+	spin_unlock_irqrestore(&pcp->lock, flags);
 }
 #endif
 
@@ -3087,13 +3080,11 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	unsigned long flags;
 	struct per_cpu_pages *pcp;
 
-	local_lock_irqsave(&pagesets.lock, flags);
-
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+	spin_lock_irqsave(&pcp->lock, flags);
 	if (pcp->count)
 		free_pcppages_bulk(zone, pcp->count, pcp);
-
-	local_unlock_irqrestore(&pagesets.lock, flags);
+	spin_unlock_irqrestore(&pcp->lock, flags);
 }
 
 /*
@@ -3355,16 +3346,14 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone)
 	return min(READ_ONCE(pcp->batch) << 2, high);
 }
 
-static void free_unref_page_commit(struct page *page, int migratetype,
-				   unsigned int order)
+static void free_unref_page_commit(struct page *page, struct per_cpu_pages *pcp,
+				   int migratetype, unsigned int order)
 {
 	struct zone *zone = page_zone(page);
-	struct per_cpu_pages *pcp;
 	int high;
 	int pindex;
 
 	__count_vm_event(PGFREE);
-	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pindex = order_to_pindex(migratetype, order);
 	list_add(&page->lru, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
@@ -3383,6 +3372,7 @@ void free_unref_page(struct page *page, unsigned int order)
 {
 	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
+	struct per_cpu_pages *pcp;
 	int migratetype;
 
 	if (!free_unref_page_prepare(page, pfn, order))
@@ -3404,9 +3394,10 @@ void free_unref_page(struct page *page, unsigned int order)
 		migratetype = MIGRATE_MOVABLE;
 	}
 
-	local_lock_irqsave(&pagesets.lock, flags);
-	free_unref_page_commit(page, migratetype, order);
-	local_unlock_irqrestore(&pagesets.lock, flags);
+	pcp = this_cpu_ptr(page_zone(page)->per_cpu_pageset);
+	spin_lock_irqsave(&pcp->lock, flags);
+	free_unref_page_commit(page, pcp, migratetype, order);
+	spin_unlock_irqrestore(&pcp->lock, flags);
 }
 
 /*
@@ -3415,6 +3406,7 @@ void free_unref_page(struct page *page, unsigned int order)
 void free_unref_page_list(struct list_head *list)
 {
 	struct page *page, *next;
+	spinlock_t *lock = NULL;
 	unsigned long flags;
 	int batch_count = 0;
 	int migratetype;
@@ -3422,6 +3414,7 @@ void free_unref_page_list(struct list_head *list)
 	/* Prepare pages for freeing */
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_to_pfn(page);
+
 		if (!free_unref_page_prepare(page, pfn, 0)) {
 			list_del(&page->lru);
 			continue;
@@ -3439,8 +3432,22 @@ void free_unref_page_list(struct list_head *list)
 		}
 	}
 
-	local_lock_irqsave(&pagesets.lock, flags);
 	list_for_each_entry_safe(page, next, list, lru) {
+		struct per_cpu_pages *pcp = this_cpu_ptr(page_zone(page)->per_cpu_pageset);
+
+		/*
+		 * As an optimization, release the previously held lock only if
+		 * the page belongs to a different zone. But also, guard
+		 * against excessive IRQ disabled times when we get a large
+		 * list of pages to free.
+		 */
+		if (++batch_count == SWAP_CLUSTER_MAX ||
+		    (lock != &pcp->lock && lock)) {
+			spin_unlock_irqrestore(lock, flags);
+			batch_count = 0;
+			lock = NULL;
+		}
+
 		/*
 		 * Non-isolated types over MIGRATE_PCPTYPES get added
 		 * to the MIGRATE_MOVABLE pcp list.
@@ -3450,19 +3457,17 @@ void free_unref_page_list(struct list_head *list)
 			migratetype = MIGRATE_MOVABLE;
 
 		trace_mm_page_free_batched(page);
-		free_unref_page_commit(page, migratetype, 0);
 
-		/*
-		 * Guard against excessive IRQ disabled times when we get
-		 * a large list of pages to free.
-		 */
-		if (++batch_count == SWAP_CLUSTER_MAX) {
-			local_unlock_irqrestore(&pagesets.lock, flags);
-			batch_count = 0;
-			local_lock_irqsave(&pagesets.lock, flags);
+		if (!lock) {
+			spin_lock_irqsave(&pcp->lock, flags);
+			lock = &pcp->lock;
 		}
+
+		free_unref_page_commit(page, pcp, migratetype, 0);
 	}
-	local_unlock_irqrestore(&pagesets.lock, flags);
+
+	if (lock)
+		spin_unlock_irqrestore(lock, flags);
 }
 
 /*
@@ -3636,18 +3641,17 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct page *page;
 	unsigned long flags;
 
-	local_lock_irqsave(&pagesets.lock, flags);
-
 	/*
 	 * On allocation, reduce the number of pages that are batch freed.
 	 * See nr_pcp_free() where free_factor is increased for subsequent
 	 * frees.
 	 */
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
+	spin_lock_irqsave(&pcp->lock, flags);
 	pcp->free_factor >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
-	local_unlock_irqrestore(&pagesets.lock, flags);
+	spin_unlock_irqrestore(&pcp->lock, flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
 		zone_statistics(preferred_zone, zone, 1);
@@ -5265,8 +5269,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		goto failed;
 
 	/* Attempt the batch allocation */
-	local_lock_irqsave(&pagesets.lock, flags);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
+	spin_lock_irqsave(&pcp->lock, flags);
 	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
 
 	while (nr_populated < nr_pages) {
@@ -5295,7 +5299,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
-	local_unlock_irqrestore(&pagesets.lock, flags);
+	spin_unlock_irqrestore(&pcp->lock, flags);
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
 	zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
@@ -5304,7 +5308,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	return nr_populated;
 
 failed_irq:
-	local_unlock_irqrestore(&pagesets.lock, flags);
+	spin_unlock_irqrestore(&pcp->lock, flags);
 
 failed:
 	page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
@@ -6947,6 +6951,7 @@ void __meminit setup_zone_pageset(struct zone *zone)
 		struct per_cpu_zonestat *pzstats;
 
 		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		spin_lock_init(&pcp->lock);
 		pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
 		per_cpu_pages_init(pcp, pzstats);
 	}
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists
  2021-11-03 17:05 [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support Nicolas Saenz Julienne
  2021-11-03 17:05 ` [PATCH v2 1/3] mm/page_alloc: Don't pass pfn to free_unref_page_commit() Nicolas Saenz Julienne
  2021-11-03 17:05 ` [PATCH v2 2/3] mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin locks Nicolas Saenz Julienne
@ 2021-11-03 17:05 ` Nicolas Saenz Julienne
  2021-12-03 14:13   ` Mel Gorman
  2021-11-23 14:58 ` [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support Vlastimil Babka
  3 siblings, 1 reply; 15+ messages in thread
From: Nicolas Saenz Julienne @ 2021-11-03 17:05 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, vbabka, cl, ppandit,
	Nicolas Saenz Julienne

Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __drain_all_pages(). So introduce new a mechanism
to remotely drain the per-cpu lists. It is made possible by remotely
locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
new scheme is that drain operations are now migration safe.

There was no observed performance degradation vs. the previous scheme.
Both netperf and hackbench were run in parallel to triggering the
__drain_all_pages(NULL, true) code path around ~100 times per second.
The new scheme performs a bit better (~5%), although the important point
here is there are no performance regressions vs. the previous mechanism.
Per-cpu lists draining happens only in slow paths.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
---
 mm/page_alloc.c | 59 +++++--------------------------------------------
 1 file changed, 5 insertions(+), 54 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b332d5cc40f1..7dbdab100461 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -140,13 +140,7 @@ DEFINE_PER_CPU(int, _numa_mem_);		/* Kernel "local memory" node */
 EXPORT_PER_CPU_SYMBOL(_numa_mem_);
 #endif
 
-/* work_structs for global per-cpu drains */
-struct pcpu_drain {
-	struct zone *zone;
-	struct work_struct work;
-};
 static DEFINE_MUTEX(pcpu_drain_mutex);
-static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain);
 
 #ifdef CONFIG_GCC_PLUGIN_LATENT_ENTROPY
 volatile unsigned long latent_entropy __latent_entropy;
@@ -3050,9 +3044,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  * Called from the vmstat counter updater to drain pagesets of this
  * currently executing processor on remote nodes after they have
  * expired.
- *
- * Note that this function must be called with the thread pinned to
- * a single processor.
  */
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 {
@@ -3070,10 +3061,6 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 
 /*
  * Drain pcplists of the indicated processor and zone.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
  */
 static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
@@ -3089,10 +3076,6 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 
 /*
  * Drain pcplists of all zones on the indicated processor.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
  */
 static void drain_pages(unsigned int cpu)
 {
@@ -3105,9 +3088,6 @@ static void drain_pages(unsigned int cpu)
 
 /*
  * Spill all of this CPU's per-cpu pages back into the buddy allocator.
- *
- * The CPU has to be pinned. When zone parameter is non-NULL, spill just
- * the single zone's pages.
  */
 void drain_local_pages(struct zone *zone)
 {
@@ -3119,24 +3099,6 @@ void drain_local_pages(struct zone *zone)
 		drain_pages(cpu);
 }
 
-static void drain_local_pages_wq(struct work_struct *work)
-{
-	struct pcpu_drain *drain;
-
-	drain = container_of(work, struct pcpu_drain, work);
-
-	/*
-	 * drain_all_pages doesn't use proper cpu hotplug protection so
-	 * we can race with cpu offline when the WQ can move this from
-	 * a cpu pinned worker to an unbound one. We can operate on a different
-	 * cpu which is alright but we also have to make sure to not move to
-	 * a different one.
-	 */
-	migrate_disable();
-	drain_local_pages(drain->zone);
-	migrate_enable();
-}
-
 /*
  * The implementation of drain_all_pages(), exposing an extra parameter to
  * drain on all cpus.
@@ -3157,13 +3119,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	 */
 	static cpumask_t cpus_with_pcps;
 
-	/*
-	 * Make sure nobody triggers this path before mm_percpu_wq is fully
-	 * initialized.
-	 */
-	if (WARN_ON_ONCE(!mm_percpu_wq))
-		return;
-
 	/*
 	 * Do not drain if one is already in progress unless it's specific to
 	 * a zone. Such callers are primarily CMA and memory hotplug and need
@@ -3213,14 +3168,12 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	}
 
 	for_each_cpu(cpu, &cpus_with_pcps) {
-		struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
-
-		drain->zone = zone;
-		INIT_WORK(&drain->work, drain_local_pages_wq);
-		queue_work_on(cpu, mm_percpu_wq, &drain->work);
+		if (zone) {
+			drain_pages_zone(cpu, zone);
+		} else {
+			drain_pages(cpu);
+		}
 	}
-	for_each_cpu(cpu, &cpus_with_pcps)
-		flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);
 
 	mutex_unlock(&pcpu_drain_mutex);
 }
@@ -3229,8 +3182,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
  * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
  *
  * When zone parameter is non-NULL, spill just the single zone's pages.
- *
- * Note that this can be extremely slow as the draining happens in a workqueue.
  */
 void drain_all_pages(struct zone *zone)
 {
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/3] mm/page_alloc: Don't pass pfn to free_unref_page_commit()
  2021-11-03 17:05 ` [PATCH v2 1/3] mm/page_alloc: Don't pass pfn to free_unref_page_commit() Nicolas Saenz Julienne
@ 2021-11-23 14:41   ` Vlastimil Babka
  0 siblings, 0 replies; 15+ messages in thread
From: Vlastimil Babka @ 2021-11-23 14:41 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, cl, ppandit

On 11/3/21 18:05, Nicolas Saenz Julienne wrote:
> free_unref_page_commit() doesn't make use of its pfn argument, so get
> rid of it.

Yeah, looks like since df1acc856923 ("mm/page_alloc: avoid conflating IRQs
disabled with zone->lock")

> Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 17 ++++++-----------
>  1 file changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..9ef03dfb8f95 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3355,8 +3355,8 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone)
>  	return min(READ_ONCE(pcp->batch) << 2, high);
>  }
>  
> -static void free_unref_page_commit(struct page *page, unsigned long pfn,
> -				   int migratetype, unsigned int order)
> +static void free_unref_page_commit(struct page *page, int migratetype,
> +				   unsigned int order)
>  {
>  	struct zone *zone = page_zone(page);
>  	struct per_cpu_pages *pcp;
> @@ -3405,7 +3405,7 @@ void free_unref_page(struct page *page, unsigned int order)
>  	}
>  
>  	local_lock_irqsave(&pagesets.lock, flags);
> -	free_unref_page_commit(page, pfn, migratetype, order);
> +	free_unref_page_commit(page, migratetype, order);
>  	local_unlock_irqrestore(&pagesets.lock, flags);
>  }
>  
> @@ -3415,13 +3415,13 @@ void free_unref_page(struct page *page, unsigned int order)
>  void free_unref_page_list(struct list_head *list)
>  {
>  	struct page *page, *next;
> -	unsigned long flags, pfn;
> +	unsigned long flags;
>  	int batch_count = 0;
>  	int migratetype;
>  
>  	/* Prepare pages for freeing */
>  	list_for_each_entry_safe(page, next, list, lru) {
> -		pfn = page_to_pfn(page);
> +		unsigned long pfn = page_to_pfn(page);
>  		if (!free_unref_page_prepare(page, pfn, 0)) {
>  			list_del(&page->lru);
>  			continue;
> @@ -3437,15 +3437,10 @@ void free_unref_page_list(struct list_head *list)
>  			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
>  			continue;
>  		}
> -
> -		set_page_private(page, pfn);
>  	}
>  
>  	local_lock_irqsave(&pagesets.lock, flags);
>  	list_for_each_entry_safe(page, next, list, lru) {
> -		pfn = page_private(page);
> -		set_page_private(page, 0);
> -
>  		/*
>  		 * Non-isolated types over MIGRATE_PCPTYPES get added
>  		 * to the MIGRATE_MOVABLE pcp list.
> @@ -3455,7 +3450,7 @@ void free_unref_page_list(struct list_head *list)
>  			migratetype = MIGRATE_MOVABLE;
>  
>  		trace_mm_page_free_batched(page);
> -		free_unref_page_commit(page, pfn, migratetype, 0);
> +		free_unref_page_commit(page, migratetype, 0);
>  
>  		/*
>  		 * Guard against excessive IRQ disabled times when we get
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support
  2021-11-03 17:05 [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support Nicolas Saenz Julienne
                   ` (2 preceding siblings ...)
  2021-11-03 17:05 ` [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists Nicolas Saenz Julienne
@ 2021-11-23 14:58 ` Vlastimil Babka
  2021-11-30 18:09   ` Nicolas Saenz Julienne
  3 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2021-11-23 14:58 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, cl, ppandit

On 11/3/21 18:05, Nicolas Saenz Julienne wrote:
> This series introduces a new locking scheme around mm/page_alloc.c's per-cpu
> page lists which will allow for remote CPUs to drain them. Currently, only a
> local CPU is permitted to change its per-cpu lists, and it's expected to do so,
> on-demand, whenever a process demands it (by means of queueing an drain task on
> the local CPU). Most systems will handle this promptly, but it'll cause
> problems for NOHZ_FULL CPUs that can't take any sort of interruption without
> breaking their functional guarantees (latency, bandwidth, etc...).
> 
> This new locking scheme, based on per-cpu spinlocks, is the simpler and more
> maintainable approach so far[1], although also has some drawbacks: it comes
> with a small performance. Depending on the page allocation code path
> micro-benchmark we can expect 0% to 0.6% degradation on x86_64, and 0% to 2% on
> arm64[2].
> 
> Assuming there is nothing too horrible in the patches themselves I believe it
> all comes down to whether we prefer to take the small performance hit vs the
> maintenance burden of a more complex solution[1]. I don't have enough

I'd be for the small performance hit over more complex solution, if possible.

> experience with performance tuning, nor with maintenance to have an
> authoritative opinion here, so I'll defer to whatever is hopefully discussed
> here. Also, I'll be happy to run any extra tests that I might have missed.

I think Mel has done most page allocator optimizations recently so he would
be most authoritative to say what is or isn't acceptable.

> Patch #1 could be taken regardless of the rest of the series as it removes dead
> code.
> 
> The series is based on today's linux-next. 
> 
> Changes since v2:
>  - Provide performance numbers
>  - Unanimously use per-cpu spinlocks
> 
> [1] Other approaches can be found here:
> 
>   - Static branch conditional on nohz_full, no performance loss, the extra
>     config option makes is painful to maintain (v1):
>     https://lore.kernel.org/linux-mm/20210921161323.607817-5-nsaenzju@redhat.com/
> 
>   - RCU based approach, complex, yet a bit less taxing performance wise
>     (RFC):
>     https://lore.kernel.org/linux-mm/20211008161922.942459-4-nsaenzju@redhat.com/

Hm I wonder if there might still be another alternative possible. IIRC I did
propose at some point a local drain on the NOHZ cpu before returning to
userspace, and then avoiding that cpu in remote drains, but tglx didn't like
the idea of making entering the NOHZ full mode more expensive [1].

But what if we instead set pcp->high = 0 for these cpus so they would avoid
populating the pcplists in the first place? Then there wouldn't have to be a
drain at all. On the other hand page allocator operations would not benefit
from zone lock batching on those cpus. But perhaps that would be acceptable
tradeoff, as a nohz cpu is expected to run in userspace most of the time,
and page allocator operations are rare except maybe some initial page
faults? (I assume those kind of workloads pre-populate and/or mlock their
address space anyway).

[1] https://lore.kernel.org/all/878rznh93e.ffs@tglx/

> [2] See individual patches for in-depth results
> 
> ---
> 
> Nicolas Saenz Julienne (3):
>   mm/page_alloc: Don't pass pfn to free_unref_page_commit()
>   mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin
>     locks
>   mm/page_alloc: Remotely drain per-cpu lists
> 
>  include/linux/mmzone.h |   1 +
>  mm/page_alloc.c        | 151 ++++++++++++++---------------------------
>  2 files changed, 52 insertions(+), 100 deletions(-)
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support
  2021-11-23 14:58 ` [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support Vlastimil Babka
@ 2021-11-30 18:09   ` Nicolas Saenz Julienne
  2021-12-01 14:01     ` Marcelo Tosatti
  0 siblings, 1 reply; 15+ messages in thread
From: Nicolas Saenz Julienne @ 2021-11-30 18:09 UTC (permalink / raw)
  To: Vlastimil Babka, akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, cl, ppandit

Hi Vlastimil, sorry for the late reply and thanks for your feedback. :)

On Tue, 2021-11-23 at 15:58 +0100, Vlastimil Babka wrote:
> > [1] Other approaches can be found here:
> > 
> >   - Static branch conditional on nohz_full, no performance loss, the extra
> >     config option makes is painful to maintain (v1):
> >     https://lore.kernel.org/linux-mm/20210921161323.607817-5-nsaenzju@redhat.com/
> > 
> >   - RCU based approach, complex, yet a bit less taxing performance wise
> >     (RFC):
> >     https://lore.kernel.org/linux-mm/20211008161922.942459-4-nsaenzju@redhat.com/
> 
> Hm I wonder if there might still be another alternative possible. IIRC I did
> propose at some point a local drain on the NOHZ cpu before returning to
> userspace, and then avoiding that cpu in remote drains, but tglx didn't like
> the idea of making entering the NOHZ full mode more expensive [1].
> 
> But what if we instead set pcp->high = 0 for these cpus so they would avoid
> populating the pcplists in the first place? Then there wouldn't have to be a
> drain at all. On the other hand page allocator operations would not benefit
> from zone lock batching on those cpus. But perhaps that would be acceptable
> tradeoff, as a nohz cpu is expected to run in userspace most of the time,
> and page allocator operations are rare except maybe some initial page
> faults? (I assume those kind of workloads pre-populate and/or mlock their
> address space anyway).

I've looked a bit into this and it seems straightforward. Our workloads
pre-populate everything, and a slight statup performance hit is not that tragic
(I'll measure it nonetheless). The per-cpu nohz_full state at some point will
be dynamic, but the feature seems simple to disable/enable. I'll have to teach
__drain_all_pages(zone, force_all_cpus=true) to bypass this special case
but that's all. I might have a go at this.

Thanks!

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support
  2021-11-30 18:09   ` Nicolas Saenz Julienne
@ 2021-12-01 14:01     ` Marcelo Tosatti
  0 siblings, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2021-12-01 14:01 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Vlastimil Babka, akpm, linux-kernel, linux-mm, frederic, tglx,
	peterz, nilal, mgorman, linux-rt-users, cl, ppandit

On Tue, Nov 30, 2021 at 07:09:23PM +0100, Nicolas Saenz Julienne wrote:
> Hi Vlastimil, sorry for the late reply and thanks for your feedback. :)
> 
> On Tue, 2021-11-23 at 15:58 +0100, Vlastimil Babka wrote:
> > > [1] Other approaches can be found here:
> > > 
> > >   - Static branch conditional on nohz_full, no performance loss, the extra
> > >     config option makes is painful to maintain (v1):
> > >     https://lore.kernel.org/linux-mm/20210921161323.607817-5-nsaenzju@redhat.com/
> > > 
> > >   - RCU based approach, complex, yet a bit less taxing performance wise
> > >     (RFC):
> > >     https://lore.kernel.org/linux-mm/20211008161922.942459-4-nsaenzju@redhat.com/
> > 
> > Hm I wonder if there might still be another alternative possible. IIRC I did
> > propose at some point a local drain on the NOHZ cpu before returning to
> > userspace, and then avoiding that cpu in remote drains, but tglx didn't like
> > the idea of making entering the NOHZ full mode more expensive [1].
> > 
> > But what if we instead set pcp->high = 0 for these cpus so they would avoid
> > populating the pcplists in the first place? Then there wouldn't have to be a
> > drain at all. On the other hand page allocator operations would not benefit
> > from zone lock batching on those cpus. But perhaps that would be acceptable
> > tradeoff, as a nohz cpu is expected to run in userspace most of the time,
> > and page allocator operations are rare except maybe some initial page
> > faults? (I assume those kind of workloads pre-populate and/or mlock their
> > address space anyway).
> 
> I've looked a bit into this and it seems straightforward. Our workloads
> pre-populate everything, and a slight statup performance hit is not that tragic
> (I'll measure it nonetheless). The per-cpu nohz_full state at some point will
> be dynamic, but the feature seems simple to disable/enable. I'll have to teach
> __drain_all_pages(zone, force_all_cpus=true) to bypass this special case
> but that's all. I might have a go at this.
> 
> Thanks!
> 
> -- 
> Nicolás Sáenz

True, but a nohz cpu does not necessarily have to run in userspace most
of the time. For example, an application can enter nohz full mode, 
go back to userspace, idle, return from idle all without leaving
nohz_full mode.

So its not clear that nohz_full is an appropriate trigger for setting
pcp->high = 0. Perhaps a task isolation feature would be an appropriate
location.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists
  2021-11-03 17:05 ` [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists Nicolas Saenz Julienne
@ 2021-12-03 14:13   ` Mel Gorman
  2021-12-09 10:50     ` Nicolas Saenz Julienne
  2021-12-09 17:45     ` Marcelo Tosatti
  0 siblings, 2 replies; 15+ messages in thread
From: Mel Gorman @ 2021-12-03 14:13 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: akpm, linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti,
	nilal, linux-rt-users, vbabka, cl, ppandit

On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote:
> Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> drain work queued by __drain_all_pages(). So introduce new a mechanism
> to remotely drain the per-cpu lists. It is made possible by remotely
> locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
> new scheme is that drain operations are now migration safe.
> 
> There was no observed performance degradation vs. the previous scheme.
> Both netperf and hackbench were run in parallel to triggering the
> __drain_all_pages(NULL, true) code path around ~100 times per second.
> The new scheme performs a bit better (~5%), although the important point
> here is there are no performance regressions vs. the previous mechanism.
> Per-cpu lists draining happens only in slow paths.
> 

netperf and hackbench are not great indicators of page allocator
performance as IIRC they are more slab-intensive than page allocator
intensive. I ran the series through a few benchmarks and can confirm
that there was negligible difference to netperf and hackbench.

However, on Page Fault Test (pft in mmtests), it is noticable. On a
2-socket cascadelake machine I get

pft timings
                                 5.16.0-rc1             5.16.0-rc1
                                    vanilla    mm-remotedrain-v2r1
Amean     system-1         27.48 (   0.00%)       27.85 *  -1.35%*
Amean     system-4         28.65 (   0.00%)       30.84 *  -7.65%*
Amean     system-7         28.70 (   0.00%)       32.43 * -13.00%*
Amean     system-12        30.33 (   0.00%)       34.21 * -12.80%*
Amean     system-21        37.14 (   0.00%)       41.51 * -11.76%*
Amean     system-30        36.79 (   0.00%)       46.15 * -25.43%*
Amean     system-48        58.95 (   0.00%)       65.28 * -10.73%*
Amean     system-79       111.61 (   0.00%)      114.78 *  -2.84%*
Amean     system-80       113.59 (   0.00%)      116.73 *  -2.77%*
Amean     elapsed-1        32.83 (   0.00%)       33.12 *  -0.88%*
Amean     elapsed-4         8.60 (   0.00%)        9.17 *  -6.66%*
Amean     elapsed-7         4.97 (   0.00%)        5.53 * -11.30%*
Amean     elapsed-12        3.08 (   0.00%)        3.43 * -11.41%*
Amean     elapsed-21        2.19 (   0.00%)        2.41 * -10.06%*
Amean     elapsed-30        1.73 (   0.00%)        2.04 * -17.87%*
Amean     elapsed-48        1.73 (   0.00%)        2.03 * -17.77%*
Amean     elapsed-79        1.61 (   0.00%)        1.64 *  -1.90%*
Amean     elapsed-80        1.60 (   0.00%)        1.64 *  -2.50%*

It's not specific to cascade lake, I see varying size regressions on
different Intel and AMD chips, some better and worse than this result.
The smallest regression was on a single CPU skylake machine with a 2-6%
hit. Worst was Zen1 with a 3-107% hit.

I didn't profile it to establish why but in all cases the system CPU
usage was much higher. It *might* be because the spinlock in
per_cpu_pages crosses a new cache line and it might be cold although the
penalty seems a bit high for that to be the only factor.

Code-wise, the patches look fine but the apparent penalty for PFT is
too severe.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists
  2021-12-03 14:13   ` Mel Gorman
@ 2021-12-09 10:50     ` Nicolas Saenz Julienne
  2021-12-09 17:45     ` Marcelo Tosatti
  1 sibling, 0 replies; 15+ messages in thread
From: Nicolas Saenz Julienne @ 2021-12-09 10:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti,
	nilal, linux-rt-users, vbabka, cl, ppandit

Hi Mel,

On Fri, 2021-12-03 at 14:13 +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote:
> > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> > drain work queued by __drain_all_pages(). So introduce new a mechanism
> > to remotely drain the per-cpu lists. It is made possible by remotely
> > locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
> > new scheme is that drain operations are now migration safe.
> > 
> > There was no observed performance degradation vs. the previous scheme.
> > Both netperf and hackbench were run in parallel to triggering the
> > __drain_all_pages(NULL, true) code path around ~100 times per second.
> > The new scheme performs a bit better (~5%), although the important point
> > here is there are no performance regressions vs. the previous mechanism.
> > Per-cpu lists draining happens only in slow paths.
> > 
> 
> netperf and hackbench are not great indicators of page allocator
> performance as IIRC they are more slab-intensive than page allocator
> intensive. I ran the series through a few benchmarks and can confirm
> that there was negligible difference to netperf and hackbench.
> 
> However, on Page Fault Test (pft in mmtests), it is noticable. On a
> 2-socket cascadelake machine I get
> 
> pft timings
>                                  5.16.0-rc1             5.16.0-rc1
>                                     vanilla    mm-remotedrain-v2r1
> Amean     system-1         27.48 (   0.00%)       27.85 *  -1.35%*
> Amean     system-4         28.65 (   0.00%)       30.84 *  -7.65%*
> Amean     system-7         28.70 (   0.00%)       32.43 * -13.00%*
> Amean     system-12        30.33 (   0.00%)       34.21 * -12.80%*
> Amean     system-21        37.14 (   0.00%)       41.51 * -11.76%*
> Amean     system-30        36.79 (   0.00%)       46.15 * -25.43%*
> Amean     system-48        58.95 (   0.00%)       65.28 * -10.73%*
> Amean     system-79       111.61 (   0.00%)      114.78 *  -2.84%*
> Amean     system-80       113.59 (   0.00%)      116.73 *  -2.77%*
> Amean     elapsed-1        32.83 (   0.00%)       33.12 *  -0.88%*
> Amean     elapsed-4         8.60 (   0.00%)        9.17 *  -6.66%*
> Amean     elapsed-7         4.97 (   0.00%)        5.53 * -11.30%*
> Amean     elapsed-12        3.08 (   0.00%)        3.43 * -11.41%*
> Amean     elapsed-21        2.19 (   0.00%)        2.41 * -10.06%*
> Amean     elapsed-30        1.73 (   0.00%)        2.04 * -17.87%*
> Amean     elapsed-48        1.73 (   0.00%)        2.03 * -17.77%*
> Amean     elapsed-79        1.61 (   0.00%)        1.64 *  -1.90%*
> Amean     elapsed-80        1.60 (   0.00%)        1.64 *  -2.50%*
> 
> It's not specific to cascade lake, I see varying size regressions on
> different Intel and AMD chips, some better and worse than this result.
> The smallest regression was on a single CPU skylake machine with a 2-6%
> hit. Worst was Zen1 with a 3-107% hit.
> 
> I didn't profile it to establish why but in all cases the system CPU
> usage was much higher. It *might* be because the spinlock in
> per_cpu_pages crosses a new cache line and it might be cold although the
> penalty seems a bit high for that to be the only factor.
> 
> Code-wise, the patches look fine but the apparent penalty for PFT is
> too severe.

Thanks for taking the time to look at this. I agree the performance penalty is
way too big. I'll move to an alternative approach.

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists
  2021-12-03 14:13   ` Mel Gorman
  2021-12-09 10:50     ` Nicolas Saenz Julienne
@ 2021-12-09 17:45     ` Marcelo Tosatti
  2021-12-10 10:55       ` Mel Gorman
  1 sibling, 1 reply; 15+ messages in thread
From: Marcelo Tosatti @ 2021-12-09 17:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nicolas Saenz Julienne, akpm, linux-kernel, linux-mm, frederic,
	tglx, peterz, nilal, linux-rt-users, vbabka, cl, ppandit

On Fri, Dec 03, 2021 at 02:13:06PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote:
> > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> > drain work queued by __drain_all_pages(). So introduce new a mechanism
> > to remotely drain the per-cpu lists. It is made possible by remotely
> > locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
> > new scheme is that drain operations are now migration safe.
> > 
> > There was no observed performance degradation vs. the previous scheme.
> > Both netperf and hackbench were run in parallel to triggering the
> > __drain_all_pages(NULL, true) code path around ~100 times per second.
> > The new scheme performs a bit better (~5%), although the important point
> > here is there are no performance regressions vs. the previous mechanism.
> > Per-cpu lists draining happens only in slow paths.
> > 
> 
> netperf and hackbench are not great indicators of page allocator
> performance as IIRC they are more slab-intensive than page allocator
> intensive. I ran the series through a few benchmarks and can confirm
> that there was negligible difference to netperf and hackbench.
> 
> However, on Page Fault Test (pft in mmtests), it is noticable. On a
> 2-socket cascadelake machine I get
> 
> pft timings
>                                  5.16.0-rc1             5.16.0-rc1
>                                     vanilla    mm-remotedrain-v2r1
> Amean     system-1         27.48 (   0.00%)       27.85 *  -1.35%*
> Amean     system-4         28.65 (   0.00%)       30.84 *  -7.65%*
> Amean     system-7         28.70 (   0.00%)       32.43 * -13.00%*
> Amean     system-12        30.33 (   0.00%)       34.21 * -12.80%*
> Amean     system-21        37.14 (   0.00%)       41.51 * -11.76%*
> Amean     system-30        36.79 (   0.00%)       46.15 * -25.43%*
> Amean     system-48        58.95 (   0.00%)       65.28 * -10.73%*
> Amean     system-79       111.61 (   0.00%)      114.78 *  -2.84%*
> Amean     system-80       113.59 (   0.00%)      116.73 *  -2.77%*
> Amean     elapsed-1        32.83 (   0.00%)       33.12 *  -0.88%*
> Amean     elapsed-4         8.60 (   0.00%)        9.17 *  -6.66%*
> Amean     elapsed-7         4.97 (   0.00%)        5.53 * -11.30%*
> Amean     elapsed-12        3.08 (   0.00%)        3.43 * -11.41%*
> Amean     elapsed-21        2.19 (   0.00%)        2.41 * -10.06%*
> Amean     elapsed-30        1.73 (   0.00%)        2.04 * -17.87%*
> Amean     elapsed-48        1.73 (   0.00%)        2.03 * -17.77%*
> Amean     elapsed-79        1.61 (   0.00%)        1.64 *  -1.90%*
> Amean     elapsed-80        1.60 (   0.00%)        1.64 *  -2.50%*
> 
> It's not specific to cascade lake, I see varying size regressions on
> different Intel and AMD chips, some better and worse than this result.
> The smallest regression was on a single CPU skylake machine with a 2-6%
> hit. Worst was Zen1 with a 3-107% hit.
> 
> I didn't profile it to establish why but in all cases the system CPU
> usage was much higher. It *might* be because the spinlock in
> per_cpu_pages crosses a new cache line and it might be cold although the
> penalty seems a bit high for that to be the only factor.
> 
> Code-wise, the patches look fine but the apparent penalty for PFT is
> too severe.

Mel,

Have you read Nicolas RCU patches?

Date: Fri,  8 Oct 2021 18:19:19 +0200                                                                                   
From: Nicolas Saenz Julienne <nsaenzju@redhat.com>      
Subject: [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support

RCU seems like a natural fit, we were wondering whether people see any
fundamental problem with this approach.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists
  2021-12-09 17:45     ` Marcelo Tosatti
@ 2021-12-10 10:55       ` Mel Gorman
  2021-12-14 10:58         ` Marcelo Tosatti
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2021-12-10 10:55 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Nicolas Saenz Julienne, akpm, linux-kernel, linux-mm, frederic,
	tglx, peterz, nilal, linux-rt-users, vbabka, cl, ppandit

On Thu, Dec 09, 2021 at 02:45:35PM -0300, Marcelo Tosatti wrote:
> On Fri, Dec 03, 2021 at 02:13:06PM +0000, Mel Gorman wrote:
> > On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote:
> > > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> > > drain work queued by __drain_all_pages(). So introduce new a mechanism
> > > to remotely drain the per-cpu lists. It is made possible by remotely
> > > locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
> > > new scheme is that drain operations are now migration safe.
> > > 
> > > There was no observed performance degradation vs. the previous scheme.
> > > Both netperf and hackbench were run in parallel to triggering the
> > > __drain_all_pages(NULL, true) code path around ~100 times per second.
> > > The new scheme performs a bit better (~5%), although the important point
> > > here is there are no performance regressions vs. the previous mechanism.
> > > Per-cpu lists draining happens only in slow paths.
> > > 
> > 
> > netperf and hackbench are not great indicators of page allocator
> > performance as IIRC they are more slab-intensive than page allocator
> > intensive. I ran the series through a few benchmarks and can confirm
> > that there was negligible difference to netperf and hackbench.
> > 
> > However, on Page Fault Test (pft in mmtests), it is noticable. On a
> > 2-socket cascadelake machine I get
> > 
> > pft timings
> >                                  5.16.0-rc1             5.16.0-rc1
> >                                     vanilla    mm-remotedrain-v2r1
> > Amean     system-1         27.48 (   0.00%)       27.85 *  -1.35%*
> > Amean     system-4         28.65 (   0.00%)       30.84 *  -7.65%*
> > Amean     system-7         28.70 (   0.00%)       32.43 * -13.00%*
> > Amean     system-12        30.33 (   0.00%)       34.21 * -12.80%*
> > Amean     system-21        37.14 (   0.00%)       41.51 * -11.76%*
> > Amean     system-30        36.79 (   0.00%)       46.15 * -25.43%*
> > Amean     system-48        58.95 (   0.00%)       65.28 * -10.73%*
> > Amean     system-79       111.61 (   0.00%)      114.78 *  -2.84%*
> > Amean     system-80       113.59 (   0.00%)      116.73 *  -2.77%*
> > Amean     elapsed-1        32.83 (   0.00%)       33.12 *  -0.88%*
> > Amean     elapsed-4         8.60 (   0.00%)        9.17 *  -6.66%*
> > Amean     elapsed-7         4.97 (   0.00%)        5.53 * -11.30%*
> > Amean     elapsed-12        3.08 (   0.00%)        3.43 * -11.41%*
> > Amean     elapsed-21        2.19 (   0.00%)        2.41 * -10.06%*
> > Amean     elapsed-30        1.73 (   0.00%)        2.04 * -17.87%*
> > Amean     elapsed-48        1.73 (   0.00%)        2.03 * -17.77%*
> > Amean     elapsed-79        1.61 (   0.00%)        1.64 *  -1.90%*
> > Amean     elapsed-80        1.60 (   0.00%)        1.64 *  -2.50%*
> > 
> > It's not specific to cascade lake, I see varying size regressions on
> > different Intel and AMD chips, some better and worse than this result.
> > The smallest regression was on a single CPU skylake machine with a 2-6%
> > hit. Worst was Zen1 with a 3-107% hit.
> > 
> > I didn't profile it to establish why but in all cases the system CPU
> > usage was much higher. It *might* be because the spinlock in
> > per_cpu_pages crosses a new cache line and it might be cold although the
> > penalty seems a bit high for that to be the only factor.
> > 
> > Code-wise, the patches look fine but the apparent penalty for PFT is
> > too severe.
> 
> Mel,
> 
> Have you read Nicolas RCU patches?
> 

I agree with Vlastimil's review on overhead.

I think it would be more straight-forward to disable the pcp allocator for
NOHZ_FULL CPUs like what zone_pcp_disable except for individual CPUs with
care taken to not accidentally re-enable nohz CPus in zone_pcp_enable. The
downside is that there will be a performance penalty if an application
running on a NOHZ_FULL CPU is page allocator intensive for whatever
reason.  However, I guess this is unlikely because if there was a lot
of kernel activity for a NOHZ_FULL CPU, the vmstat shepherd would also
cause interference.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists
  2021-12-10 10:55       ` Mel Gorman
@ 2021-12-14 10:58         ` Marcelo Tosatti
  2021-12-14 11:42           ` Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: Marcelo Tosatti @ 2021-12-14 10:58 UTC (permalink / raw)
  To: Mel Gorman, Thomas Gleixner, Frederic Weisbecker,
	Christoph Lameter
  Cc: Nicolas Saenz Julienne, akpm, linux-kernel, linux-mm, frederic,
	tglx, peterz, nilal, linux-rt-users, vbabka, cl, ppandit

On Fri, Dec 10, 2021 at 10:55:49AM +0000, Mel Gorman wrote:
> On Thu, Dec 09, 2021 at 02:45:35PM -0300, Marcelo Tosatti wrote:
> > On Fri, Dec 03, 2021 at 02:13:06PM +0000, Mel Gorman wrote:
> > > On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote:
> > > > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> > > > drain work queued by __drain_all_pages(). So introduce new a mechanism
> > > > to remotely drain the per-cpu lists. It is made possible by remotely
> > > > locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
> > > > new scheme is that drain operations are now migration safe.
> > > > 
> > > > There was no observed performance degradation vs. the previous scheme.
> > > > Both netperf and hackbench were run in parallel to triggering the
> > > > __drain_all_pages(NULL, true) code path around ~100 times per second.
> > > > The new scheme performs a bit better (~5%), although the important point
> > > > here is there are no performance regressions vs. the previous mechanism.
> > > > Per-cpu lists draining happens only in slow paths.
> > > > 
> > > 
> > > netperf and hackbench are not great indicators of page allocator
> > > performance as IIRC they are more slab-intensive than page allocator
> > > intensive. I ran the series through a few benchmarks and can confirm
> > > that there was negligible difference to netperf and hackbench.
> > > 
> > > However, on Page Fault Test (pft in mmtests), it is noticable. On a
> > > 2-socket cascadelake machine I get
> > > 
> > > pft timings
> > >                                  5.16.0-rc1             5.16.0-rc1
> > >                                     vanilla    mm-remotedrain-v2r1
> > > Amean     system-1         27.48 (   0.00%)       27.85 *  -1.35%*
> > > Amean     system-4         28.65 (   0.00%)       30.84 *  -7.65%*
> > > Amean     system-7         28.70 (   0.00%)       32.43 * -13.00%*
> > > Amean     system-12        30.33 (   0.00%)       34.21 * -12.80%*
> > > Amean     system-21        37.14 (   0.00%)       41.51 * -11.76%*
> > > Amean     system-30        36.79 (   0.00%)       46.15 * -25.43%*
> > > Amean     system-48        58.95 (   0.00%)       65.28 * -10.73%*
> > > Amean     system-79       111.61 (   0.00%)      114.78 *  -2.84%*
> > > Amean     system-80       113.59 (   0.00%)      116.73 *  -2.77%*
> > > Amean     elapsed-1        32.83 (   0.00%)       33.12 *  -0.88%*
> > > Amean     elapsed-4         8.60 (   0.00%)        9.17 *  -6.66%*
> > > Amean     elapsed-7         4.97 (   0.00%)        5.53 * -11.30%*
> > > Amean     elapsed-12        3.08 (   0.00%)        3.43 * -11.41%*
> > > Amean     elapsed-21        2.19 (   0.00%)        2.41 * -10.06%*
> > > Amean     elapsed-30        1.73 (   0.00%)        2.04 * -17.87%*
> > > Amean     elapsed-48        1.73 (   0.00%)        2.03 * -17.77%*
> > > Amean     elapsed-79        1.61 (   0.00%)        1.64 *  -1.90%*
> > > Amean     elapsed-80        1.60 (   0.00%)        1.64 *  -2.50%*
> > > 
> > > It's not specific to cascade lake, I see varying size regressions on
> > > different Intel and AMD chips, some better and worse than this result.
> > > The smallest regression was on a single CPU skylake machine with a 2-6%
> > > hit. Worst was Zen1 with a 3-107% hit.
> > > 
> > > I didn't profile it to establish why but in all cases the system CPU
> > > usage was much higher. It *might* be because the spinlock in
> > > per_cpu_pages crosses a new cache line and it might be cold although the
> > > penalty seems a bit high for that to be the only factor.
> > > 
> > > Code-wise, the patches look fine but the apparent penalty for PFT is
> > > too severe.
> > 
> > Mel,
> > 
> > Have you read Nicolas RCU patches?
> > 
> 
> I agree with Vlastimil's review on overhead.

Not sure those points are any fundamental performance problem with RCU: 
https://paulmck.livejournal.com/31058.html

> I think it would be more straight-forward to disable the pcp allocator for
> NOHZ_FULL CPUs like what zone_pcp_disable except for individual CPUs with
> care taken to not accidentally re-enable nohz CPus in zone_pcp_enable. The
> downside is that there will be a performance penalty if an application
> running on a NOHZ_FULL CPU is page allocator intensive for whatever
> reason.  However, I guess this is unlikely because if there was a lot
> of kernel activity for a NOHZ_FULL CPU, the vmstat shepherd would also
> cause interference.

Yes, it does, and its being fixed:

https://lkml.org/lkml/2021/12/8/663

Honestly i am not sure whether the association between a nohz_full CPU
and "should be mostly in userspace" is desired. The RCU solution
would be more generic. As Nicolas mentioned, for the usecases in
questions, either solution is OK.

Thomas, Frederic, Christoph, do you have any opinion on this ?



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists
  2021-12-14 10:58         ` Marcelo Tosatti
@ 2021-12-14 11:42           ` Christoph Lameter
  2021-12-14 12:25             ` Marcelo Tosatti
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2021-12-14 11:42 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Mel Gorman, Thomas Gleixner, Frederic Weisbecker,
	Christoph Lameter, Nicolas Saenz Julienne, akpm, linux-kernel,
	linux-mm, peterz, nilal, linux-rt-users, vbabka, ppandit

On Tue, 14 Dec 2021, Marcelo Tosatti wrote:

> > downside is that there will be a performance penalty if an application
> > running on a NOHZ_FULL CPU is page allocator intensive for whatever
> > reason.  However, I guess this is unlikely because if there was a lot
> > of kernel activity for a NOHZ_FULL CPU, the vmstat shepherd would also
> > cause interference.
>
> Yes, it does, and its being fixed:
>
> https://lkml.org/lkml/2021/12/8/663
>
> Honestly i am not sure whether the association between a nohz_full CPU
> and "should be mostly in userspace" is desired. The RCU solution
> would be more generic. As Nicolas mentioned, for the usecases in
> questions, either solution is OK.
>
> Thomas, Frederic, Christoph, do you have any opinion on this ?

Applications running would ideally have no performance penalty and there
is no  issue with kernel activity unless the application is in its special
low latency loop. NOHZ is currently only activated after spinning in that
loop for 2 seconds or so. Would be best to be able to trigger that
manually somehow.

And I would prefer to be able to run the whole system as
NOHZ and have the ability to selectively enable the quiet mode if a
process requires it for its processing.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists
  2021-12-14 11:42           ` Christoph Lameter
@ 2021-12-14 12:25             ` Marcelo Tosatti
  0 siblings, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2021-12-14 12:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Thomas Gleixner, Frederic Weisbecker,
	Christoph Lameter, Nicolas Saenz Julienne, akpm, linux-kernel,
	linux-mm, peterz, nilal, linux-rt-users, vbabka, ppandit

On Tue, Dec 14, 2021 at 12:42:58PM +0100, Christoph Lameter wrote:
> On Tue, 14 Dec 2021, Marcelo Tosatti wrote:
> 
> > > downside is that there will be a performance penalty if an application
> > > running on a NOHZ_FULL CPU is page allocator intensive for whatever
> > > reason.  However, I guess this is unlikely because if there was a lot
> > > of kernel activity for a NOHZ_FULL CPU, the vmstat shepherd would also
> > > cause interference.
> >
> > Yes, it does, and its being fixed:
> >
> > https://lkml.org/lkml/2021/12/8/663
> >
> > Honestly i am not sure whether the association between a nohz_full CPU
> > and "should be mostly in userspace" is desired. The RCU solution
> > would be more generic. As Nicolas mentioned, for the usecases in
> > questions, either solution is OK.
> >
> > Thomas, Frederic, Christoph, do you have any opinion on this ?
> 
> Applications running would ideally have no performance penalty and there
> is no  issue with kernel activity unless the application is in its special
> low latency loop. NOHZ is currently only activated after spinning in that
> loop for 2 seconds or so. Would be best to be able to trigger that
> manually somehow.

Can add a task isolation feature to do that.

> And I would prefer to be able to run the whole system as
> NOHZ and have the ability to selectively enable the quiet mode if a
> process requires it for its processing.

IIRC Frederic has been working on that.

Thanks.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-12-14 12:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-11-03 17:05 [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support Nicolas Saenz Julienne
2021-11-03 17:05 ` [PATCH v2 1/3] mm/page_alloc: Don't pass pfn to free_unref_page_commit() Nicolas Saenz Julienne
2021-11-23 14:41   ` Vlastimil Babka
2021-11-03 17:05 ` [PATCH v2 2/3] mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin locks Nicolas Saenz Julienne
2021-11-03 17:05 ` [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists Nicolas Saenz Julienne
2021-12-03 14:13   ` Mel Gorman
2021-12-09 10:50     ` Nicolas Saenz Julienne
2021-12-09 17:45     ` Marcelo Tosatti
2021-12-10 10:55       ` Mel Gorman
2021-12-14 10:58         ` Marcelo Tosatti
2021-12-14 11:42           ` Christoph Lameter
2021-12-14 12:25             ` Marcelo Tosatti
2021-11-23 14:58 ` [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support Vlastimil Babka
2021-11-30 18:09   ` Nicolas Saenz Julienne
2021-12-01 14:01     ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox