[RFC PATCH v1 0/4] Introduce QPW for per-cpu operations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
@ 2024-06-22  3:58 Leonardo Bras
  2024-06-22  3:58 ` [RFC PATCH v1 1/4] Introducing qpw_lock() and per-cpu queue & flush work Leonardo Bras
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Leonardo Bras @ 2024-06-22  3:58 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm

The problem:
Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem: getting
an important workload scheduled out to deal with remote requests is
sure to introduce unexpected deadline misses.

The idea:
Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
In this case, instead of scheduling work on a remote cpu, it should
be safe to grab that remote cpu's per-cpu spinlock and run the required
work locally. Tha major cost, which is un/locking in every local function,
already happens in PREEMPT_RT.

Also, there is no need to worry about extra cache bouncing:
The cacheline invalidation already happens due to schedule_work_on().

This will avoid schedule_work_on(), and thus avoid scheduling-out an 
RT workload. 

For patches 2, 3 & 4, I noticed just grabing the lock and executing
the function locally is much faster than just scheduling it on a
remote cpu.

Proposed solution:
A new interface called Queue PerCPU Work (QPW), which should replace
Work Queue in the above mentioned use case. 

If PREEMPT_RT=n, this interfaces just wraps the current 
local_locks + WorkQueue behavior, so no expected change in runtime.

If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's
per-cpu structure and perform work on it locally. This is possible
because on functions that can be used for performing remote work on
remote per-cpu structures, the local_lock (which is already
a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
is able to get the per_cpu spinlock() for the cpu passed as parameter.

Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the
current local_lock + WorkQueue interface by the QPW interface in
swap, memcontrol & slub interface.

Please let me know what you think on that, and please suggest
improvements.

Thanks a lot!
Leo

Leonardo Bras (4):
  Introducing qpw_lock() and per-cpu queue & flush work
  swap: apply new queue_percpu_work_on() interface
  memcontrol: apply new queue_percpu_work_on() interface
  slub: apply new queue_percpu_work_on() interface

 include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c     | 20 ++++++-----
 mm/slub.c           | 26 ++++++++------
 mm/swap.c           | 26 +++++++-------
 4 files changed, 127 insertions(+), 33 deletions(-)
 create mode 100644 include/linux/qpw.h

base-commit: 50736169ecc8387247fe6a00932852ce7b057083
-- 
2.45.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 1/4] Introducing qpw_lock() and per-cpu queue & flush work
  2024-06-22  3:58 [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Leonardo Bras
@ 2024-06-22  3:58 ` Leonardo Bras
  2024-09-04 21:39   ` Waiman Long
  2024-06-22  3:58 ` [RFC PATCH v1 2/4] swap: apply new queue_percpu_work_on() interface Leonardo Bras
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Leonardo Bras @ 2024-06-22  3:58 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm

Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem: getting
an important workload scheduled out to deal with some unrelated task is
sure to introduce unexpected deadline misses.

It's interesting, though, that local_lock()s in RT kernels become
spinlock(). We can make use of those to avoid scheduling work on a remote
cpu by directly updating another cpu's per_cpu structure, while holding
it's spinlock().

In order to do that, it's necessary to introduce a new set of functions to
make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
and also the corresponding queue_percpu_work_on() and flush_percpu_work()
helpers to run the remote work.

On non-RT kernels, no changes are expected, as every one of the introduced
helpers work the exactly same as the current implementation:
qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
queue_percpu_work_on()  ->  queue_work_on()
flush_percpu_work()     ->  flush_work()

For RT kernels, though, qpw_{un,}lock*() will use the extra cpu parameter
to select the correct per-cpu structure to work on, and acquire the
spinlock for that cpu.

queue_percpu_work_on() will just call the requested function in the current
cpu, which will operate in another cpu's per-cpu object. Since the
local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.

flush_percpu_work() then becomes a no-op since no work is actually
scheduled on a remote cpu.

Some minimal code rework is needed in order to make this mechanism work:
The calls for local_{un,}lock*() on the functions that are currently
scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
RT kernels they can reference a different cpu. It's also necessary to use a
qpw_struct instead of a work_struct, but it just contains a work struct
and, in PREEMPT_RT, the target cpu.

This should have almost no impact on non-RT kernels: few this_cpu_ptr()
will become per_cpu_ptr(,smp_processor_id()).

On RT kernels, this should improve performance and reduce latency by
removing scheduling noise.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 88 insertions(+)
 create mode 100644 include/linux/qpw.h

diff --git a/include/linux/qpw.h b/include/linux/qpw.h
new file mode 100644
index 000000000000..ea2686a01e5e
--- /dev/null
+++ b/include/linux/qpw.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_QPW_H
+#define _LINUX_QPW_H
+
+#include "linux/local_lock.h"
+#include "linux/workqueue.h"
+
+#ifndef CONFIG_PREEMPT_RT
+
+struct qpw_struct {
+	struct work_struct work;
+};
+
+#define qpw_lock(lock, cpu)					\
+	local_lock(lock)
+
+#define qpw_unlock(lock, cpu)					\
+	local_unlock(lock)
+
+#define qpw_lock_irqsave(lock, flags, cpu)			\
+	local_lock_irqsave(lock, flags)
+
+#define qpw_unlock_irqrestore(lock, flags, cpu)			\
+	local_unlock_irqrestore(lock, flags)
+
+#define queue_percpu_work_on(c, wq, qpw)			\
+	queue_work_on(c, wq, &(qpw)->work)
+
+#define flush_percpu_work(qpw)					\
+	flush_work(&(qpw)->work)
+
+#define qpw_get_cpu(qpw)					\
+	smp_processor_id()
+
+#define INIT_QPW(qpw, func, c)					\
+	INIT_WORK(&(qpw)->work, (func))
+
+#else /* !CONFIG_PREEMPT_RT */
+
+struct qpw_struct {
+	struct work_struct work;
+	int cpu;
+};
+
+#define qpw_lock(__lock, cpu)					\
+	do {							\
+		migrate_disable();				\
+		spin_lock(per_cpu_ptr((__lock), cpu));		\
+	} while (0)
+
+#define qpw_unlock(__lock, cpu)					\
+	do {							\
+		spin_unlock(per_cpu_ptr((__lock), cpu));	\
+		migrate_enable();				\
+	} while (0)
+
+#define qpw_lock_irqsave(lock, flags, cpu)			\
+	do {							\
+		typecheck(unsigned long, flags);		\
+		flags = 0;					\
+		qpw_lock(lock, cpu);				\
+	} while (0)
+
+#define qpw_unlock_irqrestore(lock, flags, cpu)			\
+	qpw_unlock(lock, cpu)
+
+#define queue_percpu_work_on(c, wq, qpw)			\
+	do {							\
+		struct qpw_struct *__qpw = (qpw);		\
+		WARN_ON((c) != __qpw->cpu);			\
+		__qpw->work.func(&__qpw->work);			\
+	} while (0)
+
+#define flush_percpu_work(qpw)					\
+	do {} while (0)
+
+#define qpw_get_cpu(w)						\
+	container_of((w), struct qpw_struct, work)->cpu
+
+#define INIT_QPW(qpw, func, c)					\
+	do {							\
+		struct qpw_struct *__qpw = (qpw);		\
+		INIT_WORK(&__qpw->work, (func));		\
+		__qpw->cpu = (c);				\
+	} while (0)
+
+#endif /* CONFIG_PREEMPT_RT */
+#endif /* LINUX_QPW_H */
-- 
2.45.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 2/4] swap: apply new queue_percpu_work_on() interface
  2024-06-22  3:58 [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Leonardo Bras
  2024-06-22  3:58 ` [RFC PATCH v1 1/4] Introducing qpw_lock() and per-cpu queue & flush work Leonardo Bras
@ 2024-06-22  3:58 ` Leonardo Bras
  2024-06-22  3:58 ` [RFC PATCH v1 3/4] memcontrol: " Leonardo Bras
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Leonardo Bras @ 2024-06-22  3:58 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm

Make use of the new qpw_{un,}lock*() and queue_percpu_work_on()
interface to improve performance & latency on PREEMTP_RT kernels.

For functions that may be scheduled in a different cpu, replace
local_{un,}lock*() by qpw_{un,}lock*(), and replace schedule_work_on() by
queue_percpu_work_on(). The same happens for flush_work() and
flush_percpu_work().

The change requires allocation of qpw_structs instead of a work_structs,
and changing parameters of a few functions to include the cpu parameter.

This should bring no relevant performance impact on non-RT kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 mm/swap.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 67786cb77130..c1a61b7cd71a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -28,21 +28,21 @@
 #include <linux/memremap.h>
 #include <linux/percpu.h>
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
 #include <linux/hugetlb.h>
 #include <linux/page_idle.h>
-#include <linux/local_lock.h>
+#include <linux/qpw.h>
 #include <linux/buffer_head.h>
 
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/pagemap.h>
 
 /* How many pages do we try to swap or page in/out together? As a power of 2 */
 int page_cluster;
 const int page_cluster_max = 31;
@@ -758,45 +758,45 @@ void lru_add_drain(void)
 	local_unlock(&cpu_fbatches.lock);
 	mlock_drain_local();
 }
 
 /*
  * It's called from per-cpu workqueue context in SMP case so
  * lru_add_drain_cpu and invalidate_bh_lrus_cpu should run on
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_and_bh_lrus_drain(void)
+static void lru_add_and_bh_lrus_drain(int cpu)
 {
-	local_lock(&cpu_fbatches.lock);
-	lru_add_drain_cpu(smp_processor_id());
-	local_unlock(&cpu_fbatches.lock);
+	qpw_lock(&cpu_fbatches.lock, cpu);
+	lru_add_drain_cpu(cpu);
+	qpw_unlock(&cpu_fbatches.lock, cpu);
 	invalidate_bh_lrus_cpu();
 	mlock_drain_local();
 }
 
 void lru_add_drain_cpu_zone(struct zone *zone)
 {
 	local_lock(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
 	drain_local_pages(zone);
 	local_unlock(&cpu_fbatches.lock);
 	mlock_drain_local();
 }
 
 #ifdef CONFIG_SMP
 
-static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
+static DEFINE_PER_CPU(struct qpw_struct, lru_add_drain_qpw);
 
-static void lru_add_drain_per_cpu(struct work_struct *dummy)
+static void lru_add_drain_per_cpu(struct work_struct *w)
 {
-	lru_add_and_bh_lrus_drain();
+	lru_add_and_bh_lrus_drain(qpw_get_cpu(w));
 }
 
 static bool cpu_needs_drain(unsigned int cpu)
 {
 	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
 
 	/* Check these in order of likelihood that they're not zero */
 	return folio_batch_count(&fbatches->lru_add) ||
 		data_race(folio_batch_count(&per_cpu(lru_rotate.fbatch, cpu))) ||
 		folio_batch_count(&fbatches->lru_deactivate_file) ||
@@ -882,31 +882,31 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 	 *
 	 * If the paired barrier is done at any later step, e.g. after the
 	 * loop, CPU #x will just exit at (C) and miss flushing out all of its
 	 * added pages.
 	 */
 	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
 	smp_mb();
 
 	cpumask_clear(&has_work);
 	for_each_online_cpu(cpu) {
-		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
+		struct qpw_struct *qpw = &per_cpu(lru_add_drain_qpw, cpu);
 
 		if (cpu_needs_drain(cpu)) {
-			INIT_WORK(work, lru_add_drain_per_cpu);
-			queue_work_on(cpu, mm_percpu_wq, work);
+			INIT_QPW(qpw, lru_add_drain_per_cpu, cpu);
+			queue_percpu_work_on(cpu, mm_percpu_wq, qpw);
 			__cpumask_set_cpu(cpu, &has_work);
 		}
 	}
 
 	for_each_cpu(cpu, &has_work)
-		flush_work(&per_cpu(lru_add_drain_work, cpu));
+		flush_percpu_work(&per_cpu(lru_add_drain_qpw, cpu));
 
 done:
 	mutex_unlock(&lock);
 }
 
 void lru_add_drain_all(void)
 {
 	__lru_add_drain_all(false);
 }
 #else
@@ -939,21 +939,21 @@ void lru_cache_disable(void)
 	 *
 	 * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
 	 * preempt_disable() regions of code. So any CPU which sees
 	 * lru_disable_count = 0 will have exited the critical
 	 * section when synchronize_rcu() returns.
 	 */
 	synchronize_rcu_expedited();
 #ifdef CONFIG_SMP
 	__lru_add_drain_all(true);
 #else
-	lru_add_and_bh_lrus_drain();
+	lru_add_and_bh_lrus_drain(smp_processor_id());
 #endif
 }
 
 /**
  * folios_put_refs - Reduce the reference count on a batch of folios.
  * @folios: The folios.
  * @refs: The number of refs to subtract from each folio.
  *
  * Like folio_put(), but for a batch of folios.  This is more efficient
  * than writing the loop yourself as it will optimise the locks which need
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 3/4] memcontrol: apply new queue_percpu_work_on() interface
  2024-06-22  3:58 [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Leonardo Bras
  2024-06-22  3:58 ` [RFC PATCH v1 1/4] Introducing qpw_lock() and per-cpu queue & flush work Leonardo Bras
  2024-06-22  3:58 ` [RFC PATCH v1 2/4] swap: apply new queue_percpu_work_on() interface Leonardo Bras
@ 2024-06-22  3:58 ` Leonardo Bras
  2024-06-22  3:58 ` [RFC PATCH v1 4/4] slub: " Leonardo Bras
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Leonardo Bras @ 2024-06-22  3:58 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm

Make use of the new qpw_{un,}lock*() and queue_percpu_work_on()
interface to improve performance & latency on PREEMTP_RT kernels.

For functions that may be scheduled in a different cpu, replace
local_{un,}lock*() by qpw_{un,}lock*(), and replace schedule_work_on() by
queue_percpu_work_on().

This change requires allocation of qpw_structs instead of a work_structs.

This should bring no relevant performance impact on non-RT kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 mm/memcontrol.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 71fe2a95b8bd..18a987f8c998 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -59,20 +59,21 @@
 #include <linux/swap_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include <linux/lockdep.h>
 #include <linux/file.h>
 #include <linux/resume_user_mode.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
 #include <linux/sched/isolation.h>
 #include <linux/kmemleak.h>
+#include <linux/qpw.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
 #include "slab.h"
 #include "swap.h"
 
 #include <linux/uaccess.h>
 
 #include <trace/events/vmscan.h>
 
@@ -2415,21 +2416,21 @@ struct memcg_stock_pcp {
 	unsigned int nr_pages;
 
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
 	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
 	int nr_slab_reclaimable_b;
 	int nr_slab_unreclaimable_b;
 #endif
 
-	struct work_struct work;
+	struct qpw_struct qpw;
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
 	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
 };
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
 static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock);
@@ -2503,39 +2504,40 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 		if (do_memsw_account())
 			page_counter_uncharge(&old->memsw, stock_pages);
 
 		WRITE_ONCE(stock->nr_pages, 0);
 	}
 
 	css_put(&old->css);
 	WRITE_ONCE(stock->cached, NULL);
 }
 
-static void drain_local_stock(struct work_struct *dummy)
+static void drain_local_stock(struct work_struct *w)
 {
 	struct memcg_stock_pcp *stock;
 	struct obj_cgroup *old = NULL;
 	unsigned long flags;
+	int cpu = qpw_get_cpu(w);
 
 	/*
 	 * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	qpw_lock_irqsave(&memcg_stock.stock_lock, flags, cpu);
 
-	stock = this_cpu_ptr(&memcg_stock);
+	stock = per_cpu_ptr(&memcg_stock, cpu);
 	old = drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	qpw_unlock_irqrestore(&memcg_stock.stock_lock, flags, cpu);
 	obj_cgroup_put(old);
 }
 
 /*
  * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
 static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	struct memcg_stock_pcp *stock;
@@ -2592,23 +2594,23 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		if (memcg && READ_ONCE(stock->nr_pages) &&
 		    mem_cgroup_is_descendant(memcg, root_memcg))
 			flush = true;
 		else if (obj_stock_flush_required(stock, root_memcg))
 			flush = true;
 		rcu_read_unlock();
 
 		if (flush &&
 		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
 			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
+				drain_local_stock(&stock->qpw.work);
 			else if (!cpu_is_isolated(cpu))
-				schedule_work_on(cpu, &stock->work);
+				queue_percpu_work_on(cpu, system_wq, &stock->qpw);
 		}
 	}
 	migrate_enable();
 	mutex_unlock(&percpu_charge_mutex);
 }
 
 static int memcg_hotplug_cpu_dead(unsigned int cpu)
 {
 	struct memcg_stock_pcp *stock;
 
@@ -7956,22 +7958,22 @@ static int __init mem_cgroup_init(void)
 	 * used for per-memcg-per-cpu caching of per-node statistics. In order
 	 * to work fine, we should make sure that the overfill threshold can't
 	 * exceed S32_MAX / PAGE_SIZE.
 	 */
 	BUILD_BUG_ON(MEMCG_CHARGE_BATCH > S32_MAX / PAGE_SIZE);
 
 	cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
 				  memcg_hotplug_cpu_dead);
 
 	for_each_possible_cpu(cpu)
-		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
-			  drain_local_stock);
+		INIT_QPW(&per_cpu_ptr(&memcg_stock, cpu)->qpw,
+			 drain_local_stock, cpu);
 
 	for_each_node(node) {
 		struct mem_cgroup_tree_per_node *rtpn;
 
 		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node);
 
 		rtpn->rb_root = RB_ROOT;
 		rtpn->rb_rightmost = NULL;
 		spin_lock_init(&rtpn->lock);
 		soft_limit_tree.rb_tree_per_node[node] = rtpn;
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 4/4] slub: apply new queue_percpu_work_on() interface
  2024-06-22  3:58 [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Leonardo Bras
                   ` (2 preceding siblings ...)
  2024-06-22  3:58 ` [RFC PATCH v1 3/4] memcontrol: " Leonardo Bras
@ 2024-06-22  3:58 ` Leonardo Bras
  2024-06-24  7:31 ` [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Vlastimil Babka
  2024-07-23 17:14 ` Marcelo Tosatti
  5 siblings, 0 replies; 23+ messages in thread
From: Leonardo Bras @ 2024-06-22  3:58 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm

Make use of the new qpw_{un,}lock*() and queue_percpu_work_on()
interface to improve performance & latency on PREEMTP_RT kernels.

For functions that may be scheduled in a different cpu, replace
local_{un,}lock*() by qpw_{un,}lock*(), and replace schedule_work_on() by
queue_percpu_work_on(). The same happens for flush_work() and
flush_percpu_work().

This change requires allocation of qpw_structs instead of a work_structs,
and changing parameters of a few functions to include the cpu parameter.

This should bring no relevant performance impact on non-RT kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 mm/slub.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1373ac365a46..5cd91541906e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -35,20 +35,21 @@
 #include <linux/math64.h>
 #include <linux/fault-inject.h>
 #include <linux/kmemleak.h>
 #include <linux/stacktrace.h>
 #include <linux/prefetch.h>
 #include <linux/memcontrol.h>
 #include <linux/random.h>
 #include <kunit/test.h>
 #include <kunit/test-bug.h>
 #include <linux/sort.h>
+#include <linux/qpw.h>
 
 #include <linux/debugfs.h>
 #include <trace/events/kmem.h>
 
 #include "internal.h"
 
 /*
  * Lock order:
  *   1. slab_mutex (Global Mutex)
  *   2. node->list_lock (Spinlock)
@@ -3073,36 +3074,37 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
 }
 
 #else	/* CONFIG_SLUB_CPU_PARTIAL */
 
 static inline void put_partials(struct kmem_cache *s) { }
 static inline void put_partials_cpu(struct kmem_cache *s,
 				    struct kmem_cache_cpu *c) { }
 
 #endif	/* CONFIG_SLUB_CPU_PARTIAL */
 
-static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
+static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c,
+			      int cpu)
 {
 	unsigned long flags;
 	struct slab *slab;
 	void *freelist;
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	qpw_lock_irqsave(&s->cpu_slab->lock, flags, cpu);
 
 	slab = c->slab;
 	freelist = c->freelist;
 
 	c->slab = NULL;
 	c->freelist = NULL;
 	c->tid = next_tid(c->tid);
 
-	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+	qpw_unlock_irqrestore(&s->cpu_slab->lock, flags, cpu);
 
 	if (slab) {
 		deactivate_slab(s, slab, freelist);
 		stat(s, CPUSLAB_FLUSH);
 	}
 }
 
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
 	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
@@ -3115,82 +3117,84 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 
 	if (slab) {
 		deactivate_slab(s, slab, freelist);
 		stat(s, CPUSLAB_FLUSH);
 	}
 
 	put_partials_cpu(s, c);
 }
 
 struct slub_flush_work {
-	struct work_struct work;
+	struct qpw_struct qpw;
 	struct kmem_cache *s;
 	bool skip;
 };
 
+static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
+
 /*
  * Flush cpu slab.
  *
  * Called from CPU work handler with migration disabled.
  */
 static void flush_cpu_slab(struct work_struct *w)
 {
 	struct kmem_cache *s;
 	struct kmem_cache_cpu *c;
 	struct slub_flush_work *sfw;
+	int cpu = qpw_get_cpu(w);
 
-	sfw = container_of(w, struct slub_flush_work, work);
+	sfw = &per_cpu(slub_flush, cpu);
 
 	s = sfw->s;
-	c = this_cpu_ptr(s->cpu_slab);
+	c = per_cpu_ptr(s->cpu_slab, cpu);
 
 	if (c->slab)
-		flush_slab(s, c);
+		flush_slab(s, c, cpu);
 
 	put_partials(s);
 }
 
 static bool has_cpu_slab(int cpu, struct kmem_cache *s)
 {
 	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 	return c->slab || slub_percpu_partial(c);
 }
 
 static DEFINE_MUTEX(flush_lock);
-static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
 
 static void flush_all_cpus_locked(struct kmem_cache *s)
 {
 	struct slub_flush_work *sfw;
 	unsigned int cpu;
 
 	lockdep_assert_cpus_held();
 	mutex_lock(&flush_lock);
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
 		if (!has_cpu_slab(cpu, s)) {
 			sfw->skip = true;
 			continue;
 		}
-		INIT_WORK(&sfw->work, flush_cpu_slab);
+		INIT_QPW(&sfw->qpw, flush_cpu_slab, cpu);
 		sfw->skip = false;
 		sfw->s = s;
-		queue_work_on(cpu, flushwq, &sfw->work);
+		queue_percpu_work_on(cpu, flushwq, &sfw->qpw);
 	}
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
 		if (sfw->skip)
 			continue;
-		flush_work(&sfw->work);
+		flush_percpu_work(&sfw->qpw);
 	}
 
 	mutex_unlock(&flush_lock);
 }
 
 static void flush_all(struct kmem_cache *s)
 {
 	cpus_read_lock();
 	flush_all_cpus_locked(s);
 	cpus_read_unlock();
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-06-22  3:58 [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Leonardo Bras
                   ` (3 preceding siblings ...)
  2024-06-22  3:58 ` [RFC PATCH v1 4/4] slub: " Leonardo Bras
@ 2024-06-24  7:31 ` Vlastimil Babka
  2024-06-24 22:54   ` Boqun Feng
                     ` (2 more replies)
  2024-07-23 17:14 ` Marcelo Tosatti
  5 siblings, 3 replies; 23+ messages in thread
From: Vlastimil Babka @ 2024-06-24  7:31 UTC (permalink / raw)
  To: Leonardo Bras, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Hyeonggon Yoo,
	Thomas Gleixner, Marcelo Tosatti, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng
  Cc: linux-kernel, cgroups, linux-mm

Hi,

you've included tglx, which is great, but there's also LOCKING PRIMITIVES
section in MAINTAINERS so I've added folks from there in my reply.
Link to full series:
https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/

On 6/22/24 5:58 AM, Leonardo Bras wrote:
> The problem:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem: getting
> an important workload scheduled out to deal with remote requests is
> sure to introduce unexpected deadline misses.
> 
> The idea:
> Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> In this case, instead of scheduling work on a remote cpu, it should
> be safe to grab that remote cpu's per-cpu spinlock and run the required
> work locally. Tha major cost, which is un/locking in every local function,
> already happens in PREEMPT_RT.

I've also noticed this a while ago (likely in the context of rewriting SLUB
to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of
the idea. But I forgot the details about why, so I'll let the the locking
experts reply...

> Also, there is no need to worry about extra cache bouncing:
> The cacheline invalidation already happens due to schedule_work_on().
> 
> This will avoid schedule_work_on(), and thus avoid scheduling-out an 
> RT workload. 
> 
> For patches 2, 3 & 4, I noticed just grabing the lock and executing
> the function locally is much faster than just scheduling it on a
> remote cpu.
> 
> Proposed solution:
> A new interface called Queue PerCPU Work (QPW), which should replace
> Work Queue in the above mentioned use case. 
> 
> If PREEMPT_RT=n, this interfaces just wraps the current 
> local_locks + WorkQueue behavior, so no expected change in runtime.
> 
> If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's
> per-cpu structure and perform work on it locally. This is possible
> because on functions that can be used for performing remote work on
> remote per-cpu structures, the local_lock (which is already
> a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> is able to get the per_cpu spinlock() for the cpu passed as parameter.
> 
> Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the
> current local_lock + WorkQueue interface by the QPW interface in
> swap, memcontrol & slub interface.
> 
> Please let me know what you think on that, and please suggest
> improvements.
> 
> Thanks a lot!
> Leo
> 
> Leonardo Bras (4):
>   Introducing qpw_lock() and per-cpu queue & flush work
>   swap: apply new queue_percpu_work_on() interface
>   memcontrol: apply new queue_percpu_work_on() interface
>   slub: apply new queue_percpu_work_on() interface
> 
>  include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
>  mm/memcontrol.c     | 20 ++++++-----
>  mm/slub.c           | 26 ++++++++------
>  mm/swap.c           | 26 +++++++-------
>  4 files changed, 127 insertions(+), 33 deletions(-)
>  create mode 100644 include/linux/qpw.h
> 
> 
> base-commit: 50736169ecc8387247fe6a00932852ce7b057083



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-06-24  7:31 ` [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Vlastimil Babka
@ 2024-06-24 22:54   ` Boqun Feng
  2024-06-25  2:57     ` Leonardo Bras
  2024-06-25  2:36   ` Leonardo Bras
  2024-07-15 18:38   ` Marcelo Tosatti
  2 siblings, 1 reply; 23+ messages in thread
From: Boqun Feng @ 2024-06-24 22:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Leonardo Bras, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Hyeonggon Yoo,
	Thomas Gleixner, Marcelo Tosatti, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Waiman Long, linux-kernel, cgroups, linux-mm

On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote:
> Hi,
> 
> you've included tglx, which is great, but there's also LOCKING PRIMITIVES
> section in MAINTAINERS so I've added folks from there in my reply.

Thanks!

> Link to full series:
> https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/
> 

And apologies to Leonardo... I think this is a follow-up of:

	https://lpc.events/event/17/contributions/1484/

and I did remember we had a quick chat after that which I suggested it's
better to change to a different name, sorry that I never found time to
write a proper rely to your previous seriese [1] as promised.

[1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@redhat.com/

> On 6/22/24 5:58 AM, Leonardo Bras wrote:
> > The problem:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with remote requests is
> > sure to introduce unexpected deadline misses.
> > 
> > The idea:
> > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > In this case, instead of scheduling work on a remote cpu, it should
> > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > work locally. Tha major cost, which is un/locking in every local function,
> > already happens in PREEMPT_RT.
> 
> I've also noticed this a while ago (likely in the context of rewriting SLUB
> to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of
> the idea. But I forgot the details about why, so I'll let the the locking
> experts reply...
> 

I think it's a good idea, especially the new name is less confusing ;-)
So I wonder Thomas' thoughts as well.

And I think a few (micro-)benchmark numbers will help.

Regards,
Boqun

> > Also, there is no need to worry about extra cache bouncing:
> > The cacheline invalidation already happens due to schedule_work_on().
> > 
> > This will avoid schedule_work_on(), and thus avoid scheduling-out an 
> > RT workload. 
> > 
> > For patches 2, 3 & 4, I noticed just grabing the lock and executing
> > the function locally is much faster than just scheduling it on a
> > remote cpu.
> > 
> > Proposed solution:
> > A new interface called Queue PerCPU Work (QPW), which should replace
> > Work Queue in the above mentioned use case. 
> > 
> > If PREEMPT_RT=n, this interfaces just wraps the current 
> > local_locks + WorkQueue behavior, so no expected change in runtime.
> > 
> > If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's
> > per-cpu structure and perform work on it locally. This is possible
> > because on functions that can be used for performing remote work on
> > remote per-cpu structures, the local_lock (which is already
> > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> > 
> > Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the
> > current local_lock + WorkQueue interface by the QPW interface in
> > swap, memcontrol & slub interface.
> > 
> > Please let me know what you think on that, and please suggest
> > improvements.
> > 
> > Thanks a lot!
> > Leo
> > 
> > Leonardo Bras (4):
> >   Introducing qpw_lock() and per-cpu queue & flush work
> >   swap: apply new queue_percpu_work_on() interface
> >   memcontrol: apply new queue_percpu_work_on() interface
> >   slub: apply new queue_percpu_work_on() interface
> > 
> >  include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
> >  mm/memcontrol.c     | 20 ++++++-----
> >  mm/slub.c           | 26 ++++++++------
> >  mm/swap.c           | 26 +++++++-------
> >  4 files changed, 127 insertions(+), 33 deletions(-)
> >  create mode 100644 include/linux/qpw.h
> > 
> > 
> > base-commit: 50736169ecc8387247fe6a00932852ce7b057083
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-06-24  7:31 ` [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Vlastimil Babka
  2024-06-24 22:54   ` Boqun Feng
@ 2024-06-25  2:36   ` Leonardo Bras
  2024-07-15 18:38   ` Marcelo Tosatti
  2 siblings, 0 replies; 23+ messages in thread
From: Leonardo Bras @ 2024-06-25  2:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Leonardo Bras, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Hyeonggon Yoo,
	Thomas Gleixner, Marcelo Tosatti, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng, linux-kernel, cgroups,
	linux-mm

On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote:
> Hi,
> 
> you've included tglx, which is great, but there's also LOCKING PRIMITIVES
> section in MAINTAINERS so I've added folks from there in my reply.
> Link to full series:
> https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/

Thanks Vlastimil!

> 
> On 6/22/24 5:58 AM, Leonardo Bras wrote:
> > The problem:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with remote requests is
> > sure to introduce unexpected deadline misses.
> > 
> > The idea:
> > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > In this case, instead of scheduling work on a remote cpu, it should
> > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > work locally. Tha major cost, which is un/locking in every local function,
> > already happens in PREEMPT_RT.
> 
> I've also noticed this a while ago (likely in the context of rewriting SLUB
> to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of
> the idea. But I forgot the details about why, so I'll let the the locking
> experts reply...
> 
> > Also, there is no need to worry about extra cache bouncing:
> > The cacheline invalidation already happens due to schedule_work_on().
> > 
> > This will avoid schedule_work_on(), and thus avoid scheduling-out an 
> > RT workload. 
> > 
> > For patches 2, 3 & 4, I noticed just grabing the lock and executing
> > the function locally is much faster than just scheduling it on a
> > remote cpu.
> > 
> > Proposed solution:
> > A new interface called Queue PerCPU Work (QPW), which should replace
> > Work Queue in the above mentioned use case. 
> > 
> > If PREEMPT_RT=n, this interfaces just wraps the current 
> > local_locks + WorkQueue behavior, so no expected change in runtime.
> > 
> > If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's
> > per-cpu structure and perform work on it locally. This is possible
> > because on functions that can be used for performing remote work on
> > remote per-cpu structures, the local_lock (which is already
> > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> > 
> > Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the
> > current local_lock + WorkQueue interface by the QPW interface in
> > swap, memcontrol & slub interface.
> > 
> > Please let me know what you think on that, and please suggest
> > improvements.
> > 
> > Thanks a lot!
> > Leo
> > 
> > Leonardo Bras (4):
> >   Introducing qpw_lock() and per-cpu queue & flush work
> >   swap: apply new queue_percpu_work_on() interface
> >   memcontrol: apply new queue_percpu_work_on() interface
> >   slub: apply new queue_percpu_work_on() interface
> > 
> >  include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
> >  mm/memcontrol.c     | 20 ++++++-----
> >  mm/slub.c           | 26 ++++++++------
> >  mm/swap.c           | 26 +++++++-------
> >  4 files changed, 127 insertions(+), 33 deletions(-)
> >  create mode 100644 include/linux/qpw.h
> > 
> > 
> > base-commit: 50736169ecc8387247fe6a00932852ce7b057083
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-06-24 22:54   ` Boqun Feng
@ 2024-06-25  2:57     ` Leonardo Bras
  2024-06-25 17:51       ` Boqun Feng
  2024-06-28 18:47       ` Marcelo Tosatti
  0 siblings, 2 replies; 23+ messages in thread
From: Leonardo Bras @ 2024-06-25  2:57 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Leonardo Bras, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Hyeonggon Yoo, Thomas Gleixner, Marcelo Tosatti, Peter Zijlstra,
	Ingo Molnar, Will Deacon, Waiman Long, linux-kernel, cgroups,
	linux-mm

On Mon, Jun 24, 2024 at 03:54:14PM -0700, Boqun Feng wrote:
> On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote:
> > Hi,
> > 
> > you've included tglx, which is great, but there's also LOCKING PRIMITIVES
> > section in MAINTAINERS so I've added folks from there in my reply.
> 
> Thanks!
> 
> > Link to full series:
> > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/
> > 
> 
> And apologies to Leonardo... I think this is a follow-up of:
> 
> 	https://lpc.events/event/17/contributions/1484/
> 
> and I did remember we had a quick chat after that which I suggested it's
> better to change to a different name, sorry that I never found time to
> write a proper rely to your previous seriese [1] as promised.
> 
> [1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@redhat.com/

That's correct, I commented about this in the end of above presentation.
Don't worry, and thanks for suggesting the per-cpu naming, it was very 
helpful on designing this solution.

> 
> > On 6/22/24 5:58 AM, Leonardo Bras wrote:
> > > The problem:
> > > Some places in the kernel implement a parallel programming strategy
> > > consisting on local_locks() for most of the work, and some rare remote
> > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > kernels, even though the very few remote operations will be expensive due
> > > to scheduling overhead.
> > > 
> > > On the other hand, for RT workloads this can represent a problem: getting
> > > an important workload scheduled out to deal with remote requests is
> > > sure to introduce unexpected deadline misses.
> > > 
> > > The idea:
> > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > > In this case, instead of scheduling work on a remote cpu, it should
> > > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > > work locally. Tha major cost, which is un/locking in every local function,
> > > already happens in PREEMPT_RT.
> > 
> > I've also noticed this a while ago (likely in the context of rewriting SLUB
> > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of
> > the idea. But I forgot the details about why, so I'll let the the locking
> > experts reply...
> > 
> 
> I think it's a good idea, especially the new name is less confusing ;-)
> So I wonder Thomas' thoughts as well.

Thanks!

> 
> And I think a few (micro-)benchmark numbers will help.

Last year I got some numbers on how replacing local_locks with 
spinlocks would impact memcontrol.c cache operations:

https://lore.kernel.org/all/20230125073502.743446-1-leobras@redhat.com/

tl;dr: It increased clocks spent in the most common this_cpu operations, 
while reducing clocks spent in remote operations (drain_all_stock).

In RT case, since local locks are already spinlocks, this cost is 
already paid, so we can get results like these:

drain_all_stock
cpus	Upstream 	Patched		Diff (cycles)	Diff(%)
1	44331.10831	38978.03581	-5353.072507	-12.07520567
8	43992.96512	39026.76654	-4966.198572	-11.2886198
128	156274.6634	58053.87421	-98220.78915	-62.85138425

Upstream: Clocks to schedule work on remote CPU (performing not accounted)
Patched:  Clocks to grab remote cpu's spinlock and perform the needed work 
	  locally.

Do you have other suggestions to use as (micro-) benchmarking?

Thanks!
Leo


> 
> Regards,
> Boqun
> 
> > > Also, there is no need to worry about extra cache bouncing:
> > > The cacheline invalidation already happens due to schedule_work_on().
> > > 
> > > This will avoid schedule_work_on(), and thus avoid scheduling-out an 
> > > RT workload. 
> > > 
> > > For patches 2, 3 & 4, I noticed just grabing the lock and executing
> > > the function locally is much faster than just scheduling it on a
> > > remote cpu.
> > > 
> > > Proposed solution:
> > > A new interface called Queue PerCPU Work (QPW), which should replace
> > > Work Queue in the above mentioned use case. 
> > > 
> > > If PREEMPT_RT=n, this interfaces just wraps the current 
> > > local_locks + WorkQueue behavior, so no expected change in runtime.
> > > 
> > > If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's
> > > per-cpu structure and perform work on it locally. This is possible
> > > because on functions that can be used for performing remote work on
> > > remote per-cpu structures, the local_lock (which is already
> > > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> > > 
> > > Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the
> > > current local_lock + WorkQueue interface by the QPW interface in
> > > swap, memcontrol & slub interface.
> > > 
> > > Please let me know what you think on that, and please suggest
> > > improvements.
> > > 
> > > Thanks a lot!
> > > Leo
> > > 
> > > Leonardo Bras (4):
> > >   Introducing qpw_lock() and per-cpu queue & flush work
> > >   swap: apply new queue_percpu_work_on() interface
> > >   memcontrol: apply new queue_percpu_work_on() interface
> > >   slub: apply new queue_percpu_work_on() interface
> > > 
> > >  include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
> > >  mm/memcontrol.c     | 20 ++++++-----
> > >  mm/slub.c           | 26 ++++++++------
> > >  mm/swap.c           | 26 +++++++-------
> > >  4 files changed, 127 insertions(+), 33 deletions(-)
> > >  create mode 100644 include/linux/qpw.h
> > > 
> > > 
> > > base-commit: 50736169ecc8387247fe6a00932852ce7b057083
> > 
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-06-25  2:57     ` Leonardo Bras
@ 2024-06-25 17:51       ` Boqun Feng
  2024-06-26 16:40         ` Leonardo Bras
  2024-06-28 18:47       ` Marcelo Tosatti
  1 sibling, 1 reply; 23+ messages in thread
From: Boqun Feng @ 2024-06-25 17:51 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Vlastimil Babka, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Hyeonggon Yoo,
	Thomas Gleixner, Marcelo Tosatti, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Waiman Long, linux-kernel, cgroups, linux-mm

On Mon, Jun 24, 2024 at 11:57:57PM -0300, Leonardo Bras wrote:
> On Mon, Jun 24, 2024 at 03:54:14PM -0700, Boqun Feng wrote:
> > On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote:
> > > Hi,
> > > 
> > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES
> > > section in MAINTAINERS so I've added folks from there in my reply.
> > 
> > Thanks!
> > 
> > > Link to full series:
> > > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/
> > > 
> > 
> > And apologies to Leonardo... I think this is a follow-up of:
> > 
> > 	https://lpc.events/event/17/contributions/1484/
> > 
> > and I did remember we had a quick chat after that which I suggested it's
> > better to change to a different name, sorry that I never found time to
> > write a proper rely to your previous seriese [1] as promised.
> > 
> > [1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@redhat.com/
> 
> That's correct, I commented about this in the end of above presentation.
> Don't worry, and thanks for suggesting the per-cpu naming, it was very 
> helpful on designing this solution.
> 
> > 
> > > On 6/22/24 5:58 AM, Leonardo Bras wrote:
> > > > The problem:
> > > > Some places in the kernel implement a parallel programming strategy
> > > > consisting on local_locks() for most of the work, and some rare remote
> > > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > > kernels, even though the very few remote operations will be expensive due
> > > > to scheduling overhead.
> > > > 
> > > > On the other hand, for RT workloads this can represent a problem: getting
> > > > an important workload scheduled out to deal with remote requests is
> > > > sure to introduce unexpected deadline misses.
> > > > 
> > > > The idea:
> > > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > > > In this case, instead of scheduling work on a remote cpu, it should
> > > > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > > > work locally. Tha major cost, which is un/locking in every local function,
> > > > already happens in PREEMPT_RT.
> > > 
> > > I've also noticed this a while ago (likely in the context of rewriting SLUB
> > > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of
> > > the idea. But I forgot the details about why, so I'll let the the locking
> > > experts reply...
> > > 
> > 
> > I think it's a good idea, especially the new name is less confusing ;-)
> > So I wonder Thomas' thoughts as well.
> 
> Thanks!
> 
> > 
> > And I think a few (micro-)benchmark numbers will help.
> 
> Last year I got some numbers on how replacing local_locks with 
> spinlocks would impact memcontrol.c cache operations:
> 
> https://lore.kernel.org/all/20230125073502.743446-1-leobras@redhat.com/
> 
> tl;dr: It increased clocks spent in the most common this_cpu operations, 
> while reducing clocks spent in remote operations (drain_all_stock).
> 
> In RT case, since local locks are already spinlocks, this cost is 
> already paid, so we can get results like these:
> 
> drain_all_stock
> cpus	Upstream 	Patched		Diff (cycles)	Diff(%)
> 1	44331.10831	38978.03581	-5353.072507	-12.07520567
> 8	43992.96512	39026.76654	-4966.198572	-11.2886198
> 128	156274.6634	58053.87421	-98220.78915	-62.85138425
> 
> Upstream: Clocks to schedule work on remote CPU (performing not accounted)
> Patched:  Clocks to grab remote cpu's spinlock and perform the needed work 
> 	  locally.

This looks good as a micro-benchmark. And it answers why we need patch
#3 in this series. It'll be better if we have something similar for
patch #2 and #4.

Besides, micro-benchmarks are usually a bit artifical IMO, it's better
if we have the data to prove that your changes improve the performance
from a more global view. For example, could you find or create a use
case where flush_slab() becomes somewhat a hot path? And we can then
know the performance gain from your changes in that use case. Maybe
Vlastimil has something in his mind already? ;-)

Also keep in mind that your changes apply to RT, so a natural follow-up
question would be: will it hurt the system latency? I know litte about
this area, so I must defer this to experts.

The above concern brings another opportunity: would it make sense to use
real locks instead of queuing work on a remote CPU in the case when RT
is not needed, but CPU isolation is important? I.e. nohz_full
situations?

> 
> Do you have other suggestions to use as (micro-) benchmarking?
> 

My overall suggestion is that you do find a valuable pattern where
queuing remote work may not be the best option, but usually a real world
usage would make more sense for the extra complexity that we will pay.

Does this make sense?

Regards,
Boqun

> Thanks!
> Leo
> 
> 
> > 
> > Regards,
> > Boqun
> > 
> > > > Also, there is no need to worry about extra cache bouncing:
> > > > The cacheline invalidation already happens due to schedule_work_on().
> > > > 
> > > > This will avoid schedule_work_on(), and thus avoid scheduling-out an 
> > > > RT workload. 
> > > > 
> > > > For patches 2, 3 & 4, I noticed just grabing the lock and executing
> > > > the function locally is much faster than just scheduling it on a
> > > > remote cpu.
> > > > 
> > > > Proposed solution:
> > > > A new interface called Queue PerCPU Work (QPW), which should replace
> > > > Work Queue in the above mentioned use case. 
> > > > 
> > > > If PREEMPT_RT=n, this interfaces just wraps the current 
> > > > local_locks + WorkQueue behavior, so no expected change in runtime.
> > > > 
> > > > If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's
> > > > per-cpu structure and perform work on it locally. This is possible
> > > > because on functions that can be used for performing remote work on
> > > > remote per-cpu structures, the local_lock (which is already
> > > > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > > > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> > > > 
> > > > Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the
> > > > current local_lock + WorkQueue interface by the QPW interface in
> > > > swap, memcontrol & slub interface.
> > > > 
> > > > Please let me know what you think on that, and please suggest
> > > > improvements.
> > > > 
> > > > Thanks a lot!
> > > > Leo
> > > > 
> > > > Leonardo Bras (4):
> > > >   Introducing qpw_lock() and per-cpu queue & flush work
> > > >   swap: apply new queue_percpu_work_on() interface
> > > >   memcontrol: apply new queue_percpu_work_on() interface
> > > >   slub: apply new queue_percpu_work_on() interface
> > > > 
> > > >  include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  mm/memcontrol.c     | 20 ++++++-----
> > > >  mm/slub.c           | 26 ++++++++------
> > > >  mm/swap.c           | 26 +++++++-------
> > > >  4 files changed, 127 insertions(+), 33 deletions(-)
> > > >  create mode 100644 include/linux/qpw.h
> > > > 
> > > > 
> > > > base-commit: 50736169ecc8387247fe6a00932852ce7b057083
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-06-25 17:51       ` Boqun Feng
@ 2024-06-26 16:40         ` Leonardo Bras
  0 siblings, 0 replies; 23+ messages in thread
From: Leonardo Bras @ 2024-06-26 16:40 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Leonardo Bras, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Hyeonggon Yoo, Thomas Gleixner, Marcelo Tosatti, Peter Zijlstra,
	Ingo Molnar, Will Deacon, Waiman Long, linux-kernel, cgroups,
	linux-mm

On Tue, Jun 25, 2024 at 10:51:13AM -0700, Boqun Feng wrote:
> On Mon, Jun 24, 2024 at 11:57:57PM -0300, Leonardo Bras wrote:
> > On Mon, Jun 24, 2024 at 03:54:14PM -0700, Boqun Feng wrote:
> > > On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote:
> > > > Hi,
> > > > 
> > > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES
> > > > section in MAINTAINERS so I've added folks from there in my reply.
> > > 
> > > Thanks!
> > > 
> > > > Link to full series:
> > > > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/
> > > > 
> > > 
> > > And apologies to Leonardo... I think this is a follow-up of:
> > > 
> > > 	https://lpc.events/event/17/contributions/1484/
> > > 
> > > and I did remember we had a quick chat after that which I suggested it's
> > > better to change to a different name, sorry that I never found time to
> > > write a proper rely to your previous seriese [1] as promised.
> > > 
> > > [1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@redhat.com/
> > 
> > That's correct, I commented about this in the end of above presentation.
> > Don't worry, and thanks for suggesting the per-cpu naming, it was very 
> > helpful on designing this solution.
> > 
> > > 
> > > > On 6/22/24 5:58 AM, Leonardo Bras wrote:
> > > > > The problem:
> > > > > Some places in the kernel implement a parallel programming strategy
> > > > > consisting on local_locks() for most of the work, and some rare remote
> > > > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > > > kernels, even though the very few remote operations will be expensive due
> > > > > to scheduling overhead.
> > > > > 
> > > > > On the other hand, for RT workloads this can represent a problem: getting
> > > > > an important workload scheduled out to deal with remote requests is
> > > > > sure to introduce unexpected deadline misses.
> > > > > 
> > > > > The idea:
> > > > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > > > > In this case, instead of scheduling work on a remote cpu, it should
> > > > > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > > > > work locally. Tha major cost, which is un/locking in every local function,
> > > > > already happens in PREEMPT_RT.
> > > > 
> > > > I've also noticed this a while ago (likely in the context of rewriting SLUB
> > > > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of
> > > > the idea. But I forgot the details about why, so I'll let the the locking
> > > > experts reply...
> > > > 
> > > 
> > > I think it's a good idea, especially the new name is less confusing ;-)
> > > So I wonder Thomas' thoughts as well.
> > 
> > Thanks!
> > 
> > > 
> > > And I think a few (micro-)benchmark numbers will help.
> > 
> > Last year I got some numbers on how replacing local_locks with 
> > spinlocks would impact memcontrol.c cache operations:
> > 
> > https://lore.kernel.org/all/20230125073502.743446-1-leobras@redhat.com/
> > 
> > tl;dr: It increased clocks spent in the most common this_cpu operations, 
> > while reducing clocks spent in remote operations (drain_all_stock).
> > 
> > In RT case, since local locks are already spinlocks, this cost is 
> > already paid, so we can get results like these:
> > 
> > drain_all_stock
> > cpus	Upstream 	Patched		Diff (cycles)	Diff(%)
> > 1	44331.10831	38978.03581	-5353.072507	-12.07520567
> > 8	43992.96512	39026.76654	-4966.198572	-11.2886198
> > 128	156274.6634	58053.87421	-98220.78915	-62.85138425
> > 
> > Upstream: Clocks to schedule work on remote CPU (performing not accounted)
> > Patched:  Clocks to grab remote cpu's spinlock and perform the needed work 
> > 	  locally.
> 
> This looks good as a micro-benchmark. And it answers why we need patch
> #3 in this series. It'll be better if we have something similar for
> patch #2 and #4.

I suppose that given the parallel programming scheme is the same, the 
results tend to be similar, but sure, I can provide such tests.

> 
> Besides, micro-benchmarks are usually a bit artifical IMO, it's better
> if we have the data to prove that your changes improve the performance
> from a more global view. For example, could you find or create a use
> case where flush_slab() becomes somewhat a hot path? And we can then
> know the performance gain from your changes in that use case. Maybe
> Vlastimil has something in his mind already? ;-)
> 
> Also keep in mind that your changes apply to RT, so a natural follow-up
> question would be: will it hurt the system latency? I know litte about
> this area, so I must defer this to experts.

While we notice some performance improvements, the whole deal of this 
patchset is not to gain performance, but to reduce latency:

When we call schedule_work_on() or queue_work_on(), we end up having a 
processor being interrupted (IPI) to deal with the required work. If this 
processor is running a RT task, it introduces latency. 

So by removing some of those IPIs we have a noticeable reduction in max 
latency, in tests such as cyclictest and oslat. Maybe it's a good idea to 
include those in this cover letter.
	
> 
> The above concern brings another opportunity: would it make sense to use
> real locks instead of queuing work on a remote CPU in the case when RT
> is not needed, but CPU isolation is important? I.e. nohz_full
> situations?

By having this qpw interface, that is easily achievable: 
We can add a kernel parameter that makes qpw_*locks use spinlocks if 
isolation is enabled. Even though this could be an static branch, this 
would cost some overhead in non-isolated + non-RT though.
But in any case, I am open on implementing this if there is an use-case.

> 
> > 
> > Do you have other suggestions to use as (micro-) benchmarking?
> > 
> 
> My overall suggestion is that you do find a valuable pattern where
> queuing remote work may not be the best option, but usually a real world
> usage would make more sense for the extra complexity that we will pay.
> 
> Does this make sense?

Yes, it does. There are scenarios which will cause a lot of queue_work_on, 
and this patchset would increase performance in RT. I think Marcelo showed 
me some example a while ago in mm/.

But my goal would be just to show that this change does not increase 
overhead, actually can have some improvements in RT, and achieves latency 
reduction which is the desired feature.


Thanks!
Leo

> 
> Regards,
> Boqun
> 
> > Thanks!
> > Leo
> > 
> > 
> > > 
> > > Regards,
> > > Boqun
> > > 
> > > > > Also, there is no need to worry about extra cache bouncing:
> > > > > The cacheline invalidation already happens due to schedule_work_on().
> > > > > 
> > > > > This will avoid schedule_work_on(), and thus avoid scheduling-out an 
> > > > > RT workload. 
> > > > > 
> > > > > For patches 2, 3 & 4, I noticed just grabing the lock and executing
> > > > > the function locally is much faster than just scheduling it on a
> > > > > remote cpu.
> > > > > 
> > > > > Proposed solution:
> > > > > A new interface called Queue PerCPU Work (QPW), which should replace
> > > > > Work Queue in the above mentioned use case. 
> > > > > 
> > > > > If PREEMPT_RT=n, this interfaces just wraps the current 
> > > > > local_locks + WorkQueue behavior, so no expected change in runtime.
> > > > > 
> > > > > If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's
> > > > > per-cpu structure and perform work on it locally. This is possible
> > > > > because on functions that can be used for performing remote work on
> > > > > remote per-cpu structures, the local_lock (which is already
> > > > > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > > > > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> > > > > 
> > > > > Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the
> > > > > current local_lock + WorkQueue interface by the QPW interface in
> > > > > swap, memcontrol & slub interface.
> > > > > 
> > > > > Please let me know what you think on that, and please suggest
> > > > > improvements.
> > > > > 
> > > > > Thanks a lot!
> > > > > Leo
> > > > > 
> > > > > Leonardo Bras (4):
> > > > >   Introducing qpw_lock() and per-cpu queue & flush work
> > > > >   swap: apply new queue_percpu_work_on() interface
> > > > >   memcontrol: apply new queue_percpu_work_on() interface
> > > > >   slub: apply new queue_percpu_work_on() interface
> > > > > 
> > > > >  include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
> > > > >  mm/memcontrol.c     | 20 ++++++-----
> > > > >  mm/slub.c           | 26 ++++++++------
> > > > >  mm/swap.c           | 26 +++++++-------
> > > > >  4 files changed, 127 insertions(+), 33 deletions(-)
> > > > >  create mode 100644 include/linux/qpw.h
> > > > > 
> > > > > 
> > > > > base-commit: 50736169ecc8387247fe6a00932852ce7b057083
> > > > 
> > > 
> > 
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-06-25  2:57     ` Leonardo Bras
  2024-06-25 17:51       ` Boqun Feng
@ 2024-06-28 18:47       ` Marcelo Tosatti
  1 sibling, 0 replies; 23+ messages in thread
From: Marcelo Tosatti @ 2024-06-28 18:47 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Boqun Feng, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Hyeonggon Yoo, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Waiman Long, linux-kernel, cgroups, linux-mm

On Mon, Jun 24, 2024 at 11:57:57PM -0300, Leonardo Bras wrote:
> On Mon, Jun 24, 2024 at 03:54:14PM -0700, Boqun Feng wrote:
> > On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote:
> > > Hi,
> > > 
> > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES
> > > section in MAINTAINERS so I've added folks from there in my reply.
> > 
> > Thanks!
> > 
> > > Link to full series:
> > > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/
> > > 
> > 
> > And apologies to Leonardo... I think this is a follow-up of:
> > 
> > 	https://lpc.events/event/17/contributions/1484/
> > 
> > and I did remember we had a quick chat after that which I suggested it's
> > better to change to a different name, sorry that I never found time to
> > write a proper rely to your previous seriese [1] as promised.
> > 
> > [1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@redhat.com/
> 
> That's correct, I commented about this in the end of above presentation.
> Don't worry, and thanks for suggesting the per-cpu naming, it was very 
> helpful on designing this solution.
> 
> > 
> > > On 6/22/24 5:58 AM, Leonardo Bras wrote:
> > > > The problem:
> > > > Some places in the kernel implement a parallel programming strategy
> > > > consisting on local_locks() for most of the work, and some rare remote
> > > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > > kernels, even though the very few remote operations will be expensive due
> > > > to scheduling overhead.
> > > > 
> > > > On the other hand, for RT workloads this can represent a problem: getting
> > > > an important workload scheduled out to deal with remote requests is
> > > > sure to introduce unexpected deadline misses.
> > > > 
> > > > The idea:
> > > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > > > In this case, instead of scheduling work on a remote cpu, it should
> > > > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > > > work locally. Tha major cost, which is un/locking in every local function,
> > > > already happens in PREEMPT_RT.
> > > 
> > > I've also noticed this a while ago (likely in the context of rewriting SLUB
> > > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of
> > > the idea. But I forgot the details about why, so I'll let the the locking
> > > experts reply...
> > > 
> > 
> > I think it's a good idea, especially the new name is less confusing ;-)
> > So I wonder Thomas' thoughts as well.
> 
> Thanks!
> 
> > 
> > And I think a few (micro-)benchmark numbers will help.
> 
> Last year I got some numbers on how replacing local_locks with 
> spinlocks would impact memcontrol.c cache operations:
> 
> https://lore.kernel.org/all/20230125073502.743446-1-leobras@redhat.com/
> 
> tl;dr: It increased clocks spent in the most common this_cpu operations, 
> while reducing clocks spent in remote operations (drain_all_stock).
> 
> In RT case, since local locks are already spinlocks, this cost is 
> already paid, so we can get results like these:
> 
> drain_all_stock
> cpus	Upstream 	Patched		Diff (cycles)	Diff(%)
> 1	44331.10831	38978.03581	-5353.072507	-12.07520567
> 8	43992.96512	39026.76654	-4966.198572	-11.2886198
> 128	156274.6634	58053.87421	-98220.78915	-62.85138425
> 
> Upstream: Clocks to schedule work on remote CPU (performing not accounted)
> Patched:  Clocks to grab remote cpu's spinlock and perform the needed work 
> 	  locally.
> 
> Do you have other suggestions to use as (micro-) benchmarking?
> 
> Thanks!
> Leo

One improvement which was noted when mm/page_alloc.c was converted to 
spinlock + remote drain was that, it can bypass waiting for kwork 
to be scheduled (on heavily loaded CPUs).

commit 443c2accd1b6679a1320167f8f56eed6536b806e
Author: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Date:   Fri Jun 24 13:54:22 2022 +0100

    mm/page_alloc: remotely drain per-cpu lists
    
    Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
    drain work queued by __drain_all_pages().  So introduce a new mechanism to
    remotely drain the per-cpu lists.  It is made possible by remotely locking
    'struct per_cpu_pages' new per-cpu spinlocks.  A benefit of this new
    scheme is that drain operations are now migration safe.
    
    There was no observed performance degradation vs.  the previous scheme.
    Both netperf and hackbench were run in parallel to triggering the
    __drain_all_pages(NULL, true) code path around ~100 times per second.  The
    new scheme performs a bit better (~5%), although the important point here
    is there are no performance regressions vs.  the previous mechanism.
    Per-cpu lists draining happens only in slow paths.
    
    Minchan Kim tested an earlier version and reported;
    
            My workload is not NOHZ CPUs but run apps under heavy memory
            pressure so they goes to direct reclaim and be stuck on
            drain_all_pages until work on workqueue run.
    
            unit: nanosecond
            max(dur)        avg(dur)                count(dur)
            166713013       487511.77786438033      1283
    
            From traces, system encountered the drain_all_pages 1283 times and
            worst case was 166ms and avg was 487us.
    
            The other problem was alloc_contig_range in CMA. The PCP draining
            takes several hundred millisecond sometimes though there is no
            memory pressure or a few of pages to be migrated out but CPU were
            fully booked.
    
            Your patch perfectly removed those wasted time.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-06-24  7:31 ` [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Vlastimil Babka
  2024-06-24 22:54   ` Boqun Feng
  2024-06-25  2:36   ` Leonardo Bras
@ 2024-07-15 18:38   ` Marcelo Tosatti
  2 siblings, 0 replies; 23+ messages in thread
From: Marcelo Tosatti @ 2024-07-15 18:38 UTC (permalink / raw)
  To: Vlastimil Babka, Thomas Gleixner
  Cc: Leonardo Bras, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Hyeonggon Yoo,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Waiman Long, Boqun Feng, linux-kernel, cgroups, linux-mm

On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote:
> Hi,
> 
> you've included tglx, which is great, but there's also LOCKING PRIMITIVES
> section in MAINTAINERS so I've added folks from there in my reply.
> Link to full series:
> https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/
> 
> On 6/22/24 5:58 AM, Leonardo Bras wrote:
> > The problem:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with remote requests is
> > sure to introduce unexpected deadline misses.
> > 
> > The idea:
> > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > In this case, instead of scheduling work on a remote cpu, it should
> > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > work locally. Tha major cost, which is un/locking in every local function,
> > already happens in PREEMPT_RT.
> 
> I've also noticed this a while ago (likely in the context of rewriting SLUB
> to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of
> the idea. But I forgot the details about why, so I'll let the the locking
> experts reply...

Thomas?




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-06-22  3:58 [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Leonardo Bras
                   ` (4 preceding siblings ...)
  2024-06-24  7:31 ` [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Vlastimil Babka
@ 2024-07-23 17:14 ` Marcelo Tosatti
  2024-09-05 22:19   ` Hillf Danton
  5 siblings, 1 reply; 23+ messages in thread
From: Marcelo Tosatti @ 2024-07-23 17:14 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Thomas Gleixner, linux-kernel, cgroups, linux-mm

On Sat, Jun 22, 2024 at 12:58:08AM -0300, Leonardo Bras wrote:
> The problem:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem: getting
> an important workload scheduled out to deal with remote requests is
> sure to introduce unexpected deadline misses.

Another hang with a busy polling workload (kernel update hangs on
grub2-probe):

[342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds.
[342431.665458]       Tainted: G        W      X  -------  ---  5.14.0-438.el9s.x86_64+rt #1
[342431.665488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[342431.665515] task:grub2-probe     state:D stack:0     pid:24484 ppid:24455  flags:0x00004002
[342431.665523] Call Trace:
[342431.665525]  <TASK>
[342431.665527]  __schedule+0x22a/0x580
[342431.665537]  schedule+0x30/0x80
[342431.665539]  schedule_timeout+0x153/0x190
[342431.665543]  ? preempt_schedule_thunk+0x16/0x30
[342431.665548]  ? preempt_count_add+0x70/0xa0
[342431.665554]  __wait_for_common+0x8b/0x1c0
[342431.665557]  ? __pfx_schedule_timeout+0x10/0x10
[342431.665560]  __flush_work.isra.0+0x15b/0x220
[342431.665565]  ? __pfx_wq_barrier_func+0x10/0x10
[342431.665570]  __lru_add_drain_all+0x17d/0x220
[342431.665576]  invalidate_bdev+0x28/0x40
[342431.665583]  blkdev_common_ioctl+0x714/0xa30
[342431.665588]  ? bucket_table_alloc.isra.0+0x1/0x150
[342431.665593]  ? cp_new_stat+0xbb/0x180
[342431.665599]  blkdev_ioctl+0x112/0x270
[342431.665603]  ? security_file_ioctl+0x2f/0x50
[342431.665609]  __x64_sys_ioctl+0x87/0xc0
[342431.665614]  do_syscall_64+0x5c/0xf0
[342431.665619]  ? __ct_user_enter+0x89/0x130
[342431.665623]  ? syscall_exit_to_user_mode+0x22/0x40
[342431.665625]  ? do_syscall_64+0x6b/0xf0
[342431.665627]  ? __ct_user_enter+0x89/0x130
[342431.665629]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[342431.665635] RIP: 0033:0x7f39856c757b
[342431.665666] RSP: 002b:00007ffd9541c488 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[342431.665670] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f39856c757b
[342431.665673] RDX: 0000000000000000 RSI: 0000000000001261 RDI: 0000000000000005
[342431.665674] RBP: 00007ffd9541c540 R08: 0000000000000003 R09: 006164732f766564
[342431.665676] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd9543ca68
[342431.665678] R13: 000055ea758a0708 R14: 000055ea759de338 R15: 00007f398586f000



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 1/4] Introducing qpw_lock() and per-cpu queue & flush work
  2024-06-22  3:58 ` [RFC PATCH v1 1/4] Introducing qpw_lock() and per-cpu queue & flush work Leonardo Bras
@ 2024-09-04 21:39   ` Waiman Long
  2024-09-05  0:08     ` Waiman Long
  2024-09-11  7:17     ` Leonardo Bras
  0 siblings, 2 replies; 23+ messages in thread
From: Waiman Long @ 2024-09-04 21:39 UTC (permalink / raw)
  To: Leonardo Bras, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Hyeonggon Yoo, Thomas Gleixner, Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm

On 6/21/24 23:58, Leonardo Bras wrote:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
>
> On the other hand, for RT workloads this can represent a problem: getting
> an important workload scheduled out to deal with some unrelated task is
> sure to introduce unexpected deadline misses.
>
> It's interesting, though, that local_lock()s in RT kernels become
> spinlock(). We can make use of those to avoid scheduling work on a remote
> cpu by directly updating another cpu's per_cpu structure, while holding
> it's spinlock().
>
> In order to do that, it's necessary to introduce a new set of functions to
> make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
> and also the corresponding queue_percpu_work_on() and flush_percpu_work()
> helpers to run the remote work.
>
> On non-RT kernels, no changes are expected, as every one of the introduced
> helpers work the exactly same as the current implementation:
> qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
> queue_percpu_work_on()  ->  queue_work_on()
> flush_percpu_work()     ->  flush_work()
>
> For RT kernels, though, qpw_{un,}lock*() will use the extra cpu parameter
> to select the correct per-cpu structure to work on, and acquire the
> spinlock for that cpu.
>
> queue_percpu_work_on() will just call the requested function in the current
> cpu, which will operate in another cpu's per-cpu object. Since the
> local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.
>
> flush_percpu_work() then becomes a no-op since no work is actually
> scheduled on a remote cpu.
>
> Some minimal code rework is needed in order to make this mechanism work:
> The calls for local_{un,}lock*() on the functions that are currently
> scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
> RT kernels they can reference a different cpu. It's also necessary to use a
> qpw_struct instead of a work_struct, but it just contains a work struct
> and, in PREEMPT_RT, the target cpu.
>
> This should have almost no impact on non-RT kernels: few this_cpu_ptr()
> will become per_cpu_ptr(,smp_processor_id()).
>
> On RT kernels, this should improve performance and reduce latency by
> removing scheduling noise.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>   include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 88 insertions(+)
>   create mode 100644 include/linux/qpw.h
>
> diff --git a/include/linux/qpw.h b/include/linux/qpw.h
> new file mode 100644
> index 000000000000..ea2686a01e5e
> --- /dev/null
> +++ b/include/linux/qpw.h
> @@ -0,0 +1,88 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_QPW_H
> +#define _LINUX_QPW_H
> +
> +#include "linux/local_lock.h"
> +#include "linux/workqueue.h"
> +
> +#ifndef CONFIG_PREEMPT_RT
> +
> +struct qpw_struct {
> +	struct work_struct work;
> +};
> +
> +#define qpw_lock(lock, cpu)					\
> +	local_lock(lock)
> +
> +#define qpw_unlock(lock, cpu)					\
> +	local_unlock(lock)
> +
> +#define qpw_lock_irqsave(lock, flags, cpu)			\
> +	local_lock_irqsave(lock, flags)
> +
> +#define qpw_unlock_irqrestore(lock, flags, cpu)			\
> +	local_unlock_irqrestore(lock, flags)
> +
> +#define queue_percpu_work_on(c, wq, qpw)			\
> +	queue_work_on(c, wq, &(qpw)->work)
> +
> +#define flush_percpu_work(qpw)					\
> +	flush_work(&(qpw)->work)
> +
> +#define qpw_get_cpu(qpw)					\
> +	smp_processor_id()
> +
> +#define INIT_QPW(qpw, func, c)					\
> +	INIT_WORK(&(qpw)->work, (func))
> +
> +#else /* !CONFIG_PREEMPT_RT */
> +
> +struct qpw_struct {
> +	struct work_struct work;
> +	int cpu;
> +};
> +
> +#define qpw_lock(__lock, cpu)					\
> +	do {							\
> +		migrate_disable();				\
> +		spin_lock(per_cpu_ptr((__lock), cpu));		\
> +	} while (0)
> +
> +#define qpw_unlock(__lock, cpu)					\
> +	do {							\
> +		spin_unlock(per_cpu_ptr((__lock), cpu));	\
> +		migrate_enable();				\
> +	} while (0)

Why there is a migrate_disable/enable() call in qpw_lock/unlock()? The 
rt_spin_lock/unlock() calls have already include a 
migrate_disable/enable() pair.

> +
> +#define qpw_lock_irqsave(lock, flags, cpu)			\
> +	do {							\
> +		typecheck(unsigned long, flags);		\
> +		flags = 0;					\
> +		qpw_lock(lock, cpu);				\
> +	} while (0)
> +
> +#define qpw_unlock_irqrestore(lock, flags, cpu)			\
> +	qpw_unlock(lock, cpu)
> +
> +#define queue_percpu_work_on(c, wq, qpw)			\
> +	do {							\
> +		struct qpw_struct *__qpw = (qpw);		\
> +		WARN_ON((c) != __qpw->cpu);			\
> +		__qpw->work.func(&__qpw->work);			\
> +	} while (0)
> +
> +#define flush_percpu_work(qpw)					\
> +	do {} while (0)
> +
> +#define qpw_get_cpu(w)						\
> +	container_of((w), struct qpw_struct, work)->cpu
> +
> +#define INIT_QPW(qpw, func, c)					\
> +	do {							\
> +		struct qpw_struct *__qpw = (qpw);		\
> +		INIT_WORK(&__qpw->work, (func));		\
> +		__qpw->cpu = (c);				\
> +	} while (0)
> +
> +#endif /* CONFIG_PREEMPT_RT */
> +#endif /* LINUX_QPW_H */

You may also consider adding a documentation file about the 
qpw_lock/unlock() calls.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 1/4] Introducing qpw_lock() and per-cpu queue & flush work
  2024-09-04 21:39   ` Waiman Long
@ 2024-09-05  0:08     ` Waiman Long
  2024-09-11  7:18       ` Leonardo Bras
  2024-09-11  7:17     ` Leonardo Bras
  1 sibling, 1 reply; 23+ messages in thread
From: Waiman Long @ 2024-09-05  0:08 UTC (permalink / raw)
  To: Leonardo Bras, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Hyeonggon Yoo, Thomas Gleixner, Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm

On 9/4/24 17:39, Waiman Long wrote:
> On 6/21/24 23:58, Leonardo Bras wrote:
>> Some places in the kernel implement a parallel programming strategy
>> consisting on local_locks() for most of the work, and some rare remote
>> operations are scheduled on target cpu. This keeps cache bouncing low 
>> since
>> cacheline tends to be mostly local, and avoids the cost of locks in 
>> non-RT
>> kernels, even though the very few remote operations will be expensive 
>> due
>> to scheduling overhead.
>>
>> On the other hand, for RT workloads this can represent a problem: 
>> getting
>> an important workload scheduled out to deal with some unrelated task is
>> sure to introduce unexpected deadline misses.
>>
>> It's interesting, though, that local_lock()s in RT kernels become
>> spinlock(). We can make use of those to avoid scheduling work on a 
>> remote
>> cpu by directly updating another cpu's per_cpu structure, while holding
>> it's spinlock().
>>
>> In order to do that, it's necessary to introduce a new set of 
>> functions to
>> make it possible to get another cpu's per-cpu "local" lock 
>> (qpw_{un,}lock*)
>> and also the corresponding queue_percpu_work_on() and 
>> flush_percpu_work()
>> helpers to run the remote work.
>>
>> On non-RT kernels, no changes are expected, as every one of the 
>> introduced
>> helpers work the exactly same as the current implementation:
>> qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
>> queue_percpu_work_on()  ->  queue_work_on()
>> flush_percpu_work()     ->  flush_work()
>>
>> For RT kernels, though, qpw_{un,}lock*() will use the extra cpu 
>> parameter
>> to select the correct per-cpu structure to work on, and acquire the
>> spinlock for that cpu.
>>
>> queue_percpu_work_on() will just call the requested function in the 
>> current
>> cpu, which will operate in another cpu's per-cpu object. Since the
>> local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.
>>
>> flush_percpu_work() then becomes a no-op since no work is actually
>> scheduled on a remote cpu.
>>
>> Some minimal code rework is needed in order to make this mechanism work:
>> The calls for local_{un,}lock*() on the functions that are currently
>> scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), 
>> so in
>> RT kernels they can reference a different cpu. It's also necessary to 
>> use a
>> qpw_struct instead of a work_struct, but it just contains a work struct
>> and, in PREEMPT_RT, the target cpu.
>>
>> This should have almost no impact on non-RT kernels: few this_cpu_ptr()
>> will become per_cpu_ptr(,smp_processor_id()).
>>
>> On RT kernels, this should improve performance and reduce latency by
>> removing scheduling noise.
>>
>> Signed-off-by: Leonardo Bras <leobras@redhat.com>
>> ---
>>   include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 88 insertions(+)
>>   create mode 100644 include/linux/qpw.h
>>
>> diff --git a/include/linux/qpw.h b/include/linux/qpw.h
>> new file mode 100644
>> index 000000000000..ea2686a01e5e
>> --- /dev/null
>> +++ b/include/linux/qpw.h
>> @@ -0,0 +1,88 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_QPW_H
>> +#define _LINUX_QPW_H

I would suggest adding a comment with a brief description of what 
qpw_lock/unlock() are for and their use cases. The "qpw" prefix itself 
isn't intuitive enough for a casual reader to understand what they are for.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-07-23 17:14 ` Marcelo Tosatti
@ 2024-09-05 22:19   ` Hillf Danton
  2024-09-11  3:04     ` Marcelo Tosatti
  2024-09-11  6:42     ` Leonardo Bras
  0 siblings, 2 replies; 23+ messages in thread
From: Hillf Danton @ 2024-09-05 22:19 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Leonardo Bras, Michal Hocko, Roman Gushchin, linux-kernel,
	linux-mm

On Tue, 23 Jul 2024 14:14:34 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> On Sat, Jun 22, 2024 at 12:58:08AM -0300, Leonardo Bras wrote:
> > The problem:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with remote requests is
> > sure to introduce unexpected deadline misses.
> 
> Another hang with a busy polling workload (kernel update hangs on
> grub2-probe):
> 
> [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds.
> [342431.665458]       Tainted: G        W      X  -------  ---  5.14.0-438.el9s.x86_64+rt #1
> [342431.665488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [342431.665515] task:grub2-probe     state:D stack:0     pid:24484 ppid:24455  flags:0x00004002
> [342431.665523] Call Trace:
> [342431.665525]  <TASK>
> [342431.665527]  __schedule+0x22a/0x580
> [342431.665537]  schedule+0x30/0x80
> [342431.665539]  schedule_timeout+0x153/0x190
> [342431.665543]  ? preempt_schedule_thunk+0x16/0x30
> [342431.665548]  ? preempt_count_add+0x70/0xa0
> [342431.665554]  __wait_for_common+0x8b/0x1c0
> [342431.665557]  ? __pfx_schedule_timeout+0x10/0x10
> [342431.665560]  __flush_work.isra.0+0x15b/0x220

The fresh new flush_percpu_work() is nop with CONFIG_PREEMPT_RT enabled, why
are you testing it with 5.14.0-438.el9s.x86_64+rt instead of mainline? Or what
are you testing?

BTW the hang fails to show the unexpected deadline misses.

> [342431.665565]  ? __pfx_wq_barrier_func+0x10/0x10
> [342431.665570]  __lru_add_drain_all+0x17d/0x220
> [342431.665576]  invalidate_bdev+0x28/0x40
> [342431.665583]  blkdev_common_ioctl+0x714/0xa30
> [342431.665588]  ? bucket_table_alloc.isra.0+0x1/0x150
> [342431.665593]  ? cp_new_stat+0xbb/0x180
> [342431.665599]  blkdev_ioctl+0x112/0x270
> [342431.665603]  ? security_file_ioctl+0x2f/0x50
> [342431.665609]  __x64_sys_ioctl+0x87/0xc0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-09-05 22:19   ` Hillf Danton
@ 2024-09-11  3:04     ` Marcelo Tosatti
  2024-09-15  0:30       ` Hillf Danton
  2024-09-11  6:42     ` Leonardo Bras
  1 sibling, 1 reply; 23+ messages in thread
From: Marcelo Tosatti @ 2024-09-11  3:04 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Leonardo Bras, Michal Hocko, Roman Gushchin, linux-kernel,
	linux-mm

On Fri, Sep 06, 2024 at 06:19:08AM +0800, Hillf Danton wrote:
> On Tue, 23 Jul 2024 14:14:34 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> > On Sat, Jun 22, 2024 at 12:58:08AM -0300, Leonardo Bras wrote:
> > > The problem:
> > > Some places in the kernel implement a parallel programming strategy
> > > consisting on local_locks() for most of the work, and some rare remote
> > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > kernels, even though the very few remote operations will be expensive due
> > > to scheduling overhead.
> > > 
> > > On the other hand, for RT workloads this can represent a problem: getting
> > > an important workload scheduled out to deal with remote requests is
> > > sure to introduce unexpected deadline misses.
> > 
> > Another hang with a busy polling workload (kernel update hangs on
> > grub2-probe):
> > 
> > [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds.
> > [342431.665458]       Tainted: G        W      X  -------  ---  5.14.0-438.el9s.x86_64+rt #1
> > [342431.665488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [342431.665515] task:grub2-probe     state:D stack:0     pid:24484 ppid:24455  flags:0x00004002
> > [342431.665523] Call Trace:
> > [342431.665525]  <TASK>
> > [342431.665527]  __schedule+0x22a/0x580
> > [342431.665537]  schedule+0x30/0x80
> > [342431.665539]  schedule_timeout+0x153/0x190
> > [342431.665543]  ? preempt_schedule_thunk+0x16/0x30
> > [342431.665548]  ? preempt_count_add+0x70/0xa0
> > [342431.665554]  __wait_for_common+0x8b/0x1c0
> > [342431.665557]  ? __pfx_schedule_timeout+0x10/0x10
> > [342431.665560]  __flush_work.isra.0+0x15b/0x220
> 
> The fresh new flush_percpu_work() is nop with CONFIG_PREEMPT_RT enabled, why
> are you testing it with 5.14.0-438.el9s.x86_64+rt instead of mainline? Or what
> are you testing?

I am demonstrating a type of bug that can happen without Leo's patch.

> BTW the hang fails to show the unexpected deadline misses.

Yes, because in this case the realtime app with FIFO priority never
stops running, therefore grub2-probe hangs and is unable to execute:

> > [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds
> 
> > [342431.665565]  ? __pfx_wq_barrier_func+0x10/0x10
> > [342431.665570]  __lru_add_drain_all+0x17d/0x220
> > [342431.665576]  invalidate_bdev+0x28/0x40
> > [342431.665583]  blkdev_common_ioctl+0x714/0xa30
> > [342431.665588]  ? bucket_table_alloc.isra.0+0x1/0x150
> > [342431.665593]  ? cp_new_stat+0xbb/0x180
> > [342431.665599]  blkdev_ioctl+0x112/0x270
> > [342431.665603]  ? security_file_ioctl+0x2f/0x50
> > [342431.665609]  __x64_sys_ioctl+0x87/0xc0

Does that make sense now?

Thanks!



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-09-05 22:19   ` Hillf Danton
  2024-09-11  3:04     ` Marcelo Tosatti
@ 2024-09-11  6:42     ` Leonardo Bras
  1 sibling, 0 replies; 23+ messages in thread
From: Leonardo Bras @ 2024-09-11  6:42 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Leonardo Bras, Marcelo Tosatti, Michal Hocko, Roman Gushchin,
	linux-kernel, linux-mm

On Fri, Sep 06, 2024 at 06:19:08AM +0800, Hillf Danton wrote:
> On Tue, 23 Jul 2024 14:14:34 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> > On Sat, Jun 22, 2024 at 12:58:08AM -0300, Leonardo Bras wrote:
> > > The problem:
> > > Some places in the kernel implement a parallel programming strategy
> > > consisting on local_locks() for most of the work, and some rare remote
> > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > kernels, even though the very few remote operations will be expensive due
> > > to scheduling overhead.
> > > 
> > > On the other hand, for RT workloads this can represent a problem: getting
> > > an important workload scheduled out to deal with remote requests is
> > > sure to introduce unexpected deadline misses.
> > 
> > Another hang with a busy polling workload (kernel update hangs on
> > grub2-probe):
> > 
> > [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds.
> > [342431.665458]       Tainted: G        W      X  -------  ---  5.14.0-438.el9s.x86_64+rt #1
> > [342431.665488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [342431.665515] task:grub2-probe     state:D stack:0     pid:24484 ppid:24455  flags:0x00004002
> > [342431.665523] Call Trace:
> > [342431.665525]  <TASK>
> > [342431.665527]  __schedule+0x22a/0x580
> > [342431.665537]  schedule+0x30/0x80
> > [342431.665539]  schedule_timeout+0x153/0x190
> > [342431.665543]  ? preempt_schedule_thunk+0x16/0x30
> > [342431.665548]  ? preempt_count_add+0x70/0xa0
> > [342431.665554]  __wait_for_common+0x8b/0x1c0
> > [342431.665557]  ? __pfx_schedule_timeout+0x10/0x10
> > [342431.665560]  __flush_work.isra.0+0x15b/0x220
> 
> The fresh new flush_percpu_work() is nop with CONFIG_PREEMPT_RT enabled, why
> are you testing it with 5.14.0-438.el9s.x86_64+rt instead of mainline? Or what
> are you testing?
> 
> BTW the hang fails to show the unexpected deadline misses.

I think he is showing a client case in which my patchset would be helpful, 
and avoid those stalls in RT=y.

> 
> > [342431.665565]  ? __pfx_wq_barrier_func+0x10/0x10
> > [342431.665570]  __lru_add_drain_all+0x17d/0x220
> > [342431.665576]  invalidate_bdev+0x28/0x40
> > [342431.665583]  blkdev_common_ioctl+0x714/0xa30
> > [342431.665588]  ? bucket_table_alloc.isra.0+0x1/0x150
> > [342431.665593]  ? cp_new_stat+0xbb/0x180
> > [342431.665599]  blkdev_ioctl+0x112/0x270
> > [342431.665603]  ? security_file_ioctl+0x2f/0x50
> > [342431.665609]  __x64_sys_ioctl+0x87/0xc0
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 1/4] Introducing qpw_lock() and per-cpu queue & flush work
  2024-09-04 21:39   ` Waiman Long
  2024-09-05  0:08     ` Waiman Long
@ 2024-09-11  7:17     ` Leonardo Bras
  2024-09-11 13:39       ` Waiman Long
  1 sibling, 1 reply; 23+ messages in thread
From: Leonardo Bras @ 2024-09-11  7:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Leonardo Bras, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Hyeonggon Yoo, Thomas Gleixner, Marcelo Tosatti, linux-kernel,
	cgroups, linux-mm

On Wed, Sep 04, 2024 at 05:39:01PM -0400, Waiman Long wrote:
> On 6/21/24 23:58, Leonardo Bras wrote:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with some unrelated task is
> > sure to introduce unexpected deadline misses.
> > 
> > It's interesting, though, that local_lock()s in RT kernels become
> > spinlock(). We can make use of those to avoid scheduling work on a remote
> > cpu by directly updating another cpu's per_cpu structure, while holding
> > it's spinlock().
> > 
> > In order to do that, it's necessary to introduce a new set of functions to
> > make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
> > and also the corresponding queue_percpu_work_on() and flush_percpu_work()
> > helpers to run the remote work.
> > 
> > On non-RT kernels, no changes are expected, as every one of the introduced
> > helpers work the exactly same as the current implementation:
> > qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
> > queue_percpu_work_on()  ->  queue_work_on()
> > flush_percpu_work()     ->  flush_work()
> > 
> > For RT kernels, though, qpw_{un,}lock*() will use the extra cpu parameter
> > to select the correct per-cpu structure to work on, and acquire the
> > spinlock for that cpu.
> > 
> > queue_percpu_work_on() will just call the requested function in the current
> > cpu, which will operate in another cpu's per-cpu object. Since the
> > local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.
> > 
> > flush_percpu_work() then becomes a no-op since no work is actually
> > scheduled on a remote cpu.
> > 
> > Some minimal code rework is needed in order to make this mechanism work:
> > The calls for local_{un,}lock*() on the functions that are currently
> > scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
> > RT kernels they can reference a different cpu. It's also necessary to use a
> > qpw_struct instead of a work_struct, but it just contains a work struct
> > and, in PREEMPT_RT, the target cpu.
> > 
> > This should have almost no impact on non-RT kernels: few this_cpu_ptr()
> > will become per_cpu_ptr(,smp_processor_id()).
> > 
> > On RT kernels, this should improve performance and reduce latency by
> > removing scheduling noise.
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >   include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 88 insertions(+)
> >   create mode 100644 include/linux/qpw.h
> > 
> > diff --git a/include/linux/qpw.h b/include/linux/qpw.h
> > new file mode 100644
> > index 000000000000..ea2686a01e5e
> > --- /dev/null
> > +++ b/include/linux/qpw.h
> > @@ -0,0 +1,88 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_QPW_H
> > +#define _LINUX_QPW_H
> > +
> > +#include "linux/local_lock.h"
> > +#include "linux/workqueue.h"
> > +
> > +#ifndef CONFIG_PREEMPT_RT
> > +
> > +struct qpw_struct {
> > +	struct work_struct work;
> > +};
> > +
> > +#define qpw_lock(lock, cpu)					\
> > +	local_lock(lock)
> > +
> > +#define qpw_unlock(lock, cpu)					\
> > +	local_unlock(lock)
> > +
> > +#define qpw_lock_irqsave(lock, flags, cpu)			\
> > +	local_lock_irqsave(lock, flags)
> > +
> > +#define qpw_unlock_irqrestore(lock, flags, cpu)			\
> > +	local_unlock_irqrestore(lock, flags)
> > +
> > +#define queue_percpu_work_on(c, wq, qpw)			\
> > +	queue_work_on(c, wq, &(qpw)->work)
> > +
> > +#define flush_percpu_work(qpw)					\
> > +	flush_work(&(qpw)->work)
> > +
> > +#define qpw_get_cpu(qpw)					\
> > +	smp_processor_id()
> > +
> > +#define INIT_QPW(qpw, func, c)					\
> > +	INIT_WORK(&(qpw)->work, (func))
> > +
> > +#else /* !CONFIG_PREEMPT_RT */
> > +
> > +struct qpw_struct {
> > +	struct work_struct work;
> > +	int cpu;
> > +};
> > +
> > +#define qpw_lock(__lock, cpu)					\
> > +	do {							\
> > +		migrate_disable();				\
> > +		spin_lock(per_cpu_ptr((__lock), cpu));		\
> > +	} while (0)
> > +
> > +#define qpw_unlock(__lock, cpu)					\
> > +	do {							\
> > +		spin_unlock(per_cpu_ptr((__lock), cpu));	\
> > +		migrate_enable();				\
> > +	} while (0)
> 
> Why there is a migrate_disable/enable() call in qpw_lock/unlock()? The
> rt_spin_lock/unlock() calls have already include a migrate_disable/enable()
> pair.

This was copied from PREEMPT_RT=y local_locks.

In my tree, I see:

#define __local_unlock(__lock)					\
	do {							\
		spin_unlock(this_cpu_ptr((__lock)));		\
		migrate_enable();				\
	} while (0)

But you are right:
For PREEMPT_RT=y, spin_{un,}lock() will be defined in spinlock_rt.h
as rt_spin{un,}lock(), which already runs migrate_{en,dis}able().

On the other hand, for spin_lock() will run migrate_disable() just before 
finishing the function, and local_lock() will run it before calling 
spin_lock() and thus, before spin_acquire().

(local_unlock looks like to have an unnecessary extra migrate_enable(), 
though).

I am not sure if it's actually necessary to run this extra 
migrate_disable() in local_lock() case, maybe Thomas could help us 
understand this.

But sure, if we can remove this from local_{un,}lock(), I am sure we can 
also remove this from qpw.


> 
> > +
> > +#define qpw_lock_irqsave(lock, flags, cpu)			\
> > +	do {							\
> > +		typecheck(unsigned long, flags);		\
> > +		flags = 0;					\
> > +		qpw_lock(lock, cpu);				\
> > +	} while (0)
> > +
> > +#define qpw_unlock_irqrestore(lock, flags, cpu)			\
> > +	qpw_unlock(lock, cpu)
> > +
> > +#define queue_percpu_work_on(c, wq, qpw)			\
> > +	do {							\
> > +		struct qpw_struct *__qpw = (qpw);		\
> > +		WARN_ON((c) != __qpw->cpu);			\
> > +		__qpw->work.func(&__qpw->work);			\
> > +	} while (0)
> > +
> > +#define flush_percpu_work(qpw)					\
> > +	do {} while (0)
> > +
> > +#define qpw_get_cpu(w)						\
> > +	container_of((w), struct qpw_struct, work)->cpu
> > +
> > +#define INIT_QPW(qpw, func, c)					\
> > +	do {							\
> > +		struct qpw_struct *__qpw = (qpw);		\
> > +		INIT_WORK(&__qpw->work, (func));		\
> > +		__qpw->cpu = (c);				\
> > +	} while (0)
> > +
> > +#endif /* CONFIG_PREEMPT_RT */
> > +#endif /* LINUX_QPW_H */
> 
> You may also consider adding a documentation file about the
> qpw_lock/unlock() calls.

Sure, will do when I send the non-RFC version. Thanks for pointing that 
out!

> 
> Cheers,
> Longman
> 

Thanks!
Leo



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 1/4] Introducing qpw_lock() and per-cpu queue & flush work
  2024-09-05  0:08     ` Waiman Long
@ 2024-09-11  7:18       ` Leonardo Bras
  0 siblings, 0 replies; 23+ messages in thread
From: Leonardo Bras @ 2024-09-11  7:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Leonardo Bras, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Hyeonggon Yoo, Thomas Gleixner, Marcelo Tosatti, linux-kernel,
	cgroups, linux-mm

On Wed, Sep 04, 2024 at 08:08:12PM -0400, Waiman Long wrote:
> On 9/4/24 17:39, Waiman Long wrote:
> > On 6/21/24 23:58, Leonardo Bras wrote:
> > > Some places in the kernel implement a parallel programming strategy
> > > consisting on local_locks() for most of the work, and some rare remote
> > > operations are scheduled on target cpu. This keeps cache bouncing
> > > low since
> > > cacheline tends to be mostly local, and avoids the cost of locks in
> > > non-RT
> > > kernels, even though the very few remote operations will be
> > > expensive due
> > > to scheduling overhead.
> > > 
> > > On the other hand, for RT workloads this can represent a problem:
> > > getting
> > > an important workload scheduled out to deal with some unrelated task is
> > > sure to introduce unexpected deadline misses.
> > > 
> > > It's interesting, though, that local_lock()s in RT kernels become
> > > spinlock(). We can make use of those to avoid scheduling work on a
> > > remote
> > > cpu by directly updating another cpu's per_cpu structure, while holding
> > > it's spinlock().
> > > 
> > > In order to do that, it's necessary to introduce a new set of
> > > functions to
> > > make it possible to get another cpu's per-cpu "local" lock
> > > (qpw_{un,}lock*)
> > > and also the corresponding queue_percpu_work_on() and
> > > flush_percpu_work()
> > > helpers to run the remote work.
> > > 
> > > On non-RT kernels, no changes are expected, as every one of the
> > > introduced
> > > helpers work the exactly same as the current implementation:
> > > qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
> > > queue_percpu_work_on()  ->  queue_work_on()
> > > flush_percpu_work()     ->  flush_work()
> > > 
> > > For RT kernels, though, qpw_{un,}lock*() will use the extra cpu
> > > parameter
> > > to select the correct per-cpu structure to work on, and acquire the
> > > spinlock for that cpu.
> > > 
> > > queue_percpu_work_on() will just call the requested function in the
> > > current
> > > cpu, which will operate in another cpu's per-cpu object. Since the
> > > local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.
> > > 
> > > flush_percpu_work() then becomes a no-op since no work is actually
> > > scheduled on a remote cpu.
> > > 
> > > Some minimal code rework is needed in order to make this mechanism work:
> > > The calls for local_{un,}lock*() on the functions that are currently
> > > scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(),
> > > so in
> > > RT kernels they can reference a different cpu. It's also necessary
> > > to use a
> > > qpw_struct instead of a work_struct, but it just contains a work struct
> > > and, in PREEMPT_RT, the target cpu.
> > > 
> > > This should have almost no impact on non-RT kernels: few this_cpu_ptr()
> > > will become per_cpu_ptr(,smp_processor_id()).
> > > 
> > > On RT kernels, this should improve performance and reduce latency by
> > > removing scheduling noise.
> > > 
> > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > ---
> > >   include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 88 insertions(+)
> > >   create mode 100644 include/linux/qpw.h
> > > 
> > > diff --git a/include/linux/qpw.h b/include/linux/qpw.h
> > > new file mode 100644
> > > index 000000000000..ea2686a01e5e
> > > --- /dev/null
> > > +++ b/include/linux/qpw.h
> > > @@ -0,0 +1,88 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_QPW_H
> > > +#define _LINUX_QPW_H
> 
> I would suggest adding a comment with a brief description of what
> qpw_lock/unlock() are for and their use cases. The "qpw" prefix itself isn't
> intuitive enough for a casual reader to understand what they are for.

Agree, I am also open to discuss a more intuitive naming for these.

> 
> Cheers,
> Longman
> 

Thanks!
Leo



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 1/4] Introducing qpw_lock() and per-cpu queue & flush work
  2024-09-11  7:17     ` Leonardo Bras
@ 2024-09-11 13:39       ` Waiman Long
  0 siblings, 0 replies; 23+ messages in thread
From: Waiman Long @ 2024-09-11 13:39 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Thomas Gleixner, Marcelo Tosatti, linux-kernel, cgroups, linux-mm

On 9/11/24 03:17, Leonardo Bras wrote:
> On Wed, Sep 04, 2024 at 05:39:01PM -0400, Waiman Long wrote:
>> On 6/21/24 23:58, Leonardo Bras wrote:
>>> Some places in the kernel implement a parallel programming strategy
>>> consisting on local_locks() for most of the work, and some rare remote
>>> operations are scheduled on target cpu. This keeps cache bouncing low since
>>> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
>>> kernels, even though the very few remote operations will be expensive due
>>> to scheduling overhead.
>>>
>>> On the other hand, for RT workloads this can represent a problem: getting
>>> an important workload scheduled out to deal with some unrelated task is
>>> sure to introduce unexpected deadline misses.
>>>
>>> It's interesting, though, that local_lock()s in RT kernels become
>>> spinlock(). We can make use of those to avoid scheduling work on a remote
>>> cpu by directly updating another cpu's per_cpu structure, while holding
>>> it's spinlock().
>>>
>>> In order to do that, it's necessary to introduce a new set of functions to
>>> make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
>>> and also the corresponding queue_percpu_work_on() and flush_percpu_work()
>>> helpers to run the remote work.
>>>
>>> On non-RT kernels, no changes are expected, as every one of the introduced
>>> helpers work the exactly same as the current implementation:
>>> qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
>>> queue_percpu_work_on()  ->  queue_work_on()
>>> flush_percpu_work()     ->  flush_work()
>>>
>>> For RT kernels, though, qpw_{un,}lock*() will use the extra cpu parameter
>>> to select the correct per-cpu structure to work on, and acquire the
>>> spinlock for that cpu.
>>>
>>> queue_percpu_work_on() will just call the requested function in the current
>>> cpu, which will operate in another cpu's per-cpu object. Since the
>>> local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.
>>>
>>> flush_percpu_work() then becomes a no-op since no work is actually
>>> scheduled on a remote cpu.
>>>
>>> Some minimal code rework is needed in order to make this mechanism work:
>>> The calls for local_{un,}lock*() on the functions that are currently
>>> scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
>>> RT kernels they can reference a different cpu. It's also necessary to use a
>>> qpw_struct instead of a work_struct, but it just contains a work struct
>>> and, in PREEMPT_RT, the target cpu.
>>>
>>> This should have almost no impact on non-RT kernels: few this_cpu_ptr()
>>> will become per_cpu_ptr(,smp_processor_id()).
>>>
>>> On RT kernels, this should improve performance and reduce latency by
>>> removing scheduling noise.
>>>
>>> Signed-off-by: Leonardo Bras <leobras@redhat.com>
>>> ---
>>>    include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 88 insertions(+)
>>>    create mode 100644 include/linux/qpw.h
>>>
>>> diff --git a/include/linux/qpw.h b/include/linux/qpw.h
>>> new file mode 100644
>>> index 000000000000..ea2686a01e5e
>>> --- /dev/null
>>> +++ b/include/linux/qpw.h
>>> @@ -0,0 +1,88 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +#ifndef _LINUX_QPW_H
>>> +#define _LINUX_QPW_H
>>> +
>>> +#include "linux/local_lock.h"
>>> +#include "linux/workqueue.h"
>>> +
>>> +#ifndef CONFIG_PREEMPT_RT
>>> +
>>> +struct qpw_struct {
>>> +	struct work_struct work;
>>> +};
>>> +
>>> +#define qpw_lock(lock, cpu)					\
>>> +	local_lock(lock)
>>> +
>>> +#define qpw_unlock(lock, cpu)					\
>>> +	local_unlock(lock)
>>> +
>>> +#define qpw_lock_irqsave(lock, flags, cpu)			\
>>> +	local_lock_irqsave(lock, flags)
>>> +
>>> +#define qpw_unlock_irqrestore(lock, flags, cpu)			\
>>> +	local_unlock_irqrestore(lock, flags)
>>> +
>>> +#define queue_percpu_work_on(c, wq, qpw)			\
>>> +	queue_work_on(c, wq, &(qpw)->work)
>>> +
>>> +#define flush_percpu_work(qpw)					\
>>> +	flush_work(&(qpw)->work)
>>> +
>>> +#define qpw_get_cpu(qpw)					\
>>> +	smp_processor_id()
>>> +
>>> +#define INIT_QPW(qpw, func, c)					\
>>> +	INIT_WORK(&(qpw)->work, (func))
>>> +
>>> +#else /* !CONFIG_PREEMPT_RT */
>>> +
>>> +struct qpw_struct {
>>> +	struct work_struct work;
>>> +	int cpu;
>>> +};
>>> +
>>> +#define qpw_lock(__lock, cpu)					\
>>> +	do {							\
>>> +		migrate_disable();				\
>>> +		spin_lock(per_cpu_ptr((__lock), cpu));		\
>>> +	} while (0)
>>> +
>>> +#define qpw_unlock(__lock, cpu)					\
>>> +	do {							\
>>> +		spin_unlock(per_cpu_ptr((__lock), cpu));	\
>>> +		migrate_enable();				\
>>> +	} while (0)
>> Why there is a migrate_disable/enable() call in qpw_lock/unlock()? The
>> rt_spin_lock/unlock() calls have already include a migrate_disable/enable()
>> pair.
> This was copied from PREEMPT_RT=y local_locks.
>
> In my tree, I see:
>
> #define __local_unlock(__lock)					\
> 	do {							\
> 		spin_unlock(this_cpu_ptr((__lock)));		\
> 		migrate_enable();				\
> 	} while (0)
>
> But you are right:
> For PREEMPT_RT=y, spin_{un,}lock() will be defined in spinlock_rt.h
> as rt_spin{un,}lock(), which already runs migrate_{en,dis}able().
>
> On the other hand, for spin_lock() will run migrate_disable() just before
> finishing the function, and local_lock() will run it before calling
> spin_lock() and thus, before spin_acquire().
>
> (local_unlock looks like to have an unnecessary extra migrate_enable(),
> though).
>
> I am not sure if it's actually necessary to run this extra
> migrate_disable() in local_lock() case, maybe Thomas could help us
> understand this.
>
> But sure, if we can remove this from local_{un,}lock(), I am sure we can
> also remove this from qpw.

I see. I believe the reason for this extra migrate_disable/enable() is 
to protect the this_cpu_ptr() call to prevent switching to another CPU 
right after this_cpu_ptr() but before the migrate_disable() inside 
rt_spin_lock(). So keep the migrate_disable/enable() as is.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
  2024-09-11  3:04     ` Marcelo Tosatti
@ 2024-09-15  0:30       ` Hillf Danton
  0 siblings, 0 replies; 23+ messages in thread
From: Hillf Danton @ 2024-09-15  0:30 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Leonardo Bras, Michal Hocko, linux-kernel, linux-mm

On Wed, 11 Sep 2024 00:04:46 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> On Fri, Sep 06, 2024 at 06:19:08AM +0800, Hillf Danton wrote:
> > On Tue, 23 Jul 2024 14:14:34 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> > > On Sat, Jun 22, 2024 at 12:58:08AM -0300, Leonardo Bras wrote:
> > > > The problem:
> > > > Some places in the kernel implement a parallel programming strategy
> > > > consisting on local_locks() for most of the work, and some rare remote
> > > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > > kernels, even though the very few remote operations will be expensive due
> > > > to scheduling overhead.
> > > > 
> > > > On the other hand, for RT workloads this can represent a problem: getting
> > > > an important workload scheduled out to deal with remote requests is
> > > > sure to introduce unexpected deadline misses.
> > > 
> > > Another hang with a busy polling workload (kernel update hangs on
> > > grub2-probe):
> > > 
> > > [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds.
> > > [342431.665458]       Tainted: G        W      X  -------  ---  5.14.0-438.el9s.x86_64+rt #1
> > > [342431.665488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [342431.665515] task:grub2-probe     state:D stack:0     pid:24484 ppid:24455  flags:0x00004002
> > > [342431.665523] Call Trace:
> > > [342431.665525]  <TASK>
> > > [342431.665527]  __schedule+0x22a/0x580
> > > [342431.665537]  schedule+0x30/0x80
> > > [342431.665539]  schedule_timeout+0x153/0x190
> > > [342431.665543]  ? preempt_schedule_thunk+0x16/0x30
> > > [342431.665548]  ? preempt_count_add+0x70/0xa0
> > > [342431.665554]  __wait_for_common+0x8b/0x1c0
> > > [342431.665557]  ? __pfx_schedule_timeout+0x10/0x10
> > > [342431.665560]  __flush_work.isra.0+0x15b/0x220
> > 
> > The fresh new flush_percpu_work() is nop with CONFIG_PREEMPT_RT enabled, why
> > are you testing it with 5.14.0-438.el9s.x86_64+rt instead of mainline? Or what
> > are you testing?
> 
> I am demonstrating a type of bug that can happen without Leo's patch.
> 
> > BTW the hang fails to show the unexpected deadline misses.
> 
> Yes, because in this case the realtime app with FIFO priority never
> stops running, therefore grub2-probe hangs and is unable to execute:
> 
Thanks, I see why it is a type of bug that can happen without Leo's patch.
Because linux kernel is never the pill to kill all pains in the field, I
prefer to think instead it represents no real idea of 5.14-xxx-rt at product
designing stage - what is kernel reaction to 600s cpu hog for instance?.
More interesting, what would you comment if task hang is replaced with oom?

Given locality cut by this patchset, lock contention follows up and opens
the window for priority inversion, right?

> > > [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds
> > 
> > > [342431.665565]  ? __pfx_wq_barrier_func+0x10/0x10
> > > [342431.665570]  __lru_add_drain_all+0x17d/0x220
> > > [342431.665576]  invalidate_bdev+0x28/0x40
> > > [342431.665583]  blkdev_common_ioctl+0x714/0xa30
> > > [342431.665588]  ? bucket_table_alloc.isra.0+0x1/0x150
> > > [342431.665593]  ? cp_new_stat+0xbb/0x180
> > > [342431.665599]  blkdev_ioctl+0x112/0x270
> > > [342431.665603]  ? security_file_ioctl+0x2f/0x50
> > > [342431.665609]  __x64_sys_ioctl+0x87/0xc0


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-09-15  0:31 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-22  3:58 [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Leonardo Bras
2024-06-22  3:58 ` [RFC PATCH v1 1/4] Introducing qpw_lock() and per-cpu queue & flush work Leonardo Bras
2024-09-04 21:39   ` Waiman Long
2024-09-05  0:08     ` Waiman Long
2024-09-11  7:18       ` Leonardo Bras
2024-09-11  7:17     ` Leonardo Bras
2024-09-11 13:39       ` Waiman Long
2024-06-22  3:58 ` [RFC PATCH v1 2/4] swap: apply new queue_percpu_work_on() interface Leonardo Bras
2024-06-22  3:58 ` [RFC PATCH v1 3/4] memcontrol: " Leonardo Bras
2024-06-22  3:58 ` [RFC PATCH v1 4/4] slub: " Leonardo Bras
2024-06-24  7:31 ` [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Vlastimil Babka
2024-06-24 22:54   ` Boqun Feng
2024-06-25  2:57     ` Leonardo Bras
2024-06-25 17:51       ` Boqun Feng
2024-06-26 16:40         ` Leonardo Bras
2024-06-28 18:47       ` Marcelo Tosatti
2024-06-25  2:36   ` Leonardo Bras
2024-07-15 18:38   ` Marcelo Tosatti
2024-07-23 17:14 ` Marcelo Tosatti
2024-09-05 22:19   ` Hillf Danton
2024-09-11  3:04     ` Marcelo Tosatti
2024-09-15  0:30       ` Hillf Danton
2024-09-11  6:42     ` Leonardo Bras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).