[RFC PATCH 0/4] Improve RT performance & latency w/ local_lock

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t()
@ 2023-07-29  8:37 Leonardo Bras
  2023-07-29  8:37 ` [RFC PATCH 1/4] Introducing local_lock_n() and local queue & flush Leonardo Bras
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Leonardo Bras @ 2023-07-29  8:37 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti, linux-kernel; +Cc: Leonardo Bras

The problem:
We have a few scenarios in mm that make use of local_locks() + 
schedule_work_on() due to reduced overhead on the most common local 
references. This scenario is not ideal for RT workloads though: even on 
isolated cpus, those tasks will schedule-out the sensitive RT workloads to 
perform those jobs, and usually cause missing deadlines with this.

The idea:
In PREEMPT_RT, local_locks() end up becoming spinlocks(), so there should 
be no harm in just getting another cpu's spinlock to perform the work 
on the per-cpu structure: the cacheline invalidation will already happen 
due to the usage by schedule_work_on(), and on local functions the locking 
already happens anyway.

This will avoid schedule_work_on(), and thus avoid scheduling-out an 
RT workload. Given the usually brief nature of those scheduled tasks, their 
execution end up being faster than doing their scheduling.

======

While I really belive the solution, there are problems with this patchset, 
which I need your suggestions for improvement:

1) Naming is horrible: I could not think on a good name for the new lock 
functions, so I lazely named it local_lock_n(). 
The naming should have been making clear that we are in fact dealing 
with a local_lock but it can in some scenarios be addressing another cpu's 
local_lock, and thus we need the extra cpu parameter. 

Dealing with this local & remote duality, all I thought was:
mostly_local_lock(), (or local_mostly_lock())
local_maybe_remote_lock()  <- LOL
remote_local_lock()
per_cpu_local_lock()
local_lock_cpu()

Please help !

2) Maybe naming is this hard because the interface is not the best.
My idea was to create a simple interface to easily replace functions 
already in use, but maybe there is something better I could not see.

Please contribute!

3) I am lazely setting work->data without atomic operations, which may 
be bad in some scenario. If so, even thought it can be costly, since it 
happens outside of the hotpath (local per-cpu areas) it should be no 
problem adding the atomic operation for non-RT kernels.

For RT kernels, since the whole operation hapens locally, there should be 
no need of using the atomic set.

Please let me know of any idea, or suggestion that can improve this RFC.

Thanks a lot!
Leo

Leonardo Bras (4):
  Introducing local_lock_n() and local queue & flush
  swap: apply new local_schedule_work_on() interface
  memcontrol: apply new local_schedule_work_on() interface
  slub: apply new local_schedule_work_on() interface

 include/linux/local_lock.h          | 18 ++++++++++
 include/linux/local_lock_internal.h | 52 +++++++++++++++++++++++++++++
 mm/memcontrol.c                     | 17 ++++++----
 mm/slub.c                           | 17 ++++++----
 mm/swap.c                           | 18 +++++-----
 5 files changed, 100 insertions(+), 22 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH 1/4] Introducing local_lock_n() and local queue & flush
  2023-07-29  8:37 [RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t() Leonardo Bras
@ 2023-07-29  8:37 ` Leonardo Bras
  2023-07-29  8:37 ` [RFC PATCH 2/4] swap: apply new local_schedule_work_on() interface Leonardo Bras
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Leonardo Bras @ 2023-07-29  8:37 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti, linux-kernel; +Cc: Leonardo Bras

Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps the overhead low since
cacheline tends to be mostly local (and have no locks in non-RT kernels),
and the few remote operations will be more costly due to scheduling.

On the other hand, for RT workloads this can represent a problem: getting
an important workload scheduled out to deal with some unrelated task is
sure to introduce unexpected deadline misses.

It's interesting, though, that local_lock()s in RT kernels become an
spinlock(), so we can use this locking cost (that is already being paid) in
order to avoid scheduling work on a remote cpu, and updating another cpu's
per_cpu structure from the current cpu, while holding it's spinlock().

In order to do that, it's necessary to introduce a new set of functions to
make it possible to get another cpu's local lock (local_*lock_n*()), and
also the corresponding local_queue_work_on() and local_flush_work()
helpers.

On non-RT kernels, every local*_n*() works the exactly same as the non-n
functions (the extra parameter is ignored), and both local_queue_work_on()
and local_flush_work() call their non-local versions.

For RT kernels, though, local_*lock_n*() will use the extra cpu parameter
to select the correct per-cpu structure to work on, and acquire the
spinlock for that cpu.

local_queue_work_on() will just call the requested function in the current
cpu: since the local_locks() are spinlocks() we are safe.

local_flush_work() then becomes a no-op since no work is actually scheduled
on a remote cpu.

Some minimal code rework is needed in order to make this mechanism work:
The calls for local_*lock*() on the functions that are currently scheduled
on remote cpus need to be replaced my local_*lock_n*(), so in RT kernels
they can reference a different cpu.

This should have almost no impact on non-RT kernels: few this_cpu_ptr()
will become per_cpu_ptr(,smp_processor_id()).

On RT kernels, this should improve performance and reduces latency by
removing scheduling noise.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/linux/local_lock.h          | 18 ++++++++++
 include/linux/local_lock_internal.h | 52 +++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index e55010fa7329..f1fa1e8e3fbc 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -51,4 +51,22 @@
 #define local_unlock_irqrestore(lock, flags)			\
 	__local_unlock_irqrestore(lock, flags)

+#define local_lock_n(lock, cpu)					\
+	__local_lock_n(lock, cpu)
+
+#define local_unlock_n(lock, cpu)				\
+	__local_unlock_n(lock, cpu)
+
+#define local_lock_irqsave_n(lock, flags, cpu)			\
+	__local_lock_irqsave_n(lock, flags, cpu)
+
+#define local_unlock_irqrestore_n(lock, flags, cpu)		\
+	__local_unlock_irqrestore_n(lock, flags, cpu)
+
+#define local_queue_work_on(cpu, wq, work)			\
+	__local_queue_work_on(cpu, wq, work)
+
+#define local_flush_work(work)					\
+	__local_flush_work(work)
+
 #endif
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 975e33b793a7..df064149fff8 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -98,6 +98,25 @@ do {								\
 		local_irq_restore(flags);			\
 	} while (0)

+#define __local_lock_n(lock, cpu)	__local_lock(lock)
+#define __local_unlock_n(lock, cpu)	__local_unlock(lock)
+
+#define __local_lock_irqsave_n(lock, flags, cpu)		\
+	__local_lock_irqsave(lock, flags)
+
+#define __local_unlock_irqrestore_n(lock, flags, cpu)		\
+	__local_unlock_irqrestore(lock, flags)
+
+#define __local_queue_work_on(cpu, wq, work)			\
+	do {							\
+		typeof(cpu) __cpu = cpu;			\
+		typeof(work) __work = work;			\
+		__work->data.counter = __cpu;			\
+		queue_work_on(__cpu, wq, __work);		\
+	} while (0)
+
+#define __local_flush_work(work)	flush_work(work)
+
 #else /* !CONFIG_PREEMPT_RT */

 /*
@@ -138,4 +157,37 @@ typedef spinlock_t local_lock_t;

 #define __local_unlock_irqrestore(lock, flags)	__local_unlock(lock)

+#define __local_lock_n(__lock, cpu)				\
+	do {							\
+		migrate_disable();				\
+		spin_lock(per_cpu_ptr((__lock)), cpu);		\
+	} while (0)
+
+#define __local_unlock_n(__lock, cpu)				\
+	do {							\
+		spin_unlock(per_cpu_ptr((__lock)), cpu);	\
+		migrate_enable();				\
+	} while (0)
+
+#define __local_lock_irqsave_n(lock, flags, cpu)		\
+	do {							\
+		typecheck(unsigned long, flags);		\
+		flags = 0;					\
+		__local_lock_n(lock, cpu);			\
+	} while (0)
+
+#define __local_unlock_irqrestore_n(lock, flags, cpu)		\
+	__local_unlock_n(lock, cpu)
+
+#define __local_queue_work_on(cpu, wq, work)			\
+	do {							\
+		typeof(cpu) __cpu = cpu;			\
+		typeof(work) __work = work;			\
+		__work->data = (typeof(__work->data))__cpu;	\
+		__work->func(__work);				\
+	} while (0)
+
+#define __local_flush_work(work)				\
+	do {} while (0)
+
 #endif /* CONFIG_PREEMPT_RT */
-- 
2.41.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 2/4] swap: apply new local_schedule_work_on() interface
  2023-07-29  8:37 [RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t() Leonardo Bras
  2023-07-29  8:37 ` [RFC PATCH 1/4] Introducing local_lock_n() and local queue & flush Leonardo Bras
@ 2023-07-29  8:37 ` Leonardo Bras
  2023-08-08 19:39   ` Marcelo Tosatti
  2023-07-29  8:37 ` [RFC PATCH 3/4] memcontrol: " Leonardo Bras
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Leonardo Bras @ 2023-07-29  8:37 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti, linux-kernel; +Cc: Leonardo Bras

Make use of the new local_*lock_n*() and local_schedule_work_on() interface
to improve performance & latency on PREEMTP_RT kernels.

For functions that may be scheduled in a different cpu, replace
local_*lock*() by local_lock_n*(), and replace schedule_work_on() by
local_schedule_work_on(). The same happens for flush_work() and
local_flush_work().

This should bring no relevant performance impact on non-RT kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 mm/swap.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..a79f2091eae5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -760,11 +760,11 @@ void lru_add_drain(void)
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_and_bh_lrus_drain(void)
+static void lru_add_and_bh_lrus_drain(int cpu)
 {
-	local_lock(&cpu_fbatches.lock);
-	lru_add_drain_cpu(smp_processor_id());
-	local_unlock(&cpu_fbatches.lock);
+	local_lock_n(&cpu_fbatches.lock, cpu);
+	lru_add_drain_cpu(cpu);
+	local_unlock_n(&cpu_fbatches.lock, cpu);
 	invalidate_bh_lrus_cpu();
 	mlock_drain_local();
 }
@@ -782,9 +782,9 @@ void lru_add_drain_cpu_zone(struct zone *zone)
 
 static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
 
-static void lru_add_drain_per_cpu(struct work_struct *dummy)
+static void lru_add_drain_per_cpu(struct work_struct *w)
 {
-	lru_add_and_bh_lrus_drain();
+	lru_add_and_bh_lrus_drain(w->data.counter);
 }
 
 static bool cpu_needs_drain(unsigned int cpu)
@@ -888,13 +888,13 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 
 		if (cpu_needs_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
-			queue_work_on(cpu, mm_percpu_wq, work);
+			local_queue_work_on(cpu, mm_percpu_wq, work);
 			__cpumask_set_cpu(cpu, &has_work);
 		}
 	}
 
 	for_each_cpu(cpu, &has_work)
-		flush_work(&per_cpu(lru_add_drain_work, cpu));
+		local_flush_work(&per_cpu(lru_add_drain_work, cpu));
 
 done:
 	mutex_unlock(&lock);
@@ -941,7 +941,7 @@ void lru_cache_disable(void)
 #ifdef CONFIG_SMP
 	__lru_add_drain_all(true);
 #else
-	lru_add_and_bh_lrus_drain();
+	lru_add_and_bh_lrus_drain(smp_processor_id());
 #endif
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 3/4] memcontrol: apply new local_schedule_work_on() interface
  2023-07-29  8:37 [RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t() Leonardo Bras
  2023-07-29  8:37 ` [RFC PATCH 1/4] Introducing local_lock_n() and local queue & flush Leonardo Bras
  2023-07-29  8:37 ` [RFC PATCH 2/4] swap: apply new local_schedule_work_on() interface Leonardo Bras
@ 2023-07-29  8:37 ` Leonardo Bras
  2023-07-29  8:37 ` [RFC PATCH 4/4] slub: " Leonardo Bras
  2023-08-01 19:05 ` [RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t() Leonardo Bras Soares Passos
  4 siblings, 0 replies; 8+ messages in thread
From: Leonardo Bras @ 2023-07-29  8:37 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti, linux-kernel; +Cc: Leonardo Bras

Make use of the new local_*lock_n*() and local_schedule_work_on() interface
to improve performance & latency on PREEMTP_RT kernels.

For functions that may be scheduled in a different cpu, replace
local_*lock*() by local_lock_n*(), and replace schedule_work_on() by
local_schedule_work_on().

This should bring no relevant performance impact on non-RT kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 mm/memcontrol.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ca4bdcb03c..6d4fa48d75e3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2277,7 +2277,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 	WRITE_ONCE(stock->cached, NULL);
 }
 
-static void drain_local_stock(struct work_struct *dummy)
+static void _drain_local_stock(int cpu)
 {
 	struct memcg_stock_pcp *stock;
 	struct obj_cgroup *old = NULL;
@@ -2288,18 +2288,23 @@ static void drain_local_stock(struct work_struct *dummy)
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	local_lock_irqsave_n(&memcg_stock.stock_lock, flags, cpu);
 
-	stock = this_cpu_ptr(&memcg_stock);
+	stock = per_cpu_ptr(&memcg_stock, cpu);
 	old = drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	local_unlock_irqrestore_n(&memcg_stock.stock_lock, flags, cpu);
 	if (old)
 		obj_cgroup_put(old);
 }
 
+static void drain_local_stock(struct work_struct *w)
+{
+	_drain_local_stock((int)w->data.counter);
+}
+
 /*
  * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
@@ -2365,9 +2370,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		if (flush &&
 		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
 			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
+				_drain_local_stock(cpu);
 			else if (!cpu_is_isolated(cpu))
-				schedule_work_on(cpu, &stock->work);
+				local_queue_work_on(cpu, system_wq, &stock->work);
 		}
 	}
 	migrate_enable();
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 4/4] slub: apply new local_schedule_work_on() interface
  2023-07-29  8:37 [RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t() Leonardo Bras
                   ` (2 preceding siblings ...)
  2023-07-29  8:37 ` [RFC PATCH 3/4] memcontrol: " Leonardo Bras
@ 2023-07-29  8:37 ` Leonardo Bras
  2023-08-01 19:05 ` [RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t() Leonardo Bras Soares Passos
  4 siblings, 0 replies; 8+ messages in thread
From: Leonardo Bras @ 2023-07-29  8:37 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti, linux-kernel; +Cc: Leonardo Bras

Make use of the new local_*lock_n*() and local_schedule_work_on() interface
to improve performance & latency on PREEMTP_RT kernels.

For functions that may be scheduled in a different cpu, replace
local_*lock*() by local_lock_n*(), and replace schedule_work_on() by
local_schedule_work_on(). The same happens for flush_work() and
local_flush_work().

This should bring no relevant performance impact on non-RT kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 mm/slub.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index e3b5d5c0eb3a..feb4a502d9a8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2733,13 +2733,14 @@ static inline void unfreeze_partials_cpu(struct kmem_cache *s,
 
 #endif	/* CONFIG_SLUB_CPU_PARTIAL */
 
-static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
+static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c,
+			      int cpu)
 {
 	unsigned long flags;
 	struct slab *slab;
 	void *freelist;
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_irqsave_n(&s->cpu_slab->lock, flags, cpu);
 
 	slab = c->slab;
 	freelist = c->freelist;
@@ -2748,7 +2749,7 @@ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 	c->freelist = NULL;
 	c->tid = next_tid(c->tid);
 
-	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+	local_unlock_irqrestore_n(&s->cpu_slab->lock, flags, cpu);
 
 	if (slab) {
 		deactivate_slab(s, slab, freelist);
@@ -2790,14 +2791,16 @@ static void flush_cpu_slab(struct work_struct *w)
 	struct kmem_cache *s;
 	struct kmem_cache_cpu *c;
 	struct slub_flush_work *sfw;
+	int cpu;
 
+	cpu = w->data.counter;
 	sfw = container_of(w, struct slub_flush_work, work);
 
 	s = sfw->s;
-	c = this_cpu_ptr(s->cpu_slab);
+	c = per_cpu_ptr(s->cpu_slab, cpu);
 
 	if (c->slab)
-		flush_slab(s, c);
+		flush_slab(s, c, cpu);
 
 	unfreeze_partials(s);
 }
@@ -2829,14 +2832,14 @@ static void flush_all_cpus_locked(struct kmem_cache *s)
 		INIT_WORK(&sfw->work, flush_cpu_slab);
 		sfw->skip = false;
 		sfw->s = s;
-		queue_work_on(cpu, flushwq, &sfw->work);
+		local_queue_work_on(cpu, flushwq, &sfw->work);
 	}
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
 		if (sfw->skip)
 			continue;
-		flush_work(&sfw->work);
+		local_flush_work(&sfw->work);
 	}
 
 	mutex_unlock(&flush_lock);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t()
  2023-07-29  8:37 [RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t() Leonardo Bras
                   ` (3 preceding siblings ...)
  2023-07-29  8:37 ` [RFC PATCH 4/4] slub: " Leonardo Bras
@ 2023-08-01 19:05 ` Leonardo Bras Soares Passos
  4 siblings, 0 replies; 8+ messages in thread
From: Leonardo Bras Soares Passos @ 2023-08-01 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti, linux-kernel; +Cc: Peter Xu

CC: Peter Xu

On Sat, Jul 29, 2023 at 5:38 AM Leonardo Bras <leobras@redhat.com> wrote:
>
> The problem:
> We have a few scenarios in mm that make use of local_locks() +
> schedule_work_on() due to reduced overhead on the most common local
> references. This scenario is not ideal for RT workloads though: even on
> isolated cpus, those tasks will schedule-out the sensitive RT workloads to
> perform those jobs, and usually cause missing deadlines with this.
>
> The idea:
> In PREEMPT_RT, local_locks() end up becoming spinlocks(), so there should
> be no harm in just getting another cpu's spinlock to perform the work
> on the per-cpu structure: the cacheline invalidation will already happen
> due to the usage by schedule_work_on(), and on local functions the locking
> already happens anyway.
>
> This will avoid schedule_work_on(), and thus avoid scheduling-out an
> RT workload. Given the usually brief nature of those scheduled tasks, their
> execution end up being faster than doing their scheduling.
>
> ======
>
> While I really belive the solution, there are problems with this patchset,
> which I need your suggestions for improvement:
>
> 1) Naming is horrible: I could not think on a good name for the new lock
> functions, so I lazely named it local_lock_n().
> The naming should have been making clear that we are in fact dealing
> with a local_lock but it can in some scenarios be addressing another cpu's
> local_lock, and thus we need the extra cpu parameter.
>
> Dealing with this local & remote duality, all I thought was:
> mostly_local_lock(), (or local_mostly_lock())
> local_maybe_remote_lock()  <- LOL
> remote_local_lock()
> per_cpu_local_lock()
> local_lock_cpu()
>
> Please help !
>
>
> 2) Maybe naming is this hard because the interface is not the best.
> My idea was to create a simple interface to easily replace functions
> already in use, but maybe there is something better I could not see.
>
> Please contribute!
>
> 3) I am lazely setting work->data without atomic operations, which may
> be bad in some scenario. If so, even thought it can be costly, since it
> happens outside of the hotpath (local per-cpu areas) it should be no
> problem adding the atomic operation for non-RT kernels.
>
> For RT kernels, since the whole operation hapens locally, there should be
> no need of using the atomic set.
>
> Please let me know of any idea, or suggestion that can improve this RFC.
>
> Thanks a lot!
> Leo
>
> Leonardo Bras (4):
>   Introducing local_lock_n() and local queue & flush
>   swap: apply new local_schedule_work_on() interface
>   memcontrol: apply new local_schedule_work_on() interface
>   slub: apply new local_schedule_work_on() interface
>
>  include/linux/local_lock.h          | 18 ++++++++++
>  include/linux/local_lock_internal.h | 52 +++++++++++++++++++++++++++++
>  mm/memcontrol.c                     | 17 ++++++----
>  mm/slub.c                           | 17 ++++++----
>  mm/swap.c                           | 18 +++++-----
>  5 files changed, 100 insertions(+), 22 deletions(-)
>
> --
> 2.41.0
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 2/4] swap: apply new local_schedule_work_on() interface
  2023-07-29  8:37 ` [RFC PATCH 2/4] swap: apply new local_schedule_work_on() interface Leonardo Bras
@ 2023-08-08 19:39   ` Marcelo Tosatti
  2023-08-29  0:07     ` Leonardo Brás
  0 siblings, 1 reply; 8+ messages in thread
From: Marcelo Tosatti @ 2023-08-08 19:39 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: Thomas Gleixner, linux-kernel

On Sat, Jul 29, 2023 at 05:37:33AM -0300, Leonardo Bras wrote:
> Make use of the new local_*lock_n*() and local_schedule_work_on() interface
> to improve performance & latency on PREEMTP_RT kernels.
> 
> For functions that may be scheduled in a different cpu, replace
> local_*lock*() by local_lock_n*(), and replace schedule_work_on() by
> local_schedule_work_on(). The same happens for flush_work() and
> local_flush_work().
> 
> This should bring no relevant performance impact on non-RT kernels:
> For functions that may be scheduled in a different cpu, the local_*lock's
> this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  mm/swap.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)

Leo,

I think the interruptions should rather be removed for both
CONFIG_PREEMPT_RT AND !CONFIG_PREEMPT_RT.

The impact of grabbing locks must be properly analyzed and not
"rejected blindly".

Example:

commit 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Jun 24 13:54:23 2022 +0100

    mm/page_alloc: replace local_lock with normal spinlock
    
    struct per_cpu_pages is no longer strictly local as PCP lists can be
    drained remotely using a lock for protection.  While the use of local_lock
    works, it goes against the intent of local_lock which is for "pure CPU
    local concurrency control mechanisms and not suited for inter-CPU
    concurrency control" (Documentation/locking/locktypes.rst)
    
    local_lock protects against migration between when the percpu pointer is
    accessed and the pcp->lock acquired.  The lock acquisition is a preemption
    point so in the worst case, a task could migrate to another NUMA node and
    accidentally allocate remote memory.  The main requirement is to pin the
    task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 2/4] swap: apply new local_schedule_work_on() interface
  2023-08-08 19:39   ` Marcelo Tosatti
@ 2023-08-29  0:07     ` Leonardo Brás
  0 siblings, 0 replies; 8+ messages in thread
From: Leonardo Brás @ 2023-08-29  0:07 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Thomas Gleixner, linux-kernel

On Tue, 2023-08-08 at 16:39 -0300, Marcelo Tosatti wrote:
> On Sat, Jul 29, 2023 at 05:37:33AM -0300, Leonardo Bras wrote:
> > Make use of the new local_*lock_n*() and local_schedule_work_on() interface
> > to improve performance & latency on PREEMTP_RT kernels.
> > 
> > For functions that may be scheduled in a different cpu, replace
> > local_*lock*() by local_lock_n*(), and replace schedule_work_on() by
> > local_schedule_work_on(). The same happens for flush_work() and
> > local_flush_work().
> > 
> > This should bring no relevant performance impact on non-RT kernels:
> > For functions that may be scheduled in a different cpu, the local_*lock's
> > this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >  mm/swap.c | 18 +++++++++---------
> >  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> Leo,
> 
> I think the interruptions should rather be removed for both
> CONFIG_PREEMPT_RT AND !CONFIG_PREEMPT_RT.
> 
> The impact of grabbing locks must be properly analyzed and not
> "rejected blindly".

Yes, I agree with the idea, but we have been noticing a lot of rejection for
these ideas lately, as the maintainers perceive this as a big change.

My idea here is to provide a general way to improve the PREEMPT_RT scenarios CPU
Isolation, even though there is resistance in !PREEMPT_RT case.

As I commented, spinlocks are already held by PREEMPT_RT's local_locks hotpaths,
and yet are using schedule_work_on() to have remote work done. This patch tries
to solve this by using spin_lock(remote_cpu_lock) instead and save a lot of
cycles while decreasing IPI at the remote cpu.

It looks a simple solution, improving isolation and performance on PREEMPT_RT
with no visible drawbacks. I agree the interface is not ideal, and for that I
really need you guys help.

I understand that having this change merged, we will have more precedence to
discuss performance for the !PREEMPT_RT case.

What do you think on that?

Thanks!
Leo

> 
> Example:
> 
> commit 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8
> Author: Mel Gorman <mgorman@techsingularity.net>
> Date:   Fri Jun 24 13:54:23 2022 +0100
> 
>     mm/page_alloc: replace local_lock with normal spinlock
>     
>     struct per_cpu_pages is no longer strictly local as PCP lists can be
>     drained remotely using a lock for protection.  While the use of local_lock
>     works, it goes against the intent of local_lock which is for "pure CPU
>     local concurrency control mechanisms and not suited for inter-CPU
>     concurrency control" (Documentation/locking/locktypes.rst)
>     
>     local_lock protects against migration between when the percpu pointer is
>     accessed and the pcp->lock acquired.  The lock acquisition is a preemption
>     point so in the worst case, a task could migrate to another NUMA node and
>     accidentally allocate remote memory.  The main requirement is to pin the
>     task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT.
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-08-29  0:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-29  8:37 [RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t() Leonardo Bras
2023-07-29  8:37 ` [RFC PATCH 1/4] Introducing local_lock_n() and local queue & flush Leonardo Bras
2023-07-29  8:37 ` [RFC PATCH 2/4] swap: apply new local_schedule_work_on() interface Leonardo Bras
2023-08-08 19:39   ` Marcelo Tosatti
2023-08-29  0:07     ` Leonardo Brás
2023-07-29  8:37 ` [RFC PATCH 3/4] memcontrol: " Leonardo Bras
2023-07-29  8:37 ` [RFC PATCH 4/4] slub: " Leonardo Bras
2023-08-01 19:05 ` [RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t() Leonardo Bras Soares Passos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox