[RFC PATCH 0/6] mm: LRU drain flush on nohz

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] mm: LRU drain flush on nohz_full
@ 2024-06-25 13:52 Frederic Weisbecker
  2024-06-25 13:52 ` [RFC PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2024-06-25 13:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov

When LRUs are pending, the drain can be triggered remotely, whether the
remote CPU is running in userspace in nohz_full mode or not. This kind
of noise is expected to be caused by preparatory work before a task
runs isolated in userspace. This patchset is a proposal to flush that
before the task starts its critical work in userspace.

Frederic Weisbecker (6):
  task_work: Provide means to check if a work is queued
  sched/fair: Use task_work_queued() on numa_work
  sched: Use task_work_queued() on cid_work
  tick/nohz: Move nohz_full related fields out of hot task struct's
    places
  sched/isolation: Introduce isolated task work
  mm: Drain LRUs upon resume to userspace on nohz_full CPUs

 include/linux/sched.h           | 15 +++++++++------
 include/linux/sched/isolation.h | 17 +++++++++++++++++
 include/linux/swap.h            |  1 +
 include/linux/task_work.h       | 12 ++++++++++++
 kernel/sched/core.c             |  6 ++----
 kernel/sched/fair.c             |  4 +---
 kernel/sched/isolation.c        | 32 ++++++++++++++++++++++++++++++++
 kernel/sched/sched.h            |  1 +
 kernel/task_work.c              |  1 +
 mm/swap.c                       |  5 ++++-
 10 files changed, 80 insertions(+), 14 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/6] task_work: Provide means to check if a work is queued
  2024-06-25 13:52 [RFC PATCH 0/6] mm: LRU drain flush on nohz_full Frederic Weisbecker
@ 2024-06-25 13:52 ` Frederic Weisbecker
  2024-06-25 14:15   ` Oleg Nesterov
  2024-07-16 13:00   ` Valentin Schneider
  2024-06-25 13:52 ` [RFC PATCH 2/6] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2024-06-25 13:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov

Some task work users implement their own ways to know if a callback is
already queued on the current task while fiddling with the callback
head internals.

Provide instead a consolidated API to serve this very purpose.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/task_work.h | 12 ++++++++++++
 kernel/task_work.c        |  1 +
 2 files changed, 13 insertions(+)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 795ef5a68429..f2eae971b73a 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -5,12 +5,15 @@
 #include <linux/list.h>
 #include <linux/sched.h>
 
+#define TASK_WORK_DEQUEUED	((void *) -1UL)
+
 typedef void (*task_work_func_t)(struct callback_head *);
 
 static inline void
 init_task_work(struct callback_head *twork, task_work_func_t func)
 {
 	twork->func = func;
+	twork->next = TASK_WORK_DEQUEUED;
 }
 
 enum task_work_notify_mode {
@@ -25,6 +28,15 @@ static inline bool task_work_pending(struct task_struct *task)
 	return READ_ONCE(task->task_works);
 }
 
+/*
+ * Check if a work is queued. Beware: this is inherently racy if the work can
+ * be queued elsewhere than the current task.
+ */
+static inline bool task_work_queued(struct callback_head *twork)
+{
+	return twork->next != TASK_WORK_DEQUEUED;
+}
+
 int task_work_add(struct task_struct *task, struct callback_head *twork,
 			enum task_work_notify_mode mode);
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 95a7e1b7f1da..6e3bee0b7011 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -177,6 +177,7 @@ void task_work_run(void)
 
 		do {
 			next = work->next;
+			work->next = TASK_WORK_DEQUEUED;
 			work->func(work);
 			work = next;
 			cond_resched();
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/6] sched/fair: Use task_work_queued() on numa_work
  2024-06-25 13:52 [RFC PATCH 0/6] mm: LRU drain flush on nohz_full Frederic Weisbecker
  2024-06-25 13:52 ` [RFC PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
@ 2024-06-25 13:52 ` Frederic Weisbecker
  2024-07-16 13:00   ` Valentin Schneider
  2024-06-25 13:52 ` [RFC PATCH 3/6] sched: Use task_work_queued() on cid_work Frederic Weisbecker
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2024-06-25 13:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov

Remove the ad-hoc implementation of task_work_queued().

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/fair.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..619ef8bd1486 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3213,7 +3213,6 @@ static void task_numa_work(struct callback_head *work)
 
 	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
 
-	work->next = work;
 	/*
 	 * Who cares about NUMA placement when they're dying.
 	 *
@@ -3456,7 +3455,6 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->numa_scan_period		= sysctl_numa_balancing_scan_delay;
 	p->numa_migrate_retry		= 0;
 	/* Protect against double add, see task_tick_numa and task_numa_work */
-	p->numa_work.next		= &p->numa_work;
 	p->numa_faults			= NULL;
 	p->numa_pages_migrated		= 0;
 	p->total_numa_faults		= 0;
@@ -3497,7 +3495,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	/*
 	 * We don't care about NUMA placement if we don't have memory.
 	 */
-	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
+	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || task_work_queued(work))
 		return;
 
 	/*
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 3/6] sched: Use task_work_queued() on cid_work
  2024-06-25 13:52 [RFC PATCH 0/6] mm: LRU drain flush on nohz_full Frederic Weisbecker
  2024-06-25 13:52 ` [RFC PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
  2024-06-25 13:52 ` [RFC PATCH 2/6] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
@ 2024-06-25 13:52 ` Frederic Weisbecker
  2024-07-16 13:00   ` Valentin Schneider
  2024-06-25 13:52 ` [RFC PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2024-06-25 13:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov

Remove the ad-hoc implementation of task_work_queued()

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/core.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcf2c4cc0522..f01979b600e8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11930,7 +11930,6 @@ static void task_mm_cid_work(struct callback_head *work)
 
 	SCHED_WARN_ON(t != container_of(work, struct task_struct, cid_work));
 
-	work->next = work;	/* Prevent double-add */
 	if (t->flags & PF_EXITING)
 		return;
 	mm = t->mm;
@@ -11974,7 +11973,6 @@ void init_sched_mm_cid(struct task_struct *t)
 		if (mm_users == 1)
 			mm->mm_cid_next_scan = jiffies + msecs_to_jiffies(MM_CID_SCAN_DELAY);
 	}
-	t->cid_work.next = &t->cid_work;	/* Protect against double add */
 	init_task_work(&t->cid_work, task_mm_cid_work);
 }
 
@@ -11983,8 +11981,7 @@ void task_tick_mm_cid(struct rq *rq, struct task_struct *curr)
 	struct callback_head *work = &curr->cid_work;
 	unsigned long now = jiffies;
 
-	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
-	    work->next != work)
+	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || task_work_queued(work))
 		return;
 	if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan)))
 		return;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places
  2024-06-25 13:52 [RFC PATCH 0/6] mm: LRU drain flush on nohz_full Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2024-06-25 13:52 ` [RFC PATCH 3/6] sched: Use task_work_queued() on cid_work Frederic Weisbecker
@ 2024-06-25 13:52 ` Frederic Weisbecker
  2024-06-25 13:52 ` [RFC PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
  2024-06-25 13:52 ` [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
  5 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2024-06-25 13:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov

nohz_full is a feature that only fits into rare and very corner cases.
Yet distros enable it by default and therefore the related fields are
always reserved in the task struct.

Those task fields are stored in the middle of cacheline hot places such
as cputime accounting and context switch counting, which doesn't make
any sense for a feature that is disabled most of the time.

Move the nohz_full storage to colder places.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched.h | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac6eab6..d531b610c410 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1039,13 +1039,7 @@ struct task_struct {
 #endif
 	u64				gtime;
 	struct prev_cputime		prev_cputime;
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-	struct vtime			vtime;
-#endif
 
-#ifdef CONFIG_NO_HZ_FULL
-	atomic_t			tick_dep_mask;
-#endif
 	/* Context switch counts: */
 	unsigned long			nvcsw;
 	unsigned long			nivcsw;
@@ -1350,6 +1344,14 @@ struct task_struct {
 	struct task_delay_info		*delays;
 #endif
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+	struct vtime			vtime;
+#endif
+
+#ifdef CONFIG_NO_HZ_FULL
+	atomic_t			tick_dep_mask;
+#endif
+
 #ifdef CONFIG_FAULT_INJECTION
 	int				make_it_fail;
 	unsigned int			fail_nth;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 5/6] sched/isolation: Introduce isolated task work
  2024-06-25 13:52 [RFC PATCH 0/6] mm: LRU drain flush on nohz_full Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2024-06-25 13:52 ` [RFC PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
@ 2024-06-25 13:52 ` Frederic Weisbecker
  2024-06-26 13:27   ` Vlastimil Babka
  2024-06-25 13:52 ` [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
  5 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2024-06-25 13:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov

Some asynchronous kernel work may be pending upon resume to userspace
and execute later on. On isolated workload this becomes problematic once
the process is done with preparatory work involving syscalls and wants
to run in userspace without being interrupted.

Provide an infrastructure to queue a work to be executed from the current
isolated task context right before resuming to userspace. This goes with
the assumption that isolated tasks are pinned to a single nohz_full CPU.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched.h           |  1 +
 include/linux/sched/isolation.h | 17 +++++++++++++++++
 kernel/sched/core.c             |  1 +
 kernel/sched/isolation.c        | 31 +++++++++++++++++++++++++++++++
 kernel/sched/sched.h            |  1 +
 5 files changed, 51 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d531b610c410..f6df21866055 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1349,6 +1349,7 @@ struct task_struct {
 #endif
 
 #ifdef CONFIG_NO_HZ_FULL
+	struct callback_head		nohz_full_work;
 	atomic_t			tick_dep_mask;
 #endif
 
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 2b461129d1fa..e69ec5ed1d70 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -72,4 +72,21 @@ static inline bool cpu_is_isolated(int cpu)
 	       cpuset_cpu_is_isolated(cpu);
 }
 
+#if defined(CONFIG_NO_HZ_FULL)
+extern int __isolated_task_work_queue(void);
+
+static inline int isolated_task_work_queue(void)
+{
+	if (!housekeeping_test_cpu(raw_smp_processor_id(), HK_TYPE_TICK))
+		return -ENOTSUPP;
+
+	return __isolated_task_work_queue();
+}
+
+extern void isolated_task_work_init(struct task_struct *tsk);
+#else
+static inline int isolated_task_work_queue(void) { }
+static inline void isolated_task_work_init(struct task_struct *tsk) { }
+#endif /* CONFIG_NO_HZ_FULL */
+
 #endif /* _LINUX_SCHED_ISOLATION_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f01979b600e8..01960434dbfd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4566,6 +4566,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	isolated_task_work_init(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5891e715f00d..410df1fedc9d 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -253,3 +253,34 @@ static int __init housekeeping_isolcpus_setup(char *str)
 	return housekeeping_setup(str, flags);
 }
 __setup("isolcpus=", housekeeping_isolcpus_setup);
+
+#if defined(CONFIG_NO_HZ_FULL)
+static void isolated_task_work(struct callback_head *head)
+{
+}
+
+int __isolated_task_work_queue(void)
+{
+	unsigned long flags;
+	int ret;
+
+	if (current->flags & PF_KTHREAD)
+		return 0;
+
+	local_irq_save(flags);
+	if (task_work_queued(&current->nohz_full_work)) {
+		ret = 0;
+		goto out;
+	}
+
+	ret = task_work_add(current, &current->nohz_full_work, TWA_RESUME);
+out:
+	local_irq_restore(flags);
+	return ret;
+}
+
+void isolated_task_work_init(struct task_struct *tsk)
+{
+	init_task_work(&tsk->nohz_full_work, isolated_task_work);
+}
+#endif /* CONFIG_NO_HZ_FULL */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a831af102070..24653f5879cc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -60,6 +60,7 @@
 #include <linux/stop_machine.h>
 #include <linux/syscalls_api.h>
 #include <linux/syscalls.h>
+#include <linux/task_work.h>
 #include <linux/tick.h>
 #include <linux/topology.h>
 #include <linux/types.h>
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2024-06-25 13:52 [RFC PATCH 0/6] mm: LRU drain flush on nohz_full Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2024-06-25 13:52 ` [RFC PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
@ 2024-06-25 13:52 ` Frederic Weisbecker
  2024-06-25 14:20   ` Michal Hocko
  5 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2024-06-25 13:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov

LRUs can be drained through several ways. One of them may add disturbances
to isolated workloads while queuing a work at any time to any target,
whether running in nohz_full mode or not.

Prevent from that on isolated tasks with draining LRUs upon resuming to
userspace using the isolated task work framework.

It's worth noting that this is inherently racy against
lru_add_drain_all() remotely queueing the per CPU drain work and
therefore it prevents from the undesired disturbance only
*most of the time*.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/swap.h     | 1 +
 kernel/sched/isolation.c | 1 +
 mm/swap.c                | 5 ++++-
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index bd450023b9a4..bd6169c9cc14 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -393,6 +393,7 @@ extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
 extern void lru_add_drain_all(void);
+extern void lru_add_and_bh_lrus_drain(void);
 void folio_deactivate(struct folio *folio);
 void folio_mark_lazyfree(struct folio *folio);
 extern void swap_setup(void);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 410df1fedc9d..68c70bea99e7 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -257,6 +257,7 @@ __setup("isolcpus=", housekeeping_isolcpus_setup);
 #if defined(CONFIG_NO_HZ_FULL)
 static void isolated_task_work(struct callback_head *head)
 {
+	lru_add_and_bh_lrus_drain();
 }
 
 int __isolated_task_work_queue(void)
diff --git a/mm/swap.c b/mm/swap.c
index 67786cb77130..a4d7e3dc2a66 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,7 @@
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
 #include <linux/buffer_head.h>
+#include <linux/sched/isolation.h>
 
 #include "internal.h"
 
@@ -521,6 +522,8 @@ void folio_add_lru(struct folio *folio)
 	fbatch = this_cpu_ptr(&cpu_fbatches.lru_add);
 	folio_batch_add_and_move(fbatch, folio, lru_add_fn);
 	local_unlock(&cpu_fbatches.lock);
+
+	isolated_task_work_queue();
 }
 EXPORT_SYMBOL(folio_add_lru);
 
@@ -765,7 +768,7 @@ void lru_add_drain(void)
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_and_bh_lrus_drain(void)
+void lru_add_and_bh_lrus_drain(void)
 {
 	local_lock(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] task_work: Provide means to check if a work is queued
  2024-06-25 13:52 ` [RFC PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
@ 2024-06-25 14:15   ` Oleg Nesterov
  2024-06-25 15:16     ` Oleg Nesterov
  2024-07-03 12:41     ` Frederic Weisbecker
  2024-07-16 13:00   ` Valentin Schneider
  1 sibling, 2 replies; 22+ messages in thread
From: Oleg Nesterov @ 2024-06-25 14:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Michal Hocko,
	Thomas Gleixner

On 06/25, Frederic Weisbecker wrote:
>
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -177,6 +177,7 @@ void task_work_run(void)
>
>  		do {
>  			next = work->next;
> +			work->next = TASK_WORK_DEQUEUED;

OK, but then the additional change below makes sense too?

Oleg.
---

--- x/kernel/task_work.c
+++ x/kernel/task_work.c
@@ -106,8 +106,10 @@ task_work_cancel_match(struct task_struc
 		if (!match(work, data)) {
 			pprev = &work->next;
 			work = READ_ONCE(*pprev);
-		} else if (try_cmpxchg(pprev, &work, work->next))
+		} else if (try_cmpxchg(pprev, &work, work->next)) {
+			work->next = TASK_WORK_DEQUEUED;
 			break;
+		}
 	}
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2024-06-25 13:52 ` [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
@ 2024-06-25 14:20   ` Michal Hocko
  2024-06-26 13:16     ` Vlastimil Babka
  2024-07-03 12:52     ` Frederic Weisbecker
  0 siblings, 2 replies; 22+ messages in thread
From: Michal Hocko @ 2024-06-25 14:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Thomas Gleixner,
	Oleg Nesterov

On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote:
> LRUs can be drained through several ways. One of them may add disturbances
> to isolated workloads while queuing a work at any time to any target,
> whether running in nohz_full mode or not.
> 
> Prevent from that on isolated tasks with draining LRUs upon resuming to
> userspace using the isolated task work framework.
> 
> It's worth noting that this is inherently racy against
> lru_add_drain_all() remotely queueing the per CPU drain work and
> therefore it prevents from the undesired disturbance only
> *most of the time*.

Can we simply not schedule flushing on remote CPUs and leave that to the
"return to the userspace" path?

I do not think we rely on LRU cache flushing for correctness purposes anywhere.

Please also CC linux MM ML once the core infrastructure is agreed on.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] task_work: Provide means to check if a work is queued
  2024-06-25 14:15   ` Oleg Nesterov
@ 2024-06-25 15:16     ` Oleg Nesterov
  2024-07-03 12:42       ` Frederic Weisbecker
  2024-07-03 12:41     ` Frederic Weisbecker
  1 sibling, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2024-06-25 15:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Michal Hocko,
	Thomas Gleixner

And probably task_work_add() should do the same when it returns -ESRCH.

On 06/25, Oleg Nesterov wrote:
>
> On 06/25, Frederic Weisbecker wrote:
> >
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -177,6 +177,7 @@ void task_work_run(void)
> >
> >  		do {
> >  			next = work->next;
> > +			work->next = TASK_WORK_DEQUEUED;
>
> OK, but then the additional change below makes sense too?
>
> Oleg.
> ---
>
> --- x/kernel/task_work.c
> +++ x/kernel/task_work.c
> @@ -106,8 +106,10 @@ task_work_cancel_match(struct task_struc
>  		if (!match(work, data)) {
>  			pprev = &work->next;
>  			work = READ_ONCE(*pprev);
> -		} else if (try_cmpxchg(pprev, &work, work->next))
> +		} else if (try_cmpxchg(pprev, &work, work->next)) {
> +			work->next = TASK_WORK_DEQUEUED;
>  			break;
> +		}
>  	}
>  	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2024-06-25 14:20   ` Michal Hocko
@ 2024-06-26 13:16     ` Vlastimil Babka
  2024-06-27  6:54       ` Michal Hocko
  2024-07-03 12:52     ` Frederic Weisbecker
  1 sibling, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2024-06-26 13:16 UTC (permalink / raw)
  To: Michal Hocko, Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Andrew Morton, Thomas Gleixner, Oleg Nesterov

On 6/25/24 4:20 PM, Michal Hocko wrote:
> On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote:
>> LRUs can be drained through several ways. One of them may add disturbances
>> to isolated workloads while queuing a work at any time to any target,
>> whether running in nohz_full mode or not.
>> 
>> Prevent from that on isolated tasks with draining LRUs upon resuming to
>> userspace using the isolated task work framework.
>> 
>> It's worth noting that this is inherently racy against
>> lru_add_drain_all() remotely queueing the per CPU drain work and
>> therefore it prevents from the undesired disturbance only
>> *most of the time*.
> 
> Can we simply not schedule flushing on remote CPUs and leave that to the
> "return to the userspace" path?
> 
> I do not think we rely on LRU cache flushing for correctness purposes anywhere.

I guess drain via lru_cache_disable() should be honored, but also rare.

> Please also CC linux MM ML once the core infrastructure is agreed on.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/6] sched/isolation: Introduce isolated task work
  2024-06-25 13:52 ` [RFC PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
@ 2024-06-26 13:27   ` Vlastimil Babka
  2024-07-03 12:47     ` Frederic Weisbecker
  0 siblings, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2024-06-26 13:27 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Marcelo Tosatti,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov

On 6/25/24 3:52 PM, Frederic Weisbecker wrote:
> Some asynchronous kernel work may be pending upon resume to userspace
> and execute later on. On isolated workload this becomes problematic once
> the process is done with preparatory work involving syscalls and wants
> to run in userspace without being interrupted.
> 
> Provide an infrastructure to queue a work to be executed from the current
> isolated task context right before resuming to userspace. This goes with
> the assumption that isolated tasks are pinned to a single nohz_full CPU.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  include/linux/sched.h           |  1 +
>  include/linux/sched/isolation.h | 17 +++++++++++++++++
>  kernel/sched/core.c             |  1 +
>  kernel/sched/isolation.c        | 31 +++++++++++++++++++++++++++++++
>  kernel/sched/sched.h            |  1 +
>  5 files changed, 51 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d531b610c410..f6df21866055 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1349,6 +1349,7 @@ struct task_struct {
>  #endif
>  
>  #ifdef CONFIG_NO_HZ_FULL
> +	struct callback_head		nohz_full_work;
>  	atomic_t			tick_dep_mask;
>  #endif
>  
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index 2b461129d1fa..e69ec5ed1d70 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -72,4 +72,21 @@ static inline bool cpu_is_isolated(int cpu)
>  	       cpuset_cpu_is_isolated(cpu);
>  }
>  
> +#if defined(CONFIG_NO_HZ_FULL)
> +extern int __isolated_task_work_queue(void);
> +
> +static inline int isolated_task_work_queue(void)
> +{
> +	if (!housekeeping_test_cpu(raw_smp_processor_id(), HK_TYPE_TICK))

This is an unconditional call to a function defined in
kernel/sched/isolation.c, and only there a static_branch_unlikely() test
happens, but the call overhead is always paid, and the next patch adds that
to folio_add_lru().

I notice a housekeeping_cpu() function above that does the static branch
inline, which is great, except it defaults to return true so not directly
applicable, but this function could be done the same way to keep the static
branch inline.

> +		return -ENOTSUPP;
> +
> +	return __isolated_task_work_queue();
> +}
> +
> +extern void isolated_task_work_init(struct task_struct *tsk);
> +#else
> +static inline int isolated_task_work_queue(void) { }
> +static inline void isolated_task_work_init(struct task_struct *tsk) { }
> +#endif /* CONFIG_NO_HZ_FULL */
> +
>  #endif /* _LINUX_SCHED_ISOLATION_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f01979b600e8..01960434dbfd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4566,6 +4566,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>  	p->migration_pending = NULL;
>  #endif
>  	init_sched_mm_cid(p);
> +	isolated_task_work_init(p);
>  }
>  
>  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 5891e715f00d..410df1fedc9d 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -253,3 +253,34 @@ static int __init housekeeping_isolcpus_setup(char *str)
>  	return housekeeping_setup(str, flags);
>  }
>  __setup("isolcpus=", housekeeping_isolcpus_setup);
> +
> +#if defined(CONFIG_NO_HZ_FULL)
> +static void isolated_task_work(struct callback_head *head)
> +{
> +}
> +
> +int __isolated_task_work_queue(void)
> +{
> +	unsigned long flags;
> +	int ret;
> +
> +	if (current->flags & PF_KTHREAD)
> +		return 0;
> +
> +	local_irq_save(flags);
> +	if (task_work_queued(&current->nohz_full_work)) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	ret = task_work_add(current, &current->nohz_full_work, TWA_RESUME);
> +out:
> +	local_irq_restore(flags);
> +	return ret;
> +}
> +
> +void isolated_task_work_init(struct task_struct *tsk)
> +{
> +	init_task_work(&tsk->nohz_full_work, isolated_task_work);
> +}
> +#endif /* CONFIG_NO_HZ_FULL */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index a831af102070..24653f5879cc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -60,6 +60,7 @@
>  #include <linux/stop_machine.h>
>  #include <linux/syscalls_api.h>
>  #include <linux/syscalls.h>
> +#include <linux/task_work.h>
>  #include <linux/tick.h>
>  #include <linux/topology.h>
>  #include <linux/types.h>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2024-06-26 13:16     ` Vlastimil Babka
@ 2024-06-27  6:54       ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2024-06-27  6:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Andrew Morton,
	Thomas Gleixner, Oleg Nesterov

On Wed 26-06-24 15:16:04, Vlastimil Babka wrote:
> On 6/25/24 4:20 PM, Michal Hocko wrote:
> > On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote:
> >> LRUs can be drained through several ways. One of them may add disturbances
> >> to isolated workloads while queuing a work at any time to any target,
> >> whether running in nohz_full mode or not.
> >> 
> >> Prevent from that on isolated tasks with draining LRUs upon resuming to
> >> userspace using the isolated task work framework.
> >> 
> >> It's worth noting that this is inherently racy against
> >> lru_add_drain_all() remotely queueing the per CPU drain work and
> >> therefore it prevents from the undesired disturbance only
> >> *most of the time*.
> > 
> > Can we simply not schedule flushing on remote CPUs and leave that to the
> > "return to the userspace" path?
> > 
> > I do not think we rely on LRU cache flushing for correctness purposes anywhere.
> 
> I guess drain via lru_cache_disable() should be honored, but also rare.

I do not think we can call it rare because it can be triggered by the
userspace by NUMA syscalls for example. I think we should just either
make it fail and let caller decide what to do or just make it best
effort and eventually fail the operation if there is no other way. The
latter has an advantage that the failure is lazy as well. In an ideal
world, memory offlining will be a complete no-no in isolated workloads
and mbind calls will not try to migrate memory that has been just
added on the LRU cache. In any case this would require to document this
limitation at least.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] task_work: Provide means to check if a work is queued
  2024-06-25 14:15   ` Oleg Nesterov
  2024-06-25 15:16     ` Oleg Nesterov
@ 2024-07-03 12:41     ` Frederic Weisbecker
  1 sibling, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2024-07-03 12:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Michal Hocko,
	Thomas Gleixner

Le Tue, Jun 25, 2024 at 04:15:39PM +0200, Oleg Nesterov a écrit :
> On 06/25, Frederic Weisbecker wrote:
> >
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -177,6 +177,7 @@ void task_work_run(void)
> >
> >  		do {
> >  			next = work->next;
> > +			work->next = TASK_WORK_DEQUEUED;
> 
> OK, but then the additional change below makes sense too?
> 
> Oleg.
> ---
> 
> --- x/kernel/task_work.c
> +++ x/kernel/task_work.c
> @@ -106,8 +106,10 @@ task_work_cancel_match(struct task_struc
>  		if (!match(work, data)) {
>  			pprev = &work->next;
>  			work = READ_ONCE(*pprev);
> -		} else if (try_cmpxchg(pprev, &work, work->next))
> +		} else if (try_cmpxchg(pprev, &work, work->next)) {
> +			work->next = TASK_WORK_DEQUEUED;
>  			break;
> +		}
>  	}
>  	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
>  
> 

Yes it does!

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] task_work: Provide means to check if a work is queued
  2024-06-25 15:16     ` Oleg Nesterov
@ 2024-07-03 12:42       ` Frederic Weisbecker
  0 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2024-07-03 12:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Michal Hocko,
	Thomas Gleixner

Le Tue, Jun 25, 2024 at 05:16:25PM +0200, Oleg Nesterov a écrit :
> And probably task_work_add() should do the same when it returns -ESRCH.

Good point!

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/6] sched/isolation: Introduce isolated task work
  2024-06-26 13:27   ` Vlastimil Babka
@ 2024-07-03 12:47     ` Frederic Weisbecker
  0 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2024-07-03 12:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Andrew Morton, Michal Hocko, Thomas Gleixner,
	Oleg Nesterov

Le Wed, Jun 26, 2024 at 03:27:59PM +0200, Vlastimil Babka a écrit :
> On 6/25/24 3:52 PM, Frederic Weisbecker wrote:
> > Some asynchronous kernel work may be pending upon resume to userspace
> > and execute later on. On isolated workload this becomes problematic once
> > the process is done with preparatory work involving syscalls and wants
> > to run in userspace without being interrupted.
> > 
> > Provide an infrastructure to queue a work to be executed from the current
> > isolated task context right before resuming to userspace. This goes with
> > the assumption that isolated tasks are pinned to a single nohz_full CPU.
> > 
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > ---
> >  include/linux/sched.h           |  1 +
> >  include/linux/sched/isolation.h | 17 +++++++++++++++++
> >  kernel/sched/core.c             |  1 +
> >  kernel/sched/isolation.c        | 31 +++++++++++++++++++++++++++++++
> >  kernel/sched/sched.h            |  1 +
> >  5 files changed, 51 insertions(+)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d531b610c410..f6df21866055 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1349,6 +1349,7 @@ struct task_struct {
> >  #endif
> >  
> >  #ifdef CONFIG_NO_HZ_FULL
> > +	struct callback_head		nohz_full_work;
> >  	atomic_t			tick_dep_mask;
> >  #endif
> >  
> > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> > index 2b461129d1fa..e69ec5ed1d70 100644
> > --- a/include/linux/sched/isolation.h
> > +++ b/include/linux/sched/isolation.h
> > @@ -72,4 +72,21 @@ static inline bool cpu_is_isolated(int cpu)
> >  	       cpuset_cpu_is_isolated(cpu);
> >  }
> >  
> > +#if defined(CONFIG_NO_HZ_FULL)
> > +extern int __isolated_task_work_queue(void);
> > +
> > +static inline int isolated_task_work_queue(void)
> > +{
> > +	if (!housekeeping_test_cpu(raw_smp_processor_id(), HK_TYPE_TICK))
> 
> This is an unconditional call to a function defined in
> kernel/sched/isolation.c, and only there a static_branch_unlikely() test
> happens, but the call overhead is always paid, and the next patch adds that
> to folio_add_lru().
> 
> I notice a housekeeping_cpu() function above that does the static branch
> inline, which is great, except it defaults to return true so not directly
> applicable, but this function could be done the same way to keep the static
> branch inline.

Right, there definetly needs some inlining.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2024-06-25 14:20   ` Michal Hocko
  2024-06-26 13:16     ` Vlastimil Babka
@ 2024-07-03 12:52     ` Frederic Weisbecker
  2024-07-04 13:11       ` Michal Hocko
  1 sibling, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2024-07-03 12:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Thomas Gleixner,
	Oleg Nesterov

Le Tue, Jun 25, 2024 at 04:20:01PM +0200, Michal Hocko a écrit :
> On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote:
> > LRUs can be drained through several ways. One of them may add disturbances
> > to isolated workloads while queuing a work at any time to any target,
> > whether running in nohz_full mode or not.
> > 
> > Prevent from that on isolated tasks with draining LRUs upon resuming to
> > userspace using the isolated task work framework.
> > 
> > It's worth noting that this is inherently racy against
> > lru_add_drain_all() remotely queueing the per CPU drain work and
> > therefore it prevents from the undesired disturbance only
> > *most of the time*.
> 
> Can we simply not schedule flushing on remote CPUs and leave that to the
> "return to the userspace" path?

Do you mean I should add a call on return to the userspace path or can
I expect it to be drained at some point already?

The other limitation with that task work thing is that if the task
queueing the work actually goes to sleep and another task go on the CPU
and does isolated work in userspace, the drain doesn't happen. Now whether
that is a real problem or not, I have no idea.

> 
> I do not think we rely on LRU cache flushing for correctness purposes anywhere.
> 
> Please also CC linux MM ML once the core infrastructure is agreed on.

Ok, thanks.

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2024-07-03 12:52     ` Frederic Weisbecker
@ 2024-07-04 13:11       ` Michal Hocko
  2024-07-17 13:21         ` Frederic Weisbecker
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2024-07-04 13:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Thomas Gleixner,
	Oleg Nesterov

On Wed 03-07-24 14:52:21, Frederic Weisbecker wrote:
> Le Tue, Jun 25, 2024 at 04:20:01PM +0200, Michal Hocko a écrit :
> > On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote:
> > > LRUs can be drained through several ways. One of them may add disturbances
> > > to isolated workloads while queuing a work at any time to any target,
> > > whether running in nohz_full mode or not.
> > > 
> > > Prevent from that on isolated tasks with draining LRUs upon resuming to
> > > userspace using the isolated task work framework.
> > > 
> > > It's worth noting that this is inherently racy against
> > > lru_add_drain_all() remotely queueing the per CPU drain work and
> > > therefore it prevents from the undesired disturbance only
> > > *most of the time*.
> > 
> > Can we simply not schedule flushing on remote CPUs and leave that to the
> > "return to the userspace" path?
> 
> Do you mean I should add a call on return to the userspace path or can
> I expect it to be drained at some point already?

I would make the particular per cpu cache to be drained on return to the
userspace.

> The other limitation with that task work thing is that if the task
> queueing the work actually goes to sleep and another task go on the CPU
> and does isolated work in userspace, the drain doesn't happen. Now whether
> that is a real problem or not, I have no idea.

Theoretically there is a problem because pages sitting on pcp LRU caches
cannot be migrated and some other operations will fail as well. But
practically speaking those pages should be mostly of interest to the
process allocating them most of the time. Page sharing between isolated
workloads sounds like a terrible idea to me. Maybe reality hits us in
this regards but we can deal with that when we learn about those
workloads.

So I wouldn't lose too much sleep over that. We are dealing with those
isolated workloads being broken by simple things like fork now because
that apparently adds pages on the pcp LRU cache and draining will happen
sooner or later (very often when the task is already running in the
userspace).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] task_work: Provide means to check if a work is queued
  2024-06-25 13:52 ` [RFC PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
  2024-06-25 14:15   ` Oleg Nesterov
@ 2024-07-16 13:00   ` Valentin Schneider
  1 sibling, 0 replies; 22+ messages in thread
From: Valentin Schneider @ 2024-07-16 13:00 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar, Marcelo Tosatti,
	Vlastimil Babka, Andrew Morton, Michal Hocko, Thomas Gleixner,
	Oleg Nesterov

On 25/06/24 15:52, Frederic Weisbecker wrote:
> Some task work users implement their own ways to know if a callback is
> already queued on the current task while fiddling with the callback
> head internals.
>
> Provide instead a consolidated API to serve this very purpose.
>

I'm potentially going to add yet another one of these to sched/fair.c, so
yes please!


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/6] sched/fair: Use task_work_queued() on numa_work
  2024-06-25 13:52 ` [RFC PATCH 2/6] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
@ 2024-07-16 13:00   ` Valentin Schneider
  0 siblings, 0 replies; 22+ messages in thread
From: Valentin Schneider @ 2024-07-16 13:00 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar, Marcelo Tosatti,
	Vlastimil Babka, Andrew Morton, Michal Hocko, Thomas Gleixner,
	Oleg Nesterov

On 25/06/24 15:52, Frederic Weisbecker wrote:

> @@ -3456,7 +3455,6 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
>       p->numa_scan_period		= sysctl_numa_balancing_scan_delay;
>       p->numa_migrate_retry		= 0;

>       /* Protect against double add, see task_tick_numa and task_numa_work */
> -	p->numa_work.next		= &p->numa_work;

You'll want to get rid of the comment there as well.

>       p->numa_faults			= NULL;
>       p->numa_pages_migrated		= 0;
>       p->total_numa_faults		= 0;


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/6] sched: Use task_work_queued() on cid_work
  2024-06-25 13:52 ` [RFC PATCH 3/6] sched: Use task_work_queued() on cid_work Frederic Weisbecker
@ 2024-07-16 13:00   ` Valentin Schneider
  0 siblings, 0 replies; 22+ messages in thread
From: Valentin Schneider @ 2024-07-16 13:00 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar, Marcelo Tosatti,
	Vlastimil Babka, Andrew Morton, Michal Hocko, Thomas Gleixner,
	Oleg Nesterov

On 25/06/24 15:52, Frederic Weisbecker wrote:
> Remove the ad-hoc implementation of task_work_queued()
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Reviewed-by: Valentin Schneider <vschneid@redhat.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2024-07-04 13:11       ` Michal Hocko
@ 2024-07-17 13:21         ` Frederic Weisbecker
  0 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2024-07-17 13:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Thomas Gleixner,
	Oleg Nesterov

Le Thu, Jul 04, 2024 at 03:11:24PM +0200, Michal Hocko a écrit :
> On Wed 03-07-24 14:52:21, Frederic Weisbecker wrote:
> > Le Tue, Jun 25, 2024 at 04:20:01PM +0200, Michal Hocko a écrit :
> > > On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote:
> > > > LRUs can be drained through several ways. One of them may add disturbances
> > > > to isolated workloads while queuing a work at any time to any target,
> > > > whether running in nohz_full mode or not.
> > > > 
> > > > Prevent from that on isolated tasks with draining LRUs upon resuming to
> > > > userspace using the isolated task work framework.
> > > > 
> > > > It's worth noting that this is inherently racy against
> > > > lru_add_drain_all() remotely queueing the per CPU drain work and
> > > > therefore it prevents from the undesired disturbance only
> > > > *most of the time*.
> > > 
> > > Can we simply not schedule flushing on remote CPUs and leave that to the
> > > "return to the userspace" path?
> > 
> > Do you mean I should add a call on return to the userspace path or can
> > I expect it to be drained at some point already?
> 
> I would make the particular per cpu cache to be drained on return to the
> userspace.

And then we need the patchset from Valentin that defers work to kernel entry?

> 
> > The other limitation with that task work thing is that if the task
> > queueing the work actually goes to sleep and another task go on the CPU
> > and does isolated work in userspace, the drain doesn't happen. Now whether
> > that is a real problem or not, I have no idea.
> 
> Theoretically there is a problem because pages sitting on pcp LRU caches
> cannot be migrated and some other operations will fail as well. But
> practically speaking those pages should be mostly of interest to the
> process allocating them most of the time. Page sharing between isolated
> workloads sounds like a terrible idea to me. Maybe reality hits us in
> this regards but we can deal with that when we learn about those
> workloads.
> 
> So I wouldn't lose too much sleep over that. We are dealing with those
> isolated workloads being broken by simple things like fork now because
> that apparently adds pages on the pcp LRU cache and draining will happen
> sooner or later (very often when the task is already running in the
> userspace).

That sounds good!

Thanks.

> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-07-17 13:21 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-25 13:52 [RFC PATCH 0/6] mm: LRU drain flush on nohz_full Frederic Weisbecker
2024-06-25 13:52 ` [RFC PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
2024-06-25 14:15   ` Oleg Nesterov
2024-06-25 15:16     ` Oleg Nesterov
2024-07-03 12:42       ` Frederic Weisbecker
2024-07-03 12:41     ` Frederic Weisbecker
2024-07-16 13:00   ` Valentin Schneider
2024-06-25 13:52 ` [RFC PATCH 2/6] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
2024-07-16 13:00   ` Valentin Schneider
2024-06-25 13:52 ` [RFC PATCH 3/6] sched: Use task_work_queued() on cid_work Frederic Weisbecker
2024-07-16 13:00   ` Valentin Schneider
2024-06-25 13:52 ` [RFC PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
2024-06-25 13:52 ` [RFC PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
2024-06-26 13:27   ` Vlastimil Babka
2024-07-03 12:47     ` Frederic Weisbecker
2024-06-25 13:52 ` [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
2024-06-25 14:20   ` Michal Hocko
2024-06-26 13:16     ` Vlastimil Babka
2024-06-27  6:54       ` Michal Hocko
2024-07-03 12:52     ` Frederic Weisbecker
2024-07-04 13:11       ` Michal Hocko
2024-07-17 13:21         ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox