[PATCH 0/6 v2] mm: LRU drain flush on nohz

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6 v2] mm: LRU drain flush on nohz_full
@ 2025-02-09 22:29 Frederic Weisbecker
  2025-02-09 22:29 ` [PATCH 1/6 v2] task_work: Provide means to check if a work is queued Frederic Weisbecker
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2025-02-09 22:29 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov,
	linux-mm

When LRUs are pending, the drain can be triggered remotely, whether the
remote CPU is running in userspace in nohz_full mode or not. This kind
of noise is expected to be caused by preparatory work before a task
runs isolated in userspace. This patchset is a proposal to flush that
before the task starts its critical work in userspace.

Changes since v1:

_ Various task_work improvements (Oleg)

_ Remove leftover comment on numa_work (Valentin)

_ Add Reviewed-by tag from Valentin

_ Use housekeeping_cpu() instead of housekeeping_test_cpu() on
  fastpath to benefit from static branch (Vlastimil)

_ Spare the actual remote LRU drain when the target CPU is nohz_full. It
  is assumed the LRU drain will happen locally eventually. (Michal)

Frederic Weisbecker (6):
  task_work: Provide means to check if a work is queued
  sched/fair: Use task_work_queued() on numa_work
  sched: Use task_work_queued() on cid_work
  tick/nohz: Move nohz_full related fields out of hot task struct's
    places
  sched/isolation: Introduce isolated task work
  mm: Drain LRUs upon resume to userspace on nohz_full CPUs

 include/linux/sched.h           | 15 +++++++++------
 include/linux/sched/isolation.h | 17 +++++++++++++++++
 include/linux/swap.h            |  1 +
 include/linux/task_work.h       | 12 ++++++++++++
 kernel/sched/core.c             |  6 ++----
 kernel/sched/fair.c             |  5 +----
 kernel/sched/isolation.c        | 34 +++++++++++++++++++++++++++++++++
 kernel/sched/sched.h            |  1 +
 kernel/task_work.c              | 11 +++++++++--
 mm/swap.c                       |  8 +++++++-
 10 files changed, 93 insertions(+), 17 deletions(-)

-- 
2.46.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6 v2] task_work: Provide means to check if a work is queued
  2025-02-09 22:29 [PATCH 0/6 v2] mm: LRU drain flush on nohz_full Frederic Weisbecker
@ 2025-02-09 22:29 ` Frederic Weisbecker
  2025-02-10 12:43   ` Oleg Nesterov
  2025-02-27 16:25   ` Valentin Schneider
  2025-02-09 22:30 ` [PATCH 2/6 v2] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2025-02-09 22:29 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov,
	linux-mm

Some task work users implement their own ways to know if a callback is
already queued on the current task while fiddling with the callback
head internals.

Provide instead a consolidated API to serve this very purpose.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/task_work.h | 12 ++++++++++++
 kernel/task_work.c        | 11 +++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 0646804860ff..31caf12c1313 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -5,12 +5,15 @@
 #include <linux/list.h>
 #include <linux/sched.h>
 
+#define TASK_WORK_DEQUEUED	((void *) -1UL)
+
 typedef void (*task_work_func_t)(struct callback_head *);
 
 static inline void
 init_task_work(struct callback_head *twork, task_work_func_t func)
 {
 	twork->func = func;
+	twork->next = TASK_WORK_DEQUEUED;
 }
 
 enum task_work_notify_mode {
@@ -26,6 +29,15 @@ static inline bool task_work_pending(struct task_struct *task)
 	return READ_ONCE(task->task_works);
 }
 
+/*
+ * Check if a work is queued. Beware: this is inherently racy if the work can
+ * be queued elsewhere than the current task.
+ */
+static inline bool task_work_queued(struct callback_head *twork)
+{
+	return twork->next != TASK_WORK_DEQUEUED;
+}
+
 int task_work_add(struct task_struct *task, struct callback_head *twork,
 			enum task_work_notify_mode mode);
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index d1efec571a4a..0d7b04095753 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -56,6 +56,8 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 {
 	struct callback_head *head;
 
+	work->next = TASK_WORK_DEQUEUED;
+
 	if (notify == TWA_NMI_CURRENT) {
 		if (WARN_ON_ONCE(task != current))
 			return -EINVAL;
@@ -67,8 +69,10 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 
 	head = READ_ONCE(task->task_works);
 	do {
-		if (unlikely(head == &work_exited))
+		if (unlikely(head == &work_exited)) {
+			work->next = TASK_WORK_DEQUEUED;
 			return -ESRCH;
+		}
 		work->next = head;
 	} while (!try_cmpxchg(&task->task_works, &head, work));
 
@@ -129,8 +133,10 @@ task_work_cancel_match(struct task_struct *task,
 		if (!match(work, data)) {
 			pprev = &work->next;
 			work = READ_ONCE(*pprev);
-		} else if (try_cmpxchg(pprev, &work, work->next))
+		} else if (try_cmpxchg(pprev, &work, work->next)) {
+			work->next = TASK_WORK_DEQUEUED;
 			break;
+		}
 	}
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 
@@ -224,6 +230,7 @@ void task_work_run(void)
 
 		do {
 			next = work->next;
+			work->next = TASK_WORK_DEQUEUED;
 			work->func(work);
 			work = next;
 			cond_resched();
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/6 v2] sched/fair: Use task_work_queued() on numa_work
  2025-02-09 22:29 [PATCH 0/6 v2] mm: LRU drain flush on nohz_full Frederic Weisbecker
  2025-02-09 22:29 ` [PATCH 1/6 v2] task_work: Provide means to check if a work is queued Frederic Weisbecker
@ 2025-02-09 22:30 ` Frederic Weisbecker
  2025-02-10 12:47   ` Oleg Nesterov
  2025-02-27 16:25   ` Valentin Schneider
  2025-02-09 22:30 ` [PATCH 3/6 v2] sched: Use task_work_queued() on cid_work Frederic Weisbecker
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2025-02-09 22:30 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov,
	linux-mm

Remove the ad-hoc implementation of task_work_queued().

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/fair.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce2e94ccad0c..6a2f45821dc0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3303,7 +3303,6 @@ static void task_numa_work(struct callback_head *work)
 
 	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
 
-	work->next = work;
 	/*
 	 * Who cares about NUMA placement when they're dying.
 	 *
@@ -3551,8 +3550,6 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->numa_scan_seq		= mm ? mm->numa_scan_seq : 0;
 	p->numa_scan_period		= sysctl_numa_balancing_scan_delay;
 	p->numa_migrate_retry		= 0;
-	/* Protect against double add, see task_tick_numa and task_numa_work */
-	p->numa_work.next		= &p->numa_work;
 	p->numa_faults			= NULL;
 	p->numa_pages_migrated		= 0;
 	p->total_numa_faults		= 0;
@@ -3593,7 +3590,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	/*
 	 * We don't care about NUMA placement if we don't have memory.
 	 */
-	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
+	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || task_work_queued(work))
 		return;
 
 	/*
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/6 v2] sched: Use task_work_queued() on cid_work
  2025-02-09 22:29 [PATCH 0/6 v2] mm: LRU drain flush on nohz_full Frederic Weisbecker
  2025-02-09 22:29 ` [PATCH 1/6 v2] task_work: Provide means to check if a work is queued Frederic Weisbecker
  2025-02-09 22:30 ` [PATCH 2/6 v2] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
@ 2025-02-09 22:30 ` Frederic Weisbecker
  2025-02-10 12:49   ` Oleg Nesterov
  2025-02-09 22:30 ` [PATCH 4/6 v2] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Frederic Weisbecker @ 2025-02-09 22:30 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov,
	linux-mm

Remove the ad-hoc implementation of task_work_queued()

Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/core.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 165c90ba64ea..606f596a6e0d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10538,7 +10538,6 @@ static void task_mm_cid_work(struct callback_head *work)
 
 	SCHED_WARN_ON(t != container_of(work, struct task_struct, cid_work));
 
-	work->next = work;	/* Prevent double-add */
 	if (t->flags & PF_EXITING)
 		return;
 	mm = t->mm;
@@ -10582,7 +10581,6 @@ void init_sched_mm_cid(struct task_struct *t)
 		if (mm_users == 1)
 			mm->mm_cid_next_scan = jiffies + msecs_to_jiffies(MM_CID_SCAN_DELAY);
 	}
-	t->cid_work.next = &t->cid_work;	/* Protect against double add */
 	init_task_work(&t->cid_work, task_mm_cid_work);
 }
 
@@ -10591,8 +10589,7 @@ void task_tick_mm_cid(struct rq *rq, struct task_struct *curr)
 	struct callback_head *work = &curr->cid_work;
 	unsigned long now = jiffies;
 
-	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
-	    work->next != work)
+	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || task_work_queued(work))
 		return;
 	if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan)))
 		return;
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/6 v2] tick/nohz: Move nohz_full related fields out of hot task struct's places
  2025-02-09 22:29 [PATCH 0/6 v2] mm: LRU drain flush on nohz_full Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2025-02-09 22:30 ` [PATCH 3/6 v2] sched: Use task_work_queued() on cid_work Frederic Weisbecker
@ 2025-02-09 22:30 ` Frederic Weisbecker
  2025-02-09 22:30 ` [PATCH 5/6 v2] sched/isolation: Introduce isolated task work Frederic Weisbecker
  2025-02-09 22:30 ` [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
  5 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2025-02-09 22:30 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov,
	linux-mm

nohz_full is a feature that only fits into rare and very corner cases.
Yet distros enable it by default and therefore the related fields are
always reserved in the task struct.

Those task fields are stored in the middle of cacheline hot places such
as cputime accounting and context switch counting, which doesn't make
any sense for a feature that is disabled most of the time.

Move the nohz_full storage to colder places.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched.h | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9632e3318e0d..10a9aa41b43a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1088,13 +1088,7 @@ struct task_struct {
 #endif
 	u64				gtime;
 	struct prev_cputime		prev_cputime;
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-	struct vtime			vtime;
-#endif
 
-#ifdef CONFIG_NO_HZ_FULL
-	atomic_t			tick_dep_mask;
-#endif
 	/* Context switch counts: */
 	unsigned long			nvcsw;
 	unsigned long			nivcsw;
@@ -1411,6 +1405,14 @@ struct task_struct {
 	struct task_delay_info		*delays;
 #endif
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+	struct vtime			vtime;
+#endif
+
+#ifdef CONFIG_NO_HZ_FULL
+	atomic_t			tick_dep_mask;
+#endif
+
 #ifdef CONFIG_FAULT_INJECTION
 	int				make_it_fail;
 	unsigned int			fail_nth;
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/6 v2] sched/isolation: Introduce isolated task work
  2025-02-09 22:29 [PATCH 0/6 v2] mm: LRU drain flush on nohz_full Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2025-02-09 22:30 ` [PATCH 4/6 v2] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
@ 2025-02-09 22:30 ` Frederic Weisbecker
  2025-02-09 22:30 ` [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
  5 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2025-02-09 22:30 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov,
	linux-mm

Some asynchronous kernel work may be pending upon resume to userspace
and execute later on. On isolated workload this becomes problematic once
the process is done with preparatory work involving syscalls and wants
to run in userspace without being interrupted.

Provide an infrastructure to queue a work to be executed from the current
isolated task context right before resuming to userspace. This goes with
the assumption that isolated tasks are pinned to a single nohz_full CPU.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched.h           |  1 +
 include/linux/sched/isolation.h | 17 +++++++++++++++++
 kernel/sched/core.c             |  1 +
 kernel/sched/isolation.c        | 31 +++++++++++++++++++++++++++++++
 kernel/sched/sched.h            |  1 +
 5 files changed, 51 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 10a9aa41b43a..82827f962745 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1410,6 +1410,7 @@ struct task_struct {
 #endif
 
 #ifdef CONFIG_NO_HZ_FULL
+	struct callback_head		nohz_full_work;
 	atomic_t			tick_dep_mask;
 #endif
 
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d8501f4709b5..74da4324b984 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -77,4 +77,21 @@ static inline bool cpu_is_isolated(int cpu)
 	       cpuset_cpu_is_isolated(cpu);
 }
 
+#if defined(CONFIG_NO_HZ_FULL)
+extern int __isolated_task_work_queue(void);
+
+static inline int isolated_task_work_queue(void)
+{
+	if (!housekeeping_cpu(raw_smp_processor_id(), HK_TYPE_KERNEL_NOISE))
+		return -ENOTSUPP;
+
+	return __isolated_task_work_queue();
+}
+
+extern void isolated_task_work_init(struct task_struct *tsk);
+#else
+static inline int isolated_task_work_queue(void) { return -ENOTSUPP; }
+static inline void isolated_task_work_init(struct task_struct *tsk) { }
+#endif /* CONFIG_NO_HZ_FULL */
+
 #endif /* _LINUX_SCHED_ISOLATION_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 606f596a6e0d..78b4b996f85d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4525,6 +4525,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	isolated_task_work_init(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 81bc8b329ef1..f25a5cb33c0d 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -249,3 +249,34 @@ static int __init housekeeping_isolcpus_setup(char *str)
 	return housekeeping_setup(str, flags);
 }
 __setup("isolcpus=", housekeeping_isolcpus_setup);
+
+#if defined(CONFIG_NO_HZ_FULL)
+static void isolated_task_work(struct callback_head *head)
+{
+}
+
+int __isolated_task_work_queue(void)
+{
+	unsigned long flags;
+	int ret;
+
+	if (current->flags & PF_KTHREAD)
+		return 0;
+
+	local_irq_save(flags);
+	if (task_work_queued(&current->nohz_full_work)) {
+		ret = 0;
+		goto out;
+	}
+
+	ret = task_work_add(current, &current->nohz_full_work, TWA_RESUME);
+out:
+	local_irq_restore(flags);
+	return ret;
+}
+
+void isolated_task_work_init(struct task_struct *tsk)
+{
+	init_task_work(&tsk->nohz_full_work, isolated_task_work);
+}
+#endif /* CONFIG_NO_HZ_FULL */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 38e0e323dda2..f80dc1cad219 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -60,6 +60,7 @@
 #include <linux/stop_machine.h>
 #include <linux/syscalls_api.h>
 #include <linux/syscalls.h>
+#include <linux/task_work.h>
 #include <linux/tick.h>
 #include <linux/topology.h>
 #include <linux/types.h>
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-02-09 22:29 [PATCH 0/6 v2] mm: LRU drain flush on nohz_full Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2025-02-09 22:30 ` [PATCH 5/6 v2] sched/isolation: Introduce isolated task work Frederic Weisbecker
@ 2025-02-09 22:30 ` Frederic Weisbecker
  2025-02-10 10:50   ` Hillf Danton
  2025-02-11 11:42   ` Michal Hocko
  5 siblings, 2 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2025-02-09 22:30 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Marcelo Tosatti, Vlastimil Babka,
	Andrew Morton, Michal Hocko, Thomas Gleixner, Oleg Nesterov,
	linux-mm

LRUs can be drained through several ways. One of them may add disturbances
to isolated workloads while queuing a work at any time to any target,
whether running in nohz_full mode or not.

Prevent from that on isolated tasks with defering LRUs drains upon
resuming to userspace using the isolated task work framework.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/swap.h     | 1 +
 kernel/sched/isolation.c | 3 +++
 mm/swap.c                | 8 +++++++-
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b13b72645db3..a6fdcc04403e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -406,6 +406,7 @@ extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
 extern void lru_add_drain_all(void);
+extern void lru_add_and_bh_lrus_drain(void);
 void folio_deactivate(struct folio *folio);
 void folio_mark_lazyfree(struct folio *folio);
 extern void swap_setup(void);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index f25a5cb33c0d..1f9ec201864c 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -8,6 +8,8 @@
  *
  */
 
+#include <linux/swap.h>
+
 enum hk_flags {
 	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
 	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
@@ -253,6 +255,7 @@ __setup("isolcpus=", housekeeping_isolcpus_setup);
 #if defined(CONFIG_NO_HZ_FULL)
 static void isolated_task_work(struct callback_head *head)
 {
+	lru_add_and_bh_lrus_drain();
 }
 
 int __isolated_task_work_queue(void)
diff --git a/mm/swap.c b/mm/swap.c
index fc8281ef4241..da1e569ee3ce 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,7 @@
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
 #include <linux/buffer_head.h>
+#include <linux/sched/isolation.h>
 
 #include "internal.h"
 
@@ -376,6 +377,8 @@ static void __lru_cache_activate_folio(struct folio *folio)
 	}
 
 	local_unlock(&cpu_fbatches.lock);
+
+	isolated_task_work_queue();
 }
 
 #ifdef CONFIG_LRU_GEN
@@ -738,7 +741,7 @@ void lru_add_drain(void)
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_and_bh_lrus_drain(void)
+void lru_add_and_bh_lrus_drain(void)
 {
 	local_lock(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
@@ -769,6 +772,9 @@ static bool cpu_needs_drain(unsigned int cpu)
 {
 	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
 
+	if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
+		return false;
+
 	/* Check these in order of likelihood that they're not zero */
 	return folio_batch_count(&fbatches->lru_add) ||
 		folio_batch_count(&fbatches->lru_move_tail) ||
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-02-09 22:30 ` [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
@ 2025-02-10 10:50   ` Hillf Danton
  2025-02-10 11:19     ` Michal Hocko
  2025-02-10 11:46     ` Frederic Weisbecker
  2025-02-11 11:42   ` Michal Hocko
  1 sibling, 2 replies; 19+ messages in thread
From: Hillf Danton @ 2025-02-10 10:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Marcelo Tosatti, Andrew Morton, Michal Hocko, linux-mm, LKML

On Sun,  9 Feb 2025 23:30:04 +0100 Frederic Weisbecker <frederic@kernel.org>
> @@ -769,6 +772,9 @@ static bool cpu_needs_drain(unsigned int cpu)
>  {
>  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
>  
> +	if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> +		return false;
> +
>  	/* Check these in order of likelihood that they're not zero */
>  	return folio_batch_count(&fbatches->lru_add) ||
>  		folio_batch_count(&fbatches->lru_move_tail) ||
> -- 
> 2.46.0

Nit, I'd like to add a debug line to test your assumption that
isolated tasks are pinned to a single nohz_full CPU.

--- x/mm/swap.c
+++ y/mm/swap.c
@@ -767,9 +767,10 @@ static void lru_add_drain_per_cpu(struct
 static bool cpu_needs_drain(unsigned int cpu)
 {
 	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
+	bool yes;
 
 	/* Check these in order of likelihood that they're not zero */
-	return folio_batch_count(&fbatches->lru_add) ||
+	yes = folio_batch_count(&fbatches->lru_add) ||
 		folio_batch_count(&fbatches->lru_move_tail) ||
 		folio_batch_count(&fbatches->lru_deactivate_file) ||
 		folio_batch_count(&fbatches->lru_deactivate) ||
@@ -777,6 +778,12 @@ static bool cpu_needs_drain(unsigned int
 		folio_batch_count(&fbatches->lru_activate) ||
 		need_mlock_drain(cpu) ||
 		has_bh_in_lru(cpu, NULL);
+
+	if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE)) {
+		VM_BUG_ON(yes);
+		return false;
+	}
+	return yes;
 }
 
 /*


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-02-10 10:50   ` Hillf Danton
@ 2025-02-10 11:19     ` Michal Hocko
  2025-02-10 11:46     ` Frederic Weisbecker
  1 sibling, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2025-02-10 11:19 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Frederic Weisbecker, Marcelo Tosatti, Andrew Morton, linux-mm,
	LKML

On Mon 10-02-25 18:50:26, Hillf Danton wrote:
> On Sun,  9 Feb 2025 23:30:04 +0100 Frederic Weisbecker <frederic@kernel.org>
> > @@ -769,6 +772,9 @@ static bool cpu_needs_drain(unsigned int cpu)
> >  {
> >  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
> >  
> > +	if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> > +		return false;
> > +
> >  	/* Check these in order of likelihood that they're not zero */
> >  	return folio_batch_count(&fbatches->lru_add) ||
> >  		folio_batch_count(&fbatches->lru_move_tail) ||
> > -- 
> > 2.46.0
> 
> Nit, I'd like to add a debug line to test your assumption that
> isolated tasks are pinned to a single nohz_full CPU.

Why? And why would you like to add a trivial VM_BUG_ON vector to the
kernel?

> --- x/mm/swap.c
> +++ y/mm/swap.c
> @@ -767,9 +767,10 @@ static void lru_add_drain_per_cpu(struct
>  static bool cpu_needs_drain(unsigned int cpu)
>  {
>  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
> +	bool yes;
>  
>  	/* Check these in order of likelihood that they're not zero */
> -	return folio_batch_count(&fbatches->lru_add) ||
> +	yes = folio_batch_count(&fbatches->lru_add) ||
>  		folio_batch_count(&fbatches->lru_move_tail) ||
>  		folio_batch_count(&fbatches->lru_deactivate_file) ||
>  		folio_batch_count(&fbatches->lru_deactivate) ||
> @@ -777,6 +778,12 @@ static bool cpu_needs_drain(unsigned int
>  		folio_batch_count(&fbatches->lru_activate) ||
>  		need_mlock_drain(cpu) ||
>  		has_bh_in_lru(cpu, NULL);
> +
> +	if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE)) {
> +		VM_BUG_ON(yes);
> +		return false;
> +	}
> +	return yes;
>  }
>  
>  /*

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-02-10 10:50   ` Hillf Danton
  2025-02-10 11:19     ` Michal Hocko
@ 2025-02-10 11:46     ` Frederic Weisbecker
  2025-02-11 11:31       ` Hillf Danton
  1 sibling, 1 reply; 19+ messages in thread
From: Frederic Weisbecker @ 2025-02-10 11:46 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Marcelo Tosatti, Andrew Morton, Michal Hocko, linux-mm, LKML

Le Mon, Feb 10, 2025 at 06:50:26PM +0800, Hillf Danton a écrit :
> On Sun,  9 Feb 2025 23:30:04 +0100 Frederic Weisbecker <frederic@kernel.org>
> > @@ -769,6 +772,9 @@ static bool cpu_needs_drain(unsigned int cpu)
> >  {
> >  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
> >  
> > +	if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> > +		return false;
> > +
> >  	/* Check these in order of likelihood that they're not zero */
> >  	return folio_batch_count(&fbatches->lru_add) ||
> >  		folio_batch_count(&fbatches->lru_move_tail) ||
> > -- 
> > 2.46.0
> 
> Nit, I'd like to add a debug line to test your assumption that
> isolated tasks are pinned to a single nohz_full CPU.
> 
> --- x/mm/swap.c
> +++ y/mm/swap.c
> @@ -767,9 +767,10 @@ static void lru_add_drain_per_cpu(struct
>  static bool cpu_needs_drain(unsigned int cpu)
>  {
>  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
> +	bool yes;
>  
>  	/* Check these in order of likelihood that they're not zero */
> -	return folio_batch_count(&fbatches->lru_add) ||
> +	yes = folio_batch_count(&fbatches->lru_add) ||
>  		folio_batch_count(&fbatches->lru_move_tail) ||
>  		folio_batch_count(&fbatches->lru_deactivate_file) ||
>  		folio_batch_count(&fbatches->lru_deactivate) ||
> @@ -777,6 +778,12 @@ static bool cpu_needs_drain(unsigned int
>  		folio_batch_count(&fbatches->lru_activate) ||
>  		need_mlock_drain(cpu) ||
>  		has_bh_in_lru(cpu, NULL);
> +
> +	if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE)) {
> +		VM_BUG_ON(yes);
> +		return false;
> +	}
> +	return yes;

If the task isn't pinned then the guarantees of nohz_full are broken anyway.
Also if the task migrates it will simply execute the work elsewhere.

My only worry is kernel threads. Those are simply ignored in this patchset but
this is not right as they can do allocations. Yet they can't execute anything
on return to userspace...

Thoughts?

Thanks.

>  }
>  
>  /*
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6 v2] task_work: Provide means to check if a work is queued
  2025-02-09 22:29 ` [PATCH 1/6 v2] task_work: Provide means to check if a work is queued Frederic Weisbecker
@ 2025-02-10 12:43   ` Oleg Nesterov
  2025-03-25 14:25     ` Frederic Weisbecker
  2025-02-27 16:25   ` Valentin Schneider
  1 sibling, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2025-02-10 12:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Michal Hocko,
	Thomas Gleixner, linux-mm

On 02/09, Frederic Weisbecker wrote:
>
> @@ -56,6 +56,8 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  {
>  	struct callback_head *head;
>
> +	work->next = TASK_WORK_DEQUEUED;

Do we really need to do this at the start of task_work_add() ?

If the caller didn't do init_task_work() before and task_work_add()
returns -EINVAL we probably do not care?

Reviewed-by: Oleg Nesterov <oleg@redhat.com>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/6 v2] sched/fair: Use task_work_queued() on numa_work
  2025-02-09 22:30 ` [PATCH 2/6 v2] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
@ 2025-02-10 12:47   ` Oleg Nesterov
  2025-02-27 16:25   ` Valentin Schneider
  1 sibling, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2025-02-10 12:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Michal Hocko,
	Thomas Gleixner, linux-mm

On 02/09, Frederic Weisbecker wrote:
>
> Remove the ad-hoc implementation of task_work_queued().
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Reviewed-by: Oleg Nesterov <oleg@redhat.com>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6 v2] sched: Use task_work_queued() on cid_work
  2025-02-09 22:30 ` [PATCH 3/6 v2] sched: Use task_work_queued() on cid_work Frederic Weisbecker
@ 2025-02-10 12:49   ` Oleg Nesterov
  0 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2025-02-10 12:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Michal Hocko,
	Thomas Gleixner, linux-mm

On 02/09, Frederic Weisbecker wrote:
>
> Remove the ad-hoc implementation of task_work_queued()
>
> Reviewed-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Reviewed-by: Oleg Nesterov <oleg@redhat.com>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-02-10 11:46     ` Frederic Weisbecker
@ 2025-02-11 11:31       ` Hillf Danton
  0 siblings, 0 replies; 19+ messages in thread
From: Hillf Danton @ 2025-02-11 11:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Marcelo Tosatti, Andrew Morton, Michal Hocko, linux-mm, LKML

On Mon, 10 Feb 2025 12:46:44 +0100 Frederic Weisbecker
> Le Mon, Feb 10, 2025 at 06:50:26PM +0800, Hillf Danton
> > On Sun,  9 Feb 2025 23:30:04 +0100 Frederic Weisbecker <frederic@kernel.org>
> > > @@ -769,6 +772,9 @@ static bool cpu_needs_drain(unsigned int cpu)
> > >  {
> > >  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
> > >  
> > > +	if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> > > +		return false;
> > > +
> > >  	/* Check these in order of likelihood that they're not zero */
> > >  	return folio_batch_count(&fbatches->lru_add) ||
> > >  		folio_batch_count(&fbatches->lru_move_tail) ||
> > > -- 
> > > 2.46.0
> > 
> > Nit, I'd like to add a debug line to test your assumption that
> > isolated tasks are pinned to a single nohz_full CPU.
> > 
> > --- x/mm/swap.c
> > +++ y/mm/swap.c
> > @@ -767,9 +767,10 @@ static void lru_add_drain_per_cpu(struct
> >  static bool cpu_needs_drain(unsigned int cpu)
> >  {
> >  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
> > +	bool yes;
> >  
> >  	/* Check these in order of likelihood that they're not zero */
> > -	return folio_batch_count(&fbatches->lru_add) ||
> > +	yes = folio_batch_count(&fbatches->lru_add) ||
> >  		folio_batch_count(&fbatches->lru_move_tail) ||
> >  		folio_batch_count(&fbatches->lru_deactivate_file) ||
> >  		folio_batch_count(&fbatches->lru_deactivate) ||
> > @@ -777,6 +778,12 @@ static bool cpu_needs_drain(unsigned int
> >  		folio_batch_count(&fbatches->lru_activate) ||
> >  		need_mlock_drain(cpu) ||
> >  		has_bh_in_lru(cpu, NULL);
> > +
> > +	if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE)) {
> > +		VM_BUG_ON(yes);
> > +		return false;
> > +	}
> > +	return yes;
> 
> If the task isn't pinned then the guarantees of nohz_full are broken anyway.
> Also if the task migrates it will simply execute the work elsewhere.
> 
Coding in kernel depends on the smart/stupid activity in user space, but
the dependence sounds no good.

> My only worry is kernel threads. Those are simply ignored in this patchset but
> this is not right as they can do allocations. Yet they can't execute anything
> on return to userspace...
> 
> Thoughts?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-02-09 22:30 ` [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
  2025-02-10 10:50   ` Hillf Danton
@ 2025-02-11 11:42   ` Michal Hocko
  2025-04-04 13:14     ` Frederic Weisbecker
  1 sibling, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2025-02-11 11:42 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Thomas Gleixner,
	Oleg Nesterov, linux-mm

On Sun 09-02-25 23:30:04, Frederic Weisbecker wrote:
> LRUs can be drained through several ways. One of them may add disturbances
> to isolated workloads while queuing a work at any time to any target,
> whether running in nohz_full mode or not.
> 
> Prevent from that on isolated tasks with defering LRUs drains upon
> resuming to userspace using the isolated task work framework.

I have to say this is rather cryptic description of the udnerlying
problem. What do you think about the following:

LRU batching can be source of disturbances for isolated workloads
running in the userspace because it requires kernel worker to handle
that and that would preempt the said task. The primary source for such
disruption would be __lru_add_drain_all which could be triggered from
non-isolated CPUs.

Why would an isolated CPU have anything on the pcp cache? Many syscalls
allocate pages that might end there. A typical and unavoidable one would
be fork/exec leaving pages on the cache behind just waiting for somebody
to drain.

This patch addresses the problem by noting a patch has been added to the
cache and schedule draining to the return path to the userspace so the
work is done while the syscall is still executing and there are no
suprises while the task runs in the userspace where it doesn't want to
be preempted.

> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  include/linux/swap.h     | 1 +
>  kernel/sched/isolation.c | 3 +++
>  mm/swap.c                | 8 +++++++-
>  3 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b13b72645db3..a6fdcc04403e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -406,6 +406,7 @@ extern void lru_add_drain(void);
>  extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_cpu_zone(struct zone *zone);
>  extern void lru_add_drain_all(void);
> +extern void lru_add_and_bh_lrus_drain(void);
>  void folio_deactivate(struct folio *folio);
>  void folio_mark_lazyfree(struct folio *folio);
>  extern void swap_setup(void);
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index f25a5cb33c0d..1f9ec201864c 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -8,6 +8,8 @@
>   *
>   */
>  
> +#include <linux/swap.h>
> +
>  enum hk_flags {
>  	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
>  	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
> @@ -253,6 +255,7 @@ __setup("isolcpus=", housekeeping_isolcpus_setup);
>  #if defined(CONFIG_NO_HZ_FULL)
>  static void isolated_task_work(struct callback_head *head)
>  {
> +	lru_add_and_bh_lrus_drain();
>  }
>  
>  int __isolated_task_work_queue(void)
> diff --git a/mm/swap.c b/mm/swap.c
> index fc8281ef4241..da1e569ee3ce 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -37,6 +37,7 @@
>  #include <linux/page_idle.h>
>  #include <linux/local_lock.h>
>  #include <linux/buffer_head.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "internal.h"
>  
> @@ -376,6 +377,8 @@ static void __lru_cache_activate_folio(struct folio *folio)
>  	}
>  
>  	local_unlock(&cpu_fbatches.lock);
> +
> +	isolated_task_work_queue();
>  }

This placement doens't make much sense to me. I would put
isolated_task_work_queue when we queue something up. That would be
folio_batch_add if folio_batch_space(fbatch) > 0.

>  
>  #ifdef CONFIG_LRU_GEN
> @@ -738,7 +741,7 @@ void lru_add_drain(void)
>   * the same cpu. It shouldn't be a problem in !SMP case since
>   * the core is only one and the locks will disable preemption.
>   */
> -static void lru_add_and_bh_lrus_drain(void)
> +void lru_add_and_bh_lrus_drain(void)
>  {
>  	local_lock(&cpu_fbatches.lock);
>  	lru_add_drain_cpu(smp_processor_id());
> @@ -769,6 +772,9 @@ static bool cpu_needs_drain(unsigned int cpu)
>  {
>  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
>  
> +	if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> +		return false;
> +

Would it make more sense to use cpu_is_isolated() and use it explicitly
in __lru_add_drain_all so that it is clearly visible - with a comment
that isolated workloads are dealing with cache on their return to
userspace.

>  	/* Check these in order of likelihood that they're not zero */
>  	return folio_batch_count(&fbatches->lru_add) ||
>  		folio_batch_count(&fbatches->lru_move_tail) ||
> -- 
> 2.46.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6 v2] task_work: Provide means to check if a work is queued
  2025-02-09 22:29 ` [PATCH 1/6 v2] task_work: Provide means to check if a work is queued Frederic Weisbecker
  2025-02-10 12:43   ` Oleg Nesterov
@ 2025-02-27 16:25   ` Valentin Schneider
  1 sibling, 0 replies; 19+ messages in thread
From: Valentin Schneider @ 2025-02-27 16:25 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar, Marcelo Tosatti,
	Vlastimil Babka, Andrew Morton, Michal Hocko, Thomas Gleixner,
	Oleg Nesterov, linux-mm

On 09/02/25 23:29, Frederic Weisbecker wrote:
> Some task work users implement their own ways to know if a callback is
> already queued on the current task while fiddling with the callback
> head internals.
>
> Provide instead a consolidated API to serve this very purpose.
>

Ditto on Oleg's comment, otherwise:

Reviewed-by: Valentin Schneider <vschneid@redhat.com>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/6 v2] sched/fair: Use task_work_queued() on numa_work
  2025-02-09 22:30 ` [PATCH 2/6 v2] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
  2025-02-10 12:47   ` Oleg Nesterov
@ 2025-02-27 16:25   ` Valentin Schneider
  1 sibling, 0 replies; 19+ messages in thread
From: Valentin Schneider @ 2025-02-27 16:25 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Ingo Molnar, Marcelo Tosatti,
	Vlastimil Babka, Andrew Morton, Michal Hocko, Thomas Gleixner,
	Oleg Nesterov, linux-mm

On 09/02/25 23:30, Frederic Weisbecker wrote:
> Remove the ad-hoc implementation of task_work_queued().
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Reviewed-by: Valentin Schneider <vschneid@redhat.com>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6 v2] task_work: Provide means to check if a work is queued
  2025-02-10 12:43   ` Oleg Nesterov
@ 2025-03-25 14:25     ` Frederic Weisbecker
  0 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2025-03-25 14:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Michal Hocko,
	Thomas Gleixner, linux-mm

Le Mon, Feb 10, 2025 at 01:43:41PM +0100, Oleg Nesterov a écrit :
> On 02/09, Frederic Weisbecker wrote:
> >
> > @@ -56,6 +56,8 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
> >  {
> >  	struct callback_head *head;
> >
> > +	work->next = TASK_WORK_DEQUEUED;
> 
> Do we really need to do this at the start of task_work_add() ?
> 
> If the caller didn't do init_task_work() before and task_work_add()
> returns -EINVAL we probably do not care?

Yes good point. Let me fix that...

> 
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-02-11 11:42   ` Michal Hocko
@ 2025-04-04 13:14     ` Frederic Weisbecker
  0 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2025-04-04 13:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Marcelo Tosatti, Vlastimil Babka, Andrew Morton, Thomas Gleixner,
	Oleg Nesterov, linux-mm

Le Tue, Feb 11, 2025 at 12:42:51PM +0100, Michal Hocko a écrit :
> On Sun 09-02-25 23:30:04, Frederic Weisbecker wrote:
> > LRUs can be drained through several ways. One of them may add disturbances
> > to isolated workloads while queuing a work at any time to any target,
> > whether running in nohz_full mode or not.
> > 
> > Prevent from that on isolated tasks with defering LRUs drains upon
> > resuming to userspace using the isolated task work framework.
> 
> I have to say this is rather cryptic description of the udnerlying
> problem. What do you think about the following:
> 
> LRU batching can be source of disturbances for isolated workloads
> running in the userspace because it requires kernel worker to handle
> that and that would preempt the said task. The primary source for such
> disruption would be __lru_add_drain_all which could be triggered from
> non-isolated CPUs.
> 
> Why would an isolated CPU have anything on the pcp cache? Many syscalls
> allocate pages that might end there. A typical and unavoidable one would
> be fork/exec leaving pages on the cache behind just waiting for somebody
> to drain.
> 
> This patch addresses the problem by noting a patch has been added to the
> cache and schedule draining to the return path to the userspace so the
> work is done while the syscall is still executing and there are no
> suprises while the task runs in the userspace where it doesn't want to
> be preempted.

Much better indeed :-)

> > @@ -376,6 +377,8 @@ static void __lru_cache_activate_folio(struct folio *folio)
> >  	}
> >  
> >  	local_unlock(&cpu_fbatches.lock);
> > +
> > +	isolated_task_work_queue();
> >  }
> 
> This placement doens't make much sense to me. I would put
> isolated_task_work_queue when we queue something up. That would be
> folio_batch_add if folio_batch_space(fbatch) > 0.

Ok I moved it there but why doesn't it make sense also when
folio_batch_space(fbatch) == 0, doesn't it mean that folio_batch_add()
still added the entry but there is just no further room left?

> 
> >  
> >  #ifdef CONFIG_LRU_GEN
> > @@ -738,7 +741,7 @@ void lru_add_drain(void)
> >   * the same cpu. It shouldn't be a problem in !SMP case since
> >   * the core is only one and the locks will disable preemption.
> >   */
> > -static void lru_add_and_bh_lrus_drain(void)
> > +void lru_add_and_bh_lrus_drain(void)
> >  {
> >  	local_lock(&cpu_fbatches.lock);
> >  	lru_add_drain_cpu(smp_processor_id());
> > @@ -769,6 +772,9 @@ static bool cpu_needs_drain(unsigned int cpu)
> >  {
> >  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
> >  
> > +	if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> > +		return false;
> > +
> 
> Would it make more sense to use cpu_is_isolated() and use it explicitly
> in __lru_add_drain_all so that it is clearly visible - with a comment
> that isolated workloads are dealing with cache on their return to
> userspace.

No because cpu_is_isolated() is also true when the CPU is on NULL domain
(isolated cpusset partition, or isolcpus=domain). And not all workloads
want the possible overhead of scattered batching after every syscalls.

Thanks.

> 
> >  	/* Check these in order of likelihood that they're not zero */
> >  	return folio_batch_count(&fbatches->lru_add) ||
> >  		folio_batch_count(&fbatches->lru_move_tail) ||
> > -- 
> > 2.46.0
> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-04-04 13:14 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-09 22:29 [PATCH 0/6 v2] mm: LRU drain flush on nohz_full Frederic Weisbecker
2025-02-09 22:29 ` [PATCH 1/6 v2] task_work: Provide means to check if a work is queued Frederic Weisbecker
2025-02-10 12:43   ` Oleg Nesterov
2025-03-25 14:25     ` Frederic Weisbecker
2025-02-27 16:25   ` Valentin Schneider
2025-02-09 22:30 ` [PATCH 2/6 v2] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
2025-02-10 12:47   ` Oleg Nesterov
2025-02-27 16:25   ` Valentin Schneider
2025-02-09 22:30 ` [PATCH 3/6 v2] sched: Use task_work_queued() on cid_work Frederic Weisbecker
2025-02-10 12:49   ` Oleg Nesterov
2025-02-09 22:30 ` [PATCH 4/6 v2] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
2025-02-09 22:30 ` [PATCH 5/6 v2] sched/isolation: Introduce isolated task work Frederic Weisbecker
2025-02-09 22:30 ` [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
2025-02-10 10:50   ` Hillf Danton
2025-02-10 11:19     ` Michal Hocko
2025-02-10 11:46     ` Frederic Weisbecker
2025-02-11 11:31       ` Hillf Danton
2025-02-11 11:42   ` Michal Hocko
2025-04-04 13:14     ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).