[PATCH 0/6 v3] sched/mm: LRU drain flush on nohz

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6 v3] sched/mm: LRU drain flush on nohz_full
@ 2025-04-10 15:23 Frederic Weisbecker
  2025-04-10 15:23 ` [PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2025-04-10 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Oleg Nesterov, Peter Zijlstra,
	Valentin Schneider, Thomas Gleixner, Michal Hocko, linux-mm,
	Ingo Molnar, Marcelo Tosatti, Vlastimil Babka, Andrew Morton

Hi,

When LRUs are pending, the drain can be triggered remotely, whether the
remote CPU is running in userspace in nohz_full mode or not. This kind
of noise is expected to be caused by preparatory work before a task
runs isolated in userspace. This patchset is a proposal to flush that
before the task starts its critical work in userspace.

Changes since v2:

* Add tags (thanks everyone)
* Assume TASK_WORK_DEQUEUED is set before calling task_work_add() (Oleg)
* Return -EINVAL if queued from kthread instead of silently ignoring.
* Queue from a more appropriate place (folio_batch_add()) (Michal)
* Refactor changelog on last patch (Michal)

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	task/work

HEAD: 036bfe5153de18c995653ee5074d5eec463bbde0

Thanks,
	Frederic
---

Frederic Weisbecker (6):
      task_work: Provide means to check if a work is queued
      sched/fair: Use task_work_queued() on numa_work
      sched: Use task_work_queued() on cid_work
      tick/nohz: Move nohz_full related fields out of hot task struct's places
      sched/isolation: Introduce isolated task work
      mm: Drain LRUs upon resume to userspace on nohz_full CPUs


 include/linux/pagevec.h         | 18 ++----------------
 include/linux/sched.h           | 15 +++++++++------
 include/linux/sched/isolation.h | 17 +++++++++++++++++
 include/linux/swap.h            |  1 +
 include/linux/task_work.h       | 12 ++++++++++++
 kernel/sched/core.c             |  6 ++----
 kernel/sched/fair.c             |  5 +----
 kernel/sched/isolation.c        | 34 ++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h            |  1 +
 kernel/task_work.c              |  9 +++++++--
 mm/swap.c                       | 30 +++++++++++++++++++++++++++++-
 11 files changed, 115 insertions(+), 33 deletions(-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/6] task_work: Provide means to check if a work is queued
  2025-04-10 15:23 [PATCH 0/6 v3] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
@ 2025-04-10 15:23 ` Frederic Weisbecker
  2025-04-10 15:23 ` [PATCH 2/6] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2025-04-10 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

Some task work users implement their own ways to know if a callback is
already queued on the current task while fiddling with the callback
head internals.

Provide instead a consolidated API to serve this very purpose.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/task_work.h | 12 ++++++++++++
 kernel/task_work.c        |  9 +++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 0646804860ff..31caf12c1313 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -5,12 +5,15 @@
 #include <linux/list.h>
 #include <linux/sched.h>
 
+#define TASK_WORK_DEQUEUED	((void *) -1UL)
+
 typedef void (*task_work_func_t)(struct callback_head *);
 
 static inline void
 init_task_work(struct callback_head *twork, task_work_func_t func)
 {
 	twork->func = func;
+	twork->next = TASK_WORK_DEQUEUED;
 }
 
 enum task_work_notify_mode {
@@ -26,6 +29,15 @@ static inline bool task_work_pending(struct task_struct *task)
 	return READ_ONCE(task->task_works);
 }
 
+/*
+ * Check if a work is queued. Beware: this is inherently racy if the work can
+ * be queued elsewhere than the current task.
+ */
+static inline bool task_work_queued(struct callback_head *twork)
+{
+	return twork->next != TASK_WORK_DEQUEUED;
+}
+
 int task_work_add(struct task_struct *task, struct callback_head *twork,
 			enum task_work_notify_mode mode);
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index d1efec571a4a..56718cb824d9 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -67,8 +67,10 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 
 	head = READ_ONCE(task->task_works);
 	do {
-		if (unlikely(head == &work_exited))
+		if (unlikely(head == &work_exited)) {
+			work->next = TASK_WORK_DEQUEUED;
 			return -ESRCH;
+		}
 		work->next = head;
 	} while (!try_cmpxchg(&task->task_works, &head, work));
 
@@ -129,8 +131,10 @@ task_work_cancel_match(struct task_struct *task,
 		if (!match(work, data)) {
 			pprev = &work->next;
 			work = READ_ONCE(*pprev);
-		} else if (try_cmpxchg(pprev, &work, work->next))
+		} else if (try_cmpxchg(pprev, &work, work->next)) {
+			work->next = TASK_WORK_DEQUEUED;
 			break;
+		}
 	}
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 
@@ -224,6 +228,7 @@ void task_work_run(void)
 
 		do {
 			next = work->next;
+			work->next = TASK_WORK_DEQUEUED;
 			work->func(work);
 			work = next;
 			cond_resched();
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/6] sched/fair: Use task_work_queued() on numa_work
  2025-04-10 15:23 [PATCH 0/6 v3] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
  2025-04-10 15:23 ` [PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
@ 2025-04-10 15:23 ` Frederic Weisbecker
  2025-04-10 15:23 ` [PATCH 3/6] sched: Use task_work_queued() on cid_work Frederic Weisbecker
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2025-04-10 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

Remove the ad-hoc implementation of task_work_queued().

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/fair.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e580..c6ffa2fdbbd6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3317,7 +3317,6 @@ static void task_numa_work(struct callback_head *work)
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
-	work->next = work;
 	/*
 	 * Who cares about NUMA placement when they're dying.
 	 *
@@ -3565,8 +3564,6 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->numa_scan_seq		= mm ? mm->numa_scan_seq : 0;
 	p->numa_scan_period		= sysctl_numa_balancing_scan_delay;
 	p->numa_migrate_retry		= 0;
-	/* Protect against double add, see task_tick_numa and task_numa_work */
-	p->numa_work.next		= &p->numa_work;
 	p->numa_faults			= NULL;
 	p->numa_pages_migrated		= 0;
 	p->total_numa_faults		= 0;
@@ -3607,7 +3604,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	/*
 	 * We don't care about NUMA placement if we don't have memory.
 	 */
-	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
+	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || task_work_queued(work))
 		return;
 
 	/*
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/6] sched: Use task_work_queued() on cid_work
  2025-04-10 15:23 [PATCH 0/6 v3] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
  2025-04-10 15:23 ` [PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
  2025-04-10 15:23 ` [PATCH 2/6] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
@ 2025-04-10 15:23 ` Frederic Weisbecker
  2025-04-10 15:23 ` [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2025-04-10 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

Remove the ad-hoc implementation of task_work_queued()

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/core.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cfaca3040b2f..add41254b6e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10576,7 +10576,6 @@ static void task_mm_cid_work(struct callback_head *work)
 
 	WARN_ON_ONCE(t != container_of(work, struct task_struct, cid_work));
 
-	work->next = work;	/* Prevent double-add */
 	if (t->flags & PF_EXITING)
 		return;
 	mm = t->mm;
@@ -10620,7 +10619,6 @@ void init_sched_mm_cid(struct task_struct *t)
 		if (mm_users == 1)
 			mm->mm_cid_next_scan = jiffies + msecs_to_jiffies(MM_CID_SCAN_DELAY);
 	}
-	t->cid_work.next = &t->cid_work;	/* Protect against double add */
 	init_task_work(&t->cid_work, task_mm_cid_work);
 }
 
@@ -10629,8 +10627,7 @@ void task_tick_mm_cid(struct rq *rq, struct task_struct *curr)
 	struct callback_head *work = &curr->cid_work;
 	unsigned long now = jiffies;
 
-	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
-	    work->next != work)
+	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || task_work_queued(work))
 		return;
 	if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan)))
 		return;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places
  2025-04-10 15:23 [PATCH 0/6 v3] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2025-04-10 15:23 ` [PATCH 3/6] sched: Use task_work_queued() on cid_work Frederic Weisbecker
@ 2025-04-10 15:23 ` Frederic Weisbecker
  2025-04-23 18:40   ` Shrikanth Hegde
  2025-04-10 15:23 ` [PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Frederic Weisbecker @ 2025-04-10 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

nohz_full is a feature that only fits into rare and very corner cases.
Yet distros enable it by default and therefore the related fields are
always reserved in the task struct.

Those task fields are stored in the middle of cacheline hot places such
as cputime accounting and context switch counting, which doesn't make
any sense for a feature that is disabled most of the time.

Move the nohz_full storage to colder places.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched.h | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..b5ce76db6d75 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1110,13 +1110,7 @@ struct task_struct {
 #endif
 	u64				gtime;
 	struct prev_cputime		prev_cputime;
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-	struct vtime			vtime;
-#endif
 
-#ifdef CONFIG_NO_HZ_FULL
-	atomic_t			tick_dep_mask;
-#endif
 	/* Context switch counts: */
 	unsigned long			nvcsw;
 	unsigned long			nivcsw;
@@ -1438,6 +1432,14 @@ struct task_struct {
 	struct task_delay_info		*delays;
 #endif
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+	struct vtime			vtime;
+#endif
+
+#ifdef CONFIG_NO_HZ_FULL
+	atomic_t			tick_dep_mask;
+#endif
+
 #ifdef CONFIG_FAULT_INJECTION
 	int				make_it_fail;
 	unsigned int			fail_nth;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/6] sched/isolation: Introduce isolated task work
  2025-04-10 15:23 [PATCH 0/6 v3] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2025-04-10 15:23 ` [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
@ 2025-04-10 15:23 ` Frederic Weisbecker
  2025-04-11 10:25   ` Oleg Nesterov
  2025-04-10 15:23 ` [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
       [not found] ` <20250412025831.4010-1-hdanton@sina.com>
  6 siblings, 1 reply; 17+ messages in thread
From: Frederic Weisbecker @ 2025-04-10 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

Some asynchronous kernel work may be pending upon resume to userspace
and execute later on. On isolated workload this becomes problematic once
the process is done with preparatory work involving syscalls and wants
to run in userspace without being interrupted.

Provide an infrastructure to queue a work to be executed from the current
isolated task context right before resuming to userspace. This goes with
the assumption that isolated tasks are pinned to a single nohz_full CPU.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched.h           |  1 +
 include/linux/sched/isolation.h | 17 +++++++++++++++++
 kernel/sched/core.c             |  1 +
 kernel/sched/isolation.c        | 31 +++++++++++++++++++++++++++++++
 kernel/sched/sched.h            |  1 +
 5 files changed, 51 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b5ce76db6d75..4d764eb96e3e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1437,6 +1437,7 @@ struct task_struct {
 #endif
 
 #ifdef CONFIG_NO_HZ_FULL
+	struct callback_head		nohz_full_work;
 	atomic_t			tick_dep_mask;
 #endif
 
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d8501f4709b5..74da4324b984 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -77,4 +77,21 @@ static inline bool cpu_is_isolated(int cpu)
 	       cpuset_cpu_is_isolated(cpu);
 }
 
+#if defined(CONFIG_NO_HZ_FULL)
+extern int __isolated_task_work_queue(void);
+
+static inline int isolated_task_work_queue(void)
+{
+	if (!housekeeping_cpu(raw_smp_processor_id(), HK_TYPE_KERNEL_NOISE))
+		return -ENOTSUPP;
+
+	return __isolated_task_work_queue();
+}
+
+extern void isolated_task_work_init(struct task_struct *tsk);
+#else
+static inline int isolated_task_work_queue(void) { return -ENOTSUPP; }
+static inline void isolated_task_work_init(struct task_struct *tsk) { }
+#endif /* CONFIG_NO_HZ_FULL */
+
 #endif /* _LINUX_SCHED_ISOLATION_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index add41254b6e5..c8b8b61ac3a6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4524,6 +4524,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	isolated_task_work_init(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 81bc8b329ef1..e246287de9fa 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -249,3 +249,34 @@ static int __init housekeeping_isolcpus_setup(char *str)
 	return housekeeping_setup(str, flags);
 }
 __setup("isolcpus=", housekeeping_isolcpus_setup);
+
+#if defined(CONFIG_NO_HZ_FULL)
+static void isolated_task_work(struct callback_head *head)
+{
+}
+
+int __isolated_task_work_queue(void)
+{
+	unsigned long flags;
+	int ret;
+
+	if (current->flags & PF_KTHREAD)
+		return -EINVAL;
+
+	local_irq_save(flags);
+	if (task_work_queued(&current->nohz_full_work)) {
+		ret = 0;
+		goto out;
+	}
+
+	ret = task_work_add(current, &current->nohz_full_work, TWA_RESUME);
+out:
+	local_irq_restore(flags);
+	return ret;
+}
+
+void isolated_task_work_init(struct task_struct *tsk)
+{
+	init_task_work(&tsk->nohz_full_work, isolated_task_work);
+}
+#endif /* CONFIG_NO_HZ_FULL */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 47972f34ea70..e7dc4ae5ccc1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -60,6 +60,7 @@
 #include <linux/stop_machine.h>
 #include <linux/syscalls_api.h>
 #include <linux/syscalls.h>
+#include <linux/task_work.h>
 #include <linux/tick.h>
 #include <linux/topology.h>
 #include <linux/types.h>
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-04-10 15:23 [PATCH 0/6 v3] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2025-04-10 15:23 ` [PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
@ 2025-04-10 15:23 ` Frederic Weisbecker
       [not found] ` <20250412025831.4010-1-hdanton@sina.com>
  6 siblings, 0 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2025-04-10 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

LRU batching can be source of disturbances for isolated workloads
running in the userspace because it requires kernel worker to handle
that and that would preempt the said task. The primary source for such
disruption would be __lru_add_drain_all which could be triggered from
non-isolated CPUs.

Why would an isolated CPU have anything on the pcp cache? Many syscalls
allocate pages that might end there. A typical and unavoidable one would
be fork/exec leaving pages on the cache behind just waiting for somebody
to drain.

Address the problem by noting a batch has been added to the cache and
schedule draining upon return to userspace so the work is done while the
syscall is still executing and there are no suprises while the task runs
in the userspace where it doesn't want to be preempted.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/pagevec.h  | 18 ++----------------
 include/linux/swap.h     |  1 +
 kernel/sched/isolation.c |  3 +++
 mm/swap.c                | 30 +++++++++++++++++++++++++++++-
 4 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 5d3a0cccc6bf..7e647b8df4c7 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -61,22 +61,8 @@ static inline unsigned int folio_batch_space(struct folio_batch *fbatch)
 	return PAGEVEC_SIZE - fbatch->nr;
 }
 
-/**
- * folio_batch_add() - Add a folio to a batch.
- * @fbatch: The folio batch.
- * @folio: The folio to add.
- *
- * The folio is added to the end of the batch.
- * The batch must have previously been initialised using folio_batch_init().
- *
- * Return: The number of slots still available.
- */
-static inline unsigned folio_batch_add(struct folio_batch *fbatch,
-		struct folio *folio)
-{
-	fbatch->folios[fbatch->nr++] = folio;
-	return folio_batch_space(fbatch);
-}
+unsigned int folio_batch_add(struct folio_batch *fbatch,
+			     struct folio *folio);
 
 /**
  * folio_batch_next - Return the next folio to process.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index db46b25a65ae..8244475c2efe 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -401,6 +401,7 @@ extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
 extern void lru_add_drain_all(void);
+extern void lru_add_and_bh_lrus_drain(void);
 void folio_deactivate(struct folio *folio);
 void folio_mark_lazyfree(struct folio *folio);
 extern void swap_setup(void);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index e246287de9fa..553889f4e9be 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -8,6 +8,8 @@
  *
  */
 
+#include <linux/swap.h>
+
 enum hk_flags {
 	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
 	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
@@ -253,6 +255,7 @@ __setup("isolcpus=", housekeeping_isolcpus_setup);
 #if defined(CONFIG_NO_HZ_FULL)
 static void isolated_task_work(struct callback_head *head)
 {
+	lru_add_and_bh_lrus_drain();
 }
 
 int __isolated_task_work_queue(void)
diff --git a/mm/swap.c b/mm/swap.c
index 77b2d5997873..99a1b7b81e86 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,7 @@
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
 #include <linux/buffer_head.h>
+#include <linux/sched/isolation.h>
 
 #include "internal.h"
 
@@ -155,6 +156,29 @@ static void lru_add(struct lruvec *lruvec, struct folio *folio)
 	trace_mm_lru_insertion(folio);
 }
 
+/**
+ * folio_batch_add() - Add a folio to a batch.
+ * @fbatch: The folio batch.
+ * @folio: The folio to add.
+ *
+ * The folio is added to the end of the batch.
+ * The batch must have previously been initialised using folio_batch_init().
+ *
+ * Return: The number of slots still available.
+ */
+unsigned int folio_batch_add(struct folio_batch *fbatch,
+			     struct folio *folio)
+{
+	unsigned int ret;
+
+	fbatch->folios[fbatch->nr++] = folio;
+	ret = folio_batch_space(fbatch);
+	isolated_task_work_queue();
+
+	return ret;
+}
+EXPORT_SYMBOL(folio_batch_add);
+
 static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 {
 	int i;
@@ -738,7 +762,7 @@ void lru_add_drain(void)
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_and_bh_lrus_drain(void)
+void lru_add_and_bh_lrus_drain(void)
 {
 	local_lock(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
@@ -864,6 +888,10 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
+		/* Isolated CPUs handle their cache upon return to userspace */
+		if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
+			continue;
+
 		if (cpu_needs_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, mm_percpu_wq, work);
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] sched/isolation: Introduce isolated task work
  2025-04-10 15:23 ` [PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
@ 2025-04-11 10:25   ` Oleg Nesterov
  2025-04-11 22:00     ` Frederic Weisbecker
  2025-04-12  5:12     ` K Prateek Nayak
  0 siblings, 2 replies; 17+ messages in thread
From: Oleg Nesterov @ 2025-04-11 10:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Ingo Molnar, Marcelo Tosatti, Michal Hocko,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vlastimil Babka, linux-mm

I know nothing about this code so I can't review, but let me
ask anyway...

On 04/10, Frederic Weisbecker wrote:
>
> +int __isolated_task_work_queue(void)
> +{
> +	unsigned long flags;
> +	int ret;
> +
> +	if (current->flags & PF_KTHREAD)
> +		return -EINVAL;

What about PF_USER_WORKER's ? IIUC, these (in fact kernel) threads
never return to userspace and never call task_work_run().

Or PF_IO_WORKER's, they too run only in kernel mode... But iirc they
do call task_work_run().

> +	local_irq_save(flags);
> +	if (task_work_queued(&current->nohz_full_work)) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	ret = task_work_add(current, &current->nohz_full_work, TWA_RESUME);
> +out:
> +	local_irq_restore(flags);
> +	return ret;

Hmm, why not

	local_irq_save(flags);
	if (task_work_queued(...))
		ret = 0;
	else
		ret = task_work_add(...);

?

Oleg.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] sched/isolation: Introduce isolated task work
  2025-04-11 10:25   ` Oleg Nesterov
@ 2025-04-11 22:00     ` Frederic Weisbecker
  2025-04-12  5:12     ` K Prateek Nayak
  1 sibling, 0 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2025-04-11 22:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Andrew Morton, Ingo Molnar, Marcelo Tosatti, Michal Hocko,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vlastimil Babka, linux-mm

Le Fri, Apr 11, 2025 at 12:25:56PM +0200, Oleg Nesterov a écrit :
> I know nothing about this code so I can't review, but let me
> ask anyway...
> 
> On 04/10, Frederic Weisbecker wrote:
> >
> > +int __isolated_task_work_queue(void)
> > +{
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	if (current->flags & PF_KTHREAD)
> > +		return -EINVAL;
> 
> What about PF_USER_WORKER's ? IIUC, these (in fact kernel) threads
> never return to userspace and never call task_work_run().

Ah good catch! (though I'm having a hard time finding out what this is
about)...

> 
> Or PF_IO_WORKER's, they too run only in kernel mode... But iirc they
> do call task_work_run().

At least I see a lot of task_work usage in io_uring, and there are some
explicit calls to task_work_run() there...

> 
> > +	local_irq_save(flags);
> > +	if (task_work_queued(&current->nohz_full_work)) {
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	ret = task_work_add(current, &current->nohz_full_work, TWA_RESUME);
> > +out:
> > +	local_irq_restore(flags);
> > +	return ret;
> 
> Hmm, why not
> 
> 	local_irq_save(flags);
> 	if (task_work_queued(...))
> 		ret = 0;
> 	else
> 		ret = task_work_add(...);

Hehe, yes indeed!

Thanks!

-- 
Frederic Weisbecker
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] sched/isolation: Introduce isolated task work
  2025-04-11 10:25   ` Oleg Nesterov
  2025-04-11 22:00     ` Frederic Weisbecker
@ 2025-04-12  5:12     ` K Prateek Nayak
  1 sibling, 0 replies; 17+ messages in thread
From: K Prateek Nayak @ 2025-04-12  5:12 UTC (permalink / raw)
  To: Oleg Nesterov, Frederic Weisbecker
  Cc: LKML, Andrew Morton, Ingo Molnar, Marcelo Tosatti, Michal Hocko,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vlastimil Babka, linux-mm

On 4/11/2025 3:55 PM, Oleg Nesterov wrote:
> 
>> +	local_irq_save(flags);
>> +	if (task_work_queued(&current->nohz_full_work)) {
>> +		ret = 0;
>> +		goto out;
>> +	}
>> +
>> +	ret = task_work_add(current, &current->nohz_full_work, TWA_RESUME);
>> +out:
>> +	local_irq_restore(flags);
>> +	return ret;
> 
> Hmm, why not
> 
> 	local_irq_save(flags);
> 	if (task_work_queued(...))
> 		ret = 0;
> 	else
> 		ret = task_work_add(...);
> 
> ?

Or use guard() sand save on flags and ret:

	guard(irqsave)();

	if (task_work_queued(...))
		return 0;

	return task_work_add(...);

-- 
Thanks and Regards,
Prateek

> 
> Oleg.
> 
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places
  2025-04-10 15:23 ` [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
@ 2025-04-23 18:40   ` Shrikanth Hegde
  2025-07-01 12:17     ` Frederic Weisbecker
  0 siblings, 1 reply; 17+ messages in thread
From: Shrikanth Hegde @ 2025-04-23 18:40 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Andrew Morton, Ingo Molnar, Marcelo Tosatti, Michal Hocko,
	Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm



On 4/10/25 20:53, Frederic Weisbecker wrote:
> nohz_full is a feature that only fits into rare and very corner cases.
> Yet distros enable it by default and therefore the related fields are
> always reserved in the task struct.
> 
> Those task fields are stored in the middle of cacheline hot places such
> as cputime accounting and context switch counting, which doesn't make
> any sense for a feature that is disabled most of the time.
> 
> Move the nohz_full storage to colder places.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>   include/linux/sched.h | 14 ++++++++------
>   1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..b5ce76db6d75 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1110,13 +1110,7 @@ struct task_struct {
>   #endif
>   	u64				gtime;
>   	struct prev_cputime		prev_cputime;
> -#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> -	struct vtime			vtime;
> -#endif
>   
> -#ifdef CONFIG_NO_HZ_FULL
> -	atomic_t			tick_dep_mask;
> -#endif
>   	/* Context switch counts: */
>   	unsigned long			nvcsw;
>   	unsigned long			nivcsw;
> @@ -1438,6 +1432,14 @@ struct task_struct {
>   	struct task_delay_info		*delays;
>   #endif
>   
> +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> +	struct vtime			vtime;
> +#endif
> +
> +#ifdef CONFIG_NO_HZ_FULL
> +	atomic_t			tick_dep_mask;
> +#endif
> +
>   #ifdef CONFIG_FAULT_INJECTION
>   	int				make_it_fail;
>   	unsigned int			fail_nth;
> 

Hi Frederic.

maybe move these nohz related fields into their own cacheline instead?


on PowerPC where we have 128byte cache instead, i see
these fields are crossing a cache line boundary.

without patch:
	/* XXX last struct has 4 bytes of padding */

	struct vtime               vtime;                /*  2360    48 */
	atomic_t                   tick_dep_mask;        /*  2408     4 */
	/* XXX 4 bytes hole, try to pack */

	long unsigned int          nvcsw;                /*  2416     8 */
	long unsigned int          nivcsw;               /*  2424     8 */
	/* --- cacheline 19 boundary (2432 bytes) --- */


With patch:
	struct vtime               vtime;                /*  3272    48 */
	struct callback_head       nohz_full_work;       /*  3320    16 */
	/* --- cacheline 26 boundary (3328 bytes) was 8 bytes ago --- */
	atomic_t                   tick_dep_mask;        /*  3336     4 */



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places
  2025-04-23 18:40   ` Shrikanth Hegde
@ 2025-07-01 12:17     ` Frederic Weisbecker
  0 siblings, 0 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2025-07-01 12:17 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: LKML, Andrew Morton, Ingo Molnar, Marcelo Tosatti, Michal Hocko,
	Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

Le Thu, Apr 24, 2025 at 12:10:26AM +0530, Shrikanth Hegde a écrit :
> 
> 
> On 4/10/25 20:53, Frederic Weisbecker wrote:
> > nohz_full is a feature that only fits into rare and very corner cases.
> > Yet distros enable it by default and therefore the related fields are
> > always reserved in the task struct.
> > 
> > Those task fields are stored in the middle of cacheline hot places such
> > as cputime accounting and context switch counting, which doesn't make
> > any sense for a feature that is disabled most of the time.
> > 
> > Move the nohz_full storage to colder places.
> > 
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > ---
> >   include/linux/sched.h | 14 ++++++++------
> >   1 file changed, 8 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index f96ac1982893..b5ce76db6d75 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1110,13 +1110,7 @@ struct task_struct {
> >   #endif
> >   	u64				gtime;
> >   	struct prev_cputime		prev_cputime;
> > -#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> > -	struct vtime			vtime;
> > -#endif
> > -#ifdef CONFIG_NO_HZ_FULL
> > -	atomic_t			tick_dep_mask;
> > -#endif
> >   	/* Context switch counts: */
> >   	unsigned long			nvcsw;
> >   	unsigned long			nivcsw;
> > @@ -1438,6 +1432,14 @@ struct task_struct {
> >   	struct task_delay_info		*delays;
> >   #endif
> > +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> > +	struct vtime			vtime;
> > +#endif
> > +
> > +#ifdef CONFIG_NO_HZ_FULL
> > +	atomic_t			tick_dep_mask;
> > +#endif
> > +
> >   #ifdef CONFIG_FAULT_INJECTION
> >   	int				make_it_fail;
> >   	unsigned int			fail_nth;
> > 
> 
> Hi Frederic.
> 
> maybe move these nohz related fields into their own cacheline instead?
> 
> 
> on PowerPC where we have 128byte cache instead, i see
> these fields are crossing a cache line boundary.
> 
> without patch:
> 	/* XXX last struct has 4 bytes of padding */
> 
> 	struct vtime               vtime;                /*  2360    48 */
> 	atomic_t                   tick_dep_mask;        /*  2408     4 */
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	long unsigned int          nvcsw;                /*  2416     8 */
> 	long unsigned int          nivcsw;               /*  2424     8 */
> 	/* --- cacheline 19 boundary (2432 bytes) --- */
> 
> 
> With patch:
> 	struct vtime               vtime;                /*  3272    48 */
> 	struct callback_head       nohz_full_work;       /*  3320    16 */
> 	/* --- cacheline 26 boundary (3328 bytes) was 8 bytes ago --- */
> 	atomic_t                   tick_dep_mask;        /*  3336     4 */
> 

It's not much a big deal because those fields shouldn't be accessed much
closely in time. Also such a cache alignement is hard to maintain everywhere
when there are so many ifdefferies in that structure.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6 v3] sched/mm: LRU drain flush on nohz_full
       [not found] ` <20250412025831.4010-1-hdanton@sina.com>
@ 2025-07-01 12:36   ` Frederic Weisbecker
  0 siblings, 0 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2025-07-01 12:36 UTC (permalink / raw)
  To: Hillf Danton
  Cc: LKML, Oleg Nesterov, Michal Hocko, linux-mm, Marcelo Tosatti,
	Vlastimil Babka, Andrew Morton

Le Sat, Apr 12, 2025 at 10:58:22AM +0800, Hillf Danton a écrit :
> On Thu, 10 Apr 2025 17:23:21 +0200 Frederic Weisbecker wrote
> > Hi,
> > 
> > When LRUs are pending, the drain can be triggered remotely, whether the
> > remote CPU is running in userspace in nohz_full mode or not. This kind
> > of noise is expected to be caused by preparatory work before a task
> > runs isolated in userspace. This patchset is a proposal to flush that
> > before the task starts its critical work in userspace.
> > 
> Alternatively add a syscall for workloads on isolated CPUs to flush
> this xxx and prepare that yyy before entering the critical work, instead
> of adding random (nice) patches today and next month. Even prctl can
> do lru_add_and_bh_lrus_drain(), and prctl looks more preferable over
> adding a syscall.

In an ideal world, this is indeed what we should do now. And I still wish we
can do this in the future.

The problem is that this has been tried by the past and the work was never
finished because that syscall eventually didn't meet the need for the people
working on it.

I would volunteer to start a simple prctl to do such a flush, something that
can be extended in the future, should the need arise, but such a new ABI must
be thought through along with the CPU isolation community.

Unfortunately there is no such stable CPU isolation community. Many developers
in this area contribute changes and then switch to other things. As for the CPU
isolation users, they are usually very quiet.

I can't assume all the possible usecases myself and therefore I would easily do
it wrong.

In the meantime, the patchset here is a proposal that hopefully should work for
many usecases.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 5/6] sched/isolation: Introduce isolated task work
  2025-07-03 14:07 [PATCH 0/6 v4] " Frederic Weisbecker
@ 2025-07-03 14:07 ` Frederic Weisbecker
  2025-07-17 17:29   ` Vlastimil Babka
  2025-07-18  9:52   ` Valentin Schneider
  0 siblings, 2 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2025-07-03 14:07 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

Some asynchronous kernel work may be pending upon resume to userspace
and execute later on. On isolated workload this becomes problematic once
the process is done with preparatory work involving syscalls and wants
to run in userspace without being interrupted.

Provide an infrastructure to queue a work to be executed from the current
isolated task context right before resuming to userspace. This goes with
the assumption that isolated tasks are pinned to a single nohz_full CPU.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched.h           |  4 ++++
 include/linux/sched/isolation.h | 17 +++++++++++++++++
 kernel/sched/core.c             |  1 +
 kernel/sched/isolation.c        | 23 +++++++++++++++++++++++
 kernel/sched/sched.h            |  1 +
 kernel/time/Kconfig             | 12 ++++++++++++
 6 files changed, 58 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 117aa20b8fb6..931065b5744f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1448,6 +1448,10 @@ struct task_struct {
 	atomic_t			tick_dep_mask;
 #endif
 
+#ifdef CONFIG_NO_HZ_FULL_WORK
+	struct callback_head		nohz_full_work;
+#endif
+
 #ifdef CONFIG_FAULT_INJECTION
 	int				make_it_fail;
 	unsigned int			fail_nth;
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d8501f4709b5..9481b7d152c9 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -77,4 +77,21 @@ static inline bool cpu_is_isolated(int cpu)
 	       cpuset_cpu_is_isolated(cpu);
 }
 
+#if defined(CONFIG_NO_HZ_FULL_WORK)
+extern int __isolated_task_work_queue(void);
+
+static inline int isolated_task_work_queue(void)
+{
+	if (!housekeeping_cpu(raw_smp_processor_id(), HK_TYPE_KERNEL_NOISE))
+		return -ENOTSUPP;
+
+	return __isolated_task_work_queue();
+}
+
+extern void isolated_task_work_init(struct task_struct *tsk);
+#else
+static inline int isolated_task_work_queue(void) { return -ENOTSUPP; }
+static inline void isolated_task_work_init(struct task_struct *tsk) { }
+#endif /* CONFIG_NO_HZ_FULL_WORK */
+
 #endif /* _LINUX_SCHED_ISOLATION_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 35783a486c28..eca8242bd81d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4538,6 +4538,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	isolated_task_work_init(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 93b038d48900..d74c4ef91ce2 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -249,3 +249,26 @@ static int __init housekeeping_isolcpus_setup(char *str)
 	return housekeeping_setup(str, flags);
 }
 __setup("isolcpus=", housekeeping_isolcpus_setup);
+
+#ifdef CONFIG_NO_HZ_FULL_WORK
+static void isolated_task_work(struct callback_head *head)
+{
+}
+
+int __isolated_task_work_queue(void)
+{
+	if (current->flags & (PF_KTHREAD | PF_USER_WORKER | PF_IO_WORKER))
+		return -EINVAL;
+
+	guard(irqsave)();
+	if (task_work_queued(&current->nohz_full_work))
+		return 0;
+
+	return task_work_add(current, &current->nohz_full_work, TWA_RESUME);
+}
+
+void isolated_task_work_init(struct task_struct *tsk)
+{
+	init_task_work(&tsk->nohz_full_work, isolated_task_work);
+}
+#endif /* CONFIG_NO_HZ_FULL_WORK */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295..50e0cada1e1b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -60,6 +60,7 @@
 #include <linux/stop_machine.h>
 #include <linux/syscalls_api.h>
 #include <linux/syscalls.h>
+#include <linux/task_work.h>
 #include <linux/tick.h>
 #include <linux/topology.h>
 #include <linux/types.h>
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index b0b97a60aaa6..34591fc50ab1 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -146,6 +146,18 @@ config NO_HZ_FULL
 
 endchoice
 
+config NO_HZ_FULL_WORK
+	bool "Full dynticks work flush on kernel exit"
+	depends on NO_HZ_FULL
+	help
+	 Selectively flush pending asynchronous kernel work upon user exit.
+	 Assuming userspace is not performing any critical isolated work while
+	 issuing syscalls, some per-CPU kernel works are flushed before resuming
+	 to userspace so that they don't get remotely queued later when the CPU
+	 doesn't want to be disturbed.
+
+	 If in doubt say N.
+
 config CONTEXT_TRACKING_USER
 	bool
 	depends on HAVE_CONTEXT_TRACKING_USER
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] sched/isolation: Introduce isolated task work
  2025-07-03 14:07 ` [PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
@ 2025-07-17 17:29   ` Vlastimil Babka
  2025-07-18  9:52   ` Valentin Schneider
  1 sibling, 0 replies; 17+ messages in thread
From: Vlastimil Babka @ 2025-07-17 17:29 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Andrew Morton, Ingo Molnar, Marcelo Tosatti, Michal Hocko,
	Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, linux-mm

On 7/3/25 16:07, Frederic Weisbecker wrote:
> Some asynchronous kernel work may be pending upon resume to userspace
> and execute later on.On isolated workload this becomes problematic once
> the process is done with preparatory work involving syscalls and wants
> to run in userspace without being interrupted.
> 
> Provide an infrastructure to queue a work to be executed from the current
> isolated task context right before resuming to userspace. This goes with
> the assumption that isolated tasks are pinned to a single nohz_full CPU.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

I'm wondering if this really needs a new config or we're being unnecessarily
defensive and the existing NO_HZ_FULL should be enough?




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] sched/isolation: Introduce isolated task work
  2025-07-03 14:07 ` [PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
  2025-07-17 17:29   ` Vlastimil Babka
@ 2025-07-18  9:52   ` Valentin Schneider
  2025-07-18 14:23     ` Frederic Weisbecker
  1 sibling, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2025-07-18  9:52 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Vlastimil Babka, linux-mm

On 03/07/25 16:07, Frederic Weisbecker wrote:
> @@ -77,4 +77,21 @@ static inline bool cpu_is_isolated(int cpu)
>              cpuset_cpu_is_isolated(cpu);
>  }
>
> +#if defined(CONFIG_NO_HZ_FULL_WORK)
> +extern int __isolated_task_work_queue(void);
> +
> +static inline int isolated_task_work_queue(void)
> +{
> +	if (!housekeeping_cpu(raw_smp_processor_id(), HK_TYPE_KERNEL_NOISE))
> +		return -ENOTSUPP;
> +

Am I being dense or this condition the opposite of what we want? That is,
AIUI we want isolated_task_work() to run on NOHZ_FULL/isolated CPUs'
resume-to-userspace path, so this should bail if the current CPU
*is* a housekeeping CPU.

> +	return __isolated_task_work_queue();
> +}
> +



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] sched/isolation: Introduce isolated task work
  2025-07-18  9:52   ` Valentin Schneider
@ 2025-07-18 14:23     ` Frederic Weisbecker
  0 siblings, 0 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2025-07-18 14:23 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: LKML, Andrew Morton, Ingo Molnar, Marcelo Tosatti, Michal Hocko,
	Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Vlastimil Babka,
	linux-mm

Le Fri, Jul 18, 2025 at 11:52:01AM +0200, Valentin Schneider a écrit :
> On 03/07/25 16:07, Frederic Weisbecker wrote:
> > @@ -77,4 +77,21 @@ static inline bool cpu_is_isolated(int cpu)
> >              cpuset_cpu_is_isolated(cpu);
> >  }
> >
> > +#if defined(CONFIG_NO_HZ_FULL_WORK)
> > +extern int __isolated_task_work_queue(void);
> > +
> > +static inline int isolated_task_work_queue(void)
> > +{
> > +	if (!housekeeping_cpu(raw_smp_processor_id(), HK_TYPE_KERNEL_NOISE))
> > +		return -ENOTSUPP;
> > +
> 
> Am I being dense or this condition the opposite of what we want? That is,
> AIUI we want isolated_task_work() to run on NOHZ_FULL/isolated CPUs'
> resume-to-userspace path, so this should bail if the current CPU
> *is* a housekeeping CPU.

Geeze!

> 
> > +	return __isolated_task_work_queue();
> > +}
> > +
> 

-- 
Frederic Weisbecker
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-07-18 14:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-10 15:23 [PATCH 0/6 v3] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
2025-04-10 15:23 ` [PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
2025-04-10 15:23 ` [PATCH 2/6] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
2025-04-10 15:23 ` [PATCH 3/6] sched: Use task_work_queued() on cid_work Frederic Weisbecker
2025-04-10 15:23 ` [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
2025-04-23 18:40   ` Shrikanth Hegde
2025-07-01 12:17     ` Frederic Weisbecker
2025-04-10 15:23 ` [PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
2025-04-11 10:25   ` Oleg Nesterov
2025-04-11 22:00     ` Frederic Weisbecker
2025-04-12  5:12     ` K Prateek Nayak
2025-04-10 15:23 ` [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
     [not found] ` <20250412025831.4010-1-hdanton@sina.com>
2025-07-01 12:36   ` [PATCH 0/6 v3] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
  -- strict thread matches above, loose matches on Subject: below --
2025-07-03 14:07 [PATCH 0/6 v4] " Frederic Weisbecker
2025-07-03 14:07 ` [PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
2025-07-17 17:29   ` Vlastimil Babka
2025-07-18  9:52   ` Valentin Schneider
2025-07-18 14:23     ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).