[PATCH 0/6 v4] sched/mm: LRU drain flush on nohz

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6 v4] sched/mm: LRU drain flush on nohz_full
@ 2025-07-03 14:07 Frederic Weisbecker
  2025-07-03 14:07 ` [PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2025-07-03 14:07 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Oleg Nesterov, Peter Zijlstra,
	Valentin Schneider, Thomas Gleixner, Michal Hocko, linux-mm,
	Ingo Molnar, Marcelo Tosatti, Vlastimil Babka, Andrew Morton

When LRUs are pending, the drain can be triggered remotely, whether the
remote CPU is running in userspace in nohz_full mode or not. This kind
of noise is expected to be caused by preparatory work before a task
runs isolated in userspace. This patchset is a proposal to flush that
before the task starts its critical work in userspace.

Changes since v3:

* Apply review from Oleg and K Prateek Nayak (handle io_uring kthreads
  and use guard)
  
* Confine this into a new CONFIG_NO_HZ_FULL_WORK because it is still
  experimental.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	task/work-v4

HEAD: 87896fa0dc36b421533c9dc85dd32b61eaff887b

Thanks,
	Frederic
---

Frederic Weisbecker (6):
      task_work: Provide means to check if a work is queued
      sched/fair: Use task_work_queued() on numa_work
      sched: Use task_work_queued() on cid_work
      tick/nohz: Move nohz_full related fields out of hot task struct's places
      sched/isolation: Introduce isolated task work
      mm: Drain LRUs upon resume to userspace on nohz_full CPUs


 include/linux/pagevec.h         | 18 ++----------------
 include/linux/sched.h           | 18 ++++++++++++------
 include/linux/sched/isolation.h | 17 +++++++++++++++++
 include/linux/swap.h            |  1 +
 include/linux/task_work.h       | 12 ++++++++++++
 kernel/sched/core.c             |  6 ++----
 kernel/sched/fair.c             |  5 +----
 kernel/sched/isolation.c        | 26 ++++++++++++++++++++++++++
 kernel/sched/sched.h            |  1 +
 kernel/task_work.c              |  9 +++++++--
 kernel/time/Kconfig             | 12 ++++++++++++
 mm/swap.c                       | 30 +++++++++++++++++++++++++++++-
 12 files changed, 122 insertions(+), 33 deletions(-)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/6] task_work: Provide means to check if a work is queued
  2025-07-03 14:07 [PATCH 0/6 v4] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
@ 2025-07-03 14:07 ` Frederic Weisbecker
  2025-07-03 14:07 ` [PATCH 2/6] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2025-07-03 14:07 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

Some task work users implement their own ways to know if a callback is
already queued on the current task while fiddling with the callback
head internals.

Provide instead a consolidated API to serve this very purpose.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/task_work.h | 12 ++++++++++++
 kernel/task_work.c        |  9 +++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 0646804860ff..31caf12c1313 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -5,12 +5,15 @@
 #include <linux/list.h>
 #include <linux/sched.h>
 
+#define TASK_WORK_DEQUEUED	((void *) -1UL)
+
 typedef void (*task_work_func_t)(struct callback_head *);
 
 static inline void
 init_task_work(struct callback_head *twork, task_work_func_t func)
 {
 	twork->func = func;
+	twork->next = TASK_WORK_DEQUEUED;
 }
 
 enum task_work_notify_mode {
@@ -26,6 +29,15 @@ static inline bool task_work_pending(struct task_struct *task)
 	return READ_ONCE(task->task_works);
 }
 
+/*
+ * Check if a work is queued. Beware: this is inherently racy if the work can
+ * be queued elsewhere than the current task.
+ */
+static inline bool task_work_queued(struct callback_head *twork)
+{
+	return twork->next != TASK_WORK_DEQUEUED;
+}
+
 int task_work_add(struct task_struct *task, struct callback_head *twork,
 			enum task_work_notify_mode mode);
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index d1efec571a4a..56718cb824d9 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -67,8 +67,10 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 
 	head = READ_ONCE(task->task_works);
 	do {
-		if (unlikely(head == &work_exited))
+		if (unlikely(head == &work_exited)) {
+			work->next = TASK_WORK_DEQUEUED;
 			return -ESRCH;
+		}
 		work->next = head;
 	} while (!try_cmpxchg(&task->task_works, &head, work));
 
@@ -129,8 +131,10 @@ task_work_cancel_match(struct task_struct *task,
 		if (!match(work, data)) {
 			pprev = &work->next;
 			work = READ_ONCE(*pprev);
-		} else if (try_cmpxchg(pprev, &work, work->next))
+		} else if (try_cmpxchg(pprev, &work, work->next)) {
+			work->next = TASK_WORK_DEQUEUED;
 			break;
+		}
 	}
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 
@@ -224,6 +228,7 @@ void task_work_run(void)
 
 		do {
 			next = work->next;
+			work->next = TASK_WORK_DEQUEUED;
 			work->func(work);
 			work = next;
 			cond_resched();
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/6] sched/fair: Use task_work_queued() on numa_work
  2025-07-03 14:07 [PATCH 0/6 v4] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
  2025-07-03 14:07 ` [PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
@ 2025-07-03 14:07 ` Frederic Weisbecker
  2025-07-03 14:07 ` [PATCH 3/6] sched: Use task_work_queued() on cid_work Frederic Weisbecker
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2025-07-03 14:07 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

Remove the ad-hoc implementation of task_work_queued().

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/fair.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..b350b0f4e7a5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3318,7 +3318,6 @@ static void task_numa_work(struct callback_head *work)
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
-	work->next = work;
 	/*
 	 * Who cares about NUMA placement when they're dying.
 	 *
@@ -3575,8 +3574,6 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->numa_scan_seq		= mm ? mm->numa_scan_seq : 0;
 	p->numa_scan_period		= sysctl_numa_balancing_scan_delay;
 	p->numa_migrate_retry		= 0;
-	/* Protect against double add, see task_tick_numa and task_numa_work */
-	p->numa_work.next		= &p->numa_work;
 	p->numa_faults			= NULL;
 	p->numa_pages_migrated		= 0;
 	p->total_numa_faults		= 0;
@@ -3617,7 +3614,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	/*
 	 * We don't care about NUMA placement if we don't have memory.
 	 */
-	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
+	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || task_work_queued(work))
 		return;
 
 	/*
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/6] sched: Use task_work_queued() on cid_work
  2025-07-03 14:07 [PATCH 0/6 v4] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
  2025-07-03 14:07 ` [PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
  2025-07-03 14:07 ` [PATCH 2/6] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
@ 2025-07-03 14:07 ` Frederic Weisbecker
  2025-07-17 16:32   ` Valentin Schneider
  2025-07-03 14:07 ` [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Frederic Weisbecker @ 2025-07-03 14:07 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

Remove the ad-hoc implementation of task_work_queued()

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/core.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8988d38d46a3..35783a486c28 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10599,7 +10599,6 @@ static void task_mm_cid_work(struct callback_head *work)
 
 	WARN_ON_ONCE(t != container_of(work, struct task_struct, cid_work));
 
-	work->next = work;	/* Prevent double-add */
 	if (t->flags & PF_EXITING)
 		return;
 	mm = t->mm;
@@ -10643,7 +10642,6 @@ void init_sched_mm_cid(struct task_struct *t)
 		if (mm_users == 1)
 			mm->mm_cid_next_scan = jiffies + msecs_to_jiffies(MM_CID_SCAN_DELAY);
 	}
-	t->cid_work.next = &t->cid_work;	/* Protect against double add */
 	init_task_work(&t->cid_work, task_mm_cid_work);
 }
 
@@ -10652,8 +10650,7 @@ void task_tick_mm_cid(struct rq *rq, struct task_struct *curr)
 	struct callback_head *work = &curr->cid_work;
 	unsigned long now = jiffies;
 
-	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
-	    work->next != work)
+	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || task_work_queued(work))
 		return;
 	if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan)))
 		return;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places
  2025-07-03 14:07 [PATCH 0/6 v4] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2025-07-03 14:07 ` [PATCH 3/6] sched: Use task_work_queued() on cid_work Frederic Weisbecker
@ 2025-07-03 14:07 ` Frederic Weisbecker
  2025-07-17 16:32   ` Valentin Schneider
  2025-07-03 14:07 ` [PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
  2025-07-03 14:07 ` [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
  5 siblings, 1 reply; 16+ messages in thread
From: Frederic Weisbecker @ 2025-07-03 14:07 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

nohz_full is a feature that only fits into rare and very corner cases.
Yet distros enable it by default and therefore the related fields are
always reserved in the task struct.

Those task fields are stored in the middle of cacheline hot places such
as cputime accounting and context switch counting, which doesn't make
any sense for a feature that is disabled most of the time.

Move the nohz_full storage to colder places.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched.h | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f78a64beb52..117aa20b8fb6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1114,13 +1114,7 @@ struct task_struct {
 #endif
 	u64				gtime;
 	struct prev_cputime		prev_cputime;
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-	struct vtime			vtime;
-#endif
 
-#ifdef CONFIG_NO_HZ_FULL
-	atomic_t			tick_dep_mask;
-#endif
 	/* Context switch counts: */
 	unsigned long			nvcsw;
 	unsigned long			nivcsw;
@@ -1446,6 +1440,14 @@ struct task_struct {
 	struct task_delay_info		*delays;
 #endif
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+	struct vtime			vtime;
+#endif
+
+#ifdef CONFIG_NO_HZ_FULL
+	atomic_t			tick_dep_mask;
+#endif
+
 #ifdef CONFIG_FAULT_INJECTION
 	int				make_it_fail;
 	unsigned int			fail_nth;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/6] sched/isolation: Introduce isolated task work
  2025-07-03 14:07 [PATCH 0/6 v4] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2025-07-03 14:07 ` [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
@ 2025-07-03 14:07 ` Frederic Weisbecker
  2025-07-17 17:29   ` Vlastimil Babka
  2025-07-18  9:52   ` Valentin Schneider
  2025-07-03 14:07 ` [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
  5 siblings, 2 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2025-07-03 14:07 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

Some asynchronous kernel work may be pending upon resume to userspace
and execute later on. On isolated workload this becomes problematic once
the process is done with preparatory work involving syscalls and wants
to run in userspace without being interrupted.

Provide an infrastructure to queue a work to be executed from the current
isolated task context right before resuming to userspace. This goes with
the assumption that isolated tasks are pinned to a single nohz_full CPU.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched.h           |  4 ++++
 include/linux/sched/isolation.h | 17 +++++++++++++++++
 kernel/sched/core.c             |  1 +
 kernel/sched/isolation.c        | 23 +++++++++++++++++++++++
 kernel/sched/sched.h            |  1 +
 kernel/time/Kconfig             | 12 ++++++++++++
 6 files changed, 58 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 117aa20b8fb6..931065b5744f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1448,6 +1448,10 @@ struct task_struct {
 	atomic_t			tick_dep_mask;
 #endif
 
+#ifdef CONFIG_NO_HZ_FULL_WORK
+	struct callback_head		nohz_full_work;
+#endif
+
 #ifdef CONFIG_FAULT_INJECTION
 	int				make_it_fail;
 	unsigned int			fail_nth;
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d8501f4709b5..9481b7d152c9 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -77,4 +77,21 @@ static inline bool cpu_is_isolated(int cpu)
 	       cpuset_cpu_is_isolated(cpu);
 }
 
+#if defined(CONFIG_NO_HZ_FULL_WORK)
+extern int __isolated_task_work_queue(void);
+
+static inline int isolated_task_work_queue(void)
+{
+	if (!housekeeping_cpu(raw_smp_processor_id(), HK_TYPE_KERNEL_NOISE))
+		return -ENOTSUPP;
+
+	return __isolated_task_work_queue();
+}
+
+extern void isolated_task_work_init(struct task_struct *tsk);
+#else
+static inline int isolated_task_work_queue(void) { return -ENOTSUPP; }
+static inline void isolated_task_work_init(struct task_struct *tsk) { }
+#endif /* CONFIG_NO_HZ_FULL_WORK */
+
 #endif /* _LINUX_SCHED_ISOLATION_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 35783a486c28..eca8242bd81d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4538,6 +4538,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	isolated_task_work_init(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 93b038d48900..d74c4ef91ce2 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -249,3 +249,26 @@ static int __init housekeeping_isolcpus_setup(char *str)
 	return housekeeping_setup(str, flags);
 }
 __setup("isolcpus=", housekeeping_isolcpus_setup);
+
+#ifdef CONFIG_NO_HZ_FULL_WORK
+static void isolated_task_work(struct callback_head *head)
+{
+}
+
+int __isolated_task_work_queue(void)
+{
+	if (current->flags & (PF_KTHREAD | PF_USER_WORKER | PF_IO_WORKER))
+		return -EINVAL;
+
+	guard(irqsave)();
+	if (task_work_queued(&current->nohz_full_work))
+		return 0;
+
+	return task_work_add(current, &current->nohz_full_work, TWA_RESUME);
+}
+
+void isolated_task_work_init(struct task_struct *tsk)
+{
+	init_task_work(&tsk->nohz_full_work, isolated_task_work);
+}
+#endif /* CONFIG_NO_HZ_FULL_WORK */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295..50e0cada1e1b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -60,6 +60,7 @@
 #include <linux/stop_machine.h>
 #include <linux/syscalls_api.h>
 #include <linux/syscalls.h>
+#include <linux/task_work.h>
 #include <linux/tick.h>
 #include <linux/topology.h>
 #include <linux/types.h>
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index b0b97a60aaa6..34591fc50ab1 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -146,6 +146,18 @@ config NO_HZ_FULL
 
 endchoice
 
+config NO_HZ_FULL_WORK
+	bool "Full dynticks work flush on kernel exit"
+	depends on NO_HZ_FULL
+	help
+	 Selectively flush pending asynchronous kernel work upon user exit.
+	 Assuming userspace is not performing any critical isolated work while
+	 issuing syscalls, some per-CPU kernel works are flushed before resuming
+	 to userspace so that they don't get remotely queued later when the CPU
+	 doesn't want to be disturbed.
+
+	 If in doubt say N.
+
 config CONTEXT_TRACKING_USER
 	bool
 	depends on HAVE_CONTEXT_TRACKING_USER
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-07-03 14:07 [PATCH 0/6 v4] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2025-07-03 14:07 ` [PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
@ 2025-07-03 14:07 ` Frederic Weisbecker
  2025-07-03 14:24   ` Michal Hocko
  2025-07-03 14:28   ` Matthew Wilcox
  5 siblings, 2 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2025-07-03 14:07 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

LRU batching can be source of disturbances for isolated workloads
running in the userspace because it requires kernel worker to handle
that and that would preempt the said task. The primary source for such
disruption would be __lru_add_drain_all which could be triggered from
non-isolated CPUs.

Why would an isolated CPU have anything on the pcp cache? Many syscalls
allocate pages that might end there. A typical and unavoidable one would
be fork/exec leaving pages on the cache behind just waiting for somebody
to drain.

Address the problem by noting a batch has been added to the cache and
schedule draining upon return to userspace so the work is done while the
syscall is still executing and there are no suprises while the task runs
in the userspace where it doesn't want to be preempted.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/pagevec.h  | 18 ++----------------
 include/linux/swap.h     |  1 +
 kernel/sched/isolation.c |  3 +++
 mm/swap.c                | 30 +++++++++++++++++++++++++++++-
 4 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 5d3a0cccc6bf..7e647b8df4c7 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -61,22 +61,8 @@ static inline unsigned int folio_batch_space(struct folio_batch *fbatch)
 	return PAGEVEC_SIZE - fbatch->nr;
 }
 
-/**
- * folio_batch_add() - Add a folio to a batch.
- * @fbatch: The folio batch.
- * @folio: The folio to add.
- *
- * The folio is added to the end of the batch.
- * The batch must have previously been initialised using folio_batch_init().
- *
- * Return: The number of slots still available.
- */
-static inline unsigned folio_batch_add(struct folio_batch *fbatch,
-		struct folio *folio)
-{
-	fbatch->folios[fbatch->nr++] = folio;
-	return folio_batch_space(fbatch);
-}
+unsigned int folio_batch_add(struct folio_batch *fbatch,
+			     struct folio *folio);
 
 /**
  * folio_batch_next - Return the next folio to process.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bc0e1c275fc0..d74ad6c893a1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -401,6 +401,7 @@ extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
 extern void lru_add_drain_all(void);
+extern void lru_add_and_bh_lrus_drain(void);
 void folio_deactivate(struct folio *folio);
 void folio_mark_lazyfree(struct folio *folio);
 extern void swap_setup(void);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index d74c4ef91ce2..06882916c24f 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -8,6 +8,8 @@
  *
  */
 
+#include <linux/swap.h>
+
 enum hk_flags {
 	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
 	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
@@ -253,6 +255,7 @@ __setup("isolcpus=", housekeeping_isolcpus_setup);
 #ifdef CONFIG_NO_HZ_FULL_WORK
 static void isolated_task_work(struct callback_head *head)
 {
+	lru_add_and_bh_lrus_drain();
 }
 
 int __isolated_task_work_queue(void)
diff --git a/mm/swap.c b/mm/swap.c
index 4fc322f7111a..da08c918cef4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,7 @@
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
 #include <linux/buffer_head.h>
+#include <linux/sched/isolation.h>
 
 #include "internal.h"
 
@@ -155,6 +156,29 @@ static void lru_add(struct lruvec *lruvec, struct folio *folio)
 	trace_mm_lru_insertion(folio);
 }
 
+/**
+ * folio_batch_add() - Add a folio to a batch.
+ * @fbatch: The folio batch.
+ * @folio: The folio to add.
+ *
+ * The folio is added to the end of the batch.
+ * The batch must have previously been initialised using folio_batch_init().
+ *
+ * Return: The number of slots still available.
+ */
+unsigned int folio_batch_add(struct folio_batch *fbatch,
+			     struct folio *folio)
+{
+	unsigned int ret;
+
+	fbatch->folios[fbatch->nr++] = folio;
+	ret = folio_batch_space(fbatch);
+	isolated_task_work_queue();
+
+	return ret;
+}
+EXPORT_SYMBOL(folio_batch_add);
+
 static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 {
 	int i;
@@ -738,7 +762,7 @@ void lru_add_drain(void)
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_and_bh_lrus_drain(void)
+void lru_add_and_bh_lrus_drain(void)
 {
 	local_lock(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
@@ -864,6 +888,10 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
+		/* Isolated CPUs handle their cache upon return to userspace */
+		if (IS_ENABLED(CONFIG_NO_HZ_FULL_WORK) && !housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
+			continue;
+
 		if (cpu_needs_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, mm_percpu_wq, work);
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-07-03 14:07 ` [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
@ 2025-07-03 14:24   ` Michal Hocko
  2025-07-03 14:28   ` Matthew Wilcox
  1 sibling, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2025-07-03 14:24 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Ingo Molnar, Marcelo Tosatti, Oleg Nesterov,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Vlastimil Babka, linux-mm

On Thu 03-07-25 16:07:17, Frederic Weisbecker wrote:
[...]
> @@ -864,6 +888,10 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
>  	for_each_online_cpu(cpu) {
>  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>  
> +		/* Isolated CPUs handle their cache upon return to userspace */
> +		if (IS_ENABLED(CONFIG_NO_HZ_FULL_WORK) && !housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> +			continue;
> +

Two questions. Where do you actually queue the work to be executed on
the return to the userspace? And why don't you do that only if
cpu_needs_drain?

>  		if (cpu_needs_drain(cpu)) {
>  			INIT_WORK(work, lru_add_drain_per_cpu);
>  			queue_work_on(cpu, mm_percpu_wq, work);
> -- 
> 2.48.1
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-07-03 14:07 ` [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
  2025-07-03 14:24   ` Michal Hocko
@ 2025-07-03 14:28   ` Matthew Wilcox
  2025-07-03 16:12     ` Michal Hocko
  1 sibling, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2025-07-03 14:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Ingo Molnar, Marcelo Tosatti, Michal Hocko,
	Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

On Thu, Jul 03, 2025 at 04:07:17PM +0200, Frederic Weisbecker wrote:
> +unsigned int folio_batch_add(struct folio_batch *fbatch,
> +			     struct folio *folio)
> +{
> +	unsigned int ret;
> +
> +	fbatch->folios[fbatch->nr++] = folio;
> +	ret = folio_batch_space(fbatch);
> +	isolated_task_work_queue();

Umm.  LRUs use folio_batches, but they are definitely not the only user
of folio_batches.  Maybe you want to add a new lru_batch_add()
abstraction, because this call is definitely being done at the wrong
level.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-07-03 14:28   ` Matthew Wilcox
@ 2025-07-03 16:12     ` Michal Hocko
  2025-07-17 19:33       ` Vlastimil Babka
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2025-07-03 16:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Frederic Weisbecker, LKML, Andrew Morton, Ingo Molnar,
	Marcelo Tosatti, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Vlastimil Babka, linux-mm

On Thu 03-07-25 15:28:23, Matthew Wilcox wrote:
> On Thu, Jul 03, 2025 at 04:07:17PM +0200, Frederic Weisbecker wrote:
> > +unsigned int folio_batch_add(struct folio_batch *fbatch,
> > +			     struct folio *folio)
> > +{
> > +	unsigned int ret;
> > +
> > +	fbatch->folios[fbatch->nr++] = folio;
> > +	ret = folio_batch_space(fbatch);
> > +	isolated_task_work_queue();
> 
> Umm.  LRUs use folio_batches, but they are definitely not the only user
> of folio_batches.  Maybe you want to add a new lru_batch_add()
> abstraction, because this call is definitely being done at the wrong
> level.

You have answered one of my question in other response. My initial
thought was that __lru_add_drain_all seems to be a better fit. But then
we have a problem that draining will become an unbounded operation which
will become a problem for lru_cache_disable which will never converge
until isolated workload does the draining. So it indeed seems like we
need to queue draining when a page is added. Are there other places
where we put folios into teh folio_batch than folio_batch_add? I cannot
seem to see any...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/6] sched: Use task_work_queued() on cid_work
  2025-07-03 14:07 ` [PATCH 3/6] sched: Use task_work_queued() on cid_work Frederic Weisbecker
@ 2025-07-17 16:32   ` Valentin Schneider
  0 siblings, 0 replies; 16+ messages in thread
From: Valentin Schneider @ 2025-07-17 16:32 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Vlastimil Babka, linux-mm

On 03/07/25 16:07, Frederic Weisbecker wrote:
> Remove the ad-hoc implementation of task_work_queued()
>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>

Reviewed-by: Valentin Schneider <vschneid@redhat.com>

> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places
  2025-07-03 14:07 ` [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
@ 2025-07-17 16:32   ` Valentin Schneider
  0 siblings, 0 replies; 16+ messages in thread
From: Valentin Schneider @ 2025-07-17 16:32 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Vlastimil Babka, linux-mm

On 03/07/25 16:07, Frederic Weisbecker wrote:
> nohz_full is a feature that only fits into rare and very corner cases.
> Yet distros enable it by default and therefore the related fields are
> always reserved in the task struct.
>
> Those task fields are stored in the middle of cacheline hot places such
> as cputime accounting and context switch counting, which doesn't make
> any sense for a feature that is disabled most of the time.
>
> Move the nohz_full storage to colder places.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

task_struct is still a bloody maze to me, but AFAICT there's *at least one*
full non-optional cacheline - the signal handling faff - between cputime
accounting and this new location, so:

Reviewed-by: Valentin Schneider <vschneid@redhat.com>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] sched/isolation: Introduce isolated task work
  2025-07-03 14:07 ` [PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
@ 2025-07-17 17:29   ` Vlastimil Babka
  2025-07-18  9:52   ` Valentin Schneider
  1 sibling, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-07-17 17:29 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Andrew Morton, Ingo Molnar, Marcelo Tosatti, Michal Hocko,
	Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, linux-mm

On 7/3/25 16:07, Frederic Weisbecker wrote:
> Some asynchronous kernel work may be pending upon resume to userspace
> and execute later on.On isolated workload this becomes problematic once
> the process is done with preparatory work involving syscalls and wants
> to run in userspace without being interrupted.
> 
> Provide an infrastructure to queue a work to be executed from the current
> isolated task context right before resuming to userspace. This goes with
> the assumption that isolated tasks are pinned to a single nohz_full CPU.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

I'm wondering if this really needs a new config or we're being unnecessarily
defensive and the existing NO_HZ_FULL should be enough?




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
  2025-07-03 16:12     ` Michal Hocko
@ 2025-07-17 19:33       ` Vlastimil Babka
  0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-07-17 19:33 UTC (permalink / raw)
  To: Frederic Weisbecker, Michal Hocko, Matthew Wilcox
  Cc: LKML, Andrew Morton, Ingo Molnar, Marcelo Tosatti, Oleg Nesterov,
	Peter Zijlstra, Thomas Gleixner, Valentin Schneider, linux-mm

On 7/3/25 18:12, Michal Hocko wrote:
> On Thu 03-07-25 15:28:23, Matthew Wilcox wrote:
>> On Thu, Jul 03, 2025 at 04:07:17PM +0200, Frederic Weisbecker wrote:
>> > +unsigned int folio_batch_add(struct folio_batch *fbatch,
>> > +			     struct folio *folio)
>> > +{
>> > +	unsigned int ret;
>> > +
>> > +	fbatch->folios[fbatch->nr++] = folio;
>> > +	ret = folio_batch_space(fbatch);
>> > +	isolated_task_work_queue();
>> 
>> Umm.  LRUs use folio_batches, but they are definitely not the only user
>> of folio_batches.  Maybe you want to add a new lru_batch_add()
>> abstraction, because this call is definitely being done at the wrong
>> level.
> 
> You have answered one of my question in other response. My initial
> thought was that __lru_add_drain_all seems to be a better fit. But then

__lru_add_drain_all is the part where we queue the drain work to other cpus.
In order to not disrupt isolated cpus, the isolated cpus have to self-
schedule the work to be done on resume, which indeed means at the moment
they are filling the batches to be drained, as this patch does.

> we have a problem that draining will become an unbounded operation which
> will become a problem for lru_cache_disable which will never converge
> until isolated workload does the draining. So it indeed seems like we
> need to queue draining when a page is added. Are there other places
> where we put folios into teh folio_batch than folio_batch_add? I cannot
> seem to see any...

The problem Matthew points out isn't that there would be other places that
add folios to a fbatch, other than folio_batch_add(). The problem is that
many of the folio_batch_add() add something to a temporary batch on a stack,
which is not subject to lru draining, so it's wasteful to queue the
draining for those. 

The below diff should address this by creating folio_batch_add_lru() which
is only used in the right places that do fill a lru-drainable folio batch.
It also makes the function static inline again, in mm/internal.h, which
means it's adding some includes to it. For that I had to fix up some wrong
include places of internal.h in varous mm files - these includes should not
appear below "#define CREATE_TRACE_POINTS".

Frederic, feel free to fold this in your patch with my Co-developed-by:

I also noted that since this now handles invalidate_bh_lrus_cpu() maybe we
can drop the cpu_is_isolated() check from bh_lru_install(). I'm not even
sure if the check is equivalent or stronger than the check used in
isolated_task_work_queue().

I have a bit of remaining worry for __lru_add_drain_all() simply skipping
the isolated cpus because then it doesn't wait for the flushing to be
finished via flush_work(). It can thus return before the isolated cpu does
its drain-on-resume, which might violate some expectations of
lru_cache_disable(). Is there a possibility to make it wait for the drain-
on-resume or determine there's not any, in a reliable way?

----8<----
From eb6fb0b60a2ede567539e4be071cb92ff5ec221a Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Thu, 17 Jul 2025 21:05:48 +0200
Subject: [PATCH] mm: introduce folio_batch_add_lru

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/pagevec.h | 18 ++++++++++++++++--
 include/linux/swap.h    |  2 +-
 mm/internal.h           | 19 +++++++++++++++++--
 mm/mlock.c              |  6 +++---
 mm/mmap.c               |  4 ++--
 mm/percpu-vm.c          |  2 --
 mm/percpu.c             |  5 +++--
 mm/rmap.c               |  4 ++--
 mm/swap.c               | 25 +------------------------
 mm/vmalloc.c            |  6 +++---
 10 files changed, 48 insertions(+), 43 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 7e647b8df4c7..5d3a0cccc6bf 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -61,8 +61,22 @@ static inline unsigned int folio_batch_space(struct folio_batch *fbatch)
 	return PAGEVEC_SIZE - fbatch->nr;
 }
 
-unsigned int folio_batch_add(struct folio_batch *fbatch,
-			     struct folio *folio);
+/**
+ * folio_batch_add() - Add a folio to a batch.
+ * @fbatch: The folio batch.
+ * @folio: The folio to add.
+ *
+ * The folio is added to the end of the batch.
+ * The batch must have previously been initialised using folio_batch_init().
+ *
+ * Return: The number of slots still available.
+ */
+static inline unsigned folio_batch_add(struct folio_batch *fbatch,
+		struct folio *folio)
+{
+	fbatch->folios[fbatch->nr++] = folio;
+	return folio_batch_space(fbatch);
+}
 
 /**
  * folio_batch_next - Return the next folio to process.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d74ad6c893a1..0bede5cd4f61 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -401,7 +401,7 @@ extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
 extern void lru_add_drain_all(void);
-extern void lru_add_and_bh_lrus_drain(void);
+void lru_add_and_bh_lrus_drain(void);
 void folio_deactivate(struct folio *folio);
 void folio_mark_lazyfree(struct folio *folio);
 extern void swap_setup(void);
diff --git a/mm/internal.h b/mm/internal.h
index 6b8ed2017743..c2848819ce5f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -18,12 +18,12 @@
 #include <linux/swapops.h>
 #include <linux/swap_cgroup.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/pagevec.h>
+#include <linux/sched/isolation.h>
 
 /* Internal core VMA manipulation functions. */
 #include "vma.h"
 
-struct folio_batch;
-
 /*
  * Maintains state across a page table move. The operation assumes both source
  * and destination VMAs already exist and are specified by the user.
@@ -414,6 +414,21 @@ static inline vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
 	return ret;
 }
 
+/*
+ * A version of folio_batch_add() to use with batches that are drained from
+ * lru_add_drain() and checked in cpu_needs_drain()
+ */
+static inline unsigned int
+folio_batch_add_lru(struct folio_batch *fbatch, struct folio *folio)
+{
+	/*
+	 * We could perhaps not queue when returning 0 which means the caller
+	 * should flush the batch immediately, but that's rare anyway.
+	 */
+	isolated_task_work_queue();
+	return folio_batch_add(fbatch, folio);
+}
+
 vm_fault_t do_swap_page(struct vm_fault *vmf);
 void folio_rotate_reclaimable(struct folio *folio);
 bool __folio_end_writeback(struct folio *folio);
diff --git a/mm/mlock.c b/mm/mlock.c
index 3cb72b579ffd..a95f248efdf2 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -254,7 +254,7 @@ void mlock_folio(struct folio *folio)
 	}
 
 	folio_get(folio);
-	if (!folio_batch_add(fbatch, mlock_lru(folio)) ||
+	if (!folio_batch_add_lru(fbatch, mlock_lru(folio)) ||
 	    folio_test_large(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
@@ -277,7 +277,7 @@ void mlock_new_folio(struct folio *folio)
 	__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
 
 	folio_get(folio);
-	if (!folio_batch_add(fbatch, mlock_new(folio)) ||
+	if (!folio_batch_add_lru(fbatch, mlock_new(folio)) ||
 	    folio_test_large(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
@@ -298,7 +298,7 @@ void munlock_folio(struct folio *folio)
 	 * which will check whether the folio is multiply mlocked.
 	 */
 	folio_get(folio);
-	if (!folio_batch_add(fbatch, folio) ||
+	if (!folio_batch_add_lru(fbatch, folio) ||
 	    folio_test_large(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
diff --git a/mm/mmap.c b/mm/mmap.c
index 09c563c95112..4d9a5d8616c3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -54,11 +54,11 @@
 #include <asm/tlb.h>
 #include <asm/mmu_context.h>
 
+#include "internal.h"
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/mmap.h>
 
-#include "internal.h"
-
 #ifndef arch_mmap_check
 #define arch_mmap_check(addr, len, flags)	(0)
 #endif
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index cd69caf6aa8d..cd3aa1610294 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -8,8 +8,6 @@
  * Chunks are mapped into vmalloc areas and populated page by page.
  * This is the default chunk allocator.
  */
-#include "internal.h"
-
 static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
 				    unsigned int cpu, int page_idx)
 {
diff --git a/mm/percpu.c b/mm/percpu.c
index b35494c8ede2..b8c34a931205 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -93,11 +93,12 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 
+#include "internal.h"
+#include "percpu-internal.h"
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/percpu.h>
 
-#include "percpu-internal.h"
-
 /*
  * The slots are sorted by the size of the biggest continuous free area.
  * 1-31 bytes share the same slot.
diff --git a/mm/rmap.c b/mm/rmap.c
index fb63d9256f09..a4eab1664e25 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -78,12 +78,12 @@
 
 #include <asm/tlbflush.h>
 
+#include "internal.h"
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/tlb.h>
 #include <trace/events/migrate.h>
 
-#include "internal.h"
-
 static struct kmem_cache *anon_vma_cachep;
 static struct kmem_cache *anon_vma_chain_cachep;
 
diff --git a/mm/swap.c b/mm/swap.c
index da08c918cef4..98897171adaf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -156,29 +156,6 @@ static void lru_add(struct lruvec *lruvec, struct folio *folio)
 	trace_mm_lru_insertion(folio);
 }
 
-/**
- * folio_batch_add() - Add a folio to a batch.
- * @fbatch: The folio batch.
- * @folio: The folio to add.
- *
- * The folio is added to the end of the batch.
- * The batch must have previously been initialised using folio_batch_init().
- *
- * Return: The number of slots still available.
- */
-unsigned int folio_batch_add(struct folio_batch *fbatch,
-			     struct folio *folio)
-{
-	unsigned int ret;
-
-	fbatch->folios[fbatch->nr++] = folio;
-	ret = folio_batch_space(fbatch);
-	isolated_task_work_queue();
-
-	return ret;
-}
-EXPORT_SYMBOL(folio_batch_add);
-
 static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 {
 	int i;
@@ -215,7 +192,7 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
 	else
 		local_lock(&cpu_fbatches.lock);
 
-	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || folio_test_large(folio) ||
+	if (!folio_batch_add_lru(this_cpu_ptr(fbatch), folio) || folio_test_large(folio) ||
 	    lru_cache_disabled())
 		folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ab986dd09b6a..b3a28e353b7e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -44,12 +44,12 @@
 #include <asm/shmparam.h>
 #include <linux/page_owner.h>
 
-#define CREATE_TRACE_POINTS
-#include <trace/events/vmalloc.h>
-
 #include "internal.h"
 #include "pgalloc-track.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/vmalloc.h>
+
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
 static unsigned int __ro_after_init ioremap_max_page_shift = BITS_PER_LONG - 1;
 
-- 
2.50.1


 



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] sched/isolation: Introduce isolated task work
  2025-07-03 14:07 ` [PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
  2025-07-17 17:29   ` Vlastimil Babka
@ 2025-07-18  9:52   ` Valentin Schneider
  2025-07-18 14:23     ` Frederic Weisbecker
  1 sibling, 1 reply; 16+ messages in thread
From: Valentin Schneider @ 2025-07-18  9:52 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marcelo Tosatti,
	Michal Hocko, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Vlastimil Babka, linux-mm

On 03/07/25 16:07, Frederic Weisbecker wrote:
> @@ -77,4 +77,21 @@ static inline bool cpu_is_isolated(int cpu)
>              cpuset_cpu_is_isolated(cpu);
>  }
>
> +#if defined(CONFIG_NO_HZ_FULL_WORK)
> +extern int __isolated_task_work_queue(void);
> +
> +static inline int isolated_task_work_queue(void)
> +{
> +	if (!housekeeping_cpu(raw_smp_processor_id(), HK_TYPE_KERNEL_NOISE))
> +		return -ENOTSUPP;
> +

Am I being dense or this condition the opposite of what we want? That is,
AIUI we want isolated_task_work() to run on NOHZ_FULL/isolated CPUs'
resume-to-userspace path, so this should bail if the current CPU
*is* a housekeeping CPU.

> +	return __isolated_task_work_queue();
> +}
> +



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] sched/isolation: Introduce isolated task work
  2025-07-18  9:52   ` Valentin Schneider
@ 2025-07-18 14:23     ` Frederic Weisbecker
  0 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2025-07-18 14:23 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: LKML, Andrew Morton, Ingo Molnar, Marcelo Tosatti, Michal Hocko,
	Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Vlastimil Babka,
	linux-mm

Le Fri, Jul 18, 2025 at 11:52:01AM +0200, Valentin Schneider a écrit :
> On 03/07/25 16:07, Frederic Weisbecker wrote:
> > @@ -77,4 +77,21 @@ static inline bool cpu_is_isolated(int cpu)
> >              cpuset_cpu_is_isolated(cpu);
> >  }
> >
> > +#if defined(CONFIG_NO_HZ_FULL_WORK)
> > +extern int __isolated_task_work_queue(void);
> > +
> > +static inline int isolated_task_work_queue(void)
> > +{
> > +	if (!housekeeping_cpu(raw_smp_processor_id(), HK_TYPE_KERNEL_NOISE))
> > +		return -ENOTSUPP;
> > +
> 
> Am I being dense or this condition the opposite of what we want? That is,
> AIUI we want isolated_task_work() to run on NOHZ_FULL/isolated CPUs'
> resume-to-userspace path, so this should bail if the current CPU
> *is* a housekeeping CPU.

Geeze!

> 
> > +	return __isolated_task_work_queue();
> > +}
> > +
> 

-- 
Frederic Weisbecker
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-07-18 14:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-03 14:07 [PATCH 0/6 v4] sched/mm: LRU drain flush on nohz_full Frederic Weisbecker
2025-07-03 14:07 ` [PATCH 1/6] task_work: Provide means to check if a work is queued Frederic Weisbecker
2025-07-03 14:07 ` [PATCH 2/6] sched/fair: Use task_work_queued() on numa_work Frederic Weisbecker
2025-07-03 14:07 ` [PATCH 3/6] sched: Use task_work_queued() on cid_work Frederic Weisbecker
2025-07-17 16:32   ` Valentin Schneider
2025-07-03 14:07 ` [PATCH 4/6] tick/nohz: Move nohz_full related fields out of hot task struct's places Frederic Weisbecker
2025-07-17 16:32   ` Valentin Schneider
2025-07-03 14:07 ` [PATCH 5/6] sched/isolation: Introduce isolated task work Frederic Weisbecker
2025-07-17 17:29   ` Vlastimil Babka
2025-07-18  9:52   ` Valentin Schneider
2025-07-18 14:23     ` Frederic Weisbecker
2025-07-03 14:07 ` [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs Frederic Weisbecker
2025-07-03 14:24   ` Michal Hocko
2025-07-03 14:28   ` Matthew Wilcox
2025-07-03 16:12     ` Michal Hocko
2025-07-17 19:33       ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).