linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Workqueue: mm: replace use of system_wq and add WQ_PERCPU to alloc_workqueue users
@ 2025-08-15  9:48 Marco Crivellari
  2025-08-15  9:48 ` [PATCH 1/2] Workqueue: mm: replace use of system_wq with system_percpu_wq Marco Crivellari
  2025-08-15  9:48 ` [PATCH 2/2] Workqueue: mm: WQ_PERCPU added to alloc_workqueue users Marco Crivellari
  0 siblings, 2 replies; 3+ messages in thread
From: Marco Crivellari @ 2025-08-15  9:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tejun Heo, Lai Jiangshan, Thomas Gleixner, Frederic Weisbecker,
	Sebastian Andrzej Siewior, Marco Crivellari, Michal Hocko,
	Andrew Morton

Hello!

Below is a summary of a discussion about the Workqueue API and cpu isolation
considerations. Details and more information are available here:

        "workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND."
        https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/

=== Current situation: problems ===

Let's consider a nohz_full system with isolated CPUs: wq_unbound_cpumask is
set to the housekeeping CPUs, for !WQ_UNBOUND the local CPU is selected.

This leads to different scenarios if a work item is scheduled on an isolated
CPU where "delay" value is 0 or greater then 0:
		schedule_delayed_work(, 0);

This will be handled by __queue_work() that will queue the work item on the
current local (isolated) CPU, while:

		schedule_delayed_work(, 1);

Will move the timer on an housekeeping CPU, and schedule the work there.

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

=== Plan and future plans ===

This patchset is the first stone on a refactoring needed in order to
address the points aforementioned; it will have a positive impact also
on the cpu isolation, in the long term, moving away percpu workqueue in
favor to an unbound model.

These are the main steps:
1)  API refactoring (that this patch is introducing)
	-	Make more clear and uniform the system wq names, both per-cpu and
		unbound. This to avoid any possible confusion on what should be
		used.

	-	Introduction of WQ_PERCPU: this flag is the complement of WQ_UNBOUND,
		introduced in this patchset and used on all the callers that are not
		currently using WQ_UNBOUND.

		WQ_UNBOUND will be removed in a future release cycle.

		Most users don't need to be per-cpu, because they don't have
		locality requirements, because of that, a next future step will be
		make "unbound" the default behavior.

2)  Check who really needs to be per-cpu
	-	Remove the WQ_PERCPU flag when is not strictly required.

3)  Add a new API (prefer local cpu)
	-	There are users that don't require a local execution, like mentioned
		above; despite that, local execution yeld to performance gain.

		This new API will prefer the local execution, without requiring it.
		
=== Introduced Changes by this patchset ===

1) [P 1] replace use of system_wq with system_percpu_wq (under mm)

		system_wq is a per-CPU workqueue, but his name is not clear.
		system_unbound_wq is to be used when locality is not required.

		Because of that, system_wq has been renamed in system_percpu_wq in the
		mm subsystm (details in the next section).

2) [P 2] add WQ_PERCPU to alloc_workqueue() users (under mm)

		Every alloc_workqueue() caller should use one among WQ_PERCPU or
		WQ_UNBOUND. This is actually enforced warning if both or none of them
		are present at the same time.

		These patches introduce WQ_PERCPU in the mm subsystem 
		(details in the next section).

		WQ_UNBOUND will be removed in a next release cycle.

=== For mm Maintainers ===

If you agree with these changes, one option is pull the preparation changes from
Tejun's wq branch [1].

As an alternative, the patches can be routed through a wq branch.

The preparation changes are described in the present cover letter, under the
"main steps" section. The changes done in summary are:

- add system_percpu_wq and system_dfl_wq, for now without replace the older wq(s)
  (system_unbound_wq and system_wq).
- add WQ_PERCPU flag, currently without removing WQ_UNBOUND; it will be removed
  in a future release cycle.

You can find the aforementioned changes reading:

("Workqueue: add WQ_PERCPU, system_dfl_wq and system_percpu_wq")
https://lore.kernel.org/all/20250614133531.76742-1-marco.crivellari@suse.com/


- [1] git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git WQ_PERCPU

Thanks!

Marco Crivellari (2):
  Workqueue: mm: replace use of system_wq with system_percpu_wq
  Workqueue: mm: WQ_PERCPU added to alloc_workqueue users

 mm/backing-dev.c | 4 ++--
 mm/slub.c        | 3 ++-
 mm/vmstat.c      | 3 ++-
 3 files changed, 6 insertions(+), 4 deletions(-)

-- 
2.50.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 1/2] Workqueue: mm: replace use of system_wq with system_percpu_wq
  2025-08-15  9:48 [PATCH 0/2] Workqueue: mm: replace use of system_wq and add WQ_PERCPU to alloc_workqueue users Marco Crivellari
@ 2025-08-15  9:48 ` Marco Crivellari
  2025-08-15  9:48 ` [PATCH 2/2] Workqueue: mm: WQ_PERCPU added to alloc_workqueue users Marco Crivellari
  1 sibling, 0 replies; 3+ messages in thread
From: Marco Crivellari @ 2025-08-15  9:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tejun Heo, Lai Jiangshan, Thomas Gleixner, Frederic Weisbecker,
	Sebastian Andrzej Siewior, Marco Crivellari, Michal Hocko,
	Andrew Morton

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users.
Make it clear by adding a system_percpu_wq to all the mm subsystem.

The old wq will be kept for a few release cylces.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
---
 mm/backing-dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 783904d8c5ef..784605103202 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -966,7 +966,7 @@ static int __init cgwb_init(void)
 {
 	/*
 	 * There can be many concurrent release work items overwhelming
-	 * system_wq.  Put them in a separate wq and limit concurrency.
+	 * system_percpu_wq.  Put them in a separate wq and limit concurrency.
 	 * There's no point in executing many of these in parallel.
 	 */
 	cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 1);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/2] Workqueue: mm: WQ_PERCPU added to alloc_workqueue users
  2025-08-15  9:48 [PATCH 0/2] Workqueue: mm: replace use of system_wq and add WQ_PERCPU to alloc_workqueue users Marco Crivellari
  2025-08-15  9:48 ` [PATCH 1/2] Workqueue: mm: replace use of system_wq with system_percpu_wq Marco Crivellari
@ 2025-08-15  9:48 ` Marco Crivellari
  1 sibling, 0 replies; 3+ messages in thread
From: Marco Crivellari @ 2025-08-15  9:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tejun Heo, Lai Jiangshan, Thomas Gleixner, Frederic Weisbecker,
	Sebastian Andrzej Siewior, Marco Crivellari, Michal Hocko,
	Andrew Morton

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This patch adds a new WQ_PERCPU flag to all the mm subsystem users to
explicitly request the use of the per-CPU behavior. Both flags coexist
for one release cycle to allow callers to transition their calls.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

All existing users have been updated accordingly.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
---
 mm/backing-dev.c | 2 +-
 mm/slub.c        | 3 ++-
 mm/vmstat.c      | 3 ++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 784605103202..98fb7beddd4c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -969,7 +969,7 @@ static int __init cgwb_init(void)
 	 * system_percpu_wq.  Put them in a separate wq and limit concurrency.
 	 * There's no point in executing many of these in parallel.
 	 */
-	cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 1);
+	cgwb_release_wq = alloc_workqueue("cgwb_release", WQ_PERCPU, 1);
 	if (!cgwb_release_wq)
 		return -ENOMEM;
 
diff --git a/mm/slub.c b/mm/slub.c
index b46f87662e71..cac9d5d7c924 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6364,7 +6364,8 @@ void __init kmem_cache_init(void)
 void __init kmem_cache_init_late(void)
 {
 #ifndef CONFIG_SLUB_TINY
-	flushwq = alloc_workqueue("slub_flushwq", WQ_MEM_RECLAIM, 0);
+	flushwq = alloc_workqueue("slub_flushwq", WQ_MEM_RECLAIM | WQ_PERCPU,
+				  0);
 	WARN_ON(!flushwq);
 #endif
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4c268ce39ff2..57bf76b1d9d4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2244,7 +2244,8 @@ void __init init_mm_internals(void)
 {
 	int ret __maybe_unused;
 
-	mm_percpu_wq = alloc_workqueue("mm_percpu_wq", WQ_MEM_RECLAIM, 0);
+	mm_percpu_wq = alloc_workqueue("mm_percpu_wq",
+				       WQ_MEM_RECLAIM | WQ_PERCPU, 0);
 
 #ifdef CONFIG_SMP
 	ret = cpuhp_setup_state_nocalls(CPUHP_MM_VMSTAT_DEAD, "mm/vmstat:dead",
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-08-15  9:48 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-15  9:48 [PATCH 0/2] Workqueue: mm: replace use of system_wq and add WQ_PERCPU to alloc_workqueue users Marco Crivellari
2025-08-15  9:48 ` [PATCH 1/2] Workqueue: mm: replace use of system_wq with system_percpu_wq Marco Crivellari
2025-08-15  9:48 ` [PATCH 2/2] Workqueue: mm: WQ_PERCPU added to alloc_workqueue users Marco Crivellari

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).