The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH v3 0/4] sunrpc: hardcode pool_mode to pernode, remove other modes
@ 2026-06-29 17:48 Jeff Layton
  2026-06-29 17:48 ` [PATCH v3 1/4] sunrpc: route to a populated pool in svc_pool_for_cpu() Jeff Layton
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Jeff Layton @ 2026-06-29 17:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Chuck Lever
  Cc: linux-nfs, linux-kernel, Jeff Layton

The second patch is basically the same as v2, aside from some changes
that Neil suggested.

The other three patches address what is a shortcoming of the existing
code -- namely that the server can be configured to schedule RPCs to
pools with no threads in them. The first patch addresses this problem:
if the chosen pool has no threads, then choose another that does.

The third patch tries to prevent this situation in the thread
auto-placement case by ensuring that each node has at least one thread.

The last patch is a performance micro-optimization. The old code used a
modulus (actually two) to determine the pool (and prevent potentially
overrunning the array). This trades that for a less cpu-intensive method
of finding the pool to use.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Changes in v3:
- Add patch to ensure that we don't route requests to empty pools
- When auto-distributing threads, always create at least one thread per populated pool
- Use sysfs_match_string for the module parameter
- Reword deprecation printk to be more vague about removal
- Explicitly set m_count == 0 in svc_pool_map_get()
- Optimize svc_pool_for_cpu() by eliminating modulus ops
- Link to v2: https://lore.kernel.org/r/20260625-sunrpc-pool-mode-v2-1-4f512b6e1ee8@kernel.org

Changes in v2:
- Accept any previously-accepted setting for pool_mode
- Link to v1: https://lore.kernel.org/r/20260423-sunrpc-pool-mode-v1-1-b7f20e35749b@kernel.org

---
Jeff Layton (4):
      sunrpc: route to a populated pool in svc_pool_for_cpu()
      sunrpc: hardcode pool_mode to pernode, remove other modes
      sunrpc: guarantee a thread per CPU-bearing node when auto-distributing
      sunrpc: eliminate a modulus operation from the enqueueing codepath

 Documentation/admin-guide/kernel-parameters.txt |  20 +-
 net/sunrpc/svc.c                                | 281 +++++++-----------------
 2 files changed, 88 insertions(+), 213 deletions(-)
---
base-commit: f8eb95335cc219493427f976460cf4b7e9641e92
change-id: 20260423-sunrpc-pool-mode-3e6b56320dc4

Best regards,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/4] sunrpc: route to a populated pool in svc_pool_for_cpu()
  2026-06-29 17:48 [PATCH v3 0/4] sunrpc: hardcode pool_mode to pernode, remove other modes Jeff Layton
@ 2026-06-29 17:48 ` Jeff Layton
  2026-06-29 17:48 ` [PATCH v3 2/4] sunrpc: hardcode pool_mode to pernode, remove other modes Jeff Layton
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Jeff Layton @ 2026-06-29 17:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Chuck Lever
  Cc: linux-nfs, linux-kernel, Jeff Layton

svc_set_num_threads() spreads the requested threads evenly across the
service's pools (base = nrservs / sv_nrpools).  When a service runs
fewer threads than it has pools -- e.g. an nfsd configured with fewer
threads than the host has NUMA nodes while running in "pernode" or
"percpu" mode -- the trailing pools are left with no threads at all.

svc_xprt_enqueue() selects a pool from the CPU servicing the transport,
queues the transport on that pool's sp_xprts, and only wakes a thread
from the same pool.  Each thread services exclusively its own pool, so a
transport that lands on a threadless pool is enqueued on sp_xprts and
never picked up: the connection hangs indefinitely.

Have svc_pool_for_cpu() skip pools that currently have no threads,
falling back to the next populated pool.  This trades NUMA locality for
a guarantee that the work is actually serviced.  sp_nrthreads is only
updated under the service mutex; the lockless read here is a best-effort
routing hint, so annotate it with data_race().

Fixes: 0f0257eaa5d2 ("svc: Move the xprt independent code to the svc_xprt.c file")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 net/sunrpc/svc.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index dd80a2eaaa74..82fb7faf563f 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -402,6 +402,7 @@ struct svc_pool *svc_pool_for_cpu(struct svc_serv *serv)
 	struct svc_pool_map *m = &svc_pool_map;
 	int cpu = raw_smp_processor_id();
 	unsigned int pidx = 0;
+	unsigned int i;
 
 	if (serv->sv_nrpools <= 1)
 		return serv->sv_pools;
@@ -414,8 +415,31 @@ struct svc_pool *svc_pool_for_cpu(struct svc_serv *serv)
 		pidx = m->to_pool[cpu_to_node(cpu)];
 		break;
 	}
+	pidx %= serv->sv_nrpools;
+
+	/*
+	 * Threads are spread evenly across the pools, but when there are
+	 * fewer threads than pools some pools can end up with none. A
+	 * transport enqueued on a threadless pool would never be picked
+	 * up, since each thread only services its own pool. Fall back to
+	 * the next populated pool, trading NUMA locality for a guarantee
+	 * that the transport is serviced.
+	 */
+	for (i = 0; i < serv->sv_nrpools; i++) {
+		struct svc_pool *pool = &serv->sv_pools[pidx];
+
+		/* This is set under the sp_mutex and rarely ever changes. A
+		 * data race here is harmless.
+		 */
+		if (data_race(pool->sp_nrthreads))
+			return pool;
+
+		if (++pidx >= serv->sv_nrpools)
+			pidx = 0;
+	}
 
-	return &serv->sv_pools[pidx % serv->sv_nrpools];
+	/* No pool has any threads; nothing can service the transport. */
+	return &serv->sv_pools[pidx];
 }
 
 static int svc_rpcb_setup(struct svc_serv *serv, struct net *net)

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/4] sunrpc: hardcode pool_mode to pernode, remove other modes
  2026-06-29 17:48 [PATCH v3 0/4] sunrpc: hardcode pool_mode to pernode, remove other modes Jeff Layton
  2026-06-29 17:48 ` [PATCH v3 1/4] sunrpc: route to a populated pool in svc_pool_for_cpu() Jeff Layton
@ 2026-06-29 17:48 ` Jeff Layton
  2026-06-30 17:24   ` Chuck Lever
  2026-06-29 17:48 ` [PATCH v3 3/4] sunrpc: guarantee a thread per CPU-bearing node when auto-distributing Jeff Layton
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Jeff Layton @ 2026-06-29 17:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Chuck Lever
  Cc: linux-nfs, linux-kernel, Jeff Layton

The SVC_POOL_AUTO/GLOBAL/PERCPU/PERNODE pool mode selection machinery
was added when NUMA was new and the right default was unclear.  The
default has always been "global" (a single pool for the whole service);
the other modes were only used when an admin explicitly set the
pool_mode parameter or asked for "auto", which then picked a mode from
the host topology.  Today, pernode is the right choice everywhere:

- On multi-NUMA hosts, it gives one pool per node with proper thread
  affinity and NUMA-local memory allocation.
- On single-node hosts, pernode degenerates to exactly one pool,
  identical to the old "global" mode -- svc_pool_for_cpu() short-
  circuits when sv_nrpools <= 1, no CPU affinity is set, and memory
  is allocated from the single node.

The percpu mode (one pool per CPU) created excessive pools relative to
the number of threads most deployments run, and was only auto-selected
in a narrow case (single node, >2 CPUs).

Note that this changes the default behaviour on multi-NUMA hosts: a
service that previously ran with a single global pool now gets one pool
per NUMA node by default.  This in turn means a host running fewer
threads than it has NUMA nodes can end up with pools that have no
threads.  svc_pool_for_cpu() already falls back to a populated pool in
that case, so transports are still serviced.

Remove the SVC_POOL_* enum, mode selection heuristic,
svc_pool_map_init_percpu(), and all mode-based switch statements.
Simplify pool map functions to always use the pernode path.  If pool
map allocation fails, svc_pool_map_get() now returns 0 and service
creation fails, rather than silently falling back to a single global
pool.

The module parameter and netlink interfaces are preserved for backward
compatibility:
- Writing any previously-accepted value succeeds silently
- Reading always returns "pernode"
- Writing to the module parameter emits a deprecation notice

Update Documentation/admin-guide/kernel-parameters.txt to mark the
pool_mode parameter deprecated and describe the new behaviour.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt |  20 +-
 net/sunrpc/svc.c                                | 244 ++++--------------------
 2 files changed, 49 insertions(+), 215 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b5493a7f8f22..441b78867478 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7441,19 +7441,13 @@ Kernel parameters
 
 	sunrpc.pool_mode=
 			[NFS]
-			Control how the NFS server code allocates CPUs to
-			service thread pools.  Depending on how many NICs
-			you have and where their interrupts are bound, this
-			option will affect which CPUs will do NFS serving.
-			Note: this parameter cannot be changed while the
-			NFS server is running.
-
-			auto	    the server chooses an appropriate mode
-				    automatically using heuristics
-			global	    a single global pool contains all CPUs
-			percpu	    one pool for each CPU
-			pernode	    one pool for each NUMA node (equivalent
-				    to global on non-NUMA machines)
+			Deprecated.  The NFS server now always uses one
+			service thread pool per NUMA node (equivalent to a
+			single global pool on non-NUMA machines).  All of
+			the previously accepted values (auto, global,
+			percpu, pernode) are still accepted for backward
+			compatibility but are ignored: the mode is always
+			pernode, and reads always return "pernode".
 
 	sunrpc.tcp_slot_table_entries=
 	sunrpc.udp_slot_table_entries=
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 82fb7faf563f..2f6938fe28b2 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -38,19 +38,6 @@
 
 static void svc_unregister(const struct svc_serv *serv, struct net *net);
 
-#define SVC_POOL_DEFAULT	SVC_POOL_GLOBAL
-
-/*
- * Mode for mapping cpus to pools.
- */
-enum {
-	SVC_POOL_AUTO = -1,	/* choose one of the others */
-	SVC_POOL_GLOBAL,	/* no mapping, just a single global pool
-				 * (legacy & UP mode) */
-	SVC_POOL_PERCPU,	/* one pool per cpu */
-	SVC_POOL_PERNODE	/* one pool per numa node */
-};
-
 /*
  * Structure for mapping cpus to pools and vice versa.
  * Setup once during sunrpc initialisation.
@@ -58,62 +45,29 @@ enum {
 
 struct svc_pool_map {
 	int count;			/* How many svc_servs use us */
-	int mode;			/* Note: int not enum to avoid
-					 * warnings about "enumeration value
-					 * not handled in switch" */
 	unsigned int npools;
-	unsigned int *pool_to;		/* maps pool id to cpu or node */
-	unsigned int *to_pool;		/* maps cpu or node to pool id */
+	unsigned int *pool_to;		/* maps pool id to node */
+	unsigned int *to_pool;		/* maps node to pool id */
 };
 
-static struct svc_pool_map svc_pool_map = {
-	.mode = SVC_POOL_DEFAULT
-};
+static struct svc_pool_map svc_pool_map;
 
 static DEFINE_MUTEX(svc_pool_map_mutex);/* protects svc_pool_map.count only */
 
-static int
-__param_set_pool_mode(const char *val, struct svc_pool_map *m)
-{
-	int err, mode;
-
-	mutex_lock(&svc_pool_map_mutex);
-
-	err = 0;
-	if (!strncmp(val, "auto", 4))
-		mode = SVC_POOL_AUTO;
-	else if (!strncmp(val, "global", 6))
-		mode = SVC_POOL_GLOBAL;
-	else if (!strncmp(val, "percpu", 6))
-		mode = SVC_POOL_PERCPU;
-	else if (!strncmp(val, "pernode", 7))
-		mode = SVC_POOL_PERNODE;
-	else
-		err = -EINVAL;
-
-	if (err)
-		goto out;
-
-	if (m->count == 0)
-		m->mode = mode;
-	else if (mode != m->mode)
-		err = -EBUSY;
-out:
-	mutex_unlock(&svc_pool_map_mutex);
-	return err;
-}
-
-static int
-param_set_pool_mode(const char *val, const struct kernel_param *kp)
-{
-	struct svc_pool_map *m = kp->arg;
-
-	return __param_set_pool_mode(val, m);
-}
+/*
+ * Pool modes that were historically accepted. They no longer select
+ * anything: the pool mode is always pernode. The names are retained
+ * only so that writing a previously-valid value still succeeds.
+ */
+static const char * const pool_mode_names[] = {
+	"auto", "global", "percpu", "pernode",
+};
 
 int sunrpc_set_pool_mode(const char *val)
 {
-	return __param_set_pool_mode(val, &svc_pool_map);
+	int idx = sysfs_match_string(pool_mode_names, val);
+
+	return idx < 0 ? idx : 0;
 }
 EXPORT_SYMBOL(sunrpc_set_pool_mode);
 
@@ -122,84 +76,32 @@ EXPORT_SYMBOL(sunrpc_set_pool_mode);
  * @buf: where to write the current pool_mode
  * @size: size of @buf
  *
- * Grab the current pool_mode from the svc_pool_map and write
- * the resulting string to @buf. Returns the number of characters
+ * Write the pool_mode string to @buf. Returns the number of characters
  * written to @buf (a'la snprintf()).
  */
 int
 sunrpc_get_pool_mode(char *buf, size_t size)
 {
-	struct svc_pool_map *m = &svc_pool_map;
-
-	switch (m->mode)
-	{
-	case SVC_POOL_AUTO:
-		return snprintf(buf, size, "auto");
-	case SVC_POOL_GLOBAL:
-		return snprintf(buf, size, "global");
-	case SVC_POOL_PERCPU:
-		return snprintf(buf, size, "percpu");
-	case SVC_POOL_PERNODE:
-		return snprintf(buf, size, "pernode");
-	default:
-		return snprintf(buf, size, "%d", m->mode);
-	}
+	return snprintf(buf, size, "pernode");
 }
 EXPORT_SYMBOL(sunrpc_get_pool_mode);
 
 static int
-param_get_pool_mode(char *buf, const struct kernel_param *kp)
+param_set_pool_mode(const char *val, const struct kernel_param *kp)
 {
-	char str[16];
-	int len;
-
-	len = sunrpc_get_pool_mode(str, ARRAY_SIZE(str));
-
-	/* Ensure we have room for newline and NUL */
-	len = min_t(int, len, ARRAY_SIZE(str) - 2);
-
-	/* tack on the newline */
-	str[len] = '\n';
-	str[len + 1] = '\0';
-
-	return sysfs_emit(buf, "%s", str);
+	pr_notice_once("sunrpc: the pool_mode module parameter is deprecated and no longer has any effect; the pool mode is always 'pernode'\n");
+	return sunrpc_set_pool_mode(val);
 }
 
-module_param_call(pool_mode, param_set_pool_mode, param_get_pool_mode,
-		  &svc_pool_map, 0644);
-
-/*
- * Detect best pool mapping mode heuristically,
- * according to the machine's topology.
- */
 static int
-svc_pool_map_choose_mode(void)
+param_get_pool_mode(char *buf, const struct kernel_param *kp)
 {
-	unsigned int node;
-
-	if (nr_online_nodes > 1) {
-		/*
-		 * Actually have multiple NUMA nodes,
-		 * so split pools on NUMA node boundaries
-		 */
-		return SVC_POOL_PERNODE;
-	}
-
-	node = first_online_node;
-	if (nr_cpus_node(node) > 2) {
-		/*
-		 * Non-trivial SMP, or CONFIG_NUMA on
-		 * non-NUMA hardware, e.g. with a generic
-		 * x86_64 kernel on Xeons.  In this case we
-		 * want to divide the pools on cpu boundaries.
-		 */
-		return SVC_POOL_PERCPU;
-	}
-
-	/* default: one global pool */
-	return SVC_POOL_GLOBAL;
+	return sysfs_emit(buf, "pernode\n");
 }
 
+module_param_call(pool_mode, param_set_pool_mode, param_get_pool_mode,
+		  NULL, 0644);
+
 /*
  * Allocate the to_pool[] and pool_to[] arrays.
  * Returns 0 on success or an errno.
@@ -224,35 +126,7 @@ svc_pool_map_alloc_arrays(struct svc_pool_map *m, unsigned int maxpools)
 }
 
 /*
- * Initialise the pool map for SVC_POOL_PERCPU mode.
- * Returns number of pools or <0 on error.
- */
-static int
-svc_pool_map_init_percpu(struct svc_pool_map *m)
-{
-	unsigned int maxpools = nr_cpu_ids;
-	unsigned int pidx = 0;
-	unsigned int cpu;
-	int err;
-
-	err = svc_pool_map_alloc_arrays(m, maxpools);
-	if (err)
-		return err;
-
-	for_each_online_cpu(cpu) {
-		BUG_ON(pidx >= maxpools);
-		m->to_pool[cpu] = pidx;
-		m->pool_to[pidx] = cpu;
-		pidx++;
-	}
-	/* cpus brought online later all get mapped to pool0, sorry */
-
-	return pidx;
-};
-
-
-/*
- * Initialise the pool map for SVC_POOL_PERNODE mode.
+ * Initialise the pool map for one pool per NUMA node.
  * Returns number of pools or <0 on error.
  */
 static int
@@ -284,14 +158,13 @@ svc_pool_map_init_pernode(struct svc_pool_map *m)
  * Add a reference to the global map of cpus to pools (and
  * vice versa) if pools are in use.
  * Initialise the map if we're the first user.
- * Returns the number of pools. If this is '1', no reference
- * was taken.
+ * Returns the number of pools, or 0 on failure.
  */
 static unsigned int
 svc_pool_map_get(void)
 {
 	struct svc_pool_map *m = &svc_pool_map;
-	int npools = -1;
+	int npools;
 
 	mutex_lock(&svc_pool_map_mutex);
 	if (m->count++) {
@@ -299,22 +172,11 @@ svc_pool_map_get(void)
 		return m->npools;
 	}
 
-	if (m->mode == SVC_POOL_AUTO)
-		m->mode = svc_pool_map_choose_mode();
-
-	switch (m->mode) {
-	case SVC_POOL_PERCPU:
-		npools = svc_pool_map_init_percpu(m);
-		break;
-	case SVC_POOL_PERNODE:
-		npools = svc_pool_map_init_pernode(m);
-		break;
-	}
-
+	npools = svc_pool_map_init_pernode(m);
 	if (npools <= 0) {
-		/* default, or memory allocation failure */
-		npools = 1;
-		m->mode = SVC_POOL_GLOBAL;
+		m->count = 0;
+		mutex_unlock(&svc_pool_map_mutex);
+		return 0;
 	}
 	m->npools = npools;
 	mutex_unlock(&svc_pool_map_mutex);
@@ -346,14 +208,11 @@ static int svc_pool_map_get_node(unsigned int pidx)
 {
 	const struct svc_pool_map *m = &svc_pool_map;
 
-	if (m->count) {
-		if (m->mode == SVC_POOL_PERCPU)
-			return cpu_to_node(m->pool_to[pidx]);
-		if (m->mode == SVC_POOL_PERNODE)
-			return m->pool_to[pidx];
-	}
+	if (m->count)
+		return m->pool_to[pidx];
 	return numa_mem_id();
 }
+
 /*
  * Set the given thread's cpus_allowed mask so that it
  * will only run on cpus in the given pool.
@@ -372,27 +231,15 @@ svc_pool_map_set_cpumask(struct task_struct *task, unsigned int pidx)
 	if (m->count == 0)
 		return;
 
-	switch (m->mode) {
-	case SVC_POOL_PERCPU:
-	{
-		set_cpus_allowed_ptr(task, cpumask_of(node));
-		break;
-	}
-	case SVC_POOL_PERNODE:
-	{
-		set_cpus_allowed_ptr(task, cpumask_of_node(node));
-		break;
-	}
-	}
+	set_cpus_allowed_ptr(task, cpumask_of_node(node));
 }
 
 /**
  * svc_pool_for_cpu - Select pool to run a thread on this cpu
  * @serv: An RPC service
  *
- * Use the active CPU and the svc_pool_map's mode setting to
- * select the svc thread pool to use. Once initialized, the
- * svc_pool_map does not change.
+ * Use the active CPU and the svc_pool_map to select the svc thread
+ * pool to use. Once initialized, the svc_pool_map does not change.
  *
  * Return value:
  *   A pointer to an svc_pool
@@ -400,22 +247,12 @@ svc_pool_map_set_cpumask(struct task_struct *task, unsigned int pidx)
 struct svc_pool *svc_pool_for_cpu(struct svc_serv *serv)
 {
 	struct svc_pool_map *m = &svc_pool_map;
-	int cpu = raw_smp_processor_id();
-	unsigned int pidx = 0;
-	unsigned int i;
+	unsigned int pidx, i;
 
 	if (serv->sv_nrpools <= 1)
 		return serv->sv_pools;
 
-	switch (m->mode) {
-	case SVC_POOL_PERCPU:
-		pidx = m->to_pool[cpu];
-		break;
-	case SVC_POOL_PERNODE:
-		pidx = m->to_pool[cpu_to_node(cpu)];
-		break;
-	}
-	pidx %= serv->sv_nrpools;
+	pidx = m->to_pool[cpu_to_node(raw_smp_processor_id())] % serv->sv_nrpools;
 
 	/*
 	 * Threads are spread evenly across the pools, but when there are
@@ -641,6 +478,9 @@ struct svc_serv *svc_create_pooled(struct svc_program *prog,
 	struct svc_serv *serv;
 	unsigned int npools = svc_pool_map_get();
 
+	if (!npools)
+		return NULL;
+
 	serv = __svc_create(prog, nprogs, stats, bufsize, npools, threadfn);
 	if (!serv)
 		goto out_err;

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 3/4] sunrpc: guarantee a thread per CPU-bearing node when auto-distributing
  2026-06-29 17:48 [PATCH v3 0/4] sunrpc: hardcode pool_mode to pernode, remove other modes Jeff Layton
  2026-06-29 17:48 ` [PATCH v3 1/4] sunrpc: route to a populated pool in svc_pool_for_cpu() Jeff Layton
  2026-06-29 17:48 ` [PATCH v3 2/4] sunrpc: hardcode pool_mode to pernode, remove other modes Jeff Layton
@ 2026-06-29 17:48 ` Jeff Layton
  2026-06-29 17:48 ` [PATCH v3 4/4] sunrpc: eliminate a modulus operation from the enqueueing codepath Jeff Layton
  2026-06-30 12:48 ` [PATCH 5/4] sunrpc: protect the svc_pool_map pool_to[] array with RCU Jeff Layton
  4 siblings, 0 replies; 8+ messages in thread
From: Jeff Layton @ 2026-06-29 17:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Chuck Lever
  Cc: linux-nfs, linux-kernel, Jeff Layton

svc_set_num_threads() spreads the requested thread count evenly across
the service's pools. In pernode mode each pool maps to a NUMA node, and
svc_pool_for_cpu() steers an incoming transport to the pool for the node
it arrived on. When fewer threads than pools are requested, even
distribution leaves some nodes' pools empty, and a transport steered to
an empty pool has no thread to service it.

Floor each CPU-bearing node's pool at one thread when auto-distributing a
non-zero count, so no such pool is left empty. The resulting total may
exceed the requested count. This only affects the auto-distribute path
(a single-value array, i.e. svc_set_num_threads()); callers that set
per-pool counts explicitly via svc_set_pool_threads() are unchanged and
may still set a pool to zero. Nodes without CPUs (e.g. memory-only nodes)
get no thread, as nothing is steered to them.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 net/sunrpc/svc.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 2f6938fe28b2..99a4fd62399b 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -826,6 +826,12 @@ EXPORT_SYMBOL_GPL(svc_set_pool_threads);
  * are multiple pools then the new threads or victims will be distributed
  * evenly among them.
  *
+ * When @nrservs is non-zero but smaller than the number of pools, even
+ * distribution would leave some pools empty. Since each pool maps to a
+ * NUMA node and only services transports steered to that node, every
+ * pool whose node has CPUs is instead guaranteed at least one thread.
+ * The resulting total may therefore exceed @nrservs.
+ *
  * Caller must ensure mutual exclusion between this and server startup or
  * shutdown.
  *
@@ -850,6 +856,15 @@ svc_set_num_threads(struct svc_serv *serv, unsigned int min_threads,
 			--remain;
 		}
 
+		/*
+		 * Don't let a node's pool sit empty while threads are
+		 * being auto-distributed: a transport steered there would
+		 * have nothing to service it.
+		 */
+		if (threads == 0 && nrservs &&
+		    nr_cpus_node(svc_pool_map_get_node(pool->sp_id)))
+			threads = 1;
+
 		err = svc_set_pool_threads(serv, pool, min_threads, threads);
 		if (err)
 			break;

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 4/4] sunrpc: eliminate a modulus operation from the enqueueing codepath
  2026-06-29 17:48 [PATCH v3 0/4] sunrpc: hardcode pool_mode to pernode, remove other modes Jeff Layton
                   ` (2 preceding siblings ...)
  2026-06-29 17:48 ` [PATCH v3 3/4] sunrpc: guarantee a thread per CPU-bearing node when auto-distributing Jeff Layton
@ 2026-06-29 17:48 ` Jeff Layton
  2026-06-30 12:48 ` [PATCH 5/4] sunrpc: protect the svc_pool_map pool_to[] array with RCU Jeff Layton
  4 siblings, 0 replies; 8+ messages in thread
From: Jeff Layton @ 2026-06-29 17:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Chuck Lever
  Cc: linux-nfs, linux-kernel, Jeff Layton

Currently we do this to determine the pool to enqueue on:

    pidx = m->to_pool[cpu_to_node(raw_smp_processor_id())] % serv->sv_nrpools;

...but a modulus is rather expensive. Replace this instead with an
explicit check for running off the end of the array.

This situation should never occur, but if it does, just fall back to
pool 0.

This trades a ~20-30 cycle operation that isn't pipelined and
monopolizes the divider for a ~1 cycle well-predicted branch.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 net/sunrpc/svc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 99a4fd62399b..4fff0725ef8f 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -252,7 +252,9 @@ struct svc_pool *svc_pool_for_cpu(struct svc_serv *serv)
 	if (serv->sv_nrpools <= 1)
 		return serv->sv_pools;
 
-	pidx = m->to_pool[cpu_to_node(raw_smp_processor_id())] % serv->sv_nrpools;
+	pidx = m->to_pool[cpu_to_node(raw_smp_processor_id())];
+	if (pidx >= serv->sv_nrpools)
+		pidx = 0;
 
 	/*
 	 * Threads are spread evenly across the pools, but when there are

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 5/4] sunrpc: protect the svc_pool_map pool_to[] array with RCU
  2026-06-29 17:48 [PATCH v3 0/4] sunrpc: hardcode pool_mode to pernode, remove other modes Jeff Layton
                   ` (3 preceding siblings ...)
  2026-06-29 17:48 ` [PATCH v3 4/4] sunrpc: eliminate a modulus operation from the enqueueing codepath Jeff Layton
@ 2026-06-30 12:48 ` Jeff Layton
  2026-07-01  1:50   ` NeilBrown
  4 siblings, 1 reply; 8+ messages in thread
From: Jeff Layton @ 2026-06-30 12:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Chuck Lever
  Cc: linux-nfs, linux-kernel, Jeff Layton

svc_pool_map_get_node() reads the global svc_pool_map without holding
svc_pool_map_mutex and dereferences m->pool_to[]. The array was both
published and torn down without any synchronisation against that
lockless reader:

 - svc_pool_map_get() incremented m->count to one before
   svc_pool_map_init_pernode() allocated and filled the arrays, so a
   reader observing the map as "in use" could see a NULL (or partially
   built) pool_to[] and oops.

 - svc_pool_map_put() freed the arrays as soon as the last reference
   went away, so a reader that had already started dereferencing
   pool_to[] could use it after free.

svc_new_thread() takes this lockless path for every service, including
unpooled ones that hold no map reference, so the reader genuinely can
run concurrently with another service's startup or shutdown.

Publish pool_to[] with rcu_assign_pointer() only after it is fully
built in a private allocation, and have svc_pool_map_get_node()
dereference it under rcu_read_lock(). On teardown, clear the pointer
and defer the free past a grace period with kfree_rcu_mightsleep().

svc_pool_map_set_cpumask() also reads pool_to[], but its caller holds a
map reference (it checks sv_nrpools > 1) so the array is stable; it uses
rcu_dereference_protected() rather than taking the read lock.

to_pool[] needs no such treatment: it is only read by services that
hold a map reference, so it cannot be freed under a reader.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Assisted-by: Claude:claude-opus-4-8
---
 net/sunrpc/svc.c | 91 +++++++++++++++++++++++++-----------------------
 1 file changed, 48 insertions(+), 43 deletions(-)

Sashiko pointed out this (preexisting) problem during review of the
other 4 in the series. I figure we might as well fix it too while we're
in here.

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 4fff0725ef8f..ebd953248e34 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -46,13 +46,13 @@ static void svc_unregister(const struct svc_serv *serv, struct net *net);
 struct svc_pool_map {
 	int count;			/* How many svc_servs use us */
 	unsigned int npools;
-	unsigned int *pool_to;		/* maps pool id to node */
+	unsigned int __rcu *pool_to;	/* maps pool id to node */
 	unsigned int *to_pool;		/* maps node to pool id */
 };
 
 static struct svc_pool_map svc_pool_map;
 
-static DEFINE_MUTEX(svc_pool_map_mutex);/* protects svc_pool_map.count only */
+static DEFINE_MUTEX(svc_pool_map_mutex);/* serialises svc_pool_map updates */
 
 /*
  * Pool modes that were historically accepted. They no longer select
@@ -102,32 +102,12 @@ param_get_pool_mode(char *buf, const struct kernel_param *kp)
 module_param_call(pool_mode, param_set_pool_mode, param_get_pool_mode,
 		  NULL, 0644);
 
-/*
- * Allocate the to_pool[] and pool_to[] arrays.
- * Returns 0 on success or an errno.
- */
-static int
-svc_pool_map_alloc_arrays(struct svc_pool_map *m, unsigned int maxpools)
-{
-	m->to_pool = kcalloc(maxpools, sizeof(unsigned int), GFP_KERNEL);
-	if (!m->to_pool)
-		goto fail;
-	m->pool_to = kcalloc(maxpools, sizeof(unsigned int), GFP_KERNEL);
-	if (!m->pool_to)
-		goto fail_free;
-
-	return 0;
-
-fail_free:
-	kfree(m->to_pool);
-	m->to_pool = NULL;
-fail:
-	return -ENOMEM;
-}
-
 /*
  * Initialise the pool map for one pool per NUMA node.
  * Returns number of pools or <0 on error.
+ *
+ * pool_to[] is published with rcu_assign_pointer() only once fully built,
+ * so a lockless reader sees either NULL or a complete map.
  */
 static int
 svc_pool_map_init_pernode(struct svc_pool_map *m)
@@ -135,21 +115,30 @@ svc_pool_map_init_pernode(struct svc_pool_map *m)
 	unsigned int maxpools = nr_node_ids;
 	unsigned int pidx = 0;
 	unsigned int node;
-	int err;
+	unsigned int *to_pool, *pool_to;
 
-	err = svc_pool_map_alloc_arrays(m, maxpools);
-	if (err)
-		return err;
+	to_pool = kcalloc(maxpools, sizeof(*to_pool), GFP_KERNEL);
+	if (!to_pool)
+		return -ENOMEM;
+	pool_to = kcalloc(maxpools, sizeof(*pool_to), GFP_KERNEL);
+	if (!pool_to) {
+		kfree(to_pool);
+		return -ENOMEM;
+	}
 
 	for_each_node_with_cpus(node) {
 		/* some architectures (e.g. SN2) have cpuless nodes */
 		BUG_ON(pidx > maxpools);
-		m->to_pool[node] = pidx;
-		m->pool_to[pidx] = node;
+		to_pool[node] = pidx;
+		pool_to[pidx] = node;
 		pidx++;
 	}
 	/* nodes brought online later all get mapped to pool0, sorry */
 
+	m->npools = pidx;
+	m->to_pool = to_pool;
+	rcu_assign_pointer(m->pool_to, pool_to);
+
 	return pidx;
 }
 
@@ -178,7 +167,6 @@ svc_pool_map_get(void)
 		mutex_unlock(&svc_pool_map_mutex);
 		return 0;
 	}
-	m->npools = npools;
 	mutex_unlock(&svc_pool_map_mutex);
 	return npools;
 }
@@ -195,11 +183,21 @@ svc_pool_map_put(void)
 
 	mutex_lock(&svc_pool_map_mutex);
 	if (!--m->count) {
+		unsigned int *pool_to;
+
+		/* Protected by svc_pool_map_mutex */
+		pool_to = rcu_dereference_protected(m->pool_to, 1);
+
+		/*
+		 * Defer the pool_to[] free past a grace period; a lockless
+		 * reader may still hold it. to_pool[] has no such readers, so
+		 * free it directly.
+		 */
+		rcu_assign_pointer(m->pool_to, NULL);
 		kfree(m->to_pool);
 		m->to_pool = NULL;
-		kfree(m->pool_to);
-		m->pool_to = NULL;
 		m->npools = 0;
+		kfree_rcu_mightsleep(pool_to);
 	}
 	mutex_unlock(&svc_pool_map_mutex);
 }
@@ -207,10 +205,15 @@ svc_pool_map_put(void)
 static int svc_pool_map_get_node(unsigned int pidx)
 {
 	const struct svc_pool_map *m = &svc_pool_map;
+	const unsigned int *pool_to;
+	int node = numa_mem_id();
 
-	if (m->count)
-		return m->pool_to[pidx];
-	return numa_mem_id();
+	rcu_read_lock();
+	pool_to = rcu_dereference(m->pool_to);
+	if (pool_to)
+		node = pool_to[pidx];
+	rcu_read_unlock();
+	return node;
 }
 
 /*
@@ -221,17 +224,19 @@ static inline void
 svc_pool_map_set_cpumask(struct task_struct *task, unsigned int pidx)
 {
 	struct svc_pool_map *m = &svc_pool_map;
-	unsigned int node = m->pool_to[pidx];
+	const unsigned int *pool_to;
 
 	/*
-	 * The caller checks for sv_nrpools > 1, which
-	 * implies that we've been initialized.
+	 * The caller checks for sv_nrpools > 1, so its service holds a
+	 * reference to the map: pool_to[] is allocated and cannot be freed
+	 * under us. No RCU read lock is needed; the held reference keeps the
+	 * array stable.
 	 */
-	WARN_ON_ONCE(m->count == 0);
-	if (m->count == 0)
+	pool_to = rcu_dereference_protected(m->pool_to, 1);
+	if (WARN_ON_ONCE(!pool_to))
 		return;
 
-	set_cpus_allowed_ptr(task, cpumask_of_node(node));
+	set_cpus_allowed_ptr(task, cpumask_of_node(pool_to[pidx]));
 }
 
 /**
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/4] sunrpc: hardcode pool_mode to pernode, remove other modes
  2026-06-29 17:48 ` [PATCH v3 2/4] sunrpc: hardcode pool_mode to pernode, remove other modes Jeff Layton
@ 2026-06-30 17:24   ` Chuck Lever
  0 siblings, 0 replies; 8+ messages in thread
From: Chuck Lever @ 2026-06-30 17:24 UTC (permalink / raw)
  To: Jeff Layton, Trond Myklebust, Anna Schumaker, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, linux-kernel



On Mon, Jun 29, 2026, at 1:48 PM, Jeff Layton wrote:
> The SVC_POOL_AUTO/GLOBAL/PERCPU/PERNODE pool mode selection machinery
> was added when NUMA was new and the right default was unclear.  The
> default has always been "global" (a single pool for the whole service);
> the other modes were only used when an admin explicitly set the
> pool_mode parameter or asked for "auto", which then picked a mode from
> the host topology.

> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 82fb7faf563f..2f6938fe28b2 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c

> @@ -58,62 +45,29 @@ enum {
> 
>  struct svc_pool_map {
>  	int count;			/* How many svc_servs use us */
> -	int mode;			/* Note: int not enum to avoid
> -					 * warnings about "enumeration value
> -					 * not handled in switch" */
>  	unsigned int npools;
> -	unsigned int *pool_to;		/* maps pool id to cpu or node */
> -	unsigned int *to_pool;		/* maps cpu or node to pool id */
> +	unsigned int *pool_to;		/* maps pool id to node */
> +	unsigned int *to_pool;		/* maps node to pool id */
>  };
> 
> -static struct svc_pool_map svc_pool_map = {
> -	.mode = SVC_POOL_DEFAULT
> -};
> +static struct svc_pool_map svc_pool_map;
> 
>  static DEFINE_MUTEX(svc_pool_map_mutex);/* protects svc_pool_map.count only */
> 
> -static int
> -__param_set_pool_mode(const char *val, struct svc_pool_map *m)
> -{
> -	int err, mode;
> -
> -	mutex_lock(&svc_pool_map_mutex);
> -
> -	err = 0;
> -	if (!strncmp(val, "auto", 4))
> -		mode = SVC_POOL_AUTO;
> -	else if (!strncmp(val, "global", 6))
> -		mode = SVC_POOL_GLOBAL;
> -	else if (!strncmp(val, "percpu", 6))
> -		mode = SVC_POOL_PERCPU;
> -	else if (!strncmp(val, "pernode", 7))
> -		mode = SVC_POOL_PERNODE;
> -	else
> -		err = -EINVAL;
> -
> -	if (err)
> -		goto out;
> -
> -	if (m->count == 0)
> -		m->mode = mode;
> -	else if (mode != m->mode)
> -		err = -EBUSY;
> -out:
> -	mutex_unlock(&svc_pool_map_mutex);
> -	return err;
> -}
> -
> -static int
> -param_set_pool_mode(const char *val, const struct kernel_param *kp)
> -{
> -	struct svc_pool_map *m = kp->arg;
> -
> -	return __param_set_pool_mode(val, m);
> -}
> +/*
> + * Pool modes that were historically accepted. They no longer select
> + * anything: the pool mode is always pernode. The names are retained
> + * only so that writing a previously-valid value still succeeds.
> + */
> +static const char * const pool_mode_names[] = {
> +	"auto", "global", "percpu", "pernode",
> +};
> 
>  int sunrpc_set_pool_mode(const char *val)
>  {
> -	return __param_set_pool_mode(val, &svc_pool_map);
> +	int idx = sysfs_match_string(pool_mode_names, val);
> +
> +	return idx < 0 ? idx : 0;
>  }
>  EXPORT_SYMBOL(sunrpc_set_pool_mode);
> 

sysfs_match_string() compares the whole string, whereas the old
__param_set_pool_mode() matched prefixes (strncmp(val, "global", 6)
and friends). An input like "globalx" that the old code accepted is
now rejected with -EINVAL.

The commit message says:

  Writing any previously-accepted value succeeds silently

The four documented names still succeed, including with a trailing
newline since sysfs_streq() tolerates one, so only prefix-extended
strings change.


> @@ -284,14 +158,13 @@ svc_pool_map_init_pernode(struct svc_pool_map *m)
>   * Add a reference to the global map of cpus to pools (and
>   * vice versa) if pools are in use.
>   * Initialise the map if we're the first user.
> - * Returns the number of pools. If this is '1', no reference
> - * was taken.
> + * Returns the number of pools, or 0 on failure.
>   */
>  static unsigned int
>  svc_pool_map_get(void)
>  {
>  	struct svc_pool_map *m = &svc_pool_map;
> -	int npools = -1;
> +	int npools;
> 
>  	mutex_lock(&svc_pool_map_mutex);
>  	if (m->count++) {

Regarding the kdoc modification above: with percpu removed the
map is node-to-pool; "cpus to pools" now contradicts the struct
field comments this same patch corrected ("maps pool id to
node" / "maps node to pool id").


> @@ -299,22 +172,11 @@ svc_pool_map_get(void)
>  		return m->npools;
>  	}
> 
> -	if (m->mode == SVC_POOL_AUTO)
> -		m->mode = svc_pool_map_choose_mode();
> -
> -	switch (m->mode) {
> -	case SVC_POOL_PERCPU:
> -		npools = svc_pool_map_init_percpu(m);
> -		break;
> -	case SVC_POOL_PERNODE:
> -		npools = svc_pool_map_init_pernode(m);
> -		break;
> -	}
> -
> +	npools = svc_pool_map_init_pernode(m);
>  	if (npools <= 0) {
> -		/* default, or memory allocation failure */
> -		npools = 1;
> -		m->mode = SVC_POOL_GLOBAL;
> +		m->count = 0;
> +		mutex_unlock(&svc_pool_map_mutex);
> +		return 0;
>  	}
>  	m->npools = npools;
>  	mutex_unlock(&svc_pool_map_mutex);

svc_pool_map_put() frees pool_to[] and sets it back to NULL when the
last pooled reference drops.

Can an unpooled svc_new_thread() running concurrently with an nfsd
start or stop observe m->count != 0 while pool_to is still NULL (the
first get) or already freed (the last put), and dereference it here?
The reads of m->count and m->pool_to are not under the mutex.

The old default global mode returned numa_mem_id() in this case and
never touched pool_to[]; with pernode as the default this path runs
for every server. Would it help to consult pool_to[] only when the
service holds a reference, the way svc_pool_map_set_cpumask() is
already gated in svc_new_thread()?

        if (serv->sv_nrpools > 1)
                svc_pool_map_set_cpumask(task, pool->sp_id);

An unpooled service (sv_nrpools == 1) could then take numa_mem_id()
without reading the shared map.


> @@ -346,14 +208,11 @@ static int svc_pool_map_get_node(unsigned int pidx)
>  {
>  	const struct svc_pool_map *m = &svc_pool_map;
> 
> -	if (m->count) {
> -		if (m->mode == SVC_POOL_PERCPU)
> -			return cpu_to_node(m->pool_to[pidx]);
> -		if (m->mode == SVC_POOL_PERNODE)
> -			return m->pool_to[pidx];
> -	}
> +	if (m->count)
> +		return m->pool_to[pidx];
>  	return numa_mem_id();
>  }
> +
>  /*
>   * Set the given thread's cpus_allowed mask so that it
>   * will only run on cpus in the given pool.

This path now dereferences pool_to[] whenever m->count is non-zero.

svc_new_thread() calls it for every service, pooled or not, holding
neither svc_pool_map_mutex nor a map reference:

        node = svc_pool_map_get_node(pool->sp_id);

Unpooled services reach this too: lockd and the NFS callback are both
created with svc_create(), which passes npools = 1 and never calls
svc_pool_map_get(), so their m->count is whatever a pooled nfsd has
left set.

svc_pool_map_get() bumps the count before pool_to[] is allocated, and
the whole sequence is serialized only against other map writers:

        mutex_lock(&svc_pool_map_mutex);
        if (m->count++) {
                ...
        }
        npools = svc_pool_map_init_pernode(m);   /* allocates pool_to */


> @@ -372,27 +231,15 @@ svc_pool_map_set_cpumask(struct task_struct 
> *task, unsigned int pidx)
>  	if (m->count == 0)
>  		return;
> 
> -	switch (m->mode) {
> -	case SVC_POOL_PERCPU:
> -	{
> -		set_cpus_allowed_ptr(task, cpumask_of(node));
> -		break;
> -	}
> -	case SVC_POOL_PERNODE:
> -	{
> -		set_cpus_allowed_ptr(task, cpumask_of_node(node));
> -		break;
> -	}
> -	}
> +	set_cpus_allowed_ptr(task, cpumask_of_node(node));
>  }

pool_to[pidx] is dereferenced before the m->count == 0 check, so if
count were ever 0 the NULL deref has already happened and the
WARN/return can never fire. The sv_nrpools > 1 gate at the call site
keeps this safe, so the guard is now dead code.

For the series as a whole, let's consider moving 5/4 to the front.
It looks like a fix that might want to be backported.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 5/4] sunrpc: protect the svc_pool_map pool_to[] array with RCU
  2026-06-30 12:48 ` [PATCH 5/4] sunrpc: protect the svc_pool_map pool_to[] array with RCU Jeff Layton
@ 2026-07-01  1:50   ` NeilBrown
  0 siblings, 0 replies; 8+ messages in thread
From: NeilBrown @ 2026-07-01  1:50 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Trond Myklebust, Anna Schumaker, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Chuck Lever, linux-nfs, linux-kernel, Jeff Layton

On Tue, 30 Jun 2026, Jeff Layton wrote:
> svc_pool_map_get_node() reads the global svc_pool_map without holding
> svc_pool_map_mutex and dereferences m->pool_to[]. The array was both
> published and torn down without any synchronisation against that
> lockless reader:
> 
>  - svc_pool_map_get() incremented m->count to one before
>    svc_pool_map_init_pernode() allocated and filled the arrays, so a
>    reader observing the map as "in use" could see a NULL (or partially
>    built) pool_to[] and oops.
> 
>  - svc_pool_map_put() freed the arrays as soon as the last reference
>    went away, so a reader that had already started dereferencing
>    pool_to[] could use it after free.
> 
> svc_new_thread() takes this lockless path for every service, including
> unpooled ones that hold no map reference, so the reader genuinely can
> run concurrently with another service's startup or shutdown.
> 
> Publish pool_to[] with rcu_assign_pointer() only after it is fully
> built in a private allocation, and have svc_pool_map_get_node()
> dereference it under rcu_read_lock(). On teardown, clear the pointer
> and defer the free past a grace period with kfree_rcu_mightsleep().
> 
> svc_pool_map_set_cpumask() also reads pool_to[], but its caller holds a
> map reference (it checks sv_nrpools > 1) so the array is stable; it uses
> rcu_dereference_protected() rather than taking the read lock.
> 
> to_pool[] needs no such treatment: it is only read by services that
> hold a map reference, so it cannot be freed under a reader.

I don't think this is the best approach.
The problem only occurs when svc_create() is used as when
svc_create_pooled() is used, svc_pool_map_get() is called early so a
reference is held to svc_pool_map.

svc_new_thread() is the only caller of svc_pool_map_get_node(), and it
can easily check pool->sv_is_pooled and only call
svc_pool_map_get_node() is sv_is_pooled is true.  When false it should
probably use NUMA_NO_NODE.
e.g.

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index ae9ec4bf34f7..063826702f46 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -800,7 +800,10 @@ int svc_new_thread(struct svc_serv *serv, struct svc_pool *pool)
 	int node;
 	int err = 0;
 
-	node = svc_pool_map_get_node(pool->sp_id);
+	if (serv->sv_is_pooled)
+		node = svc_pool_map_get_node(pool->sp_id);
+	else
+		node = NUMA_NO_NODE;
 
 	rqstp = svc_prepare_thread(serv, pool, node);
 	if (!rqstp)


NeilBrown

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-07-01  1:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-29 17:48 [PATCH v3 0/4] sunrpc: hardcode pool_mode to pernode, remove other modes Jeff Layton
2026-06-29 17:48 ` [PATCH v3 1/4] sunrpc: route to a populated pool in svc_pool_for_cpu() Jeff Layton
2026-06-29 17:48 ` [PATCH v3 2/4] sunrpc: hardcode pool_mode to pernode, remove other modes Jeff Layton
2026-06-30 17:24   ` Chuck Lever
2026-06-29 17:48 ` [PATCH v3 3/4] sunrpc: guarantee a thread per CPU-bearing node when auto-distributing Jeff Layton
2026-06-29 17:48 ` [PATCH v3 4/4] sunrpc: eliminate a modulus operation from the enqueueing codepath Jeff Layton
2026-06-30 12:48 ` [PATCH 5/4] sunrpc: protect the svc_pool_map pool_to[] array with RCU Jeff Layton
2026-07-01  1:50   ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox