* [PATCH net-next v5] Add support to set napi threaded for individual napi
@ 2025-04-23 20:14 Samiullah Khawaja
2025-04-24 23:13 ` Joe Damato
2025-04-26 0:42 ` Jakub Kicinski
0 siblings, 2 replies; 24+ messages in thread
From: Samiullah Khawaja @ 2025-04-23 20:14 UTC (permalink / raw)
To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, mkarsten
Cc: netdev, skhawaja
A net device has a threaded sysctl that can be used to enable threaded
napi polling on all of the NAPI contexts under that device. Allow
enabling threaded napi polling at individual napi level using netlink.
Extend the netlink operation `napi-set` and allow setting the threaded
attribute of a NAPI. This will enable the threaded polling on a napi
context.
Add a test in `nl_netdev.py` that verifies various cases of threaded
napi being set at napi and at device level.
Tested
./tools/testing/selftests/net/nl_netdev.py
TAP version 13
1..6
ok 1 nl_netdev.empty_check
ok 2 nl_netdev.lo_check
ok 3 nl_netdev.page_pool_check
ok 4 nl_netdev.napi_list_check
ok 5 nl_netdev.napi_set_threaded
ok 6 nl_netdev.nsim_rxq_reset_down
# Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0
v5:
- This patch was part of:
https://lore.kernel.org/netdev/Z92e2kCYXQ_RsrJh@LQ3V64L9R2/T/
It is being sent separately for the first time.
- Change threaded attribute type to uint
- Set napi threaded state when set at napi level. When set at device
level overwrite the state of each napi. This is the `write all`
symantics that is also followed by other configurations.
- Add a test to verify `write all` symantics.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
Documentation/netlink/specs/netdev.yaml | 10 ++++
Documentation/networking/napi.rst | 13 ++++-
include/linux/netdevice.h | 1 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 23 ++++++++
net/core/dev.h | 20 +++++++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 10 ++++
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/nl_netdev.py | 67 +++++++++++++++++++++++-
10 files changed, 147 insertions(+), 4 deletions(-)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index f5e0750ab71d..c9d190fe1f05 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -280,6 +280,14 @@ attribute-sets:
doc: The timeout, in nanoseconds, of how long to suspend irq
processing, if event polling finds events
type: uint
+ -
+ name: threaded
+ doc: Whether the napi is configured to operate in threaded polling
+ mode. If this is set to `1` then the NAPI context operates
+ in threaded polling mode.
+ type: uint
+ checks:
+ max: 1
-
name: xsk-info
attributes: []
@@ -691,6 +699,7 @@ operations:
- defer-hard-irqs
- gro-flush-timeout
- irq-suspend-timeout
+ - threaded
dump:
request:
attributes:
@@ -743,6 +752,7 @@ operations:
- defer-hard-irqs
- gro-flush-timeout
- irq-suspend-timeout
+ - threaded
kernel-family:
headers: [ "net/netdev_netlink.h"]
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index d0e3953cae6a..63f98c05860f 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -444,7 +444,18 @@ dependent). The NAPI instance IDs will be assigned in the opposite
order than the process IDs of the kernel threads.
Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
-netdev's sysfs directory.
+netdev's sysfs directory. It can also be enabled for a specific napi using
+netlink interface.
+
+For example, using the script:
+
+.. code-block:: bash
+
+ $ kernel-source/tools/net/ynl/pyynl/cli.py \
+ --spec Documentation/netlink/specs/netdev.yaml \
+ --do napi-set \
+ --json='{"id": 66,
+ "threaded": 1}'
.. rubric:: Footnotes
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d8544f6a680c..3817720e8b24 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -369,6 +369,7 @@ struct napi_config {
u64 irq_suspend_timeout;
u32 defer_hard_irqs;
cpumask_t affinity_mask;
+ bool threaded;
unsigned int napi_id;
};
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 7600bf62dbdf..fac1b8ffeb55 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -134,6 +134,7 @@ enum {
NETDEV_A_NAPI_DEFER_HARD_IRQS,
NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
+ NETDEV_A_NAPI_THREADED,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/dev.c b/net/core/dev.c
index d0563ddff6ca..ea3c8a30bb97 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6888,6 +6888,27 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
+int napi_set_threaded(struct napi_struct *napi, bool threaded)
+{
+ if (threaded) {
+ if (!napi->thread) {
+ int err = napi_kthread_create(napi);
+
+ if (err)
+ return err;
+ }
+ }
+
+ if (napi->config)
+ napi->config->threaded = threaded;
+
+ /* Make sure kthread is created before THREADED bit is set. */
+ smp_mb__before_atomic();
+ assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+
+ return 0;
+}
+
int dev_set_threaded(struct net_device *dev, bool threaded)
{
struct napi_struct *napi;
@@ -7144,6 +7165,8 @@ static void napi_restore_config(struct napi_struct *n)
napi_hash_add(n);
n->config->napi_id = n->napi_id;
}
+
+ napi_set_threaded(n, n->config->threaded);
}
static void napi_save_config(struct napi_struct *n)
diff --git a/net/core/dev.h b/net/core/dev.h
index e93f36b7ddf3..b50d118ad014 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -315,6 +315,26 @@ static inline void napi_set_irq_suspend_timeout(struct napi_struct *n,
WRITE_ONCE(n->irq_suspend_timeout, timeout);
}
+/**
+ * napi_get_threaded - get the napi threaded state
+ * @n: napi struct to get the threaded state from
+ *
+ * Return: the per-NAPI threaded state.
+ */
+static inline bool napi_get_threaded(struct napi_struct *n)
+{
+ return test_bit(NAPI_STATE_THREADED, &n->state);
+}
+
+/**
+ * napi_set_threaded - set napi threaded state
+ * @n: napi struct to set the threaded state on
+ * @threaded: whether this napi does threaded polling
+ *
+ * Return 0 on success and negative errno on failure.
+ */
+int napi_set_threaded(struct napi_struct *n, bool threaded);
+
int rps_cpumask_housekeeping(struct cpumask *mask);
#if defined(CONFIG_DEBUG_NET) && defined(CONFIG_BPF_SYSCALL)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 739f7b6506a6..2791b6b232fa 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -92,11 +92,12 @@ static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1]
};
/* NETDEV_CMD_NAPI_SET - do */
-static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT + 1] = {
+static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_THREADED + 1] = {
[NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
+ [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_UINT, 1),
};
/* Ops table for netdev */
@@ -187,7 +188,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
.cmd = NETDEV_CMD_NAPI_SET,
.doit = netdev_nl_napi_set_doit,
.policy = netdev_napi_set_nl_policy,
- .maxattr = NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
+ .maxattr = NETDEV_A_NAPI_THREADED,
.flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
},
};
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index b64c614a00c4..f7d000a600cf 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -184,6 +184,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
if (napi->irq >= 0 && nla_put_u32(rsp, NETDEV_A_NAPI_IRQ, napi->irq))
goto nla_put_failure;
+ if (nla_put_uint(rsp, NETDEV_A_NAPI_THREADED,
+ napi_get_threaded(napi)))
+ goto nla_put_failure;
+
if (napi->thread) {
pid = task_pid_nr(napi->thread);
if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid))
@@ -322,8 +326,14 @@ netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
{
u64 irq_suspend_timeout = 0;
u64 gro_flush_timeout = 0;
+ uint threaded = 0;
u32 defer = 0;
+ if (info->attrs[NETDEV_A_NAPI_THREADED]) {
+ threaded = nla_get_uint(info->attrs[NETDEV_A_NAPI_THREADED]);
+ napi_set_threaded(napi, !!threaded);
+ }
+
if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
defer = nla_get_u32(info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]);
napi_set_defer_hard_irqs(napi, defer);
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 7600bf62dbdf..fac1b8ffeb55 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -134,6 +134,7 @@ enum {
NETDEV_A_NAPI_DEFER_HARD_IRQS,
NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
+ NETDEV_A_NAPI_THREADED,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/tools/testing/selftests/net/nl_netdev.py b/tools/testing/selftests/net/nl_netdev.py
index beaee5e4e2aa..505564818fa8 100755
--- a/tools/testing/selftests/net/nl_netdev.py
+++ b/tools/testing/selftests/net/nl_netdev.py
@@ -2,6 +2,7 @@
# SPDX-License-Identifier: GPL-2.0
import time
+from os import system
from lib.py import ksft_run, ksft_exit, ksft_pr
from lib.py import ksft_eq, ksft_ge, ksft_busy_wait
from lib.py import NetdevFamily, NetdevSimDev, ip
@@ -34,6 +35,70 @@ def napi_list_check(nf) -> None:
ksft_eq(len(napis), 100,
comment=f"queue count after reset queue {q} mode {i}")
+def napi_set_threaded(nf) -> None:
+ """
+ Test that verifies various cases of napi threaded
+ set and unset at napi and device level.
+ """
+ with NetdevSimDev(queue_count=2) as nsimdev:
+ nsim = nsimdev.nsims[0]
+
+ ip(f"link set dev {nsim.ifname} up")
+
+ napis = nf.napi_get({'ifindex': nsim.ifindex}, dump=True)
+ ksft_eq(len(napis), 2)
+
+ napi0_id = napis[0]['id']
+ napi1_id = napis[1]['id']
+
+ # set napi threaded and verify
+ nf.napi_set({'id': napi0_id, 'threaded': 1})
+ napi0 = nf.napi_get({'id': napi0_id})
+ ksft_eq(napi0['threaded'], 1)
+
+ ip(f"link set dev {nsim.ifname} down")
+ ip(f"link set dev {nsim.ifname} up")
+
+ # verify if napi threaded is still set
+ napi0 = nf.napi_get({'id': napi0_id})
+ ksft_eq(napi0['threaded'], 1)
+
+ # unset napi threaded and verify
+ nf.napi_set({'id': napi0_id, 'threaded': 0})
+ napi0 = nf.napi_get({'id': napi0_id})
+ ksft_eq(napi0['threaded'], 0)
+
+ # set napi threaded for napi0
+ nf.napi_set({'id': napi0_id, 'threaded': 1})
+ napi0 = nf.napi_get({'id': napi0_id})
+ ksft_eq(napi0['threaded'], 1)
+
+ # check it is not set for napi1
+ napi1 = nf.napi_get({'id': napi1_id})
+ ksft_eq(napi1['threaded'], 0)
+
+ # set threaded at device level
+ system(f"echo 1 > /sys/class/net/{nsim.ifname}/threaded")
+
+ # check napi threaded is set for both napis
+ napi0 = nf.napi_get({'id': napi0_id})
+ ksft_eq(napi0['threaded'], 1)
+ napi1 = nf.napi_get({'id': napi1_id})
+ ksft_eq(napi1['threaded'], 1)
+
+ # set napi threaded for napi0
+ nf.napi_set({'id': napi0_id, 'threaded': 1})
+ napi0 = nf.napi_get({'id': napi0_id})
+ ksft_eq(napi0['threaded'], 1)
+
+ # unset threaded at device level
+ system(f"echo 0 > /sys/class/net/{nsim.ifname}/threaded")
+
+ # check napi threaded is unset for both napis
+ napi0 = nf.napi_get({'id': napi0_id})
+ ksft_eq(napi0['threaded'], 0)
+ napi1 = nf.napi_get({'id': napi1_id})
+ ksft_eq(napi1['threaded'], 0)
def nsim_rxq_reset_down(nf) -> None:
"""
@@ -122,7 +187,7 @@ def page_pool_check(nf) -> None:
def main() -> None:
nf = NetdevFamily()
ksft_run([empty_check, lo_check, page_pool_check, napi_list_check,
- nsim_rxq_reset_down],
+ napi_set_threaded, nsim_rxq_reset_down],
args=(nf, ))
ksft_exit()
base-commit: b65999e7238e6f2a48dc77c8c2109c48318ff41b
--
2.49.0.805.g082f7c87e0-goog
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-23 20:14 [PATCH net-next v5] Add support to set napi threaded for individual napi Samiullah Khawaja
@ 2025-04-24 23:13 ` Joe Damato
2025-04-25 18:28 ` Samiullah Khawaja
2025-04-26 0:42 ` Jakub Kicinski
1 sibling, 1 reply; 24+ messages in thread
From: Joe Damato @ 2025-04-24 23:13 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Wed, Apr 23, 2025 at 08:14:13PM +0000, Samiullah Khawaja wrote:
> A net device has a threaded sysctl that can be used to enable threaded
> napi polling on all of the NAPI contexts under that device. Allow
> enabling threaded napi polling at individual napi level using netlink.
>
> Extend the netlink operation `napi-set` and allow setting the threaded
> attribute of a NAPI. This will enable the threaded polling on a napi
> context.
>
> Add a test in `nl_netdev.py` that verifies various cases of threaded
> napi being set at napi and at device level.
>
> Tested
> ./tools/testing/selftests/net/nl_netdev.py
> TAP version 13
> 1..6
> ok 1 nl_netdev.empty_check
> ok 2 nl_netdev.lo_check
> ok 3 nl_netdev.page_pool_check
> ok 4 nl_netdev.napi_list_check
> ok 5 nl_netdev.napi_set_threaded
> ok 6 nl_netdev.nsim_rxq_reset_down
> # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> v5:
> - This patch was part of:
> https://lore.kernel.org/netdev/Z92e2kCYXQ_RsrJh@LQ3V64L9R2/T/
> It is being sent separately for the first time.
Thanks; I think this is a good change on its own separate from the
rest of the series and, IMO, I think it makes it easier to get
reviewed and merged.
Probably a matter of personal preference, which you can feel free to
ignore, but IMO I think splitting this into 3 patches might make it
easier to get Reviewed-bys and make changes ?
- core infrastructure changes
- netlink related changes (and docs)
- selftest
Then if feedback comes in for some parts, but not others, it makes
it easier for reviewers in the future to know what was already
reviewed and what was changed ?
Just my 2 cents.
The above said, my high level feedback is that I don't think this
addresses the concerns we discussed in the v4 about device-wide
setting vs per-NAPI settings.
See below.
[...]
> diff --git a/net/core/dev.c b/net/core/dev.c
> index d0563ddff6ca..ea3c8a30bb97 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6888,6 +6888,27 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
> return HRTIMER_NORESTART;
> }
>
> +int napi_set_threaded(struct napi_struct *napi, bool threaded)
> +{
> + if (threaded) {
> + if (!napi->thread) {
> + int err = napi_kthread_create(napi);
> +
> + if (err)
> + return err;
> + }
> + }
> +
> + if (napi->config)
> + napi->config->threaded = threaded;
> +
> + /* Make sure kthread is created before THREADED bit is set. */
> + smp_mb__before_atomic();
> + assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> +
> + return 0;
> +}
Hm, I wonder if dev_set_threaded can be cleaned up to use this
instead of repeating similar code in two places?
> +
> int dev_set_threaded(struct net_device *dev, bool threaded)
> {
> struct napi_struct *napi;
> @@ -7144,6 +7165,8 @@ static void napi_restore_config(struct napi_struct *n)
> napi_hash_add(n);
> n->config->napi_id = n->napi_id;
> }
> +
> + napi_set_threaded(n, n->config->threaded);
It makes sense to me that when restoring the config, the kthread is
kicked off again (assuming config->thread > 0), but does the
napi_save_config path need to stop the thread?
Not sure if kthread_stop is hit via some other path when
napi_disable is called? Can you clarify?
From my testing, it seems like setting threaded: 0 using the CLI
leaves the thread running. Should it work that way? From a high
level that behavior seems somewhat unexpected, don't you think?
[...]
> static void napi_save_config(struct napi_struct *n)
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index b64c614a00c4..f7d000a600cf 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -184,6 +184,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
> if (napi->irq >= 0 && nla_put_u32(rsp, NETDEV_A_NAPI_IRQ, napi->irq))
> goto nla_put_failure;
>
> + if (nla_put_uint(rsp, NETDEV_A_NAPI_THREADED,
> + napi_get_threaded(napi)))
> + goto nla_put_failure;
> +
> if (napi->thread) {
> pid = task_pid_nr(napi->thread);
> if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid))
> @@ -322,8 +326,14 @@ netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
> {
> u64 irq_suspend_timeout = 0;
> u64 gro_flush_timeout = 0;
> + uint threaded = 0;
I think I'd probably make this a u8 or something? uint is not a
commonly used type, AFAICT, and seems out of place here.
> diff --git a/tools/testing/selftests/net/nl_netdev.py b/tools/testing/selftests/net/nl_netdev.py
> index beaee5e4e2aa..505564818fa8 100755
> --- a/tools/testing/selftests/net/nl_netdev.py
> +++ b/tools/testing/selftests/net/nl_netdev.py
> @@ -2,6 +2,7 @@
> # SPDX-License-Identifier: GPL-2.0
>
> import time
> +from os import system
> from lib.py import ksft_run, ksft_exit, ksft_pr
> from lib.py import ksft_eq, ksft_ge, ksft_busy_wait
> from lib.py import NetdevFamily, NetdevSimDev, ip
> @@ -34,6 +35,70 @@ def napi_list_check(nf) -> None:
> ksft_eq(len(napis), 100,
> comment=f"queue count after reset queue {q} mode {i}")
>
> +def napi_set_threaded(nf) -> None:
> + """
> + Test that verifies various cases of napi threaded
> + set and unset at napi and device level.
> + """
> + with NetdevSimDev(queue_count=2) as nsimdev:
> + nsim = nsimdev.nsims[0]
> +
> + ip(f"link set dev {nsim.ifname} up")
> +
> + napis = nf.napi_get({'ifindex': nsim.ifindex}, dump=True)
> + ksft_eq(len(napis), 2)
> +
> + napi0_id = napis[0]['id']
> + napi1_id = napis[1]['id']
> +
> + # set napi threaded and verify
> + nf.napi_set({'id': napi0_id, 'threaded': 1})
> + napi0 = nf.napi_get({'id': napi0_id})
> + ksft_eq(napi0['threaded'], 1)
> +
> + ip(f"link set dev {nsim.ifname} down")
> + ip(f"link set dev {nsim.ifname} up")
> +
> + # verify if napi threaded is still set
> + napi0 = nf.napi_get({'id': napi0_id})
> + ksft_eq(napi0['threaded'], 1)
I feel like the test needs to be expanded?
It's not just the attribute that matters (although that bit is
important), but I think it probably is important that the kthread is
still running and that should be checked ?
> + # unset napi threaded and verify
> + nf.napi_set({'id': napi0_id, 'threaded': 0})
> + napi0 = nf.napi_get({'id': napi0_id})
> + ksft_eq(napi0['threaded'], 0)
> +
> + # set napi threaded for napi0
> + nf.napi_set({'id': napi0_id, 'threaded': 1})
> + napi0 = nf.napi_get({'id': napi0_id})
> + ksft_eq(napi0['threaded'], 1)
> +
> + # check it is not set for napi1
> + napi1 = nf.napi_get({'id': napi1_id})
> + ksft_eq(napi1['threaded'], 0)
> +
> + # set threaded at device level
> + system(f"echo 1 > /sys/class/net/{nsim.ifname}/threaded")
> +
> + # check napi threaded is set for both napis
> + napi0 = nf.napi_get({'id': napi0_id})
> + ksft_eq(napi0['threaded'], 1)
> + napi1 = nf.napi_get({'id': napi1_id})
> + ksft_eq(napi1['threaded'], 1)
> +
> + # set napi threaded for napi0
> + nf.napi_set({'id': napi0_id, 'threaded': 1})
> + napi0 = nf.napi_get({'id': napi0_id})
> + ksft_eq(napi0['threaded'], 1)
If napi0 is already set to threaded in the previous block for the
device level, why set it explicitly again ?
> +
> + # unset threaded at device level
> + system(f"echo 0 > /sys/class/net/{nsim.ifname}/threaded")
> +
> + # check napi threaded is unset for both napis
> + napi0 = nf.napi_get({'id': napi0_id})
> + ksft_eq(napi0['threaded'], 0)
> + napi1 = nf.napi_get({'id': napi1_id})
> + ksft_eq(napi1['threaded'], 0)
I ran the test and it passes for me.
That said, the test is incomplete or buggy because I've manually
identified 1 case that is incorrect which we discussed in the v4 and
a second case that seems buggy from a user perspective.
Case 1 (we discussed this in the v4, but seems like it was missed
here?):
Threaded set to 1 and then to 0 at the device level
echo 1 > /sys/class/net/eni28160np1/threaded
echo 0 > /sys/class/net/eni28160np1/threaded
Check the setting:
cat /sys/class/net/eni28160np1/threaded
0
Dump the settings for netdevsim, noting that threaded is 0, but pid
is set (again, should it be?):
./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 20}'
[{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 612,
'ifindex': 20,
'irq-suspend-timeout': 0,
'pid': 15728,
'threaded': 0},
{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 611,
'ifindex': 20,
'irq-suspend-timeout': 0,
'pid': 15729,
'threaded': 0}]
Now set NAPI 612 to threaded 1:
./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set --json='{"id": 612, "threaded": 1}'
Dump the settings:
./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 20}'
[{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 612,
'ifindex': 20,
'irq-suspend-timeout': 0,
'pid': 15728,
'threaded': 1},
{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 611,
'ifindex': 20,
'irq-suspend-timeout': 0,
'pid': 15729,
'threaded': 0}]
So far, so good.
Now set device-wide threaded to 0, which should set NAPI 612 to threaded 0:
echo 0 > /sys/class/net/eni28160np1/threaded
Dump settings:
./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 20}'
[{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 612,
'ifindex': 20,
'irq-suspend-timeout': 0,
'pid': 15728,
'threaded': 1},
{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 611,
'ifindex': 20,
'irq-suspend-timeout': 0,
'pid': 15729,
'threaded': 0}]
Note that NAPI 612 is still set to threaded 1, but we set the
device-wide setting to 0.
Shouldn't NAPI 612 be threaded = 0 ?
Second case:
- netdevsim is brought up with 2 queues and 2 napis
- Threaded NAPI is enabled for napi1
- The queue count on the device is reduced from 2 to 1 (therefore
napi1 is disabled) via ethtool -L
Then:
- Is the thread still active? Should it be?
IMO: if the napi is disabled the thread should stop because there is
no NAPI for it to poll. When the napi is enabled, the thread can
start back up.
It does not seem that this is currently the case from manual
testing.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-24 23:13 ` Joe Damato
@ 2025-04-25 18:28 ` Samiullah Khawaja
2025-04-25 22:24 ` Joe Damato
2025-04-25 23:06 ` Samiullah Khawaja
0 siblings, 2 replies; 24+ messages in thread
From: Samiullah Khawaja @ 2025-04-25 18:28 UTC (permalink / raw)
To: Joe Damato, Samiullah Khawaja, Jakub Kicinski, David S . Miller,
Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, netdev
On Thu, Apr 24, 2025 at 4:13 PM Joe Damato <jdamato@fastly.com> wrote:
>
> On Wed, Apr 23, 2025 at 08:14:13PM +0000, Samiullah Khawaja wrote:
> > A net device has a threaded sysctl that can be used to enable threaded
> > napi polling on all of the NAPI contexts under that device. Allow
> > enabling threaded napi polling at individual napi level using netlink.
> >
> > Extend the netlink operation `napi-set` and allow setting the threaded
> > attribute of a NAPI. This will enable the threaded polling on a napi
> > context.
> >
> > Add a test in `nl_netdev.py` that verifies various cases of threaded
> > napi being set at napi and at device level.
> >
> > Tested
> > ./tools/testing/selftests/net/nl_netdev.py
> > TAP version 13
> > 1..6
> > ok 1 nl_netdev.empty_check
> > ok 2 nl_netdev.lo_check
> > ok 3 nl_netdev.page_pool_check
> > ok 4 nl_netdev.napi_list_check
> > ok 5 nl_netdev.napi_set_threaded
> > ok 6 nl_netdev.nsim_rxq_reset_down
> > # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0
> >
> > v5:
> > - This patch was part of:
> > https://lore.kernel.org/netdev/Z92e2kCYXQ_RsrJh@LQ3V64L9R2/T/
> > It is being sent separately for the first time.
>
> Thanks; I think this is a good change on its own separate from the
> rest of the series and, IMO, I think it makes it easier to get
> reviewed and merged.
Thanks for the review and suggestion to split this out.
>
> Probably a matter of personal preference, which you can feel free to
> ignore, but IMO I think splitting this into 3 patches might make it
> easier to get Reviewed-bys and make changes ?
>
> - core infrastructure changes
> - netlink related changes (and docs)
> - selftest
>
> Then if feedback comes in for some parts, but not others, it makes
> it easier for reviewers in the future to know what was already
> reviewed and what was changed ?
>
> Just my 2 cents.
>
> The above said, my high level feedback is that I don't think this
> addresses the concerns we discussed in the v4 about device-wide
> setting vs per-NAPI settings.
>
> See below.
>
> [...]
>
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index d0563ddff6ca..ea3c8a30bb97 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6888,6 +6888,27 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
> > return HRTIMER_NORESTART;
> > }
> >
> > +int napi_set_threaded(struct napi_struct *napi, bool threaded)
> > +{
> > + if (threaded) {
> > + if (!napi->thread) {
> > + int err = napi_kthread_create(napi);
> > +
> > + if (err)
> > + return err;
> > + }
> > + }
> > +
> > + if (napi->config)
> > + napi->config->threaded = threaded;
> > +
> > + /* Make sure kthread is created before THREADED bit is set. */
> > + smp_mb__before_atomic();
> > + assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> > +
> > + return 0;
> > +}
>
> Hm, I wonder if dev_set_threaded can be cleaned up to use this
> instead of repeating similar code in two places?
>
> > +
> > int dev_set_threaded(struct net_device *dev, bool threaded)
> > {
> > struct napi_struct *napi;
> > @@ -7144,6 +7165,8 @@ static void napi_restore_config(struct napi_struct *n)
> > napi_hash_add(n);
> > n->config->napi_id = n->napi_id;
> > }
> > +
> > + napi_set_threaded(n, n->config->threaded);
>
> It makes sense to me that when restoring the config, the kthread is
> kicked off again (assuming config->thread > 0), but does the
> napi_save_config path need to stop the thread?
>
> Not sure if kthread_stop is hit via some other path when
> napi_disable is called? Can you clarify?
NAPI kthread are not stopped when napi is disabled. When napis are
disabled, the NAPI_STATE_DISABLED state is set on them and the
associated thread goes to sleep after unsetting the STATE_SCHED. The
kthread_stop is only called on them when NAPI is deleted. This is the
existing behaviour. Please see napi_disable implementation for
reference. Also napi_enable doesn't create a new kthread and just sets
the napi STATE appropriately and once the NAPI is scheduled again the
thread is woken up. Please seem implementation of napi_schedule for
reference.
>
> From my testing, it seems like setting threaded: 0 using the CLI
> leaves the thread running. Should it work that way? From a high
> level that behavior seems somewhat unexpected, don't you think?
Please see the comment above. The thread goes to sleep and is woken up
when NAPI is enabled and scheduled again.
>
> [...]
>
> > static void napi_save_config(struct napi_struct *n)
> > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> > index b64c614a00c4..f7d000a600cf 100644
> > --- a/net/core/netdev-genl.c
> > +++ b/net/core/netdev-genl.c
> > @@ -184,6 +184,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
> > if (napi->irq >= 0 && nla_put_u32(rsp, NETDEV_A_NAPI_IRQ, napi->irq))
> > goto nla_put_failure;
> >
> > + if (nla_put_uint(rsp, NETDEV_A_NAPI_THREADED,
> > + napi_get_threaded(napi)))
> > + goto nla_put_failure;
> > +
> > if (napi->thread) {
> > pid = task_pid_nr(napi->thread);
> > if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid))
> > @@ -322,8 +326,14 @@ netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
> > {
> > u64 irq_suspend_timeout = 0;
> > u64 gro_flush_timeout = 0;
> > + uint threaded = 0;
>
> I think I'd probably make this a u8 or something? uint is not a
> commonly used type, AFAICT, and seems out of place here.
Agreed. Will change
>
> > diff --git a/tools/testing/selftests/net/nl_netdev.py b/tools/testing/selftests/net/nl_netdev.py
> > index beaee5e4e2aa..505564818fa8 100755
> > --- a/tools/testing/selftests/net/nl_netdev.py
> > +++ b/tools/testing/selftests/net/nl_netdev.py
> > @@ -2,6 +2,7 @@
> > # SPDX-License-Identifier: GPL-2.0
> >
> > import time
> > +from os import system
> > from lib.py import ksft_run, ksft_exit, ksft_pr
> > from lib.py import ksft_eq, ksft_ge, ksft_busy_wait
> > from lib.py import NetdevFamily, NetdevSimDev, ip
> > @@ -34,6 +35,70 @@ def napi_list_check(nf) -> None:
> > ksft_eq(len(napis), 100,
> > comment=f"queue count after reset queue {q} mode {i}")
> >
> > +def napi_set_threaded(nf) -> None:
> > + """
> > + Test that verifies various cases of napi threaded
> > + set and unset at napi and device level.
> > + """
> > + with NetdevSimDev(queue_count=2) as nsimdev:
> > + nsim = nsimdev.nsims[0]
> > +
> > + ip(f"link set dev {nsim.ifname} up")
> > +
> > + napis = nf.napi_get({'ifindex': nsim.ifindex}, dump=True)
> > + ksft_eq(len(napis), 2)
> > +
> > + napi0_id = napis[0]['id']
> > + napi1_id = napis[1]['id']
> > +
> > + # set napi threaded and verify
> > + nf.napi_set({'id': napi0_id, 'threaded': 1})
> > + napi0 = nf.napi_get({'id': napi0_id})
> > + ksft_eq(napi0['threaded'], 1)
> > +
> > + ip(f"link set dev {nsim.ifname} down")
> > + ip(f"link set dev {nsim.ifname} up")
> > +
> > + # verify if napi threaded is still set
> > + napi0 = nf.napi_get({'id': napi0_id})
> > + ksft_eq(napi0['threaded'], 1)
>
> I feel like the test needs to be expanded?
>
> It's not just the attribute that matters (although that bit is
> important), but I think it probably is important that the kthread is
> still running and that should be checked ?
>
> > + # unset napi threaded and verify
> > + nf.napi_set({'id': napi0_id, 'threaded': 0})
> > + napi0 = nf.napi_get({'id': napi0_id})
> > + ksft_eq(napi0['threaded'], 0)
> > +
> > + # set napi threaded for napi0
> > + nf.napi_set({'id': napi0_id, 'threaded': 1})
> > + napi0 = nf.napi_get({'id': napi0_id})
> > + ksft_eq(napi0['threaded'], 1)
> > +
> > + # check it is not set for napi1
> > + napi1 = nf.napi_get({'id': napi1_id})
> > + ksft_eq(napi1['threaded'], 0)
> > +
> > + # set threaded at device level
> > + system(f"echo 1 > /sys/class/net/{nsim.ifname}/threaded")
> > +
> > + # check napi threaded is set for both napis
> > + napi0 = nf.napi_get({'id': napi0_id})
> > + ksft_eq(napi0['threaded'], 1)
> > + napi1 = nf.napi_get({'id': napi1_id})
> > + ksft_eq(napi1['threaded'], 1)
> > +
> > + # set napi threaded for napi0
> > + nf.napi_set({'id': napi0_id, 'threaded': 1})
> > + napi0 = nf.napi_get({'id': napi0_id})
> > + ksft_eq(napi0['threaded'], 1)
>
> If napi0 is already set to threaded in the previous block for the
> device level, why set it explicitly again ?
Will remove
>
> > +
> > + # unset threaded at device level
> > + system(f"echo 0 > /sys/class/net/{nsim.ifname}/threaded")
> > +
> > + # check napi threaded is unset for both napis
> > + napi0 = nf.napi_get({'id': napi0_id})
> > + ksft_eq(napi0['threaded'], 0)
> > + napi1 = nf.napi_get({'id': napi1_id})
> > + ksft_eq(napi1['threaded'], 0)
>
> I ran the test and it passes for me.
>
> That said, the test is incomplete or buggy because I've manually
> identified 1 case that is incorrect which we discussed in the v4 and
> a second case that seems buggy from a user perspective.
>
> Case 1 (we discussed this in the v4, but seems like it was missed
> here?):
>
> Threaded set to 1 and then to 0 at the device level
>
> echo 1 > /sys/class/net/eni28160np1/threaded
> echo 0 > /sys/class/net/eni28160np1/threaded
>
> Check the setting:
>
> cat /sys/class/net/eni28160np1/threaded
> 0
>
> Dump the settings for netdevsim, noting that threaded is 0, but pid
> is set (again, should it be?):
>
> ./tools/net/ynl/pyynl/cli.py \
> --spec Documentation/netlink/specs/netdev.yaml \
> --dump napi-get --json='{"ifindex": 20}'
>
> [{'defer-hard-irqs': 0,
> 'gro-flush-timeout': 0,
> 'id': 612,
> 'ifindex': 20,
> 'irq-suspend-timeout': 0,
> 'pid': 15728,
> 'threaded': 0},
> {'defer-hard-irqs': 0,
> 'gro-flush-timeout': 0,
> 'id': 611,
> 'ifindex': 20,
> 'irq-suspend-timeout': 0,
> 'pid': 15729,
> 'threaded': 0}]
As explained in the comment earlier, since the kthread is still valid
and associated with the napi, the PID is valid. I just verified that
this is the existing behaviour. Not sure whether the pid should be
hidden if the threaded state is not enabled? Do you think we should
change this behaviour?
>
> Now set NAPI 612 to threaded 1:
>
> ./tools/net/ynl/pyynl/cli.py \
> --spec Documentation/netlink/specs/netdev.yaml \
> --do napi-set --json='{"id": 612, "threaded": 1}'
>
> Dump the settings:
>
> ./tools/net/ynl/pyynl/cli.py \
> --spec Documentation/netlink/specs/netdev.yaml \
> --dump napi-get --json='{"ifindex": 20}'
>
> [{'defer-hard-irqs': 0,
> 'gro-flush-timeout': 0,
> 'id': 612,
> 'ifindex': 20,
> 'irq-suspend-timeout': 0,
> 'pid': 15728,
> 'threaded': 1},
> {'defer-hard-irqs': 0,
> 'gro-flush-timeout': 0,
> 'id': 611,
> 'ifindex': 20,
> 'irq-suspend-timeout': 0,
> 'pid': 15729,
> 'threaded': 0}]
>
> So far, so good.
>
> Now set device-wide threaded to 0, which should set NAPI 612 to threaded 0:
>
> echo 0 > /sys/class/net/eni28160np1/threaded
>
> Dump settings:
>
> ./tools/net/ynl/pyynl/cli.py \
> --spec Documentation/netlink/specs/netdev.yaml \
> --dump napi-get --json='{"ifindex": 20}'
>
> [{'defer-hard-irqs': 0,
> 'gro-flush-timeout': 0,
> 'id': 612,
> 'ifindex': 20,
> 'irq-suspend-timeout': 0,
> 'pid': 15728,
> 'threaded': 1},
> {'defer-hard-irqs': 0,
> 'gro-flush-timeout': 0,
> 'id': 611,
> 'ifindex': 20,
> 'irq-suspend-timeout': 0,
> 'pid': 15729,
> 'threaded': 0}]
>
> Note that NAPI 612 is still set to threaded 1, but we set the
> device-wide setting to 0.
>
> Shouldn't NAPI 612 be threaded = 0 ?
Yes it should be. I will add a test for this and also remove the
initial check in dev_set_threaded to fix this behaviour.
>
> Second case:
> - netdevsim is brought up with 2 queues and 2 napis
> - Threaded NAPI is enabled for napi1
> - The queue count on the device is reduced from 2 to 1 (therefore
> napi1 is disabled) via ethtool -L
>
> Then:
>
> - Is the thread still active? Should it be?
>
> IMO: if the napi is disabled the thread should stop because there is
> no NAPI for it to poll. When the napi is enabled, the thread can
> start back up.
Please see the comment above for the explanation of this., but to
reiterate the thread is there but not actively being scheduled and
sleeps until napi is enabled again and scheduled.
>
> It does not seem that this is currently the case from manual
> testing.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-25 18:28 ` Samiullah Khawaja
@ 2025-04-25 22:24 ` Joe Damato
2025-04-25 22:52 ` Samiullah Khawaja
2025-04-25 23:06 ` Samiullah Khawaja
1 sibling, 1 reply; 24+ messages in thread
From: Joe Damato @ 2025-04-25 22:24 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Fri, Apr 25, 2025 at 11:28:11AM -0700, Samiullah Khawaja wrote:
> On Thu, Apr 24, 2025 at 4:13 PM Joe Damato <jdamato@fastly.com> wrote:
> >
> > On Wed, Apr 23, 2025 at 08:14:13PM +0000, Samiullah Khawaja wrote:
[...]
> > Thanks; I think this is a good change on its own separate from the
> > rest of the series and, IMO, I think it makes it easier to get
> > reviewed and merged.
> Thanks for the review and suggestion to split this out.
Sure. Up to you if you want to split it out or not.
[...]
> > > +
> > > int dev_set_threaded(struct net_device *dev, bool threaded)
> > > {
> > > struct napi_struct *napi;
> > > @@ -7144,6 +7165,8 @@ static void napi_restore_config(struct napi_struct *n)
> > > napi_hash_add(n);
> > > n->config->napi_id = n->napi_id;
> > > }
> > > +
> > > + napi_set_threaded(n, n->config->threaded);
> >
> > It makes sense to me that when restoring the config, the kthread is
> > kicked off again (assuming config->thread > 0), but does the
> > napi_save_config path need to stop the thread?
>
> >
> > Not sure if kthread_stop is hit via some other path when
> > napi_disable is called? Can you clarify?
> NAPI kthread are not stopped when napi is disabled. When napis are
> disabled, the NAPI_STATE_DISABLED state is set on them and the
> associated thread goes to sleep after unsetting the STATE_SCHED. The
> kthread_stop is only called on them when NAPI is deleted. This is the
> existing behaviour. Please see napi_disable implementation for
> reference. Also napi_enable doesn't create a new kthread and just sets
> the napi STATE appropriately and once the NAPI is scheduled again the
> thread is woken up. Please seem implementation of napi_schedule for
> reference.
Yea but seems:
- Weird from a user perspective, and
- Keeps the pid around as shown below even if threaded NAPI is
inactive, which seems weird, too.
> > I ran the test and it passes for me.
> >
> > That said, the test is incomplete or buggy because I've manually
> > identified 1 case that is incorrect which we discussed in the v4 and
> > a second case that seems buggy from a user perspective.
> >
> > Case 1 (we discussed this in the v4, but seems like it was missed
> > here?):
> >
> > Threaded set to 1 and then to 0 at the device level
> >
> > echo 1 > /sys/class/net/eni28160np1/threaded
> > echo 0 > /sys/class/net/eni28160np1/threaded
> >
> > Check the setting:
> >
> > cat /sys/class/net/eni28160np1/threaded
> > 0
> >
> > Dump the settings for netdevsim, noting that threaded is 0, but pid
> > is set (again, should it be?):
> >
> > ./tools/net/ynl/pyynl/cli.py \
> > --spec Documentation/netlink/specs/netdev.yaml \
> > --dump napi-get --json='{"ifindex": 20}'
> >
> > [{'defer-hard-irqs': 0,
> > 'gro-flush-timeout': 0,
> > 'id': 612,
> > 'ifindex': 20,
> > 'irq-suspend-timeout': 0,
> > 'pid': 15728,
> > 'threaded': 0},
> > {'defer-hard-irqs': 0,
> > 'gro-flush-timeout': 0,
> > 'id': 611,
> > 'ifindex': 20,
> > 'irq-suspend-timeout': 0,
> > 'pid': 15729,
> > 'threaded': 0}]
> As explained in the comment earlier, since the kthread is still valid
> and associated with the napi, the PID is valid. I just verified that
> this is the existing behaviour. Not sure whether the pid should be
> hidden if the threaded state is not enabled? Do you think we should
> change this behaviour?
I don't know, but I do think it's pretty weird from the user
perspective.
Probably need a maintainer to weigh-in on what the preferred
behavior is. Maybe there's a reason the thread isn't killed.
Overall, it feels weird to me that disabling threaded NAPI leaves
the pid in the netlink output and also keeps the thread running
(despite being blocked).
I think:
- If the thread is kept running then the pid probably should be
left in the output. Because if its hidden but still running (and
thus still visible in ps or top or whatever), that's even
stranger/more confusing?
- If the thread is killed when threaded NAPI is disabled, then the
pid should be wiped.
My personal preference is for the second option because it seems
more sensible from the user's point of view, but maybe there's a
reason it works the way it does that I don't know or understand.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-25 22:24 ` Joe Damato
@ 2025-04-25 22:52 ` Samiullah Khawaja
2025-04-26 0:37 ` Jakub Kicinski
0 siblings, 1 reply; 24+ messages in thread
From: Samiullah Khawaja @ 2025-04-25 22:52 UTC (permalink / raw)
To: Joe Damato, Samiullah Khawaja, Jakub Kicinski, David S . Miller,
Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, netdev
On Fri, Apr 25, 2025 at 3:24 PM Joe Damato <jdamato@fastly.com> wrote:
>
> On Fri, Apr 25, 2025 at 11:28:11AM -0700, Samiullah Khawaja wrote:
> > On Thu, Apr 24, 2025 at 4:13 PM Joe Damato <jdamato@fastly.com> wrote:
> > >
> > > On Wed, Apr 23, 2025 at 08:14:13PM +0000, Samiullah Khawaja wrote:
>
> [...]
>
> > > Thanks; I think this is a good change on its own separate from the
> > > rest of the series and, IMO, I think it makes it easier to get
> > > reviewed and merged.
> > Thanks for the review and suggestion to split this out.
>
> Sure. Up to you if you want to split it out or not.
I meant splitting it from the main series about busypolling.
>
> [...]
>
> > > > +
> > > > int dev_set_threaded(struct net_device *dev, bool threaded)
> > > > {
> > > > struct napi_struct *napi;
> > > > @@ -7144,6 +7165,8 @@ static void napi_restore_config(struct napi_struct *n)
> > > > napi_hash_add(n);
> > > > n->config->napi_id = n->napi_id;
> > > > }
> > > > +
> > > > + napi_set_threaded(n, n->config->threaded);
> > >
> > > It makes sense to me that when restoring the config, the kthread is
> > > kicked off again (assuming config->thread > 0), but does the
> > > napi_save_config path need to stop the thread?
> >
> > >
> > > Not sure if kthread_stop is hit via some other path when
> > > napi_disable is called? Can you clarify?
> > NAPI kthread are not stopped when napi is disabled. When napis are
> > disabled, the NAPI_STATE_DISABLED state is set on them and the
> > associated thread goes to sleep after unsetting the STATE_SCHED. The
> > kthread_stop is only called on them when NAPI is deleted. This is the
> > existing behaviour. Please see napi_disable implementation for
> > reference. Also napi_enable doesn't create a new kthread and just sets
> > the napi STATE appropriately and once the NAPI is scheduled again the
> > thread is woken up. Please seem implementation of napi_schedule for
> > reference.
>
> Yea but seems:
> - Weird from a user perspective, and
> - Keeps the pid around as shown below even if threaded NAPI is
> inactive, which seems weird, too.
>
> > > I ran the test and it passes for me.
> > >
> > > That said, the test is incomplete or buggy because I've manually
> > > identified 1 case that is incorrect which we discussed in the v4 and
> > > a second case that seems buggy from a user perspective.
> > >
> > > Case 1 (we discussed this in the v4, but seems like it was missed
> > > here?):
> > >
> > > Threaded set to 1 and then to 0 at the device level
> > >
> > > echo 1 > /sys/class/net/eni28160np1/threaded
> > > echo 0 > /sys/class/net/eni28160np1/threaded
> > >
> > > Check the setting:
> > >
> > > cat /sys/class/net/eni28160np1/threaded
> > > 0
> > >
> > > Dump the settings for netdevsim, noting that threaded is 0, but pid
> > > is set (again, should it be?):
> > >
> > > ./tools/net/ynl/pyynl/cli.py \
> > > --spec Documentation/netlink/specs/netdev.yaml \
> > > --dump napi-get --json='{"ifindex": 20}'
> > >
> > > [{'defer-hard-irqs': 0,
> > > 'gro-flush-timeout': 0,
> > > 'id': 612,
> > > 'ifindex': 20,
> > > 'irq-suspend-timeout': 0,
> > > 'pid': 15728,
> > > 'threaded': 0},
> > > {'defer-hard-irqs': 0,
> > > 'gro-flush-timeout': 0,
> > > 'id': 611,
> > > 'ifindex': 20,
> > > 'irq-suspend-timeout': 0,
> > > 'pid': 15729,
> > > 'threaded': 0}]
> > As explained in the comment earlier, since the kthread is still valid
> > and associated with the napi, the PID is valid. I just verified that
> > this is the existing behaviour. Not sure whether the pid should be
> > hidden if the threaded state is not enabled? Do you think we should
> > change this behaviour?
>
> I don't know, but I do think it's pretty weird from the user
> perspective.
>
> Probably need a maintainer to weigh-in on what the preferred
> behavior is. Maybe there's a reason the thread isn't killed.
+1
I think the reason behind it not being killed is because the user
might have already done some configuration using the PID and if the
kthread was removed, the user would have to do that configuration
again after enable/disable. But I am just speculating. I will let the
maintainers weigh-in as you suggested.
>
> Overall, it feels weird to me that disabling threaded NAPI leaves
> the pid in the netlink output and also keeps the thread running
> (despite being blocked).
>
> I think:
> - If the thread is kept running then the pid probably should be
> left in the output. Because if its hidden but still running (and
> thus still visible in ps or top or whatever), that's even
> stranger/more confusing?
>
> - If the thread is killed when threaded NAPI is disabled, then the
> pid should be wiped.
I agree. I think in the selftest I will also add PID checks just to
explicitly verify the existing behaviour is preserved.
>
> My personal preference is for the second option because it seems
> more sensible from the user's point of view, but maybe there's a
> reason it works the way it does that I don't know or understand.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-25 18:28 ` Samiullah Khawaja
2025-04-25 22:24 ` Joe Damato
@ 2025-04-25 23:06 ` Samiullah Khawaja
1 sibling, 0 replies; 24+ messages in thread
From: Samiullah Khawaja @ 2025-04-25 23:06 UTC (permalink / raw)
To: Joe Damato, Samiullah Khawaja, Jakub Kicinski, David S . Miller,
Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, netdev
On Fri, Apr 25, 2025 at 11:28 AM Samiullah Khawaja <skhawaja@google.com> wrote:
>
> On Thu, Apr 24, 2025 at 4:13 PM Joe Damato <jdamato@fastly.com> wrote:
> >
> > On Wed, Apr 23, 2025 at 08:14:13PM +0000, Samiullah Khawaja wrote:
> > > A net device has a threaded sysctl that can be used to enable threaded
> > > napi polling on all of the NAPI contexts under that device. Allow
> > > enabling threaded napi polling at individual napi level using netlink.
> > >
> > > Extend the netlink operation `napi-set` and allow setting the threaded
> > > attribute of a NAPI. This will enable the threaded polling on a napi
> > > context.
> > >
> > > Add a test in `nl_netdev.py` that verifies various cases of threaded
> > > napi being set at napi and at device level.
> > >
> > > Tested
> > > ./tools/testing/selftests/net/nl_netdev.py
> > > TAP version 13
> > > 1..6
> > > ok 1 nl_netdev.empty_check
> > > ok 2 nl_netdev.lo_check
> > > ok 3 nl_netdev.page_pool_check
> > > ok 4 nl_netdev.napi_list_check
> > > ok 5 nl_netdev.napi_set_threaded
> > > ok 6 nl_netdev.nsim_rxq_reset_down
> > > # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0
> > >
> > > v5:
> > > - This patch was part of:
> > > https://lore.kernel.org/netdev/Z92e2kCYXQ_RsrJh@LQ3V64L9R2/T/
> > > It is being sent separately for the first time.
> >
> > Thanks; I think this is a good change on its own separate from the
> > rest of the series and, IMO, I think it makes it easier to get
> > reviewed and merged.
> Thanks for the review and suggestion to split this out.
> >
> > Probably a matter of personal preference, which you can feel free to
> > ignore, but IMO I think splitting this into 3 patches might make it
> > easier to get Reviewed-bys and make changes ?
> >
> > - core infrastructure changes
> > - netlink related changes (and docs)
> > - selftest
> >
> > Then if feedback comes in for some parts, but not others, it makes
> > it easier for reviewers in the future to know what was already
> > reviewed and what was changed ?
> >
> > Just my 2 cents.
> >
> > The above said, my high level feedback is that I don't think this
> > addresses the concerns we discussed in the v4 about device-wide
> > setting vs per-NAPI settings.
> >
> > See below.
> >
> > [...]
> >
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index d0563ddff6ca..ea3c8a30bb97 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -6888,6 +6888,27 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
> > > return HRTIMER_NORESTART;
> > > }
> > >
> > > +int napi_set_threaded(struct napi_struct *napi, bool threaded)
> > > +{
> > > + if (threaded) {
> > > + if (!napi->thread) {
> > > + int err = napi_kthread_create(napi);
> > > +
> > > + if (err)
> > > + return err;
> > > + }
> > > + }
> > > +
> > > + if (napi->config)
> > > + napi->config->threaded = threaded;
> > > +
> > > + /* Make sure kthread is created before THREADED bit is set. */
> > > + smp_mb__before_atomic();
> > > + assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> > > +
> > > + return 0;
> > > +}
> >
> > Hm, I wonder if dev_set_threaded can be cleaned up to use this
> > instead of repeating similar code in two places?
I wanted to add something about this part and forgot. The reason
dev_set_thread handling is a little different from individual NAPIs is
because the dev->threaded state on a device needs to be set only if
all underlying NAPIs are guaranteed to have the same state. So that is
why the dev_set_threaded does this in 3 steps,
1- Create kthread for each NAPI. This can fail and if it fails the
threaded state is set to disable/0 since at least one NAPI kthread
failed to create.
2- Set the threaded state on dev->threaded.
3- Set the threaded state flag on each NAPI, this cannot fail.
If we use the napi_set_threaded in dev_set_threaded implementation
then we will be doing something like following,
1- Do napi_set_threaded for each NAPI.
2- If a NAPI fails then revert to disable state and reset NAPI state
to disable for all NAPIs in dev
3- set threaded state on dev->threaded (or maybe set in the start and
set here again on failure).
I think this will complicate the logic and also not preserve the
current behaviour and order of operations. So keeping the
dev_set_threaded implementation separate.
> >
> > > +
> > > int dev_set_threaded(struct net_device *dev, bool threaded)
> > > {
> > > struct napi_struct *napi;
> > > @@ -7144,6 +7165,8 @@ static void napi_restore_config(struct napi_struct *n)
> > > napi_hash_add(n);
> > > n->config->napi_id = n->napi_id;
> > > }
> > > +
> > > + napi_set_threaded(n, n->config->threaded);
> >
> > It makes sense to me that when restoring the config, the kthread is
> > kicked off again (assuming config->thread > 0), but does the
> > napi_save_config path need to stop the thread?
>
> >
> > Not sure if kthread_stop is hit via some other path when
> > napi_disable is called? Can you clarify?
> NAPI kthread are not stopped when napi is disabled. When napis are
> disabled, the NAPI_STATE_DISABLED state is set on them and the
> associated thread goes to sleep after unsetting the STATE_SCHED. The
> kthread_stop is only called on them when NAPI is deleted. This is the
> existing behaviour. Please see napi_disable implementation for
> reference. Also napi_enable doesn't create a new kthread and just sets
> the napi STATE appropriately and once the NAPI is scheduled again the
> thread is woken up. Please seem implementation of napi_schedule for
> reference.
> >
> > From my testing, it seems like setting threaded: 0 using the CLI
> > leaves the thread running. Should it work that way? From a high
> > level that behavior seems somewhat unexpected, don't you think?
> Please see the comment above. The thread goes to sleep and is woken up
> when NAPI is enabled and scheduled again.
> >
> > [...]
> >
> > > static void napi_save_config(struct napi_struct *n)
> > > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> > > index b64c614a00c4..f7d000a600cf 100644
> > > --- a/net/core/netdev-genl.c
> > > +++ b/net/core/netdev-genl.c
> > > @@ -184,6 +184,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
> > > if (napi->irq >= 0 && nla_put_u32(rsp, NETDEV_A_NAPI_IRQ, napi->irq))
> > > goto nla_put_failure;
> > >
> > > + if (nla_put_uint(rsp, NETDEV_A_NAPI_THREADED,
> > > + napi_get_threaded(napi)))
> > > + goto nla_put_failure;
> > > +
> > > if (napi->thread) {
> > > pid = task_pid_nr(napi->thread);
> > > if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid))
> > > @@ -322,8 +326,14 @@ netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
> > > {
> > > u64 irq_suspend_timeout = 0;
> > > u64 gro_flush_timeout = 0;
> > > + uint threaded = 0;
> >
> > I think I'd probably make this a u8 or something? uint is not a
> > commonly used type, AFAICT, and seems out of place here.
> Agreed. Will change
> >
> > > diff --git a/tools/testing/selftests/net/nl_netdev.py b/tools/testing/selftests/net/nl_netdev.py
> > > index beaee5e4e2aa..505564818fa8 100755
> > > --- a/tools/testing/selftests/net/nl_netdev.py
> > > +++ b/tools/testing/selftests/net/nl_netdev.py
> > > @@ -2,6 +2,7 @@
> > > # SPDX-License-Identifier: GPL-2.0
> > >
> > > import time
> > > +from os import system
> > > from lib.py import ksft_run, ksft_exit, ksft_pr
> > > from lib.py import ksft_eq, ksft_ge, ksft_busy_wait
> > > from lib.py import NetdevFamily, NetdevSimDev, ip
> > > @@ -34,6 +35,70 @@ def napi_list_check(nf) -> None:
> > > ksft_eq(len(napis), 100,
> > > comment=f"queue count after reset queue {q} mode {i}")
> > >
> > > +def napi_set_threaded(nf) -> None:
> > > + """
> > > + Test that verifies various cases of napi threaded
> > > + set and unset at napi and device level.
> > > + """
> > > + with NetdevSimDev(queue_count=2) as nsimdev:
> > > + nsim = nsimdev.nsims[0]
> > > +
> > > + ip(f"link set dev {nsim.ifname} up")
> > > +
> > > + napis = nf.napi_get({'ifindex': nsim.ifindex}, dump=True)
> > > + ksft_eq(len(napis), 2)
> > > +
> > > + napi0_id = napis[0]['id']
> > > + napi1_id = napis[1]['id']
> > > +
> > > + # set napi threaded and verify
> > > + nf.napi_set({'id': napi0_id, 'threaded': 1})
> > > + napi0 = nf.napi_get({'id': napi0_id})
> > > + ksft_eq(napi0['threaded'], 1)
> > > +
> > > + ip(f"link set dev {nsim.ifname} down")
> > > + ip(f"link set dev {nsim.ifname} up")
> > > +
> > > + # verify if napi threaded is still set
> > > + napi0 = nf.napi_get({'id': napi0_id})
> > > + ksft_eq(napi0['threaded'], 1)
> >
> > I feel like the test needs to be expanded?
> >
> > It's not just the attribute that matters (although that bit is
> > important), but I think it probably is important that the kthread is
> > still running and that should be checked ?
> >
> > > + # unset napi threaded and verify
> > > + nf.napi_set({'id': napi0_id, 'threaded': 0})
> > > + napi0 = nf.napi_get({'id': napi0_id})
> > > + ksft_eq(napi0['threaded'], 0)
> > > +
> > > + # set napi threaded for napi0
> > > + nf.napi_set({'id': napi0_id, 'threaded': 1})
> > > + napi0 = nf.napi_get({'id': napi0_id})
> > > + ksft_eq(napi0['threaded'], 1)
> > > +
> > > + # check it is not set for napi1
> > > + napi1 = nf.napi_get({'id': napi1_id})
> > > + ksft_eq(napi1['threaded'], 0)
> > > +
> > > + # set threaded at device level
> > > + system(f"echo 1 > /sys/class/net/{nsim.ifname}/threaded")
> > > +
> > > + # check napi threaded is set for both napis
> > > + napi0 = nf.napi_get({'id': napi0_id})
> > > + ksft_eq(napi0['threaded'], 1)
> > > + napi1 = nf.napi_get({'id': napi1_id})
> > > + ksft_eq(napi1['threaded'], 1)
> > > +
> > > + # set napi threaded for napi0
> > > + nf.napi_set({'id': napi0_id, 'threaded': 1})
> > > + napi0 = nf.napi_get({'id': napi0_id})
> > > + ksft_eq(napi0['threaded'], 1)
> >
> > If napi0 is already set to threaded in the previous block for the
> > device level, why set it explicitly again ?
> Will remove
> >
> > > +
> > > + # unset threaded at device level
> > > + system(f"echo 0 > /sys/class/net/{nsim.ifname}/threaded")
> > > +
> > > + # check napi threaded is unset for both napis
> > > + napi0 = nf.napi_get({'id': napi0_id})
> > > + ksft_eq(napi0['threaded'], 0)
> > > + napi1 = nf.napi_get({'id': napi1_id})
> > > + ksft_eq(napi1['threaded'], 0)
> >
> > I ran the test and it passes for me.
> >
> > That said, the test is incomplete or buggy because I've manually
> > identified 1 case that is incorrect which we discussed in the v4 and
> > a second case that seems buggy from a user perspective.
> >
> > Case 1 (we discussed this in the v4, but seems like it was missed
> > here?):
> >
> > Threaded set to 1 and then to 0 at the device level
> >
> > echo 1 > /sys/class/net/eni28160np1/threaded
> > echo 0 > /sys/class/net/eni28160np1/threaded
> >
> > Check the setting:
> >
> > cat /sys/class/net/eni28160np1/threaded
> > 0
> >
> > Dump the settings for netdevsim, noting that threaded is 0, but pid
> > is set (again, should it be?):
> >
> > ./tools/net/ynl/pyynl/cli.py \
> > --spec Documentation/netlink/specs/netdev.yaml \
> > --dump napi-get --json='{"ifindex": 20}'
> >
> > [{'defer-hard-irqs': 0,
> > 'gro-flush-timeout': 0,
> > 'id': 612,
> > 'ifindex': 20,
> > 'irq-suspend-timeout': 0,
> > 'pid': 15728,
> > 'threaded': 0},
> > {'defer-hard-irqs': 0,
> > 'gro-flush-timeout': 0,
> > 'id': 611,
> > 'ifindex': 20,
> > 'irq-suspend-timeout': 0,
> > 'pid': 15729,
> > 'threaded': 0}]
> As explained in the comment earlier, since the kthread is still valid
> and associated with the napi, the PID is valid. I just verified that
> this is the existing behaviour. Not sure whether the pid should be
> hidden if the threaded state is not enabled? Do you think we should
> change this behaviour?
> >
> > Now set NAPI 612 to threaded 1:
> >
> > ./tools/net/ynl/pyynl/cli.py \
> > --spec Documentation/netlink/specs/netdev.yaml \
> > --do napi-set --json='{"id": 612, "threaded": 1}'
> >
> > Dump the settings:
> >
> > ./tools/net/ynl/pyynl/cli.py \
> > --spec Documentation/netlink/specs/netdev.yaml \
> > --dump napi-get --json='{"ifindex": 20}'
> >
> > [{'defer-hard-irqs': 0,
> > 'gro-flush-timeout': 0,
> > 'id': 612,
> > 'ifindex': 20,
> > 'irq-suspend-timeout': 0,
> > 'pid': 15728,
> > 'threaded': 1},
> > {'defer-hard-irqs': 0,
> > 'gro-flush-timeout': 0,
> > 'id': 611,
> > 'ifindex': 20,
> > 'irq-suspend-timeout': 0,
> > 'pid': 15729,
> > 'threaded': 0}]
> >
> > So far, so good.
> >
> > Now set device-wide threaded to 0, which should set NAPI 612 to threaded 0:
> >
> > echo 0 > /sys/class/net/eni28160np1/threaded
> >
> > Dump settings:
> >
> > ./tools/net/ynl/pyynl/cli.py \
> > --spec Documentation/netlink/specs/netdev.yaml \
> > --dump napi-get --json='{"ifindex": 20}'
> >
> > [{'defer-hard-irqs': 0,
> > 'gro-flush-timeout': 0,
> > 'id': 612,
> > 'ifindex': 20,
> > 'irq-suspend-timeout': 0,
> > 'pid': 15728,
> > 'threaded': 1},
> > {'defer-hard-irqs': 0,
> > 'gro-flush-timeout': 0,
> > 'id': 611,
> > 'ifindex': 20,
> > 'irq-suspend-timeout': 0,
> > 'pid': 15729,
> > 'threaded': 0}]
> >
> > Note that NAPI 612 is still set to threaded 1, but we set the
> > device-wide setting to 0.
> >
> > Shouldn't NAPI 612 be threaded = 0 ?
> Yes it should be. I will add a test for this and also remove the
> initial check in dev_set_threaded to fix this behaviour.
> >
> > Second case:
> > - netdevsim is brought up with 2 queues and 2 napis
> > - Threaded NAPI is enabled for napi1
> > - The queue count on the device is reduced from 2 to 1 (therefore
> > napi1 is disabled) via ethtool -L
> >
> > Then:
> >
> > - Is the thread still active? Should it be?
> >
> > IMO: if the napi is disabled the thread should stop because there is
> > no NAPI for it to poll. When the napi is enabled, the thread can
> > start back up.
> Please see the comment above for the explanation of this., but to
> reiterate the thread is there but not actively being scheduled and
> sleeps until napi is enabled again and scheduled.
> >
> > It does not seem that this is currently the case from manual
> > testing.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-25 22:52 ` Samiullah Khawaja
@ 2025-04-26 0:37 ` Jakub Kicinski
2025-04-26 2:34 ` Joe Damato
0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2025-04-26 0:37 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: Joe Damato, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Fri, 25 Apr 2025 15:52:30 -0700 Samiullah Khawaja wrote:
> > Probably need a maintainer to weigh-in on what the preferred
> > behavior is. Maybe there's a reason the thread isn't killed.
> +1
>
> I think the reason behind it not being killed is because the user
> might have already done some configuration using the PID and if the
> kthread was removed, the user would have to do that configuration
> again after enable/disable. But I am just speculating. I will let the
> maintainers weigh-in as you suggested.
I haven't looked at the code, but I think it may be something more
trivial, namely that napi_enable() return void, so it can't fail.
Also it may be called under a spin lock.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-23 20:14 [PATCH net-next v5] Add support to set napi threaded for individual napi Samiullah Khawaja
2025-04-24 23:13 ` Joe Damato
@ 2025-04-26 0:42 ` Jakub Kicinski
2025-04-26 2:31 ` Joe Damato
1 sibling, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2025-04-26 0:42 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: David S . Miller , Eric Dumazet, Paolo Abeni, almasrymina,
willemb, jdamato, mkarsten, netdev
On Wed, 23 Apr 2025 20:14:13 +0000 Samiullah Khawaja wrote:
> A net device has a threaded sysctl that can be used to enable threaded
> napi polling on all of the NAPI contexts under that device. Allow
> enabling threaded napi polling at individual napi level using netlink.
>
> Extend the netlink operation `napi-set` and allow setting the threaded
> attribute of a NAPI. This will enable the threaded polling on a napi
> context.
I think I haven't replied to you on the config recommendation about
how global vs per-object config should behave. I implemented the
suggested scheme for rx-buf-len to make sure its not a crazy ask:
https://lore.kernel.org/all/20250421222827.283737-1-kuba@kernel.org/
and I do like it more.
Joe, Stanislav and Mina all read that series and are CCed here.
What do y'all think? Should we make the threaded config work like
the rx-buf-len, if user sets it on a NAPI it takes precedence
over global config? Or stick to the simplistic thing of last
write wins?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-26 0:42 ` Jakub Kicinski
@ 2025-04-26 2:31 ` Joe Damato
2025-04-26 14:41 ` Willem de Bruijn
0 siblings, 1 reply; 24+ messages in thread
From: Joe Damato @ 2025-04-26 2:31 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Samiullah Khawaja, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Fri, Apr 25, 2025 at 05:42:51PM -0700, Jakub Kicinski wrote:
> On Wed, 23 Apr 2025 20:14:13 +0000 Samiullah Khawaja wrote:
> > A net device has a threaded sysctl that can be used to enable threaded
> > napi polling on all of the NAPI contexts under that device. Allow
> > enabling threaded napi polling at individual napi level using netlink.
> >
> > Extend the netlink operation `napi-set` and allow setting the threaded
> > attribute of a NAPI. This will enable the threaded polling on a napi
> > context.
>
> I think I haven't replied to you on the config recommendation about
> how global vs per-object config should behave. I implemented the
> suggested scheme for rx-buf-len to make sure its not a crazy ask:
> https://lore.kernel.org/all/20250421222827.283737-1-kuba@kernel.org/
> and I do like it more.
>
> Joe, Stanislav and Mina all read that series and are CCed here.
> What do y'all think? Should we make the threaded config work like
> the rx-buf-len, if user sets it on a NAPI it takes precedence
> over global config? Or stick to the simplistic thing of last
> write wins?
For the per-NAPI defer-hard-irqs (for example):
- writing to the NIC-wide sysfs path overwrites all of the
individual NAPI settings to be the global setting written
- writing to an individual NAPI, though, the setting takes
precedence over the global
So, if you wrote 100 to the global path, then 5 to a specific NAPI,
then 200 again to the global path, IIRC the NAPI would go through:
- being set to 100 (from the global path write)
- being set to 5 (for its NAPI specific write)
- being set to 200 (from the final global path write)
The individual NAPI setting takes precedence over the global
setting; but the individual setting is re-written when the global
value is adjusted.
Can't tell if that's clear or if I just made it worse ;)
Anyway: I have a preference for consistency when possible, so IMHO,
it would be nice if:
- Writing to NIC-wide threaded set all NAPIs to the value written
to the NIC-wide setting
- Individual NAPIs can have threaded enabled/disabled, which takes
precedence over global
But IDK if that's realistic/desirable/or even what everyone else
prefers :)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-26 0:37 ` Jakub Kicinski
@ 2025-04-26 2:34 ` Joe Damato
2025-04-26 2:47 ` Jakub Kicinski
0 siblings, 1 reply; 24+ messages in thread
From: Joe Damato @ 2025-04-26 2:34 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Samiullah Khawaja, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Fri, Apr 25, 2025 at 05:37:43PM -0700, Jakub Kicinski wrote:
> On Fri, 25 Apr 2025 15:52:30 -0700 Samiullah Khawaja wrote:
> > > Probably need a maintainer to weigh-in on what the preferred
> > > behavior is. Maybe there's a reason the thread isn't killed.
> > +1
> >
> > I think the reason behind it not being killed is because the user
> > might have already done some configuration using the PID and if the
> > kthread was removed, the user would have to do that configuration
> > again after enable/disable. But I am just speculating. I will let the
> > maintainers weigh-in as you suggested.
>
> I haven't looked at the code, but I think it may be something more
> trivial, namely that napi_enable() return void, so it can't fail.
> Also it may be called under a spin lock.
If you don't mind me asking: what do you think at a higher level
on the discussion about threaded NAPI being disabled?
It seems like the current behavior is:
- If you write 1 to the threaded NAPI sysfs path, kthreads are
kicked off and start running.
- If you write 0, the threads are not killed but don't do any
processing and their pids are still exported in netlink.
I was arguing in favor of disabling threading means the thread is
killed and the pid is no longer exported (as a side effect) because
it seemed weird to me that the netlink output would say:
pid: 1234
threaded: 0
In the current implementation.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-26 2:34 ` Joe Damato
@ 2025-04-26 2:47 ` Jakub Kicinski
2025-04-26 3:12 ` Jakub Kicinski
0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2025-04-26 2:47 UTC (permalink / raw)
To: Joe Damato
Cc: Samiullah Khawaja, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Fri, 25 Apr 2025 19:34:52 -0700 Joe Damato wrote:
> On Fri, Apr 25, 2025 at 05:37:43PM -0700, Jakub Kicinski wrote:
> > On Fri, 25 Apr 2025 15:52:30 -0700 Samiullah Khawaja wrote:
> > > I think the reason behind it not being killed is because the user
> > > might have already done some configuration using the PID and if the
> > > kthread was removed, the user would have to do that configuration
> > > again after enable/disable. But I am just speculating. I will let the
> > > maintainers weigh-in as you suggested.
> >
> > I haven't looked at the code, but I think it may be something more
> > trivial, namely that napi_enable() return void, so it can't fail.
> > Also it may be called under a spin lock.
>
> If you don't mind me asking: what do you think at a higher level
> on the discussion about threaded NAPI being disabled?
>
> It seems like the current behavior is:
> - If you write 1 to the threaded NAPI sysfs path, kthreads are
> kicked off and start running.
>
> - If you write 0, the threads are not killed but don't do any
> processing and their pids are still exported in netlink.
>
> I was arguing in favor of disabling threading means the thread is
> killed and the pid is no longer exported (as a side effect) because
> it seemed weird to me that the netlink output would say:
>
> pid: 1234
> threaded: 0
>
> In the current implementation.
We should check the discussions we had when threaded NAPI was added.
I feel nothing was exposed in terms of observability so leaving the
thread running didn't seem all that bad back then. Stopping the NAPI
polling safely is not entirely trivial, we'd need to somehow grab
the SCHED bit like busy polling does, and then re-schedule.
Or have the thread figure out that it's done and exit.
Probably easier to hide the attr in netlink.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-26 2:47 ` Jakub Kicinski
@ 2025-04-26 3:12 ` Jakub Kicinski
2025-04-26 3:53 ` Samiullah Khawaja
0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2025-04-26 3:12 UTC (permalink / raw)
To: Joe Damato
Cc: Samiullah Khawaja, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Fri, 25 Apr 2025 19:47:42 -0700 Jakub Kicinski wrote:
> > > I haven't looked at the code, but I think it may be something more
> > > trivial, namely that napi_enable() return void, so it can't fail.
> > > Also it may be called under a spin lock.
> >
> > If you don't mind me asking: what do you think at a higher level
> > on the discussion about threaded NAPI being disabled?
> >
> > It seems like the current behavior is:
> > - If you write 1 to the threaded NAPI sysfs path, kthreads are
> > kicked off and start running.
> >
> > - If you write 0, the threads are not killed but don't do any
> > processing and their pids are still exported in netlink.
> >
> > I was arguing in favor of disabling threading means the thread is
> > killed and the pid is no longer exported (as a side effect) because
> > it seemed weird to me that the netlink output would say:
> >
> > pid: 1234
> > threaded: 0
> >
> > In the current implementation.
>
> We should check the discussions we had when threaded NAPI was added.
> I feel nothing was exposed in terms of observability so leaving the
> thread running didn't seem all that bad back then. Stopping the NAPI
> polling safely is not entirely trivial, we'd need to somehow grab
> the SCHED bit like busy polling does, and then re-schedule.
> Or have the thread figure out that it's done and exit.
Actually, we ended up adding the explicit ownership bits so it may not
be all that hard any more.. Worth trying.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-26 3:12 ` Jakub Kicinski
@ 2025-04-26 3:53 ` Samiullah Khawaja
2025-04-28 18:23 ` Jakub Kicinski
0 siblings, 1 reply; 24+ messages in thread
From: Samiullah Khawaja @ 2025-04-26 3:53 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Joe Damato, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Fri, Apr 25, 2025 at 8:12 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 25 Apr 2025 19:47:42 -0700 Jakub Kicinski wrote:
> > > > I haven't looked at the code, but I think it may be something more
> > > > trivial, namely that napi_enable() return void, so it can't fail.
> > > > Also it may be called under a spin lock.
> > >
> > > If you don't mind me asking: what do you think at a higher level
> > > on the discussion about threaded NAPI being disabled?
> > >
> > > It seems like the current behavior is:
> > > - If you write 1 to the threaded NAPI sysfs path, kthreads are
> > > kicked off and start running.
> > >
> > > - If you write 0, the threads are not killed but don't do any
> > > processing and their pids are still exported in netlink.
> > >
> > > I was arguing in favor of disabling threading means the thread is
> > > killed and the pid is no longer exported (as a side effect) because
> > > it seemed weird to me that the netlink output would say:
> > >
> > > pid: 1234
> > > threaded: 0
> > >
> > > In the current implementation.
> >
> > We should check the discussions we had when threaded NAPI was added.
> > I feel nothing was exposed in terms of observability so leaving the
> > thread running didn't seem all that bad back then. Stopping the NAPI
> > polling safely is not entirely trivial, we'd need to somehow grab
> > the SCHED bit like busy polling does, and then re-schedule.
> > Or have the thread figure out that it's done and exit.
>
> Actually, we ended up adding the explicit ownership bits so it may not
> be all that hard any more.. Worth trying.
Agreed. NAPI kthread lets go of the ownership by unsetting the
SCHED_THREADED flag at napi_complete_done. This makes sure that the
next SCHED is scheduled when new IRQ arrives and no packets are
missed. We just have to make sure that it does that if it sees the
kthread_should_stop. Do you think we should handle this maybe as a
separate series/patch orthogonal to this?
Also some clarification, we can remove the kthread when disabling napi
threaded state using device level or napi level setting using netlink.
But do you think we should also stop the thread when disabling a NAPI?
That would mean the NAPI would lose any configurations done on this
kthread by the user and those configurations won't be restored when
this NAPI is enabled again. Some drivers use enable/disable as a
mechanism to do soft reset, so a simple softreset to change queue
length etc might revert these configurations.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-26 2:31 ` Joe Damato
@ 2025-04-26 14:41 ` Willem de Bruijn
2025-04-28 18:12 ` Joe Damato
0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2025-04-26 14:41 UTC (permalink / raw)
To: Joe Damato, Jakub Kicinski
Cc: Samiullah Khawaja, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
Joe Damato wrote:
> On Fri, Apr 25, 2025 at 05:42:51PM -0700, Jakub Kicinski wrote:
> > On Wed, 23 Apr 2025 20:14:13 +0000 Samiullah Khawaja wrote:
> > > A net device has a threaded sysctl that can be used to enable threaded
> > > napi polling on all of the NAPI contexts under that device. Allow
> > > enabling threaded napi polling at individual napi level using netlink.
> > >
> > > Extend the netlink operation `napi-set` and allow setting the threaded
> > > attribute of a NAPI. This will enable the threaded polling on a napi
> > > context.
> >
> > I think I haven't replied to you on the config recommendation about
> > how global vs per-object config should behave. I implemented the
> > suggested scheme for rx-buf-len to make sure its not a crazy ask:
> > https://lore.kernel.org/all/20250421222827.283737-1-kuba@kernel.org/
> > and I do like it more.
> >
> > Joe, Stanislav and Mina all read that series and are CCed here.
> > What do y'all think? Should we make the threaded config work like
> > the rx-buf-len, if user sets it on a NAPI it takes precedence
> > over global config? Or stick to the simplistic thing of last
> > write wins?
>
> For the per-NAPI defer-hard-irqs (for example):
> - writing to the NIC-wide sysfs path overwrites all of the
> individual NAPI settings to be the global setting written
> - writing to an individual NAPI, though, the setting takes
> precedence over the global
>
> So, if you wrote 100 to the global path, then 5 to a specific NAPI,
> then 200 again to the global path, IIRC the NAPI would go through:
> - being set to 100 (from the global path write)
> - being set to 5 (for its NAPI specific write)
> - being set to 200 (from the final global path write)
>
> The individual NAPI setting takes precedence over the global
> setting; but the individual setting is re-written when the global
> value is adjusted.
>
> Can't tell if that's clear or if I just made it worse ;)
That does not sound like precedence to me ;)
I interpret precedence as a value being sticky. The NAPI would stay
at 5 even after the global write of 200.
> Anyway: I have a preference for consistency
+1
I don't think either solution is vastly better than the other, as
long as it is the path of least surprise. Different behavior for
different options breaks that rule.
This also reminds me of /proc/sys/net/ipv4/conf/{all, default, .. }
API. Which confuses me to this day.
From the PoV of path of least surprise, minor preference for Joe's
example of having no sticky state, but just last write wins.
> when possible, so IMHO,
> it would be nice if:
> - Writing to NIC-wide threaded set all NAPIs to the value written
> to the NIC-wide setting
> - Individual NAPIs can have threaded enabled/disabled, which takes
> precedence over global
>
> But IDK if that's realistic/desirable/or even what everyone else
> prefers :)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-26 14:41 ` Willem de Bruijn
@ 2025-04-28 18:12 ` Joe Damato
2025-04-28 18:38 ` Jakub Kicinski
0 siblings, 1 reply; 24+ messages in thread
From: Joe Damato @ 2025-04-28 18:12 UTC (permalink / raw)
To: Willem de Bruijn
Cc: Jakub Kicinski, Samiullah Khawaja, David S . Miller, Eric Dumazet,
Paolo Abeni, almasrymina, willemb, mkarsten, netdev
On Sat, Apr 26, 2025 at 10:41:10AM -0400, Willem de Bruijn wrote:
> Joe Damato wrote:
> > On Fri, Apr 25, 2025 at 05:42:51PM -0700, Jakub Kicinski wrote:
> > > On Wed, 23 Apr 2025 20:14:13 +0000 Samiullah Khawaja wrote:
> > > > A net device has a threaded sysctl that can be used to enable threaded
> > > > napi polling on all of the NAPI contexts under that device. Allow
> > > > enabling threaded napi polling at individual napi level using netlink.
> > > >
> > > > Extend the netlink operation `napi-set` and allow setting the threaded
> > > > attribute of a NAPI. This will enable the threaded polling on a napi
> > > > context.
> > >
> > > I think I haven't replied to you on the config recommendation about
> > > how global vs per-object config should behave. I implemented the
> > > suggested scheme for rx-buf-len to make sure its not a crazy ask:
> > > https://lore.kernel.org/all/20250421222827.283737-1-kuba@kernel.org/
> > > and I do like it more.
> > >
> > > Joe, Stanislav and Mina all read that series and are CCed here.
> > > What do y'all think? Should we make the threaded config work like
> > > the rx-buf-len, if user sets it on a NAPI it takes precedence
> > > over global config? Or stick to the simplistic thing of last
> > > write wins?
> >
> > For the per-NAPI defer-hard-irqs (for example):
> > - writing to the NIC-wide sysfs path overwrites all of the
> > individual NAPI settings to be the global setting written
> > - writing to an individual NAPI, though, the setting takes
> > precedence over the global
> >
> > So, if you wrote 100 to the global path, then 5 to a specific NAPI,
> > then 200 again to the global path, IIRC the NAPI would go through:
> > - being set to 100 (from the global path write)
> > - being set to 5 (for its NAPI specific write)
> > - being set to 200 (from the final global path write)
> >
> > The individual NAPI setting takes precedence over the global
> > setting; but the individual setting is re-written when the global
> > value is adjusted.
> >
> > Can't tell if that's clear or if I just made it worse ;)
>
> That does not sound like precedence to me ;)
Sounds like you are focusing on the usage of a word both out of
context and without considering the behavior of the system ;)
> I interpret precedence as a value being sticky. The NAPI would stay
> at 5 even after the global write of 200.
The individual NAPI config value is always used before the global
value is consulted. One might say it precedes the global value when
used in the networking stack.
That individual NAPI value may be rewritten by writes to the
NIC-wide path, though, which does not affect the precedence with
which the values are consulted by the code.
> > Anyway: I have a preference for consistency
>
> +1
>
> I don't think either solution is vastly better than the other, as
> long as it is the path of least surprise. Different behavior for
> different options breaks that rule.
I agree and my feedback on the previous revision was that all NAPI
config settings should work similarly. Whether that's what I already
implemented for defer-hard-irq/gro-flush-timeout or something else I
don't really have a strong preference.
Implementing something other than what already exists for
defer-hard-irq/gro-flush-timeout, though, would probably mean you'll
need to update how both of those work, for consistency.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-26 3:53 ` Samiullah Khawaja
@ 2025-04-28 18:23 ` Jakub Kicinski
2025-04-28 19:25 ` Samiullah Khawaja
0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2025-04-28 18:23 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: Joe Damato, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Fri, 25 Apr 2025 20:53:14 -0700 Samiullah Khawaja wrote:
> > > We should check the discussions we had when threaded NAPI was added.
> > > I feel nothing was exposed in terms of observability so leaving the
> > > thread running didn't seem all that bad back then. Stopping the NAPI
> > > polling safely is not entirely trivial, we'd need to somehow grab
> > > the SCHED bit like busy polling does, and then re-schedule.
> > > Or have the thread figure out that it's done and exit.
> >
> > Actually, we ended up adding the explicit ownership bits so it may not
> > be all that hard any more.. Worth trying.
> Agreed. NAPI kthread lets go of the ownership by unsetting the
> SCHED_THREADED flag at napi_complete_done. This makes sure that the
> next SCHED is scheduled when new IRQ arrives and no packets are
> missed. We just have to make sure that it does that if it sees the
> kthread_should_stop. Do you think we should handle this maybe as a
> separate series/patch orthogonal to this?
We need to handle the case Joe pointed out. The new Netlink attributes
must make sense from day 1. I think it will be cleanest to work on
killing the thread first, but it can be a separate series.
> Also some clarification, we can remove the kthread when disabling napi
> threaded state using device level or napi level setting using netlink.
> But do you think we should also stop the thread when disabling a NAPI?
> That would mean the NAPI would lose any configurations done on this
> kthread by the user and those configurations won't be restored when
> this NAPI is enabled again. Some drivers use enable/disable as a
> mechanism to do soft reset, so a simple softreset to change queue
> length etc might revert these configurations.
That part I think needs to stay as is, the thread can be started and
stopped on napi_add / del, IMO.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-28 18:12 ` Joe Damato
@ 2025-04-28 18:38 ` Jakub Kicinski
2025-04-28 21:29 ` Joe Damato
0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2025-04-28 18:38 UTC (permalink / raw)
To: Joe Damato
Cc: Willem de Bruijn, Samiullah Khawaja, David S . Miller,
Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, netdev
On Mon, 28 Apr 2025 11:12:34 -0700 Joe Damato wrote:
> On Sat, Apr 26, 2025 at 10:41:10AM -0400, Willem de Bruijn wrote:
> > > Anyway: I have a preference for consistency
> >
> > +1
> >
> > I don't think either solution is vastly better than the other, as
> > long as it is the path of least surprise. Different behavior for
> > different options breaks that rule.
>
> I agree and my feedback on the previous revision was that all NAPI
> config settings should work similarly. Whether that's what I already
> implemented for defer-hard-irq/gro-flush-timeout or something else I
> don't really have a strong preference.
>
> Implementing something other than what already exists for
> defer-hard-irq/gro-flush-timeout, though, would probably mean you'll
> need to update how both of those work, for consistency.
Nobody will disagree with consistency being good. The question is how
broadly you define the scope :) If you say 'all settings within
napi-set' that's one level of consistency, if you say 'all netdev
netlink' then the picture is less clear.
> > This also reminds me of /proc/sys/net/ipv4/conf/{all, default, .. }
> > API. Which confuses me to this day.
Indeed. That scheme has the additional burden of not being consistently
enforced :/ So I'm trying to lay down some rules (in the doc linked
upthread).
The concern I have with the write all semantics is what happens when
we delegate the control over a queue / NAPI to some application or
container. Is the expectation that some user space component prevents
the global settings from being re-applied when applications using
dedicated queues / NAPIs are running?
Second, more minor concern is that we expose all settings on all
sub-objects which I find slightly less clear for the admin. It's much
harder to tell at a glance which settings are overrides and which one
was the default.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-28 18:23 ` Jakub Kicinski
@ 2025-04-28 19:25 ` Samiullah Khawaja
0 siblings, 0 replies; 24+ messages in thread
From: Samiullah Khawaja @ 2025-04-28 19:25 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Joe Damato, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Mon, Apr 28, 2025 at 11:23 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 25 Apr 2025 20:53:14 -0700 Samiullah Khawaja wrote:
> > > > We should check the discussions we had when threaded NAPI was added.
> > > > I feel nothing was exposed in terms of observability so leaving the
> > > > thread running didn't seem all that bad back then. Stopping the NAPI
> > > > polling safely is not entirely trivial, we'd need to somehow grab
> > > > the SCHED bit like busy polling does, and then re-schedule.
> > > > Or have the thread figure out that it's done and exit.
> > >
> > > Actually, we ended up adding the explicit ownership bits so it may not
> > > be all that hard any more.. Worth trying.
> > Agreed. NAPI kthread lets go of the ownership by unsetting the
> > SCHED_THREADED flag at napi_complete_done. This makes sure that the
> > next SCHED is scheduled when new IRQ arrives and no packets are
> > missed. We just have to make sure that it does that if it sees the
> > kthread_should_stop. Do you think we should handle this maybe as a
> > separate series/patch orthogonal to this?
>
> We need to handle the case Joe pointed out. The new Netlink attributes
> must make sense from day 1. I think it will be cleanest to work on
> killing the thread first, but it can be a separate series.
Yes, I will be sending a revision to handle the case that Joe pointed
out where the napi threaded is not disabled after setting it at device
level. The switch from threaded NAPI and the normal softirq based
polling happens already, this is existing behaviour. The thread goes
to sleep after doing polling cycle but it is "kthread_stop" in
napi_del and that aligns with your comment below and also Alexander
Duyck's suggestion on original change. Please correct me if I read
your comment below incorrectly.
>
> > Also some clarification, we can remove the kthread when disabling napi
> > threaded state using device level or napi level setting using netlink.
> > But do you think we should also stop the thread when disabling a NAPI?
> > That would mean the NAPI would lose any configurations done on this
> > kthread by the user and those configurations won't be restored when
> > this NAPI is enabled again. Some drivers use enable/disable as a
> > mechanism to do soft reset, so a simple softreset to change queue
> > length etc might revert these configurations.
>
> That part I think needs to stay as is, the thread can be started and
> stopped on napi_add / del, IMO.
Agreed. This is currently existing behaviour and this change doesn't
modify that. The threads are created when the
napi_set_threaded/dev_set_threaded is done if napis are already added
Or when napi_add is called. The threads are stopped/killed when
napi_del is called. And this matches Alexander Dayck's suggestion when
threaded mode was added originally:
https://lore.kernel.org/netdev/CAKgT0UdjWGBrv9wOUyOxon5Sn7qSBHL5-KfByPS4uB1_TJ3WiQ@mail.gmail.com/
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-28 18:38 ` Jakub Kicinski
@ 2025-04-28 21:29 ` Joe Damato
2025-04-28 22:32 ` Jakub Kicinski
0 siblings, 1 reply; 24+ messages in thread
From: Joe Damato @ 2025-04-28 21:29 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Willem de Bruijn, Samiullah Khawaja, David S . Miller,
Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, netdev
On Mon, Apr 28, 2025 at 11:38:45AM -0700, Jakub Kicinski wrote:
> On Mon, 28 Apr 2025 11:12:34 -0700 Joe Damato wrote:
> > On Sat, Apr 26, 2025 at 10:41:10AM -0400, Willem de Bruijn wrote:
> > > This also reminds me of /proc/sys/net/ipv4/conf/{all, default, .. }
> > > API. Which confuses me to this day.
>
> Indeed. That scheme has the additional burden of not being consistently
> enforced :/ So I'm trying to lay down some rules (in the doc linked
> upthread).
>
> The concern I have with the write all semantics is what happens when
> we delegate the control over a queue / NAPI to some application or
> container. Is the expectation that some user space component prevents
> the global settings from being re-applied when applications using
> dedicated queues / NAPIs are running?
I think this is a good question and one I spent a lot of time
thinking through while hacking on the per-NAPI config stuff.
One argument that came to my mind a few times was that to write to
the global path requires admin and one might assume:
- an admin knows what they are doing and why they are doing a
global write
- there could be a case where the admin does really want to reset
every NAPIs setting on the system in one swoop
I suppose you could have the above (an admin override, so to speak)
but still delegate queues/NAPIs to apps to configure as they like?
I think the admin override is kinda nice if an app starts doing
something weird, but maybe that's too much complexity.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-28 21:29 ` Joe Damato
@ 2025-04-28 22:32 ` Jakub Kicinski
2025-04-30 0:16 ` Joe Damato
0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2025-04-28 22:32 UTC (permalink / raw)
To: Joe Damato
Cc: Willem de Bruijn, Samiullah Khawaja, David S . Miller,
Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, netdev
On Mon, 28 Apr 2025 14:29:03 -0700 Joe Damato wrote:
> On Mon, Apr 28, 2025 at 11:38:45AM -0700, Jakub Kicinski wrote:
> > On Mon, 28 Apr 2025 11:12:34 -0700 Joe Damato wrote:
> > > On Sat, Apr 26, 2025 at 10:41:10AM -0400, Willem de Bruijn wrote:
> > > > This also reminds me of /proc/sys/net/ipv4/conf/{all, default, .. }
> > > > API. Which confuses me to this day.
> >
> > Indeed. That scheme has the additional burden of not being consistently
> > enforced :/ So I'm trying to lay down some rules (in the doc linked
> > upthread).
> >
> > The concern I have with the write all semantics is what happens when
> > we delegate the control over a queue / NAPI to some application or
> > container. Is the expectation that some user space component prevents
> > the global settings from being re-applied when applications using
> > dedicated queues / NAPIs are running?
>
> I think this is a good question and one I spent a lot of time
> thinking through while hacking on the per-NAPI config stuff.
>
> One argument that came to my mind a few times was that to write to
> the global path requires admin and one might assume:
> - an admin knows what they are doing and why they are doing a
> global write
> - there could be a case where the admin does really want to reset
> every NAPIs setting on the system in one swoop
>
> I suppose you could have the above (an admin override, so to speak)
> but still delegate queues/NAPIs to apps to configure as they like?
>
> I think the admin override is kinda nice if an app starts doing
> something weird, but maybe that's too much complexity.
The way I see it - the traditional permission model doesn't extend
in any meaningful way to network settings. All the settings are done
by some sort of helper or intermediary which implements its own
privilege model.
My thinking is more that the "global" settings are basically "system"
settings. There is a high-perf app running which applied its own
settings to its own queues. The remaining queues belong to the "system".
Now if you for some reason want to modify the settings of the "system
queues" you really don't want to override the app specific settings.
The weakness in my argument is that you probably really shouldn't mess
with system level settings on a live system, according to devops.
But life happens, and visibility is not perfect so somethings you have
to poke to test a theory..
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-28 22:32 ` Jakub Kicinski
@ 2025-04-30 0:16 ` Joe Damato
2025-05-03 2:10 ` Jakub Kicinski
0 siblings, 1 reply; 24+ messages in thread
From: Joe Damato @ 2025-04-30 0:16 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Willem de Bruijn, Samiullah Khawaja, David S . Miller,
Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, netdev
On Mon, Apr 28, 2025 at 03:32:07PM -0700, Jakub Kicinski wrote:
> On Mon, 28 Apr 2025 14:29:03 -0700 Joe Damato wrote:
> > On Mon, Apr 28, 2025 at 11:38:45AM -0700, Jakub Kicinski wrote:
> > > On Mon, 28 Apr 2025 11:12:34 -0700 Joe Damato wrote:
> > > > On Sat, Apr 26, 2025 at 10:41:10AM -0400, Willem de Bruijn wrote:
> > > > > This also reminds me of /proc/sys/net/ipv4/conf/{all, default, .. }
> > > > > API. Which confuses me to this day.
> > >
> > > Indeed. That scheme has the additional burden of not being consistently
> > > enforced :/ So I'm trying to lay down some rules (in the doc linked
> > > upthread).
> > >
> > > The concern I have with the write all semantics is what happens when
> > > we delegate the control over a queue / NAPI to some application or
> > > container. Is the expectation that some user space component prevents
> > > the global settings from being re-applied when applications using
> > > dedicated queues / NAPIs are running?
> >
> > I think this is a good question and one I spent a lot of time
> > thinking through while hacking on the per-NAPI config stuff.
> >
> > One argument that came to my mind a few times was that to write to
> > the global path requires admin and one might assume:
> > - an admin knows what they are doing and why they are doing a
> > global write
> > - there could be a case where the admin does really want to reset
> > every NAPIs setting on the system in one swoop
> >
> > I suppose you could have the above (an admin override, so to speak)
> > but still delegate queues/NAPIs to apps to configure as they like?
> >
> > I think the admin override is kinda nice if an app starts doing
> > something weird, but maybe that's too much complexity.
>
> The way I see it - the traditional permission model doesn't extend
> in any meaningful way to network settings. All the settings are done
> by some sort of helper or intermediary which implements its own
> privilege model.
I agree that is how it is today, but maybe we are misunderstanding
each other? In my mind, today, the intermediary is something like a
script that runs a bunch of ethtool commands.
In my mind with the movement of more stuff to core and the existence
of YNL, it seems like the future is an app uses YNL and is able to
configure (for example) an RSS context and ntuple rules to steer
flows to queues it cares about... which in my mind is moving toward
eliminating the intermediary ?
> My thinking is more that the "global" settings are basically "system"
> settings. There is a high-perf app running which applied its own
> settings to its own queues. The remaining queues belong to the "system".
> Now if you for some reason want to modify the settings of the "system
> queues" you really don't want to override the app specific settings.
Yea, I see what you are saying, I think.
Can I rephrase it to make sure I'm following?
An app uses YNL to set defer-hard-irqs to 100 for napis 2-4. napis 0
and 1 belong to the "system."
You are saying that writing to the NIC-wide sysfs path for
defer-hard-irqs wouldn't affect napis 0 and 1 because they don't
belong to the system anymore.
Is that right?
If so... I think that's fairly interesting and maybe it implies a
tighter coupling of apps to queues than is possible with the API
that exists today? For example say an app does a bunch of config to
a few NAPIs and then suddenly crashes. I suppose core would need to
"know" about this so it can "release" those queues ?
If I'm following you, I think overall your idea is sensible but
requires more work to get there.
Do you think that the current model of defer-hard-irqs prevents the
model you describe from becoming a thing in the future? I don't
think it does, personally, but then again ... what do I know?
> The weakness in my argument is that you probably really shouldn't mess
> with system level settings on a live system, according to devops.
> But life happens, and visibility is not perfect so somethings you have
> to poke to test a theory..
Agreed on all points in the above.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-04-30 0:16 ` Joe Damato
@ 2025-05-03 2:10 ` Jakub Kicinski
2025-05-03 3:04 ` Joe Damato
0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2025-05-03 2:10 UTC (permalink / raw)
To: Joe Damato
Cc: Willem de Bruijn, Samiullah Khawaja, David S . Miller,
Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, netdev
On Tue, 29 Apr 2025 17:16:03 -0700 Joe Damato wrote:
> > The way I see it - the traditional permission model doesn't extend
> > in any meaningful way to network settings. All the settings are done
> > by some sort of helper or intermediary which implements its own
> > privilege model.
>
> I agree that is how it is today, but maybe we are misunderstanding
> each other? In my mind, today, the intermediary is something like a
> script that runs a bunch of ethtool commands.
>
> In my mind with the movement of more stuff to core and the existence
> of YNL, it seems like the future is an app uses YNL and is able to
> configure (for example) an RSS context and ntuple rules to steer
> flows to queues it cares about... which in my mind is moving toward
> eliminating the intermediary ?
Yes, AFAIU.
> > My thinking is more that the "global" settings are basically "system"
> > settings. There is a high-perf app running which applied its own
> > settings to its own queues. The remaining queues belong to the "system".
> > Now if you for some reason want to modify the settings of the "system
> > queues" you really don't want to override the app specific settings.
>
> Yea, I see what you are saying, I think.
>
> Can I rephrase it to make sure I'm following?
>
> An app uses YNL to set defer-hard-irqs to 100 for napis 2-4. napis 0
> and 1 belong to the "system."
>
> You are saying that writing to the NIC-wide sysfs path for
> defer-hard-irqs wouldn't affect napis 0 and 1 because they don't
> belong to the system anymore.
>
> Is that right?
Typo - napis 2-4, right? If so - yes, exactly.
> If so... I think that's fairly interesting and maybe it implies a
> tighter coupling of apps to queues than is possible with the API
> that exists today? For example say an app does a bunch of config to
> a few NAPIs and then suddenly crashes. I suppose core would need to
> "know" about this so it can "release" those queues ?
Exactly, Stan also pointed out the config lifetime problem.
My plan was to add the ability to tie the config to a netlink socket.
App dies, socket goes away, clears the config. Same thing as we do for
clean up of DEVMEM bindings. But I don't have full details.
The thing I did add (in the rx-buf-len series) was a hook to the queue
count changing code which wipes the configuration for queues which are
explicitly disabled.
So if you do some random reconfig (eg attach XDP) and driver recreates
all NAPIs - the config should stay around. Same if you do ifdown ifup.
But if you set NAPI count from 8 to 4 - the NAPIs 4..7 should get wiped.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-05-03 2:10 ` Jakub Kicinski
@ 2025-05-03 3:04 ` Joe Damato
2025-05-05 18:56 ` Jakub Kicinski
0 siblings, 1 reply; 24+ messages in thread
From: Joe Damato @ 2025-05-03 3:04 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Willem de Bruijn, Samiullah Khawaja, David S . Miller,
Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, netdev
On Fri, May 02, 2025 at 07:10:11PM -0700, Jakub Kicinski wrote:
> On Tue, 29 Apr 2025 17:16:03 -0700 Joe Damato wrote:
> > > The way I see it - the traditional permission model doesn't extend
> > > in any meaningful way to network settings. All the settings are done
> > > by some sort of helper or intermediary which implements its own
> > > privilege model.
> >
> > I agree that is how it is today, but maybe we are misunderstanding
> > each other? In my mind, today, the intermediary is something like a
> > script that runs a bunch of ethtool commands.
> >
> > In my mind with the movement of more stuff to core and the existence
> > of YNL, it seems like the future is an app uses YNL and is able to
> > configure (for example) an RSS context and ntuple rules to steer
> > flows to queues it cares about... which in my mind is moving toward
> > eliminating the intermediary ?
>
> Yes, AFAIU.
>
> > > My thinking is more that the "global" settings are basically "system"
> > > settings. There is a high-perf app running which applied its own
> > > settings to its own queues. The remaining queues belong to the "system".
> > > Now if you for some reason want to modify the settings of the "system
> > > queues" you really don't want to override the app specific settings.
> >
> > Yea, I see what you are saying, I think.
> >
> > Can I rephrase it to make sure I'm following?
> >
> > An app uses YNL to set defer-hard-irqs to 100 for napis 2-4. napis 0
> > and 1 belong to the "system."
> >
> > You are saying that writing to the NIC-wide sysfs path for
> > defer-hard-irqs wouldn't affect napis 0 and 1 because they don't
> > belong to the system anymore.
> >
> > Is that right?
>
> Typo - napis 2-4, right? If so - yes, exactly.
Yes, a typo -- sorry for the added confusion.
> > If so... I think that's fairly interesting and maybe it implies a
> > tighter coupling of apps to queues than is possible with the API
> > that exists today? For example say an app does a bunch of config to
> > a few NAPIs and then suddenly crashes. I suppose core would need to
> > "know" about this so it can "release" those queues ?
>
> Exactly, Stan also pointed out the config lifetime problem.
> My plan was to add the ability to tie the config to a netlink socket.
> App dies, socket goes away, clears the config. Same thing as we do for
> clean up of DEVMEM bindings. But I don't have full details.
Yea that makes sense. Right now, we are using YNL to configure NAPI
settings for specific queues but clearing them on graceful app
termination. Of course if the app dies unexpectedly, then the NAPIs
are in configured state with no app.
Having some sort of automatic cleanup thing (as you've described)
would be pretty handy.
> The thing I did add (in the rx-buf-len series) was a hook to the queue
> count changing code which wipes the configuration for queues which are
> explicitly disabled.
> So if you do some random reconfig (eg attach XDP) and driver recreates
> all NAPIs - the config should stay around. Same if you do ifdown ifup.
> But if you set NAPI count from 8 to 4 - the NAPIs 4..7 should get wiped.
Yea I see. I will go back and re-read that series because I think
missed that part.
IIRC, if you:
1. set defer-hard-irqs on NAPIs 2 and 3
2. resize down to 2 queues
3. resize then back up to 4, the setting for NAPIs 2 and 3 should
be restored.
I now wonder if I should change that to be more like what you
describe for rx-buf-len so we converge?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH net-next v5] Add support to set napi threaded for individual napi
2025-05-03 3:04 ` Joe Damato
@ 2025-05-05 18:56 ` Jakub Kicinski
0 siblings, 0 replies; 24+ messages in thread
From: Jakub Kicinski @ 2025-05-05 18:56 UTC (permalink / raw)
To: Joe Damato
Cc: Willem de Bruijn, Samiullah Khawaja, David S . Miller,
Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, netdev
On Fri, 2 May 2025 20:04:31 -0700 Joe Damato wrote:
> > The thing I did add (in the rx-buf-len series) was a hook to the queue
> > count changing code which wipes the configuration for queues which are
> > explicitly disabled.
> > So if you do some random reconfig (eg attach XDP) and driver recreates
> > all NAPIs - the config should stay around. Same if you do ifdown ifup.
> > But if you set NAPI count from 8 to 4 - the NAPIs 4..7 should get wiped.
>
> Yea I see. I will go back and re-read that series because I think
> missed that part.
>
> IIRC, if you:
> 1. set defer-hard-irqs on NAPIs 2 and 3
> 2. resize down to 2 queues
> 3. resize then back up to 4, the setting for NAPIs 2 and 3 should
> be restored.
>
> I now wonder if I should change that to be more like what you
> describe for rx-buf-len so we converge?
IMHO yes, but others may disagree.
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2025-05-05 18:56 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-23 20:14 [PATCH net-next v5] Add support to set napi threaded for individual napi Samiullah Khawaja
2025-04-24 23:13 ` Joe Damato
2025-04-25 18:28 ` Samiullah Khawaja
2025-04-25 22:24 ` Joe Damato
2025-04-25 22:52 ` Samiullah Khawaja
2025-04-26 0:37 ` Jakub Kicinski
2025-04-26 2:34 ` Joe Damato
2025-04-26 2:47 ` Jakub Kicinski
2025-04-26 3:12 ` Jakub Kicinski
2025-04-26 3:53 ` Samiullah Khawaja
2025-04-28 18:23 ` Jakub Kicinski
2025-04-28 19:25 ` Samiullah Khawaja
2025-04-25 23:06 ` Samiullah Khawaja
2025-04-26 0:42 ` Jakub Kicinski
2025-04-26 2:31 ` Joe Damato
2025-04-26 14:41 ` Willem de Bruijn
2025-04-28 18:12 ` Joe Damato
2025-04-28 18:38 ` Jakub Kicinski
2025-04-28 21:29 ` Joe Damato
2025-04-28 22:32 ` Jakub Kicinski
2025-04-30 0:16 ` Joe Damato
2025-05-03 2:10 ` Jakub Kicinski
2025-05-03 3:04 ` Joe Damato
2025-05-05 18:56 ` Jakub Kicinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).